# modelbased_offline_metareinforcement_learning_with_regularization__309e1a1a.pdf

Published as a conference paper at ICLR 2022

MODEL-BASED OFFLINE META-REINFORCEMENT LEARNING WITH REGULARIZATION

Sen Lin1, Jialin Wan1, Tengyu Xu2, Yingbin Liang2, Junshan Zhang1,3

1School of ECEE, Arizona State University 2Department of ECE, The Ohio State University 3Department of ECE, University of California, Davis {slin70, jwan20}@asu.edu, {xu.3260, liang.889}@osu.edu, jazh@ucdavis.edu

Existing ofﬂine reinforcement learning (RL) methods face a few major challenges, particularly the distributional shift between the learned policy and the behavior policy. Ofﬂine Meta-RL is emerging as a promising approach to address these challenges, aiming to learn an informative meta-policy from a collection of tasks. Nevertheless, as shown in our empirical studies, ofﬂine Meta-RL could be outperformed by ofﬂine single-task RL methods on tasks with good quality of datasets, indicating that a right balance has to be delicately calibrated between exploring the out-of-distribution state-actions by following the meta-policy and exploiting the ofﬂine dataset by staying close to the behavior policy. Motivated by such empirical analysis, we propose model-based ofﬂine Meta-RL with regularized Policy Optimization (Mer PO), which learns a meta-model for efﬁcient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. In particular, we devise a new meta-Regularized model-based Actor Critic (RAC) method for within-task policy optimization, as a key building block of Mer PO, using both conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt policy offers guaranteed improvement over both the behavior policy and the meta-policy, thus ensuring the performance improvement on new tasks via ofﬂine Meta-RL. Experiments corroborate the superior performance of Mer PO over existing ofﬂine Meta-RL methods.

1 INTRODUCTION

Ofﬂine reinforcement learning (a.k.a., batch RL) has recently attracted extensive attention by learning from ofﬂine datasets previously collected via some behavior policy (Kumar et al., 2020). However, the performance of existing ofﬂine RL methods could degrade signiﬁcantly due to the following issues: 1) the possibly poor quality of ofﬂine datasets (Levine et al., 2020) and 2) the inability to generalize to different environments (Li et al., 2020b). To tackle these challenges, ofﬂine Meta-RL (Li et al., 2020a; Dorfman & Tamar, 2020; Mitchell et al., 2020; Li et al., 2020b) has emerged very recently by leveraging the knowledge of similar ofﬂine RL tasks (Yu et al., 2021a). The main aim of these studies is to enable quick policy adaptation for new ofﬂine tasks, by learning a meta-policy with robust task structure inference that captures the structural properties across training tasks.

Figure 1: FOCAL vs. COMBO.

Because tasks are trained on ofﬂine datasets, value overestimation (Fujimoto et al., 2019) inevitably occurs in dynamic programming based ofﬂine Meta-RL, resulted from the distribution shift between the behavior policy and the learnt task-speciﬁc policy. To guarantee the learning performance on new ofﬂine tasks, a right balance has to be carefully calibrated between exploring the out-of-distribution state-actions by following the meta-policy, and exploiting the ofﬂine dataset by staying close to the behavior policy. However, such a unique exploration-exploitation tradeoff has not been considered in existing ofﬂine Meta-RL approaches, which would likely limit their ability to handle diverse ofﬂine datasets

Published as a conference paper at ICLR 2022

particularly towards those with good behavior policies. To illustrate this issue more concretely, we compare the performance between a state-of-the-art ofﬂine Meta-RL algorithm FOCAL (Li et al., 2020b) and an ofﬂine single-task RL method COMBO (Yu et al., 2021b) in two new ofﬂine tasks. As illustrated in Figure 1, while FOCAL performs better than COMBO on the task with a bad-quality dataset (left plot in Figure 1), it is outperformed by COMBO on the task with a good-quality dataset (right plot in Figure 1). Clearly, existing ofﬂine Meta-RL fails in several standard environments (see Figure 1 and Figure 11) to generalize universally well over datasets with varied quality. In order to ﬁll such a substantial gap, we seek to answer the following key question in ofﬂine Meta-RL:

How to design an efﬁcient ofﬂine Meta-RL algorithm to strike the right balance between exploring with the meta-policy and exploiting the ofﬂine dataset?

To this end, we propose Mer PO, a model-based ofﬂine Meta-RL approach with regularized Policy Optimization, which learns a meta-model for efﬁcient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. Compared to existing approaches, Mer PO achieves: (1) safe policy improvement: performance improvement can be guaranteed for ofﬂine tasks regardless of the quality of the dataset, by strike the right balance between exploring with the meta-policy and exploiting the ofﬂine dataset; and (2) better generalization capability: through a conservative utilization of the learnt model to generate synthetic data, Mer PO aligns well with a recently emerging trend in supervised meta-learning to improve the generalization ability by augmenting the tasks with more data (Rajendran et al., 2020; Yao et al., 2021). Our main contributions can be summarized as follows:

(1) Learnt dynamics models not only serve as a natural remedy for task structure inference in ofﬂine Meta-RL, but also facilitate better exploration of out-of-distribution state-actions by generating synthetic rollouts. With this insight, we develop a model-based approach, where an ofﬂine metamodel is learnt to enable efﬁcient task model learning for each ofﬂine task. More importantly, we propose a meta-regularized model-based actor-critic method (RAC) for within-task policy optimization, where a novel regularized policy improvement module is devised to calibrate the unique exploration-exploitation tradeoff by using an interpolation between two regularizers, one based on the behavior policy and the other on the meta-policy. Intuitively, RAC generalizes COMBO to the multi-task setting, with introduction of a novel regularized policy improvement module to strike a right balance between the impacts of the meta-policy and the behavior policy.

(2) We theoretically show that under mild conditions, the learnt task-speciﬁc policy based on Mer PO offers safe performance improvement over both the behavior policy and the meta-policy with high probability. Our results also provide a guidance for the algorithm design in terms of how to appropriately select the weights in the interpolation, such that the performance improvement can be guaranteed for new ofﬂine RL tasks.

(3) We conduct extensive experiments to evaluate the performance of Mer PO. More speciﬁcally, the experiments clearly show the safe policy improvement offered in Mer PO, corroborating our theoretical results. Further, the superior performance of Mer PO over existing ofﬂine Meta-RL methods suggests that model-based approaches can be more beneﬁcial in ofﬂine Meta-RL.

2 RELATED WORK

Ofﬂine single-task RL. Many existing model-free ofﬂine RL methods regularize the learnt policy to be close to the behavior policy by, e.g., distributional matching (Fujimoto et al., 2019), support matching (Kumar et al., 2019), importance sampling (Nachum et al., 2019; Liu et al., 2020), learning lower bounds of true Q-values (Kumar et al., 2020). Along a different avenue, model-based algorithms learn policies by leveraging a dynamics model obtained with the ofﬂine dataset. (Matsushima et al., 2020) directly constrains the learnt policy to the behavior policy as in model-free algorithms. To penalize the policy for visiting states where the learnt model is likely to be incorrect, MOPO (Yu et al., 2020) and Mo REL (Kidambi et al., 2020) modify the learnt dynamics such that the value estimates are conservative when the model uncertainty is above a threshold. To remove the need of uncertainty quantiﬁcation, COMBO (Yu et al., 2021b) is proposed by combining model-based policy optimization (Janner et al., 2019) and conservative policy evaluation (Kumar et al., 2020).

Ofﬂine Meta-RL. A few very recent studies have explored the ofﬂine Meta-RL. Particularly, (Li et al., 2020a) considers a special scenario where the task identity is spuriously inferred due to biased

Published as a conference paper at ICLR 2022

datasets, and applies the triplet loss to robustify the task inference with reward relabelling. (Dorfman & Tamar, 2020) extends an online Meta-RL method Vari BAD (Zintgraf et al., 2019) to the ofﬂine setup, and assumes known reward functions and shared dynamics across tasks. Based on MAML (Finn et al., 2017), (Mitchell et al., 2020) proposes an ofﬂine Meta-RL algorithm with advantage weighting loss, and learns initializations for both the value function and the policy, where they consider the ofﬂine dataset in the format of full trajectories in order to evaluate the advantage. Based on the off-policy Meta-RL method PEARL (Rakelly et al., 2019), (Li et al., 2020b) combines the idea of deterministic context encoder and behavior regularization, under the assumption of deterministic MDP. Different from the above works, we study a more general ofﬂine Meta-RL problem. More importantly, Mer PO strikes a right balance between exploring with the meta-policy and exploiting the ofﬂine dataset, which guarantees safe performance improvement for new ofﬂine tasks.

3 PRELIMINARIES

Consider a Markov decision process (MDP) M = (S, A, T, r, µ0, γ) with state space S, action space A, the environment dynamics T(s |s, a), reward function r(s, a), initial state distribution µ0, and γ (0, 1) is the discount factor. Without loss of generality, we assume that |r(s, a)| Rmax. Given a policy π, let dπ M(s) := (1 γ) P t=0 γt PM(st = s|π) denote the discounted marginal state distribution, where PM(st = s|π) denotes the probability of being in state s at time t by rolling out π in M. Accordingly, let dπ M(s, a) := dπ M(s)π(a|s) denote the discounted marginal state-action distribution, and J(M, π) := 1 1 γ E(s,a) dπ M(s,a)[r(s, a)] denote the expected discounted return. The goal of RL is to ﬁnd the optimal policy that maximizes J(M, π). In ofﬂine RL, no interactions with the environment are allowed, and we only have access to a ﬁxed dataset D = {(s, a, r, s )} collected by some unknown behavior policy πβ. Let dπβ M(s) be the discounted marginal state distribution of πβ. The dataset D is indeed sampled from dπβ M(s, a) = dπβ M(s)πβ(a|s). Denote M as the empirical MDP induced by D and d(s, a) as a sample-based version of dπβ M(s, a).

In ofﬂine Meta-RL, consider a distribution of RL tasks p(M) as in standard Meta-RL (Finn et al., 2017; Rakelly et al., 2019), where each task Mn is an MDP, i.e., Mn = (S, A, Tn, rn, µ0,n, γ), with task-shared state and action spaces, and unknown task-speciﬁc dynamics and reward function.For each task Mn, no interactions with the environment are allowed and we only have access to an ofﬂine dataset Dn, collected by some unknown behavior policy πβ,n. The main objective is to learn a meta-policy based on a set of ofﬂine training tasks {Mn}N n=1.

Conservative Ofﬂine Model-Based Policy Optimization (COMBO). Recent model-based ofﬂine RL algorithms, e.g., COMBO (Yu et al., 2021b), have demonstrated promising performance on a single ofﬂine RL task by combining model-based policy optimization (Janner et al., 2019) and conservative policy evaluation (CQL (Kumar et al., 2020)). Simply put, COMBO ﬁrst trains a dynamics model b Tθ(s |s, a) parameterized by θ, via supervised learning on the ofﬂine dataset D. The learnt MDP is constructed as c M := (S, A, b T, r, µ0, γ). Then, the policy is learnt using D and model-generated rollouts. Speciﬁcally, deﬁne the action-value function (Q-function) as Qπ(s, a) := E [P t=0 γtr(st, at)|s0 = s, a0 = a], and the empirical Bellman operator as: b BπQ(s, a) = r(s, a)+ γE(s,a,s ) D[Q(s , a )], for a π( |s ). To penalize the Q functions in out-of-distribution stateaction tuples, COMBO employs conservative policy evaluation based on CQL: b Qk+1 arg min Q(s,a) β(Es,a ρ[Q(s, a)] Es,a D[Q(s, a)]) + 1

2Es,a,s df [(Q(s, a) b Bπ b Qk(s, a))2] (1)

where ρ(s, a) := dπ c M(s)π(a|s) is the discounted marginal distribution when rolling out π in c M,

and df(s, a) = fdπβ M(s, a) + (1 f)ρ(s, a) for f [0, 1]. The Bellman backup b Bπ over df can be interpreted as an f-interpolation of the backup operators under the empirical MDP (denoted by Bπ

M) and the learnt MDP (denoted by Bπ c M). Given the Q-estimation b Qπ, the policy can be learnt by:

π arg max π Es ρ(s),a π( |s)[ b Qπ(s, a)]. (2)

4 MERPO: MODEL-BASED OFFLINE META-RL WITH REGULARIZED POLICY OPTIMIZATION

Learnt dynamics models not only serves as a natural remedy for task structure inference in ofﬂine Meta-RL, but also facilitates better exploration of out-of-distribution state-actions by generating

Published as a conference paper at ICLR 2022

Figure 2: Model-based ofﬂine Meta-RL with learning of ofﬂine meta-model and ofﬂine meta-policy.

synthetic rollouts (Yu et al., 2021b). Thus motivated, we propose a general framework of modelbased ofﬂine Meta-RL, as depicted in Figure 2. More speciﬁcally, the ofﬂine meta-model is ﬁrst learnt by using supervised meta-learning, based on which the task-speciﬁc model can be quickly adapted. Then, the main attention of this study is devoted to the learning of an informative metapolicy via bi-level optimization, where 1) a model-based policy optimization approach is leveraged in the inner loop for each task to learn a task-speciﬁc policy; and 2) the meta-policy is then updated in the outer loop based on the learnt task-speciﬁc policies.

4.1 OFFLINE META-MODEL LEARNING

Learning a meta-model based on the set of ofﬂine dataset {Dn}N n=1 can be carried out via supervised meta-learning. Many gradient-based meta-learning techniques can be applied here, e.g., MAML (Finn et al., 2017) and Reptile (Nichol et al., 2018). In what follows, we outline the basic idea to leverage the higher-order information of the meta-objective function. Speciﬁcally, we consider a proximal meta-learning approach, following the same line as in (Zhou et al., 2019):

min φm Lmodel(φm) = EMn

h E(s,a,s ) Dn[log b Tθn(s |s, a)] + η θn φm 2 2 i (3)

where the learnt dynamics for each task Mn is parameterized by θn and the meta-model is parameterized by φm. Solving Eq. (3) leads to an ofﬂine meta-model.

Given the learnt meta-model Tφ m, the dynamics model for an individual ofﬂine task j can be found by solving the following problem via gradient descent with initialization Tφ m using Dj, i.e., min θj E(s,a,s ) Dj[log b Tθj(s |s, a)] + η θj φ m 2 2. (4)

Compared to learning the dynamics model from scratch, adapting from Tφ m can quickly generate a dynamics model for task identity inference by leveraging the knowledge from similar tasks, and hence improve the sample efﬁciency (Finn et al., 2017; Zhou et al., 2019).

4.2 OFFLINE META-POLICY LEARNING

In this section, we turn attention to tackle one main challenge in this study: How to learn an informative ofﬂine meta-policy in order to achieve the optimal tradeoff between exploring the out-ofdistribution state-actions by following the meta-policy and exploiting the ofﬂine dataset by staying close to the behavior policy? Clearly, it is highly desirable for the meta-policy to safely explore out-of-distribution state-action pairs, and for each task to utilize the meta-policy to mitigate the issue of value overestimation.

4.2.1 HOW DO EXISTING PROXIMAL META-RL APPROACHES PERFORM?

Proximal Meta-RL approaches have demonstrated remarkable performance in the online setting (e.g., (Wang et al., 2020)), by explicitly regularizing the task-speciﬁc policy close to the meta-policy. We ﬁrst consider the approach that applies the online Proximal Meta-RL method directly to devise ofﬂine Meta-RL, which would lead to:

E s ρn, a πn( |s)

h ˆQπ n(s, a) i λD(πn, πc) (5)

where πc is the ofﬂine meta-policy, πn is the task-speciﬁc policy, ρn is the state marginal of ρn(s, a) for task n and D( , ) is some distance measure between two probability distributions. To alleviate value overestimation, conservative policy evaluation can be applied to learn ˆQπ n by using Eq. (1).

Published as a conference paper at ICLR 2022

Intuitively, Eq. (5) corresponds to generalizing COMBO to the multi-task setting, where a meta policy πc is learned to regularize the within-task policy optimization.

Figure 3: Performance of proximal Meta-RL Eq. (5).

To get a sense of how the meta-policy learnt using Eq. (5) performs, we evaluate its performance in an ofﬂine variant of standard Meta-RL benchmark Walker-2D-Params with good-quality datasets, and evaluate the testing performance of the task-speciﬁc policy after ﬁne-tuning based on the learnt meta-policy, with respect to the meta-training steps. As can be seen in Figure 3, the proximal Meta-RL algorithm Eq. (5) performs surprisingly poorly and fails to learn an informative meta-policy, despite conservative policy evaluation being applied in within-task policy optimization to deal with the value overestimation. In particular, the testing performance degrades along with the metatraining process, implying that the quality of the learnt meta-policy is in fact decreasing.

Why does the proximal Meta-RL method in Eq. (5) perform poorly in ofﬂine Meta-RL, even with conservative policy evaluation? To answer this, it is worth to take a closer look at the within-task policy optimization in Eq. (5), which is given as follows: πn arg max πn Es ρn,a πn( |s)[ b Qπ n(s, a)] λD(πn, πc). (6)

Clearly, the performance of Eq. (6) depends heavily on the quality of the meta-policy πc. A poor meta-policy may have negative impact on the performance and result in a task-speciﬁc policy πn that is even outperformed by the behaviour policy πβ,n. Without online exploration, the quality of πn could not be improved, which in turn leads to a worse meta-policy πc through Eq. (5). The iterative meta-training process would eventually result in the performance degradation in Figure 3.

In a nutshell, simply following the meta-policy may lead to worse performance of ofﬂine tasks when πβ is a better policy than πc. Since it is infeasible to guarantee the superiority of the meta-policy a priori, it is necessary to balance the tradeoff between exploring with the meta-policy and exploiting the ofﬂine dataset, in order to guarantee the performance improvement of new ofﬂine tasks.

4.2.2 SAFE POLICY IMPROVEMENT WITH META-REGULARIZATION

To tackle the above challenge, we next devise a novel regularized policy improvement for withintask policy optimization of task n, through a weighted interpolation of two different regularizers based on the behavior policy πβ,n and the meta-policy πc, given as follows: πn arg max πn Es ρn,a πn( |s)[ b Qπ n(s, a)] λαD(πn, πβ,n) λ(1 α)D(πn, πc), (7)

for some α [0, 1]. Here, α controls the trade-off between staying close to the behavior policy and following the meta-policy to explore out-of-distribution state-actions. Intuitively, as α is closer to 0, the policy improvement is less conservative and tends to improve the task-speciﬁc policy πn towards the actions in πc that have highest estimated Q-values. Compared to Eq. (6), the exploration penalty induced by D(πn, πβ,n) serves as a safeguard and stops πn following πc over-optimistically.

Algorithm 1 RAC

1: Train dynamics model b Tθn using Dn; 2: for k = 1, 2, ... do 3: Perform model rollouts starting from states in Dn and add into Dmodel,n; 4: Policy evaluation by recursively solving Eq. (1) using Dn Dmodel,n; 5: Improve policy by solving Eq. (7); 6: end for

Safe Policy Improvement Guarantee. Based on conservative policy evaluation Eq. (1) and regularized policy improvement Eq. (7), we have the meta-regularized model-based actor-critic method (RAC), as outlined in Algorithm 1. Note that different distribution distance measures can be used in Eq. (7). In this work, we theoretically show that the policy πn(a|s) learnt by RAC is a safe improvement over both the behavior policy πβ,n and the meta-policy πc on the underlying MDP Mn, when using the maximum total-variation distance for D(π1, π2), i.e., D(π1, π2) := maxs DT V (π1||π2).

For convenience, deﬁne νn(ρ, f) = Eρ [(ρ(s, a) dn(s, a))/df,n(s, a)], and let δ (0, 1/2). We have the following important result on the safe policy improvement achieved by πn(a|s).

Theorem 1. (a) Let ϵ = β[νn(ρπn,f) νn(ρπβ,n,f)]

2λ(1 γ)D(πn,πβ,n) . If νn(ρπn, f) νn(ρπβ,n, f) > 0 and α max{ 1

2 , then J(Mn, πn) max{J(Mn, πc) + ξ1, J(Mn, πβ,n) + ξ2} holds with probability at least 1 2δ, where both ξ1 and ξ2 are positive for large enough β and λ;

Published as a conference paper at ICLR 2022

(b) More generally, we have that J(Mn, πn) max{J(Mn, πc) + ξ1, J(Mn, πβ,n) + ξ2} holds with probability at least 1 2δ, when α (0, 1/2), where ξ1 is positive for large enough λ.

Remark 1. The expressions of ξ1 and ξ2 are involved and can be found in Eq. (14) and Eq. (15) in the appendix. In part (a) of Theorem 1, both ξ1 and ξ2 are positive for large enough β and λ, pointing to guaranteed improvements over πc and πβ,n. Due to the fact that the dynamics T d Mn learnt via supervised learning is close to the true dynamics TMn on the states visited by the behavior policy πβ,n, dπβ,n d Mn (s, a) is close to dπβ,n

Mn (s, a) and ρπβ,n is close to dn(s, a), indicating that the condition νn(ρπn, f) νn(ρπβ,n, f) > 0 is expected to hold in practical scenarios (Yu et al., 2021b). For more general cases, a slightly weaker result can be obtained in part (b) of Theorem 1, where ξ1 is positive for large enough λ and ξ2 can be negative.

Remark 2. Intuitively, the selection of α balances the impact of πβ,n and πc, while delicately leaning toward the meta-policy πc because πβ,n has played an important role in policy evaluation to ﬁnd a lower bound of Q-value. As a result, Eq. (7) maximizes the true Q-value while implicitly regularized by a weighted combination, instead of α-interpolation, between D(πn, πβ,n) and D(πn, πc), where the weights are carefully balanced through α. In particular, in the tabular setting, the conservative policy evaluation in Eq. (1) corresponds to penalizing the Q estimation (Yu et al., 2021b): b Qk+1 n (s, a) = b Bπ b Qk n(s, a) β[ρ(s, a) dn(s, a)]

df,n(s, a) . (8)

Clearly, ϵ increases with the value of the penalty term in Eq. (8). As a result, when the policy evaluation Eq. (1) is overly conservative, the lower bound of α will be close to 0, and hence the regularizer based on the meta-policy πc can play a bigger role so as to encourage the exploration of out-of-distribution state-actions following the guidance of πc. On the other hand, when the policy evaluation Eq. (1) is less conservative, the lower bound of α will be close to 1

2, and the regularizer based on πβ,n will have more impact, leaning towards exploiting the ofﬂine dataset. In fact, the introduction of 1) behavior policy-based regularizer and 2) the interpolation for modeling the interaction between the behavior policy and the meta-policy, is the key to prove Theorem 1.

Practical Implementation. In practice, we can use the KL divergence to replace the total variation distance between policies, based on Pinsker s Inequality: π1 π2 p

2DKL(π1||π2). Moreover, since the behavior policy πβ,n is typically unknown, we can use the reverse KL-divergence between πn and πβ,n to circumvent the estimation of πβ,n, following the same line as in (Fakoor et al., 2021): DKL(πβ,n||πn) = Ea πβ,n[log πβ,n(a|s)] Ea πβ,n[log πn(a|s)]

Ea πβ,n[log πn(a|s)] E(s,a) Dn[log πn(a|s)].

Then, the task-speciﬁc policy can be learnt by solving the following problem:

max πn Es ρn,a πn( |s) h b Qπ n(s, a) i + λαE(s,a) Dn[log πn(a|s)] λ(1 α)DKL(πn||πc). (9)

4.2.3 OFFLINE META-POLICY UPDATE

Built on RAC, the ofﬂine meta-policy πc is updated by taking the following two steps, in an iterative manner: 1) (inner loop) given the meta-policy πc, RAC is run for each training task to obtain the task-speciﬁc policy πn; 2) (outer loop) based on {πn}n, πc is updated by solving:

E s ρn, a πn( |s)

h ˆQπ n(s, a) i + λαE(s,a) Dn[log πn(a|s)] λ(1 α)DKL(πn||πc) (10)

where both ρn and ˆQπ n are from the last iteration of the inner loop for each training task. By using RAC in the inner loop for within-task policy optimization, the learnt task-speciﬁc policy πn and the meta-policy πc work in concert to regularize the policy search for each other, and improve akin to positive feedback . Here the regularizer based on the behavior policy serves an important initial force to boost the policy optimization against the ground: RAC in the inner loop aims to improve the task-speciﬁc policy over the behavior policy at the outset and the improved task-speciﬁc policy consequently regularizes the meta-policy search as in Eq. (10), leading to a better meta-policy eventually. Noted that a meta-Q network is learnt using ﬁrst-order meta-learning to initialize taskspeciﬁc Q networks. It is worth noting that different tasks can have different values of α to capture the heterogeneity of dataset qualities across tasks.

Published as a conference paper at ICLR 2022

Figure 4: Performance comparison among COMBO, COMBO-3 and RAC, with good-quality metapolicy (two ﬁgures on the left) and poor-quality meta-policy (two ﬁgures on the right).

In a nutshell, the proposed model-based ofﬂine Meta-RL with regularized Policy Optimization (Mer PO) is built on two key steps: 1) learning the ofﬂine meta-model via Eq. (3) and 2) learning the ofﬂine meta-policy via Eq. (10). The details are presented in Algorithm 2 in the appendix.

4.3 MERPO-BASED POLICY OPTIMIZATION FOR NEW OFFLINE RL TASK

Let Tφ m and π c be the ofﬂine meta-model and the ofﬂine meta-policy learnt by Mer PO. For a new ofﬂine RL task, the task model can be quickly adapted based on Eq. (4), and the task-speciﬁc policy can be obtained based on π c using the within-task policy optimization module RAC. Appealing to Theorem 1, we have the following result on Mer PO-based policy learning on a new task.

Proposition 1. Consider a new ofﬂine RL task with the true MDP M. Suppose πo is the Mer PO-based task-speciﬁc policy, learnt by running RAC over the meta-policy π c. If ϵ = β[ν(ρπo,f) ν(ρπβ ,f)]

2λ(1 γ)D(πo,πβ) 0 and α max{ 1

2 , then πo achieves the safe performance improvement over both π c and πβ, i.e., J(M, πo) > max{J(M, π c), J(M, πβ)} holds with probability at least 1 2δ, for large enough β and λ.

Proposition 1 indicates that Mer PO-based policy optimization for learning task-speciﬁc policy guarantees a policy with higher rewards than both the behavior policy and the meta-policy. This is particularly useful in the following two scenarios: 1) the ofﬂine dataset is collected by some poor behavior policy, but the meta-policy is a good policy; and 2) the meta-policy is inferior to a good behavior policy.

5 EXPERIMENTS

In what follows, we ﬁrst evaluate the performance of RAC for within-task policy optimization on an ofﬂine RL task to validate the safe policy improvement, and then examine how Mer PO performs when compared to state-of-art ofﬂine Meta-RL algorithms. Due to the space limit, we relegate additional experiments to the appendix.

5.1 PERFORMANCE EVALUATION OF RAC

Setup. We evaluate RAC on several continuous control tasks in the D4RL benchmark (Fu et al., 2020) from the Open AI Gym (Brockman et al., 2016), and compare its performance to 1) COMBO (where no meta-policy is leveraged) and 2) COMBO with policy improvement Eq. (6) (namely, COMBO-3), under different qualities of ofﬂine datasets and different qualities of meta-policy (good and poor). For illustrative purpose, we use a random policy as a poor-quality meta-policy, and choose the learnt policy after 200 episodes as a better-quality meta-policy. We evaluate the average return over 4 random seeds after each episode with 1000 gradient steps.

Results. As shown in Figure 4, RAC can achieve comparable performance with COMBO-3 given a good-quality meta-policy, and both clearly outperform COMBO. Besides, the training procedure is also more stable and converges more quickly as expected when regularized with the meta-policy. When regularized by a poor-quality meta-policy, that is signiﬁcantly worse than the behavior policy in all environments, the performance of COMBO-3 degrades dramatically. However, RAC outperforms COMBO even when the meta-policy is a random policy. In a nutshell, RAC consistently

Published as a conference paper at ICLR 2022

achieves the best performance in various setups and demonstrates compelling robustness against the quality of the meta-policy, for suitable parameter selections (α = 0.4 in Figure 4).

Figure 5: Impact of α on the performance of RAC under different qualities of ofﬂine datasets.

Impact of α. As shown in Theorem 1, the selection of α is important to guarantee the safe policy improvement property of RAC. Therefore, we next examine the impact of α on the performance of RAC under different qualities of datasets and meta-policy. More speciﬁcally, we consider four choices of α: α = 0, 0.4, 0.7, 1. Here, α = 0 corresponds to COMBO-3, i.e., regularized by the meta-policy only, and the policy improvement step is regularized by the behavior policy only when α = 1. Figure 5 shows the average return of RAC over different qualities of meta-policies under different qualities of the ofﬂine datasets. It is clear that RAC achieves the best performance when α = 0.4 among the four selections of α, corroborating the result in Theorem 1. In general, the performance of RAC is stable for α [0.3, 0.5] in our experiments.

5.2 PERFORMANCE EVALUATION OF MERPO

Setup. To evaluate the performance of Mer PO, we follow the setups in the literature (Rakelly et al., 2019; Li et al., 2020b) and consider continuous control meta-environments of robotic locomotion. More speciﬁcally, tasks has different transition dynamics in Walker-2D-Params and Point-Robot Wind, and different reward functions in Half-Cheetah-Fwd-Back and Ant-Fwd-Back. We collect the ofﬂine dataset for each task by following the same line as in (Li et al., 2020b). We consider the following baselines: (1) FOCAL (Li et al., 2020b), a model-free ofﬂine Meta-RL approach based on a deterministic context encoder that achieves the state-of-the-art performance; (2) MBML (Li et al., 2020a), an ofﬂine multi-task RL approach with metric learning; (3) Batch PEARL, which modiﬁes PEARL (Rakelly et al., 2019) to train and test from ofﬂine datasets without exploration; (4) Contextual BCQ (CBCQ), which is a task-augmented variant of the ofﬂine RL algorithm BCQ (Fujimoto et al., 2019) by integrating a task latent variable into the state information. We train on a set of ofﬂine RL tasks, and evaluate the performance of the learnt meta-policy during the training process on a set of unseen testing ofﬂine RL tasks.

Fixed α vs Adaptive α. We consider two implementations of Mer PO based on the selection of α. 1) Mer PO: α is ﬁxed as 0.4 for all tasks; 2) Mer PO-Adp: at each iteration k, given the task-policy πk n for task n and the meta-policy πk c at iteration k, we update αk n using one-step gradient descent to minimize the following problem. min αkn (1 αk n)(D(πk n, πβ,n) D(πk n, πk c )), s.t. αk n [0.1, 0.5]. (11)

The idea is to adapt αk n in order to balance between D(πk n, πβ,n) and D(πk n, πk c ), because Theorem 1 implies that the safe policy improvement can be achieved when the impacts of the meta-policy and the behavior policy are well balanced. Speciﬁcally, at iteration k for each task n, αk n is increased when the task-policy πk n is closer to the meta-policy πk c , and is decreased when πk n is closer to the behavior policy. Note that αk n is constrained in the range [0.1, 0.5] as suggested by Theorem 1.

Results. As illustrated in Figure 6, Mer PO-Adp yields the best performance, and both Mer POAdp and Mer PO achieve better or comparable performance in contrast to existing ofﬂine Meta-RL approaches. Since the meta-policy changes during the learning process and the qualities of the behavior policies vary across different tasks, Mer PO-Adp adapts α across different iterations and tasks so as to achieve a local balance between the impacts of the meta-policy and the behavior policy. As expected, Mer PO-Adp can perform better than Mer PO with a ﬁxed α. Here the best testing performance for the baseline algorithms is selected over different qualities of ofﬂine datasets.

Ablation Study. We next provide ablation studies by answering the following questions.

(1) Is RAC important for within-task policy optimization? To answer this question, we compare Mer PO with the approach Eq. (5) where the within-task policy optimization is only regularized by

Published as a conference paper at ICLR 2022

Figure 6: Performance comparison in terms of the average return in different environments. Clearly, Mer PO Adp and Mer PO achieve better or comparable performance than the baselines.

(a) Impact of RAC module.

(b) Impact of model utilization.

(c) Performance under different data qualities.

(d) Testing performance for expert dataset.

Figure 7: Ablation study of Mer PO in Walker-2D-Params.

the meta-policy. As shown in Figure 7(a), with the regularization based on the behavior policy in RAC, Mer PO performs signiﬁcantly better than Eq. (5), implying that the safe policy improvement property of RAC enables Mer PO to continuously improve the meta-policy.

(2) Is learning the dynamics model important? Without the utilization of models, the within-task policy optimization degenerates to CQL (Kumar et al., 2020) and the Meta-RL algorithm becomes a model-free approach. Figure 7(b) shows the performance comparison between the cases whether the dynamics model is utilized. It can be seen that the performance without model utilization is much worse than that of Mer PO. This indeed makes sense because the task identity inference (Dorfman & Tamar, 2020; Li et al., 2020a;b) is a critical problem in Meta-RL. Such a result also aligns well with a recently emerging trend in supervised meta-learning to improve the generalization ability by augmenting the tasks with more data (Rajendran et al., 2020; Yao et al., 2021).

(3) How does Mer PO perform in unseen ofﬂine tasks under different data qualities? We evaluate the average return in unseen ofﬂine tasks with different data qualities, and compare the performance between (1) Mer PO with α = 0.4 ( With meta ) and (2) Run a variant of COMBO with behavior-regularized policy improvement, i.e., α = 1 ( With beha only ). For a fair comparison, we initialize the policy network with the meta-policy in both cases. As shown in Figure 7(c), the average performance of With meta over different data qualities is much better than that of With beha only . More importantly, for a new task with expert data, Mer PO ( With meta ) clearly outperforms COMBO as illustrated in Figure 7(d), whereas the performance of FOCAL is worse than COMBO.

6 CONCLUSION

In this work, we study ofﬂine Meta-RL aiming to strike a right balance between exploring the out-of-distribution state-actions by following the meta-policy and exploiting the ofﬂine dataset by staying close to the behavior policy. To this end, we propose a model-based ofﬂine Meta-RL approach, namely Mer PO, which learns a meta-model to enable efﬁcient task model learning and a meta-policy to facilitate safe exploration of out-of-distribution state-actions. Particularly, we devise RAC, a meta-regularized model-based actor-critic method for within-task policy optimization, by using a weighted interpolation between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt task-policy via Mer PO offers safe policy improvement over both the behavior policy and the meta-policy. Compared to existing ofﬂine Meta-RL methods, Mer PO demonstrates superior performance on several benchmarks, which suggests a more prominent role of model-based approaches in ofﬂine Meta-RL.

Published as a conference paper at ICLR 2022

ACKNOWLEDGEMENT

This work is supported in part by NSF Grants CNS-2003081, CNS-2203239, CPS-1739344, and CCSS-2121222.

REPRODUCIBILITY STATEMENT

For the theoretical results presented in the main text, we state the full set of assumptions of all theoretical results in Appendix B, and include the complete proofs of all theoretical results in Appendix C. For the experimental results presented in the main text, we include the code in the supplemental material, and specify all the training details in Appendix A. For the datasets used in the main text, we also give a clear explanation in Appendix A.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Ron Dorfman and Aviv Tamar. Ofﬂine meta reinforcement learning. ar Xiv preprint ar Xiv:2008.02598, 2020.

Rasool Fakoor, Jonas Mueller, Pratik Chaudhari, and Alexander J Smola. Continuous doubly constrained batch reinforcement learning. ar Xiv preprint ar Xiv:2102.09225, 2021.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pp. 1126 1135. JMLR. org, 2017.

Justin Fu, Aviral Kumar, Oﬁr Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In International Conference on Machine Learning, pp. 1587 1596. PMLR, 2018.

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052 2062. PMLR, 2019.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861 1870. PMLR, 2018.

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Modelbased policy optimization. ar Xiv preprint ar Xiv:1906.08253, 2019.

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Modelbased ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2005.05951, 2020.

Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. ar Xiv preprint ar Xiv:1906.00949, 2019.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2006.04779, 2020.

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

Jiachen Li, Quan Vuong, Shuang Liu, Minghua Liu, Kamil Ciosek, Henrik Christensen, and Hao Su. Multi-task batch reinforcement learning with metric learning. Advances in Neural Information Processing Systems, 33, 2020a.

Lanqing Li, Rui Yang, and Dijun Luo. Efﬁcient fully-ofﬂine meta-reinforcement learning via distance metric learning and behavior regularization. ar Xiv preprint ar Xiv:2010.01112, 2020b.

Published as a conference paper at ICLR 2022

Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with stationary distribution correction. In Uncertainty in Artiﬁcial Intelligence, pp. 1180 1190. PMLR, 2020.

Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Oﬁr Nachum, and Shixiang Gu. Deploymentefﬁcient reinforcement learning via model-based ofﬂine optimization. ar Xiv preprint ar Xiv:2006.03647, 2020.

Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, and Chelsea Finn. Ofﬂine metareinforcement learning with advantage weighting. ar Xiv preprint ar Xiv:2008.06043, 2020.

Oﬁr Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019.

Alex Nichol, Joshua Achiam, and John Schulman. On ﬁrst-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018.

Janarthanan Rajendran, Alex Irpan, and Eric Jang. Meta-learning requires meta-augmentation. ar Xiv preprint ar Xiv:2007.05549, 2020.

Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, and Sergey Levine. Efﬁcient off-policy meta-reinforcement learning via probabilistic context variables. ar Xiv preprint ar Xiv:1903.08254, 2019.

Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. On the global optimality of modelagnostic meta-learning. In International Conference on Machine Learning, pp. 9837 9846. PMLR, 2020.

Huaxiu Yao, Long-Kai Huang, Linjun Zhang, Ying Wei, Li Tian, James Zou, Junzhou Huang, et al. Improving generalization in meta-learning via task augmentation. In International Conference on Machine Learning, pp. 11887 11897. PMLR, 2021.

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based ofﬂine policy optimization. ar Xiv preprint ar Xiv:2005.13239, 2020.

Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Sergey Levine, and Chelsea Finn. Conservative data sharing for multi-task ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2109.08128, 2021a.

Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative ofﬂine model-based policy optimization. ar Xiv preprint ar Xiv:2102.08363, 2021b.

Pan Zhou, Xiaotong Yuan, Huan Xu, Shuicheng Yan, and Jiashi Feng. Efﬁcient meta learning via minibatch proximal update. Advances in Neural Information Processing Systems, 32:1534 1544, 2019.

Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: A very good method for bayes-adaptive deep rl via metalearning. ar Xiv preprint ar Xiv:1910.08348, 2019.

Published as a conference paper at ICLR 2022

Table 1: Hyperparameters for RAC.

Hyperparameters Halfcheetah Hopper Walker2d Discount factor 0.99 0.99 0.99 Sample batch size 256 256 256 Real data ratio 0.5 0.5 0.5 Model rollout length 5 5 1 Critic lr 3e-4 3e-4 1e-4 Actor lr 1e-4 1e-4 1e-5 Model lr 1e-3 1e-3 1e-3 Optimizer Adam Adam Adam β 1 1 10 Max entropy True True True λ 1 1 1

A EXPERIMENTAL DETAILS

A.1 META ENVIRONMENT DESCRIPTION

Walker-2D-Params: Train an agent to move forward. Different tasks correspond to different randomized dynamcis parameters.

Half-Cheeta-Fwd-Back: Train a Cheetah robot to move forward or backward, and the reward function depends on the moving direction. All tasks have the same dynamics model but different reward functions.

Ant-Fwd-Back: Train an Ant robot to move forward or backward, and the reward function depends on the moving direction. All tasks have the same dynamics model but different reward functions.

Poing-Robot-Wind: Point-Robot-Wind is a variant of Sparse-Point-Robot (Li et al., 2020b), a 2D navigation problem introduced in (Rakelly et al., 2019), where each task is to guide a point robot to navigate to a speciﬁc goal location on the edge of a semi-circle from the origin. In Point-Robot-Wind, each task is affected by a distinct wind uniformly sampled from [ 0.05, 0.05]2, and hence differs in the transition dynamics.

A.2 IMPLEMENTATION DETAILS AND MORE EXPERIMENTS

A.2.1 EVALUATION OF RAC

Model learning. Following the same line as in (Janner et al., 2019; Yu et al., 2020; 2021b), the dynamics model for each task is represented as a probabilistic neural network that takes the current state-action as input and outputs a Gaussian distribution over the next state and reward: b Tθ(st+1, r|s, a) = N(µθ(st, at), Σθ(st, at)). An ensemble of 7 models is trained independently using maximum likelihood estimation, and the best 5 models are picked based on the validation prediction error using a held-our set of the ofﬂine dataset. Each model is represented by a 4-layer feedforward neural network with 256 hidden units. And one model will be randomly selected from the best 5 models for model rollout.

Policy optimization. We represent both Q-network and policy network as a 4-layer feedforward neural network with 256 hidden units, and use clipped double Q-learning (Fujimoto et al., 2018) for Q backup update. A max entropy term is also included to the value function for computing the target Q value as in SAC (Haarnoja et al., 2018). The hyperparameters used for evaluating the performance of RAC are described in Table 1.

Additional experiments. We also evaluate the performance of RAC in Walker2d under different qualities of the meta-policy. As shown in Figure 8(a) and 8(b), RAC achieves the best performance in both scenarios, compared to COMBO and COMBO-3. Particularly, the performance of COMBO3 in Figure 8(a) degrades in the later stage of training because the meta-policy is not superior over the behavior policy in this case. In stark contrast, the performance of RAC is consistently better, as it provides a safe policy improvement guarantee over both the behavior policy and the meta-policy.

Published as a conference paper at ICLR 2022

(a) Performance in Walker2d with a good meta-policy.

(b) Performance in Walker2d with a random meta-policy.

(c) Performance in Half Cheetah with expert ofﬂine dataset.

(d) Average return over different qualities of meta-policies under expert dataset for different choices of α.

Figure 8: Performance evaluation of RAC.

Beside, we also compare the performance of these three algorithms under an expert behavior policy in Figure 8(c), where a meta-policy usually interferes the policy optimization and drags the learnt policy away from the expert policy. As expected, RAC can still achieve comparable performance with COMBO, as a result of safe policy improvement over the behavior policy for suitable parameter selections.

We examine the impact of α on the performance of RAC under different qualities of the meta-policy for Half Cheetah with expert data. In this case, the meta-policy is a worse policy compared to the behavior policy. As shown in Figure 8(d), the performance α = 0.4 is comparable to the case of α = 1 where the policy improvement step is only regularized based on the behavior policy, and clearly better than the other two cases.

A.2.2 EVALUATION OF MERPO

Data collection. We collect the ofﬂine dataset for each task by training a stochastic policy network using SAC (Haarnoja et al., 2018) for that task and rolling out the policies saved at each checkpoint to collect trajectories. Different checkpoints correspond to different qualities of the ofﬂine datasets. When training with Mer PO, we break the trajectories into independent tuples {si, ai, ri, s i} and store in a replay buffer. Therefore, the ofﬂine dataset for each task does not contain full trajectories over entire episodes, but merely individual transitions collected by some behavior policy.

Setup. For each testing task, we obtain the task-speciﬁc policy through quick adaptation using the within-task policy optimization method RAC, based on its own ofﬂine dataset and the learnt metapolicy, and evaluate the average return of the adapted policy over 4 random seeds. As shown earlier, we take α = 0.4 for all experiments about Mer PO. In Mer PO-Adp, we initialize α with 0.4 and update with a learning rate of 1e 4.

Meta-model learning. Similar as in section A.2.1, for each task we quickly adapt from the metamodel to obtain an ensemble of 7 models and pick the best 5 models based on the validation error. The neural network used for representing the dynamics model is same with that in section A.2.1.

Meta-policy learning. As in RAC, we represent the task Q-network, the task policy network and the meta-policy network as a 5-layer feedforward neural network with 300 hidden units, and use clipped double Q-learning (Fujimoto et al., 2018) for within task Q backup update. For each task, we also use dual gradient descent to automatically tune both the parameter β for conservative policy evaluation and the parameter λ for regularized policy improvement:

Tune β. Before optimizing the Q-network in policy evaluation, we ﬁrst optimize β by solving the following problem:

min Q max β 0 β(Es,a ρ[Q(s, a)] Es,a D[Q(s, a)] τ) + 1

2Es,a,s df [(Q(s, a) b Bπ b Qk(s, a))2].

Intuitively, the value of β will be increased to penalty the Q-values for out-of-distribution state-actions if the difference Es,a ρ[Q(s, a)] Es,a D[Q(s, a)] is larger than some threshold value τ.

Published as a conference paper at ICLR 2022

Table 2: Hyperparameters for Mer PO.

Hyperparameters Walker-2D-Params Half-Cheetah-Fwd-Back Ant-Fwd-Back Point-Robot-Wind Discount factor 0.99 0.99 0.99 0.9 Sample batch size 256 256 256 256 Task batch size 8 2 2 8 Real data ratio 0.5 0.5 0.5 0.5 Model rollout length 1 1 1 1 Inner critic lr 1e-3 1e-3 8e-4 1e-3 Inner actor lr 1e-3 5e-4 5e-4 1e-3 Inner steps 10 10 10 10 Outer critic lr 1e-3 1e-3 1e-3 1e-3 Outer actor lr 1e-3 1e-3 1e-3 1e-3 Meta-q lr 1e-3 1e-3 1e-3 1e-3 Task model lr 1e-4 1e-4 1e-4 1e-4 Meta-model lr 5e-2 5e-2 5e-2 5e-2 Model adaptation steps 25 25 25 25 Optimizer Adam Adam Adam Adam Auto-tune λ True True True True λ lr 1 1 1 1 λ initial 5 100 100 5 Target divergence 0.05 0.05 0.05 0.05 Auto-tune β True True True True log β lr 1e-3 1e-3 1e-3 1e-3 log β initial 0 0 0 0 Q difference threshold 5 10 10 10 Max entropy True True True True α 0.4 0.4 0.4 0.4 Testing adaptation steps 100 100 100 100 # training tasks 20 2 2 40 # testing tasks 5 2 2 10

Tune λ. Similarly, we can optimize λ in policy improvement by solving the following problem: max π min λ 0 Es ρ(s),a π( |s)[ b Qπ(s, a)] λα[D(π, πβ) Dtarget] λ(1 α)[D(π, πc) Dtarget].

Intuitively, the value of λ will be increased so as to have stronger regularizations if the divergence is larger than some threshold Dtarget, and decreased if the divergence is smaller than Dtarget.

Besides, we also build a meta-Q network Qmeta over the training process as an initialization of the task Q networks to facilitate the within task policy optimization. At the k-th meta-iteration for meta-policy update, the meta-Q network is also updated using the average Q-values of current batch B of training tasks with meta-q learning rate ξq, i.e.,

Qk+1 meta = Qk meta ξq[Qk meta 1

Therefore, we initialize the task Q networks and the task policy with the meta-Q network and the meta-policy, respectively, for within task policy optimization during both meta-training and metatesting. The hyperparameters used in evaluation of Mer PO are listed in Table 2.

For evaluating the performance improvement in a single new ofﬂine task, we use a smaller learning rate of 8e 5 for the Q network and the policy network update.

A.2.3 MORE EXPERIMENTS.

We also evaluate the impact of the utilization extent of the learnt model, by comparing the performance of Mer PO under different cases of real data ratio, i.e., the ratio of the data from the ofﬂine dataset in the data batch for training. As shown in Figure 9(a), the performance of Mer PO can be further boosted with a more conservative utilization of the model.

To understand how much beneﬁt Mer PO can bring for policy learning in unseen ofﬂine RL tasks, we compare the performance of the following cases with respect to the gradient steps taken for learning in unseen ofﬂine RL tasks: (1) Initialize the task policy network with the meta-policy and run RAC ( With meta ); (2) Run RAC using the meta-policy without network initialization ( With meta (no init) ); (3) Run RAC with a single regularization based on behavior policy without network initialization, i.e., α = 1 ( With beha only ); (4) Run COMBO ( No regul. ). As shown in Figure 9(b), With meta achieves the best performance and improves signiﬁcantly over No regul. and With beha only , i.e., learning alone without any guidance of meta-policy, which implies that the

Published as a conference paper at ICLR 2022

(a) Impact of real data ratio.

(b) Performance comparison for unseen tasks.

(c) Training sample efﬁciency in Point-Robot-Wind.

Figure 9: Ablation study of Mer PO.

(a) Training efﬁciency.

(b) Testing efﬁciency.

Figure 10: Sample efﬁciency.

learnt meta-policy can efﬁciently guide the exploration of out-of-distribution state-actions. Without network initialization, With meta (no init) and With beha only achieve similar performance because good ofﬂine dataset is considered here. Such a result is also consistent with Figure 8(d).

We evaluate the testing performance of Mer PO, by changing sample size of all tasks. Figure 10(a) shows that the performance of Mer PO is stable even if we decrease the number of trajectories for each task to be around 200. In contrast, the number of trajectories collected in other baselines is of the order 103. Figure 10(b) illustrates the testing sample efﬁciency of Mer PO, by evaluating the performance at new ofﬂine tasks under different sample sizes. Clearly, a good task-speciﬁc policy can be quickly adapted at a new task even with 5 trajectories (1000 samples) of ofﬂine data. We also evaluate the training sample efﬁciency of Mer PO in Point-Robot-Wind. As shown in Figure 9(c) the performance of Mer PO is stable even if we decrease the number of trajectories for each task to be around 200.

A.2.4 MORE COMPARISON BETWEEN FOCAL AND COMBO

Following the setup as in Figure 1, we compare the performance between FOCAL and COMBO in two more environments: Half-Cheetah-Fwd-Back and Ant-Fwd-Back. As shown in Figure 11, although FOCAL performs better than COMBO on the task with a bad-quality dataset, it is outperformed by COMBO on the task with a good-quality dataset. This further conﬁrms the observation made in Figure 1.

A.3 ALGORITHMS

We include the details of Mer PO in Algorithm 2.

Published as a conference paper at ICLR 2022

(a) Performance comparison in Half-Cheetah-Fwd Back.

(b) Performance comparison in Ant-Fwd-Back.

Figure 11: FOCAL vs. COMBO.

Algorithm 2 Regularized policy optimization for model-based ofﬂine Meta-RL (Mer PO)

1: Initialize the dynamics, actor and critic for each task, and initialize the meta-model and the meta-policy; 2: for k = 1, 2, ... do 3: for each training task Mn do 4: Solve the following problem with gradient descent for h steps to compute the dynamics model b Tθk n based on the ofﬂine dataset Di: min θn E(s,a,s ) Dn[log b Tθn(s |s, a)] + η θn φm(k) 2 2;

5: end for 6: Update φm(k + 1) = φm(k) ξ1[φm(k) 1

N PN n=1 θk n]; 7: end for 8: Quickly obtain the estimated dynamics model b Tn for each training task by solving Eq. (4) with t steps gradient descent; 9: for k = 1, 2, ... do 10: for each training task Mn do 11: for j = 1, ..., J do 12: Perform model rollouts with b Tn starting from states in Dn and add model rollouts to Dn model; 13: Policy evaluation by recursively solving Eq. (1) using data from Dn Dn model; 14: Given the meta-policy πk c , improve policy πk n by solving Eq. (7); 15: end for 16: end for 17: Given the learnt policy πk n for each task, update the meta-policy πk+1 c by solving Eq. (10) with one step gradient descent; 18: end for

B PRELIMINARIES

For ease of exposition, let TM and r M denote the dynamics and reward function of the underlying MDP M, TM and r M denote the dynamics and reward function of the empirical MDP M induced by the dataset D, and T c M and r c M denote the dynamics and reward function of the learnt MDP c M. To prevent any trivial bound with values, we assume that the cardinality of a state-action pair in the dataset D, i.e., |D(s, a)|, in the denominator, is non-zero, by setting |D(s, a)| to be a small value less than 1 when (s, a) / D.

Following the same line as in (Kumar et al., 2020; Yu et al., 2021b), we make the following standard assumption on the concentration properties of the reward and dynamics for the empirical MDP M to characterize the sampling error.

Assumption 1. For any (s, a) M, the following inequalities hold with probability 1 δ:

TM(s |s, a) TM(s |s, a) 1 CT,δ p

|D(s, a)| ; |r M(s, a) r M| Cr,δ p

|D(s, a)| where CT,δ and Cr,δ are some constants depending on δ via a p

log(1/δ) dependency.

Published as a conference paper at ICLR 2022

Based on Assumption 1, we can bound the estimation error induced by the empirical Bellman backup operator for any (s, a) M: Bπ

M b Qk(s, a) Bπ M b Qk(s, a)

r M(s, a) r M(s, a) + γ X

s (TM(s |s, a) TM(s |s, a))Eπ(a |s )[ b Qk(s , a )]

|r M(s, a) r M(s, a)| + γ

s (TM(s |s, a) TM(s |s, a))Eπ(a |s )[ b Qk(s , a )]

|D(s, a)| + γ TM(s |s, a) TM(s |s, a) 1 Eπ(a |s )[ b Qk(s , a )]

Cr,δ + γCT,δRmax/(1 γ) p

=((1 γ)Cr,δ/Rmax + γCT,δ)Rmax

(Cr,δ/Rmax + CT,δ)Rmax

|D(s, a)| Cr,T,δRmax (1 γ) p

|D(s, a)| .

Similarly, we can bound the difference between the Bellman backup induced by the learnt MDP c M and the underlying Bellman backup: Bπ c M b Qk(s, a) Bπ M b Qk(s, a)

|r c M(s, a) r M(s, a)| + γRmax

1 γ Dtv(T c M, TM)

where Dtv(T c M, TM) is the total-variation distance between T c M and TM.

For any two MDPs, M1 and M2, with the same state space, action space and discount factor γ, and a given fraction f (0, 1), deﬁne the f-interpolant MDP Mf as the MDP with dynamics: TMf = f TM1 + (1 f)TM2 and reward function: r Mf = fr M1 + (1 f)r M2, which has the same state space, action space and discount factor with M1 and M2. Let T π be the transition matrix on state-action pairs induced by a stationary policy π, i.e., T π = T(s |s, a)π(a |s ).

To prove the main result, we ﬁrst restate the following lemma from (Yu et al., 2021b) to be used later.

Lemma 1. For any policy π, its returns in any MDP M, denoted by J(M, π), and in Mf, denoted by J(M1, M2, f, π), satisfy the following: J(M, π) η J(M1, M2, f, π) J(M, π) + η where

(1 γ)2 Rmax Dtv(TM2, TM) + γf 1 γ |Edπ M[(T π M T π M1)Qπ M]|

+ f 1 γ Es,a dπ M[|r M1(s, a) r M(s, a)|] + 1 f

1 γ Es,a dπ M [|r M2(s, a) r M(s, a)|].

Lemma 1 characterizes the relationship between policy returns in different MDPs in terms of the corresponding reward difference and dynamics difference.

C PROOF OF THEOREM 1

Let d(s, a) := dπβ M(s, a). In the setting without function approximation, by setting the derivation of Equation Eq. (1) to 0, we have that b Qk+1(s, a) = b Bπ b Qk(s, a) β ρ(s, a) d(s, a)

Published as a conference paper at ICLR 2022

Denote ν(ρ, f) = Eρ h ρ(s,a) d(s,a)

df (s,a) i as the expected penalty on the Q-value. It can be shown (Yu

et al., 2021b) that ν(ρ, f) 0 and increases with f, for any ρ and f (0, 1). Then, RAC optimizes the return of a policy in a f-interpolant MDP induced by the empirical MDP M and the learnt MDP c M, which is regularized by both the behavior policy πβ and the meta-policy πc:

max π J(M, c M, f, π) β ν(ρπ, f)

1 γ λαD(π, πβ) λ(1 α)D(π, πc). (12)

Denote πo as the solution to the above optimization problem. Based on Lemma 1, we can ﬁrst characterize the return of the learnt policy πo in the underlying MDP M in terms of its return in the f-interpolant MDP: J(M, πo) + η1 J(M, c M, f, πo) (13) where

η1 =2γ(1 f)

(1 γ)2 Rmax Dtv(T c M, TM) + γf 1 γ |Edπo M[(T πo M T πo

+ f 1 γ Es,a dπo M[|r M(s, a) r M(s, a)|] + 1 f

1 γ Es,a dπo M[|r c M(s, a) r M(s, a)|]

(1 γ)2 Rmax Dtv(T c M, TM) + γ2f CT,δRmax

(1 γ)2 Es dπo M(s)

DCQL(πo, πβ)(s) + 1

1 γ Es,a dπo M

+ 1 1 γ Es,a dπo M[|r c M(s, a) r M(s, a)|]

ηc 1. Note that the inequality above holds because the following is true for the empirical MDP M (Kumar et al., 2020):

|Edπ M[(T π M T π

M)Qπ M]| γCT,δRmax

1 γ Es dπ M(s)

DCQL(π, πβ)(s) + 1

for DCQL(π1, π2)(s) := P

a π1(a|s) π1(a|s) π2(a|s) 1 .

C.1 SAFE IMPROVEMENT OVER πc

We ﬁrst show that the learnt policy offers safe improvement over the meta-policy πc. Following the same line as in Eq. (13), we next bound the return of the meta-policy πc in the underlying MDP M from above, in terms of its return in the f-interpolant MDP: J(M, c M, f, πc) J(M, πc) η2 where

(1 γ)2 Rmax Dtv(T c M, TM) + γ2f CT,δRmax

(1 γ)2 Es dπc M(s)

DCQL(πc, πβ)(s) + 1

1 γ Es,a dπc M

+ 1 1 γ Es,a dπc M[|r c M(s, a) r M(s, a)|]

It follows that

J(M, πo) + ηc 1 β ν(ρπo, f)

1 γ λαD(πo, πβ) λ(1 α)D(πo, πc)

J(M, c M, f, πo) β ν(ρπo, f)

1 γ λαD(πo, πβ) λ(1 α)D(πo, πc)

J(M, c M, f, πc) β ν(ρπc, f)

1 γ λαD(πc, πβ)

J(M, πc) ηc 2 β ν(ρπc, f)

1 γ λαD(πc, πβ),

Published as a conference paper at ICLR 2022

where the second inequality is true because πo is the solution to Eq. (12). This gives us a lower bound on J(M, πo) in terms of J(M, πc):

J(M, πo) J(M, πc) ηc 1 ηc 2 + β 1 γ [ν(ρπo, f) ν(ρπc, f)]

+ λαD(πo, πβ) + λ(1 α)D(πo, πc) λαD(πc, πβ).

It is clear that ηc 1 and ηc 2 are independent to β and λ. To show the performance improvement of πo over the meta-policy πc, it sufﬁces to guarantee that for appropriate choices of β and λ,

c = λαD(πo, πβ) + λ(1 α)D(πo, πc) λαD(πc, πβ) + β 1 γ [ν(ρπo, f) ν(ρπc, f)] > 0.

To this end, the following lemma ﬁrst provides an upper bound on |ν(ρπo, f) ν(ρπc, f)|:

Lemma 2. There exist some positive constants L1 and L2 such that |ν(ρπo, f) ν(ρπc, f)| 2(L1 + L2)Dtv(ρπo(s, a)||ρπc(s, a)).

Proof. First, we have that |ν(ρπo, f) ν(ρπc, f)|

= Eρπo ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a)

Eρπc ρπc(s, a) d(s, a) fd(s, a) + (1 f)ρπc(s, a)

ρπo(s, a) ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a) ρπc(s, a) ρπc(s, a) d(s, a) fd(s, a) + (1 f)ρπc(s, a)

ρπo(s, a) ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a) ρπc(s, a) ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a)

ρπc(s, a) ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a) ρπc(s, a) ρπc(s, a) d(s, a) fd(s, a) + (1 f)ρπc(s, a)

(s,a) [ρπo(s, a) ρπc(s, a)] ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a)

(s,a) ρπc(s, a) ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a) ρπc(s, a) d(s, a) fd(s, a) + (1 f)ρπc(s, a)

(s,a) |ρπo(s, a) ρπc(s, a)| ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a)

(s,a) ρπc(s, a) ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a) ρπc(s, a) d(s, a) fd(s, a) + (1 f)ρπc(s, a)

First, observe that for the term ρπo(s,a) d(s,a) fd(s,a)+(1 f)ρπo(s,a) ,

If ρπo(s, a) d(s, a), then ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a)

ρπo(s, a) fd(s, a) + (1 f)ρπo(s, a)

ρπo(s, a) (1 f)ρπo(s, a)

If ρπo(s, a) < d(s, a), then ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a)

d(s, a) fd(s, a) + (1 f)ρπo(s, a)

d(s, a) fd(s, a)

Published as a conference paper at ICLR 2022

Therefore, ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a)

Next, for the term ρπo(s,a) d(s,a) fd(s,a)+(1 f)ρπo(s,a) ρπc(s,a) d(s,a) fd(s,a)+(1 f)ρπc(s,a) , consider the function g(x) =

x d fd+(1 f)x for x [0, 1]. Clearly, when d(s, a) = 0, ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a) ρπc(s, a) d(s, a) fd(s, a) + (1 f)ρπc(s, a)

For any (s, a) that d(s, a) > 0, it can be shown that g(x) is continuous and has bounded gradient, i.e., | g(x)| 1 f 2d L2. Hence, it follows that ρπo(s, a) d(s, a) fd(s, a) + (1 f)ρπo(s, a) ρπc(s, a) d(s, a) fd(s, a) + (1 f)ρπc(s, a)

L2|ρπo(s, a) ρπc(s, a)|.

Therefore, we can conclude that |ν(ρπo, f) ν(ρπc, f)|

(s,a) |ρπo(s, a) ρπc(s, a)| + L2 X

(s,a) ρπc(s, a)|ρπo(s, a) ρπc(s, a)|

(L1 + L2) X

s,a |ρπo(s, a) ρπc(s, a)|

=2(L1 + L2)Dtv(ρπo(s, a)||ρπc(s, a)).

Recall that ρπo(s, a) = dπo c M(s)πo(a|s), ρπc(s, a) = dπc c M(s)πc(a|s), which denote the marginal state-action distributions by rolling out πo and πc in the learnt model c M, respectively. Lemma 2 gives an upper bound on the difference between the expected penalties induced under πo and πc, with regard to the difference between the marginal state-action distributions. Next, we need to characterize the relationship between the marginal state-action distribution difference and the corresponding policy distance, which is captured in the following lemma.

Lemma 3. Let D(π1||π2) = maxs Dtv(π1||π2) denote the maximum total-variation distance between two policies π1 and π2. Then, we can have that

Dtv(ρπo(s, a)||ρπc(s, a)) 1 1 γ max s Dtv(πo(a|s)||πc(a|s)).

Proof. Note that

Dtv(ρπo(s, a)||ρπc(s, a)) (1 γ)

t=0 γt Dtv(ρπo t (s, a)||ρπc t (s, a)).

It then sufﬁces to bound the state-action marginal difference at time t. Since both state-action marginals here correspond to rolling out πo and πc in the same MDP c M, based on Lemma B.1 and B.2 in (Janner et al., 2019), we can obtain that Dtv(ρπo t (s, a)||ρπc t (s, a)) Dtv(ρπo t (s)||ρπc t (s)) + max s Dtv(πo(a|s)||πc(a|s))

t max s Dtv(πo(a|s)||πc(a|s)) + max s Dtv(πo(a|s)||πc(a|s))

=(t + 1) max s Dtv(πo(a|s)||πc(a|s)), which indicates that

Dtv(ρπo(s, a)||ρπc(s, a)) (1 γ)

t=0 γt(t + 1) max s Dtv(πo(a|s)||πc(a|s))

= 1 1 γ max s Dtv(πo(a|s)||πc(a|s)).

Published as a conference paper at ICLR 2022

Building on Lemma 2 and Lemma 3, we can show that

|ν(ρπo, f) ν(ρπc, f)| 2(L1 + L2)

1 γ max s Dtv(πo(a|s)||πc(a|s))

C max s Dtv(πo(a|s)||πc(a|s)).

Let D( , ) = maxs Dtv( || ). It is clear that for λ λ0 where λ0 > Cβ (1 γ)(1 2α) and α < 1

c =λαD(πo, πβ) + λ(1 α)D(πo, πc) λαD(πc, πβ) + β 1 γ [ν(ρπo, f) ν(ρπc, f)]

=λαD(πo, πβ) + λαD(πo, πc) λαD(πc, πβ) + λ(1 2α)D(πo, πc)

+ β 1 γ [ν(ρπo, f) ν(ρπc, f)]

λ(1 2α)D(πo, πc) + β 1 γ [ν(ρπo, f) ν(ρπc, f)]

λ(1 2α)D(πo, πc) Cβ

1 γ D(πo, πc)

=(λ λ0)(1 2α)D(πo, πc) + λ0(1 2α) Cβ

D(πo, πc) > 0.

In a nutshell, we can conclude that with probability 1 δ

J(M, πo) J(M, πc) ηc 1 ηc 2 | {z } (a)

+(λ λ0)(1 2α)D(πo, πc) | {z } (b)

+ λ0(1 2α) Cβ

D(πo, πc) | {z } (c)

where (a) depends on δ but is independent to λ, (b) is positive and increases with λ, and (c) is positive. This implies that an appropriate choice of λ will make term (b) large enough to counteract term (a) and lead to the performance improvement over the meta-policy πc: J(M, πo) J(M, πc) + ξ1 where ξ1 0.

C.2 SAFE IMPROVEMENT OVER πβ

Next, we show that the learnt policy πo achieves safe improvement over the behavior policy πβ. Based on Lemma 1, we have J(M1, M2, f, πβ) J(M, πβ) η3 where

η3 =2γ(1 f)

(1 γ)2 Rmax Dtv(T c M, TM) + γf 1 γ |Ed πβ M [(T πβ M T πβ

+ f 1 γ Es,a d πβ M [|r M(s, a) r M(s, a)|] + 1 f

1 γ Es,a d πβ M [|r c M(s, a) r M(s, a)|]

(1 γ)2 Rmax Dtv(T c M, TM) + γ2f CT,δRmax

(1 γ)2 Es d πβ M (s)

1 γ Es,a d πβ M

+ 1 1 γ Es,a d πβ M [|r c M(s, a) r M(s, a)|]

Published as a conference paper at ICLR 2022

Therefore, it follows that

J(M, πo) + ηc 1 β ν(ρπo, f)

1 γ λαD(πo, πβ) λ(1 α)D(πo, πc)

J(M, c M, f, πo) β ν(ρπo, f)

1 γ λαD(πo, πβ) λ(1 α)D(πo, πc)

J(M, c M, f, πβ) β ν(ρπβ, f)

1 γ λ(1 α)D(πβ, πc)

J(M, πβ) ηβ 3 β ν(ρπβ, f)

1 γ λ(1 α)D(πβ, πc),

which indicates that with probability 1 δ J(M, πc) J(M, πβ) ηc 1 ηβ 3 +λαD(πo, πβ) + λ(1 α)D(πo, πc) λ(1 α)D(πβ, πc)

+ β 1 γ [ν(ρπo, f) ν(ρπβ, f)],

where ηβ 3 is some constant that depends on δ but is independent to β and λ.

To conclude, we can have that with probability 1 2δ J(M, πo) max{J(M, πc) + ξ1, J(M, πβ) + ξ2} where

ξ1 = ηc 1 ηc 2 + (λ λ0)(1 2α)D(πo, πc) + λ0(1 2α) Cβ

D(πo, πc) (14)

and ξ2 = ηc 1 ηβ 3 + λαD(πo, πβ)+λ(1 α)D(πo, πc) λ(1 α)D(πβ, πc) (15)

+ β 1 γ [ν(ρπo, f) ν(ρπβ, f)]. (16)

Moreover, as we noted earlier, ξ1 > 0 for a suitably selected λ and α < 1

2. For the term ν(ρπo, f)

ν(ρπβ, f) in ξ2 where ν(ρπ, f) is deﬁned as Eρπ h ρπ(s,a) d(s,a)

df (s,a) i , as noted in (Yu et al., 2021b),

ν(ρπβ, f) is expected to be smaller than ν(ρπo, f) in practical scenarios, due to the fact that the dynamics T c M learnt via supervised learning is close to the underlying dynamics TM on the states visited by the behavior policy πβ. This directly indicates that dπβ c M(s, a) is close to dπβ

M(s, a) and ρπβ

is close to d(s, a). In this case, let

ϵ = β[ν(ρπo, f) ν(ρπβ, f)]

2λ(1 γ)D(πo, πβ) .

We can show that for α > 1

β =λαD(πo, πβ) + λ(1 α)D(πo, πc) λ(1 α)D(πc, πβ) + β 1 γ [ν(ρπo, f) ν(ρπβ, f)]

=λαD(πo, πβ) + λ(1 α)D(πo, πc) λ(1 α)D(πc, πβ) + 2ϵλD(πo, πβ) =λ [(2ϵ + α)D(πo, πβ) + (1 α)D(πo, πc) (1 α)D(πc, πβ)] >λ(1 α)[D(πo, πβ) + D(πo, πc) D(πc, πβ)] >0, and β increases with λ, which implies that J(M, πo) J(M, πβ) + ξ2 = J(M, πβ) ηc 1 ηβ 3 + β > J(M, πβ) for an appropriate choice of λ.