# offline_metareinforcement_learning_with_online_selfsupervision__061bdc90.pdf

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

Vitchyr H. Pong 1 Ashvin Nair 1 Laura Smith 1 Catherine Huang 1 Sergey Levine 1

Meta-reinforcement learning (RL) methods can meta-train policies that adapt to new tasks with orders of magnitude less data than standard RL, but meta-training itself is costly and time-consuming. If we can meta-train on ofﬂine data, then we can reuse the same static dataset, labeled once with rewards for different tasks, to meta-train policies that adapt to a variety of new tasks at meta-test time. Although this capability would make meta RL a practical tool for real-world use, ofﬂine meta RL presents additional challenges beyond online meta-RL or standard ofﬂine RL settings. Meta-RL learns an exploration strategy that collects data for adapting, and also meta-trains a policy that quickly adapts to data from a new task. Since this policy was meta-trained on a ﬁxed, ofﬂine dataset, it might behave unpredictably when adapting to data collected by the learned exploration strategy, which differs systematically from the ofﬂine data and thus induces distributional shift. We propose a hybrid ofﬂine meta-RL algorithm, which uses ofﬂine data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any reward labels to bridge this distribution shift. By not requiring reward labels for online collection, this data can be much cheaper to collect. We compare our method to prior work on ofﬂine meta-RL on simulated robot locomotion and manipulation tasks and ﬁnd that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta RL on a range of challenging domains that require generalization to new tasks.

1University of California, Berkeley. Correspondence to: Vitchyr H. Pong <vitchyr@eecs.berkeley.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

1. Introduction

Reinforcement learning (RL) agents are often described as learning from reward and punishment analogously to animals: in the same way that a person might train a dog by providing treats, we might train RL agents by providing rewards. However, in reality, modern deep RL agents require so many trials to learn a task that providing rewards by hand is often impractical. Meta-reinforcement learning in principle can mitigate this, by learning to learn using a set of meta-training tasks, and then acquiring new behaviors in just a few trials at meta-test time. Current meta-RL methods are so efﬁcient that meta-trained policies require only a handful of trajectories (Rakelly et al., 2019), which is reasonable for a human to provide by hand. However, the meta-training phase in these algorithms still requires a large number of online samples, often even more than standard RL, due to the multi-task nature of the meta-learning problem.

A potential solution to this issue is to use ofﬂine reinforcement learning methods, which use only prior experience without active data collection. These methods are promising because a user must only annotate multi-task data with rewards once in the ofﬂine dataset, rather than doing so in the inner loop of RL training, and the same ofﬂine multitask data can be reused repeatedly for many training runs. While a few recent works have proposed ofﬂine meta-RL algorithms (Dorfman & Tamar, 2020; Mitchell et al., 2021), we identify a speciﬁc problem when an agent trained with ofﬂine meta-RL is tested on a new task: the distributional shift between the behavior policy from the ofﬂine data and the meta-test time exploration policy means that adaptation procedures learned from ofﬂine data might not perform well on the (differently distributed) data collected by the exploration policy at meta-test time. This mismatch in training distribution occurs because ofﬂine meta-RL never trains on data generated by the meta-learned exploration policy. In practice, we ﬁnd that this mismatch leads to a large degradation in performance when adapting to new tasks. Moreover, we do not want to remove this distributional shift by simply adopting a conservative exploration strategy, because learning an exploration strategy enables an agent to collect better data for faster adaptation.

We propose to address this challenge by collecting additional online data without any reward supervision, leading

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

to a semi-supervised ofﬂine meta-RL algorithm, as illustrated in Figure 1. Online data can be cheaper to collect when it does not require reward labels, but it can still help mitigate the distributional shift issue. To make it feasible to use this data for meta-training, we can generate synthetic reward labels for it based on the labeled ofﬂine data.

Based on this principle, we propose semi-supervised meta actor-critic (SMAC), which uses reward-labeled ofﬂine data to bootstrap a semi-supervised meta-reinforcement learning procedure, in which an ofﬂine meta-RL agent collects additional online experience without any reward labels. SMAC uses the reward supervision from the ofﬂine dataset to learn to generate new reward functions, which it uses to autonomously annotate rewards in these rewardless interactions and meta-train on this new data. Our paper contains two novel contributions: First, we identify and provide evidence for the aforementioned distribution issue speciﬁc to ofﬂine meta-RL. Second, we propose a new method that mitigates this distribution shift by performing ofﬂine meta RL with self-supervised online ﬁnetuning, without ground truth rewards. Our method is based off of the efﬁcient PEARL (Rakelly et al., 2019) amortized inference method for meta-RL and AWAC (Nair et al., 2020), an approach for ofﬂine RL with online ﬁnetuning. We evaluate our method and prior ofﬂine meta-RL methods on a number of benchmarks (Dorfman & Tamar, 2020; Mitchell et al., 2021), as well as a challenging robot manipulation domain that requires generalization to new tasks, with just a few rewardlabeled trials at meta-test time. We ﬁnd that, while standard meta-RL methods perform well at adapting to training tasks, they suffer from data-distribution shifts when adapting to new tasks. In contrast, our method attains signiﬁcantly better performance, on par with an online meta-RL method that receives fully labeled online interaction data.

2. Related Works

Many prior meta-RL algorithms assume that reward labels are provided with each episode of online interaction (Duan et al., 2016; Finn et al., 2017; Gupta et al., 2018b; Xu et al., 2018; Hausman et al., 2018; Rakelly et al., 2019; Humplik et al., 2019; Kirsch et al., 2019; Zintgraf et al., 2020; Xu et al., 2020; Zhao et al., 2020; Kamienny et al., 2020). In contrast to these prior methods, our method only requires ofﬂine prior data with rewards, and additional online interaction does not require any ground truth reward signal. Prior works have also studied other formulations that combine unlabeled and labeled trials. For example, imitation and inverse reinforcement learning methods use ofﬂine demonstrations to either learn a reward function (Abbeel & Ng, 2004; Finn et al., 2016; Ho & Ermon, 2016; Fu et al., 2017) or to directly learn a policy (Schaal, 1999; Ross & Bagnell, 2010; Ho & Ermon, 2016; Reddy et al., 2019;

Peng et al., 2020). Semi-supervised and positive-unlabeled reward learning (Xu & Denil, 2019; Zolna et al., 2020; Konyushkova et al., 2020) methods use reward labels provided for some interactions to train a reward function for RL. However, all of these methods have been studied in the context of a single task. In contrast, we focus on meta-learning an RL procedure that can adapt to new reward functions. In other words, we do not focus on recovering a single reward function, because there is no single test time reward or task.

SMAC uses a context-based adaptation procedure similar to that proposed by Rakelly et al. (2019), which is related to contextual policies, such as goal-conditioned RL (Kaelbling, 1993; Schaul et al., 2015; Andrychowicz et al., 2017; Pong et al., 2018; Colas et al., 2018; Warde-Farley et al., 2018; P er e et al., 2018; Nair et al., 2018) or successor features (Kulkarni et al., 2016; Barreto et al., 2017; 2019; Grimm et al., 2019). In contrast, SMAC applies to any RL problem, does not assume that the reward is deﬁned by a single goal state or ﬁxed basis function, and only requires reward labels for static ofﬂine data.

Our method addresses a similar problem to prior ofﬂine meta-RL methods (Mitchell et al., 2021; Dorfman & Tamar, 2020). In our comparisons, we ﬁnd that after ofﬂine meta-training, our method is competitive to these prior approaches, but that with additional self-supervised online ﬁnetuning, our method signiﬁcantly outperforms these methods by mitigating the aforementioned distributional shift issue. Our method addresses the distribution shift problem by using online interactions without reward supervision. In our experiments, we found that SMAC greatly improves performance on both training and held-out tasks. Lastly, SMAC is also related to unsupervised meta-learning methods (Gupta et al., 2018a; Jabri et al., 2019), which annotate data with their own rewards. In contrast to these methods, we assume that there exists an ofﬂine dataset with reward labels that we can use to learn to generate similar rewards.

3. Preliminaries

Meta-reinforcement learning. In meta-RL, we assume there is a distribution of tasks p T ( ). A task T is a Markov decision process (MDP), deﬁned by a tuple T = (S, A, r, γ, p0, pd), where S is the state space, A is the action space, r is a reward function, γ is a discount factor, p0(s0) is the initial state distribution, and pd(st+1 | st, at) is the state transition distribution. A replay buffer D is a set of state, action, reward, next-states tuples, D = {si, ai, ri, s i}Nsize i=1 , where all rewards come from the same task. We use the letter h to denote a mini-batch or history and the notation h D to denote that mini-batch h is sampled from a replay buffer D. We use the letter τ to represent a trajectory τ = (s1, a1, s2, . . . ) without reward labels.

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

Figure 1: (left) In ofﬂine meta-RL, an agent uses ofﬂine data from multiple tasks T1, T2, . . . , each with reward labels that must only be provided once. (middle) In online meta-RL, new reward supervision must be provided with every environment interaction. (right) In semi-supervised meta-RL, an agent uses an ofﬂine dataset collected once to learn to generate its own reward labels for new, online interactions. Similar to ofﬂine meta-RL, reward labels must only be provided once for the ofﬂine training, and unlike online meta-RL, the additional environment interactions require neither external reward supervision nor additional task sampling.

A meta-episode consists of sampling a task T p T ( ), collecting T trajectories with a policy πθ with parameters θ, adapting the policy to the task between trajectories, and measuring the performance on the last trajectory. Between trajectories, the adaptation procedure transforms the states and actions h from the current meta-episode into a context z = Aφ(h), which is then given to the policy πθ(a, | s, z) for adaptation. The exact representation of πθ, Aφ, and z depends on the speciﬁc meta-RL method used. For example, the context z can be weights of a neural network (Finn et al., 2017) outputted by a gradient update, hidden activations ouputted by a recurrent neural network (Duan et al., 2016), or latent variables outputted by a stochastic encoder (Rakelly et al., 2019). Using this notation, the objective in meta-RL is to learn the adaptation parameters φ and policy parameters θ to maximize performance on a meta-episode given a new task T sampled from p(T ).

PEARL. Since we require an off-policy meta-RL procedure for ofﬂine meta-training, we build on probabilistic embeddings for actor-critic RL (PEARL) (Rakelly et al., 2019), an online off-policy meta-RL algorithm. In PEARL, z is a vector and the adaptation procedure Aφ consists of sampling z from a distribution z qφe(z | h). The distribution qφe is generated by an encoder with parameters φe. This encoder is a neural network that processes the tuples in h = {si, ai, ri, s i}Nenc i=1 in a permutation-invariant manner to produce the mean and variance of a diagonal multivariate Gaussian. The policy is a contextual policy πθ(a | s, z) conditioned on z by concatenating z to the state s.

The policy parameter θ is trained using soft-actor critic (Haarnoja et al., 2018) which involves learning a Qfunction, Qw(s, a, z), with parameter w that estimates the sum of future discounted rewards conditioned on the current state, action, and context. The encoder parameters are trained by back-propagating the critic loss into the encoder. The actor, critic, and encoder losses are minimized via gradient descent with mini-batches sampled from separate replay buffers for each task.

Ofﬂine reinforcement learning. In ofﬂine reinforcement learning, we assume that we have access to a dataset D collected by some behavior policy πβ. An RL agent must train on this ﬁxed dataset and cannot interact with the environment. One challenge that ofﬂine RL poses is that the distribution of states and actions that an agent will see when deployed will likely be different from those seen in the ofﬂine dataset as they are generated by the agent, and a number of recent methods have tackled this distribution shift issue (Fujimoto et al., 2019b;a; Kumar et al., 2019; Wu et al., 2019; Nair et al., 2020; Levine et al., 2020). Moreover, one can combine ofﬂine RL with meta-RL by training meta-RL on multiple datasets D1, . . . , DNbuff (Dorfman & Tamar, 2020; Mitchell et al., 2021), but in the next section we describe some limitations of this combination.

4. The Problem with Na ıve Ofﬂine Meta-Reinforcement Learning

Ofﬂine meta-RL is the composition of meta-RL and ofﬂine RL: the objective is to maximize the standard meta RL objective using only a ﬁxed set of replay buffers, D = {Di}Nbuff i=1 , where each buffer corresponds to data for one task. Ofﬂine meta-RL methods can in principle utilize the same constraint-based approaches that standard ofﬂine RL algorithms have used to mitigate distributional shift. However, they must also contend with an additional distribution shift challenge that is speciﬁc to the meta-RL scenario: distribution shift in z-space.

Distribution shift in z-space occurs because meta-learning requires learning an exploration policy πθ that generates data for adaptation. However, ofﬂine meta-RL only trains the adaptation procedure Aφ(h) using ofﬂine data generated by a previous behavior policy, which we denote as πβ. After ofﬂine training, there will be a mismatch between this learned exploration policy πθ and the behavior policy πβ, leading to a difference in the history h and in turn, in the context variables z = Aφ(h). For example, in

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

offline online data source

DKL(q (z h) || p(z))

Online vs Offline z-Posteriors

0 10000 20000 30000 40000 50000

Number of gradient updates

Post-Adaptation Returns

Adapting to Online vs Offline Data

data source

offline online

Figure 2: Left: The distribution of the KL-divergence between the posterior qφe(z | h) and the prior p(z) over the course of metatraining, when h is sampled from ofﬂine data (blue) or online data generated by the learned policy (orange). Adapting to online data results in posteriors that are substantially farther from the prior, suggesting a signiﬁcant difference in distribution over z. Right: The performance of the policy after adapting to data from the ofﬂine dataset (blue) or the learned policy (orange). Since the same policy is evaluated, the performance drop when conditioned on the online data is likely due to the change in z-distribution.

a robot manipulation setting, the ofﬂine dataset may contain smooth trajectories that were collected by a human teleoperator (Kofman et al., 2005). In contrast, the learned exploration may contain jittering due learning artifacts or the use of a stochastic policy. This jittering may not impede the robot from exploring the environment, but may result in a trajectory distribution shift that degrades the adaptation process, which only learned to adapt to smooth, humangenerated trajectories in the ofﬂine dataset. More formally, if p(z | hofﬂine) and p(z | honline) denote the marginal distribution of z given histories h sampled using ofﬂine and online data, respectively, the differences between πθ and πβ will lead to differences between p(z | hofﬂine) during ofﬂine training and p(z | honline) at meta-test time.

To illustrate this difference, we compare p(z | hofﬂine) and p(z | honline) on the Ant Direction task (see Section 6). We approximate these distributions by using the PEARL encoder discussed in Section 3, with p(z | h) qφe(z | h), where h is sampled either from the ofﬂine data set or using the learned policy. We measure the KL-divergence observed at the end of ofﬂine training between the posterior p(z | h) and a ﬁxed prior pz(z), for different samples of h. If the two distributions were the same, then we would expect the distribution of KL divergences to also be similar. How-

ever, Figure 2 shows that the two distributions are markedly different after the ofﬂine training phase of SMAC.

We also observe that this distribution shift negatively impacts the resulting policy. In Figure 2, we plot the performance of the learned policy when conditioned on z sampled from qφe(z | hofﬂine) compared to qφe(z | honline). We see that the policy that uses ofﬂine data, hofﬂine, leads to improvement, while the same policy that uses data from the learned policy, honline drops in performance. Since we evaluate the same policy πθ and only change how z is sampled, this degradation in performance suggests that the policy suffers from distributional shift between p(z | hofﬂine) and p(z | honline). In other words, the encoder produces z vectors that are too unfamiliar to the policy when conditioned on these exploration trajectories.

Ofﬂine meta-RL with self-supervised online training. In complex environments, the learned policy is likely to deviate from the behavior policy and induce the distribution shift in z-space. To address this issue, we assume that the agent can autonomously interact with the environment without observing additional reward supervision. This assumption is increasingly applicable, as more effort is made to enable unsupervised robot interactions with automatic safeguards and resets (Levine et al., 2018; Yang et al., 2019; Bloesch et al., 2022). Unsupervised interactions are also cheap in many domains outside of robotics, such as internet navigation (Nogueira & Cho, 2016; Nakano et al., 2021) and video games (Brockman et al., 2016; Torrado et al., 2018), but designing rewards may be difﬁcult (Christiano et al., 2017), expensive (Nakano et al., 2021), or potentially harmful (Hadﬁeld-Menell et al., 2017; Leike et al., 2018).

In our case, we use the additional unsupervised interactions to construct honline, allowing an agent to train on the online context distribution, p(z | honline) and mitigating distribution shift. However, meta-training requires that the history honline contain not just states and actions, but also rewards. Our approach enables meta-training on the additional data by autonomously labeling it with synthetic rewards.

5. Semi-Supervised Meta Actor-Critic

We now present our method, semi-supervised meta actorcritic (SMAC). SMAC consists of ofﬂine meta-training followed by self-supervised online meta-training to mitigate the distribution shift in z-space. The SMAC adaptation procedure consists of passing history through the encoder described in Section 3, resulting in a posterior qφe(z | h). Below, we describe both phases.

5.1. Ofﬂine Meta-Training

To learn from ofﬂine data, we adapt the PEARL (Rakelly et al., 2019) to the ofﬂine setting. Similar to PEARL, we

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

Figure 3: (Left) In the ofﬂine phase, we sample a history h to compute the posterior qφe(z | h ). We then use a sample from this encoder and another history batch h to train the networks. In red, we then update the networks with h and the z sample. (Right) During the self-supervised phase, we explore by sampling z p(z) and conditioning our policy on these observations. We label rewards using our learned reward decoder, and append the resulting data to the training data. The training procedure is equivalent to the ofﬂine phase, except that we do not train the reward decoder or encoder since no additional ground-truth rewards are observed.

update the critic by minimizing the Bellman error:

Lcritic(w) = E(s,a,r,s ) Di,z qφe(z|h),a πθ(a |s ,z) h

(Qw(s, a, z) (r + γQ w(s , a , z)))2i , (1)

where w are target network weights updated using soft target updates (Lillicrap et al., 2016).

PEARL has a policy updated based on soft actor critic (SAC) (Haarnoja et al., 2018), but when na ıvely applied to the ofﬂine setting, these policy updates suffer from off-policy bootstrapping error accumulation (Kumar et al., 2019), which occurs when the target Q-function for bootstrapping Q(s , a ) is evaluated at actions a outside of the training data.

To avoid error accumulation during ofﬂine training, we implicitly constrain the policy to stay close to the actions observed in the replay buffer, following the approach in a single-task ofﬂine RL algorithm called AWAC (Nair et al., 2020). AWAC uses the following loss to approximate a constrained optimization problem, where the policy is constrained to stay close to the data observed in D:

Lactor(θ) = Es,a,s D,z qφe(z|h) h log πθ(a | s)

exp Q(s, a, z) V (s , z)

We estimate the value function V (s, z) = Ea πθ(a|s,z)Q(s, a, z) with a single sample, and λ is the resulting Lagrange multiplier for the optimization problem. See Nair et al. (2020) for a full derivation.

With this modiﬁed actor update, we can train the encoder, actor, and critic on the ofﬂine data without the overestimation issues that afﬂict conventional actor-critic algorithms (Kumar et al., 2019). However, it does not address the z-space distributional shift issue discussed in Section 4, because the exploration policy learned via this ofﬂine procedure will still deviate signiﬁcantly from the behavior policy πβ. As discussed previously, we will aim to address this issue

by collecting additional online data without reward labels and learning to generate reward labels if self-supervised meta-training.

Learning to generate rewards. To continue metatraining online without ground truth reward labels, we propose to use the ofﬂine dataset to learn a generative model over meta-training task reward functions that we can use to label the transitions collected online. Recall that during ofﬂine learning, we learn an encoder qφe that maps experience h to a latent context z that encodes the task. In the same way that we train our policy πθ(a | s, z) that conditionally decodes z into actions, we additionally train a reward decoder rφd(s, a, z) with parameter φd 1 that conditionally decodes z into rewards. We train the reward decoder rφd to reconstruct the observed reward in the ofﬂine dataset through a mean squared error loss.

Because we use the latent space z for reward-decoding, we back-propagate the reward decoder loss into qφe. As visualized in Figure 3, we also regularize the posteriors qφe(z | h) against a prior pz(z) to provide an information bottleneck in that latent space z and ensure that samples from pz(z) represent meaningful latent variables. Prior work such as PEARL back-propagates the critic loss into the encoder, and we found that SMAC performs well regardless of whether or not we back-propagate the critic loss into the encoder (see Appendix B). For simplicity, we train the encoder and reward decoder by minimizing the following loss, in which we assume that z qφe(h):

Lreward(φd, φe, h, z) = X

(s,a,r) h r rφd(s, a, z) 2 2

+ DKL qφe( | h) pz( ) . (3)

5.2. Self-Supervised Online Meta-Training

We now describe the self-supervised online training procedure, during which we use the reward decoder to provide

1With this notation, meta-parameters are the encoder and decoder parameters (i.e., φ = {φe, φd}).

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

Figure 4: We propose a new meta-learning evaluation domain based on the environment from Khazatsky et al. (2021), in which a simulated Sawyer gripper can perform various manipulation tasks such as pushing a button, opening drawers, and picking and placing objects. We show a subset of meta-training (blue) and meta-test (orange) tasks. Each task contains a unique object conﬁguration, and we test the agent on held-out tasks.

supervision. First, we collect a trajectory τ by rolling out our exploration policy πθ conditioned on a context sampled from the prior p(z). To emulate the ofﬂine meta-training supervision, we would like to label τ with rewards that are in the distribution of meta-training tasks. As such, we sample a replay buffer Di uniformly from D to get a history hofﬂine Di from the ofﬂine data. We then sample from the posterior z qφe(z | hofﬂine) and label the reward rgenerated of a new state and action, (s, a), using the reward decoder

rgenerated = rφd(s, a, z), where z qφe(z | h). (4)

We then add the labeled trajectory to the buffer and perform actor and critic updates as in ofﬂine meta-training. Lastly, since we do not observe additional ground-truth rewards, we do not update the reward decoder rφd or encoder qφe during the self-supervised phase and only train the policy and Q-function. We visualize this procedure in Figure 3.

We note that the distribution shift discussed in Section 4 is not an issue for the reward decoder. The distribution shift only occurs when we sample z from the encoder using online data, i.e. z qφe(z | honline), but, we only sample from this reward decoder using z sampled from the encoder using ofﬂine data, i.e. z qφe(z | hofﬂine). The reward decoder does need to generalize to new states and actions, and we hypothesize that reward prediction generalization is easier than policy generalization. If true, then we would expect that using the reward decoder to label rewards and train on those labels, as in SMAC, will outperform directly using the policy on new tasks (as in Figure 2).

5.3. Algorithm Summary and Details

We visualize SMAC in Figure 3. For ofﬂine training, we assume access to ofﬂine datasets D = {Di}Nbuff i=1 , where each buffer corresponds to data generated for one task. Each iteration, we sample a buffer Di D and a history from this buffer h Di. We condition the stochastic encoder qφe on this history to obtain a sample z qφe(z | h) and sample a second history sample h Di to update the Q-function,

the policy, encoder, and decoder by minimizing Equation (1), Equation (2), and Equation (3) respectively. During the self-supervised phase, we found it beneﬁcial to train the actor with a combination of the loss in Equation (2) and the original PEARL actor loss, weighted by hyperparameter λpearl. We provide pseudo-code for SMAC in Appendix A.

6. Experiments

We presented a method that uses self-supervised data to mitigate the z-space distribution shift that occurs in ofﬂine meta-RL. In this section, we evaluate how well the selfsupervised phase of SMAC mitigates the resulting drop in performance, and we compare SMAC to prior ofﬂine meta RL methods on a range of meta-RL benchmark tasks that require generalization to unseen tasks at meta-test time.

Meta-RL tasks. We ﬁrst evaluate our method on multiple simulated Mu Jo Co (Todorov et al., 2012) meta-learning tasks that have been used in past online and ofﬂine meta-RL papers (Finn et al., 2017; Rakelly et al., 2019; Dorfman & Tamar, 2020; Mitchell et al., 2021) (see Figure 10). Although there are standard benchmarks for ofﬂine RL (Fu et al., 2020) and meta-RL (Yu et al., 2020), there are no standard benchmarks for ofﬂine meta-RL. We use a combination of tasks in prior work (Dorfman & Tamar, 2020; Mitchell et al., 2021), and introduce a more complex robotic manipulation domain to stress-test task generalization.

We ﬁrst evaluate variants of Cheetah Velocity, Ant Direction, Humanoid, Walker Param, and Hopper Param, based on prior online meta-RL work (Rothfuss et al., 2018; Rakelly et al., 2019; Dorfman & Tamar, 2020). In the ﬁrst three domains, the reward is based on a target velocity, and a meta-episode consists of sampling a desired velocity. The last task, Humanoid, is particularly challenging due to its 376-dimensional state space. The last two domains, Walker Param and Hopper Param, require adapting to different physics parameters (friction, joint mass, and inertia). The agent must adapt within T = 3 episodes of 200 steps each.

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

0 25 50 75 100 Number of Environment Steps (x1000)

Post-Adaptation Returns

Cheetah Velocity

100 200 Number of Environment Steps (x1000)

Post-Adaptation Returns

Ant Direction

0 100 200 Number of Environment Steps (x1000)

Post-Adaptation Returns

Sawyer Manipulation

25 50 75 Number of Environment Steps (x1000)

Post-Adaptation Returns

25 50 75 Number of Environment Steps (x1000)

Post-Adaptation Returns

25 50 75 Number of Environment Steps (x1000)

Post-Adaptation Returns

SMAC (Ours) Online Oracle SMAC (actor ablation) MACAW BORe L meta behavior cloning

Figure 5: We report the ﬁnal return of meta-test adaptation on unseen test tasks versus the amount of self-supervised meta-training following ofﬂine meta-training. Our method SMAC, shown in red, consistently trains to a reasonable performance from ofﬂine meta-RL (shown at step 0) and then steadily improves with online self-supervised experience. The ofﬂine meta-RL methods, MACAW (Mitchell et al., 2021) and BORe L are competitive with the ofﬂine performance of SMAC but have no mechanism to improve via self-supervision. We also compare to SMAC (SAC ablation) which uses SAC instead of AWAC as the underlying RL algorithm. This ablation struggles to train a value function ofﬂine, and so struggles to improve on more difﬁcult tasks.

We also evaluated SMAC on a signiﬁcantly more diverse robot manipulation meta-learning task called Sawyer Manipulation, based on the goal-conditioned environment introduced by Khazatsky et al. (2021). This is a simulated Py Bullet environment (Coumans & Bai, 2016 2021) in which a Sawyer robot arm can manipulate various objects. Sampling a task T p(T ) involves sampling a new conﬁguration of the environment and the desired behavior to achieve, such as pushing a button, opening a drawer, or lifting an object (see Figure 4). The sparse reward is 0 when the desired behavior is achieved and 1 otherwise, and we detail the state and action space in Appendix C. The need to handle sparse rewards, precise movements, and diverse object positions for each task make this domain especially difﬁcult.

In all of the environments, we test the meta-RL procedure s ability to generalize to new tasks by evaluating the policies on held-out tasks sampled from the same distribution as in the ofﬂine datasets. We give a complete description of environments and task distribution in Appendix C.

Ofﬂine data collection. For the Mu Jo Co tasks, we use the replay buffer from a single PEARL run with groundtruth reward. We limit the data collection to 1200 transitions (6 trajectories) per task and terminate the PEARL run early, forcing the meta-RL agent to learn from highly suboptimal data. For Sawyer Manipulation, we collect data for 50

training tasks using a scripted policy that randomly performs one of the tasks in the environment. For each task, we collected 50 trajectories of length 75. The task performed by the scripted policy (e.g., open drawer) may differ from the task that is rewarded (e.g., push a button). As a result, in the ofﬂine dataset, the robot succeeds on the task 46% of the time, indicating that this data is highly suboptimal. In contrast to prior work (Dorfman & Tamar, 2020; Mitchell et al., 2021), which uses nearly complete online RL training runs, we use this suboptimal data because it enables us to test how well the different methods can improve over suboptimal ofﬂine data. See Appendix C for further details.

Comparisons and ablations. As an upper bound, we include the performance of PEARL with online training using oracle ground-truth rewards, rather than self-generated rewards, which we label Online Oracle. To understand the importance of using the actor loss in Equation (2), we include an ablation that replaces Equation (2) with the actor loss from PEARL, which we label SMAC (actor ablation). We also include a meta-imitation baseline, labeled meta behavior cloning, which replaces the actor update in Equation (2) with simply maximizing log πθ(a | s, z). A gap between SMAC and this imitation method would help us understand whether or not our method improves over the (possibly sub-optimal) behavior policy.

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

-20 -10 0 10 20 COM X-position

COM Y-position

After Offline Training

-20 -10 0 10 20 COM X-position

COM Y-position

After Self-Supervised Training

post-adaptation policy, where z q(z hoffline)

post-adaptation policy, where z q(z hexplore)

Figure 6: Example XY-coordinates visited by a learned policy on the Ant Direction task. Left: Immediately after ofﬂine training, the post adaptation policy moves in many different directions when conditioned on hofﬂine (blue). However, when conditioned on honline (orange), the policy only moves up and to the left, suggesting that the post-adaptation policy is sensitive to data distribution used to collect h. Right: After the self-supervised phase, the post-adaptation policy moves in many directions regardless of the data source, suggesting that the self-supervised phase mitigates the distribution shift between conditioning on ofﬂine and online data.

We also compare to the two previously proposed ofﬂine meta-RL methods: meta-actor critic with advantage weighting, labelled MACAW (Mitchell et al., 2021), and Bayesian ofﬂine RL, labelled BORe L (Dorfman & Tamar, 2020). MACAW and BORe L assume that rewards are provided during training and cannot make use of the self-supervised online interaction, and so we cannot compare directly to these past works. Therefore, we report their performance only after ofﬂine training. For these prior works, we used the open-sourced code directly from each paper, and and train these methods using the same ofﬂine dataset. To ensure a fair comparison, we ran both the original hyperparameters and the same hyperparameters as SMAC (matching network size, learning rate, and batch size), taking the better of the two as the result for each prior method. We describe all hyperparameters in Appendix C.3.

Comparison results. We plot the mean post-adaptation returns and standard deviation across 4 seeds in Figure 5. We see that across all six environments, SMAC consistently improves during the self-supervised phase, and often achieves a similar performance to the oracle that uses ground-truth reward during the online phase of learning. SMAC also greatly improves over meta behavior cloning, conﬁrming that the ofﬂine dataset is far from optimal.

Even before the online phase, our method is competitive with BORe L and MACAW as a stand-alone ofﬂine meta-RL method, outperforming them after just the ofﬂine phase in 4 of the 6 domains. This large performance differences may be because these prior methods were developed assuming orders of more magnitude more ofﬂine data than in our experiments (see Appendix C.1 for further discussion). When self-supervised training is included, we see that SMAC signiﬁcantly improves over the performance of BORe L on all 6 tasks and signiﬁcantly improves over MACAW on 5 of the 6 tasks, including the Sawyer Manipulation environment,

which is by far the most challenging and exhibits the most variability between tasks. In this domain, we also see the largest gains from the AWAC actor update, in contrast to the actor ablation (blue), which uses a PEARL-style update, indicating that properly handling the ofﬂine phase is important for good performance.

In conclusion, only our method attains generalization performance close to the oracle upper bound on all six tasks, indicating that self-supervised online ﬁne-tuning is highly effective for mitigating distributional shift in meta-RL, whereas without it ofﬂine meta-RL generally does not exceed the performance of a meta-imitation learning baseline.

Visualizing the distribution shift. Lastly, we further study if self-supervised training helps speciﬁcally because it mitigates a distribution shift caused by the exploration policy. We visualize the trajectories of the learned policy both before and after the self-supervised phase for the Ant Direction task in Figure 6. For each plot, we show trajectories from the policy πθ(a | s, z) when the encoder qφe(z | h) is conditioned on histories from either the ofﬂine dataset (hofﬂine) or from the learned exploration policy (honline). Since the same policy is evaluated, differences between the resulting trajectories represent the distribution shift caused by using history from the learned exploration policy rather than from the ofﬂine dataset.

We see that before the self-supervised phase, there is a large difference between the two modes that can only be attributed to the difference in h. When using honline, the postadaptation policy only explores one mode, but when using hofﬂine, the policy moves in all directions. This qualitative difference explains the large performance gap observed in Figure 2 and highlights that the adaptation procedure is sensitive to the history h used to adapt. In contrast, after the self-supervised phase, the policy moves in all directions regardless of source of the history.

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

In Appendix B, we visualize the exploration trajectories and ﬁnd that the exploration trajectories are qualitatively similar both before and after the self-supervised phase, providing further evidence that the issue is speciﬁcally with the adaptation procedure and not exploration. Together, these results illustrate the SMAC meta-policy learns to adapt to the exploration trajectories by using the self-supervised interactions to mitigate the z-distribution shift in na ıve ofﬂine meta RL.

7. Conclusion

We studied a problem speciﬁc to ofﬂine meta-RL: distribution shift in the context parameters z. This distribution shift occurs because the data collected by the meta-learned exploration policy differs from the ofﬂine dataset. To address this problem, we assumed that an agent can sample new trajectories without additional reward labels and presented SMAC, a method that uses these additional interactions and synthetically generated rewards to alleviate this distribution shift. Experimentally, we found that SMAC generally outperforms prior ofﬂine meta-RL methods, especially after the self-supervised online phase.

A limitation of SMAC is that it assumes that the agent can gather additional unlabeled online samples. However, this assumption may not always be satisﬁed, and so an important direction for future work is to develop more systems with safe, autonomous environment interactions.

Lastly, using self-supervised interaction to mitigate distribution shift in ofﬂine meta RL is orthogonal to the underlying meta RL algorithm. In this paper, we built off of PEARL (Rakelly et al., 2019) due to its simplicity, and future work can apply this insight to improve existing (Mitchell et al., 2021; Dorfman & Tamar, 2020) or new ofﬂine meta RL algorithms.

Acknowledgements

This research was supported by the Ofﬁce of Naval Research, ARL DCIST CRA W911NF-17-2-0181, Google, and Berkeley Deep Drive, with compute support from Microsoft Azure.

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-ﬁrst international conference on Machine learning, pp. 1, 2004.

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., Mcgrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight Experience Replay. In Advances in Neural Information Processing Systems (NIPS), 2017.

URL https://arxiv.org/pdf/1707.01495. pdfhttp://arxiv.org/abs/1707.01495.

Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055 4065, 2017.

Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., ˇZ ıdek, A., and Munos, R. Transfer in deep reinforcement learning using successor features and generalised policy improvement. ar Xiv preprint ar Xiv:1901.10964, 2019.

Bloesch, M., Humplik, J., Patraucean, V., Hafner, R., Haarnoja, T., Byravan, A., Siegel, N. Y., Tunyasuvunakool, S., Casarini, F., Batchelor, N., et al. Towards real robot learning in the wild: A case study in bipedal locomotion. In Conference on Robot Learning, pp. 1502 1511. PMLR, 2022.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. ar Xiv preprint ar Xiv:1706.03741, 2017.

Colas, C., Sigaud, O., and Oudeyer, P.-Y. Gep-pg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. International Conference on Machine Learning (ICML), 2018.

Coumans, E. and Bai, Y. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016 2021.

Dorfman, R. and Tamar, A. Ofﬂine meta reinforcement learning. ar Xiv preprint ar Xiv:2008.02598, 2020.

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl2: Fast reinforcement learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779, 2016.

Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International conference on machine learning, pp. 49 58. PMLR, 2016.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126 1135. PMLR, 2017.

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. ar Xiv preprint ar Xiv:1710.11248, 2017.

Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

Fujimoto, S., Conti, E., Ghavamzadeh, M., and Pineau, J. Benchmarking batch deep reinforcement learning algorithms. ar Xiv preprint ar Xiv:1910.01708, 2019a.

Fujimoto, S., Meger, D., and Precup, D. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052 2062. PMLR, 2019b.

Grimm, C., Higgins, I., Barreto, A., Teplyashin, D., Wulfmeier, M., Hertweck, T., Hadsell, R., and Singh, S. Disentangled cumulants help successor representations transfer to new tasks. ar Xiv preprint ar Xiv:1911.10866, 2019.

Gupta, A., Eysenbach, B., Finn, C., and Levine, S. Unsupervised meta-learning for reinforcement learning. ar Xiv preprint ar Xiv:1806.04640, 2018a.

Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., and Levine, S. Meta-reinforcement learning of structured exploration strategies. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018b.

Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018.

Hadﬁeld-Menell, D., Milli, S., Abbeel, P., Russell, S., and Dragan, A. Inverse reward design. ar Xiv preprint ar Xiv:1711.02827, 2017.

Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018.

Ho, J. and Ermon, S. Generative adversarial imitation learning. ar Xiv preprint ar Xiv:1606.03476, 2016.

Humplik, J., Galashov, A., Hasenclever, L., Ortega, P. A., Teh, Y. W., and Heess, N. Meta reinforcement learning as task inference. ar Xiv preprint ar Xiv:1905.06424, 2019.

Jabri, A., Hsu, K., Eysenbach, B., Gupta, A., Levine, S., and Finn, C. Unsupervised curricula for visual metareinforcement learning. ar Xiv preprint ar Xiv:1912.04226, 2019.

Kaelbling, L. P. Learning to achieve goals. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), volume vol.2, pp. 1094 8, 1993.

Kamienny, P.-A., Pirotta, M., Lazaric, A., Lavril, T., Usunier, N., and Denoyer, L. Learning adaptive exploration strategies in dynamic environments through informed policy regularization. ar Xiv preprint ar Xiv:2005.02934, 2020.

Khazatsky, A., Nair, A., Jing, D., and Levine, S. What can i do here? learning new skills by imagining visual affordances. In International Conference on Robotics and Automation. IEEE, 2021.

Kirsch, L., van Steenkiste, S., and Schmidhuber, J. Improving generalization in meta reinforcement learning using learned objectives. ar Xiv preprint ar Xiv:1910.04098, 2019.

Kofman, J., Wu, X., Luu, T. J., and Verma, S. Teleoperation of a robot manipulator using a vision-based human-robot interface. IEEE transactions on industrial electronics, 52 (5):1206 1219, 2005.

Konyushkova, K., Zolna, K., Aytar, Y., Novikov, A., Reed, S., Cabi, S., and de Freitas, N. Semi-supervised reward learning for ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:2012.06899, 2020.

Kulkarni, T. D., Saeedi, A., Gautam, S., and Gershman, S. J. Deep successor reinforcement learning. ar Xiv preprint ar Xiv:1606.02396, 2016.

Kumar, A., Fu, J., Tucker, G., and Levine, S. Stabilizing off-policy q-learning via bootstrapping error reduction. ar Xiv preprint ar Xiv:1906.00949, 2019.

Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., and Legg, S. Scalable agent alignment via reward modeling: a research direction. ar Xiv preprint ar Xiv:1811.07871, 2018.

Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., and Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research, 37(4-5):421 436, 2018.

Levine, S., Kumar, A., Tucker, G., and Fu, J. Ofﬂine reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2016. ISSN 10769757. doi: 10.1613/jair.301. URL https://arxiv.org/pdf/ 1509.02971.pdf.

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

Mitchell, E., Rafailov, R., Peng, X. B., Levine, S., and Finn, C. Ofﬂine meta-reinforcement learning with advantage weighting. In International Conference on Machine Learning, pp. 7780 7791. PMLR, 2021.

Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. Visual Reinforcement Learning with Imagined Goals. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accelerating online reinforcement learning with ofﬂine datasets. ar Xiv preprint ar Xiv:2006.09359, 2020.

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. Webgpt: Browser-assisted question-answering with human feedback. ar Xiv preprint ar Xiv:2112.09332, 2021.

Nogueira, R. and Cho, K. End-to-end goal-driven web navigation. Advances in neural information processing systems, 29:1903 1911, 2016.

Peng, X. B., Coumans, E., Zhang, T., Lee, T.-W., Tan, J., and Levine, S. Learning agile robotic locomotion skills by imitating animals. In Robotics: Science and Systems, 2020.

P er e, A., Forestier, S., Sigaud, O., and Oudeyer, P.-Y. Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration. In International Conference on Learning Representations (ICLR), 2018. URL https://arxiv.org/pdf/1803.00781.pdf.

Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal Difference Models: Model-Free Deep RL For Model Based Control. In International Conference on Learning Representations (ICLR), 2018. URL https://arxiv. org/pdf/1802.09081.pdf.

Rakelly, K., Zhou, A., Finn, C., Levine, S., and Quillen, D. Efﬁcient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pp. 5331 5340. PMLR, 2019.

Reddy, S., Dragan, A. D., and Levine, S. Sqil: Imitation learning via reinforcement learning with sparse rewards. ar Xiv preprint ar Xiv:1905.11108, 2019.

Ross, S. and Bagnell, D. Efﬁcient reductions for imitation learning. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pp. 661 668. JMLR Workshop and Conference Proceedings, 2010.

Rothfuss, J., Lee, D., Clavera, I., Asfour, T., and Abbeel, P. Promp: Proximal meta-policy search. In International Conference on Learning Representations, 2018.

Schaal, S. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233 242, 1999.

Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal Value Function Approximators. In International Conference on Machine Learning (ICML), 2015.

Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A. A., and Hardt, M. Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. In International Conference on Machine Learning (ICML), 2020. URL https: //test-time-training.github.io/.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012.

Torrado, R. R., Bontrager, P., Togelius, J., Liu, J., and Perez Liebana, D. Deep reinforcement learning for general video game ai. In 2018 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1 8. IEEE, 2018.

Warde-Farley, D., de Wiele, T. V., Kulkarni, T., Ionescu, C., Hansen, S., and Mnih, V. Unsupervised control through non-parametric discriminative rewards. Co RR, abs/1811.11359, 2018.

Wu, Y., Tucker, G., and Nachum, O. Behavior regularized ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:1911.11361, 2019.

Xu, D. and Denil, M. Positive-unlabeled reward learning. ar Xiv preprint ar Xiv:1911.00459, 2019.

Xu, Z., van Hasselt, H. P., and Silver, D. Meta-gradient reinforcement learning. Advances in neural information processing systems, 31:2396 2407, 2018.

Xu, Z., van Hasselt, H., Hessel, M., Oh, J., Singh, S., and Silver, D. Meta-gradient reinforcement learning with an objective discovered online. ar Xiv preprint ar Xiv:2007.08433, 2020.

Yang, B., Zhang, J., Pong, V., Levine, S., and Jayaraman, D. Replab: A reproducible low-cost arm benchmark platform for robotic learning. ar Xiv preprint ar Xiv:1905.07447, 2019.

Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, pp. 1094 1100. PMLR, 2020.

Zhao, T. Z., Nagabandi, A., Rakelly, K., Finn, C., and Levine, S. Meld: Meta-reinforcement learning

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

from images via latent state models. ar Xiv preprint ar Xiv:2010.13957, 2020.

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y., Hofmann, K., and Whiteson, S. Varibad: a very good method for bayes-adaptive deep rl via meta-learning. Proceedings of ICLR 2020, 2020.

Zolna, K., Novikov, A., Konyushkova, K., Gulcehre, C., Wang, Z., Aytar, Y., Denil, M., de Freitas, N., and Reed, S. Ofﬂine learning from demonstrations and unlabeled experience. ar Xiv preprint ar Xiv:2011.13885, 2020.

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

Supplementary Material

A. Method Pseudo-code

We present the pseudo-code for SMAC in Algorithm 1.

Algorithm 1 Semi-Supervised Meta Actor-Critic

1: Input: datasets D = {Di}Nbuff i=1 , policy πθ, Q-function Qw, encoder qφe, and decoder rφd. 2: for iteration n = 1, 2, . . . , Nofﬂine do ofﬂine phase 3: Sample buffer Di D and two histories from buffer h, h Di. 4: Use the ﬁrst history sample to h to infer z encode it z qφe(h). 5: Update πθ, Qw, qφe, rφd by minimizing Lactor, Lcritic, Lreward with samples z, h .

6: for iteration n = 1, 2, . . . , Nonline do self-supervised phase 7: Collect trajectory τ with πθ(a | s, z), with zt p(z). 8: Sample buffer Di D and ofﬂine history hofﬂine Di. 9: Use hofﬂine to label the rewards in τ, as in Equation (4), and add the resulting data to Di. 10: Sample buffer Di D and two histories from buffer h, h Di. 11: Encode ﬁrst history z = qφe(h). 12: Update πθ, Qw by minimizing Lactor, Lcritic with samples z, h .

B. Additional Experimental Results and Discussion

Sensitivity to encoder loss. Prior work such as Rakelly et al. (2019) train the encoder only on the Q-loss, whereas SMAC trains the encoder with the reward loss. In this section we study the sensitivity of our method to the loss used for the encoder. On the Walker Param, Hopper Param, and Humanoid tasks, we ran the following ablations. In our ﬁrst ablation, called only Q-loss (online & offline), we updated the encoder using only the Q-loss rather than reward loss to mimic prior work (Rakelly et al., 2019). In our second ablation, called only Q-loss (online), we implemented a variant that combines SMAC with training the encoder on the Q-loss: we train the encoder on the reward and Q-loss in the ofﬂine phase, but in the online phase, we train the encoder only on the Q-loss. For all ablations, we keep the KL loss for the encoder.

The results are shown in Figure 7. We see that the ﬁrst ablation shown in black performs poorly, especially in the environments that require adapting to new dynamics (Walker and Hopper). This is most likely because the encoder does not encode enough information to generate realistic rewards when trained only on the Q-loss. We see that the second ablation shown in blue is on par or slightly outperforms SMAC. This improved performance may be because the Q-function must learn features about the dynamics, and so the Q-loss would encourage the encoder to learn features that summarize the dynamics. In the Walker Param and Hopper Param tasks, the dynamics vary between the tasks, and so learning dynamics-aware features in the encoder would likely improve the agent s ability to adapt.

Because the beneﬁts of this variant are not too drastic, we decided to not add the complexity of the second ablation to SMAC, and simply train the encoder on only the reward loss during both the online and ofﬂine phase.

Exploration and ofﬂine dataset visualization. In Figure 8, we visualize the post-adaption trajectories generated when conditioning the encoder the online exploration trajectories honline and the ofﬂine trajectories hofﬂine. Similar to Figure 6, and also visualize the online and ofﬂine trajectories themselves. We see that the exploration trajectories honline and the ofﬂine trajectories hofﬂine are very different (green vs red, respectively), but the self-supervised phase mitigates the negative impact that this distribution shift has on ofﬂine meta RL. In particular, the post-adaptation trajectories conditioned on these two data sources (blue and orange) are similar after the self-supervised training, whereas before the self-supervised training, only the trajectories conditioned on the ofﬂine data (blue) move in multiple directions.

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

0 20 40 60 80 100 # Environment Steps

Post-Adapt Returns

0 20 40 60 80 100 # Environment Steps

Post-Adapt Returns

0 20 40 60 80 100 # Environment Steps

Post-Adapt Returns

SMAC (original) only Q-loss (offline & online) only Q-loss (online)

Figure 7: Ablations of SMAC that vary how the encoder is trained during the online and ofﬂine phase. In red, we plot the performance of SMAC. In black, we train the encoder with only the Q-loss during both the ofﬂine and online phase. In blue, we train the encoder with the Q-loss and reward loss during the ofﬂine phase, and then train the encoder with only the Q-loss during the online phase. We see that this last ablation in blue performs on par or even better than SMAC, whereas the ﬁrst ablation does not perform well. These results suggest that back-propagating the Q-loss into the encoder can also help the encoder, but it is necessary to train the encoder on the reward loss.

Addressing state-space distribution shift by self-supervised meta-training on test tasks. Another source of distribution shift that can negatively impact a meta-policy is a distribution shift in state space. While this distribution shift occurs in standard ofﬂine RL, we expect this issue to be more prominent in meta RL, where there is a focus on generalizing to completely novel tasks. In many real-world scenarios, experiencing the state distribution of a novel task is possible, but it is the supervision (ie. reward signal) that is expensive to obtain. Can we mitigate state distribution shift by allow the agent to meta-train in the test task environments, but without rewards?

In this experiment, we evaluate our method, SMAC, when training online on the test tasks instead of on the meta-training tasks as in the experiments in Section 6. Prior work has explored this idea of self-supervision with test tasks in supervised learning (Sun et al., 2020) and goal-conditioned RL (Khazatsky et al., 2021). We use the Sawyer Manipulation environment to study how self-supervised training can mitigate state distribution shifts, as these environments contain signiﬁcant variation between tasks. To further increase the complexity of the environment, we use a version of the environment which samples from a set of eight potential desired behaviors instead of three.

We compare self-supervised training on test tasks to self-supervised training on the set of meta-training tasks, which are also the tasks contained in the ofﬂine dataset. A large gap in performance indicates that interacting with the test tasks can mitigate the resulting distribution shift even when no reward labels are provided.

We show the results in Figure 9 and ﬁnd that there is indeed a large performance gap between the two training modes, with self-supervision on test tasks improving post-adaptation returns while self-supervision on meta-training tasks does not improve post-adaptation returns. We also compare to an oracle method that performs online training with the test tasks and the ground-truth reward signal. We see that SMAC is competitive with the oracle, demonstrating that we do not need access

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

-20 -10 0 10 20 COM X-position

COM Y-position

After Offline Training

-20 -10 0 10 20 COM X-position

COM Y-position

After Self-Supervised Training

post-adaptation policy, where z q(z hoffline)

post-adaptation policy, where z q(z hexplore)

exploration policy, where z p(z)

offline dataset

Figure 8: We duplicate Figure 6 but include the exploration trajectories (green) and example trajectories from the ofﬂine dataset (red). We see that the exploration policy both before and after self-supervised training primarily moves up and to the left, whereas the ofﬂine data moves in all direction. Before the self-supervised phase, we see that conditioning the encoder on online data (orange) rather than ofﬂine data (blue) results in very different policies, with the online data resulting in the post-adaptation policy only moving up and to the left. However, the self-supervised phase of SMAC mitigates the impact of this distribution shift and results in qualitatively similar post-adaptation trajectories, despite the large difference between the exploration trajectories and ofﬂine dataset trajectories.

0 25 50 75 100 Number of Environment Steps (x1000)

Post-Adaptation Returns

Sawyer Manipulation

Self-Supervision on Test Tasks Oracle Self-Supervision on Train Tasks

Figure 9: Learning curves when performing self-supervised training on the test environments (red) or the meta-training environments (blue). We also compare to an oracle that trains on test environments in combination with ground-truth rewards (black). We see that interacting with the test environment without rewards allows for steady improvement in post-adaptation test performance and obtains a similar performance to meta-training on those environments with ground-truth rewards.

to rewards in order to improve on test tasks. Instead, the entire performance gain comes from experiencing the new state distribution of test tasks. Overall, these results suggest that SMAC is effective for mitigate distribution shifts in both z-space and state space, even when an agent can interact in the environment without reward supervision.

Reward accuracy dependence. Our method involves training a reward decoder, and so a natural question is: How good must the reward decoder be for SMAC to work? We found that the reward loss does not need to be particularly low for our method to work. Speciﬁcally, in the Sawyer Manipulation environment, the reward scale is either 1 or 0, meaning that the maximum reward scale is 1. We observed that the reward decoder loss is around 0.2 on the training task and around 0.25 on the test tasks, indicating that the method does not need a relatively low reward decoder loss to perform well.

Why does the performance not degrade over time? During the online phase, the online data with self-generated reward will continue to grow, while the amount of ofﬂine data with ground-truth reward is constant. One might wonder why the online performance of SMAC does not degrade over time, as most of the replay buffers will be full of generated rewards. Our intuition for why we did not observe this is that the purpose of meta-RL is to learn to adapt to any task, and so there is no incorrect training data. Of course, any learning method, including SMAC, requires some generalization between the

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

Figure 10: Illustrations of two evaluation domains, each of which has a set of meta-train tasks (examples shown in blue) and held out test tasks (orange). The domains include (left) a half cheetah tasked with running at different speeds and (right) a quadruped ant locomoting to different points on a circle.

meta-train and meta-test tasks, and so a drop in performance is possible, particularly if the reward model does not ﬁt the ofﬂine data well.

C. Experimental Details

C.1. Data Collection Difference from Prior Work

BORe L and MACAW were both developed assuming several orders of magnitude more data than the regime that we tested. For example, in the BORe L paper (Dorfman & Tamar, 2020), the Cheetah Velocity was trained with an ofﬂine dataset using 400 million transitions and performs additional reward relabeling using ground-truth information about the transitions. In contrast, our ofﬂine dataset contains only 240 thousand transitions, roughly three orders of magnitude fewer transitions. Similarly, MACAW uses 100M transitions for Cheetah Velocity, over 40 times more transitions than used in our experiments. These prior methods also collect ofﬂine datasets by training task-speciﬁc policies, which converge to near-optimal policies within the ﬁrst million time step (Haarnoja et al., 2018), meaning that they utilize very high-quality data. In contrast, our experiments focused on the scenario with many fewer and much lower quality trajectories, which is likely the cause of the relatively worse performance of BORe L and MACAW than in the original papers.

C.2. Environment Details

In this section, we describe the state and action space of each environment. We also describe how reward functions were generated and how the ofﬂine data was generated.

Ant Direction. The Ant Direction task consists of controlling a quadruped ant robot that can move in a plane. Following prior work (Rakelly et al., 2019; Dorfman & Tamar, 2020), the reward function is the dot product between the agent s velocity and a direction uniformly sampled from the unit circle. The state space is R20, comprising the orientation of the ant (in quaternion) as well as the angle and angular velocity of all 8 joints. The action space is [ 1, 1]8, with each dimension corresponding to the torque applied to a respective joint. The reward function is the negative absolute difference between the agent s x-velocity and a target velocity uniformly sampled from [0, 3].

The ofﬂine data is collected by running PEARL (Rakelly et al., 2019) on this meta RL task with 100 pre-sampled 2 target velocities. We terminate PEARL after 100 iterations, with each iteration consisting of collecting trajectories until at least 1000 new transitions have been observed. As discussed in Appendix C.1, this results in highly suboptimal data, enabling us to test how well the different methods can improve over the ofﬂine data. In PEARL, there are two replay buffers saved for each task, one for sampling data for training the encoder and another for training the policy and Q-function. We will call the former replay buffer the encoder replay buffer and the latter the RL replay buffer. The encoder replay buffer contains data

2To mitigate variance coming from this sampling procedure, we use the same sampled target velocities across all experiments and comparisons. We similarly use a pre-sampled set of tasks for the other environments.

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

generated by only the exploration policy, in which z pz(z). The RL replay buffer contains all data generated, including both exploration and post-adaptation, in which z qφe(z | h). To make the ofﬂine dataset, we load the last 1200 samples of the RL replay buffer and the last 400 transitions from the encoder replay buffer into corresponding RL and encoder replay buffers for SMAC.

Cheetah Velocity. The Cheetah Velocity task consists of controlling a two-legged half cheetah that can move forwards or backwards along the x-axis. Following prior work (Rakelly et al., 2019; Dorfman & Tamar, 2020), the reward function is the absolute difference product between the agent s x-velocity and a velocity uniformly sampled from [0, 3]. The state space is R20, comprising the z-position; the cheetah s xand zvelocity; the angle and angular velocity of each joint and the half-cheetah s y-angle; and the XYZ position of the center of mass. The action space is [ 1, 1]6, with each dimension corresponding to the torque applied to a respective joint.

The ofﬂine data is collected in the same way as in the Ant Direction task, using a run from PEARL with 100 pre-sampled target velocities. For the ofﬂine dataset, we use the ﬁrst 1200 samples from the RL replay buffer and last 400 samples from the encoder replay buffer after 50 PEARL iterations, with each iteration containing at least 1000 new transitions. For only this environment, we found that it was beneﬁcial to freeze the encoder buffer during the self-supervised phase.

Sawyer Manipulation. The state space, action space, and reward is described in Section 6. Tasks are generated by sampling the initial conﬁguration, and then the desired behavior. There are ﬁve objects: a drawer opened by handle, a drawer opened by button, a button, a tray, and a graspable object. The state is a 13-dimensional vector comprising of the 3D position of the end-effector and positions of objects in the scene: 3 dimensions for the graspable object, and 1 dimension for each articulated joint in the scene including the robot grippers. If an object is not present, it takes on position 0 in the corresponding element of the state space. First, the presence or absence of each of the ﬁve is randomized. Next, the position of the drawers (from 2 sides), initial position of the tray (from 4 positions), and the object (from 4 positions) is randomized. Finally, the desired behavior is randomly chosen from the following list, but only including the ones that are possible in the scene: move hand , open top drawer with handle , or open bottom drawer with button . The ofﬂine data is collected using a scripted controller that does not know the desired behavior and randomly performs potential tasks in the scene, choosing another task if it ﬁnishes one task before the trajectory ends. This data is loaded into a single replay buffer used for both the encoder and RL.

Walker Param. This environment involves controlling a bi-pedal robot that can move along the Zand Y-axis. The source code is taken from the rand param envs repository 3, which has been used in prior work (Rakelly et al., 2019). A task involves randomly sampling the body mass, the joint damping coefﬁcients, the body inertia, and the friction parameters from a log-uniform distribution on the range [1.5 3, 1.53] . The reward is the velocity plus a bonus of 1 for staying alive and a control penalty of 10 3 a 2, where a is the ℓ2 norm of the action. The state space is 17 dimensions and the action space is 6 dimensions (one for each joint).

Hopper Param. This environment involves controlling a one-legged robot that can move along the Zand Y-axis. This environment is also taken from the rand param envs repository. The reward is the same as in Walker Param, and tasks are sampled the same as for the Walker Param. The state space is 11 dimensions and the action space is 3 dimensions (one for each joint).

Humanoid. This environment is based on the Humanoid environment from Open AI Gym (Brockman et al., 2016). We reuse the standard reward function but replace the forward-velocity reward with a velocity reward based on target direction. Each task consists of sampling a target direction uniformly at random. The state space is 376 dimensions and the action space is 17 dimensions (one for each joint).

C.3. Hyperparameters

We list the hyperparameters for training the policy, encoder, decoder, and Q-network in Table 1. If hyperparameters were different across environments, they are listed in Table 2. For pretraining, we use the same hyperparameters and train for 50000 gradient steps. Below, we give details on non-standard hyperparameters and architectures.

3https://github.com/dennisl88/rand_param_envs

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

Hyperparameter Value

RL batch size 256 encoder batch size 64 meta batch size 4 Q-network hidden sizes [300, 300, 300] policy network hidden sizes [300, 300, 300] decoder network hidden sizes [64, 64] encoder network hidden sizes [200, 200, 200] z dimensionality (dz) 5 hidden activation (all networks) Re LU Q-network, encoder, and decoder output activation identity policy output activation tanh discount factor γ 0.99 target network soft target η 0.005 policy, Q-network, encoder, and decoder learning rate 3 10 4

policy, Q-network, encoder, and decoder optimizer Adam # of gradient steps per environment transition 4

Table 1: SMAC Hyperparameters for Self-Supervised Phase

Hyperparameter Cheetah Ant Sawyer Walker Hopper Humanoid

horizon (max # of transitions per trajectory) 200 200 50 200 200 200 AWR β 100 100 0.3 100 100 100 reward scale 5 5 1 5 5 5 # of training tasks 100 100 50 50 50 50 # of test tasks 30 20 10 5 5 5 # of transitions per training task in ofﬂine dataset 1600 1600 3750 1200 1200 1200 λpearl 1 1 0 1 1 1

Table 2: Environment Speciﬁc SMAC Hyperparameters

Batch sizes. The RL batch size is the batch size per task when sampling (s, a, r, s ) tuples to update the policy and Q-network. The encoder batch size is the size of the history h per task used to conditioned the encoder qφe(z | h). The meta batch size is how many tasks batches were sampled and concatenated for both the RL and encoder batches. In other words, for each gradient update, the policy and Q-network observe (RL batch size) (meta batch size) transitions and the encoder observes (RL batch size) (encoder batch size) transitions.

Encoder architecture. The encoder uses the same architecture as in Rakelly et al. (2019). The posterior is given as the product of independent factors

qφe(z | h) Y

s,a,r h Φ(z | s, a, r),

where each factor is a multi-variate Gaussian over Rdz with learned mean and diagonal variance. In other words,

Φφe(z | s, a, r) = N(µφe(s, a, r), σφe(s, a, r)).

The mean and standard deviation is the output of a single MLP network with output dimensionality 2 dz. The output of the MLP network is split into two halves. The ﬁrst half is the mean and the second half is passed through the softplus activation to get the standard deviation.

Ofﬂine Meta-Reinforcement Learning with Online Self-Supervision

Self-supervised actor update. The parameter λpearl controls the actor loss during the self-supervised phase, which is

Lself-supervised actor (θ) = Lactor(θ) + λpearl LPEARL actor (θ),

where LPEARL actor is the actor loss from PEARL (Rakelly et al., 2019). For reference, the PEARL actor loss is

LPEARL actor (θ) = Es Di,z qφe(z|h)

πθ(a | s, z) exp Qw(s, a, z)

When the parameter λpearl is zero, the actor update is equivalent to the actor update in AWAC (Nair et al., 2020).

Comparisons. As discussed in Section 6, we used the authors code for PEARL (Rakelly et al., 2019),4 BORe L (Dorfman & Tamar, 2020),5 and MACAW (Mitchell et al., 2021).6 To ensure a fair comparison, we ran the original hyperparameters and matched SMAC hyperparameters (matching network size, learning rate, and batch size), taking the better of the two as the result for each prior method. For all comparisons, we evaluated the ﬁnal policy using the same evaluation method as SMAC, i.e. collecting exploration trajectories with the learned policy and perform adaptation using this newly collected data.

4https://github.com/katerakelly/oyster 5https://github.com/Rondorf/BORe L 6https://github.com/eric-mitchell/macaw