# distributionally_adaptive_meta_reinforcement_learning__7828e080.pdf

Distributionally Adaptive Meta Reinforcement Learning

Anurag Ajay , Abhishek Gupta * , Dibya Ghosh , Sergey Levine , Pulkit Agrawal

Improbable AI Lab

MIT-IBM Watson AI Lab

University of California, Berkeley

Massachusetts Institute Technology

Meta-reinforcement learning algorithms provide a data-driven way to acquire policies that quickly adapt to many tasks with varying rewards or dynamics functions. However, learned meta-policies are often effective only on the exact task distribution on which they were trained and struggle in the presence of distribution shift of test-time rewards or transition dynamics. In this work, we develop a framework for meta-RL algorithms that are able to behave appropriately under test-time distribution shifts in the space of tasks. Our framework centers on an adaptive approach to distributional robustness that trains a population of meta-policies to be robust to varying levels of distribution shift. When evaluated on a potentially shifted test-time distribution of tasks, this allows us to choose the meta-policy with the most appropriate level of robustness, and use it to perform fast adaptation. We formally show how our framework allows for improved regret under distribution shift, and empirically show its efficacy on simulated robotics problems under a wide range of distribution shifts.

1 Introduction

The diversity and dynamism of the real world require reinforcement learning (RL) agents that can quickly adapt and learn new behaviors when placed in novel situations. Meta reinforcement learning provides a framework for conferring this ability to RL agents, by learning a meta-policy trained to adapt as quickly as possible to tasks from a provided training distribution [40, 11, 34, 49]. Unfortunately, meta-RL agents assume tasks to be always drawn from the training task distribution and often behave erratically when asked to adapt to tasks beyond the training distribution [5, 8]. As an example of this negative transfer, consider using meta-learning to teach a robot to navigate to goals quickly (illustrated in Figure 1). The resulting meta-policy learns to quickly adapt and walk to any target location specified in the training distribution, but explores poorly and fails to adapt to any location not in that distribution. This is particularly problematic for the meta-learning setting, since the scenarios where we need the ability to learn quickly are usually exactly those where the agent experiences distribution shift. This type of meta-distribution shift afflicts a number of real-world problems including autonomous vehicle driving [10], in-hand manipulation [18, 1, 9], and quadruped locomotion [25, 23, 19], where training task distribution may not encompass all real-world scenarios.

In this work, we study algorithms that learn meta-policies resilient to task distribution shift at test time. We assume the test-time distribution shift to be unknown but within a fixed range. One approach to enable this resiliency is leveraging the distributional robustness framework [38] to learn a meta-policy robust to a wide range of distribution shifts. However, the resulting meta-policy can be slow to adapt

denotes equal contribution. Correspondence to aajay@mit.edu and abhgupta@cs.washington.edu

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

during test time. In contrast, learning a meta-policy robust to a small range of distribution shifts enable faster test-time adaptation but leaves the meta-policy brittle to task distribution shifts. Can we do better if we knew the degree of distribution shift apriori? Yes, but since the test-time distribution shift is unknown, we would first want our algorithm to infer the degree of distribution shift and then deploy the appropriate meta-policy robust to the inferred degree of distribution shift.

To enable our approach, we use the distributional robustness framework to train meta-policies that prepare for distribution shifts by optimizing the worst-case empirical risk against a set of task distributions which lie within a bounded distance from the original training task distribution (known as an uncertainty set). This allows meta-policies to deal with potential test-time task distribution shift, bounding their worst-case test-time regret for distributional shifts within the chosen uncertainty set.

Meta Train Meta Test

Episode 1 Episode 2

Figure 1: Failure of Typical Meta-RL. On meta-training tasks, πmeta explores effectively and quickly learns the optimal behavior (top row). When test tasks come from a slightly larger task distribution, exploration fails catastrophically, resulting in poor adaptation behavior (bottom row).

Our key insight is that we can prepare for a variety of potential test-time distribution shifts by constructing and training against different uncertainty sets at training time. By preparing for adaptation against each of these uncertainty sets, an agent is able to adapt to a variety of potential test-time distribution shifts by adaptively choosing the most appropriate level of distributional robustness for the test distribution at hand. We introduce a conceptual framework called distributionally adaptive meta reinforcement learning, formalizing this idea. At train time, the agent learns robust meta-policies with widening uncertainty sets, preemptively accounting for different levels of test-time distribution shift that may be encountered. At test time, the agent infers the level of distribution shift it is faced with, and then uses the corresponding meta-policy to adapt to the new task (Figure 2). In doing so, the agent can adaptively choose the best level of robustness for the test-time task distribution, preserving the fast adaptation benefits of meta RL, while also ensuring good asymptotic performance under distribution shift. We instantiate a practical algorithm in this framework called Di AMet R.

Our main contributions are twofold. First, we show how to leverage the distributional robustness framework to make meta-reinforcement learning robust to a given level of distribution shift. Secondly, we propose a framework for making meta-reinforcement learning resilient to a variety of task distribution shifts, and Di AMet R, a practical algorithm instantiating the framework. Our experiments verify the utility of adaptive distributional robustness under test-time task distribution shift in a number of simulated robotics domains.

2 Related Work

Meta-reinforcement learning algorithms aim to leverage a distribution of training tasks to learn a reinforcement learning algorithm", that is able to learn as quickly on new tasks drawn from the same distribution. A variety of algorithms have been proposed for meta-RL, including memorybased [7, 26], gradient-based [11, 36, 14] and latent-variable based [34, 49, 48, 12] schemes. These algorithms show the ability to generalize to new tasks drawn from the same distribution, and have been applied to problems ranging from robotics [28, 48, 19] to computer science education [44]. This line of work has been extended to operate in scenarios without requiring any pre-specified task distribution [13, 17], in offline settings [6, 29, 27] or in hard (meta-)exploration settings [50, 47], making them more broadly applicable to a wider class of problems. However, most meta-RL algorithms assume source and target tasks are drawn from the same distribution, an assumption rarely met in practice. Our work shows how the machinery of meta-RL can be made compatible with distribution shift at test time, using ideas from distributional robustness. Some recent work shows that model based meta-reinforcement learning can be made to be robust to a particular level distribution shift [24, 21] by learning a shared dynamics model against adversarially chosen task distributions. We show that we can build model-free meta-reinforcement learning algorithms, which are not just robust to a particular level of distribution shift, but can adapt to various levels of shift.

Distributional robustness methods have been studied in the context of building supervised learning systems that are robust to the test distribution being different than the training one. The key idea is to train a model to not just minimize empirical risk, but instead learn a model that has the

Meta-train on train-task

distribution

Task Distribution

Meta-train on imagined test-task

distributions

Meta-policy selection during

Test time Meta-policy

Imagined Test tasks

Train tasks

T!(s, a, z)

Figure 2: During meta-train, Di AMet R learns a meta-policy πϵ1 meta and task distribution model Tω(s, a, z) on train task distribution. Then, it uses the task distribution model to imagine different shifted test task distributions on which it learns different meta-policies {πϵi meta}M i=2, each corresponding to a different level of robustness. During meta-test, it chooses an appropriate meta-policy based on inferred test task distribution shift with Thompson s sampling and then quickly adapts the selected meta-policy to individual tasks.

lowest worst-case empirical risk among an uncertainty-set" of distributions that are boundedly close to the empirical training distribution [38, 22, 2, 16]. If the uncertainty set and optimization are chosen carefully, these methods have been shown to obtain models that are robust to small amounts of distribution shift at test time [38, 22, 2, 16], finding applications in problems like federated learning [16] and image classification [22]. This has been extended to the min-max robustness setting for specific algorithms like model-agnostic meta-learning [3], but are critically dependent on correct specification of the appropriate uncertainty set and applicable primarily in supervised learning settings. Alternatively, several RL techniques aim to directly tackle the robustness problem, aiming to learn policies robust to adversarial perturbations [42, 46, 33, 32]. [45] conditions the policy on uncertainty sets to make it robust to different perturbation sets. While these methods are able to learn conservative, robust policies, they are unable to adapt to new tasks as Di AMet R does in the meta-reinforcement learning setting. In our work, rather than choosing a single uncertainty set, we learn many meta-policies for widening uncertainty sets, thereby accounting for different levels of test-time distribution shift.

3 Preliminaries

Meta-Reinforcement Learning aims to learn a fast reinforcement learning algorithm or a metapolicy" that can quickly maximize performance on tasks T from some distribution p(T ). Formally, each task T is a Markov decision process (MDP) M = (S, A, P, R, γ, µ0); the goal is to exploit regularities in the structure of rewards and environment dynamics across tasks in p(T ) to acquire effective exploration and adaptation mechanisms that enable learning on new tasks much faster than learning the task naively from scratch. A meta-policy (or fast learning algorithm) πmeta maps a history of environment experience h (S A R) in a new task to an action a, and is trained to acquire optimal behaviors on tasks from p(T ) within k episodes:

min πmeta ET p(T ) [Regret(πmeta, T )] ,

Regret(πmeta, T ) = J(π T ) Ea(i) t πmeta( |h(i) t ),T

, J(π T ) = max π Eπ,T [ X

where h(i) t = (s(i) 1:t, r(i) 1:t, a(i) 1:t 1) (s(j) 1:T , r(j) 1:T , a(j) 1:T )i 1 j=1. (1)

Intuitively, the meta-policy has two components: an exploration mechanism that ensures that appropriate reward signal is found for all tasks in the training distribution, and an adaptation mechanism that uses the collected exploratory data to generate optimal actions for the current task. In practice, the meta-policy may be represented explicitly as an exploration policy conjoined with a policy update[11, 34], or implicitly as a black-box RNN [7, 49]. We use the terminology meta-policies" interchangeably with that of fast-adaptation" algorithms, since our practical implementation builds on [31] (which represents the adaptation mechanism using a black-box RNN). Our work focuses on the setting where there is potential drift between ptrain(T ), the task distribution we have access to during training, and ptest(T ), the task distribution of interest during evaluation.

Distributional robustness [38] learns models that do not minimize empirical risk against the training distribution, but instead prepare for distribution shift by optimizing the worst-case empirical risk

E4so Uw Leyth I6op Q5t Qy Ybg Lb+8Slq XVe+q Wnuo Veq3e Rx FOIFTOAc Prq EO9CAJj AYwz O8wpu TOC/Ou/Oxa C04+cwx/IHz+QPy FY9T</latexit>at 1

VR9S6rtftap X6Tx1GEzi Fc/Dg Cupw Bw1o Ao Mh PMrv Dn Se XHen Y9Fa8HJZ47h D5z PH1Egjd U=</latexit>at

D(ptrain(T )||qφ(T )) i

Meta Policy MDP

Constrained task distribution

l A=</latexit> i

Meta Policy Selection

T ptest(T )

Meta Train phase Meta Test phase

Figure 3: During meta-train phase, Di AMet R learns a family of meta-policies robust to varying levels of distribution shift (as characterized by ϵi). During meta-test phase, given a potentially shifted test-time distribution of tasks, Di AMet R chooses the meta-policy with the most appropriate level of robustness and use it to perform fast adaptation for new tasks sampled from the same shifted test task distribution.

against a set of data distributions close to the training distribution (called an uncertainty set):

min θ max ϕ Ex qϕ(x)[l(x; θ)] s.t. D(ptrain(x)||qϕ(x)) ϵ (2)

This optimization finds the model parameters θ that minimizes worst case risk l over distributions qϕ(x) in an ϵ-ball (measured by an f-divergence) from the training distribution ptrain(x).

4 Distributionally Adaptive Meta-Reinforcement Learning

In this section, we develop a framework for learning meta-policies, that given access to a training distribution of tasks ptrain(T ), is still able to adapt to tasks from a test-time distribution ptest(T ) that is similar but not identical to the training distribution. We introduce a framework for distributionally adaptive meta-RL below and instantiate it as a practical method in Section 5.

4.1 Known Level of Test-Time Distribution Shift

We begin by studying a simplified problem where we can exactly quantify the degree to which the test distribution deviates from the training distribution. Suppose we know that ptest satisfies D(ptest(T )||ptrain(T )) < ϵ for some ϵ > 0, where D( ) is a probability divergence on the set of task distributions (e.g. an f-divergence [35] or a Wasserstein distance [41]). A natural learning objective to learn a meta-policy under this assumption is to minimize the worst-case test-time regret across any test task distribution q(T ) that is within some ϵ divergence of the train distribution:

min πmeta R(πmeta, ptrain(T ), ϵ),

R(πmeta, ptrain(T ), ϵ) = max q(T ) ET q(T ) [Regret(πmeta, T )] s.t. D(ptrain(T ) q(T )) ϵ (3)

Solving this optimization problem results in a meta-policy that has been trained to adapt to tasks from a wider task distribution than the original training distribution. It is worthwhile distinguishing this robust meta-objective, which incentivizes a robust adaptation mechanism to a wider set of tasks, from robust objectives in standard RL, which produce base policies robust to a wider set of dynamics conditions. The objective in Eq 3 incentivizes an agent to explore and adapt more broadly, not act more conservatively as standard robust RL methods [33] would encourage. Naturally, the quality of the robust meta-policy depends on the size of the uncertainty set. If ϵ is large, or the geometry of the divergence poorly reflect natural task variations, then the robust policy will have to adapt to an overly large set of tasks, potentially degrading the speed of adaptation.

4.2 Handling Arbitrary Levels of Distribution Shift

In practice, it is not known how the test distribution ptest deviates from the training distribution, and consequently it is challenging to determine what ϵ to use in the meta-robustness objective. We propose to overcome this via an adaptive strategy: to train meta-policies for varying degrees of distribution shift, and at test-time, inferring which distribution shift is most appropriate through experience.

We train a population of meta-policies {π(i) meta}M i=1, each solving the distributionally robust meta-RL objective (eq 3) for a different level of robustness ϵi: πϵi meta := arg min πmeta R(πmeta, ptrain(T ), ϵi) M

i=1 where ϵM > ϵM 1 > . . . > ϵ1 = 0 (4)

In choosing a spectrum of ϵi, we learn a set of meta-policies that have been trained on increasingly large set of tasks: at one end (i = 1), the meta-policy is trained only on the original training distribution, and at the other (i = M), the meta-policy trained to adapt to any possible task within the parametric family of tasks. These policies span a tradeoff between being robust to a wider set of task distributions with larger ϵ (allowing for larger distribution shifts), and being able to adapt quickly to any given task with smaller ϵ (allowing for better per-task regret minimization).

With a set of meta-policies in hand, we must now decide how to leverage test-time experience to discover the right one to use for the actual test distribution ptest. We recognize that the problem of policy selection can be treated as a stochastic multi-armed bandit problem (precise formulation in Appendix C), where pulling arm i corresponds to running the meta-policy πϵi meta for an entire meta-episode (k task episodes). If a zero-regret bandit algorithm (eg: Thompson s sampling [43]) is used , then after a certain number of test-time meta episodes, we can guarantee that the meta-policy selection mechanism will converge to the meta-policy that best balances the tradeoff between adapting quickly while still being able to adapt to all the tasks from ptest(T ).

To summarize our framework for distributionally adaptive meta-RL, we train a population of metapolicies at varying levels of robustness on a distributionally robust objective that forces the learned adaptation mechanism to also be robust to tasks not in the training task distribution. At test-time, we use a bandit algorithm to select the meta-policy whose adaptation mechanism has the best tradeoff between robustness and speed of adaptation specifically on the test task distribution. Combining distributional robustness with test-time adaptation allows the adaptation mechanism to work even if distribution shift is present, while obviating the decreased performance that usually accompanies overly conservative, distributionally robust solutions.

4.3 Analysis

To provide some intuition on the properties of this algorithm, we formally analyze adaptive distributional robustness in a simplified meta RL problem involving tasks Tg corresponding to reaching some unknown goal g in a deterministic MDP M, exactly at the final timestep of an episode. We assume that all goals are reachable, and use the family of meta-policies that use a stochastic exploratory policy π until the goal is discovered and return to the discovered goal in all future episodes. The performance of a meta-policy on a task Tg under this model can be expressed in terms of the state distribution of the exploratory policy: Regret(πmeta, Tg) = 1 d T π (g). This particular framework has been studied in [13, 20], and is a simple, interpretable framework for analysis.

We seek to understand performance under distribution shift when the original training task distribution is relatively concentrated on a subset of possible tasks. We choose the training distribution ptrain(Tg) = (1 β)Uniform(S0) + βUniform(S\S0), so that ptrain is concentrated on tasks involving a subset of the state space S0 S, with β a parameter dictating the level of concentration, and consider test distributions that perturb under the TV metric. Our main result compares the performance of a meta-policy trained to an ϵ2-level of robustness when the true test distribution deviates by ϵ1.

Proposition 4.1. Let ϵi = min{ϵi + β, 1 |S0|

|S| }. There exists q(T ) satisfying DT V (ptrain, q) ϵ1 where an ϵ2-robust meta policy incurs excess regret over the optimal ϵ1-robust meta-policy:

Eq(T )[Regret(πϵ2 meta, T ) Regret(πϵ1 meta, T )] c(ϵ1, ϵ2) + 1 c(ϵ1, ϵ2) 2 (5) p

ϵ1(1 ϵ1)|S0|(|S| S0|) (6)

The scale of regret depends on c(ϵ1, ϵ2) = q

ϵ2 1 1 ϵ1 1 1, a measure of the mismatch between ϵ1 and ϵ2.

We first compare robust and non-robust solutions by analyzing the bound when ϵ2 = 0. In the regime of β 1, excess regret scales as O(ϵ1 q

1 β ), meaning that the robust solution is most necessary

Algorithm 1 Di AMet R:Meta-training phase

1: Given: ptrain(T ), Return: {πϵi meta,θ}M i=1 2: πϵ1 meta,θ, DReplay-Buffer Solve equation 1 with off-policy RL2

3: Prior ptrain(T ) Solve eq 8 using DReplay-Buffer 4: for ϵ in {ϵ2, . . . , ϵM} do 5: Initialize qϕ, πϵ meta,θ and λ 0. 6: for iteration n = 1, 2, ... do 7: Meta-policy: Update πϵ meta,θ using off-policy RL2 [31]

θ := θ + α θET qϕ(T )[Eπϵ meta,θ,PT [ 1

t=1 r T (s(i) t , a(i) t )]]

8: Adversarial task distribution: Update qϕ using Reinforce [39]

ϕ := ϕ α ϕ(ET qϕ(T )[Eπϵ meta,θ,PT [ 1

t=1 r T (s(i) t , a(i) t )]] + λDKL(ptrain(T ) qϕ(T )))

9: Lagrange constraint multiplier: Update λ to enforce DKL(ptrain qϕ) < ϵ,

λ :=λ 0 λ + α(DKL(ptrain(T ) qϕ(T )) ϵ) 10: end for 11: end for

when the training distribution is highly concentrated in a subset of the task space. At one extreme, if the training distribution contains no examples of tasks outside S0 (β = 0), the non-robust solution incurs infinite excess regret; at the other extreme, if the training distribution is uniform on the set of all possible tasks (β = 1 |S0|

|S| ), robustness provides no benefit.

We next quantify the effect of mis-specifying the level of robustness in the meta-robustness objective, and what benefits adaptive distributional robustness can confer. For small β and fixed ϵ1, the excess regret of an ϵ2-robust policy scales as O( q

ϵ2 }), meaning that excess regret gets incurred if

the meta-policy is trained either to be too robust (ϵ2 ϵ1) or not robust enough ϵ1 ϵ2. Compared to a fixed robustness level, our strategy of training meta-policies for a sequence of robustness levels {ϵi}M i=1 ensures that this misspecification constant is at most the relative spacing between robustness levels: maxi ϵi ϵi 1 . This enables the distributionally adaptive approach to control the amount of excess regret by making the sequence more fine-grained, while a fixed choice of robustness incurs larger regret (as we verify empirically in our experiments as well).

5 Di AMet R: A Practical Algorithm for Meta-Distribution Shift

In order to instantiate our distributionally adaptive framework into a practical algorithm, we must address how task distributions should be parameterized and optimized over . We must also address how the robust meta-RL problem can be solved with stochastic gradient methods. We first introduce the individual components of task parameterization and robust optimization, describe the overall algorithm in Algorithm 1 and 2, and visualize components of Di AMet R in Fig 3.

5.1 Parameterizing Task Distributions

Handling in-support distribution shifts: For handling in-support task distribution shifts, we propose to represent new task distributions as re-weighted training task distribution q(T ) w T ptrain(T ) where w T > 0 is a parameter. Since we have a finite set of training tasks, say {Ti}ntr i=1, new task distributions become q(Ti) = w Ti Pntr i=1 w Ti . With a slight abuse of notation, we can write empirical training task

distribution as ptrain(Ti) = 1 ntr . We can use KL divergence to measure the divergence between

training task distribution and test task distribution D(ptrain(T ) qϕ(T )) = Pntr i=1 1 ntr log

Pntr i=1 w Ti ntrw Ti . We collectively represent the parameters ϕ = {w Ti}ntr i=1. Using this parameterization, the training

objective (equation 3) becomes

max θ min ϕ ET ptrain(T )

Eπϵ meta,θ,P

" nw T Pntr i=1 w Ti

DKL(ptrain(T )||qϕ(T )) ϵ (7)

Handling out-of-support distribution shift: For handling out-of-support task distribution shifts, we propose to learn a probabilistic model of the training task distribution, and use the learned latent representation as a space on which to parameterize uncertainty sets over new task distributions. Specifically, we jointly train a task encoder qψ(z|h) that encodes an environment history into the latent space, and a decoder Tω(s, a, z) mapping a latent vector z to a property of the task using a dataset of trajectories collected from the training tasks. To describe the exact form of Tω, we consider how tasks can differ and list two scenarios: (1) Tasks differ in reward functions: Tω takes form of reward functions rω(s, a, z) that maps a latent vector z to a reward function and (2) Tasks differ in dynamics: Tω takes form of dynamics pω(s, a, z) that maps a latent vector z to a dynamics function. This generative model can be trained as a standard latent variable model by maximizing a standard evidence lower bound (ELBO), trading off reward prediction and matching a prior ptrain(z) (chosen to be the unit gaussian).

min ω,ψ Eh D

t=1 l(Tω(st, at, z), h, t)

+ DKL(qψ(z|h)||N(0, I))

l(Tω(st, at, z), h, t) = (rω(st, at, z) rt)2 when rewards differ

l(Tω(st, at, z), h, t) = pω(st, at, z) st+1 2 when dynamics differ (8)

Having learned a latent space, we can parameterize new task distributions q(T ) as distributions qϕ(z) (the original training distribution corresponds to ptrain(z) = N(0, I), and measure the divergence between task distributions as well using the KL divergence in this latent space D(ptrain(z) qϕ(z)). Using this parameterization, the training objective (equation 3) becomes

max θ min ϕ Ez qϕ(z)

Eπϵ meta,θ,P

rω(s(i) t , a(i) t , z)

when rewards differ

max θ min ϕ Ez qϕ(z)

Eπϵ meta,θ,pω( , ,z)

when dynamics differ

DKL(ptrain(z)||qϕ(z)) ϵ (9)

5.2 Training and test-time selection of meta-policies

Learning Robust Meta-Policies: Given this task parameterization, the next question becomes how to actually perform the robust optimization laid out in Eq:3. The distributional meta-robustness objective can be modelled as an adversarial game between a meta-policy πϵ meta and a task proposal distribution q(T ). As described above, this task proposal distribution is parameterized as a distribution over latent space qϕ(z), while πϵ meta is parameterized a typical recurrent neural network policy as in [31]. We parameterize {πϵi meta}M i=1 as a discrete set of meta-policies, with one for each chosen value of ϵ.

This leads to a simple alternating optimization scheme (see Algorithm 1), where the meta-policy is trained using a standard meta-RL algorithm (we use off-policy RL2 [31] as a base learner), and the task proposal distribution with an constrained optimization method (we use dual gradient descent [30]). Each iteration n, three updates are performed: 1) the meta-policy πmeta,θ updated to improve performance on the current task distribution, 2) the task distribution qϕ(T ) updated to increase weight on tasks where the current meta-policy adapts poorly and decreases weight on tasks that the current meta-policy can learn, while staying close to the original training distribution, and 3) a penalty coefficient λ is updated to ensure that qϕ(T ) satisfies the divergence constraint.

(a) Ant navigation

(b) Wind navigation

(c) Object localization

(d) Block push Figure 4: The agent needs to either navigate in absence of winds, navigate in presence of winds, use its gripper to localize an object at an unobserved target location or push the block to an unobserved target location, indicated by green sphere (or cube), by exploring its environment and experiencing reward. While tasks vary in reward functions for Ant navigation, Object localization and Block push, they vary in dynamics function for Wind navigation.

Algorithm 2 Di AMet R: Meta-test phase

1: Given: ptest(T ), Π = {πϵi meta,θ}M i=1 2: Initialize TS = Thompson-Sampler() 3: for meta-episode n = 1, 2, ... do 4: Choose meta-policy i = TS.sample() 5: Run πϵi meta,θ for meta-episode 6: TS.update( arm=i, reward=meta-episode return) 7: end for

Test-time meta-policy selection: Since test-time meta-policy selection can be framed as a multiarmed bandit problem, we use a generic Thompson s sampling [43] algorithm (see Algorithm 2). Each meta-episode n, we sample a meta-policy πϵ meta with probability proportional to its estimated average episodic reward, run the sampled meta-policy πϵ meta for an meta-episode (k environment episodes) and then update the estimate of the average episodic reward. Since Thompson s sampling is a zero-regret bandit algorithm, it will converge to the meta-policy that achieves the highest average episodic reward and lowest regret on the test task distribution.

6 Experimental Evaluation

We aim to comprehensively evaluate Di AMet R and answer the following questions: (1) Do metapolicies learned via Di AMet R allow for quick adaptation under different distribution shifts in the test-time task distribution? (2) Does learning for multiple levels of robustness actually help the algorithm adapt more effectively than a particular chosen uncertainty level? (3) Does proposing uncertainty sets via generative modeling provide useful distributions of tasks for robustness?

Setup. We use Di AMet R on four continuous control environments: Ant navigation (controlling a four-legged robotic quadruped), Wind navigation [6] (controlling a linear system robot in presence of wind), Object localization (controlling a Fetch robot to localize an object through its gripper) and Block push (controlling a robot arm to push an object) [14] (Figures 4a to 4d) (see Appendix D for details about reward function and dynamics). We design various meta RL tasks from these environments. Each meta RL task has a train task distribution Ti ptrain(T ) such that each task Ti either parameterizes a reward function ri(s, a) := r(s, a, Ti) or a dynamics function pi(s, a) := p(s, a, Ti). Ti itself remains unobserved, the agent simply has access to reward values and executing actions in the environment. The learned meta-policies are evaluated on different distributionally shifted test task distributions {pi test(T )}K i=1 which are either in-support or out-ofsupport of training task distribution. In all meta RL tasks, the train and test task distribution is determined by the distribution of an underlying task parameter (i.e. wind velocity w T for Wind navigation and target location s T for other environments), which either determines the reward function or the dynamics function. While tasks vary in reward functions for Ant navigation, Object localization and Block push, they vary in dynamics function for Wind navigation (exact task distributions in Table 2). We use 4 random seeds for all our experiments and include the standard error bars in our plots.

Figure 5: We evaluate Di AMet R and meta RL algorithms (RL2, Vari BAD and Hyper X) on different in-support (top row) and out-of-support (bottom row) shifted test task distributions. Di AMet R either matches or outperforms RL2, Vari BAD and Hyper X on these task distributions. The first point p1 on the horizontal axis indicates the task parameter ( ) distribution U(0, p1) and the subsequent points pi indicate task parameter ( ) distribution U(pi 1, pi). While task parameter for wind navigation is wind velocity w T , it is target location s T for other environments. Table 2 details the task distributions used in this evaluation.

6.1 Adaptation to Varying Levels of Distribution Shift

During meta test, given a test task distribution ptest(T ), Di AMet R uses Thompson sampling to select the appropriate meta-policy πϵ meta,θ within N = 250 meta episodes. πϵ meta,θ can then solve any new task T ptest(T ) within 1 meta episode (k environment episode). To test Di AMet R s ability to adapt to varying levels of distribution shift, we evaluate it on different test task distributions, as detailed in Table 2. We compare Di AMet R with meta RL algorithms such as (off-policy) RL2 [31], Vari BAD [49] and Hyper X [50]. Since Di AMet R uses 250 meta-episodes to adaptively choose a metapolicy during test time, we finetune RL2, Vari BAD and Hyper X with 250 meta-episodes of test task distribution to make the comparisons fair (see Appendix G for the finetuning curves). Figure 5 show that Di AMet R outperforms RL2, Vari BAD and Hyper X on out-of-support and in-support shifted test task distributions. Furthermore, the performance gap between Di AMet R and other baselines increase as distribution shift between test task distribution and train task distribution increases. Naturally, the performance of Di AMet R also deteriorates as the distribution shift is increased, but as shown in Fig 5, it does so much more slowly than other algorithms. We also evaluate Di AMet R on train task distribution to see if it incurs any performance loss. Figure 5 shows that Di AMet R either matches or outperforms RL2, Vari BAD, and Hyper X on the train task distribution.

6.2 Analysis of Tasks Proposed by Latent Conditional Uncertainty Sets

We visualize the imagined test reward distribution through their heatmaps for various distribution shifts. To generate an imagined reward function, we sample z qϕ(z) and pass the z into rω(s, a, z) = rω(s, z) (given the learned reward is only dependent on state as mentioned in Appendix A.1). We then take the ant agent and reset its (x, y) location to different points in the (discretized) grid [ 1, 1]2 and calculate rω(s, z) at all those points. This gives us a reward map for a single imagined reward function. We sample 10000 of these reward functions and plot them together. Figure 6 visualizes the imagined test reward distribution in Ant-navigation environment in increasing order of distribution shifts with respect to train reward distribution (with distribution shift parameter ϵ increasing from left to right). The train distribution of rewards has uniformly distributed target locations within the red circle. As seen in Figure 6, the learned reward distribution model imagines more target locations outside the red circle as we increase the distribution shifts.

6.3 Analysis of Importance of Multiple Uncertainty Sets

Di AMet R meta-learns a family of adaptation policies, each conditioned on different uncertainty set. As discussed in section 4, selecting a policy conditioned on a large uncertainty set would lead to overly conservative behavior. Furthermore, selecting a policy conditioned on a small uncertainty set would result in failure if the test time distribution shift is high. Therefore, we need to adaptively

(a) ϵ = 0.1

(b) ϵ = 0.2

(c) ϵ = 0.4

(d) ϵ = 0.8

Figure 6: Imagined test reward distributions in Ant-navigation environment in increasing order of distribution shifts. Train reward distribution is uniform within the red circle.

Figure 7: Adaptively choosing an uncertainty set for Di AMet R policy (Adapt) during test time allows it to better adapt to test time distribution shift than choosing an uncertainty set beforehand (Mid). Choosing a large uncertainty set for Di AMet R policy (Conservative) leads to a conservative behavior and hurts its performance when test time distribution shift is low. While top row contains in-support task distribution shifts, the bottom row contains out-of-support task distribution shifts.

select an uncertainty set during test time. To validate this phenomenon empirically, we performed an ablation study in Figure 7. As clearly visible, adaptively choosing an uncertainty set during test time allows for better test time distribution adaptation when compared to selecting an uncertainty set beforehand or selecting a large uncertainty set. These results suggest that a combination of training robust meta-learners and constructing various uncertainty sets allows for effective test-time adaptation under distribution shift. Di AMet R is able to avoid both overly conservative behavior and under-exploration at test-time.

7 Discussion

In this work, we introduce a distributionally adaptive" meta RL framework for tackling task distribution shifts and argue for adaptation in face of these distribution shifts. There are several avenues for future work we are keen on exploring, for instance extending adaptive distributional robustness to more complex meta RL tasks, including those with image observations. Another interesting direction would be to develop a more formal theory providing adaptive robustness guarantees in meta-RL problems under these inherent distribution shifts.

Acknowledgements The authors thank the members of Improbable AI Lab & RAIL for discussions and helpful feedback. We thank MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing compute resources. This research was supported by an NSF graduate fellowship, a DARPA Machine Common Sense grant, a MURI grant from the Army Research Office under the Cooperative Agreement Number W911NF-21-1-0097, and an MIT-IBM grant. This research was also partly sponsored by the United States Air Force Research Laboratory and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-21000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force, the United States Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation herein.

[1] T. Chen, J. Xu, and P. Agrawal. A system for general in-hand object re-orientation. In A. Faust, D. Hsu, and G. Neumann, editors, Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pages 297 307. PMLR, 08 11 Nov 2022. URL https://proceedings.mlr.press/v164/chen22a.html.

[2] J. Cohen, E. Rosenfeld, and Z. Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pages 1310 1320. PMLR, 2019.

[3] L. Collins, A. Mokhtari, and S. Shakkottai. Distribution-agnostic model-agnostic meta-learning. Co RR, abs/2002.04766, 2020. URL https://arxiv.org/abs/2002.04766.

[4] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein. A tutorial on the cross-entropy method. Annals of operations research, 134(1):19 67, 2005.

[5] T. Deleu and Y. Bengio. The effects of negative adaptation in model-agnostic meta-learning. ar Xiv preprint ar Xiv:1812.02159, 2018.

[6] R. Dorfman, I. Shenfeld, and A. Tamar. Offline meta learning of exploration. ar Xiv preprint ar Xiv:2008.02598, 2020.

[7] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779, 2016.

[8] A. Fallah, A. Mokhtari, and A. Ozdaglar. Generalization of model-agnostic meta-learning algorithms: Recurring and unseen tasks. Advances in Neural Information Processing Systems, 34, 2021.

[9] N. Fazeli, A. Ajay, and A. Rodriguez. Long-horizon prediction and uncertainty propagation with residual point contact learners. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7898 7904, 2020. doi: 10.1109/ICRA40945.2020.9196511.

[10] A. Filos, P. Tigkas, R. Mc Allister, N. Rhinehart, S. Levine, and Y. Gal. Can autonomous vehicles identify, recover from, and adapt to distribution shifts? In International Conference on Machine Learning, pages 3145 3153. PMLR, 2020.

[11] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126 1135. PMLR, 2017.

[12] H. Fu, H. Tang, J. Hao, C. Chen, X. Feng, D. Li, and W. Liu. Towards effective context for meta-reinforcement learning: an approach based on contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7457 7465, 2021.

[13] A. Gupta, B. Eysenbach, C. Finn, and S. Levine. Unsupervised meta-learning for reinforcement learning. ar Xiv preprint ar Xiv:1806.04640, 2018.

[14] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine. Meta-reinforcement learning of structured exploration strategies. Advances in neural information processing systems, 31, 2018.

[15] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018.

[16] J. Hong, H. Wang, Z. Wang, and J. Zhou. Federated robustness propagation: Sharing adversarial robustness in federated learning. ar Xiv preprint ar Xiv:2106.10196, 2021.

[17] A. Jabri, K. Hsu, A. Gupta, B. Eysenbach, S. Levine, and C. Finn. Unsupervised curricula for visual meta-reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.

[18] L. Ke, J. Wang, T. Bhattacharjee, B. Boots, and S. Srinivasa. Grasping with chopsticks: Combating covariate shift in model-free imitation learning for fine manipulation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6185 6191. IEEE, 2021.

[19] A. Kumar, Z. Fu, D. Pathak, and J. Malik. Rma: Rapid motor adaptation for legged robots. ar Xiv preprint ar Xiv:2107.04034, 2021.

[20] L. Lee, B. Eysenbach, E. Parisotto, E. P. Xing, S. Levine, and R. Salakhutdinov. Efficient exploration via state marginal matching. Co RR, abs/1906.05274, 2019. URL http://arxiv. org/abs/1906.05274.

[21] Z. Lin, G. Thomas, G. Yang, and T. Ma. Model-based adversarial meta-reinforcement learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 73634c1dcbe056c1f7dcf5969da406c8-Abstract.html.

[22] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

[23] G. B. Margolis, G. Yang, K. Paigwar, T. Chen, and P. Agrawal. Rapid locomotion via reinforcement learning. ar Xiv preprint ar Xiv:2205.02824, 2022.

[24] R. Mendonca, X. Geng, C. Finn, and S. Levine. Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling. Co RR, abs/2006.07178, 2020. URL https://arxiv.org/abs/2006.07178.

[25] T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter. Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics, 7(62):eabk2822, 2022.

[26] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. ar Xiv preprint ar Xiv:1707.03141, 2017.

[27] E. Mitchell, R. Rafailov, X. B. Peng, S. Levine, and C. Finn. Offline meta-reinforcement learning with advantage weighting. In International Conference on Machine Learning, pages 7780 7791. PMLR, 2021.

[28] A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. ar Xiv preprint ar Xiv:1803.11347, 2018.

[29] A. Nair, A. Gupta, M. Dalal, and S. Levine. Awac: Accelerating online reinforcement learning with offline datasets. ar Xiv preprint ar Xiv:2006.09359, 2020.

[30] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221 259, 2009.

[31] T. Ni, B. Eysenbach, S. Levine, and R. Salakhutdinov. Recurrent model-free RL is a strong baseline for many POMDPs, 2022. URL https://openreview.net/forum?id=E0z OKx Qs Zh N.

[32] T. P. Oikarinen, W. Zhang, A. Megretski, L. Daniel, and T. Weng. Robust deep reinforcement learning through adversarial loss. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 614, 2021, virtual, pages 26156 26167, 2021. URL https://proceedings.neurips.cc/ paper/2021/hash/dbb422937d7ff56e049d61da730b3e11-Abstract.html.

[33] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta. Robust adversarial reinforcement learning. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 2817 2826. PMLR, 2017. URL http: //proceedings.mlr.press/v70/pinto17a.html.

[34] K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen. Efficient off-policy meta-reinforcement learning via probabilistic context variables. In International conference on machine learning, pages 5331 5340. PMLR, 2019.

[35] A. Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 547 562. University of California Press, 1961.

[36] J. Rothfuss, D. Lee, I. Clavera, T. Asfour, and P. Abbeel. Promp: Proximal meta-policy search. ar Xiv preprint ar Xiv:1810.06784, 2018.

[37] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. Co RR, abs/1707.06347, 2017. URL http://dblp.uni-trier.de/db/ journals/corr/corr1707.html#Schulman WDRK17.

[38] A. Sinha, H. Namkoong, R. Volpi, and J. Duchi. Certifying some distributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017.

[39] R. S. Sutton, D. Mc Allester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.

[40] S. Thrun and L. Y. Pratt, editors. Learning to Learn. Springer, 1998. ISBN 9781-4613-7527-2. doi: 10.1007/978-1-4615-5529-2. URL https://doi.org/10.1007/ 978-1-4615-5529-2.

[41] L. N. Vaserstein. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64 72, 1969.

[42] E. Vinitsky, Y. Du, K. Parvate, K. Jang, P. Abbeel, and A. M. Bayen. Robust reinforcement learning using adversarial populations. Co RR, abs/2008.01825, 2020. URL https://arxiv. org/abs/2008.01825.

[43] D. Wolpert and W. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67 82, 1997. doi: 10.1109/4235.585893.

[44] M. Wu, N. Goodman, C. Piech, and C. Finn. Prototransformer: A meta-learning approach to providing student feedback. ar Xiv preprint ar Xiv:2107.14035, 2021.

[45] A. Xie, S. Sodhani, C. Finn, J. Pineau, and A. Zhang. Robust policy learning over multiple uncertainty sets. ar Xiv preprint ar Xiv:2202.07013, 2022.

[46] H. Zhang, H. Chen, D. S. Boning, and C. Hsieh. Robust reinforcement learning on state observations with learned optimal adversary. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum?id=s CZbh Bvq Qa U.

[47] J. Zhang, J. Wang, H. Hu, T. Chen, Y. Chen, C. Fan, and C. Zhang. Metacure: Meta reinforcement learning with empowerment-driven exploration. In International Conference on Machine Learning, pages 12600 12610. PMLR, 2021.

[48] T. Z. Zhao, A. Nagabandi, K. Rakelly, C. Finn, and S. Levine. Meld: Meta-reinforcement learning from images via latent state models. ar Xiv preprint ar Xiv:2010.13957, 2020.

[49] L. Zintgraf, K. Shiarlis, M. Igl, S. Schulze, Y. Gal, K. Hofmann, and S. Whiteson. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. ar Xiv preprint ar Xiv:1910.08348, 2019.

[50] L. M. Zintgraf, L. Feng, C. Lu, M. Igl, K. Hartikainen, K. Hofmann, and S. Whiteson. Exploration in approximate hyper-state space for meta reinforcement learning. In International Conference on Machine Learning, pages 12991 13001. PMLR, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 7

(c) Did you discuss any potential negative societal impacts of your work? [N/A] Our work is done in simulation and won t have any negative societal impact. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] This work does not actually use human subjects, and is done in simulation. We have reviewed ethics guidelines and ensured that our paper conforms to them. 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] Math is used as a theory/formalism, but we don t make any provable claims about it. (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We have included the code along with a README in the supplemental material (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Appendix L (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] All plots were created with 4 random seeds with std error bars. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix L 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] Environments we used are cited in section 6. Codebase used are cited in Appendix L (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

We published the code and included all environments and assets as a part of this (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] Environments and codebases we used are open-source. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]