# transformers_are_metareinforcement_learners__19331326.pdf

Transformers are Meta-Reinforcement Learners

Luckeciano C. Melo 1 2

The transformer architecture and variants presented a remarkable success across many machine learning tasks in recent years. This success is intrinsically related to the capability of handling long sequences and the presence of contextdependent weights from the attention mechanism. We argue that these capabilities suit the central role of a Meta-Reinforcement Learning algorithm. Indeed, a meta-RL agent needs to infer the task from a sequence of trajectories. Furthermore, it requires a fast adaptation strategy to adapt its policy for a new task - which can be achieved using the self-attention mechanism. In this work, we present Tr MRL (Transformers for Meta-Reinforcement Learning), a meta-RL agent that mimics the memory reinstatement mechanism using the transformer architecture. It associates the recent past of working memories to build an episodic memory recursively through the transformer layers. We show that the selfattention computes a consensus representation that minimizes the Bayes Risk at each layer and provides meaningful features to compute the best actions. We conducted experiments in highdimensional continuous control environments for locomotion and dexterous manipulation. Results show that Tr MRL presents comparable or superior asymptotic performance, sample efficiency, and out-of-distribution generalization compared to the baselines in these environments.

1. Introduction

In recent years, the Transformer architecture (Vaswani et al., 2017) achieved exceptional performance on many machine learning applications, especially for text (Devlin et al., 2019; Raffel et al., 2020) and image processing (Dosovitskiy et al.,

1Microsoft, USA 2Center of Excellence in Artificial Intelligence (Deep Learning Brazil), Brazil. Correspondence to: Luckeciano C. Melo <luckeciano@gmail.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

2021b; Caron et al., 2021; Yuan et al., 2021). This intrinsically relates to its few-shot learning nature (Brown et al., 2020b): the attention weights work as context-dependent parameters, inducing better generalization. Furthermore, this architecture parallelizes token processing by design. This property avoids backpropagation through time, making it less prone to vanishing/exploding gradients, a very common problem for recurrent models. As a result, they can handle longer sequences more efficiently.

This work argues that these two capabilities are essential for a Meta-Reinforcement Learning (meta-RL) agent. We propose Tr MRL (Transformers for Meta-Reinforcement Learning), a memory-based meta-Reinforcement Learner which uses the transformer architecture to formulate the learning process. It works as a memory reinstatement mechanism (Rovee-Collier, 2012) during learning, associating recent working memories to create an episodic memory which is used to contextualize the policy.

Figure 1 illustrates the process. We formulated each task as a distribution over working memories. Tr MRL associates these memories using self-attention blocks to create a task representation in each head. These task representations are combined in the position-wise MLP to create an episodic output (which we identify as episodic memory). We recursively apply this procedure through layers to refine the episodic memory. In the end, we select the memory associated with the current timestep and feed it into the policy head.

Nonetheless, transformer optimization is often unstable, especially in the RL setting. Past attempts either fail to stabilize (Mishra et al., 2018) or required architectural additions (Parisotto et al., 2019) or restrictions on the observations space (Loynd et al., 2020). We hypothesize that this challenge is because the instability of early stages of transformer optimization harms initial exploration, which is crucial for environments where the learned behaviors must guide exploration to prevent poor policies. We argue that this challenge can be mitigated through a proper weight initialization scheme. For this matter, we applied T-Fixup initialization (Huang et al., 2020).

We conducted a series of experiments to evaluate metatraining, fast adaptation, and out-of-distribution generalization in continuous control environments for locomotion and

Transformers are Meta-Reinforcement Learners

+ Causal Multi-Head

Attention + Position-Wise MLP

Transformer Encoder

Working Memory Embedding Policy Head

Attention Maps

Figure 1. Illustration of the Tr MRL agent. At each timestep, it associates the recent past of working memories to build an episodic memory through transformer layers recursively. We argue that the self-attention works as a fast adaptation strategy since it provides context-dependent parameters.

robotic manipulation. Results show that Tr MRL presents comparable or superior performance and sample efficiency compared to the meta-RL baselines. We also conducted an experiment to validate the episode memory refinement process. Finally, we conducted an ablation study to show the effectiveness of the T-Fixup initialization, and the sensibility to network depth, sequence size, and the number of attention heads.

2. Related Work

Meta-Learning is an established Machine Learning (ML) principle to learn inductive biases from the distribution of tasks to produce a data-efficient learning system (Bengio et al., 1991; Schmidhuber et al., 1996; Thrun & Pratt, 1998). This principle spanned in a variety of methods in recent years, learning different components of an ML system, such as the optimizer (Andrychowicz et al., 2016; Li & Malik, 2016; Chen et al., 2017), neural architectures (Hutter et al., 2019; Zoph & Le, 2017), metric spaces (Vinyals et al., 2016), weight initializations (Finn et al., 2017; Nichol et al., 2018; Finn et al., 2018), and conditional distributions (Zintgraf et al., 2019; Melo et al., 2019). Another branch of methods learns the entire system using memory-based architectures (Ortega et al., 2019; Wang et al., 2017; Duan et al., 2016; Ritter et al., 2018a) or generating update rules by discovery (Oh et al., 2020) or evolution (Co-Reyes et al., 2021).

Memory-Based Meta-Learning is the particular class of

methods where we focus on in this work. In this context, Wang et al. (2017); Duan et al. (2016) concurrently proposed the RL2 framework, which formulates the learning process as a Recurrent Neural Network (RNN) where the hidden state works as the memory mechanism. Given the recent rise of attention-based architectures, one natural idea is to use it as a replacement for RNNs. Mishra et al. (2018) proposed an architecture composed of causal convolutions (to aggregate information from past experience) and softattention (to pinpoint specific pieces of information). In contrast, our work applies causal, multi-head self-attention by stabilizing the complete transformer architecture with an arbitrarily large context window. Finally, Ritter et al. (2021) also applied multi-head self-attention for rapid task solving for RL environments. However, in a different dynamic: their work applied the attention mechanism iteratively in a predefined episodic memory, while ours applies it recursively through transformer layers to build an episodic memory from the association of recent working memories.

Our work has intersections with Cognitive Neuroscience research on memory for learning systems (Hoskin et al., 2018; Rovee-Collier, 2012; Wang et al., 2018). In this context, Ritter et al. (2018c) extended the RL2 framework incorporating a differentiable neural dictionary as the inductive bias for episodic memory recall. In the same line, Ritter et al. (2018b) also extended RL2 but integrating a different episodic memory system inspired by the reinstatement mechanism. In our work, we also mimic reinstatement to

Transformers are Meta-Reinforcement Learners

retrieve episodic memories from working memories but using self-attention. Lastly, Fortunato et al. (2019) studied the association between working and episodic memories for RL agents, specifically for memory tasks, proposing separated inductive biases for these memories based on LSTMs and auxiliary unsupervised losses. In contrast, our work studies this association for the Meta-RL problem, using memory as a task proxy implemented by the transformer architecture.

Meta-Reinforcement Learning is a branch of Meta Learning for RL agents. Some of the algorithms described in past paragraphs extend to the Meta-RL setting by design (Finn et al., 2017; Mishra et al., 2018; Wang et al., 2017; Duan et al., 2016). Others were explicitly designed for RL and often aimed to create a task representation to condition the policy. PEARL (Rakelly et al., 2019) is an off-policy method that learns a latent representation of the task and explores via posterior sampling. MAESN (Gupta et al., 2018) also creates task variables but optimizes them with on-policy gradient descent and explores by sampling from the prior. MQL (Fakoor et al., 2020) is also an offpolicy method, but it uses a deterministic context that is not permutation invariant implemented by an RNN. Lastly, Vari BAD (Zintgraf et al., 2020) formulates the problem as a Bayes-Adaptive MDP and extends the RL2 framework by incorporating a stochastic latent representation of the task trained with a VAE objective. Our work contrasts all the previous methods in this task representation: we condition the policy in the episodic memory generated by the transformer architecture from the association of past working memories. We show that this episodic memory works as a proxy to the task representation.

Transformers for RL. The application of the transformer architecture in the RL setting is still an open challenge. Mishra et al. (2018) tried to apply this architecture for simple bandit tasks and tabular MDPs and reported unstable train and random performance. Parisotto et al. (2019) then proposed some architectural changes in the vanilla transformer, reordering layer normalization modules and replacing residual connections with expressing gating mechanisms, improving state-of-the-art performance for a set of memory environments. Loynd et al. (2020) also studied how transformerbased models can improve the performance of sequential decision-making agents. It stabilized the architecture using factored observations and an intense hyperparameter tuning procedure, resulting in improved sample efficiency. In contrast to these methods, our work stabilizes the transformer model by improving optimization through a better weight initialization. In this way, we could use the vanilla transformer without architectural additions or imposing restrictions on the observations.

Finally, recent work studied how to replace RL algorithms with transformer-based language models (Janner et al., 2021;

Chen et al., 2021). Using a supervised prediction loss in the offline RL setting, they modeled the agent as a sequence problem. Our work, on the other hand, considers the standard RL formulation in the meta-RL setting. This formulation presents more challenges, since the agent needs to explore in the environment and apply RL gradients to maximize rewards, which is acknowledged to be much noisier (Norouzi et al., 2016).

3. Preliminaries

We define a Markov decision process (MDP) by a tuple M = (S, A, P, R, P0, γ, H), where S is a state space, A is an action space, P : S A S [0, ) is a transition dynamics, R : S A [ Rmax, Rmax] is a bounded reward function, P0 : S [0, ) is an initial state distribution, γ [0, 1] is a discount factor, and H is the horizon. The standard RL objective is to maximize the cumulative reward, i.e., max E[PT t=0 γt R(st, at)], with at πθ(at | st) and st P(st | st 1, at 1), where πθ : S A [0, ) is a policy parameterized by θ.

3.1. Problem Setup: Meta-Reinforcement Learning

In the meta-RL setting, we define p(M) : M [0, ) a distribution over a set of MDPs M. During meta-training, we sample Mi p(M) from this distribution, where Mi = (S, A, Pi, Ri, P0,i, γ, H). Therefore, the tasks1

share a similar structure in this setting, but reward function and transition dynamics vary. The goal is to learn a policy that, during meta-testing, can adapt to a new task sampled from the same distribution p(M). In this context, adaptation means maximizing the reward under the task in the most efficient way. To achieve this, the meta-RL agent should learn the prior knowledge shared across the distribution of tasks. Simultaneously, it should learn how to differentiate and identify these tasks using only a few episodes.

3.2. Transformer Architecture

The transformer architecture (Vaswani et al., 2017) was first proposed as an encoder-decoder architecture for neural machine translation. Since then, many variants have emerged, proposing simplifications or architectural changes across many ML problems (Dosovitskiy et al., 2021a; Brown et al., 2020a; Parisotto et al., 2019). Here, we describe the encoder architecture as it composes our memory-based meta-learner.

The transformer encoder is a stack of multiple equivalent layers. There are two main components in each layer: a multi-head self-attention block, followed by a position-wise feed-forward network. Each component contains a resid-

1We use the terms task and MDP interchangeably.

Transformers are Meta-Reinforcement Learners

ual connection (He et al., 2015) around them, followed by layer normalization (Ba et al., 2016). The multi-head self-attention (MHSA) block computes the self-attention operation across many different heads, whose outputs are concatenated to serve as input to a linear projection module, as in Equation 1:

MHSA(K, Q, V ) = Concat(h1, h2, . . . , hω)Wo,

hi = softmax(QKT

where K, Q, V are the keys, queries, and values for the sequence input, respectively. Additionally, d represents the dimension size of keys and queries representation and ω the number of attention heads. M represents the attention masking operation. Wo represents a linear projection operation.

The position-wise feed-forward block is a 2-layer dense network with a Re LU activation between these layers. All positions in the sequence input share the parameters of this network, equivalently to a 1 1 temporal convolution over every step in the sequence. Finally, we describe the positional encoding. It injects the relative position information among the elements in the sequence input since the transformer architecture fully parallelizes the input processing. The standard positional encoding is a sinusoidal function added to the sequence input (Vaswani et al., 2017).

3.3. T-Fixup Initialization

The training of transformer models is notoriously difficult, especially in the RL setting (Parisotto et al., 2019). Indeed, gradient optimization with attention layers often requires complex learning rate warmup schedules to prevent divergence (Huang et al., 2020). Recent work suggests two main reasons for this requirement. First, the Adam optimizer (Kingma & Ba, 2017) presents high variance in the inverse second moment for initial updates, proportional to a divergent integral (Liu et al., 2020). It leads to problematic updates and significantly affects optimization. Second, the backpropagation through layer normalization can also destabilize optimization because the associated error depends on the magnitude of the input (Xiong et al., 2020).

Given these challenges, Huang et al. (2020) proposed a weight initialization scheme (T-Fixup) to eliminate the need for learning rate warmup and layer normalization. This is particularly important to the RL setting once current RL algorithms are very sensitive to the learning rate for learning and exploration.

T-Fixup appropriately bounds the original Adam update to make variance finite and reduce instability, regardless of model depth. We refer to Huang et al. (2020) for the

Figure 2. The illustration of two tasks (T1 and T2) as distributions over working memories. The intersection of both densities represents the ambiguity between T1 and T2.

mathematical derivation. We apply the T-Fixup for the transformer encoder as follows:

Apply Xavier initialization (Glorot & Bengio, 2010) for all parameters excluding input embeddings. Use Gaussian initialization N(0, d 1

2 ), for input embeddings, where d is the embedding dimension;

Scale the linear projection matrices in each encoder attention block and position-wise feed-forward block by 0.67N 1

4. Transformers are Meta-Reinforcement Learners

In this work, we argue that two critical capabilities of transformers compose the central role of a Meta-Reinforcement Learner. First, transformers can handle long sequences and reason over long-term dependencies, which is essential to the meta-RL agent to identify the MDP from a sequence of trajectories. Second, transformers present contextdependent weights from self-attention. This mechanism serves as a fast adaptation strategy and provides necessary adaptability to the meta-RL agent for new tasks.

4.1. Task Representation

We represent a working memory at the timestep t as a parameterized function ϕt(st, at, rt, ηt), where st is the MDP state, at π(at | st) is an action, rt R(st, at) is the reward, and ηt is a boolean flag to identify whether this is a terminal state. Our first hypothesis is that we can define a task T as a distribution over working memories, as in Equation 2: T (ϕ) : Φ [0, ), (2)

where Φ is the working memory embedding space. In this context, one goal of a meta-RL agent is to learn ϕ to make a distinction among the tasks in the embedding space Φ.

Transformers are Meta-Reinforcement Learners

Figure 3. Illustration of causal self-attention as a fast adaptation strategy. In this simplified scenario (2 working memories), the attention weights αi,j drives the association between the current working memory and the past ones to compute a task representation µt. Selfattention computes this association by relative similarity.

Furthermore, the learned embedding space should also approximate the distributions of similar tasks so that they can share knowledge. Figure 2 illustrates this concept for a one-dimensional representation.

We aim to find a representation for the task given its distribution to contextualize our policy. Intuitively, we can represent each task as a linear combination of working memories sampled by the policy interacting with it:

t=0 αt W(ϕt(st, at, rt, ηt)),

t=0 αt = 1 (3)

where N represents the length of a segment of sampled trajectories during the policy and task interaction. W represents an arbitrary linear transformation. Furthermore, αt is a coefficient to compute how relevant a particular working memory t is to the task representation, given the set of sampled working memories. Next, we show how the self-attention computes these coefficients, which we use to output an episodic memory from the transformer architecture.

4.2. Self-Attention as a Fast Adaptation Strategy

In this work, our central argument is that self-attention works as a fast adaptation strategy. The context-dependent weights dynamically compute the working memories coefficients to implement Equation 3. We now derive how we compute these coefficients. Figure 3 illustrates this mechanism.

Let us define ϕk t , ϕq t, and ϕv t as a representation2 of the working memory at timestep t in the keys, queries, and values spaces, respectively. The dimension of the queries and keys spaces is d. We aim to compute the attention operation in Equation 1 for a sequence of T timesteps, resulting in Equation 4:

softmax(QKT

α1,1 α1,2 . . . ... ... αT,1 αT,T

ϕv 1 ... ϕv T

αi,j = exp ϕq i ,ϕk j Pi n=1 exp ϕq 1,ϕkn if i j

0 otherwise. (4)

where ai, bj = Pd n=0 ai,n bj,n is the dot product between the working memories ai and bj. Therefore, for a particular timestep t, the self-attention output is:

d ϕv 1 exp ϕq t, ϕk 1 + + ϕv t exp ϕq t, ϕk t Pi n=1 exp ϕq 1, ϕkn

n=1 αt,n Wv(ϕt). (5)

Equation 5 shows that the self-attention mechanism implements the task representation in Equation 3 by associating past working memories given that the current one is ϕt. It computes this association with relative similarity through

2we slightly abuse the notation by omitting the function arguments st, at, rt, and ηt for the sake of conciseness.

Transformers are Meta-Reinforcement Learners

the dot product normalized by the softmax operation. This inductive bias helps the working memory representation learning to approximate the density of the task distribution T (ϕ).

4.3. Transformers and Memory Reinstatement

We now argue that the transformer model implements a memory reinstatement mechanism for episodic memory retrieval. An episodic memory system is a long-lasting memory that allows an agent to recall and re-experience personal events (Tulving, 2002). It complements the working memory system, which is active and relevant for short periods (Baddeley, 2010) and works as a buffer for episodic memory retrieval (Zilli & Hasselmo, 2008). Adopting this memory interaction model, we model an episodic memory as a transformation over a collection of past memories. More concretely, we consider that a transformer layer implements this transformation:

el t = f(el 1 0 , . . . , el 1 t ), (6)

where el t represents the episodic memory retrieved from the last layer l for the timestamp t and f represents the transformer layer. Equation 6 provides a recursive definition, and e0 t (the base case) corresponds to the working memories ϕt In this way, the transformer architecture recursively refines the episodic memory interacting memories retrieved from the past layer. We show the pseudocode for this process in Algorithm 1 (Appendix G). This refinement is guaranteed by a crucial property of the self-attention mechanism: it computes a consensus representation across the input memories associated to the sub-trajectory, as stated by Theorem 4.1 (Proof in Appendix F). Here, we define consensus representation as the memory representation that is closest on average to all likely representations (Kumar & Byrne, 2004), i.e., minimizes the Bayes risk considering the set of episodic memories.

Theorem 4.1. Let Sl = (el 0, . . . , el N) p(e|Sl, θl) be a set of normalized episodic memory representations sampled from the posterior distribution p(e|Sl, θl) induced by the transformer layer l, parameterized by θl. Let K, Q, V be the Key, Query, and Value vector spaces in the self-attention mechanism. Then, the self-attention in the layer l + 1 computes a consensus representation

PN t=1 el,V t exp el,Q t ,el,K i PN t=1 exp el,Q t ,el,K i whose associated Bayes risk (in terms of negative cosine similarity) lower bounds the Minimum Bayes Risk (MBR) predicted from the set of candidate samples Sl projected onto the V space.

Lastly, we condition the policy head in the episodic memory from the current timestep to sample an action. This complete process resembles a memory reinstatement operation:

a reminder procedure that reintroduces past elements in advance of a long-term retention test (Rovee-Collier, 2012). In our context, this long-term retention test identifies the task and acts accordingly to maximize rewards.

5. Experiments and Discussion

In this section, we present an empirical validation of our method, comparing it with the current state-of-the-art methods. We considered high-dimensional, continuous control tasks for locomotion (Mu Jo Co3) and dexterous manipulation (Meta World). We describe them in Appendix B. For reproducibility (source code and hyperparameters), we refer to the released source code at https://github.com/ luckeciano/transformers-metarl.

5.1. Experimental Setup

Meta-Training. During meta-training, we repeatedly sampled a batch of tasks to collect experience with the goal of learning to learn. For each task, we ran a sequence of E episodes. During the interaction, the agent conducted exploration with a gaussian policy. During optimization, we concatenate these episodes to form a single trajectory and we maximize the discounted cumulative reward of this trajectory. This is equivalent to the training setup for other on-policy meta-RL algorithms (Duan et al., 2016; Zintgraf et al., 2020). For these experiments, we considered E = 2. We performed this training via Proximal Policy Optimization (PPO) (Schulman et al., 2017), and the data batches mixed different tasks. Therefore, we present here an on-policy version of the Tr MRL algorithm. To stabilize transformer training, we used the T-Fixup as a weight initialization scheme.

Meta-Testing. During meta-testing, we sampled new tasks. These are different from the tasks in meta-training, but they come from the same distribution, except during Out-of Distribution (OOD) evaluation. For Tr MRL, in this stage, we froze all network parameters. For each task, we ran few episodes, performing the adaptation strategy. The goal is to identify the current MDP and maximize the cumulative reward across the episodes.

Memory Write Logic. At each timestep, we fed the network with the sequence of working memories. This process works as follows: at the beginning of an episode (when the memory sequence is empty), we start writing the first positions of the sequence until we fill all the slots. Then, for each new memory, we removed the oldest memory in the sequence (in the back of this queue ) and added the

3We highlight that both Mu Jo Co (Locomotion Tasks) and Meta World are built on the Mu Jo Co physics engine. We identify the set of locomotion tasks as solely for Mu Jo Co to ensure simpler and concise writing during analysis of the results.

Transformers are Meta-Reinforcement Learners

0 1 2 3 4 5 timesteps 1e7

Success Rate

Meta World-ML1-Reach-v2

0 1 2 3 4 5 timesteps 1e7

Success Rate

Meta World-ML1-Push-v2

0 1 2 3 4 5 timesteps 1e7

Success Rate

Meta World-ML45-Train

0 1 2 3 4 5 timesteps 1e7

Average Return

Half Cheetah Vel

0 1 2 3 4 5 timesteps 1e7

Average Return

Tr MRL (ours) MAML-TRPO RL2-PPO PEARL Vari BAD

Figure 4. Meta-Training results for Meta World (success rate) and Mu Jo Co (average return) benchmarks. All subplots represent performance on test tasks over the training timesteps.

0 1 2 3 4 5 episodes

Average Reward

Half Cheetah Vel

0 1 2 3 4 5 episodes

Tr MRL (ours) MAML-TRPO RL2-PPO PEARL Vari BAD

Figure 5. Fast adaptation results on Mu Jo Co locomotion tasks. Each curve represents the average performance over 20 test tasks. Tr MRL presented high performance since the first episode due to the online adaptation nature from attention weights.

most recent one (in the front ).

Comparison Methods. For comparison, we evaluated three different meta-RL baselines: RL2 (Duan et al., 2016), optimized using PPO (Schulman et al., 2017); PEARL (Rakelly et al., 2019); MAML (Finn et al., 2017), whose outer-loop used TRPO (Schulman et al., 2015); and Vari BAD ((Zintgraf et al., 2020).

5.2. Results and Analysis

We compared Tr MRL with baseline methods in terms of meta-training, episode adaptation, and OOD performance. We also present the latent visualization for Tr MRL working memories and ablation studies. All the curves presented are averaged across three random seeds, with 95% bootstrapped confidence intervals.

Meta-Training Evaluation. Figure 4 shows the metatraining results for all the methods in the Meta World (success rates) and Mu Jo Co (average returns) environments. All subplots represent performance on test tasks over the training timesteps. Tr MRL presented comparable or superior performance to the baseline methods. In scenarios where the task ambiguity is high, Vari BAD presented stronger results,

especially for Push-v2. In this context, ambiguity relates to the same working memories belonging to multiple different MDPs (as illustrated in Figure 2). These results support the effectiveness of the Vari BAD objective that incorporates task uncertainty directly during action selection (Zintgraf et al., 2020). Reach-v2 and Ant Dir also presented some level of ambiguity, but Tr MRL is on par with these scenarios. For scenarios with less ambiguity, such as Half Cheeta Vel, Tr MRL is way more sample efficient than Vari BAD. It is worth mentioning that Vari BAD s training objective can be leveraged by other encoding methods (such as Tr MRL) (Zintgraf et al., 2020), and we let this as a future line of work.

PEARL failed to explore and learn these robotic manipulation tasks: prior work already pointed out the difficulty of training PEARL s task encoder for dexterous manipulation (Yu et al., 2021). On the other side, it presented better results on locomotion tasks, especially in sample efficiency. This is because of its off-policy nature inherited from the Soft Actor-Critic framework (Haarnoja et al., 2018). Compared with other on-policy methods (such as RL2 and MAML), Tr MRL significantly improved performance or sample efficiency.

Transformers are Meta-Reinforcement Learners

0 1 2 3 4 5 episodes

Average Reward

OOD Evaluation - Half Cheetah Vel

Tr MRL (ours) MAML-TRPO RL2-PPO PEARL Vari BAD

Figure 6. OOD Evaluation in Half Cheetah Vel environment. Tr MRL surpasses all the baselines methods with a good margin, suggesting that the context-dependent weights learned a robust adaptation strategy, while other methods memorized some aspects of the standard distribution of tasks.

Finally, for Meta World-ML45-Train (a simplified version of ML45 benchmark where we evaluate on the training tasks), the more complex environment in this work, Tr MRL achieved the best learning performance among the methods, supporting that the proposed method can generalize better across very different tasks in comparison with other methods. Additionally, Vari BAD s performance suggests that its proposed objective does not scale well for this scenario, harming the final performance when compared with RL2. We hypothesize that the supervision from the dynamics and rewards in the training objective provides conflicting learning signals to the RNN.

Fast Adaptation Evaluation. A critical skill for meta-RL agents is the capability of adapting to new tasks given a few episodes. We evaluate this by running meta-testing on 20 test tasks over six sequential episodes. Each agent runs its adaptation strategy to identify the task and maximize the reward across episodes. Figure 5 presents the results for the locomotion tasks. Tr MRL again presents a comparable or superior performance in comparison to the baselines.

We highlight that Tr MRL presented high performance since the first episode. It only requires a few timesteps to achieve high performance in test tasks. In the Half Cheetah Vel, for example, it only requires around 20 timesteps to achieve the best performance (Figure 9 in Appendix D). Therefore, it presents a nice property for online adaptation. Other methods, such as PEARL and MAML, do not present such property, as they need a few episodes before executing adaptation efficiently.

OOD Evaluation. Another critical scenario is how the fast adaptation strategies perform for out-of-distribution tasks. For this case, we change the Half Cheetah Vel en-

0 1 2 3 4 Layer Depth

Dissimilarity

Representation Error

0 1 2 3 4 Layer Depth

Mean Squared Error

1e 2 Regression over Expert Actions

Episodic Memory Refinement - Evaluation

Figure 7. Episodic Memory Refinement Evaluation. We sampled 30 tasks from the Half Cheetah Vel and interacted one of the trained policies to collect the episodic memories from each layer for 3 episodes per task. For the representation error, we computed the dissimilarity as 1.0 minus the cosine similarity between the episodic memory and the final representation from the last layer. For the linear regression models, we trained using a set of 20 tasks and evaluated the mean squared error over a test set of 10 tasks.

vironment to sample OOD tasks during the meta-test. In the standard setting, both training and testing target velocities are sampled from a uniform distribution in the interval [0.0, 3.0]. In the OOD setting, we sampled 20 tasks in the interval [3.0, 4.0] and assessed adaptation throughout the episodes. Figure 6 presents the results. Tr MRL surpasses all the baselines methods with a good margin, suggesting that the context-dependent weights learned a robust adaptation strategy, while other methods memorized some aspects of the standard distribution of tasks. We especially highlight PEARL, which achieved the best performance for locomotion tasks among the methods but performed poorly in this setting. Vari BAD also presented unstable performance across the episodes. These results suggest that their task encoding mechanisms do not generate effective latent representations for OOD tasks.

Episodic Memory Refinement. We evaluated Theorem 4.1 empirically in Figure 7. In these two experiments, we interacted with one of the trained policies within the Half Cheetah Vel environment and collected the episodic memories computed from each layer throughout a few episodes and across different tasks. Firstly, we computed the dissimilarity between the memories from each layer and the final representation from the last layer. We refer to this metric as the representation error. Next, we computed linear regression models over the expert actions sampled from the trained policy. We used the episodic memories from each layer as features and reported the mean squared error on a test set. We aim to evaluate how meaningful the representations

Transformers are Meta-Reinforcement Learners

from each layer are by predicting the best action given a very lightweight model.

Figure 7 shows that the representation error gradually decreases throughout the layers, suggesting the refinement of the episodic memory and the convergence to a consensus representation, as supported by Theorem 4.1. It also shows a correlated behavior in the regression error over the expert actions. This evidence suggests that the refinement of an episodic memory induces a consensus representation that is more meaningful to predict the best actions from the expert agent.

6. Conclusion and Future Work

In this work, we presented Tr MRL, a memory-based meta RL algorithm built upon a transformer, where the multihead self-attention mechanism works as a fast adaptation strategy. We designed this network to resemble a memory reinstatement mechanism, associating past working memories to dynamically represent a task and recursively build an episode memory through layers.

Tr MRL demonstrated a valuable capability of learning from reward signals. On the other side, recent Language Models presented substantial improvements by designing selfsupervised tasks (Devlin et al., 2019; Raffel et al., 2020) or even automating their generation (Shin et al., 2020). As future work, we aim to investigate how to enable these forms of self-supervision to leverage off-policy data collected during the training and further improve sample efficiency in transformers for the meta-RL scenario.

Andrychowicz, M., Denil, M., G omez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and de Freitas, N. Learning to learn by gradient descent by gradient descent. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings. neurips.cc/paper/2016/file/ fb87582825f9d28a8d42c5e5e5e8b23d-Paper. pdf.

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization, 2016.

Baddeley, A. Working memory. Current Biology, 20(4):R136 R140, 2010. ISSN 0960-9822. doi: https://doi.org/10.1016/j.cub.2009.12.014. URL https://www.sciencedirect.com/ science/article/pii/S0960982209021332.

Bengio, Y., Bengio, S., and Cloutier, J. Learning a synaptic

learning rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume ii, pp. 969 vol.2 , 1991. doi: 10.1109/IJCNN.1991.155621.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. Co RR, abs/2005.14165, 2020a. URL https://arxiv.org/ abs/2005.14165.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020b.

Caron, M., Touvron, H., Misra, I., J egou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers, 2021.

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling, 2021.

Chen, Y., Hoffman, M. W., Colmenarejo, S. G., Denil, M., Lillicrap, T. P., Botvinick, M., and de Freitas, N. Learning to learn without gradient descent by gradient descent. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 748 756. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr.press/v70/ chen17e.html.

Co-Reyes, J. D., Miao, Y., Peng, D., Real, E., Levine, S., Le, Q. V., Lee, H., and Faust, A. Evolving reinforcement learning algorithms, 2021.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021a. URL https:// openreview.net/forum?id=Yicb Fd NTTy.

Transformers are Meta-Reinforcement Learners

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale, 2021b.

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. Rl2: Fast reinforcement learning via slow reinforcement learning, 2016.

Fakoor, R., Chaudhari, P., Soatto, S., and Smola, A. J. Metaq-learning, 2020.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1126 1135. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr.press/v70/ finn17a.html.

Finn, C., Xu, K., and Levine, S. Probabilistic modelagnostic meta-learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pp. 9537 9548, Red Hook, NY, USA, 2018. Curran Associates Inc.

Fortunato, M., Tan, M., Faulkner, R., Hansen, S., Puigdom enech Badia, A., Buttimore, G., Deck, C., Leibo, J. Z., and Blundell, C. Generalization of reinforcement learners with working and episodic memory. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ 02ed812220b0705fabb868ddbf17ea20-Paper. pdf.

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS 10). Society for Artificial Intelligence and Statistics, 2010.

Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., and Levine, S. Meta-reinforcement learning of structured exploration strategies. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pp. 5307 5316, Red Hook, NY, USA, 2018. Curran Associates Inc.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Dy, J. and Krause,

A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1861 1870. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr. press/v80/haarnoja18b.html.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition, 2015.

Hoskin, A. N., Bornstein, A. M., Norman, K. A., and Cohen, J. D. Refresh my memory: Episodic memory reinstatements intrude on working memory maintenance. bio Rxiv, 2018. doi: 10. 1101/170720. URL https://www.biorxiv.org/ content/early/2018/05/30/170720.

Huang, X. S., Perez, F., Ba, J., and Volkovs, M. Improving transformer optimization through better initialization. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 4475 4483. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/ v119/huang20f.html.

Hutter, F., Kotthoff, L., and Vanschoren, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Publishing Company, Incorporated, 1st edition, 2019. ISBN 3030053172.

Janner, M., Li, Q., and Levine, S. Reinforcement learning as one big sequence modeling problem, 2021.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017.

Kumar, S. and Byrne, W. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pp. 169 176, Boston, Massachusetts, USA, May 2 - May 7 2004. Association for Computational Linguistics. URL https: //aclanthology.org/N04-1022.

Li, K. and Malik, J. Learning to optimize, 2016.

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond, 2020.

Loynd, R., Fernandez, R., Celikyilmaz, A., Swaminathan, A., and Hausknecht, M. Working memory graphs, 2020.

Melo, L. C., Maximo, M. R. O. A., and da Cunha, A. M. Bottom-up meta-policy search, 2019.

Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner, 2018.

Transformers are Meta-Reinforcement Learners

Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms, 2018.

Norouzi, M., Bengio, S., Chen, z., Jaitly, N., Schuster, M., Wu, Y., and Schuurmans, D. Reward augmented maximum likelihood for neural structured prediction. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings. neurips.cc/paper/2016/file/ 2f885d0fbe2e131bfc9d98363e55d1d4-Paper. pdf.

Oh, J., Hessel, M., Czarnecki, W. M., Xu, Z., van Hasselt, H. P., Singh, S., and Silver, D. Discovering reinforcement learning algorithms. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1060 1070. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper/2020/file/ 0b96d81f0494fde5428c7aea243c9157-Paper. pdf.

Ortega, P. A., Wang, J. X., Rowland, M., Genewein, T., Kurth-Nelson, Z., Pascanu, R., Heess, N., Veness, J., Pritzel, A., Sprechmann, P., Jayakumar, S. M., Mc Grath, T., Miller, K., Azar, M., Osband, I., Rabinowitz, N., Gy orgy, A., Chiappa, S., Osindero, S., Teh, Y. W., van Hasselt, H., de Freitas, N., Botvinick, M., and Legg, S. Meta-learning of sequential strategies, 2019.

Parisotto, E., Song, H. F., Rae, J. W., Pascanu, R., Gulcehre, C., Jayakumar, S. M., Jaderberg, M., Kaufman, R. L., Clark, A., Noury, S., Botvinick, M. M., Heess, N., and Hadsell, R. Stabilizing transformers for reinforcement learning, 2019.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer, 2020.

Rakelly, K., Zhou, A., Quillen, D., Finn, C., and Levine, S. Efficient off-policy meta-reinforcement learning via probabilistic context variables, 2019.

Ritter, S., Wang, J., Kurth-Nelson, Z., Jayakumar, S., Blundell, C., Pascanu, R., and Botvinick, M. Been there, done that: Meta-learning with episodic recall. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 4354 4363. PMLR, 10 15 Jul 2018a. URL https://proceedings.mlr.press/v80/ ritter18a.html.

Ritter, S., Wang, J. X., Kurth-Nelson, Z., and Botvinick, M. Episodic control through meta-reinforcement learning. In Cog Sci, 2018b. URL https://mindmodeling. org/cogsci2018/papers/0190/index.html.

Ritter, S., Wang, J. X., Kurth-Nelson, Z., Jayakumar, S. M., Blundell, C., Pascanu, R., and Botvinick, M. Been there, done that: Meta-learning with episodic recall, 2018c.

Ritter, S., Faulkner, R., Sartran, L., Santoro, A., Botvinick, M., and Raposo, D. Rapid task-solving in novel environments, 2021.

Rothfuss, J., Lee, D., Clavera, I., Asfour, T., and Abbeel, P. Promp: Proximal meta-policy search, 2018.

Rovee-Collier, C. Reinstatement of Learning, pp. 2803 2805. Springer US, Boston, MA, 2012. ISBN 9781-4419-1428-6. doi: 10.1007/978-1-4419-1428-6 346. URL https://doi.org/10.1007/ 978-1-4419-1428-6_346.

Schmidhuber, J., Zhao, J., and Wiering, M. Simple principles of metalearning. Technical report, 1996.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 1889 1897, Lille, France, 07 09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/ schulman15.html.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms, 2017.

Shin, T., Razeghi, Y., IV, R. L. L., Wallace, E., and Singh, S. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. Co RR, abs/2010.15980, 2020. URL https://arxiv.org/ abs/2010.15980.

Thrun, S. and Pratt, L. Learning to Learn: Introduction and Overview, pp. 3 17. Kluwer Academic Publishers, USA, 1998. ISBN 0792380479.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033, 2012. doi: 10.1109/IROS.2012.6386109.

Tulving, E. Episodic memory: From mind to brain. Annual review of psychology, 53:1 25, 02 2002. doi: 10.1146/ annurev.psych.53.100901.135114.

Transformers are Meta-Reinforcement Learners

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 6000 6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.

Vinyals, O., Blundell, C., Lillicrap, T., kavukcuoglu, k., and Wierstra, D. Matching networks for one shot learning. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings. neurips.cc/paper/2016/file/ 90e1357833654983612fb05e3ec9148c-Paper. pdf.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn, 2017.

Wang, J. X., Kurth-Nelson, Z., Kumaran, D., Tirumala, D., Soyer, H., Leibo, J. Z., Hassabis, D., and Botvinick, M. Prefrontal cortex as a metareinforcement learning system. bio Rxiv, 2018. doi: 10. 1101/295964. URL https://www.biorxiv.org/ content/early/2018/04/13/295964.

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. On layer normalization in the transformer architecture, 2020.

Yu, T., Quillen, D., He, Z., Julian, R., Narayan, A., Shively, H., Bellathur, A., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021.

Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., Tay, F. E., Feng, J., and Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet, 2021.

Zilli, E. A. and Hasselmo, M. E. Modeling the role of working memory and episodic memory in behavioral tasks. Hippocampus, 18(2):193 209, 2008.

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y., Hofmann, K., and Whiteson, S. Varibad: A very good method for bayes-adaptive deep rl via meta-learning, 2020.

Zintgraf, L. M., Shiarlis, K., Kurin, V., Hofmann, K., and Whiteson, S. Fast context adaptation via meta-learning, 2019.

Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. 2017. URL https://arxiv. org/abs/1611.01578.

Transformers are Meta-Reinforcement Learners

A. Reproducibility Statement

Code Release. To ensure the reproducibility of our research, we released all the source code associated with our models and experimental pipeline. We refer to the supplementary material of this submission. It also includes the hyperparameters and the scripts to execute all the scenarios presented in this paper.

Baselines Reproducibility. We strictly reproduced the results from prior work implementations for baselines, and we provide their open-source repositories for reference.

Proof of Theoretical Results and Pseudocode. We provide a detailed proof of Theorem 4.1 in Appendix F, containing all assumptions considered. We also provide pseudocode from Tr MRL s agent to improve clarity on the proposed method.

Availability of all simulation environments. All environments used in this work are freely available and open-source.

B. Metal-RL Environments Description

In this Appendix, we detail the environments considered in this work.

B.1. Mu Jo Co Locomotion Tasks

This benchmark is a set of locomotion tasks on the Mu Jo Co (Todorov et al., 2012) environment. It comprises different bodies, and each environment provides different tasks with different learning goals. These locomotion tasks are previously introduced by Finn et al. (2017) and Rothfuss et al. (2018). We considered 2 different environments.

Ant Dir: This environment has an ant body, and the goal is to move forward or backward. Hence, it presents these 2 tasks.

Half Cheetal Vel: This environment also has a half cheetah body, and the goal is to achieve a target velocity running forward. This target velocity comes from a continuous uniform distribution.

These locomotion task families require adaptation across reward functions.

B.2. Meta World

The Meta World (Yu et al., 2021) benchmark contains a diverse set of manipulation tasks designed for multi-task RL and meta-RL settings. Meta World presents a variety of evaluation modes. Here, we describe the ML1 mode used in this work. For more detailed description of the benchmark, we refer to Yu et al. (2021).

ML1: This scenario considers a single robotic manipulation task but varies the goal. The meta-training tasks corresponds to 50 random initial object and goal positions, and meta-testing on 50 heldout positions.

ML45: With the objective of testing generalization to new manipulation tasks, the benchmark provides 45 training tasks and holds out 5 meta-testing tasks. Given the difficulty of such benchmark, in this work, we adopted a simplified version, ML45-Train , where we also evaluate the performance of the methods in the 45 training tasks.

These robotic manipulation task families require adaptation across reward functions and dynamics.

Transformers are Meta-Reinforcement Learners

C. Working Memories Latent Visualization

Figure 8 presents a 3-D view of the working memories from the Half Cheetah Vel environment. We sampled some tasks (target velocities) and collected working memories during the meta-test setting. We observe that this embedding space learns a representation of each MDP as a distribution over the working memories, as suggested in Section 4. In this visualization, we can draw planes that approximately distinguish these tasks. Working memories that cross this boundary represent the ambiguity between two tasks. Furthermore, this representation also learns the similarity of tasks: for example, the cluster of working memories for target velocity v = 1.0 is between the clusters for v = 0.5 and v = 1.5. This property induces knowledge sharing among all the tasks, which suggests the sample efficiency behind Tr MRL meta-training.

Figure 8. 3-D Latent visualization of the working memories for the Half Cheetah Vel environment. We plotted the 3 most relevant components from PCA. Tr MRL learns a representation of each MDP as a distribution over the working memories. This representation distinguishes the tasks and approximates similar tasks, which helps knowledge sharing among them.

Transformers are Meta-Reinforcement Learners

D. Online Adaptation

In this Appendix, we highlight that Tr MRL presented high performance since the first episode. In fact, it only requires a few timesteps to achieve high performance in test tasks. Figure 9 shows that it only requires around 20 timesteps to achieve the best performance in the Half Cheetah Vel environment. Therefore, it presents a nice property for online adaptation. This is because the self-attention mechanism is lightweight and only requires a few working memories to achieve good performance. Hence, we can run it efficiently at each timestep. Other methods, such as PEARL and MAML, do not present such property, and they need a few episodes before executing adaptation efficiently.

Figure 9. Tr MRL s adaptation for Half Cheetah Vel environment.

Transformers are Meta-Reinforcement Learners

E. Ablation Study

In this section, we present an ablation study regarding the main components of Tr MRL to identify how they affect the performance of the learned agents. For all the scenarios, we considered one environment for each benchmark to represent both locomotion (Half Cheetah Vel) and dexterous manipulation (Meta World-ML1-Reach-v2). We evaluated the meta-training phase so that we could analyze both sample efficiency and asymptotic performance.

E.1. T-Fixup

In this work, we employed T-Fixup to address the instability from the early stages of transformer training, given the reasons described in Section 3.3. In RL, the early stages of training are also the moment when the learning policies are more exploratory to cover the state and action spaces better and discover rewards, preventing the convergence to sub-optimal policies. Hence, it is crucial for RL that the transformer policy learns appropriately since the beginning to drive exploration.

This section evaluated how T-Fixup becomes essential for environments where the learned behaviors must guide exploration to prevent poor policies. For this, we present T-Fixup ablation (Figure 10) for two settings: Meta World-ML1-Reach-v2 and Half Cheetah Vel. For the reach environment, we compute the reward distribution using the distance between the gripper and the target location. Hence, it is always a dense and informative signal: even a random policy can easily explore the environment, and T-Fixup does not interfere with the learning curve. On the other side, Half Cheetah Vel requires a functional locomotion gate to drive exploration; otherwise, it can get stuck with low rewards (e.g., cheetah is exploring while fallen). In this scenario, T-Fixup becomes crucial to prevent unstable learning updates that could collapse the learning policy to poor behaviors.

0 1 2 3 4 5 6 timesteps 1e6

Success Rate

Meta World-ML1-Reach-v2

0.0 0.2 0.4 0.6 0.8 1.0 timesteps 1e7

Average Return

Half Cheetah Vel

Ablation Study: T-Fixup

enabled disabled

Figure 10. Ablation results for the T-Fixup component.

E.2. Working Memory Sequence Length

A meta-RL agent requires a sequence of interactions to identify the running task and act accordingly. The length of this sequence N should be large enough to address the ambiguity associated with the set of tasks, but not too long to make the transformer optimization harder and less sample efficient. In this ablation, we study two environments that present different levels of ambiguity and show that they also require different lengths to achieve optimal sample efficiency.

We first analyze Meta World-ML1-Reach-v2. The environment defines each target location in the 3D space as a task. The

Transformers are Meta-Reinforcement Learners

associated reward is the distance between the gripper and this target. Hence, at each timestep, the reward is ambiguous for all the tasks located on the sphere s surface with the center in the gripper position. This suggests that the agent will benefit from long sequences. Figure 11 (left) confirms this hypothesis, as the sample efficiency improves until sequences with several timesteps (N = 50).

The Half Cheetah Vel environment defines each forward velocity as a different task. The associated reward depends on the difference between the current cheetah velocity and this target. Hence, at each timestep, the emitted reward is ambiguous only for two possible tasks. To identify the current task, the agent needs to estimate its velocity (which requires a few timesteps) and then disambiguate between these two tasks. This suggests that the agent will not benefit from very long sequences. Figure 11 (right) confirms this hypothesis: there is an improvement in sample efficiency from N = 1 to N = 5, but it decreases for longer sequences as the training becomes harder (it is worth mentioning that all evaluated policies achieved the best asymptotic performance).

0 1 2 3 4 5 6 timesteps 1e6

Success Rate

Meta World-ML1-Reach-v2

0 1 2 3 4 5 6 timesteps 1e6

Average Return

Half Cheetah Vel

Ablation Study: Working Memory Sequence Length

Figure 11. Ablation results for the working memory sequence length.

E.3. Number of Layers

Another important component is the network depth. In Section 4, we hypothesized that more layers would help to recursively build a more meaningful version of the episodic memory since we interact with output memories from the past layer and mitigates the bias effect from the task representations. Figure 12 shows how Tr MRL behaves according to the number of layers. We observe a similar pattern to the previous ablation case. For Reach-v2, more layers improved the performance by reducing the effect of ambiguity and biased task representations. For Half Cheeta Vel, we can see an improvement from a single layer to 4 or 8 layers, but for 12 layers, sample efficiency starts to decrease. On the other hand, we highlight that even for a deep network with 12 layers, we have a stable optimization procedure, showing the effectiveness of the T-Fixup initialization.

E.4. Number of Attention Heads

The last ablation case relates to the number of attention heads in each MHSA block. We hypothesized that multiples heads would diversify working memory representation and improve network expressivity. Nevertheless, Figure 13 shows that more heads slightly increased the performance in Half Cheetah Vel and did not interfere in Reach-v2 significantly.

Transformers are Meta-Reinforcement Learners

0 1 2 3 4 5 6 timesteps 1e6

Success Rate

Meta World-ML1-Reach-v2

0.0 0.2 0.4 0.6 0.8 1.0 timesteps 1e7

Average Return

Half Cheetah Vel

Ablation Study: Number of Layers

Figure 12. Ablation study for the number of transformer layers.

0 1 2 3 4 5 6 timesteps 1e6

Success Rate

Meta World-ML1-Reach-v2

0.0 0.2 0.4 0.6 0.8 1.0 timesteps 1e7

Average Return

Half Cheetah Vel

Ablation Study: Number of Attention Heads

Figure 13. Ablation study for the number of attention heads.

Transformers are Meta-Reinforcement Learners

F. Proof of Theorem 4.1

Theorem 4.1. Let Sl = (el 0, . . . , el N) p(e|Sl, θl) be a set of normalized episodic memory representations sampled from the posterior distribution p(e|Sl, θl) induced by the transformer layer l, parameterized by θl. Let K, Q, V be the Key, Query, and Value vector spaces in the self-attention mechanism. Then, the self-attention in the layer l + 1 computes a consensus

representation el+1 N =

PN t=1 el,V t exp el,Q t ,el,K i PN t=1 exp el,Q t ,el,K i whose associated Bayes risk (in terms of negative cosine similarity) lower

bounds the Minimum Bayes Risk (MBR) predicted from the set of candidate samples Sl projected onto the V space.

Proof. Let us define Sl V as the set containing the projection of the elements in Sl onto the V space: Sl V = (el,V 0 , . . . , el,V N ), where el,V = WV el (WV is the projection matrix). The Bayes risk of selecting ˆel,V as representation, BR(ˆel,V ), under a loss function L, is defined by: BR(ˆel,V ) = Ep(e|Sl V ,θl)[L(e, ˆe)] (7)

The MBR predictor selects the episodic memory ˆel,V Sl V that minimizes the Bayes Risk among the set of candidates: el,V MBR = arg minˆe Sl V BR(ˆe). Employing negative cosine similarity as loss function, we can represent MBR prediction as:

el,V MBR = arg max ˆe Sl V Ep(e|Sl V ,θl)[ e, ˆe ] (8)

The memory representation outputted from a self-attention operation in layer l + 1 is given by:

el+1 N = PN t=1 el,V t exp el,Q t , el,K i PN t=1 exp el,Q t , el,K i =

t=1 αN,t el,V t (9)

The attention weights αN,t define a probability distribution over the samples in Sl V , which approximates the posterior distribution p(e|Sl V , θl). Hence, we can represent Equation 9 as an expectation: el+1 N = Ep(e|Sl V ,θl)[e]. Finally, we compute the Bayer risk for it:

BR(el+1 N ) = Ep(e|Sl V ,θl)[ e, ˆel+1 N ]

= Ep(e|Sl V ,θl)[ e, Ep(e|Sl V ,θl)[e] ]

= Ep(e|Sl V ,θl)[e], Ep(e|Sl V ,θl)[e]

Ep(e|Sl V ,θl)[ e, ˆe ], ˆe

= Ep(e|Sl V ,θl)[ e, ˆe ], ˆe Sl V . (10)

Transformers are Meta-Reinforcement Learners

G. Pseudocode

In this section, we present a pseudocode for Tr MRL s agent during its interaction with an arbitrary MDP.

Algorithm 1 Tr MRL Forward Pass Require: MDP M p(M) Require: Working Memory Sequence Length N Require: Parameterized function ϕ(s, a, r, η) Require: Transformer network with L layers {f1, . . . , f L} Require: Policy Head π

Initialize Buffer with N 1 PAD transitions: B = {(s P AD, a P AD, r P AD, ηP AD)i}, i {1, . . . , N 1} t 0 snext s0 while episode not done do

Retrieve the N 1 most recent transitions (s, a, r, η) from B to create the ordered subset D D D S(snext, a P AD, r P AD, ηP AD) Compute working memories:

ϕi = ϕ(si, ai, ri, ηi), {si, ai, ri, ηi} D Set e0 1, . . . , e0 N ϕ1, . . . , ϕN for each l 1, . . . , L do

Refine episodic memories:

el 1, . . . , el N fl(el 1 1 , . . . , el 1 N ) end for Sample at π( |e L N) Collect (st+1, rt, ηt) interacting with M applying action at snext st+1 B B S(st, at, rt, ηt) t t + 1 end while