# contrabar_contrastive_bayesadaptive_deep_rl__cfb48292.pdf

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Era Choshen 1 Aviv Tamar 1

Abstract In meta reinforcement learning (meta RL), an agent seeks a Bayes-optimal policy the optimal policy when facing an unknown task that is sampled from some known task distribution. Previous approaches tackled this problem by inferring a belief over task parameters, using variational inference methods. Motivated by recent successes of contrastive learning approaches in RL, such as contrastive predictive coding (CPC), we investigate whether contrastive methods can be used for learning Bayes-optimal behavior. We begin by proving that representations learned by CPC are indeed sufficient for Bayes optimality. Based on this observation, we propose a simple meta RL algorithm that uses CPC in lieu of variational belief inference. Our method, Contra BAR, achieves comparable performance to state-of-the-art in domains with state-based observation and circumvents the computational toll of future observation reconstruction, enabling learning in domains with image-based observations. It can also be combined with image augmentations for domain randomization and used seamlessly in both online and offline meta RL settings.

1. Introduction

In meta reinforcement learning (meta RL), an agent learns from a set of training tasks how to quickly solve a new task, sampled from a similar distribution as the training set (Finn et al., 2017; Duan et al., 2016). A formal setting for meta RL is based on the Bayesian RL formulation, where a task corresponds to a particular Markov decision process (MDP), and there exists some prior distribution over MDPs (Humplik et al., 2019; Zintgraf et al., 2020; Rakelly et al., 2019). Under this setting, the optimal meta RL policy is well defined, and is often referred to as a Bayes-optimal policy (Ghavamzadeh et al., 2016).

1Technion, Haifa, Israel. Correspondence to: Era Choshen <erachoshen@campus.technion.ac.il>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

In contrast to the single MDP setting, where an optimal policy can be Markovian taking as input the current state and outputting the next action, the Bayes-optimal policy must take as input the whole history of past states, actions, and rewards, or some sufficient statistic of it (Bertsekas, 1995). A popular sufficient statistic is the belief the posterior probability of the MDP parameters given the observed history. For small MDPs, the belief may be inferred by directly applying Bayes rule, and approximate dynamic programming can be used to calculate an approximately Bayes-optimal policy (Ghavamzadeh et al., 2016). However, this approach quickly becomes intractable for large or continuous MDPs. Recently, several studies proposed to scale up belief inference using deep learning, where the key idea is to leverage a variational autoencoder (VAE, Kingma & Welling 2013) formulation of the problem, in which the posterior is approximated using a recurrent neural network (Zintgraf et al., 2020; Humplik et al., 2019). While this approach has demonstrated impressive results on continuous control benchmarks (Zintgraf et al., 2020; Humplik et al., 2019; Dorfman et al., 2021), it also has some limitations. Training a VAE is based on a reconstruction loss, in this case, predicting the future observations given the current history, which can be difficult to optimize for visually rich observations such as images. Furthermore, variational algorithms such as Vari BAD (Zintgraf et al., 2020) reconstruct entire trajectories, restricting application to image-based domains due to memory limitations.

As an alternative to VAEs, contrastive learning has shown remarkable success in learning representations for various domains, including image recognition and speech processing (Chen et al., 2020; Fu et al., 2021b; Han et al., 2021). Rather than using a reconstruction loss, these approaches learn features that discriminate between similar observations and dissimilar ones, using a contrastive loss such as the Info NCE in contrastive predictive coding (CPC, Oord et al. 2018). Indeed, several recent studies showed that contrastive learning can learn useful representations for image based RL (Laskin et al., 2020; 2022), outperforming representations learned using VAEs. Furthermore, Guo et al. (2018) showed empirically that in partially observed MDPs, representations learned using CPC (Oord et al., 2018) are correlated with the belief. In this work, we further investigate contrastive learning for meta RL, henceforth termed

Contra BAR: Contrastive Bayes-Adaptive Deep RL

CL meta RL, and aim to establish it as a principled and advantageous alternative to the variational approach.

Our first contribution is a proof that, given certain assumptions on data collection and the optimization process of CPC, representations learned using a variant of CPC are indeed a sufficient statistic for control, and therefore suffice as input for a Bayes-optimal policy. Our second contribution is a bound on the suboptimality of a policy that uses an approximate sufficient statistic, learned by CPC, in an iterative policy improvement scheme where policies between iterations are constrained to be similar. This result relaxes the assumptions on the optimization and data collection in the first proof. Building on this result, we propose a simple meta RL algorithm that uses a CPC based representation to learn a sufficient statistic. Our third contribution is an empirical evaluation of our method that exposes several advantages of the contrastive learning approach. In particular, we show that: (1) For state-based observations, CL meta RL is on par with the state-of-the-art Vari BAD (Zintgraf et al., 2020) (2) For image-based observations, CL meta RL significantly outperforms the variational approach, and is competitive with RNN based methods (Duan et al., 2016) (3) In contrast to the variational approach, CL Meta RL is compatible with image augmentations and domain randomization. (4) Our method works well in the online and offline meta RL setting. Overall, our results establish CL meta RL as a versatile and competitive approach to meta RL.

2. Background and Problem Formulation

In this section we present our problem formulation and relevant background material.

2.1. Meta RL and POMDPs

We define a Markov Decision Process (MDP) (Bertsekas, 1995) as a tuple M = (S, A, P, R), where S is the state space, A is the action space, P is the transition kernel and R is the reward function. In meta RL, we assume a distribution over tasks, where each task is an MDP Mi = (S, A, Pi, Ri), where the state and action spaces are shared across tasks, and Pi, Ri are task specific and drawn from a task distribution, which we denote D(P, R). At a given time t, we denote by (s0, a0, r0, s1, a1, r1, . . . , st) = ht Ωt the current history, where Ωt is the space of all state-actionreward histories until time t. Our aim in meta RL is to find a policy π = {π0, π1, . . . }, where πt : Ωt A, which maximizes the following objective:

where the expectation Eπ is taken over the transitions st+1 P( |st, at), the reward rt = R(st, at), the actions

at π( |ht) and the uncertainty over the MDP parameters P, R D(P, R). We assume a bounded reward rt [ Rmax, Rmax], Rmax > 0 with probability one.

Meta RL is a special case of the more general Partially Observed Markov-Decision Process (POMDP), which is an extension of MDPs to partially observed states. In the POMDP for meta RL, the unobserved variables are P, R, and they do not change over time. We define Ωt for POMDPs as above, except that states are replaced by observations according to the distribution ot+1 U(ot+1|st+1, at). As shown in Bertsekas (1995), the optimal policy for a POMDP can be calculated using backwards dynamic programming for every possible ht Ωt. However, as explained in Bertsekas (1995) this method is computationally intractable in most cases as Ωt grows exponentially with t.

2.2. Information States and BAMDPs

Instead of the intractable space of histories, sufficient statistics can succinctly summarize all the necessary information for optimal control. One popular sufficient statistic is the posterior state distribution or belief P(st|ht). Conditions for a function to be a sufficient statistic, also termed information state, were presented by Subramanian et al. (2020) and are reiterated here for completeness: Definition 2.1 (Information State Generator). Let {Zt}T t=1 be a pre-specified collection of Banach spaces. A collection {σt : Ωt Zt}T t=1 of history compression functions is called an information generator if the process {Zt}T t=1 satisfies the following properties, where ht Ωt, and σt(ht) = Zt Zt:

P1 For any time t and for any ht Ωt, at A we have:

E [rt|ht, at] = E [rt|Zt = σt(ht), at] .

P2 For any time t, and for any ht Ωt, at A, and any Borel subset B of Zt+1 we have:

P (B Zt+1|ht, at) = P (B Zt+1|Zt = σt(ht), at) .

Intuitively, information states compress the history without losing predictive power about the next reward, or the next information state.

To solve a POMDP, one can define a Bayes-Adaptive MDP (BAMDP) an MDP over the augmented state space of S B, where B = {Zt}T t=1 is the space of the information state. This idea was introduced by Duff (2002) for the belief. Here, we use the term BAMDP more generally, referring to any information state. The optimal policies for BAMDPs are termed Bayes-optimal and optimally trade-off between exploration and exploitation, which is essential for maximizing online return during learning. Unfortunately, in most cases computing the Bayes-optimal policy is intractable because the augmented space is continuous and

Contra BAR: Contrastive Bayes-Adaptive Deep RL

high-dimensional. Zintgraf et al. (2020) proposed to approximate the Bayes-optimal policy by using deep neural networks to learn an information state (belief), and conditioning an RL agent on the learned augmented space; here we follow this approach.

3. Related Work

Our focus in this work is learning a Bayes-optimal policy for meta RL. We recapitulate the current approaches to meta RL with a focus on approaches that potentially yield Bayesoptimal policies.

The methods in (Finn et al., 2017; Grant et al., 2018; Nichol et al., 2018; Rothfuss et al., 2018; Clavera et al., 2018) learn neural network policies that can quickly be fine-tuned to new tasks at test time via gradient updates. These methods do not optimize for Bayes-optimal behavior, and typically exhibit significantly suboptimal test-time adaptation.

A different approach is to learn an agent that directly infers the task at test time, and conditions the policy based on the inferred task. Typically, past interactions of the agent with the environment are aggregated to a latent representation of the task. Rakelly et al. (2019); Fakoor et al. (2019) follow a posterior-sampling approach, which is not Bayes-optimal (Zintgraf et al., 2020); in this work we focus on methods that can achieve Bayes-optimality. Duan et al. (2016); Wang et al. (2016) propose memory-based approaches, which Ortega et al. (2019) proves to approximate Bayes-optimal agents. Zintgraf et al. (2020; 2021); Dorfman et al. (2021) also approximate Bayes-optimal agents with a history-based representation, using a variational approach. Humplik et al. (2019) learn an approximately Bayes-optimal agent, where privileged information a task descriptor is used to learn a sufficient statistic. We explore an alternative approach that lies at the intersection of meta RL and contrastive learning. Different from memory-based methods such as RL2 (Duan et al., 2016), and similarly to Vari BAD, we learn a history based embedding separately from the policy. However, unlike variational methods, we learn the task representation using contrastive learning.

Contrastive learning has been used to learn representations for input to a meta RL policy. FOCAL (Li et al., 2020b) uses distance metric learning to learn a deterministic encoder of transition tuples to perform offline RL. They operate under the relatively restrictive assumption that each transition tuple (s, a, s , r) is uniquely identified by a task. The authors followed up with FOCAL++, in which batches of transition tuples (not necessarily from the same trajectory) are encoded to a representation that is optimized with Mo Co (He et al., 2020), a variant of CPC, alongside an intra-task attention mechanism meant to robustify task inference (Li et al., 2021). The MBML method in (Li et al., 2020a) proposes

an offline meta RL method that uses the triplet loss to learn embeddings of batches of transition tuples from the same task, with the same probabilistic and permutation-invariant architecture of Rakelly et al. (2019). Wang et al. (2021) propose embedding windows of transition tuples as probabilistic latent variables, where the windows are cropped from different trajectories. The embeddings are learned with Mo CO (He et al., 2020) by contrasting them in probabilistic metric space, where positive pairs are transition windows that come from the same batch. The algorithm is presented as a general method to learn representations for contextbased meta RL algorithms, but in practice all results are shown with PEARL (Rakelly et al., 2019). In a similar line of work, Fu et al. (2021a) encode batches of transitions as a product of Gaussian factors and contrast the embeddings with Mo CO (He et al., 2020), with positive pairs being embedded transition batches from the same task, as opposed to the same trajectory as in Wang et al. (2021). As in Wang et al. (2021), results are shown with a posterior sampling meta RL algorithm. While we also investigate contrastive learning for meta RL, we make an important distinction: all of the works above embed transition tuples and not histories, and therefore cannot represent information states, and cannot obtain Bayes-optimal behavior. In contrast, in our work, we draw inspiration from Guo et al. (2018), who used a glass-box approach to empirically show that contrastive learning can be used to learn the belief in a POMDP. We cast this idea in the Bayesian-RL formalism, and show both theoretically and empirically, that contrastive learning can be used to learn Bayes-optimal meta RL policies.

In this section we show how to use contrastive learning to learn an information state representation of the history, and use it as input to an RL agent. We give a brief description of CPC (Oord et al., 2018) followed by our meta RL algorithm. We then prove that our method does indeed learn an information state.

4.1. Contrastive Predictive Coding

CPC (Oord et al., 2018) is a contrastive learning method that uses noise contrastive estimation (Gutmann & Hyv arinen, 2010) to discriminate between positive future observations o+ t+k, where t is the current time step, and negative observations o t+k. First, an encoder g generates an embedding for each observation in a sequence of observations from a trajectory τ until time t, {zi = g(oi)}t i=1. Second, an autoregressive model g AR summarizes z t, the past t observations in latent space, and outputs a latent ct. The model is trained to discriminate between future observations o+ t+k

and K negative observations n o ,i t+k o K

i=1 given ct. Given

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Figure 1: Contra BAR architecture. Red box: similarly to CPC, g AR is autoregressively applied to past observation embeddings (zt, zt 1, . . . ) to yield the current representation ct. Green box: unlike CPC, the MLP f receives as input, in addition to a positive/negative future embedding, the output of an action GRU that process ct and the future action sequence. Blue box: the representation ct is the information state, which is concatenated with the state st and input to the RL policy.

a set X = {o+ t+k, o ,1 t+k, . . . , o ,K t+k } containing one positive future observation sampled according to P(o+ t+k|ct) and K negative observations sampled from a proposal distribution P(o t+k), the Info NCE loss is:

LInfo NCE =

log exp f ct, o+ t+k

exp f ct, o+ t+k + PK i=1 exp f ct, o ,i t+k

where f is a learnable function that outputs a similarity score. The model components f, g, g AR are learned by optimizing the loss LInfo NCE.

4.2. Contra BAR Algorithm

We will now introduce our CPC based meta RL algorithm, which is depicted in Figure 1, and explain how CPC is used to learn a latent representation of the history.

We begin by noting that we use the term observation throughout the text, in line with Oord et al. (2018), however in our case the meaning is state, reward and action ot = {st, rt 1, at 1} when talking about observation history , and state and reward ot+k = {st+k, rt+k 1} when talking about future observations . We would like to learn an embedding of the observation history, ct, that will contain relevant information for decision making. The CPC formulation seems like a natural algorithm to do this for a given trajectory τ of length T collected from some unknown MDP M, we use an embedding of its observation history until time t < T, ct, to learn to discriminate between future

observations from the trajectory τ and random observations from other trajectories τj = τ. This means that ct encodes relevant information for predicting the future system states, and consequently information regarding the MDP M from which τ was collected.

The CPC formulation described above is based on predicting future states in an uncontrolled system without rewards. We now modify it to learn a sufficient statistic for meta RL. We assume that data is collected at each training iteration m by some data collection policy πm and added to a replay buffer D = {τi}N i=1 containing trajectories from previous data collection policies {π1, . . . , πm 1}; we note that length of the trajectories may vary. At each learning iteration, a batch of M trajectories is sampled, and for each trajectory and time t the negative observations are sampled from the remaining M 1 trajectories in the batch. As in CPC, we define ct to be a function of the observation history until time t, but we add to f as input the future k 1 actions, as in a controlled system the future observation ot+k = (st+k, rt+k 1) depends on the controls. Our f can therefore now be written as f(ct, ot+k, at:t+k 1). We implement this modification as in Guo et al. (2018) by means of an additional autoregressive component, a GRU gaction that receives actions as input and takes ct as its initial hidden state.

Given the adjustments described above, each batch B used as input to our algorithm contains the following: (1) The observation history until time t in some trajectory τ (2) Future observations from time t + k (3) Observations o t+k from the remaining M 1 trajectories sampled from D.

Contra BAR: Contrastive Bayes-Adaptive Deep RL

We rewrite the Info NCE loss for the meta RL setting explicitly; For ease of notation we mark f(ct, o+ t+k, at:t+k 1) as f + and f(ct, o ,i t+k, at:t+k 1) as f ,i:

log exp(f +)

exp(f +) + PK i=1 exp (f ,i)

where the expectation is over the batches of positive and negative observations sampled from D, as described above.

4.3. Learning Information States with CPC

We now show that integrating contrastive learning with meta RL is a fundamentally sound idea. We shall prove that our algorithm presented in Section 4.2 learns a representation of the history that is an information state, by showing that the latent encoding satisfies the properties of an information state P1, P2 as defined by Subramanian et al. (2020) and reiterated above in Section 2.

We first define the notion of a possible history, which we use in Assumption 4.2.

Definition 4.1 (Possible history). Let PM denote a probability distribution over MDPs and let Pm,π(ht) be the probability of observing history ht under policy π in MDP m. We say that ht is a possible history if there exists a policy π and an MDP m such that PM(m) > 0 and Pm,π(ht) > 0.

Next, we make the following assumption, which states that the policy collecting the data covers the state, reward and action space.

Assumption 4.2. Let the length of the longest possible history be T. Let ht, where t T, be a history and let PD(ht) denote the probability of observing a history in the data D. If ht is a possible history, then PD(ht) > 0.

Assumption 4.2 is necessary to claim that the learned CPC representation is a sufficient statistic for every possible history. In Section 4.4 we discuss a relaxation of this assumption, using approximate information states.

Theorem 4.3. Let Assumption 4.2 hold. Let g , g AR, f

jointly minimize LM(g, g AR, f). Then the context latent representation ct = g AR(z t) satisfies conditions P1, P2 and is therefore an information state.

The full proof is provided in Appendix A; we next provide a sketch. The main challenge in our proof lies in proving the following equality:

P(st+1, rt|ht, at) = P(st+1, rt|ct, at). (2)

Given the equality in Equation (2), proving P1,P2 is relatively straightforward. We prove Equation (2) by expanding the proof in (Oord et al., 2018), which shows that the Info NCE loss upper bounds the negative mutual information

between o+ t+k and ct (in the CPC setting). In our case, we show that

LM log(M 1) I(st+1, rt; ct|at), (3)

where I( ; ) denotes mutual information. Thus, by minimizing the loss in Equation (1), we maximize the mutual information I(st+1, rt; ct|at). Due to the Markov property of the process, the mutual information in (3) cannot be greater than I(st+1, rt; ht|at), which leads to Equation 2.

4.4. Learning Approximate Information States with CPC

We next investigate a more practical setting, where there may be errors in the CPC learning, and the data does not necessarily satisfy Assumption 4.2. We aim to relate the CPC error to a bound on the suboptimality of the resulting policy. In this section, we consider an iterative policy improvement algorithm with a similarity constraint on consecutive policies, similar to the PPO algorithm we use in practice (Schulman et al., 2017). We shall bound the suboptimality of policy improvement, when data for training CPC is collected using the previous policy, denoted πk.

In light of Eq. 3, we assume the following error due to an imperfect CPC representation:

Assumption 4.4. There exists an ϵ such that for every t T, I(st+1, rt; ct|at) I(st+1, rt; ht|at) ϵ, where the histories are distributed according to policy πk.

The next theorem provides our main result.

Theorem 4.5. Let Assumption 4.4 hold for some representation ct. Consider the distance function between two distributions D(P1(x), P2(x)) = maxx |P1(x)/P2(x)|. We let ˆr(ct, at) = E[rt|ct, at] and ˆP(c |ct, at) = E[1(ct+1 = c )|ct, at] denote an approximate reward and transition kernel, respectively. Define the value functions

ˆQt(ct, at) = ˆr(ct, at) + X

ˆP(ct+1|ct, at) ˆVt+1(ct+1)

ˆVt(ct) = max π:D(π(ct),πk(ct)) β

a π(a) ˆQt(ct, a), (4)

for t T, and ˆVT (c T ) = 0, and the approximate optimal policy

ˆπ(ct) argmax π:D(π,πk(ct)) β

a π(a) ˆQt(ct, a). (5)

Let the optimal policy π (ht) be defined similarly, but with ht replacing ct in (4) and (5). Then we have that

t=0 r(st, at)

t=0 r(st, at)

ϵ1/3Rmax T 2(

Contra BAR: Contrastive Bayes-Adaptive Deep RL

The dynamic programming recurrence in Equation (4) defines the optimal policy that is conditioned on ct (and not ht), and is restricted to be β-similar to the previous policy πk. The theorem bounds the loss in performance of such a policy compared to a policy that is conditioned on the full history (yet still restricted to be β-similar to πk). The proof of Theorem 4.5 builds on the idea of an approximate information state (Subramanian et al., 2020) and is detailed in Appendix A.

4.5. Contra BAR Architecture

We now describe several design choices in our Contra BAR implementation.

History Embedding We now describe the specific architecture used to implement our algorithm, also depicted in Figure 1. We use a non-linear encoder to embed a history of actions, rewards and states and run it through a GRU to generate the hidden state for the current time-step ct. The latent ct is then used to initialize the action-gru gaction, which is fed future actions as input the resulting hidden state is then concatenated with either a positive observation-reward pair, or a negative one and used as input to a projection head that outputs a score used in Equation (1).

We note that given a random sampling of negative observations, the probability of sampling a positive and negative observation that share the same state is low. Consequently, for environments where st+k can be estimated via st, at, . . . , at+k without ht, ct need only encode information regarding st to allow the action-gru to learn to distinguish between positive and negative observations. This renders ct uninformative about the reward and transition functions and thus unhelpful for optimal control. An example of this is a set of deterministic environments that differ only in reward functions. The action-gru can learn to predict st+k via st, at, . . . , at+k, only requiring ct to encode information regarding st and not the reward function. One way to circumvent this is hard negative mining, i.e using negative samples that are difficult to distinguish from the positive ones. Another solution, relevant for the case of varying reward functions, is to generate a negative observation by taking the state and action from the positive observation and recalculating the reward with a reward function sampled from the prior. In practice, we found that a simple alternative is to omit the action-gru. This prevents the easy estimation of st+k and requires ct to encode information regarding the reward and transition function. We found this worked well in practice for the environments we ran experiments on, including those with varying transitions. We expand on these considerations in Appendix D.

RL Policy The history embedding portion of the algorithm described above is learned separately from the policy

and can be done online or offline. The policy, which can be trained with an RL algorithm of the user s choice, is now conditioned on the current state st as well as ct the learned embedding of ht. We chose to use PPO (Schulman et al., 2017) for the online experiments and SAC (Haarnoja et al., 2018) for the offline experiment in line with Vari BAD and BORe L (Dorfman et al., 2021).

5. Experiments

In our experiments, we shall demonstrate that (1) Contra BAR learns approximately Bayes-optimal policies (2) Contra BAR is on par with SOTA for environments with state inputs (3) Contra BAR scales to image-based environments (4) Augmentations can be naturally incorporated into Contra BAR and (5) Contra BAR can work in the offline setting.

We compare Contra BAR to state-of-the-art approximately Bayes-optimal meta RL methods. In the online setting, we compare against Vari BAD (Zintgraf et al., 2020), RL2 (Duan et al., 2016), and the recent modification of RL2 by Ni et al. (2022) which we refer to as RMF (recurrent model-free). In the offline setting, we compare with BORe L (Dorfman et al., 2021). Zintgraf et al. (2020) and Dorfman et al. (2021) already outperform posterior sampling based methods such as PEARL (Rakelly et al., 2019), therefore we do not include such methods in our comparison. Finally, we note that using Vari BAD (Zintgraf et al., 2020) with image-based inputs is currently computationally infeasible due to memory constraints, and as such we did not use it as a baseline we explain this issue further in Appendix E. Other variational approaches, which require a reconstruction of the future observations, are subject to similar memory constraints. Instead, we compared our algorithm against RL2 (Duan et al., 2016), which works with images. We evaluate performance similarly to Zintgraf et al. (2020), by evaluating per episode return for 5 consecutive episodes with the exception of the offline setting where we adapted our evaluation to that of BORe L.

5.1. Qualitative Near Bayes-Optimal Behavior

We begin with a qualitative demonstration that Contra BAR can learn near Bayes-optimal policies. As calculating the exact Bayes-optimal policy is mostly intractable, we adopt the approach of Dorfman et al. (2021): for deterministic domains with a single sparse reward, the Bayes-optimal solution is essentially to search all possible reward locations so as to maximally reduce uncertainty, and then go directly to the goal in subsequent episodes. Thus, we can identify whether a policy is approximately Bayes-optimal by inspecting its trajectory. Figure 2 displays rollouts from a trained policy in the Gridworld and Semi-Circle domains, demonstrating near Bayes-optimal behavior similar to Vari BAD (Zintgraf et al., 2020).

Contra BAR: Contrastive Bayes-Adaptive Deep RL

t = 0 t = 4 t = 7 t = 15

Episode 2 Episode 3

Sparse Pointrobot

Episode 0 Episode 1 Episode 2

Figure 2: Left: Qualitatively Bayes-Optimal behavior on gridworld. The agent methodically searches the grid, reducing uncertainty about unexplored cells in the second episode and updating to a shorter path in the subsequent episode. Right: Qualitatively Bayes-Optimal behaviour on Sparse Semi-Circle. The agent scans the semi-circle until the goal has been found and all uncertainty reduced. In subsequent episodes the agent goes straight to the goal.

5.2. Results for Problems with State Observations

We compare Contra BAR with Vari BAD and RMF, the current state-of-the-art on Mu Jo Co locomotion tasks (Todorov et al., 2012), commonly used in meta RL literature. We use the environments considered in Zintgraf et al. (2020), namely the Ant-Dir, Ant Goal, Half Cheetah Dir, Half Cheetah Vel, Humanoid and Walker environments. Figure 3 shows competitive performance with the current SOTA on all domains. Note that rewards in these environments are dense, so in principle, the agent only needs a few exploratory actions to infer the task by observing the rewards it receives. Indeed, we see that Contra BAR is able to quickly adapt within the first episode, with similar performance in subsequent episodes.

5.3. Scaling Belief to Image-Based Inputs

We show that Contra BAR can scale to image domains, which are computationally expensive, by running our algorithm on three image-based domains with varying levels of difficulty and sources of uncertainty: (1) Reacher-Image a two-link robot reaching an unseen target located somewhere on the diagonal of a rectangle, with sparse rewards (2) Panda Reacher a Franka Panda robot tasked with placing the end effector at a goal on a 2d semi-circle, where the vertical position of the goal (z coordinate) is fixed; adapted from the Reacher task in Panda Gym (Gallou edec et al., 2021) (3) Panda Wind The same environment as Panda Reacher, except that the transitions are perturbed with Gaussian noise sampled separately for each task. For a more detailed description of each environment see Appendix B.

Image-Based Reward: For our image-based experiments, we found that learning in image-based domains with sparse reward was difficult when the reward was embedded separately (as in the state observation domains), and concate-

nated with the image embedding. We hypothesized that this might be an issue of differing scales between the scalar rewards and image inputs, but we observed that standard normalization techniques such as layer norm (Ba et al., 2016) did not help. Instead, we opted for a different approach that embeds the reward as an explicit part of the image. To implement this idea, we exploited the fact that in all our domains, the reward is sparse and binary, and we add a colored strip to a fixed place in the image when non-zero reward is received. Extending this idea to non-binary reward is possible, for example, by controlling the color of the strip.

Our results are displayed in Figure 4. For the Reacher environment, Contra BAR is slightly outperformed by RL2, whereas in Panda Reacher and Panda Reacher Wind Contra BAR outperform RL2 by a large margin. Notice that in contrast to the dense reward domains of Section 5.2, in these sparse reward tasks the agent gains by exploring for the goal in the first iteration. Evidently, the plots show significantly higher reward in the second episode onward.

Glass-box Approach To further validate that our algorithm learns a sound belief representation, we follow a glassbox approach similar to that of Guo et al. (2018). First, we used Contra BAR to learn an information state for the Panda Reacher environment. Second, we use the trained agent to create a dataset of trajectories, including the agent s belief at each time step of every trajectory. We then trained an MLP-based binary classifier, which takes (x, y) and the information state ct as input and predicts whether the goal in the trajectory is indeed (x, y). In Figure 5 we see the visualization of the classifier s prediction at different points along the trajectory; We see that the predictions coincide with the belief we expect the agent to hold at each step, thus validating the soundness of our belief representation.

5.4. Contra BAR with Domain Randomization

Despite the high fidelity of modern simulators, when deployed in the real-world, image-based algorithms learned in simulation can only be accurate up to the differences between simulation and reality the sim-to-real gap. This motivates us to learn a belief representation that is robust to such differences, and in the following we will show that our algorithm can indeed learn such an information state. Robustification to irrelevant visual properties via random modifications is termed domain randomization (Tobin et al., 2017). We employ domain randomization in a similar fashion to Rabinovitz et al. (2021) wherein we modify the past and future observations (without the rewards) in the trajectories with a mapping T : S S that randomly shifts the RGB channels of the images. These modified trajectories are used to learn the history embedding ct, with the hope that it will be invariant to different color schemes in the environment. We show the strength of such modifications by

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Cheetah-Dir

Vari BAD Contra BAR RMF

1 2 3 4 5 60

0 Cheetah-Vel

Humanoid-Dir

1 2 3 4 5 Episodes

Figure 3: Average test performance on six different Mu Jo Co environments, trained separately with 10 seeds per Mu Jo Co environment per method - as in (Zintgraf et al., 2020). The meta-trained policies are rolled out for 5 episodes to show how they adapt to the task. Values shown are averages across tasks (95% confidence intervals shaded). We show competitive performance, surpassing Vari Bad and RMF (Ni et al., 2022) on certain environments. The results shown for RMF are the average performance on the first 2 episodes, as taken from their repository, which did not contain results for more than 2 episodes and did not contain results for the Walker and Ant-Goal environments.

1 2 3 4 5 0

Contra BAR RL2

1 2 3 4 5 Episodes

50 Panda Reacher

1 2 3 4 5 40

50 Panda Wind

Figure 4: Average test performance on three different image-based environments, trained separately with different seeds per environment per method. The meta-trained policies are rolled out for 5 episodes to show how they adapt to the task. Values shown are averages across tasks (95% confidence intervals shaded). All methods were run with 5 seeds for each environment. We show competitive performance with RL2, surpassing it in the Panda environments.

training two agents with Contra BAR on the Panda Reacher environment one receives images modified by T and the other does not. We then evaluate each agent s performance on different color schemes, which are kept static for evaluation. The results as well as the environments can be seen in Figure 7. Note that while the belief may be robustified separately with augmentations, the policy must be robust to such changes as well. To do so, we used the data-regularized actor-critic method from Raileanu et al. (2021) where the policy πθ and value function Vϕ are regularized via two additional loss terms,

Gπ = KL [πθ(a|s)|πθ(a|T(s))] ,

GV = (Vϕ(s) Vϕ(T(s)))2 ,

where T : S S randomly modifies the image.

We emphasize that domain randomization, as applied here, is not naturally compatible with variational belief inference methods. The reason is that when the loss targets reconstruction of the modified observation, the learned embedding cannot be trained to be invariant to the modification T .

5.5. Offline Contra BAR

We show that as in Vari BAD (Zintgraf et al., 2020), the disentanglement of belief and control allows us to reframe the algorithm within the context of offline meta RL, as was done in Dorfman et al. (2021). First, we use Contra BAR to learn a history embedding ct from an offline dataset. Note that no specific change is required to our algorithm we simply treat the offline dataset as the replay buffer for Contra BAR. Second, we perform state relabeling as described

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Figure 5: Belief visualization on the Panda Reacher environment. At each step, we visualise the predictions of the MLP-classifier for the entire map. Each frame shows an interesting milestone along the trajectory. From left to right along the rows: (1) Initially, the agent s belief over the location of the reward is uniform over the semi-circle, as in the prior. (2) As the agent travels along the semi-circle, the uncertainty regarding the location of the goal collapses in locations where the agent has visited and not found a goal. (3) Once the agent reaches the goal, all uncertainty collapses, even for locations the agent has yet to visit. (4) On the second episode, the agent follows the maintained belief and goes directly to the goal without exploration.

in Dorfman et al. (2021): for each trajectory τi of length T, i.e (si 0, ai o, ri 0, . . . , si T ), we embed each partial t-length history ht as ct, and transform each si t to s+,i t = (si t, ci t) as in the BAMDP formulation. We then learn a policy with SAC (Haarnoja et al., 2018) on the transformed dataset. We show competitive results with BORe L (Dorfman et al., 2021) in Figure 6. Unfortunately we were not able to find an offline adaptation of RMF to use as an additional baseline.

6. Conclusions

We proved that Contra BAR learns a representation that is a sufficient static of the history. Following on this, we presented what is to the best of our knowledge the first approximately Bayes-optimal CL meta RL algorithm. We demonstrated results competitive with previous approaches on several challenging state-input domains. Furthermore, by using contrastive learning we were able to scale meta-RL to image-based domains; We displayed results on par with RL2 which was also able to scale to image inputs. Finally, we showed that our method is naturally amenable to domain randomization, which may be important for applications such as robotics.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.6 Steps 1e5

Average returns

Offline Semi-Circle

offline Contra BAR BORel

Figure 6: Results for the sparse Semi-Circle environment in the offline setting. Values shown are the average returns over two episodes, averaged over tasks. The shaded areas indicate 95% confidence intervals over different seeds of the method. Contra BAR and BORe L were both run with 10 seeds with reward relabeling.

1 2 3 Episode #

Average return over domains

Results for different domains

Augmentation No augmentation

Figure 7: Left: Average test performance on three Panda Reacher environments with different color schemes from the training data. We show two meta-trained policies, one with augmentations and one without, each rolled out for 3 episodes to show how they adapt to the task. Values shown are averages across tasks (95% confidence intervals shaded). The agent trained with augmentations is invariant of the color schemes. Right: Top left image is the training environment; others are different color schemes for evaluation.

7. Acknowledgements

We thank Tom Jurgenson, Ev Zisselman, Orr Krupnik and Gal Avineri for useful discussions and feedback, and Luisa Zintgraf for invaluable help with reproducing the graphs from the Vari BAD paper. This work received funding from the European Union (ERC, Bayes-RL, Project Number 101041250). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

Bertsekas, D. P. Dynamic programming and optimal control, volume 1. Athena Scientific, 1995.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020.

Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., and Abbeel, P. Model-based reinforcement learning via meta-policy optimization. In Conference on Robot Learning, pp. 617 629, 2018.

Cover, T. M. and Thomas, J. A. Elements of information theory.

Dorfman, R., Shenfeld, I., and Tamar, A. Offline meta reinforcement learning identifiability challenges and effective data collection strategies. Advances in Neural Information Processing Systems, 34, 2021.

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever, I., and Abbeel, P. RL2: Fast reinforcement learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779, 2016.

Duff, M. O. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. Ph D thesis, University of Massachusetts at Amherst, 2002.

Fakoor, R., Chaudhari, P., Soatto, S., and Smola, A. J. Metaq-learning. ar Xiv preprint ar Xiv:1910.00125, 2019.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126 1135. JMLR. org, 2017.

Fu, H., Tang, H., Hao, J., Chen, C., Feng, X., Li, D., and Liu, W. Towards effective context for meta-reinforcement learning: an approach based on contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7457 7465, 2021a.

Fu, L., Li, X., Wang, R., Zhang, Z., Wu, Y., He, X., and Zhou, B. Scala: Supervised contrastive learning for end-to-end automatic speech recognition. ar Xiv preprint ar Xiv:2110.04187, 2021b.

Gallou edec, Q., Cazin, N., Dellandr ea, E., and Chen, L. panda-gym: Open-Source Goal-Conditioned Environments for Robotic Learning. 4th Robot Learning Workshop: Self-Supervised and Lifelong Learning at Neur IPS, 2021.

Ghavamzadeh, M., Mannor, S., Pineau, J., and Tamar, A. Bayesian reinforcement learning: A survey. ar Xiv preprint ar Xiv:1609.04436, 2016.

Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T. Recasting gradient-based meta-learning as hierarchical bayes. ar Xiv preprint ar Xiv:1801.08930, 2018.

Guo, Z. D., Azar, M. G., Piot, B., Pires, B. A., and Munos, R. Neural predictive belief representations. ar Xiv preprint ar Xiv:1811.06407, 2018.

Gutmann, M. and Hyv arinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 297 304. JMLR Workshop and Conference Proceedings, 2010.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018.

Han, T., Huang, H., Yang, Z., and Han, W. Supervised contrastive learning for accented speech recognition. ar Xiv preprint ar Xiv:2107.00921, 2021.

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729 9738, 2020.

Humplik, J., Galashov, A., Hasenclever, L., Ortega, P. A., Teh, Y. W., and Heess, N. Meta reinforcement learning as task inference. ar Xiv preprint ar Xiv:1905.06424, 2019.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Laskin, M., Srinivas, A., and Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. ar Xiv:2004.04136.

Laskin, M., Liu, H., Peng, X. B., Yarats, D., Rajeswaran, A., and Abbeel, P. Cic: Contrastive intrinsic control for unsupervised skill discovery. ar Xiv preprint ar Xiv:2202.00161, 2022.

Li, J., Vuong, Q., Liu, S., Liu, M., Ciosek, K., Christensen, H. I., and Su, H. Multi-task batch reinforcement learning with metric learning. ar Xiv preprint arxiv:1909.11373, 2020a.

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Li, L., Yang, R., and Luo, D. Focal: Efficient fullyoffline meta-reinforcement learning via distance metric learning and behavior regularization. ar Xiv preprint ar Xiv:2010.01112, 2020b.

Li, L., Huang, Y., Chen, M., Luo, S., Luo, D., and Huang, J. Provably improved context-based offline meta-rl with attention and contrastive learning. ar Xiv preprint ar Xiv:2102.10774, 2021.

Ni, T., Eysenbach, B., and Salakhutdinov, R. Recurrent model-free rl can be a strong baseline for many pomdps. In International Conference on Machine Learning, pp. 16691 16723. PMLR, 2022.

Nichol, A., Pfau, V., Hesse, C., Klimov, O., and Schulman, J. Gotta learn fast: A new benchmark for generalization in rl. ar Xiv preprint ar Xiv:1804.03720, 2018.

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018.

Ortega, P. A., Wang, J. X., Rowland, M., Genewein, T., Kurth-Nelson, Z., Pascanu, R., Heess, N., Veness, J., Pritzel, A., Sprechmann, P., et al. Meta-learning of sequential strategies. ar Xiv preprint ar Xiv:1905.03030, 2019.

Rabinovitz, C., Grupen, N., and Tamar, A. Unsupervised feature learning for manipulation with contrastive domain randomization. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 10153 10159. IEEE, 2021.

Raileanu, R., Goldstein, M., Yarats, D., Kostrikov, I., and Fergus, R. Automatic data augmentation for generalization in reinforcement learning. Advances in Neural Information Processing Systems, 34:5402 5415, 2021.

Rakelly, K., Zhou, A., Quillen, D., Finn, C., and Levine, S. Efficient off-policy meta-reinforcement learning via probabilistic context variables. ar Xiv preprint ar Xiv:1903.08254, 2019.

Rothfuss, J., Lee, D., Clavera, I., Asfour, T., and Abbeel, P. Promp: Proximal meta-policy search. ar Xiv preprint ar Xiv:1810.06784, 2018.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Subramanian, J., Sinha, A., Seraj, R., and Mahajan, A. Approximate information state for approximate planning and reinforcement learning in partially observed systems. 2020.

Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23 30. IEEE, 2017.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012.

Wang, B., Xu, S., Keutzer, K., Gao, Y., and Wu, B. Improving context-based meta-reinforcement learning with selfsupervised trajectory contrastive learning. ar Xiv preprint ar Xiv:2103.06386, 2021.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn. ar Xiv preprint ar Xiv:1611.05763, 2016.

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y., Hofmann, K., and Whiteson, S. Varibad: A very good method for bayes-adaptive deep rl via meta-learning. In International Conference on Learning Representation (ICLR), 2020.

Zintgraf, L. M., Feng, L., Lu, C., Igl, M., Hartikainen, K., Hofmann, K., and Whiteson, S. Exploration in approximate hyper-state space for meta reinforcement learning. In International Conference on Machine Learning, pp. 12991 13001. PMLR, 2021.

Contra BAR: Contrastive Bayes-Adaptive Deep RL

A. Theorem proofs

Theorem 4.3. Let Assumption 4.2 hold. Let g , g AR, f jointly minimize LM(g, g AR, f). Then the context latent representation ct = g AR(z t) satisfies conditions P1, P2 and is therefore an information state.

We begin our proof by presenting the causal model for the variant of CPC used by Contra BAR, shown in Figure 8.

Figure 8: The causal model for ct

From the causal model, we can infer that P(ct+1|ct, st+1, rt, at) =

P(ct+1|ct, st+1, rt, at, ht)

and P(st+1, rt|ht, at) = P(st+1, rt|ht, at, ct). We shall also assume that ct is a deterministic function of ht, and therefore P(ct+1|ct, st+1, rt, at, ht) = P(ct+1|st+1, rt, at, ht), and from the above, we have P(ct+1|st+1, rt, at, ht) = P(ct+1|st+1, rt, at, ct).

We now prove a mutual information bound similar to that of Oord et al. (2018), we show that by optimizing the meta RL Info NCE loss defined in Equation (1) we maximize the mutual information between ct and st+1, rt given at.

We begin with a lemma similar to that of Section 2.3 in Oord et al. (2018):

Definition A.1 (Possible sufficient statistic transition). Let ct be a function of ht, i.e ct = σt(ht). st+1, rt, ct, at is a possible sufficient statistic transition if ht, at, rt, st+1 is a possible history as in Definition 4.1 and ct = σt(ht).

Lemma A.2. Let Assumption 4.2 and the loss in Equation (1) be jointly minimized by f, g, g AR, then for any possible sufficient statistic transition st+1, rt, ct, at as in Definition A.1, where ct = g AR(ht), we have that

f(st+1, rt, ct, at) P(st+1, rt|ct, at)

P(st+1, rt|at) .

Proof. The loss in Eq. 1 is the categorical cross-entropy of classifying the positive example correctly, with f P

B f being the prediction of the model. We denote the j-th example in the batch B as sj, rj, where the subscript does not refer to time here. As in (Oord et al., 2018), the optimal probability for this loss is P(d = i|B, ct, at) (with [d = i] indicating the i-th example in B is the positive example) and can be derived as follows:

P(d = i|B, ct, at) = P(si, ri|ct, at)Πl =i P(sl, rl|at) PM j=1 P(sj, rj|ct, at)Πl =j P(sl, rl|at)

P (si,ri|ct,at)

P (si,ri|at) PM j=1 P (sj,rj|ct,at)

P (sj,rj|at) .

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Eq. 6 means that for any st+1, rt, ct, at that are part of a batch B in the data, we have that f(st+1, rt, ct, at) P (st+1,rt|ct,at)

P (st+1,rt|at) . From Assumption 4.2, for any sufficient statistic transition tuple st+1, rt, ct, at there exists a batch it is a part of.

Lemma A.3. Let Assumption 4.2, and let the loss in Equation (1) be jointly minimized by f, g, g AR. Then

I(st+1, rt; ct|at) log(M 1) Lopt.

Proof. Given the optimal value shown in Lemma A.2 for f(st+1, rt, ct, at), by inserting back into the loss we get:

Lopt = E log

P (st+1,rt|ct,at)

P (st+1,rt|at) P (st+1,rt|ct,at)

P (st+1,rt|at) + P

(s ,r ) {o j }M 1 j=1 P (s ,r |ct,at)

P (s ,r |at)

1 + ( P(st+1, rt|at)

P(st+1, rt|ct, at)

(s ,r ) {o j }M 1 j=1

P(s , r |ct, at)

P(s , r |at)

E log 1 + ( P(st+1, rt|at)

P(st+1, rt|ct, at)(M 1) ED(s ,r |at) P(s , r |ct, at)

P(s , r |at)

= E log 1 + ( P(st+1, rt|at)

P(st+1, rt|ct, at)(M 1)

E log P(st+1, rt|at)

P(st+1, rt|ct, at)(M 1)

= I(st+1, rt; ct|at) + log(M 1).

We therefore get that I(st+1, rt; ct|at) log(M 1) Lopt.

We conclude that the objective maximizes the mutual information between ct and st+1, rt given at.

Corollary A.4. Let Assumption 4.2, and let the loss in Equation (1) be jointly minimized by f, g, g AR, then I(ct; st+1, rt|at) = I(ht; st+1, rt|at) where I( ; ) denotes mutual information.

Proof. Since st+1, rt depend only on ht (conditioned on at), and since ct is a deterministic function of ht, I(st+1, rt; ct|at) cannot be greater than I(st+1, rt; ht|at). From Lemma A.3 , we therefore have that

I(ct; st+1, rt|at) = I(ht; st+1, rt|at)

Note that Corollary A.4 states that given the causal model above, ct is maximally informative about st+1, rt (conditioned on at). We use this result to prove a short lemma that will help us prove that that ct is an information state.

Lemma A.5. Let the assumptions of Corollary A.4 hold, then for every a, P(st+1, rt|ht, at) = P(st+1, rt|ct, at).

Proof. We start with a result similar to the data processing inequality. Consider I(st+1, rt; ht, ct|at). We have that

I(st+1, rt; ht, ct|at) =

I(st+1, rt; ct|ht, at) + I(st+1, rt; ht|at), (7)

and on the other hand,

I(st+1, rt; ht, ct|at) =

I(st+1, rt; ht|ct, at) + I(st+1, rt; ct|at). (8)

From the causal graph above, we have that I(st+1, rt; ct|ht, at) = 0. Therefore, from Eq. (7) and (8) we have

Contra BAR: Contrastive Bayes-Adaptive Deep RL

I(st+1, rt; ht|at) = I(st+1, rt; ct|at) + I(st+1, rt; ht|ct, at)

I(st+1, rt; ct|at)

with equality only if I(st+1, rt; ht|ct, at) = 0, since the mutual information is positive. From Corollary A.4, we therefore must have I(st+1, rt; ht|ct, at) = 0. This implies that st+1, rt and ht are independent conditioned on ct, at (Cover & Thomas), and therefore P(st+1, rt|ht, at) = P(st+1, rt|ct, at)

Proposition A.6. Let Assumption 4.2, and let the loss in Equation (1) be jointly minimized by f, g, g AR, then ct satisfies P1, i.e., E[rt|ht, at] = E[rt|ct, at].

E[rt|ht, at] = Z rt

Z P(st+1, rt|ht, at)dst+1drt

Z P(st+1, rt|ct, at)dst+1drt

= Z rt P(rt|ct, at)drt

= E[rt|ct, at].

Proposition A.7. Let Assumption 4.2 and let the loss in Equation (1) be jointly minimized by f, g, g AR, then ct satisfies P2, i.e., P(ct+1|ht) = P(ct+1|ct).

P(ct+1|ht, at)

= Z Z P(st+1, rt|ht, at)P(ct+1|ht, st+1, rt, at)dst+1drt

= Z Z P(st+1, rt|ct, at)P(ct+1|ht, ct, st+1, rt, at)dst+1drt

= Z Z P(st+1, rt|ct, at)P(ct+1|ct, st+1, rt, at)dst+1drt

= P(ct+1|ct, at).

where the second equality is due to lemma A.5 and the penultimate equality is due to ct+1 being a deterministic function of ct, st+1, rt and at

We now provide the proofs for the setting described in Section 4.4, where there may be errors in the CPC learning, and the data does not necessarily satisfy Assumption 4.2.

We recapitulate that we consider an iterative policy improvement algorithm with a similarity constraint on consecutive policies, similar to the PPO algorithm we use in practice (Schulman et al., 2017). We shall bound the suboptimality of policy improvement, when data for training CPC is collected using the previous policy, denoted πk. We will show optimal policy bounds when the information state is approximate, similar in spirit to Subramanian et al. (2020), but with additional technicalities. Under the setting above, we will bound the suboptimality in policy improvement in terms of an error in CPC training, which we denote ϵ.

In light of the bound from A.3, we assume the following:

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Assumption 4.4. There exists an ϵ such that for every t T, I(st+1, rt; ct|at) I(st+1, rt; ht|at) ϵ, where the histories are distributed according to policy πk.

We now define Pπ(ht) as the probability of seeing a history under a policy π. For the sake of simplicity, for the subsequent section we will refer to Pπ(ht) as P(ht). Furthermore, when the information state is approximate, we denote the information state generator ˆσt.

We begin with the following bound.

Proposition A.8. Let Assumption 4.4 hold, then Eht P (ht) [DKL(P(st+1, rt|ht, at)||P(st+1, rt|ˆσt(ht), at)] ϵ

Proposition A.9. Let Assumption 4.4 hold, then I(st+1, rt; ht|ct, at) ϵ

Proof. We start with a result similar to the data processing inequality. We have that

I(st+1, rt; ht, ct|at) =

I(st+1, rt; ct|ht, at) + I(st+1, rt; ht|at)

I(st+1, rt; ht, ct|at) =

I(st+1, rt; ht|ct, at) + I(st+1, rt; ct|at)

From the causal graph we have that I(st+1, rt; ct|ht, at) = 0, yielding

I(st+1, rt; ht|at) = I(st+1, rt; ht|ct, at) + I(st+1, rt; ct|at)

I(st+1, rt; ht|at) I(st+1, rt; ct|at) = I(st+1, rt; ht|ct, at)

Combined with 4.4 we get that I(st+1, rt; ht|ct, at) ϵ

We note that from here on out everything is conditioned on at, and omit it to avoid overly cumbersome notation.

For ease of notation we define: z = st+1, rt. We note that given a specific ht, we have:

DKL Pz|ht||Pz|ˆσt(ht) = Z

z P(z|ht) log P(z|ht)

Proposition A.10. I(z; ht|ct) = Eht P (ht) DKL Pz|ht||Pz|ˆσt(ht)

Contra BAR: Contrastive Bayes-Adaptive Deep RL

I(z; ht|ct) = EPσt(ht)=ct DKL Pz,ht|ct||Pz|ct Pht|ct

= EPσt(ht)=ct

z P(z, ht|ct) log P(z, ht|ct) P(z|ct) P(ht|ct)

= EPσt(ht)=ct

z P(z, ht|ct) log P(z, ht, ct) P(ct)

P(z, ct) P(ht, ct)

= EPσt(ht)=ct

P(z, ht, ct)

P(ct) log P(z|ht)

z P(z, ht, ct) log P(z|ht)

z P(z|ht)P(ht, ct) log P(z|ht)

ht P(ht, ct) Z

z P(z|ht) log P(z|ht)

ct δct=σt(ht)

z P(z|ht) log P(z|ht)

= Eht P (ht) [DKL(P(st+1, rt|ht)||P(st+1, rt|ˆσt(ht))]

We now complete the proof. Combining Proposition A.9 with Proposition A.10 we get that

Eht P (ht) DKL Pst+1,rt|ht||Pst+1,rt|ˆσt(ht) ϵ

as required.

Let πk(ˆσt(ht)) denote the policy at iteration k, and note that it is defined on the information state. At iteration k + 1, we first collect data using πk. We denote Pπk(ht) the probability of observing a history in this data collection process. We then use CPC to learn an approximate information state. Let D(ht) = DKL Pst+1,rt|ht,at||Pst+1,rt|ˆσt(ht),at . Proposition A.11. Let Assumption 4.4 hold, then X

ht Pπk(ht)D(ht) ϵ. (9)

Proof. Assumption 4.4 holds, therefore the result is an immediate corollary from A.8 for every t 0, 1, . . . , T 1.

For some distance measure D, let Πβ = {π : D(π(ht), πk(ˆσt(ht))) β ht} denote the set of policies that are β-similar to πk.

We next define the optimal next policy π

π argmax π Πβ Eπ "T 1 X

t=0 r(st, at)

Note that the value of this policy satisfies the following Bellman optimality equations:

Qt(ht, at) = r(ht, at) + E [Vt+1(ht+1)]

Vt(ht) = max π:D(π(ht),πk(ht)) β

a π(a)Qt(ht, a), (11)

for t T, and VT (h T ) = 0.

We now present our main result, where we consider an iterative policy improvement scheme based on the approximate information state of Contra BAR and provide policy improvement bounds.

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Theorem 4.5. Let Assumption 4.4 hold for some representation ct. Consider the distance function between two distributions D(P1(x), P2(x)) = maxx |P1(x)/P2(x)|. We let ˆr(ct, at) = E[rt|ct, at] and ˆP(c |ct, at) = E[1(ct+1 = c )|ct, at] denote an approximate reward and transition kernel, respectively. Define the value functions

ˆQt(ct, at) = ˆr(ct, at) + X

ˆP(ct+1|ct, at) ˆVt+1(ct+1)

ˆVt(ct) = max π:D(π(ct),πk(ct)) β

a π(a) ˆQt(ct, a), (4)

for t T, and ˆVT (c T ) = 0, and the approximate optimal policy

ˆπ(ct) argmax π:D(π,πk(ct)) β

a π(a) ˆQt(ct, a). (5)

Let the optimal policy π (ht) be defined similarly, but with ht replacing ct in (4) and (5). Then we have that

t=0 r(st, at)

t=0 r(st, at)

ϵ1/3Rmax T 2(

Proof. Since Assumption 4.4 holds, Proposition A.11 does as well.

From the Markov inequality, we have Pπk(D(ht) nϵ) ϵ nϵ = 1

We now the define the Good Set HG = {ht : D(ht) < nϵ} and the Bad Set HB = {ht : D(ht) nϵ}.

Next, we define an auxiliary policy π(ht) =

( π (ht), if ht HG worst behavior, if ht HB .

We will assume that after observing ht HB, the policy performs as bad as possible for the rest of the episode.

Next, we bound the performance of π.

Proposition A.12. We have that Eπ h PT 1 t=0 r(st, at) i E π h PT 1 t=0 r(st, at) i 2T 2RmaxβT /n.

Proof. We will denote by rt(ht) the reward at the last state-action pair. That is, for ht = s0, a0, r0, . . . , st 1, at 1, rt 1, st we set rt(ht) = rt 1. We will denote R(ht) the sum of rewards, that is, R(ht) = Pt 1 t =0 rt .

We also denote by Pπ(ht) the probability of observing history ht under policy π. Note that by definition PT 1 t=0 P

ht Pπ(ht) = 1. Also, note that by the definition of the set Πβ, for any two policies π1, π2 Πβ we have Pπ1(ht)/Pπ2(ht) βt.

We now claim that

2T 2RmaxβT /n.

We first estimate the probability that policy π encounters a history in HB. Consider some t 0, . . . , T 1. We have that under Pπk, with probability at most 1/n, ht HB. Under P π, with probability at most βt/n, ht HB. From the union bound, with probability at most TβT /n the policy visits at least one history in HB.

Let HB denote the set of T-length histories that visit a history in HB, and let HG be its complement set.

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Now, note that

h T P π(h T )R(h T )

h T HG P π(h T )R(h T ) + X

h T HB P π(h T )R(h T )

h T HG Pπ (h T )R(h T ) + X

h T HB P π(h T )R(h T )

h T HG Pπ (h T )R(h T ) + (TβT /n)T( Rmax)

h T Pπ (h T )R(h T ) X

h T HB Pπ (h T )R(h T ) + (TβT /n)T( Rmax)

2T 2RmaxβT /n

The third equality is from the definition of π . The fourth inequality relies on the reward function being bounded, i.e R(h T ) T( Rmax). This alongside the fact that P

h T HB P π(h T ) (TBT /n) gives us the inequality. Note that the last inequality follows from the definition of π , wherein the probability of visiting at least one history in HB is the same for π and π.

Next, we note that using Pinsker s inequality, we have d T V (Pst+1,rt|ht,at, Pst+1,rt|ct,at) q

2d KL(Pst+1,rt|ht,at, Pst+1,rt|ct,at), and that

|E[rt|ht, at] E[rt|ct, at]| Rmaxd T V (Pst+1,rt|ht,at, Pst+1,rt|ct,at)

|E[Vt+1|ht, at] E[Vt+1|ct, at]| Rmax(T t)d T V (Pst+1,rt|ht,at, Pst+1,rt|ct,at)

We next prove the following result.

Proposition A.13. We have that

ˆQt(ˆσt(ht), a) Q π(ht, a) αt, ˆVt(ˆσt(ht)) V π(ht) αt, (13)

where αt satisfies the following recursion: αT = 0, and αt =

2nϵRmax(T t + 1) + αt+1.

Proof. We prove by backward induction. The argument holds for T by definition. Assume that Equation (13) holds at time t + 1, and consider time t. If ht HB, then by definition ˆQt(ˆσt(ht), a) Q π(ht, a), since π will take the worst possible actions after observing ht. Otherwise, ht HG and we have

Q π(ht, a) ˆQt(ˆσt(ht), a)

=E[rt|ht, a] + E V π t+1(ht+1)|ht, a ˆr(ˆσt(ht), a) X

ˆP(ct+1|ˆσt(ht), a) ˆVt+1(ct+1)

=E[rt|ht, a] E[rt|ct, at]

+ E V π t+1(ht+1)|ht, a E h ˆVt+1(ˆσt+1(ht+1))|ht, a i

+ E h ˆVt+1(ˆσt+1(ht+1))|ht, a i X

ˆP(ct+1|ˆσt(ht), a) ˆVt+1(ct+1)

2nϵRmax + αt+1 +

2nϵRmax(T t).

We note that for ht HG, D(ht) nϵ, yielding the d T V bounds.

Contra BAR: Contrastive Bayes-Adaptive Deep RL

For the second part, If ht HB, then by definition ˆVt(ˆσt(ht)) V π(ht). Otherwise, ht HG and we have

V π t (ht) ˆVt(ˆσt(ht)) = max π:D(π(ht),πk(ht)) β

a π(a)Q π t (ht, a) max π:D(π(ht),πk(ht)) β

a π(a) ˆQt(ˆσt(ht), a)

max π:D(π(ht),πk(ht)) β

a π(a)( ˆQt(ˆσt(ht), a) + αt) max π:D(π(ht),πk(ht)) β

a π(a) ˆQt(ˆσt(ht), a)

We next define another auxiliary policy ˆπ(ht) =

( ˆπ(ht), if ht HG optimal behavior, if ht HB . We will assume that after observing

ht HB, the policy perform optimally for the rest of the episode. Therefore, V ˆπ t (ht HB) = Vt(ht). We have the following results, analogous to Propositions A.12 and A.13.

Proposition A.14. We have that E ˆπ h PT 1 t=0 r(st, at) i Eˆπ h PT 1 t=0 r(st, at) i 2T 2RmaxβT /n.

Proof. Analogous to the proof of Proposition A.12.

Proposition A.15. We have that

ˆQt(ˆσt(ht), a) Q ˆπ(ht, a) + αt,

ˆVt(ˆσt(ht)) V ˆπ(ht) + αt, (14)

where αt satisfies the following recursion: αT = 0, and αt =

2nϵRmax(T t + 1) + αt+1.

Proof. Similarly to the proof of Proposition A.13. The argument hold for T by definition. Assume that Equation (14) holds at time t + 1, and consider time t. If ht HB, then by definition ˆQt(ˆσt(ht), a) Q ˆπ(ht, a), since ˆπ will take the best possible actions after observing ht. Otherwise, ht HG and we have

ˆQt(ˆσt(ht), a) Q ˆπ(ht, a)

=ˆr(ˆσt(ht), a) + X

ˆP(ct+1|ˆσt(ht), a) ˆVt+1(ct+1) E[rt|ht, a] E h V ˆπ t+1(ht+1)|ht, a i

=E[rt|ˆσt(ht), at] E[rt|ht, a]

+ E h ˆVt+1(ˆσt+1(ht+1))|ht, a i E h V ˆπ t+1(ht+1)|ht, a i

ˆP(ct+1|ˆσt(ht), a) ˆVt+1(ct+1) E h ˆVt+1(ˆσt+1(ht+1))|ht, a i

2nϵRmax + αt+1 +

2nϵRmax(T t).

For the second part, If ht HB, then by definition ˆVt(ˆσt(ht)) V ˆπ(ht). Otherwise, ht HG and we have

ˆVt(ˆσt(ht)) V ˆπ t (ht) = max π:D(π(ht),πk(ht)) β

a π(a) ˆQt(ˆσt(ht), a) max π:D(π(ht),πk(ht)) β

a π(a)Q ˆπ t (ht, a)

max π:D(π(ht),πk(ht)) β

a π(a)(Q ˆπ t (ht, a) + αt) max π:D(π(ht),πk(ht)) β

a π(a)Q ˆπ t (ht, a)

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Corollary A.16. We have

t=0 r(st, at)

t=0 r(st, at)

t=0 r(st, at)

t=0 r(st, at)

+ 2T 2RmaxβT /n

h0 P(h0) V π 0 (h0) V ˆπ 0 (h0) + 2T 2RmaxβT /n

h0 P(h0) V π 0 (h0) ˆV0(ˆσ0(h0)) + ˆV0(ˆσ0(h0)) V ˆπ 0 (h0) + 2T 2RmaxβT /n

h0 P(h0) V π 0 (h0) ˆV0(ˆσ0(h0)) + ˆV0(ˆσ0(h0)) V ˆπ 0 (h0) + 4T 2RmaxβT /n

2α0 + 4T 2RmaxβT /n

We note that the first inequality follows from Proposition A.12. The second equality stems from the fact that P(h0) = P(s0), which is not affected by the choice of policy. The fourth transition follows from the addition and subtraction of V ˆπ 0 (h0) and the use of Proposition A.14. The final inequality follows from Propositions A.13 and A.15.

Let us bound α0. By the recursion αt =

2nϵRmax(T t + 1) + αt+1 we have that α0 = T 2

2nϵRmax. Setting n = ϵ 1/3 we obtain the desired result:

t=0 r(st, at)

t=0 r(st, at)

2ϵ1/3Rmax + 4T 2RmaxβT ϵ1/3

= ϵ1/3Rmax T 2(

B. Environments

Reacher Image: In this environment, a two-link planar robot needs to reach an unknown goal as in (Dorfman et al., 2021), except that the goal is randomly chosen along a horizontal section of 0.48. For each task, the agent receives a reward of +1 if it is within a small radius r = 0.05 of the goal, and 0 otherwise.

( 1, if xt xgoal 2 0.05 0, otherwise

where xt is the location of the robot s end effector. The agent observes single-channel images of size 64 64 of the environment]. The horizon is set to 50 and we aggregate k = 2 consecutive episodes to form a trajectory of length 100.

Panda Reacher: A Franka Panda robot tasked with placing the end effector at a goal on a 2d semi-circle of radius 0.15 with fixed z = 0.15/2 in 3d-space. The task is adapted from the Reacher task in Panda Gym (Gallou edec et al., 2021), with the goal occluded. For each task, the agent receives a reward of +1 if it is within a small radius r = 0.05 of the goal, and 0 otherwise.

( 1, if xt xgoal 2 0.05 0, otherwise

where xt is the current location of the end effector. The action space is 3-dimensional and bounded [ 1, 1]3. The agent observes a 3-channel image of size 84 84 of the environment. We set the horizon to 50 and aggregate k = 3 consecutive episodes to form a trajectory of length 150.

Panda Wind: This environment is identical to Panda Reacher, except that the goal is fixed and for each task the agent experiences different wind with shifts the transition function, such that for an MDP M the transition function becomes

st+1 = st + at + w M

where w M is task specific and drawn randomly from a circle of radius 0.1. To get to the goal and stay there, the agent must learn to quickly adapt in a way that cancels the effect of the wind.

Contra BAR: Contrastive Bayes-Adaptive Deep RL

C. Implementation Details

In this section we outline our training process and implementation details, exact hyperparameters can be found in our code at https://github.com/ec2604/Contra BAR

The CPC component termed g AR consists of a recurrent encoder, which at time step t takes as input the tuple (at, rt+1, st+1). The state, reward and action are passed each through a different fc layer (or a cnn feature-extractor for the states in image-based inputs). Our CPC projection head takes in (ca t , zt+k) and passes it through one hidden layer of half the input size, with an ELU activation function.

D. Architecture Details

In this section we detail practical considerations regarding the CPC architecture.

In Section 4.5 we described situations where ct does not need to encode belief regarding the task in order to distinguish between positive and negative observations. This is detrimental to learning a sound sufficient statistic as we would like ct to encode information regarding the reward and transition functions, as they are what set apart each task. In order to prevent this shortcut from being used, we can perform hard negative mining. We do this by using negative observations that cannot be distinguished from the positive observation without belief regarding the transition and reward functions. In the case where only the reward functions vary, we can do this by taking the state and action of the positive observation and sampling a new reward function. We then calculate the respective reward and embed it as a negative observation alongside the original state and action. By having the positive and negative observations share the same state and action, we ensure that ct must be informative regarding the reward function in order to distinguish between positive and negative observations. We note that in this modified setup we use st as the initial hidden state for the action-gru and include the original ct as input to the CPC projection head. This ensures that the gradient of the loss with respect to the action-gru does not affect ct, which should encode information regarding the reward function. For the case where the environments only vary in reward functions, we propose a simpler solution which is to omit the action-gru, as the future actions except for at+k 1 do not affect rt+k. We can simply use (ct, zt+k) as input to the CPC projection head we note that in this case zt+k is an embedding of the reward, state and action. We found in practice that this simplification also worked well for the environments we used where the transitions varied.

The modified architecture where the action-gru is omitted can be seen in Figure 9. In Figure 10 we demonstrate on the Ant-Goal environment that omitting the action-gru and reward-relabeling with the action-gru yield similar results. Finally, we note that hard-negative mining can be done for varying transitions by sampling a random transition from the prior and simulating the transition to some st+k given st, at, . . . , at+k.

E. Image-based inputs are computationally restrictive for Vari BAD

To understand the computational restriction in Vari BAD (Zintgraf et al., 2020), we look to the formulation of the VAE objective. For every timestep t, the past trajectory τ:t is encoded to infer the posterior q(m|τ:t), and used by the decoder to reconstruct the entire trajectory including the future. In our analysis we restrict ourselves to the memory required for reconstruction of the reward trajectory, in an image-based domain, under the following assumptions: (1) images of dimension d d 3 are embedded to a representation of size 32 via 3 convolutions with 32 channels each and kernels of size, with strides 2, 2, 1 respectively. (2) actions are of size 2 and embedded with a linear layer of size 16 (3) The trajectory is of length 120, which is average for the domains in meta RL. We draw attention to the fact that reward decoder in Vari BAD receives st, st+1 as input, requiring us to take into consideration the memory required for embedding the image trajectory. On top of this, we also consider three times the size of the parameters of the image encoder (parameters, gradients and gradient moments). We present the memory consumption as a function of the image dimensions in Figure 11. We note that in practice we often wish to decode multiple trajectories at once, and we also need to take into account encoder portion of the model as well as its gradients.

Contra BAR: Contrastive Bayes-Adaptive Deep RL

Figure 9: Contra BAR architecture where the action-gru is omitted. The history encoding is the same, future observations however, do not include the action.

1 2 3 4 5 600

relabel omit gru original action-gru

Figure 10: The relabel and omit-gru architectures are run on the Ant-goal environment, with 10 seeds (evaluated over 10 environment samples each). The meta-trained policies are rolled out for 5 episodes to show how they adapt to the task. Values shown are averages across tasks (95% confidence intervals shaded). The results for both methods are similar, with slightly better results when the action-gru is omitted. We additionally show a single seed of the original action-gru architecture, evaluated on 10 environment samples we see that the sample mean is out of the confidence interval for both the relabel and omit-gru architectures.

Contra BAR: Contrastive Bayes-Adaptive Deep RL

60 80 100 120 140 Image dimension

Memory consumption (Mi B)

Vari BAD Memory Consumption

Figure 11: Vari BAD memory consumption for decoding the rewards for image-based input, as a function of the image dimension given a trajectory length of 200