# minimum_description_length_control__862a1d4a.pdf

Published as a conference paper at ICLR 2023

MINIMUM DESCRIPTION LENGTH CONTROL

Ted Moskovitz1,*, Calvin Kao2,3, Maneesh Sahani1, Matthew M. Botvinick1,2

1. Gatsby Unit, University College London 2. Deep Mind 3. Facebook Reality Labs *Correspondence: ted@gatsby.ucl.ac.uk

We propose a novel framework for multitask reinforcement learning based on the minimum description length (MDL) principle. In this approach, which we term MDL-control (MDL-C), the agent learns the common structure among the tasks with which it is faced and then distills it into a simpler representation which facilitates faster convergence and generalization to new tasks. In doing so, MDLC naturally balances adaptation to each task with epistemic uncertainty about the task distribution. We motivate MDL-C via formal connections between the MDL principle and Bayesian inference, derive theoretical performance guarantees, and demonstrate MDL-C s empirical effectiveness on both discrete and highdimensional continuous control tasks.

1 INTRODUCTION

In order to learn efficiently in a complex world with multiple rapidly changing objectives, both animals and machines must leverage past experience. This is a challenging task, as processing and storing all relevant information is computationally infeasible. How can an intelligent agent address this problem? We hypothesize that one route may lie in the dual process theory of cognition, a longstanding framework in cognitive psychology introduced by William James (James, 1890) which lies at the heart of many dichotomies in both cognitive science and machine learning. Examples include goal-directed versus habitual behavior (Graybiel, 2008), model-based versus model-free reinforcement learning (Daw et al., 2011; Sutton and Barto, 2018), and System 1 versus System 2 thinking (Kahneman, 2011). In each of these paradigms, a complex, control process trades off with a simple, default process to guide actions. Why has this been such a successful and enduring conceptual motif? Our hypothesis is that default processes often serve to distill common structure from the tasks consistently faced by animals and agents, facilitating generalization and rapid learning on new objectives. For example, drivers can automatically traverse commonly traveled roads en route to new destinations, and chefs quickly learn new dishes on the back of well-honed fundamental techniques. Importantly, even intricate tasks can become automatic, if repeated often enough (e.g., the combination of fine motor commands required to swing a tennis racket): the default process must be sufficiently expressive to learn common behaviors, regardless of their complexity. In reality, most processes likely lie on a continuum between simplicity and complexity.

In reinforcement learning (RL; Sutton and Barto, 2018), improving sample efficiency on new tasks is crucial to the developement of general agents which can learn effectively in the real world (Botvinick et al., 2015; Kirk et al., 2021). Intriguingly, one family of approaches which have shown promise in this regard are regularized policy optimization algorithms, in which a goal-specific control policy is paired with a simple yet general default policy to facilitate learning across multiple tasks (Teh et al., 2017; Galashov et al., 2019; Goyal et al., 2020; 2019; Moskovitz et al., 2022a). One difficulty in algorithm design, however, is how much or how little to constrain the default policy, and in what way. An overly simple default policy will fail to identify and exploit commonalities among tasks, while a complex model may overfit to a single task and fail to generalize. Most approaches manually specify an asymmetry between the control and default policies, such as hiding input information (Galashov et al., 2019) or constraining the model class (Lai and Gershman, 2021). Ideally, we d like an adaptive approach that learns the appropriate degree of complexity via experience.

Published as a conference paper at ICLR 2023

The minimum description length principle (MDL; Rissanen, 1978), which in general holds that one should prefer the simplest model that accurately fits the data, offers a guiding framework for algorithm design that does just that, enabling the default policy to optimally trade off between adapting to information from new tasks and maintaining simplicity. Inspired by dual process theory and the MDL principle, we propose MDL-control (MDL-C, pronounced middle-cee ), a principled RPO framework for multitask RL. In Section 2, we formally introduce multitask RL and describe RPO approaches within this setting. In Section 3, we describe MDL and the variational coding framework, from which we extract MDL-C and derive its formal performance characteristics. In Section 5, we demonstrate its empirical effectiveness in both discrete and continuous control settings. Finally, we discuss related ideas from the the literature (Section 6) and conclude (Section 7).

2 REINFORCEMENT LEARNING PRELIMINARIES

The single-task setting We model a task as a Markov decision process (MDP; Puterman, 2010) M = (S, A, P, r, γ, ρ), where S, A are state and action spaces, respectively, P : S A P(S) is the state transition distribution, r : S A [0, 1] is a reward function, γ [0, 1) is a discount factor, and ρ P(S) is the starting state distribution. P( ) is the space of probability distributions defined over a given space. The agent takes actions using a policy π : S P(A). In large or continuous domains, the policy is often parameterized: π πθ, θ Θ, where Θ Rd represents a particular model class with d parameters. In conjunction with the transition dynamics, the policy induces a distribution over trajectories τ = (sh, ah) h=0, Pπθ(τ). In a single task, the agent seeks to maximize its value V πθ = Eτ Pπθ R(τ), where R(τ) := P

h 0 γhr(sh, ah) is called the return. We denote by dπ ρ the state-occupancy distribution induced by policy π with starting state distribution ρ: dπ ρ(s) = Eρ(1 γ) P h 0 γh Pr(sh = s|s0).

Multiple tasks There are a number of frameworks for multitask RL in the literature (Yu et al., 2019; Zahavy et al., 2021; Finn et al., 2017; Brunskill and Li, 2013). For a more detailed discussion, see Appendix Section B. In this paper, we focus primarily on what we term the sequential and parallel task settings. The objective in each case is to maximize average reward across tasks, equivalent to minimizing cumulative regret over the agent s lifetime. More specifically, we assume a (possibly infinite) set of tasks (MDPs) M = {M} presented to the agent by sampling from some task distribution PM P(M). In the sequential task setting (Moskovitz et al., 2022a; Pacchiano et al., 2022), tasks (MDPs) are sampled one at a time from PM, with the agent training on each until convergence. In the parallel task training (Yu et al., 2019), a new MDP is sampled from PM at the start of every episode and is associated with a particular input feature g G that indicates to the agent which task has been sampled.

Regularized Policy Optimization One common approach which improves performance is regularized policy optimization (RPO; Schulman et al., 2017; 2018; Levine, 2018; Agarwal et al., 2020; Pacchiano et al., 2020; Tirumala et al., 2020; Abdolmaleki et al., 2018). In RPO, a convex regularization term Ω(θ) is added to the objective: J RPO λ (θ) = V πθ λΩ(θ). In the single-task setting, the regularization term is often used to approximate trust region (Schulman et al., 2015), proximal point (Schulman et al., 2017), or natural gradient (Kakade, 2002; Pacchiano et al., 2020; Moskovitz et al., 2020) optimization, or to prevent premature convergence to local maxima (Haarnoja et al., 2018; Lee et al., 2018). In multitask settings, the regularization term for RPO typically takes the form of a divergence measure penalizing the policy responsible for taking actions πθ, which we ll refer to as the control policy, for deviating from some default policy πw, which is intended to encode generally useful behavior for some family of tasks (Teh et al., 2017; Galashov et al., 2019; Goyal et al., 2019; 2020; Moskovitz et al., 2022a). By capturing behavior which is on average useful across tasks, πw can provide a form of beneficial supervision to πθ when obtaining reward is challenging, either because πθ has been insufficiently trained or rewards are sparse.

3 THE MINIMUM DESCRIPTION LENGTH PRINCIPLE

General principle Storing all environment interactions across multiple tasks is computationally infeasible, so multitask RPO algorithms offer a compressed representation in the form of a default policy. However, the type of information which is compressed (and that which is lost) is often

Published as a conference paper at ICLR 2023

hard-coded a priori. Preferably, we d like an approach which can distill structural regularities among tasks without needing to know what they are beforehand. The minimum description length (MDL) framework offers a principled approach to this problem. So-called ideal MDL seeks to find the shortest solution written in a general-purpose programming language1 which accurately reproduces the data an idea rooted in the concept of Kolmogorov complexity (Li and Vitnyi, 2008). Given the known impossibility of computing Kolmogorov complexity for all but the simplest cases, a more practical MDL approach instead prescribes selecting the hypothesis H from some hypothesis class H which minimizes the two-part code H = argmin H H L(D|H) + L(H), where L(D|H) is the number of bits required to encode the data given the hypothesis and L(H) is the number of bits needed to encode the hypothesis itself. There are a variety of so-called universal coding schemes which can be used to model these quantities.

Variational code One popular encoding scheme is the variational code (Blier and Ollivier, 2018; Hinton and Van Camp, 1993; Honkela and Valpola, 2004):

Lvar ν (D) = Eθ ν [ log pθ(D)] | {z } Lvar(D|H)

+ KL[ν( ), p( )] | {z } Lvar(H)

where the hypothesis class is of a set of parametric models H = {pθ(D) : θ Θ}. The model parameters are random variables with prior distribution p(θ) and ν(θ) is any distribution over Θ. Minimizing Lvar ν (D) with respect to ν is equivalent to performing variational inference, maximizing a lower-bound to the data log-likelihood log p(D) = log R p(θ)pθ(D)dθ Lvar ν (D). Roughly speaking, MDL encourages the choice of simple models when limited data are available (Grunwald, 2004). In the variational coding scheme, simplicity is enforced via the choice of prior.

Sparsity-inducing priors and variational dropout Sparsity-inducing priors can be used to improve the compression rate within the variational coding scheme, as they encourage the model to prune out parameters that do not contribute to reducing Lvar(D|θ). Many sparsity-inducing priors belong to the family of scale mixtures of normal distributions (Andrews and Mallows, 1974): z p(z), θ p(θ|z) = N(w; 0, z2) where p(z) defines a distribution over the variance z2. Common choices of p(z) include the Jeffreys prior p(z) |z| 1 (Jeffreys, 1946), the inverse-Gamma distribution, and the half-Cauchy distribution (Polson and Scott, 2012; Gelman, 2006). Such priors have deep connections to MDL theory. For example, the Jeffreys prior in conjunction with an exponential family likelihood is asymptotically identical to the normalized maximum likelihood estimator, perhaps the most fundamental MDL estimator (Gr unwald and Roos, 2019). Variational dropout (VDO) is an effective algorithm for minimizing Equation (3.1) for these sparsity-inducing priors (Louizos et al., 2017; Kingma et al., 2015; Molchanov et al., 2017). Briefly, this involves choosing an approximate posterior distribution with the form

p(w, z|D) ν(w, z) = N(z; µz, ασ2 z)N(w; zµ, z2σ2Id) (3.2)

and optimizing Equation (3.1) via stochastic gradient descent on the variational parameters given by {α, µz, σ2 z, µ, σ2}. As its name suggests and importantly for its ease of application to large models VDO can be implemented as a form of dropout (Srivastava et al., 2014) by reparameterizing the noise on the weights as activation noise (Kingma et al., 2015). Application of VDO to Bayesian neural networks has achieved impressive compression rates, sparsifying deep neural networks while maintaining prediction performance on supervised learning problems (Molchanov et al., 2017; Louizos et al., 2017). Equipped with a powerful approach for MDL-grounded posterior inference, we can now integrate these ideas with multitask RPO.

4 MINIMUM DESCRIPTION LENGTH CONTROL

As part of its underlying philosophy, the MDL principle holds that 1) learning is the process of discovering regularity in data, and 2) any regularity in the data can be used to compress it (Grunwald, 2004). Applying this perspective to RL is non-obvious from the agent s perspective, what data is it trying to compress? Our hypothesis, which forms the basis for the framework

1The invariance theorem (Kolmogorov, 1965) ensures that, given a sufficiently long sequence, Kolmogorov complexity is invariant to the choice of general-purpose language.

Published as a conference paper at ICLR 2023

Algorithm 1: MDL-C for Sequential Multitask Learning with Persistent Replay

1: Require: task distribution PM, policy class Θ, non-increasing coefficients {ηk}K k=1 2: Initialize: default policy distribution ν1 N P(Θ), default policy dataset D0 3: for tasks k = 1, 2, . . . , K do 4: Sample a task Mk = (S, A, Pk, rk, γk, ρk) PM( ) 5: Optimize control policy:

θk argmax θ Θ V π Mk αEs dπ ρk Ew νk KL[πw( |s), πθ( |s)] (4.2)

6: Add data to default policy replay (M = |S| for finite/small state spaces):

Dk Dk 1 {(sm, ˆπθk(sm))}M m=1. (4.3)

7: Update default policy distribution:

νk+1 argmin ν N

1 ηk 1 KL[ν( ), p( )] +

m=1 Ew νKL[ˆπ θi( |sm), πw( |sm)] (4.4)

we propose in this paper, is that an agent faced with a set of tasks in the world should seek to elucidate structural regularity from the environment interactions generated by the optimal policies for the tasks. This makes intuitive sense: the agent ought to compress information which indicates how to correctly perform the tasks with which it is faced. That is, we propose that the data in multitask RL are the state-action interactions generated by the optimal policies for a set of tasks: D = {DM}M M = {(s, a) : s S, a π M( |s)}M M This interpretation is in line with work suggesting that a useful operational definition of task can be derived directly from the set of optimal (or near-optimal) policies it induces (Abel et al., 2021). It also suggests a natural mapping to the multitask RPO framework. In this view, the control policy is responsible for learning and the default policy for compression: by converging to the optimal policy for a given task, the control policy discovers regularity which is then distilled into a low-complexity representation by the default policy. In our approach, the default policy is encouraged to learn a compressed representation not by artificially constraining the network architecture or via hand-designed information asymmetry, but rather through a prior distribution p(w) over its parameters which biases a variational posterior ν(w) towards simplicity. The default policy is therefore trained to minimize the variational code:

argmin ν N Es,a D w ν log πw(a|s) + KL[ν( ), p( )]

= argmin ν N EM PME s dπ M w νϕ KL[π M( |s), πw( |s)] + KL[ν( ), p( )], (4.1)

where N is the distribution family for the posterior. This suggests the approach presented in Algorithm 1, in which for each task Mk, the control policy πθ is trained to approximate the optimal policy π k via RPO, and the result is compressed into a new default policy distribution νk+1. We now further motivate sparsity-inducing priors for the default policy in multitask settings, derive formal performance guarantees for MDL-C, and demonstrate its empirical effectiveness.

4.1 MOTIVATING THE CHOICE OF SPARSITY-INDUCING PRIORS

In Section 3, compression (via pruning extraneous parameters) is the primary motivation for using sparsity-inducing priors that belong to the family of scaled-mixtures of normal distributions. Intuitively, placing a distribution over the default parameters reflects the agent s epistemic uncertainty about the task distribution when few tasks have been sampled, a sparse prior prevents the default policy from overfitting to spurious correlations in the limited data that the agent has collected. Here, we make this motivation more precise, describing an example generative model of optimal policy parameters which provides a principled interpretation for prior choice p(z) in multitask RL.

Published as a conference paper at ICLR 2023

w1 w11 w1 ˆw1

0 0.2 0.4 0.6 0.8 1

Jeffreys Inverse-Gamma Half-Cauchy

Figure 4.1: (A) Illustration of a generative model of optimal policy parameters. ˆw1 = (1 β)w11 shrinks towards the origin, growing closer to w1 than w11. (B) Sparsity-inducing priors over β.

Generative model of optimal policy parameters Consider a set of tasks M = {Mik}I,Ki i=1,k=1 that are clustered into I groups, such that the MDPs in each group are more similar to one another than to members of other groups. As an example, the overall family M could be all sports, while clusters Mi M could consist of, say, ball sports or endurance competitions. To make this precise, we assume that the optimal policies of every MDP belong to a parametric family Π = {πw( |s) : w Rd, s S} (e.g., softmax policies with parameters w), and that the optimal policies for each group are randomly distributed within parameter space. In particular, we assume that the parameters of the optimal policies of M have the following generative model:

wm; 0, (1 β)β 1σ2Id , wik|wi, σ2 N wik; wi, σ2Id .

where Id is the d dimensional identity matrix. If we marginalize out wi, we get the marginal distribution p(wik|β, σ2) = N(wik; 0, σ2β 1Id). We can then visualize the parameter distribution of the optimal policies for M as a d-dimensional Gaussian within which lie clusters of optimal policies for related tasks which are themselves normally distributed (see Fig. 4.1A for d = 2).

Interpretation of β The parameter β (0, 1] can be interpreted as encoding the squared distance between optimal policy parameters within a group divided by the squared distance between optimal policies in M. Intuitively, β determines how much information one gains about the optimal parameters of a task in a group, given knowledge about the optimal parameters of another task in the same group. To see this, we compute our posterior belief about the value wi given observation of wik: p(wi|wik, β, σ2) = N

wi; (1 β)wik, (1 β)σ2Id . When β = 1 (inner circle in Figure 4.1A has the same radius as the outer circle), our posterior mean estimate of wi is simply 0, suggesting we have learned nothing new about the mean of the optimal parameters in group i, by observing wik. In the other extreme when β 0, the posterior mean approaches the maximum-likelihood estimator wik, suggesting that observation of wik provides maximal information about the optimal parameters in group i. Any β in between the two extremes results in an estimator that shrinks wik towards 0. The value of β thus has important implications for multitask learning. Suppose an RL agent learns the optimal parameters w11 (task 1, group 1), and proceeds to learn task 2 in group 1. The value of β determines whether w11 can be used to inform the agent s learning of w21. In this way, β determines the effective degree of epistemic uncertainty the agent has about the task distribution.

Choice of p(β) and connection to p(z) Given its importance, it s natural to ask what value β should take. Instead of treating β as a parameter, we can choose a prior p(β) and perform Bayesian inference. Ideally, p(β) should (i) encode our prior belief about the extent to which the optimal parameters cluster into groups and (ii) result in a posterior mean estimator ˆw(p(β))(x) = 1 E [β|x] x that is close to w for x|w N(x; w, σ2). This condition encourages the expected default policy (under the posterior ν; Equation (4.1)) to be close to optimal policies in the same MDP group (centered at w). One prior choice that satisfies both conditions is p(β) β 1. It places high probability for small β and low probability for high β, thus encoding the prior belief that the optimal task parameters are clustered (see Figure 4.1B; blue). It is instructive to compare p(β) β 1 with two extreme choices of p(β). When p(β) = δ(β 1), p(z) = δ(σ) and the marginal p(w) is the often-used Gaussian prior over the parameters w with fixed variance σ2. This corresponds to the prior belief that knowing wi1 provides no information about wi2. On the other hand, p(β) = δ(β) recovers a uniform prior over the parameters w and reflects the prior belief that the MDP groups are infinitely far apart. In relation to (ii), one can show the ˆw(p(β)) strictly dominates the maximum-likelihood estimator ˆw(ML)(x) = x (Efron

Published as a conference paper at ICLR 2023

and Morris, 1973; Section D), for p(β) β 1. This means MSE(w, ˆw(p(β))) MSE(w, ˆw(ML)) for all w, where MSE(w, ˆw) = Ex N(x;w,σ2) w ˆw(x) 2.

Connection to p(z) and application of VDO Defining z2 = σ2β 1 and applying the changeof-variable formula to p(β) β 1 gives p(z) |z| 1 and thus the Normal-Jeffreys prior in Section 3. VDO (see Section 3) can then be applied to obtain an approximate posterior ν(w, z) which minimizes the variational code Equation (4.1). Similar correspondences may also be derived for the inverse-Gamma distribution and the half-Cauchy distribution (Figure 4.1B; Section D).

4.2 PERFORMANCE ANALYSIS

At a fundamental level, we d like assurance (i) that MDL-C s default policy will be able to effectively distill the optimal policies for previously observed tasks, and (ii) that regularization using this default policy gives strong performance guarantees for the control policy on future tasks.

Default policy performance One way we can verify (i) is to obtain an upper bound on the average KL between default policies sampled from the default policy distribution and an optimal policy for a task sampled from the task distribution. This enables us to perform analysis using online convex optimization (OCO). In OCO, the learner observes a series of convex loss functions ℓk : N R, k = 1, . . . , K, where N Rd is a convex set. After each round, the learner produces an output xk N for which it will then incur a loss ℓk(xk) (Orabona, 2019). At round k, the learner is usually assumed to have knowledge of ℓ1, . . . , ℓk 1, but no other assumptions are made about the sequence of loss functions. The learner s goal is to minimize its average regret. For further background, see Section F. Crucially, the MDL-C learning procedure for the default policy distribution is equivalent to follow the regularized leader (FTRL), an OCO algorithm which enjoys sublinear regret. In each round of FTRL, the learner selects the solution x N according to the following general objective: xk+1 = argminx N ψk(x) + Pk 1 i=1 ℓi(x), where ψ : N R is a convex regularization function. Using standard results, this connection allows us to bound MDL-C s regret in learning the default policy distribution. All proofs are provided in Section G. Proposition 4.1 (Persistent Replay FTRL Regret). Let tasks Mk be independently drawn from PM at every round, and let them each be associated with a deterministic optimal policy π k : S A. We make the following mild assumptions: i) πw(a |s) ϵ > 0 s S, where a = π k(s) and ϵ is a constant. ii) minν KL[ν( ), p( )] = 0 asymptotically as Var[ν] . Then with ηk 1 = log(1/ϵ)

k, Algorithm 1 guarantees

k=1 ℓk(νk) 1

k=1 ℓk( νK) (KL[ νK, p] + 1) log(1/ϵ)

where νK = argminν N PK k=1 ℓk(ν).

Control policy performance Intuitively, this result shows that the average regret is upper-bounded by factors which depend on the divergence of the barycenter distribution from the prior and the worstcase prediction of the default policy. Importantly, the KL between the default policy distribution and the barycenter distribution goes to zero as K . We can also now be assured of point (ii) above, in that this result can be used to obtain a sample-complexity bound for the control policy. Specifically, we can use Proposition G.1 to place an upper-bound on the total variation distance between default policies sampled from ν and the KL between the maximum likelihood solution and a sparsity-inducing prior p. This is useful, as it allows to translate low regret for the default policy into a sample complexity result for the control policy using Moskovitz et al. (2022a), Lemma 5.2. Proposition 4.2 (Control Policy Sample Complexity). Under the setting described in Proposition G.1, denote by Tk the number of iterations to reach ϵ-error for Mk in the sense that mint Tk{V π k V (t)} ϵ. whenever t > Tk. Further, denote the upper-bound in Eq. (G.1) by G(K). In a finite MDP, from any initial θ(0), and following gradient ascent, EMk PM [Tk] satisfies:

EMk PMi [Tk] 80|A|2|S|2

ϵ2(1 γ)6 EMk PMi s Unif S

Published as a conference paper at ICLR 2023

goal change contingency change

(a) (b) (c)

0 10000 20000 30000 40000

feature weight

phase 1 phase 2

sh ah 1 rh 1 gh

0 10000 20000 30000 40000

phase 1 phase 2

PO RPO VDO-PO Manual IA MDL-C

0 4000 8000 12000 16000 episodes

feature weight

phase 1 phase 2

0 4000 8000 12000 16000 episodes

phase 1 phase 2

Figure 5.1: MDL-C rapidly adapts to new goal locations (top row) and rule changes (bottom row). All curves represent averages taken over 10 random seeds, with the shading indicating standard error.

where αk(s) := d TV(π k( |s), ˆπ0( |s)) p

G(K), καk A (s) = 2|A|(1 α(s)) 2|A|(1 α(s)) 1, and µ is a measure over S such that µ(s) > 0 s S.

Intuitively, this means that when the average number of samples is sufficiently large, the control policy is guaranteed to have reached ε error. Therefore, as the agent is trained on more tasks, the default policy distribution regret, upper-bounded by G(K),decreases asymptotically to zero, and as the default policy regret decreases, the control policy will learn more rapidly, as poly(G(K)).

5 EXPERIMENTS

We tested MDL-C applied to discrete and continuous control in both the sequential and parallel task settings. To quantify performance, in addition to measuring per-task reward, we also report the cumulative regret for each method in each experimental setting in Section I.1.

5.1 2D NAVIGATION

We first test MDL-C in the classic FOURROOMS environment (Fig. 5.1a, (Sutton et al., 1999)). The baselines in this case are PO entropy-regularized policy optimization (PO), regularized policy optimization with no constraint on the default policy (RPO), an agent with VDO applied to the control policy and no default policy (VDO-PO), and MANUALIA (Galashov et al., 2019) in which the goal feature is manually witheld from the default policy. Details can be found in Section H.

Generalization Across Goals In the first setting, we test MDL-C s ability to facilitate rapid learning on previously unseen goals. In the first phase of training, a single goal location is randomly sampled at the start of each episode, and may be placed anywhere in two of the four rooms in the environment (Fig. 5.1a, top left). In the second phase, the goal location is again randomly sampled at the start of each episode, but in this case, only in the rooms which were held out in the first phase. Additionally, the agent is limited to 25 rather than 100 steps per episode. Importantly, VDO induces the MDL-C default policy to ignore input features which are, on average, less predictive of the control policy s behavior. In this case, the default policy learns to ignore the goal feature and the reward obtained on the previous timestep. This is because, when averaging across goal locations, the agent s current position (sh) and its previous direction (ah 1) are more informative of its next action typically, heading towards the nearest door. In contrast, the un-regularized default policy of the RPO agent does not drop these features (Section I for a visualization and Section H for more details). By learning to ignore the goals present in phase 1 and encoding useful behavior regardless of goal location, MDL-C s develops more effective regularization in phase 2, enabling it to adapt more quickly than other methods (Fig. 5.1c, top), particularly RPO, which overfits to phase 1 s goals. MANUALIA also adapts quickly, as its default policy is hard-coded to ignore the goal feature.

Published as a conference paper at ICLR 2023

(a) (b) (c) (d)

0 50000 100000 150000 200000 250000 0

test reward

walker, run (sequential)

SAC VDO Distral TVPO RPO MDL-C

100000 200000 300000 400000 500000 timesteps

test reward

cartpole, swingup-sparse (sequential)

0 48000 96000 144000 192000 240000

feature weight

walker (parallel)

orientation height velocity task id

0 28000 56000 84000 112000 140000 timesteps

feature weights

cartpole (parallel)

position velocity task id

50000 100000 150000 200000 250000

walker (parallel)

MT-SAC Distral Manual IA RPO MDL-C

20000 40000 60000 80000 100000 120000 140000 timesteps

cartpole (parallel)

Figure 5.2: MDL-C improves both sequential and parallel learning in continuous control tasks. All curves represent averages taken over 8 random seeds, with the shading indicating standard error. In (b), insets show the improvement of MDL-C as k increases, and in (d), solid curves represent averages over each feature within a category.

Robustness to Rule Changes In this setting, there are only two possible goal locations, one at the top left of the environment, and the other at the bottom right. In training phase 1, the agent receives a goal feature as input which indicates the state index of the rewarded location for that episode. In phase 2, the goal feature switches from marking the reward location to marking the unrewarded location. That is, if the reward is in the top left, the goal feature will point to the bottom right. Here, the danger for the agent isn t overfitting to a particular goal or goals, but rather overfitting to the reward-based rules associated with a given feature. As we saw in Fig. 5.1c (top), an un-regularized default policy, will copy the control policy and overfit to a particular setting. Fortunately, the MDL-C default policy learns to ignore features which are, on average, less useful for predicting the control policy s behavior the goal and previous reward features. This renders the agent more robust to contingency switches like the one described, as we can see in Fig. 5.1c (bottom). These examples illustrate that MDL-C enables agents to effectively learn the consistent structure of a group of tasks, regardless of its semantics, and compress out information which is less informative on average.

5.2 CONTINUOUS CONTROL

A more challenging application area is that of high-dimensional continuous control. In this setting, we presented agents with multitask learning problems using environments from the Deep Mind Control Suite (DMC; (Tassa et al., 2018)). We used soft actor critic (SAC; (Haarnoja et al., 2018)) as the base agent. We tested MDL-C in both the sequential and parallel settings on two domains from DMC: walker and cartpole (Fig. 5.2a). Additional details can be found in Section H.

Sequential Tasks In the sequential setting, tasks are sampled one at a time uniformly without replacement from the available tasks within each domain, with the default policy distribution conserved across tasks. For walker, these tasks are stand, walk, and run. In stand, the agent is rewarded for increasing the height of its center of mass, and in the latter two tasks, an additional reward is given for forward velocity. For cartpole, there are four tasks: balance, balance-sparse, swingup, and swingup-sparse. In the balance tasks, the agent must keep a rotating pole upright, and in the swingup tasks, it must additionally learn to swing the pole upwards from an initial downward orientation. Performance results for the hardest task within each domain (run in walker and swingup-sparse in cartpole) for each method are plotted in Fig. 5.2b, where k indicates the task round at which the task was sampled. We can see that as k increases (as more tasks have been seen previously), MDL-C s performance improves. Importantly, the RPO agent s default policy, which is un-regularized, overfits to the previous task, essentially copying the optimal policy s behavior. This can severely hinder the agent s performance when the subsequent task requires different behavior. For example, on swingup-sparse, if the previous task is swingup, the RPO agent performs well,

Published as a conference paper at ICLR 2023

as the goal is identical. However, if the previous task is balance or balance-sparse, the agent never learns to swing the pole upwards, significantly reducing its average performance. Another important point to note is that because the agent is not given an explicit goal feature in this setting, methods like MANUALIA which rely on prior knowledge about the agent s inputs cannot be applied.

Parallel Tasks We also tested parallel-task versions of SAC, MANUALIA, and MDL-C based on the model of Yu et al. (2019). In this framework, a task within each domain is randomly sampled at the start of each episodeand the agent learns a single control policy for all tasks. Performance is plotted in Fig. 5.2c, where we can again see that MDL-C accelerates convergence relative to the baseline methods. This marks a difference compared to the easier Four Rooms environment, in which MDL-C and MANUALIA performed roughly the same. As before, one clue to the difference can be found in the input features that the MDL-C default policy chooses to ignore (Fig. 5.2d). For walker, inputs are 24-dimensional, with 14 features related to the joint orientations, 1 feature indicating the height of the agent s center of mass, and 9 features indicating velocity components. For cartpole, there are 5 input dimensions, with 3 pertaining to position and 2 to velocity. In the walker domain, where the performance difference is greatest, the MDL-C agent not only ignores the added task ID feature, but also the several features related to velocity. In contrast, in the cartpole domain, MDL-C only ignores the task ID feature, just as MANUALIA does, and the performance gap is smaller. This illustrates that MDL-C learns to compress out spurious information even in settings for which it is difficult to identify a priori. In order to test the effect of the learned asymmetry on performance more directly, we implemented a variant of MANUALIA in which all of the features which MDL-C learned to ignore were manually hidden from the default policy (Fig. I.4). Interestingly, while this method improved over standard MANUALIA, it didn t completely close the gap with MDL-C, indicating there are downstream effects within the network beyond input processing which are important for the default policy s effectiveness. We hope to explore these effects in more detail in future work.

6 RELATED WORK

MDL-C can be viewed as an extension of recent approaches to learning default policies ( behavioral priors ) from the optimal policies of related tasks (Teh et al., 2017; Tirumala et al., 2020). For a default policy to be useful for transfer learning, it is crucial to balance the ability of the default policy to copy the control policies with its expressiveness. If the default policy is too expressive, it is likely to overfit on past tasks and fail to generalize to unseen tasks. Whereas prior work primarily handcrafts structural constraints into the default policies to avoid overfitting (e.g., by hiding certain state information from the default policy; Galashov et al., 2019), MDL-C learns such a balance from data with sparsity-inducing priors via variational inference. MDL-C may also be derived from the RL-asinference framework (Levine, 2018; Section A). MDL-C thus has close connections with algorithms such as MPO (Abdolmaleki et al., 2018) and VIREL (Fellows et al., 2020), discussed in Section A. As a general framework, MDL-C is also connected to the long and well-established literature on choosing appropriate Bayesian priors (Jeffreys, 1946; Bernardo, 2005; Casella, 1985), and more recent work that focuses on learning such priors for large-scale machine learning models (Nalisnick and Smyth, 2017; Nalisnick et al., 2021; Atanov et al., 2018). For a further discussion of related work, particularly concerning the application of MDL to the RL setting, see Section C.

7 CONCLUSION

Inspired by dual process theories and the MDL principle, we propose a regularized policy optimization framework for multitask RL which aims to learn a simple default policy encoding a low-complexity distillation of the optimal behavior for some family of tasks. By encouraging the default policy to maintain a low effective description length, MDL-C ensures that it does not overfit to spurious correlations among the (approximately) optimal policies learned by the agent. We described MDLC s formal properties and demonstrated its empirical effectiveness in discrete and continuous control tasks. There are of course limitations of MDL-C, which we believe represent opportunities for future work (see Section E). Promising research directions include integrating MDL-C with multitask RL approaches which balance a larger set of policies (Barreto et al., 2020; Moskovitz et al., 2022b; Thakoor et al., 2022) as well considering nonstationary environments (Parker-Holder et al., 2022). We hope MDL-C inspires further advances in multitask RL.

Published as a conference paper at ICLR 2023

Acknowledgements The authors would like to thank Kevin Miller, DJ Strouse, Marcel Binz, and Alexander Galashov for useful discussions and suggested improvements to the manuscript. Work funded by the Gatsby Charitable Foundation.

Published as a conference paper at ICLR 2023

William James. The Principles of Psychology, volume 1. Henry Holt, New York, 1890.

Ann M. Graybiel. Habits, rituals, and the evaluative brain. Annual Review of Neuroscience, 31(1): 359 387, 2008. doi: 10.1146/annurev.neuro.29.051605.112851. URL https://doi.org/10. 1146/annurev.neuro.29.051605.112851. PMID: 18558860.

Nathaniel D Daw, Samuel J Gershman, Ben Seymour, Peter Dayan, and Raymond J Dolan. Modelbased influences on humans choices and striatal prediction errors. Neuron, 69(6):1204 1215, 03 2011.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd. html.

Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.

Matthew Botvinick, Ari Weinstein, Alec Solway, and Andrew Barto. Reinforcement learning, efficient coding, and the statistics of natural tasks. Current Opinion in Behavioral Sciences, 5:71 77, 08 2015. doi: 10.1016/j.cobeha.2015.08.009.

Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rockt aschel. A survey of generalisation in deep reinforcement learning, 2021. URL https://arxiv.org/abs/2111.09794.

Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, page 4499 4509, 2017.

Alexandre Galashov, Siddhant M. Jayakumar, Leonard Hasenclever, Dhruva Tirumala, Jonathan Schwarz, Guillaume Desjardins, Wojciech M. Czarnecki, Yee Whye Teh, Razvan Pascanu, and Nicolas Heess. Information asymmetry in kl-regularized RL. Co RR, abs/1905.01240, 2019.

Anirudh Goyal, Yoshua Bengio, Matthew Botvinick, and Sergey Levine. The variational bandwidth bottleneck: Stochastic evaluation on an information budget, 2020.

Anirudh Goyal, Riashat Islam, Daniel Strouse, Zafarali Ahmed, Matthew Botvinick, Hugo Larochelle, Yoshua Bengio, and Sergey Levine. Infobot: Transfer and exploration via the information bottleneck, 2019.

Ted Moskovitz, Michael Arbel, Jack Parker-Holder, and Aldo Pacchiano. Towards an understanding of default policies in multitask policy optimization. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 10661 10686. PMLR, 28 30 Mar 2022a. URL https://proceedings.mlr.press/ v151/moskovitz22a.html.

Lucy Lai and Samuel Gershman. Policy compression: An information bottleneck in action selection.

Psychology of Learning and Motivation, 74:195 232, 01 2021. doi: 10.1016/bs.plm.2021.02.004.

Jorma Rissanen. Modelling by shortest data description. Automatica, 14, 01 1978.

Martin L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley and Sons, 2010.

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2019. URL https://arxiv.org/ abs/1910.10897.

Tom Zahavy, Andre Barreto, Daniel J Mankowitz, Shaobo Hou, Brendan O Donoghue, Iurii Kemaev, and Satinder Singh. Discovering a set of policies for the worst case reward. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=PUkh Wz65dy5.

Published as a conference paper at ICLR 2023

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126 1135. PMLR, 2017.

Emma Brunskill and Lihong Li. Sample complexity of multi-task reinforcement learning, 2013. URL

https://arxiv.org/abs/1309.6821.

Aldo Pacchiano, Ofir Nachum, Nilseh Tripuraneni, and Peter Bartlett. Joint representation training in sequential tasks with shared structure, 2022.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

John Schulman, Xi Chen, and Pieter Abbeel. Equivalence between policy gradients and soft qlearning, 2018.

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018.

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 64 66. PMLR, 2020.

Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Anna Choromanska, Krzysztof Choromanski, and Michael I Jordan. Learning to score behaviors for guided policy optimization. In The International Conference on Machine Learning. 2020.

Dhruva Tirumala, Alexandre Galashov, Hyeonwoo Noh, Leonard Hasenclever, Razvan Pascanu, Jonathan Schwarz, Guillaume Desjardins, Wojciech Marian Czarnecki, Arun Ahuja, Yee Whye Teh, and Nicolas Heess. Behavior priors for efficient reinforcement learning. ar Xiv preprint ar Xiv:2010.14274, 2020.

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation, 2018.

John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. Co RR, abs/1502.05477, 2015.

Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531 1538, 2002.

Ted Moskovitz, Michael Arbel, Ferenc Huszar, and Arthur Gretton. Efficient wasserstein natural gradients for reinforcement learning, 2020.

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018.

Kyungjae Lee, Sungjoon Choi, and Songhwai Oh. Sparse markov decision processes with causal sparse tsallis entropy regularization for reinforcement learning. IEEE Robotics and Automation Letters, 3(3):1466 1473, 2018.

A. N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of

Information Transmission, 1:1 7, 1965.

Ming Li and Paul M.B. Vitnyi. An Introduction to Kolmogorov Complexity and Its Applications. Springer Publishing Company, Incorporated, 3 edition, 2008.

L eonard Blier and Yann Ollivier. The description length of deep learning models. Advances in

Neural Information Processing Systems, 31, 2018.

Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5 13, 1993.

Published as a conference paper at ICLR 2023

Antti Honkela and Harri Valpola. Variational learning and bits-back coding: an information-theoretic view to bayesian learning. IEEE transactions on Neural Networks, 15(4):800 810, 2004.

Peter Grunwald. A tutorial introduction to the minimum description length principle, 2004.

David F Andrews and Colin L Mallows. Scale mixtures of normal distributions. Journal of the Royal

Statistical Society: Series B (Methodological), 36(1):99 102, 1974.

Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of

the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453 461, 1946. URL https://royalsocietypublishing.org/doi/abs/10.1098/rspa. 1946.0056.

Nicholas G Polson and James G Scott. On the half-cauchy prior for a global scale parameter. Bayesian

Analysis, 7(4):887 902, 2012.

Andrew Gelman. Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper). Bayesian analysis, 1(3):515 534, 2006.

Peter Gr unwald and Teemu Roos. Minimum description length revisited. International Journal of

Mathematics for Industry, 11(01), 2019.

Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. Advances in neural information processing systems, 30, 2017.

Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28, 2015.

Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498 2507. PMLR, 2017.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929 1958, 2014. URL http://jmlr.org/papers/v15/ srivastava14a.html.

David Abel, Will Dabney, Anna Harutyunyan, Mark K Ho, Michael Littman, Doina Precup, and Satinder Singh. On the expressivity of markov reward. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 7799 7812. Curran Associates, Inc., 2021.

Bradley Efron and Carl Morris. Stein s estimation rule and its competitors an empirical bayes approach. Journal of the American Statistical Association, 68(341):117 130, 1973.

Francesco Orabona. A modern introduction to online learning, 2019.

Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1): 181 211, 1999. URL https://www.sciencedirect.com/science/article/pii/ S0004370299000521.

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. Deepmind control suite, 2018.

Matthew Fellows, Anuj Mahajan, Tim G. J. Rudner, and Shimon Whiteson. Virel: A variational inference framework for reinforcement learning, 2020.

Jos e M Bernardo. Reference analysis. Handbook of statistics, 25:17 90, 2005.

George Casella. An introduction to empirical bayes data analysis. The American Statistician, 39(2): 83 87, 1985.

Eric Nalisnick and Padhraic Smyth. Learning approximately objective priors. ar Xiv preprint ar Xiv:1704.01168, 2017.

Published as a conference paper at ICLR 2023

Eric Nalisnick, Jonathan Gordon, and Jos e Miguel Hern andez-Lobato. Predictive complexity priors. In International Conference on Artificial Intelligence and Statistics, pages 694 702. PMLR, 2021.

Andrei Atanov, Arsenii Ashukha, Kirill Struminsky, Dmitry Vetrov, and Max Welling. The deep weight prior. ar Xiv preprint ar Xiv:1810.06943, 2018.

Andre Barreto, Shaobo Hou, Diana Borsa, David Silver, and Doina Precup. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences, 117 (48):30079 30087, 2020. ISSN 0027-8424. doi: 10.1073/pnas.1907370117. URL https: //www.pnas.org/content/117/48/30079.

Ted Moskovitz, Spencer R Wilson, and Maneesh Sahani. A first-occupancy representation for reinforcement learning. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=JBAZe2y N6Ub.

Shantanu Thakoor, Mark Rowland, Diana Borsa, Will Dabney, R emi Munos, and Andr e Barreto. Generalised policy improvement with geometric policy composition, 2022. URL https:// arxiv.org/abs/2206.08736.

Jack Parker-Holder, Minqi Jiang, Michael Dennis, Mikayel Samvelyan, Jakob Foerster, Edward Grefenstette, and Tim Rockt aschel. Evolving curricula with regret-based environment design, 2022. URL https://arxiv.org/abs/2203.01302.

Samuel Kessler, Jack Parker-Holder, Philip Ball, Stefan Zohren, and Stephen J. Roberts. Same state, different task: Continual reinforcement learning without interference, 2021. URL https: //arxiv.org/abs/2106.02940.

Jesse Zhang, Karl Pertsch, Jiefan Yang, and Joseph J Lim. Minimum description length skills for accelerated reinforcement learning. In Self-Supervision for Reinforcement Learning Workshop - ICLR 2021, 2021. URL https://openreview.net/forum?id=r4Xxtr Io1m9.

Sebastian Thrun and Anton Schwartz. Finding structure in reinforcement learning. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7. MIT Press, 1994. URL https://proceedings.neurips.cc/paper/1994/file/ 7ce3284b743aefde80ffd9aec500e085-Paper.pdf.

Edward I George, Feng Liang, and Xinyi Xu. Improved minimax predictive densities under kullbackleibler loss. The Annals of Statistics, pages 78 91, 2006.

Dominique Fourdrinier, William E Strawderman, and Martin T Wells. On the construction of bayes minimax estimators. Annals of Statistics, pages 660 671, 1998.

Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy P. Lillicrap. Distributed distributional deterministic policy gradients. Co RR, abs/1804.08617, 2018. URL http://arxiv.org/abs/1804. 08617.

James Melbourne. Strongly convex divergences. Entropy (Basel, Switzerland), 22(11):1327, 11 2020.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, March 2004.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928 1937, New York, New York, USA, 20 22 Jun 2016. PMLR.

Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014. URL

https://arxiv.org/abs/1412.6980.

Published as a conference paper at ICLR 2023

Minimum Description Length Control Supplementary Information

A REINFORCEMENT LEARNING AS INFERENCE

The control as inference framework (Levine, 2018) associates every time step h with a binary optimality random variable Oh {0, 1} that indicates whether ah is optimal at state sh (Oh = 1 for optimal, and Oh = 0 for not). The optimality variable has the conditional distribution P(Oh = 1|sh, ah) = exp(r(sh, ah)), which scales exponentially with the reward received taking action ah in state sh.

Denote OH as the event that Os = 1 for s = 0, . . . , H 1. Then the log-likelihood that a policy πw(a|s) is optimal over a horizon H is given by:

P(OH) = Z P(OH|τ)Pπw(τ|w)p(w)dτdw.

By performing variational inference, we can lower-bound the log-likelihood with the ELBO:

log P(OH) Eνπ(τ)

r(sh, ah) Eνθ(w)KL[πθ(ah|sh), πw(ah|sh)]

KL[νϕ(w), p(w)],

where νθ,ϕ(τ, w) = νθ(τ)νϕ(w) is the variational posterior,

νθ(τ) = ρ(s0)

h=0 P(sh+1|sh, ah)πθ(ah, sh)

and {θ, ϕ} are the variational parameters. We can maximize this objective iteratively by performing coordinate ascent on {θ, ϕ}:

r(sh, ah) Eνθ(w)KL[πθ(ah|sh), πw(ah|sh)] !

h=0 Eνθ(w)KL[πθ(ah|sh), πw(ah|sh)] + KL[νϕ(w), p(w)]

where η is a learning rate parameter. Note that Equation (A.3) is equivalent to Equation (4.1) and Equation (G.8), and Equation (A.2) is equivalent to Equation (G.7) with the KL reversed.

Connection to Maximum a Posteriori Policy Optimization (MPO) MDL-C is closely related to MPO (Abdolmaleki et al., 2018), with three key differences. First, MDL-C performs variational inference on the parameters of the default policy with an approximate posterior νϕ(w), whereas MPO performs MAP inference. Second, MPO places a normal prior on w, which in effect penalizes the L2 norm of w. In contrast, MDL-C uses sparsity-inducing priors such as the normal-Jeffreys prior. Third, MDL-C uses a parametric πθ, whereas MPO uses a non-parametric one2. While there is also a parametric variant of MPO, this variant does not maintain θ and ϕ separately. Instead, this variant directly sets θ to ϕ in Equation (A.2). This illustrates the key conceptual difference between MDL-C and MPO. MDL-C makes a clear distinction between the control policy πθ and the default policy πw, with the two policies serving two distinct purposes: the control policy for performing on the current task, the default policy for distilling optimal policies across tasks and generalizing to new ones. MPO, on the other hand, treats πθ and πw as fundamentally the same object.

Like MPO, VIREL (Fellows et al., 2020) can be derived from the control as inference framework. In fact, Fellows et al. showed that a parametric variant of MPO can be derived from VIREL (Fellows et al., 2020). The key novelty that sets VIREL apart from both MPO and MDL-C is an adaptive temperature parameter that dynamically updates the influence of the KL term in Equation (A.2).

2In practice, MPO parametrizes πθ implicitly with a parameterized action-value function and the default policy.

Published as a conference paper at ICLR 2023

B MULTITASK RL FRAMEWORKS

We believe the objective which best captures naturalistic settings is the average reward obtained over the agent s lifetime : lim T 1

T E PT t=1 r(st, at). Typical objectives include finding either a single policy or a set of policies which maximize worstor average-case value: maxπ min M M V π M (Zahavy et al., 2021) or maxπ EPMV π M (Moskovitz et al., 2022a). When the emphasis is on decreasing the required sample complexity of learning new tasks, a useful metric is cumulative regret: the agent s total shortfall across training compared to an optimal agent. In practice, it s often simplest to consider the task distribution PM to be a categorical distribution defined over a discrete set of tasks M := {Mk}K k=1, though continuous densities over MDPs are also possible. Two multitask settings which we consider here are parallel task RL and sequential task RL. In typical parallel task training (Yu et al., 2019), a new MDP is sampled from PM at the start of every episode and is associated with a particular input feature g G that indicates to the agent which task has been sampled. The agent s performance is evaluated on all tasks M M together. In the sequential task setting (Moskovitz et al., 2022a; Pacchiano et al., 2022), tasks (MDPs) are sampled one at a time from PM, with the agent training on each until convergence. In contrast to continual learning (Kessler et al., 2021), the agent s goal is simply to learn a new policy for each task more quickly as more are sampled, rather than learning a single policy which maintains its performance across tasks. Another important setting is meta-RL, which we do not consider here. In the meta-RL setting, the agent trains on each sampled task for only a few episodes each with the goal of improving few-shot performance and is meta-tested on a set of held-out tasks (Yu et al., 2019; Finn et al., 2017).

Another strain of work in multitask RL assumes some form of shared structure in the transition dynamics (Pacchiano et al., 2022; ?; ?). Specifically, the core assumption made by these works is that the transition dynamics are linearly decodable from a set of features which is shared across tasks or in which the transition matrix admits a low-rank decomposition. This is very different from our own structural assumption that is, in its simplest form, that the optimal policies of the tasks with which our agents are faced take similar actions in at least some part of the state space. Beyond this, the MDPs in M need only share the same state and action space, with no direct assumptions about transitions or rewards. This is important, because the assumed structures in the transition distribution made by Pacchiano et al. (2022); ?); ? act as the starting points for algorithm development. MDL-C/RPO/TVPO however, can leverage similarity among optimal policies when it exists, but are not dependent on it as a prerequisite. (E.g., TVPO (and RPO/MDL-C) is guaranteed to perform no worse than log-barrier regularization, which has a polynomial sample complexity guarantee.) Ideally, we d like a generalist method which can identify on its own and exploit different types of structure in the environment.

C ADDITIONAL RELATED WORK

Previous work has also applied the MDL principle in an RL context, though primarily in the context of unsupervised skill learning (Zhang et al., 2021; Thrun and Schwartz, 1994). For example, Thrun and Schwartz (1994) are concerned with a set of skills which are policies defined only over a subset of the state space that are reused across tasks. They consider tabular methods, measuring a pseudo-description length as

M M P M(s) + X

n N |Sn|, (C.1)

where P M(s) is the probability that no skill selects an action in state s for task M and the agent must compute the optimal Q-values in state s for M, N is the number of skills, and |Sn| is the number states for which skill n is defined. They then trade off this description length term with performance across a series of tabular environments.

One other related method is DISTRAL (Teh et al., 2017), which uses the following objective in the parallel task setting:

J Distral(θ, ϕ) = V πθ Es dπθ [αKL[πθ( |s), πϕ( |s)] + βH[πθ( |s)]] . (C.2)

That is, like the un-regularized RPO method, DISTRAL can be seen as performing maximumlikelihood estimation to learn the (unconstrained) default policy, while adding an entropy bonus to the control policy.

Published as a conference paper at ICLR 2023

Another important method in the sequential setting is TVPO (Moskovitz et al., 2022a), in which (in the tabular case) the default policy is defined as a softmax over the average action frequencies of the optimal policies for the tasks that the agent has seen so far. That is, if the average optimal action in a state s is given by

ˆξk(s, a) = 1

i=1 1(π i (s) = a),

then the TVPO default policy is

πw(a|s) = softmax ˆξk(s, a)/β(k) ,

where β(k) is a temperature which decays as k . In high-dimensional state and action spaces, this tabular solution can be approximated by training a default policy to predict the converged control policy s actions in each task. Importantly, this is equivalent to using KL distillation in that the default policies will converge to the same barycenter policy (Moskovitz et al., 2022a), as long as the distillation is only performed once the control policy has converged in each task. Using KL distillation in this way is exactly the RPO baseline that we use in this paper. Crucially, the use of the softmax with decaying temperature was introduced by Moskovitz et al. (2022a) as a useful hack to prevent the default policy from overfitting to early tasks, as the optimal default policy is the barycenter policy (approximated as the number of draws from the task distribution grows). Thus, MDL-C can itself be seen as a scalable advancement of TVPO which models the agent s epistemic uncertainty about the task distribution by placing a sparse prior over the default policy parameters (and uses a distillation loss rather than action prediction). In other words, MDL-C represents a principled approach to reducing the risk of default policy overfitting in the low-data regime.

Finally, Brunskill and Li (2013) consider a similar training and task structure to our own, but use a model-based approach to learn the underlying MDPs.

D MOTIVATING THE CHOICE OF SPARSITY-INDUCING PRIORS

As a reminder, the generative model of optimal parameters in Section 4.1 is given by:

wi|β, σ2 N(0, 1 β

β σ2Id), (D.1)

wik|wi, σ2, β N(w, σ2Id) (D.2)

with marginal and posterior densities

p(wik|σ2, β) = N(0, σ2β 1Id), (D.3)

p(wi|wik, σ2, β) = N (1 β)wik, (1 β)σ2Id . (D.4)

In the rest of this section, we set σ2 = 1 for simplicity and drop the indices on w and w to remove clutter.

D.1 CORRESPONDENCE BETWEEN p(z) AND p(β)

In Section 4.1, we draw a connection between p(β) β 1 and the normal-Jeffreys prior, which is commonly used for compressing deep neural networks (Louizos et al., 2017). In Table 1, we expand on this connection and list p(β) for two other commonly-used priors for scale mixture of normal distributions: Jeffreys, Inverse-gamma, and Inverse-beta. Note that the half-Cauchy distribution p(z) (1 + z2) 1 is a special case of the inverse-beta distribution for s = t = 1/2. Half-cauchy prior is another commonly used prior for compressing Bayesian neural networks (Louizos et al., 2017).

D.2 MSE RISK

In this section, we prove that the Bayes estimators for the Jeffreys, inverse-gamma, and the inversebeta (by extension the half-Cauchy) distributions dominate the maximum-likelihood estimator with respect to the mean-squared error.

Published as a conference paper at ICLR 2023

Define the mean-squared error of an estimator ˆw(x) of w as

MSE(w, ˆw) = Ex ˆw(x) w 2, (D.5)

where the expectation is taken over N(x; w, α2). Immediately, we have R(w, ˆw(ML)) = d, where ˆw(ML)(x) = x is the maximum-likelihood estimator. An estimator ˆw(a)(x) is said to dominate another estimator ˆw(b)(x) if MSE(w, ˆwa) MSE(w, ˆwb) for all w and the inequality is strict for a set of positive Lesbesgue measure. It is well-known that the maximum-likelihood estimator is minimax (George et al., 2006), and thus any estimator that dominates the maximum-likelihood estimator is also minimax.

To compute the mean-squared error risk for an estimator ˆw(x), observe that

ˆw(x) w 2 = x ˆw(x) 2 x w 2 + 2( ˆw(x) w) (x w). (D.6)

Taking expectations on both sides gives

MSE(w, ˆw) = Ex x ˆw(x) 2 d + 2

i=1 Cov( ˆwi(x), xi) (D.7)

= Ex x ˆw(x) 2 d + 2Ex ˆw(x) (D.8)

where = ( / x1, . . . , / xd) and we apply Stein s lemma cov( ˆwi(x), xi) = Ex ˆwi/ xi in the last line. If the estimator takes the form ˆw(x) = x + γ(x), the expression simplifies as:

MSE(w, ˆw) = d + Ex γ(x) 2 + 2Ex γ(x). (D.9)

Therefore, an estimator ˆw(x) = x + γ(x) dominates ˆw(ML)(x) if

MSE(w, ˆw) MSE(w, ˆw(ML)) = Ex γ(x) 2 + 2 γ(x) 0 (D.10)

for all w and the inequality is strict on a set of positive Lesbesgue measure.

D.2.1 JAMES-STEIN ESTIMATOR

The famous Jame-Stein estimator is defined as

ˆw(JS)(x) = x + γ(JS)(x), γ(JS)(x) = (d 2)x/ x 2, (D.11)

x 2 + 2 d 2

( x 2)2 x2 i

x 2 , (D.12)

γ(JS)(x) 2 = (d 2)2

x 2 . (D.13)

Substituting γ(JS)(x) and γ(JS)(x) 2 into Equation (D.10), we have

MSE(w, ˆw(JS)) MSE(w, ˆw(ML)) = Ex (d 2)2

x 2 . (D.14)

Thus, the James-Stein estimator dominates the maximum-likelihood estimator for d > 2.

Prior name p(z2) p(β)

Jeffreys p(z2) z 2 p(β) β 1

Inverse-gamma p(z2) z 2(s+1)e t/(2z2) p(β) βs 1e tβ/2

Inverse-beta p(z2) (z2)t 1(1 + z2) (s+t) p(β) β (s+2t+1)(1 + β) (s+t)

Table 1: Correspondence between p(z2) and p(β).

Published as a conference paper at ICLR 2023

D.2.2 BAYES ESTIMATORS

The Bayes estimator for a prior choice p(β) is given by (?):

ˆw(p(β))(x) = x + γ(p(β))(x), γ(p(β))(x) = log m(x), (D.15)

m(x) = Z N(x; 0, β 1Id)p(β)dβ (D.16)

2 βd/2 exp βx2/2 p(β)dβ. (D.17)

Substituting γ(p(β))(x) into Equation (D.10), we find that the condition for the Bayes estimator to be minimax is given by (George et al., 2006):

MSE(w, ˆw(B)) MSE(w, ˆw(ML)) = Ex

log m(x) 2 + 2 2m(x)

where 2 = P

i 2/ x2 i is the Laplace operator. This condition holds when p

m(x) is superharmonic (i.e., p

m(x) 0, x Rd), suggesting a recipe for constructing Bayes estimators that dominate the maximum likelihood estimator, summarized in the following proposition. Proposition D.1 (Extension of Theorem 1 in Fourdrinier et al., 1998). Let p(β) be a positive function such that f(β) = βp (β)/p(β) can be decomposed as f1(β) + f2(β) where f1 is non-decreasing, f1 A, 0 < f2 B, and A/2 + B (d 6)/4. Assume also that limβ 0 βd/2+2p(β) = 0. Then, 2p

m(x) 0 and the Bayes estimator is minimax. If A/2 + B < (d 6)/4, then the Bayes estimator dominates ˆw(ML)(x).

Proof. This proof largely follows the proof of Theorem 1 in (Fourdrinier et al., 1998).

Note that Equation (D.18) holds if

0 x Rd, (D.20)

or equivalently

2m(x) m(x) 1

m(x) 0 x Rd. (D.21)

Computing the derivatives, we get the condition R 1 0 β x 2 d βd/2+1e β x 2/2p(β)dβ

x R 1 0 βd/2+1e β x 2/2p(β)dβ 1

2 x R 1 0 βd/2+1e β x 2/2p(β)dβ R 1 0 βd/2e β x 2/2p(β)dβ 0. (D.22)

Divide both sides by x and rearrange to get R 1 0 βd/2+2e β x 2/2p(β)dβ R 1 0 βd/2+1e β x 2/2p(β)dβ 1

R 1 0 βd/2+1e β x 2/2p(β)dβ R βd/2e β x 2/2p(β)dβ d x 2 . (D.23)

Next, we integrate by parts the numerator of the first term on the left-hand side to get: Z 1

0 βd/2+2e β x 2/2p(β)dβ = 2 x 2

h βd/2+2e β x 2/2p(β) i1

0 βd/2+1e β x 2/2p(β)dβ

0 βd/2+2e β x 2/2p (β)dβ,

Published as a conference paper at ICLR 2023

where the middle term is the same as the denominator of the first term in Equation (D.23). Integrating by parts the second term gives the same expression as that of the first term, but with d 2 in place of d everywhere. Substituting these expressions back into Equation (D.23), collecting like terms, and dividing both sides by 2/ x 2, gives: R 1 0 βd/2+2e β x 2/2p (β)dβ R 1 0 βd/2+1e β x 2/2p(β)dβ 1

R 1 0 βd/2+1e β x 2/2p (β)dβ

R 1 0 βd/2e β x 2/2p(β)dβ + κ0 + κ1 (D.25)

κ1 = limβ 1 βd/2+2e β x 2/2p(β) R 1 0 βd/2+1e β x 2/2p(β)dβ + 1

2 limβ 1 βd/2+1e β x 2/2p(β) R 1 0 βd/2e β x 2/2p(β)dβ , (D.26)

κ0 = limβ 0 βd/2+2e β x 2/2p(β) R 1 0 βd/2+1e β x 2/2p(β)dβ 1

2 limβ 0 βd/2+1e β x 2/2p(β) R 1 0 βd/2e β x 2/2p(β)dβ . (D.27)

Here, both κ0 and κ1 are nonpositive: (i) κ0 is nonpositive because the first term vanishes due to the boundary conditions and the second term is nonpositive, and (ii) κ1 is nonpositive because the limits of the numerators of the two terms are equal while the denominator of the second term is larger than that of the first. We can thus drop κ0 and κ1 to get the sufficient condition:

2Ed 2 (f) d 6

where Ed denotes expectation with respect to the density

gd(β) = βd/2+1e β x 2/2p(β) R 1 0 βd/2+1e β x 2/2p(β)dβ (D.29)

and where f(β) = βp (β)/p(β).

Because gd(β) is a family of monotone increasing likelihood ratio in d and f1 is nonincreasing and bounded by A, we have Ed(f1) Ed 2(f1)/2 A/2. We have Ed(f2) Ed 2(f2)/2 B because 0 < f2 B. Taken together, we have

Ed(f) Ed 2(f)/2 A/2 + B (k 6)/4. (D.30)

When the inequality is strict (i.e., A/2 + B < (k 6)/4), then 2p

m(x) < 0 and the Bayes estimator dominates the maximum-likelihood estimator.

Checking whether a given p(β) satisfy the conditions in Proposition D.1 may be tedious. The following corollary is useful for construction p(β) that satisfies the conditions in Proposition D.1.

Corollary D.1 (Extension of Corollary 1 in Fourdrinier et al., 1998). Let ψ be a continuous function that can be decomposed as ψ1 + ψ2, with ψ1 C, ψ1 non-decreasing, 0 < ψ2 D, and C/2 + D 0. Let

2ψ(u) + d 6

β0 0, (D.31)

such that limβ 0 βd/2+2p(β) = 0 and β0 (0, 1) is a constant. Then, p(β) results in a minimax Bayes estimator, which dominates the maximum likelihood estimator when C/2 + D < 0.

Proof. The proof is the same as that of Corollary 1 in Fourdrinier et al., 1998, with Proposition D.1 in place of Theorem 1 in Fourdrinier et al., 1998.

Using Corollary D.1, we now check that the three priors listed in Table 1 and referenced in Section 4.1 lead to Bayes estimators that dominate the maximum-likelihood estimator.

Published as a conference paper at ICLR 2023

Jeffreys prior Let ψ1(u) = a for a 0 and ψ2(u) = 0. We have

βa+(d 6)/2. (D.32)

To satisfy limβ 0 βd/2+2p(β) = 0, we require 1 d < a 0. We recover the improper normal Jeffreys prior p(β) β 1, for a = 2 d/2. The corresponding Bayes estimator dominates the maximum likelihood estimator when d > 4.

Inverse-gamma prior Let ψ1(u) = a and ψ2(u) = b(1 u)/2 for a 0 and b 0. We have

a + b(1 u)/2 + (d 6)/2

βa+(b+d 6)/2e bβ/2. (D.33)

Setting C = a and D = b/2, we get the followings conditions: a + b 0 and 1 d a + b/2. Note that when these conditions are met with s = a + (b + d 4)/2 and t = b, we recover the inverse-gamma prior in Table 1.

Inverse-beta (half-Cauchy) prior Let ψ1(u) = a and ψ2(u) = b/(u + 1) for a 0 and b 0. We have

a + b/(1 + u) + (d 6)/2

βa+b+(d 6)/2(1 + β) b. (D.34)

Setting C = a and D = b, we get the condition a/2 + b 0. To satisfy limβ 0 βd/2+2p(β) = 0, we require 1 d < a + b 0. Note that this corresponds to the inverse-beta prior in Table 1 with t = a + (d 8)/2 and s = b t.

To recover the half-Cauchy prior, we set b = 1 and a = (5 d)/2. All conditions in Corollary D.1 are satisfied when d > 9.

E LIMITATIONS

One weakness of the current theoretical analysis regarding the choice of sparsity-inducing priors is the assumption of Gaussian (and in particular, isotropic Gaussian) structure in the parameter space of optimal policies for clusters of tasks. In reality, there is likely a nontrivial degree of covariance among task parameterizations. Extending our analysis to more realistic forms of task structure is an important direction for future work. In a similar vein, the assumption that tasks are drawn iid from a fixed distribution is also unrealistic in naturalistic settings. It would be interesting to introduce some form of sequential structure (e.g., tasks are drawn from a Markov process). Another direction for future work is expanding beyond the one control policy, one default policy setup having, for example, one default policy per task cluster and the ability to reuse and select (for example, using successor feature-like representations (Barreto et al., 2020; Barth-Maron et al., 2018; Moskovitz et al., 2022b)) among an actively-maintained set of control policies across tasks and task clusters would be useful.

F OCO BACKGROUND

In online convex optimization (OCO), the learner observes a series of convex loss functions ℓk : N R, k = 1, . . . , K, where N Rd is a convex set. After each round, the learner produces an output xk N for which it will then incur a loss ℓk(xk) (Orabona, 2019). At round k, the learner is usually assumed to have knowledge of ℓ1, . . . , ℓk 1, but no other assumptions are made about the sequence of loss functions. The learner s goal is to minimize its average regret:

k=1 ℓk(xk) min x N 1 K

k=1 ℓk(x). (F.1)

Published as a conference paper at ICLR 2023

One OCO algorithm which enjoys sublinear regret is follow the regularized leader (FTRL). In each round of FTRL, the learner selects the solution x N according to the following objective:

xk+1 = argmin x N ψk(x) +

i=1 ℓi(x), (F.2)

where ψk : N R is a convex regularization function.

G PROOFS OF PERFORMANCE BOUNDS AND ADDITIONAL THEORETICAL RESULTS

The following result is useful. Lemma G.1. The function ℓ(ν) = Ew νf(w) is L-Lipschitz as long as f : W R lies within [0, L] w W, where W Rd is a Hilbert space and L < .

Proof. We have

|ℓ(ν1) ℓ(ν2)| = |Ew ν1f(w) Ew ν2f(w)|

W (ν1(w) ν2(w))f(w) dw

= | f, ν1 ν2 W| f W ν1 ν2 W L ν1 ν2 W,

where the first inequality is due to Cauchy-Schwarz and the second is by assumption on f.

Proposition G.1 (Default Policy Distribution Regret). Let tasks Mk be independently drawn from PM at every round, and let them each be associated with a deterministic optimal policy π k : S A. We make the following mild assumptions: i) πw(a |s) ϵ > 0 s S, where a = π k(s) and ϵ is a constant. ii) minν KL[ν( ), p( )] 0 as Var[ν] for an appropriate choice of sparsity-inducing prior p. Then Algorithm 2 guarantees

EPM[ℓK(νK) ℓK( νK)] (EPMKL[ νK, p] + 1) log(1/ϵ)

where νK = argminν N PK k=1 ℓk(ν).

Proof. The first part of the proof sets up an application of Orabona (2019), Corollary 7.9.

To establish grounds for its application, we first note the standard result that the regularization functional ψ(ν) = KL[ν(w), p(w)] for probability measures ν, p P(W) is 1-strongly convex in ν (Melbourne, 2020).

Finally, assumption (i) implies that the KL between the default policy and the optimal policy is upper-bounded: KL[π k, πw] log 1/ϵ. Then by Lemma G.1, ℓk(ν) is L-Lipschitz wrt the TV distance, where L = log 1/ϵ.

Note also that under a Gaussian parameterization for ν, the distribution space N is the Gaussian parameter space N = {(µ, Σ) : µ Rd, Σ Rd d, Σ 0}, which is convex (Boyd and Vandenberghe, 2004).

Then Orabona (2019), Corollary 7.9 gives

k=1 ℓk(νk) 1

k=1 ℓk( νK) 1

αKL[ νK, p] + α L

where νK = argminν PK k=1 ℓk(ν). The constant α R+ is a hyperparameter, so we are free to set it to 1 (Orabona, 2019). Finally, we observe that EPMi 1 K PK k=1 ℓ(νk) = EPMiℓK(νK) and take the

Published as a conference paper at ICLR 2023

expectation with respect to PMi of both sides of Eq. (G.2) to get the desired result:

EPMi[ℓK(νK) ℓK( νK)] EPMi KL[ νK, p] + 1 L

Proposition 4.2 (Control Policy Sample Complexity). Under the setting described in Proposition G.1, denote by Tk the number of iterations to reach ϵ-error for Mk in the sense that mint Tk{V π k V (t)} ϵ. whenever t > Tk. Further, denote the upper-bound in Eq. (G.1) by G(K). In a finite MDP, from any initial θ(0), and following gradient ascent, EMk PM [Tk] satisfies:

EMk PMi [Tk] 80|A|2|S|2

ϵ2(1 γ)6 EMk PMi s Unif S

where αk(s) := d TV(π k( |s), ˆπ0( |s)) p

G(K), καk A (s) = 2|A|(1 α(s)) 2|A|(1 α(s)) 1, and µ is a measure over S such that µ(s) > 0 s S.

Note: In the above, there is a small error it should be αk(s) := Ew νd TV(π k( |s), πw( |s)) q

1 2G(K). dπ ρ refers to the discounted state-occupancy distribution under π with initial state distribution ρ:

dπ ρ(s) = Es0 ρ(1 γ) X

h 0 γh Pπ(sh = s|s0). (G.4)

Division between probability mass functions is assumed to be element-wise.

Proof. Without loss of generality, we prove the bound for a fixed state s S, noting that the bound applies independently of our choice of s. We use the shorthand KL[π( |s), πw( |s)] KL[π, πw] for brevity. We start by multiplying both sides of the bound from Proposition G.1 by 1/2 and rearranging:

EPMiℓK( νK) + L

EPMi KL[ νK, p] + 1

EPMi 1 2ℓK(νK)

= EPMi EνK 1 2KL[π K, πw]

1 2KL[π K, πw]

1 2KL[π K, πw]

1 2KL[π K, πw]

where (i) follows from the definition of the variance, and (ii) follows from its non-negativity. We can rearrange to get

EPMi KL[ νK, p] + 1 EPMi EνK

1 2KL[π K, πw]

(ii) EPMi EνK [d TV(π K, πw)]2

where (ii) follows from Pinsker s inequality. Letting αK(s) = q

1 2G(K) and applying Moskovitz et al. (2022a), Lemma 5.2 gives the desired result.

This upper-bound is signficant, as it shows that, all else being equal, a high complexity barycenter default policy distribution νK (where complexity is measured by KL[ νK, p]) leads to a slower convergence rate in the control policy.

Published as a conference paper at ICLR 2023

Algorithm 2: Idealized MDL-C for Multitask Learning

1: require: task distribution PM, policy class Π, coefficients {ηk} 2: initialize: default policy distribution ν1 N 3: for tasks k = 1, 2, . . . , K do 4: Sample a task Mk PM( ) 5: Optimize control policy:

ˆπ k = argmax π Π V π Mk λEs dπEw νk KL[πw(a|s), π(a|s)] (G.7)

6: Update default policy distribution:

νk+1 = argmin ν N KL[ν, p] + Ew νKL[ˆπ k, πw] (G.8)

G.1 MDL-C WITH PERSISTENT REPLAY

Rather than rely on iid task draws to yield a bound on the expected regret under the task distribution, a more general formulation of MDL-C for sequential task learning is described in Algorithm 1. In this setting, the dataset of optimal agent-environment interactions is explicitly constructed by way of a replay buffer which persists across tasks and is used to train the default policy distribution. This is much more directly in line with standard FTRL, and we can obtain the standard FTRL bound.

Proposition G.2 (Persistent Replay FTRL Regret; (Orabona, 2019), Corollary 7.9). Let tasks Mk be independently drawn from PM at every round, and let them each be associated with a deterministic optimal policy π k : S A. We make the following mild assumptions: i) πw(a |s) ϵ > 0 s S, where a = π k(s) and ϵ is a constant. ii) minν KL[ν( ), p( )] = 0 asymptotically as Var[ν] . Then with ηk 1 = L

k, Algorithm 1 guarantees

k=1 ℓk(νk) 1

k=1 ℓk( νK) (KL[ νK, p] + 1) L

where νK = argminν N PK k=1 ℓk(ν).

Proof. This follows directly from the arguments made in the proof of Proposition G.1.

As before, this result can be used to obtain a performance bound for the control policy.

Proposition G.3 (Control Policy Sample Complexity for MDL-C with Persistent Replay). Under the setting described in Proposition G.2, denote by Tk the number of iterations to reach ϵ-error for Mk in the sense that mint Tk{V π k V (t)} ϵ. In a finite MDP, from any initial θ(0), and following gradient ascent, EMk PM [Tk] satisfies:

EMk PMi [Tk] 80|A|2|S|2

ϵ2(1 γ)6 EMk PMis Unif S

where αk(s) := Ew νd TV(π k( |s), πw( |s)) q

G(K) := ℓK( νK) +

k=1 (ℓk( νK) ℓk(νk)) + (KL[ ν, p] + 1) L

καk A (s) = 2|A|(1 α(s)) 2|A|(1 α(s)) 1, and µ is a probability measure over S such that µ(s) > 0 s S.

Proof. Without loss of generality, we select a single state s S, observing that the same analysis applies s S. For simplicity, we denote π( |s) by π. We start by multiplying each side of Eq. (G.2)

Published as a conference paper at ICLR 2023

by K and rearranging:

k=1 ℓk( νK) (KL[ ν, p] + 1) L

k=1 ℓk( νK)

k=1 ℓk(νk) + (KL[ ν, p] + 1) L

= ℓK( νK) +

k=1 (ℓk( νK) ℓk(νk)) + (KL[ ν, p] + 1) L

| {z } :=G(K)

We can multiply both sides by 1/2 and expand ℓK(νK):

1 2G(K) Ew νK 1 2KL[π K, πw]

(i) = VarνK

1 2KL[π K, πw]

1 2KL[π K, πw]

1 2KL[π K, πw]

(iii) (EνKd TV(π K, πw))2

where (i) follows from the definition of the variance, (ii) follows from its non-negativity, and (iii) follows from Pinsker s inequality. We then have

EνKd TV(π K, πw)

1 2G(K). (G.12)

Letting αK(s) = q

1 2G(K) and applying Moskovitz et al. (2022a), Lemma 5.2 gives the desired result.

G.2 COMMENT ON IMPROVEMENT ACROSS TASKS

To gain intuition for these bounds, there are several important values of α(s) that we can consider. First, as α(s) 1 1/|A|, which is the TV distance between a uniform default policy and a deterministic optimal policy, κα A(s) 2. This is an important value because it s the coefficient obtained for log-barrier regularization that is, when the default policy is uniform and encodes no information about the task distribution. Next, as α(s) 0 (that is, as the TV distance between the default policy and the optimal policy decreases), κα A(s) 2|A|/(2|A| 1) < 2 for |A| > 1. So, we get faster as the distance between the default policy and the optimal policy decreases, as we would hope. Another crucial point to note is that as |A| in this case, κα A(s) 1. Finally, and importantly for MDL-C, as α(s) 1 1/2|A| from below, κα A(s) . In other words, a sufficiently bad default policy can preclude convergence entirely if it puts too much mass on a suboptimal action. For an illustration of this phenomenon, see Moskovitz et al. (2022a) Figure 4.1. Indeed, this is why our Proposition 4.1 is so useful by effectively placing an upper bound on α(s) which shrinks as the number of tasks K increases, MDL-C s default policy is guaranteed to a) avoid putting too much mass on a suboptimal action and thereby preclude or delay convergence for the control policy, and b) improve the rate as the default policy regret drops.

G.3 PARALLEL TASK SETTING

An overview of MDL-C as applied in the parallel task setting is presented in Algorithm 3. One important feature to note is the return threshold R . As a proxy for the control policy converging to π k, data are only added to the default policy replay buffer when a trajectory return is above this threshold performance (on DM control suite tasks, R corresponded to a test reward of at least 700).

Published as a conference paper at ICLR 2023

Algorithm 3: Off-Policy MDL-C for Parallel Multitask Learning

1: require: task distribution PM, policy class Π 2: initialize: default policy distribution ν1 N, control replay D0 , default replay Dϕ 0 3: initialize control policy parameters θ and default policy distribution parameters ϕ. 4: while not done do 5: for episodes k = 1, 2, . . . , K do 6: Sample a task Mk PM( ) with goal ID feature gk 7: Collect trajectory τ = ( s0, a0, r0, . . . , s H 1, a H 1, r H 1) Pπθ( ), store experience

Dk Dk 1 {( sh, ah, rh, sh+1)}H 1 h=0 (G.13)

where sh := (sh, gk). 8: if R(τ) R (i.e., πθ π k) then 9: Add to default policy replay:

Dϕ k Dϕ k 1 {( sh, πθ( | sh)}H 1 h=0 (G.14)

Note that, e.g., when πθ(a| s) = N(a; µ( s, gk), Σ( s, gk)) is a Gaussian policy, µ( sh, gk), Σ( sh, gk) are added to the replay with sh. 10: end if 11: end for 12: Update Q-function(s) as in Haarnoja et al. (2018). 13: Update control policy:

θ argmin θ EUnif Dk V πθ αEw νϕKL[πθ ( | sh), πw( | sh)] (G.15)

14: Update default policy distribution:

ϕ argmin ϕ KL[νϕ ( ), p( )] + EUnif Dϕ k Ew νKL[πθ( | sh), πw( | sh)] (G.16)

15: end while

We leave more in-depth theoretical analysis of this setting to future work, but note that as the task experience is interleaved, πw = Eνπw will converge to the prior-weighted KL barycenter. If, in expectation, this distribution is a TV distance of less than 1 1/|A| from π k, then the control policy will converge faster than for log-barrier regularization (Moskovitz et al., 2022a).

H ADDITIONAL EXPERIMENTAL DETAILS

Below, we describe experimental details for the two environment domains in the paper.

H.1 FOURROOMS

As input, the agent receives a 16-dimensional vector containing the index of the current state, a flattened 3 3 local view of its surrounding environment, its previous action taken encoded as a 4-dimensional one-hot vector, the reward on the previous timestep, and a feature indicating the goal state index. The base learning algorithm in all cases is advantage actor critic (A2C; (Mnih et al., 2016)).

Environment The FOURROOMS experiments are set in an 11 11 gridworld. The actions available to the agent are the four cardinal directions, up, down, left, and right, and transitions are deterministic. In both FOURROOMS experiments, the agent can begin an episode anywhere in the environment (sampled uniformly at random), and a single location with reward r = 50 is sampled at the beginning of each episode from a set of possible goal states which varies depending on the experiment and the current phase. A reward of r = 1 is given if the agent contacts the walls. All other states give a reward of zero. Episodes end when either a time (number of timesteps) limit is reached or the agent reaches the goal state. Observations were 16-dimensional vectors consisting of

Published as a conference paper at ICLR 2023

the current state index (1d), flattened 3 3 local window surrounding the agent (includes walls, but not goals), a one-hot encoding of the action on the previous timestep (4d), the reward on the previous timestep (1d), and the state index of the current goal (1d). In the goal generalization experiment, goals may be sampled anywhere in either the top left or bottom right rooms in the first phase and either the top right or bottom left rooms in the second phase. Each phase comprises 20,000 episodes, and in each phase, the agent may start each episode anywhere in the environment. In the first phase, the agent was allowed 100 steps per episode, and in the second phase 25 steps. In the contingency change experiment, the possible reward states in each phase were the top left state and bottom right state. In the second phase of training, however, the semantics of the goal feature change from indicating the location of the reward to the location where it is absent. Each phase consisted of 8,000 episodes with maximum length 100 timesteps. Results are averaged over 10 random seeds.

Agents All agents were trained on-policy with advantage actor-critic (Mnih et al., 2016). The architecture was a single-layer LSTM (Hochreiter and Schmidhuber, 1997) with 128 hidden units. To produce the feature sensitivity plots in Fig. 5.1c, a gating function was added to the input layer of the network:

xh = σ(bκ) oh, (H.1)

where oh is the current observation, σ( ) was the sigmoid funcion, b R is a constant (set to b = 150 in all experiments), xh Rd is the filter layer output, and κ Rd is a parameter trained using backpropagation. In this way, as κd , σ(bκd) 1, allowing input feature oh, d through the gate. As κd , the gate is shut. The plots in Fig. 5.1c track σ(bκd) over the course of training. The baseline agent objective functions are as follows:

J PO(θ) = V πθ + αEs dπθ H[πθ( |s)]

J RPO(θ, ϕ) = V πθ αEs dπθ KL[πθ( |s), πϕ( |s)]

J VDO PO(θ) = Ew νθV πw βKL[νθ( ), p( )]

J Manual IA(θ, ϕ) = V πθ αEs dπθ KL[πθ( |s), πϕ( |sd)]; sd = s \ g.

In all cases α = 0.1, β = 1.0, and learning rates for all agents were set to 0.0007. Agents were optimized with Adam (Kingma and Ba, 2014). Agent control policies were reset after phase 1.

H.2 DEEPMIND CONTROL SUITE

Environments/Task Settings We use the walker and cartpole environments from the Deep Mind Control Suite (Tassa et al., 2018). We consider two multitask settings: sequential tasks and parallel tasks. All results are averaged over 10 random seeds, and agents are trained for 500k timesteps. In the sequential task setting, tasks are sampled one at a time without replacement and solved by the agent. The control policy is reset after each task, but the default policy is preserved. For methods which have a default policy which can be preserved, performance on task k is averaged over runs with all possible previous tasks in all possible orders. For example, when walker-run is the third task, performance is averaged over previous tasks being stand then walk and walk then stand. In the parallel task setting, a different task is sampled randomly at the start of each episode, and a one-hot task ID vector is appended to the state observation. Learning was done directly from states, not from pixels.

Agents The base agent in all cases was SAC with automatic temperature tuning, following Haarnoja et al. (2018). Standard SAC seeks to optimize the maximum-entropy RL objective:

J max ent(π) = V π + αEs dπH[π( |s)] = V π + αEs dπKL[π( |s), Unif A] (H.3)

Effectively, then, SAC uses a uniform default policy. The RPO algorithms with learned default policies replace KL[π( |s), Unif A] with KL[π( |s), πw( |s)] (or KL[πw( |s), π( |s)]). As MDL-C, RPO, and TVPO require that the control policy approximate the optimal policy before being used to generated the a learning signal for the default policy, in the sequential setting, the default policy is updated only after halfway through training. Because variational dropout can cause the network to over-sparsify (and not learn the learn adequately) if turned on too early in training, we follow the strategy of Molchanov et al. (2017), linearly ramping up a coefficient β on the variational dropout KL from 0 to 1 starting from 70% through training to 80% through training. Note that MANUALIA is not

Published as a conference paper at ICLR 2023

applicable to the sequential task setting, as there is no explicit goal feature. In the sequential task setting, we took inspiration from Haarnoja et al. (2018) and Abdolmaleki et al. (2018) and reframed the soft KL penalty for methods with learned default policies as a constraint, i.e.,

max π V π αEKL[πw, π] max π V π s.t. EKL[πw, π] ε,

where ε > 0 was a target KL divergence. Under this formulation, α is treated as a dual variable via Lagrangian relaxation and optimized with the following objective:

max α 0 J(α) := EαKL[πw, π] αε.

In the parallel task setting, we convert the base SAC agent into the multitask variant used by Yu et al. (2019), in which the agent learns a vector of temperature parameters [α1, . . . , αK], one for each task. In this setting, we found it more effective to set α to a constant value. Test performance was computed by averaging performance across all K tasks presented to the agent. The baseline agent objectives are as in Eq. (H.2), and the Distral objective is given by

J Distral(θ, ϕ) = V πθ αEs dπθ KL[πθ( |s), πϕ( |s)] + λEs dπθ H[πθ( |s)].

TVPO is trained in the same way as RPO, with the difference being that the default policy objective is to predict the control policy action, rather than a distillation objective. Hyperparameters shared by all agents can be viewed in Table 2.

As a note on performance, Distral performs very strongly in the parallel task setting, with overall performance slightly worse than MDL-C on Walker and virtually the same on Cartpole. However, the gap is significantly greater in the sequential setting, particularly on Walker. We hypothesize that this is due to the fact that by regularizing the control policy to be close to the default policy, but also encouraging the control policy to have high entropy (rather than regularizing the default policy as MDL-C does), Distral can in effect provide a conflicting objective to the control policy when strong structure is present. In particular, on Walker, the optimal policies for each task have significant overlap, and so by encouraging high entropy in the control policy even on the third task, Distral negates the effect of an informative default policy. As evidence, both RPO and TVPO, which only regularize the control policy to be close to the default policy, perform significantly more strongly on Walker in the sequential setting.

Hyperparameter Value

Collection Steps 1000 Random Action Steps 10000 Network Hidden Layers 256:256 Learning Rate 3 10 4 Optimizer Adam Replay Buffer Size 1 106 Action Limit [ 1, 1] Exponential Moving Avg. Parameters 5 10 3 (Critic Update:Environment Step) Ratio 1 (Policy Update:Environment Step) Ratio 1 Expected KL/Entropy Target 0.2/ dim(A) Policy Log-Variance Limits [ 20, 2]

Table 2: DM control suite hyperparameters, used for all experiments. In the parallel setting, α was simply set to 0.1 for methods with learned default policies.

I ADDITIONAL EXPERIMENTAL RESULTS

I.1 FOURROOMS

I.2 DEEPMIND CONTROL SUITE

Published as a conference paper at ICLR 2023

Method Goal Change Contingency Change

PO 1.25e5 1.76e4 8.80e4 1.64e4 RPO 1.77e5 1.11e4 1.04e5 2.20e4 VDO-PO 1.48e5 1.91e4 8.23e4 1.98e4 Manual IA 1.23e5 2.51e4 7.69e4 2.89e4 MDL-C 1.08e5 2.44e4 5.11e4 1.70e4

Table 3: Four Rooms: Average cumulative regret across 8 random seeds in phase 2 of the goal change and contingency change experiments for each method. values are standard error.

Method Cartpole Walker

SAC 1.25e5 1.76e 3.42e5 6.10e4 RPO-SAC (k = 3) 1.77e5 1.11e4 1.04e5 2.20e4 VDO-SAC 1.48e5 1.91e4 8.23e4 1.98e4 MDL-C (k = 1) 1.23e5 2.51e4 7.69e4 2.89e4 MDL-C (k = 2) 1.08e5 2.44e4 5.11e4 1.70e4 MDL-C (k = 3) 1.08e5 2.44e4 5.11e4 1.70e4

Table 4: DM Control Suite, Sequential: Average cumulative regret across 8 random seeds in the sequential setting. values are standard error.

Method Cartpole Walker

SAC 1.01e5 2.01e3 1.46e5 5.11e3 Manual IA 9.90e4 1.87e3 1.50e5 3.86e3 MDL-C 9.47e4 8.36e2 1.31e5 1.35e3

Table 5: DM Control Suite, Parallel: Average cumulative regret across 8 random seeds in the parallel task setting. values are standard error.

Figure I.1: Heatmaps of KL[πθ( |s), πw( |s)] s S for RPO and KL[πθ( |s), π w( |s)] s S, where w = Eνw for MDL-C, averaged over all possible goal states. The RPO default policy nearly perfectly matches the control policy, while the MDL-C default policy diverges most strongly from the control policy at the doorways. This is because the direction chosen by the policy in the doorways is highly goal-dependent. Because the MDL-C default policy learns to ignore the goal feature, it s roughly uniform in the doorways, whereas the control policy is highly deterministic, having access to the goal feature.

Published as a conference paper at ICLR 2023

0 48000 96000 144000 192000 240000 timesteps

feature weight

walker (parallel)

orientation height velocity task id

0 28000 56000 84000 112000 140000 timesteps

1.0 cartpole (parallel)

position velocity task id

Figure I.2: Without a sparse prior, RPO does not learn to ignore spurious input features.

0 50000 100000 150000 200000 250000 timesteps

walker (sequential)

0 50000 100000 150000 200000 250000 timesteps

cartpole (sequential)

Figure I.3: MDL-C s learned αs in the DMC sequential setting. Because α tends to decay, the control policy is able to specialize to the current task later in training. Results averaged over eight random seeds; error shading denotes standard error.

0 50000 100000 150000 200000 250000 timesteps

test reward

walker (parallel)

Manual IA Manual IA+ MDL-C

0 20000 40000 60000 80000 100000 120000 140000 timesteps

cartpole (parallel)

Figure I.4: To test the effect of information asymmetry on its on performance, we trained a variant of MANUALIA in which we withheld the input features that MDL-C learned to gate out (Fig. 5.2) in addition to the task ID feature. We call this modified method MANUALIA+. Average performance is plotted above over 10 seeds, with the shading representing one unit of standard error. We can see that while MANUALIA+ narrowly outperforms MANUALIA, the performance gains of MDL-C can t solely be ascribed to effective information asymmetry.

Published as a conference paper at ICLR 2023

0 100000 200000 300000 400000 500000 timesteps

test reward

stand (parallel)

MT-SAC Distral Manual IA RPO MDL-C

0 100000 200000 300000 400000 500000 timesteps

1000 walk (parallel)

0 100000 200000 300000 400000 500000 timesteps

run (parallel)

Figure I.5: Test reward on each individual task in the walker domain over the course of parallel task training. Average performance is plotted above over 10 seeds, with the shading representing one unit of standard error. We can see the biggest performance difference on walker, run, the most challenging task.

0 50000 100000 150000 200000 250000 timesteps

test reward

balance (parallel)

SAC Distral Manual IA RPO MDL-C

0 50000 100000 150000 200000 250000 timesteps

swingup (parallel)

0 50000 100000 150000 200000 250000 timesteps

balance_sparse (parallel)

0 50000 100000 150000 200000 250000 timesteps

swingup_sparse (parallel)

Figure I.6: Test reward on each individual task in the cartpole domain over the course of parallel task training. Average performance is plotted above over 10 seeds, with the shading representing one unit of standard error. Interestingly, unlike in the sequential learning setting, joint training seems to impede performance on swingup sparse, with no method succeeding.