# metacurl_nonstationary_concave_utility_reinforcement_learning__544ba7fc.pdf

Meta CURL: Non-stationary Concave Utility Reinforcement Learning

Bianca Marin Moreno Inria

Margaux Brégère Sorbonne Université

Pierre Gaillard Inria Nadia Oudjane EDF R&D

We explore online learning in episodic Markov decision processes on non-stationary environments (changing losses and probability transitions). Our focus is on the Concave Utility Reinforcement Learning problem (CURL), an extension of classical RL for handling convex performance criteria in state-action distributions induced by agent policies. While various machine learning problems can be written as CURL, its non-linearity invalidates traditional Bellman equations. Despite recent solutions to classical CURL, none address non-stationary MDPs. This paper introduces Meta CURL, the first CURL algorithm for non-stationary MDPs. It employs a meta-algorithm running multiple black-box algorithms instances over different intervals, aggregating outputs via a sleeping expert framework. The key hurdle is partial information due to MDP uncertainty. Under partial information on the probability transitions (uncertainty and non-stationarity coming only from external noise, independent of agent state-action pairs), the algorithm achieves optimal dynamic regret without prior knowledge of MDP changes. Unlike approaches for RL, Meta CURL handles adversarial losses. We believe our approach for managing non-stationarity with experts can be of interest to the RL community.

1 Introduction

We consider the task of learning in an episodic Markov decision process (MDP) with a finite state space X, a finite action space A, episodes of length N, and a probability transition kernel p := (pn)n [N] such that for all (x, a) X A, pn( |x, a) SX . For any finite set B, we denote by SB the simplex induced by this set, and by |B| its cardinality. For all d N we let [d] := {1, . . . , d}. At each time step n, an agent in state xn chooses an action an πn( |xn) by means of a policy, and moves to the next state xn+1 pn+1( |xn, an), inducing a state-action distribution sequence µπ,p := (µπ,p n )n [N], where µπ,p n SX A for all n [N].

In many applications of learning in episodic MDPs, an agent aims at finding an optimal policy π maximizing/minimizing a concave/convex function F of its state-action distribution, known as the Concave Utility Reinforcement Learning (CURL) problem:

min π (SA)X N F(µπ,p). (1)

CURL extends reinforcement learning (RL) from linear to convex losses. Many machine learning problems can be written in the CURL setting, including: RL, where for a loss function ℓ, F(µπ,p) = ℓ, µπ,p ; pure RL exploration [28], where F(µπ,p) = µπ,p, log(µπ,p) ; imitation learning [26, 35] and apprenticeship learning [55, 1], where F(µπ,p) = Dg(µπ,p, µ ), with Dg representing a Bregman

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. EDF Lab, 7 bd Gaspard Monge, 91120 Palaiseau, France Sorbonne Université LPSM, Paris, France

divergence induced by a function g and µ being a behavior to be imitated; certain instances of meanfield control [7], where F(µπ,p) = ℓ(µπ,p), µπ,p ; mean-field games with potential rewards [34]; among others. The CURL problem alters the additive structure inherent in standard RL, invalidating the classical Bellman equations, requiring the development of new algorithms.

Most of existing works on CURL focus on stationary environments [28, 57, 58, 5, 56, 25, 12, 11], where both the objective function F and the probability transition kernel p remain the same across episodes. However, in practical scenarios, environments are rarely stationary. The work of [39] is the first to address online CURL with objective functions that can change arbitrarily between episodes, also known as adversarial losses [19]. However, their work assumes stationary probability kernels and presents results in terms of static regret (performance comparable to an optimal policy). In non-stationary scenarios, it is more relevant to minimize dynamic regret the gap between the learner s total loss and that of any policy sequence (see Eq. (5) for formal definition). In this work we address this problem by introducing the first algorithm for CURL handling adversarial objective functions and non-stationary probability transitions, achieving near-optimal dynamic regret.

High-level idea. Our approach, called Meta CURL, draws inspiration from the online learning literature. In online learning [9], non-stationarity is often managed by running multiple black-box algorithm instances from various starting points and dynamically selecting the best performer using an "expert" algorithm. This strategy has demonstrated effectiveness in settings with complete information [29, 59, 47, 33]. With Meta CURL, we extend this concept to decision-making in MDPs. Unlike classical online learning, the main challenge faced is uncertainty. We assume that the probability transition kernel in each episode has a known deterministic structure but is affected by an external noise with unknown distribution, placing us in a setting with only partial information (see Section 2 for more details). The learner is then unable to observe the agent s loss under policies other than the one played.

Meta CURL is a general algorithm that can be applied with any black-box algorithm with low dynamic regret in near-stationary environments. CURL approaches suitable as black-boxes rely on parametric algorithms that would require prior knowledge of the MDP changes to tune their learning rate. Meta CURL also addresses this challenge by simultaneously running multiple learning rates and weighting them in direct proportion to their empirical performance. Meta CURL achieves optimal regret of order O

π T + min {

p T, T 2/3( p)1/3} , where p and p represent the frequency and magnitude of changes of the probability transition kernel respectively, and π

is the magnitude of changes of the policy sequence we compare ourselves with in dynamic regret (see Eqs. (6) and (7) for formal definitions). Meta CURL does not require previous knowledge of the degree of non-stationarity of the environment, and can handle adversarial losses. To ensure completeness, we show that Greedy MD-CURL from [39] fulfills the requirements to serve as a black-box algorithm. This is the first dynamic regret analysis for a CURL approach.

Comparisons. Without literature on non-stationary CURL, we review non-stationary RL approaches. Most methods [24, 13, 45, 17, 20, 40, 21] typically rely on prior knowledge of the MDP s nonstationarity degree, while Meta CURL does not. Let l and l represent the frequency and magnitude of change in the RL loss function, respectively1. Recently, [54] achieved a regret of O min { p

( p + l )T, T 2/3( p + l)1/3} , a near-optimal result as demonstrated by [40], without requiring prior knowledge of the environment s variation. However, this regret bound is tied to changes in loss functions, making it ineffective against adversarial losses. In contrast, rather than

Table 1: Comparisons of our results with the state-of-the-art in non-stationary RL. Here, p , p and π are defined in (6) and (7); and l and l measure the RL loss function variations1. We introduce DT ( , ) := min { T, T 2/3 1/3}.

Algorithm Dynamic Regret in O RL CURL Adv. losses No prior knowledge Exploration

Meta CURL (ours) DT ( p , p) +

π T So TA in RL [54] DT ( p + l , p + l)

1 l := 1 + PT 1 t=1 l t and l := 1 + PT 1 t=1 1{ l t = l t+1}, where l t := PN n=1 maxx,a |ℓt n(x, a) ℓt+1 n (x, a)| and ℓt n(x, a) is the expected loss suffered by choosing action a in state x at step n of round t.

depending on the magnitude of variation of the loss function, Meta CURL s bound depends on the magnitude of variation of the policy sequence we use for comparison in dynamic regret. This allows it to handle adversarial losses, and to compare against policies with a more favorable bias-variance trade-off, which may not align with the optimal policies for each loss. In addition, we improve this dependency by paying it as

π T instead of ( π )1/3T 2/3. We summarize comparisons in Table 1.

Other related works. The studies by [43, 42] examine the difference between optimizing the objective over infinite trials and the expectation of the objective over a single trial, challenging the traditional CURL formulation in Eq. (1). Here, we retain the classic formulation to align with existing CURL research. Other works on RL with nonlinear objective functions are [46, 16] focusing on rewards over trajectories rather than individual states. In addition to non-stationarity, there is a series of works on RL with adversarial losses but stationary probability transitions, with results only on static regret [48, 30, 18, 50, 32, 14]. Another line of research is known as corruption-robust RL. It differs from non-stationary MDPs in that it assumes a ground-truth MDP and measures adversary malice by the degree of ground-truth corruption [31, 38, 10, 60, 53].

Contributions. We resume our main contributions below:

We introduce Meta CURL, the first algorithm for non-stationary CURL. Under the framework described in Section 2, Meta CURL achieves the optimal dynamic regret bound of order O

π T + min {

p T, T 2/3( p)1/3} , without requiring previous knowledge of the degree of non-stationarity of the MDP. Meta CURL handles full adversarial losses and improves the dependency of the regret on the total variation of policies. Meta CURL is the first adaptation of Learning with Expert Advice (LEA) to deal with uncertainty in non-stationary MDPs. We also establish the first dynamic regret upper bound for an online CURL algorithm in a nearly stationary environment, which can serve as a black-box routine for Meta CURL.

Notations. Let 1 be the L1 norm, and for all v := (vn)n [N], such that vn RX A we define v ,1 := sup1 n N vn 1.

2 General framework: non-stationary CURL

When an agent plays a policy π := (πn)n [N] in an episodic MDP with probability transition p, it induces a state-action distribution sequence (also called the occupancy-measure [61]), which we denote by µπ,p := (µπ,p n )n [N], with µπ,p n SX A. It can be calculated recursively for all (x, a) X and n [N] by taking µπ,p 0 (x, a) = µ0(x, a) fixed, and

µπ,p n (x, a) = X

(x ,a ) X A µπ,p n 1(x , a )pn(x|x , a )πn(a|x). (2)

Offline CURL. The classic CURL optimization problem in Eq. (1) considers minimizing a function F : (SX A)N R, here defined as F(µ) := PN n=1 fn(µn) with fn a convex function over µn with values in [0, 1], across all policies that induce µπ,p. Note that F is not convex on the policy π. To convexify the problem, we define the set of state-action distributions satisfying the Bellman flow of a MDP with transition kernel p as

Mp µ0 := µ X

a A µn(x , a ) = X

x X,a A pn(x |x, a)µn 1(x, a) , x X, n [N] . (3)

For any µ Mp µ0, there exists a strategy π such that µπ,p = µ. It suffices to take πn(a|x) µn(x, a) when the normalization factor is non-zero, and arbitrarily defined otherwise. There is thus an equivalence between the CURL problem (optimization on policies) and a convex optimization problem on state-action distributions satisfying the Bellman flow:

min π (SA)X N F(µπ,p) min µ Mp µ0 F(µ). (4)

Online CURL. In this paper we consider the online CURL problem in a non-stationary setting. We assume a finite-horizon scenario with T episodes. An oblivious adversary generates a sequence of changing objective functions (F t)t [T ], with F t being fully communicated to the learner only at the end of episode t. We assume F t is LF -Lipschitz with respect to the ,1 norm for all t. The

probability transition kernel is also allowed to evolve over time and is denoted by pt at episode t. The learner s objective is then to calculate a sequence of strategies (πt)t [T ] minimizing a total loss LT := PT t=1 F t(µπt,pt), while dealing with adversarial objective functions F t and changing probability transition kernels pt. To measure the learner s performance, we use the notion of dynamic regret (the difference between the learner s total loss and that of any policy sequence (πt, )t [T ]):

R[T ] (πt, )t [T ] := P

t [T ] F t(µπt,pt) F t(µπt, ,pt). (5)

Non-stationarity measures. We consider the following two non-stationary measures p and p on the probability transition kernels that respectively measure abrupt and smooth variations:

t=1 1{pt =pt+1}, p := 1+

t=1 p t , p t := max n,x,a pt n( |x, a) pt+1 n ( |x, a) 1 . (6)

Regarding dynamic regret, we define for any sequence of policies (πt, )t [T ], its non-stationarity measure as

t=1 π t , π t := max n [N],x X πt, n ( |x) πt+1, n ( |x) 1 . (7)

Moreover, for any interval I [T], we write p I := P

t I p t and π I := P

Dynamic s hypothesis. For each episode t, let (xt 0, at 0) µ0( ), and for all time steps n [N],

xt n+1 = gn(xt n, at n, ϵt n), (8)

where gn represents the deterministic part of the dynamics, and (ϵt n)n [N] is a sequence of independent external noises such that ϵt n ht n( ), where ht n is any centered distribution. Note that these dynamics imply that the probability transition kernel can be written as pt n+1(x |x, a) = P gn(x, a, ϵt n) = x . Different variants of this problem can be considered, depending on the prior information available about the dynamics in Eq. (8). In this article we consider the case where gn is fixed and known by the learner, but ht n is unknown and can change (hence the source of uncertainty and non-stationarity of the transitions). To the best of our knowledge, there are no black-box algorithms in the literature that achieve sublinear regret for online CURL with adversarial losses without relying on model assumptions. In using RL methods to CURL, we believe model-optimistic approaches like UCRL (Upper Confidence RL [4]) could be adapted. However, these methods are computationally expensive, as they require solving an additional optimization problem in every episode. The black-box algorithm for CURL we consider from [39] provides closed-form solutions, which is more computationally efficient, but requires the same dynamic assumption as in Eq. (8). Another class of RL methods is policy optimization (PO), which directly optimizes the policy and often yields closed-form solutions, leading to faster performance. Recent theoretical work [37] has shown that PO methods can achieve near-optimal regret without model assumptions. However, these methods rely on RL s Bellman equations, which do not apply to CURL due to its non-linear nature. We believe that the Meta CURL analysis could potentially be extended to the case where gn is unknown but belongs to a parametric family. We leave this extension for future work.

This particular dynamic is also motivated by many real-world applications:

Controlling a fleet of drones in a known environment, subject to external influences due to weather conditions or human interventions. Addressing data center power management aiming to cut energy expenses while maintaining service quality. Workload fluctuations cause dynamic job queue transitions, and volatile electricity prices lead to varying operational costs. The probabilities of task processing by each server are predetermined, but the probabilities of task arrival are uncertain [6]. As renewable energy use increases and energy demand grows, balancing production and consumption becomes harder. Certain devices, like electric vehicle batteries and water heaters, can serve as flexible energy storage options. However, this requires electric grids to establish policies regulating when these devices turn on or off to match a desired consumption profile. These profiles can fluctuate daily due to changes in energy production levels. Despite knowing the devices physical dynamics, household consumption habits remain unpredictable and constantly changing [51, 41].

Outline. In this paper, we propose a novel approach to handle non-stationarity in MDPs, being the first to propose a solution to CURL within this context. We begin in Section 3 by discussing the idea behind our algorithm s construction and the key challenges within our framework. Section 4 introduces Meta CURL, while Section 5 presents the main results of our regret analysis. The proofs specifics are provided in the appendix.

3 Main idea

A hypothetical learner who achieves optimal regret. Let m > 1. Assume a hypothetical learner that could compute a sequence of restart times 1 = t1 < . . . < tm+1 = T + 1, where for each i [m] we let Ii := [ti, ti+1 1], such that p Ii p/m. (9)

Consider any algorithm that, when computing (πt)t I with learning rate λ for any interval I [T], attains a dynamic regret relative to any sequence of policies (πt, )t I upper bounded by

RI (πt, )t I c1λ|I| + λ 1(c2 π I + c3) + |I| p I, (10) where (cj)j [3] are constants that may depend on the MDP parameters, and on the interval size only in logarithmic terms. This kind of regret bound holds for Greedy MD-CURL from [39] as we show in Appendix G. Suppose the hypothetical learner could also access π to calculate the optimal learning rate. Hence, playing such an algorithm for all horizon T with the optimal learning rate, the learner would have a dynamic regret upper bounded by

R[T ] (πt, )t [T ] 2 p

c1(c2 π + c3m)T + T pm 1.

Optimizing over m, the learner would obtain the optimal regret of order O

π T + ( p)1/3T 2/3 . In the case where the MDP is piece-wise stationary, if the learner takes Ii such that p Ii = 0, it obtains a regret of order O(

p T), where p is the number of times the probability transitions of the MDP change over [T].

A meta algorithm to learn restart times. In reality, the restart times of Eq. (9), and the optimal learning rate, are unknown to the learner. Hence, we propose to build a meta aggregation algorithm to learn both. Let E represent a parametric black-box algorithm with dynamic regret as in Eq. (10). We introduce a meta algorithm M that, takes as input a finite set of learning rates Λ, and at each episode t, initializes |Λ| instances of E, denoted as Et,λ for each λ Λ. Each Et,λ operates independently within the interval [t, T]. At time t, M combines the decisions from the active runs {Es,λ}s t,λ Λ by weighted average. The idea is that at time t, some of the outputs of {Es,λ}s t,λ Λ are not based on data prior to t < t, so if the environment changes at time t , these outputs can be given a greater weight by the meta algorithm, enabling it to adapt more quickly to the change. At the same time, we expect a larger weight will be given to the empirically best learning rate. Let M(E, Λ) be the complete algorithm. Remark 3.1. The meta-algorithm increases the computational complexity of the parametric blackbox algorithm by a factor of T |Λ|, as it requires updating t |Λ| instances at each episode t. By strategically designing intervals to run the black-box algorithms, previous works on online learning have reduced computational complexity to O(log(T)) [15, 29, 27]. Extending our analysis to these intervals is straightforward, but it would complicate the presentation of the paper. Thus, we decided to present our results using the naive choice of intervals. Also, in Section 5, we show that a learning rate grid with |Λ| = log(T) is sufficient to obtain the optimal regret.

Regret decomposition. Denote by πt,s,λ the policy output from Es,λ at episode t, for learning rate λ, for all s t, and by πt the policy output by the meta algorithm M(E, Λ) to be played by the learner. The regret of M(E, Λ) can be decomposed as the sum of the regret suffered by the meta algorithm aggregation scheme, M, and the regret from the black-box algorithm, E, played with any learning rate λ Λ. The dynamic regret, defined in Eq. (5), can be decomposed, for any set of intervals Ii = [ti, ti+1 1], with 1 = t1 < . . . < tm+1 = T + 1, and for any learning rate λ Λ, as

R[T ] (πt, )t [T ] =

t Ii F t(µπt,pt) F t(µπt,ti,λ,pt)

| {z } Meta algorithm regret

t Ii F t(µπt,ti,λ,pt) F t(µπt, ,pt)

| {z } Black-box regret on Ii := Rmeta [T ] + Rblack-box [T ] (πt, )t [T ] . (11)

The black-box regret on Ii is exactly the standard regret for T = |Ii| with a learning rate of λ. Hence, in order to prove low dynamic regret for M(E, Λ) we have to: show that M incurs a low dynamic regret in each interval Ii; find a black-box algorithm E for CURL that has dynamic regret as in Eq. (10), and build a learning rate grid Λ allowing us to perform nearly as well as the optimal learning rate.

4 Meta CURL Algorithm

We call our meta-algorithm M Meta CURL. It is based on sleeping experts, is parameter-free, and achieves optimal regret. Its construction is described below.

4.1 Learning with expert advice

General setting. In Learning with Expert Advice (LEA), a learner makes sequential online predictions u1, . . . , u T in a decision space U, over a series of T episodes, with the help of K experts [22, 36, 9]. For each round t, each expert k makes a prediction ut,k, and the learner combines the experts predictions by computing a vector vt := (vt,1, . . . , vt,K) SK, and predicting the convex combination of experts prediction ut := PK k=1 vt,kut,k. The environment then reveals a convex loss function ℓt : U R. Each expert suffers a loss ℓt,k := ℓt(ut,k), and the learner suffers a loss ˆℓt := ℓt(ut). The learner s objective is to keep the cumulative regret with respect to each expert as low as possible. For each expert k, this quantity is defined as Reg[T ](k) := PT t=1 ˆℓt ℓt,k.

Sleeping experts. In our case, each black-box algorithm is an expert that does not produce solutions outside its active interval. This problem can be reduced to the sleeping expert problem [8, 23], where experts are not required to provide solutions at every time step. Let It,k {0, 1} define a signal equal to 1 if expert k is active at episode t and 0 otherwise. The algorithm knows (It,k)k [K] and assigns a zero weight to sleeping experts (It,k = 0 implies vt,k = 0). We would like to have a guarantee with respect to expert k [K] but only when it is active. Hence, we now aim to bound a cumulative regret that depends on the signal It,k: Regsleep [T ] (k) := PT t=1 It,k(ˆℓt ℓt,k). There is a generic reduction from the sleeping expert framework to the general LEA setting [3, 2] (see Appendix A.1).

4.2 Meta-aggregation scheme

In every episode t, for every learning rate λ Λ and s t, an instance Es,λ of the black-box algorithm acts as an expert computing a policy πt,s,λ. The meta algorithm aims to aggregate these predictions using a sleeping expert approach based on the expert s losses. However, within CURL s framework, the meta algorithm faces two challenges:

Uncertainty. At the episode s end, the learner has full information about the objective function F t. If the learner also knew pt, they could recursively calculate the corresponding state-action distribution µπt,s,λ,pt using Eq. (2) and observe the actual loss of each expert, denoted as F t(µπt,s,λ,pt). However, given that pt is unknown to the learner, the true loss remains unobservable. Consequently, the metaalgorithm needs to create an estimator ˆpt for pt and utilize it to estimate the losses. We propose a method to compute an estimator ˆpt in Subsection 4.3.

Convexity. As discussed in Section 2, the objective functions F t are not convex over the space of polices. However, CURL is equivalent to a convex problem over the state-action distributions satisfying the Bellman s flow as shown in Eq. (4). Therefore, instead of aggregating policies, the meta algorithm aggregates the associated state-action distributions using the probability estimator ˆpt and the recursive scheme at Eq. (2). We detail Meta CURL in Alg. 1 when employed with the Exponentially Weighted Average forecaster (EWA) as the sleeping expert subroutine (we detail EWA in Appendix A.2).

4.3 Building an estimator of pt

As discussed earlier, applying the learning with experts framework requires estimating the loss of non-played expert policies, which depends on estimating the non-stationary transition probabilities ˆpt. Standard RL techniques for bounding the L1 norm between the empirical estimator ˆpt and the true

Algorithm 1 Meta CURL with EWA

1: Input: number of episodes T, finite set of learning rates Λ, black-box algo. E, EWA learning rate η = p

8 log(T)T 2: Initialization: ˆp1 n( |x, a) := 1 |X| for all n [N], (x, a) X A 3: for t = 1, . . . , T do 4: Start |Λ| new instances of E denoted by Et,λ for all λ Λ, assign each of them a new weight vt,t,λ = 1 |Λ|t, and normalize weight vectors vt,s,λ for s [t 1] such that vt := (vt,s,λ)s t,λ Λ is a probability vector in Rt Λ

5: For s t and λ Λ, Es,λ outputs πt,s,λ

6: Compute recursively µπt,s,λ,ˆpt using Eq. (2) for all s t and λ Λ 7: Aggregate the state-action distributions: µt := Pt s=1 P

λ Λ µπt,s,λ,ˆptvt,s,λ

8: Retrieve πt from µt: for all n, (x, a),

πt n(a|x) =

( µt n(x,a) P

a A µn(x,a ), if µt n(x, a) = 0

1 |A|, if µt n(x, a) = 0

9: Learner plays πt: Agent starts at (xt 0, at 0) µ0( ) 10: for n = 1, . . . , N do 11: Environment draws new state xt n pt n( |xt n 1, at n 1) 12: Learner observes agent s external noise εt n 13: Agent chooses an action at n πt n( |xt n) 14: end for 15: Objective function F t is exposed 16: Compute experts losses ℓt,s,λ := F t(µπt,s,λ,ˆpt), for all s t and λ Λ 17: Compute the new weight vector vt+1: for all s t and λ Λ,

vt+1,s,λ = vt,s,λ exp ( ηℓt,s,λ) Pt s =1 P

λ Λ vt,s ,λ exp ( ηℓt,s ,λ ) (EWA update)

18: Use agent s external noise trajectory (εt n)n [N] to compute ˆpt+1 as in Subsection 4.3 19: end for

dynamics pt [44, 49] are not applicable here due to non-stationarity. To address this, we introduce a second layer of sleeping experts for each (n, x, a) [N] X A, where each expert provides an empirical estimate of pt based on different intervals. We then propose a new loss function in Eq. (12) and conduct a novel regret analysis in Prop. 5.2 to achieve the optimal regret rate.

In each episode t, the learner calculates independent samples xt n,x,a pt n( |x, a) utilizing the external noise sequence (εt n)n [N] observed (just let xt n,x,a = gn 1(x, a, εt n 1), see Eq. (8)). Each expert outputs an empirical estimator of pt n( |x, a) using samples across different intervals. We assume T experts, with expert s active in interval [s, T]. Expert s at episode t > s outputs:

ˆpt,s n (x |x, a) = N s:t 1 n,x,a (x ) (t s) , with N s:t 1 n,x,a (x ) :=

q=s 1{xq n,x,a=x }.

We let ˆpt n( |x, a) be the result of employing sleeping EWA with experts ˆpt,s n ( |x, a), for s < t. Typically, in density estimation with EWA, a logarithmic loss log( ) is used. However, in this case log( ) can be unbounded, so we opt here for a smoothed logarithmic loss, given by, for all q SX ,

x X log q(x) + 1

1{ xtn,x,a=x}, where xt n,x,a pt n( |x, a) + 1

The definition of this non-standard loss is further clarified during the regret analysis in Section 5. This loss function is 1-exp concave (see Lemma 4 of [52]), hence the cumulative regret of EWA with respect to each expert s [T], for all episodes τ [s, T], satisfies Regsleep [s,τ](s) = Pτ t=s ℓt(ˆpt n( |x, a)) ℓt(ˆpt,s n ( |, x, a)) log(T) (for more information on the regret

bounds of EWA with exp-concave losses, see Appendix A.2). We describe the complete online scheme to compute ˆpt in Alg. 3 at Appendix B.

5 Regret analysis

This section presents the main result concerning Meta CURL s regret analysis. Subsection 5.1 shows an upper bound for Rmeta when Meta CURL is played with EWA and ˆpt is computed as in Subsection. 4.3. Subsection 5.2 introduces a learning rate grid for Meta CURL when the black-box algorithm meets the dynamic regret criteria in Eq. (10), providing an upper bound for Rblack-box. Given the dynamic regret decomposition of Eq. (11), we see that the combination of these results leads to our main result, the full proof of which can be found in appendix (F) : Theorem 5.1 (Main result). Let δ (0, 1). Playing Meta CURL, with a parametric black-box algorithm E with dynamic regret as in Eq. (10), with a learning rate grid Λ := 2 j|j = 0, 1, 2, . . . , log2(T)/2 , and with EWA as the sleeping expert subroutine, we obtain, with probability at least 1 2δ, for any sequence of policies (πt, )t [T ],

R[T ] (πt, )t [T ] O

π T + min p

T p , T 2/3( p)1/3 .

5.1 Meta-algorithm analysis

Given the uncertainty in the probability transition, the meta regret term can be decomposed as follows:

Rmeta [T ] =

t=1 F t(µπt,pt) F t(µπt,ˆpt)

Rˆ p [T ](πt) (ˆpt estimation)

t Ii F t(µπt,ˆpt) F t(µπt,ti,λ,ˆpt)

| {z } sleeping LEA regret

t Ii F t(µπt,ti,λ,ˆpt) F t(µπt,ti,λ,pt)

| {z } Pm i=1 P

t Ii Rˆ p Ii(πt,ti,λ) (ˆpt estimation)

Sleeping LEA regret. Referring to Thm. A.1 in Appendix A, using sleeping EWA as the sleeping expert subroutine of Meta CURL, with signals It,s = 1 for active experts (s t), experts convex losses ℓt,s,λ := F t(µπt,s,λ,ˆpt), and learner loss ˆℓt := F t(µπt,ˆpt), yields, for any m [T] and for any set of intervals Ii = [ti, ti+1 1], with 1 = t1 < . . . < tm+1 = T + 1,

t Ii F t(µπt,ˆpt) F t(µπt,ti,λ,ˆpt) =

i=1 Regsleep Ii (ti)

2 log(T|Λ|)

2 log(T|Λ|).

ˆpt Estimation regret. In a scenario without uncertainty in the MDP s probability transitions, the meta-algorithm s regret would simply be bounded by Eq. (14), the sleeping expert regret used as a subroutine. However, given the presence of uncertainty, the main challenge in analyzing the meta-regret comes from the regret terms associated with the estimator ˆpt. We outline this analysis in Prop. 5.2.

Proposition 5.2. Let δ (0, 1), C := q

1 2 log N|X||A|2|X|T

δ , and LF be the Lipschitz constant of F t, with respect to the norm ,1, for all t [T]. With a probability of at least 1 δ, Meta CURL obtains

Rˆp [T ](πt) :=

t=1 F t(µπt,pt) F t(µπt,ˆpt) 2LF N 2|X| p

3|A|C2/3 log(T)1/3T 2/3( p)1/3.

For any m [T] and for any set of intervals Ii = [ti, ti+1 1], with 1 = t1 < . . . < tm+1 = T + 1, the same bound is valid for Pm i=1 P

t Ii Rˆp Ii(πt,ti,λ).

Proof. The proof idea is based mainly on the formulation of ˆpt described in Subsection 4.3. We start by using the convexity of F t to linearize the expression, then we apply Holder s inequality and exploit the LF -Lipschitz property of F t to establish an upper bound based on the L1 norm difference of the state-action distributions induced by the true probability transition and the estimator. Using Lemma C.5 in Appendix C, we then obtain that

Rˆp [T ] LF

j 1 (x, a) pt j( |x, a) ˆpt j( |x, a) 1.

To use the results from Subsection 4.3, we first regularize pt and ˆpt, for each (n, x, a), by averaging each with the uniform distribution over X, that we denote by p0 := 1/|X|. As both probabilities are now lower bounded, we can employ Pinsker s inequality to convert the L1 norm into a KL divergence. The sum over t [T] of the KL divergence can then be decomposed as follows:

t=1 KL pt j( |x, a) + p0

ˆpt j( |x, a) + p0

t Ii KL pt j( |x, a) + p0

ˆpt,ti j ( |x, a) + p0

t Ii E xt j,x,a log ˆpt,ti j ( xt j,x,a|x, a) + p0 log ˆpt j( xt j,x,a|x, a) + p0 ,

where ˆpt,ti j ( |x, a) is the empirical estimate of pt j( |x, a) calculated with the observed data from ti to t 1, and the expectation is over xt j,x,a (pt j( |x, a) + p0)/2. The second term is the cumulative regret of computing ˆpt using EWA with loss as in Eq. (12), and is bounded by m log(T). We finish and give more details of the proof in Appendix D.

Prop. 5.2 together with Eq. (14) yields the main result of this subsection:

Proposition 5.3 (Meta regret bound). With the same assumptions as Prop. 5.2, for any m [T], with probability at least 1 2δ,

Rmeta [T ] 4LF N 2|X| p

3|A|C2/3 log(T)1/3T 2/3( p)1/3 +

2 log(T|Λ|).

5.2 Black-box algorithm analysis

Assuming E is a parametric black-box algorithm with dynamic regret satisfying Eq. (10) for any learning rate λ > 0, we only need to address the selection of the λs grid and optimize across λ to achieve our final bound on Rblack-box [T ] .

Learning rate grid. The dynamic regret of Eq. (10) implies that any two λ that are a constant factor of each other will guarantee the same upper-bound up to essentially the same constant factor. We therefore choose an exponentially spaced grid

Λ := 2 j|j = 0, 1, 2, . . . , log2(T)/2 . (15)

The meta-algorithm aggregation scheme guarantees that the learner performs as well as the best empirical learning rate. We thus obtain a bound on Rblack-box [T ] , with its proof in Appendix E:

Proposition 5.4 (Black-box regret bound). Assume Meta CURL is played with a black-box algorithm satisfying dynamic regret as in Eq. (10), with learning rate grid as in Eq. (15). Hence, for any sequence of policies (πt, )t [T ],

Rblack-box [T ] (πt, )t [T ] N c2 π + c3

c1(c2 π + c3m)T + T p

Greedy MD-CURL. Greedy MD-CURL, developed by [39], is a computationally efficient policyoptimization algorithm known for achieving sublinear static regret in online CURL with adversarial objective functions within a stationary MDP. In Thm. G.3 of Appendix G, we extend this analysis showing that Greedy MD-CURL also achieves dynamic regret as in Eq. (10). To our knowledge, this is the first dynamic regret result for a CURL algorithm. Hence, Greedy MD-CURL can be used as a black-box for Meta CURL. We detail Greedy MD-CURL in Alg. 4 in Appendix G.

6 Conclusion, discussion, and future work

In this paper, we present Meta CURL, the first algorithm for dealing with non-stationarity in CURL, a setting covering many problems in the literature that modifies the standard linear RL configuration, making typical RL techniques difficult to use. We also employ a novel approach to deal with nonstationarity in MDPs using the learning with expert advice framework from the online learning literature. The main difficulty in analyzing this method arises from uncertainty about probability transitions. We overcome this problem by employing a second expert scheme, and show that Meta CURL achieves near-optimal regret.

Compared to the RL literature, our approach is more efficient, deals with adversarial losses, and has a better regret dependency concerning the varying losses, but to do so, we need to simplify the assumptions about the dynamics (all uncertainty comes only from the external noise, that is independent of the agent s state-action). There seems to be a trade-off in RL: all algorithms dealing with both non-stationarity and full exploration use UCRL-type approaches, and are thus computationally expensive. We thus leave a question for future work: How can we effectively manage non-stationarity and adversarial losses using efficient algorithms, all while addressing full exploration?

[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning (ICML), 2004.

[2] D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. Journal of Machine Learning Research, 17(23):1 21, 2016.

[3] D. Adamskiy, M. K. K. Warmuth, and W. M. Koolen. Putting bayes to sleep. In Advances in Neural Information Processing Systems (Neur IPS), volume 25, 2012.

[4] P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), volume 21, 2008.

[5] A. Barakat, I. Fatkhullin, and N. He. Reinforcement learning with general utilities: simpler variance reduction and large state-action space. In International Conference on Machine Learning (ICML), pages 1753 1800, 2023.

[6] M. Bayati. Power management policy for heterogeneous data center based on histogram and discrete-time mdp. Electronic Notes in Theoretical Computer Science, 337:5 22, 2018. Proceedings of the Ninth International Workshop on the Practical Application of Stochastic Modelling (PASM).

[7] A. Bensoussan, P. Yam, and J. Frehse. Mean Field Games and Mean Field Type Control Theory. Springer Briefs in Mathematics. Springer, 2013.

[8] A. Blum. Empirical support for winnow and weighted-majority based algorithms: results on a calendar scheduling domain. In A. Prieditis and S. Russell, editors, Machine Learning Proceedings 1995, pages 64 72, 1995.

[9] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.

[10] Y. Chen, S. S. Du, and K. Jamieson. Improved corruption robust algorithms for episodic reinforcement learning. In M. Meila and T. Zhang, editors, International Conference on Machine Learning (ICML), volume 139, pages 1561 1570, 2021.

[11] W. C. Cheung. Exploration-exploitation trade-off in reinforcement learning on online markov decision processes with global concave rewards. Ar Xiv, abs/1905.06466, 2019.

[12] W. C. Cheung. Regret minimization for reinforcement learning with vectorial feedback and complex objectives. In Advances in Neural Information Processing Systems (Neur IPS), volume 32, 2019.

[13] W. C. Cheung, D. Simchi-Levi, and R. Zhu. Reinforcement learning for non-stationary Markov decision processes: The blessing of (More) optimism. In International Conference on Machine Learning (ICML), volume 119, pages 1843 1854, 2020.

[14] A. Cohen, Y. Efroni, Y. Mansour, and A. Rosenberg. Minimax regret for stochastic shortest path. In Advances in Neural Information Processing Systems (Neur IPS), volume 34, pages 28350 28361, 2021.

[15] A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning (ICML), volume 37, pages 1405 1411, 2015.

[16] R. De Santi, M. Prajapat, and A. Krause. Global reinforcement learning : Beyond linear and convex rewards via submodular semi-gradient methods. In Proceedings of the 41st International Conference on Machine Learning, volume 235, pages 10235 10266, 2024.

[17] O. D. Domingues, P. M enard, M. Pirotta, E. Kaufmann, and M. Valko. A kernel-based approach to non-stationary reinforcement learning in metric spaces. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130, pages 3538 3546, 2021.

[18] Y. Efroni, L. Shani, A. Rosenberg, and S. Mannor. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning (ICML), volume 119, pages 8604 8613, 2020.

[19] E. Even-Dar, S. M. Kakade, and Y. Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009.

[20] Y. Fei, Z. Yang, Z. Wang, and Q. Xie. Dynamic regret of policy optimization in non-stationary environments. In Advances in Neural Information Processing Systems (Neur IPS), volume 33, pages 6743 6754, 2020.

[21] S. Feng, M. Yin, R. Huang, Y.-X. Wang, J. Yang, and Y. Liang. Non-stationary reinforcement learning under general function approximation. In International Conference on Machine Learning (ICML), volume 202, pages 9976 10007, 2023.

[22] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

[23] Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that specialize. In Annual ACM Symposium on Theory of Computing (STOC), page 334 343, 1997.

[24] P. Gajane, R. Ortner, and P. Auer. A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions. Ar Xiv, abs/1805.10066, 2018.

[25] M. Geist, J. Pérolat, M. Laurière, R. Elie, S. Perrin, O. Bachem, R. Munos, and O. Pietquin. Concave utility reinforcement learning: The mean-field game viewpoint. In International Conference on Autonomous Agents and Multiagent Systems, page 489 497, 2022.

[26] S. K. S. Ghasemipour, R. Zemel, and S. Gu. A divergence minimization perspective on imitation learning methods. In Proceedings of the Conference on Robot Learning, volume 100, pages 1259 1277, 2020.

[27] A. György, T. Linder, and G. Lugosi. Efficient tracking of large classes of experts. In 2012 IEEE International Symposium on Information Theory Proceedings, pages 885 889, 2012.

[28] E. Hazan, S. Kakade, K. Singh, and A. Van Soest. Provably efficient maximum entropy exploration. In International Conference on Machine Learning (ICML), volume 97, pages 2681 2691, 2019.

[29] E. Hazan and C. Seshadhri. Adaptive algorithms for online decision problems. Electronic Colloquium on Computational Complexity (ECCC), 14, 01 2007.

[30] C. Jin, T. Jin, H. Luo, S. Sra, and T. Yu. Learning adversarial Markov decision processes with bandit feedback and unknown transition. In International Conference on Machine Learning (ICML), volume 119, pages 4860 4869, 2020.

[31] T. Jin, J. Liu, C. Rouyer, W. Chang, C.-Y. Wei, and H. Luo. No-regret online reinforcement learning with adversarial losses and transitions. In Advances in Neural Information Processing Systems (Neur IPS), volume 36, pages 38520 38585, 2023.

[32] T. Jin and H. Luo. Simultaneously learning stochastic and adversarial episodic mdps with known transition. In Advances in Neural Information Processing Systems (Neur IPS), volume 33, pages 16557 16566, 2020.

[33] K.-S. Jun, F. Orabona, S. Wright, and R. Willett. Improved Strongly Adaptive Online Learning using Coin Betting. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54, pages 943 951, 2017.

[34] P. Lavigne and L. Pfeiffer. Generalized conditional gradient and learning in potential mean field games. Applied Mathematics & Optimization, 88(3):89, Oct 2023.

[35] J. W. Lavington, S. Vaswani, and M. Schmidt. Improved policy optimization for online imitation learning. In Proceedings of The 1st Conference on Lifelong Learning Agents, volume 199, pages 1146 1173, 2022.

[36] N. Littlestone and M. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212 261, 1994.

[37] H. Luo, C.-Y. Wei, and C.-W. Lee. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. In Advances in Neural Information Processing Systems (Neur IPS), volume 34, pages 22931 22942, 2021.

[38] T. Lykouris, M. Simchowitz, A. Slivkins, and W. Sun. Corruption-robust exploration in episodic reinforcement learning. In Conference on Learning Theory (COLT), volume 134, pages 3242 3245, 2021.

[39] B. M Moreno, M. Bregere, P. Gaillard, and N. Oudjane. Efficient model-based concave utility reinforcement learning through greedy mirror descent. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 238, pages 2206 2214, 2024.

[40] W. Mao, K. Zhang, R. Zhu, D. Simchi-Levi, and T. Basar. Near-optimal model-free reinforcement learning in non-stationary episodic mdps. In International Conference on Machine Learning (ICML), volume 139, pages 7447 7458, 2021.

[41] B. Marin Moreno, M. Brégère, P. Gaillard, and N. Oudjane. (Online) Convex Optimization for Demand-Side Management: Application to Thermostatically Controlled Loads, Jan. 2023.

[42] M. Mutti, R. De Santi, P. De Bartolomeis, and M. Restelli. Challenging common assumptions in convex reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), volume 35, pages 4489 4502, 2022.

[43] M. Mutti, R. D. Santi, P. D. Bartolomeis, and M. Restelli. Convex reinforcement learning in finite trials. Journal of Machine Learning Research, 24(250):1 42, 2023.

[44] G. Neu, A. Gyorgy, and C. Szepesvari. The adversarial stochastic shortest path problem with unknown transition probabilities. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 805 813. PMLR, 2012.

[45] R. Ortner, P. Gajane, and P. Auer. Variational regret bounds for reinforcement learning. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115, pages 81 90, 2020.

[46] M. Prajapat, M. Mutny, M. Zeilinger, and A. Krause. Submodular reinforcement learning. In International Conference on Learning Representations (ICLR), 2024.

[47] A. Raj, P. Gaillard, and C. Saad. Non-stationary online regression, 2020.

[48] A. Rosenberg and Y. Mansour. Online convex optimization in adversarial Markov decision processes. In International Conference on Machine Learning (ICML), volume 97, pages 5478 5486, 2019.

[49] A. Rosenberg and Y. Mansour. Online convex optimization in adversarial Markov decision processes. In International Conference on Machine Learning (ICML), pages 5478 5486, 2019.

[50] A. Rosenberg and Y. Mansour. Stochastic shortest path with adversarially changing costs. In International Joint Conference on Artificial Intelligence (IJCAI), pages 2936 2942, 2021.

[51] A. Séguret, C. Wan, and C. Alasseur. A mean field control approach for smart charging with aggregate power demand constraints. In 2021 IEEE PES Innovative Smart Grid Technologies Europe (ISGT Europe), pages 01 05, 2021.

[52] D. van der Hoeven, N. Zhivotovskiy, and N. Cesa-Bianchi. High-probability risk bounds via sequential predictors, 2023.

[53] C.-Y. Wei, C. Dann, and J. Zimmert. A model selection approach for corruption robust reinforcement learning. In International Conference on Algorithmic Learning Theory (ALT), volume 167, pages 1043 1096, 2022.

[54] C.-Y. Wei and H. Luo. Non-stationary reinforcement learning without prior knowledge: an optimal black-box approach. In Conference on Learning Theory (COLT), volume 134, pages 4300 4354, 2021.

[55] T. Zahavy, A. Cohen, H. Kaplan, and Y. Mansour. Apprenticeship learning via frank-wolfe. In AAAI Conference on Artificial Intelligence, 2019.

[56] T. Zahavy, B. O' Donoghue, G. Desjardins, and S. Singh. Reward is enough for convex mdps. In Advances in Neural Information Processing Systems (Neur IPS), volume 34, pages 25746 25759, 2021.

[57] J. Zhang, A. Koppel, A. S. Bedi, C. Szepesvari, and M. Wang. Variational policy gradient method for reinforcement learning with general utilities. In Advances in Neural Information Processing Systems (Neur IPS), volume 33, pages 4572 4583, 2020.

[58] J. Zhang, C. Ni, z. Yu, C. Szepesvari, and M. Wang. On the convergence and sample efficiency of variance-reduced policy gradient method. In Advances in Neural Information Processing Systems (Neur IPS), volume 34, pages 2228 2240, 2021.

[59] L. Zhang, T. Yang, rong jin, and Z.-H. Zhou. Dynamic regret of strongly adaptive methods. In International Conference on Machine Learning (ICML), volume 80, pages 5882 5891, 2018.

[60] X. Zhang, Y. Chen, X. Zhu, and W. Sun. Robust policy gradient against strong data corruption. In International Conference on Machine Learning (ICML), volume 139, pages 12391 12401, 2021.

[61] A. Zimin and G. Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems (Neur IPS), volume 26, pages 1583 1591, 2013.

A Learning with expert advice

In this section, we take a closer look at the Learning with Expert Advice (LEA) framework. We start by presenting, in Subsection A.1, a general reduction of the sleeping experts framework to the standard framework. Thus, any LEA algorithm can be used as a sub-routine for Meta CURL. In Section 5 of the main paper, we show a regret bound for Meta CURL using the Exponentially Weighted Average Forecaster (EWA) algorithm [9], also known as Hedge. In Subsection A.2 we present the main results of playing EWA with convex and exp-concave losses.

Setting. We recall the general setting of learning with expert advice (LEA) as presented in the main paper: a learner makes sequential online predictions u1, . . . , u T in a decision space U, over a series of T episodes, with the help of K experts. For each round t, each expert k makes a prediction ut,k, and the learner combines the experts predictions by computing a vector vt := (vt,1, . . . , vt,K) SK, and predicting ut := PK k=1 vt,kut,k. The environment then reveals a convex loss function ℓt : U R. Each expert suffers a loss ℓt,k := ℓt(ut,k), and the learner suffers a loss ˆℓt := ℓt(ut). The learner s objective is to keep the cumulative regret with respect to each expert as low as possible. For each expert k, this quantity is defined as Reg[T ](k) := PT t=1 ˆℓt ℓt,k.

A.1 Sleeping experts

The sleeping expert problem [8, 23] is a LEA framework where experts are not required to provide solutions at every time step. Let It,k {0, 1} define a binary signal that equals 1 if expert k is active at episode t and 0 otherwise. The algorithm knows (It,k)k [K] and assigns a zero weight to sleeping experts. We would like to have a guarantee with respect to expert k [K] but only when it is active. Hence, we now aim to bound a cumulative regret that depends on the signal It,k: Regsleep [T ] (k) := PT t=1 It,k(ˆℓt ℓt,k). We present a generic reduction from the sleeping expert framework to the standard LEA framework [3, 2]:

Let, for all episodes t [T],

ˆut := PK k=1 It,kvt,kut,k PK k=1 It,kvt,k .

We play a standard LEA algorithm with modified outputs where, at episode t, expert k outputs

ut,k := ut,k, if k is active at episode t ˆut, if not.

A standard LEA algorithm gives an upper bound on the regret Reg T (k) with respect to each expert k. Using that PK k=1 ut,kvt,k = ˆut, we obtain that

Reg[T ](k) :=

k=1 ut,kvt,k ℓt( ut,k)

t=1 ℓt(ˆut) ℓt( ut,k)

t=1 It,k ℓt(ˆut) ℓt(ut,k)

=: Regsleep [T ] (k).

Consequently, the cumulative regret with respect to each expert during the times it is active is upper bounded by the standard regret of playing a LEA algorithm with the modified outputs.

A.2 Exponentially Weighted Average forecaster

The exponentially weighted average forecaster (EWA), also called Hedge, is a LEA algorithm that chooses a weight that decreases exponentially fast with past errors. We present EWA in Alg. 2.

Algorithm 2 EWA (Exponentially Weighted Average)

Input: [K] := {1, . . . , K} a finite set of experts, v0 a prior over [K], a learning rate η > 0 for t {1, . . . , T} do

Observe loss function ℓt, compute the loss suffered by each expert ℓt,k := ℓt(ut,k) and suffer

loss ˆℓt := ℓ PK k=1 vt,kut,k

Update for all k [K] :

vt+1,k = vt,k exp ( ηℓt,k) PK k =1 vt,k exp ( ηℓt,k ) end for

We recall two results of playing EWA with convex losses, and with exp-concave losses, used in the main paper:

Theorem A.1 (EWA with convex losses: Corollary 2.2 from [9]). If the ℓt losses are convex and take value in [0, 1], then the regret of the learner playing EWA with any η > 0 satisfies, for any k [K],

Reg[T ](k) log(K)

In particular, with η = p

8 log(K)/T, the upper bound becomes p

(T/2) log(K).

Theorem A.2 (EWA with exp-concave losses: Thm. 3.2 from [9]). If the ℓt losses are η-exp concave, then the regret of the learner playing EWA (with the same value of η) satisfies, for any k [K],

Reg[T ](k) log(K)

B Algorithm for the online estimation of the probability kernel (ˆpt estimator)

Algorithm 3 Online estimation of the probability kernel (ˆpt estimator)

1: for t {1, . . . , T} do 2: Get agent s external noise trajectory (εt n)n [N] from Meta CURL 3: for (n, x, a) [N] X A do 4: Compute xt n,x,a = gn 1(x, a, εt n 1) 5: Update the empirical estimations for s < t and x X:

ˆpt,s n (x |x, a) = 1{xtn,x,a=x }

t s + (t 1 s)

t s ˆpt 1,s n (x |x, a)

6: Initialize a new estimator, ˆpt,t n (x |x, a) = 1{xtn,x,a=x } for all x X, assign it a new weight vector vt,t n,x,a = 1

t , and normalize weight vectors vt,s n,x,a for s [t 1] such that vt n,x,a := (vt,s n,x,a)s t,λ Λ is a probability vector in Rt

7: Simulate a sample xt n,x,a from distribution pt n( |x, a) + 1 |X| /2:

xt n,x,a = xt n,x,a, with probability 1/2, Uniformly over X, with probability 1/2,

and use it to build the loss function

x X log q(x) + 1

1{ xtn,x,a=x}

8: Update weights using EWA with loss ℓt: for all s t,

vt+1,s n,x,a = ˆvt,s n,x,a exp ℓt(ˆpt,s n ( |x, a))

Pt s =1 ˆvt,s n,x,a exp ℓt(ˆpt,s n ( |x, a)) (EWA update)

9: Compute ˆpt+1 n ( |x, a) = Pt s=1 vt+1,s n,x,a ˆpt,s n ( |x, a) 10: end for 11: Issue ˆpt+1 to Meta CURL (line 18 of Alg. 1) 12: end for

C Auxiliary lemmas

We start with some auxiliary results. For t I := [ts +1, te] [T], we define the average probability distribution for all n and (x, a) as

pt(x |x, a) = 1 t ts

s=ts ps n(x |x, a).

Lemma C.1. Let ˆpt,ts be the empirical probability transition kernel computed with data from episodes [ts, t 1]. For any δ (0, 1), with probability 1 δ,

ˆpt,ts n ( |x, a) pt n( |x, a) 1

1 2(t ts) log N|X||A|2|X|T

simultaneously for all n [N], (x, a) X A, ts [T 1], and t [ts + 1, T].

Proof. For a fixed n [N], (x, a) X A, and θ { 1, 1}|X|, we define for all s I,

Y s n,x,a,θ := X

x X θ(x )1{gn(x,a,εsn)=x },

a Bernoulli random variable with mean value given by P

x X θ(x )ps n(x |x, a).

The sequence of random variables given by Y s n,x,a,θ

s I is independent, as we assume that the external noises observed at each episode are all independent. Hence, by Hoeffding s inequality we get that, for all ξ > 0,

s=ts Y s n,x,a,θ E h t 1 X

s=ts Y s n,x,a,θ i ξ exp 2ξ2

Therefore, we have that

x X θ(x ) ˆpt,ts n (x |x, a) pt n(x |x, a) ξ

x X θ(x )1{gn(x,a,εsn)=x }

s=ts θ(x )ps n(x |x, a) ξ

s=ts Y s n,x,a,θ E h t 1 X

s=ts Y s n,x,a,θ i ξ

exp 2ξ2(t ts) .

By applying an union bound on all n [N], (x, a) X A, and θ { 1, 1}|X| and noting that, for any two probability distributions p, q X , we have that

p q 1 = max θ { 1,1}|X| X

x X θ(x)(p(x) q(x)),

we arrive at the final result.

Lemma C.2. Let t I := [ts + 1, te] [T]. For all n [N], and (x, a) X A,

pt n( |x, a) pt n( |x, a) 1

Proof. For t I, and for all n and (x, a) we have that

pt n( |x, a) pt n( |x, a) 1 = X

pt n(x |x, a) 1 t ts

s=ts ps n(x |x, a)

pt n(x |x, a) ps n(x |x, a)

j=s+1 pj n( |x, a) pj 1 n ( |x, a) 1

j=ts (j ts) pj n( |x, a) pj 1( |x, a) 1

where recall that we define p j := maxn,s,a pj+1 n ( |x, a) pj n( |x, a) 1.

Lemma C.3. Let ˆpt,ts be the empirical probability transition kernel computed with data from episodes [ts, t 1]. For any δ (0, 1), with probability 1 δ,

pt n( |x, a) ˆpt,ts n ( |x, a) 1

1 2(t ts) log N|X||A|2|X|T

simultaneously for all n [N], (x, a) X A, ts [T 1], and t [ts + 1, T].

Proof. The result follows immediately from the triangular inequality and Lemmas C.1 and C.2.

Lemma C.4 (A version of the inverse of Pinsker s inequality). Let p , q be any distributions over SX . Define

p := p + 1 |X| 2 , and q := q + 1 |X| 2 .

KL(p | q) 2|X| p q 2 1.

Proof. First, note that q is lower bounded by 1 2|X|, hence KL(p | q) is well defined. Also, by convexity of the simplex, p, q SX , therefore

KL(p | q) = X

x X p(x) log p(x)

x X p(x) log 1 + p(x)

x X p(x) p(x)

p(x) q(x) p(x) q(x) + X

p(x) q(x) 2

1 minx X q(x) p q 2 1

2|X| p q 2 1.

Lemma C.5. For any strategy π (SA)X N, for any two probability kernels p = (pn)n [N] and q = (qn)n [N] such that pn, qn : X A X [0, 1], and for all n [N],

µπ,p n µπ,q n 1

x,a µπ,p i (x, a) pi+1( |x, a) qi+1( |x, a) 1.

Proof. From the definition of a state-action distribution sequence induced by a policy π in a MDP with probability kernel p in Eq. (2), we have that for all (x, a) X A and n [N],

µπ,p n (x, a) = X

x ,a µπ,p n 1(x , a )pn(x|x , a )πn(a|x).

µπ,p n µπ,q n 1 = X

µπ,p n (x, a) µπ,q n (x, a)

µπ,p n 1(x , a )pn(x|x , a ) µπ,q n 1(x , a )qn(x|x , a ) πn(a|x)

µπ,p n 1(x , a )pn(x|x , a ) µπ,q n 1(x , a )qn(x|x , a )

µπ,p n 1(x , a )pn(x|x , a ) µπ,p n 1(x , a )qn(x|x , a )

+ µπ,p n 1(x , a )qn(x|x , a ) µπ,q n 1(x , a )qn(x|x , a )

x ,a µπ,p n 1(x , a ) pn( |x , a ) qn( |x , a ) 1 + X

µπ,p n 1(x , a ) µπ,q n 1(x , a )

x ,a µπ,p n 1(x , a ) pn( |x , a ) qn( |x , a ) 1 + µπ,p n 1 µπ,q n 1 1.

Since for n = 0, µπ,p 0 µπ,q 0 1 = 0, by induction we get that

µπ,p n µπ,q n 1

x ,a µπ,p i (x , a ) pi+1( |x , a ) qi+1( |x , a ) 1.

Lemma C.6. For any pair of strategies π, π ( A)X N, for any probability kernel p = (pn)n [N] such that pn : X A X [0, 1], and for all n [N],

µπ,p n µπ ,p n 1

x X ρπ,p i (x) πi( |x) π i( |x) 1,

where ρπ,p i (x) := P

a A µπ,p i (x, a) for all x X and i [N].

Proof. Using the recursive relation from Eq. (2) of a state-action distribution induced by a policy π in a MDP with probability transition p we have that

µπ,p n µπ ,p n 1 = X

µπ,p n (x, a) µπ ,p n (x, a)

µπ,p n 1(x , a )πn(a|x) µπ ,p n 1(x , a )π n(a|x) pn(x|x , a )

x ,a µπ,p n 1(x , a )pn(x|x , a ) πn(a|x) π n(a|x)

x ,a π n(a|x)pn(x|x , a ) µπ,p n 1(x , a ) µπ ,p n 1(x , a )

x ρπ,p n (x) πn( |x) π n( |x) 1 + µπ,p n 1 µπ ,p n 1 1.

Since µ0 is fixed for each state-action distribution sequence, by induction we obtain that

µπ,p n µπ ,p n 1

x ρπ,q i (x) πt i( |x) πt 1 i ( |x) 1,

completing the proof.

D Proof of Prop. 5.2: Rˆp [T] regret analysis

Proof. Here, we set an upper bound on the term Rˆp [T ] where we pay for errors in estimating pt by ˆpt.

Rˆp [T ] :=

t=1 F t(µπt,pt) F t(µπt,ˆpt)

t=1 F t(µπt,pt), µπt,pt µπt,ˆpt

n=1 µπt,pt n µπt,ˆpt n 1

j 1 (x, a) pt j( |x, a) ˆpt j( |x, a) 1.

To obtain the first inequality, we use the convexity of F t for all t [T], then we use Holder s inequality and the fact that F t is LF -Lipschitz, and for the last inequality we use Lemma C.5.

The difficulty in analyzing the L1 difference between pt and ˆpt arises from the non-stationarity of pt. To overcome this we want to use the scheme presented in Subsection 4.3. By Cauchy-Schwartz, we get that

Rˆp [T ] LF

v u u u u u t

x,a (µπt,pt

j 1 (x, a))2

| {z } T N2

x,a pt j( |x, a) ˆpt j( |x, a) 2 1

Analysis of the L1 norm. We start by analysing the sum over t [T] of the L1 norm in term ( ). For each n [N], j [n] and (x, a) X A,

t=1 pt j( |x, a) ˆpt j( |x, a) 2 1 = 4

pt j( |x, a) + 1 |X| 2 ˆpt j( |x, a) + 1 |X| 2

pt j( |x, a) + 1 |X| 2

ˆpt j( |x, a) + 1 |X| 2

where we apply Pinsker s inequality.

Consider a sequence of episodes 1 = t1 < t2 < . . . < tm+1 = T + 1, with Ii := [ti, ti+1 1], such that p Ii p/m for all i [m]. Decomposing the KL sum over t [T] as a sum on the intervals Ii, we obtain that

pt j( |x, a) + 1 |X| 2

ˆpt j( |x, a) + 1 |X| 2

pt j( |x, a) + 1 |X| 2

ˆpt,ti j ( |x, a) + 1 |X| 2

t Ii E xt j,x,a

log ˆpt,ti j ( xt j,x,a|x, a) + 1 |X| 2

log ˆpt j( xt j,x,a|x, a) + 1 |X| 2

| {z } (ii)

where the expectation of the second term is with respect to xt j,x,a pt j( |x, a) + 1 |X| /2.

We analyse each term separately:

First, note that (ii) is the expectation over xt j,x,a of the cumulative regret of sleeping EWA on interval Ii with respect to the expert ti using the loss function ℓt defined in Eq. (12). This term is upper bounded by log(T) (see Subection 4.3 of the main paper). From it we deduce that

i=1 log(T) = m log(T).

Regarding term (i), we start by using the inverse of Pinsker s inequality presented in Lemma C.4,

pt j( |x, a) + 1 |X| 2 ˆpt,ti j ( |x, a) + 1 |X| 2

2 pt j( |x, a) ˆpt,ti j ( |x, a) 2 1.

To simplify notations, from now on we let C :=

1 2 log N|X||A|2|X|T

. Applying Lemma C.3,

we obtain that

2 pt j( |x, a) ˆpt,ti j ( |x, a) 2 1

t ti + 1 + 2C t ti + 1

k=ti p k + t 1 X

C2 log(|Ii|) + 2C p

|Ii| p Ii + |Ii|( p Ii)2

C2m log(T) + 2C

m + T ( p)2

Joining the upper bounds of (i) and (ii) we conclude that

t=1 pt j( |x, a) ˆpt j( |x, a) 2 1 8 (i) + (ii)

2 + 1)m log(T) + 2|X|

Thus, for m = 2T p C2 log(T ) 1/3 ,

t=1 pt j( |x, a) ˆpt j( |x, a) 2 1 12|X|C4/3 log(T)2/3T 1/3( p)2/3. (17)

Back to the analysis of Rˆp [T ]. Using the inequality in Eq. (17) to bound the L1 norm of ( ) on Eq. (16), we obtain that

Rˆp [T ] LF N

x,a 12|X|C4/3 log(T)2/3T 1/3( p)2/3

2LF N 2|X| p

3|A|C2/3 log(T)1/3T 2/3( p)1/3,

concluding the proof.

Note that for Pm i=1 P t Ii Rˆp Ii(πt,ti,λ) (the third term of the meta-regret decomposition of Eq. (13)), following the same procedure as above, we obtain that m X

t Ii F t(µπt,ti,ˆpt) F t(µπt,ti,pt)

t Ii F t(µπt,ti,ˆpt), µπt,ti,ˆpt µπt,ti,pt

t Ii µπt,ti,ˆpt n µπt,ti,pt n 1

x,a µπt,ti,pt

j 1 (x, a) pt j( |x, a) ˆpt j( |x, a) 1

v u u u u u t

x,a (µπt,ti,pt

j 1 (x, a))2

| {z } T N2

x,a pt j( |x, a) ˆpt j( |x, a) 2 1

independent of πt,ti

Since the second term is independent of πt,ti, the analysis is the same as before and we obtain the same upper bound as for Rˆp [T ](πt).

E Proof of Prop. 5.4: Rblack-box [T] regret analysis

Proof. Assume a Black-box algorithm satisfying the dynamic regret bound of Eq. (10), i.e., for any interval I [T], with respect to any sequence of policies (πt, )t I, and for any learning rate λ,

RI (πt, )t I c1λ|I| + c2 π I + c3 λ + |I| p I. (18)

Consider any sequence of episodes 1 = t1 < t2 < . . . < tm+1 = T + 1, forming intervals Ii := [ti, ti+1 1] for all i [m]. We can decompose the black-box regret over [T] as

Rblack-box [T ] (πt, )t [T ] =

t Ii F t(µπt,ti,λ,pt) F t(µπt, ,pt)

i=1 c1λ|Ii| + c2 π Ii + c3

λ + |Ii| p I

c1λT + c2 π + c3

In principle, we would like to select the optimal λ that optimizes this dynamic regret. However, as π may be unknown in advance, this is not possible. We show here that running Meta CURL with the learning rate set Λ := {2 j|j = 0, 1, 2, . . . , [log2(T)/2 } ensures that the optimal empirical learning rate is close to the true optimal one by a factor of 2 and that the learner always plays as well as the optimal empirical learning rate.

Denote by λ the optimal learning rate and ˆλ the empirical optimal learning rate in Λ. Note that

We consider three different cases:

If λ 1: this implies that c2 π +c3

c1T 1. Therefore, we have that T c2 π +c3

c1 . As we assume f t n [0, 1] for all time steps n [N] and episodes t [T], then

Rblack-box [T ] (πt, )t [T ] NT + T p

m N c2 π + c3

T: this implies that c2 π +c3

c1 1. Therefore, taking ˆλ = 1/

T Λ, we have that

Rblack-box [T ] (πt, )t [T ] λc1T + c1

T, 1]: in this case, given the construction of Λ, there is a ˆλ Λ such that λ ˆλ 2λ . Hence,

Rblack-box [T ] (πt, )t [T ] 3 p

c1(c2 π + c3m)T + T p

Therefore, by taking λ = ˆλ in the analysis, we can ensure that

Rblack-box [T ] (πt, )t [T ] =

t Ii F t(µπt,ti,λ,pt) F t(µπt, ,pt)

N c2 π + c3

c1(c2 π + c3m)T + T p

F Proof of Thm. 5.1: Main result

Joining the results from the meta-regret bound and the black-box regret bound, we get the main result of the paper:

Theorem (Main result). Let δ (0, 1). Playing Meta CURL, with black-box algorithm with dynamic regret as in Eq. (10), with a learning rate grid Λ := 2 j|j = 0, 1, 2, . . . , 1/2 log2(T) , and with EWA as the sleeping expert subroutine, we obtain, with probability at least 1 2δ, for any sequence of policies (πt, )t [T ],

R[T ] (πt, )t [T ] O

π T + min p

T p , T 2/3( p)1/3 .

Proof. Define a sequence of episodes 1 = t1 < t2 < . . . < tm+1 = T + 1, with Ii := [ti, ti+1 1], such that p Ii p/m for all i [m].

The dynamic regret of M(E, Λ) with respect to any sequence of policies (πt, )t [T ], and any λ Λ, can be decomposed as

R[T ] (πt, )t [T ] =

t Ii F t(µπt,pt) F t(µπt,ti,λ,pt)

| {z } Meta algorithm regret

t Ii F t(µπt,ti,λ,pt) F t(µπt, ,pt)

| {z } Black-box regret on Ii

:= Rmeta [T ] + Rblack-box [T ] (πt, )t [T ] .

From Prop. 5.3, we have that with probability at least 1 2δ, and C :=

1 2 log N|X||A|2|X|T

Rmeta [T ] 4LF N 2|X| p

3|A|C2/3 log(T)1/3T 2/3( p)1/3 +

2 log(T|Λ|).

In addition, for Λ := 2 j|j = 0, 1, 2, . . . , 1/2 log2(T) , and λ equal the best empirical learning rate in Λ, Prop. 5.4 yields that, if the black-box algorithm has dynamic regret as in Eq. (10) for any interval in T, then

Rblack-box [T ] (πt, )t [T ] N c2 π + c3

c1(c2 π + c3m)T + T p

Therefore, joining both results, we get that,

R[T ] (πt, )t [T ] 4LF N 2|X| p

3|A|C2/3 log(T)1/3T 2/3( p)1/3 +

2 log(T log2(T))

+ N c2 π + c3

c1(c2 π + c3m)T + T p

Thus, for m = 2

γ 2/3 , with γ := q

log(T log2(T ))

2 + 3 c1c3, we have that

R[T ] (πt, )t [T ] T 2/3 1/3 4LF N 2|X| p

3|A|C2/3 log(T)1/3 + 2γ2/3

+ N c2 π + c3

π T + min p

T p , T 2/3( p)1/3 .

G Greedy MD-CURL dynamic regret analysis

Here we introduce Greedy MD-CURL developed by [39], a computationally efficient policyoptimization algorithm known for achieving sublinear static regret in online CURL with adversarial objective functions within a stationary MDP. We begin by detailing Greedy MD-CURL as presented in [39] in Alg. 4. We then provide a new analysis upper bounding the dynamic regret of Greedy MD-CURL in a quasi-stationary interval valid for any learning rate λ. Hence, Greedy MD-CURL can be used as a black-box for Meta CURL. This is the first dynamic regret analysis for a CURL approach.

Let Mp, µ0 denote the subset of Mp µ0 where the corresponding policies π are such that πn(a|x) = 0 for all (x, a) X A, n [N]. For any probability transition p, Γ : Mp µ0 Mp, µ0 R such that, for all µ Mp µ0 with its associated policy π, and µ Mp, µ0 with its associated policy π , we have

Γ(µπ, µπ ) :=

n=1 E(x,a) µπ n( )

log πn(a|x)

This divergence Γ is a Bregman divergence (see Proposition 4.3 of [39]). Problem (20) implemented with this Bregman divergence Γ has a closed form solution, as showed in [39].

Algorithm 4 Greedy MD-CURL

1: Input: number of episodes T, initial sequence of policies π1 (SA)X N, initial state-action distribution µ0, learning rate λ > 0, sequence of parameters (αt)t [T ].

2: Initialization: (x, a), ˆp1( |x, a) = 1 |X| and µ1 = µ1 := µπ1,ˆp1

3: for t = 1, . . . , T do 4: Agent starts at (xt 0, at 0) µ0( ) 5: for n = 1, . . . , N do 6: Environment draws new state xt n pn( |xt n 1, at n 1) 7: Learner observes agent s external noise εt n 8: Agent chooses an action at n πt n( |xt n) 9: end for 10: Update probability kernel estimate for all (x, a): 11:

ˆpt+1 n ( |x, a) := 1

t δgn(x,a,ϵtn) + t 1

t ˆpt n( |x, a)

12: Compute policy for the next episode: 13:

µt+1 arg min

µ Mˆ pt+1 µ0 {λ F t(µt), µ + Γ(µ, µt)} (20)

14: for all n [N], (x, a) X A,

πt+1 n (a|x) = µt+1 n (x, a) P

a A µt+1 n (x, a )

15: Compute πt+1 := (1 αt+1)πt+1 + αt+1/|A| 16: Compute µt+1 := µ πt+1,ˆpt+1 as in Eq. (2) 17: end for 18: return (πt)t [T ]

G.1 Dynamic regret analysis of Greedy MD-CURL

Let us assume we analyze our regret in an interval I := [ts, te] [T]. We denote by RI the dynamic regret of an instance of Greedy MD-CURL started at episode ts until the end of interval I at episode te. We denote by πt the policy produced by this instance of Greedy MD-CURL at episode t I, pt the true probability transition kernel, and

ˆpt n(x |x, a) = 1 t ts

s=ts 1{gn(x,a,εsn)=x },

the empirical estimate of the probability kernel at episode t, with data from the beginning of the interval I.

We define and decompose the dynamic regret RI with respect to any sequence of policies (πt, )t I into three terms as follows:

RI (πt, )t I := X

t I F t(µπt,pt) F t(µπ ,t,pt) = X

t I F t(µπt,pt) F t(µπt,ˆpt)

| {z } RMDP I (πt)

t I F t(µπt,ˆpt) F t(µπ ,t,ˆpt)

Rpolicy I (πt, )t I

t I F t(µπt, ,ˆpt) F t(µπt, ,pt)

| {z } RMDP I (πt, )

The terms RMDP I (πt) and RMDP I (πt, ) pay for our lack of knowledge of the true MDP, forcing us to use its empirical estimate. The term Rpolicy I corresponds to the loss incurred in calculating the policy

by solving the optimization problem given in Eq. (20). Below, we present the first analysis of the dynamic regret for a CURL algorithm. We consider each term separately.

G.1.1 RMDP I analysis

In Section 2 we assume that the deterministic part of the dynamics, given by gn in equation (8) for each time step n, is known in advance. The source of uncertainty and non-stationarity in the MDP comes only from the external noise dynamics, that is independent of the agent s state-action pair. Therefore, we do not need to explore in this setting, so the analysis of the two terms RMDP I (πt) and RMDP I (πt, ) are the same.

Proposition G.1. With probability at least 1 δ,

RMDP I (πt) LF N 2 s

1 2 log N|X||A|2|X|T

|I| + |I| p I,

for all intervals I [T]. The same result is valid for RMDP I (πt, ).

Proof. We start by using the convexity of F t, Holder s inequality, that F t is LF -Lipschitz, and Lemma C.5 to obtain that

RMDP I (πt) LF

x,a µπi,t,pt

j 1 (x, a) pt j( |x, a) ˆpt j( |x, a) 1. (22)

Applying Lemmas C.1 and C.2, we have that for any δ (0, 1), with probability 1 δ,

pt j( |x, a) ˆpt j( |x, a) 1 pt j( |x, a) pt j( |x, a) 1 + pt j( |x, a) ˆpt j( |x, a) 1

1 2(t ts) log N|X||A|2|X|T

Using this to continue the upper bound of Eq. (22), we conclude our proof:

RMDP I (πt) LF N 2 te X

1 2(t ts) log N|X||A|2|X|T

1 2 log N|X||A|2|X|T

|I| + |I| p I.

G.1.2 Rpolicy I analysis

Proposition G.2. Let b be a constant defined as

b := 8N 2 + 2N 2 log(|A|) log(|I|) 1 + log(|I|) + 2N log(|I|) + N log(|A|).

Then, Greedy MD-CURL obtains, for any sequence of policies (πt, )t I, and for any learning rate λ > 0,

Rpolicy I (πt, )t I λL2 F |I| + N 2

λ π I + b 1

Proof. We adapt the proof of Prop. 5.7 of [39] that upper bounds the static regret incurred when solving the optimization Problem (20), for a proof that upper bounds the dynamic regret. The main difference is that, in the case of static regret, we compare ourselves to the same policy throughout the interval, whereas in the case of dynamic regret, at each episode we compare ourselves to a different policy given by πt, . Consequently, the analysis remains the same as in [39] for all terms that do not depend on πt, , but requires a new analysis in terms that do depend on it.

To simplify notation, we take ℓt := F t(µπt,ˆpt) and µt := µπt,ˆpt. We can use the same reasoning as in appendix D.5 of [39] to show that

Rpolicy I 1

λ ℓt, µt µt+1 Γ(µt+1, µt)

Γ(µπt, ,ˆpt+1, µt) Γ(µπt, ,ˆpt+1, µt+1)

Since the A term does not depend on πt, , its analysis follows directly from [39], and is given by

A λL2 F |I| + 1

t ts + 2Nαt

where LF is the Lipschitz constant of F t and αt is an input parameter of Greedy MD-CURL.

We then proceed to analyze term B. Again, following the procedure of appendix D.5 of [39], we obtain that

t=1 Γ(µπt, ,ˆpt+1, µt) Γ(µπt 1, ,ˆpt, µt)

t=1 Γ(µπt 1, ,ˆpt, µt) Γ(µπt 1, ,ˆpt, µt)

| {z } (ii)

t=1 Γ(µπt 1, ,ˆpt, µt) Γ(µπt, ,ˆpt+1, µt+1)

| {z } (iii)

Let ψ : (SX A)N R denote the function inducing the Bregman divergence Γ of Eq. (19). [39] further shows that:

(i) ψ(µπti 1, ,ˆpti) + N P

t I log |A|

µπt 1, ,ˆpt µπt, ,ˆpt+1 ,1

t I αt, and this upper bound is found independently of πt,

(iii) Γ(µπt 1, ,ˆpt, µt).

Lemma D.6 of [39] shows that,

ψ(µπti 1, ,ˆpti) + Γ(µπti 1, ,ˆpti, µti) N log(|A|).

Only term (i), which involves µπ ,t 1,ˆpt µπ ,t,ˆpt+1 ,1, depends on the sequence (πt, )t [T ], requiring then a new analysis. For this purpose, we rely on the following two results:

From Lemma 5.6 of [39], we have that, for all strategies π,

µπ,ˆpt µπ,ˆpt+1 ,1 2N t ts .

From auxiliary Lemma C.6 proved in Appendix C we have that

µπ ,t 1,ˆpt+1 n µπt, ,ˆpt+1 n 1

x X ρπt 1, i (x) πt, i ( |x) πt 1, i 1 N π t .

Therefore, using the triangular inequality and the two results above, we obtain that

µπ ,t 1,ˆpt µπt, ,ˆpt+1 ,1 µπt 1, ,ˆpt µπt 1, ,ˆpt+1 ,1 + µπt 1, ,ˆpt+1 µπt, ,ˆpt+1 ,1

t + N π t .

Therefore, the bound on term B is given by

t I log |A|

t ts + N π t + 2N X

t I αt + N log(|A|) . (24)

Final step: joining all results

Joining the upper bounds on term A from Eq. (23) and on term B from Eq. (24), we have that

Rpolicy I (πt, )t I A + B

λL2 F |I| + 1

t ts + 2Nαt

t I log |A|

t ts + N π t + 2N X

t I αt + N log(|A|) .

If we take the learning rate as αt = 1/t, then, for all λ > 0,

Rpolicy I (πt, )t I λL2 F |I| + N 2

8N 2 + 2N 2 log(|A|) log(|I|) 1 + log(|I|) + 2N log(|I|) + N log(|A|)

= λL2 F |I| + N 2 π I + b λ ,

where b := 8N 2 + 2N 2 log(|A|) log(|I|) 1 + log(|I|) + 2N log(|I|) + N log(|A|).

G.2 Final Greedy MD-CURL regret analysis

Replacing the bounds of Prop. G.1 and G.2 in Eq. (21) yields the final upper bound of Greedy MDCURL s dynamic regret for any interval I T with respect to any sequence of policies (πt, )t I: Theorem G.3 (Dynamic regret of Greedy MD-CURL). Let b be a constant defined as

b := 8N 2 + 2N 2 log(|A|) log(|I|) 1 + log(|I|) + 2N log(|I|) + N log(|A|).

Let δ (0, 1). With probability at least 1 2δ, for any interval I [T], for any sequence of policies (πt, )t I, for any learning rate λ > 0, and for αt := 1/t, Greedy MD-CURL obtains

RI (πt, )t I λL2 F |I| + N 2 π I + b λ + 2FN 2 s

1 2 log N|X||A|2|X|T

|I| + 2|I| p I

Hence, Greedy MD-CURL meets the requisite dynamic regret bound from Eq. (10) to serve as a black-box algorithm for Meta CURL achieving optimal dynamic regret.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We clearly provide the contributions of our work and our settings in the Introduction. We formally detail our hypothesis in Section 2. We present the new algorithm in Section 4. All regret results related to the new algorithm claimed in the abstract and introduction are stated in Section 5 and proved in the Appendix. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: In Section 2, we formally detail our assumptions and discuss the strong assumption made about probability transitions, justifying it as a means of providing lowcomplexity methods. We also discuss real-world applications that satisfy these assumptions. In Section 6, we discuss the aspirational goal of addressing this limitation as a future work. In Remark 3.1, we discuss the complexity of the algorithm. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: All the main results are in the paper, and all proofs are carefully stated in the Appendix. All auxiliary results used to obtain the main results are also included in the Appendix. A proof sketch is provided for the main results in the main paper. All results are properly referenced.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [NA]

Justification: This is a theoretical paper, with no experiments.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [NA]

Justification: This is a theoretical paper, with no experiments.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justification: This is a theoretical paper, with no experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [NA]

Justification: This is a theoretical paper, with no experiments.

Guidelines:

The answer NA means that the paper does not include experiments.

The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: This is a theoretical paper, with no experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: This work conform with the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes]

Justification: The results presented in this paper are largely theoretical. The framework provided in this paper is very general and could be applied to any reinforcement learning or concave utility reinforcement learning problem in a tabular MDP. Therefore, as with any reinforcement learning algorithm, it is possible that the algorithms developed from the ideas presented in this paper could be applied in contexts that have negative societal impacts, or in contexts where the reward function has a negative negative societal impacts.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: This paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: This paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: This paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.