# dynamic_regret_of_online_markov_decision_processes__12e44be5.pdf

Dynamic Regret of Online Markov Decision Processes

Peng Zhao 1 Long-Fei Li 1 Zhi-Hua Zhou 1

We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner s performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinitehorizon MDPs. For the three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure.

1. Introduction

Markov Decision Processes (MDPs) are widely used to model decision-making problems, where a learner interacts with the environments sequentially and aims to improve the learned strategy over time. The MDPs model is very general and encompasses a variety of applications, including games (Silver et al., 2016), robotic control (Schulman et al., 2015), autonomous driving (Kendall et al., 2019), etc.

In this paper, we focus on the online MDPs framework with adversarially changing loss functions and known transitions, which has attracted increasing attention in recent years due to its generality (Even-Dar et al., 2009; Zimin & Neu, 2013; Rosenberg & Mansour, 2019a; Jin et al., 2020a; Chen et al., 2021a). Let T be the total time horizon. The general procedures of the online MDPs are as follows: at each round t [T], the learner observes the current state xt and decides

1National Key Laboratory for Novel Software Technology, Nanjing University. Correspondence to: Zhi-Hua Zhou <zhouzh@lamda.nju.edu.cn>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

a policy πt : X A [0, 1], where πt(a|x) is the probability of taking action a A at state x X. Then, the learner draws and executes an action at from πt( |xt) and suffers a loss ℓt(xt, at). The environments subsequently transit to the next state xt+1 according to the transition kernel P( |xt, at). We focus on the full-information setting where the entire loss function is revealed to the learner. The standard measure for online MDPs is the regret defined as the performance difference between learner s policy and that of the best fixed policy in hindsight, namely,

t=1 ℓt (xt, πt(xt)) min π Π

t=1 ℓt (xt, π(xt)) , (1)

where Π is a certain policy class. There are many efforts devoted to optimizing the measure, yielding fruitful results (Even-Dar et al., 2009; Neu et al., 2012; Zimin & Neu, 2013; Neu et al., 2014; Rosenberg & Mansour, 2019a; 2021; Chen et al., 2021a). However, one caveat in the performance measure in Eq. (1) is that the measure only benchmarks the learner s performance with a fixed strategy, so it is usually called the static regret in the literature. The fact makes the static regret metric not suitable to guide the algorithm design for online decision making in non-stationary environments, which is often the case in many real-world decision-making applications such as online recommendations and autonomous driving. In particular, in online MDPs model the loss functions encountered by the learner can be adversarially changing, it is thus unrealistic to assume the existence of a single fixed strategy that can perform well over the horizon. To this end, in this paper we introduce the dynamic regret as the metric to guide the algorithm design for online MDPs, which competes the learner s performance against a sequence of changing policies, defined as

D-REGT (πc 1:T ) =

t=1 ℓt(xt, πt(xt))

t=1 ℓt(xt, πc t (xt)),

(2) where πc 1, . . . , πc T is any sequence of compared policies in the policy class Π, which can be chosen with the complete foreknowledge of the online loss functions. We use πc 1:T as a shorthand of the compared policies. An upper bound of dynamic regret usually scales with a certain variation quantity of the compared policies denoted by PT (πc 1, . . . , πc T ) that can reflect the non-stationarity of environments.

Dynamic Regret of Online Markov Decision Processes

Table 1: Summary of our main results. For three models of online MDPs (episodic loop-free SSP, episodic SSP, and infinitehorizon MDPs), we establish their dynamic regret guarantees. Our obtained dynamic regret bounds immediately recover the best known static regret presented in the last column, when choosing a fixed compared policy and the non-stationarity measure PT or PK then equals to zero. Note that all our results are achieved by parameter-free algorithms in the sense that they do not require the knowledge of unknown quantities related to the environmental non-stationarity.

MDP Model Ours Result (dynamic regret) Previous Work (static regret)

Episodic loop-free SSP (Section 2) e O(H p

K(1 + PT )) [Theorem 1] e O(H

K) (Zimin & Neu, 2013)

Episodic SSP (Section 3) e O( p

BK(H + PK) + PK) [Theorem 3] e O(

Hπ DK) (Chen et al., 2021a)

Infinite-horizon MDPs (Section 4) e O( p

τT(1 + τPT ) + τ 2PT ) [Theorem 6] e O(

τT) (Zimin & Neu, 2013)

The dynamic regret measure in Eq. (2) is in fact very general due to the flexibility of compared policies. For example, it immediately recovers the standard regret notion defined in Eq. (1) when choosing the single best compared policy in hindsight, namely, choosing πc 1:T = π arg minπ Π PT t=1 ℓt(xt, π(xt)). Hence, any dynamic regret upper bound directly implies a static regret upper bound by substituting a fixed compared policy. Another typical choice is setting the compared policies as the sequence of the best policy of each round, namely, choosing πc t = π t arg minπ Π ℓt(xt, π(xt)), and the resulting dynamic regret measure is sometimes referred to as the worst-case dynamic regret in the literature (Zhang et al., 2018). It is noteworthy to emphasize that the dynamic regret measure in Eq. (2) does not assume prior information of the compared policies, which is certainly also unknown to the online algorithms. As a result, the measure is also called universal dynamic regret (or general dynamic regret) in the sense that the regret bound holds for any feasible compared policies. Both static regret and the worst-case dynamic regret are two special cases of the universal dynamic regret by configuring different choices of compared policies.

In this paper, focusing on the dynamic regret measure presented in Eq. (2), we investigate three foundational and well-studied models of online MDPs: (i) episodic loopfree Stochastic Shortest Path (SSP) (Zimin & Neu, 2013), (ii) episodic SSP (Rosenberg & Mansour, 2021; Chen et al., 2021a), and (iii) infinite-horizon MDPs (Even-Dar et al., 2009). The first two SSP models belong to episodic MDPs, in which the learner interacts with environments in episodes and the goal is to reach a goal state with minimum total loss. The distinction lies in that the learner is guaranteed to reach the goal state within a fixed number of steps in the loop-free SSP model; by contrast, the horizon length in general SSP model depends on the learner s policies, which could potentially be infinite (if the goal is not reached). In infinite-horizon MDPs, there is no goal state and the horizon can be never end and the goal of the learner is to minimize the average loss over time. For all those three

models, we propose novel online algorithms and provide the corresponding expected dynamic regret guarantees. We also establish several lower bound results and show that the obtained upper bounds for episodic loop-free SSP and general SSP are minimax optimal in terms of time horizon and non-stationarity measure. Notably, all our algorithms are parameter-free in the sense that they do not require knowing the non-stationarity quantity ahead of time. Table 1 summarizes our main results.

Technical contributions. Similar to prior studies of non-stationary online learning (Hazan & Seshadhri, 2009; Daniely et al., 2015; Zhang et al., 2018; Zhao et al., 2020b), our proposed algorithms fall into the online ensemble framework with a meta-base two-layer structure. While the framework is standard in modern online learning, several important new ingredients are required to achieve minimax dynamic regret guarantees for online MDPs. We highlight the main technical challenges and contributions as follows.

For all three models, algorithms are performed over the occupancy measure space, so dynamic regret inevitably scales with the variation of occupancy measures induced by compared policies, making it necessary to establish relationships between the variation of occupancy measures and that of compared policies. Achieving the minimax dynamic regret bound for episodic (non-loop-free) SSP is one of the most challenging parts of this paper due to the complicated structure of this model and also the requirement of handling dual uncertainties of unknown horizon length and unknown non-stationarity. This motivates a novel groupwise scheduling for base-learners and a new weighted entropy regularizer for the meta-algorithm. Additionally, appropriate correction terms in the feedback loss and carefully designed step sizes for both base-algorithm and meta-algorithm are also important. For learning in infinite-horizon MDPs, we present a reduction to the problem of minimizing dynamic regret of the switching-cost expert problem, which is new to the best of our knowledge.

Dynamic Regret of Online Markov Decision Processes

Notations. We present several general notations used throughout the paper. We use ℓ Rd [a,b] to denote a vector whose each element satisfies ℓi [a, b] for i [d]. For a vector a Rd, a2 denotes the vector (a2 1, . . . , a2 d) Rd. Besides, ei Rd denotes the i-th standard basis vector. For a convex function ψ, its induced Bregman divergence is defined as Dψ(u, w) = ψ(u) ψ(w) ψ(w), u w . Given two policies π and π , π π 1, = maxx π( |x) π ( |x) 1. e O( ) omits the logarithmic factors on horizon T.

Organization. The rest of the paper is organized as follows. In Section 2 and Section 3, we establish the minimax dynamic regret for episodic loop-free and general SSP respectively. In Section 4, we provide dynamic regret bound for infinite-horizon MDPs. Section 5 concludes the paper and discusses the future work. We defer the related works to Appendix A and proofs to remaining appendices.

2. Episodic Loop-free SSP

This section presents our results for episodic loop-free SSP, a foundational and conceptually simple model of online MDPs. We first introduce the problem setup, and then establish the minimax dynamic regret bound.

2.1. Problem Setup

An episodic online MDP is specified by a tuple M = (X, g, A, P, {ℓk}K k=1), where X and A are the finite state and action spaces, g / X is the goal state, P : X A X {g} [0, 1] is the transition kernel, K is the number of episodes and ℓk R|X||A| [0,1] is the loss function in episode k [K]. An episodic loop-free SSP is an instance of episodic online MDP and further satisfies the following conditions: state space X {g} can be decomposed into H +1 non-intersecting layers denoted by X0, . . . , XH 1, g such that X0 = {x0} and g are singletons, and transitions are only possible between the consecutive layers. Notice that the total horizon is T = KH.

The learning protocol of episodic loop-free SSP proceeds in K episodes. In each episode k [K], environments decide a loss ℓk : X A [0, 1], and simultaneously the learner starts from state x0 and moves forward across consecutive layers until reaching the goal state g. We focus on the fullinformation setting, i.e., the loss is revealed to the learner after the episode ends. Notably, no statistical assumption is imposed on the loss sequence, which means the online loss functions can be chosen in an adversarial manner.

Occupancy measure. Existing studies reveal the importance of the concept occupancy measure in handling online MDPs (Zimin & Neu, 2013; Rosenberg & Mansour, 2019a), which deeply connects the problem of online MDPs with online convex optimization.

Given a policy π and transition kernel P, the occupancy measure qπ R|X||A| [0,1] is defined as the probability of visiting state-action pair (x, a) by executing the policy π, i.e., qπ(x, a) = Pr xl(x) = x, al(x) = a|P, π , where l(x) is the index of the layer that x belongs to. For an episode loop-free SSP instance M, its occupancy measure space is defined as (M) = {q R|X||A| [0,1] | P

a A q(x, a) = 1, l = {0} [H 1] and P

x Xl(x) 1 P

a A P(x|x , a )q(x , a ) = P

a A q(x, a), x X \ {x0}}, For any occupancy measure q (M), it induces a policy π such that π(a|x) q(x, a), x X, a A. Existing study shows that there exists a unique induced policy for all occupancy measures in (M) and vice versa (Zimin & Neu, 2013). Then, the expected loss of any policy π at episode k can be written as E h PH 1 l=0 ℓk(xl, al | P, π) i = PH 1 l=0 P x Xl P a A qπ(x, a)ℓt(x, a) = qπ, ℓk , where the expectation is taken over the randomness of the policy and transition kernel. Note the total horizon T of episodic loop-free SSP can be divided into K episodes, each with horizon length H, i.e., T = KH. Denote by πk,l the policy at layer l {0} [H 1] in episode k [K], the policy sequence π1, . . . , πT in Eq. (1) can be represented by π1,0, . . . , π1,H 1, π2,0, . . . , πK,H 1. We use the notation πk as a shorthand of πk,0:H 1 for notational simplicity. Then we can rewrite the expected static regret in Eq. (1) as E [REGK] = PK k=1 qπk, ℓk minq (M) PK k=1 q, ℓk .

Dynamic regret. Similar to the derivation for static regret, we can also rewrite the expected dynamic regret in Eq. (2) into a form with respect to the occupancy measure as

E [D-REGK(πc 1:K)] =

k=1 qπk, ℓk

k=1 qπc k, ℓk , (3)

where qπc k is the occupancy measure of the compared policy πc k for all k [K]. The non-stationarity measure is naturally defined as PT = PK k=2 PH 1 l=0 πc k,l πc k 1,l 1, .

2.2. Dynamic Regret

Before presenting our algorithm for dynamic regret, we first briefly review the O-REPS algorithm of Zimin & Neu (2013) developed for minimizing the static regret. The key idea of O-REPS is to perform online mirror descent over the occupancy measure space (M), specifically, at episode k + 1, the learner updates the prediction by

qk+1 = arg min q (M) η q, ℓk + Dψ(q, qk),

where η > 0 is step size, ψ(q) = P

x,a q(x, a) log q(x, a) is the standard negative entropy. Zimin & Neu (2013) prove O-REPS enjoys an O(H p

K log (|X||A|)) static regret.

Dynamic Regret of Online Markov Decision Processes

By slightly modifying the algorithm, in following lemma we show O-REPS over a clipped occupancy measure space can achieve dynamic regret guarantees. Specifically, define the clipped space as (M, α) = {q | q (M), and q(x, a) α, x, a} with 0 < α < 1 being the clipping parameter, we prove that performing O-REPS over (M, α) ensures the following dynamic regret.

Lemma 1. Set q1 = arg minq (M,α) ψ(q). For any compared policies πc 1, . . . , πc K {π | qπ (M, α)}, O-REPS over a clipped space (M, α) ensures

k=1 qk qπc k, ℓk ηT+1

H log |X||A|

H + PT log 1

where PT = PT (πc 1, . . . , πc K) = PK k=2 qπc k qπc k 1 1 is the path-length of occupancy measures.

To achieve a favorable dynamic regret, we need to set the step size η optimally to balance time horizon T and the path-length of occupancy measures PT . However, we actually do not have prior knowledge of PT even after the horizon ends since the compared policies can be arbitrarily chosen in the feasible set. Thus, we cannot apply the standard adaptive step size tuning techniques such as doubling trick or self-confident tuning (Auer et al., 2002) to remove the dependence on PT . To address the issue, we employ a meta-base two-layer structure to handle the uncertainty (Zhang et al., 2018; Zhao et al., 2020b). Specifically, we first construct a step size pool H = {η1, , ηN} (N is the number of candidate step sizes and is of order O(log T) whose configuration will be specified later) to discretize value range of the optimal step size; and then initialize multiple base-learners simultaneously, denoted by B1, , BN, where Bi returns her prediction qk,i by performing O-REPS with step size ηi H; finally a meta-algorithm is used to combine predictions of all base-learners and yield the final output {qk}K k=1. Below, we specify the details.

At episode k [K], the learner receives the decision qk,i from each base-learner Bi, i [N] and the weight vector pk N from meta-algorithm. Then the learner outputs the decisions by qk = PN i=1 pk,iqk,i, plays the policy π(a|x) q(x, a), x, a, and observes the loss function ℓk. After that, the base-learner Bi updates by performing O-REPS over the clipped space (M, α) with step size ηi H, namely,

qk+1,i = arg min q (M,α) ηi q, ℓk + Dψ(q, qk,i),

where ηi H is the step size of the base-learner Bi. The meta-algorithm aims to track the unknown best baselearner. We employ Hedge algorithm (Freund & Schapire, 1997) that updates the weight pk+1 N by pk+1,i exp( ε Pk s=1 hs,i) where ε > 0 is the learning rate, hk RN evaluates the performance of the base-learners and is set as hk,i = qk,i, ℓk for i [N].

Algorithm 1 DO-REPS

Input: step size pool H, learning rate ε, clipping param α.

1: Define ψ(q) = P

x,a q(x, a) log q(x, a). 2: Initialization: set q1,i = arg minq (M,α) ψ(q) and p1,i = 1/N, i [N]. 3: for k = 1 to K do 4: Receive qk,i from base-learner Bi for i [N].

5: Compute occupancy measure qk = PN i=1 pk,iqk,i. 6: Play the induced policy πk(a|x) q(x, a), x, a. 7: Update the weight by pk+1,i exp( ε Pk s=1 hs,i) where hk,i = qk,i, ℓk , i [N]. 8: Each base-learner Bi updates prediction by qk+1,i = arg minq (M,α) ηi q, ℓk + Dψ(q, qk,i). 9: end for

Algorithm 1 summarizes our proposed Dynamic O-REPS (DO-REPS) algorithm and the guarantee is as follows.

Theorem 1. Set the clipping parameter α = 1/T 2, the step size pool H = {ηi = 2i 1p

K 1 log(|X||A|/H) | i [N]}, where N = 1

2 log(1 + 4K log T log(|X||A|/H)) + 1, and the

learning rate of meta-algorithm as ε = p

(log N)/(HT). DO-REPS (Algorithm 1) satisfies

E[D-REGK(πc 1:K)] O q

T(H log |X||A| + PT log T)

K(log |X||A| + PT log T) ,

where PT = PK k=2 qπc k qπc k 1 1 is the path-length of occupancy measures and PT = PK k=2 PH 1 l=0 πc k,l πc k 1,l 1, is the path-length of the compared policies. Remark 1. Setting compared policies πc 1:K = π (then PT = 0), Theorem 1 recovers the O(H p

K log |X||A|) minimax optimal static regret of Zimin & Neu (2013).

The proof can be found in Appendix C.3. Note that Theorem 1 presents two dynamic regret bounds in terms of either the path-length of occupancy measures PT or the path-length of compared policies PT (see definition at the end of Section 2.1). To achieve the latter one, we establish the relationship of path-length variations between compared policies and their induced occupancy measures. Indeed, we prove that PT HPT in Lemma 6 of Appendix C.1.

We finally establish the lower bound in Theorem 2, which indicates the minimax optimality of our attained upper bound in terms of T and PT (up to logarithmic factors). Theorem 2. For any online algorithm and any γ [0, 2T], there exists an episode loop-free SSP instance with H layers, |X| states and |A| actions and a sequence of compared policies πc 1, . . . , πc K such that

PT γ and E[D-REGK] Ω( p

T(H + γ) log |X||A|)

under the full-information and known transition setting.

Dynamic Regret of Online Markov Decision Processes

3. Episodic SSP

In this section, we consider the episodic SSP, which does not necessarily satisfy the loop-free structure and is thus more general and difficult than the loop-free SSP studied in Section 2. For this model, we first introduce the formal problem setup and then establish minimax dynamic regret.

3.1. Problem Setup

An episodic SSP instance is defined by a tuple M = (X, g, A, P, {ℓk}K k=1), as the same as introduced in Section 2.1, x0 X is the initial state and g / X is the goal state. The learning protocol proceeds in K episodes. In each episode k [K], environments decide a loss ℓk : X A [0, 1], and simultaneously the learner starts from state x0 and moves to the next state until reaching the goal state g. Thus, the horizon in each episode depends on the learner s policy and is unfixed and can be even infinite, leading to inherent difficulties compared with episodic loop-free SSP. The goal of the learner is to reach the goal with the smallest cumulative loss. Again, we focus on the full-information setting, namely, the entire loss is revealed to the learner after the episode ends. Below we introduce several key concepts and we refer the reader to the recent work (Chen et al., 2021a) for more details.

Proper policy. A policy is called proper if playing it ensures that the goal state is reached within a finite number of steps with probability 1 starting from any state, otherwise it is called improper. The set of all proper policies is denoted by Πproper. Following earlier studies (Rosenberg & Mansour, 2021; Chen et al., 2021a), we assume Πproper = .

Hitting time. Denote by Hπ(x) the expected hitting time of g when executing policy π and starting from state x. If π is proper, Hπ(x) is finite for any x X. Let Hπ Hπ(x0) be the hitting time of policy π from the initial state x0 to simplify notation. Another useful concept in SSP is the fast policy πf, defined as the (deterministic) policy that achieves the minimum expected hitting time starting from any state. The diameter of the SSP is defined as D = maxx X minπ Πproper Hπ(x) = maxx X Hπf (x). Note both πf and D can be computed ahead of time as the transition kernel is known (Bertsekas & Tsitsiklis, 1991).

Cost-to-go function. Given a loss function ℓand a policy π, the induced cost-to-go function Jπ : X [0, ) is defined as Jπ(x) = E[PI i=1 ℓ(xi, ai) P, π], where I denotes the number of steps before reaching g of policy π and the expectation is over the randomness of the stochastic policy and transition kernel. Denote by Jπ k the cost-to-go function for policy π with respect to loss ℓk from the initial state x0.

Occupancy measure. For the episodic SSP, the occupancy measure qπ R|X||A| is defined as the expected num-

ber of visits to (x, a) from x0 to g when executing π, i.e., qπ(s, a) = E[PI t=1 1{xt = x, at = a} | P, π, x1 = x0]. Similar to the case in loop-free SSP, the inducted policy of a given occupancy measure q : X A [0, ) can be calculated by π(a|x) q(x, a), x, a. It holds that Hπ = P

x,a qπ(x, a). Based on the occupancy measure, we can rewrite the cost-to-go function Jπ k as Jπ k = E[PIk i=1 ℓk(xi, ai) | P, πk] = P

x,a qπ(x, a)ℓk(x, a) = qπ, ℓk , where Ik denotes the number of steps before reaching g of policy π in episode k. Then the expected static regret in Eq. (1) for episodic SSP can be written as E [REGK] = E PK k=1(Jπk k Jπ k ) = E PK k=1 qπk qπ , ℓk , where π = arg minπ Πproper PK k=1 Jπ k . Two important quantities related to π are commonly used in the analysis: (i) its hitting time Hπ from initial state x0; and (ii) the cumulative loss PK k=1 Jπ k during K episodes. The cumulative loss of the best policy is smaller than the fast policy, i.e., PK k=1 Jπ k PK k=1 Jπf k DK, where the last inequality holds due to the definition of the fast policy and the boundedness of the loss range in [0, 1].

Dynamic regret. Similar to the derivation for static regret, we can also rewrite the expected dynamic regret in Eq. (2) into a form with respect to the occupancy measure as

E[D-REGK] E K X

k=1 (Jπk k J πc k k ) = E K X

k=1 qπk qπc k, ℓk .

Similarly, we generalize the two crucial quantities to accommodate changing comparators: the largest hitting time starting from the initial state H = maxk [K] Hπc k and the

cumulative loss of compared policies BK = PK k=1 Jπc k k = PK k=1 qπc k, ℓk . It is clear that BK H K. Notably, both quantities H and BK are unknown to the learner due to involving the unknown compared policies. For the episodic (non-loop-free) SSP, the non-stationarity measure is naturally defined as PK = PK k=2 πc k πc k 1 1, .

3.2. Dynamic Regret

Before introducing our approach, we first review existing works studying static regret and then show that several crucial ingredients are required to achieve dynamic regret.

To resolve episodic (non-loop-free) SSP, Rosenberg & Mansour (2021) propose to deploy Online Mirror Descent (OMD) over the parametrized occupancy measure space. For an MDP instance M and a given horizon length H, the parameterized space is defined as (M, H) = {q R|X||A| 0 | P

x,a q(x, a) H and P

a q(x, a) = P

x ,a P(x|x , a )q(x , a ), x X}. The authors prove that OMD enjoys an e O(H

K) static regret as long as qπ (M, H). Therefore, if the largest hitting time Hπ

were known ahead of time, a simple choice of H = Hπ

would attain the favorable static regret. However, such in-

Dynamic Regret of Online Markov Decision Processes

formation is in fact unavailable in advance, which motivates a two-layer approach deal with this uncertainty.

Specifically, Chen et al. (2021a) maintain multiple baselearners B1, . . . , BN, where Bi works with an occupancy measure space (M, Hi) and a step size ηi and returns her individual occupancy measure qi k; and then a certain meta-algorithm is employed to combine predictions of baselearners to produce final decisions qk. Let Bi be the base-learner whose space size Hi well approximates the unknown Hπ . Denote by LK = PK k=1 qk, ℓk , Li K = PK k=1 qi k , ℓk , Lc K = PK k=1 qπc k, ℓk 1 the cumulative loss of final decisions, base-learner Bi and the compared policy, respectively. The overall regret can be decomposed as

E[REGK] = E[(LK Li K)] + E[(Li K Lc K)], (4)

where the two terms are called meta-regret (that captures the regret overhead due to the two-layer ensemble) and base-regret (that measures the regret of the unknown best base-learner). To achieve a favorable regret, they propose two mechanisms to control base-regret and metaregret respectively. First, they pick the base-algorithm with an e O(Hi /ηi + ηi Lc K) small-loss static regret, which ensures an e O(

Hπ DK) base-regret by setting ηi = O( p

Hi/DK) as the cumulative loss of the best policy in hindsight satisfies Lc K DK. Second, they design a small-loss type multi-scale online algorithm (roughly, OMD with weighted entropy ψ(p) = PN i=1 1 εi pi log pi) as the meta-algorithm to make meta-regret adaptive to the individual loss range of experts, so that meta-regret is at most e O(1/εi + εi Hi Li K). Combining the baseregret we further have Li K Lc K + e O(

Hπ DK) DK + e O(

Hπ DK) = e O(DK) as Hπ DK. So an e O(

Hπ DK) meta-regret is achievable by setting εi = e O(1/ Hi DK), which in conjunction with the base-regret yields an e O(

Hπ DK) static regret.

However, it becomes more involved for dynamic regret. First of all, in addition to the uncertainty of unknown horizon length H , the base level also needs to deal with the unknown environmental non-stationarity PK. Conceptually, this can be handled by maintaining more base-learners, which will be specified later. Second and more importantly, it is challenging to design a compatible meta-algorithm. To see this, suppose we already have an e O( p

BK(PK + H )) small-loss dynamic regret for the base-algorithm, where BK is the cumulative loss of compared policies as defined at the end of Section 3.1, we then continue the above recipe and see the issue in meta-regret. Indeed, the metaregret is at most e O(1/εi + εi Hi Li K), and by the baseregret bound we have Li K Lc K + base-regret

1Here we define Lc K in a general way to accommodate changing comparators, which will be later used in the dynamic regret analysis. For static regret, it becomes Lc K = PK k=1 qπ , ℓk .

BK + e O( p

BK(PK + H )). The natural upper bound of BK depends on H (recall that BK H K) due to the arbitrary choice of compared policies. An important technical caveat is that here we cannot simply assume the costto-go functions of the compared policies {Jπc k k }1,...,K are bounded by that of fast policy Jπf k , in contrast to the static regret analysis where we have PK k=1 Jπ k PK k=1 Jπf k due to the optimality of the compared offline policy. Hence, even with a multi-scale meta-algorithm, meta-regret will be e O(H

K) and become the dominating term, making final dynamic regret linear in H and thus suboptimal.

To address above issues in both base and meta levels, building upon the structure of Chen et al. (2021a), we propose a novel two-layer approach to deal with the dual uncertainty of unknown horizon length and unknown non-stationarity. To achieve this, we introduce three crucial ingredients: groupwise scheduling for base-learners, injecting corrections in feedback loss of both baseand meta-algorithm, and a new multi-scale meta-algorithm. Below, we first describe the base-algorithm, then introduce the scheduling method that instantiates a bunch of base-learners with different parameter configurations, and finally design the meta-algorithm to adaptively combine all the base-learners.

Base-algorithm. The base-algorithm performs OMD over a clipped occupancy measure space. At each episode k [K], the base-algorithm receives the loss ℓk and performs

qk+1 = arg min q (M,H,α) η q, ℓk + ak + Dψ(q, qk), (5)

where η > 0 is the step size, (M, H, α) = {q (M, H) | q(x, a) α, x, a} is the clipped space with α (0, 1), ψ is the standard negative-entropy regularizer. Notably, we inject a correction term ak R|X||A| to the loss, set as ak = 32ηℓ2 k, k [K]. The purpose is to ensure a small-loss dynamic regret and simultaneously introduce an negative term that will be crucial to address the difficulty occurred in controlling meta-regret (as mentioned earlier). The base-algorithm enjoys the following guarantee.

Lemma 2. Set q1 = arg minq (M,H,α) ψ(q) and η 1 64, for any compared policies πc 1:K {π | qπ (M, H, α)}, Eq. (5) ensures PK k=1 qk qπc k, ℓk is upper bounded by

α +H 1+log(|X||A|H) +32ηBK 16ηSK,

where SK = PK k=1 qk, ℓ2 k and PK = PK k=2 qπc k qπc k 1 1 is the path-length of occupancy measures.

Scheduling. Lemma 2 indicates that given a horizon length H, it is crucial to set step size properly to achieve tight dynamic regret. Since H affects the base-learner s feasible domain (i.e., the parametrized occupancy measure space), we propose a groupwise scheduling scheme to simultaneously

Dynamic Regret of Online Markov Decision Processes

adapt to unknown non-stationarity PK and horizon length H . Specifically, due to Hπf H K, we first construct a horizon length pool H = {Hi = 2i 1 Hπf | i [G]} where G = 1 + log((K + 1)/Hπf ) to exponentially discretize the possible range; and for each Hi in the pool, we further design a step size grid Ei = {ηi,j = 1/(32 2j) | j [Ni]} where Ni = 1

2 log ( 4K 1+log (|X||A|Hi)) to search the optimal optimal step size associated with Hi. Overall, we maintain N = PG i=1 Ni base-learners, each of which associates with a specific space size and step size. More precisely, let Bi,1:Ni be a shorthand of the i-th group of base-learners Bi,1, . . . , Bi,Ni, in which they use the same space size Hi yet different step sizes (see the configuration of Ei). Thus, the set of all base-learners can be denoted as {B1,1:N1, . . . , Bi,1:Ni, . . . , BG,1:NG}. The decision of baselearner Bi,j in episode k is denoted by qi,j k .

Meta-algorithm. The meta-algorithm requires a careful design to achieve a favorable regret. We propose a new metaalgorithm under the standard OMD framework, where additional designs are required including a novel weighted entropy regularizer and an appropriate correction term. Specifically, the meta-algorithm updates pk+1 N by

pk+1 = arg min p N p, hk + bk + D ψ(p, pk), (6)

where hk RN is the loss of meta-algorithm, defined as hi,j k = qi,j k , ℓk , i [G], j [Ni]. Moreover, there are two important features in the design: (i) an injected correction term bk RN; and (ii) a weighted entropy regularizer ψ(p) = PN i=1 1 εi pi log pi to realize the multi-scale online learning, where εi > 0 is a multi-scale learning rate for i [N]. Below we specify the details and motivation behind such designs.

First, in meta level we inject a correction term bk RN as

bi,j k = 32εi,j(hi,j k )2, i [G], j [Ni]. (7)

Let Bi ,j be the base-learner whose space size Hi well approximates the unknown H and step size ηi ,j well approximates the unknown optimal step size. Although injecting a correction term for the meta-algorithm was also used in (Chen et al., 2021a) to ensure a small-loss type meta-regret of the form e O(1/εi ,j + εi ,j Hi Li ,j

K ), as aforementioned, this will not lead to an optimal meta-regret in our case due to the undesired upper bound of Li ,j

K . Asides from that, our key novelty is to simultaneously exploit the correction term in the base level, which contributes an additional negative term in the base-regret e O(( PK + Hi )/ηi ,j + ηi ,j BK ηi ,j PK k=1 qi ,j

k , ℓ2 k ). By a careful design of step size ηi,j and learning rate εi,j, we can successfully cancel the positive term εi ,j Hi Li ,j

K in the meta-regret by the negative term in the base-regret.

Algorithm 2 CODO-REPS

Input: horizon length pool H, step size grid Ei, i [G] and clipping parameter α. 1: Define the weighted entropy ψ(p) as in Eq. (8). 2: Initialize qi,j 1 = arg minq (M,H,α) ψ(q) and pi,j 1 ε2 i,j, i [G], j [Ni]. 3: for k = 1, . . . , K do 4: Receive qi,j k from base-learner Bi,j by Eq. (5). 5: Sample (ik, jk) pk, play the induced policy πk(a|x) qik,jk k (x, a), x, a. 6: Define hi,j k = qi,j k , ℓk , bi,j k = 32εi,j(hi,j k )2, i, j. 7: Update weight pk+1 N by Eq. (6). 8: end for

Second, it is known that OMD with a weighted entropy regularizer leads to a multi-scale expert-tracking algorithm (Bubeck et al., 2019). In our case, we set the weighted entropy regularizer ψ : N R as

1 εi,j pi,j log pi,j, with εi,j = ηi,j

In above, ηi,j is the step size employed by the base-learner Bi,j as specified earlier. Note that the weighted entropy regularizer depends on both space size and step size such that the final meta-algorithm can successfully handle the groupwise scheduling over the base-learners.

Combining all above ingredients yields our COrrected DO-REPS (CODO-REPS) algorithm, as summarized in Algorithm 2. We have the following dynamic regret guarantee.

Theorem 3. Set the clipping parameter α = 1/K3, the horizon length pool H = {Hi = 2i 1 Hπf | i [G]} where G = 1 + log((K + 1)/Hπf ) and the step size grid Ei = {ηi,j = 1/(32 2j) | j [Ni]} where Ni = 1

2 log ( 4K 1+log (|X||A|Hi)) . CODO-REPS ensures

E[D-REGK] e O q

(H + PK)(H + PK + BK) .

Remark 2. Setting compared policies πc 1:K = π (then PT = 0 and BK = PK k=1 Jπ k ), Theorem 3 implies an e O( H BK) static regret, which gives a small-loss type bound for the episodic SSP and is new to the literature to the best of our knowledge. The bound is no worse the minimax rate e O( H DK) of Chen et al. (2021a) as BK = PK k=1 Jπ k DK in the static case, and can be much better than theirs when best policy behaves well.

We remark that the upper bound in Theorem 3 depends on the path-length of occupancy measures PK rather than that of compared policies PK. A natural question is how to upper bound PK by PK (up to multiplicative dependence

Dynamic Regret of Online Markov Decision Processes

on H ). However, we show this is generally impossible for episode (non-loop-free) SSP in Theorem 7 of Appendix D.1.

Finally we show that the result in Theorem 3 is actually minimax in terms of BK and PK up to logarithmic factors. Theorem 4. For any online algorithm and any γ [0, 2T], there exists an episodic SSP instance with diameter D and a sequence of compared policies πc 1, . . . , πc K with the largest hitting time H such that

PK γ and E[D-REGK] Ω( p

DH K(1 + γ/H ))

under the full-information and known transition setting.

4. Infinite-horizon MDPs

This section studies infinite-horizon MDPs. We begin with the problem setup, then present a reduction to the switchingcost expert problem and establish the dynamic regret bound.

4.1. Problem Setup

An infinite-horizon MDP instance is specified by a tuple M = (X, A, P, {ℓt} t=1), where X, A, P are the same as introduced in Section 2, ℓt R|X||A| [0,1] is the loss function at time t [T]. Unlike episodic MDPs studied in previous two sections, infinite-horizon MDPs have no goal state. The learner aims to minimize the cumulative loss over a T-step horizon in the MDP. We investigate the uniform mixing MDPs (Even-Dar et al., 2009; Neu et al., 2010b). Definition 1 (Uniform Mixing). There exists a constant τ 0 such that for any policy π and any pair of distributions µ and µ over X, we have (µ µ )P π 1 e 1/τ µ µ 1. The smallest τ is called the mixing time.

The uniform mixing assumption is standard and widely adopted in online MDPs studies (Even-Dar et al., 2009; Neu et al., 2010b; 2014). Nevertheless, the assumption could be strong in some sense, and recent study trying to relax the assumption by considering a larger class of communicating MDPs (Chandrasekaran & Tewari, 2021). It would be interesting to see whether our results can be extended to the communicating MDPs, and we leave this as the future work.

Occupancy measure. For an infinite-horizon MDP, the occupancy measure qπ R|X||A| [0,1] is defined as the stationary distribution when executing policy π, i.e., qπ(x, a) = lim T 1

T PT t=1 1{xt = x, at = a}. For an infinitehorizon MDP instance M, its occupancy measure space is defined as (M) = {q R|X||A| [0,1] | P

x,a q(x, a) = 1 and P

a q(x, a) = P

x ,a P(x | x , a )q(x , a ), x X}. For any occupancy measure q (M), its induced policy π can be obtained by π(a|x) q(x, a), x, a.

Dynamic regret. As defined in Eq. (2), the dynamic regret benchmarks the learner s performance against a sequence

of compared policies πc 1:T , namely,

E[D-REGT ] = E T X

t=1 ℓt xt, πt(xt)

t=1 ℓt xt, πc t (xt) .

The non-stationarity measure for infinite-horizon MDPs is naturally defined as PT = PT t=2 πc t πc t 1 1, .

4.2. Reduction to Switching-cost Expert Problem

In this part, we present a reduction to the switching-cost expert problem for infinite-horizon MDPs. In fact, we have the following theorem.

Theorem 5. For infinite-horizon MDPs, the expected dynamic regret against any compared policies πc 1:T satisfies

E[D-REGT (πc 1:T )] (9)

t=1 qt qπc t, ℓt + τ T X

t=2 qt qt 1 1 + τ 2PT + 4τ .

where τ = τ + 1 is introduced to simplify the notation.

Therefore, it suffices to design an algorithm to minimize the first two terms on the right-hand side of (9), as the last two terms τ 2PT + 4τ are not related to the algorithm. This essentially provides a generic regret reduction from infinite-horizon MDPs to the switching-cost expert problem (Merhav et al., 2002). Specifically, for the expert problem, at each round t [T], the learner chooses a decision qt N as a weight over all N experts, receives the loss ℓt RN and suffers loss qt, ℓt . In addition to the cumulative loss PT t=1 qt, ℓt , the switching-cost expert problem further takes the actions switch into account by adding λ PT t=2 qt qt 1 1 as penalty, λ > 0 is the coefficient.

Our reduction also holds for the static regret (simply choosing all compared policies as a fixed one), perhaps surprisingly, there is no explicit reduction in the literature to the best of our knowledge, though proof of Theorem 5 is simple and all the ingredients are already in the pioneering work (Even-Dar et al., 2009). As another note, Agarwal et al. (2019) study online non-stochastic control and give a reduction to the switching-cost online learning problem (or called online convex optimization with memory), while their reduction does not apply to infinite-horizon MDPs.

4.3. Dynamic Regret

With the reduction on hand, we now consider the design of a two-layer approach to optimize the dynamic regret of the switching-cost expert problem. It turns out that a recent result (Zhao et al., 2022) has resolved that expert problem, building upon which we propose REgularized DO-REPS (REDO-REPS) algorithm for infinite-horizon MDPs.

Dynamic Regret of Online Markov Decision Processes

As discussed before, it suffices to design an algorithm to minimize the first two terms in (9), namely, the dynamic regret in terms of the occupancy measure and a switching cost term. Notice that the first term also appears in optimizing dynamic regret of the episodic loop-free SSP (see Eq. (3)). Thus, a natural idea is to run (Algorithm 1) over the occupancy measure space (M, α) = {q (M) | q(x, a) α, x, a}. Specifically, we maintain N base-learners denoted by B1, . . . , BN, where Bi generates the prediction qt,i by performing O-REPS with a particular step size ηi in the step size pool H; then the meta-algorithm combines all the predictions to produce the final decision qt = PN i=1 pt,iqt,i and updates the weight pt. However, DO-REPS does not take the switching cost into account, leading to undesired behavior in this problem. To see this, it can be verified that the one-step switching cost can be decomposed as

qt qt 1 1 XN

i=1 pt,i qt,i qt 1,i 1 + pt pt 1 1.

Summing over T, the second term in the right-hand side is the meta-algorithm s switching cost, which can be easily bounded by O(

T) for common expert-tracking algorithms. However, the first term is the weighted switching cost of all base-learners, which could be very large and even grow linearly with iterations due to the base-learners with large step sizes. For example, when employing OMD as the base-algorithm, the switching cost of Bi is of order O(ηi T). Then, the construction of step size pool requires that ηN = O(1), leading to an O(T) switching cost of the base-learner BN, which ruins the overall regret bound. To address this, inspired by the recent progress on OCO with memory (Zhao et al., 2022), we add a switching-cost regularization in evaluating each base-learner, i.e., the feedback loss of the meta-algorithm ht RN is constructed as

ht,i = qt,i, ℓt + λ qt,i qt 1,i 1. (10)

Set λ = τ , it can be verified the first two terms PT t=1 qt qπc t, ℓt + τ PT t=2 qt qt 1 1 in (9) can be written as

t=1( pt, ht ht,i) + τ T X

t=2 pt pt 1 1

t=1 qt,i qπc t, ℓt + τ XT

t=2 qt,i qt 1,i 1.

We have decomposed the switching-cost dynamic regret into two parts the first part is the meta-regret over the regularized loss ht that measures the regret overhead of the meta-algorithm penalized by the switching cost, and the second part is the base-regret of a specific base-learner Bi taking her switching cost into account. By slightly modifying DO-REPS (Algorithm 1), we get REgularized DO-REPS (REDO-REPS) algorithm as shown in Algorithm 3. The key difference is the designed switching-cost-regularized loss for meta-algorithm s updates in Lines 7 8, such that the overall two-layer approach enjoys nice following guarantee.

Algorithm 3 REDO-REPS

Input: step size pool H, learning rate ε, clipping param α.

1: Define: ψ(q) = P

x,a q(x, a) log q(x, a). 2: Initialization: set q1,i = arg minq (M,α) ψ(q) and p1,i = 1/N for i [N]. 3: for t = 1 to T do 4: Receive qt,i from base-learner Bi, i [N].

5: Compute qt = PN i=1 pt,iqt,i, play the induced policy πt(a|xt) qt(xt, a), a A. 6: Suffer loss ℓt(xt, at) and observe loss function ℓt. 7: Construct switching-cost-regularized loss by Eq. (10) 8: Update weight by pt+1,i exp( ε Pt s=1 hs,i). 9: Each base-learner Bi updates prediction by qt+1,i = arg minq (M,α) ηi q, ℓt + Dψ(q, qt,i). 10: end for

Theorem 6. Set the clipping parameter α = 1/T 2, the step size pool H = {2i 1p

T 1 log |X||A| | i [N]} where N = 1

2 log(1 + 4T log T log |X||A|) + 1 and the learning

rate ε = (2τ + 3) 1p

(log N)/2T. REDO-REPS ensures

E[D-REGT ] O p

τT(log |X||A| + τPT log T) + τ 2PT .

Remark 3. Set πc 1:T = π arg minπ PT t=1 ℓt(xt, π(xt)) (then PT = 0), we recover the best known static regret O( p

τT log |X||A|) (Zimin & Neu, 2013). Our dynamic regret scales with the path-length of compared policies rather than that of occupancy measures. We achieve so by establishing their relationships in Lemma 11 of Appendix E.1.

5. Conclusion

In this paper we investigate learning in three foundational online MDPs with adversarially changing loss functions and known transition kernel. We propose novel online ensemble algorithms and establish their dynamic regret guarantees for the first time. In particular, the results for episodic (loopfree) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure.

Our results present an initial resolution for dynamic regret of online MDPs. There are plenty of future works to investigate. For example, this paper focuses on the full-information feedback and known transition, and how to extend the results to the bandit feedback and unknown transition setting is important and challenging. Moreover, it is interesting to further consider function approximation in those models.

Acknowledgements

This research was supported by NSFC (61921006) and Collaborative Innovation Center of Novel Software Technology and Industrialization.

Dynamic Regret of Online Markov Decision Processes

Agarwal, N., Bullins, B., Hazan, E., Kakade, S. M., and Singh, K. Online control with adversarial disturbances. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 111 119, 2019.

Auer, P., Cesa-Bianchi, N., and Gentile, C. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, pp. 48 75, 2002.

Baby, D. and Wang, Y.-X. Online forecasting of totalvariation-bounded sequences. In Advances in Neural Information Processing Systems 32 (Neur IPS), pp. 11071 11081, 2019.

Baby, D. and Wang, Y.-X. Optimal dynamic regret in expconcave online learning. In Proceedings of the 34th Conference on Learning Theory (COLT), pp. 359 409, 2021.

Baby, D. and Wang, Y.-X. Optimal dynamic regret in proper online learning with strongly convex losses and beyond. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1805 1845, 2022.

Bertsekas, D. P. and Tsitsiklis, J. N. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580 595, 1991.

Besbes, O., Gur, Y., and Zeevi, A. J. Non-stationary stochastic optimization. Operations Research, pp. 1227 1244, 2015.

Bubeck, S., Devanur, N. R., Huang, Z., and Niazadeh, R. Multi-scale online learning: Theory and applications to online auctions and pricing. Journal of Machine Learning Research, 20:62:1 62:37, 2019.

Cai, Q., Yang, Z., Jin, C., and Wang, Z. Provably efficient exploration in policy optimization. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 1283 1294, 2020.

Cesa-Bianchi, N., Gaillard, P., Lugosi, G., and Stoltz, G. Mirror descent meets fixed share (and feels no regret). In Advances in Neural Information Processing Systems 25 (NIPS), pp. 989 997, 2012.

Chandrasekaran, G. and Tewari, A. Online learning in adversarial MDPs: Is the communicating case harder than ergodic? Ar Xiv preprint, 2111.02024, 2021.

Chen, G. and Teboulle, M. Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization, pp. 538 543, 1993.

Chen, L., Luo, H., and Wei, C.-Y. Minimax regret for stochastic shortest path with adversarial costs and known transition. In Proceedings of the 34th Conference on Learning Theory (COLT), pp. 1180 1215, 2021a.

Chen, L., Luo, H., and Wei, C.-Y. Impossible tuning made possible: A new expert algorithm and its applications. In Proceedings of 34th Conference on Learning Theory (COLT), pp. 1216 1259, 2021b.

Chen, X., Wang, Y., and Wang, Y.-X. Non-stationary stochastic optimization under Lp,q-variation measures. Operations Research, 67(6):1752 1765, 2019.

Cheung, W. C., Simchi-Levi, D., and Zhu, R. Reinforcement learning for non-stationary Markov decision processes: The blessing of (more) optimism. Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 1843 1854, 2020.

Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J., Jin, R., and Zhu, S. Online optimization with gradual variations. In Proceedings of the 25th Conference On Learning Theory (COLT), pp. 6.1 6.20, 2012.

Daniely, A., Gonen, A., and Shalev-Shwartz, S. Strongly adaptive online learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 1405 1411, 2015.

Domingues, O. D., Ménard, P., Pirotta, M., Kaufmann, E., and Valko, M. A kernel-based approach to non-stationary reinforcement learning in metric spaces. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 3538 3546, 2021.

Even-Dar, E., Kakade, S. M., and Mansour, Y. Online Markov decision processes. Mathematics of Operations Research, pp. 726 736, 2009.

Fei, Y., Yang, Z., Wang, Z., and Xie, Q. Dynamic regret of policy optimization in non-stationary environments. In Advances in Neural Information Processing Systems 33 (Neur IPS), 2020.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

Gajane, P., Ortner, R., and Auer, P. A sliding-window algorithm for Markov decision processes with arbitrarily changing rewards and transitions. Ar Xiv preprint, ar Xiv:1805.10066, 2018.

György, A. and Szepesvári, C. Shifting regret, mirror descent, and matrices. In Proceedings of the 33nd International Conference on Machine Learning (ICML), pp. 2943 2951, 2016.

Dynamic Regret of Online Markov Decision Processes

Hazan, E. Introduction to Online Convex Optimization. Foundations and Trends in Optimization, pp. 157 325, 2016.

Hazan, E. and Seshadhri, C. Efficient learning algorithms for changing environments. In Proceedings of the 26th International Conference on Machine Learning (ICML), pp. 393 400, 2009.

Jadbabaie, A., Rakhlin, A., Shahrampour, S., and Sridharan, K. Online optimization : Competing with dynamic comparators. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 398 406, 2015.

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, pp. 1563 1600, 2010.

Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. Learning adversarial Markov decision processes with bandit feedback and unknown transition. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 4860 4869, 2020a.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Proceedings of the 33rd Conference on Learning Theory (COLT), pp. 2137 2143, 2020b.

Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Allen, J., Lam, V., Bewley, A., and Shah, A. Learning to drive in a day. In Proceedings of the 2019 IEEE International Conference on Robotics and Automation (ICRA), pp. 8248 8254, 2019.

Luo, H., Zhang, M., Zhao, P., and Zhou, Z.-H. Corralling a larger band of bandits: A case study on switching regret for linear bandits. In Proceedings of the 35th Conference on Learning Theory (COLT), 2022.

Mao, W., Zhang, K., Zhu, R., Simchi-Levi, D., and Basar, T. Near-optimal model-free reinforcement learning in nonstationary episodic MDPs. In Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 7447 7458, 2021.

Merhav, N., Ordentlich, E., Seroussi, G., and Weinberger, M. J. On sequential strategies for loss functions with memory. IEEE Transactions on Information Theory, 48 (7):1947 1958, 2002.

Neu, G., György, A., and Szepesvári, C. The online loopfree stochastic shortest-path problem. In Proceedings of 23rd Conference on Learning Theory (COLT), pp. 231 243, 2010a.

Neu, G., György, A., Szepesvári, C., and Antos, A. Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems 24 (NIPS), pp. 1804 1812, 2010b.

Neu, G., György, A., and Szepesvári, C. The adversarial stochastic shortest path problem with unknown transition probabilities. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 805 813, 2012.

Neu, G., György, A., Szepesvári, C., and Antos, A. Online Markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, pp. 676 691, 2014.

Ortner, R., Gajane, P., and Auer, P. Variational regret bounds for reinforcement learning. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 81 90, 2019.

Rosenberg, A. and Mansour, Y. Online convex optimization in adversarial Markov decision processes. In Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 5478 5486, 2019a.

Rosenberg, A. and Mansour, Y. Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems 32 (NIPS), pp. 2209 2218, 2019b.

Rosenberg, A. and Mansour, Y. Stochastic shortest path with adversarially changing costs. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI), pp. 2936 2942, 2021.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 1889 1897, 2015.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of go with deep neural networks and tree search. Nature, pp. 484 489, 2016.

Touati, A. and Vincent, P. Efficient learning in nonstationary linear Markov decision processes. Ar Xiv preprint, ar Xiv:2010.12870, 2020.

Wei, C.-Y. and Luo, H. Non-stationary reinforcement learning without prior knowledge: an optimal black-box approach. In Proceedings of the 34th Conference on Learning Theory (COLT), pp. 4300 4354, 2021.

Dynamic Regret of Online Markov Decision Processes

Yang, L. and Wang, M. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 10746 10756, 2020.

Yang, T., Zhang, L., Jin, R., and Yi, J. Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 449 457, 2016.

Yu, J. Y., Mannor, S., and Shimkin, N. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, pp. 737 757, 2009.

Zhang, L., Lu, S., and Zhou, Z.-H. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems 31 (Neur IPS), pp. 1330 1340, 2018.

Zhang, Y.-J., Zhao, P., and Zhou, Z.-H. A simple online algorithm for competing with dynamic comparators. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 390 399, 2020.

Zhao, P. and Zhang, L. Improved analysis for dynamic regret of strongly convex and smooth functions. In Proceedings of the 3rd Conference on Learning for Dynamics and Control (L4DC), pp. 48 59, 2021.

Zhao, P., Zhang, L., Jiang, Y., and Zhou, Z.-H. A simple approach for non-stationary linear bandits. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 746 755, 2020a.

Zhao, P., Zhang, Y.-J., Zhang, L., and Zhou, Z.-H. Dynamic regret of convex and smooth functions. In Advances in Neural Information Processing Systems 33 (Neur IPS), pp. 12510 12520, 2020b.

Zhao, P., Wang, G., Zhang, L., and Zhou, Z.-H. Bandit convex optimization in non-stationary environments. Journal of Machine Learning Research, 22(125):1 45, 2021a.

Zhao, P., Zhang, Y.-J., Zhang, L., and Zhou, Z.-H. Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization. Ar Xiv preprint, ar Xiv:2112.14368, 2021b.

Zhao, P., Wang, Y.-X., and Zhou, Z.-H. Non-stationary online learning with memory and non-stochastic control. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. to appear, 2022.

Zhou, H., Chen, J., Varshney, L. R., and Jagmohan, A. Nonstationary reinforcement learning with linear function approximation. Ar Xiv preprint, ar Xiv:2010.04244, 2020.

Zimin, A. and Neu, G. Online learning in episodic Markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems 26 (NIPS), pp. 1583 1591, 2013.

Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML), pp. 928 936, 2003.

Dynamic Regret of Online Markov Decision Processes

A. Related Work

We presents discussions on several topics related to this work. The first part is about the development of static regret for online adversarial MDPs, and the second part reviews related advance of dynamic regret in non-stationary online learning.

A.1. Online Adversarial MDPs

Learning adversarial MDPs has attracted much attention in recent years. We briefly discuss related works on three models of online MDPs studied in this paper, including episodic loop-free SSP, episodic SSP, and infinite-horizon MDPs.

Episodic loop-free SSP. Neu et al. (2010a) first study learning in the episodic SSP with a loop-free structure and known transition, where an e O(H2

K) regret is achieved in the full information setting and K is the number of the episodes and H is the horizon length in each episode. Later Zimin & Neu (2013) propose the O-REPS algorithm which applies mirror descent over occupancy measure space and achieves the optimal regret of order e O(H

K). Neu et al. (2010a); Zimin & Neu (2013) also consider the bandit feedback setting. Neu et al. (2012); Rosenberg & Mansour (2019a) investigate the unknown transition kernel and full-information setting. Rosenberg & Mansour (2019b) and Jin et al. (2020a) further consider the harder unknown transition kernel and bandit-feedback setting. The linear function approximation setting is also studied (Cai et al., 2020). Notably, our results for episodic loop-free SSP (see Section 2) focus on known transition and full-information feedback setting. Different from all mentioned results minimizing static regret, our proposed algorithm is equipped with dynamic regret guarantee, which can recover the e O(H

K) minimax optimal static regret when choosing compared policies as the best fixed policy in hindsight. Furthermore, when the environments are predictable, we enhance the algorithm to capture such adaptivity and hence enjoy better dynamic regret guarantees than the minimax rate.

Episodic SSP. Rosenberg & Mansour (2021) first consider learning in episodic (non-loop-free) SSP with full-information loss feedback. Their algorithm achieves an e O( D

K) regret for the known transition setting, where cmin (0, 1] is the lower bound of the loss function and D is the diameter of the MDP. They also study the zero costs case and unknown transition setting. Chen et al. (2021a) develop algorithms that significantly improve the results and achieve minimax regret e O(

Hπ DK) for the full information with known transition setting, where Hπ is the hitting time of the optimal policy. They also investigate the unknown transition setting. Our results for episodic SSP (see Section 3) focus on the known transition and full-information setting. We develop an algorithm with optimal dynamic regret guarantees. Our result immediately recovers the optimal e O(

Hπ DK) static regret when setting comparators as the best fixed policy in hindsight. We further enhance our algorithm to achieve a more adaptive bound when the environments are predictable.

Infinite-horizon MDPs. Even-Dar et al. (2009) consider learning in unichain MDPs with known transition and fullinformation feedback, they propose the algorithm MDP-E that enjoys e O(

τ 3T) regret, where τ is the mixing time. Another work (Yu et al., 2009) achieves e O(T 2/3) regret in a similar setting. The O-REPS algorithm of Zimin & Neu (2013) achieves an e O(

τT) regret. Neu et al. (2010b; 2014) consider the known transition kernel and bandit feedback setting. These studies focus on the MDPs with uniform mixing properties, which could be strong. Recent study tries to relax the assumption by considering the larger class of communicating MDPs (Chandrasekaran & Tewari, 2021). Our results for infinite-horizon MDPs (see Section 4) focus on the known transition and full-information feedback setting and propose an algorithm that enjoys dynamic regret which can recover the best-known e O(

τT) static regret.

Discussion. We note that all those works focus on the static regret minimization, and our work establishes the dynamic regret for all the three online MDPs models. In a setting most similar to ours, Fei et al. (2020) investigate the dynamic regret of episodic loop-free SSP (with function approximation). They propose two model-free algorithms and prove the dynamic regret bound scaling with non-stationarity of environments. However, we note that their algorithms require the prior knowledge of non-stationarity measure PT as input, which is generally unavailable to the learner in practice. By contrast, our proposed algorithms are parameter-free to those unknown quantities related to the underlying environments (including non-stationarity measure PT and adaptivity quantity VT ). More importantly, we also consider dynamic regret of two more challenging settings of online MDPs episodic (non-loop-free) SSP and infinite-horizon MDPs.

A.2. Non-stationary Online Learning

In this part, we first discuss related works of online non-stationary MDPs (whose loss functions are stochastic, whereas we study the adversarial setting) then discuss dynamic regret of online convex optimization whose techniques are related to us.

Dynamic Regret of Online Markov Decision Processes

Online Non-stationary MDPs. Another related line of research is on the online non-stationary MDPs. More specifically, in contrast to learning with adversarial MDPs where the online loss functions are generated in an adversarial way, online non-stationary MDPs consider the setting where reward (loss) functions are generated in a stochastic way according to a certain reward distribution that might be non-stationary over the time. For infinite-horizon MDPs, Jaksch et al. (2010) consider the piecewise-stationary setting where the losses and transition kernels are allowed to change a fixed number and then propose UCRL2 with restarting mechanism to handle the non-stationarity. Later, Gajane et al. (2018) propose an alternative approach based on the sliding-window update for the same setting, and is later generalized to more general non-stationary setting with gradual drift (Ortner et al., 2019). However, all above approaches require the prior knowledge on the degree of non-stationarity, either the number of piecewise changes or the tensity of gradual drift. Recently, Cheung et al. (2020) propose the Bandit-over-RL algorithm to remove the requirement of unknown non-stationarity measure, but nevertheless can only obtain suboptimal result. Other results for non-stationary MDPs includes episodic non-stationary MDPs (Mao et al., 2021; Domingues et al., 2021) and episodic non-stationary linear MDPs (Touati & Vincent, 2020; Zhou et al., 2020). The techniques in those studies are related to the thread of stochastic linear bandits (Jin et al., 2020b; Yang & Wang, 2020; Zhao et al., 2020a). A recent breakthrough is made by Wei & Luo (2021), who propose a black-box approach that can turn a certain algorithm with optimal static regret in a stationary environment into another algorithm with optimal dynamic regret in a non-stationary environment, and more importantly, the overall approach does not require any prior knowledge on the degree of non-stationarity. They achieve optimal dynamic regret for episodic tabular MDPs (Mao et al., 2021; Zhou et al., 2020; Touati & Vincent, 2020). For infinite-horizon MDPs, they can achieve optimal dynamic regret when the maximum diameter of MDP is known or the degree of non-stationarity is known (Gajane et al., 2018; Cheung et al., 2020); when none of them is know, they attain suboptimal regret but is still the best-known result.

Non-stationary Online Convex Optimization. Online convex optimization (OCO) is a fundamental and versatile framework for modeling online prediction problems (Hazan, 2016). Dynamic regret of OCO has drawn increasing attention in recent years, and techniques are highly related to ours. We here briefly review some related results and refer the reader to the latest paper (Zhao et al., 2021b) for a more thorough treatment. Dynamic regret ensures the online learner to be competitive with a sequence of changing comparators, and is sometimes called tracking regret or switching regret in the study of prediction with expert advice setting (Cesa-Bianchi et al., 2012). As mentioned in Section 1, this paper focuses on the general dynamic regret that allows the any feasible comparators in the decision set, which is also called universal dynamic regret. A special variant is called worst-case dynamic regret, which only competes with the sequence of minimizers of online functions and has gained much attention in the literature (Besbes et al., 2015; Jadbabaie et al., 2015; Yang et al., 2016; György & Szepesvári, 2016; Chen et al., 2019; Baby & Wang, 2019; Zhang et al., 2020; Zhao & Zhang, 2021). However, the worst-case dynamic regret would be problematic or even misleading in many cases, for example, approaching the minimizer of each-round online function would lead to overfitting when the environments admit some noise (Zhang et al., 2018). Thus, the universal dynamic regret is generally more desired to be performance measure for algorithm design in non-stationary online learning. We now introduce the results in this regard. Zinkevich (2003) first considers the universal dynamic regret of OCO and shows that Online Gradient Descent (OGD) enjoys O(

T(1 + PT )) dynamic regret, where PT is the path-length of the comparators reflecting the non-stationarity of the environments. Later, Zhang et al. (2018) propose a novel algorithm and prove a minimax optimal O( p

T(1 + PT )) dynamic regret guarantee without requiring the knowledge of unknown PT . Their proposed algorithm employs the meta-base structure, which turns out to be a key component to handle unknown non-stationarity measure PT . When the environments are predictable and the loss functions are convex and smooth, Zhao et al. (2020b; 2021b) develop an algorithm, achieving problem-dependent dynamic regret which could be much smaller than the minimax rate. Baby & Wang (2021; 2022) consider OCO with exp-concave or strongly convex loss functions. Dynamic regret of bandit online learning is studied for adversarial linear bandits (Luo et al., 2022) and bandit convex optimization (Zhao et al., 2021a). More discussions can be found in the latest advance (Zhao et al., 2021b).

B. Useful Lemmas Related to Online Mirror Descent

In this section, we present several important lemmas used frequently in the analysis of (optimistic) online mirror descent.

Lemma 3 (Lemma 3.2 of Chen & Teboulle (1993)). Define q = arg minq K η q, ℓ + Dψ(q, bq) for some compact set K Rd, convex function ψ, an arbitrary point ℓ Rd, and a point bq K. Then for any u K,

η (Dψ(u, bq) Dψ(u, q ) Dψ(q , bq)).

Dynamic Regret of Online Markov Decision Processes

Lemma 4 (Lemma 5 and Proposition 7 of Chiang et al. (2012)). Let qt = arg minq K η q, mt + Dψ(q, bqt) and bqt+1 = arg minq K η q, ℓt + Dψ(q, bqt) for some compact convex set K Rd, convex function ψ, arbitrary points ℓt, mt Rd, and a point bq1 K. Then, for any u K,

qt u, ℓt qt bqt+1, ℓt mt + 1

η (Dψ(u, bqt) Dψ(u, bqt+1) Dψ(bqt+1, qt) Dψ(qt, bqt)),

and when ψ is λ-strongly convex function w.r.t. the norm , we have

Lemma 5 (Lemma 1 of Chen et al. (2021b)). Define ψ(q) = Pd i=1 1 ηi qi log qi, where d is the dimension of q. Let at Rd

with at,i = 32ηi(ℓt,i mt,i)2, qt = arg minq K q, mt + Dψ(q, bqt) and bqt+1 = arg minq K q, ℓt + at + Dψ(q, bqt) for some compact convex set K Rd, arbitrary points ℓt, mt Rd, and a point bqt K. Suppose 32ηi|ℓt,i mt,i| 1 holds for all i [d]. Then, for any u K,

qt u, ℓt Dψ(u, bqt) Dψ(u, bqt+1) + 32

i=1 ηiui(ℓt,i mt,i)2 16

i=1 ηiqt,i(ℓt,i mt,i)2.

Proof. This lemma is originally proven by Chen et al. (2021b). For self-containedness, we present their proof and adapt to our notations. By Lemma 4, we have

qt u, ℓt + at Dψ(u, bqt) Dψ(u, bqt+1) + qt bqt+1, ℓt mt + at Dψ(bqt+1, qt).

For the last two terms, define q = arg maxq Rd >0 qt q, ℓt mt + at + Dψ(q, qt), by the optimality of q , we have: ℓt mt + at = ψ(qt) ψ(q ) and thus q i = qt,ie ηi(ℓt,i mt,i+at,i). Therefore, we have

qt bqt+1, ℓt mt + at Dψ(bqt+1, qt)

qt q , ℓt mt + at Dψ(q , qt)

= qt q , ψ(qt) ψ(q ) Dψ(q , qt)

= Dψ(qt, q ) =

qt,i log qt,i

q i qt,i + q i

ηi(ℓt,i mt,i + at,i) 1 + e ηi(ℓt,i mt,i+at,i)

i=1 ηiqt,i(ℓt,i mt,i + at,i)2,

where the last inequality holds due to the fact e x 1 + x x2 for x 1 and the condition that ηi|ℓt,i mt,i| 1 32 such that ηi|ℓt,i mt,i + at,i| ηi|ℓt,i mt,i| + 32η2 i (ℓt,i mt,i)2 1 16. Using the definition of at and the condition ηi|ℓt,i mt,i| 1 32, we have

qt bqt+1, ℓt mt + at Dψ(bqt+1, qt)

i=1 ηiqt,i(ℓt,i mt,i + 32ηi(ℓt,i mt,i)2)2 4

i=1 ηiqt,i(ℓt,i mt,i)2.

To sum up, we have

qt u, ℓt + at Dψ(u, bqt) Dψ(u, bqt+1) + 4

i=1 ηiqt,i(ℓt,i mt,i)2.

Finally, moving qt u, at to the right-hand side of the inequality and using the definition of at finishes the proof.

Dynamic Regret of Online Markov Decision Processes

C. Proofs for Section 2 (Episodic Loop-free SSP)

In this section, we provide the proofs for Section 2. First, we introduce the relationship between the path-length of policies and the path-length of occupancy measures, and then provide proof of the dynamic regret of DO-REPS algorithm in Section 2.2. Finally we present the proof of the dynamic regret lower bound.

C.1. Path-length of Policies and Occupancy Measures

This part introduces the relationship between the path-length of policies and path-length of occupancy measures.

Lemma 6. For any occupancy measure sequence qπ1, . . . , qπK induced by the policy sequence π1, . . . , πK, it holds that

k=2 qπk qπk 1 1 H

l=0 πk,l πk 1,l 1, .

Proof. Let dπk l (x) P

a qπk(x, a), x Xl, k [K], we have X

x,a |qπk(x, a) qπk 1(x, a)|

a |qπk(x, a) qπk 1(x, a)|

a |dπk l (x)πk(a|x) dπk 1 l (x)πk 1(a|x)|

a |dπk l (x)πk(a|x) dπk 1 l (x)πk(a|x)| + |dπk 1 l (x)πk(a|x) dπk 1 l (x)πk 1(a|x)|

x Xl |dπk l (x) dπk 1 l (x)| X

a πk(a|x) +

x Xl dπk 1 l (x) X

a |πk(a|x) πk 1(a|x)|

l=0 dπk l dπk 1 l 1 +

l=0 πk,l πk 1,l 1, , (11)

where the first inequality due to the triangle inequality. Next, we bound the term dπk l dπk 1 l 1. Since X0 = {x0}, we have dπk 0 dπk 1 0 1 = 0. For l 1, we have

dπk l dπk 1 l 1 = dπk l 1P πk dπk 1 l 1 P πk 1 1 dπk l 1P πk dπk l 1P πk 1 1 + dπk l 1P πk 1 dπk 1 l 1 P πk 1 1 πk,l 1 πk 1,l 1 1, + dπk l 1 dπk 1 l 1 1,

where the last inequality holds due to Lemma 7 and Lemma 8. Summing the above inequality from 1 to l, we have

dπk l dπk 1 l 1

i=0 πk,i πk 1,i 1, . (12)

Combining (11) and (12), we obtain

x,a |qπk(x, a) qπk 1(x, a)|

i=0 πk,i πk 1,i 1, +

l=0 πk,l πk 1,l 1,

i=0 πk,i πk 1,i 1, H

l=0 πk,l πk 1,l 1, .

We complete the proof by summing the inequality over all iterations.

Dynamic Regret of Online Markov Decision Processes

C.2. Proof of Lemma 1

Proof. Denote by q k+1 = arg minq η q, ℓk + Dψ(q, qk), or equivalently, q k+1(x, a) = qk(x, a) exp( ηℓk(x, a)). By standard analysis of online mirror descent, we have

k=1 qk qπc k, ℓk =

k=1 qk q k+1, ℓk + q k+1 qπc k, ℓk

k=1 qk q k+1, ℓk + 1

Dψ(qπc k, qk) Dψ(qπc k, q k+1)

k=1 qk q k+1, ℓk + 1

Dψ(qπc k, qk) Dψ(qπc k, qk+1) ,

where the first inequality holds due to Lemma 3 and the last inequality holds due to Pythagoras theorem. For the first term, applying the inequality 1 e x x, we obtain

k=1 qk q k+1, ℓk η

x,a qk(x, a)ℓ2 k(x, a) η

x,a qk(x, a) ηHK = ηT. (14)

For the last term, we have

Dψ(qπc k, qk) Dψ(qπc k, qk+1)

= Dψ(qπc 1, q1) +

Dψ(qπc k, qk) Dψ(qπc k 1, qk)

= Dψ(qπc 1, q1) +

qπc k(x, a) log qπc k(x, a) qk(x, a) qπc k 1(x, a) log qπc k 1(x, a) qk(x, a)

= Dψ(qπc 1, q1) +

qπc k(x, a) qπc k 1(x, a) log 1 qk(x, a) + ψ(qπc K) ψ(qπc 1)

k=2 qπc k qπc k 1 1 log 1

α + Dψ(qπc 1, q1) + ψ(qπc K) ψ(qπc 1),

where the last inequality holds due to qk(x, a) α, since qk (M, α) for all k and x, a. It remains to bound the last two terms. Since q1 minimize ψ over (M, α), we have ψ(q1), qπc 1 q1 0, and thus

Dψ(qπc 1, q1) + ψ(qπc K) ψ(qπc 1) ψ(qπc K) ψ(q1) X

x,a q1(x, a) log 1 q1(x, a) H log |X||A|

Combining(14), (15) and (16), we obtain

k=1 qk qπc k, ℓk ηT + 1

H log |X||A|

H + PT log 1

where PT = PK k=2 qπc k qπc k 1 1. This completes the proof.

C.3. Proof of Theorem 1

Proof. Without loss of generality, we assume that all states are reachable with positive probability under the uniform policy πu(a|x) = 1/|A|, x X, a A (otherwise remove the unreachable states since they are unreachable by any

Dynamic Regret of Online Markov Decision Processes

policy). Assume T is large enough such that the occupancy measure of πu satisfies qπu (M, 1

T ), then define uk = (1 1

T )qπc k + 1

T qπu (M, 1 T 2 ), we have

k=1 qk qπc k, ℓk =

k=1 qk uk, ℓk + 1

k=1 qπu qπc k, ℓk

k=1 qk uk, ℓk + 2

k=1 qk qk,i, ℓk

| {z } meta-regret

k=1 qk,i uk, ℓk

| {z } base-regret

where the first inequality follows from the definition uk = (1 1

T )qπc k + 1

T qπu and the last inequality holds for any index i.

Upper bound of base-regret. Since uk (M, 1 T 2 ), k [K], from Lemma 1 we have

base-regret ηT + H log |X||A|

H + 2 PK k=2 uk uk 1 1 log T

η ηT + H log |X||A|

H + 2 PT log T η ,

where the last inequality holds due to PK k=2 uk uk 1 1 PK k=2 qπc k qπc k 1 1 = PT . It is clear that the optimal step size is η = p

(H log (|X||A|/H) + 2 PT log T)/T. From the definition of PT = PK k=2 qπc k qπc k 1 1, we have 0 PT 2KH = 2T. Thus, the possible range of the optimal step size is

H log(|X||A|/H)

T , and ηmax =

H log(|X||A|/H)

T + 4 log T.

By the construction of the candidate step size pool H = {ηi = 2i 1p

K 1 log(|X||A|/H) | i [N]}, where N = 1

2 log(1 + 4K log T log(|X||A|/H)) + 1, we know that the step size therein is monotonically increasing, in particular

H log(|X||A|/H)

T = ηmin, and ηN

H log(|X||A|/H)

T + 4 log T = ηmax.

Thus, we ensure there exists a base-learner i whose step size satisfies ηi η ηi +1 = 2ηi . Since the regret decomposition in (17) holds for any i [N], we choose the base-learner i to analysis to obtain a sharp bound.

base-regret ηi T + H log(|X||A|/H) + 2 PT log T

η T + 2(H log(|X||A|/H) + 2 PT log T)

T H log(|X||A|/H) + 2 PT log T ,

where the second inequality holds due to ηi η ηi +1 = 2ηi and the last equality holds by substituting the optimal step size η = p

(H log(|X||A|/H) + 2 PT log T)/T.

Upper bound of meta-regret. From the construction that hk,i = qk,i, ℓt , k [K], i [N], the meta-regret can be written in the following way:

meta-regret =

k=1 qk qk,i, ℓk =

i=1 pk,iqk,i qk,i, ℓk =

k=1 pk ei, hk

It is known that the update pk+1,i exp( ε Pk s=1 hs,i), i [N] is equal to the update pk+1 = arg minp N ε p, hk + Dψ(p, pk), where ψ(p) = PN i=1 pi log pi is the standard negative entropy. This can be further reformulated solving the

Dynamic Regret of Online Markov Decision Processes

unconstrained optimization problem p k+1 = arg minp ε p, hk + Dψ(p, pk) at first and then projecting p k+1 to the simplex N as pk+1 = arg minp N Dψ(p, p k+1). By standard analysis of OMD, we have

k=1 pk ei, hk

k=1 pk p k+1, hk +

k=1 p k+1 ei, hk

k=1 pk p k+1, hk + 1

k=1 (Dψ(ei, pk) Dψ(ei, p k+1))

k=1 pk p k+1, hk + 1

k=1 (Dψ(ei, pk) Dψ(ei, pk+1))

k=1 pk p k+1, hk + 1

εDψ(ei, p1),

where the second inequality holds due to Lemma 3 and the third inequality holds due to Pythagoras theorem. Using the fact that 1 e x x and the definition that p1,i = 1/N, hk,i H, k [K], i [N], we have

k=1 pk p k+1, hk + 1

εDψ(ei, p1) ε

i=1 pk,ih2 k,i + ln N

ε εHT + ln N

Therefore, for any base-learner i [N], we have

k=1 qk qk,i, ℓk =

k=1 pk ei, hk εHT + log N

In particular, for the best base-learner i [N], we have

meta-regret =

k=1 qk qk,i , ℓk εHT + log N

HT log N, (19)

where the last equality holds due to the setting ε = p

(log N)/(HT).

Upper bound of overall dynamic regret. Combining (17), (18) and (19), we obtain

k=1 qk qπc k, ℓk base-regret + meta-regret

T H log(|X||A|/H) + 2 PT log T + p

HT log N + 2

HT (log(|X||A|/H) + PT log T) ,

where the last equality is due to PT HPT by Lemma 6, N = 1

2 log(1 + 4K log T log(|X||A|/H)) + 1. This finishes the proof.

C.4. Proof of Theorem 2

Proof. The proof is similar to the proof of the minimax lower bound of dynamic regret for online convex optimization (Zhang et al., 2018). For any γ [0, 2T], we first construct a piecewise-stationary comparator sequence, whose path-length is smaller than γ, then we split the whole time horizon into several pieces, where the comparator is fixed in each piece. By this construction, we can apply the existed minimax static regret lower bound of episodic loop-free SSP (Zimin & Neu, 2013) in each piece, and finally sum over all pieces to obtain the lower bound for the dynamic regret.

Denote by RK(Π, F, γ) the minimax dynamic regret, which is defined as

RK(Π, F, γ) = inf π1 Π sup ℓ1 F . . . inf πK Π sup ℓK F

k=1 qπk, ℓk min (πc 1,...,πc K) U(γ)

k=1 qπc k, ℓk

Dynamic Regret of Online Markov Decision Processes

where Π denotes the set of all policies, F denotes the set of loss functions ℓ R|X||A| [0,1] , and U(γ) = {(πc 1, . . . , πc K) | k

[K], πc k Π, and PT = PK k=2 qπc k qπc k 1 1 γ} is the set of feasible policy sequences with the path-length PT of the occupancy measures less than γ.

We first consider the case of γ 2H. Then we can directly utilize the established lower bound of the static regret for learning in episodic loop-free SSP (Zimin & Neu, 2013) as a natural lower bound of dynamic regret,

RK(Π, F, γ) C1H p

K log(|X||A|), (20)

where C1 = 0.03 is the constant appeared in the lower bound of static regret.

We next deal with the case that γ 2H. Without loss of generality, we assume L = γ/2H divides K and split the whole time horizon into L pieces equally. Next, we construct a special policy sequence in U(γ) such that the policy sequence is fixed within each piece and only changes in the split point. Since the sequence changes at most L 1 γ/2H times and the variation of the occupancy measure at each change point is at most 2H, the path-length PT of the occupancy measures does not exceed γ. As a result, we have

RK(Π, F, γ) LC1H

L log(|X||A|)

HKγ log(|X||A|). (21)

Combining (20) and (21), we obtain the final lower bound

RK(Π, F, γ)

HK log(|X||A|) max(

2H, γ) Ω( p

HK(H + γ) log(|X||A|)),

which finishes the proof.

C.5. Useful Lemmas

In this part, we present some basic lemmas in episodic loop-free SSP. For any policy π(a|x), we define P π to be the transition matrix induced by π, where the component [P π]x,x is the transition probability from x to x , i.e., [P π]x,x = P

a π(a|x)P a x,x . Then, we have the following useful lemmas.

Lemma 7 (Lemma 6.3 of Even-Dar et al. (2009)). For any policies π and π and any state distribution d, we have

d P π d P π 1 π π 1, .

Proof. Consider the case when d is a delta function on x. The difference in the next state distributions, d P π d P π 1, is X

[P π]x,x [P π ]x,x = X

a |P(x |x, a) (π(a|x) π (a|x))|

x ,a P(x |x, a)|π(a|x) π (a|x)| = X

a |π(a|x) π (a|x)|.

Linearity of expectation leads to the result for arbitrary d.

Lemma 8. For any state distribution d and d , and any policy π, we have

d P π d P π 1 d d 1. (22)

Proof. Note that the relationship that d(x ) = P

x d(x)P π x,x , therefore, we have

d P π d P π 1 = X

x d(x)P π x,x d (x)P π x,x | X

x |d(x)P π x,x d (x)P π x,x |

x |d(x) d (x)|P π x,x = X

x |d(x) d (x)| X

x |d(x) d (x)| = d d 1.

This finishes the proof.

Dynamic Regret of Online Markov Decision Processes

Finally, we introduce the following lemma, which shows the strongly convexity of the regularizer.

Lemma 9. ψ(w) = Pd i=1 wi log wi is 1

H -strongly convex w.r.t. 1 for {w Rd 0 | Pd i=1 wi = H}.

Proof. For any y, z {w Rd 0 | PN i=1 wi = H}, we have y

H {w Rd 0 | Pd i=1 wi = 1}. Then, it holds that

ψ(y) ψ(z) ψ(y), y z =

i=1 yi log yi

yi H log yi/H

zi/H 1 2H y z 2 1,

where the last inequality holds due to Pinsker s Inequality. This finishes the proof.

D. Proofs for Section 3 (Episodic SSP)

In this section, we first give the impossible result to bound the path-length of occupancy measures by the path-length of policies. Next we provide proofs of the dynamic regret of CODO-REPS algorithm and the lower bound in Section 3.2.

D.1. Path-length of Policies and Occupancy Measures

In the following, we give the impossible result to bound the path-length of the occupancy measures by the path-length of the corresponding policies.

Theorem 7. For any H > 1 and any positive integer c > 0, there exists an SSP instance with |X| = 2c + 1 states, |A| = 2 actions and a policy sequence πc 1, . . . , πc K with largest expected hitting time H such that PK c PK.

Proof. For any H > 1 and any positive integer c > 0, we construct an episodic SSP with n + 1 states X = {x0, . . . , xn} with n = 2c and two actions A = {a1, a2} as in Figure 1. Let the transition kernel be deterministic and the corresponding transitions are shown in Figure 1. Specifically, taking a1 and a2 in initial state x0 leads to the state g and x1 respectively.

Figure 1: State transitions for Theorem 7.

Taking any action in state xi leads to state xi+1, i [n 1] and taking any action in state xn leads to the goal state state g. Then, we consider two policies π and π with π(a1|xi) = 1, i {0} [n] and π (a1|x0) = 1 ε, π (a1|xi) = 1, i [n]. It is clear that π π 1, = 2ε and Hπ(x0) = 1, Hπ (x0) = 1 + εn. For any H > 1 and c > 0, let ε = (H 1)/n, we have Hπ = 1 + εn = H , i.e., the largest hitting time of π and π is H . Then we consider the occupancy measure discrepancy of π and π . It is easy to verify X

x,a |qπ(x, a) qπ (x, a)| = ε + ε(n + 1) = ε(n + 2) = 2ε(c + 1) = (c + 1) π π 1, .

Therefore, we have qπ qπ 1 c π π 1, . Thus, the policy sequence π, π , π, π , . . . satisfies PK = K qπ qπ 1 c K π π 1, = c PK, which completes the proof.

D.2. Proof of Lemma 2

Proof. Since η 1 64, we ensure that 32η|ℓk,i| 1, k [K], i [|X||A|]. Taking mk = 0 in Lemma 5, we obtain

k=1 qk qπc k, ℓk

Dψ(qπc k, bqk) Dψ(qπc k, bqk+1) + 32η

k=1 qk, ℓ2 k 16η

k=1 qπc k, ℓ2 k . (23)

Dynamic Regret of Online Markov Decision Processes

For the first term, from the definition that Dψ(q, q ) = P

x,a q(x, a) log q(x,a)

x,a(q(x, a) q (x, a)), we have

Dψ(qπc k, bqk) Dψ(qπc k, bqk+1) (24)

= Dψ(qπc 1, bq1) +

Dψ(qπc k, bqk) Dψ(qπc k 1, bqk)

= Dψ(qπc 1, bq1) + 1

qπc k(x, a) log qπc k(x, a) bqk(x, a) qπc k 1(x, a) log qπc k 1(x, a) bqk(x, a)

qπc k 1(x, a) qπc k(x, a)

= Dψ(qπc 1, bq1) + 1

qπc k(x, a) qπc k 1(x, a) log 1 bqk(x, a) + ψ(qπc K) ψ(qπc 1) P

x,a(qπc K(x, a) qπc 1(x, a))

k=2 qπc k qπc k 1 1 + Dψ(qπc 1, bq1) + ψ(qπc K) ψ(qπc 1) P

x,a(qπc K(x, a) qπc 1(x, a))

where the last inequality holds due to |log bqk(x, a)| log H

α since α bqk(x, a) H for bqk (M, H, α). For the last two term, since bq1 minimize ψ over (M, H, α), we have ψ(bq1), qπc 1 bq1 0, thus

Dψ(qπc 1, bq1) + ψ(qπc K) ψ(qπc 1) P

x,a(qπc K(x, a) qπc 1(x, a))

ψ(qπc 1) ψ(bq1)

η + ψ(qπc K) ψ(qπc 1) P

x,a qπc K(x, a) qπc 1(x, a)

ψ(qπc K) ψ(bq1) P

x,a qπc K(x, a) qπc 1(x, a)

H log(|X||A|) + H log H + H

η = H(1 + log(|X||A|H))

where the last inequality holds due to H log(|X||A|) ψ(q) H log H and 0 P

x,a q(x, a) H for any q (M, H, α) from Lemma 10. Combining (23), (24) and (25), we have

k=1 qk qπc k, ℓk H(1 + log(|X||A|H)) + PK log(H/α)

k=1 qk, ℓ2 k 16η

k=1 qπc k, ℓ2 k ,

where PK = PK k=2 qπc k qπc k 1 1. This finishes the proof.

D.3. Proof of Theorem 3

Proof. We only need to consider the case H K (otherwise the claimed regret bound is vacuous). Since all the compared policies are proper, they will not visit the states from which the goal state g is not accessible (otherwise the hitting time will be infinite) and the states which are not accessible from initial state x0. We can remove them from the SSP since we consider the known transition setting. Then, suppose K is large enough such that these exists at least a policy πu whose occupancy measure qπu satisfies qπu (M, K, 1

K ). Then, we define uk = (1 1 K2 )qπc k + 1 K2 qπu and the corresponding policy πuk. For any k [K], we ensure that the hitting time Hπuk (1 1 K2 )H + K

K2 H + 1 and the occupancy measure uk(x, a) 1 K3 , x, a, i.e., uk (M, H + 1, 1 K3 ). Thus, we have

E[D-REGK(πc 1:K)] = E

i,j pi,j k qi,j k , ℓk

k=1 qπc k, ℓk

i,j pi,j k qi,j k , ℓk

k=1 qπu qπc k, ℓk

Dynamic Regret of Online Markov Decision Processes

i,j pi,j k qi,j k , ℓk

k=1 pk ei,j, hk

| {z } meta-regret

k=1 qi,j k uk, ℓk

| {z } base-regret

where the first inequality holds due to P

x,a qu(x, a) K and P

x,a qπc k(x, a) H K, the last inequality holds due to the definition that hi,j k = qi,j k , ℓk , i [G], j [Ni] and the decomposition holds for any index i [G], j [Ni].

Upper bound of base-regret. Since the possible range of H is Hπf H K. From the construction of horizon length pool H = {Hi = 2i 1 Hπf |i [G]} where G = 1 + log((K + 1)/Hπf ) , we ensure

H1 = Hπf H + 1 and HG = K + 1 H + 1.

So for any unknown H , there exist an index i for the space pool that satisfies Hi 1 = Hi

2 H + 1 Hi . Then, we analysis the base-regret of the base learners in group i . From the construction of each step size pool, we ensure ηi,j 1 64, i.e., 32ηi,j|ℓk,r| 1, i [G], j [Ni], k [K], r [|X||A|]. Since qi ,j k , uk (M, Hi , 1/K3), j [N i ], k [K], from Lemma 2, we have

base-regret 4 PK k=2 uk uk 1 1 log K + Hi (1 + log(|X||A|Hi ))

ηi ,j + 16ηi ,j

k=1 (2 uk, ℓk qi ,j k , ℓ2 k )

4 PK log K + Hi (1 + log(|X||A|Hi ))

ηi ,j + 32ηi ,j BK 16ηi ,j

k=1 qi ,j k , ℓ2 k + 1, (27)

where the last inequality holds due to PK k=2 uk uk 1 1 PK k=2 qπc k qπc k 1 1 = PK and BK = PK k=1 qπc k, ℓk .

Upper bound of meta-regret. Then, we consider the meta-regret with respect to base-learner Bi ,j, j Ni . From the construction of the regularizer ψ(p) in meta-algorithm, we have 32εi,j|hi,j k | = 32 ηi,j

2Hi | qi,j k , ℓk | 1, i [G], j [Ni], k [K]. From the analysis of OMD in Lemma 5, we have

meta-regret D ψ(ei ,j, p1) + 32εi ,j

k=1 (hi ,j k )2

= 1 εi ,j log 1

pr,s 1 εr,s + 32εi ,j

k=1 (hi ,j k )2

= 1 εi ,j log

PG r=1 PNi s=1 ε2 r,s ε2 i ,j + PG r=1 PNi s=1 εr,s PG r=1 PNi s=1 ε2r,s + 32εi ,j

k=1 qi ,j k , ℓk 2,

where the first equality holds due to D ψ(p, p ) = P

i,j 1 εi,j (pi,j log pi,j

p i,j pi,j + p i,j) and the last equality is due to

pi,j 1 ε2 i,j, hi,j k = qi,j k , ℓk , i [G], j [Ni]. From the definition of the horizon length pool H = {Hi = 2i 1 Hπf |

i [G]} where G = 1 + log((K + 1)/Hπf ) , the step size pools Ei = 1 32 2j | j [Ni] , i [G], where Ni = 1

2 log ( 4K 1+log (|X||A|Hi)) and learning rate εi,j = ηi,j

2Hi , i [G], j [Ni], we ensure that PG r=1 PNi s=1 εr,s = Θ(1/H1)

and PG r=1 PNi s=1 ε2 r,s = Θ(1/H2 1). Thus,

meta-regret Θ Hi

ηi ,j log Hi H1ηi ,j

k=1 qi ,j k , ℓk 2 + Θ(H1)

ηi ,j log Hi H1ηi ,j

k=1 qi ,j k , ℓ2 k + Θ(H1),

where the last inequality holds due to Cauchy Schwarz inequality.

Dynamic Regret of Online Markov Decision Processes

Upper bound of over all dynamic regret. Combining (26), (27) and (28), we obtain

E[D-REGK] 4 PT log K + Hi (1 + log(|X||A|Hi ))

ηi ,j + 32ηi ,j BK + Θ( Hi

ηi ,j log Hi H1ηi ,j ), (29)

holds for any index j [Ni ]. Omit the last term, it is clear that the optimal step size is η = p

(Hi (1 + log(|X||A|Hi )) + 4 PK log K)/(32BK). Meanwhile, since P x,a qk(x, a) H , k [K], we have

0 PK = PK k=2 qπc k qπc k 1 1 2H K 2Hi K and BK H K Hi K. Therefore, we ensure that

1 + log(|X||A|Hi )

From the construction of the candidate step size pool Hi , we know that the step size therein is monotonically decreasing with respect to the index, in particular,

64, and ηN =

1 + log(|X||A|Hi )

Let j be the index of base learner in group i with step size closest to the η . Then, we consider the base regret of the base learner Bi ,j . We consider the following two cases:

when η 1 64, then ηi ,j η 2ηi ,j = ηi ,j 1, we have

R.H.S of (29) 8 PK log K + 2Hi (1 + log(|X||A|Hi ))

η + 32η BK + Θ Hi

e O q PK + H BK

when η > 1 64, then ηi ,j = 1 64, we have

R.H.S of (29) 256 PK log K + Hi (1 + log(|X||A|Hi )) + 1

2BK + Θ(H ) e O PK + H ,

where the last inequality holds due to p

(Hi (1 + log(|X||A|Hi )) + 4 PK log K)/(32BK) 1 64.

As a result, taking both cases into account yields

k=1 qk qπc k, ℓk e O q H + PK (H + PK + BK) .

This finishes the proof.

D.4. Proof of Theorem 4

Proof. The proof is similar to that of Theorem 2. For any γ [0, 2T], we first construct a piecewise-stationary comparator sequence, whose path-length is smaller than γ, then we split the whole time horizon into several pieces, where the comparator is fixed in each piece. By this construction, we can apply the existed minimax static regret lower bound of episodic SSP (Chen et al., 2021a) in each piece, and finally sum over all pieces to obtain the lower bound for the dynamic regret.

Denote by RK(Π, F, γ) the minimax dynamic regret, which is defined as

RK(Π, F, γ) = inf π1 Π sup ℓ1 F . . . inf πK Π sup ℓK F

k=1 qπk, ℓk min (πc 1,...,πc K) U(γ)

k=1 qπc k, ℓk

where Π denotes the set of all policies, F denotes the set of loss functions ℓ R|X||A| [0,1 and U(γ) = {(πc 1, . . . , πc K) | k

[K], πc k Π, and PK = PK k=2 qπc k qπc k 1 1 γ} is the set of feasible policy sequences with path-length PK of the occupancy measures less than γ.

Dynamic Regret of Online Markov Decision Processes

We first consider the case of γ 2(H + 1). From Theorem 3 of Chen et al. (2021a), we ensure for any D, H , K with K D + 1, there exists an SSP instance such that its diameter is D + 2, the hitting time of the best fixed policy is H + 1 and the expected regret of any policy after K episodes is at least Ω( DH K). Then we can set all compared policies as the best fixed policy and directly utilize this lower bound of the static regret as a natural lower bound of dynamic regret,

RK(Π, F, γ) Ω( p

DH K). (30)

We next deal with the case that γ 2(H + 1). Without loss of generality, we assume L = γ/2(H + 1) devides K and split the whole time horizon into L pieces equally. Next, we construct an SSP instance such that its diameter is D + 2, the hitting time of the best fixed policy is H + 1 and the expected regret of any policy after K episodes is at least Ω( DH K) in each piece. Then, we choose the best fixed policy in each piece as the comparator sequence, whose hitting time are all H + 1. Since the sequence changes at most L 1 γ/2(H + 1) times and the variation of the policy sequence at each change point is at most 2(H + 1) (Note that qπc k qπc k 1 qπc k 1 + qπc k 1 1 = 2(H + 1), πc k = πc k 1), the path PK does not exceed γ. As a result,

RK(Π, F, γ) LΩ( p

DH K/L) Ω( p

Combining (30) and (31), we have

RK(Π, F, γ) Ω( p

DH K) + Ω( p

DH K(1 + γ/H )),

which finishes the proof.

D.5. Useful Lemmas

we introduce the following lemma which shows the boundedness of the regularizer.

Lemma 10. Let H 1, it holds that H log(|X||A|) P

x,a q(x, a) log q(x, a) H log H for every q (M, H).

Proof. First, we prove the right-hand side of the inequality.

x,a q(x, a) log q(x, a) = X

x,a q(x, a) log q(x, a)

x,a q(x, a) log H X

s,a q(x, a) log H H log H.

Then, we prove the left-hand side of the inequality.

x,a q(x, a) log q(x, a) = X

x,a q(x, a) log q(x, a)

x,a q(x, a) log H H X

H log q(x, a)

H H log |X||A|.

This finishes the proof.

E. Proofs for Section 4 (Infinite-horizon MDPs)

In this section, we first show the relationship between the path-length of policies and the path-length of occupancy measures. Next, we show the proofs of the reduction to switching-cost expert problem in Section 4.2. Finally, we give the proofs of the dynamic regret of our algorithm in Section 4.3.

E.1. Path-length of Policies and Occupancy Measures

We introduce the relationship between the path-length of policies and the path-length of occupancy measures as follows.

Lemma 11. For any occupancy measure sequence q1, . . . , q T induced by the policy sequence π1, . . . , πT , it holds that

t=2 qπt qπt 1 1 (τ + 2)

t=2 πt πt 1 1, .

Dynamic Regret of Online Markov Decision Processes

Proof. Consider any two policies π and π with occupancy measure qπ and qπ , let dπ(x) P

x,a qπ(x, a), dπ (x) P

x,a qπ (x, a), x X, we have

qπ qπ 1 = X

x,a |qπ(x, a) qπ (x, a)|

x,a |dπ(x)π(a|x) dπ (x)π (a|x)|

x,a |dπ(x)π(a|x) dπ(x)π (a|x)| + X

x,a |dπ(x)π (a|x) dπ (x)π (a|x)|

a |π(a|x) π (a|x)| + X

x |dπ(x) d (x)| X

π π 1, + dπ dπ 1 (τ + 2) π π 1, ,

where the first inequality holds due to the triangle inequality and the last inequality holds due to Lemma 14. We finish the proof by summing the inequality over T.

E.2. Proof of Theorem 5

To prove Theorem 5, we first introduce two lemmas which measure the difference between the sum of average losses and the actual losses of the learner and compared policies. Denote by ρπ t the average loss per step corresponding π: ρπ t lim T 1

T PT t=1 E[ℓt(xt, at)|P, π] = qπ, ℓt and the actual cumulative loss suffered by the learner LT E[ℓt(xt, πt(xt))|P, π], where the randomness is over the transition kernel and policy sequence π1:T . Similarly, the actual cumulative loss suffered by the compared policy sequence πc 1:T is Lc T E[ℓt(xt, πc t (xt))|P, π]. Let dπ be the stationary state distribution, i.e., dπ(x) P

a qπ(x, a), x X. Denote by µt = µ1P π1 P πt 1 the state distribution after executing π1, . . . , πt 1, where µ1 is the initial distribution, similarly, µc t = µ1P πc 1 P πc t 1.

Lemma 12. For any compared policy sequence πc 1, . . . , πc T , it holds that PT t=1 ρπc t t Lc T (τ + 1)2PT + 2(τ + 1).

Proof. From the definition that µc t = µ1P πc 1 P πc t 1, we have

t=1 ρπc t t Lc T =

dπc t(x) µc t(x) X

a πc t (a|x)ℓt(x, a)

t=1 dπc t µc t 1

2(τ + 1) + (τ + 1)

t=2 dπc t dπc t 1 1

2(τ + 1) + (τ + 1)2 T X

t=2 πc t πc t 1 1, ,

where the second inequality holds due to Lemma 15 and the last inequality holds due to Lemma 14.

Lemma 13. For any occupancy measure sequence qπ1, . . . , qπT returned by the learner, it holds that LT PT t=1 ρπt t (τ + 1) PT t=2 qπt qπt 1 1 + 2(τ + 1).

Proof. From the definition that µt = µ1P π1 P πt 1, we have

t=1 ρπt t =

x (µt(x) dπt(x)) X

a πt(a|x)ℓt(x, a)

Dynamic Regret of Online Markov Decision Processes

t=1 µt dπt 1

2(τ + 1) + (τ + 1)

t=2 dπt dπt 1 1

= 2(τ + 1) + (τ + 1)

a qπt(x, a) qπt 1(x, a)|

2(τ + 1) + (τ + 1)

t=2 qπt qπt 1 1,

where the second inequality holds due to Lemma 15.

Then, we present the proof of Theorem 5.

Proof of Theorem 5. Note that the dynamic regret for infinite-horizon MDPs is defined as E[D-REG(πc 1:T )] = E[PT t=1 ℓt(xt, πt(xt)) ℓt(xt, πc t (xt))]. Then it can be written as

E[D-REGT (πc 1:T )] = E

t=1 ℓt(xt, πt(xt)) ℓt(xt, πc t (xt))

t=1 ρπt t +

t=1 (ρπt t ρπc t t ) +

t=1 ρπc t t Lc T

t=1 qt qπc t, ℓt +

t=2 qt qt 1 1 + (τ + 1)2PT + 4(τ + 1),

where the last inequality holds due to Lemma 12 and Lemma 13 and the definition that PT = PT t=2 πt πc t 1, .

E.3. Proof of Theorem 6

Proof. Similar to the proof in Appendix C.3, since the MDP is ergodic according to Definition 1, we assume T is large enough such that there at least exists a policy πu whose occupancy measure qu satisfies qπu (M, 1

T ), then define ut = (1 1

T )qπc t + 1

T qπu (M, 1 T 2 ), from the dynamic regret decomposition in (9), we have

E[D-REGT (πc 1:T )] (32)

t=1 qt qπc t, ℓt + (τ + 1)

t=2 qt qt 1 1 + (τ + 1)2PT + 4(τ + 1)

t=1 qt ut, ℓt + 1

t=1 qπu qπc t, ℓt + (τ + 1)

t=2 qt qt 1 1 + (τ + 1)2PT + 4(τ + 1)

t=1 qt ut, ℓt + (τ + 1)

t=2 qt qt 1 1

| {z } term (a)

+(τ + 1)2PT + 4(τ + 1) + 2, (33)

where the first equality follows from the definition that ut = (1 1

T )qπc t+ 1

T qπu. We only need to consider term (a) since the remaining terms are not related to the algorithm. From the definition that ht,i = qt,i, ℓt +(τ +1) qt,i qt 1,i 1, i [N], it can be verified that term (a) can be written as

| {z } meta-regret

t=2 pt pt 1 1

| {z } meta-switching-cost

t=1 qt,i ut, ℓt

| {z } base-regret

t=2 qt,i qt 1,i 1

| {z } base-switching-cost

Dynamic Regret of Online Markov Decision Processes

which hold for any index i. Next, we bound these terms separately.

Upper bound of base-regret. From the standard analysis of OMD similar to that in (13) and (15), we have

t=1 qt,i ut, ℓt ηi

x,a qt,i(x, a)ℓ2 t(x, a) + 1

t=1 (Dψ(ut, qt,i) Dψ(ut, qt+1,i))

ηi T + log |X||A|

ηi + 2 log T

t=2 ut ut 1 1

ηi T + log |X||A| + 2 PT log T

which the last inequality holds due to PT t=2 ut ut 1 1 PT t=2 qπc t qπc t 1 1 = PT .

Upper bound of meta-regret. From the definition that ht,i = qt,i, ℓt + (τ + 1) qt,i qt 1,i 1, i [N], we have 0 ht,i 1 + 2(τ + 1) = 2τ + 3, i [N]. By the standard analysis of Hedge similar to the analysis of meta-regret in Appendix C.3, we have

i=1 pt,ih2 t,i + log N

ε ε(2τ + 3)2T + log N

Upper bound of switching-cost. From Lemma 3, we have

qt,i qt 1,i 1 ηi ℓt ηi, and pt pt 1 1 ε ht ε(2τ + 3), t 2. (36)

Upper bound of overall dynamic regret. Combining (34), (35) and (36), we obtain

term (a) ηi(τ + 2)T + log (|X||A|) + 2 PT log T

ηi + ε(2τ + 3)2T + log N

ε + ε(2τ + 3)(τ + 1)T

ηi(τ + 2)T + log (|X||A|) + 2 PT log T

ηi + 2ε(2τ + 3)2T + log N

From the configuration that ε = q

log N 2T (2τ+3)2 , we obtain

term (a) ηi(τ + 2)T + log (|X||A|) + 2 PT log T

ηi + (4τ + 6) p

It is clear that the the optimal step size is η = q

log |X||A|+2 PT log T (τ+2)T . From the definition of PT , we have 0 PT = PT t=2 qπc t qπc t 1 1 2T, we ensure the possible range of η is s

log (|X||A|) + 4T log T

Set the step size pool as H = n 2i 1 q

T | i [N] o where N = 1

2 log(1 + 4T log T log |X||A|) + 1. We can verify that

(τ + 2)T η , and ηN

log (|X||A|) + 4T log T

(τ + 2)T = η .

Thus, we confirm that there exists a base-learner whose step size satisfies ηi η ηi +1 = 2ηi . Then, we choose i to analysis to obtain a sharp bound. Thus term (a) is bounded by

term (a) ηi (τ + 2)T + log (|X||A|) + 2 PT log T

ηi + (4τ + 6) p

Dynamic Regret of Online Markov Decision Processes

η (τ + 2)T + 2(log (|X||A|) + 2 PT log T)

η + (4τ + 6) p

(τ + 2)T(log |X||A| + 2 PT log T) + (4τ + 6) p

2T log N, (37)

where the second inequality holds due to the condition ηi η ηi +1 = 2ηi and the last inequality holds by substituting

the optimal step size η = q

log |X||A|+2 PT log T (τ+2)T . Therefore, combining (33) and (37), we obtain

E[D-REGT (πc 1:T )]

term (a) + (τ + 1)2PT + 4τ + 6

(τ + 2)T(log |X||A| + 2 PT log T) + (4τ + 6) p

2T log N + (τ + 1)2PT + 4τ + 6

(τ + 2)T(log |X||A| + 2(τ + 2)PT log T) + (4τ + 6) p

2T log N + (τ + 1)2PT + 4τ + 6

τT(log |X||A| + τPT log T) + τ 2PT ),

where the third inequality uses PT (τ + 2)PT from Lemma 11. This finishes the proof.

E.4. Useful Lemmas

In this part, we present some useful lemmas in infinite-horizon MDPs. Denote by dπ the stationary state distribution induced by policy π under transition kernel P, i.e., dπ(x) P a qπ(x, a), x X. Then we have the following useful lemmas which show the relationships between policy discrepancy and distribution discrepancy.

Lemma 14 (Lemma 4 of Neu et al. (2014)). For any two policies π and π , it holds that

dπ dπ 1 (τ + 1) π π 1, .

Proof. The statement follows from solving

dπ dπ 1 dπP π dπ P π 1 + dπ P π dπ P π 1 e 1/τ dπ dπ 1 + π π 1,

for dπ dπ 1 and using 1 1 e 1/τ τ + 1.

Lemma 15. Consider the distribution µt = µ1P π1 P πt 1, where µ1 is any distribution over X and π1, . . . , πt is any policy sequence, it holds that

t=1 µt dπt 1 2(τ + 1) + (τ + 1)

t=2 dπt dπt 1 1.

Proof. It is trivial for t = 1 since µ1 dπ1 1 2. Thus, in what follows we only consider the case that t 2.By the triangle inequality, we have

µt dπt 1 µt dπt 1 1 + dπt 1 dπt 1 = µt 1P πt 1 dπt 1P πt 1 1 + dπt 1 dπt 1

e 1/τ µt 1 dπt 1 1 + dπt 1 dπt 1

e 1/τ e 1/τ µt 2 dπt 2 1 + dπt 2 dπt 1 1 + dπt 1 dπt 1

e (t 1)/τ µ1 dπ1 1 +

n=0 e n/τ dπt n dπt n 1 1

2e (t 1)/τ +

n=0 e n/τ dπt n dπt n 1 1.

Dynamic Regret of Online Markov Decision Processes

where the first equality holds since dπt 1 is the stationary distribution of πt 1, i.e., dπt 1 = dπt 1P πt 1 and the second inequality holds due to Definition 1. Summing above inequality over t, we have

t=1 µt dπt 1 2 + 2

t=2 e (t 1)/τ +

n=0 e n/τ dπt n dπt n 1 1

n=0 e n/τ) dπt dπt 1 1

2(τ + 1) + (τ + 1)

t=2 dπt dπt 1 1.

This finishes the proof.