# dynamic_regret_of_adversarial_linear_mixture_mdps__5f2f9951.pdf

Dynamic Regret of Adversarial Linear Mixture MDPs

Long-Fei Li, Peng Zhao, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, China School of Artificial Intelligence, Nanjing University, China {lilf, zhaop, zhouzh}@lamda.nju.edu.cn

We study reinforcement learning in episodic inhomogeneous MDPs with adversarial full-information rewards and the unknown transition kernel. We consider the linear mixture MDPs whose transition kernel is a linear mixture model and choose the dynamic regret as the performance measure. Denote by d the dimension of the feature mapping, H the length of each episode, K the number of episodes, PT the non-stationary measure, we propose a novel algorithm that enjoys an e O

H4(K + PT )(1 + PT ) dynamic regret under the condition that PT is known, which improves previously best-known dynamic regret for adversarial linear mixture MDP and adversarial tabular MDPs. We also establish an Ω

HK(H + PT ) lower bound, indicating our algorithm is optimal in K and PT . Furthermore, when the non-stationary measure PT is unknown, we design an online ensemble algorithm with a meta-base structure, which is proved to achieve an e O

H4(K + PT )(1 + PT ) + H2S2 T dynamic regret and here ST is the expected switching number of the best base-learner. The result can be optimal under certain regimes.

1 Introduction

Reinforcement Learning (RL) aims to learn a policy that maximizes the cumulative reward through interacting with the environment, which has achieved tremendous successes in various fields, including games [1, 2], robotic control [3, 4], and dialogue generation [5, 6]. In reinforcement learning, the Markov Decision Process (MDP) [7] is the most widely used model to describe the environment.

Traditional MDPs assume that the reward functions are stochastic and the number of actions and states is small. However, in many real-world applications, the reward functions may be adversarially changing and the state and action spaces are large or even infinite. Previous work studies these two problems separately. To deal with the adversarial reward functions, Even-Dar et al. [8] first consider learning adversarial MDPs with known transition and full-information reward feedback. They propose the MDP-E algorithm that enjoys e O(

τ 3T) regret, where τ is the mixing time and T is the number of steps. There is a line of subsequent work studying adversarial MDPs [9, 10, 11, 12, 13, 14, 15, 16], which studies various settings depending on whether the transition kernel is known, and whether the feedback is full-information or bandit. To overcome the large state and action space issue, a widely used method is linear function approximation, which reparameterizes the value function as a linear function over some feature mapping that maps the state and action to a low-dimensional space. Amongst these work, linear MDPs [17, 18, 19, 20, 21, 22, 23] and linear mixture MDPs [24, 25, 26, 27, 28, 29, 30, 31] are two of the most popular MDP models with linear function approximation. In particular, He et al. [23] and Zhou et al. [31] attain the minimax optimal e O(

d2H3K) regret for linear MDPs and linear mixture MDPs with stochastic rewards respectively.

Recent studies try to combine two lines of work to establish the theoretical foundation of adversarial MDPs with large state and action space. In particular, Cai et al. [32] study adversarial linear mixture

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

MDPs and propose the OPPO algorithm that enjoys an e O(

d2H4K) regret. He et al. [33] improve the result to e O(

d2H3K) and show it is (nearly) optimal by presenting a matching lower bound. However, these studies choose static regret as the performance measure, defined as the performance difference between the learner s policy and that of the best-fixed policy in hindsight, namely,

Regret(K) = max π Π

k=1 V π k,1(sk,1)

k=1 V πk k,1(sk,1), (1)

where V is the value function and Π is the set of all stochastic policies. One caveat in (1) is that the best fixed policy may behave poorly in non-stationary environments. To this end, we introduce dynamic regret, which benchmarks the learner s performance with changing policies, defined as

D-Regret(K) =

k=1 V πc k k,1(sk,1)

k=1 V πk k,1(sk,1), (2)

where πc 1, . . . , πc K Π is compared policies. Define PT = PK k=2 d(πc k, πc k 1) with a certain distance measure d( , ) as the non-stationary measure. A favorable dynamic regret should scale with PT .

Dynamic regret is a more appropriate measure in non-stationary environments, but it is more challenging to optimize such that few studies focus on it in the literature. Zhao et al. [34] investigate the dynamic regret of adversarial tabular MDPs with the known transition kernel and present an algorithm with optimal dynamic regret. Importantly, their algorithm does not require the non-stationarity measure PT as the algorithmic input ahead of time. For the unknown transition setting, Fei et al. [35] study adversarial tabular MDPs and propose an algorithm with dynamic regret guarantee. Zhong et al. [36] further extend the algorithm of Fei et al. [35] to accommodate non-stationary transition kernels with linear function approximation. Both algorithms of Fei et al. [35] and Zhong et al. [36] require the quantity of PT as the input. Moreover, their dynamic regret bounds are suboptimal in terms of K and PT as demonstrated by the lower bound established by our work (see Theorem 4).

This work investigates the dynamic regret of adversarial linear mixture MDPs, with a focus on the full-information feedback and the unknown transition. We first propose POWERS-Fix Share algorithm when PT is known, an algorithm combining optimistic policy optimization with a Bernstein bonus and fixed-share mechanism. We show it enjoys an e O(

H4(K + PT )(1 + PT )) dynamic regret, where d is the dimension of the feature mapping, H is the length of each episode, K is the number of episodes, PT is the non-stationary measure. We also establish a dynamic regret lower bound of Ω(

HK(H + PT )) . We stress four remarks regarding our results:

(1) Our result can recover the e O(

d2H3K) minimax optimal static regret in He et al. [33]. (2) Our result improves upon the previously best-known e O(d H7/4K3/4 + H2K2/3PT 1/3) dynamic regret for the fixed transition of Zhong et al. [36, Theorem 4.6] in terms of H, K and PT . (3) Our result can imply an e O(

H4S2AK + H2p

(K + PT )(1 + PT )) dynamic regret for adversarial tabular MDPs, strictly improving upon the previously best-known e O(

H4S2AK + H2K2/3PT 1/3) dynamic regret of Fei et al. [35, Theorem 1], in terms of both K and PT . (4) As the lower bound suggests, our result is the first optimal regarding the dependence on d, K and PT and can be optimal in terms of H under certain regimes (H d and PT d2/H).

Furthermore, we study the case when PT is unknown and design a novel algorithm equipped with the dynamic regret guarantee by the meta-base two-layer structure. Our algorithm enjoys an e O(

H4(K + PT )(1 + PT ) + H2S2 T ) dynamic regret, where ST is the expected switching number of the best base-learner. Though ST is a data-dependent quantity, it also reflects the degree of environmental non-stationarity to some extent. Moreover, under specific regimes, the magnitude of ST may be relatively negligible, resulting in our results still being optimal. Indeed, given that ST is a data-dependent quantity, its inclusion in the regret bound is not ideal. Deriving bounds that rely exclusively on problem-dependent quantities, like PT , remains an open challenge. We discuss the technical difficulty of removing ST in Section 5 and take this issue for future work.

Finally, we also highlight the main technical challenges and our solutions as follows.

We first show that the dynamic regret depends on the inverse of the minimal probability over the action space of our policies, which can be arbitrarily small. To this end, we propose a novel

algorithm with the fixed-share mechanism [37]. While this mechanism is proved to enjoy favorable dynamic regret in online convex optimization [38], it suffers an additional term that can be regarded as the weighted sum of the difference between the occupancy measure of the compared policies in online MDPs. To overcome the difficulty, we exploit the multiplicative stability to bound this term, eliminating the need for a restart strategy to handle the environmental non-stationarity as in previous studies [35, 36] and allows us to attain the dynamic regret optimal in terms of K and PT .

We show the dynamic regret of online MDPs can be written as the weighted average of multiarmed bandits problems over all states, where the weight for each state is the unknown and changing probability of being visited by πc 1, . . . , πc K. For the unknown PT case, we first show the standard two-layer structure used in non-stationary online learning studies [39, 40, 41] fails to achieve a favorable dynamic regret guarantee, which characterizes the unique difficulty of online MDPs. Then, we present an initial attempt to address this issue by a specifically designed two-layer structure. We prove our algorithm enjoys nice dynamic regret guarantees under certain regimes.

Notations. We denote by (A) the set of probability distributions on a set A and denote the KLdivergence by DKL(p p ) = P

a A p(a) log p(a)

p (a) for any p, p (A). We define (A | S, H) = {{πh( | )}H h=1 | πh( | x) (A), s S, h [H]} for any set S and H Z+. Further, for any π, π , π (A | S, H), we define Eπ[ e DKL(π π )] = Eπ[PH h=1 DKL(π h( | sh) π h( | sh))]. For any policy pair πh, π h, we define πh π h 1, = maxs S πh( | s) π h( | s) 1. For any a, b, x R with a b, let [x][a,b] denote min{max(x, a), b}. e O( ) omits the logarithmic factors.

2 Related Work

RL with adversarial rewards. There are many studies on learning adversarial MDPs where the reward functions are adversarially chosen, yielding fruitful results that can be categorized into three lines [8, 9, 10, 11, 12, 13, 14, 15, 16]. In particular, the first line of work considers the infinite-horizon MDPs with uniform mixing assumption. In the known transition and full-information setting, the seminal work of Even-Dar et al. [8] proposes MDP-E algorithm that achieves the e O(

τ 3T) regret, where τ is the mixing time and T is the number of steps. Another concurrent work of Yu et al. [9] achieves e O(T 2/3) in the same setting. In the known transition and bandit-feedback setting, Neu et al. [10] propose MDP-EXP3 algorithm that attains e O(T 2/3) regret. The second line of work considers the episodic loop-free MDPs. Neu et al. [11] first study this problem under the known transition setting and propose algorithms that achieve e O(

T) and e O( p

T/α) for full-information and bandit-feedback respectively, where α is the lower bound of the probability of all states under any policy. Zimin and Neu [12] propose O-REPS algorithm that enjoys e O(

T) regret in both full-information and bandit-feedback setting without any additional assumption. Rosenberg and Mansour [13] and Jin et al. [14] further consider the harder unknown transition and bandit-feedback setting. The last line of work studies the episodic Stochastic Shortest Path (SSP) setting [15, 16]. In this paper, we focus on episodic MDPs with the unknown transition and full-information setting.

RL with linear function approximation. To design RL algorithms in large state and action space scenarios, recent works focus on solving MDPs with linear function approximation. In general, these works can be divided into three lines based on the specific assumption of the underlying MDP. The first line of work considers the low Bellman-rank assumption [42, 43, 44, 45], which assumes the Bellman error matrix has a low-rank factorization. The second line of work is based on the linear MDP assumption [17, 18, 19, 20, 21, 22, 23], where both the transition kernel and reward functions can be parameterized as linear functions of given state-action feature mappings ϕ : S A Rd. The last line of work studies the linear mixture MDP [24, 25, 26, 28, 29, 30, 31], where the transition kernel can be parameterized as a linear function of a feature mapping ϕ : S A S Rd but without the linear reward functions assumption. Amongst these works, He et al. [23] and Zhou et al. [31] attain the minimax optimal e O(

d2H3K) regret for both episodic linear MDPs and linear mixture MDPs respectively. However, all the above studies consider the stochastic reward setting. In this paper, we study the episodic linear mixture MDP setting but with adversarial reward functions.

Non-stationary RL. Another related line of research is online non-stationary MDPs. In contrast to adversarial MDPs where the reward functions are generated in an adversarial manner, online non-stationary MDPs consider the setting where the reward functions are generated stochastically

according to some distributions that may vary over time. Jaksch et al. [46] study the piecewise stationary setting where the transitions and rewards are allowed to change certain times and propose UCRL2 with restart technique to deal with the non-stationarity. Later, Ortner et al. [47] consider the generalized setting where the changes are allowed to take place every step. However, the above studies need prior knowledge about the magnitude of non-stationary. To address this issue, Cheung et al. [48] propose the Bandit-over-RL algorithm to remove this requirement. A recent breakthrough by Wei and Luo [49] introduces a black-box method that can convert any algorithm satisfying specific conditions and enjoying optimal static regret in stationary environments into another with optimal dynamic regret in non-stationary environments, without requiring prior knowledge about the degree of non-stationarity. However, this approach does not apply to the adversarial setting. Specifically, their reduction requires the base algorithm to satisfy a certain property enjoyed by typical UCB-type algorithms. When a new instance of the base algorithm surpasses the optimistic estimator, it can be inferred that the environment has changed, prompting a restart of the algorithm to disregard prior information. However, this approach of constructing an optimistic estimator by a UCB-type algorithm can only be applied to a stochastic setting. In the adversarial setting, where no model assumptions are made and comparators can be arbitrary, this approach encounters significant difficulties.

Dynamic Regret. Dynamic regret of RL with adversarial rewards is only recently studied in the literature [34, 35, 36]. Zhao et al. [34] investigate the dynamic regret of adversarial tabular MDPs with the known transition kernel and present an algorithm with optimal dynamic regret. Importantly, their algorithm does not require the non-stationarity measure PT as the algorithmic input ahead of time. For the unknown transition setting, Fei et al. [35] study adversarial tabular MDPs and propose an algorithm with dynamic regret guarantees. Zhong et al. [36] further extend the algorithm of Fei et al. [35] to accommodate non-stationary transition kernels with linear function approximation. Both algorithms [35, 36] require the quantity of PT as the input. Moreover, their dynamic regret bounds are suboptimal in K and PT as shown by the lower bound established in our work. In this work, we first design an optimal algorithm in terms of K and PT when PT is known. Further, we develop the first algorithm to handle the unknown PT issue in adversarial MDPs with unknown transition.

3 Problem Setup

We focus on episodic inhomogeneous MDPs with full-information reward functions and the unknown transition kernel. Denote by M = {S, A, H, {rk,h}k [K],h [H], {Ph}h [H]} an episodic inhomogeneous MDP, where S is the state space, A is the action space, K is the number of episodes, H is the horizon, rk,h : S A [0, 1] is the reward function, Ph( | , ) : S A S [0, 1] is the transition kernel. We assume the rewards are deterministic without loss of generality and extending our results to stochastic rewards is straightforward. Let S = |S| and A = |A|.

The learner interacts with the MDP through K episodes without knowledge of transition kernel {Ph}h [H]. In each episode k, the environment chooses the reward function {rk,h}h [H] and decides the initial state sk,1, where the reward function may be chosen in an adversarial manner and depend on the history of the past (k 1) episodes. Simultaneously, the learner decides a policy πk = {πk,h}h [H] where each πk,h : S (A) is a function that maps a state s to a distribution over action space A. In the h stage in episode k, the learner observes current state sk,h, chooses an action ak,h πk,h( | sk,h), and transits to the next state sk,h+1 Ph( | sk,h, ak,h). Then the learner obtains the reward rk,h(sk,h, ak,h) and observes the reward function rk,h as we consider the full-information setting. At stage H + 1, the learner observes the final state sk,H+1 but does not take any action, and the episode k terminates. Denote by T = KH the total steps throughout K episodes.

Linear Mixture MDPs. In this work, we focus on a special class of MDPs called linear mixture MDPs, a setting initiated by Ayoub et al. [24] and further studied in the subsequent work [32, 31, 33]. In this setup, the transition kernel can be parameterized as a linear function of a feature mapping ϕ : S A S Rd. The formal definition of linear mixture MDPs is as follows.

Definition 1 (Linear Mixture MDPs). An MDP M = {S, A, H, {rk,h}k [K],h [H], {Ph}h [H]} is called an inhomogeneous, episode B-bounded linear mixture MDP, if there exist a known feature mapping ϕ(s | s, a) : S A S Rd and an unknown vector θ h Rd with θ h 2 B, h [H] such that (i) Ph(s | s, a) = ϕ(s | s, a) θ h for all (s, a, s ) S A S and h [H], (ii) ϕV (s, a) 2 P

s S ϕ(s | s, a)V (s ) 2 1 for any (s, a) S A and function V : S [0, 1].

Algorithm 1 POWERS-Fix Share

Input: step size η, exploration parameter γ and regularization parameter λ.

1: Initialize {π0,h( |s)}H h=1, s S as uniform distribution, and set {Q0,h( , )}H h=1 as zero function. 2: for k = 1, 2, , K do 3: Receive the initial state sk,1. 4: for h = 1, 2, , H do 5: For all h [H], s S, updates policy by

π k,h( | s) πk 1,h( | s) exp ηQk 1,h(s, ) , πk,h( | s) = (1 γ)π k,h( | s) + γπu( | s).

6: Take the action following ak,h πk,h( | sk,h) and transit to the next state sk,h+1. 7: Obtain reward rk,h(sk,h, ak,h) and observe the reward function rk,h( , ). 8: end for 9: Initialize Vk,H+1( ) as a zero function. 10: for h = H, H 1, , 1 do 11: Set Qk,h( , ) rk,h( , ) + bθk,h, ϕVk,h+1( , ) + bβk bΣ 1/2 k,h ϕVk,h+1( , ) 2

12: Set Vk,h( ) Ea πk,h( | ) [Qk,h( , )]. 13: Set the estimated variance Vk,h Vk,h+1 (sk,h, ak,h) as in (11), bonus Ek,h as in (21).

max H2/d, Vk,h Vk,h+1 (sk,h, ak,h) + Ek,h .

15: bΣk+1,h bΣk,h + σ 2 k,hϕVk,h+1 (sk,h, ak,h) ϕVk,h+1 (sk,h, ak,h) .

16: bbk+1,h bbk,h + σ 2 k,hϕVk,h+1 (sk,h, ak,h) Vk,h+1 sk h+1 .

17: eΣk+1,h eΣk,h + ϕV 2 k,h+1 (sk,h, ak,h) ϕV 2 k,h+1 (sk,h, ak,h) .

18: ebk+1,h ebk,h + ϕV 2 k,h+1 (sk,h, ak,h) V 2 k,h+1 sk h+1 .

19: bθk+1,h bΣ 1 k+1,hbbk+1,h, eθk+1,h eΣ 1 k+1,hebk+1,h. 20: end for 21: end for

Dynamic Regret. For any policy π = {πh}h [H] and any (k, h, s, a) [K] [H] S A, we define the action-value function Qπ k,h and value function V π k,h as

Qπ k,h(s, a) = Eπ

h =h rk,h (sh , ah ) sh = s, ah = a

, V π k,h(s) = Eπ

h =h rk,h (sh , ah ) sh = s

The Bellman equation is given by Qπ k,h = rk,h + Ph V π k,h+1, and V π k,h(s) = Ea πh( | s)[Qπ k,h(s, a)] with V π k,H+1 = 0. For simplicity, for any function V : S R, we define the operator

[Ph V ](s, a) = Es Ph( | s,a)V (s ), [Vh V ](s, a) = [Ph V 2](s, a) ([Ph V ](s, a))2. (3)

As stated in Section 1, dynamic regret is a more appropriate measure compared with static regret for the adversarial environments, which is defined in (2) and we rewrite it below for clarity:

D-Regret(K) =

k=1 V πc k k,1(sk,1)

k=1 V πk k,1(sk,1), (4)

where πc 1, . . . , πc K is any sequence of compared policies. We define πc 0 = πc 1 to simplify the notation. The non-stationarity measure is defined as PT = PK k=1 PH h=1 πc k,h πc k 1,h 1, .

4 Optimal Dynamic Regret with Known PT

We present our proposed algorithm in Algorithm 1. Similar to previous studies, the algorithm consists of (i) policy improvement phase, and (ii) policy evaluation phase. We introduce the details below. In Sections 4.1, we first consider the case when the transition is known to highlight the challenges even under the ideal setting. Then, we extend the results to the unknown transition setting in Section 4.2.

4.1 Policy Improvement Phase

In the policy improvement phase, the algorithm updates πk based on πk 1 using the proximal policy optimization (PPO) method [50]. Specifically, at episode k, we define the following linear function:

Lk 1(π) = V πk 1 k,1 (sk,1) + Eπk 1

h=1 Qπk 1 k 1,h, πh( | sh) πk 1,h( | sh) s1 = sk,1

which is the first-order Taylor approximation of V π k 1,1(sk,1) around πk 1. Then, we update πk by

πk = arg max π (A | S,H) Lk 1(π) 1

h=1 DKL πh( | sh) πk 1,h( | sh) #

where η > 0 is the stepsize and the KL-divergence encourages πk to be close to πk 1 so that Lk 1(π) is a good approximation of V π k 1,1(sk,1). The update rule in (5) takes the following closed form,

πk,h( | s) πk 1,h( | s) exp η Qπk 1 k 1,h(s, ) , h [H], s S. (6)

We show the update rule in (6) ensures the following guarantee and the proof is in Appendix C.1. Lemma 1. The update rule in (6) ensures the following dynamic regret guarantee:

D-Regret(K) ηKH3

h e DKL (πc k πk) e DKL (πc k πk+1) i . (7)

Note that the expectation in the last term in (7) is taken over πc k which may change over episode k. For static regret, i.e., πc 1 = . . . = πc K = π , we can control this term by a standard telescoping argument, which is not viable for dynamic regret analysis. Fei et al. [35] propose a restart strategy to handle this term. Specifically, they restart the algorithm every certain number of steps and decompose the above expectation into Eπc k[ ] = Eπc k0[ ] + Eπc k πc k0[ ] where k0 < k is the episode in which restart takes place most recently before episode k. For the first expectation Eπc k0[ ], they apply a customized telescoping argument to each period as the expectation is taken over the fixed policy. The second expectation Eπc k πc k0[ ] involves the difference πc k πc k0 and can be bounded by PT . However, as we will show in Theorem 4, their regret bound is suboptimal in terms of K and PT .

We introduce our approach below. Let us first consider taking expectations over any fixed policy π. Denote by δ the minimal probability over any action at any state for policies π1, . . . , πK, i.e., δ = mink [K] πk(a | s), a A, s S, the last term in (7) can be upper bounded by PK k=1 Eπ[ e DKL (πc k πk) e DKL (πc k πk+1)] H log A+PT log 1

δ , showing that we need to control the minimal value of δ to obtain a favorable dynamic regret bound. To this end, we slightly modify the update rule in (6) and add a uniform distribution πu( | s) = 1

A1, s S. That is, the policy πu chooses each action with equal probability at any state. Thus, the update rule in (6) is modified as:

π k,h( | s) πk 1,h( | s) exp(η Qπk 1 k 1,h(s, )), πk,h( | s) = (1 γ)π k,h( | s) + γπu( | s) (8)

for any s S, h [H], where γ 0 is the exploration parameter. This update is called the fixed-share mechanism in online learning literature [37]. While the fixed-share mechanism is standard to obtain dynamic regret in modern online learning [38], several important new challenges arise in online MDPs due to taking expectations over the policy sequence of changing policies πc 1, . . . , πc K. In particular, we prove that performing (8) ensures the following dynamic regret. Lemma 2. Set π1 as uniform distribution on A for any state s S. The update rule in (8) ensures

D-Regret(K) ηKH3

γ + KH log 1 1 γ

πc k 1,h(a | sh) log 1 π k,h(a | sh) πc k,h(a | sh) log 1 π k+1,h(a | sh)

The proof can be found in Appendix C.2. In the dynamic regret analysis in online learning, the last term in (9) is usually canceled out through telescoping since we do not need to take expectations [38].

However, this is not the case in online MDPs. Since the expectation is taken over the policy sequence of changing policies πc 1, . . . , πc K, this term cannot be canceled out, which requires a more refined analysis. To address this issue, we decompose one step of the expectation in (9) as follows.1

a A πc k 1,h log 1 π k,h

a A πc k,h log 1 π k+1,h

+ Eπc k πc k 1

a A πc k 1,h log 1 π k,h

With this decomposition, the first term can be canceled out through telescoping, yet it remains to control the second term the weighted difference between the state-action occupancy measures of policy πc k and πc k 1 with weight πc k 1,h(a | sh) log π k,h(a | sh) for state-action (sh, a). To control it, we need to (i) ensure the weight is upper bounded by some universal constant, and (ii) bound the unweighted difference between the state-action occupancy measures, which are new challenges that arose in online MDPs compared with standard online learning.

For the first challenge, note that the weight πc k 1,h(a | sh) log π k,h(a | sh) can be large or even infinite since π k,h is the policy before uniform exploration and π k,h(a | sh) can be arbitrarily small. Fortunately, π k,h is obtained by one-step descent from πk 1,h, which is the policy after uniform exploration and can be lower bounded. We provide the following multiplicative stability lemma for the one-step update, which shows π k,h is not far from πk 1,h and thus is also lower bounded.

Lemma 3 (Multiplicative Stability). For any distributions p (A) with p(a) > 0, for all a A, and any function Q : S A [0, H], it holds for p (A) with p (a) p(a) exp(η Q(s, a)) and η 1/H that p (a) [p(a)/4, 4p(a)], for all a A.

For the second challenge, we show the unweighted difference between the state-action occupancy measures can be bounded by the path length of policies. In particular, we have the following lemma. Lemma 4. For any policy sequence πc 1, . . . , πc K, it holds that

Eπc k Eπc k 1

h=1 1(sh) | s1 = sk,1

i=1 πc k,i πc k 1,i 1, = HPT .

Remark 1. We note that a similar argument is also used in Fei et al. [35, Appendix B.2.2]. However, they prove this lemma by imposing an additional smooth visitation measures assumption [35, Assumption 1], which is not required in our analysis.

The proofs for Lemmas 3 and 4 can be found in Appendices C.3 and C.4 respectively. Combining Lemmas 2, 3 and 4, we can prove the guarantee for update rule (8). The proof is in Appendix C.5. Theorem 1. Set π1 as uniform distribution on A for any state s S. The update rule in (8) ensures

D-Regret(K) ηKH3

H log A + (1 + H) log 4A

γ PT + KH log 1 1 γ

Remark 2. Considering the static regret where πc 1 = . . . = πc K = π , we can recover the O( p

H4K log A) static regret in Cai et al. [32] under the stationary scenario by setting γ = 0, that is, without uniform exploration. However, when γ = 0, the dynamic regret is not bounded as there lacks an upper bound for log γ, showing the necessity of the fixed-share mechanism.

4.2 Policy Evaluation Phase

Sections 4.1 focus on the simplified scenario where the transition is known. In this subsection, we further consider the unknown transition setting such that it is necessary to evaluate the policy πk based on the (k 1) historical trajectories. To see how the model estimation error enters the dynamic regret, we decompose the dynamic regret in the following lemma. Lemma 5 (Fei et al. [35, Lemma 1]). Define the model prediction error as ιk,h = rk,h +Ph Vk,h+1 Qk,h, the dynamic regret D-Regret(K) = PK k=1 V πc k k,1(sk,1) PK k=1 V πk k,1(sk,1) can be written as X

k,h Eπc k Qk,h(sh, ), πc k,h( | sh) πk,h( | sh) + MK,H + X

Eπc k[ιk,h(sh, ah)] ιk,h(sk,h, ak,h) ,

where MK,H = PK k=1 PH h=1 Mk,h is a martingale that satisfies Mk,h 4H, k [k], h [H].

1With a slight abuse of notations, we omit ( | sh) for simplicity.

Remark 3. Lemma 5 is independent of the structure of MDPs. The first term in Lemma 5 is the dynamic regret over the estimated action-value function Qk,h, which can be upper bounded by Theorem 5. The second term is a martingale, which can be bounded by Azuma-Hoeffding inequality. The third term is the model estimation error, which is the main focus of this section. Note the model prediction error ιk,h(sh, ah) can be large for the state-action pairs that are less visited or even unseen. The general approach is incorporating the bonus function into the estimated Q-function such that ιk,h(sh, ah) 0 for all s S, a A (i.e., Eπc k[ιk,h(sh, ah)] 0) and we only need to control ιk,h(sk,h, ak,h), which is the model estimation error at the visited state-action pair (sk,h, ak,h).

When applied to linear mixture MDPs, the key idea is learning the unknown parameter θ h of the linear mixture MDP and using the learned parameter θk,h to build an optimistic estimator Qk,h( , ) such that the model prediction error is non-positive, which is more or less standard. From the definition of linear mixture MDP, for the learned value function Vk,h( ), we have [Ph Vk,h+1](s, a) = P

s ϕ(s | s, a)Vk,h+1(s ), θ h = ϕVk,h+1(s, a), θ h . Inspired by recent advances in policy evaluation for linear mixture MDPs [31], we adopt the weighted ridge regression to estimate the parameter θ h, that is, we construct the estimator bθk,h by solving the following weighted ridge regression problem:

bθk,h = arg min θ Rd

j=1 ϕVj,h+1 (sj,h, aj,h) , θ Vj,h+1 (sj,h+1) 2 / σ2 j,h + λ θ 2 2.

Here, σ2 j,h is the upper confidence bound of the variance [Vh Vj,h+1](sj,h, aj,h), and we set it

as σk,h = q

max{H2/d, [ Vk,h Vk,h+1] (sk,h, ak,h) + Ek,h}, where [ Vk,h Vk,h+1](sk,h, ak,h) is a scalar-valued empirical estimate for the variance of the value function Vk,h+1 under the transition probability Ph( | sk, ak), and Ek,h is the bonus term to guarantee that the true variance [Vk,h Vk,h+1](sk,h, ak,h) is upper bounded by [ Vk,h Vk,h+1](sk,h, ak,h) + Ek,h with high probability. Then, the confidence set b Ck,h is constructed as follows:

b Ck,h = θ | bΣ1/2 k,h(θ bθk,h) 2 bβk . (10)

where bΣk,h is a covariance matrix based on the observed data, and bβk is a radius of the confidence set. Given b Ck,h, we estimate the Q-function following the principle of optimism in the face of uncertainty [51] and set it as Qk,h( , ) = [rk,h( , ) + maxθ b Ck,h θ, ϕVk,h+1( , ) ][0,H h+1].

It remains to estimate the variance [Vh Vk,h+1](sk,h, ak,h). By the definition of linear mixture MDPs, we have [Vh Vk,h+1](sk,h, ak,h) = ϕV 2 k,h+1(sk,h, ak,h), θ h [ ϕVk,h+1(sk,h, ak,h), θ h ]2. Therefore, we estimate [ Vk,h Vk,h+1] (sk,h, ak,h) by the expression below ϕV 2 k,h+1 (sk,h, ak,h) , eθk,h

[0,H2] ϕVk,h+1 (sk,h, ak,h) , bθk,h 2 [0,H], (11)

where eθk,h = arg minθ Rd Pk 1 j=1[ ϕV 2 j,h+1(sj,h, aj,h), θ V 2 j,h+1(sj,h+1)]2 + λ θ 2 2. The details are summarized in Lines 10-20 of Algorithm 1 and we provide the following guarantee. Theorem 2. Set the parameters as in Lemma 8, with probability at least 1 δ, we have

Vk,1 sk 1 V πk k,1 sk 1 = MK,H

h=1 ιk,h(sk,h, ak,h) e O p

d H4K + d2H3K .

The proof is given in Appendix C.6. Theorem 2 shows the model estimation error can be bounded. Combining Theorems 1, 2 and Lemma 5, we present the dynamic regret bound in the next section.

4.3 Regret Guarantee: Upper and Lower Bounds

In this section, we provide the regret bound for our algorithm and present a lower bound of the dynamic regret for any algorithm for adversarial linear mixture MDPs with the unknown transition.

Theorem 3. Set η = min{ p

(PT + log A)/K, 1}/H, γ = 1/(KH) and bβk as in Lemma 8, then with probability at least 1 δ, it holds

D-Regret(K) e O p

d H4K + d2H3K + p

H4(K + PT )(1 + PT ) , (12)

where PT = PK k=1 PH h=1 πc k,h πc k 1,h 1, is the path length of the compared policies.

Remark 4 (recovering static regret). Since static regret is a special case with πc k = π , k, our result can recover the optimal e O(

d2H3K) static regret when H d, same as the result in He et al. [33]. Remark 5 (improving linear mixture case). Our result improves upon the previously best-known e O(d H7/4K3/4 + H2K2/3PT 1/3) dynamic regret for adversarial linear mixture MDPs of Zhong et al. [36, Theorem 4.6] in terms of the dependence on H, K and PT . Remark 6 (improving tabular case). For the adversarial tabular MDPs, our result implies an e O(

H4S2AK + H2p

(K + PT )(1 + PT )) dynamic regret. This improves upon the best-known e O(

H4S2AK +H2K2/3PT 1/3) result of Fei et al. [35, Theorem 1]. The details are in Appendix B.

We finally establish the lower bound of this problem. The proof can be found in Appendix C.8. Theorem 4. Suppose B 2, d 4, H 3, K (d 1)2H/2, for any algorithm and any constant Γ [0, 2KH], there exists an episodic B-bounded adversarial linear mixture MDP and compared policies πc 1, . . . , πc K such that PT Γ, and D-Regret(K) Ω(

HK(H + Γ)).

When H d, the upper bound is e O(

H4(K + PT )(1 + PT )). Combining it with Theorem 4, we discuss the optimality of our result. We consider the following three regimes.

Small PT : when 0 PT d2/H, the upper bound (12) can be simplified as e O(

d2H3K), and the lower bound is e O(

d2H3K), hence our result is optimal in terms of d, H and K.

Moderate PT : when d2/H PT K, the upper bound (12) can be simplified as e O(

H4K(1 + PT )), and it is minimax optimal in d, K and PT but looses a factor of H 3 2 . Large PT : when PT K, any algorithm suffers at most O(HK) dynamic regret, while the lower bound is Ω(K

H). So our result is minimax optimal in K but looses a factor of

5 Towards Optimal Dynamic Regret with Unknown PT

This section further considers the case when the non-stationarity measure PT is unknown. By Theorem 1, we need to tune the step size η optimally to balance the number of episodes K and PT to achieve a favorable dynamic regret. To address the difficulty of not knowing PT ahead of time, we develop an online ensemble method to handle this uncertainty, in which a two-layer meta-base structure is maintained. While the methodology can be standard in recent non-stationary online learning [52, 39, 40, 41], new challenges arise in online MDPs. We introduce the details below.

By the performance difference lemma in Cai et al. [32, Lemma 3.2] (as restate in Lemma 13), we can rewrite the dynamic regret as

h V πc k k,1 sk 1 V πk k,1 sk 1 i =

D Qπk k,h(sh, ), πc k,h( | sh) πk,h( | sh) E#

where the expectation is taken over the randomness of the state trajectory sampled according to πc k. The dynamic regret of online MDPs can be written as the weighted average of some multi-armed bandits problems over all states, where the weight for each state is the unknown and changing probability of being visited by πc 1, . . . , πc K. As the optimal step size depends on the unknown nonstationarity measure PT as shown in Section 4, a natural idea is to the two-layer structure to learn the optimal step size as in recent online convex optimization literature [52, 39, 40, 41].

The general idea is constructing a step size pool H = {η1, . . . , ηN} to discretize the value range of the optimal step size; and then maintaining multiple base-learners B1, . . . , BN, each of which works with a specific step size ηi. Finally, a meta-algorithm is used to track the best base-learner and yield the final policy. Then, the dynamic regret can be decomposed as follows (omit ( | sh) for simplicity):

D Qπk k,h, πc k,h πi k,h E#

D Qπk k,h, πi k,h πk,h E#

Since the above decomposition holds for any index i [N], we can always choose the base-learner with optimal step size to analyze and the first term is easy to control. The challenge is to control the

second term, which is the regret of the meta-algorithm. Different from the standard Prediction with Expert Advice problem, it involves an additional expectation over the randomness of states sampled according to πc k. This poses a grand challenge compared to conventional online convex optimization where the expectation is not required. Although we can bound this term by PT again, optimal tuning of the meta-algorithm is hindered as PT is unknown. Consequently, we opt to upper bound it by the worst-case dynamic regret [53], that is, benchmarking the performance with the best choice of each round, which in turn introduces the dependence on the switching number of the best base-learner.

We introduce our approach as follows. We maintain multiple base-learners, each of which works with a specific step size ηi. All base-learners update their policies according to the same action-value function Qk 1,h(sh, ) of the combined policy πk 1, that is, the base-learner Bi updates policy by

πi, k,h( | s) πi k 1,h( | s) exp(ηi Qk 1,h(s, )), πi k,h( | s) = (1 γ)πi, k,h( | s) + γπu( | s), (13) Then, the meta-algorithm chooses the base-learner by measuring the quality of each base-learner. In our approach, we choose the best base-learner at the last episode, that is,

πk,h( | s) = π i k 1,h k,h ( | s) with i k 1,h(s) = arg maxi [N] Qk 1,h(s, ), πi k 1,h( | s) . (14) The details are summarized in Algorithm 2 of Appendix A and the guarantee is as follows. Theorem 5. Set γ = 1/(KH), step size pool H = {ηi = (2i/H) p

(log A)/K | i [N]} with N = 1

2 log( K log A) . Algorithm 2 ensures

D-Regret(K) e O p

d H4K + d2H3K + q

H4(K + PT )(1 + PT ) + H2S2 T ,

where PT = PK k=1 PH h=1 πc k,h πc k 1,h 1, is the path length of the compared policies, ST = PK k=1 PH h=1 Eπc k 1[i k,h(sh) = i k 1,h(sh)] is the expected switching number of best base-learner.

Combining it with Theorem 3, we discuss the optimality of our result. We consider two regimes.

Small ST : when ST max{d

(K + PT )(1 + PT )}, the term ST can be subsumed by other terms. In this case, the upper bound in Theorem 5 is entirely the same as that in Theorem 3. This implies we maintain the same guarantees without PT as algorithmic input. Large ST : when ST > max{d

(K + PT )(1 + PT )}, our result looses a factor of HST compared with the result in Theorem 3 for the known PT setting.

By the above discussion, our result can be optimal in terms of K and PT under certain regimes when PT is unknown. In comparison, the regret bounds achieved via the restart mechanism [35, 36] remain sub-optimal across all regimes even PT is known. Note that we introduce the notation ST in the regret analysis, which also reflects the degree of environmental non-stationarity to some extent. Consider the following two examples: (i) in the stationary environment, ST could be relatively small as the best base-learner would seldom change, and (ii) in the piecewise-stationary environment, ST would align with the frequency of environmental changes. Indeed, given that ST is a data-dependent quantity, its inclusion in the regret analysis is not ideal. Deriving bounds that rely exclusively on problem-dependent quantities, like PT , remains a significant open challenge.

6 Conclusion and Future Work

In this work, we study the dynamic regret of adversarial linear mixture MDPs with the unknown transition. For the case when PT is known, we propose a novel policy optimization algorithm that incorporates a fixed-share mechanism without the need for restarts. We show it enjoys a dynamic regret of e O

H4(K + PT )(1 + PT ) , strictly improving the previously best-known result of Zhong et al. [36] for the same setting and Fei et al. [35] when specialized to tabular MDPs. We also establish an Ω

HK(H + PT ) lower bound, indicating that our algorithm is optimal regarding d, K and PT and can be optimal in terms of H under certain regimes. Moreover, we explore the more complex scenario where PT is unknown. We show this setting presents unique challenges that distinguish online MDPs from conventional online convex optimization. We introduce a novel two-layer algorithm and show its dynamic regret guarantee is attractive under certain regimes.

There are several important future works to investigate. First, how to remove the dependence on the switching number ST is an important open question. Moreover, we focus on the full-information feedback in this work, it remains an open problem to extend the results to the bandit feedback.

Acknowledgements

This research was supported by National Key R&D Program of China (2022ZD0114800) and NSFC (62206125, 61921006). Peng Zhao was supported in part by the Xiaomi Foundation.

[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing Atari with deep reinforcement learning. Ar Xiv preprint, 1312.5602, 2013.

[2] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, pages 484 489, 2016.

[3] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 1889 1897, 2015.

[4] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33nd International Conference on Machine Learning (ICML), pages 1329 1338, 2016.

[5] Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1192 1202, 2016.

[6] William Yang Wang, Jiwei Li, and Xiaodong He. Deep reinforcement learning for NLP. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pages 19 21, 2018.

[7] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994.

[8] Eyal Even-Dar, Sham. M. Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics of Operations Research, pages 726 736, 2009.

[9] Jia Yuan Yu, Shie Mannor, and Nahum Shimkin. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, pages 737 757, 2009.

[10] Gergely Neu, András György, Csaba Szepesvári, and András Antos. Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems 24 (NIPS), pages 1804 1812, 2010.

[11] Gergely Neu, András György, and Csaba Szepesvári. The online loop-free stochastic shortestpath problem. In Proceedings of 23rd Conference on Learning Theory (COLT), pages 231 243, 2010.

[12] Alexander Zimin and Gergely Neu. Online learning in episodic Markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems 26 (NIPS), pages 1583 1591, 2013.

[13] Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial Markov decision processes. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 5478 5486, 2019.

[14] Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial Markov decision processes with bandit feedback and unknown transition. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 4860 4869, 2020.

[15] Aviv Rosenberg and Yishay Mansour. Stochastic shortest path with adversarially changing costs. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI), pages 2936 2942, 2021.

[16] Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Minimax regret for stochastic shortest path with adversarial costs and known transition. In Proceedings of the 34th Conference on Learning Theory (COLT), pages 1180 1215, 2021.

[17] Lin Yang and Mengdi Wang. Sample-optimal parametric Q-learning using linearly additive features. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 6995 7004, 2019.

[18] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. Provably efficient reinforcement learning with linear function approximation. In Proceedings of the 33rd Conference on Learning Theory (COLT), pages 2137 2143, 2020.

[19] Simon S. Du, Sham M. Kakade, Ruosong Wang, and Lin F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020.

[20] Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1954 1964, 2020.

[21] Yining Wang, Ruosong Wang, Simon Shaolei Du, and Akshay Krishnamurthy. Optimism in reinforcement learning with generalized linear function approximation. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021.

[22] Jiafan He, Dongruo Zhou, and Quanquan Gu. Logarithmic regret for reinforcement learning with linear function approximation. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 4171 4180, 2021.

[23] Jiafan He, Heyang Zhao, Dongruo Zhou, and Quanquan Gu. Nearly minimax optimal reinforcement learning for linear Markov decision processes. Ar Xiv preprint, 2212.06132, 2022.

[24] Alex Ayoub, Zeyu Jia, Csaba Szepesvári, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 463 474, 2020.

[25] Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of reinforcement learning using linearly combined model ensembles. In Proceedings of the 23th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2010 2020, 2020.

[26] Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 10746 10756, 2020.

[27] Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted MDPs with feature mapping. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 12793 12802, 2021.

[28] Zihan Zhang, Jiaqi Yang, Xiangyang Ji, and Simon S. Du. Improved variance-aware confidence sets for linear bandits and linear mixture MDP. In Advances in Neural Information Processing Systems 34 (Neur IPS), pages 4342 4355, 2021.

[29] Dongruo Zhou and Quanquan Gu. Computationally efficient horizon-free reinforcement learning for linear mixture MDPs. Ar Xiv preprint, 2205.11507, 2022.

[30] Yue Wu, Dongruo Zhou, and Quanquan Gu. Nearly minimax optimal regret for learning infinite-horizon average-reward MDPs with linear function approximation. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3883 3913, 2022.

[31] Dongruo Zhou, Quanquan Gu, and Csaba Szepesvári. Nearly minimax optimal reinforcement learning for linear mixture Markov decision processes. In Proceedings of the 34th Conference on Learning Theory (COLT), pages 4532 4576, 2021.

[32] Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provably efficient exploration in policy optimization. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 1283 1294, 2020.

[33] Jiafan He, Dongruo Zhou, and Quanquan Gu. Near-optimal policy optimization algorithms for learning adversarial linear mixture MDPs. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 4259 4280, 2022.

[34] Peng Zhao, Long-Fei Li, and Zhi-Hua Zhou. Dynamic regret of online Markov decision processes. In Proceedings of the 39th International Conference on Machine Learning (ICML), pages 26865 26894, 2022.

[35] Yingjie Fei, Zhuoran Yang, Zhaoran Wang, and Qiaomin Xie. Dynamic regret of policy optimization in non-stationary environments. In Advances in Neural Information Processing Systems 33 (Neur IPS), pages 6743 6754, 2020.

[36] Han Zhong, Zhuoran Yang, Zhaoran Wang, and Csaba Szepesvári. Optimistic policy optimization is provably efficient in non-stationary MDPs. Ar Xiv preprint, ar Xiv:2110.08984, 2021.

[37] Mark Herbster and Manfred K. Warmuth. Tracking the best expert. Machine Learning, 32(2): 151 178, 1998.

[38] Nicolò Cesa-Bianchi, Pierre Gaillard, Gábor Lugosi, and Gilles Stoltz. Mirror descent meets fixed share (and feels no regret). In Advances in Neural Information Processing Systems 25 (NIPS), pages 989 997, 2012.

[39] Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Dynamic regret of convex and smooth functions. In Advances in Neural Information Processing Systems 33 (Neur IPS), pages 12510 12520, 2020.

[40] Dheeraj Baby and Yu-Xiang Wang. Optimal dynamic regret in exp-concave online learning. In Proceedings of the 34th Conference on Learning Theory (COLT), pages 359 409, 2021.

[41] Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization. Ar Xiv preprint, 2112.14368, 2021.

[42] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1704 1713, 2017.

[43] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over modelfree approaches. In Proceedings of the 32nd Conference on Learning Theory (COLT), pages 2898 2933, 2019.

[44] Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient RL with rich observations via latent state decoding. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 1665 1674, 2019.

[45] Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of RL problems, and sample-efficient algorithms. In Advances in Neural Information Processing Systems 34 (Neur IPS), pages 13406 13418, 2021.

[46] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, pages 1563 1600, 2010.

[47] Ronald Ortner, Pratik Gajane, and Peter Auer. Variational regret bounds for reinforcement learning. In Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence (UAI), pages 81 90, 2019.

[48] Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Reinforcement learning for nonstationary Markov decision processes: The blessing of (more) optimism. Proceedings of the 37th International Conference on Machine Learning (ICML), pages 1843 1854, 2020.

[49] Chen-Yu Wei and Haipeng Luo. Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. In Proceedings of the 34th Conference on Learning Theory (COLT), pages 4300 4354, 2021.

[50] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. Ar Xiv preprint, 1707.06347, 2017.

[51] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24 (NIPS), pages 2312 2320, 2011.

[52] Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems 31 (Neur IPS), pages 1330 1340, 2018.

[53] Peng Zhao and Lijun Zhang. Improved analysis for dynamic regret of strongly convex and smooth functions. In Proceedings of the 3rd Conference on Learning for Dynamics and Control (L4DC), pages 48 59, 2021.

[54] Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Impossible tuning made possible: A new expert algorithm and its applications. In Proceedings of the 34th Conference On Learning Theory (COLT), pages 1216 1259, 2021.

[55] Chi Jin, Zeyuan Allen-Zhu, Sébastien Bubeck, and Michael I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems 31 (Neur IPS), pages 4868 4878, 2018.

A Algorithm with Unknown PT

In this part, we present the POWERS-Fix Share-On E algorithm in Algorithm 2 for the case when PT is unknown. The algorithm is based on the POWERS-Fix Share algorithm and we further employ an online ensemble structure to eliminate the algorithmic dependence on PT . In Line 5, each base-learner updates her policy with a specific step size ηi and the meta-learner selects the best policy among the base-learners in Line 6. In the policy evaluation phase (Line 11 - 21), we use the same estimator as in Algorithm 1 to estimate the parameter θ h and construct the confidence set.

Algorithm 2 POWERS-Fix Share-On E

Input: step size pool H, exploration parameter γ and regularization parameter λ.

1: Initialize {πi 0,h( | s)}H h=1, i [N], s S as uniform distribution on A, {p0,h(s)}H h=1, s S as uniform distribution on N and set {Q0,h( , )}H h=1 as zero function. 2: for k = 1, 2, , K do 3: Receive the initial state sk,1. 4: for h = 1, 2, , H do 5: For all h [H], s S, each base-learner Bi, i [N] updates policy by

πi, k,h( | s) πi k 1,h( | s) exp ηi Qk 1,h(s, ) , πi k,h( | s) = (1 γ)πi, k,h( | s) + γπu( | s).

6: Set πk,h( | s) = π i k 1,h k,h ( | s) with i k 1,h(s) = arg maxi [N] Qk 1,h(s, ), πi k 1,h( | s) . 7: Take the action following ak,h πk,h( | sk,h) and transit to the next state sk,h+1. 8: Obtain reward rk,h(sk,h, ak,h) and observe the reward function rk,h( , ). 9: end for 10: Initialize Vk,H+1( ) as a zero function. 11: for h = H, H 1, , 1 do 12: Set Qk,h( , ) rk,h( , ) + bθk,h, ϕVk,h+1( , ) + bβk bΣ 1/2 k,h ϕVk,h+1( , ) 2

13: Set Vk,h( ) Ea πk,h( | ) [Qk,h( , )]. 14: Set the estimated variance Vk,h Vk,h+1 (sk,h, ak,h) as in (11), bonus Ek,h as in (21).

max H2/d, Vk,h Vk,h+1 (sk,h, ak,h) + Ek,h .

16: bΣk+1,h bΣk,h + σ 2 k,hϕVk,h+1 (sk,h, ak,h) ϕVk,h+1 (sk,h, ak,h) .

17: bbk+1,h bbk,h + σ 2 k,hϕVk,h+1 (sk,h, ak,h) Vk,h+1 sk h+1 .

18: eΣk+1,h eΣk,h + ϕV 2 k,h+1 (sk,h, ak,h) ϕV 2 k,h+1 (sk,h, ak,h) .

19: ebk+1,h ebk,h + ϕV 2 k,h+1 (sk,h, ak,h) V 2 k,h+1 sk h+1 .

20: bθk+1,h bΣ 1 k+1,hbbk+1,h, eθk+1,h eΣ 1 k+1,hebk+1,h. 21: end for 22: end for

B Recovering Tabular Case

In this part, we show the result of our algorithm when specialized to the tabular case. Note in the linear case, we adopt the weighted ridge regression to estimate the parameter θ h, that is, we construct the estimator bθk,h by solving the following weighted ridge regression problem:

bθk,h = arg min θ Rd

j=1 ϕVj,h+1 (sj,h, aj,h) , θ Vj,h+1 (sj,h+1) 2 / σ2 j,h + λ θ 2 2.

In the tabular case, we simply set σj,h = 1, and compute ϕVk,h+1 by taking the sample mean of {Vk,h+1(sj,h+1)}j [k 1]. That is, we set it as

ϕVj,h+1 (sj,h, aj,h) = X

Nk(sj,h, aj,h, s ) Nk(sj,h, aj,h) + λ Vj,h+1(s ),

of each (s, a), where Nk counts the number of times each tuple (s, a, s) or (s, a) has been visited up to episode k. Then, we construct the estimator bθk,h by solving the linear regression problem:

bθk,h = arg min θ Rd

j=1 ϕVj,h+1 (sj,h, aj,h) , θ Vj,h+1 (sj,h+1) 2 + λ θ 2 2.

Then, the confidence set b Ck,h is constructed as follows:

b Ck,h = θ | bΣ1/2 k,h(θ bθk,h) 2 β .

The remaining part of the algorithm is the same as the linear case. The following theorem shows the result of our algorithm in the tabular case.

Theorem 6. Set γ = 1/(KH), step size pool H = {ηi = (2i/H) p

(log A)/K | i [N]} with N = 1

2 log( K log A) and β = H p

S log(d KH/δ), with probability at least 1 δ, it holds

D-Regret(K) e O

H4(K + PT )(1 + PT ) ,

where PT = PK k=1 PH h=1 πc k,h πc k 1,h 1, is the path length of the compared policy sequence.

Proof. By Lemma 5, we decompose the dynamic regret as follows.

D-Regret(K)

h=1 Eπc k Qk,h(sh, ), πc k,h( | sh) πk,h( | sh) + MK,H

Eπc k[ιk,h(sh, ah)] ιk,h(sk,h, ak,h) ,

Note that the policy evaluation phase is the same as the one in Fei et al. [35, Algorithm 3], by the estimator bound in Fei et al. [35, Lemmas 7 and 9], we have

h=1 Eπc k[ιk,h(sh, ah)] 0,

h=1 ιk,h(sk,h, ak,h) O q

H4S2AT log2(SAKH/δ) .

The remaining proof is the same as the proof of Theorem 3 in Appendix C.7.

In this section, we provide the proof of the results in this paper.

C.1 Proof of Lemma 1

Proof. By the definition of dynamic regret, we have

D-Regret(K)

k=1 V πc k k,1(sk,1)

k=1 V πk k,1(sk,1)

h Qπk k,h(sh, ), πc k,h( | sh) πk,h( | sh) s1 = sk,1 i

DKL πc k,h( | sh) πk,h( | sh) DKL πc k,h( | sh) πk+1,h( | sh) ,

where the equality holds by the performance difference lemma in Lemma 13 and the inequality holds by the one-step descent guarantee in Lemma 14. This finishes the proof.

C.2 Proof of Lemma 2

Proof. Note the update rule is

π k,h( | s) πk 1,h( | s) exp(η Qπk 1 k 1,h(s, )), πk,h( | s) = (1 γ)π k,h( | s) + γπu( | s)

By Lemma 14, for any sh S, h [H], k [K], we have

Qπk k,h(sh, ), πc k,h( | sh) πk,h( | sh)

DKL πc k,h( | sh) πk,h( | sh) DKL πc k,h( | sh) π k+1,h( | sh)

πc k,h(a | sh) log 1 πk,h(a | sh) πc k,h(a | sh) log 1 π k+1,h(a | sh)

where the equality holds by the definition of KL divergence. We decompose the last term as follows: X

πc k,h(a | sh) log 1 πk,h(a | sh) πc k,h(a | sh) log 1 π k+1,h(a | sh)

πc k,h(a | sh) log 1 πk,h(a | sh) πc k 1,h(a | sh) log 1 π k,h(a | sh)

πc k 1,h(a | sh) log 1 π k,h(a | sh) πc k,h(a | sh) log 1 π k+1,h(a | sh)

For the first term, we have X

πc k,h(a | sh) log 1 πk,h(a | sh) πc k 1,h(a | sh) log 1 π k,h(a | sh)

πc k,h(a | sh) πc k 1,h(a | sh) log 1 πk,h(a | sh) + πc k 1,h(a | sh) log π k,h(a | sh) πk,h(a | sh)

By the update rule, we have 1 1/πk,h(a | sh) A/γ and π k,h(a | sh)/πk,h(a | sh) 1/(1 γ). Therefore, we have

πc k,h(a | sh) πc k 1,h(a | sh) log 1 πk,h(a | sh) + πc k 1,h(a | sh) log π k,h(a | sh) πk,h(a | sh)

πc k,h( | sh) πc k 1,h( | sh) 1 log A

γ + log 1 1 γ . (18)

Then, the dynamic regret is bounded as follows:

D-Regret(K)

k=1 V πc k k,1(sk,1)

k=1 V πk k,1(sk,1)

h Qπk k,h(sh, ), πc k,h( | sh) πk,h( | sh) s1 = sk,1 i

πc k,h(a | sh) log 1 πk,h(a | sh) πc k,h(a | sh) log 1 π k+1,h(a | sh)

πc k,h( | sh) πc k 1,h( | sh) 1 log A

γ + log 1 1 γ

πc k 1,h(a | sh) log 1 π k,h(a | sh) πc k,h(a | sh) log 1 π k+1,h(a | sh)

γ + KH log 1 1 γ

πc k 1,h(a | sh) log 1 π k,h(a | sh) πc k,h(a | sh) log 1 π k+1,h(a | sh)

where the first inequality holds by (16), the second inequality is due to (17) and (18), and the last inequality holds by the definition of PT = PK k=1 PH h=1 πc k,h πc k 1,h 1, . It ends the proof.

C.3 Proof of Lemma 3

Proof. Lemma 3 is a simplified version of Chen et al. [54, Lemma 17] and we prove it in our notations below for self-containedness. It is easy to verify that the update rule p (a) p(a) exp(η Q(s, a)) is equivalent to the update

p = arg max p (A) η p , Q(s, ) + DKL(p p).

Thus, by the KKT condition, we have for some λ and µ(a) 0, such that

η log p (a)

p(a) + λ + µ(a) = 0, and µ(a)p (a) = 0, a A.

The above equations give the closed-form solution p (a) = p(a) exp(η(Q(s, a) + λ + µ(a))). First, we prove that for all a A we have µ(a) = 0. Indeed, when µ(a) = 0, by µ(a)p (a) = 0, we have p (a) = p(a) exp(η(Q(s, a) + λ + µ(a))) = 0, which contradicts with p(a) > 0.

Then, we now separately discuss two cases.

Case 1: mina A Q(s, a) = maxa A Q(s, a). In this case, we first show that mina A Q(s, a) λ maxa A Q(s, a). We prove it by contradiction: If λ maxa A Q(s, a), then we have X

a A p (a) = X

a A p(a) exp(η(Q(s, a) + λ + µ(a))) X

a A p(a) 1.

contradicting with p (A). A similar argument holds for the case λ mina A( Q(s, i)). Thus, we have λ [mina A( Q(s, a)), maxa A( Q(s, a))]. Then, we have |Q(s, a) + λ + µ(a)| maxa A Q(s, a) mina A Q(s, a) H. By the condition on ηH 1, we have p (a) [exp( 1)p(a), exp(1)p(a)] [1/(4p(a)), 4p(a)].

Case 2: mina A Q(s, a) = maxa A Q(s, a). In this case, it is clear that λ = Q(s, a) must hold for all a A to make p and p both discussions. Thus p (a) = p(a) for all a A.

Combining the above two cases finishes the proof.

C.4 Proof of Lemma 4

To prove Lemma 4, we first introduce the following two lemmas. Denote by Pπh h (s | s) = P

a A Ph(s | s, a)πh(a | s) the transition kernel of the MDP in step h under policy πh and recall that π π 1, = maxs S π( | s) π ( | s) . The first lemma shows the difference between the state distribution of two policies can be bounded by the path length of the policies. Lemma 6 (Zhao et al. [34, Lemma 7]). For any state distribution d, policy pair π and π , and transition kernel P, we have d Pπh h ( , s) d Pπ h h ( , s) 1 πh π h 1, , h [H].

Proof. Consider the case when d is a delta function on s. The difference in the next distributions is Pπh h ( , s) Pπh h ( , s) 1 = X

a A |P(s | s, a)(πh(a | s) πh(a | s))|

a A P(s | s, a) πh(a | s) π h(a | s) 1

a A |π(a | s) π (a | s)| πh π h 1, .

Linearity of expectation leads to the result for arbitrary distributions. This finishes the proof.

The second lemma shows the difference between the state distribution of the policy starting from the different initial distributions can be bounded by the difference between the initial distributions. Lemma 7 (Zhao et al. [34, Lemma 8]). For any two initial distributions d and d , transition kernel P and policy π, we have d Pπh h d Pπh h 1 d d 1, h [H].

Proof. Note the relationship that d(s ) = P

s S d(s)Pπh s,s , we have d Pπh h d Pπh h 1 = X

s d(s)Pπh h (s | s) d (s)Pπh h (s | s)

d(s)Pπh h (s | s) d (s)Pπh h (s | s)

s |d(s) d (s)| Pπh h (s | s)

s |d(s) d (s)| X

s Pπh h (s | s)

s |d(s) d (s)| = d d 1 .

This finishes the proof.

Now, we are ready to prove Lemma 4.

Proof of Lemma 4. For any policy π and π and any initial state s1, denote by dh the distribution of the MDP in step h under policy πh and d h the distribution of the MDP in step h under policy π h, that is dh(sh) = Eπ[1(sh) | s1] and d h(sh) = Eπ [1(sh) | s1]. We have dh d h 1 = dh 1Pπh h d h 1Pπ h h 1

dh 1Pπh h d h 1Pπh h 1 + d h 1Pπh h d h 1Pπ h h 1 dh 1 d h 1 1 + πh π h 1,

πi π i 1, ,

where the second inequality holds by Lemmas 6 and 7 and the last inequality holds by a recursive calculation. Thus, we have

Eπc k Eπc k 1

h=1 1(sh) | s1 = sk,1

i=1 πc k,i πc k 1,i 1, = HPT .

This finishes the proof.

C.5 Proof of Theorem 1

Proof. By Lemma 2, we have

D-Regret(K) ηKH3

γ + KH log 1 1 γ

πc k 1,h(a | sh) log 1 π k,h(a | sh) πc k,h(a | sh) log 1 π k+1,h(a | sh)

Then, the last term can be bounded as follows:

πc k 1,h(a | sh) log 1 π k,h(a | sh) πc k,h(a | sh) log 1 π k+1,h(a | sh)

a A πc k 1,h(a | sh) log 1 π k,h(a | sh)

a A πc k,h(a | sh) log 1 π k+1,h(a | sh)

h=1 Eπc k πc k 1

a A πc k 1,h(a | sh) log 1 π k,h(a | sh)

a A πc 1,h(a | sh) log 1 π 1,h(a | sh)

h=1 Eπc k πc k 1

a A πc k 1,h(a | sh) log 1 π k,h(a | sh)

H log A + log 4A

h=1 Eπc k πc k 1 [1(sh) | s1 = sk,1]

H log A + log 4A

γ HPT , (20)

where the second inequality holds by the telescoping sum and the third inequality holds because π 1,h = π1,h is the uniform distribution and π k,h(a | s) γ 4A by the multiplicative stability lemma in Lemma 3 and the last inequality holds by Lemma 4.

Combining (19) and (20), we have

D-Regret(K) ηKH3

H log A + (1 + H) log 4A

γ PT + KH log 1 1 γ

which finishes the proof.

C.6 Proof of Theorem 2

To prove Theorem 2, we first introduce the following key lemma omitted in the main paper due to the space limit, which shows the guarantee of the confidence set.

Lemma 8 (Zhou et al. [31, Lemma 5]). Let b Ck,h be defined in (10) and set bβk as

d log(1 + k/λ) log (4k2H/δ) + 4

d log 4k2H/δ +

Then, with probability at least 1 3δ, we have that simultaneously for all h [H] and k [K],

θ h n θ bΣ1/2 k,h(θ bθk,h) 2 bβk o , Vk,h Vk,h+1 (sk,h, ak,h) [Vh Vk,h+1] (sk,h, ak,h) Ek,h,

where Ek,h is defined as follows:

min n H2, 2H ˇβk bΣ 1/2 k,h ϕVk,h+1 (sk,h, ak,h) 2

o + min n H2, eβk eΣ 1/2 k,h ϕV 2 k,h+1 (sk,h, ak,h) 2

with ˇβk = 8d p

log(1 + k/λ) log (4k2H/δ) + 4

d log 4k2H/δ +

λB, eβk = 8 p

d H4 log (1 + k H4/(dλ)) log (4k2H/δ) + 4H2 log 4k2H/δ +

We denote by E the event when the result of Lemma 8 holds, we have Pr(E) 1 3δ.

Then we present the following lemma that shows the value function difference can be decomposed into two parts, one is a martingale sequence and the other is the estimation error. Lemma 9. For all k [K], h [H], it holds that

Vk,h(sk,h) V πk k,h(sk,h) =

Mk,h ,1 + Mk,h ,2 ιk,h (sk,h , ak,h ) ,

Mk,h,1 = Ea πk( | sk,h)[Qk,h(sk,h, a) Qπk k,h(sk,h, a)] (Qk,h(sk,h, ak,h) Qπk k,h(sk,h, ak,h)),

Mk,h,2 = [Ph(Vk,h+1 V πk k,h+1)](sk,h, ak,h) (Vk,h+1(sk,h+1) V πk k,h+1(sk,h+1)).

Proof. By the definition Vk,h(sk,h) = Ea πk( | sk,h)[Qk,h(sk,h, a)], we have

Vk,h(sk,h) V πk k,h(sk,h)

= Ea πk( | sk,h) Qk,h(sk,h, a) Qπk k,h(sk,h, a)

= Ea πk( | sk,h) Qk,h(sk,h, a) Qπk k,h(sk,h, a) Qk,h(sk,h, ak,h) Qπk k,h(sk,h, ak,h)

+ Qk,h(sk,h, ak,h) Qπk k,h(sk,h, ak,h)

= Ea πk( | sk,h) Qk,h(sk,h, a) Qπk k,h(sk,h, a) Qk,h(sk,h, ak,h) Qπk k,h(sk,h, ak,h)

+ [Ph(Vk,h+1 V πk k,h+1)](sk,h, ak,h) ιk,h(sk,h, ak,h)

= Ea πk( | sk,h) Qk,h(sk,h, a) Qπk k,h(sk,h, a) Qk,h(sk,h, ak,h) Qπk k,h(sk,h, ak,h)

| {z } Mk,h,1 + [Ph(Vk,h+1 V πk k,h+1)](sk,h, ak,h) Vk,h+1(sk,h+1) V πk k,h+1(sk,h+1)

| {z } Mk,h,2 + Vk,h+1(sk,h+1) V πk k,h+1(sk,h+1) ιk,h(sk,h, ak,h)

where the third equality holds by the fact Qk,h = rk,h+Ph Vk,h+1 ιk,h and Qπk k,h = rk,h+Ph V πk k,h+1. Summing up the above equation from h to H recursively finishes the proof.

Note Mk,h,1 is the noise from the stochastic policy and Mk,h,2 is the noise from the state transition, Let Mk,h = Mk,h,1 + Mk,h,2, k [K], h [H], we define two following high probability events:

E1 = h [H],

h =h Mk,h 4

According to the Azuma-Hoeffding inequality, we have Pr(E1) 1 δ and Pr(E2) 1 δ.

Then, we show the model prediction error ιk,h can be upper and lower bounded both. Lemma 10. Define prediction error ιk,h = rk,h + Ph Vk,h+1 Qk,h, on the event E, it holds that

2bβk bΣ 1/2 k,h ϕVk,h+1( , ) 2 ιk,h( , ) 0, k [K], h [H].

Proof. First, we prove the left-hand side inequality. By the definition of ιk,h, we have

ιk,h(s, a) = Qk,h(s, a) (rk,h + Ph Vk,h+1)(s, a)

rk,h(s, a) + bθk,h, ϕVk,h+1(s, a) + bβk bΣ 1/2 k,h ϕVk,h+1(s, a) 2 (rk,h + Ph Vk,h+1)(s, a)

= bθk,h θ h, ϕVk,h+1(s, a) + bβk bΣ 1/2 k,h ϕVk,h+1(s, a) 2 2bβk bΣ 1/2 k,h ϕVk,h+1(s, a) 2,

where the first inequality holds by the configuration of Qk,h in Algorithm 1, the second equality holds by the definition of linear mixture MDP such that [Ph Vk,h+1](s, a) = ϕVk,h+1(s, a), θ h and the last inequality holds by the configuration of confidence set in Lemma 8.

Next, we prove the right-hand side inequality. By the definition of ιk,h, we have

ιk,h(s, a) = (rk,h + Ph Vk,h+1)(s, a) Qk,h(s, a)

(rk,h + Ph Vk,h+1)(s, a) h rk,h(s, a) + bθk,h, ϕVk,h+1(s, a) + bβk bΣ 1/2 k,h ϕVk,h+1(s, a) 2

= max n bθk,h θ h, ϕVk,h+1(s, a) bβk bΣ 1/2 k,h ϕVk,h+1(s, a) 2, (rk,h + Ph Vk,h+1)(s, a) (H h + 1) o

where the first inequality holds by the configuration of Qk,h in Algorithm 1, the second equality holds by the definition of linear mixture MDP such that [Ph Vk,h+1](s, a) = ϕVk,h+1(s, a), θ h and the last inequality holds by the configuration of confidence set in Lemma 8.

Next, we show the prediction error can be bounded by the cumulative estimate variance.

Lemma 11. Define prediction error ιk,h = rk,h + Ph Vk,h+1 Qk,h, on the event E, it holds that

h=1 ιk,h(sk,h, ak,h) 2bβK

h=1 σ2 k,h p

2Hd log(1 + K/λ).

Proof. By Lemma 10 and the definition of ιk,h = rk,h + Ph Vk,h+1 Qk,h, we have

h=1 ιk,h(sk,h, ak,h)

h=1 2 min{bβk bΣ 1/2 k,h ϕVk,h+1(sk,h, ak,h) 2, H}

h=1 2bβk σk,h min bΣ 1/2 k,h ϕVk,h+1 (sk,h, ak,h) / σk,h 2, 1

h=1 min n bΣ 1/2 k,h ϕVk,h+1 (sk,h, ak,h) / σk,h 2 , 1 o

h=1 σ2 k,h p

2Hd log(1 + K/λ),

where the second inequality holds by 2bβk σk,h

d = H, the third inequality is by Cauchy Schwarz inequality and the last inequality holds by the elliptical potential lemma in Lemma 16.

Define the event

h=1 [Vh V πk h+1](sk h, ak h) 3(HK + H3 log(1/δ)) ,

by the law of total variance in Lemma 15, we have Pr(E3) 1 δ. Then, we have the following lemma which bounds the estimated variance of the value function.

Lemma 12 (He et al. [33, Lemma 6.5]). On the events E E1 E2 E3, it holds that

h=1 σ2 k,h 2H3K/d + 179H2K + 165d3H4 log2 4K2H/δ log2 1 + KH4/λ

+ 2062d2H5 log2 4K2H/δ log2(1 + K/λ).

Now, we are ready to prove Theorem 2.

Proof of Theorem 2. On the events E E1 E2 E3, for any h [H], it holds that

k=1 Vk,h(sk,h)

k=1 V πk k,h(sk,h)

h=1 (Mk,h,1 + Mk,h,2 ιk,h(sk,h, ak,h))

H3K log(H/δ) + 2bβK

h=1 σ2 k,h p

2Hd log(1 + K/λ)

d H4K + d2H3K ,

where the first equality holds by Lemma 9, the second inequality holds by Lemma 11, and the last inequality holds by Lemma 12. This finishes the proof.

C.7 Proof of Theorem 3

Proof. By Lemma 5, we can rewrite the dynamic regret as follows.

D-Regret(K) =

k=1 V πc k k,1(sk,1)

k=1 V πk k,1(sk,1)

h=1 Eπc k[ Qk,h(sh, ), πc k,h( | sh) πk,h( | sh) ]

h=1 (Eπc k[ιk,h(sh, ah)] ιk,h(sk,h, ak,h)) + MK,H. (22)

By Lemma 10, we have ιk,h(s, a) 0 for any k [K], h [H], s S, a A. Thus, we have

h=1 Eπc k[ιk,h(sh, ah)] 0. (23)

By Theorem 2, we have

h=1 ( ιk,h(sk,h, ak,h)) + MK,H e O p

d H4K + d2H3K . (24)

It remains to bound the first term. Note our algorithm is indeed updated based on the estimated action-value function Qk,h. By Theorem 1, set γ = 1/KH and note that log(1/(1 γ)) γ/(1 γ) for all γ > 0. Then, we have

h=1 Eπc k[ Qk,h(sh, ), πc k,h( | sh) πk,h( | sh) ]

H log A + (1 + H) log A

It is clear that the optimal step size is η = p

(PT + log A)/(KH2). Note our step size is set as η = min{ p

(PT + log A)/K, 1}/H to ensure η 1/H. We consider the following two cases:

Case 1: η 1/H. In this case, our step size is set as η = p

(PT + log A)/(KH2). We have

H log A + (1 + H) log A

γ PT + 2 e O( p

KH4(1 + PT )).

Case 2: η > 1/H. In this case, our step size is set as η = 1/H. Therefore, we have

H log A + (1 + H) log A

γ PT + 2 e O(H2PT ),

where the equality holds by PT K in this case. Combining these two cases, we have

h=1 Eπc k[ Qk,h(sh, ), πc k,h( | sh) πk,h( | sh) ] e O( p

H4(K + PT )(1 + PT )). (25)

Combining (22), (23), (24) and (25), we obtain

D-Regret(K) e O p

d H4K + d2H3K + p

H4(K + PT )(1 + PT ) .

This finishes the proof.

C.8 Proof of Theorem 4

Proof. At a high level, we prove this lower bound by noting that optimizing the dynamic regret of linear mixture MDPs is harder than (i) optimizing the static regret of linear mixture MDPs with the unknown transition kernel, (ii) optimizing the dynamic regret of linear mixture MDPs with the known transition kernel, both. Thus, we can consider the lower bound of these two problems separately and combine them to obtain the lower bound of the dynamic regret of linear mixture MDPs with the unknown transition kernel. We present the details below.

First, we consider the lower bound of optimizing the static regret of adversarial linear mixture MDPs with the unknown transition kernel. From lower bound in He et al. [33, Theorem 5.3], we have the following lower bound in this case since the dynamic regret is no smaller than the static regret.

D-Regret(K) Ω(

d2H3K). (26)

Then, we consider the lower bound of optimizing the dynamic regret of adversarial linear mixture MDPs with the known transition kernel. We note that Zimin and Neu [12] show the lower bound of the static regret for adversarial episodic loop-free SSP with known transition kernel is Ω(H p

K log (SA)), we utilize this lower bound to establish our lower bound as the episodic loopfree SSP is a special case of linear mixture MDPs with d = S2A. We consider two cases:

Case 1: Γ 2H. In this case, we can directly utilize the established lower bound of static regret as a natural lower bound of dynamic regret,

D-Regret(K) Ω(H p

K log (SA)). (27)

Case 2: Γ > 2H. Without loss of generality, we assume L = Γ/2H divides K and split the whole episodes into L pieces equally. Next, we construct a special policy sequence such that the policy sequence is fixed within each piece and only changes in the split point. Since the sequence changes at most L 1 Γ/2H times and the path length of the policy sequence at each change point is at most 2H, the total path length in K episodes does not exceed Γ. As a result, we have

D-Regret(K) LH p

K/L log (SA) = H p

KL log (SA) Ω( p

KHΓ log (SA)). (28)

Combining (27) and (28), we have the following lower bound for the dynamic regret of adversarial linear mixture MDPs with the known transition kernel,

D-Regret(K) Ω max{H p

K log (SA), p

KHΓ log (SA)} Ω( p

KH(H + Γ) log(SA)), (29)

where the last inequality holds by max{a, b} (a + b)/2.

Combining two lower bounds (26) and (29), we have the lower bound of the dynamic regret of adversarial linear mixture MDPs with the unknown transition kernel,

D-Regret(K) Ω max{

KH(H + Γ) log(SA)}

KH(H + Γ) log(SA) .

This finishes the proof.

C.9 Proof of Theorem 5

Proof. By Lemma 5, we can rewrite the dynamic regret as follows.

D-Regret(K) =

k=1 V πc k k,1(sk,1)

k=1 V πk k,1(sk,1)

h=1 Eπc k[ Qk,h(sh, ), πc k,h( | sh) πk,h( | sh) ]

h=1 (Eπc k[ιk,h(sh, ah)] ιk,h(sk,h, ak,h)) + MK,H.

By Lemma 10, we have ιk,h(s, a) 0 for any k [K], h [H], s S, a A. Thus, we have

h=1 Eπc k[ιk,h(sh, ah)] 0.

By Theorem 2, we have

h=1 ( ιk,h(sk,h, ak,h)) + MK,H e O p

d H4K + d2H3K .

It remains to bound the first term. We decompose this term as follows.

h Qπk k,h(sh, ), πc k,h( | sh) πk,h( | sh) i

h Qπk k,h(sh, ), πc k,h( | sh) πi k,h( | sh) i

| {z } base-regret

h Qπk k,h(sh, ), πi k,h( | sh) πk,h( | sh) i

| {z } meta-regret

where the decomposition holds for any base-learner i N. Next, we bound the two terms separately.

Upper bound of base regret. From Theorem 1, we have

base-regret ηi KH3

H log A + (1 + H) log A

γ PT + KH log 1 1 γ

Set γ = 1/KH and note that log(1/(1 γ)) γ/(1 γ) for all γ > 0. Then, we have

base-regret ηi KH3

ηi (H log A + (1 + H)PT log(KHA) + 2)

ηi (log A + PT log(KHA)) .

It is clear that the optimal learning rate η = p

4(log A + PT log(KHA))/(KH2). By the definition PT = PK k=1 PH h=1 πc k,h πc k 1,h 1, , it holds that 0 PT 2KH. Therefore, the range of the optimal learning rate is

KH2 , and ηmax =

4 log A + 8KH log(KHA)

From the construction of the step size pool H = {ηi = (2i/H) p

(log A)/K | i [N]} with N = 1

2 log( K log A) , we know that the step size therein is monotonically increasing, in particular

KH2 , and ηN = 1

In the following, we consider two cases:

Case 1: η [η1, ηN]. In this case, we ensure there exists i N such that ηi η 2ηi . Note the decomposition in (30) holds for any base-learner. Therefore, we choose the base-learner whose step size is ηi and have

base-regret ηi KH3

ηi (log A + PT log(KHA))

η (log A + PT log(KHA))

KH4 (log A + PT log(KHA)), where the second inequality holds by the condition that ηi η 2ηi and the last equality holds by substituting η = p

4(log A + PT log(KHA))/(KH2).

Case 2: η > ηN. In this case, we know that 4(log A + PT log(KHA)) > K. Therefore, we choose the base-learner whose step size is ηN and have

base-regret ηNKH3

ηN (log A + PT log(KHA))

2 + 2H2 (log A + PT log(KHA))

4H2 (log A + PT log(KHA)) , where the last inequality holds by the condition that 4(log A + PT log(KHA)) > K.

Summing over the two upper bounds yields

base-regret 3 p

KH4 (log A + PT log(KHA)) + 4H2 (log A + PT log(KHA))

H4(K + log A + PT log(KHA))(log A + PT log(KHA))

H4(K + PT )(1 + PT )), (31)

where the inequality holds by a +

2(a + b), a, b 0.

Upper bound of meta regret. For meta-regret, we have

meta-regret =

h=1 Eπc k Qk,h(sh, ), πi k,h( | sh) πk,h( | sh)

h=1 Eπc k ei(sh) pi k,h(sh), Qk,h(sh, ) πi k,h( | sh)

h ei k,h(sh) pi k,h(sh), Qk,h(sh, ) πi k,h( | sh) i

h ei k,h(sh) ei k 1,h(sh), Qk,h(sh, ) πi k,h( | sh) i

h ei k,h(sh) ei k 1,h(sh) 1 i

2HST , (32)

where the first inequality holds by the definition that i k,h = arg maxi [N] Qk,h(sh, ), πi k,h( | sh) , the second inequality holds due to pi k,h(sh) = ei k 1,h(sh), and the last equality holds by the definition

ST = PK k=1 PH h=1 Eπc k 1[i k,h(sh) = i k 1,h(sh)].

Combining (31) and (32), by a +

2(a + b), a, b 0, we have

D-Regret(K) e O p

d H4K + d2H3K + q

H4(K + PT )(1 + PT ) + H2S2 T .

This finishes the proof.

D Supporting Lemmas

In this section, we introduce the supporting lemmas used in the proofs.

First, we introduce the performance difference lemma which connects the difference between two policies to the difference between their expected total rewards through the Q-function. Lemma 13 (Cai et al. [32, Lemma 3.2]). For any policies π, π (A | S, H), it holds that

V π k,1(sk,1) V π k,1(sk,1) = Eπ

h=1 Qπ k,h(sh, ), π h( | sh) πh( | sh) s1 = sk,1

Then, we introduce the following lemmas which show the one-step descent guarantee. Lemma 14 (Cai et al. [32, Lemma 3.3]). For any distributions p , p (A), state s S, and function Q : S A [0, H], it holds for p (A) with p ( ) p( ) exp(η Q(s, )) that

Q(s, ), p ( ) p( ) ηH2/2 + η 1 (DKL (p ( ) p( )) DKL (p ( ) p ( ))) .

Next, we introduce the law of total variance, which bounds the variance of the value function. Lemma 15 (Jin et al. [55, Lemma C.5]). With probability at least 1 δ, it holds that

h Vh V πk h+1 i sk h, ak h 3 HK + H3 log 1

Finally, we introduce the elliptical potential lemma, which is a key lemma in online linear regression. Lemma 16 (Abbasi-Yadkori et al. [51, Lemma 11]). Let {xt} t=1 be a sequence in Rd space, V0 = λI and define Vt = V0 + Pt i=1 xix i . If xi 2 L, i Z+, then for each t Z+,

i=1 min n 1, xi V 1 i 1

o 2d log dλ + t L2