# best_of_both_worlds_policy_optimization__de044439.pdf

Best of Both Worlds Policy Optimization

Christoph Dann 1 Chen-Yu Wei 2 Julian Zimmert 1

Policy optimization methods are popular reinforcement learning algorithms in practice. Recent works have built theoretical foundation for them by proving

T regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that in tabular Markov decision processes (MDPs), by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog(T) regret when the losses are stochastic, without sacriﬁcing the worst-case guarantee in the adversarial regime. To our knowledge, this is also the ﬁrst time a gap-dependent polylog(T) regret bound is shown for policy optimization. Speciﬁcally, we achieve this by leveraging a Tsallis entropy or a Shannon entropy regularizer in the policy update. Then we show that under known transitions, we can further obtain a ﬁrst-order regret bound in the adversarial regime by leveraging the log barrier regularizer.

1. Introduction

Policy optimization methods have seen great empirical success in various domains (Schulman et al., 2017; Levine & Koltun, 2013). An appealing property of policy optimization methods is the local-search nature, which lends itself to an efﬁcient implementation as a search over the whole MDP is avoided. However, this property also makes it difﬁcult to obtain global optimality guarantees for these algorithms and a large portion of the literature postulates strong and often unrealistic assumptions to ensure global exploration (see e.g., Abbasi-Yadkori et al., 2019; Agarwal et al., 2020b; Neu & Olkhovskaya, 2021; Wei et al., 2021). Recently, the need for extra assumptions has been overcome by adding exploration bonuses to the update (Cai et al., 2020; Shani

1Google Research 2MIT Institute for Data, Systems, and Society. Correspondence to: Chen-Yu Wei <chenyuw@mit.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

et al., 2020; Agarwal et al., 2020a; Zanette et al., 2020; Luo et al., 2021). These works demonstrate an additional robustness property of policy optimization, which is able to handle adversarial losses or some level of corruption. Luo et al. (2021) and Chen et al. (2022) even managed to obtain the optimal

However, when the losses are in fact stochastic, the

T minimax regret is often overly pessimistic and log(T) with problem-dependent factors is the optimal rate (Lai et al., 1985). Recently, Jin et al. (2021) obtained a best-of-bothworlds algorithm that automatically adapts to the nature of the environment, a method which relies on FTRL with a global regularizer over the occupancy measure.

In this work, we show that by properly assigning the bonus and tuning the learning rates, policy optimization can also achieve the best of both worlds, which gives a more computationally favorable solution than Jin et al. (2021) for the same setting. Speciﬁcally, we show that policy optimization with Tsallis entropy or Shannon entropy regularizer achieves

T regret in the adversarial regime and polylog(T) regret in the stochastic regime. The

T can further be improved to

L if the transition is known and if a log-barrier regularizer is used, where L is the cumulative loss of the best policy. Though corresponding results in multi-armed bandits have been well-studied, new challenges arise in the MDP setting which require non-trivial design for the exploration bonus and the learning rate scheduling. The techniques we develop to address these issues constitute the main contribution of this work.

2. Related Work

For multi-armed bandits, the question whether there is a single algorithm achieving near-optimal regret bounds in both the adversarial and the stochastic regimes was ﬁrst asked by Bubeck & Slivkins (2012). A series of followup works reﬁned the bounds through different techniques (Seldin & Slivkins, 2014; Auer & Chiang, 2016; Seldin & Lugosi, 2017; Wei & Luo, 2018; Zimmert & Seldin, 2019; Ito, 2021). One of the most successful approaches is developed by Wei & Luo (2018); Zimmert & Seldin (2019); Ito (2021), who demonstrated that a simple Online Mirror Descent (OMD) or Follow the Regularized Leader (FTRL) algorithm, which was originally designed only for the adversarial case, is able

Best of Both Worlds Policy Optimization

to achieve the best of both worlds. This approach has been adopted to a wide range of problems including semi-bandits (Zimmert et al., 2019), graph bandits (Erez & Koren, 2021; Ito et al., 2022), partial monitoring (Tsuchiya et al., 2022), multi-armed bandits with switching costs (Rouyer et al., 2021; Amir et al., 2022), tabular MDPs (Jin & Luo, 2020; Jin et al., 2021), and others. Though under a similar framework, each of them addresses new challenges that arises in their speciﬁc setting.

Previous works that achieve the best of both worlds in tabular MDPs (Jin & Luo, 2020; Jin et al., 2021) are based on FTRL over the occupancy measure space. This approach has several shortcomings, making it less favorable in practice. First, the feasible set of occupancy measure depends on the transition kernel, so the extension to a model-free version is difﬁcult. Second, since the occupancy measure space is a general convex set that may change over time as the learner gains more knowledge about transitions, it requires solving a different convex programming in each round. In contrast, policy optimization is easier to extend to settings where transitions are hard to learn, and it is computationally simple in tabular MDPs, it is equivalent to running an individual multi-armed bandit algorithm on each state.

Due to its local search nature, exploration under policy optimization is non-trivial, especially when coupled with bandit feedback and adversarial losses. In a simpler setting where the loss feedback has full information, He et al. (2022); Cai et al. (2020) showed

T regret for linear mixture MDPs using policy optimization. In another simpler setting where the loss is stochastic, Agarwal et al. (2020a); Zanette et al. (2021) showed poly(1/ϵ) sample complexity for linear MDPs. The work by Shani et al. (2020) ﬁrst studied policy optimization with bandit feedback and adversarial losses, and obtained a T 2/3 regret for tabular MDPs. Luo et al. (2021) improved it to the optimal

T, and provided extensions to linear-Q and linear MDPs. In this work, we demonstrate another power of policy optimization by showing a best-of-both-world regret bound in tabular MDPs. To our knowledge, this is also the ﬁrst time a gap-dependent polylog(T) regret bound is shown for policy optimization.

We also note that a ﬁrst-order bound has been shown for adversarial MDPs by Lee et al. (2020). Their algorithm is based on regularization on the occupancy measure, and does not rely on knowledge of the transition kernel. On the other hand, our ﬁrst-order bound currently relies on the learner knowing the transitions. Whether it can be achieved under unknown transitions is an open question.

3. Notation and Setting

Notation For f R and g R+, we use f g or f O(g) to mean that f c g for some absolute constant

c > 0. [x]+ max{x, 0}. (X) denotes the probability simplex over the set X.

3.1. MDP setting

We consider episodic ﬁxed-horizon MDPs. Let T be the total number of episodes. The MDP is described by a tuple (S, A, H, P, {ℓt}T t=1), where S is the state set, A is the action set, H is the horizon length, P : S A (S) is the transition kernel so that P(s |s, a) is the probability of moving to state s after taking action a on state s, and ℓt : S A [0, 1] is the loss function in episode t. We deﬁne S = |S| and A = |A|, which are both assumed to be ﬁnite. Without loss of generality, we assume A T. A policy π : S (A) describes how the player interacts with the MDP, with π( |s) (A) being the action distribution the player uses to select actions in state s. If for all s, π( |s) is only supported on one action, we call π a deterministic policy, and we abuse the notation π(s) A to denote the action π chooses on state s.

Without loss of generality, we assume that the state space can be partitioned into H + 1 disjoint layers S = S0 S1 SH, and the transition is only possible from one layer to the next (i.e., P( |s, a) is only supported on Sh+1 if s Sh)1. Without loss of generality, we assume that S0 = {s0} (initial state) and SH = {s H} (terminal state). Also, since there is at least one state on each layer, it holds that H S. Let h(s) denotes the layer where state s lies.

The environment decides P and {ℓt}T t=1 ahead of time. In episode t, the learner decides on a policy πt. Starting from the initial state st,0 = s0, the learner repeatedly draws action at,h from πt( |st,h) and transitions to the next state st,h+1 Sh+1 following st,h+1 P( |st,h, at,h), until it reaches the terminal state st,H = s H. The learner receives {ℓt(st,h, at,h)}H 1 h=0 at the end of episode t.

For a policy π and a loss function ℓ, we deﬁne V π(s H; ℓ) = 0 and recursively deﬁne

Qπ(s, a; ℓ) = ℓ(s, a) + Es P ( |s,a) [V π(s ; ℓ)] ,

V π(s; ℓ) = X

a A π(a|s)Qπ(s, a; ℓ), (1)

which are the standard state-action value function and state value function under policy π and loss function ℓ.

The learner s regret with respect to a policy π is deﬁned as

t=1 (V πt(s0; ℓt) V π(s0; ℓt))

1This setting follows previous work on adversarial MDPs (Jin et al., 2020; Luo et al., 2021).

Best of Both Worlds Policy Optimization

3.2. Known and unknown transition

Following Jin & Luo (2020); Jin et al. (2021), we consider both scenarios where the learner knows the transition kernel P and where he does not know it.

The empirical transition is deﬁned by the following:

b Pt(s |s, a) = nt(s, a, s )

where nt(s, a) is the number of visits to (s, a) prior to episode t, and nt(s, a, s ) is the number of visits to s after visiting (s, a), prior to episode t. If nt(s, a) = 0, we deﬁne b Pt( |s, a) to be uniform over the states on layer h(s) + 1.

In the unknown transition case, we deﬁne the conﬁdence set of the transition:

( e P : h, (s, a) Sh A, e P( |s, a) (Sh+1),

e P(s |s, a) b Pt(s |s, a) 2

b Pt(s |s, a)]ι

nt(s, a) + 14ι 3nt(s, a)

where ι = ln(SAT/δ). As shown in (Jin & Luo, 2020), P TT t=1 Pt with probability at least 1 4δ. Through out the paper, we use δ = 1 T 3 .

For an arbitrary transition kernel e P, deﬁne

µ e P ,π(s, a) =

h=0 Pr(sh = s, ah = a | π, e P),

where Pr( |π, e P) denotes the probability measure induced by policy π and transition kernel e P. Furthermore, deﬁne µ e P ,π(s) = P

a µ e P ,π(s, a). We write µπ(s) = µP,π(s) and µπ(s, a) = µP,π(s, a) where P is the true transition. Deﬁne the upper and lower conﬁdence measure as

µπ t (s) = max e P Pt µ e P ,π(s), µπ t (s) = min e P Pt µ e P ,π(s).

Finally, deﬁne V e P ,π(s; ℓ) and Q e P ,π(s, a; ℓ) to be similar to (1), with the transition kernel replaced by e P.

3.3. Adversarial versus stochastic regimes

We analyze our algorithm in two regimes: the adversarial regime and the stochastic regime. In both regimes, the transition P is ﬁxed throughout all episodes. In the adversarial regime, the loss functions {ℓt}T t=1 are determined arbitrarily ahead of time. In the stochastic regime, ℓt are generated randomly, and there exists a deterministic policy π , a gap function : S A R 0, and {λt(π)}t,π R such that

for any policy π and any t,

E h V π(s0; ℓt) V π (s0; ℓt) i

a =π (s) µπ(s, a) (s, a) λt(π).

If λt(π) 0 for all π, the condition above certiﬁes that π is the optimal policy in episode t, and every time π visits state s and chooses an action a = π (s), the incurred regret against π is at least (s, a). The amount [λt(π)]+ thus quantiﬁes how much the condition above is violated. The stochastic regime captures the standard RL setting (i.e., {ℓt} are i.i.d.) with λt(π) = 0 and (s, a) = E Qπ (s, a; ℓt) V π (s; ℓt) . Deﬁne min = mins mina =π (s) (s, a). Also, deﬁne C = E h PT t=1 λt(πt) i

+ and C(π) = PT t=1 λt(π)

4. Main Results and Techniques Overview

Our main results with Tsallis entropy and log barrier regularizers are the following (see Section 6 and Appendix H for results with Shannon entropy): Theorem 4.1. Under known transitions, Algorithm 1 with Tsallis entropy regularizer ensures for any π

H3SAT ln(T) + poly(H, S, A) ln(T)

in the adversarial regime, and

UC + poly(H, S, A) ln(T) (3)

in the stochastic regime, where U = P

a =π (s) H2 ln(T )

Our bounds in both regimes are similar to those of Jin et al. (2021) up to the deﬁnition of U under their parameter γ = 1 H (tuning γ trades their bounds between the two regimes; see their Appendix A.3). Compared with our deﬁnition of U, theirs involves an additional additive term poly(H)S ln(T )

min even under the assumption that the optimal action is unique on all states. Theorem 4.2. Under unknown transitions, Algorithm 1 with Tsallis entropy regularizer ensures for any π

H4S2AT ln(T)ι + poly(H, S, A) ln(T)ι

in the adversarial regime, and

Reg(π) U + p

U(C + C(π)) + poly(H, S, A) ln(T)ι (4)

2A lower bound in (Xu et al., 2021) shows that an S/ min dependence is inevitable even when the transition is known. However, this lower bound only holds when there exist multiple optimal actions on Ω(S) of the states, while our gap bound is ﬁnite only when the optimal action is unique on all states. Therefore, our upper bound does not violate their lower bound.

Best of Both Worlds Policy Optimization

in the stochastic regime, where U = H4S2A ln(T )ι

min and ι = ln(SAT).

In Jin et al. (2021), for the stochastic case under unknown transition, a similar guarantee as (4) is proven only for π = π , with the case for general π left open. We generalize their result by resolving some technical difﬁculties in their analysis. Overall, our bound in the stochastic regime improves that of Jin et al. (2021), and the bound in the adversarial regime matches that of Luo et al. (2021). Notice that comparing (4) with (3), the bound under unknown transition involves an additional term C(π). It remains open whether it can be removed.

Finally, we provide a ﬁrst-order best-of-both-world result under known transition.

Theorem 4.3. Under known transitions, Algorithm 1 with log barrier regularizer ensures for any π,

Reg(π) v u u t H2SA

t=1 V π(s0; ℓt) ln2(T) + poly(H, S, A) ln2(T)

in the adversarial regime, and

UC + poly(H, S, A) ln2(T)

in the stochastic regime, where U = P

a =π (s) H2 ln2(T )

In the next two subsections, we overview the techniques we used and challenges we faced in obtaining our results.

4.1. Exploration bonus for policy optimization

In the tabular case, a policy optimization algorithm can be viewed as running an individual bandit algorithm on each state. Our algorithm is built upon the policy optimization framework developed by Luo et al. (2021), who achieve near-optimal worst-case regret in adversarial MDPs. Their key idea is summarized in the next lemma.

Lemma 4.4 (Lemma B.1 of Luo et al. (2021)). Suppose that for some {bt}T t=1 and {Pt}T t=1, where each bt : S R 0 is a non-negative bonus function and each Pt is a set of transitions, it holds that

Bt(s, a) = bt(s)+ 1 + 1

max e P Pt Es e P ( |s,a),a πt( |s ) [Bt(s , a )]. (5)

Also, suppose that the following holds for a policy π and a

function Xπ : S R:

a (πt(a|s) π(a|s)) (Qπt(s, a; ℓt) Bt(s, a))

t=1 bt(s) + 1

a πt(a|s)Bt(s, a)

Then Reg(π) is upper bounded by

s µπ(s)Xπ(s) + 3E

t=1 V e Pt,πt(s0; bt)

where e Pt is the e P that attains the maximum in (5), and F = HTI{ t [T], P / Pt}.

The intuition about Lemma 4.4 is explained below (the reader may also refer to Section 3 of Luo et al. (2021) for a more complete explanation.

We start by analyzing a vanilla policy optimization algorithm without adding bonus (i.e., feeding b Qt(s, a), an estimator of Qπt(s, a; ℓt), to the bandit algorithm on state s). By the value difference lemma (Kakade & Langford, 2002) and standard analysis for the mirror descent algorithm, we run into the following form of regret:

t (V πt(s0; ℓt) V π(s0; ℓt))

t,a (πt(a|s) π(a|s)) Qπt(s, a; ℓt)

s µπ(s)Xπ(s) + X

t V π(s0; bt)

where Xπ(s)+P

t bt(s) is the regret bound of the bandit algorithm on state s, with Xπ(s) related to the regularization penalty, and bt(s) related to the stability of the algorithm. Speciﬁcally, in the known transition case, the standard choice is to use b Qt(s, a) = Lt(s,a)I{(s,a) is visited in episode t}

µπt(s,a) as an unbiased loss estimator for Qπt(s, a; ℓt), where Lt(s, a) is the cumulative loss starting from state-action (s, a) in episode t. Using exponential weights with learning rate η on every state, one can derive a regret bound of (8) with Xπ(s) = ln A

η and bt(s) = O ηH2

µπt(s) . This makes the quantity V π(s0; bt) involve the distribution mismatch coefﬁcient P

s µπ(s) µπt(s) (Agarwal et al., 2020b; Wei et al., 2020) that can be prohibitively large.

Best of Both Worlds Policy Optimization

On the other hand, an observation is that the problematic quantity V π(s0; bt) is nicely bounded if π is πt. This motivates (Luo et al., 2021) to use ℓt(s, a) bt(s) as the loss, where bt(s) can be viewed as a bonus term that encourages the learner to visit states that have been seldom visited before. To see why this works, assume for a moment that the regret bound still roughly holds when we replace the loss ℓt by ℓt bt. Then similar to (8), we get

t=1 (V πt(s0; ℓt bt) V π(s0; ℓt bt))

s µπ(s)Xπ(s) +

t=1 V π(s0; bt)

which implies

t=1 (V πt(s0; ℓt) V π(s0; ℓt))

s,a µπ(s)Xπ(s) +

t=1 V πt(s0; bt)

by rearranging and the linearity of the value function V π(s0; ℓt bt) = V π(s0; ℓt) V π(s0; bt) for any π. Now since the regret only involves V πt(s0; bt), there will be no distribution mismatch coefﬁcient in the regret bound.

The caveat of the discussion above is the assumption of (9). After adding the bonus bt, the original regret bound can be affected, that is, the bt on the right-hand side of (9) can be something larger, breaking the desired cancellation effect to achieve (10). To resolve this issue, Luo et al. (2021) proposed to use the dilated bonus deﬁned in (5) in the policy optimization update. In (5), the bonus-to-go function is not constructed through a standard Bellman equation, but through a dilated version that includes an additional 1 + 1

H factor for future steps. The additional amount of bonus-togo can be used to cancel the additional regret due to the inclusion of bt.

Lemma 4.4 gives a general recipe to design the exploration bonus for policy optimization algorithms. Roughly speaking, the bonus function bt(s) is chosen to be the instantaneous regret of the bandit algorithm on state s, which scales inversely with the probability of visiting state s (i.e., 1 µπt(s)). Lemma 4.4 suggests that the bandit algorithm on state s should update itself using Qπt(s, a; ℓt) Bt(s, a) as the loss, where Bt(s, a) is the dilated bonus-to-go.

The bonus function bt(s) we use is slightly different from that in Luo et al. (2021) though. We notice that the bt(s) deﬁned in Luo et al. (2021) has two parts: the ﬁrst part is FTRL regret overhead, which comes from the regret bound of the FTRL algorithm under the given loss estimator, and

the second part comes from the estimation error in estimating the transition kernel. In order to apply the self-bounding technique to obtain the best-of-both-worlds result, the second term in (7) can only involve the the ﬁrst part (FTRL regret overhead) but not the second part (estimation error). Therefore, we split their bonus into two: our bt(s) only includes the ﬁrst part, and ct(s) only includes the second part. This allows us to use self-bounding on the second term in (7). Our ct(s) goes to the ﬁrst term in (7) instead and is handled differently from Luo et al. (2021). More details are given in Section 5 and Section 6.

4.2. Adaptive learning rate tuning and bonus design

Our algorithm heavily relies on carefully tuning the learning rates and assigning a proper amount of bonus. These two tasks are intertwined with each other and introduce new challenges that are not seen in the global regularization approach (Jin et al., 2021) or policy optimization approach that only aims at a worst-case bound (Luo et al., 2021). Below we give a high-level overview for the challenges.

In the FTRL analysis, a major challenge is to handle losses that are overly negative3. Typically, if the learning rate is η and the negative loss of action a has a magnitude of R, we need ηp(a)βR 1 in order to keep the algorithm stable, where p(a) is the probability of choosing action a, and β [0, 1] is a parameter related to the choice of the regularizer ( 1

2 for Tsallis entropy, 0 for Shannon entropy, and 1 for log barrier). In our case, there are two places we potentially encounter overly negative losses. One is when applying the standard loss-shifting technique for best-ofboth-world bounds (see Jin et al. (2021)). The loss-shifting effectively creates a negative loss in the analysis. The other overly negative loss is the bonus we use to obtain the ﬁrstorder bound.

For the ﬁrst case, we develop a simple trick that only performs loss-shifting when the introduced negative loss is not too large, and further show that the extra penalty due to not performing loss-shifting is well-controlled. This is explained in Section 5.1. For the second case, we develop an even more general technique (which can also cover the ﬁrst case). This technique can be succinctly described as inserting virtual episodes when ηp(a)βR is potentially too large. In virtual episodes, the losses are assumed to be all-zero (because the learner actually does not interact with the environment in these episodes) and the algorithm only updates over some bonus term. The goal of the virtual episodes is solely to tune down the learning rate η and prevent ηp(a)βR from being to large in real episodes. Similarly, we are able to show that the extra penalty due to virtual episodes is well-controlled. This is explained in Section 5.3.

3Losses here refer not only to the loss from the environment, but also loss estimators or bonuses constructed by the algorithm.

Best of Both Worlds Policy Optimization

Algorithm 1 Policy Optimization

Deﬁne: ψt(π; s), bt(s) are deﬁned according to Figure 1, γt min n 106H4A2

for t = 1, 2, . . . do

πt( |s) = argmin π (A)

a π(a) b Qτ(s, a) Bτ(s, a) Cτ(s, a) + ψt(π; s)

Add a virtual round if needed (only when aiming to get a ﬁrst-order bound with log barrier see Section 5.3 (29)). Execute πt in episode t, and receive {ℓt(st,h, at,h)}H 1 h=0 . Deﬁne Pt: Under known transition, deﬁne Pt = {P}. Under unknown transition, deﬁne Pt by (2). Deﬁne b Qt: For s Sh, let It(s, a) = I{(st,h, at,h) = (s, a)}, Lt,h = PH 1 h =h ℓt(st,h , at,h ), and

b Qt(s, a) = It(s, a)Lt,h

µt(s)πt(a|s), where µt(s) = µπt t (s) + γt. (12)

Deﬁne Ct: Let ct(s) = µt(s) µπt t (s) µt(s) H, and compute Ct(s, a) by

Ct(s, a) = max e P Pt Es e P ( |s,a),a πt( |s ) h ct(s ) + Ct(s , a ) i , (13)

Deﬁne Bt: Compute Bt(s, a) by (5) using the bt(s) deﬁned in Figure 1.

5. Algorithm

The template of our algorithm is Algorithm 1, in which we can plug different regularizers. The template applies to both known transition and unknown transition cases the only difference is in the deﬁnition of the conﬁdence set Pt.

The policy update (11) is equivalent to running individual FTRL on each state with an adaptive learning rate. The loss estimator b Qt(s, a) deﬁned in (12) is similar to that in Luo et al. (2021): if (s, a) is visited, it is the cumulative loss starting from (s, a) divided by the upper occupancy measure (Jin et al., 2020) of (s, a); otherwise it is zero. One difference is that the implicit exploration factor γt added to the denominator is of order 1

t in our case, while it is of order 1

t in Luo et al. (2021). This smaller γt allows us to achieve logarithmic regret in the stochastic regime.

There are two bonus functions ct(s) and bt(s) deﬁned in (13) and Figure 1, respectively. As discussed in Section 4.1, the bonus functions are deﬁned to be the instantaneous regret of the bandit algorithm on state s. The ﬁrst bonus function ct(s) comes from the bias of the loss estimator. Our choice of ct(s) is such that a, Qπt(s, a; ℓt) E[ b Qt(s, a)] ct(s). The second bonus function bt(s) is related to the regret of the FTRL algorithm under the given loss estimator, which is regularizer dependent. We will elaborate how to choose bt(s) for different regularizers later in this section.

Finally, dynamic programming are used to obtain Ct(s, a) and Bt(s, a), which are trajectory sums of ct(s) and bt(s),

with an (1 + 1

H ) dilation on Bt(s, a). They are then used in the policy update (11). In the following subsections, we discuss how we choose bt(s) and tune the learning rate for each regularizer.

5.1. Tsallis entropy

bt(s) corresponds to the instantaneous regret of the bandit algorithm on state s under the given loss estimator. To obtain its form, we ﬁrst analyze the regret assuming Bt(s, a) is not included, i.e., only update on b Qt(s, a) Ct(s, a) (Bt(s, a) will be added back for analysis after the form of bt(s) is decided). Inspired by Zimmert & Seldin (2019) for multi-armed bandits, our target is to show that the instantaneous regret (see Appendix D for details) on state s is upper bounded by

1 ηt(s) 1 ηt 1(s)

ξt(s) | {z } penalty term

+ H2ηt(s)ξt(s)

µt(s) | {z } stability term

+νt(s) (26)

where ξt(s) = P

πt(a|s)(1 πt(a|s))

A, and νt(s) is some overhead due to the inclusion of Ct(s, a). The factor ξt(s) allows us to use the self-bounding technique that leads to best-of-both-worlds bounds, which cannot be relaxed to

A in general. Compared to the bound for multiarmed bandits in (Zimmert & Seldin, 2019), the extra 1 µt(s) scaling in the stability term comes from importance weighting because state s is visited with probability roughly µt(s).

Best of Both Worlds Policy Optimization

Figure 1. Deﬁnitions of ψt(π; s) and bt(s) for different regularizers (to be used in Algorithm 1).

Tsallis entropy:

ψt(π; s) = 2 ηt(s)

bt(s) = 4 1 ηt(s) 1 ηt 1(s)

µt(s) > 1 8H

+ νt(s), (15)

A + 4H q Pt τ=1 1 µτ (s) , ξt(s) = X

πt(a|s)(1 πt(a|s)), νt(s) = 8ηt(s) X

a πt(a|s)Ct(s, a)2. (16)

Shannon entropy:

ψt(π; s) = X

1 ηt(s, a)π(a) ln π(a), (17)

bt(s) = 8 X

1 ηt(s, a) 1 ηt 1(s, a)

ξt(s, a) + 1 minτ [t] µτ(s) minτ [t 1] µτ(s)

+ νt(s), (18)

1 ηt(s, a) = 1 ηt 1(s, a) + 4

µt(s) q Pt 1 τ=1 ξτ (s,a)

µτ (s) + 1 µt(s) + H

ln T, with 1 η0(s, a) = 1600H4A

ξt(s, a) = min{πt(a|s) ln(T), 1 πt(a|s)}, νt(s) = 8 X

a ηt(s, a)πt(a|s)Ct(s, a)2. (20)

Log barrier (for ﬁrst-order bound under known transition):

ψt(π; s) = X

1 ηt(s, a) ln 1 π(a), (21)

bt(s) = 8 X

1 ηt+1(s, a) 1 ηt(s, a)

log(T) + νt(s), (22)

(s t, a t) = argmax s,a ηt(s, a)

µt(s) (break tie arbitrarily)

1 ηt+1(s, a) =

1 ηt(s,a) + 4ηt(s,a)ζt(s,a)

µt(s)2 log(T ) if t is a real episode

1 ηt(s,a) 1 + 1 24H log T if t is a virtual episode and (s t, a t) = (s, a)

1 ηt(s,a) if t is a virtual episode and (s t, a t) = (s, a)

1 η1(s, a) = 4H4, (24)

ζt(s, a) = It(s, a) πt(a|s)It(s) 2L2 t,h where It(s) = X

a It(s, a), (suppose that s Sh)

νt(s) = 8 X

a ηt(s, a)πt(a|s)Ct(s, a)2. (25)

Best of Both Worlds Policy Optimization

This desired bound suggests a learning rate scheduling of

H q Pt τ=1 1 µτ (s) (27)

to balance the penalty and the stability terms. This is exactly how we tune ηt(s) in (16). However, to obtain the ξt(s) factor in the stability term in (26), we need to perform lossshifting in the analysis, which necessitates the condition ηt(s)H

µt(s) 1 as discussed in Section 4.2. From the choice of ηt(s) in (27), this condition may not always hold, but every time it is violated, ηt(s) is decreased by a relatively large factor in the next episode.

Our strategy is that whenever the condition Hηt(s)

µt(s) 1 is violated, we do not perform loss-shifting. This still allows us to prove a stability term of H2ηt(s)

A µt(s) for that episode. The key in the analysis is to show that the extra cost due to not performing loss-shifting is only logarithmic in T (see the proof of Lemma 6.3). Combining this idea with the instantaneous regret bound in (26) and the choice of ηt(s) in (27), we are able to derive the form of bt(s) in (15). After ﬁguring out the form of bt(s) assuming Bt(s, a) is not incorporated in the updates, we incorporate it back and re-analyze the stability term. The extra stability term due to bt(s) leads to a separate quantity 1

H πt(a|s)Bt(s, a), which is an overhead allowed by (6).

5.2. Shannon entropy

The design of bt(s) under Shannon entropy follows similar procedures as in Section 5.1, except that the tuning of the learning rate is inspired by Ito et al. (2022). One improvement over theirs is that we adopt coordinate-dependent learning rates that can give us a reﬁned gap-dependent bound in the stochastic regime (in multi-armed bandits, this improves their maxa A (a) dependence to P

a 1 (a)). With Shannon entropy, there is less learning rate tuning issue because its optimal learning rate decreases faster than other regularizers, and there is no need to perform loss-shifting (Ito et al., 2022). The regret bound under Shannon entropy is overall worse than that of Tsallis entropy by a ln2(T) factor.

5.3. Log barrier

As shown by Wei & Luo (2018); Ito (2021), FTRL with a log barrier regularizer is also able to achieve the best of both worlds, with the additional beneﬁt of having data-dependent bounds. In this subsection, we demonstrate the possibility of this by showing that under known transition, Algorithm 1 is able to achieve a ﬁrst-order bound in the adversarial regime, while achieving polylog(T) regret in the stochastic regime.

To get a ﬁrst-order best-of-both-world bound with log barrier, inspired by Ito (2021), we need to prove the following

instantaneous regret for the bandit algorithm on s:

1 ηt(s, a) 1 ηt 1(s, a)

| {z } penalty term

ηt(s, a)ζt(s, a)

µt(s)2 | {z } stability term

where ζt(s, a) = (It(s, a) π(a|s)It(s))2 L2 t,h for s Sh. This suggests a learning rate scheduling of 1/ηt+1(s, a) = 1/ηt(s, a) + ηt(s, a)ζt(s, a)/µt(s)2. Similar to the Tsallis entropy case, obtaining the desired stability term in (28) requires loss-shifting, so we encounter the same issue as before and can resolve it in the same way. With this choice of ηt(s, a), we can derive the desired form of bt(s) from (28). However, the magnitude of this bonus is larger than in the Tsallis entropy case because of the 1 µt(s)2 scaling here. Therefore, an additional problem arises: the Bt(s, a) derived from this bt(s) can be large that makes ηt(s, a)πt(a|s)Bt(s, a) > 1 H happen, which violates the condition under which we can bound the extra stability term (due to the inclusion of bt(s)) by 1

H πt(a|s)Bt(s, a). Notice that this was not an issue under Tsallis entropy.

To resolve this, we note that ηt(s, a)πt(a|s)Bt(s, a) can be as large as poly(H, S) maxs ,a ηt(s ,a )2

µt(s )2 (Lemma E.3), so

all we need is to make ηt(s,a)

µt(s) 1 poly(H,S) for all s, a.

Our solution is to insert virtual episodes when ηt(s,a)

µt(s) is too large on some (s, a). In virtual episodes, the learner does not actually interact with the environment; instead, the goal is purely to tune down ηt(s, a). To decide whether to insert a virtual episode, in episode t, after the learner computes πt( |s) on all states, he checks if

max s,a ηt(s, a)

If so, then episode t is made a virtual episode in which the losses are assumed to be zero everywhere.4 In a virtual episode, let (s t, a t) = argmaxs,a ηt(s,a)

µt(s) , and we tune

down ηt(s t, a t) by a factor of (1+ 1 24H log T ). Also, a bonus

bt(s) is assigned to s t to reﬂect the increased penalty term on state s t due to the decrease in learning rate (by (28)). Combinig all the above, we get the bonus and learning rate speciﬁed in (22) and (23). Again, since every time a virtual episode happens, there exists some ηt(s, a) decreased by a signiﬁcant factor, it cannot happen too many times.

4Inserting a virtual episode shifts the index of future real episodes. Since there are only O(HSA log2 T) virtual episodes, we still use T to denote the total number of episodes.

Best of Both Worlds Policy Optimization

6. Sketch of Regret Analysis

Our goal is to show (6) and bound the right-hand side of (7) (for all regularizers and known/unknown transitions). To show (6), for a ﬁxed π, we do the following decomposition:

t,a (πt(a|s) π(a|s)) (Qπt(s, a; ℓt) Bt(s, a)) (30)

t,a (πt(a|s) π(a|s)) b Qt(s, a) Bt(s, a) Ct(s, a)

| {z } ftrl-regπ(s)

t,a (πt(a|s) π(a|s)) Qπt(s, a; ℓt) b Qt(s, a) + Ct(s, a)

| {z } biasπ(s)

The next lemma bounds the expectation of ftrl-regπ(s).

Lemma 6.1. E [ftrl-regπ(s)] is upper bounded by

O(H4SA ln(T)) + E

t=1 bt(s) + 1

a πt(a|s)Bt(s, a)

The proof of Lemma 6.1 is in Appendix E. Notice that depending on the regularizers and whether the transition is known/unknown, the deﬁnitions of bt(s) are different, so we prove it individually for each case.

Combining (30) with Lemma 6.1, we see that the condition in Lemma 4.4 is satisﬁed with Xπ(s) = O(H4SA ln(T))+ E[biasπ(s)]. By Lemma 4.4, we can upper bound E[Reg(π)] by the order of

H5SA ln(T) + E

s µπ(s)biasπ(s) +

t=1 V e Pt,πt(s0; bt)

The next lemma bounds the bias part in (31). See Appendix F for the proof.

Lemma 6.2. With known transitions, E [P

s µπ(s)biasπ(s)] H5SA2 ln(T), and with unknown transitions,

s µπ(s)biasπ(s)

H2S4A2 ln(T)ι+

v u u t H3S2AE

s,a [µπt(s, a) µπ(s, a)]+

Next, we bound the bonus part in (31) for all regularizers we consider. The proofs are in Appendix G.

Lemma 6.3. Using Tsallis entropy as the regularizer, with

known transitions,

t=1 V e Pt,πt(s0; bt)

H4SA2 ln(T)+

t=1 µt(s)πt(a|s)(1 πt(a|s))

With unknown transitions, the right-hand side above further has an additional term O(HS4A2 ln(T)ι).

Lemma 6.4. Using Shannon entropy as the regularizer, With known transitions,

t=1 V e Pt,πt(s0; bt)

t=1 µt(s)πt(a|s)(1 πt(a|s))

With unknown transitions, the right-hand side above further has an additional term O(HS4A2 ln(T)ι).

Lemma 6.5. Using log barrier as the regularizer, with known transitions,

t=1 V πt(s0; bt)

H3S2A2 ln(T) ln(SAT)+

t=1 (It(s, a) πt(a|s)It(s))2L2 t,h(s)

Final regret bounds To obtain the ﬁnal regret bounds, we combine Lemma 6.2 with each of Lemma 6.3, Lemma 6.4, and Lemma 6.5 based on (31). Then we use the standard self-bounding technique to derive the bounds for each regime. The details are provided in Appendix H.

7. Conclusion

In this work, we develop policy optimization algorithms for tabular MDPs that achieves the best of both worlds. Compared to previous solutions with a similar guarantee (Jin & Luo, 2020; Jin et al., 2021), our algorithm is computationally much simpler; compared to most existing RL algorithms, our algorithm is more robust (handling adversarial losses) and more adaptive (achieving fast rate in stochastic environments) simultaneously. Built upon the ﬂexible policy optimization framework, our work paves a way towards developing more robust and adaptive algorithms for more general settings. Future directions include obtaining data-dependent bounds under unknown transitions, and incorporating function approximation.

Best of Both Worlds Policy Optimization

Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C., and Weisz, G. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pp. 3692 3702. PMLR, 2019.

Agarwal, A., Henaff, M., Kakade, S., and Sun, W. Pc-pg: Policy cover directed exploration for provable policy gradient learning. ar Xiv preprint ar Xiv:2007.08459, 2020a.

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pp. 64 66. PMLR, 2020b.

Amir, I., Azov, G., Koren, T., and Livni, R. Better best of both worlds bounds for bandits with switching costs. ar Xiv preprint ar Xiv:2206.03098, 2022.

Auer, P. and Chiang, C.-K. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pp. 116 120. PMLR, 2016.

Bubeck, S. and Slivkins, A. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pp. 42 1. JMLR Workshop and Conference Proceedings, 2012.

Cai, Q., Yang, Z., Jin, C., and Wang, Z. Provably efﬁcient exploration in policy optimization. In International Conference on Machine Learning, pp. 1283 1294. PMLR, 2020.

Chen, L., Luo, H., and Wei, C.-Y. Impossible tuning made possible: A new expert algorithm and its applications. In Conference on Learning Theory, pp. 1216 1259. PMLR, 2021.

Chen, L., Luo, H., and Rosenberg, A. Policy optimization for stochastic shortest path. ar Xiv preprint ar Xiv:2202.03334, 2022.

Erez, L. and Koren, T. Best-of-all-worlds bounds for online learning with feedback graphs. ar Xiv preprint ar Xiv:2107.09572, 2021.

He, J., Zhou, D., and Gu, Q. Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 4259 4280. PMLR, 2022.

Ito, S. Parameter-free multi-armed bandit algorithms with hybrid data-dependent regret bounds. In Conference on Learning Theory, pp. 2552 2583. PMLR, 2021.

Ito, S., Tsuchiya, T., and Honda, J. Nearly optimal best-ofboth-worlds algorithms for online learning with feedback graphs. ar Xiv preprint ar Xiv:2206.00873, 2022.

Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. Learning adversarial markov decision processes with bandit feedback and unknown transition. In International Conference on Machine Learning, 2020.

Jin, T. and Luo, H. Simultaneously learning stochastic and adversarial episodic mdps with known transition. Advances in neural information processing systems, 33: 16557 16566, 2020.

Jin, T., Huang, L., and Luo, H. The best of both worlds: stochastic and adversarial episodic mdps with unknown transition. Advances in Neural Information Processing Systems, 34:20491 20502, 2021.

Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.

Lai, T. L., Robbins, H., et al. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6 (1):4 22, 1985.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press (preprint), 2018.

Lee, C.-W., Luo, H., Wei, C.-Y., and Zhang, M. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in Neural Information Processing Systems, 2020.

Levine, S. and Koltun, V. Guided policy search. In International conference on machine learning, pp. 1 9. PMLR, 2013.

Luo, H. Homework 3 solution, introduction to online optimization/learning. http://haipeng-luo.net/ courses/CSCI659/2022_fall/homework/ HW3_solutions.pdf, November 2022.

Luo, H., Wei, C.-Y., and Lee, C.-W. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. Advances in Neural Information Processing Systems, 34:22931 22942, 2021.

Neu, G. and Olkhovskaya, J. Online learning in mdps with linear function approximation and bandit feedback. ar Xiv preprint ar Xiv:2007.01612v2, 2021.

Rouyer, C., Seldin, Y., and Cesa-Bianchi, N. An algorithm for stochastic and adversarial bandits with switching costs. In International Conference on Machine Learning, pp. 9127 9135. PMLR, 2021.

Best of Both Worlds Policy Optimization

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Seldin, Y. and Lugosi, G. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, pp. 1743 1759. PMLR, 2017.

Seldin, Y. and Slivkins, A. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pp. 1287 1295. PMLR, 2014.

Shani, L., Efroni, Y., Rosenberg, A., and Mannor, S. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pp. 8604 8613. PMLR, 2020.

Tsuchiya, T., Ito, S., and Honda, J. Best-of-bothworlds algorithms for partial monitoring. ar Xiv preprint ar Xiv:2207.14550, 2022.

Wei, C.-Y. and Luo, H. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, 2018.

Wei, C.-Y., Jahromi, M. J., Luo, H., Sharma, H., and Jain, R. Model-free reinforcement learning in inﬁnite-horizon average-reward markov decision processes. In International Conference on Machine Learning, pp. 10170 10180. PMLR, 2020.

Wei, C.-Y., Jahromi, M. J., Luo, H., and Jain, R. Learning inﬁnite-horizon average-reward mdps with linear function approximation. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 3007 3015. PMLR, 2021.

Xu, H., Ma, T., and Du, S. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap. In Conference on Learning Theory, pp. 4438 4472. PMLR, 2021.

Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pp. 10978 10989. PMLR, 2020.

Zanette, A., Cheng, C.-A., and Agarwal, A. Cautiously optimistic policy optimization and exploration with linear function approximation. ar Xiv preprint ar Xiv:2103.12923, 2021.

Zimmert, J. and Seldin, Y. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 467 475. PMLR, 2019.

Zimmert, J., Luo, H., and Wei, C.-Y. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, pp. 7683 7692. PMLR, 2019.

Best of Both Worlds Policy Optimization

A. Additional Deﬁnitions

Deﬁne µ e P ,π(s |s, a) as the probability of visiting s conditioned on that (s, a) is already visited, under transition kernel e P and policy π. In other words, µ e P ,π(s |s, a) is deﬁned as

0 if h(s ) < h(s), 0 if h(s) = h(s ), s = s , 1 if s = s, Pr{sh(s ) = s | (sh, ah) = (s, a)} if h(s ) > h(s).

Further deﬁne µ e P ,π(s |s) = P

a µ e P ,π(s |s, a)π(a|s). We write µπ(s |s, a) = µP,π(s |s, a) and µπ(s |s) = µP,π(s |s) where P is the true transition.

B. Concentration Bounds

Lemma B.1. If P Pt, then for all e P Pt,

e P(s |s, a) P(s |s, a) min

P(s |s, a)ι

nt(s, a) + 40ι 3nt(s, a), 1

Lemma B.2 (Lemma D.3.7 of Jin et al. (2021)). With probability at least 1 δ, for any h,

nt(s, a) |Sh|A ln T + ln(1/δ)

Deﬁnition B.3. Deﬁne E to be the event that P Pt for all t and the bound in Lemma B.2 holds. By (2) and Lemma B.2, Pr{E} 1 5Hδ.

C. Difference Lemmas

Lemma C.1 (Performance difference). For any policies π1 and π2, and any loss function ℓ: S A R,

V π1(s0; ℓ) V π2(s0; ℓ) = X

s µπ2(s)(π1(a|s) π2(a|s))Qπ1(s, a; ℓ).

Lemma C.2. For any policies π1 and π2 and any function L : S A R, X

s µπ2(s)(π1(a|s) π2(a|s))L(s, a) = V π1(s0; ℓ) V π2(s0; ℓ)

ℓ(s, a) L(s, a) Es P ( |s,a),a π1( |s )[L(s , a )].

Proof. This is simply a different way to write the performance difference lemma (Lemma C.1). One only needs to verify that Qπ1(s, a; ℓ) = L(s, a). This can be shown straightforwardly by backward induction from s SH to s S0 and using the deﬁnition of ℓ(s, a).

Lemma C.3 (Occupancy measure difference, Lemma D.3.1 of (Jin et al., 2021)).

µP1,π(s) µP2,π(s) = X

(u,v,w) S A S µP1,π(u, v) [P1(w|u, v) P2(w|u, v)] µP2,π(s|w)

(u,v,w) S A S µP2,π(u, v) [P1(w|u, v) P2(w|u, v)] µP1,π(s|w)

Best of Both Worlds Policy Optimization

Lemma C.4 (Generalized version of Lemma 4 in (Jin et al., 2020)). Suppose the high probability event E deﬁned in Deﬁnition B.3 holds. Let e P s t be a transition kernel in Pt which may depend on s, and let gt(s) [0, G]. Then

µπt(s) µ e P s t ,πt(s) gt(s)

v u u t HS2A ln(T)ι

s µπt(s)gt(s)2 + HS4AG ln(T)ι.

Proof. We ﬁrst show that for any t, s,

µπ(s) µ e P s t ,π(s) X

(u,v,w) S A S µπ(u, v)

nt(u, v) µπ(s|w) + HS2 X

nt(u, v) . (32)

Below, the summation range of (u, w, v) and (x, y, z) are both SH 1 h=0 (Sh A Sh+1) if without specifying. µπ(s) µ e P s t ,π(s)

u,v,w µπ(u, v) P(w|u, v) e P s t (w|u, v) µ e P s t ,π(s|w) (by Lemma C.3)

u,v,w µπ(u, v) P(w|u, v) e P s t (w|u, v) µπ(s|w)

u,v,w µπ(u, v) P(w|u, v) e P s t (w|u, v) µ e P s t ,π(s|w) µπ(s|w)

u,v,w µπ(u, v) P(w|u, v) e P s t (w|u, v) µπ(s|w)

u,v,w µπ(u, v) P(w|u, v) e P s t (w|u, v) X

x,y,z µπ(x, y|w) e P s t (z|x, y) P(z|x, y) µ e P s t ,π(s|z) (by Lemma C.3)

u,v,w µπ(u, v)

nt(u, v) + ι nt(u, v)

x,y,z µπ(u, v)

nt(u, v) + ι nt(u, v)

µπ(x, y|w) min

nt(x, y) + ι nt(x, y), 1

(by Lemma B.1 and the assumption that E holds)

u,v,w µπ(u, v)

nt(u, v) µπ(s|w)

u,v,w µπ(u, v) ι nt(u, v)µπ(s|w) (=: term1)

x,y,z µπ(u, v)

nt(u, v) µπ(x, y|w)

nt(x, y) (=: term2)

x,y,z µπ(u, v)

nt(u, v) µπ(x, y|w) min ι nt(x, y), 1 (=: term3)

x,y,z µπ(u, v) ι nt(u, v)µπ(x, y|w) (=: term4)

We bound term1 to term4 separately as below:

nt(u, v) S X

Best of Both Worlds Policy Optimization

µπ(u, v)P(z|x, y)µπ(x, y|w)ι

µπ(u, v)P(w|u, v)µπ(x, y|w)ι

µπ(u, v)P(z|x, y)µπ(x, y|w)ι

µπ(u, v)P(w|u, v)µπ(x, y|w)ι

nt(x, y) (AM-GM)

v u u t H X

v u u t H X

x,y,z µπ(u, v) P(w|u, v) + ι nt(u, v)

µπ(x, y|w) min ι nt(x, y), 1

x,y,z µπ(u, v)P(w|u, v)µπ(x, y|w) ι nt(x, y) + X

x,y,z µπ(u, v) ι nt(u, v)µπ(x, y|w)

x,y,z µπ(x, y) ι nt(x, y) + HS X

u,v,w µπ(u, v) ι nt(u, v)

nt(x, y) + HS2 X

u,v,w µπ(u, v) ι nt(u, v) HS2 X

Collecting all terms we obtain (32). Thus,

µπt(s) µ e P s t ,πt(s) gt(s)

u,v,w µπt(u, v)

nt(u, v) µπt(s|w) + HS2 X

u,v,w µπt(u, v)

nt(u, v) µπt(s|w)

nt(u, v) (33)

Fix an h, we consider the summation ( ) restricted to (u, v, w) Th Sh A Sh+1. That is,

(u,v,w) Th µπt(u, v)

nt(u, v) µπt(s|w)

(u,v,w) Th µπt(u, v) αP(w|u, v)gt(s)2 + ι αnt(u, v)

(holds for any α > 0 by AM-GM)

(u,v,w) Th µπt(u, v)P(w|u, v)µπt(s|w)gt(s)2 + 1

nt(u, v) µπt(s|w)

Best of Both Worlds Policy Optimization

s µπt(s)gt(s)2 + H|Sh+1|

s µπt(s)gt(s)2 + H|Sh+1||Sh|A ln(T)ι

α + H|Sh+1| ln(1/δ)ι

(by Lemma B.2 and the assumption that E holds)

v u u t H|Sh||Sh+1|A ln(T)ι

s µπt(s)gt(s)2 (picking the optimal α and using our choice of δ = 1 T 3 )

(|Sh| + |Sh+1)

v u u t HA ln(T)ι

s µπt(s)gt(s)2.

Continue from (33):

µπt(s) µ e P s t ,πt(s) gt(s)

h (|Sh| + |Sh+1)

v u u t HA ln(T)ι

s µπt(s)gt(s)2 + HS4AG ln(T)ι

(by Lemma B.2 and the assumption that E holds)

v u u t HA ln(T)ι

s µπt(s)gt(s)2 + HS4AG ln(T)ι.

Lemma C.5. For any π1, π2, X

s,a |µπ1(s, a) µπ2(s, a)| H X

s,a µπ1(s) |π1(a|s) π2(a|s)|

Proof. For any s, a, we can view µπ(s, a) as V π(s0; 1s,a) where 1s,a is the loss function that takes the value of 1 on (s, a) and 0 on other state-actions. By the performance difference lemma (Lemma C.1),

|µπ1(s, a) µπ2(s, a)| X

s ,a µπ1(s ) |π1(a |s ) π2(a |s )| Qπ2(s , a ; 1s,a).

Therefore, X

s,a |µπ1(s, a) µπ2(s, a)| X

s ,a µπ1(s ) |π1(a |s ) π2(a |s )| X

s,a Qπ2(s , a ; 1s,a)

s ,a µπ1(s ) |π1(a |s ) π2(a |s )| Qπ2(s , a ; 1)

(1 is the loss function that takes a constant value 1)

s ,a µπ1(s ) |π1(a |s ) π2(a |s )| .

D. FTRL Regret Bounds

The lemmas in this section are standard results for FTRL, which can be found in e.g. Lattimore & Szepesv ari (2018); Zimmert & Seldin (2019); Ito (2021); Luo (2022). We list the results here for completeness.

Best of Both Worlds Policy Optimization

Lemma D.1. The FTRL algorithm:

pt = argmin p Ω

guarantees the following:

t=1 pt u, ℓt ψ0(u) min p Ωψ0(p) +

t=1 (ψt(u) ψt(pt) ψt 1(u) + ψt 1(pt))

| {z } penalty term

t=1 max p Ω( pt p, ℓt Dψt(p, pt))

| {z } stability term

Proof. Let Lt Pt τ=1 ℓτ. Deﬁne Ft(p) = p, Lt 1 + ψt(p) and Gt(p) = p, Lt + ψt(p). Therefore, pt is the minimizer of Ft. Let p t+1 be minimizer of Gt. Then by the ﬁrst-order optimality condition, we have

Ft(pt) Gt(p t+1) Ft(p t+1) Gt(p t+1) Dψt(p t+1, pt) = p t+1, ℓt Dψt(p t+1, pt). (34)

By deﬁnition, we also have

Gt(p t+1) Ft+1(pt+1) Gt(pt+1) Ft+1(pt+1) = ψt(pt+1) ψt+1(pt+1). (35)

pt p t+1, ℓt Dψt(p t+1, pt) + Gt(p t+1) Ft(pt) (by (34))

pt p t+1, ℓt Dψt(p t+1, pt) + Gt 1(p t) Ft(pt) + GT (p T +1) G0(p 1)

n pt p, ℓt Dψt(p, pt) o ψt(pt) + ψt 1(pt) + GT (u) min p ψ0(p)

(by (35), using that p T +1 is the minimizer of GT )

n pt p, ℓt Dψt(p, pt) o ψt(pt) + ψt 1(pt) +

t=1 u, ℓt + ψT (u) min p ψ0(p)

n pt p, ℓt Dψt(p, pt) o + ψt(u) ψt(pt) ψt 1(u) + ψt 1(pt) +

t=1 u, ℓt + ψ0(u) min p ψ0(p).

Re-arranging ﬁnishes the proof.

Lemma D.2 (Stability under Tsallis entropy). Let ψt(p) = 2

p(a), and let ℓt RA be such that ηt p

p(a)ℓt(a) 1

max p (A) { pt p, ℓt Dψt(p, pt)} 2ηt X

a pt(a) 3 2 ℓt(a)2.

Proof. The proof can be found in the Problem 1 of Luo (2022).

Best of Both Worlds Policy Optimization

Lemma D.3 (Stability under Shannon entropy). Let ψt(p) = P

a 1 ηt(a)p(a) ln p(a), and let ℓt RA be such that η(a)ℓt(a) 1. Then

max p (A) { pt p, ℓt Dψt(p, pt)} X

a ηt(a)pt(a)ℓt(a)2.

Proof. The proof can be found in the Proof of Lemma 1 in Chen et al. (2021).

Lemma D.4 (Stability under log barrier). Let ψt(p) = P

a 1 ηt(a) ln 1 p(a), and let ℓt RA be such that ηt(a)p(a)ℓt(a) 1

max p (A) { pt p, ℓt Dψt(p, pt)} X

a ηt(a)pt(a)2ℓt(a)2.

max p (A) { pt p, ℓt Dψt(p, pt)} max q RA + { pt q, ℓt Dψt(q, pt)}

Deﬁne f(q) = pt q, ℓt Dψt(q, pt). Let q be the solution in the last expression. Next, we verify that under the speciﬁed conditions, we have f(q ) = 0. It sufﬁces to show that there exists q RA + such that f(q) = 0 since if such q exists, then it must the maximizer of f and thus q = q.

[ f(q)]a = ℓt(a) [ ψt(q)]a + [ ψt(pt)]a = ℓt(a) + 1 ηt(a)q(a) 1 ηt(a)pt(a)

By the condition, we have 1 ηt(a)pt(a) ℓt(a) < 0 for all a. and so f(q) = 0 has solution in R+, which is q(a) = 1 pt(a) + ηt(a)ℓt(a) 1 .

Therefore, f(q ) = ℓt ψt(q ) + ψt(pt) = 0, and we have

max q RA + { pt q, ℓt Dψt(q, pt)} = pt q , ψt(pt) ψt(q ) Dψt(q , pt) = Dψt(pt, q ).

It remains to bound Dψt(pt, q ), which by deﬁnition can be written as

Dψt(pt, q ) = X

1 ηt(a)h pt(a)

where h(x) = x 1 ln(x). By the relation between q (a) and pt(a) we just derived, it holds that pt(a)

q (a) = 1 + ηt(a)pt(a)ℓt(a). By the fact that ln(1 + x) x x2 for all x 1

= ηt(a)pt(a)ℓt(a) ln(1 + ηt(a)pt(a)ℓt(a)) ηt(a)2pt(a)2ℓt(a)2

which gives the desired bound.

Lemma D.5 (FTRL with Tsallis entropy). Let ψt(p) = 2

p(a) for non-increasing ηt, and let xt be such that

pt(a)(ℓt(a) + xt) 1

2 for all t, a. Then the FTRL algorithm in Lemma D.1 ensures for any u (A),

t=1 pt u, ℓt 2

a pt(a) 3 2 (ℓt(a) + xt)2 ,

where ξt = P

pt(a)(1 pt(a)).

Best of Both Worlds Policy Optimization

Proof. We use Lemma D.1, and bound the penalty term and stability individually.

penalty term = 2

η0 max p (A)

Bounding the stability term:

stability term =

t=1 max p (A)

n pt p, ℓt + xt1 Dψt(p, pt) o 2

a pt(a) 3 2 (ℓt(a) + xt)2

where the ﬁrst equality is because pt p, 1 = 0 for pt, p (A), and the last inequality is by Lemma D.2.

Lemma D.6 (FTRL with Shannon entropy). Let ψt(p) = P

a 1 ηt(a)p(a) ln p(a), for non-increasing ηt(a) such that η0(a) = η0 for all a. Assume that ηt(a)ℓt(a) 1 for all t, a, and assume A T. Then for any u (A),

t=1 pt u, ℓt ln A

1 ηt(a) 1 ηt 1(a)

a ηt(a)pt(a)ℓt(a)2 + 1

where ξt(a) = min {pt(a) ln(T), 1 pt(a)}.

Proof. Let u = 1 1 T 2 u + 1 AT 2 1. We use Lemma D.1, and bound the penalty term and stability individually (with respect to u ).

penalty term

p(a) ln 1 p(a) u (a) ln 1 u (a)

1 ηt(a) 1 ηt 1(a)

pt(a) ln 1 pt(a) u (a) ln 1 u (a)

1 ηt(a) 1 ηt 1(a)

pt(a) ln 1 pt(a) u (a) ln 1 u (a)

To bound pt(a) ln 1 pt(a) u (a) ln 1 u (a), ﬁrst observe that pt(a) ln 1 pt(a) = pt(a) ln 1 + 1 pt(a)

pt(a) pt(a) 1 pt(a)

1 pt(a) because ln(1 + x) x. By the deﬁnition of u , we have

u (a) ln 1 u (a) min 1 AT 2 ln(AT 2), 1 1

ln 1 1 1 T 2

min 1 AT 2 , 1 1

If pt(a) 1 A2T 4 , then

pt(a) ln 1 pt(a) u (a) ln 1 u (a) 1 A2T 4 ln(A2T 4) 1 AT 2 = 2 ln(AT 2) AT 2

where the ﬁrst inequality is because x ln(x) is increasing for x e 1, and last inequality is because 2 ln(x) x < 0 for all x R. If pt(a) > 1 A2T 4 , then pt(a) ln 1 pt(a) pt(a) ln(A2T 4) 6pt(a) ln(T) by the assumption A T. Combining all arguments above, we get

penalty term ln A

1 ηt(a) 1 ηt(a)

min {1 pt(a), pt(a) ln(T)} .

Best of Both Worlds Policy Optimization

Bounding the stability term:

stability term =

t=1 max p (A)

n pt p, ℓt Dψt(p, pt) o

a ηt(a)pt(a)ℓt(a)2

where the last inequality is by Lemma D.3.

t=1 pt u , ℓt ln A

1 ηt(a) 1 ηt 1(a)

a ηt(a)pt(a)ℓt(a)2

Then noticing that

t=1 pt u, ℓt =

t=1 pt u , ℓt +

t=1 u u, ℓt

t=1 pt u , ℓt + 1

ﬁnishes the proof.

Lemma D.7 (FTRL with log barrier). Let ψt(p) = P

a 1 ηt(a) ln 1 p(a) for non-increasing ηt(a) with η0(a) = η0 for all a, and let xt be such that ηt(a)pt(a)(ℓt(a) + xt) 1

2 for all t, a. Then for any u (A),

t=1 pt u, ℓt 3A ln T

1 ηt(a) 1 ηt 1(a)

a ηt(a)pt(a)ℓt(a)2 + 1

Proof. Let u = 1 1 T 3 u + 1 AT 3 1. We use Lemma D.1, and bound the penalty term and stability individually (with respect to u ).

penalty term A ln(T 3)

1 ηt(a) 1 ηt 1(a)

ln 1 u (a) ln 1 pt(a)

1 ηt(a) 1 ηt 1(a)

1 ηt(a) 1 ηt 1(a)

ln(T) (because A T)

Bounding the stability term:

stability term =

t=1 max p (A)

n pt p, ℓt + xt1 Dψt(p, pt) o

a ηt(a)pt(a)2 (ℓt(a) + xt)2

where the ﬁrst equality is because pt p, 1 = 0, and the last inequality is by Lemma D.4. Then noticing that

t=1 pt u, ℓt =

t=1 pt u , ℓt +

t=1 u u, ℓt

t=1 pt u , ℓt + 1

ﬁnishes the proof.

Best of Both Worlds Policy Optimization

E. Analysis for FTRL Regret Bound (Lemma 6.1)

E.1. Tsallis entropy

Proof of Lemma 6.1 (Tsallis entropy). We focus on a particular s, and use πt(a), b Qt(a), Bt(a), Ct(a), ηt, µt, ξt, bt to denote πt(a|s), b Qt(s, a), Bt(s, a), Ct(s, a), ηt(s), µt(s), ξt(s), bt(s), respectively.

By Lemma D.5, we have for any π

t=1 πt π, b Qt Bt Ct

a ηtπt(a) 3 2 b Qt(a) Bt(a) Ct(a) + xt 2 #

for arbitrary xt R such that ηt p

πt(a|s)( b Qt(a) Bt(a) Ct(a) + xt) 1

2 for all t, a. Our choice of xt is the following:

xt = D πt, b Qt E Yt. (38)

with Yt I h ηt µt 1 8H i . Below, we verify that ηt p

πt(a) b Qt(a) Bt(a) Ct(a) + xt 1

πt(a) b Qt(a) Bt(a) Ct(a) + xt

πt(a) Bt(a) Ct(a) D πt, b Qt E Yt (using (38) and b Qt(a) 0)

ηt Bt(a) ηt Ct(a) ηt X

a πt(a )HIt(s, a )

µtπt(a ) Yt (by the deﬁnition of b Qt(a))

8H 1 4H2 Hηt

µt Yt (using Lemma E.1, Ct(a) H2 and ηt 1 4H4 )

2. (by the deﬁnition of Yt = I h ηt µt 1 8H i )

Continued from (37) with the choice of xt:

t=1 πt π, b Qt Bt Ct

a ηtπt(a) 3 2 b Qt(a) πt, b Qt Yt Bt(a) Ct(a) 2 #

a ηtπt(a) 3 2 b Qt(a) πt, b Qt 2 + b Qt(a)2Y t + Bt(a)2 + Ct(a)2 #

(deﬁne Y t = 1 Yt)

a ηtπt(a) 3 2 b Qt(a) πt, b Qt 2 + b Qt(a)2Y t

| {z } term1

a πt(a)Bt(a)

a ηtπt(a)Ct(a)2 #

. (using Lemma E.1)

To bound term1, notice that

b Qt(a) πt, b Qt 2 = Et

" It(s, a)Lt,h

µtπt(a) It(s)Lt,h

(assume s Sh)

Best of Both Worlds Policy Optimization

µtπt(a) H µtπt(a) H

2 + µt(1 πt(a)) H

= 1 µtπt(a)(1 πt(a))2H2 + 1

µt (1 πt(a))H2

Et h b Qt(a)2Y t i = Et

" It(s, a)Lt,h

µtπt(a)Y t .

a ηtπt(a) 3 2 1 πt(a)

µtπt(a) + 1 µtπt(a)Y t

πt(a)(1 πt(a)) + p

Notice that

1 µt q Pt τ=1 1 µτ

and continuing from (39) we have

t=1 πt π, b Qt Bt Ct

O(H4A) + 3E

a πt(a)Bt(a)

a ηtπt(a)Ct(a)2 #

a πt(a)Bt(a)

with bt deﬁned in (15). This ﬁnishes the proof.

Lemma E.1 (Tsallis entropy). ηt(s)Bt(s, a) 1 8H .

Proof. By the deﬁnition of bt(s) in (15), we have

A 1 ηt(s) 1 ηt 1(s)

+ 8ηt(s)H4 (Ct(s, a) H2)

Best of Both Worlds Policy Optimization

1 µt(s) q Pt τ=1 1 µτ (s) + 8 1 4H4 H4

A µt(s) + 2 34H

ηt(s)Bt(s, a) ηt(s) 1 + 1

H H max s bt(s )

min 1 1600H4

100H2 min 1 1600H4

At 106H4A2 ,

(by the deﬁnition of γt)

E.2. Shannon entropy

Proof of Lemma 6.1 (Shannon entropy). We focus on a particular s, and use πt(a), b Qt(a), Bt(a), ηt(a), µt, bt to denote πt(a|s), b Qt(s, a), Bt(s, a), ηt(s, a), µt(s), bt(s), respectively.

Notice that for any t, a, since b Qt(a) 0, ηt(a)Bt(a) 1 4H (by Lemma E.2), and ηt(a)Ct(a) 1 4H4 H2 = 1 4H2 (because ηt(a) η0(a) = 1 4H4 and Ct(a) H2), we have

ηt(a)( b Qt(a) Bt(a) Ct(a)) 1

4H 1 4H2 1.

Besides, for any a, E

t=1 b Qt(a)

H γt HT 2 (by the deﬁnition of γt)

log T (by Lemma E.2)

H2T (Ct(a) H2)

With these inequalities, by Lemma D.6, the following holds for any π:

t=1 πt π, b Qt Bt Ct

ln A η0(a) + E

1 ηt(a) 1 ηt 1(a)

a ηt(a)πt(a) b Qt(a) Bt(a) Ct(a) 2 #

b Qt(a) Bt(a) Ct(a) #

Best of Both Worlds Policy Optimization

O(H4A ln(T)) + E

1 ηt(a) 1 ηt 1(a)

a ηt(a)πt(a) b Qt(a)2 + Bt(a)2 + Ct(a)2 #

O(H4A ln(T)) + E

1 ηt(a) 1 ηt 1(a)

µt + πt(a)Bt(a)2 + πt(a)Ct(a)2 #

O(H4A ln(T)) + E

1 ηt(a) 1 ηt 1(a)

a πt(a)Bt(a) + 3

a ηt(a)πt(a)Ct(a)2 #

(by Lemma E.2)

By the update ηt(a),

1 ηt(a) 4H p

1 µτ 1 q PT τ=1 ξτ (a)

µτ + maxτ [T ] 1 µτ

µτ + max τ [T ] 1 µτ

1 µt Pt τ=1 1 µτ

µτ + max τ [T ] 1 µτ

µτ + max τ [t] 1 µτ

µτ + max τ [t 1] 1 µτ

µt + maxτ [t] 1 µτ maxτ [t 1] 1 µτ q Pt τ=1 ξτ (a)

µτ + maxτ [t] 1 µτ

1 minτ [t] µτ minτ [t 1] µτ

q Pt τ=1 ξτ (a)

µτ + maxτ [t] 1 µτ

1 ηt(a) 1 ηt 1(a)

ξt(a) + 1 minτ [t] µτ minτ [t 1] µτ

where we use (19) in the last inequality. Using this in (41), we get

t=1 πt π, b Qt Bt Ct

O(H4A ln(T)) + E

1 ηt(a) 1 ηt 1(a)

ξt(a) + 1 minτ [t] µτ minτ [t 1] µτ

a πt(a)Bt(a) + 3

a ηt(a)πt(a)Ct(a)2 #

O(H4A ln(T)) + E

a πt(a)Bt(a)

where we use the deﬁnition of bt in (18). This ﬁnishes the proof.

Best of Both Worlds Policy Optimization

Lemma E.2 (Shannon entropy). ηt(s, a)Bt(s, a) 1 4H and Bt(s, a) 400 T log T.

Proof. By the deﬁnition of bt(s) in (18), we have

1 ηt(s, a) 1 ηt 1(s, a)

a ηt(s, a)πt(a|s)H4 (Ct(s, a) H2)

µt(s) q Pt 1 τ=1 ξτ (s,a)

µτ (s) + 1 µt(s) + H

log T + 2 (using (19) and ηt(s, a) 1 4H4 )

log T + 2 132HA log T γt .

Further notice that

1 ηt(s, a) 4

H log T τ 4H p

Bt(s, a) H 1 + 1

H max s bt(s) 396H2A log T γt 400H2A p

ηt(s, a)Bt(s, a) min 1 1600H4A log T , 1 4H t log T

396H2A log T γt

min 1 1600H4A log T , 1 4H t log T

max 396H2A t log T

106H4A2 , 396H2A p

(by the deﬁnition of γt)

by the deﬁnition of γt.

E.3. Log barrier

Proof of Lemma 6.1 (log barrier). We focus on a particular s, and use πt(a), b Qt(a), Bt(a), Ct(a), ηt, µt, ζt(a), to denote πt(a|s), b Qt(s, a), Bt(s, a), Ct(s, a), ηt(s), µt(s), ζt(s, a), respectively.

By Lemma D.7,

t=1 πt π, b Qt Bt Ct

O(H4A ln(T)) + E

1 ηt(a) 1 ηt 1(a)

a ηt(a)πt(a)2 b Qt(a) Bt(a) Ct(a) + xt 2 #

t=1 b Qt(a) Bt(a) Ct(a)

for arbitrary xt R such that ηt(a)πt(a)( b Qt(a) Bt(a) Ct(a) + xt) 1. Recall that with log barrier, there are real episodes and virtual episodes in which ℓt(s, a) = 0 for all (s, a). Let Yt = 0 if t is a virtual episode, and Yt = 1 otherwise.

xt = D πt, b Qt E . (43)

Best of Both Worlds Policy Optimization

Below, we verify that ηt(a)πt(a) b Qt(a) Bt(a) Ct(a) + xt 1

ηt(a)πt(a) b Qt(a) Bt(a) Ct(a) + xt

ηt(a)πt(a) Bt(a) Ct(a) D πt, b Qt E (using (43) and b Qt(a) 0)

ηt(a)πt(a|s)Bt(a) ηt(a)Ct(a) ηt(a) X

a πt(a )HIt(s, a )

µtπt(a ) Yt (when Yt = 0, b Qt(a) = 0)

8H 1 4H2 Hηt

µt Yt (by Lemma E.3 and that Ct(a) H2 and ηt(a) 1 4H4 )

2. (when Yt = 1 (real episode), ηt(a)

Besides, for any a, E

t=1 b Qt(a)

H γt HT 2 (by the deﬁnition of γt)

15ST 2 (by Lemma E.3)

H2T (Ct(a) H2)

Below, we continue from (42) with our choice of xt:

t=1 πt π, b Qt Bt Ct

O(H4SA ln(T)) + E

1 ηt(a) 1 ηt 1(a)

a ηt(a)πt(a)2 b Qt(a) πt, b Qt 2 + Bt(a)2 + Ct(a)2 #

O(H4SA ln(T)) + E

1 ηt(a) 1 ηt 1(a)

| {z } term1

a ηt(a)πt(a)2 b Qt(a) πt, b Qt 2

| {z } term2

a πt(a)Bt(a)

a ηt(a)πt(a)Ct(a)2

| {z } term3

. (by Lemma E.3)

We further manipulate term2 (suppose that s Sh). In virtual episodes, term2 = 0, and in real episodes,

ηt(a)πt(a)2 b Qt(a) πt, b Qt 2 = ηt(a)πt(a|s)2 It(s, a)Lt,h

µtπt(a) It(s)Lt,h

= ηt(a) It(s, a)Lt,h

µt πt(a|s)It(s)Lt,h

µ2 t (It(s, a) πt(a|s)It(s)) L2 t,h

= ηt(a)ζt(a)

Best of Both Worlds Policy Optimization

1 ηt+1(a) 1 ηt(a)

(by Eq. (23))

By the deﬁnition of bt in (22), we have E[term1 + term2 + term3] E h PT t=1 bt i , which ﬁnishes the proof.

Lemma E.3 (log barrier). ηt(s, a)πt(a|s)Bt(s, a) 1 8H and Bt(s, a) 15ST.

Proof. If t is a real episode,

bt(s) = 8 X

1 ηt+1(s, a) 1 ηt(s, a)

ηt(s, a)It(s, a)Lt(s, a)2

µt(s)2 32H2 max a ηt(s, a)

µt(s) 1 µt(s) 1 µt(s) 32H2 max s ,a

Bt(s, a) bt(s) + 3 X

s :h(s )>h(s) µ e Pt,πt(s |s, a)bt(s )

1 µt(s) + 3 X

s :h(s )>h(s) µ e Pt,πt(s |s, a) 1 µt(s )

32H2 max s ,a

1 µt(s) + 3 X

s :h(s )>h(s) µ e Pt,πt(s |s, a) 1

µπt t (s)πt(a|s)µ e Pt,πt(s |s, a) + γt

32H2 max s ,a

1 µπt(s)πt(a|s) + γt 32H2 max s ,a

S µt(s)πt(a|s) 96H2 max s ,a

ηt(s, a)πt(a|s)Bt(s, a) 96H2S max s ,a

2 1 8H (46)

where the last inequality is because ηt(s ,a )

µt(s ) 1 60

H3S in real episodes.

From the second-to-last step in (45), we also have

Bt(s, a) 3S

γt 32H2 max s ,a

In virtual episodes,

1 ηt+1(s, a) 1 ηt(s, a)

I{(s t, a t) = (s, a)} 24ηt(s, a)H log T log T

I{(s t, a t) = (s, a)} 24µt(s)H 1

maxs ,a ηt(s ,a )

µt(s ) (by the deﬁnition of (s t, a t))

Best of Both Worlds Policy Optimization

I{s t = s} 24µt(s)H 1

maxs ,a ηt(s ,a )

I{s t = s} µt(s) 1 24HMt

where we deﬁne Mt = maxs ,a ηt(s ,a )

µt(s ) . Similar to (45):

Bt(s, a) bt(s) + 3 X

s :h(s )>h(s) µ e Pt,πt(s |s, a)bt(s )

I{s t = s} µt(s) + 3 X

s µ e Pt,πt(s |s, a)I{s t = s } µt(s )

I{s t = s} µt(s) + 3 X

s :h(s )>h(s) µ e Pt,πt(s |s, a) I{s t = s }

µπt t (s)πt(a|s)µ e Pt,πt(s |s, a) + γt

I{s t = s } µπt(s)πt(a|s) + γt 1 24HMt

1 µt(s)πt(a|s) 1 8HMt (47)

ηt(s, a)πt(a|s)Bt(s, a) ηt(s, a)

µt(s) 1 8HMt 1 8H

where the last step uses the deﬁnition of Mt.

From the second-to-last step in (47) , we also have

Bt(s, a) 1 8γt HMt 15

where we use that Mt 1 60

H3S in vitrual episodes.

F. Analysis for the Bias (Lemma 6.2)

Proof of Lemma 6.2.

s µπ(s)biasπ(s)

s,a µπ(s) (πt(a|s) π(a|s)) Qπt(s, a; ℓt) b Qt(s, a) + Ct(s, a) #

s,a µπ(s) (πt(a|s) π(a|s)) Qπt(s, a; ℓt) µπt(s)

µt(s) Qπt(s, a; ℓt) + Ct(s, a) #

s,a µπ(s) (πt(a|s) π(a|s)) µt(s) µπt(s)

µt(s) Qπt(s, a; ℓt) + Ct(s, a) #

s,a (µπt(s, a) µπ(s, a))zt(s, a)

Best of Both Worlds Policy Optimization

with zt(s, a) deﬁned as the following based on Lemma C.2:

zt(s, a) µt(s) µπt(s)

µt(s) Qπt(s, a; ℓt) + Ct(s, a) Es P ( |s,a),a πt( |s )

µt(s ) µπt(s )

µt(s ) Qπt(s , a ; ℓt) + Ct(s , a )

Recall the high probability event E deﬁned in Deﬁnition B.3. Notice that

s,a (µπt(s, a) µπ(s, a))zt(s, a)

s,a (µπt(s, a) µπ(s, a))zt(s, a) E

s,a (µπt(s, a) µπ(s, a))zt(s, a) E

s,a (µπt(s, a) µπ(s, a))zt(s, a) E

+ O(Hδ) O(TH TH2)

(because |zt(s, a)| O(TH2) almost surely)

s,a (µπt(s, a) µπ(s, a))zt(s, a) E

. (δ = 1 T 3 )

From now on, it sufﬁces to bound PT t=1 P

s,a(µπt(s, a) µπ(s, a))zt(s, a) assuming E holds (i.e., P Pt for all t).

By the deﬁnition of Ct(s, a), we have

zt(s, a) = µt(s) µπt(s)

µt(s) Qπt(s, a; ℓt) + max e P Pt Es e P ( |s,a),a πt( |s )

" µt(s ) µπt t (s )

µt(s ) H + Ct(s , a )

Es P ( |s,a),a πt( |s )

µt(s ) µπt(s )

µt(s ) Qπt(s , a ; ℓt) + Ct(s , a )

µt(s) µπt(s)

µt(s) Qπt(s, a; ℓt) 0. (51)

On the other hand,

zt(s, a) µt(s) µπt(s)

µt(s) H + Ct(s, a) Es P ( |s,a),a πt( |s ) [Ct(s , a )]

ct(s) + Es P t( |s,a),a πt( |s ) [ct(s ) + Ct(s , a )] Es P ( |s,a),a πt( |s ) [Ct(s , a )] (let P t be the transition that attains the maximum in (13))

ct(s) + Es P ( |s,a)[ct(s )] + X

P t(s |s, a) P(s |s, a) πt(a |s ) (ct(s ) + Ct(s , a ))

ct(s) + Es P ( |s,a)[ct(s )] + X

s ,a et(s |s, a)πt(a |s )(ct(s ) + Ct(s , a )) (52)

where we deﬁne et(s |s, a) = P t(s |s, a) P(s |s, a) .

Observe that by the deﬁnition of Ct(s, a), it holds that

Ct(s, a) = X

s :h(s )>h(s) µP t,πt(s |s, a)ct(s ),

and therefore,

ct(s) + Ct(s, a) = X

s µP t,πt(s |s, a)ct(s )

Best of Both Worlds Policy Optimization

a πt(a|s) (ct(s) + Ct(s, a)) = X

s µP t,πt(s |s)ct(s ).

Thus we can thus rewrite (52) as

zt(s, a) ct(s) + Es P ( |s,a)[ct(s )] + X

s et(s |s, a) X

s µP t,πt(s |s )ct(s ). (53)

Continue from the previous calculation in (50):

s,a (µπt(s, a) µπ(s, a)) zt(s, a)

s,a [µπt(s, a) µπ(s, a)]+ zt(s, a) (by (51))

s,a [µπt(s, a) µπ(s, a)]+ ct(s)

| {z } term1

s,a [µπt(s, a) µπ(s, a)]+ Es P ( |s,a)[ct(s )]

| {z } term2

s,a [µπt(s, a) µπ(s, a)]+ X

s et(s |s, a) X

s µP t,πt(s |s )ct(s )

| {z } term3

. (by (53))

Known transition case For the known transition case, we have

ct(s) µπt(s) + γt µπt(s)

µt(s) H = γt µt(s)H

and et(s |s, a) = 0. Thus,

s µπ(s)biasπ(s)

s µπt(s) γt µt(s)H HS

t=1 γt = O H5SA2 ln(T) .

Unknown transition case Upper bounding term1. By the deﬁnition of ct(s),

s,a [µπt(s, a) µπ(s, a)]+

µπt t (s) µπt t (s) + γt µt(s)

s,a [µπt(s, a) µπ(s, a)]+

µπt t (s) µπt(s)

| {z } term1a

s,a [µπt(s, a) µπ(s, a)]+

µπt(s) µπt t (s)

| {z } term1b

s,a µπt(s, a) Hγt

| {z } term1c

To bound term1a, we apply Lemma C.4 with

[µπt(s, a) µπ(s, a)]+

Best of Both Worlds Policy Optimization

which gives

v u u t H3S2A

s,a µπt(s)[µπt(s, a) µπ(s, a)]+

µt(s) ln(T)ι + H2S4A ln(T)ι

v u u t H3S2A

s,a [µπt(s, a) µπ(s, a)]+ ln(T)ι + H2S4A ln(T)ι.

term1b can be bound in the same way and admits the same upper bound. term1c HS PT t=1 γt = O H5SA2 ln(T) . Combining term1, term2, term3, we get

v u u t H3S2A

s,a [µπt(s, a) µπ(s, a)]+ ln(T)ι + H2S4A2 ln(T)ι.

Upper bounding term2. This is very similar to the procedure of bounding term1. We perform a similar decomposition:

s,a [µπt(s, a) µπ(s, a)]+ Es P ( |s,a)

µπt t (s ) µπt(s )

| {z } term2a

s,a [µπt(s, a) µπ(s, a)]+ Es P ( |s,a)

" µπt(s ) µπt t (s )

| {z } term2b

s,a µπt(s, a)Es P ( |s,a)

| {z } term2c

To bound term2a, we apply Lemma C.4 with

[µπt(s, a) µπ(s, a)]+

µt(s ) P(s |s, a)

s,a µπt(s, a)P(s |s, a)

µt(s ) µπt(s )

which gives

v u u t H3S2A

s µπt(s ) X

[µπt(s, a) µπ(s, a)]+

µt(s ) P(s |s, a) ln(T)ι + H2S4A ln(T)ι

v u u t H3S2A

s,a [µπt(s, a) µπ(s, a)]+ P(s |s, a) ln(T)ι + H2S4A ln(T)ι

v u u t H3S2A

s,a [µπt(s, a) µπ(s, a)]+ ln(T)ι + H2S4A ln(T)ι.

which is same as the bound for term1a. Also, term2b can be handled in the same way as term2a, and term2c PT t=1 P

s µπt(s ) Hγt µt(s ) HS PT t=1 γt. Overall, term2 can be bounded by the same order as term1.

Upper bounding term3.

s,a,s [µπt(s, a) µπ(s, a)]+

P t(s |s, a)ι

nt(s, a) + ι nt(s, a)

s µP t,πt(s |s )ct(s )

s,a,s [µπt(s, a) µπ(s, a)]+

P t(s |s, a)α + ι nt(s, a)α

s µP t,πt(s |s )ct(s ) (for any α (0, 1])

Best of Both Worlds Policy Optimization

s,a,s [µπt(s, a) µπ(s, a)]+ P t(s |s, a) X

s µP t,πt(s |s )ct(s )

s,a,s [µπt(s, a) µπ(s, a)]+ ι nt(s, a)

s µP t,πt(s |s )ct(s )

s,a [µπt(s, a) µπ(s, a)]+ X

s µP t,πt(s |s, a)ct(s ) + H2

s,a,s [µπt(s, a) µπ(s, a))]+ ι nt(s, a)

s,a,s [µπt(s, a) µπ(s, a)]+ µP t,πt(s |s, a) µt(s ) µt(s )

µt(s ) + H2S

s,a,s [µπt(s, a) µπ(s, a)]+ µP t,πt(s |s, a)µπt t (s ) µπt(s )

µt(s ) | {z } term3a

s,a,s [µπt(s, a) µπ(s, a)]+ µP t,πt(s |s, a) µπt(s ) µπt t (s )

µt(s ) | {z } term3b

s,a,s [µπt(s, a) µπ(s, a)]+ µP t,πt(s |s, a) γt µt(s ) | {z } term3c

+H2S2A ln(T)ι

(by Lemma B.2 and the assumption that E holds.)

For term3a we apply Lemma C.4 with

s,a[µπt(s, a) µπ(s, a)]+µP t,πt(s |s, a)

s,a µπt(s, a)µP t,πt(s |s, a)

v u u t H2S2A ln(T)ι

s,a[µπt(s, a) µπ(s, a)]+µP t,πt(s |s, a)

µt(s ) + αH H2S4A ln(T)ι

v u u t H3S2A ln(T)ι

s,a [µπt(s, a) µπ(s, a)]+ + αH3S4A ln(T)ι

The same bound applies to term3b, too.

s,a,s µπt(s, a)µP t,πt(s |s, a) γt µt(s ) αH2 X

s γt αH6SA2.

Picking α = 1

H , combining term3a and term3b, and using H S, we get

v u u t H3S2A ln(T)ι

s,a [µπt(s, a) µπ(s, a)]+ + H2S4A2 ln(T)ι

which is also of the same order as term1.

Combining term1, term2, term3, we get that if E holds, then

s,a (µπt(s, a) µπ(s, a)) zt(s, a)

v u u t H3S2A

s,a [µπt(s, a) µπ(s, a))]+ ln(T)ι + H2S4A2 ln(T)ι

Best of Both Worlds Policy Optimization

Using this in (50) ﬁnishes the proof.

G. Bounding P

s V πt(s0; bt) (Lemma 6.3, Lemma 6.4, Lemma 6.5)

We ﬁrst show Lemma G.1 and Lemma G.2 which are common among different regularizers. Lemma G.1.

t=1 µt(s)πt(a|s)(1 πt(a|s))

t=1 µπt(s)πt(s)(1 πt(a|s))

H4S2A3 ln(T) + I{unknown transition} p

HS5A3 ln(T)ι.

Proof. Deﬁne φ(s, a) = πt(a|s)(1 πt(a|s)).

t=1 µt(s)φ(s, a) X

t=1 µπt(s)φ(s, a) + X

t=1 γtφt(s, a)

| {z } term1

t=1 |µπt t (s) µπt(s)| φt(s, a)

| {z } term2

t=1 γtφt(s, a)

s,a φt(s, a) S

H4S2A3 ln(T).

term2 is zero in the known transition case, and in the unknown transition case, if E deﬁned in Deﬁnition B.3 holds, then

t=1 (µπt t (s) µπt(s))φt(s, a) + 1

(for any α > 0)

v u u t HS2A ln(T)ι

+ HS4A ln(T)ι

(by Lemma C.4 with gt(s) = P

a φt(s, a))

v u u t HS2A ln(T)ι

s,a µπt(s)φt(s, a) + HS4A ln(T)ι

s,a µπt(s)φt(s, a) + p

HS5A3 ln(T)ι (choosing α = 1

HS3A ln(T )ι)

t=1 µπt(s)φt(s, a) + p

HS5A3 ln(T)ι.

If E does not hold (which happens with probability O(H/T 3)), then term2 O(SA

T). Overall,

t=1 µπt(s)φt(s, a)

HS5A3 ln(T)ι + HSA

Collecting terms and using H S ﬁnishes the proof.

Best of Both Worlds Policy Optimization

Lemma G.2. With known transition,

s µπt(s)νt(s) H4SA2 ln(T).

For Tsallis entropy or Shannon entropy with unknown transition,

s µ e Pt,πt(s)νt(s)

HS4A2 ln(T)ι.

Proof. With known transition, we have

s µπt(s)νt(s)

s,a µπt(s)πt(a|s)Ct(s, a)2 (ηt(s, a) 1 H4 or ηt(s) 1 H4 )

s,a µπt(s)πt(a|s)Ct(s, a) (Ct(s, a) H2)

s,a µπt(s)πt(a|s) X

s µπt(s |s, a)µt(s ) µπt(s )

µt(s ) (by the deﬁnition of Ct(s, a))

s µπt(s ) µt(s ) µπt(s )

H4SA2 ln(T).

With unknown transitions, notice that for Tsallis entropy we have ηt(s) min n 1 H4 , 1 H

o and for Shannon entropy we

have ηt(s, a) min n 1 H4 , 1 H

o . Therefore, in both cases, suppose that E holds,

s µ e Pt,πt(s)νt(s)

s,a µ e Pt,πt(s)πt(a|s)Ct(s, a)2

s,a µ e Pt,πt(s)πt(a|s)Ct(s, a)

s,a µ e Pt,πt(s)πt(a|s) X

s µP t,πt(s |s, a) µt(s ) µπt t (s )

(let P t be the e P attaining maximum in (13))

t=1 min 1, H3

µπt t (s ) µt(s ) µπt t (s )

t=1 min 1, H3

µπt t (s ) µπt t (s ) +

Best of Both Worlds Policy Optimization

t=1 min 1, H3

µπt t (s ) µπt t (s ) + H4SA2 ln(T)

By Lemma C.4, the ﬁrst part above can be upper bounded by v u u t HS2A ln(T)ι

s µπt(s) min 1, H6

+ HS4A ln(T)ι

H8S2Aι ln(T) + HS4A ln(T)ι HS4A ln(T)ι

where we use H S.

Suppose that E does not hold (happens with probability O(H/T 3)), we still have

s µ e Pt,πt(s)νt(s)

s,a µ e Pt,πt(s)πt(a|s)Ct(s, a)2

H4 H H2 2 O(HT)

because |Ct(s, a)| H2 with probability 1.

Combining all terms and taking expectation, we conclude that

s µ e Pt,πt(s)νt(s)

HS4A2 ln(T)ι.

G.1. Tsallis entropy

Proof of Lemma 6.3.

t=1 V πt, e Pt(s0; bt)

s µ e Pt,πt(s)bt(s)

s µ e Pt,πt(s) νt(s) + 1 ηt(s) 1 ηt 1(s)

µt(s) > 1 8H

s µ e Pt,πt(s)νt(s) + H

s µ e Pt,πt(s)

1 µt(s) q Pt τ=1 1 µτ (s)

s µ e Pt,πt(s)νt(s)

| {z } term1

ξt(s) q Pt τ=1 1 µτ (s) | {z } term2

1 µt(s) ηt(s) q Pt τ=1 1 µτ (s) | {z } term3

Bounding term1. term1 can be bounded using Lemma G.2, which gives

E[term1] H4SA2 ln(T) + I{unknown transition}HS4A2 ln(T)ι.

Best of Both Worlds Policy Optimization

Bounding term2.

1 µt(s) Pt τ=1 1 µτ (s)

µt(s)πt(a|s)(1 πt(a|s))

1 µt(s) Pt τ=1 1 µτ (s)

t=1 µt(s)πt(s, a)(1 πt(a|s))

1 µt(s) Pt τ=1 1 µτ (s)

t=1 µt(s)πt(a|s)(1 πt(a|s)).

By Lemma G.1, we can bound the last expression by

v u u tln(T)

t=1 µπt(s)πt(s, a)(1 πt(s, a))

H6S2A3 ln(T) + I{unknown transition}

H3S5A3 ln(T)ι.

Bounding term3. By (16),

1 µt(s) Pt τ=1 1 µτ (s) HS

Combining term1, term2, term3 ﬁnishes the proof.

G.2. Shannon entropy

Proof of Lemma 6.4.

t=1 V e Pt,πt(s0; bt)

t,s,a µ e Pt,πt(s)

µt(s) q Pt 1 τ=1 ξτ (s,a)

µτ (s) + 1 µt(s) + 1

ξt(s, a) + 1 minτ [t] µτ(s) minτ [t 1] µτ(s)

t,s µ e Pt,πt(s)νt(s)

ln T q Pt 1 τ=1 ξτ (s,a)

µτ (s) + 1 µt(s) ξt(s, a) + X

ln T q Pt 1 τ=1 ξτ (s,a)

µτ (s) + 1 µt(s)

1 minτ [t] µτ(s) minτ [t 1] µτ(s)

t ξt(s, a) + H

t,s,a µ e Pt,πt(s)1

t,s,a µ e Pt,πt(s) 1 minτ [t] µτ(s) minτ [t 1] µτ(s)

t,s µ e Pt,πt(s)νt(s)

µt(s) Pt 1 τ=1 ξτ (s,a)

µτ (s) + 1 µt(s)

t µt(s)ξt(s, a) + H

t,s,a ln minτ [t 1] µτ(s)

minτ [t] µτ(s)

Best of Both Worlds Policy Optimization

µt(s)ξt(s, a)

t µt(s)ξt(s, a)

v u u t A X

t,s ln minτ [t 1] µτ(s)

minτ [t] µτ(s)

t,s µ e Pt,πt(s)νt(s)

t µt(s)ξt(s, a) + X

t,s µ e Pt,πt(s)νt(s) + H2SA ln 3 2 (T)

v u u tln3(T)

t=1 µt(s)πt(a|s)(1 πt(a|s)) + X

s,t µ e Pt,πt(s)νt(s) + H2SA ln 3 2 (T)

By Lemma G.1 and Lemma G.2, the expectation of this can be upper bounded by

v u u tln3(T)

t=1 µπt(s)πt(a|s)(1 πt(a|s))

H6S2A3 ln3(T) + I{unknown transition} q

H3S5A3 ln3(T)ι

+ H4SA2 ln 3 2 (T) + I{unknown transition}HS4A2 ln(T)ι

v u u tln3(T)

t=1 µπt(s)πt(a|s)(1 πt(a|s))

ln3(T) + I{unknown transition}HS4A2 ln(T)ι. (using H S and log(T) ι)

G.3. Log barrier

Lemma G.3. Let η1 > 0, η2, η3, . . . be updated by

ηt + ηtφt t 1

with 0 φt η 2 t . Then

Proof. By the update rule,

η2 t = 1 ηt+1 + 1

= 1 ηt+1 + 1

which implies

By the condition on φt, we also have

Combining the two inequalities ﬁnishes the proof.

Best of Both Worlds Policy Optimization

Proof of Lemma 6.5. In this proof we only focus on the know transition case. We use Tr and Tv to denote the set of real and virtual episodes, respectively.

Let φt(s, a) = 4ζt(s,a) µt(s)2 log(T ) in real episodes and φt(s, a) = I{(s t,a t)=(s,a)} 24ηt(s,a)2H log T in virtual episodes. We ﬁrst show that φt(s, a) 1 ηt(s,a)2 , which allows us to apply Lemma G.3 because 1 ηt+1(s,a) = 1 ηt(s,a) + ηt(s, a)φt(s, a) by our update rule. This is clear for virtual episodes. For real episodes,

φt(s, a)ηt(s, a)2 = 4ηt(s, a)2ζt(s, a)

µt(s)2 log T H2

log T 1 H3S 1

because ηt(s,a)

H3S in real episodes.

t=1 V πt(s0; bt)

ηt(s, a)ζt(s, a)

µt(s)2 log(T) log(T) + X

t Tv µπt(s t) 1

ηt(s t, a t)H log T log T +

s µπt(s)νt(s)

s,a ηt(s, a)ζt(s, a)

H |Tv| + H4SA ln(T)

(in virtual episodes, ηt(s t,a t) µt(s t) 1

H3S , and we use Lemma G.2 to bound the last term)

µt(s) q P τ t:τ Tr ζτ (s,a)

HS|Tv| + +H4SA2 ln(T)

(by Lemma G.3 and the condition veriﬁed at the beginning of the proof)

µt(s)2 P τ t:τ Tr ζτ (s,a)

t Tr ζt(s, a) +

HS|Tv| + +H4SA2 ln(T)

t Tr ζt(s, a) +

HS|Tv| + H4SA2 ln(T).

Now we bound the number of virtual episodes. Notice that each time a virtual episode happens, there exist s, a such that ηt(s,a)

H3S , and ηt(s, a) will shrink by a factor of (1 + 1 24H log T ) after the virtual episode. Since µt(s) γt, this event cannot happen if ηt(s, a) γt 60

H3S . Thus, the number of virtual episodes is upper bounded by

|Tv| SA log 60

H3S γt log 1 + 1 24H log T HSA ln(T) ln(SAT).

Applying this bound in the last expression and using H S ﬁnishes the proof.

H. Final Regret Bounds through Self-Bounding (Theorem 4.1, Theorem 4.2, Theorem 4.3)

Proof of Theorem 4.1. Let π = argmaxπ Reg(π). By (31), Lemma 6.2, and Lemma 6.3, under known transition and Tsallis entropy, we have

Reg( π) H X

t=1 µπt(s)πt(a|s)(1 πt(a|s))

ln(T) + H5SA2 ln(T)

Best of Both Worlds Policy Optimization

For the adversarial regime, we bound the above by

v u u t SAE

s,a µπt(s)πt(a|s))

ln(T) + H5SA ln(T) =

H3SAT + H5SA2 ln(T).

For the stochastic regime, notice that Reg( π) Reg(π ) E h PT t=1 µπt(s)πt(a|s) (s, a) i C, and we have

Reg( π) c1H X

t=1 µπt(s)πt(a|s)

ln(T) + c2H5SA2 ln(T) (for some universal constants c1, c2)

t=1 µπt(s)πt(a|s) (s, a)

+ c2 1H ln(T) α (s, a)

+ c2H5SA2 ln(T) (for arbitrary α > 0)

t=1 µπt(s)πt(a|s) (s, a)

α (s, a) + H5SA2 ln(T)

α(Reg( π) + C) + O

α (s, a) + H5SA2 ln(T)

Picking α = min 1 2, C 1

2 H2 ln(T )

2 leads to the bound

Reg( π) U +

UC + H5SA2 ln(T)

where U = P

a =π (s) H2 ln(T )

(s,a) . Finally, using that Reg(π) Reg( π) for all π ﬁnishes the proof.

Proof of Theorem 4.2. By (31), Lemma 6.2, and Lemma 6.3, under unknown transition and Tsallis entropy, we have

v u u t H3S2AE

s,a [µπt(s, a) µπ(s, a)]+

| {z } term1

t=1 µπt(s)πt(a|s)(1 πt(a|s))

| {z } term2

+c3H2S4A2 ln(T)ι

(for universal constants c1, c2, c3)

In the adversarial regime, we can bound it by the order of p

H4S2AT ln(T)ι + H2S4A2 ln(T)ι

To get a bound in the stochastic regime, we ﬁrst argue that it sufﬁces to show the desired bound for all π that satisﬁes Reg(π) Reg(π ). This is because we can then bound Reg(π) for π such that Reg(π) < Reg(π ) by

Reg(π) < Reg(π ) U + p

U(C + C(π )) + poly(H, S, A) ln(T)ι = U +

UC + poly(H, S, A) ln(T)ι

because C(π ) = 0 by deﬁnition.

Below we assume that Reg(π) Reg(π ). Note that by Lemma C.5, for any π, X

µπ(s, a) µπ (s, a) H X

s,a µπ(s) |π(a|s) π (a|s)|

Best of Both Worlds Policy Optimization

a =π (s) µπ(s)π(a|s) + H X

s µπ(s)(1 π( π (s) |s))

a =π (s) µπ(s)π(a|s).

v u u t H3S2AE

s,a |µπt(s, a) µπ(s, a)|

v u u t H3S2AE

s,a |µπt(s, a) µπ (s, a)|

ln(T)ι + c1

v u u t H3S2A

s,a |µπ(s, a) µπ (s, a)| ln(T)ι

v u u u t2H4S2AE

a =π (s) µπt(s, a)

ln(T)ι + c1

v u u t2H4S2A

a =π (s) µπ(s, a) ln(T)ιι

a =π (s) µπt(s, a) min

a =π (s) µπ(s, a) min + O H4S2A ln(T)ι

α(Reg(π ) + C) + α(Reg(π ) Reg(π) + C(π)) + O H4S2A ln(T)ι

(see explanation below)

αReg(π) + α(C + C(π)) + O H4S2A ln(T)ι

(by the assumption Reg(π ) Reg(π))

where in the second-to-last inequality we use the property:

Reg(π ) Reg(π) = E

t=1 V π(s0; ℓt) V π (s0; ℓt)

a =π (s) µπ(s, a) (s, a)

a =π (s) µπ(s, a) min C(π)

For term2, similar to before,

term2 c2H X

t=1 µπt(s)πt(a|s)(1 πt(a|s))

α(Reg(π) + C) + O

α (s, a) + H5SA2 ln(T)

α(Reg(π) + C) + O

α (s, a) + H5SA2 ln(T)

α(Reg(π) + C) + O H4S2A ln(T)

α min + H2S4A2 ln(T)

Combining term1 and term2, we get

Reg(π) 2αReg(π) + 2α(C + C(π)) + O H4S2A ln(T)ι

α min + H2S4A2 ln(T)ι

Best of Both Worlds Policy Optimization

Picking α = min 1 4, (C + C(π)) 1

2 H4S2A ln(T )ι

2 leads to the desired bound.

Proof of Theorem 4.3.

v u u tln2(T)E

t=1 (It(s, a) πt(a|s)It(s))2L2 t,h(s)

+ H3S2A2 ln(T) ln(SAT)

In the adversarial regime,

v u u t HSA ln2(T)E

s,a (It(s, a) πt(a|s)It(s))2Lt,h(s)

+ H3S2A2 ln(T) ln(SAT)

v u u t HSA ln2(T)E

s,a It(s, a)Lt,h(s)

+ H3S2A2 ln(T) ln(SAT)

v u u t H2SA ln2(T)E

t=1 V πt(s0; ℓt)

+ H3S2A2 ln(T) ln(SAT)

On the other hand, Reg(π) = E h PT t=1 V πt(s0; ℓt) PT t=1 V π(s0; ℓt) i . Solving the inequality, we get

v u u t H2SA ln2(T)

t=1 V π(s0; ℓt) + H3S2A2 ln(T) ln(SAT).

In the stochastic regime,

v u u tln2(T)E

t=1 (It(s, a) πt(a|s)It(s))2L2 t,h(s)

+ H3S2A2 ln(T) ln(SAT)

v u u t H2 ln2(T)E

t=1 µπt(s)πt(a|s)(1 πt(a|s))

+ H3S2A2 ln(T) ln(SAT),

which is similar to the stochastic bound in Theorem 4.1. Following the same self-bounding analysis in the proof of Theorem 4.1 we can get the desired bound.

To get regret bounds for the Shannon entropy version under known and unknown transitions, we use Lemma 6.2 and Lemma 6.4 and follow exactly the same procedure as in the proofs of Theorem 4.1 and Theorem 4.2. This leads to the following guarantees:

Theorem H.1. Under known transitions, Algorithm 1 with Shannon entropy regularizer ensures for any π

H3SAT ln3(T) + poly(H, S, A) ln2(T)

in the adversarial case, and

UC + poly(H, S, A) ln2(T)

in the stochastic case, where U = P s P a =π (s) H2 ln3(T )

Best of Both Worlds Policy Optimization

Theorem H.2. Under unknown transitions, Algorithm 1 with Shannon entropy regularizer ensures for any π

H4S2AT ln2(T)ι + poly(H, S, A) ln(T)ι

in the adversarial case, and

Reg(π) U + p

U(C + C(π)) + poly(H, S, A) ln(T)ι

in the stochastic case, where U = H4S2A ln2(T )ι