# best_of_both_worlds_policy_optimization__de044439.pdf Best of Both Worlds Policy Optimization Christoph Dann 1 Chen-Yu Wei 2 Julian Zimmert 1 Policy optimization methods are popular reinforcement learning algorithms in practice. Recent works have built theoretical foundation for them by proving T regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that in tabular Markov decision processes (MDPs), by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog(T) regret when the losses are stochastic, without sacrificing the worst-case guarantee in the adversarial regime. To our knowledge, this is also the first time a gap-dependent polylog(T) regret bound is shown for policy optimization. Specifically, we achieve this by leveraging a Tsallis entropy or a Shannon entropy regularizer in the policy update. Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log barrier regularizer. 1. Introduction Policy optimization methods have seen great empirical success in various domains (Schulman et al., 2017; Levine & Koltun, 2013). An appealing property of policy optimization methods is the local-search nature, which lends itself to an efficient implementation as a search over the whole MDP is avoided. However, this property also makes it difficult to obtain global optimality guarantees for these algorithms and a large portion of the literature postulates strong and often unrealistic assumptions to ensure global exploration (see e.g., Abbasi-Yadkori et al., 2019; Agarwal et al., 2020b; Neu & Olkhovskaya, 2021; Wei et al., 2021). Recently, the need for extra assumptions has been overcome by adding exploration bonuses to the update (Cai et al., 2020; Shani 1Google Research 2MIT Institute for Data, Systems, and Society. Correspondence to: Chen-Yu Wei . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). et al., 2020; Agarwal et al., 2020a; Zanette et al., 2020; Luo et al., 2021). These works demonstrate an additional robustness property of policy optimization, which is able to handle adversarial losses or some level of corruption. Luo et al. (2021) and Chen et al. (2022) even managed to obtain the optimal However, when the losses are in fact stochastic, the T minimax regret is often overly pessimistic and log(T) with problem-dependent factors is the optimal rate (Lai et al., 1985). Recently, Jin et al. (2021) obtained a best-of-bothworlds algorithm that automatically adapts to the nature of the environment, a method which relies on FTRL with a global regularizer over the occupancy measure. In this work, we show that by properly assigning the bonus and tuning the learning rates, policy optimization can also achieve the best of both worlds, which gives a more computationally favorable solution than Jin et al. (2021) for the same setting. Specifically, we show that policy optimization with Tsallis entropy or Shannon entropy regularizer achieves T regret in the adversarial regime and polylog(T) regret in the stochastic regime. The T can further be improved to L if the transition is known and if a log-barrier regularizer is used, where L is the cumulative loss of the best policy. Though corresponding results in multi-armed bandits have been well-studied, new challenges arise in the MDP setting which require non-trivial design for the exploration bonus and the learning rate scheduling. The techniques we develop to address these issues constitute the main contribution of this work. 2. Related Work For multi-armed bandits, the question whether there is a single algorithm achieving near-optimal regret bounds in both the adversarial and the stochastic regimes was first asked by Bubeck & Slivkins (2012). A series of followup works refined the bounds through different techniques (Seldin & Slivkins, 2014; Auer & Chiang, 2016; Seldin & Lugosi, 2017; Wei & Luo, 2018; Zimmert & Seldin, 2019; Ito, 2021). One of the most successful approaches is developed by Wei & Luo (2018); Zimmert & Seldin (2019); Ito (2021), who demonstrated that a simple Online Mirror Descent (OMD) or Follow the Regularized Leader (FTRL) algorithm, which was originally designed only for the adversarial case, is able Best of Both Worlds Policy Optimization to achieve the best of both worlds. This approach has been adopted to a wide range of problems including semi-bandits (Zimmert et al., 2019), graph bandits (Erez & Koren, 2021; Ito et al., 2022), partial monitoring (Tsuchiya et al., 2022), multi-armed bandits with switching costs (Rouyer et al., 2021; Amir et al., 2022), tabular MDPs (Jin & Luo, 2020; Jin et al., 2021), and others. Though under a similar framework, each of them addresses new challenges that arises in their specific setting. Previous works that achieve the best of both worlds in tabular MDPs (Jin & Luo, 2020; Jin et al., 2021) are based on FTRL over the occupancy measure space. This approach has several shortcomings, making it less favorable in practice. First, the feasible set of occupancy measure depends on the transition kernel, so the extension to a model-free version is difficult. Second, since the occupancy measure space is a general convex set that may change over time as the learner gains more knowledge about transitions, it requires solving a different convex programming in each round. In contrast, policy optimization is easier to extend to settings where transitions are hard to learn, and it is computationally simple in tabular MDPs, it is equivalent to running an individual multi-armed bandit algorithm on each state. Due to its local search nature, exploration under policy optimization is non-trivial, especially when coupled with bandit feedback and adversarial losses. In a simpler setting where the loss feedback has full information, He et al. (2022); Cai et al. (2020) showed T regret for linear mixture MDPs using policy optimization. In another simpler setting where the loss is stochastic, Agarwal et al. (2020a); Zanette et al. (2021) showed poly(1/ϵ) sample complexity for linear MDPs. The work by Shani et al. (2020) first studied policy optimization with bandit feedback and adversarial losses, and obtained a T 2/3 regret for tabular MDPs. Luo et al. (2021) improved it to the optimal T, and provided extensions to linear-Q and linear MDPs. In this work, we demonstrate another power of policy optimization by showing a best-of-both-world regret bound in tabular MDPs. To our knowledge, this is also the first time a gap-dependent polylog(T) regret bound is shown for policy optimization. We also note that a first-order bound has been shown for adversarial MDPs by Lee et al. (2020). Their algorithm is based on regularization on the occupancy measure, and does not rely on knowledge of the transition kernel. On the other hand, our first-order bound currently relies on the learner knowing the transitions. Whether it can be achieved under unknown transitions is an open question. 3. Notation and Setting Notation For f R and g R+, we use f g or f O(g) to mean that f c g for some absolute constant c > 0. [x]+ max{x, 0}. (X) denotes the probability simplex over the set X. 3.1. MDP setting We consider episodic fixed-horizon MDPs. Let T be the total number of episodes. The MDP is described by a tuple (S, A, H, P, {ℓt}T t=1), where S is the state set, A is the action set, H is the horizon length, P : S A (S) is the transition kernel so that P(s |s, a) is the probability of moving to state s after taking action a on state s, and ℓt : S A [0, 1] is the loss function in episode t. We define S = |S| and A = |A|, which are both assumed to be finite. Without loss of generality, we assume A T. A policy π : S (A) describes how the player interacts with the MDP, with π( |s) (A) being the action distribution the player uses to select actions in state s. If for all s, π( |s) is only supported on one action, we call π a deterministic policy, and we abuse the notation π(s) A to denote the action π chooses on state s. Without loss of generality, we assume that the state space can be partitioned into H + 1 disjoint layers S = S0 S1 SH, and the transition is only possible from one layer to the next (i.e., P( |s, a) is only supported on Sh+1 if s Sh)1. Without loss of generality, we assume that S0 = {s0} (initial state) and SH = {s H} (terminal state). Also, since there is at least one state on each layer, it holds that H S. Let h(s) denotes the layer where state s lies. The environment decides P and {ℓt}T t=1 ahead of time. In episode t, the learner decides on a policy πt. Starting from the initial state st,0 = s0, the learner repeatedly draws action at,h from πt( |st,h) and transitions to the next state st,h+1 Sh+1 following st,h+1 P( |st,h, at,h), until it reaches the terminal state st,H = s H. The learner receives {ℓt(st,h, at,h)}H 1 h=0 at the end of episode t. For a policy π and a loss function ℓ, we define V π(s H; ℓ) = 0 and recursively define Qπ(s, a; ℓ) = ℓ(s, a) + Es P ( |s,a) [V π(s ; ℓ)] , V π(s; ℓ) = X a A π(a|s)Qπ(s, a; ℓ), (1) which are the standard state-action value function and state value function under policy π and loss function ℓ. The learner s regret with respect to a policy π is defined as t=1 (V πt(s0; ℓt) V π(s0; ℓt)) 1This setting follows previous work on adversarial MDPs (Jin et al., 2020; Luo et al., 2021). Best of Both Worlds Policy Optimization 3.2. Known and unknown transition Following Jin & Luo (2020); Jin et al. (2021), we consider both scenarios where the learner knows the transition kernel P and where he does not know it. The empirical transition is defined by the following: b Pt(s |s, a) = nt(s, a, s ) where nt(s, a) is the number of visits to (s, a) prior to episode t, and nt(s, a, s ) is the number of visits to s after visiting (s, a), prior to episode t. If nt(s, a) = 0, we define b Pt( |s, a) to be uniform over the states on layer h(s) + 1. In the unknown transition case, we define the confidence set of the transition: ( e P : h, (s, a) Sh A, e P( |s, a) (Sh+1), e P(s |s, a) b Pt(s |s, a) 2 b Pt(s |s, a)]ι nt(s, a) + 14ι 3nt(s, a) where ι = ln(SAT/δ). As shown in (Jin & Luo, 2020), P TT t=1 Pt with probability at least 1 4δ. Through out the paper, we use δ = 1 T 3 . For an arbitrary transition kernel e P, define µ e P ,π(s, a) = h=0 Pr(sh = s, ah = a | π, e P), where Pr( |π, e P) denotes the probability measure induced by policy π and transition kernel e P. Furthermore, define µ e P ,π(s) = P a µ e P ,π(s, a). We write µπ(s) = µP,π(s) and µπ(s, a) = µP,π(s, a) where P is the true transition. Define the upper and lower confidence measure as µπ t (s) = max e P Pt µ e P ,π(s), µπ t (s) = min e P Pt µ e P ,π(s). Finally, define V e P ,π(s; ℓ) and Q e P ,π(s, a; ℓ) to be similar to (1), with the transition kernel replaced by e P. 3.3. Adversarial versus stochastic regimes We analyze our algorithm in two regimes: the adversarial regime and the stochastic regime. In both regimes, the transition P is fixed throughout all episodes. In the adversarial regime, the loss functions {ℓt}T t=1 are determined arbitrarily ahead of time. In the stochastic regime, ℓt are generated randomly, and there exists a deterministic policy π , a gap function : S A R 0, and {λt(π)}t,π R such that for any policy π and any t, E h V π(s0; ℓt) V π (s0; ℓt) i a =π (s) µπ(s, a) (s, a) λt(π). If λt(π) 0 for all π, the condition above certifies that π is the optimal policy in episode t, and every time π visits state s and chooses an action a = π (s), the incurred regret against π is at least (s, a). The amount [λt(π)]+ thus quantifies how much the condition above is violated. The stochastic regime captures the standard RL setting (i.e., {ℓt} are i.i.d.) with λt(π) = 0 and (s, a) = E Qπ (s, a; ℓt) V π (s; ℓt) . Define min = mins mina =π (s) (s, a). Also, define C = E h PT t=1 λt(πt) i + and C(π) = PT t=1 λt(π) 4. Main Results and Techniques Overview Our main results with Tsallis entropy and log barrier regularizers are the following (see Section 6 and Appendix H for results with Shannon entropy): Theorem 4.1. Under known transitions, Algorithm 1 with Tsallis entropy regularizer ensures for any π H3SAT ln(T) + poly(H, S, A) ln(T) in the adversarial regime, and UC + poly(H, S, A) ln(T) (3) in the stochastic regime, where U = P a =π (s) H2 ln(T ) Our bounds in both regimes are similar to those of Jin et al. (2021) up to the definition of U under their parameter γ = 1 H (tuning γ trades their bounds between the two regimes; see their Appendix A.3). Compared with our definition of U, theirs involves an additional additive term poly(H)S ln(T ) min even under the assumption that the optimal action is unique on all states. Theorem 4.2. Under unknown transitions, Algorithm 1 with Tsallis entropy regularizer ensures for any π H4S2AT ln(T)ι + poly(H, S, A) ln(T)ι in the adversarial regime, and Reg(π) U + p U(C + C(π)) + poly(H, S, A) ln(T)ι (4) 2A lower bound in (Xu et al., 2021) shows that an S/ min dependence is inevitable even when the transition is known. However, this lower bound only holds when there exist multiple optimal actions on Ω(S) of the states, while our gap bound is finite only when the optimal action is unique on all states. Therefore, our upper bound does not violate their lower bound. Best of Both Worlds Policy Optimization in the stochastic regime, where U = H4S2A ln(T )ι min and ι = ln(SAT). In Jin et al. (2021), for the stochastic case under unknown transition, a similar guarantee as (4) is proven only for π = π , with the case for general π left open. We generalize their result by resolving some technical difficulties in their analysis. Overall, our bound in the stochastic regime improves that of Jin et al. (2021), and the bound in the adversarial regime matches that of Luo et al. (2021). Notice that comparing (4) with (3), the bound under unknown transition involves an additional term C(π). It remains open whether it can be removed. Finally, we provide a first-order best-of-both-world result under known transition. Theorem 4.3. Under known transitions, Algorithm 1 with log barrier regularizer ensures for any π, Reg(π) v u u t H2SA t=1 V π(s0; ℓt) ln2(T) + poly(H, S, A) ln2(T) in the adversarial regime, and UC + poly(H, S, A) ln2(T) in the stochastic regime, where U = P a =π (s) H2 ln2(T ) In the next two subsections, we overview the techniques we used and challenges we faced in obtaining our results. 4.1. Exploration bonus for policy optimization In the tabular case, a policy optimization algorithm can be viewed as running an individual bandit algorithm on each state. Our algorithm is built upon the policy optimization framework developed by Luo et al. (2021), who achieve near-optimal worst-case regret in adversarial MDPs. Their key idea is summarized in the next lemma. Lemma 4.4 (Lemma B.1 of Luo et al. (2021)). Suppose that for some {bt}T t=1 and {Pt}T t=1, where each bt : S R 0 is a non-negative bonus function and each Pt is a set of transitions, it holds that Bt(s, a) = bt(s)+ 1 + 1 max e P Pt Es e P ( |s,a),a πt( |s ) [Bt(s , a )]. (5) Also, suppose that the following holds for a policy π and a function Xπ : S R: a (πt(a|s) π(a|s)) (Qπt(s, a; ℓt) Bt(s, a)) t=1 bt(s) + 1 a πt(a|s)Bt(s, a) Then Reg(π) is upper bounded by s µπ(s)Xπ(s) + 3E t=1 V e Pt,πt(s0; bt) where e Pt is the e P that attains the maximum in (5), and F = HTI{ t [T], P / Pt}. The intuition about Lemma 4.4 is explained below (the reader may also refer to Section 3 of Luo et al. (2021) for a more complete explanation. We start by analyzing a vanilla policy optimization algorithm without adding bonus (i.e., feeding b Qt(s, a), an estimator of Qπt(s, a; ℓt), to the bandit algorithm on state s). By the value difference lemma (Kakade & Langford, 2002) and standard analysis for the mirror descent algorithm, we run into the following form of regret: t (V πt(s0; ℓt) V π(s0; ℓt)) t,a (πt(a|s) π(a|s)) Qπt(s, a; ℓt) s µπ(s)Xπ(s) + X t V π(s0; bt) where Xπ(s)+P t bt(s) is the regret bound of the bandit algorithm on state s, with Xπ(s) related to the regularization penalty, and bt(s) related to the stability of the algorithm. Specifically, in the known transition case, the standard choice is to use b Qt(s, a) = Lt(s,a)I{(s,a) is visited in episode t} µπt(s,a) as an unbiased loss estimator for Qπt(s, a; ℓt), where Lt(s, a) is the cumulative loss starting from state-action (s, a) in episode t. Using exponential weights with learning rate η on every state, one can derive a regret bound of (8) with Xπ(s) = ln A η and bt(s) = O ηH2 µπt(s) . This makes the quantity V π(s0; bt) involve the distribution mismatch coefficient P s µπ(s) µπt(s) (Agarwal et al., 2020b; Wei et al., 2020) that can be prohibitively large. Best of Both Worlds Policy Optimization On the other hand, an observation is that the problematic quantity V π(s0; bt) is nicely bounded if π is πt. This motivates (Luo et al., 2021) to use ℓt(s, a) bt(s) as the loss, where bt(s) can be viewed as a bonus term that encourages the learner to visit states that have been seldom visited before. To see why this works, assume for a moment that the regret bound still roughly holds when we replace the loss ℓt by ℓt bt. Then similar to (8), we get t=1 (V πt(s0; ℓt bt) V π(s0; ℓt bt)) s µπ(s)Xπ(s) + t=1 V π(s0; bt) which implies t=1 (V πt(s0; ℓt) V π(s0; ℓt)) s,a µπ(s)Xπ(s) + t=1 V πt(s0; bt) by rearranging and the linearity of the value function V π(s0; ℓt bt) = V π(s0; ℓt) V π(s0; bt) for any π. Now since the regret only involves V πt(s0; bt), there will be no distribution mismatch coefficient in the regret bound. The caveat of the discussion above is the assumption of (9). After adding the bonus bt, the original regret bound can be affected, that is, the bt on the right-hand side of (9) can be something larger, breaking the desired cancellation effect to achieve (10). To resolve this issue, Luo et al. (2021) proposed to use the dilated bonus defined in (5) in the policy optimization update. In (5), the bonus-to-go function is not constructed through a standard Bellman equation, but through a dilated version that includes an additional 1 + 1 H factor for future steps. The additional amount of bonus-togo can be used to cancel the additional regret due to the inclusion of bt. Lemma 4.4 gives a general recipe to design the exploration bonus for policy optimization algorithms. Roughly speaking, the bonus function bt(s) is chosen to be the instantaneous regret of the bandit algorithm on state s, which scales inversely with the probability of visiting state s (i.e., 1 µπt(s)). Lemma 4.4 suggests that the bandit algorithm on state s should update itself using Qπt(s, a; ℓt) Bt(s, a) as the loss, where Bt(s, a) is the dilated bonus-to-go. The bonus function bt(s) we use is slightly different from that in Luo et al. (2021) though. We notice that the bt(s) defined in Luo et al. (2021) has two parts: the first part is FTRL regret overhead, which comes from the regret bound of the FTRL algorithm under the given loss estimator, and the second part comes from the estimation error in estimating the transition kernel. In order to apply the self-bounding technique to obtain the best-of-both-worlds result, the second term in (7) can only involve the the first part (FTRL regret overhead) but not the second part (estimation error). Therefore, we split their bonus into two: our bt(s) only includes the first part, and ct(s) only includes the second part. This allows us to use self-bounding on the second term in (7). Our ct(s) goes to the first term in (7) instead and is handled differently from Luo et al. (2021). More details are given in Section 5 and Section 6. 4.2. Adaptive learning rate tuning and bonus design Our algorithm heavily relies on carefully tuning the learning rates and assigning a proper amount of bonus. These two tasks are intertwined with each other and introduce new challenges that are not seen in the global regularization approach (Jin et al., 2021) or policy optimization approach that only aims at a worst-case bound (Luo et al., 2021). Below we give a high-level overview for the challenges. In the FTRL analysis, a major challenge is to handle losses that are overly negative3. Typically, if the learning rate is η and the negative loss of action a has a magnitude of R, we need ηp(a)βR 1 in order to keep the algorithm stable, where p(a) is the probability of choosing action a, and β [0, 1] is a parameter related to the choice of the regularizer ( 1 2 for Tsallis entropy, 0 for Shannon entropy, and 1 for log barrier). In our case, there are two places we potentially encounter overly negative losses. One is when applying the standard loss-shifting technique for best-ofboth-world bounds (see Jin et al. (2021)). The loss-shifting effectively creates a negative loss in the analysis. The other overly negative loss is the bonus we use to obtain the firstorder bound. For the first case, we develop a simple trick that only performs loss-shifting when the introduced negative loss is not too large, and further show that the extra penalty due to not performing loss-shifting is well-controlled. This is explained in Section 5.1. For the second case, we develop an even more general technique (which can also cover the first case). This technique can be succinctly described as inserting virtual episodes when ηp(a)βR is potentially too large. In virtual episodes, the losses are assumed to be all-zero (because the learner actually does not interact with the environment in these episodes) and the algorithm only updates over some bonus term. The goal of the virtual episodes is solely to tune down the learning rate η and prevent ηp(a)βR from being to large in real episodes. Similarly, we are able to show that the extra penalty due to virtual episodes is well-controlled. This is explained in Section 5.3. 3Losses here refer not only to the loss from the environment, but also loss estimators or bonuses constructed by the algorithm. Best of Both Worlds Policy Optimization Algorithm 1 Policy Optimization Define: ψt(π; s), bt(s) are defined according to Figure 1, γt min n 106H4A2 for t = 1, 2, . . . do πt( |s) = argmin π (A) a π(a) b Qτ(s, a) Bτ(s, a) Cτ(s, a) + ψt(π; s) Add a virtual round if needed (only when aiming to get a first-order bound with log barrier see Section 5.3 (29)). Execute πt in episode t, and receive {ℓt(st,h, at,h)}H 1 h=0 . Define Pt: Under known transition, define Pt = {P}. Under unknown transition, define Pt by (2). Define b Qt: For s Sh, let It(s, a) = I{(st,h, at,h) = (s, a)}, Lt,h = PH 1 h =h ℓt(st,h , at,h ), and b Qt(s, a) = It(s, a)Lt,h µt(s)πt(a|s), where µt(s) = µπt t (s) + γt. (12) Define Ct: Let ct(s) = µt(s) µπt t (s) µt(s) H, and compute Ct(s, a) by Ct(s, a) = max e P Pt Es e P ( |s,a),a πt( |s ) h ct(s ) + Ct(s , a ) i , (13) Define Bt: Compute Bt(s, a) by (5) using the bt(s) defined in Figure 1. 5. Algorithm The template of our algorithm is Algorithm 1, in which we can plug different regularizers. The template applies to both known transition and unknown transition cases the only difference is in the definition of the confidence set Pt. The policy update (11) is equivalent to running individual FTRL on each state with an adaptive learning rate. The loss estimator b Qt(s, a) defined in (12) is similar to that in Luo et al. (2021): if (s, a) is visited, it is the cumulative loss starting from (s, a) divided by the upper occupancy measure (Jin et al., 2020) of (s, a); otherwise it is zero. One difference is that the implicit exploration factor γt added to the denominator is of order 1 t in our case, while it is of order 1 t in Luo et al. (2021). This smaller γt allows us to achieve logarithmic regret in the stochastic regime. There are two bonus functions ct(s) and bt(s) defined in (13) and Figure 1, respectively. As discussed in Section 4.1, the bonus functions are defined to be the instantaneous regret of the bandit algorithm on state s. The first bonus function ct(s) comes from the bias of the loss estimator. Our choice of ct(s) is such that a, Qπt(s, a; ℓt) E[ b Qt(s, a)] ct(s). The second bonus function bt(s) is related to the regret of the FTRL algorithm under the given loss estimator, which is regularizer dependent. We will elaborate how to choose bt(s) for different regularizers later in this section. Finally, dynamic programming are used to obtain Ct(s, a) and Bt(s, a), which are trajectory sums of ct(s) and bt(s), with an (1 + 1 H ) dilation on Bt(s, a). They are then used in the policy update (11). In the following subsections, we discuss how we choose bt(s) and tune the learning rate for each regularizer. 5.1. Tsallis entropy bt(s) corresponds to the instantaneous regret of the bandit algorithm on state s under the given loss estimator. To obtain its form, we first analyze the regret assuming Bt(s, a) is not included, i.e., only update on b Qt(s, a) Ct(s, a) (Bt(s, a) will be added back for analysis after the form of bt(s) is decided). Inspired by Zimmert & Seldin (2019) for multi-armed bandits, our target is to show that the instantaneous regret (see Appendix D for details) on state s is upper bounded by 1 ηt(s) 1 ηt 1(s) ξt(s) | {z } penalty term + H2ηt(s)ξt(s) µt(s) | {z } stability term +νt(s) (26) where ξt(s) = P πt(a|s)(1 πt(a|s)) A, and νt(s) is some overhead due to the inclusion of Ct(s, a). The factor ξt(s) allows us to use the self-bounding technique that leads to best-of-both-worlds bounds, which cannot be relaxed to A in general. Compared to the bound for multiarmed bandits in (Zimmert & Seldin, 2019), the extra 1 µt(s) scaling in the stability term comes from importance weighting because state s is visited with probability roughly µt(s). Best of Both Worlds Policy Optimization Figure 1. Definitions of ψt(π; s) and bt(s) for different regularizers (to be used in Algorithm 1). Tsallis entropy: ψt(π; s) = 2 ηt(s) bt(s) = 4 1 ηt(s) 1 ηt 1(s) µt(s) > 1 8H + νt(s), (15) A + 4H q Pt τ=1 1 µτ (s) , ξt(s) = X πt(a|s)(1 πt(a|s)), νt(s) = 8ηt(s) X a πt(a|s)Ct(s, a)2. (16) Shannon entropy: ψt(π; s) = X 1 ηt(s, a)π(a) ln π(a), (17) bt(s) = 8 X 1 ηt(s, a) 1 ηt 1(s, a) ξt(s, a) + 1 minτ [t] µτ(s) minτ [t 1] µτ(s) + νt(s), (18) 1 ηt(s, a) = 1 ηt 1(s, a) + 4 µt(s) q Pt 1 τ=1 ξτ (s,a) µτ (s) + 1 µt(s) + H ln T, with 1 η0(s, a) = 1600H4A ξt(s, a) = min{πt(a|s) ln(T), 1 πt(a|s)}, νt(s) = 8 X a ηt(s, a)πt(a|s)Ct(s, a)2. (20) Log barrier (for first-order bound under known transition): ψt(π; s) = X 1 ηt(s, a) ln 1 π(a), (21) bt(s) = 8 X 1 ηt+1(s, a) 1 ηt(s, a) log(T) + νt(s), (22) (s t, a t) = argmax s,a ηt(s, a) µt(s) (break tie arbitrarily) 1 ηt+1(s, a) = 1 ηt(s,a) + 4ηt(s,a)ζt(s,a) µt(s)2 log(T ) if t is a real episode 1 ηt(s,a) 1 + 1 24H log T if t is a virtual episode and (s t, a t) = (s, a) 1 ηt(s,a) if t is a virtual episode and (s t, a t) = (s, a) 1 η1(s, a) = 4H4, (24) ζt(s, a) = It(s, a) πt(a|s)It(s) 2L2 t,h where It(s) = X a It(s, a), (suppose that s Sh) νt(s) = 8 X a ηt(s, a)πt(a|s)Ct(s, a)2. (25) Best of Both Worlds Policy Optimization This desired bound suggests a learning rate scheduling of H q Pt τ=1 1 µτ (s) (27) to balance the penalty and the stability terms. This is exactly how we tune ηt(s) in (16). However, to obtain the ξt(s) factor in the stability term in (26), we need to perform lossshifting in the analysis, which necessitates the condition ηt(s)H µt(s) 1 as discussed in Section 4.2. From the choice of ηt(s) in (27), this condition may not always hold, but every time it is violated, ηt(s) is decreased by a relatively large factor in the next episode. Our strategy is that whenever the condition Hηt(s) µt(s) 1 is violated, we do not perform loss-shifting. This still allows us to prove a stability term of H2ηt(s) A µt(s) for that episode. The key in the analysis is to show that the extra cost due to not performing loss-shifting is only logarithmic in T (see the proof of Lemma 6.3). Combining this idea with the instantaneous regret bound in (26) and the choice of ηt(s) in (27), we are able to derive the form of bt(s) in (15). After figuring out the form of bt(s) assuming Bt(s, a) is not incorporated in the updates, we incorporate it back and re-analyze the stability term. The extra stability term due to bt(s) leads to a separate quantity 1 H πt(a|s)Bt(s, a), which is an overhead allowed by (6). 5.2. Shannon entropy The design of bt(s) under Shannon entropy follows similar procedures as in Section 5.1, except that the tuning of the learning rate is inspired by Ito et al. (2022). One improvement over theirs is that we adopt coordinate-dependent learning rates that can give us a refined gap-dependent bound in the stochastic regime (in multi-armed bandits, this improves their maxa A (a) dependence to P a 1 (a)). With Shannon entropy, there is less learning rate tuning issue because its optimal learning rate decreases faster than other regularizers, and there is no need to perform loss-shifting (Ito et al., 2022). The regret bound under Shannon entropy is overall worse than that of Tsallis entropy by a ln2(T) factor. 5.3. Log barrier As shown by Wei & Luo (2018); Ito (2021), FTRL with a log barrier regularizer is also able to achieve the best of both worlds, with the additional benefit of having data-dependent bounds. In this subsection, we demonstrate the possibility of this by showing that under known transition, Algorithm 1 is able to achieve a first-order bound in the adversarial regime, while achieving polylog(T) regret in the stochastic regime. To get a first-order best-of-both-world bound with log barrier, inspired by Ito (2021), we need to prove the following instantaneous regret for the bandit algorithm on s: 1 ηt(s, a) 1 ηt 1(s, a) | {z } penalty term ηt(s, a)ζt(s, a) µt(s)2 | {z } stability term where ζt(s, a) = (It(s, a) π(a|s)It(s))2 L2 t,h for s Sh. This suggests a learning rate scheduling of 1/ηt+1(s, a) = 1/ηt(s, a) + ηt(s, a)ζt(s, a)/µt(s)2. Similar to the Tsallis entropy case, obtaining the desired stability term in (28) requires loss-shifting, so we encounter the same issue as before and can resolve it in the same way. With this choice of ηt(s, a), we can derive the desired form of bt(s) from (28). However, the magnitude of this bonus is larger than in the Tsallis entropy case because of the 1 µt(s)2 scaling here. Therefore, an additional problem arises: the Bt(s, a) derived from this bt(s) can be large that makes ηt(s, a)πt(a|s)Bt(s, a) > 1 H happen, which violates the condition under which we can bound the extra stability term (due to the inclusion of bt(s)) by 1 H πt(a|s)Bt(s, a). Notice that this was not an issue under Tsallis entropy. To resolve this, we note that ηt(s, a)πt(a|s)Bt(s, a) can be as large as poly(H, S) maxs ,a ηt(s ,a )2 µt(s )2 (Lemma E.3), so all we need is to make ηt(s,a) µt(s) 1 poly(H,S) for all s, a. Our solution is to insert virtual episodes when ηt(s,a) µt(s) is too large on some (s, a). In virtual episodes, the learner does not actually interact with the environment; instead, the goal is purely to tune down ηt(s, a). To decide whether to insert a virtual episode, in episode t, after the learner computes πt( |s) on all states, he checks if max s,a ηt(s, a) If so, then episode t is made a virtual episode in which the losses are assumed to be zero everywhere.4 In a virtual episode, let (s t, a t) = argmaxs,a ηt(s,a) µt(s) , and we tune down ηt(s t, a t) by a factor of (1+ 1 24H log T ). Also, a bonus bt(s) is assigned to s t to reflect the increased penalty term on state s t due to the decrease in learning rate (by (28)). Combinig all the above, we get the bonus and learning rate specified in (22) and (23). Again, since every time a virtual episode happens, there exists some ηt(s, a) decreased by a significant factor, it cannot happen too many times. 4Inserting a virtual episode shifts the index of future real episodes. Since there are only O(HSA log2 T) virtual episodes, we still use T to denote the total number of episodes. Best of Both Worlds Policy Optimization 6. Sketch of Regret Analysis Our goal is to show (6) and bound the right-hand side of (7) (for all regularizers and known/unknown transitions). To show (6), for a fixed π, we do the following decomposition: t,a (πt(a|s) π(a|s)) (Qπt(s, a; ℓt) Bt(s, a)) (30) t,a (πt(a|s) π(a|s)) b Qt(s, a) Bt(s, a) Ct(s, a) | {z } ftrl-regπ(s) t,a (πt(a|s) π(a|s)) Qπt(s, a; ℓt) b Qt(s, a) + Ct(s, a) | {z } biasπ(s) The next lemma bounds the expectation of ftrl-regπ(s). Lemma 6.1. E [ftrl-regπ(s)] is upper bounded by O(H4SA ln(T)) + E t=1 bt(s) + 1 a πt(a|s)Bt(s, a) The proof of Lemma 6.1 is in Appendix E. Notice that depending on the regularizers and whether the transition is known/unknown, the definitions of bt(s) are different, so we prove it individually for each case. Combining (30) with Lemma 6.1, we see that the condition in Lemma 4.4 is satisfied with Xπ(s) = O(H4SA ln(T))+ E[biasπ(s)]. By Lemma 4.4, we can upper bound E[Reg(π)] by the order of H5SA ln(T) + E s µπ(s)biasπ(s) + t=1 V e Pt,πt(s0; bt) The next lemma bounds the bias part in (31). See Appendix F for the proof. Lemma 6.2. With known transitions, E [P s µπ(s)biasπ(s)] H5SA2 ln(T), and with unknown transitions, s µπ(s)biasπ(s) H2S4A2 ln(T)ι+ v u u t H3S2AE s,a [µπt(s, a) µπ(s, a)]+ Next, we bound the bonus part in (31) for all regularizers we consider. The proofs are in Appendix G. Lemma 6.3. Using Tsallis entropy as the regularizer, with known transitions, t=1 V e Pt,πt(s0; bt) H4SA2 ln(T)+ t=1 µt(s)πt(a|s)(1 πt(a|s)) With unknown transitions, the right-hand side above further has an additional term O(HS4A2 ln(T)ι). Lemma 6.4. Using Shannon entropy as the regularizer, With known transitions, t=1 V e Pt,πt(s0; bt) t=1 µt(s)πt(a|s)(1 πt(a|s)) With unknown transitions, the right-hand side above further has an additional term O(HS4A2 ln(T)ι). Lemma 6.5. Using log barrier as the regularizer, with known transitions, t=1 V πt(s0; bt) H3S2A2 ln(T) ln(SAT)+ t=1 (It(s, a) πt(a|s)It(s))2L2 t,h(s) Final regret bounds To obtain the final regret bounds, we combine Lemma 6.2 with each of Lemma 6.3, Lemma 6.4, and Lemma 6.5 based on (31). Then we use the standard self-bounding technique to derive the bounds for each regime. The details are provided in Appendix H. 7. Conclusion In this work, we develop policy optimization algorithms for tabular MDPs that achieves the best of both worlds. Compared to previous solutions with a similar guarantee (Jin & Luo, 2020; Jin et al., 2021), our algorithm is computationally much simpler; compared to most existing RL algorithms, our algorithm is more robust (handling adversarial losses) and more adaptive (achieving fast rate in stochastic environments) simultaneously. Built upon the flexible policy optimization framework, our work paves a way towards developing more robust and adaptive algorithms for more general settings. Future directions include obtaining data-dependent bounds under unknown transitions, and incorporating function approximation. Best of Both Worlds Policy Optimization Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C., and Weisz, G. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pp. 3692 3702. PMLR, 2019. Agarwal, A., Henaff, M., Kakade, S., and Sun, W. Pc-pg: Policy cover directed exploration for provable policy gradient learning. ar Xiv preprint ar Xiv:2007.08459, 2020a. Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pp. 64 66. PMLR, 2020b. Amir, I., Azov, G., Koren, T., and Livni, R. Better best of both worlds bounds for bandits with switching costs. ar Xiv preprint ar Xiv:2206.03098, 2022. Auer, P. and Chiang, C.-K. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pp. 116 120. PMLR, 2016. Bubeck, S. and Slivkins, A. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pp. 42 1. JMLR Workshop and Conference Proceedings, 2012. Cai, Q., Yang, Z., Jin, C., and Wang, Z. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp. 1283 1294. PMLR, 2020. Chen, L., Luo, H., and Wei, C.-Y. Impossible tuning made possible: A new expert algorithm and its applications. In Conference on Learning Theory, pp. 1216 1259. PMLR, 2021. Chen, L., Luo, H., and Rosenberg, A. Policy optimization for stochastic shortest path. ar Xiv preprint ar Xiv:2202.03334, 2022. Erez, L. and Koren, T. Best-of-all-worlds bounds for online learning with feedback graphs. ar Xiv preprint ar Xiv:2107.09572, 2021. He, J., Zhou, D., and Gu, Q. Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps. In International Conference on Artificial Intelligence and Statistics, pp. 4259 4280. PMLR, 2022. Ito, S. Parameter-free multi-armed bandit algorithms with hybrid data-dependent regret bounds. In Conference on Learning Theory, pp. 2552 2583. PMLR, 2021. Ito, S., Tsuchiya, T., and Honda, J. Nearly optimal best-ofboth-worlds algorithms for online learning with feedback graphs. ar Xiv preprint ar Xiv:2206.00873, 2022. Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. Learning adversarial markov decision processes with bandit feedback and unknown transition. In International Conference on Machine Learning, 2020. Jin, T. and Luo, H. Simultaneously learning stochastic and adversarial episodic mdps with known transition. Advances in neural information processing systems, 33: 16557 16566, 2020. Jin, T., Huang, L., and Luo, H. The best of both worlds: stochastic and adversarial episodic mdps with unknown transition. Advances in Neural Information Processing Systems, 34:20491 20502, 2021. Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002. Lai, T. L., Robbins, H., et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6 (1):4 22, 1985. Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press (preprint), 2018. Lee, C.-W., Luo, H., Wei, C.-Y., and Zhang, M. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in Neural Information Processing Systems, 2020. Levine, S. and Koltun, V. Guided policy search. In International conference on machine learning, pp. 1 9. PMLR, 2013. Luo, H. Homework 3 solution, introduction to online optimization/learning. http://haipeng-luo.net/ courses/CSCI659/2022_fall/homework/ HW3_solutions.pdf, November 2022. Luo, H., Wei, C.-Y., and Lee, C.-W. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. Advances in Neural Information Processing Systems, 34:22931 22942, 2021. Neu, G. and Olkhovskaya, J. Online learning in mdps with linear function approximation and bandit feedback. ar Xiv preprint ar Xiv:2007.01612v2, 2021. Rouyer, C., Seldin, Y., and Cesa-Bianchi, N. An algorithm for stochastic and adversarial bandits with switching costs. In International Conference on Machine Learning, pp. 9127 9135. PMLR, 2021. Best of Both Worlds Policy Optimization Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Seldin, Y. and Lugosi, G. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, pp. 1743 1759. PMLR, 2017. Seldin, Y. and Slivkins, A. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pp. 1287 1295. PMLR, 2014. Shani, L., Efroni, Y., Rosenberg, A., and Mannor, S. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pp. 8604 8613. PMLR, 2020. Tsuchiya, T., Ito, S., and Honda, J. Best-of-bothworlds algorithms for partial monitoring. ar Xiv preprint ar Xiv:2207.14550, 2022. Wei, C.-Y. and Luo, H. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, 2018. Wei, C.-Y., Jahromi, M. J., Luo, H., Sharma, H., and Jain, R. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International Conference on Machine Learning, pp. 10170 10180. PMLR, 2020. Wei, C.-Y., Jahromi, M. J., Luo, H., and Jain, R. Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp. 3007 3015. PMLR, 2021. Xu, H., Ma, T., and Du, S. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap. In Conference on Learning Theory, pp. 4438 4472. PMLR, 2021. Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pp. 10978 10989. PMLR, 2020. Zanette, A., Cheng, C.-A., and Agarwal, A. Cautiously optimistic policy optimization and exploration with linear function approximation. ar Xiv preprint ar Xiv:2103.12923, 2021. Zimmert, J. and Seldin, Y. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 467 475. PMLR, 2019. Zimmert, J., Luo, H., and Wei, C.-Y. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, pp. 7683 7692. PMLR, 2019. Best of Both Worlds Policy Optimization A. Additional Definitions Define µ e P ,π(s |s, a) as the probability of visiting s conditioned on that (s, a) is already visited, under transition kernel e P and policy π. In other words, µ e P ,π(s |s, a) is defined as 0 if h(s ) < h(s), 0 if h(s) = h(s ), s = s , 1 if s = s, Pr{sh(s ) = s | (sh, ah) = (s, a)} if h(s ) > h(s). Further define µ e P ,π(s |s) = P a µ e P ,π(s |s, a)π(a|s). We write µπ(s |s, a) = µP,π(s |s, a) and µπ(s |s) = µP,π(s |s) where P is the true transition. B. Concentration Bounds Lemma B.1. If P Pt, then for all e P Pt, e P(s |s, a) P(s |s, a) min P(s |s, a)ι nt(s, a) + 40ι 3nt(s, a), 1 Lemma B.2 (Lemma D.3.7 of Jin et al. (2021)). With probability at least 1 δ, for any h, nt(s, a) |Sh|A ln T + ln(1/δ) Definition B.3. Define E to be the event that P Pt for all t and the bound in Lemma B.2 holds. By (2) and Lemma B.2, Pr{E} 1 5Hδ. C. Difference Lemmas Lemma C.1 (Performance difference). For any policies π1 and π2, and any loss function ℓ: S A R, V π1(s0; ℓ) V π2(s0; ℓ) = X s µπ2(s)(π1(a|s) π2(a|s))Qπ1(s, a; ℓ). Lemma C.2. For any policies π1 and π2 and any function L : S A R, X s µπ2(s)(π1(a|s) π2(a|s))L(s, a) = V π1(s0; ℓ) V π2(s0; ℓ) ℓ(s, a) L(s, a) Es P ( |s,a),a π1( |s )[L(s , a )]. Proof. This is simply a different way to write the performance difference lemma (Lemma C.1). One only needs to verify that Qπ1(s, a; ℓ) = L(s, a). This can be shown straightforwardly by backward induction from s SH to s S0 and using the definition of ℓ(s, a). Lemma C.3 (Occupancy measure difference, Lemma D.3.1 of (Jin et al., 2021)). µP1,π(s) µP2,π(s) = X (u,v,w) S A S µP1,π(u, v) [P1(w|u, v) P2(w|u, v)] µP2,π(s|w) (u,v,w) S A S µP2,π(u, v) [P1(w|u, v) P2(w|u, v)] µP1,π(s|w) Best of Both Worlds Policy Optimization Lemma C.4 (Generalized version of Lemma 4 in (Jin et al., 2020)). Suppose the high probability event E defined in Definition B.3 holds. Let e P s t be a transition kernel in Pt which may depend on s, and let gt(s) [0, G]. Then µπt(s) µ e P s t ,πt(s) gt(s) v u u t HS2A ln(T)ι s µπt(s)gt(s)2 + HS4AG ln(T)ι. Proof. We first show that for any t, s, µπ(s) µ e P s t ,π(s) X (u,v,w) S A S µπ(u, v) nt(u, v) µπ(s|w) + HS2 X nt(u, v) . (32) Below, the summation range of (u, w, v) and (x, y, z) are both SH 1 h=0 (Sh A Sh+1) if without specifying. µπ(s) µ e P s t ,π(s) u,v,w µπ(u, v) P(w|u, v) e P s t (w|u, v) µ e P s t ,π(s|w) (by Lemma C.3) u,v,w µπ(u, v) P(w|u, v) e P s t (w|u, v) µπ(s|w) u,v,w µπ(u, v) P(w|u, v) e P s t (w|u, v) µ e P s t ,π(s|w) µπ(s|w) u,v,w µπ(u, v) P(w|u, v) e P s t (w|u, v) µπ(s|w) u,v,w µπ(u, v) P(w|u, v) e P s t (w|u, v) X x,y,z µπ(x, y|w) e P s t (z|x, y) P(z|x, y) µ e P s t ,π(s|z) (by Lemma C.3) u,v,w µπ(u, v) nt(u, v) + ι nt(u, v) x,y,z µπ(u, v) nt(u, v) + ι nt(u, v) µπ(x, y|w) min nt(x, y) + ι nt(x, y), 1 (by Lemma B.1 and the assumption that E holds) u,v,w µπ(u, v) nt(u, v) µπ(s|w) u,v,w µπ(u, v) ι nt(u, v)µπ(s|w) (=: term1) x,y,z µπ(u, v) nt(u, v) µπ(x, y|w) nt(x, y) (=: term2) x,y,z µπ(u, v) nt(u, v) µπ(x, y|w) min ι nt(x, y), 1 (=: term3) x,y,z µπ(u, v) ι nt(u, v)µπ(x, y|w) (=: term4) We bound term1 to term4 separately as below: nt(u, v) S X Best of Both Worlds Policy Optimization µπ(u, v)P(z|x, y)µπ(x, y|w)ι µπ(u, v)P(w|u, v)µπ(x, y|w)ι µπ(u, v)P(z|x, y)µπ(x, y|w)ι µπ(u, v)P(w|u, v)µπ(x, y|w)ι nt(x, y) (AM-GM) v u u t H X v u u t H X x,y,z µπ(u, v) P(w|u, v) + ι nt(u, v) µπ(x, y|w) min ι nt(x, y), 1 x,y,z µπ(u, v)P(w|u, v)µπ(x, y|w) ι nt(x, y) + X x,y,z µπ(u, v) ι nt(u, v)µπ(x, y|w) x,y,z µπ(x, y) ι nt(x, y) + HS X u,v,w µπ(u, v) ι nt(u, v) nt(x, y) + HS2 X u,v,w µπ(u, v) ι nt(u, v) HS2 X Collecting all terms we obtain (32). Thus, µπt(s) µ e P s t ,πt(s) gt(s) u,v,w µπt(u, v) nt(u, v) µπt(s|w) + HS2 X u,v,w µπt(u, v) nt(u, v) µπt(s|w) nt(u, v) (33) Fix an h, we consider the summation ( ) restricted to (u, v, w) Th Sh A Sh+1. That is, (u,v,w) Th µπt(u, v) nt(u, v) µπt(s|w) (u,v,w) Th µπt(u, v) αP(w|u, v)gt(s)2 + ι αnt(u, v) (holds for any α > 0 by AM-GM) (u,v,w) Th µπt(u, v)P(w|u, v)µπt(s|w)gt(s)2 + 1 nt(u, v) µπt(s|w) Best of Both Worlds Policy Optimization s µπt(s)gt(s)2 + H|Sh+1| s µπt(s)gt(s)2 + H|Sh+1||Sh|A ln(T)ι α + H|Sh+1| ln(1/δ)ι (by Lemma B.2 and the assumption that E holds) v u u t H|Sh||Sh+1|A ln(T)ι s µπt(s)gt(s)2 (picking the optimal α and using our choice of δ = 1 T 3 ) (|Sh| + |Sh+1) v u u t HA ln(T)ι s µπt(s)gt(s)2. Continue from (33): µπt(s) µ e P s t ,πt(s) gt(s) h (|Sh| + |Sh+1) v u u t HA ln(T)ι s µπt(s)gt(s)2 + HS4AG ln(T)ι (by Lemma B.2 and the assumption that E holds) v u u t HA ln(T)ι s µπt(s)gt(s)2 + HS4AG ln(T)ι. Lemma C.5. For any π1, π2, X s,a |µπ1(s, a) µπ2(s, a)| H X s,a µπ1(s) |π1(a|s) π2(a|s)| Proof. For any s, a, we can view µπ(s, a) as V π(s0; 1s,a) where 1s,a is the loss function that takes the value of 1 on (s, a) and 0 on other state-actions. By the performance difference lemma (Lemma C.1), |µπ1(s, a) µπ2(s, a)| X s ,a µπ1(s ) |π1(a |s ) π2(a |s )| Qπ2(s , a ; 1s,a). Therefore, X s,a |µπ1(s, a) µπ2(s, a)| X s ,a µπ1(s ) |π1(a |s ) π2(a |s )| X s,a Qπ2(s , a ; 1s,a) s ,a µπ1(s ) |π1(a |s ) π2(a |s )| Qπ2(s , a ; 1) (1 is the loss function that takes a constant value 1) s ,a µπ1(s ) |π1(a |s ) π2(a |s )| . D. FTRL Regret Bounds The lemmas in this section are standard results for FTRL, which can be found in e.g. Lattimore & Szepesv ari (2018); Zimmert & Seldin (2019); Ito (2021); Luo (2022). We list the results here for completeness. Best of Both Worlds Policy Optimization Lemma D.1. The FTRL algorithm: pt = argmin p Ω guarantees the following: t=1 pt u, ℓt ψ0(u) min p Ωψ0(p) + t=1 (ψt(u) ψt(pt) ψt 1(u) + ψt 1(pt)) | {z } penalty term t=1 max p Ω( pt p, ℓt Dψt(p, pt)) | {z } stability term Proof. Let Lt Pt τ=1 ℓτ. Define Ft(p) = p, Lt 1 + ψt(p) and Gt(p) = p, Lt + ψt(p). Therefore, pt is the minimizer of Ft. Let p t+1 be minimizer of Gt. Then by the first-order optimality condition, we have Ft(pt) Gt(p t+1) Ft(p t+1) Gt(p t+1) Dψt(p t+1, pt) = p t+1, ℓt Dψt(p t+1, pt). (34) By definition, we also have Gt(p t+1) Ft+1(pt+1) Gt(pt+1) Ft+1(pt+1) = ψt(pt+1) ψt+1(pt+1). (35) pt p t+1, ℓt Dψt(p t+1, pt) + Gt(p t+1) Ft(pt) (by (34)) pt p t+1, ℓt Dψt(p t+1, pt) + Gt 1(p t) Ft(pt) + GT (p T +1) G0(p 1) n pt p, ℓt Dψt(p, pt) o ψt(pt) + ψt 1(pt) + GT (u) min p ψ0(p) (by (35), using that p T +1 is the minimizer of GT ) n pt p, ℓt Dψt(p, pt) o ψt(pt) + ψt 1(pt) + t=1 u, ℓt + ψT (u) min p ψ0(p) n pt p, ℓt Dψt(p, pt) o + ψt(u) ψt(pt) ψt 1(u) + ψt 1(pt) + t=1 u, ℓt + ψ0(u) min p ψ0(p). Re-arranging finishes the proof. Lemma D.2 (Stability under Tsallis entropy). Let ψt(p) = 2 p(a), and let ℓt RA be such that ηt p p(a)ℓt(a) 1 max p (A) { pt p, ℓt Dψt(p, pt)} 2ηt X a pt(a) 3 2 ℓt(a)2. Proof. The proof can be found in the Problem 1 of Luo (2022). Best of Both Worlds Policy Optimization Lemma D.3 (Stability under Shannon entropy). Let ψt(p) = P a 1 ηt(a)p(a) ln p(a), and let ℓt RA be such that η(a)ℓt(a) 1. Then max p (A) { pt p, ℓt Dψt(p, pt)} X a ηt(a)pt(a)ℓt(a)2. Proof. The proof can be found in the Proof of Lemma 1 in Chen et al. (2021). Lemma D.4 (Stability under log barrier). Let ψt(p) = P a 1 ηt(a) ln 1 p(a), and let ℓt RA be such that ηt(a)p(a)ℓt(a) 1 max p (A) { pt p, ℓt Dψt(p, pt)} X a ηt(a)pt(a)2ℓt(a)2. max p (A) { pt p, ℓt Dψt(p, pt)} max q RA + { pt q, ℓt Dψt(q, pt)} Define f(q) = pt q, ℓt Dψt(q, pt). Let q be the solution in the last expression. Next, we verify that under the specified conditions, we have f(q ) = 0. It suffices to show that there exists q RA + such that f(q) = 0 since if such q exists, then it must the maximizer of f and thus q = q. [ f(q)]a = ℓt(a) [ ψt(q)]a + [ ψt(pt)]a = ℓt(a) + 1 ηt(a)q(a) 1 ηt(a)pt(a) By the condition, we have 1 ηt(a)pt(a) ℓt(a) < 0 for all a. and so f(q) = 0 has solution in R+, which is q(a) = 1 pt(a) + ηt(a)ℓt(a) 1 . Therefore, f(q ) = ℓt ψt(q ) + ψt(pt) = 0, and we have max q RA + { pt q, ℓt Dψt(q, pt)} = pt q , ψt(pt) ψt(q ) Dψt(q , pt) = Dψt(pt, q ). It remains to bound Dψt(pt, q ), which by definition can be written as Dψt(pt, q ) = X 1 ηt(a)h pt(a) where h(x) = x 1 ln(x). By the relation between q (a) and pt(a) we just derived, it holds that pt(a) q (a) = 1 + ηt(a)pt(a)ℓt(a). By the fact that ln(1 + x) x x2 for all x 1 = ηt(a)pt(a)ℓt(a) ln(1 + ηt(a)pt(a)ℓt(a)) ηt(a)2pt(a)2ℓt(a)2 which gives the desired bound. Lemma D.5 (FTRL with Tsallis entropy). Let ψt(p) = 2 p(a) for non-increasing ηt, and let xt be such that pt(a)(ℓt(a) + xt) 1 2 for all t, a. Then the FTRL algorithm in Lemma D.1 ensures for any u (A), t=1 pt u, ℓt 2 a pt(a) 3 2 (ℓt(a) + xt)2 , where ξt = P pt(a)(1 pt(a)). Best of Both Worlds Policy Optimization Proof. We use Lemma D.1, and bound the penalty term and stability individually. penalty term = 2 η0 max p (A) Bounding the stability term: stability term = t=1 max p (A) n pt p, ℓt + xt1 Dψt(p, pt) o 2 a pt(a) 3 2 (ℓt(a) + xt)2 where the first equality is because pt p, 1 = 0 for pt, p (A), and the last inequality is by Lemma D.2. Lemma D.6 (FTRL with Shannon entropy). Let ψt(p) = P a 1 ηt(a)p(a) ln p(a), for non-increasing ηt(a) such that η0(a) = η0 for all a. Assume that ηt(a)ℓt(a) 1 for all t, a, and assume A T. Then for any u (A), t=1 pt u, ℓt ln A 1 ηt(a) 1 ηt 1(a) a ηt(a)pt(a)ℓt(a)2 + 1 where ξt(a) = min {pt(a) ln(T), 1 pt(a)}. Proof. Let u = 1 1 T 2 u + 1 AT 2 1. We use Lemma D.1, and bound the penalty term and stability individually (with respect to u ). penalty term p(a) ln 1 p(a) u (a) ln 1 u (a) 1 ηt(a) 1 ηt 1(a) pt(a) ln 1 pt(a) u (a) ln 1 u (a) 1 ηt(a) 1 ηt 1(a) pt(a) ln 1 pt(a) u (a) ln 1 u (a) To bound pt(a) ln 1 pt(a) u (a) ln 1 u (a), first observe that pt(a) ln 1 pt(a) = pt(a) ln 1 + 1 pt(a) pt(a) pt(a) 1 pt(a) 1 pt(a) because ln(1 + x) x. By the definition of u , we have u (a) ln 1 u (a) min 1 AT 2 ln(AT 2), 1 1 ln 1 1 1 T 2 min 1 AT 2 , 1 1 If pt(a) 1 A2T 4 , then pt(a) ln 1 pt(a) u (a) ln 1 u (a) 1 A2T 4 ln(A2T 4) 1 AT 2 = 2 ln(AT 2) AT 2 where the first inequality is because x ln(x) is increasing for x e 1, and last inequality is because 2 ln(x) x < 0 for all x R. If pt(a) > 1 A2T 4 , then pt(a) ln 1 pt(a) pt(a) ln(A2T 4) 6pt(a) ln(T) by the assumption A T. Combining all arguments above, we get penalty term ln A 1 ηt(a) 1 ηt(a) min {1 pt(a), pt(a) ln(T)} . Best of Both Worlds Policy Optimization Bounding the stability term: stability term = t=1 max p (A) n pt p, ℓt Dψt(p, pt) o a ηt(a)pt(a)ℓt(a)2 where the last inequality is by Lemma D.3. t=1 pt u , ℓt ln A 1 ηt(a) 1 ηt 1(a) a ηt(a)pt(a)ℓt(a)2 Then noticing that t=1 pt u, ℓt = t=1 pt u , ℓt + t=1 u u, ℓt t=1 pt u , ℓt + 1 finishes the proof. Lemma D.7 (FTRL with log barrier). Let ψt(p) = P a 1 ηt(a) ln 1 p(a) for non-increasing ηt(a) with η0(a) = η0 for all a, and let xt be such that ηt(a)pt(a)(ℓt(a) + xt) 1 2 for all t, a. Then for any u (A), t=1 pt u, ℓt 3A ln T 1 ηt(a) 1 ηt 1(a) a ηt(a)pt(a)ℓt(a)2 + 1 Proof. Let u = 1 1 T 3 u + 1 AT 3 1. We use Lemma D.1, and bound the penalty term and stability individually (with respect to u ). penalty term A ln(T 3) 1 ηt(a) 1 ηt 1(a) ln 1 u (a) ln 1 pt(a) 1 ηt(a) 1 ηt 1(a) 1 ηt(a) 1 ηt 1(a) ln(T) (because A T) Bounding the stability term: stability term = t=1 max p (A) n pt p, ℓt + xt1 Dψt(p, pt) o a ηt(a)pt(a)2 (ℓt(a) + xt)2 where the first equality is because pt p, 1 = 0, and the last inequality is by Lemma D.4. Then noticing that t=1 pt u, ℓt = t=1 pt u , ℓt + t=1 u u, ℓt t=1 pt u , ℓt + 1 finishes the proof. Best of Both Worlds Policy Optimization E. Analysis for FTRL Regret Bound (Lemma 6.1) E.1. Tsallis entropy Proof of Lemma 6.1 (Tsallis entropy). We focus on a particular s, and use πt(a), b Qt(a), Bt(a), Ct(a), ηt, µt, ξt, bt to denote πt(a|s), b Qt(s, a), Bt(s, a), Ct(s, a), ηt(s), µt(s), ξt(s), bt(s), respectively. By Lemma D.5, we have for any π t=1 πt π, b Qt Bt Ct a ηtπt(a) 3 2 b Qt(a) Bt(a) Ct(a) + xt 2 # for arbitrary xt R such that ηt p πt(a|s)( b Qt(a) Bt(a) Ct(a) + xt) 1 2 for all t, a. Our choice of xt is the following: xt = D πt, b Qt E Yt. (38) with Yt I h ηt µt 1 8H i . Below, we verify that ηt p πt(a) b Qt(a) Bt(a) Ct(a) + xt 1 πt(a) b Qt(a) Bt(a) Ct(a) + xt πt(a) Bt(a) Ct(a) D πt, b Qt E Yt (using (38) and b Qt(a) 0) ηt Bt(a) ηt Ct(a) ηt X a πt(a )HIt(s, a ) µtπt(a ) Yt (by the definition of b Qt(a)) 8H 1 4H2 Hηt µt Yt (using Lemma E.1, Ct(a) H2 and ηt 1 4H4 ) 2. (by the definition of Yt = I h ηt µt 1 8H i ) Continued from (37) with the choice of xt: t=1 πt π, b Qt Bt Ct a ηtπt(a) 3 2 b Qt(a) πt, b Qt Yt Bt(a) Ct(a) 2 # a ηtπt(a) 3 2 b Qt(a) πt, b Qt 2 + b Qt(a)2Y t + Bt(a)2 + Ct(a)2 # (define Y t = 1 Yt) a ηtπt(a) 3 2 b Qt(a) πt, b Qt 2 + b Qt(a)2Y t | {z } term1 a πt(a)Bt(a) a ηtπt(a)Ct(a)2 # . (using Lemma E.1) To bound term1, notice that b Qt(a) πt, b Qt 2 = Et " It(s, a)Lt,h µtπt(a) It(s)Lt,h (assume s Sh) Best of Both Worlds Policy Optimization µtπt(a) H µtπt(a) H 2 + µt(1 πt(a)) H = 1 µtπt(a)(1 πt(a))2H2 + 1 µt (1 πt(a))H2 Et h b Qt(a)2Y t i = Et " It(s, a)Lt,h µtπt(a)Y t . a ηtπt(a) 3 2 1 πt(a) µtπt(a) + 1 µtπt(a)Y t πt(a)(1 πt(a)) + p Notice that 1 µt q Pt τ=1 1 µτ and continuing from (39) we have t=1 πt π, b Qt Bt Ct O(H4A) + 3E a πt(a)Bt(a) a ηtπt(a)Ct(a)2 # a πt(a)Bt(a) with bt defined in (15). This finishes the proof. Lemma E.1 (Tsallis entropy). ηt(s)Bt(s, a) 1 8H . Proof. By the definition of bt(s) in (15), we have A 1 ηt(s) 1 ηt 1(s) + 8ηt(s)H4 (Ct(s, a) H2) Best of Both Worlds Policy Optimization 1 µt(s) q Pt τ=1 1 µτ (s) + 8 1 4H4 H4 A µt(s) + 2 34H ηt(s)Bt(s, a) ηt(s) 1 + 1 H H max s bt(s ) min 1 1600H4 100H2 min 1 1600H4 At 106H4A2 , (by the definition of γt) E.2. Shannon entropy Proof of Lemma 6.1 (Shannon entropy). We focus on a particular s, and use πt(a), b Qt(a), Bt(a), ηt(a), µt, bt to denote πt(a|s), b Qt(s, a), Bt(s, a), ηt(s, a), µt(s), bt(s), respectively. Notice that for any t, a, since b Qt(a) 0, ηt(a)Bt(a) 1 4H (by Lemma E.2), and ηt(a)Ct(a) 1 4H4 H2 = 1 4H2 (because ηt(a) η0(a) = 1 4H4 and Ct(a) H2), we have ηt(a)( b Qt(a) Bt(a) Ct(a)) 1 4H 1 4H2 1. Besides, for any a, E t=1 b Qt(a) H γt HT 2 (by the definition of γt) log T (by Lemma E.2) H2T (Ct(a) H2) With these inequalities, by Lemma D.6, the following holds for any π: t=1 πt π, b Qt Bt Ct ln A η0(a) + E 1 ηt(a) 1 ηt 1(a) a ηt(a)πt(a) b Qt(a) Bt(a) Ct(a) 2 # b Qt(a) Bt(a) Ct(a) # Best of Both Worlds Policy Optimization O(H4A ln(T)) + E 1 ηt(a) 1 ηt 1(a) a ηt(a)πt(a) b Qt(a)2 + Bt(a)2 + Ct(a)2 # O(H4A ln(T)) + E 1 ηt(a) 1 ηt 1(a) µt + πt(a)Bt(a)2 + πt(a)Ct(a)2 # O(H4A ln(T)) + E 1 ηt(a) 1 ηt 1(a) a πt(a)Bt(a) + 3 a ηt(a)πt(a)Ct(a)2 # (by Lemma E.2) By the update ηt(a), 1 ηt(a) 4H p 1 µτ 1 q PT τ=1 ξτ (a) µτ + maxτ [T ] 1 µτ µτ + max τ [T ] 1 µτ 1 µt Pt τ=1 1 µτ µτ + max τ [T ] 1 µτ µτ + max τ [t] 1 µτ µτ + max τ [t 1] 1 µτ µt + maxτ [t] 1 µτ maxτ [t 1] 1 µτ q Pt τ=1 ξτ (a) µτ + maxτ [t] 1 µτ 1 minτ [t] µτ minτ [t 1] µτ q Pt τ=1 ξτ (a) µτ + maxτ [t] 1 µτ 1 ηt(a) 1 ηt 1(a) ξt(a) + 1 minτ [t] µτ minτ [t 1] µτ where we use (19) in the last inequality. Using this in (41), we get t=1 πt π, b Qt Bt Ct O(H4A ln(T)) + E 1 ηt(a) 1 ηt 1(a) ξt(a) + 1 minτ [t] µτ minτ [t 1] µτ a πt(a)Bt(a) + 3 a ηt(a)πt(a)Ct(a)2 # O(H4A ln(T)) + E a πt(a)Bt(a) where we use the definition of bt in (18). This finishes the proof. Best of Both Worlds Policy Optimization Lemma E.2 (Shannon entropy). ηt(s, a)Bt(s, a) 1 4H and Bt(s, a) 400 T log T. Proof. By the definition of bt(s) in (18), we have 1 ηt(s, a) 1 ηt 1(s, a) a ηt(s, a)πt(a|s)H4 (Ct(s, a) H2) µt(s) q Pt 1 τ=1 ξτ (s,a) µτ (s) + 1 µt(s) + H log T + 2 (using (19) and ηt(s, a) 1 4H4 ) log T + 2 132HA log T γt . Further notice that 1 ηt(s, a) 4 H log T τ 4H p Bt(s, a) H 1 + 1 H max s bt(s) 396H2A log T γt 400H2A p ηt(s, a)Bt(s, a) min 1 1600H4A log T , 1 4H t log T 396H2A log T γt min 1 1600H4A log T , 1 4H t log T max 396H2A t log T 106H4A2 , 396H2A p (by the definition of γt) by the definition of γt. E.3. Log barrier Proof of Lemma 6.1 (log barrier). We focus on a particular s, and use πt(a), b Qt(a), Bt(a), Ct(a), ηt, µt, ζt(a), to denote πt(a|s), b Qt(s, a), Bt(s, a), Ct(s, a), ηt(s), µt(s), ζt(s, a), respectively. By Lemma D.7, t=1 πt π, b Qt Bt Ct O(H4A ln(T)) + E 1 ηt(a) 1 ηt 1(a) a ηt(a)πt(a)2 b Qt(a) Bt(a) Ct(a) + xt 2 # t=1 b Qt(a) Bt(a) Ct(a) for arbitrary xt R such that ηt(a)πt(a)( b Qt(a) Bt(a) Ct(a) + xt) 1. Recall that with log barrier, there are real episodes and virtual episodes in which ℓt(s, a) = 0 for all (s, a). Let Yt = 0 if t is a virtual episode, and Yt = 1 otherwise. xt = D πt, b Qt E . (43) Best of Both Worlds Policy Optimization Below, we verify that ηt(a)πt(a) b Qt(a) Bt(a) Ct(a) + xt 1 ηt(a)πt(a) b Qt(a) Bt(a) Ct(a) + xt ηt(a)πt(a) Bt(a) Ct(a) D πt, b Qt E (using (43) and b Qt(a) 0) ηt(a)πt(a|s)Bt(a) ηt(a)Ct(a) ηt(a) X a πt(a )HIt(s, a ) µtπt(a ) Yt (when Yt = 0, b Qt(a) = 0) 8H 1 4H2 Hηt µt Yt (by Lemma E.3 and that Ct(a) H2 and ηt(a) 1 4H4 ) 2. (when Yt = 1 (real episode), ηt(a) Besides, for any a, E t=1 b Qt(a) H γt HT 2 (by the definition of γt) 15ST 2 (by Lemma E.3) H2T (Ct(a) H2) Below, we continue from (42) with our choice of xt: t=1 πt π, b Qt Bt Ct O(H4SA ln(T)) + E 1 ηt(a) 1 ηt 1(a) a ηt(a)πt(a)2 b Qt(a) πt, b Qt 2 + Bt(a)2 + Ct(a)2 # O(H4SA ln(T)) + E 1 ηt(a) 1 ηt 1(a) | {z } term1 a ηt(a)πt(a)2 b Qt(a) πt, b Qt 2 | {z } term2 a πt(a)Bt(a) a ηt(a)πt(a)Ct(a)2 | {z } term3 . (by Lemma E.3) We further manipulate term2 (suppose that s Sh). In virtual episodes, term2 = 0, and in real episodes, ηt(a)πt(a)2 b Qt(a) πt, b Qt 2 = ηt(a)πt(a|s)2 It(s, a)Lt,h µtπt(a) It(s)Lt,h = ηt(a) It(s, a)Lt,h µt πt(a|s)It(s)Lt,h µ2 t (It(s, a) πt(a|s)It(s)) L2 t,h = ηt(a)ζt(a) Best of Both Worlds Policy Optimization 1 ηt+1(a) 1 ηt(a) (by Eq. (23)) By the definition of bt in (22), we have E[term1 + term2 + term3] E h PT t=1 bt i , which finishes the proof. Lemma E.3 (log barrier). ηt(s, a)πt(a|s)Bt(s, a) 1 8H and Bt(s, a) 15ST. Proof. If t is a real episode, bt(s) = 8 X 1 ηt+1(s, a) 1 ηt(s, a) ηt(s, a)It(s, a)Lt(s, a)2 µt(s)2 32H2 max a ηt(s, a) µt(s) 1 µt(s) 1 µt(s) 32H2 max s ,a Bt(s, a) bt(s) + 3 X s :h(s )>h(s) µ e Pt,πt(s |s, a)bt(s ) 1 µt(s) + 3 X s :h(s )>h(s) µ e Pt,πt(s |s, a) 1 µt(s ) 32H2 max s ,a 1 µt(s) + 3 X s :h(s )>h(s) µ e Pt,πt(s |s, a) 1 µπt t (s)πt(a|s)µ e Pt,πt(s |s, a) + γt 32H2 max s ,a 1 µπt(s)πt(a|s) + γt 32H2 max s ,a S µt(s)πt(a|s) 96H2 max s ,a ηt(s, a)πt(a|s)Bt(s, a) 96H2S max s ,a 2 1 8H (46) where the last inequality is because ηt(s ,a ) µt(s ) 1 60 H3S in real episodes. From the second-to-last step in (45), we also have Bt(s, a) 3S γt 32H2 max s ,a In virtual episodes, 1 ηt+1(s, a) 1 ηt(s, a) I{(s t, a t) = (s, a)} 24ηt(s, a)H log T log T I{(s t, a t) = (s, a)} 24µt(s)H 1 maxs ,a ηt(s ,a ) µt(s ) (by the definition of (s t, a t)) Best of Both Worlds Policy Optimization I{s t = s} 24µt(s)H 1 maxs ,a ηt(s ,a ) I{s t = s} µt(s) 1 24HMt where we define Mt = maxs ,a ηt(s ,a ) µt(s ) . Similar to (45): Bt(s, a) bt(s) + 3 X s :h(s )>h(s) µ e Pt,πt(s |s, a)bt(s ) I{s t = s} µt(s) + 3 X s µ e Pt,πt(s |s, a)I{s t = s } µt(s ) I{s t = s} µt(s) + 3 X s :h(s )>h(s) µ e Pt,πt(s |s, a) I{s t = s } µπt t (s)πt(a|s)µ e Pt,πt(s |s, a) + γt I{s t = s } µπt(s)πt(a|s) + γt 1 24HMt 1 µt(s)πt(a|s) 1 8HMt (47) ηt(s, a)πt(a|s)Bt(s, a) ηt(s, a) µt(s) 1 8HMt 1 8H where the last step uses the definition of Mt. From the second-to-last step in (47) , we also have Bt(s, a) 1 8γt HMt 15 where we use that Mt 1 60 H3S in vitrual episodes. F. Analysis for the Bias (Lemma 6.2) Proof of Lemma 6.2. s µπ(s)biasπ(s) s,a µπ(s) (πt(a|s) π(a|s)) Qπt(s, a; ℓt) b Qt(s, a) + Ct(s, a) # s,a µπ(s) (πt(a|s) π(a|s)) Qπt(s, a; ℓt) µπt(s) µt(s) Qπt(s, a; ℓt) + Ct(s, a) # s,a µπ(s) (πt(a|s) π(a|s)) µt(s) µπt(s) µt(s) Qπt(s, a; ℓt) + Ct(s, a) # s,a (µπt(s, a) µπ(s, a))zt(s, a) Best of Both Worlds Policy Optimization with zt(s, a) defined as the following based on Lemma C.2: zt(s, a) µt(s) µπt(s) µt(s) Qπt(s, a; ℓt) + Ct(s, a) Es P ( |s,a),a πt( |s ) µt(s ) µπt(s ) µt(s ) Qπt(s , a ; ℓt) + Ct(s , a ) Recall the high probability event E defined in Definition B.3. Notice that s,a (µπt(s, a) µπ(s, a))zt(s, a) s,a (µπt(s, a) µπ(s, a))zt(s, a) E s,a (µπt(s, a) µπ(s, a))zt(s, a) E s,a (µπt(s, a) µπ(s, a))zt(s, a) E + O(Hδ) O(TH TH2) (because |zt(s, a)| O(TH2) almost surely) s,a (µπt(s, a) µπ(s, a))zt(s, a) E . (δ = 1 T 3 ) From now on, it suffices to bound PT t=1 P s,a(µπt(s, a) µπ(s, a))zt(s, a) assuming E holds (i.e., P Pt for all t). By the definition of Ct(s, a), we have zt(s, a) = µt(s) µπt(s) µt(s) Qπt(s, a; ℓt) + max e P Pt Es e P ( |s,a),a πt( |s ) " µt(s ) µπt t (s ) µt(s ) H + Ct(s , a ) Es P ( |s,a),a πt( |s ) µt(s ) µπt(s ) µt(s ) Qπt(s , a ; ℓt) + Ct(s , a ) µt(s) µπt(s) µt(s) Qπt(s, a; ℓt) 0. (51) On the other hand, zt(s, a) µt(s) µπt(s) µt(s) H + Ct(s, a) Es P ( |s,a),a πt( |s ) [Ct(s , a )] ct(s) + Es P t( |s,a),a πt( |s ) [ct(s ) + Ct(s , a )] Es P ( |s,a),a πt( |s ) [Ct(s , a )] (let P t be the transition that attains the maximum in (13)) ct(s) + Es P ( |s,a)[ct(s )] + X P t(s |s, a) P(s |s, a) πt(a |s ) (ct(s ) + Ct(s , a )) ct(s) + Es P ( |s,a)[ct(s )] + X s ,a et(s |s, a)πt(a |s )(ct(s ) + Ct(s , a )) (52) where we define et(s |s, a) = P t(s |s, a) P(s |s, a) . Observe that by the definition of Ct(s, a), it holds that Ct(s, a) = X s :h(s )>h(s) µP t,πt(s |s, a)ct(s ), and therefore, ct(s) + Ct(s, a) = X s µP t,πt(s |s, a)ct(s ) Best of Both Worlds Policy Optimization a πt(a|s) (ct(s) + Ct(s, a)) = X s µP t,πt(s |s)ct(s ). Thus we can thus rewrite (52) as zt(s, a) ct(s) + Es P ( |s,a)[ct(s )] + X s et(s |s, a) X s µP t,πt(s |s )ct(s ). (53) Continue from the previous calculation in (50): s,a (µπt(s, a) µπ(s, a)) zt(s, a) s,a [µπt(s, a) µπ(s, a)]+ zt(s, a) (by (51)) s,a [µπt(s, a) µπ(s, a)]+ ct(s) | {z } term1 s,a [µπt(s, a) µπ(s, a)]+ Es P ( |s,a)[ct(s )] | {z } term2 s,a [µπt(s, a) µπ(s, a)]+ X s et(s |s, a) X s µP t,πt(s |s )ct(s ) | {z } term3 . (by (53)) Known transition case For the known transition case, we have ct(s) µπt(s) + γt µπt(s) µt(s) H = γt µt(s)H and et(s |s, a) = 0. Thus, s µπ(s)biasπ(s) s µπt(s) γt µt(s)H HS t=1 γt = O H5SA2 ln(T) . Unknown transition case Upper bounding term1. By the definition of ct(s), s,a [µπt(s, a) µπ(s, a)]+ µπt t (s) µπt t (s) + γt µt(s) s,a [µπt(s, a) µπ(s, a)]+ µπt t (s) µπt(s) | {z } term1a s,a [µπt(s, a) µπ(s, a)]+ µπt(s) µπt t (s) | {z } term1b s,a µπt(s, a) Hγt | {z } term1c To bound term1a, we apply Lemma C.4 with [µπt(s, a) µπ(s, a)]+ Best of Both Worlds Policy Optimization which gives v u u t H3S2A s,a µπt(s)[µπt(s, a) µπ(s, a)]+ µt(s) ln(T)ι + H2S4A ln(T)ι v u u t H3S2A s,a [µπt(s, a) µπ(s, a)]+ ln(T)ι + H2S4A ln(T)ι. term1b can be bound in the same way and admits the same upper bound. term1c HS PT t=1 γt = O H5SA2 ln(T) . Combining term1, term2, term3, we get v u u t H3S2A s,a [µπt(s, a) µπ(s, a)]+ ln(T)ι + H2S4A2 ln(T)ι. Upper bounding term2. This is very similar to the procedure of bounding term1. We perform a similar decomposition: s,a [µπt(s, a) µπ(s, a)]+ Es P ( |s,a) µπt t (s ) µπt(s ) | {z } term2a s,a [µπt(s, a) µπ(s, a)]+ Es P ( |s,a) " µπt(s ) µπt t (s ) | {z } term2b s,a µπt(s, a)Es P ( |s,a) | {z } term2c To bound term2a, we apply Lemma C.4 with [µπt(s, a) µπ(s, a)]+ µt(s ) P(s |s, a) s,a µπt(s, a)P(s |s, a) µt(s ) µπt(s ) which gives v u u t H3S2A s µπt(s ) X [µπt(s, a) µπ(s, a)]+ µt(s ) P(s |s, a) ln(T)ι + H2S4A ln(T)ι v u u t H3S2A s,a [µπt(s, a) µπ(s, a)]+ P(s |s, a) ln(T)ι + H2S4A ln(T)ι v u u t H3S2A s,a [µπt(s, a) µπ(s, a)]+ ln(T)ι + H2S4A ln(T)ι. which is same as the bound for term1a. Also, term2b can be handled in the same way as term2a, and term2c PT t=1 P s µπt(s ) Hγt µt(s ) HS PT t=1 γt. Overall, term2 can be bounded by the same order as term1. Upper bounding term3. s,a,s [µπt(s, a) µπ(s, a)]+ P t(s |s, a)ι nt(s, a) + ι nt(s, a) s µP t,πt(s |s )ct(s ) s,a,s [µπt(s, a) µπ(s, a)]+ P t(s |s, a)α + ι nt(s, a)α s µP t,πt(s |s )ct(s ) (for any α (0, 1]) Best of Both Worlds Policy Optimization s,a,s [µπt(s, a) µπ(s, a)]+ P t(s |s, a) X s µP t,πt(s |s )ct(s ) s,a,s [µπt(s, a) µπ(s, a)]+ ι nt(s, a) s µP t,πt(s |s )ct(s ) s,a [µπt(s, a) µπ(s, a)]+ X s µP t,πt(s |s, a)ct(s ) + H2 s,a,s [µπt(s, a) µπ(s, a))]+ ι nt(s, a) s,a,s [µπt(s, a) µπ(s, a)]+ µP t,πt(s |s, a) µt(s ) µt(s ) µt(s ) + H2S s,a,s [µπt(s, a) µπ(s, a)]+ µP t,πt(s |s, a)µπt t (s ) µπt(s ) µt(s ) | {z } term3a s,a,s [µπt(s, a) µπ(s, a)]+ µP t,πt(s |s, a) µπt(s ) µπt t (s ) µt(s ) | {z } term3b s,a,s [µπt(s, a) µπ(s, a)]+ µP t,πt(s |s, a) γt µt(s ) | {z } term3c +H2S2A ln(T)ι (by Lemma B.2 and the assumption that E holds.) For term3a we apply Lemma C.4 with s,a[µπt(s, a) µπ(s, a)]+µP t,πt(s |s, a) s,a µπt(s, a)µP t,πt(s |s, a) v u u t H2S2A ln(T)ι s,a[µπt(s, a) µπ(s, a)]+µP t,πt(s |s, a) µt(s ) + αH H2S4A ln(T)ι v u u t H3S2A ln(T)ι s,a [µπt(s, a) µπ(s, a)]+ + αH3S4A ln(T)ι The same bound applies to term3b, too. s,a,s µπt(s, a)µP t,πt(s |s, a) γt µt(s ) αH2 X s γt αH6SA2. Picking α = 1 H , combining term3a and term3b, and using H S, we get v u u t H3S2A ln(T)ι s,a [µπt(s, a) µπ(s, a)]+ + H2S4A2 ln(T)ι which is also of the same order as term1. Combining term1, term2, term3, we get that if E holds, then s,a (µπt(s, a) µπ(s, a)) zt(s, a) v u u t H3S2A s,a [µπt(s, a) µπ(s, a))]+ ln(T)ι + H2S4A2 ln(T)ι Best of Both Worlds Policy Optimization Using this in (50) finishes the proof. G. Bounding P s V πt(s0; bt) (Lemma 6.3, Lemma 6.4, Lemma 6.5) We first show Lemma G.1 and Lemma G.2 which are common among different regularizers. Lemma G.1. t=1 µt(s)πt(a|s)(1 πt(a|s)) t=1 µπt(s)πt(s)(1 πt(a|s)) H4S2A3 ln(T) + I{unknown transition} p HS5A3 ln(T)ι. Proof. Define φ(s, a) = πt(a|s)(1 πt(a|s)). t=1 µt(s)φ(s, a) X t=1 µπt(s)φ(s, a) + X t=1 γtφt(s, a) | {z } term1 t=1 |µπt t (s) µπt(s)| φt(s, a) | {z } term2 t=1 γtφt(s, a) s,a φt(s, a) S H4S2A3 ln(T). term2 is zero in the known transition case, and in the unknown transition case, if E defined in Definition B.3 holds, then t=1 (µπt t (s) µπt(s))φt(s, a) + 1 (for any α > 0) v u u t HS2A ln(T)ι + HS4A ln(T)ι (by Lemma C.4 with gt(s) = P a φt(s, a)) v u u t HS2A ln(T)ι s,a µπt(s)φt(s, a) + HS4A ln(T)ι s,a µπt(s)φt(s, a) + p HS5A3 ln(T)ι (choosing α = 1 HS3A ln(T )ι) t=1 µπt(s)φt(s, a) + p HS5A3 ln(T)ι. If E does not hold (which happens with probability O(H/T 3)), then term2 O(SA T). Overall, t=1 µπt(s)φt(s, a) HS5A3 ln(T)ι + HSA Collecting terms and using H S finishes the proof. Best of Both Worlds Policy Optimization Lemma G.2. With known transition, s µπt(s)νt(s) H4SA2 ln(T). For Tsallis entropy or Shannon entropy with unknown transition, s µ e Pt,πt(s)νt(s) HS4A2 ln(T)ι. Proof. With known transition, we have s µπt(s)νt(s) s,a µπt(s)πt(a|s)Ct(s, a)2 (ηt(s, a) 1 H4 or ηt(s) 1 H4 ) s,a µπt(s)πt(a|s)Ct(s, a) (Ct(s, a) H2) s,a µπt(s)πt(a|s) X s µπt(s |s, a)µt(s ) µπt(s ) µt(s ) (by the definition of Ct(s, a)) s µπt(s ) µt(s ) µπt(s ) H4SA2 ln(T). With unknown transitions, notice that for Tsallis entropy we have ηt(s) min n 1 H4 , 1 H o and for Shannon entropy we have ηt(s, a) min n 1 H4 , 1 H o . Therefore, in both cases, suppose that E holds, s µ e Pt,πt(s)νt(s) s,a µ e Pt,πt(s)πt(a|s)Ct(s, a)2 s,a µ e Pt,πt(s)πt(a|s)Ct(s, a) s,a µ e Pt,πt(s)πt(a|s) X s µP t,πt(s |s, a) µt(s ) µπt t (s ) (let P t be the e P attaining maximum in (13)) t=1 min 1, H3 µπt t (s ) µt(s ) µπt t (s ) t=1 min 1, H3 µπt t (s ) µπt t (s ) + Best of Both Worlds Policy Optimization t=1 min 1, H3 µπt t (s ) µπt t (s ) + H4SA2 ln(T) By Lemma C.4, the first part above can be upper bounded by v u u t HS2A ln(T)ι s µπt(s) min 1, H6 + HS4A ln(T)ι H8S2Aι ln(T) + HS4A ln(T)ι HS4A ln(T)ι where we use H S. Suppose that E does not hold (happens with probability O(H/T 3)), we still have s µ e Pt,πt(s)νt(s) s,a µ e Pt,πt(s)πt(a|s)Ct(s, a)2 H4 H H2 2 O(HT) because |Ct(s, a)| H2 with probability 1. Combining all terms and taking expectation, we conclude that s µ e Pt,πt(s)νt(s) HS4A2 ln(T)ι. G.1. Tsallis entropy Proof of Lemma 6.3. t=1 V πt, e Pt(s0; bt) s µ e Pt,πt(s)bt(s) s µ e Pt,πt(s) νt(s) + 1 ηt(s) 1 ηt 1(s) µt(s) > 1 8H s µ e Pt,πt(s)νt(s) + H s µ e Pt,πt(s) 1 µt(s) q Pt τ=1 1 µτ (s) s µ e Pt,πt(s)νt(s) | {z } term1 ξt(s) q Pt τ=1 1 µτ (s) | {z } term2 1 µt(s) ηt(s) q Pt τ=1 1 µτ (s) | {z } term3 Bounding term1. term1 can be bounded using Lemma G.2, which gives E[term1] H4SA2 ln(T) + I{unknown transition}HS4A2 ln(T)ι. Best of Both Worlds Policy Optimization Bounding term2. 1 µt(s) Pt τ=1 1 µτ (s) µt(s)πt(a|s)(1 πt(a|s)) 1 µt(s) Pt τ=1 1 µτ (s) t=1 µt(s)πt(s, a)(1 πt(a|s)) 1 µt(s) Pt τ=1 1 µτ (s) t=1 µt(s)πt(a|s)(1 πt(a|s)). By Lemma G.1, we can bound the last expression by v u u tln(T) t=1 µπt(s)πt(s, a)(1 πt(s, a)) H6S2A3 ln(T) + I{unknown transition} H3S5A3 ln(T)ι. Bounding term3. By (16), 1 µt(s) Pt τ=1 1 µτ (s) HS Combining term1, term2, term3 finishes the proof. G.2. Shannon entropy Proof of Lemma 6.4. t=1 V e Pt,πt(s0; bt) t,s,a µ e Pt,πt(s) µt(s) q Pt 1 τ=1 ξτ (s,a) µτ (s) + 1 µt(s) + 1 ξt(s, a) + 1 minτ [t] µτ(s) minτ [t 1] µτ(s) t,s µ e Pt,πt(s)νt(s) ln T q Pt 1 τ=1 ξτ (s,a) µτ (s) + 1 µt(s) ξt(s, a) + X ln T q Pt 1 τ=1 ξτ (s,a) µτ (s) + 1 µt(s) 1 minτ [t] µτ(s) minτ [t 1] µτ(s) t ξt(s, a) + H t,s,a µ e Pt,πt(s)1 t,s,a µ e Pt,πt(s) 1 minτ [t] µτ(s) minτ [t 1] µτ(s) t,s µ e Pt,πt(s)νt(s) µt(s) Pt 1 τ=1 ξτ (s,a) µτ (s) + 1 µt(s) t µt(s)ξt(s, a) + H t,s,a ln minτ [t 1] µτ(s) minτ [t] µτ(s) Best of Both Worlds Policy Optimization µt(s)ξt(s, a) t µt(s)ξt(s, a) v u u t A X t,s ln minτ [t 1] µτ(s) minτ [t] µτ(s) t,s µ e Pt,πt(s)νt(s) t µt(s)ξt(s, a) + X t,s µ e Pt,πt(s)νt(s) + H2SA ln 3 2 (T) v u u tln3(T) t=1 µt(s)πt(a|s)(1 πt(a|s)) + X s,t µ e Pt,πt(s)νt(s) + H2SA ln 3 2 (T) By Lemma G.1 and Lemma G.2, the expectation of this can be upper bounded by v u u tln3(T) t=1 µπt(s)πt(a|s)(1 πt(a|s)) H6S2A3 ln3(T) + I{unknown transition} q H3S5A3 ln3(T)ι + H4SA2 ln 3 2 (T) + I{unknown transition}HS4A2 ln(T)ι v u u tln3(T) t=1 µπt(s)πt(a|s)(1 πt(a|s)) ln3(T) + I{unknown transition}HS4A2 ln(T)ι. (using H S and log(T) ι) G.3. Log barrier Lemma G.3. Let η1 > 0, η2, η3, . . . be updated by ηt + ηtφt t 1 with 0 φt η 2 t . Then Proof. By the update rule, η2 t = 1 ηt+1 + 1 = 1 ηt+1 + 1 which implies By the condition on φt, we also have Combining the two inequalities finishes the proof. Best of Both Worlds Policy Optimization Proof of Lemma 6.5. In this proof we only focus on the know transition case. We use Tr and Tv to denote the set of real and virtual episodes, respectively. Let φt(s, a) = 4ζt(s,a) µt(s)2 log(T ) in real episodes and φt(s, a) = I{(s t,a t)=(s,a)} 24ηt(s,a)2H log T in virtual episodes. We first show that φt(s, a) 1 ηt(s,a)2 , which allows us to apply Lemma G.3 because 1 ηt+1(s,a) = 1 ηt(s,a) + ηt(s, a)φt(s, a) by our update rule. This is clear for virtual episodes. For real episodes, φt(s, a)ηt(s, a)2 = 4ηt(s, a)2ζt(s, a) µt(s)2 log T H2 log T 1 H3S 1 because ηt(s,a) H3S in real episodes. t=1 V πt(s0; bt) ηt(s, a)ζt(s, a) µt(s)2 log(T) log(T) + X t Tv µπt(s t) 1 ηt(s t, a t)H log T log T + s µπt(s)νt(s) s,a ηt(s, a)ζt(s, a) H |Tv| + H4SA ln(T) (in virtual episodes, ηt(s t,a t) µt(s t) 1 H3S , and we use Lemma G.2 to bound the last term) µt(s) q P τ t:τ Tr ζτ (s,a) HS|Tv| + +H4SA2 ln(T) (by Lemma G.3 and the condition verified at the beginning of the proof) µt(s)2 P τ t:τ Tr ζτ (s,a) t Tr ζt(s, a) + HS|Tv| + +H4SA2 ln(T) t Tr ζt(s, a) + HS|Tv| + H4SA2 ln(T). Now we bound the number of virtual episodes. Notice that each time a virtual episode happens, there exist s, a such that ηt(s,a) H3S , and ηt(s, a) will shrink by a factor of (1 + 1 24H log T ) after the virtual episode. Since µt(s) γt, this event cannot happen if ηt(s, a) γt 60 H3S . Thus, the number of virtual episodes is upper bounded by |Tv| SA log 60 H3S γt log 1 + 1 24H log T HSA ln(T) ln(SAT). Applying this bound in the last expression and using H S finishes the proof. H. Final Regret Bounds through Self-Bounding (Theorem 4.1, Theorem 4.2, Theorem 4.3) Proof of Theorem 4.1. Let π = argmaxπ Reg(π). By (31), Lemma 6.2, and Lemma 6.3, under known transition and Tsallis entropy, we have Reg( π) H X t=1 µπt(s)πt(a|s)(1 πt(a|s)) ln(T) + H5SA2 ln(T) Best of Both Worlds Policy Optimization For the adversarial regime, we bound the above by v u u t SAE s,a µπt(s)πt(a|s)) ln(T) + H5SA ln(T) = H3SAT + H5SA2 ln(T). For the stochastic regime, notice that Reg( π) Reg(π ) E h PT t=1 µπt(s)πt(a|s) (s, a) i C, and we have Reg( π) c1H X t=1 µπt(s)πt(a|s) ln(T) + c2H5SA2 ln(T) (for some universal constants c1, c2) t=1 µπt(s)πt(a|s) (s, a) + c2 1H ln(T) α (s, a) + c2H5SA2 ln(T) (for arbitrary α > 0) t=1 µπt(s)πt(a|s) (s, a) α (s, a) + H5SA2 ln(T) α(Reg( π) + C) + O α (s, a) + H5SA2 ln(T) Picking α = min 1 2, C 1 2 H2 ln(T ) 2 leads to the bound Reg( π) U + UC + H5SA2 ln(T) where U = P a =π (s) H2 ln(T ) (s,a) . Finally, using that Reg(π) Reg( π) for all π finishes the proof. Proof of Theorem 4.2. By (31), Lemma 6.2, and Lemma 6.3, under unknown transition and Tsallis entropy, we have v u u t H3S2AE s,a [µπt(s, a) µπ(s, a)]+ | {z } term1 t=1 µπt(s)πt(a|s)(1 πt(a|s)) | {z } term2 +c3H2S4A2 ln(T)ι (for universal constants c1, c2, c3) In the adversarial regime, we can bound it by the order of p H4S2AT ln(T)ι + H2S4A2 ln(T)ι To get a bound in the stochastic regime, we first argue that it suffices to show the desired bound for all π that satisfies Reg(π) Reg(π ). This is because we can then bound Reg(π) for π such that Reg(π) < Reg(π ) by Reg(π) < Reg(π ) U + p U(C + C(π )) + poly(H, S, A) ln(T)ι = U + UC + poly(H, S, A) ln(T)ι because C(π ) = 0 by definition. Below we assume that Reg(π) Reg(π ). Note that by Lemma C.5, for any π, X µπ(s, a) µπ (s, a) H X s,a µπ(s) |π(a|s) π (a|s)| Best of Both Worlds Policy Optimization a =π (s) µπ(s)π(a|s) + H X s µπ(s)(1 π( π (s) |s)) a =π (s) µπ(s)π(a|s). v u u t H3S2AE s,a |µπt(s, a) µπ(s, a)| v u u t H3S2AE s,a |µπt(s, a) µπ (s, a)| ln(T)ι + c1 v u u t H3S2A s,a |µπ(s, a) µπ (s, a)| ln(T)ι v u u u t2H4S2AE a =π (s) µπt(s, a) ln(T)ι + c1 v u u t2H4S2A a =π (s) µπ(s, a) ln(T)ιι a =π (s) µπt(s, a) min a =π (s) µπ(s, a) min + O H4S2A ln(T)ι α(Reg(π ) + C) + α(Reg(π ) Reg(π) + C(π)) + O H4S2A ln(T)ι (see explanation below) αReg(π) + α(C + C(π)) + O H4S2A ln(T)ι (by the assumption Reg(π ) Reg(π)) where in the second-to-last inequality we use the property: Reg(π ) Reg(π) = E t=1 V π(s0; ℓt) V π (s0; ℓt) a =π (s) µπ(s, a) (s, a) a =π (s) µπ(s, a) min C(π) For term2, similar to before, term2 c2H X t=1 µπt(s)πt(a|s)(1 πt(a|s)) α(Reg(π) + C) + O α (s, a) + H5SA2 ln(T) α(Reg(π) + C) + O α (s, a) + H5SA2 ln(T) α(Reg(π) + C) + O H4S2A ln(T) α min + H2S4A2 ln(T) Combining term1 and term2, we get Reg(π) 2αReg(π) + 2α(C + C(π)) + O H4S2A ln(T)ι α min + H2S4A2 ln(T)ι Best of Both Worlds Policy Optimization Picking α = min 1 4, (C + C(π)) 1 2 H4S2A ln(T )ι 2 leads to the desired bound. Proof of Theorem 4.3. v u u tln2(T)E t=1 (It(s, a) πt(a|s)It(s))2L2 t,h(s) + H3S2A2 ln(T) ln(SAT) In the adversarial regime, v u u t HSA ln2(T)E s,a (It(s, a) πt(a|s)It(s))2Lt,h(s) + H3S2A2 ln(T) ln(SAT) v u u t HSA ln2(T)E s,a It(s, a)Lt,h(s) + H3S2A2 ln(T) ln(SAT) v u u t H2SA ln2(T)E t=1 V πt(s0; ℓt) + H3S2A2 ln(T) ln(SAT) On the other hand, Reg(π) = E h PT t=1 V πt(s0; ℓt) PT t=1 V π(s0; ℓt) i . Solving the inequality, we get v u u t H2SA ln2(T) t=1 V π(s0; ℓt) + H3S2A2 ln(T) ln(SAT). In the stochastic regime, v u u tln2(T)E t=1 (It(s, a) πt(a|s)It(s))2L2 t,h(s) + H3S2A2 ln(T) ln(SAT) v u u t H2 ln2(T)E t=1 µπt(s)πt(a|s)(1 πt(a|s)) + H3S2A2 ln(T) ln(SAT), which is similar to the stochastic bound in Theorem 4.1. Following the same self-bounding analysis in the proof of Theorem 4.1 we can get the desired bound. To get regret bounds for the Shannon entropy version under known and unknown transitions, we use Lemma 6.2 and Lemma 6.4 and follow exactly the same procedure as in the proofs of Theorem 4.1 and Theorem 4.2. This leads to the following guarantees: Theorem H.1. Under known transitions, Algorithm 1 with Shannon entropy regularizer ensures for any π H3SAT ln3(T) + poly(H, S, A) ln2(T) in the adversarial case, and UC + poly(H, S, A) ln2(T) in the stochastic case, where U = P s P a =π (s) H2 ln3(T ) Best of Both Worlds Policy Optimization Theorem H.2. Under unknown transitions, Algorithm 1 with Shannon entropy regularizer ensures for any π H4S2AT ln2(T)ι + poly(H, S, A) ln(T)ι in the adversarial case, and Reg(π) U + p U(C + C(π)) + poly(H, S, A) ln(T)ι in the stochastic case, where U = H4S2A ln2(T )ι