# branching_reinforcement_learning__244394a6.pdf

Branching Reinforcement Learning

Yihan Du 1 Wei Chen 2

In this paper, we propose a novel Branching Reinforcement Learning (Branching RL) model, and investigate both Regret Minimization (RM) and Reward-Free Exploration (RFE) metrics for this model. Unlike standard RL where the trajectory of each episode is a single H-step path, branching RL allows an agent to take multiple base actions in a state such that transitions branch out to multiple successor states correspondingly, and thus it generates a tree-structured trajectory. This model ﬁnds important applications in hierarchical recommendation systems and online advertising. For branching RL, we establish new Bellman equations and key lemmas, i.e., branching value difference lemma and branching law of total variance, and also bound the total variance by only O(H2) under an exponentially-large trajectory. For RM and RFE metrics, we propose computationally efﬁcient algorithms Branch VI and Branch RFE, respectively, and derive nearly matching upper and lower bounds. Our regret and sample complexity results are polynomial in all problem parameters despite exponentially-large trajectories.

1. Introduction

Reinforcement Learning (RL) (Burnetas & Katehakis, 1997; Sutton & Barto, 2018) models a fundamental sequential decision making problem, where an agent interacts with the environment over time in order to maximize the obtained rewards. Standard RL (Jaksch et al., 2010; Agrawal & Jia, 2017; Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019) considers taking only a single action in a state and formulates a single H-step path model. However, in many real-world applications such as recommendation systems (Fu et al., 2021) and online advertising (Kang et al., 2020), we often need to select multiple options at a time, and

1IIIS, Tsinghua University, Beijing, China 2Microsoft Research. Correspondence to: Yihan Du <duyh18@mails.tsinghua.edu.cn>, Wei Chen <weic@microsoft.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

each option can trigger a corresponding successor state. For example, in category-based shopping recommendation (Fu et al., 2021), the recommendation system often displays a list of main categories at the ﬁrst step, where each one has a probability to be clicked. If a main category is clicked, at the second step, the system further provides a list of sub-categories according to the clicked main category. By analogy, at the last step, the system provides a list of items according to the chosen category path. In this process, users can select (trigger) more than one category-item paths, e.g., one may buy IT accessories-printers-laser printers and IT accessories-scanners-document scanners at once.

To handle such scenarios involving multiple actions and successor states, we propose a novel Branching Reinforcement Learning (Branching RL) framework, which is an episodic tree-structured forward model. In each episode, an agent starts from an initial state and takes a super action that contains multiple base actions, where each base action in this state has a probability to be triggered. For each state-base action pair, if triggered successfully, the agent receives a reward and transitions to a next state; Otherwise, if it is not triggered, the agent receives zero reward and transitions to an absorbing state associated with zero reward. Thus, the transitions branch out to multiple successor states. At the second step, for each branched-out state, the agent also selects a super action that contains multiple base actions with trigger probabilities. She only obtains rewards from the triggered state-base action pairs, and each state-base action pair transitions to a corresponding next state. Then, the transitions at the second step branch out to more successor states. By analogy, till the last step, she traverses an H-layer tree-structured trajectory, and only collects rewards at the triggered state-base action pairs.

Different from standard episodic RL (Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019) where each trajectory is a single H-step path, the trajectory of branching RL is an H-layer triggered tree with exponentially increasing states and actions in each layer. This model allows an agent to take multiple base actions at once and handle multiple successor states. It can be applied to many hierarchical decision making scenarios, such as category-based recommendation systems (Fu et al., 2021) and online advertising (Kang et al., 2020).

Branching Reinforcement Learning

Under the branching RL model, we investigate two popular metrics in the RL literature, i.e., Regret Minimization (RM) and Reward-Free Exploration (RFE). In regret minimization (Jaksch et al., 2010; Azar et al., 2017; Zanette & Brunskill, 2019), the agent aims to minimize the gap between the obtained reward and the reward that can obtained by always taking the optimal policy. In reward-free exploration (Jin et al., 2020a; Kaufmann et al., 2021; M enard et al., 2021), the agent explores the unknown environment (model) without observation of rewards, in order to estimate the model accurately such that for any given reward function, she can plan a near-optimal policy using the estimated model. The performance in RFE is measured by the number of episodes used during exploration (i.e., sample complexity).

Our work faces several unique challenges: (i) Since branching RL is a tree-structured forward model which greatly differs from standard RL, existing analytical tools for standard RL, e.g., Bellman equations, value difference lemma and law of total variance, cannot be directly applied to our problem. (ii) With exponentially-large trajectories, it is challenging to analyze the total variance and derive tight (polynomial) regret and sample complexity guarantees. (iii) Since the number of possible super actions can be combinatorially large, how to design a computationally efﬁcient algorithm that avoids naive enumeration over all super actions is another challenge.

To tackle the above challenges, we establish novel analytical tools, including branching Bellman equations, branching value difference lemma and branching law of total variance, and bound the total variance by only O(H2) under exponentially-large trajectories. We also propose computationally efﬁcient algorithms for both RM and RFE metrics, and provide nearly matching upper and lower bounds, which are polynomial in all problem parameters despite exponentially-large trajectories.

To sum up, our contributions in this paper are as follows:

We propose a novel Branching Reinforcement Learning (Branching RL) framework, which is an episodic H-layer tree-structured forward model and ﬁnds important applications in hierarchical recommendation systems and online advertising. Under branching RL, we investigate two popular metrics, i.e., Regret Minimization (RM) and Reward-Free Exploration (RFE).

We establish new techniques for branching RL, including branching Bellman equations, branching value difference lemma and branching law of total variance, and bound the total variance by only O(H2) despite exponentially-large trajectories.

For both RM and RFE metrics, we design computation-

ally efﬁcient algorithms Branch VI and Branch RFE, respectively, and build near-optimal upper and lower bounds, which are polynomial in all problem parameters even with exponentially-large trajectories. When our problem reduces to standard RL, our results match the state-of-the-arts.

Due to space limit, we defer all proofs to Appendix.

2. Related Work

Below we review the literature of standard (episodic and tabular) RL with regret minimization (RM) and reward-free exploration (RFE) metrics.

Standard RL-RM. For the regret minimization (RM) metric, Jaksch et al. (2010) propose an algorithm that adds optimistic bonuses on transition probabilities, and achieves a regret bound with a gap in factors H, S compared to the lower bound (Jaksch et al., 2010; Osband & Van Roy, 2016). Here H is the length of an episode, and S is the number of states. Agrawal & Jia (2017) use posterior sampling and obtain an improved regret bound. Azar et al. (2017) build conﬁdence intervals directly for value functions rather than transition probabilities, and provide the ﬁrst optimal regret. Zanette & Brunskill (2019) design an algorithm based on both optimistic and pessimistic value functions, and achieve a tighter problem-dependent regret bounds without requiring domain knowledge. The above works focus on model-based RL algorithms. There are also other works (Jin et al., 2018; Zhang et al., 2020) studying model-free algorithms based on Q-learning with exploration bonus or advantage functions.

Standard RL-RFE. Jin et al. (2020a) introduce the rewardfree-exploration (RFE) metric and design an algorithm that runs multiple instances of existing RM algorithm (Zanette & Brunskill, 2019), and their sample complexity has a gap to the lower bound (Jin et al., 2020a; Domingues et al., 2021) in factors H, S. Kaufmann et al. (2021) propose an algorithm which builds upper conﬁdence bounds for the estimation error of value functions, and improve the sample complexity of (Jin et al., 2020a). M enard et al. (2021) achieve a near-optimal sample complexity by applying an empirical Bernstein inequality and upper bounding the overall estimation error.

There are huge differences between standard RL and our branching RL. The exponentially-large trajectory of branching RL brings unique challenges in developing Bellman equations and key lemmas, designing computationally efﬁcient algorithms and deriving optimal (polynomial) bounds. Existing RL algorithms and analysis cannot be applied to solve our challenges.

Branching Reinforcement Learning

3. Problem Formulation

In this section, we present the formal formulation of Branching Reinforcement Learning (Branching RL).

Branching Markov Decision Process (Branching MDP). We consider an episodic branching MDP deﬁned by a tuple M = (S, Auniv, A, m, H, q, p, r). Here S = Sreg {s } is the state space with cardinality S. Sreg is the set of regular states, and s is an ending state, which is an absorbing state with zero reward. Auniv is the set of base actions, which represents the set of all feasible items in recommendation. Let N := |Auniv| denote the number of base actions. A super action A Auniv consists of m (m N) base actions, which stands for a recommended list. A is the collection of all feasible super actions and can be combinatorially large. H is the length of an episode. Throughout the paper, we call a super action an action for short, call (s, a) S Auniv a state-base action pair, and call (s, a) S \ {s } Auniv a regular state-base action pair.

q(s, a) is the trigger probability of state-base action pair (s, a) S Auniv. p(s |s, a) is the probability of transitioning to state s on state-base action pair (s, a), for any (s , s, a) S S Auniv. r(s, a) [0, 1] is the reward of pair (s, a) S Auniv. We assume that reward function r is deterministic as many prior RL works (Azar et al., 2017; Jin et al., 2018; Zhang et al., 2020), and our work can generalize to stochastic rewards easily. Parameters q, p, r are time-homogeneous, i.e., have the same definitions for different step h [H]. The ending state s has zero reward and always transitions back to itself, i.e., q(s , a) = 0, p(s |s , a) = 1 and r(s , a) = 0 for all a Auniv. We deﬁne a policy π as a collection of H functions {πh :S 7 A}h [H], and K as the number of episodes.

String-based Notations. As shown in Figure 1, in branching RL, the trajectory of each episode is an m-ary tree, where there are H layers (steps), and each layer h [H] has mh 1 states (nodes) and mh state-base action pairs (edges). We use the following string-based notations to denote a trajectory: Each tree node in layer h has a string index i1, . . . , ih 1 , with the root node for layer 1 having the empty string , and i1, . . . , ih 1 {1, 2, . . . , m}. The m children of this node have indices that concatenate ih [m] to the string, making it i1, . . . , ih 1, ih , where ih stands for that this node is the ih-th child of the node i1, . . . , ih 1 . Operator is the concatenation operation for strings, and i h denotes the concatenation of h strings i for any i [m], h [H]. For any string σ, |σ| denotes its length, and thus state sσ is at step |σ| + 1.

Online Game. In each episode k [K], an agent selects a policy πk at the beginning, and starts from an initial state s . At step 1, she chooses an action A = {a 1 , . . . , a m } ac-

, = , , , =

, , = , , = , , = , , =

Figure 1. Illustrating example with m = 2 for branching RL.

cording to πk 1. Each state-base action pair (s , a i ) for i [m] has probability q(s , a i ) to be triggered. If triggered successfully, the agent obtains reward r(s , a i ) and this state-base action pair transitions to a next state s i p( |s , a i ); Otherwise, if not triggered successfully, she obtains zero reward and this state-base action pair transitions to the ending state s . Hence, the transitions at step 1 branch out to m successor states s 1 , . . . , s m . At step 2, for each state s i (i [m]), she chooses an action A i = {a i,1 , . . . , a i,m } according to πk 2. Then, there are m2 state-base action pairs {(s i , a i,j )}i,j [m] at step 2, and each of them is triggered with probability q(s i , a i,j ). If triggered successfully, the agent receives reward r(s i , a i,j ) and this pair transitions to a next state s i,j p( |s i , a i,j ); Otherwise, she receives zero reward and this pair transitions to s . Then, the transitions at step 2 branch out to m2 successor states {s i,j }i,j [m]. The episode proceeds by analogy at the following steps 3, . . . , H. In the trajectory tree, once the agent reaches s at some node, she obtains no reward throughout this branch (This is so-called ending state ).

Branching Value functions and Bellman Equations. For any policy π, we deﬁne value function V π h : S 7 R, so that

V π h (s) =Eq,p,π h m (H h) X

ℓ=1 q(sσ σ , aσ σ ℓ)

r(sσ σ , aσ σ ℓ)|sσ = s i (1)

gives the expected cumulative reward starting from some state s at step h till the end of this branch, under policy π. Here σ is the index string for an arbitrary state at step h, and thus σ {1 (h 1), . . . , m (h 1)}. Pm (H h)

σ = denotes the summation over strings σ = , 1 , . . . , m , 1, 1 , . . . , m, m , . . . , m (H h), which effectively enumerates all tree nodes of H h+1 layers. The expectation is taken with respect to the trajectory, which is dependent on trigger distribution q, transition distribution p and policy π.

Accordingly, we also deﬁne Q-value function Qπ h : S A 7

Branching Reinforcement Learning

Qπ h(s, A) =Eq,p,π h m (H h) X

ℓ=1 q(sσ σ , aσ σ ℓ)

r(sσ σ , aσ σ ℓ)|sσ = s, Aσ = A i

denotes the expected cumulative reward starting from some state-action pair (s, A) at step h till the end of this branch, under policy π. From the deﬁnitions of r, p, q for ending state s , we have V π h (s ) = Qπ h(s , A) = 0 for any A A, h [H], π.

Since S, A and H are all ﬁnite, there exists a deterministic optimal policy π which has the optimal value V h (s) = supπ V π h (s) for any s S and h [H]. Then, we can establish the Bellman (optimality) equations as follows:

Qπ h(s, A) = X

a A q(s, a) r(s, a) + p( |s, a) V π h+1

V π h (s) =Qπ h(s, πh(s))

V π H+1(s) =0, s S,

Q h(s, A) = X

a A q(s, a) r(s, a) + p( |s, a) V h+1

V h (s) = max A A Q h(s, A)

V H+1(s) =0, s S.

Under the framework of branching RL, we consider two important RL settings, i.e., regret minimization (branching RL-RM) and reward-free exploration (branching RL-RFE).

Regret Minimization (RM). In branching RL-RM, the agent plays the branching RL game for K episodes, and the goal is to minimize the following regret

Regret(K) =

V 1 (sk ) V πk 1 (sk ) .

Reward-Free Exploration (RFE). Branching RL-RFE consists of two phases, i.e., exploration and planning. (i) In the exploration phase, given a ﬁxed initial state s , the agent plays the branching RL game without the observation of reward function r, and estimates a trigger and transition model (ˆq, ˆp). (ii) In the planning phase, the agent is given reward function r, and computes the optimal policy ˆπ under her estimated model (ˆq, ˆp) with respect to r. Given an accuracy parameter ε and a conﬁdence parameter δ, she needs to guarantee that for any given reward function r, the policy ˆπ with respect to r is ε-optimal, i.e.,

V ˆπ 1 (s ; r) V π 1 (s ; r) ε,

with probability at least 1 δ. We measure the performance by sample complexity, i.e., the number of episodes used in the exploration phase to guarantee an ε-optimal policy for any given r.

In order to ensure that the number of triggered state-base action pairs will not increase exponentially and designing sample efﬁcient algorithms is possible in branching RL, we introduce the following assumption. Assumption 1 (Bounded Trigger Probability). For any (s, a) S Auniv, we have q(s, a) 1

To justify the necessity of Assumption 1, we provide a rigorous lower bound to show that once relaxing the threshold of trigger probability, any branching RL algorithm must suffer an exponential regret. Theorem 2. Suppose that for any (s, a) S Auniv, q(s, a) q for some threshold parameter q > 1 m. Then, there exists an instance of branching RL with H > 1, where the regret of any algorithm is bounded by Ω( m q((m q)H 1 1)

We describe the intuition behind this lower bound, and defer the full proof to Appendix C.2.2. Consider a branching MDP, where at an early step the agent has to distinguish the optimal action that has trigger probability q, from the sub-optimal actions that have trigger probabilities only q η. Once the agent takes a sub-optimal action at the early step, such trigger sub-optimality will impact exponentially many states in the following steps, and she will suffer a regret of mη m q + m2 q2 + + m H 1 q H 1 in this episode, which is the sum of a geometric progression with common ratio m q. If q > 1 m, summing over all episodes, the total regret is exponentially large with respect to H.

Besides this lower bound, Assumption 1 is also mild in practice, since in real-world applications such as recommendation systems (Fu et al., 2021) and online advertising (Kang et al., 2020), it is often the case that users are only attracted to and click on a few items in a recommended list. In addition, in multi-step (e.g., category-based) recommendation, the interests of users usually converge to a single branch in the end (in expectation).

When m = 1, our branching RL reduces to standard episodic RL (Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019) with transition probability paug, such that paug(s |s, a) = 1 q(s, a) and paug(s |s, a) = q(s, a)p(s |s, a) for any s = s . In this case, our results match the state-of-the-art results for standard RL in both RM (Azar et al., 2017; Zanette & Brunskill, 2019) and RFE (M enard et al., 2021; Zhang et al., 2021) settings.

Branching Reinforcement Learning

4. Properties of the Branching Markov Decision Process

Before introducing our algorithms for branching RL, in this section, we ﬁrst investigate special structural properties of branching MDP, which are critical to deriving tight (polynomial) regret and sample complexity guarantees.

4.1. Branching Value Difference Lemma and Law of Total Variance

Different from standard episodic MDP (Azar et al., 2017; Zanette & Brunskill, 2019) where a trajectory is an H-step path, the trajectory of branching MDP is an m-ary tree with each node a state and each edge a state-base action pair. Thus, many analytical tools in standard MDP, e.g., value difference lemma (Dann et al., 2017) and law of total variance (Jin et al., 2018; Zanette & Brunskill, 2019), cannot be directly applied to branching MDP. To handle this problem, we establish new fundamental techniques for branching MDP, including branching value difference lemma and branching law of total variance.

First, we present a branching value difference lemma.

Lemma 3 (Branching Value Difference Lemma). For any two branching MDP M (S, Auniv, A, m, H, q , p , r) and M (S, Auniv, A, m, H, q , p , r), the difference in values under the same policy π satisﬁes that

V π h (s) V π h (s) =

ℓ=1 Eq ,p ,π h q (sτ, aτ ℓ)

q (sτ, aτ ℓ) r(sτ, aτ ℓ) + q (sτ, aτ ℓ)p (sτ, aτ ℓ)

q (sτ, aτ ℓ)p (sτ, aτ ℓ) V π |τ ℓ|+1 sσ = s i ,

where τ := σ σ .

Using Lemma 3 with M and M being the optimistic and true models, respectively, we can bound the difference between optimistic and true values by the deviations between optimistic and true trigger and transition probabilities, in expectation with respect to the true model.

Next, we provide a branching law of total variance, which is critical to analyzing the estimation error of transition.

Lemma 4 (Branching Law of Total Variance). For any policy π,

Eq,p,π h m (H 1) X

ℓ=1 Varq,p V π |σ ℓ|+1(sσ ℓ)|sσ, aσ ℓ i

=Eq,p,π h m (H 1) X

ℓ=1 q(sσ, aσ ℓ)r(sσ, aσ ℓ) V π 1 (s ) 2i (2)

Eq,p,π h m (H 1) X

σ= 1 {sσ = s } 2i . (3)

Here Varq,p V π |σ ℓ|+1(sσ ℓ)|sσ, aσ ℓ denotes the variance of value V π |σ ℓ|+1(sσ ℓ) with respect to sσ ℓ, which depends on trigger probability q(sσ, aσ ℓ) and transition probability p( |sσ, aσ ℓ), conditioning on (sσ, aσ ℓ).

Remark 1. Lemma 4 exhibits that under branching MDP, the sum of conditional variances over all state-base action pairs is equal to the overall variance considering the whole trajectory, shown by Eq. (2). Furthermore, the overall variance can be bounded by the total number of regular (triggered) states, revealed by Eq. (3).

From Lemma 4, we have that to bound the estimation error of transition, which is related to the sum of conditional variances, it sufﬁces to bound the total number of triggered states in a trajectory tree (discussed in the following).

4.2. The Number of Triggered States

In this subsection, we show that with Assumption 1 that only constrains the ﬁrst moment of trigger distribution, we can bound both the ﬁrst and second moments of the number of triggered states in a trajectory tree.

Lemma 5 (The Number of Triggered States). For any policy π,

Eq,p,π h m (H 1) X

σ= 1 {sσ = s } i H, (4)

Eq,p,π h m (H 1) X

σ= 1 {sσ = s } 2i 3H2. (5)

Remark 2. Eq. (4) gives a universal upper bound of value function as V π h (s) h Pm (H 1)

σ= 1 {sσ = s } i H for any s S, h [H], π. Moreover, Eq. (5) provides a sharp upper bound for overall variance, as well as the sum of conditional variances of transition (by plugging Eq. (5) into Lemma 4). To our best knowledge, this second moment result is novel.

Lemma 5 shows that despite the exponentially increasing nodes in branching MDP, its value and overall variance (estimation error) will not explode. This critical property enables us to avoid an exponentially-large regret or sample complexity.

Novel Analysis for Triggered States. The analysis of Lemma 5 is highly non-trivial. We ﬁrst relax all regular trigger probabilities to 1

m, and then investigate the number of triggered states for each step individually. While we can

Branching Reinforcement Learning

show that the number of triggered states at each step is a conditional Binomial random variable, the distribution of their sum is too complex to express. This incurs a nontrivial challenge on analyzing the second moment of the total number of triggered states. To tackle this challenge, we investigate the correlation of triggered states between any two steps, by exploiting the structure of branching MDP.

Proof sketch. Under Assumption 1, to bound the total number of triggered (regular) states for any branching MDP and policy π, it sufﬁces to bound it under a relaxed model M

with q(s, a) = q := 1 m for all (s, a) S \ {s } Auniv. Let ωh denote the number of triggered states at each step h under M , and ω := PH h=1 ωh. Below we prove that E[ω] H and E[ω2] 3H2.

For h = 1, ωh = 1 deterministically. For h 2, ωh|ωh 1 Binomial(mωh 1, q ). According to the properties of Binomial distribution and q := 1

m, for h 2,

E [ωh] = mq E [ωh 1] = 1,

E (ωh)2 =mq (1 q )E [ωh 1]+m2(q )2E (ωh 1)2

= (1 q ) + E (ωh 1)2 h.

Hence, we have that E[ω] = PH h=1 E[ωh] = H, and

h=1 E[(ωh)2] + 2 X

1<i,j<H E[ωiωj]

1<i,j<H E[ωiωj]. (6)

Now, the challenge falls on how to bound E[ωiωj] for any 1 < i, j < H. Let Wσ be a Bernoulli random variable denoting whether state sσ is triggered for any index string σ. Then, we can write E[ωiωj] as

E h m (i 1) X

σ=1 (i 1) Wσ m (j 1) X

σ =1 (j 1) Wσ i

(a)=mi 1E h W1 (i 1) m (j 1) X

σ =1 (j 1) Wσ i

=mi 1E h W1 (i 1) m (j 1) X

σ starts with 1 (i 1)

σ does not start with 1 (i 1)

(b) mi 1 mj i(q )i 1+j i + mj 1(q )i 1+j 1

Here (a) comes from the symmetry of trajectory tree. (b) is due to that at step j, the children states of s1 (i 1) have

Algorithm 1 Branch VI

1: Input: conﬁdence parameter δ, δ := 1 6δ, L :=

log( SNH(m H K)

δ ). Initialize V k h (s ) = V k h(s ) = 0, h [H], k. 2: for k = 1, 2, . . . do 3: for h = H, H 1, . . . , 1 do 4: for s S \ {s } do 5: for a Auniv do 6: ˆqk(s, a) Jk sum(s,a) nk(s,a) . bk,q(s, a) 4 q

7: ˆpk(s |s, a) P k sum(s |s,a) Jk sum(s,a) , s S;

8: bk,qp V (s, a) 4

Vars ( V k h+1(s ))L nk(s,a) +

Es h ( V k h+1(s ) V k h+1(s )) 2i L

nk(s,a) + 36HL

9: f k h(s, a) (ˆqk(s, a) + bk,q(s, a))r(s, a) + ˆqk(s, a)ˆpk( |s, a) V k h+1 + bk,qp V (s, a); 10: end for 11: V k h (s) min{max A A P

a Af k h(s, a), H}; 12: πk h(s) argmax A A P

a A f k h(s, a);

13: V k h(s) max{P

a πk h(s)((ˆqk(s, a)

bk,q(s, a))r(s, a)+ ˆqk(s, a)ˆpk( |s, a) V k h+1 bk,qp V (s, a)), 0}; 14: end for 15: end for 16: Take policy πk and observe the trajectory; 17: end for

dependency on it and the other states are independent of it, and E[WσWσ ] = Pr[Wσ = 1, Wσ = 1] for any σ, σ . By plugging Eq. (7) into Eq. (6), we have E[ω2] H(H+1)

2 + 4 H(H 1)

2 3H2. Thus, we obtain Lemma 5.

5. Branching Reinforcement Learning with Regret Minimization

In this section, we study branching RL-RM, and propose an efﬁcient algorithm Branch VI with a near-optimal regret guarantee for large enough K. A lower bound is also established to validate the optimality of Branch VI.

5.1. Algorithm Branch VI

The algorithm design for branching RL faces two unique challenges: (i) Computation efﬁciency. Since the action space A can be combinatorially large, it is inefﬁcient to explicitly maintain Q function as in standard RL (Azar et al., 2017; Zanette & Brunskill, 2019); (ii) Tight optimistic estimator. Naively adapting standard RL algorithms (Azar et al., 2017; Zanette & Brunskill, 2019) and adding optimistic bonuses on trigger and transition probabilities, respectively, will lead to a loose regret bound (see Appendix A for de-

Branching Reinforcement Learning

tails). To handle these challenges, Branch VI only maintains the components of Q function, and uses a maximization oracle to directly calculate V function. In addition, Branch VI considers trigger and transition distributions as a whole and adds a composite bonus.

The procedure of Branch VI (Algorithm 1) is as follows. In each episode k, we ﬁrst calculate the empirical trigger and transition probabilities ˆqk, ˆpk, a bonus for trigger probability bk,q, and a composite bonus for trigger and transition probabilities bk,qp V (Lines 1-1). Here nk(s, a), Jk sum(s, a) and P k sum(s |s, a) denote the number of times (s, a) was visited, the number of times (s, a) was successfully triggered, and the number of times the agent transitioned to s from (s, a) up to episode k, respectively. Then, we calculate a component function f k h(s, a), which represents the contribution to value function from each (s, a) (Line 1).

We allow Branch VI to access a maximization oracle which can efﬁciently calculate max A A P

a A w(a) and argmax A A P a A w(a) for any vector w RN (N := |Auniv|). Since the objective function P

a A w(a) is linear, such oracle exists for many combinatorial decision classes, e.g., all m-cardinality subsets and m-cardinality matchings. By utilizing this oracle with f k h(s, a), we can efﬁciently calculate the optimistic value function V k h (s) and policy πk h, and further compute the pessimistic value function V k h (Lines 1-1). After ﬁguring out V k h (s), V k h, πk h(s) for all s, h, we execute policy πk in episode k (Line 1).

Computation Efﬁciency. We remark that the computation complexity of Branch VI is O(S2N), instead of expensive O(S2|A|) as one may suffer by naively adapting standard RL algorithms (Azar et al., 2017; Zanette & Brunskill, 2019). This advantage is due to that Branch VI only maintains the component function instead of the Q function, and utilizes a maximization oracle to directly compute V function.

5.2. Regret Upper Bound for Branch VI

Now we provide the regret guarantee for Branch VI. Theorem 6 (Regret Upper Bound). With probability at least 1 δ, for any episode K > 0, the regret of algorithm Branch VI is bounded by O(H

SNK log( SNH(m H K)

δ )). In particular, when K m H, the regret is bounded by

SNK log SNHK

Remark 3. Theorem 6 shows that, despite the exponentially-large trajectory of branching RL, Branch VI enjoys a regret only polynomial in problem parameters. For large enough K such that K m H, Theorem 6 matches the lower bound (presented in Section 5.3) up to logarithmic factors. In addition, when branching RL reduces to standard

RL (i.e., m = 1), our result also matches the state-of-thearts (Azar et al., 2017; Zanette & Brunskill, 2019).

Branching Regret Analysis. In contrast to standard RL (Azar et al., 2017; Zanette & Brunskill, 2019), we derive a tree-structured regret analysis based on special structural properties of branching RL in Section 4.

Using the branching value difference lemma (Lemma 3), we can decompose the regret into the estimation error from each regular state-base action pair as follows:

(s,a),s =s wk,σ,ℓ(s, a)

h qk(s, a) q(s, a) r(s, a) | {z } Term 1: Triggered reward

+ qk(s, a) pk( |s, a) ˆqk(s, a)ˆpk( |s, a) V k |σ ℓ|+1 | {z } Term 2: Triggered transition optimism

+ ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V |σ ℓ|+1 | {z } Term 3: Triggered transition estimation

+ ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V k |σ ℓ|+1 V |σ ℓ|+1

| {z } Term 4: Lower order term

Here wk,σ,ℓ(s, a) denotes the probability that (sσ, aσ ℓ) = (s, a) in episode k, and qk and pk represent optimistic trigger and transition probabilities, respectively. In this decomposition, we address the estimation error for triggered reward (Term 1) and triggered transition (Terms 2, 3) separately, and consider trigger and transition probabilities as a whole distribution (in Terms 2, 3). The dominant terms are Terms 2, 3, which stand for the estimation error for triggered transition and depend on the sum of conditional variances. Using branching law of total variance (Lemma 4) and the second moment bound of triggered states (Lemma 5), we can bound Terms 2, 3 by only O(H2) despite exponential state-base action pairs.

We note that it is necessary to separately address triggered reward and triggered transition, and consider trigger and transition as a whole distribution in bonus design (Line 1) and regret decomposition. A naive adaption of standard RL algorithm (Zanette & Brunskill, 2019), which adds bonuses on trigger and transition probabilities respectively, i.e., replacing Line 1 with f k h(s, a) ˆqk(s, a) + bk,q(s, a) (r(s, a)+ ˆpk( |s, a) V k h+1 +bk,p V (s, a)), will suffer an extra factor

H in the regret bound. Please see Appendix A for more discussion.

5.3. Regret Lower Bound

In this subsection, we provide a lower bound for branching RL-RM which is polynomial in problem parameters and

Branching Reinforcement Learning

Algorithm 2 Branch RFE

1: Input: s , ε, δ, β(t, κ) := log(SN/κ) + S log(8e(t + 1)) for any t N and κ (0, 1). Initialize Bk h(s )= 0, h [H], k and Bk H+1(s)=0, s S, k. 2: for k = 1, 2, . . . do 3: for h = H, H 1, . . . , 1 do 4: for s S \ {s } do 5: for a Auniv do 6: ˆqk(s, a) Jk sum(s,a) nk(s,a) ;

7: ˆpk(s |s, a) P k sum(s |s,a) Jk sum(s,a) , s S;

8: gk h(s, a) 12H2 β(nk(s,a),δ)

nk(s,a) + 1 + 1

H ˆqk(s, a)ˆpk( |s, a) Bk h+1(s); 9: end for 10: πk h(s) argmax A A P

a A gk h(s, a); 11: Bk h(s) min{max A A P

a A gk h(s, a), H}; 12: end for 13: end for 14: if 4e p

Bk 1(s ) + Bk 1(s ) ε

2, return (ˆqk, ˆpk); 15: else Take policy πk and observe the trajectory; 16: end for

demonstrates the optimality of Branch VI.

Theorem 7 (Regret Lower Bound). There exists an instance of branching RL-RM, where any algorithm must have Ω(H

SNK) regret.

Remark 4.Theorem 7 validates that the regret of Branch VI (Theorem 6) is near-optimal for large enough K, and reveals that, a polynomial regret is achievable and tight even with exponentially-large trajectories in branching RL.

Branching Regret Lower Bound Analysis. Unlike prior standard episodic RL works (Azar et al., 2017; Jin et al., 2018) which directly adapt the diameter-based lower bound (Jaksch et al., 2010) to the episodic setting, we derive a new lower bound analysis for branching RL-RM. We construct a hard instance, where an agent uniformly enters one of Θ(S) bandit states , i.e., states with an optimal action and sub-optimal actions, and hereafter always transitions to a homogeneous state , i.e., a state with homogeneous actions. Then, if the agent makes a mistake in a bandit state, she will suffer Ω(H) regret in this episode. By bounding the KL-divergence between this hard instance and uniform instance, we can derive a desired lower bound.

6. Branching Reinforcement Learning with Reward-free Exploration

In this section, we investigate branching RL-RFE, and develop an efﬁcient algorithm Branch RFE and nearly matching upper and lower bounds of sample complexity.

6.1. Algorithm Branch RFE

In each episode, Branch RFE estimates the trigger and transition probabilities, and computes the estimation error Bk h(s), which stands for the cumulative variance of trigger and transition from step h to H. Once the total estimation error Bk 1(s ) is shrunk below the required accuracy, Branch RFE returns the estimated model (ˆqk, ˆpk). Given any reward function r, the optimal policy ˆπ under (ˆqk, ˆpk) with respect to r is ε-optimal, i.e., V ˆπ 1 (s ; r) V 1 (s ; r) ε, with probability at least 1 δ.

We describe Branch RFE (Algorithm 2) as follows: In each episode k, for each step h, Branch RFE ﬁrst estimates the trigger and transition probabilities ˆqk, ˆpk (Lines 2,2), and calculates the component estimation error gk h(s, a) for each state-base action pair (Line 2). Then, similar to Branch VI, we utilize a maximization oracle to efﬁciently ﬁnd the action with the maximum estimation error (the most necessary action for exploration) to be πk h(s), and assign such maximum error to Bk h(s) (Lines 2,2). If the square root of total estimation error p

Bk 1(s ), which represents the standard deviation of trigger and transition for whole trajectory, is smaller than ε/2, Branch RFE stops and outputs the estimated model (ˆqk, ˆpk) (Line 2); Otherwise, it continues to explore according to the computed policy πk (Line 2).

The computation complexity of Branch RFE is also O(S2N) instead of O(S2|A|), since it only computes the component estimation error for state-base action pairs rather than enumerating super actions, and utilizes a maximization oracle to calculate πk h(s) and Bk h(s).

6.2. Sample Complexity Upper Bound for Branch RFE

Now we present the sample complexity for Branch RFE. We say an algorithm for branching RL-RFE is (δ, ε)-correct, if it returns an estimated model (ˆq, ˆp) such that given any reward function r, the optimal policy under (ˆq, ˆp) with respect to r is ε-optimal with probability at least 1 δ.

Theorem 8 (Sample Complexity Upper Bound). For any ε > 0 and δ (0, 1), algorithm Branch RFE is (δ, ε)- correct. Moreover, with probability 1 δ, the number of episodes used in Branch RFE is bounded by

+ S log e m H C2 ,

where C = log log SN

δ + S log e m H HSN

Remark 5. Theorem 6 exhibits that even with exponentially-large trajectories in branching RL, Branch RFE only needs polynomial episodes to ensure an ε-optimal policy for any reward function. This sample complexity is optimal for small enough δ within logarithmic factors (compared to the lower bound presented

Branching Reinforcement Learning

in Section 6.3). In addition, when degenerating to standard RL (i.e., m = 1), our result also matches the state-of-the-arts (M enard et al., 2021; Zhang et al., 2021).

Branching Sample Complexity Analysis. Unlike standard RL (M enard et al., 2021; Zhang et al., 2021) which only bounds the estimation error in a single H-step path, in branching RL we need to unfold and analyze the estimation error for all state-base action pairs in a trajectory tree. In our analysis, we utilize special structural properties of branching MDP, e.g., branching law of total variance (Lemmas 4) and the second moment bound of triggered states (Lemmas 5), to skillfully bound the total estimation error throughout the trajectory tree. Despite exponentially-large trajectories, we obtain sample complexity only polynomial in problem parameters.

6.3. Sample Complexity Lower Bound

Theorem 9 (Sample Complexity Lower Bound). There exists an instance of branching RL-RFE, where any (δ, ε)- correct algorithm requires Ω( H2SN

ε2 log δ 1) trajectories.

Remark 6. Theorem 9 demonstrates that Branch RFE (Theorem 8) achieves a near-optimal sample complexity for small enough δ, and also, a polynomial sample complexity is achievable and sharp for branching RL-RFE.

7. Experiments

In this section, we conduct experiments for branching RL. We set K = 5000, δ = 0.005, H = 6, m = 2, N {10, 15}, S = {s , s1, . . . , s5}. A is the collection of all m-cardinality subsets of Auniv = {a1, . . . , a N}, and thus |A| = N m {45, 105}. The reward function r(s, a) = 1 for any (s, a) S A. The trigger probability q(s, a) = 1

m for any (s, a) S {a N 1, a N}, and q(s, a) = 1 2m for any (s, a) S Auniv \ {a N 1, a N}. We set s1 as the initial state for each episode. Under all actions a Auniv, the transition probability q(s |s1, a) = 0.5 for any s {s2, s3}, and q(s |s, a) = 0.5 for any (s, s ) {s2, s3} {s4, s5} or (s, s ) {s4, s5} {s2, s3}. We perform 50 independent runs, and report the average regrets and running times (in legends) across runs.

Since we study a new problem and there is no existing algorithm for branching RL, we compare our algorithm Branch VI with two adaptations from standard RL, i.e., Euler-Adaptation (Zanette & Brunskill, 2019) and ε-Greedy (ε = 0.01). The former uses individual exploration bonuses for trigger and transition, the latter uses εgreedy in exploration, and both of them explicitly maintain Q-functions. As shown in Figure 2, Branch VI enjoys a lower regret and a faster running time than the baselines, which demonstrates the effectiveness of our exploration

0 1000 2000 3000 4000 5000 The Number of Episodes

Cumulative Regret

Branch VI 26.827s Euler-Adaptation 46.342s ε-Greedy (ε=0.01) 45.677s

0 1000 2000 3000 4000 5000 The Number of Episodes

Cumulative Regret

Branch VI 35.199s Euler-Adaptation 87.958s ε-Greedy (ε=0.01) 88.890s

Figure 2. Experiments for branching RL.

strategy and the computation efﬁciency of our algorithm.

8. Conclusion and Future Work

In this paper, we formulate a novel branching RL model, and consider both regret minimization and reward-free exploration metrics. Different from standard RL where each episode is a single H-step path, branching RL is a treestructured forward model which allows multiple base actions in a state and multiple successor states. For branching RL, we build novel fundamental analytical tools and carefully bound the overall variance. We design efﬁcient algorithms and provide near-optimal upper and lower bounds.

There are many interesting directions to explore. One direction is to improve the dependency on H in our regret upper bound (Theorem 6) for small K and close the gap on factors H, S between sample complexity upper and lower bounds (Theorems 8,9). Another direction is to extend branching RL from the tabular setting to function approximation, e.g., representing the value function in a linear form with respect to the feature vectors of state-action pairs (Jin et al., 2020b; Zhou et al., 2021).

Agrawal, S. and Jia, R. Posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pp. 1184 1194, 2017.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002.

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272. PMLR, 2017.

Burnetas, A. N. and Katehakis, M. N. Optimal adaptive policies for markov decision processes. Mathematics of Operations Research, 22(1):222 255, 1997.

Dann, C. and Brunskill, E. Sample complexity of episodic ﬁxed-horizon reinforcement learning. In Advances in

Branching Reinforcement Learning

Neural Information Processing Systems, pp. 2818 2826, 2015.

Dann, C., Lattimore, T., and Brunskill, E. Unifying PAC and regret: uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5717 5727, 2017.

Domingues, O. D., M enard, P., Kaufmann, E., and Valko, M. Episodic reinforcement learning in ﬁnite MDPs: Minimax lower bounds revisited. In Algorithmic Learning Theory, pp. 578 598. PMLR, 2021.

Fu, M., Agrawal, A., Irissappane, A. A., Zhang, J., Huang, L., and Qu, H. Deep reinforcement learning framework for category-based item recommendation. IEEE Transactions on Cybernetics, 2021.

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010.

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efﬁcient? In Advances in Neural Information Processing Systems, pp. 4868 4878, 2018.

Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pp. 4870 4879. PMLR, 2020a.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efﬁcient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020b.

Kang, S., Jeong, C., and Chung, K. Tree-based real-time advertisement recommendation system in online broadcasting. IEEE Access, 8:192693 192702, 2020.

Kaufmann, E., M enard, P., Domingues, O. D., Jonsson, A., Leurent, E., and Valko, M. Adaptive reward-free exploration. In International Conference on Algorithmic Learning Theory, pp. 865 891. PMLR, 2021.

Mannor, S. and Tsitsiklis, J. N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5:623 648, 2004.

M enard, P., Domingues, O. D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp. 7599 7608. PMLR, 2021.

Osband, I. and Van Roy, B. On lower bounds for regret in reinforcement learning. ar Xiv preprint ar Xiv:1608.02732, 2016.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304 7312. PMLR, 2019.

Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 2020.

Zhang, Z., Du, S. S., and Ji, X. Nearly minimax optimal reward-free reinforcement learning. International Conference on Machine Learning, 2021.

Zhou, D., He, J., and Gu, Q. Provably efﬁcient reinforcement learning for discounted MDPs with feature mapping. In International Conference on Machine Learning, pp. 12793 12802. PMLR, 2021.

Branching Reinforcement Learning

A. Discussion for Algorithm Branch VI

We remark that it is necessary to separately address triggered reward and triggered transition, and consider trigger and transition as a whole distribution in bonus design (Line 1 in Algorithm 1) and regret analysis (Eq. (13)). A counter example is discussed below to support this statement.

If one naively adapts standard RL algorithms (Azar et al., 2017; Zanette & Brunskill, 2019), she/he may directly add bonuses on trigger and transition probabilities, respectively, without separating triggered reward and triggered transition. In this case, f k h(s, a) (Line 1) becomes:

f k h(s, a) ˆqk(s, a) + bk,q(s, a) r(s, a) + ˆpk( |s, a) V k h+1 + bk,p V (s, a)

Then, in regret decomposition, we will obtain

Regret(K) (a) O(1)

(s,a),s =s wkσℓ(s, a) h bk,q(s, a) r(s, a) + p( |s, a) V k |σ ℓ|+1 | {z } Term Λ

+ bk,p V (s, a)q(s, a) i ,

where (a) omits second order terms. Since Term Λ already reaches Θ(h) order, further summing over all episodes k and steps h, we will suffer an extra

H factor in the ﬁnal result.

One can see from this counter example that, our bonus design and analysis, which separately handle triggered reward and triggered transition and consider trigger and transition as a whole distribution, are sharp and enable a near-optimal regret.

B. Proofs for Properties of Branching MDP

In this section, we present the proofs for structural properties of branching MDP, including branching value difference lemma (Lemma 3), branching law of total variance (Lemma 4) and the upper bounds of the number of triggered states (Lemma 5).

B.1. Proof of Lemma 3

Proof of Lemma 3. This proof adapts the analysis of Lemma E.15 in (Dann et al., 2017) to branching RL. According to branching Bellman equations,

V π h (sσ) V π h (sσ)

q (sσ, aσ ℓ)r(sσ, aσ ℓ) q (sσ, aσ ℓ)r(sσ, aσ ℓ)

+ q (sσ, aσ ℓ)p (sσ, aσ ℓ) V π h+1 q (sσ, aσ ℓ)p (sσ, aσ ℓ) V π h+1

q (sσ, aσ ℓ)r(sσ, aσ ℓ) q (sσ, aσ ℓ)r(sσ, aσ ℓ)

+ (q (sσ, aσ ℓ)p (sσ, aσ ℓ) q (sσ, aσ ℓ)p (sσ, aσ ℓ)) V π h+1

+ q (sσ, aσ ℓ)p (sσ, aσ ℓ) V π h+1 V π h+1

q (sσ, aσ ℓ)r(sσ, aσ ℓ) q (sσ, aσ ℓ)r(sσ, aσ ℓ)

+ (q (sσ, aσ ℓ)p (sσ, aσ ℓ) q (sσ, aσ ℓ)p (sσ, aσ ℓ)) V π h+1

ℓ=1 Eq ,p ,π V π h+1(sσ ℓ) V π h+1(sσ ℓ)|sh = s

ℓ=1 Eq ,p ,π h q (sσ σ , aσ σ ℓ)r(stℓ, aσ σ ℓ) q (stℓ, aσ σ ℓ)r(stℓ, aσ σ ℓ)

Branching Reinforcement Learning

+ (q (stℓ, aσ σ ℓ)p (stℓ, aσ σ ℓ) q (stℓ, aσ σ ℓ)p (stℓ, aσ σ ℓ)) V π h+1 i

σ =1 2 Eq ,p ,π V π h+2,ℓ(sσ ) V π h+2,ℓ(sσ )|sh

ℓ=1 Eq ,p ,π h q (sσ σ , aσ σ ℓ)r(sσ σ , aσ σ ℓ) q (sσ σ , aσ σ ℓ)r(sσ σ , aσ σ ℓ)

+ (q (sσ σ , aσ σ ℓ)p (sσ σ , aσ σ ℓ) q (sσ σ , aσ σ ℓ)p (sσ σ , aσ σ ℓ)) V π |σ σ ℓ|+1 i

B.2. Proof of Lemma 4

Proof of Lemma 4. First, we prove the equality. This proof adapts the analysis of standard law of total variance in (Jin et al., 2018; Zanette & Brunskill, 2019) to branching RL.

ℓ=1 Varsσ ℓ q,p V π |σ ℓ|+1(sσ ℓ)|sσ, aσ ℓ

V π |σ ℓ|+1(sσ ℓ) q(sσ, aσ ℓ)p(sσ, aσ ℓ) V π |σ ℓ|+1 2

" m (H 1) X

q(sσ, aσ ℓ)r(sσ, aσ ℓ) + V π |σ ℓ|+1(sσ ℓ)

q(sσ, aσ ℓ)r(sσ, aσ ℓ) + q(sσ, aσ ℓ)p(sσ, aσ ℓ) V π |σ ℓ|+1 !2#

" m (H 1) X

q(sσ, aσ ℓ)r(sσ, aσ ℓ) + V π |σ ℓ|+1(sσ ℓ)

q(sσ, aσ ℓ)r(sσ, aσ ℓ) + q(sσ, aσ ℓ)p(sσ, aσ ℓ) V π |σ ℓ|+1 !2#

ℓ=1 q(sσ, aσ ℓ)r(sσ, aσ ℓ) +

ℓ=1 V π |σ ℓ|+1(sσ ℓ)

σ= V π |σ|+1(sσ)

ℓ=1 q(sσ, aσ ℓ)r(sσ, aσ ℓ) V π 1 (s )

Here (a) comes from the Markov property and that conditioning on the ﬁltration of step h, the triggers and transitions of state-base action pairs at step h + 1 are independent. (b) is due to that V π |σ|+1(sσ) = Pm ℓ=1 q(sσ, aσ ℓ)r(sσ, aσ ℓ) + q(sσ, aσ ℓ)p(sσ, aσ ℓ) V π |σ ℓ|+1 and we can merge the m terms of state-base ac-

tion value into V π |σ|+1(sσ).

Now, we prove the inequality.

ℓ=1 q(sσ, aσ ℓ)r(sσ, aσ ℓ) V π 1 (s )

Branching Reinforcement Learning

ℓ=1 q(sσ, aσ ℓ)

q(sσ, aσ ℓ)1 {sσ = s } + q(sσ, aσ ℓ)1 {sσ = s }

1 m1 {sσ = s }

σ= 1 {sσ = s }

where (a) is due to Assumption 1 and q(s , a) = 0 for any a Auniv.

B.3. Proof of Lemma 5

Proof of Lemma 5. Under Assumption 1, to bound the total number of triggered (regular) states for any branching MDP and policy π, it sufﬁces to bound it under a relaxed model M with q(s, a) = q := 1

m for all (s, a) S \ {s } Auniv. Let ωh denote the number of triggered states at each step h under M , and ω := PH h=1 ωh. Below we prove that E[ω] H and E[ω2] 3H2.

For h = 1, ωh = 1 deterministically. For h 2, ωh|ωh 1 Binomial(mωh 1, q ). According to the properties of Binomial distribution and q := 1

m, for h 2,

E [ωh] = mq E [ωh 1] = 1,

E (ωh)2 =mq (1 q )E [ωh 1] + m2(q )2E (ωh 1)2

= (1 q ) + E (ωh 1)2 h.

Hence, we have that E[ω] = PH h=1 E[ωh] = H, and

h=1 E[(ωh)2] + 2 X

1<i,j<H E[ωiωj]

1<i,j<H E[ωiωj]. (8)

Now, the challenge falls on how to bound E[ωiωj] for any 1 < i, j < H. Let Wσ be a Bernoulli random variable denoting whether state sσ is triggered for any index string σ. Then, we can write E[ωiωj] as

E h m (i 1) X

σ=1 (i 1) Wσ m (j 1) X

σ =1 (j 1) Wσ i

(a)=mi 1E h W1 (i 1) m (j 1) X

σ =1 (j 1) Wσ i

=mi 1E h W1 (i 1) m (j 1) X

σ starts with 1 (i 1)

σ does not start with 1 (i 1)

(b) mi 1 mj i(q )i 1+j i + mj 1(q )i 1+j 1

Branching Reinforcement Learning

=2. (9) Here (a) comes from the symmetry of trajectory tree. (b) is due to that at step j, the children states of s1 (i 1) have dependency on it and the other states are independent of it, and E[WσWσ ] = Pr[Wσ = 1, Wσ = 1] for any σ, σ . By plugging Eq. (9) into Eq. (8), we have E[ω2] H(H+1)

2 + 4 H(H 1)

2 3H2. Thus, we obtain Lemma 5.

C. Proofs for Branching RL with Regret Minimization

In this section, we prove the regret upper bound for algorithm Branch VI (Theorem 6) and the regret lower bounds for Branching RL-RM in cases with Assumption 1 (Theorem 7) and without Assumption 1 (Theorem 2).

We ﬁrst introduce some notations. Let Bernoulli random variable Xkσℓ(s, a) denote whether (s, a) was visited at indices σ

and σ ℓ, respectively, in episode k, and wkσℓ(s, a) := Pr[Xkσℓ(s, a)]. Let Xk(s, a) := Pm (H 1)

σ= Pm ℓ=1 Xkσℓ(s, a) de-

note the number of times that (s, a) was visited in episode k, and wk(s, a) := E[Xk(s, a)] = Pm (H 1)

σ= Pm ℓ=1 wkσℓ(s, a).

Let nk σℓ(s, a) := P

k <k Xk σℓ(s, a) denote the cumulative number of times that (s, a) was visited at indices σ and σ ℓ,

respectively, up to episode k. Let nk(s, a) := P k <k Pm (H 1)

σ= Pm ℓ=1 Xkσℓ(s, a) denote the cumulative number of times that (s, a) was visited up to episode k.

In the following proofs, we make the convention that when m = 1, m H+1 m

m 1 := H. Then, we have Xk(s, a)

m + m2 + + m H = m H+1 m

m 1 for any m 1.

C.1. Proof of Regret Upper Bound

C.1.1. CONCENTRATION

In the following, we present several important concentration lemmas and deﬁne concentration events. Lemma 10 (Concentration of Trigger).

ˆqk(s, a) q(s, a) 4

v u u tlog SNH(m H K)

nk(s, a) , (s, a) S \ {s } Auniv, k [K]

Proof of Lemma 10. Since nk(s, a) m H+1 m

m 1 K, using the Hoeffding inequality with a union bound over (s, a) and nk(s, a), we have

ˆqk(s, a) q(s, a) 2

v u u tlog SN

nk(s, a) , (s, a) S \ {s } Auniv, k [K]

m 1 K, then we have

m 1 K 2 log SNHK

m 1 K, then using m H+1 m

m 1 Hm2H, we have

m 1 K 2 log SNHm2H

4 log SNHm H

Combining the above two cases, we have

m 1 K 4 log SNH(m H K)

Branching Reinforcement Learning

Therefore, we have

ˆqk(s, a) q(s, a) 4

v u u tlog SNH(m H K)

nk(s, a) , (s, a) S \ {s } Auniv, k [K]

Lemma 11 (Concentration of Triggered Transition).

" ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V h+1 4

v u u t Vars q,p V h+1(s ) log SNH(m H K)

+ 4H log SNH(m H K)

nk(s, a) , (s, a) S \ {s } Auniv, k [K]

Proof of Lemma 11. Using the similar analytical procedure as Lemma 10 and the Bernstein s inequality, we can obtain this lemma.

Lemma 12 (Concentration of Variance).

Vars ˆq,ˆp V k h+1(s ) q

Vars q,p V h+1(s )

Es ˆq,ˆp h V k h+1(s ) V h+1(s ) 2i

v u u tlog SNH(m H K)

nk(s, a) , (s, a) S \ {s } Auniv, k [K]

Proof of Lemma 12. Using the similar analytical procedure as Lemma 10 and Proposition 2 (in particular, Eq. (53)) in (Zanette & Brunskill, 2019), we can obtain

Vars ˆq,ˆp V h+1(s ) q

Vars q,p V h+1(s ) 8H

v u u tlog SNH(m H K)

(s, a) S \ {s } Auniv, k [K]

With probability 1 2δ , for any (s, a) S \ {s } Auniv and k [K], we have

Vars ˆq,ˆp V k h+1(s ) q

Vars q,p V h+1(s )

Vars ˆq,ˆp V k h+1(s ) q

Vars ˆq,ˆp V h+1(s )

Vars ˆq,ˆp V h+1(s ) q

Vars q,p V h+1(s )

Vars ˆq,ˆp V k h+1(s ) V h+1(s ) + 8H

v u u tlog SNH(m H K)

Es ˆq,ˆp h V k h+1(s ) V h+1(s ) 2i + 8H

v u u tlog SNH(m H K)

where (a) uses Proposition 2 (in particular, Eqs. (52)) in (Zanette & Brunskill, 2019) and Eq. (11).

Branching Reinforcement Learning

To summarize the concentration lemmas used, we deﬁne the following concentration events:

ˆqk(s, a) q(s, a) 4

v u u tlog SNH(m H K)

nk(s, a) , (s, a) S \ {s } Auniv, k [K]

" ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V h+1 4

v u u t Vars q,p V h+1(s ) log SNH(m H K)

+ 4 H log SNH(m H K)

nk(s, a) , (s, a) S \ {s } Auniv, k [K]

Vars ˆq,ˆp V k h+1(s ) q

Vars q,p V h+1(s ) q

Es ˆq,ˆp V k h+1(s ) V h+1(s ) 2

v u u tlog SNH(m H K)

nk(s, a) , (s, a) S \ {s } Auniv, k [K]

E :=Etri Etrans Evar

Lemma 13. Letting δ := δ

6, the concentration event E satisﬁes that

Pr [E] 1 6δ = 1 δ

Proof of Lemma 13. We can obtain this lemma by combining Lemmas 10-12.

C.1.2. VISITATION

Below we present an important bound on visitation, which will be used in the proof of Theorem 6.

Lemma 14 (Regret Bound of Visitation). Suppose that the concentration event E holds. Then, it holds that

nk(s, a) SN log (KH)

Proof of Lemma 14.

ℓ=1 E(sσ,aσ ℓ) πk 1 nk(sσ, aσ ℓ) 1 {sσ = s }

(s,a),s =s Xk(s, a) 1 nk(s, a)

k=1 Xk(s, a) 1 nk(s, a)

k=1 Xk(s, a) 1 P

k <k Xk (s, a)

(s,a),s =s log

k=1 Xk(s, a)

Branching Reinforcement Learning

(s,a),s =s log

k=1 Xk(s, a)

(b) SN log (KH) ,

where (a) uses Jensen s inequality, and (b) is due to that for a ﬁxed (s, a) such that s = s , E[Xk(s, a)]

σ= 1 {sσ = s }] H, since a single base action cannot be chosen twice in a state.

C.1.3. OPTIMISM AND PESSIMISM

Next, we prove the optimism and pessimism of the constructed value functions in algorithm Branch VI, and bound the gap between optimistic and pessimistic value functions. Recall that L := log SNH(m H K)

Lemma 15 (Optimism). Suppose that the concentration event E holds. Then,

V k h(s) V h (s) V k h (s), s S, h [H], k [K]

Proof of Lemma 15. We prove this lemma by induction. Since V k H+1(s) = V H+1(s) = V k H+1(s) = 0, s S, it sufﬁces to prove that if V k h+1(s) V h (s) V k h+1(s), s S, then V k h(s) V h (s) V k h (s), s S.

First, we prove the optimistic direction, i.e., V k h (s) V h (s), s S. In the following, we prove Qπk h (s, A) Q h(s, A) for any s S and A A. If Qπk h (s, A) = H, then Qπk h (s, A) = H Q h(s, A) trivially holds. Otherwise,

Qπk h (s, A) X

ˆqk(s, a) + bq k(s, a) r(s, a) + ˆqk(s, a)ˆpk( |s, a) V k h+1 + bqp V k (s, a)

ˆqk(s, a) + 4

r(s, a) + ˆqk(s, a)ˆpk( |s, a) V k h+1 + 4

Vars ˆq,ˆp V k h+1(s ) L nk(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

nk(s, a) + 36HL

q(s, a)r(s, a) + ˆqk(s, a)ˆpk( |s, a) V k h+1 + 4

Vars ˆq,ˆp V k h+1(s )

V k h+1(s ) V k h+1(s ) 2 + 8H

+ 4HL nk(s, a)

q(s, a)r(s, a) + ˆqk(s, a)ˆpk( |s, a) V h+1 + 4

Vars q,p V h+1(s ) L nk(s, a) + 4HL nk(s, a)

q(s, a)r(s, a) + q(s, a)p( |s, a) V h+1

=Q h(s, A),

where (a) is due to the induction hypothesis and Lemma 12, and (b) comes from Lemma 11.

Then, we have

V k h (s) = Qk h(s, πk(s)) Qk h(s, π (s)) Q h(s, π (s)) = V h (s)

Branching Reinforcement Learning

Now, we prove the pessimistic direction, i.e., V k h(s) V h (s), s S. If V k h(s) = 0, then V k h(s) = 0 V h (s) trivially holds. Otherwise,

V k h(s) = X

ˆqk(s, a) bq k(s, a) r(s, a) + ˆqk(s, a)ˆpk( |s, a) V k h+1 bqp V k (s, a)

ˆqk(s, a) 4

r(s, a) + ˆqk(s, a)ˆpk( |s, a) V k h+1 4

Vars ˆq,ˆp V k h+1(s ) L nk(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

nk(s, a) 36HL

q(s, a)r(s, a) + ˆqk(s, a)ˆpk( |s, a) V k h+1 4

Vars ˆq,ˆp V k h+1(s )

V k h+1(s ) V k h+1(s ) 2 + 8H

4HL nk(s, a)

q(s, a)r(s, a) + ˆqk(s, a)ˆpk( |s, a) V h+1 4

Vars q,p V h+1(s ) L nk(s, a) 4HL nk(s, a)

q(s, a)r(s, a) + q(s, a)p( |s, a) V h+1

=Q h(s, πk(s))

Q h(s, π (s))

Lemma 16 (Gap between Optimism and Pessimism). Suppose that the concentration event E holds. Then, it holds that

S nk(sσ σ , aσ σ ℓ) 1 {sσ σ = s } sσ = s, πk #

In particular,

V πk 1 (s) V πk 1 (s)

S nk(sσ, aσ ℓ) 1 {sσ = s } s = s, πk #

Proof of Lemma 16. According to the construction of optimistic and pessimistic value functions, we have

V πk h (s) X

ˆqk(s, a) + bq k(s, a) r(s, a) + ˆqk(s, a)ˆpk( |s, a) V k h+1 + bqp V k (s, a)

V πk h (s) X

ˆqk(s, a) bq k(s, a) r(s, a) + ˆqk(s, a)ˆpk( |s, a) V k h+1 bqp V k (s, a) (12)

V πk h (s) V πk h (s) X

2bq k(s, a)r(s, a) + ˆqk(s, a)ˆpk( |s, a) V k h+1 V k h+1 + 2bqp V k (s, a)

2bq k(s, a)r(s, a) + 2bqp V k (s, a) + ˆqk(s, a)ˆpk( |s, a) V k h+1 V k h+1

Branching Reinforcement Learning

L nk(s, a)r(s, a) + 8

Vars ˆq,ˆp V k h+1(s ) L nk(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

nk(s, a) + 72HL

nk(s, a) +ˆqk(s, a)ˆpk( |s, a) V k h+1 V k h+1 !

L nk(s, a)r(s, a) + 8

Vars q,p V h+1(s ) L nk(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

nk(s, a) + 136HL

nk(s, a) + q(s, a)p( |s, a) V k h+1 V k h+1

+ ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V k h+1 V k h+1 !

L nk(s, a)r(s, a) + 8

Vars q,p V h+1(s ) L nk(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

nk(s, a) + 136HL

nk(s, a) + q(s, a)p( |s, a) V k h+1 V k h+1

+ ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) 1 V k h+1 V k h+1

L nk(s, a)r(s, a) + 8

Vars q,p V h+1(s ) L nk(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

nk(s, a) + 136HL

nk(s, a) + q(s, a)p( |s, a) V k h+1 V k h+1

2SL nk(s, a)

S nk(s, a) + q(s, a)p( |s, a) V k h+1 V k h+1 !

S nk(sσ σ , aσ σ ℓ) 1 {sσ σ = s } sσ = s, πk #

V πk 1 (s) V πk 1 (s)

S nk(sσ, aσ ℓ) 1 {sσ = s } s = s, πk #

Lemma 17 (Cumulative Gap between Optimism and Pessimism). Suppose that the concentration event E holds. Then, it holds that

(s,a),s =s wkσℓ(s, a) X

s q(s, a)p(s |s, a) V k h+1(s ) V k h+1(s ) 2 28900m H4L3S3N 2

Branching Reinforcement Learning

Proof of Lemma 17.

(s,a),s =s wkσℓ(s, a) X

s =s q(s, a)p(s |s, a) V k h+1(s ) V k h+1(s ) 2

s =s wkσℓ(s, a)q(s, a)p(s |s, a) V k h+1(s ) V k h+1(s ) 2

s =s wkσℓ(s , s, a) V k h+1(s ) V k h+1(s ) 2

s =s wkσℓ(s ) V k h+1(s ) V k h+1(s ) 2

ℓ=1 Esσ ℓ πk V k h+1(sσ ℓ) V k h+1(sσ ℓ) 2

σ= Esσ πk V k h (sσ) V k h(sσ) 2

σ= Esσ πk V k h (sσ) V k h(sσ) 1 {sσ = s } 2

1 {sσ = s }

S nk(sσ σ , aσ σ ℓ) 1 {sσ σ = s } sσ, πk #

28900H2L2 K X

1 {sσ = s } E

(s,a),s =s Xk(s, a)

=28900H2L2 K X

(1 {sσ = s })2

(s,a),s =s Xk(s, a)

=28900H2L2 K X

(s,a),s =s Xk(s, a)

2 m (H 1) X

ℓ=1 Esσ πk h (1 {sσ = s })2i

28900H3L2 K X

(s,a),s =s Xk(s, a)

2 S nk(s, a)

(b) 28900m H4L2 K X

(s,a),s =s Xk(s, a)

=28900m H4L2SE

k=1 Xk(s, a) 1 nk(s, a)

(c) 28900m H4L3S2N

where (a) uses Lemma 16, (b) is due to E h P

(s,a),s =s Xk(s, a) i m E[Pm (H 1)

σ= 1 {sσ = s }] m H and (c) comes from Lemma 14.

Branching Reinforcement Learning

C.1.4. PROOF OF THEOREM 6

Now we prove the regret upper bound (Theorem 6) for algorithm Branch VI.

Proof of Theorem 6. Suppose that the concentration event E holds.

For any k [K] and (s, a) S \ {s } Auniv, let qk(s, a) := ˆqk(s, a) + bq k(s, a), qk(s, a) pk( |s, a) V k h+1 := ˆqk(s, a)ˆpk( |s, a) V k h+1 + bqp V k (s, a).

Step 1: Regret decomposition. Using Lemma 3, we can decompose Regret(K) as follows:

Regret(K) =

V 1 (s) V πk 1 (s)

V k 1 (s) V πk 1 (s)

ℓ=1 E h qk(sσ, aσ ℓ) q(sσ, aσ ℓ) r(sσ, aσ ℓ)

+ qk(sσ, aσ ℓ) pk( |sσ, aσ ℓ) q(sσ, aσ ℓ)p( |sσ, aσ ℓ) V k h+1 i

(s,a),s =s wkσℓ(s, a)

h qk(s, a) q(s, a) r(s, a) + qk(s, a) pk( |s, a) q(s, a)p( |s, a) V k h+1 i

(s,a),s =s wkσℓ(s, a) h qk(s, a) q(s, a) r(s, a) | {z } Term 1

+ qk(s, a) pk( |s, a) ˆqk(s, a)ˆpk( |s, a) V k h+1 | {z } Term 2

+ ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V h+1 | {z } Term 3

+ ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V k h+1 V h+1

| {z } Term 4

Step 2: Bound the bonus term for triggered rewards Term 1.

(s,a),s =s wkσℓ(s, a) qk(s, a) q(s, a) r(s, a)

(s,a),s =s wkσℓ(s, a)

ˆqk(s, a) + 4

L nk(s, a) q(s, a)

(s,a),s =s wkσℓ(s, a)

L nk(s, a)r(s, a)

(s,a),s =s wkσℓ(s, a)r2(s, a)

Branching Reinforcement Learning

ℓ=1 E 1 sk σℓ = s

m SNHK, (14)

where (a) uses Lemma 5.

Step 3: Bound the bonus term for triggered future values Term 2.

(s,a),s =s wkσℓ(s, a) qk(s, a) pk( |s, a) ˆqk(s, a)ˆpk( |s, a) V k h+1

(s,a),s =s wkσℓ(s, a)

Vars ˆq,ˆp V k h+1(s ) L nk(s, a) + 4

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

(s,a),s =s wkσℓ(s, a)

Vars q,p V h+1(s ) L nk(s, a) + 8

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

(s,a),s =s wkσℓ(s, a)

Vars q,p V h+1(s ) L nk(s, a) + 68HL

(s,a),s =s wkσℓ(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

(s,a),s =s wkσℓ(s, a)Vars q,p V h+1(s )

(s,a),s =s wkσℓ(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

ℓ=1 Vars q,p V h+1(s ) #

(s,a),s =s wkσℓ(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

Branching Reinforcement Learning

3SNKH2 + 68SNHL2 + 8

(s,a),s =s wkσℓ(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

nk(s, a) | {z } Term 2.1

where (a) uses Lemma 12, (b) comes from Lemma 14 and (c) is due to Lemma 4.

Then, we bound Term 2.1 as follows:

Term 2.1 (16)

(s,a),s =s wkσℓ(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

(s,a),s =s wkσℓ(s, a)ˆqk(s, a)ˆpk( |s, a) V k h+1(s ) V k h+1(s ) 2

(s,a),s =s wkσℓ(s, a)q(s, a)p( |s, a) V k h+1(s ) V k h+1(s ) 2 +

(s,a),s =s wkσℓ(s, a) (ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a)) V k h+1(s ) V k h+1(s ) 2 !

(s,a),s =s wkσℓ(s, a)q(s, a)p( |s, a) V k h+1(s ) V k h+1(s ) 2 +

(s,a),s =s wkσℓ(s, a) (ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a)) V k h+1(s ) V k h+1(s ) 2

(s,a),s =s wkσℓ(s, a)q(s, a)p( |s, a) V k h+1(s ) V k h+1(s ) 2 +

Branching Reinforcement Learning

(s,a),s =s wkσℓ(s, a) (ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a)) V k h+1(s ) V k h+1(s )

28900m H4L3S2N +

where (a) comes from Lemmas 14,17 and Eq. (20) (the upper bound of Term 4).

Plugging Eq. (17) into Eq. (15), we obtain

3SNKH2+68SNHL2+8

(s,a),s =s wkσℓ(s, a)

v u u u t Es ˆq,ˆp

V k h+1(s ) V k h+1(s ) 2 L

nk(s, a) | {z } Term 2.1 8HL

SNK + 68SNHL2 + 1472SNH2L2

SNK + 1540SNH2L2

Step 4: Bound the estimate deviation term for triggered future values Term 3.

(s,a),s =s wkσℓ(s, a) ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V h+1

(s,a),s =s wkσℓ(s, a)

Vars q,p V h+1(s ) L nk(s, a) + 4HL nk(s, a)

(s,a),s =s wkσℓ(s, a)Vars q,p V h+1(s )

ℓ=1 Vars q,p V h+1(s ) #

3SNKH2 + 4SNHL2

SNK + 4SNHL2, (19)

where (a) comes from Lemma 4.

Step 5: Bound the second order term. Term 4.

(s,a),s =s wkσℓ(s, a) ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V k h+1 V h+1

(s,a),s =s wkσℓ(s, a) X

ˆqk(s, a)ˆpk(s |s, a) q(s, a)p(s |s, a) V k h+1(s ) V h+1(s )

(s,a),s =s wkσℓ(s, a) X

ˆqk(s, a)ˆpk(s |s, a) q(s, a)p(s |s, a) V k h+1(s ) V h+1(s )

Branching Reinforcement Learning

(s,a),s =s wkσℓ(s, a)

q(s, a)p(s |s, a) (1 q(s, a)p(s |s, a)) L

nk(s, a) + L nk(s, a)

! V k h+1(s ) V k h+1(s )

(s,a),s =s wkσℓ(s, a) X

q(s, a)p(s |s, a)L

V k h+1(s ) V k h+1(s )

(s,a),s =s wkσℓ(s, a) X

HL nk(s, a)

(s,a),s =s wkσℓ(s, a)

q(s, a)p(s |s, a)

V k h+1(s ) V k h+1(s ) 2

(s,a),s =s wkσℓ(s, a) SHL

v u u u u u t

(s,a),s =s wkσℓ(s, a) X

s q(s, a)p(s |s, a) V k h+1(s ) V k h+1(s ) 2

| {z } Term 4.1

28900m H4L3S2N + S2NHL2

where (a) is due to Lemmas 14,17.

Finally, we combine the upper bounds of Term 1, Term 2, Term 3, Term 4 and the minimal regret contribution to bound the total regret.

Plugging Eqs. (14),(18),(19) and (20) into Eq. (13), we have

(s,a),s =s wkσℓ(s, a) h qk(s, a) q(s, a) r(s, a) | {z } Term 1

+ qk(s, a) pk( |s, a) ˆqk(s, a)ˆpk( |s, a) V k h+1 | {z } Term 2

+ ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V h+1 | {z } Term 3

+ ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V k h+1 V h+1

| {z } Term 4

m SNHK + 8HL

SNK + 1540SNH2L2

SNK + 4SNHL2

+ 170S2NH2L2

Branching Reinforcement Learning

Take optimal action ( ) or suboptimal action A ( )

Here , [ 3]

Figure 3. The constructed instance with m = 2 in regret lower bound analysis.

C.2. Regret Lower Bound

In this subsection, we prove the regret lower bounds for branching RL-RM in cases with Assumption 1 (Theorem 7) and without Assumption 1 (Theorem 2).

C.2.1. PROOF OF THEOREM 7

Proof of Theorem 7. As shown in Figure 3, consider a random instance as follows: There are N base actions, i.e., Auniv = {a1, . . . , a N}, and d := N m super actions, i.e., A1 = {a1, . . . , am}, A2 = {am+1, . . . , a2m}, . . . , Ad = {am(d 1)+1, . . . , amd}. The action set is A = {A1, . . . , Ad}. The state set is S = {s , s1, s2, x1, . . . , x S 3}.

The trigger probabilities are as follows: q(s1, a) = q(s2, a) = α + η and q(s , a) = 0 for any a Auniv.

The transition distributions are as follows: s1 is the initial state. p(xi|s1, a) = 1 S 3 for any a Auniv. p(s2|xi, a) = 1 for any a Auniv, i [S 3]. p(s2|s2, a) = 1 and p(s |s , a) = 1 for any a Auniv.

The reward function is only dependent on the current state. r(s, a) = 1 for any s S \ {s }, a Auniv, and r(s , a) = 0 for any a Auniv.

The randomness of this instance is as follows: for each xi {x1, . . . , x S 3}, we uniformly choose an action Aj(xi) from A1, . . . , Ad as the optimal action. Let q(xi, a) = α + η for all a Aj(xi), and q(xi, a) = α for all a / Aj(xi). Let α + η = 1

In words, in each episode, at step 1, an agent starts from state s1 and takes an action that contains m base actions, where each base action has trigger probability α +η. For each state-base action pair at step 1, if triggered successfully, it transitions to xi {x1, . . . , x S 3} with probability 1 S 3; Otherwise, if triggered successfully, it transitions to ending state s . At step 1, in a bandit state xi, the agent takes an action A A that contains m base actions, where each base action has trigger probability α + η if A = Aj(xi), and has trigger probability only α if A = Aj(xi). At step 2, for each state-base action pair, if triggered successfully, it transitions to s2; Otherwise, it transitions to s . At step 3, starting from state s2, the agent takes an action where each contained base action has trigger probability α + η. For each state-base action pair at step 3, if triggered successfully, it still transitions back to s2; Otherwise, it transitions to s . The following steps 4, . . . , H are similar to step 3, where the agent starts from s2 and transitions back to s2 or transitions to s .

The optimal policy π is to take action Aj(xi) at state xi, and we have

E[Rewardπ ] = E

k=1 V 1 (s1)

= K m(α + η) + m2(α + η)2 + + m H(α + η)H = HK (21)

Fix an algorithm A. Let πk denote the policy taken by A in episode k. For each xi {x1, . . . , x S 3}, let Txi,Aj(xi) := PK k=1 1 πk(xi) = Aj(xi) denote the number of episodes where A chooses Aj(xi) in state xi. Then, the number of episodes where A chooses suboptimal actions in state xi is K Txi,Aj(xi).

E[Reward A] =E

k=1 V πk 1 (s1)

Branching Reinforcement Learning

=Km(α + η) + 1 (S 3)d

j=1 E h Txi,Aj(xi) m2(α + η)2 + + m H(α + η)H

+ K Txi,Aj(xi) m2(α + η)α + m3(α + η)α(α + η) + + m H(α + η)α(α + η)H 2 i

=Km(α + η) + 1 (S 3)d

j=1 E h Txi,Aj(xi) m2(α + η)2 + + m H(α + η)H

+ K Txi,Aj(xi) m2(α + η)α + m3(α + η)2α + + m H(α + η)H 1α i (22)

Subtracting Eq. (22) by Eq. (21), we have

E[Regret A] =E[Rewardπ ] E[Reward A]

V 1 (s1) V πk 1 (s1) #

j=1 E h K Txi,Aj(xi) m(α + η)mη + m2(α + η)2mη + + m H 1(α + η)H 1mη i

j=1 E h K Txi,Aj(xi) (H 1)mη i

j=1 E Txi,Aj(xi)

Let EAj(xi)[ ] denote the expectation operator under the instance IAj(xi) where the optimal action of state xi is Aj(xi).

Let Exi,unif[ ] denote the expectation operator under the instance where all actions A A at state xi have the same trigger probability, i.e., q(xi, a) = α for any a Auniv, and other distribution settings are the same as IAj(xi).

Note that the KL-divergence between the above two instances is m KL (Bernoulli(α)||Bernoulli(α + η)) m η2

(α+η)(1 (α+η)) m η2

c1 with (α + η)(1 (α + η)) c1 for some absolute positive constant c1. After a pull of (xi, Aj(xi)), we receive an observation of such difference between the two instances.

Using Lemma A.1 in (Auer et al., 2002), we have

EAj(xi) Txi,Aj(xi) Exi,unif Txi,Aj(xi)

1 2 1 S 3Exi,unif Txi,Aj(xi) m KL (Bernoulli(α)||Bernoulli(α + η))

(a) Exi,unif Txi,Aj(xi) + K

r m 2c1(S 3)Exi,unif Txi,Aj(xi) η2

=Exi,unif Txi,Aj(xi) + Kη

r m 2c1(S 3)Exi,unif Txi,Aj(xi)

Since Pd j=1 Exi,unif Txi,Aj(xi) = Pd j=1 PK k=1 Exi,unif πk(Aj(xi)|xi) = K, we have

j=1 EAj(xi) Txi,Aj(xi)

j=1 Exi,unif Txi,Aj(xi) + Kη

r m 2c1(S 3)Exi,unif Txi,Aj(xi)

j=1 Exi,unif Txi,Aj(xi) + Kη

v u u t dm 2c1(S 3)

j=1 Exi,unif Txi,Aj(xi)

Branching Reinforcement Learning

dm K 2c1(S 3)

j=1 EAj(xi) Txi,Aj(xi) 1

dm K 2c1(S 3)

m K 2c1d(S 3)

Plugging Eq. (24) into Eq. (23), we obtain

E[Regret A] mη(H 1)

j=1 E Txi,Aj(xi)

m K 2c1d(S 3)

Let η = c2 q

m K for some small enough constant c2, we have

E[Regret A] =Ω (H 1) p

C.2.2. PROOF OF THEOREM 2

Proof of Theorem 2. This proof uses the same instance and analytical procedure as the proof of Theorem 7, except that we set α + η = q for some trigger probability threshold q > 1

Then, Eq. (23) becomes

E[Regret A] =E[Rewardπ ] E[Reward A]

V 1 (s1) V πk 1 (s1) #

j=1 E h K Txi,Aj(xi) m q mη + m2 q2 mη + + m H 1 q H 1 mη i

j=1 E h K Txi,Aj(xi) m q (m q)H 1 1

=m q (m q)H 1 1

j=1 E Txi,Aj(xi)

and Eq (25) becomes

E[Regret A] m q (m q)H 1 1

j=1 E Txi,Aj(xi)

Branching Reinforcement Learning

m q (m q)H 1 1

m K 2c1d(S 3)

Let η = c2 q

m K for some small enough constant c2, we have

E[Regret A] =Ω

m q (m q)H 1 1

m q (m q)H 1 1

Therefore, when relaxing the trigger probability threshold in Assumption 1 to some q > 1

m, any algorithm for branching RL-RM must suffer an exponential regret.

D. Proofs for Branching RL with Reward-Free Exploration

In this section, we prove sample complexity upper and lower bounds (Theorems 8,9) for branching RL-RFE.

D.1. Proof for Sample Complexity Upper Bound

D.1.1. AUGMENTED TRANSITION DISTRIBUTION

First, we introduce an augmented transition distribution paug( |s, a) and connect it with trigger distribution q(s, a) and transition distribution p( |s, a).

For any (s, a) S Auniv, let paug( |s, a) denote the augmented transition distribution on S, which satisﬁes that

paug(s |s, a) = 1 q(s, a),

paug(s |s, a) = q(s, a)p(s |s, a), s S \ {s }.

For any episode k, we can also deﬁne the empirical augmented transition distribution as

ˆpaug,k(s |s, a) = 1 ˆqk(s, a),

ˆpaug,k(s |s, a) = ˆqk(s, a)ˆpk(s |s, a), s S \ {s }.

Lemma 18. For any function f( ) deﬁned on S such that f(s ) = 0 (e.g., f can be the value function V π h ( ), h [H], π), it holds that

paug( |s, a) f = q(s, a)p( |s, a) f, (29)

ˆpaug,k( |s, a) Vh+1 = ˆqk(s, a)ˆpk( |s, a) f. (30)

Proof of Lemma 18. We prove Eq. (29) as follows:

paug( |s, a) f = X

s S paug(s |s, a)f(s )

s S\{s } q(s, a)p(s |s, a)f(s )

= q(s, a)p( |s, a) f

Eq. (30) can be proved in a similar manner.

Branching Reinforcement Learning

D.1.2. CONCENTRATION

In the following, we introduce a concentration lemma.

Lemma 19 (KL-divergence Based Concentration of Triggered Transition). Deﬁning event

KL ˆpaug,k( |s, a) paug( |s, a) log SN

δ + S log 8e(nk(s, a) + 1)

nk(s, a) , (s, a) S Auniv, k

it holds that

Pr [G] 1 δ.

Proof of Lemma 19. Using Theorem 3 and Lemma 3 in (M enard et al., 2021), we can obtain this lemma.

D.1.3. KL DIVERGENCE-BASED TECHNICAL TOOLS

Below, we present several useful KL divergence-based technical tools.

Lemma 20 (Lemma 10 in (M enard et al., 2021)). Let p1 and p2 be two distributions on S such that KL(p1, p2) α. Let f be a function deﬁned on S such that for any s S, 0 f(s) b. Then,

|p1f p2f| q

2Varp2(f)α + 2

Lemma 21 (Lemma 11 in (M enard et al., 2021)). Let p1 and p2 be two distributions on S such that KL(p1, p2) α. Let f be a function deﬁned on S such that for any s S, 0 f(s) b. Then,

Varp2(f) 2Varp1(f) + 4b2α

Varp1(f) 2Varp2(f) + 4b2α

Lemma 22 (Lemma 12 in (M enard et al., 2021)). Let p1 and p2 be two distributions on S such that KL(p1, p2) α. Let f, g be two functions deﬁned on S such that for any s S, 0 f(s), g(s) b. Then,

Varp1(f) 2Varp1(g) + 2bp1|f g|

Varp2(f) Varp1(f) + 3b2α p1 p2 1

D.1.4. ESTIMATION ERROR

Next, we state an important lemma on estimation error.

Lemma 23 (Estimation Error). Suppose that the concentration event G holds. Then, for any episode k, policy π and reward function r, ˆV k,π 1 (s; r) V π 1 (s; r) 4e q

Bk 1(s) + Bk 1(s).

Proof of Lemma 23. For any t N, κ (0, 1), let β(t, κ) := log(SN/κ) + S log(8e(t + 1)). Then, for any π, r and (s, A) S \ {s } A, ˆQk,π h (s, A; r) Qπ h(s, A; r)

ˆqk(s, a) q(s, a) r(s, a) + ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 q(s, a)p( |s, a) V π h+1

ˆqk(s, a) q(s, a) r(s, a) + ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V π h+1

+ ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1

Branching Reinforcement Learning

β(nk(s, a), δ )

nk(s, a) r(s, a) +

2Vars q,p V h+1(s ) β(nk(s, a), δ ) nk(s, a) + 2

3H β(nk(s, a), δ )

+ ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1

β(nk(s, a), δ )

nk(s, a) r(s, a) +

4Vars ˆq,ˆp V h+1(s ) β(nk(s, a), δ ) nk(s, a) + 8H2 β(nk(s, a), δ )

3H β(nk(s, a), δ )

nk(s, a) + ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1

β(nk(s, a), δ )

nk(s, a) r(s, a)

8Vars ( ˆV k,π h+1(s ))β(nk(s, a), δ )

nk(s, a) +8H ˆqk(s, a)ˆpk( |s, a) V π h+1 ˆV k,π h+1 β(nk(s, a), δ )

nk(s, a) +8H2 β(nk(s, a), δ )

3H β(nk(s, a), δ )

nk(s, a) + ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1

β(nk(s, a), δ )

nk(s, a) r(s, a) +

v u u t8Vars ˆq,ˆp ˆV k,π h+1(s ) β(nk(s, a), δ )

8H2 β(nk(s, a), δ )

1 H ˆqk(s, a)ˆpk( |s, a) V π h+1 ˆV k,π h+1 8H2 β(nk(s, a), δ )

nk(s, a) + 2

3H β(nk(s, a), δ )

+ ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1

β(nk(s, a), δ )

nk(s, a) r(s, a) +

v u u t8Vars ˆq,ˆp ˆV k,π h+1(s ) β(nk(s, a), δ )

H ˆqk(s, a)ˆpk( |s, a) V π h+1 ˆV k,π h+1 + 12H2 β(nk(s, a), δ )

nk(s, a) + ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1

β(nk(s, a), δ )

nk(s, a) r(s, a) +

v u u t8Vars ˆq,ˆp ˆV k,π h+1(s ) β(nk(s, a), δ )

nk(s, a) + 12H2 β(nk(s, a), δ )

ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1

Here (a)(b)(c) use Lemmas 20,21,22, respectively. (d) is due to that x + y + z x + y + z for x, y, z 0. (e) comes from that xy x + y for x, y 0.

Then, unfolding ˆQk,π 1 (s, a; r) Qπ 1(s, a; r) , we have

ˆQk,π 1 (s, A; r) Qπ 1(s, A; r)

h 1 Eˆq,ˆp,π

β(nk(sσ, aσ ℓ), δ )

nk(sσ, aσ ℓ) r(sσ, aσ ℓ)

Branching Reinforcement Learning

v u u t8Vars ˆq,ˆp ˆV k,π h+1(s ) β(nk(sσ, aσ ℓ), δ )

nk(sσ, aσ ℓ) + 12H2 β(nk(sσ, aσ ℓ), δ )

nk(sσ, aσ ℓ)

1 {sσ = s }

(s,a),s =s ˆwk,π σℓ(s, a)

v u u t Vars ˆq,ˆp ˆV k,π h+1(s )

H2 H2β(nk(s, a), δ )

nk(s, a) + 12H2 β(nk(s, a), δ )

(s,a),s =s ˆwk,π σℓ(s, a) Vars ˆq,ˆp ˆV k,π h+1(s )

v u u u t H2 m (H 1) X

(s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ )

+ 12e H2 m (H 1) X

(s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ )

v u u u t 1

H2 Eˆq,ˆp,π

ℓ=1 Vars ˆq,ˆp ˆV k,π h+1(s )

v u u u t H2 m (H 1) X

(s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ )

+ 12e H2 m (H 1) X

(s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ )

v u u u t H2 m (H 1) X

(s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ )

nk(s, a) + 12e H2 m (H 1) X

(s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ )

where (a) uses branching law of total variance (Lemma 4), which also holds for the estimated model (ˆqk, ˆpk) if adding a clip operation ˆqk(s, a) min{ˆqk(s, a), 1

m} in algorithm Branch RFE to guarantee ˆqk 1

Bk,π h (s, A) := min

12H2 β(nk(s, a), δ )

nk(s, a) + 1 + 1

ˆqk(s, a)ˆpk( |s, a) Bk,π h+1

Bk,π h (s) :=Bk,π h (s, π(s)).

Bk h(s, A) := min

12H2 β(nk(s, a), δ )

nk(s, a) + 1 + 1

ˆqk(s, a)ˆpk( |s, a) Bk h(s) , H

Bk h(s) := max A A Bk h(s, A)

In the following, we show ˆQk,π 1 (s, A; r) Qπ 1(s, A; r) 4e q

Bk,π 1 (s, A) + Bk,π 1 (s, A) (32)

If Bk,π 1 (s, A) = H, Eq. (32) holds trivially. Otherwise, unfolding Bk,π 1 (s, A), we have

Bk,π 1 (s, A) = 12e H2 m (H 1) X

(s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ )

and using Eq. (31), we obtain ˆQk,π 1 (s, A; r) Qπ 1(s, A; r) 4e q

Bk,π 1 (s, A) + Bk,π 1 (s, A).

Branching Reinforcement Learning

Thus, by the deﬁnitions of Bk h(s, A) and Bk h(s), we have ˆV k,π 1 (s; r) V π 1 (s; r) 4e q

Bk,π 1 (s) + Bk,π 1 (s) 4e q

Bk 1(s) + Bk 1(s).

D.1.5. PROOF OF THEOREM 8

Now, we prove the sample complexity upper bound for algorithm Branch RFE (Theorem 8).

Proof of Theorem 8. First, we prove the correctness.

Let K denote the number of episodes that algorithm Branch RFE costs. According to the stopping rule (Line 2 in Algorithm Branch RFE) and Lemma 23, when algorithm Branch RFE stops in episode K, we have that for any π, r, ˆV K,π 1 (s1; r) V π 1 (s1; r) 4e q

BK 1 (s) + BK 1 (s) ε

Then, we have that for any r,

V 1 (s1; r) V ˆπ 1 (s1; r) =V 1 (s1; r) ˆV K,π 1 (s1; r) + ˆV K,π 1 (s1; r) ˆV K,ˆπ 1 (s1; r) + ˆV K,ˆπ 1 (s1; r) V ˆπ 1 (s1; r)

V 1 (s1; r) ˆV K,π 1 (s1; r) + ˆV K,ˆπ 1 (s1; r) V ˆπ 1 (s1; r)

Now, we prove the sample complexity.

Bk h(s, A) X

12H2 β(nk(s, a), δ )

nk(s, a) + 1 + 1

ˆqk(s, a)ˆpk( |s, a) Bk h+1

12H2 β(nk(s, a), δ )

nk(s, a) + 1 + 1

q(s, a)p( |s, a) Bk h+1

ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) Bk h+1

12H2 β(nk(s, a), δ )

nk(s, a) + 1 + 1

q(s, a)p( |s, a) Bk h+1

2Vars q,p Bk h+1(s ) β(nk(s, a), δ )

nk(s, a) + 2

3H β(nk(s, a), δ )

12H2 β(nk(s, a), δ )

nk(s, a) + 1 + 1

q(s, a)p( |s, a) Bk h+1

2Hq(s, a)p( |s, a) Bk h+1 β(nk(s, a), δ )

nk(s, a) + 2

3H β(nk(s, a), δ )

12H2 β(nk(s, a), δ )

nk(s, a) + 1 + 1

q(s, a)p( |s, a) Bk h+1

1 H q(s, a)p( |s, a) Bk h+12H2 β(nk(s, a), δ )

nk(s, a) + 2

3H β(nk(s, a), δ )

Branching Reinforcement Learning

12H2 β(nk(s, a), δ )

nk(s, a) + 1 + 1

q(s, a)p( |s, a) Bk h+1

1 H q(s, a)p( |s, a) Bk h+1 + 2H2 β(nk(s, a), δ )

nk(s, a) + 2

3H β(nk(s, a), δ )

12H2 β(nk(s, a), δ )

nk(s, a) + 1 + 1

q(s, a)p( |s, a) Bk h+1 + 2

H q(s, a)p( |s, a) Bk h+1

+ 6H2 β(nk(s, a), δ )

18H2 β(nk(s, a), δ )

nk(s, a) + 1 + 3

q(s, a)p( |s, a) Bk h+1

where (a) uses Lemma 20.

Then, unfolding Bk 1(s) = Bk 1(s, πk(s)) and summing over k = 1, . . . , K 1, we have

k=1 Bk 1(s)

ℓ=1 Eq,p,πk 18e3H2 β(nk(sσ, aσ ℓ), δ )

nk(sσ, aσ ℓ) 1 {sσ = s }

=18e3H2Eq,p,πk

(s,a),s =s Xk(s, a)β(nk(s, a), δ )

18e3H2Eq,p,πk

k=1 Xk(s, a)β(nk(s, a), δ )

18e3H2 β (K 1)m H+1 m

m 1 , δ Eq,p,πk

(s,a),s =s log n K 1(s, a)

18e3H2 β (K 1)m H+1 m

(s,a),s =s log Eq,p,πk n K 1(s, a)

18e3H2SN β (K 1)m H+1 m

m 1 , δ log (H(K 1))

(a) 18e3H2SN log SN

+ S log 8e Hm H+1(K 1) log (H(K 1)) , (33)

where (a) comes from β(t, κ) := log(SN/κ) + S log(8e(t + 1)) and m H+1 m

m 1 Hm H+1.

According to the stopping rule, we have ε 4e p

Bk 1(s) + Bk 1(s) for k = 1, . . . , K 1. Then, summing over k = 1, . . . , K 1 for both sides, we obtain

k=1 Bk 1(s)

v u u t(K 1)

k=1 Bk 1(s) +

k=1 Bk 1(s)

v u u t(K 1)

k=1 Bk 1(s) + 1

k=1 Bk 1(s)

Branching Reinforcement Learning

(K 1) 18e3H2SN log SN

+ S log (8e Hm H+1(K 1)) log (H(K 1))

+ S log 8e Hm H+1(K 1) log (H(K 1))

(K 1) log SN

log (H(K 1)) + S log (8em H+1) log2 (H(K 1))

log (H(K 1)) + S log 8em H+1 log2 (H(K 1))

Using Lemma 13 in (M enard et al., 2021) with τ = K 1, C = 4e

ε , A = log SN

δ , α = H, B = E = S log 8em H+1 and D = 18e3H2SN

ε , we obtain

K 1 = O C2(A + B)C2 1

+ S log e m H C2 1

where C1 = log(α(A + E)(C + D)).

Thus, we have

+ S log e m H C2 1

C1 = log log SN

+ S log e m H HSN

D.2. Sample Complexity Lower Bound

In this subsection, we prove the sample complexity lower bound (Theorem 9) for branching RL-RFE.

Proof of Theorem 9. This lower bound analysis follows the proof procedure of Theorem 2 in (Dann & Brunskill, 2015).

We consider the same instance as the proof of regret minimization lower bound in Section C.2.

The optimal policy π is to take action Aj(xi) at state xi, and we have

V 1 (s1) = m(α + η) + m2(α + η)2 + + m H(α + η)H = H (34)

Fix a policy π. For each i [S 3], let Gi := {π(xi) = Aj(xi)} denotes the event that policy π chooses the optimal action Aj(xi) in state xi. Then, we have

V π 1 (s1) =m(α + η) + 1 (S 3)

i=1 E h 1 {Gi} m2(α + η)2 + + m H(α + η)H

+ (1 1 {Gi}) m2(α + η)α + m3(α + η)α(α + η) + + m H(α + η)α(α + η)H 2 i

=m(α + η) + 1 (S 3)

i=1 E h 1 {Gi} m2(α + η)2 + + m H(α + η)H

+ (1 1 {Gi}) m2(α + η)α + m3(α + η)2α + + m H(α + η)H 1α i (35)

Branching Reinforcement Learning

Subtracting Eq. (35) by Eq. (34), we have

V 1 (s1) V π 1 (s1) = 1 (S 3)

i=1 E h (1 1 {Gi}) m(α + η)mη + m2(α + η)2mη + + m H 1(α + η)H 1mη i

i=1 E h (1 1 {Gi}) mη(H 1) i

i=1 E h 1 {Gi} i!

The following analysis follows the proof procedure of Theorem 2 in (Dann & Brunskill, 2015). For π to be ε-optimal, we need

i=1 E h 1 {Gi} i!

Let η = 8e2ε cm(H 1), where c is an absolute constant that we specify later. Then, we have

i=1 E [1 {Gi}] 1 c 8e4

Using Markov s inequality, we have

i=1 E [1 {Gi}] 1 c 8e4

1 (S 3) 1 c 8e4

i=1 Pr [Gi]

Since all Gi (i [S 3]) are independent of each other, there exist {δi}i [S 3] such that Pr Gi δi and

1 (S 3) 1 c 8e4

i=1 (1 δi) 1 δ,

which is equivalent to

i=1 δi (S 3) 1 + δ 1 c 8e4

Let ε be small enough such that η = 8e2ε cm(H 1) 1

4, and let δ to be small enough δ c 8e4 . Since all Gi are independent, we can use Theorem 1 in (Mannor & Tsitsiklis, 2004) to obtain

1 + δ 1 c 8e4

1 + δ 1 c 8e4

Let ni denote the number of observations on state xi. The KL-divergence of the trigger distribution on state xi between our constructed instance and the uniform instance is m KL (Bernoulli(α)||Bernoulli(α + η)) m η2

(α+η)(1 (α+η)) = O(mη2) with (α + η)(1 (α + η)) c1 for some absolute positive constant c1. Then, to ensure Pr Gi δi, we need

1 n cδi 1 + δ 1 c 8e4

Branching Reinforcement Learning

where c1 and c2 are appropriate absolute constant, e.g., c1 = 400 and c2 = 4.

In the following, we compute the worst bound over all δ1, . . . , δS 3 to ensure that PS 3 i=1 δi (S 3) 1 + δ 1 c 8e4 1 c 8e4 .

min δ1,...,δS 3

1 n cδi 1 + δ 1 c 8e4

i=1 δi (S 3) 1 + δ 1 c 8e4

Using Lemma D.1 in (Dann & Brunskill, 2015), the optimal solution of this optimization is δ1 = = δS 3 = z, if c (1 log z) 1 with z = 1 + δ 1 c 8e4 1 c 8e4 .

Since z 1 1 c 8e4 = c 8e4 and c (1 log z) is decreasing wi respect to z, we can obtain a sufﬁcient condition for c (1 log z) 1 as

Let c = 1 10, which satisﬁes this condition. Thus, δ1 = = δS 3 = z is the optimal solution to Eq. (37).

Since in each episode, we only observe a single state xi, the number of required episodes is at least

c2 1 + δ 1 c 8e4 1 c 8e4

c1c2d(S 3)m(H 1)2

c2 δ 1 c 8e4 + c 8e4