# branching_reinforcement_learning__244394a6.pdf Branching Reinforcement Learning Yihan Du 1 Wei Chen 2 In this paper, we propose a novel Branching Reinforcement Learning (Branching RL) model, and investigate both Regret Minimization (RM) and Reward-Free Exploration (RFE) metrics for this model. Unlike standard RL where the trajectory of each episode is a single H-step path, branching RL allows an agent to take multiple base actions in a state such that transitions branch out to multiple successor states correspondingly, and thus it generates a tree-structured trajectory. This model finds important applications in hierarchical recommendation systems and online advertising. For branching RL, we establish new Bellman equations and key lemmas, i.e., branching value difference lemma and branching law of total variance, and also bound the total variance by only O(H2) under an exponentially-large trajectory. For RM and RFE metrics, we propose computationally efficient algorithms Branch VI and Branch RFE, respectively, and derive nearly matching upper and lower bounds. Our regret and sample complexity results are polynomial in all problem parameters despite exponentially-large trajectories. 1. Introduction Reinforcement Learning (RL) (Burnetas & Katehakis, 1997; Sutton & Barto, 2018) models a fundamental sequential decision making problem, where an agent interacts with the environment over time in order to maximize the obtained rewards. Standard RL (Jaksch et al., 2010; Agrawal & Jia, 2017; Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019) considers taking only a single action in a state and formulates a single H-step path model. However, in many real-world applications such as recommendation systems (Fu et al., 2021) and online advertising (Kang et al., 2020), we often need to select multiple options at a time, and 1IIIS, Tsinghua University, Beijing, China 2Microsoft Research. Correspondence to: Yihan Du , Wei Chen . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). each option can trigger a corresponding successor state. For example, in category-based shopping recommendation (Fu et al., 2021), the recommendation system often displays a list of main categories at the first step, where each one has a probability to be clicked. If a main category is clicked, at the second step, the system further provides a list of sub-categories according to the clicked main category. By analogy, at the last step, the system provides a list of items according to the chosen category path. In this process, users can select (trigger) more than one category-item paths, e.g., one may buy IT accessories-printers-laser printers and IT accessories-scanners-document scanners at once. To handle such scenarios involving multiple actions and successor states, we propose a novel Branching Reinforcement Learning (Branching RL) framework, which is an episodic tree-structured forward model. In each episode, an agent starts from an initial state and takes a super action that contains multiple base actions, where each base action in this state has a probability to be triggered. For each state-base action pair, if triggered successfully, the agent receives a reward and transitions to a next state; Otherwise, if it is not triggered, the agent receives zero reward and transitions to an absorbing state associated with zero reward. Thus, the transitions branch out to multiple successor states. At the second step, for each branched-out state, the agent also selects a super action that contains multiple base actions with trigger probabilities. She only obtains rewards from the triggered state-base action pairs, and each state-base action pair transitions to a corresponding next state. Then, the transitions at the second step branch out to more successor states. By analogy, till the last step, she traverses an H-layer tree-structured trajectory, and only collects rewards at the triggered state-base action pairs. Different from standard episodic RL (Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019) where each trajectory is a single H-step path, the trajectory of branching RL is an H-layer triggered tree with exponentially increasing states and actions in each layer. This model allows an agent to take multiple base actions at once and handle multiple successor states. It can be applied to many hierarchical decision making scenarios, such as category-based recommendation systems (Fu et al., 2021) and online advertising (Kang et al., 2020). Branching Reinforcement Learning Under the branching RL model, we investigate two popular metrics in the RL literature, i.e., Regret Minimization (RM) and Reward-Free Exploration (RFE). In regret minimization (Jaksch et al., 2010; Azar et al., 2017; Zanette & Brunskill, 2019), the agent aims to minimize the gap between the obtained reward and the reward that can obtained by always taking the optimal policy. In reward-free exploration (Jin et al., 2020a; Kaufmann et al., 2021; M enard et al., 2021), the agent explores the unknown environment (model) without observation of rewards, in order to estimate the model accurately such that for any given reward function, she can plan a near-optimal policy using the estimated model. The performance in RFE is measured by the number of episodes used during exploration (i.e., sample complexity). Our work faces several unique challenges: (i) Since branching RL is a tree-structured forward model which greatly differs from standard RL, existing analytical tools for standard RL, e.g., Bellman equations, value difference lemma and law of total variance, cannot be directly applied to our problem. (ii) With exponentially-large trajectories, it is challenging to analyze the total variance and derive tight (polynomial) regret and sample complexity guarantees. (iii) Since the number of possible super actions can be combinatorially large, how to design a computationally efficient algorithm that avoids naive enumeration over all super actions is another challenge. To tackle the above challenges, we establish novel analytical tools, including branching Bellman equations, branching value difference lemma and branching law of total variance, and bound the total variance by only O(H2) under exponentially-large trajectories. We also propose computationally efficient algorithms for both RM and RFE metrics, and provide nearly matching upper and lower bounds, which are polynomial in all problem parameters despite exponentially-large trajectories. To sum up, our contributions in this paper are as follows: We propose a novel Branching Reinforcement Learning (Branching RL) framework, which is an episodic H-layer tree-structured forward model and finds important applications in hierarchical recommendation systems and online advertising. Under branching RL, we investigate two popular metrics, i.e., Regret Minimization (RM) and Reward-Free Exploration (RFE). We establish new techniques for branching RL, including branching Bellman equations, branching value difference lemma and branching law of total variance, and bound the total variance by only O(H2) despite exponentially-large trajectories. For both RM and RFE metrics, we design computation- ally efficient algorithms Branch VI and Branch RFE, respectively, and build near-optimal upper and lower bounds, which are polynomial in all problem parameters even with exponentially-large trajectories. When our problem reduces to standard RL, our results match the state-of-the-arts. Due to space limit, we defer all proofs to Appendix. 2. Related Work Below we review the literature of standard (episodic and tabular) RL with regret minimization (RM) and reward-free exploration (RFE) metrics. Standard RL-RM. For the regret minimization (RM) metric, Jaksch et al. (2010) propose an algorithm that adds optimistic bonuses on transition probabilities, and achieves a regret bound with a gap in factors H, S compared to the lower bound (Jaksch et al., 2010; Osband & Van Roy, 2016). Here H is the length of an episode, and S is the number of states. Agrawal & Jia (2017) use posterior sampling and obtain an improved regret bound. Azar et al. (2017) build confidence intervals directly for value functions rather than transition probabilities, and provide the first optimal regret. Zanette & Brunskill (2019) design an algorithm based on both optimistic and pessimistic value functions, and achieve a tighter problem-dependent regret bounds without requiring domain knowledge. The above works focus on model-based RL algorithms. There are also other works (Jin et al., 2018; Zhang et al., 2020) studying model-free algorithms based on Q-learning with exploration bonus or advantage functions. Standard RL-RFE. Jin et al. (2020a) introduce the rewardfree-exploration (RFE) metric and design an algorithm that runs multiple instances of existing RM algorithm (Zanette & Brunskill, 2019), and their sample complexity has a gap to the lower bound (Jin et al., 2020a; Domingues et al., 2021) in factors H, S. Kaufmann et al. (2021) propose an algorithm which builds upper confidence bounds for the estimation error of value functions, and improve the sample complexity of (Jin et al., 2020a). M enard et al. (2021) achieve a near-optimal sample complexity by applying an empirical Bernstein inequality and upper bounding the overall estimation error. There are huge differences between standard RL and our branching RL. The exponentially-large trajectory of branching RL brings unique challenges in developing Bellman equations and key lemmas, designing computationally efficient algorithms and deriving optimal (polynomial) bounds. Existing RL algorithms and analysis cannot be applied to solve our challenges. Branching Reinforcement Learning 3. Problem Formulation In this section, we present the formal formulation of Branching Reinforcement Learning (Branching RL). Branching Markov Decision Process (Branching MDP). We consider an episodic branching MDP defined by a tuple M = (S, Auniv, A, m, H, q, p, r). Here S = Sreg {s } is the state space with cardinality S. Sreg is the set of regular states, and s is an ending state, which is an absorbing state with zero reward. Auniv is the set of base actions, which represents the set of all feasible items in recommendation. Let N := |Auniv| denote the number of base actions. A super action A Auniv consists of m (m N) base actions, which stands for a recommended list. A is the collection of all feasible super actions and can be combinatorially large. H is the length of an episode. Throughout the paper, we call a super action an action for short, call (s, a) S Auniv a state-base action pair, and call (s, a) S \ {s } Auniv a regular state-base action pair. q(s, a) is the trigger probability of state-base action pair (s, a) S Auniv. p(s |s, a) is the probability of transitioning to state s on state-base action pair (s, a), for any (s , s, a) S S Auniv. r(s, a) [0, 1] is the reward of pair (s, a) S Auniv. We assume that reward function r is deterministic as many prior RL works (Azar et al., 2017; Jin et al., 2018; Zhang et al., 2020), and our work can generalize to stochastic rewards easily. Parameters q, p, r are time-homogeneous, i.e., have the same definitions for different step h [H]. The ending state s has zero reward and always transitions back to itself, i.e., q(s , a) = 0, p(s |s , a) = 1 and r(s , a) = 0 for all a Auniv. We define a policy π as a collection of H functions {πh :S 7 A}h [H], and K as the number of episodes. String-based Notations. As shown in Figure 1, in branching RL, the trajectory of each episode is an m-ary tree, where there are H layers (steps), and each layer h [H] has mh 1 states (nodes) and mh state-base action pairs (edges). We use the following string-based notations to denote a trajectory: Each tree node in layer h has a string index i1, . . . , ih 1 , with the root node for layer 1 having the empty string , and i1, . . . , ih 1 {1, 2, . . . , m}. The m children of this node have indices that concatenate ih [m] to the string, making it i1, . . . , ih 1, ih , where ih stands for that this node is the ih-th child of the node i1, . . . , ih 1 . Operator is the concatenation operation for strings, and i h denotes the concatenation of h strings i for any i [m], h [H]. For any string σ, |σ| denotes its length, and thus state sσ is at step |σ| + 1. Online Game. In each episode k [K], an agent selects a policy πk at the beginning, and starts from an initial state s . At step 1, she chooses an action A = {a 1 , . . . , a m } ac- , = , , , = , , = , , = , , = , , = Figure 1. Illustrating example with m = 2 for branching RL. cording to πk 1. Each state-base action pair (s , a i ) for i [m] has probability q(s , a i ) to be triggered. If triggered successfully, the agent obtains reward r(s , a i ) and this state-base action pair transitions to a next state s i p( |s , a i ); Otherwise, if not triggered successfully, she obtains zero reward and this state-base action pair transitions to the ending state s . Hence, the transitions at step 1 branch out to m successor states s 1 , . . . , s m . At step 2, for each state s i (i [m]), she chooses an action A i = {a i,1 , . . . , a i,m } according to πk 2. Then, there are m2 state-base action pairs {(s i , a i,j )}i,j [m] at step 2, and each of them is triggered with probability q(s i , a i,j ). If triggered successfully, the agent receives reward r(s i , a i,j ) and this pair transitions to a next state s i,j p( |s i , a i,j ); Otherwise, she receives zero reward and this pair transitions to s . Then, the transitions at step 2 branch out to m2 successor states {s i,j }i,j [m]. The episode proceeds by analogy at the following steps 3, . . . , H. In the trajectory tree, once the agent reaches s at some node, she obtains no reward throughout this branch (This is so-called ending state ). Branching Value functions and Bellman Equations. For any policy π, we define value function V π h : S 7 R, so that V π h (s) =Eq,p,π h m (H h) X ℓ=1 q(sσ σ , aσ σ ℓ) r(sσ σ , aσ σ ℓ)|sσ = s i (1) gives the expected cumulative reward starting from some state s at step h till the end of this branch, under policy π. Here σ is the index string for an arbitrary state at step h, and thus σ {1 (h 1), . . . , m (h 1)}. Pm (H h) σ = denotes the summation over strings σ = , 1 , . . . , m , 1, 1 , . . . , m, m , . . . , m (H h), which effectively enumerates all tree nodes of H h+1 layers. The expectation is taken with respect to the trajectory, which is dependent on trigger distribution q, transition distribution p and policy π. Accordingly, we also define Q-value function Qπ h : S A 7 Branching Reinforcement Learning Qπ h(s, A) =Eq,p,π h m (H h) X ℓ=1 q(sσ σ , aσ σ ℓ) r(sσ σ , aσ σ ℓ)|sσ = s, Aσ = A i denotes the expected cumulative reward starting from some state-action pair (s, A) at step h till the end of this branch, under policy π. From the definitions of r, p, q for ending state s , we have V π h (s ) = Qπ h(s , A) = 0 for any A A, h [H], π. Since S, A and H are all finite, there exists a deterministic optimal policy π which has the optimal value V h (s) = supπ V π h (s) for any s S and h [H]. Then, we can establish the Bellman (optimality) equations as follows: Qπ h(s, A) = X a A q(s, a) r(s, a) + p( |s, a) V π h+1 V π h (s) =Qπ h(s, πh(s)) V π H+1(s) =0, s S, Q h(s, A) = X a A q(s, a) r(s, a) + p( |s, a) V h+1 V h (s) = max A A Q h(s, A) V H+1(s) =0, s S. Under the framework of branching RL, we consider two important RL settings, i.e., regret minimization (branching RL-RM) and reward-free exploration (branching RL-RFE). Regret Minimization (RM). In branching RL-RM, the agent plays the branching RL game for K episodes, and the goal is to minimize the following regret Regret(K) = V 1 (sk ) V πk 1 (sk ) . Reward-Free Exploration (RFE). Branching RL-RFE consists of two phases, i.e., exploration and planning. (i) In the exploration phase, given a fixed initial state s , the agent plays the branching RL game without the observation of reward function r, and estimates a trigger and transition model (ˆq, ˆp). (ii) In the planning phase, the agent is given reward function r, and computes the optimal policy ˆπ under her estimated model (ˆq, ˆp) with respect to r. Given an accuracy parameter ε and a confidence parameter δ, she needs to guarantee that for any given reward function r, the policy ˆπ with respect to r is ε-optimal, i.e., V ˆπ 1 (s ; r) V π 1 (s ; r) ε, with probability at least 1 δ. We measure the performance by sample complexity, i.e., the number of episodes used in the exploration phase to guarantee an ε-optimal policy for any given r. In order to ensure that the number of triggered state-base action pairs will not increase exponentially and designing sample efficient algorithms is possible in branching RL, we introduce the following assumption. Assumption 1 (Bounded Trigger Probability). For any (s, a) S Auniv, we have q(s, a) 1 To justify the necessity of Assumption 1, we provide a rigorous lower bound to show that once relaxing the threshold of trigger probability, any branching RL algorithm must suffer an exponential regret. Theorem 2. Suppose that for any (s, a) S Auniv, q(s, a) q for some threshold parameter q > 1 m. Then, there exists an instance of branching RL with H > 1, where the regret of any algorithm is bounded by Ω( m q((m q)H 1 1) We describe the intuition behind this lower bound, and defer the full proof to Appendix C.2.2. Consider a branching MDP, where at an early step the agent has to distinguish the optimal action that has trigger probability q, from the sub-optimal actions that have trigger probabilities only q η. Once the agent takes a sub-optimal action at the early step, such trigger sub-optimality will impact exponentially many states in the following steps, and she will suffer a regret of mη m q + m2 q2 + + m H 1 q H 1 in this episode, which is the sum of a geometric progression with common ratio m q. If q > 1 m, summing over all episodes, the total regret is exponentially large with respect to H. Besides this lower bound, Assumption 1 is also mild in practice, since in real-world applications such as recommendation systems (Fu et al., 2021) and online advertising (Kang et al., 2020), it is often the case that users are only attracted to and click on a few items in a recommended list. In addition, in multi-step (e.g., category-based) recommendation, the interests of users usually converge to a single branch in the end (in expectation). When m = 1, our branching RL reduces to standard episodic RL (Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019) with transition probability paug, such that paug(s |s, a) = 1 q(s, a) and paug(s |s, a) = q(s, a)p(s |s, a) for any s = s . In this case, our results match the state-of-the-art results for standard RL in both RM (Azar et al., 2017; Zanette & Brunskill, 2019) and RFE (M enard et al., 2021; Zhang et al., 2021) settings. Branching Reinforcement Learning 4. Properties of the Branching Markov Decision Process Before introducing our algorithms for branching RL, in this section, we first investigate special structural properties of branching MDP, which are critical to deriving tight (polynomial) regret and sample complexity guarantees. 4.1. Branching Value Difference Lemma and Law of Total Variance Different from standard episodic MDP (Azar et al., 2017; Zanette & Brunskill, 2019) where a trajectory is an H-step path, the trajectory of branching MDP is an m-ary tree with each node a state and each edge a state-base action pair. Thus, many analytical tools in standard MDP, e.g., value difference lemma (Dann et al., 2017) and law of total variance (Jin et al., 2018; Zanette & Brunskill, 2019), cannot be directly applied to branching MDP. To handle this problem, we establish new fundamental techniques for branching MDP, including branching value difference lemma and branching law of total variance. First, we present a branching value difference lemma. Lemma 3 (Branching Value Difference Lemma). For any two branching MDP M (S, Auniv, A, m, H, q , p , r) and M (S, Auniv, A, m, H, q , p , r), the difference in values under the same policy π satisfies that V π h (s) V π h (s) = ℓ=1 Eq ,p ,π h q (sτ, aτ ℓ) q (sτ, aτ ℓ) r(sτ, aτ ℓ) + q (sτ, aτ ℓ)p (sτ, aτ ℓ) q (sτ, aτ ℓ)p (sτ, aτ ℓ) V π |τ ℓ|+1 sσ = s i , where τ := σ σ . Using Lemma 3 with M and M being the optimistic and true models, respectively, we can bound the difference between optimistic and true values by the deviations between optimistic and true trigger and transition probabilities, in expectation with respect to the true model. Next, we provide a branching law of total variance, which is critical to analyzing the estimation error of transition. Lemma 4 (Branching Law of Total Variance). For any policy π, Eq,p,π h m (H 1) X ℓ=1 Varq,p V π |σ ℓ|+1(sσ ℓ)|sσ, aσ ℓ i =Eq,p,π h m (H 1) X ℓ=1 q(sσ, aσ ℓ)r(sσ, aσ ℓ) V π 1 (s ) 2i (2) Eq,p,π h m (H 1) X σ= 1 {sσ = s } 2i . (3) Here Varq,p V π |σ ℓ|+1(sσ ℓ)|sσ, aσ ℓ denotes the variance of value V π |σ ℓ|+1(sσ ℓ) with respect to sσ ℓ, which depends on trigger probability q(sσ, aσ ℓ) and transition probability p( |sσ, aσ ℓ), conditioning on (sσ, aσ ℓ). Remark 1. Lemma 4 exhibits that under branching MDP, the sum of conditional variances over all state-base action pairs is equal to the overall variance considering the whole trajectory, shown by Eq. (2). Furthermore, the overall variance can be bounded by the total number of regular (triggered) states, revealed by Eq. (3). From Lemma 4, we have that to bound the estimation error of transition, which is related to the sum of conditional variances, it suffices to bound the total number of triggered states in a trajectory tree (discussed in the following). 4.2. The Number of Triggered States In this subsection, we show that with Assumption 1 that only constrains the first moment of trigger distribution, we can bound both the first and second moments of the number of triggered states in a trajectory tree. Lemma 5 (The Number of Triggered States). For any policy π, Eq,p,π h m (H 1) X σ= 1 {sσ = s } i H, (4) Eq,p,π h m (H 1) X σ= 1 {sσ = s } 2i 3H2. (5) Remark 2. Eq. (4) gives a universal upper bound of value function as V π h (s) h Pm (H 1) σ= 1 {sσ = s } i H for any s S, h [H], π. Moreover, Eq. (5) provides a sharp upper bound for overall variance, as well as the sum of conditional variances of transition (by plugging Eq. (5) into Lemma 4). To our best knowledge, this second moment result is novel. Lemma 5 shows that despite the exponentially increasing nodes in branching MDP, its value and overall variance (estimation error) will not explode. This critical property enables us to avoid an exponentially-large regret or sample complexity. Novel Analysis for Triggered States. The analysis of Lemma 5 is highly non-trivial. We first relax all regular trigger probabilities to 1 m, and then investigate the number of triggered states for each step individually. While we can Branching Reinforcement Learning show that the number of triggered states at each step is a conditional Binomial random variable, the distribution of their sum is too complex to express. This incurs a nontrivial challenge on analyzing the second moment of the total number of triggered states. To tackle this challenge, we investigate the correlation of triggered states between any two steps, by exploiting the structure of branching MDP. Proof sketch. Under Assumption 1, to bound the total number of triggered (regular) states for any branching MDP and policy π, it suffices to bound it under a relaxed model M with q(s, a) = q := 1 m for all (s, a) S \ {s } Auniv. Let ωh denote the number of triggered states at each step h under M , and ω := PH h=1 ωh. Below we prove that E[ω] H and E[ω2] 3H2. For h = 1, ωh = 1 deterministically. For h 2, ωh|ωh 1 Binomial(mωh 1, q ). According to the properties of Binomial distribution and q := 1 m, for h 2, E [ωh] = mq E [ωh 1] = 1, E (ωh)2 =mq (1 q )E [ωh 1]+m2(q )2E (ωh 1)2 = (1 q ) + E (ωh 1)2 h. Hence, we have that E[ω] = PH h=1 E[ωh] = H, and h=1 E[(ωh)2] + 2 X 1 0, the regret of algorithm Branch VI is bounded by O(H SNK log( SNH(m H K) δ )). In particular, when K m H, the regret is bounded by SNK log SNHK Remark 3. Theorem 6 shows that, despite the exponentially-large trajectory of branching RL, Branch VI enjoys a regret only polynomial in problem parameters. For large enough K such that K m H, Theorem 6 matches the lower bound (presented in Section 5.3) up to logarithmic factors. In addition, when branching RL reduces to standard RL (i.e., m = 1), our result also matches the state-of-thearts (Azar et al., 2017; Zanette & Brunskill, 2019). Branching Regret Analysis. In contrast to standard RL (Azar et al., 2017; Zanette & Brunskill, 2019), we derive a tree-structured regret analysis based on special structural properties of branching RL in Section 4. Using the branching value difference lemma (Lemma 3), we can decompose the regret into the estimation error from each regular state-base action pair as follows: (s,a),s =s wk,σ,ℓ(s, a) h qk(s, a) q(s, a) r(s, a) | {z } Term 1: Triggered reward + qk(s, a) pk( |s, a) ˆqk(s, a)ˆpk( |s, a) V k |σ ℓ|+1 | {z } Term 2: Triggered transition optimism + ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V |σ ℓ|+1 | {z } Term 3: Triggered transition estimation + ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V k |σ ℓ|+1 V |σ ℓ|+1 | {z } Term 4: Lower order term Here wk,σ,ℓ(s, a) denotes the probability that (sσ, aσ ℓ) = (s, a) in episode k, and qk and pk represent optimistic trigger and transition probabilities, respectively. In this decomposition, we address the estimation error for triggered reward (Term 1) and triggered transition (Terms 2, 3) separately, and consider trigger and transition probabilities as a whole distribution (in Terms 2, 3). The dominant terms are Terms 2, 3, which stand for the estimation error for triggered transition and depend on the sum of conditional variances. Using branching law of total variance (Lemma 4) and the second moment bound of triggered states (Lemma 5), we can bound Terms 2, 3 by only O(H2) despite exponential state-base action pairs. We note that it is necessary to separately address triggered reward and triggered transition, and consider trigger and transition as a whole distribution in bonus design (Line 1) and regret decomposition. A naive adaption of standard RL algorithm (Zanette & Brunskill, 2019), which adds bonuses on trigger and transition probabilities respectively, i.e., replacing Line 1 with f k h(s, a) ˆqk(s, a) + bk,q(s, a) (r(s, a)+ ˆpk( |s, a) V k h+1 +bk,p V (s, a)), will suffer an extra factor H in the regret bound. Please see Appendix A for more discussion. 5.3. Regret Lower Bound In this subsection, we provide a lower bound for branching RL-RM which is polynomial in problem parameters and Branching Reinforcement Learning Algorithm 2 Branch RFE 1: Input: s , ε, δ, β(t, κ) := log(SN/κ) + S log(8e(t + 1)) for any t N and κ (0, 1). Initialize Bk h(s )= 0, h [H], k and Bk H+1(s)=0, s S, k. 2: for k = 1, 2, . . . do 3: for h = H, H 1, . . . , 1 do 4: for s S \ {s } do 5: for a Auniv do 6: ˆqk(s, a) Jk sum(s,a) nk(s,a) ; 7: ˆpk(s |s, a) P k sum(s |s,a) Jk sum(s,a) , s S; 8: gk h(s, a) 12H2 β(nk(s,a),δ) nk(s,a) + 1 + 1 H ˆqk(s, a)ˆpk( |s, a) Bk h+1(s); 9: end for 10: πk h(s) argmax A A P a A gk h(s, a); 11: Bk h(s) min{max A A P a A gk h(s, a), H}; 12: end for 13: end for 14: if 4e p Bk 1(s ) + Bk 1(s ) ε 2, return (ˆqk, ˆpk); 15: else Take policy πk and observe the trajectory; 16: end for demonstrates the optimality of Branch VI. Theorem 7 (Regret Lower Bound). There exists an instance of branching RL-RM, where any algorithm must have Ω(H SNK) regret. Remark 4.Theorem 7 validates that the regret of Branch VI (Theorem 6) is near-optimal for large enough K, and reveals that, a polynomial regret is achievable and tight even with exponentially-large trajectories in branching RL. Branching Regret Lower Bound Analysis. Unlike prior standard episodic RL works (Azar et al., 2017; Jin et al., 2018) which directly adapt the diameter-based lower bound (Jaksch et al., 2010) to the episodic setting, we derive a new lower bound analysis for branching RL-RM. We construct a hard instance, where an agent uniformly enters one of Θ(S) bandit states , i.e., states with an optimal action and sub-optimal actions, and hereafter always transitions to a homogeneous state , i.e., a state with homogeneous actions. Then, if the agent makes a mistake in a bandit state, she will suffer Ω(H) regret in this episode. By bounding the KL-divergence between this hard instance and uniform instance, we can derive a desired lower bound. 6. Branching Reinforcement Learning with Reward-free Exploration In this section, we investigate branching RL-RFE, and develop an efficient algorithm Branch RFE and nearly matching upper and lower bounds of sample complexity. 6.1. Algorithm Branch RFE In each episode, Branch RFE estimates the trigger and transition probabilities, and computes the estimation error Bk h(s), which stands for the cumulative variance of trigger and transition from step h to H. Once the total estimation error Bk 1(s ) is shrunk below the required accuracy, Branch RFE returns the estimated model (ˆqk, ˆpk). Given any reward function r, the optimal policy ˆπ under (ˆqk, ˆpk) with respect to r is ε-optimal, i.e., V ˆπ 1 (s ; r) V 1 (s ; r) ε, with probability at least 1 δ. We describe Branch RFE (Algorithm 2) as follows: In each episode k, for each step h, Branch RFE first estimates the trigger and transition probabilities ˆqk, ˆpk (Lines 2,2), and calculates the component estimation error gk h(s, a) for each state-base action pair (Line 2). Then, similar to Branch VI, we utilize a maximization oracle to efficiently find the action with the maximum estimation error (the most necessary action for exploration) to be πk h(s), and assign such maximum error to Bk h(s) (Lines 2,2). If the square root of total estimation error p Bk 1(s ), which represents the standard deviation of trigger and transition for whole trajectory, is smaller than ε/2, Branch RFE stops and outputs the estimated model (ˆqk, ˆpk) (Line 2); Otherwise, it continues to explore according to the computed policy πk (Line 2). The computation complexity of Branch RFE is also O(S2N) instead of O(S2|A|), since it only computes the component estimation error for state-base action pairs rather than enumerating super actions, and utilizes a maximization oracle to calculate πk h(s) and Bk h(s). 6.2. Sample Complexity Upper Bound for Branch RFE Now we present the sample complexity for Branch RFE. We say an algorithm for branching RL-RFE is (δ, ε)-correct, if it returns an estimated model (ˆq, ˆp) such that given any reward function r, the optimal policy under (ˆq, ˆp) with respect to r is ε-optimal with probability at least 1 δ. Theorem 8 (Sample Complexity Upper Bound). For any ε > 0 and δ (0, 1), algorithm Branch RFE is (δ, ε)- correct. Moreover, with probability 1 δ, the number of episodes used in Branch RFE is bounded by + S log e m H C2 , where C = log log SN δ + S log e m H HSN Remark 5. Theorem 6 exhibits that even with exponentially-large trajectories in branching RL, Branch RFE only needs polynomial episodes to ensure an ε-optimal policy for any reward function. This sample complexity is optimal for small enough δ within logarithmic factors (compared to the lower bound presented Branching Reinforcement Learning in Section 6.3). In addition, when degenerating to standard RL (i.e., m = 1), our result also matches the state-of-the-arts (M enard et al., 2021; Zhang et al., 2021). Branching Sample Complexity Analysis. Unlike standard RL (M enard et al., 2021; Zhang et al., 2021) which only bounds the estimation error in a single H-step path, in branching RL we need to unfold and analyze the estimation error for all state-base action pairs in a trajectory tree. In our analysis, we utilize special structural properties of branching MDP, e.g., branching law of total variance (Lemmas 4) and the second moment bound of triggered states (Lemmas 5), to skillfully bound the total estimation error throughout the trajectory tree. Despite exponentially-large trajectories, we obtain sample complexity only polynomial in problem parameters. 6.3. Sample Complexity Lower Bound Theorem 9 (Sample Complexity Lower Bound). There exists an instance of branching RL-RFE, where any (δ, ε)- correct algorithm requires Ω( H2SN ε2 log δ 1) trajectories. Remark 6. Theorem 9 demonstrates that Branch RFE (Theorem 8) achieves a near-optimal sample complexity for small enough δ, and also, a polynomial sample complexity is achievable and sharp for branching RL-RFE. 7. Experiments In this section, we conduct experiments for branching RL. We set K = 5000, δ = 0.005, H = 6, m = 2, N {10, 15}, S = {s , s1, . . . , s5}. A is the collection of all m-cardinality subsets of Auniv = {a1, . . . , a N}, and thus |A| = N m {45, 105}. The reward function r(s, a) = 1 for any (s, a) S A. The trigger probability q(s, a) = 1 m for any (s, a) S {a N 1, a N}, and q(s, a) = 1 2m for any (s, a) S Auniv \ {a N 1, a N}. We set s1 as the initial state for each episode. Under all actions a Auniv, the transition probability q(s |s1, a) = 0.5 for any s {s2, s3}, and q(s |s, a) = 0.5 for any (s, s ) {s2, s3} {s4, s5} or (s, s ) {s4, s5} {s2, s3}. We perform 50 independent runs, and report the average regrets and running times (in legends) across runs. Since we study a new problem and there is no existing algorithm for branching RL, we compare our algorithm Branch VI with two adaptations from standard RL, i.e., Euler-Adaptation (Zanette & Brunskill, 2019) and ε-Greedy (ε = 0.01). The former uses individual exploration bonuses for trigger and transition, the latter uses εgreedy in exploration, and both of them explicitly maintain Q-functions. As shown in Figure 2, Branch VI enjoys a lower regret and a faster running time than the baselines, which demonstrates the effectiveness of our exploration 0 1000 2000 3000 4000 5000 The Number of Episodes Cumulative Regret Branch VI 26.827s Euler-Adaptation 46.342s ε-Greedy (ε=0.01) 45.677s 0 1000 2000 3000 4000 5000 The Number of Episodes Cumulative Regret Branch VI 35.199s Euler-Adaptation 87.958s ε-Greedy (ε=0.01) 88.890s Figure 2. Experiments for branching RL. strategy and the computation efficiency of our algorithm. 8. Conclusion and Future Work In this paper, we formulate a novel branching RL model, and consider both regret minimization and reward-free exploration metrics. Different from standard RL where each episode is a single H-step path, branching RL is a treestructured forward model which allows multiple base actions in a state and multiple successor states. For branching RL, we build novel fundamental analytical tools and carefully bound the overall variance. We design efficient algorithms and provide near-optimal upper and lower bounds. There are many interesting directions to explore. One direction is to improve the dependency on H in our regret upper bound (Theorem 6) for small K and close the gap on factors H, S between sample complexity upper and lower bounds (Theorems 8,9). Another direction is to extend branching RL from the tabular setting to function approximation, e.g., representing the value function in a linear form with respect to the feature vectors of state-action pairs (Jin et al., 2020b; Zhou et al., 2021). Agrawal, S. and Jia, R. Posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pp. 1184 1194, 2017. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002. Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272. PMLR, 2017. Burnetas, A. N. and Katehakis, M. N. Optimal adaptive policies for markov decision processes. Mathematics of Operations Research, 22(1):222 255, 1997. Dann, C. and Brunskill, E. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Branching Reinforcement Learning Neural Information Processing Systems, pp. 2818 2826, 2015. Dann, C., Lattimore, T., and Brunskill, E. Unifying PAC and regret: uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5717 5727, 2017. Domingues, O. D., M enard, P., Kaufmann, E., and Valko, M. Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited. In Algorithmic Learning Theory, pp. 578 598. PMLR, 2021. Fu, M., Agrawal, A., Irissappane, A. A., Zhang, J., Huang, L., and Qu, H. Deep reinforcement learning framework for category-based item recommendation. IEEE Transactions on Cybernetics, 2021. Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010. Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pp. 4868 4878, 2018. Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pp. 4870 4879. PMLR, 2020a. Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020b. Kang, S., Jeong, C., and Chung, K. Tree-based real-time advertisement recommendation system in online broadcasting. IEEE Access, 8:192693 192702, 2020. Kaufmann, E., M enard, P., Domingues, O. D., Jonsson, A., Leurent, E., and Valko, M. Adaptive reward-free exploration. In International Conference on Algorithmic Learning Theory, pp. 865 891. PMLR, 2021. Mannor, S. and Tsitsiklis, J. N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5:623 648, 2004. M enard, P., Domingues, O. D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp. 7599 7608. PMLR, 2021. Osband, I. and Van Roy, B. On lower bounds for regret in reinforcement learning. ar Xiv preprint ar Xiv:1608.02732, 2016. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304 7312. PMLR, 2019. Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 2020. Zhang, Z., Du, S. S., and Ji, X. Nearly minimax optimal reward-free reinforcement learning. International Conference on Machine Learning, 2021. Zhou, D., He, J., and Gu, Q. Provably efficient reinforcement learning for discounted MDPs with feature mapping. In International Conference on Machine Learning, pp. 12793 12802. PMLR, 2021. Branching Reinforcement Learning A. Discussion for Algorithm Branch VI We remark that it is necessary to separately address triggered reward and triggered transition, and consider trigger and transition as a whole distribution in bonus design (Line 1 in Algorithm 1) and regret analysis (Eq. (13)). A counter example is discussed below to support this statement. If one naively adapts standard RL algorithms (Azar et al., 2017; Zanette & Brunskill, 2019), she/he may directly add bonuses on trigger and transition probabilities, respectively, without separating triggered reward and triggered transition. In this case, f k h(s, a) (Line 1) becomes: f k h(s, a) ˆqk(s, a) + bk,q(s, a) r(s, a) + ˆpk( |s, a) V k h+1 + bk,p V (s, a) Then, in regret decomposition, we will obtain Regret(K) (a) O(1) (s,a),s =s wkσℓ(s, a) h bk,q(s, a) r(s, a) + p( |s, a) V k |σ ℓ|+1 | {z } Term Λ + bk,p V (s, a)q(s, a) i , where (a) omits second order terms. Since Term Λ already reaches Θ(h) order, further summing over all episodes k and steps h, we will suffer an extra H factor in the final result. One can see from this counter example that, our bonus design and analysis, which separately handle triggered reward and triggered transition and consider trigger and transition as a whole distribution, are sharp and enable a near-optimal regret. B. Proofs for Properties of Branching MDP In this section, we present the proofs for structural properties of branching MDP, including branching value difference lemma (Lemma 3), branching law of total variance (Lemma 4) and the upper bounds of the number of triggered states (Lemma 5). B.1. Proof of Lemma 3 Proof of Lemma 3. This proof adapts the analysis of Lemma E.15 in (Dann et al., 2017) to branching RL. According to branching Bellman equations, V π h (sσ) V π h (sσ) q (sσ, aσ ℓ)r(sσ, aσ ℓ) q (sσ, aσ ℓ)r(sσ, aσ ℓ) + q (sσ, aσ ℓ)p (sσ, aσ ℓ) V π h+1 q (sσ, aσ ℓ)p (sσ, aσ ℓ) V π h+1 q (sσ, aσ ℓ)r(sσ, aσ ℓ) q (sσ, aσ ℓ)r(sσ, aσ ℓ) + (q (sσ, aσ ℓ)p (sσ, aσ ℓ) q (sσ, aσ ℓ)p (sσ, aσ ℓ)) V π h+1 + q (sσ, aσ ℓ)p (sσ, aσ ℓ) V π h+1 V π h+1 q (sσ, aσ ℓ)r(sσ, aσ ℓ) q (sσ, aσ ℓ)r(sσ, aσ ℓ) + (q (sσ, aσ ℓ)p (sσ, aσ ℓ) q (sσ, aσ ℓ)p (sσ, aσ ℓ)) V π h+1 ℓ=1 Eq ,p ,π V π h+1(sσ ℓ) V π h+1(sσ ℓ)|sh = s ℓ=1 Eq ,p ,π h q (sσ σ , aσ σ ℓ)r(stℓ, aσ σ ℓ) q (stℓ, aσ σ ℓ)r(stℓ, aσ σ ℓ) Branching Reinforcement Learning + (q (stℓ, aσ σ ℓ)p (stℓ, aσ σ ℓ) q (stℓ, aσ σ ℓ)p (stℓ, aσ σ ℓ)) V π h+1 i σ =1 2 Eq ,p ,π V π h+2,ℓ(sσ ) V π h+2,ℓ(sσ )|sh ℓ=1 Eq ,p ,π h q (sσ σ , aσ σ ℓ)r(sσ σ , aσ σ ℓ) q (sσ σ , aσ σ ℓ)r(sσ σ , aσ σ ℓ) + (q (sσ σ , aσ σ ℓ)p (sσ σ , aσ σ ℓ) q (sσ σ , aσ σ ℓ)p (sσ σ , aσ σ ℓ)) V π |σ σ ℓ|+1 i B.2. Proof of Lemma 4 Proof of Lemma 4. First, we prove the equality. This proof adapts the analysis of standard law of total variance in (Jin et al., 2018; Zanette & Brunskill, 2019) to branching RL. ℓ=1 Varsσ ℓ q,p V π |σ ℓ|+1(sσ ℓ)|sσ, aσ ℓ V π |σ ℓ|+1(sσ ℓ) q(sσ, aσ ℓ)p(sσ, aσ ℓ) V π |σ ℓ|+1 2 " m (H 1) X q(sσ, aσ ℓ)r(sσ, aσ ℓ) + V π |σ ℓ|+1(sσ ℓ) q(sσ, aσ ℓ)r(sσ, aσ ℓ) + q(sσ, aσ ℓ)p(sσ, aσ ℓ) V π |σ ℓ|+1 !2# " m (H 1) X q(sσ, aσ ℓ)r(sσ, aσ ℓ) + V π |σ ℓ|+1(sσ ℓ) q(sσ, aσ ℓ)r(sσ, aσ ℓ) + q(sσ, aσ ℓ)p(sσ, aσ ℓ) V π |σ ℓ|+1 !2# ℓ=1 q(sσ, aσ ℓ)r(sσ, aσ ℓ) + ℓ=1 V π |σ ℓ|+1(sσ ℓ) σ= V π |σ|+1(sσ) ℓ=1 q(sσ, aσ ℓ)r(sσ, aσ ℓ) V π 1 (s ) Here (a) comes from the Markov property and that conditioning on the filtration of step h, the triggers and transitions of state-base action pairs at step h + 1 are independent. (b) is due to that V π |σ|+1(sσ) = Pm ℓ=1 q(sσ, aσ ℓ)r(sσ, aσ ℓ) + q(sσ, aσ ℓ)p(sσ, aσ ℓ) V π |σ ℓ|+1 and we can merge the m terms of state-base ac- tion value into V π |σ|+1(sσ). Now, we prove the inequality. ℓ=1 q(sσ, aσ ℓ)r(sσ, aσ ℓ) V π 1 (s ) Branching Reinforcement Learning ℓ=1 q(sσ, aσ ℓ) q(sσ, aσ ℓ)1 {sσ = s } + q(sσ, aσ ℓ)1 {sσ = s } 1 m1 {sσ = s } σ= 1 {sσ = s } where (a) is due to Assumption 1 and q(s , a) = 0 for any a Auniv. B.3. Proof of Lemma 5 Proof of Lemma 5. Under Assumption 1, to bound the total number of triggered (regular) states for any branching MDP and policy π, it suffices to bound it under a relaxed model M with q(s, a) = q := 1 m for all (s, a) S \ {s } Auniv. Let ωh denote the number of triggered states at each step h under M , and ω := PH h=1 ωh. Below we prove that E[ω] H and E[ω2] 3H2. For h = 1, ωh = 1 deterministically. For h 2, ωh|ωh 1 Binomial(mωh 1, q ). According to the properties of Binomial distribution and q := 1 m, for h 2, E [ωh] = mq E [ωh 1] = 1, E (ωh)2 =mq (1 q )E [ωh 1] + m2(q )2E (ωh 1)2 = (1 q ) + E (ωh 1)2 h. Hence, we have that E[ω] = PH h=1 E[ωh] = H, and h=1 E[(ωh)2] + 2 X 1 1 Then, Eq. (23) becomes E[Regret A] =E[Rewardπ ] E[Reward A] V 1 (s1) V πk 1 (s1) # j=1 E h K Txi,Aj(xi) m q mη + m2 q2 mη + + m H 1 q H 1 mη i j=1 E h K Txi,Aj(xi) m q (m q)H 1 1 =m q (m q)H 1 1 j=1 E Txi,Aj(xi) and Eq (25) becomes E[Regret A] m q (m q)H 1 1 j=1 E Txi,Aj(xi) Branching Reinforcement Learning m q (m q)H 1 1 m K 2c1d(S 3) Let η = c2 q m K for some small enough constant c2, we have E[Regret A] =Ω m q (m q)H 1 1 m q (m q)H 1 1 Therefore, when relaxing the trigger probability threshold in Assumption 1 to some q > 1 m, any algorithm for branching RL-RM must suffer an exponential regret. D. Proofs for Branching RL with Reward-Free Exploration In this section, we prove sample complexity upper and lower bounds (Theorems 8,9) for branching RL-RFE. D.1. Proof for Sample Complexity Upper Bound D.1.1. AUGMENTED TRANSITION DISTRIBUTION First, we introduce an augmented transition distribution paug( |s, a) and connect it with trigger distribution q(s, a) and transition distribution p( |s, a). For any (s, a) S Auniv, let paug( |s, a) denote the augmented transition distribution on S, which satisfies that paug(s |s, a) = 1 q(s, a), paug(s |s, a) = q(s, a)p(s |s, a), s S \ {s }. For any episode k, we can also define the empirical augmented transition distribution as ˆpaug,k(s |s, a) = 1 ˆqk(s, a), ˆpaug,k(s |s, a) = ˆqk(s, a)ˆpk(s |s, a), s S \ {s }. Lemma 18. For any function f( ) defined on S such that f(s ) = 0 (e.g., f can be the value function V π h ( ), h [H], π), it holds that paug( |s, a) f = q(s, a)p( |s, a) f, (29) ˆpaug,k( |s, a) Vh+1 = ˆqk(s, a)ˆpk( |s, a) f. (30) Proof of Lemma 18. We prove Eq. (29) as follows: paug( |s, a) f = X s S paug(s |s, a)f(s ) s S\{s } q(s, a)p(s |s, a)f(s ) = q(s, a)p( |s, a) f Eq. (30) can be proved in a similar manner. Branching Reinforcement Learning D.1.2. CONCENTRATION In the following, we introduce a concentration lemma. Lemma 19 (KL-divergence Based Concentration of Triggered Transition). Defining event KL ˆpaug,k( |s, a) paug( |s, a) log SN δ + S log 8e(nk(s, a) + 1) nk(s, a) , (s, a) S Auniv, k it holds that Pr [G] 1 δ. Proof of Lemma 19. Using Theorem 3 and Lemma 3 in (M enard et al., 2021), we can obtain this lemma. D.1.3. KL DIVERGENCE-BASED TECHNICAL TOOLS Below, we present several useful KL divergence-based technical tools. Lemma 20 (Lemma 10 in (M enard et al., 2021)). Let p1 and p2 be two distributions on S such that KL(p1, p2) α. Let f be a function defined on S such that for any s S, 0 f(s) b. Then, |p1f p2f| q 2Varp2(f)α + 2 Lemma 21 (Lemma 11 in (M enard et al., 2021)). Let p1 and p2 be two distributions on S such that KL(p1, p2) α. Let f be a function defined on S such that for any s S, 0 f(s) b. Then, Varp2(f) 2Varp1(f) + 4b2α Varp1(f) 2Varp2(f) + 4b2α Lemma 22 (Lemma 12 in (M enard et al., 2021)). Let p1 and p2 be two distributions on S such that KL(p1, p2) α. Let f, g be two functions defined on S such that for any s S, 0 f(s), g(s) b. Then, Varp1(f) 2Varp1(g) + 2bp1|f g| Varp2(f) Varp1(f) + 3b2α p1 p2 1 D.1.4. ESTIMATION ERROR Next, we state an important lemma on estimation error. Lemma 23 (Estimation Error). Suppose that the concentration event G holds. Then, for any episode k, policy π and reward function r, ˆV k,π 1 (s; r) V π 1 (s; r) 4e q Bk 1(s) + Bk 1(s). Proof of Lemma 23. For any t N, κ (0, 1), let β(t, κ) := log(SN/κ) + S log(8e(t + 1)). Then, for any π, r and (s, A) S \ {s } A, ˆQk,π h (s, A; r) Qπ h(s, A; r) ˆqk(s, a) q(s, a) r(s, a) + ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 q(s, a)p( |s, a) V π h+1 ˆqk(s, a) q(s, a) r(s, a) + ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) V π h+1 + ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1 Branching Reinforcement Learning β(nk(s, a), δ ) nk(s, a) r(s, a) + 2Vars q,p V h+1(s ) β(nk(s, a), δ ) nk(s, a) + 2 3H β(nk(s, a), δ ) + ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1 β(nk(s, a), δ ) nk(s, a) r(s, a) + 4Vars ˆq,ˆp V h+1(s ) β(nk(s, a), δ ) nk(s, a) + 8H2 β(nk(s, a), δ ) 3H β(nk(s, a), δ ) nk(s, a) + ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1 β(nk(s, a), δ ) nk(s, a) r(s, a) 8Vars ( ˆV k,π h+1(s ))β(nk(s, a), δ ) nk(s, a) +8H ˆqk(s, a)ˆpk( |s, a) V π h+1 ˆV k,π h+1 β(nk(s, a), δ ) nk(s, a) +8H2 β(nk(s, a), δ ) 3H β(nk(s, a), δ ) nk(s, a) + ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1 β(nk(s, a), δ ) nk(s, a) r(s, a) + v u u t8Vars ˆq,ˆp ˆV k,π h+1(s ) β(nk(s, a), δ ) 8H2 β(nk(s, a), δ ) 1 H ˆqk(s, a)ˆpk( |s, a) V π h+1 ˆV k,π h+1 8H2 β(nk(s, a), δ ) nk(s, a) + 2 3H β(nk(s, a), δ ) + ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1 β(nk(s, a), δ ) nk(s, a) r(s, a) + v u u t8Vars ˆq,ˆp ˆV k,π h+1(s ) β(nk(s, a), δ ) H ˆqk(s, a)ˆpk( |s, a) V π h+1 ˆV k,π h+1 + 12H2 β(nk(s, a), δ ) nk(s, a) + ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1 β(nk(s, a), δ ) nk(s, a) r(s, a) + v u u t8Vars ˆq,ˆp ˆV k,π h+1(s ) β(nk(s, a), δ ) nk(s, a) + 12H2 β(nk(s, a), δ ) ˆqk(s, a)ˆpk( |s, a) ˆV k,π h+1 V π h+1 Here (a)(b)(c) use Lemmas 20,21,22, respectively. (d) is due to that x + y + z x + y + z for x, y, z 0. (e) comes from that xy x + y for x, y 0. Then, unfolding ˆQk,π 1 (s, a; r) Qπ 1(s, a; r) , we have ˆQk,π 1 (s, A; r) Qπ 1(s, A; r) h 1 Eˆq,ˆp,π β(nk(sσ, aσ ℓ), δ ) nk(sσ, aσ ℓ) r(sσ, aσ ℓ) Branching Reinforcement Learning v u u t8Vars ˆq,ˆp ˆV k,π h+1(s ) β(nk(sσ, aσ ℓ), δ ) nk(sσ, aσ ℓ) + 12H2 β(nk(sσ, aσ ℓ), δ ) nk(sσ, aσ ℓ) 1 {sσ = s } (s,a),s =s ˆwk,π σℓ(s, a) v u u t Vars ˆq,ˆp ˆV k,π h+1(s ) H2 H2β(nk(s, a), δ ) nk(s, a) + 12H2 β(nk(s, a), δ ) (s,a),s =s ˆwk,π σℓ(s, a) Vars ˆq,ˆp ˆV k,π h+1(s ) v u u u t H2 m (H 1) X (s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ ) + 12e H2 m (H 1) X (s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ ) v u u u t 1 H2 Eˆq,ˆp,π ℓ=1 Vars ˆq,ˆp ˆV k,π h+1(s ) v u u u t H2 m (H 1) X (s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ ) + 12e H2 m (H 1) X (s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ ) v u u u t H2 m (H 1) X (s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ ) nk(s, a) + 12e H2 m (H 1) X (s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ ) where (a) uses branching law of total variance (Lemma 4), which also holds for the estimated model (ˆqk, ˆpk) if adding a clip operation ˆqk(s, a) min{ˆqk(s, a), 1 m} in algorithm Branch RFE to guarantee ˆqk 1 Bk,π h (s, A) := min 12H2 β(nk(s, a), δ ) nk(s, a) + 1 + 1 ˆqk(s, a)ˆpk( |s, a) Bk,π h+1 Bk,π h (s) :=Bk,π h (s, π(s)). Bk h(s, A) := min 12H2 β(nk(s, a), δ ) nk(s, a) + 1 + 1 ˆqk(s, a)ˆpk( |s, a) Bk h(s) , H Bk h(s) := max A A Bk h(s, A) In the following, we show ˆQk,π 1 (s, A; r) Qπ 1(s, A; r) 4e q Bk,π 1 (s, A) + Bk,π 1 (s, A) (32) If Bk,π 1 (s, A) = H, Eq. (32) holds trivially. Otherwise, unfolding Bk,π 1 (s, A), we have Bk,π 1 (s, A) = 12e H2 m (H 1) X (s,a),s =s ˆwk,π σℓ(s, a)β(nk(s, a), δ ) and using Eq. (31), we obtain ˆQk,π 1 (s, A; r) Qπ 1(s, A; r) 4e q Bk,π 1 (s, A) + Bk,π 1 (s, A). Branching Reinforcement Learning Thus, by the definitions of Bk h(s, A) and Bk h(s), we have ˆV k,π 1 (s; r) V π 1 (s; r) 4e q Bk,π 1 (s) + Bk,π 1 (s) 4e q Bk 1(s) + Bk 1(s). D.1.5. PROOF OF THEOREM 8 Now, we prove the sample complexity upper bound for algorithm Branch RFE (Theorem 8). Proof of Theorem 8. First, we prove the correctness. Let K denote the number of episodes that algorithm Branch RFE costs. According to the stopping rule (Line 2 in Algorithm Branch RFE) and Lemma 23, when algorithm Branch RFE stops in episode K, we have that for any π, r, ˆV K,π 1 (s1; r) V π 1 (s1; r) 4e q BK 1 (s) + BK 1 (s) ε Then, we have that for any r, V 1 (s1; r) V ˆπ 1 (s1; r) =V 1 (s1; r) ˆV K,π 1 (s1; r) + ˆV K,π 1 (s1; r) ˆV K,ˆπ 1 (s1; r) + ˆV K,ˆπ 1 (s1; r) V ˆπ 1 (s1; r) V 1 (s1; r) ˆV K,π 1 (s1; r) + ˆV K,ˆπ 1 (s1; r) V ˆπ 1 (s1; r) Now, we prove the sample complexity. Bk h(s, A) X 12H2 β(nk(s, a), δ ) nk(s, a) + 1 + 1 ˆqk(s, a)ˆpk( |s, a) Bk h+1 12H2 β(nk(s, a), δ ) nk(s, a) + 1 + 1 q(s, a)p( |s, a) Bk h+1 ˆqk(s, a)ˆpk( |s, a) q(s, a)p( |s, a) Bk h+1 12H2 β(nk(s, a), δ ) nk(s, a) + 1 + 1 q(s, a)p( |s, a) Bk h+1 2Vars q,p Bk h+1(s ) β(nk(s, a), δ ) nk(s, a) + 2 3H β(nk(s, a), δ ) 12H2 β(nk(s, a), δ ) nk(s, a) + 1 + 1 q(s, a)p( |s, a) Bk h+1 2Hq(s, a)p( |s, a) Bk h+1 β(nk(s, a), δ ) nk(s, a) + 2 3H β(nk(s, a), δ ) 12H2 β(nk(s, a), δ ) nk(s, a) + 1 + 1 q(s, a)p( |s, a) Bk h+1 1 H q(s, a)p( |s, a) Bk h+12H2 β(nk(s, a), δ ) nk(s, a) + 2 3H β(nk(s, a), δ ) Branching Reinforcement Learning 12H2 β(nk(s, a), δ ) nk(s, a) + 1 + 1 q(s, a)p( |s, a) Bk h+1 1 H q(s, a)p( |s, a) Bk h+1 + 2H2 β(nk(s, a), δ ) nk(s, a) + 2 3H β(nk(s, a), δ ) 12H2 β(nk(s, a), δ ) nk(s, a) + 1 + 1 q(s, a)p( |s, a) Bk h+1 + 2 H q(s, a)p( |s, a) Bk h+1 + 6H2 β(nk(s, a), δ ) 18H2 β(nk(s, a), δ ) nk(s, a) + 1 + 3 q(s, a)p( |s, a) Bk h+1 where (a) uses Lemma 20. Then, unfolding Bk 1(s) = Bk 1(s, πk(s)) and summing over k = 1, . . . , K 1, we have k=1 Bk 1(s) ℓ=1 Eq,p,πk 18e3H2 β(nk(sσ, aσ ℓ), δ ) nk(sσ, aσ ℓ) 1 {sσ = s } =18e3H2Eq,p,πk (s,a),s =s Xk(s, a)β(nk(s, a), δ ) 18e3H2Eq,p,πk k=1 Xk(s, a)β(nk(s, a), δ ) 18e3H2 β (K 1)m H+1 m m 1 , δ Eq,p,πk (s,a),s =s log n K 1(s, a) 18e3H2 β (K 1)m H+1 m (s,a),s =s log Eq,p,πk n K 1(s, a) 18e3H2SN β (K 1)m H+1 m m 1 , δ log (H(K 1)) (a) 18e3H2SN log SN + S log 8e Hm H+1(K 1) log (H(K 1)) , (33) where (a) comes from β(t, κ) := log(SN/κ) + S log(8e(t + 1)) and m H+1 m m 1 Hm H+1. According to the stopping rule, we have ε 4e p Bk 1(s) + Bk 1(s) for k = 1, . . . , K 1. Then, summing over k = 1, . . . , K 1 for both sides, we obtain k=1 Bk 1(s) v u u t(K 1) k=1 Bk 1(s) + k=1 Bk 1(s) v u u t(K 1) k=1 Bk 1(s) + 1 k=1 Bk 1(s) Branching Reinforcement Learning (K 1) 18e3H2SN log SN + S log (8e Hm H+1(K 1)) log (H(K 1)) + S log 8e Hm H+1(K 1) log (H(K 1)) (K 1) log SN log (H(K 1)) + S log (8em H+1) log2 (H(K 1)) log (H(K 1)) + S log 8em H+1 log2 (H(K 1)) Using Lemma 13 in (M enard et al., 2021) with τ = K 1, C = 4e ε , A = log SN δ , α = H, B = E = S log 8em H+1 and D = 18e3H2SN ε , we obtain K 1 = O C2(A + B)C2 1 + S log e m H C2 1 where C1 = log(α(A + E)(C + D)). Thus, we have + S log e m H C2 1 C1 = log log SN + S log e m H HSN D.2. Sample Complexity Lower Bound In this subsection, we prove the sample complexity lower bound (Theorem 9) for branching RL-RFE. Proof of Theorem 9. This lower bound analysis follows the proof procedure of Theorem 2 in (Dann & Brunskill, 2015). We consider the same instance as the proof of regret minimization lower bound in Section C.2. The optimal policy π is to take action Aj(xi) at state xi, and we have V 1 (s1) = m(α + η) + m2(α + η)2 + + m H(α + η)H = H (34) Fix a policy π. For each i [S 3], let Gi := {π(xi) = Aj(xi)} denotes the event that policy π chooses the optimal action Aj(xi) in state xi. Then, we have V π 1 (s1) =m(α + η) + 1 (S 3) i=1 E h 1 {Gi} m2(α + η)2 + + m H(α + η)H + (1 1 {Gi}) m2(α + η)α + m3(α + η)α(α + η) + + m H(α + η)α(α + η)H 2 i =m(α + η) + 1 (S 3) i=1 E h 1 {Gi} m2(α + η)2 + + m H(α + η)H + (1 1 {Gi}) m2(α + η)α + m3(α + η)2α + + m H(α + η)H 1α i (35) Branching Reinforcement Learning Subtracting Eq. (35) by Eq. (34), we have V 1 (s1) V π 1 (s1) = 1 (S 3) i=1 E h (1 1 {Gi}) m(α + η)mη + m2(α + η)2mη + + m H 1(α + η)H 1mη i i=1 E h (1 1 {Gi}) mη(H 1) i i=1 E h 1 {Gi} i! The following analysis follows the proof procedure of Theorem 2 in (Dann & Brunskill, 2015). For π to be ε-optimal, we need i=1 E h 1 {Gi} i! Let η = 8e2ε cm(H 1), where c is an absolute constant that we specify later. Then, we have i=1 E [1 {Gi}] 1 c 8e4 Using Markov s inequality, we have i=1 E [1 {Gi}] 1 c 8e4 1 (S 3) 1 c 8e4 i=1 Pr [Gi] Since all Gi (i [S 3]) are independent of each other, there exist {δi}i [S 3] such that Pr Gi δi and 1 (S 3) 1 c 8e4 i=1 (1 δi) 1 δ, which is equivalent to i=1 δi (S 3) 1 + δ 1 c 8e4 Let ε be small enough such that η = 8e2ε cm(H 1) 1 4, and let δ to be small enough δ c 8e4 . Since all Gi are independent, we can use Theorem 1 in (Mannor & Tsitsiklis, 2004) to obtain 1 + δ 1 c 8e4 1 + δ 1 c 8e4 Let ni denote the number of observations on state xi. The KL-divergence of the trigger distribution on state xi between our constructed instance and the uniform instance is m KL (Bernoulli(α)||Bernoulli(α + η)) m η2 (α+η)(1 (α+η)) = O(mη2) with (α + η)(1 (α + η)) c1 for some absolute positive constant c1. Then, to ensure Pr Gi δi, we need 1 n cδi 1 + δ 1 c 8e4 Branching Reinforcement Learning where c1 and c2 are appropriate absolute constant, e.g., c1 = 400 and c2 = 4. In the following, we compute the worst bound over all δ1, . . . , δS 3 to ensure that PS 3 i=1 δi (S 3) 1 + δ 1 c 8e4 1 c 8e4 . min δ1,...,δS 3 1 n cδi 1 + δ 1 c 8e4 i=1 δi (S 3) 1 + δ 1 c 8e4 Using Lemma D.1 in (Dann & Brunskill, 2015), the optimal solution of this optimization is δ1 = = δS 3 = z, if c (1 log z) 1 with z = 1 + δ 1 c 8e4 1 c 8e4 . Since z 1 1 c 8e4 = c 8e4 and c (1 log z) is decreasing wi respect to z, we can obtain a sufficient condition for c (1 log z) 1 as Let c = 1 10, which satisfies this condition. Thus, δ1 = = δS 3 = z is the optimal solution to Eq. (37). Since in each episode, we only observe a single state xi, the number of required episodes is at least c2 1 + δ 1 c 8e4 1 c 8e4 c1c2d(S 3)m(H 1)2 c2 δ 1 c 8e4 + c 8e4