# cascading_reinforcement_learning__a7257769.pdf Published as a conference paper at ICLR 2024 CASCADING REINFORCEMENT LEARNING Yihan Du Electrical and Computer Engineering University of Illinois Urbana-Champaign Urbana, IL 61801, USA yihandu@illinois.edu R. Srikant Electrical and Computer Engineering University of Illinois Urbana-Champaign Urbana, IL 61801, USA rsrikant@illinois.edu Wei Chen Microsoft Research Beijing 100080, China weic@microsoft.com Cascading bandits have gained popularity in recent years due to their applicability to recommendation systems and online advertising. In the cascading bandit model, at each timestep, an agent recommends an ordered subset of items (called an item list) from a pool of items, each associated with an unknown attraction probability. Then, the user examines the list, and clicks the first attractive item (if any), and after that, the agent receives a reward. The goal of the agent is to maximize the expected cumulative reward. However, the prior literature on cascading bandits ignores the influences of user states (e.g., historical behaviors) on recommendations and the change of states as the session proceeds. Motivated by this fact, we propose a generalized cascading RL framework, which considers the impact of user states and state transition into decisions. In cascading RL, we need to select items not only with large attraction probabilities but also leading to good successor states. This imposes a huge computational challenge due to the combinatorial action space. To tackle this challenge, we delve into the properties of value functions, and design an oracle Best Perm to efficiently find the optimal item list. Equipped with Best Perm, we develop two algorithms Cascading VI and Cascading BPI, which are both computation-efficient and sample-efficient, and provide near-optimal regret and sample complexity guarantees. Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice. 1 INTRODUCTION In recent years, a model called cascading bandits (Kveton et al., 2015; Combes et al., 2015; Li et al., 2016; Vial et al., 2022) has received extensive attention in the online learning community, and found various applications such as recommendation systems (Mary et al., 2015) and online advertising (Tang et al., 2013). In this model, an agent is given a ground set of items, each with an unknown attraction probability. At each timestep, the agent recommends an ordered list of items to a user, and the user examines the items one by one, where the probability that each item attracts the user is equal to its attraction probability. Then, the user clicks the first attractive item (if any), and skips the following items. If an item is clicked in the list, the agent receives a reward; if no item is clicked, the agent receives no reward. The objective of the agent is to maximize the expected cumulative reward. While the cascading bandit model has been extensively studied, it neglects the influences of user states (e.g., users past behaviors) on recommendations, and the fact that states can transition as users take actions. For example, in personalized video recommendation, the recommendation system usually suggests a list of videos according to the characteristics and viewing records of users. If the user clicks a video, the environment of the system can transition to a next state that stores the Corresponding authors. Published as a conference paper at ICLR 2024 user s latest behavior and interest. Next, the recommendation system will suggest videos of a similar type as what the user watched before to improve the click-through rate. While contextual cascading bandits (Li et al., 2016; Zong et al., 2016) can also be used to formulate users historical behaviors as part of contexts, cascading RL is more suitable for long-term reward maximization, since it considers potential rewards from future states. For instance, to maximize the long-term reward, the recommendation system may recommend videos of TV series, since users may keep watching subsequent videos of the same TV series once they get attracted to one of them. To model such state-dependent behavior, we propose a novel framework called cascading reinforcement learning (RL), which generalizes the conventional cascading bandit model to depict the influence of user states on recommendations and the transition of states in realistic applications. In this framework, there is a pool of N items and a space of states. Each state-item pair has an unknown attraction probability, an underlying transition distribution and a deterministic reward. In each episode (e.g., a session in recommendation systems), at each step, the agent first observes the current state (e.g., the user s past behavior in the session), and recommends a list of at most m N items. Then, the user goes over the list one by one, and clicks the first interesting item. After that, the agent receives a reward, and transitions to a next state according to the current state and clicked item. If no item in the list interests the user, the agent receives zero reward, and transitions to a next state according to the current state and a virtual item a . Here we say that the user clicks a if no item is clicked. We define a policy as a mapping from the space of states and the current step to the space of item lists, and the optimal policy as the policy that maximizes the expected cumulative reward. Our work distinguishes itself from prior cascading bandit studies (Kveton et al., 2015; Vial et al., 2022) on its unique computational challenge. In prior cascading bandit studies, there is no state, and they only need to select m items with the highest attraction probabilities, which does not involve computational difficulty. In contrast, in cascading RL, states also matter, and we need to balance between the maximization of the attraction probabilities of chosen items and the optimization of the expected reward obtained from future states. This poses a great computational difficulty in the planning of the optimal policy under a combinatorial action space. Moreover, the combinatorial action space also brings a challenge on sample efficiency, i.e., how to avoid a dependency on the exponential number of actions in sample complexity. To handle these challenges, we conduct a fine-grained analysis on the properties of value functions, and design an efficient oracle Best Perm to find the optimal item list, based on a novel dynamic programming for combinatorial optimization. Furthermore, we design computation-efficient and sample-efficient algorithms Cascading VI and Cascading BPI, which employ oracle Best Perm to only maintain the estimates for items and avoid enumerating over item lists. Finally, we also conduct experiments to demonstrate the superiority of our algorithms over naive adaptations of classic RL algorithms in computation and sampling. The contributions of this work are summarized as follows. We propose the cascading RL framework, which generalizes the traditional cascading bandit model to formulate the influence of user states (e.g., historical behaviors) on recommendations and the change of states through time. This framework can be applied to various real-world scenarios, e.g., personalized recommendation systems and online advertising. To tackle the computational challenge of cascading RL, we leverage the properties of value functions to develop a novel oracle Best Perm, which uses a carefully-designed dynamic programming to efficiently find the optimal item list under combinatorial action spaces. For the regret minimization objective, with oracle Best Perm, we design an efficient algorithm Cascading VI, and establish a O(H HSNK) regret, which matches a known lower bound for the general episodic RL setting up to O( H). Here S is the number of states, H is the length of each episode, and K is the number of episodes. Note that the regret depends only on the number of items N, instead of the exponential number of item lists. For the best policy identification objective, we devise a computation and sample efficient algorithm Cascading BPI, and provide O( H3SN ε2 ) sample complexity. Cascading BPI is optimal up to a factor of O(H) when ε < H S2 , where ε is an accuracy parameter. Published as a conference paper at ICLR 2024 2 RELATED WORK Cascading bandits. Kveton et al. (2015) and Combes et al. (2015) concurrently introduce the cascading bandit model, and design algorithms based on upper confidence bounds. Cheung et al. (2019) and Zhong et al. (2021) propose Thompson Sampling-type (Thompson, 1933) algorithms. Vial et al. (2022) develop algorithms equipped with variance-aware confidence intervals, and achieve near-optimal regret bounds. In these cascading bandit studies, there is no state (context), and the order of selected items does not matter. Thus, they only need to select m items with the maximum attraction probabilities. Li et al. (2016); Zong et al. (2016) and Li & Zhang (2018) study linear contextual cascading bandits. In (Zong et al., 2016; Li & Zhang, 2018), the attraction probabilities depend on contexts, and the order of selected items still does not matter. Hence, they need to select m items with the maximum attraction probabilities in the current context. Li et al. (2016) consider position discount factors, where the order of selected items is important. Different from the above studies, our cascading RL formulation further considers state transition, and the attraction probabilities and rewards depend on states. Thus, we need to put the items that induce higher expected future rewards in the current state in the front. In addition, we require to both maximize the attraction probabilities of selected items, and optimize the potential future reward. Provably efficient RL. In recent years, there have been a number of studies on provably efficient RL, e.g., (Jaksch et al., 2010; Dann & Brunskill, 2015; Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019). However, none of the above studies tackle the challenge of combinatorial action space and the computation and sample efficiency issues it brings. In contrast, we have to directly face this challenge when we generalize cascading bandits to the RL framework to better formulate the state transition in real-world recommendation and advertising applications, and we provide satisfactory solutions with near optimal computation and sample efficiency. 3 PROBLEM FORMULATION We consider an episodic cascading Markov decision process (MDP) M(S, Aground, A, m, q, p, r, H). Here S is the space of states. Aground := {a1, . . . , a N, a } is the ground set of items, where a1, . . . , a N are regular items and a is a virtual item. Item a is put at the end of each item list by default, which represents that no item in the list is clicked. An action (i.e., item list) A is a permutation which consists of at least one and at most m N regular items and the item a at the end. A is the collection of all actions. We use action and item list interchangeably throughout the paper. For any A A and i [|A|], let A(i) denote the i-th item in A, and we have A(|A|) = a . For any (s, a) S Aground, q(s, a) [0, 1] denotes the attraction probability, which gives the probability that item a is clicked in state s; r(s, a) [0, 1] is the deterministic reward of clicking item a in state s; p( |s, a) S is the transition distribution, so that for any s S, p(s |s, a) gives the probability of transitioning to state s if item a is clicked in state s. For any s S, we define q(s, a ) := 1 and r(s, a ) := 0 (because if no regular item is clicked, the agent must click a and receives zero reward). Transition probability p( |s, a ) allows to formulate interesting transition scenarios that could happen when none of the recommended items is clicked. One of such scenarios is that the session ends with no more reward, which is modeled by letting p(s0|s, a ) = 1 where s0 is a special absorbing state that always induces zero reward. H is the length of each episode. In addition, we define a deterministic policy π = {πh : S A}h [H] as a collection of H mappings from the state space to the action space, so that πh(s) gives what item list to choose in state s at step h. The cascading RL game is as follows. In each episode k, an agent first chooses a policy πk, and starts from a fixed initial state sk 1 := s1. At each step h [H], the agent first observes the current state sk h (e.g., stored historical behaviors of the user), and then selects an item list Ak h = πk h(sk h) according to her policy. After that, the environment (user) browses the selected item list one by one following the order given in Ak h. When the user browses the i-th item Ak h(i) (i [|Ak h|]), there is a probability of q(sk h, Ak h(i)) that the user is attracted by the item and clicks it. This attraction and clicking event is independent among all items. Once an item Ak h(i) is clicked, the agent observes which item is clicked and receives reward r(sk h, Ak h(i)), skipping the subsequent items {Ak h(j)}i 0 and h [H], we use Ik,h to denote the index of the clicked item in Ak h, and Ik,h = |Ak h| if no regular item is clicked. In our cascading RL model, the agent only observes the attraction of items {Ak h(j)}j Ik,h (i.e., items {Ak h(j)}jIk,h is unobserved. For any policy π, h [H] and s S, we define value function V π h : S [0, H] as the expected cumulative reward that can be obtained under policy π, starting from state s at step h, till the end of the episode. Formally, V π h (s) := Eq,p,π[PH h =h r(sh , Ah (Ih ))|sh = s]. Similarly, for any policy π, h [H] and (s, A) S A, we define Q-value function Qπ h : S A [0, H] as the expected cumulative reward received under policy π, starting from (s, A) at step h, till the end of the episode, i.e., Qπ h(s, A) := Eq,p,π[PH h =h r(sh , Ah (Ih ))|sh = s, Ah = A]. Since S, A and H are finite, there exists a deterministic optimal policy π which always gives the maximum value V π h (s) for all s S and h [H] (Sutton & Barto, 2018). The Bellman equation and Bellman optimality equation for cascading RL can be stated as follows. Qπ h(s, A) = 1 q(s, A(j)) q(s, A(i)) r(s, A(i)) + p( |s, A(i)) V π h+1 , V π h (s) = Qπ h(s, πh(s)), V π H+1(s) = 0, s S, A A, h [H]. Q h(s, A) = 1 q(s, A(j)) q(s, A(i)) r(s, A(i)) + p( |s, A(i)) V h+1 , V h (s) = max A A Q h(s, A), V H+1(s) = 0, s S, A A, h [H]. Here Qi 1 j=1(1 q(s, A(j)))q(s, A(i)) is the probability that item A(i) is clicked, which captures the cascading feature. r(s, A(i)) + p( |s, A(i)) V h+1 is the expected cumulative reward received from step h onward if item A(i) is clicked at step h. Q h(s, A) is the summation of the expected cumulative reward over each item in A. In this work, we focus on the tabular setting where S is not too large. This is practical in category-based recommendation applications. For example, in video recommendation, the videos are categorized into multiple types, and the recommendation system can suggest videos according to the types of the latest one or two videos that the user just watched. We investigate two popular objectives in RL, i.e., regret minimization and best policy identification. In the regret minimization setting, the agent plays K episodes with the goal of minimizing the regret R(K) = PK k=1 V 1 (s1) V πk 1 (s1). In the best policy identification setting, given a confidence parameter δ (0, 1) and an accuracy parameter ε, the agent aims to identify an ε-optimal policy ˆπ which satisfies V ˆπ 1 (s1) V 1 (s1) ε with probability at least 1 δ, using as few episodes as possible. Here the performance is measured by the number of episodes used, i.e., sample complexity. 4 AN EFFICIENT ORACLE FOR CASCADING RL In the framework of model-based RL, it is typically assumed that one has estimates (possibly including exploration bonuses) of the model, and then a planning problem is solved to compute the value functions. In our problem, this would correspond to solving the maximization in Eq. (1). Different from planning in classic RL (Jaksch et al., 2010; Azar et al., 2017; Sutton & Barto, 2018), in cascading RL, a naive implementation of this maximization will incur exponential computation complexity due to the fact that we have to consider all A A. To tackle this computational difficulty, we note that each backward recursion step in Eq. (1) involves the solution to a combinatorial optimization of the following form: For any ordered subset of Aground Published as a conference paper at ICLR 2024 denoted by A, u : Aground R and w : Aground R, max A A f(A, u, w) := 1 u(A(j)) u(A(i))w(A(i)). (2) Here f(A, u, w) corresponds to Q h(s, A), u(A(i)) represents q(s, A(i)), and w(A(i)) stands for r(s, A(i)) + p( |s, A(i)) V h+1. We have u(a ) = 1 to match with q(s, a ) = 1. 4.1 CRUCIAL PROPERTIES OF PROBLEM (2) Before introducing an efficient oracle to solve problem Eq. (2), we first exhibit several nice properties of this optimization problem, which serve as the foundation of our oracle design. For any subset of items X Aground, let Perm(X) denote the collection of permutations of the items in X, and Des W(X) Perm(X) denote the permutation where items are sorted in descending order of w. For convenience of analysis, here we treat a as an ordinary item as a1, . . . , a N, and a can appear in any position in the permutations in Perm(X). Lemma 1. The weighted cascading reward function f(A, u, w) satisfies the following properties: (i) For any u, w and X Aground, we have f(Des W(X), u, w) = max A Perm(X) f(A, u, w). (ii) For any u, w and disjoint X, X Aground \ {a } such that w(a) > w(a ) for any a X, X , we have f((Des W(X), a ), u, w) f((Des W(X X ), a ), u, w). Furthermore, for any u, w and disjoint X, X Aground \ {a } such that w(a) > w(a ) for any a X, and w(a) < w(a ) for any a X , we have f((Des W(X, X ), a ), u, w) f((Des W(X), a ), u, w). Property (i) can be proved by leveraging the continued product structure in f(A, u, w) and a similar analysis as the interchange argument (Bertsekas & Castanon, 1999; Ross, 2014). Property (ii) follows from property (i) and the fact that u(a ) = 1. The detailed proofs are presented in Appendix B. Remark. Property (i) exhibits that when fixing a subset of Aground, the best order of this subset is to rank items in descending order of w. Then, the best permutation selection problem in Eq. (2) can be reduced to a best subset selection problem. Then, a natural question is what items and how many items should the best subset (permutation) include? Property (ii) gives an answer we should include the items with weights above w(a ), and discard the items with weights below w(a ). The intuition behind is as follows: If a permutation does not include the items in X such that w(a) > w(a ) for any a X , this is equivalent to putting X behind a , since u(a ) = 1. Then, according to property (i), we can arrange the items in X in front of a (i.e., include them) to obtain a better permutation. Similarly, if a permutation includes the items X such that w(a) < w(a ) for any a X , from property (i), we can also obtain a better permutation by putting X behind a , i.e., discarding them. 4.2 ORACLE Best Perm Making use of the properties of f(A, u, w), we develop an efficient oracle Best Perm to solve problem Eq. (2), based on a carefully-designed dynamic programming to find the optimal item subset. Algorithm 1 presents the pseudo-code of oracle Best Perm. Given attraction probabilities u and weights w, we first sort the items in Aground in descending order of w, and denote the sorted sequence by a1, . . . , a J, a , a J+1, . . . , a N. Here J denotes the number of items with weights above w(a ). If J = 0, i.e., a has the highest weight, since the solution must contain at least one item, we choose the best single item as the solution (lines 3-4). If 1 J m, (a1, . . . , a J) satisfies the cardinality constraint and is the best permutation (lines 6-7). If J > m, to meet the cardinality constraint, we Published as a conference paper at ICLR 2024 Algorithm 1: Best Perm: find argmax A A f(A, u, w) and max A A f(A, u, w) Input: Aground, u : Aground [0, 1], w : Aground R. 1 Sort the items in Aground in descending order of w, and denote the sorted sequence by a1, . . . , a J, a , a J+1, . . . , a N. Here J denotes the number of items with weights above w(a ); 2 if J = 0 then 3 a argmaxa {a1,...,a N}{u(a)w(a) + (1 u(a))w(a )}; // Select the best single item 4 Sbest (a ). F best u(a )w(a ) + (1 u(a ))w(a ); 5 if 1 J m then 6 Sbest (a1, . . . , a J); // Simply output (a1, . . . , a J) 7 F best PJ i=1 Qi 1 j=1(1 u(aj))u(ai)w(ai) + QJ j=1(1 u(aj))w(a ); 8 if J > m then 9 S[J][1] (a J); // Select m best items from (a1, . . . , a J) 10 F[J][1] u(a J)w(a J) + (1 u(a J))w(a ); 11 For any i [J], S[i][0] and F[i][0] w(a ); 12 For any i [J] and J i + 1 < k m, F[i][k] ; 13 for i = J 1, J 2, . . . , 1 do 14 for k = 1, 2, . . . , min{m, J i + 1} do 15 if F[i + 1][k] w(ai)u(ai) + (1 u(ai))F[i + 1][k 1] then 16 S[i][k] S[i + 1][k]. F[i][k] F[i + 1][k]; 18 S[i][k] (ai, S[i + 1][k 1]); 19 F[i][k] u(ai)w(ai) + (1 u(ai))F[i + 1][k 1]; 20 Sbest S[1][m]. F best F[1][m]; 21 return F best, Sbest; need to select m best items from (a1, . . . , a J) which maximize the objective value, i.e., max (a 1,...,a m) (a1,...,a J) f((a 1, . . . , a m), u, w). (3) Eq. (3) is a challenging combinatorial optimization problem, and costs exponential computation complexity if one performs naive exhaustive search. To solve Eq. (3), we resort to dynamic programming. For any i [J] and k [min{m, J i + 1}], let S[i][k] and F[i][k] denote the optimal solution and optimal value of the problem max(a 1,...,a k) (ai,...,a J) f((a 1, . . . , a k), u, w). Utilizing the structure of f(A, u, w), we have F[J][1] = u(a J)w(a J) + (1 u(a J))w(a ), F[i][0] = w(a ), 1 i J, F[i][k] = , 1 i J, J i + 1 < k m, F[i][k] = max{F[i + 1][k], u(ai)w(ai) + (1 u(ai))F[i + 1][k 1]}, 1 i J 1, 1 k min{m, J i + 1}. The idea of this dynamic programming is as follows. Consider that we want to select k best items from (ai, . . . , a J). If we put ai into the solution, we need to further select k 1 best items from (ai+1, . . . , a J). In this case, we have F[i][k] = u(ai)w(ai) + (1 u(ai))F[i + 1][k 1]; Otherwise, if we do not put ai into the solution, we are just selecting k best items from (ai+1, . . . , a J), and then F[i][k] = F[i + 1][k]. After computing this dynamic programming, we output S[1][m] and F[1][m] as the optimal solution and optimal value of Eq. (2) (lines 9-20). Below we formally state the correctness of Best Perm. Lemma 2 (Correctness of Oracle Best Perm). Given any u : Aground [0, 1] and w : Aground R, the permutation Sbest returned by algorithm Best Perm satisfies f(Sbest, u, w) = max A A f(A, u, w). Best Perm achieves O(Nm + N log(N)) computation complexity. This is dramatically better than the computation complexity of the naive exhaustive search, which is O(|A|) = O(N m). Published as a conference paper at ICLR 2024 Algorithm 2: Cascading VI Input: δ, δ := δ 14, L := log( KHSN δ ). For any k > 0 and s S, qk(s, a ) = qk(s, a ) := 1 and V k H+1(s) = V k H+1(s) := 0. Initialize n1,q(s, a) = n1,p(s, a) := 0 for any (s, a) S A. 1 for k = 1, 2, . . . , K do 2 for h = H, H 1, . . . , 1 do 3 for s S do 4 for a Aground \ {a } do 5 bk,q(s, a) min{2 q ˆqk(s,a)(1 ˆqk(s,a))L nk,q(s,a) + 5L nk,q(s,a), 1}; 6 qk(s, a) ˆqk(s, a) + bk,q(s, a). qk(s, a) ˆqk(s, a) bk,q(s, a); 7 for a Aground do 8 bk,p V (s, a) min{2 Vars ˆ pk ( V k h+1(s ))L nk,p(s,a) + Es ˆ pk [( V k h+1(s ) V k h+1(s ))2]L nk,p(s,a) + 5HL nk,p(s,a), H}; 9 wk(s, a) r(s, a) + ˆpk( |s, a) V k h+1 + bk,p V (s, a); 10 V k h (s), πk h(s) Best Perm(Aground, qk(s, ), wk(s, )); 11 V k h (s) min{ V k h (s), H}. A πk h(s); 12 V k h(s) max{P|A | i=1 Qi 1 j=1(1 qk(s, A (j)))qk(s, A (i))(r(s, A (i)) + ˆpk( |s, A (i)) V k h+1 bk,p V (s, A (i))), 0}; 13 for h = 1, 2, . . . , H do 14 Observe the current state sk h; // Take policy πk and observe the trajectory 15 Take action Ak h = πk h(sk h). i 1; 16 while i m do 17 Observe if Ak h(i) is clicked or not. Update nk,q(sk h, Ak h(i)) and ˆqk(sk h, Ak h(i)); 18 if Ak h(i) is clicked then 19 Receive reward r(sk h, Ak h(i)), and transition to a next state sk h+1 p( |sk h, Ak h(i)); 20 Ik,h i. Update nk,p(sk h, Ak h(i)) and ˆpk( |sk h, Ak h(i)); 21 break while; // Skip subsequent items 23 i i + 1; 24 if i = m + 1 then 25 Transition to a next state sk h+1 p( |sk h, a ); // No item was clicked 26 Update nk,p(sk h, a ) and ˆpk( |sk h, a ); 5 REGRET MINIMIZATION FOR CASCADING RL In this section, we study cascading RL with the regret minimization objective. Building upon oracle Best Perm, we propose an efficient algorithm Cascading VI, which is both computation and sample efficient and has a regret bound that nearly matches the lower bound. 5.1 ALGORITHM Cascading VI Algorithm 2 describes Cascading VI. In each episode k, we construct the exploration bonus for the attraction probability bk,q and the exploration bonus for the expected future reward bk,p V . Then, we calculate the optimistic attraction probability qk(s, a) and weight wk(s, a), by adding exploration bonuses for q(s, a) and p( |s, a) V individually (lines 6 and 9). Here the weight wk(s, a) represents an optimistic estimate of the expected cumulative reward that can be obtained if item a is clicked in state s. Then, utilizing the monotonicity property of f(A, u, w) with respect to the attraction Published as a conference paper at ICLR 2024 probability u and weight w, we invoke oracle Best Perm with qk(s, ) and wk(s, ) to efficiently compute the optimistic value function V k h (s) and its associated greedy policy πk h(s) (line 10). After obtaining policy πk, we play episode k with πk (lines 14-26). At each step h [H], we first observe the current state sk h, and select the item list Ak h = πk h(s). Then, the environment (user) examines the items in Ak h and clicks the first attractive item. Once an item Ak h(i) is clicked, the agent receives a reward and transitions to sk h+1 p( |sk h, Ak h(i)), skipping the subsequent items. If no item in Ak h is clicked, the agent receives zero reward and transitions to sk h+1 p( |sk h, a ). In line 17, we increment the number of attraction observations nk,q(sk h, Ak h(i)) by 1, and update the empirical attraction probability ˆqk(sk h, Ak h(i)). In lines 20 and 26, we increment the number of transition observations nk,p(sk h, Ak h(i)) (or nk,p(sk h, a )) by 1, and update the empirical transition distribution ˆpk( |sk h, Ak h(i)) (or ˆpk( |sk h, a )) by incrementing the number of transitions to sk h+1 by 1 and keeping the number of transitions to other states unchanged. If one naively applies classical RL algorithms (Azar et al., 2012; Zanette & Brunskill, 2019) by treating each A A as an ordinary action, one will suffer a dependency on |A| in the results. By contrast, Cascading VI only maintains the estimates of attraction and transition probabilities for each (s, a), and employs oracle Best Perm to directly compute V k h (s), without enumerating over A A. Therefore, Cascading VI avoids the dependency on |A| in computation and statistical complexities. 5.2 THEORETICAL GUARANTEE OF ALGORITHM Cascading VI In regret analysis, we need to tackle several challenges: (i) how to guarantee the optimism of the output value by oracle Best Perm with the optimistic estimated inputs, and (ii) how to achieve the optimal regret bound when the problem degenerates to cascading bandits (Kveton et al., 2015; Vial et al., 2022). To handle these challenges, we prove the monotonicity property of f(A, u, w) with respect to the attraction probability u and weight w, leveraging the fact that the items in the optimal permutation are ranked in descending order of w. This monotonicity property ensures the optimism of the value function V k h (s). In addition, we employ the variance-awareness of the exploration bonus bk,q(s, a), scaling as ˆqk(s, a)(1 ˆqk(s, a)), to save a factor of m in the regret bound. This enables our result to match the optimal result for cascading bandits (Vial et al., 2022) in the degenerated case. Theorem 1 (Regret Upper Bound). With probability at least 1 δ, the regret of algorithm Cascading VI is bounded by O H From Theorem 1, we see that the regret of Cascading VI depends only on the number of items N, instead of the number of item lists |A| = O(N m). This demonstrates the efficiency of our estimation scheme and exploration bonus design. When our problem degenerates to cascading bandits, i.e., S = H = 1, our regret bound matches the optimal result for cascading bandits (Vial et al., 2022). Lower bound and optimality. Recall from (Jaksch et al., 2010; Osband & Van Roy, 2016) that the lower bound for classic RL is Ω(H SNK). Since cascading RL reduces to classic RL when q(s, a) = 1 for any (s, a) S Aground (i.e., the user always clicks the first item), the lower bound Ω(H SNK) also holds for cascading RL. Our regret bound matches the lower bound up to a factor of H when ignoring logarithmic factors. Regarding this gap, one possibility is that it comes from our upper bound analysis. Specifically, we add exploration bonuses for q and p V individually, which leads to a term (q+bk,q)(ˆp V +bk,p V p V ) in regret decomposition. Here the term bk,q(ˆp V +bk,p V p V ) = O(Hbk,q) leads to a O(H HK) regret, which causes the extra H gap. A straightforward idea to avoid this is to regard q and p as an integrated transition distribution of (s, A), and construct an overall exploration bonus for (s, A). However, this strategy forces us to maintain the estimates for all A A, and will incur a dependency on |A| in computation and statistical complexities. Another possibility behind the gap H is that it is needed in a tight lower bound for any polynomial-time algorithm for cascading RL. Therefore, how to close the gap H while maintaining the computational efficiency remains open for future work. Published as a conference paper at ICLR 2024 0 20000 40000 60000 80000 100000 The Number of Episodes K Cumulative Regret N=10, |A|=820 Cascading VI 1194.29s Cascading VI-Oracle 21809.87s Cascading VI-Bonus 1099.32s Adapt VI 28066.90s 0 20000 40000 60000 80000 100000 The Number of Episodes K Cumulative Regret N=15, |A|=2955 Cascading VI 2291.00s Cascading VI-Oracle 71190.59s Cascading VI-Bonus 2144.96s Adapt VI 53214.95s 0 20000 40000 60000 80000 100000 The Number of Episodes K Cumulative Regret N=20, |A|=7240 Cascading VI 2770.94s Cascading VI-Oracle 145794.64s Cascading VI-Bonus 2731.21s Adapt VI 81135.42s 0 20000 40000 60000 80000 100000 The Number of Episodes K Cumulative Regret N=25, |A|=14425 Cascading VI 5216.00s Cascading VI-Oracle 196125.72s Cascading VI-Bonus 5066.27s Adapt VI 95235.55s Figure 1: Experiments for cascading RL on real-world data. 6 BEST POLICY IDENTIFICATION FOR CASCADING RL Now we turn to cascading RL with best policy identification. We propose an efficient algorithm Cascading BPI to find an ε-optimal policy. Here we mainly introduce the idea of Cascading BPI, and defer the detailed pseudo-code and description to Appendix D.1 due to space limit. In Cascading BPI, we optimistically estimate q(s, a) and p( |s, a) V , and add exploration bonuses for them individually. Then, we call the oracle Best Perm to compute the optimistic value function and hypothesized optimal policy. Furthermore, we construct an estimation error to bound the deviation between the optimistic value function and true value function. If this estimation error shrinks within ε, we simply output the hypothesized optimal policy; Otherwise, we play an episode with this policy. In the following, we provide the sample complexity of Cascading BPI. Theorem 2. With probability at least 1 δ, algorithm Cascading BPI returns an ε-optimal policy, and the number of episodes used is bounded by O( H3SN HSN ε ε (log 1 δ + S)), where O( ) hides logarithmic factors with respect to N, H, S and ε. Theorem 2 reveals that the sample complexity of Cascading BPI is polynomial in problem parameters N, H, S and ε, and independent of |A|. Regarding the optimality, since cascading RL reduces to classic RL when q(s, a) = 1 for all (s, a) S Aground, existing lower bound for classic best policy identification Ω( H2SN δ )) (Dann & Brunskill, 2015) also applies here. This corroborates that Cascading BPI is near-optimal up to a factor of H when ε < H 7 EXPERIMENTS In this section, we present experimental results on a real-world dataset Movie Lens (Harper & Konstan, 2015), which contains millions of ratings for movies by users. We set δ = 0.005, K = 100000, H = 3, m = 3, S = 20, N {10, 15, 20, 25} and |A| {820, 2955, 7240, 14425}. We defer the detailed setup and more results to Appendix A. We compare our algorithm Cascading VI with three baselines, i.e., Cascading VI-Oracle, Cascading VI-Bonus and Adapt VI. Specifically, Cascading VI-Oracle replaces the efficient oracle Best Perm by a naive exhaustive search. Cascading VI-Bonus replaces the variance-aware exploration bonus bk,q by a variance-unaware bonus. Adapt VI adapts the classic RL algorithm (Zanette & Brunskill, 2019) to the combinatorial action space, which maintains the estimates for all (s, A). As shown in Figure 1, our algorithm Cascading VI achieves the lowest regret and a fast running time. Cascading VI-Oracle has a comparative regret performance to Cascading VI, but suffers a much higher running time, which demonstrates the power of our oracle Best Perm in computation. Cascading VI-Bonus attains a similar running time as Cascading VI, but has a worse regret. This corroborates the effectiveness of our variance-aware exploration bonus in enhancing sample efficiency. Adapt VI suffers a very high regret and running time, since it learns the information of all permutations independently and its statistical and computational complexities depend on |A|. 8 CONCLUSION In this work, we formulate a cascading RL framework, which generalizes the cascading bandit model to characterize the impacts of user states and state transition in applications such as recommendation systems. We design a novel oracle Best Perm to efficiently identify the optimal item list under combinatorial action spaces. Building upon this oracle, we develop efficient algorithms Cascading VI and Cascading BPI with near-optimal regret and sample complexity guarantees. Published as a conference paper at ICLR 2024 ACKNOWLEDGEMENT The work of Yihan Du and R. Srikant is supported in part by AFOSR Grant FA9550-24-1-0002, ONR Grant N00014-19-1-2566, and NSF Grants CNS 23-12714, CNS 21-06801, CCF 19-34986, and CCF 22-07547. Mohammad Gheshlaghi Azar, R emi Munos, and Hilbert Kappen. On the sample complexity of reinforcement learning with a generative model. In International Conference on Machine Learning, 2012. Mohammad Gheshlaghi Azar, Ian Osband, and R emi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272. PMLR, 2017. Dimitri P Bertsekas and David A Castanon. Rollout algorithms for stochastic scheduling problems. Journal of Heuristics, 5:89 108, 1999. Wang Chi Cheung, Vincent Tan, and Zixin Zhong. A thompson sampling algorithm for cascading bandits. In International Conference on Artificial Intelligence and Statistics, pp. 438 447. PMLR, 2019. Richard Combes, Stefan Magureanu, Alexandre Proutiere, and Cyrille Laroche. Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 231 244, 2015. Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2818 2826, 2015. Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5717 5727, 2017. F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems, 5(4):1 19, 2015. Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010. Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pp. 4868 4878, 2018. Emilie Kaufmann, Pierre M enard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In International Conference on Algorithmic Learning Theory, pp. 865 891. PMLR, 2021. Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pp. 767 776. PMLR, 2015. Shuai Li and Shengyu Zhang. Online clustering of contextual cascading bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. In International Conference on Machine Learning, pp. 1245 1253. PMLR, 2016. J er emie Mary, Romaric Gaudel, and Philippe Preux. Bandits and recommender systems. In Machine Learning, Optimization, and Big Data: First International Workshop, MOD 2015, Taormina, Sicily, Italy, July 21-23, 2015, Revised Selected Papers 1, pp. 325 336. Springer, 2015. Pierre M enard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp. 7599 7608. PMLR, 2021. Published as a conference paper at ICLR 2024 Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. ar Xiv preprint ar Xiv:1608.02732, 2016. Sheldon M Ross. Introduction to stochastic dynamic programming. Academic press, 2014. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selection via contextual bandits. In Proceedings of the ACM International Conference on Information & Knowledge Management, pp. 1587 1594, 2013. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. Daniel Vial, Sujay Sanghavi, Sanjay Shakkottai, and R Srikant. Minimax regret for cascading bandits. In Advances in Neural Information Processing Systems, volume 35, pp. 29126 29138, 2022. Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003. Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304 7312. PMLR, 2019. Zixin Zhong, Wang Chi Chueng, and Vincent YF Tan. Thompson sampling algorithms for cascading bandits. Journal of Machine Learning Research, 22(1):9915 9980, 2021. Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. Cascading bandits for large-scale recommendation problems. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp. 835 844. AUAI Press, 2016. Published as a conference paper at ICLR 2024 𝑠2 𝐻 1 1 𝑠2(𝐻 1) 𝑎 𝑎𝑁 𝑚+1, , 𝑎𝑁: 𝑤. 𝑝. 0.1 𝑎 𝑎𝑁 𝑚+1, , 𝑎𝑁: 𝑤. 𝑝. 0.9 𝑎 𝑎1, , 𝑎𝑁 𝑚} {𝑎 : 𝑤. 𝑝. 0.1 𝑎 𝑎1, , 𝑎𝑁 𝑚} {𝑎 : 𝑤. 𝑝. 0.9 Figure 2: The constructed cascading MDP in synthetic data. 0 2000 4000 6000 8000 10000 The Number of Episodes K Cumulative Regret Cascading VI 76.68s Adapt VI 128.29s (a) Regret minimization, N = 4 0 2000 4000 6000 8000 10000 The Number of Episodes K Cumulative Regret Cascading VI 178.70s Adapt VI 1806.26s (b) Regret minimization, N = 8 4 5 6 7 8 The Number of Items N Sample Complexity Cascading BPI - Samples Adapt BPI - Samples Running Time (s) Cascading BPI - Time Adapt BPI - Time (c) Best policy identification Figure 3: Experiments for cascading RL on synthetic data. A MORE EXPERIMENTS In this section, we describe the setup for the experiments on real-world data (Figure 1), and present more experimental results on synthetic data (Figure 3). A.1 EXPERIMENTAL SETUP WITH REAL-WORLD DATA In our experiments in Figure 1, we consider the real-world dataset Movie Lens (Harper & Konstan, 2015), which is also used in prior cascading bandit works (Zong et al., 2016; Vial et al., 2022). This dataset contains 25 million ratings on a 5-star scale for 62000 movies by 162000 users. We regard each user as a state, and each movie as an item. For each user-movie pair, we scale the rating to [0, 1] and regard it as the attraction probability. The reward of each user-movie pair is set to 1. For each user-movie pair which has a rating no lower than 4.5 stars, we set the transition probability to this state (user) itself as 0.9, and that to other states (users) as 0.9 S 1. For each user-movie pair which has a rating lower than 4.5 stars, we set the transition probability to all states (users) as 1 S . We use a subset of data from Movie Lens, and set δ = 0.005, K = 100000, H = 3, m = 3, S = 20, N {10, 15, 20, 25} and |A| = Pm m=1 N m m! {820, 2955, 7240, 14425}. Published as a conference paper at ICLR 2024 A.2 EXPERIMENTS ON SYNTHETIC DATA For synthetic data, we consider a cascading MDP with H layers, S = 2H 1 states and N items as shown in Figure 2: There is only an initial state s0 in the first layer. For any 2 h H, there are a good state s2(h 1) 1 and a bad state s2(h 1) in layer h. The reward function depends only on states. All good states induce reward 1, and all bad states and the initial state give reward 0. The attraction probability for all state-item pairs is 1 2. Denote Aground = {a1, . . . , a N}. For any h [H], under a good item a {a N m+1, . . . , a N}, the transition probabilities from each state in layer h to the good state and the bad state in layer h + 1 are 0.9 and 0.1, respectively. On the contrary, under a bad item a {a1, . . . , a N m} {a }, the transition probabilities from each state in layer h to the good state and the bad state in layer h + 1 are 0.1 and 0.9, respectively. Therefore, in this MDP, an optimal policy is to select good items (a N m+1, . . . , a N) in all states. We set δ = 0.005, H = 5, S = 9 and m = 3. Each algorithm is performed for 20 independent runs. In the regret minimization setting, we let N {4, 8} and K = 10000, and show the average cumulative regrets and average running times (in the legend) across runs. In the best policy identification setting, we set ϵ = 0.5 and N {4, 5, 6, 7, 8}, and plot the average sample complexities and average running times across runs with 95% confidence intervals. Under the regret minimization objective, we compare our algorithm Cascading VI with Adapt VI. From Figures 3(a) and 3(b), one can see that Cascading VI achieves significantly lower regret and running time than Adapt VI, and this advantage becomes more clear as N increases. This result demonstrates the efficiency of our computation oracle and estimation scheme. Regarding the best policy identification objective, we compare our algorithm Cascading BPI with Adapt BPI, an adaptation of a classic best policy identification algorithm in (M enard et al., 2021) to combinatorial actions. In Figure 3(c), as N increases, the sample complexity and running time of Adapt BPI increase exponentially fast. By contrast, Cascading BPI has much lower sample complexity and running time, and enjoys a mild growth rate as N increases. This matches our theoretical result that the sample and computation complexities of Cascading BPI are polynomial in N. B PROOFS FOR ORACLE Best Perm In this section, we present the proofs for oracle Best Perm. First, we introduce two important lemmas which are used in the proof of Lemma 1. Lemma 3 (Interchange by Descending Weights). For any u : Aground 7 [0, 1], w : Aground 7 R and A = (a1, . . . , aℓ, aℓ+1, . . . , an) such that 1 ℓ< n and w(aℓ) < w(aℓ+1), denoting A := (a1, . . . , aℓ+1, aℓ, . . . , an), we have f(A, u, w) f(A , u, w). Proof of Lemma 3. This proof uses a similar idea as the interchange argument (Bertsekas & Castanon, 1999; Ross, 2014). We have f(A, u, w) f(A , u, w) j=1 (1 u(aj)) (u(aℓ)w(aℓ) + (1 u(aℓ))u(aℓ+1)w(aℓ+1)) j=1 (1 u(aj)) (u(aℓ+1)w(aℓ+1) + (1 u(aℓ+1))u(aℓ)w(aℓ)) j=1 (1 u(aj)) (u(aℓ+1)u(aℓ)w(aℓ) u(aℓ)u(aℓ+1)w(aℓ+1)) j=1 (1 u(aj)) u(aℓ)u(aℓ+1) (w(aℓ) w(aℓ+1)) Published as a conference paper at ICLR 2024 Lemma 4 (Items behind a Do Not Matter). For any ordered subsets of {a1, . . . , a N}, A and A , such that A A = , we have f((A, a , A ), u, w) = f((A, a ), u, w). Proof. Since u(a ) = 1, we have f((A, a , A ), u, w) 1 u(A(j)) u(A(i))w(A(i)) + 1 u(A(j)) u(a )w(a ) 1 u(A(ℓ)) 1 u(a ) i 1 Y 1 u(A (j)) u(A (i))w(A (i)) 1 u(A(j)) u(A(i))w(A(i)) + 1 u(A(j)) u(a )w(a ) =f((A, a ), u, w). Now we prove Lemma 1. For any X Aground, let Perm(X) denote the collection of permutations of the items in X, and Des W(X) Perm(X) denote the permutation where items are sorted in descending order of w. Proof of Lemma 1. First, we prove property (i) by contradiction. Suppose that the best permutation A = argmax A Perm(X) f(A, u, w) does not rank items in descending order of w. In other words, there exist some aℓ, aℓ+1 such that we can write A = (a1, . . . , aℓ, aℓ+1, . . . , a|X|) and w(aℓ) < w(aℓ+1). Then, using Lemma 3, we have that A = (a1, . . . , aℓ+1, aℓ, . . . , a|X|) satisfies f(A , u, w) f(A , u, w), which contradicts the supposition. Given any permutation in Perm(X), we can repeatedly perform Lemma 3 to obtain a better permutation as bubble sort, until all items are ranked in descending order of w. Therefore, we obtain property (i). Next, we prove property (ii). For any u, w and disjoint X, X Aground \ {a } such that w(a) > w(a ) for any a X, X , we have f((Des W(X), a ), u, w) (a)= f((Des W(X), a , X ), u, w) (b) f((Des W(X X ), a ), u, w), Here in the right-hand side of equality (a), X can be in any order, and inequality (b) uses property (i). Furthermore, for any u, w and disjoint X, X Aground \ {a } such that w(a) > w(a ) for any a X, and w(a) < w(a ) for any a X , we have f((Des W(X, X ), a ), u, w) (c) f((Des W(X), a , Des W(X )), u, w) = f((Des W(X), a ), u, w), where inequality (c) is due to property (i). Next, we prove Lemma 2. Published as a conference paper at ICLR 2024 Proof of Lemma 2. From Lemma 1 (i), we have that when fixing a subset of Aground, the best order of this subset is to rank items in descending order of w. Thus, the problem of finding the best permutation in Eq. (2) reduces to finding the best subset, and then we can just sort items in this subset by descending w to obtain the solution. We sort the items in Aground in descending order of w, and denote the sorted sequence by a1, . . . , a J, a , a J+1, . . . , a N. Here J denotes the number of items with weights above a . According to Lemma 1 (ii), we have that the best permutation only consists of items in a1, . . . , a J. In other words, we should discard a J+1, . . . , a N. Case (i). If 1 J m, (a1, . . . , a J) satisfies the cardinality constraint and is the best permutation. Case (ii). Otherwise, if J = 0, we have to select a single best item to satisfy that there is at least one regular item in the solution. Why do not we select more items? We can prove that including more items gives a worse permutation by contradiction. Without loss of generality, consider a permutation with an additional item, i.e., (a, a , a ), where a, a Aground \ {a } and w(a), w(a ) < w(a ). Using Lemma 3, we have f((a, a , a ), u, w) f((a, a , a ), u, w) = f((a, a ), u, w). In this case, the best permutation is the best single item argmaxa {a1,...,a N} f((a, a ), u, w). Case (iii). If J > m, the problem reduces to selecting m best items from a1, . . . , a J. For any i [J] and k [min{m, J i + 1}], let F[i][k] denote the optimal value of the problem max(a 1,...,a k) (ai,...,a J) f((a 1, . . . , a k), u, w). From the structure of f(A, u, w), we have the following dynamic programming: F[J][1] = u(a J)w(a J) + (1 u(a J))w(a ), F[i][0] = w(a ), 1 i J, F[i][k] = , 1 i J, J i + 1 < k m, F[i][k] = max{F[i + 1][k], u(ai)w(ai) + (1 u(ai))F[i + 1][k 1]}, 1 i J 1, 1 k min{m, J i + 1}. Then, F[1][m] gives the objective value of the best permutation. Combining the above analysis, we have that Best Perm returns the best permutation, i.e., f(Sbest, u, w) = max A A f(A, u, w). C PROOFS FOR CASCADING RL WITH REGRET MINIMIZATION In this section, we provide the proofs for cascading RL with the regret minimization objective. C.1 VALUE DIFFERENCE LEMMA FOR CASCADING MDP We first give the value difference lemma for cascading MDP, which is useful for regret decomposition. Lemma 5 (Cascading Value Difference Lemma). For any two cascading MDPs M (S, Aground, A, m, H, q , p , r ) and M (S, Aground, A, m, H, q , p , r ), the difference in values under the same policy π satisfies V π h (s) V π h (s) = t=h Eq ,p ,π j=1 (1 q (st, At(j)))q (st, At(i))r (st, At(i)) j=1 (1 q (st, At(j)))q (st, At(i))r (st, At(i)) Published as a conference paper at ICLR 2024 j=1 (1 q (st, At(j)))q (st, At(i))p ( |st, At(i)) j=1 (1 q (st, At(j)))q (st, At(i))p ( |st, At(i)) V π t+1 sh = s, Ah = πh(sh) Proof of Lemma 5. Let A := πh(s). We have V π h (s) V π h (s) j=1 (1 q (s, A(j)))q (s, A(i)) r (s, A(i)) + p ( |s, A(i)) V π h+1 j=1 (1 q (s, A(j)))q (s, A(i)) r (s, A(i)) + p ( |s, A(i)) V π h+1 j=1 (1 q (s, A(j)))q (s, A(i))r (s, A(i)) j=1 (1 q (s, A(j)))q (s, A(i))r (s, A(i)) j=1 (1 q (s, A(j)))q (s, A(i))p ( |s, A(i)) j=1 (1 q (s, A(j)))q (s, A(i))p ( |s, A(i)) j=1 (1 q (s, A(j)))q (s, A(i))p ( |s, A(i)) V π h+1 V π h+1 j=1 (1 q (s, A(j)))q (s, A(i))r (s, A(i)) j=1 (1 q (s, A(j)))q (s, A(i))r (s, A(i)) j=1 (1 q (s, A(j)))q (s, A(i))p ( |s, A(i)) j=1 (1 q (s, A(j)))q (s, A(i))p ( |s, A(i)) + Eq ,p ,π V π h+1(sh+1) V π h+1(sh+1)|sh = s, π j=1 (1 q (st, At(j)))q (st, At(i))r (st, At(i)) j=1 (1 q (st, At(j)))q (st, At(i))r (st, At(i)) j=1 (1 q (st, At(j)))q (st, At(i))p ( |st, At(i)) j=1 (1 q (st, At(j)))q (st, At(i))p ( |st, At(i)) V π t+1 ! sh = s, Ah = πh(sh) Published as a conference paper at ICLR 2024 C.2 REGRET UPPER BOUND FOR ALGORITHM Cascading VI Below we prove the regret upper bound for algorithm Cascading VI. C.2.1 CONCENTRATION For any k > 0, s S and a Aground \ {a }, let nk,q(s, a) denote the number of times that the attraction of (s, a) is observed up to episode k. In addition, for any k > 0, s S and a Aground, let nk,p(s, a) denote the number of times that the transition of (s, a) is observed up to episode k. ( ˆqk(s, a) q(s, a) 2 ˆqk(s, a)(1 ˆqk(s, a)) log KHSA nk,q(s, a) + 5 log KHSA nk,q(s, a) , ˆqk(s, a)(1 ˆqk(s, a)) p q(s, a)(1 q(s, a)) 2 nk,q(s, a) , k [K], (s, a) S (Aground \ {a }) Lemma 6 (Concentration of Attractive Probability). It holds that Pr [E] 1 4δ . Proof of Lemma 6. Using Bernstern s inequality and a union bound over nk,q(s, a) [KH], k [K] and (s, a) S Aground, we have that with probability 1 2δ , ˆqk(s, a) q(s, a) 2 q(s, a)(1 q(s, a)) log KHSA nk,q(s, a) + log KHSA nk,q(s, a) . (4) Moreover, applying Lemma 1 in (Zanette & Brunskill, 2019), we have that with probability 1 2δ , ˆqk(s, a)(1 ˆqk(s, a)) p q(s, a)(1 q(s, a)) 2 nk,q(s, a) . (5) Combining Eqs. (4) and (5), we obtain this lemma. ( ˆpk( |s, a) p( |s, a) V h+1 2 Vars p V h+1(s ) log KHSA nk,p(s, a) + H log KHSA nk,p(s, a) , ˆpk(s |s, a) p(s |s, a) p(s |s, a)(1 p(s |s, a)) log KHSA nk,p(s, a) + log KHSA nk,p(s, a) , ˆpk( |s, a) p( |s, a) 1 2S log KHSA nk,p(s, a) , (8) k [K], h [H], (s, a) S Aground ) Lemma 7 (Concentration of Transition Probability). It holds that Pr [F] 1 6δ . Proof of Lemma 7. According to Bernstein s inequality and a union bound over nk,p(s, a) [KH], k [K], h [H] and (s, a) S Aground, we obtain Eqs. (6) and (7). In addition, Eq. (8) follows from (Weissman et al., 2003) and Eq. (55) in (Zanette & Brunskill, 2019) . Published as a conference paper at ICLR 2024 Vars ˆpk V h+1(s ) q Vars p V h+1(s ) 2H nk,p(s, a) , k [K], h [H], (s, a) S Aground ) Lemma 8 (Concentration of Variance). It holds that Pr [G] 1 2δ . Furthermore, assume event F G holds. Then, for any k [K], h [H] and (s, a) S Aground, if V k h+1(s ) V h+1(s ) V k h+1(s ) for any s S, we have ˆpk( |s, a) p( |s, a) V h+1 2 Vars ˆpk V k h+1(s ) log KHSA v u u u t Es ˆpk V k h+1(s ) V k h+1(s ) 2 log KHSA nk,p(s, a) + 5H log KHSA nk,p(s, a) . Proof of Lemma 8. According to Eq. (53) in (Zanette & Brunskill, 2019), we have Pr [G] 1 2δ . Moreover, assume event F G holds. Then, for any k [K], h [H] and (s, a) S Aground, if V k h+1(s ) V h+1(s ) V k h+1(s ) for any s S, we have Vars ˆpk V k h+1(s ) q Vars p V h+1(s ) Vars ˆpk V k h+1(s ) q Vars ˆpk V h+1(s ) Vars ˆpk V h+1(s ) q Vars p V h+1(s ) Es ˆpk V k h+1(s ) V k h+1(s ) 2 + 2H nk,p(s, a) , (10) where inequality (a) comes from Eqs. (48)-(52) in (Zanette & Brunskill, 2019). Plugging Eq. (10) into Eq. (6), we have ˆpk( |s, a) p( |s, a) V h+1 2 Vars ˆpk V k h+1(s ) log KHSA v u u u t Es ˆpk V k h+1(s ) V k h+1(s ) 2 log KHSA nk,p(s, a) + 5H log KHSA nk,p(s, a) . For any k > 0, h [H], i [m] and (s, a) S (Aground \ {a }), let vobserve,q k,h,i (s, a) denote the probability that the attraction of (s, a) is observed in the i-th position at step h of episode k. Let vobserve,q k,h (s, a) := Pm i=1 vobserve,q k,h,i (s, a) and vobserve,q k (s, a) := PH h=1 Pm i=1 vobserve,q k,h,i (s, a). For any k > 0, h [H], i [m + 1] and (s, a) S Aground, let vobserve,p k,h,i (s, a) denote the probability that the transition of (s, a) is observed (i.e., (s, a) is clicked) in the i-th posi- Published as a conference paper at ICLR 2024 tion at step h of episode k. Let vobserve,p k,h (s, a) := Pm+1 i=1 vobserve,p k,h,i (s, a) and vobserve,p k (s, a) := PH h=1 Pm+1 i=1 vobserve,p k,h,i (s, a). Lemma 9. For any k > 0 and h [H], we have X (s,a) S Aground\{a } vobserve,q k,h (s, a)q(s, a) 1. Proof. For any k > 0, h [H], s S and A A, let wk,h(s, A) denote the probability that (s, A) is visited at step h in episode k. It holds that X (s,a) S Aground\{a } vobserve,q k,h (s, a)q(s, a) (s,a) S Aground\{a } i=1 vobserve,q k,h,i (s, a)q(s, a) (s,a) S Aground\{a } j=1 (1 q(s, A(j)))q(s, a) (s,a) S Aground\{a } i=1 Pr[(s, a) is clicked in the i-th position at step h of episode k] (s,a) S Aground\{a } Pr[(s, a) is clicked at step h of episode k] nk,q(s, a) 1 k 0, we define the following two sets: (s, a) S (Aground \ {a }) : 1 k 0, h [H] and s S, we define bk,q(s, a) := min ˆqk(s, a)(1 ˆqk(s, a))L nk,q(s, a) + 5L nk,q(s, a), 1 , a Aground \ {a }, bk,q(s, a ) := 0, bk,p V (s, a) :=min Vars ˆpk V k h+1(s ) L nk,p(s, a) +2 v u u u t Es ˆpk V k h+1(s ) V k h+1(s ) 2 L + 5HL nk,p(s, a), H , a Aground. For any k > 0, h [H], s S and A A, we define Qk h(s, A) = min j=1 (1 qk(s, A(j))) qk(s, A(i)) r(s, A(i)) + ˆpk( |s, A(i)) V k h+1 + bk,p V (s, A(i)) , H πk h(s) = argmax A A Qk h(s, A), V k h (s) = Qk h(s, πk h(s)), V k H+1(s) = 0, Published as a conference paper at ICLR 2024 Qk h(s, A) = max j=1 (1 qk(s, A(j)))qk(s, A(i)) r(s, A(i)) + ˆpk( |s, A(i)) V k h+1 bk,p V (s, A(i)) , 0 V k h(s) = Qk h(s, πk h(s)), V k H+1(s) = 0. We first prove the monotonicity of f, which will be used in the proof of optimism. Lemma 14 (Monotonicity of f). For any A = (a1, . . . , an, a ) A, w : Aground 7 R and u, u : Aground 7 [0, 1] such that 1 n m, w(a1) w(an) w(a ) and u(a) u(a) for any a A, we have f(A, u, w) f(A, u, w). Proof of Lemma 14. In the following, we make the convention an+1 = a and prove f((a1, . . . , ak), u, w) f((a1, . . . , ak), u, w) for k = 1, 2, . . . , n + 1 by induction. Then, it suffices to prove that for k = 1, 2, . . . , n + 1, j=1 (1 u(aj)) u(ai)w(ai) j=1 (1 u(aj))u(ai)w(ai) (1 u(a1)) (1 u(ak 1))( u(ak) u(ak))w(ak) + (1 u(a1)) (1 u(ak 2))( u(ak 1) u(ak 1))(1 u(ak))w(ak) + . . . + ( u(a1) u(a1))(1 u(a2)) (1 u(ak))w(ak), (11) since the right-hand side of the above equation is nonnegative. First, for k = 1, we have u(a1)w(a1) u(a1)w(a1) = ( u(a1) u(a1)) w(a1), and thus Eq. (11) holds. Then, for any 1 k n, supposing that Eq. (11) holds for k, we prove that it also holds for k + 1. j=1 (1 u(aj)) u(ai)w(ai) j=1 (1 u(aj))u(ai)w(ai) j=1 (1 u(aj)) u(ai)w(ai) j=1 (1 u(aj))u(ai)w(ai) +(1 u(a1)) (1 u(ak)) u(ak+1)w(ak+1) (1 u(a1)) (1 u(ak))u(ak+1)w(ak+1) | {z } Term D Here, we have Term D = (1 u(a1)) (1 u(ak)) u(ak+1)w(ak+1) (1 u(a1)) (1 u(ak))u(ak+1)w(ak+1) + (1 u(a1)) (1 u(ak))u(ak+1)w(ak+1) (1 u(a1)) (1 u(ak))u(ak+1)w(ak+1) = (1 u(a1)) (1 u(ak)) ( u(ak+1) u(ak+1)) w(ak+1) + h (1 u(a1)) (1 u(ak)) (1 u(a1)) (1 u(ak)) i u(ak+1)w(ak+1) = (1 u(a1)) (1 u(ak)) ( u(ak+1) u(ak+1)) w(ak+1) + h (1 u(a1)) (1 u(ak)) Published as a conference paper at ICLR 2024 (1 u(a1)) (1 u(ak 1)(1 u(ak)) + (1 u(a1)) (1 u(ak 1)(1 u(ak)) (1 u(a1)) (1 u(ak)) i u(ak+1)w(ak+1) = (1 u(a1)) (1 u(ak)) ( u(ak+1) u(ak+1)) w(ak+1) + (1 u(a1)) (1 u(ak 1)(u(ak) u(ak))u(ak+1)w(ak+1) + h (1 u(a1)) (1 u(ak 1) (1 u(a1)) (1 u(ak 1)) i (1 u(ak))u(ak+1)w(ak+1) = (1 u(a1)) (1 u(ak)) ( u(ak+1) u(ak+1)) w(ak+1) + (1 u(a1)) (1 u(ak 1)(u(ak) u(ak))u(ak+1)w(ak+1) + (1 u(a1)) (1 u(ak 2)(u(ak 1) u(ak 1))(1 u(ak))u(ak+1)w(ak+1) + . . . + (u(a1) u(a1))(1 u(a2)) (1 u(ak))u(ak+1)w(ak+1). (13) Plugging Eq. (13) and the induction hypothesis into Eq. (12), we have j=1 (1 u(aj)) u(ai)w(ai) j=1 (1 u(aj))u(ai)w(ai) (1 u(a1)) (1 u(ak)) ( u(ak+1) u(ak+1)) w(ak+1) + (1 u(a1)) (1 u(ak 1)( u(ak) u(ak)) h w(ak) u(ak+1)w(ak+1) i + (1 u(a1)) (1 u(ak 2)( u(ak 1) u(ak 1))(1 u(ak)) h w(ak) u(ak+1)w(ak+1) i + ( u(a1) u(a1))(1 u(a2)) (1 u(ak)) h w(ak) u(ak+1)w(ak+1) i (1 u(a1)) (1 u(ak)) ( u(ak+1) u(ak+1)) w(ak+1) + (1 u(a1)) (1 u(ak 1)( u(ak) u(ak))(1 u(ak+1))w(ak+1) + (1 u(a1)) (1 u(ak 2)( u(ak 1) u(ak 1))(1 u(ak))(1 u(ak+1))w(ak+1) + . . . + ( u(a1) u(a1))(1 u(a2)) (1 u(ak))(1 u(ak+1))w(ak+1). Therefore, Eq. (11) holds for k + 1, and we complete the proof. Now we prove the optimism of V k h (s) and pessimism of V k h(s). Lemma 15 (Optimism and Pessimism). Assume that event E F G holds. Then, for any k [K], h [H] and s S, V k h (s) V h (s) V k h(s). In addition, for any k [K], h [H] and (s, a) S Aground, it holds that ˆpk( |s, a) p( |s, a) V h+1 2 Vars ˆpk V k h+1(s ) log KHSA v u u u t Es ˆpk V k h+1(s ) V k h+1(s ) 2 log KHSA nk,p(s, a) + 5H log KHSA nk,p(s, a) . Proof of Lemma 15. We prove this lemma by induction. For any k [K] and s S, it holds that V k H+1(s) = V H+1(s) = V k H+1(s) = 0. Published as a conference paper at ICLR 2024 First, we prove optimism. For any k [K] and h [H], if V k h (s) = H, then V k h (s) V h (s) trivially holds. Otherwise, supposing V k h+1(s ) V h+1(s ) V k h+1(s ) for any s S, we have V k h (s) = Qk h(s, πk h(s)) Qk h(s, A ) j=1 (1 qk(s, A (j))) qk(s, A (i)) r(s, A (i)) + ˆpk( |s, A (i)) V k h+1 + bk,p V (s, A (i)) j=1 (1 qk(s, A (j))) qk(s, A (i)) r(s, A (i)) + ˆpk( |s, A (i)) V h+1 + bk,p V (s, A (i)) j=1 (1 qk(s, A (j))) qk(s, A (i)) r(s, A (i)) + p( |s, A (i)) V h+1 j=1 (1 q(s, A (j)))q(s, A (i)) r(s, A (i)) + p( |s, A (i)) V h+1 = Q h(s, A ). = V h (s). Here A := argmax A A P|A| i=1 Qi 1 j=1(1 q(s, A(j)))q(s, A(i))(r(s, A(i)) + p( |s, A(i)) V h+1). Inequality (a) uses the induction hypothesis. Inequality (b) applies the second statement of Lemma 8 with the induction hypothesis. Inequality (c) follows from Lemma 14 and the fact that the optimal permutation A satisfies that the items in A are ranked in descending order of w(a) := r(s, a) + p( |s, a) V h+1. Next, we prove pessimism. For any k [K] and h [H], if V k h(s) = 0, then V k h(s) V h (s) trivially holds. Otherwise, supposing V k h+1(s ) V h+1(s ) V k h+1(s ) for any s S, we have Qk h(s, A) = j=1 (1 qk(s, A(j)))qk(s, A(i)) r(s, A(i)) + ˆpk( |s, A(i)) V k h+1 bk,p V (s, A(i)) j=1 (1 qk(s, A(j)))qk(s, A(i)) r(s, A(i)) + ˆpk( |s, A(i)) V h+1 bk,p V (s, A(i)) j=1 (1 q(s, A(j)))q(s, A(i)) r(s, A(i)) + p( |s, A(i)) V h+1 = Q h(s, A), where inequality (a) uses the induction hypothesis. Then, we have V k h(s) = Qk h(s, πk h(s)) Q h(s, πk h(s)) Q h(s, A ) = V h (s), which completes the proof of the first statement. Combining the first statement and Lemma 8, we obtain the second statement. Published as a conference paper at ICLR 2024 C.2.4 SECOND ORDER TERM Lemma 16 (Gap between Optimism and Pessimism). For any k > 0, h [H] and s S, V k h (s) V k h(s) t=h E(st,At) πk j=1 (1 qk(st, At(j))) 66H nk,q(st, At(i)) j=1 (1 qk(st, At(j)))q(st, At(i)) 20HL nk,p(st, At(i)) sh = s, πk # Proof of Lemma 16. Let A = πk h(s). Recall that j=1 (1 qk(s, A (j))) qk(s, A (i)) r(s, A (i)) + ˆpk( |s, A (i)) V k h+1 + bk,p V (s, A (i)) , j=1 (1 qk(s, A (j)))qk(s, A (i)) r(s, A (i)) + ˆpk( |s, A (i)) V k h+1 bk,p V (s, A (i)) . Then, we have V k h (s) V k h(s) j=1 (1 qk(s, A (j))) 2r(s, A (i))bk,q(s, A (i)) + qk(s, A (i)) + 2bk,q(s, A (i)) ˆpk( |s, A (i)) V k h+1 + bk,p V (s, A (i)) qk(s, A (i)) ˆpk( |s, A (i)) V k h+1 bk,p V (s, A (i)) j=1 (1 qk(s, A (j))) q(s, A (i)) ˆpk( |s, A (i)) V k h+1 ˆpk( |s, A (i)) V k h+1 + 2bk,p V (s, A (i)) + 6Hbk,q(s, A (i)) j=1 (1 qk(s, A (j))) ˆqk(s, a)(1 ˆqk(s, a))L nk,q(s, a) + 5L nk,q(s, a) j=1 (1 qk(s, A (j)))q(s, A (i)) Vars ˆpk V k h+1(s ) L nk,p(s, A (i)) v u u u t Es ˆpk V k h+1(s ) V k h+1(s ) 2 L nk,p(s, A (i)) + 5HL nk,p(s, A (i)) j=1 (1 qk(s, A (j)))q(s, A (i)) ˆpk( |s, A (i)) p( |s, A (i)) V k h+1 V k h+1 j=1 (1 qk(s, A (j)))q(s, A (i))p( |s, A (i)) V k h+1 V k h+1 Published as a conference paper at ICLR 2024 j=1 (1 qk(s, A (j))) q(s, a)(1 q(s, a))L nk,q(s, a) + 9L nk,q(s, a) j=1 (1 qk(s, A (j)))q(s, A (i)) 18HL p nk,p(s, A (i)) j=1 (1 qk(s, A (j)))q(s, A (i)) 2H nk,p(s, A (i)) j=1 (1 qk(s, A (j)))q(s, A (i))p( |s, A (i)) V k h+1 V k h+1 j=1 (1 qk(s, A (j))) 66H nk,q(s, A (i)) j=1 (1 qk(s, A (j)))q(s, A (i)) 20HL nk,p(s, A (i)) j=1 (1 qk(s, A (j)))q(s, A (i))p( |s, A (i)) V k h+1 V k h+1 t=h E(st,At) πk j=1 (1 qk(st, At(j))) 66H nk,q(st, At(i)) j=1 (1 qk(st, At(j)))q(st, At(i)) 20HL nk,p(st, At(i)) sh = s, πk # where inequality (a) is due to Eq. (8). Lemma 17 (Cumulative Gap between Optimism and Pessimism). It holds that (s,a) S Aground vobserve,p k,h (s, a) p( |s, a) V k h+1 V k h+1 2 152192(m + 1)H5S2NL3. Proof of Lemma 17. For any k > 0, h [H] and s S, let wk,h(s) denote the probability that state s is visited at step h in episode k. (s,a) S Aground vobserve,p k,h (s, a) p( |s, a) V k h+1 V k h+1 2 (s,a) S Aground vobserve,p k,h (s, a) X s S p(s |s, a) V k h+1(s ) V k h+1(s ) 2 (s,a) S Aground s S vobserve,p k,h (s , s, a) V k h+1(s ) V k h+1(s ) 2 s S wk,h+1(s ) V k h+1(s ) V k h+1(s ) 2 h=1 Esh+1 πk V k h+1(sh+1) V k h+1(sh+1) 2 Published as a conference paper at ICLR 2024 h=1 Esh πk V k h (sh) V k h(sh) 2 t=h E(st,At) πk j=1 (1 qk(st, At(j))) 66H nk,q(st, At(i)) j=1 (1 qk(s, At(j)))q(st, At(i)) 20HL nk,p(st, At(i)) sh = s, πk #!2# E(st,At) πk j=1 (1 q(st, At(j))) 66H nk,q(st, At(i)) j=1 (1 q(s, At(j)))q(st, At(i)) 20HL nk,p(st, At(i)) sh = s, πk #!2# t=h E(st,At) πk j=1 (1 q(st, At(j))) 66H nk,q(st, At(i)) j=1 (1 q(s, At(j)))q(st, At(i)) 20HL nk,p(st, At(i)) !2 sh = s, πk ## (d) 2(m + 1)H t=h E(st,At) πk j=1 (1 q(st, At(j)))2 4356H2L nk,q(st, At(i)) j=1 (1 q(s, At(j)))2q(st, At(i))2 400H2L2S nk,p(st, At(i)) =2(m + 1)H2 K X (s,a) S (Aground\{a }) vobserve,q k,h (s, a)4356H2L (s,a) S Aground vobserve,p k,h (s, a)400H2L2S (e) 2(m + 1)H2 8SN log(KH) + 8HSN log HSN 4356H2L + 400H2L2S 152192(m + 1)H5S2NL3. Here inequality (a) uses Lemma 16. Inequalities (b) and (d) are due to the Cauchy-Schwarz inequality. Inequality (c) comes from Jensen s inequality. Inequality (e) follows from Lemmas 12 and 13. C.2.5 PROOF OF THEOREM 1 Proof of Theorem 1. Recall that δ := δ 14. Combining Lemmas 6-10, we have Pr[E F G K] 1 14δ = 1 δ. In the following, we assume that event E F G K holds, and then prove the regret upper bound. V 1 (sk 1) V πk 1 (sk 1) V k 1 (sk 1) V πk 1 (sk 1) Published as a conference paper at ICLR 2024 j=1 (1 qk(sh, Ah(j))) qk(sh, Ah(i))r(sh, Ah(i)) j=1 (1 q(sh, Ah(j)))q(sh, Ah(i))r(sh, Ah(i)) j=1 (1 qk(sh, Ah(j))) qk(sh, Ah(i)) ˆpk( |sh, Ah(i)) V k h+1 + bk,p V (sh, Ah(i)) j=1 (1 q(sh, Ah(j)))q(sh, Ah(i))p( |sh, Ah(i)) V k h+1 j=1 (1 q(sh, Ah(j)))r(sh, Ah(i))bk,q(sh, Ah(i)) j=1 (1 q(sh, Ah(j))) q(sh, Ah(i)) + 2bk,q(sh, Ah(i)) ˆpk( |sh, Ah(i)) V k h+1 + bk,p V (sh, Ah(i)) j=1 (1 q(sh, Ah(j)))q(sh, Ah(i))p( |sh, Ah(i)) V k h+1 j=1 (1 q(sh, Ah(j)))r(sh, Ah(i))bk,q(sh, Ah(i)) j=1 (1 q(sh, Ah(j)))q(sh, Ah(i)) ˆpk( |sh, Ah(i)) V k h+1 + bk,p V (sh, Ah(i)) p( |sh, Ah(i)) V k h+1 + 4H j=1 (1 q(sh, Ah(j)))bk,q(sh, Ah(i)) j=1 (1 q(sh, Ah(j)))bk,q(sh, Ah(i)) j=1 (1 q(sh, Ah(j)))q(sh, Ah(i))bk,p V (sh, Ah(i)) j=1 (1 q(sh, Ah(j)))q(sh, Ah(i)) ˆpk( |sh, Ah(i)) p( |sh, Ah(i)) V h+1 j=1 (1 q(sh, Ah(j)))q(sh, Ah(i)) ˆpk( |sh, Ah(i)) p( |sh, Ah(i)) V k h+1 V h+1 # (s,a) S (Aground\{a }) vobserve,q k,h,i (s, a)bk,q(s, a) (s,a) S Aground vobserve,p k,h,i (s, a)bk,p V (s, a) (s,a) S Aground vobserve,p k,h,i (s, a) ˆpk( |s, a) p( |s, a) V h+1 Published as a conference paper at ICLR 2024 (s,a) S Aground vobserve,p k,h,i (s, a) ˆpk( |s, a) p( |s, a) V k h+1 V h+1 # 6Hvobserve,q k,h (s, a)bk,q(s, a) vobserve,p k,h (s, a)bk,p V (s, a) + vobserve,p k,h (s, a) ˆpk( |s, a) p( |s, a) V h+1 + vobserve,p k,h (s, a) ˆpk( |s, a) p( |s, a) V k h+1 V h+1 (s,a)/ Bq k vobserve,q k,h (s, a) + 3H (s,a)/ Bp k vobserve,p k,h (s, a) 6Hvobserve,q k,h (s, a)bk,q(s, a) vobserve,p k,h (s, a)bk,p V (s, a) + vobserve,p k,h (s, a) ˆpk( |s, a) p( |s, a) V h+1 + vobserve,p k,h (s, a) ˆpk( |s, a) p( |s, a) V k h+1 V h+1 + 72H2SNL, (14) where inequality (a) is due to that for any k > 0 and s S, bk,q(s, a ) := 0, and inequality (b) uses Lemma 12. We bound the four terms on the right-hand side of the above inequality as follows. (i) Term 1: vobserve,q k,h (s, a)bk,q(s, a) vobserve,q k,h (s, a) ˆqk(s, a)(1 ˆqk(s, a))L nk,q(s, a) + 5L nk,q(s, a) vobserve,q k,h (s, a) q(s, a)(1 q(s, a))L nk,q(s, a) + 9L nk,q(s, a) vobserve,q k,h (s, a)q(s, a) vobserve,q k,h (s, a) 8SN log (KH) where inequality (a) is due to Lemma 9. (ii) Term 2: vobserve,p k,h (s, a)bk,p V (s, a) Published as a conference paper at ICLR 2024 vobserve,p k,h (s, a) Vars ˆpk V k h+1(s ) L nk,p(s, a) v u u u t Es ˆpk V k h+1(s ) V k h+1(s ) 2 L nk,p(s, a) + 5HL nk,p(s, a) vobserve,p k,h (s, a) vobserve,p k,h (s, a)Vars ˆpk V k h+1(s ) vobserve,p k,h (s, a) vobserve,p k,h (s, a)p( |s, a) V k h+1(s ) V k h+1(s ) 2 vobserve,p k,h (s, a) (ˆpk( |s, a) p( |s, a)) V k h+1(s ) V k h+1(s ) 2 ! vobserve,p k,h (s, a) 8SN log (KH) H 8SN log (KH) p 152192(m + 1)H5S2NL3 1112H2S2NL2p (m + 1)HL + 5HL 8SN log (KH) KHSN + 3428H2SNL2p (m + 1)HSL. (iii) Term 3: vobserve,p k,h (s, a) ˆpk( |s, a) p( |s, a) V h+1 vobserve,p k,h (s, a) Vars p( |s,a) V h+1(s ) vobserve,p k,h (s, a) vobserve,p k,h (s, a) vobserve,p k,h (s, a)Vars p( |s,a) V h+1(s ) vobserve,p k,h (s, a) 8SN log (KH) H KH + HL 8SN log (KH) KHSN + 8HSNL2. Published as a conference paper at ICLR 2024 (iv) Term 4: vobserve,p k,h (s, a) ˆpk( |s, a) p( |s, a) V k h+1 V h+1 vobserve,p k,h (s, a) X p(s |s, a)(1 p(s |s, a))L V k h+1(s ) V h+1(s ) + HL nk,p(s, a) vobserve,p k,h (s, a) X p(s |s, a) nk,p(s, a) V k h+1(s ) V h+1(s ) vobserve,p k,h (s, a) vobserve,p k,h (s, a) p(s |s, a) nk,p(s, a) V k h+1(s ) V h+1(s ) 2 vobserve,p k,h (s, a) vobserve,p k,h (s, a) p( |s, a) V k h+1 V h+1 2 vobserve,p k,h (s, a) vobserve,p k,h (s, a) vobserve,p k,h (s, a) p( |s, a) V k h+1 V h+1 2 vobserve,p k,h (s, a) 8SN log (KH) p 152192(m + 1)H5S2NL3 + HL 8SN log (KH) 1104H2S2NL2p (m + 1)HL + 8HSNL2 1112H2S2NL2p where inequality (a) follows from Eq. (7). Plugging the above four terms into Eq. (14), we obtain R(K) = 48HL KHSN + 3428H2SNL2p KHSN + 8HSNL2 + 1112H2S2NL2p (m + 1)HL + 72H2SAL Published as a conference paper at ICLR 2024 D ALGORITHM AND PROOFS FOR CASCADING RL WITH BEST POLICY IDENTIFICATION In this section, we present algorithm Cascading BPI and proofs for cascading RL with the best policy identification objective. D.1 ALGORITHM Cascading BPI Algorithm 3: Cascading BPI Input: ε, δ, δ := δ 7. For any κ (0, 1) and n > 0, L (κ, n) := log( HSN κ ) + log(8e(n + 1)) and L(κ, n) := log( HSN κ ) + S log(8e(n + 1)). For any k > 0 and s S, qk(s, a ) = qk(s, a ) := 1 and V k H+1(s) = V k H+1(s) := 0. Initialize n1,q(s, a) = n1,p(s, a) := 0 for any (s, a) S A. 1 for k = 1, 2, . . . , K do 2 for h = H, H 1, . . . , 1 do 3 for s S do 4 for a Aground \ {a } do 5 bk,q(s, a) min{4 q ˆqk(s,a)(1 ˆqk(s,a))L (δ ,k) nk,q(s,a) + 15L (δ ,k) nk,q(s,a) , 1}; 6 qk(s, a) ˆqk(s, a) + bk,q(s, a). qk(s, a) ˆqk(s, a) bk,q(s, a); 7 for a Aground do 8 bk,p V (s, a) min{4 Vars ˆ pk ( V k h+1(s ))L (δ ,k) nk,p(s,a) + 15H2L(δ ,k) nk,p(s,a) + 2 H ˆpk( |s, a) ( V k h+1 V k h+1), H}; 9 wk(s, a) r(s, a) + ˆpk( |s, a) V k h+1 + bk,p V (s, a); 10 V k h (s), πk h(s) Best Perm(Aground, qk(s, ), wk(s, )); 11 V k h (s) min{ V k h (s), H}. A πk h(s); 12 V k h(s) max{P|A | i=1 Qi 1 j=1(1 qk(s, A (j)))qk(s, A (i))(r(s, A (i)) + ˆpk( |s, A (i)) V k h+1 bk,p V (s, A (i))), 0}; 13 Gk h(s) min{P|A | i=1 Qi 1 j=1(1 qk(s, A (j)))(6Hbk,q(s, A (i)) + qk(s, A (i))(2bk,p V (s, A (i)) + ˆpk( |s, A (i)) Gk h+1)), H}; 14 if Gk 1(s1) ε then 15 return πk; // Estimation error is small enough 16 for h = 1, 2, . . . , H do 17 Observe the current state sk h; // Take policy πk and observe the trajectory 18 Take action Ak h = πk h(sk h). i 1; 19 while i m do 20 Observe if Ak h(i) is clicked or not. Update nk,q(sk h, Ak h(i)) and ˆqk(sk h, Ak h(i)); 21 if Ak h(i) is clicked then 22 Receive reward r(sk h, Ak h(i)), and transition to a next state sk h+1 p( |sk h, Ak h(i)); 23 Ik,h i. Update nk,p(sk h, Ak h(i)) and ˆpk( |sk h, Ak h(i)); 24 break while; // Skip subsequent items 26 i i + 1; 27 if i = m + 1 then 28 Transition to a next state sk h+1 p( |sk h, a ); // No item was clicked 29 Update nk,p(sk h, a ) and ˆpk( |sk h, a ); Published as a conference paper at ICLR 2024 Algorithm 3 gives the pseudo-code of Cascading BPI. Similar to Cascading VI, in each episode, we estimates the attraction and transition for each item independently, and calculates the optimistic attraction probability qk(s, a) and weight wk(s, a) using exploration bonuses. Here wk(s, a) represents the optimistic cumulative reward that can be received if item a is clicked in state s. Then, we call the oracle Best Perm to compute the maximum optimistic value V k h (s) and its greedy policy πk h(s). Furthermore, we build an estimation error Gk h(s) which upper bounds the difference between V k h (s) and V πk h (s) with high confidence. If Gk h(s) shrinks within the accuracy parameter ε, we output the policy πk. Otherwise, we play episode k with policy πk, and update the estimates of attraction and transition for the clicked item and the items prior to it. Employing the efficient oracle Best Perm, Cascading BPI only maintains the estimated attraction and transition probabilities for each a Aground, instead of calculating Qk h(s, A) for each A A as in a naive adaption of existing RL algorithms (Kaufmann et al., 2021; M enard et al., 2021). Therefore, Cascading BPI achieves a superior computation cost that only depends on N, rather than |A|. D.2 SAMPLE COMPLEXITY FOR ALGORITHM Cascading BPI In the following, we prove the sample complexity for algorithm Cascading BPI. D.2.1 CONCENTRATION For any κ (0, 1) and n > 0, let L(κ, n) := log( HSN κ ) + S log(8e(n + 1)) and L (κ, n) := log( HSN κ ) + log(8e(n + 1)). ( ˆqk(s, a) q(s, a) 4 ˆqk(s, a)(1 ˆqk(s, a))L (δ , k) nk,q(s, a) + 15L (δ , k) nk,q(s, a) , ˆqk(s, a)(1 ˆqk(s, a)) p q(s, a)(1 q(s, a)) 4 L (δ , k) nk,q(s, a), (15) k > 0, h [H], (s, a) S Aground \ {a } Lemma 18 (Concentration of Attractive Probability). It holds that Pr [L] 1 4δ . Proof of Lemma 18. Using a similar analysis as that for Lemma 6 and a union bound over k = 1, . . . , , we have that event L holds with probability 1 4δ . KL ˆpk( |s, a) p( |s, a) L(δ , k) nk,p(s, a), ˆpk( |s, a) p( |s, a) V h+1 2 Vars p V h+1(s ) L (δ , k) nk,p(s, a) + 3HL (δ , k) nk,p(s, a) , k > 0, h [H], (s, a) S Aground ) Lemma 19 (Concentration of Transition Probability). It holds that Pr [M] 1 3δ . Furthermore, if event L M holds, we have that for any k > 0, h [H] and (s, a) S Aground, ˆpk( |s, a) p( |s, a) V h+1 4 Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, a) + 15H2L(δ , k) Published as a conference paper at ICLR 2024 H ˆpk( |s, a) V k h+1 V k h+1 . Proof of Lemma 19. Following Lemma 3 in (M enard et al., 2021), we have Pr [M] 1 3δ . Using Eq. (22) and (23) in Lemma 22, we have s Vars p V h+1(s ) L (δ , k) 2Vars ˆpk V h+1(s ) L (δ , k) nk,p(s, a) + 4H2L(δ , k) nk,p(s, a) L (δ , k) 4Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, a) + 4H ˆpk( |s, a) V k h+1 V h+1 L (δ , k) + 2HL(δ , k) 4Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, a) + 1 H ˆpk( |s, a) V k h+1 V k h+1 4H2L (δ , k) + 2HL(δ , k) Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, a) + 1 H ˆpk( |s, a) V k h+1 V k h+1 + 6H2L(δ , k) nk,p(s, a) , where inequality (a) uses the induction hypothesis of Lemma 15. Thus, we have ˆpk( |s, a) p( |s, a) V h+1 4 Vars ˆpk V k h+1(s ) L (δ , k) H ˆpk( |s, a) V k h+1 V k h+1 + 15H2L(δ , k) nk,p(s, a) . D.2.2 OPTIMISM AND ESTIMATION ERROR For any k > 0, h [H] and s S, we define bk,q(s, a) := min ˆqk(s, a)(1 ˆqk(s, a))L (δ , k) nk,q(s, a) + 15L (δ , k) nk,q(s, a) , 1 a Aground \ {a }, bk,q(s, a ) := 0, bk,p V (s, a) := min Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, a) + 15H2L(δ , k) H ˆpk( |s, a) V k h+1 V k h+1 , H , a Aground. Published as a conference paper at ICLR 2024 For any k > 0, h [H], s S and A A, we define Qk h(s, A) = min j=1 (1 qk(s, A(j))) qk(s, A(i)) r(s, A(i)) + ˆpk( |s, A(i)) V k h+1 + bk,p V (s, A(i)) , H πk h(s) = argmax A A Qk h(s, A), V k h (s) = Qk h(s, πk h(s)), V k H+1(s) = 0, Qk h(s, A) = max j=1 (1 qk(s, A(j)))qk(s, A(i)) r(s, A(i)) + ˆpk( |s, A(i)) V k h+1 bk,p V (s, A(i)) , 0 V k h(s) = Qk h(s, πk h(s)), V k H+1(s) = 0, Gk h(s, A) = j=1 (1 qk(s, A(j))) ˆqk(s, a)(1 ˆqk(s, a))L (δ , k) + 15L (δ , k) nk,q(s, a) , 1 + qk(s, A(i)) min 8 Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, a) + 30H2L(δ , k) nk,p(s, a) + 4 H ˆpk( |s, a) Gk h+1, 2H + ˆpk( |s, a) Gk h+1 Gk h(s) = Gk h(s, πk h(s)), Gk H+1(s) = 0. Lemma 20 (Optimism). Assume that event L M holds. Then, for any k > 0, h [H] and s S, V k h (s) V h (s). Proof of Lemma 20. By a similar analysis as that for Lemma 15 with different definitions of bk,q(s, a), bk,p V (s, a) and Lemmas 18, 19, we obtain this lemma. Lemma 21 (Estimation Error). Assume that event K L M holds. Then, with probability at least 1 δ, for any k > 0, h [H] and s S, V k h (s) min{V k h(s), V πk h (s)} Gk h(s). Proof of Lemma 21. For any k > 0 and s S, it trivially holds that V k H+1(s) min{V k H+1(s), V πk H+1(s)} Gk H+1(s). For any k > 0, h [H] and (s, A) S A, if Gk h(s, A) = H, then it trivially holds that Qk h+1(s, A) min{Qk h+1(s, A), Qπk h+1(s, A)} Gk h+1(s, A). Otherwise, supposing V k h+1(s) min{V k h+1(s), V πk h+1(s)} Gk h+1(s), we first prove Qk h(s, A) Qk h(s, A) Gk h(s, A). Qk h(s, A) Qk h(s, A) Published as a conference paper at ICLR 2024 j=1 (1 qk(s, A(j)))(qk(s, A(i)) + 2bk,q(s, A(i))) r(s, A(i))+ˆpk( |s, A(i)) V k h+1+bk,p V (s, A(i)) j=1 (1 qk(s, A(j)))qk(s, A(i)) r(s, A(i)) + ˆpk( |s, A(i)) V k h+1 bk,p V (s, A(i)) j=1 (1 qk(s, A(j))) 6Hbk,q(s, A(i)) + qk(s, A(i)) 2bk,p V (s, A(i)) + ˆpk( |s, A(i)) V k h+1 V k h+1 j=1 (1 qk(s, A(j))) 6Hbk,q(s, A(i))+qk(s, A(i)) 2bk,p V (s, A(i))+ˆpk( |s, A(i)) Gk h+1 = Gk h(s, A). Now, we prove Qk h(s, A) Qπk h (s, A) Gk h(s, A). Qk h(s, A) Qπk h (s, A) j=1 (1 qk(s, A(j)))(qk(s, A(i))+2bk,q(s, A(i))) r(s, A(i))+ˆpk( |s, A(i)) V k h+1+bk,p V (s, A(i)) j=1 (1 qk(s, A(j)))qk(s, A(i)) r(s, A(i)) + p( |s, A(i)) V πk h+1 j=1 (1 qk(s, A(j))) 6Hbk,q(s, A(i)) + qk(s, A(i)) bk,p V (s, A(i)) + ˆpk( |s, A(i)) V k h+1 p( |s, A(i)) V πk h+1 j=1 (1 qk(s, A(j))) 6Hbk,q(s, A(i)) + qk(s, A(i)) bk,p V (s, A(i)) + ˆpk( |s, A(i)) V k h+1 V πk h+1 + ˆpk( |s, A(i)) p( |s, A(i)) V h+1 + p( |s, A(i)) ˆpk( |s, A(i)) V h+1 V πk h+1 ! In addition, for any (s, a) S Aground, we have p( |s, a) ˆpk( |s, a) V h+1 V πk h+1 Vars p V h+1(s ) V πk h+1(s ) L(δ , k) nk,p(s, a) + 3HL(δ , k) L(δ , k) nk,p(s, a) 2Vars ˆpk V h+1(s ) V πk h+1(s ) + 4H2L(δ , k) + 3HL(δ , k) Published as a conference paper at ICLR 2024 L(δ , k) nk,p(s, a) 2H ˆpk( |s, a) V h+1 V πk h+1 + 4H2L(δ , k) + 3HL(δ , k) 2H2L(δ , k) nk,p(s, a) 1 H ˆpk( |s, a) V h+1 V πk h+1 + 4HL(δ , k) nk,p(s, a) + 3HL(δ , k) H ˆpk( |s, a) V h+1 V πk h+1 + 11H2L(δ , k) nk,p(s, a) (17) Plugging Eq. (17) into Eq. (16), we have Qk h(s, A) Qπk h (s, A) j=1 (1 qk(s, A(j))) ˆqk(s, a)(1 ˆqk(s, a))L (δ , k) nk,q(s, a) + 15L (δ , k) + qk(s, A(i)) 4 Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, a) + 15H2L(δ , k) nk,p(s, a) + 2 H ˆpk( |s, a) V k h+1 V k h+1 + ˆpk( |s, A(i)) V k h+1 V πk h+1 + 2 Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, a) + 3HL(δ , k) H ˆpk( |s, a) V h+1 V πk h+1 + 11H2L(δ , k) j=1 (1 qk(s, A(j))) ˆqk(s, a)(1 ˆqk(s, a))L (δ , k) nk,q(s, a) + 15L (δ , k) + qk(s, A(i)) 6 Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, a) + 29H2L(δ , k) nk,p(s, a) + 1 + 4 ˆpk( |s, a) Gk h+1 By the clipping bound in Eq. (16), we have Qk h(s, A) Qπk h (s, A) j=1 (1 qk(s, A(j))) ˆqk(s, a)(1 ˆqk(s, a))L (δ , k) nk,q(s, a) + 15L (δ , k) nk,q(s, a) , 1 + qk(s, A(i)) min 6 Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, a) + 29H2L(δ , k) H ˆpk( |s, a) Gk h+1, 2H + ˆpk( |s, a) Gk h+1 Gk h+1(s, A). Then, we have V k h (s) min{V k h(s), V πk h (s)} = Qk h(s, πk h(s)) min{Qk h(s, πk h(s)), Qπk h (s, πk h(s))} Gk h(s, πk h(s)) Published as a conference paper at ICLR 2024 which completes the proof. D.2.3 PROOF OF THEOREM 2 Proof of Theorem 2. Recall that δ := δ 7. From Lemmas 10, 18 and 19, we have Pr[K L M] 1 7δ = 1 δ. Below we assume that event K L M holds, and then prove the correctness and sample complexity. First, we prove the correctness. Using Lemma 20 and 21, we have that the output policy πK+1 satisfies that V 1 (s1) V πK+1 1 (s1) V K+1 1 (s1) V πK+1 1 (s1) GK+1 1 (s1) ε, which indicates that policy πK+1 is ε-optimal. Next, we prove sample complexity. For any k > 0, h [H], s S and A A, j=1 (1 qk(s, A(j))) ˆqk(s, a)(1 ˆqk(s, a))L (δ , k) nk,q(s, a) + 15L (δ , k) + qk(s, A(i)) 8 Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, A(i)) + 30H2L(δ , k) nk,p(s, A(i)) ˆpk( |s, A(i)) Gk h+1 j=1 (1 q(s, A(j))) ˆqk(s, a)(1 ˆqk(s, a))L (δ , k) nk,q(s, a) + 90HL (δ , k) + q(s, A(i)) 8 Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, A(i)) + 30H2L(δ , k) nk,p(s, A(i)) ˆpk( |s, A(i)) Gk h+1 j=1 (1 q(s, A(j))) q(s, a)(1 q(s, a))L (δ , k) nk,q(s, a) + 186HL (δ , k) + q(s, A(i)) 8 Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, A(i)) + 30H2L(δ , k) nk,p(s, A(i)) p( |s, A(i)) Gk h+1 + 1 + 4 ˆpk( |s, A(i)) p( |s, A(i)) Gk h+1 where inequality (a) comes from Eq. (15). Using Eq. (21) in Lemma 22, we have ˆpk( |s, A(i)) p( |s, A(i)) Gk h+1 2Vars p Gk h+1(s ) L(δ, k) nk,p(s, A(i)) + 2HL(δ, k) 3nk,p(s, A(i)) 1 H p( |s, A(i)) Gk h+1 2H2L(δ, k) nk,p(s, A(i)) + 2HL(δ, k) 3nk,p(s, A(i)) H p( |s, A(i)) Gk h+1 + 3H2L(δ, k) nk,p(s, A(i)) (19) Published as a conference paper at ICLR 2024 According to Eqs. (22) and (23) in Lemma 22, we have q Vars ˆpk V k h+1(s ) 2Vars p V k h+1(s ) + 4H2L(δ, k) 4Vars p V πk h+1(s ) + 4Hp( |s, a) V k h+1 V πk h+1 + 4H2L(δ, k) 4Vars p V πk h+1(s ) + 4Hp( |s, a) V k h+1 V k h+1 + 4H2L(δ, k) 4Vars p V πk h+1(s ) + q 4Hp( |s, a) Gk h+1 + Vars ˆpk V k h+1(s ) L (δ , k) nk,p(s, a) 4Vars p V πk h+1(s ) L (δ , k) nk,p(s, a) 4H2L (δ , k) nk,p(s, a) 1 H p( |s, a) Gk h+1 + 2HL(δ , k) Vars p V πk h+1(s ) L (δ , k) nk,p(s, a) + 6H2L(δ , k) H p( |s, a) Gk h+1 (20) Plugging Eqs. (19) and (20) into Eq. (18), we obtain j=1 (1 q(s, A(j))) q(s, a)(1 q(s, a))L (δ , k) nk,q(s, a) + 186HL (δ , k) + q(s, A(i)) 16 Vars p V πk h+1(s ) L (δ , k) nk,p(s, a) + 42H2L(δ , k) nk,p(s, a) + 8 H p( |s, a) Gk h+1 + 30H2L(δ , k) nk,p(s, A(i)) + 1 + 4 p( |s, A(i)) Gk h+1 H p( |s, A(i)) Gk h+1 + 3H2L(δ, k) nk,p(s, A(i)) j=1 (1 q(s, A(j))) q(s, a)(1 q(s, a))L (δ , k) nk,q(s, a) + 186HL (δ , k) + q(s, A(i)) 16 Vars p V πk h+1(s ) L (δ , k) nk,p(s, a) + 87H2L(δ , k) nk,p(s, A(i)) p( |s, A(i)) Gk h+1 Unfolding the above inequality over h = 1, 2, . . . , H, we have (s,a) S Aground\{a } vobserve,q k,h (s, a) q(s, a)(1 q(s, a))L (δ , k) nk,q(s, a) + 186HL (δ , k) (s,a) S Aground vobserve,p k,h (s, a) Vars p V πk h+1(s ) L (δ , k) nk,p(s, a) Published as a conference paper at ICLR 2024 + 87e17H2L(δ , k) (s,a) S Aground\{a } vobserve,q k,h (s, a) q(s, a)(1 q(s, a))L (δ , k) (s,a) S Aground\{a } vobserve,q k,h (s, a) L (δ , k) (s,a) S Aground vobserve,p k,h (s, a)Vars p V πk h+1(s ) (s,a) S Aground vobserve,p k,h (s, a) L (δ , k) + 87e17 H X (s,a) S Aground vobserve,p k,h (s, a)H2L(δ , k) (s,a) S Aground\{a } vobserve,q k,h (s, a) q(s, a)(1 q(s, a))L (δ , k) (s,a) S Aground\{a } vobserve,q k,h (s, a) L (δ , k) (s,a) S Aground vobserve,p k,h (s, a) L (δ , k) + 87e17H2 H X (s,a) S Aground vobserve,p k,h (s, a) L(δ , k) nk,p(s, a). Let K denote the number of episodes that algorithm Cascading BPI plays. According to the stopping rule of algorithm Cascading BPI, we have that for any k K, ε < Gk 1(s1). Summing the above inequality over k = 1, . . . , K, dividing (s, a) by sets Bq k and Bp k, and using the clipping construction of Gk h(s, A), we obtain vobserve,q k,h (s, a) q(s, a)(1 q(s, a))L (δ , k) vobserve,q k,h (s, a) L (δ , k) vobserve,p k,h (s, a) L (δ , k) + 87e17H2 K X vobserve,p k,h (s, a) L(δ , k) Published as a conference paper at ICLR 2024 (s,a)/ Bq k vobserve,q k,h (s, a)6H + (s,a)/ Bp k vobserve,p k,h (s, a)2H vobserve,q k,h (s, a)q(s, a) vobserve,q k,h (s, a)L (δ , k) vobserve,q k,h (s, a) L (δ , k) vobserve,p k,h (s, a) L (δ , k) + 87e17H2 K X vobserve,p k,h (s, a) L(δ , k) nk,p(s, a) + 64H2SA log HSN KHSA log (KH) L (δ , K) + 1488HSA log (KH) L (δ , K) KHSA log (KH) L (δ , K) + 696e17H2SA log (KH) L(δ , K) + 64H2SA log HSN log (KH) + log2(8e H(K + 1)) + 2248e17H2SA log HSN + S log2 8e HSN(K + 1) where inequality (a) uses Lemma 9. Thus, we have K < 160e17H log (KH) + log2(8e H(K + 1)) + 2248e17H2SA + S log2 8e HSN(K + 1) Using Lemma 23 with τ = K + 1, α = 8e HSN δ , A = log HSN δ , B = 1, C = 160e17H HSA ε , D = 2248e17H2SA ε and E = S, we have C1 = O log HSN Therefore, we obtain Theorem 2. E TECHNICAL LEMMAS In this section, we present two useful technical lemmas. Lemma 22 (Lemmas 10, 11, 12 in (M enard et al., 2021)). Let p1 and p2 be two distributions on S such that KL(p1, p2) α. Let f and g be two functions defined on S such that for any s S, Published as a conference paper at ICLR 2024 0 f(s), g(s) b. Then, p 1 f p 2 f q 2Varp2(f)α + 2 Varp2(f) 2Varp1(f) + 4b2α, Varp1(f) 2Varp2(f) + 4b2α, (22) Varp1(f) 2Varp1(g) + 2b p 1 |f g|. (23) Lemma 23. Let A, B, C, D, E and α be positive scalars such that 1 B E and α e. If τ 0 satisfies τ A log (ατ) + B log2 (ατ) + D A log (ατ) + E log2 (ατ) , then τ C2 (A + B) C2 1 + D + 2 DC (A + E) C2 1 + 1, where C1 = 8 5 log 11α2 (A + E) (C + D) .