# reinforcement_learning_with_segment_feedback__682f6d32.pdf Reinforcement Learning with Segment Feedback Yihan Du 1 Anna Winnicki 2 Gal Dalal 3 Shie Mannor 4 3 R. Srikant 1 Standard reinforcement learning (RL) assumes that an agent can observe a reward for each stateaction pair. However, in practical applications, it is often difficult and costly to collect a reward for each state-action pair. While there have been several works considering RL with trajectory feedback, it is unclear if trajectory feedback is inefficient for learning when trajectories are long. In this work, we consider a model named RL with segment feedback, which offers a general paradigm filling the gap between per-state-action feedback and trajectory feedback. In this model, we consider an episodic Markov decision process (MDP), where each episode is divided into m segments, and the agent observes reward feedback only at the end of each segment. Under this model, we study two popular feedback settings: binary feedback and sum feedback, where the agent observes a binary outcome and a reward sum according to the underlying reward function, respectively. To investigate the impact of the number of segments m on learning performance, we design efficient algorithms and establish regret upper and lower bounds for both feedback settings. Our theoretical and experimental results show that: under binary feedback, increasing the number of segments m decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing m does not reduce the regret significantly. Reinforcement learning (RL) is a class of sequential decision-making algorithms, where an agent interacts with an unknown environment through time with the goal of maximizing the obtained reward. RL has variant applications such as robotics, autonomous driving and game playing. 1University of Illinois at Urbana-Champaign 2Stanford University 3NVIDIA Research 4Technion. Correspondence to: Yihan Du , R. Srikant . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). In classic RL, when the agent takes an action in a state, the environment will provide a reward for this state-action pair. However, in real-world applications, it is often difficult and costly to collect a reward for each state-action pair. For example, in robotics, when we instruct a robot to scramble eggs, it is hard to specify a reward for each individual action. In autonomous driving, it is difficult and onerous to evaluate each action, considering multiple criteria including safety, comfort and speed. Motivated by this fact, there have been several works that consider RL with trajectory feedback (Efroni et al., 2021; Chatterji et al., 2021). In these works, the agent observes a reward signal only at the end of each episode, instead of at each step, with the signal indicating the quality of the trajectory generated during the episode. While these works mitigate the issue of impractical per-step reward feedback in classic RL, the relationship between the frequency of feedback and the performance of RL algorithms is unknown. In particular, if for example we get feedback twice in each trajectory, does that significantly improve performance over once per trajectory feedback? To answer this question, we study a general model called RL with segment feedback, which bridges the gap between per-state-action feedback in classic RL (Sutton & Barto, 2018) and trajectory feedback in recent works (Efroni et al., 2021; Chatterji et al., 2021). In this model, we consider an episodic Markov decision process (MDP), where an episode is equally divided into m segments. In each episode, at each step, the agent first observes the current state, and takes an action, and then transitions to a next state according to the transition distribution. The agent observes a reward signal at the end of each segment. Under this model, we consider two reward feedback settings: binary feedback and sum feedback. In the binary feedback setting, the agent observes a binary outcome (e.g., thumbs up/down) generated by a sigmoid function of the reward on this segment. In the sum feedback setting, the agent observes the sum of the rewards over this segment. In our model, the agent needs to learn the underlying reward function (i.e., the expected reward as a function of states and actions) from binary or sum segment feedback, and maximize the expected reward achieved. While Tang et al. (2024) also studied this segment model before (they called it RL from bagged reward), their work is mostly empirical, and does not provide theoretical Reinforcement Learning with Segment Feedback guarantees for algorithms and rigorously reveal the influence of segments on learning. This model is applicable to many scenarios involving human queries. For instance, in autonomous driving, a driving trajectory is often divided into several segments, and human annotators are asked to provide feedback for each segment, e.g., thumbs up/down. Compared to state-action pairs or whole trajectories, segments are easier and more efficient to evaluate, since human annotators can focus on and rate behaviors in each segment, e.g., passing through intersections, reversing the car and parking. In this segment model, there is an interesting balance between the number of segments (queries to humans) and the collected observations, i.e., we desire more observations, but we also want to reduce the number of queries. Therefore, in this problem, it is critical to investigate the trade-off between the benefits brought by segments and the increase of queries, which essentially comes down to a question: How does the number of segments m impact learning performance? To answer this question, we design efficient algorithms for binary and sum feedback settings in both known and unknown transition cases. Regret upper and lower bounds are provided to rigorously show the influence of the number of segments on learning performance. We also present experiments to validate our theoretical results. Note that studying RL with equal segments is an important starting point and serves as a foundation for further investigation on more general models and analysis for RL with unequal segments. Even under equal segments, this problem is already very challenging: (i) This problem cannot be solved by applying prior trajectory feedback works, e.g., (Efroni et al., 2021), since they use the martingale property of subsequent trajectories in analysis, while subsequent segments are not a martingale due to dependency among segments within a trajectory. (ii) In prior trajectory feedback works (Efroni et al., 2021; Chatterji et al., 2021), there exists a gap between upper and lower bounds for sum feedback, and there is no lower bound for binary feedback. This fact poses a significant challenge for us when trying to understand the influence of the number of segments m on learning performance. Our work overcomes the above challenges and makes contributions as follows. 1. We study a general model called RL with segment feedback, which bridges the gap between per-state-action feedback in classic RL and trajectory feedback seemlessly. Under this model, we consider two feedback settings: binary feedback and sum feedback. 2. For binary feedback, we design computationallyefficient and sample-efficient algorithms Seg Bi TS and Seg Bi TS-Tran for known and unknown transitions, respectively. We provide regret upper and lower bounds which depend on exp( Hrmax 2m ), where H is the length of each episode, and rmax is a universal upper bound of rewards. Our results exhibit that under binary feedback, increasing the number of segments m significantly helps accelerate learning. 3. For sum feedback, we devise algorithms E-Lin UCB and Lin UCB-Tran, which achieve near-optimal regrets in terms of H and m. We also establish lower bounds to validate the optimality, and show that optimal regrets do not depend on m. Our results reveal that surprisingly, under sum feedback, increasing the number of segments m does not help expedite learning much. 4. We develop novel techniques which can be of independent interest, including the KL divergence analysis to derive an exponential lower bound under binary feedback, and the use of E-optimal experimental design in algorithm E-Lin UCB to refine the eigenvalue of the covariance matrix and reduce the regret. 1. Related Work In this section, we briefly review prior related works. Algorithms and analysis for classic RL were well studied in the literature (Sutton & Barto, 2018; Jaksch et al., 2010; Azar et al., 2017; Jin et al., 2018; Zanette & Brunskill, 2019). Tang et al. (2024) proposed the RL with segment feedback problem (they called it RL from bagged rewards), and designed a transformer-based algorithm. However, their work is mostly empirical and does not provide theoretical guarantees. Gao et al. (2025) considers RL with bagged decision times, where the state transitions are non-Markovian within a bag, and a reward is observed at the end of the bag. But the focus of (Gao et al., 2025) is to handle the non-Markovian state transitions within a bag using a causal directed acyclic graph, instead of investigating how to infer the reward function of state-action pairs from bagged rewards like us. In addition, to the best of our knowledge, there is no existing work that rigorously quantifies the influence of segments on learning performance. There are two prior works (Efroni et al., 2021; Chatterji et al., 2021) studying RL with trajectory feedback, which are most related to our work. Efroni et al. (2021) investigated RL with sum trajectory feedback, and designed upper confidence bound (UCB)-type and Thompson sampling (TS)-type algorithms with regret guarantees. Chatterji et al. (2021) studied RL with binary trajectory feedback, but considered a different formulation for binary feedback from ours. Specifically, in their formulation, the objective is to find the policy that maximizes the expected probability of generating feedback 1, and their optimal policy can be non- Reinforcement Learning with Segment Feedback Markovian due to the non-linearity of the sigmoid function; In our formulation, our objective is to find the optimal policy under the standard MDP definition by inferring rewards from binary feedback, and thus we consider Markovian policies. The algorithms in (Chatterji et al., 2021) are either computationally inefficient or have a suboptimal regret order due to the non-linearity of their objective and direct maximization over all non-Markovian policies. Our algorithms are computationally efficient by adopting the TS algorithmic style and efficient MDP planning under Markovian policies. Our regret results cannot be directly compared to those in (Chatterji et al., 2021) due to the difference in formulation. Moreover, different from (Efroni et al., 2021; Chatterji et al., 2021), we study RL with segment feedback, which allows feedback from multiple segments within a trajectory, with per-state-action feedback and trajectory feedback as the two extremes. Under sum feedback, we improve the result in (Efroni et al., 2021) by a factor of H using experimental design, when the problem reduces to the trajectory feedback setting. Under binary feedback, we propose TS-style algorithms which are computationally efficient, and build a lower bound to reveal an inevitable exponential factor in the regret bound, which is novel to the RL literature. Our work is also related to linear bandits (Abbasi-Yadkori et al., 2011) and logistic bandits (Filippi et al., 2010; Faury et al., 2020; Russac et al., 2021), and uses analytical techniques from that literature. 2. Formulation In this section, we present the formulation of RL with binary and sum segment feedback. We consider an episodic MDP denoted by M(S, A, H, r, p, ρ). Here S is the state space, and A is the action space. H is the length of each episode. r : S A [ rmax, rmax] is an unknown reward function, where rmax > 0 is a universal constant. Define the reward parameter θ := [r(s, a)](s,a) S A R|S||A|. p : S A S is the transition distribution. For any (s, a, s ) S A S, p(s |s, a) is the probability of transitioning to s if action a is taken in state s. ρ S is an initial state distribution. A policy π : S [H] A is defined as a mapping from the state space and step indices to the action space, so that πh(s) specifies what action to take in state s at step h. For any policy π, h [H] and (s, a) S A, let V π h (s) be the state value function, and Qπ h(s, a) be the state-action value function, which denote the cumulative expected reward obtained under policy π till the end of an episode, starting from s and (s, a) at step h, respectively. Formally, V π h (s) := E[PH t=h r(st, at)|sh = s, π], and Qπ h(s, a) := E[PH t=h r(st, at)|sh = s, ah = a, π]. The optimal policy is defined as π = argmaxπ V π h (s) for all s S and h [H]. For any s S and h [H], denote V h (s) := V π h (s). The process of RL with segment feedback is as follows. In each episode k, the agent chooses a policy πk at the beginning of this episode, and starts from sk 1 ρ. At each step h [H], the agent first observes the current state sk h, and takes an action ak h = πk h(sk h) according to her policy, and then transitions to a next state sk h+1 p( |sk h, ak h). Each episode is equally divided into m segments, and each segment is of length H m. For convenience, assume that H is divisible by m. For any k > 0 and i [m], let τ k = (sk 1, ak 1, . . . , sk h, ak h) denote the trajectory in episode k, and τ k i = (sk H m (i 1)+1, ak H m (i 1)+1, . . . , sk H m i, ak H m i) denote the i-th segment of the trajectory in episode k. For any trajectory or trajectory segment τ, ϕτ R|S||A| denotes the vector where each entry ϕτ(s, a) is the number of times (s, a) is visited in τ. For any policy π, ϕπ R|S||A| denotes the vector where each entry ϕπ(s, a) is the expected number of times (s, a) is visited in an episode under policy π, i.e., ϕπ(s, a) := E h=1 1{sh = s, ah = a} π In our model, the agent observes reward feedback only at the end of each segment, instead of each step as in classic RL. We consider two reward feedback settings as follows. Binary Segment Feedback. Denote the sigmoid function by µ(x) := 1 1+exp( x) for any x R. In the binary segment feedback setting, in each episode k, at the end of each segment i [m], the agent observes a binary outcome ( 1, w.p. µ((ϕτ k i ) θ ), 0, w.p. 1 µ((ϕτ k i ) θ ). Note that our formulation is different from that in prior work for binary feedback (Chatterji et al., 2021). Chatterji et al. (2021) aim to find the policy that maximizes the expected probability of generating feedback 1, i.e., maxπ Eτ π,p[µ((ϕτ) θ )], where the optimal policy can be non-Markovian due to the non-linearity of µ( ). In contrast, we aim to find the optimal policy under the standard MDP definition, i.e., maxπ Eτ π,p[(ϕτ) θ ], by inferring reward θ from binary feedback, and thus we consider Markovian policies. Both formulations have value and are applicable in different contexts. In particular, our formulation is better suited to situations where we want to solve an MDP but only get binary segment feedback. Under our formulation, we design TS-type algorithms with confidence Reinforcement Learning with Segment Feedback bonuses added on θ element-wise to achieve computational efficiency, which cannot be done without sacrificing the regret order under the formulation of (Chatterji et al., 2021). Sum Segment Feedback. In the sum segment feedback setting, in each episode k, at each step h, the environment generates an underlying random reward Rk h = r(sk h, sk h) + εk h, where εk h is a zero-mean and 1-sub-Gaussian noise, and independent of transition. At the end of each segment i [m], the agent observes the sum of random rewards m (i 1)+1 R(sk t , ak t ) = (ϕτ k i ) θ + m (i 1)+1 εk t . Under sum feedback, when m = H, our model degenerates to classic RL (Azar et al., 2017; Sutton & Barto, 2018). When m = 1, the above two settings reduce to the problems of RL with binary (Chatterji et al., 2021) and sum trajectory feedback (Efroni et al., 2021), respectively. In our model, the agent needs to infer the reward function from sparse and implicit reward feedback. Let K denote the number of episodes played. The goal of the agent is to minimize the cumulative regret, which is defined as k=1 (V 1 (s1) V πk 1 (s1)). We note that to the best of our knowledge, the fact that one gets extremely coarse information about the sum reward in the binary feedback case makes it impossible to have a common analysis for both feedback models. 3. Reinforcement Learning with Binary Segment Feedback In this section, we investigate RL with binary segment feedback. To isolate the effect of segment feedback from transition model learning, we first design a computationallyefficient and sample-efficient algorithm Seg Bi TS for the known transition case, and establish a novel lower bound to exhibit the indispensable exponential dependency in the result under binary feedback. Then, we further develop an algorithm Seg Bi TS-Tran with carefully-designed transition bonuses for the unknown transition case. 3.1. Algorithm Seg Bi TS for Known Transition Building upon the Thompson sampling algorithm (Thompson, 1933), Seg Bi TS adopts the maximum likelihood estimator (MLE) to learn rewards from binary feedback, and performs posterior sampling to compute the optimal policy. Different from prior trajectory feedback algorithms (Chatterji et al., 2021) which are either computationally inefficient Algorithm 1 Seg Bi TS 1: Input: δ, δ := δ 3, α := exp( Hrmax m )+exp( Hrmax m )+ 2, λ. 2: for k = 1, . . . , K do 3: ˆθk 1 argminθ (Pk 1 k =1 Pm i=1(yk i log(µ((ϕτ k i ) θ)) + (1 yk i ) log(1 µ((ϕτ k i ) θ))) 1 4: Σk 1 Pk 1 k =1 Pm i=1 ϕτ k i (ϕτ k i ) + αλI 5: Sample ξk N(0, α ν(k 1)2 Σ 1 k 1), where ν(k 1) is defined in Eq. (1) 6: θk ˆθk 1 + ξk 7: πk argmaxπ(ϕπ) θk 8: Play episode k with policy πk. Observe trajectory τ k and binary segment feedback {yk i }m i=1 9: end for or have a O(K 2 3 ) regret bound, Seg Bi TS is both computationally efficient and has a O( K) regret bound. Algorithm 1 presents the procedure of Seg Bi TS. Specifically, in each episode k, Seg Bi TS first employs MLE with past binary reward observations to obtain the estimated reward parameter ˆθk 1 (Line 3). Then, Seg Bi TS calculates the feature covariance matrix of past segments Σk 1 (Line 4). After that, Seg Bi TS samples a noise ξk from Gaussian distribution N(0, α ν(k 1)2 Σ 1 k 1) (Line 5). Here α is a universal upper bound of the inverse of the sigmoid function s derivative. For any k > 0, we define 1 + Hrmax p |S||A| m + H m 1 + Hrmax p |S||A| m ω(k) + H2 m2λ ω(k)2 ! 3 1 + H2k 4|S||A|λm ν(k) is the confidence radius factor of the MLE estimate ˆθk. With high probability, we have |ϕ θ ϕ ˆθk| α ν(k) ϕ Σ 1 k , where ϕ is the visitation indicator of any trajectory (Lemma C.7 in Appendix C.1). Adding noise ξk to ˆθk 1, Seg Bi TS obtains a posterior reward estimate θk (Line 6). Then, it computes the optimal policy πk under reward θk, i.e., argmaxπ(ϕπ) θk (Line 7). Note that this step is computationally efficient, which can be easily solved by any MDP planning algorithm, e.g., value iteration, by taking θk as the reward function. After obtaining πk, Seg Bi TS plays episode k, and observes trajectory τ k and binary feedback {yk i }m i=1 on each segment (Line 8). Reinforcement Learning with Segment Feedback Now we provide a regret upper bound for Seg Bi TS. Theorem 3.1. With probability at least 1 δ, for any K > 0, the regret of algorithm Seg Bi TS is bounded by Km|S||A| max H2 In this result, the dependency on |S|, |A| and H are |S|3, |A|3 and exp( Hrmax 2m )H2, respectively. Our focus here is to reveal the exponential dependency on Hrmax m in the regret bound under binary feedback, instead of pursuing absolute tightness of every polynomial factor. Since the exponential factor is usually the dominating factor, this result implies that as the number of segments m increases, the regret decays rapidly. Thus, under binary feedback, increasing the number of segments significantly helps accelerate learning. The intuition behind this exponential dependency is that when the reward scale x = Hrmax m is large, the binary feedback is generated from the range where the sigmoid function µ(x) = 1 1+exp( x) is flat, i.e., the derivative of the sigmoid function µ (x) is small. Then, the generated binary feedback is likely always 0 or always 1, and it is hard to distinguish between a good action and a bad action, leading to a higher regret; On the contrary, when the reward scale x = Hrmax m is small, the binary feedback is generated from the range where the sigmoid function µ(x) is steep, i.e., µ (x) is large. Then, the generated binary feedback is more dispersed to be 0 or 1, and it is easier to distinguish between a good action and a bad action, leading to a lower regret. In other words, the regret bound depends on the inverse of the sigmoid function s derivative µ (x) = 1 exp(x)+exp( x)+2. 3.2. Regret Lower Bound for Known Transition Below we provide a lower bound, which firstly demonstrates the inevitability of the exponential factor in the regret bound for RL with binary feedback. Theorem 3.2. Consider RL with binary segment feedback and known transition. There exists a distribution of instances where for any c0 (0, 1 2), when K exp( Hrmax m ) 4|S||A|m H2r2maxc2 0 , the regret of any algorithm must be |S||A|m K . Theorem 3.2 shows that under binary feedback, the exponential dependency on Hrmax m in the result is indispensable, and the exp( Hrmax 2m ) factor in Theorem 3.1 nearly matches the exponential factor in the lower bound up to an arbitrarily small factor c0 in exp( ). Theorem 3.2 reveals that when the number of segments m increases, the regret indeed decreases at an exponential rate. In addition, this lower bound also holds for the unknown transition case, by constructing the same problem instance as in its proof. To the best of our knowledge, our lower bound for binary feedback and its analysis are novel in the RL literature. In the analysis, we calculate the KL divergence of Bernoulli distributions with the sigmoid function being in their parameters. Then, we employ Pinsker s inequality and the fact that µ (x) = µ(x)(1 µ(x)) to build a connection between the calculated KL divergence and µ ( Hrmax m ). Since µ (x) = 1 exp(x)+exp( x)+2 contains an exponential factor, we can finally derive an exponential dependency in the lower bound. Below we give a proof sketch of Theorem 3.2, and defer a full proof to Appendix C.2. Proof Sketch. Consider an instance as follows: there are n bandit states s1, . . . , sn (i.e., there is an optimal action and multiple suboptimal actions), a good absorbing state sn+1 and a bad absorbing state sn+2. The agent starts from s1, . . . , sn with equal probability 1 n. For any i [n], in state si, under the optimal action a i , the agent transitions to sn+1 deterministically, and r(si, a i ) = rmax; : transition to 𝑠𝑛+1 𝑎𝑖 𝑠𝑢𝑏: transition to 𝑠𝑛+2 𝑟𝑠𝑛+1, = 𝑟𝑚𝑎𝑥 𝑟𝑠𝑛+2, = 1 𝜀𝑟𝑚𝑎𝑥 ) = 𝑟𝑚𝑎𝑥 𝑟(𝑠𝑖, 𝑎𝑖 𝑠𝑢𝑏) = (1 𝜀)𝑟𝑚𝑎𝑥 Figure 1. Lower bound instance. Under any suboptimal action asub i , the agent transitions to sn+2 deterministically, and r(si, asub i ) = (1 ε)rmax, where ε (0, 1 2) is a parameter specified later. For all actions a A, r(sn+1, a) = rmax and r(sn+2, a) = (1 ε)rmax. Then, the KL divergence of binary observations between the optimal action and suboptimal actions in an episode is i=1 KL B µ (1 ε)Hrmax µ (1 ε)Hrmax (b) m µ (1 ε)Hrmax m 2 ε Hrmax Here B(p) denotes the Bernoulli distribution with parameter p. Inequality (a) uses the facts that KL(B(p) B(q)) (p q)2 q(1 q) and µ (x) = µ(x)(1 µ(x)). Inequality (b) is due to that µ (x) is monotonically decreasing when x > 0. Furthermore, we consider the reward scale Hrmax in each Reinforcement Learning with Segment Feedback episode, and the enumeration over each bandit state si (i [n]) and each possible optimal action a i A in the lower bound derivation. Then, following the analysis in (Auer et al., 2002), to learn the difference between the optimal action and suboptimal actions, the agent must suffer a regret n|A| 1 Eq. (3) v u u t µ Hrmax µ (1 ε) Hrmax Recall that µ (x) = 1 exp(x)+exp( x)+2. Let ε = Θ( 1 K ). For any constant c0 (0, 1 2), letting K large enough (ε small enough) to satisfy ε c0, then the regret is exp (1 2ε)Hrmax K|S||A|m exp 1 3.3. Algorithm Seg Bi TS-Tran for Unknown Transition Now we extend our results to the unknown transition case. We develop an efficient algorithm Seg Bi TS-Tran for binary segment feedback and unknown transition. Seg Bi TS-Tran includes a transition bonus ppv k 1 in posterior reward estimate θb k, and replaces visitation indicator ϕπ by its estimate ˆϕπ k 1. For any (s, a), ˆϕπ k 1(s, a) is the expected number of times (s, a) is visited in an episode under policy π on empirical MDP ˆpk 1, where ˆpk 1 is the empirical estimate of transition distribution p. Then, Seg Bi TS-Tran computes the optimal policy via argmaxπ(ˆϕπ k 1) θb k, which can be efficiently solved by any MDP planning algorithm with transition distribution ˆpk 1 and reward θb k. We defer the details of Seg Bi TS-Tran to Appendix C.3, and present its regret performance as follows. Theorem 3.3. With probability at least 1 δ, for any K > 0, the regret of algorithm Seg Bi TS-Tran is bounded by Km|S||A| max H2 |S|2|A| 3 2 H 3 2 Similar to algorithm Seg Bi TS (Theorem 3.1), the regret bound of algorithm Seg Bi TS-Tran also has a factor exp( Hrmax 2m ). When the number of segments m increases, the regret of Seg Bi TS-Tran significantly decreases. Compared to Seg Bi TS, the regret of Seg Bi TS-Tran has an additional polynomial term in |S|, |A|, H and K, which is incurred due to learning the unknown transition distribution. 4. Reinforcement Learning with Sum Segment Feedback In this section, we turn to RL with sum segment feedback. Different from prior sum trajectory feedback algorithm (Efroni et al., 2021), which directly uses the least squares estimator and has a suboptimal regret bound, we develop an algorithm E-Lin UCB for the known transition case, which adopts experimental design to perform an initial exploration and achieves a near-optimal regret with respect to H and m. To validate the optimality, we further establish a regret lower bound. Moreover, we design an algorithm Lin UCB-Tran equipped with a variance-aware transition bonus to handle the unknown transition case. 4.1. Algorithm E-Lin UCB for Known Transition If we regard visitation indicators ϕπk i as feature vectors and θ as the reward parameter, RL with sum segment feedback and known transition is similar to linear bandits. Building upon the classic linear bandit algorithm Lin UCB (Abbasi-Yadkori et al., 2011), our algorithm E-Lin UCB performs the E-optimal design (Pukelsheim, 2006) to conduct an initial exploration. This scheme ensures sufficient coverage of the covariance matrix and further sharpens the norm under the inverse of the covariance matrix, which enables an improved regret bound over prior trajectory feedback algorithm (Efroni et al., 2021). Algorithm 2 shows the procedure of E-Lin UCB. Specifically, E-Lin UCB first performs the E-optimal design to compute a distribution on policies w , which maximizes the minimum eigenvalue of the feature covariance matrix P π Π w(π)(Pm i=1 Eτi π[ϕτi(ϕτi) ]) (Line 2). We assume that there exists a policy distribution w under which this matrix is invertible. Then, E-Lin UCB calculates the number of samples K0 for initial exploration according to the optimal value of the E-optimal design (Line 3). Then, in Line 4, E-Lin UCB calls a rounding procedure ROUND (Allen-Zhu et al., 2021) to transform sampling distribution w into discrete sampling sequence (π1, . . . , πK0), which satisfies (see Appendix B for more details of ROUND) i=1 Eτi πk ϕτi(ϕτi) ! 1 π Π w (π) m X i=1 Eτi π ϕτi(ϕτi) ! 1 . Reinforcement Learning with Segment Feedback Algorithm 2 E-Lin UCB 1: Input: δ, δ := δ 3, λ := H r2maxm, rounding procedure ROUND, rounding approximation parameter γ := 1 10. β(k) := q m log(1 + k H2 λ|S||A|m) + 2 log( 1 δ ) + rmax p λ|S||A|, k > 0. 2: Let w Π and z be the optimal solution and optimal value of the optimization: π Π w(π) m X i=1 Eτi π ϕτi(ϕτi) ! 1 (4) 3: K0 max{26(1 + γ)2(z )2H4 log( 2|S||A| δ ), |S||A| 4: (π1, . . . , πK0) ROUND({Pm i=1 Eτi π ϕτi(ϕτi) }π Π, w , γ, K0) 5: Play K0 episodes with policies π1, . . . , πK0. Observe trajectories τ 1, . . . , τ K0 and rewards {R1 i }m i=1, . . . , {RK0 i }m i=1 6: for k = K0 + 1, . . . , K do 7: ˆθk 1 (λI + Pk 1 k =1 Pm i=1 ϕτ k i (ϕτ k i ) ) 1 Pk 1 k =1 Pm i=1 ϕτ k i Rk i 8: Σk 1 λI + Pk 1 k =1 Pm i=1 ϕτ k i (ϕτ k i ) 9: πk argmaxπ Π((ϕπ) ˆθk 1 + β(k 1) ϕπ (Σk 1) 1) 10: Play episode k with policy πk. Observe trajectory τ k and sum segment feedback {Rk i }m i=1 11: end for After that, E-Lin UCB plays K0 episodes with (π1, . . . , πK0) to perform initial exploration (Line 5). Owing to the E-optimal design, the covariance matrix of initial exploration ΣK0 has an optimized minimum eigenvalue, and then ϕπ (Σk 1) 1 has a sharp upper bound for any k > K0. This is the key to the optimality of E-Lin UCB. In each episode k > K0, E-Lin UCB first calculates the least squares reward estimate ˆθk 1 using past reward observations and covariance matrix Σk 1 (Lines 7-8). Then, it computes the optimal policy with reward estimate ˆθk 1 and reward confidence bonus ϕπ (Σk 1) 1 (Line 9). E-Lin UCB plays episode k with the computed optimal policy πk, and collects trajectory τ k and reward observations on each segment {Rk i }m i=1 (Line 10). Below we present a regret upper bound for algorithm E-Lin UCB. Theorem 4.1. With probability at least 1 δ, for any K > 0, the regret of algorithm E-Lin UCB is bounded by HK log 1 + KHrmax + (z )2H5 log |S||A| Surprisingly, under sum feedback, when the number of segments m increases, the regret bound does not decrease significantly, e.g., at a rate of 1 m or 1 m. While this looks surprising at the first glance, we discover an intuition through analysis: The performance in RL is measured by the expected reward sum of an episode, namely, we only need to accurately estimate the expected reward sum of an episode. When the number of segments m increases, while we obtain more observations, the segment features ϕτ k i contributed to covariance matrix Σk shrink, which makes the reward estimation uncertainty ϕπ (Σk) 1 inflate. When we focus on the estimation performance of the expected reward sum of an episode, these two effects cancel out with each other, and the regret result is not influenced by m distinctly. When m = 1, our problem reduces to RL with sum trajectory feedback (Efroni et al., 2021), and our result improves theirs by a factor of H and achieves the optimality with respect to H. This improvement comes from the fact that we conduct the E-optimal design and perform an initial exploration to guarantee that ϕπ (Σk 1) 1 1, instead of ϕπ (Σk 1) 1 H λ as used in (Efroni et al., 2021). Next, we study the lower bound to see if the number of segments m really does not influence the regret bound much. 4.2. Regret Lower Bound for Known Transition We establish a lower bound for RL with sum segment feedback and known transition as follows. Theorem 4.2. Consider RL with sum segment feedback and known transition. There exists a distribution of instances where the regret of any algorithm must be Theorem 4.2 demonstrates that our regret upper bound for algorithm E-Lin UCB (Theorem 4.1) is optimal with respect to H and m when ignoring logarithmic factors. In addition, this lower bound corroborates that the number of segments Reinforcement Learning with Segment Feedback 0 20 40 60 80 100 The Number of Segments m Cumulative Regret of K Episodes Seg Bi TS Seg Bi TS-Tran (a) Binary segment feedback 0 20 40 60 80 100 The Number of Segments m Cumulative Regret of K Episodes E-Lin UCB Lin UCB-Tran 0 20 40 60 80 100 The Length of Each Segment H m Cumulative Regret of K Episodes E-Lin UCB Lin UCB-Tran (b) Sum segment feedback Figure 2. Experimental results for RL with binary or sum segment feedback. m does not impact the regret result in essence. 4.3. Algorithm Lin UCB-Tran for Unknown Transition Now we investigate RL with sum segment feedback in the unknown transition case. For unknown transition, we design an algorithm Lin UCB-Tran, which establishes a variance-aware uncertainty bound for the estimated visitation indicator ˆϕπ k, and incorporates this uncertainty bound into exploration bonuses. In analysis, we handle the estimation error of visitation indicators ˆϕπ k ϕπ 1 by this variance-aware uncertainty bound, which enables us to achieve a near-optimal regret in terms of H. The details of Lin UCB-Tran are deferred to Appendix D.3, and we state the regret performance of algorithm Lin UCB-Tran below. Theorem 4.3. With probability at least 1 δ, for any K > 0, the regret of algorithm Lin UCB-Tran is bounded by O (1 + rmax)|S| 5 2 |A|2H Theorem 4.3 shows that similar to algorithm E-Lin UCB, the regret of Lin UCB-Tran does not depend on the number of segments m when ignoring logarithmic factors. The heavier dependency on |S|, |A| and H is due to the estimation of the unknown transition distribution. We also provide a lower bound for the unknown transition case, which demonstrates that the optimal regret indeed does not depend on m and our upper bound is near-optimal with respect to H (see Appendix D.5). 5. Experiments Below we present experiments for RL with segment feedback to validate our theoretical results. For the binary feedback setting, we evaluate our algorithms Seg Bi TS and Seg Bi TS-Tran in known and unknown transition cases, respectively, and we set |S| = 9, |A| = 5 and K = 30000. For the sum feedback setting, similarly, we run our algorithms E-Lin UCB and Lin UCB-Tran in known and unknown transition cases, respectively. Since E-Lin UCB and Lin UCB-Tran are computationally inefficient (mainly designed to reveal the optimal dependency on m), we use a small MDP with |S| = 3 and |A| = 5, and set K = 1000. The details of the instances used in our experiments are described in Appendix A. In both settings, we set rmax = 0.5, δ = 0.005, H = 100 and m {1, 2, 4, 5, 10, 20, 25, 50, 100}. For each algorithm, we perform 20 independent runs, and plot the average cumulative regret up to episode K across runs with a 95% confidence interval. Figure 2(a) reports the regrets of algorithms Seg Bi TS and Seg Bi TS-Tran under binary feedback. One sees that as the number of segments m increases, the regret decreases rapidly. Specifically, when m decreases from 20 to 1, i.e., H 2m increases from exp(2.5) to exp(50), the regret grows explosively. This matches our theoretical results, i.e., Theorems 3.1 and 3.3, which show a dependency on exp( Hrmax Figure 2(b) plots the regrets of algorithms E-Lin UCB and Lin UCB-Tran under sum feedback. To see the impact of segments on regrets clearly, here we show the regrets with respect to the number of segments m and the length of each segment H m in the left and right subfigures, respectively. In the left subfigure, when m increases, the regrets almost keep the same for small m and slightly decrease for large m. To see the dependency on m more clearly, we turn to the right subfigure: When the length of each segment H m increases, the regrets slightly increase in a logarithmic trend. This also matches our theoretical bounds in Theorems 4.1 and 4.3, which do not depend on m except for the log( H Reinforcement Learning with Segment Feedback 6. Conclusion In this work, we formulate a model named RL with segment feedback, which offers a general paradigm for feedback, bridging the gap between per-state-action feedback in classic RL and trajectory feedback. In the binary feedback setting, we deign efficient algorithms Seg Bi TS and Seg Bi TS-Tran, and provide regret upper and lower bounds which show a dependency on exp( Hrmax 2m ). These results reveal that under binary feedback, increasing the number of segments m greatly helps expedite learning. In the sum feedback setting, we develop near-optimal algorithms E-Lin UCB and Lin UCB-Tran in terms of H and m, where the regret results do not depend on m when ignoring logarithmic factors. These results exhibit that under sum feedback, increasing m does not help accelerate learning much. There are several interesting directions worth further investigation. One direction is to consider segments of unequal lengths and study how to divide segments to optimize learning. The variable segment length will affect the noise variance of reward feedback, and the sum analysis of segment visitation indicators. More advanced techniques are needed to handle these challenges. Another direction is to generalize the results to the function approximation setting. Since our analysis is based on the fact that segment reward feedback is generated linearly with respect to visitation indicators ϕ(s, a), we believe that generalizing ϕ(s, a) from a visitation indicator in the tabular setting to a feature vector in the linear function approximation setting is viable. Acknowledgements The work of Yihan Du and R. Srikant is supported by NSF grants CNS 23-12714, CCF 22-07547, CNS 21-06801 and AFOSR grant FA9550-24-1-0002. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, volume 24, 2011. Allen-Zhu, Z., Li, Y., Singh, A., and Wang, Y. Near-optimal discrete optimization for experimental design: A regret minimization approach. Mathematical Programming, 186:439 478, 2021. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002. Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272. PMLR, 2017. Borjesson, P. and Sundberg, C.-E. Simple approximations of the error function q (x) for communications applications. IEEE Transactions on Communications, 27(3):639 643, 1979. Chatterji, N., Pacchiano, A., Bartlett, P., and Jordan, M. On the theory of reinforcement learning with once-perepisode feedback. In Advances in Neural Information Processing Systems, volume 34, pp. 3401 3412, 2021. Dann, C., Lattimore, T., and Brunskill, E. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, volume 30, 2017. Efroni, Y., Merlis, N., and Mannor, S. Reinforcement learning with trajectory feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7288 7295, 2021. Faury, L., Abeille, M., Calauz enes, C., and Fercoq, O. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning, pp. 3052 3060. PMLR, 2020. Filippi, S., Cappe, O., Garivier, A., and Szepesv ari, C. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, volume 23, 2010. Gao, D., Lai, H.-Y., Klasnja, P., and Murphy, S. Harnessing causality in reinforcement learning with bagged decision times. In International Conference on Artificial Intelligence and Statistics, 2025. Gheshlaghi Azar, M., Munos, R., and Kappen, H. J. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning, 91:325 349, 2013. Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563 1600, 2010. Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, volume 31, 2018. Lattimore, T. and Hutter, M. Pac bounds for discounted mdps. In International Conference on Algorithmic Learning Theory, pp. 320 334. Springer, 2012. Reinforcement Learning with Segment Feedback Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pp. 1302 1338, 2000. M enard, P., Domingues, O. D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp. 7599 7608. PMLR, 2021. Munos, R. and Moore, A. Influence and variance of a markov chain: Application to adaptive discretization in optimal control. In Proceedings of the IEEE Conference on Decision and Control, volume 2, pp. 1464 1469. IEEE, 1999. Pukelsheim, F. Optimal design of experiments. SIAM, 2006. Russac, Y., Faury, L., Capp e, O., and Garivier, A. Selfconcordant analysis of generalized linear bandits with forgetting. In International Conference on Artificial Intelligence and Statistics, pp. 658 666. PMLR, 2021. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Tang, Y., Cai, X.-Q., Ding, Y.-X., Wu, Q., Liu, G., and Sugiyama, M. Reinforcement learning from bagged reward. In ICML 2024 Workshop: Aligning Reinforcement Learning Experimentalists and Theorists, 2024. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933. Tropp, J. A. et al. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1 230, 2015. Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304 7312. PMLR, 2019. Reinforcement Learning with Segment Feedback A. Details of the Experimental Setup In this section, we detail the instances used in our experiments. For the binary segment feedback setting, we consider an MDP as in Figure 2(a): There are 9 states and 5 actions. For any a A, we have r(s0, a) = 0, r(si, a) = rmax for any i {1, 3, 5, 7} (called good states), and r(si, a) = rmax for any i {2, 4, 6, 8} (called bad states). There is an optimal action a and four suboptimal actions asub for all states. The agent starts from an initial state s0. For any 0 i 6, in state si, under the optimal action a , the agent transitions to the good state and bad state at the next horizon with probabilities 0.9 and 0.1, respectively; Under the suboptimal action asub, the agent transitions to the good state and bad state at the next horizon with probabilities 0.1 and 0.9, respectively. In s7 or s8, under the optimal action a , the agent transitions to s1 and s2 with probabilities 0.9 and 0.1, respectively; Under the suboptimal action asub, the agent transitions to s1 and s2 with probabilities 0.1 and 0.9, respectively. 𝑟(𝑠 , ) = 0 Optimal action 𝑎 : 𝑤. 𝑝. 0.9 Suboptimal action: 𝑎௦௨ : 𝑤. 𝑝. 0.1 Optimal action 𝑎 : 𝑤. 𝑝. 0.1 Suboptimal action 𝑎௦௨ : 𝑤. 𝑝. 0.9 𝑟𝑠 , = 𝑟 ௫, 𝑖 {1,3,5,7} 𝑟𝑠, = 𝑟 ௫, 𝑖 {2,4,6,8} Figure 3. Instance used in the experiment for RL with binary segment feedback. For the sum segment feedback setting, since algorithms E-Lin UCB and Lin UCB-Tran are computationally inefficient (which are mainly designed for revealing the optimal dependency on H and m), we consider a smaller MDP as in Figure 2(b): There are 3 states and 5 actions. For any a A, we have r(s0, a) = 0, r(s1, a) = rmax (called a good state), and r(s2, a) = rmax (called a bad state). There is an optimal action a and four suboptimal actions asub for all states. The agent starts from an initial state s0. In any state s S, under the optimal action a , the agent transitions to s1 and s2 with probabilities 0.9 and 0.1, respectively; Under the suboptimal action asub, the agent transitions to s1 and s2 with probabilities 0.1 and 0.9, respectively. 𝑟(𝑠 , ) = 0 Optimal action 𝑎 : 𝑤. 𝑝. 0.9 Suboptimal action: 𝑎௦௨ : 𝑤. 𝑝. 0.1 Optimal action 𝑎 : 𝑤. 𝑝. 0.1 Suboptimal action 𝑎௦௨ : 𝑤. 𝑝. 0.9 Figure 4. Instance used in the experiment for RL with sum segment feedback. Reinforcement Learning with Segment Feedback B. Rounding Procedure ROUND Algorithm E-Lin UCB calls a rounding procedure ROUND (Allen-Zhu et al., 2021) in the experimental design literature. Taking X1, . . . , Xn Sd +, distribution w {X1,...,Xn}, rounding approximation error γ > 0 and the number of samples T d γ2 as inputs, ROUND rounds sampling distribution w into a discrete sampling sequence (Y1, . . . , YT ) {X1, . . . , Xn}T that satisfies ! 1 (1 + γ) i [n] w(Xi)Xi In implementation, we can regard xx in (Allen-Zhu et al., 2021) as Pm i=1 Eτi π[ϕτi(ϕτi) ], and regard sampling weight on x as the sampling weight on π in our work. C. Proofs for RL with Binary Segment Feedback In this section, we present the proofs for RL with binary segment feedback. C.1. Proof for the Regret Upper Bound with Known Transition First, we prove the regret upper bound (Theorem 3.1) of algorithm Seg Bi TS for known transition. For any k > 0 and θ Θ, define i=1 εk ,i ϕτ k i , i=1 µ((ϕτ k i ) θ) ϕτ k i + λθ, (5) i=1 µ ((ϕτ k i ) θ) ϕτ k i (ϕτ k i ) + λI. (6) Lemma C.1. For any k > 0 and θ Θ, we have det(Λk(θ)) H2µ maxk |S||A|m + λ |S||A| . Proof. For any k > 0, we have det(Λk(θ)) tr(Λk(θ)) 2 + λ|S||A| = H2µ maxk |S||A|m + λ |S||A| . For any k > 0, let Fk denote the filtration that includes all events up to the end of episode k, and Fk denote the filtration that includes all events before playing πk in episode k. Then, πk is Fk-measurable. For any k > 0 and i [m], let εk,i := yk i (ϕτ k i ) θ denote the noise of binary feedback, and v2 k,i := E[ε2 k,i| Fk] = (ϕτ k i ) θ (1 (ϕτ k i ) θ ) = µ ((ϕτ k i ) θ ) denote the variance of εk,i conditioning on Fk. Reinforcement Learning with Segment Feedback Then, we have i=1 µ ((ϕτ k i ) θ ) ϕτ k i (ϕτ k i ) + λI i=1 v2 k ,i ϕτ k i (ϕτ k i ) + λI. Lemma C.2 (Concentration of Noises under Binary Feedback). With probability at least 1 δ , for any k > 0, i=1 εk ,i ϕτ k i λ 2 + |S||A| δ 1 + H2µ maxk |S||A|mλ Proof. According to Theorem 1 in (Faury et al., 2020), we have that with probability at least 1 δ , for any k > 0, i=1 εk ,i ϕτ k i det(Λk(θ )) 1 2 λ |S||A| λ |S||A| log(2) 1 + H2µ maxk |S||A|mλ λ |S||A| log(2) λ 2 + |S||A| 1 + H2µ maxk |S||A|mλ λ |S||A| log(2) λ 2 + |S||A| 1 + H2µ maxk |S||A|mλ where (a) uses Lemma C.1. Define event E := gk(ˆθk) gk(θ ) Λ 1 k (θ ) ω(k), k > 0 . Lemma C.3. It holds that Pr [E] 1 δ . Proof. This proof is similar to that for Lemma 8 in (Faury et al., 2020). yk i log µ((ϕτ k i ) θ) +(1 yk i ) log 1 µ((ϕτ k i ) θ) 1 Recall that ˆθk = argminθ Lk(θ). Using the facts that Lk(ˆθk) = 0 and µ (x) = µ(x)(1 µ(x)), we have i=1 µ((ϕτ k i ) ˆθk) ϕτ k i + λˆθk i=1 yk i ϕτ k i . Hence, we have gk(ˆθk) gk(θ ) = i=1 yk i ϕτ k i i=1 µ((ϕτ k i ) θ ) ϕτ k i + λθ ! Reinforcement Learning with Segment Feedback i=1 εk ,i ϕτ k i λθ . (7) Then, using Lemma C.2, we have that with probability at least 1 δ , for any k > 0, gk(ˆθk) gk(θ ) Λ 1 k (θ ) i=1 εk ,i ϕτ k i Λ 1 k (θ ) + rmax p λ 2 + |S||A| δ 1 + H2µ maxk |S||A|mλ For any ϕ R|S||A| and θ1, θ2 Θ, define b(ϕ, θ1, θ2) := Z 1 z=0 µ ((1 z) ϕ θ1 + z ϕ θ2)dz. For any k > 0 and θ1, θ2 Θ, define Γk(θ1, θ2) := i=1 b(ϕ, θ1, θ2) ϕτ k i (ϕτ k i ) + λI. In the definitions of b(ϕ, θ1, θ2) and Γk(θ1, θ2), θ1 and θ2 have the same roles and can be interchanged. Recall that α := exp(Hrmax m ) + exp( Hrmax Then, we have sup τ seg,θ 1 µ ((ϕτ seg) θ) α, where τ seg denotes the visitation indicator of any possible trajectory segment. Lemma C.4. For any k 1 and θ Θ, we have Proof. We have 1 α = inf τ seg,θ µ ((ϕτ seg) θ). Then, it holds that i=1 ϕτ k i (ϕτ k i ) + αλI 1 α ϕτ k i (ϕτ k i ) + λI i=1 µ ((ϕτ k i ) θ) ϕτ k i (ϕτ k i ) + λI Reinforcement Learning with Segment Feedback Lemma C.5. For any ϕ R|S||A| and θ1, θ2 Θ, we have µ(ϕ θ1) µ(ϕ θ2) = b(ϕ, θ2, θ1) ϕ (θ1 θ2). In addition, for any k > 0 and θ1, θ2 Θ, we have θ1 θ2 Γk(θ2,θ1) = gk(θ1) gk(θ2) Γ 1 k (θ2,θ1) . Proof. The first statement follows from the mean-value theorem. Then, using the first statement, we have that for any k > 0, gk(θ1) gk(θ2) = µ((ϕτ k i ) θ1) µ((ϕτ k i ) θ2) ϕτ k i + λ (θ1 θ2) i=1 b(ϕτ k i , θ2, θ1) ϕτ k i (ϕτ k i ) (θ1 θ2) + λ (θ1 θ2) = Γk(θ2, θ1) (θ1 θ2), θ1 θ2 Γk(θ2,θ1) = q (θ1 θ2) Γk(θ2, θ1) (θ1 θ2) (θ1 θ2) Γk(θ2, θ1) Γ 1 k (θ2, θ1) Γk(θ2, θ1) (θ1 θ2) = gk(θ1) gk(θ2) Γ 1 k (θ2,θ1) , which gives the second statement. Recall that for any k > 0, Zk := Pk k =1 Pm i=1 εk ,i ϕτ k i . Lemma C.6. For any k > 0, we have Γk(θ , ˆθk) 1 + Hrmax p |S||A| m + H m λ Zk Γ 1 k (θ ,ˆθk) Zk Γ 1 k (θ ,ˆθk) 1 + Hrmax p |S||A| m Zk Λ 1 k (θ ) + H m λ Zk 2 Λ 1 k (θ ) . Furthermore, assuming that event E holds, we have Zk Γ 1 k (θ ,ˆθk) 1 + Hrmax p |S||A| m ω(k) + H m Proof. This proof follows the analysis of Proposition 6 and Corollary 5 in (Russac et al., 2021). From Eq. (7), we have that for any k > 0, gk(ˆθk) gk(θ ) = Zk λθ . Using Lemma E.1, we have that for any ϕ R|S||A| such that ϕ 2 Lϕ, b(ϕ, θ , ˆθk) 1 + ϕ (θ ˆθk) 1 µ (ϕ θ ) = 1 + ϕ Γ 1 k (θ , ˆθk) (gk(θ ) gk(ˆθk)) 1 µ (ϕ θ ) Reinforcement Learning with Segment Feedback 1 + ϕ Γ 1 k (θ ,ˆθk) gk(θ ) gk(ˆθk) Γ 1 k (θ ,ˆθk) gk(θ ) gk(ˆθk) Γ 1 k (θ ,ˆθk) λ Zk λθ Γ 1 k (θ ,ˆθk) 1 + Lϕrmax p |S||A| + Lϕ λ Zk Γ 1 k (θ ,ˆθk) 1 µ (ϕ θ ). Using the above equation with ϕ = ϕτ k i and Lϕ = H Γk(θ , ˆθk) := i=1 b(ϕτ k i , θ , ˆθk) ϕτ k i (ϕτ k i ) + λI |S||A| m + H m λ Zk Γ 1 k (θ ,ˆθk) µ (ϕ θ ) ϕτ k i (ϕτ k i ) +λI 1 + Hrmax p |S||A| m + H m λ Zk Γ 1 k (θ ,ˆθk) This implies Zk 2 Γ 1 k (θ ,ˆθk) 1 + Hrmax p |S||A| m + H m λ Zk Γ 1 k (θ ,ˆθk) Zk 2 Λ 1 k (θ ) , which is equivalent to Zk 2 Γ 1 k (θ ,ˆθk) H m λ Zk 2 Λ 1 k (θ ) Zk Γ 1 k (θ ,ˆθk) 1 + Hrmax p Zk 2 Λ 1 k (θ ) 0. By analysis of quadratic functions, we have Zk Γ 1 k (θ ,ˆθk) 1 + Hrmax p |S||A| m Zk Λ 1 k (θ ) + H m λ Zk 2 Λ 1 k (θ ) 1 + Hrmax p |S||A| m ω(k) + H m Lemma C.7 (Concentration of ϕ ˆθk under Binary Feedback). Assume that event E holds. Then, for any k > 0 and ϕ R|S||A|, |ϕ θ ϕ ˆθk| α ν(k) ϕ Σ 1 k . Proof. We have |ϕ θ ϕ ˆθk| = ϕ Γ 1 k (θ ,ˆθk) θ ˆθk Γk(θ ,ˆθk) Reinforcement Learning with Segment Feedback 1 + Hrmax p |S||A| m + H m λ Zk Γ 1 k (θ ,ˆθk) ϕ Λ 1 k (θ ) gk(θ ) gk(ˆθk) Γ 1 k (θ ,ˆθk) 1 + Hrmax p |S||A| m + H m λ Zk Γ 1 k (θ ,ˆθk) ϕ Λ 1 k (θ ) Zk λθ Γ 1 k (θ ,ˆθk) |S||A| m + H m λ Zk Γ 1 k (θ ,ˆθk) ϕ Λ 1 k (θ ) Zk Γ 1 k (θ ,ˆθk)+rmax p 1 + Hrmax p |S||A| m + H m λ Zk Γ 1 k (θ ,ˆθk) ϕ Λ 1 k (θ ) H m λ Zk Γ 1 k (θ ,ˆθk) + Hrmax p 1 + Hrmax p |S||A| m + H m λ Zk Γ 1 k (θ ,ˆθk) 2 ϕ Λ 1 k (θ ) |S||A| m + H m |S||A| m ω(k)+ H m = α ν(k) ϕ Σ 1 k , where inequality (a) is due to Lemmas C.5 and C.6, and inequality (b) follows from Lemmas C.4 and C.6. Lemma C.8 (Gaussian Anti-Concentration). Assume that event E holds. Then, for any k > 0 and Fk 1-measurable random variable X R|S||A|, we have Pr h X θk > X θ | Fk 1 i 1 2 Proof. This proof is originated from the analysis of Lemma 11 in (Efroni et al., 2021). Using Lemma C.7, we have that for any k > 0, |X θ X ˆθk 1| α ν(k 1) X Σ 1 k 1 . It holds that Pr h X θk > X θ | Fk 1 i " X θk X ˆθk 1 α ν(k 1) X Σ 1 k 1 > X θ X ˆθk 1 α ν(k 1) X Σ 1 k 1 | Fk 1 Here given Fk 1, X θk X ˆθk 1 = X ξk is a Gaussian random variable with mean 0 and standard deviation α ν(k 1) X Σ 1 k 1. Since when event E holds, X θ X ˆθk 1 α ν(k 1) X Σ 1 k 1 α ν(k 1) X Σ 1 k 1 α ν(k 1) X Σ 1 k 1 = 1, Pr h X θk > X θ | Fk 1 i Pr " X θk X ˆθk 1 α ν(k 1) X Σ 1 k 1 > 1 | Fk 1 Reinforcement Learning with Segment Feedback " X ξk α ν(k 1) X Σ 1 k 1 > 1 | Fk 1 where inequality (a) comes from that if Z FB UTran(0, 1), Pr[Z > z] 1 2π z 1+z2 e z2 2 (Borjesson & Sundberg, 1979). Lemma C.9. Let ξk, ξ k R|S||A| be i.i.d. random variables given Fk 1. Let p be a Fk 1-measurable transition model, and xk 1 R|S||A| be a Fk 1-measurable random variable. For any policy π, denote the visitation indicator under policy π on MDP p by ϕπ. Let πk := argmaxπ( ϕπ) (xk 1 + ξk). Then, we have E ( ϕ πk) (xk 1 + ξk) E h ( ϕ πk) (xk 1 + ξk) | Fk 1 i + Fk 1 E h |( ϕ πk) ξk| + |( ϕ πk) ξ k| Fk 1 i . Proof. This proof is originated from Lemma 12 in (Efroni et al., 2021). First, using the definition of πk and the fact that ξk and ξ k follow the same distribution, we have E h ( ϕ πk) (xk 1 + ξk) | Fk 1 i = E h max π ( ϕπ) (xk 1 + ξ k) | Fk 1 i . (8) Then, since given Fk 1, ξk and ξ k are independent, we have E h max π ( ϕπ) (xk 1 + ξ k) | Fk 1 i = E h max π ( ϕπ) (xk 1 + ξ k) | Fk 1, ξk, πk i E (ϕ πk) (xk 1 + ξ k) | Fk 1, ξk, πk . (9) Hence, combining Eqs. (8) and (9), we have E h (ϕ πk) (xk 1 + ξk) E (ϕ πk) (xk 1 + ξk) | Fk 1 + Fk 1 i E h (ϕ πk) (xk 1 + ξk) E (ϕ πk) (xk 1 + ξ k) | Fk 1, ξk, πk + Fk 1 i = E h E (ϕ πk) (xk 1 + ξk) (ϕ πk) (xk 1 + ξ k) | Fk 1, ξk, πk + Fk 1 i = E h E (ϕ πk) ξk (ϕ πk) ξ k | Fk 1, ξk, πk + Fk 1 i E h E (ϕ πk) ξk (ϕ πk) ξ k | Fk 1, ξk, πk Fk 1 i E h E |(ϕ πk) ξk (ϕ πk) ξ k| | Fk 1, ξk, πk Fk 1 i = E h |(ϕ πk) ξk (ϕ πk) ξ k| Fk 1 i E h |(ϕ πk) ξk| Fk 1 i + E h |(ϕ πk) ξ k| Fk 1 i . For any k > 0 and δk (0, 1), define event ϕ R|S||A| : |ϕ ξk| α ν(k 1) Reinforcement Learning with Segment Feedback Lemma C.10. For any k > 0 and δk (0, 1), we have Pr [Mk(δk) | Fk 1] 1 δk. In addition, for a random variable X R|S||A| such that X Σ 1 k 1 LX, we have E |X ξk| |Fk 1 α ν(k 1) E h X Σ 1 k 1 |Fk 1 i + α ν(k 1) LX p Proof. This proof is similar to the analysis of Lemma 13 in (Efroni et al., 2021). First, we prove the first statement. For any ϕ R|S||A|, we have |ϕ ξk| = |ϕ Σ 1 1 2 k 1ξk 2 = α ν(k 1) ϕ Σ 1 k 1 1 α ν(k 1)Σ Since given Fk 1, 1 α ν(k 1)Σ 1 2 k 1ξk R|S||A| is a vector with each entry being a standard Gaussian random variable, we have that 1 α ν(k 1)Σ 1 2 k 1ξk 2 is chi-distributed with parameter |S||A|. Then, using Lemma 1 in (Laurent & Massart, 2000), we have that with probability at least 1 δk, 1 α ν(k 1)Σ v u u t|S||A| + 2 |S||A| log 1 Next, we prove the second statement. For a random variable X R|S||A|, we have E |X ξk| |Fk 1 = Pr [Mk(δk)] E |X ξk| |Fk 1, Mk(δk) + Pr Mk(δk) E |X ξk| |Fk 1, Mk(δk) E h X Σ 1 k 1 |Fk 1 i Pr Mk(δk) E |X ξk|2 |Fk 1, Mk(δk) (a) α ν(k 1) E h X Σ 1 k 1 |Fk 1 i v u u tδk E X 2 Σ 1 k 1 1 α ν(k 1)Σ Reinforcement Learning with Segment Feedback E h X Σ 1 k 1 |Fk 1 i v u u tδk L2 XE " 1 α ν(k 1)Σ (b) α ν(k 1) E h X Σ 1 k 1 |Fk 1 i + α ν(k 1) LX p Here inequality (a) follows from the Cauchy-Schwarz inequality. Inequality (b) is due to the fact that given Fk 1, 1 α ν(k 1)Σ 1 2 k 1ξk 2 is chi-distributed with parameter |S||A|, and then E[ 1 α ν(k 1)Σ 1 2 k 1ξk 2 2 |Fk 1] = |S||A|. Define event FB KTran := Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕτ (Σk 1) 1 4H k αλ log 4k E h (ϕπk ) θ |Fk 1 i (ϕπk ) θ 4Hrmax Lemma C.11. It holds that Pr FB KTran 1 2δ . Proof. We prove the first inequality as follows. For any k 1, we have that ϕτ (Σk 1) 1 H αλ, and then |Eτ πk [ ϕτ (Σk 1) 1 |Fk 1] ϕτ (Σk 1) 1 | 2H Using the Azuma-Hoeffding inequality, we have that for any fixed k > 0, with probability at least 1 δ 2k2 , Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕτ (Σk 1) 1 αλ k log 4k2 Since P k=1 δ 2k2 δ , by a union bound over k, we have that with probability at least δ , for any k 1, Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕτ (Σk 1) 1 αλ k log 4k2 k αλ log 4k The second inequality can be obtained by a similar argument and the fact that |(ϕπk) θ | Hrmax for any k > 0. Lemma C.12. For any K 1, we have ϕτ k i (Σk 1) 1 2Km|S||A| max H2 mαλ, 1 log 1 + KH2 Proof. We have ϕτ k i (Σk 1) 1 Reinforcement Learning with Segment Feedback v u u t2Km max H2 mαλ, 1 log det(ΣK) 2Km|S||A| max H2 mαλ, 1 log 1 + KH2 where inequality (a) is due to that for any x [0, c] with constant c 0, it holds that x 2 max{c, 1} log(1 + x). Proof of Theorem 3.1. Letting δ = δ 3, we have Pr[E FB KTran] 1 δ. Then, to prove this theorem, it suffices to prove the regret bound when event E FB KTran holds. Assume that event E FB KTran holds. Then, we have (ϕπ ) θ (ϕπk) θ E h (ϕπ ) θ (ϕπk) θ |Fk 1 i + E h (ϕπk) θ |Fk 1 i (ϕπk) θ E h (ϕπ ) θ (ϕπk) θ |Fk 1 i + 4Hrmax For the first term, we have k=1 E h (ϕπ ) θ (ϕπk) θ |Fk 1 i E h (ϕπ ) θ (ϕπk) θk|Fk 1 i + E h (ϕπk) θk (ϕπk) θ |Fk 1 i . (14) In the following, we prove E h (ϕπ ) θ (ϕπk) θk|Fk 1 i 2 2πe E (ϕπk) θk E h (ϕπk) θk|Fk 1 i + |Fk 1 If E[(ϕπ ) θ (ϕπk) θk|Fk 1] < 0, then Eq. (15) trivially holds. Otherwise, letting z := E[(ϕπ ) θ (ϕπk) θk|Fk 1], we have E (ϕπk) θk E h (ϕπk) θk|Fk 1 i + |Fk 1 z Pr h (ϕπk) θk E h (ϕπk) θk|Fk 1 i z|Fk 1 i E h (ϕπ ) θ (ϕπk) θk|Fk 1 i Pr h (ϕπk) θk (ϕπ ) θ |Fk 1 i (a) E h (ϕπ ) θ (ϕπk) θk|Fk 1 i Pr h (ϕπ ) θk (ϕπ ) θ |Fk 1 i (b) E h (ϕπ ) θ (ϕπk) θk|Fk 1 i 1 2 where inequality (a) uses the definition of πk, and inequality (b) follows from Lemma C.8. Thus, we complete the proof of Eq. (15). Reinforcement Learning with Segment Feedback Let ξ k R|S||A| be a random variable that is i.i.d. with ξ given Fk 1. Then, using Lemma C.9 with p = p, xk 1 = ˆθk 1 and πk = πk, we have E h (ϕπ ) θ (ϕπk) θk|Fk 1 i 2 2πe E (ϕπk) θk E h (ϕπk) θk|Fk 1 i + |Fk 1 2πe E |ϕ(πk) ξk| + |ϕ(πk) ξ k| |Fk 1 . Plugging the above inequality into Eq. (14) and using Lemma C.10 with δk = 1 k4 and LX = H αλ, we have k=1 E h (ϕπ ) θ (ϕπk) θ |Fk 1 i 2πe E h |(ϕπk) ξk|+|(ϕπk) ξ k| |Fk 1 i +E h (ϕπk) ˆθk 1+ξk (ϕπk) θ |Fk 1 i 2πe + 1 E h |(ϕπk) ξk| |Fk 1 i + 2 2πe E h |(ϕπk) ξ k| |Fk 1 i + E h (ϕπk) ˆθk 1 (ϕπk) θ |Fk 1 i 2πe + 2 α ν(k 1) p |S||A| + 4 p log (k) E ϕπk Σ 1 k 1 |Fk 1 2πe + 1 α ν(k 1) where inequality (a) uses Lemmas C.7 and C.10. Here according to the definition of event FB KTran and Lemma C.12, we have k=1 E ϕπk Σ 1 k 1 |Fk 1 E ϕπk Σ 1 k 1 |Fk 1 ϕπk Σ 1 k 1 ϕπk Σ 1 k 1 K αλ log 4K 2Km|S||A| max H2 mαλ, 1 log 1 + KH2 Therefore, plugging the above two equations into Eq. (13), we have 2πe + 2 α ν(K) p |S||A| + 4 p K αλ log 4K 2Km|S||A| max H2 mαλ, 1 log 1 + KH2 2πe + 1 H ν(K) 2m ) ν(K) p Km|S||A| max H2 where in equality (a), the last two terms are absorbed into O( ). Reinforcement Learning with Segment Feedback : transition to 𝑠𝑛+1 𝑎𝑖 𝑠𝑢𝑏: transition to 𝑠𝑛+2 𝑟𝑠𝑛+1, = 𝑟𝑚𝑎𝑥 𝑟𝑠𝑛+2, = 1 𝜀𝑟𝑚𝑎𝑥 Optimal action: 𝑟(𝑠𝑖, 𝑎𝑖 ) = 𝑟𝑚𝑎𝑥 Suboptimal action: 𝑟(𝑠𝑖, 𝑎𝑖 𝑠𝑢𝑏) = (1 𝜀)𝑟𝑚𝑎𝑥 Figure 5. Instance for the lower bound under binary segment feedback and known transition. C.2. Proof for the Regret Lower Bound with Known Transition In the following, we prove the regret lower bound (Theorem 3.2) for RL with binary segment feedback and known transition. Proof of Theorem 3.2. We construct a random instance I as follows. As shown in Figure 5, there are n bandit states s1, . . . , sn (i.e., there is an optimal action and multiple suboptimal actions), a good absorbing state sn+1 and a bad absorbing state sn+2. The agent starts from s1, . . . , sn with equal probability 1 n. For any i [n], in state si, one action a J is uniformly chosen from A as the optimal action. In state si, under the optimal action a J, the agent transitions to sn+1 deterministically, and r(si, a J) = rmax; Under any suboptimal action a A \ {s J}, the agent transitions to sn+2 deterministically, and r(si, a) = (1 ε)rmax, where ε (0, 1 2) is a parameter specified later. For all actions a A, r(sn+1, a) = rmax and r(sn+2, a) = (1 ε)rmax. In this proof, we will also use an alternative uniform instance Iunif. The only difference between Iunif and I is that for any i [n], in state si, under all actions a A, the agent transitions to sn+2 deterministically, and r(si, a) = (1 ε)rmax. Fix an algorithm A. Let Eunif[ ] denote the expectation with respect to Iunif. Let E [ ] denote the expectation with respect to I. For any i [n] and j [|A|], let Ei,j[ ] denote the expectation with respect to the case where aj is the optimal action in state si, and Ni,j denote the number of episodes where algorithm A chooses aj in state si, i.e., Ni,j = PK k=1 1{πk 1(si) = aj}. The KL divergence of binary observations if taking a J in si in each episode between Iunif and I is i=1 KL B µ (1 ε)rmax H µ (1 ε)rmax H (b) m µ (1 ε) Hrmax m 2 ε Hrmax where inequality (a) uses the fact that KL(B(p) B(q)) (p q)2 q(1 q), and inequality (b) is due to that µ (x) is monotonically decreasing when x > 0. In addition, the agent has probability only 1 n to arrive at (observe) state si. Thus, using Lemma A.1 in (Auer et al., 2002), we have that for any i [n], in state si, Ei,j[Ni,j] Eunif[Ni,j] + K n Eunif[Ni,j] m µ (1 ε) Hrmax m 2 ε Hrmax Reinforcement Learning with Segment Feedback = Eunif[Ni,j] + K n Eunif[Ni,j] µ (1 ε) Hrmax Summing over j [|A|], using the Cauchy-Schwarz inequality and the fact that P|A| j=1 Eunif[Ni,j] = K, we have j=1 Ei,j[Ni,j] K + KHrmaxε v u u t|A|K mn µ (1 ε) Hrmax K + KHrmaxε v u u t|A|K mn µ (1 c0) Hrmax where c0 (0, 1 2) is a constant which satisfies c0 ε. We will specify how to make c0 ε to satisfy this condition later. Then, we have k=1 E h V V πki = rmax HK 1 (1 ε)rmax HK + εrmax H 1 j=1 Ei,j[Ni,j] |A| KHrmaxε v u u t K |A|mn µ (1 c0) Hrmax ε = 1 2Hrmax v u u t|A|mn µ (1 c0) Hrmax Then, the constant c0 should satisfy ε = 1 2Hrmax v u u t|A|mn µ (1 c0) Hrmax µ (1 c0) Hrmax exp (1 c0) Hrmax m + exp (1 c0) Hrmax m + exp Hrmax 4 exp (1 c0) Hrmax = 16 exp 1 2c0 Hrmax it suffices to let c0 satisfy K 16 exp (1 2c0)Hrmax Reinforcement Learning with Segment Feedback Algorithm 3 Seg Bi TS-Tran 1: Input: δ, δ := δ 8, λ. 2: for k = 1, . . . , K do 3: ˆθk 1 argminθ (Pk 1 k =1 Pm i=1(yk i log(µ((ϕτ k i ) θ)) + (1 yk i ) log(1 µ((ϕτ k i ) θ))) 1 4: Σk 1 Pk 1 k =1 Pm i=1 ϕτ k i (ϕτ k i ) + αλI 5: Draw a noise ξk N(0, α ν(k 1)2 Σ 1 k 1), where ν(k 1) is defined in Eq. (1) 6: bpv k 1(s, a) min{2Hrmax log( KH|S||A| δ ) nk 1(s,a) , Hrmax} for any (s, a) S A 7: θb k ˆθk 1 + ξk + bpv k 1 8: πk argmaxπ(ˆϕπ k 1) θb k, where ˆϕπ k 1 is defined in Eq. (18) 9: Play episode k with policy πk. Observe τ k and binary segment feedback {yk i }m i=1 10: end for which is equivalent to K 4|A|mn H2r2maxc2 0 exp((1 2c0) Hrmax It suffices to let K 4|A|mn H2r2maxc2 0 exp Hrmax and then c0 can be any constant in (0, 1 Let |S| 3, |A| 2, c0 (0, 1 2) and K 4|A|mn H2r2 maxc2 0 exp( Hrmax µ (1 c0) Hrmax exp (1 c0) Hrmax m + exp (1 c0) Hrmax m + exp Hrmax exp (1 c0) Hrmax 4 exp Hrmax 4 exp 1 2c0 Hrmax R(K) 1 2Hrmax v u u t|A|mn µ (1 c0) Hrmax m 2 rmax H K K exp (1 2c0)Hrmax |S||A|m K . C.3. Pseudo-code and Detailed Description of Algorithm Seg Bi TS-Tran Algorithm 3 illustrates the procedure of Seg Bi TS-Tran. In episode k, similar to Seg Bi TS, Seg Bi TS-Tran first uses MLE with past binary segment observations to obtain a reward estimate ˆθk 1, and calculates the covariance matrix of past observations Σk 1 (Lines 3-4). After that, Seg Bi TS-Tran samples a Gaussian noise ξk using Σk 1 (Line 4). For any k > 0 and (s, a) S A, let ˆpk( |s, a) denote the empirical estimate of p( |s, a), and nk(s, a) denote the number of times (s, a) was visited at the end of episode k. Then, Seg Bi TS-Tran constructs a transition bonus bpv k 1(s, a), which represents the uncertainty on transition estimation. Incorporating the MLE estimate ˆθk 1, noise ξk and transition bonus bpv k 1(s, a), Seg Bi TS-Tran constitutes a posterior estimate of the reward parameter θk (Line 7). Reinforcement Learning with Segment Feedback For any policy π, k > 0 and (s, a) S A, we define ˆϕπ k(s, a) := Eˆpk h=1 1{sh = s, ah = a}|π which denotes the expected number of times (s, a) is visited in an episode under policy π on the empirical MDP ˆpk. In addition, let ˆϕπ k := [ˆϕπ k(s, a)](s,a) S A R|S||A|. Then, Seg Bi TS-Tran finds the optimal policy via argmaxπ(ˆϕπ k 1) θb k, which can be efficiently solved by any MDP planning algorithm with transition ˆpk 1 and reward θb k (Line 8). With the computed optimal policy πk, Seg Bi TS-Tran plays episode k, and observes a trajectory and binary feedback on each segment (Line 9). C.4. Proof for the Regret Upper Bound with Unknown Transition In the following, we prove the regret upper bound (Theorem 3.3) of algorithm Seg Bi TS-Tran for unknown transition. Define event ( ˆpk 1( |s, a) V h+1 p( |s, a) V h+1 v u u tlog KH|S||A| nk 1(s, a) Hrmax (s, a) S A, k > 0 Lemma C.13. It holds that Pr [GHoeff] 1 2δ . Proof. This lemma follows from the Hoeffding inequality and a union bound over nk 1(s, a) [KH] and (s, a) S A. Lemma C.14 (Optimism of Thompson Sampling with Unknown Transition). Assume that event E and GHoeff holds. Then, for any k > 0, we have Pr h ˆϕk 1(πk) θb k > (ϕπ ) θ | Fk 1 i 1 2 Proof. This proof follows the analysis of Lemma 17 in (Efroni et al., 2021). Using the value difference lemma (see Lemma E.2), we have ˆϕk 1(π ) θb k (ϕπ ) θ θb k(sh, ah) θ (sh, ah) + (ˆpk 1( |sh, ah) p( |sh, ah)) V h+1 # θk(sh, ah) θ (sh, ah)+bpv k 1(sh, ah)+(ˆpk 1( |sh, ah) p( |sh, ah)) V h+1 # (a) Eˆpk 1,π θk(sh, ah) θ (sh, ah) + bpv k 1(sh, ah) bpv k 1(sh, ah) # θk(sh, ah) θ (sh, ah) # = ˆϕk 1(π ) θk ˆϕk 1(π ) θ , Reinforcement Learning with Segment Feedback where inequality (a) uses the definition of event GHoeff. Thus, by the definition of πk, we have Pr h ˆϕk 1(πk) θb k > (ϕπ ) θ | Fk 1 i (a) Pr h ˆϕk 1(π ) θb k > (ϕπ ) θ | Fk 1 i = Pr h ˆϕk 1(π ) θb k (ϕπ ) θ > 0 | Fk 1 i Pr h ˆϕk 1(π ) θk ˆϕk 1(π ) θ > 0 | Fk 1 i where inequality (a) is due to the definition of πk, and inequality (b) follows from Lemma C.8. Define event GKL := KL(ˆpk 1( |s, a), p( |s, a)) L nk 1(s, a), k > 0, (s, a) S A . (19) Lemma C.15 (Concentration of Transition). It holds that Pr[GKL] 1 δ . Proof. This lemma can be obtained by Theorem 3 and Lemma 3 in (M enard et al., 2021). Recall that for any k > 0 and (s, a) S A, nk(s, a) denotes the cumulative number of times that (s, a) is visited at the end of episode k. For any k > 0, h [H] and (s, a) S A, let wk,h(s, a) denote the probability that (s, a) is visited at step h in episode k, and let wk(s, a) := PH h=1 wk,h(s, a). Define event k =1 wk (s, a) H log |S||A|H , k > 0, (s, a) S A Lemma C.16 (Concentration of the Number of Visitations). It holds that Pr[H] 1 δ . Proof. This lemma can be obtained from Lemma F.4 in (Dann et al., 2017) and summing over h [H]. Define event FB UTran := E h (ϕπk ) bpv k 1|Fk 1 i (ϕπk ) bpv k 1 4H2rmax E h ˆϕk 1(πk ) ϕ(πk ) 1 |Fk 1 i ˆϕk 1(πk ) ϕ(πk ) 1 Lemma C.17. It holds that Pr FB UTran 1 2δ . Reinforcement Learning with Segment Feedback Proof. This lemma can be obtained by a similar analysis as Lemma C.11, and the facts that |(ϕπk) bpv k 1| H2rmax and ˆϕk 1(πk) ϕπk 1 2H for any k 1. Lemma C.18. Assume that event FB UTran GKL H holds. Then, we have k=1 E h ˆϕk 1(πk) ϕπk 1 |Fk 1 i 24e12|S| 3 2 |A| 3 2 H 3 2 p KL log(2KH) + 192e12|S|2|A|2H2L log 2KH|S||A| Proof. First, from Lemmas D.10 and D.11, we have ˆϕk 1(π) ϕ(π) 1 (s,a) Dk wπk h (s, a) L nk 1(s, a) + 46H2L nk 1(s, a) + e12|S||A|H (s,a)/ Dk wπk h (s, a) 8e12|S||A|H (s,a) Dk wπk h (s, a) wπk h (s, a) nk 1(s, a) + 46e12|S||A|H2L wπk h (s, a) nk 1(s, a) + 8e12|S|2|A|2H2 log |S||A|H 16e12|S| 3 2 |A| 3 2 H 3 2 p KL log(2KH) + 184e12|S|2|A|2H2L log(2KH) + 8e12|S|2|A|2H2 log |S||A|H 16e12|S| 3 2 |A| 3 2 H 3 2 p KL log(2KH) + 192e12|S|2|A|2H2L log 2KH|S||A| Next, we have k=1 E h ˆϕk 1(πk) ϕπk 1 |Fk 1 i ˆϕk 1(πk) ϕπk 1 + E h ˆϕk 1(πk) ϕπk 1 |Fk 1 i ˆϕk 1(πk) ϕπk 1 16e12|S| 3 2 |A| 3 2 H 3 2 p KL log(2KH) + 192e12|S|2|A|2H2L log 2KH|S||A| 24e12|S| 3 2 |A| 3 2 H 3 2 p KL log(2KH) + 192e12|S|2|A|2H2L log 2KH|S||A| Lemma C.19. Assume that event FB UTran holds. Then, we have k=1 E h (ϕπk) bpv k 1|Fk 1 i 20|S||A|H2rmax K log 4KH|S||A| Reinforcement Learning with Segment Feedback Proof. It holds that k=1 E h (ϕπk) bpv k 1|Fk 1 i k=1 (ϕπk) bpv k 1 + E h (ϕπk) bpv k 1|Fk 1 i (ϕπk) bpv k 1 s,a wπk h (s, a) v u u tlog KH|S||A| nk 1(s, a) Hrmax log KH|S||A| wπk h (s, a) p (s,a)/ Dk wπk h (s, a) + 4H2rmax log KH|S||A| wπk h (s, a) nk 1(s, a) + 8|S||A|H2rmax log |S||A|H log KH|S||A| 4|S||A| log(2KH) + 8|S||A|H2rmax log |S||A|H 16|S||A|H2rmax K log 4KH|S||A| Proof of Theorem 3.3. Letting δ = δ 8, we have Pr[E FB KTran GHoeff GKL H FB UTran] 1 δ. Then, to prove this theorem, it suffices to prove the regret bound when event E FB KTran GHoeff GKL H FB UTran holds. Assume that event E FB KTran GHoeff GKL H FB UTran holds. Then, we have (ϕπ ) θ (ϕπk) θ E h (ϕπ ) θ (ϕπk) θ |Fk 1 i + E h (ϕπk) θ |Fk 1 i (ϕπk) θ E h (ϕπ ) θ (ϕπk) θ |Fk 1 i + 4Hrmax For the first term, we have k=1 E h (ϕπ ) θ (ϕπk) θ |Fk 1 i Reinforcement Learning with Segment Feedback E h (ϕπ ) θ ˆϕk 1(πk) θb k|Fk 1 i + E h ˆϕk 1(πk) θb k (ϕπk) θ |Fk 1 i . (22) In the following, we prove E h (ϕπ ) θ ˆϕk 1(πk) θb k|Fk 1 i 2πe E ˆϕk 1(πk) θb k E h ˆϕk 1(πk) θb k|Fk 1 i + |Fk 1 If E[(ϕπ ) θ ˆϕk 1(πk) θb k|Fk 1] < 0, then Eq. (23) trivially holds. Otherwise, letting z := E[(ϕπ ) θ ˆϕk 1(πk) θb k|Fk 1], we have E ˆϕk 1(πk) θb k E h ˆϕk 1(πk) θb k|Fk 1 i + |Fk 1 z Pr h ˆϕk 1(πk) θb k E h ˆϕk 1(πk) θb k|Fk 1 i z|Fk 1 i E h (ϕπ ) θ ˆϕk 1(πk) θb k|Fk 1 i Pr h ˆϕk 1(πk) θb k (ϕπ ) θ |Fk 1 i (a) E h (ϕπ ) θ ˆϕk 1(πk) θb k|Fk 1 i 1 2 where inequality (a) uses Lemma C.14. Thus, we complete the proof of Eq. (23). Let ξ k R|S||A| be an i.i.d. random variable with ξ given Fk 1. Then, using Lemma C.9 with p = ˆpk 1, xk 1 = ˆθk 1 + bpv k 1 and πk = πk, we have E h (ϕπ ) θ ˆϕk 1(πk) θb k|Fk 1 i 2πe E ˆϕk 1(πk) θb k E h ˆϕk 1(πk) θb k|Fk 1 i + |Fk 1 2πe E h |ˆϕk 1(πk) ξk| + |ˆϕk 1(πk) ξ k| |Fk 1 i . Plugging the above inequality into Eq. (22) and using Lemma C.10 with δk = 1 k4 and LX = H αλ, we have k=1 E h (ϕπ ) θ (ϕπk) θ |Fk 1 i 2πe E h |ˆϕk 1(πk) ξk| + |ˆϕk 1(πk) ξ k| |Fk 1 i + E h ˆϕk 1(πk) ˆθk 1 + bpv k 1 + ξk (ϕπk) θ |Fk 1 i 2πe + 1 E h |ˆϕk 1(πk) ξk| |Fk 1 i + 2 2πe E h |ˆϕk 1(πk) ξ k| |Fk 1 i + E h ˆϕk 1(πk) ˆθk 1 + bpv k 1 (ϕπk) θ |Fk 1 i 2πe + 1 α ν(k 1) p |S||A| + 4 p log (k) E ˆϕk 1(πk) Σ 1 k 1 |Fk 1 2πe + 1 α ν(k 1) Reinforcement Learning with Segment Feedback + E h ˆϕk 1(πk) ˆθk 1 + bpv k 1 (ϕπk) θ |Fk 1 i ! E h ˆϕk 1(πk) ˆθk 1 + bpv k 1 (ϕπk) θ |Fk 1 i = E h ˆϕk 1(πk) ˆθk 1 θ |Fk 1 i + E ˆϕk 1(πk) ϕπk θ |Fk 1 + E h ˆϕk 1(πk) bpv k 1|Fk 1 i α ν(k 1)E ˆϕk 1(πk) Σ 1 k 1 |Fk 1 + rmax E h ˆϕk 1(πk) ϕπk 1 |Fk 1 i + E h ˆϕk 1(πk) bpv k 1|Fk 1 i . Hence, plugging the above inequality into Eq. (24), we have k=1 E h (ϕπ ) θ (ϕπk) θ |Fk 1 i 2πe + 2 α ν(k 1) p |S||A| + 4 p log (k) E ˆϕk 1(πk) Σ 1 k 1 |Fk 1 2πe + 1 ν(k 1) H λ + rmax E h ˆϕk 1(πk) ϕπk 1 |Fk 1 i + E h ˆϕk 1(πk) bpv k 1|Fk 1 i ! Here we have k=1 E ˆϕk 1(πk) Σ 1 k 1 |Fk 1 k=1 E ϕπk Σ 1 k 1 |Fk 1 k=1 E ˆϕk 1(πk) ϕπk Σ 1 k 1 |Fk 1 k=1 E ϕπk Σ 1 k 1 |Fk 1 k=1 E h ˆϕk 1(πk) ϕπk 1 |Fk 1 i K αλ log 4K 2Km|S||A| max H2 mαλ, 1 log 1 + KH2 k=1 E h ˆϕk 1(πk) ϕπk 1 |Fk 1 i , where inequality (a) uses Eq. (17). In addition, we have k=1 E h ˆϕk 1(πk) bpv k 1|Fk 1 i k=1 E h (ϕπk) bpv k 1|Fk 1 i + k=1 E h ˆϕk 1(πk) ϕπk 1 bpv k 1 |Fk 1 i Reinforcement Learning with Segment Feedback k=1 E h (ϕπk) bpv k 1|Fk 1 i + Hrmax k=1 E h ˆϕk 1(πk) ϕπk 1 |Fk 1 i . Therefore, plugging the above three equations into Eq. (21), we have 2πe + 2 α ν(K) p |S||A| + 4 p K αλ log 4K 2Km|S||A| max H2 mαλ, 1 log 1 + KH2 log(K) +2Hrmax k=1 E h ˆϕk 1(πk) ϕπk 1|Fk 1 i k=1 E h (ϕπk) bpv k 1|Fk 1 i +2 4 2πe + 1 H ν(K) Km|S||A| max H2 |S|2|A| 3 2 H 3 2 where in equality (a), we use Lemmas C.18 and C.19, and the last three terms are absorbed into O( ). D. Proofs for RL with Sum Segment Feedback In this section, we provide the proofs for RL with sum segment feedback. D.1. Proof for the Regret Upper Bound with Known Transition We first prove the regret upper bound (Theorem 4.1) of algorithm E-Lin UCB for known transition. Define event i=1 ϕτ k i (ϕτ k i ) Eτi πk i=1 ϕ(τi)ϕ(τi) #! K0 log 2|S||A| m log 2|S||A| Lemma D.1 (Concentration of Initial Sampling). It holds that Pr [J ] 1 δ . Proof. Note that π1, . . . , πK0 and K0 are fixed before sampling, E[Pm i=1 ϕτ k i (ϕτ k i ) ] = Eτi πk Pm i=1 ϕ(τi)ϕ(τi) , and Pm i=1 ϕτ k i (ϕτ k i ) H2 m . Then, using the matrix Bernstein inequality (Theorem 6.1.1 in (Tropp et al., 2015)), we can obtain this lemma. Lemma D.2 (E-optimal Design). Assume that event J holds. Then, we have i=1 ϕτ k i (ϕτ k i ) ! 1 1 H2 . Reinforcement Learning with Segment Feedback Proof. Using the guarantee of the rounding procedure ROUND (Theorem 1.1 in (Allen-Zhu et al., 2021)) and the fact that K0 |S||A| γ2 , we have i=1 ϕ(τi)ϕ(τi) #! 1 π Π w (π) Eτi πk i=1 ϕ(τi)ϕ(τi) #! 1 Let σmin( ) denote the minimum eigenvalue. Then, we have i=1 ϕτ k i (ϕτ k i ) ! i=1 ϕ(τi)ϕ(τi) # i=1 ϕτ k i (ϕτ k i ) i=1 ϕ(τi)ϕ(τi) #! i=1 ϕ(τi)ϕ(τi) #! i=1 ϕτ k i (ϕτ k i ) i=1 ϕ(τi)ϕ(τi) # K0 (1 + γ)z 4H2 log 2|S||A| m log 2|S||A| Let x = K0 and f(x) = 1 (1 + γ)z x2 4H2 log 2|S||A| m log 2|S||A| According to the property of quadratic functions, when log 2|S||A| log 2|S||A| δ 2 + 4 1 (1+γ)z 4H2 m log 2|S||A| 2 1 (1+γ)z , (27) we have f(x) 0. To make Eq. (27) hold, it suffices to set K0 (1 + γ)2(z )2 log 2|S||A| log 2|S||A| + 8 (1 + γ)z 5H2 log 2|S||A| = 16H4(1 + γ)2(z )2 m2 + 10H2(1 + γ)z log 2|S||A| Furthermore, since P π Π w (π)Eτi πk Pm i=1 ϕ(τi)ϕ(τi) H2 and then z 1 H2 , to make the right-hand-side in Eq. (26) no smaller than H2, it suffices to set K0 26H4(1 + γ)2(z )2 log 2|S||A| Reinforcement Learning with Segment Feedback Therefore, combining the definition of K0 and Eq. (26), we have k=1 ϕ(τ k)ϕ(τ k) ! which completes the proof. Lemma D.3. For any k > 0, = log det(Σk) |S||A| log 1 + k H2 Proof. For any k > 0, it holds that det(Σk) = det i=1 ϕτ k i (ϕτ k i ) ! = det(Σk 1) det i=1 (Σk 1) 1 2 ϕτ k i (ϕτ k i ) (Σk 1) 1 = det(Σk 1) Taking the logarithm on both sides, we have log det(Σk) = log det(λI) + = log det(Σk) tr(Σk) |S||A| |S||A| = |S||A| log tr(Σk) λ|S||A| + km H2 = |S||A| log 1 + k H2 where (a) uses the arithmetic mean-geometric mean inequality. Lemma D.4 (Elliptical Potential with Optimized Initialization). Assume that event J holds. Then, for any k K0 + 1, (Σk 1) 1 1. Reinforcement Learning with Segment Feedback Furthermore, for any K K0 + 1, ϕτ k i (Σk 1) 1 2Km|S||A| log 1 + KH2 Proof. Using Lemma D.2, for any k K0 + 1, we have ϕτ k i 2 λI+PK0 k =1 ϕτk (ϕτk ) +Pk 1 k =K0+1 ϕτk (ϕτk ) 1 ϕτ i i 2 PK0 k =1 ϕτk (ϕτk ) 1 Then, we have ϕτ k i (Σk 1) 1 v u u t Km 2 v u u t Km 2 2Km|S||A| log 1 + KH2 where (a) uses the fact that x 2 log(1 + x) for any x [0, 1], and (b) follows from Lemma D.3. Define event m log 1+ k H2 λ|S||A|, k > 0 Lemma D.5 (Concentration of ˆθk under Sum Feedback). It holds that Pr [K] 1 δ . Proof. Since the sum feedback on each segment is H m-sub-Gaussian given the observation of transition and θ rmax p |S||A|, using Lemma 2 in (Abbasi-Yadkori et al., 2011), we can obtain this lemma. Define event Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕτ (Σk 1) 1 Reinforcement Learning with Segment Feedback Lemma D.6 (Concentration of Visitation Indicators). It holds that Pr FS opt 1 δ . Proof. According to Lemma D.4, we have that for any k K0 + 1, ϕτ (Σk 1) 1 1, and then |Eτ πk [ ϕτ (Σk 1) 1 |Fk 1] ϕτ (Σk 1) 1 | 2. Using the Azuma-Hoeffding inequality, we have that for any fixed k K0 + 1, with probability at least 1 δ 2k2 , Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕτ (Σk 1) 1 2 4(k K0 1) log 4k2 Since P k=K0+1 δ 2k2 δ , by a union bound over k, we have that with probability at least δ , for any k K0 + 1, Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕτ (Σk 1) 1 2 4(k K0 1) log 4k2 Proof of Theorem 4.1. Let δ = δ 3. We have Pr[J K FS opt] 1 δ. To prove this theorem, it suffices to prove the regret bound when event J K FS opt holds. Assume that event J K FS opt holds. Then, we have (ϕπ ) θ (ϕπk) θ (ϕπ ) ˆθk 1 + β(k 1) ϕπ (Σk 1) 1 (ϕπk) θ + K0H (ϕπk) ˆθk 1 + β(k 1) ϕπk (Σk 1) 1 (ϕπk) θ + K0H k=K0+1 2β(k 1) ϕπk (Σk 1) 1 + K0H k=K0+1 Eτ πk [ϕτ|Fk 1] (Σk 1) 1 + K0H k=K0+1 Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i + K0H Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕτ (Σk 1) 1 + ϕτ (Σk 1) 1 +K0H Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕτ (Σk 1) 1 + ϕτ k i (Σk 1) 1 + K0H, (30) Reinforcement Learning with Segment Feedback : transition to 𝑠𝑛+1 𝑎𝑖 𝑠𝑢𝑏: transition to 𝑠𝑛+2 𝑟𝑠𝑛+1, = 1 2 + 𝜀𝑟𝑚𝑎𝑥 Optimal action: 𝑟(𝑠𝑖, 𝑎𝑖 1 2 + 𝜀𝑟𝑚𝑎𝑥 Suboptimal action: 𝑟(𝑠𝑖, 𝑎𝑖 Figure 6. Instance for the lower bound under sum segment feedback and known transition. where inequality (a) follows from Eq. (28), inequality (b) is due to the definition of πk, and inequality (c) uses the Jensen inequality. Plugging Eq. (29) and Lemma D.4 into Eq. (30) and using the fact that λ := H r2maxm, we have m log 1 + KH2 2Km|S||A| log 1 + KH2 + H max 26H4(1 + γ)2(z )2 log 2|S||A| HK log 1 + KHr2 max |S||A|m +(z )2H5 log |S||A| D.2. Proof for the Regret Lower Bound with Known Transition Now we prove the regret lower bound (Theorem 4.2) for RL with sum segment feedback and known transition. Proof of Theorem 4.2. We construct a random instance I as follows. As shown in Figure 6, there are n bandit states s1, . . . , sn (i.e., there is an optimal action and multiple suboptimal actions), a good absorbing state sn+1 and a bad absorbing state sn+2. The agent starts from s1, . . . , sn with equal probability 1 n. For any i [n], in state si, one action a J is uniformly chosen from A as the optimal action. In state si, under the optimal action a J, the agent transitions to sn+1 deterministically, and r(si, a J) = ( 1 2 + ε)rmax, where ε (0, 1 2] is a parameter specified later; Under any suboptimal action a A \ {s J}, the agent transitions to sn+2 deterministically, and r(si, a) = 1 2rmax. For all actions a A, r(sn+1, a) = ( 1 2 + ε)rmax and r(sn+2, a) = 1 2rmax. For any (s, a) S A, the reward distribution of (s, a) is Gaussian distribution N(r(s, a), 1). In this proof, we will also use an alternative uniform instance Iunif. The only difference between Iunif and I is that for any i [n], in state si, under all actions a A, the agent transitions to sn+2 deterministically, and r(si, a) = 1 Fix an algorithm A. Let Eunif[ ] denote the expectation with respect to Iunif. Let E [ ] denote the expectation with respect to I. For any i [n] and j [|A|], let Ei,j[ ] denote the expectation with respect to the case where aj is the optimal action in state si, and Ni,j denote the number of episodes where algorithm A chooses aj in state si, i.e., Ni,j = PK k=1 1{πk 1(si) = aj}. The KL divergence of the reward observations if taking a J in si (i [n]) between Iunif and I is 2 + ε rmax H Reinforcement Learning with Segment Feedback Algorithm 4 Lin UCB-Tran 1: Input: δ, δ := δ 4, λ := H m, L := log( 3|S||A|H δ ) + S log(8e(1 + KH)). For any k 1, β(k) := q m log(1 + k H2 λ|S||A|m) + 2 log( 1 δ ) + rmax p 2: for k = 1, . . . , K do 3: ˆθk 1 (λI + Pk 1 k =1 Pm i=1 ϕτ k i (ϕτ k i ) ) 1 Pk 1 k =1 Pm i=1 ϕτ k i Rk i 4: Σk 1 λI + Pk 1 k =1 Pm i=1 ϕτ k i (ϕτ k i ) 5: πk argmaxπ Π((ˆϕπ k 1) ˆθk 1 +β(k 1) ˆϕπ k 1 (Σk 1) 1 +P s ,a Es1 ρ[Bπ;s ,a ;k 1 (s1)]), where Bπ;s ,a ;k 1 (s1) is defined in Eq. (32) 6: Play episode k with policy πk. Observe τ k and sum segment feedback {Rk i }m i=1 7: end for H m = Hr2 maxε2. In addition, the agent has probability only 1 n to arrive at (observe) state si. Hence, using Lemma A.1 in (Auer et al., 2002), we have that for any i [n], in state si, Ei,j[Ni,j] Eunif[Ni,j] + K 1 n Eunif[Ni,j] Hr2maxε2. Summing over j [|A|], using the Cauchy-Schwarz inequality and the fact that P|A| j=1 Eunif[Ni,j] = K, we have j=1 Ei,j[Ni,j] K + K n K Hr2maxε2 = K + Krmaxε Then, we have k=1 E h V V πki 2 + ε rmax HK 1 2rmax HK + εrmax H 1 j=1 Ei,j[Ni,j] Recall that n = |S| 2. Let |S| 3, |A| 2, K n|A| r2max H and ε = 1 2rmax HK . Then, we have D.3. Pseudo-code and Detailed Description of Algorithm Lin UCB-Tran Algorithm 4 presents the pseudo-code of Lin UCB-Tran. In each episode k, similar to algorithm E-Lin UCB, Lin UCB-Tran first computes the least squares estimate of the reward parameter ˆθk 1 and covariance matrix Σk 1 with past observations (Lines 3-4). Reinforcement Learning with Segment Feedback Then, we introduce the transition estimation in Lin UCB-Tran. We first define some notation which also appears in algorithm Seg Bi TS-Tran. For any k > 0 and (s, a) S A, let ˆpk( |s, a) denote the empirical estimate of p( |s, a), and nk(s, a) denote the number of times (s, a) was visited up to the end of episode k. In addition, for any policy π, let ˆϕπ k(s, a) denote the expected number of times (s, a) is visited in an episode under policy π on empirical MDP ˆpk 1 (see Eq. (18) for the formal definition). Below we establish a bound for the deviation between ˆϕπ k 1 and ϕπ. For ease of analysis, we first connect ϕπ with a newly-defined visitation value function Gπ;s ,a h (s; p). For any transition model p , policy π and (s , a ) S A, if regarding hitting (s , a ) as an instantaneous reward one, then we can define a visitation value function: ( Gπ;s ,a h (s; p ) = 1{s = s , πh(s) = a } + p( |s, πh(s)) Gπ;s ,a h+1 ( ), s S, h [H], Gπ;s ,a H+1 (s; p ) = 0, s S. (31) h (s; p ) denotes the expected cumulative number of times (s , a ) was hit starting from s at step h under policy π on MDP p , till the end of this episode. It holds that ϕπ(s , a ) = Es1 ρ[Gπ;s ,a 1 (s1|p)] and ˆϕπ k 1(s , a ) = Es1 ρ[Gπ;s ,a 1 (s1|ˆpk 1)] for any (s , a ) S A. With the definition of Gπ;s ,a h , bounding the deviation between ˆϕπ k 1 and ϕπ is similar to bounding the gap between the estimated and true value functions. Then, we can build a Bernstern-type uncertainty bound between ˆϕπ k 1 and ϕπ using the variance of Gπ;s ,a h . For any policy π, (s , a ) S A and k > 0, define Bπ;s ,a ;k h (s) = min 4 Varˆ pk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) L nk 1(s,πh(s)) + 13H2L nk 1(s,πh(s)) H ˆpk 1( |s, πh(s)) Bπ;s ,a ;k h+1 ( ) , H , s S, h [H], Bπ;s ,a ;k H+1 (s) = 0, s S. The construction of Bπ;s ,a ;k h (s) satisfies (see Lemma D.10 for more details) |ˆϕπ k 1(s , a ) ϕπ(s , a )| Es1 ρ h Bπ;s ,a ;k 1 (s1) i , (s , a ) S A, ˆϕπ k 1 ϕπ 1 X (s ,a ) Es1 ρ h Bπ;s ,a ;k 1 (s1) i . Incorporating this transition uncertainty Es1 ρ[Bπ;s ,a ;k 1 (s1)] and reward uncertainty ˆϕπ k 1 (Σk 1) 1 into exploration bonuses, Lin UCB-Tran computes the optimal policy πk under optimistic estimation (Line 5). After that, Lin UCB-Tran plays episode k with πk, and collects trajectory τ k and reward observation on each segment {Rk i }m i=1 (Line 6). D.4. Proof for the Regret Upper Bound with Unknown Transition In the following, we prove the regret upper bound (Theorem 4.3) of algorithm Lin UCB-Tran for unknown transition. Recall the definition of events GKL and H in Eqs. (19) and (20), respectively. For any k > 0, define the set of state-action pairs (s, a) S A : 1 k =1 wk (s, a) H log |S||A|H Dk stands for the set of state-action pairs which have sufficient visitations in expectation. Lemma D.7. Assume that event H holds. Then, if (s, a) Dk, nk 1(s, a) 1 k =1 wk (s, a). Reinforcement Learning with Segment Feedback Proof. We have nk 1(s, a) 1 k =1 wk (s, a) H log |S||A|H k =1 wk (s, a) + 1 k =1 wk (s, a) H log |S||A|H k =1 wk (s, a) + 1 k =1 wk (s, a) H log |S||A|H k =1 wk (s, a) + H 1 k =1 wk (s, a), where (a) is due to the definition of Dk (Eq. (33)). Lemma D.8. It holds that (s,a)/ Dk wk,h(s, a) 8|S||A|H log |S||A|H Proof. If (s, a) / Dk, then k =1 wk (s, a) < H log |S||A|H Thus, we have (s,a)/ Dk wk,h(s, a) = X h=1 1{(s, a) / Dk} wk,h(s, a) k=1 1{(s, a) / Dk} wk(s, a) 4|S||A|H log |S||A|H 8|S||A|H log |S||A|H Lemma D.9. Assume that event H holds. Then, we have wk,h(s, a) nk 1(s, a) 4|S||A| log(2KH). Proof. It holds that wk,h(s, a) nk 1(s, a) = wk(s, a) nk 1(s, a) Reinforcement Learning with Segment Feedback wk(s, a) nk 1(s, a) 1{(s, a) Dk} wk(s, a) Pk k =1 wk(s, a) 1{(s, a) Dk} wk(s, a) Pk k =1 wk(s, a) 1{(s, a) Dk} (b) 4|S||A| log(2KH), where (a) uses Lemma D.7, and (b) follows from the analysis of Lemma 13 in (Zanette & Brunskill, 2019). Lemma D.10 (Error in Visitation Vectors). Assume that event GKL holds. Then, for any k > 0 and policy π, ˆϕk 1(π) ϕ(π) 1 X s ,a Es1 ρ h Bπ;s ,a ;k 1 (s1) i . Proof. Since ϕπ(s , a ) = Es1 ρ[Gπ;s ,a 1 (s1|p)] and ˆϕπ k 1(s , a ) = Es1 ρ[Gπ;s ,a 1 (s1|ˆpk 1)], in this proof, we investigate the error in Gπ;s ,a h due to the estimation of the transition model. In the following, we prove by induction that for any h [H] and s S, |Gπ;s ,a h (s|ˆpk 1) Gπ;s ,a h (s|p)| Bπ;s ,a ;k h (s). When h = H + 1, by definition, we have Gπ;s ,a H+1 (s|ˆpk 1) = Gπ;s ,a H+1 (s|p) = Bπ;s ,a ;k H+1 (s) = 0 for any s S, and then the above statement trivially holds. When 1 h H, if |Gπ;s ,a h+1 ( |ˆpk 1) Gπ;s ,a h+1 ( |p)| Bπ;s ,a ;k h+1 ( ) element-wise, then for any s S, we have h (s|ˆpk 1) Gπ;s ,a = ˆpk 1( |s, πh(s)) Gπ;s ,a h+1 ( |ˆpk 1) p( |s, πh(s)) Gπ;s ,a = ˆpk 1( |s, πh(s)) Gπ;s ,a h+1 ( |ˆpk 1) Gπ;s ,a + (ˆpk 1( |s, πh(s)) p( |s, πh(s))) Gπ;s ,a (a) ˆpk 1( |s, πh(s)) Gπ;s ,a h+1 ( |ˆpk 1) Gπ;s ,a h+1 ( |p) + 2 Varp( |s,πh(s))(Gπ;s ,a h+1 ( |p)) L nk 1(s, πh(s)) + HL nk 1(s, πh(s)), (34) where (a) is due to Lemma E.4. Here, we have Varp( |s,πh(s))(Gπ;s ,a (a) 2Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |p)) + 4H2L nk 1(s, πh(s)) (b) 4Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) + 4H ˆpk 1( |s, πh(s)) |Gπ;s ,a h+1 ( |ˆpk 1) Gπ;s ,a + 4H2L nk 1(s, πh(s)), where (a) uses Lemma E.5 and (b) comes from Lemma E.6. Reinforcement Learning with Segment Feedback Varp( |s,πh(s))(Gπ;s ,a h+1 ( |p)) L nk 1(s, πh(s)) (35) 4Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) L nk 1(s, πh(s)) 1 H ˆpk 1( |s, πh(s)) |Gπ;s ,a h+1 ( |ˆpk 1) Gπ;s ,a h+1 ( |p)| 4H2L nk 1(s, πh(s)) + 2HL nk 1(s, πh(s)) Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) L nk 1(s, πh(s)) H ˆpk 1( |s, πh(s)) |Gπ;s ,a h+1 ( |ˆpk 1) Gπ;s ,a h+1 ( |p)| + 6H2L nk 1(s, πh(s)), (36) where (a) is due to the fact that xy x + y. Hence, plugging Eq. (36) into Eq. (34) and using the fact that |Gπ;s ,a h (s)| [0, H], we have h (s|ˆpk 1) Gπ;s ,a Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) L nk 1(s, πh(s)) + 13H2L nk 1(s, πh(s)) ˆpk 1( |s, πh(s)) Gπ;s ,a h+1 ( |ˆpk 1) Gπ;s ,a Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) L nk 1(s, πh(s)) + 13H2L nk 1(s, πh(s)) ˆpk 1( |s, πh(s)) Bπ;s ,a ;k h+1 ( ) = Bπ;s ,a ;k h (s), which completes the induction proof. Therefore, ˆϕπ k 1(s , a ) ϕπ(s , a ) = Es1 ρ h Gπ;s ,a 1 (s1|ˆpk 1) i Es1 ρ h Gπ;s ,a 1 (s1|p) i Es1 ρ h Gπ;s ,a 1 (s1|ˆpk 1) Gπ;s ,a 1 (s1|p) i Es1 ρ h Bπ;s ,a ;k 1 (s1) i . Summing over (s , a ) S A, we obtain this lemma. Lemma D.11. Assume that event GKL H holds. Then, for any k > 0 and policy π, Es1 ρ h Bπ;s ,a ;k 1 (s1) i s,a wπ h(s, a) Varp( |s,a)(Gπ;s ,a h+1 ( |p)) L nk 1(s, a) + 46H2L nk 1(s, a) k=1 Es1 ρ h Bπk;s ,a ;k 1 (s1) i Reinforcement Learning with Segment Feedback 16e12|S| 3 2 |A| 3 2 H p KL log(2KH) + 192e12|S|2|A|2H2L log(2KH). Proof. First, we prove the first statement. For any policy π, k > 0, (s , a ) S A, h [H] and s S, we have Bπ;s ,a ;k h (s) 4 Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) L nk 1(s, πh(s)) + 13H2L nk 1(s, πh(s)) ˆpk 1( |s, πh(s)) Bπ;s ,a ;k h+1 ( ) Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) L nk 1(s, πh(s)) + 13H2L nk 1(s, πh(s)) p( |s, πh(s)) Bπ;s ,a ;k h+1 ( ) (ˆpk 1( |s, πh(s)) p( |s, πh(s))) Bπ;s ,a ;k h+1 ( ) Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) L nk 1(s, πh(s)) + 13H2L nk 1(s, πh(s)) p( |s, πh(s)) Bπ;s ,a ;k h+1 ( ) Varp( |s,πh(s))(Bπ;s ,a ;k h+1 ( )) L nk 1(s, πh(s)) + HL nk 1(s, πh(s)) Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) L nk 1(s, πh(s)) + 13H2L nk 1(s, πh(s)) p( |s, πh(s)) Bπ;s ,a ;k h+1 ( ) 1 H p( |s, πh(s)) Bπ;s ,a ;k h+1 ( ) H2L nk 1(s, πh(s)) + HL nk 1(s, πh(s)) Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) L nk 1(s, πh(s)) + 22H2L nk 1(s, πh(s)) p( |s, πh(s)) Bπ;s ,a ;k h+1 ( ), (37) where (a) uses Lemma E.4, and (b) follows from the fact that xy x + y. In addition, we have Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) (a)= 2Varp( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) + 4H2L nk 1(s, a) (b) 4Varp( |s,πh(s))(Gπ;s ,a h+1 ( |p)) + 4Hp( |s, πh(s)) Gπ;s ,a h+1 ( |ˆpk 1) Gπ;s ,a + 4H2L nk 1(s, a) 4Varp( |s,πh(s))(Gπ;s ,a h+1 ( |p)) + 4Hp( |s, πh(s)) Bπ;s ,a h+1 ( |ˆpk 1) + 4H2L nk 1(s, a), Reinforcement Learning with Segment Feedback where (a) uses Lemma E.5, and (b) comes from Lemma E.6. Varˆpk 1( |s,πh(s))(Gπ;s ,a h+1 ( |ˆpk 1)) L nk 1(s, πh(s)) 4Varp( |s,πh(s))(Gπ;s ,a h+1 ( |p)) L nk 1(s, πh(s)) + 1 H p( |s, πh(s)) Bπ;s ,a h+1 ( |ˆpk 1) 4H2L nk 1(s, πh(s)) + 2HL nk 1(s, a) Varp( |s,πh(s))(Gπ;s ,a h+1 ( |p)) L nk 1(s, πh(s)) + 1 H p( |s, πh(s)) Bπ;s ,a h+1 ( |ˆpk 1) + 6H2L nk 1(s, πh(s)) (38) Plugging Eq. (38) into Eq. (37) and using the clipping definition of Bπ;s ,a ;k h (s), we have Bπ;s ,a ;k h (s) Varp( |s,πh(s))(Gπ;s ,a h+1 ( |p)) L nk 1(s, πh(s)) + 46H2L nk 1(s, πh(s)) p( |s, πh(s)) Bπ;s ,a ;k h+1 ( ) Using the above inequality, taking s1 ρ, and unfolding Bπ;s ,a ;k 1 (s1) over h, we have Es1 ρ h Bπ;s ,a ;k 1 (s1) i s,a wπ h(s, a) Varp( |s,a)(Gπ;s ,a h+1 ( |p)) L nk 1(s, a) + 46H2L nk 1(s, a) Next, we prove the second statement. It holds that k=1 Es1 ρ h Bπk;s ,a ;k 1 (s1) i (s,a) Dk wk,h(s, a) Varp( |s,a)(Gπk;s ,a h+1 ( |p)) L nk 1(s, a) + 46H2L nk 1(s, a) + e12H|S||A| (s,a)/ Dk wk,h(s, a) (s,a) Dk wk,h(s, a)Varp( |s,a)(Gπk;s ,a h+1 ( |p)) wk,h(s, a) nk 1(s, a) + e12|S||A| 46H2L wk,h(s, a) nk 1(s, a) + 8e12|S|2|A|2H2 log |S||A|H Reinforcement Learning with Segment Feedback (b) 8e12|S||A| 4|S||A| log(2KH) + 184e12|S|2|A|2H2L log(2KH) + 8e12|S|2|A|2H2 log |S||A|H 16e12|S| 3 2 |A| 3 2 H p KL log(2KH) + 192e12|S|2|A|2H2L log(2KH), where (a) is due to Lemma D.8, and (b) follows from Lemmas E.3 and D.9. Lemma D.12 (Optimism under Sum Feedback and Unknown Transition). Assume that event GKL holds. Then, for any k > 0 and fixed policy π, V π 1 (s1) ˆϕk 1(π) ˆθk 1 + β(k 1) ˆϕk 1(π) (Σk 1) 1 + rmax X s ,a Es1 ρ h Bπ;s ,a ;k 1 (s1) i . Proof. It holds that V π 1 (s1) = ϕ(π) θ = ˆϕk 1(π) ˆθk 1 + ϕ(π) θ ˆϕk 1(π) θ + ˆϕk 1(π) θ ˆϕk 1(π) ˆθk 1 ˆϕk 1(π) ˆθk 1 + ϕ(π) ˆϕk 1(π) 1 θ + β(k 1) ˆϕk 1(π) (Σk 1) 1 (a) ˆϕk 1(π) ˆθk 1 + β(k 1) ˆϕk 1(π) (Σk 1) 1 + rmax X s ,a Es1 ρ h Bπ;s ,a ;k 1 (s1) i , where (a) uses Lemma D.10. Lemma D.13. For any K 1, we have ϕτ k i (Σk 1) 1 H λ log 1 + KH2 Proof. We have ϕτ k i (Σk 1) 1 (Σk 1) 1 , H2 v u u t H2K (Σk 1) 1 , 1 v u u t2H2K (Σk 1) 1 , 1 v u u t2H2K λ log 1 + KH2 where inequality (a) uses the fact that x 2 log(1 + x) for any 0 x 1, inequality (b) is due to the fact that λ H2 m , and inequality (c) follows from Lemma D.3. Reinforcement Learning with Segment Feedback Define event Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕτ (Σk 1) 1 4H Event FS reg is similar to FS opt, except that here the universal upper bound of ϕτ (Σk 1) 1 is H λ rather than 1. Lemma D.14. It holds that Pr FS reg 1 δ . Proof. For any k 1, we have that ϕτ (Σk 1) 1 H λ, and then |Eτ πk [ ϕτ (Σk 1) 1 |Fk 1] ϕτ (Σk 1) 1 | Using the Azuma-Hoeffding inequality, we have that for any fixed k > 0, with probability at least 1 δ 2k2 , Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕτ (Σk 1) 1 λ k log 4k2 Since P k=1 δ 2k2 δ , by a union bound over k, we have that with probability at least δ , for any k 1, Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕτ (Σk 1) 1 λ k log 4k2 Proof of Theorem 4.3. Let δ = δ 4. Then, we have Pr[K FS reg GKL H] 1 δ. Thus, it suffices to prove the regret upper bound when event K FS reg GKL H holds. Assume that event K FS reg GKL H holds. For any k > 0, we have V (s1) V πk(s1) ˆϕk 1(π ) ˆθk 1 + β(k 1) ˆϕk 1(π ) (Σk 1) 1 + rmax X s ,a Es1 ρ h Bπ ;s ,a ;k 1 (s1) i ˆϕk 1(πk) ˆθk 1 + β(k 1) ˆϕk 1(πk) (Σk 1) 1 + rmax X s ,a Es1 ρ h Bπk;s ,a ;k 1 (s1) i ˆϕk 1(πk) ˆθk 1 ˆϕk 1(πk) θ + ˆϕk 1(πk) θ (ϕπk) θ Reinforcement Learning with Segment Feedback + β(k 1) ˆϕk 1(πk) (Σk 1) 1 + rmax X s ,a Es1 ρ h Bπk;s ,a ;k 1 (s1) i ! 2β(k 1) ˆϕk 1(πk) (Σk 1) 1 + 2rmax X s ,a Es1 ρ h Bπk;s ,a ;k 1 (s1) i k=1 ˆϕk 1(πk) (Σk 1) 1 + 2rmax s ,a Es1 ρ h Bπk;s ,a ;k 1 (s1) i , (41) where (a) uses Lemma D.12, (b) is due to the definition of πk, and (c) follows from Lemma D.10 and the definition of event K. Next, we first bound PK k=1 ˆϕk 1(πk) (Σk 1) 1. k=1 ˆϕk 1(πk) (Σk 1) 1 ϕπk (Σk 1) 1 + ˆϕk 1(πk) ϕπk (Σk 1) 1 ϕπk (Σk 1) 1 + 1 λ ˆϕk 1(πk) ϕπk 2 ϕπk (Σk 1) 1 + 1 λ ˆϕk 1(πk) ϕπk 1 Here we have k=1 ϕπk (Σk 1) 1 k=1 Eτ πk [ϕτ|Fk 1] (Σk 1) 1 k=1 Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕ(τ k) (Σk 1) 1 + ϕ(τ k) (Σk 1) 1 Eτ πk h ϕτ (Σk 1) 1 |Fk 1 i ϕ(τ k) (Σk 1) 1 + ϕτ k i (Σk 1) 1 λ log 1 + KH2 where (a) uses the Jensen inequality, and (b) comes from the definition of FS reg and Lemma D.13. Hence, plugging Eq. (43) into Eq. (42) and using Lemma D.10, we have k=1 ˆϕk 1(πk) (Σk 1) 1 4H λ log 1 + KH2 s ,a Es1 ρ h Bπ;s ,a ;k 1 (s1) i . (44) Reinforcement Learning with Segment Feedback On the other hand, according to Eq. (39), we have Es1 ρ h Bπ;s ,a ;k 1 (s1) i e12 H X s,a wπ h(s, a) Varp( |s,a)(Gπ;s ,a h+1 ( |p)) L nk 1(s, a) + 46H2L nk 1(s, a) Therefore, plugging Eqs. (44) and (39) into Eq. (41), we have λ log 1 + KH2 k=1 Es1 ρ h Bπk;s ,a ;k 1 (s1) i λ log 1 + KH2 16e12|S| 3 2 |A| 3 2 H p KL log(2KH) + 192e12|S|2|A|2H2L log(2KH) m log 1 + KH2 λ log 1 + KH2 + |S| 3 2 |A| 3 2 H λ log(KH) + |S|2|A|2H2L λ log(KH) ! (1 + rmax)|S|2|A|2H K log 1 + KH + (1 + rmax)|S| 5 2 |A| 5 2 H2L log(KH) = O (1 + rmax)|S| 5 2 |A|2H K + (1 + rmax)|S| 7 2 |A| 5 2 H2 , where inequality (a) comes from Lemma D.11, and equality (b) uses the fact that λ := H D.5. A Lower Bound for Unknown Transition and its Proof Below we provide a lower bound for RL with sum segment feedback and unknown transition with the proof. Theorem D.15. Consider the problem of RL with sum segment feedback and unknown transition. There exists a distribution of instances where the regret of any algorithm must be Proof of Theorem D.15. We construct a random instance I as follows. As shown in Figure 7, there are n bandit states s1, . . . , sn (i.e., there are an optimal action and multiple suboptimal actions), a good absorbing state sn+1 and a bad absorbing state sn+2. The agent starts from s1, . . . , sn with equal probability 1 n. For any i [n], in state si, one action a J is uniformly chosen from A as the optimal action. In state si, under the optimal action a J, the agent transitions to sn+1 and sn+2 with probabilities 1 2 + ε and 1 2 ε, respectively, where ε (0, 1 4) is a parameter specified later; Under any suboptimal action a A \ {s J}, the agent transitions to sn+1 and sn+2 with equal probability 1 The rewards are deterministic for all state-action pairs. For any a A, r(sn+1, a) = rmax. For any i {1, ..., n, n + 2} and a A, r(si, a) = 0. Reinforcement Learning with Segment Feedback Suboptimal action: transition to 𝑠𝑛+1 w.p. 1 2 𝑠𝑛+2 w.p. 𝑠𝑛+2 𝑟𝑠𝑛+2, = 0 𝑟𝑠𝑛+1, = 𝑟𝑚𝑎𝑥 Optimal action: transition to 𝑠𝑛+1 w.p. 𝑠𝑛+2 w.p. 1 Figure 7. Instance for the lower bound under sum segment feedback and unknown transition. In this proof, we will also use an alternative uniform instance Iunif. The only difference between Iunif and I is that for any i [n], in state si, under all actions a A, the agent transitions to sn+1 and sn+2 with equal probability 1 Fix an algorithm A. Let Eunif[ ] denote the expectation with respect to Iunif. Let E [ ] denote the expectation with respect to I. For any i [n] and j [|A|], let Ei,j[ ] denote the expectation with respect to the case where aj is the optimal action in state si, and Ni,j denote the number of episodes where algorithm A chooses aj in state si, i.e., Ni,j = PK k=1 1{πk 1(si) = aj}. The KL divergence of transition distribution on (si, a J) (i [n]) between Iunif and I is 2 ln 1 2 1 2 ε 2 ln 1 2 1 2 + ε 2 ln 1 4 1 4 ε2 where (a) uses the fact that ln(1 x) 2x when x (0, 1 In addition, the agent has probability only 1 n to arrive at (observe) state si. Thus, using Lemma A.1 in (Auer et al., 2002), we have that for any i [n], in state si, Ei,j[Ni,j] Eunif[Ni,j] + K 1 n Eunif[Ni,j] KL B 1 Eunif[Ni,j] + K 1 n Eunif[Ni,j] 4ε2 = Eunif[Ni,j] + Kε 1 n Eunif[Ni,j]. Summing over j [|A|], using the Cauchy-Schwarz inequality and the fact that P|A| j=1 Eunif[Ni,j] = K, we have j=1 Ei,j[Ni,j] K + Kε Then, we have k=1 E h V V πki Reinforcement Learning with Segment Feedback 2 + ε (H 1)rmax K 2(H 1)rmax K + ε(H 1)rmax 1 j=1 Ei,j[Ni,j] Recall that n = |S| 2. Let |S| 3, |A| 2, H 2, K > |A|n and ε = 1 K . Then, we have R(K) = Ω rmax H p E. Technical Tools In this section, we introduce several technical tools. Lemma E.1 (Self-concordance, Lemma 9 in (Faury et al., 2020)). For any x1, x2 R, we have µ (x1)1 exp( |x1 x2|) |x1 x2| Z 1 z=0 µ ((1 z)x1 + zx2)dz µ (x1)exp(|x1 x2|) 1 Furthermore, we have Z 1 z=0 µ ((1 z)x1 + zx2)dz µ (x1) 1 + |x1 x2|. Lemma E.2 (Value Difference Lemma, Lemma E.15 in (Dann et al., 2017)). For any two MDPs M and M with rewards r and r and transition distributions p and p , we have that for any h [H] and s S, V h(s) V h (s)=Ep r (st, at) r (st, at)+(p ( |st, at) p ( |st, at)) V h+1( ) |st = s Lemma E.3 (Law of Total Variance, Lemma 15 in (Zanette & Brunskill, 2019)). For an MDP p and a fixed policy π, we have h=1 r(sh, πh(s)) V π 1 (s1) h=1 Varsh+1 p( |sh,πh(sh)) V π h+1(sh+1) s1 The idea of Lemma E.3 was also used in earlier works, e.g., (Munos & Moore, 1999; Lattimore & Hutter, 2012; Gheshlaghi Azar et al., 2013). Lemma E.4 (Lemma 10 in (M enard et al., 2021)). For distributions p, q S and function f : S [0, b], if KL(p, q) α, then |(p( ) q( )) f( )| q 2Varq(f)α + 2 Lemma E.5 (Lemma 11 in (M enard et al., 2021)). For distributions p, q S and function f : S [0, b], if KL(p, q) α, then Varq(f) 2Varp(f) + 4b2α. Lemma E.6 (Lemma 12 in (M enard et al., 2021)). For distribution p S and functions f, g : S [0, b], we have Varp(f) 2Varp(g) + 2bp( ) |f( ) g( )|.