# online_restless_bandits_with_unobserved_states__f8825766.pdf Online Restless Bandits with Unobserved States Bowen Jiang 1 Bo Jiang 1 Jian Li 2 Tao Lin 3 Xinbing Wang 1 Chenghu Zhou 4 We study the online restless bandit problem, where each arm evolves according to a Markov chain independently, and the reward of pulling an arm depends on both the current state of the corresponding Markov chain and the pulled arm. The agent (decision maker) does not know the transition functions and reward functions, and cannot observe the states of arms even after pulling. The goal is to sequentially choose which arms to pull so as to maximize the expected cumulative rewards collected. In this paper, we propose TSEETC, a learning algorithm based on Thompson Sampling with Episodic Explore-Then Commit. The algorithm proceeds in episodes of increasing length and each episode is divided into exploration and exploitation phases. During the exploration phase, samples of action-reward pairs are collected in a round-robin fashion and utilized to update the posterior distribution as a mixture of Dirichlet distributions. At the beginning of the exploitation phase, TSEETC generates a sample from the posterior distribution as true parameters. It then follows the optimal policy for the sampled model for the rest of the episode. We establish the Bayesian regret bound O( T) for TSEETC, where T is the time horizon. We show through simulations that TSEETC outperforms existing algorithms in regret. 1. Introduction The restless multi-armed bandits (RMAB) is a general setup to model many sequential decision making problems ranging from wireless communication (Tekin & Liu, 2011; 1Shanghai Jiao Tong University, Shanghai, China. 2SUNYBinghamton University, Binghamton, NY, USA. 3Communication University of China, Beijing, China. 4Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China. Correspondence to: Bo Jiang . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Sheng et al., 2014; Xiong et al., 2022b), sensor/machine maintenance (Ahmad et al., 2009; Akbarzadeh & Mahajan, 2021) and healthcare (Mate et al., 2020; 2021). This problem considers one agent and N arms. Each arm i is modulated by a Markov chain M i with state transition function P i and reward function Ri. At each time, the agent decides which arm to pull. After the pulling, all arms make a state transition independently. The state transitions can be actiondependent or not, and we consider the action-independent case. That is to say, the state of each arm makes one transition per time slot regardless of being pulled or not. More importantly, the transition function remains the same when pulling or not pulling. The goal is to decide which arm to pull to maximize the expected reward, i.e., E[PT t=1 rt], where rt is the reward at time t and T is the time horizon. In this paper, we consider the online restless bandit problem with unknown parameters (transition functions and reward functions) and unobserved states. Many works assume the arms states are known and concentrate on learning unknown parameters (Liu et al., 2010; 2011; Ortner et al., 2012; Wang et al., 2020; Xiong et al., 2022a;d;c). However, the arms states are often unobserved in real-world applications, such as cache access (Paria & Sinha, 2021) and recommendation system (Peng et al., 2020). In the cache access problem, the user can only get the perceived delay but cannot know whether the requested content is stored in the cache before or after the access. In the recommender system, we do not know the user s preference for the items. There are some studies that consider the unobserved states. However, they often assume the parameters are known (Mate et al., 2020; Meshram et al., 2018; Akbarzadeh & Mahajan, 2021) or there is no discussion about the regret bound (Peng et al., 2020; Hu et al., 2020). One common way to handle the unknown parameters but with observed states is to use the optimism in the face of uncertainty (OFU) principle (Liu et al., 2010; Ortner et al., 2012; Wang et al., 2020). However, existing policies may not perform close to the optimal offline policy, e.g., Liu et al. (2010) only considers the best policy that constantly pulls a fixed arm, which is not optimal for RMAB problems. Ortner et al. (2012) derives the lower bound O( T) for RMAB problems. Another way to estimate the unknown parameters is Thompson Sampling (TS) (Jung & Tewari, 2019; Jung et al., 2019; Jahromi et al., 2022; Hong et al., 2022). TS Online Restless Bandits with Unobserved States algorithms do not need to solve all instances that lie within the confident sets as OFU-based algorithms (Ouyang et al., 2017). What s more, empirical studies suggest that TS algorithms outperform OFU-based algorithms in bandits and Markov decision process (MDP) problems (Scott, 2010; Chapelle & Li, 2011; Osband & Van Roy, 2017). Some studies assume that only the states of pulled arms are observable (Mate et al., 2020; Liu & Zhao, 2010; Wang et al., 2020; Jung & Tewari, 2019). They translate the partially observable Markov decision process (POMDP) problem into a fully observable MDP by regarding the state last observed and the time elapsed as a meta-state (Mate et al., 2020; Jung & Tewari, 2019). Mate et al. (2020), and Liu & Zhao (2010) derive the optimal index policy but they assume the parameters are known. Restless-UCB in Wang et al. (2020) achieves the regret bound of O(T 2/3) and their algorithm is restricted to restless bandit problems with birth-death state Markov chains. A general Markov chain is considered in (Xiong et al., 2022c) with a regret bound of O( T) guarantee. There are also some works that consider that the arm state is not visible even after pulling (Meshram et al., 2018; Akbarzadeh & Mahajan, 2021; Peng et al., 2020; Hu et al., 2020; Zhou et al., 2021; Yemini et al., 2021) and the classical POMDP setting (Jahromi et al., 2022). Meshram et al. (2018) and Akbarzadeh & Mahajan (2021) study the RMAB problem with unobserved states but with known parameters. However, the true value of the parameters are often unavailable in practice. Some works study POMDP problem from a learning perspective, e.g., Peng et al. (2020); Hu et al. (2020), but there is no regret analysis. Under the unobserved states setting, the state-of-the-art algorithm achieves O(T 2/3) bound on the frequentist regret (Zhou et al., 2021). Yemini et al. (2021) considers the arms are modulated by two unobserved states and with linear reward. This linear structure is quite a bit of side information that the decision maker can take advantage of and a instancedependent regret bound of log(T) is given. To the best of our knowledge, there is no known policy that performs close to the offline optimum with a provable regret bound of O( T) for online restless bandits with unobserved states even after pulling. The unobserved states and unknown parameters bring many challenges. First, we need to control estimation error about states, which are not directly observed. Second, the error depends on the model parameters in a complex way via Bayesian updating and the parameters are still unknown. Third, since the state is not fully observable, the decision-maker cannot keep track of the number of visits to state-action pairs, a quantity that is crucial in the theoretical analysis. To deal with this challenge, we design a learning algorithm TSEETC to estimate these unknown parameters and update the posterior distribution about unknown parameters as mixture of Dirichlet distributions. We define the pseudo-count about the number of visits to state-action based on Dirichlet distribution and with this pseudo-count we obtain a bound about parameters estimation errors. Benchmarked on a stronger oracle, we show that our algorithm achieves a bound O( T) in Bayesian regret. In summary, we make the following contributions: Problem formulation. We consider the online restless bandit problems with unobserved states and unknown parameters. Compared with Jahromi et al. (2022), our reward functions are unknown. Algorithmic design. We propose TSEETC, a learning algorithm based on Thompson Sampling with Episodic Explore Then-Commit. The whole learning horizon is divided into episodes of increasing length, each of which consists of exploration and exploitation phases. During the exploration phase, we utilize a mixture of Dirichlet distributions to update posterior distributions and estimate unknown parameters. The belief state is implemented to encode previous historical information for unobserved states. In the exploitation phase, we sample parameters from the posterior distribution and derive an optimal policy based on the sampled parameter. Furthermore, we design increasing episode lengths to control the total number of episodes, which is crucial to bound the regret caused by exploration. Regret analysis. We consider a stronger oracle which solves POMDP based on our belief state. And we define the pseudo-count to store the state-action pairs. Under a Bayesian framework, we show that the expected regret of TSEETC accumulated up to time T is bounded by O( T), where O hides logarithmic factors. This is the first O( T) Bayesian regret bound in the setting with unknown parameters and unobserved states even after pulling the arm. Experiment results. We conduct the proof-of-concept experiments, and compare our policy with existing baseline algorithms. Our results show that TSEETC outperforms existing algorithms in regret and the regret order is consistent with our theoretical result. 2. Related Work We review the related works in two main domains: learning algorithm for unknown parameters, and methods to identify unknown states. Unknown parameters. Since the system parameters are unknown in advance, it is essential to study RMAB problems from a learning perspective. Generally speaking, these works can be divided into two categories: OFU (Ortner et al., 2012; Wang et al., 2020; Xiong et al., 2022a; Zhou et al., 2021; Xiong et al., 2022d;c) or TS based (Jung et al., 2019; Jung & Tewari, 2019; Jahromi et al., 2022; Hong et al., 2022). The algorithms based on OFU often construct Online Restless Bandits with Unobserved States confidence sets for the system parameters at each time, find the optimistic estimator that is associated with the maximum reward, and then select an action based on the optimistic estimator. Apart from these works, Thompson sampling (Jung & Tewari, 2019; Jung et al., 2019) were used to solve this problem. A TS algorithm generally samples a set of MDP parameters randomly from the posterior distribution, then actions are selected based on the sampled model. Jung & Tewari (2019) and Jung et al. (2019) provide theoretical guarantee O( T) in the Bayesian setting for the online restless bandit with partially observed states. TS algorithms are confirmed to outperform optimistic algorithms in bandit and MDP problems (Scott, 2010; Chapelle & Li, 2011; Osband & Van Roy, 2017). Unknown states. There are some works that consider the states of the pulled arm are unobserved (Mate et al., 2020; Liu & Zhao, 2010; Wang et al., 2020; Jung & Tewari, 2019). Mate et al. (2020) and Liu & Zhao (2010) assumes the unobserved states but with known parameters. Wang et al. (2020) constructs an offline instance and give the regret bound O(T 2/3). Jung & Tewari (2019) considers the episodic RMAB problems with observed states about pulled arms and the regret bound O( T) is guaranteed in the Bayesian setting. Some studies assume that the states are unobserved even after pulling. Akbarzadeh & Mahajan (2021) and Meshram et al. (2018) consider the RMAB problem with unknown states but known system parameters. And there is no regret guarantee. Peng et al. (2020) and Hu et al. (2020) consider the unknown parameters but there are also no any theoretical results. The most similar to our work is Zhou et al. (2021) and Jahromi et al. (2022). Zhou et al. (2021) considers that all arms are modulated by a common unobserved Markov chain. They proposed the estimation method based on spectral method (Anandkumar et al., 2012) and learning algorithm based on upper confidence bound (UCB) strategy (Auer et al., 2002). They also give the regret bound O(T 2/3). Jahromi et al. (2022) considers the POMDP setting and propose the pseudo counts to store the state-action pairs. Their learning algorithm is based on Ouyang et al. (2017) and the regret bound is also O(T 2/3). 3. Problem Setting Consider a restless bandit problem with one agent and N arms. Each arm i [N] := {1, 2, . . . , N} is associated with an independent discrete time Markov chain Mi = (Si, P i), where Si is the discrete state space and P i RSi Si the transition functions. Let si t denote the state of arm i at time t and st = (s1 t, s2 t, . . . , s N t ) the state of all arms. Each arm i is also associated with a reward function Ri RSi R, where Ri (r | s) is the probability that the agent receives a reward r R when he pulls arm i in state s. We assume the state spaces Si and the reward set R are finite and known to the agent. The parameters P i and Ri, i [N] are unknown, and the state st is also unobserved to the agent. For the sake of notational simplicity, we assume that all arms have the same state spaces S with size S. Our result can be generalized in a straightforward way to allow different state spaces. The whole game is divided into T time steps. The initial state si 0 for each arm i [N] is drawn from a distribution hi independently, which we assume to be known to the agent. At each time t, the agent chooses one arm at [N] to pull and receives a reward rt R with probability Rat(rt | sat t ). Note that only the pulled arm has the reward feedback. His decision on which arm at to pull is based on the observed history Ht = [a1, r1, a2, r2 , at 1, rt 1]. Note that the states of the arms are never observed, even after pulling. Each arm i makes a state transition independently according to the associated P i, whether it is pulled or not. This process continues until the end of the game. The goal of the agent is to maximize the total expected reward. We use θi to denote the unknown P i and Ri for arm i and denote θ as the unknown P i and Ri for all i [N] collectively. Since the true states are unobservable, the agent maintains a belief state bi t = [bi t(s, θi), s S] S for each arm i, where bi t(s, θi) := P si t = s | Ht, θi , and S := b RS + : P s S b(s) = 1 is the probability simplex in RS. Note that bi t(s, θi) depends on the unknown model parameter θi, which itself has to be learned by the agent. We aggregate all arms as a whole Markov chain M and denote its transition matrix and reward function as P and R, respectively. Note that the states of the arms at any given time t are independent, since the initial states are independent and they also evolve independently. As a consequence, for a given θ, the overall belief state bt = (b1 t, b2 t, , b N t ) is a sufficient statistic for Ht 1 (Smallwood & Sondik, 1973). Thus the agent can base his decision at time t on bt only. Let b = N i=1 S be the state space of the overall belief state of the system. A deterministic stationary policy π : b [N] maps a belief state to an action. The long-term average reward of a policy π is defined as Jπ(h, θ) := lim sup T t=1 rt h, θ We use J(h, θ) = supπ Jπ(h, θ) to denote the optimal longterm average reward. We assume J(h, θ) is independent of the initial distribution h as in Jahromi et al. (2022) and denote it by J(θ). We make the following assumptions. Assumption 3.1. The smallest element ϵ1 in the transition functions P i, i N is larger than zero. Online Restless Bandits with Unobserved States Assumption 3.2. The smallest element ϵ2 in the reward functions Ri, i N is larger than zero. Assumption 3.1 and Assumption 3.2 are strong in general, but they help us bound the error of belief estimation (De Castro et al., 2017). Assumption 3.1 also makes the MDP weakly communicating (Bertsekas et al., 2011). For weakly communicating MDP, it is known that there exists a bounded function v( , θ) : b R such that for all b b (Bertsekas et al., 2011), J(θ)+v(b, θ) = max a r(b, a) + X r P(r | b, a, θ)v (b , θ) (2) where v is the relative value function, r(b, a) = P r ba(s, θ)Ra(r | s)r is the expected reward, b is the updated belief after obtaining the reward r, and P(r | b, a, θ) is the probability of observing r in the next step, conditioned on the current belief b and action a. The corresponding optimal policy is the maximizer of the right part in (2). Since the value function v(, θ) is finite, we can bound the span function sp(θ) := maxb v(b, θ) minb v(b, θ) as in Zhou et al. (2021). We show the details about this bound in Proposition D.1 and denote the bound by H. We consider the Bayesian regret. The parameters θ is randomly generated from a known prior distribution Q at the beginning and then fixed but unknown to the agent. We measure the efficiency of a policy π by its regret, defined as the expected gap between the cumulative reward of an offline oracle and that of π. If an oracle knows P i, Ri and underlying state si t, the problem becomes simple as the agent would select a t = argmaxa N rt Rat (rt | sat t ) where Rat (rt | sat t ) is the reward function of the pulled arm at and rt is the obtained reward. If we benchmark a learning policy against the oracle, then the regret must be linear in T, because the oracle always observes st while the agent cannot predict the transition based on the history. Whenever a transition occurs, there is a non-vanishing regret incurred. Since the number of transitions during the time interval [0, T] is linear in T, the total regret is of the same order. Since comparing to the oracle knowing st is uninformative, we consider such an oracle that assumes the unknown states and known parameters. The offline oracle is similar to Zhou et al. (2021), which is stronger than those considered in Azizzadenesheli et al. (2016) and Fiez et al. (2018). We focus on the Bayesian regret of policy π (Ouyang et al., 2017; Jung & Tewari, 2019) as follows, t=1 (J(θ ) rt) The above expectation is with respect to the prior distribution about θ , the randomness in state transitions and the random reward. 4. The TSEETC Algorithm In section 4.1, we define the belief state and show how to update it with new observation. In section 4.2, we show how to update the posterior distributions under unknown states. In section 4.3, we show the details about our learning algorithm TSEETC. 4.1. Belief Encoder for Unobserved State Here we focus on the belief update for arm i with known parameters θi = (P i, Ri). At time t, the belief for arm i in state s is bi t(s, θi). Then after the pulling of arm i, we obtain the observation rt. The belief bi t(s , θi) can be updated as follows: bi t+1(s , θi) = P s bi t(s, θi)Ri (rt | s) P i(s | s) P s bi t(s, θi)Ri (rt | s) , (4) where the P i(s | s) is the probability of transitioning from state s at time t to state s and Ri (rt | s) is the probability of obtain reward rt under state s. If the arm i is not pulled, we update its belief as follows: bi t+1(s , θi) = X s bi t(s, θi)P i(s | s). (5) Then at each time, we can aggregate the belief of all arms as bt. Based on (2), we can derive the optimal action at for current belief bt. 4.2. Mixture of Dirichlet Distribution In this section, we estimate the unknown P i and Ri based on Dirichlet distribution. The Dirichlet distribution is parameterized by a count vector, ϕ = (ϕ1, . . . , ϕk), where ϕi 0, such that the density of probability distribution p = (p1, . . . , pk) is defined as f(p | ϕ) Qk i=1 pϕi 1 i (Ghavamzadeh et al., 2015). Since the true states are unobserved, all state sequences (and their corresponding Dirichlet posteriors) should be considered, with some weight proportional to the likelihood of each state sequence (Ross et al., 2011). Denote the reward history collected from time t1 till t2 (not including t2) for arm i as ri t1:t2 and similarly the states history is denoted as si t1:t2. And the belief state history is denoted as bi t1:t2. Recall that we assume the smallest element in the transition functions and reward functions are ϵ1 and ϵ2, respectively. To satisfy this, we can assume the transition function P i takes the form P i = ϵ11 + (1 Sϵ1) P i, where P i follows the Dirichlet distribution and 1 is the vector with one in each position. Similarly, we assume the reward function Ri takes the form Ri = ϵ21 + (1 Sϵ2) Ri, where Ri also follows the Dirichlet distribution. The element 1 can have different lengths in correspondence with the dimension of Online Restless Bandits with Unobserved States P i and Ri. Then with these history information bi t1:t2 and ri t1:t2, the posterior distribution gt(P i) and gt(Ri) at time t can be updated as in Lemma 4.1. Lemma 4.1. Assuming the transition function P i has prior g0 P i = f( P i ϵ11 1 Sϵ1 | ϕi) , and the reward function Ri has prior g0 Ri = f( Ri ϵ21 1 Sϵ2 | ψi), given the information ri 0:t and bi 0:t , the posterior distributions in the unobserved state setting are as follows: si 0:t St (g0 P i w(si 0:t) s,s (P i(s | s) ϵ1 1 ϵ1 )Ni s,s ( si t)+ϕi s,s 1), (6) si 0:t St (g0 Ri w(si 0:t) s,r (Ri(r | s) ϵ2 1 ϵ2 )Ni s,r( si t)+ψi s,r 1), (7) where w(si 0:t) is the likelihood of state sequence si 0:t and St is the all possible states sequences from time 0 to t 1. ϕi and ψi are the count vectors for the transition matrix and reward function of arm i, respectively. This procedure is summarized in Algorithm 1. In line 2-3, we consider all the possible state transition sequences and calculate their corresponding weights. Then we derive the Dirichlet distribution related to the specific sequence (in line 4-8). In line 9, we update the posterior distribution as the mixture Dirichlet distribution. Algorithm 1 Posterior Update for Ri(s, ) and P i(s, ) 1: Input: the history length τ1, the state space S, the belief history bi 0:τ1, the reward history ri 0:τ1, the initial parameters ϕi s,s , ψi s,r, for s, s S, r R, 2: generate Sτ1 possible state sequences 3: calculate the weight w(j) = Qτ1 1 t=0 bi t(s, θ), j Sτ1 4: for j in 1, . . . , Sτ1 do 5: count the occurence times of event (s, s ) and (s, r) as N i s,s , N i s,r in sequence j 6: update ϕi s,s ϕi s,s + N i s,s , ψi s,r ψi s,r + N i s,r 7: aggregate the ϕi s,s as ϕ(j), ψi s,r as ψ(j) for all s, s S, r R 8: end for 9: update the mixture Dirichlet distribution gτ1(P i) PSτ1 j=1 w(j)f( P i ϵ11 1 Sϵ1 | ϕ(j)), gτ1(Ri) PSτ1 j=1 w(j)f( Ri ϵ21 1 Sϵ2 | ψ(j)) Algorithm 2 Thompson Sampling with Episodic Explore Then-Commit 1: Input: prior g0(P),g0(R), initial belief b0, exploration length τ1, the first episode length T1 2: for episode k = 1, 2, . . . , do 3: start the first time of episode k, tk := t 4: sample Rtk gtk 1+τ1(R) and Ptk gtk 1+τ1(P) 5: for t = tk, tk + 1, ..., tk + τ1 do 6: pull the arm i for τ1/N times in a round robin way 7: receive the reward rt 8: update the belief bi t using Rtk, Ptk according to (4) 9: update the belief bj t, j N \{i} using Ptk according to (5) 10: end for 11: for i = 1, 2, . . . , N do 12: input the obtained rt1:t1+τ1, ..., rtk:tk+τ1 , bt1:t1+τ1, ..., btk:tk+τ1 to Algorithm 1 to update the posterior distribution gtk+τ1(P), gtk+τ1(R) 13: end for 14: sample Rtk+τ1 gtk+τ1(P), Ptk+τ1 gtk+τ1(R) 15: for i in 0, 1, . . . , N do 16: re-update the belief bi t from time 0 to tk + τ1 according to Rtk+τ1 and Ptk+τ1 17: end for 18: compute π k( ) = Oracle ( , Rtk+τ1, Ptk+τ1) 19: for t = tk + τ1 + 1, , tk+1 1 do 20: apply action at = π k (bt) 21: observe new reward rt+1 22: update the belief bt of all arms using (4), (5) 23: end for 24: end for 4.3. Our Algorithm In this section, we present the details about our TSEETC algorithm. TSEETC operates in episodes with different lengths. The total number of episodes is denoted by k T . The length of episode k is denoted as Tk and is determined by Tk = T1 + k 1, where T1 = l 2 m . Denote the first time of the episode k by tk. Each episode is split into an exploration phase and an exploitation phase. The length of exploration phase in each episode is fixed as τ1 such that τ1KT = O( T) and τ1 T1+KT 1 2 . Define the sampled parameters at time t as Rt and Pt. With these notations, we show the pseudo-code about TSEETC in Algorithm 2. In episode k, for the exploration phase (line 3-17), we first sample the Rtk, Ptk from the distribution gtk 1+τ1(P) and gtk 1+τ1(R) to update the belief states. We pull each arm for τ1/N times in a round-robin way. For the pulled arm, we update its belief according to (4) using Rtk and Ptk. For the arms that are not pulled, we update its belief according to (5) using Ptk. The reward and belief history of each arm are Online Restless Bandits with Unobserved States input into Algorithm 1 to update the posterior distribution after the exploration phase. Then we sample the new Rtk+τ1, Ptk+τ1 from the posterior distribution, and re-calibrate the belief bt based on the most recent sampled Rtk+τ1, Ptk+τ1. Next, we enter into the exploitation phase (line 18-23). First, we use an Oracle to derive the optimal policy πk for the sampled parameters Rtk+τ1, Ptk+τ1 . The Oracle can be the Bellman equation for POMDP as we introduced in equation (2), or the approximation methods (Pineau et al., 2003; Silver & Veness, 2010), etc. The approximation error is discussed in Remark 4.2. Then we use policy πk for the rest of the episode k. Our deterministic linear increment of episode length guarantees the episode number k T is order O( T) as in Lemma B.6. Then the regret of the exploration phases can be bound by O( T), which is an crucial part in Theorem 5.1. Remark 4.2. If the oracle returns an ϵk-approximate policy πk in each episode instead of the optimal policy, i.e., r(b, πk(b)) + P r P(r | b, πk(b), θ)v (b , θ) maxa {r(b, a) + P r P(r | b, a, θ)v (b , θ)} ϵk, then there will be an extra regret term E h P k:tk T (Tk τ1)ϵk i in the exploitation phase. If we control the error as ϵk 1 Tk τ1 , then this extra regret can be bounded as E h P k:tk T (Tk τ1)ϵk i k T = O( T) by Lemma B.6. Thus the approximation error in the computation of optimal policy does not affect the order of our regret bound. 5. Performance Analysis In Section 5.1, we show our theoretical results and some discussions. In Section 5.2, we provide a proof sketch and the detailed proof is in Appendix B. 5.1. Regret Bound and Discussions Theorem 5.1. Suppose Assumptions 3.1,3.2 hold and the Oracle returns the optimal policy in each episode. The Bayesian regret of our algorithm satisfies RT 48C1C2S p NT log(NT) + C1C2+ (τ1 R + H + 4C1C2SN) where C1 = L1 + L2N + N 2 + S2, C2 = rmax + H are constants independent with time horizon T, L1 = 4(1 ϵ1)2 Nϵ2 1ϵ2 , L2 = 4(1 ϵ1)2 ϵ3 1 , ϵ1 and ϵ2 are the lower bounds of the functions P and R , respectively. τ1 is the fixed exploration length in each episode, R is is the gap between the maximum and the minimum rewards, H is the bounded span, rmax is the maximum reward obtain each time, N is the number of arms and S is the state size for each arm. The best existing bounds on both the frequentist regret and the Bayesian regret are O(T 2/3) in the setting with both unobserved state and unknown parameters. Our algorithm is the first to achieves the O( T) Bayesian regret bound on average. Whether one can achieve the O( T) frequentist regret bound is still open. The key ingredients that allow us to obtain the O( T) bound are as follows. First, we estimate the unknown parameters based on Thompson sampling to update the posterior distribution of unknown parameters as the mixture of each combined distribution. Second, to control the regret caused by the exploration phases, we use an episodic algorithm and increase the episode length in a deterministic manner that guarantees the total episode number is order O( T), so the regret of the exploration phases is bounded by O( T). Third, we propose a novel pseudo count of the state-action pairs based on Dirichlet distribution, which allows us to bound the total estimation errors about unknown parameters and unobserved states in the exploitation phase by O( The algorithm in Zhou et al. (2021) is also episodic and each episode is divided into an exploration phase and an exploitation phase as ours. Their cumulative regret bounds in each phase are both O(T 2/3) in the frequentist sense. The bottleneck of their method is the spectral estimator used for parameter estimation, which has an error bound of order 1/ k, where k is the episode index. To control this error, they had to use a longer exploration phase than we do, which results in a larger regret. In contrast, the regrets of our algorithm in both phases are well controlled by O( T), in the Bayesian sense though. Jahromi et al. (2022) considers a similar problem in the POMDP setting and obtain a Bayesian regret bound of O(T 2/3). They define the pseudo counts of state-action pairs, but their pseudo counts are always smaller than the true counts with a nonzero probability at any time. On the other hand, in our algorithm, the sampled parameter is more concentrated around the true values with the posterior update. Therefore, our belief-based pseudo counts defined in (13) approximate the true counts more closely, which helps us obtain the final O( T) regret bound. The existing work in restless bandits provide regret bounds depending on the mixing time Tmix. To guarantee an accuracy of 1 T , the mixing time Tmix can be bound by O(log T) (Jung et al., 2019). Therefore, bounds depending on Tmix can be bound as O(log T T) (Jung et al., 2019) and O log7/2 T T (Ortner et al., 2012). Our regret bound depends on the lower bounds ϵ1 and ϵ2 in Assumptions 3.1,3.2, which is independent with the time horizon T. Then our regret bound is O( T log T). Therefore our regret bound improves those two bounds by a logarithmic factor. Remark 5.2. (Continuous reward functions). We assume the reward set is finite as Jung & Tewari (2019); Zhou et al. (2021); Singh et al. (2022). However, our TSEETC algo- Online Restless Bandits with Unobserved States rithm can be extended to handle continuous rewards. First, for unknown states, we can update the belief states incorporated with such continuous reward function. After pulling the arm and observing r, the belief state b is revised according to Bayes theorem: b (s ) P s b(s)R(r | s)P (s | s) (Hoey & Poupart, 2005), where R(r | s) is a probability density function. Second, the posterior distribution g(R) should be updated with continuous reward. For the case where each arm has two states, g(R) is the Beta distribution. We can accordingly modify TSEETC so that after observing the reward rt [0, 1] at time t, it performs a Bernoulli trial with success probability rt. Let the random variable rt denote the outcome of this Bernoulli trial, and let Si(t), Fi(t) be the number of successes and failures in the Bernoulli trials until time t. If rt = 1, we set Si(t) = Si(t) + 1. Otherwise, we let Fi(t) = Fi(t) + 1. Then we can update the parameters in g(R) accordingly. We leave the theoretical analysis for continuous reward as future work. Remark 5.3. (Thompson sampling approximation error). The establishment of the final regret bound and the total estimation error for belief states requires satisfying not only Assumption 3.1 and Assumption 3.2, but also relies crucially on the Oracle returning the optimal policy in each episode and exact posterior updates. However, in practice, marginalizing the full state sequence in Algorithm 1 is an exponentially costly task. Therefore, we approximate the posterior distribution (Urteaga & Wiggins, 2018; Lu & Van Roy, 2017) by sampling M state transition sequences, where M is a hyperparameter. As such, it is necessary to consider the impact of approximation errors on the posterior distribution (Phan et al., 2019; Mazumdar et al., 2020) in relation to the regret bound. A desired final regret bound comprises two terms: the first term is the regret bound achieved by Thompson sampling with an exact posterior, while the second term is an incremental term and accounts for posterior mismatch. Importantly, the second term converges to zero as the size of sequences considered M approaches infinity. To achieve this, we need to study the divergence between two distributions g(P) and g (P) defined by different Dirichlet counts and bound the gap between the sampled transition matrix and the true parameters based on such a divergence. This is nontrivial and it deserves further study. 5.2. Proof Sketch The total regret can be decomposed as follows: tk J(θ ) rt | {z } Regret (A) tk+τ1+1 J(θ ) rt | {z } Regret (B) Bounding Regret (A). The Regret (A) is the regret caused in the exploration phase of each episode. This term can be simply bounded as follows: Regret (A) Eθ τ1 Rk T (9) where R = rmax rmin is the gap between the maximum and the minimum rewards. The regret in (9) is related to the episode number k T . Since the first episode has length of order O( T) and the episode length is increasing linearly with the episode index, we can easily bound the total episode number by O( T) as in Lemma B.6. Bounding Regret (B). Next we bound Regret(B) in the exploitation phase. Let ˆbt denote the belief updated with parameter θk and b t the belief with true parameter θ . During episode k, based on (2) for the sampled parameter θk and that at = π (ˆbt), we can write: J (θk)+v(ˆbt, θk) = r(ˆbt, at)+ X r P(r | ˆbt, at, θk)v(b , θk). (10) With (10), we proceed by decomposing the regret as: Regret(B) = R1 + R2 + R3 + R4 (11) where each term is defined as follows: k=1 [(Tk τ1 1) (J(θ ) J(θk))] , v(ˆbt+1, θk) v(ˆbt, θk) # r P(r | ˆbt, at, θk)v(b , θk) v(ˆbt+1, θk) # r(ˆbt, at) r(b t , at) # Bounding R1. A key property of TS algorithms is that when the prior distribution g0 coincides with that of the true parameter θ , given the history Htk, the sampled θk has the same distribution as θ at time tk as stated in Lemma 5.4. Since the length Tk is deterministic and independent of θk, R1 is zero thanks to this property as stated in Lemma 5.5 . Lemma 5.4. (Posterior Sampling (Ouyang et al., 2017)). In TSEETC, tk is an almost surely finite σ (Htk)-stopping time. If the prior distribution g0(P),g0(R) is the distribution of θ , then for any measurable function g, E [g (θ ) | Htk] = E [g (θk) | Htk] . Online Restless Bandits with Unobserved States Lemma 5.5. R1 satisfies that R1 = 0. Bounding R2. The inner sum in R2 is a telescopic sum that reduces to the difference of two value functions, which is upper bounded by the span and hence by H. Since the number of episodes k T is deterministic and O( T), we have R2 Hk T = O(H T) and we have Lemma 5.6. Lemma 5.6. R2 is bounded by R2 O(H Bounding R3 and R4. The regret terms R3 and R4 are related to the estimation error of θ (including the transition function P and reward function R) and estimation error of belief state. The belief estimation error can be bounded in terms of the estimation errors of θ by a result of Xiong et al. (2022e), reproduced in Proposition D.2. Thus the key is to bound the estimation error of θ. Recalling the definition of ϕ, ψ in Lemma 4.1, we define the posterior mean of ˆP i(s | s) and ˆRi(r | s) for arm i at time t as follows: ˆP i(s | s) = ϵ1 + (1 ϵ1)ϕi s,s (t) Sϵ1 + (1 ϵ1) ϕis, (t) 1 ˆRi(r | s) = ϵ2 + (1 ϵ2)ψi s,r(t) Sϵ2 + (1 ϵ2) ψis, (t) 1 . For a fixed arm i, it can be pulled or not each time. The action a is 1 or 0 depending on whether the arm is pulled or not. Then we define the pesudo count of the state-action pair (s, a) before the episode k as N i tk(s, 1) = ψi s, (tk) 1 ψi s, (0) 1 , (13) N i tk(s, 0) = ( j=1 Tj) N i tk(s, 1), (14) where ψi s, (tk) is the parameter in the Dirichlet distribution at time tk about reward function of arm i. Let Mi k be the set of plausible MDPs in episode k with reward function Ri (r | z) and transition function P i (s | z) satisfying, X P i (s | z) ˆP i k (s | z) βk(z) Ri (r | z) ˆRi k (r | z) βk(z), (15) where βi k(s, a) := r 14S log(2Ntk T ) max{1,N i tk (s,a)} is chosen conserva- tively (Auer et al., 2008) so that Mi k contains both P i and P i k, Ri and Ri k with high probability. P i and Ri are the true parameters as we defined in Section 4.1. The core of the proof lies in deriving a high-probability confidence set with our pseudo counts and showing that the estimated error accumulated to T for each arm is bounded by T. Thus, we can derive the final error bound about the MDP aggregated by all arms as stated in Lemma 5.7 . The proof of Lemma 5.7 is in the Appendix B.3.2. Lemma 5.7. (Estimation errors of unknown parameters). Suppose Assumptions 3.1,3.2 hold and the posterior distributions are exactly updated, then the total estimation error about unknown parameters accumulated by all exploitation phases satisfies the following bound t=tk+τ1+1 P Pk 1 t=tk+τ1+1 R Rk 1 where Pk,Rk are the sampled parameters in episode k. With Lemma 5.7, we can bound the estimation errors about belief states as stated in Lemma 5.8. The proof of Lemma 5.8 is in the Appendix B.3.2. Lemma 5.8. (Control belief error). Suppose Assumptions 3.1,3.2 hold and the posterior distributions are exactly updated, then the total estimation error about belief states accumulated by all exploitation phases satisfies the following bound t=tk+τ1+1 ||b t ˆbt||1 NT log(NT) + C1 The Lemma 5.8 shows that the accumulated belief errors about unobserved states is also bounded by O( T). Then, we can obtain the final bound about R3, R4 and the detailed proof in Appendix B.3,B.4. Lemma 5.9. R3 satisfies the following bound R3 48C1SH p NT log NT + 4C1SNH Lemma 5.10. R4 satisfies the following bound R4 48C1Srmax p T + C1rmax. Then the claim of the Theorem 5.1 directly follows from Lemma 5.5, Lemma 5.6, Lemma 5.9, 5.10. 6. Numerical Experiments In this section, we present proof-of-concept experiments. To implement TSEETC efficiently, we just consider the most possible states transition sequences in the posterior update about unknown parameters. This approximation reduce the Online Restless Bandits with Unobserved States Table 1. The average accumulated regrets of different algorithms with different arms and states (ARMS, STATES) TSEETC SEEU RUCB Q-LEARNING ϵ-GREEDY SLIDE-UCB (2, 2) 580 871 1259 1710 2653 4039 (4, 2) 9968 10253 13520 14932 16684 17690 (6, 2) 14640 25940 26932 29875 30260 33894 (8, 2) 27252 34614 35650 42261 44962 46541 (10, 2) 39635 42600 44506 49580 51540 54652 (2, 3) 4654 6065 7420 7976 8598 9590 (2, 4) 10080 11652 14064 15648 17895 18953 computational complexity and the final simulation results show that this approximated algorithm can still achieve better performance than the existing algorithms. We consider two arms and there are two hidden states (0 and 1) for each arm. We pull just one arm each time. The learning horizon T = 50000, and each algorithm runs 100 iterations. At state 1, the reward set is {10, 20} and the reward set is { 10, 10} at state 0. The transition functions and reward functions for all arms are the same. We initialize the algorithm with uninformed Dirichlet prior on the unknown parameters. The baselines include ϵ-greedy (Lattimore & Szepesv ari, 2020) with ϵ = 0.01, Sliding-Window UCB (Garivier & Moulines, 2011) with specified window size( equal to 50), RUCB (Liu et al., 2010), Q-learning (Hu et al., 2020), and SEEU (Zhou et al., 2021). The pseudo-counts in Jahromi et al. (2022) are related with the expectation of true counts, which can not be obtained due to the unknown states. Thus we excluded it in our experiments. The results are shown in Figure 1. We observe that approximate TSEETC has the minimum regret among these algorithms. Figure 1. The cumulative regret In Figure 2, we plot the cumulative regret versus T of the six algorithms in log-log scale. We observe that the slopes of all algorithms except for our TSEETC and SEEU are close to one, suggesting that they incur linear regrets. What is more, the slope of TSEETC is close to 0.5, which is better than SEEU. This is consistent with our theoretical result. Next, we show the robustness of TSEETC to other action and state dimensionalities. We first consider the setting with Figure 2. The log-log regret different arms and each with the same state space. Secondly, we consider the case where the number of arms is equal, but the state spaces of each arm are different. The results are shown in Table 1. It shows that our TSEETC achieves minimal cumulative regret among all compared algorithms under different settings. 7. Conclusion In this paper, we consider restless bandits with unknown states and unknown dynamics. We propose the TSEETC algorithm to estimate these unknown parameters and derive the optimal policy. We also establish the Bayesian regret of our algorithm as O( T). Numerical results validate that the TSEETC algorithm outperforms other learning algorithms in regret. A related open question is whether our method can be applied to the setting where the transition functions are action dependent. We leave it for future work. Acknowledgements This work is supported in part by the National Natural Science Foundation of China (No. 62072302, 42050105, 62262018), and the Open Research Project of the State Key Laboratory of Media Convergence and Communication, Communication University of China, China (No.SKLMCC2021KF011). We thank all reviewers for their constructive feedback. Online Restless Bandits with Unobserved States Ahmad, S. H. A., Liu, M., Javidi, T., Zhao, Q., and Krishnamachari, B. Optimality of myopic sensing in multichannel opportunistic access. IEEE Transactions on Information Theory, 55(9):4040 4050, 2009. Akbarzadeh, N. and Mahajan, A. Maintenance of a collection of machines under partial observability: Indexability and computation of whittle index. ar Xiv preprint ar Xiv:2104.05151, 2021. Anandkumar, A., Hsu, D., and Kakade, S. M. A method of moments for mixture models and hidden markov models. In Conference on Learning Theory, pp. 33 1. JMLR Workshop and Conference Proceedings, 2012. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002. Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008. Azizzadenesheli, K., Lazaric, A., and Anandkumar, A. Reinforcement learning of pomdps using spectral methods. In Conference on Learning Theory, pp. 193 256. PMLR, 2016. Bertsekas, D. P. et al. Dynamic programming and optimal control 3rd edition, volume ii. Belmont, MA: Athena Scientific, 2011. Chapelle, O. and Li, L. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24, 2011. De Castro, Y., Gassiat, E., and Le Corff, S. Consistent estimation of the filtering and marginal smoothing distributions in nonparametric hidden markov models. IEEE Transactions on Information Theory, 63(8):4758 4777, 2017. Fiez, T., Sekar, S., and Ratliff, L. J. Multi-armed bandits for correlated markovian environments with smoothed reward feedback. ar Xiv preprint ar Xiv:1803.04008, 2018. Garivier, A. and Moulines, E. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pp. 174 188. Springer, 2011. Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A., et al. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning, 8(5-6):359 483, 2015. Hoey, J. and Poupart, P. Solving pomdps with continuous or large discrete observation spaces. In IJCAI, pp. 1332 1338, 2005. Hong, J., Kveton, B., Zaheer, M., Ghavamzadeh, M., and Boutilier, C. Thompson sampling with a mixture prior. In International Conference on Artificial Intelligence and Statistics, pp. 7565 7586. PMLR, 2022. Hu, Z., Zhu, M., and Liu, P. Adaptive cyber defense against multi-stage attacks using learning-based pomdp. ACM Transactions on Privacy and Security (TOPS), 24(1):1 25, 2020. Jahromi, M. J., Jain, R., and Nayyar, A. Online learning for unknown partially observable mdps. In International Conference on Artificial Intelligence and Statistics, pp. 1712 1732. PMLR, 2022. Jung, Y. H. and Tewari, A. Regret bounds for thompson sampling in episodic restless bandit problems. Advances in Neural Information Processing Systems, 32, 2019. Jung, Y. H., Abeille, M., and Tewari, A. Thompson sampling in non-episodic restless bandits. ar Xiv preprint ar Xiv:1910.05654, 2019. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann. Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020. Liu, H., Liu, K., and Zhao, Q. Learning in a changing world: Non-bayesian restless multi-armed bandit. Technical report, California Univ Davis Dept of Electrical and Computer Engineering, 2010. Liu, H., Liu, K., and Zhao, Q. Logarithmic weak regret of non-bayesian restless multi-armed bandit. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1968 1971. IEEE, 2011. Liu, K. and Zhao, Q. Indexability of restless bandit problems and optimality of whittle index for dynamic multichannel access. IEEE Transactions on Information Theory, 56 (11):5547 5567, 2010. Lu, X. and Van Roy, B. Ensemble sampling. Advances in neural information processing systems, 30, 2017. Mate, A., Killian, J., Xu, H., Perrault, A., and Tambe, M. Collapsing bandits and their application to public health intervention. Advances in Neural Information Processing Systems, 33:15639 15650, 2020. Online Restless Bandits with Unobserved States Mate, A., Perrault, A., and Tambe, M. Risk-aware interventions in public health: Planning with restless multi-armed bandits. In 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS). London, UK, volume 10, 2021. Mazumdar, E., Pacchiano, A., Ma, Y., Jordan, M., and Bartlett, P. On approximate thompson sampling with langevin algorithms. In International Conference on Machine Learning, pp. 6797 6807. PMLR, 2020. Meshram, R., Manjunath, D., and Gopalan, A. On the whittle index for restless multiarmed hidden markov bandits. IEEE Transactions on Automatic Control, 63(9): 3046 3053, 2018. Ortner, R., Ryabko, D., Auer, P., and Munos, R. Regret bounds for restless markov bandits. In International conference on algorithmic learning theory, pp. 214 228. Springer, 2012. Osband, I. and Van Roy, B. Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning, pp. 2701 2710. PMLR, 2017. Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. Learning unknown markov decision processes: A thompson sampling approach. Advances in neural information processing systems, 30, 2017. Paria, D. and Sinha, A. Leadcache: Regret-optimal caching in networks. Advances in Neural Information Processing Systems, 34:4435 4447, 2021. Peng, Z., Jin, J., Luo, L., Yang, Y., Luo, R., Wang, J., Zhang, W., Xu, H., Xu, M., Yu, C., et al. Learning to infer user hidden states for online sequential advertising. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2677 2684, 2020. Phan, M., Abbasi Yadkori, Y., and Domke, J. Thompson sampling and approximate inference. Advances in Neural Information Processing Systems, 32, 2019. Pineau, J., Gordon, G., Thrun, S., et al. Point-based value iteration: An anytime algorithm for pomdps. In IJCAI, volume 3, pp. 1025 1032. Citeseer, 2003. Ross, S., Pineau, J., Chaib-draa, B., and Kreitmann, P. A bayesian approach for learning and planning in partially observable markov decision processes. Journal of Machine Learning Research, 12(5), 2011. Scott, S. L. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639 658, 2010. Sheng, S.-P., Liu, M., and Saigal, R. Data-driven channel modeling using spectrum measurement. IEEE Transactions on Mobile Computing, 14(9):1794 1805, 2014. Silver, D. and Veness, J. Monte-carlo planning in large pomdps. Advances in neural information processing systems, 23, 2010. Singh, S. K., Borkar, V. S., and Kasbekar, G. S. User association in dense mmwave networks as restless bandits. IEEE Transactions on Vehicular Technology, 71(7):7919 7929, 2022. Smallwood, R. D. and Sondik, E. J. The optimal control of partially observable markov processes over a finite horizon. Operations research, 21(5):1071 1088, 1973. Tekin, C. and Liu, M. Online learning in opportunistic spectrum access: A restless bandit approach. In 2011 Proceedings IEEE INFOCOM, pp. 2462 2470. IEEE, 2011. Urteaga, I. and Wiggins, C. Variational inference for the multi-armed contextual bandit. In International Conference on Artificial Intelligence and Statistics, pp. 698 706. PMLR, 2018. Wang, S., Huang, L., and Lui, J. Restless-ucb, an efficient and low-complexity algorithm for online restless bandits. Advances in Neural Information Processing Systems, 33: 11878 11889, 2020. Xiong, G., Li, J., and Singh, R. Reinforcement learning augmented asymptotically optimal index policy for finitehorizon restless bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8726 8734, 2022a. Xiong, G., Qin, X., Li, B., Singh, R., and Li, J. Index-aware reinforcement learning for adaptive video streaming at the wireless edge. In Proceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pp. 81 90, 2022b. Xiong, G., Wang, S., and Li, J. Learning infinite-horizon average-reward restless multi-action bandits via index awareness. Advances in Neural Information Processing Systems, 35:17911 17925, 2022c. Xiong, G., Wang, S., Yan, G., and Li, J. Reinforcement Learning for Dynamic Dimensioning of Cloud Caches: A Restless Bandit Approach. In Proc. of IEEE INFOCOM, 2022d. Xiong, Y., Chen, N., Gao, X., and Zhou, X. Sublinear regret for learning pomdps. Production and Operations Management, 31(9):3491 3504, 2022e. Online Restless Bandits with Unobserved States Yemini, M., Leshem, A., and Somekh-Baruch, A. The restless hidden markov bandit with linear rewards and side information. IEEE Transactions on Signal Processing, 69:1108 1123, 2021. Zhou, X., Xiong, Y., Chen, N., and Gao, X. Regime switching bandits. Advances in Neural Information Processing Systems, 34, 2021. Online Restless Bandits with Unobserved States A. Table of Notations Notation Description T The length of horizon k T The episode number of time T Tk The episode length of episode k τ1 The fixed exploration length in each episode P i The transition functions for arm i Ri The reward function for arm i Pk The sampled transition function for aggregated MDP Rk The sampled reward function for aggregated MDP rt The reward obtained at time t bi t(s, θ) The belief state for being in state s at time t for arm i with parameter θ ˆbt The belief of all arms at time t with parameter θk b t The belief of all arms at time t with parameter θ at The action at time t r(bt, at) The expected reward obtained when the belief state is bt and the action is at J(θk) The optimal long term average reward with parameter θk rmax The maximum reward obtained each time rmin The minimum reward obtained each time R The biggest gap of the obtained reward B. Proof of Theorem 5.1 Recall that our goal is to minimize the regret : t=1 (J(θ ) rt) rt depends on the state st and at. Thus rt can be written as r(st, at). Due to Eθ [r (st, at) | Ht 1] = r(b t , at) for any t, we have, t=1 (J(θ ) r(b t , at)) In our algorithm, each episode is split into the exploration and exploitation phase then we can rewrite the regret as: tk (J(θ ) r (b t , at)) + tk+τ1+1 (J(θ ) r (b t , at)) where τ1 is the exploration length for each episode. τ1 is a constant. tk is the start time of episode k. Define the first part as Regret (A) which is caused by the exploration operations. The another part Regret (B) is as follows. Regret (A) = Eθ tk (J(θ ) r (b t , at)) Regret (B) = Eθ tk+τ1+1 (J(θ ) r (b t , at)) Recall that the reward set is R and we define the maximum reward gap in R as R = rmax rmin. Then we get: J(θ ) r (b t , at) R. Online Restless Bandits with Unobserved States Then Regret (A) can be simply upper bounded as follows: Regret (A) Eθ Regret (A) is related with the episode number k T obviously, which is bounded in Lemma B.6. Next we should bound the term Regret (B). During the episode k, based on (2), we get: J (θk) + v(ˆbt, θk) = r(ˆbt, at) + X r P(r | ˆbt, at, θk)v (b , θk) , (19) where J (θk) is the optimal long-term average reward when the system parameter is θk, ˆbt is the belief at time t updated with parameter θk, r(ˆbt, at) is the expected reward we can get when the action at is taken for the current belief ˆbt, b is the updated belief based on (4) with parameter θk when the reward r is received. Using this equation, we proceed by decomposing the regret as: Regret(B) = R1 + R2 + R3 + R4, (20) k=1 [(Tk τ1 1) (J(θ ) J(θk))] , v(ˆbt+1, θk) v(ˆbt, θk) # r P(r | ˆbt, at, θk)v(b , θk) v(ˆbt+1, θk) r(ˆbt, at) r(b t , at) # Next we bound the four parts one by one. B.1. Bound R1 Lemma B.1. R1 satisfies that R1 = 0. Proof. Recall that: k=1 [(Tk τ1 1) (J(θ ) J(θk)] . For each episode, Tk is determined and is independent with θk. Based on Lemma 5.4, we know that, Eθ [J(θ )] = Eθ [J(θk)]. therefore, the part R1 is 0. B.2. Bound R2 Lemma B.2. R2 satisfies the following bound R2 Hk T , where k T is the total number of episodes until time T. Online Restless Bandits with Unobserved States Proof. Recall that R2 is the telescoping sum of value function at time t + 1 and t. h v(ˆbt+1, θk) v(ˆbt, θk) i# We consider the whole sum in episode k, then the R2 can be rewrite as: h v(ˆbtk+1, θk) v(ˆbtk+τ1+1, θk) i . Due to the span of v(b, θ) is bounded by H as in proposition D.1 , then we can obtain the final bound, B.3. Bound R3 In this section, we first rewrite the R3 in section B.3.1. In section B.3.2, we show the details about how to bound R3. B.3.1. REWRITE R3 Lemma B.3. (Rewrite R3 ) The regret R3 can be bounded as follows: t=tk+τ1+1 ||P Pk||1 t=tk+τ1+1 ||b t ˆbt||1 t=tk+τ1+1 R Rk 1 where Pk is the sampled transition functions in episode k, Rk is the sampled reward functions in episode k, b t is the belief at time t updated with true P and R , ˆbt is the belief at time t updated with sampled Pk, Rk. Proof. The most part is similar to Jahromi et al. (2022), except that we should handle the unknown reward functions. Recall that R3 = Eθ Pk T k=1 h Ptk+1 1 t=tk+τ1+1 P r P(r | ˆbt, at, θk)v (b , θk) v(ˆbt+1, θk) i . Recall that Ht is the history of actions and observations prior to action at. Conditioned on Ht, θ and θk, the only random variable in ˆbt+1 is rt+1, then we can get, Eθ h v(ˆbt+1, θk) | Ht, θk i = X r R v (b , θk) P(r | b t , at, θ ), (22) where P(r | b t , at, θ ) is the probability of getting reward r given b t , at, θ . By the law of probability, P(r | b t , at, θ ) can be written as follows, P(r | b t , at, θ ) = X s R (r | s ) P (st+1 = s | Ht, θ ) s R (r | s ) X s P (st+1 = s | st = s, Ht, at, θ ) P (st = s | Ht, θ ) s b t (s)P (s | s) R (r | s ) , where P is the transition functions for the MDP aggregated by all arms, R is the reward function for the aggregated MDP. Therefore, we can rewrite the R3 as follows, r R (P(r | ˆbt, at, θk) P(r | b t , at, θ )v (b , θk) Online Restless Bandits with Unobserved States Based on (23), we get s v (b , θk) Rk (r | s ) X s ˆbt(s)Pk(s | s) s v (b , θk) R (r | s ) X s b t (s)P (s | s) s v (b , θk) Rk (r | s ) X s ˆbt(s)Pk(s | s) s v (b , θk) Rk (r | s ) X s b t (s)P (s | s) s v (b , θk) Rk (r | s ) X s b t (s)P (s | s) s v (b , θk) R (r | s ) X s b t (s)P (s | s) where Rk is the sampled reward function for aggregated MDP, Pk is the sampled transition function for aggregated MDP. s v (b , θk) Rk (r | s ) s ˆbt(s)Pk(s | s) X s b t (s)P (s | s) s v (b , θk) [Rk (r | s ) R (r | s )] X s b t (s)P (s | s) Bounding R 3. The part R 3 can be bounded as Jahromi et al. (2022). s v (b , θk) Rk (r | s ) s ˆbt(s)Pk(s | s) X s b t (s)P (s | s) = R 3(0) + R 3(1) where R 3(0) = Eθ s v (b , θk) Rk (r | s ) X s ˆbt(s)Pk(s | s) s v (b , θk) Rk (r | s ) X s b t (s)Pk(s | s) R 3(1) = Eθ s v (b , θk) Rk (r | s ) X s b t (s)Pk(s | s) s v (b , θk) Rk (r | s ) X s b t (s)P (s | s) Online Restless Bandits with Unobserved States For R 3(0), because P r Rk (r | s ) = 1,P s Pk(s | s) = 1,v (b , θk) H, we have R 3(0) = Eθ s v (b , θk) Rk (r | s ) X s ˆbt(s)Pk(s | s) s v (b , θk) Rk (r | s ) X s b t (s)Pk(s | s) s v (b , θk) Rk (r | s ) X s (ˆbt(s) b t (s)Pk(s | s) s v (b , θk) Rk (r | s ) X s |ˆbt(s) b t (s)|Pk(s | s) s |ˆbt(s) b t (s)| ˆbt(s) b t (s) 1 where the first inequality is due to ˆbt(s) b t (s) |ˆbt(s) b t (s)| and the second inequality is because P r Rk (r | s ) = 1, P s Pk(s | s) = 1, v (b , θk) H. For the first term in R 3(1) , note that conditioned on Ht, θ , the distribution of st is b t . Furthermore, at is measurable with respect to the sigma algebra generated by Ht, θk since at = π (ˆbt, θk). Thus, we have v (b , θk) X s P (s | s) b (s) | Ht, θk = v (b , θk) Eθ [P (s | s) | Ht, θk] . (25) v (b , θk) X s Pk (s | s) b (s) | Ht, θk = v (b , θk) Eθ [Pk (s | s) | Ht, θk] . (26) Substitute (25), (26) into R 3(1), we have R 3(1) = Eθ s v (b , θk) Rk (r | s ) (Pk(s | s) P (s | s)) s v (b , θk) Rk (r | s ) |Pk(s | s) P (s | s)| s |Pk(s | s) P (s | s)| t=tk+τ1+1 ( Pk P 1) where the first inequality is because Pk(s | s) P (s | s) |Pk(s | s) P (s | s)|, the second inequality is due to v (b , θk) H and P r Rk (r | s ) = 1. Therefore we obtain the final results, t=tk+τ1+1 ||P Pk||1 t=tk+τ1+1 ||b t ˆbt||1 Online Restless Bandits with Unobserved States Bounding R 3. For part R 3, note that for any fixed s , P s b t (s)P (s | s) S, therefore we can bound R 3 as follows, s v (b , θk) [Rk (r | s ) R (r | s )] X s b t (s)P (s | s) r [Rk (r | s ) R (r | s )] t=tk+τ1+1 S Rk R 1 t=tk+τ1+1 Rk R 1 where the first inequality is due to v (b , θk) H and P s b t (s)P (s | s) S , the second inequality is due to for any fixed s , P r [Rk (r | s ) R (r | s )] Rk R 1. B.3.2. BOUND R3 Lemma B.4. R3 satisfies the following bound R3 48(L1 + L2N + N + S2)SH p NT log(NT) + (L1 + L2N + N + S2)H + 4(L1 + L2N + N 2 + S2)SNH(T1 + k T τ1 1). Proof. Recall that the R3 is as follows: r P[r | ˆbt, at, θk]v (b , θk) v(ˆbt+1, θk) This regret terms are dealing with the model estimation errors. That is to say, they depend on the on-policy error between the sampled transition functions and the true transition functions, the sampled reward functions and the true reward functions. Thus we should bound the parameters error especially in our unobserved state setting. Based on the parameters in our Dirichlet distribution, we can define the empirical estimation of reward function and transition functions for arm i as follows: ˆP i(s | s) = ϵ1 + (1 ϵ1)ϕi s,s (t) Sϵ1 + (1 ϵ1) ϕis, (t) 1 , ˆRi(r | s) = ϵ2 + (1 ϵ2)ψi s,r(t) Sϵ2 + (1 ϵ2) ψis, (t) 1 . (28) where ϕi s,s (t) is the parameters in the posterior distribution of P i at time t, ψi s,r(t) is the parameters in the posterior distribution of Ri at time t. For each arm, it can be pulled or not. When it is pulled, we define the action a is 1 and the action a is 0 when it is not pulled. Then we define the pseudo count N i tk(s, a) of the state-action pair (s, a) before the episode k for arm i as N i tk(s, 1) = ψi s, (tk) 1 ψi s, (0) 1 , N i tk(s, 0) = ( j=1 Tj) N i tk(s, 1). For notational simplicity, we use z = (s, a) S A and zt = (st, at) to denote the corresponding state-action pair. Then based on Lemma B.3 we can decompose the R3 as follows, r P[r | ˆbt, at, θk]v (b , θk) v(ˆbt+1, θk) P(r | ˆbt, at, θk) P(r | b t , at, θ ) v(b , θk) R0 3 + R1 3 + R2 3 Online Restless Bandits with Unobserved States t=tk+τ1+1 P Pk 1 t=tk+τ1+1 ||b t ˆbt||1 R2 3 = S2HEθ t=tk+τ1+1 R Rk 1 Note that the following results are all focused on one arm. Define P i is the true transition function for arm i, P i k is the sampled transition function for arm i. We can extend the results on a arm to the aggregated large MDP based on Lemma D.3. Bounding R0 3. Since 0 v (b , θk) H from our assumption , each term in the inner summation is bounded by s S | P i (s | zt) P i k (s | zt) |v (s , θk) P i (s | zt) P i k (s | zt) P i (s | zt) ˆP i k (s | zt) + H X P i k (s | zt) ˆP i k (s | zt) . where P i (s | zt) is the true transition function, P i k (s | zt) is the sampled reward function and ˆP i k (s | zt) is the posterior mean. The second inequality above in due to triangle inequality. Let Mi k be the set of plausible MDPs in episode k with reward function Ri (r | z) and transition function P i (s | z) satisfying, P i (s | z) ˆP i k (s | z) βi k(z), X Ri (r | z) ˆRi k (r | z) βi k(z), where βi k(s, a) := r 14S log(2Ntk T ) max{1,N i tk (s,a)} is chosen conservatively (Auer et al., 2008) so that Mi k contains both P i and P i k, Ri and Ri k with high probability. P i and Ri are the true parameters as we defined in section 4.1. Note that βi k(z) is the confidence set with δ = 1/tk. Then we can obtain, X P i (s | zt) ˆP i k (s | zt) + X P i k (s | zt) ˆP i k (s | zt) 2βi k (zt) + 2 I{P i / Bk} + I{P i k / Bk} We assume the length of the last episode is the biggest. Note that even the assumption does not hold, we can enlarge the sum items as Tk T 1 τ1. This does not affect the order of our regret bound. With our assumption, because the all episode length is not bigger than the last episode, that is tk+1 1 (tk + τ1) Tk T τ1, then we can obtain, t=tk+τ1 βi k (zt) t=1 βi k (zt) . (30) Note that P s S P i (s | zt) ˆP i k (s | zt) 2 is always true. And with our assumption τ1 T1+k T 1 2 , it is easy to show that when N i tk Tk T τ1, βi k (zt) 2 holds. Then we can obtain, Online Restless Bandits with Unobserved States t=1 min{2, βi k (zt)} t=1 2I(N i tk < Tk T τ1) t=1 I(N i tk Tk T τ1) 14S log (2Ntk T) max 1, N i tk (zt) . Consider the first part in (31). Obviously, the maximum of N i tk is Tk T τ1. Because there are totally SA state-action pairs, therefore, the first part in equation (31) can be bounded as, Pk T k=1 PTk T τ1 t=1 2I(N i tk < Tk T τ1) 2(Tk T τ1)SA. Due to Tk T = T1 + k T 1 and Lemma B.6, we get , 2(Tk T τ1)SA = 2(T1 + k T τ1 1)SA = O( Consider the second part in 31. Denote the N i t (s, a) is the count of (s, a) before time t(not including t). Due to we just consider the exploration phase in each episode, then N i t (s, a) can be calculated as follows, N i t (s, a) = τ < t, τ [tk, tk + τ1], k k(t) : si τ, ai τ = (s, a) , where k(t) is the episode number where the time t is in. In the second part in (31), when N i tk Tk T τ1, based on our assumption τ1 T1+k T 1 2 , we can get, τ1 T1 + k T 1 2τ1 T1 + k T 1 = Tk T . therefore, Tk T τ1 τ1. Because N i tk Tk T τ1, then N i tk(s, a) τ1. For any t [tk, tk + τ1],we have N i t(s, a) N i tk(s, a) + τ1 2N i tk(s, a). Therefore N i t(s, a) 2N i tk(s, a). Next we can bound the confidence set when Nt(s, a) 2Ntk(s, a) as follows, t=1 βi k (zt) 14S log (2Ntk T) max 1, N i tk (zt) 14S log (2NT 2) max 1, N i tk (zt) 28S log (2NT 2) max 1, N i t (zt) 56S log(2NT) max 1, N i t (zt) . where the second inequality in (32) is due to tk T for all episodes and the first equality is due to N i t(s, a) 2N i tk(s, a). Then similar to Ouyang et al. (2017), since N i t (zt) is the count of visits to zt, we have max 1, N i t (zt) = X max 1, N i t(z) I{N i T +1(z)>0} + Ni T +1(z) 1 X I{Ni T +1(z)>0} + 2 q N i T +1(z) 3 X N i T +1(z). Online Restless Bandits with Unobserved States z N i T +1(z) T, we have N i T +1(z) 3 s z N i T +1(z) = 3 With (32) and (33) we get t=tk βi k (zt) 6 NT log(NT) 48HS p NT log(NT). Then we can bound the (30) as follows, t=tk βi k (zt) 24S p NT log(NT) + 2SA(T1 + k T τ1 1). (34) Choose the δ = 1/T in Lemma D.4, and based by Lemma 5.4, we obtain that P P i k / Bk = P P i / Bk 1 15Tt6 k . Then we can obtain, k=1 Tk I{θ / Bk} + I{θk / Bk} # k=1 t 6 k 4 k=1 k 6 1. (35) Therefore we obtain k=1 Tk I{θ / Bk} + I{θk / Bk} # Therefore, we can obtain the bound for one arm as follows, P i (s | zt) P i k (s | zt) v (s , θk) H + 4SNH(T1 + k T τ1 1) + 48HS p NT log(NT). Next we consider the state transition of all arms. Recall that the states of all arms at time t is st. Because every arm evolves independently, then the transition probability from state st to state st+1 is as follows, P (st+1 | st, θ ) = i=1 P i si t+1 | si t , where P i is the true transition functions of arm i. Based by the Lemma D.3 and our assumption that all arms have the same state space S, we can obtain st+1 |P (st+1 | st, θ ) P (st+1 | st, θk)| P i si t+1 | si t P i k si t+1 | si t 1 N P i si t+1 | si t P i k si t+1 | si t 1 . Therefore, we can bound the R0 3 as follows: R0 3 NH + 4SN 2H(T1 + k T τ1 1) + 48SNH p NT log(NT). (39) Online Restless Bandits with Unobserved States Bounding R1 3. Based on the Proposition D.2, we know that b t ˆbt 1 L1 R Rk 1 + L2 max s P (s, :) Pk(s, :) 2 . Note that the elements in the true transition matrix P and the sampled matrix Pk are between the interval (0, 1). Then based on the facts about the norm, we know that max s P (s, :) Pk(s, :) 2 P Pk 1 . Therefore, we can bound the belief error at any time as follows: b t ˆbt 1 L1 R Rk 1 + L2 P Pk 1. (40) Recall in the confidence for Mk, the error bound is the same for R Rk 1 and P Pk 1, and based by the bound in (34) and (35), we can bound the R1 3 as follows: t=tk (L1 R Rk 1 + L2 P Pk 1) (L1 + L2N)HEθ 2βi k (zt) + 2 I{P / Bk} + I{Pk / Bk} # 48(L1 + L2N)SH p NT log(NT) + (L1 + L2N)H + 4(L1 + L2N)SNH(T1 + k T τ1 1). Bounding R2 3. Based on (34) and (35), we can bound R2 3 as follows, R2 3 = S2HEθ t=tk+τ1+1 R ( | s) Rk( | s) 1 2βi k (zt) + 2 I{R / Bk} + I{Rk / Bk} # HS2 + 4S3NH(T1 + k T τ1 1) + 48HS3p NT log(NT). Combine the bound in (39), (41) and (42), we bound the term R3 as follows: R3 48(L1 + L2N)SH p NT log(NT) + 4(L1 + L2N)SNH(T1 + k T τ1 1) + (L1 + L2N)H + NH + 4SN 2H(T1 + k T τ1 1) + 48SNH p + HS2 + 4S3NH(T1 + k T τ1 1) + 48HS3p = 48(L1 + L2N + N + S2)SH p NT log(NT) + (L1 + L2N + N + S2)H + 4(L1 + L2N + N 2 + S2)SNH(T1 + k T τ1 1). B.4. Bound R4 Lemma B.5. R4 satisfies the following bound R4 48(L1 + L2N + N + S2)Srmax p NT log(NT) + (L1 + L2N + N + S2)rmax + 4(L1 + L2N + N + S2)SArmax(T1 + k T τ1 1). Online Restless Bandits with Unobserved States Proof. We can rewrite the R4 as follows: s rk (s, at)ˆbt(s) X s r (s, at) b t (s) s rk (s, at)ˆbt(s) X s rk (s, at) b t (s) + X s rk (s, at) b t (s) X s r (s, at) b t (s) where rk (s, at) = P r r Rat k (r | s) is the expect reward conditioned on the state s of pulled arm and at, when the reward function is Rat k . And r (s, at) = P r r Rat (r | s) is the expect reward conditioned on the state s and at,with the true reward function Rat . The (44) is due to the add the term P s rk (s, at) b t (s) and subtract it. s rk (s, at)ˆbt(s) X s rk (s, at) b t (s) s rk (s, at) b t (s) X s r (s, at) b t (s) s rk (s, at)ˆbt(s) X s rk (s, at) b t (s) s rk (s, at) (ˆbt(s) b t (s)) ˆbt(s) b t (s) where the last inequality is due to the fact rk (s, at) rmax. s rk (s, at) b t (s) X s r (s, at) b t (s) s [rk (s, at) r (s, at)] b t (s) s |rk (s, at) r (s, at)| r r |Rat k (r | s) Rat (r | s)| Rat k Rat 1 # where the first inequality in 46 is due to b t (s) 1, rk (s, at) r (s, at) |rk (s, at) r (s, at)| and the second inequality is due to P r [Rat k (r | s) Rat (r | s)] Rat k Rat 1. Based on the (41), we can bound the R0 4, R0 4 48(L1 + L2N)Srmax p NT log(NT) + (L1 + L2N)rmax + 4(L1 + L2N)SNrmax(T1 + k T τ1 1). Online Restless Bandits with Unobserved States Note that for any reward function R (r | z) in confidence set Mk, the reward function satisfies, R (r | z) ˆRi k (r | z) βi k(z) Then based on (42), we get R1 4 48S2rmax p NT log(NT) + 2S2Nrmax(T1 + k T τ1 1) + Srmax. Then we can obtain the final bound: R4 48(L1 + L2N + S)Srmax p NT log(NT) + 4(L1 + L2N + S)SNrmax(T1 + k T τ1 1) + (L1 + L2N + S)rmax 48(L1 + L2N + N + S2)Srmax p NT log(NT) + (L1 + L2N + N + S2)rmax + 4(L1 + L2N + N + S2)SNrmax(T1 + k T τ1 1) where the last inequality is due to S N + S2. B.5. The total regret Next we bound the episode number. Lemma B.6. (Bound the episode number) With the convention T1 = l 2 m and Tk = Tk 1 + 1, the episode number is bounded by k T = O( Proof. Note that the total horizon is T. The length of episode k is Tk = T1 + k 1. Then we can get, T = T1 + T2 + ... + Tk T = T1 + (T1 + 1) + ... + (T1 + k T 1) = k T T1 + (1 + 2 + ... + k T 1) = k T T1 + k T (k T 1) Therefore, k2 T + (2T1 1)k T 2T = 0. (48) With the convention T1 = l 2 m , then we can get k T = O( Denote C1 = L1 + L2N + N 2 + S2, C2 = H + rmax and C3 = T1 + k T τ1 1, then we can get the final regret: RT = Regret(A) + R1 + R2 + R3 + R4 τ1 Rk T + Hk T + 48C1SH p NT log(NT) + 4C1C3SAH + C1H + 48C1Srmax p NT log(NT) + 4C1C3SArmax + C1rmax T + 48C1S(H + rmax) p + 4C1SA(rmax + H) T + C1(H + rmax) = 48C1C2S p NT log(NT) + (τ1 R + H + 4C1C2SN) Thus, we get the final Theorem. Online Restless Bandits with Unobserved States Theorem B.7. Suppose Assumptions 3.1,3.2 hold and the Oracle returns the optimal policy in each episode. The Bayesian regret of our algorithm satisfies RT 48C1C2S p NT log(NT) + (τ1 R + H + 4C1C2SN) where C1 = L1 + L2N + N 2 + S2, C2 = rmax + H are constants independent with time horizon T, L1 = 4(1 ϵ1)2 Nϵ2 1ϵ2 , L2 = ϵ3 1 , ϵ1 and ϵ2 are the minimum elements of the functions P and R , respectively. τ1 is the fixed exploration length in each episode, R is is the gap between the maximum and the minimum rewards, H is the bounded span, rmax is the maximum reward obtain each time, N is the number of arms and S is the state size for each arm. C. Posterior distribution Note that we assume the state transition is independent of the action for each arm. Denote the states visited history from time 0 till t of arm i as si 0:t and the reward collected history is ri 0:t. And the action history from time 0 to t is ai 0:t. Denote N i s,s si 0:t as the occurence time of state evolves from s to s for arm i in the state history si 0:t. Hence, if the prior g (Pi(s, )) is Dirichlet ϕi s,s1, . . . , ϕi s,Si , then after the observation of history si 0:t, the posterior g Pi(s, ) | si 0:t is Dirichlet ϕi s,s1 + N i s,s1 si 0:t , . . . , ϕi s,Si + N i s,Si si 0:t (Ross et al., 2011). Similarly, if the prior g (Ri(s, )) is Dirichlet ψi s,r1, . . . , ψi s,rk , then after the observation of reward history ri 0:t and si 0:t , the posterior g Ri(s, ) | ri 0:t, si 0:t is Dirichlet ψi s,r1 + N i s,r1 si 0:t, ri 0:t , . . . , ψi s,rk + N i s,rk si 0:t, ri 0:t , and N i s,r is the number of times the observation (s, r) appears in the history si 0:t, ri 0:t . Here we drop the arm index and consider a fixed arm. For the unknown transition function, we assume its prior g0 (P) = f( P ϵ11 1 Sϵ1 | ϕ). We consider this special prior is due to the minimum elements of the transition matrix is bigger than ϵ1. Next we show the details that how to update the posterior distribution for unknown P and omit the details of unknown reward function R. g (P | a0:t 1, r0:t 1) = P (r0:t 1, st | P, a0:t 1) g (P, a0:t 1) R P (r0:t 1, st | P, a0:t 1) g (P, a0:t 1) d P s0:t 1 St P (r0:t 1, s0:t | P, a0:t 1) g(P) R P (r0:t 1, st | P, a0:t 1) g (P, a0:t 1) d P s0:t 1 St g (P) Qt i=1 P(si | si 1) R P (r0:t 1, st | P, a0:t 1) g (P, a0:t 1) d P s0:t 1 St g (P) h Q s,s ( P (s |s) ϵ1 1 ϵ1 )Nss (s0:t)i R P (r0:t 1, st | P, a0:t 1) g (P, a0:t 1) d P . where the last equality is due to the prior for unknown P is g0 (P) = f( P ϵ11 1 Sϵ1 | ϕ). Next we show the Bayesian approach to learning unknown P and R with the history (a0:t 1, r0:t). Since the current state st of the agent at time t is unobserved, we consider a joint posterior g (st, P, R | a0:t 1, r0:t) over st, P, and R (Ross et al., Online Restless Bandits with Unobserved States 2011). The most parts are similar to Ross et al. (2011), except for our special priors. g (st, P, R | a0:t 1, r0:t 1) P (r0:t, st | P, R, a0:t 1) g (P, R, a0:t 1) s0:t 1 St P (r0:t, s0:t | P, R, a0:t 1) g(P, R) s0:t 1 St g (s0, P, R) i=1 P(si | si 1)R(ri | si) s0:t 1 St g (s0, P, R) s,s (P(s | s) ϵ1 1 ϵ1 )Nss (s0:t) s,r (R(r | s) ϵ2 1 ϵ2 )Nsr(s0:t,r0:t 1) # where g (s0, P, R) is the joint prior over the initial state s0, transition function P, and reward function R; Nss (s0:t) is the number of times the transition (s, s ) appears in the history of state-action (s0:t); and Nsr (s0:t, r0:t 1) is the number of times the observation (s, r) appears in the history of state-rewards (s0:t, r0:t 1). D. Technical Results Proposition D.1. (Uniform bound on the bias span (Zhou et al., 2021)). If the belief MDP satisfies Assumption 3.1,3.2, then for (J(θ), v(:, θ)) satisfying the Bellman equation (2), we have the span of the bias function span(v, θ) :=maxb,θ v(b, θ) minb,θ v(b, θ) is bounded by H, where H := 8 2 (1 α)2 + (1 + α) logα 1 α 1 α , with α = 1 ϵ1 1 ϵ1/2 (0, 1) Proposition D.2. (Controlling the belief error (Xiong et al., 2022e)). Suppose Assumption 3.1,3.2 hold. Given (Rk, Pk), an estimator of the true model parameters (R , P ). For an arbitrary reward-action sequence rt, at, let ˆbt( , Rk, Pk) and bt( , R , P ) be the corresponding beliefs in period t under (Rk, Pk) and (R , P ) respectively. Then there exists constants L1, L2 such that bt( , R , P ) ˆbt( , Rk, Pk) 1 L1 Rk R 1 + L2 max s P (m, :) Pk(m, :) 2 , where L1 = 4(1 ϵ1)2 Nϵ2 1ϵ2 , L2 = 4(1 ϵ1)2 ϵ3 1 , ϵ1 and ϵ2 are the minimum elements of the functions P and R , respectively. Lemma D.3. (Lemma 13 in Jung et al. (2019)) Suppose ak and bk are probability distributions over a set [nk] for k [K]. Then we have X x K k=1[nk] k=1 ak bk 1 . Lemma D.4. (Lemma 17 in Auer et al. (2008)) For any t 1, the probability that the true MDP M is not contained in the set of plausible MDPs M(t) at time t is at most δ 15t6 , that is P{M / M(t)} < δ 15t6 .