# regime_switching_bandits__7bb73fbc.pdf Regime Switching Bandits Xiang Zhou Yi Xiong Ningyuan Chen Xuefeng Gao We study a multi-armed bandit problem where the rewards exhibit regime switching. Specifically, the distributions of the random rewards generated from all arms are modulated by a common underlying state modeled as a finite-state Markov chain. The agent does not observe the underlying state and has to learn the transition matrix and the reward distributions. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, belief error control in partially observable Markov decision processes and upper-confidence-bound methods for online learning. We also establish an upper bound O(T 2/3 log T) for the proposed learning algorithm where T is the learning horizon. Finally, we conduct proof-of-concept experiments to illustrate the performance of the learning algorithm. 1 Introduction The multi-armed bandit (MAB) problem is a popular model for sequential decision making with unknown information: the decision maker makes decisions repeatedly among I different options, or arms. After each decision she receives a random reward having an unknown probability distribution that depends on the chosen arm. The objective is to maximize the expected total reward over a finite horizon of T periods. The MAB problem has been extensively studied in various fields and applications including Internet advertising, dynamic pricing, recommender systems, clinical trials and medicine [12, 13, 43]. In the classical MAB problem, it is typically assumed that the random reward of each arm is i.i.d. (independently and identically distributed) over time and independent of the rewards from other arms. However, these assumptions do not necessarily hold in practice [10]. To address the drawback, a growing body of literature studies MAB problems with non-stationary rewards to capture temporal changes in the reward distributions in applications, see e.g. [11, 16, 20]. In this paper, we study a non-stationary MAB model with Markovian regime-switching rewards. We assume that the random rewards associated with all the arms are modulated by a common unobserved state (or regime) {Mt : t = 1, 2, . . .} modeled as a finite-state discrete-time Markov chain. This chain makes a transition at each period regardless of which arm is pulled and its transition probabilities are independent of the action chosen. Given Mt = m, the reward of arm i is i.i.d., whose distribution is denoted Q( |m, i). Such structural change of the environment is usually referred to as regime switching in finance [33]. The agent doesn t observe or control the underlying state Mt, and has to learn the transition probability matrix P of {Mt} as well as the distribution of reward of each arm Q( |m, i), based on the observed historical rewards. The goal of the agent is to design a learning policy that decides which arm to pull in each period to minimize the expected regret over T periods. The first two authors (Xiang Zhou and Yi Xiong) have equal contribution. Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong; 1911606962@qq.com Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong; yxiong@se.cuhk.edu.hk The Rotman School of Management, University of Toronto; ningyuan.chen@utoronto.ca Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong; xfgao@se.cuhk.edu.hk 35th Conference on Neural Information Processing Systems (Neur IPS 2021). The regime-switching models are widely used in various industries. For example, in finance, the sudden changes of market environments are usually modeled as hidden Markov chains. In revenue management and marketing, a firm may face a shift in consumer demand due to undetected changes in sentiment or competition. In such cases, when the agents (traders or firms) take actions (trading financial assets with different strategies or setting prices), they need to learn the reward and the underlying state at the same time. Our setup is designed to tackle such problems. Our Contribution. Our study features novel designs in three regards: we propose a new bandit problem formulation, develop a new learning algorithm, and prove regret bounds with new techniques. In terms of problem formulation, online learning with unobserved states has attracted some attention recently [8, 19]. We consider the strongest oracle among the studies, who knows P and Q( |m, i), but doesn t observe the hidden state Mt. The oracle thus faces a partially observable Markov decision process (POMDP) [26] (with a long-run average reward objective). By reformulating it as a Markov decision process (MDP) with a continuous belief space (i.e., a distribution over hidden states), the oracle then solves the optimal policy (mapping belief states to actions) using the Bellman equation. Having sublinear regret benchmarked against the strong oracle, our algorithm has better theoretical performance than others with weaker oracles such as the best fixed arm [19] or memoryless policies (the action only depends on the current observation) [8]. In terms of algorithmic design, we propose a learning algorithm (see Algorithm 2) with two key ingredients. First, it builds on the recent advance on the estimation of the parameters of hidden Markov models (HMMs) using spectral method-of-moments methods [3, 2, 8]. It benefits from the theoretical finite-sample bound of spectral estimators, while the finite-sample guarantees of other alternatives such as maximum likelihood estimators remain an open problem [30]. Second, it builds on the well-known upper confidence bound (UCB) method in reinforcement learning [7, 25]. There are two difficulties here as the oracle uses the optimal (belief-based) policy of the POMDP. First, the spectral method can not use the non i.i.d. samples generated from the belief-based policy due to the complex history dependency. Second, the belief of the hidden state is subject to the estimation error. Hence, we divide the horizon into nested exploration and exploitation phases. We use spectral estimators in the exploration phase to gauge the estimation error of P and Q( |m, i). We use the UCB method to control the regret in the exploitation phase. Different from other learning problems, we re-calibrate the belief at the beginning of each exploitation phase based on the parameters estimated in the most recent exploration phase using previous exploratory samples. In terms of technical analysis, we establish a regret bound of O(T 2/3p log(T)) for our proposed learning algorithm where T is the learning horizon. Our regret analysis draws inspirations from [25, 35] for learning MDPs and undiscounted reinforcement learning problems, but the analysis differs significantly from theirs since there are two main technical challenges in our problem. First, to control the regret, we need to control of the error of the belief state, which itself is not directly observed and needs to be estimated. This is in stark contrast to learning MDPs [25, 35] with observed states. Specifically, we need to bound the estimation error of belief states by the estimation errors of the model parameters. In addition, since the belief state is un-observable, we also need to bound the error in the belief transition kernel, which measures the distance between the belief transition kernel under the optimistic belief MDP model (from the UCB component of our algorithm) at each episode and the belief transition kernel under the true model. These bounds are not trivial since the transition kernel of the belief state depends on the model parameters in a complex way via Bayesian updating. We overcome the difficulties by building on [18] and a delicate analysis of the belief transition kernel to control the errors in the estimations of belief states and the belief transitions. Second, to establish regret bound, we need an explicit bound for the span of the bias function (also referred as the relative value function) for the belief MDP which has a continuous state space. Such a bound is often critical in the regret analysis of undiscounted reinforcement learning of continuous MDP, but it is either taken as an assumption [39] or proved under Hölder continuity assumptions that do not hold for the belief transitions in our setting [35, 27]. We overcome this challenge and bound the bias span by developing a novel approach, which could be of independent interest for learning continuous-state MDPs. Specifically, we bound the bias span by bounding the Lipschitz module of the bias function for our infinite-horizon undiscounted problem (with a long-run average reward objective). To achieve this, we rely on a non-trivial application of the results in [23] which provide general tools for proving Lipschitz continuity of value functions in finite-horizon discounted MDPs. One key step is to bound the Lipschitz module of the belief transitions using the Kantorovich metric, which allows us to establish a bound on the Lipschitz module of the value function of the finite-horizon discounted problem uniformly over the discounting factors. Exploiting the connection with the infinite-horizon undiscounted problem then yields an explicit bound on the bias span for our problem. We also mention that our bound on the bias span is unrelated to the diameter of the POMDP discussed in [8]. The diameter in [8] is only for observation-based policies, not for belief-state based policies we consider. That also explains why we need a new approach to bound the bias span. Related Work. Studies on non-stationary/switching MAB investigate the problem when the rewards of all arms may change over time [5, 20, 10, 6]. Our formulation can be regarded as non-stationary MAB with a special structure. They consider an even stronger oracle than ours, the best arm in each period. However, the total number of changes or the changing budget have to be sublinear in T to achieve sublinear regret. In our formulation, the total number of transitions of the underlying state is linear in T, and the algorithms proposed in these papers fail even considering our oracle (See Section 5). Other studies focus on linear changing budget with some structure such as seasonality [15], which is not present in this paper. Our work is also related to the restless Markov bandit problem [21, 36, 44] in which the state of each arm evolves according to independent Markov chains. In contrast, our regime-switching MAB model assumes a common underlying Markov chain so that the rewards of all arms are correlated, and the underlying state is unobservable to the agent. In addition, our work is related to MAB studies where rewards of all arms depend on a common unknown parameter or a latent random variable, see, e.g., [4, 22, 28, 32, 34]. Our model differs from them in that the common latent state variable follows a dynamic stochastic process which introduces difficulties in algorithm design and analysis. Two papers have similar settings to ours. [19] studies MAB problems whose rewards are modulated by an unobserved Markov chain and the transition matrix may depend on the action. However, their oracle is the best fixed arm when defining the regret, which is much weaker than the optimal policy of the POMDP (the performance gap is linear between the two oracles). Therefore, their algorithm is expected to have linear regret when using our oracle. [8] proposes an algorithm based on spectral estimators for learning POMDPs. Their algorithm will generally suffer linear regret in our problem setting. This is because we consider a stronger oracle which is the optimal POMDP policy, whereas their oracle is the optimal memoryless policy, i.e., a policy that only depends on the current reward observation instead of using all historical observations to form the belief of the underlying state. It is known that memoryless policies are in general not optimal and hence the gap between their oracle and our oracle can be linear. Note that considering the memoryless policy allows [8] to circumvent the introduction of the belief entirely. Indeed, the dynamics under the memoryless policy can be viewed as a finite-state (modified) HMM and spectral estimators can be applied. Instead, because we consider the optimal belief-based policy, such reduction is not available. Our algorithm and regret analysis hinge on the interaction between the estimation of the belief and the spectral estimators. Our algorithm needs separate exploration to apply the spectral methods and uses the belief-based policy for exploitation. For the regret analysis, unlike [8], we have to carefully control the belief error and bound the span of the bias function from the optimistic belief MDP in each episode. The comparison of our study with related papers is summarized in Table 1. Note that we only present the regret in terms of T. Papers Oracle Changing Budget Regret [5] Best fixed action Linear O( T) [20] Best action in each period Finite O( T) [10] Best action in each period Sublinear O(T 2/3) [8] Optimal memoryless policy Linear O( T) This paper Optimal POMDP policy Linear O(T 2/3) Table 1: Comparison of our study with some related literature. 2 MAB with Markovian Regime Switching 2.1 Problem Formulation Consider the MAB problem with arms I := {1, . . . , I}. There is a Markov chain {Mt} with states M := {1, 2, . . . , M} and transition probability matrix P RM M. In period t = 1, 2, . . . , if the state of the Markov chain Mt = m and the agent chooses arm It = i, then the reward in that period is Rt with discrete finite support, and its distribution is denoted by Q( |m, i) := P(Rt |Mt = m, It = i), with µm,i := E[Rt|Mt = m, It = i]. We use µ := (µm,i) RM I to denote the mean reward matrix. The agent knows M and I, but has no knowledge about the underlying state Mt (also referred to as the regime), the transition matrix P or the reward distribution Q( |m, i). The goal is to design a learning policy that is adapted to the filtration generated by the observed rewards to decide which arm to pull in each period to maximize the expected cumulative reward over T periods where T is unknown in advance. If an oracle knows P, Q( |m, i) and the underlying state Mt, then the problem becomes trivial as s/he would select I t = argmaxi I µMt,i in period t. If we benchmark a learning policy against the oracle, then the regret must be linear in T, because the oracle always observes Mt while the agent cannot predict the transition based on the history. Whenever a transition occurs, there is non-vanishing regret incurred. Since the number of transitions during [0, T] is linear in T, the total regret is of the same order. Since comparing to the oracle knowing Mt is uninformative, we consider a weaker oracle who knows P, Q( |m, i), but not Mt. In this case, the oracle solves a POMDP since the states Mt are unobservable and the optimal policy maps belief states (a distribution over the hidden state) to actions. The total expected reward of the POMDP scales linearly in T, and asymptotically the reward per period converges to a constant denoted by ρ under the optimal belief-based policy. See Section 2.2. For a learning policy π, we denote by Rπ t the reward received under the learning policy π (which does not know P, Q( |m, i) initially) in period t. We follow the literature (see, e.g., [25, 36, 1]) and define its total regret after T periods by t=1 Rπ t . (1) Our goal is to design a learning algorithm with theoretical guarantees including high probability and expectation bounds (sublinear in T) on the total regret. Without loss of generality we consider Bernoulli rewards with mean µm,i (0, 1) for all m, i. Hence µm,i characterizes the distribution Q( |m, i). Our analysis holds generally for random rewards with discrete finite support. In addition, we impose the following assumptions. Assumption 1. The transition matrix P of the Markov chain {Mt} is invertible. Assumption 2. The mean reward matrix µ = (µm,i) has full row rank. Assumption 3. The smallest element of the transition matrix ϵ := mini,j M Pij > 0. The first two assumptions are required for finite-sample guarantees of spectral estimators for HMMs [3, 2]. The third assumption is needed to control the belief error of the hidden states by the state-of-art methods, see [18] and our Proposition 3. We next reformulate the POMDP as a belief MDP. 2.2 Reduction of POMDP to Belief MDP To present our learning algorithm and analyze the regret, we first investigate the POMDP problem faced by the oracle where parameters P, µ (equivalently Q for Bernoulli rewards) are known with unobserved states Mt. Based on the historical observed history, the oracle forms a belief of the underlying state. The belief can be encoded by a M-dimension vector bt = (bt(1), . . . , bt(M)) B : bt(m) := P(Mt = m|I1, , It 1, R1, , Rt 1), where B := n b RM + : PM m=1 b(m) = 1 o . It is well known that the POMDP of the oracle can be seen as a MDP built on a (continuous) belief state space B, and here a policy is a mapping from belief states to actions, see [26]. We next introduce a few notations that facilitate the analysis. For notation simplicity, we write c(m, i) := µm,i. Given belief b B, the expected reward of arm i is c(b, i) := PM m=1 c(m, i)b(m). In period t, given belief bt = b, It = i, Rt = r, by Bayes theorem, the belief bt+1 is updated by bt+1 = Hµ,P (b, i, r), where the m-th entry is bt+1(m) = m P (m ,m) (µm ,i)r(1 µm ,i)1 r bt(m ) m (µm ,i)r(1 µm ,i)1 r bt(m ) . It is obvious that the forward function H depends on the transition matrix P and the reward matrix µ . We can also define the transition probability of the belief state conditional on the arm pulled: T( |b, i) := P(bt+1 |b, i), where bt+1 is random due to the random reward. The long-run average reward of the infinite-horizon belief MDP following policy π given the initial belief b can be written as ρπ b := lim sup T 1 T E[PT t=1 Rπ t |b1 = b]. The optimal (belief-based) policy maximizes ρπ b for a given b. One can show that supπ ρπ b is independent of the initial belief b (Proposition 8.2.1 [38]). Therefore, we use ρ := supπ ρπ b to denote the optimal long-run average reward. Under this belief MDP formulation, for all b B, the Bellman equation states that ρ + v(b) = max i I c(b, i) + Z B T(db |b, i)v(b ) , (2) where v : B 7 R is the bias function. It can be shown (see the proof of Proposition 2 in Appendix) that under our assumptions, ρ and v(b) are well defined and there exists a stationary deterministic optimal policy π which maps a belief state to an arm to pull (an action that maximizes the right side of (2)). Finding the optimal policy for the POMDP model is computationally intractable in general [37, 31]. Therefore, various approximate methods have been proposed to solve (2) and to find the optimal policy for the belief MDP, see e.g. [46, 41, 42]. In this work, we do not focus on this planning problem for a known model, and we assume the access to an optimization oracle that solve (2) and returns the optimal average reward ρ and the optimal stationary policy for a given known model. 3 The SEEU Algorithm This section describes our learning algorithm for the regime switching MAB model: the Spectral Exploration and Exploitation with UCB (SEEU) algorithm. To device a learning policy for the POMDP with unknown µ and P, one needs a procedure to estimate those quantities from observed rewards. [3, 2] propose the so-called spectral estimator for the unknown parameters in HMMs. However, the algorithm is not directly applicable to ours, because there is no decision making involved in HMMs . To use the spectral estimator, we divide the learning horizon T into nested exploration and exploitation phases. In the exploration phase, we randomly select an arm in each period. This transforms the system into a HMM so that we can apply the spectral method to estimate µ and P from the observed rewards in the phase. In the exploitation phase, based on the estimators obtained from the exploration phase, we use a UCB-type policy to further narrow down the optimal belief-based policy in the POMDP introduced in Section 2.2. 3.1 Spectral Estimator We introduce the spectral estimator [3, 2], and adapt it to our setting. To simplify the notation, suppose the exploration phase starts from period 1 until period n, with realized arms {i1, . . . , in}, and realized rewards {r1, . . . , rn} sampled from Bernoulli distributions. Recall I is the cardinality of the arm set I, then one can create a one-to-one mapping from a pair (i, r) into a scalar s {1, 2, ..., 2I}. Therefore, the pair can be expressed as a vector y {0, 1}2I such that in each period t, yt satisfies 1{yt=es} = 1{rt=r,it=i}, where es is a basis vector with its s-th element being one and zero otherwise. Let A R2I M be the observation probability matrix conditional on the state: A(s, m) = P(Rt = r, It = i|Mt = m). It can be shown that A satisfies E[yt|Mt = m] = Aem, and E[yt+1|Mt = m] = AP T em. Write for the tensor product. For three consecutive observations yt 1, yt, yt+1, define eyt 1 := E[yt+1 yt]E[yt 1 yt] 1yt 1, eyt := E[yt+1 yt 1]E[yt yt 1] 1yt, M2 := E[eyt 1 eyt], M3 := E[eyt 1 eyt yt+1]. From the observations {y1, y2, . . . , yn}, we may construct the estimations ˆ M2 and ˆ M3 for M2 and M3 respectively, and apply the tensor decomposition to obtain the estimator ˆµ for the unknown mean reward matrix and ˆP for the transition matrix. This procedure is summarized in Algorithm 1. We use ˆAm (respectively ˆBm) to denote the m-th column vector of ˆA (respectively ˆB). In addition, we have the following result ([8]) and it provides the confidence regions of the estimators in Algorithm 1. Proposition 1. Under Assumptions 1 and 2, for any δ (0, 1) and any initial distribution, there exists N0 such that when n N0, with probability 1 δ, the estimated ˆµ and ˆP by Algorithm 1 Algorithm 1 Spectral estimation of (µ, P) from the observations from the exploration phase [2, 3, 8]. Input: sample size n, {y1, y2, . . . , yn} created from the rewards {r1, . . . , rn} and arms {i1, . . . , in} Output: The estimation ˆµ, ˆP 1: For i, j { 1, 0, 1}: compute ˆWi,j = 1 N 2 PN 1 t=2 yt+i yt+j. 2: For t = 2, . . . , n 1: compute ˆyt 1 := ˆW1,0( ˆW 1,0) 1yt 1, ˆyt := ˆW1, 1( ˆW0, 1) 1yt. 3: Compute ˆ M2 := 1 N 2 PN 1 t=2 ˆyt 1 ˆyt, ˆ M3 := 1 N 2 PN 1 t=2 ˆyt 1 ˆyt yt+1. 4: Apply tensor decomposition ([2]): ˆB = Tensor Decomposition( ˆ M2, ˆ M3). 5: Compute ˆAm = ˆW 1,0( ˆW1,0) ˆBm for each m M. 6: Return mth row vector (ˆµ)m of ˆµ from ˆAm . 7: Return ˆP = ( ˆA ˆB) ( represents the pseudoinverse of a matrix) ||(µ)m (ˆµ)m||2 C1 δ ) n , m M, ||P ˆP||2 C2 δ ) n . (3) where (µ)m and (ˆµ)m are the m-th row vectors of µ and ˆµ, respectively. Here, S = 2I, and C1, C2 are constants independent of n. The expressions of constants N0, C1, C2 are given in Section B in the appendix. Note that parameters µ, P are identifiable up to permutations of the hidden states [8]. 3.2 The SEEU Algorithm The SEEU algorithm proceeds in episodes of increasing length. As mentioned before, each episode is divided into exploration and exploitation phases. In episode k, it starts with the exploration phase that lasts for a fixed number of periods τ1, and the algorithm uniformly randomly chooses an arm and observes the rewards. After the exploration phase, the algorithm applies Algorithm 1 to (re-)estimate µ and P. Moreover, it constructs a confidence interval based on Proposition 1 with a confidence level 1 δk, where δk := δ/k3 is a vanishing sequence. Then the algorithm enters the exploitation phase. Its length is proportional to k. In the exploitation phase, it conducts UCB-type learning: the arm is pulled according to a policy that corresponds to the optimistic estimator of µ and P inside the confidence interval. The detailed steps are listed in Algorithm 2. Note we use the UCB component in the exploitation phase instead of point estimators of µ and P for the ease of analysis. 3.3 Discussions on the SEEU Algorithm Computations. For given parameters (µ, P), we need to compute the optimal average reward ρ (µ, P) that depends on the parameters (Step 8 in Algorithm 2). Various computational and approximation methods have been proposed to tackle this planning problem for belief MDPs as mentioned in Section 2.2. In addition, we need to find out the optimistic POMDP in the confidence region Ck(δk) with the best average reward. In general it is not clear whether there is an efficient computational method to find the optimistic plausible POMDP model in the confidence region when the unknown parameters are high-dimensional. The extended value iteration method in [25] does not work in our POMDP setting. This is because we cannot separately find µ and P as c and T in (2) both depend on µ, and we cannot write the inner optimization in the extended value iteration to find optimistic P as a linear programming over the convex polytope Ck(δk) since T is a nonlinear function of µ and P. This issue is also present in recent studies on learning continuous-state MDPs with the upper confidence bound approach, see e.g. [27] for further discussions. For low dimensional models, one can discretize Ck(δk) into grids and calculate the corresponding optimal average reward ρ at each grid point so as to find (approximately) the optimistic model (µk, Pk). The discretization error does affect the regret, although it can be controlled arbitrarily well with sufficient computational capacity. Algorithm 2 The SEEU Algorithm Input: Initial belief b1, precision δ, exploration parameter τ1, exploitation parameter τ2 1: for k = 1, 2, 3, . . . do 2: Set the start time of episode k, tk := t 3: for t = tk, tk + 1, . . . , tk + τ1 do 4: Uniformly randomly select an arm: P(It = i) = 1 I 5: end for 6: Input the realized actions and rewards in all previous exploration phases ˆIk := {it1:t1+τ1, , itk:tk+τ1} and ˆRk := {rt1:t1+τ1, , rtk:tk+τ1} to Algorithm 1 to compute ˆµk, ˆPk = Spectral Estimation(ˆIk, ˆRk) 7: Compute the confidence interval Ck(δk) from (3) using the confidence level 1 δk = 1 δ k3 such that P{(µ, P) Ck(δk)} 1 δk 8: Find the optimistic POMDP in the confidence interval (µk, Pk) = argmax(µ,P ) C(δk) ρ (µ, P) 9: for t = 1, 2, . . . , tk + τ1 do 10: Update belief bk t to bk t+1 = Hµk,Pk(bk t , it, rt) under the new parameters (µk, Pk) 11: end for 12: for t = tk + τ1 + 1, . . . , tk + τ1 + τ2 k do 13: Execute the optimal policy π(k) by solving the Bellman equation (2) under parameters (µk, Pk): it = π(k)(bk t ) 14: Observe reward rt 15: Update the belief at t + 1 by bk t+1 = Hµk,Pk(bk t , it, rt) 16: end for 17: end for Below we discuss its impact on regret. The discretization kicks in when we want to find the optimistic POMDP in the confidence region, whose gain is denoted as ρk in episode k. In practice, we have to discretize the confidence region into grid points and find the optimistic POMDP among the finite set. Suppose one can obtain an approximate optimistic model with error ϵk, that is, suppose we can find a model with the gain ρk ρk ϵk. Then we can infer from formula (22) in the appendix that the extra regret incurred due to the discretization error is given by PK k=1 P t Ek ϵk = τ2 PK k=1 kϵk, where Ek denotes the exploitation phase in episode k. One can show that the order of K, the number of episodes up to time T, is T 2/3. Hence if the discretization error can be controlled at ϵk = c/ k, then the extra regret is simply cτ2K, which is O(T 2/3). On the other hand, if the discretization error ϵk is a constant c > 0 in all the exploitation phases, then the extra regret incurred is of the order PK k=1 kc c K3/2, which is of order c T. There is a trade-off between the additional computational complexity due to discretization and the regret bound. For instance, to control the discretization error at ϵk = c/ k, the computational cost is higher compared with the case ϵk = c. In general, it remains open to find efficient methods to solve the optimistic POMDP approximately in the high-dimensional setting. In our regret analysis below, we do not take into account approximation errors arising from the computational aspects discussed above, as in [35, 27]. Dependence on the unknown parameters. When computing the confidence region in Step 7 of Algorithm 2, the agent needs the information of the constants C1 and C2 in Proposition 1. These constants depend on a few primitives that can be hard to know, for example, the mixing rate of the underlying Markov chain. However we only need upper bounds for C1 and C2 for the theoretical guarantee, and hence a rough and conservative estimate would be sufficient. Such dependence on some unknown parameters is common in learning problems, and one remedy is to dedicate the beginning of the horizon to estimate the unknown parameters, which typically doesn t increase the rate of the regret. Alternatively, C1 and C2 can be replaced by parameters that are tuned by hand. See Remark 3 of [8] for a further discussion on this issue. 4 Regret Bound This section presents our main theoretical results on the regret bound for the SEEU algorithm. We first state two technical results that are important in proving the regret bounds. Proposition 2 (Uniform bound on the bias span). If the belief MDP satisfies Assumption 3, then for (ρ, v) satisfying the Bellman equation (2), we have the span of the bias function span(v) := maxb B v(b) minb B v(b) is bounded by D(ϵ), where D(ϵ) := 8 2 (1 α)2 + (1 + α) logα 1 α 1 α , with α = 1 2ϵ 1 ϵ (0, 1). Recall vk is the bias function for the optimistic belief MDP in episode k. Proposition 2 guarantees that span(vk) is bounded by D = D(ϵ/2) uniformly in k, because Assumption 3 can be satisfied (with ϵ replaced by ϵ/2) by the optimistic MDPs when T is sufficiently large due to Proposition 1. Proposition 3 (Controlling the belief error). Suppose Assumption 3 holds. Given (ˆµ, ˆP), an estimator of the true model parameters (µ, P). For an arbitrary reward-action sequence {r1:t, i1:t}t 1, let ˆbt and bt be the corresponding beliefs in period t under (ˆµ, ˆP) and (µ, P) respectively. Then there exists constants L1, L2 such that ||ˆbt bt||1 L1||ˆµ µ||1 + L2|| ˆP P||F , (4) where L1 = 4M(1 ϵ)2 ϵ2 min{µmin,1 µmax}, L2 = 4M(1 ϵ)2 M, || ||F is the Frobenius norm, µmax and µmin are the maximum and minimum element of the matrix µ respectively. We now state our first main result. The proof is given in Appendix E. Theorem 1. Suppose Assumptions 1 to 3 hold. Fix the parameter τ1 in Algorithm 2 to be sufficiently large. Then there exist constants T0, C which are independent of T, such that when T > T0, with probability 1 7 2δ, the regret of Algorithm 2 satisfies RT CT 2/3 s log 9(S + 1) δ T + T0ρ , where S = 2I and ρ denotes the optimal long-run average reward under the true model. The constant T0 measures the number of periods needed for the sample size in the exploration phases to exceed N0 arising in Proposition 1. The constant C has the following expression: 2 D + 1 + 1 + D(1 α) M 3/2C1 + 1 + D(1 α) τ 1/3 2 τ 1/2 1 + 3τ 2/3 2 (τ1ρ + D) + (D + 1) Here, M is the number of hidden states, C1, C2 are given in Proposition 1, α, D is given in Proposition 2, and L1, L2 are given in Proposition 3. One can verify that C = O(ϵ 4C1M 5/2+ϵ 5C2M 3/2). Here we require τ1 to be large so as to apply Proposition 1 from the first episode to simplify the presentation. Theoretically, this requirement is not essential and can be removed without affecting the order of the regret. For τ2, it can be an arbitrary positive integer in our regret analysis. Remark 2 (Effect of lengths of exploration and exploitation phases on regret order). In Algorithm 2, we use an exploration phase of length τ1 and a exploitation phase of length τ2 k in episode k. This in the end leads to O(T 2/3) of the regret as the total length of exploration (and the total number of episodes) is Θ(T 2/3). The intuition of such a choice is the following. After k episodes, the total length of exploration is τ1k, and hence the estimation error of the model parameters and also the belief states is of order 1/ k. As a result, we needs to choose an exploitation phase with length τ2 k to balance this so that the regret incurred per episode is controlled. One might wonder whether one can simply choose a longer exploitation phase of length τ2kα with α 1/2, and then the regret order will be reduced. This is not possible, and the intuition is as follows. The total number of episodes K satisfies PK k=1(τ1 + τ2kα) = T, which yields K T 1 α+1 . The total regret incurred then is, ignoring the logarithmic factors, on the order of PK k=1(τ1 + 1 k τ2kα), which is T can immediately see that the choice of α = 1/2 is actually optimal, leading to a regret of order T 2/3. The impact of the hyper-parameters τ1 and τ2 on the regret will be studied numerically in Section 5. From Theorem 1, we can choose an appropriate δ = 9(S+1) T and readily obtain the following expectation bound on the regret. The proof is omitted. Theorem 3. Under the same assumptions as Theorem 1, the regret of Algorithm 2 satisfies E[RT ] CT 2/3p 2 log T + (T0 + 32(S + 1))ρ . Remark 4 (Lower bound). For the lower bound of the regret, consider the I problem instances (equal to the number of arms): In instance i, let µm,i = 0.5+ ϵ for all m for a small positive constant ϵ, and let µm,j = 0.5 for all m and j = i. Such structure makes sure that the oracle policy simply pulls one arm without the need to infer the state. Since the problem reduces to the classic MAB, the regret is at least O( IT) in this case. Note that the setup of the instances may violate Assumption 2, but this can be easily fixed by introducing an arbitrarily small perturbation to µ. The gap between the upper and lower bounds is probably caused by the split of exploration/exploitation phases in our algorithm. In the exploration phase, arms have to be pulled purely randomly in order to estimate the parameters. Such naive exploration without any maximization may lead to unnecessary regret. In fact, it resembles the structure of the explore-then-commit algorithm (Chapter 6 of [29]) for the classic MAB problem, which turns out to have the suboptimal regret O(T 2/3) as well. With special structures, such as a common mean for all states, special-purpose algorithms can be designed to achieve the optimal rate O(T 1/2). Unfortunately, our algorithm doesn t adapt to the special structure. In fact, the spectral estimator no longer works because of the degeneracy as it cannot differentiate between states from the observations. One natural direction that may potentially improve our algorithm and close the gap is to integrate the exploration and exploitation. We may follow the design similar to UCRL2 in [25]: in each episode, we use the parameters estimated in previous episodes and use the optimistic policy in the confidence region. The observations are then used to update the estimation. This can lower the cost for exploration because the naive exploration is replaced by a near-optimal policy when the confidence region is small. However, it requires the spectral estimator to work with non i.i.d. samples generated by the optimistic belief-based policy. The design of the spectral estimator makes it hard to handle non i.i.d. observations with complex history dependency. This is also why [8] focused on observation-based policies in order to apply the spectral estimator. We may replace the spectral estimator by other estimators with better theoretical properties. The likelihood-based estimators may be a promising candidate [45], but they have not been fully extended to POMDPs yet. In summary, we need better estimators for POMDP/HMMs (which is a dynamic research area itself) with finite-sample guarantees to improve the upper bound. Nevertheless, we are not aware of other algorithms that can achieve sublinear regret in our setting. Remark 5 (General reward distribution). Our model requires the estimation of the reward distribution instead of just the mean to estimate the belief. For discrete rewards taking O possible values, the regret bound holds with S = OI instead of 2I for Bernoulli rewards. For continuous reward distributions, it might be possible to combine the non-parametric HMM inference method in [18] with our algorithm design to obtain regret bounds, and we leave it for future work. 5 Numerical Experiment In this section, we present proof-of-concept experiments. Note that large-scale POMDPs with long-run average objectives (the oracle in our problem) are computationally difficult to solve, i.e. it is challenging to find the optimal belief-based policy [14]. On the other hand, while there can be many hidden states in general, often only two or three states are important to model in several application areas, e.g. bull market and bear market in finance [17]. Hence we focus on small-scale experiments, following some recent literature on reinforcement learning for POMDPs [8, 24]. As a representative example, we consider a 2-hidden-state, 2-arm setting with P = 1/3 2/3 3/4 1/4 µ = 0.9 0.1 0.5 0.6 , where the random reward follows a Bernoulli distribution. We compare our algo- rithm with (1) ϵ-greedy (ϵ = 0.1), non-stationary bandits algorithms including (2) Sliding-Window UCB (SW-UCB) [20] with tuned window size, (3) Exp3.S [5] with L = T (the hyperparameter L is the number of changes in their algorithm), and (4) the optimal memoryless policy for POMDPs discussed in [8]. To implement (4) under our model setting, we assume the reward µ and the transition matrix P are known and search for the optimal stochastic memoryless policy that maps the reward and action in the last period to a discrete distribution over two arms. We also run an algorithm/oracle 4.0 4.2 4.4 4.6 4.8 5.0 log T SEEU SEEU with known P Memoryless ε-Greedy Exp3.S SW-UCB 4.0 4.2 4.4 4.6 4.8 5.0 log T τ1 = 100, τ2 = 500 τ1 = 100, τ2 = 1000 τ1 = 300, τ2 = 1000 τ1 = 500, τ2 = 1000 Figure 1: (a) Regret comparison of different algorithms; (b) Effect of (τ1, τ2) on the regret of SEEU. that implements SEEU with known P but unknown µ. This can allow us to better understand the value of knowing the information of P. In Figure 1(a), we plot the average regret versus T of different algorithms in log-log scale, where the number of runs for each algorithm is 500. The numerical experiments are conducted on a PC with 3.10 GHz Intel Processor and 16 GB of RAM. We observe that the slopes of all algorithms except for two SEEU implementations are close to one, suggesting that they incur linear regrets. This is expected, because these algorithms don t take into account the hidden states. On the other hand, the slope of SEEU is close to 2/3. This is consistent with our theoretical result (Theorem 3). Similar observations are made on other small-scale examples. This demonstrates the effectiveness of our SEEU algorithm, particularly when the horizon length T is relatively large. We also briefly discuss the impact of parameters τ1 and τ2 on the performance of the SEEU algorithm. For the example above, we calculate the average regret for several pairs of parameters (τ1, τ2). It can be seen that the choices of these parameters do not affect the order O(T 2/3) of the regret (the slope). See Figure 1(b) for an illustration. 6 Conclusions, Limitations and Future Research In this paper, we study a non-stationary MAB model with Markovian regime-switching rewards. We propose a learning algorithm that integrates spectral estimators for hidden Markov models and upper confidence methods from reinforcement learning. We also establish a regret bound of order of O(T 2/3 log T) for the learning algorithm. As far as we know, this is the first algorithm with sublinear regret for MAB with unobservable regime switching. The main limitation of our work is that the regret upper bound does not match the lower bound. It would be interesting to find out whether one can improve the regret upper bound O(T 2/3 log T). The gap in the upper and lower bounds are likely to be filled if one can resolve either of the following two fundamental open questions: (1) show the spectral method can be applied to non i.i.d. samples generated from belief-based policies, and (2) establish finite-sample guarantees for other estimators (e.g. maximum likelihood estimators) of POMDP parameters with data generated from adaptive policies. We leave them for future research. Acknowledgments and Disclosure of Funding The authors thank the Area Chair and three anonymous referees for constructive comments and helpful suggestions. Xuefeng Gao acknowledges support from Hong Kong RGC GRF Grants 14201520 and 14201421. The research of Ningyuan Chen is partially supported by NSERC Discovery Grants RGPIN-2020-04038. [1] S. Agrawal and R. Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In In Advances in Neural Information Processing Systems, pages 1184 1194, 2017. [2] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773 2832, 2014. [3] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden markov models. In Conference on Learning Theory, pages 33 1, 2012. [4] O. Atan, C. Tekin, and M. van der Schaar. Global bandits. IEEE transactions on neural networks and learning systems, 29(12):5798 5811, 2018. [5] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. [6] P. Auer, P. Gajane, and R. Ortner. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pages 138 158, 2019. [7] P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, pages 49 56, 2007. [8] K. Azizzadenesheli, A. Lazaric, and A. Anandkumar. Reinforcement learning of POMDPs using spectral methods. ar Xiv preprint ar Xiv:1602.07764, 2016. [9] K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357 367, 1967. [10] O. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in neural information processing systems, pages 199 207, 2014. [11] O. Besbes, Y. Gur, and A. Zeevi. Optimal exploration-exploitation in a multi-armed-bandit problem with non-stationary rewards. Stochastic Systems, 2019. [12] D. Bouneffouf and I. Rish. A survey on practical applications of multi-armed and contextual bandits. ar Xiv preprint ar Xiv:1904.10040, 2019. [13] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1 122, 2012. [14] K. Chatterjee, R. Saona, and B. Ziliotto. The complexity of pomdps with long-run average objectives. ar Xiv preprint ar Xiv:1904.13360, 2019. [15] N. Chen, C. Wang, and L. Wang. Learning and optimization with seasonal patterns. Working paper, 2020. [16] W. C. Cheung, D. Simchi-Levi, and R. Zhu. Learning to optimize under non-stationarity. ar Xiv preprint ar Xiv:1810.03024, 2018. [17] M. Dai, Q. Zhang, and Q. J. Zhu. Trend following trading under a regime switching model. SIAM Journal on Financial Mathematics, 1(1):780 810, 2010. [18] Y. De Castro, E. Gassiat, and S. Le Corff. Consistent estimation of the filtering and marginal smoothing distributions in nonparametric hidden markov models. IEEE Transactions on Information Theory, 63(8):4758 4777, 2017. [19] T. Fiez, S. Sekar, and L. J. Ratliff. Multi-armed bandits for correlated markovian environments with smoothed reward feedback. ar Xiv preprint ar Xiv:1803.04008, 2019. [20] A. Garivier and E. Moulines. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pages 174 188, Berlin, Heidelberg, 2011. Springer. [21] S. Guha, K. Munagala, and P. Shi. Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58(1):3, 2010. [22] S. Gupta, G. Joshi, and O. Ya gan. Correlated multi-armed bandits with a latent random source. ar Xiv preprint ar Xiv:1808.05904, 2018. [23] K. Hinderer. Lipschitz continuity of value functions in markovian decision processes. Mathematical Methods of Operations Research, 62(1):3 22, 2005. [24] M. Igl, L. Zintgraf, T. A. Le, F. Wood, and S. Whiteson. Deep variational reinforcement learning for pomdps. ar Xiv preprint ar Xiv:1806.02426, 2018. [25] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010. [26] V. Krishnamurthy. Partially observed Markov decision processes. Cambridge University Press, 2016. [27] K. Lakshmanan, R. Ortner, and D. Ryabko. Improved regret bounds for undiscounted continuous reinforcement learning. In Proceedings of the International Conference on Machine Learning, pages 524 532, 2015. [28] T. Lattimore and R. Munos. Bounded regret for finite-armed structured bandits. In Advances in Neural Information Processing Systems, pages 550 558, 2014. [29] T. Lattimore and C. Szepesvári. Bandit algorithms. preprint, 2018. [30] L. Lehéricy. Consistent order estimation for nonparametric hidden markov models. Bernoulli, 25(1):464 498, 2019. [31] O. Madani, S. Hanks, and A. Condon. On the computability of infinite-horizon partially observable markov decision processes. In AAAI98 Fall Symposium on Planning with POMDPs, Orlando, FL. Citeseer, 1998. [32] O. A. Maillard and S. Mannor. Latent bandits. In International Conference on Machine Learning, pages 136 144, 2014. [33] R. S. Mamon and R. J. Elliott. Hidden Markov models in finance, volume 4. Springer, 2007. [34] A. J. Mersereau, P. Rusmevichientong, and J. N. Tsitsiklis. A structured multiarmed bandit problem and the greedy policy. IEEE Transactions on Automatic Control, 54(12):2787 2802, 2009. [35] R. Ortner and D. Ryabko. Online regret bounds for undiscounted continuous reinforcement learning. In Advances in Neural Information Processing Systems, pages 1763 1771, 2012. [36] R. Ortner, D. Ryabko, P. Auer, and R. Munos. Regret bounds for restless markov bandits. Theoretical Computer Science, 558:62 76, 2014. [37] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of markov decision processes. Mathematics of operations research, 12(3):441 450, 1987. [38] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, 2014. [39] J. Qian, R. Fruit, M. Pirotta, and A. Lazaric. Exploration bonus for regret minimization in undiscounted discrete and continuous markov decision processes. ar Xiv preprint ar Xiv:1812.04363, 2018. [40] S. M. Ross. Arbitrary state markovian decision processes. The Annals of Mathematical Statistics, 39(6):2118 2122, 1968. [41] N. Saldi, S. Yüksel, and T. Linder. On the asymptotic optimality of finite approximations to markov decision processes with borel spaces. Mathematics of Operations Research, 42(4):945 978, 2017. [42] H. Sharma, M. Jafarnia-Jahromi, and R. Jain. Approximate relative value learning for averagereward continuous state mdps. In Uncertainty in Artificial Intelligence, pages 956 964. PMLR, 2020. [43] A. Slivkins. Introduction to multi-armed bandits. ar Xiv preprint ar Xiv:1904.07272, 2019. [44] A. Slivkins and E. Upfal. Adapting to a changing environment: the brownian restless bandits. In Conference on Learning Theory, pages 343 354, 2008. [45] F. Yang, S. Balakrishnan, and M. J. Wainwright. Statistical and computational guarantees for the baum-welch algorithm. The Journal of Machine Learning Research, 18(1):4528 4580, 2017. [46] H. Yu and D. P. Bertsekas. Discretized approximations for pomdp with average cost. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 619 627. AUAI Press, 2004.