# latent_bandits_revisited__353d0651.pdf Latent Bandits Revisited Joey Hong Google Research jxihong@google.com Branislav Kveton Google Research bkveton@google.com Manzil Zaheer Google Research manzilzaheer@google.com Yinlam Chow Google Research yinlamchow@google.com Amr Ahmed Google Research amra@google.com Craig Boutilier Google Research cboutilier@google.com A latent bandit is a bandit problem where the learning agent knows reward distributions of arms conditioned on an unknown discrete latent state. The goal of the agent is to identify the latent state, after which it can act optimally. This setting is a natural midpoint between online and offline learning, where complex models can be learned offline and the agent identifies the latent state online. This is of high practical relevance, for instance in recommender systems. In this work, we propose general algorithms for latent bandits, based on both upper confidence bounds and Thompson sampling. The algorithms are contextual, and aware of model uncertainty and misspecification. We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions. A comprehensive empirical study showcases the advantages of our approach. 1 Introduction Many online platforms, such as search engines and recommender systems, display results based on observed properties of the user and their query. However, the user s behavior is often influenced by a latent state not explicitly revealed to the system. This might be user intent (e.g., reflecting a long-term task) in search, or shortand long-term preferences (e.g., reflecting topic interests) in a recommender. In each case, the unobserved latent state influences the user response to the displayed results, and thus the associated reward. A machine learning (ML) system, thus, should take steps to infer the latent state and tailor its results accordingly. While many ML models use either heuristic features [1, 4] or recurrent models [33] to capture user history, explicit exploration for latent state identification (i.e., reducing uncertainty regarding the true state) is less common in practice. In this paper, we study latent bandits, which model online interactions of the above type as follows. In each round, the learning agent observes context (e.g., query or user demographics), takes an action (e.g., a recommends), and observes its reward (e.g., user engagement with the recommendation). The reward depends stochastically on both the context and user latent state. As a result, it provides information about the unobserved latent state, which can be used to improve predictions in the future. We are interested in designing exploration policies that allow the agent to quickly maximize its per-round reward by resolving the relevant latent state uncertainty. Specifically, we want policies that have low n-round regret. Our latent state structure allows an agent to quickly adapt its results to new users, or adapt to new user tasks or intents on a per-session basis. One example of that structure are clusters of users with similar item preferences [35]. In this example, a system should be able to identify which cluster a 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. new user belongs to faster than learning all user preferences without any prior information. This would yield much better recommendations in the short-horizon, or cold-start, regime. Fully online exploration (e.g., for personalization) involves learning a reward model, conditioned on the context and latent state, and generally requires massive amounts of interaction data. Fortunately, many platforms have such offline data (e.g., past user interactions) available, which can be used to construct both a reasonable latent state space and conditional reward models [23, 9]. We assume that such models are available and focus on a simpler online problem of latent state identification. Prior works in this setting [25, 35] assumed that the true conditional reward models are given. Moreover, they focused on upper confidence bound (UCB) designs, which are theoretically optimal but often perform poorly empirically. We provide a unified framework that combines offline-learned models with online exploration using both UCBs and Thompson sampling. Our algorithms are practical, analyzable, contextual, and robust to natural forms of model misspecification. We are the first to propose algorithms for latent bandits that are aware of model uncertainty. Our paper is organized as follows. In Section 3, we propose novel practical algorithms based on UCBs and Thompson sampling. Based on a connection between UCBs and posterior sampling [28], we derive near-optimal bounds on the Bayes regret of our algorithms in Section 4. Finally, in Section 5, we demonstrate their effectiveness in synthetic simulations and on a large-scale real-world dataset. 2 Problem Formulation We adopt the following notation. Random variables are capitalized. The set of actions is A, the set of contexts is X, and the set of latent states is S, with |S| |A|. We study a latent bandit problem, where the learning agent interacts with an environment over n rounds as follows. In round t [n], the agent observes context Xt X, takes action At A, and observes reward Rt R. The random variable Rt depends on the context Xt, action At, and latent state s S, where s is fixed but unknown.1 The observation history of the learning agent up to round t is a vector Ht = (X1, A1, R1, . . . , Xt 1, At 1, Rt 1). The policy of the agent maps Ht and Xt to the choice of action At. Now we state our assumptions on how rewards and contexts are generated. The rewards are drawn i.i.d. from a conditional reward distribution P( | A, X, s; θ), which is parameterized by a vector θ Θ, where Θ is a set of feasible model parameters. We use µ(a, x, s; θ) = ER P ( |a,x,s;θ) [R] to denote the mean reward of action a in context x and latent state s under model θ. We denote the true (unknown) latent state by s and true model parameters by θ . We assume that θ can be estimated offline. We also assume that the rewards are σ2-sub-Gaussian, that is ER P ( |a,x,s;θ ) [exp(λ(R µ(a, x, s; θ )))] exp(σ2λ2/2) for all a, x, s, and λ > 0. The contexts can be generated by an arbitrary process independent of the agent s actions and the mean reward µ(a, x, s; θ) can be any complex function of θ. The performance of the agent is measured by regret. Let At, = arg maxa A µ(a, Xt, s ; θ ) be the optimal action for latent state s S and model θ Θ. Then the expected n-round regret is R(n; s , θ ) = E t=1 µ(At, , Xt, s ; θ ) µ(At, Xt, s ; θ ) While a fixed-state regret is useful, we are often more concerned with the average performance over a range of states (e.g., multiple users and multiple sessions with the same user). Thus we also consider the Bayes regret, where we take an expectation over a random latent state and model. Suppose that S and θ are drawn from some prior P1. Then the n-round Bayes regret is BR(n) = E [R(n; S , θ )] = E t=1 µ(At, , Xt, S ; θ ) µ(At, Xt, S ; θ ) Note that S and θ in the definition of At, = arg maxa A µ(a, Xt, S ; θ ) are random now. 1The latent state s can be viewed as a user s current task or preferences, and is fixed over all n rounds. Algorithm 1 m UCB 1: Input: Model parameters bθ 2: for t 1, 2, . . . do 3: Define Nt(s) Pt 1 ℓ=1 1{Bℓ= s} and ℓ=1 1{Bℓ= s} (bµ(Aℓ, Xℓ, s) Rℓ) (3) 4: Set of consistent latent states Ct n s S : Gt(s) σ p 6Nt(s) log n o 5: Select Bt, At arg maxs Ct,a A bµ(a, Xt, s) 3 Algorithms In this section, we develop both UCB and Thompson sampling (TS) algorithms that leverage an arbitrarily complex environment model, generally learned offline, to expedite online exploration. As discussed earlier, such offline models can be readily learned given large amounts of offline interaction data available in many interactive systems. In each subsection below, we specify a particular form of the offline-learned model, and develop a corresponding online algorithm. It is important to note that given an environment model, an optimal solution is to compute and maximize the Gittins index over actions [15]. However, this is often computationally intractable and does not generalize to complex latent variable models. We want our methods to be practical, and thus only consider myopic policies that can be efficiently implemented online. 3.1 UCB with Perfect Model (m UCB) We first design a UCB-like algorithm with learned model bθ Θ. Let bµ(a, x, s) = µ(a, x, s; bθ) be the mean reward under model bθ and µ(a, x, s) = µ(a, x, s; θ ) be the true reward. We initially assume that the learned model is accurate, that is we are given bθ = θ as an input. The key idea in UCB algorithms is to compute a high-probability upper confidence bound Ut(a) on the mean reward of each action a in any round t, where the Ut is some function of history [8]. Then the algorithms take action At = arg maxa A Ut(a). Our model-based UCB algorithm, which we call m UCB, works very similarly. The pseudocode of m UCB is in Algorithm 1. The algorithm works as follows. In round t, it maintains a set of latent states Ct that are consistent with the observed rewards thus far. Then it chooses a specific believed latent state Bt from the consistent set Ct and takes action At with the maximum expected reward in that state, (Bt, At) = arg maxs Ct,a A bµ(a, Xt, s). Thus the UCB of action a is Ut(a) = arg maxs Ct bµ(a, Xt, s). m UCB tracks two key quantities: the number of times that state s is selected up to round t, Nt(s); and the gap between the expected and realized rewards in state s up to round t, Gt(s). If Gt(s) is high, m UCB marks state s as inconsistent and does not consider it in round t. Note that the gap is defined over latent states rather than actions, and with respect to realized rewards rather than expected rewards. 3.2 UCB with Misspecified Model (mm UCB) Now we extend m UCB to misspecified models, where we are given bθ = θ as an input. We consider the following worst-case high-probability definition of model misspecification: there exist ε, δ > 0 such that |bµ(a, x, s) µ(a, x, s)| ε holds with probability at least 1 δ jointly for all a, x, and s. Guarantees of this form, for example, exist for spectral methods for latent variable models, where ε and δ depend on the size of the offline dataset [5]. We modify m UCB to be robust to this type of model error, which gives rise to a new algorithm mm UCB. The only change to m UCB is that Gt(s) is replaced with a high-probability lower bound ℓ=1 1{Bℓ= s} (bµ(Aℓ, Xℓ, s) ε Rℓ) . (4) Algorithm 2 m TS 1: Input: 2: Model parameters bθ 3: Prior over latent states P1(s) 4: for t 1, 2, . . . do 5: Define Pt(s) P1(s) Qt 1 ℓ=1 P(Rℓ| Aℓ, Xℓ, s; bθ) 6: Sample Bt Pt 7: Select At arg maxa A bµ(a, Xt, Bt) Algorithm 3 mm TS 1: Input: 2: Prior over latent states and model parameters P1(s, θ) 3: for t 1, 2, . . . do 4: Define Pt(s, θ) P1(s, θ) Qt 1 ℓ=1 P(Rℓ| Aℓ, Xℓ, s; θ) 5: Sample Bt, bθ Pt 6: Select At arg maxa A bµ(a, Xt, Bt) This allows mm UCB to be conservative when identifying inconsistent latent states, so that s Ct remains to hold with a high probability. Just as importantly, it is also useful for deriving worst-case regret bounds, both for UCB algorithms and TS. 3.3 Thompson Sampling with Perfect Model (m TS) Our UCB algorithms are designed for the worst case. Now we adopt a more relaxed design, where the latent state is random, and the model is fixed and known. That is, we are given bθ = θ and a prior distribution over latent states P1 as inputs. Our solution is a variant of Thompson sampling [32, 10, 28]. The key idea in TS is to sample actions according to their posterior probability of being optimal, conditioned on the history of the agent. In our case, the optimal action in round t is At, = arg maxa A µ(a, Xt, S ; θ ), and is random both due to the observed context Xt and unknown latent state S . Therefore, TS should take action a with probability P (At = a | Xt, Ht) = P (At, = a | Xt, Ht). An advantage of TS over UCB algorithms is that it obviates the need to design UCBs, which are often loose. As a result, UCB algorithms are often conservative in practice and TS typically offers better empirical performance [10]. Our model-based TS algorithm, which we call m TS, is presented in Algorithm 2. The algorithm works as follows. In round t, it maintains a posterior probability, Pt(s) = P (S = s | Ht), that each latent state s is optimal. Then it samples the latent state from its posterior distribution, Bt Pt, and takes action At = maxa A bµ(a, Xt, Bt). For any fixed s, Pt(s) P1(s) Qt 1 ℓ=1 P(Rℓ| Aℓ, Xℓ, s; bθ). As a result, Pt can be updated incrementally in the standard Bayesian filtering fashion [30]. Note that unlike m UCB, m TS needs to know the conditional reward distribution P( | a, x, s; bθ) to update Pt(s). 3.4 Thompson Sampling with Misspecified Model (mm TS) Now we extend m TS to misspecified models. However, instead of considering a worst-case estimate of θ , as in mm UCB, we assume that θ P1 and that the learning agent knows P1. This is motivated by prior literature on modeling epistemic uncertainty [11]. In practice, learning a distribution over parameters is intractable for complex models; but approximate inference can be always performed, for instance by an ensemble of bootstrapped models [11]. Our TS algorithm for misspecified models, which we call mm TS, is presented in Algorithm 3. The algorithm seamlessly integrates model uncertainty into m TS as follows. In round t, the latent state Bt and estimated model parameters bθ are sampled from their joint posterior, and then an action is taken to maximize At = maxa A bµ(a, Xt, Bt). In general, sequential Monte Carlo methods [12] can be used for approximate posterior sampling. However, when the model prior is conjugate to the likelihood, the posterior has a closed form and can be sampled from as follows. Since S is finite, we can tractably sample from the joint posterior by first sampling latent state Bt from its marginal posterior and then bθ conditioned on Bt. For exponential family distributions, the posterior parameters can also be updated online and efficiently. We provide more details in Appendix A, and a pseudocode for Gaussians in Appendix B. 4 Regret Analysis Maillard and Mannor [25] derived gap-dependent regret bounds for UCB algorithms in latent bandits under the assumption that the true model is known and arms are independent. We provide a unified analysis that extends their results to include context, model misspecification, and TS algorithms. 4.1 Regret Decomposition UCB algorithms explore using upper confidence bounds, while TS samples from the posterior. Russo and Van Roy [28] related these two designs with a unified regret decomposition. In our problems, this is reflected as follows. Let s be the true latent state. Then the regret of our UCB algorithms in round t decomposes as µ(At, , Xt, s ) µ(At, Xt, s ) = µ(At, , Xt, s ) Ut(At) + Ut(At) µ(At, Xt, s ) [µ(At, , Xt, s ) Ut(At, )] + [Ut(At) µ(At, Xt, s )] , where the inequality holds by the definition of At. A similar inequality without latent states appears in prior work [28]. This yields the following regret decomposition R(n; s , θ ) E t=1 µ(At, , Xt, s ) Ut(At, ) t=1 Ut(At) µ(At, Xt, s ) An analogous decomposition exists for the Bayes regret of our TS algorithms. Specifically, for any TS algorithm and function Ut of history, we have t=1 µ(At, , Xt, S ; θ ) Ut(At, ) t=1 Ut(At) µ(At, Xt, S ; θ ) The proof uses the fact that E [Ut(At, ) | Xt, Ht] = E [Ut(At) | Xt, Ht] holds for any Ht and Xt from the design of TS. Thus Ut can be an upper confidence bound in a UCB algorithm. Though the UCBs Ut are not used by TS, they can be used to analyze it, due to the similarity of (5) and (6). Thus regret bounds for UCB algorithms can be translated to Bayes regret bounds for TS. There are two caveats though. First, since actions in TS do not maximize Ut, the regret bound must be proved using a worst-case argument over suboptimal actions. Second, because the Bayes regret is in expectation over problem instances, the resulting regret bounds are problem-independent, also known as gap-free. 4.2 Key Steps in Our Proofs Full proofs of our regret bounds are in Appendix C. All proofs follow the same outline, the key steps of which are outlined below. To ease the exposition, we assume that the suboptimality gap of any action is bounded by 1. Step 1: Concentration of realized rewards at their means. We first show that the total observed reward does not deviate too much from its expectation, under any believed latent state s. Formally, we show that P Pt 1 ℓ=1 1{Bℓ= s} (µ(Aℓ, Xℓ, s ) Rℓ) σ p 6Nt(s) log n = O(n 2) holds for any round t and state s. Since we consider the contextual case, which requires joint estimation over dependent arms, we use martingales and Azuma s inequality. Step 2: s Ct in each round t with a high probability. This follows from the definition of Ct and the concentration argument in Step 1 for s = s . Then, in any round t where s Ct, we can use that Ut(a) µ(a, Xt, s ) for any action a in m UCB, and Ut(a) µ(a, Xt, s ) ε in mm UCB. Step 3: Upper bound on the UCB regret. We prove this by bounding each term in (5) separately. The first term is at most 0 with a high probability by Step 2. The second term is a sum of confidence widths, the differences between Ut and the mean reward in round t. We partition it by the chosen latent state in each round. For each latent state s, we introduce realized rewards Rt and get t=1 1{Bt = s} (Ut(At) µ(At, Xt, s )) Gn(s) + 1 + t=1 1{Bt = s} (Rt µ(At, Xt, s )) . The key step in proving the above bound is that Gℓ(s) = Gn(s) holds in the last round ℓwhere state s is chosen, where Pℓ 1 t=1 1{Bt = s} (Ut(At) Rt) Gℓ(s) holds. This relation also implies that Gn(s) σ p 6Nn(s) log n. The other term is bounded by Step 1, which gives a total upper bound of 2σ p 6Nn(s) log n. Finally, we apply the Cauchy-Schwarz inequality to combine the bounds for individual latent states. Step 4: Upper bound on the TS regret. We use the fact that the regret decomposition for Bayes regret in (6) is similar to that for the UCB regret in (5). As mentioned in Section 4.1, as long as our UCB analysis in Step 3 is worst-case over all possible sequences of actions and gap-free, the UCB regret bound transfers to a Bayes regret bound for TS. 4.3 Regret Bounds Our first result is an upper bound on the n-round regret of m UCB. This result differs from Maillard and Mannor [25] in two respects: our bound is gap-free and accounts for context. Theorem 1. Assume that bθ = θ . Then, for any s S and θ Θ, the n-round regret of m UCB is bounded as R(n; s , θ ) 2σ p 6|S|n log n + 3|S|. A gap-free lower bound on the regret in a K-armed bandit is Ω( Kn) [7]. So our upper bound is optimal up to log factors, when substituting actions A with latent states S. The bound can be much lower when |S| K. It also holds for arbitrary reward models and is contextual. From Step 4 of the proof outline, we also have that the Bayes regret of m TS is bounded. Corollary 1. Assume that bθ = θ . Then, for S P1 and any θ Θ, the n-round Bayes regret of m TS is bounded as BR(n) 2σ p 6|S|n log n + 3|S|. Our next results apply to misspecified models. We assume that bθ is estimated by some black-box method. For mm UCB, our bound depends on a high-probability maximum error ε induced by bθ. Theorem 2. Let P a A, x X, s S : |µ(a, x, s; bθ) µ(a, x, s; θ )| ε 1 δ for some ε, δ > 0. Then, for any s S and θ Θ, the n-round regret of mm UCB is bounded as R(n; s , θ ) 2σ p 6|S|n log n + 2εn + δn + 3|S| . The proof of Theorem 2 follows our outline in Section 4.2. Steps 1 2 remain unchanged, but Step 3 changes to account for the error due to model misspecification. The linear dependence on error ε and probability δ is unavoidable in the worst case, specifically when ε is larger than the suboptimality gap. Nevertheless, some offline learning methods, such as spectral methods for latent variable models [5], allow ε and δ to be arbitrarily small as the size of the offline dataset grows. So our bound can be sublinear in n. For mm TS, we assume that a prior distribution over model parameters is known. The key step in the proof is to introduce µ(a, x, s) = R θ µ(a, x, s, θ)P1(θ)dθ, the conditional mean reward marginalized over the prior. Using this quantity, we can obtain the following Bayes regret bound. Corollary 2. Let θ P1 and P ( a A, x X, s S : | µ(a, x, s) µ(a, x, s; θ )| ε) 1 δ for some ε, δ > 0. Then, for S , θ P1, the n-round Bayes regret of mm TS is bounded as 6|S|n log n + 2εn + δn + 3|S| . The proof of Corollary 2 relies on a variant of mm UCB where bµ(a, x, s) is replaced with µ(a, x, s). Note that unlike in mm UCB, the linear dependence of ε is conservative, as mm TS updates its model posteriors and eventually concentrates. We leave the derivation of a tighter bound for future work. We can relate the error ε and its probability δ using the tails of the conditional reward distributions. In particular, let µ(a, x, s; θ) µ(a, x, s) be v2-sub-Gaussian in θ P1 for any a, x, and s. Then for any δ [0, 1], ε = O(v p 2 log(K|X||S|/δ)) satisfies the conditions on ε and δ in Corollary 2. 5 Experiments In this section, we evaluate our algorithms on both synthetic and real-world datasets. We compare the following methods: (i) UCB: UCB1/Lin UCB with no offline model [8, 1]; (ii) TS: TS/Lin TS with 0 200 400 600 800 1000 Round Mean reward Model noise ( = 0.05) UCB TS Exp4 m UCB m TS mm TS Optimal 0 200 400 600 800 1000 Round Mean reward Model noise ( = 0.2) 200 400 600 800 1000 Round Worst-case reward Model noise ( = 0.2) Figure 1: Left: Mean reward and standard error for low model noise (σ0 = 0.05). Middle/Right: Mean/worstcase reward and standard error for high model noise (σ0 = 0.2). no offline model [3, 4]; (iii) Exp4: Exp4 where models are experts [7] (iv) m UCB, mm UCB: our proposed UCB algorithms m UCB/mm UCB; (v) m TS, mm TS: our proposed TS algorithms m TS/mm TS. In contrast to our methods, the UCB and TS baselines do not use an offline learned model. Exp4 uses models as experts, where each expert pulls the best arm given context and its latent state. Since we are interested in fast personalization , we experiment with short horizons of up to 1000 rounds. 5.1 Synthetic Experiments We first experiment with synthetic non-contextual bandits where A = [10] and S = [5]. The mean rewards are sampled uniformly at random as µ(a, s) Uniform(0, 1) for each a A, s S. Using rejection sampling, we constrain the suboptimality gap of all actions in each state s to be at least 0.1, to ensure statistically significant comparisons at short horizons. The rewards are drawn i.i.d. from P( | a, s) = N(µ(a, s), σ2) with σ = 0.5. We evaluate each algorithm on 500 independent runs, with a uniformly sampled latent state in each run, and report the average reward over time. We analyze the effect of model misspecification by perturbing the mean rewards with various degrees of noise: for noise σ0 > 0, the estimated means are sampled as bµ(a, s) N(µ(a, s), σ2 0) for each action a and latent state s. The left plot in Figure 1 shows the average reward over time for low model noise, σ0 = 0.05. In this setting, our algorithms m UCB and m TS perform better than the baselines UCB1 and TS. In the middle plot, we increase the noise to σ0 = 0.2. Neither m UCB nor m TS accounts for modeling errors, and thus their performance degrades. On the other hand, the uncertainty-aware mm TS outperforms m TS. However, mm UCB (not reported to reduce clutter) performs the same as m UCB. This is likely because of the conservative nature of UCBs. Despite having similar worst-case regret guarantees, Exp4 is tailored to adversarial bandits instead of the stochastic ones. Therefore, it is outperformed by m UCB and m TS with the same model. The right plot in Figure 1 shows 10% of the worst runs from the middle plot, as measured by the average reward in the last round. This is a measure of a worst-case performance. Both UCB1 and TS are unaffected by model misspecification, and outperform m UCB and m TS. However, mm TS still performs best due to adapting the misspecified model to online data. This shows that employing uncertainty-awareness makes model-based algorithms much more robust to model misspecification and learning error. 5.2 Movie Lens Experiments We also assess the performance of our algorithms on the Movie Lens 1M dataset [17], a large-scale collaborative filtering dataset where 6040 users rate 3883 movies. Each movie has a set of genres. We filter the dataset to include only users who rated at least 200 movies, and movies rated by at least 200 users; resulting in 1353 users and 1124 movies. We randomly select 50% of all ratings as our training set and use the remaining 50% as the test set; resulting in sparse rating matrices Mtrain and Mtest. We complete each matrix using least-squares matrix completion [29] with rank 20. This rank is high enough to yield a low prediction error, and yet small enough to avoid overfitting. The learned latent factors are Mtrain = b U b V T and Mtest = UV T . User i and movie j correspond to rows Ui and Vj, respectively, in the matrix factors. We define a latent contextual bandit instance with A = [20] and S = [5] as follows. Using k-means clustering on the rows of U, we cluster users into 5 clusters, where 5 is the largest value that does not yield empty clusters. First, a user i is sampled uniformly at random. In each round, 20 genres 0 50 100 150 200 Round Mean rating Lin UCB Lin TS Exp4 m UCB m TS mm TS 50 100 150 200 Round Worst-case rating Figure 2: Mean/worst-case rating and standard error on the Movie Lens 1M dataset. and then a movie for each genre are uniformly sampled, creating a set of diverse movies. Context Xt R20 20 is a matrix where each row is a training latent factor of one sampled movie, that is b Vj for movie j. The agent selects one movies from Xt and observes its reward. The reward distribution for user i and movie j is N(U T i Vj, 0.5), and its mean is the product of the corresponding user and item factors on the test set. We evaluate on 500 users. Our offline model is a Gaussian mixture that is learned in the same way as the true model, except on the training set. It has 5 clusters of users derived from k-means clustering on the rows of b U. For each latent state, the prior is a Gaussian with the corresponding cluster mean and covariance. The context in Lin UCB and Lin TS is Xt, the same as in our algorithms, and they only learn the user latent factor. This is more information than in low-rank bandit algorithms [24], which learn both the user and movie representations, and thus perform poorly in short horizons that we consider. The left plot in Figure 2 shows mean ratings and standard errors of 6 algorithms (as earlier, mm UCB performs similarly to m UCB and is not shown). m UCB and m TS clearly adapt and personalize to users faster than Lin UCB and Lin TS, and converge to better policies than Exp4. Both m UCB and m TS are affected by model misspecification. In comparison, mm TS handles model uncertainty and converges to the best policies. The right plot in Figure 2 shows the average results for the worst 10% of users. Again, mm TS dramatically outperforms m TS in the worst case. 6 Related Work Latent bandits. Latent contextual bandits can personalize faster than standard contextual bandit policies, such as Lin UCB [1] or linear Thompson sampling [4, 2]. The closest to our work is that of Maillard and Mannor [25], who proposed and analyzed non-contextual UCB algorithms, with side information that uniquely identifies users, under the assumption that the mean rewards for each latent state are known. Then they relaxed the assumption on known means, but assumed the other extreme case where the means are learned completely online. Zhou and Brunskill [35] extended this formulation to contextual bandits. However, they used offline-learned policies that were deployed online as a mixture, using Exp4. Bayesian policy reuse (BPR) [27] selects offline-learned policies by maintaining a belief over the optimality of each policy through posterior inference, but no analysis exists. Krause and Guestrin [21] used inference in latent graphical models to gather information, but only to identify the state. This is akin to best-arm identification [6], which is a different objective from cumulative regret minimization. We subsume prior work by providing contextual uncertainty-aware UCB and TS algorithms, and their unified analyses. Low-rank bandits. Low-rank bandits can be viewed as a generalization of latent bandits, where low-rank matrices parameterize the reward and are learned online. Kawale et al. [20] proposed a TS algorithm for low-rank matrix factorization; however, their algorithm is inefficient and is analyzed only for rank-1 matrices. Sen et al. [31] analyzed an ε-greedy algorithm, but relied on the properties that rarely hold in practice. Another body of work studied online clustering of bandits, which is based on a specific low-rank structure [24, 13, 14, 26]. Yet another studied low-rank matrices where both rows and columns are arms [19, 18]. None of these prior works used offline-learned models, an important practical consideration given the general availability of offline data, and learned at most linear models. In Section 5, we compare to idealized versions of these methods where low-rank features are provided. Structured bandits. In structured bandits, the arms are related through a common latent parameter. Lattimore and Munos [22] proposed a UCB algorithm for a multi-armed bandit variant of this problem. Recently, Gupta et al. [16] proposed a unified framework that adapts classic bandit algorithms, such as UCB1 and TS, to structured multi-armed bandits. Though similar to our work, the algorithms differ in key aspects: we put confidence intervals on latent states instead of arms, and develop contextual algorithms with a prior model. Yu et al. [34] proposed variational Thompson sampling for learning to act in graphical models with latent variables. Since this framework is very general, both model parameters and latent variables are learned, the regret guarantees are weak and Thompson sampling needs to be approximated. This is in a stark contrast to our work, where we obtain strong regret guarantees and can implement algorithms as analyzed. 7 Conclusions We study latent bandits, a class of bandit problems where the reward model is parameterized by a latent state and at least partially known. We propose UCB and Thompson sampling algorithms for solving this problem, which identify the latent state conditioned on offline-learned reward models. The algorithms are contextual and robust to misspecification. We bound their regret using a unified analysis, and validate them empirically on both a synthetic problem and the Movie Lens 1M dataset. Because of its generality and practicality, our work can be naturally extended to more complicated graphical models of the environment. For example, in this work, we assume that the latent state is fixed over time. However, a transition model could be incorporated into our setting to model an evolving latent state. This would be useful when user preferences or intents change over time. In addition, we could consider a more expressive latent state, such as a mixture of topics, which would be useful if user types could not be classified into a finite discrete set. We leave the detailed study of these extensions to future work. Broader Impact Our work develops improved algorithms for bandit-style exploration in a very general and abstract sense. We have demonstrated its ability to increase the rate at which interactive systems identify user latent state to improve long-term impact on user reward (e.g., engagement in a recommender system). Our work is agnostic to the form of the reward. We are strongly motivated by improving user positive engagement with interactive systems (e.g., by identifying user interests or preferences in a recommender system). However, other forms of reward that are unaligned with a user s best interests could be used our methods do not propose specific reward models. That said, our work has no social implications (welfare, fairness, privacy, etc.) beyond those already at play in the interactive system to which our methods might be applied. Acknowledgments and Disclosure of Funding There are no additional sources of funding related to this work. [1] Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Neur IPS, 2011. [2] Marc Abeille and Alessandro Lazaric. Linear Thompson sampling revisited. Electronic Journal of Statistics, 2016. [3] Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. Co RR, abs/1111.1797, 2011. [4] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. ICML, 2013. [5] Anima Anandkumar, Rong Ge, Daniel J. Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. JMLR, 2014. [6] Jean-Yves Audibert, Sebastien Bubeck, and Remi Munos. Best arm identification in multi-armed bandits. In COLT, 2010. [7] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 2002. [8] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2002. [9] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 2013. [10] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Neural Information Processing Systems, pages 2249 2257, 2012. [11] Merlise Clyde and Edward I. George. Model uncertainty. Statistical Science, 2004. [12] Arnaud Doucet, Neil Gordon, and Nando de Freitas. Sequential Monte Carlo Methods in Practice. Springer New York, 2013. [13] Claudio Gentile, Shuai Li, and Giovanni Zapella. Online clustering of bandits. ICML, 2014. [14] Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni Zappella, and Evans Etrue. On context-dependent clustering of bandits. ICML, 2017. [15] J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, 41:148 177, 1979. [16] Samarth Gupta, Shreyas Chaudhari, Subhojyoti Mukherjee, Gauri Joshi, and Osman Ya gan. A unified approach to translate classical bandit algorithms to the structured bandit setting. Co RR, abs/1810.08164, 2018. [17] F. Maxwell Harper and Joseph A. Konstan. The Movie Lens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (Tii S), 2015. [18] Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Claire Vernade, and Zheng Wen. Bernoulli rank-1 bandits for click feedback. IJCAI, 2017. [19] Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, and Zheng Wen. Stochastic rank-1 bandits. AISTATS, 2017. [20] Jaya Kawale, Hung H Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay Chawla. Efficient Thompson sampling for online matrix-factorization recommendation. Neur IPS, 2015. [21] Andreas Krause and Carlos Guestrin. Optimal value of information in graphical models. In JAIR, 2009. [22] Tor Lattimore and Remi Munos. Bounded regret for finite-armed structured bandits. Neur IPS, 2014. [23] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. WWW, 2010. [24] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative filtering bandits. SIGIR, 2016. [25] Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. ICML, 2014. [26] Trong T. Nguyen and Hady W. Lauw. Dynamic clustering of contextual multi-armed bandits. CIKM, 2014. [27] Benjamin Rosman, Majd Hawasly, and Subramanian Ramamoorthy. Bayesian policy reuse. Machine Learning, 2016. [28] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Co RR, abs/1301.2609, 2013. [29] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. Neur IPS, 2008. [30] Simo Särkkä. Bayesian Filtering and Smoothing. Cambridge University Press, 2013. [31] Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alex Dimakis, and Sanjay Shakkottai. Contextual bandits with latent confounders: An nmf approach. AISTATS, 2017. [32] William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933. [33] Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J. Smola, and How Jing. Recurrent recommender networks. WSDM, 2017. [34] Tong Yu, Branislav Kveton, Zheng Wen, Ruiyi Zhang, and Ole J. Mengshoel. Graphical models meet bandits: A variational Thompson sampling approach. In ICML, 2020. [35] Li Zhou and Emma Brunskill. Latent contextual bandits and their application to personalized recommendations for new users. IJCAI, 2016.