# latent_bandits_revisited__353d0651.pdf

Latent Bandits Revisited

Joey Hong Google Research jxihong@google.com

Branislav Kveton Google Research bkveton@google.com

Manzil Zaheer Google Research manzilzaheer@google.com

Yinlam Chow Google Research yinlamchow@google.com

Amr Ahmed Google Research amra@google.com

Craig Boutilier Google Research cboutilier@google.com

A latent bandit is a bandit problem where the learning agent knows reward distributions of arms conditioned on an unknown discrete latent state. The goal of the agent is to identify the latent state, after which it can act optimally. This setting is a natural midpoint between online and ofﬂine learning, where complex models can be learned ofﬂine and the agent identiﬁes the latent state online. This is of high practical relevance, for instance in recommender systems. In this work, we propose general algorithms for latent bandits, based on both upper conﬁdence bounds and Thompson sampling. The algorithms are contextual, and aware of model uncertainty and misspeciﬁcation. We provide a uniﬁed theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions. A comprehensive empirical study showcases the advantages of our approach.

1 Introduction

Many online platforms, such as search engines and recommender systems, display results based on observed properties of the user and their query. However, the user s behavior is often inﬂuenced by a latent state not explicitly revealed to the system. This might be user intent (e.g., reﬂecting a long-term task) in search, or shortand long-term preferences (e.g., reﬂecting topic interests) in a recommender. In each case, the unobserved latent state inﬂuences the user response to the displayed results, and thus the associated reward. A machine learning (ML) system, thus, should take steps to infer the latent state and tailor its results accordingly.

While many ML models use either heuristic features [1, 4] or recurrent models [33] to capture user history, explicit exploration for latent state identiﬁcation (i.e., reducing uncertainty regarding the true state) is less common in practice. In this paper, we study latent bandits, which model online interactions of the above type as follows. In each round, the learning agent observes context (e.g., query or user demographics), takes an action (e.g., a recommends), and observes its reward (e.g., user engagement with the recommendation). The reward depends stochastically on both the context and user latent state. As a result, it provides information about the unobserved latent state, which can be used to improve predictions in the future. We are interested in designing exploration policies that allow the agent to quickly maximize its per-round reward by resolving the relevant latent state uncertainty. Speciﬁcally, we want policies that have low n-round regret.

Our latent state structure allows an agent to quickly adapt its results to new users, or adapt to new user tasks or intents on a per-session basis. One example of that structure are clusters of users with similar item preferences [35]. In this example, a system should be able to identify which cluster a

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

new user belongs to faster than learning all user preferences without any prior information. This would yield much better recommendations in the short-horizon, or cold-start, regime.

Fully online exploration (e.g., for personalization) involves learning a reward model, conditioned on the context and latent state, and generally requires massive amounts of interaction data. Fortunately, many platforms have such ofﬂine data (e.g., past user interactions) available, which can be used to construct both a reasonable latent state space and conditional reward models [23, 9]. We assume that such models are available and focus on a simpler online problem of latent state identiﬁcation. Prior works in this setting [25, 35] assumed that the true conditional reward models are given. Moreover, they focused on upper conﬁdence bound (UCB) designs, which are theoretically optimal but often perform poorly empirically. We provide a uniﬁed framework that combines ofﬂine-learned models with online exploration using both UCBs and Thompson sampling. Our algorithms are practical, analyzable, contextual, and robust to natural forms of model misspeciﬁcation.

We are the ﬁrst to propose algorithms for latent bandits that are aware of model uncertainty. Our paper is organized as follows. In Section 3, we propose novel practical algorithms based on UCBs and Thompson sampling. Based on a connection between UCBs and posterior sampling [28], we derive near-optimal bounds on the Bayes regret of our algorithms in Section 4. Finally, in Section 5, we demonstrate their effectiveness in synthetic simulations and on a large-scale real-world dataset.

2 Problem Formulation

We adopt the following notation. Random variables are capitalized. The set of actions is A, the set of contexts is X, and the set of latent states is S, with |S| |A|.

We study a latent bandit problem, where the learning agent interacts with an environment over n rounds as follows. In round t [n], the agent observes context Xt X, takes action At A, and observes reward Rt R. The random variable Rt depends on the context Xt, action At, and latent state s S, where s is ﬁxed but unknown.1 The observation history of the learning agent up to round t is a vector Ht = (X1, A1, R1, . . . , Xt 1, At 1, Rt 1). The policy of the agent maps Ht and Xt to the choice of action At.

Now we state our assumptions on how rewards and contexts are generated. The rewards are drawn i.i.d. from a conditional reward distribution P( | A, X, s; θ), which is parameterized by a vector θ Θ, where Θ is a set of feasible model parameters. We use µ(a, x, s; θ) = ER P ( |a,x,s;θ) [R] to denote the mean reward of action a in context x and latent state s under model θ. We denote the true (unknown) latent state by s and true model parameters by θ . We assume that θ can be estimated ofﬂine. We also assume that the rewards are σ2-sub-Gaussian, that is

ER P ( |a,x,s;θ ) [exp(λ(R µ(a, x, s; θ )))] exp(σ2λ2/2)

for all a, x, s, and λ > 0. The contexts can be generated by an arbitrary process independent of the agent s actions and the mean reward µ(a, x, s; θ) can be any complex function of θ.

The performance of the agent is measured by regret. Let At, = arg maxa A µ(a, Xt, s ; θ ) be the optimal action for latent state s S and model θ Θ. Then the expected n-round regret is

R(n; s , θ ) = E

t=1 µ(At, , Xt, s ; θ ) µ(At, Xt, s ; θ )

While a ﬁxed-state regret is useful, we are often more concerned with the average performance over a range of states (e.g., multiple users and multiple sessions with the same user). Thus we also consider the Bayes regret, where we take an expectation over a random latent state and model. Suppose that S and θ are drawn from some prior P1. Then the n-round Bayes regret is

BR(n) = E [R(n; S , θ )] = E

t=1 µ(At, , Xt, S ; θ ) µ(At, Xt, S ; θ )

Note that S and θ in the deﬁnition of At, = arg maxa A µ(a, Xt, S ; θ ) are random now.

1The latent state s can be viewed as a user s current task or preferences, and is ﬁxed over all n rounds.

Algorithm 1 m UCB

1: Input: Model parameters bθ

2: for t 1, 2, . . . do 3: Deﬁne Nt(s) Pt 1 ℓ=1 1{Bℓ= s} and

ℓ=1 1{Bℓ= s} (bµ(Aℓ, Xℓ, s) Rℓ) (3)

4: Set of consistent latent states Ct n s S : Gt(s) σ p

6Nt(s) log n o

5: Select Bt, At arg maxs Ct,a A bµ(a, Xt, s)

3 Algorithms

In this section, we develop both UCB and Thompson sampling (TS) algorithms that leverage an arbitrarily complex environment model, generally learned ofﬂine, to expedite online exploration. As discussed earlier, such ofﬂine models can be readily learned given large amounts of ofﬂine interaction data available in many interactive systems. In each subsection below, we specify a particular form of the ofﬂine-learned model, and develop a corresponding online algorithm. It is important to note that given an environment model, an optimal solution is to compute and maximize the Gittins index over actions [15]. However, this is often computationally intractable and does not generalize to complex latent variable models. We want our methods to be practical, and thus only consider myopic policies that can be efﬁciently implemented online.

3.1 UCB with Perfect Model (m UCB)

We ﬁrst design a UCB-like algorithm with learned model bθ Θ. Let bµ(a, x, s) = µ(a, x, s; bθ) be the mean reward under model bθ and µ(a, x, s) = µ(a, x, s; θ ) be the true reward. We initially assume that the learned model is accurate, that is we are given bθ = θ as an input.

The key idea in UCB algorithms is to compute a high-probability upper conﬁdence bound Ut(a) on the mean reward of each action a in any round t, where the Ut is some function of history [8]. Then the algorithms take action At = arg maxa A Ut(a). Our model-based UCB algorithm, which we call m UCB, works very similarly. The pseudocode of m UCB is in Algorithm 1. The algorithm works as follows. In round t, it maintains a set of latent states Ct that are consistent with the observed rewards thus far. Then it chooses a speciﬁc believed latent state Bt from the consistent set Ct and takes action At with the maximum expected reward in that state, (Bt, At) = arg maxs Ct,a A bµ(a, Xt, s). Thus the UCB of action a is Ut(a) = arg maxs Ct bµ(a, Xt, s). m UCB tracks two key quantities: the number of times that state s is selected up to round t, Nt(s); and the gap between the expected and realized rewards in state s up to round t, Gt(s). If Gt(s) is high, m UCB marks state s as inconsistent and does not consider it in round t. Note that the gap is deﬁned over latent states rather than actions, and with respect to realized rewards rather than expected rewards.

3.2 UCB with Misspeciﬁed Model (mm UCB)

Now we extend m UCB to misspeciﬁed models, where we are given bθ = θ as an input. We consider the following worst-case high-probability deﬁnition of model misspeciﬁcation: there exist ε, δ > 0 such that |bµ(a, x, s) µ(a, x, s)| ε holds with probability at least 1 δ jointly for all a, x, and s. Guarantees of this form, for example, exist for spectral methods for latent variable models, where ε and δ depend on the size of the ofﬂine dataset [5].

We modify m UCB to be robust to this type of model error, which gives rise to a new algorithm mm UCB. The only change to m UCB is that Gt(s) is replaced with a high-probability lower bound

ℓ=1 1{Bℓ= s} (bµ(Aℓ, Xℓ, s) ε Rℓ) . (4)

Algorithm 2 m TS

1: Input: 2: Model parameters bθ 3: Prior over latent states P1(s)

4: for t 1, 2, . . . do 5: Deﬁne

Pt(s) P1(s) Qt 1 ℓ=1 P(Rℓ| Aℓ, Xℓ, s; bθ)

6: Sample Bt Pt 7: Select At arg maxa A bµ(a, Xt, Bt)

Algorithm 3 mm TS

1: Input: 2: Prior over latent states and model parameters P1(s, θ)

3: for t 1, 2, . . . do 4: Deﬁne

Pt(s, θ) P1(s, θ) Qt 1 ℓ=1 P(Rℓ| Aℓ, Xℓ, s; θ)

5: Sample Bt, bθ Pt 6: Select At arg maxa A bµ(a, Xt, Bt)

This allows mm UCB to be conservative when identifying inconsistent latent states, so that s Ct remains to hold with a high probability. Just as importantly, it is also useful for deriving worst-case regret bounds, both for UCB algorithms and TS.

3.3 Thompson Sampling with Perfect Model (m TS)

Our UCB algorithms are designed for the worst case. Now we adopt a more relaxed design, where the latent state is random, and the model is ﬁxed and known. That is, we are given bθ = θ and a prior distribution over latent states P1 as inputs.

Our solution is a variant of Thompson sampling [32, 10, 28]. The key idea in TS is to sample actions according to their posterior probability of being optimal, conditioned on the history of the agent. In our case, the optimal action in round t is At, = arg maxa A µ(a, Xt, S ; θ ), and is random both due to the observed context Xt and unknown latent state S . Therefore, TS should take action a with probability P (At = a | Xt, Ht) = P (At, = a | Xt, Ht). An advantage of TS over UCB algorithms is that it obviates the need to design UCBs, which are often loose. As a result, UCB algorithms are often conservative in practice and TS typically offers better empirical performance [10].

Our model-based TS algorithm, which we call m TS, is presented in Algorithm 2. The algorithm works as follows. In round t, it maintains a posterior probability, Pt(s) = P (S = s | Ht), that each latent state s is optimal. Then it samples the latent state from its posterior distribution, Bt Pt, and takes action At = maxa A bµ(a, Xt, Bt). For any ﬁxed s, Pt(s) P1(s) Qt 1 ℓ=1 P(Rℓ| Aℓ, Xℓ, s; bθ). As a result, Pt can be updated incrementally in the standard Bayesian ﬁltering fashion [30]. Note that unlike m UCB, m TS needs to know the conditional reward distribution P( | a, x, s; bθ) to update Pt(s).

3.4 Thompson Sampling with Misspeciﬁed Model (mm TS)

Now we extend m TS to misspeciﬁed models. However, instead of considering a worst-case estimate of θ , as in mm UCB, we assume that θ P1 and that the learning agent knows P1. This is motivated by prior literature on modeling epistemic uncertainty [11]. In practice, learning a distribution over parameters is intractable for complex models; but approximate inference can be always performed, for instance by an ensemble of bootstrapped models [11].

Our TS algorithm for misspeciﬁed models, which we call mm TS, is presented in Algorithm 3. The algorithm seamlessly integrates model uncertainty into m TS as follows. In round t, the latent state Bt and estimated model parameters bθ are sampled from their joint posterior, and then an action is taken to maximize At = maxa A bµ(a, Xt, Bt). In general, sequential Monte Carlo methods [12] can be used for approximate posterior sampling. However, when the model prior is conjugate to the likelihood, the posterior has a closed form and can be sampled from as follows. Since S is ﬁnite, we can tractably sample from the joint posterior by ﬁrst sampling latent state Bt from its marginal posterior and then bθ conditioned on Bt. For exponential family distributions, the posterior parameters can also be updated online and efﬁciently. We provide more details in Appendix A, and a pseudocode for Gaussians in Appendix B.

4 Regret Analysis

Maillard and Mannor [25] derived gap-dependent regret bounds for UCB algorithms in latent bandits under the assumption that the true model is known and arms are independent. We provide a uniﬁed analysis that extends their results to include context, model misspeciﬁcation, and TS algorithms.

4.1 Regret Decomposition

UCB algorithms explore using upper conﬁdence bounds, while TS samples from the posterior. Russo and Van Roy [28] related these two designs with a uniﬁed regret decomposition. In our problems, this is reﬂected as follows. Let s be the true latent state. Then the regret of our UCB algorithms in round t decomposes as

µ(At, , Xt, s ) µ(At, Xt, s ) = µ(At, , Xt, s ) Ut(At) + Ut(At) µ(At, Xt, s ) [µ(At, , Xt, s ) Ut(At, )] + [Ut(At) µ(At, Xt, s )] ,

where the inequality holds by the deﬁnition of At. A similar inequality without latent states appears in prior work [28]. This yields the following regret decomposition

R(n; s , θ ) E

t=1 µ(At, , Xt, s ) Ut(At, )

t=1 Ut(At) µ(At, Xt, s )

An analogous decomposition exists for the Bayes regret of our TS algorithms. Speciﬁcally, for any TS algorithm and function Ut of history, we have

t=1 µ(At, , Xt, S ; θ ) Ut(At, )

t=1 Ut(At) µ(At, Xt, S ; θ )

The proof uses the fact that E [Ut(At, ) | Xt, Ht] = E [Ut(At) | Xt, Ht] holds for any Ht and Xt from the design of TS. Thus Ut can be an upper conﬁdence bound in a UCB algorithm.

Though the UCBs Ut are not used by TS, they can be used to analyze it, due to the similarity of (5) and (6). Thus regret bounds for UCB algorithms can be translated to Bayes regret bounds for TS. There are two caveats though. First, since actions in TS do not maximize Ut, the regret bound must be proved using a worst-case argument over suboptimal actions. Second, because the Bayes regret is in expectation over problem instances, the resulting regret bounds are problem-independent, also known as gap-free.

4.2 Key Steps in Our Proofs

Full proofs of our regret bounds are in Appendix C. All proofs follow the same outline, the key steps of which are outlined below. To ease the exposition, we assume that the suboptimality gap of any action is bounded by 1.

Step 1: Concentration of realized rewards at their means. We ﬁrst show that the total observed reward does not deviate too much from its expectation, under any believed latent state s. Formally, we show that P Pt 1 ℓ=1 1{Bℓ= s} (µ(Aℓ, Xℓ, s ) Rℓ) σ p

6Nt(s) log n = O(n 2) holds for any round t and state s. Since we consider the contextual case, which requires joint estimation over dependent arms, we use martingales and Azuma s inequality.

Step 2: s Ct in each round t with a high probability. This follows from the deﬁnition of Ct and the concentration argument in Step 1 for s = s . Then, in any round t where s Ct, we can use that Ut(a) µ(a, Xt, s ) for any action a in m UCB, and Ut(a) µ(a, Xt, s ) ε in mm UCB.

Step 3: Upper bound on the UCB regret. We prove this by bounding each term in (5) separately. The ﬁrst term is at most 0 with a high probability by Step 2. The second term is a sum of conﬁdence widths, the differences between Ut and the mean reward in round t. We partition it by the chosen latent state in each round. For each latent state s, we introduce realized rewards Rt and get

t=1 1{Bt = s} (Ut(At) µ(At, Xt, s )) Gn(s) + 1 +

t=1 1{Bt = s} (Rt µ(At, Xt, s )) .

The key step in proving the above bound is that Gℓ(s) = Gn(s) holds in the last round ℓwhere state s is chosen, where Pℓ 1 t=1 1{Bt = s} (Ut(At) Rt) Gℓ(s) holds. This relation also implies that Gn(s) σ p

6Nn(s) log n. The other term is bounded by Step 1, which gives a total upper bound of 2σ p

6Nn(s) log n. Finally, we apply the Cauchy-Schwarz inequality to combine the bounds for individual latent states.

Step 4: Upper bound on the TS regret. We use the fact that the regret decomposition for Bayes regret in (6) is similar to that for the UCB regret in (5). As mentioned in Section 4.1, as long as our UCB analysis in Step 3 is worst-case over all possible sequences of actions and gap-free, the UCB regret bound transfers to a Bayes regret bound for TS.

4.3 Regret Bounds

Our ﬁrst result is an upper bound on the n-round regret of m UCB. This result differs from Maillard and Mannor [25] in two respects: our bound is gap-free and accounts for context.

Theorem 1. Assume that bθ = θ . Then, for any s S and θ Θ, the n-round regret of m UCB is bounded as R(n; s , θ ) 2σ p

6|S|n log n + 3|S|.

A gap-free lower bound on the regret in a K-armed bandit is Ω(

Kn) [7]. So our upper bound is optimal up to log factors, when substituting actions A with latent states S. The bound can be much lower when |S| K. It also holds for arbitrary reward models and is contextual. From Step 4 of the proof outline, we also have that the Bayes regret of m TS is bounded.

Corollary 1. Assume that bθ = θ . Then, for S P1 and any θ Θ, the n-round Bayes regret of m TS is bounded as BR(n) 2σ p

6|S|n log n + 3|S|.

Our next results apply to misspeciﬁed models. We assume that bθ is estimated by some black-box method. For mm UCB, our bound depends on a high-probability maximum error ε induced by bθ.

Theorem 2. Let P a A, x X, s S : |µ(a, x, s; bθ) µ(a, x, s; θ )| ε 1 δ for some ε, δ > 0. Then, for any s S and θ Θ, the n-round regret of mm UCB is bounded as

R(n; s , θ ) 2σ p

6|S|n log n + 2εn + δn + 3|S| .

The proof of Theorem 2 follows our outline in Section 4.2. Steps 1 2 remain unchanged, but Step 3 changes to account for the error due to model misspeciﬁcation. The linear dependence on error ε and probability δ is unavoidable in the worst case, speciﬁcally when ε is larger than the suboptimality gap. Nevertheless, some ofﬂine learning methods, such as spectral methods for latent variable models [5], allow ε and δ to be arbitrarily small as the size of the ofﬂine dataset grows. So our bound can be sublinear in n.

For mm TS, we assume that a prior distribution over model parameters is known. The key step in the proof is to introduce µ(a, x, s) = R

θ µ(a, x, s, θ)P1(θ)dθ, the conditional mean reward marginalized over the prior. Using this quantity, we can obtain the following Bayes regret bound. Corollary 2. Let θ P1 and P ( a A, x X, s S : | µ(a, x, s) µ(a, x, s; θ )| ε) 1 δ for some ε, δ > 0. Then, for S , θ P1, the n-round Bayes regret of mm TS is bounded as

6|S|n log n + 2εn + δn + 3|S| .

The proof of Corollary 2 relies on a variant of mm UCB where bµ(a, x, s) is replaced with µ(a, x, s). Note that unlike in mm UCB, the linear dependence of ε is conservative, as mm TS updates its model posteriors and eventually concentrates. We leave the derivation of a tighter bound for future work.

We can relate the error ε and its probability δ using the tails of the conditional reward distributions. In particular, let µ(a, x, s; θ) µ(a, x, s) be v2-sub-Gaussian in θ P1 for any a, x, and s. Then for any δ [0, 1], ε = O(v p

2 log(K|X||S|/δ)) satisﬁes the conditions on ε and δ in Corollary 2.

5 Experiments

In this section, we evaluate our algorithms on both synthetic and real-world datasets. We compare the following methods: (i) UCB: UCB1/Lin UCB with no ofﬂine model [8, 1]; (ii) TS: TS/Lin TS with

0 200 400 600 800 1000 Round

Mean reward

Model noise ( = 0.05)

UCB TS Exp4 m UCB m TS mm TS Optimal

0 200 400 600 800 1000 Round

Mean reward

Model noise ( = 0.2)

200 400 600 800 1000 Round

Worst-case reward

Model noise ( = 0.2)

Figure 1: Left: Mean reward and standard error for low model noise (σ0 = 0.05). Middle/Right: Mean/worstcase reward and standard error for high model noise (σ0 = 0.2).

no ofﬂine model [3, 4]; (iii) Exp4: Exp4 where models are experts [7] (iv) m UCB, mm UCB: our proposed UCB algorithms m UCB/mm UCB; (v) m TS, mm TS: our proposed TS algorithms m TS/mm TS. In contrast to our methods, the UCB and TS baselines do not use an ofﬂine learned model. Exp4 uses models as experts, where each expert pulls the best arm given context and its latent state. Since we are interested in fast personalization , we experiment with short horizons of up to 1000 rounds.

5.1 Synthetic Experiments

We ﬁrst experiment with synthetic non-contextual bandits where A = [10] and S = [5]. The mean rewards are sampled uniformly at random as µ(a, s) Uniform(0, 1) for each a A, s S. Using rejection sampling, we constrain the suboptimality gap of all actions in each state s to be at least 0.1, to ensure statistically signiﬁcant comparisons at short horizons. The rewards are drawn i.i.d. from P( | a, s) = N(µ(a, s), σ2) with σ = 0.5. We evaluate each algorithm on 500 independent runs, with a uniformly sampled latent state in each run, and report the average reward over time. We analyze the effect of model misspeciﬁcation by perturbing the mean rewards with various degrees of noise: for noise σ0 > 0, the estimated means are sampled as bµ(a, s) N(µ(a, s), σ2 0) for each action a and latent state s.

The left plot in Figure 1 shows the average reward over time for low model noise, σ0 = 0.05. In this setting, our algorithms m UCB and m TS perform better than the baselines UCB1 and TS. In the middle plot, we increase the noise to σ0 = 0.2. Neither m UCB nor m TS accounts for modeling errors, and thus their performance degrades. On the other hand, the uncertainty-aware mm TS outperforms m TS. However, mm UCB (not reported to reduce clutter) performs the same as m UCB. This is likely because of the conservative nature of UCBs. Despite having similar worst-case regret guarantees, Exp4 is tailored to adversarial bandits instead of the stochastic ones. Therefore, it is outperformed by m UCB and m TS with the same model.

The right plot in Figure 1 shows 10% of the worst runs from the middle plot, as measured by the average reward in the last round. This is a measure of a worst-case performance. Both UCB1 and TS are unaffected by model misspeciﬁcation, and outperform m UCB and m TS. However, mm TS still performs best due to adapting the misspeciﬁed model to online data. This shows that employing uncertainty-awareness makes model-based algorithms much more robust to model misspeciﬁcation and learning error.

5.2 Movie Lens Experiments

We also assess the performance of our algorithms on the Movie Lens 1M dataset [17], a large-scale collaborative ﬁltering dataset where 6040 users rate 3883 movies. Each movie has a set of genres. We ﬁlter the dataset to include only users who rated at least 200 movies, and movies rated by at least 200 users; resulting in 1353 users and 1124 movies.

We randomly select 50% of all ratings as our training set and use the remaining 50% as the test set; resulting in sparse rating matrices Mtrain and Mtest. We complete each matrix using least-squares matrix completion [29] with rank 20. This rank is high enough to yield a low prediction error, and yet small enough to avoid overﬁtting. The learned latent factors are Mtrain = b U b V T and Mtest = UV T . User i and movie j correspond to rows Ui and Vj, respectively, in the matrix factors.

We deﬁne a latent contextual bandit instance with A = [20] and S = [5] as follows. Using k-means clustering on the rows of U, we cluster users into 5 clusters, where 5 is the largest value that does not yield empty clusters. First, a user i is sampled uniformly at random. In each round, 20 genres

0 50 100 150 200 Round

Mean rating

Lin UCB Lin TS Exp4 m UCB m TS mm TS

50 100 150 200 Round

Worst-case rating

Figure 2: Mean/worst-case rating and standard error on the Movie Lens 1M dataset.

and then a movie for each genre are uniformly sampled, creating a set of diverse movies. Context Xt R20 20 is a matrix where each row is a training latent factor of one sampled movie, that is b Vj for movie j. The agent selects one movies from Xt and observes its reward. The reward distribution for user i and movie j is N(U T i Vj, 0.5), and its mean is the product of the corresponding user and item factors on the test set. We evaluate on 500 users.

Our ofﬂine model is a Gaussian mixture that is learned in the same way as the true model, except on the training set. It has 5 clusters of users derived from k-means clustering on the rows of b U. For each latent state, the prior is a Gaussian with the corresponding cluster mean and covariance. The context in Lin UCB and Lin TS is Xt, the same as in our algorithms, and they only learn the user latent factor. This is more information than in low-rank bandit algorithms [24], which learn both the user and movie representations, and thus perform poorly in short horizons that we consider.

The left plot in Figure 2 shows mean ratings and standard errors of 6 algorithms (as earlier, mm UCB performs similarly to m UCB and is not shown). m UCB and m TS clearly adapt and personalize to users faster than Lin UCB and Lin TS, and converge to better policies than Exp4. Both m UCB and m TS are affected by model misspeciﬁcation. In comparison, mm TS handles model uncertainty and converges to the best policies. The right plot in Figure 2 shows the average results for the worst 10% of users. Again, mm TS dramatically outperforms m TS in the worst case.

6 Related Work

Latent bandits. Latent contextual bandits can personalize faster than standard contextual bandit policies, such as Lin UCB [1] or linear Thompson sampling [4, 2]. The closest to our work is that of Maillard and Mannor [25], who proposed and analyzed non-contextual UCB algorithms, with side information that uniquely identiﬁes users, under the assumption that the mean rewards for each latent state are known. Then they relaxed the assumption on known means, but assumed the other extreme case where the means are learned completely online. Zhou and Brunskill [35] extended this formulation to contextual bandits. However, they used ofﬂine-learned policies that were deployed online as a mixture, using Exp4. Bayesian policy reuse (BPR) [27] selects ofﬂine-learned policies by maintaining a belief over the optimality of each policy through posterior inference, but no analysis exists. Krause and Guestrin [21] used inference in latent graphical models to gather information, but only to identify the state. This is akin to best-arm identiﬁcation [6], which is a different objective from cumulative regret minimization. We subsume prior work by providing contextual uncertainty-aware UCB and TS algorithms, and their uniﬁed analyses.

Low-rank bandits. Low-rank bandits can be viewed as a generalization of latent bandits, where low-rank matrices parameterize the reward and are learned online. Kawale et al. [20] proposed a TS algorithm for low-rank matrix factorization; however, their algorithm is inefﬁcient and is analyzed only for rank-1 matrices. Sen et al. [31] analyzed an ε-greedy algorithm, but relied on the properties that rarely hold in practice. Another body of work studied online clustering of bandits, which is based on a speciﬁc low-rank structure [24, 13, 14, 26]. Yet another studied low-rank matrices where both rows and columns are arms [19, 18]. None of these prior works used ofﬂine-learned models, an important practical consideration given the general availability of ofﬂine data, and learned at most linear models. In Section 5, we compare to idealized versions of these methods where low-rank features are provided.

Structured bandits. In structured bandits, the arms are related through a common latent parameter. Lattimore and Munos [22] proposed a UCB algorithm for a multi-armed bandit variant of this problem. Recently, Gupta et al. [16] proposed a uniﬁed framework that adapts classic bandit algorithms, such as UCB1 and TS, to structured multi-armed bandits. Though similar to our work, the algorithms differ in key aspects: we put conﬁdence intervals on latent states instead of arms, and develop contextual algorithms with a prior model. Yu et al. [34] proposed variational Thompson sampling for learning to act in graphical models with latent variables. Since this framework is very general, both model parameters and latent variables are learned, the regret guarantees are weak and Thompson sampling needs to be approximated. This is in a stark contrast to our work, where we obtain strong regret guarantees and can implement algorithms as analyzed.

7 Conclusions

We study latent bandits, a class of bandit problems where the reward model is parameterized by a latent state and at least partially known. We propose UCB and Thompson sampling algorithms for solving this problem, which identify the latent state conditioned on ofﬂine-learned reward models. The algorithms are contextual and robust to misspeciﬁcation. We bound their regret using a uniﬁed analysis, and validate them empirically on both a synthetic problem and the Movie Lens 1M dataset.

Because of its generality and practicality, our work can be naturally extended to more complicated graphical models of the environment. For example, in this work, we assume that the latent state is ﬁxed over time. However, a transition model could be incorporated into our setting to model an evolving latent state. This would be useful when user preferences or intents change over time. In addition, we could consider a more expressive latent state, such as a mixture of topics, which would be useful if user types could not be classiﬁed into a ﬁnite discrete set. We leave the detailed study of these extensions to future work.

Broader Impact

Our work develops improved algorithms for bandit-style exploration in a very general and abstract sense. We have demonstrated its ability to increase the rate at which interactive systems identify user latent state to improve long-term impact on user reward (e.g., engagement in a recommender system). Our work is agnostic to the form of the reward. We are strongly motivated by improving user positive engagement with interactive systems (e.g., by identifying user interests or preferences in a recommender system). However, other forms of reward that are unaligned with a user s best interests could be used our methods do not propose speciﬁc reward models. That said, our work has no social implications (welfare, fairness, privacy, etc.) beyond those already at play in the interactive system to which our methods might be applied.

Acknowledgments and Disclosure of Funding

There are no additional sources of funding related to this work.

[1] Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Neur IPS, 2011.

[2] Marc Abeille and Alessandro Lazaric. Linear Thompson sampling revisited. Electronic Journal of Statistics, 2016.

[3] Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. Co RR, abs/1111.1797, 2011.

[4] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. ICML, 2013.

[5] Anima Anandkumar, Rong Ge, Daniel J. Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. JMLR, 2014.

[6] Jean-Yves Audibert, Sebastien Bubeck, and Remi Munos. Best arm identiﬁcation in multi-armed bandits. In COLT, 2010.

[7] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 2002.

[8] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2002.

[9] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. The Journal of Machine Learning Research, 2013.

[10] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Neural Information Processing Systems, pages 2249 2257, 2012.

[11] Merlise Clyde and Edward I. George. Model uncertainty. Statistical Science, 2004.

[12] Arnaud Doucet, Neil Gordon, and Nando de Freitas. Sequential Monte Carlo Methods in Practice. Springer New York, 2013.

[13] Claudio Gentile, Shuai Li, and Giovanni Zapella. Online clustering of bandits. ICML, 2014.

[14] Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni Zappella, and Evans Etrue. On context-dependent clustering of bandits. ICML, 2017.

[15] J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society, 41:148 177, 1979.

[16] Samarth Gupta, Shreyas Chaudhari, Subhojyoti Mukherjee, Gauri Joshi, and Osman Ya gan. A uniﬁed approach to translate classical bandit algorithms to the structured bandit setting. Co RR, abs/1810.08164, 2018.

[17] F. Maxwell Harper and Joseph A. Konstan. The Movie Lens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (Tii S), 2015.

[18] Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Claire Vernade, and Zheng Wen. Bernoulli rank-1 bandits for click feedback. IJCAI, 2017.

[19] Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, and Zheng Wen. Stochastic rank-1 bandits. AISTATS, 2017.

[20] Jaya Kawale, Hung H Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay Chawla. Efﬁcient Thompson sampling for online matrix-factorization recommendation. Neur IPS, 2015.

[21] Andreas Krause and Carlos Guestrin. Optimal value of information in graphical models. In JAIR, 2009.

[22] Tor Lattimore and Remi Munos. Bounded regret for ﬁnite-armed structured bandits. Neur IPS, 2014.

[23] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. WWW, 2010.

[24] Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. Collaborative ﬁltering bandits. SIGIR, 2016.

[25] Odalric-Ambrym Maillard and Shie Mannor. Latent bandits. ICML, 2014.

[26] Trong T. Nguyen and Hady W. Lauw. Dynamic clustering of contextual multi-armed bandits. CIKM, 2014.

[27] Benjamin Rosman, Majd Hawasly, and Subramanian Ramamoorthy. Bayesian policy reuse. Machine Learning, 2016.

[28] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Co RR, abs/1301.2609, 2013.

[29] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. Neur IPS, 2008.

[30] Simo Särkkä. Bayesian Filtering and Smoothing. Cambridge University Press, 2013.

[31] Rajat Sen, Karthikeyan Shanmugam, Murat Kocaoglu, Alex Dimakis, and Sanjay Shakkottai. Contextual bandits with latent confounders: An nmf approach. AISTATS, 2017.

[32] William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

[33] Chao-Yuan Wu, Amr Ahmed, Alex Beutel, Alexander J. Smola, and How Jing. Recurrent recommender networks. WSDM, 2017.

[34] Tong Yu, Branislav Kveton, Zheng Wen, Ruiyi Zhang, and Ole J. Mengshoel. Graphical models meet bandits: A variational Thompson sampling approach. In ICML, 2020.

[35] Li Zhou and Emma Brunskill. Latent contextual bandits and their application to personalized recommendations for new users. IJCAI, 2016.