# multitask_learning_for_contextual_bandits__129a6a93.pdf

Multi-Task Learning for Contextual Bandits

Aniket Anand Deshmukh Department of EECS University of Michigan Ann Arbor Ann Arbor, MI 48105 aniketde@umich.edu

Urun Dogan Microsoft Research Cambridge CB1 2FB, UK urun.dogan@skype.net

Clayton Scott Department of EECS University of Michigan Ann Arbor Ann Arbor, MI 48105 clayscot@umich.edu

Contextual bandits are a form of multi-armed bandit in which the agent has access to predictive side information (known as the context) for each arm at each time step, and have been used to model personalized news recommendation, ad placement, and other applications. In this work, we propose a multi-task learning framework for contextual bandit problems. Like multi-task learning in the batch setting, the goal is to leverage similarities in contexts for different arms so as to improve the agent s ability to predict rewards from contexts. We propose an upper conﬁdence bound-based multi-task learning algorithm for contextual bandits, establish a corresponding regret bound, and interpret this bound to quantify the advantages of learning in the presence of high task (arm) similarity. We also describe an effective scheme for estimating task similarity from data, and demonstrate our algorithm s performance on several data sets.

1 Introduction

A multi-armed bandit (MAB) problem is a sequential decision making problem where, at each time step, an agent chooses one of several arms," and observes some reward for the choice it made. The reward for each arm is random according to a ﬁxed distribution, and the agent s goal is to maximize its cumulative reward [4] through a combination of exploring different arms and exploiting those arms that have yielded high rewards in the past [15, 11].

The contextual bandit problem is an extension of the MAB problem where there is some side information, called the context, associated to each arm [12]. Each context determines the distribution of rewards for the associated arm. The goal in contextual bandits is still to maximize the cumulative reward, but now leveraging the contexts to predict the expected reward of each arm. Contextual bandits have been employed to model various applications like news article recommendation [7], computational advertisement [9], website optimization [20] and clinical trials [19]. For example, in the case of news article recommendation, the agent must select a news article to recommend to a particular user. The arms are articles and contextual features are features derived from the article and the user. The reward is based on whether a user reads the recommended article.

One common approach to contextual bandits is to ﬁx the class of policy functions (i.e., functions from contexts to arms) and try to learn the best function with time [13, 18, 16]. Most algorithms estimate rewards either separately for each arm, or have one single estimator that is applied to all arms. In

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

contrast, our approach is to adopt the perspective of multi-task learning (MTL). The intuition is that some arms may be similar to each other, in which case it should be possible to pool the historical data for these arms to estimate the mapping from context to rewards more rapidly. For example, in the case of news article recommendation, there may be thousands of articles, and some of those are bound to be similar to each other.

Problem 1 Contextual Bandits

for t = 1, ..., T do Observe context xa,t Rd for all arms a [N], where [N] = {1, ...N} Choose an arm at [N] Receive a reward rat,t R Improve arm selection strategy based on new observation (xat,t, at, rat,t) end for

The contextual bandit problem is formally stated in Problem 1. The total T trial reward is deﬁned as PT t=1 rat,t and the optimal T trial reward as PT t=1 ra t ,t, where rat,t is reward of the selected arm at at time t and a t is the arm with maximum reward at trial t. The goal is to ﬁnd an algorithm that minimizes the T trial regret

t=1 ra t ,t

We focus on upper conﬁdence bound (UCB) type algorithms for the remainder of the paper. A UCB strategy is a simple way to represent the exploration and exploitation tradeoff. For each arm, there is an upper bound on reward, comprised of two terms. The ﬁrst term is a point estimate of the reward, and the second term reﬂects the conﬁdence in the reward estimate. The strategy is to select the arm with maximum UCB. The second term dominates when the agent is not conﬁdent about its reward estimates, which promotes exploration. On the other hand, when all the conﬁdence terms are small, the algorithm exploits the best arm(s) [2].

In the popular UCB type contextual bandits algorithm called Lin-UCB, the expected reward of an arm is modeled as a linear function of the context, E[ra,t|xa,t] = x T a,tθ a, where ra,t is the reward of arm a at time t and xa,t is the context of arm a at time t. To select the best arm, one estimates θa for each arm independently using the data for that particular arm [13]. In the language of multi-task learning, each arm is a task, and Lin-UCB learns each task independently.

In the theoretical analysis of the Lin-UCB [7] and its kernelized version Kernel-UCB [18] θa is replaced by θ, and the goal is to learn one single estimator using data from all the arms. In other words, the data from the different arms are pooled together and viewed as coming from a single task. These two approaches, independent and pooled learning, are two extremes, and reality often lies somewhere in between. In the MTL approach, we seek to pool some tasks together, while learning others independently.

We present an algorithm motivated by this idea and call it kernelized multi-task learning UCB (KMTL-UCB). Our main contributions are proposing a UCB type multi-task learning algorithm for contextual bandits, established a regret bound and interpreting the bound to reveal the impact of increased task similarity, introducing a technique for estimating task similarities on the ﬂy, and demonstrating the effectiveness of our algorithm on several datasets.

This paper is organized as follows. Section 2 describes related work and in Section 3 we propose a UCB algorithm using multi-task learning. Regret analysis is presented in Section 4, and our experimental ﬁndings are reported in Section 5. We conclude in Section 6.

2 Related Work

A UCB strategy is a common approach to quantify the exploration/exploitation tradeoff. At each time step t, and for each arm a, a UCB strategy estimates a reward ˆra,t and a one-sided conﬁdence interval above ˆra,t with width ˆwa,t. The term ucba,t = ˆra,t + ˆwa,t is called the UCB index or just UCB. Then at each time step t, the algorithm chooses the arm a with the highest UCB.

In contextual bandits, the idea is to view learning the mapping x 7 r as a regression problem. Lin-UCB uses a linear regression model while Kernel-UCB uses a nonlinear regression model drawn from the reproducing kernel Hilbert space (RKHS) of a symmetric and positive deﬁnite (SPD) kernel. Either of these two regression models could be applied in either the independent setting or the pooled setting. In the independent setting, the regression function for each arm is estimated separately. This was the approach adopted by Li et al. [13] with a linear model. Regret analysis for both Lin-UCB and Kernel-UCB adopted the pooled setting [7, 18]. Kernel-UCB in the independent setting has not previously been considered to our knowledge, although the algorithm would just be a kernelized version of Li et al. [13]. We will propose a methodology that extends the above four combinations of setting (independent and pooled) and regression model (linear and nonlinear). Gaussian Process UCB (GP-UCB) uses a Gaussian prior on the regression function and is a Bayesian equivalent of Kernel-UCB [16].

There are some contextual bandit setups that incorporate multi-task learning. In Lin-UCB with Hybrid Linear Models the estimated reward consists of two linear terms, one that is arm-speciﬁc and another that is common to all arms [13]. Gang of bandits [5] uses a graph structure (e.g., a social network) to transfer the learning from one user to other for personalized recommendation. Collaborative ﬁltering bandits [14] is a similar technique which clusters the users based on context. Contextual Gaussian Process UCB (CGP-UCB) builds on GP-UCB and has many elements in common with our framework [10]. We defer a more detailed comparison to CGP-UCB until later.

We propose an alternate regression model that includes the independent and pooled settings as special cases. Our approach is inspired by work on transfer and multi-task learning in the batch setting [3, 8]. Intuitively, if two arms (tasks) are similar, we can pool the data for those arms to train better predictors for both.

Formally, we consider regression functions of the form

where X = Z X, and Z is what we call the task similarity space, X is the context space and Y R is the reward space. Every context xa X is associated with an arm descriptor za Z, and we deﬁne xa = (za, xa) to be the augmented context. Intuitively, za is a variable that can be used to determine the similarity between different arms. Examples of Z and za will be given below.

Let k be a SPD kernel on X. In this work we focus on kernels of the form

k (z, x), (z , x ) = k Z(z, z )k X (x, x ), (1)

where k X is a SPD kernel on X, such as linear or Gaussian kernel if X = Rd, and k Z is a kernel on Z (examples given below). Let H k be the RKHS of functions f : X 7 R associated to k. Note that a product kernel is just one option for k, and other forms may be worth exploring.

3.1 Upper Conﬁdence Bound

Instead of learning regression estimates for each arm separately, we effectively learn regression estimates for all arms at once by using all the available training data. Let N be the total number of distinct arms that algorithm has to choose from. Deﬁne [N] = {1, ..., N} and let the observed contexts at time t be xa,t, a [N]. Let na,t be the number of times the algorithm has selected arm a up to and including time t so that PN a=1 na,t = t. Deﬁne sets ta = {τ < t : aτ = a}, where aτ is the arm selected at time τ. Notice that |ta| = na,t 1 for all a. We solve the following problem at time t:

ˆft = arg min f H k

τ ta (f( xa,τ) ra,τ)2 + λ f 2 H k, (2)

where xa,τ is the augmented context of arm a at time τ, and ra,τ is the reward of an arm a selected at time τ. This problem (2) is a variant of kernel ridge regression. Applying the representer theorem [17]

the optimal f can be expressed as f = PN a =1 P

τ ta αa ,τ k( , xa ,τ ), which yields the solution (detailed derivation is in the supplementary material)

ˆft( x) = kt 1( x)T (ηt 1 Kt 1 + λI) 1ηt 1yt 1, (3)

where Kt 1 is the (t 1) (t 1) kernel matrix on the augmented data [ xaτ ,τ]t 1 τ=1, kt 1( x) = [ k( x, xaτ ,τ)]t 1 τ=1 is a vector of kernel evaluations between x and the past data, yt 1 = [raτ ,τ]t 1 τ=1 are all observed rewards, and ηt 1 is the (t 1) (t 1) diagonal matrix ηt 1 = diag[ 1 naτ,t 1 ]t 1 τ=1.

When x = xa,t, we write ka,t = kt 1( xa,t). With only minor modiﬁcations to the argument in Valko et al [18], we have the following: Lemma 1. Suppose the rewards [raτ ,τ]T τ=1 are independent random variables with means

E[raτ ,τ| xaτ ,τ] = f ( xaτ ,τ), where f H k and f H k c. Let α = q

log(2T N/δ)

2 and δ > 0.

With probability at least 1 δ

T , we have that a [N]

| ˆft( xa,t) f ( xa,t)| wa,t := (α + c

where sa,t = λ 1/2 q

k( xa,t, xa,t) k T a,t(ηt 1 Kt 1 + λI) 1ηt 1 ka,t.

The result in Lemma 1 motivates the UCB

ucba,t = ˆft( xa,t) + wa,t

and inspires Algorithm 1.

Algorithm 1 KMTL-UCB

Input: β R+, for t = 1, ..., T do Update the (product) kernel matrix Kt 1 and ηt 1 Observe context features at time t: xa,t for each a [N]. Determine arm descriptor za for each a [N] to get augmented context xa,t. for all a at time t do pa,t ˆft( xa,t) + βsa,t end for Choose arm at = arg max pa,t, observe a real valued payoff rat,t and update yt . Output: at end for

Before an arm has been selected at least once, ˆft( xa,t) and the second term in sa,t, i.e., k T a,t(ηt 1 Kt 1 + λI) 1ηt 1 ka,t, are taken to be 0. In that case, the algorithm only uses the ﬁrst

term of sa,t, i.e., q

k( xa,t, xa,t), to form the UCB.

3.2 Choice of Task Similarity Space and Kernel

To illustrate the ﬂexibility of our framework, we present the following three options for Z and k Z:

1. Independent: Z = {1, ..., N}, k Z(a, a ) = 1a=a . The augmented context for a context xa from arm a is just (a, xa). 2. Pooled: Z = {1}, k Z 1. The augmented context for a context xa for arm a is just (1, xa). 3. Multi-Task: Z = {1, ..., N} and k Z is a PSD matrix reﬂecting arm/task similarities. If this matrix is unknown, it can be estimated as discussed below.

Algorithm 1 with the ﬁrst two choices specializes to the independent and pooled settings mentioned previously. In either setting, choosing a linear kernel for k X leads to Lin-UCB, while a more general kernel essentially gives rise to Kernel-UCB. We will argue that the multi-task setting facilitates learning when there is high task similarity.

We also introduce a fourth option for Z and k Z that allows task similarity to be estimated when it is unknown. In particular, we are inspired by the kernel transfer learning framework of Blanchard et al. [3]. Thus, we deﬁne the arm similarity space to be Z = PX , the set of all probability distributions on X. We further assume that contexts for arm a are drawn from probability measure Pa. Given a context xa for arm a, we deﬁne its augmented context to be (Pa, xa).

To deﬁne a kernel on Z = PX , we use the same construction described in [3], originally introduced by Steinwart and Christmann [6]. In particular, in our experiments we use a Gaussian-like kernel k Z(Pa, Pa ) = exp( Ψ(Pa) Ψ(Pa ) 2/2σ2 Z), (5) where Ψ(P) = R k X ( , x)d Px is the kernel mean embedding of a distribution P. This embedding is deﬁned by yet another SPD kernel k X on X, which could be different from the k X used to deﬁne k. We may estimate Ψ(Pa) via Ψ( b Pa) = 1 na,t 1 P

τ ta k X ( , xaτ ,τ), which leads to an estimate of k Z.

4 Theoretical Analysis

To simplify the analysis we consider a modiﬁed version of the original problem 2:

ˆft = arg min f H k

τ ta (f( xa,τ) ra,τ)2 + λ f 2 H k. (6)

In particular, this modiﬁed problem omits the terms 1 na,t 1 as they obscure the analysis. In practice, these terms should be incorporated.

In this case sa,t = λ 1/2 q

k( xa,t, xa,t) k T a,t( Kt 1 + λI) 1 ka,t. Under this assumption Kernel UCB is exactly KMTL-UCB with k Z 1. On the other hand, KMTL-UCB can be viewed as a special case of Kernel-UCB on the augmented context space X. Thus, the regret analysis of Kernel-UCB applies to KMTL-UCB, but it does not reveal the potential gains of multi-task learning. We present an interpretable regret bound that reveals the beneﬁts of MTL. We also establish a lower bound on the UCB width that decreases as task similarity increases (presented in the supplementary ﬁle).

4.1 Analysis of Sup KMTL-UCB

It is not trivial to analyze algorithm 1 because the reward at time t is dependent on the past rewards. We follow the same strategy originally proposed in [1] and used in [7, 18] which uses Sup KMTL-UCB as a master algorithm, and Base KMTL-UCB (which is called by Sup KMTL-UCB) to get estimates of reward and width. Sup KMTL-UCB builds mutually exclusive subsets of [T] such that rewards in any subset are independent. This guarantees that the independence assumption of Lemma 1 is satisﬁed. We describe these algorithms in a supplementary section because of space constraints. Theorem 1. Assume that ra,t [0, 1], a [N], T 1, f H k c, k( x, x) c k, x X and the task similarity matrix KZ is known. With probability at least 1 δ, Sup KMTL-UCB satisﬁes

v u u tlog 2TN(log(T) + 1)/δ

2m log g([T]) p

T log(g([T]))

where g([T]) = det( KT +1+λI)

λT +1 and m = max(1, c k

Note that this theorem assumes that task similarity is known. In the experiments for real datasets using the approach discussed in subsection 3.2 we estimate the task similarity from the available data.

4.2 Interpretation of Regret Bound

The following theorems help us interpret the regret bound by looking at

g([T]) = det( KT +1 + λI)

where, λ1 λ2 λT +1 are the eigenvalues of the kernel matrix KT +1.

As mentioned above, the regret bound of Kernel-UCB applies to our method, and we are able to recover this bound as a corollary of Theorem 1. In the case of Kernel-UCB Kt = KXt, t [T] as all arm estimators are assumed to be the same. We deﬁne the effective rank of KT +1 in the same way as [18] deﬁnes the effective dimension of the kernel feature space.

Deﬁnition 1. The effective rank of KT +1 is deﬁned to be r := min{j : jλ log T PT +1 i=j+1 λi}.

In the following result, the notation O hides logarithmic terms.

Corollary 1. log(g([T])) r log 2T 2(T +1)c k+rλ rλ log T

rλ , and therefore R(T) = O(

However, beyond recovering a known bound, Theorem 1 can also be interpreted to reveal the potential gains of multi-task learning. To interpret the regret bound in Theorem 1, we make a further assumption that after time t, na,t = t N for all a [N]. For simplicity deﬁne nt = na,t. Let ( ) denote the Hadamard product, ( ) denote the Kronecker product and 1n Rn be the vector of ones. Let KXt = [k X (xaτ ,τ, xaτ ,τ )]t τ,τ =1 be the t t kernel matrix on contexts, KZt = [k Z(zaτ , zaτ )]t τ,τ =1 be the associated t t kernel matrix based on arm similarity, and KZ = [k Z(za, za)]N a=1 be the N N arm/task similarity matrix between N arms, where xaτ ,τ is the observed context and zaτ is the associated arm descriptor. Using eqn. (1), we can write Kt = KZt KXt. We rearrange the sequence of xaτ ,τ to get [xa,τ]N a=1,τ=(t+1)a such that elements

(a 1)nt to ant belong to arm a. Deﬁne Kr t , Kr Xt and Kr Zt to be the rearranged kernel matrices based on the re-ordered set [xa,τ]N a=1,τ=(t+1)a. Notice that we can write Kr t = (KZ 1nt1T nt) Kr Xt and the eigenvalues λ( Kt) and λ( Kr t ) are equal. To summarize, we have Kt = KZt KXt

λ( Kt) = λ (KZ 1nt1T nt) Kr Xt . (7)

Theorem 2. Let the rank of matrix KXT +1 be rx and the rank of matrix KZ be rz. Then

log(g([T])) rzrx log (T +1)c k+λ

This means that when the rank of the task similarity matrix is low, which reﬂects a high degree of inter-task similarity, the regret bound is tighter. For comparison, note that when all tasks are independent, rz = N and when all tasks are the same (pooled), then rz = 1. In the case of Lin UCB [7] where all arm estimators are assumed to be the same and k X is a linear kernel, the regret bound in Theorem 1 evaluates to O(

d T), where d is the dimension of the context space. In the original Lin-UCB algorithm [13] where all arm estimators are different, the regret bound would be O(

We can further comment on g([T]) when all distinct tasks (arms) are similar to each other with task similarity equal to µ. Thus deﬁne KZ(µ) := (1 µ)IN + µ1N1T N and Kr t (µ) = (KZ(µ) 1nt1T nt) Kr Xt.

Theorem 3. Let gµ([T]) = det( Kr T +1(µ)+λI)

λT +1 . If µ1 µ2 then gµ1([T]) gµ2([T]).

This shows that when there is more task similarity, the regret bound is tighter.

4.3 Comparison with CGP-UCB

CGP-UCB transfers the learning from one task to another by leveraging additional known taskspeciﬁc context variables [10], similar in spirit to KTML-UCB. Indeed, with slight modiﬁcations, KMTL-UCB can be viewed as a frequentist analogue of CGP-UCB, and similarly CGP-UCB could be modiﬁed to address our setting. Furthermore, the term g([T]) appearing in our regret bound is equivalent to an information gain term used to analyze CGP-UCB. In the agnostic case of CGPUCB where there is no assumption of a Gaussian prior on decision functions, their regret bound is O(log(g([T]))

T), while their regret bound matches ours when they adopt a GP prior on f . Thus, our primary contributions with respect to CGP-UCB are to provide a tighter regret bound in agnostic case, and a technique for estimating task similarity which is critical for real-world applications.

5 Experiments

We test our algorithm on synthetic data and some multi-class classiﬁcation datasets. In the case of multi-class datasets, the number of arms N is the number of classes and the reward is 1 if we predict the correct class, otherwise it is 0. We separate the data into two parts - validation set and test set. We use all Gaussian kernels and pre-select the bandwidth of kernels using ﬁve fold cross-validation on a holdout validation set and we use β = 0.1 for all experiments. Then we run the algorithm on the test set 10 times (with different sequences of streaming data) and report the mean regret. For the synthetic data, we compare Kernel-UCB in the independent setting (Kernel-UCB-Ind) and pooled setting (Kernel-UCB-Pool), KMTL-UCB with known task similarity, and KMTL-UCB-Est which estimates task similarity on the ﬂy. For the real datasets in the multi-class classiﬁcation setting, we compare Kernel-UCB-Ind and KMTL-UCB-Est. In this case, the pooled setting is not valid because xa,t is the same for all arms (only za differs) and KMTL-UCB is not valid because the task similarity matrix is unknown. We also report the conﬁdence intervals for these results in the supplementary material.

5.1 Synthetic News Article Data

Suppose an agent has access to a pool of articles and their context features. The agent then sees a user along with his/her features for which it needs to recommend an article. Based on user features and article features the algorithm gets a combined context xa,t. The user context xu,t R2, t is randomly drawn from an ellipse centered at (0, 0) with major axis length 1 and minor axis length 0.5. Let xu,t[:, 1] be the minor axis and xu,t[:, 2] be the major axis. Article context xart,t is any angle θ [0, π

2 ]. To get the overall summary xa,t of user and article the user context xu,t is rotated with xart,t.

Rewards for each article are deﬁned based on the minor axis ra,t = 1.0 (xu,t[:, 1] a

N + 0.5)2 .

Figure 1: Synthetic Data

Figure 1 shows one such example for 4 different arms. The color code describes the reward, the two axes show the information about user context, and theta is the article context. We take N = 5. For KMTL-UCB, we use a Gaussian kernel on xart,t to get the task similarity.

The results of this experiment are shown in Figure 1. As one can see, Kernel-UCB-Pool performs the worst. That means for this setting combining all the data and learning a single estimator is not efﬁcient. KMTL-UCB beats the other methods in all 10 runs, and Kernel-UCB-Ind and KMTL-UCB-Est perform equally well.

5.2 Multi-class Datasets

In the case of multi-class classiﬁcation, each class is an arm and the features of an example for which the algorithm needs to recommend a class are the contexts. We consider the following datasets: Digits (N = 10, d = 64), Letter (N = 26, d = 16), MNIST (N = 10, d = 780 ), Pendigits (N = 10, d = 16), Segment (N = 7, d = 19) and USPS (N = 10, d = 256). Empirical mean regrets are shown in Figure 2. KMTL-UCB-Est performs the best in three of the datasets and performs equally well in the other three datasets. Figure 3 shows the estimated task similarity (re-ordered

to reveal block structure) and one can see the effect of the estimated task similarity matrix on the empirical regret in Figure 2. For the Digits, Segment and MNIST datasets, there is signiﬁcant inter-task similarity. For Digits and Segment datasets, KMTL-UCB-Est is the best in all 10 runs of the experiment while for MNIST, KMTL-UCB-Est is better for all but 1 run.

Figure 2: Results on Multiclass Datasets - Empirical Mean Regret

Figure 3: Estimated Task Similarity for Real Datasets

6 Conclusions and future work

We present a multi-task learning framework in the contextual bandit setting and describe a way to estimate task similarity when it is not given. We give theoretical analysis, interpret the regret bound, and support the theoretical analysis with extensive experiments. In the supplementary material we establish a lower bound on the UCB width, and argue that it decreases as task similarity increases.

Our proposal to estimate the task similarity matrix using the arm similarity space Z = PX can be extended in different ways. For example, we could also incorporate previously observed rewards into Z. This would alleviate a potential problem with our approach, namely, that some contexts may have been selected when they did not yield a high reward. Additionally, by estimating the task similarity matrix, we are estimating arm-speciﬁc information. In the case of multiclass classiﬁcation, k Z reﬂects information that represents the various classes. A natural extension is to incorporate methods for representation learning into the MTL bandit setting.

[1] P. Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002.

[3] G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classiﬁcation tasks to a new unlabeled sample. In Advances in neural information processing systems, pages 2178 2186, 2011.

[4] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Machine Learning, 5(1):1 122, 2012.

[5] N. Cesa-Bianchi, C. Gentile, and G. Zappella. A gang of bandits. In Advances in Neural Information Processing Systems, pages 737 745, 2013.

[6] A. Christmann and I. Steinwart. Universal kernels on non-standard input spaces. In Advances in neural information processing systems, pages 406 414, 2010.

[7] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions.

[8] T. Evgeniou and M. Pontil. Regularized multi task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109 117. ACM, 2004.

[9] S. Kale, L. Reyzin, and R. E. Schapire. Non-stochastic bandit slate problems. In Advances in Neural Information Processing Systems, pages 1054 1062, 2010.

[10] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447 2455, 2011.

[11] V. Kuleshov and D. Precup. Algorithms for multi-armed bandit problems. ar Xiv preprint ar Xiv:1402.6028, 2014.

[12] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pages 817 824, 2008.

[13] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670. ACM, 2010.

[14] S. Li, A. Karatzoglou, and C. Gentile. Collaborative ﬁltering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 539 548. ACM, 2016.

[15] H. Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, pages 169 177. Springer, 1985.

[16] N. Srinivas, A. Krause, M. Seeger, and S. M. Kakade. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1015 1022, 2010.

[17] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.

[18] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised contextual bandits. In Uncertainty in Artiﬁcial Intelligence, page 654. Citeseer, 2013.

[19] S. S. Villar, J. Bowden, and J. Wason. Multi-armed bandit models for the optimal design of clinical trials: beneﬁts and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015.

[20] J. White. Bandit algorithms for website optimization. " O Reilly Media, Inc.", 2012.