# cbrap_contextual_bandits_with_random_projection__e12c9b07.pdf

CBRAP: Contextual Bandits with RAndom Projection

Xiaotian Yu, Michael R. Lyu, Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong, Shatin, N.T., Hong Kong Email: {xtyu,lyu,king}@cse.cuhk.edu.hk

Contextual bandits with linear payoffs, which are also known as linear bandits, provide a powerful alternative for solving practical problems of sequential decisions, e.g., online advertisements. In the era of big data, contextual data usually tend to be high-dimensional, which leads to new challenges for traditional linear bandits mostly designed for the setting of low-dimensional contextual data. Due to the curse of dimensionality, there are two challenges in most of the current bandit algorithms: the ﬁrst is high time-complexity; and the second is extreme large upper regret bounds with highdimensional data. In this paper, in order to attack the above two challenges effectively, we develop an algorithm of Contextual Bandits via RAndom Projection (CBRAP) in the setting of linear payoffs, which works especially for high-dimensional contextual data. The proposed CBRAP algorithm is time-efﬁcient and ﬂexible, because it enables players to choose an arm in a low-dimensional space, and relaxes the sparsity assumption of constant number of non-zero components in previous work. Besides, we prove an upper regret bound for the proposed algorithm, which is associated with reduced dimensions. By comparing with three benchmark algorithms, we demonstrate improved performance on cumulative payoffs of CBRAP during its sequential decisions on both synthetic and real-world datasets, as well as its superior time-efﬁciency.

Introduction The Multi-Armed Bandit (MAB) problem was proposed and investigated by Robbins in 1952, which has attracted great interests from numerous researchers in operation research and computer science (Robbins 1952; Auer, Cesa-Bianchi, and Fischer 2002; Bubeck and Cesa-Bianchi 2012). The fundamental issue in the MAB problem and its variants focuses on the exploration-exploitation trade-off, which refers to an algorithm trying to maximize cumulative rewards in sequential decisions but the algorithm has only limited knowledge about the mechanism of generating the rewards (Auer 2002). As a natural and important variant of the basic MAB problem, contextual bandits with linear payoffs, which are also known as linear bandits, are sequential decision-making

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

problems with side information (Wang, Kulkarni, and Poor 2005; Dani, Hayes, and Kakade 2008; Abbasi-Yadkori, P al, and Szepesv ari 2011; Chu et al. 2011). Speciﬁcally, given feature information of arm space for each of T rounds, a learner is required to choose one of K arms. Linear bandits contain a basic assumption of linearly mapping from the arm space to the reward space (Filippi et al. 2010), which should be the most common case in reality. Recently, contextual bandits with linear payoffs have been successfully applied into many practical applications, such as recommender systems (Tang et al. 2013), social network analysis (Zhao, Mc Auley, and King 2014; Zhao and King 2016b) and information retrieval (Zhao and King 2016a). In (Tang et al. 2013), the demonstration of advertisements was based on users input information on web pages. The authors formulated the problem of automatic layout selection in online advertisements as a contextual bandit problem. These personalized advertisements are expected to improve click-through rates of web links. For these models of sequential decisions, recommendation algorithms always receive additional contextual information from users, which could be greatly useful for the online sequential decisions. In the big data era, it is pretty common to encounter highdimensional and/or sparse contextual information. In this case, traditional bandit algorithms, which are mostly designed in the setting of low-dimensional data, are facing new challenges in applications. Due to the curse of dimensionality, there are two challenges in most the of current bandit algorithms. The ﬁrst is high time-complexity; and the second is extremely large upper regret bounds with highdimensional data. Speciﬁcally, traditional contextual bandits (e.g., Lin UCB in (Chu et al. 2011)) contain inverse operations in the original contextual space, which will be timeconsuming for computations. Besides, the regret bounds of linear bandits in (Chu et al. 2011; Abbasi-Yadkori, P al, and Szepesv ari 2011) are related to the original dimension of contextual data, which can lead to increasing regret bounds with the curse of dimensionality. This will be even worse when the original dimension of contextual data is larger than the the total sequential rounds of playing bandits. There have been some efforts on context bandits with linear payoffs in high-dimensional and/or sparse contextual data (Deshpande and Montanari 2012; Carpentier, Munos, and others 2012). The corresponding bandit algorithms are

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

named Ball Exp in (Deshpande and Montanari 2012), and SLUCB in (Carpentier, Munos, and others 2012). However, in SLUCB, the authors assumed that the contextual data contain S non-zero components, which may not be ﬂexible in applications, especially for cases with highdimensional dense data. In Ball Exp, the upper and lower regret bounds are relatively loose, and are still closely related to the original dimension of contextual data. Besides, both SLUCB and Ball Exp adopted the technique of ball exploration in the high-dimensional space, which will be time-consuming in applications. For rigorous analysis of time complexity for these two algorithms, interested readers can refer to (Dani, Hayes, and Kakade 2008). Random projection is a powerful and popular technique to deal with high-dimensional data (Fern and Brodley 2003; Zhang et al. 2016), which maps high-dimensional data onto a low-dimensional space. Note that random projection does not contain the assumption of sparsity in high-dimensional data. The most common case in random projection is to construct a Gaussian random matrix, where each element is an i.i.d. sample following a standard normal distribution. It has been proved to preserve the Euclidean distance within an error ball (Dasgupta and Gupta 1999). Besides, the error bounds for inner products in random projection have been investigated (Kab an 2015), which will be an effective tool for analysis of upper regret bounds in contextual bandits. In this paper, to tackle the aforementioned two challenges effectively, we propose an algorithm of Contextual Bandits via RAndom Projection (CBRAP) in the setting of linear payoffs, which works especially for high-dimensional data. Note that, for simplicity in the work, we assume that contextual bandits have linear payoff functions. But the framework of our bandit algorithm can be easily generalized to the case of relaxing the linear assumption. Speciﬁcally, our proposed algorithm adopts random projection to map the high-dimensional contextual information onto a lowdimension space, where we should design a random matrix. The proposed CBRAP algorithm is time-efﬁcient and ﬂexible, because it enables players to choose an arm in a lowdimensional space, and relaxes the sparsity assumption of constant number of non-zero components in previous work. Besides, we prove an upper regret bound for the proposed algorithm, and show the bound to be better than the traditional ones with appropriate reduced dimensions. By comparing with three benchmark algorithms (i.e., Lin UCB, Ball Exp and SLUCB), we demonstrate improved performance on cumulative payoffs of the CBRAP algorithm during its sequential decisions on both synthetic and real-world datasets, as well as its superior time-efﬁciency. In summary, we make the following contributions. For contextual bandits in the setting of linear payoffs, we develop an efﬁcient and practical algorithm named CBRAP by taking advantage of random projection. We derive an upper regret bound for the proposed CBRAP algorithm, which guarantee the worst case is associated with the reduced dimensions. Besides, our algorithm is more ﬂexible and time-efﬁcient than Ball Exp and SLUCB in high-dimensional settings. We evaluate the CBRAP algorithm via a series of exper-

iments with synthetic and real-world datasets. Compared with the three benchmarks, we demonstrate the proposed algorithm s improved performance of cumulative payoffs during sequential decisions, as well as its time-efﬁciency.

Preliminary and Related Work In this section, we ﬁrst introduce notions and the deﬁnition of sub-Gaussian of a random variable, which will be used in this paper. Then, we present the process of contextual bandits with linear payoffs, as well as the metric of bandits. Finally, we provide a brief survey on random projection.

Notations and Deﬁnition of Sub-Gaussian The total sequential rounds of playing bandits is T. For each round t [T] with [T] = {1, 2, , T}, a learner receives contextual information from the set of X Rn, where n can be an extremely large integer representing a highdimensional space. In this work, high-dimensional contextual data precisely mean that T n or even T n, which is the case mentioned in (Carpentier, Munos, and others 2012). Let K N+ be the number of arms and πt,y [0, 1] the reward of arm y on round t with y [K] and [K] = {1, 2, , K}. We adopt 2 to denote the ℓ2 norm of a vector x Rn, and Im m to denote the identity matrix with dimensions of m m. For a positive deﬁnite matrix A Rm m, the weighted norm of vector x is deﬁned as x A =

x TAx. The inner product is represented as , , and the weighted inner product is x ATy = x, y A. Mathematically, we given the following deﬁnition on the sub-Gaussian of a random variable. Deﬁnition 1 ((Buldygin and Kozachenko 1980)). A random variable ξ is sub-Gaussian if there exists an R 0 such that

E[exp(λξ)] exp(λ2R2

where λ R, E[ ] is the expectation of a random variable and exp( ) denotes the exponential operation. Given a set F, ξ is conditionally R-sub-Gaussian if λ R and a ﬁxed R 0, we have E[exp(λξ)|F] exp( λ2R2

Contextual Bandits with Linear Payoffs As shown in (Auer 2002; Chu et al. 2011), an algorithm (denoted by A) for contextual bandits with linear payoffs usually contains the following three steps at round t: 1) contextual information xt,y X for all y [K] is revealed to the bandit algorithm A; 2) the bandit algorithm A chooses an arm at [K], which follows an underlying distribution Π(xt,at, θ ) with θ Rn being the unknown true parameter vector; and 3) a stochastic payoff πt,at [0, 1] is revealed to the bandit algorithm A. In the above stochastic setting of step 3, for the chosen arm at at round t, we usually assume that there is an underlying distribution Π(xt,at, θ ) with the ﬁrst moment information being xt,at, θ , so that πt,at is a sample from Π(xt,at, θ ). Thus, we have

πt,at = x T t,atθ + ηt, (2)

where ηt is a random noise satisfying the assumption of conditionally R-sub-Gaussian. That is, λ R, we have

E[exp(ληt)|Ft] exp(λ2R2

where Ft is the σ-algebra of σ({xi,ai}i [t], {ηi}i [t 1]) and R 0. Eq. (3) implies that E[ηt|Ft] = 0. A popular measure in demonstrating the performance of an algorithm for solving MAB problems is regret, which is deﬁned as the difference between the expected payoff of the optimal decision in hindsight and that of the algorithm. Mathematically, the regret of the algorithm A is deﬁned as

Regret(T) E[

t=1 max y [K] x T t,yθ

t=1 πt,at]. (4)

Random Projection One common technique for dimensionality reduction is to perform linear random projection (Baraniuk, Cevher, and Wakin 2010; Fodor 2002). In this paper, we consider projecting the contextual data of X Rn onto a low-dimensional space of Z Rm. Without loss of generality, we denote the random projection matrix by M Rm n. Then, we have

z = Mx, (5)

where z Z and x X. In (Blum 2006), M is constructed as a random matrix where each element follows a normal distribution of N (0, ˆσ2). By setting ˆσ2 = 1/m in the next section, we name our algorithm of CBRAP with Standard Gaussian (SG) matrix (abbreviated as CBRAP.SG). In (Achlioptas 2003), the authors proposed new methods for constructing sparse random sign matrix for dimensionality reduction. In the ensuing section, we name our algorithm of CBRAP with Random Sign (RS) matrix (abbreviated as CBRAP.RS). In addition to the above work, there have been other ways of constructing a matrix for random projection (Li, Hastie, and Church 2006; Ailon and Chazelle 2006; Clarkson and Woodruff 2013; Lu et al. 2013). These investigations consider how to speed up the dimensionality reduction, or how to conduct the random projection with the assumption of low-rank matrix. In this paper, since our focus is to conduct the dimensionality reduction of contextual bandits in a general way, we only consider the construction of random matrix in (Blum 2006; Achlioptas 2003).

Related Work Contextual bandits are important variants of traditional MAB problems and match many real applications (Langford and Zhang 2008; Li et al. 2010; Tang et al. 2013; Wang, Kulkarni, and Poor 2005). Contextual bandits with linear payoffs have been intensively investigated in previous work (Abbasi-Yadkori, Pal, and Szepesvari 2012; Abe, Biermann, and Long 2003; Abe and Long 1999; Chu et al. 2011; Kaelbling 1994). As shown in (Chu et al. 2011), the traditional upper regret bound for the Lin UCB algorithm is

Regret(T) O(

Tn ln3(KT ln(T)/δ)), (6)

where δ (0, 1) is a conﬁdence parameter. We know that the dimension of contextual information of n in Eq. (6) will increase when the dimension of context space increases. Roughly, we have the upper regret bound as R(T) O(T) when n = T. This will be even worse when the dimension of context data becomes larger, especially for n T. In (Abbasi-Yadkori, Pal, and Szepesvari 2012), Abbasi Yadkori et al. studied a sparse variant of stochastic linear bandits. For high-dimensional bandits, Carpentier and Munos (Carpentier, Munos, and others 2012) attacked highdimensional stochastic linear bandits with the sparsity assumption of S non-zero component, where the algorithm is named SLUCB. The upper regret bound in (Carpentier, Munos, and others 2012) is O(S

T). In real applications, the sparsity assumption may be unreasonable, especially for high-dimensional dense data. In (Deshpande and Montanari 2012), the authors proposed an algorithm named Ball Exp for high-dimensional linear bandits, where the regret bound is relatively loose, and is directly related to the dimension of data. Recently, by adopting additional assumptions of margin and compatibility conditions in (Bastani and Bayati 2015), the authors investigated high-dimensional covariates in online decision-marking. From prior work, it is urgent and important to develop a ﬂexible and practical algorithm for contextual bandits with high-dimensional data, where we do not have additional assumptions (e.g., sparsity or the margin condition). This motivates our proposed CBRAP algorithm in the next section.

The CBRAP Algorithm

In this section, we ﬁrstly present the overview of CBRAP, and then provide theoretical analyses of a practical upper regret bound and time complexity for the algorithm.

Overview of CBRAP

Our proposed bandit algorithm is shown in Algorithm 1, which is named CBRAP. As depicted in Algorithm 1, the basic idea of CBRAP algorithm is to project the highdimensional data onto a low-dimensional space, and maintains a conﬁdence set of the unknown optimal parameter θ z Rm in a low-dimensional space. θ z is corresponding to the original true parameter θ in n-dimensional space. Our main contribution in the CBRAP algorithm is twofold. First, we construct a random matrix for contextual bandits from Step 2 to Step 7 in Algorithm 1. Note that the designed random matrix is ﬂexible, and can be revised based on users needs. Here we just consider the random matrix in (Blum 2006; Achlioptas 2003). Second, via the designed random matrix M, we conduct dimensionality reduction for the contextual information in Algorithm 1.

Theoretical Analyses of Regret Bound

In this section, we provide theoretical results on upper regret bound of the proposed CBRAP algorithm. First of all, we show that errors resulting from projecting highdimensional data onto low-dimensional data are condition-

Algorithm 1 CBRAP

1: input: m, T, β R+ and α R+ 2: for p = 1, 2, , m do 3: for q = 1, 2, , n do 4: generate a random value bpq based on SG or RS 5: M(p, q) bpq 6: end for 7: end for 8: A0 Im m 9: b0 0m 10: for t = 1, 2, , T do 11: observe context xt,y Rn for all y [K] 12: zt,y Mxt,y for all y [K] 13: if t == 1 then 14: At At 1 15: bt bt 1 16: else 17: At At 1 + zt 1,at 1z T t 1,at 1 18: bt bt 1 + πt 1,at 1zt 1,at 1 19: end if 20: θt z A 1 t bt 21: for y [K] do 22: vt,y β zt,y A 1 t 23: ˆrt,y θt z, zt,y 24: ucbt,y ˆrt,y + vt,y 25: end for 26: choose the arm at arg maxy [K] ucbt,y (break ties arbitrarily) 27: observe the reward πt,at 28: end for

ally sub-Gaussian associated with the reduced dimension m. Then, we derive a practical upper regret bound for CBRAP.

Incurred sub-Gaussian errors in CBRAP We focus on the errors due to dimensionality reduction in CBRAP. The following theorem shows that the incurred errors are sub-Gaussian, which can be combined with the original independent noise of ηt to be a new sub-Gaussian random noise. Theorem 1. Considering a linear stochastic payoff model as πt = xt, θ + ηt, where xt Rn, θ Rn and ηt is conditionally R-sub-Gaussian with the σ-algebra as Ft = σ({xi}i [t], {ηi}i [t 1]), we can design a random matrix M Rm n such that πt = zt, θ z + ηt + ρt and λ R

E[exp(λ(ηt + ρt))|F t] exp(λ2(

2 ), (7) where F t = σ({xi}i [t], {ηi}i [t 1], {ρi}i [t 1]) is σalgebra, zt = Mxt and θ z Rm is the unknown parameter in m-dimensional space. In other words, with the designed random matrix M, ηt + ρt is conditionally (

4/m + R)- sub-Gaussian.

Proof. Without loss of generality, we assume that x 2 1 and θ 2 1. Otherwise, we can conduct normalization.

Given a random matrix M Rm n with normal distribution as N (0, ˆσ2), the mapping from n-dimension to mdimension for θ Rn is denoted by

θz = Mθ, (8)

where θz Rm. Since the random noise ηt is conditionally R-sub Gaussian with Ft = σ({xi}i [t], {ηi}i [t 1]), we are ready to have E[ηt|Ft] = 0, and E[ηt|F t] = 0. Besides, we have

E[ xt, θ + ηt|Ft] = E[ zt, θ z + ηt + ρt|F t]. (9)

Due to the mean preservation of inner product for random projection, which has been shown in Lemma 4 of (Shi et al. 2012), we have E[ xt, θ |Ft] = E[ zt, θ z |F t]. Thus,

E[ρt|F t] = 0. (10)

Based on (Kab an 2015), given ϵ (0, 1), we have

Pr{z Tθz < x Tθmˆσ2 ϵmˆσ2 x 2 θ 2} < exp( mϵ2

Pr{z Tθz > x Tθmˆσ2 + ϵmˆσ2 x 2 θ 2} < exp( mϵ2

We focus on the error bounds of z Tθz xθ. In practice, we can set mˆσ2 = 1. Note that x 2 1 and θ 2 1. Thus, we have

Pr{z Tθz < x Tθ ϵ} < exp( mϵ2

Pr{z Tθz > x Tθ + ϵ} < exp( mϵ2

Note that ρt = xt, θ zt, θ z = x Tθ z TMθ . Thus, with high probability of 1 2 exp( mϵ2

8 ), we have |ρt| ϵ. Since we can adopt the σ-algebra F t in Eqs. (11) and (12),

Pr{|ρt| > ϵ|F t} 2 exp( mϵ2

Based on Eqs. (10) and (13), and Theorem 3.1 in (Rivasplata 2012), we have E[exp(λρt)|F t] exp(2λ2/m) for all λ R. In light of the deﬁnition of sub-Gaussian, we have that ρt is conditionally

4/m-sub-Gaussian. Now we know ηt and ρt are both conditionally sub Gaussian. Thus, in light of Theorem 2.7 in (Rivasplata 2012), we have E[exp(λ(ηt + ρt))|F t] exp(λ2(

4/m + R)2/2) for all λ R, which means that ηt +ρt is conditionally (

4/m + R)-sub-Gaussian.

Upper regret bound for CBRAP In this subsection, we derive an upper regret bound for the proposed CBRAP algorithm. Theorem 2. If CBRAP is run, then with probability at least (1 δ)(1 2 exp( mϵ2

8 )), the upper bound of regret of the algorithm is

Regret(T) 2

m T log(1 + TL2 2 γm )

β2 T (δ) + βT (δ)ϵ,

where βT (δ) = (

m log( 1+T L2 2/γ δ )+γ1/2L1, with L1 and L2 being parameters associated with the reduced dimension.

Table 1: Statistics of used datasets. dataset name #items #users #dim. {(#feedback; sparsity)} rewards synthetic 1 s1 100 10 1,000 {(200;0.10),(201;0.30),(194;0.50),(333;0.70),(238;0.90)} {0,1} synthetic 2 s2 200 20 2,000 {(1,098;0.50)} {0,1} synthetic 3 s3 200 20 5,000 {(1,282;0.50)} {0,1} synthetic 4 s4 100 30 10,000 {(592;0.50)} {0,1} Movielens ML 668 10,329 9,689 {(105,339;0.99)} {0,1,2,3,4,5} Jester JR 24,983 100 1,458 {(1,810,455;0.98)} [0,10]

Table 2: Time complexity for bandit algorithms. CBRAP SLUCB Ball Exp Lin UCB O(m3KT) O(n2KT) O(n2KT) O(n3KT)

Proof. We provide a sketch of the proof here. The detailed proof can be found in the appendix. First, we adopt the theoretical result in (Abbasi-Yadkori, P al, and Szepesv ari 2011) for the unknown parameter in the reduced m-dimension with a conﬁdence bound. Second, we divide the regret between payoffs of mdimension and optimal payoffs into two parts. The ﬁrst is the error resulting from the dimension reduction. The second is the error due to adopting the conﬁdence bound. Third, by adding the errors together, we can obtain the ﬁnal regret of the algorithm in the m-dimension space.

Time Complexity Analysis We analyze the time complexity for the CBRAP algorithm, and the three benchmarks, which is shown in Table 2. Note that, in the CBRAP algorithm and the Lin UCB algorithm, the most time-consuming operation is the inverse of matrix At. For SLUCB and Ball Exp, the analysis of time complexity can be found in (Dani, Hayes, and Kakade 2008). From the table, we know that the proposed CBRAP algorithm is time efﬁcient, especially for the case of m n.

Experiments In this section, we conduct experiments based on the proposed CBRAP algorithm, and three benchmarks of Lin UCB, Ball Exp and SLUCB. Note that, in the CBRAP algorithm, by designing different random matrices, we have two variants of CBRAP.SG and CBRAP.RS. Our algorithm and used datasets are all publicly available1.

Datasets For veriﬁcations, we adopt six datasets in the experiments, of which statistics are shown in Table 1. Speciﬁcally, we ﬁrst construct four synthetic datasets with different dimensions and sparsity, which are named from s1 to s4. Then, we conduct experiments on two real-world datasets, i.e., Movielens2 and Jester3. In Table 1, the sparsity is deﬁned as the

1https://github.com/Aaronyxt/CBRAP 2http://grouplens.org/datasets/movielens/ 3http://www.ieor.berkeley.edu/ goldberg/jester-data/

percentage of zero components divided by the total number of contextual dimension. For example, given a 0.10 sparsity of s1, the zero components in contextual vectors should be 1000 0.10 = 100. For non-zero components in s1, we generate values from a standard Gaussian distribution. Note that, for the sparsity of the real-world datasets, we count the percentage of zero components divided by the number of dimension for each feature vector, and then show the average percentage among the whole contextual vectors. For comparisons, all the datasets are repeated for 10 times in the experiments. Besides, the performance metric of bandit algorithms is the cumulative payoffs with T=1000.

Setting We conduct all experiments on a server installed with Ubuntu 12.04.5 LTS, which contains 24 processors of each core being Intel CPU@2.60GHz, and has a total memory of 200GB. In experiments, we investigate cumulative payoffs for the CBRAP algorithm with different values of n and m. With three benchmarks of Lin UCB, Ball Exp and SLUCB, we evaluate CBRAP.SG and CBRAP.RS via the following three questions. 1) Given a ﬁxed high-dimensional space n, do different values of m affect the performance of CBRAP? 2) What is the performance of CBRAP when the sparsity of contextual data increases? 3) What is the performance of CBRAP when it is compared with the three benchmarks?

Results For the ﬁrst question, we set the reduced dimension (RD) as m = 10, 20, 30, 40, 50 in synthetic datasets. We show the cumulative payoffs for dataset of s1 in Figure 1. From the ﬁgure, we ﬁnd the proposed CBRAP algorithm is ﬂexible with different RD, especially for CBRAP.SG. Similar results of other datasets can be obtained via the source codes1. For the second question, we investigate the performance of CBRAP algorithm via the synthetic dataset of s1. The experimental results are shown in Figure 2. We ﬁnd that CBRAP is stable, which means the sparsity assumption in the previous work can be relaxed. For the third question, we show the results in Tables 3 and 4. From the table, we know that, when the dimension is high, the proposed CBRAP algorithm outperforms other three benchmarks. Due to space limitation, we just show the results with m = 20, 50. For the real-world datasets, we ﬁnd that the CBRAP algorithm greatly outperforms the SLUCB and Ball Exp, and slightly better than the Lin UCB.

Figure 1: Cumulative payoffs with different reduced dimensions in s1 by adopting CBRAP.SG and CBRAP.RS.

Figure 2: Cumulative payoffs with reduced dimension m = 10 and different sparsity in s1.

Besides, we show comparisons of the average of time consumption in experiments in Table 5. From the table, we ﬁnd that the CBRAP algorithm is the most time-efﬁcient. Overall, the proposed CBRAP algorithm is suitable and ﬂexible for bandits with high-dimensional contextual data.

Conclusion In this paper, we have investigated contextual bandits with linear payoffs by adopting the technique of random projection. There are two main challenges in the most of the current bandit algorithms due to the curse of dimensionality in the big data era. The ﬁrst is high time-complexity; and the second is increasing upper regret bounds. To solve these two challenges, we proposed an algorithm named CBRAP for contextual bandits with linear payoffs. We adopted the technique of random projection in CBRAP. For theoretical results, we have derived a practical upper regret bound for CBRAP. Finally, experimental results have demonstrated the improved payoffs of CBRAP, as well as its time-efﬁciency.

Acknowledgments We would like to thank the anonymous reviewers for their valuable comments and suggestions to improve the qual-

Table 3: Cumulative payoffs for synthetic datasets with the sparsity being 0.50.

name algorithm RD (m) payoffs (μ σ)

CBRAP.SG 20 981.4 1.0 50 955.6 2.2

CBRAP.RS 20 995.0 3.9 50 994.6 3.4 Ball Exp NA 206.8 8.7 SLUCB NA 312.5 10.1 Lin UCB NA 989.7 2.4

CBRAP.SG 20 982.5 1.6 50 955.6 3.9

CBRAP.RS 20 996.3 3.1 50 998.8 1.0 Ball Exp NA 183.2 5.0 SLUCB NA 350.6 4.7 Lin UCB NA 975.4 3.1

CBRAP.SG 20 982.7 1.3 50 957.3 2.5

CBRAP.RS 20 991.6 2.9 50 998.0 2.2 Ball Exp NA 280.4 19.0 SLUCB NA 321.6 5.1 Lin UCB NA 981.2 4.5

CBRAP.SG 20 982.4 1.6 50 957.0 2.6

CBRAP.RS 20 989.4 4.8 50 998.1 1.8 Ball Exp NA 201.5 19.4 SLUCB NA 298.1 7.5 Lin UCB NA 967.7 6.3

Table 4: Cumulative payoffs for real-world datasets.

name algorithm RD (m) payoffs (μ σ)

CBRAP.SG 20 2693.0 30.3 50 2302.1 14.9

CBRAP.RS 20 2303.3 46.2 50 2034.5 17.8 Ball Exp NA 628.6 9.8 SLUCB NA 500.1 19.0 Lin UCB NA 2595.2 8.5

CBRAP.SG 20 1016.5 10.1 50 1008.9 1.8

CBRAP.RS 20 1028.0 11.2 50 1065.1 3.7 Ball Exp NA 441.2 7.6 SLUCB NA 323.5 5.3 Lin UCB NA 998.2 10.2

ity of the paper. The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (No. CUHK14205214 and No. CUHK14208815 of the General Research Fund), and 2015 Microsoft Research Asia

Table 5: Comparisons of the average of time consumption for synthetic dataset of s1 with the sparsity being 0.50.

name algorithm RD (m) time (second)

CBRAP.SG 20 1, 424.9 50 3, 285.3

CBRAP.RS 20 1, 410.4 50 3, 046.9 Ball Exp NA 359, 555.7 SLUCB NA 337, 197.2 Lin UCB NA 1, 633, 656.4

Collaborative Research Program (Project No. FY16-RESTHEME-005).

Abbasi-Yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved algorithms for linear stochastic bandits. In NIPS, 2312 2320. Abbasi-Yadkori, Y.; Pal, D.; and Szepesvari, C. 2012. Online-to-conﬁdence-set conversions and application to sparse stochastic bandits. In AISTATS, 1 9. Abe, N., and Long, P. M. 1999. Associative reinforcement learning using linear probabilistic concepts. In ICML, 3 11. Abe, N.; Biermann, A. W.; and Long, P. M. 2003. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica 37(4):263 293. Achlioptas, D. 2003. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences 66(4):671 687. Ailon, N., and Chazelle, B. 2006. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. In STOC, 557 563. Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finitetime analysis of the multiarmed bandit problem. Machine learning 47(2-3):235 256. Auer, P. 2002. Using conﬁdence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research 3:397 422. Baraniuk, R. G.; Cevher, V.; and Wakin, M. B. 2010. Lowdimensional models for dimensionality reduction and signal recovery: A geometric perspective. Proceedings of the IEEE 98(6):959 971. Bastani, H., and Bayati, M. 2015. Online decisionmaking with high-dimensional covariates. Available at SSRN 2661896. Blum, A. 2006. Random projection, margins, kernels, and feature-selection. In Subspace, Latent Structure and Feature Selection. Springer. 52 68. Bubeck, S., and Cesa-Bianchi, N. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. ar Xiv preprint ar Xiv:1204.5721.

Buldygin, V. V., and Kozachenko, Y. V. 1980. Subgaussian random variables. Ukrainian Mathematical Journal 32(6):483 489. Carpentier, A.; Munos, R.; et al. 2012. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In AISTATS, 190 198. Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. E. 2011. Contextual bandits with linear payoff functions. In AISTATS, 208 214. Clarkson, K. L., and Woodruff, D. P. 2013. Low rank approximation and regression in input sparsity time. In STOC, 81 90. Dani, V.; Hayes, T. P.; and Kakade, S. M. 2008. Stochastic linear optimization under bandit feedback. In COLT, 355 366. Dasgupta, S., and Gupta, A. 1999. An elementary proof of the johnson-lindenstrauss lemma. International Computer Science Institute, Technical Report 99 006. Deshpande, Y., and Montanari, A. 2012. Linear bandits in high dimension and recommendation systems. In CCC, 1750 1754. Fern, X. Z., and Brodley, C. E. 2003. Random projection for high dimensional data clustering: A cluster ensemble approach. In ICML, 186 193. Filippi, S.; Cappe, O.; Garivier, A.; and Szepesv ari, C. 2010. Parametric bandits: The generalized linear case. In NIPS, 586 594. Fodor, I. K. 2002. A survey of dimension reduction techniques. Technical Report. Kab an, A. 2015. Improved bounds on the dot product under random projection and random sign projection. In SIGKDD, 487 496. Kaelbling, L. P. 1994. Associative reinforcement learning: A generate and test algorithm. Machine Learning 15(3):299 319. Langford, J., and Zhang, T. 2008. The epoch-greedy algorithm for multi-armed bandits with side information. In NIPS, 817 824. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In WWW, 661 670. Li, P.; Hastie, T. J.; and Church, K. W. 2006. Very sparse random projections. In SIGKDD, 287 296. Lu, Y.; Dhillon, P.; Foster, D. P.; and Ungar, L. 2013. Faster ridge regression via the subsampled randomized hadamard transform. In NIPS, 369 377. Rivasplata, O. 2012. Subgaussian random variables: an expository note. Internet publication. Robbins, H. 1952. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58(5):527 535. Shi, Q.; Shen, C.; Hill, R.; and Hengel, A. 2012. Is margin preserved after random projection? In ICML, 591 598.

Tang, L.; Rosales, R.; Singh, A.; and Agarwal, D. 2013. Automatic ad format selection via contextual bandits. In CIKM, 1587 1594. Wang, C.-C.; Kulkarni, S. R.; and Poor, H. V. 2005. Bandit problems with side observations. IEEE Transactions on Automatic Control 50(3):338 355. Zhang, W.; Zhang, L.; Jin, R.; Cai, D.; and He, X. 2016. Accelerated sparse linear regression via random projection. In AAAI, 2337 2343. Zhao, T., and King, I. 2016a. Constructing reliable gradient exploration for online learning to rank. In CIKM, 1643 1652. Zhao, T., and King, I. 2016b. Locality-sensitive linear bandit model for online social recommendation. In ICONIP, 80 90. Zhao, T.; Mc Auley, J. J.; and King, I. 2014. Leveraging social connections to improve personalized ranking for collaborative ﬁltering. In CIKM, 261 270.

Appendix In this appendix, we show the proof of Theorem 2, and discuss the lower regret bound of the proposed algorithm.

Proof of Theorem 2 Before we show the proof of Theorem 2, we show the following lemma, which is directly adopted from (Abbasi Yadkori, P al, and Szepesv ari 2011). Lemma 1. Let πt = z T t,atθ z + ηt + ρt, where ηt + ρt is (

4/m + R)-sub-Gaussian. Assume θ z 2 L1 and zt,at 2 L2. By deﬁning At = γIm m+ t i=1 zi,aiz T i,ai, bt = t i=1 πtzi,ai and ˆθt z = A 1 t bt, with probability at least 1 δ, for all rounds t 0, we have

ˆθt z θ z At (

m log( 1 + t L2 2/γ δ )+γ1/2L1. (14)

Now we provide the proof of Theorem 2.

Proof. Consider random projection of the contextual vector and the unknown weight vector as

zt,a = Mxt,a, (15) θ z = Mθ , (16)

where zt,a Rm, xt,a Rn, θ z Rm and θ Rn. For each round, we focus on the error bound for πt,a x T t,atθ . By adding a term in m-dimension space, we have

rt = πt,a x T t,atθ

= πt,a z T t,atθ z + z T t,atθ z x T t,atθ

πt,a z T t,atθ z + z T t,atθ z x T t,atθ 2

πt,a z T t,atθ z 2 + z T t,atθ z x T t,atθ 2,

where we adopt the Cauchy-Schwarz inequality. Now we investigate πt,a z T t,atθ z 2 and z T t,atθ z x T t,atθ 2 separately.

Since θt z is optimistic based on F t, we have

πt,a z T t,atθ z zt,at, θt z zt,at, θ z

= zt,at, θt z θ z

= zt,at ˆθt 1 z θ z + zt,at, θt z ˆθt 1 z

zt,at A 1 t ( ˆθt 1 z θ z A 1 t + θt z ˆθt 1 z A 1 t )

2βt(δ) zt,at A 1 t ,

where βt(δ) = (

m log( 1+t L2 2/γ δ ) + γ1/2L1, θt z is the optimal parameter for zt,at with the condition of θt z 2 L1. Note that the ﬁrst inequality follows the fact that zt,at, θt z is the optimal reward in round t. The last inequality follows Lemma 1. For z T t,atθ z x T t,atθ , based on (Kab an 2015), we have

Pr{z Tθz < x Tθ ϵ} < exp( mϵ2

Pr{z Tθz > x Tθ + ϵ} < exp( mϵ2

which implies that, with high probability of 1 2 exp( mϵ2

8 ), z Tθz x Tθ 2 ϵ. (19)

Thus, for each round t, with probability of (1 δ)(1 2 exp( mϵ2

8 )), we have

rt 2βt(δ) zt,at A 1 t + ϵ. (20)

Finally, with probability of (1 δ)(1 2 exp( mϵ2

t=1 (2βt(δ) zt,at A 1 t + ϵ)2

4T β2 T (δ)

t=1 zt,at 2 A 1 t + T 2ϵ2 + 4T ϵβT (δ)

t=1 zt,at A 1 t

m T log(1 + T L2 2 γm )

β2 T (δ) + βT (δ)ϵ,

where βT (δ) = (

m log( 1+T L2 2/γ δ ) + γ1/2L1.

Remarks. Based on the work in (Kab an 2015), we know that the parameters L1 and L2 are associated with the reduced dimension m. Thus, the regret bound for CBRAP has non-liner relationship with m. For the lower bound of CBRAP, based on (Chu et al. 2011), we can roughly obtain the relationship as

Regret(T) γ

where γ > 0 is a constant.