# efficient_and_robust_highdimensional_linear_contextual_bandits__04a638e9.pdf

Efﬁcient and Robust High-Dimensional Linear Contextual Bandits

Cheng Chen1 , Luo Luo2 , Weinan Zhang1 , Yong Yu1 and Yijiang Lian3

1Shanghai Jiao Tong University 2Hong Kong University of Science and Technology 3Baidu jack chen1990@sjtu.edu.cn, luoluo@ust.hk, {wnzhang, yyu}@apex.sjtu.edu.cn, lianyijiang@baidu.com

The linear contextual bandits is a sequential decision-making problem where an agent decides among sequential actions given their corresponding contexts. Since large-scale data sets become more and more common, we study the linear contextual bandits in high-dimensional situations. Recent works focus on employing matrix sketching methods to accelerating contextual bandits. However, the matrix approximation error will bring additional terms to the regret bound. In this paper we ﬁrst propose a novel matrix sketching method which is called Spectral Compensation Frequent Directions (SCFD). Then we propose an efﬁcient approach for contextual bandits by adopting SCFD to approximate the covariance matrices. By maintaining and manipulating sketched matrices, our method only needs O(md) space and O(md) update time in each round, where d is the dimensionality of the data and m is the sketching size. Theoretical analysis reveals that our method has better regret bounds than previous methods in highdimensional cases. Experimental results demonstrate the effectiveness of our algorithm and verify our theoretical guarantees.

1 Introduction The contextual bandit is a popular model for sequential decision-making problems [Auer, 2002; Dani et al., 2008; Chu et al., 2011; Abbasi-Yadkori et al., 2011]. In contextual bandit problems, a decision maker repeatedly (a) receives a set of contexts (or actions), (b) chooses an action, and (c) observes a reward for the chosen action. The goal is to maximize the expected cumulative reward over some time horizon. Recently, contextual bandits have been applied to many machine learning tasks, including personalized recommendation [Li et al., 2010], web advertising [Tang et al., 2013], social network analysis [Zhao et al., 2014] and mobile health platforms [Tewari and Murphy, 2017]. Contextual bandits have been widely studied in recent years [Langford and Zhang, 2008; Tang et al., 2013]. The traditional methods for linear contextual bandits including upper-conﬁdence bound algorithms [Chu et al., 2011;

Abbasi-Yadkori et al., 2011] and Thompson sampling algorithms [Agrawal and Goyal, 2013; Russo and Van Roy, 2014]. Some other works extends linear bandits to generalized linear models [Filippi et al., 2010; Li et al., 2017; Dumitrascu et al., 2018]. A famous algorithm for linear contextual bandits is Lin UCB [Chu et al., 2011], which

regret bound is O( q

d T log3(KT log(T)/δ)). Here d is the dimension of the contextual vectors and T is the time horizon. This result was improved to O(d

T log(T) + p

d T log(T) log(1/δ)) [Abbasi-Yadkori et al., 2011], which is irrelevant to the number of arms K. In the big data era, the dimension of the data grows rapidly. In high-dimensional situation, i.e., d T, most traditional bandit algorithms are time-consuming. Also, the upper regret bounds become larger with the growth of the dimension. Some recent works focus on high dimensional linear contextual bandits. SLUCB [Carpentier and Munos, 2012] achieves a regret bound O(S

T) by assuming that the contextual data contains only S non-zero components. Ball EXP [Deshpande and Montanari, 2012] leverages the technique of ball exploration in the high-dimensional space, but their regret bound is very loose. These two methods are still less efﬁcient in practice because they require O(d2) time in each round. Calandriello et al. [2019] proposed an efﬁcient Gaussian process based algorithm for kernel bandits. But their regret bound is O(d

T) for linear kernels. Recently, some researchers apply matrix sketching techniques to contextual bandit problems [Yu et al., 2017; Kuzborskij et al., 2019]. In [Yu et al., 2017], random projection was adopted to map the high-dimensional contextual information to a random m-dimensional subspace and proposed Contextual Bandits with RAndom Projection (CBRAP) algorithm. Though CBRAP successfully reduces the update time from O(d2) to O(md + m3), the random projection introduces an additive error εT to their regret bound. To enable the failure probability δ away from 1, the parameter ε has to be Ω(m 1/2). Kuzborskij et al. [2019] propose SOFUL algorithm by adopting Frequent Directions (FD) to sketch covariance matrices of linear contextual bandits. Since FD [Liberty, 2013; Ghashami et al., 2016] has good theoretical guarantees in streaming setting, it is more suitable for contextual bandits than random projection. Due to the advantage of FD, SOFUL only require O(md) update time and achieve upper re-

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

gret bound as O((1+ T )3/2(m+d log(1+ T ))

T), where T is upper bounded by the spectral tail of the covariance matrix. However, FD still has some drawbacks when applied to contextual bandits. First, the sequence of covariance matrices satisﬁes the positive deﬁnite monotonicity, i.e. Vt Vt 1, where Vt is the covariance matrix in the t-th round. But FD produces a rank-deﬁcient approximation which destroys the positive deﬁnite monotonicity of covariance matrices. This deﬁciency leads to additional terms in the regret bounds of contextual bandits. Speciﬁcally, the regret bound of [Kuzborskij et al., 2019] is associated with the spectral tail to the power of 3/2. However, the sketch size is usually much smaller than the rank of covariance matrix in practice. Thus their bounds may be heavily affected by the spectral tail. In this paper, we present a variant of Frequent Directions, which we call Spectral Compensation Frequent Directions (SCFD). To overcome the disadvantages of FD, SCFD compensates a diagonal matrix which contains spectral information to the result of FD. We further adopt SCFD to accelerate Lin UCB, and propose an efﬁcient algorithm for high-dimensional linear contextual bandits, which is called Contextual Bandits via Spectral Compensation Frequent Directions (CBSCFD). Unlike the regularization term used in [Kuzborskij et al., 2019], which is invariant during the sequential decisions, the compensated diagonal matrix of SCFD changes adaptively in each round. Compare with previous methods for contextual bandits, our method provide better regret bounds in highdimensional cases. In addition, our algorithm is much less sensitive to the regularization hyper-parameter, which means that our method is robust. The main contributions of this paper are summarized as follows:

We propose SCFD algorithm for matrix sketching. SCFD can approximate a sequence of incremental covariance matrices while keeping the positive deﬁnite monotonicity.

We propose CBSCFD for high-dimensional linear contextual bandits. Our methods only requires O(md) time and O(md) space to update the model in each round. Also, our methods obtain a regret bound of O(( p

m + d log(1 + T )+ T )

m T), which is better than state-of-the-art methods. Here T is upper bounded by the spectral tail (sum of the last d m + 1 singular values) of the covariance matrix.

We validate our approach on both synthetic and realworld data sets. The result of experiments shows that our algorithm outperforms other state-of-the-art algorithms in the high-dimensional setting.

2 Preliminaries

In this section, we ﬁrst show notations used in this paper. Then we describe the problem setting of linear contextual bandits. Finally we introduce recent works for Frequent Directions.

2.1 Notation and Deﬁnition Let X denote the context space and A = {1, . . . , K} denote the action space. Suppose the total rounds of playing bandits is T. We use O notation to hide log(T) in the complexity analysis. We use Im to denote m m identity matrix, and 0 to denote a zero vector or matrix of appropriate size. For a vector x Rd, let x 2 be the ℓ2-norm of x. For a positive semideﬁnite (PSD) matrix A, the weighted 2-norm of vector x is deﬁned by x A =

x Ax. We denote the inner product as , and the weighted inner product as x A y = x, y A. Let det(A) be the determinant of A. For two PSD matrices A and B, we use A B to represent the fact that A B is PSD. Let ρ be the rank of matrix A. The reduced SVD of A

is deﬁned as A = UΣV = ρP

i=1 σiuiv i , where σi are

positive singular values in the descending order. That is, σi(A) is the i-th largest singular value of A. Let κ(A) = σmax(A)/σmin(A) be the condition number of A. Let (A)k = Pk i=1 σiuiv i denote the best rank-k approxima-

tion of A. Additionally, we let A F = q P i,j A2 ij be the

Frobenius norm and A 2 = σmax(A) be the spectral norm. We deﬁne the positive deﬁnite monotonicity as follows: Deﬁnition 1. For a sequence of positive semi-deﬁnite matrices {Ai}T i=1, we say {Ai}T i=1 satisﬁes positive deﬁnite monotonicity if and only if

Ai Ai 1 for i {2, 3, . . . , T}.

2.2 Problem Setting We introduce the problem of linear contextual bandits [Chu et al., 2011; Abbasi-Yadkori et al., 2011]. At each round t, the learner receives contexts xt,a for all a A. Then, the learner chooses an action at and observes a reward yt [0, 1]. The reward has the structure yt = x t,atθ + ηt where θ Rd is a unknown true parameter and ηt is a conditionally R-sub Gaussian noise. That is

E eληt|x1,a1, . . . , xt,at; η1, . . . , ηt 1 exp λ2R2

This condition implies that

E [ηt|x1,a1, . . . , xt,at; η1, . . . , ηt 1] = 0.

The cumulative regret in this setting is deﬁned by

t=1 x t,a θ

t=1 x t,atθ , (1)

where a = arg maxa A x t,aθ is the optimal action at round t. The goal of the learner is to minimize the regret RT .

2.3 Frequent Directions Frequent Directions [Liberty, 2013; Ghashami et al., 2016] is a deterministic matrix sketching method for streaming model. It has been applied to speed up many online learning algorithms including linear bandits [Kuzborskij et al., 2019; Luo et al., 2016; Luo et al., 2019]. For any given matrix

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

XT RT d which rows come sequentially, FD aims to generate a matrix ZT Rm d such that

X T XT Z T ZT .

Here m min{d, T} is the sketching size. Suppose the given matrix XT = [x1, x2, . . . , x T ] . At round t, FD inserts x t into the last row of Zt 1 and performs following manipulations:

[U, Σ, V] = SVD(Zt 1),

δt = Σ2 mm,

By using the doubling space technique, FD only need O(md) time to update the model per round (see Section 3.2 of [Ghashami et al., 2016]). Moreover, the approximation matrix is bounded by:

X T XT Z T ZT 2 XT (XT )k 2 F m k ,

where 0 < k < m. Some variants of FD have been proposed in recent years. Parameterized Frequent Directions [Desai et al., 2016] speciﬁes the proportion of singular values shrunk in each round; Compensative Frequent Directions [Desai et al., 2016] increases the singular values of the sketching matrix. Both methods may increase the performance empirically, but keep the same error bound as traditional FD algorithm. Robust Frequent Directions[Luo et al., 2019] introduces an adaptive regularizer and improves the approximation error bound by a factor 1/2. Though these variants of FD have different levels of improvement on the FD, they all destroy the positive definite monotonicity of covariance matrices and may increase the regret bound when applied to linear contextual bandits.

3 Main Results

In this section, we ﬁrst present our SCFD algorithm. Then, we propose CBSCFD algorithm for linear bandits based on SCFD, and analyze the complexity of our approach. Finally we provide theoretical analysis on the regret bounds of our algorithm.

3.1 Spectral Compensation Frequent Direction Reviewing the procedure of Frequent Directions, we notice that FD subtracts a small term δt from singular values in each round. This manipulation will bring approximation errors and break the positive deﬁnite monotonicity. Thus, our idea is to compensate the lost spectral information to the approximation. Speciﬁcally, we keep a counter αt to add up the total mass of subtracted values during the Frequent Directions procedure. Then, we compensate the reduced spectrum to the sketched matrix by adding a diagonal matrix αt Id. Namely, we use a full rank matrix to approximate the original matrix as follows: X t Xt Z t Zt + αt Id,

where Xt = [x1, x2, . . . , xt] . We present the procedure of SCFD in Algorithm 1. In Algorithm 1, we show how the

Algorithm 1 SCFD Input: XT = [x1, . . . , x T ] , m, α0 0. 1: Z0 = 0m d 2: for t = 1, 2, . . . , T do 3: Zt [Z t 1, xt] , αt αt 1 4: if Zt has 2m rows then 5: Compute SVD: [U, Σ, V] SVD(Zt). 6: δt σ2 m(Zt), αt αt 1 + δt. 7: bΣ p

max{Σ2 δt I, 0}.

8: Zt bΣV . 9: Remove zero value rows in Zt 10: end if 11: end for Output: ZT , αT .

doubling space technique [Ghashami et al., 2016] is applied to SCFD. Notice that after performing line 5-9, Zt will have only m 1 rows. Thus, the if statement is triggered every m+1 iterations, which indicates that the average time cost of SCFD is O(md). Let t = Pt i=1 δt, then αt = α0 + t. We have the following theorem: Theorem 1. Assume that SCFD runs with α0=0. Let Z t Zt + αt Id be the sketch of X t Xt at round t. Then we have

Z t Zt + αt Id X t Xt 2 Xt (Xt)k 2 F m k for all 0 < k < m.

Proof. Since αt Id is only added on the output matrix, the sketched matrix Zt still satisﬁes the property of original FD. The Property 1 and Property 2 of Ghashami et al. [2016] shows that Z t Zt + t Id X t Xt Z t Zt. Combining the fact that αt = α0 + t = t, we have Z t Zt + αt Id X t Xt 2 t. Using the fact that t Xt (Xt)k 2 F /(m k) for all 0 < k < m [Ghashami et al., 2016, Theorem 3.1], we get the conclusion of Theorem 1.

Theorem 1 shows that the error of SCFD is bounded by the spectral tail of the data matrix. Actually, SCFD has the same error bound as the original Frequent Directions. However, SCFD has the property that the sequence of approximation matrices {Z t Zt +αt Id}T t=0 satisﬁes the positive deﬁnite monotonicity, which is very important in the regret analysis. We formally present the property as follows: Property 1. Suppose α0 0. When SCFD is running, at each round t, we have Z t Zt + αt Id Z t 1Zt 1 + αt 1Id.

Proof. Let Z = [Z t 1, xt] , then we have

Z t Zt + δt Id Z Z Z t 1Zt 1. (2) Then, for any unit vector w, we can get

w (Z t Zt + αt Id Z t 1Zt 1 αt 1Id)w

=w (Z t Zt + δt Id Z t 1Zt 1)w 0 (3)

The last step holds because of Eq.(2).

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Another property is that the approximation matrices of SCFD is more well-conditioned than FD and even the original matrix, which indicates that SCFD will make the whole algorithm more stable. Let VSCF D = Z t Zt + αt Id, VF D = Z t Zt+α0Id and V = X t Xt+α0Id. We have the following property:

Property 2. Suppose α0 0. When SCFD is running, at each round t, we have

κ(VSCF D) κ(VF D) and κ(VSCF D) κ(V).

Proof. Since αt α0, we have

κ(VSCF D) = (σmax(Z t Zt) + αt)/αt (σmax(Z t Zt) + α0)/α0 = κ(VF D).

κ(VSCF D) = (σmax(Z t Zt) + αt)/αt (σmax(X t Xt) + αt)/αt (σmax(X t Xt) + α0)/α0 = κ(V).

Remark 1. Theorem 1 and Property 2 show that our method makes the approximation matrices both theoretically guaranteed and well-conditioned. To this sense, the αt selected by Algorithm 1 is optimal. Setting αt to a larger value would lead to breaking Theorem 1. Setting αt to a smaller value would cause worse condition number of the approximation matrices.

3.2 Contextual Bandits via Spectral Compensation Frequent Direction Our method is a sketched version of Lin UCB, which ﬁnd the solution in a conﬁdence ellipsoid [Abbasi-Yadkori et al., 2011]. Suppose Xt = [x1,a1, x2,a2, . . . , xt,at] be the sequence of selected arms of a contextual bandit. The regularized covariance matrix is deﬁned as

Vt=X t Xt+λI.

In each round, Lin UCB require O(d2) to update the inverse of regularized covariance matrix, which is rather costly when d is large. Thus, we adopt SCFD to approximate Vt in Lin UCB by setting α0 = λ. We present CBSCFD in Algorithm 2. Then, we show how CBSCFD efﬁciently computes the inverse of approximated covariance matrix b V 1 t . Let Ht = (Zt Z t + αt I) 1. By Woodbury Formula, we have b V 1 t = (Z t Zt + αt I) 1 = (I Z t Ht Zt)/αt. When the if statement is triggered, We have Zt = bΣV , thus Ht can easily computed as Ht = (bΣ2 + αt I) 1. When the if statement is not triggered, we can compute Ht as follows:

Ht = Zt 1 xt

[Zt 1 xt] + αt I 1

= Ht 1 + pp /k p/k p /k 1/k

where p = Ht 1Zt 1xt and k = x t xt x t Z t 1p + αt.

Algorithm 2 CBSCFD Input: Sketch size m, parameter λ > 0, β > 0.

1: bθ0 = 0, α0 = λ, b V0 = α0Id, Z0 = 0m d. 2: for t = 1, 2, . .. , T do 3: Observe contexts xt,a.

4: Select at = arg maxa A n bθ t 1xt,a + β xt,a b V 1 t 1

5: Receive reward yt. 6: Zt [Z t 1, xt,at] , αt αt 1. 7: if Zt has 2m rows then 8: Compute SVD: [U, Σ, V] SVD(Zt). 9: δt σ2 m(Zt), αt αt 1 + δt. 10: bΣ p

max{Σ2 δt I, 0}, Zt bΣV .

11: Ht (bΣ2 + αt I) 1. 12: else 13: Compute Ht as (4). 14: end if 15: b V 1 t 1 αt (I Z t Ht Zt)

16: bθt b V 1 t

i=1 xi,aiyi.

17: end for

Since the size of Ht is at most 2m, CBSCFD require at most O(md) time to compute Ht. In addition, our method do not need to compute the elements of b V 1 t because all operations involving b V 1 t are matrix-vector multiplications. Therefore, we only need to consider the matrix-vector multiplications involving Zt, which require only O(md) time. As discussed in Section 3.1, SCFD only needs to compute SVD every m+1 round. Thus, the average time cost per round of CBSCFD is O(md). Since we only need to store Zt and Ht, the space complexity of CBSCFD algorithm is O(md).

3.3 Regret Analysis

Deﬁne Yt = [y1, y2, . . . , yt] , then we have bθt = b V 1 t Xt Yt. The upper regret bound of Algorithm 2 is summarized in the following theorem.

Theorem 2. Assume that θ 2 S, xt,a 2 L and ηt is a R-sub-Gaussian noise for t {1, 2, . . . , T}. If Algorithm 2 runs with β = βT (δ), then with probability 1 δ, the regret of Algorithm 2 is

Regret(T) βT (δ)

8m T log 1 + TL2

δ +m log 1 + TL2

+d log 1 + T

Proof sketch. The ﬁrst step is to ﬁnd the conﬁdence ellipsoid where θ lies. Actually, we can obtain that θ lies in the set Ct = {θ : bθt θ b Vt βt(δ)} (5)

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Algorithm Time cost per round Space Upper regret bound Lin UCB O(d2) O(d2) O(d

T) CBRAP O(md + m3) O(md) O(

m T + m 1/2T) SOFUL O(md) O(md) O((1 + T )3/2(m + d log(1 + T ))

T) Our method O(md) O(md) O(( p

m + d log(1 + T ) + T )

Table 1: Comparison of our theoretical results with state-of-the-art approaches. Here we consider the parameter λ as a constant.

with probability 1 δ. Notice that the line 4 of Algorithm 2 is equivalent to solving the following problem:

(xt,at, eθt) = arg max x θ s.t. (x, θ) Dt Ct 1,

where Dt is the decision set that xt,a belongs to. Thus, the action at chosen by Algorithm 2 satisﬁes x t,at eθt x t θ . We next consider the regret rt at round t:

rt =x t,atθ x t,atθ

x t,at eθt x t,atθ

=x t,at(bθt 1 θ ) + x t,at(eθt bθt 1)

xt,at b V 1 t

bθt 1 θ b Vt + eθt bθt 1 b Vt

2βt(δ) xt,at b V 1 t Since rt 2, we have

rt 2 min(βt(δ) xt b V 1 t , 1) 2βt(δ) min( xt b V 1 t , 1).

Then, we have

t=1 r2 t 2βT (δ)

2m T log 1 + TL2

where the last step follow from the following lemma:

t=1 min(1, xt 2 b V 1 t 1) 2m log 1 + TL2

3.4 Discussion Brieﬂy, our bound can be written as

m + d log(1 + T /λ) + p

According to Theorem 1, we have

T XT (XT )k 2 F /(m k)

for all 0 < k < m. Thus, T is very small when the covariance matrix is approximately low-rank. Especially, when the rank of covariance matrix is less than m, we have T = 0. In this situation, our regret bound becomes O(m

T). When the covariance matrix is not low-rank, our method significantly reduces the inﬂuence of T when compared with SOFUL [Kuzborskij et al., 2019]. Note that the regret bound of SOFUL is

O((1 + T /λ)3/2(m + d log(1 + T /λ))

Our method reduces the order of T from 3/2 to 1/2. In addition, our regret bound decouples the dimension d and T , which further reduces the inﬂuence of T . Finally, our bound is less sensitive to the parameter λ (which is usually a small number in practice) because the term T /λ is in the logarithmic function. Overall, our method is more effective and robust than SOFUL. We summarize the comparison between our theoretical results and state-of-the-art methods in Table 1. From the table, we can ﬁnd that our method has the best regret bound in high dimensional case.

4 Experiments

In this section, we empirically verify the efﬁciency and effectiveness of our CBSCFD algorithm. We conduct experiments on both synthetic data and real-world data sets. The baseline approaches include Lin UCB [Abbasi-Yadkori et al., 2011], CBRAP [Yu et al., 2017] and SOFUL [Kuzborskij et al., 2019]. For CBRAP algorithm, we use the Gaussian random matrix as the projection matrix because it has better performance. We conduct all experiments on a Linux server which contains 8 processors and has total memory of 32GB. The code is implemented in Matlab R2017b.

4.1 Synthetic Data We generated a synthetic data set with 100 arms and 2000 features per context. Speciﬁcally, all contexts xt,a R2000 are drawn independently from multivariate Gaussian distributions xt,a MV N(1, I2000). The true parameter θ is computed as θ = θ / θ 2 where θ is drawn from a multivariate Gaussian distribution θ MV N(0, I2000). We set T = 1000 and run the experiments for 20 times. The parameter β of all methods is searched in {10 4, 10 3, . . . , 1} and λ is searched in {2 10 4, 2 10 3, . . . , 2 104}. We choose the best values for each approach and report the average results in Figure 1. In Figure 1, we can ﬁnd that the running time of Lin UCB is much larger than other three methods which use matrix sketching. The CBRAP is a little faster than SOFUL and CBSCFD. The reason is that SOFUL and CBSCFD require to perform SVD decomposition, which is much slower than matrix multiplication though they have the same time complexity. Compare the cumulative regrets in Figure 1, we can ﬁnd that our CBSCFD outperforms other approaches.

4.2 Online Classiﬁcation Then, we perform online classiﬁcation to evaluate the performance of our methods. We follow the experiments setup of [Kuzborskij et al., 2019]. Speciﬁcally, we ﬁt the online classiﬁcation problem into the contextual bandit setting as follows.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

0 200 400 600 800 1000 Number of Samples

Lin UCB CBRAP SOFUL CBSCFD

Lin UCB CBRAP SOFUL CBSCFD Methods

Average Cumulative Regrets Running Time

Figure 1: Comparison of the average cumulative regrets and running time on the synthetic data set.

Data set #Samples #Features #Classes Sketch size MNIST 60000 784 10 10 CIFAR10 50000 3072 10 20

Table 2: Summary of Data sets for Contextual Bandits.

Given a data set with data in K clusters, we ﬁrst choose one cluster as the target cluster. In each round the environment randomly draws one sample from each class and composes a set of contexts of K samples. The learner chooses one sample from the set and observe the reward corresponding to whether the selected sample belongs to the target cluster. The reward is 1 if the selected sample comes from the target cluster, and is 0 otherwise. We perform our experiments on two real-world data sets: MNIST [Le Cun et al., 1998] and CIFAR10 [Krizhevsky and Hinton, 2009]. We summarize the statistic information of both data sets in Table 2. For MNIST data set, we perform experiments with sketch sizes m = 10. For CIFAR10, we set the sketch sizes m = 20. As previous experiment, the parameter β is searched in {10 4, 10 3, . . . , 1} and λ is searched in {2 10 4, 2 10 3, . . . , 2 104}. We run the experiments for 20 times and report the average online mistakes in Figure 2. We ﬁnd that our CBSCFD algorithm outperforms all other methods on MNIST and CIFAR10 data sets. This result validates the effectiveness of our method.

4.3 Robustness of CBSCFD Finally, we investigate the robustness of our method. We compare the sensitivity to parameter λ between our method and baseline approaches. Since CBRAP always sets λ = 1, we do not include it as baseline. In this experiment, we ﬁx the parameter β = 0.01 and m = 10. Then we perform experiments on MNIST data set with {2 10 4, 2 10 3, . . . , 2 104}. We present the result in Figure 3. This ﬁgure shows that the performance of CBSCFD only changes slightly when λ is changed. But the performance of other two methods signiﬁcantly depend on the choice of λ. This result validates our theoretical analysis in Property 2 and shows that our method is robust than others.

4.4 Discussion It is surprising to see that our method outperforms Lin UCB, which adopts no approximations. This happens because our experiments focuses on high dimensional setting, i.e., d T.

0 1000 2000 3000 4000 Number of Rounds

Lin UCB CBRAP SOFUL CBSCFD

0 1000 2000 3000 4000 Number of Rounds

Lin UCB CBRAP SOFUL CBSCFD

MNIST, m = 10 CIFAR, m = 20

Figure 2: The comparison of online mistakes among different contextual bandit algorithms on real-world data sets with different sketch sizes.

2e-4 2e-3 2e-2 2e-1 2 20 200 2000 20000 0

Lin UCB SOFUL CBSCFD

Figure 3: Total mistakes with different choice of parameter λ.

When the iteration round t < d, the matrix Xt X t is singular. Lin UCB uses a constant regularization and enhances the spectrum by λ. When the regulrization parameter λ is very small, the matrix Vt is ill-conditioned, which makes Lin UCB unstable. On the other hand, CBSCFD enhances the spectrum by t + λ, where t is updated in each round and no more than the smallest nonzero singular values of Xt X t . Thus, the approximated covariance matrix may have better performance than the original one. Also, the covariance matrix of CBSCFD is more well-conditioned than that of Lin UCB, which makes the algorithm more stable and less sensitive to λ.

5 Conclusions

In this paper, we present a variant of FD, which is called SCFD, for matrix sketching. We then propose CBSCFD algorithm for high-dimensional linear contextual bandits based on SCFD. Our method is much more efﬁcient than Lin UCB and require less space. Compared with previous methods which use matrix sketching to accelerate linear contextual bandits, our method has better upper regret bound and more robust. Finally, the experiments demonstrated the efﬁciency, effectiveness and robustness of CBSCFD.

Acknowledgments

Luo Luo was partially supported by Baidu academic collaboration program. The corresponding authors Yong Yu and Weinan Zhang thank the support of NSFC 61702327 and 61772333.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

References [Abbasi-Yadkori et al., 2011] Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011. [Agrawal and Goyal, 2013] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pages 127 135, 2013. [Auer, 2002] Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. [Calandriello et al., 2019] Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, and Lorenzo Rosasco. Gaussian process optimization with adaptive sketching: Scalable and no regret. In Proceedings of the 32nd Annual Conference on Learning Theory, pages 533 557, 2019. [Carpentier and Munos, 2012] Alexandra Carpentier and R emi Munos. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Proceedings of the 15th Artiﬁcial Intelligence and Statistics, 2012. [Chu et al., 2011] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artiﬁcial Intelligence and Statistics, 2011. [Dani et al., 2008] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, 2008. [Desai et al., 2016] Amey Desai, Mina Ghashami, and Jeff M Phillips. Improved practical matrix sketching with guarantees. IEEE Transactions on Knowledge and Data Engineering, 28(7):1678 1690, 2016. [Deshpande and Montanari, 2012] Yash Deshpande and Andrea Montanari. Linear bandits in high dimension and recommendation systems. In 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1750 1754. IEEE, 2012. [Dumitrascu et al., 2018] Bianca Dumitrascu, Karen Feng, and Barbara Engelhardt. Pg-ts: Improved thompson sampling for logistic contextual bandits. In Advances in Neural Information Processing Systems, pages 4629 4638, 2018. [Filippi et al., 2010] Sarah Filippi, Olivier Cappe, Aur elien Garivier, and Csaba Szepesv ari. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586 594, 2010. [Ghashami et al., 2016] Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. Frequent directions: Simple and deterministic matrix sketching. SIAM Journal on Computing, 45(5):1762 1792, 2016. [Krizhevsky and Hinton, 2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, 2009.

[Kuzborskij et al., 2019] Ilja Kuzborskij, Leonardo Cella, and Nicol o Cesa-Bianchi. Efﬁcient linear bandits through matrix sketching. Proceedings of the 22th Artiﬁcial Intelligence and Statistics, 2019. [Langford and Zhang, 2008] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems, pages 817 824, 2008. [Le Cun et al., 1998] Yann Le Cun, L eon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. [Li et al., 2010] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661 670. ACM, 2010. [Li et al., 2017] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning, pages 2071 2080, 2017. [Liberty, 2013] Edo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pages 581 588. ACM, 2013. [Luo et al., 2016] Haipeng Luo, Alekh Agarwal, Nicolo Cesa-Bianchi, and John Langford. Efﬁcient second order online learning by sketching. In Advances in Neural Information Processing Systems, pages 902 910, 2016. [Luo et al., 2019] Luo Luo, Cheng Chen, Zhihua Zhang, Wu-Jun Li, and Tong Zhang. Robust frequent directions with application in online learning. Journal of Machine Learning Research, 20(45):1 41, 2019. [Russo and Van Roy, 2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 2014. [Tang et al., 2013] Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selection via contextual bandits. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pages 1587 1594. ACM, 2013. [Tewari and Murphy, 2017] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health. Springer, 2017. [Yu et al., 2017] Xiaotian Yu, Michael R Lyu, and Irwin King. Cbrap: Contextual bandits with random projection. In Proceedings of the 31st AAAI Conference on Artiﬁcial Intelligence, 2017. [Zhao et al., 2014] Tong Zhao, Julian Mc Auley, and Irwin King. Leveraging social connections to improve personalized ranking for collaborative ﬁltering. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. ACM, 2014.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)