# bilinear_bandits_with_lowrank_structure__d0ddebbf.pdf

Bilinear Bandits with Low-rank Structure

Kwang-Sung Jun 1 Rebecca Willett 2 Stephen Wright 3 Robert Nowak 3

We introduce the bilinear bandit problem with low-rank structure in which an action takes the form of a pair of arms from two different entity types, and the reward is a bilinear function of the known feature vectors of the arms. The unknown in the problem is a d1 by d2 matrix that deﬁnes the reward, and has low rank r min{d1,d2}. Determination of with this low-rank structure poses a signiﬁcant challenge in ﬁnding the right exploration-exploitation tradeoff. In this work, we propose a new two-stage algorithm called Explore-Subspace-Then-Reﬁne (ESTR). The ﬁrst stage is an explicit subspace exploration, while the second stage is a linear bandit algorithm called almost-low-dimensional OFUL (Low OFUL) that exploits and further reﬁnes the estimated subspace via a regularization technique. We show that the regret of ESTR is

O((d1 + d2)3 2

r T) where O hides logarithmic factors and T is the time horizon, which improves upon the regret of O(d1d2

T) attained for a na ıve linear bandit reduction. We conjecture that the regret bound of ESTR is unimprovable up to polylogarithmic factors, and our preliminary experiment shows that ESTR outperforms a na ıve linear bandit reduction.

1 Introduction

Consider a drug discovery application where scientists would like to choose a (drug, protein) pair and measure whether the pair exhibits the desired interaction (Luo et al., 2017). Over many repetitions of this step, one would like to maximize the number of discovered pairs with the desired interaction. Similarly, an online dating service may want to choose a (female, male) pair from the user pool, match them, and receive feedback about whether they like each other or not. For clothing websites, the recommendation

1Boston University 2University of Chicago 3University of Wisconsin-Madison. Correspondence to: Kwang-Sung Jun <kwangsungjun@gmail.com>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

system may want to choose a pair of items (top, bottom) for a customer, whose appeal depends in part on whether they match. In these applications, the two types of entities are recommended and evaluated as a unit. Having feature vectors of the entities available,1 the system must explore and learn what features of the two entities jointly predict positive feedback in order to make effective recommendations.

The recommendation system aims to obtain large rewards (the amount of positive feedback) but does not know ahead of time the relationship between the features and the feedback. The system thus faces two conﬂicting goals: choosing pairs that (i) maximally help estimate the relationship ( exploration ) but which may give small rewards and (ii) return relatively large, but possibly suboptimal, rewards ( exploitation ), given the limited information obtained from the feedback collected so far. Such an exploration-exploitation dilemma can be formulated as a multi-armed bandit problem (Lai & Robbins, 1985; Auer et al., 2002). When the feature vectors are available for each arm, one can postulate simple reward structures such as (generalized) linear models to allow a large or even inﬁnite number of arms (Auer, 2002; Dani et al., 2008; Abbasi-Yadkori et al., 2011; Filippi et al., 2010), a paradigm that has received much attention during the past decade, with such applications as online news recommendations (Li et al., 2010). Less is known for the situation we consider here, in which the recommendation (action) involves two different entity types and forms a bilinear structure. The closest work we are aware of is Kveton et al. (2017) whose action structure is the same as ours but without arm feature vectors. Factored bandits (Zimmert & Seldin, 2018) provide a more general view with L entity types rather than two, but they do not utilize arm features nor the low-rank structure. Our problem is different from dueling bandits (Yue et al., 2012a) or bandits with unknown user segment (Bhargava et al., 2017), which choose two arms from the same entity set rather than from two different entity types. Section 7 below contains detailed comparisons to related work.

This paper introduces the bilinear bandit problem with lowrank structure. In each round t, an algorithm chooses a left arm xt from X Rd1 and a right arm zt from Z Rd2, and

1 The feature vectors can be obtained either directly from the entity description (for example, hobbies or age) or by other preprocessing techniques (for example, embedding).

Bilinear Bandits with Low-rank Structure

observes a noisy reward of a bilinear form:

t zt + t , (1)

where Rd1 d2 is an unknown parameter and t is a σ-sub-Gaussian random variable conditioning on xt, zt, and all the observations before (and excluding) time t. Denoting by r the rank of , we assume that r is small (r min{d1,d2}), which means that the reward is governed by a few factors. Such low-rank appears in many recommendation applications (Ma et al., 2008). Our choice of reward model is popular and arguably natural; for example, the same model was used in Luo et al. (2017) for drug discovery.

The goal is to maximize the cumulative reward up to time T. Equivalently, we aim to minimize the cumulative regret:2

x X ,z Z x z x

A naive approach to this problem is to reduce the bilinear problem to a linear problem, as follows:

x z = vec(xz ),vec( ) . (3)

Throughout the paper, we focus on the regime in which the numbers of possible actions N1 = X N+ { } and N2 = Z N+ { } are much larger than dimensions d1 and d2, respectively.3 The reduction above allows us to use the standard linear bandit algorithms (see, for example, (Abbasi-Yadkori et al., 2011)) in the d1d2-dimensional space and achieve regret of O(d1d2

T), where O hides logarithmic factors. However, d1d2 can be large, making this regret bound take an undesirably large value. Moreover, the regret does not decrease as r gets smaller, since the reduction hinders us from exploiting the low-rank structure.

We address the following challenge: Can we design an algorithm for the bilinear bandit problem that exploits the low-rank structure and enjoys regret strictly smaller than

T)? We answer the question in the afﬁrmative by proposing Explore Subspace Then Reﬁne (ESTR), an approach that achieves a regret bound of O((d1+d2)3 2

r T). ESTR consists of two stages. In the ﬁrst stage, we estimate the row and column subspace by randomly sampling from a subset of arms, chosen carefully. In the second stage, we leverage the estimated subspace by invoking an approach called almost-low-dimensional OFUL (Low OFUL), a variant of OFUL (Abbasi-Yadkori et al., 2011) that uses regularization to penalize the subspaces that are apparently not spanned by the rows and columns (respectively) of . We

2This regret deﬁnition is actually called pseudo regret; we refer to Bubeck & Cesa-Bianchi (2012, Section 1) for detail.

3Otherwise, one can reduce the problem to the standard Karmed bandit problem and enjoy regret of O( N1N2T). With Sup Lin Rel (Auer, 2002), one may also achieve O(

d1d2T log(N1N2)), but this approach wastes a lot of samples and does not allow an inﬁnite number of arms.

conjecture that our regret upper bound is minimax optimal up to polylogarithmic factors based on the fact that the bilinear model has a much lower expected signal strength than the linear model. We provide a detailed argument on the lower bound in Section 5.

While the idea of having an explicit exploration stage, socalled Explore-Then-Commit (ETC), is not new, the way we exploit the subspace with Low OFUL is novel for two reasons. First, the standard ETC commits to the estimated parameter without reﬁning and is thus known to have O(

T) regret only for smooth arm sets such as the unit ball (Rusmevichientong & Tsitsiklis, 2010; Abbasi-Yadkori et al., 2009). This means that the estimate reﬁning is necessary for generic arm sets. Second, after the ﬁrst stage that outputs a subspace estimate, it is tempting to project all the arms onto the identiﬁed subspaces (r dimensions for each row and column space), and naively invoke OFUL in the r2-dimensional space. However, the subspace mismatch invalidates the upper conﬁdence bound used in OFUL; i.e., the conﬁdence bound does not actually bound the mean reward.

Attempts to correct the conﬁdence bound so that it is faithful are not trivial, and we are unaware of a solution that leads to improved regret bounds. Departing from completely committing to the identiﬁed subspaces, Low OFUL works with the full d1d2-dimensional space, but penalizes the subspace that is complementary to the estimated subspace, thus continuing to reﬁne the subspace. We calibrate the amount of regularization to be a function of the subspace estimation error; this is the key to achieving our ﬁnal regret bound.

We remark that our bandit problem can be modiﬁed slightly for the setting in which the arm zt is considered as a context, obtained from the environment. This situation arises, for example, in recommendation systems where Z is the set of users represented by indicator vectors (i.e., d2 = N2) and X is the set of items. Such a setting is similar to Cesa-Bianchi et al. (2013), but we assume that is low-rank rather than knowing the graph information. Furthermore, when the user information is available, one can take Z as the set of user feature vectors.

The paper is structured as follows. In Section 2, we deﬁne the problem formally and provide a sketch of the main contribution. Sections 3 and 4 describe the details of stages 1 and 2 of ESTR, respectively. We elaborate our conjecture on the regret lower bound in Section 5. After presenting our preliminary experimental results in Section 6, we discuss related work in Section 7 and propose future research directions in Section 8.

Bilinear Bandits with Low-rank Structure

Input: time horizon T, the exploration length T1, the rank r of , and the spectral bounds SF , S2, and Sr of .

Stage 1 (Section 3)

Solve (approximately)

arg max distinct x(1),...,x(d1) X the smallest eigenvalue of x(1),...,x(d1) (4)

and deﬁne X = {x(1), ,x(d1)}. Deﬁne Z similarly. For T1 rounds, choose a pair of arms from X Z, pulling each pair the same number of times to the extent possible.

That is, choose each pair T1

d1d2 times, then choose T1 d1d2 T1

d1d2 pairs uniformly at random without replacement.

Let K be a matrix such that Kij is the average reward of pulling the arm (x(i),z(j)). Invoke a noisy matrix recovery algorithm (e.g., Opt Space (Keshavan et al., 2010)) with K and the rank r to obtain an estimate K. Let = X 1 K(Z ) 1 where X = [(x(1)) ; ;(x(d1)) ] Rd1 d1 (abusing notation) and Z is deﬁned similarly. Let = U S V be the SVD of . Let U and V be orthonormal bases of the complementary subspaces of U and

V, respectively. Let γ(T1) be the subspace angle error bound such that, with high probability,

V F γ(T1) (5)

where = U S V is the SVD of .

Stage 2 (Section 4)

Rotate the arm sets: X = [ U U ] x x X and Z = [ V V ] z z Z . Deﬁne a vectorized arm set so that the last (d1 r) (d2 r) components are from the complementary subspaces:

A = [vec(x1 rz

1 r);vec(xr+1 d1z

1 r);vec(x1 rz

r+1 d2);vec(xr+1 d1z

r+1 d2)] Rd1d2 x X , z Z .

For T2 = T T1 rounds, invoke Low OFUL with the arm set A, the low dimension k = (d1 + d2)r r2, and γ(T1).

Figure 1. A sketch of Explore Subspace Then Reﬁne (ESTR)

2 Preliminaries

We deﬁne the problem formally as follows. Let X Rd1 and Z Rd2 be the left and right arm space, respectively. Deﬁne N1 = X and N2 = Z . (Either or both can be inﬁnite.) We assume that both the left and right arms have Euclidean norm at most 1: x 2 1 and z 2 1 for all x X and z Z. Without loss of generality, we assume X (Z) spans the whole d1 (d2) dimensional space (respectively) since, if not, one can project the arm set to a lowerdimensional space that is now fully spanned.4 We assume d2 = (d1) and deﬁne d = max{d1,d2}. If A is a positive integer, we use notation [A] = {1,2,...,A}. We denote by vi j the (j i + 1)-dimensional vector taking values from the coordinates from i to j from v. Similarly, we deﬁne Mi j,k R(j i+1) ( k+1) to be a submatrix taking values from M with the row indices from i to j and the column indices from k to . We denote by vi the i-th component of the vector v and by Mij the entry of a matrix M located at the i-th row and j-th column. Denote by k(M) the k-th largest singular value, and deﬁne max(M) = 1(M). Let min(M) be the smallest nonzero singular value of M. M denotes the determinant of a matrix M.

4 In this case, we effectively work with a projected version of , and its rank may become smaller as well.

The protocol of the bilinear bandit problem is as follows. At time t, the algorithm chooses a pair of arms (xt,zt) X Z and receives a noisy reward yt according to (1). We make the standard assumptions in linear bandits: the Frobenius and operator norms of are bounded by known constants, F SF and 2 S2,5 and the sub-Gaussian scale σ of t is known to the algorithm. We denote by s

i the i-th largest singular value of . We assume that the rank r of the matrix is known and that s

r Sr for some known Sr > 0. 6

The main contribution of this paper is the ﬁrst nontrivial upper bound on the achievable regret for the bilinear bandit problem. In this section, we provide a sketch of the overall result and the key insight. For simplicity, we omit constants and variables other than d, r, and T. Our proposed ESTR algorithm enjoys the following regret bound, which strictly improves the naive linear bandit reduction when r d.

Theorem 1 (An informal version of Corollary 2). Under mild assumptions, the regret of ESTR is O(d3 2

r T) with high probability.

5 When S2 is not known, one can set S2 = SF . In some applications, S2 is known. For example, the binary model yt Bernoulli((x

t zt) + 1) 2), we can evidently set S2 = 1. 6In practice, one can perform rank estimation after the ﬁrst stage (see, for example, Keshavan et al. (2010)).

Bilinear Bandits with Low-rank Structure

We conjecture that the regret bound above is minimax optimal up to polylogarithmic factors since the expected signal strength in the bilinear model is much weaker than the linear model. We elaborate on this argument in Section 5.

We describe ESTR in Figure 1. The algorithm proceeds in two stages. In the ﬁrst stage, we estimate the column and row subspace of from noisy rank-one measurements using a matrix recovery algorithm. Speciﬁcally, we ﬁrst identify d1 and d2 arms from the set X and Z in such a way that the smallest singular values of the matrices formed from these arms are maximized approximately (see (4)), which is a form of submatrix selection problem (details in Section 3). We emphasize that ﬁnding the exact solution is not necessary here since Theorem 1 has a mild dependency on the smallest eigenvalue found when approximating (4). We then use the popular matrix recovery algorithm, Opt Space (Keshavan et al., 2010) to estimate . The sin theorem of Wedin (Stewart & Sun, 1990) is used to convert the matrix recovery error bound from Opt Space to the desired subspace angle guarantee (5) with γ(T1) = O d3r

T1 . The regret incurred in stage 1 is bounded trivially by T1 2.

In the second stage, we transform the problem into a d1d2dimensional linear bandit problem and invoke Low OFUL that we introduce in Section 4. This technique projects the arms onto both the estimated subspace and its complementary subspace and uses γ(T1) to penalize weights in the complementary subspaces U and V . Low OFUL enjoys regret bound O((dr +

Tγ(T1)) T T1) during T T1 rounds. By combining with the regret for the ﬁrst stage, we obtain an overall regret of

Choosing T1 to minimize this expression, we obtain a regret bound of O(d3 2

3 Stage 1: Subspace estimation

The goal of stage 1 is to estimate the row and column subspaces for the true parameter . How should we choose which arm pairs to pull, and what guarantee can we obtain on the subspace estimation error? One could choose to apply a noisy matrix recovery algorithm with afﬁne rank minimization (Recht et al., 2010; Mohan & Fazel, 2010) to the measurements attained from the arm pulls. However, these methods require the measurements to be Gaussian or Rademacher, so their guarantees depend on satisfaction of a RIP property (Recht et al., 2010), or, for rank-one projection measurements, an RUB property (Cai et al., 2015). Such assumptions are not suitable for our setting since measurements are restricted to the arbitrarily given arm sets X and Z. Uniform sampling from the arm set cannot guarantee RIP, as the arm set itself can be heavily biased in certain

directions.

We design a simple reduction procedure though matrix recovery with noisy entry observations, leaving a more sophisticated treatment as future work. The d1 arms in X are chosen according to the criterion (4), which is a combinatorial problem that is hard to solve exactly. Our analysis does not require its exact solution, however; it is enough that the objective value is nonzero (that is, the matrix X constructed from these d1 arms is nonsingular). (Similar comments hold for the matrix Z.) We remark that the problem (4) is shown to be NP-hard by C ivril & Magdon-Ismail (2009) and is related to ﬁnding submatrices with favorable spectral properties (C ivril & Magdon-Ismail, 2007; Tropp, 2009), but a thorough review on algorithms and their limits is beyond the scope of the paper. For our experiments, simple methods such as random selection were sufﬁcient; we describe our implementation in the supplementary material.

If K is the matrix deﬁned by K

ij = x(i) z(j), each time step of stage 1 obtains a noisy estimate of one element of K . Since multiple measurements of each entry are made, in general, we compute average measurements for each entry. A matrix recovery algorithm applied to this matrix of average measurements yields the estimate K of the rank-r matrix K . Since K = X Z , we estimate by = X 1 K(Z ) 1 and then compute the subspace estimate U Rd1 r and V Rd2 r by applying SVD to .

We choose the recovery algorithm Opt Space by Keshavan et al. (2010) because of its strong (near-optimal) guarantee. Denoting the SVD of K by URV , we use the matrix incoherence deﬁnition from Keshavan et al. (2010) and let (µ0,µ1) be the smallest values such that for all i [d1],j [d2],

jk µ0r d2, and

Uik( k(K ) max(K ))Vjk µ1

Deﬁne the condition number = max(K ) min(K ). We present the guarantee of Opt Space (Keshavan et al., 2010) in a paraphrased form. (The proof of this result, and all subsequent proofs, are deferred to the supplementary material.)

Theorem 2. There exists a constant C0 such that for T1 C0σ2(µ2

min(K )2 dr(r + log d), we have that, with probability at least 1 2 d3

K K F C1 2σ d3 2 r T1

where C1 is an absolute constant.

The original theorem from Keshavan et al. (2010) assumes T1 d1d2 and does not allow repeated sampling. However,

Bilinear Bandits with Low-rank Structure

we show in the proof that the same guarantee holds for T1 > d1d2 since repeated sampling of entries has the effect of reducing the noise parameter σ.

Our recovery of an estimate K of K implies the bound F X 1 2 Z 1 2 where is the RHS of (6). However, our goal in stage 1 is to obtain bounds on the subspace estimation errors. That is, given the SVDs

= U S V and = U S V , we wish to identify how close U ( V) is to U (V respectively). Such guarantees on the subspace error can be obtained via the sin theorem by Stewart & Sun (1990), which we restate in our supplementary material. Roughly, this theorem bounds the canonical angles between two subspaces by the Frobenius norm of the difference between the two matrices. Recall that s

r is the r-th largest singular value of .

Theorem 3. Suppose we invoke Opt Space to compute K as an estimate of the matrix K . After stage 1 of ESTR with T1 satisfying the condition of Theorem 2, we have, with probability at least 1 2 d3

2 (s r)2 2 (7)

where = C1 2σd3 2 r T1.

4 Stage 2: Almost-low-dimensional linear bandits

The goal of stage 2 is to exploit the subspaces U and V estimated in stage 1 to perform efﬁcient bandit learning. At ﬁrst, it is tempting to project all the left and right arms to rdimensional subspaces using U and V, respectively, which seems to be a bilinear bandit problem with an r by r unknown matrix. One can then reduce it to an r2-dimensional linear bandit problem and solve it by standard algorithms such as OFUL (Abbasi-Yadkori et al., 2011). Indeed, if U and V exactly span the row and column spaces of , this strategy yields a regret bound of O(r2

T). In reality, these matrices (subspaces) are not exact, so there is model mismatch, making it difﬁcult to apply standard regret analysis. The upper conﬁdence bound (UCB) used in popular algorithms becomes invalid, and there is no known correction that leads to a regret bound lower than O(d1d2

T), to the best of our knowledge.

In this section, we show how stage 2 of our approach avoids the mismatch issue by returning to the full d1d2-dimensional space, allowing the subspace estimates to be inexact, but penalizing those components that are complementary to U and V. This effectively constrains the hypothesis space to be much smaller than the full d1d2-dimensional space. We show how the bilinear bandit problem with good subspace estimates can be turned into the almost low-dimensional linear bandit problem, and how much penalization / regularization is needed to achieve a low overall regret bound.

Finally, we state our main theorem showing the overall regret bound of ESTR. Reduction to linear bandits. Recall that = U S V is the SVD of (where S is r r diagonal) and that U and V are the complementary subspace of U and V respectively. Let M = [ U U ] [ V V ] be a rotated version of . Then we have

= [ U U ]M[ V V ] and

x z = ([ U U ] x) M([ V V ] z) .

Thus, the bilinear bandit problem with the unknown

with arm sets X and Z is equivalent to the one with the unknown M with arm sets X = {x = [ U U ] x x X} and Z (deﬁned similarly). As mentioned earlier, this problem can be cast as a d1d2-dimensional linear bandit problem by considering the unknown vector = vec(M). The difference is, however, that we have learnt something about the subspace in stage 1. We deﬁne to be a rearranged version of vec(M) so that the last (d1 r) (d2 r) dimensions of are Mij for i {r + 1,...,d1} and j {r + 1,...,d2}; that is, letting k = d1d2 (d1 r) (d2 r),

1 k = [vec(M1 r,1 r); vec(Mr+1 d1,1 r); vec(M1 r,r+1 d2)],

k+1 p = vec(Mr+1 d1,r+1 d2) .

Then we have

2 = i>r j>r

(U S V ) V 2

which implies

k+1 p 2 = O(d3r T1) by Theorem 3. Our knowledge on the subspace results in the knowledge of the norm of certain coordinates! Can we exploit this knowledge to enjoy a better regret bound than O(d1d2

T)? We answer this question in the afﬁrmative below. Almost-low-dimensional OFUL (Low OFUL). We now focus on an abstraction of the conversion described in the previous paragraph, which we call the almost-lowdimensional linear bandit problem. In the standard linear bandit problem in p dimensions, the player chooses an arm at at time t from an arm set A Rp and observes a noisy reward yt = at, + t, where the noise t has the same properties as in (1). We assume that a 2 1 for all a A, and 2 B for some known constant B > 0. In almost-low-dimensional linear bandits, we have additional knowledge that

k+1 p 2 B for some index k and some constant B (ideally B). This means that all-but-k dimensions of are close to zero.

To exploit the extra knowledge on the unknown, we propose almost-low-dimensional OFUL (Low OFUL) that extends the standard linear bandit algorithm OFUL (Abbasi-Yadkori et al., 2011). To describe OFUL, deﬁne the design matrix A Rt p with rows a

s, s = 1,2,...,t and the vector of rewards y = [y1,...,yt] . The key estimator is based on regression with the standard squared 2-norm regularizer, as

Bilinear Bandits with Low-rank Structure

Algorithm 1 Low OFUL

1: Input: T, k, the arm set A Rp, failure rate δ, and

positive constants B, B , λ, λ . 2: Set = diag(λ,...,λ, λ ,...,λ ) where λ occupies

the ﬁrst k diagonal entries. 3: for t = 1,2,...,T do 4: Compute at = arg maxa A max ct 1 ,a . 5: Pull arm at. 6: Receive reward yt. 7: Set ct as (12). 8: end for

t = arg min

2 = (λI + A A) 1A y .

OFUL then deﬁnes a conﬁdence ellipsoid around t based on which one can compute an upper conﬁdence bound on the mean reward of any arm. In our variant, we allow a different regularization for each coordinate, replacing the regularizer λ

2 for some positive diagonal matrix . Speciﬁcally, we deﬁne = diag(λ,...,λ, λ ,...,λ ), where λ occupies the ﬁrst k diagonal entries and λ the last p k positions. With this modiﬁcation, the estimator becomes

t = arg min

= ( + A A) 1A y .

Deﬁne Vt = + t

t = +A A and let δ be the failure rate we are willing to endure. The conﬁdence ellipsoid for is

This ellipsoid enjoys the following guarantee, which is a direct consequence of Valko et al. (2014, Lemma 3) that is based on the self-normalized martingale inequality of Abbasi-Yadkori et al. (2011, Theorem 1).

Lemma 1. With probability at least 1 δ, we have ct for all t 1.

We summarize Low OFUL in Algorithm 1, where max ct 1 ,a can be simpliﬁed to t 1,a + βt 1 a V 1

We now state the regret bound of Low OFUL in Theorem 4, which is based on the standard linear bandit regret analysis dating back to Auer (2002).

Theorem 4. The regret of Low OFUL is, with probability at

In the standard linear bandit setting where λ = λ and B = B, we recover the regret bound O(p

T) of OFUL, since log VT

T) (Abbasi-Yadkori et al., 2011, Lemma 10).

To alleviate the dependence on p in the regret bound, we propose a carefully chosen value of λ in the following corollary.

Corollary 1. Then, the regret of Low OFUL with λ =

T k log(1+ T

λ ) is, with probability at least 1 δ,

The bound improves the dependence on dimensionality from p to k, but introduces an extra factor of

T to B , resulting in linear regret. While this choice is not interesting in general, we show that it is useful for our case: Since

k+1 p 2 = O(1 T1), we can set B = O(1 T1) to be a valid upper bound of

k+1 p 2. By setting T1 = (

T), the regret bound in Corollary 1 scales with

T rather than T.

Concretely, using (9), we set

B = SF and B = S2 γ(T1) where

γ(T1) = X 1 2

B and B are valid upper bounds of 2 and

k+1 p 2, respectively, with high probability. Note we must use S2, Sr, and S2 Sr instead of s

r, and , respectively, since the latter variables are unknown to the learner.

Overall regret. Theorem 5 shows the overall regret bound of ESTR.

Theorem 5. Suppose we run ESTR (Algorithm 1) with T1 C0σ2(µ2

min(K )2 dr(r + log d). We invoke Low OFUL in stage 2 with p = d1d2, k = r (d1+d2 r), deﬁned

as (8), the rotated arm sets X and Z , λ = T2 k log(1+T2 λ), and B and B as in (14). The regret of ESTR is, with probability at least 1 δ 2 d3

1T1 + T X 1 2

One can see that there exists an optimal choice of T1, which we state in the following corollary.

Corollary 2. Suppose the assumptions in Theorem 5 hold. If T1 = X 1 2 Z 1 2

2 S3r σd3 2

r T , then the regret of

Bilinear Bandits with Low-rank Structure

ESTR is, with probability at least 1 δ 2 d3

X 1 2 Z 1 2σd3 2

Note that, for our problem, the incoherence constants µ0 and µ1 do not play an important role with large enough T.

Remark One might notice that we can also regularize the submatrices Mr+1 d1,1 r and M1 r,r+1 d2 since they are coming partly from the complementary subspace of

U and partly from the complement of V (but not both). In practice, such a regularization can be done to reduce the regret slightly, but it does not affect the order of the regret. We do not have sufﬁcient decrease in the magnitude to provide interesting bounds. One can show that, while Mr+1 d1,r+1 d2 2

F = O(1 T1), the quantities M1 r,r+1 d2 2

F and Mr+1 d1,1 r 2

F are O(1 T1).

5 Lower bound

A simple lower bound is (d

T), since when the arm set Z is a singleton the problem reduces to a d1-dimensional linear bandit problem. We have attempted to extend existing lower-bound proof techniques in Rusmevichientong & Tsitsiklis (2010), Dani et al. (2008), and Lattimore & Szepesv ari (2018), but the bilinear nature of the problem introduces cross terms between the left and right arm, which are difﬁcult to deal with in general. However, we conjecture that the lower bound is (d3 2

r T). We provide an informal argument below that the dependence on d must be d3 2

based on the observation that the rank-one bilinear reward model s signal-to-noise ratio (SNR) is signiﬁcantly worse than that of the linear reward model.

Consider a rank-one that can be decomposed as uv

for some u,v { 1

d}d. Suppose the left and right arm sets are X = Z = { 1

d}d. Let us choose xt and zt uniformly at random (which is the sort of pure exploration that must be performed initially). Then a simple calculation shows that the expected squared signal strength with such a random choice is E x

t zt 2 = 1 d2 . In contrast, the expected squared signal strength for a linear reward model is E x

d. The effect of this is analogous to increasing the sub-Gaussian scale parameter of the noise t by a factor of

d. We thus conjecture that the

d difference in the SNR introduces the dependence d3 2 in the regret rather than d.

6 Experiments

We present a preliminary experimental result and discuss practical concerns.

Bandits in practice requires tuning the exploration rate to perform well, which is usually done by adjusting the conﬁdence bound width (Chapelle & Li, 2011; Li et al., 2010;

Figure 2. Simulation results for d = 8 and r = 1. Our method

ESTR-OS, its variant ESTR-BM, and an implicit exploration variant of ESTR called ISSE all outperform the baseline linear bandit method OFUL.

Zhang et al., 2016), which amounts to replacing βt with cβt for some c > 0 for OFUL or its variants (including Low OFUL). An efﬁcient parameter tuning in bandits is an open problem and is beyond our scope. For the sake of comparison, we tune c by grid search and report the result with the smallest average regret. For ESTR, the value of T1 used in the proof involves some unknown constants; to account for this, we tune T1 by grid search. We consider the following methods:

OFUL: The OFUL reduction described in (3), which

ignores the low-rank structure. ESTR-OS: Our proposed method; we simplify B

in (14) to S2σ2d3r T1. ESTR-BM: We replace Opt Space with the Burer-

Monteiro formulation and perform the alternating minimization (Burer & Monteiro, 2003). ISSE (Implicit Sub Space Exploration): Low OFUL

with a heuristic subspace estimation that avoids an explicit exploration stage. We split the time intervals with knots at t {100.5,101,101.5,...}. At the beginning time t of each interval, we perform the matrix recovery with the Burer-Monteiro formulation using all the past data, estimate the subspaces, and use them to initialize Low OFUL with B = S2σ2d3r t and all the past data.

Note that OFUL and ISSE only require tuning c whereas ESTR methods require tuning both c and T1.

We run our simulation with d1 = d2 = 8, r = 1, σ = 0.01. We set λ = 1 for both OFUL and Low OFUL. We draw 16 arms from the unit sphere for each arm set X and Z and simulate the bandit game for T = 104 iterations, which we repeat 60 times for each method. Figure 2 plots the average regret of the methods and the .95 conﬁdence intervals. All the methods outperform OFUL, and the regret differences from OFUL are statistically signiﬁcant. We observe that ESTRBM performs better than ESTR-OS. We believe this is due to our limit on the number of iterations of Opt Space set to

Bilinear Bandits with Low-rank Structure

1000, which we imposed due to its slow convergence in our

experiments.7 The Burer-Monteiro formulation, however, converged within 200 iterations. Finally, ISSE performs close to ESTR-BM, but with a larger variance. Although ISSE does not have a theoretical guarantee, it does not require tuning T1 and performs better than OFUL.

7 Related work

There exist a few studies on pulling a pair of arms as a unit action, as we do. Kveton et al. (2017) consider the K-armed bandit with N1 left arms and N2 right arms. The expected rewards can be represented as a matrix R RN1 N2 where the authors assume R has rank r min{N1,N2}. The main difference from our setting is that they do not assume that the arm features are available, so our work is related to Kveton et al. (2017) in the same way as the linear bandits are related to K-armed bandits. The problem considered in Katariya et al. (2017b) is essentially a rank-one version of Kveton et al. (2017), which is motivated by a click-feedback model called position-based model with N1 items and N2 positions. This work is further extended to have a tighter KL-based bound by Katariya et al. (2017a). All these studies successfully exploit the low-rank structure to enjoy regret bounds that scale with r(N1 + N2) rather than N1N2. Zimmert & Seldin (2018) propose a more generic problem called factored bandits whose action set is a product of atomic L action sets rather than two. While they achieve generality by not require to know the explicit reward model, factored bandits do not leverage the known arm features nor the low-rank structure, resulting in large regret in our problem.

There are other works that exploit the low-rank structure of the reward matrix, although the action is just a single arm pull. Sen et al. (2017) consider the contextual bandit setting where there are N1 discrete contexts and N2 arms, but do not take into account the observed features of contexts or arms. Under the so-called separability assumption, the authors make use of Hottopix algorithm to exploit the low-rank structure. Gopalan et al. (2016) consider a similar setting, but employ the robust tensor power method for recovery. Kawale et al. (2015) study essentially the same problem, but make assumptions on the prior that generates the unknown matrix and perform online matrix factorization with particle ﬁltering to leverage the low-rank structure. These studies also exploit the low-rank structure successfully and enjoy regret bounds that scale much better than N1N2.

There has been a plethora of contextual bandit studies that exploit structures other than the low-rank-ness, where the context is usually the user identity or features. For example, Gentile et al. (2014) and its followup studies (Li et al., 2016;

7 We used the authors C implementation that is wrapped in python (https://github.com/strin/py Opt Space).

Gentile et al., 2017) leverage the clustering structure of the contexts. In Cesa-Bianchi et al. (2013) and Vaswani et al. (2017), a graph structure of the users is leveraged to enjoy regret bound that is lower than running bandits on each context (i.e., user) independently. Deshmukh et al. (2017) introduce a multitask learning view and exploit arm similarity information via kernels, but their regret guarantee is valid only when the similarity is known ahead of time. In this vein, if we think of the right arm set Z as tasks, we effectively assume different parameters for each task but with a low-rank structure. That is, the parameters can be written as a linear combination of a few hidden factors, which are estimated on the ﬂy rather than being known in advance. Johnson et al. (2016) consider low-rank structured bandits but in a different setup. Their reward model has expected reward of the form tr(X

t ) with the arm Xt Rd p and the unknown Rd p. While Xt corresponds to xtz

t in our setting, they consider a continuous arm set only, so their algorithm cannot be applied to our problem.

Our subroutine Low OFUL is quite similar to Spectral UCB of (Valko et al., 2014), which is designed speciﬁcally for graph-structured arms in which expected rewards of the two arms are close to each other (i.e., smooth ) when there is an edge between them. Although technical ingredients for Corollary 1 stem from Valko et al. (2014), Low OFUL is for an inherently different setup in which we design the regularization matrix to maximally exploit the subspace knowledge and minimize the regret, rather than receiving from the environment as a part of the problem deﬁnition.

Gilton & Willett (2017) study a similar regularizer in the context of sparse linear bandits under the assumption that a superset of the sparse locations is known ahead of time. Yue et al. (2012b) consider a setup similar to Low OFUL. They assume an estimate of the subspace is available, but their regret bound still depends on the total dimension p.

8 Conclusion

In this paper, we introduced the bilinear low-rank bandit problem and proposed the ﬁrst algorithm with a nontrivial regret guarantee. Our study opens up several future research directions. First, there is currently no nontrivial lower bound, and showing whether the regret of O(d3 2

r T) is tight or not remains open. Second, while our algorithm improves the regret bound over the trivial linear bandit reduction, the algorithm requires to tune an extra parameter T1. It would be more natural to continuously update the subspace estimate and the amount of regularization, just like ISSE. However, proving a theoretical guarantee would be challenging since most matrix recovery algorithms require some sort of uniform sampling with a nice set of measurements. We speculate that one can employ randomized arm selection and use importance-weighted data to perform effective and provable matrix recoveries on-the-ﬂy.

Bilinear Bandits with Low-rank Structure

Acknowledgements

This work was supported by NSF 1447449 - IIS, NIH 1 U54 AI117924-01, NSF CCF-0353079, NSF 1740707, and AFOSR FA9550-18-1-0166.

Abbasi-Yadkori, Y., Antos, A., and Szepesv ari, C. Forced-

exploration based algorithms for playing in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback. Citeseer, 2009.

Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Improved

Algorithms for Linear Stochastic Bandits. Advances in Neural Information Processing Systems (Neur IPS), pp. 1 19, 2011.

Auer, P. Using Conﬁdence Bounds for Exploitation Exploration Trade-offs. Journal of Machine Learning Research, 3:2002, 2002.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2 3):235 256, 2002.

Bhargava, A., Ganti, R., and Nowak, R. Active Positive

Semideﬁnite Matrix Completion: Algorithms, Theory and Applications. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 54, pp. 1349 1357, 2017.

Bubeck, S. and Cesa-Bianchi, N. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning, 5: 1 122, 2012.

Burer, S. and Monteiro, R. D. C. A nonlinear program-

ming algorithm for solving semideﬁnite programs via low-rank factorization. Mathematical Programming, Series B, 2003.

Cai, T. T., Zhang, A., and Others. ROP: Matrix recovery

via rank-one projections. The Annals of Statistics, 43(1): 102 138, 2015.

C ivril, A. and Magdon-Ismail, M. Finding maximum Vol-

ume sub-matrices of a matrix. RPI Comp Sci Dept TR, pp. 7 8, 2007.

C ivril, A. and Magdon-Ismail, M. On selecting a maximum

volume sub-matrix of a matrix and related problems. Theoretical Computer Science, 410(47):4801, 2009.

Cesa-Bianchi, N., Gentile, C., and Zappella, G. A Gang

of Bandits. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems (Neur IPS), pp. 737 745. 2013.

Chapelle, O. and Li, L. An Empirical Evaluation of Thomp-

son Sampling. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2249 2257, 2011.

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic Linear

Optimization under Bandit Feedback. In Proceedings of the Conference on Learning Theory (COLT), pp. 355 366, 2008.

Deshmukh, A. A., Dogan, U., and Scott, C. Multi-Task

Learning for Contextual Bandits. In Advances in Neural Information Processing Systems (Neur IPS), pp. 4848 4856. 2017.

Filippi, S., Cappe, O., Garivier, A., and Szepesv ari, C.

Parametric Bandits: The Generalized Linear Case. In Advances in Neural Information Processing Systems (Neur IPS), pp. 586 594. 2010.

Gentile, C., Li, S., and Zappella, G. Online Clustering of

Bandits. In Proceedings of the International Conference on Machine Learning (ICML), volume 32, pp. 757 765, 2014.

Gentile, C., Li, S., Kar, P., Karatzoglou, A., Zappella, G.,

and Etrue, E. On Context-Dependent Clustering of Bandits. In Proceedings of the International Conference on Machine Learning (ICML), volume 70, pp. 1253 1262, 2017.

Gilton, D. and Willett, R. Sparse linear contextual bandits

via relevance vector machines. In International Conference on Sampling Theory and Applications (Samp TA), pp. 518 522, 2017.

Gopalan, A., Maillard, O.-A., and Zaki, M. Low-rank Ban-

dits with Latent Mixtures. ar Xiv:1609.01508 [cs.LG], 2016.

Johnson, N., Sivakumar, V., and Banerjee, A. Structured

Stochastic Linear Bandits. ar Xiv:1606.05693 [stat.ML], 2016.

Jun, K.-S., Bhargava, A., Nowak, R., and Willett, R. Scal-

able Generalized Linear Bandits: Online Computation and Hashing. In Advances in Neural Information Processing Systems (Neur IPS), pp. 99 109. 2017.

Katariya, S., Kveton, B., Szepesv ari, C., Vernade, C., and

Wen, Z. Bernoulli Rank-1 Bandits for Click Feedback. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 2001 2007, 2017a.

Katariya, S., Kveton, B., Szepesv ari, C., Vernade, C., and

Wen, Z. Stochastic Rank-1 Bandits. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 392 401, 2017b.

Bilinear Bandits with Low-rank Structure

Kawale, J., Bui, H. H., Kveton, B., Tran-Thanh, L., and

Chawla, S. Efﬁcient Thompson Sampling for Online Matrix-Factorization Recommendation. In Advances in Neural Information Processing Systems (Neur IPS), pp. 1297 1305. 2015.

Keshavan, R. H., Montanari, A., and Oh, S. Matrix Com-

pletion from Noisy Entries. J. Mach. Learn. Res., 11: 2057 2078, 2010. ISSN 1532-4435.

Kveton, B., Szepesv ari, C., Rao, A., Wen, Z., Abbasi-

Yadkori, Y., and Muthukrishnan, S. Stochastic Low-Rank

Bandits. ar Xiv:1712.04644 [cs.LG], 2017.

Lai, T. L. and Robbins, H. Asymptotically Efﬁcient Adap-

tive Allocation Rules. Advances in Applied Mathematics, 6(1):4 22, 1985.

Lattimore, T. and Szepesv ari, C. Bandit Algorithms. 2018. URL http://downloads. tor-lattimore.com/book.pdf.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A Contextual-Bandit Approach to Personalized News Article Recommendation. Proceedings of the International Conference on World Wide Web (WWW), pp. 661 670, 2010.

Li, S., Karatzoglou, A., and Gentile, C. Collaborative Fil-

tering Bandits. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 539 548, 2016.

Luo, Y., Zhao, X., Zhou, J., Yang, J., Zhang, Y., Kuang,

W., Peng, J., Chen, L., and Zeng, J. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nature communications, 8(1):573, 2017.

Ma, H., Yang, H., Lyu, M. R., and King, I. So Rec: Social

Recommendation Using Probabilistic Matrix Factorization. Proceeding of the ACM Conference on Information and Knowledge Mining (CIKM), 2008.

Mohan, K. and Fazel, M. New restricted isometry results

for noisy low-rank recovery. In IEEE International Symposium on Information Theory - Proceedings, 2010.

Recht, B., Fazel, M., and Parrilo, P. A. Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. SIAM Rev., 52(3):471 501, 2010.

Rusmevichientong, P. and Tsitsiklis, J. N. Linearly Parame-

terized Bandits. Math. Oper. Res., 35(2):395 411, 2010.

Sen, R., Shanmugam, K., Kocaoglu, M., Dimakis, A., and

Shakkottai, S. Contextual Bandits with Latent Confounders: An NMF Approach. In Proceedings of the

International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 54, pp. 518 527, 2017.

Stewart, G. W. and Sun, J.-g. Matrix Perturbation Theory.

Academic Press, 1990.

Tropp, J. A. Column subset selection, matrix factoriza-

tion, and eigenvalue optimization. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Al-

gorithms, pp. 978 986. Society for Industrial and Applied Mathematics, 2009.

Valko, M., Munos, R., Kveton, B., and Kocak, T. Spectral

Bandits for Smooth Graph Functions. In Proceedings of the International Conference on Machine Learning (ICML), pp. 46 54, 2014.

Vaswani, S., Schmidt, M., and Lakshmanan, L. V. S. Horde

of Bandits using Gaussian Markov Random Fields. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pp. 690 699, 2017.

Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The K-

armed dueling bandits problem. In Journal of Computer and System Sciences, 2012a.

Yue, Y., Hong, S. A. S., and Guestrin, C. Hierarchical explo-

ration for accelerating contextual bandits. Proceedings of the International Conference on Machine Learning (ICML), pp. 1895 1902, 2012b.

Zhang, L., Yang, T., Jin, R., Xiao, Y., and Zhou, Z.-h. Online

Stochastic Linear Optimization under One-bit Feedback. In Proceedings of the International Conference on Machine Learning (ICML), volume 48, pp. 392 401, 2016.

Zimmert, J. and Seldin, Y. Factored Bandits. In Advances

in Neural Information Processing Systems (Neur IPS), pp. 2840 2849, 2018.