# personalizing_many_decisions_with_highdimensional_covariates__2d51cae0.pdf

Personalizing Many Decisions with High-Dimensional Covariates

Nima Hamidi Mohsen Bayati Kapil Gupta

We consider the k-armed stochastic contextual bandit problem with d dimensional features, when both k and d can be large. To the best of our knowledge, all existing algorithms for this problem have regret bounds that scale as polynomials of degree at least two, in k and d. The main contribution of this paper is to introduce and theoretically analyse a new algorithm (REAL-Bandit) with a regret that scales by r2(k + d) when r is the rank of the k d matrix of unknown parameters. REAL-Bandit relies on ideas from low-rank matrix estimation literature and a new row-enhancement subroutine that yields sharper bounds for estimating each row of the parameter matrix that may be of independent interest. We also show via simulations that REAL-Bandit algorithm outperforms existing algorithms that do not leverage the low-rank structure of the problem.

1 Introduction

Running online experiments has recently become a popular approach in data-centric enterprises. However, running an experiment involves an opportunity cost or regret (e.g., exposing some users to potentially inferior experiences). To reduce this opportunity cost, a growing number of companies leverage multi-armed bandit (MAB) experiments [38, 39, 19] that were initially motivated by the cost of experimentation in clinical trials [41, 27]. Another common feature of online experiments is personalization; users have heterogenous preferences that means the optimal decisions depend on user or product characteristics (also known as context). MAB approach for personalizing decisions is therefore called contextual MAB (or contextual bandit) [29]. For example, [30] used contextual bandits to propose a personalized news article recommender system.

There is a large body of literature on algorithms with theoretical guarantees for contextual bandits with linear reward functions. An admittedly incomplete list is [5, 13, 2, 12, 14, 4, 34, 37, 42, 25, 6], and we defer to [7] for additional references. While these papers study the problem under a variety of different assumptions, they can be divided into two groups: (A) when context vectors are arbitrary and can be potentially selected by an adversary, and (B) when context vectors are i.i.d. samples from a ﬁxed (but unknown) probability distribution. Our focus in this paper is the latter group (ﬁrst studied by [14]). As the number of decisions T (time horizon) grows, the regret bounds for the algorithms in group (A) grow with

T. But the algorithms in group (B) take advantage of the i.i.d. assumption and have a signiﬁcantly lower (logarithmic) dependence in T.

Two other important parameters are the number of arms k, and the dimension of context vectors d. For example, when d grows, the regret bound of [14] grows as d3 which can dominate the dependence on T. [6] tackled this difﬁculty by imposing sparseness assumption and replaced d3 with s2 (up to logarithmic factors) where s is sparsity of the parameter vectors for the reward functions. On the other hand, a careful inspection of the bounds in [14, 6] reveals that their regret bounds could grow by k3 (in the worst case) that can be very large in applications such as assortment optimization [21].

Department of Statistics, Stanford University, hamidi@stanford.edu Graduate School of Business, Stanford University, bayati@stanford.edu Airbnb, kapil.gupta@airbnb.com

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

On the other hand, the lower bounds found in [13] and [14] imply that if there is no structural assumption, the lowest possible regret would grow at least by kd. The aim of this paper is to reduce this dependence under a low-rank assumption on the k d matrix of parameters of the reward functions. Speciﬁcally, each of the k reward functions is represented by a d-vector of coefﬁcients (one coefﬁcient per covariate) that is a row of the parameter matrix. This can also be interpreted as imposing a similarity between the reward functions of the k different arms, like in the multi-task learning literature [10]. We propose a new algorithm (called REAL-Bandit) and prove that its regret grows by r2(k + d) where r is rank of the parameter matrix.

Contributions. Our main technical contributions in design and analysis of REAL-Bandit are as follows. (1) Stronger row-wise guarantees: To prove regret guarantees for REAL-Bandit, we need bounds for estimating every single row of the matrix. However, existing matrix estimation results provide a bound on the estimation error of the whole matrix, which would be a crude upper bound for the estimation error of each single row. Therefore, REAL-Bandit includes a subroutine (called Row Enhancement) that reﬁnes the estimates in order to establish stronger row-wise guarantees that may be of independent interest (see 3 for details). Very recently, [11] provided bounds for the matrix completion problem that are also sharp at the row or entry level, however, their results are only for the matrix completion case and do not apply to our setting. (2) Implementation: Our theoretical analysis does not require that each matrix estimation phase in REAL-Bandit is solved to completion. In other words, REAL-Bandit does not need ﬁnding a global minimum of the relevant estimator s optimization (penalized maximum likelihood) problem. REAL-Bandit only needs a solution with cost below a certain threshold which can be used to signiﬁcantly speed up implementation of REALBandit. (3) Estimator independence: Over the last decade, several types of estimators have been introduced for recovering low-rank matrices from noisy observations, with varying assumptions and theoretical guarantees. Two of the most common approaches are based on convex optimization [8, 9, 31, 15, 35, 36, 26, 24], or non-convex optimization [40, 22, 23]. Unlike [14, 6] that work with a ﬁxed estimator, REAL-Bandit is designed to be estimator agnostic and works with any matrix estimation algorithm with theoretical guarantees (see 3 for details).

Other related literature. A class of decision-making problems with a large number of arms are assortment optimization problems; when a small subset from a potentially large set of items should be selected. Among the rich literature on this topic, [21, 3] are more related to our paper since they consider a dynamic allocation of assortments via multi-armed bandit ideas. [21] (like us) uses low-rank matrix estimation methods for the learning part. However, the problems and models they study are very different. Speciﬁcally, they assume that a decision-maker shows a subset of products to a user. Then the user selection is modeled via a multi-nomial logit (MNL) [32] where the parameters of the MNL model form a low-rank matrix with rows representing customer types and columns representing products.

Two other relevant papers are [42, 25] since they too tackle bandit problems with many actions. They introduce algorithms with regrets that scale with spectral dimension of the Laplacian of a graph that has arms as its vertices. These papers are in group (A) of the aforementioned class of bandit papers that are inherently different. Speciﬁcally, they assume actions have known feature vectors (with low spectral dimension) that, together with a single unknown parameter vector, deﬁne the linear reward functions. There is a reduction from this setting to our problem, only when the action set is allowed to change (see [1]) which is not the case in [42, 25]. Another recent paper in this category is [28]. The main difference between all of these papers and ours, as discussed above, is that we consider i.i.d. context vectors which allows regret bounds that scale logarithmically in T instead of scaling with

T. Finally, recent paper of [20] studies a bandit problem where each action is a pair of arms and the reward function is a bilinear function of the feature vectors of each arm, and has a low-rank parameter matrix.

Organization. We introduce additional notation in 2. Then the REAL-estimator and REAL-Bandit algorithm are introduced in 3, followed by simulations in 4. In 5 we present our assumptions, statement of the main theorem, as well as its proof. Proofs of lemmas and additional details are deferred to the extended version of the paper [17].

2 Setting and notation

Let B be a k d matrix with real-valued entries. We further assume that B is of rank r with r min(k, d). At time t = 1, 2, , a context vector Xt Rd is drawn from a ﬁxed probability distribution P, independently from Xs for s < t. Then, by choosing arm 1 κ k, the reward yt := B κ, Xt + εκ,t is generated, where B κ is the κ-th row of the matrix B , and εκ,t are independent σ2 ε-sub-Gaussian random variables. In addition, U, V refers to the inner product of vectors U and V , and U refers to the transpose of U. Throughout, we use bold capital letters for matrices, and use notation [n] for the set {1, 2, . . . , n}, when n is an integer. For any two matrices Y1 and Y2 with d columns, by Y1 Y2, we mean that all rows of Y1 are also rows of Y2. Also, for any subset U of Rd, the notation PU refers to the conditional distribution P( |U) of the contexts.

A policy π, is a sequential decision-making algorithm that, at each time t, chooses the arm πt [k] given previous observations and the revealed context Xt. We will evaluate the performance of a policy by its cumulative regret, deﬁned as RT = PT t=1 rt, where rt = E maxκ [k] Xt, B κ Xt, B πt

where expectation is with respect to the randomness of Xt, εt and potential randomness introduced by the policy π. Our goal is to ﬁnd policies with low RT .

In order to avoid dealing with unnecessary subscripts, for each context vector Xt Rd, we deﬁne Xπ t to be a k d matrix with all elements equal to zero, except for the πt-th row which is equal to X t . Using this notation, we have that

yt = B , Xπ t + εt, (1)

where the inner product for matrices is deﬁned as U, V := tr UV . Note that εt is actually επt,t, but since all noise values are i.i.d., we will drop the dependence on π.

For a given subset I = {t1, . . . , tn} of [T], consider the set of corresponding context matrices {Xπ ti | i I}. We deﬁne an associated sampling operator Xπ I : Rk d Rn to be deﬁned as follows. For any matrix B Rk d, Xπ I(B) is a vector of length n where its i-th entry (i [n]) is given by [Xπ I(B)]i := B, Xπ ti . Therefore, the vector form of (1) is

Y = Xπ I(B ) + E ,

where E is the n-vector of all noise values εt1, . . . , εtn. In the remaining, we use the simpler notation X( ) instead of Xπ I( ) when I and π are implicitly clear.

We also use different norms in our algorithm and analysis in this paper. 2 refers to the regular ℓ2 norm of a vector. The nuclear norm (or trace-norm) of a matrix is denoted by , and F and refer to the Frobenius and the inﬁnity norm of a matrix. Also, for a given distribution P over Rk d, we can deﬁne the following norm B P := E B, Z 2 for all B Rk d where Z is drawn from P. Finally, Γ ,2 is the maximum of Γκ 2 for κ [k] (recall that Γ κ is κ-th row of Γ). In fact, one of our assumptions that will be stated explicitly later is that the matrix B belongs to the following set: S = {B Rk d | B ,2 b } ,

for a positive constat b . Also, for a k by d matrix B of rank r with singular value decomposition B = UDV , we deﬁne the row-incoherence parameter as

3 Algorithm

In this section, we describe the Row-Enhanced and Low-Rank Bandit (REAL-Bandit) algorithm. The algorithm combines ideas from existing literature [14, 6] and a new row-enhancement procedure to obtain sharper convergence rate when k is very large. REAL-Bandit algorithm has two disjoint phases for exploration and exploitation, similar to [14, 6]. In the exploration phase, all arms are given an equal chance to be explored to enable the algorithm to obtain an estimate of their corresponding parameters. These forced-sampling estimates are not sufﬁciently accurate to pick the best arm with high probability, however, they are accurate enough to rule out all of the arms that are substantially

inferior to the optimal arm. At each time t, these estimates are used as a proxy of the actual arm parameters to form a set of candidate arms. In order to choose one of these candidates, we need more accurate estimates and so, the algorithm uses the all-sampling estimates that are obtained from all the observations made thus far to pick the best arm.

However, unlike [14, 6], that estimate each of the arm parameters {B κ}κ [k] separately, our forcedsampling and all-sampling estimates utilize the low-rank assumption on matrix B and estimate all parameters simultaneously (like in the multi-task learning literature).

The Estimators. REAL-Bandit is designed to work with any matrix estimation method that has theoretical guarantees. Two such estimators (developed in the matrix completion literature) are: (1) estimators based on convex optimization and (2) estimators based on non-convex optimization. Before we present these two classes of estimators, we assume that a set of time periods J = {t1, . . . , tn} and their associate observations (Xπ t1, yt1), . . . , (Xπ tn, ytn), and a positive constant λ are available. We use notations B(J ) or b B(J ) for estimators of B , that use observations from time periods in J . When J is clear, we use simpler notations B and b B.

(1) Convex optimization. In this approach, introduced by [8], the approximation to B is the minimizer of the following convex program:

minimize n 1 Y X(B) 2 + λ B . (3)

In fact, as [16] shows, one just needs a feasible solution B = B(J , λ) that satisﬁes:

n 1 Y X( B) 2 + λ B n 1 Y X(B ) 2 + λ B . (4)

This brings additional ﬂexibility to choose the optimizer and has computational advantages.

(2) Non-convex optimization. Another approach is to explicitly impose the low-rank constraint by writing B as UV where U Rk r and V Rd r. The optimization problem would be:

minimize U,V n 1 Y X(UV ) 2 + λ( U 2 F + V 2 F )/2 . (5)

One challenge is that this is not a convex program, but it has been shown that under certain conditions, alternating minimization can be an effective algorithm [18]. It can also be shown (e.g., see [22, 31]) that minimizing the above loss function is equivalent to solving the following optimization problem:

minimize n 1 Y X(B) 2 + λ B (6) subject to rank(B) r .

If B is a solution to (6), since B is also a feasible solution, (4) must hold.

REAL-estimator. The existing theory of matrix estimation provides error bounds for B B F [9, 26, 33, 24, 16]. However, these results do not characterize how this error is distributed across different rows. On the other hand, in order to get a regret bound, we need to control Bκ B κ 2 for all κ [k], and the trivial inequality Bκ B κ 2 B B F would introduce an unnecessary

k factor. To remedy this, we introduce REAL-estimator that uses a set of almost independent observations to improve the row-wise error bound.

As before, let J = {t1, . . . , tn} be a set of n time periods with t1 < t2 < < tn. We split J to J1 := {t1, . . . , t n

2 } and J2 := {t n

2 +1, . . . , tn}. For any K [k], let J K be the subset of J such that an arm in K is pulled, i.e. πti K. Moreover, for ℓ {1, 2} and K [k], let J K ℓ:= J K T Jℓ. For κ [k], when K = {κ}, we use superscript κ rather than {κ} for simplicity. Next, for any low-rank matrix estimator B, Algorithm 1 performs the row-enhancement procedure. We call the output of this algorithm REAL-estimator and denote it by b B(J ). The difﬁculty of analyzing this estimator arises from the fact that the observations are generated in an adaptive fashion, and thus, the results that require independence assumption are not applicable in our case. However, in the analysis we will show that these observations can be approximated by i.i.d. samples, and as a result theoretical guarantees can be obtained. In the following, we will state the assumptions formally, and then, we will verify that they continue to hold throughout the analysis.

Before we deﬁne the notion of approximately independent, we need a few more notations. For κ [k], Xκ is a matrix constructed by the set of context vectors of observations for arm κ, stacked as rows of this matrix. We deﬁne Xκ ℓfor ℓ {1, 2} and Jℓsimilarly. Recall that, for matrices Y1 and Y2 with d columns, Y1 Y2 means that all rows of Y1 are also rows of Y2.

Algorithm 1 Row-enhancement procedure

Input: Low-rank matrix estimator B Rk d, J = {t1, . . . , tn}, and observations (Xπ t1, yt1), . . . , (Xπ tn, ytn).

1: Initialize, b B Rk d, 2: Split J into J1 := {t1, . . . , t n

2 } and J2 := {t n

2 +1, . . . , tn}, 3: Compute SVD B(J1) = UDV , 4: Let Vr be the matrix containing ﬁrst r columns of V, 5: for κ = 1, 2, , k do 6: Let ˆϑκ = arg minϑ Rr P

ti J κ 2 (yti Vrϑ, Xti )2,

7: Set row κ of b B to (Vr ˆϑκ) . 8: end for 9: Return b B.

Deﬁnition 3.1 (Approximately independence). Let J be a given set of n time periods and P, PU, and PV be three distributions. Then, for κ [k], we say that J κ is a (n U, n V)-approximately independent set of observations if there exists random matrices Xκ U and Xκ V such that

1. Xκ U Xκ and Xκ Xκ V,

2. All rows of Xκ U are independent samples of PU,

3. All rows of Xκ V are independent samples from either P or PV,

4. Xκ U and Xκ V have n U and n V rows respectively.

This deﬁnition requires the observations for a row κ to lie between two sets of i.i.d. samples. This notion becomes extremely useful whenever one can prove that n U and n V are of the same order.

Next, we specify the conditions that P, PU, and PV need to meet so that we can prove error bounds. Deﬁnition 3.2. We say that a distribution P( ) on Rd is (γmin , γmax , σX)-diverse if

γmin λmin (Σ) λmax (Σ) γmax ,

where Σ = E XX , and Σ 1

2 X is σ2 X-sub-Gaussian (i.e., for any deterministic unit vector u Rd, the real-valued random variable u Σ 1

2 X is σ2 X-sub-Gaussian).

We will treat σX as a constant. Note that, for instance, when X follows a multivariate Gaussian distribution, then σX = 1.

In our proofs, we will show that whenever J can be split into two almost independent halves, then row-enhancement procedure gives us sharper per-row guarantees than the raw matrix estimator B.

The REAL-Bandit algorithm. Here, we describe REAL-Bandit algorithm presented in Algorithm 2. As mentioned earlier, this algorithm has disjoint exploration and exploitation phases which are speciﬁed by a force-sampling rule f : N [k] { }. At time t, the force-sampling rule decides between forcing the arm ft [k] to be pulled or exploiting the past data, indicated by ft = . By Ft, we denote the time periods that an arm was forced to be pulled, i.e. Ft := {τ t : fτ [k]}. For simplicity, we also use At := [t] to refer to the all time periods up to time t. The force-sampling rule that we use is a randomized function that picks an arm κ [k] with probability

P(ft = κ) =

( 1 k if t 2ρ log(ρ) , ρ k[t ρ log(ρ)+1] if t > 2ρ log(ρ) , (7)

and ft = otherwise. We will specify the hyper-parameter ρ in 5. As we will see in our analysis, this force-sampling rule ensures that Fκ t grows as O(log t) for all κ [k]. One can alternatively use any force-sampling rule that has this rate of exploration. Remark 1. The algorithm proposed in [14, 6] are similar to the REAL-Bandit. They, however, use a deterministic force-sampling rule (that can be used here as well). However, our randomized rule brings practical advantages in exchange for a slightly more complex theoretical analysis.

Now, let BF and BA be two low-rank matrix estimators (obtained from observations of the forcesampling rounds and the all-sampling rounds respectively) and denote by b BF and b BA their corresponding REAL-estimators, introduced above. These estimators serve different purposes in our algorithm. We will show that the forced-samples estimator b BF satisﬁes b BF κ B κ 2 O(1) with probability at least 1 O(1/t) for all arms κ [k]. The key idea is that O(log t) i.i.d. samples are enough to get such a guarantee. These estimates are then only used to rule out some arms that are very far from the optimal arm. The threshold for eliminating sub-optimal arms is determined by a hyper-parameter h that is given to the algorithm. This parameter can be thought of as the average gap of the problem.

The remaining arms are candidates of being the optimal arm. Then, the all-samples estimator b BA comes into play. This estimator is used to pick the best arm among these candidate arms. We will show that b BA enjoys the sharper bound b BA κ B κ 2 O 1/

t for all optimal arms κ Kopt [k] with probability at least 1 O(1/t) where Kopt is deﬁned formally in Assumption 3 of 5. This sharper rate improves the accuracy of the decisions made by the algorithm signiﬁcantly.

Algorithm 2 REAL-Bandit algorithm

Input: Force-sampling rule f, gap h. 1: for t = 1, 2, do 2: Observe Xt P, 3: if ft = 0 then 4: πt ft 5: else 6: C = n κ [k] | Xt, b BF κ (Ft 1) max ℓ [k]

Xt, b BF ℓ(Ft 1) h

7: πt arg maxκ C Xt, b BA κ (At 1)

8: end if 9: end for

4 Simulations

We compared the REAL-Bandit algorithm with four other algorithms: OLS-Bandit of [14], but we use the improved version from [6] that ﬁlters sub-optimal arms, LASSO-Bandit of [6], OFUL of [2] which is based on the Upper Conﬁdence Bound (UCB) idea, and Thompson sampling (the version from [37]). Taking k = 201, d = 200, and r = 3, we generated matrix B as UV where rows of U R201 3 and V R200 3 are drawn independently and uniformly from the unit sphere in R3. Noise variance is 1 and features are i.i.d. N(0, Id). We gave Thompson sampling the true prior mean and variance of the arm parameters, and the true noise variance. Similarly, OFUL had access to the true noise variance. Other parameters of OLS-Bandit, LASSO-Bandit, and OFUL are selected as in [6]. We generated 10 data sets and executed all algorithms for a time horizon of length T = 40, 000. Figure 1 shows average cumulative regret (with 1 SE error bars) for all algorithms across these 10 runs.. The results of this simulation support our theoretical analysis, that REAL-Bandit takes advantage of the low-rank structure of the problem parameters and signiﬁcantly outperforms other benchmarks that do not leverage the structure.

This section is dedicated to the analysis of REAL-Bandit. We will ﬁrst state the assumptions underlying the analysis and then state the main theorem of this section. A discussion of some of these assumptions can be found in [6]. Assumption 1 (Parameter set). Assume the rank of B is r, B ,2 b , and µ := µ(B ) where µ( ) is deﬁned in (2). Assumption 2 (Margin condition). For any a > 0, there is a constant c0 > 0 such that E[Na] kc0a, where the random variable Na is deﬁned by Na := Pk κ=1 I X, B κ B κ a b X 2 where κ = κ (X) is the optimal arm, given context vector X.

Figure 1: Cumulative regret of REAL-Bandit versus LASSO-Bandit, OFUL, OLS-Bandit, and Thompson sampling for (k, d, r) = (201, 200, 3).

Assumption 3 (Arm optimality). Let Kopt and Ksub be a partitioning of [k]. Then, for some h > 0, the following conditions hold: 1) For any sub-optimal arm κ Ksub, X, B κ maxκ X, B κ h X 2 for any context X, 2) For each arm κ Kopt, where P(X Uk) p

|Kopt|, where Uκ is deﬁned by

Uκ := X Rd | X, B κ > max κ =κ X, B κ + h X 2

3) For each arm κ Kopt, there exists constant q > 0 such that maxκ Kopt P(X Vκ) q

|Kopt| where the set Vκ is deﬁned as

Vκ := X Rd | X, B κ > max κ =κ X, B κ h X 2

Assumption 4 (Diversity). For all κ Kopt, P, PUκ, and PVκ are (γmin , γmax , σX)-diverse.

Assumption 5 (Low-rank estimators). Assume the following tail bounds hold for BF and BA: 1) Let J be a set of n time periods such that for each arm κ [k], the matrix Xκ of the context vectors associated to the arm κ has i.i.d. rows sampled from P. Then,

P BF (J ) B F δ exp c1δ2n

holds when n log(n) c2(1 + 1

δ2 )r(k + d), δ := q

γmin γmax q

k r h 64µ , and c1, c2 are positive constants.

2) Let J be a set of n observations, such that for all κ Kopt, J κ is a set of ( np

2 , 2nq )- approximately independent observations. Then, we get

Πopt[ BA(J ) B ] F

dσε b dγmin

c3r(k + d) log(n)

provided that (n/ log n) c2r(k + d) for some constant c3 > 0, and Πopt : Rk d Rk d denotes the linear function that sets the rows corresponding to the sub-optimal arms to zero and keeps the rest unchanged.

Now, we are prepared to state our main theoretical result.

Theorem 1. If Assumptions 1-5 hold, then the cumulative regret of Algorithm 2 is bounded above by

xn C c2b (1 + 1

δ2 )r(k + d) log(T) + C c0r2(k + d) log(T)2

where C > 0 is a constant, the forced-sampling parameter ρ is set to 2c2(1 + δ 2)r(k + d), and

C := c3µ 2 γ2 max q

γ2 min p dσ2 ε b 2

, xn := sup V 2=1 E X 2 X = X 2V .

Before describing the proof of Theorem 1, we state four key lemmas that will be used in the proof. Due to space limitations, we defer proofs of the lemmas to the extended version of the paper [17].

Lemma 1. The force-sampling sets created by the force-sampling rule (7) satisfy the following inequalities, for all t 2ρ log(ρ), provided that ρ 24,

P |Ft| 6ρ log t t 1 and P(|Ft| [ρ/2] log t) t 3 .

Lemma 2. Let I be a (deterministic) subset of the forced-sampling observations and by Iκ I, we denote the observations corresponding to arm κ [k]. Then, the following inequality holds,

|I| exp |I|

Lemma 3. For all t 10c2(1 + δ 2)r(k + d) log(kd) and κ [k], with probability at least 1 10t 3, the following inequality holds b BF κ B κ 2 h/4 .

Lemma 4. For all t > 10c2r(k + d) log(kd) and κ Kopt, with probability at least 1 100r(k + d)t 1, we have

b BA κ B κ 2 10

C r2(k + d) log(t)

Proof of Theorem 1. Following the lines of the proof of Theorem 1 in [6], we deﬁne G( ) as

( 1 if b BF κ (Ft 1) B κ 2 h

4 for all κ [k], 0 otherwise.

Deﬁne c4 := 10c2(1 + δ 2)r(k + d) log(kd). Then, we split the regret of the algorithm into the following three cases and bound each case separately: (a) Initialization (i.e., when t c4) and forced-sampling rounds. (b) When t > c4 and G(Ft 1) = 0. (c) When t > c4 and G(Ft 1) = 1, but a suboptimal arm is chosen due to inaccurate all-sampling estimates.

Let R(a) T , R(b) T , and R(c) T denote the regret incurred in the above cases, respectively. Clearly, we have that RT = R(a) T + R(b) T + R(c) T .

Before proving upper bounds, note that, for each suboptimal choice, the regret incurred at each step is at most X, B κ B κ for some κ, κ which in turn is bounded above by

| X, B κ B κ | X 2 B κ B κ 2 2b X 2 .

This fact can be used to obtain regret bounds by bounding the number of times that each suboptimal arm is pulled. Clearly, for part (a), this number is less than or equal to c4 + |FT |. Using Lemma 1,

E h R(a) T i 2b xn E[c4 + |FT |] 2b xn (c4 + 6ρ log T) .

Next, it follows from the deﬁnition of G(Ft 1) and Lemma 3 that the number of times that G(Ft 1) = 0 is controlled by

t=c4+1 [1 G(Ft 1)]

t=c4+1 P(G(Ft 1) = 0)

t=c4+1 10t 3

t=c4+1 10t 1 10 log(T) .

Now, since R(b) T 2b E h PT t=c4+1 Xt 2 [1 G(Ft 1)] i , we have

R(b) T 2b xn E

t=c4+1 (1 G(Ft 1))

20b xn log(T) .

Finally, we need to ﬁnd an upper bound for R(c) T . It follows from a slightly modiﬁed version of Lemma EC.18 in [6] that, whenever G(Ft 1) = 1, the set C contains the optimal arm and no suboptimal arm. In particular, if the best arm is κ , we get the following inequality for all κ C 0 < Xt, B κ B κ < h Xt 2. (8) Therefore, whenever G(Ft 1) = 1, for any κ C, we have X Vκ and if X Uκ, then C = {κ}. Now, we are ready to use Lemma 4 to bound the probability of pulling an incorrect arm. Letting

EA t := κ Kopt : b BA κ (At) B κ 2 > rc5

where c5 := 100C r2(k+d) log(T )

kp . Now, using Lemma 4, we have for all t > c4,

P EA t 100(k + d)r

Now, recall κ denotes the optimal arm and the arm πt is the pulled arm. For (random variable) κ [k], deﬁne Dκ := Xt, B κ B κ 2p c5

t Xt 2 . It follows from the deﬁnition of rt that

rt = E Xt, B κ B πt E Xt, B κ B πt I Dπt EA t + E Xt, B κ B πt I Dc πt EAc t

E h Xt, B κ B πt I n Xt, b BA πt Xt, b BA κ o Dπt EA t i

+ E Xt, B κ B πt I Dc πt EAc t

b P n Xt, b BA πt Xt, b BA κ o Dπt EA t + rc5

t P Dc πt EAc t . (10)

Note that Xt, b BA πt Xt, b BA κ in combination with the deﬁnition of Dπt implies that

0 D Xt, b BA κ B κ E + D Xt, B πt b BA πt E + 2 rc5

And this entails that at least one of the following inequalities hold:

Xt 2 B κ b BA κ 2 Xt, B κ b BA κ rc5

Xt 2 b BA πt B πt 2 Xt, b BA πt B πt rc5

Since C does not contain any suboptimal arm, we have that n Xt, b BA πt Xt, b BA κ o T Dπt EA t . This fact, combined with (9) means for all t > c4, the following holds

P n Xt, b BA πt Xt, b BA κ o Dπt EA t 100(k + d)r

Finally, by using the margin condition, we get that

P Dc πt EAc t P Dc πt E h N 2

Therefore, using (10), we have

100 r(k + d) b + kc0c5

2xn 100 r(k + d) b + 4kc0c5

Acknowledgments

The authors gratefully acknowledge support of the National Science Foundation (CAREER award CMMI: 1554140), Stanford Data Science Initiative, and Human-Centered AI Initiative.

[1] Yasin Abbasi-Yadkori. Online Learning for Linearly Parametrized Control Problems. Ph D. Thesis, 2012.

[2] Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

[3] S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. MNL-Bandit: A Dynamic Learning Approach to Assortment Selection. Ar Xiv e-prints, June 2017.

[4] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In ICML (3), pages 127 135, 2013.

[5] Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397 422, 2003.

[6] Hamsa Bastani and Mohsen Bayati. Online decision-making with high-dimensional covariates, 2015. https://papers.ssrn.com/sol3/papers.cfm?abstract\_id= 2661896.

[7] S ebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1 122, 2012.

[8] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111 119, 2009.

[9] Emmanuel J Candes and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925 936, 2010.

[10] Rich Caruana. Multitask learning. Machine Learning, 28(1):41 75, 1997.

[11] Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, and Yuling Yan. Noisy Matrix Completion: Understanding Statistical Guarantees for Convex Relaxation via Nonconvex Optimization. ar Xiv e-prints, page ar Xiv:1902.07698, Feb 2019.

[12] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 208 214, 2011.

[13] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.

[14] Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. Stochastic Systems, 3(1):230 261, 2013.

[15] David Gross. Recovering low-rank matrices from few coefﬁcients in any basis. IEEE Trans. Information Theory, 57(3):1548 1566, 2011.

[16] Nima Hamidi and Mohsen Bayati. On low-rank trace regression under general sampling distribution. ar Xiv preprint ar Xiv:1904.08576, 2019.

[17] Nima Hamidi, Mohsen Bayati, and Kapil Gupta. Personalizing many decisions with highdimensional covariates. ar Xiv preprint, 2019. Extended Version.

[18] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In Proceedings of the forty-ﬁfth annual ACM symposium on Theory of computing, pages 665 674. ACM, 2013.

[19] Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. Peeking at a/b tests: Why it matters, and what to do about it. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1517 1525, New York, NY, USA, 2017. ACM.

[20] Kwang-Sung Jun, Rebecca Willett, Stephen Wright, and Robert Nowak. Bilinear Bandits with Low-rank Structure. ar Xiv e-prints, page ar Xiv:1901.02470, January 2019.

[21] N. Kallus and M. Udell. Dynamic Assortment Personalization in High Dimensions. Ar Xiv e-prints, October 2016.

[22] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980 2998, 2009.

[23] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries. Journal of Machine Learning Research, 11(Jul):2057 2078, 2010.

[24] Olga Klopp et al. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20(1):282 303, 2014.

[25] Tom aˇs Koc ak, Michal Valko, R emi Munos, and Shipra Agrawal. Spectral thompson sampling. In Proceedings of the Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence, AAAI 14, pages 1911 1917. AAAI Press, 2014.

[26] Vladimir Koltchinskii, Karim Lounici, Alexandre B Tsybakov, et al. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302 2329, 2011.

[27] Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

[28] Sahin Lale, Kamyar Azizzadenesheli, Anima Anandkumar, and Babak Hassibi. Stochastic Linear Bandits with Hidden Low Rank Structure. ar Xiv e-prints, page ar Xiv:1901.09490, Jan 2019.

[29] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 817 824. Curran Associates, Inc., 2008.

[30] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670. ACM, 2010.

[31] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of machine learning research, 11(Aug):2287 2322, 2010.

[32] Daniel Mc Fadden. Econometric models for probabilistic choice among products. The Journal of Business, 53(3):S13 S29, 1980.

[33] Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 13(May):1665 1697, 2012.

[34] Ian Osband, Dan Russo, and Benjamin Van Roy. (more) efﬁcient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003 3011, 2013.

[35] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(Dec):3413 3430, 2011.

[36] Angelika Rohde, Alexandre B Tsybakov, et al. Estimation of high-dimensional low-rank matrices. The Annals of Statistics, 39(2):887 930, 2011.

[37] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221 1243, 2014.

[38] Steven L Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639 658, 2010.

[39] Steven L. Scott. Multi-armed bandit experiments in the online service economy. Appl. Stoch. Model. Bus. Ind., 31(1):37 45, 2015.

[40] Nathan Srebro, Noga Alon, and Tommi S. Jaakkola. Generalization error bounds for collaborative prediction with low-rank matrices. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1321 1328. 2005.

[41] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

[42] Michal Valko, R emi Munos, Branislav Kveton, and Tom aˇs Koc ak. Spectral bandits for smooth graph functions. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, pages II 46 II 54. JMLR.org, 2014.