# disposable_linear_bandits_for_online_recommendations__ccbc38e4.pdf

Disposable Linear Bandits for Online Recommendations

Melda Korkut, Andrew Li Tepper School of Business Carnegie Mellon University {melda,aali1}@cmu.edu

We study the classic stochastic linear bandit problem under the restriction that each arm may be selected for limited number of times. This simple constraint, which we call disposability, captures a common restriction that occurs in recommendation problems from a diverse array of applications ranging from personalized styling services to dating platforms. We show that the regret for this problem is characterized by a previously-unstudied function of the reward distribution among optimal arms. Algorithmically, our upper bound relies on an optimism-based policy which, while computationally intractable, lends itself to approximation via a fast alternating heuristic initialized with a classic similarity score. Experiments show that our policy dominates a set of benchmarks which includes algorithms known to be optimal for the linear bandit without disposability, along with natural modiﬁcations to these algorithms for the disposable setting.

Introduction Consider the stochastic linear bandit problem: given a set of arms 푎1, . . . , 푎퐾 R푑, the task is to select a single arm in each of a sequence of time periods so as to maximize the total reward gained. The rewards are random, but there exists some (unknown) 휃 R푑such that the mean reward when an arm 푎푖is chosen is equal to 푎푖, 휃 . Algorithms known to be theoretically optimal (in a sense we will review later on) for this problem successfully balance the trade-off between arms that are known to yield high reward, with arms that may reveal more information about the unknown 휃 . These algorithms are now a core component of most modern recommender systems, as the linear bandit model provides an extremely close ﬁt to a common problem in recommendations: the cold-start problem. Succinctly, this is the problem of making recommendations to newer users whose amount of activity is insufﬁcient to accurately estimate their preferences. To map the cold-start problem to the linear bandit model: the 푎푖 s encode the features for each item, which have been previously estimated using past data, and 휃 encodes the unknown features of the new user. The linear form of the mean reward matches the underlying preference model of a variety of recommendation algorithms including those based on matrix factorization.

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

The key problem we seek to address is for a particular subset of recommender systems in which the number of recommendations of an item made to a person may not exceed a certain limit, a property we will refer to as disposability. One natural example of disposability in practice is personalized styling services (e.g. Stitch Fix, Beauty FIX, Trunk Club) which by and large operate by sending personalized boxes of items to users. Users can choose to purchase any subset of the box, and these companies do not re-send an item to the same user. Another application area is online dating where some online dating platforms do not want to show the same person multiple times to a user.

This Paper: Thus motivated, we study the linear bandit problem under the additional rule that each arm may be selected at most a certain amount for each user. Our primary observation is that this disposability constraint impacts both (a) the performance of algorithms known to be optimal when disposability is not imposed, and (b) the nature of algorithms that achieve meaningful theoretical (regret) guarantees. Our speciﬁc contributions are as follows:

1. We prove a minimax lower bound of Ω(휙푑

푇) on the regret of the disposable linear bandit problem which suggests that disposability should allow for lower regret than the standard linear bandit. In particular, while the dependence on the dimensionality 푑and time horizon 푇remains the same as in the standard linear bandit, a new term 휙 [0, 1] loosely captures the spread of the optimal set of arms. 2. We propose an optimism-based algorithm, which generalizes the optimal upper conﬁdence bound algorithm (called Lin UCB) for the standard linear bandit. We prove an upper bound of 푂( p

푇) on its regret, where 훾is a parameter that again encodes a notion of the spread of the optimal arm set. Noting that our algorithm involves a computationally hard optimization formulation, we propose a fast alternating heuristic. While this heuristic is not guaranteed to ﬁnd the true optimum, we observe that it performs well in practice when combined with a clever initialization based on similarity scores. 3. We evaluate our algorithm s performance on a recommendation task based on synthetically generated data. Compared to a number of benchmarks, including Lin UCB and

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

a natural modiﬁcation of Thompson sampling, our algorithm (solved via the preceding heuristic) achieves as much as 10% lower regret against all competing algorithms.

Previous Work: There are mainly two types of ﬁltering recommender systems use, content based ﬁltering such as in (G omez Hidalgo et al. 2006), (Lang 1995) and (Mooney and Roy 2000) and collaborative ﬁltering (Sarwar et al. 2001). Collaborative ﬁltering makes use of the idea that similar people will also prefer similar products. The problem of recommendation becomes more complicated when a new user is in the network, which is called the cold-start problem. (Bobadilla et al. 2012) , (Wei et al. 2017), (Lika, Kolomvatsos, and Hadjiefthymiades 2014) and many more focused especially on algorithms improving this problem. Our paper also contributes to this stream of research, albeit using bandit algorithms. Another method used to learn about user preferences is Multi-Armed Bandits. The classical multi-armed bandit problem was introduced by (Robbins 1952). The problem is well studied for a couple of decades now, pioneered and put together by (Gittins 1979), who introduced the all-time famous Gittins Index. The known best algorithm for the classic simple MAB is the Upper Conﬁdence Bound (UCB) algorithm by (Lai and Robbins 1985). How to use certain MAB techniques for recommendations are studied before, such as in (Bouneffouf, Bouzeghoub, and Ganc arski 2012), (Zeng et al. 2016). In recommendations setting, however, learning a user s preference is most sought-after, so linear bandits, ﬁrst introduced in (Auer 2002), then used in recommendations in (Li et al. 2010) are also a recent popular tool. Introduced by (Basu et al. 2019) and then extended to the contextual version by (Basu et al. 2020), blocking bandits represent the case for which the arms become unavailable for a deterministic amount of time after they are played. A similar bandit model to ours is budgeted bandits. Budgeted bandits have been studied in both classic MAB and linear models in (Xia et al. 2017), Thompson Sampling on budgeted bandits in (Xia et al. 2015), cost-aware cascading bandits (Gan et al. 2020). In all of these models, however, the budget is a common budget on all of the resources that arms might consume, whereas in our case, we consider independent budgets on arms. In (Slivkins 2013; Combes, Jiang, and Srikant 2015), the model is a classical, not linear, MAB model except that each arm has an individual certain budget. Bandits-with-Knapsacks, as studied in (Badanidiyuru, Kleinberg, and Slivkins 2013; Agrawal and Devanur 2016), are the types of bandits that consume a set of resources each time they are played. Bandits that deteriorate or vanish by time have been studied in (Farias and Madan 2011; Chakrabarti et al. 2009; Levine, Crammer, and Mannor 2017; Seznec et al. 2019). In (Farias and Madan 2011), the agent can either keep playing an arm or decide to discard it. (Chakrabarti et al. 2009) has a stochastic time until when an arm is exhausted. (Levine, Crammer, and Mannor 2017) bandits have a decaying reward function depending on the number of times an arm is

pulled. (Kleinberg, Niculescu-Mizil, and Sharma 2010) has varying arm-sets over time, however, the structure differ signiﬁcantly from ours, since we do not allow an unavailable action to be available again in the future. None of the above models has extensively studied the problem of budgeted linear bandits the way we model in this paper.

Model: Disposable Linear Bandits We begin by formally deﬁning the disposable linear bandit problem that we will study. First, recall the well-studied (stochastic) linear bandit problem, which we will refer to here as the non-disposable problem: we have a set 퐴 R푑 of available actions, or arms. At each of 푇discrete time periods, we select exactly one arm, and a reward is observed and received. The reward from selecting any arm 푎 퐴is drawn independently according to a distribution 푎, 휃 + 휂, where 휃 R푑is an (unknown) vector that we seek to learn over time, and 휂is some mean-zero noise. The goal is, loosely, to collect as much total reward as possible within the 푇time periods. We quickly summarize some standard assumptions and notation that will be needed: 1. Bounded arms: 푎 퐿for all 푎 퐴 2. Bounded 휃 : 휃 퐿 3. Sub-Gaussian1 noise: 휂 휓2 휎 4. Finite arm set: 퐴= {푎1, . . . , 푎퐾} 5. Non-negative mean rewards: 푎, 휃 0 for all 푎 퐴 The fact that the same 퐿is used in the ﬁrst two assumptions is not necessary, and done purely to save on notation. Similarly, the ﬁfth assumption is not necessary, but allows us to avoid problems in which it may be preferable to not select any arm. From an applications standpoint, the typical mapping to the recommendation setting is to treat 휃 as the cold user, the arms as items to recommend, and reward as some form of conversion. The embedding of the user and items in some inner-product space is estimated from previous data, e.g. with collaborative ﬁltering via matrix factorization (for example, see (Koren, Bell, and Volinsky 2009)). The above problem is made disposable by imposing just one additional constraint: each arm can only be selected at most once. Implicitly, this also requires assuming that the number of arms 퐾is at least 푇, though in our target applications it is typically the case that 퐾 푇. Note that we have not required the arms in 퐴to be unique.2 so this problem includes as special cases: the original non-disposable problem, and versions with arm-speciﬁc restrictions on selection such as so-called budgeted bandits (Slivkins 2013). Let 휋denote any policy, i.e. a collection of functions

휋푡: ([퐾] R)푡 1 [퐾],

which at each time 푡, map the previously selected arms and observed rewards to the next selected arm (the notation [퐾]

1The sub-Gaussian norm of a random variable 푋is deﬁned as 푋 휓2 inf{푏> 0 : E[exp(|푋|2/푏2)] 2}. 2The set 퐴should thus technically be referred to as a multiset, though we will continue to use the shorter term set.

refers to the set {1, . . . , 퐾}). We measure the performance of a policy 휋via its regret with respect to the ofﬂine optimal policy which knows 휃 . Speciﬁcally, the regret Reg(휋) is deﬁned as

Reg(휋) max 푆 [퐾],|푆|=푇

푠 푆 푎푠, 휃 E

where the ﬁrst term is the maximum achievable expected reward, and the second term is the expected reward of 휋. To summarize so far, our goal is to construct a policy 휋 which both (a) satisﬁes the disposability constraint, and (b) achieves low regret. Before describing our policy, we will state a lower bound on the regret of any policy. Comparing this lower bound with known results for the non-disposable problem will serve to highlight the effect of the disposability constraint. Before proceeding, we conclude this section by deﬁning some additional notation that will be needed.

Notation: For any subset of arms 푆 퐴, let 푆(푖) denote the arm in 푆with the 푖th highest mean reward, with ties broken arbitrarily. In other words,

푆(1), 휃 푆(2), 휃

For any subset of arms 푆, deﬁne the averaged arm 푆as

For any policy 휋and time 푡, let 퐴휋 푡denote the subset of arms remaining (i.e. not selected in the ﬁrst 푡 1 periods) under policy 휋.

Lower Regret Bound Our ﬁrst result is the following minimax lower bound for our disposable linear bandit: Proposition 1. Let 퐴= { 1, 1}푑(and thus 푇 |퐴| = 2푑). For every policy 휋, there exists 휃 [ 1, 1]푑such that

Reg(휋) 퐶(푑 푘)

where 퐶is a universal constant, and 푘is deﬁned to be the minimum integer in [0, 푑] satisfying

It may be useful to compare this lower bound visa-vis existing results for the non-disposable linear bandit: similar results (e.g. Theorem 24.1 in (Lattimore and Szepesv ari 2020)) show that the minimax regret for the non-disposable setting is Ω(푑

푇). Compared to this, Proposition 1 suggests that lower regret may be achievable under disposability. The

푇scaling remains, but the effect of the ambient dimension 푑might be reduced. The reduction in regret is dependent on the size of 푇relative to |퐴|: as 푇increases, 푘increases, and thus the lower bound on regret decreases.

Setting Greedy Lin UCB

Non-disposable 100% 52.9% Disposable 100% 89.91%

Table 1: Average regret relative to greedy under nondisposable and disposable settings. Experiments were conducted with 푇= 200, with synthetically generated arms and 휃 , and with Bernoulli rewards. Reported values are averaged over 2000 replications.

Another useful interpretation of 푘is via the relative difference in rewards among the optimal set of arms. Let 휙be the ratio between the lowest and highest mean rewards among the set of 푇optimal arms, i.e.

Intuitively, a larger value of 휙corresponds to problems where the optimal ofﬂine set of arms is closer together, and thus disposability has less of an effect and the achievable regret should be higher. This plays out precisely in Proposition 1 where 휙 (푑 푘)/푑, and so the lower bound can be re-written as Ω(휙푑

푇). Similarly, 휙= 1 for any nondisposable problem, so the corresponding lower bounds can also be written as Ω(휙푑

Practical Effect of Disposability: Proposition 1 serves as theoretical evidence that disposability may alter the achievable performance in a given problem. Before moving on to describing our policy, it is worth investigating experimentally whether existing algorithms for the non-disposable setting succeed. A more-expanded version of these experiments is described in the Experiments section, but for now we brieﬂy touch on the performance of a policy known to be regret-optimal under non-disposability: Linear UCB (Lin UCB). We performed a set of synthetic experiments under both non-disposable and disposable settings, compared the average regret of Lin UCB to that of a greedy policy (both policies will be described in detail later). The results are given in Table 1. The main takeaway is that while Lin UCB performed better than the greedy policy in both settings, its relative performance signiﬁcantly decreased under disposability. This suggests (though it does not guarantee) that better policies may exist. We will propose such a policy now.

Our Algorithm: Generalized UCB

The algorithm we propose generalizes the previouslymentioned Lin UCB algorithm (Auer 2002; Dani, Hayes, and Kakade 2008; Li et al. 2010; Abbasi-Yadkori, P al, and Szepesv ari 2011) to the disposable setting. It will be necessary ﬁrst to recall the details of that algorithm, and in particular the construction of conﬁdence sets on 휃 , as we will use the same in our own algorithm.

Conﬁdence Sets and Lin UCB: The Lin UCB algorithm relies on using previously-observed rewards to construct carefully tuned conﬁdence sets on 휃 . The following construction is based on that of (Abbasi-Yadkori, P al, and Szepesv ari 2011): Let 휆> 0 be an arbitrary constant (in practice, this can be treated as a tuning parameter). Suppose at the beginning of time period 푡+1, the policy 휋has selected arms 푎휋1, . . . , 푎휋푡and observed rewards 푟1, . . . , 푟푡. We deﬁne the matrix 푉푡as

푖=1 푎휋푖푎휋푖 + 휆퐼,

where 퐼is the identity matrix. Then letting ˆ휃푡denote the regularized least-squares estimator of 휃 , we have

and ﬁnally, the conﬁdence set used in the Lin UCB algorithm is the following ellipsoid:

Θ푡= {휃 R푑: 휃 ˆ휃푡 푉푡 훽푡}, (2)

where 푥 2 푉= 푥 푉푥for any 푥 R푑, and where

2 log푇+ 푑log 푑휆+ 푇퐿2

The Lin UCB algorithm, then, selects the arm that has the highest potential reward over all 휃 Θ푡, i.e. it selects

argmax 푎 퐴 max 휃 Θ푡 푎, 휃 .

As a side note, implementing this is tractable (unless 퐴is large or inﬁnite) because the ellipsoidal form of Θ푡enables a closed-form expression for the inner maximization above:

UCB푡(푎) max 휃 Θ푡 푎, 휃 = 푎, ˆ휃푡 + 훽푡 푎 푉 1 푡. (4)

Generalized UCB: The guiding principle behind Lin UCB, and in fact optimism-based algorithms in general, can be viewed as selecting the action which has the highest optimistic (within reason) reward. For the non-disposable linear bandit, within reason is precisely encoded by the previously-described uncertainty sets Θ푡, and the actions are simply the arms. From this lens, it seems the natural generalization of Lin UCB to the disposable setting is to keep the same uncertainty sets, but to treat an action as a plan for the set of arms that will be selected in the remaining periods. That is, at any time 푡, there are 푇 푡+ 1 selections remaining, and so we should choose optimistically from among the subsets of 푇 푡+ 1 arms. This is precisely the idea behind our algorithm, Generalized UCB (UCBG). Recall that we use 퐴UCBG 푡 to denote the remaining (i.e. not yet selected) arms at the beginning of time 푡. Then the optimistic subset of arms is chosen according to:

푆UCBG 푡 argmax 푆 퐴UCBG 푡 |푆|=푇 푡+1

푎 푆 푎, 휃 . (5)

Algorithm 1: Generalized LINUCB (UCBG)

Input: 휆> 0 Initialization: 푉0 = 휆퐼, ˆ휃0 = [0]푑 Calculate Θ0 according to (2). for 푡= 1, 2, . . .푇do

Calculate 푆UCBG 푡 according to (5) Select arm 푎UCBG푡= argmax 푎 푆UCBG 푡 UCB푡 1(푎)

Observe reward 푟푡 Calculate 푉푡= 푉푡 1 + 푎UCBG푡(푎UCBG푡) Calculate Θ푡according to (2). end

The actual arm selected by UCBG, denoted 푎UCBG푡, is chosen optimistically from this subset:

푎UCBG푡 argmax 푎 푆UCBG 푡 UCB푡 1(푎),

with UCB푡(푎) deﬁned as in (4). The UCBG algorithm is described fully in Algorithm 1.

Our Guarantee Next we prove an upper bound on the regret of our algorithm. To summarize brieﬂy, UCBG achieves 푂( p

푇) regret, which is a reduction of the 푂(푑

푇) regret for nondisposable linear bandits. This reduction depends on the quantity 훾, which averages, over time, a certain shrinkage factor that loosely captures the difference in rewards among the best arms.

Shrinkage Factor: Fix any time period 푡. Note that at time 푡, there are 푇 푡+1 selections remaining, and irrespective of the previous selections, there are 퐾 푡+ 1 remaining arms from which to select. For any such set of remaining arms, we can compute the ratio between the (푇 푡+ 1)th highest mean reward and the average mean reward across the best 푇 푡+ 1 arms. We deﬁne 훾푡to be an upper bound on this ratio across all such subsets:

푆(푇 푡+ 1), 휃 1 푇 푡+1 Í푇 푡+1 푖=1 푆(푖), 휃 훾푡for all 푆 퐴, |푆| = 퐾 푡+ 1

(6) Notice that 훾푡 1 for all 푡, and is in general smaller when the arms are more spread out. As a special case, any non-disposable problem, when represented in the disposable framework, has 훾푡= 1. With 훾푡deﬁned, we can now state our regret bound: Theorem 1. With probability at least 1 푂(1/푇), the expected regret of UCBG satisﬁes

Reg(UCBG) = 푂 p

푇 Í푇 푡=1 훾푡and 훾푡is deﬁned according to (6).

Proof Sketch of Theorem 1. Our proof relies on a framework similar to that used to upper bound the regret for the

Lin UCB policy. The key challenge in adapting that framework is that the disposable setting misses an important property the optimal arm at each time period is a single best arm, and so the regret analysis can be almost entirely decomposed over time. On the other hand, in our disposable setting, the optimal sequence of arm is unique only up to permutation, which prevents the same decomposition if the usual method for regret book-keeping is applied. To deal with this, we prove the following key lemma, which suggests that there is an alternative book-keeping, with a different form of instantaneous regret, that allows for decomposition over time.

퐴휋 푡(푇 푡+ 1), 휃 푎휋푡, 휃 +

A Fast Heuristic

So far, we have proposed and analyzed the UCBG policy, and while it enjoys the theoretical regret guarantee given in Theorem 1, there is the computational issue of solving (5) at each time step. That is, we need to maximize a function of the form 푓(휃, 푆) Õ

where 휃ranges over ellipsoidal conﬁdence sets, and 푆ranges over subsets of 퐴. So even though the inner maximization over 휃can be done in closed form, the outer maximization is still over exponentially many subsets 푆. In this section, we propose a heuristic for dealing with this (this heuristic will be the algorithm we test in the coming experiments). To simplify the notation, assume that all of the arms in 퐴and 휃 lie on the unit-ball (these have already been assumed to be bounded, so the extra assumption is that they are of the same size; taking 퐿= 1 is done without loss of generality). Our heuristic is based on the idea that if the arms are fairly dense on the unit-ball (a reasonable assumption in our target applications), then whatever 휃 is, there is likely to be an arm close to it, where closeness is measured in Euclidean distance. To be more precise, consider functions 푔(휃) of the form

푔(휃) max 푆 퐴 |푆|=푘

and for any given vector 푣 R푑, let 푎푖(푣) be the 푖th closest arm (from the set 퐴) to 푣in Euclidean distance. Then, 푔(휃) is equivalently

푖=1 푎푖(휃), 휃 .

By holding out the ﬁrst term, and making the (approximate)

substitution 푎1(휃) 휃twice, we have

푔(휃) = 푎1(휃), 휃 +

푖=2 푎푖(휃), 휃

푖=2 푎푖(푎1(휃)), 휃

푖=2 푎푖(푎1(휃)), 푎1(휃) (7)

The key property we use from the ﬁnal expression above is that the nearest arm operations (푎1( ), 푎2( ), . . .) are only applied to 푎1(휃). Indeed, the second term in (7) depends only on the set of arms (i.e. not on 휃), and can be precomputed for each 푎 퐴. We call this term the similarity score of an arm (this is indeed the usual cosine similarity). This enables the following heuristic: instead of maximizing 푔(휃) over an ellipsoid, we iterate over all of the arms, computing the maximum of (7) over this same ellipsoid (which can be done in closed-form). Another lens toward this heuristic is that it balances the usual objectives of exploration and exploitation with a third objective, which is to select arms for which there are other similar arms. This makes sense intuitive sense in the disposable setting: there is little reason to learn the mean reward of an arm if there are not similar arms to leverage later on. To formally state the method, we must deﬁne two functions: Similarity(푣, 퐵, 푠) and Closest(푣, 퐵, 푠). Similarity(푣, 퐵, 푠) is the cumulative similarity score of vector 푣with the set of arms that are closest to it in cosine-similarity measure. Closest(푣, 퐵, 푠) gives a subset of arms of size 푠from 퐵with the highest inner products with vector 푣.

Similarity(푣, 퐵, 푠) = max 푆 퐵,|푆|=푠

Closest(푣, 퐵, 푠) = argmax 푆 퐵,|푆|=푠

Our heuristic, which we call the Alternating Heuristic, works as follows. We ﬁrst choose the arm 푎with the highest sum UCB + Similarity. Intuitively this corresponds to an arm which has a high reward estimate and a high number of arms close to it. Then, we ﬁnd the set of closest arms to this arm, which we denote as 푆AH 푡 . Note that the 휃(from the uncertainty set) that will maximize the optimistic reward for 푆AH 푡 is different from that of 푎. Let 휃UCB(푎) = argmax휃 Θ푡 1 휃, 푎 and 휃UCBG(푆) = argmax휃 Θ푡 1 Í 푎 푆 휃, 푎 . We ﬁnd the 휃 Θ푡 1 that max-

imizes 푆AH 푡 , 휃 . Here we use the fact that maximizing

UCB(푆AH 푡 ) is equivalent to maximizing UCBG(푆AH 푡 ):

휃UCBG(푆) = argmax 휃 Θ푡 1

푎 푆 푎, 휃 = argmax 휃 Θ푡 1 푆, 휃 = 휃UCB( 푆).

For a ﬁxed 휃, we can compute 푔(휃) by ﬁnding a set of arms closest to it, where we denote this set as 푆휃 푡. We iterate

Algorithm 2: Alternating Heuristic

Input: 휆, 훼, 푐> 0 Initialization: 푉= 휆퐼; ˆ휃0 = [0]푑; 푋,푌= [] for 푡:= 1, 2, . . .푇do

푎= argmax 푎 퐴AH 푡 ˆ휃푡 1, 푎 + 푐 푎 푉 1 푡 1 +

훼Similarity(푎, 퐴AH 푡 ,푇 푡) 푆AH 푡 = Closest( 푎, 퐴AH 푡 ,푇 푡+ 1) 휃= 휃UCB(푆AH 푡 ) 푆휃 푡= Closest( 휃, 퐴AH 푡 ,푇 푡+ 1) while 푆퐴퐻 푡 푆휃 푡do 푆AH 푡 = Closest(푆휃 푡, 퐴AH 푡 ,푇 푡+ 1) 휃= 휃UCB(푆AH 푡 ) 푆휃 푡= Closest( 휃, 퐴AH 푡 ,푇 푡+ 1) end Play the arm 푎AH푡= argmax 푎 푆AH 푡 ˆ휃푡 1, 푎 + 푐 푎 푉 1 푡 1

Observe reward 푟푡 푉푡= 푉푡 1 + 푎AH푡(푎AH푡) 푋= [푋푎AH푡] 푌= [푌 푟푡] Calculate ˆ휃푡= 푉 1 푡 푋 푌 end

until the two arm sets, 푆휃 푡and 푆AH 푡 are equal. Finally, the arm in the ﬁnal subset with the maximum UCB index is selected. Note that in each alternating step, the objective value 푓increases. Thus, since the set of arms is ﬁnite, the heuristic is guaranteed to converge. However, since 푓is non-convex, the point of convergence is not guaranteed to be 푆UCBG 푡 . We ﬁnd in experiments that the choice of initialization is crucial to the performance of the method. We investigated three different methods to initialize this set, and compared both the ﬁnal objective values and the convergence rates. The data and experiments are explained in detail in the Experiments section, but in Figure 1, we provide the comparisons of the objective function between those different initial sets given to the Alternating Heuristic. These trends were observed during 푡= 5. We tested three different initialization methods over the arm set.

Random: Choose 푇 푡+1 arms randomly and iterate until convergence. As seen in the Figure 1, this method takes more iterations to converge and converges to a lower value of the objective function.

UCB: Choose 푇 푡+ 1 arms by

argmax 푆 퐴AH 푡 |푆|=푇 푡+1

푎 푆 UCB(푎),

i.e. arms with the highest combined UCB values. This method is signiﬁcantly better than random initialization, but still has a lower ﬁnal value than that of Similarity initialization.

Figure 1: Performance and convergence rate of alternating heuristic with three different initialization methods. Results are averaged over 200 instances, all at period 푡= 5. Performance (Solid): Objective function vs. iteration number. Convergence Rate (Dashed): Gap fraction to converged value vs. iteration number.

Similarity: Choose 푇 푡+ 1 arms by maximizing the similarity score of one arm, and choosing 푇 푡arms closest to it. This initialization not only has the highest function value it converges, but also has a faster convergence than random, similar to that of Lin UCB.

Experiments We tested our ﬁnal algorithm (the alternating heuristic with similarity score initialization) against a number of baseline algorithms. These experiments reveal the following:

1. Bernoulli reward setting (see Table 2 top and Figure 2): for all three time horizons tested, the alternating heuristic outperformed all benchmarks in terms of average regret. These benchmarks included Lin UCB (with perfect choice of tuning parameter) and the greedy algorithm.

2. Gaussian reward setting (see Table 2 bottom): the alternating heuristic outperformed both methods again, despite perfectly tune the parameters for Lin UCB, but not for the heuristic.

We begin by describing the experimental setup. We generated 퐾arms, 푖= 1, . . . , 퐾in 푑 dimensional space where 퐾= 5000 and 푑= 15. Similarly, we generated a set of 휃s that lie in the same space, where the total number of 휃s is 5000. 3 As described in the Model section, the mean reward of pulling arm 푎while having 휃as the unknown vector is 푎, 휃 , with the observed reward following the distribution:

푟(푎) = 푎, 휃 + 휂

where 휂is a mean-0 sub-Gaussian noise. We now give the disposable adaptations of the benchmarks we use.

Disposable Setting Adaptations: The modiﬁcation for all benchmark algorithms to the disposable setting is straight-

3The corresponding data can be found in https://github.com/ Melda Kor/Disposable Linear Bandits.

Type Horizon Greedy Lin UCB1 Lin UCB2 Lin UCB3 UCBG1 UCBG2 UCBG3 TS-Adapted

Bernoulli 50 100% 94.81% 94.36% 99.46% 94.82% 94.15% 88.15% - 100 100% 91.67% 87.66% 95.65% 86.94% 84.75% 78.97% - 200 100% 91.42% 89.91% 99.59% 93.4% 88.60% 83.07% -

Gaussian 50 100% 94.29% 88.97% 94.92% 86.13% 83.11% 79.67% 106.65%

Table 2: Final Regret Comparisons. We report the ﬁnal regret with respect to the Greedy method. For all experiments, we set the tuning parameter 훼for the heuristic the same as Lin UCB s 푐. In experiments, 훼, 푐= (1/2)푖where 푖= 3, 4, 5.

forward: after playing an arm, we remove it from the future available arm sets. The decision making procedures for Greedy and Lin UCB remain the same: Lin UCB selects the arm with the highest potential whereas Greedy plays the arm with the highest expected potential, i.e. it uses the MLE estimator for 휃. Both of these algorithms and Alternating Heuristic could be run in either Bernoulli or Gaussian reward functions. In the Gaussian setting, we also test a version of Thompson Sampling that is given the (true) multivariate Gaussian prior, similar to that studied in (Agrawal and Goyal 2013). Formally, we assume that rewards are drawn according to 푟(푎) = 휃, 푎 + 휂where 휂 N 0, 휎2 , and that 휃 is drawn from N (휃0, Σ0). We propose the following adaptation for Thompson Sampling in the disposable setting:choose an arm, uniformly at random, from among the closest 푇 푡+ 1 arms to 휃TS 푡, which is sampled from the running posterior distribution.

Results We ran two sets of experiments with a modiﬁcation on the reward function: when the reward function has an underlying Gaussian distribution and Bernoulli distribution. Both means are, naturally, 푎, 휃 when an arm 푎is pulled with the unknown parameter 휃is present. We ran the experiments with 5000 arms with horizons of 푇= 50, 100, 200. The number of instances, i.e. different 휃values we worked with, ranges from 200 to 2500. The immediate pseudo regret was calculated as in Lemma 1 using the new book-keeping method for disposable bandits. The regret value of each instance was averaged over 10 runs, and the ﬁnal results were averaged over all instances.

1. Gaussian Reward: The reason we work with Gaussian reward distribution, even though it is more realistic that the matching rewards come from Bernoulli distribution in online platforms, is that we know the posterior of a Gaussian prior in linear Thompson Sampling. The posterior is calculated as given in the previous subsection. The methods we implement are Thompson Sampling Adaptation, Greedy, Lin UCB and Alternating Heuristic. To give Thompson Sampling the beneﬁt of the doubt, we sample 휃 R푑from the standard multivariate normal distribution N ([0]푑, 퐼푑). However, the arms 푎푖 R푑, 푖= 1, . . . , 5000 are as generated. So, for an arm 푎푖the reward we observe becomes 푟(푎푖) = 푎푖, 휃 + 휂where 휂 N (0, 1).

2. Bernoulli Reward: In Bernoulli case, the reward we ob-

serve from pulling an arm 푎푖has an underlying distribution of Bernoulli( 푎푖, 휃 ). The methods we test are Greedy, Lin UCB and Alternating Heuristic.

Figure 2: The tuned parameter for Lin UCB and the corresponding Heuristic parameter were chosen.

In Figure 2, we work with Bernoulli-type feedback, and we provide a point-wise comparison of each instance based on the ﬁnal regret value. For each horizon value, we plotted 200 different randomly chosen instances. Alternating Heuristic outperforms Lin UCB in majority of the instances for all those 3 horizons implemented. In Table 2, we give the ﬁnal regret fraction of these methods with varying parameters and horizons comparing to that of the Greedy Algorithm. In both cases of Gaussian and Bernoulli, we see that even though we did not perfectly tune the Similarity parameter 훼and took it equal to that of Lin UCB, 푐, Alternating Heuristic outperforms all other benchmarks. In the Gaussian case, the main purpose of which was to test the computationally tractable adaptation of Thompson Sampling, surprisingly, we see that the TS adaptation had a higher regret averaged over the instances compared to other methods.

Conclusion We tackled the problem of making budgeted recommendations using linear bandits. We ﬁrst introduced an ideal generalization of the Lin UCB algorithm, and due to its computational difﬁculty, we gave an alternating heuristic approximated from UCBG-index. We tested the heuristic along with other methods on synthetically generated data, and saw that it outperforms all others when arms are disposable.

Ethics Statement The primary motivation for this work is recommender systems. Such systems are in broad use, and so it is impossible to predict all future ethical and societal issues that may arise from their use. One foreseeable issue is that these systems may introduce biases against certain groups.

References Abbasi-Yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2312 2320. Agrawal, S.; and Devanur, N. 2016. Linear contextual bandits with knapsacks. In Advances in Neural Information Processing Systems, 3450 3458. Agrawal, S.; and Goyal, N. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. In Dasgupta, S.; and Mc Allester, D., eds., Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, 127 135. Atlanta, Georgia, USA: PMLR. URL http://proceedings.mlr.press/ v28/agrawal13.html. Auer, P. 2002. Using conﬁdence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research 3(Nov): 397 422. Badanidiyuru, A.; Kleinberg, R.; and Slivkins, A. 2013. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 207 216. IEEE. Basu, S.; Papadigenopoulos, O.; Caramanis, C.; and Shakkottai, S. 2020. Contextual Blocking Bandits. ar Xiv preprint ar Xiv:2003.03426 . Basu, S.; Sen, R.; Sanghavi, S.; and Shakkottai, S. 2019. Blocking Bandits. Co RR abs/1907.11975. URL http://arxiv. org/abs/1907.11975. Bobadilla, J.; Ortega, F.; Hernando, A.; and Bernal, J. 2012. A collaborative ﬁltering approach to mitigate the new user cold start problem. Knowledge-based systems 26: 225 238. Bouneffouf, D.; Bouzeghoub, A.; and Ganc arski, A. L. 2012. A Contextual-Bandit Algorithm for Mobile Context Aware Recommender System. In Huang, T.; Zeng, Z.; Li, C.; and Leung, C. S., eds., Neural Information Processing, 324 331. Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-642-34487-9. Chakrabarti, D.; Kumar, R.; Radlinski, F.; and Upfal, E. 2009. Mortal Multi-Armed Bandits 273 280. URL http:// papers.nips.cc/paper/3580-mortal-multi-armed-bandits.pdf. Combes, R.; Jiang, C.; and Srikant, R. 2015. Bandits with budgets: Regret lower bounds and optimal algorithms. ACM SIGMETRICS Performance Evaluation Review 43(1): 245 257. Dani, V.; Hayes, T. P.; and Kakade, S. M. 2008. Stochastic Linear Optimization under Bandit Feedback. In COLT. Farias, V. F.; and Madan, R. 2011. The Irrevocable Multiarmed Bandit Problem. Oper. Res. 59(2): 383 399. ISSN 0030-364X. doi:10.1287/opre.1100.0891. URL http://dx. doi.org/10.1287/opre.1100.0891.

Gan, C.; Zhou, R.; Yang, J.; and Shen, C. 2020. Cost-Aware Cascading Bandits. IEEE Transactions on Signal Processing 68: 3692 3706.

Gittins, J. C. 1979. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B (Methodological) 41(2): 148 164.

G omez Hidalgo, J. M.; Bringas, G. C.; S anz, E. P.; and Garc ıa, F. C. 2006. Content based SMS spam ﬁltering. In Proceedings of the 2006 ACM symposium on Document engineering, 107 114. ACM.

Kleinberg, R.; Niculescu-Mizil, A.; and Sharma, Y. 2010. Regret bounds for sleeping experts and bandits. Machine learning 80(2-3): 245 272.

Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix Factorization Techniques for Recommender Systems. Computer 42(8): 30 37.

Lai, T.; and Robbins, H. 1985. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics 6(1): 4 22. ISSN 0196-8858. doi:https://doi.org/10.1016/ 0196-8858(85)90002-8. URL http://www.sciencedirect. com/science/article/pii/0196885885900028.

Lang, K. 1995. News Weeder: Learning to Filter Netnews. In in Proceedings of the 12th International Machine Learning Conference (ML95.

Lattimore, T.; and Szepesv ari, C. 2020. Bandit algorithms. Cambridge University Press.

Levine, N.; Crammer, K.; and Mannor, S. 2017. Rotting Bandits. In NIPS.

Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A Contextual-bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 10, 661 670. New York, NY, USA: ACM. ISBN 978-1-60558-799-8. doi:10. 1145/1772690.1772758. URL http://doi.acm.org/10.1145/ 1772690.1772758.

Lika, B.; Kolomvatsos, K.; and Hadjiefthymiades, S. 2014. Facing the cold start problem in recommender systems. Expert Systems with Applications 41(4): 2065 2073.

Mooney, R. J.; and Roy, L. 2000. Content-based book recommending using learning for text categorization. In Proceedings of the ﬁfth ACM conference on Digital libraries, 195 204. ACM.

Robbins, H. 1952. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58(5): 527 535. URL https://projecteuclid.org:443/euclid.bams/1183517370.

Sarwar, B. M.; Karypis, G.; Konstan, J. A.; Riedl, J.; et al. 2001. Item-based collaborative ﬁltering recommendation algorithms. Www 1: 285 295.

Seznec, J.; Locatelli, A.; Carpentier, A.; Lazaric, A.; and Valko, M. 2019. Rotting bandits are no harder than stochastic ones. In Chaudhuri, K.; and Sugiyama, M., eds., Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, 2564 2572. PMLR. URL http://proceedings.mlr.press/v89/seznec19a.html.

Slivkins, A. 2013. Dynamic Ad Allocation: Bandits with Budgets. Ar Xiv abs/1306.0155. Wei, J.; He, J.; Chen, K.; Zhou, Y.; and Tang, Z. 2017. Collaborative ﬁltering and deep learning based recommendation system for cold start items. Expert Systems with Applications 69: 29 39. Xia, Y.; Li, H.; Qin, T.; Yu, N.; and Liu, T.-Y. 2015. Thompson sampling for budgeted multi-armed bandits. ar Xiv preprint ar Xiv:1505.00146 . Xia, Y.; Qin, T.; Ding, W.; Li, H.; Zhang, X.; Yu, N.; and Liu, T.-Y. 2017. Finite budget analysis of multi-armed bandit problems. Neurocomputing 258: 13 29. ISSN 0925-2312. doi:https://doi.org/10.1016/j.neucom.2016.12. 079. URL http://www.sciencedirect.com/science/article/pii/ S0925231217304216. Special Issue on Machine Learning. Zeng, C.; Wang, Q.; Mokhtari, S.; and Li, T. 2016. Online Context-Aware Recommendation with Time Varying Multi-Armed Bandit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 16, 2025 2034. New York, NY, USA: ACM. ISBN 978-1-4503-4232-2. doi:10. 1145/2939672.2939878. URL http://doi.acm.org/10.1145/ 2939672.2939878.