# multinomial_logit_bandit_with_low_switching_cost__4c4fbe90.pdf

Multinomial Logit Bandit with Low Switching Cost

Kefan Dong 1 Yingkai Li 2 Qin Zhang 3 Yuan Zhou 4

We study multinomial logit bandit with limited adaptivity, where the algorithms change their exploration actions as infrequently as possible when achieving almost optimal minimax regret. We propose two measures of adaptivity: the assortment switching cost and the more ﬁne-grained item switching cost. We present an anytime algorithm (AT-DUCB) with O(N log T) assortment switches, almost matching the lower bound ( N log T

log log T ). In the ﬁxed-horizon setting, our algorithm FH-DUCB incurs O(N log log T) assortment switches, matching the asymptotic lower bound. We also present the ESUCB algorithm with item switching cost O(N log2 T).

1. Introduction

The dynamic assortment selection problem with the multinomial logic (MNL) choice model, also called MNL-bandit, is a fundamental problem in online learning and operations research. In this problem we have N distinct items, each of which is associated with a known reward ri and an unknown preference parameter vi. In the MNL choice model, given a subset S [N]

def = {1, 2, 3, . . . , N}, the probability that a user chooses i 2 S is given by

if i 2 S [ {0}

0 otherwise

where 0 stands for the case that the user does not choose any item, and v0 is the associated preference parameter. As a convention (see, e.g. Agrawal et al., 2019), we assume

Author names are listed in alphabetical order. 1Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China. 2Department of Computer Science, Northwestern University, Evanston, Illinois, USA. 3Computer Science Department, Indiana University, Bloomington, Indiana, USA. 4Department of ISE, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA. Correspondence to: Yuan Zhou <yuanz@illinois.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

that no-purchase is the most frequent choice, which is very natural in retailing. W.l.o.g., we assume v0 = 1, and vi 1 for all i 2 [N]. The expected reward of the set S under the preference vector v = {v0, v1, . . . , v N} is deﬁned to be

For any online policy that selects a subset St [N] (|St| K, where K is a predeﬁned capacity parameter) at each time step t, observes the user s choice at to gradually learn the preference parameters {vi}, and runs for a horizon of T time steps, we deﬁne the regret of the policy to be

(R(S?, v) R(St, v)) , (3)

where S? = arg max S [N],|S| K R(S, v) is the optimal assortment in hindsight. The goal is to ﬁnd a policy to minimize the expected regret E[Reg T ] for all MNL-bandit instances.

To motivate the deﬁnition of the MNL-bandit problem, let us consider a fast fashion retailer such as Zara or Mango. Each of its product corresponds to an item in [N], and by selling the i-th item the retailer takes a proﬁt of ri. At each speciﬁc time in each of its shops, the retailer can only present a certain number of items (say, at most K) on the shelf due to the space constraints. As a consequence, customers who visit the store can only pick items from the presented assortment (or, just buy nothing which corresponds to item 0), following a choice model. There has been a number of choice models being proposed in the literature (see, e.g., (Train, 2009; Luce, 2012) for overviews), and the MNL model is arguably the most popular one. The retailer certainly wants to maximize its proﬁt by identifying the best assortment S? to present. However, it does not know in advance customers preferences to items in [N] (i.e., the preference vector v), to get which it has to learn from customers actual choices. More precisely, the retailer needs to develop a policy to choose at each time step t an assortment St [N] (|St| K) based on the previous presented assortments S1, . . . , St 1 and customers choices in the past (t 1) time steps. The retailer s expected reward in a time

horizon T can be expressed by PT

t=1 R(St, v), which is typically reformulated as the regret compared with the best policy in the form of (3).

Multinomial Logit Bandit with Low Switching Cost

The MNL-bandit problem has attracted quite some attention in the past decade (Rusmevichientong et al., 2010; Saur e & Zeevi, 2013; Agrawal et al., 2016; 2017; Chen & Wang, 2018). However, all these works do not consider an important practical issue for regret minimization: in reality it is often impossible to frequently change the assortment display. For example, in retail stores it may not be possible to change the display in the middle of the day, not mentioning doing it after each purchase. We thus hope to minimize the number of assortment switches in the selling time horizon without increasing the regret by much. Another advantage of achieving a small number of assortment switches is that such algorithms are easier to parallelize, which enables us to learn users preferences much faster. This feature is particularly useful in applications such as online advertising where it is easy to show the same assortment (i.e., a set of ads) in a large amount of end users displays simultaneously.

We are interested in two kinds of switching costs under a time horizon T. The ﬁrst is the assortment switching cost, deﬁned as

I[St 6= St+1].

The second is the item switching cost, deﬁned as

|St St+1| ,

where binary operator computes the symmetric difference of the two sets. In comparison, the item switching cost is more ﬁne-grained and put less penalty if two neighboring assortments are almost the same . As a straightforward observation, we always have that

T min{2K, N} (asst)

Our results. In this paper we obtain the following results for MNL-bandit with low switching cost. By default all log s are of base 2.

We ﬁrst introduce an algorithm, AT-DUCB, that achieves almost optimal regret (up to a logarithmic factor) and incurs an assortment switching cost of O(N log T); this algorithm is anytime, i.e., it does not need to know the time horizon T in advance. We then show that the AT-DUCB algorithm achieves almost optimal assortment switching cost. In particular, we prove that every anytime algorithm that achieves almost optimal regret must incur an assortment switching cost of at least (N log T/ log log(NT)). These results are presented in Section 2.

When the time horizon is known beforehand, we obtain an algorithm, FH-DUCB, that achieves almost optimal regret (up to a logarithmic factor) and incurs an assortment switching cost of O(N log log T). We also prove the optimality of

this switching cost by establishing a matching lower bound. See Section 3.

For item switches, while the trivial application of (4) leads to O(N 2 log T) and O(N 2 log log T) item switching cost bounds for AT-DUCB and FH-DUCB respectively, in Section 4, we design a new algorithm, ESUCB, to achieve an item switching cost of O(N log2 T). In Appendix F, we show that a more careful modiﬁcation to the algorithm further improves the item switching cost to O(N log T).

We make two interesting observations from the results above: (1) there is a separation between the assortment switching complexities when knowing the time horizon T and when not; in other words, the time horizon T is useful for achieving a smaller assortment switching cost; (2) the item switching cost is only at most a logarithmic factor higher than the assortment switching cost.

Technical contributions. We combine the epoch-based offering algorithm for MNL-bandits (Agrawal et al., 2019) and a natural delayed update policy in the design of ATDUCB. Although a similar delayed update rule has been recently analyzed for multi-armed bandits and Q-learning (Bai et al., 2019), and such a result does not seem surprising, we present it in the paper as a warm-up to help the readers get familiar with a few algorithmic techniques commonly used for the MNL-bandit problem.

Our ﬁrst main technical contribution comes from the design of FH-DUCB algorithm, where we invent a novel delayed update policy that uses the horizon information to improve the switching cost from O(N log T) to O(N log log T). We note that for the ordinary multi-armed bandit problem, recent works (Gao et al., 2019) and (Simchi-Levi & Xu, 2019) managed to show a similar O(N log log T) switching cost with known horizon. However, their update rules do not have to utilize the learned parameters for the arms, and a straightforward conversion of such update rules to the MNLbandit problem does not produce the desired guarantees. In contrast, our update rule, formally described in (6), carefully exploits the structure of the MNL-bandits and uses the information of the partially learned preference parameters (more speciﬁcally, ˆvi, i in (6)) to adaptively decide when to switch to a different assortment.

Our second main technical contribution is the ESUCB algorithm for the low item switching cost. The technical challenge here stems from the fact that the low item switching cost is a much stronger requirement than the low assortment switching cost, and simple lazy updates with the doubling trick and the straightforward analysis will show that the item switching cost is at most N times the assortment switching cost (see (4)), leading to a total item switching cost of O(N 2 log T). To reducing the extra factor N, we propose the idea of decoupling the learning for the optimal revenue

Multinomial Logit Bandit with Low Switching Cost

and the assortment, so that the offering of the assortment is decided via optimizing a new objective function based on the (usually) ﬁxed revenue estimate. Since the revenue estimates are ﬁxed, the offered assortments enjoy improved stability, and the item switching cost can be upper bounded by careful analysis.

We remark that the item switching cost is a particularly interesting goal that arises in online learning problems when the actions are sets of elements, which is very different from traditional MAB and linear bandits. Thanks to our novel technical ingredients, we are able to bring the item switching cost down to almost the same order as the assortment switching cost. We hope our results will inspire future study of the switching costs in both settings for other online learning problems with set actions.

Related work. MNL-bandit was ﬁrst studied in (Rusmevichientong et al., 2010) and (Saur e & Zeevi, 2013), where the authors took the explore-then-commit approach, and proposed algorithms with regret O(N 2 log2 T) and O(N log T) respectively under the assumption that the gap between the best and second-to-the-best assortments is known. (Agrawal et al., 2016) removed this assumption using a UCB-type algorithm, which achieves a regret of O(p NT log T). An almost tight regret lower bound of (

NT) was later given by (Chen & Wang, 2018). (Agrawal et al., 2017) proposed an algorithm using Thompson Sampling, which achieves comparable regret bound to the UCB-type algorithms while demonstrates a better numerical performance.

Learning with low policy switches (also called learning in the batched model or limited adaptivity) has recently been studied in reinforcement learning for several other problems, including stochastic multi-armed bandits (Perchet et al., 2015; Jun et al., 2016; Agarwal et al., 2017; Gao et al., 2019; Esfandiari et al., 2019; Simchi-Levi & Xu, 2019), Q-learning (Bai et al., 2019), and online-learning (Cesa Bianchi et al., 2013). This research direction is motivated by the fact that in many practical settings, the change of learning policy is very costly. For example, in clinical trials, every treatment policy switch would trigger a separate approval process. In crowdsourcing, it takes time for the crowd to answer questions, and thus a small number of rounds of interactions with the crowd is desirable. The performance of the learning would be much better if the data is processed in batches and during each batch the learning policy is ﬁxed.

2. Warm-up: An anytime algorithm with

O(N log T) assortment switches

As a warm-up, we begin with a simple anytime algorithm using at most O(N log T) assortment switches. Our algorithm combines the epoch-based offering framework introduce in

(Agrawal et al., 2016) and a deferred update policy. We will ﬁrst brieﬂy explain the epoch-based offering procedure, and then present and analyze our algorithm.

The epoch-based offering. In the epoch-based offering framework, whenever we are to offer an assortment S, instead of offering it for only one time period, we keep offering S until a no-purchase decision (item 0) is observed, and refer to all the consecutive time periods involved in this procedure as an epoch. The detailed offering procedure is described in Algorithm 1, where t is the global counter for the time period, and { i} records the number of purchases made for each item i in the epoch.

Algorithm 1: EXPLORATION(S)

1 Initialize: i 0 for all i 2 [N];

2 while TRUE do

4 Offer assortment S, and observe purchase decision at;

5 If at = 0 then return { i};

6 at at + 1;

The following key observation for EXPLORATION(S) states that { i} forms an unbiased estimate for the utility parameters of all items in S.

Observation 1. Let { i} be returned by EXPLORATION(S). For each i 2 S, i is an independent geometric random variable with mean vi. Moreover, one can verify that E[ i] = vi and

Pr[ i = k] =

At any time of the algorithm when an epoch has ended, for each item i 2 [N], we let vi = ni/Ti where Ti is the number of the past epochs in which i is included in the offered assortment, and ni is the total number of purchases for item i during all past epochs. By Observation 1, we know that vi is also an unbiased estimate of vi. In (Agrawal et al., 2016), the following upper conﬁdence bound (UCB) is constructed for each i 2 [N],

We will compute the assortment for the next epoch based on the vector of UCB values ˆv = (ˆv1, ˆv2, . . . , ˆvn).

We describe our algorithm in Algorithm 2, which can be seen as an adaptation of the one in (Agrawal et al., 2016). The main difference from (Agrawal et al., 2016) is that the UCB values (and hence the assortment) is updated only when Ti reaches an integer power of 2 for any item i 2 [N].

Multinomial Logit Bandit with Low Switching Cost

Algorithm 2: Anytime Deferred Update UCB (AT-DUCB)

1 Initialize: ˆvi 1, Ti 0 for all i 2 [N], t 0;

2 for 1, 2, 3, . . . , do

3 Compute S = arg max S [N]:|S| K R(S, ˆv);

4 { i} EXPLORATION(S);

5 for i 2 S do

6 ni ni + i and Ti Ti + 1;

7 if Ti = 2k for some k 2 Z then

8 vi ni/Ti; ˆvi min

ˆvi, vi + q

N +1) Ti + 48 ln(

This deferred update strategy is implemented in Line 7. Also note that instead of directly evaluating (5), the update in Line 8 makes sure that ˆvi is non-increasing as the algorithm proceeds. We comment that the optimization task in Line 3 can be done efﬁciently, as studied in, for example, (Rusmevichientong et al., 2010).

Theorem 2. For any time horizon T, the expect regret incurred by Algorithm 2 is

E [Reg T ] .

and the expected number of assortment switches E[ (asst)

T ] is O(N log T). 1

The proof of the regret upper bound in Theorem 2 is similar to that of (Agrawal et al., 2016), except for a more careful analysis about the deferred update rule. For completeness, we prove this part in Appendix A.

Proof of the assortment switch upper bound in Theorem 2. Let D( )

i be the event that Line 8 is executed in Algorithm 2 for item i at the -th epoch. Recall that the assortment S is computed by S = arg max S [N],|S| K R(S, ˆv), and ˆv is updated after epoch only when D( )

i happens for some i 2 [N]. Let L be the total number of epochs at or before time T; we thus have PL

i] log T. We then have that

I[St 6= St+1]

i ] . N log T.

1For two sequences {an} and {bn}, we write an = O(bn) or an . bn if there exists a universal constant C < 1 such that lim supn!1 |an|/|bn| C. Similarly, we write an = (bn) or an & bn if there exists a universal constant c > 0 such that lim infn!1 |an|/|bn| c.

The lower bound. We complement our algorithmic result with the following almost matching lower bound. The theorem states that the number of assortment switches has to be (N log T/ log log(NT)), if the algorithm is anytime and incurs only

NT poly log(NT) regret. The proof of Theorem 3 can be found in Appendix E.1.

Theorem 3. There exist universal constants d0, d1 > 0 such that the following holds. For any constant C 1, if an anytime algorithm A achieves expected regret at most d0

NT(ln(NT))C for all T and all instances with N items, then for any N 2, T0 N and T0 greater than a sufﬁciently large constant that only depends on C, there exists an instance with N items and a time horizon T 2 [T0, T 2

0 ], such that the expected number of assortment switches before time T is at least d1N log T/(C log log(NT)).

3. Achieving O(N log log T) assortment

switch with known horizons

When the time horizon is known to the algorithm, we can exploit this advantage via more carefully designed update policy to achieve only O(N log log T) assortment switches. For the convenience of presentation, we ﬁrst introduce a few notations.

Algorithm 3: UPDATE(i)

1 i i + 1; T ( i)

i + |T (i, i 1)|;

i + ni, i 1; vi, i n( i)

3 ˆvi, i min

ˆvi, i 1, vi, i +

48 vi, i ln(

For each item i 2 [N], we divide the time periods into consecutive stages where the boundaries between any two neighboring stages are marked by the UCB updates for item i. Note that the division for the stages may be different for different items. For any 2 {1, 2, 3, . . . }, let T (i, ) be the set of epochs to offer item i, in stage for the item. Let T ( )

0=1 |T (i, 0)| be the total number of epochs to offer item i, before stage for the item, and let n( )

i be the total number of purchases for item i in the epochs counted by T ( )

i . We can therefore deﬁne vi,

i as an unbiased estimate of vi based on the observations before stage . Similarly to (5), we can deﬁne ˆvi, as a UCB for vi. The UPDATE(i) procedure (formally described in Algorithm 3) is invoked whenever the main algorithm decides to conclude the current stage for item i and update the UCB for vi together with the quantities deﬁned above, where i is the counter for the number of stages for item i, and ni, is the number of purchases observed in stage for

Multinomial Logit Bandit with Low Switching Cost

The key to the design of our main algorithm for the ﬁxed time horizon setting is a new trigger for updating the UCB values. Let 0 = dlog log(T/N)+1e, for each item i 2 [N], we will conclude the current stage i and invoke UPDATE(i) whenever the following condition P(i, i) is satisﬁed. Note that P(i, i) is adaptive to the estimated parameters ˆvi, i to customize the number of epochs between assortment switches for each item. More speciﬁcally, the smaller ˆvi, i is, the less regret may be incurred by offering item i, and therefore the longer we can offer item i without switching and incurring too large regret, and this is reﬂected in the design of P.

|T (i, i)| 1 +

( i) i N if i < 0

|T (i, i)| 1 +

( i) i N ˆvi, i and ˆvi, 0 > 1/

For each epoch , we use i( ) to denote the stage (in terms of item i) where epoch belongs to. We present the details of our main algorithm in Algorithm 4. The algorithm is terminated whenever the time step t reaches the horizon T. Theorem 4. For any given time horizon T N 4, we have the following upper bound for the expected regret:

E [Reg T ] .

NT 2 + 1) log log T,

and the following upper bound for the expected number of assortment switches:

. N log log T.

To prove Theorem 4, we ﬁrst deﬁne the desired events. Let

ˆvi, vi and ˆvi, vi+ v u u t144vi ln(

E(1) def = \i, E(1)

We also let

2vi|T (i, )|,

1 NT and |T (i, )| T 4N vi

E(2) def = \i, E(2)

Finally, let E = E(1) \ E(2). In Appendix B.1, we prove the following lemma.

Algorithm 4: Deferred Update UCB for Fixed Time Horizon (FH-DUCB) Input :The time horizon T.

1 Initialize: i 1, ˆvi, i 1, ni, i 0, T (i, i)

i 0 for all i 2 [N];

2 t 0, S0 [N];

3 for 1, 2, 3, . . . , do

5 if 9i : P(i, i) holds then

6 UPDATE(i) for all i such that P(i, i) holds;

7 Compute S arg max S [N]:|S| K R(S, ˆv )

where ˆv = (ˆvi, i( ))i2[N];

8 { i} EXPLORATION(S );

9 for i 2 S do

10 ni, i ni, i + i; Add to T (i, i);

Lemma 5. If T N 4 and T is greater than a large enough universal constant, then Pr[E] 1 14

Bounds for the stage lengths. When E happens, we can infer the following useful lower bound for the lengths of the stages after 0. The lemma is proved in Appendix B.2.

Lemma 6. Assume that T N 4 and T is greater than a sufﬁciently large universal constant. Conditioned on E(1), for each i 2 [N], if 0 is not the last stage for item i, we

have that vi 1

1 NT . Additionally, if ˆvi, 0 > 1/

NT, then for all > 0 such that is not the last stage for i, we have that |T (i, )| (T/(2Nvi))1 2 + 0+1.

Upper bounding the number of assortment switches. Suppose that there are L epochs before the algorithm terminates. We only need to upper bound E

i=1 i(L) which upper bounds the number of assortment switches E[ (asst)

T ]. For each i 2 [N], if i(L) 0 and ˆvi, 0 1/

NT, we easily deduce that i(L) 0 + 1 because of the condition P(i, 0). Otherwise, assuming that ˆvi, 0 > 1/

NT, by Lemma 6, conditioned on E(1),

we have that vi 1 2

1 NT and |T (i, )| T 4Nvi for

all 2 [ 0 + log log T 2Nvi + 1, i(L) 1]. Because of E(2), we have ni, vi

2 |T (i, )| T 8N for all 2 [ 0 + log log T 2Nvi + 1, i(L) 1]. Therefore, we know that there are no more than 8N pairs of (i, ) satisfying 2 [ 0 +log log T 2Nvi +1, i(L) 1]. In total, conditioned on E, we have that

Multinomial Logit Bandit with Low Switching Cost

ˆvi, 0 > 1/

log log T 2Nvi

max{ i(L) 0 log log T 2Nvi

. N log log T +

log log T 3/2

N 1/2 . N log log T, (7)

where the second inequality is because of Lemma 6. Finally, since the contribution to the expected number of assortment switches when E fails is at most Pr[E] T O(1) (because of Lemma 5), we prove the upper bound for the number of assortment switches in Theorem 4.

Upper bounding the expected regret. Let E( ) be the length of epoch , i.e., the number of time steps taken in epoch . Note that E( ) is a geometric random variable with mean value (1 + P

i2S vi). Also recall that there are L epochs in total. Letting S be the optimal assortment, conditioned on event E(1), we have that

E [Reg T ] = E

E( )(R(S?, v) R(S , v))

(R(S?, v) R(S , v))

(ˆvi, i( ) vi)

(ˆvi, i( ) vi)

(ˆvi, vi), (8)

where the inequality is due to Lemma 17. In the next lemma, we upper bound the contribution from each item i and stage to the upper bound in (8). The lemma is proved in Appendix B.3.

Lemma 7. Conditioned on event E(1), for any item i and any stage i(L), we have that

(ˆvi, vi) .

NT 2 + 1)/N.

Combining Lemma 5, Lemma 7, inequalities (7) and (8), we have that

E [Reg T ] T Pr[E(1)] + E

NT 2 + 1) log log T,

proving the expected regret upper bound in Theorem 4.

The lower bound. We prove the following matching lower bound in Appendix E.2.

Theorem 8. For any constant C 0 and time horizon T, if an algorithm A achieves expected regret E[Reg T ] at most

NT(ln(NT))C for all N-item instances, then there exists an N-item instance such that the expected number of assortment switches is

T ] = (N log log T).

4. Optimizing the number of item switches

In this section, we study how to minimize the item switch cost while still achieving O(

NT) regret.

Algorithm 5: The Exponential Stride UCB algorithm (ESUCB) for MNL-Bandit

1 Initialize: ˆ 1, 1 1/3, c1 44840;

2 for 1, 2, 3, . . . do

3 tmax c1N ln3(NT/δ)/ 2

4 if CHECK(ˆ 3 , ˆ , tmax) then ˆ ˆ ;

We now propose a new algorithm, Exponential Stride UCB (ESUCB), to achieve an item switching cost that is linear with N and poly-logarithmic with T. The speciﬁc guarantee of the ESUCB algorithm is presented in Theorem 10, the main theorem of this section. The key idea of the algorithm is to decouple the learning of the optimal expected revenue and the optimal assortment, which is made possible by the following lemma.

Lemma 9. Deﬁne G( )

def = R(S , v), where S

def = arg max S [N]:|S| K

i2S vi(ri )

. There exists a unique ? such that

G( ?) = ? = max

|S| K R(S, v).

(1) for any < ?, we have that G( ) > , and

(2) for any > ?, we have that G( ) < .

The proof of Lemma 9 is deferred to Appendix D.1. Motivated by the lemma, we present our ESUCB algorithm in Algorithm 5. The algorithm learns the optimal revenue ?

in the main loop, using a sequence of exponentially decreasing learning step size . For each estimate ˆ , the CHECK

Multinomial Logit Bandit with Low Switching Cost

procedure (Algorithm 6) learns the assortment Sˆ via the UCB method with deferred updates. (More precisely speaking, the algorithm learns Sˆ and Sˆ 3 , and at Line 4, chooses one of them based on the UCB estimation ˆ for the expected revenue of Sˆ .) In the CHECK procedure, the variable t keeps the count of time steps and is updated in EXPLORATION. We also make the following notes: 1) The ESUCB algorithm needs the horizon T as input, and uses a conﬁdence parameter δ, which is usually set as 1/T. The whole algorithm terminates whenever the horizon T is reached. 2) At the optimization steps (Lines 6 and 9 of Algorithm 6), we have to adopt a deterministic tie breaking rule, e.g., we let the arg max operator to return the S such that P

i2S 2i is minimized among multiple maximizers. Theorem 10. Setting δ = 1/T, we have the following upper bound for the expected regret of ESUCB:

E [Reg T ] .

NT log1.5(NT),

and the item switching cost for ESUCB is

. N log2 T.

To prove Theorem 10, we upper bound the item switching cost and the expected regret separately.

Upper bounding the item switch cost. Since the estimate of ? is ﬁxed in CHECK, the outcome of arg max S:|S| K

i2S ˆvi(ri ) (corresponding to Lines 6 and 9 of Algorithm 6) becomes more stable compared to that of arg max S:|S| K R(S, ˆv) in previous algorithms. Exploiting this advantage, we upper bound the number of item switches incurred by each call of CHECK as follows. The lemma is proved in Appendix D.2. Lemma 11. The item switch cost incurred by any invocation CHECK( l, r, tmax) is O(N log T).

Since the loop in Algorithm 5 iterates for only O(log T) times, Lemma 11 easily implies an O(N log2 T) item switching cost upper bound for ESUCB. We also note that this bound can be improved to O(N log T) via a slight modiﬁcation to the algorithm which is elaborated in Appendix F.

Upper bounding the expected regret. We ﬁrst provide the following guarantees for CHECK. Lemma 12 (Main Lemma for CHECK). For any invocation CHECK( l, r, tmax), with probability at least (1 δ/T), the following statements hold.

(a) If CHECK returns true, then G( r) < r.

(b) If CHECK returns false, then

Ntmax ln3 NT

δ + c3N ln3 NT

Algorithm 6: CHECK( l, r, tmax)

1 Initialize: ˆvi 1, Ti 0, ni 0 for all i 2 [N],

c2 688, c3 21732;

2 0, ˆ 1, b false, t 0;

3 for 1, 2, 3, . . . do

4 if ˆ < r then

6 S arg max S [N],|S| K

i2S ˆvi(ri l)

7 { i} EXPLORATION(S );

9 S arg max S [N],|S| K

i2S ˆvi(ri r)

10 { i} EXPLORATION(S );

i2S i ri; ˆ 1

Ntmax ln3(NT/δ) + c3N ln3(NT/δ)

12 if t tmax then return b;

13 for i 2 S do

14 ni ni + i, Ti Ti + 1;

15 if Ti = 2k for some k 2 Z then

16 vi ni/Ti; ˆvi min

ˆvi, vi + q

196 vi log(NT/δ+1)

Ti + 292 log(NT/δ+1)

(c) Let r(t)

CHECK be the reward at time step t in this invocation. If l ?, then we have that

Ntmax ln3(NT/δ) + N ln3(NT/δ).

Proof of Lemma 12 is built upon Lemma 9 and deferred to Appendix D.3.

Let Q be the event that the statements (a) (c) hold for the invocation of CHECK at iteration of Algorithm 5, and let Q be the event that Q holds every all . By Lemma 12 and a union bound, we immediately have that Pr[Q] 1 δ. The next lemma, built upon Lemma 9 and Lemma 12, shows that ˆ in Algorithm 5 is always an upper conﬁdence bound for the true parameter ?, and converges to ? with a decent rate.

Lemma 13. Let ˆ ( ) be the value of ˆ at the beginning of iteration of Algorithm 5. Conditioned on event Q, for any iteration = 1, 2, 3, . . . , we have that ˆ ( ) 3 ?

Proof. Recall that for every = 1, 2, 3, . . . , we need to prove

ˆ ( ) 3 ? ˆ ( ). (9)

Multinomial Logit Bandit with Low Switching Cost

We prove this by induction. For iteration = 1, (9) trivially holds since 0 ri 1 and therefore 0 ? 1.

Now suppose (9) holds for iteration , we will establish (9) for iteration ( + 1). Consider the invocation of CHECK( l, r, tmax) at iteration , where l = ˆ ( ) 3 and r = ˆ ( ) . We discuss the following two cases.

Case 1. When the CHECK procedure returns true, by Lemma 12 we have that G( r) < r. By Lemma 9, we have that r > ?. Therefore, by Line 4 and the induction hypothesis we have that ˆ ( +1) = ˆ ( ) = r > ?, and

ˆ ( +1) 3 +1 = r 2 = ˆ ( ) 3 ?, proving (9).

Case 2. When the CHECK procedure returns false, by Lemma 12, we have that

Ntmax ln3 NT

δ + c3N ln3 NT

Recall that at Line 3 we set tmax = c1N ln3(NT/δ)/ 2

. For large enough c1, this implies that

? r = ˆ ( ) 2 = ˆ ( +1) 3 +1.

By Line 4 and the induction hypothesis we have that

ˆ ( +1) = ˆ ( ) ?, ﬁnishing the proof of (9).

Finally we upper bound the expected regret of Algorithm 5.

Lemma 14. With probability at least 1 δ, the expected regret incurred by Algorithm 5 is O(

NT log1.5(NT/δ)). Therefore, if we set δ = 1/T, we have that

E[Reg T ] .

NT log1.5(NT).

Proof. Throughout the proof we condition on the event Q, which happens probability at least (1 δ). We ﬁrst prove that at iteration of Algorithm 5, the expected regret for this iteration is bounded by O(N/ ). Consider the invocation CHECK( l, r, tmax) at Line 4. Recall that we deﬁne tmax = c1N ln3(NT/δ)/ 2

. Combining with statement (c) of Lemma 12 and Lemma 13, the expected regret of this invocation is bounded by (where the O(N) term is due to the last epoch that might run over time tmax),

. tmax( ? l) + E

. tmax( ? l) + N ln3(NT/δ)/ . (10)

By Lemma 13, we have that ? l . . Therefore, (10) is upper bounded by O(N ln3(NT/δ)/ ).

Since CHECK( l, r, tmax) runs for at least tmax time steps, the second to the last iteration ( max 1) satisﬁes that c1N ln3(NT/δ)/ 2

max 1 T, which means that

N log3(NT/δ)/T.

Since is an exponential sequence, the overall expected regret is bounded by the order of

N log3(NT/δ)/ .

NT log3(NT/δ).

Reﬁned and non-trivial item switching cost upper bound for the AT-DUCB algorithm. Since an assortment switch may incur at most 2K item switches, Theorem 2 trivially implies that Algorithm 2 (AT-DUCB) incurs at most O(KN log T) item switches, which is upper bounded by O(N 2 log T) since K = O(N).

In Appendix C, we present a reﬁned analysis showing that the item switching cost of AT-DUCB is at most O(N 1.5 log T). While it is not clear to us whether the dependence on N delivered by this analysis is optimal, we also discuss the relationship between the analysis and an extensively studied (but not yet fully resolved) geometry problem, namely the maximum number of planar K-sets. We hope that further study of this relationship might lead to improvement of both upper and lower bounds of the item switching cost of AT-DUCB. Please refer to Appendix C for more details.

5. Conclusion

In this paper, we present algorithms for MNL-bandits that achieve both almost optimal regret and assortment switching cost, in both anytime and ﬁxed-horizon settings. We also design the ESUCB algorithm that achieves the almost optimal regret and item switching cost O(N log2 T). For future directions, it is interesting to study whether it is possible to achieve an item switching cost of O(N log T) in the anytime setting and O(N log log T) in the ﬁxed-horizon setting. Also, as mentioned in Section 4 (and Appendix C), given the simplicity of our AT-DUCB algorithm, it is worthwhile to further reﬁne the bounds for its item switching cost.

Acknowledgement

Part of the work done while Kefan Dong was a visiting student at UIUC. Kefan Dong and Yuan Zhou were supported in part by a Ye Grant and a JPMorgan Chase AI Research Faculty Research Award. Qin Zhang was supported in part by NSF IIS-1633215, CCF-1844234 and CCF-2006591.

Multinomial Logit Bandit with Low Switching Cost

Agarwal, A., Agarwal, S., Assadi, S., and Khanna, S. Learn-

ing with limited rounds of adaptivity: Coin tossing, multiarmed bandits, and ranking from pairwise comparisons. In COLT, pp. 39 75, 2017.

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. A

near-optimal exploration-exploitation approach for assortment selection. In EC, pp. 599 600, 2016.

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A.

Thompson sampling for the MNL-bandit. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pp. 76 78,

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. MNL-

bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453 1485, 2019.

Bai, Y., Xie, T., Jiang, N., and Wang, Y.-X. Provably ef-

ﬁcient q-learning with low switching cost. In Neur IPS, 2019.

Cesa-Bianchi, N., Dekel, O., and Shamir, O. Online learning

with switching costs and other adaptive adversaries. In NIPS, pp. 1160 1168, 2013.

Chen, X. and Wang, Y. A note on a tight lower bound

for capacitated MNL-bandit assortment selection models. Oper. Res. Lett., 46(5):534 537, 2018.

Dey, T. K. Improved bounds for planar k-sets and related

problems. Discrete & Computational Geometry, 19(3): 373 382, 1998.

Esfandiari, H., Karbasi, A., Mehrabian, A., and Mirrokni,

V. S. Batched multi-armed bandits with optimal regret. Co RR, abs/1910.04959, 2019.

Gao, Z., Han, Y., Ren, Z., and Zhou, Z. Batched multi-

armed bandits problem. In Neur IPS, 2019.

Jin, Y., Li, Y., Wang, Y., and Zhou, Y. On asymptotically

tight tail bounds for sums of geometric and exponential random variables. ar Xiv preprint ar Xiv:1902.02852, 2019.

Jun, K., Jamieson, K. G., Nowak, R. D., and Zhu, X. Top

arm identiﬁcation in multi-armed bandits with batch arm pulls. In AISTATS, pp. 139 148, 2016.

Luce, R. D. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.

Perchet, V., Rigollet, P., Chassang, S., and Snowberg, E.

Batched bandit problems. In COLT, pp. 1456, 2015.

Pinsker, M. S. Information and information stability of

random variables and processes. Holden-Day, 1964.

Rusmevichientong, P., Shen, Z. M., and Shmoys, D. B.

Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations Research, 58(6):1666 1680, 2010.

Saur e, D. and Zeevi, A. Optimal dynamic assortment plan-

ning with demand learning. Manufacturing & Service Operations Management, 15(3):387 404, 2013.

Simchi-Levi, D. and Xu, Y. Phase transitions and cyclic

phenomena in bandits with switching constraints. In Neur IPS, 2019.

T oth, G. Point sets with many k-sets. Discrete & Computa-

tional Geometry, 26(2):187 194, 2001.

Train, K. E. Discrete Choice Methods with Simulation.

Cambridge university press, 2009.