# multinomial_logit_bandit_with_linear_utility_functions__278a16ad.pdf Multinomial Logit Bandit with Linear Utility Functions Mingdong Ou, Nan Li, Shenghuo Zhu, Rong Jin, Alibaba Group, Hang Zhou, China {mingdong.omd, nanli.ln, shenghuo.zhu, jinrong.jr}@alibaba-inc.com, Multinomial logit bandit is a sequential subset selection problem which arises in many applications. In each round, the player selects a K-cardinality subset from N candidate items, and receives a reward which is governed by a multinomial logit (MNL) choice model considering both item utility and substitution property among items. The player s objective is to dynamically learn the parameters of MNL model and maximize cumulative reward over a finite horizon T. This problem faces the exploration-exploitation dilemma, and the involved combinatorial nature makes it non-trivial. In recent years, there have developed some algorithms by exploiting specific characteristics of the MNL model, but all of them estimate the parameters of MNL model separately and incur a regret no better than O NT which is not preferred for large candidate set size N. In this paper, we consider the linear utility MNL choice model whose item utilities are represented as linear functions of ddimension item features, and propose an algorithm, titled LUMB, to exploit the underlying structure. It is proven that the proposed algorithm achieves O d K T regret which is free of candidate set size. Experiments show the superiority of the proposed algorithm. 1 Introduction In traditional stochastic multi-armed bandit (MAB) [Bubeck and Cesa-Bianchi, 2012], the player selects one from N items and receives a reward corresponding to that item in each round. The objective is to maximize cumulative reward over a finite horizon of length T, or alternatively, minimize the regret relative to an oracle. Typically, algorithms are designed based on appropriate exploration-exploitation tradeoff which allows the player to identify the best item through exploration whilst not spending too much on sub-optimal ones, and the family of upper confidence bound (UCB) algorithms [Auer, 2002; Chu et al., 2011] and Thompson sampling (TS) [Thompson, 1933; Agrawal and Goyal, 2012] are representative examples. This paper studies a combinato- rial variant of MAB, where in each round, the player offers a subset of cardinality K to a user, and receives the reward associated with one of the items in the selected subset1. The player faces the problem of determining which subset of items to present to users who arrive sequentially, whilst not knowing user preference. Similar to MAB, we need to solve the exploration-exploitation tradeoff in this problem. However, a naive translation of this problem to MAB is prohibitive, since the number of possible K-cardinality subsets is exponentially large and cannot be efficiently explored within a reasonable sized time horizon. To tackle this issue, different strategies have been proposed in the literature, e.g., [Kveton et al., 2015; Lagr ee et al., 2016; Agrawal et al., 2017a]. In recent literature, the multinomial logit (MNL) choice model [Luce, 2005; Plackett, 1975] which is widely used in economics is utilized to model user choice behavior by specifying the probability that a user selects an item given the offered set, and above exploration-exploitation problem is referred as the MNL-Bandit problem [Rusmevichientong et al., 2010; Agrawal et al., 2017a; 2017b; Cheung and Simchi Levi, 2017]. Unlike other combinatorial bandit problems, MNL-Bandit problem considers the substitution property among items and leads to non-monotonic reward function. By exploiting specific characteristics of the MNL model, UCBstyle [Agrawal et al., 2017a] and TS-style [Agrawal et al., 2017b] algorithms have been developed to dynamically learn the parameters of the MNL model which are a priori unknown, achieving a regret of O NT under a mild assumption. Also, a lower regret bound is established in [Agrawal et al., 2017a], by showing that any algorithm based on the MNL choice model must incur a regret of Ω p NT/K . It is easy to find that the regret depends on the number of candidate items N, making them less preferred in many large scale applications such as online advertising. In this paper, we use the linear utility MNL choice model (formulated in Section 3) to model user choice behavior given a set of items, rather than traditional MNL model. Specifically, it is assumed that each item in candidate set is described by a d-dimension feature vector, and item utilities 1This problem is known as dynamic assortment selection in the literature [Caro and Gallien, 2007; Rusmevichientong et al., 2010], where the selected subset of items forms an assortment. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) of the MNL model can be formulated as linear functions of item features. Based on this, the problem of estimating item utilities (i.e., parameters of the MNL model) is changed to estimating underlying model parameters of linear functions. Since the number of parameters is irrelevant with the number of items, it is possible to achieve more efficient solution when the number of items is large. By taking the UCB approach, we propose an algorithm, titled LUMB (which is short for Linear Utility MNL-Bandit), to dynamically learn the parameters and narrow the regret. The main contributions of this work include: To the best of our knowledge, this is the first to use linear utility MNL choice model in sequential subset selection problem. Also, an UCB-style algorithm LUMB is proposed to learn the model parameters dynamically. An upper regret bound O d K T (log T)2 is established for the proposed LUMB algorithm. This regret bound is free of candidate item set size, which means that LUMB can be applied to large item set. Empirical studies demonstrate the superiority of the proposed LUMB algorithm over existing algorithms. The rest of this paper is organized as follows. Section 2 briefly introduces related work. Section 3 and 4 present problem formulation and the LUMB algorithm. Section 5 establishes regret analysis. Section 6 summarizes the experiments, and Section 7 concludes this work with future directions. 2 Related Work Classical bandit algorithms aim to find the best arm with exploration-exploitation strategy. Auer [2002] first proposes a UCB approach in linear payoff setting. Dani et al. [2008] and Abbasi-Yadkori et al. [2011] propose improved algorithms which bound the linear parameters directly. Agrawal and Goyal [2013] propose a Thompson sampling approach. However, because the reward of a subset is not a linear function of item features in the subset, these works cannot be directly applied to our problem. Another class of bandit works related to our work is combinatorial bandit where the player selects a subset of arms and receive a collective reward in each round. Researchers study the problem mainly on two settings, stochastic setting [Gai et al., 2012; Russo and Van Roy, 2014; Kveton et al., 2015] and adversarial setting [Cesa-Bianchi and Lugosi, 2012; Audibert et al., 2013]. Gai et al. [2012] first learn the problem in linear reward setting and Kveton et al. [2015] prove a tight regret bound. It is generalized to non-linear rewards in [Chen et al., 2016; 2013]. Wen et al. [2015] and Wang et al. [2017] propose contextual algorithms which can handle large item sets. However, these works imply that the reward is monotonic which is not satisfied in MNL-Bandit (Section 3). In practice, as clarified in [Cheung and Simchi-Levi, 2017], the low-reward item may divert the attention of user and lead to lower subset reward. Rusmevichientong et al. [2010] solve the MNL-Bandit problem and achieve instance-dependent upper regret bound O (log T), and Saur e and Zeevi [2013] extend to a wider class of choice models. Agrawal et al. [2017a] propose a UCB approach and achieve instance-independent upper regret bound O NT . Agrawal et al. [2017a] propose a Thompson sampling approach with better empirical performance. Recently, some works begin to study variants of classic MNL-Bandit problem. Some works learn the personalized MNL-Bandit problem [Kallus and Udell, 2016; Bernstein et al., 2017; Golrezaei et al., 2014]. Cheung and Simchi-Levi [2017] learn the problem with resource constraints. However, as clarified in Section 1, these works model item utility separately which is not feasible for large item sets. 3 Problem Formulation Suppose there is a candidate item set, S = {1, , N}, to offer to users. Each item i corresponds to a reward ri 0 and a feature vector xi Rd. Let X = [x1, , x N] be the feature matrix. In MNL-Bandit, at each time step t, the player selects a subset St C (S) = {S|S S, |S| K} and observes the user choice ct St {0}, where ct = 0 represents that the user chooses nothing from St. The objective is to design a bandit policy to approximately maximize the expected cumulative reward of chosen items, i.e., E (P t rct). According to above setting, in each time step t, a bandit policy is only allowed to exploit the item features {x1, , x N}, historical selected subsets, {Sτ|τ < t} and the user choice feedbacks, {cτ|τ < t}. The user choice follows the MNL model [Luce, 2005; Plackett, 1975]. MNL assumes substitution property among items in a selected subset. Specifically, for item i, the larger the utilities of the other items, the smaller the chosen probability of item i is. The choice probability follows a multinomial distribution, j St vj , i St j St vj , i = 0 0, otherwise , where vi 0 is item utility which is a priori unknown to the player. The choice probability of item i in selected subset, St, is linear with its utility, vi. Besides, it is possible that nothing is chosen which is realized by adding a virtual item 0 with utility 1. With the MNL model, we have the expected reward under given utility vector, v = [v1, , v N], i St viri 1 + P i St vi . (2) Note that the expected reward is non-monotonic, that is, both the addition of low-reward item to selected subset and increment on utility of low-reward item may lead to lower reward. The expected cumulative reward is t E (rct) = X St, v . (3) Since direct analysis of (3) is not tractable when v is unknown, we analyze the regret instead, Reg (T, v) = XT t=1 R (S , v) E R St, v , (4) Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) where T is the length of time horizon and S is the optimal subset, S = arg max S C(S) R (S, v) . Naturally, the objective is to approximately minimize the expected cumulative regret, Reg (T, v), with appropriate bandit policy. Especially, after enough time steps, an appropriate solution should almost achieve the subsets with highest reward, which implies that the cumulative regret, Reg (T, v), should be sub-linear with T. As each item corresponds to an utility which needs to be estimated separately, this makes the lower cumulative regret bound relevant with item number and will be not feasible for large item sets. Therefore, linear item utility is introduced where item utility is a linear function of item feature, vi = θ xi , (5) where θ is a linear parameter vector unknown to the player. Thus, estimating item utilities will be changed to estimating utility function parameters which can exploit the correlation between items on features, then it is potential to achieve regret bound free of item number, N. 4 Algorithm We propose an algorithm, called Linear Utility MNL-Bandit (LUMB), which proposes a UCB approach to sequentially estimate linear utility function and approach highest reward. LUMB first estimates the linear parameters of utility function, then constructs the UCB of item utility and subset reward, finally offers the subset with highest reward UCB. Algorithm 1 clarifies the detail of LUMB. 4.1 Estimation of Linear Utility Function As the choice probability of an item is non-linear with the parameters of utility function, it is difficult to estimate the parameters directly with user choice feedback. Instead, we split the time horizon into epochs like [Agrawal et al., 2017a]. Let L be the number of epochs. In each epoch l, the selected subset, Sl C (S), is offered repeatedly until that the user chooses nothing from the offered subset. Then, we can obtain the chosen times of each item i Sl, t El I (ct = i) , (6) s.t. I (ct = i) = 1, ct = i 0, ct = i , where El is the set of time steps in epoch l, ct is the chosen item in time step t. It can be proven that E (ˆvi,l) = vi (Lemma 1), which means the empirical average of ˆvi,l is almost equal to real item utility and irrelevant to other items in the subset. Thus, the estimation of utility function can be simply formulated as a linear regression which directly approaches empirical samples of ˆvi,l. Specifically, θl = arg min θ i Sτ θ xi ˆvi,l 2 2 + λ θ 2 2 , Algorithm 1 Linear Utility MNL-Bandit 1: Inputs: S = {1, , N}, X = [x1, , x N], r = [r1, , r N] 2: Initialize: θ0 [0]d, A0 Id, b0 [0]d, t 1, l 1, E1 , c0 0 3: repeat 4: if ct 1 = 0 then 5: Compute Sl arg max S S R S, v UCB l 1 6: end if 7: Offer subset Sl, observe the user choice ct. 8: if ct = 0 then 9: for i Sl do 10: compute ˆvi,l P t El I (ct = i) 11: end for 12: update bl bl 1 + P i Sl ˆvi,lxi 13: update Al Al 1 + P i Sl xix i 14: update θl A 1 l bl 15: for i S do 16: v UCB i,l θ l xi + x i A 1 l xi 17: end for 18: l l + 1 19: El 20: else 21: El El t 22: end if 23: t t + 1 24: until t < T where λ is a constant regularization coefficient. Then, we can obtain close-form solution Al = λId + X i Sτ xix i , (7) i Sτ ˆvi,τxi , (8) θl = A 1 l bl , (9) where Id is a d-by-d identity matrix. 4.2 Construction of Upper Confidence Bound We construct the UCB of item utility, which is proven in Lemma 2, as v UCB i,l = θ l xi + 2 + α σi,l , (10) s.t. σi,l = q x i A 1 l xi where α is constant. Let v UCB l = [v UCB 1,l , , v UCB N,l ]. Then, the UCB of the highest reward, R (S , v), is constructed as the highest reward with v UCB l (Lemma 5). The corresponding subset is Sl+1 = arg max S C(S) R S, v UCB l , (11) Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) and we offer the subset Sl+1 in epoch l + 1. It is hard to get Sl+1 by directly solving the above optimization problem. According to [Davis et al., 2013], the above optimization problem can be translated to a linear program problem, i=1 riwi , (12) s.t. w0 + XN i=1 wi = 1, XN i=1 wi v UCB i Kw0 , i S, 0 wi v UCB i w0 . Then, Sl+1 = {i|wi > 0, i > 0}. 5 Regret Analysis In this section, we analyze the upper regret bound of Algorithm 1 to theoretically identify the convergence performance. Without loss of generality, we first declare the assumption in the following analysis. Assumption 1. i S, ri 1, xi 1, θ 1. According to the assumption, we have that i S, vi 1. Moreover, we let λ = 1. Then, we give the upper bound of regret in Theorem 1 in advance which is proven in Section 5.3. We can achieve result similar to Theorem 1 when the parameters in Assumption 1 are bounded by finite constants. Theorem 1. Following the process in Algorithm 1, let β = 2 log2 T, v u u t2 log then the upper bound of Reg (T, v) is T (log T)2 . (13) The proof is separated into three steps. We first prove the correctness of the constructed UCB of utility and the cumulative deviation between UCB of utility and real utility which is sublinear with respect to T. Then we prove that the deviation between UCB of reward and real reward can be bounded by deviation of utility, finally the upper bound of cumulative regret can be proved by combining the above two results. 5.1 Analysis of Utility We first prove the distribution of ˆvi,l. Lemma 1. With the definition in Eq. (6), l L, i Sl, ˆvi,l follows geometric distribution, that is P (ˆvi,l = β) = 1 1 + vi β , β 0 , (14) E (ˆvi,l) = vi . (15) Because of the space limitation, proof will be attached in a longer version. According to above Lemma, the deviation between ˆvi,l and real utility is unbounded. This makes the prove of utility UCB difficult. Fortunately, the probability P (ˆvi,l > β) decays exponentially when β increases. We bound ˆvi,l in a relative small interval with high probability. Then, we can prove the utility UCB as below. Lemma 2. With definition of v UCB i,l in Eq. (10) and definition of ˆvi,l in Eq. (6), if β 2 log2 T, τ l, j Sl, ˆvj,τ β, then i S, l L, 0 v UCB i,l vi 2 2 + α σi,l , (16) with probability at least Proof. Let i,l = |θ l xi vi|, we just need to prove 2 + α σi,l . According to Lemma 1, when ˆvi,l β, E (ˆvi,l vi) = (1 + β) pβ+1 i 1 pβ+1 i , s.t. pi = vi 1 + vi . Note that the result of E (ˆvi,l vi) is irrelevant with l. Let ϵi = E (ˆvi,l vi). i,l |x i A 1 l X j Sτ xj (ˆvj,τ vj ϵj)| + |x i A 1 l X j Sτ xjϵj x i A 1 l θ | . We prove the bound of two parts respectively. Let ui = ˆvi,τ vi ϵi, sl = P τ l P i Sτ xiui, it is easy to prove that E (exp (γui)) exp(γ2β2 then, with Lemma 9 in [Abbasi-Yadkori et al., 2011], we can prove that with probability we have that |x i A 1 l X i Sτ xi (ˆvi,τ vi ϵi)| ασi,l . (17) Moreover, with Cauchy Schwarz inequality, the other part is bounded as |x i A 1 l X j Sτ xjϵj x i A 1 l θ | 2σi,l . (18) The lemma can be proven by combining Eq. (17) and Eq. (18). Moreover, we prove the bound of cumulative utility deviation. Lemma 3. Following the process in Algorithm 1, the cumulative deviation between utility UCB and real utility can be bounded as XL i Sl σi,l K p 70d L log L . (19) Proof is similar to the proof of Lemma 3 in [Chu et al., 2011]. Because of the space limitation, the proof will be attached in a longer version. Lemma 3 shows that the bound of cumulative deviation is sub-linear with epoch number, and the average deviation in each epoch will vanish after enough epochs. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 5.2 Analysis of Reward We first estimate the deviation between estimated reward and real reward of Sl with the result of Lemma 3. Lemma 4. In each epoch l of Algorithm 1, if i Sl, 0 v UCB i vi 2 2 + α σi,l , then the cumulative deviation between estimated reward and real reward of Sl is R Sl, v UCB l rct ! i Sl riσi,l . Because of the space limitation, proof will be attached in a longer version. This Lemma means that the deviation of subset reward is bounded by deviation of item utilities in the subset. Lemma 5. (Lemma 4.2 in [Agrawal et al., 2017a])With the reward defined in Eq. (2), suppose there are two subsets, SUCB and S, that SUCB = arg max S C(S) R S, v UCB , S = arg max S C(S) R (S, v) . If i S, v UCB i vi, then the rewards satisfy the inequality R S, v R S, v UCB R SUCB, v UCB . (21) Lemma 5 shows that the estimated reward of subset Sl is an upper bound of real highest reward. Then we can easily bound the regret in each epoch with Lemma 4 and Lemma 5. Lemma 6. (Regret bound in a single epoch) In each epoch l of Algorithm 1, if i Sl S 0 v UCB i vi 2 2 + α σi,l , (22) then the regret of epoch l is t El (R (S , v) rct) i Sl riσi,l . 5.3 Upper Bound of Regret We first prove a more general version of Theorem 1 with parameters, α and β. Lemma 7. Following the process in Algorithm 1, if β 2 log2 T, the cumulative regret, defined in Eq. (4), can be bounded, Reg (T, v) 2TK 1 + T 70d T log T . (24) Proof. We obtain the bound of regret respectively in two situations: the item utility inequality in Lemma 2 is (or not) satisfied. We model the event that the item utility inequality in Lemma 2 is not satisfied as Al = Ul Bl, (25) s.t. Ul = {v UCB i,l > vi + 2 or v UCB i,l < vi, i Sl S }, Bl = {ˆvi,τ > β, τ l, i Sτ}. Then, it is easy to bound the probability of Al, P (Al) 2K 1 + T Let Al be the complement of Al. Then, the regret can be splited into two parts, that is, Reg (T, v) = E t El I (Al 1) (R (S , v) rct) t El I Al 1 (R (S , v) rct) where I (A) is an indicator random variable whose value is 1 when A happens, othewise 0. We first consider the situation that Al happens, t El I (Al) (R (S , v) rct) 2β+2 . (27) Then, we consider that Al does not happen. According to Lemma 6 and Lemma 3, t El I Al 1 (R (S , v) rct) 70d T log T . (28) Finally, we can finish the proof by adding Eq. (27) and Eq. (28). With Lemma 7, Theorem 1 can be proven by setting β = 2 log2 T , v u u t2 log As our method is in the similar framework of MNLBandit [Agrawal et al., 2017a] whose lower bound is O T , our regret bound matches the lower bound up to logarithmic terms with respect to T. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 6 Experiments In this section, we evaluate LUMB on synthetic data and compare it to three existing alternative algorithms. We demonstrate the superiority of LUMB on cumulative regret. Moreover, we show that the estimated linear parameters of utility function and utilities will asymptotically converge to the real value. 6.1 Setting The synthetic data is generated randomly. N rewards are sampled from interval (0, 1] uniformly. d-dimension parameter vector of utility function, θ , is sampled from [0, 1]d uniformly, then is normalized to 1. N d-dimension feature vectors are sampled from [0, 1]d uniformly. To follow the experiment setting in [Agrawal et al., 2017b], feature vectors are normalized so that item utilities distribute uniformly on [0, 1]. Experiments are all performed on ten randomly generated data sets and the results show below are all average of results on these data sets. Three alternative algorithms are compared: UCB-MNL [Agrawal et al., 2017a]: This algorithm proposes a UCB approach with MNL choice model. Thompson-Beta [Agrawal et al., 2017b]: This algorithm proposes a Thompson sampling approach with MNL choice model. Thompson-Corr [Agrawal et al., 2017b]: This algorithm is a variant of Thompson-Beta which samples item utilities with correlated sampling. 6.2 Results We conduct empirical experiments on synthetic data sets with N = 1000,d = 10. Subset size K is set to 10. Figure 1 shows the cumulative regret on the synthetic data sets, which is normalized by best reward, i.e., Reg (t, v) /R (S , v). Note that the axis of cumulative regret is plot in a logarithm style for the convenience of observing the trend of LUMB regret on time horizon. The cumulative regrets increase slower when time step increases. Besides, we can see that the cumulative regret of LUMB is much smaller than the alternative algorithms through the time horizon. We evaluate the convergence of utility vector on synthetic data sets with N = 1000,d = 10. Figure 2 shows the deviation between estimated mean utility and real utility, which is normalized by the norm of real utility, i.e., θ t X v / v . The deviation of LUMB decreases fast in the early stage and achieve smaller deviation compared to the alternative algorithms. Moreover, we evaluate the deviation of estimated linear parameter vector in Figure 2. The deviation is normalized by the norm of real parameters, i.e., θt θ / θ . Note that the deviation also decreases fast in the early stage and asymptotically converges to zero finally. This demonstrates that LUMB can correctly estimate the linear parameters. 7 Conclusion We study the sequential subset selection problem with linear utility MNL choice model, and propose a UCBstyle algorithm, LUMB. Also, an upper regret bound, Figure 1: Cumulative regret along time horizon. Figure 2: Deviation of item utility vector (upper figure) and linear utility function parameter vector (lower figure) on different time steps. T (log T)2 , is established, which is free of candidate item number. Experiments show the performance of LUMB. In the future work, we are interested in extending the idea to other choice models such as nested MNL. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) References [Abbasi-Yadkori et al., 2011] Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In NIPS, pages 2312 2320, 2011. [Agrawal and Goyal, 2012] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multiarmed bandit problem. In COLT, pages 39 1, 2012. [Agrawal and Goyal, 2013] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In ICML, pages 127 135, 2013. [Agrawal et al., 2017a] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic learning approach to assortment selection. ar Xiv preprint ar Xiv:1706.03880, 2017. [Agrawal et al., 2017b] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Thompson sampling for the mnl-bandit. ar Xiv preprint ar Xiv:1706.00977, 2017. [Audibert et al., 2013] Jean-Yves Audibert, S ebastien Bubeck, and G abor Lugosi. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31 45, 2013. [Auer, 2002] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. JMLR, 3(Nov):397 422, 2002. [Bernstein et al., 2017] Fernando Bernstein, Sajad Modaresi, and Denis Saur e. A dynamic clustering approach to data-driven assortment personalization. 2017. [Bubeck and Cesa-Bianchi, 2012] S ebastien Bubeck and Nicol o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Co RR, abs/1204.5721, 2012. [Caro and Gallien, 2007] Felipe Caro and J er emie Gallien. Dynamic assortment with demand learning for seasonal consumer goods. Management Science, 53(2):276 292, 2007. [Cesa-Bianchi and Lugosi, 2012] Nicolo Cesa-Bianchi and G abor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012. [Chen et al., 2013] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. In ICML, pages 151 159, 2013. [Chen et al., 2016] Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, and Pinyan Lu. Combinatorial multi-armed bandit with general reward functions. In NIPS, pages 1659 1667, 2016. [Cheung and Simchi-Levi, 2017] Wang Chi Cheung and David Simchi-Levi. Assortment optimization under unknown multinomial logit choice models. ar Xiv preprint ar Xiv:1704.00108, 2017. [Chu et al., 2011] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In IJCAI, pages 208 214, 2011. [Dani et al., 2008] Varsha Dani, Thomas Hayes, and Sham Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355 366, 2008. [Davis et al., 2013] James Davis, Guillermo Gallego, and Huseyin Topaloglu. Assortment planning under the multinomial logit model with totally unimodular constraint structures. Technical Report, Cornell University, pages 335 357, 2013. [Gai et al., 2012] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. TON, 20(5):1466 1478, 2012. [Golrezaei et al., 2014] Negin Golrezaei, Hamid Nazerzadeh, and Paat Rusmevichientong. Real-time optimization of personalized assortments. Management Science, 60(6):1532 1551, 2014. [Kallus and Udell, 2016] Nathan Kallus and Madeleine Udell. Dynamic assortment personalization in high dimensions. ar Xiv preprint ar Xiv:1610.05604, 2016. [Kveton et al., 2015] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pages 535 543, 2015. [Lagr ee et al., 2016] Paul Lagr ee, Claire Vernade, and Olivier Cappe. Multiple-play bandits in the position-based model. In NIPS, pages 1597 1605. 2016. [Luce, 2005] Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005. [Plackett, 1975] Robin L Plackett. The analysis of permutations. Applied Statistics, pages 193 202, 1975. [Rusmevichientong et al., 2010] Paat Rusmevichientong, Zuo-Jun Max Shen, and David Shmoys. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research, 58(6):1666 1680, 2010. [Russo and Van Roy, 2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221 1243, 2014. [Saur e and Zeevi, 2013] Denis Saur e and Assaf Zeevi. Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management, 15(3):387 404, 2013. [Thompson, 1933] William Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933. [Wang et al., 2017] Yingfei Wang, Hua Ouyang, Chu Wang, Jianhui Chen, Tsvetan Asamov, and Yi Chang. Efficient ordered combinatorial semi-bandits for whole-page recommendation. In AAAI, pages 2746 2753, 2017. [Wen et al., 2015] Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large-scale combinatorial semi-bandits. In ICML, pages 1113 1122, 2015. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)