# learning_to_rank_under_multinomial_logit_choice__3a761221.pdf Journal of Machine Learning Research 24 (2023) 1-49 Submitted 9/20; Revised 5/23; Published 10/23 Learning to Rank under Multinomial Logit Choice James A. Grant j.grant@lancaster.ac.uk Department of Mathematics and Statistics, Lancaster University, Lancaster, UK David S. Leslie d.leslie@lancaster.ac.uk Department of Mathematics and Statistics, Lancaster University, Lancaster, UK Editor: Csaba Szepesv ari Learning the optimal ordering of content is an important challenge in website design. The learning to rank (LTR) framework models this problem as a sequential problem of selecting lists of content and observing where users decide to click. Most previous work on LTR assumes that the user considers each item in the list in isolation, and makes binary choices to click or not on each. We introduce a multinomial logit (MNL) choice model to the LTR framework, which captures the behaviour of users who consider the ordered list of items as a whole and make a single choice among all the items and a no-click option. Under the MNL model, the user favours items which are either inherently more attractive, or placed in a preferable position within the list. We propose upper confidence bound (UCB) algorithms to minimise regret in two settings - where the position dependent parameters are known, and unknown. We present theoretical analysis leading to an Ω( JT) lower bound for the problem, an O( JT) upper bound on regret of the UCB algorithm in the known-parameter setting, and an O(K2 JT) upper bound on regret, the first, in the more challenging unknown-position-parameter setting. Our analyses are based on tight new concentration results for Geometric random variables, and novel functional inequalities for maximum likelihood estimators computed on discrete data. Keywords: Learning to rank; Multinomial Logit choice model; Multi-armed Bandits; Upper Confidence Bound; Concentration Inequalities. 1. Introduction Learning the optimal ordering of content is an important challenge in website design and online advertising. The learning to rank (LTR) framework captures such a challenge via a sequential decision-making model. In this setting, a decision-maker repeatedly selects orderings of items (product advertisements, search results, news articles etc.) and displays them to a user visiting their website. In response the user opts to click on none, one, or more of the displayed items. The objective of the decision-maker will be to maximise the number of clicks received over many iterations of this process. Such an objective is a reasonable and widely-used proxy for the most common interests of a decision-maker in this setting: c 2023 James A. Grant and David S. Leslie. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/20-981.html. Grant and Leslie e.g. maximising profit, and maximising user satisfaction. As such, methods which achieve this objective can be hugely impactful in real-world settings. Recent works (e.g. Kveton et al. (2015), Lagr ee et al. (2016)) have considered various formulations of LTR, distinguished by the assumptions on the click model, assumed to govern how users decide to click on items. A majority of previous works utilise a factored click model, which assumes, in particular, that the user will click on any displayed item satisfying two conditions: 1) that the user finds the item attractive, and 2) the user examines the item. The various factored models are differentiated by their specification of the probability of attraction and examination events given particular orderings of content. Factored models represent the user s choice as a series of binary decisions to click or not click on each examined item. They fail to capture settings where the user s decisions are made among more than two alternatives, for instance choosing between several items considered simultaneously. In this work, we consider LTR under a click model which captures the phenomenon of a user making a single decision among several alternatives including the option to not click at all. Our click model is based on the multinomial logit (MNL) model of discrete user choice (Luce, 1959; Plackett, 1975). We augment the classical MNL model with position effects to capture the relative prominence of different display positions, which may be pre-specified or learned online. The classic model is a special case of ours. Our model may be more suitable in a variety of settings. For instance, where all items are visible to a user simultaneously or are laid out on a grid such that it is not possible, a priori, to specify a rank order of positions in terms of prominence, or assume that users consider items sequentially. Several studies in the information retrieval and recommender systems literatures highlight the complexity of position effects in grid displays (Xie et al., 2017, 2019; Guo et al., 2020), and the issues with assuming items are considered independently by default (Oosterhuis and de Rijke, 2019; C apan et al., 2022; Hazrati and Ricci, 2022). 1.1 Problem Definition We propose the Multinomial Logit Learning to Rank (MNL-LTR) problem. This problem captures the challenge of learning an optimal list of K items among J, where the click model is an order-dependent variant of multinomial logit choice. In each of a series of rounds t [T],1 the decision-maker chooses an action at = (a1,t, . . . , a K,t) A [J]K, where A is the set of all ordered lists of length K consisting of items drawn from [J] without replacement. The action indicates an ordering of K items to display to the user in round t. Each action j [J] has an associated attractiveness parameter, αj (0, 1], and each slot k [K] has an associated position bias λk (0, 1]. We let αk,t = αak,t refer to the attractiveness parameter of item ak,t. In response to the action at, the user will either click on a displayed item or take a no click action. This process is captured via a click variable Qt taking values in [K]0. The click probabilities follow from the MNL choice model, whose parameters are the products of attractiveness and bias parameters. Specifically, we have P(Qt = k | at) = λkαk,t 1 + PK v=1 λvαv,t , k [K], (1) 1. For an integer W 1, we let [W] denote the set {1, . . . , W} Learning to Rank under Multinomial Logit Choice and P(Qt = 0 | at) = 1 PK k=1 P(Qt = k | at). Following the user s choice, the decision-maker receives a reward R(at) = I{Qt = 0}. The decision-maker s aim is to maximise their expected cumulative reward over T rounds. The expected reward on an action a A (in any round) is written, r(a) := E(R(a)) = PK k=1 P(Q = k | a). The challenge for the decision maker is that the attractiveness parameters are unknown and the optimal action is therefore initially unclear. We will consider the problem in two informational settings: where the position biases are known, and unknown. 1.2 On Approaches to the Problem The decision-maker faces a classic exploration-exploitation dilemma and must employ a strategy which balances between reward maximising and information gaining actions. We refer to such a strategy as a policy, and formalise it as a (possibly randomised) mapping from a history Ht 1 = σ(a1, Q1, . . . , at 1, Qt 1) to an action at A for each time t [T]. We propose upper confidence bound (UCB) policies for both the known and unknown position bias settings. UCB approaches are well-studied in the context of multi-armed bandits, following from Lai and Robbins (1985), and are known to achieve optimal regret in a variety of settings. The canonical principle of a UCB approach is as follows. In each round, the policy computes high probability upper confidence bounds on the expected rewards of actions by utilising tight concentration results, and selects an action with maximal associated bound. Intuitively speaking, these approaches are effective because they tend to select actions that either have high reward (and thus are profitable) or high uncertainty (and thus provide a substantial information gain). In the MNL-LTR setting, the identification of tight concentration results is the most involved aspect of designing UCB policies. In part, this is because the likelihood induced by the MNL model (1) has a complex combinatorial structure, making it hard to identify parameter estimates with known distributional properties. Our proposed strategies subvert this issue by utilising a restriction on the decision-maker s actions such that the likelihood factorises usefully. This technique (first used by Agrawal et al. (2017, 2019)) restricts the decision-maker to repeatedly display any selected ordered list in each round until a no-click event is observed. Unbiased estimators may then be constructed as a sum of geometrically distributed random variables which are functions of the users stochastic behaviour. We will be interested in the empirical and theoretical performance of policies, measured in terms of their expected pseudoregret (referred to simply as regret in what follows) in T rounds, defined as, Reg(T) = Tr(a ) E T X t=1 R(at) , (2) where a = maxa A r(a) is an optimal action. Specifically, we will be interested in the order (w.r.t. T, K, and J) of upper bounds on the regret for our proposed policies, and lower bounds on the regret which hold uniformly across all (reasonable) policies. We will study the problem in two informational settings: one where only attractiveness parameters are unknown, and another where both the attractiveness parameters and position biases are unknown. Our results establish an Ω( JT) lower bound on regret, and an upper bound matching this up to logarithmic factors for the known-position bias setting, and the first upper bound for the unknown-position bias setting of O(K2 Grant and Leslie 1.3 Key Contributions The primary contributions of this work are threefold. Firstly, we provide a new parametric model of LTR based on set-wise user decisions with foundations in classic choice theory. This model can capture the complex patterns of position effects in modern recommender systems, and the phenomenon where users consider multiple items simultaneously (not possible with factored click models). Second, we derive new theoretical results concerning the concentration of Geometric random variables, giving rise to two new exponential inequalities: The first, an improved highprobability bound on the sum of non-independent, non-identically distributed (n.i.n.i.d.) Geometric random variables. The second, a high-probability bound on smooth functions of n.i.n.i.d. Geometric random variables, which is applied to the maximum likelihood estimates (MLEs) in the unknown position bias setting to give non-asymptotic confidence sets for all parameters, even in the absence of closed-form expressions for the MLEs. Finally, based on these results we propose UCB algorithms for the known and unknown position bias settings, and validate their efficacy through derivation of upper and lower bounds on regret - which match up to logarithmic factors in the former setting - and empirical assessment against other state-of-the-art approaches. 1.4 Related Work A special case of our MNL-LTR model is the MNL bandit (Rusmevichientong et al., 2010). It does not consider ordering of the items and coincides with our model when all position biases are equal. Initial studies of the MNL-bandit problem presented explore-then-commit approaches, which only behave optimally for specific problem classes, and with prior knowledge of certain problem parameters (Rusmevichientong et al., 2010; Saur e and Zeevi, 2013). Agrawal et al. (2019) and Agrawal et al. (2017) since presented UCB and TS approaches to the MNLbandit respectively, which have O( JT) regret, matching the minimax lower bound derived by Chen and Wang (2018) (up to logarithmic factors). These methods use restrictions on decision-making to permit the construction of estimators with desirable properties. Such approaches and results are natural benchmarks2 for our problem, since when position bias λk = 1 k [K] the MNL-LTR problem reduces to the MNL-bandit. When the position biases differ, however, these are not as successfully applied. As we show in the numerical experiments of Section 6 and the accompanying Appendix F, an MNL-bandit algorithm can be applied to MNL-LTR by overparameterising the problem. That is to say, treating each item-slot combination as an item in its own right, placing constraints on which items can be selected concurrently, and mapping each selection to an ordered list. The theory of Agrawal et al. (2017, 2019) then guarantees O( JKT) regret, but this is suboptimal, as it does not exploit the problem structure. Our approach to the known position bias setting does exploit this structure and thus enjoys a lower regret. 2. Independence from J is possible in some variants - Wang et al. (2018) achieve this subject to further assumptions on the attractiveness parameters, and Chen et al. (2021); Peeters and den Boer (2022) find such results for an uncapacitated version of the problem, but these are either less general or notcomparable to our model. Gap-dependent results have also been recently identified by Yang (2021), however we retain a focus on minimax style results in this work. Learning to Rank under Multinomial Logit Choice There has since been interest in extending the MNL-bandit model in various directions, considering the best-action identification variant (Chen et al., 2018; Fang, 2022), contextdependent variant (Oh and Iyengar, 2019; Chen et al., 2020; Bernstein et al., 2022), and variants models such as including variable rewards, inventory constraints, or omitting the no-click event (Bengs and H ullermeier, 2020; Saha and Gopalan, 2019; Mesaoudi-Paul et al., 2020; Dong et al., 2020; Zhang et al., 2022). None of these works, however, fully consider the effect of position biases - i.e. as in our LTR variant - though they can be adapted to give consideration to ordering through specification of constraints. Works on LTR are mainly distinguished by different click models (Chuklin et al., 2015), the majority being of the factored form previously described. Two notable choices are the Cascade Model (CM) (Craswell et al., 2008) and the Position Based Model (PBM) (Richardson et al., 2007). Under the CM the user considers each item in sequence, and decides whether or not to click on it before considering any items. If the user clicks an item, or reaches the end of the list without clicking any items, they stop. In contrast, under the PBM, the user may click on multiple items, and chooses whether to examine each item independently, with probabilities similar to our position biases. Kveton et al. (2015) consider a LTR problem incorporating the CM, and Lagr ee et al. (2016) and Komiyama et al. (2017) the PBM. In these specific settings upper confidence bound approaches can achieve O( T) regret (optimal for those models), by exploiting the knoweledge of the true click model. Recently, Gauthier et al. (2021) have considered a version of PBM where the position effects are unknown - showing effective performance of a Thompson Sampling approach, but without theoretical guarantees. The works of Zoghi et al. (2017), Lattimore et al. (2018), and Li et al. (2019) have investigated more general click models which include the CM and PBM as special cases. The models of Zoghi et al. (2017) and Li et al. (2019) retain the assumption of a factored model, but are less restrictive than CM, and PBM. The model of Lattimore et al. (2018) makes sufficiently few assumptions to capture a wider range of models, including that which we propose. However such a general approach does not admit as tight theoretical guarantees. Table 1 compares the existing results on regret in LTR and the MNL-bandit with our regret upper bound. MNL choice LTR Algorithm Regret on MNL-LTR Agrawal et al. (2019) * UCB O Agrawal et al. (2017) * TS Lattimore et al. (2018) included as special case Top Rank O This paper UCB O Table 1: Comparison of results in the present paper and related work for MNL-LTR with known position biases. T denotes the number of rounds, J the number of items, and K the number of items chosen per round. An asterisk (*) in the LTR column denotes frameworks that can be adapted to LTR, but do so by over-parameterising the problem. Grant and Leslie 2. Inference In the MNL-LTR framework, the task of making accurate and efficient inference on the attractiveness parameters is more challenging than in other variants of LTR. Consider the likelihood of the sequence of clicks Q1:T = {Q1, . . . , QT } given the attractiveness parameters α, position biases λ, and action sequence a1:T = {a1, . . . , a T }, L(Q1:T | a1:T , α, λ) = PJ j=0 αj PK k=1 λk I{ak,t = j, Qt = k} 1 + PJ j=1 αj PK k=1 λk I{ak,t = j} . (3) The likelihood (3) lacks a closed-form maximiser, meaning maximum likelihood estimators of α and λ can only be computed numerically. Similarly, any Bayesian inference would necessarily be approximate, and computationally intensive. Both of these approximations (which are not necessary in related, factored models) are obstacles to the design and analysis of efficient, optimal sequential decision making policies. 2.1 Inference with Known Position Biases Exact inference is possible if we restrict the manner in which actions are selected. For the MNL-bandit, Agrawal et al. (2019) propose a restriction on decision-making that admits unbiased independent estimators of the attractiveness parameters. Specifically, if each selected set of items is displayed repeatedly until a no-click event is observed, then unbiased estimators of the attractiveness parameters are available. We will show that the same is possible in the MNL-LTR setting, if we display the same ranked list repeatedly until a no-click event occurs. To describe this approach, we think of the T rounds as being divided into L T epochs of variable length. An epoch l [L] will consist of a sequence of consecutive time periods El [T]. In each epoch l we will offer an ordered list al A repeatedly, until a no-click event is observed. Let al k be the item in position k in epoch l and let αl k be the attractiveness parameter of this item, for k [K]. As usual in each round t El a click variable Qt is observed. For each slot k [K]0 the number of clicks on position k in epoch l is defined as nl k = P t El I{Qt = k}. By the construction of the epochs we always have nl 0 = 1, unless l = L and the final epoch is stopped by the completion of the time horizon, rather than a no-click event. We now show that these counts nl k can be used to construct simple closed-form estimators of the attractiveness parameters. The log-likelihood of the observed clicks nl = (nl 0, nl 1, . . . , nl K) in a single epoch l, with fixed action al can be written as, log L(nl|al, α) = log(λkαl k) log(1 + v=1 λvαl v) . The single-epoch likelihood is maximised by estimators ˆαl k = nl k/λk for k [K]. Inspired by these within-epoch estimators, we may then construct estimators for each attractiveness parameter αj, j [J], aggregating over L complete epochs as αj(L) = PL l=1 PK k=1 I{al k = j}nl k PL l=1 PK k=1 I{al k = j}λk , j [J]. (4) Learning to Rank under Multinomial Logit Choice These estimators result from weighted averaging of the within-epoch maximum likelihood estimators - which is preferable to uniform averaging as we should expect epochs where the item j was placed in a slot with a higher position bias to be less variable and thus more reliable. The lemma below, whose proof is reserved for Appendix E, gives the distribution of the random variables nl k. It follows immediately that our estimators ˆαl k and αj(L) are unbiased. Lemma 1 For each k [K], and l [L], nl k, the number of clicks on the item in position k during epoch l, follows an Geometric3 distribution with parameter (1 + λkαl k) 1. 2.2 Inference with Unknown Position Biases When the position biases are unknown, epoch-based decision making is also useful. In this setting the likelihood is not identified unless we fix one of the position biases, so we fix λ1 = 1. This restriction may rescale other parameters, with respect to the known position bias case, but crucially it does not change the interpretation of the model. Some further notation is also useful to describe inference in this setting. Define the K J matrix of click counts in l [L] epochs as N(l) having entries, t El I{Qt = k, al k = j}, k [K], j [J]. Similarly, define N(l) as the matrix of counts of selections of item-position combinations, whose entries are l=1 I{al k = j}, k [K], j [J]. Now, define γjk = αjλk, j [J], k [K] to be the products of the attraction probabilities and position biases. Using the known distribution of the click counts nl k we can derive an unbiased product parameter estimate γjk(L) = Nkj/ Nkj for each j [J], and k [K]. A naive approach would independently estimate the JK product parameters and build UCBs around those. Such an approach does not make efficient use of the data, and as such associated decision-making rules can spend a prohibitively long time exploring, although they will eventually converge to optimal actions. We discuss the limitations of such an approach in more detail in Appendix F and revisit it in the experiments in Section 6. In the remainder of this section we will focus on direct inference on the attractiveness parameters and position biases. We may obtain estimates of the attractiveness parameters and position biases via the EM scheme outlined in Algorithm 1. In particular this algorithm exploits that conditioned on estimates of the attractiveness parameters ˆα1:J(L) we have an estimate of the position bias for slot k {2, . . . , K} as l=1 I{al k = j} = 1 Nkj ˆαj(L). (5) 3. For clarity, we note that throughout this paper we use the following parametrisation of the geometric distribution. If X Geom(p) then P(X = x) = (1 p)xp, x N := {0, 1, . . . }. Grant and Leslie Similarly, we have an estimate of the attractiveness of item j [J] given estimates ˆλ2:K(L) of the position biases, as ˆαj(L) = 1 PL l=1 PK k=1 I{al k = j} Nkj ˆλk(L) , (6) where ˆλ1(L) = λ1 = 1. Algorithm 1 iterates between estimating position biases and attractiveness parameters until the estimates converge to within some tolerance.4 The following lemma guarantees the convergence of this EM scheme. Its proof is given in Appendix E, and follows from the unimodality of the log-likelihood function. Lemma 2 The estimators αEM, and λEM derived from the EM algorithm, Algorithm 1, converge monotonically to the maximum likelihood estimators. Algorithm 1 EM Algorithm for MNL-LTR with Unknown Position Biases Inputs: Initial parameter values αj,0 for all j [J], and λk,0 for k {2, . . . , K}. Tolerance parameter 0 < ξ < 1. Action and click histories, a1:L, Q1:T . Set d 1, s 0, and λ1,t 1 for all t 0. While d > ξ do: Set s s + 1. E-Step For each k {2, . . . , K}, calculate λk,s according to (5). M-Step For each j [J], calculate αj,s according to (6). Calculate d = max maxk {2,...,K} |λk,s λk,s 1|, maxj [J] |αj,s αj,s 1| . Return λ1,s, . . . , λK,s and α1,s . . . , αJ,s as estimates of position biases and attractiveness parameters. 3. Concentration Results In this section we derive concentration results for the parameter estimates in both the known and unknown position bias settings. Quantification of the uncertainty in the parameters is key to designing effective sequential decision-making algorithms, and the results in this section will later be used to construct UCB approaches. 3.1 Concentration Results Relevant to the Known Position Bias Setting As discussed in Section 2.1, the empirical means, αj, are weighted averages of Geometric random variables. The following theorem gives a martingale-type concentration result for 4. There is an issue with the numerical stability of this EM scheme, as if a given item or position has no associated clicks, its estimate will go to 0. We can resolve this either by adopting the convention that 0/0 = 0 or by artificially constraining the estimates to be no smaller than some ϵ > 0 Learning to Rank under Multinomial Logit Choice the sum of geometrically distributed random variables with differing means. This result is not specific to the MNL-LTR or MNL bandit settings, and therefore may be of independent interest. It is worth noting that the results of Theorem 3 simultaneously have improved coefficients, and a greater generality than alternative results for i.i.d. geometric random variables obtained by Agrawal et al. (2019). We require the greater generality in the MNL-LTR setting because the random variables associated with clicks of an item per epoch will be a) non-identically distributed as they depend on the position bias, and b) non-independent as the assignment of items to slots depends on the previously observed data. Theorem 3 Consider geometric random variables Yi with parameter pi, i [n], where pi may be a function of p1, . . . , pi 1, Y1, . . . Yi 1. Let µi = 1 pi pi , and σ2 i = µ2 i + µi. If µi 1 for all i [n], then we have for all C > 0, i=1 σ2 i log(C) + 4 log(C) 2C 1, n 1. (7) Furthermore, we have for all C > 0, i=1 Yi log(C) + 4 log(C) An where An = n Pn i=1 µi 8 log(C) + q 8 Pn i=1 σ2 i log(C) o . A full proof of Theorem 3 is provided in Appendix A, but we briefly outline its intuition here. The proof derives a new bound on the central moments of the geometric distribution in order to utilise a Bernstein-like inequality for martingale difference sequences. As the central moments of the geometric distribution lack a closed-form expression, this is nontrivial. We achieve the bound by first bounding the cumulants of the geometric distribution and exploiting a combinatorial link between central moments and cumulants. The following lemma adapts the result of Theorem 3 to the LTR setting. Its proof is also given in Appendix A. The UCB algorithm we propose in Section 4 for the known position bias setting is designed to exploit these results. Lemma 4 We have for estimators αj(l) j [J] defined as in Equation (4), and attractiveness parameters 0 < αj 1, j [J], the following concentration results, for all l : Λj,l > 0 | αj(l) αj| > 4αj log(Jl2/2) Λj,l + 4 log(Jl2/2) | αj(l) αj| > 8 αj(l) log(Jl2/2) Λj,l + 8 log(Jl2/2) Furthermore, for l : Λj,l > 4 log(Jl2/2)/αj we have, P αj(l) > 2αj + 4 log(Jl2/2) Grant and Leslie 3.2 Concentration Results Relevant to the Unknown Position Bias Setting The derivation of concentration results in the unknown position biases setting is more challenging, since the MLE for any unknown parameter (attractiveness or position bias) does not have a closed form. The asymptotic properties of MLEs are well documented, but there are comparatively few general guarantees relating to finite-time behaviour. However, here we are able to utilise non-asymptotic deviation inequalities on certain functions of random variables to derive concentration properties for a family of MLEs derived from geometrically distributed data. As in the previous section we have a general concentration result, followed by an application to the MNL-LTR setting. We begin with Theorem 5, whose proof is given in Appendix B which gives a deviation inequality for a function of multivariate Geometric data. The derivation of this result is based on theory from Bobkov and Ledoux (1998) and a logarithmic Sobolev inequality of Joulin and Privault (2004). Before stating our result we introduce a notion of the smoothness of a discrete function expressed in terms of its finite differences. Let (ϵli)i [d],l [n] denote the canonical basis on Rd n. For a function f : Nd n R define the finite difference with respect to the input variable indexed l, i, Dlif(X) = F(X + ϵli) F(X), X Nd n. We say that a function F : Nd n R is (β1, β2)-smooth, for parameters β1, β2 > 0, if, i=1 |Dli F|2 β2 1, and max l [n] max i [d] (|Dli F|) β2 X Nd n. (12) Theorem 5 Let n, d N and µl | µ1:l 1 l [n] be a series of conditional multivariate geometric measures on Nd, such that each component, µli is a geometric law with parameter pli (0, 1] for i [d], l [n]. Define µn = Nn l=1 µl as the product measure, and let F be a (β1, β2)-smooth function with β1 > 0, and β2 (0, maxi,l( log(1 pli))]. Then Eµn(|F|) < , and for every δ > 0, Pµn (F Eµn(F) + δ) exp min δ2 4β2 1M , (log(1 p))2β2 1M 4β2 2 + log(1 p)δ where M > 0 is a known finite constant depending on the parameters {pli}i [d],l [n]. Choosing the function F to be the MLEs in the MNL-LTR setting, we have the following result, giving concentration inequalities for the estimators derived from Algorithm 1. Corollary 6 We have for EM estimators αEM l,j , j [J] and attractiveness parameters 0 < αj 1, j [J], that for all l : PK k=1 Nkj > 0, P αEM l,j αj > q 36β2 1,l,j log(Jl2) 2 Jl2 , where β1,j,l is a sensitivity parameter defined as αEM l 1,j αEM l 1,ks,j 2 Nks, for j [J], l 1. Learning to Rank under Multinomial Logit Choice The proof of this corollary is reserved for Appendix B. Its main argument is to recognise that the restriction of αEM l to its jth output, for fixed selections matrix N(l), is (subject to a minor rearrangement of the inputs) a function from NK l [0, 1], to which the functional inequality of Theorem 5 applies. A similar result holds for the position bias parameters but is not required. This is since the algorithm we propose in the following section does not need to construct UCBs for the position biases as every slot is utilised in every round. Finally, for the purpose of the regret bounds which will follow in Section 5, Lemma 7, below, relates the sensitivity parameter β1,l,j to the number of selections of item j in l epochs. Its proof is given in Appendix B, and is based on bounding inverse gradients of the log-likelihood to quantify sensitivity of the MLEs. Lemma 7 There exists a constant C > 0 independent of J, K, and L such that the sensitivity parameter β1,j,l satisfies JK maxk Nkj . 4. Decision-making Algorithms We now outline our new UCB approaches. As is typical, the algorithms select actions which maximise the expected reward with respect to a set of upper confidence bounds on the attractiveness parameters. Such an optimal action will place the item with the kth largest UCB in the slot with the kth largest position bias (or estimated position bias if position biases are unknown) for each k [K]. Then, however, the algorithms will repeatedly use this action in each round until a no-click event is observed. This is in contrast to the traditional approach of calculating new UCBs in every round. Algorithm 2 Epoch-UCB algorithm for known position biases Initialise with l = 0, and Q0 = 0. Iteratively perform the following for t [T], If Qt 1 = 0 Set l l + 1 Calculate UCBs. For j [J] compute, αUCB j,l = αj(l 1) + 4 min(1, 2 αj(l 1)) log(Jl2/2) Λj,l 1 + 4 log(Jl2/2) Select an action at argmaxa A rαUCB l (a) which is optimal with respect to the UCB vector αUCB l := (αUCB 1,l , . . . , αUCB l,J ), and observe click variable Qt otherwise, set action at = at 1, and observe click variable Qt. Grant and Leslie In Algorithm 2 we present our Epoch-UCB method for the variant of MNL-LTR with known position biases. In each epoch l [L] the algorithm computes a UCB, αUCB j,l for each item j [J]. This UCB is constructed using the concentration results of Lemma 4 to give an upper bound on αj with high probability. The min(1, 2 αj,l 1) term allows the UCB to adapt to whichever of (9) and (10) gives the tighter bound. Henceforth, for l 1, and j [J], we define Λj,l := PK k=1 Pl s=1 λk I{as k = j}. To state our algorithm for the setting where position biases are unknown define αEM : NK J NK J [0, 1]J be the function which takes click and selection count matrices as inputs and returns the EM estimates for attractiveness parameters α. In effect, αEM represents the application of Algorithm 1. Algorithm 3 Epoch-UCB algorithm for unknown position biases Initialise with l = 0 and Q0 = 0. Iteratively perform the following for t [T], If Qt 1 = 0 Set l l + 1. Compute EM estimators and finite difference bounds, αEM l = αEM(N(l 1), N(l 1)) (14) αEM l,kj = αEM(N(l 1) + ϵkj, N(l 1)), k, j : Nkj 1 (15) β2 1,l,j = X αEM l 1,j αEM l 1,ks,j 2 Nks, j [J] (16) Calculate UCBs. For j [J] compute, αUCB U j,l = αEM j,l + q 36β2 1,j,l log(Jl2). Select an action at argmaxa A rαUCB U l (a), which is optimal with respect to the UCB vector, and observe click variable Qt. otherwise, set action at = at 1, and observe click variable Qt. In Algorithm 3 we give our policy for the setting where position biases are not known. Its structure is similar to Algorithm 2, but the computation of the UCBs is more involved as it involves finite difference gradients. In each epoch l [L], estimates of the MLEs, αEM l , are computed via Algorithm 1 as in (14). Our approach proceeds to calculate further estimates of the attractiveness parameters but on modified data, as in (15). For each itemposition pair that has been selected at once, i.e. each k, j with Nkj 1, we compute αEM l,kj , parameter estimates based on l epochs of data but with Nkj incremented by 1. A sum of the squared finite differences is then computed as in (16), which is used in the UCB inspired by Corollary 6. This is in place of a supremum bound on the sum of squared differences over all possible outcomes, which would be difficult to compute in practice. Learning to Rank under Multinomial Logit Choice 5. Regret Bounds In this section we give upper and lower bounds on the regret for MNL-LTR algorithms. Proposition 8 gives our upper bound on the regret incurred by Algorithm 2 when the position biases are known. Proposition 9 gives a lower bound on the regret of any algorithm, in terms of SK = PK k=1 λk, and SK,2 = PK k=1 λ2 k. The proofs of both results are given in the appendix - Proposition 8 in Appendix C, and Proposition 9 in Appendix D. Proposition 8 The regret in T rounds of the the Epoch-UCB approach, Algorithm 2, for any MNL-LTR problem where the item attractiveness parameters satisfy αj α0 = 1, j [J] and the position biases λk 1, k [K] are known satisfies maxk [K] λk log(JT)JT mink [K] λk Proposition 9 The regret of any algorithm for the MNL-LTR problem with position biases 1 λ1 > λ2 > > λK > 0 satisfying SK,2 > 1 and J 4K items with attractiveness parameters αj (0, 1], j [J], is lower bounded as The upper and lower bounds in the known position bias case match in their order with respect to J and T (up to logarithmic factors). For the case of unknown position biases, the complex form of the derived UCB algorithm prohibits an analysis which as sharply quantifies the dependence on position biases. Nevertheless, we have an upper bound on the regret of Algorithm 3 in terms of J, K, and T, in the proposition below, whose proof is in Appendix C. Proposition 10 The regret in T rounds of the Epoch-UCB approach, Algorithm 3, for any MNL-LTR problem where the unknown item attractiveness parameters satisfy αj α0 = 1, j [J] and unknown position biases satisfy λk 1, k [K] satisfies Reg(T) = O K2p log(JT 2)JT . There is some suboptimality in the bound of Proposition 10 with respect to K. We believe this arises, in part, as an artefact of the (simplifying) focus on a single k = argmax Nkj in Lemma 7, and it may yet be possible (through further laborious analytic work) to achieve a less elegant but tighter bound. 6. Experiments We now conduct empirical comparisons on three instances of MNL-LTR: (a) There are K = 4 slots with position biases λ = (1, 0.3, 0.2, 0.1). There are J = 6 items with attractiveness parameters α = (0.3, 0.28, 0.26, 0.24, 0.22, 0.2). Grant and Leslie (b) There are K = 3 slots with position biases λ = (1, 0.2, 0.9). There are J = 4 items with attractiveness parameters α = (0.05, 0.1, 0.15, 0.2). (c) There are K = 6 slots with position biases λ = (1, 0.9, 0.7, 0.3, 0.5, 0.7) and J = 30 items, four having attractiveness parameter 1, two having attractiveness parameter 0.8, and the remaining twenty-four having attractiveness parameter 0.1. Together, these cover several scales of problem where users may simultaneously consider multiple options with different prominence and items may be modelled as having independent attractiveness. We consider both Algorithm 2 which knows the position biases and Algorithm 3 where the position biases are inferred. We will refer to the former as Epoch-UCB, and the latter as Epoch-UCB UPB (Unknown position biases) in what follows. Experimental results suggest that while the Epoch-UCB UPB algorithm does eventually learn the optimal actions, it can be overly conservative. We therefore also investigate a modification, Epoch-UCB* UPB, which is identical to Algorithm 3 except the UCB for item j [J] in epoch l [L] is calculated as αUCB j,l = αEM j,l + 0.5 q β2 1,j,L log( Jl). We compare our algorithms to a range of alternative approaches. Firstly, we have a further known-position-bias epoch-based approach, Epoch-UCB-W. This uses the coefficients we would expect from adapting the weaker concentration inequalities used in Agrawal et al. (2019). Epoch-UCB-W is identical to Algorithm 2 except it calculates UCB index for item j [J] in epoch l [L] as: αUCB W j,l = αj,l 1 + 48 min(1, 2 αj,l 1) log( 2) Λj,l 1 + 48 log( 2) Λj,l 1 . Second, we also consider the Top Rank algorithm of Lattimore et al. (2018). This algorithm can operate without knowledge of the position biases, but assumes that the slots are of decreasing attractiveness. Top Rank has a markedly different structure to Epoch-UCB. Top Rank maintains a hierarchical partition of the item set, such that the items sit in strata based on their perceived attractiveness. In each round the displayed list is constructed by randomising the order of the n1 1 items in the top strata and assigning these to the first n1 slots, then randomising the order of the n2 1 items in the second strata, assigning these to the next n2 slots, and proceeding in such a fashion until all K slots are filled. Items are demoted to lower strata if they have received sufficiently fewer clicks than another item in their strata. As discussed in Section 1.4, the other LTR approaches that we are aware of are all designed with factored click models in mind, and do not carry performance guarantees to the MNL-LTR setting. We do however investigate the Position Bias Upper Confidence Bound (PBUCB) algorithm of Lagr ee et al. (2016), which is based on the position bias click model. This algorithm can use our position bias parameters in terms of its model, but will underestimate the α parameters as it expects that multiple items may be clicked by a user - i.e. its inference model is inconsistent with the MNL data generating process. Further modifications would be necessary to deploy a version of this algorithm if the position biases were not known. Learning to Rank under Multinomial Logit Choice Finally, we compare to the UCB approach described in Appendix F. This approach, which we refer to as MNL-bandit in the figures, ignores some of the LTR structure, and treats the unknown position bias version of the problem as a constrained MNL bandit problem. It learns the product parameters γj,k = λkαj individually and avoids the need for running the EM algorithm. As discussed in Appendix F it does have sublinear regret guarantee, but must perform more exploration than other approaches due to being overparameterised. Note that, problems (b) and (c) give examples where the optimal ordering of items is not in decreasing order of attractiveness. In (b) for instance, the final slot, not the second slot, has the second-to-largest position bias. In the unknown position bias variant of this problem, Epoch-UCB UPB and Epoch-UCB* UPB can adapt to this as they actively learn the position biases, but algorithms assuming decreasing position bias, such as Top Rank, cannot. The aforementioned algorithms were applied to problems (a) and (b) over 50000 decisionmaking rounds, over 40 replications. For problem (c) we use 8000 decision-making rounds, and 40 replications, since the optimal action can be learned more quickly. Figures 1, 2, and 3 display the results, in two forms: the mean regret accumulated through time in their left panes and the distribution of regret in the final rounds in their right. We focus only on the distribution in the final rounds as the results are such that plotting error bars with the each of the seven mean trajectories would make the graphs difficult to read. Across all three problems we find that our Epoch-UCB and the PBUCB algorithms generally perform best. This is to be expected as they have access to the known position biases and assume a position based model. Our improved coefficients for the known position-bias setting are seen to have a substantial benefit as Epoch-UCB-W is much more conservative than Epoch-UCB. Indeed Epoch-UCB-W even suffers worse performance than Top Rank which can identify the correct ordering of items despite not knowing the true click model. In the unknown position bias setting, we see that the unmodified Epoch-UCB UPB approach is also overly conservative, but the approach with modified coefficients Epoch UCB* UPB is not much worse than the best known position bias algorithms. Top Rank does not know the click model but performs well when the position biases are decreasing in the slot index. In problems (b) and (c) where the position biases do not fit this assumption Top Rank can identify the top K items reliably, but incurs a linear regret due to repeatedly ordering these suboptimally. Problem (c) most clearly demonstrates the issue with the MNL-bandit approach. Here, K and J are much larger than in problems (a) and (b), and the MNL-bandit approach continues to explore long after Epoch-UCB* UPB has reached the point of mostly exploiting near-optimal actions. This is due to the fact that the MNL-bandit approach aims to collect data to estimate JK = 180 different parameters, whereas Epoch-UCB* UPB utilises the known click model such it only estimates J + K 1 = 35 (assuming λ1 = 1). This example displays that although the MNL-bandit approach has a sublinear regret guarantee, as shown in Appendix F, it may be inappropriate in practice. Grant and Leslie 0 10000 20000 30000 40000 50000 Round Known Position Biases 0 10000 20000 30000 40000 50000 Round Unknown Position Biases 1 2 3 Algorithm 1 Epoch UCB 3 Epoch UCB W Regret at Round 50000 4 5 6 7 Algorithm 5 Epoch UCB* UPB 6 Epoch UCB UPB 7 MNL Bandit Regret at Round 50000 Figure 1: Performance of algorithms on problem (a). The left panel shows the mean cumulative regret trajectory for each algorithm over 50000 rounds. The right panel shows the distribution of regret by algorithm at the end of 50000 rounds. Learning to Rank under Multinomial Logit Choice 0 10000 20000 30000 40000 50000 Round Known Position Biases 0 10000 20000 30000 40000 50000 Round Unknown Position Biases 1 2 3 Algorithm 1 Epoch UCB 3 Epoch UCB W Regret at Round 50000 4 5 6 7 Algorithm 5 Epoch UCB* UPB 6 Epoch UCB UPB 7 MNL Bandit Regret at Round 50000 Figure 2: Performance of algorithms on problem (b). The left panel shows the mean cumulative regret trajectory for each algorithm over 50000 rounds. The right panel shows the distribution of regret by algorithm at the end of 50000 rounds. Grant and Leslie 0 2000 4000 6000 8000 Round Known Position Biases 0 2000 4000 6000 8000 Round Unknown Position Biases 1 2 3 Algorithm 1 Epoch UCB 3 Epoch UCB W Regret at Round 8000 4 5 6 7 Algorithm 5 Epoch UCB* UPB 6 Epoch UCB UPB 7 MNL Bandit Regret at Round 8000 Figure 3: Performance of algorithms on problem (c). The left panel shows the mean cumulative regret trajectory for each algorithm over 50000 rounds. The right panel shows the distribution of regret by algorithm at the end of 8000 rounds. Learning to Rank under Multinomial Logit Choice 7. Conclusion In this paper, we have proposed and analysed the multinomial logit choice variant of the learning to rank problem. Distinct from other model-based treatments of learning to rank, our model captures the behaviour of a user who makes a decision over a set of alternatives, rather than making a sequence of independent decisions. We proposed upper confidence bound approaches for the problem in two informational settings - where the effects of rank are known and unknown respectively. Both of these approaches are derived from new concentration theory. In the known position bias setting, we have derived a verified a Bernstein moment condition on the moments of the Geometric distribution and provided new martingale inequalities for geometric random variables. In the unknown position bias setting we have provided new functional inequalities for geometric data, giving concentration results for numerical maximum likelihood estimators. Further we have provided upper and lower bounds on regret in the known position bias setting, and simulations to display the effectiveness of our approaches. Our proposed framework makes few assumptions beyond the MNL choice, which is a long-standing popular model in decision theory and may be applicable in numerous domains. Our concentration results are also of potential interest in other areas. Further, we lay a groundwork for further study of MNL-type LTR problems. Future work may consider randomised approaches, utilising similar posterior approximations as in Agrawal et al. (2017), or recently proposed bootstrapping techniques as in Kveton et al. (2019). The development of algorithms for a contextual variant of the problem, or with more complex or alternatively structured actions (perhaps bespoke to particular settings) would also seem to be worthy avenues to follow in the extension of this work. Acknowledgments The authors gratefully acknowledge the support of EPSRC grant EP/L015692/1 (STOR-i Centre for Doctoral Training). Appendix A. Proof of Geometric Martingale Concentration In this section, we provide proofs of Theorem 3 and Lemma 4, which comprise the concentration results relevant to the known position bias setting. The following result is a key component of the proof of Theorem 3. It gives a Bernstein-like bound for heavy-tailed martingale data. Lemma 11 (Theorem 1.2A of de la Pe na (1999)) Let {dj, Fj} be a martingale difference sequence with E(dj | Fj 1) = 0, E(d2 j | Fj 1) = σ2 j , for each j, and V 2 n = Pn j=1 σ2 j . Furthermore assume that E(|dj|k | Fj 1) k! 2 σ2 j ck 2 a.e (18) Grant and Leslie or P(|dj| c | Fj 1) = 1, for k > 2, 0 < c < . Then for all x, y > 0, j=1 dj x, V 2 n y for some n exp A.1 Proof of Theorem 3 Firstly, we demonstrate that a geometric martingale difference sequence meets the conditions of Lemma 11. Define, Zi = Pi j=1(Yi µi) and Wi = Zi Zi 1. By definition {Zi} i=1 is a martingale and {Wi} i=2 is a martingale difference sequence. Immediately, from the distribution of Yi, i [n], we have E(Wi | Fi 1) = 0 and E(W 2 i | Fi 1) = V ar(Yi | Fi 1) = µ2 i +µi. The higher-order central moments of the Geometric distribution lack a closed-form expression which makes checking condition (18) more complex. Our technique relies on two main steps: we identify a bound on the cumulants of the Geometric distribution, and we use a link between the central moments and cumulants from Combinatorics to realise a central moment bound, given in the following lemma. Lemma 12 The central moments µn, n 1 of the Geometric random variable with parameter p satisfy where !m denotes the number of derangements of an integer m 1, defined recursively as !m = (m 1)(!(m 1)+!(m 2)) where !0 = 1 and !1 = 0. The proof of Lemma 12 is given in Section A.3. It uses the property that the the central moments of any distribution may be expressed in terms of the cumulants κk of the distribution via incomplete exponential Bell polynomials. In particular, we have, E(W k i | Fi 1) = m=1 Bk,m(0, κ2, . . . , κk m+1), (19) where the summands Bk,m are incomplete exponential Bell polynomials (see e.g. Chapter 11 of Charalambides (2002)). For integers n m 1 and arguments x1, . . . , xn m+1 Zn m+1 these polynomials are defined as Bn,m(x1, . . . , xn m+1) = X n! j1!j2! . . . jn m+1! j2 . . . xn m+1 (n m + 1)! (20) where the sum is over all sequences j1, j2, . . . , jn m+1 of non-negative integers such that Pn m+1 i=1 ji = m and Pn m+1 i=1 iji = n. The bound in Lemma 12 can be adapted to the form required for condition (18). We have the following relationship between the number of derangements and the factorial, Learning to Rank under Multinomial Logit Choice where [ ] is the nearest integer function. It follows that E(W k i |Fi 1) k!(1 p) p2 1 pk 2 . (21) Thus, the central moments of the Geometric random variable Xi satisfy (18) with σ2 = 1 p p2 and c = 1/p. Thus from Lemma 11 we have, for some n 1, and any x > 0, i=1 Yi µi x exp x2 2 Pn i=1 σ2 i + x/(mini pi) . Therefore if, for C > 0, x = 2 log(C)/(mini pi) + q 2 Pn i=1 σ2 i log(C), we have i=1 Yi µi x exp 4 log2(C) (mini pi)2 + 4 2 Pn i=1 σ2 i log3(C) mini pi + 2 Pn i=1 σ2 i log(C) 4 log(C) (mini pi)2 + 2 2 Pn i=1 σ2 i log(C) mini pi + 2 Pn i=1 σ2 i exp log(C) = C 1 By symmetry we have the same bound on P Pn i=1 µi Yi x , and the statement of (7) follows. Now consider, i=1 µi = P n X i=1 Yi Pn i=1 µi for any δ [0, 1/2]. Choosing i=1 σ2 i log(C) and noting δ 1/2 when Pn i=1 µi > 4 log(C)/(mini pi) + q 8 Pn i=1 σ2 i log(C), we have by Lemma 11 and a similar manipulation to that used in the proof of (7), that Furthermore, since σ2 i = µ2 i + µi 2µi (as µi 1), we have also that It follows that i=1 Yi log(C) + 2 log(C) Grant and Leslie i=1 σ2 i log(C) + 2 log(C) mini pi and 4 i=1 σ2 i log(Jl2/2) + 2 log(C) By symmetry we have the high-probability same bound on Pn i=1 µi Yi, and the statement of (8) follows. A.2 Proof of Lemma 4 We recall that the number of clicks on item j [J] in an epoch l [L] is a geometric random variable with parameter pj,l = PK k=1 I{al k = j}(1 + λkαj) 1, as such " 1 + max k λkαj 1 , 1 + min k λkαj for an epoch where j al. Thus, the sequence of click counts is a sequence of Geometric random variables of the form considered in Theorem 3. It follows from Theorem 3, specifically equation (7), that for any item j [J] the sum of clicks on that item in L epochs obeys, k=1 I{al k = j}nl k k=1 I{al k = j}λkαj l=1 σ2 j,l log (JL2/2) + 4 log JL2/2 , with probability at least 1 4/JL2. As per their definition in equation (4) the estimators αj(l) are weighted sums of these click counts, and therefore we have for any j [J], l [L], and a1:l such that Pl s=1 PK k=1 I{as k = j} > 0, 2 Pl s=1 σ2 j,s log(Jl2/2) Pl s=1 PK k=1 λk I{as k = j} + 4 log(Jl2/2) Pl s=1 PK k=1 λk I{as k = j} , (22) with probability at least 1 4/Jl2. Notice that σ2 j,l = µ2 j,l + µj,l 2µj,l = 2αj k=1 I{al k = j} 2 k=1 I{al k = j}, and thus we also have, for all j [J] and l [L], | αj(l) αj| > 4αj log(Jl2/2) Pl s=1 PK k=1 λk I{as k = j} + 4 log(Jl2/2) Pl s=1 PK k=1 λk I{as k = j} Fixing j and l and considering the unconditioned probability, the result stated in equation (9) follows via a union bound. Learning to Rank under Multinomial Logit Choice Similarly, it follows from equation (8) that the data adaptive bound below holds with probability at least 1 6/JL2, k=1 I{al k = j}nl k k=1 I{al k = j}λkαj k=1 I{al k = j}nl k log (JL2/2) + 4 log JL2/2 , Then by a similar union bound we have the result (10) as stated. Finally, we consider the probability in equation (11). We have, for Λj,l and l such that Λj,l > 4 log(Jl2/2)/αj, P αj(l) > 2αj + 4 log(Jl2/2) 4αj log(Jl2/2) Λj,l + 4 log(Jl2/2) with the final inequality using Lemma 3 once again. A.3 Proof of Lemma 12 First, we give a recurrence relation for the cumulants of the Geometric distribution. This will be used to verify the Bernstein condition for the central moments. These results may also be of independent interest. Lemma 13 The cumulants κn, n 1 of the Geometric random variable with parameter p satisfy, i=1 ( 1)n i hn,i where the coefficients hn,i, n 1, i n are recursively defined positive integers satisfying hn,1 = 1 n 1 hn,i = ihn 1,i + (i 1)hn 1,i 1 n 3, i {2, . . . , n 1} hn,n = (n 1)hn 1,n 1 n 1. Proof: Firstly we note that the cumulants of the Geometric distribution with parameter p satisfy the recurrence relation κk = (p 1)dκk 1 dp , κ1 = 1 p The second and third cumulants follow immediately from (24) as κ2 = (p 1)dκ1 dp = (p 1) 1 Grant and Leslie κ3 = (p 1)dκ2 dp = (p 1) 1 Thus we may verify that κ3 satisfies the definition in (23). Now assume that κn matches the definition in (23) for some fixed n > 3 and consider κn+1. We have from (24) the following, κn+1 = (p 1)dκn i=1 ( 1)n+1 i ihn,i ( 1)n+1 i ihn,i pi ( 1)n+1 i ihn,i = ( 1)n hn,1 ( 1)n+1 j jhn,j pj ( 1)n j (j 1)hn,j 1 = ( 1)n hn,1 ( 1)n+1 j jhn,j + (j 1)hn,j 1 thus proving the statement by induction. Considering this form of the cumulants (23), and the nature of the Bell polynomial (20), it is apparent that the nth central moment µn may also be written as O((1/p)n) polynomials, with some non-negative, integer coefficients fn,1, . . . , fn,n (to be specified later). Specifically, we may write i=1 ( 1)n i fn,i Next, we introduce a relevant property of a sequence of non-negative integers, and give a lemma showing that the coefficients of the cumulants and central moments have this property. Definition 14 (Alternating Partial Sum (APS)) A sequence of n > 0 non-negative integers, h1, . . . , hn is called APS if for all k [n] i=1 ( 1)n ihi ( 0, when (n k) mod 2 = 0, 0, when (n k) mod 2 = 1. Lemma 15 For any integer n 3, and Geometric random variable X with parameter p, both the coefficients of the polynomial expression for the cumulants of X, hn,1, . . . , hn,n, and the coefficients of the polynomial expression for the central moments of X, fn,1, . . . , fn,n are APS sequences. The full proof of Lemma 15 is in the next subsection. The proof has two main steps. First we show that the sequence hn,1, . . . , hn,n is APS for any n from its recursive formula. Second, we show that the sequence fn,1, . . . , fn,n can be written as a linear combination of APS sequences (derived from multiplying together cumulants in the Bell polynomial). This linear Learning to Rank under Multinomial Logit Choice combination operation preserves the APS property, and thus the sequence fn,1, . . . , fn,n is also APS. The proof of Lemma 12 follows from application of Lemma 15. First, we demonstrate that fn,n =!n. Recall that the central moments µn are defined in terms of the cumulants as m=1 Bn,m(0, i=1 ( 1)2 i h2,i pi , . . . , i=1 ( 1)n m+1 i hn m+1,i It follows that the leading order coefficient fn,n of the polynomial expression for µn can be expressed in terms of incomplete Bell polynomials of the leading order coefficients of the preceding cumulants, i.e. m=1 Bn,m(0, h2,2, . . . , hn m+1,n m+1) = m=1 Bn,m(0, 1!, 2!, . . . , (n m)!) (27) where the second equality follows from Lemma 13. The above definition of fn,n coincides with a complete Bell polynomial, such that we have fn,n = Bn(0, 1!, 2!, . . . , (n 1)!) j2,...,jn 2j2+ +njn=n n! j2! . . . jn! j2 . . . (n 1)! j2,...,jn 2j2+ +njn=n n! j2! . . . jn! jn =!n. (28) The final equality follows from the observation that each of the summands in the penultimate expression are the number of permutations in the group of all permutations of n integers with cycle structure 2j23j3 . . . njn. By definition this sum is the number of derangements of n. The second stage of the proof is to demonstrate that µn fn,n/pn. First, we note that if p = 1 then the Geometric variable X has P(X = 0) = 1. Thus, by the definition of µn as a central moment, if p = 1 then µn = 0. This implies that the alternating sum of polynomial coefficients fn,1, . . . , fn,n must be 0 for any n, i.e. i=1 ( 1)n ifn,i = 0, and in particular, that i=1 ( 1)n ifn,i. (29) As p 1 by definition, the APS property of fn,1, . . . , fn 1 tells us that i=1 ( 1)n 1 i fn,i i=1 ( 1)n 1 ifn,i. (30) Grant and Leslie We verify this by considering that i=1 ( 1)n 1 ifn,i = i=1 ( 1)n 1 ipn 1 ifn,i + i=1 ( 1)n 1 i(1 pn 1 i)fn,i i=1 ( 1)n 1 ipn 1 ifn,i, where the inequality holds since the second sum is negative. Dividing both sides by pn 1 gives (30). We then complete the proof by bounding µn as follows i=1 ( 1)n i fn,i i=1 ( 1)n i fn,i = fn,n(1 p) pn = !n(1 p) where the inequality follows from (30) and (29), and the final equality follows from (28). A.4 Proof of Lemma 15 Firstly we show by induction that the sequences hn,1, . . . , hn,n are APS for all n 3. Consider first the case of n = 3. We have, as defined in Lemma 13, that h3,1 = 1, h3,1 h3,2 = 1 3 = 2, and h3,1 h3,2 + h3,3 = 1 3 + 2 = 0. Thus all of the non-negativity and non-positivity conditions are satisfied and the sequence h3,1, h3,2, h3,3 is APS. We now assume for some fixed m 4 that hm,1, . . . , hm,m is APS, and proceed to consider the sequence hm+1,1, . . . , hm+1,m+1. By definition we have hm,1 = 1 and hm+1,m+1 = m!. Thus the APS conditions are satisfied for k = 1 and k = m + 1. We proceed to consider Pk i=1( 1)m+1 ihm+1,k for k {2, . . . , m}. We have, i=1 ( 1)m+1 ihm+1,i = ( 1)mhm,1 + i=2 ( 1)m+1 i ihm,i + (i 1)hm,i 1 = ( 1)m+1 kkhm,k. (31) Since all hm,k, m 2, k m are positive integers, (31) is positive when m+1 k mod 2 = 0 and negative when m + 1 k mod 2 = 1. Thus the APS conditions are satisfied for the sequence hm+1,1, . . . , hm+1,m+1 given hm,1, . . . , hm,m is APS. Thus, by induction, the sequences hn,1, . . . , hn,n are APS for all n 3. We next show two properties of APS sequences. Firstly, we have the property that addition preserves APS. Property 1 (Preservation of APS under addition) If a1, . . . , an and b1, . . . , bn are APS sequences, then the sequence a1 + b1, . . . , an + bn is APS. Learning to Rank under Multinomial Logit Choice To verify, consider first j n : n j mod 2 = 0, we have i=1 ( 1)n i(ai + bi) = i=1 ( 1)n iai + i=1 ( 1)n ibi 0, since a1, . . . , an and b1, . . . , bn are both APS. Similarly, for j n : n j mod 2 = 1 we have j X i=1 ( 1)n i(ai + bi) = i=1 ( 1)n iai + i=1 ( 1)n ibi 0, showing that (a1 + b1, . . . , an + bn) are APS. The second property states that if two polynomials (in the same variable) have APS coefficients, the product of these polynomials has APS coefficients. Property 2 (Preservation of APS under polynomial multiplication) Let a1, . . . , an and b1, . . . , bm be APS for n, m N, with Pn i=1( 1)n iai = 0 and Pm i=1( 1)m ibi = 0. Then, the sequence of polynomial coefficients c1, . . . , cn+m such that i=1 ( 1)n+m icixi = n X i=1 ( 1)n iaixi m X i=1 ( 1)m ibixi , x R To verify this, consider i=1 ( 1)n+m icixi = anxn an 1xn 1 + + ( 1)n 1a1x m X j=1 ( 1)m jbjxj j=1 ( 1)m jbj anxn+j an 1xn 1+j + + ( 1)n 1a1x1+j i=1 ( 1)n+m id(j) i xi for coefficients d(j) i , j [m], i [n + m], defined as follows 0, i < 1 + j ai jbj, i {1 + j, . . . , n + j} 0, i > n + j Now, since a1, . . . , an is APS and has Pn i=1( 1)n iai = 0, we have i=1 ( 1)n+m id(j) i = 0, k < 1 + j Pk j i=1 ( 1)n iai, k {1 + j, . . . , n + j} 0, k > n + j Grant and Leslie for all j [m]. Thus the sequence {d(j) i }n+m i=1 is APS for each j [m] and by Property 1, the coefficients c1, . . . , cn+m are also APS. Using the result that hn,1, . . . , hn,n is APS for any n 3 and the above properties, we will show that the sequences fn,1, . . . , fn,n are also APS for all n 3. Firstly, we recall the definition of the central moments in terms of cumulants, j2,...,jn: j2+...jn=k 2j2+ +njn=n n! j2! . . . jn! j3 . . . κn Since κ2 = O(p 2), κ3 = O(p 3), and κn = O(p n), each summand is o(p n), and may be written as a polynomial i=1 ( 1)n i Ji pi = n! j2! . . . jn! j3 . . . κn where J1, . . . , Jn are (possibly zero or negative) coefficients which are a function of variables j2, . . . , jn, the order n of the moment in question, and the sequences hm,1, . . . , hm,n for all m n. It follows by Property 2 that the coefficients J1, . . . Jn are APS. Finally since the coefficients fn,1, . . . , fn,n can be written as sums of the J coefficients over the valid combinations of j2, . . . , jn, we have by Property 1 that the coefficients fn,1, . . . , fn,n are APS. Appendix B. Proof of Maximum Likelihood Concentration In this section we provide proofs of Theorem 5 and Corollary 6 giving our result on the concentration of the maximum likelihood estimators in the unknown position bias setting. B.1 Proof of Theorem 5 The first step of the proof is derive a bound on the relative entropy of functions with respect to the product measure. We make use of the following result from Joulin and Privault (2004). It gives a logarithmic Sobolev inequality tailored to functions of a univariate Geometric distribution. Theorem 16 (Theorem 3.7 (Joulin and Privault, 2004)) Let π denote the law of a Geometric random variable with parameter p. Let 0 < b < log(1 p) and let f : N R such that |d+f| = maxk N |f(k + 1) f(k)| b for all k N. We have Entπ ef (1 p)eb (1 p)eb) Eπ |d+f|2ef . The chain rule for differential entropy (see e.g. Theorem 8.6.2 of Cover and Thomas (2012)) states that for a series of random variables X1, . . . , Xn with joint distribution µn, Entµn (X1, . . . , Xn) = l=1 Entµl (Xl | X1, . . . , Xl 1) . Learning to Rank under Multinomial Logit Choice Using this result we may extend the univariate bound in Theorem 16 in both dimensions to a bound under product measure. Specially, for every function G : Nd n R, satisfying DGli bli for constants 0 < bli < log(1 pli) i [d], l [n], we have Entµn e G max i [d],l [n] (1 pli)ebli (1 pli)ebli) Eµn l=1 |d+Gli|2e G ! In particular, under the assumptions of the theorem statement, this holds for F, with maxl,i bli = β2. We now follow the so-called Herbst s method (Davies and Simon, 1984; Aida et al., 1994) to achieve a deviation inequality on F. For ease in what follows we introduce the function MG : R R with MG(b) = max i [d],l [n] (1 pli)eb (1 pli)eb) , b 0, max i [d],l [n] log(1 pli) . (33) Further we let p be the parameter attaining the maximum in (33), i.e. p = argmax pli:i d,l [n] (1 pli)eb) , for any valid b. Applying (32) to ηF for every 0 < η log(1 p)/(2β2) and substituting the definition for entropy we have, Eµn ηFeηF Eµn eηF log Eµn eηF i=1 η2|F(X + ϵli) F(X)|2eηF ! Exploiting the assumed bound (12), and introducing the notation H(η) = Eµn eηF we may then rewrite the above display as, ηH (η) H(η) log (H(η)) η2β2 1MG(ηβ2)H(η). We then set K(η) = 1 η log(H(η)), with K(0) = H (0)/H(0) = Eµn(F) and observe, K (η) η2β2 1MG(ηβ2)H(η) η2H(η) = β2 1MG(ηβ2) β2 1MG since MG is an increasing function. We will define M := MG log(1 p) 2 for convenience. As such we may bound K(η) Eµn(F) + ηβ2 1M, and derive the exponential inequality, Eµn eηF exp ηEµn (F) + η2β2 1M , 0 < η log(1 p) Finally, we apply a Chernoffbound to F, and substitute (34), giving, Pµn (F Eµn(F) + δ) exp ηEµn (F) + η2β2 1M ηEµn(F) ηδ which when minimised over η (0, log(1 p)/(2β2)], yields the stated result. Grant and Leslie B.2 Proof of Corollary 6 The function αEM which computes the EM estimates of α, is posed as a function from NK J NK J to [0, 1]J, where the input matrices are of the form of N(L) and N(L). Recall that we have αEM(N(L), N(L)) = αEM where αEM are the EM estimates of α derived from Algorithm 1. For the purposes of this proof we will define αEM : NK L [J]K L [0, 1]J, which computes the same αEM estimates but via an alternative arrangement of the input data. Let N(L) be the K L matrix whose lth column (l [L]) is nl, which we recall is the vector of clicks per slot in epoch l. Let A(L) be the K L matrix whose lth column (l [L]) is al the action vector in epoch l. Define αEM such that αEM( N(L), A(L)) = αEM(N(L), N(L)) = αEM. Then, for fixed A(L), the restriction of αEM to its jth output, αEM j , is, a function from NK L to [0, 1]. It also follows from the definitions above that E( αEM j ) = αj. Finally since the entries of the (non-fixed) input matrix N(L) are Geometric random variables, the function αEM j fits within the framework of Theorem 5. We therefore have that P |αEM j (N(L), N(L)) αj| δ = P | αEM j ( N(L), A(L)) αj| δ exp 4β2 1,L,j M β1,j,l = sup X NK L αEM j (X + ϵlk) αEM j (X) 2 (35) for any X NK L. We recall that M = MG(maxk [K],l [L] log(1 pkl)/2). Specifically in the context of our MNL-LTR problem, we have pkl [1/2, 1) for all k [K], l [L]. Thus, M 0.5/(0.5(1 (0.5)1/4)) 9. Rearranging the exponential inequality and substituting a bound on M we therefore have, P |αEM j (N(L), N(L)) αj| q 72β2 1,L,j log( JL) 2 JL2 . B.3 Proof of Lemma 7 In this section, we derive the bound on the sensitivity parameter β1,j,l which enables an analysis of the regret. We first derive an alternative expression for β1,j,l to that in (35) in terms of the αEM functions, as oppose to the αEM functions, as β1,j,l = sup N NK J k=1 Nki αEM j (N, N) αEM j (N + ϵki, N) 2 . Bounding the finite differences with partial derivatives, we have the following bound on the sensitivity parameter, β1,j,l sup N NK J k=1 Nki max nki [Nki,Nki+1] Learning to Rank under Multinomial Logit Choice In what follows we shall obtain a further analytical bound on this expression in order to relate the sensitivity parameter to the selected actions and number of rounds. For the purposes of this section, let θ = (α, λ 1) denote the length J + K 1 vector of the unknown parameters in the unknown position biases model. As before, N and N are J K matrices of clickand play-counts for each item-slot combination. We will suppress dependence on L for brevity in this section. In order to express gradients with respect to click counts conveniently, we introduce the notation n to represent a vectorised version of N. Here element m of n corresponds to element N(m mod K), m/K of the matrix N. For a given pair n, N, the maximum likelihood estimate, θ (n, N) argmax θ [0,1]J+K 1 ℓ θ; n, N , is found where θℓ(θ; n, N) = 0. It then follows from the continuous differentiability of the likelihood, and an application of the Implicit Function theorem that the gradient of the maximiser with respect to the click vector n, within an open neighbourhood of a particular solution θ 0 may be written, nθ (n, N) = h θθℓ θ; n, N i 1 θnℓ θ; n, N . (37) Elements of nθ , a (J +K 1) JK matrix provide the gradients of maximum likelihood estimators contained in (36). To compute (37), we first consider the gradient vector θℓwhich contains the partial derivatives, αj λk(Nkj + Nkj) (1 + αjλk) j [J], ℓ λk = λk αj(Nkj + Nkj) (1 + αjλk) k [K] \ {1}. Thus, the second derivatives contained in the Hessian matrix θθℓare of the following form λ2 k(Nkj + Nkj) (1 + αjλk)2 Nkj α2 j , 2ℓ λ2 k = α2 j(Nkj + Nkj) (1 + αjλk)2 Nkj and 2ℓ αjλk = (Nkj + Nkj) (1 + αjλkj)2 , Grant and Leslie with 2ℓ/ αj α j = 0 where j = j and 2ℓ/ λk λ k = 0 where k = k . The Hessian therefore has a sparsity structure which we later exploit: 2ℓ α2 1 0 . . . 0 2ℓ α1 λ2 . . . . . . 2ℓ α1 λK 0 ... ... ... ... ... ... ... ... ... 0 . . . 0 2ℓ α2 J 2ℓ αJ λ2 . . . . . . 2ℓ αJ λK 2ℓ α1 λ2 . . . . . . 2ℓ αJ λ2 2ℓ λ2 2 0 . . . 0 ... ... 0 ... ... ... ... ... ... 0 2ℓ α1 λK . . . . . . 2ℓ αJ λK 0 . . . 0 2ℓ λ2 K Finally, the elements of θnℓare the second derivatives, 2ℓ αj Nkj = 1 αj λk (1 + αjλk), 2ℓ λk Nkj = 1 λk αj (1 + αjλk), (38) and 2ℓ/ αj Nkj = 0 for j = j and 2ℓ/λk Nk j = 0 for k = k . It follows from the sparsity of θnℓthat the gradient of a particular attractiveness parameter estimate α j with respect to a particular click count Nkl, k [K], l [J], may be written, js 2ℓ αs Nkl j,J+m 1 2ℓ λm Nkl jl 2ℓ αl Nkl I{k = 1} 1 θθ j,J+k 1 2ℓ λk Nkl , (39) where 1 θθ will henceforth be shorthand for the inverse Hessian, ( θθℓ) 1. We proceed to bound these gradients uniformly, and apply said bounds to derive a bound on β1,j,l. To bound the elements of 1 θθ, recall the definition of the matrix inverse as A 1 = (det(A)) 1adj(A), where adj(A) denotes the adjugate of a square matrix A, a matrix of the same dimension, consisting of minors of the matrix. Specifically, entry (h, i) of adj(A) is the determinant of the matrix obtained by removing the ith row and hth column of A, times ( 1)h+i, i.e. adj(A)hi = ( 1)h+i det(A i, h). We may therefore write, for j [J], k [K], and l [J], α j Nkl = (adj( θθ))jl det( θθ) 2ℓ αl Nkl I{k = 1} (adj( θθ))j,J+k 1 det( θθ) 2ℓ λk Nkl , and, since the second derivatives which mix parameter and click derivatives (those of the form 2ℓ/ α N and 2ℓ/ λ N defined in (38)) are independent of N and N, there exists a constant C1 > 0 such that, k=1 Nkl max n αj Nkl Learning to Rank under Multinomial Logit Choice (adj( θθ))jl (adj( θθ))jl det( θθ) (adj( θθ))j,J+k+1 C1 PJ l=1 N1l (adj( θθ)jl)2 + PK k=2 Nkl (adj( θθ)jl + adj( θθ)j,J+k 1)2 det( θθ)2 . (40) For any j [J], k [K] \ {1}, the variable Nkj enters (always linearly) in four elements of θθℓonly, namely those indexed (j, j), (j, J +k 1), (J +k 1, j), and (J +k 1, J +k 1). The determinant computation never takes a product of terms in the same row or column, and thus, det( θθℓ) = O N2 kj , j [J], k [K] \ {1}. For the adjugate of our Hessian matrix, we have the following order results, for h, i [J + K 1], j [J], and k [K], |(adj( θθ))hi| = O Nkj when h {j, J + k 1} and/or i {j, J + k 1}, O N2 kj otherwise. That is, the elements of the adjuagte (themselves determinants of submatrices), are quadratic in variables Nkj unless they are contained within a row or column sharing indices of the variable. In such rows and columns, the order is linear, since instances of Nkj are removed from the determinant computation. The order never falls to O(1) because the removal of no single row h and column i can remove all four instances of the variable Nkj. It follows, from these order results that within the square root of Equation (40), for all k [K] that the numerator is of order O N3 kj , while the denominator is O N4 kj and thus there exists a constant C > 0 such that, JK maxk Nkj . Appendix C. Proof of Regret Upper Bounds In this section we provide a proof of Proposition 8, giving an upper bound on the regret of the Epoch-UCB algorithm for the known position bias setting. The proof has two main stages. First we construct an event that all the UCB indices remain in intervals of certain width around the unknown attractiveness parameters, and verify that this is a high-probability event. We simply assume that the regret is the worst possible if this event does not occur. Then conditioned on the high probability event occurring, we utilise bounds on the values of the UCB indices and the properties of the reward function to bound the regret per epoch, which is aggregated over L epochs to give the stated regret bound. Grant and Leslie C.1 Proof of Proposition 8 We begin by defining the regret specifically for an epoch-based algorithm. We have that the T-round regret as defined in (2) may be rewritten as t El R(a ) R al l=1 |El| R(a ) R al ! We recall that |El|, the number of rounds in epoch l follows a Geometric distribution when conditioned on al, and that al is a deterministic function of the history Hl 1. As such we may use the the law of conditional expectations to replace |El| with its expectation. We have l=1 E |El| R(a ) R al | Hl 1 ! R(a ) R al ! To aid readability, we will define Rl = 1 + PK k=1 λkαl k R(a ) R al for each epoch l [L] so that we have Reg(T) = E L X Next, we define a series of events Bl, l [L] which concern the value of the UCBs, as follows, αUCB j,l / αj, αj + (1 + 8αj log(Jl2/2) Λj,l + 4(2 + 2) log(Jl2/2) We can bound the probability of this event using the concentration results derived in Lemma 4. Specifically, we have that j=1 P(αUCB j,l < αj) + P αUCB j,l > αj + (1 + 8αj log(Jl2/2) Λj,l + 4(2 + 2) log(Jl2/2) 3 Jl + P | αUCB j,l αj| > (1 + 8αj log(Jl2/2) Λj,l + 4(2 + 2) log(Jl2/2) since P(αUCB j,l < αj) is bounded for each j [J] whether min(1, 2 αj,l) is 1 or 2 αj,l. Fixing j [J] and considering a single summand from (41) we have, P | αUCB j,l αj(l)| + | αj(l) αj| > (1 + 8αj log(Jl2/2) Λj,l + 4(2 + 2) log(Jl2/2) Learning to Rank under Multinomial Logit Choice P | αUCB j,l αj(l)| > 16αj log(Jl2/2) Λj,l + 4(1 + 2) log(Jl2/2) + P | αj(l) αj| > 8αj log(Jl2/2) Λj,l + 4 log(Jl2/2) 4 min(1, 2 αj,l) log(Jl2/2) Λj,l + 4 log(Jl2/2) 16αj log(Jl2/2) Λj,l + (1 + 2)4 log(Jl2/2) 4 min(1, 2 αj,l) log(Jl2/2) 16αj log(Jl2/2) 2 log(Jl2/2) P 8 αj,l log(Jl2/2) Λj,l > 16αj log(Jl2/2) Λj,l + 32 log2(Jl2/2) = P αj,l > 2αj + 4 log(Jl2/2) The first inequality uses the triangle inequality, the second an application of equation (9), the third bounds by replacing the minimum with the 2 αj,l term, and the final uses (11). It follows, combining (41) and (42), that for any l [L], P(Bl) 9/l. Having established Bl as a low probabilitiy event we will decompose the regret according to Bl, and bound it separately under the events Bl and Bl. Under Bl we will resort to trivial bounds on regret coming from the maximum of the reward function, but these will make limited contribution to the overall expected regret, since Bl occurs with low probability. On Bl, the parameters are bounded in a way that we can exploit to bound the per-round regret with quantities leading to an optimal overall bound. Specifically, we have for l [L], E Rl) = E Rl I{Bl} + Rl I{ Bl} (K + 1)P(Bl) + E Rl I{ Bl} k=1 λkαl k R(a ) R(al) I{ Bl} Since the reward function R is monotonically increasing in the attractiveness parameter vector, and under Bl we have αUCB j,l αj for all j [J], it follows that we also have R(a, αUCB l ) R(a, α), a A, under Bl. We also have by definition of the UCB algorithm that R(al, αUCB l ) R(a , αl), and thus under Bl we have R(a ) R(al) R(al, αUCB l ) R(al, α) PK k=1 λkαUCB l,al k 1 + PK k=1 λkαUCB l,al k PK k=1 λkαal k 1 + PK k=1 λkαal k Grant and Leslie PK k=1 λk(αUCB l,al k αal k) 1 + PK k=1 λkαUCB l,al k PK k=1 λk(αUCB l,al k αal k) 1 + PK k=1 λkαal k (44) Thus, combining (43) and (44) we have the following bound on per-epoch regret, E Rl 9(K + 1) v u u t8αal k log(Jl2/2) Λal k,l + 4(2 + 2) log(Jl2/2) Λal k,l Aggregating over L epochs, it follows that Reg(T) is upper bounded by v u u t8αal k log(Jl2/2) Λal k,l + 4(2 + 2) log(Jl2/2) Λal k,l 9(K + 1)(log(T) + 1) + E v u u t 48αal kλ2 k log(Jl2/2) λK Pl s=1 I{al k as} + 14 log(Jl2/2) λK Pl s=1 I{al k as} 9(K + 1) + 14KJ λK log(JT 2/2) (log(T) + 1) λK αjλ2 1 log(Jl2/2) s=1 I{j as} 9(K + 1) + 14JK λK log(JT 2/2) (log(T) + 1) v u u t192λ2 1 log(Jl2/2)JE PJ j=1 PL s=1 αj I{al k as} 9(K + 1) + 14JK λK log(JT 2/2) (log(T) + 1) + 192λ2 1 log(JT 2)JT Here the second inequality replaces sums of reciprocals with increasing functions of T (since T > L), the third inequality uses Cauchy-Schwarz, and the final inequality uses the property that PJ j=1 PL s=1 αj I{al k as} is less than T in expectation, as derived in (Agrawal et al., 2019, Appendix A.3). C.2 Proof of Proposition 10 The structure of the proof is similar to that of Proposition 8. We again decompose the regret as ! R(a ) R al ! Learning to Rank under Multinomial Logit Choice and define a series of low-probability events Cl, l [L], where the UCBs deviate substantially from their expected value, n αUCB U j,l / h αj, αj + 2 q 36β2 i,l,j log(Jl2) io For each l [L], we may bound the probability of Cl as, j=1 P αUCB U j,l < αj + P αUCB U j,l > αj + 2 q 36β2 i,l,j log(Jl2) 4 We now decompose the per-epoch regret on the low-probability event Cl. For each l [L] we have, E Rl E Rl | Cl + E Rl | Cl ! R(a ) R al Cl The result (44) holds irrespective of the knowledge of position bias, so we have the following bound on per-epoch regret, combining (44), (45), and then utilising Lemma 7, E Rl 4K + 4 36β2 1,al k,l log(Jl2) 36CJK log(Jl2) maxk [K] Nk al k Aggregating over L epochs, we reach the stated regret bound as follows, 36CJK log(Jl2) maxk [K] Nk al k 8(K + 1) + E 36CJK log(Jl2) maxk [K] Nk al k 8(K + 1) + p 144CJK4T log(JT 2). Appendix D. Proof of Regret Lower Bound Proof: We first introduce some further notation. For each action a A, a fixed position bias vector λ, and some constant ϵ (0, 1/2] to be fixed later, define the attractiveness parameter vector τ a (0, 1]J+1 such that SK,2 , j = ak, k [K], 1 SK , otherwise. Grant and Leslie Let Pa and Ea denote the law and expectation with respect to the parametrisation αj = τa,j. Under Pa, the action a is optimal. We will also make use of the additional laws Pa\j and expectations Ea\j for j a, for each a A. Under Pa\j we set αj = τj, for all j = j and have αj = 1/SK. We define A as the set of incomplete orderings, i.e. orderings in A with a single unfilled slot. Each incomplete ordering a A has a corresponding law Pa and expectation operator Ea which coincide with Pa\j and Ea\j when placing j in the empty slot of a would generate ordering a. Further, for j a, introduce the notation a 1(j) to refer to the slot in which action a places item j - i.e. aa 1(j) = j. For j / a, a 1(j) = K + 1, and we let λK+1 = 0. For a fixed a A, consider the per-round regret under the problem with parameters τ a. We have, for any t [T], Ea(r(a) r(at)) = Ea 2 + ϵ 1 + ϵ SK,2 PK k=1 λk P j a λa 1(j)I{ak,t = j} 2 + ϵ SK,2 PK k=1 λk P j a λa 1(j)I{ak,t = j} ϵ ϵ SK,2 PK k=1 λk P j a λa 1(j)I{ak,t = j} (2 + ϵ) 2 + ϵ SK,2 PK k=1 λk P j a λa 1(j)I{ak,t = j} ϵ ϵ SK,2 PK k=1 λk P j a λa 1(j)I{ak,t = j} It follows that in T rounds, the regret satisfies max α (0,1]J Regα(T) max a A Ea t=1 r(a) r(at) t=1 r(a) r(at) j a λa 1(j) t=1 λa 1 t (j) Modifying slightly the notation from Section 4, we consider the innermost sum Λj(T) = PT t=1 λa 1 t (j) in isolation, for j [J]. For an a A, and j a the expectation of these random variables has the following property, Ea(Λj(T)) Ea\j(Λj(T)) + Ea\j(Λj(T)) Ea(Λj(T)) Ea\j(Nkj(T)) + Tλmax||Pa\j Pa||TV Ea\j(Λj(T)) + Tλmax KL(Pa || Pa\j) Here, the second inequality uses the discrete support of Λj(T), and the final inequality uses Pinsker s inequality. Learning to Rank under Multinomial Logit Choice We now turn our attention to the KL divergence term KL(Pa || Pa\j). By the Law of Total Entropy (see e.g. Theorem 2.5.3 of Cover and Thomas (2012)), we have KL(Pa || Pa\j) = t=1 KL Pa(Qt | Q1, . . . , Qt 1) || Pa\j (Qt | Q1, . . . , Qt 1) a A:j a Pa {at = a }KL(Pa(Q | a ) || Pa\j (Q | a )) k=1 max a A:a k=j Ea (Nkj(T))KL(Pa(Q | a ) || Pa\j(Q | a )) (48) Here we also use the property that any algorithm gives a deterministic mapping from Q1:t 1 to at - since even an instance of a randomised algorithm can, alternatively, be viewed as a deterministic algorithm randomly selected from a (potentially infinitely large) population of algorithms. The following lemma, whose proof is given in the the following subsection, bounds the contribution to the KL divergence from a single round. Lemma 17 For two actions a, a A, and an item j a such that an = j and a n = j for some (possibly equal) n, n [K], we have the following bound on the KL divergence between Pa(Q | a ) and Pa\j (Q | a ), the marginal distributions on Q | a implied by the laws Pa and Pa\j, KL Pa(Q | a ) || Pa\j(Q | a ) 17ϵ2λ2 a 1(j)SK S2 K,2 . (49) Thus, combining the decomposition of KL divergence in (48) and the bound on perround KL divergence in (49) we have for any j a, KL(Pa || Pa\j) k=1 Ea (Nkj(T)) 17ϵ2λ2 k SK S2 K,2 Then combining with (47) we have that the regret is lower bounded, similarly to in Chen and Wang (2018), as, Reg(T) 1 7|A| j a λa 1(j)Λj(T) j a λa 1(j) Ea\j(Λj(T)) + Tλmax KL(Pa || Pa\j) 7 ϵ 7SK,2|A| Ea\j(Λj(T)) + Tλmax KL(Pa || Pa\j) Grant and Leslie 7 ϵ 7SK,2|A| Ea (Λj(T)) + Tλmax KL(Pa || Pa\j ) 7 ϵ 7SK,2|A| a A λa SKT ϵ|A |Tλmax 7SK,2|A| max a A X KL(Pa || Pa\j ) 7 ϵSKKT 7SK,2(J K + 1) ϵKT v u u tmax a A X PK k=1 Ea Nkj (T) 17ϵ2λ2 a SK 2S2 K,2(J + K 1) 7 ϵCSKT 7(J K + 1) 2SK,2(J K + 1). Here inequality (a) introduces the bound on the Λj terms, and inequality (b) reorders the summations. Inequality (c) replaces the |A | individual summands with |A | copies of their largest and notes any sum of Λj(T) terms is necessarily bounded by SKT. Then, inequality (d) uses the concavity of the square root, while noting |A |/|A| = K/(J K + 1). The final inequality introduces a constant C > 0 such that K/SK,2 < C. Finally, we complete the proof by choosing ϵ = O(p JSK,2/ TSK) and using the assumption that K J/4. D.1 Proof of Lemma 17 In this section we provide a proof of Lemma 17, which helps to complete the proof of the regret lower bound, Proposition 9. Lemma 17 gives a bound on the KL divergence between the marginal distributions over a single click variable, under (particular) different attractiveness parameter vectors. The proof makes use of the following result, originally given as Lemma 3 in Chen and Wang (2018), bounding the KL-divergence between categorical random variables. Lemma 18 (Lemma 3 of Chen and Wang (2018)) Suppose P is a categorical distribution on [M]0 with parameters p0, . . . , p M, such that if X P, P(X = m) = pm for m [M]0. Suppose also that Q is an equivalently defined categorical distribution with parameters q0, . . . , q M, and we have δm = pm qm for m [M]0. Then Proof of Lemma 17: We begin by deriving expressions for the parameters pk := Pa(Q = k | a ), 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am} , λk SK + ϵλk PK m=1 λm SK,2 I{a k = am} 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am} , k [K], Learning to Rank under Multinomial Logit Choice and qk := Pa\j(Q = k | a ), 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am, m = n} , λk SK + ϵλk PK m=1 λm SK,2 I{a k = am, m = n} 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am, m = n} , k [K]. To apply Lemma 18 we consider the differences in these parameters. For the no-click event we have p0 q0 = ϵλnλn SK,2 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am} 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am, m = n} . For k [K] such that k = n we have λk SK + ϵλk PK m=1 λm SK,2 I{a k = am} 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am} λk SK + ϵλk PK m=1 λm SK,2 I{a k = am, m = n} 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am, m = n} , SKSK,2 ϵ2 λ2 kλnλn S2 K,2 2 + ϵ PK l=1 λl PK m=1 λj SK,2 I{a l = am} 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am, m = n} . Finally for the slot n in which item j is placed we have λn SK + ϵλnλn SK,2 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am} λn SK 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am, m = n} 2ϵ + ϵ2 PK l=1 PK m=1 λlλm SK,2 I{a l = am, m = n} ϵ λnλ2 n SKSK,2 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am} 2 + ϵ PK l=1 λl PK m=1 λm SK,2 I{a l = am, m = n} . It follows, subsequent to some further algebra, that we have q0 ϵ2λ2 nλ2 n 8S2 K,2 , (pk qk)2 qk 7ϵ2λkλnλ2 n 8S3 K,2/SK , and (pn qn )2 qn 9ϵ2λnλ2 n 8S2 K,2/SK . As such, we have the following by Lemma 18, and the assumption that SK,2 1 KL(Pa(Q | a ) || Pa\j(Q | a )) ϵ2λ2 nλ2 n 8S2 K,2 + 7ϵ2λkλnλ2 n 8S3 K,2/SK I{k = n } + 9ϵ2λnλ2 n 8S2 K,2/SK 17ϵ2λ2 n SK S2 K,2 . Appendix E. Proofs of Technical Lemmas In this section we provide proofs of the remaining technical lemmas arising in the main text. Grant and Leslie E.1 Proof of Lemma 1 The probability of a no-click event given action al is given as p0(al) = P(Qt = 0|at = al) = 1 1 + PK k=1 λkαl k . It follows that nl = |El| 1, the number of clicks before the no-click event in epoch l is a Geometric random variable with parameter p0(al). It follows, that conditioned on nl, each nl k count may be viewed as a Binomial random variable, nl k|nl Binom(nl, pk), pk = λkαk PK v=1 λvαv , is the probability of a click on the item in position k, given that there is a click. The moment generating function of a Binomial random variable is of course well-known, and we therefore have Eπ(eθnl k) = Enl Eπ(eθnl k|nl) = Enl ( pkeθ + 1 pk)nl . We then consider the result that if X is a Geometric random variable with parameter p, then for any τ such that τ(1 p) < 1 we have E(τ X) = p/(1 τ(1 p)). It follows that, for any θ < log(λkαl k+1 λkαl k ), we have Eπ(eθnl k) = p0 1 ( pkeθ/λk + 1 pk)(1 p0) = p0 1 pk(eθ 1) + 1 (1 p0) = p0 1 (1 p0) λkαl kp0(eθ 1) = 1 1 λkαl k(eθ 1), as stated. We recognise that each nl k | al is an independent geometric random variable, by considering the moment generating function of the geometric random variable X with parameter p and density f X(k) = p(1 p)k for k {0, 1, 2, . . . }, is given as M(t) = p 1 et(1 p) = 1 1 (1 p p )(et 1) , and is defined only for et(1 p) < 1. Learning to Rank under Multinomial Logit Choice E.2 Proof of Lemma 2 We will first demonstrate that the log-likelihood function has at most one stationary point on the parameter space (0, 1]J+K 1, We have that the log-likelihood of data N given N and parameters α, λ is log L(N; N, α, λ) = j=1 Nkj log(αjλk) (Nkj + Nkj) log(1 + αjλk). (50) Consdier the partial derivatives of the log-likelihood, Nkj αjλk Nkj αj(1 + αjλk) , j [J], log L Nkj αjλk Nkj λk(1 + αjλk) , k [K] \ {1}. The solutions of the system of equations log L/ αj = 0, log L/ λk = 0, j [J], k [K] \ {1}, coincide with the solutions of, k=1 (Nkj αj Nkj) Y m =k (1 + αjλm) = 0, j [J] (51) j=1 (Nkj λk Nkj) Y i =j (1 + αiλk) = 0, k [K] \ {1}. (52) We will demonstrate that the log-likelihood has at most one stationary point on (0, 1]J+K 1 by showing that the system of equations given by (51) and (52) has at most one solution on (0, 1]J+K 1. For each j [J] consider the LHS of equation (51) with all λ2, . . . , λK fixed in (0, 1]. The result is an O(αK j ) polynomial, which can be decomposed in to the summation of k O(αK j ) polynomials, gk(αj) = (Nkj αj Nkj) Q m =k(1 + αjλm), k [K]. We have roots of gk at αj = Nkj/ Nkj and αj = 1/λm, m = k, and notice that gk(αj) has negative leading order term when αj > 0. Notice that only one of the solutions is positive, and lies in (0, 1] iff0 < Nkj Nkj. Since each polynomial gk has negative leading order term, it follows that PK k=1 gk(αj) = 0, i.e. equation (51), also has at most one solution in (0, 1], for fixed λ2, . . . , λK. An analogous argument applied to equation (52) tells us that there is at most one positive solution value in (0, 1] for each λk, k [K] \ {1}, coinciding with all other variables being positive. This tells us that the system of equations where the partial derivatives are set to zero, has at most one solution in (0, 1]K+J 1 and as such the log-likelihood function has at most one stationary point on (0, 1]K+J 1. From this we deduce that the log-likelihood is either monotonic on (0, 1]K+J 1 or unimodal. We have from Wu (1983) that the EM algorithm will converge to the unique MLE if the log-likelihood is unimodal and continuous, and thus that the EM algorithm, Algorithm 1, will converge to the MLEs. Grant and Leslie Appendix F. Independent Product Parameter Model A perhaps more straightforward approach to the unknown position bias variant of the MNLLTR problem would be to exploit the closed-form distribution of the estimators γj,k(L) = Nkj/ Nkj of product parameters γj,k = αjλk, j [J], k [K] and build UCBs around these parameters independently. In this section we argue that this is not an appropriate strategy. Specifically, although such an approach can be shown to eventually learn the optimal action, and indeed have sublinear regret, the amount of exploration it requires is prohibitively large in comparison to our proposed strategy. F.1 An MNL-Bandit Approach to MNL-LTR with Unknown Position Biases We notice that by modelling the γjk parameters as independent, the unknown position bias variant of MNL-LTR can also be thought of as a constrained MNL-bandit problem. We can design such a formulation where the decision-maker is oblivious to the ranking aspect, but the constraints on their actions ensure that they implicitly assign a single item to a single slot. In such a setting there are JK objects, indexed Sj,k for j [J], k [K], where selecting object Sj,k represents placing item j in slot k. Object Sj,k is therefore associated with parameter γjk. In each round t [T] the decision-maker chooses a set St containing K of the objects. The constraints of the associated MNL-LTR problem are such that in each round exactly one of these objects must have index k for each k [K]. Introducing JK further indicator variables xjk(t) = I{Sj,k St} for each t [T], these constraints may be expressed as, j=1 xj,k(t) = 1 k [K], and k=1 xj,k(t) 1 j [J], t [T]. (53) The first constraint captures the rule that every slot is utilised, and the second constraint captures the rule that each item is used at most once. A valid set of objects St, satisfying (53), maps in a one-to-one fashion to a valid action at A for our MNL-LTR problem. Both Agrawal et al. (2017) and Agrawal et al. (2019) derive order-optimal guarantees for constrained MNL-bandit algorithms. However, the constraint considered in Agrawal et al. (2017) is only a simple cardinality constraint - i.e. an upper limit on the number of objects chosen in each round. Thus the guarantees on the Thompson Sampling approach proposed therein do not carry to the problem with constraints (53). Agrawal et al. (2019), however, allow for a more general class of constraints, requiring that constraints may be expressed in the form Ax b, where A is a totally unimodular (TU) matrix, and b is an integer-valued vector. The JK (J + 2K) matrix A implied by constraints (53) can be shown to be TU. Thus the regret guarantees on the UCB algorithm proposed in Agrawal et al. (2019) apply directly to the constrained MNL-bandit instance associated with an MNL-LTR instance. Algorithm 4 describes a modification of such an algorithm to incorporate our sharper concentration results. Then, the corollary below gives the corresponding order result on regret enjoyed by this algorithm. The proof of this result is omitted as it follows directly from the observation that the coefficient matrix implied by constraints (53) is TU, and substituting the concentration results of Lemma 4 in to the proof of Theorem 1 of Agrawal et al. (2019). Learning to Rank under Multinomial Logit Choice Corollary 19 There exist constants C1, C2 > 0 such that for any MNL-LTR problem where the item attractiveness parameters satisfy αj α0 = 1, j [J] and the position biases satisfy λ1 = 1, λk λ1 k 2, the regret in T rounds of Algorithm 4 satisfies Reg(T) C1 p JKT log(JKT 2) + C2JK log2(JKT). Although this implies a guarantee on the performance of Algorithm 4 which is near optimal in its dependence in T, we see that the O(log(T)) term has a worse dependence on JK than, for instance, the Epoch-UCB algorithm for the known position bias case. This hints to the critical issue with deploying Algorithm 4, that its exploration cost scales linearly with the product of the number of items and number of slots. In the experiments in Section 6, in particular problem (c), we see that even in simple problems - where the optimal action can be identified quickly - Algorithm 4 is slow to converge. Algorithm 4 Epoch-UCB algorithm for MNL-bandit model of MNL-LTR Initialise with l = 0, and Q0 = 0. Iteratively perform the following for t [T], If Qt 1 = 0 Set l l + 1 Calculate UCBs. For j [J], k [K] compute, γUCB j,k,l = γj,k(l 1) + 4 min(1, 2 γj,k(l 1)) log(JKl2/2) Pl 1 s=1 I{Sj,k Sl} + 4 log(JKl2/2) Pl 1 s=1 I{Sj,k Sl} . Solve the optimisation problem for object indicator variables xl = argmax xj,k,j [J],k [K] k=1 xj,kγUCB j,k,l j=1 xj,k(t) = 1 k [K], k=1 xj,k(t) 1 j [J] Select an action at A associated with xl {0, 1}J K, and observe click variable Qt, otherwise, set action at = at 1, and observe click variable Qt. Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Thompson sampling for the MNL-bandit. In Conference on learning theory, pages 76 78. PMLR, 2017. Grant and Leslie Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 2019. Shigeki Aida, Takao Masuda, and Ichiro Shigekawa. Logarithmic sobolev inequalities and exponential integrability. Journal of Functional Analysis, 126(1):83 101, 1994. Viktor Bengs and Eyke H ullermeier. Preselection bandits. In International Conference on Machine Learning, pages 778 787. PMLR, 2020. Fernando Bernstein, Sajad Modaresi, and Denis Saure. Exploration optimization for dynamic assortment personalization under linear preferences. Available at SSRN, 2022. Sergey G Bobkov and Michel Ledoux. On modified logarithmic sobolev inequalities for bernoulli and poisson measures. Journal of functional analysis, 156(2):347 365, 1998. G okhan C apan, Ilker G undo gdu, Ali Caner T urkmen, and Ali Taylan Cemgil. Dirichlet luce choice model for learning from interactions. User Modeling and User-Adapted Interaction, pages 1 38, 2022. Charalambos A Charalambides. Enumerative Combinatorics. CRC Press, 2002. Xi Chen and Yining Wang. A note on a tight lower bound for capacitated mnl-bandit assortment selection models. Operations Research Letters, 46(5):534 537, 2018. Xi Chen, Yuanzhi Li, and Jieming Mao. A nearly instance optimal algorithm for top-k ranking under the multinomial logit model. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2504 2522. SIAM, 2018. Xi Chen, Yining Wang, and Yuan Zhou. Dynamic assortment optimization with changing contextual information. The Journal of Machine Learning Research, 21(1):8918 8961, 2020. Xi Chen, Yining Wang, and Yuan Zhou. Optimal policy for dynamic assortment planning under multinomial logit models. Mathematics of Operations Research, 46(4):1639 1657, 2021. Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. Click models for web search. Synthesis Lectures on Information Concepts, Retrieval, and Services, 7(3):1 115, 2015. Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012. Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining, pages 87 94. ACM, 2008. Edward Brian Davies and Barry Simon. Ultracontractivity and the heat kernel for schr odinger operators and dirichlet laplacians. Journal of Functional Analysis, 59(2): 335 395, 1984. Learning to Rank under Multinomial Logit Choice Victor H. de la Pe na. A general class of exponential inequalities for martingales and ratios. The Annals of Probability, 27(1):537 564, 1999. Kefan Dong, Yingkai Li, Qin Zhang, and Yuan Zhou. Multinomial logit bandit with low switching cost. In International Conference on Machine Learning, pages 2607 2615. PMLR, 2020. Boli Fang. Fixed-budget pure exploration in multinomial logit bandits. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 2951 2957, 2022. Camille-Sovanneary Gauthier, Romaric Gaudel, and Elisa Fromont. Position-based multiple-play bandits with thompson sampling. In IDA 21, 2021. Ruocheng Guo, Xiaoting Zhao, Adam Henderson, Liangjie Hong, and Huan Liu. Debiasing grid-based product search in e-commerce. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2852 2860, 2020. Naieme Hazrati and Francesco Ricci. Recommender systems effect on the evolution of users choices distribution. Information Processing & Management, 59(1):102766, 2022. Ald eric Joulin and Nicolas Privault. Functional inequalities for discrete gradients and application to the geometric distribution. ESAIM: Probability and Statistics, 8:87 101, 2004. Junpei Komiyama, Junya Honda, and Akiko Takeda. Position-based multiple-play bandit problem with unknown position bias. In Advances in Neural Information Processing Systems, pages 4998 5008, 2017. Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pages 767 776, 2015. Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Tor Lattimore, and Mohammad Ghavamzadeh. Garbage in, reward out: Bootstrapping exploration in multiarmed bandits. In International Conference on Machine Learning, pages 3601 3610, 2019. Paul Lagr ee, Claire Vernade, and Olivier Capp e. Multiple-play bandits in the positionbased model. In Advances in Neural Information Processing Systems, pages 1597 1605, 2016. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985. Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. Toprank: A practical algorithm for online stochastic ranking. In Advances in Neural Information Processing Systems, pages 3945 3954, 2018. Shuai Li, Tor Lattimore, and Csaba Szepesvari. Online learning to rank with features. In International Conference on Machine Learning, pages 3856 3865, 2019. Grant and Leslie R Duncan Luce. Individual choice behavior. 1959. Adil El Mesaoudi-Paul, Viktor Bengs, and Eyke H ullermeier. Online preselection with context information under the plackett-luce model. ar Xiv preprint ar Xiv:2002.04275, 2020. Min-hwan Oh and Garud Iyengar. Thompson sampling for multinomial logit contextual bandits. In Advances in Neural Information Processing Systems, pages 3145 3155, 2019. Harrie Oosterhuis and Maarten de Rijke. Optimizing ranking models in an online setting. In European conference on information retrieval, pages 382 396. Springer, 2019. Yannik Peeters and Arnoud V den Boer. Stochastic approximation for uncapacitated assortment optimization under the multinomial logit model. Naval Research Logistics (NRL), 69(7):927 938, 2022. Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193 202, 1975. Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pages 521 530. ACM, 2007. Paat Rusmevichientong, Zuo-Jun Max Shen, and David B Shmoys. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research, 58(6):1666 1680, 2010. Aadirupa Saha and Aditya Gopalan. Combinatorial bandits with relative feedback. In Advances in Neural Information Processing Systems, pages 983 993, 2019. Denis Saur e and Assaf Zeevi. Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management, 15(3):387 404, 2013. Yining Wang, Xi Chen, and Yuan Zhou. Near-optimal policies for dynamic multinomial logit assortment selection models. In Advances in Neural Information Processing Systems, pages 3101 3110, 2018. CF JeffWu. On the convergence properties of the em algorithm. The Annals of statistics, pages 95 103, 1983. Xiaohui Xie, Yiqun Liu, Xiaochuan Wang, Meng Wang, Zhijing Wu, Yingying Wu, Min Zhang, and Shaoping Ma. Investigating examination behavior of image search users. In Proceedings of the 40th international acm sigir conference on research and development in information retrieval, pages 275 284, 2017. Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Maarten de Rijke, Yunqiu Shao, Zixin Ye, Min Zhang, and Shaoping Ma. Grid-based evaluation metrics for web image search. In The world wide web conference, pages 2103 2114, 2019. Jiaqi Yang. Fully gap-dependent bounds for multinomial logit bandit. In International Conference on Artificial Intelligence and Statistics, pages 199 207. PMLR, 2021. Learning to Rank under Multinomial Logit Choice Hongbin Zhang, Yu Yang, Feng Wu, and Qixin Zhang. Mnl-bandits under inventory and limited switches constraints. ar Xiv preprint ar Xiv:2204.10787, 2022. Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. Online learning to rank in stochastic click models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4199 4208. JMLR. org, 2017.