# parametric_graph_for_unimodal_ranking_bandit__36b11721.pdf Parametric Graph for Unimodal Ranking Bandit Camille-Sovanneary Gauthier * 1 2 Romaric Gaudel * 3 Elisa Fromont 4 5 2 Boammani Aser Lompo 6 We tackle the online ranking problem of assigning L items to K positions on a web page in order to maximize the number of user clicks. We propose an original algorithm, easy to implement and with strong theoretical guarantees to tackle this problem in the Position-Based Model (PBM) setting, well suited for applications where items are displayed on a grid. Besides learning to rank, our algorithm, GRAB (for parametric Graph for unimodal RAnking Bandit), also learns the parameter of a compact graph over permutations of K items among L. The logarithmic regret bound of this algorithm is a direct consequence of the unimodality property of the bandit setting with respect to the learned graph. Experiments against state-ofthe-art learning algorithms which also tackle the PBM setting, show that our method is more efficient while giving regret performance on par with the best known algorithms on simulated and real life datasets. 1. Introduction Online Recommendation Systems are used to choose relevant items, such as songs, adds or movies on a website. At each call, they select K items among L potential ones, K L. User feedbacks are collected for each displayed items, reflecting how relevant these choices are: listening time, clicks, rates, etc. These feedbacks are available only for the items which were presented to the user. This corresponds to an instance of the multi-armed bandit problem with semi-bandit feedback (Gai et al., 2012; Chen et al., 2013). Another problem, related to ranking, is to display the K chosen items at the right positions to maximize the *Equal contribution 1Louis Vuitton, F-75001 Paris, France 2IRISA UMR 6074 / INRIA rba, F-35000 Rennes, France 3Univ Rennes, Ensai, CNRS, CREST - UMR 9194, F-35000 Rennes, France 4Univ. Rennes 1, F-35000 Rennes, France 5 Institut Universitaire de France, M.E.S.R.I., F-75231 Paris 6ENS Rennes, F-35000 Rennes, France. Correspondence to: Camille-Sovanneary Gauthier . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). user attention. Typical examples of such displays are (i) a list of news, visible one by one by scrolling; (ii) a list of products, arranged by rows; or (iii) advertisements spread in a web page. Numerous approaches have been proposed to jointly learn how to choose the best positions for the corresponding best items (Radlinski et al., 2008; Combes et al., 2015; Li et al., 2019) referred as multiple-play bandit or online learning to rank (OLR). To take into account the user behaviour while facing such a list of items, several models exist (Richardson et al., 2007; Craswell et al., 2008) and have been transposed to the bandit framework (Kveton et al., 2015a; Komiyama et al., 2017) such as the Position-Based Model (PBM) (Richardson et al., 2007). PBM allows to take into account displays where the best position is a priori unknown. This is typically the case when items are displayed on a grid and not in an ordered list. PBM assumes that the click probability on an item i at position k results from the product of two independent factors: the item relevance and its position s visibility. Items displayed at other positions do not impact the probability to consider the item i at position k. According to PBM, a user may give more than one feedbacks: she may click on all items relevant for her, e.g. when looking for product on commercial websites. PBM is also particularly interesting when the display is dynamic, as often on modern web pages, and may depend on the reading direction of the user (which varies from one country to another) and on the ever-changing layout of the page. In this paper, we tackle an online learning to rank bandit setting, which mainly covers PBM click model, with an unimodal bandits point of view (Combes & Prouti ere, 2014). First, we expose a family of parametric graphs of degree L 1 over permutations, such that the PBM setting is unimodal w.r.t. one graph in this family. While the corresponding graph is unknown from the learner, graphs of this family enable an efficient exploration strategy of the set of potential recommendations. Secondly, we introduce a new bandit algorithm, GRAB, which learns online the appropriate graph in this family and bases its recommendations on the learned graph. From an application point of view, this algorithm has several interesting features: it is simple to implement and efficient in terms of computation time; it handles the PBM bandit setting without any knowledge on the impact of positions (contrarily to many competitors); and it empirically exhibits a regret on par with other theoretically proven Parametric Graph for Unimodal Ranking Bandit Table 1. Settings and upper-bound on cumulative regret for state of the art algorithms. The main notations for the assumptions are given in Section 3. Nπ (a ) is a set of recommendations in the neighborhood of the best recommendation. Kmax is the maximum number of differences between two arms; see. Theorem 2 for a specific definition. ALGORITHM HANDLED SETTINGS REGRET , ASSUMING θ1 θ2 θL GRAB (OUR ALGORITHM) PBM O L log T min a Nπ (a ) µ µa COMBUCB1 (KVETON ET AL., 2015B) PBM O LK2 log T min a PL K µ µa PBM-PIE (LAGR EE ET AL., 2016) PBM WITH κ KNOWN O L K log T min i {K+1,...,L} µ µa[K:=i] PMED-HINGE (KOMIYAMA ET AL., 2017) PBM WITH κ1 κK O (c (θ,κ) log T) TOPRANK (LATTIMORE ET AL., 2018) PBM WITH κ1 κK, CM, . . . O LK log T min (j,i) [L] [K]:j>i θi θj OSUB (COMBES & PROUTI ERE, 2014) UNIMODAL O γ log T min a NG(a ) µ µa KL-COMBUCB (THEOREM 2) COMBINATORIAL O |A|K2 max log T min a A µ µa algorithms on both artificial and real datasets. In particular, we prove a O(L/ log T) regret upper-bound for GRAB. The corresponding proof extends OSUB s proof (Combes & Prouti ere, 2014) both (i) to the context of a graph learned online, and (ii) to the combinatorial semi-bandit setting. This paper is organized as follows: Section 2 presents the related work and Section 3 defines our target setting. We introduce GRAB and the hypotheses needed in Section 4. Theoretical guarantees and empirical performance are presented respectively in Section 5 and 6. We conclude in Section 7. 2. Related Work A comparison of the assumptions and the regret upperbounds of the related algorithms is shown in Table 1. The Position-Based Model (PBM) (Richardson et al., 2007; Craswell et al., 2008) relies on two vectors of parameters: θ [0, 1]L and κ [0, 1]K, where θi is the probability for the user to click on item i when she observes this item, and κk is the probability for the user to observe position k. Several bandit algorithms are designed to handle PBM (Komiyama et al., 2015; Lagr ee et al., 2016; Komiyama et al., 2017). However, each of them assumes some knowledge about the ranking of positions. (Komiyama et al., 2015) and (Lagr ee et al., 2016) assume κ known beforehand. Thanks to this very strong assumption (that we do not make in this paper), the theoretical results from (Lagr ee et al., 2016) depend on the L K worst items and their regret is expressed as O((L K)/ log T). (Komiyama et al., 2017) and (Gauthier et al., 2021) propose respectively PMED and PB-MHB, the only approaches learning both θ and κ while recommending. However, PMED still requires the κk values to be organized in decreasing order. It derives a bound on the regret in O(c (θ,κ) log T), where c (θ,κ) only depends on θ and κ and is asymptotically optimal in this setting. Unfortunately, to the best of our knowledge, there is no known closed-form for c (θ,κ), which hinders the comparison to other algorithms, including ours. (Gauthier et al., 2021) has shown very good performances on PBM, but PB-MHB, based on Thomson sampling, does not have any theoretical guarantees. Other learning to rank algorithms such as Top Rank (Lattimore et al., 2018) and Bubble Rank (Li et al., 2019) cover many click models (including PBM). They exhibit a regret upper-bound for T iterations of O(LK/ log T), where depends on the attraction probability θ of items. They also assume that the best recommendation is the one displaying the items from the most attractive to the K-ith most attractive, which implies that the first position is the most-observed one, the second position is the second most-observed one, and so on. Although the hypotheses taken by PMED and Top Rank are often assumed by the approaches handling PBM setting. In this paper we tackle a full PBM setting where there is no a priori hypothesis on the ordering of positions. Our algorithm, GRAB, suffers a O(L/ log T) regret that we conjecture to be on par with the best theoretical results provided by PMED (O(c (θ,κ) log T)). Combinatorial algorithms (Gai et al., 2012; Chen et al., 2013) can also handle the PBM bandit setting. Typically, Comb UCB1 (Kveton et al., 2015b) applied to PBM leads to an algorithm which suffers a O(LK2/ log T) regret (see the appendix for more details), which is higher than the upper-bound on the regret of GRAB by a factor K2. Note that the proof of the upper-bound on the regret of GRAB is based on the same reduction of the PBM bandit Parametric Graph for Unimodal Ranking Bandit 1.00 0.90 0.80 0.90 0.81 0.72 0.80 0.72 0.64 0.70 0.63 0.56 a= (B, C, A) π= (1, 3, 2) Πρ(a)= {π} ρa1,1 = 0.90 ρa2,2 = 0.72 < ρa3,3 = 0.80 ρaπ1,π1 = 0.90 ρaπ2,π2 = 0.80 ρaπ3,π3 = 0.72 µa (3,1) = 1.00 + 0.72 + 0.72 = 2.44 > µa = 0.90 + 0.72 + 0.80 = 2.42 a = (B, C, A) a (2, 1) = (C, B, A) a (2, 3) = (B, A, C) a (3, 1) = (A, C, B) a[2 := D] = (B, D, A) a[3 := D] = (B, C, D) a[1 := D] = (D, C, A) Figure 1. Assumption 1 in practice. To distinguish between items and positions, the 4 items are denoted A, B, C, and D. On the left: parameters and considered recommendation a. We consider a matrix of probabilities of clicks ρ which corresponds to a PBM click model, and a sub-optimal recommendation a. The corresponding set Πρ(a) of appropriate rankings of positions is composed of a unique permutation π. On the right: corresponding neighborhoods. Solid lines identify the neighborhood Nπ(a) used by GRAB, and both solid and dashed lines correspond to the neighborhood NG(a) used by S-GRAB. Note that there is a recommendation better than a in both neighborhoods: a (3, 1) = (A, C, B). to a combinatorial semi-bandit, but with two additional properties derived from the design of GRAB. Finally, GRAB extends the unimodal bandit setting (Combes & Prouti ere, 2014) which assumes the existence of a known graph G carrying a partial order on the set of bandit arms denoted A. The unimodal bandit algorithms are aware of G, but ignore the partial order induced by the edges of G. However, they rely on G to efficiently browse the arms up to the best one. Typically, the algorithm OSUB (Combes & Prouti ere, 2014) selects at each iteration t, an arm a(t) in the neighborhood NG ( a(t)) given G of the current best arm a(t) (a.k.a. the leader). By restricting the exploration to this neighborhood, the regret suffered by OSUB scales as O(γ/ log T), where γ is the maximum degree of G, to be compared with O(|A|/ log T) if the arms were independent. 3. Learning to Rank in a Semi-Bandit Setting We consider the following online learning to rank (OLR) problem with clicks feedback which encompasses the PBM setting. For any integer n, let [n] denote the set {1, . . . , n}. An instance of our OLR problem is a tuple (L, K, (ρi,k)(i,k) [L] [K]), where L is the number of items to be displayed, K L is the number of positions to display the items, and for any item i and position k, ρi,k is the probability for a user to click on item i when displayed at position k, independently of the items displayed at other positions. Under PBM click-model, there exists two vectors θ RL and κ RK, such that ρi,k = θiκk (i.e. ρ is of rank 1). A recommendation algorithm is only aware of L and K and has to deliver T consecutive recommendations. At each iteration t [T], the algorithm recommends a permutation a(t) = (a1(t), . . . , a K(t)) of K distinct items among L, where ak(t) is the item displayed at position k. We denote A = PL K the set of such permutations, which corresponds to the set of arms of the bandit setting. Throughout the paper, we will use the terms permutation and recommendation interchangeably to denote an element of PL K. Thereafter, the algorithm observes the clicks vector c(t) {0, 1}K, where for any position k, ck(t) equals 1 if the user clicks on item ak(t) displayed at position k, and 0 otherwise. Note that the recommendation at time t is only based on previous recommendations and observations. While the individual clicks are observed, the reward of the algorithm is their total number r(t) def = PK k=1 ck(t). Let µa denote the expectation of r(t) when the recommendation is a(t) = a, and µ def = maxa PL K µa the highest expected reward. The aim of the algorithm is to minimize the cumulative (pseudo-) regret R(T) = Tµ E where the expectation is taken w.r.t. the recommendations from the algorithm and the clicks. Note that for any recommendation a PL K, µa = PK k=1 ρak,k. 3.1. Modeling Assumption Apart from the independency of the clicks, the proposed algorithm assumes a relaxed version of unimodality. Here we present this assumption and state its relation with PBM. We first define the set of appropriate rankings of positions: for each recommendation a PL K, we denote Πρ(a) PK K the set of permutations π of the K positions such that ρaπ1,π1 ρaπ2,π2 ρaπK ,πK. Therefore, an appropriate ranking of positions orders the positions from the one with the highest probability of click to the one with the Parametric Graph for Unimodal Ranking Bandit lowest probability of click. See Figure 1 for an example. With this notation, our assumption is the following: Assumption 1 (Relaxed Unimodality). For any recommendation a PL K and any ranking of positions π Πρ(a), if µa = µ , then either there exists k [K 1] such that µa < µa (πk,πk+1) (2) or there exists i [L] \ a([K]) such that µa < µa[πK:=i], (3) a (πk, πk+1) is the permutation for which the items at positions πk and πk+1 are swapped, a[πK := i] is the permutation which is the same as a for any position k = πK, and such that a[πK := i]πK = i, and a([K]) is the set of items recommended by a, namely a([K]) def = {a1, . . . , a K}. Assumption 1 relates to a natural property of standard click models: (i) for the optimal recommendation, the position with the k-th highest probability to be observed is the one displaying the k-th most attractive item, (ii) for a suboptimal recommendation, swapping two consecutive items, given this order, leads to an increase of the expected reward. However, Assumption 1 considers the order based on the click probabilities ρak,k, not on the observation probabilities κk. Figure 1 gives an example of both orders and of the neighborhood associated to the ranking π defined after the order on click probabilities ρak,k. While the existence of a better recommendation in the neighborhood defined given this order is less intuitive, it remains true for state of the art click models (PBM, the cascading model, and the dependent click model) and paves the way to an algorithm based on observed random variables. Note also that while there exists a better recommendation both in the neighborhood based on the order on observation probability and in the neighborhood based on the order on click probability, this is not true for any neighborhood based on any arbitrary order (as soon as K 4). Hereafter, Lemma 1, states that Assumption 1 is weaker than the PBM one. The proof of this Lemma is deferred to the appendix. Lemma 1. Let (L, K, (θiκk)(i,k) [L] [K]) be an online learning to rank problem with users following PBM, with positive probabilities of looking at a given position. Then Assumption 1 is fulfilled. 3.2. Relation with Unimodality Assumption 1 relates to the unimodality of the set of expected rewards (µa)a PL K. Let us first recall the definition of unimodality in (Combes & Prouti ere, 2014) and then express this relation. Definition 1 (Unimodality). Let A be a set of arms, and (νa)a A a set of reward distributions of respective expectations (µa)a A. Let G = (V, E) be an undirected graph with vertices V = A and edges E V 2. The set of expected rewards (µa)a A is unimodal w.r.t. G, if and only if: 1. the set of expected rewards admits a unique best arm: argmaxa A µa = {a }; 2. and from any arm a = a , there exists a path (a0,a1, . . . ,an) in G such that a0 = a, an = a , and i [n], µai > µai 1. Note that the second property of unimodal sets of expected rewards is equivalent to the property stating that from any sub-optimal arm a, there exists an arm a NG(a) such that µa > µa, where NG(a) is the neighborhood of a in G. Let s assume that there exists a unique recommendation a with maximum expected reward, and denote F = (πa)a PL K a set of rankings of positions such that for any recommendation a, πa Πρ(a). Then, by denoting GF = (V, EF) the directed graph with vertices V = PL K and edges EF def = (a,a (πak, πa(k+1))) : k [K 1] {(a,a[πa K := i]) : i [L] \ a([K])} , (µa)a PL K is unimodal1 with respect to GF. Note that this graph is unknown from the algorithm as it builds upon the unknown mapping F. However, this mapping may be learned online, paving the way to an OSUB-like algorithm to explore the space of recommendations. 4. GRAB Algorithm Our algorithm, GRAB, takes inspiration from the unimodal bandit algorithm OSUB (Combes & Prouti ere, 2014) by selecting at each iteration t an arm a(t) in the neighborhood of the current best arm (a.k.a. the leader). While in OSUB the neighborhood is known beforehand, here we learn it online. GRAB is described in Algorithm 1. This algorithm uses the following notations: 1While the definition of unimodality in (Combes & Prouti ere, 2014) involves an undirected graph, OSUB only requires a directed graph and the existence of a strictly increasing path from any sub-optimal arm to the optimal one. Parametric Graph for Unimodal Ranking Bandit At each iteration t, we denote ˆρi,k(t) def = 1 Ti,k(t) s=1 1{ak(s) = i}ck(s) the average number of clicks obtained at position k when displaying item i at this position, where Ti,k(t) def = s=1 1{ak(s) = i} is the number of time item i has been displayed at position k; ˆρi,k(t) def = 0 when Ti,k(t) = 0. We also denote a(t) the leader, meaning the recommen- dation with the best pseudo average reward µa(t) def = PK k=1 ˆρak,k(t), and we note Ta(t) def = s=1 1{ a(s) = a} the number of times the leader is a for iterations 1 to t 1. Finally, the statistics ˆρi,k(t) are paired with their respective indices bi,k(t) def = f ˆρi,k(t), Ti,k(t), T a(t)(t) + 1 , where f(ˆρ, s, t) stands for sup{p [ˆρ, 1] : s kl(ˆρ, p) log(t) + 3 log(log(t))}, kl(p, q) def = p log p + (1 p) log 1 p the Kullback-Leibler divergence from a Bernoulli distribution of mean p to a Bernoulli distribution of mean q; f(ˆρ, s, t) def = 1 when ˆρ = 1, s = 0, or t = 0. At each iteration t, GRAB first identifies the leader a(t), and then recommends either a(t) every L-th iteration, or the best permutation in the inferred neighborhood, given the sum of indices PK k=1 bak,k(t) (see Figure 1 for an example of a neighborhood). Each time an argmax is computed, the ties are randomly broken. To finish the presentation of GRAB, let us now discuss its initialisation and its time-complexity. Remark 1 (Initialisation). The initialisation of the algorithm is handled through the default value of indices bi,k: 1. This value ensures that any permutation is recommended at least once, as soon as it belongs to the neighborhood of an arm which is often the leader. If a permutation is not in such neighborhood, the theoretical analysis in Section 5 proves that this permutation is sub-optimal, hence it does not matter whether this permutation is explored at least once or not. Algorithm 1 GRAB: parametric Graph for unimodal RAnking Bandit Input: number of items L, number of positions K 1: for t = 1, 2, . . . do 2: a(t) argmax a PL K k=1 ˆρak,k(t) 3: find π(t) s.t. ˆρ a π1(t)(t), π1(t)(t) ˆρ a π2(t)(t), π2(t)(t) ˆρ a πK (t)(t), πK(t)(t) 4: recommend a(t) , if T a(t)(t) argmax a { a(t)} N π( a(t)) k=1 bak,k(t) , otherwise where Nπ(a) = {a (πk, πk+1) : k [K 1]} {a[πK := i] : i [L] \ a([K])} 5: observe the clicks vector c(t) 6: end for Remark 2 (Algorithmic Complexity). Even though the two optimization steps might seem costly, at each iteration the choice of a recommendation is done in a polynomial time w.r.t. L and K: first, the maximization at Line 2 is a linear sum assignment problem which is solvable in O K2(L + log K) time (Ramshaw & Tarjan, 2012); it is not required to scan the L!/(L K)! permutations of K distinct items among L. Secondly, the maximization at Line 4 is over a set of L 1 recommendations and is equivalent to the maximization of k=1 bak,k(t) k=1 b ak(t),k(t) which reduces to the sum of up to four bak,k(t) terms as we are looking at recommendations a in the neighborhood of the leader. Specifically, either a = a(t) and Ba(t) = 0, or a = a(t) (k, k ) and Ba(t) = b ak ,k(t) + b ak,k (t) b ak,k(t) b ak ,k (t), or a = a(t)[k := i] and Ba(t) = bi,k(t) b ak,k(t). Hence, this maximization requires O(L) computation time. Overall, the computation time per iteration is a O K2(L + log K) . 5. Theoretical Analysis As already discussed in Section 2, the proof of the upperbound on the regret of GRAB follows a similar path as the Parametric Graph for Unimodal Ranking Bandit proof of OSUB (Combes & Prouti ere, 2014): (1) apply standard bandit analysis to control the regret under the condition that the leader a(t) is the best arm a , and (2) upper-bound the expected number of iterations such that a(t) = a by a O(log log T). The inference of the rankings on positions adds up a third step (3) upper-bounding the expected number of iterations such that π(t) / Πρ ( a(t)). The first step differs from (Combes & Prouti ere, 2014), as we have to account for the semi-bandit feedback. We note that when the leader is the best arm, GRAB behaves as a Kullback-Leibler variation of Comb UCB1 (Kveton et al., 2015b) that we call KL-Comb UCB in the following (see the appendix for a complete definition of KL-Comb UCB). We derive an upper-bound specific to KL-Comb UCB which accounts for the fact that the maximization at Line 4 of Algorithm 1 can be reduced to the maximization over sums of at most 4 terms (see Remark 2). In the context of GRAB, this new result, expressed by Theorem 2, reduces the regretbound by a factor K w.r.t. the standard upper-bound for Comb UCB1. The second part of the analysis is based on the fact that with high probability µa(t) > µa (t) if µa > µa , which derives from the control of the deviation of each ˆρi,k(t). Here lies the second main difference with Combes & Prouti ere s analysis: we control the deviation of each indi- vidual ˆρi,k(t) while they control the deviation of ˆµa(t) def = (Pt 1 s=1 1{a(s) = a}) 1 Pt 1 s=1 1{a(s) = a}r(s). Again, the analysis benefits from the small number of differences between recommendations in the neighborhood of the leader. Moreover, the analysis handles the fact that the neighborhoods may change from an iteration to another, while the neighborhoods are constant in Combes & Prouti ere s analysis. The corresponding result is expressed, in the following, by Lemma 2. Finally, the number of iterations at which the inferred ranking on the positions is inappropriate is controlled by Lemma 3. The proof of this lemma is eased by the fact that the number of times the leader is played is at least proportional to the number of times it is the leader. We now propose and prove the main theorem that upperbounds the regret of GRAB. Its proof is given after the presentation of all the necessary theorems and lemmas. Theorem 1 (Upper-Bound on the Regret of GRAB). Let L, K, (ρi,k)(i,k) [L] [K] be an online learning to rank problem satisfying Assumption 1 and such that there exists a unique recommendation a with maximum expected reward. When facing this problem, GRAB fulfills: a Nπ (a ), E t=1 1 { a(t)=a , π(t)=π , a(t)=a} 8 2a log T + O (log log T) , (4) t=1 1{ a(t) = a } = O (log log T) , (5) t=1 1{ π(t) / Πρ ( a(t))} = O (1) , (6) 8 a log T + O (log log T) (7) min log(T) , where π is the unique ranking of positions in Πρ(a ), a def = µ µa, and min def = mina Nπ (a ) a. The first upper-bound (Equation (4)) deals with the expected number of iterations at which GRAB recommends a suboptimal permutation while the leader is the best permutation. It derives from Theorem 2 hereafter, which detailed proof is in the appendix. Theorem 2 (New Upper-Bound on the Regret of KL-Comb UCB). We consider a combinatorial semi-bandit setting. Let E be a set of elements and A {0, 1}E be a set of arms, where each arm a is a subset of E. Let s assume that the reward when drawing the arm a A is P e a ce, where for each element e E, ce is an independent draw of a Bernoulli distribution of mean ρe [0, 1]. Therefore, the expected reward when drawing the arm a A is µa = P When facing this bandit setting, KL-Comb UCB (Comb UCB1 equiped with Kullback-Leibler indices, see the appendix) fulfills a A s.t. µa = µ , t=1 1{a(t) = a} 2K2 a 2a log T + O (log log T) , 2K2 a a log T + O (log log T) = O |A|K2 max min log(T) , Parametric Graph for Unimodal Ranking Bandit where µ def = maxa A µa, a def = µ µa, min def = mina A: a>0 a, Ka def = mina A:µa =µ |a \ a | is the smallest number of elements to remove from a to get an optimal arm, and Kmax def = maxa A:µa =µ Ka. Secondly, the expected number of iterations at which the leader is not the optimal arm (Equation (5)) is controlled by Lemma 2, which detailed proof is in the appendix. Lemma 2 (Upper-Bound on the Number of Iterations of GRAB for which a(t) = a ). Under the hypotheses of Theorem 1 and using its notations, a PL K\{a }, E t=1 1{ a(t) = a} = O (log log T) . Finally, the number of iterations at which the inferred ranking on the positions is inappropriate (Equation (6)) is controlled by Lemma 3, which detailed proof is in the appendix. Lemma 3 (Upper-Bound on the Number of Iterations of GRAB for which π(t) / Πρ( a)). Under the hypothesises of Theorem 1 and using its notations, t=1 1 { a(t) = a, π(t) / Πρ ( a)} We assemble these results to get the proof of Theorem 1. Proof of Theorem 1. First note that, since there is a unique optimal permutation, there is a unique appropriate ranking π of positions w.r.t. a : Πρ(a ) = {π }. Then, the proof is based on the following decomposition of the set [T] of iterations: a {a } Nπ (a ) {t [T] : a(t) = a , π(t) = π ,a(t) = a} {t [T] : a(t) = a } {t [T] : π(t) / Πρ ( a(t))}. As for any recommendation a, a K, this decomposition leads to the inequality R(T) P a Nπ (a ) a Aa +KB+ KC, with t=1 1 { a(t) = a , π(t) = π ,a(t) = a} t=1 1 { a(t) = a } t=1 1{ π(t) / Πρ ( a(t))} The term Aa is smaller than the expected number of times the arm a is chosen by KL-Comb UCB when it plays on the set of arms {a } Nπ (a ). As any of these arms differs with a at at most two positions, Theorem 2 upper-bounds Aa by 8 2a log T + O (log log T) and hence P a Nπ (a ) a Aa = O (L/ min log T) as |Nπ (a ) | = L 1. Note that Theorem 5 of (Kveton et al., 2015b), upper-bounding the regret of Comb UCB1, leads to a O (LK/ log T) bound2 for P a Nπ (a ) a Aa, which we reduce by a factor K by using Theorem 2. From Lemma 2, the term B is upper-bounded by a PL K\{a } E t=1 1 { a(t) = a} = O (log log T) , and we upper-bound the term C with Lemma 3: t=1 1{ a(t) = a, π(t) / Πρ ( a)} Finally, the regret of GRAB is upper-bounded by summing these three terms, which concludes the proof. 5.1. Discussion Assumming θ1 θL and κ1 κK, the detailed formula for the regret upper-bound (7) is PK 1 k=1 8 log T (κk κk+1)(θk θk+1) + PL k=K+1 8 log T κK(θK θk), where the first sum corresponds to the set of neighbors of a which recommend the same items as a , and the second sum relates to the set of neighbors of a which replace the last item in a . Hence, the number of displayed items does not impact the total number of terms, but the gaps a. Note also that GRAB is, by design, robust to missspecifications. Typically, GRAB would properly handle a matrix ρ = θTκ + E, if maxi,j |Ei,j| is smaller than half of the minimum gap between two entries of the matrix θTκ. However, if there is a set of optimal recommendations A (instead of a unique one), after convergence, the leader will be picked in that set at each iteration. So the neighborhood of each optimal recommendation will be explored, and we will get a regret bound in O(|A |L). This behavior 2In this setting, the ground set is E def = S k [K]{(amax(k 1,1), k), (ak, k), (amin(k+1,K), k)} S k [L]\[K]{(ak, K)} and is of size L + 2K 2, and any arm is composed of exactly K elements in E. Parametric Graph for Unimodal Ranking Bandit questions the applicability of unimodality to the Cascading Model (CM), as with this model there is at least K! optimal recommendations. Moreover, while Assumption 1 is valid for CM and the Dependent Click Model (DCM), our setting also assumes the existence of the matrix ρ, which is false for CM and DCM: in both settings the probability of clicking on item i in position ℓdepends on other displayed items. 6. Experiments In this section, we compare GRAB to PMED (Komiyama et al., 2017), to Top Rank (Lattimore et al., 2018), to PBMHB (Gauthier et al., 2021), to ϵn-Greedy, to Static Graph for unimodal RAnking Bandit (S-GRAB), a simplified version of GRAB, and to KL-Comb UCB, an adaptation of Comb UCB1 (Kveton et al., 2015b) (see the appendix for details regarding S-GRAB and KL-Comb UCB). The experiments are conducted on the Yandex dataset (Yandex, 2013) and on purely simulated data. We use the cumulative regret to evaluate the performance of each algorithm, where the cumulative regret is averaged over 20 independent runs of T = 107 iterations each. Code and data for replicating our experiments are available at https: //github.com/gaudel/ranking_bandits. 6.1. Experimental Setting We use two types of datasets: a simulated one for which we set the values for κ and θ and a real one, where parameters are inferred from real life logs of Yandex search engine (Yandex, 2013). Let s remind that θi is the probability for the user to click on item i when it observes this item, and κk is the probability for the user to observe position k. Simulated data allow us to test GRAB in extreme situations. We consider L = 10 items, K = 5 positions, and κ = [1, 0.75, 0.6, 0.3, 0.1]. The range of values for θ is either close to zero (θ = [10 3, 5.10 4, 10 4, 5.10 5, 10 5, 10 6, . . . , 10 6]), or close to one (θ+ = [0.99, 0.95, 0.9, 0.85, 0.8, 0.75, . . . , 0.75]). Real data contain the logs of actions toward the Yandex search engine: 65 million search queries and 167 million hits (clicks). Common use of this database in the bandit setting consists first in extracting from these logs the parameters of the chosen real model, and then in simulating users interactions given these parameters (Lattimore et al., 2018). We use Pyclick library (Chuklin et al., 2015) to infer the PBM parameters of each query with the expectation maximization algorithm. This leads to θi values ranging from 0.070 to 0.936, depending on the query. Similarly to (Lattimore et al., 2018), we look at the results averaged on the 10 most frequent queries, while displaying K = 5 items among the L = 10 most attractive ones. Among our opponents, Top Rank and PMED require de- 103 104 105 106 107 Cumulative Expected Regret GRAB S-GRAB KL-Comb UCB PB-MHB, c=103, m=1 εn-greedy, c=104 Top Rank PMED Figure 2. Cumulative regret w.r.t. iterations on Yandex dataset. The plotted curves correspond to the average over 200 independent sequences of recommendations (20 sequences per query). The shaded area depicts the standard error of our regret estimates. creasing values of κ which may not be fulfilled by PBM. We pre-order them to fulfill these algorithms requirements. Otherwise, κ is shuffled at the beginning of each sequence of recommendations. We also carefully tune the exploration hyper-parameter c of εn-greedy taking values ranging exponentially from 100 to 106. For PB-MHB, we use the hyper-parameters recommended in (Gauthier et al., 2021). 6.2. Results Figure 2 shows the results for the algorithms on Yandex and Figure 3 on the simulated data. We measure the performance of each algorithm according to the cumulative regret (see Equation 1). It is the sum, over T consecutive recommendations, of the difference between the expected reward of the best answer and of the answer of a given recommender system. The best algorithm is the one with the lowest regret. We average the results of each algorithm over 20 independent sequences of recommendations per query or simulated setting. Although PMED theoretically yields an asymptotically optimal regret, we stop it at iteration t = 105 due to its heavy computation-time. Ablation Study The two main ingredients of GRAB are the use of a graph to explore the set of recommendations, and the online inference of this graph. Without these ingredients, GRAB boils down to KL-Comb UCB which recommends at each iteration the best permutation given the sum of indices bi,k and has a O LK2/ log T regret. With only the first ingredient (namely a static graph of degree Parametric Graph for Unimodal Ranking Bandit 103 104 105 106 107 Cumulative Expected Regret (a) Parameter θ close to 1 104 105 106 107 Cumulative Expected Regret (b) Parameter θ close to 0 GRAB S-GRAB KL-Comb UCB PB-MHB, c=103, m=1 Top Rank PMED Figure 3. Cumulative regret w.r.t. iterations on simulated data. The plotted curves correspond to the average over 20 independent sequences of recommendations. The shaded area depicts the standard error of our regret estimates. For ϵn-Greedy, c is set to 105 when θ is close to 0, and to 103 when θ is close to 1. Θ(LK)), we get S-GRAB which regret is upper-bounded by O (LK/ log T), while GRAB s regret is upper-bounded by O (L/ log T) thanks to a set of graphs of degree L 1. We want to assert the empirical impact of these ingredients. On Figures 2 and 3, we see that GRAB has a better regret than S-GRAB and KL-Comb UCB in every settings. This confirms that the proposed graphs are relevant to explore the set of recommendations, and that GRAB quickly infer the appropriate graph in the family of potential ones. Results Analysis Figure 2 compares the empirical regret of all algorithms on Yandex dataset. GRAB is the second best with a regret at T = 107 about two time smaller than the rest of the algorithms. Only PB-MHB yields a smaller regret, but PB-MHB is more than ten times slower to deliver a recommendation than GRAB and it does not have any theoretical guarantees. Figure 3 shows our results on purely simulated data illustrating extreme settings even though these settings are less realistic. In both settings, GRAB is in the top-3 algorithms. PB-MHB is still the algorithm yielding the best regret. However, while Top Rank provides better or similar result as Table 2. Average computation time for sequences of 107 recommendations vs. all queries of Yandex dataset ALGORITHM (HOUR/MIN) TRIAL (MS) GRAB 2H24 0.9 S-GRAB 9H56 3.6 εn-GREEDY c = 104 1H13 0.4 PB-MHB c = 103, m = 1 44H50 16 KL-COMBUCB 2H03 0.7 PMED 474H13 170 TOPRANK 9H29 3 EXTRAPOLATION FROM 105 RECOMMENDATIONS. GRAB at iteration 107, its regret is higher than the one of GRAB up to iteration t = 4 106. Top Rank only catchs-up GRAB at the end of the sequences of recommendations. We note that in the setting close to 1, Top Rank manages to find the perfect order after 106 iterations. In this setting too, εn-greedy has better performance during the 106 first iterations, but suffers from its greedy behaviour during the last steps with a large variance. Computation Time As shown in Table 2, the fastest algorithm is εn-greedy. KL-Comb UCB and GRAB are two times slower. The exploration of S-GRAB multiplies its computation time by 4 compared to GRAB. Top Rank is about three times slower than GRAB, and PB-MHB, despite its good regret is one of the slowest algorithm with PMED. 7. Conclusion Our work targets the full PBM setting, which aims at recommending a ranking of K items among L and display them, without prior knowledge on the attractiveness of the positions. We learn online both the user preferences and their gaze habits. To solve this problem, we define a graph parametrized by rankings of positions, and we extend the unimodal bandit setting to this family of graphs. We also design GRAB, an algorithm that learns online the proper parametrization of the graph, and we prove a regret upperbound in O(L/ log T) for this algorithm which reduces by a factor K2 (respectively K) the bound which would be obtained without the unimodal setting (resp. with the standard unimodal setting). On real and simulated data, GRAB quickly delivers good recommendations. The extension of the unimodal setting is a promising tool which may benefit to recommendations to users with a more general behavior, or to other combinatorial semi-bandit scenarios. The integration of unimodal bandit algorithms working on parametric spaces (Combes et al., 2020) may also pave the way to efficient contextual recommendation systems handling larger sets of items and positions. Parametric Graph for Unimodal Ranking Bandit Chen, W., Wang, Y., and Yuan, Y. Combinatorial multiarmed bandit: General framework and applications. In proc. of the 30th Int. Conf. on Machine Learning, ICML 13, 2013. Chuklin, A., Markov, I., and de Rijke, M. Click Models for Web Search. Morgan & Claypool Publishers, 2015. Combes, R. and Prouti ere, A. Unimodal bandits: Regret lower bounds and optimal algorithms. In proc. of the 31st Int. Conf. on Machine Learning, ICML 14, 2014. Combes, R., Magureanu, S., Prouti ere, A., and Laroche, C. Learning to rank: Regret lower bounds and efficient algorithms. In proc. of the ACM SIGMETRICS Int. Conf. on Measurement and Modeling of Computer Systems, 2015. Combes, R., Prouti ere, A., and Fauquette, A. Unimodal bandits with continuous arms: Order-optimal regret without smoothness. Proc. ACM Meas. Anal. Comput. Syst., 4(1), May 2020. Craswell, N., Zoeter, O., Taylor, M., and Ramsey, B. An experimental comparison of click position-bias models. In proc. of the Int. Conf. on Web Search and Data Mining, WSDM 08, 2008. Gai, Y., Krishnamachari, B., and Jain, R. Combinatorial network optimization with unknown variables: Multiarmed bandits with linear rewards and individual observations. IEEE/ACM Trans. Netw., 20(5):1466 1478, October 2012. Gauthier, C.-S., Gaudel, R., Fromont, E., and Lompo, B. A. Position-based multiple-play bandits with thompson sampling. In proc. of the 19th Symposium on Intelligent Data Analysis, IDA 21, 2021. Komiyama, J., Honda, J., and Nakagawa, H. Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. In proc. of the 32nd Int. Conf. on Machine Learning, ICML 15, 2015. Komiyama, J., Honda, J., and Takeda, A. Position-based multiple-play bandit problem with unknown position bias. In Advances in Neural Information Processing Systems 30, NIPS 17, 2017. Kveton, B., Szepesv ari, C., Wen, Z., and Ashkan, A. Cascading bandits: Learning to rank in the cascade model. In proc. of the 32nd Int. Conf. on Machine Learning, ICML 15, 2015a. Kveton, B., Wen, Z., Ashkan, A., and Szepesvari, C. Tight Regret Bounds for Stochastic Combinatorial Semi Bandits. In proc. of the 18th Int. Conf. on Artificial Intelligence and Statistics, AISTATS 15, 2015b. Lagr ee, P., Vernade, C., and Capp e, O. Multiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems 30, NIPS 16, 2016. Lattimore, T., Kveton, B., Li, S., and Szepesvari, C. Toprank: A practical algorithm for online stochastic ranking. In Advances in Neural Information Processing Systems 31, NIPS 18, 2018. Li, C., Kveton, B., Lattimore, T., Markov, I., de Rijke, M., Szepesv ari, C., and Zoghi, M. Bubblerank: Safe online learning to re-rank via implicit click feedback. In proc. of the 35th Uncertainty in Artificial Intelligence Conference, UAI 19, 2019. Radlinski, F., Kleinberg, R., and Thorsten, J. Learning diverse rankings with multi-armed bandits. In proc. of the 25th Int. Conf. on Machine Learning, ICML 08, 2008. Ramshaw, L. and Tarjan, R. E. On minimum-cost assignments in unbalanced bipartite graphs. Technical report, HP research labs, 2012. Richardson, M., Dominowska, E., and Ragno, R. Predicting clicks: Estimating the click-through rate for new ads. In proc. of the 16th International World Wide Web Conference, WWW 07, 2007. Yandex. Yandex personalized web search challenge. 2013. URL https://www.kaggle.com/c/ yandex-personalized-web-search-challenge.