# online_learning_to_rank_with_features__3a745d92.pdf

Online Learning to Rank with Features

Shuai Li 1 Tor Lattimore 2 Csaba Szepesvári 2

Abstract We introduce a new model for online ranking in which the click probability factors into an examination and attractiveness function and the attractiveness function is a linear function of a feature vector and an unknown parameter. Only relatively mild assumptions are made on the examination function. A novel algorithm for this setup is analysed, showing that the dependence on the number of items is replaced by a dependence on the dimension, allowing the new algorithm to handle a large number of items. When reduced to the orthogonal case, the regret of the algorithm improves on the state-of-the-art.

1. Introduction

Let L be a large set of items to be ranked. For example, a database of movies, news articles or search results. We consider a sequential version of the ranking problem where in each round the learner chooses an ordered list of K distinct items from L to show the user. We assume the feedback comes in the form of clicks and the learner s objective is to maximise the expected number of clicks over T rounds. Our focus is on the case where L is large (perhaps millions) and K is relatively small (ﬁfty or so). There are two main challenges that arise in online ranking problems:

(a) The number of rankings grows exponentially in K, which makes learning one parameter for each ranking a fruitless endeavour. Click models may be used to reduce the dimensionality of the learning problem, but balancing generality of the model with learnability is a serious challenge. The majority of previous works on online learning to rank have used unstructured models, which are not well suited to our setting where L is large.

(b) Most click models depend on an unknown attractiveness function that endows the item set with an order. This yields

1The Chinese University of Hong Kong 2Deep Mind. Correspondence to: Shuai Li <shuaili@cse.cuhk.edu.hk>, Tor Lattimore <lattimore@google.com>, Csaba Szepesvári <szepi@google.com>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

a model with at least |L| parameters, which is prohibitively large in the applications we have in mind.

The ﬁrst challenge is tackled by adapting the ﬂexible click models introduced in (Zoghi et al., 2017; Lattimore et al., 2018) to our setting. For the second we follow previous works on bandits with large action sets by assuming the attractiveness function can be written as a linear function of a relatively small number of features.

Contribution We make several contributions:

A new model for ranking problems with features is proposed that generalises previous work (Li et al., 2016; Zong et al., 2016; Liu et al., 2018) by relaxing the relatively restrictive assumptions on the probability that a user clicks on an item. The new model is strictly more robust than previous works focusing on regret analysis for large item sets.

We introduce a novel polynomial-time algorithm called Recur Rank. The algorithm operates recursively over an increasingly ﬁne set of partitions of [K]. Within each partition the algorithm balances exploration and exploitation, subdividing the partition once it becomes sufﬁciently certain about the suboptimality of a subset of items.

A regret analysis shows that the cumulative regret of Recur Rank is at most RT = O(K p

d T log(LT), where K is the number of positions, L is the number of items and d is the dimension of the feature space. Even in the non-feature case where L = d this improves on the state-of-the-art by a factor of

A comparison with most related work is shown in Table 1.

Related work Online learning to rank has seen an explosion of research in the last decade and there are multiple ways of measuring the performance of an algorithm. One view is that the clicks themselves should be maximised, which we take in this article. An alternative is to assume an underlying relevance of all items in a ranking that is never directly observed, but can be inferred in some way from the observed clicks. In all generality this latter setting falls into the partial monitoring framework (Rustichini, 1999),

Online Learning to Rank with Features

Table 1. This table compares settings and regret bounds of most related works on online learning to rank. T is the number of total rounds, K is the number of positions, L is the number of items and d is the feature space dimension. is the minimal gap between the expected click rate of the best items and the expected click rate of the suboptimal items.

Context Click Model Regret

Kveton et al. (2015) - Cascade Model (CM) Θ L

Li et al. (2016) Zong et al. (2016) Li & Zhang (2018)

(Generalised) Linear Form CM O d

Katariya et al. (2016) - Dependent Click Model (DCM) Θ L

Liu et al. (2018) Generalised Linear Form DCM O d K

Lagree et al. (2016) - Position-Based Model (PBM) with known position bias O L

Zoghi et al. (2017) - General Click Model O K3L

Lattimore et al. (2018) - General Click Model O KL

K3LT log(T)

Ours Linear Form General Click Model O K p

d T log(LT)

but has been studied in speciﬁc ranking settings (Chaudhuri, 2016, and references therein). See the article by Hofmann et al. (2011) for more discussion on various objectives.

Maximising clicks directly is a more straightforward objective because clicks are an observed quantity. Early work was empirically focused. For example, Li et al. (2010) propose a modiﬁcation of Lin UCB for contextual ranking and Chen & Hofmann (2015) modify the optimistic algorithms for linear bandits. These algorithms do not come with theoretical guarantees, however. There has recently been signiﬁcant effort towards designing theoretically justiﬁed algorithms in settings of increasing complexity (Kveton et al., 2015; Combes et al., 2015; Zong et al., 2016; Katariya et al., 2016; Lagree et al., 2016). These works assume the user s clicks follow a click model that connects properties of the shown ranking to the probability that a user clicks on an item placed in a given position. For example, in the document-based model it is assumed that the probability that the user clicks on a shown item only depends on the unknown attractiveness of that item and not its position in the ranking or the other items. Other simple models include the position-based, cascade and dependent click models. For a survey of click models see (Chuklin et al., 2015).

As usual, however, algorithms designed for speciﬁc models are brittle when the modelling assumptions are not met. Recent work has started to relax the strong assumptions

by making the observation that in all of the above click models the probability of a user clicking on an item can be written as the product of the item s inherent attractiveness and the probability that the user examines its position in the list. Zoghi et al. (2017) use a click model where this decomposition is kept, but the assumption on how the examination probability of a position depends on the list is signiﬁcantly relaxed. This is relaxed still further by Lattimore et al. (2018) who avoid the factorisation assumption by making assumptions directly on the click probabilities, but the existence of an attractiveness function remains.

The models mentioned in the last paragraph do not make assumptions on the attractiveness function, which means the regret depends badly on the size of L. Certain simple click models have assumed the attractiveness function is a linear function of an item s features and the resulting algorithms are suitable for large action sets. This has been done for the cascade model (Li et al., 2016) and the dependent-click model (Liu et al., 2018). While these works are welcomed, the strong assumptions leave a lingering doubt that perhaps the models may not be a good ﬁt for practical problems. Of course, our work is closely related to stochastic linear bandits, ﬁrst studied by Abe & Long (1999) and reﬁned by Auer (2002); Abbasi-Yadkori et al. (2011); Valko et al. (2014) and many others.

Ranking has also been examined in an adversarial frame-

Online Learning to Rank with Features

work by Radlinski et al. (2008). These settings are most similar to the stochastic position-based and document-based models, but with the additional robustness bought by the adversarial framework. Another related setup is the rank-1 bandit problem in which the learner should choose just one of L items to place in one of K positions. For example, the location of a billboard with the budget to place only one. These setups have a lot in common with the present one, but cannot be directly applied to ranking problems. For more details see (Katariya et al., 2017a;b).

Finally, we note that some authors do not assume an ordering of the item set provided by an attractiveness function. The reader is referred to the work by Slivkins et al. (2013) (which is a follow-up work to Radlinski et al. (2008)) where the learner s objective is to maximise the probability that a user clicks on any item, rather than rewarding multiple clicks. This model encourages diversity and provides an interesting alternative approach.

2. Preliminaries

Notation Let [n] = {1, 2, . . . , n} denote the ﬁrst n natural numbers. Given a set X the indicator function is 1X. For vector x Rd and positive deﬁnite matrix V Rd d

we let x 2 V = x V x. The Moore-Penrose pseudoinverse of a matrix V is V .

Problem setup Let L Rd be a ﬁnite set of items, L = |L| and K > 0 a natural number, denoting the number of positions. A ranking is an injective function from [K], the set of positions, to L and the set of all rankings is denoted by Σ. We use uppercase letters like A to denote rankings in Σ and lowercase letters a, b to denote items in L. The game proceeds over T rounds. In each round t [T] the learner chooses a ranking At Σ and subsequently receives feedback in the form of a vector Ct {0, 1}K where Ctk = 1 if the user clicked on the kth position. We assume that the conditional distribution of Ct only depends on At, which means there exists an unknown function v : Σ [K] [0, 1] such that for all A Σ and k [K],

P (Ctk = 1 | At = A) = v(A, k) . (1)

Remark 1. We do not assume conditional independence of (Ctk)K k=1.

In all generality the function v has K|Σ| parameters, which is usually impractically large to learn in any reasonable timeframe. A click model corresponds to making assumptions on v that reduces the statistical complexity of the learning problem. We assume a factored model:

v(A, k) = χ(A, k)α(A(k)) , (2)

where χ : Σ [K] [0, 1] is called the examination probability and α : L [0, 1] is the attractiveness function.

We assume that attractiveness is linear in the action, which means there exists an unknown θ Rd such that

α(a) = a, θ for all a L . (3)

Let a k be the k-th best item sorted in order of decreasing attractiveness. Then let A = (a 1, . . . , a K). In case of ties the choice of A may not be unique. All of the results that follow hold for any choice.

The examination function satisﬁes three additional assumptions. The ﬁrst says the examination probability of position k only depends on the identity of the ﬁrst k 1 items and not their order:

Assumption 1. χ(A, k) = χ(A , k) for any A, A Σ with A([k 1]) = A ([k 1]).

The second assumption is that the examination probability on any ranking is monotone decreasing in k:

Assumption 2. χ(A, k + 1) χ(A, k) for all A Σ and k [K 1].

The third assumption is that the examination probability on ranking A is minimal:

Assumption 3. χ(A, k) χ(A , k) =: χ k for all A Σ and k [K].

All of these assumptions are satisﬁed by many standard click models, including the document-based, position-based and cascade models. These assumptions are strictly weaker than those made by Zoghi et al. (2017) and orthogonal to those by Lattimore et al. (2018) as we discuss it in Section 6.

The learning objective We measure the performance of our algorithm in terms of the cumulative regret, which is

k=1 v(A , k) E

k=1 v(At, k)

Remark 2. The regret is deﬁned relative to A , but our assumptions do not imply that

A arg max A Σ

k=1 v(A, k) . (4)

The assumptions in all prior work in Table 1 either directly or indirectly ensure that (4) holds. Our regret analysis does not rely on this, so we do not assume it. Note, however, that the deﬁnition of regret is most meaningful when (Eq. (4)) approximately holds.

Experimental design Our algorithm makes use of an exploration spanner that approximately minimises the covariance of the least-squares estimator. Given an arbitrary

Online Learning to Rank with Features

ﬁnite set of vectors X = {x1, . . . , xn} Rd and distribution π : X [0, 1] let Q(π) = P

x X π(x)xx . By the Kiefer Wolfowitz theorem (Kiefer & Wolfowitz, 1960) there exists a π called the G-optimal design such that

max x X x 2 Q(π) d . (5)

As explained in Chap. 21 of (Lattimore & Szepesvári, 2018), π may be chosen so that |{x : π(x) > 0}| d(d + 1)/2. A G-optimal design π for X has the property that if each element x X is observed nπ(x) times for some n large enough, the value estimate obtained via least-squares will have its maximum uncertainty over the items minimised. Given a ﬁnite (multi-)set of vectors X Rd

we let GOPT(X) denote a G-optimal design distribution. Methods from experimental design have been used for pure exploration in linear bandits (Soare et al., 2014; Xu et al., 2017) and also ﬁnite-armed linear bandits (Lattimore & Szepesvári, 2018, Chap. 22) as well as adversarial linear bandits (Bubeck et al., 2012).

3. Algorithm

The new algorithm is called Recur Rank ( recursive ranker ). The algorithm maintains a partition of the K positions into intervals. Associated with each interval is an integer-valued phase number and an ordered set of items, which has the same size as the interval for all but the last interval (containing position K). Initially the partition only contains one interval that is associated with all the items and phase number ℓ= 1.

At any point in time, Recur Rank works in parallel on all intervals. Within an interval associated with phase number ℓ, the algorithm balances exploration and exploitation while determining the relative attractiveness of the items to accuracy ℓ= 2 ℓ. To do this, items are placed in the ﬁrst position of the interval in proportion to an experimental design. The remaining items are placed in order in the remaining positions. Once sufﬁcient data is collected, the interval is divided into a collection of subintervals and the algorithm is restarted on each subinterval with the phase number increased.

The natural implementation of the algorithm maintains a list of partitions and associated items. In each round it iterates over the partitions and makes assignments of the items within each partition. The assignments are based on roundrobin idea using an experimental design, which means the algorithm needs to keep track of how often each item has been placed in the ﬁrst position. This is not a problem from an implementation perspective, but stateful code is hard to interpret in pseudocode. We provide a recursive implementation that describes the assignments made within each interval and the rules for creating a new partition. A ﬂow chart depicting the operation of the algorithm is given in

Fig. 1. The code is provided in the supplementary material.

Algorithm 1 Recur Rank

1: Input: Phase number ℓand A = (a1, a2, . . .) and K = (k, k + 1, . . . , k + m 1) 2: Find a G-optimal design π = GOPT(A) 3: Let ℓ= 2 ℓand

T(a) = d π(a)

2 2 ℓ log |A|

This instance will run P a A T(a) times 4: Select each item a A exactly T(a) times at position k and put available items in {a1, . . . , am} sequentially in positions {k+1, . . . , k+m 1} and receive feedbacks (synchronized by a global clock). 5: Let D = {(β1, ζ1), . . .} be the multiset of item/clicks from position k and compute

ˆθ = V S with (7)

(β,ζ) D ββ and S = X

6: Let a(1), a(2), . . . , a(|A|) be an ordering A such that

εi = ˆθ, a(i) a(i+1) 0 for all 1 i < |A|

and set ε|A| = 2 ℓ 7: Let (u1, . . . , up) = (i [|A|] : εi 2 ℓ) and u0 = 0

Ai = (a(ui 1+1), . . . , a(ui))

Ki = (k + ui 1, . . . , k + min(m, ui) 1)

8: For each i [p] such that k + ui 1 k + m 1 call Recur Rank (ℓ+ 1, Ai, Ki) on separate threads

The pseudocode of the core subroutine on each interval is given in Algorithm 1. The subroutine accepts as input (1) the phase number ℓ, (2) the positions of the interval K [K] and (3) an ordered list of items, A. The phase number determines the length of the experiment and the target precision. The ordering of the items in A is arbitrary in the initial partition (when ℓ= 1). When ℓ> 1 the ordering is determined by the empirical estimate of attractiveness in the previous experiment, which is crucial for the analysis. The whole algorithm is started by calling Recur Rank(1, L, (1, 2, . . . , K)) where the order of L is random. The algorithm is always instantiated with parameters that satisfy |A| |K| = m. Furthermore, |A| > |K| is only possible when K K.

The subroutine learns about the common unknown parameter vector by placing items in the ﬁrst position of the interval in proportion to a G-optimal design for the available items. The remaining items in A are placed in order into

Online Learning to Rank with Features

|| z}|{ a1...a3 |{z}

|| z}|{ a1...a3 |{z}

z}|{ a4 a5 |{z}

z}|{ a6...a8

Figure 1. A ﬂow chart demonstration for the algorithm. Each dotted circle represents a subinterval and runs an instance of Algorithm 1. The dashed line denotes the ﬁrst position for each interval.

the remaining positions (Line 4). This means that each item a A is placed exactly T(a) times in the ﬁrst position k of the interval. The choice of T(a) is based on the phase number ℓand the G-optimal design π over A (Line 2). Note T(a) = 0 if π(a) = 0. For example, if A = (a1, . . . , am) and a3 is placed at the ﬁrst position, then the rest positions are ﬁlled in as a1, a2, a4, a5, . . . , am. The subroutine runs for P

a A T(a) rounds. The G-optimal design means that the number of rounds required to estimate the value of each item to a ﬁxed precision depends only logarithmically on the number of items. Higher phase number means longer experiment and also higher target precision.

Once all arms a A have been placed in the ﬁrst position of the interval T(a) times, Recur Rank estimates the attractiveness of the items in A using a least-squares estimator based on the data collected from the ﬁrst position (Line 5). The items are then ordered based on their estimated attractiveness. The subroutine then partitions the ordered items when the difference between estimated attractiveness of consecutive items is sufﬁciently large (Line 7). Finally the subroutine recursively calls Recur Rank on each partition for which there are positions available with an increased phase number with items sorted according to their empirical attractiveness (Line 8).

Remark 3. Items are eliminated entirely if at the end of a subroutine a partition is formed for which there are no available positions. For example, consider the ﬁrst instantiation of Recur Rank with ℓ= 1 and K = [K] and A = L. Suppose the observed data is such that p = 2 and u1 K,

then items a(u1+1), a(u1+2), . . . , a(u2) will be discarded because the starting position of the second partition would be larger than K.

Remark 4. The least-squares estimator ˆθ deﬁned in Eq. (7) actually does not have expectation θ, which means the algorithm is not really estimating attractiveness. Our assumptions ensure that the expectation of ˆθ is proportional to θ, however, which is sufﬁcient for our analysis. This is the reason for only using the ﬁrst position within an interval for estimation.

Remark 5. The subroutine only uses data collected during its own run. Not doing this would introduce bias that may be hard to control. In Fig. 1, the algorithm starts with Instance 1 of phase number ℓ= 1, all items and all positions. At time t1, Instance 1 splits into two, each with an increased phase number ℓ= 2. Instance 2 contains 3 items and 3 positions and Instance 3 contains 5 positions but 22 items. The remaining items have been eliminated. At time t2, Instance 2 ﬁnishes running but has no split, so it calls Instance 4 with the same items, same positions but increased phase number ℓ= 3. During time t1 to t2, Instance 2 and Instance 3 run in parallel and recommend lists together; during time t2 to t3, Instances 3 and 4 run in parallel and recommend lists together. At time t3, Instance 3 ﬁnishes and splits into another two threads, both with increased phase number ℓ= 3. Instance 5 contains exactly 2 items and 2 positions and Instance 6 contains 3 positions but 7 items. Note that the involved items become even less. Right after time t3, Instance 4, 5, 6 run in parallel and recommend lists together.

Online Learning to Rank with Features

Recur Rank has two aspects that one may think can lead to an unjustiﬁable increase of regret: (i) each subroutine only uses data from the ﬁrst position to estimate attractiveness, and (ii) data collected by one subroutine is not re-used subsequently. The second of these is relatively minor. Like many elimination algorithms, the halving of the precision means that at most a constant factor is lost by discarding the data. The ﬁrst issue is more delicate. On the one hand, it seems distasteful not to use all available data. But the assumptions do not make it easy to use data collected in later positions. And actually the harm may not be so great. Intuitively the cost of only using data from the ﬁrst position is greatest when the interval is large and the attractiveness varies greatly within the interval. In this case, however, a split will happen relatively fast.

Running time The most expensive component is computing the G-optimal design. This is a convex optimisation problem and has been studied extensively (see, Boyd & Vandenberghe 2004, 7.5 and Todd 2016). It is not necessary to solve the optimisation problem exactly. Suppose instead we ﬁnd a distribution π on A with support at most D(D +1)/2 and for which maxa A a 2 Q(π) D. Then our bounds continue to hold with d replaced by D. Such approximations are generally easy to ﬁnd. For example, π may be chosen to be a uniform distribution on a volumetric spanner of A of size D. See the supplementary material for a summary of volumetric spanners. Hazan & Karnin (2016) provide a randomized algorithm that returns a volumetric spanner of size at most O(d log(d) log(|A|)) with an expected running time of O(|A| d2). For the remaining parts of the algorithm, the least-squares estimation is at most O(d3). The elimination and partitioning run in O(|A| d). Note these computations happen only once for each instantiation. The update for each partition in each round is O(d2). The total running time is O(Ld2 log(T) + Kd2T).

4. Regret Analysis

Our main theorem bounds the regret of Algorithm 1. Theorem 1. There exists a universal constant C > 0 such that the regret bound for Algorithm 1 with δ = 1/

d T log(LT) .

Let Iℓbe the number of calls to Recur Rank with phase number ℓ. Hence each i [Iℓ] corresponds to a call of Recur Rank with phase number ℓand the arguments are denoted by Aℓi and Kℓi. Abbreviate Kℓi = min Kℓi for the ﬁrst position of Kℓi, Mℓi = |Kℓi| for the number of positions and K+ ℓi = Kℓi \ {Kℓi}. We also let Kℓ,Iℓ+1 = K + 1 and assume that the calls i [Iℓ] are ordered so that

1 = Kℓ1 < Kℓ2 < < KℓIℓ K < K + 1 = Kℓ,Iℓ+1 .

The reader is reminded that χ k = χ(A , k) is the examination probability of the kth position under the optimal list. Let χℓi = χ Kℓi be the shorthand for the optimal examination probability of the ﬁrst position in call (ℓ, i). We let ˆθℓi be the least-squares estimator computed in Eq. (7) in Algorithm 1. The maximum phase number during the entire operation of the algorithm is ℓmax.

Deﬁnition 1. Let F be the failure event that there exists an ℓ [ℓmax], i [Iℓ] and a Aℓi such that ˆθℓi, a χℓi θ , a ℓ

or there exists an ℓ [ℓmax], i [Iℓ] and k Kℓi such that a k / Aℓi.

The ﬁrst lemma shows that the failure event occurs with low probability. The proof follows the analysis in (Lattimore & Szepesvári, 2018, Chap. 22) and is summarised in the supplementary material.

Lemma 1. P (F) δ.

The proofs of the following lemmas are also provided in the supplementary material.

Lemma 2. On the event F c it holds for any ℓ [ℓmax], i [Iℓ] and positions k, k + 1 Kℓi that χℓi(α(a k) α(a k+1)) 8 ℓ.

Lemma 3. On the event F c it holds for any ℓ [ℓmax] and a AℓIℓthat χℓIℓ(α(a K) α(a)) 8 ℓ.

Lemma 4. Suppose that in its (ℓ, i)th call Recur Rank places item a in position k = Kℓi. Then, provided F c holds, χℓi (α(a k) α(a)) 8Mℓi.

Lemma 5. Suppose that in its (ℓ, i)th call Recur Rank places item a in position k K+ ℓi. Then provided F c holds, χℓi (α(a k) α(a)) 4 ℓ.

Proof of Theorem 1. The ﬁrst step is to decompose the regret using the failure event:

RT P (F) TK + E

k=1 (v(A , k) v(At, k))

From now on we assume that F c holds and bound the term inside the expectation. Given ℓand i [Iℓ] let Tℓi be the set of rounds when algorithm (ℓ, i) is active. Then

k=1 (v(A , k) v(At, k)) =

i=1 Rℓi , (8)

where Rℓi is the regret incurred during call (ℓ, i):

k Kℓi (v(A , k) v(At, k)) .

Online Learning to Rank with Features

This quantity is further decomposed into the ﬁrst position in Kℓi, which is used for exploration, and the remaining positions:

R(1) ℓi = X

t Tℓi (v(A , Kℓi) v(At, Kℓi)) .

R(2) ℓi = X

(v(A , k) v(At, k)) .

Each of these terms is bounded separately. For the ﬁrst term we have

R(1) ℓi = X

t Tℓi (v(A , Kℓi) v(At, Kℓi))

t Tℓi χ(A , Kℓi)α(a Kℓi) χ(At, Kℓi)α(At(Kℓi))

t Tℓi χℓi α(a Kℓi) α(At(Kℓi))

t Tℓi Mℓi ℓ, (9)

where the ﬁrst equality is the deﬁnition of R(1) ℓi , the second is the deﬁnition of v. The third inequality is true because event F c ensures that

{At(k) : k < Kℓi} = {a k : k < Kℓi} ,

which combined with Assumption 1 shows that χ(A , Kℓi) = χ(At, Kℓi) = χℓi. The inequality in Eq. (9) follows from Lemma 4. Moving on to the second term,

R(2) ℓi = X

(v(A , k) v(At, k))

χ k(α(a k) α(At(k)))

χℓi(α(a k) α(At(k)))

t Tℓi Mℓi ℓ,

where the second inequality follows from Assumption 3 and the third inequality follows from Assumption 2 on ranking A . The inequality in Eq. (10) follows from Lemma 5 and the one after it from the deﬁnition of Mℓi = |Kℓi|. Putting things together,

i Iℓ |Tℓi|Mℓi ℓ 12K

ℓ=1 max i Iℓ|Tℓi| ℓ,

where we used that P

i IℓMℓi = K. To bound |Tℓi| note that, on the one hand, |Tℓi| T (this will be useful when ℓis large), while on the other hand, by the deﬁnition of the algorithm and the fact that the G-optimal design is supported on at most d(d + 1)/2 points we have

2dπ(a) log(1/δℓ)

2 + 2d log(1/δℓ)

We now split to sum in (11) into two. For 1 ℓ0 ℓmax to be chosen later,

ℓ=1 max i Iℓ|Tℓi| ℓ d(d + 1)

2 + 4d log(1/δℓ0)2ℓ0 ,

ℓ=ℓ0+1 max i Iℓ|Tℓi| ℓ T

ℓ=ℓ0+1 ℓ T2 ℓ0 ,

(8) 12K d(d + 1)

2 + 4d log(1/δℓ0)2ℓ0 + T2 ℓ0 .

The result is completed by optimising ℓ0.

5. Experiments

We run experiments to compare Recur Rank with Cascade Lin UCB (Li et al., 2016; Zong et al., 2016) and Top Rank (Lattimore et al., 2018).

Synthetic experiments We construct environments using the cascade click model (CM) and the position-based click model (PBM) with L = 104 items in d = 5 dimension to be displayed in K = 10 positions. We ﬁrst randomly draw item vectors L and weight vector θ in d 1 dimension with each entry a standard Gaussian variable, then normalise, add one more dimension with constant 1, and divide by

2. The transformation is as follows:

This transformation on both the item vector x L Rd

and weight vector θ is to guarantee the attractiveness θ , x of each item x lies in [0, 1]. The position bias for PBM is set as 1, 1

3, . . . , 1

K which is often adopted in applications (Wang et al., 2018). The evolution of the regret as a function of time is shown in Fig. 2(a)(b). The regrets at the end and total running times are given in the supplementary material.

Online Learning to Rank with Features

0 50k 100k 150k 200k Time t

0 500k 1m 1.5m 2m Time t

0 500k 1m 1.5m 2m Time t

(c) Movie Lens DBM

Figure 2. The ﬁgures compare Recur Rank (red) with Cascade Lin UCB (black) and Top Rank (blue). Subﬁgure (a) shows results for an environment that follows the cascade click model (CM), while subﬁgure (b) does the same for the position-based click model (PBM). On these ﬁgures, regret over time is shown (smaller is better). In both models there are L = 104 items and K = 10 positions, and the feature space dimension is d = 5. Note the logarithmic scale of the y axis on subﬁgure (a). Subﬁgure (c) shows the regret over time on the Movie Lens dataset with L = 103, d = 5, K = 10. All results are averaged over 10 random runs. The error bars are standard errors.

Cascade Lin UCB is best in CM but worst in PBM because of its modelling bias. Top Rank takes much longer time to converge than either Cascade Lin UCB or Recur Rank since it neither exploits the speciﬁcs of the click model, nor does it use the linear structure.

Movie Lens dataset We use the 20m Movie Lens dataset (Harper & Konstan, 2016) which contains 20 million ratings for 2.7 104 movies by 1.38 105 users. We extract L = 103 movies with most ratings and 1.1 103 users who rate most and randomly split the user set to two parts, U1 and U2 with |U1| = 100 and |U2| = 103. We then use the rating matrix of users in U1 to derive feature vectors with d = 5 for all movies by singular-value decomposition (SVD). The resulting feature vectors L are also processed as (12). The true weight vector θ is computed by solving the linear system of L w.r.t. the rating matrix of U2. The environment is the document-based click model (DBM) with L and θ and we set K = 10. The performances are measured in regret, as shown in Fig. 2(c). As can be seen, Recur Rank learns faster than the other two algorithms. Of these two algorithms, the performance of Cascade Lin UCB saturates: this is due to its incorrect bias.

6. Discussion

Assumptions Our assumptions are most closely related to the work by Lattimore et al. (2018) and Zoghi et al. (2017). The latter work also assumes a factored model where the probability of clicking on an item factors into an examination probability and an attractiveness function. None of these works make use of features to model the attractiveness of items: They are a special case of our model when we set the features of items to be orthogonal to each other (in particular, d = L). Our assumptions on the examination probability function are weaker than those by Zoghi et al. (2017). Despite this, our regret upper bound is better by a factor of K (when setting d = L) and the analysis is also simpler. The paper by Lattimore et al. (2018) does not

assume a factored model, but instead places assumptions directly on v. They also assume a speciﬁc behaviour of the v function under pairwise exchanges that is not required here. Their assumptions are weaker in the sense that they do not assume the probability of clicking on position k only depends on the identities of the items in positions [k 1] and the attractiveness of the item in position k. On the other hand, they do assume a speciﬁc behaviour of the v function under pairwise exchanges that is not required by our analysis. It is unclear which set of these assumptions is preferable.

Lower bounds In the orthogonal case where d = L the lower bound in (Lattimore et al., 2018) provides an example where the regret is at least Ω(

TKL). For d L, the standard techniques for proving lower bounds for linear bandits can be used to prove the regret is at least Ω(

d TK), which except for logarithmic terms means our upper bound is suboptimal by a factor of at most

K. We are not sure whether either the lower bound or the upper bound is tight.

Open questions Only using data from the ﬁrst position seems suboptimal, but is hard to avoid without making additional assumptions. Nevertheless, we believe a small improvement should be possible here. Another natural question is how to deal with the situation when the set of available items is changing. In practice this happens in many applications, either because the features are changing or because new items are being added or removed. Other interesting directions are to use weighted least-squares estimators to exploit the low variance when the examination probability and attractiveness are small. Additionally one can use a generalised linear model instead of the linear model to model the attractiveness function, which may be analysed using techniques developed by Filippi et al. (2010) and Jun et al. (2017). Finally, it could be interesting to generalise to the setting where item vectors are sparse (see Abbasi-Yadkori et al. 2012 and Lattimore & Szepesvári 2018, Chap. 23).

Online Learning to Rank with Features

Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24, NIPS, pp. 2312 2320. Curran Associates, Inc., 2011.

Abbasi-Yadkori, Y., Pal, D., and Szepesvári, C. Onlineto-conﬁdence-set conversions and application to sparse stochastic bandits. In Lawrence, N. D. and Girolami, M. (eds.), Proceedings of the 15th International Conference on Artiﬁcial Intelligence and Statistics, volume 22 of Proceedings of Machine Learning Research, pp. 1 9, La Palma, Canary Islands, 21 23 Apr 2012. PMLR.

Abe, N. and Long, P. M. Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the 16th International Conference on Machine Learning, ICML, pp. 3 11, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.

Auer, P. Using conﬁdence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.

Bubeck, S., Cesa-Bianchi, N., and Kakade, S. Towards minimax policies for online linear optimization with bandit feedback. In Annual Conference on Learning Theory, volume 23, pp. 41 1. Microtome, 2012.

Chaudhuri, S. Learning to Rank: Online Learning, Statistical Theory and Applications. Ph D thesis, 2016.

Chen, Y. and Hofmann, K. Online learning to rank: Absolute vs. relative. In Proceedings of the 24th International Conference on World Wide Web, pp. 19 20. ACM, 2015.

Chuklin, A., Markov, I., and de Rijke, M. Click Models for Web Search. Morgan & Claypool Publishers, 2015.

Combes, R., Magureanu, S., Proutiere, A., and Laroche, C. Learning to rank: Regret lower bounds and efﬁcient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 231 244. ACM, 2015. ISBN 978-1-4503-3486-0.

Filippi, S., Cappe, O., Garivier, A., and Szepesvári, C. Parametric bandits: The generalized linear case. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, NIPS, pp. 586 594. Curran Associates, Inc., 2010.

Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016. URL https:// grouplens.org/datasets/movielens/20m/.

Hazan, E. and Karnin, Z. Volumetric spanners: an efﬁcient exploration basis for learning. The Journal of Machine Learning Research, 17(1):4062 4095, 2016.

Hofmann, K., Whiteson, S., and De Rijke, M. A probabilistic method for inferring preferences from clicks. In Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 249 258. ACM, 2011.

Jun, K., Bhargava, A., Nowak, R., and Willett, R. Scalable generalized linear bandits: Online computation and hashing. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 99 109. Curran Associates, Inc., 2017.

Katariya, S., Kveton, B., Szepesvári, C., and Wen, Z. DCM bandits: Learning to rank with multiple clicks. In Proceedings of the 33rd International Conference on Machine Learning, pp. 1215 1224, 2016.

Katariya, S., Kveton, B., Szepesvári, C., Vernade, C., and Wen, Z. Bernoulli rank-1 bandits for click feedback. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, 2017a.

Katariya, S., Kveton, B., Szepesvári, C., Vernade, C., and Wen, Z. Stochastic rank-1 bandits. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017b.

Kiefer, J. and Wolfowitz, J. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12 (5):363 365, 1960.

Kveton, B., Szepesvári, C., Wen, Z., and Ashkan, A. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, pp. 767 776. JMLR.org, 2015.

Lagree, P., Vernade, C., and Cappé, O. Multiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems 29, NIPS, pp. 1597 1605. Curran Associates Inc., 2016.

Lattimore, T. and Szepesvári, C. Bandit Algorithms. preprint, 2018.

Lattimore, T., Kveton, B., Li, S., and Szepesvári, C. Toprank: A practical algorithm for online stochastic ranking. In Proceedings of the 31st Conference on Neural Information Processing Systems. 2018.

Online Learning to Rank with Features

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on world wide web, pp. 661 670. ACM, 2010.

Li, S. and Zhang, S. Online clustering of contextual cascading bandits. In The 32nd AAAI Conference on Artiﬁcial Intelligence, pp. 3554 3561, 2018.

Li, S., Wang, B., Zhang, S., and Chen, W. Contextual combinatorial cascading bandits. In Proceedings of the 33rd International Conference on Machine Learning, pp. 1245 1253, 2016.

Liu, W., Li, S., and Zhang, S. Contextual dependent click bandit algorithm for web recommendation. In International Computing and Combinatorics Conference, pp. 39 50. Springer, 2018.

Radlinski, F., Kleinberg, R., and Joachims, T. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pp. 784 791. ACM, 2008.

Rustichini, A. Minimizing regret: The general case. Games and Economic Behavior, 29(1):224 243, 1999.

Slivkins, A., Radlinski, F., and Gollapudi, S. Ranked bandits in metric spaces: learning diverse rankings over large document collections. Journal of Machine Learning Research, 14(Feb):399 436, 2013.

Soare, M., Lazaric, A., and Munos, R. Best-arm identiﬁcation in linear bandits. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, NIPS, pp. 828 836. Curran Associates, Inc., 2014.

Todd, M. J. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016.

Valko, M., Munos, R., Kveton, B., and Kocák, T. Spectral bandits for smooth graph functions. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 46 54, Bejing, China, 22 24 Jun 2014. PMLR.

Wang, X., Golbandi, N., Bendersky, M., Metzler, D., and Najork, M. Position bias estimation for unbiased learning to rank in personal search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 610 618. ACM, 2018.

Xu, L., Honda, J., and Sugiyama, M. Fully adaptive algorithm for pure exploration in linear bandits. ar Xiv preprint ar Xiv:1710.05552, 2017.

Zoghi, M., Tunys, T., Ghavamzadeh, M., Kveton, B., Szepesvári, C., and Wen, Z. Online learning to rank in stochastic click models. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of PMLR, pp. 4199 4208, 2017.

Zong, S., Ni, H., Sung, K., Ke, R. N., Wen, Z., and Kveton, B. Cascading bandits for large-scale recommendation problems. In Proceedings of the 32nd Conference on Uncertainty in Artiﬁcial Intelligence, UAI, 2016.