# stochasticrank_global_optimization_of_scalefree_discrete_functions__164ad595.pdf Stochastic Rank: Global Optimization of Scale-Free Discrete Functions Aleksei Ustimenko 1 Liudmila Prokhorenkova 1 2 3 In this paper, we introduce a powerful and efficient framework for direct optimization of ranking metrics. The problem is ill-posed due to the discrete structure of the loss, and to deal with that, we introduce two important techniques: stochastic smoothing and novel gradient estimate based on partial integration. We show that classic smoothing approaches may introduce bias and present a universal solution for a proper debiasing. Importantly, we can guarantee global convergence of our method by adopting a recently proposed Stochastic Gradient Langevin Boosting algorithm. Our algorithm is implemented as a part of the Cat Boost gradient boosting library and outperforms the existing approaches on several learning-torank datasets. In addition to ranking metrics, our framework applies to any scale-free discrete loss function. 1. Introduction The quality of ranking algorithms is traditionally measured by ranking quality metrics such as Normalized Discounted Cumulative Gain (NDCG), Expected Reciprocal Rank (ERR), Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), and so on (Sakai, 2013). These metrics are defined on a list of documents sorted by their predicted relevance to a query and capture the utility of that list for users of a search engine, who are more likely to scan documents starting at the top. Direct optimization of ranking metrics is an extremely challenging problem since sorting makes them piecewise constant (as functions of predicted relevances), so they are neither convex nor smooth. Many algorithms were proposed for different ranking objectives in the learning-to-rank (LTR) research field. We refer to Liu (2009) for a systematic overview of some classic methods. 1Yandex, Moscow, Russia 2Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia 3Higher School of Economics, Moscow, Russia. Correspondence to: Aleksei Ustimenko . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). To deal with the discrete structure of a ranking loss, one can use some smooth approximation, which is easier to optimize. This technique lies behind such well-known algorithms as Soft Rank (Taylor et al., 2008), Approx NDCG (Qin et al., 2010), Rank Net (Burges, 2010), etc. The obtained smooth function can be optimized by gradient-based methods and, in particular, by Stochastic Gradient Boosting (SGB) that is known to be the learning algorithm behind most stateof-the-art LTR frameworks and is commonly preferred by major search engines (Chapelle & Chang, 2011; Yin et al., 2016). Unfortunately, all known smoothing approaches suffer from bias (see Sections 4.2-4.3) which prevents them from truly direct optimization. Moreover, smoothed ranking loss functions are non-convex, and existing algorithms can guarantee only local optima. Our ultimate goal is to solve these problems and propose a truly direct LTR algorithm with provable guarantees of global convergence and generalization. We adopt a theoretical approach, so we start with formal definitions of the class of ranking losses and its generalization to scale-free (SF) discrete loss functions (Section 3.2). Our results hold for the general class of SF losses, which, in addition to all ranking metrics, includes, e.g., a recently proposed loss function for Learning-to-Select-with-Order (Vorobev et al., 2019). Then, to mitigate the discontinuity of the loss, we use stochastic smoothing. We prove that previous smoothing-based approaches are inconsistent with the underlying loss (due to the problem of ties, which we discuss in the next section) and propose a universal solution to this problem (relevancebased consistent smoothing, see Section 4.3). Next, we derive a novel stochastic gradient estimate, which can be applied to the entire class of SF losses (see Section 5). The obtained estimate has low variance and uniformly bounded error, which is crucial for our analysis. Finally, to guarantee global convergence of the algorithm, we adopt a recently proposed Stochastic Gradient Langevin Boosting (SGLB) algorithm (Ustimenko & Prokhorenkova, 2020). SGLB is based on a well studied Stochastic Gradient Langevin Dynamics (Gelfand et al., 1992; Raginsky et al., 2017; Erdogdu et al., 2018) and converges globally for a wide range of loss functions including non-convex ones. We adapt SGLB to our setting and obtain a gradient boosting algorithm that converges globally for the entire class of SF loss functions with provable generalization guarantees (see Section 6). Stochastic Rank: Globally Convergent Stochastic Ranking To sum up, to the best of our knowledge, the proposed Stochastic Rank algorithm is the first globally converging LTR method with provable guarantees that optimizes exactly the underlying ranking quality loss. Stochastic Rank is implemented within the official Cat Boost library (Prokhorenkova et al., 2018; Cat Boost, 2020). Our experiments show that Stochstic Rank outperforms the existing approaches on several LTR datasets. The rest of the paper is organized as follows. In the next section, we briefly overview the related research on learning to rank. In Section 3, we formalize the problem and, in particular, define a general class of ranking loss functions. In Section 4, we formulate the problem of smoothing bias and propose an unbiased solution. Then, in Section 5, we derive a novel stochastic gradient estimate for the whole class of loss functions under consideration. In Section 6, we show how SGLB can be used to achieve global convergence. Finally, Section 7 empirically compares the proposed algorithm with existing approaches, and Section 8 concludes the paper. 2. Related Work Usually, researches divide all LTR methods into three categories: pointwise, pairwise, and listwise (Liu, 2009). Pointwise are the earliest and simplest methods: they approximate relevance labels based on simple or ordinal regression or classification. Such methods were shown to be ineffective for LTR, since loss functions they optimize (e.g., RMSE for relevance labels) differ significantly from the target ranking metric, e.g., NDCG@k. Pairwise methods make a step forward and focus on pairwise preferences and thus known to outperform pointwise approaches significantly. Nevertheless, pairwise approaches still suffer from the problem of solving a different task rather than optimizing a ranking quality objective. Listwise methods try to solve the problem directly by developing either smooth proxies of the target ranking metric like Soft Rank (Taylor et al., 2008), Boltz Rank (Volkovs & Zemel, 2009), Approx NDCG (Qin et al., 2010), Rank Net (Burges, 2010) or by Majorization-Minimization procedure that builds a convex upper bound on the metric on each iteration like Lambda MART (Wu et al., 2010), Lambda Loss (Wang et al., 2018), Permu Rank (Xu et al., 2010), SVMRank (Cao et al., 2006), etc. As discussed in the previous section, algorithms based on smooth approximations suffer from bias and local optima. Also, there are listwise approaches that try to optimize the target loss function without smoothing. For instance, Direct Rank (Tan et al., 2013) constructs an ensemble of decision trees, where the values in the leaves are chosen to optimize the original loss. However, due to greediness, this approach can guarantee only local optima. Finally, let us note that algorithms optimizing a convex upper bound instead of the original loss cannot be truly direct since the optimum for the upper bound can potentially be far away from the true optimum. This is nicely illustrated by Nguyen & Sanner (2013) for accuracy optimization. Let us also mention a recent approach for improving learningto-rank algorithms by adding Gumbel noise to model predictions (Bruch et al., 2020). This is a regularization technique since it builds a convex upper bound on any given convex loss (e.g., Lambda MART).1 Thus, from a theoretical point of view, this approach cannot be truly direct since it uses convex upper bounding. The issue of smoothing bias mentioned in the introduction is connected to the problem of ties: if predicted relevances of some documents coincide, one has to order them somehow to compute a ranking metric. This situation may occur when two documents have equal features. More importantly, ties are always present in boosting algorithms based on discrete weak learners such as decision trees. Unfortunately, this problem is rarely addressed in LTR papers. In practice, it is reasonable to use the worst permutation. First, due to strong penalization, it would force an optimization algorithm to avoid ties. Second, in practice, one cannot know how a production system would rank the items, and often some attribute negatively correlated with relevance is used (e.g., sorting by a bid in online auctions). The importance of using the worst permutation is also discussed by Rudin & Wang (2018), and this ordering is adopted in some open-source libraries like Cat Boost (Prokhorenkova et al., 2018). An alternative choice is to compute the expected value of a ranking metric for a random permutation. This choice is rarely used in practice, since it is computationally complex and gives non-trivial scores to trivial constant predictions, but is often assumed (explicitly or implicitly) by LTR algorithms (Kustarev et al., 2011). 3. Problem Formalization 3.1. Examples of Ranking Loss Functions Before we introduce a general class of loss functions, let us define classic ranking quality functions widely used throughout the literature and in practice.2 These loss functions depend on z, which is a vector of scores produced by the model, and r, which is a vector of relevance labels for a given query. The length of these vectors is denoted by n and can be different for different queries. 1Nesterov & Spokoiny (2017) prove this for Gaussian noise, but the same result generalizes to any centered noise. 2To obtain the loss function from the corresponding quality function, we multiply it by 1. Stochastic Rank: Globally Convergent Stochastic Ranking Let s = argsort(z), i.e., si is the index of a document at i-th position if documents are ordered according to their scores (if zi = zj for j = i, then we place the less relevant one first). Let us define DCG@k, where k denotes the number of top documents we are interested in: DCG@k(z, r) = 2rsi 1 24 log2(i + 1), (1) where ri [0, 4] are relevance labels. This quality function is called Discounted Cumulative Gain: for each document, the numerator corresponds to gain for the relevance, while the denominator discounts for a lower position. NDCG@k is a normalized variant of DCG@k: NDCG@k(z, r) = DCG@k(z, r) maxz Rn DCG@k(z , r). (2) Expected Reciprocal Rank ERR@k assumes that rj [0, 1]: ERR@k(z, r) = j=1 (1 rsj). (3) Mean reciprocal rank (MRR) is used for binary relevance labels rj {0, 1}: MRR(z, r) = j=1 (1 rsj), (4) which is the inverse rank of the first relevant document. Finally, let us define a quality function for the LSO (learning to select with order) problem introduced by Vorobev et al. (2019), which is not exactly a ranking metric, but has a similar structure. The order of elements is predefined (documents are sorted by their indices), but the list of documents to be included is determined by (1{zi>0})n i=1 {0, 1}n: DCG-RR(z, r) = ri 1{zi>0} 1 + P j0} . (5) In the sum above, for each included document we divide its relevance by its rank. 3.2. Generalized Ranking Loss Functions To develop a stochastic ranking theory, we first formalize the class of loss functions to which our results apply. We start with a very general class of scale-free (SF) discrete loss functions. Further, by ξ we denote a vector of context, which may include relevance and any other factors affecting the ranking quality value (like query type or document topic). Definition 1. A function L(z, ξ) : n>0 Rn Ξn R is a Scale-Free Discrete Loss Function iff the following conditions hold: Uniform boundedness: There exists a constant l > 0 such that |L(z, ξ)| l holds n, ξ Ξn, z Rn; Discreteness on subspaces: For each n N and linear subspace V Rn there exist convex open subsets U1, . . . , Uk V, k = k(n, V ) (w.r.t. induced topology on V ), mutually disjoint Ui Uj = for i = j, with everywhere dense union i Ui = V (X denotes the closure of X w.r.t. the ambient topology), such that for any ξ Ξn and i k holds L(z, ξ) Ui const(i, ξ, V ); Jumps regularity: By reusing Ui defined above, for any z i Ui either of the following conditions holds: lim inf z z L(z , ξ) < L(z, ξ) lim sup z z L(z , ξ), lim inf z z L(z , ξ) = L(z, ξ) = lim sup z z L(z , ξ), where z z means z Ui, z z. Scalar freeness: For any n > 0, ξ Ξn, z Rn, λ > 0 holds L(λz, ξ) = L(z, ξ). We denote the class of all SF discrete loss functions by R0. Informally speaking, R0 is a class of bounded discrete functions on a sphere. The jumps regularity property is needed to exclude the breaking points from arg min L. One can show that all loss functions defined in Section 3.1, including the LSO loss DCG-RR, belong to R0. Stochastic Rank out-of-box can be applied to any SF discrete loss function. However, to guarantee global convergence, we need to use consistent smoothing (see Section 4.3), which has to be chosen based on the properties of a particular metric. We propose smoothing which is consistent for the whole class of ranking loss functions defined below. Assume that Ξn = Rn Ξ n and ξ Ξn is a tuple (r, ξ ), where r Rn is a vector of relevance labels. As discussed in Section 2, a particular definition of a ranking loss depends on tie resolution. When some documents have equal scores, we may either use the worst permutation (as commonly done in practice) or compute the average over all orderings of such documents (as usually assumed by LTR algorithms). The definition below assumes the worst permutation. Definition 2. A function L(z, ξ) R0 is a Ranking Loss Function iff the following properties hold: Relevance monotonicity: For each n > 0 and z, r Rn, there exists ϵ0 = ϵ0(r, z) > 0 such that ϵ (0, ϵ0] δ = δ(ϵ, r, z) > 0 such that z : z z < δ: lim sup z z L(z , ξ) = L(z ϵr, ξ). Informally, r is the worst direction for the loss function, i.e., near a breaking point with zi = zj and ri > rj for some i, j, it is better to have zi > zj. Stochastic Rank: Globally Convergent Stochastic Ranking Strong upper semi-continuity (s.u.s.c.): For each n > 0 and z, r Rn: lim sup z z L(z , ξ) = L(z, ξ). Informally, this means that if we do not know how to rank two items (i.e., zi = zj for i = j), then we shall rank them by placing the less relevant one first. Translation invariance:3 For any n > 0, r, z Rn, λ R holds: L(z + λ1n, ξ) L(z, ξ), where 1n := (1, . . . , 1) Rn. Pairwise decision boundary:4 Partition of the space for discreteness on subspaces {Ui} for Rn can be obtained as connected components of Rn\ i,j {z : zi zj = 0}, similarly for an arbitrary subspace V . We denote this class of functions by R1. It can be shown that R1 includes all ranking losses defined in Section 3.1, but not the LSO loss DCG-RR which does not satisfy Relevance monotonicity. Let us now define a class Rsoft 1 , where instead of the worst ranking for ties, we consider the expected loss of a random ranking. For this, we replace the s.u.s.c. condition by: Soft semi-continuity (s.s.c.): For each n > 0 and r, z Rn we have: lim σ 0+ EL(z + σε, ξ) = L(z, ξ), where ε N(0n, In) is a normally distributed random variable. We will show that under some restrictive conditions (that are commonly assumed in the LTR literature), it does not matter which of the two definitions we use (R1 or Rsoft 1 ) as they coincide almost surely and have equal arg min L sets. However, we will explain why these conditions do not hold in practice and in general the minimizers for R1 and Rsoft 1 do not coincide. 3.3. Model Assumptions We assume that for each n > 0 and ξ Ξn there is a model fξ(θ) : Rm Rn such that fξ(θ) = Φξθ for some matrix Φξ Rn m, where θ Rm is a vector of parameters (independent from ξ) and m N is the number of parameters. Typically, each row of Φξ is a feature vector. Gradient boosting over decision trees satisfies this assumption. Indeed, let 3This property is assumed only to be consistent with the learning-to-rank literature and can be omitted. 4This condition can also be removed, but it simplifies the analysis of smoothing bias. us consider all possible trees of a fixed depth formed by a finite number of binary splits obtained by binarization of the initial feature vectors. To get a linear model, we say that θ is a vector of leaf weights of these trees and Φξ is a binary matrix formed by the binarized feature vectors. We will also assume that 1n, z 2 = 0. Indeed, instead of z = fξ(θ) we can define the model as z = fξ(θ) 1 n1T nfξ(θ)1n, which is equivalent due to the translation invariance property. 3.4. Data Distribution Assume that we are given some distribution ξ D on Ξ := n>0 Ξn meaning that ξ also implicitly incorporates information about the number of items n, i.e., for ξ Ξ there exists a unique number n > 0 so that ξ Ξn. D is some unknown distribution, e.g., the distribution of queries submitted to a search system. We are given a finite i.i.d. sample ξ1, . . . , ξN D that corresponds to the train set. Let DN := 1 N PN i=1 δξi be the empirical distribution. 3.5. Optimization Target The assumptions and definitions above allow us to define the expected (generalized) ranking quality for the function L R0 with respect to ξ D and model parameters θ Rm: L(θ) := Eξ DL(fξ(θ), ξ). Our ultimate goal is to find arg minθ L(θ). However, since the distribution D is unknown, we have only i.i.d. samples ξ1, . . . , ξN as defined above. So, we consider the expected ranking quality under the empirical distribution DN: LN(θ) := Eξ DN L(fξ(θ), ξ) = 1 i=1 L(fξi(θ), ξi). We want to optimize L(θ) globally by optimizing LN(θ). This is possible because of the stability of global minimizers even for discontinuous functions: for N 1 an almost minimizer of LN(θ) should be an almost minimizer of L(θ) (Artstein & Wets, 1995). Thus, we need to find a global minimizer of LN. Due to the discrete structure, we can ignore sets of zero Lebesgue measure. Recall that essential infimum (ess inf) is infimum that ignores sets of zero measure and int U denotes an open interior of the set U. Definition 3. For any function L(θ) : Rm R with L := ess infθ Rm L(θ) > , we define arg min L(θ) := int θ Rm : L(θ) = L . We need this unusual definition because of the discrete structure of our loss: we want to exclude the breaking points from arg min. One can see that despite L( , ) satisfies Jumps regularity, the function LN(θ) does not have to. Stochastic Rank: Globally Convergent Stochastic Ranking Statement 1. The set arg minθ Rm LN(θ) is not empty. The proof is straightforward (see Section A of the supplementary materials). 4. Stochastic Smoothing 4.1. Smoothing of Scores The discrete structure of ranking loss functions prevents their effective optimization. Hence, some smoothing is needed and a natural approach for this is mollification (Ermoliev et al., 1995; Dolecki et al., 1983), i.e., adding randomness to parameters. We refer to Section B.1 of the supplementary materials for the formal definition and the reasons why this approach is not applicable in our case. Thus, instead of acting on the level of parameters θ, we act on the level of scores z: Lπ ξ (z, σ) := EL(z + σε, ξ), where ε is a random variable with p.d.f. π(z). We multiply the noise by σ to preserve Scalar-freeness in a sense that Lπ ξ (λz, λσ) = Lπ ξ (z, σ) for any λ > 0. In the linear case f(θ) = Φ θ, if rk Φ = n, it is not hard to show the convergence of minimizers. However, in general, we cannot assume rk Φ = n. In particular, this property is violated in the presence of ties that always occur in gradient boosting due to the discrete nature of decision trees. As a result, there is a smoothing bias that alters the set of minimizers. 4.2. Simple Example of Smoothing Bias Within this section, assume for simplicity that we are dealing with one function L(z) := L(z, ξ) : Rn R for some arbitrary fixed n and ξ Ξn. Let Φ = Φξ Rm n and L(θ) := L(Φ θ). To clearly see how a smoothing bias can be introduced, consider the case when im(Φ) Rn \ k i=1Ui, where Ui are from the Discreteness on subspaces assumption for V = Rn. Denote by c1, . . . , ck R the values of L(z) on the corresponding subsets Ui. Consider the functions L(θ) and Lπ(θ) := limσ 0+ Eε πL(Φθ + σε). The value of Lπ(θ) is fully determined by π, c1, . . . , ck and the subsets U1, . . . , Uk in the following way: Lπ(θ) = P i αici with αi = αi(π, θ, U1, . . . , Uk) = lim σ 0+ P(Φ θ + σε Ui). In contrast, the value L(θ) depends on the values c1, . . . , ck much weaker: for fixed θ, consider the values c 1, . . . , c k that correspond to Ui such that Φθ Ui, then the only limitation we have is min c i < L(θ) max c i (this is required by Jumps regularity), which clearly allows more flexibility than the linear combination defined above. In LTR, the issue of smoothing bias is connected to the problems of ties: the situations when zi = zj and ri = rj. 4.3. Consistent Smoothing Definition 4. We say that the family of distributions πξ(z) : n>0 Rn Ξn R+ is a consistent smoothing for L(z, ξ) R0 and for the model fξ iff for each n > 0, ξ Ξn the following limit holds almost surely locally uniform in θ: L(fξ(θ), ξ) = lim σ 0+ Lπ ξ (fξ(θ), σ). If π is smooth enough and consistent, then the function Lπ N(θ, σ) := 1 N PN i=1 Lπ ξi(fξi(θ), σ) is also smooth and almost surely locally uniformly approximates the discrete loss LN(θ) as σ 0+. To optimize ranking losses, it is important to find a consistent smoothing π for functions in R1. Fortunately, we can do this with an arbitrary precision by shifting the normal distribution by µr for large enough µ. Relevance monotonicity and s.u.s.c. imply the following pointwise limit: lim µ lim σ 0+ Eε N( µr,In)L(z + σε, ξ) = lim µ lim σ 0+ Eε N(0n,In)L(z σµr+σε, ξ) = L(z, r) . This can be strengthened to the following theorem, which is proven in Section B.2 of the supplementary materials. Theorem 1. πξ,µ = N( µr, In) is a consistent smoothing for R1 as µ . Formally, θ except zero measure δ > 0 ϵ > 0 µ > 0 σ0 > 0 such that σ (0, σ0) and θ : θ θ < δ holds |Lπ ξ (fξ(θ ), σ) L(fξ(θ ), ξ)| < ϵ. By similar arguments, one can show that N(0, In) is a consistent smoothing for Rsoft 1 . Note that in both cases the consistent smoothing is universal for the entire class (R1 of Rsoft 1 ), i.e., it is independent from the choice of fξ. Thus, LTR problems require non-trivial smoothing to preserve consistency. However, under some restrictive assumptions on the loss and on the model, any smoothing π is consistent. Recall that LN(θ) = 1 N PN i=1 L(Φξiθ, ξi) and assume that L(z, ξ) R0. The following theorem is proven in Section B.3 of the supplementary materials. Theorem 2. Consider open and convex subsets U ij := Uij im Φξi. If i j s.t. U ij = and j U ij = im Φξi, then any smoothing π is consistent for LN(θ). In early literature on LTR, all authors used such conditions implicitly by assuming that scores for all items are different. In contrast, we do not use this assumption as it never Stochastic Rank: Globally Convergent Stochastic Ranking holds in practice (e.g., when two documents have equal features). As a result, all existing LTR approaches suffer from a smoothing bias. In contrast, for the LSO problem, any smoothing is consistent, as we discuss in Section B.4 of the supplementary materials. 4.4. Scale-Free Acceleration It is intuitively clear that for a scale-free function it is better to have a scalar-free approximation. However, for each λ > 0 we have Lπ ξ (λz, σ) = Lπ ξ (z, λ 1σ), i.e., the smoothed function is no longer scale-free. To enforce scale-freeness, we take a vector z with z 2 > 0 and define Lπ ξ (z, σ|z ) := Lπ ξ We refer to such smoothing as Scale-Free Acceleration (SFA). The obtained function is indeed scale-free: Lπ ξ (λz, σ|z ) Lπ ξ (z, σ|z ) for any λ > 0. Let bσ(z) := z 2 z 2 σ. In our optimization, we will be interested only in the case when z = zt is the vector of scores obtained on t-th iteration of the optimization algorithm. So, we have bσ(zt) = σ and SFA does not change the scale σ. One can imagine a sphere of radius R = z 2, where we restrict Lπ ξ (z, σ) and homogenize it along the rays from the origin to infinity to obtain a scalar-free function. 4.5. Smoothing Properties Finally, let us discuss regularity assumptions for smoothing on which our optimization method relies. Consider a family of distributions with p.d.f. πξ(z) with ξ Ξn for some n > 0, z Rn. We require the following properties: Continuous differentiability: πξ(z) is C(1)(Rn), i.e., is differentiable with a continuous derivative. Uniformly bounded derivative: n N, ξ Ξn we have zπξ 2 = O(1) uniformly in z Rn. Derivative decay: n N we have zπξ 2 = O( z n 2 2 ) as z 2 . Tractable conditional expectations: conditional densities πj ξ(zj) := πξ(zj|z\j) are easy to compute.5 Clearly, N( µr, IN) satisfies these assumptions µ 0. 5We do not use the log-derivative trick, so we do not care about the ability to compute d dzj πj ξ(zj) and d dzj log πj ξ(zj), our gradient estimates require only computation of πj ξ(zj). 5. Coordinate Conditional Sampling 5.1. Gradient Estimate In the previous section, we required the ability to easily compute πj ξ(zj) = πξ(zj|z\j). This property allows us to do the following trick: we decompose πξ(z) = πj ξ(zj)π\j ξ (z\j) with π\j ξ (z\j) being the marginal distribution for z\j. Then, we can represent Lπ ξ (z, σ) = Lπ ξ πj ξ π\j ξ . Note that the convolution is an associative operation that commutes with differentiation and, henceforth, zj Lπ ξ (z, σ) = zj Lπ ξ πj ξ Note that we differentiate by zj the convolution by the same zj. So, if we want to estimate the gradient unbiasedly, we need to sample ε\j π\j ξ and then compute exactly zj Lπ ξ πj ξ (zj, z\j + σε\j) . The resulting estimate would be unbiased by construction. The following lemma suggests how to deal with zj Lπ ξ πj ξ. Lemma 1. The function lj(zj) := L((zj, z\j), ξ) : R R for all z except zero measure has at most k k(n, Rn) 1 (k is from the Discreteness on subspaces assumption) breaking points b1, . . . , bk (possibly depending on z\j and ξ) and can be represented as: s=1 lj(bs)1{zj bs} + const(z\j, ξ), lj(bs) := lim ϵ 0+ lj(bs + ϵ) lj(bs ϵ). All results of this section are proven in Section C of the supplementary materials. Based on the above lemma, we prove the following theorem. Theorem 3. The derivative zj Lπ ξ (z, σ) is equal to: σ 1 Eε\j π\j ξ s=1 lj(bs)πj ξ(σ 1(bs zj)), where k and bs = bs(z\j + σε\j) are from Lemma 1. Corollary 1. For LTR losses, the above formula becomes: zj Lπ ξ (z, σ) = σ 1 s=1 lj(zs + σεs)πj ξ(σ 1(zs zj) + εs). Uniform boundedness of lj and π implies the following. Statement 2. The estimate is uniformly bounded by O(σ 1). Stochastic Rank: Globally Convergent Stochastic Ranking Proceeding analogously with each coordinate j {1, . . . , n}, we obtain an unbiased estimate of z Lπ ξ (z, σ) that is uniformly bounded, in contrast to the classic estimate σ 1(L(z + σε) L(z))ε (Nesterov & Spokoiny, 2017) obtained by the log-derivative trick for the normal distribution that is also known as REINFORCE (Williams, 1992). Uniform boundedness is crucial since without it we would not be able to claim global convergence. We call such estimate Conditional Coordinate Sampling (CCS) and denote it by b CCLπ ξ (z, σ). Note that for each coordinate when estimating b CCLπ ξ (z, σ) we use the shared noise vector ε πξ, i.e., the components of the gradient can have non-trivial covariation, but due to the uniform boundness the covariation is also uniformly bounded by O(σ 1). Finally, let us discuss the complexity of computing b CCLπ ξ (z, σ). The following result follows from Section D of the supplementary materials. Statement 3. The estimate b CCLπ ξ (z, σ) can be computed in: O((k+log n)n) operations and O(n) additional memory for (N)DCG@k and ERR@k. O(n log n) operations and O(1) memory for MRR. 5.2. SFA Gradient Estimate It is not hard to generalize CCS to SFA. The following theorem holds. Theorem 4. For bσ(z) = z 2 z 2 σ at z = z we have: z Lπ(z, bσ(z)) = z Lπ ξ D z Lπ ξ , z z 2 Corollary 2. Unbiased CCS estimate for SFA can be obtained by orthogonalizing b CCLπ ξ (z, σ) and z. Since orthogonalization reduces the norm of the estimate, it necessarily reduces the variance, so we obtain the following corollary. Corollary 3. SFA CCS estimate has a lower variance than the original CCS. The intuition for the orthogonalization is based on Scalarfreenees: the function L(z, ξ) does not change along z direction, so this direction in the gradient z Lπ ξ does not contribute to L(z, ξ) optimization. As we need to deal with possibility of z = z = 0n, we introduce a parameter ν > 0 and replace z 2 by z 2 + ν: b CCLπ ξ (z, σ|z , ν) z =z := b CCLπ ξ (z, σ) D b CCLπ ξ (z, σ), z z 2 + ν 2 z z 2 + ν . Lemma 2. Bias of SFA CCS estimate is uniformly bounded: Eb CCLπ ξ (z, σ0|z , ν) z Lπ ξ (z, bσ) = O 1 z + ν As a consequence, if ν or z , then the estimate is asymptotically unbiased. Thus, for the convergence analysis we consider only b CCLπ(z, σ) since the estimate b CCLπ(z, σ0|z , ν) can be made unbiased by varying the parameter ν > 0. In practice, we consider b CCLπ ξ (z, σ0|z , ν) with fixed ν = 10 2 as we observed that this parameter performs well enough. Moreover, SFA can be seen as a bias variance tradeoff controlled by ν > 0 for CCS estimate of z Lπ ξ (z, σ). For prac- tical comparison of b CCLπ ξ (z, σ) and b CCLπ ξ (z, σ0|z , ν) we refer to Section 7, where we show that SFA gives a significant improvement. 6. Global Optimization by Diffusion Previously, we discussed the importance of global optimization of LN(θ). As we show in this section, this can be achieved by global optimization of smoothed Lπ N(θ, σ) with σ = 1 (if smoothing is consistent) using the recently proposed Stochastic Gradient Langevin Boosting (SGLB) (Ustimenko & Prokhorenkova, 2020). SGLB is easy to apply: essentially, each iteration of standard SGB is modified via model shrinkage and adding Gaussian noise to the gradients. However, the obtained algorithm is backed by strong theoretical results, see (Ustimenko & Prokhorenkova, 2020) for the details and the supplementary materials (Section E.1) for a brief sketch. The global convergence is implied by the fact that as the number of iterations grows, the stationary distribution pβ(F) of the predictions F = (fξ1(θ), . . . , fξN (θ)) concentrates around the global optima of the implicitly regularized loss Lπ N(F, σ, γ) = Lπ N(F, σ) + γ where Γ is an implicitly defined regularization matrix. More formally, pβ(F) exp( βLπ N(F, σ, γ)). Global convergence of SGLB requires Lipschitz smoothness and continuity (Ustimenko & Prokhorenkova, 2020). We can ensure this for the entire R0, which allows us to claim the following theorem (see Section E.2 for the proof). Theorem 5. SGLB method applied to Lπ N(F, σ) converges globally to optima of LN(F) LN(θ) when used with CCS estimate. The following statement ensures that we can safely fix σ = 1 and fit only γ parameter without loosing any possible solution. Stochastic Rank: Globally Convergent Stochastic Ranking Statement 4. EF pβLN(F) = EF p βLN(F ), where pβ corresponds to (σ, γ) and p β to (1, σ2γ). Proof. Due to Scalar-freeness, we can write Lπ N(F, σ) Lπ N(σ 1F, 1) and γ 2 Γ σ 1F 2 2. Finally, due to Scalar-freeness, the change F = σ 1F does not change the value of LN(F) LN(F ) and thus the expectation does not change. 6.2. Generalization Ustimenko & Prokhorenkova (2020) related the generalization gap with the uniform spectral gap parameter λ 0 for the distribution pβ(θ) := exp( βLN(θ,σ,γ)) R Rm exp( βLN(θ,σ,γ))dθ (see Raginsky et al. (2017) for the definition of a uniform spectral gap). Here pβ(θ) represents the limiting (as the learning rate goes to zero) distribution of the vector of parameters θ and is induced by the distribution pβ(F) exp( βL(F, σ, γ)) using the relation F = Φθ. The following theorem is proven in the supplementary materials (Section E.3). Theorem 6. The generalization gap Eθ pβ(θ)Lπ(θ, σ) Eθ pβ(θ)Lπ N(θ, σ) can be bounded by: β + 2d + d2 7. Experiments As baseline approaches, we consider the well-known Lambda MART framework optimized for NDCG@k (Wu et al., 2010), NDCG-Loss2++ from the Lambda Loss framework (Wang et al., 2018), and Soft Rank (Taylor et al., 2008). We also apply the technique proposed by Bruch et al. (2020) to the baselines, the corresponding methods are called Eλ-MART and Eλ-Loss. Similarly to Wang et al. (2018), we set the parameter µ for NDCG-Loss2++ to be equal to 5. According to our experiments, NDCG-Loss2++ performed significantly better than NDCG-Loss2, which agrees with Wang et al. (2018). 7.1. Synthetic Data Unfortunately, in practice, we cannot verify if we have reached the global optimum as we cannot evaluate all possible ensembles of trees. But having theoretical guarantees is important as it implies the stability of the algorithm and good generalization. In this section, we describe a simple synthetic test to verify whether Stochastic Rank can reach the global optimum. The following dataset is multimodal (has several local optima) for NDCG@3: the number of queries is N = 2, first relevance vector is r1 = (3, 2, 1) and the second is r2 = (3, 2). We consider the following features for the first Table 1. Experimental results on synthetic data. Method NDCG@3 λ-MART 0.903 λ-Loss 0.903 Eλ-MART 0.903 Eλ-Loss 0.903 Soft Rank 0.903 Stochastic Rank 0.917 query: x1 = (1, 0, 0), x2 = (0, 1, 0), x3 = (0, 0, 1) and for the second x3 and x1 (in the given order). We consider this simple synthetic dataset for two reasons: first, it clearly shows that ranking losses are likely to be multimodal; second, it allows us to demonstrate how multimodality prevents existing approaches from reaching the global optimum. We limited the tree depth parameter to 3, so one tree can separate all documents with different features. We set the number of iterations to 1000, learning rate to 0.1, diffusion temperature to 103, and model-shrink-rate to 10 3. The results are shown in Table 1. We note that the maximum achievable NDCG@3 for this dataset is 0.917, i.e., Stochastic Rank successfully recovers the global optimum while all other approaches converge to a local optimum 0.903. 7.2. Real Data Datasets For our experiments, we use the following publicly available datasets. First, we use the data from YAHOO! Learning to Rank Challenge (Chapelle & Chang, 2011): there are two datasets, each is pre-divided into training, validation, and testing parts. The other datasets are WEB10K and WEB30K released by Microsoft (Qin & Liu, 2013). Following Wang et al. (2018), we use Fold 1 for these two datasets. Quality metrics The first metric we use is NDCG@5, which is very common in LTR research. The second one is MRR, which is a well-known click-based metric. Recall that MRR requires binary labels, so we binarize each label by eyi := 1{yi>0}. Notably, while MRR is frequently used in online evaluations, it is much less studied compared to NDCG@k and there are no effective approaches designed for it. Fortunately, our method can be easily adapted to any ranking metric via a combination of SGLB with Coordinate Conditional Sampling smoothed by Gaussian noise. Framework We implemented all approaches in Cat Boost, which is an open-source gradient boosting library outperforming the most popular alternatives like XGBoost (Chen Stochastic Rank: Globally Convergent Stochastic Ranking Table 2. Experimental results. Method Dataset NDCG@5 MRR λ-MART Yahoo Set 1 74.53 90.21 λ-Loss Yahoo Set 1 74.73 - Eλ-MART Yahoo Set 1 74.57 90.30 Eλ-Loss Yahoo Set 1 74.75 - Soft Rank Yahoo Set 1 71.98 90.17 SR-Rsoft 1 Yahoo Set 1 74.68 91.07 SR-R1 Yahoo Set 1 74.92 90.97 λ-MART Yahoo Set 2 73.87 91.48 λ-Loss Yahoo Set 2 73.89 - Eλ-MART Yahoo Set 2 73.87 91.48 Eλ-Loss Yahoo Set 2 73.91 - Soft Rank Yahoo Set 2 73.91 92.16 SR-Rsoft 1 Yahoo Set 2 73.95 93.16 SR-R1 Yahoo Set 2 74.15 93.56 λ-MART WEB10K 48.22 81.85 λ-Loss WEB10K 48.33 - Eλ-MART WEB10K 48.29 81.72. Eλ-Loss WEB10K 48.47 - Soft Rank WEB10K 42.82 81.38 SR-Rsoft 1 WEB10K 48.19 83.08 SR-R1 WEB10K 48.53 83.30 λ-MART WEB30K 49.55 83.79 λ-Loss WEB30K 49.45 - Eλ-MART WEB30K 49.49 83.79 Eλ-Loss WEB30K 49.52 - Soft Rank WEB30K 43.46 82.73 SR-Rsoft 1 WEB30K 49.67 85.19 SR-R1 WEB30K 49.59 85.01 & Guestrin, 2016) and Light GBM (Ke et al., 2017) for several tasks (Prokhorenkova et al., 2018). Lambda MART can be easily adapted for optimizing MRR, so we implemented both versions. In contrast, Lambda Loss is specifically designed for NDCG and cannot be easily modified for MRR. For Soft Rank we used CCS to estimate gradients, since the original approach is computation and memory demanding, so it is infeasible in gradient boosting which requires all gradients to be estimated at each iteration. Parameter tuning For all algorithms, we set the maximum number of trees to 1000. We tune the hyperparameters using 500 iterations of random search and select the best combination using the validation set, the details are given in the supplementary materials (Section F). Table 3. Comparison of the algorithm s features on Yahoo Set 1, where πµ means using unbiased smoothing. Features NDCG@5 REINFORCE 70.74 CCS 71.89 CCS+SFA 74.55 CCS+SFA+SGLB (SR-Rsoft 1 ) 74.68 CCS+SFA+SGLB+πµ (SR-R1) 74.92 Results The results are shown in Table 2. One can see that Stochastic Rank (SR-R1) outperforms the baseline approaches on all datasets. In all cases, the difference with the closest baseline is statistically significant with a p-value < 0.05 measured by the paired one-tailed t-test. Also, in most cases, SR-R1 outperforms SR-Rsoft 1 , which clearly demonstrates the advantage of unbiased smoothing, which takes into account the tie resolution policy. The results in Table 2 are comparable to previously reported numbers, although they cannot be compared directly, since experimental setup (e.g., the maximum number of trees) is not fully described in many cases (Wang et al., 2018). More importantly, the previously reported results can be overvalued, since many openly available libraries compute ranking metrics using neither worst (as in our case) nor expected permutation, but some fixed arbitrary one depending on a particular implementation of the sorting operation. To further understand how different techniques proposed in this paper affect the quality of the algorithm, we show the improvement obtained from each feature using the Yahoo dataset and the NDCG metrics (see Table 3). We see that CCS is significantly better than REINFORCE, while SFA gives an additional significant performance boost. SGLB and consistent smoothing further improve NDCG. We note that for both REINFORCE and CCS we use one sample per gradient estimate since the most time-consuming operation for both estimates is sorting (see Section D of the supplementary materials). 8. Conclusion In this paper, we proposed the first truly direct LTR algorithm. We formally proved that this algorithm converges globally to the minimizer of the target loss function. This is possible due to the combination of three techniques: unbiased smoothing for consistency between the original and smoothed losses; SGLB for global optimization via gradient boosting; and CCS gradient estimate with uniformly bounded error and low variance, which is required for SGLB to be applied. Our experiments clearly illustrate that the new algorithm outperforms state-of-the-art LTR methods. Stochastic Rank: Globally Convergent Stochastic Ranking Artstein, Z. and Wets, R. Consistency of minimizers and the SLLN for stochastic programs. Journal of Convex Analysis, 2(1-2):1 17, 1995. Bruch, S., Han, S., Bendersky, M., and Najork, M. A stochastic treatment of learning to rank scoring functions. In Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM 2020), pp. 61 69, 2020. Burges, C. J. C. From Rank Net to Lambda Rank to Lambda MART: An overview. Technical report, Microsoft Research, 2010. Cao, Y., Xu, J., Liu, T.-Y., Li, H., Huang, Y., and Hon, H.-W. Adapting ranking SVM to document retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 186 193. ACM, 2006. Cat Boost. Ranking: objectives and metrics. https://catboost.ai/docs/concepts/ loss-functions-ranking.html, 2020. Chapelle, O. and Chang, Y. Yahoo! learning to rank challenge overview. In Proceedings of the learning to rank challenge, pp. 1 24, 2011. Chen, T. and Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785 794, 2016. Dolecki, S., Salinetti, G., and Wets, R. J.-B. Convergence of functions: equi-semicontinuity. Transactions of the American Mathematical Society, 276(1):409 429, 1983. Erdogdu, M. A., Mackey, L., and Shamir, O. Global nonconvex optimization with discretized diffusions. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, pp. 9694 9703, 2018. Ermoliev, Y., Norkin, V., and Wets, R. The minimization of semicontinuous functions: mollifier subgradients. SIAM Journal on Control and Optimization, 33, 01 1995. Gelfand, S. B., Doerschuk, P. C., and Nahhas-Mohandes, M. Theory and application of annealing algorithms for continuous optimization. In Proceedings of the 24th conference on Winter simulation, pp. 494 499, 1992. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. Light GBM: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pp. 3146 3154, 2017. Kustarev, A., Ustinovsky, Y., Logachev, Y., Grechnikov, E., Segalovich, I., and Serdyukov, P. Smoothing NDCG metrics using tied scores. In Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 2053 2056, 2011. Liu, T.-Y. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225 331, 2009. Nesterov, Y. and Spokoiny, V. G. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17:527 566, 2017. Nguyen, T. and Sanner, S. Algorithms for direct 0 1 loss optimization in binary classification. In International Conference on Machine Learning, pp. 1085 1093, 2013. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. Catboost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, pp. 6638 6648, 2018. Qin, T. and Liu, T. Introducing LETOR 4.0 datasets. Co RR, abs/1306.2597, 2013. Qin, T., Liu, T.-Y., and Li, H. A general approximation framework for direct optimization of information retrieval measures. Information retrieval, 13(4):375 397, 2010. Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. Co RR, abs/1702.03849, 2017. Rudin, C. and Wang, Y. Direct learning to rank and rerank. In International Conference on Artificial Intelligence and Statistics, pp. 775 783, 2018. Sakai, T. Metrics, statistics, tests. In Bridging Between Information Retrieval and Databases - PROMISE Winter School 2013, Bressanone, Italy, February 4-8, 2013. Revised Tutorial Lectures, pp. 116 163, 2013. Tan, M., Xia, T., Guo, L., and Wang, S. Direct optimization of ranking measures for learning to rank models. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 856 864, 2013. Taylor, M., Guiver, J., Robertson, S., and Minka, T. Soft Rank: Optimizing non-smooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM 08, pp. 77 86, 2008. ISBN 978-1-59593-927-2. Ustimenko, A. and Prokhorenkova, L. SGLB: Stochastic Gradient Langevin Boosting. ar Xiv e-prints, art. ar Xiv:2001.07248, 2020. Stochastic Rank: Globally Convergent Stochastic Ranking Volkovs, M. N. and Zemel, R. S. Boltz Rank: learning to maximize expected ranking gain. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1089 1096, 2009. Vorobev, A., Ustimenko, A., Gusev, G., and Serdyukov, P. Learning to select for a predefined ranking. In International Conference on Machine Learning, pp. 6477 6486, 2019. Wang, X., Li, C., Golbandi, N., Bendersky, M., and Najork, M. The lambdaloss framework for ranking metric optimization. In Proceedings of The 27th ACM International Conference on Information and Knowledge Management (CIKM 18), pp. 1313 1322, 2018. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. Wu, Q., Burges, C. J., Svore, K. M., and Gao, J. Adapting boosting for information retrieval measures. Information Retrieval, 13(3):254 270, 2010. Xu, J., Li, H., Liu, T.-Y., Peng, Y., Lu, M., and Ma, W.-Y. Direct optimization of evaluation measures in learning to rank. 2010. Yin, D., Hu, Y., Tang, J., Daly, T., Zhou, M., Ouyang, H., Chen, J., Kang, C., Deng, H., Nobata, C., Langlois, J.-M., and Chang, Y. Ranking relevance in Yahoo search. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 16, pp. 323 332, 2016.