# catboost_unbiased_boosting_with_categorical_features__739b2c29.pdf

Cat Boost: unbiased boosting with categorical features

Liudmila Prokhorenkova1,2, Gleb Gusev1,2, Aleksandr Vorobev1, Anna Veronika Dorogush1, Andrey Gulin1

1Yandex, Moscow, Russia 2Moscow Institute of Physics and Technology, Dolgoprudny, Russia {ostroumova-la, gleb57, alvor88, annaveronika, gulin}@yandex-team.ru

This paper presents the key algorithmic techniques behind Cat Boost, a new gradient boosting toolkit. Their combination leads to Cat Boost outperforming other publicly available boosting implementations in terms of quality on a variety of datasets. Two critical algorithmic advances introduced in Cat Boost are the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features. Both techniques were created to ﬁght a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms. In this paper, we provide a detailed analysis of this problem and demonstrate that proposed algorithms solve it effectively, leading to excellent empirical results.

1 Introduction

Gradient boosting is a powerful machine-learning technique that achieves state-of-the-art results in a variety of practical tasks. For many years, it has remained the primary method for learning problems with heterogeneous features, noisy data, and complex dependencies: web search, recommendation systems, weather forecasting, and many others [5, 26, 29, 32]. Gradient boosting is essentially a process of constructing an ensemble predictor by performing gradient descent in a functional space. It is backed by solid theoretical results that explain how strong predictors can be built by iteratively combining weaker models (base predictors) in a greedy manner [17].

We show in this paper that all existing implementations of gradient boosting face the following statistical issue. A prediction model F obtained after several steps of boosting relies on the targets of all training examples. We demonstrate that this actually leads to a shift of the distribution of F(xk) | xk for a training example xk from the distribution of F(x) | x for a test example x. This ﬁnally leads to a prediction shift of the learned model. We identify this problem as a special kind of target leakage in Section 4. Further, there is a similar issue in standard algorithms of preprocessing categorical features. One of the most effective ways [6, 25] to use them in gradient boosting is converting categories to their target statistics. A target statistic is a simple statistical model itself, and it can also cause target leakage and a prediction shift. We analyze this in Section 3.

In this paper, we propose ordering principle to solve both problems. Relying on it, we derive ordered boosting, a modiﬁcation of standard gradient boosting algorithm, which avoids target leakage (Section 4), and a new algorithm for processing categorical features (Section 3). Their combination is implemented as an open-source library1 called Cat Boost (for Categorical Boosting ), which outperforms the existing state-of-the-art implementations of gradient boosted decision trees XGBoost [8] and Light GBM [16] on a diverse set of popular machine learning tasks (see Section 6).

1https://github.com/catboost/catboost

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

2 Background

Assume we observe a dataset of examples D = {(xk, yk)}k=1..n, where xk = (x1 k, . . . , xm k ) is a random vector of m features and yk R is a target, which can be either binary or a numerical response. Examples (xk, yk) are independent and identically distributed according to some unknown distribution P( , ). The goal of a learning task is to train a function F : Rm R which minimizes the expected loss L(F) := EL(y, F(x)). Here L( , ) is a smooth loss function and (x, y) is a test example sampled from P independently of the training set D.

A gradient boosting procedure [12] builds iteratively a sequence of approximations F t : Rm R, t = 0, 1, . . . in a greedy fashion. Namely, F t is obtained from the previous approximation F t 1 in an additive manner: F t = F t 1 + αht, where α is a step size and function ht : Rm R (a base predictor) is chosen from a family of functions H in order to minimize the expected loss:

ht = arg min h H L(F t 1 + h) = arg min h H EL(y, F t 1(x) + h(x)). (1)

The minimization problem is usually approached by the Newton method using a second order approximation of L(F t 1 + ht) at F t 1 or by taking a (negative) gradient step. Both methods are kinds of functional gradient descent [10, 24]. In particular, the gradient step ht is chosen in such a way that ht(x) approximates gt(x, y), where gt(x, y) := L(y,s)

s s=F t 1(x). Usually, the least-squares approximation is used:

ht = arg min h H E gt(x, y) h(x) 2 . (2)

Cat Boost is an implementation of gradient boosting, which uses binary decision trees as base predictors. A decision tree [4, 10, 27] is a model built by a recursive partition of the feature space Rm into several disjoint regions (tree nodes) according to the values of some splitting attributes a. Attributes are usually binary variables that identify that some feature xk exceeds some threshold t, that is, a = 1{xk>t}, where xk is either numerical or binary feature, in the latter case t = 0.5.2 Each ﬁnal region (leaf of the tree) is assigned to a value, which is an estimate of the response y in the region for the regression task or the predicted class label in the case of classiﬁcation problem.3 In this way, a decision tree h can be written as

j=1 bj1{x Rj}, (3)

where Rj are the disjoint regions corresponding to the leaves of the tree.

3 Categorical features

3.1 Related work on categorical features

A categorical feature is one with a discrete set of values called categories that are not comparable to each other. One popular technique for dealing with categorical features in boosted trees is one-hot encoding [7, 25], i.e., for each category, adding a new binary feature indicating it. However, in the case of high cardinality features (like, e.g., user ID feature), such technique leads to infeasibly large number of new features. To address this issue, one can group categories into a limited number of clusters and then apply one-hot encoding. A popular method is to group categories by target statistics (TS) that estimate expected target value in each category. Micci-Barreca [25] proposed to consider TS as a new numerical feature instead. Importantly, among all possible partitions of

2Alternatively, non-binary splits can be used, e.g., a region can be split according to all values of a categorical feature. However, such splits, compared to binary ones, would lead to either shallow trees (unable to capture complex dependencies) or to very complex trees with exponential number of terminal nodes (having weaker target statistics in each of them). According to [4], the tree complexity has a crucial effect on the accuracy of the model and less complex trees are less prone to overﬁtting. 3In a regression task, splitting attributes and leaf values are usually chosen by the least squares criterion. Note that, in gradient boosting, a tree is constructed to approximate the negative gradient (see Equation (2)), so it solves a regression problem.

categories into two sets, an optimal split on the training data in terms of logloss, Gini index, MSE can be found among thresholds for the numerical TS feature [4, Section 4.2.2] [11, Section 9.2.4]. In Light GBM [20], categorical features are converted to gradient statistics at each step of gradient boosting. Though providing important information for building a tree, this approach can dramatically increase (i) computation time, since it calculates statistics for each categorical value at each step, and (ii) memory consumption to store which category belongs to which node for each split based on a categorical feature. To overcome this issue, Light GBM groups tail categories into one cluster [21] and thus looses part of information. Besides, the authors claim that it is still better to convert categorical features with high cardinality to numerical features [19]. Note that TS features require calculating and storing only one number per one category.

Thus, using TS as new numerical features seems to be the most efﬁcient method of handling categorical features with minimum information loss. TS are widely-used, e.g., in the click prediction task (click-through rates) [1, 15, 18, 22], where such categorical features as user, region, ad, publisher play a crucial role. We further focus on ways to calculate TS and leave one-hot encoding and gradient statistics out of the scope of the current paper. At the same time, we believe that the ordering principle proposed in this paper is also effective for gradient statistics.

3.2 Target statistics

As discussed in Section 3.1, an effective and efﬁcient way to deal with a categorical feature i is to substitute the category xi k of k-th training example with one numeric feature equal to some target statistic (TS) ˆxi k. Commonly, it estimates the expected target y conditioned by the category: ˆxi k E(y | xi = xi k).

Greedy TS A straightforward approach is to estimate E(y | xi = xi k) as the average value of y over the training examples with the same category xi k [25]. This estimate is noisy for low-frequency categories, and one usually smoothes it by some prior p:

Pn j=1 1{xi j=xi k} yj + a p Pn j=1 1{xi j=xi k} + a , (4)

where a > 0 is a parameter. A common setting for p is the average target value in the dataset [25].

The problem of such greedy approach is target leakage: feature ˆxi k is computed using yk, the target of xk. This leads to a conditional shift [30]: the distribution of ˆxi|y differs for training and test examples. The following extreme example illustrates how dramatically this may affect the generalization error of the learned model. Assume i-th feature is categorical, all its values are unique, and for each category A, we have P(y = 1 | xi = A) = 0.5 for a classiﬁcation task. Then, in the training dataset, ˆxi k = yk+ap

1+a , so it is sufﬁcient to make only one split with threshold t = 0.5+ap

1+a to perfectly classify all training examples. However, for all test examples, the value of the greedy TS is p, and the obtained model predicts 0 for all of them if p < t and predicts 1 otherwise, thus having accuracy 0.5 in both cases. To this end, we formulate the following desired property for TS:

P1 E(ˆxi | y = v) = E(ˆxi k | yk = v), where (xk, yk) is the k-th training example.

In our example above, E(ˆxi k | yk) = yk+ap

1+a and E(ˆxi | y) = p are different.

There are several ways to avoid this conditional shift. Their general idea is to compute the TS for xk on a subset of examples Dk D \ {xk} excluding xk:

P xj Dk 1{xi j=xi k} yj + a p P

xj Dk 1{xi j=xi k} + a . (5)

Holdout TS One way is to partition the training dataset into two parts D = ˆD0 ˆD1 and use Dk = ˆD0 for calculating the TS according to (5) and ˆD1 for training (e.g., applied in [8] for Criteo dataset). Though such holdout TS satisﬁes P1, this approach signiﬁcantly reduces the amount of data used both for training the model and calculating the TS. So, it violates the following desired property:

P2 Effective usage of all training data for calculating TS features and for learning a model.

Leave-one-out TS At ﬁrst glance, a leave-one-out technique might work well: take Dk = D \ xk for training examples xk and Dk = D for test ones [31]. Surprisingly, it does not prevent target leakage. Indeed, consider a constant categorical feature: xi k = A for all examples. Let n+ be the number of examples with y = 1, then ˆxi k = n+ yk+a p

n 1+a and one can perfectly classify the training

dataset by making a split with threshold t = n+ 0.5+a p

Ordered TS Cat Boost uses a more effective strategy. It relies on the ordering principle, the central idea of the paper, and is inspired by online learning algorithms which get training examples sequentially in time [1, 15, 18, 22]). Clearly, the values of TS for each example rely only on the observed history. To adapt this idea to standard ofﬂine setting, we introduce an artiﬁcial time , i.e., a random permutation σ of the training examples. Then, for each example, we use all the available history to compute its TS, i.e., take Dk = {xj : σ(j) < σ(k)} in Equation (5) for a training example and Dk = D for a test one. The obtained ordered TS satisﬁes the requirement P1 and allows to use all training data for learning the model (P2). Note that, if we use only one random permutation, then preceding examples have TS with much higher variance than subsequent ones. To this end, Cat Boost uses different permutations for different steps of gradient boosting, see details in Section 5.

4 Prediction shift and ordered boosting

4.1 Prediction shift

In this section, we reveal the problem of prediction shift in gradient boosting, which was neither recognized nor previously addressed. Like in case of TS, prediction shift is caused by a special kind of target leakage. Our solution is called ordered boosting and resembles the ordered TS method.

Let us go back to the gradient boosting procedure described in Section 2. In practice, the expectation in (2) is unknown and is usually approximated using the same dataset D:

ht = arg min h H

gt(xk, yk) h(xk) 2 . (6)

Now we describe and analyze the following chain of shifts:

1. the conditional distribution of the gradient gt(xk, yk) | xk (accounting for randomness of D \ {xk}) is shifted from that distribution on a test example gt(x, y) | x; 2. in turn, base predictor ht deﬁned by Equation (6) is biased from the solution of Equation (2); 3. this, ﬁnally, affects the generalization ability of the trained model F t.

As in the case of TS, these problems are caused by the target leakage. Indeed, gradients used at each step are estimated using the target values of the same data points the current model F t 1 was built on. However, the conditional distribution F t 1(xk) | xk for a training example xk is shifted, in general, from the distribution F t 1(x) | x for a test example x. We call this a prediction shift.

Related work on prediction shift The shift of gradient conditional distribution gt(xk, yk) | xk was previously mentioned in papers on boosting [3, 13] but was not formally deﬁned. Moreover, even the existence of non-zero shift was not proved theoretically. Based on the out-of-bag estimation [2], Breiman proposed iterated bagging [3] which constructs a bagged weak learner at each iteration on the basis of out-of-bag residual estimates. However, as we formally show in Section E of the supplementary material, such residual estimates are still shifted. Besides, the bagging scheme increases learning time by factor of the number of data buckets. Subsampling of the dataset at each iteration proposed by Friedman [13] addresses the problem much more heuristically and also only alleviates it.

Analysis of prediction shift We formally analyze the problem of prediction shift in a simple case of a regression task with the quadratic loss function L(y, ˆy) = (y ˆy)2.4 In this case, the negative gradient gt(xk, yk) in Equation (6) can be substituted by the residual function rt 1(xk, yk) := yk F t 1(xk).5 Assume we have m = 2 features x1, x2 that are i.i.d. Bernoulli random variables

4We restrict the rest of Section 4 to this case, but the approaches of Section 4.2 are applicable to other tasks. 5Here we removed the multiplier 2, what does not matter for further analysis.

with p = 1/2 and y = f (x) = c1x1 + c2x2. Assume we make N = 2 steps of gradient boosting with decision stumps (trees of depth 1) and step size α = 1. We obtain a model F = F 2 = h1 + h2. W.l.o.g., we assume that h1 is based on x1 and h2 is based on x2, what is typical for |c1| > |c2| (here we set some asymmetry between x1 and x2).

Theorem 1 1. If two independent samples D1 and D2 of size n are used to estimate h1 and h2, respectively, using Equation (6), then ED1,D2F 2(x) = f (x) + O(1/2n) for any x {0, 1}2. 2. If the same dataset D = D1 = D2 is used in Equation (6) for both h1 and h2, then EDF 2(x) = f (x) 1 n 1c2(x2 1

2) + O(1/2n).

This theorem means that the trained model is an unbiased estimate of the true dependence y = f (x), when we use independent datasets at each gradient step.6 On the other hand, if we use the same dataset at each step, we suffer from a bias 1 n 1c2(x2 1

2), which is inversely proportional to the data size n. Also, the value of the bias can depend on the relation f : in our example, it is proportional to c2. We track the chain of shifts for the second part of Theorem 1 in a sketch of the proof below, while the full proof of Theorem 1 is available in the supplementary material (Section A).

Sketch of the proof. Denote by ξst, s, t {0, 1}, the number of examples (xk, yk) D with xk = (s, t). We have h1(s, t) = c1s + c2ξs1 ξs0+ξs1 . Its expectation E(h1(x)) on a test example x equals c1x1 + c2

2 . At the same time, the expectation E(h1(xk)) on a training example xk is different and equals (c1x1 + c2

2 ) c2( 2x2 1

n ) + O(2 n). That is, we experience a prediction shift of h1. As a consequence, the expected value of h2(x) is E(h2(x)) = c2(x2 1

2)(1 1 n 1) + O(2 n) on a test example x and E(h1(x) + h2(x)) = f (x) 1 n 1c2(x2 1

2) + O(1/2n).

Finally, recall that greedy TS ˆxi can be considered as a simple statistical model predicting the target y and it suffers from a similar problem, conditional shift of ˆxi k | yk, caused by the target leakage, i.e., using yk to compute ˆxi k.

4.2 Ordered boosting

Here we propose a boosting algorithm which does not suffer from the prediction shift problem described in Section 4.1. Assuming access to an unlimited amount of training data, we can easily construct such an algorithm. At each step of boosting, we sample a new dataset Dt independently and obtain unshifted residuals by applying the current model to new training examples. In practice, however, labeled data is limited. Assume that we learn a model with I trees. To make the residual r I 1(xk, yk) unshifted, we need to have F I 1 trained without the example xk. Since we need unbiased residuals for all training examples, no examples may be used for training F I 1, which at ﬁrst glance makes the training process impossible. However, it is possible to maintain a set of models differing by examples used for their training. Then, for calculating the residual on an example, we use a model trained without it. In order to construct such a set of models, we can use the ordering principle previously applied to TS in Section 3.2. To illustrate the idea, assume that we take one random permutation σ of the training examples and maintain n different supporting models M1, . . . , Mn such that the model Mi is learned using only the ﬁrst i examples in the permutation. At each step, in order to obtain the residual for j-th sample, we use the model Mj 1 (see Figure 1). The resulting Algorithm 1 is called ordered boosting below. Unfortunately, this algorithm is not feasible in most practical tasks due to the need of training n different models, what increase the complexity and memory requirements by n times. In Cat Boost, we implemented a modiﬁcation of this algorithm on the basis of the gradient boosting algorithm with decision trees as base predictors (GBDT) described in Section 5.

Ordered boosting with categorical features In Sections 3.2 and 4.2 we proposed to use random permutations σcat and σboost of training examples for the TS calculation and for ordered boosting, respectively. Combining them in one algorithm, we should take σcat = σboost to avoid prediction shift. This guarantees that target yi is not used for training Mi (neither for the TS calculation, nor for the gradient estimation). See Section F of the supplementary material for theoretical guarantees. Empirical results conﬁrming the importance of having σcat = σboost are presented in Section G of the supplementary material.

6Up to an exponentially small term, which occurs for a technical reason.

5 Practical implementation of ordered boosting

Cat Boost has two boosting modes, Ordered and Plain. The latter mode is the standard GBDT algorithm with inbuilt ordered TS. The former mode presents an efﬁcient modiﬁcation of Algorithm 1. A formal description of the algorithm is included in Section B of the supplementary material. In this section, we overview the most important implementation details.

1 2 3 6 5 4 7

𝑟𝑡𝐱7, 𝑦7 = 𝑦7 𝑀6

Figure 1: Ordered boosting principle, examples are ordered according to σ.

Algorithm 1: Ordered boosting input : {(xk, yk)}n k=1, I;

σ random permutation of [1, n] ; Mi 0 for i = 1..n; for t 1 to I do

for i 1 to n do

ri yi Mσ(i) 1(xi);

for i 1 to n do

M Learn Model((xj, rj) : σ(j) i); Mi Mi + M ;

Algorithm 2: Building a tree in Cat Boost input : M, {(xi, yi)}n i=1, α, L, {σi}s i=1, Mode

grad Calc Gradient(L, M, y); r random(1, s); if Mode = Plain then

G (gradr(i) for i = 1..n); if Mode = Ordered then

G (gradr,σr(i) 1(i) for i = 1..n);

T empty tree; foreach step of top-down procedure do

foreach candidate split c do

Tc add split c to T; if Mode = Plain then

(i) avg(gradr(p) for p : leafr(p) = leafr(i)) for i = 1..n; if Mode = Ordered then

(i) avg(gradr,σr(i) 1(p) for p : leafr(p) = leafr(i), σr(p) < σr(i)) for i = 1..n; loss(Tc) cos( , G) T arg min Tc(loss(Tc)) if Mode = Plain then

Mr (i) Mr (i) α avg(gradr (p) for p : leafr (p) = leafr (i)) for r = 1..s, i = 1..n; if Mode = Ordered then

Mr ,j(i) Mr ,j(i) α avg(gradr ,j(p) for p : leafr (p) = leafr (i), σr (p) j) for r = 1..s, i = 1..n, j σr (i) 1; return T, M

At the start, Cat Boost generates s + 1 independent random permutations of the training dataset. The permutations σ1, . . . , σs are used for evaluation of splits that deﬁne tree structures (i.e., the internal nodes), while σ0 serves for choosing the leaf values bj of the obtained trees (see Equation (3)). For examples with short history in a given permutation, both TS and predictions used by ordered boosting (Mσ(i) 1(xi) in Algorithm 1) have a high variance. Therefore, using only one permutation may increase the variance of the ﬁnal model predictions, while several permutations allow us to reduce this effect in a way we further describe. The advantage of several permutations is conﬁrmed by our experiments in Section 6.

Building a tree In Cat Boost, base predictors are oblivious decision trees [9, 14] also called decision tables [23]. Term oblivious means that the same splitting criterion is used across an entire level of the tree. Such trees are balanced, less prone to overﬁtting, and allow speeding up execution at testing time signiﬁcantly. The procedure of building a tree in Cat Boost is described in Algorithm 2.

In the Ordered boosting mode, during the learning process, we maintain the supporting models Mr,j, where Mr,j(i) is the current prediction for the i-th example based on the ﬁrst j examples in the permutation σr. At each iteration t of the algorithm, we sample a random permutation σr from {σ1, . . . , σs} and construct a tree Tt on the basis of it. First, for categorical features, all TS are computed according to this permutation. Second, the permutation affects the tree learning procedure.

Table 1: Computational complexity.

Procedure Calc Gradient Build T Calc all bt j Update M Calc ordered TS

Complexity for iteration t O(s n) O(|C| n) O(n) O(s n) O(NT S,t n)

Namely, based on Mr,j(i), we compute the corresponding gradients gradr,j(i) = L(yi,s)

s s=Mr,j(i). Then, while constructing a tree, we approximate the gradient G in terms of the cosine similarity cos( , ), where, for each example i, we take the gradient gradr,σ(i) 1(i) (it is based only on the previous examples in σr). At the candidate splits evaluation step, the leaf value (i) for example i is obtained individually by averaging the gradients gradr,σr(i) 1 of the preceding examples p lying in the same leaf leafr(i) the example i belongs to. Note that leafr(i) depends on the chosen permutation σr, because σr can inﬂuence the values of ordered TS for example i. When the tree structure Tt (i.e., the sequence of splitting attributes) is built, we use it to boost all the models Mr ,j. Let us stress that one common tree structure Tt is used for all the models, but this tree is added to different Mr ,j with different sets of leaf values depending on r and j, as described in Algorithm 2.

The Plain boosting mode works similarly to a standard GBDT procedure, but, if categorical features are present, it maintains s supporting models Mr corresponding to TS based on σ1, . . . , σs.

Choosing leaf values Given all the trees constructed, the leaf values of the ﬁnal model F are calculated by the standard gradient boosting procedure equally for both modes. Training examples i are matched to leaves leaf0(i), i.e., we use permutation σ0 to calculate TS here. When the ﬁnal model F is applied to a new example at testing time, we use TS calculated on the whole training data according to Section 3.2.

Complexity In our practical implementation, we use one important trick, which signiﬁcantly reduces the computational complexity of the algorithm. Namely, in the Ordered mode, instead of O(s n2) values Mr,j(i), we store and update only the values M r,j(i) := Mr,2j(i) for j = 1, . . . , log2 n and all i with σr(i) 2j+1, what reduces the number of maintained supporting predictions to O(s n). See Section B of the supplementary material for the pseudocode of this modiﬁcation of Algorithm 2.

In Table 1, we present the computational complexity of different components of both Cat Boost modes per one iteration (see Section C.1 of the supplementary material for the proof). Here NT S,t is the number of TS to be calculated at the iteration t and C is the set of candidate splits to be considered at the given iteration. It follows that our implementation of ordered boosting with decision trees has the same asymptotic complexity as the standard GBDT with ordered TS. In comparison with other types of TS (Section 3.2), ordered TS slow down by s times the procedures Calc Gradient, updating supporting models M, and computation of TS.

Feature combinations Another important detail of Cat Boost is using combinations of categorical features as additional categorical features which capture high-order dependencies like joint information of user ID and ad topic in the task of ad click prediction. The number of possible combinations grows exponentially with the number of categorical features in the dataset, and it is infeasible to process all of them. Cat Boost constructs combinations in a greedy way. Namely, for each split of a tree, Cat Boost combines (concatenates) all categorical features (and their combinations) already used for previous splits in the current tree with all categorical features in the dataset. Combinations are converted to TS on the ﬂy.

Other important details Finally, let us discuss two options of the Cat Boost algorithm not covered above. The ﬁrst one is subsampling of the dataset at each iteration of boosting procedure, as proposed by Friedman [13]. We claimed earlier in Section 4.1 that this approach alone cannot fully avoid the problem of prediction shift. However, since it has proved effective, we implemented it in both modes of Cat Boost as a Bayesian bootstrap procedure. Speciﬁcally, before training a tree according to Algorithm 2, we assign a weight wi = at i to each example i, where at i are generated according to the Bayesian bootstrap procedure (see [28, Section 2]). These weights are used as multipliers for gradients gradr(i) and gradr,j(i), when we calculate (i) and the components of the vector G to deﬁne loss(Tc).

The second option deals with ﬁrst several examples in a permutation. For examples i with small values σr(i), the variance of gradr,σr(i) 1(i) can be high. Therefore, we discard (i) from the beginning of the permutation, when we calculate loss(Tc) in Algorithm 2. Particularly, we eliminate the corresponding components of vectors G and when calculating the cosine similarity between them.

6 Experiments

Comparison with baselines We compare our algorithm with the most popular open-source libraries XGBoost and Light GBM on several well-known machine learning tasks. The detailed description of the experimental setup together with dataset descriptions is available in the supplementary material (Section D). The source code of the experiment is available, and the results can be reproduced.7 For all learning algorithms, we preprocess categorical features using the ordered TS method described in Section 3.2. The parameter tuning and training were performed on 4/5 of the data and the testing was performed on the remaining 1/5.8 The results measured by logloss and zero-one loss are presented in Table 2 (the absolute values for the baselines can be found in Section G of the supplementary material). For Cat Boost, we used Ordered boosting mode in this experiment.9 One can see that Cat Boost outperforms other algorithms on all the considered datasets. We also measured statistical signiﬁcance of improvements presented in Table 2: except three datasets (Appetency, Churn and Upselling) the improvements are statistically signiﬁcant with p-value 0.01 measured by the paired one-tailed t-test.

To demonstrate that our implementation of plain boosting is an appropriate baseline for our research, we show that a raw setting of Cat Boost provides state-of-the-art quality. Particularly, we take a setting of Cat Boost, which is close to classical GBDT [12], and compare it with the baseline boosting implementations in Section G of the supplementary material. Experiments show that this raw setting differs from the baselines insigniﬁcantly.

Table 2: Comparison with baselines: logloss / zero-one loss (relative increase for baselines).

Cat Boost Light GBM XGBoost

Adult 0.270 / 0.127 +2.4% / +1.9% +2.2% / +1.0% Amazon 0.139 / 0.044 +17% / +21% +17% / +21% Click 0.392 / 0.156 +1.2% / +1.2% +1.2% / +1.2% Epsilon 0.265 / 0.109 +1.5% / +4.1% +11% / +12% Appetency 0.072 / 0.018 +0.4% / +0.2% +0.4% / +0.7% Churn 0.232 / 0.072 +0.1% / +0.6% +0.5% / +1.6% Internet 0.209 / 0.094 +6.8% / +8.6% +7.9% / +8.0% Upselling 0.166 / 0.049 +0.3% / +0.1% +0.04% / +0.3% Kick 0.286 / 0.095 +3.5% / +4.4% +3.2% / +4.1%

Table 3: Plain boosting mode: logloss, zeroone loss and their change relative to Ordered boosting mode.

Logloss Zero-one loss

Adult 0.272 (+1.1%) 0.127 (-0.1%) Amazon 0.139 (-0.6%) 0.044 (-1.5%) Click 0.392 (-0.05%) 0.156 (+0.19%) Epsilon 0.266 (+0.6%) 0.110 (+0.9%) Appetency 0.072 (+0.5%) 0.018 (+1.5%) Churn 0.232 (-0.06%) 0.072 (-0.17%) Internet 0.217 (+3.9%) 0.099 (+5.4%) Upselling 0.166 (+0.1%) 0.049 (+0.4%) Kick 0.285 (-0.2%) 0.095 (-0.1%)

We also empirically analyzed the running times of the algorithms on Epsilon dataset. The details of the comparison can be found in the supplementary material (Section C.2). To summarize, we obtained that Cat Boost Plain and Light GBM are the fastest ones followed by Ordered mode, which is about 1.7 times slower.

Ordered and Plain modes In this section, we compare two essential boosting modes of Cat Boost: Plain and Ordered. First, we compared their performance on all the considered datasets, the results are presented in Table 3. It can be clearly seen that Ordered mode is particularly useful on small datasets. Indeed, the largest beneﬁt from Ordered is observed on Adult and Internet datasets, which are relatively small (less than 40K training examples), which supports our hypothesis that a higher bias negatively affects the performance. Indeed, according to Theorem 1 and our reasoning in Section 4.1, bias is expected to be larger for smaller datasets (however, it can also depend on other properties of the dataset, e.g., on the dependency between features and target). In order to further

7https://github.com/catboost/benchmarks/tree/master/quality_benchmarks 8For Epsilon, we use default parameters instead of parameter tuning due to large running time for all algorithms. We tune only the number of trees to avoid overﬁtting. 9The numbers for Cat Boost in Table 2 may slightly differ from the corresponding numbers in our Git Hub repository, since we use another version of Cat Boost with all the discussed features implemented.

validate this hypothesis, we make the following experiment: we train Cat Boost in Ordered and Plain modes on randomly ﬁltered datasets and compare the obtained losses, see Figure 2. As we expected, for smaller datasets the relative performance of Plain mode becomes worse. To save space, here we present the results only for logloss; the ﬁgure for zero-one loss is similar.

We also compare Ordered and Plain modes in the above-mentioned raw setting of Cat Boost in Section G of the supplementary material and conclude that the advantage of Ordered mode is not caused by interaction with speciﬁc Cat Boost options.

Table 4: Comparison of target statistics, relative change in logloss / zero-one loss compared to ordered TS.

Greedy Holdout Leave-one-out

Adult +1.1% / +0.8% +2.1% / +2.0% +5.5% / +3.7% Amazon +40% / +32% +8.3% / +8.3% +4.5% / +5.6% Click +13% / +6.7% +1.5% / +0.5% +2.7% / +0.9% Appetency +24% / +0.7% +1.6% / -0.5% +8.5% / +0.7% Churn +12% / +2.1% +0.9% / +1.3% +1.6% / +1.8% Internet +33% / +22% +2.6% / +1.8% +27% / +19% Upselling +57% / +50% +1.6% / +0.9% +3.9% / +2.9% Kick +22% / +28% +1.3% / +0.32% +3.7% / +3.3%

0.001 0.01 0.1 1

relative logloss for Plain, %

Adult Amazon Click prediction KDD appetency

KDD churn KDD internet KDD upselling Kick prediction

Figure 2: Relative error of Plain boosting mode compared to Ordered boosting mode depending on the fraction of the dataset.

Analysis of target statistics We compare different TSs introduced in Section 3.2 as options of Cat Boost in Ordered boosting mode keeping all other algorithmic details the same; the results can be found in Table 4. Here, to save space, we present only relative increase in loss functions for each algorithm compared to Cat Boost with ordered TS. Note that the ordered TS used in Cat Boost signiﬁcantly outperform all other approaches. Also, among the baselines, the holdout TS is the best for most of the datasets since it does not suffer from conditional shift discussed in Section 3.2 (P1); still, it is worse than Cat Boost due to less effective usage of training data (P2). Leave-one-out is usually better than the greedy TS, but it can be much worse on some datasets, e.g., on Adult. The reason is that the greedy TS suffer from low-frequency categories, while the leave-one-out TS suffer also from high-frequency ones, and on Adult all the features have high frequency.

Finally, let us note that in Table 4 we combine Ordered mode of Cat Boost with different TSs. To generalize these results, we also made a similar experiment by combining different TS with Plain mode, used in standard gradient boosting. The obtained results and conclusions turned out to be very similar to the ones discussed above.

Feature combinations The effect of feature combinations discussed in Section 5 is demonstrated in Figure 1 of the supplementary material. In average, changing the number cmax of features allowed to be combined from 1 to 2 provides an outstanding improvement of logloss by 1.86% (reaching 11.3%), changing from 1 to 3 yields 2.04%, and further increase of cmax does not inﬂuence the performance signiﬁcantly.

Number of permutations The effect of the number s of permutations on the performance of Cat Boost is presented in Figure 2 of the supplementary material. In average, increasing s slightly decreases logloss, e.g., by 0.19% for s = 3 and by 0.38% for s = 9 compared to s = 1.

7 Conclusion

In this paper, we identify and analyze the problem of prediction shifts present in all existing implementations of gradient boosting. We propose a general solution, ordered boosting with ordered TS, which solves the problem. This idea is implemented in Cat Boost, which is a new gradient boosting library. Empirical results demonstrate that Cat Boost outperforms leading GBDT packages and leads to new state-of-the-art results on common benchmarks.

Acknowledgments

We are very grateful to Mikhail Bilenko for important references and advices that lead to theoretical analysis of this paper, as well as suggestions on the presentation. We also thank Pavel Serdyukov for many helpful discussions and valuable links, Nikita Kazeev, Nikita Dmitriev, Stanislav Kirillov and Victor Omelyanenko for help with experiments.

[1] L. Bottou and Y. L. Cun. Large scale online learning. In Advances in neural information processing systems, pages 217 224, 2004.

[2] L. Breiman. Out-of-bag estimation, 1996.

[3] L. Breiman. Using iterated bagging to debias regressions. Machine Learning, 45(3):261 277, 2001.

[4] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classiﬁcation and regression trees. CRC press, 1984.

[5] R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning, pages 161 168. ACM, 2006.

[6] B. Cestnik et al. Estimating probabilities: a crucial task in machine learning. In ECAI, volume 90, pages 147 149, 1990.

[7] O. Chapelle, E. Manavoglu, and R. Rosales. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4):61, 2015.

[8] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785 794. ACM, 2016.

[9] M. Ferov and M. Modr y. Enhancing lambdamart using oblivious trees. ar Xiv preprint ar Xiv:1609.05610, 2016.

[10] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The annals of statistics, 28(2):337 407, 2000.

[11] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.

[12] J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189 1232, 2001.

[13] J. H. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367 378, 2002.

[14] A. Gulin, I. Kuralenok, and D. Pavlov. Winning the transfer learning track of yahoo! s learning to rank challenge with yetirank. In Yahoo! Learning to Rank Challenge, pages 63 76, 2011.

[15] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pages 1 9. ACM, 2014.

[16] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly efﬁcient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pages 3149 3157, 2017.

[17] M. Kearns and L. Valiant. Cryptographic limitations on learning boolean formulae and ﬁnite automata. Journal of the ACM (JACM), 41(1):67 95, 1994.

[18] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10(Mar):777 801, 2009.

[19] Light GBM. Categorical feature support. http://lightgbm.readthedocs.io/en/latest/ Advanced-Topics.html#categorical-feature-support, 2017.

[20] Light GBM. Optimal split for categorical features. http://lightgbm.readthedocs.io/en/ latest/Features.html#optimal-split-for-categorical-features, 2017.

[21] Light GBM. feature_histogram.cpp. https://github.com/Microsoft/Light GBM/blob/ master/src/treelearner/feature_histogram.hpp, 2018.

[22] X. Ling, W. Deng, C. Gu, H. Zhou, C. Li, and F. Sun. Model ensemble for click prediction in bing search ads. In Proceedings of the 26th International Conference on World Wide Web Companion, pages 689 698. International World Wide Web Conferences Steering Committee, 2017.

[23] Y. Lou and M. Obukhov. Bdt: Gradient boosted decision tables for high accuracy and scoring efﬁciency. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1893 1901. ACM, 2017.

[24] L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In Advances in neural information processing systems, pages 512 518, 2000.

[25] D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classiﬁcation and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1):27 32, 2001.

[26] B. P. Roe, H.-J. Yang, J. Zhu, Y. Liu, I. Stancu, and G. Mc Gregor. Boosted decision trees as an alternative to artiﬁcial neural networks for particle identiﬁcation. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 543(2):577 584, 2005.

[27] L. Rokach and O. Maimon. Top down induction of decision trees classiﬁers a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 35(4):476 487, 2005.

[28] D. B. Rubin. The bayesian bootstrap. The annals of statistics, pages 130 134, 1981.

[29] Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Adapting boosting for information retrieval measures. Information Retrieval, 13(3):254 270, 2010.

[30] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pages 819 827, 2013.

[31] O. Zhang. Winning data science competitions. https://www.slideshare.net/ Shangxuan Zhang/winning-data-science-competitions-presented-by-owen-zhang, 2015.

[32] Y. Zhang and A. Haghani. A gradient boosting method to improve travel time prediction. Transportation Research Part C: Emerging Technologies, 58:308 324, 2015.