# faster_boosting_with_smaller_memory__ab73612c.pdf

Faster Boosting with Smaller Memory

Julaiti Alafate Department of Computer Science and Engineering

University of California, San Diego

La Jolla, CA 92093

Yoav Freund Department of Computer Science and Engineering

University of California, San Diego

La Jolla, CA 92093

State-of-the-art implementations of boosting, such as XGBoost and Light GBM, can process large training sets extremely fast. However, this performance requires that the memory size is sufﬁcient to hold a 2-3 multiple of the training set size. This paper presents an alternative approach to implementing the boosted trees, which achieves a signiﬁcant speedup over XGBoost and Light GBM, especially when the memory size is small. This is achieved using a combination of three techniques: early stopping, effective sample size, and stratiﬁed sampling. Our experiments demonstrate a 10-100 speedup over XGBoost when the training data is too large to ﬁt in memory.

1 Introduction

Boosting [7, 16], and in particular gradient boosted trees [9], are some of the most popular learning algorithms used in practice. There are several highly optimized implementations of boosting, among which XGBoost [5] and Light GBM [12] are two of the most popular ones. These implementations can train models with hundreds of trees using millions of training examples in a matter of minutes.

However, a signiﬁcant limitation of these methods is that all of the training examples are required to be stored in the main memory. For Light GBM this requirement is strict. XGBoost can operate in the disk-mode, which makes it possible to use machines with smaller memory than the training set size. However, it comes with a penalty in much longer training time.

In this paper, we present a new implementation of boosted trees1. This implementation can run efﬁciently on machines whose memory sizes are much smaller than the training set. It is achieved with no penalty in accuracy, and with a speedup of 10-100 over XGBoost in disk mode.

Our method is based on the observation that each boosting step corresponds to an estimation of the gradients along the axis deﬁned by the weak rules. The common approach to performing this estimation is to scan all of the training examples so as to minimize the estimation error. This operation is very expensive especially when the training set does not ﬁt in memory.

We reduce the number of examples scanned in each boosting iteration by combining two ideas. First, we use early stopping [19] to minimize the number of examples scanned at each boosting iteration.

1The source code of the implementation is released at

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Second, we keep in memory only a sample of the training set, and we replace the sample when the sample in memory is a poor representation of the complete training set. We exploit the fact that boosting tends to place large weights on a small subset of the training set, thereby reducing the effectiveness of the memory-resident training set. We propose a measure for quantifying the variation in weights called the effective number of examples. We also describe an efﬁcient algorithm, stratiﬁed weighted sampling.

Early stopping for Boosting was studied in previous work [6, 4]. The other two are, to the best of our knowledge, novel. In the following paragraphs, we give a high-level description of these three ideas, which will be elaborated on in the rest of the paper.

Early Stopping We use early stopping to reduce the number of examples that the boosting algorithm reads from the memory to the CPU. A boosting algorithm adds a weak rule to the combined strong rule iteratively. In most implementations, the algorithm searches for the best weak rule, which requires scanning all of the training examples. However, the theory of boosting requires the added weak rule to be just signiﬁcantly better than random guessing, which does not require scanning all of the training examples. Instead, our approach is to read just as many examples as needed to identify a weak rule that is signiﬁcantly better than random guessing.

Our approach is based on sequential analysis and early stopping [19]. Using sequential analysis methods, we designed a stopping rule to decide when to stop reading more examples without increasing the chance of over-ﬁtting.

Effective Number of Examples Boosting assigns different weights to different examples. The weight of an example represents the magnitude of its inﬂuence on the estimate of the gradient. However, when the weight distribution of a training set is dominated by a small number of heavy examples, the variance of the gradient estimates is high. It leads to over-ﬁtting, and effectively reduces the size of the training set. We quantify this reduction using the effective number of examples, neff. To get reliable estimates, neff should be close to the size of the current training set in memory, n. When neff

n is small, we ﬂush the current training set, and get a new sample using weighted sampling.

Stratiﬁed Weighted Sampling While there are well-known methods for weighted sampling, all of the existing methods (that we know of) are inefﬁcient when the weights are highly skewed. In such cases, most of the scanned examples are rejected, which leads to very slow sampling. To increase the sampling efﬁciency, we introduce a technique we call stratiﬁed weighted sampling. It generates the same sampled distribution while guaranteeing that the fraction of rejected examples is no large than

We implemented a new boosted tree algorithm with these three techniques, called Sparrow. We compared its performance to the performance of XGBoost and Light GBM on two large datasets: one with 50 million examples (the human acceptor splice site dataset [18, 1]), the other with over 600 million examples (the bathymetry dataset [11]). We show that Sparrow can achieve 10-20x speed-up over Light GBM and XGBoost especially in the limited memory settings.

The rest of the paper is organized as follows. In Section 2 we discuss the related work. In Section 3 we review the conﬁdence-based boosting algorithm. In Section 4 we describe the statistical theory behind the design of Sparrow. In Section 5 we describe the design of our implementation. In Section 6 we describe our experiments. We conclude with the future work direction in Section 7.

2 Related Work

There are several methods that use sampling to reduce the training time of boosting. Both Friedman et al. [8] and Light GBM [12] use a ﬁxed threshold to ﬁlter out the light-weight examples: the former discards the examples whose weights are smaller than the threshold; the later accepts all examples if their gradients exceed the threshold, otherwise accepts them with a ﬁxed probability. Their major difference with Sparrow is that their sampling methods are biased, while Sparrow does not change the original data distribution. Appel et al. [2] uses small samples to prune weak rules associated with unpromising features, and only scans all samples for evaluating the remaining ones. Their major difference with Sparrow is that they focus on ﬁnding the best weak rule, while Sparrow tries to ﬁnd

a statistically signiﬁcant one. Scanning over all example is required for the former, while using a stopping rule the algorithm often stops after reading a small fraction of examples.

The idea of accelerating boosting with stopping rules is also studied by Domingo and Watanabe [6] and Bradley and Schapire [4]. Our contribution is in using a tighter stopping rule. Our stopping rule is tighter because it takes into account the dependence on the variance of the sample weights.

There are several techniques that speed up boosting by taking advantage of the sparsity of the dataset [5, 12]. We will consider those techniques in future work.

3 Conﬁdence-Rated Boosting

We start with a brief description of the conﬁdence-rated boosting algorithm under the Ada Boost framework (Algorithm 9.1 on the page 274 of [16]).

Let ~x 2 X be the feature vectors and let the output be y 2 Y = { 1, +1}. For a joint distribution D over X Y , our goal is to ﬁnd a classiﬁer c : X ! Y with small error:

err D(c) .= P(~x,y) D [c(~x) 6= y] .

We are given a set H of base classiﬁers (weak rules) h : X ! [ 1, +1]. We want to generate a score function, which is a weighted sum of T rules from H:

The term t is the weights by which each base classiﬁers contribute to the ﬁnal prediction, and is decided by the speciﬁc boosting paradigm.

Finally, we have the strong classiﬁer as the sign of the score function: HT = sign(ST ).

Ada Boost can be viewed as a coordinate-wise gradient descent algorithm [15]. The algorithm iteratively ﬁnds the direction (weak rule) which maximizes the decrease of the average potential function, and then adds this weak rule to ST with a weight that is proportional to the magnitude of the gradient. The potential function used in Ada Boost is Φ(~x, y) = e ST (~x)y. Other potential functions have been studied (e.g. [9]). In this work we focus on the potential function used in Ada Boost.

We distinguish between two types of average potentials: the expected potential or true potential:

Φ(ST ) = E(~x,y) D

e ST (~x)yi

and the average potential or empirical potential:

bΦ(ST ) = 1

e ST (~xi)yi.

The ultimate goal of the boosting algorithm is to minimize the expected potential, which determines the true error rate. However, most boosting algorithms, including XGBoost and Light GBM, focus on minimizing the empirical potential bΦ(ST ), and rely on the limited capacity of the weak rules to guarantee that the true potential is also small. Sparrow takes a different approach. It uses an estimator of the true edge (explained below) to identify the weak rules that reduce the true potential with high probability.

Consider adding a weak rule ht to the score function St 1, we get St = St 1 + tht. Taking the partial derivative of the average potential Φ with respect to t we get

Φ(St 1 + th) = E(~x,y) Dt 1 [h(~x)y] (1)

Dt 1 = D Zt 1

exp ( St 1(~x)y) , (2)

and Zt 1 is a normalization factor that makes Dt 1 a distribution.

Boosting algorithms performs coordinate-wise gradient descent on the average potential where each coordinate corresponds to one weak rule. Using equation (1), we can express the gradient with respect to the weak rule h as a correlation, which we call the true edge:

γt(h) .= corr Dt 1(h) .= E(~x,y) Dt 1 [h(~x)y] , (3)

which is not directly measurable. Given n i.i.d. samples, an unbiased estimate for the true edge is the empirical edge:

ˆγt(h) .= d corr Dt 1(h) .=

h(~xi)yi, (4)

where wi = e St 1(~xi) and Zt 1 = Pn

To decrease the expected potential, we want to ﬁnd a weak rule with a large edge (and add it to the score function). XGBoost and Light GBM do this by searching for the weak rule with the largest empirical edge. Sparrow ﬁnds a weak rule which, with high probability, has a signiﬁcantly large true edge. Next, we explain the statistical techniques for identifying such weak rules while minimizing the number of examples needed to compute the estimates.

4.1 Effective Number of Examples

Equation 4 deﬁnes ˆγ(h), which is an unbiased estimate of γ(h). How accurate is this estimate?

A standard quantiﬁer is the variance of the estimator. Suppose the true edge of a weak rule h is γ. Then the expected (normalized) correlation between the predictions of h and the true labels, w

is 2γ. The variance of this correlation can be written as 1 n2

E2(w) 4γ2. Ignoring the second term (because γ is usually close to zero) and the variance in E(w), we approximate the variance in the edge to be

i=1 wi)2 . (5)

If all of the weights are equal then Var(ˆγ) = 1/n. It corresponds to a standard deviation of 1/pn which is the expected relation between the sample size and the error.

If the weights are not equal then the variance is larger and thus the estimate is less accurate. We deﬁne the effective number of examples neff to be 1/Var(ˆγ), speciﬁcally,

neff .= (Pn

i=1 wi)2 Pn

To see that the name effective number of examples makes sense, consider n weights where w1 = = wk = 1/k and wk+1 = = wn = 0. It is easy to verify that in this case neff = k which agrees with our intuition, namely the examples with zero weights do not affect the estimate.

Suppose the memory is only large enough to store n examples. If neff n then we are wasting valuable memory space on examples with small weights, which can signiﬁcantly increase the chance of over-ﬁtting. We can ﬁx this problem by using weighted sampling. In this way we repopulate memory with n equally weighted examples, and make it possible to learn without over-ﬁtting.

4.2 Weighted Sampling

When Sparrow detects that neff is much smaller than the memory size n, it clears the memory and collects a new sample from disk using weighted sampling.

The speciﬁc sampling algorithm that Sparrow uses is the minimal variance weighted sampling [13]. This method reads from disk one example (~x, y) at a time, calculates the weight for that example

wi, and accepts the example with the probability proportional to its weight. Accepted examples are stored in memory with the initial weights of 1. This increases the effective sample size from neff back to n, thereby reduces the chance of over-ﬁtting.

To gain some intuition regarding this effect, consider the following setup of an imbalanced classiﬁcation problem. Suppose that the training set size is N = 100, 000, of which 0.01N are positive and 0.99N are negative. Suppose we can store n = 2, 000 examples in memory. The number of the memory-resident examples is 0.01n = 20. Clearly, with such a small number of positive examples, there is a danger of over-ﬁtting. However, an (almost) all negative rule is 99% correct. If we then reweigh the examples using the Ada Boost rule, we will give half of the total weight to the positives and the other half to the negatives. The value of neff will drop to about 80. This would trigger a resampling step, which generates a training set with 1000 positives and 1000 negatives. It allows us to ﬁnd additional weak rules with little danger of over-ﬁtting.

This process continues as long as Sparrow is making progress and the weights are becoming increasingly skewed. When the skew is large, neff is small and Sparrow resamples a new sample with uniform weights.

Sparrow uses weighted sampling to achieve high disk-to-memory efﬁciency. In addition, Sparrow also achieves high memory-to-CPU efﬁciency by reading from memory the minimal number of examples necessary to establish that a particular weak rule has a signiﬁcant edge. This is done using sequential analysis and early stopping.

4.3 Sequential Analysis

Sequential analysis was introduced by Wald in the 1940s [19]. Suppose we want to estimate the expected loss of a model. In the standard large deviation analysis, we assume that the loss is bounded in some range, say [ M, +M], and that the size of the training set is n. This implies that the standard deviation of the training loss is at most M/pn. To make this standard deviation smaller than some > 0, we need that n > (M/ )2. While this analysis is optimal in the worst case, it can be improved if we have additional information about the standard deviation. We can glean such information from the observed losses by using the following sequential analysis method.

Instead of choosing n ahead of time, the algorithm computes the loss one example at a time. It uses a stopping rule to decide whether, conditioned on the sequence of losses seen so far, the difference between the average loss and the true loss is smaller than with large probability. The result is that when the standard deviation is signiﬁcantly smaller than M/pn, the number of examples needed in the estimate is much smaller than (M/ )2.

We use a stopping rule based on Theorem 1 in Appendix B, which depends on both the mean and the variance of the weighted correlation [3]. Fixing the current strong rule H (i.e. the score function), we deﬁne a (unnormalized) weight for each example, denoted as w(~x, y) = e H(x)y. Consider a particular candidate weak rule h and a sequence of labeled examples {(~x1, y1), (~x2, y2), . . .}. For some γ > 0, we deﬁne two cumulative quantities (after seeing n examples from the sequence):

w(~xi, yi)(ht(~xi)yi γ), and Vt .=

w(~xi, yi)2. (7)

Mt is an estimate of the difference between the true correlation of h and γ. Vt quantiﬁes the variance of this estimate.

The goal of the stopping rule is to identify a weak rule h whose true edge is larger than γ. It is deﬁned to be t > t0 and

Vt(log log Vt

where t0, C, and B are parameters. If both conditions of the stopping rule are true, we claim that the true edge of h is larger than γ with high probability. The proof of this test can be found in [3].

Note that our stopping rule is correlated with the cumulative variance Vt, which is basically the same as 1/neff. If neff is large, say neff = n when a new sample is placed in memory, the stopping rule stops quickly. On the other hand, when the weights diverge, neff becomes smaller than n, and the stopping rule requires proportionally more examples before stopping.

Figure 1: The Sparrow system architecture. Left: The workﬂow of the Scanner and the Sampler. Right: Partitioning of the examples stored in disk according to their weights.

The relationship between martingales, sequential analysis, and stopping rules has been studied in previous work [19]. Brieﬂy, when the edge of a rule is smaller than γ, then the sequence is a supermartingale. If it is larger than γ, then it is a submartingale. The only assumption is that the examples are sampled i.i.d.. Theorem 1 in Appendix B guarantees two things about the stopping rule deﬁned in Equation 8: (1) if the true edge is smaller than γ, the stopping rule will never ﬁre (with high probability); (2) if the stopping rule ﬁres, the true edge of the rule h is larger than γ.

5 System Design and Algorithms

In this section we describe Sparrow. As Sparrow consists of a number of concurrent threads and many queues, we chose to implement it using the Rust programming language for the beneﬁts of its memory-safety and thread-safety guarantees [14].

We use a bold letter in parenthesis to refer the corresponding component in the workﬂow diagram in Figure 1. We also provide the pseudo-code in the Appendix C.

The main procedure of Sparrow generates a sequence of weak rules h1, . . . , hk and combines them into a strong rule Hk. It calls two subroutines that execute in parallel: a Scanner and a Sampler.

Scanner The task of a scanner (the upper part of the workﬂow diagram in Figure 1) is to read training examples sequentially and stop when it has identiﬁed one of the rules to be a good rule.

At any point of time, the scanner maintains the current strong rule Ht, a set of candidate weak rules W, and a target edge γt+1. For example, when training boosted decision trees, the scanner maintains the current strong rule Ht which consists of a set of decision trees, a set of candidate weak rules W which is the set of candidate splits on all features, and γt+1 2 (0.0, 0.5).

Inside the scanner, a booster (d) scans the training examples stored in main memory (c) sequentially, one at a time. It computes the weight of the read examples using Ht and then updates a running estimate of the edge of each weak rule h 2 W accordingly. Periodically, it feeds these running estimates into the stopping rule, and stop the scanning when the stopping rule ﬁres.

The stopping rule is designed such that if it ﬁres for ht, then the true edge of a particular weak rule γ(ht+1) is, with high probability, larger than the set threshold γt+1. The booster then adds the identiﬁed weak rule ht+1 (f) to the current strong rule Ht to create a new strong rule Ht+1 (g). The booster decides the weight of the weak rule ht+1 in Ht+1 based on γt+1 (lower bound of its accuracy). It could underestimate the weight. However, if the underestimate is large, the weak rule ht+1 is likely to be re-discovered later which will effectively increase its weight.

Lastly, the scanner falls into the Failed state if after exhausting all examples in the current sample set, no weak rule with an advantage larger than the target threshold γt+1 is detected. When it happens, the scanner shrinks the value of γt+1 and restart scanning. More precisely, it keeps track of the empirical edges ˆγ(h) of all weak rules h. When the failure state happens, it resets the threshold γt+1 to just below the value of the current maximum empirical edge of all weak rules.

To illustrate the relationship between the target threshold and the empirical edge of the detected weak rule, we compare their values in Figure 2. The empirical edge ˆγ(ht+1) of the detected weak

Figure 2: The empirical edge and the corresponding target edge γ of the weak rules being added to the ensemble. Sparrow adds new weak rules with a weight calculated using the value of γ at the time of their detection, and shrinks γ when it cannot detect a rule with an edge over γ.

Figure 3: Accuracy comparison on the Cover Type dataset. For uniform sampling, we trained XGBoost on a uniformly sampled dataset with the same sample fraction set in Sparrow. The accuracy is evaluated with same number of boosting iterations.

rules are usually larger than γt+1. The weak rules are then added to the strong rule with a weight corresponding to γt+1 (the lower bound for their true edges) to avoid over-estimation. Lastly, the value of γt+1 shrinks over time when there is no weak rule with the larger edge exists.

Sampler Our assumption is that the entire training dataset does not ﬁt into the main memory and is therefore stored in an external storage (a). As boosting progresses, the weights of the examples become increasingly skewed, making the dataset in memory effectively smaller. To counteract that skew, Sampler prepares a new training set, in which all of the examples have equal weights, by using selective sampling. When the effective sample size neff associated with the old training set becomes too small, the scanner stops using the old training set and starts using the new one2.

The sampler uses selective sampling by which we mean that the probability of an example (x, y) being added to the sample is proportional to its weight w(x, y). Each added example is assigned an initial weight of 1. There are several known algorithms for selective sampling. The best-known one is rejection sampling in which a biased coin is ﬂipped for each example. We use a method known as minimal variance sampling [13] because it produces less variation in the sampled set.

Stratiﬁed Storage and Stratiﬁed Sampling The standard approach to sampling reads examples one at a time, calculates the weight of the example, and accepts the example into the memory with the probability proportional to its weight, otherwise rejects the example. Let the largest weight be wmax and the average weight be wmean, then the maximal rate at which examples are accepted is wmean/wmax. If the weights are highly skewed, then this ratio can be arbitrarily small, which means that only a small fraction of the evaluated examples are then accepted. As evaluation is time-consuming, this process becomes a computation bottleneck.

We proposed a stratiﬁed-based sampling mechanism to address this issue (the right part of Figure 1). It applies incremental update to reduce the computational cost of making predictions with a large model, and uses a stratiﬁed data organization to reduce the rejection rate.

To implement incremental update we store for each example, whether it is on disk or in memory, the result of the latest update. Speciﬁcally, we store each training example in a tuple (x, y, Hl, wl), where x, y are the feature vector and the label, Hl is the last strong rule used to calculate the weight of the example, and wl is the weight last calculated. In this way both the scanner and sampler only calculate over the incremental changes to the model since the last time it was used to predict examples.

To reduce the rejection rate, we want the sampler to avoid reading examples that it will likely to reject. We organize examples in a stratiﬁed structure, where the stratum k contains examples whose weights are in [2k, 2k+1). It limits the skew of the weights of the examples in each stratum so that wmean/wmax 1

2. Besides, the sampler also maintains the (estimated) total weight of the examples in each stratum. It then associates a probability with each stratum by normalizing the total weights to 1.

2The sampler and scanner can run in parallel on a multi-core machine, or run on two different machines. In our experiments, we keep them in one machine.

To sample a new example, the sampler ﬁrst samples the next stratum to read, then reads examples from the selected stratum until one of them is accepted. For each example, the sampler ﬁrst updates its weight, then decides whether or not to accept this example, ﬁnally writes it back to the stratum it belongs to according to its updated weight. As a result, the reject rate is at most 1/2, which greatly improves the speed of sampling.

Lastly, since the stratiﬁed structure contains all of the examples, it is managed mostly on disk, with a small in-memory buffer to speed up I/O operations.

6 Experiments

In this section we describe the experiment results of Sparrow. In all experiments, we use trees as weak rules. First, we use the forest cover type dataset [10] to evaluate the sampling effectiveness. It contains 581 K samples. We performed a 80/20 random split for training and testing.

Besides, we use two large datasets to evaluate the overall performance of Sparrow on large datasets. The ﬁrst large dataset is the splice site dataset for detecting human acceptor splice site [18, 1]. We use the same training dataset of 50 M samples as in the other work, and validate the model on the testing data set of 4.6 M samples. The training dataset on disk takes over 39 GB in size. The second large dataset is the bathymetry dataset for detecting the human mislabeling in the bathymetry data [11]. We use a training dataset of 623M samples, and validate the model on the testing dataset of 83M samples. The training dataset takes 100 GB on disk. Both learning tasks are binary classiﬁcation.

The experiments on large datasets are all conducted on EC2 instances with attached SSD storages from Amazon Web Services. We ran the evaluations on ﬁve different instance types with increasing memory capacities, ranging from 8 GB to 244 GB (for details see Appendix A).

6.1 Effectiveness of Weighted Sampling

We evaluate the effectiveness of weighted sampling by comparing it to uniform sampling. The comparison is over the model accuracy on the testing data when both trained for 500 boosting iteration on the cover type dataset. For both methods, we generate trees with depth 5 as weak rules. In uniform sampling, we ﬁrst randomly sample from the training data with each sampling ratio, and use XGBoost to train the models. We evaluated the model performance on the sample ratios ranging from 0.1 to 0.5, and repeated each evaluation for 10 times. The results are showed in Figure 3. We can see that the accuracy of Sparrow is higher with the same number of boosting iteration and the same sampling ratio. In addition, the variance of the model accuracy is also smaller. It demonstrates that the weighted sampling method used in Sparrow is more effective and more stable than uniform sampling.

6.2 Training on Large Datasets

We compare Sparrow on the two large datasets, and use XGBoost and Light GBM for the baselines since they out-perform other boosting implementations [5, 12]. The comparison was done in terms of the reduction in the exponential loss, which is what boosting minimizes directly, and in terms of AUROC, which is often more relevant for practice. We include the data loading time in the reported training time.

There are two popular tree-growth algorithms: depth-wise and leaf-wise [17]. Both Sparrow and Light GBM grow trees leaf-wise. XGBoost uses the depth-wise method by default. In all experiments, we grow trees with at most 4 leaves, or depth two. We choose to train smaller trees in these experiments since the training takes a very long time otherwise.

For XGBoost, we chose the approximate greedy algorithm which is its fastest training method. Light GBM supports using sampling in the training, which they called Gradient-based One-Side Sampling (GOSS). GOSS keeps a ﬁxed percentage of the examples with large gradients, and randomly sample from the remaining examples. We selected GOSS as the tree construction algorithm for Light GBM. In addition, we also enabled the

option in Light GBM to reduce its memory footprint.

Figure 4: Time-AUROC curve on the splice site detection dataset, higher is better, clipped on right and bottom The (S) sufﬁx is for training on 30.5 GB memory, and the (L) sufﬁx is for training on 61 GB memory.

Figure 5: Time-AUROC curve on the bathymetry dataset, higher is better, clipped on right and bottom. The (S) sufﬁx is for training on 61 GB memory, and the (L) sufﬁx is for training on 244 GB memory.

The memory requirement of Sparrow is decided by the sample size, which is a conﬁgurable parameter. XGBoost supports external memory training when the memory is too small to ﬁt the training dataset. The in-memory version of XGBoost is used for training whenever possible. If it runs out of memory, we trained the model using the external memory version of XGBoost instead. Unlike XGBoost, Light GBM does not support external memory execution. Lastly, all algorithms in this comparison optimize the exponential loss as deﬁned in Ada Boost.

Due to the space limit, we put the detailed summary of the experiment results in Table 1 and Table 2 in the Appendix A. We evaluated each algorithm in terms of the AUROC as a function of training time on the testing dataset. The results are given in Figure 4 and Figure 5.

On the splice site dataset, Sparrow is able to run on the instances with as small as 8 GB memory. The external memory version of XGBoost can execute with a reasonable amount of memory (but still needs to be no smaller than 15 GB) but takes about 3x longer training time. However, we also noticed that Sparrow does not have an advantage over other two boosting implementations when the memory size is large enough to load in the entire training dataset.

On the bathymetry dataset, Sparrow consistently out-performs XGBoost and Light GBM, even when the memory size is larger than the dataset size. In extreme cases, we see that Sparrow takes 10x-20x shorter training time and achieves better accuracy. In addition, both Light GBM and the in-memory version of XGBoost crash when trained with less than 244 GB memory.

We observed that properly initializing the value of γ and setting a reasonable sample set size can have a great impact on the performance of Sparrow. If stopping rule frequently fails to ﬁre, it can introduce signiﬁcant overhead to the training process. Speciﬁc to the boosted trees, one heuristic we ﬁnd useful is to initialize γ to the maximum advantage of the tree nodes in the previous tree. A more systematic approach for deciding γ and the sample set size is left as future work.

7 Conclusion and Future Work

In this paper, we have proposed a boosting algorithm contains three techniques: effective number of examples, weighted sampling, and early stopping. Our preliminary results show that they can dramatically speed up boosting algorithms on large real-world datasets, especially when the data size exceeds the memory capacity. For future work, we are working on a parallelized version of Sparrow which uses a novel type of asynchronous communication protocol. It uses stopping rule to do the model update, and relaxes the necessity for frequent communication between multiple workers especially when training on large datasets, which we believe is a better parallel learning paradigm.

Acknowledgements

We are grateful to David Sandwell and Brook Tozer for providing the bathymetry dataset.

This work was supported by the NIH (grant U19 NS107466).

[1] Alekh Agarwal, Oliveier Chapelle, Miroslav Dudík, and John Langford. A Reliable Effective

Terascale Linear Learning System. Journal of Machine Learning Research, 15:1111 1133, 2014.

[2] Ron Appel, Thomas Fuchs, Piotr Dollár, and Pietro Perona. Quickly boosting decision trees

pruning underachieving features early. In International Conference on Machine Learning, pages 594 602, 2013.

[3] Akshay Balsubramani. Sharp Finite-Time Iterated-Logarithm Martingale Concentration. ar Xiv:1405.2639 [cs, math, stat], May 2014.

[4] Joseph K. Bradley and Robert E. Schapire. Filter Boost: Regression and Classiﬁcation on

Large Datasets. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS 07, pages 185 192, USA, 2007. Curran Associates Inc.

[5] Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In Proceedings

of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 16, pages 785 794, New York, NY, USA, 2016. ACM.

[6] Carlos Domingo and Osamu Watanabe. Scaling Up a Boosting-Based Learner via Adaptive

Sampling. In Knowledge Discovery and Data Mining. Current Issues and New Applications, Lecture Notes in Computer Science, pages 317 328. Springer, Berlin, Heidelberg, April 2000.

[7] Yoav Freund and Robert E Schapire. A Decision-Theoretic Generalization of On-Line Learning

and an Application to Boosting. Journal of Computer and System Sciences, 55(1):119 139, August 1997.

[8] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical

view of boosting (With discussion and a rejoinder by the authors). The Annals of Statistics, 28(2):337 407, April 2000.

[9] Jerome H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine. The

Annals of Statistics, 29(5):1189 1232, 2001.

[10] João Gama, Ricardo Rocha, and Pedro Medas. Accurate decision trees for mining high-speed

data streams. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 523 528. ACM, 2003.

[11] Japan Agency for Marine-Earth Science and Technology JAMSTEC. Data and sample research

system for whole cruise information in jamstec (darwin).

[12] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and

Tie-Yan Liu. Light GBM: A Highly Efﬁcient Gradient Boosting Decision Tree. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3146 3154. Curran Associates,

Inc., 2017.

[13] Genshiro Kitagawa. Monte Carlo Filter and Smoother for Non-Gaussian Nonlinear State Space

Models. Journal of Computational and Graphical Statistics, 5(1):1 25, 1996.

[14] Steve Klabnik and Carol Nichols. The Rust Programming Language. No Starch Press, 2018.

[15] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting Algorithms As

Gradient Descent. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS 99, pages 512 518, Cambridge, MA, USA, 1999. MIT Press.

[16] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. MIT Press, 2012.

[17] Haijian Shi. Best-ﬁrst decision tree learning. Ph D thesis, The University of Waikato, 2007.

[18] Soeren Sonnenburg and Vojtˇech Franc. COFFIN: A Computational Framework for Linear

SVMs. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML 10, pages 999 1006, USA, 2010. Omnipress.

[19] Abraham Wald. Sequential Analysis. Courier Corporation, 1973.