# boosted_histogram_transform_for_regression__2bf6e0d7.pdf

Boosted Histogram Transform for Regression

Yuchao Cai * 1 2 Hanyuan Hang * 1 Hanfang Yang 2 Zhouchen Lin 3 1

In this paper, we propose a boosting algorithm for regression problems called boosted histogram transform for regression (BHTR) based on histogram transforms composed of random rotations, stretchings, and translations. From the theoretical perspective, we ﬁrst prove fast convergence rates for BHTR under the assumption that the target function lies in the spaces C0,α. Moreover, if the target function resides in the subspace C1,α, for the ﬁrst time we manage to explain the beneﬁts of the boosting procedure, by establishing the upper bound of the convergence rate for the boosted regressor, i.e. BHTR, and the lower bound for base regressors, i.e. histogram transform regressors (HTR). In the experiments, compared with other state-of-the-art algorithms such as gradient boosted regression tree (GBRT), Breiman s forest, and kernel-based methods, our BHTR algorithm shows promising performance on both synthetic and real datasets.

1. Introduction

Over the past two decades, boosting has become one of the most successful algorithms in the machine learning community (Bühlmann & Yu, 2003). When the idea of iterative utilizations of weak learners from a certain function space to generate a strong one, which is called boosting, ﬁrst came out in Schapire (1990); Freund (1995), it gains a lot of attention, and a wealth of literature has applied it on a large number of datasets.

During this period, many boosting algorithms with impressive performance have been proposed. Perhaps the ﬁrst boosting algorithm goes back to the Adaboost for classiﬁ-

*Equal contribution 1AI Lab, Samsung Research China - Beijing (SRC-B), Beijing, China 2School of Statistics, Renmin University of China, Beijing, China 3Key Lab. of Machine Perception (Mo E), School of EECS, Peking University, Beijing, China. Correspondence to: Hanfang Yang <hyang@ruc.edu.cn>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

cation by Schapire & Freund (1995); Freund & Schapire (1997). Another important boosting algorithm for regression called Gradient Boosted Regression Tree (GBRT) is proposed by Friedman (2001). GBRT takes advantages of tree-based learners to capture complex data structure. More recently, inspired by the second order method originated from Friedman (2001), Chen & Guestrin (2016) came up with the e Xtreme Gradient Boosting (XGBoost), which achieves excellent experimental performance against overﬁtting with the help of certain regularization terms.

Due to the great success of these boosting algorithms, a lot of attempts have been made to establish their theoretical foundations. First of all, theoretical margin guarantees of Adaboost have been well studied by Freund & Schapire (1997); Koltchinskii & Panchenko (2002). Furthermore, based on the view of gradient descent optimization, various versions of boosting algorithms have been shown to be consistent in different settings, see Friedman et al. (2000); Mannor et al. (2002); Bühlmann & Yu (2003); Lugosi & Vayatis (2004); Zhang (2004). Moreover, Blanchard et al. (2003) conducted a deeper investigation of the convergence of regularized boosting classiﬁers through the restriction on weights of the composite estimator. Finally, a different method of achieving consistency and convergence rate results is through early stopping rule, which is designed to prevent overﬁtting (Zhang & Yu, 2005), whereas Mease & Wyner (2008) argued that in practice additional iterations beyond the necessary number actually reduce the overﬁtting that has already occurred.

Unfortunately, none of the above-mentioned boosting works present a satisfactory explanation from the statistical optimization view (Mease & Wyner, 2008; Wyner et al., 2017). Nevertheless, some of the boosting variants with specific base learners are more easily accessible for statistical analysis. For example, Bühlmann & Yu (2003) derived an exponential bias-variance trade-off for linear regression to illustrate the almost resistance to overﬁtting for L2-Boosting in a ﬁxed design setting. Moreover, Park et al. (2009) and Lin et al. (2019) established the theoretical analysis of boosting methods using Nadaraya-Watson kernel estimates and kernel ridge regression estimates as base learners, respectively. However, these methods are of little practical value since they fail to capture the complex data dependencies in applications. In addition, they did not show the beneﬁts of

Boosted Histogram Transform for Regression

the boosting procedure from the theoretical perspective.

Under such background, this paper aims to establish a new boosting algorithm which not only has satisfactory performance but also has solid theoretical foundations. To be speciﬁc, motivated by the random rotation ensemble algorithms (López-Rubio, 2013; Blaser & Fryzlewicz, 2016), we propose boosted histogram transform for regression (BHTR) which takes full advantages of the high effectiveness of the boosting procedure based on the histogram transforms: First of all, we generate a random histogram transform consisting of random rotations, stretchings, and translations. Then the input space is partitioned into non-overlapping cells corresponding to the unit bin in the transformed space. On those cells, we obtain base learners where piecewise constant functions are applied. Then the iterative process to ﬁt residuals is started with the help of a sequence of random histogram transforms by a natural adaption of gradient descent boosting algorithm. Finally, by integrating the estimators generated by the above procedure, we obtain the boosted histogram transform regressor. It is worth mentioning that BHTR enjoys two advantages which can be stated as follows: First, the algorithm can be locally adaptive by applying stretching matrices associated with the variance of samples in each dimension. Second, our obtained regression function can be globally smooth thanks to the diversity of different base learners resulting from the random histogram partitions. Although the beneﬁts of ensembles for large scale regression were explored in (Hang et al., 2019), for the ﬁrst time we attempt to reveal the beneﬁts of boosting procedure on single regressors equipped with histogram transforms in this paper.

The contributions of this paper come from both theoretical and experimental perspectives: (i) In Section 3, we derive the theoretical results under the assumption that the target function resides in C0,α and C1,α, respectively. By decomposing the error term into approximation error and sample error, we establish the fast convergence rates of BHTR in the space C0,α. Moreover, for the subspace C1,α consisting of smoother functions, we are able to show that BHTR can attain the convergence rate O(n (2(1+α))/(4(1+α)+d)) whereas the lower bound of the convergence rates for HTR is merely of the order O(n 2/(2+d)). As a result, when d 2(1 + α)/α, BHTR actually outperforms HTR, which conﬁrms the beneﬁts of the boosting procedure. (ii) In Section 4, several numerical experiments are designed to study the parameters including the bin width of the histogram transform, learning rate, and the iteration times of boosting, which coincide with the theoretical analysis of these parameters in the established convergence rates. Moreover, to validate the performance of BHTR, we conduct experiments of several algorithms including GBRT, Breiman s forest, and kernel-based methods on both synthetic and real datasets. Thanks to the randomness of the histogram trans-

forms and the boosting procedure, our BHTR demonstrates both high accuracy and strong overﬁtting resistance.

2. Methodology

2.1. Notations

Regression is to predict the value of an unobserved output variable Y based on the observed input variable X, based on a dataset D := {(x1, y1), . . . , (xn, yn)} consisting of i.i.d. observations drawn from an unknown probability measure P on X Y. Throughout this paper, we assume that X Rd and Y R are compact and non-empty.

For any ﬁxed R > 0, we denote BR as the centered hypercube of Rd with size 2R, that is, BR := [ R, R]d := {x = (x1, . . . , xd) Rd : xi [ R, R], i = 1, . . . , d}, and for any r (0, R), we write B+ R,r := [ R + r, R r]d. Recall that for 1 p < , the Lp-norm of x = (x1, . . . , xd) is deﬁned by x p := (|x1|p + + |xd|p)1/p, and the L -norm is deﬁned by x := maxi=1,...,d |xi|.

Throughout this paper, we use the notation an bn and an bn to denote that there exist positive constant c and c

such that an cbn and an c bn, for all n N. Moreover, for any x R, let x denote the largest integer less than or equal to x. In the sequel, the following multi-index notations are used frequently. For any vector x = (xi)d i=1 Rd, we write x := ( xi )d i=1, x 1 := (x 1 i )d i=1, log(x) := (log xi)d i=1, x = maxi=1,...,d xi, and x = mini=1,...,d xi.

2.2. Least Square Regression

It is legitimate to consider the least square loss L : X Y [0, ) deﬁned by L(x, y, f(x)) := (y f(x))2 for our target of regression. Then, for a measurable decision function f : X Y, the risk is deﬁned by RL,P(f) := R

X Y L(x, y, f(x)) d P(x, y) and the empirical risk is deﬁned by RL,D(f) := 1

n Pn i=1 L(xi, yi, f(xi)). The Bayes risk, which is the smallest possible risk with respect to P and L, is given by R L,P := inf{RL,P(f)|f : X Y measurable}.

In what follows, it is sufﬁcient to consider predictors with values in [ M, M]. To this end, we introduce the concept of clipping for the decision function, see also Deﬁnition 2.22 in Steinwart & Christmann (2008). Let et be the clipped value of t R at M deﬁned by M if t < M, t if t [ M, M], and M if t > M. Then, a loss is called clippable at M > 0 if, for all (y, t) Y R, there holds L(x, y, et) L(x, y, t). According to Example 2.26 in Steinwart & Christmann (2008), the least square loss L is clippable at M with the risk reduced after clipping, i.e. RL,P( ef) RL,P(f). Therefore, in the following, we only consider the clipped version ef D of the decision function as well as the risk RL,P( ef D).

Boosted Histogram Transform for Regression

2.3. Histogram transform for Regression

To give a clear description of one possible construction procedure of histogram transforms, we introduce a random vector (R, S, b) where each element represents the rotation matrix, stretching matrix and translation vector, respectively. To be speciﬁc,

R denotes the rotation matrix which is a real-valued d d orthogonal square matrix with unit determinant, that is

R = R 1 and det(R) = 1. (1)

S stands for the stretching matrix which is a positive real-valued d d diagonal scaling matrix with diagonal elements (si)d i=1 that are certain random variables. Obviously, there holds

i=1 si. (2)

Moreover, we denote s = (si)d i=1, and the bin width vector deﬁned on the input space is given by

h = s 1. (3)

b [0, 1]d is a d dimensional vector named translation vector.

Figure 1. Two-dimensional examples of histogram transforms. The left subﬁgure is the original data and the other two subﬁgures are possible histogram transforms of the original sample space, with different rotating orientations and scales of stretching.

Based on the above notation, we deﬁne the histogram transform H : X X by

H(x) := R S x + b. (4)

It is important to note that there is no point to consider the bin width h0 = 1 in the transformed space since the same effect can be achieved by scaling the transformation matrix H . Therefore, let H(x) be the transformed bin indices, then the transformed bin is given by

A H(x) := {H(x ) | H(x ) = H(x) }. (5)

The corresponding histogram bin containing x X in the input space is

AH(x) := {x | H(x ) A H(x)} (6)

and we further denote all the bins induced by H as {A j} = {AH(x) : x X} with the repetitive bin counted only once, and IH as the index set for H such that for j IH, we have A j BR = . As a result, the set

πH := {Aj}j IH := {A j BR}j IH

forms a partition of partition of BR. For the sake of convenience, we substitute A0 for Bc R and then

π H := {Aj}j πH {0}

forms a partition of Rd.

Here we describe a practical method for the construction of histogram transforms we are conﬁned to in this study. Starting with a d d square matrix M, consisting of d2

independent univariate standard normal random variates, a Householder QR decomposition is applied to obtain a factorization of the form M = R W, with orthogonal matrix R and upper triangular matrix W with positive diagonal elements. The resulting matrix R is orthogonal by construction and can be shown to be uniformly distributed. Unfortunately, if R does not feature a positive determinant then it is not a proper rotation matrix according to deﬁnition (1). In this case, we can change the sign of the ﬁrst column of R to construct a new rotation matrix R+ that satisﬁes the condition (1).

We build a diagonal scaling matrix with the signs of the diagonal of S where the elements sk are drawn from the well known Jeffreys prior, that is, log(si) follows the uniform distribution over certain interval of real numbers [log(s0), log(s0)] for ﬁxed constants s0 and s0 with 0 < s0 < s0 < . For simplicity and uniformity of notations, in the sequel, we denote h0 = s 1 0 and h0 = s 1 0 , and then we say hi [h0, h0] = [s 1 0 , s 1 0 ], i = 1, . . . , d. Moreover, the translation vector b is drawn from the uniform distribution over the hypercube [0, 1]d.

Given a histogram transform H, the set πH = {Aj}j IH forms a partition of BR. We consider the following function set FH deﬁned by

j IH cj1Aj : cj [ M, M] . (7)

In order to constrain the complexity of FH, we penalize on the bin width h := (hi)d i=1 of the partition πH. Then the histogram transform regressor (HTR) can be produced by the regularized empirical risk minimization (RERM) over FH, i.e.

(f D, h ) = arg min f FH, h Rd Ω(h) + RL,D(f), (8)

Boosted Histogram Transform for Regression

where Ω(h) := λh 2d 0 . It is worth pointing out that we adopt the isotropic penalty for each dimension rather than each element h1, . . . , hd for simplicity of computation.

2.4. Boosted histogram transform with L2 penalty

Boosting is the task of converting inaccurate weak learners into a single accurate predictor. To be speciﬁc, we deﬁne a restricted family of functions F be a set of base learners and a general boosting algorithm is combining a sequence of functions {ft}T t=1 from F to minimize a certain empirical loss. Then the ﬁnal predictor can be represented as

where wt 0, t = 1, . . . , T, are weights and ft F, t = 1, . . . , T. From a functional gradient descent viewpoint in statistics (Friedman, 2001), boosting is reformulated as a stage-wise optimization problem with different loss functions. In this scenario, gradient boosting requires computing the negative functional gradient as the response

Ui = L(yi, f(xi))

f(xi)= ˆ f(xi)

and select a particular model from the allowable class of functions at each boosting iteration to update the predictor.

In this work, we mainly focus on the boosting algorithm equipped with histogram transform regressors as base learners since they are weak predictors and enjoy computational efﬁciency. Before we proceed, we need to introduce the function space that we are most interested in to establish our learning theory. Assume that {Ht}T t=1 is an i.i.d. sequence of histogram transforms drawn from some probability measure PH and Ft := FHt, t = 1, . . . , T, are deﬁned as in (7). Then we deﬁne the function space E by

E := f : BR R f =

t=1 wtft, ft Ft

Moreover, for f E, we deﬁne

f E := inf T X

t=1 |wt|2 with f =

Then for any f E, by the Cauchy-Schwarz inequality, we immediately get

t=1 |wt| M(T f E)1/2.

In fact, (E, E) is a function space that consists of measurable and bounded functions.

As is mentioned above, boosting methods may be viewed as iterative methods for optimizing a convex empirical cost function. To simplify the theoretical analysis, following the approach of Blanchard et al. (2003), we ignore the dynamics of the optimization procedure and simply consider minimizers of an empirical cost function to establish the oracle inequalities, which leads to the following deﬁnition.

Deﬁnition 1 Let E be the function space (9) and L be the least square loss. Given λ1 > 0, λ2 > 0, we call a learning method that assigns to every D (X Y)n a function f D,B : X R such that

(f D,B, h ) = arg min f E, h Rd Ωλ(f) + RL,D(f) (10)

a boosted histogram transform for regression (BHTR) algorithm with respect to E, where Ωλ(f) is deﬁned by

Ωλ(f) := λ1Ω1(f) + λ2Ω2(f) := λ1 f E + λ2h 2d 0 .

The regularization term consists of two components. The ﬁrst term is motivated by the fact the early boosting methods such as Adaboost may overﬁt in the presence of label noise. It helps control the degree of overﬁtting by the L2-norm of the weights of the composite estimators and helps achieve the consistency and convergence results. The second term is added to control the bin width of the histogram transform, which has been discussed in subsection 2.3. In fact, it is equivalent to adding the Lp-norm of the base learners ft. since piecewise constant functions are applied on the cells with volume no more than h d 0.

To conduct the theoretical analysis, we also need the inﬁnite sample version of Deﬁnition 1. To this end, we ﬁx a distribution P on X Y and let the function space E be as in (9). Then every f P,B E satisfying

Ωλ(f P,B) + RL,P(f P,B) = inf f E Ωλ(f) + RL,P(f)

is called an inﬁnite sample version of BHTR with respect to E and L. Moreover, the approximation error function A(λ) is deﬁned by

A(λ) = inf f E Ωλ(f) + RL,P(f) R L,P. (11)

With all these preparations, we now present a general form of algorithm for BHTR in Algorithm 1. Indeed, the randomness of histogram transform provides an effective procedure for carrying out boosting. With the help of HTR, we repeat the least squares ﬁtting of residuals. Moreover, we introduce the learning rate ρ to dampen the move on the gradient descent update, which is related to the regularization through shrinkage.

Boosted Histogram Transform for Regression

Algorithm 1 Boosted Histogram Transform for Regression

Input: Training data D := {(x1, y1), . . . , (xn, yn)}; Bin width parameters h0, h0; Learning rate ρ > 0; Initialization: Ui = yi, i = 1, . . . , n. b F0(x) = 0. for t = 1 to T do

Generate random afﬁne transform matrix Ht n = Rn St n; Apply data independent splitting to the transformed sample space; Apply constant functions to each cell, that is, ﬁt residuals with function ft such that

ft = arg min f Ft

i=1 L(Ui, f(xi)),

where Ft is deﬁned as in (7) for Ht n. Update: b Ft(x) = b Ft 1(x) + ρft(x). Compute residuals Ui = Ui b Ft(xi), i = 1, . . . , n. end for Output: Boosted histogram transform estimator for regression is f D(x) = b FT (x).

In fact, the Gradient Boosting Algorithm 1 converges to the empirical risk minimizer of the mean squared error with respect to the function space E deﬁned by (9)

i=1 (yi F(xi))2, (12)

which is illustrated as below:

(i) In the least-square regression setting, the goal of a gradient boosting algorithm with T stages is to ﬁt a function F of the form F(x) = PT t=1 ft(x) to minimize (12).

(ii) At stage t(1 t T), our algorithm should add some new estimator to improve some imperfect model Ft 1 to correct the errors of its predecessor. For regression problems, we observe that residuals are the negative gradients (with respect to F(x)) of the squared error loss function (y F(x))2/2. Then gradient boosting will ﬁt ft from the hypothesis space deﬁned as in (7) to the residuals Ui = yi Ft 1(xi).

(iii) In other words, Algorithm 1 does so by starting with a model F0(x) = 0, and incrementally expands it in a greedy fashion. The main idea is to apply a (functional gradient) descent step to this minimization problem to solve computationally infeasible optimization problem in general.

(iv) Finally, the regularization of gradient boosting method is realized by shrinkage which consists in modifying the update rule with learning rate as shown in Algorithm 1 to improve the generalization ability of the model.

In summary, in accordance with the empirical risk minimization principle, gradient boosting algorithm tries to ﬁnd an approximation F(x) that minimizes the average value of the loss function on the training set, i.e., minimizes the empirical risk with respect to the space E deﬁned by (9).

3. Theoretical Results

In this paper, the theoretical analysis is built on Hölder space Ck,α consisting of (k, α)-Hölder continuous functions of different order of smoothness.

Deﬁnition 2 Let k N, α (0, 1], and R > 0. We say that a function f : BR R is (k, α)-Hölder continuous, if there exists a ﬁnite constant c L > 0 such that ℓf c L for all ℓ {1, . . . , k} and kf(x) kf(x ) c L x x α

for all x, x BR. The set of such functions is denoted by Ck,α(BR).

From Deﬁnition 2 we see that the functions contained in the space Ck,α with larger k enjoy high level of smoothness. In particular, for the special case k = 0, the corresponding function space C0,α(BR) coincides with the commonly used α-Höder continuous function space Cα(BR).

Throughout this paper, we make the following assumptions on the bin width h.

Assumption 1 Let the bin width h [h0, h0] be deﬁned as in (3), assume that there exists some constant c0 (0, 1) such that c0h0 h0 c 1 0 h0. Moreover, if the bin width h depends on the sample size n, that is, hn [h0,n, h0,n], assume that there exist constants c0 (0, 1) such that c0h0,n h0,n c 1 0 h0,n for all n.

Assumption 1 requires that the upper and lower bounds of the bin width h are assumed to be of the same order. In other words, the extent of stretching in each dimension can not vary too much.

Finally, to leave out the boundary effect on the convergence rate, we denote Lh0(x, y, t) as the least squares loss function restricted to B+ R,

d h0, that is,

Lh0(x, y, t) := 1B+ R,

d h0 (x)L(x, y, t), (13)

where L(x, y, t) is the least squares loss.

3.1. Convergence Rates for BHTR in C0,α

Theorem 1 Let the histogram transform Hn be deﬁned as in (4) with bin width hn satisfying Assumption 1, and f D,B be deﬁned in (10). Furthermore, suppose that the Bayes decision function f L,P C0,α. Moreover, let {λ1,n}, {λ2,n}

Boosted Histogram Transform for Regression

and {h0,n} be chosen as

λ1,n := n 2α (4 2δ)α+d , λ2,n := n 2(α+d) (4 2δ)α+d ,

h0,n := n 1 (4 2δ)α+d ,

where δ := (h0,n/cd)d, then for all τ > 0, we have

RL,P(f D,B) R L,P n 2α (4 2δ)α+d ,

holds with probability Pn PH at least 1 3e τ.

3.2. Convergence Rates for BHTR in C1,α

Theorem 2 Let the histogram transform Hn be deﬁned as in (4) with bin width hn satisfying Assumption 1 and Tn be the number of iterations. Furthermore, let f D,B be deﬁned in (10) and suppose that the Bayes decision function f L,P C1,α and PX is the uniform distribution. Moreover, let Lh0(x, y, t) be the restricted least squares deﬁned as in (13) and the sequences {Tn}, {λ1,n}, {λ2,n}, and {h0,n} be chosen as

λ1,n := n 2 2(1+α)(2 δ)+d , λ2,n := n 2(α+d+1) 2(1+α)(2 δ)+d ,

h0,n := n 1 2(1+α)(2 δ)+d , Tn := n

2α 2(1+α)(2 δ)+d ,

where δ := (h0,n/cd)d with cd depending only on d. Then, for all τ > 0, the boosted histogram transform regressor satisﬁes

RLh0,P(f D,B) R Lh0,P n 2(1+α) 2(1+α)(2 δ)+d (14)

with probability Pn not less than 1 3e τ in expectation with respect to PH.

Note that as n , we have h0,n 0, and thus the upper bound for our BHTR attains asymptotically convergence rate which is slightly faster than

n 2(1+α) 4(1+α)+d . (15)

Moreover, the excess risk decreases as Tn increases at the beginning, and when Tn achieves a certain level, the algorithm achieves the optimal learning rate.

Compared to kernel regression or orthonormal basis regression, the convergence rates for BHTR are suboptimal. We conjecture that this is mainly due to the fact that the entropy number estimate (B.12) in the supplementary material may not be tight enough. As a result, we are able to show that only when d > 2(1 + α)/α, BHTR outperforms single base estimators.

Theorem 3 Let the histogram transform Hn be deﬁned as in (4) with bin width hn satisfying Assumption 1 with h0,n

1. Furthermore, let the histogram transform regressor f D be deﬁned as in (8) and the regression model be deﬁned by Y := f(X) + ε, where PX is the uniform distribution over BR and ε is independent of X such that E(ε|X) = 0 and Var(ε|X) = σ2 < . Moreover, assume that f C1,α and there exists a constant cf (0, ) such that f cf. Then for all n > N1, there holds

RL,P(f D) R L,P n 2/(2+d) (16)

in expectation with respect to Pn PH, where the constant N1 is speciﬁed in the proof.

Note that for any α (0, 1], if d 2(1 + α)/α, then the upper bound of the convergence rate (14) for BHTR will be smaller than the lower bound (16) for HTR, which explains the beneﬁts of the boosting procedure.

The theoretical analysis above implies that under restriction on the hypothesis space, or equivalently, with the additional regularization terms in Deﬁnition 1, the empirical functional minimization is equivalent to minimizing the generalization error, similar discussion can be found in (Lugosi & Vayatis, 2004). As a result, according to Theorems 1 and 2, if we feed Algorithm 1 with appropriate h0, h0, T, and ρ, we then obtain a linear combination of base learners which minimizes the empirical risk with respect to the space (9).

We mention that all the proofs in this paper can be found in supplementary material.

4. Numerical Experiments

4.1. Experimental Setup

We generate the random rotation matrix R in the manner described in Section 2.3 and apply the well-known Jeffreys prior for scale parameters (Jeffreys, 1946). To be speciﬁc, we draw log(si) from the uniform distribution over intervals [log(s0), log(s0)]. Recall that h = s 1 stands for the bin width vector measured in the input space, we choose s0 and s0, recommended by (López-Rubio, 2013), as bh = 3.5σn 1/(2+d), where σ := p

trace(V )/d is the standard deviation deﬁned by V := 1 n 1 Pn i=1(xi x)(xi x)

and x := 1 n Pn i=1 xi. Then we can transform the bin width vector to obtain this scale parameter bs = (bh) 1 = (3.5σ) 1n 1 2+d , which can be further reﬁned as

log(s0) := smin + log(bs), log(s0) := smax + log(bs),

where smin < smax are tunable parameters. Finally, to measure the performance for regression estimators, we adopt the mean squared error (MSE) deﬁned by MSE( bf) = 1 n Pn j=1(yj bf(xj))2 over test set {(xj, yj)}n j=1.

Boosted Histogram Transform for Regression

4.2. Parameter Analysis for Histogram Transforms

In this subsection, we mainly conduct experiments dealing with the parameters of histogram transform for our BHTR algorithm, namely the lower and upper scale parameters smin, smax R. To this end, we consider the following model:

Y = sin(16X) + ε, (17)

where X Unif[0, 1] and ε N(0, 1). Recall that the scale parameters smin and smax of the stretching matrix S control the size of histogram bins. For the regions with complex data structure, smaller bins are required while those with simple structure calls for larger bins. A narrower range of bin sizes are accommodated to cope with the varying scales while to preserve a homogeneous structure. We perform experiments with n = 500 training data and then predict 2000 test observations with four pairs of scale parameter (smin, smax) {( 2, 0), ( 1, 1), (0, 2), (1, 3)}. In addition, we select ρ = 0.01 and T = 500. The results are shown in Figure 2.

(a) (smin, smax) = ( 2, 0).

(b) (smin, smax) = ( 1, 1).

(c) (smin, smax) = (0, 2).

(d) (smin, smax) = (1, 3).

Figure 2. Red points represent the training sample and green ones denote the predictive values on the test sample.

As we can see, lower values of these parameters lead to a coarser approximation for the underlying Bayes decision function, which results in the loss of precision. From Figure 2(a) we can see that the regressor is underﬁtting when the bin width is too large. On the contrary, with the bin width being too small, there are few samples lying in most of the histogram bins and thus resulting in overﬁtting, as is shown in Figure 2(d). Therefore, it is of great importance to properly choose the values of smin and smax, where the grid search procedure can be adopted.

4.3. Learning Rate

From the theory of boosting, it is well-known that weak learners should be underﬁtting, then the bias and variance will be decreased and increased accordingly during the iterations. For an estimator with high-level underﬁtting, more boosting iterations are needed in order to achieve a competitive performance compared with other efﬁcient learning algorithms. Therefore, if HTRs are used as base learners, then we should select a relatively small learning rate ρ in Algorithm 1. In this subsection, we conduct simulations to verify this argument and show the relationship between the generalization ability and the learning rate in Algorithm 1.

In the simulation, we choose the pair of scale parameter (smin, smax) to be ( 1, 1) and the learning rate ρ over the set {0.01 + 0.03k, k = 0, . . . , 33}. Then we perform experiments with 350 training data and 150 validation data for the model (17) with ε independently drawn from the Gaussian distribution N(0, 1). For each learning rate parameter, we run BHTR with training set until T reaches 500. We repeat this procedure for 30 times, denote the corresponding estimator as { b F i T (x)}30 i=1, and compute the average of these estimators, that is, b FT (x) = 1 30 P30 i=1 b F i T (x). For each ρ, we estimate the error over the validation set {(xj, yj)}150 j=1 by 1 150 P150 j=1(yj b FT (xj))2. Then we record the optimal T with respect to the validation data and the corresponding test error on 2000 data. Figure 3 reports test errors and iteration numbers versus learning rate parameters.

Figure 3. Figure (a) shows test errors of BHTR versus learning rate parameters, while Figure (b) shows the optimal iteration numbers.

From Figure 3 we see that the test error grows as the learning rate ρ increases from 0.01 to 1. Furthermore, when ρ is larger than some value, e.g. 0.2 in this simulation, then BHTR needs nearly the same number of steps to achieve the optimal test error. Moreover, Figure 3 shows that the more underﬁtted HTR is, the more boosting iterations are required. Therefore, it is reasonable to use HTR to build weak learners for boosting. And in this case, a smaller learning rate ρ is necessary to achieve better experimental performance.

Boosted Histogram Transform for Regression

4.4. Behavior of BHTR

In this subsection, we give a more comprehensive understanding of the behavior of BHTR. Since BHTR focuses on ﬁxed learning rate parameter and varying iteration times, we perform the following experiment to study how the test error would behave as a function of iteration times.

In this simulation, The learning rate ρ is picked from the set {0.002, 0.01, 0.25, 0.5} and other experimental setups are the same as those in Section 4.3. In Figure 4, test error versus iteration times for each learning rate is plotted. It can be apparently seen that the test error typically decreases until iteration times increases to certain value, and then it increases slowly, which reﬂects the trade-off between the approximation error and the sample error of the theoretical results in Section 3.

Note that a too small ρ are likely to make the test error converge too slowly and bring about the additional burden of computation. Therefore, it is necessary to select an appropriate learning rate. Furthermore, Figure 4 shows a stable relation between the generalization performance and iteration numbers for some small ρ, e.g. ρ < 0.05 in this experiment. This tells us that overﬁtting does not seem to occur even though we run the boosting with plentiful of iterations. In addition, it is easily seen that the number T of iterations at which the test error achieves the minimal value would increase as ρ increases. Last but not least, lower test error implies better performance of BHTR than HTR in terms of accuracy for a wide range of ρ.

Figure 4. Test errors versus the number of iterations for different learning rates. The horizontal line indicates the test error for HTR.

4.5. Synthetic and Real Data Analysis

In this section, our experiments are carried out on three benchmark synthetic datasets and eight real datasets.

For our BHTR, We ﬁrst scale each feature individually to the range [0, 1] on the training set, and then impose a gird of size 4 on learning rate ρ {0.2, 0.1, 0.05, 0.01}, a grid of size 3 on smin { 4, 3, 2} and a grid of size 2 on smax smin {1, 2}. For each element from the Cartesian product of these grids, we run BHTR with iteration times T = 3000. The optimal iteration times t T and optimal

parameters ρ, smin and smax are chosen by 5-fold crossvalidation. The comparisons are conducted among

Gradient Boosting Regression Tree (GBRT): proposed by Friedman (2001). We utilize the parkage sklearn in python with iteration times n_estimators = 3000 and other parameters being default.

Random Forest (RF): proposed by Breiman (2001). The sklearn package in python is applied with n_estimators = 100 and other parameters being default.

Boosted Kernel Ridge Regression (BKRR): proposed by Lin et al. (2019). They combine L2-Boosting with the kernel ridge regression. The regularization parameter λ is chosen from grid np.logspace(0, 4, 5) and the bandwidth σ is chosen from grid np.logspace(-1, 1, 3). The iteration times are set to be 3000 and the hyper-parameters is chosen by 5-fold cross-validation.

Comparisons on Synthetic Datasets

We choose Friedman s benchmark functions (Friedman, 1991), which are widely and frequently employed models in regression problems (Smola & Schölkopf, 2004; Brown et al., 2005; Feng et al., 2015), listed as follows:

1.f1(x) = sin(πx1x2) + 20(x3 0.5)2 + 10x4 + 5x5;

2.f2(x) = p

(x1)2 + (x2x3 1/(x2x4))2;

3.f3(x) = arctan(1/x1(x2x3 1/(x2x4))).

For f1 we have x = (x1, . . . , x10), where xj Unif[0, 1], j = 1, . . . , 5, and xj, j = 6, . . . , 10, are noise variables. For f2 and f3, we have x = (x1, . . . , x4), where x1 Unif[0, 100], x2 Unif[40π, 560π], x3 Unif[0, 1], and x4 Unif[1, 11]. Moreover, we assume that the noise added to the function f is drawn from the standard normal distribution. For each function, 1000 observations are generated for training and another 1000 are for testing. The cross-validation procedure is adopted for hyper-parameter selection.

Table 1. Average MSE over synthetic datasets

Dataset BHTR GBRT RF BKRR Fried 1 3.55 2.01 3.87 9.41 Fried 2 258.81 313.76 310.08 10448 Fried 3 1.09 1.43 1.10 1.03

* The best results are marked in bold, and the second best are marked in italic.

Table 1 shows that on the second dataset Fried 2, BHTR performs the best while on the remaining two datasets, it ranks the second.

Boosted Histogram Transform for Regression

Comparisons on Real Datasets

Eight real datasets are listed in Table 2. Further information can be found in the supplementary material.

Table 2. Description of real datasets

DATASET N d DATASET N d

ABA 4177 8 MPG 392 7 BOD 252 14 PYR 74 27 HOU 506 13 SPA 3107 6 MG 1385 6 TRI 186 60

Table 3. Average MSE over real datasets

Dataset BHTR GBRT RF BKRR

ABA 4.60 5.73 4.83 6.43 BOD 1.65e 05 9.51e-06 9.53e-06 3.56e 04 HOU 13.03 10.03 12.33 46.39 MG 0.016 0.017 0.015 0.021 MPG 6.98 8.47 7.67 11.54 PYR 0.0063 0.0067 0.0065 0.010 SPA 0.0123 0.0120 0.0137 0.061 TRI 0.021 0.019 0.016 0.023

* The best results are marked in bold, and the second best are marked in italic.

Table 3 shows that on three datasets ABA, MPG, and PYR, our BHTR has the best accuracy and on another two datasets MG and SPA, BHTR ranks the second.

From the results in Tables 1 and 3 we come to the conclusion that our method shows promising performance compared to the efﬁcient algorithms GBRT and RF.

Note that BKRR using kernel functions always works worse than other methods. Experimental results in the simulation reveal that when the feature variables are continuous and the regression function is smoothness, the BKRR algorithm may have better performance. However, in real data applications, the feature variables are usually categorical and the smoothness of the regression function cannot be guaranteed.

We mention that our BHTR requires less running time than BKRR, since we only need to carry out the QR decomposition of the matrix instead of computing the inverse of the kernel matrix. We need a bit more time than other tree-based methods.

Signiﬁcance Test

We apply the Wilcoxon signed-rank test (Wilcoxon, 1946) to ﬁnd out whether the performance of our method and competing methods are signiﬁcantly different. The results are shown in Table 4.

From the results in Tables 1, 3 and 4, we come to the conclusion that our method shows promising performance compared to the efﬁcient algorithms GBRT and RF. At a signif-

Table 4. P-values under Wilcoxon Signed-rank Test Dataset GBRT RF BKRR Freid 1 0.000089 0.000120 0.000089 Freid 2 0.000254 0.001162 0.000089 Freid 3 0.000089 0.765198 0.000254 ABA 0.000089 0.000120 0.000089 BOD 0.000120 0.000089 0.000089 HOU 0.000892 0.262722 0.000089 MG 0.000681 0.575486 0.000089 MPG 0.000163 0.001325 0.000089 PYR 0.681322 0.550292 0.000449 SPA 0.370261 0.000089 0.000089 TRI 0.022769 0.000219 0.000892

icance level of 0.05, our method is signiﬁcantly different from other methods in many cases. To be speciﬁc, in three of four datasets that our method performs the best, our method shows pronounced difference. Moreover, for the datasets that BHTR ranks the second, most of the results are signiﬁcant.

We mention that to improve the performance, (Friedman, 2002) proposed the stochastic gradient boosting algorithm using subsampling to reduce the variance of the base learners. Therefore, in the future work, we believe that to further improve the performance of BHTR, the combination of BHTR with subsampling should be explored.

5. Conclusion

In the present paper, we propose the boosting algorithm called boosted histogram transform regression (BHTR) based on histogram transforms consisting of random rotations, stretchings, and translations. By conducting a theoretical analysis within the framework of regularized empirical risk minimization in learning theory, we are able to prove the fast convergence rates of BHTR when the target function f L,P lies in the Hölder space C0,α. Moreover, when f L,P C1,α, it is the ﬁrst time that we successfully derive the upper bound of convergence rates for the boosted predictor BHTR and the lower bound for the base predictor HTR, which further demonstrate the beneﬁts of the boosting procedure. In the experiments, numerical simulations and real data comparisons with other state-of-the-art methods including GBRT, random forest, and kernel-based boosting algorithms are provided to support the theoretical results and justify the performance of BHTR.

Acknowledgements

H. Yang is supported by National Key R&D Program of China (2018YFC0830300). Z. Lin is supported by NSF

Boosted Histogram Transform for Regression

China (grant no.s 61625301 and 61731018). Resources supporting this work were provided by High-performance Computing Platform of Renmin University of China.

Blanchard, G., Lugosi, G., and Vayatis, N. On the rate of convergence of regularized boosting classiﬁers. The Journal of Machine Learning Research, 4(Oct):861 894, 2003.

Blaser, R. and Fryzlewicz, P. Random rotation ensembles. The Journal of Machine Learning Research, 17(1):126 151, 2016.

Breiman, L. Random forests. Machine Learning, 45(1): 5 32, 2001.

Brown, G., Wyatt, J. L., and Tiˇno, P. Managing diversity in regression ensembles. The Journal of Machine Learning Research, 6(Sep):1621 1650, 2005.

Bühlmann, P. and Yu, B. Boosting with the L2 loss: regression and classiﬁcation. Journal of the American Statistical Association, 98(462):324 339, 2003.

Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785 794, 2016.

Feng, Y., Huang, X., Shi, L., Yang, Y., and Suykens, J. Learning with the maximum correntropy criterion induced losses for regression. The Journal of Machine Learning Research, 16:993 1034, 2015.

Freund, Y. Boosting a weak learning algorithm by majority. Information and computation, 121(2):256 285, 1995.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

Friedman, J., Hastie, T., and Tibshirani, R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics, 28(2):337 407, 2000.

Friedman, J. H. Multivariate adaptive regression splines. The Annals of Statistics, pp. 1 67, 1991.

Friedman, J. H. Greedy function approximation: a gradient boosting machine. The Annals of statistics, pp. 1189 1232, 2001.

Friedman, J. H. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367 378, 2002.

Hang, H., Lin, Z., Liu, X., and Wen, H. Histogram transform ensembles for large-scale regression. ar Xiv preprint ar Xiv:1912.04738, 2019.

Jeffreys, H. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453 461, 1946.

Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classiﬁers. The Annals of Statistics, 30(1):1 50, 2002.

Lin, S.-B., Lei, Y., and Zhou, D.-X. Boosted kernel ridge regression: Optimal learning rates and early stopping. The Journal of Machine Learning Research, 20(46):1 36, 2019.

López-Rubio, E. A histogram transform for probabilitydensity function estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):644 656, 2013.

Lugosi, G. and Vayatis, N. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics, 32(1):30 55, 2004.

Mannor, S., Meir, R., and Zhang, T. The consistency of greedy algorithms for classiﬁcation. In International Conference on Computational Learning Theory, pp. 319 333. Springer, 2002.

Mease, D. and Wyner, A. Evidence contrary to the statistical view of boosting. The Journal of Machine Learning Research, 9(Feb):131 156, 2008.

Park, B., Lee, Y., and Ha, S. L2 boosting in kernel regression. Bernoulli, 15(3):599 613, 2009.

Schapire, R. and Freund, Y. A decision-theoretic generalization of on-line learning and an application to boosting. In Second European Conference on Computational Learning Theory, pp. 23 37, 1995.

Schapire, R. E. The strength of weak learnability. Machine Learning, 5(2):197 227, 1990.

Smola, A. J. and Schölkopf, B. A tutorial on support vector regression. Statistics and Computing, 14(3):199 222, 2004.

Steinwart, I. and Christmann, A. Support Vector Machines. Information Science and Statistics. Springer, New York, 2008.

Wilcoxon, F. Individual comparisons of grouped data by ranking methods. Journal of economic entomology, 39 (2):269 270, 1946.

Boosted Histogram Transform for Regression

Wyner, A. J., Olson, M., Bleich, J., and Mease, D. Explaining the success of adaboost and random forests as interpolating classiﬁers. The Journal of Machine Learning Research, 18(1):1558 1590, 2017.

Zhang, T. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, 2004.

Zhang, T. and Yu, B. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33(4): 1538 1579, 2005.