# pacbayes_tree_weighted_subtrees_with_guarantees__28fd45c2.pdf

PAC-Bayes Tree: weighted subtrees with guarantees

Tin Nguyen MIT EECS tdn@mit.edu

Samory Kpotufe Princeton University ORFE samory@princeton.edu

We present a weighted-majority classiﬁcation approach over subtrees of a ﬁxed tree, which provably achieves excess-risk of the same order as the best tree-pruning. Furthermore, the computational efﬁciency of pruning is maintained at both training and testing time despite having to aggregate over an exponential number of subtrees. We believe this is the ﬁrst subtree aggregation approach with such guarantees. The guarantees are obtained via a simple combination of insights from PAC-Bayes theory, which we believe should be of independent interest, as it generically implies consistency for weighted-voting classiﬁers w.r.t. Bayes while, in contrast, usual PAC-bayes approaches only establish consistency of Gibbs classiﬁers.

1 Introduction

Classiﬁcation trees endure as popular tools in data analysis, offering both efﬁcient prediction and interpretability yet they remain hard to analyze in general. So far there are two main approaches with generalization guarantees: in both approaches, a large tree (possibly overﬁtting the data) is ﬁrst obtained; one approach is then to prune back this tree down to a subtree2 that generalizes better; the alternative approach is to combine all possible subtrees of the tree by weighted majority vote. Interestingly, while both approaches are competitive with other practical heuristics, it remains unclear whether the alternative of weighting subtrees enjoys the same strong generalization guarantees as pruning; in particular, no weighting scheme to date has been shown to be statistically consistent, let alone attain the same tight generalization rates (in terms of excess risk) as pruning approaches.

In this work, we consider a new weighting scheme based on PAC-Bayesian insights [1], that (a) is consistent and attains the same generalization rates as the best pruning of a tree, (b) is efﬁciently computable at both training and testing time, and (c) competes against pruning approaches on real-world data. To the best of our knowledge, this is the ﬁrst practical scheme with such guarantees.

The main technical hurdle has to do with a subtle tension between goals (a) and (b) above. Namely, let T0 denote a large tree built on n datapoints, usually a binary tree with O(n) nodes; the family of subtrees T of T0 is typically of exponential size in n [2], so a naive voting scheme that requires visiting all subtrees is impractical; on the other hand it is known that if the weights decompose favorably over the leaves of T (e.g., multiplicative over leaves) then efﬁcient classiﬁcation is possible. Unfortunately, while various such multiplicative weights have been designed for voting with subtrees [3, 4, 5], they are not known to yield statistically consistent prediction. In fact, the best known result to date [5] presents a weighting scheme which can provably achieve an excess risk3 (over the Bayes classiﬁer) of the form o P (1) + C min T R(h T ), where R(h T ) denotes the misclassiﬁcation rate of a classiﬁer h T based on subtree T. In other words, the excess risk might never go to 0 as sample size increases, which in contrast is a basic property of the pruning alternative. Furthermore, the approach

The majority of the research was done when the author was an undergraduate student at Princeton University ORFE. 2Considering only subtrees that partition the data space. 3The excess risk of a classiﬁer h over the Bayes h B (which minimizes R(h) over any h) is R(h) R(h B).

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

of [5], based on l1-risk minimization, does not trivially extend to multiclass classiﬁcation, which is most common in practice. Our approach is designed for multiclass by default.

Statistical contribution. PAC-Bayesian theory [1, 6, 7, 8] offers useful insights into designing weighting schemes with generalization guarantees (w.r.t. a prior distribution P over classiﬁers). However, a direct application of existing results fails to yield a consistent weighted-majority scheme. This is because PAC-Bayes results are primarily concerned with so-called Gibbs classiﬁers, which in our context corresponds to predicting with a random classiﬁer h T drawn according to a weightdistribution Q over subtrees of T0. Instead, we are interested in Q-weighted majority classiﬁers h Q. Unfortunately the corresponding error R(h Q) can be twice the risk R(Q) = Eh T QR(h T ) of the corresponding Gibbs classiﬁer: this then results (at best see overview in Section 2.2) in an excess risk of the form (R(h Q) R(h B)) (R(Q) R(h B)) + R(Q) = o P (1) + R(h B), which, similar to [5], does not go to 0. So far, this problem is best addressed in PAC-Bayes results such as the Min Cq bound in [6, 8] on R(h Q), which is tighter in the presence of low correlation between base classiﬁers. In contrast, our PAC-Bayes result applies even without low correlation between base classiﬁers, and allows an excess risk o P (1) + (C/n) min T log(1/P(T)) 0 (Proposition 2). This ﬁrst result is in fact of general interest since it extends beyond subtrees to any family of classiﬁers, and is obtained by carefully combining existing arguments from PAC-Bayes analysis.

However, our basic PAC-Bayes result alone does not ensure convergence at the same rate as that of the best pruning approaches. This requires designing a prior P that scales properly with the size of subtrees T of T0. For instance, suppose P were uniform over all subtrees of T0, then log(1/(P(T)) = Ω(n), yielding a vacuous excess risk. We show through information-theoretic arguments that an appropriate prior P can be designed to yield rates of convergence of the same order as that of the best pruning of T0. In particular, our resulting weighting scheme maintains ideal properties of pruning approaches such as adaptivity to the intrinsic dimension of data (see e.g. [9]).

Algorithmic contribution. We show that we can design a prior P which, while meeting the above statistical constraints, yields posterior weights that decompose favorably over the leaves of a subtree T. As a result of this decomposition, the weights of all subtrees can be recovered by simply maintaining corresponding weights at the nodes of the original tree T0 for efﬁcient classiﬁcation in time O(log n) (this is illustrated in Figure 1). We then propose an efﬁcient approach to obtain weights at the nodes of T0, consisting of concurrent top-down and bottom-up dynamic programs that run in O(n) time. These match the algorithmic complexity of the most efﬁcient pruning approaches, and thus offer a practical alternative.

Our theoretical results are then veriﬁed in experiments over many real-world datasets. In particular we show that our weighted-voting scheme achieves similar or better error than pruning on practical problems, as suggested by our theoretical results.

Paper Organization. We start in Section 2 with theoretical setup and an overview of PAC-Bayes analysis. This is followed in Section 3 with an overview of our statistical results, and in Section 4 with algorithmic results. Our experimental analysis is then presented in Section 5.

2 Preliminaries

2.1 Classiﬁcation setup

We consider a multiclass setup where the input X X, for a bounded subset X of RD, possibly of lower intrinsic dimension. For simplicity of presentation we assume X [0, 1]D (as in normalized data). The output Y [L], where we use the notation [L] = {1, 2, . . . , L} for L N.

We are to learn a classiﬁer h : X 7 [L], given an i.i.d. training sample {Xi, Yi}2n i=1 of size 2n, from an unknown distribution over X, Y . Throughout, we let S .= {Xi, Yi}n i=1 and S0 .= {Xi, Yi}2n i=n+1, which will serve later to simplify dependencies in our analysis.

Our performance measure is as follows.

Deﬁnition 1. The risk of a classiﬁer h is given as R(h) = E[h(X) = Y ]. This is minimized by the Bayes classiﬁer h B(x) .= argmaxl [L] P (Y = l|X = x). Therefore, for any classiﬁer ˆh learned over a sample {Xi, Yi}i, we are interested in the excess-risk E(ˆh) .= R(ˆh) R(h B).

Figure 1: A partition tree T0 over input space X, and a query x X to classify. The leaves of T0 are the 4 cells shown left, and the root is X. A query x follows a single path (shown in bold) from the root down to a leaf. A key insight towards efﬁcient weighted-voting is that this path visits all leaves (containing x) of any subtree of T0. Therefore, weighted voting might be implemented by keeping a weight w(A) at any node A along the path, where w(A) aggregates the weights Q(T) of every subtree T that has A as a leaf. This is feasible if we can restrict Q(T) to be multiplicative over the leaves of T, without trading off accuracy.

Here we are interested in aggregations of classiﬁcation trees, deﬁned as follows.

Deﬁnition 2. A hierarchical partition or (space) partition-tree T of X is a collection of nested partitions of X; this is viewed as a tree where each node is a subset A of X, each child A of a node A is a subset of A, and whose collection of leaves, denoted π(T), is a partition of X. A classiﬁcation tree h T on X is a labeled partition-tree T of X: each leaf A π(T) is assigned a label l = l(A) [L]; the classiﬁcation rule is simply h T (x) = l(A) for any x A.

Given an initial tree T0, we will consider only subtrees T of T0 that form a hierarchical partition of X, and we henceforth use the term subtrees (of T0) without additional qualiﬁcation.

Finally, aggregation (of subtrees of T0) consists of majority-voting as deﬁned below.

Deﬁnition 3. Let H denote a discrete family of classiﬁers h : X 7 [L], and let Q denote a distribution over H. The Q-majority classiﬁer h Q .= h Q(H) is one satisfying for any x X

h Q(x) = argmax l [L]

h H,h(x)=l Q(h).

Our oracle rates of Theorem 1 requires no additional assumptions; however, the resulting corollary is stated under standard distributional conditions that characterize convergence rates for tree-prunings.

2.2 PAC-Bayes Overview

PAC-Bayes analysis develops tools to bound the error of a Gibbs classiﬁer, i.e. one that randomly samples a classiﬁer h Q over a family of classiﬁers H. In this work we are interested in families {h T } deﬁned over subtrees of an initial tree T0. Here we present some basic PAC-Bayes result which we extend for our analysis. While these results are generally presented for classiﬁcation risk R (deﬁned above), we keep our presentation generic, as we show later that a different choice of risk leads to stronger results for R than what is possible through direct application of existing results.

Generic Setup. Consider a random vector Z, and an i.i.d sample Z[n] = {Zi}n i=1. Let Z be the support of Z, and L = {ℓh : h H} be a loss class indexed by h H discrete, and where ℓh : Z [0, 1]. For h H, the loss ℓh induces the following risk and empirical counterparts:

RL(h) .= EZℓh(Z), b RL(h, Z[n]) .= 1

i=1 ℓh(Zi).

In particular, for the above classiﬁcation risk R, and Z (X, Y ), we have ℓh(Z) = 1 {h(X) = Y }.

Given a distribution Q over H, the risk (and empirical counterpart) of the Gibbs classiﬁer is then

RL(Q) .= Eh QRL(h), b RL(Q, Z[n]) .= Eh Q b RL(h, Z[n]).

PAC-Bayesian results bound RL(Q) in terms of b RL(Q, Z[n]), uniformly over any distribution Q, provided a ﬁxed prior distribution P over H. We will build on the following form of [10] which yields an upper-bound that is convex in Q (and therefore can be optimized for a good posterior Q ). Proposition 1 (PAC-Bayes on RL [10]). Fix a prior P supported on H, and let n 8 and δ (0, 1). With probability at least 1 δ over Z[n], simultaneously for all λ (0, 2) and all posteriors Q over H:

RL(Q) b RL(Q, Z[n])

1 λ/2 + Dkl (Q P) + log (2 n/δ)

λ(1 λ/2)n ,

where Dkl (Q P) .= EQlog Q(h)

P (h) is the Kullback-Leibler divergence between Q and P.

Choice of posterior Q . Let Q minimize the above upper-bound, and let h minimize RL over H. Then, by letting Qh put all mass on h , we automatically get that, with probability at least 1 2δ:

RL(Q ) RL(Qh ) C b RL(h , Z[n]) + log(1/P(h )) + log(n/δ)

RL(h ) + log(1/P(h )) + log(n/δ)

where the last inequality results from bounding |RL(h ) b RL(h , Z[n])| using Chernoff.

Unfortunately, such direct application is not enough for our purpose when RL = R. We want to bound the excess risk E(h Q) for a Q-majority classiﬁer h Q over h s H. It is known that R(h Q) 2R(Q) which yields a bound of the form (1) on R(h Q ); however this implies at best that R(h Q ) 2R(h B) even if E(h ) 0 (which is generally the case for optimal tree-pruning h T [9]). This is a general problem in converting from Gibbs error to that of majority-voting, and is studied for instance in [6, 8] where it is shown that R(h Q) can actually be smaller in some situations.

Improved choice of Q . Here, we want to design Q such that R(h Q ) R(h B) (i.e. E(h Q ) 0) at the same rate as E(h T ) 0 always. Our solution relies on a proper choice of loss ℓh that relates most directly to excess risk E that the 0-1 loss 1 {h(x) = y}. A ﬁrst candidate is to deﬁne ℓh(x, y) as eh(x, y) .= 1 {h(x) = y} 1 {h B(x) = y} since E(h) = E eh(X, Y ); however eh(x, y) / [0, 1] and can take negative values. This is resolved by considering an intermediate loss eh(x) = EY |xeh(x, Y ) [0, 1] to be related back to eh(x, y) by integration in a suitable order.

3 Statistical results

3.1 Basic PAC-Bayes result

We start with the following intermediate loss family over classiﬁers h, w.r.t. the Bayes classiﬁer h B. Deﬁnition 4. Let eh(x, y) .= 1 {h(x) = y} 1 {h B(x) = y}, and eh(x) = EY |xeh(x, Y ), and

e E(h, S) .= 1

i=1 eh(Xi), and b E(h, S) .= 1

i=1 eh(Xi, Yi).

Our ﬁrst contribution is a basic PAC-Bayes result which the rest of our analysis builds on. Proposition 2 (PAC-Bayes on excess risk). Let H denote a discrete family of classiﬁers, and ﬁx a prior distribution P with support H. Let n 8 and δ (0, 1). Suppose, there exists bounded functions b n(h, S), n(h), h H (depending on δ) such that

P h H, e E(h, S) b E(h, S) + b n(h, S) 1 δ, inf h H P b n(h, S) n(h) 1 δ.

For any λ (0, 2), consider the following posterior over H:

c e nλ( b R(h,S)+b n(h,S))P(h), for c = Eh P e nλ( b R(h,S)+b n(h,S)). (2)

Then, with probability at least 1 4δ over S, simultaneously for all λ (0, 2):

E(h Q λ) L 1 λ/2 inf h H

E(h) + n(h) + log(1/P(h))

λn + log 2 n

Proposition 2 builds on Proposition 1 by ﬁrst taking RL(h) to be E(h), b RL(h) to be e E(h), and Z to be X. The bound in Proposition 2 is then obtained by optimizing over Q for ﬁxed λ. Since this bound is on excess error (rather than error), optimizing over λ can only improve constants, while the choice of prior P is crucial in obtaining optimal rates as |H| . Such choice is treated next.

3.2 Oracle risk for trees (as H .= H(T0) grows in size with T0)

We start with the following deﬁnitions on classiﬁers of interest and related quantities. Deﬁnition 5. Let T0 be a binary partition-tree of X obtained from data S0, of depth D0. Consider a family of classiﬁcation trees H(T0) .= {h T } indexed by subtrees T of T0, and where h T deﬁnes a ﬁxed labeling l(A) of nodes A π(T), e.g., l(A) .= majority label in Y if A S0 = .

Furthermore, for any node A of T0, let ˆp(A, S) denote the empirical mass of A under S and p(A) be the population mass. Then for any subtree T of T0, let |T| be the number of nodes in T and deﬁne

b n(h T , S) .= X

ˆp(A, S)2 log(|T0| /δ)

n , and (3)

n(h T ) .= X

8 max p(A), (2 + log D) D0 + log(1/δ)

log(|T0| /δ)

Remark 1. In practice, we might start with a space partitioning tree T 0 (e.g., a dyadic tree, or KD-tree) which partitions [0, 1]D, rather than the support X. We then view T0 as the intersection of T 0 with X.

Our main theorem below follows from Proposition 2 on excess risk, by showing (a) that the above deﬁnition of b n(h T , S) and n(h T ) satisﬁes the conditions of Proposition 2, and (b) that there exists a proper prior P such that log(1/P(T)) |π(T)|, i.e., depends just on the subtree complexity rather than on that of T0. The main technicality in showing (b) stems from the fact that P needs to be a proper distribution (i.e. P

T P(T) = 1) without requiring too large a normalization constant (remember that the number of subtrees can be exponential in the size of T0). This is established through arguments from coding theory, and in particular Kraft-Mc Millan inequality.

Theorem 1 (Oracle risk for trees). Let the prior satisfy P(h T ) .= (1/CP )e 3D0 |π(T )| for a normalizing constant CP , and consider the corresponding posterior Q λ as deﬁned in Equation 2, such that, with probability at least 1 4δ over S, for all λ (0, 2), the excess risk E(h Q λ) of the majority-classiﬁer is at most

L 1 λ/2 min h T H(T0)

E(h T ) + n(h T ) + 3D0 |π(T)|

λn + log 2 n

From Theorem 1 we can deduce that the majority classiﬁer h Q λ is consistent whenever the approach of pruning to the best subtree is consistent (typically, minh T E(h T ) + (D0 |π(T)|)/n = o P (1)). Furthermore, we can infer that E(h Q λ) converges at the same rate as pruning approaches: the terms n(h T ) and D0 |π(T)|/n can be shown to be typically, of lower or similar order as E(h T ) for the best subtree classiﬁer h T . These remarks are formalized next and result in Corollary 1 below.

3.3 Rate of convergence

Much of known rates for tree-pruning are established for dyadic trees (see e.g. [9, 11]), due to their simplicity, under nonparametric assumptions on E[Y |X]. Thus, we adopt such standard assumptions here to illustrate the rates achievable by h Q λ, following the more general statement of Theorem 1.

The ﬁrst standard assumption below restricts how fast class probabilities change over space. Assumption 1. Consider the so-called regression function η(x) RL with coordinate ηl(x) .= EY |x1 {Y = l} , l [L]. We assume η is α-Hölder for α (0, 1], i.e.,

λ such that x, x X, η(x) η(x ) λ x x α .

Next, we illustrate some of the key conditions veriﬁed by dyadic trees which standard results build on. In particular, we want the diameters of nodes of T0 to decrease relatively fast from the root down. Assumption 2 (Conditions on T0). The tree T0 is obtained as the intersection of X with dyadic partition of [0, 1]D (e.g. by cycling though coordinates) of depth D0 = O(D log n) and partition size |T0| = O(n). In particular, we emphasize that the following conditions on subtrees then hold.

For any subtree T of T0, let r(T) denote the maximum diameter of leaves of T (viewed as subsets of X). There exist C1, C2, d > 0 such that: For all (C1/n) < r 1, there exists a subtree T of T0 such that r(T) r and |π(T)| C2r d.

The above conditions on subtrees are known to approximately hold for other procedures such as KD-trees, and PCA-trees; in this sense, analyses of dyadic trees do yield some insights into the performance other approaches. The quantity d captures the intrinsic dimension (e.g., doubling or box dimension) of the data space X or is often of the same order [12, 13, 14].

Under the above two assumptions, it can be shown through standard arguments that the excess error of the best pruning, namely minh T H(T0) E(h T ) is of order n α/(2α+d), which is tight (see e.g. minimax lower-bounds of [15]). The following corollary to Theorem 1 states that such a rate, up to a logarithmic factor of n, is also attained by majority classiﬁcation under Q λ. Corollary 1 (Adaptive rate of convergence). Assume that for any cell A of T0, the labeling l(A) corresponds to the majority label in A (under S0) if A S0 = , or l(A) = 1 otherwise. Then, under Assumptions 1 and 2, and the conditions of Theorem 1, there exists a constant C such that:

ES0,SE(h Q λ) C log n

4 Algorithmic Results

Here we show that h Q can be efﬁciently implemented by storing appropriate weights at nodes of T0. Let w Q(A) .= P

h T :A π(T ) Q(h T ) aggregate weights over all subtrees T of T0 having A as a leaf. Then h Q(x) = argmaxl [L] P

A path(x),l(A)=l w Q(A), where path(x) denotes all nodes of T0 containing x. Thus, h Q(x) is computable from weights proportional to w Q(A) at every node.

We show in what follows that we can efﬁciently obtain w(A) = C w Q λ(A) by dynamic-programming by ensuring that Q λ(h T ) is multiplicative over π(T). This is the case, given our choice of prior from Theorem 1: we have Q λ(h T ) = (1/CQ λ) exp(P

A π(T ) φ(A)) where

φ(A) .= λ X

i:Xi A S 1 {Yi = l(A)} nλ

ˆp(A, S)2 log(|T0| /δ)

We can then compute w(A) .= CQ λ w Q λ(A) via dynamic-programming. The intuition is similar to that in [5], however, the particular form of our weights require a two-pass dynamic program (bottom-up and top-down) rather than the single pass in [5]. Namely, w(A) divides into subweights that any node A might contribute up or down the tree. Let

h T :A π(T ) exp X

A =A,A π(T ) φ(A ) , (5)

so that w(A) = eφ(A) α(A). As we will show (proof of Theorem 2), α(A) decomposes into contributions from the parent Ap and sibling As of A, i.e., α(A) = α(Ap)β(As) where β(As) is given as (writing T A 0 for the subtree of T0 rooted at A, and T T when T is a subtree of T ):

A π(T ) φ(A ) . (6)

The contributions β(A) are ﬁrst computed using the bottom-up Algorithm 1, and the contributions α(A) and ﬁnal weights w(A) are then computed using the top-down Algorithm 2. For ease of presentation, these routines run on a full-binary tree version T0 of T0, obtained by adding a dummy child to each node A that has a single child in T0. Each dummy node A has φ(A ) = 0.

Algorithm 1 Bottom-up pass

for A π( T0) do

end for for i D0 to 0 do

Ai set of nodes of T0 at depth i for A Ai \ π( T0) do

N the children nodes of A β(A) eφ(A) + Q

A N β(A ) end for end for

Algorithm 2 Top-down pass

α(root) 1 for i 1 to D0 do

Ai set of nodes of T0 at depth i for A Ai do

Ap, As parent of node A, sibling of node A α(A) α(Ap)β(As) w(A) eφ(A)α(A) end for end for

Theorem 2 (Computing w(A)). Running Algorithm 1, then 2, we obtain w(A) .= CQ λ w Q λ(A), where Q λ is as deﬁned in Theorem 1. Furthermore, the combined runtime of Algorithms 1, then 2 is 2| T0| 4|T0|, where |T| is the number of nodes in T.

5 Experiments

Table 1: UCI datasets

Name (abbreviation) Features count Labels count Train size

Spambase (spam) 57 2 2601 EEG Eye State (eeg) 14 2 12980 Epileptic Seizure Recognition (epileptic) 178 2 9500 Crowdsourced Mapping (crowd) 28 6 8546 Wine Quality (wine) 12 11 4497 Optical Recognition of Handwritten Digits (digit) 64 10 3620 Letter Recognition (letter) 16 26 18000

Here we present experiments on real-world datasets, for two common partition-tree approaches, dyadic trees and KD-trees. The various datasets are described in Table 1.

The main baseline we compare against, is a popular efﬁcient pruning heuristic where a subtree of T0 is selected to minimize the penalized error C1(h T ) = b R(h T , S) + λ |π(T,S)|

We also compare against other tree-based approaches that are theoretically driven and efﬁcient. First is a pruning approach proposed in [16], which picks a subtree minimizing the penalized error

C2(h T ) = b R(h T , S) + λ P

max ˆp(A, S), A

n , where A denotes the depth of node A in T0. We note that, here we choose a form of C2 that avoids theoretical constants that were of a technical nature, but instead let λ account for such. We report this approach as SN-pruning. Second is the majority classiﬁer of [5], which however is geared towards binary classiﬁcation as it requires regression-type estimates in [0, 1] at each node. This is denoted HS-vote.

All the above approaches have efﬁcient dynamic programs that run in time O(|T0|), and all predict in time O(height(T0)). The same holds for our PAC-Bayes approach as discussed above in Section 4.

Practical implementation of PAC-Bayes tree. Our implementation rests on the theoretical insights of Theorem 1, however we avoid some of the technical details that were needed for rigor,

such as sample splitting and overly conservative constants in concentration results. Instead we advise cross-validating for such constants in the prior and posterior deﬁnitions. Namely, we ﬁrst set P(h T ) exp( |π(T, S)|), where π(T, S) denotes the leaves of T containing data. We set

n(h T , S) = P

n . The posterior is then set as Q (h T ) exp( n(λ1 b R(h T , S) + λ2 n(h T , S)))P(h T ), where λ1, λ2 account for concentration terms to be tuned to the data.

Finally, we use the entire data to construct T0 and compute weights, i.e., S0 = S, as interdependencies are in fact less of an issue in practice. We note, that the above alternative theoretical approaches, SN-pruning and HS-vote, are also assumed (in theory) to work on a sample independent choice of T0 (or equivalently built and labeled on a separate sample S0), but are implemented here on the entire data to similarly take advantage of larger data sizes. The baseline pruning heuristic is by default always implemented on the full data.

Experimental setup and results. The data is preprocessed as follows: for dyadic trees, data is scaled to be in [0, 1]D, while for KD-trees data is normalized accross each coordinate by standard deviation.

Testing data is ﬁxed to be of size 2000, while each experiment is ran 5 times (with random choice of training data of size reported in Table 1) and average performance is reported. In each experiment, all parameters are chosen by 2-fold cross-validation for each of the procedures. The log-grid is 10 values, equally spaced in logarithm, from 2 8 to 26 while the linear-grid is 10 linearly-spaced values between half the best value of the log-search and twice the best value of the log-search.

Table 2 reports classiﬁcation performance of the various theoretical methods relative to the baseline pruning heuristic. We see that proposed PAC-Bayes tree achieves competitive performance against all other alternatives. All the approaches have similar performance accross datasets, with some working slightly better on particular datasets. Figure 2 further illustrates typical performance on multiclass problems as training size varies.

Table 2: Ratio of classiﬁcation error over that of the default pruning baseline: bold indicates best results across methods, while blue indicates improvement over baseline; N/A means the algorithm was not run on the task.

T0 dyadic tree T0 KD tree

Dataset SN-pruning PAC-Bayes tree HS-vote SN-pruning PAC-Bayes tree HS-vote

spam 1.118 0.975 1.224 1.048 1.020 1.075 eeg 0.979 0.993 1.029 1.000 0.990 1.000 epileptic 0.993 0.992 0.951 0.977 0.987 0.907 crowd 0.991 1.020 N/A 1.001 1.017 N/A wine 1.035 0.991 N/A 1.010 0.997 N/A digit 1.000 0.936 N/A 0.994 0.997 N/A letter 1.005 0.993 N/A 1.000 1.001 N/A

Figure 2: Classiﬁcation error versus training size

[1] David A Mc Allester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355 363, 1999.

[2] László A Székely and Hua Wang. On subtrees of trees. Advances in Applied Mathematics, 34(1):138 155, 2005.

[3] Trevor Hastie and Daryl Pregibon. Shrinking trees. AT & T Bell Laboratories, 1990.

[4] Wray Buntine and Tim Niblett. A further comparison of splitting rules for decision-tree induction. Machine Learning, 8(1):75 85, 1992.

[5] David P Helmbold and Robert E Schapire. Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27(1):51 68, 1997.

[6] Alexandre Lacasse, François Laviolette, Mario Marchand, Pascal Germain, and Nicolas Usunier. PACBayes bounds for the risk of the majority vote and the variance of the Gibbs classiﬁer. In Advances in Neural information processing systems, pages 769 776, 2007.

[7] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In Advances in neural information processing systems, pages 439 446, 2003.

[8] Pascal Germain, Alexandre Lacasse, Francois Laviolette, Mario Marchand, and Jean-Francis Roy. Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm. Journal of Machine Learning Research, 16:787 860, 2015.

[9] C. Scott and R.D. Nowak. Minimax-optimal classiﬁcation with dyadic decision trees. IEEE Transactions on Information Theory, 52, 2006.

[10] Niklas Thiemann, Christian Igel, Olivier Wintenberger, and Yevgeny Seldin. A Strongly Quasiconvex PAC-Bayesian bound. In Steve Hanneke and Lev Reyzin, editors, Proceedings of the 28th International Conference on Algorithmic Learning Theory, volume 76 of Proceedings of Machine Learning Research, pages 466 492, Kyoto University, Kyoto, Japan, 15 17 Oct 2017. PMLR.

[11] L. Gyorﬁ, M. Kohler, A. Krzyzak, and H. Walk. A Distribution Free Theory of Nonparametric Regression. Springer, New York, NY, 2002.

[12] Nakul Verma, Samory Kpotufe, and Sanjoy Dasgupta. Which spatial partition trees are adaptive to intrinsic dimension? In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pages 565 574. AUAI Press, 2009.

[13] Samory Kpotufe and Sanjoy Dasgupta. A tree-based regressor that adapts to intrinsic dimension. Journal of Computer and System Sciences, 78(5):1496 1515, 2012.

[14] Santosh Vempala. Randomly-oriented kd trees adapt to intrinsic dimension. In FSTTCS, volume 18, pages 48 57. Citeseer, 2012.

[15] Jean-Yves Audibert and Alexandre B Tsybakov. Fast learning rates for plug-in classiﬁers. The Annals of Statistics, 35(2):608 633, 2007.

[16] Clayton Scott. Dyadic Decision Trees. Ph D thesis, Rice University, 2004.