# on_computing_optimal_tree_ensembles__45695771.pdf

On Computing Optimal Tree Ensembles

Christian Komusiewicz * 1 Pascal Kunz * 2 3 Frank Sommer * 1 Manuel Sorge * 4

Random forests and, more generally, (decision-)tree ensembles are widely used methods for classification and regression. Recent algorithmic advances allow to compute decision trees that are optimal for various measures such as their size or depth. We are not aware of such research for tree ensembles and aim to contribute to this area. Mainly, we provide two novel algorithms and corresponding lower bounds. First, we are able to carry over and substantially improve on tractability results for decision trees, obtaining a (6δDS)S poly-time algorithm, where S is the number of cuts in the tree ensemble, D the largest domain size, and δ is the largest number of features in which two examples differ. To achieve this, we introduce the witness-tree technique which also seems promising for practice. Second, we show that dynamic programming, which has been successful for decision trees, may also be viable for tree ensembles, providing an ℓn poly-time algorithm, where ℓis the number of trees and n the number of examples. Finally, we compare the number of cuts necessary to classify training data sets for decision trees and tree ensembles, showing that ensembles may need exponentially fewer cuts for increasing number of trees.

1. Introduction

Random forests is a method for classification or regression in which we construct an ensemble of decision trees for (ran-

*Equal contribution 1Fakult at f ur Mathematik und Informatik, Friedrich-Schiller-Universit at Jena, Germany 2Algorithmics and Computational Complexity, TU Berlin, Germany 3Algorithm Engineering, HU Berlin, Germany 4Institute of Logic and Computation, TU Wien, Austria. Correspondence to: Christian Komusiewicz <c.komusiewicz@uni-jena.de>, Pascal Kunz <p.kunz.1@tu-berlin.de>, Frank Sommer <frank.sommer@unijena.de>, Manuel Sorge <manuel.sorge@ac.tuwien.ac.at>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

dom subsets of) the training data and, in the classification phase, aggregate their outcomes by majority voting. The random-forests method has received a tremendous amount of attention for its simplicity and improved accuracy over plain decision trees (Breiman, 2001; Verikas et al., 2011; Kulkarni & Sinha, 2013; Rokach, 2016). Commonly, fast heuristics without performance guarantees are used for computing random forests (Kulkarni & Sinha, 2013; Rokach, 2016), in particular for computing the individual decision trees in the forest. For plain decision trees, researchers lately made several advances in computing optimal decision trees, that is, decision trees that provably optimize criteria such as minimizing the tree size (Bessiere et al., 2009; Narodytska et al., 2018; Carrizosa et al., 2021; Demirovi c et al., 2022). With that increased amount of attention also came theoretical advances, showing the limits and opportunities for developing efficient exact algorithms for computing decision trees (Ordyniak & Szeider, 2021; Kobourov et al., 2022; Eiben et al., 2023). One impetus to computing optimal decision trees is that minimizing the size reduces tendencies to overfitting (Bessiere et al., 2009; Demirovi c et al., 2022). It is conceivable that such benefits transfer to globally optimizing the tree ensembles computed by random forests. However, apart from sporadic hardness results (Tamon & Xiang, 2000), we are not aware of exact algorithmic research for tree ensembles. In this work, we aim to initiate this direction; that is, we begin to build the theoretical footing for exact algorithmics of computing optimal tree ensembles and provide potential avenues for exact algorithms that are guaranteed to provide optimal results with acceptable worst-case running times.

We study the algorithmic properties of two canonical formulations of the training problem for tree ensembles: We are given a set of training examples labeled with two classes and a number ℓof trees and we want to compute a tree ensemble containing ℓtrees that classifies the examples consistently with the given class labels.1 We want to minimize either the sum of the tree sizes, resulting in the problem MINIMUM TREE ENSEMBLE SIZE (MTES), or the largest size of a tree in the ensemble, resulting in the

1To keep the presentation focused, we consider mainly the case without training error. See the conclusion for extensions to minimizing training error.

On Computing Optimal Tree Ensembles

problem MINIMAX TREE ENSEMBLE SIZE (MMAXTES).2

Both contain as a special case the problem of computing a minimum-size decision tree, which is known to be NPhard (Hyafil & Rivest, 1976; Ordyniak & Szeider, 2021). However, the hardness constructions do not necessarily reflect practical data. Thus, we are interested in precisely which properties make the problems hard or tractable.

Mainly, we provide two novel algorithms for MTES and MMAXTES3 and matching lower-bound results for their running times. We call the first one witness-tree algorithm. This algorithm demonstrates that prospects for tractable algorithms for optimizing decision trees can be non-trivially generalized to optimizing tree ensembles. Namely, it was known that for small tree size s, moderate maximum domain size D of any feature, and moderate number δ of features in which two examples differ4, a minimum decision tree can be computed efficiently, that is, in f(s, D, δ) poly time, where poly is a polynomial in the input size (Ordyniak & Szeider, 2021). However, the function f is at least δs (Ds2δ)s 2s2 and the algorithm involves enumerative steps in which the worst-case running time equals the average case. We show that, even for the more general MTES, we can improve the running time to O((6δDS)S Sℓn), where S denotes the sum of the tree sizes, ℓthe number of trees in the ensemble, and n the number of training examples (Theorem 4.1). Moreover, we can avoid the enumerative approach, obtaining a search-tree algorithm that is both conceptually simpler and more easily amenable to heuristic improvements such as early search-termination rules and data reduction.5 We achieve this by growing the trees iteratively and labeling their leaves with witness examples that need to be classified in these leaves. This allows us to localize misclassifications and their rectification, shrinking the search space. We believe that this technique may have practical applications beyond improving the worst-case running times as we do here. The running time that we achieve is tight in the sense that we cannot decrease the exponent to o(S) without violating reasonable complexity-theoretic assumptions (Theorem 4.4).

Recently, exponential-time dynamic programming has been applied to compute optimal decision trees and the resulting

2It is also natural to consider the depths of the trees instead of their sizes, but results are usually transferable between these two optimization goals and the size makes the presentation more accessible. 3The algorithms work on the decision version of these problems, but they easily apply to the optimization versions as well. 4See Ordyniak & Szeider (2021) for measurements showing that this is a reasonably small parameter in several datasets. 5In the meantime since this paper has been accepted, Eiben et al. (2023) showed that for minimum-size decision trees the dependency on the domain size can be dropped, that is, there is an f(s, δ) poly-time algorithm. It still uses enumerative steps which would be infeasible in practice.

trees have shown comparable performance to (heuristic) random forests on some datasets (Demirovi c et al., 2022). With the second algorithm that we provide, we investigate the potential of dynamic programming for computing optimal tree ensembles. We first show that minimizing decision trees can be done in O(3n) time, where n is the number of input examples, by a dynamic-programming approach that works on all possible splits of the examples (Corollary 5.2). (Indeed, the algorithm employed by Demirovi c et al. (2022) similarly computes a table over all possible splits in the worst case.) We then extend this algorithm to tree ensembles with ℓtrees, achieving (ℓ+ 1)n poly running time (Theorem 5.3). Unfortunately, we also show that the running time cannot be substantially improved: A running time of f(ℓ) 2o(log ℓ) n would violate reasonable complexity-theoretic assumptions (Theorem 5.4).

Finally, we compare the power of decision trees and tree ensembles in terms of their sizes. Here, we show that a training data set D that can be classified by a tree ensemble with ℓtrees of size at most s, can also be classified by a decision tree of size (s + 1)ℓ(Theorem 3.1). However, such an exponential increase is necessary in the worst case: We show that there are such training data sets D that cannot be classified by any decision tree of size roughly (s/2)ℓ/2 (Theorem 3.2).

In summary, as the number of trees in a tree ensemble grow, the classification power increases exponentially over decision trees. Nevertheless, we are able to carry over and substantially improve on tractability results for decision trees if in particular the number of cuts in the optimal ensemble is relatively small. The underlying witness-tree technique seems promising to try in practice. Furthermore, we show that dynamic programming, which has been successful for decision trees, may also be viable for tree ensembles. We also provide matching lower bounds for the running times. Apart from tuning our algorithms, in the future, deconstructing these lower bounds may provide further guidelines towards which properties of the input data we may exploit for efficient algorithms and which we likely may not.

Proofs of statements marked with are deferred to a full version of the paper.

2. Preliminaries

For n N we use [n] := {1, 2, . . . , n}. For a vector x Rd

we denote by x[i] the ith entry of x.

Let Σ be a set of class symbols; unless stated otherwise, we use Σ = {blue, red}. A decision tree in Rd with classes Σ is the following. Let T be an ordered binary tree, that is, each inner node has a well-defined left and right child. Let dim: V (T) [d] and thr: V (T) R be labelings of each internal node t V (T) by a dimension dim(t) [d] and a threshold thr(t) R. Furthermore, let cla(ℓ): V (T) Σ

On Computing Optimal Tree Ensembles

be a labeling of the leaves of T by class symbols. Then the tuple (T, dim, thr, cla) is a decision tree in Rd with classes Σ. We often omit the labelings dim, thr, cla and just refer to the tree T. The size of T is the number of its internal nodes. We conveniently call the internal nodes of T and their associated labels cuts.

A training data set is a tuple (E, λ) of a set of examples E Rd and their class labeling λ: E Σ. Given a training data set, we fix for each dimension i a minimumsize set Thr(i) of thresholds that distinguishes between all values of the examples in the ith dimension. In other words, for each pair of elements e and e with e[i] < e [i], there is at least one value t Thr(i) such that e[i] < t < e [i]. Let t R be some threshold. We use E[fi t] = {x E | x[i] t} and E[fi > t] = {x E | x[i] > t} to denote the set of examples of E whose ith dimension is less or equal and strictly greater than the threshold t, respectively. Now let T be a decision tree. Each node t V (T), including the leaves, defines a subset E[T, t] E as follows. For the root t of T, we define E[T, t] := E. For each non-root node t, let p denote the parent of t. We then define E[T, t] := E[T, p] E[fdim(p) thr(p)] if t is the left child of p and E[T, t] := E[T, p] E[fdim(p) > thr(p)] if t is the right child of p. If the tree T is clear from the context, we simplify E[T, t] to E[t]. We say that T classifies (E, λ) if for each leaf u V (T) and each example e E[u] we have λ(e) = cla(u) (recall that cla is a labeling of the leaves of T by classes). Note that the set family that contains E[u] for all leaves u of T forms a partition of E. Thus for each example e E there is a unique leaf u such that e E[u]. We also say that u is e s leaf. We say that cla(u) is the class assigned to e by T and we write T[e] for cla(u).

A tree ensemble is a set of decision trees. A tree ensemble T classifies (E, λ) if for each example e E the majority vote of the trees in T agrees with the label λ(e). That is, for each example e E we have λ(e) = arg maxσ Σ |{T T | T[e] = σ}| =: T [e]. To avoid ambiguity in the maximum, we fix an ordering of Σ and break ties according to this ordering. If Σ = {blue, red} we break ties in favor of blue. The overall size of a tree ensemble T is the sum of the sizes of the decision trees in T .

The computational problems that we consider are as follows.

MINIMUM TREE ENSEMBLE SIZE (MTES) Instance: A training data set (E, λ), a number ℓ of trees, and a size bound S. Question: Is there a tree ensemble of overall size at most S that classifies (E, λ) and contains exactly ℓtrees?

When restricting to ℓ= 1, MTES is known as MINIMUM DECISION TREE SIZE (DTS) (Ordyniak & Szeider, 2021;

Kobourov et al., 2022). In the variant MINIMAX TREE ENSEMBLE SIZE (MMAXTES), instead of S, we are given an integer s and we ask whether there is a tree ensemble that classifies (E, λ) and contains exactly ℓtrees, each of which has size at most s.

Our analysis is within the framework of parameterized complexity (Gottlob et al., 2002; Flum & Grohe, 2006; Niedermeier, 2006; Cygan et al., 2015; Downey & Fellows, 2013). Let L Σ be a computational problem specified over some alphabet Σ and let p: Σ N be a parameter, that is, p assigns to each instance of L an integer parameter value (which we simply denote by p if the instance is clear from the context). We say that L is fixedparameter tractable (FPT) with respect to p if it can be decided in f(p) poly(n) time where n is the input encoding length. The corresponding hardness concept related to fixed-parameter tractability is W[t]-hardness, t 1; if problem L is W[t]-hard with respect to p then L is thought to not be fixed-parameter tractable; see (Flum & Grohe, 2006; Niedermeier, 2006; Cygan et al., 2015; Downey & Fellows, 2013) for details. The Exponential Time Hypothesis (ETH) (Impagliazzo & Paturi, 2001; Impagliazzo et al., 2001) states that 3SAT on n-variable formulas cannot be solved in 2o(n) time.

3. Decision Trees Versus Tree Ensembles

We will call a decision tree and a tree ensemble equivalent if any training data set is classified by the one if and only if it is classified by the other. We start by comparing the minimum size of a decision tree to that of a minimum-size decision tree ensemble, showing that there are examples where the latter is significantly smaller. Vidal & Schiffer (2020, Theorem 2) obtained similar results for the depth of a decision tree and decision tree ensemble, showing that any training data set that can be classified by a decision tree ensemble with ℓtrees of depth t can also be classified by a tree with depth ℓ t and that this bound is tight. Here, we analyze trees and tree ensembles in terms of their size, showing that a minimum ensemble can be exponentially smaller than any equivalent decision tree.

Theorem 3.1 ( ). Any training data set that can be classified by a decision tree ensemble consisting of ℓtrees, each of size at most s, can also be classified by a decision tree of size (s + 1)ℓ 1.

Theorem 3.2 ( ). For any odd ℓ, s N, there is a training data set that can be classified by a decision tree ensemble containing ℓtrees of size s each, but cannot be classified by a single decision tree of size smaller than

On Computing Optimal Tree Ensembles

Proof. For any x Nℓ, let even(x) := {i [ℓ] | x[i] is even} and odd(x) := [ℓ] \ even(x). Furthermore, let ev(x) and od(x) denote the sizes of even(x) and odd(x), respectively. Let E := {x [s + 1]ℓ| |ev od| = 1} and λ: E {blue, red} with λ(x) = blue if and only if ev(x) > od(x). We show that (E, λ) fulfills the claim.

We defer the proof that there is a decision tree ensemble T with ℓtrees of size at most s that classifies (E, λ) to the full version.

We will show that any decision tree T that classifies E has at least (s+1)ℓ

2 leaves. Observe that |E| ℓ ℓ+1

2 )ℓ= (s+1)ℓ

ℓ , because even if we fix some subset I of [ℓ] such that x[i] is to be even if and only if i I, then there are s+1

2 possible values for each component of x. Therefore, it is sufficient to prove that |E[T, t]| ( s+1

2 for every leaf t of T.

Let t be a leaf of T. Without loss of generality, assume that cla(t) = blue, that is λ(x) = blue for all x E[T, t]. We will show that x[i] = y[i] for all x, y E[T, t] and i even(x). Suppose that x[i] = y[i], i even(x), and x, y E[T, t]. Without loss of generality, x[i] < y[i]. Define z [s + 1]ℓby z[i] := x[i] + 1 and zj := xj for all j [ℓ] \ {i}. Then, even(z) = even(x) \ {i}, implying that λ(z) = red. Hence, z / E[T, t]. This implies that in T there must be a node v with dim(v) = i and thr(v) = x[i] on the path from the root to t. However, this means that y / E[T, t]. Hence, the examples in E[T, t] can differ only in the components in odd(x). Moreover, od(x) = ℓ 1

2 and [s + 1] contains s+1

2 odd values.

Since the size of a binary tree is the number of leaves minus one, the claim follows.

This result still leaves a considerable gap between the upper and lower bound. We conjecture that the lower bound can be improved: for example, by showing that in the example presented in the proof of Theorem 3.2 the number of examples in each leaf is, on average, smaller than we showed.

4. The Witness-Tree Algorithm

In this section, we prove the following theorem. Recall that S is the desired overall size of the tree ensemble, s is the maximum size of a tree in the ensemble, ℓis the number of trees in the ensemble, D is the largest domain of a feature, δ is the largest number of features in which two examples of different classes differ, and n is the number of training examples.

Theorem 4.1. MINIMUM TREE ENSEMBLE SIZE can be solved in O((6δDS)S Sℓn) time and MINIMAX TREE ENSEMBLE SIZE in O(2ℓ (δDℓ(2s + 1))sℓ sℓ2n) time.

The basic idea is to start with a tree ensemble that contains only trivial trees and to successively refine the trees in the ensemble until all input examples are classified correctly. To facilitate the refinement process, each leaf of each tree is assigned a distinct example, called witness. In a recursive process we then aim to find refinements of the trees such that each witness is classified in the assigned leaf. This will speed up the refinement process because it enables us to detect examples that need to be cut away from witnesses in some trees of the ensemble.

Formally, a witness tree is a tuple (T, dim, thr, cla, wit) wherein (T, dim, thr, cla) is a decision tree and wit: V (T) E is a mapping from the leaves of T to the set of examples such that for each leaf t we have wit[t] E[t]. The images of wit are called witnesses. Note that a witness is not necessarily classified correctly, that is, we permit T[wit(t)] = λ(wit(t)). A witness ensemble is a set of witness trees.

We aim to successively refine the trees in a witness ensemble until all examples are classified correctly. For this, an example e E is dirty for some tree T (or tree ensemble F) if the label T(e) (or F(e)) assigned to e by T (or by F) is not equal to λ(e).

Next, we define a refinement of a tree in an ensemble. All our refinements will take a dirty example and change the class label assigned to this example by one of the decision trees. Consider a witness tree T and a dirty example e for T. Intuitively, we take the leaf t of T in which e is classified and consider its witness wit(t). Then we pick a way of introducing into T a new cut on the path from the root to t that cuts apart wit(t) and e. This then results in a refinement of T.

Formally, let T be a witness tree. A one-step refinement R of T is a witness tree constructed in one of the following two ways (illustrated in Figure 1):

Possibility one: Add a new root r to T, labeled with a dimension dim(r) and threshold thr(r), put the old root of T to be the left or right child of r, and put the other child of r to be a new leaf v, labeled with an arbitrary class label and with a witness x E such that x E[R, v].

Possibility two: Pick any edge f in T. Subdivide f with a new node u, labeled with a dimension dim(u) and threshold thr(u), and add a new leaf v as a child to u, labeled with an arbitrary class label and with a witness x E. The order of the children of u is chosen such that x E[R, v].

This finishes the definition of a one-step refinement. We also say that the one-step refinement introduces the new leaf v, the witness x, and the node r or u (thought of as the nodes including their associated labelings), respectively. Observe that the refinement is a witness tree and thus the previous witnesses need to be preserved. That is, the choices of

On Computing Optimal Tree Ensembles

x[dim(r)] thr(r)

x[dim(r)] > thr(r)

Figure 1. Two ways of refining a tree: On the left an new root r and a new leaf t are introduced. On the right, an existing edge between the subtrees T1 and T2 is subdivided with a vertex u and a new leaf v is introduced.

the dimension dim(r), dim(u) and threshold thr(r), thr(u) need to be such that each witness is still classified in its leaf. In formulas, for each leaf t of T it must hold that wit(t) E[R, t].

A refinement R of a witness tree T is obtained by a series T = T1, T2, . . . , Tk = R of witness trees such that Ti is a one-step refinement of Ti 1. A decision tree R (without witness labeling) is a refinement of a witness tree T if there is a labeling of the leaves of R by witnesses such that (a) the labeling results in a witness tree, that is, for each leaf t of R the witness wit(t) of t is in E[R, t], and (b) after the labeling the witness tree R is a refinement of T. If a tree ensemble C consists of the trees of a witness ensemble C or refinements thereof, then we say C is a refinement of C.

For the correctness proof for our algorithm we need a property of refinements that shows that the order in which we introduce nodes for certain refinements is immaterial. Lemma 4.2. Let T1 be a witness tree, T2 a one-step refinement of T1, and T3 a one-step refinement of T2. Let T3 introduce an inner node u. Assume that T3 does not subdivide the edge incident with a new leaf introduced in T2. Then there is a one-step refinement S2 of T1 and a one-step refinement S3 of S2 such that S2 introduces u and S3 = T3.

Proof. Refinement T3 may introduce a new root or subdivide an edge. Suppose that T3 introduces u as a new root above the root r of T2. If r is also present in T1, then T2 subdivides an edge f of T1. Thus, in S2 we may instead introduce the root u, maintaining that edge f is present, and then in S3 we may subdivide the edge f as in T2. This results in the same witness tree S3 as T3 and thus maintains the sequence of refinements. If r is not present in T1, then T2 introduces the new root r above the old root r of T1. Instead, we may proceed as follows. In S2 we first introduce the new root u above the old root r of T1. Then, in S3 we subdivide the edge between r and u to introduce the node r

with the same labels as in T2. Again, this results in the same witness tree S3 as T3.

Now suppose that T3 subdivides the edge f to introduce

node u. If f is present in T1, then T2 introduces a new root that is not incident with f or subdivides an edge different from f. In both cases, we may reverse the two operations, resulting in the same tree T3. If f is not present in T1, then f is a new edge introduced in T2. By precondition, f is not incident with a new leaf of T2. Hence, f is introduced by adding a new root above an old root or by subdividing an edge. If f was introduced by adding a new root r into T1, we may instead proceed as follows. To obtain S2, we take T1 and introduce u as a new root. Afterwards, to obtain S3 we take S2 and introduce r as a new root above u. We again have that T3 and S3 are the same and the sequence of refinements is maintained. Suppose that f was introduced in T2 by subdividing an edge f of T1 to introduce some node v. Then, in T3 node v is the parent of u or vice-versa. Without loss of generality, assume that v is the parent of u. Instead, we may first subdivide f in T1 and introduce u to obtain S2 and then subdivide the edge from u to its parent in S2 to introduce v and obtain S3. This again results in the same witness tree as T3. Concluding, we may replace T3 and T2 with S3 and S2, obtaining the required properties.

We can now describe the recursive algorithm for solving MTES. The pseudo-code is given in Algorithm 1. As mentioned, it checks whether the current witness ensemble is sufficiently small and classifies the input and, if so, reports it as a solution. Otherwise, it finds a dirty example e and tries all possibilities of reclassifying the example in a refinement of one of the trees in the current ensemble. Note that since e is dirty, a tree which can be refined always exists. Then it continues recursively. In Line 6, a one-step refinement of T is important if it is obtained by introducing a new node w that is labeled by a dimension i [d] in which e and the witness x of e s leaf in T differ, i.e., e[i] = x[i], and by a threshold δ Thr(i) such that δ is between e[i] and x[i].

Note that, if ℓ> S, then necessarily in the solution ensemble there are trivial trees without any cuts, that is, trees that classify all examples as blue or all examples as red. Now with a factor ℓin the running time, we may determine how

On Computing Optimal Tree Ensembles

Algorithm 1: Computing tree ensembles.

1 Function Refine Ensemble (C, (E, λ), S)

Input: A witness ensemble C, a training data set (E, λ), and a size threshold S N. Output: A tree ensemble of overall size at most S that is a refinement of C and classifies (E, λ) or if none exists.

2 if overall size of C is larger than S then return

3 if C classifies (E, λ) then return C

4 e a dirty example for C

5 for each tree T in C in which e is not a witness do

6 for each important one-step refinement T of T introducing e as witness and s.t. T [e] = λ(e) do

7 C C with T replaced by T

8 D Refine Ensemble (C , (E, λ), S)

9 if D = then return D

many such trivial trees of either type there are (by trying all possibilities). Thus, in the following, we will assume that each tree has at least one inner node and thus that ℓ S.

The initial calls to Refine Ensemble are made with the following 2ℓwitness ensembles C: For each tree T in C we pick an arbitrary distinct example and try both possibilities for whether e is classified as λ(e) or not (i.e. with the other class) and make T to be a tree consisting of a single leaf labeled by the corresponding class and with e as its witness. This concludes the description of the algorithm. For the algorithm for MMAXTES we replace S by s and we modify the check in Line 2 to check that the size of the largest tree is larger than s instead.

Proof of Theorem 4.1. We now show that the algorithm described above achieves the required running time and that it is correct.

For the running time, observe that, after one of the 2ℓinitial calls, the algorithm describes a search tree in which each node corresponds to a call to Refine Ensemble. The depth of this tree is at most S for MTES and at most sℓfor MMAXTES because in each call at least one refinement is made and thus the overall size increases by at least one. We claim that each search-tree node has at most δD(2S + ℓ) or δD(2s+1)ℓchildren, respectively. To see this, we show that the total number of refinements of C considered in Line 6 is bounded by that number: Each such refinement is specified (1) by a new root or an edge on a root-leaf path of a tree in C, (2) a dimension in the newly introduced node, and (3) a threshold in the newly introduced node.

For (3) there are at most D possibilities.

For (2) there are at most δ possibilities: Observe that e has a different class label than the witness w of its old leaf. Thus, there are at most δ dimensions in which e and w differ.

For (1) there are at most ℓways to choose to introduce a new root. Furthermore, since each edge in a tree is incident with an inner node and each inner node u is incident with at most two edges to children of u, there are at most 2S ways (resp. 2sℓways) to choose an edge for subdivision.

Thus, the overall search tree has size at most (δD(2S + ℓ))S (3δDS)S (resp. (δD(2s + 1)ℓ)sℓ). Accounting for the 2ℓ 2S initial calls and noticing that the operations in one search-tree node take O(Sn) time yields the claimed running time.

It remains to prove the correctness. Clearly, if the algorithm returns something different from , then it is a tree ensemble that classifies (E, λ) and is of the required size. Now assume that there is a tree ensemble that classifies (E, λ) and is of the required size. We show that the algorithm will not return . We say that a witness-tree ensemble C is good if there is a tree ensemble C that classifies (E, λ) of the required size, and such that C refines C. We claim that (1) one of the ensembles C of an initial call to Refine Ensemble is good and (2) that if C in a call to Algorithm 1 is good, then either it classifies (E, λ) or in at least one recursive call Refine Ensemble(C , , ) it is the case that C is good. Observe that it is enough to prove both claims.

As to claim (1): Consider a solution C and the witnesses that were chosen arbitrarily for C before the initial calls to Refine Ensemble. Fix an arbitrary mapping of trees between C and C . Consider a tree T in C and its corresponding tree T in C . Observe that T has a leaf t such that E[T , t] contains the (single) witness of T. Label t with this witness. For each remaining leaf, pick an arbitrary example that is classified in this leaf and label it as its witness. (Note that, without loss of generality, there is at least one example in each leaf because if there is a leaf without an example then we can find a smaller solution ensemble.) Observe that the witness tree resulting from T by this labeling is a refinement of T if the leaves containing the single witness of T have the same class label. Since we try all possible class labels for the leaf in T before the initial calls to Refine Ensemble, eventually we obtain that T

is a refinement of T and indeed this holds for all pairs of mapped trees of C and C . Thus, C is good.

As to claim (2): Assume that the witness-tree ensemble C is good and let C be a corresponding witness-tree ensemble. If C classifies (E, λ) then there is nothing to prove. Otherwise, there is at least one dirty example for C. Let e be the dirty example picked by the algorithm in Line 4. Consider the classes assigned to e by the trees in C and those assigned by the corresponding refinements in C . Observe that there

On Computing Optimal Tree Ensembles

is at least one tree T C such that its refinement R C assigns a different class to e. In a refinement, the class of a witness is never changed and thus e is not a witness in T. Hence, the for loop in Line 5 selects the tree T in one iteration.

We now claim that the loop in Line 6 will select a refinement T such that R (thought of as a non-witness decision tree) is a refinement of T . Consider a sequence of one-step refinements T = T1, T2, . . . , Tk = R. We call a refinement Tj in this sequence e-rerouting if Tj introduces a new leaf t such that e E[Tj, t]. We claim that we may assume that T2 is e-rerouting. For a contradiction, assume that in all sequences (Ti)i [k] the first e-rerouting one-step refinement Tj satisfies j > 2. (Note that such a refinement always exists because e is classified differently in R from T.) Pick the sequence (Ti)i [k] such that j is minimal. Let Tj introduce some node w. Observe that e s leaf t of T is present in Tj and that there is a path from w to t. Thus, w is not introduced by subdividing the edge incident with a leaf that has been introduced in Tj 1. Thus, by Lemma 4.2 we may replace Tj and Tj 1 by one-step refinements that result in the same witness tree Tj and such that Tj 1 introduces w instead. This is a contradiction to the minimality of j and thus we may assume that T2 is e-rerouting.

Next, we claim that we may assume that the witness assigned to the new leaf introduced in T2 is e. Before we can show this, we need to ensure that there are no further e-rerouting refinements in the sequence. Consider an e-rerouting refinement Tj after T2, introducing some node u and a new leaf t such that e E[t]. Consider the other child v of u. By Lemma 4.2 we may assume that all nodes in the subtree of Tj rooted at v are introduced after u and thus that u has the two leaves t and v as children. Thus, instead of introducing t in Tj we may instead introduce v. That is, we may replace Tj by a different refinement that is not e-rerouting, maintaining the sequence of refinements. After doing this for all e-rerouting refinements it is the case that the leaf t containing e is introduced once in T2 and then maintained in every refinement until we reach R. Thus, we may relabel the witness of t to be e in all the refinements. Finally, observe that we may assume that T2 is important and thus T2 equals T , the refinement selected by the algorithm in Line 6. Thus, the refined ensemble C

constructed from T is good, as claimed.

Recall that MINIMUM DECISION TREE SIZE (DTS) is the special case of MTES in which ℓ= 1 (and thus without loss of generality s = S). Thus we have the following.

Corollary 4.3. MINIMUM DECISION TREE SIZE can be solved in O((6δDs)s sn) time.

This improves on the running time for DTS given by Ordyniak & Szeider (2021) (see their main theorem, Theorem 8).

The following theorem shows that the exponent in our running time cannot be improved.

Theorem 4.4 ( ). Solving MTES in (δDS)o(S) poly time would contradict the Exponential Time Hypothesis, even if D = 2 and ℓ= 1.

5. Tight Exponential-Time Algorithm

We now give an algorithm that solves MTES in (ℓ+ 1)n poly time, where n := |E| is the number of examples. This running time is single-exponential in n for every fixed number of trees. More importantly, we show that this running time is essentially optimal. To obtain the algorithm, we first show how to compute in a suitable running time the sizes of smallest trees for essentially all possible classification outcomes of a decision tree.

Lemma 5.1. Given a training data set (E, λ) one can compute in 3n poly time for all E E the size of a smallest decision tree T such that T[e] = blue if and only if e E .

Proof. We solve the problem using dynamic programming over subsets of E. The dynamic-programming table has entries of the type Q[Eb, Er] where Eb and Er are example sets. Each table entry stores the size of a smallest decision tree on Eb Er where exactly the examples of Eb receive the label blue. We fill the table entries for increasing values of |Eb Er|.

We initialize the table by setting

Q[Eb, ] := 0 and Q[ , Er] := 0

for all Eb E and all Er E. This is correct since in these cases, a decision tree without cuts and only one leaf with the appropriate class label suffices.

The recurrence to fill the table when Eb and Er are nonempty is

Q[Eb, Er] := min i [d],t Thr(i) Q[Eb[fi t], Er[fi t]]+

Q[Eb[fi > t], Er[fi > t]] + 1.

Recall that Thr(i) denotes some minimum-size set of thresholds that distinguishes between all values of the examples in the ith dimension. In other words, for each pair of elements e and e with e[i] < e [i], there is at least one value t Thr(i) such that e[i] < t < e [i]. Moreover, we only consider those cases where Eb[fi t] Er[fi t] = and Eb[fi > t] Er[fi > t] = . That is, we consider only the case that the cut gives two nonempty subtrees. This ensures that the recurrence only considers table entries with smaller set sizes.

The idea behind the recurrence is that we consider all possibilities for the cut at the root and then use the smallest

On Computing Optimal Tree Ensembles

decision trees to achieve the desired labeling for the two resulting subtrees. The size of the resulting tree is the size of the two subtrees plus one, for the additional root vertex. The formal correctness proof is standard and omitted.

The running time bound can be seen as follows. The number of table entries is O(3n) since each entry corresponds to a 3-partition of E into Eb, Er, and E \ (Eb Er). Each entry can be evaluated in polynomial time since the number of dimensions and the size of Thr(i) is polynomial in the input size.

The above lemma directly gives an algorithm for DTS with the same running time.

Corollary 5.2. MINIMUM DECISION TREE SIZE can be solved in 3n poly time.

Before proving the running time bound for constructing tree ensembles, let us remark that the above algorithm can be used not only to produce perfect decision trees but also to give a set of Pareto-optimal trees for the trade-off between size and essentially any type of efficiently computable classification error, for example precision, recall, or F1-score.

We now use this algorithm as a subroutine in an algorithm for MINIMUM TREE ENSEMBLE SIZE.

Theorem 5.3. For ℓ> 1, one can solve MINIMUM TREE ENSEMBLE SIZE in (ℓ+ 1)n poly time.

Proof. We use again dynamic programming. It is not sufficient to use subsets of elements that are classified correctly. Instead, we build the solution by iteratively adding trees and storing for each example e how often e is classified correctly.

To store subsolutions, we use a table R with entries of the type R[c, j] where c is a length-n integer vector where each ci is an integer in [0, ℓ/2 ] for each i [n] and j [ℓ]. An entry R[c, j] stores the smallest total size of any set of j decision trees such that each element ei is classified correctly exactly ci times if ci < ℓ/2 and at least ci times if ci = ℓ/2 . The distinction between ci < ℓ/2 and ci = ℓ/2 allows us to assign only one value of ci to the situation that ei is already correctly classified irrespective of the other trees of the ensemble.

The first step of the algorithm is to compute for all E E the smallest size of any decision tree T assigning the blue label exactly to all e E . From this information, we can directly compute for all E E, the size of a smallest decision tree that classifies all examples in E correctly and all examples in E \ E incorrectly. We will store these sizes in table entries Q[E ].

Now, we initialize R for j = 1, by setting

Q[E ] E E : c = 1E , + otherwise.

Here, we let 1E denote the indicator vector for E . Now, for j > 1, we use the recurrence

R[c, j] := min E E,c :c 1E =c R[c , j 1] + Q[E ]. (1)

Here, is a truncated addition, that is, for the ith component of c , we add 1 if it is strictly smaller than ℓ/2 . If for some R[c, j] the minimum ranges over an empty set, then we set R[c, j] := + .

The idea of the recurrence is simply that the jth tree classifies some element set E correctly and that this increases the number of correct classifications for all elements of E . The smallest size of a tree ensemble with ℓtrees to correctly classify E can then be found in R[c , ℓ] where we let c denote the length-n vector where each component has value ℓ/2 . The formal proof is again standard and omitted.

The table has size ( ℓ/2 + 1)n ℓ. The bottleneck in the running time to fill the table is the time needed for evaluating the min in Equation 1. A straightforward estimation gives a time of 2n ℓ/2 n for each entry since we consider all possible subsets E of E and possibly all vectors c . Instead, we may fill the table entries also in a forward direction, that is, for each c and each E E, we compute c 1E and update the table entry for R[c, j] if R[c , j 1] + Q[E ] is smaller than the current entry of R[c, j]. This way, the total time for updating table entries is 2n ℓ/2 n (ℓ+1)n since we consider 2n possible choices for E at each vector c

and directly derive the corresponding c for each choice. The overall time bound follows from the observation that the 3n poly time needed for the preprocessing is upperbounded by (ℓ+ 1)n since ℓ> 1.

We now show that, under a standard conjecture in complexity theory, this running time cannot be improved substantially.

We show this by a reduction from MULTICOLORING. Here, one is given a graph G and two integers a and x, and wants to assign each vertex in V (G) a set of b out of a colors such that each two adjacent vertices receive different color sets. MULTICOLORING cannot be solved in f(x) xo(n) time, where n is the number of vertices even if a = Θ(x2 log x) unless the ETH fails (Bonamy et al., 2019). Observe that in a solution for MULTICOLORING, the vertices having some color c form an independent set in G. In our reduction, we have a choice dimension for each maximal independent set of G. Furthermore, we have a vertex example for each vertex of G. Also, we set ℓ:= 2a + 1 and S := 2a + 1. The main idea of our reduction is that there are exactly a many trees cutting a choice dimension such that each vertex example is

On Computing Optimal Tree Ensembles

correctly classified in at least x of these trees. We achieve this by adding some dummy dimensions and further vertex examples such that each correct tree ensemble consists of exactly ℓtrees having exactly one inner node, as we show.

Theorem 5.4 ( ). Solving MINIMUM TREE ENSEMBLE SIZE in f(ℓ) 2o(log ℓ) n time would contradict the Exponential Time Hypothesis.

Now, Theorem 5.4 implies that the running time (ℓ+ 1)n poly of the algorithm in Theorem 5.3 cannot be significantly improved, unless the ETH is wrong.

Our proof of Theorem 5.4 also implies hardness for the larger parameter S and that this hardness holds even if D = 2, that is, each feature is binary.

Corollary 5.5. Solving MINIMUM TREE ENSEMBLE SIZE on instances with binary features in f(S) 2o(log S) n time would contradict the Exponential Time Hypothesis.

Our proof of Theorem 5.4 also implies hardness for MMAXTES. For this result the proof is simpler, since no argument is needed that each tree in the ensemble has exactly one inner node.

Corollary 5.6. Solving MINIMAX TREE ENSEMBLE SIZE on instances with binary features and s = 1 in f(ℓ) 2o(log ℓ) n time would contradict the ETH.

We conclude by mentioning a few avenues for possible future research.

In Theorem 4.1, we showed that MINIMUM TREE EN-

SEMBLE SIZE and MINIMAX TREE ENSEMBLE SIZE are both fixed-parameter tractable when parameterized by δ (the maximum number of dimensions in which a pair of examples differ), D (the domain size), S and s (the total tree ensemble size and the maximum size of a tree in the ensemble, respectively), and ℓ(the number of trees in the ensemble), respectively. It would be interesting to investigate the problem for strictly smaller parameterizations. Of course, lower bounds for MINIMUM DECISION TREE SIZE also apply to MTES and MMAXTES. Hence, these two problems are W[2]- hard with respect to (D, s, ℓ) and (D, S, ℓ), respectively, and NP-hard for constant values of (δ, D, ℓ). This leaves the parameterized complexity of MTES for (δ, S, ℓ) and (δ, D, S) and of MMAXTES for (δ, s, ℓ) for (δ, D, s) as open problems.

Theorem 4.1 naturally raises the question of whether those problems admit polynomial-size kernels for that parameterization. Intuitively, a polynomial-size kernel is a polynomial-time algorithm that takes an instance

for the problem at hand as input and shrinks this instance to a size that is bounded by a polynomial in the parameter. For details on kernelization, we refer to (Fomin et al., 2019; Cygan et al., 2015). Essentially, a kernelization is simply a polynomial-time preprocessing procedure which comes with a provable guarantee that the preprocessed instance is small relative to the parameter. Even preprocessing algorithms without a guarantee could be of practical interest.

The decision tree and tree ensemble models could be extended in several ways: For instance, one could dispense with the restriction that cuts must be axis-aligned and allow cuts along any hyperplane. Lower bounds for constructing optimum trees in this model were proven by Goodrich et al. (1995) and Grigni et al. (2000), but the parameterized complexity of this problem has not been studied yet. One could also dispense with the requirement that the decision tree or tree ensemble classify the training data set perfectly and instead allow a bounded number of examples to be misclassified. This is a better reflection of practical uses of decision trees and tree ensembles. As we observed, Theorem 5.3 can easily be adapted to this setting. If one additionally includes the bound on the number of misclassifed examples as a parameter, the algorithm in Theorem 4.1 can also be adapted by adding a branch in which the dirty example is misclassifed for any dirty example that is considered in the algorithm. Whether or not one can do without this superpolynomial dependence on that bound, may be an interesting question. Also, one could dispense with the restriction that there are only two classes of examples. Theorems 4.1 and 5.3 could likely be adapted, but with a super-polynomial dependence on the number of classes in the running time. An algorithm without this dependence could be a challenge. Finally, a common approach to building tree ensembles is to build them incrementally, with each added tree greedily reducing the number of misclassified examples as much as possible. The parameterized complexity of this problem would be worth investigating as well.

Acknowledgements

Pascal Kunz was supported by the DFG Research Training Group 2434 Facets of Complexity . Frank Sommer was supported by the DFG, project EAGR (KO 3669/6-1). Manuel Sorge was supported by the Alexander von Humboldt Foundation.

This work was initiated at the research retreat of the Algorithmics and Computational Complexity group of TU Berlin held in Darlingerode in September 2022.

On Computing Optimal Tree Ensembles

Bessiere, C., Hebrard, E., and O Sullivan, B. Minimising decision tree size as combinatorial optimisation. In Proceedings of the 15th International Conference on Principles and Practice of Constraint Programming (CP 09), pp. 173 187, 2009. doi: 10.1007/978-3-642-04244-7 16.

Bonamy, M., Kowalik, L., Pilipczuk, M., Socała, A., and Wrochna, M. Tight lower bounds for the complexity of multicoloring. ACM Transactions on Computation Theory, 11(3):1 19, 2019. doi: 10.1145/3313906.

Breiman, L. Random forests. Machine Learning, 45(1): 5 32, 2001. doi: 10.1023/A:1010933404324.

Carrizosa, E., Molero-R ıo, C., and Romero Morales, D. Mathematical optimization in classification and regression trees. TOP, 29(1):5 33, 2021. doi: 10.1007/ s11750-021-00594-1.

Cygan, M., Fomin, F. V., Kowalik, L., Lokshtanov, D., Marx, D., Pilipczuk, M., Pilipczuk, M., and Saurabh, S. Parameterized Algorithms. Springer, 2015. doi: 10.1007/ 978-3-319-21275-3.

Demirovi c, E., Lukina, A., Hebrard, E., Chan, J., Bailey, J., Leckie, C., Ramamohanarao, K., and Stuckey, P. J. Murtree: Optimal decision trees via dynamic programming and search. Journal of Machine Learning Research, 23:1 47, 2022.

Downey, R. G. and Fellows, M. R. Fundamentals of Parameterized Complexity. Springer, 2013. doi: 10.1007/ 978-1-4471-5559-1.

Eiben, E., Ordyniak, S., Paesani, G., and Szeider, S. Learning small decision trees with large domain. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI 23). International Joint Conferences on Artificial Intelligence Organization, 2023.

Flum, J. and Grohe, M. Parameterized Complexity Theory. Springer, 2006. doi: 10.1007/3-540-29953-X.

Fomin, F. V., Lokshtanov, D., Saurabh, S., and Zehavi, M. Kernelization: Theory of Parameterized Preprocessing. Cambridge University Press, 2019. doi: 10.1017/ 9781107415157.

Goodrich, M. T., Mirelli, V., Orletsky, M., and Salowe, J. Decision tree construction in fixed dimensions: Being global is hard but local greed is good, 1995. Technical Report TR-95-1, Department of Computer Science, Johns Hopkins University.

Gottlob, G., Scarcello, F., and Sideri, M. Fixed-parameter complexity in AI and nonmonotonic reasoning. Artificial Intelligence, 138(1-2):55 86, 2002. doi: 10.1016/ S0004-3702(02)00182-0.

Grigni, M., Mirelli, V., and Papadimitriou, C. H. On the difficulty of designing good classifiers. SIAM Journal on Computing, 30(1):318 323, 2000. doi: 10.1137/ S009753979630814X.

Hyafil, L. and Rivest, R. L. Constructing optimal binary decision trees is NP-complete. Information Processing Letters, 5(1):15 17, 1976. doi: 10.1016/0020-0190(76) 90095-8.

Impagliazzo, R. and Paturi, R. On the complexity of k SAT. Journal of Computer and System Sciences, 62(2): 367 375, 2001. doi: 10.1006/jcss.2000.1727.

Impagliazzo, R., Paturi, R., and Zane, F. Which problems have strongly exponential complexity? Journal of Computer and System Sciences, 63(4):512 530, 2001. doi: 10.1006/jcss.2001.1774.

Kobourov, S. G., L offler, M., Montecchiani, F., Pilipczuk, M., Rutter, I., Seidel, R., Sorge, M., and Wulms, J. The influence of dimensions on the complexity of computing decision trees. In Proceedings 37th AAAI Conference on Artificial Intelligence (AAAI 22), 2022. Accepted for publication.

Kulkarni, V. Y. and Sinha, D. P. K. Random forest classifiers: A survey and future research directions. International Journal of Advanced Computing, 36:1144 1153, 2013.

Narodytska, N., Ignatiev, A., Pereira, F., and Marques-Silva, J. Learning optimal decision trees with sat. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 18), pp. 1362 1368. AAAI Press, 2018. doi: 10.24963/ijcai.2018/189.

Niedermeier, R. Invitation to Fixed-Parameter Algorithms. Oxford University Press, 2006.

Ordyniak, S. and Szeider, S. Parameterized complexity of small decision tree learning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI 21), pp. 6454 6462, 2021. doi: 10.1609/aaai.v35i7.16800.

Rokach, L. Decision forest: Twenty years of research. Information Fusion, 27:111 125, 2016. doi: 10.1016/j. inffus.2015.06.005.

Tamon, C. and Xiang, J. On the boosting pruning problem. In Proceedings of the 11th European Conference on Machine Learning (ECML 00), pp. 404 412, 2000. doi: 10.1007/3-540-45164-1 41.

Verikas, A., Gelzinis, A., and Bacauskiene, M. Mining data with random forests: A survey and results of new tests. Pattern Recognition, 44(2):330 349, 2011. doi: 10.1016/j.patcog.2010.08.011.

On Computing Optimal Tree Ensembles

Vidal, T. and Schiffer, M. Born-again tree ensembles. In Proceedings of the 37th International Conference on Machine Learning (ICML 20), pp. 9743 9753, 2020.