# scalable_optimal_multiwaysplit_decision_trees_with_constraints__b9e96a50.pdf

Scalable Optimal Multiway-Split Decision Trees with Constraints

Shivaram Subramanian*, Wei Sun*

IBM Research, Yorktown Heights, New York, USA {subshiva, sunw}@us.ibm.com

There has been a surge of interest in learning optimal decision trees using mixed-integer programs (MIP) in recent years, as heuristic-based methods do not guarantee optimality and find it challenging to incorporate constraints that are critical for many practical applications. However, existing MIP methods that build on an arc-based formulation do not scale well as the number of binary variables is in the order of O(2d N), where d and N refer to the depth of the tree and the size of the dataset. Moreover, they can only handle sample-level constraints and linear metrics. In this paper, we propose a novel path-based MIP formulation where the number of decision variables is independent of N. We present a scalable column generation algorithm to solve the MIP. Our framework produces a multiway-split tree which is more interpretable than the standard binary-split trees due to its shorter rules. Our method can incorporate nonlinear metrics such as F1-score and a broader class of constraints. We demonstrate its efficacy with extensive experiments. We present results on datasets containing up to 1,008,372 samples while existing MIP-based decision tree models do not scale well on data beyond a few thousand points. We report superior or competitive results compared to the state-of-art MIP-based methods with up to a 24X reduction in runtime.

1 Introduction Decision trees are among the most popular machine learning models as the tree structure is visually easy to understand. As learning an optimal decision tree is NP-hard (Laurent and Rivest 1976), popular algorithms such as CART (Breiman et al. 1984), ID3 (Quinlan 1986) and C4.5 (Quinlan 2014) rely on greedy heuristics to construct trees. Motivated by the heuristic nature of the traditional methods, there have been many efforts across different fields to learn optimal decision trees (ODT), e.g., dynamic programming (Lin et al. 2020), constraint programming (Verhaeghe et al. 2020), Boolean satisfiability (Narodytska et al. 2018), itemset mining (Aglin, Nijssen, and Schaus 2020). In particular, recent advances in modern optimization has facilitated a nascent stream of research that leverages mixed-integer programming (MIP) to train globally optimal trees with constraints (Bertsimas and Dunn 2017; Aghaei, Azizi, and

*These authors contributed equally. Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Figure 1: An example of OMT trained on car-evaluation.

Vayanos 2019; Verwer and Zhang 2019; Aghaei, G omez, and Vayanos 2021) - this is the methodology we are focusing on in this paper. Prior MIP-based methods rely on an arc-based formulation that require a large number of binary decision variables to identify splitting conditions at branch nodes as well as label and sample assignments to leaf nodes. This approach has several drawbacks: (1) the optimization problem becomes easily intractable as the number of binary variables and constraints increases linearly with training data. Hence, experiments are typically restricted to datasets with no more than a few thousand samples. (2) Prior MIP frameworks can only handle linear metrics. In many applications, it is desirable to consider nonlinear metrics, e.g., F1-score is preferred over accuracy to evaluate machine learning models trained on imbalanced datasets. (3) With the arc-based formulation, it is challenging to impose constraints on individual decision rules and feature combinations. One such example occurs in the medical field, where doctors have to arrive at an appropriate diagnosis while taking into account the costs of medical tests (Lomax and Vadera 2013). (4) The vast majority of the decision tree literature focuses on binary-split trees, where each node can have at most two child nodes. Multiway-split trees (see Figure 1 for an example) whose branching condition may contain several values are often more intuitive and comprehensible (Fulton, Kasif, and Salzberg 1995). In this paper, we attempt to address these prior limitations by proposing a scalable MIP-based framework to train constrained optimal multiway-split trees (OMT). Our contributions are five-fold.

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

Novel path-based MIP formulation Unlike arc-based MIP, we explicitly model decision rules or paths in a tree. The high level idea is as follows - we first define a feature graph that admits every possible combination of input features as decision rule candidates. The objective of the MIP is to identify a subset of rules from this potentially huge rule space to minimize the prediction error. The feature graph naturally embeds the hierarchical structure of a tree, and we add a constraint to model another defining property of decision trees where each sample can only be assigned to a single rule. Scalable algorithm Unlike existing MIP ODTs whose binary decision variables are in the order of O(2d N), where d and N refer to the depth of the tree and the size of training data, the number of binary decision variables in our formulation is independent of N. It equals the number of candidate decision rules defined in the feature graph. While this number can also be excessive in the worst case, we employ column generation (CG) to dynamically generate a relatively small number of candidates to recover a near-optimal solution. While existing MIP approaches can manage no more than a few thousand training samples, an initial implementation of our proposed CG algorithm effectively solved a 106 sample dataset. Flexible framework Our framework can be applied to both classification and regression settings. The path-based model allows us to incorporate any path-level metric, including nonlinear ones, which enter the MIP as input parameters. We also show how a broader class of constraints including path-level, and attribute-level constraints, which are difficult to express within existing arc-based formulations, can be easily incorporated into our framework. Enhanced interpretability with multiway-splits To the best of our knowledge, we are the first to train optimal multiway-split trees for prediction tasks, where a node may have more than two child nodes. Since a feature seldom appears more than once in any root-to-leaf path, multiway-split trees are easier to comprehend than their binary counterparts (Fulton, Kasif, and Salzberg 1995). Extensive numerical results We conduct extensive computational studies on public datasets and report the first MIPbased results for large datasets containing up to 1,008,372 samples, which are an order of magnitude larger than the prior datasets analyzed in the MIP-based ODT literature. We show that our framework is either competitive or improves upon the state-of-the-art MIP methods in terms of out-ofsample accuracy and achieves up to a 24-fold reduction in runtime.

2 Related Literature

The discrete nature of decisions involved in training a decision tree and recent algorithmic advances in integer optimization has inspired a burgeoning body of literature that utilizes a MIP formulation to construct decision trees (Bertsimas and Dunn 2017; Aghaei, G omez, and Vayanos 2021; G unl uk et al. 2021). Different flavors of MIP-based trees have been proposed in the literature, e.g., discriminationaware trees (Aghaei, Azizi, and Vayanos 2019), trees with

hyperplane splits (Zhu et al. 2020), prescriptive trees (Biggs, Sun, and Ettl 2021; Subramanian et al. 2022). Most of the existing MIP methods build on top of the Optimal Classification Tree (OCT) first introduced in Bertsimas and Dunn 2017, where decisions about the split condition at each node, label assignment for each leaf node, and the routing of each data point from the root node to a leaf node are made. As the computational tractability of the original OCT formulation is limited by data size and tree depth, subsequent efforts sought to improve the efficiency of the approach. The Bin OCT approach of Verwer and Zhang 2019 employs a binary encoding method to model the threshold selection at branch nodes to reduce the number of binary decision variables needed. Most notably, a recent paper proposes Flow OCT (Aghaei, G omez, and Vayanos 2021), a strong max-flow based MIP formulation with binary data. Its formulation yields a tighter underlying LP relaxation and outperforms prior methods in many instances. Nevertheless, due to the underlying arc-based formulation, all these methods face significant computational difficulties with larger datasets and deeper trees. This motivates us to seek a completely different approach, by explicitly modeling the paths used to construct a tree. While the worst-case number of rules can be huge with high-dimensional data, we show that our problem can be solved efficiently via column generation (CG). CG has been successful in solving large-scale discrete optimization models in many domains including vehicle routing (Chen and Xu 2006), crew scheduling (Subramanian and Sherali 2008; Bront, M endez-D ıaz, and Vulcano 2009), and supply chain management, among others (Xu 2019). Utilizing large-scale optimization techniques for MIP-based ODTs has been attempted previously. Aghaei, G omez, and Vayanos 2021 show that Flow OCT can be solved by Benders decomposition (row generation). Its subproblem involves solving a max-flow problem for every sample potentially, limiting its scalability.

3 Problem Formulation

We consider a dataset which consists of N samples, {(xi, yi)}N i=1, where xi X k are features which are assumed to be categorical. We will discuss how to handle numerical features in Section 5, and present experiments on datasets with both categorical and numerical features in Section 6. yi is the outcome, which can be a discrete label for classification or a continuous quantity for regression. A binary-split tree of depth d can have at most 2d leaf nodes. In a multiway-split tree, each node may have more than two children. Thus, we use the depth of a tree d, and the number of leaf nodes l, which are user-specified parameters to define the complexity of a tree. Figure 1 shows a multiway-split tree with d = 3 and l = 8. In our framework, we explicitly model individual decision paths from the root to the leaf nodes. We begin by defining a feature graph that contains every possible combination of input features, followed by introducing a MIP optimization problem to identify a subset of paths to form the tree.

Figure 2: The feature graph for car-evaluation

Feature Graph We consider an acyclic multi-level digraph, G(V, E), where each feature indicates a level in the graph, represented by multiple nodes corresponding to its distinct feature values. Nodes of a feature are fully connected to nodes in the next level. The graph includes a source and sink node. A decision rule is defined as a path from the source to the sink node. For each feature, we introduce a dummy node SKIP. If a path goes through SKIP node of a feature, it means that the particular feature is not used in a decision rule. As SKIP nodes allow paths to ignore features, paths on this acyclic graph represent all possible feature combinations, defining the full search space P that one may need to consider to construct an optimal tree. Figure 2 illustrates an example of a feature graph using three features from car-evaluation, a UCI dataset (Dua and Graff 2017), which records the assessment of cars based on criteria such as Persons (number of seats), Buying (purchase cost) and Safety . Two highlighted paths in Figure 2 correspond to the colored decision rules in the OMT shown in Figure 1. Denote η as the maximum unique feature values associated with a feature. The following result implies that enumerating through the space P may become impractical for high-dimensional datasets.

Proposition 1 The search space P is O(ηk).

Later in Section 4, we describe our algorithm which uses a large-scale optimization technique to intelligently search through the space, and only paths that can improve the current solution are generated on the fly.

Path-Based MIP Formulation We denote a decision rule as j {1, , L}, where L = |P|. Let Sj [N] be the subset of observations which fall into rule j. Denote ξj as a user-specified metric associated with rule j, e.g., misclassification error in rule j in a classification task, squared loss or absolute loss for regression. Let zj be a binary decision variable which indicates whether rule j is selected (1) or not (0), for j = 1, , L. We define non-negative slack variables si and a positive penalty cost ci for each sample i. A MIP formulation to determine an optimal multiway-split tree (OMT) can be writ-

ten as follows,

j=1 aijzj + si = 1, i = 1, , N (1)

j=1 zj l (2)

zj {1, 0}, j = 1, , L si 0, i = 1, , N

The input parameter aij = 1 if sample i satisfies the conditions specified in rule j, and 0 otherwise. While several rules may contain the same data sample, the set partitioning constraint (1) with a sufficiently large penalty ci ensures that each sample is ultimately assigned to exactly one rule. The cardinality constraint (2) stipulates that no more than l rules are active in the optimal solution z . It is well-known that set partitioning problems are NPhard (Hartmanis 1982). Thus, while optimal solutions can be obtained in practice for moderate sized instances (Atamt urk, Nemhauser, and Savelsbergh 1996), the problem may easily become intractable to solve directly as the cardinality of feasible rules grows exponentially with the feature space. We will present an efficient algorithm to overcome this computational challenge in the next section.

4 Column Generation Column generation (CG) is a classical technique to solve large MIPs for which it is not practical to explicitly generate all columns (variables) of the problem (L ubbecke and Desrosiers 2005). Specifically, we first consider a restricted master program (RMP) version of OMT, where we 1) consider only a subset of paths ˆL, which is typically much smaller than L and is determined dynamically, and 2) relax the integrality constraints on zj to 0 zj 1 for all j = 1, , ˆL. Denote the dual variables associated with the set partitioning constraints in (1) and the cardinality constraint in (2) as λi and µ respectively. The dual formulation of RMP can be written as follows,

i=1 λi + nµ

i=1 aijλi + µ ξj, j = 1, , ˆL (3)

µ 0, λi ci, i = 1, , N

Note that since zj 1 is implied by Eq (1), we are only left with zj 0 in the primal RMP. Based on the dual feasibility constraint for path j in (3), the reduced cost for path j is equal to

i=1 aijλi + µ

When RMP is solved to optimality, dual feasibility is guaranteed only for the rules included in ˆL. A path violating the dual constraint (3) has a negative reduced cost and must be added to the RMP for the next iteration. To identify paths that maximally violate Eq (3), we need to solve min rcj. We define the subproblem, as the search for some K paths having the most negative reduced cost, which is a K-shortest path problem (KSP) over the feature graph G, where the cost on a path is its reduced cost defined in (4). If no violating path exists, we obtain dual feasibility, and the CG has converged. The high-level view of the CG procedure is as follows: we start by solving RMP with an empty set of candidate rules with ˆL = , which sets slack variables si to 1.0 and yields an initial dual solution. Next, we use these dual values to solve the subproblem to identify up to K candidate rules having the most negative reduced cost. They are added to ˆL and the RMP is re-optimized to obtain a new primal-dual solution, and the process iterates. The CG procedure converges to an optimal solution of the LP relaxation of OMT when dual feasibility is achieved, i.e., rcj 0 for all j = 1, , M. After the CG procedure has converged, the binary restrictions are reimposed on z and we solve the resultant Master-MIP. The CG procedure can be implemented exactly for provable optimality, or approximately to obtain a near-optimal decision tree. Prior works (Barnhart et al. 1998) have proposed a branch-and-price technique to solve the Master-MIP to provable optimality, which also requires the subproblem to be solved exactly (Desrosiers and L ubbecke 2005). For simplicity and replicability, we employ a CG heuristic that solves the resultant Master-MIP directly using a standard optimization package (CPLEX 2021). As noted in L ubbecke and Desrosiers 2005, the role of the subproblem is to provide a potential column or to prove that none exists. Since any negative reduced cost path contributes to this aim, we employ a heuristic subproblem algorithm that is a path-dependent adaptation of a K-shortest path method for acyclic graphs with additive arc costs (Horne 1980). Our KSP method is similar to the resource-constrained shortest path heuristics used in CG applications (Desrosiers and L ubbecke 2005; Desaulniers, Desrosiers, and Solomon 2006) and has polynomial-time complexity (additional details are given in the appendix). In our experiments, the CG procedure terminates when the dual solution converges to within a tolerance or we reach a maximum iteration limit. To preserve interpretability, shallow trees are often preferred, where the depth d is small. The following result shows how this constraint dramatically shrinks the CG search space, especially for high-dimensional datasets.

Proposition 2 The constrained search space is O k d ηd .

Comparing this result to the unconstrained setting shown in Proposition 1, we see that restricting each path to contain no more than d features significantly reduces the total number of feasible paths, and the worst-case value for ˆL L. Considering an example with d = η = 2, ˆL is bounded by O(k2) versus L is O(2k). Prior MIP models are restricted to a fixed-depth binary tree representation which requires cat-

egorical attributes to be encoded into a generic set of binary features. Doing so prevents them from exploiting the beneficial hierarchical feature graph structure used in our approach.

5 Flexible Framework Nonlinear Metrics Nonlinear metrics such as the F1-score, Matthews correlation coefficient, and Fowlkes Mallows index, are often used to evaluate the performance of machine learning models trained on imbalanced data (Demirovic and Stuckey 2021). Existing MIP-based decision trees only consider linear metrics. In our approach, once the feature graph is defined, we know the samples Sj which satisfy this rule and we can compute the metric associated with this rule in the subproblem. These nonlinear metrics may be represented as ξj and enter RMP as an input parameter in the objective as shown in Section 3. They can also enter RMP as constraints and we provide an example of incorporating F1-score in the appendix. This transformative modeling capability of CG is invaluable when solving optimization problems that are highly nonlinear and nonconvex in their original form.

Constraints Enforcement Existing optimal classification trees in the literature have been extended to incorporate constraints to address fairness and data imbalance issues. In this section, we show how our method provides an elegant and unified framework to handle constraints, including those which existing arc-based MIP formulations cannot efficiently manage.

Path-level constraints Prior MIP-based decision trees typically manage sample-level constraints, e.g., constraining precision or recall conditioned on samples class labels (Aghaei, G omez, and Vayanos 2021; G unl uk et al. 2021), or fairness metrics such as statistical parity conditioned on sensitive features (Aghaei, Azizi, and Vayanos 2019; Aghaei, G omez, and Vayanos 2021). However, none of the existing MIP-based methods are able to efficiently handle constraints at path-level, as the notion of path is not explicitly defined in the arc-based formulation. Consider the example of cost-sensitive decision trees, which is motivated by the medical domain. Often, doctors must arrive at a diagnosis by taking into account the economic constraints faced by a patient when different test options involve a tradeoff between accuracy and measurement cost (Lomax and Vadera 2013; N unez 1991). Denoting the cost associated with each decision rule j which consists of several medical tests as ρj and the budget as C, we specify the following constraint: ρjzj C for all j = 1, , L. This constraint which ensures that each of the final selected diagnostic methods (with zj = 1) is staying within the budget, can be processed within the KSP subproblem. Additional examples can be found in the appendix. In general, path-level constraints can be expressed as a set of polyhedral inequalities on z, i.e., PL j=1 ρmjzj qm, m = 1, , M. Let τm 0 denote the dual variables corresponding to these constraints. Then, the reduced cost

rcj for path j = ξj PN i=1 aijλi + µ + PM m=1 ρmjτm . These constraints influence the subproblem through the resultant reduced costs to ensure that paths that are more likely to be feasible are added to the RMP.

Attribute-level constraints Existing MIP methods are also challenged by attribute-level constraints that represent complex nonlinear conditions involving several features that can not be abstracted efficiently into linear constraints, e.g., disallowing certain feature combinations. These constraints are easily handled within the KSP subproblem as a feasibility check while extending a partial path to the next node in G. A practical example of an attribute-level requirement is the need to preserve attribute hierarchy. For example, in the medical domain, doctors may perform temperature checks or blood tests before proceeding to more advanced tests. Enforcing such hierarchy makes the decision tree more reliable and comprehensible for domain experts (Nanfack, Temple, and Fr enay 2022). This can be easily achieved by appropriately arranging the nodes in the feature graph.

Cumulative Binning on Numerical Features For tree-based approaches, numerical input are typically handled via thresholding. For example, consider a numerical feature with values in [0, 1] being divided into 3 intervals, e.g., [0, 0.33), [0.33, 0.67) and [0.67, 1.0]. One can transform this numerical feature to a categorical feature with 3 values, and create 3 nodes representing this feature in the graph. This approach may be limiting as a binary-split tree can branch on conditions such as x 0.67 or x > 0.3. To address this limitation, we employ cumulative binning, where intervals can be overlapping. We create additional nodes representing intervals [0, 0.67), [0.33, 1.0], yielding a total of 5 nodes. The coverage constraints (1) ensure that the final rule set does not contain overlapping samples. In our experiments, quantile discretization is used to create intervals. More generally, with κ intervals, cumulative binning results in O(κ2) nodes, i.e, a gain in a rule s expressiveness at the cost of higher computational effort.

6 Experiments While OMT is a general framework which can also be used in regression settings, we focus on classification tasks in our experiments as most of the existing ODTs are classification trees. We group our experiments by dataset size, where small/medium datasets are loosely defined as having a few thousand samples, and a dataset is considered large if it has many thousands of samples and several hundred binary features (Dash, Gunluk, and Wei 2018). To benchmark our approach OMT, we implement the following MIP-based methods, i.e., Flow OCT (Aghaei, G omez, and Vayanos 2021), Bin OCT (Verwer and Zhang 2019) and OCT (Bertsimas and Dunn 2017). Although CART cannot produce constrained decision rules, we still include it as a baseline. As the benchmarks produce binarysplit trees of a given depth d, we construct a comparable multiway-split tree via the cardinality constraint that restricts the number of leaf nodes l to be at most 2d, i.e., l = 2d, and limit the rule length to d.

0 200 400 600 800 1000 1200 Training time (seconds)

Number of instances solved

OMT OCT Bin OCT Flow OCT

Figure 3: Runtime performance on a total of 240 instances

CPLEX 20.1 (CPLEX 2021) was used to solve the MIPbased methods. CART was trained using scikit-learn (Pedregosa et al. 2011) using default hyper-parameters. The minimum number of samples per rule for OMT was set to 1% of the training data. The maximum value for K was set to 1000 in the KSP for all instances except the large dataset experiments, where it was reduced to 100 to stay within the RAM limit. We set a maximum CG iteration limit of 40 and a ˆL limit of 10,000. All experiments were run on an Intel 8-core i7 PC with 32GB RAM. Details on the experiment setup can be found in the appendix.

Small/Medium Datasets We evaluate the same 12 classification datasets from the UCI repository (Dua and Graff 2017) that have been used in Flow OCT (Aghaei, G omez, and Vayanos 2021), which is considered the state-of-the-art ODT. We closely follow the experiment setup in Aghaei, G omez, and Vayanos 2021 to construct decision trees with d {2, 3, 4, 5}. We create 5 random splits for each dataset into training (50%), validation (25%), and test sets (25%). A time limit of 20 minutes is imposed on each experiment, in contrast to the one hour limit used in Aghaei, G omez, and Vayanos 2021. Table 1 reports the out-of-sample accuracy averaged over five splits for d {4, 5} (the complete results across all depths can be found in the appendix). Best accuracy in a given row is reported in bold. Among 48 (data, depth) combinations, OMT dominates other methods in 56.3% of them, Flow OCT 33.3%, OCT 29.2%, Bin OCT 8.3%, and CART 14.6% (including ties). These results demonstrate that our OMT method achieves competitive and often superior results compared to MIP-based binary-split ODT models. Next, we analyze the solution time taken by different MIP-based methods. We define an instance as a unique (dataset, depth, data split) combination. With 5 random data splits, there are 12 4 5 = 240 instances in total. In Figure 3, the x-axis shows the solution time in seconds, while the y-axis shows the number of instances solved. With the MIP-based ODT benchmarks, when the solution time is under the time limit, implying the instance is solved to optimality or near optimality. Figure 3 shows that OMT solves most instances before the 20 minutes time limit, followed by Flow OCT, Bin OCT and OCT. Specifically, 188 out of 240 instances was solved under 1 minute by OMT, in contrast to

dataset N d OMT OCT Bin OCT Flow OCT CART

soybean-small 47 4 0.883 0.139 0.944 0.048 0.833 0.22 0.944 0.083 1.0 0.0 soybean-small 47 5 0.95 0.075 0.972 0.096 0.722 0.21 0.972 0.083 1.0 0.0 monks-3 122 4 0.987 0.008 0.99 0.015 0.988 0.011 0.99 0.011 0.993 0.007 monks-3 122 5 0.987 0.008 0.978 0.014 0.983 0.015 0.99 0.012 0.993 0.007 monks-1 124 4 1.0 0.0 1.0 0.029 1.0 0.0 1.0 0.03 0.806 0.064 monks-1 124 5 1.0 0.0 0.935 0.142 1.0 0.0 1.0 0.03 0.787 0.042 hayes-roth 132 4 0.75 0.075 0.75 0.038 0.642 0.138 0.8 0.066 0.55 0.087 hayes-roth 132 5 0.77 0.089 0.75 0.076 0.575 0.066 0.817 0.029 0.708 0.058 monks-2 169 4 0.603 0.045 0.662 0.031 0.581 0.027 0.662 0.023 0.598 0.033 monks-2 169 5 0.779 0.03 0.662 0.05 0.607 0.03 0.662 0.055 0.651 0.055 house-votes-84 232 4 0.962 0.046 0.971 0.01 0.914 0.03 0.971 0.01 0.96 0.02 house-votes-84 232 5 0.955 0.026 0.971 0.01 0.96 0.01 0.971 0.01 0.96 0.02 spect 267 4 0.834 0.04 0.801 0.086 0.746 0.065 0.791 0.079 0.731 0.026 spect 267 5 0.803 0.039 0.791 0.06 0.721 0.048 0.796 0.085 0.731 0.03 breast-cancer 277 4 0.74 0.046 0.724 0.038 0.662 0.091 0.743 0.079 0.69 0.058 breast-cancer 277 5 0.669 0.094 0.738 0.0 0.567 0.157 0.676 0.036 0.671 0.049 balance-scale 625 4 0.781 0.016 0.747 0.048 0.707 0.006 0.699 0.01 0.769 0.033 balance-scale 625 5 0.772 0.012 0.735 0.01 0.565 0.1 0.72 0.045 0.762 0.013 tic-tac-toe 958 4 0.802 0.039 0.776 0.073 0.786 0.021 0.757 0.032 0.758 0.019 tic-tac-toe 958 5 0.82 0.061 0.711 0.025 0.812 0.029 0.788 0.05 0.778 0.046 car-evaluation 1728 4 0.879 0.009 0.796 0.076 0.848 0.01 0.823 0.016 0.842 0.029 car-evaluation 1728 5 0.91 0.005 0.742 0.041 0.815 0.052 0.8 0.016 0.857 0.019 kr-vs-kp 3196 4 0.96 0.006 0.847 0.094 0.938 0.012 0.94 0.011 0.94 0.011 kr-vs-kp 3196 5 0.968 0.003 0.652 0.098 0.847 0.164 0.946 0.057 0.94 0.011

Table 1: Mean standard deviation of out of sample accuracy on the small/medium datasets.

137, 95 and 59 instances by Flow OCT, Bin OCT and OCT respectively; 230 instances out of 240 was solved under 10 minutes by OMT, compared to 175, 118 and 90 instances by the benchmarks. The bulk of the unsolved instances by the time limit is the medium datasets with d 3. Table 2 reports the solution times of different MIP methods with respect to tree depth. The solution time of unsolved instances is capped at 20 minutes. Due to the skewed distributions contributed by huge discrepancies in terms of data size and feature size, we report the median values. As CART finishes all instances in seconds, we omit the results in the table. We note that CART s solutions may be infeasible by ignoring constraints considered in this work. Table 2 shows that the state-of-the-art MIP-based benchmark, Flow OCT takes the least time per instance among all methods including OMT at d = 2, However, it experiences a nearly 20X increase in runtime as d increases from 2 to 3, which grows by another 4.5X when d = 4. We believe this is an under-statement as the solution time of many unsolved instances at higher depth are capped at 20 minutes. A further increase in runtime is observed in OCT and Bin OCT whose binary variables increase exponentially with d. In contrast, for OMT, the runtime increases are merely 1.4X, 0.68X, 0.4X as d increases from 2 to 5. Overall, our speedup over Flow OCT at d = 4 and 5 is 10.3X and 24.6X respectively. The exact OMT speedup is likely to be even higher if we compare runtime to achieve the same solution quality. The strong CG performance can also be explained by examining the gap between the objective value achieved by the Master-MIP (νIP ) and the final RMP (νLP ), i.e., = (νIP νLP )/νIP . A table that reports on this MIP-

d OMT Flow OCT Bin OCT OCT

2 3.9 1.9 6.1 33.9 3 9.3 32.5 727.8 1200.4 4 15.6 177.0 1201.9 1202.5 5 21.9 561.4 1203.5 1205.9

Table 2: Median runtime (seconds) for MIP-based methods

LP gap with respect to d is included in the appendix. In particular, the median gap is no more than 0.3% across all depths, suggesting that the path-based LP relaxation is a relatively strong approximation of the nonconvex discrete OMT model. This tight gap also underscores the advantage of converging to a relatively small subset of high quality paths from which an effective decision tree can be distilled using a standard MIP solver.

Large Datasets

We test our method on the six largest datasets analyzed in the MIP-based ODT literature (Zhu et al. 2020). We follow the same experiment setup described earlier but limit runtime to one hour to account for the larger data size. As in Zhu et al. 2020, we focus on d = {2, 3}. Table 3 summarizes the average out-of-sample accuracy of different MIP approaches. The left entries under column CART shows the CART results that we obtain. None of the arc-based MIP methods are able to obtain an optimal solution on any of the 12 large instances at d = 3 within the time limit, with Flow OCT winning in one instance, and no wins for OCT and Bin OCT. On the other hand, OMT dominates 6 out of 12, and CART

Univariate Multivariate

dataset N k d OMT OCT Bin OCT Flow OCT CART S1O

pendigits 7494 16 2 0.395 0.348 0.291 0.329 0.352/0.362 0.389 pendigits 7494 16 3 0.685 0.517 0.347 0.459 0.558/0.579 0.625 avila 10430 10 2 0.475 0.461 0.096 0.509 0.501/0.503 0.526

avila 10430 10 3 0.524 0.417 0.403 0.504 0.527/0.535 0.558

EEG 14980 14 2 0.658 0.602 0.584 0.648 0.624/0.586 0.665

EEG 14980 14 3 0.690 0.572 0.493 0.649 0.659/0.642 0.665 HTRU 17898 8 2 0.973 0.977 0.705 0.956 0.978/0.973 0.978

HTRU 17898 8 3 0.977 0.978 0.552 0.973 0.979/0.981 0.979 shuttle 43500 9 2 0.968 0.821 0.285 0.920 0.939/0.938 0.940 shuttle 43500 9 3 0.984 0.793 0.390 0.914 0.996/0.997 0.995

skin 245057 3 2 0.875 0.899 0.774 0.802 0.907/0.806 0.863 skin 245057 3 3 0.967 0.793 0.855 0.801 0.965/0.871 0.949

Table 3: Out of sample accuracy, using the large datasets from (Zhu et al. 2020).

wins 5 out of 12 cases. While the quality improvement over CART is not dramatic, the latter is unable to handle constraints, which is a key practical feature of our approach. Meanwhile, we improve upon the average misclassification error achieved by Flow OCT by 8.4%. Table 3 also includes a column S1O , taken from Zhu et al. 2020. For ease of comparison, we include their CART values shown as the right entries under column CART in Table 3. We want to point out that it is not quite fair to compare the two: Firstly, OMT is a univariate tree with axisparallel splits, whereas S1O is a multivariate (oblique) tree, wherein splits at a node use multiple variables, or hyperplanes. These multivariate splits tend to be much stronger than univariate splits (Bertsimas and Dunn 2017), at the expense of losing some interpretability. Furthermore, the time limit for S1O reported in Zhu et al. (2020) was four hours, while we limit OMT to an hour (the average OMT runtime per instance is 6.8 and 13.9 minutes at d = 2 and 3, respectively). Finally, instead of running on the raw data as in OMT, S1O employs a LP-based data-selection preprocessing step. Nevertheless, we benchmark OMT against S1O (and their CART results) in Table 3, and the best achieved accuracy is marked with a star ( ). Of the 12 cases, OMT still wins 6, while S1O and CART wins 5 and 2 cases respectively. These results highlight our method s ability to produce good quality results in a relatively short training time, and without sacrificing interpretability. Lastly, we stress-test our method by analyzing challenging datasets in the UCI repository that are an order of magnitude larger than those reported for prior optimal ODT methods. We increase the degree of difficulty in two dimensions: More samples (up to one million) and more raw features (up to k = 175). For such large datasets, we found a negligible change in solution quality for different random seeds, so we solved each instance once using a fixed random seed and a six-hour time limit and report the results in Table 4. As none of the prior optimal ODTs could process such large data, we compare the achieved solution quality to CART. Note via proposition 1 that setting d = 2 for the crop-mapping dataset (η = 4) yields ˆL 16 1752 versus L = 4175. We improve upon CART s test accuracy in 6 of the 8 instances

dataset N k d OMT CART

Mini Boo NE 130065 50 2 0.850 0.835 Mini Boo NE 130065 50 3 0.873 0.865 crop mapping 325834 175 2 0.769 0.679 crop mapping 325834 175 3 0.841 0.797 covertype 581012 54 2 0.667 0.668 covertype 581012 54 3 0.684 0.677 susy 1008372 18 2 0.744 0.748 susy 1008372 18 3 0.760 0.755

Table 4: Out-of-sample accuracy on additional large datasets

with an average runtime of 3.7 hours per large instance.

7 Conclusion In this paper, we propose a scalable and flexible mixedinteger optimization framework based on column generation (CG) to identify optimal multiway-split decision trees with constraints. We present a novel path-based MIP formulation for decision trees where the number of columns or binary variables is independent of the training data size. Our method can generate both classification and regression trees by minimizing any of the commonly used nonlinear error metrics while still solving a linear MIP model. A first plain vanilla CG implementation is tested on several public datasets ranging up to a million data samples. We are able to achieve a performance that is comparable to or superior than those achieved by the state-of-art optimal classification tree methods in the literature while consuming only a small fraction of their runtime. Our CG approach is also able to seamlessly handle a wide variety of practical constraints that cannot be efficiently managed by prior ODT models or popular heuristics like CART. The current framework has to discretize numerical features. One way to handle numerical features without discretization is to employ a preprocessing step like Bin OCT [Verwer & Zhang 2019], which we leave as future work. For deeper trees (e.g., d 10), the computational benefit of Proposition-2 diminishes and a more sophisticated CG implementation may be required.

References Aghaei, S.; Azizi, M. J.; and Vayanos, P. 2019. Learning optimal and fair decision trees for non-discriminative decisionmaking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 1418 1426. Aghaei, S.; G omez, A.; and Vayanos, P. 2021. Strong optimal classification trees. ar Xiv preprint ar Xiv:2103.15965. Aglin, G.; Nijssen, S.; and Schaus, P. 2020. Learning optimal decision trees using caching branch-and-bound search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 3146 3153. Atamt urk, A.; Nemhauser, G. L.; and Savelsbergh, M. W. 1996. A combined Lagrangian, linear programming, and implication heuristic for large-scale set partitioning problems. Journal of heuristics, 1(2): 247 259. Barnhart, C.; Johnson, E. L.; Nemhauser, G. L.; Savelsbergh, M. W.; and Vance, P. H. 1998. Branch-and-price: Column generation for solving huge integer programs. Operations research, 46(3): 316 329. Bertsimas, D.; and Dunn, J. 2017. Optimal classification trees. Machine Learning, 106(7): 1039 1082. Biggs, M.; Sun, W.; and Ettl, M. 2021. Model Distillation for Revenue Optimization: Interpretable Personalized Pricing. In International Conference on Machine Learning, 946 956. PMLR. Breiman, L.; Friedman, J.; Stone, C. J.; and Olshen, R. A. 1984. Classification and regression trees. CRC press. Bront, J. J. M.; M endez-D ıaz, I.; and Vulcano, G. 2009. A column generation algorithm for choice-based network revenue management. Operations research, 57(3): 769 784. Chen, Z.-L.; and Xu, H. 2006. Dynamic column generation for dynamic vehicle routing with time windows. Transportation Science, 40(1): 74 88. CPLEX, I. I. 2021. Introducing IBM ILOG CPLEX Optimization Studio 20.1.0. Dash, S.; Gunluk, O.; and Wei, D. 2018. Boolean decision rules via column generation. Advances in neural information processing systems, 31. Demirovic, E.; and Stuckey, P. J. 2021. Optimal decision trees for nonlinear metrics. In Thirty-fifth AAAI Conference on Artificial Intelligence. Desaulniers, G.; Desrosiers, J.; and Solomon, M. M. 2006. Column generation, volume 5. Springer Science & Business Media. Desrosiers, J.; and L ubbecke, M. E. 2005. A primer in column generation. In Column generation, 1 32. Springer. Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Fulton, T.; Kasif, S.; and Salzberg, S. 1995. Efficient algorithms for finding multi-way splits for decision trees. In Machine Learning Proceedings 1995, 244 251. Elsevier. G unl uk, O.; Kalagnanam, J.; Li, M.; Menickelly, M.; and Scheinberg, K. 2021. Optimal decision trees for categorical data via integer programming. Journal of global optimization, 81(1): 233 260.

Hartmanis, J. 1982. Computers and intractability: a guide to the theory of np-completeness (michael r. garey and david s. johnson). Siam Review, 24(1): 90. Horne, G. 1980. Finding the K least cost paths in an acyclic activity network. Journal of the Operational Research Society, 31(5): 443 448. Laurent, H.; and Rivest, R. L. 1976. Constructing optimal binary decision trees is NP-complete. Information processing letters, 5(1): 15 17. Lin, J.; Zhong, C.; Hu, D.; Rudin, C.; and Seltzer, M. 2020. Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning, 6150 6160. PMLR. Lomax, S.; and Vadera, S. 2013. A survey of cost-sensitive decision tree induction algorithms. ACM Computing Surveys (CSUR), 45(2): 1 35. L ubbecke, M. E.; and Desrosiers, J. 2005. Selected topics in column generation. Operations research, 53(6): 1007 1023. Nanfack, G.; Temple, P.; and Fr enay, B. 2022. Constraint Enforcement on Decision Trees: a Survey. ACM Computing Surveys (CSUR). Narodytska, N.; Ignatiev, A.; Pereira, F.; Marques-Silva, J.; and RAS, I. 2018. Learning Optimal Decision Trees with SAT. In Ijcai, 1362 1368. N unez, M. 1991. The use of background knowledge in decision tree induction. Machine learning, 6(3): 231 250. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12: 2825 2830. Quinlan, J. R. 1986. Induction of decision trees. Machine learning, 1(1): 81 106. Quinlan, J. R. 2014. C4. 5: programs for machine learning. Elsevier. Subramanian, S.; and Sherali, H. D. 2008. An effective deflected subgradient optimization scheme for implementing column generation for large-scale airline crew scheduling problems. INFORMS Journal on Computing, 20(4): 565 578. Subramanian, S.; Sun, W.; Drissi, Y.; and Ettl, M. 2022. Constrained Prescriptive Trees via Column Generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 4602 4610. Verhaeghe, H.; Nijssen, S.; Pesant, G.; Quimper, C.-G.; and Schaus, P. 2020. Learning optimal decision trees using constraint programming. Constraints, 25(3): 226 250. Verwer, S.; and Zhang, Y. 2019. Learning optimal classification trees using a binary linear program formulation. In Proceedings of the AAAI conference on artificial intelligence, volume 33, 1625 1632. Xu, Y. 2019. Solving Large Scale Optimization Problems in the Transportation Industry and Beyond Through Column Generation. In Optimization in Large Scale Problems, 269 292. Springer.

Zhu, H.; Murali, P.; Phan, D.; Nguyen, L.; and Kalagnanam, J. 2020. A Scalable MIP-based Method for Learning Optimal Multivariate Decision Trees. In Advances in Neural Information Processing Systems, volume 33, 1771 1781.