# parameterized_complexity_of_small_decision_tree_learning__f9f7db40.pdf

Parameterized Complexity of Small Decision Tree Learning

Sebastian Ordyniak1 and Stefan Szeider2

1University of Leeds, School of Computing, Leeds, UK 2Algorithms and Complexity Group, TU Wien, Vienna, Austria sordyniak@gmail.com, stefan@szeider.net

We study the NP-hard problem of learning a decision tree (DT) of smallest depth or size from data. We provide the ﬁrst parameterized complexity analysis of the problem and draw a detailed parameterized complexity map for the natural parameters: size or depth of the DT, maximum domain size of all features, and the maximum Hamming distance between any two examples. Our main result shows that learning DTs of smallest depth or size is ﬁxed-parameter tractable (FPT) parameterized by the combination of all three of these parameters. We contrast this FPT-result by various hardness results that underline the algorithmic signiﬁcance of the considered parameters.

Introduction

Decision Trees have proved to be extremely useful tools for the description, classiﬁcation, and generalization of data (Larose 2005; Murthy 1998; Quinlan 1986). Because of their simplicity, decision trees are particularly attractive for providing interpretable models of the underlying data, an aspect whose importance has been stronly emphasized over the recent years (Darwiche and Hirth 2020; Doshi-Velez and Kim 2017; Goodman and Flaxman 2017; Lipton 2018; Monroe 2018). In this context, one prefers small trees (trees of small size or small depth), as they are easier to interpret and require fewer tests to make a classiﬁcation. Small trees are also preferred in view of the parsimony principle (Occam s Razor) since small trees are expected to generalize better to new data (Bessiere, Hebrard, and O Sullivan 2009). However, learning small trees is computationally costly: it is NP-hard to decide whether a given data set can be represented by a decision tree of certain size or depth (Hyaﬁl and Rivest 1976). In view of this complexity barrier, several constrained-based and SAT-based methods have been proposed for learning small decision trees (Bessiere, Hebrard, and O Sullivan 2009; Narodytska et al. 2018; Avellaneda 2020; Schidler and Szeider 2021). In this paper, we investigate the problem of ﬁnding small decision trees (w.r.t. size or depth) under the framework of Parameterized Complexity (Downey and Fellows 2013; Gottlob, Scarcello, and Sideri 2002; Niedermeier 2006). This

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

framework allows us to achieve a more ﬁne-grained and qualitative analysis, revealing properties of the input data in terms of problem parameters that provide runtime guarantees for decision tree learning algorithms. The key notion of Parameterized Complexity is ﬁxed-parameter tractability (FPT) which generalizes the classical polynomial time tractability by allowing the running time to be exponential in a function of the problem parameters while remaining polynomial in the input size (we provide more detailed deﬁnitions in Section ). Fixed-parameter tractability captures the scalability of algorithms to large inputs as long as the problem parameters remain small. Several key problems that arise in AI have been studied in terms of their ﬁxed-parameter tractability, including Planning (B ackstr om et al. 2012), SAT and CSP (Bessi ere et al. 2008; Gaspers et al. 2014), Computational Social Choice (Bredereck et al. 2017), Machine Learning (Ganian et al. 2018), and Argumentation (Dvor ak, Ordyniak, and Szeider 2012). For decision tree learning, we consider parameterizations of the following two fundamental NP-hard problems: MINIMUM DECISION TREE SIZE (DTS): we are given a set of examples, labeled positive or negative, each over a set of features; each feature f ranges over a linearly ordered range of possible values, and an integer s (for size). The task is to decide whether there exists a decision tree with at most s tests or report that no such tree exists. Each test determines whether a certain feature is below a certain threshold or not. MINIMUM DECISION TREE DEPTH (DTD) is deﬁned similarly, where instead of the bound s on the total number of tests, a bound d (for depth) on the number of tests on any root-to-leaf path is provided. For technical reasons, we deﬁne the problems as decision problems; however, all our algorithms are constructive in the sense that if they decide DTS or DTD afﬁrmatively, then they can output a witnessing decision tree; otherwise, they report that a decision tree of the required size or depth does not exist. For both problems, it is natural to include the solution size (i.e., s for DTS and d for DTD, respectively) as a parameter, since our objective is to learn decision trees where these values are small. Another natural parameter is the number of different values a feature ranges over, we refer to this number as the maximum domain size (Dmax). It is reasonable to assume this number to be small, as many features are even

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Name |E| |F| δmax

Append-tis 106 531 14 Australian 690 1164 24 Backache 180 476 31 Car 1728 22 12 Cancer 683 90 18 Colic 368 416 43 Cleve 303 396 23 Haberman 300 93 6

Name |E| |F| δmax

Heart-statlog 270 382 23 Hepatitis 155 362 34 House Votes 435 17 16 Hungarian 294 331 24 New-thyroid 215 335 10 Promoters 106 335 106 Shuttle 14500 692 18 Spect 250 23 22

Table 1: Boolean classiﬁcation instances from Narodytska et al. (2018) with their number of examples |E|, the number of features |F|, and their maximum difference δmax. The instances originate from the UCI Machine Learning Repository (Dua and Graff 2017).

Boolean or have a small range (e.g., temperature in a city). As a ﬁrst result (Theorem 2), we show that taking solution size and maximum domain size as parameters alone does not yield ﬁxed-parameter tractability), even if all the features are Boolean. The hardness follows by a reduction from the HITTING SET (HS) problem. Hence one needs an additional parameter that restricts the input. Following the principle of deconstruction of intractability (Komusiewicz, Niedermeier, and Uhlmann 2011; Fellows, Jansen, and Rosamond 2013), we observe that when the hitting set instance consist of sets of bounded size, then HS becomes ﬁxed-parameter tractable. Bounded set size for HS corresponds to a bound on the number of features two examples with a different classiﬁcation can disagree in. We call this number the maximum difference δmax of the classiﬁcation instance. What makes this parameter even more attractive is that it is small for standard benchmark instances (see Table 1). Our main positive result (Theorem 8) establishes that, indeed, adding the maximum difference as additional parameter to solution size and maximum domain size renders both problems, DTS and DTD, ﬁxed-parameter tractable. The algorithm consists of two main parts, namely, identifying a small set of features used by an optimal decision tree (feature selection) and ﬁnding an optimal decision tree that uses only a small set of features. Both of these steps are nontrivial and provide interesting insights for our understanding of decision trees in general. For instance, the feature selection part shows that only a small subset of all features needs to be considered to ﬁnd an optimal decision tree. Similarly, the second part of our algorithm also shows that the same holds true for the set of thresholds; while the number of thresholds in the input instance can grow with the input size, only a small subset of them needs to be considered to ﬁnd an optimal decision tree. The diagram in Figure 1 visualizes the parameterized complexity landscape spanned out by the three parameters under consideration (solution size, maximum domain size, and maximum difference). The diagram shows that, without maximum difference included in the parameterization, neither of the problems is ﬁxed-parameter tractable. Those parameterizations that do not yield ﬁxed-parameter tractability can be split into two categories: one where the

ﬁxed-parameter tractable (FPT)

XP-tractable

para NP-hard

W[2]-hard XP-tractable sol + δmax + Dmax

sol + δmax sol + Dmax δmax + Dmax

sol δmax Dmax

Figure 1: Diagram of the parameterized complexity of DTS and DTD when parameterized by subsets of {s, δmax, Dmax} and {d, δmax, Dmax}, where sol = s for DTS and sol = d for DTD. The stated parameterized complexity results apply to all the combinations of parameters that are in the same area indicated by the same color.

problems DTS and DTD are already NP-hard when the parameters are ﬁxed to a constant (para NP-hard, in terms of Parameterized Complexity), and those where the problems are solvable in polynomial-time when the parameters are ﬁxed to a constant (XP-tractable, in terms of Parameterized Complexity).

Preliminaries

Classiﬁcation problems An example e is a function e : feat(e) D deﬁned on a ﬁnite set feat(e) of features and a possibly inﬁnite linearly ordered domain D Z. For a set E of examples, we put feat(E) = S e E feat(e). We say that two examples e1, e2 agree on a feature f if f feat(e1), f feat(e2) and e1(f) = e2(f). If f feat(e1), f feat(e2) but e1(f) = e2(f), we say that the examples disagree on f. A classiﬁcation instance (CI) (or just instance) E = E+ E is the disjoint union of two sets of examples, where for all e1, e2 E we have feat(e1) = feat(e2). The examples in E+ are called positive, the examples in E are called negative. A set X of examples is uniform if X E+ or X E ; otherwise X is non-uniform. Given a CI E, a subset F feat(E) is a support set of E if any two examples e1 E+ and e2 E disagree in at least one feature of F. Finding a support set of minimal size, denoted by MSS(E), for a classiﬁcation instance E is an NP-hard task (Ibaraki, Crama, and Hammer 2011, Theorem 12.2). For two examples e and e in E, we denote by δ(e, e ) the set of features where e and e disagree and we denote by δmax(E) = maxe+ E+ e E |δ(e+, e )| the maximum difference between any non-uniform pair of examples. For a feature f feat(E), we denote by DE(f) the set of domain values for f appearing in any example of E, i.e., DE(f) = { e(f) | e E } and we set Dmax to be the maximum size of DE(f) over all features of E, i.e., Dmax = maxf feat(E) |DE(f)|. If Dmax = 2 the CI is called Boolean or a partially deﬁned Boolean func-

temp rain time day walk

37 1 10 3 E

68 0 60 2 E+

53 1 60 4 E

53 0 15 2 E

60 0 60 5 E+

51 0 40 3 E+

71 0 35 5 E+

time, 30 min

Figure 2: A CI with seven examples over four features (temp, rain, time, day), classifying on whether one walks to work or not, and a DT for it.

tion (Ibaraki, Crama, and Hammer 2011). We denote by E[α] the set of examples in E that agree with the assignment α : F D, where F feat(E), i.e., E[α] = { e | e(f) = α(f) f F }. For the complexity analysis we set the input size E of a CI E to |E| (|feat(E)| + 1) log Dmax.

Decision trees A decision tree (DT) (or classiﬁcation tree) is a rooted tree T with vertex set V (T) and arc set A(T), where each non-leaf node (called a test) v V (T) is labeled with a feature feat(v) and an integer threshold λ(v), each non-leaf node v has exactly two outgoing arcs, a left arc and a right arc, and each leaf is either a positive or a negative leaf. We write feat(T) = { feat(v) | v V (T) }. Consider a CI E and a decision tree T with feat(T) feat(E). For each node v of T we deﬁne ET (v) as the set of all examples e E such that for each left (right, respectively) arc (u, w) on the unique path from the root of T to v we have e(feat(u)) λ(u) (e(feat(u)) > λ(u), respectively). T correctly classiﬁes an example e E if e is a positive (negative) example and e ET (v) for a positive (negative) leaf. We say that T classiﬁes E (or simply that T is a DT for E) if T correctly classiﬁes every example e E. An example for a CI E and a possible DT for E are illustrated in Figure 2. The size of T, denoted by |T|, is the number of non-leaf nodes (tests) in T and the depth of T, denoted by dep(T), is the maximum number of non-leaf nodes (tests) in any root-to-leaf path of T. DTS (and DTD) are now the problems of deciding whether for a given CI E and an integer s (d), there is a DT for E of size at most s (depth at most d).

Observation 1. Let T be a DT for a CI E, then feat(T) is a support set of E.

Proof. Suppose for a contradiction that this is not the case and there is an example e+ E+ and an example e E such that e+ and e agree on all features in feat(T). Therefore, e+ and e are contained in the same leaf node of T, contradicting our assumption that T is a DT.

Parameterized Complexity We give some basic deﬁnitions of Parameterized Complexity and refer for a more in-depth treatment to other

sources (Downey and Fellows 2013). Parameterized complexity considers problems in a two-dimensional setting, where a problem instance is a pair (I, k), where I is the main part and k is the parameter (k can be composed of several parameters, e.g., k = k1 + k2). A parameterized problem is ﬁxed-parameter tractable (FPT-tractable) if there exists a computable function g such that instances (I, k) can be solved in time g(k) I O(1). A parameterized problem is XP-tractable if instances (I, k) can be solved in time I g(k). There are different shades of parameterized hardness. A problem is para NP-hard if ﬁxing the parameter to a constant gives an NP-hard problem. Many parameterized problems that are XP-tractable but not ﬁxed-parameter tractable are contained in the complexity classes of the Weft hierarchy, W[1] W[2] . . . . Hardness for any of these classes provides a strong theoretical evidence that a problem is not ﬁxed-parameter tractable. For DTS and DTD we consider mainly the three parameters solution size (s for DTS and d for DTD), maximum domain size Dmax, and maximum difference δmax.

Hardness Results

In this section we show our hardness results for DTS and DTD, i.e., we will show that even restricted to Boolean instances both problems are W[2]-hard parameterized by s or d and they are para NP-hard parameterized by δmax(E). Underlying our hardness results is the following (polynomialtime) reduction from the well-known HITTING SET problem, which given a family F of sets over some universe U and an integer k asks whether F has a hitting set of size at most k, i.e., a subset H of U of size at most k such that F H = for every F F. The maximum arity of a HITTING SET instance is the size of a largest set in F. For an instance I = (F, U, k) of HITTING SET, we let E(I) be the CI that has a (Boolean) feature for every element in U, one positive example p with p(u) = 0 for every u U and one negative example n F for every F F such that n F (u) = 1 for every u F and n F (u) = 0 otherwise.

Theorem 2. DTS parameterized by s and DTD parameterized by h are W[2]-hard, even for Boolean instances.

Proof. We use the polynomial-time reduction from HITTING SET deﬁned by E(I) with I = (F, U, k) together with the fact that HITTING SET is W[2]-hard parameterized by the size k of the hitting set (Downey and Fellows 2013). Note that a set H of features of E(I) is a support set if and only if H is a hitting set for I. Therefore, I has a hitting set of size at most k if and only if E(I) has a support set of size k. Because E(I) contains only one positive example, every support set S gives rise to a DT whose depth and size are equal to |S|. Together with Observation 1, this implies that E(I) has a DT of depth/size at most k if and only if E(I) has a support set of size at most k, which as we showed above is the case if and only if I has a hitting set of size at most k.

The same reduction also shows that DTS and DTD are para NP-hard parameterized by δmax(E), because

δmax(E(I)) = (I) and HITTING SET is NP-complete even for = 2.

Theorem 3. DTS and DTD are para NP-hard parameterized by δmax(E).

The reduction also shows para NP-hardness for other parameters such as min#(E) = min{|E+|, |E |} and the number of branching nodes of the resulting DT; a node is a branching node if none of its children is a leaf. This is because min#(E(I)) = 1 and therefore it sufﬁces to consider only DTs for E(I) with 0 branching nodes.

In this section, we will provide our algorithmic results for DTS and DTD. Namely, we will show that both problems are ﬁxed-parameter tractable parameterized by solution size (s or d for DTS respectively DTD), δmax, and Dmax. To complete the complexity map given in Figure 1, we will also show that both problems can be solved in polynomial-time when searching for a DT of constant size or depth. To simplify the presentation, we will provide the result only for DTS and defer the proof for DTD to the supplementary material. The algorithm for our main result, i.e., the FPT-result, consists of two main steps. In the ﬁrst step, which we call feature selection and which is provided in Subsection , we show that one can efﬁciently enumerate all sets of features such that every optimal DT only uses the features from one of the sets. We then show that given such a set of features, which is small since it can have size at most s, it is possible to ﬁnd an optimal DT that uses only those features. We start with the second step of the algorithm.

An Algorithm for Instances with Few Features

This subsection is devoted to a proof of the following theorem, showing that one can efﬁciently compute a decision tree T of minimum size that uses exactly the features of a given support set S for a CI E. Note that the theorem actually shows that ﬁnding such an optimal decision tree is FPT parameterized by sol only. This is strictly not required to show our main result (Theorem 8), however, we ﬁnd the result surprising, interesting in its own right, and believe that it provides a ﬁrst step towards giving an FPT-result solely for sol + δmax(E).

Theorem 4. Let E be a CI, S feat(E) be a support set for E, and let s be an integer. Then, there is an algorithm that runs in time 2O(s2) E 1+o(1) log E and computes a DT of minimum size among all DTs T with feat(T) = S and size(T) s if such a DT exists;otherwise nil is returned.

First note that since feat(T) = S, we have that k = |S| s and therefore the size of S, i.e., the number of features that we need to consider for our DT T, is bounded in terms of our parameter s. This implies that we can enumerate both the tree underlying the DT T but also the possible assignments of the nodes of T to features (because there are at most k features) by brute-force; the details of this procedure can be found in Lemma 5. Unfortunately, the same is not true for obtaining the thresholds employed by

the DT, since the number of possible thresholds is potentially as large as the input size. However, perhaps suprisingly Lemma 6 shows that it is possible to ﬁnd the thresholds efﬁciently. We start by introducing the necessary notions and deﬁnitions. We say that a tree T is a pseudo DT if T is a rooted ordered binary tree, where every non-leaf node of T has a left and a right child, i.e., T has the same structure as a decision tree. Furthermore, we say that a function α is a feature assignment for T, if α assigns a feature in feat(E) to every non-leaf node v of T. Similarly, we say that a function γ is a threshold assignment for T (and α), if γ assigns a threshold in DE(α(v)) to every non-leaf node of T. Finally, we say that a pair (T, α), where T is a pseudo DT and α is a feature assignment for T can be extended to a DT for E, if there is a threshold assignment γ for T and α such that T together with feat = α and λ = γ is a DT for E. Our ﬁrst lemma shows that we can efﬁciently enumerate all pseudo DTs and feature assignments if the size of the DT and the set of features that the DT is allowed to use is small. The lemma can essentially be shown by analysing the number of pseudo DTs and feature assignments.

Lemma 5. Let E be a CI, S feat(E) be a support set for E with k = |S|, and let s be an integer with s k. Then, there is an algorithm that enumerates all pairs (T, α) such that T is a pseudo DT of size at most s and α is a feature assignment for T with α(T) = S in time O(ss).

Our next lemma shows that we can efﬁciently compute a threshold assignment for a given pseudo DT and feature assignment.

Lemma 6. Let E be a CI, let T be a pseudo DT of depth d and let α be a feature assignment for T. Then, there is an algorithm that runs in time O(2d2/2 E 1+o(1) log E ) and decides whether the pair (T, α) can be extended to a DT for E.

Proof Sketch. Let r be the root of T with f = α(r) having left child cl and right child cr. Then, (T, α) can be extended to a DT for E if and only if there is a threshold t DE(f) for f such that: (1) (Tcl, α) can be extended to a DT for E[(f t)], where Tcl is the subtree of T rooted at cl and E[(f t)] is the set of all examples e in E with f(e) t and (2) (Tcr, α) can be extended to a DT for E[(f > t)], where E[(f > t)] is the set of all examples e E with f(e) > t. Therefore, the problem can be reduced to ﬁnding the threshold t for the feature f of the root r of T. Unfortunately, the number of possible thresholds, i.e., the size of DE(f) can be large and it is therefore not possible to try every possible threshold. However, as it turns out this is not necessary since the inﬂuence of the threshold on the existence of an extension is in a certain sense monotone. Namely, it holds that if (Tcl, α) can be extended to a DT for E[(f t)], then it can also be extended to a DT for E[(f t )] for every t t. This is because E[(f t )] E[(f t)] and therefore every DT for E[(f t)] is also a DT for E[(f t )]. An analogous property holds for the right child cr of r, i.e., (Tcr, α) can

Algorithm 1 Algorithm to compute the threshold assignment for a pseudo DT and a feature assignment.

Input: CI E, pseudo DT T, feature assignment α for T Output: threshold assignment λ for T such that T together with feat = α and λ is a DT for E; if no such function exists, nil 1: function FINDTH(E, T, α) 2: r root of T 3: if r is a leaf then 4: if E is not uniform then 5: return nil 6: return () () is the empty assignment 7: f α(r); cl, cr left child and right child of r 8: t BINARYSEARCH(E, T, α, cl, f) 9: λr FINDTH(E[f > t], Tcr, α) 10: if λr =nil then 11: return nil 12: λl FINDTH(E[f t], Tcl, α) 13: return (f = t) λl λr

Algorithm 2 Sub-routine Binary Search used by Algorithm 1. Input: CI E, pseudo DT T, feature assignment α for T, feature f of the root of T, left child cl of the root of T Output: the largest threshold t in DE(f) for f such that (Tcl, α) can be extended to a DT for E[f t]; if no such threshold exists output the smallest value in DE(f) minus 1 1: function BINARYSEARCH(E, T, α, f, cl) 2: D array containing all elements in DE(f) in ascending order 3: L 0; R |DE(f)| 1; b 0 4: while L R do 5: m (L + R)/2 6: if FINDTH(E[f D[m]], Tcl, α) = nil then 7: L m + 1; b 1 8: else 9: R m 1; b 0 10: if b = 1 then 11: return D[m] return D[m 1] assuming that D[ 1] = D[0] 1

be extended to a DT for E[(f > t)], then it can also be extended to a DT for E[(f > t )] for every t t. Therefore, it sufﬁces to ﬁnd the maximum threshold t DE(f) such that (Tcl, α) can be extended to a DT for E[(f t)] and then to check (only for this maximum threshold) that also (Tcr, α) can be extended to a DT for E[(f > t)]. The big advantage of this is that we can now use binary search to ﬁnd the largest threshold. This idea leads to the recursive algorithm illustrated in Algorithm 1. The run-time of Algorithm 1 can be obtained as the number of recursive calls to ﬁnd TH times the time required for one recursive call. The former can be bounded by O((log E )d) given that ﬁnd TH calls itself at most log E + 2 times for a decision tree of smaller depth. Together with the fact that one recursive call needs time at most O( E log E ), we obtain the stated run-time.

Proof Sketch of Theorem 4. We ﬁrst use Lemma 5 to enumerate all pairs (T, α) such that T is a pseudo DT of size at

most s and α is a feature assignment for T with α(T) = S in time O(ss). Then, for every such pair (T, α), we employ Lemma 6, to check whether the pair can be extended to a DT for E. Finally, we return the DT T with feat = α and λ = γ, where T is the smallest DT such that (T, α) can be extended to a DT for E using the threshold assignment γ or we return nil if no such pair (T, α) exists. The total runtime of the algorithm is the time required by the algorithm of Lemma 5 (O(ss)) times the time required by the algorithm of Lemma 6 (O(2d2/2( E )1+o(1) log E ), where d s). Since this expression is dominated by the second term, the total run-time is O(2s2/2( E )1+o(1) log E ).

Feature Selection Because of Theorem 4, we know that the problems can be solved efﬁciently once we identiﬁed the right set of features. Since they are at most O(|feat(E)|s) possible sets of features for a DT of size at most s, we can already employ Theorem 4 to solve DTS in polynomial-time if s is considered to be constant.

Theorem 7. DTS and DTD are XP-tractable parameterized by either s or s.

We show next that we can perform the feature selection step signiﬁcantly faster if Dmax + δmax are small, which together with Theorem 4 will establish our FPT-result for DTS.

Theorem 8. DTS and DTD are ﬁxed-parameter tractable parameterized by Dmax +δmax(E)+s respectively Dmax + δmax(E) + d.

We start by showing that, given a CI E and an integer k, it is possible to enumerate all (inclusion-wise) minimal support sets of size at most k for E. The result is based on a well-known result for HITTING SET.

Corollary 9. Let E be a CI and let k be an integer. Then there is an algorithm that in time O(δmax(E)k|E|) enumerates all (of the at most δmax(E)k) minimal support sets of size at most k for E.

Because of Observation 1, it holds that the set of features of any DT forms a support set. One might be tempted to think that any DT of minimum size or depth uses only features contained in some inclusion-wise minimal support set and that one could use this to ﬁnd the set of variables used by an optimal decision tree by enumerating over all minimal support sets of a certain size using Corollary 9. Unfortunately, the following lemma shows that this is not the case.

Lemma 10. There is a CI ED such that feat(T) is not a minimal support set for any decision T of minimum depth. Similarily, there is a CI ES such that feat(T) is not a minimal support set for any decision T of minimum size.

Proof Sketch. Figure 3 shows a simple example for ED on ﬁve Boolean variables a1, a2, b1, b2, and x. Note that the set S = {a1, a2, b1, b2} is the only minimal support set for ED. It is now straightforward to show that every DT for ED that uses only the features S has depth at least 4; such a DT is illustrated in Figure 3. However, ED has a DT of depth at

a1 a2 b1 b2 x 0 0 0 0 0 T 0 1 0 0 0 F 1 0 0 0 0 F 1 1 0 0 0 T 0 0 0 0 1 T 0 0 0 1 1 F 0 0 1 0 1 F 0 0 1 1 1 T

Figure 3: The CI ED used in Lemma 10 to show that the set of features used by a DT of minimum depth is not necessarily a minimal support set. ED has only one minimal support set, i.e., the set M = {a1, a2, b1, b2}. The DT in the centre shows a DT of minimum depth for ED that uses only the features in M and the DT on the right shows the DT of minimum depth for ED.

most 3 that additionally uses the variable x. The supplementary material contains the construction for ES.

Therefore, to solve DTS it is not sufﬁcient to enumerate minimal support sets. Nevertheless, because of Observation 1, we know that every DT T contains some minimal support set, say S. Moreover, as we will show next, if T has minimum size , then the set of features that T uses in addition to the features in S (i.e., features in feat(T) \ S) has a special property, which can be exploited to enumerate all sets of features that can be used by any DT of minimum size or depth. Namely, we say that a set R of features is useful for a given support set S of E if for every assignment β : R D, there is an assignment α : S D with E[α] = but E[α β] = . Informally, R is not useful for S if at least one assignment of R leaves all equivalence classes of S intact, where an equivalence class contains all examples that have the same value for all features in S.

Lemma 11. Let T be a DT of minimum size for E and let S be a support set contained in feat(T) (which exists due to Observation 1). Then, the set R = feat(T) \ S is useful.

Proof. If R = , then there is nothing to show. Hence, assume that R = and suppose for a contradiction that the statement of the lemma does not hold. Then, there is an assignment β : R D such that E[α β] = for every assignment α : S D with E[α] = . We will show a contradiction to our assumption that T has minimum size. To achieve this, consider the tree T obtained from T as follows. (1) For every node t in T with f = feat(t) R and λ = λ(t), let t be the left child of t if β(f) > λ and let t be the right child of t otherwise. Remove the subtree rooted at t from T and let T be the tree obtained from T after applying this step exhaustively. Note that every node t of T with feat(t) R has exactly one child node. (2) Let P be a path of maximum length in T consisting only of nodes t with feat(t) R. We obtain the tree T |P by contracting the path P in T, i.e., we remove all nodes on P from T and add an edge between the unique parent and the unique child of the two endpoints of P in T . Then T is the tree obtained from T after exhaustively contracting all maximal paths P in T that only consists of nodes t with feat(t) R.

Note that every non-leaf node t in T has exactly two children and moreover feat(t) S. We now show that T is also a DT for E, which, because R = , contradicts our assumption that T had minimum size. Suppose for a contradiction that T is not a DT. Then there is a leaf l of T such that ET (l) is not uniform, i.e., it contains at least one example from e+ from E+ and at least one example e from E . Let α+ : S D be the unique assignment agreeing with e+ on S, i.e., α+(f) = e+(f) for every f S, and similarly let α : S D be the unique assignment agreeing with e on S. Then, E[α+] and E[α ] are nonempty, E[α+], E[α ] ET (l), and moreover because S is a support set the former only contains positive examples while the latter only contains negative examples. But then E[α+ β] and E[α β] are also non-empty and are contained in ET (l ) for the unique leaf l in T corresponding to l in T . Therefore, ET (l ) is not uniform, contradicting our assumption that T is a DT for E.

We say that a set R0 is a branching set for S if every useful set R for S contains at least one feature in R0. To exploit Lemma 11, we will show next that we can efﬁciently compute a small branching set for S. Consider an arbitrary assignment β : R D for some useful set R for S. Then, because R is useful, there must be an assignment α : S D with E[α] = such that E[α β] = . In other words, for every example e in E[α], the set R must contain at least one feature f for which e disagrees with β. Therefore, every useful set has to contain at least one feature from the set δ(e, β), i.e., the set of features where e disagrees with β.

Lemma 12. Let E(S) be an arbitrary set of examples that contains exactly one example from E[α] for every assignment α : S D with E[α] = . Then, every useful set R for S has to contain at least one feature from the set S e E(S) δ(e, β) for every β : R D.

Unfortunately, the set δ(e, β) for some e E(S) can be arbitrary large in general (i.e., cannot be bounded by our parameters). However, the following lemma shows that there is a global assignment γ : feat(E) D such that every example in E disagrees with γ in at most 2δmax(E) features.

Lemma 13. Let E be a CI. Then, there is an assignment γ : feat(E) D such that every example e in E disagrees with γ in at most 2δmax(E) features. Moreover, γ can be computed in polynomial-time.

Proof Sketch. Setting γ(f) = e(f) for an arbitary example e E is sufﬁcient because any example in E can differ from e in at most 2δmax(E) features.

Therefore, after setting β in Lemma 12 equal to the global assignment γ from Lemma 13, we obtain that the set { δ(e, β) | e E(S) } contains at most 2δmax(E) features for all (of the at most D|S| max) examples in E(S).

Lemma 14. There is a polynomial-time algorithm that given a support set S computes a branching set R0 for S of size at most D|S| max2δmax(E).

Algorithm 3 Main method for ﬁnding a DT of minimum size. Input: CI E and integer s Output: DT for E of minimum size (among all DTs of size at most s) if such a DT exists, otherwise nil 1: function MINDT(E, d) 2: compute γ using Lemma 13 3: S set of all minimal support sets for E of size at most s 4: B nil 5: for S S do 6: T MINDTS(E, s, S) 7: if (T = nil) and (B = nil or |B| > |T|) then 8: B T 9: if B = nil and |B| s then 10: return B 11: return nil

Algorithm 4 Method for ﬁnding a DT of minimum size using at least the features in a given support set S.

Input: CI E, integer s, support set S for E with |S| s Output: DT of minimum size among all DTs T for E of size at most s such that S feat(T); if no such DT exists, nil 1: function MINDTS(E, s, S) 2: B compute a DT of minimum size for E using exactly the features in S using Theorem 4 3: R0 compute the branching set R0 for S using Lemma 14 4: for f R0 do 5: T MINDTS(E, s, S {f}) 6: if T = nil and |T| < |B| then 7: B T 8: if |B| s then 9: return B 10: return nil

Proof Sketch of Theorem 8. We start by presenting the algorithm for DTS, which is illustrated in Algorithm 3 and Algorithm 4. The algorithm for DTD is similar and can be found in the supplementary material. Given a CI E and an integer s, the algorithm returns a DT of minimum size among all DTs of size at most s if such a DT exists and otherwise the algorithm returns nil. The algorithm starts by computing the global assignment γ in polynomial-time using Lemma 13. The algorithm then computes the set S of all minimal support sets for E of size at most s, which because of Corollary 9 results in a set S of size at most (δmax(E))s. In Line 5 the algorithm then interates over all sets S in S and calls the function min DTS given in Algorithm 4 for E, s, and S, which returns a DT of minimum size among all DTs T for E of size at most s such that S feat(T). It then updates the currently best decision tree B if necessary with the DT found by the function min DTS. Moreover, if the best DT found after going through all sets in S has size at most s, it is returned (in Line 10), otherwise the algorithm returns nil. Finally, the function min DTS given in Algorithm 4 does the following. It ﬁrst computes a DT T of minimum size that uses exactly the features in S using Theorem 4. It then tries to improve upon T with the help of useful sets. That is, it uses Lemma 14 to compute

the branching set R0. It then interates over all (of the at most D|S| max2δmax(E)) features v R0 (using the for-loop in Line 4), and calls itself recursively on the feature set S {v}. If this call ﬁnds a smaller DT, then the current best DT B is updated. Finally, after the for-loop the algorithm either returns B if its size is less then s or nil otherwise. Towards showing the correctness of Algorithm 3, consider the case that E has a DT of size at most s and let T be a such a DT of minimum size. Because of Observation 1, feat(T) is a support set for E and therefore feat(T) contains a minimal support set S of size at most s. Because the for-loop in Line 5 of Algorithm 3 interates over all minimal support sets of size at most s for E, it follows that Algorithm 4 is called with parameters E, s, and S. If feat(T) = S, then B is set to a DT for E of size |T| in Line 2 of Algorithm 4 and the algorithm will output a DT of size at most |T| for E. If, on the other hand, feat(T) \ S = , then because T has minimum size and S is a support set for E with S feat(T), we obtain from Lemma 11 that the set R = feat(T)\S is useful for S. Therefore, because of Lemma 14, R has to contain a variable v from the set R0 computed in Line 3. It follows that Algorithm 4 is called with parameters E, s, and S {v}. From now onwards the argument repeats and since R0 = the process stops after at most s |S| recursive calls after which a DT for E of size at most |T| will be computed in Line 2 of Algorithm 4. Finally, it is easy to see that if Algorithm 3 outputs a DT T, then it is a valid solution. This is because, T must have been computed in Line 2 of Algorithm 4, which implies that T is a DT for E. Moreover, T has size at most s, because of Line 9 in Algorithm 3. The run-time of the algorithm can be obtained by noting that the algorithm can be seen as a branching algorithm with recursion depth at most s and at most Ds max2δmax(E) branches per call, using at most 2O(s2) E 1+o(1) log E time per call.

We have carried out the ﬁrst analysis of the parameterized complexity of learning DTs of small depth or size, providing an almost comprehensive map of this problem s parameterized complexity w.r.t. natural parameters. While our algorithmic results open up the opportunity for the development of practically more efﬁcient exact (or heuristic) algorithms, our lower-bound results provide a formal and rigorous justiﬁcation for the use of heuristics or approximation algorithms. Moreover, the developed techniques provide novel and useful insights into the problem of learning DTs. For future work, we leave the only remaining open question from our analysis: whether our FPT-result still holds without the parameter Dmax. We conjecture that this is indeed the case, since the second part of our algorithm (i.e., learning the DT and thresholds for a small set of features) is already FPT parameterized by solution size and δmax alone. We hope that our work stimulates further investigations on the parameterized complexity of DT learning, for instance by considering other structural parameters that have previously been employed for various problems in AI, including decompositional parameters (Gottlob, Pichler, and Wei 2006) and backdoor sets size (Gaspers and Szeider 2012).

Acknowledgements Stefan Szeider acknowledges support by the Austrian Science Fund (FWF, project P32441) and the Vienna Science and Technology Fund (WWTF, project ICT19-065). Sebastian Ordyniak acknowledges support from the Engineering and Physical Sciences Research Council (EPSRC, project EP/V00252X/1).

References Avellaneda, F. 2020. Efﬁcient Inference of Optimal Decision Trees. In Proceedings of the The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence, AAAI-20. AAAI Press, Palo Alto, California USA. B ackstr om, C.; Chen, Y.; Jonsson, P.; Ordyniak, S.; and Szeider, S. 2012. The Complexity of Planning Revisited - A Parameterized Analysis. In Hoffmann, J.; and Selman, B., eds., Proceedings of the Twenty-Sixth AAAI Conference on Artiﬁcial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada. AAAI Press. Bessi ere, C.; Hebrard, E.; Hnich, B.; Kiziltan, Z.; Quimper, C.-G.; and Walsh, T. 2008. The Parameterized Complexity of Global Constraints. In Proceedings of the Twenty-Third AAAI Conference on Artiﬁcial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008, 235 240. AAAI Press. Bessiere, C.; Hebrard, E.; and O Sullivan, B. 2009. Minimising Decision Tree Size as Combinatorial Optimisation. In Gent, I. P., ed., Principles and Practice of Constraint Programming - CP 2009, 173 187. Berlin, Heidelberg: Springer Berlin Heidelberg. Bredereck, R.; Chen, J.; Niedermeier, R.; and Walsh, T. 2017. Parliamentary Voting Procedures: Agenda Control, Manipulation, and Uncertainty. J. Artif. Intell. Res. 59: 133 173. doi:10.1613/jair.5407. Darwiche, A.; and Hirth, A. 2020. On The Reasons Behind Decisions. In Giacomo, G. D.; Catal a, A.; Dilkina, B.; Milano, M.; Barro, S.; Bugar ın, A.; and Lang, J., eds., ECAI 2020 - 24th European Conference on Artiﬁcial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artiﬁcial Intelligence (PAIS 2020), volume 325 of Frontiers in Artiﬁcial Intelligence and Applications. IOS Press. Doshi-Velez, F.; and Kim, B. 2017. A Roadmap for a Rigorous Science of Interpretability. Co RR abs/1702.08608. Downey, R. G.; and Fellows, M. R. 2013. Fundamentals of parameterized complexity. Texts in Computer Science. Springer Verlag. Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Last accessed: 1.9.2020. Dvor ak, W.; Ordyniak, S.; and Szeider, S. 2012. Augmenting tractable fragments of abstract argumentation. Artiﬁcial Intelligence 186: 157 173. Fellows, M. R.; Jansen, B. M.; and Rosamond, F. 2013. Towards fully multivariate algorithmics: Parameter ecology

and the deconstruction of computational complexity. European J. Combin. 34(3): 541 566.

Ganian, R.; Kanj, I.; Ordyniak, S.; and Szeider, S. 2018. Parameterized Algorithms for the Matrix Completion Problem. In Proceeding of ICML, the Thirty-ﬁfth International Conference on Machine Learning, Stockholm, July 10 15, 2018, 1642 1651. JMLR.org. ISSN: 1938-7228.

Gaspers, S.; Misra, N.; Ordyniak, S.; Szeider, S.; and Zivny, S. 2014. Backdoors into Heterogeneous Classes of SAT and CSP. In Brodley, C. E.; and Stone, P., eds., Proceedings of the Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence, July 27 -31, 2014, Qu ebec City, Qu ebec, Canada., 2652 2658. AAAI Press.

Gaspers, S.; and Szeider, S. 2012. Backdoors to Satisfaction. In Bodlaender, H. L.; Downey, R.; Fomin, F. V.; and Marx, D., eds., The Multivariate Algorithmic Revolution and Beyond - Essays Dedicated to Michael R. Fellows on the Occasion of His 60th Birthday, volume 7370 of Lecture Notes in Computer Science, 287 317. Springer Verlag.

Goodman, B.; and Flaxman, S. R. 2017. European Union Regulations on Algorithmic Decision-Making and a Right to Explanation . AI Magazine 38(3): 50 57.

Gottlob, G.; Pichler, R.; and Wei, F. 2006. Bounded Treewidth as a Key to Tractability of Knowledge Representation and Reasoning. In 21st National Conference on Artiﬁcial Intelligence and the 18th Innovative Applications of Artiﬁcial Intelligence Conference. AAAI Press.

Gottlob, G.; Scarcello, F.; and Sideri, M. 2002. Fixedparameter complexity in AI and nonmonotonic reasoning. Artiﬁcial Intelligence 138(1-2): 55 86.

Hyaﬁl, L.; and Rivest, R. L. 1976. Constructing Optimal Binary Decision Trees is NP-Complete. Information Processing Letters 5(1): 15 17.

Ibaraki, T.; Crama, Y.; and Hammer, P. L. 2011. Partially deﬁned Boolean functions, 511 563. Encyclopedia of Mathematics and its Applications. Cambridge University Press.

Komusiewicz, C.; Niedermeier, R.; and Uhlmann, J. 2011. Deconstructing intractability - A multivariate complexity analysis of interval constrained coloring. J. Discrete Algorithms 9(1): 137 151.

Larose, D. T. 2005. Discovering knowledge in data. Wiley Interscience [John Wiley & Sons], Hoboken, NJ. ISBN 0471-66657-2. An introduction to data mining.

Lipton, Z. C. 2018. The mythos of model interpretability. Communications of the ACM 61(10): 36 43.

Monroe, D. 2018. AI, explain yourself. AI Communications 61(11): 11 13. doi:10.1145/3276742.

Murthy, S. K. 1998. Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey. Data Min. Knowl. Discov. 2(4): 345 389. doi:10.1023/A: 1009744630224.

Narodytska, N.; Ignatiev, A.; Pereira, F.; and Marques-Silva, J. 2018. Learning Optimal Decision Trees with SAT. In Proceedings of the Twenty-Seventh International Joint Confer-

ence on Artiﬁcial Intelligence, IJCAI-18, 1362 1368. International Joint Conferences on Artiﬁcial Intelligence Organization. doi:10.24963/ijcai.2018/189. Niedermeier, R. 2006. Invitation to Fixed-Parameter Algorithms. Oxford Lecture Series in Mathematics and its Applications. Oxford: Oxford University Press. Quinlan, J. R. 1986. Induction of Decision Trees. Machine Learning 1(1): 81 106. doi:10.1023/A:1022643204877. Schidler, A.; and Szeider, S. 2021. SAT-based Decision Tree Learning for Large Data Sets. In Proceedings of AAAI 21, the Thirty-Fifth AAAI Conference on Artiﬁcial Intelligence. AAAI Press.