# average_sensitivity_of_decision_tree_learning__58eaa078.pdf Published as a conference paper at ICLR 2023 AVERAGE SENSITIVITY OF DECISION TREE LEARNING Satoshi Hara Osaka University satohara@ar.sanken.osaka-u.ac.jp Yuichi Yoshida National Institute of Informatics yyoshida@nii.ac.jp A decision tree is a fundamental model used in data mining and machine learning. In practice, the training data used to construct a decision tree may change over time or contain noise, and a drastic change in the learned tree structure owing to such data perturbation is unfavorable. For example, in data mining, a change in the tree implies a change in the extracted knowledge, which raises the question of whether the extracted knowledge is truly reliable or is only a noisy artifact. To alleviate this issue, we design decision tree learning algorithms that are stable against insignificant perturbations in the training data. Specifically, we adopt the notion of average sensitivity as a stability measure, and design an algorithm with low average sensitivity that outputs a decision tree whose accuracy is close to the optimal decision tree. The experimental results on real-world datasets demonstrate that the proposed algorithm enables users to select suitable decision trees considering the trade-off between average sensitivity and accuracy. 1 INTRODUCTION A decision tree is a fundamental model in applications such as extracting knowledge in data mining and predicting outcomes in machine learning. Learned decision trees enable the extraction of hidden structures in the data in an interpretable manner using the if-then format. In data mining, the extracted structures are of fundamental interest (Rokach & Maimon, 2007; Gorunescu, 2011). Decision trees also play an essential role in decision making (Zeng et al., 2017; Rudin, 2019; Arrieta et al., 2020) because unlike complex models, such as deep neural networks, the decisions made by decision trees are explainable. With the increase of the utility of machine learning models in realworld problems, decision trees and their variants are widely used particularly for applications such as high-stake decision making, where explainability is crucial and transparency higher than post-hoc explanations (e.g., (Angelino et al., 2018; Rudin, 2019; Arrieta et al., 2020)) are required. Current studies on decision trees and their families mainly focus on developing learning algorithms to improve two aspects of learned trees: accuracy and interpretability. Here, we demonstrate that there is a third essential aspect that is missing in current studies: the stability of the learning algorithm against insignificant perturbations on the training data. Decision trees are typically used to extract knowledge from data and help users make decisions that can be explained. If the learning algorithm is unstable, the structure of the learned trees can vary significantly even for insignificant changes in the training data. In data mining, this implies that the extracted knowledge can be unstable, which raises the question of whether the extracted knowledge is truly reliable or only a noisy artifact induced by the unstable learning algorithm. In model-based decision making, this implies that the decision process can change drastically whenever a few additional data are obtained and the tree is retrained on the new training data. Such noisy decision makers are unacceptable for several reasons. For example, stakeholders may lose their trust in such decision makers, or it may be extremely costly to frequently and drastically update the entire decision making system. Figure 1 shows an illustrative example of sensitive/stable decision tree learning algorithms. In this example, the standard greedy tree learning algorithm induces different trees before and after one data point (large red triangle) is removed (Figure 1(a)). Thus, it can be observed that the greedy algorithm is sensitive to the removal of data points. The objective of this study is to design a tree learning algorithm that can induce (almost) same trees against the removal of a few data points (Figure 1(b)). Published as a conference paper at ICLR 2023 Before Removing After Removing (a) Standard greedy tree learning algorithm Before Removing After Removing (b) Proposed stable tree learning algorithm Data Points Class 1 Class 2 Predictions by Tree Class 1 Class 2 Figure 1: Decision boundaries of the learned decision trees. (a) The standard greedy tree learning algorithm is sensitive to the removal of even a single data point (large red triangle) from the training data. (b) The proposed learning algorithm produces more stable trees. In this study, we design a decision tree learning algorithm that is stable against insignificant perturbations in the training data. Specifically, we consider the change in (the distribution of) the learned tree upon deletion of a random data point from the training data, using the notion of average sensitivity (Varma & Yoshida, 2021). Subsequently, we design a (randomized) decision tree learning algorithm with low average sensitivity while preserving the accuracy of the learned decision tree up to a tolerance parameter. A randomized algorithm may output completely different decision trees on the original training data and on the training data obtained by deleting a random data point even if the output distributions are close. To alleviate this issue, we design a (randomized) decision tree learning algorithm with low expected average sensitivity over random bits used in the algorithm, which implies that the output decision tree on the original training data and that on the training data obtained by deleting a random data point are close with a high probability over the choice of the random bits used. Through real-world data experiments, we demonstrate that our learning algorithm exhibits a lower average sensitivity compared to the standard greedy decision tree learning algorithm, while maintaining the decrease in accuracy within the prescribed tolerance parameter. 2 RELATED WORK Decision Tree Learning Algorithms Generally, learning an optimal decision tree is NPhard (Laurent & Rivest, 1976), and hence we can obtain optimal trees only for small problems (Bertsimas & Dunn, 2017; Angelino et al., 2018; G unl uk et al., 2021). To avoid this issue, recursive greedy splitting is widely used for learning (non-optimal) decision trees (Rivest, 1987; Loh, 2011; Quinlan, 2014), and Bayesian approaches are used to learn a family of decision trees, such as rule lists and rule sets (Wang et al., 2017; Yang et al., 2017). These studies are concerned with learning trees with less computation or better interpretability. This study is orthogonal to them in that our interest is developing stable decision tree learning algorithms, which was not considered before. We stress that the focus of the current study is to learn a stand-alone decision tree and not to learn a collection of decision trees for ensemble models (e.g., (Ho, 1995; Breiman, 2001; Friedman, 2001; Chen & Guestrin, 2016)). For the latter, we want decision trees that make different predictions, and hence somewhat sensitive algorithms are favorable rather than stable ones. Average Sensitivity Varma & Yoshida (2021) introduced the notion of average sensitivity and designed algorithms with low average sensitivity for various graph problems including the minimum spanning tree, minimum cut, and minimum vertex cover problems. Average sensitivity of algorithms are discussed also for various problems including the maximum matching problem (Yoshida & Zhou, 2021), problems that can be solved by dynamic programming (Kumabe & Yoshida, 2022a;b), spectral clustering (Peng & Yoshida, 2020), and Euclidean k-clustering (Yoshida & Ito, 2022). Adversarial Robustness Insignificant human-imperceptible perturbations to the input can mislead trained models, and such perturbations are called adversarial attacks. It is known that adversarial attacks are harmful to decision trees (Chen et al., 2019a;b; Kantchelian et al., 2016). To alleviate this issue, several recent studies have considered the problems of robustness verification (Chen et al., 2019b; T ornblom & Nadjm-Tehrani, 2019; Wang et al., 2020) and adversarial defense (Chen et al., 2019a; Andriushchenko & Hein, 2019; Calzavara et al., 2020; Chen et al., 2021). Adversarial at- Published as a conference paper at ICLR 2023 tacks focus on the change in the predicted label against perturbing a single input data point at the inference time, whereas average sensitivity focuses on the change in the learned decision tree against perturbing the entire training data at the learning time. Stability Bousquet and Elisseeff introduced the notion of the stability of a learning algorithm and discussed its relation to generalization ability (Bousquet & Elisseeff, 2002). Unlike average sensitivity, this notion only concerns the stability of the loss value against data perturbation. Hence, it cannot be used to stabilize the structure of learned decision tree. Differential Privacy Differential privacy (Dwork, 2006) measures the stability of the output against perturbation to the input. Decision trees have been intensively studied from the perspective of differential privacy (see (Fletcher & Islam, 2019) and references therein). The average sensitivity of an algorithm can be bounded by the differential privacy parameter ϵ times the maximum size of the output (Varma & Yoshida, 2021). In the decision tree learning setting, the bound we obtain in this manner is roughly O(ϵ 2B), where B specifies the depth of the output decision tree. By contrast, our bound (Theorem 4.1) is O(B2B/n), where n is the number of points in the training data. Our bound is much smaller because we usually set 2B n to avoid overfitting. 3 PRELIMINARIES 3.1 DECISION TREE Let X and Y be input and output spaces, respectively. We call a Boolean function ω : X {0, 1} a decision rule. Then, a decision tree is a function ϕ : X Y represented by a rooted proper binary tree, that is, it has a special node called the root and each node has either zero or two children, such that each internal node t is associated with a decision rule ωt : X {0, 1} and each leaf is associated with a label yt Y. Then given an input x X, the decision tree ϕ predicts a label y according to PREDICT(ϕ, x) shown in Algorithm 1. Let L = ((x1, y1), . . . , (xn, yn)) be a training data. The total score of a decision tree ϕ : X Y with respect to L is defined to be s(ϕ, L) := Pn i=1 1[ϕ(xi) = yi], where 1[X] is the indicator of the event X. Note that s(ϕ, L)/n is the accuracy of ϕ on L. For a nonnegative integer B, let TB be the set of decision trees of depth B. Then, let opt B(L) := maxϕ TB s(ϕ, L) be the maximum total score of a decision tree of depth B. Also for a decision rule ω : X {0, 1} and a nonnegative integer B, let Tω,B be the set of decision trees of depth B with the root node having the decision rule ω. Then we define optω,B(L) := maxϕ Tω,B s(ϕ, L). We clearly have opt B(L) = maxω optω,B(L). For a decision tree ϕ, we denote by |ϕ| the number of nodes (including the leaves) in ϕ. For two decision trees ϕ and ϕ , we define the distance d DT(ϕ, ϕ ) between them as the output of DISTANCE(ϕ, ϕ ) shown in Algorithm 2. Intuitively, the procedure DISTANCE(ϕ, ϕ ) computes the maximal subtree common to ϕ and ϕ , and then outputs the total number of remaining nodes after subtracting the common subtree from ϕ and ϕ . It is clear that d DT( , ) satisfies triangle inequality. We note that, even if two decision trees ϕ, ϕ : X Y are equal as a function, they may have a large distance in d DT if they have different tree structures. For interpretability, however, we believe this is the advantage of using d DT because it is not easy to verify the equivalence of ϕ and ϕ when they have different tree structures and d DT correctly reflect this situation. 3.2 AVERAGE SENSITIVITY For a training data L = ((x1, y1), . . . , (xn, yn)) and 1 i n, let L(i) = ((x1, y1), . . . , (xi 1, yi 1), (xi+1, yi+1), . . . , (xn, yn)) be the training data obtained from L by dropping the ith data point. Let A be a deterministic algorithm that, given a training data, outputs a decision tree. Then, the average sensitivity of A on L is i=1 d DT(A(L), A(L(i))). Published as a conference paper at ICLR 2023 Algorithm 1: 1 Procedure PREDICT(ϕ, x) 2 Let t be the root node of ϕ; 3 if t is a leaf then return yt; 5 Let ϕL, ϕR be the decision trees rooted at the left and right children of t, respectively; 6 if ωt(x) = 0 then return PREDICT(ϕL, x); 7 else return PREDICT(ϕR, x); Algorithm 2: 1 Procedure DISTANCE(ϕ, ϕ ) 2 Let t and t be the root nodes of ϕ and ϕ , respectively; 3 if both t and t are leaves then 4 return 0 if yt = yt and 2 otherwise. 5 else if either t or t is a leaf then return |ϕ| + |ϕ |; 6 else if ωt = ωt then return |ϕ| + |ϕ |; 8 Let ϕL, ϕR be the decision trees rooted at the left and right children of t, respectively; 9 Let ϕ L, ϕ R be the decision trees rooted at the left and right children of t , respectively; 10 return DISTANCE(ϕL, ϕ L) + DISTANCE(ϕR, ϕ R). Now, we generalize this definition for randomized algorithms. First for two distributions D and D over decision trees, we define their earth mover s distance as d EM(D, D ) = min P E(ϕ,ϕ ) P d DT(ϕ, ϕ ), where P is over distributions of pairs of decision trees such that the marginal distributions on the first and second coordinates are equal to D and D , respectively. Let A be a randomized algorithm that, given a training data, outputs a decision tree. Then, we define the average sensitivity of A on L as i=1 d EM(A(L), A(L(i))), (1) where we regard A( ) as a distribution over decision trees. It is natural to consider a variant of average sensitivity such that we delete k data points instead of a single data point because in practice many data points can be dropped. As discussed in Varma & Yoshida (2021), this variant can be bounded by k times the average sensitivity above, and hence we focus on bounding the latter. 4 DECISION TREE CONSTRUCTION In this section, we provide a decision tree learning algorithm with a high total score and a low average sensitivity. Specifically, we show the following: Theorem 4.1. There exists an (possibly inefficient) randomized algorithm that, given a training data L of size n, a depth bound B, and a parameter ϵ > 0, returns a decision tree ϕ of depth at most B such that Eϕ[s(ϕ, L)] (1 ϵ)Bopt B(L), and for the set of decision rules Ω, its average sensitivity is O B2B log |Ω| We note that we usually choose 2B n to avoid overfitting and ϵ = Θ(1/B) to achieve constant approximation, and hence the average sensitivity bound is essentially O(1/n). We note that the algorithm of Theorem 4.1 makes use of the optimal total score, which is not efficiently computable in general. We discuss practical implementations of the algorithm in Section 6. The algorithm of Theorem 4.1 is based on a procedure called STABLEDR, which selects a decision rule in a stable way (Algorithm 3). This is a simple application of the exponential mechanism (Mc- Published as a conference paper at ICLR 2023 Algorithm 3: 1 Procedure STABLEDR(L, B, ϵ) 2 Select ω Ωwith probability exp(λ optω,B(L)), where λ = 2 log |Ω| ϵ opt B(L) ; 3 return ω. 4 Procedure STABLEDT (L, B, ϵ, d) 5 if |L| 1 or d = B then return an optimal label for L; 6 ω STABLEDR(L, B, ϵ); 7 Partition L into LL LR according to ω; 8 ϕL STABLEDT (LL, B, ϵ, d + 1) and ϕR STABLEDT (LR, B, ϵ, d + 1); 9 Let ϕω be the decision tree such that the root node t has rule ω and the left and right children of t are ϕL and ϕR, respectively; 10 return ϕω. 11 Procedure STABLEDT(L, B, ϵ) 12 return STABLEDT (L, B, ϵ, 0). Sherry & Talwar, 2007) to optω,B(L) (ω Ω). Then, the proposed algorithm, STABLEDT (Algorithm 3), works as follows. At each node, we compute a decision rule ω using STABLEDR, split the training data according to ω, and recursively construct decision trees for each of the split data until the size of the training data becomes at most one or the depth becomes B. To understand why the randomized procedure STABLEDR produces stability, consider the following scenario: Suppose that we have two candidate rules ω1 and ω2 with optimal scores optω1,B(L) = 90 and optω2,B(L) = 89, respectively. The greedy algorithm selects ω1 because it has the larger score. Suppose that the scores have changed upon the removal of a subset S L and now we have optω2,B(L \ S) = 85 and optω2,B(L \ S) = 86. Then, the greedy algorithm selects ω2 because it has now the larger score, and hence we never obtain the same tree for L and L \ S. By contrast, the probability that STABLEDR selects ω1 (resp., ω2) is nearly half for both L and L \ S because ω1 (resp., ω2) is a nearly optimal rule. Hence, the distributions of the trees given by Algorithm 3 are close between L an L \ S. 5 EXPECTED DETERMINISTIC AVERAGE SENSITIVITY Recall that the average sensitivity (1) for randomized algorithms bounds the earth mover s distance between the distributions of A(L) and A(L(i)). This does not immediately imply that, given a decision tree for L, we can compute a similar decision tree for L(i) with a similar total score because the mapping from decision trees for L to those for L(i) that achieves the earth mover s distance is not always available. To address this issue, we consider bounding the expectation of deterministic average sensitivity between output decision trees over random bits: i=1 d DT(Aπ(L), Aπ(L(i))) where Aπ is the deterministic algorithm obtained from a randomized algorithm A by fixing the random bits used in A to π {0, 1} . Then, using the same random bits π, we can obtain similar decision trees for L and L(i). Note that the average sensitivity (1) is bounded from above by the expected deterministic average sensitivity (2) because the pair of random variables (Aπ(L), Aπ(L(i))) induces a joint distribution over pairs of decision trees. By slightly modifying STABLEDT and the proof of Theorem 4.1, we obtain the following: Theorem 5.1. There exists a (possibly inefficient) randomized algorithm that, given a training data L of size n, a depth bound B, and a parameter ϵ > 0, returns a decision tree ϕ of depth at most B such that Eϕ[s(ϕ, L)] (1 ϵ)Bopt B(L), and for the set of decision rules Ω, its expected average sensitivity over random bits is O B2B log |Ω| Published as a conference paper at ICLR 2023 Table 1: Datasets Dataset training data size sampled data size test data size # of features # of classes tree depth breast cancer 546 436 137 10 2 5 diabetes 614 491 154 8 2 1 cod-rna 59535 1000 271617 8 2 10 covtype 400000 1000 181000 54 7 7 higgs 10500000 1000 500000 28 2 3 ijcnn 49900 1000 91701 22 2 1 sensorless 48509 1000 10000 48 11 9 webspam 300000 1000 50000 254 2 7 6 PRACTICAL IMPLEMENTATIONS The algorithm STABLEDR and the algorithm of Theorem 5.1 require the value of optω,B(L), which is NP-hard to compute (Laurent & Rivest, 1976). Although several algorithms based on integer programming were proposed (Bertsimas & Dunn, 2017; G unl uk et al., 2021), they are not fast enough to handle large data. To address this issue, we suggest replacing optω,B(L) with optω,1(L) because the latter can be computed quite efficiently. Note that this strategy is the same as the standard greedy recursive splitting used for training (non-optimal) decision trees (Rivest, 1987; Loh, 2011; Quinlan, 2014). We use this version in our experiments in Section 7. The same analysis goes through and we obtain the following. Theorem 6.1. There exists a randomized algorithm that, given a training data L of size n, a depth bound B, and a parameter ϵ > 0, returns a decision tree ϕ of depth at most B such that E[s(ϕ, L)] (1 ϵ)opt1(L), and for the set of decision rules Ω, its expected average sensitivity over random bits is O B2B log |Ω| ϵn . The time complexity is P log n 1 d=0 2B T(n/2B) |Ω|, where T(m) is the running time required to compute optω,1(L ) for ω Ωwith |L | = m. We note that, although the approximation guarantee of Theorem 6.1 is weaker than those of Theorems 4.1 and 5.1, practically used decision tree learning algorithms also lack approximation guarantees. In Section 7, we empirically confirm that the accuracy of the algorithm of Theorem 6.1 is not much worse than those of baseline algorithms. We also note that, in many cases, computing optω,1(L ) takes linear time, and sometimes we can compute optω,1(L ) for many ω s at once. 7 EXPERIMENTS We demonstrate that the proposed algorithm can output stable decision trees. Datasets We used datasets shown in Table 1.1 We split the datasets into small ones (breast cancer and diabetes) and large ones (cod-rna, covtype, higgs, ijcnn, sensorless, and webspam). We use small datasets to demonstrate the stability of the proposed algorithm, and large datasets to study trade-offs between average sensitivity and accuracy. In the experiments, we used subsamples of these datasets. For training, we randomly sampled 80% of the data points and 1000 data points for small and large datasets, respectively. In the experiments, we evaluated the test accuracy of the learned decision trees using the entire test data. Tree Learning Algorithms In the experiment, we used the practical implementation described in Section 6 for training the decision trees. As we have shown in Theorem 6.1, this practical implementation can output stable decision trees without solving NP-hard problems. We also adopted the following greedy tree learning algorithm as the baseline for comparison. At each node, the greedy 1These datasets are obtained from https://github.com/chenhongge/Robust Trees We omitted MNIST-based datasets because decision trees are typically not used for images. Published as a conference paper at ICLR 2023 algorithm selects the decision rule ω with the highest score, i.e., ω arg maxω Ωoptω,1(L). When multiple decision rules exist in arg maxω Ωoptω,1(L), we select one of them uniformly at random. We set the tree depth shown in Table 1 so that the greedy algorithm exhibits the highest accuracy in cross-validation. We implemented both the greedy and proposed algorithms in Python 3 using the JIT compiler of Numba. We used the equidistant points {xj,1, . . . , xj,Q} within the interval [xj,min, xj,max] for each feature xj as the decision rules, where xj,min = xj,1 and xj,max = xj,Q are the minimum and the maximum of xj in the dataset and we set Q = 500. That is, we set Ω= {u 7 1[uj xj,q]}j,q, Procedure We generated 10 sampled training data from the original training data. For each of the sampled training data, we trained decision trees using the greedy algorithm and proposed algorithms over different values of ϵ. Because the tree learning algorithms are randomized, we trained trees for ten different random bits π1, π2, . . . , π10. Through this procedure, we obtained 100 trees for each tree learning algorithm (over ten sampled training data and ten different random bits). We report the average of the average sensitivity and the training and test accuracy over these 100 trees. To estimate the average sensitivity, we trained additional trees using the sampled training data, with m data points removed at random. Formally, let A(L) be the tree trained using the algorithm A on L. Then, we consider another tree A(L \ S) trained on the set L \ S with |S| = m. We selected the sets S1, S2, . . . , SR at random, and estimated the (normalized) expected deterministic average sensitivity over random bits as 1 R PR r=1 d DT(Aπt(L),Aπt(L\Sr)) d B , where πt is the t-th random bits, and d B = 2B+2 2 is the maximum distance between the trees with depth B, which normalizes the sensitivity within [0, 1]. We interpret this normalized sensitivity as the fraction of different nodes between the two trees. We set R = 100, and varied the number m of removed data points from 1 to 30% of the sampled training data. 7.2 RESULT 1: DEPENDENCY ON ϵ First, we evaluated the dependency of the proposed algorithm on the choice of ϵ for small datasets. In the experiment, we evaluated the algorithm over nine different ϵ values from 10 2 to 100 on a logarithmic scale and over the number of removed data points m = 1, 1%, and 10% of the sampled training data. Figure 2 shows the average sensitivity, training accuracy, and test accuracy. It is evident from Figures 2 (a) and (d) that the average sensitivity decreases as ϵ increases as suggested by Theorem 6.1. Empirically, we observed that the average sensitivity is kept almost constant until ϵ = 0.1 and it began to decrease for larger ϵ values. Figures 2 (b) and (e) show that the training accuracy decreases for ϵ larger than 0.1. These results suggest that the stability of the trees is obtained at the cost of the decrease in training accuracy. This observation is consistent with Theorem 6.1 as well. Based on these results, we can empirically confirm the correctness of Theorem 6.1: the trained decision trees become stable, particularly for larger ϵ, whereas the accuracy decrease. We also note that Figures 2(c) and (f) suggest that this may not always be the case for test accuracy, which is outside the scope of our theorem. 7.3 RESULT 2: EXAMPLES OF TREES Figure 3 shows examples of the trees trained on the breast cancer dataset using the greedy and the proposed algorithms with ϵ = 0.3 and m = 10% of the sampled training data. In the experiment, we trained 100 trees on the set L\Sr for r = 1, . . . , 100 to estimate the expected average sensitivity over random bits. Within these 100 trees, some had identical structures, while others did not. In the figure, we show the original tree trained using all the sampled training data, and the three most frequently identical structures within the 100 trees. In this example, the original tree and the most frequent trees were identical. There are two important implications in Figure 3. Firstly, the second frequent trees of the greedy algorithm changed drastically from the original tree, whereas in the proposed algorithm, the changes appeared only in small substructures. These smaller changes in the proposed algorithm induced a smaller distance d DT(A(L), A(L \ Sr)), resulting in smaller expected deterministic average sensitivity for ϵ > 0.1, as shown in Figure 2. Secondly, the most frequent trees dominated 10/100 and 72/100 of the cases for the greedy and proposed algorithms, respectively. Therefore, the tree can change frequently upon data point removal Published as a conference paper at ICLR 2023 10 2 10 1 100 10 3 Average Sensitivity (a) Sensitivity 10 2 10 1 100 0.960 Training Accuracy (b) Training accuracy 10 2 10 1 100 0.972 Test Accuracy (c) Test accuracy The number m of removed data points 10 2 10 1 100 10 3 Average Sensitivity (d) Sensitivity 10 2 10 1 100 Training Accuracy (e) Training accuracy 10 2 10 1 100 0.66 Test Accuracy (f) Test accuracy Figure 2: Average sensitivity and accuracy of the trained trees over different ϵ on small datasets, breast cancer (a) (c) and diabetes (d) (e). The blue and red lines denote the results for the proposed and greedy tree learning algorithms, respectively. T F x9 9 y = 1 T F x3 3 T F x6 1 y = 1 T F x4 2 T F x7 3 T F x9 9 y = 1 T F x3 3 T F x6 1 y = 1 T F x4 2 T F x7 3 T F x3 10 y = 1 T F x2 8 T F x8 2 T F x3 1 T F x4 2 T F x9 9 y = 1 T F x3 3 T F x2 6 y = 1 T F x4 2 T F x7 3 T F x9 9 y = 1 T F x3 3 T F x6 1 y = 1 T F x4 2 T F x7 3 Original Tree 1st Frequent: 10/100 2nd Frequent: 6/100 3rd Frequent: 6/100 (a) (Greedy) The original tree (leftmost) and the first three frequent trees (the others). T F x3 10 y = 1 T F x2 9 T F x2 6 y = 1 T F x4 2 T F x7 3 T F x3 10 y = 1 T F x2 9 T F x2 6 y = 1 T F x4 2 T F x7 3 T F x3 10 y = 1 T F x2 9 T F x2 9 y = 1 T F x4 2 T F x7 3 T F x3 4 T F x4 4 T F x2 6 y = 1 T F x4 2 T F x7 3 T F x3 10 y = 1 T F x2 9 T F x2 6 y = 1 T F x4 2 T F x7 3 Original Tree 1st Frequent: 72/100 2nd Frequent: 12/100 3rd Frequent: 2/100 (b) (Proposed) The original tree (leftmost) and the first three frequent trees (the others). Figure 3: The three most frequent trees on the breast cancer dataset for (a) the greedy and (b) proposed tree learning algorithms over 100 random data point removals with ϵ = 0.3 and m = 10%. T and F on the edges denote the splitting whether the rule in the parent node is True and False, respectively. In both methods, the most frequent trees are identical to the original tree trained using all the sampled training data. The red nodes denote differences from the original tree. in the greedy algorithm, whereas the change is less frequent in the proposed algorithm, where more than 70% of the trees are identical. These results indicate that the proposed algorithm can induce trees with small distances and less frequent structural changes. This property is favorable in practical situations because in data mining applications, the extracted knowledge is guaranteed to not change drastically, and in machine learning applications, we can continue using almost the same decision-making process. 7.4 RESULT 3: SENSITIVITY-ACCURACY TRADE-OFF The results in Figure 2 suggest that the average sensitivity and accuracy are in a trade-off relation through ϵ. Using a large ϵ, we obtain stable trees with a slight decrease in accuracy, where using a small ϵ, we obtain accurate trees; however they tend to have high sensitivities. Published as a conference paper at ICLR 2023 0.0 0.2 0.4 0.6 0.8 Average Sensitivity Training Accuracy (a) cod-rna 0.0 0.2 0.4 0.6 0.8 1.0 Average Sensitivity Training Accuracy (b) covtype 0.0 0.2 0.4 0.6 0.8 Average Sensitivity Training Accuracy The number m of removed data points m = 1 m = 1% m = 3% m = 10% m = 30% m = 1 m = 1% m = 3% m = 10% m = 30% 0.0 0.1 0.2 0.3 0.4 0.5 Average Sensitivity 0.90525 0.90550 0.90575 0.90600 0.90625 0.90650 0.90675 0.90700 0.90725 Training Accuracy 0.0 0.2 0.4 0.6 0.8 Average Sensitivity Training Accuracy (e) sensorless 0.0 0.2 0.4 0.6 0.8 1.0 Average Sensitivity Training Accuracy (f) webspam Figure 4: Trade-off curves between average sensitivity and training accuracy when ϵ is changed. We varied the number of training data points to be removed from one to 30% of the sampled training data. White markers denote the results of the greedy tree learning. We computed this trade-off for large datasets by varying ϵ from 10 5 to 100 on a logarithmic scale. Figure 4 shows the trade-off curves between the average sensitivity and training accuracy. In the figures, we observe that the average sensitivity and accuracy are in a trade-off relation for most of the datasets. The trees become stable as ϵ increases, while incurring an accuracy decrease in the bottom left of the figures. We note that our theory only guarantees that the decrease in training accuracy is limited. However, the test accuracy results in Figure 7 in Appendix D also confirm that the drop in test accuracy is limited. Thus, it is not necessary to sacrifice a considerable amount of the test accuracy to obtain stable decision trees in practice. The result on cod-rna is an exception that did not exhibit clear trade-off curves. In this dataset, the trees become less stable and less accurate simultaneously for small ϵ, as shown in the bottom right of the figures. In particular, the greedy trees tend to be less accurate than stable ones. One possible reason of this phenomenon is overfitting. In Table 1, the tree depth is set to ten for this dataset. In such deep trees, the number of training data points that reach a deep node tends to be small. Then, the greedy selection of the best decision rule based on a small number of data points can be unstable, some arbitrary rules may be selected by chance, and overfitting occurs. Therefore, the greedily selected rule is not optimal if we consider deeper trees, i.e., optω,B(L), can be smaller than some optω ,B(L) even if optω,1(L) > optω ,1(L). On the contrary, the proposed algorithm randomizes the choice of the decision rule using an exponential mechanism. We conjecture that this randomization can help prevent suboptimal rules and overfitting, and increase the chance of gaining more accurate trees. Finally, we emphasize that the results confirm that we gained the freedom to choose decision trees on these trade-off curves using the proposed algorithm. This will open up a new paradigm for practitioners who have suffered from the high sensitivity of existing learning algorithms. Practitioners can now tune ϵ and obtain stable trees upon their demand. 8 CONCLUSION We proposed decision tree learning algorithms with theoretical guarantees on its average sensitivity and accuracy. We experimentally confirmed that the proposed algorithms performed well on realworld datasets. An obvious open problem is to show some hardness on the trade-off between average sensitivity and accuracy. In practice, it is important that the predictions made by the learned model do not change significantly with slight changes in the training data. Hence, it is also natural to measure the distance between models by the number of mismatches of their predictions and study average sensitivity in terms of this distance even for non-explainable learning models such as random decision forests Ho (1995); Breiman (2001) and deep neural networks. Published as a conference paper at ICLR 2023 REPRODUCIBILITY STATEMENT In the experiments, we only used publicly available data so that all the results to be reproducible. The code is available at https://github.com/sato9hara/Stable Decision Tree ACKNOWLEDGEMENT SH is supported by JST, PRESTO Grant Number JPMJPR20C8. YY is supported by JST, PRESTO Grant Number JPMJPR192B. Maksym Andriushchenko and Matthias Hein. Provably robust boosted decision stumps and trees against adversarial attacks. Neur IPS, 32:13017 13028, 2019. Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learning certifiably optimal rule lists for categorical data. Journal of Machine Learning Research, 18:1 78, 2018. Alejandro Barredo Arrieta, Natalia D ıaz-Rodr ıguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garc ıa, Sergio Gil-L opez, Daniel Molina, Richard Benjamins, et al. Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82 115, 2020. Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106(7):1039 1082, 2017. Olivier Bousquet and Andr e Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499 526, 2002. Leo Breiman. Random forests. Machine Learning, 45(1):5 32, 2001. Stefano Calzavara, Claudio Lucchese, Gabriele Tolomei, Seyum Assefa Abebe, and Salvatore Orlando. Treant: training evasion-aware decision trees. Data Mining and Knowledge Discovery, 34 (5):1390 1420, 2020. Hongge Chen, Huan Zhang, Duane Boning, and Cho-Jui Hsieh. Robust decision trees against adversarial examples. In ICML, pp. 1122 1131, 2019a. Hongge Chen, Huan Zhang, Si Si, Yang Li, Duane Boning, and Cho-Jui Hsieh. Robustness verification of tree-based models. Neur IPS, 32:12317 12328, 2019b. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In KDD, pp. 785 794, 2016. Yizheng Chen, Shiqi Wang, Weifan Jiang, Asaf Cidon, and Suman Jana. Cost-aware robust tree ensembles for security applications. In USENIX Security, pp. 2291 2308, 2021. Cynthia Dwork. Differential privacy. In ICALP, pp. 1 12, 2006. Sam Fletcher and Md Zahidul Islam. Decision tree classification with differential privacy: A survey. ACM Computing Surveys (CSUR), 52(4):1 33, 2019. Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of statistics, 29(5):1189 1232, 2001. Florin Gorunescu. Data mining: concepts, models and techniques, volume 12. Springer Science & Business Media, 2011. Oktay G unl uk, Jayant Kalagnanam, Minhan Li, Matt Menickelly, and Katya Scheinberg. Optimal decision trees for categorical data via integer programming. Journal of Global Optimization, pp. 1 28, 2021. Published as a conference paper at ICLR 2023 Tin Kam Ho. Random decision forests. In ICDAR, pp. 278 282, 1995. Alex Kantchelian, J Doug Tygar, and Anthony Joseph. Evasion and hardening of tree ensemble classifiers. In ICML, pp. 2387 2396, 2016. Soh Kumabe and Yuichi Yoshida. Average sensitivity of dynamic programming. In SODA, pp. 1925 1961, 2022a. Soh Kumabe and Yuichi Yoshida. Average sensitivity of the knapsack problem. In ESA, volume 244, pp. 75:1 75:14, 2022b. Hyafil Laurent and Ronald L Rivest. Constructing optimal binary decision trees is np-complete. Information Processing Letters, 5(1):15 17, 1976. Wei-Yin Loh. Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):14 23, 2011. Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In FOCS, pp. 94 103, 2007. Pan Peng and Yuichi Yoshida. Average sensitivity of spectral clustering. In KDD, pp. 1132 1140, 2020. J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014. Ronald L Rivest. Learning decision lists. Machine learning, 2(3):229 246, 1987. Lior Rokach and Oded Z Maimon. Data mining with decision trees: theory and applications, volume 69. World scientific, 2007. Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206 215, 2019. John T ornblom and Simin Nadjm-Tehrani. An abstraction-refinement approach to formal verification of tree ensembles. In SAFECOMP, pp. 301 313, 2019. Nithin Varma and Yuichi Yoshida. Average sensitivity of graph algorithms. In SODA, pp. 684 703, 2021. Tong Wang, Cynthia Rudin, Finale Doshi-Velez, Yimin Liu, Erica Klampfl, and Perry Mac Neille. A bayesian framework for learning rule sets for interpretable classification. The Journal of Machine Learning Research, 18(1):2357 2393, 2017. Yihan Wang, Huan Zhang, Hongge Chen, Duane Boning, and Cho-Jui Hsieh. On lp-norm robustness of ensemble decision stumps and trees. In ICML, pp. 10104 10114, 2020. Hongyu Yang, Cynthia Rudin, and Margo Seltzer. Scalable bayesian rule lists. In ICML, pp. 3921 3930, 2017. Yuichi Yoshida and Shinji Ito. Average sensitivity of euclidean k-clustering. In Neur IPS, 2022. to appear. Yuichi Yoshida and Samson Zhou. Sensitivity analysis of the maximum matching problem. In ITCS, pp. 58:1 58:20, 2021. Jiaming Zeng, Berk Ustun, and Cynthia Rudin. Interpretable classification models for recidivism prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180(3):689 722, 2017. Published as a conference paper at ICLR 2023 A ANALYSIS OF STABLEDR In this section, we show the following performance guarantee for STABLEDR. Theorem A.1. For a training data L, a depth bound B, and a parameter ϵ > 0, let ω = STABLEDR(L, B, ϵ). Then, we have Eω optω,B(L) (1 ϵ)opt B(L). Moreover, we have Pn i=1 d TV(ω, ω(i)) = O log |Ω| ϵ , where ω(i) = A(L(i), B, ϵ), and d TV(ω, ω(i)) denotes the total variation distance between (the distributions of) ω and ω(i). The first inequality claims that we can achieve a nearly optimal total score using the output decision rule ω. The second inequality claims that the distribution of ω does not change significantly when a data point is removed from the training data. Theorem A.1 is obtained by combining Lemmas A.2 and A.4. A.1 APPROXIMATION GUARANTEE First, we show that the selected decision rule does not much deteriorate the total score of an optimal decision tree. Lemma A.2. Let ω = STABLEDR(L, B, ϵ). Then, we have E ω[optω,B(L)] (1 ϵ)opt B(L). Proof. For any c > 0, we have Pr[optω,B(L) opt B(L) c] ψ Ω:optψ,B(L) opt B(L) c exp(λ optψ,B(L)) P ψ Ωexp(λ optψ,B(L)) |Ω| exp(λ (opt B(L) c)) P ψ Ωexp(λ optψ,B(L)) |Ω| exp(λ (opt B(L) c)) exp(λ opt B(L)) |Ω| exp( λc). Therefore, we have E[optω,B(L)] Pr[optω,B(L) opt B(L) c] 0 + Pr[optω,B(L) > opt B(L) c] (opt B(L) c) (1 |Ω| exp( λc)) (opt B(L) c) opt B(L) |Ω| exp( λc) c. By setting c = log |Ω|/λ and the choice of λ, the claim holds. A.2 AVERAGE SENSITIVITY Next, we analyze the average sensitivity of STABLEDR. For notational simplicity, we write optω and opt(i) ω to denote optω,B(L) and optω,B(L(i)), respectively. The following lemma is useful for our analysis. Lemma A.3. For any decision rule ω Ω, we have optω opt(i) ω optω. Similarly, we have opt B(L) opt B(L(i)) opt B(L). Published as a conference paper at ICLR 2023 Proof. We first consider the first statement. Let ϕ be the optimal decision that attains optω. Note that ϕ has depth B and the root node of ϕ has the decision rule ω. Then, we have optω opt(i) ω i=1 (s(ϕ, L) s(ϕ, L(i))) = i=1 1[ϕ(xi) = y] = s(ϕ, L) = optω. The second statement follows by a similar argument. Lemma A.4. Let ω = STABLEDR(L, B, ϵ), ω(i) = STABLEDR(L(i), B, ϵ). Then, we have n X i=1 d TV(ω, ω(i)) = O log |Ω| Proof. Notice that i=1 d TV(ω, ω(i)) = ψ Ω max n 0, Pr[ω = ψ] Pr[ω(i) = ψ] o . Let λ(i) be λ used in STABLEDR(L(i), B, ϵ). Then we have max n 0, Pr[ω = ψ] Pr[ω(i) = ψ] o 0, exp(λ optψ) P ψ Ωexp(λ optψ ) exp(λ(i) opt(i) ψ ) P ψ Ωexp(λ(i) opt(i) ψ ) exp(λ optψ) exp(λ opt(i) ψ ) P ψ Ωexp(λ optψ ) 0, exp(λ opt(i) ψ ) P ψ Ωexp(λ optψ ) exp(λ(i) opt(i) ψ ) P ψ Ωexp(λ(i) opt(i) ψ ) where the equality is from the design of the algorithm and the inequality is from the following inequality max{0, b a} (b x) + max{0, x a} which holds for any x b. Let Ai,ψ and Bi,ψ denote the first and second terms, respectively, of (3). The following two claims bound the sums of the first and the second terms over i and ψ. Claim A.5. n X ψ Ω Ai,ψ λ opt B(L). Claim A.6. n X ψ Ω Bi,ψ O(λ opt B(L)). Before proving these claims, we first complete the proof of the lemma assuming them. Combining (3) and the two claims above, we have i=1 d TV(ω, ω(i)) O(λ opt B(L)) + 1 = O log |Ω| Published as a conference paper at ICLR 2023 Theorem A.1 follows by combining Lemmas A.2 and A.4. Proof of Claim A.5. We have exp(λ optψ) exp(λ opt(i) ψ ) P ψ Ωexp(λ optψ ) = Pr[ω = ψ] 1 exp(λ opt(i) ψ ) exp(λ optψ) = Pr[ω = ψ] 1 exp λ (optψ opt(i) ψ ) λ Pr[ω = ψ](optψ opt(i) ψ ), where the inequality is from 1 e x x for any x R. Therefore, we have exp(λ optψ) exp(λ opt(i) ψ ) P ψ Ωexp(λ optψ ) ψ Ω Pr[ω = ψ](optψ opt(i) ψ ) ψ Ω Pr[ω = ψ]optψ (by Lemma A.3) λ opt B(L). as desired. Proof of Claim A.6. We first note that 0, exp(λ opt(i) ψ ) P ψ Ωexp(λ optψ ) exp(λ(i) opt(i) ψ ) P ψ Ωexp(λ(i) opt(i) ψ ) 0, exp(λ opt(i) ψ ) P ψ Ωexp(λ optψ ) exp(λ(i) opt(i) ψ ) P ψ Ωexp(λ optψ ) 0, exp(λ opt(i) ψ ) P ψ Ωexp(λ optψ ) 1 exp opt(i) ψ (λ λ(i)) ) max n 0, Pr[ω = ψ]opt(i) ψ (λ λ(i)) o opt B(L) Pr[ω = ψ]|λ λ(i)|. Also, we have log |Ω| opt B(L) log |Ω| opt B(L(i)) log |Ω| opt B(L) log |Ω| opt B(L) log |Ω| opt B(L(i)) log |Ω| opt B(L) , log |Ω| opt B(L(i)) log |Ω| opt B(L) log |Ω| opt B(L) ( log |Ω| log |Ω| 1 opt B(L(i)) 1 opt B(L) 1 opt B(L) log |Ω| log |Ω| log |Ω| + λ opt B(L) opt B(L(i)) opt B(L(i)) Published as a conference paper at ICLR 2023 log |Ω| log |Ω| log |Ω| + 2λ opt B(L) opt B(L(i)) opt B(L) (by opt B(L(i)) 1) = O(λ), (by Lemma A.3) Combining the two inequalities above, we obtain ψ Ω opt B Pr[ω = ψ]|λ λ(i)| = O (λ opt B(L)) . B ANALYSIS OF STABLEDT In this section, we analyze STABLEDT and prove Theorem 4.1 Proof of the first claim of Theorem 4.1. Let L0 be the input training data (so that we can use L to denote other sets). We prove the following by backward induction on depth. E[s(STABLEDT (L, B, ϵ, d), L)] (1 ϵ)B dopt B(L). Then, the statement holds by setting d = 0. The claim clearly holds when d = B because we output the optimal label. Suppose that the claim holds for depth more than d. Consider a particular call STABLEDT (L, B, ϵ, d), and let ϕ denote the output decision tree, let ω be the decision rule used in the root node of ϕ, and let Lω L and Lω R denote the two training datas obtained from L by splitting it according to ω. Note that these are random variables. Then, we have E ϕ[s(ϕ, L)] = X ψ Ω Pr[ω = ψ] E[s(STABLEDT (Lψ L, B, ϵ, d + 1), Lψ L)] + E[s(STABLEDT (Lψ R, B, ϵ, d + 1), Lψ R)] ψ Ω Pr[ω = ψ] (1 ϵ)B d 1opt B d 1(Lψ L) + (1 ϵ)B d 1opt B d 1(Lψ R) (1 ϵ)B d 1 X ψ Ω Pr[ω = ψ] optψ,B d(L) ψ Ω Pr[ω = ψ](1 ϵ)opt B d(L) (1 ϵ)B dopt B d(L), where the first inequality is based on the induction hypothesis and the second to last inequality is based on Theorem A.1. Proof of the second claim of Theorem 4.1. For notational simplicity, we drop the arguments B and ϵ when calling STABLEDT (L, B, ϵ, d), because they are fixed in this proof. Additionally, we write STABLEDT instead of STABLEDT . Let L0 = ((x1, y1), . . . , (xn, yn)) be the input training data (so that we can use L to denote other sets). For a subset L of L0 and i {1, 2, . . . , n}, let L(i) := L \ {(xi, yi)}. For 0 d B, let Ld,1, . . . , Ld,2d be the sets on which STABLEDT is called at depth d (if the number of sets on which STABLEDT is called at depth d is less than 2d, we append empty sets). We can order them so that STABLEDT (Ld,j, d) calls STABLEDT (Ld+1,2j 1, d + 1) and STABLEDT (Ld+1,2j, d + 1) (if STABLEDT (Ld,j, d) does not make recursive calls, we set Ld+1,2j 1 = Ld+1,2j = ). Published as a conference paper at ICLR 2023 For fixed {LB,j}j, we have j=1 d EM(STABLEDT (LB,j, B), STABLEDT (L(i) B,j, B)) j=1 1[(xi, yi) LB,j] = i=1 1[(xi, yi) L] = |L| because the output changes only when (xi, yi) L and the output change is bounded by one. Let 0 d < B. Let ωd,j and ω(i) d,j be the ω values used in STABLEDT (Ld,j, d) and STA- BLEDT (L(i) d,j, d), respectively. Note that they are random variables. For a rule ω, Let Lω d+1,2j 1 and Lω d+1,2j be the two sets obtained by partitioning Ld,j according to ω. Then for fixed {Ld,j}j, we have j=1 d EM(STABLEDT (Ld,j, d), STABLEDT (L(i) d,j, d)) d TV(ωd,j, ω(i) d,j) 2B d + E ωd,j d EM(STABLEDT (Lωd,j d+1,2j 1, d + 1), STABLEDT (Lωd,j,(i) d+1,2j 1, d + 1)) + E ωd,j d EM(STABLEDT (Lωd,j d+1,2j, d + 1), STABLEDT (Lωd,j,(i) d+1,2j , d + 1)) C 2B d 2d X j=1 E ωd,j d EM(STABLEDT (Lωd,j d+1,j, d + 1), STABLEDT (Lωd,j,(i) d+1,j , d + 1)) (by Lemma A.1) C 2B log |Ω| j=1 E ωd,j d EM(STABLEDT (Lωd,j d+1,j, d + 1), STABLEDT (Lωd,j,(i) d+1,j , d + 1)), where C > 0 is some universal constant. By backward induction, we obtain for any d and fixed {Ld,j}j j=1 d EM(STABLEDT (Ld,j, d), STABLEDT (L(i) d,j, d)) C 2B(B d)log |Ω| for every 0 d B. By setting d = 0, we obtain the claim. C MISSING PROOFS OF SECTION 5 In this section, we prove Theorem 5.1. We discuss modifications to STABLEDR and STABLEDT in Sections C.1 and C.2, respectively. Published as a conference paper at ICLR 2023 Algorithm 4: 1 Procedure SEEDEDSTABLEDR(L, B, ϵ, π) 2 λ 2 log |Ω| ϵ opt B(L); 3 while true do 4 Sample ω Ωuniformly at random using π; 5 Sample τ [0, 1] uniformly at random using π; 6 pω be the probability of choosing ω as given in STABLEDR; 7 if pω > τ then return ω; 8 Procedure SEEDEDSTABLEDT (L, B, ϵ, d, j, π) 9 if |L| 1 or d = B then 10 return an optimal label for L. 11 ω SEEDEDSTABLEDR(L, B, ϵ, π); 12 Partition L into LL LR according to ω; 13 πL (π1, π3, . . .) and πR (π2, π4, . . .); 14 ϕL SEEDEDSTABLEDT (LL, B, ϵ, d + 1, 2j, πL); 15 ϕR SEEDEDSTABLEDT (LR, B, ϵ, d + 1, 2j + 1, πR); 16 Let ϕω be the decision tree such that the root node t has rule ω and the left and right children of t are ϕL and ϕR, respectively; 17 return ϕω. 18 Procedure SEEDEDSTABLEDT(L, B, ϵ, π) 19 return SEEDEDSTABLEDT (L, B, ϵ, 0, 1, π). C.1 DECISION RULE SELECTION In STABLEDR, we sampled a rule ω Ωby the exponential mechanism Mc Sherry & Talwar (2007). To bound the expected deterministic average sensitivity over random bits, we perform the following rejection sampling. We first sample a rule ω Ωand threshold τ [0, 1] uniformly at random by using π. If the threshold τ is more than the probability pω that we sample ω in the exponential mechanism, then we output ω. Otherwise, we repeat the same process again. The details are given as SEEDEDSTABLEDR in Algorithm 4. The following lemma shows that the distributions of STABLEDR and DERANDOMIZEDSTABLEDR are the same and the derandomized average sensitivity of the latter can be bounded from above by the average sensitivity of the former. Lemma C.1. Let ω = STABLEDR (L, B, ϵ), ω(i) = STABLEDR (L(i), B, ϵ), ωπ = DERANDOMIZEDSTABLEDR (L, B, ϵ, π), ω(i) π = DERANDOMIZEDSTABLEDR (L(i), B, ϵ, π). Then, the distribution of ω and that of ωπ over π are the same. Moreover for any i {1, 2, . . . , n}, we have Pr π [ωπ = ω(i) π ] 2d TV(ω, ω(i)). Proof. The first claim is clear from the design of DERANDOMIZEDSTABLEDR. Now we see the second claim. Let Z = P ψ Ωexp(λ optψ,B(L)) and let Z(i) = P optψ,B(L(i))). For ω Ω, we let pω = exp(λ optω,B(L))/Z. For ω Ω, we let p(i) ω = exp(λ optω,B(L(i)))/Z(i), and for ω Ω\ Ω, we let p(i) ω = 0. Then, we have Pr π [ωπ = ω(i) π ] X ψ Ω Pr τ [min{pψ, p(i) ψ } < τ < max{pψ, p(i) ψ }] Published as a conference paper at ICLR 2023 ψ Ω |pψ p(i) ψ | = 2d TV(ω, ω(i)). By the analysis of STABLEDR and Lemma C.1, we obtain the following: Theorem C.2. Let ωπ = SEEDEDSTABLEDR(L, B, ϵ, π). We have Eπ[optωπ,B(L)] (1 ϵ)opt B(L). Moreover for ω(i) π = SEEDEDSTABLEDR(L(i), B, ϵ, π), we have Eπ h Pn i=1 d DT(ωπ, ω(i) π ) i = O log |Ω| C.2 DECISION TREE CONSTRUCTION We now explain the modification to STABLEDT. Let Ld,1, . . . , Ld,2d be the sets on which our algorithm is called at depth d as defined in the proof in Section B. Then, we want to make sure that the same random bits are used when processing particular Ld,j no matter whether the input training data is L or L(i) (1 i n). To this end, at each node in the decision tree, we split the random bits π = (π1, π2, . . .) into πL = (π1, π3, . . .) and πR = (π2, π4, . . .), and then pass πL and πR on to the nodes for Ld+1,2j and Ld+1,2j+1, respectively. See Algorithm 4 for details. We replace Theorem A.1 with Theorem C.2 in the proof of Theorem 4.1, and we obtain Theorem 5.1. D ADDITIONAL RESULTS D.1 DETAILED RESULTS IN SECTION 7.2 In Section 7.2, we reported the trends of average sensitivity and accuracies over ϵ on small datasets, breast cancer and diabetes. Here, we show the detailed results (i) with error bars, and (ii) with a relaxed version of the tree distance. For (i), in addition to the average results, we also show their variations. More specifically, we report the 25 and 75 percentiles of the results over 10 random realizations of the sampled training data. For (ii), we adopt a relaxed version of the tree distance in Algorithm 5. In the original tree distance in Algorithm 2, we regarded that two trees are (completely) different when their top rules are different (Line 6). In the relaxed version in Algorithm 5, we regard that two trees are completely different only when the features used in the top rules are different. With this relaxation, we regard two subtrees with similar top rules such as ω : u 7 1[u1 1.0] and ω : u 7 1[u1 1.01] as identical. Algorithm 5: 1 Procedure DISTANCE (ϕ, ϕ ) 2 Let t and t be the root nodes of ϕ and ϕ , respectively; 3 Let ω.feature be the feature used in ω; 4 if both t and t are leaves then 5 return 0 if yt = yt and 2 otherwise. 6 else if either t or t is a leaf then return |ϕ| + |ϕ |; 7 else if ωt.feature = ωt .feature then return |ϕ| + |ϕ |; 9 Let ϕL, ϕR be the decision trees rooted at the left and right children of t, respectively; 10 Let ϕ L, ϕ R be the decision trees rooted at the left and right children of t , respectively; 11 return DISTANCE (ϕL, ϕ L) + DISTANCE (ϕR, ϕ R). Figures 5 and 6 show the detailed results on breast cancer and diabetes, respectively. In the figures, we show the 25 and 75 percentiles using colored shades. The figures named Sensitivity and Sensitivity are the average sensitivity computed using the original distance and the relaxed distance, respectively. The figures confirm that the decrease of the average sensitivity for ϵ > 0.1 will be sufficiently significant, in particular for the number of data removal m = 1% and 10%. The figures on Sensitivity and Sensitivity also confirm that the average sensitivity measured by using the original tree distance Published as a conference paper at ICLR 2023 and the relaxed tree distance are almost identical, implying the choice of the tree distance will only have negligible impacts to the results. 10 2 10 1 100 10 3 Average Sensitivity (a) Sensitivity (m = 1) 10 2 10 1 100 10 2 Average Sensitivity (b) Sensitivity (m = 1%) 10 2 10 1 100 10 1 Average Sensitivity (c) Sensitivity (m = 10%) The number m of removed data points 10 2 10 1 100 10 3 Average Sensitivity (d) Sensitivity (m = 1) 10 2 10 1 100 10 2 Average Sensitivity (e) Sensitivity (m = 1%) 10 2 10 1 100 10 1 Average Sensitivity (f) Sensitivity (m = 10%) 10 2 10 1 100 0.00748 0.00750 0.00752 0.00754 0.00756 0.00758 0.00760 0.00762 Training Accuracy (g) Training accuracy (m = 1) 10 2 10 1 100 0.00748 0.00750 0.00752 0.00754 0.00756 0.00758 0.00760 0.00762 Training Accuracy (h) Training accuracy (m = 1%) 10 2 10 1 100 0.00748 0.00750 0.00752 0.00754 0.00756 0.00758 0.00760 0.00762 Training Accuracy (i) Training accuracy (m = 10%) 10 2 10 1 100 Test Accuracy (j) Test accuracy (m = 1) 10 2 10 1 100 Test Accuracy (k) Test accuracy (m = 1%) 10 2 10 1 100 Test Accuracy (l) Test accuracy (m = 10%) Figure 5: Detailed results on average sensitivity and accuracy of the trained trees over different ϵ on breast cancer. The figures named Sensitivity and Sensitivity are the average sensitivity computed using the original distance and the relaxed distance, respectively. D.2 TEST ACCURACY For the experiments in Section 7, Figure 7 shows the trade-off curves between average sensitivity and test accuracy when ϵ is changed. Published as a conference paper at ICLR 2023 10 2 10 1 100 10 4 Average Sensitivity (a) Sensitivity (m = 1) 10 2 10 1 100 10 3 Average Sensitivity (b) Sensitivity (m = 1%) 10 2 10 1 100 10 2 Average Sensitivity (c) Sensitivity (m = 10%) The number m of removed data points 10 2 10 1 100 10 4 Average Sensitivity (d) Sensitivity (m = 1) 10 2 10 1 100 10 3 Average Sensitivity (e) Sensitivity (m = 1%) 10 2 10 1 100 10 2 Average Sensitivity (f) Sensitivity (m = 10%) 10 2 10 1 100 Training Accuracy (g) Training accuracy (m = 1) 10 2 10 1 100 Training Accuracy (h) Training accuracy (m = 1%) 10 2 10 1 100 Training Accuracy (i) Training accuracy (m = 10%) 10 2 10 1 100 Test Accuracy (j) Test accuracy (m = 1) 10 2 10 1 100 Test Accuracy (k) Test accuracy (m = 1%) 10 2 10 1 100 Test Accuracy (l) Test accuracy (m = 10%) Figure 6: Detailed results on average sensitivity and accuracy of the trained trees over different ϵ on diabetes. The figures named Sensitivity and Sensitivity are the average sensitivity computed using the original distance and the relaxed distance, respectively. 0.0 0.2 0.4 0.6 0.8 Average Sensitivity Test Accuracy (a) cod-rna 0.0 0.2 0.4 0.6 0.8 1.0 Average Sensitivity Test Accuracy (b) covtype 0.0 0.2 0.4 0.6 0.8 Average Sensitivity 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 Test Accuracy The number m of removed data points m = 1 m = 1% m = 3% m = 10% m = 30% m = 1 m = 1% m = 3% m = 10% m = 30% 0.0 0.1 0.2 0.3 0.4 0.5 Average Sensitivity Test Accuracy 0.0 0.2 0.4 0.6 0.8 Average Sensitivity Test Accuracy (e) sensorless 0.0 0.2 0.4 0.6 0.8 1.0 Average Sensitivity Test Accuracy (f) webspam Figure 7: Trade-off curves between average sensitivity and test accuracy when ϵ is changed. We varied the number of training data points to be removed from one to 30% of the sampled training data. White markers denote the results for the greedy tree learning.