# fast_provably_robust_decision_trees_and_boosting__539affe7.pdf Fast Provably Robust Decision Trees and Boosting Jun-Qi Guo 1 Ming-Zhuo Teng 1 Wei Gao 1 Zhi-Hua Zhou 1 Learning with adversarial robustness has been a challenge in contemporary machine learning, and recent years have witnessed increasing attention on robust decision trees and ensembles, mostly working with high computational complexity or without guarantees of provable robustness. This work proposes the Fast Provably Robust Decision Tree (FPRDT) with the smallest computational complexity O(n log n), a tradeoff between global and local optimizations over the adversarial 0/1 loss. We further develop the Provably Robust Ada Boost (PRAda Boost) according to our robust decision trees, and present convergence analysis for training adversarial 0/1 loss. We conduct extensive experiments to support our approaches; in particular, our approaches are superior to those unprovably robust methods, and achieve better or comparable performance to those provably robust methods yet with the smallest running time. 1. Introduction A well-trained model may be heavily misled by examples of adversarial perturbations (Szegedy et al., 2014; Goodfellow et al., 2015), which has been a big challenge in machine learning. A large number of defensive algorithms have been developed to deal with adversarial examples (Goodfellow et al., 2015; Papernot et al., 2016; Madry et al., 2018; Liu et al., 2018; Akhtar et al., 2018; Liao et al., 2018), whereas most are heuristic without theoretical guarantees of robustness. Provably robust training has been an effective method to guarantee the robustness against adversarial perturbations, and the basic idea is to exactly optimize the adversarial loss or its upper bound (Mirman et al., 2018; Raghunathan et al., 2018a;b; Wong & Kolter, 2018; Wong et al., 2018; Cohen et al., 2019; Li et al., 2019; Balunovi c & Vechev, 2020). 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China. Correspondence to: Wei Gao . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). Table 1. Comparisons of computational complexity and provable robustness for robust decision trees and boosting (n denotes the number of training data). Methods Comp. complexity Prov. robustness Our FPRDT O(n log n) ! Our PRAda Boost O(n log n) ! ROCT (Vos & Verwer, 2021b) O(exp(n)) ! TREANT (Calzavara et al., 2020) O(n2) ! PRB (Andriushchenko & Hein, 2019) O(n2) ! GROOT (Vos & Verwer, 2021a) O(n log n) % RIGDT-heuristic (Chen et al., 2019a) O(n log n) % RGBDT (Chen et al., 2019a) O(n log n) % Adv Boost (Kantchelian et al., 2016) O(n log n) % Recent years have witnessed the increasing attention on robust decision trees and ensembles, from the discoveries of adversarial examples in tree attacks (Kantchelian et al., 2016; Chen et al., 2019b; Cheng et al., 2019). Kantchelian et al. (2016) produced approximated adversarial examples, and trained decision trees iteratively based on the original data and new adversarial examples, and Andriushchenko & Hein (2019) proposed the provably robust boosting via upper bound of adversarial loss. Chen et al. (2019a) and Vos & Verwer (2021a) constructed robust decision trees and ensembles by adversarial Gini impurity or information gain. Calzavara et al. (2020) presented the provably robust decision tree with additional constraints over examples, and Vos & Verwer (2021b) suggested globally optimal robust decision trees. Table 1 summarizes the provable robustness and computational complexities for different methods. This work presents the fast algorithms for provably robust decision trees and Boosting, and the main contributions can be summarized as follows: We propose the Fast Provably Robust Decision Tree (FPRDT), a greedy recursive approach on the direct minimization of the adversarial 0/1 loss. Our approach makes optimization over all leaves and the splitting children, which is different from global optimization over all internal and leaf nodes, and local optimization over the single splitting node with its children. Our approach takes O(n log n) computational complexity. Fast Provably Robust Decision Trees and Boosting We further develop the Provably Robust Ada Boost (PRAda Boost), a boosting algorithm to minimize an upper bound on the adversarial exponential loss, and the base learners are constructed with our FPRDT with slight modifications. We present the exponential convergence analysis of the training adversarial 0/1 loss for our approach, and obtain the smallest O(n log n) computational complexity. We finally present extensive experiments to validate the effectiveness of our approaches. Specifically speaking, our approaches are clearly superior to those unprovably robust methods of decision trees and tree ensembles, and achieve better or comparable performance to provably robust methods yet of the smallest running time. The rest of this work is organized as follows: Section 2 presents some preliminaries. Section 3 proposes the FPRDT approach. Section 4 develops the PRAda Boost approach. Section 5 conducts extensive experiments, and Section 6 concludes with future work. 2. Preliminaries Let X Rd be the instance/input space, and Y = { 1, 1} denotes the output space for binary classification. Suppose that D is an underlying distribution over the product space X Y. Distribution D is unknown in practice, and what we can observe is a training data Sn = {(x1, y1), (x2, y2), ..., (xn, yn)} , where each example is drawn i.i.d. from distribution D. Let Nϵ(x) denote the ball with center x and radius ϵ under the L metric, that is, Nϵ(x) = {z : z x ϵ} . Let H be a hypothesis space, and a loss function ℓ(h(x), y)) is introduced to measure the prediction performance with hypothesis h H and example (x, y). Given example x, an adversarial example xadv is given by xadv arg max x Nϵ(x) {ℓ(h(x ), y)} . Given training sample Sn, the adversarial robust learning aims to look for a hypothesis ˆh H (Madry et al., 2018), such that ˆh arg min h H i=1 max x Nϵ(xi) ℓ(h(x ), yi) One feasible method is to solve the inner maximization of Eqn. (1) approximately at each step, called adversarial training as mentioned by Goodfellow et al. (2015). We introduce some notations in this work. Let I[ ] denote the indicator function, which returns 1 if the argument is true, and 0 otherwise. Let stand for the empty set, and denote by [k] = {1, 2, , k} for integer k > 0. 3. Fast Provably Robust Decision Trees A decision tree can be constructed with some internal and leaf nodes by partitioning training data recursively, and the prediction follows the path from root to leaf. For a decision tree of m leaf nodes, we could associate with m corresponding rectangle cells B1, B2, . . . , Bm as follows: Bj = (aj 1, bj 1] (aj d, bj d] for j [m] . Given example (xi, yi) Sn and leaf node Bj, we further introduce the set of adversarial examples of xi, falling into the j-th leaf node Bj, as follows: Eij = {x X : x Nϵ(xi) and x Bj} . Let h( ) denote the prediction function of a decision tree. We have h(x) = h(x ) for every x, x Bj, and this follows that h(x) = h(x ) for every x, x Eij. By traversing all leaf nodes of a decision tree, optimizing Eqn (1) is equivalent to minimizing the following objective i=1 max j [m] {ℓ(h(x), yi)I[x Eij]} , (2) where ℓ(h(x), yi)I[x Eij] = 0 when Eij = by default. We directly use the 0/1 loss, rather than traditional surrogate losses such as square loss and softmax loss, as the splitting criterion during the construction of decision tree, that is, ℓ(h(x), yi) = I[h(x) = yi] . This is because the consistency between those surrogate losses and 0/1 loss does not hold in the adversarial setting (Awasthi et al., 2021). Another reason is the efficiency on the robust optimization of splitting nodes with the O(1) computational complexity, while those surrogate losses would require the O(n) computational complexity via gradient descent. The details can be found in Appendix A. Following the classic method (Rokach & Maimon, 2005), at one split with previous optimal loss ˆLpre of Eqn. (2), we assume that the p-th leaf node will be split, and let t and η be the splitting feature and threshold, respectively. For its left and right child, we denote by vl and vr their respective values, and Bl and Br represent their associated rectangle cells. Let vj be the value of the j-th leaf node for j = p. We could write the objective loss of Eqn. (2) at one split as ˆL(t, η, vl, vr) = i=1 max I[vl = yi]I[Eil = ], Kip, I[vr = yi]I[Eir = ] , Fast Provably Robust Decision Trees and Boosting Algorithm 1 Fast Provably Robust Decision Tree (FPRDT) Input: Dataset Sn Output: A tree T Initialize: Root of T with the majority class in Sn ˆL = min {Pn i=1 I[yi = 1], Pn i=1 I[yi = 1]} Ei = 0 for the majority class, and 1 otherwise PRDT-recursive(Sn, T, ϵ, ˆL) Return: T with Kip = maxj [m]\{p} {I[vj = yi]I[Eij = ]} {0, 1}. Here, we denote by Eil and Eir the sets of Nϵ(xi) Bl and Nϵ(xi) Br, respectively. The one-split goal for robust training is to solve {t , η , v l , v r} arg min t,η,vl,vr n ˆL(t, η, vl, vr) o . To calculate Kip efficiently, let Ei (i [n]) be the number of leaves, including adversarial examples of xi. We have j [m] I[vj = yi]I[Eij = ] , and this follows that, Kip = 1 if and only if Ei I[vp = yi]I[Eip = ] 1 , (3) where vp denotes the value of the p-th leaf node. Following the robust decision trees (Vos & Verwer, 2021a), we traverse all potential features t [d] and thresholds i=1,Eip = xit ϵ, xit, xit + ϵ . (4) For any fixed splitting feature t and threshold η , it is easy to observe that I[Eil = ] = I [Eip = and xit η + ϵ] , (5) I[Eir = ] = I [Eip = and xit > η ϵ] , (6) where xit is the value of the t -th feature of instance xi. We then solve the following problem (v l, v r) = arg min vl,vr {0,1} n ˆL(t , η , vl, vr) o , (7) where it takes O(1) computational complexity to solve the above problem based on loss decomposition when Wt is sorted. Note that it would take O(n) computational complexity if we use other surrogate losses instead. The detailed solution and discussion can be found in Appendix A. After traversing potential features and thresholds, and solving their respective minimizations of Eqn. (7), we can obtain the final optimal objective loss ˆL = ˆL(t , η , v l , v r) = min t [d],η Wt min vl,vr {0,1} ˆL(t , η , vl, vr) Algorithm 2 FPRDT-recursive Input: Dataset Sn, split leaf p, perturbation size ϵ, previous optimal loss ˆLpre Output: None Initialize ˆL + Compute loss Kip (i [n]) according to Eqn. (3) for t = 1 to d do Compute Wt according to Eqn. (4) for η in sorted(Wt ) do Compute I[Eil = ] and I[Eir = ] by Eqns. (5)-(6) Solve v l and v r from Eqn. (7) with optimal loss ˆL if ˆL < ˆL then t = t , η = η , v l = v l, v r = v r, ˆL = ˆL end if end for end for if ˆL < ˆLpre then Split leaf p via {t , η , v l , v r}, and obtain children l, r Update Ei (i [n]) according to Eqn. (8) FPRDT-recursive(Sn, l, ϵ, ˆL ) FPRDT-recursive(Sn, r, ϵ, ˆL ) end if with the optimal solution {t , η , v l , v r}. We split the p-th leaf node only if ˆL < ˆLpre, that is, the current optimal loss ˆL is smaller than the previous optimal loss ˆLpre. For a splitting node, we update Ei = Ei + I[vl = yi]I[Eil = ] +I[vr = yi]I[Eir = ] I[vp = yi]I[Eip = ] . (8) Algorithms 1 and 2 present the detailed description of our Fast Provably Robust Decision Trees (FPRDT) approach, where Ei(i [n]) are defined as global variables. We now present computational complexity analysis for our FPRDT approach at one split. It takes O(n) computational complexity on the update of Ei from Eqn. (8) for i [n], as well as O(n) computational complexity for Kip from Eqn. (3). We have at most 3n splits for each feature from Eqn. (4), and take O(n log n) computational complexity to sort Wt. In summary, our proposed FPRDT method takes the O(n log n) computational complexity. Comparisons with previous work Chen et al. (2019a) developed the robust decision trees based on adversarial Gini impurity, and Vos & Verwer (2021a) further introduced robust decision trees with better training efficiency in terms of their closed-form solutions. Yang et al. (2019) proved the equivalence between Gini impurity and square loss on the construction of regular decision trees, whereas we prove that such equivalence does not hold in the adversarial setting as follows: Fast Provably Robust Decision Trees and Boosting Theorem 3.1. Optimizing the adversarial Gini impurity of robust decision trees is equivalent to minimizing a lower bound of Pn i=1 maxj [m]{(1 h(x)yi)2I[x Eij]}. From this theorem, we can see that the equivalence does not hold between adversarial Gini impurity and adversarial square loss, which proves the unprovable robustness of algorithms via optimizing the adversarial Gini impurity. The detailed proof of Theorem 3.1 is presented in Appendix B. Calzavara et al. (2020) introduced provably robust decision trees of O(n2) computational complexity with additional constraints over examples, and Vos & Verwer (2021b) gave the robust optimal decision tree based on mixed-integer linear program, which is an NP-hard problem with exponential computational complexity. In comparison, our FPRDT approach takes the smallest O(n log n) computational complexity in all provably robust methods. In our FPRDT approach, we introduce the maximal loss Kip over all leaves except for the splitting leaf. This is different from the robust optimal decision tree (Vos & Verwer, 2021b), which takes the global optimization over all internal and leaf nodes. We also distinguish from local optimization (Chen et al., 2019a; Vos & Verwer, 2021a), which only focuses on the splitting leaf with its children, regardless of other leaves. Our method can be viewed as a trade-off between global and local optimizations. 4. Provably Robust Ada Boost The Ada Boost algorithm (Freund et al., 1996) has been one of the most influential algorithms in machine learning, and the basic idea is to construct a strong classifier from a sequence of weak learners. Essentially, Ada Boost can be viewed as a procedure on the optimization of exponential loss (Schapire, 2013). This section focuses on provably robust guarantees for Ada Boost via an upper bound over the adversarial exponential loss. Let Ht(x) = Pt i=1 αihi(x) be the output of Ada Boost with base learners hi H and corresponding weights αi. We have the empirical adversarial exponential loss over training sample Sn as follows: ˆL(h, Sn) = 1 i=1 max x i Nϵ(xi) n exp( Ht(x i)yi) o . It is an NP-hard problem to directly optimize the above empirical adversarial loss for tree ensembles (Kantchelian et al., 2016), and we resort to an upper bound as follows: ˆL(h, Sn) = 1 i=1 max x i Nϵ(xi) j=1 exp( yiαjhj(x i)) o j=1 max x i Nϵ(xi) n exp( yiαjhj(x i)) o . Algorithm 3 Provably Robust Ada Boost (PRAda Boost) Input: Dataset Sn, iteration num. T and perturb. size ϵ Output: Classifier H Initialize: Instance weights w1,i = 1 (i [n]) for t = 1 to T do Solve ht from Eqn. (9) by our modified FPRDT Calculate ϵt according to Eqn. (11) if ϵt > 1/2 then break end if Calculate classifier weight αt from Eqn. (10) Calculate instance weights wt+1,i from Eqn. (12) end for Return H(x) = sign(PT t=1 αtht(x)) Such relaxation has also been used by Andriushchenko & Hein (2019). Let wt,i denote the weight of instance (xi, yi) after the t 1-th iteration, and motivated from the original Ada Boost, we have j=1 max x i Nϵ(xi) n exp( yiαjhj(x i)) o . In the t-th iteration, we try to solve the following problem {αt, ht} arg min α,h i=1 wt,i max x i Nϵ(xi) exp( yiαh(x i)) o . We first minimize the following weighted adversarial 0/1 loss to solve ht, for any given αt > 0, ht = arg min h i=1 wt,i max x i Nϵ(xi) I[h(x i) = yi] , (9) and this can be efficiently solved based on our FPRDT approach with modification of objective loss in Eqn. (2) to weighted ones. The details can be found in Appendix C. We then solve the weight αt from its derivative as follows: ϵt ) , (10) wt,i Pn i=1 wt,i max x i Nϵ(xi) {I(ht(x i) = yi)} . (11) We finally calculate instance weights wt,i according to the following recursive relationship wt+1,i = wt,i max x i Nϵ(xi) n exp( yiαtht(x i)) o . (12) Algorithm 3 presents the detailed description of Provably Robust Ada Boost (PRAda Boost), and takes the O(n log n) Fast Provably Robust Decision Trees and Boosting computational complexity in each iteration since it requires O(n) and O(n log n) computational complexities to compute instance weights and base learners, respectively. We present the convergence analysis on the adversarial 0/1 loss over training dataset for our PRAda Boost as follows: Theorem 4.1. Let H(x) = sign(PT t=1 αtht(x)) be the final classifier of PRAda Boost with γt = 1/2 ϵt > 0 for each iteration t [T]. We have the empirical adversarial 0/1 loss over training sample Sn as follows: max x i Nϵ(xi) I[H(x i) = yi] exp This theorem shows the exponential convergence rate for the PRAda Boost approach when every base learner has slightly better performance than random guess in the adversarial setting, that is, γt = 1/2 ϵt γ for some small γ > 0. The detailed proof is presented in Appendix D. Comparisons with previous work Kantchelian et al. (2016) proposed to train a sequence of robust decision trees iteratively and recursively, under the L0 metric, based on the original data and new approximated adversarial examples. Vos & Verwer (2021a) presented the robust random forests following the original idea of random forests (Breiman, 2001). Andriushchenko & Hein (2019) introduced another robust boosting algorithm via an upper bound over adversarial exponential loss with O(n2) computational complexity, whereas we consider the modified FPRDT to construct the base learners based on the direct optimization of adversarial 0/1 loss with O(n log n) computational complexity. Our PRAda Boost can be viewed as a general learning approach which can be used in any place where Ada Boost and robust boostings can be applied. An interesting future work is to exploit the generalization and consistency of Ada Boost and the convergence rate of random forests in adversarial settings, by analogy with the traditional studies (Bartlett & Traskin, 2006; Gao & Zhou, 2013; 2020). 5. Experiments We conduct our experiments on 18 datasets1,2, and most datasets have been used in previous robust decision trees and ensembles. Table 2 presents the detailed statistical information, and we take the same perturbation size ϵ as previous robust studies on decision trees or tree ensembles. The features have been scaled to [ 1, 1] for all datasets. We take the adversarial accuracy as predictive measure 1https://www.openml.org/ 2https://www.cs.toronto.edu/ kriz/cifar.html Table 2. Datasets Dataset (Perb.) # Inst. # Feat. Dataset (Perb.) # Inst. # Feat. ionos (.2) 351 34 cifar10:0v5 (.1) 12,000 3,072 breast (.3) 683 9 cifar10:0v6 (.1) 12,000 3,072 diabet (.05) 768 8 cifar10:4v8 (.1) 12,000 3,072 bank (.1) 1,372 4 mnist2v6 (.4) 13,866 784 Japan3v4 (.1) 3,087 14 mnist3v8 (.4) 13,966 784 har1v2 (.1) 3,266 561 F-mnist2v5 (.2) 14,000 784 spam (.05) 4,601 57 F-mnist3v4 (.2) 14,000 784 Ges Dv P (.01) 4,838 32 F-mnist7v9 (.2) 14,000 784 wine (.05) 6,497 11 mnist1v7 (.4) 15,170 784 in robust classifications (Goodfellow et al., 2015; Andriushchenko & Hein, 2019; Vos & Verwer, 2021a;b). For test data T = {(x1, y1), , (xm, ym)}, we define the test adversarial accuracy as i=1 max z xi ϵ I(h(z) = yi), following the implementation of (Vos & Verwer, 2021a). For fair comparisons, we run all experiments on a single core without parallel optimizations, and experiments are performed with Python on nodes of a computational cluster with 20 CPUs (Intel Core i9-10900X 3.7GHz), running Ubuntu with 128GB main memory. 5.1. Comparisons on Robust Decision Trees Besides regular decision trees (Safavian & Landgrebe, 1991), we compare with state-of-the-art robust decision trees as follows: RIGBT-h3: Robust decision trees based on adversarial information gain (Chen et al., 2019a); TREANT4: Robust decision trees with constraints over training examples (Calzavara et al., 2020); GROOT5: Robust decision trees based on adversarial Gini impurity (Vos & Verwer, 2021a); ROCT5: Robust decision trees with global optimization over all nodes (Vos & Verwer, 2021b); PRB tree6: Robust decision trees based on upper bound of exponential loss (Andriushchenko & Hein, 2019). We take the maximum depth 4 for TREANT and ROCT due to high computational complexity, and do not restrict the 3The code is taken from https://github.com/chenhongge. 4The code is taken from https://github.com//gtolomei/treant. 5The code is taken from https://github.com/tudelft-cda-lab. 6The code is taken from https://github.com/max-andr. Fast Provably Robust Decision Trees and Boosting Table 3. Comparisons of adversarial accuracies (mean std). / indicates that our FPRDT is significantly better/worse than the corresponding methods (pairwise t-tests at 95% significance level). N/A means that no results were obtained after running out 12 hours. Dataset FPRDT Decision tree RIGDT-h GROOT TREANT ROCT PRB tree ionos .7954 .0302 .3100 .0549 .7015 .0874 .7829 .0325 .7232 .0434 .7897 .0330 .7601 .0346 breast .8765 .0290 .2501 .0457 .8381 .0317 .8744 .0273 .8334 .0318 .8735 .0256 .8706 .0258 diabet .6674 .0258 .6333 .0346 .5695 .0640 .6481 .0352 .6695 .0228 .6552 .0244 .6633 .0333 bank .6577 .0311 .6333 .0346 .4685 .0713 .5410 .0440 .6092 .0259 .6539 .0167 .6299 .0328 Japan3v4 .6673 .0129 .5751 .0377 .5638 .0337 .5829 .0468 N/A .6671 .0170 .5954 .0112 har1v2 .8044 .0249 .2316 .0434 .7074 .0257 .8058 .0198 N/A .7786 .0165 .7383 .0148 spam .7404 .0124 .0006 .0012 .4669 .1019 .7231 .0188 N/A .4813 .0469 .6968 .0110 Ges Dv P .7301 .0174 .4783 .1390 .5483 .1035 .7164 .0180 N/A .7071 .0126 .7014 .0190 wine .6364 .0062 .3515 .1430 .4032 .0375 .6373 .0073 .6388 .0079 .6117 .0343 .6299 .0080 cifar10:0v5 .6878 .0177 .2960 .0598 .3469 .0457 .4847 .0608 N/A .6639 .0113 .6501 .0106 cifar10:0v6 .6883 .0102 .5878 .0568 .4771 .0168 .5555 .0495 N/A .6684 .0077 .6833 .0051 cifar10:4v8 .6613 .0117 .2561 .0784 .4882 .0468 .4727 .0195 N/A .6319 .0114 .6698 .0093 mnist2v6 .8954 .0025 .0178 .0320 .8850 .0079 .8725 .0110 N/A .7842 .0222 .8385 .0673 mnist3v8 .8527 .0058 .0028 .0071 .8102 .0102 .7575 .0185 N/A .7565 .0280 .8030 .0235 F-mnist2v5 .9780 .0027 .3214 .2143 .9446 .0080 .9714 .0054 N/A .9394 .0128 .9761 .0025 F-mnist3v4 .8652 .0056 .0166 .0521 .7928 .0129 .8193 .0104 N/A .8271 .0168 .8407 .0052 F-mnist7v9 .8760 .0058 .2930 .1554 .8100 .0108 .8291 .0115 N/A .8376 .0188 .8399 .0106 mnist1v7 .9633 .0034 .0036 .0081 .9325 .0076 .9457 .0052 N/A .8937 .0095 .9196 .0264 Average .7802 .1090 .2922 .2172 .6530 .1877 .7233 .1494 .7342 .1134 .7503 .1075 Win/Tie/Loss 18/0/0 18/0/0 15/3/0 16/1/1 15/3/0 16/1/1 depth for other methods. We set 10 as the minimum number of instances for a splitting leaf node, and each leaf node has at least 5 instances. We modify the original TREANT by adding perturbation ϵ for each feature once time as in the work of (Vos & Verwer, 2021a). ROCT is initialized with the return of GROOT, and outputs the final result by optimizing objective loss within 30 minutes as done by Vos & Verwer (2021b). More experimental settings and results can be found in Appendix E. The performances of the compared methods are evaluated by five trials of 5-fold cross validation, where test adversarial accuracies are obtained by averaging over these 25 runs, as summarized in Table 3. These empirical results clearly show the effectiveness of our FPRDT. It is obvious that regular decision tree takes the worst performance without the consideration of adversarial examples. Our FPRDT is better than unprovable robust algorithms RIGDT-h and GROOT experimentally, particularly for large datasets. The win/tie/loss counts show that FPRDT is clearly superior to these unprovable robust algorithms, as it wins for most times and never loses. An intuitive explanation is that RIGDT-h and GROOT make local optimization over splitting nodes without guarantee of provable robustness. In comparison with those provably robust algorithms, our FPRDT approach achieves better performance in most cases when they return results. One possible reason is owing to the high computational complexity without reaching the optimal solutions, for example, ROCT takes exponential computational complexity yet returns result in 30 minutes as done in (Vos & Verwer, 2021b), and TREANT does not obtain results within 12 hours when datasets cardinalities of features and instances exceed 20 and 1000, respectively; because it requires some extra computations on constraints 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss PRDT RIGDT-h GROOT 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss PRDT RIGDT-h GROOT 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss PRDT RIGDT-h GROOT 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss PRDT RIGDT-h GROOT 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss cifar10:0v5 PRDT RIGDT-h GROOT 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss cifar10:0v6 PRDT RIGDT-h GROOT Figure 1. Comparisons of training adversarial errors in learning process, where we scale the number of leaf nodes to [0,1]. in implementation (Calzavara et al., 2020). Another possible reason is the local optimization, such as the PRB tree, which only focuses on the splitting node with its children. Our FPRDT approach keeps good balance between local and global optimizations with more information over leaf nodes. We further exploit the convergence of adversarial 0/1 errors during the training process between our FPRDT and those unprovably robust methods, as shown in Figure 1. It is clearly observable that our FPRDT can keep adversarial errors decrease during the whole training process, whereas those unprovably robust methods, such as RIGDT-h and GROOT, even increase the training adversarial errors on Fast Provably Robust Decision Trees and Boosting Table 4. Comparisons of adversarial accuracies (mean std). / indicates that our PRAda Boost is significantly better/worse than the corresponding methods (pairwise t-tests at 95% significance level). N/A means that no results were obtained after running out 12 hours. Dataset PRAda Boost Ada Boost Random forests RGBDT RIGDT forests GROOT forests PRBoosting ionos .7960 .0329 .0321 .0137 .1122 .0353 .5276 .0472 .6565 .0414 .7869 .0346 .7755 .0370 breast .8793 .0263 .0732 .0156 .2167 .0193 .7034 .0648 .8437 .0280 .8838 .0251 .8694 .0282 diabet .6635 .0281 .1352 .0156 .4523 .0411 .4560 .0371 .5999 .0371 .6578 .0210 .6276 .0325 bank .6680 .0361 .4019 .0619 .5087 .0292 .4427 .0333 .5087 .0292 .6407 .0308 .6195 .0403 Japan3v4 .6816 .0165 .4913 .0231 .5187 .0184 .5885 .0099 .6039 .0171 .6580 .0160 .6874 .0167 har1v2 .8601 .0162 .0092 .0065 .8326 .0137 .6417 .0165 .4998 .0212 .7917 .0169 .8653 .0130 spam .7540 .0129 .0000 .0000 .0000 .0000 .4888 .0349 .6212 .0202 .7495 .0130 .7312 .0101 Gest Dv P .7315 .0165 .1031 .0079 .1887 .0086 .2420 .0090 .6459 .0118 .7314 .0127 .7318 .0179 wine .6397 .0069 .0002 .0004 .0910 .0112 .1068 .0118 .4161 .0129 .6329 .0008 .6356 .0066 cifar10:0v5 .6906 .0159 .0083 .0035 .3015 .0058 .3137 .0095 .4413 .0094 .5262 .0123 N/A cifar10:0v6 .6958 .0092 .0556 .0041 .3678 .0103 .3446 .0091 .5199 .0082 .5604 .0098 N/A cifar10:4v8 .6710 .0096 .0019 .0014 .2956 .0074 .2707 .0103 .4614 .0074 .4983 .0068 N/A mnist2v6 .9437 .0046 .0000 .0000 .0000 .0000 .8442 .0266 .8999 .0062 .9249 .0045 .9365 .0044 mnist3v8 .8829 .0106 .0000 .0000 .0000 .0000 .7155 .0147 .7756 .0081 .8228 .0057 .8680 .0082 F-mnist2v5 .9823 .0028 .3852 .0552 .4561 .0041 .9645 .0050 .9654 .0040 .9791 .0029 .9843 .0018 F-mnist3v4 .8674 .0055 .0000 .0000 .0441 .0311 .7172 .0131 .8063 .0064 .8392 .0063 .8640 .0054 F-mnist7v9 .8691 .0066 .0000 .0000 .1357 .0358 .7401 .0164 .8282 .0067 .8359 .0068 .8779 .0069 mnist1v7 .9752 .0031 .0000 .0000 .0000 .0000 .7897 .1170 .9600 .0032 .9668 .0031 .9781 .0028 Average .7918 .1134 .0943 .1545 .2512 .2279 .5499 .2276 .6697 .1763 .7492 .1422 Win/Tie/Loss 18/0/0 18/0/0 18/0/0 18/0/0 15/3/0 9/7/2 cifar10:0v5 cifar10:0v6 cifar10:4v8 running time (seconds) Decision tree our PRDT RIGDT h GROOT PRB tree ROCT TREANT Figure 2. Comparisons of running time (in seconds) for different methods. Notice that the y-axis is in log-scale and full black columns imply that no result was obtained after running out 12 hours. datasets F-mnist3v4 and cifar10:0v5, etc. This is because those methods optimize some lower bounds of adversarial loss without guarantee of provable robustness. We also present the comparisons of average CPU time (in seconds) for our FPRDT and others, as shown in Figure 2. We can easily observe that our FPRDT takes comparable running time with local and unprovably robust methods RIGDT-h and GROOT, and takes the smallest running time among provably robust methods. This is nicely in agreement with the analysis of computational complexity in Table 1. 5.2. Experimental Results for Tree ensembles In addition to original random forests (Breiman, 2001) and Ada Boost (Freund et al., 1996), we also compare with the following robust methods: RGBDT7: Robust tree ensembles based on the approximated worst loss of GBDT (Chen et al., 2019a); 7The code is taken from https://github.com/chenhongge. RIGDT forests8: Random forests with RIGDT-h as the base learner (Vos & Verwer, 2021a); GROOT forests8: Random forests with GROOT as the base learner (Vos & Verwer, 2021a); PRBoosting9: Robust tree ensembles via upper bound of exponential loss (Andriushchenko & Hein, 2019). We do not compare with Adv Boost (Kantchelian et al., 2016) because it considers the attack under L0 metric. We take at most 100 decision trees for forests ensemble, and select d candidate features randomly for splitting in random forests. We take the maximum depth 8 for PRBoosting due to high computational complexity, and do not restrict the depth for other methods. We also set 10 as the minimum number of instances for a splitting leaf node, and each leaf node has at least 5 instances. 8The code is taken from https://github.com/tudelft-cda-lab. 9The code is taken from https://github.com/max-andr. Fast Provably Robust Decision Trees and Boosting cifar10:0v5 cifar10:0v6 cifar10:4v8 running time (seconds) our PRAda Boost Ada Boost Random forests RGBDT RIGDT forests GROOT forests PRBoosting Figure 3. Comparisons of running time (in seconds) between different methods. Notice that the y-axis is in log-scale and full black columns imply that no result was obtained after running out 12 hours. Figure 4. Visualization of the necessary minimum perturbations to change models predictions. The larger the minimum perturbation, the better the robust method. Table 4 presents comparisons of test adversarial accuracies for our PRAda Boost and other ensemble methods. Our PRAda Boost obviously outperforms original Ada Boost and random forests, which do not consider adversarial examples in training process. It is evident that our approach performs better than those unprovable robust methods RGBDT, RIGDT forests and GROOT forests experimentally, since the win/tie/loss counts show that our PRAda Boost wins in most times and never loses. From Table 4, it is also observable that our PRAda Boost takes comparable performance with provably robust method PRBoosting, and an intuitive reason is that two methods optimize the same upper bound over adversarial exponential loss despite of different base learners. It is noteworthy that our PRAda Boost takes smaller computational complexity; for example, there is no results returned by PRBoosting after running out 12 hours for dataset cifar10 of 3,072 features. We also compare the running time of PRAda Boost with other methods as shown in Figure 3. It is observable that our PRAda Boost approach takes comparable running time with those unprovably robust methods RGBDT, RIGDT forests and GROOT forests, whereas our PRAda Boost approach is about 10 times faster than the provably robust PRBoosting. This is in accordance with the analysis of computational complexity from Table 1. We finally present the visualization of necessary minimum perturbations to change the prediction for robust methods, as shown in Figure 4. Here, perturbation denotes the distance between adversarial and original example under the L metric, and the larger the minimum perturbation, the better the robust method. As can be seen, our PRAda Boost method takes the largest necessary minimum perturbations, which implies the strongest robustness to the adversarial defense under the L metric. 6. Conclusion Recent years have attracted increasing attention on robust decision trees and ensembles. This work presents a fast provably robust decision tree, a tradeoff between global and local optimizations over the adversarial 0/1 loss. We also provide provably robust boosting method based on our decision trees with theoretical guarantees. Extensive experiments are presented to verify the effectiveness and efficiency of our approaches. An interesting future work is to exploit other adversarial losses for provably robust decision trees, or look for practical and tractable upper bound over adversarial loss for robust tree ensembles methods. Fast Provably Robust Decision Trees and Boosting Acknowledgements The authors want to thank Dr. Dani el Vos and Sicco Verwer for their codes, and thank the reviewers for their helpful comments and suggestions. This research was supported by National Key R&D Program of China (2021ZD0112802) and NSFC (61921006, 61876078). Akhtar, N., Liu, J., and Mian, A. Defense against universal adversarial perturbations. In Proceedings of the 31th IEEE Conference on Computer Vision and Pattern Recognition, pp. 3389 3398, Salt Lake City, UT, 2018. Andriushchenko, M. and Hein, M. Provably robust boosted decision stumps and trees against adversarial attacks. In Advances in Neural Information Processing Systems 32, pp. 12997 13008. MIT Press, Cambridge, MA, 2019. Awasthi, P., Frank, N., Mao, A., Mohri, M., and Zhong, Y. Calibration and consistency of adversarial surrogate losses. In Advances in Neural Information Processing Systems 34, pp. 9804 9815. MIT Press, Cambridge, MA, 2021. Balunovi c, M. and Vechev, M. Adversarial training and provable defenses: Bridging the gap. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020. Bartlett, P. and Traskin, M. Adaboost is consistent. In Advances in Neural Information Processing Systems 19, pp. 2347 2368. MIT Press, Cambridge, MA, 2006. Breiman, L. Random forests. Machine Learning, 45(1): 5 32, 2001. Calzavara, S., Lucchese, C., Tolomei, G., Abebe, S. A., and Orlando, S. Treant: Training evasion-aware decision trees. Data Mining and Knowledge Discovery, 34(5): 1390 1420, 2020. Chen, H., Zhang, H., Boning, D., and Hsieh, C.-J. Robust decision trees against adversarial examples. In Proceedings of the 36th International Conference on Machine Learning, pp. 1122 1131, Long Beach, CA, 2019a. Chen, H., Zhang, H., Si, S., Li, Y., Boning, D., and Hsieh, C.-J. Robustness verification of tree-based models. In Advances in Neural Information Processing Systems 32, pp. 12317 12328. MIT Press, Cambridge, MA, 2019b. Cheng, M., Le, T., Chen, P.-Y., Zhang, H., Yi, J., and Hsieh, C.-J. Query-efficient hard-label black-box attack: An optimization-based approach. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, 2019. Cohen, J., Rosenfeld, E., and Kolter, Z. Certified adversarial robustness via randomized smoothing. In Proceedings of the 36th International Conference on Machine Learning, pp. 1310 1320, Long Beach, CA, 2019. Freund, Y., Schapire, R. E., et al. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, pp. 148 156, Bari, Italy, 1996. Gao, W. and Zhou, Z.-H. On the doubt about margin explanation of boosting. Artificial Intelligence, 203:1 18, 2013. Gao, W. and Zhou, Z.-H. Towards convergence rate analysis of random forests for classification. In Advances in Neural Information Processing Systems 33, pp. 9300 9311. MIT Press, Cambridge, MA, 2020. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, 2015. Kantchelian, A., Tygar, J. D., and Joseph, A. Evasion and hardening of tree ensemble classifiers. In Proceedings of the 33rd International Conference on Machine Learning, pp. 2387 2396, New York, NY, 2016. Li, B., Chen, C., Wang, W., and Carin, L. Certified adversarial robustness with additive noise. In Advances in Neural Information Processing Systems 32, pp. 9459 9469. MIT Press, Cambridge, MA, 2019. Liao, F., Liang, M., Dong, Y., Pang, T., Hu, X., and Zhu, J. Defense against adversarial attacks using high-level representation guided denoiser. In Proceedings of the 31th IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778 1787, Salt Lake City, UT, 2018. Liu, X., Li, Y., Wu, C., and Hsieh, C.-J. Adv-bnn: Improved adversarial defense through robust bayesian neural network. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018. Mirman, M., Gehr, T., and Vechev, M. Differentiable abstract interpretation for provably robust neural networks. In Proceedings of the 35th International Conference on Machine Learning, pp. 3578 3586, Stockholm, Sweden, 2018. Fast Provably Robust Decision Trees and Boosting Papernot, N., Mc Daniel, P., Wu, X., Jha, S., and Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. In Proceedings of the 37th IEEE Symposium on Security and Privacy, pp. 582 597, San Jose, CA, 2016. Raghunathan, A., Steinhardt, J., and Liang, P. Certified defenses against adversarial examples. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018a. Raghunathan, A., Steinhardt, J., and Liang, P. S. Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems 31, pp. 10900 10910. MIT Press, Cambridge, MA, 2018b. Rokach, L. and Maimon, O. Top-down induction of decision trees classifiers-a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 35(4):476 487, 2005. Safavian, S. R. and Landgrebe, D. A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man, and Cybernetics, 21(3):660 674, 1991. Schapire, R. E. Explaining adaboost. In Empirical Inference, pp. 37 52. Springer, Berlin, Germany, 2013. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In Proceedings of the 2nd International Conference on Learning Representations, Banff, Canada, 2014. Vos, D. and Verwer, S. Efficient training of robust decision trees against adversarial examples. In Proceedings of the 38th International Conference on Machine Learning, pp. 10586 10595, 2021a. Vos, D. and Verwer, S. Robust optimal classification trees against adversarial examples. ar Xiv preprint ar Xiv:2109.03857, 2021b. Wong, E. and Kolter, Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Proceedings of the 35th International Conference on Machine Learning, pp. 5286 5295, Stockholm, Sweden, 2018. Wong, E., Schmidt, F. R., Metzen, J. H., and Kolter, J. Z. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems 31, pp. 8410 8419. MIT Press, Cambridge, MA, 2018. Yang, B.-B., Gao, W., and Li, M. On the robust splitting criterion of random forest. In Proceedings of the 19th IEEE International Conference on Data Mining, pp. 1420 1425, Beijing, China, 2019. Fast Provably Robust Decision Trees and Boosting A. The Detailed Solution of Eqn. (7) and Computational Complexity of Surrogate Losses The detailed solution of Eqn. (7) We will show how to determine the optimal (v l, v r) in Eqn. (7) with O(1) computational complexity when Wt is sorted. We introduce ˆL1(t, η), ˆL2(t, η), ˆL3(t, η), ˆL4(t, η), which are defined as ˆL1(t, η) = F(t, η, 0, 0), ˆL2(t, η) = F(t, η, 0, 1) ˆL3(t, η) = F(t, η, 1, 0), ˆL4(t, η) = F(t, η, 1, 1) Once we know the values of ˆL1, ˆL2, ˆL3, ˆL4, Eqn. (7) can be solved in O(1) computational complexity. We next show how to obtain the values of ˆL1, ˆL2, ˆL3, ˆL4 in in O(1) computational complexity. We introduce k1(t, η), k2(t, η), k3(t, η), k4(t, η), k5(t, η), k6(t, η), which are defined as i=1 I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 0], i=1 I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 1], i=1 I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 0], i=1 I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 1], i=1 I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 0], i=1 I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 1]. Then we have ˆL1 = k2 + k4 + k6, ˆL2 = k2 + k3 + k5 + k6, ˆL3 = k1 + k4 + k5 + k6, ˆL4 = k1 + k3 + k5, which can be verified by the definition. Given a splitting feature t , for a sorted Wt = {η1, η2, ..., ηq} in ascending order, we can determine the values of ki(t , η1) for i = 1, 2, ..., 6 by the definition. It takes O(n) computational complexity. We will show that ki(t , ηj) for each j = 2, ..., q can be determined by iteration in O(1) computational complexity. Notice that q = O(n). Thus it takes average O(1) computational complexity to determine the values of ki(t , ηj) for each i = 1, 2, ..., 6 and each j = 1, ..., q It can be seen that Wt is composed of W 1 t , W 2 t and W 3 t , where i=1,Eip = {xit ϵ}, W 2 t = [n i=1,Eip = {xit }, W 3 t = [n i=1,Eip = {xit + ϵ} If ηj = xit ϵ W 1 t , for yi = 0, we have k3(t , ηj) k3(t , ηj 1) 1, k5(t , ηj) k5(t , ηj 1) + 1 . And for yi = 1, we have k4(t , ηj) k4(t , ηj 1) 1, k6(t , ηj) k6(t , ηj 1) + 1 . Fast Provably Robust Decision Trees and Boosting If ηj = xit + ϵ W 3 t , for yi = 0, we have k5(t , ηj) k5(t , ηj 1) 1, k1(t , ηj) k1(t , ηj 1) + 1 . And for yi = 1, we have k6(t , ηj) k6(t , ηj 1) 1, k2(t , ηj) k2(t , ηj 1) + 1 . Therefore, ki(t , ηj) for each j = 2, ..., q can be determined by iteration in O(1) computational complexity. And we can solve Eqn. (7) in O(1) computational complexity. The computational complexity of surrogate losses However, once some convex surrogate losses are used in Eqn. (7), we have to solve it by iterative optimization algorithms, e.g., gradient decent. It would take O(n) computational complexity to calculate the gradient at each iteration. Thus the total computational complexity is O(Tn) for T iterations, much higher than O(1) computational complexity. B. Proof of Theorem. 3.1 Lemma B.1. For a leaf p, let n1 and n 1 represent the number of positive and negative samples respectively, the value of the leaf is v. The Gini impurity of the leaf p is defined as Gini(p) = 1 n 1 n 1 + n1 2 n1 n 1 + n1 2 = 2n1n 1 (n 1 + n1)2 . The Mean Square loss of the leaf p is min v ℓMSE(p) = min v i=1 (yi v)2 . Then the Gini impurity is essentially minimizing the Mean Square loss. Proof. We have that min v ℓMSE(p) = min v i=1 (yi v)2 i=1 (yi n1 n 1 n 1 + n1 )2 = n 1( 1 n1 n 1 n 1 + n1 )2 + n1(1 n1 n 1 n 1 + n1 )2 = 2n 2n1n 1 (n 1 + n1)2 = 2n Gini(p) Therefore, the Gini impurity is essentially minimizing the Mean Square loss. Theorem B.2. The sum of adversarial Gini impurity over leaves for robust decision trees is essentially minimizing a lower bound of n X i=1 max j [m]{(h(x) yi)2I[x Eij]} . (13) Fast Provably Robust Decision Trees and Boosting Proof. Let vi represents the i-th leaf s value, Eqn. (13) can be written as i=1 max j [m]{(vj yi)2I[Eij = ]} . To optimize the adversarial Gini impurity, at each split of the decision trees, the movement of the examples should make the Gini impurity worst (Chen et al., 2019a). Let g(xi, p) indicate whether (xi, yi) falls into the p-th leaf after such movement, i.e., g(xi, p) = 1 if (xi, yi) falls into the p-th leaf after such movement, otherwise g(xi, p) = 0. We have that g(xi, p) I[Eip = ] . The adversarial Gini impurity for a leaf p can be defined as Adv Gini(p) = 2n1n 1 (n 1 + n1)2 , i=1 g(xi, p)I[yi = 1], n 1 = i=1 g(xi, p)I[yi = 1] . Because the adversarial loss is defined on a whole tree, thus we consider the sum of the adversarial Gini impurity on a whole tree, i.e., the sum of the adversarial Gini impurity over all leaf nodes. We have that min v1,...,vm i=1 max j [m]{(vj yi)2I[Eij = ]} min v1,...,vm j=1 (vj yi)2I[Eij = ] i=1 (vj yi)2I[Eij = ] i=1 (vj yi)2g(xi, j) Let Dj = {i: (xi, yi) Sn and g(xi, j) = 1}, we have that i=1 (vj yi)2g(xi, j) = min vj i Dj (vj yi)2 . And Adv Gini(j) = 2n1n 1 (n 1+n1)2 , where i Dj I[yi = 1], n 1 = X i Dj I[yi = 1] . According to Lemma B.1, the adversarial Gini impurity Adv Gini(j) is essentially minimizing P i Dj(vj yi)2. Take the sum over j [m] and we can see that the sum of adversarial Gini impurity is essentially minimizing a lower bound of the adversarial square loss. C. Modified FPRDT In the r-th iteration of PRAda Boost, the objective loss is a weighted objective loss : i=1 wr,i max j [m] n ℓ(h(x), yi)I[x Eij] o , Fast Provably Robust Decision Trees and Boosting We could write the weighted objective loss at one split: ˆL(t, η, vl, vr) = i=1 wr,i max n I[vl = yi]I[Eil = ], I[vr = yi]I[Eir = ], Kip o , The one-split goal for robust training is to solve {t , η , v l , v r} arg min t,η,vl,vr n ˆL(t, η, vl, vr) o . The term Kip is calculated as FPRDT. We traverse all possible values of t, η. For any fixed t and η , we solve (v l, v r) = arg min vl,vr {0,1} n ˆL(t , η , vl, vr) o , The algorithm to solve it is similar to FPRDT. Following Appendix A, we just need to change the definition of k1, k2, k3, k4, k5 and k6 as i=1 wr,i I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 0], i=1 wr,i I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 1], i=1 wr,i I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 0], i=1 wr,i I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 1], i=1 wr,i I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 0], i=1 wr,i I[Kip = 1]I[Eil = ]I[Eir = ]I[yi = 1]. And the updated rules are changed as If ηj = xit ϵ W 1 t , for yi = 0, we have k3(t , ηj) k3(t , ηj 1) wr,i, k5(t , ηj) k5(t , ηj 1) + wr,i . And for yi = 1, we have k4(t , ηj) k4(t , ηj 1) wr,i, k6(t , ηj) k6(t , ηj 1) + wr,i . If ηj = xit + ϵ W 3 t , for yi = 0, we have k5(t , ηj) k5(t , ηj 1) wr,i, k1(t , ηj) k1(t , ηj 1) + wr,i . And for yi = 1, we have k6(t , ηj) k6(t , ηj 1) wr,i, k2(t , ηj) k2(t , ηj 1) + wr,i . Therefore, ki(t , ηj) for each j = 2, ..., q can be determined by iteration in O(1) computational complexity. And we can solve Eqn. (7) in O(1) computational complexity. Fast Provably Robust Decision Trees and Boosting D. Proof of Theorem. 4.1 Lemma D.1. For functions fi, i = 1, ..., m with S as their domain. If fi(x) > 0 for all x S and i [m], we have that i=1 max x S fi(x) . Proof. Suppose x arg maxx S Qm i=1 fi(x), and x i maxx S fi(x). Thus we have that fi(x ) fi(x i ) Therefore, we have that i=1 fi(x) = i=1 fi(x i ) = i=1 max x S fi(x) . Theorem D.2. Let H(x) = sign(PT t=1 αtht(x)) be the final classifier of PRAda Boost, with γt = 1/2 ϵt > 0 for each iteration t [T]. We have the empirical adversarial 0/1 loss over training sample Sn as follows: max x i Nϵ(xi) I[H(x i) = yi] exp Proof. Let F(x) = PT t=1 αtht(x) and define Dt((xi, yi)) = wt,i/ Pn i=1 wt,i. Therefore, Dt is a distribution over the training set in the t-th iteration. Notice that w1,i = 1 for i [n]. i=1 Dt((xi, yi)) max x i Nϵ(xi) n exp( yαtht(x i)) o . (14) And we have that i=1 Dt((xi, yi)) max x i Nϵ(xi) n exp( yαtht(x i)) o wt,i Pn j=1 wt,j max x i Nϵ(xi) n exp( yαtht(x i)) o = Pn i=1 wt+1,i Pn i=1 wt,i . (16) Eqn. (15) holds by the definition of Dt. Eqn. (16) use the update rule Eqn. (12). From Eqn. (16), we have Pn i=1 wt+1,i = Zt Pn i=1 wt,i, and it follows that DT +1((x, y)) = w T +1,i Pn i=1 w T +1,i = w T +1,i ZT Pn i=1 w T,i = = w T +1,i (QT t=1 Zt) Pn i=1 w1,i = 1 n w T +1,i (QT t=1 Zt) . (17) Unraveling the recurrence of Eqn. (12) gives w T +1,i = max x Nϵ(x) exp( yα1h1(x )) max x Nϵ(x) exp( yαT h T (x )) . (18) Combining Equ. (17) and Equ. (18), we obtain that DT +1((x, y)) = 1 n maxx Nϵ(x) exp( yα1h1(x )) maxx Nϵ(x) exp( yαT h T (x )) QT t=1 maxx i Nϵ(xi) n exp( yiαtht(x i)) o n QT t=1 Zt . (19) Fast Provably Robust Decision Trees and Boosting Since H(x) = sign(F(x)), if there exists x Nϵ(x) satisfying H(x ) = y, then minx Nϵ(x) y F(x ) 0, which implies that maxx Nϵ(x) exp( y F(x )) 1. That is max x Nϵ(x) I[H(x ) = y] max x Nϵ(x) exp( y F(x )) . (20) For adversarial training loss, we have that max x i Nϵ(xi) I[H(x i) = yi] i=1 max x i Nϵ(xi) n exp( yi F(x i)) o (21) i=1 max x i Nϵ(xi) t=1 exp( yiαtht(x i)) o t=1 max x i Nϵ(xi) n exp( yiαtht(x i)) o (22) i=1 DT +1((xi, yi)) t=1 Zt (23) t=1 Zt . (24) Eqn. (21), Eqn. (22) and Eqn. (23) holds by Eqn. (20), Lemma D.1 and Eqn. (19), respectively. Eqn. (24) uses the fact that DT +1 is a distribution, which sums to 1. Finally, by the definition Eqn. (14) of Zt , we have that i=1 Dt((xi, yi)) max x i Nϵ(xi) n exp( αtyiht(x i)) o i S1 Dt((xi, yi)) exp( αt) + X i S2 Dt((xi, yi)) exp(αt) (25) = exp( αt)(1 ϵt) + exp(αt)ϵt (26) = exp( αt)(1 2 + γt) + exp(αt)(1 1 4γ2 t (28) exp( 2γ2 t ) . (29) where S1 and S2 is defined as S2 = {i [n]: min x i Nϵ(xi) yiht(x i) = 1} , S1 = [n] \ S2 . Eqn. (25) uses the fact that both yi and h(xi) are { 1, +1}-valued. Eqn. (26) and Eqn. (27) follows from the definition of ϵt and γt, respectively. Eqn. (28) follows from the definition of αt. For Eqn. (29), we simply apply the approximation 1 + x ex for all real x. Thus, we have that max x Nϵ(xi) I[H(x ) = yi] exp Fast Provably Robust Decision Trees and Boosting E. Experiments Experimental settings The used hyperparameters are summarized in Table 5 and Table 6. Table 5. Hyperparameters of all tree models used in our experiments. Parameters that were not applicable were left blank. Parameter FPRDT Decision tree RIGDT-h GROOT TREANT ROCT PRB tree max depth None None None None 4 4 None min samples split 10 10 10 10 10 10 10 min samples leaf 5 5 5 5 5 5 5 affine - - - - - - False Table 6. Hyperparameters of all tree ensemble models used in our experiments. Parameters that were not applicable were left blank. Parameter PRAda Boost Ada Boost Random forests RGBDT RIGDT forests GROOT forests PRB max depth None None None None None None 4 min samples split 10 10 10 - 10 10 10 min samples leaf 5 5 5 - 5 5 5 n estimators 100 100 100 100 100 100 100 η - - - 0.2 - - 0.2 γ - - - 1.0 - - - min child weight - - - 1 - - - Except for n estimators and max depth, the values were copied from their original works (Chen et al., 2019a; Andriushchenko & Hein, 2019; Calzavara et al., 2020; Vos & Verwer, 2021a;b). Note that all methods don t prune the tree except the PRB method. We do not prune our trees because our FPRDT generally prefers to generate a shallow tree, which avoids overfitting. The curve of adversarial loss in the learning process We present more comparisons of the convergence of adversarial errors during the training process between FPRDT and other unprovable methods, as shown in Figure 5. 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss PRDT RIGDT-h GROOT 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss PRDT RIGDT-h GROOT 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss PRDT RIGDT-h GROOT 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss PRDT RIGDT-h GROOT 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss PRDT RIGDT-h GROOT 0.0 0.2 0.4 0.6 0.8 1.0 Number of leaves Adversarial 0/1 loss cifar10:4v8 PRDT RIGDT-h GROOT Figure 5. Comparisons of training adversarial errors between FPRDT and other unprovable methods in learning process, where we scale the number of leaf nodes to [0,1]. As can be seen, our FPRDT can still keep adversarial errors decrease during the whole training process, whereas those unprovably robust methods, such as RIGDT-h and GROOT, may increase the training adversarial errors in some datasets. Fast Provably Robust Decision Trees and Boosting The depth of the robust trees We summarize the average depth of the robust tree methods in Table 7. Table 7. Average depth over 18 datasets. We use FPRDT as the baseline here. Methods FPRDT Decision tree RIGDT-h GROOT ROCT PRB tree Average depth 1x 0.88x 1.84x 1.81x 1.84x 0.89x We omit TREANT method because it can not be trained successfully on most datasets within 12 hours. Note that the depth of FPRDT is similar to PRB tree, which is the only method that considers pruning.