# evasion_and_hardening_of_tree_ensemble_classifiers__b4ba9160.pdf Evasion and Hardening of Tree Ensemble Classifiers Alex Kantchelian AKANT@CS.BERKELEY.EDU J. D. Tygar TYGAR@CS.BERKELEY.EDU Anthony D. Joseph ADJ@CS.BERKELEY.EDU University of California, Berkeley Classifier evasion consists in finding for a given instance x the nearest instance x such that the classifier predictions of x and x are different. We present two novel algorithms for systematically computing evasions for tree ensembles such as boosted trees and random forests. Our first algorithm uses a Mixed Integer Linear Program solver and finds the optimal evading instance under an expressive set of constraints. Our second algorithm trades off optimality for speed by using symbolic prediction, a novel algorithm for fast finite differences on tree ensembles. On a digit recognition task, we demonstrate that both gradient boosted trees and random forests are extremely susceptible to evasions. Finally, we harden a boosted tree model without loss of predictive accuracy by augmenting the training set of each boosting round with evading instances, a technique we call adversarial boosting. 1. Introduction Deep neural networks (DNN) represent a prominent success of machine learning. These models can successfully and accurately address difficult learning problems, including classification of audio, video, and natural language possible where previous approaches have failed. Yet, the existence of evading instances for the current incarnation of DNNs (Szegedy et al., 2013) shows a perhaps surprising brittleness: for virtually any instance x that the model classifies correctly, it is possible to find a negligible perturbation δ such that x+δ evades being correctly classified, that is, receives a (sometimes widely) inaccurate prediction. The general study of the evasion problem matters on both conceptual and practical grounds. First, we expect a high- Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). performance learning algorithm to generalize well and be hard to evade: only a large enough perturbation δ should be able to alter its decision. The existence of small-δ evading instances shows a defect in the generalization ability of the model, and hints at improper model class and/or insufficient regularization. Second, machine learning is becoming the workhorse of security-oriented applications, the most prominent example being unwanted content filtering. In those applications, the attacker has a large incentive for finding evading instances. For example, spammers look for small, cost-effective changes to their online content to avoid detection and removal. While prior work extensively studies the evasion problem on differentiable models by means of gradient descent, those results are reported in an essentially qualitative fashion, implicitly defaulting the choice of metric for measuring δ to the L2 norm. Further, non-differentiable, noncontinuous models have received very little attention. Tree sum-ensembles as produced by boosting or bagging are perhaps the most important models from this class as they are often able to achieve competitive performance and enjoy good adoption rates in both industrial and academic contexts. In this paper, we develop two novel exact and approximate evasion algorithms for sum-ensemble of trees. Our exact (or optimal) evasion algorithm computes the smallest δ according to the Lp norm for p = 0, 1, 2, such that the model misclassifies x+δ. The algorithm relies on a Mixed Integer Linear Program solver and enables precise quantitative robustness statements. We benchmark the robustness of boosted trees and random forests on a concrete handwritten digit classification task by comparing the minimal required perturbation δ across many representative models. Those models include L1 and L2 regularized logistic regression, max-ensemble of linear classifiers (shallow maxout network), a 3-layer deep neural network and a classic RBF-SVM. The comparison shows that for this task, despite their competitive accuracies, tree ensembles are consistently the most brittle models across the board. Finally, our approximate evasion algorithm is based on Evasion and Hardening of Tree Ensemble Classifiers symbolic prediction, a fast and novel method for computing finite differences for tree ensemble models. We use this method for generating more than 11 million synthetic confusing instances and incorporate those during gradient boosting in an approach we call adversarial boosting. This technique produces a hardened model which is significantly harder to evade without loss of accuracy. 2. Related Work From the onset of the adversarial machine learning subfield, evasion is recognized as part of the larger family of attacks occurring at inference time: exploratory attacks (Barreno et al., 2006). While there is a prolific literature considering the evasion of linear or otherwise differentiable models (Dalvi et al., 2004; L., 2005; Lowd & Meek, 2005; Nelson et al., 2012; Br uckner et al., 2012; Fawzi et al., 2014; Biggio et al., 2013; Szegedy et al., 2013; Srndic & Laskov, 2014), we are only aware of a single paper tackling the case of tree ensembles. In Xu et al. (Xu et al., 2016), the authors present a genetic algorithm for finding malicious PDF instances which evade detection. In this paper, we forgo application-specific feature extraction and directly work in feature space. We briefly discuss strategies for modeling the feature extraction step in paragraph additional constraints of section 4.3. We deliberately do not limit the amount of information available for carrying out evasion. In this paper, our goal is to establish the intrinsic evasion robustness of the machine learning models themselves, and thus provide a guaranteed worstcase lower-bound. In contrast to (Xu et al., 2016), our exact algorithm guarantees optimality of the solution, and our approximate algorithm performs a fast coordinate descent without the additional tuning and hyper-parameters that a genetic algorithm requires. We contrast our paper with a few related papers on deep neural networks, as these are the closest in spirit to the ideas developed here. Goodfellow et al. (Goodfellow et al., 2014) hypothesize that evasion in practical deep neural networks is possible because these models are locally linear. However, this paper demonstrates that despite their extreme non-linearity, boosted trees are even more susceptible to evasion than neural networks. On the hardening side, Goodfellow et al. (Goodfellow et al., 2014) introduce a regularization penalty term which simulates the presence of evading instances at training time, and show limited improvements in both test accuracy and robustness. Gu et al. (Gu & Rigazio, 2015) show preliminary results by augmenting deep neural networks with a pre-filtering layer based on a form of contractive auto-encoding. Most recently, Papernot et al. (Papernot et al., 2015) shows the strong positive effect of distillation on evasion robustness for neural networks. In this paper, we demonstrate a large increase in robustness for a boosted tree model hardened by adversarial boosting. We empirically show that our method does not degrade accuracy and creates the most robust model in our benchmark problem. 3. The Optimal Evasion Problem In this section, we formally introduce the optimal evasion problem and briefly discuss its relevance to adversarial machine learning. We follow the definition of (Biggio et al., 2013). Let c : X Y be a classifier. For a given instance x X and a given distance function d : X X R+, the optimal evasion problem is defined as: minimize x X d(x, x ) subject to c(x) = c(x ) (1) In this paper, we focus on binary classifiers defined over an n-dimensional feature space, that is Y = { 1, 1} and X Rn. Setting the classifier c aside, the distance function d fully specifies (1), hence we talk about d-evading instances, or drobustness. In fact, many problems of interest in adversarial machine learning fit under formulation (1) with a judicious choice for d. In the adversarial learning perspective, d can be used to model the cost the attacker has to pay for changing her initial instance x. In this paper, we proceed as if this cost is decomposable over the feature dimensions. In particular, we present results for four representative distances. We briefly describe those and their typical effects on the solution of (1). The L0 distance Pn i=1 Ixi =x i, or Hamming distance encourages the sparsest, most localized deformations with arbitrary magnitude. Our optimal evasion algorithm can also handle the case of non-uniform costs over features. This situation corresponds to minimizing Pn i=1 αi Ixi =x i where αi are non-negative weights. The L1 distance Pn i=1 |xi x i| encourages localized deformations and additionally controls for their magnitude. The L2 distance p Pn i=1(xi x i)2 encourages less localized but small deformations. The L distance maxi |xi x i| encourages uniformly spread deformations with the smallest possible magnitude. Note that for binary-valued features, L1 and L2 reduce to L0 and L results in the trivial solution value 1 for (1). 4. Evading Tree Ensemble Models We start by introducing tree ensemble models along with some useful notation. We then describe our optimal and ap- Evasion and Hardening of Tree Ensemble Classifiers proximate algorithms for generating evading instances on sum-ensembles of trees. 4.1. Tree Ensembles A sum-ensemble of trees model f : Rn R consists of a set T of regression trees. Without loss of generality, a regression tree T T is a binary tree where each internal node n T.nodes holds a logical predicate n.predicate over the feature variables, outgoing node edges are by convention labeled n.true and n.false and finally each leaf l T.leaves holds a numerical value l.prediction R. For a given instance x Rn, the prediction path in T is the path from the tree root to a leaf such that for each internal node n in the path, n.true is also in the path if and only if n.predicate is true. The prediction of tree T is the leaf value of the prediction path. Finally, the signed margin prediction f(x) of the ensemble model is the sum of all individual tree predictions and the predicted label is obtained by thresholding, with the threshold value commonly fixed at zero: c(x) = 1 f(x) > 0. In this paper, we consider the case of single-feature threshold predicates of the form xi < τ or equivalently xi > τ, where 0 i < n and τ R are fixed model parameters. This restriction excludes oblique decision trees where predicates simultaneously involve several feature variables. We however note that oblique trees are seldom used in ensemble classifiers, partially because of their relatively high construction cost and complexity (Norouzi et al., 2015). Before describing our generic approach for solving the optimal evasion problem, we first state a simple worst-case complexity result for problem (1). 4.2. Theoretical Hardness of Evasion For a given tree ensemble model f, finding an x Rn such that f(x) > 0 (or f(x) < 0 without loss of generality) is NP-complete. That is, irrespectively of the choice for d, the optimal evasion problem (1) requires solving an NPcomplete feasibility subproblem. We give a proof of this fact by reduction from 3-SAT in the appendix. While we can not expect an efficient algorithm for solving all instances of problem (1) unless P = NP, it may be the case that tree ensemble models as produced by common learners such as gradient boosting or random forests are practically easy to evade. We now turn to an algorithm for optimally solving the evasion problem when d is one of the distances presented in section 3. 4.3. Optimal Evasion Let f be a sum-ensemble of trees as defined in 4.1 and x Rn an initial instance. We present a reduction of problem (1) into a Mixed Integer Linear Program (MILP). This reduction avoids introducing constraints with so called big-M constants (Griva et al., 2008) at the cost of a slightly more complex solution encoding. We experimentally find that our reduction produces tight formulations and acceptable running times for all common models f. In what follows, we present the mixed integer program by defining three groups of MILP variables: the predicate variables encode the state (true or false) of all predicates, the leaf variables encode which prediction leaf is active in each tree, and the optional objective variable for the case where d is the L norm. We then introduce three families of constraints: the predicates consistency constraints enforce the logical consistency between predicates, the leaves consistency constraints enforce the logical consistency between prediction leaves and predicates, and the model mislabel constraint enforces the condition c(x) = c(x ), or equivalently that f(x ) > 0 or f(x ) < 0 depending on the sign of f(x). Finally we reduce the objective of (1) by relating the predicate variables to the value of d(x, x ) in objective. Program Variables For clarity, MILP variables are bolded and italicized throughout. Our reduction uses three families of variables. T T |T.nodes| binary variables pi {0; 1} (predicates) encoding the state of the predicates. Our implementation sparingly create those variables: if any two or more predicates in the model are logically equivalent, their state is represented by a single variable. For example, the state of x 5 < 0 and x 5 > 0 would be represented by the same variable. T T |T.leaves| continuous variables 0 li 1 (leaves) encoding which prediction leaf is active in each tree. The MILP constraints force exactly one li per tree to be non-zero with li = 1. The l variables are thus implied binary in any solution but are nonetheless typed continuous to narrow down the choice of branching variable candidates during branch-andbound, and hence improve solving time. At most 1 non-negative continuous variable b (bound) for expressing the distance d(x, x ) of problem (1) when d is the L distance. This variable is first used in the objective paragraph. In what follows, we illustrate our reduction by using a model with a single regression tree as represented in figure 1. Predicates consistency Without loss of generality, each predicate variable pi corresponds to the state of a predicate Evasion and Hardening of Tree Ensemble Classifiers Figure 1. Regression tree for the reduction example. Predicate variables p and leaf variables l are shown next to their corresponding internal and leaf nodes. There are n = 2 continuous features. The leaf predictions are -2, 1, 1 and 2. of the form xk < τk. If two variables pi and pj correspond to predicates over the same variable xk < τ1 and xk < τ2, then pi and pj can take inconsistent values without additional constraints. For instance, if τ1 < τ2, then pi = 1 and pj = 0 would be logically inconsistent because xk < τ1 xk < τ2, but any other valuation is possible. For each feature variable x k, we can ensure the consistency of all p variables which reference a predicate over x k by adding K 1 inequalities enforcing the implicit implication constraints between the predicates, where K is the number of p variables referencing xk. For a given x k, let τ1 < < τK be the sorted thresholds of the predicates over x k. Let p1, . . . , p K be the MILP variables corresponding to predicates x k < τ1, . . . , x k < τK. A valuation of (pi)i=1..K is consistent if and only if p1 = 1 p K = 1. Thus the consistency constraints are: When the feature variables x k are binary-valued, there is a single pi variable associated to a feature variable: all predicates x k < τ with 0 < τ < 1 are equivalent. Generally, tree building packages generate a threshold of 0.5 in this situation. This is however implementation dependent and we can simplify the formulation with additional knowledge of the value domain x k is allowed to take. In our toy example in figure 1, variables p0 and p1 refer to the same feature dimension 0 and are not independent. The predicate consistency constraint in this case is: and no other predicate consistency constraint is needed. Leaves consistency These constraints bind the p and l variables so that the semantics of the regression trees are preserved. Each regression tree has its own independent set of leaves consistency constraints. We construct the constraints such that the following properties hold: (i) if lk = 1, then every other li =k variable within the same tree is zero, and (ii) if a leaf variable lk is 1, then all predicate variables pi encountered in the prediction path of the corresponding leaf are forced to be either 0 or 1 in accordance with the semantics of the prediction path, and (iii) exactly one lk variable per tree is equal to 1. This property is needed because (i) does not force any li to be non-zero. Enforcing property (i) is done using a classic exclusion constraint. If l1, . . . , l K are the K leaf variables for a given tree, then the following equality constraint enforces (i): l1 + l2 + + l K = 1 (2) For our toy example, this constraint is: l1 + l2 + l3 + l4 = 1 Enforcing property (ii) requires two constraints per internal node. Let us start at the root node r. Let proot be the variable corresponding to the root predicate. Let l T 1 , . . . , l T i be the variables corresponding to the leaves of the subtree rooted at r.true, and l F 1 , . . . , l F j the variables for the subtree rooted at r.false. The root predicate is true if and only if the active prediction leaf belongs to the subtree rooted at r.true. In terms of the MILP reduction, this means that proot is equal to 1 if and only if one of the leaf variables of the true subtree is set to one. Similarly on the false subtree, proot is 0 if and only if one of the leaf variables of the false subtree is set to one. Because only one leaf can be non-zero, these constraints can be written as: 1 l F 1 + l F 2 + + l F j = proot = l T 1 + l T 2 + + l T i The case of internal nodes is identical, except that if and only ifs are weakened to single side implications. Indeed, unlike the root case, it is possible that no leaf in either subtree might be an active prediction leaf. For an internal node n, let pnode be the variable attached to the node, l T and l F the variables attached to leaves of the true and false subtrees rooted at n.true and n.false. The constraints are: 1 l F 1 + l F 2 + + l F j pnode l T 1 + l T 2 + + l T i In our toy example, we have 3 internal nodes and thus six constraints. The constraints associated with the root, the Evasion and Hardening of Tree Ensemble Classifiers leftmost and rightmost internal nodes are respectively: l1 + l2 = p0 = 1 (l3 + l4) l1 p1 1 l2 l3 p2 1 l4 Finally, property (iii) automatically holds given the previously defined constraints. To see this, one can walk down the prediction path defined by the p variables and notice that at each level, the leaves values of one of the subtree rooted at the current node must be all zero. For instance, if pnode = 1, then we have l F 1 + l F 2 + + l F j 0 l F 1 = l F 2 = = l F j = 0 At the last internal node, exactly two leaf variables remain unconstrained, and one of them is pushed to zero. By the exclusion constraint (2), the remaining leaf variable must be set to 1. Model mislabel Without loss of generality, consider an original instance x such that f(x) < 0. In order for x to be an evading instance, we must have f(x ) 0. Encoding the model output f(x ) is straightforward given the leaf variables l. The output of each regression tree is simply the weighted sum of its leaf variables, where the weight of each variable li corresponds to the prediction value vi of the associated leaf. Hence, f(x ) is the sum of |T | weighted sums over the l variables and the following constraint enforces f(x ) 0: X For our running example, the mislabeling constraint is: 2l1 + l2 l3 + 2l4 0 Objective Finally, we need to translate the objective d(x, x ) of problem (1). We rely on the predicate variables p in doing so. For any distance Lρ with ρ N, there exists weights wi and a constant C such that the MILP objective can be written as: X We leave the complete proof in the appendix. Intuitively, because the predicates effectively discretize the feature values, an optimal distance d(x, x ) can only take a finite number of values. For our toy example, consider (x0 = 0, x1 = 3). In the case of the L0 distance, we have the following objective: For the (squared) L2 distance instead, the objective is essentially: 4 3p0 p1 + 4p2 For the L case, our objective reduces to the variable b and we introduce n additional bounding constraints of the form b where the left hand side measures |xk x k| using the same technique as the ρ = 1 case. Hence, the full MILP reduction of the optimal L0-evasion for our toy instance is: min p,l 1 p1 + p2 s.t. p0, p1 {0; 1}; 0 l1, l2, l3, l4 1 p1 p0 predicates consistency l1 + l2 + l3 + l4 = 1 leaves consistency l1 + l2 = p0 = 1 (l3 + l4) leaves consistency l1 p1 1 l2 leaves consistency l3 p2 1 l4 leaves consistency 2l1 + l2 l3 + 2l4 0 model mislabel Additional Constraints Reducing problem (1) to a MILP allows expressing potentially complex inter-feature dependencies created by the feature extraction step. For instance, consider the common case of K mutually exclusive binary features x1, . . . , x K such that in any well-formed instance, exactly one feature is non-zero. Letting pi be the predicate variable associated with xi < 0.5, mutual exclusivity can be enforced by: i=1 pi = K 1 4.4. Approximate Evasion While the above reduction of problem (1) to an MILP is linear in the size of the model f, the actual solving time can be very significant for difficult models. Thus, as a complement to the exact method, we develop an approximate evasion algorithm to generate good quality evading instances. For this part, we exclusively focus on minimizing the L0 distance. Our approximate evasion algorithm is based on the iterative coordinate descent procedure described in algorithm 1. In essence, this algorithm greedily modifies the single best feature at each iteration until the sign of f(x ) changes. We now present an efficient algorithm for solving the inner optimization subproblem max x: x x 0=1 f( x) (3) Evasion and Hardening of Tree Ensemble Classifiers Algorithm 1 Coordinate Descent for Problem (1) Input: model f, initial instance x (assume f(x) < 0) Output: evading instance x such that f(x ) 0 x x while f(x ) < 0 do x arg max x : x x 0=1 f( x ) The time complexity of a careful brute force approach is high. For balanced regression trees, the prediction time for a given instance is O P T T log |T.nodes| . Further, for each dimension 1 k n, we must compute all possible values of f( x) where x and x only differ along dimension k. Note that because the model predicates effectively discretize the feature space, f( x) takes a finite number of distinct values. This number is no more than one plus the total number of predicates holding over feature k. Hence, we must compute f( x) for a total of P T T |T.nodes| candidates x and the total running time is O P T T |T.nodes| P T T log |T.nodes| . If we denote by |f| the size of the model which is proportional to the total number of predicates, the running time is O |f||T | log |f| |T | . Tree ensembles often have thousands of trees, making the |f||T | dependency prohibitively expensive. We can efficiently solve problem (3) by a dynamic programming approach. The main idea is to visit each internal node no more than once by computing what value of x can land us at each node. We call this approach symbolic prediction in reference to symbolic program execution (King, 1976), because we essentially move a symbolic instance x down the regression tree and keep track of the constraints imposed on x by all encountered predicates. Because we are only interested in x instances that are at most one feature away from x, we can stop the tree exploration early if the current constraints imply that at least two dimensions need to be modified or more trivially, if there is no instance x that can simultaneously satisfy all the constraints. When reaching a leaf, we report the leaf prediction value f( x) along with the pair of perturbed dimension number k and value interval for xk which would reach the given leaf. To simplify the presentation of the algorithm, we introduce a SYMBOLICINSTANCE data structure which keeps track of the constraints on x. This structure is initialized by x and has four methods. For a predicate p, .ISFEASIBLE(p) returns true if and only if there exists an instance x such that x x 0 1 and all constraints including p hold. .UPDATE(p) updates the set of constraints on x by adding predicate p. .ISCHANGED() returns true if and only if the current set of constraints imply x = x. .GETPERTURBATION() returns the index k such that xk = xk and the admissible interval of values for xk It is possible to implement SYMBOLICINSTANCE such that each method executes in constant time. Algorithm 2 presents the symbolic prediction algorithm recursively for a given tree. It updates a list of elements by appending tuples to it. The first element of a tuple is the feature index k where xk = xk, the second element is the allowed right-open interval for xk, and the last element is the prediction score f( x). Algorithm 2 Recursive definition of the symbolic prediction algorithm. For the first call, n is the tree root, s is a fresh SYMBOLICINSTANCE object initialized on x with no additional constraints and l is an empty list. Input: node n (either internal or leaf) Input: s of type SYMBOLICINSTANCE Input/Output: list of tuples l (see description) if n is a leaf then if s.ISCHANGED() then l l {s.GETPERTURBATION(), n.prediction} end if else if s.ISFEASIBLE(n.predicate) then s T COPY(s) s T .UPDATE(n.predicate) SYMBOLICPREDICTION(n.true, s T , l) end if if s.ISFEASIBLE( n.predicate) then s.UPDATE( n.predicate) SYMBOLICPREDICTION(n.false, s, l) end if end if This algorithm visits each node at most once and performs at most one copy of the SYMBOLICINSTANCE s per visit. The copy operation is proportional to the number of constraints in s. For a balanced tree T, the copy cost is O(log |T.nodes|), so that the total running time is O(|T.nodes| log |T.nodes|). For each tree of the model, once the list of dimensioninterval-prediction tuples is obtained, we substract the leaf prediction value for x from all predictions in order to obtain a score variation between x and x instead of the score for x. With the help of an additional data structure, we can use the dimension-interval-variation tuples across all trees to find the dimension k and interval for xk which corresponds to Evasion and Hardening of Tree Ensemble Classifiers the highest variation f( x) f(x). This final search can be done in O(L log L), where L is the total number of tuples, and is no larger than P T T |T.leaves| by construction. To summarize, the time complexity of our method for solving problem (3) is O(|f| log |f|), an exponential improvement over the brute force method. We turn to the experimental evaluation of the robustness of tree ensembles. We start by describing the evaluation dataset and our choice of models for benchmarking purposes before moving to a quantitative comparison of the robustness of boosted trees and random forest models against a garden variety of learning algorithms. We finally show that the brittleness of boosted trees can be effectively addressed by including fresh evading instances in the training set during boosting. Model Parameters Test Error Lin. L1 C = 0.5 1.5% Lin. L2 C = 0.2 1.5% BDT 1,000 trees, depth 4, η = 0.02 0.25% RF 80 trees, max. depth 22 0.20% CPM k = 30, C = 0.01 0.20% NN 60-60-30 sigmoidal (tanh) units 0.25% RBF-SVM γ = 0.04, C = 1 0.25% BDT-R 1,000 trees, depth 6, η = 0.01 0.20% Table 1. The considered models. BDT-R is the hardened boosted trees model introduced in section 5.4. 5.1. Dataset and Method We choose digit recognition over the MNIST (Le Cun et al.) dataset as our benchmark classification task for three reasons. First, the MNIST dataset is well studied and exempt from labeling errors. Second, there is a one-to-one mapping between pixels and features, so that features can vary independently from each other. Third, we can pictorially represent evading instances, and this helps understanding the models robustness or lack of. Our running binary classification task is to distinguish between handwritten digits 2 and 6 . Our training and testing sets respectively include 11,876 and 1,990 images and each image has 28 28 gray scale pixels and our feature space is X = [0, 1]784. As our main goal is not to compare model accuracies, but rather to obtain the best possible model for each model class, we tune the hyper-parameters so as to minimize the error on the testing set directly. In addition to the training and testing sets, we create an evaluation dataset of a hundred instances from the testing set such that every instance is correctly classified by all of the benchmarked models. These correctly classified instances are to serve the purpose of x, the starting point instances in the evasion problem (1). 5.2. Considered Models Table 1 summarizes the 7 benchmarked models with their salient hyper-parameters and error rates on the testing set. For our tree ensembles, BDT is a (gradient) boosted decision trees model in the modern XGBoost implementation (Chen & He) and RF is a random forest trained using scikit-learn (Buitinck et al., 2013). We also include the following models for comparison purposes. Lin. L1 and Lin. L2 are respectively a L1 and L2-regularized logistic regression using the Lib Linear (Fan et al., 2008) implementation. RBF-SVM is a regular Gaussian kernel SVM trained using Lib SVM (Chang & Lin, 2011). NN is a 3 hidden layer neural network with a top logistic regression layer implemented using Theano (Bergstra et al., 2010) (no pre-training, no drop-out). Finally, our last benchmark model is the equivalent of a shallow neural network made of two max-out units (one unit for each class) each made of thirty linear classifiers. This model corresponds to the difference of two Convex Polytope Machines (Kantchelian et al., 2014) (one for each class) and we use the authors implementation (CPM). Two factors motivate the choice of CPM. First, previous work has theoretically considered the evasion robustness of such ensemble of linear classifiers and proved the problem to be NP-hard (Stevens & Lowd, 2013). Second, unlike RBF-SVM and NN, this model can be readily reduced to a Mixed Integer Program, enabling optimal evasions thanks to a MIP solver. As the reduction is considerably simpler than the one presented for tree ensembles above, we omit it here. Except for the two linear classifiers, all models have a comparable, very low error rate on the benchmark task. 5.3. Robustness For each learned model, and for all of the 100 correctly classified evaluation instance, we compute the optimal (or best effort) solution to the evasion problem under all of the deformation metrics. We use the Gurobi (Gurobi Optimization, 2015) solver to compute the optimal evasions for all distances and all models but NN and RBF-SVM. We use a classic projected gradient descent method for solving the L1, L2 and L evasions of NN and RBF-SVM, and address the L0-evasion case by an iterative coordinate descent algorithm and a brute force grid search at each iteration. Figure 2 summarizes the obtained adversarial bounds as one boxplot for each combination of model and distance. Although the tree ensembles BDT and RF have very competitive accuracies, they systematically rank at the bottom for robustness across all metrics. Remarkably, negligible L1 or L2 perturbations suffice to evade those models. RBF-SVM is apparently the hardest model to evade, agreeing with results from (Goodfellow et al., 2014). NN Evasion and Hardening of Tree Ensemble Classifiers Figure 2. Optimal (white boxes) or best-effort (gray boxes) evasion bounds for different metrics on the evaluation dataset. The smallest bounds, 25-50% and 50-75% quartiles and largest bounds are shown. The red line is the median score. Larger scores mean more deformations are necessary to change the model prediction. Figure 3. First 4 rows: examples of optimal or best effort evading 6 instances. Every picture is misclassified as 2 by its column model. Last row: feature importance computed as frequency of pixel modification in the L0-evasions (darker means feature is more often picked). and CPM exhibit very similar performance despite having quite different architectures. Finally, the L1-regularized linear model exhibits significantly more brittleness than its L2 counterpart. This phenomenon is explained by large weights concentrating in specific dimensions as a result of sparsity. Thus, small modifications in the heavily weighted model dimensions can result in large classifier output variations. 5.4. Hardening by Adversarial Boosting We empirically demonstrate how to significantly improve the robustness of the BDT model by adding evading instances to the training set during the boosting process. At each boosting round, we use our fast symbolic prediction-based algorithm to create budgeted adversarial instances with respect to the current model and for all the 11,876 original training instances. For a given training instance x with label y and a modification budget B 1, a budgeted adversarial training instance x is such that x x 0 B and the margin yf(x ) is as small as possible. Here, we use B = 28, the size of the picture diagonal, as our budget. The reason is that modifying 28 pixels over 784 is not enough to perceptually morph a handwritten 2 into 6 . The training dataset for the current round is then formed by appending to the original training dataset these evading instances along with their correct labels, thus increasing the size of the training set by a factor 2. Finally, gradient boosting produces the next regression tree which by definition minimizes the error of the augmented ensemble model on the adversarially-enriched training set. After 1,000 adversarial boosting rounds, our model has encountered more than 11 million adversarial instances, without ever training on more than 24,000 instances at a time. We found that we needed to increase the maximum tree depth from 4 to 6 in order to obtain an acceptable error rate. After 1,000 iterations, the resulting model BDT-R has a slightly higher testing accuracy than BDT (see Table 1). Unlike BDT, BDT-R is extremely challenging to optimally evade using the MILP solver: the branch-andbound search continues to expand nodes after 1 day on a 6 core Xeon 3.2GHz machine. To obtain the tightest possible evasion bound, we warm-start the solver with the solution found by the fast evasion technique and report the best solution found by the solver after an hour. Figure 2 shows that BDT-R is more robust than our previous champion RBF-SVM with respect to L0 deformations. Unfortunately, we found significantly lower scores on all L1, L2 and L distances compared to the original BDT model: hardening against L0-evasions made the model more sensitive to all other types of evasions. 6. Conclusion We have presented two novel algorithms, one exact and one approximate, for systematically computing evasions of tree ensembles such as boosted trees and random forests. On a classic digit recognition task, both gradient boosted trees and random forests are extremely susceptible to evasions. We also introduce adversarial boosting and show that it trains models that are hard to evade, without sacrificing accuracy. One future work direction would be to use these algorithms to generate small evading instances for practical security systems. Another direction would be to better understand the properties of adversarial boosting. In particular, it is not known whether this hardening approach would succeed on all possible datasets. Evasion and Hardening of Tree Ensemble Classifiers Acknowledgements This research is supported in part by Intel s ISTC for Secure Computing, NSF grants 0424422 (TRUST) and 1139158, the Freedom 2 Connect Foundation, US State Dept. DRL, LBNL Award 7076018, DARPA XData Award FA8750-12-2-0331, and gifts from Amazon, Google, SAP, Apple, Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, Game On Talis, General Electric, Hortonworks, Huawei, Intel, Microsoft, Net App, Oracle, Samsung, Splunk, VMware, WANdisco and Yahoo!. The opinions in this paper are those of the authors and do not necessarily reflect those of any funding sponsor or the United States Government. Barreno, M., Nelson, B., Sears, R., Joseph, A. D., and Tygar, J. D. Can machine learning be secure? In Proceedings of the 2006 ACM Symposium on Information, Computer and Communications Security, ASIACCS 06, 2006. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (Sci Py), 2010. Biggio, B., Corona, I., Maiorca, D., Nelson, B., rndi, N., Laskov, P., Giacinto, G., and Roli, F. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases, volume 8190 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2013. Br uckner, M., Kanzow, C., and Scheffer, T. Static prediction games for adversarial learning problems. The Journal of Machine Learning Research, 13(1):2617 2654, 2012. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., Vander Plas, J., Joly, A., Holt, B., and Varoquaux, G. API design for machine learning software: experiences from the scikitlearn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp. 108 122, 2013. Chang, C.C. and Lin, C.J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 2011. Chen, T. and He, T. XGBoost: e Xtreme Gradient Boosting. https://github.com/dmlc/xgboost. Accessed: 2015-06-05. Dalvi, N., Domingos, P., Mausam, Sanghai, S., and Verma, D. Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 99 108. ACM, 2004. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., and Lin, C.J. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 2008. Fawzi, A., Fawzi, O., and Frossard, P. Analysis of classifiers robustness to adversarial perturbations. ar Xiv preprint, 2014. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint, 2014. Griva, I., Nash, S. G., and Sofer, A. Linear and Nonlinear Optimization (2nd edition). Society for Industrial Mathematics, 2008. Gu, S. and Rigazio, L. Towards deep neural network architectures robust to adversarial examples. ar Xiv preprint, 2015. Gurobi Optimization, Inc. Gurobi optimizer reference manual, 2015. URL http://www.gurobi.com. Kantchelian, A., Tschantz, M. C., Huang, L., Bartlett, P. L., Joseph, A. D., and Tygar, J. D. Large-margin convex polytope machine. In Advances in Neural Information Processing Systems (NIPS), 2014. King, James C. Symbolic execution and program testing. Communications of the ACM, 19(7):385 394, 1976. L., Daniel. Good word attacks on statistical spam filters. In Proceedings of the Second Conference on Email and Anti-Spam (CEAS), 2005. Le Cun, Yann, Cortes, Corinna, and Burges, Christopher J.C. MNIST dataset, 1998. URL http://yann. lecun.com/exdb/mnist/. Lowd, D. and Meek, C. Adversarial learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD 05, 2005. Nelson, B., Rubinstein, B. I. P., Huang, L., Joseph, A. D., Lee, S. J., Rao, S., and Tygar, J. D. Query strategies for evading convex-inducing classifiers. Journal of Machine Learning Research, 13, May 2012. Norouzi, M., Collins, M., Johnson, M. A, Fleet, D. J., and Kohli, P. Efficient non-greedy optimization of decision trees. In Advances in Neural Information Processing Systems (NIPS), pp. 1720 1728, 2015. Evasion and Hardening of Tree Ensemble Classifiers Papernot, N., Mc Daniel, P., Wu, X., Jha, S., and Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. ar Xiv preprint ar Xiv:1511.04508, 2015. Srndic, N. and Laskov, P. Practical evasion of a learningbased classifier: A case study. In Proceedings of the 2014 IEEE Symposium on Security and Privacy, S&P 14, 2014. Stevens, D. and Lowd, D. On the hardness of evading combinations of linear classifiers. In Proceedings of the 2013 ACM Workshop on Artificial Intelligence and Security, AISec 13, 2013. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint, 2013. Xu, W., Qi, Y., and Evans, D. Automatically evading classifiers: A case study on PDF malware classifiers. In Network and Distributed Systems Security Symposium (NDSS), 2016.