# machine_unlearning_for_random_forests__28e57258.pdf Machine Unlearning for Random Forests Jonathan Brophy 1 Daniel Lowd 1 Responding to user data deletion requests, removing noisy examples, or deleting corrupted training data are just a few reasons for wanting to delete instances from a machine learning (ML) model. However, efficiently removing this data from an ML model is generally difficult. In this paper, we introduce data removal-enabled (Da RE) forests, a variant of random forests that enables the removal of training data with minimal retraining. Model updates for each Da RE tree in the forest are exact, meaning that removing instances from a Da RE model yields exactly the same model as retraining from scratch on updated data. Da RE trees use randomness and caching to make data deletion efficient. The upper levels of Da RE trees use random nodes, which choose split attributes and thresholds uniformly at random. These nodes rarely require updates because they only minimally depend on the data. At the lower levels, splits are chosen to greedily optimize a split criterion such as Gini index or mutual information. Da RE trees cache statistics at each node and training data at each leaf, so that only the necessary subtrees are updated as data is removed. For numerical attributes, greedy nodes optimize over a random subset of thresholds, so that they can maintain statistics while approximating the optimal threshold. By adjusting the number of thresholds considered for greedy nodes, and the number of random nodes, Da RE trees can trade off between more accurate predictions and more efficient updates. In experiments on 13 real-world datasets and one synthetic dataset, we find Da RE forests delete data orders of magnitude faster than retraining from scratch while sacrificing little to no predictive power. 1Department of Computer and Information Science, University of Oregon, Eugene, Oregon. Correspondence to: Jonathan Brophy . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 1. Introduction Recent legislation (EU, 2016; California, 2018; Canada, 2018) requiring companies to remove private user data upon request has prompted new discussions on data privacy and ownership (Shintre et al., 2019), and fulfilling this right to be forgotten (Kwak et al., 2017; Garg et al., 2020) may require updating any models trained on this data (Villaronga et al., 2018). However, retraining a model from scratch on a revised dataset becomes prohibitively expensive as dataset sizes and model complexities increase (Shoeybi et al., 2019); the result is wasted time and computational resources, exacerbated as the frequency of data removal requests increases. Decision trees and random forests (Breiman et al., 1984; Friedman, 2001) are popular and widely used machine learning models (Lundberg et al., 2018), mainly due to their predictive prowess on many classification and regression tasks (Kocev et al., 2013; Genuer et al., 2017; Wager & Athey, 2018; Linero & Yang, 2018; Biau et al., 2019). Current work on deleting data from machine learning models has focused mainly on recommender systems (Cao & Yang, 2015; Schelter, 2020), K-means (Ginart et al., 2019), SVMs (Cauwenberghs & Poggio, 2001), logistic regression (Guo et al., 2020; Schelter, 2020), and deep neural networks (Baumhauer et al., 2020; Golatkar et al., 2020b; Wu et al., 2020); however, there is very limited work addressing the problem of efficient data deletion for tree-based models (Schelter et al., 2021). Thus, we outline our contributions as follows: 1. We introduce Da RE (Data Removal-Enabled) Forests (a.k.a Da RE RF), a variant of random forests that supports the efficient removal of training instances. Da RE RF works with discrete tree structures, in contrast to many related works on efficient data deletion that assume continuous parameters. The key components of Da RE RF are to retrain subtrees only as needed, consider only a subset of valid thresholds per attribute at each decision node, and to strategically place completely random nodes near the top of each tree to avoid costly retraining. 2. We provide algorithms for training and subsequently removing data from a Da RE forest. 3. We evaluate Da RE RF s ability to efficiently perform Machine Unlearning for Random Forests sequences of deletions on 13 real-world binary classification datasets and one synthetic dataset, and find that Da RE RF can typically delete data 2-4 orders of magnitude faster than retraining from scratch while sacrificing less than 1% in terms of predictive performance. 2. Problem Formulation We assume an instance space X Rp and possible labels Y = {+1, 1}1. Let D = {(xi, yi)}n i=1 be a training dataset in which each instance xi X is a p-dimensional vector (xi,j)p j=1 and yi Y. We refer to P = {j}p j=1 as the set of possible attributes. 2.1. Unlearning Our goal is to unlearn specific training examples by updating a trained model to completely remove their influence. We base our definition on prior work by Ginart et al. (2019, Def. 3.1). We define a (possibly randomized) learning algorithm, A : D H, as a function from a dataset D to a model in hypothesis space H. A removal method, R : A(D) D (X Y) H, is a function from a model A(D), dataset D, and an instance to remove from the training data (x, y) to a model in H. For exact unlearning (a.k.a. perfect unlearning), the removal method must be equivalent to applying the training algorithm to the dataset with instance (x, y) removed. In the case of randomized training algorithms, we define equivalence as having identical probabilities for each model in H: P(A(D \ (x, y))) = P(R(A(D), D, (x, y))) (1) See 5 for more related work on unlearning. The simplest approach to exact unlearning is to ignore the existing model and simply rerun A on the updated dataset, D \ (x, y). We refer to this as the naive retraining approach. Naive retraining is agnostic to virtually all machine learning models, easy to understand, and easy to implement. However, this approach becomes prohibitively expensive as the dataset size, model complexity, and number of deletion requests increase. 2.2. Random Forests A decision tree is a tree-structured model in which each leaf is associated with a binary-valued prediction and each internal node is a decision node associated with an attribute a P and threshold value v R. The outgoing branches of the decision node partition the data based on the chosen attribute and threshold. Given x X, the prediction of a 1Our methods can easily be generalized to the multi-class setting, |Y| = C, by storing statistics for C 1 classes instead of just one. decision tree can be found by traversing the tree, starting at the root and following the branches consistent with the attribute values in x. Traversal ends at one of the leaf nodes, where the prediction is equal to the value of the leaf node. Decision trees are typically learned in a recursive manner, beginning by picking an attribute a and threshold v at the root that optimizes an empirical split criterion such as the Gini index (Breiman et al., 1984): GD,Y(a, v) = X or entropy (Quinlan, 2014): HD,Y(a, v) = X |Db| log2 |Db,y| (3) in which D D is the input dataset to a decision node, Dℓ= {(xi, yi) D | xi,a v}, Dr = D \ Dℓ, Dℓ,y = {(xi, yi) Dℓ| yi = y}, and Dr,y = {(xi, yi) Dr | yi = y}. Once a and v have been chosen for the root node, the data is partitioned into mutually exclusive subsets based on the value of v, and a child node is learned for each data subset. The process terminates when the entire subset has the same label or the tree reaches a specified maximum depth dmax. A random forest (RF) is an ensemble of decision trees which predicts the mean value of its component trees. Two sources of randomness are used to increase diversity among the trees. First, each tree in the ensemble is trained from a bootstrap sample of the original training data, with some instances excluded and some included multiple times. Second, each decision node is restricted to a random subset of attributes, and the split criterion is optimized over this subset rather than over all attributes. We base our methods on a minor variation of a standard RF, one that does not use bootstrapping. Bootstrapping complicates the removal of training instances, since one instance may appear multiple times in the training data for one tree. There is also empirical evidence that bootstrapping does not improve predictive performance (Zaman & Hirose, 2009; Denil et al., 2014; Mentch & Hooker, 2016), which was consistent with our own experiments (Appendix: B.2, Table 5). Since predictive performance was already similar, we saw no need to add the extra bookkeeping to handle this complexity. 3. Da RE Forests We now describe Da RE (Data Removal-Enabled) forests (a.k.a. Da RE RF), an RF variant that enables the efficient removal of training instances. Machine Unlearning for Random Forests Theorem 3.1. Data deletion for Da RE forests is exact (see Eq. 1), meaning that removing instances from a Da RE model yields exactly the same model as retraining from scratch on updated data. This is also equivalent to certified removal (Guo et al., 2020) with ϵ = 0. Proofs to all theorems are in A of the Appendix. A Da RE forest is a tree ensemble in which each tree is trained independently on a copy of the training data, considering a random subset of p attributes at each split to encourage diversity among the trees. In our experiments we use p = p . Since each tree is trained independently, we describe our methods in terms of training and updating a single tree; the extension to the ensemble is trivial. Da RE forests leverage several techniques to make deletions efficient: (1) only retrain portions of the model where the structure must change to match the updated database; (2) consider at most k randomly-selected thresholds per attribute; (3) introduce random nodes at the top of each tree that minimally depend on the data and thus rarely need to be retrained. We present abridged versions for training and updating a Da RE tree in Algorithms 1 and 2, respectively, with full explanations below. Detailed pseudocode for both operations is in the Appendix, A.8. 3.1. Retraining Minimal Subtrees We avoid unnecessary retraining by storing statistics at each node in the tree. For decision nodes, we store and update counts for the number of instances |D| and positives instances |D ,1|, as well as |Dℓ| and |Dℓ,1| for a set of k thresholds per attribute. This information is sufficient to recompute the split criterion of each threshold without iterating through the data. For leaf nodes, we store and update |D| and |D ,1|, along with a list of training instances that end at that leaf. These statistics are initialized when training the tree for the first time (Alg. 1). We find this additional overhead has a negligible effect on training time. When deleting a training instance (x, y) D, these statistics are updated and used to check if a particular subtree needs retraining. Specifically, decision nodes affected by the deletion of (x, y) update the statistics and recompute the split criterion for each attribute-threshold pair. If a different threshold obtains an improved split criterion over the currently chosen threshold, then we retrain the subtree rooted at this node. The training data for this subtree can be found by concatenating the instance lists from all leaf-node descendants. If no retraining occurs at any decision node and a leaf node is reached instead, its label counts and instance list are updated and the deletion operation is complete. See Alg. 2 for pseudocode. Algorithm 1 Building a Da RE tree / subtree. 1: Input: data D, depth d 2: if stopping criteria reached then 3: node LEAFNODE() 4: save instance counts(node, D) |D|, |D ,1| 5: save leaf-instance pointers(node, D) 6: compute leaf value(node) 7: else 8: if d < drmax then 9: node RANDOMNODE() 10: save instance counts(node, D) |D|, |D ,1| 11: a randomly sample attribute(D) 12: v randomly sample threshold [amin, amax) 13: save threshold statistics(node, D, a, v) 14: else 15: node GREEDYNODE() 16: save instance counts(node, D) |D|, |D ,1| 17: A randomly sample p attributes(D) 18: for a A do 19: C get valid thresholds(D, a) 20: V randomly sample k valid thresholds(C) 21: for v V do 22: save threshold statistics(node, D, a, v) 23: scores compute split scores(node) 24: select optimal split(node, scores) 25: D.ℓ, D.r split on selected threshold(node, D) 26: node.ℓ= TRAIN(Dℓ, d + 1) Alg. 1 27: node.r TRAIN(Dr, d + 1) Alg. 1 28: Return node 3.2. Sampling Valid Thresholds The optimal threshold for a continuous attribute will always lie between two training instances with adjacent feature values containing opposite labels; if the two training instances have the same label, the split criterion improves by increasing or decreasing v. We refer to these as valid thresholds, and any other threshold as invalid. More precisely, a threshold v between two adjacent values v1 and v2 for a given attribute a is valid if and only if there exist instances (x1, y1) and (x2, y2) such that x1,a = v1, x2,a = v2, and y1 = y2. Only considering valid thresholds substantially reduces the statistics we need to store and compute at each node. We gain further efficiency by randomly sampling k valid thresholds and only considering these thresholds when deciding which attribute-threshold pair to split on. We treat k as a hyperparameter and tune its value when building a Da RE model. One might suspect that only considering a subset of thresholds for each attribute may lead to decreased predictive performance; however, our experiments show that relatively modest values of k (e.g. 5 k 25) are sufficient to providing accurate predictions, and in some cases lead to improved performance (Appendix: B.2, Table 5). When deleting an instance at a given node, we must determine if any threshold has become invalid. To accomplish this efficiently, at each node we also save and update the number of instances in which attribute a equals v1, the Machine Unlearning for Random Forests Algorithm 2 Deleting a training instance from a Da RE tree. Require: Start at the root node. 1: Input: node, depth d, instance to remove (x, y). 2: update instance counts(node, (x, y)) |D| and |D ,1| 3: if node is a LEAFNODE then 4: remove (x, y) from leaf-instance pointers(node, (x, y)) 5: recompute leaf value(node) 6: remove (x, y) from database and return 7: else 8: update decision node statistics(node, (x, y)) 9: if node is a RANDOMNODE then 10: if node.selected threshold is invalid then 11: D get data from leaf instances(node) \ (x, y) 12: if node.selected attribute (a) is not constant then 13: v resample threshold [amin, amax) 14: D.ℓ, D.r split on new threshold(node, D, a, v) 15: node.ℓ, r TRAIN(D.ℓ, d+1), TRAIN(D.r, d+1) 16: else 17: node TRAIN(D, d) Alg. 1 18: remove (x, y) from database and return 19: else 20: if invalid attributes or thresholds then 21: D get data from leaf instances(node) \ (x, y) 22: resample invalid attributes and thesholds(node, D) 23: scores recompute split scores(node) 24: a, v select optimal split(node, scores) 25: if optimal split has changed then 26: D.ℓ, D.r split on new threshold(node, D, a, v) 27: node.ℓ, r TRAIN(D.ℓ, d + 1), TRAIN(D.r, d + 1) 28: remove (x, y) from database and return 29: if x ,a v then 30: DELETE(node.ℓ, d + 1, (x, y)) Alg. 2 31: else 32: DELETE(node.r, d + 1, (x, y)) Alg. 2 number in which a equals v2, and the number of positive instances matching each of those criteria. When an attribute threshold becomes invalid, we sort and iterate through the node data D, resampling the invalid threshold to obtain a new valid threshold. 3.3. Random Splits The third technique for efficient model updating is to choose the attribute and threshold for some of the decision nodes at random, independent of the split criterion. Specifically, given the data at a particular decision node D D, we sample an attribute a P uniformly at random, and then sample a threshold v in the range [amin, amax), the min. and max. values for a in D. We henceforth refer to these decision nodes as random nodes, in contrast to the greedy decision nodes that optimize the split criterion. Random nodes store and update |Dℓ| and |Dr|, statistics based on the sampled threshold, and retrain only if |Dℓ| = 0 or |Dr| = 0 (i.e. v is no longer in the range [amin, amax)); however, since random nodes minimally depend on the statistics of the data, they rarely need to be retrained. Random nodes are placed in the upper layers of the tree and greedy nodes are used for all other layers (excluding leaf nodes). We introduce drmax as another hyperparameter indicating how many layers from the top the tree should use for random nodes (e.g. the top two layers of the tree are all random nodes if drmax= 2). Intuitively, nodes near the top of the tree contain more instances than nodes near the bottom, making them more expensive to retrain if necessary. Thus, we can significantly increase deletion efficiency by replacing those nodes with random ones. We can also maintain comparable predictive performance to a model with no random nodes by using greedy nodes in all subsequent layers, resulting in a greedy model built on top of a random projection of the input space (Haupt & Nowak, 2006). In our experiments, we compare Da RE RF with random splits to those without, to evaluate the benefits of adding these random nodes. We refer to Da RE models with random nodes as random Da RE (R-Da RE) and those without as greedy Da RE (G-Da RE). G-Da RE RF can also be viewed as a special case of R-Da RE RF in which drmax= 0. 3.4. Complexity Analysis The time for training a Da RE forest is identical to that of a standard RF: Theorem 3.2. Given n = |D|, T, dmax, and p, the time complexity to train a Da RE forest is O(T p n dmax). The overhead of storing statistics and instance pointers is negligible compared to the cost of iterating through the entire dataset to score all attributes at each node. The key difference is in the deletion time, which can be much better depending on how much of each tree needs to be retrained: Theorem 3.3. Given dmax, p, and k, the time complexity to delete a single instance (x, y) D from a Da RE tree is O( p k dmax), if the tree structure is unchanged and the attribute thresholds remain valid. If a node with |D| instances has invalid attribute thresholds, then the additional time to choose new thresholds is O(|D| log |D|). If a node with |D| instances at level d needs to be retrained, then the additional retraining time is O( p |D| (dmax d))). When the structure is unchanged, this is much more efficient than naive retraining, especially if the number of thresholds considered (k) is much smaller than n. In the worst case, if the split changes at the root of every tree, then deletion in a Da RE forest is no better than naive retraining. In practice, this is very unlikely, since different trees in the forest consider different sets of p attributes at the root, and the difference between the best and second-best attribute-threshold pairs is usually bigger than a single instance. Choosing new thresholds also requires iterating through the training instances at a node. Thresholds only become invalid when an instance adjacent to the threshold is removed, so Machine Unlearning for Random Forests this is an infrequent event when k is much smaller than n. To analyze this empirically, we evaluate our methods with both random and adversarially chosen deletions, approximating the averageand worst-case, respectively. The main storage costs for a Da RE forest come from storing sets of attribute-threshold statistics at each greedy node, and the instance lists for the leaf nodes. Theorem 3.4. Given n = |D|, dmax, k, T, and p, the space complexity of a Da RE forest is O(k p 2dmax T + n T). In our experiments, we analyze the space overhead of a DARE forest by measuring its memory consumption as compared to a standard RF, quantifying the time/space tradeoff introduced by DARE RF to enable efficient data deletion. 4. Experimental Evaluation Research Questions Can we use G-Da RE RF to efficiently delete a significant number of instances as compared to naive retraining (RQ1)? Can we use R-Da RE RF to further increase deletion efficiency while maintaining comparable predictive performance (RQ2)? Datasets We conduct our experiments on 13 publiclyavailable datasets that represent problems well-suited for tree-based models, and one synthetic dataset we call Synthetic. For each dataset, we generate one-hot encodings for any categorical variable and leave all numeric and binary variables as is. For any dataset without a designated train and test split, we randomly sample 80% of the data for training and use the rest for testing. A summary of the datasets is in Table 1, and additional dataset details are in the Appendix: B.1. Hyperparameter Tuning Due to the range of label imbalances in our datasets (Table 1 and Appendix: B.1, Table 4) we measure the predictive peformance of our models using average precision (AP) (Zhu, 2004) for datasets with a positive label percentage < 1%, AUC (Hanley & Mc Neil, 1982) for datasets between [1%, 20%], and accuracy (acc.) for the remaining datasets. Using these metrics and Gini index as the split criterion, we tune the following hyperparameters: the maximum depth of each tree dmax, the number of trees in the forest T, and the number of thresholds considered per attribute for greedy nodes k. Our protocol for tuning drmax is as follows: first, we tune a greedy model (i.e. by keeping drmax = 0 fixed) using 5-fold cross-validation. Once the optimal values for dmax, T, and k are found, we tune drmax by incrementing its value from zero to dmax, stopping when the model s cross-validation score exceeds a specified error tolerance as compared to the greedy model; for these experiments, we tune drmax using absolute error tolerances of 0.1%, 0.25%, 0.5%, and 1.0%. Selected hyperparameter values are in the Appendix: B.2, Table 6. Table 1. Dataset Summary. n = no. instances, p = no. attributes, Pos. % = positive label percentage, Met. = predictive performance metric. Dataset n p Pos. % Met. Surgical 14,635 90 25.2% Acc. Vaccine 26,707 185 46.4% Acc. Adult 48,842 107 23.9% Acc. Bank Mktg. 41,188 63 11.3% AUC Flight Delays 100,000 648 19.0% AUC Diabetes 101,766 253 46.1% Acc. No Show 110,527 99 20.2% AUC Olympics 206,165 1,004 14.6% AUC Census 299,285 408 6.2% AUC Credit Card 284,807 29 0.2% AP CTR 1,000,000 13 2.9% AUC Twitter 1,000,000 15 17.0% AUC Synthetic 1,000,000 40 50.0% Acc. Higgs 11,000,000 28 53.0% Acc. 4.1. Methodology We measure relative efficiency or speedup as the number of instances a Da RE model deletes in the time it takes the naive retraining approach to delete one instance (i.e. retrain without that instance); the number of instances deleted gives us the speedup over the the naive approach.2 We also measure the predictive performance of R-Da RE RF prior to deletion and compare its predictive performance to that of G-Da RE RF. Each experiment is repeated five times. We determine the order of deletions using two different adversaries: Random and Worst-of-1000. The random adversary selects training instances to be deleted uniformly at random, while the worst-of-1000 adversary selects each instance by first selecting 1,000 candidate instances uniformly at random, and then choosing the instance that results in the most retraining, as measured by the total number of instances assigned to all retrained nodes across all trees. 4.2. Deletion Efficiency Results Random Adversary We present the results of the deletion experiments using the random adversary in Figure 1 (top). We find that G-Da RE RF is usually at least two orders of magnitude faster than naive retraining, while R-Da RE RF is faster than G-Da RE RF to a varying degree depending on the dataset and error tolerance. R-Da RE RF is also able to maintain comparable predictive performance to G-Da RE RF, typically staying within a test error difference of 1% depending on which tolerance is used to tune drmax (Figure 1: bottom). 2System hardware specifications are in the Appendix: B. Machine Unlearning for Random Forests 100 101 102 103 104 105 Speedup vs Naive Deletion Efficiency Using the Random Adversary (higher is better) G-Da RE R-Da RE (tol=0.1%) R-Da RE (tol=0.25%) R-Da RE (tol=0.5%) R-Da RE (tol=1.0%) 100 101 102 103 104 105 Speedup vs Naive Deletion Efficiency Using the Worst-of-1000 Adversary (higher is better) Bank Marketing Flight Delays Credit Card Higgs Dataset Test error Δ (%) Test Error Increase Relative to G-Da RE RF (lower is better) Figure 1. Deletion efficiency of Da RE RF. Top & Middle: Number of instances deleted in the time it takes the naive retraining approach to delete one instance using the random and worst-of-1000 adversaries, respectively (error bars represent standard deviation). Bottom: The increase in test error when using R-Da RE RF relative to the predictive performance of G-Da RE RF (error bars represent standard error). As an example of Da RE RF s utility, naive retraining took 1.3 hours to delete a single instance for the Higgs dataset. R-Da RE RF (tol = 0.25% resulting in drmax= 3) deleted over 17,000 instances in that time, an average of 0.283s per deletion, while the average test set error increased by only 0.5%. In this case, R-Da RE RF provides a speedup of over four orders of magnitude, providing a tractable solution for something previously intractable. Worst-of-1000 Adversary Against the more challenging worst-of-1000 adversary (Figure 1 (middle)), the speedup over naive deletion remains large, but is often an order of magnitude smaller. While R-Da RE models also decrease in efficiency, they maintain a significant advantage over GDa RE RF, showing very similar trends of increased relative efficiency as when using the random adversary. Summary A summary of the deletion efficiency results is in Table 2. When instances to delete are chosen randomly, G-Da RE RF is more than 250x faster than naively retraining after every deletion (taking the geometric mean over the 14 datasets). By adding randomness, R-Da RE models achieve even larger speedups, from 360x to over 1,200x, depending on the predictive performance tolerance (0.1% to 1.0%). The more sophisticated worst-of-1000 adversary can force more costly retraining. In this case, G-Da RE RF is more than 50x faster than naive retraining, and R-Da RE RF ranges from 80x to 260x depending on the tolerance. Table 2. Summary of the deletion efficiency results. Specifically, the minimum, maximum, and geometric mean (G. mean) of the speedup vs. the naive retraining method across all datasets. Model Min. Max. G. Mean Random Adversary G-Da RE 6x 12,232x 257x R-Da RE (tol=0.1%) 10x 9,735x 366x R-Da RE (tol=0.25%) 13x 17,044x 494x R-Da RE (tol=0.5%) 68x 22,011x 681x R-Da RE (tol=1.0%) 145x 35,856x 1,272x Worst-of-1000 Adversary G-Da RE 5x 626x 52x R-Da RE (tol=0.1%) 8x 1,106x 79x R-Da RE (tol=0.25%) 8x 961x 102x R-Da RE (tol=0.5%) 33x 950x 139x R-Da RE (tol=1.0%) 47x 1,476x 263x 4.3. Effect of drmaxand k on Deletion Efficiency Figure 2 details the effect drmax has on deletion efficiency under each adversary for the Bank Marketing dataset3. As expected, we see that deletion efficiency increases as drmax increases. Predictive performance degrades as drmax increases, but initially degrades gracefully, maintain- 3Other datasets show similar trends; see the Appendix: B.3. Machine Unlearning for Random Forests 0 5 10 15 20 Speedup vs Naive Deletion Efficiency 0 5 10 15 20 drmax Test error Δ (%) Prediction Degradation tol=0.1% tol=0.25% tol=0.5% tol=1.0% G-Da RE R-Da RE 0 5 10 15 0 No. samples 1e5 Retrains 0 5 10 15 20 drmax (Worst-of-1000) Speedup vs Naive 0 5 10 15 Retrain depth No. samples 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 2. Effect of increasing drmaxon deletion efficiency (left), predictive performance (middle), and the cost of retraining (right) using the random (top) and worst-of-1000 (bottom) adversaries for the Bank Marketing dataset. The predictive performance is independent of the adversary, as performance is measured before any deletions occur. Error bars represent standard deviation and standard error for the left and middle plots, respectively. In short, we see that increasing drmax increases deletion efficiency but initially gradually degrades predictive performance. Similar analysis for other datasets are in the Appendix: B.3. 0 25 50 75 100 k 0 25 50 75 100 k Speedup vs Naive Figure 3. Effect of increasing k on predictive performance (left) and deletion efficiency (right) for the Surgical dataset using the random adversary; drmax is held fixed at 0. Error bars represent standard error and standard deviation for the left and right plots, respectively. Analysis for other datasets is in the Appendix: B.4. ing a low increase in test error even as the top ten layers of each tree are replaced with random nodes (+0.346% test error). Figure 2 also shows the number of instances retrained at each depth, across all trees in the model. We immediately notice the increase in retraining cost when switching from the random (top-right plot) to the worst-of-1000 (bottom-right plot) adversary, especially at larger depths. This matches our intuition since nodes deeper in the tree have fewer instances; each instance thus has a larger influence on the resulting split criterion over all attributes at a given node and increases the likelihood that a chosen attribute may change, resulting in more subtree retraining. Figure 3 shows the effect increasing k has on predictive performance and deletion efficiency for the Surgical dataset4. In general, we find k introduces a trade-off between predictive performance and deletion efficiency. However, our 4Other datasets show similar trends; see the Appendix: B.4. experiments show that modest values of k can achieve competitive predictive performance while maintaining a high degree of deletion efficiency and incurring low storage costs. 4.4. Space Overhead This section shows the space overhead of DARE forests by breaking the memory usage of G-DARE RF into three constituent parts: 1) the structure of the model that is needed for making predictions, 2) the additional statistics stored at each decision node, and 3) the additional statistics and training-instance pointers stored at each leaf node. Parts 2) and 3), plus the size of the data, constitute the space needed by G-DARE RF to enable efficient data removal. Table 3 shows the space overhead of G-DARE RF after training. We also show the training set size for each dataset, and the total memory usage of an SKLearn RF model using the same values for T and dmax as G-DARE RF. As expected, decision-node statistics often make up the bulk of the space overhead for G-DARE RF; two exceptions are the Credit Card and CTR datasets, in which the size of the training-instance pointers outweigh the relatively low number of decision nodes (an average of 238 and 726 per tree, respectively) for those models. The total memory usage of the G-DARE RF model is 10-113x larger than that of the SKLearn RF model. However, since both approaches require the training data to enable deletions (G-DARE RF may need to retrain subtrees; SKLearn RF needs to retrain using the naive approach), the relative overhead of G-DARE RF is the ratio of (data + G-DARE RF) to (data + SKLearn RF); this results in an overhead of 6 26x, quantifying the time/space trade-off for efficient data deletion. Machine Unlearning for Random Forests Table 3. Memory usage (in megabytes) for the training data, G-DARE RF, and an SKLearn RF trained using the same values of T and dmax as G-DARE RF. The total memory usage for the G-DARE RF model is broken down into: 1) the structure of the model needed for making predictions (Structure); 2) the additional statistics stored at each decision node (Decision Stats.); and 3) the additional statistics and training-instance pointers stored at each leaf node (Leaf Stats.). The space overhead for G-DARE RF to enable efficient data deletion is measured as a ratio of the total memory usage of (data + G-DARE RF) to (data + SKLearn RF). Results are averaged over five runs and the standard error is shown in parentheses. Dataset Data Structure Decision Stats. Leaf Stats. Total SKLearn RF Overhead Surgical 4 15 (0) 388 ( 1) 14 (0) 417 ( 1) 31 (0) 12.0x Vaccine 16 18 (0) 426 ( 1) 14 (0) 458 ( 2) 37 (0) 8.9x Adult 14 9 (0) 227 ( 1) 16 (0) 252 ( 1) 18 (0) 8.3x Bank Marketing 8 23 (0) 455 ( 2) 33 (0) 511 ( 2) 51 (0) 8.8x Flight Delays 207 37 (0) 3,030 ( 4) 171 (0) 3,238 ( 5) 66 (0) 12.6x Diabetes 83 125 (0) 4,968 ( 12) 199 (0) 5,292 ( 12) 257 (1) 15.8x No Show 35 91 (0) 2,511 ( 5) 203 (0) 2,805 ( 5) 187 (1) 12.8x Olympics 663 27 (0) 3,196 ( 22) 338 (0) 3,561 ( 23) 57 (0) 5.9x Census 326 33 (0) 1,737 ( 14) 169 (0) 1,939 ( 14) 63 (0) 5.8x Credit Card 27 5 (0) 105 ( 1) 457 (0) 567 ( 0) 7 (0) 17.5x CTR 45 6 (0) 485 ( 2) 642 (0) 1,133 ( 0) 10 (0) 21.4x Twitter 48 186 (1) 2,450 ( 11) 693 (0) 3,329 ( 12) 332 (0) 8.9x Synthetic 131 128 (1) 5,661 ( 36) 357 (0) 6,146 ( 37) 114 (1) 25.6x Higgs 1,021 935 (4) 39,416 (168) 3,787 (1) 44,138 (173) 1,325 (9) 19.3x 5. Related Work Exact Unlearning There are a number of works that support exact unlearning of SVMs (Cauwenberghs & Poggio, 2001; Tveit et al., 2003; Duan et al., 2007; Romero et al., 2007; Karasuyama & Takeuchi, 2009; Chen et al., 2019) in which the original goal was to accelerate leaveone-out cross-validation (Shao, 1993). More recently, Cao & Yang (2015) developed deletion mechanisms for several models that fall under the umbrella of non-adaptive SQlearning (Kearns, 1998) in which data deletion is efficient and exact (e.g. naive Bayes, item-item recommendation, etc.); Schelter (2020) has also developed decremental update procedures for similar classes of models. Ginart et al. (2019) introduced a quantized variant of the k-means algorithm (Lloyd, 1982) called Q-k-means that supports exact data deletion. Bourtoule et al. (2021) and Aldaghri et al. (2020) propose training an ensemble of deep learning models on disjoint shards of a dataset, saving a snapshot of each model for every data point; the biggest drawbacks are the large storage costs, applicability only to iterative learning algorithms, and the significant degradation of predictive performance. Schelter et al. (2021) enable efficient data removal for extremely randomized trees (ERTs) (Geurts et al., 2006) without needing to save the training data by precomputing alternative subtrees for splits sensitive to deletions; aside from only being applicable to ERTs, they assume a very small percentage of instances can be deleted. Approximate Unlearning In contrast to exact unlearning, a promising definition of approximate unlearning (a.k.a statistical unlearning) guarantees S H, D, (x, y) D : e ϵ P(R(A(D), D, (x, y)) S) / P(A(D \ (x, y)) S) eϵ (ϵ-certified removal: Guo et al. (2020), Eq. 1). Golatkar et al. (2020c;b) propose a scrubbing mechanism for deep neural networks that does not require any retraining; however, the computational complexity of their approach is currently quite high. Guo et al. (2020), Izzo et al. (2020), and Wu et al. (2020) propose different removal mechanisms for linear and logistic regression models that can be applied to the last fully connected layer of a deep neural network. Golatkar et al. (2020a) perform unlearning on a linear approximation of large-scale vision networks in a mixedprivacy setting. Fu et al. (2021) propose an unlearning procedure for models in a Bayesian setting using variational inference. Mitigation Although not specifically designed as unlearning techniques, the following works propose different mechanisms for mitigating the impact of noisy, poisoned, or non-private training data. Baumhauer et al. (2020) propose an output filtering technique that prevents private data from being leaked; however, their approach does not update the model itself, potentially leaking information if the model were still accessible. Wang et al. (2019) and Du et al. (2019) fine-tune their models on corrected versions of poisoned or corrupted training instances to mitigate backdoor Machine Unlearning for Random Forests attacks (Gu et al., 2017) on image classifiers and anomaly detectors, respectively. Although both approaches show promising empirical performance, they provide no guarantees about the extent to which these problematic training instances are removed from the model (Sommer et al., 2020). Tople et al. (2019) analyze privacy leakage in language model snapshots before and after they are updated. Differential Privacy Differential privacy (DP) (Dwork, 2006; Chaudhuri et al., 2011; Abadi et al., 2016) is a sufficient condition for approximate unlearning (in the case of a single deletion, sequential deletions may require using group DP (Dwork et al., 2014)), but it is an unnecessary and overly strict one since machine unlearning does not require instances to be private (Guo et al., 2020). Furthermore, differentially-private random forest models often suffer from poor predictive performance (Fletcher & Islam, 2015; 2019); this is because the privacy budget (typically denoted ϵ or β) must be split among all the trees in the forest, and among the different layers in each tree. This typically results in a meaningless privacy budget (e.g. ϵ > 10) (Fletcher & Islam, 2019), a relaxed definition of DP (Rana et al., 2015), extremely randomized trees (Geurts et al., 2006; Fletcher & Islam, 2017), or very small forests (e.g. T = 10) (Consul & Williamson, 2020). 6. Discussion Since data deletions in Da RE models are exact, membership inference attacks (Yeom et al., 2018; Carlini et al., 2018) are guaranteed to be unsuccessful for instances deleted from the model. Da RE models also reduce the need for deletion verification methods (Shintre & Dhaliwal, 2019; Sommer et al., 2020). However, one must be aware that Da RE models (as well as any unlearning method) can leak which instances are deleted if an adversary has access to the model before and after the deletion (Chen et al., 2020). Although privacy is a strong motivator for this work, there are a number of other useful applications for Da RE forests. Instance-Based Interpretability A popular form of interpretability looks at how much each training instance contributes to a given prediction (Koh & Liang, 2017; Yeh et al., 2018; Sharchilev et al., 2018; Pruthi et al., 2020; Chen et al., 2021). The naive approach to this task involves leave-oneout retraining for every training instance in order to analyze the effect each instance has on a target prediction, but this is typically intractable for most machine learning models and datasets. Da RE models can more efficiently compute the same training-instance attributions as the naive approach, making leave-one-out retraining a potentially viable option for generating instance-attribution explanations for random forest models. Dataset Cleaning Aside from removing user data for privacy reasons, one may also wish to efficiently remove outliers (Rahmani & Li, 2019; Dong et al., 2019) or training instances that are noisy, corrupted, or poisoned (Mozaffari Kermani et al., 2014; Steinhardt et al., 2017). Continual Learning Our methods can also be used to add data to a random forest model, allowing for continuous updating as data is added and removed. This makes them well-suited to continual learning settings with streaming data (Chrysakis & Moens, 2020; Knoblauch et al., 2020). However, the hyperparameters may need to be periodicially retuned as the size or distribution of the data shifts from adding and/or deleting more and more instances. Eco-Friendly Machine Learning Finally, this line of research promotes a more economically and environmentally sustainable approach to building learning systems; if a model can be continuously updated only as necessary and avoid frequent retraining, significant time and computational resources can be spared (Gupta et al., 2020). 7. Conclusion In this work, we introduced Da RE RF, a random forest variant that supports efficient model updates in response to repeated deletions of training instances. We find that, on average, Da RE models are 2-3 orders of magnitude faster than the naive retraining approach with no loss in accuracy, and additional efficiency can be achieved if slightly worse predictive performance is tolerated. For future work, there are many exciting opportunities and applications of Da RE forests, from maintaining user privacy to building interpretable models to cleaning data, all without retraining from scratch. One could even investigate the possibility of extending Da RE forests to boosted trees (Chen & Guestrin, 2016; Ke et al., 2017; Prokhorenkova et al., 2018). At its best, Da RE RF was more than four orders of magnitude faster than naive retraining, so it has the potential to enable new applications of model updating that were previously intractable. Acknowledgments We would like to thank Zayd Hammoudeh for useful discussions and feedback and the reviewers for their constructive comments that improved this paper. This work was supported by a grant from the Air Force Research Laboratory and the Defense Advanced Research Projects Agency (DARPA) agreement number FA8750-16-C-0166, subcontract K001892-00-S05. This work benefited from access to the University of Oregon high performance computer, Talapas. Machine Unlearning for Random Forests Abadi, M., Chu, A., Goodfellow, I., et al. Deep learning with differential privacy. In CCS, 2016. Aldaghri, N., Mahdavifar, H., and Beirami, A. Coded machine unlearning. ar Xiv preprint ar Xiv:2012.15721, 2020. Baldi, P., Sadowski, P., and Whiteson, D. Searching for exotic particles in high-energy physics with deep learning. Nature Communications, 2014. Baumhauer, T., Sch ottle, P., and Zeppelzauer, M. Machine unlearning: Linear filtration for logit-based classifiers. ar Xiv preprint ar Xiv:2002.02730, 2020. Biau, G., Scornet, E., and Welbl, J. Neural random forests. Sankhya A, 2019. Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C., et al. Machine unlearning. In IEEE Symposium on Security and Privacy, 2021. Breiman, L., Friedman, J., Olshen, R. A., and Stone, C. J. Classification and Regression Trees. CRC Press, 1984. Bull, P., Slavitt, I., and Lipstein, G. Harnessing the power of the crowd to increase capacity for data science in the social sector. In ICML #Data4Good Workshop, 2016. California. California consumer privacy act. https://leginfo.legislature.ca.gov/ faces/bill Text Client.xhtml?bill_id= 201720180AB375, 2018. [Online; accessed 16-April2020]. Canada. Pipeda. https://www.priv.gc.ca/ en/opc-news/news-and-announcements/ 2018/an_181010/, 2018. [Online; accessed 19-August-2020]. Cao, Y. and Yang, J. Towards making systems forget with machine unlearning. In IEEE Symposium on Security and Privacy, 2015. Carlini, N., Liu, C., et al. The secret sharer: Measuring unintended neural network memorization & extracting secrets. ar Xiv preprint ar Xiv:1802.08232, 2018. Cauwenberghs, G. and Poggio, T. Incremental and decremental support vector machine learning. In Neur IPS, 2001. Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. Differentially private empirical risk minimization. JMLR, 2011. Chen, M., Zhang, Z., et al. When machine unlearning jeopardizes privacy. ar Xiv preprint ar Xiv:2005.02205, 2020. Chen, T. and Guestrin, C. XGBoost: A scalable tree boosting system. In KDD, 2016. Chen, Y., Xiong, J., Xu, W., and Zuo, J. A novel online incremental and decremental learning algorithm based on variable support vector machine. Cluster Computing, 2019. Chen, Y., Li, B., Yu, H., Wu, P., and Miao, C. Hydra: Hypergradient data relevance analysis for interpreting deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021. Chrysakis, A. and Moens, M.-F. Online continual learning from imbalanced data. In ICML, 2020. Consul, S. and Williamson, S. Differentially private median forests for regression and classification. ar Xiv preprint ar Xiv:2006.08795, 2020. Criteo. Criteo click-through rate prediction. https://ailab.criteo.com/downloadcriteo-1tb-click-logs-dataset/, 2015. [Online; accessed 25-Januaray-2021]. Denil, M., Matheson, D., and De Freitas, N. Narrowing the gap: Random forests in theory and in practice. In ICML, 2014. Dong, Y., Hopkins, S., and Li, J. Quantum entropy scoring for fast robust mean estimation and improved outlier detection. In Neur IPS, 2019. Driven Data. Flu shot learning: Predict h1n1 and seasonal flu vaccines. https://www. drivendata.org/competitions/66/flushot-learning/data/, 2019. [Online; accessed 12-August-2020]. Du, M., Chen, Z., Liu, C., Oak, R., and Song, D. Lifelong anomaly detection through unlearning. In CCS, 2019. Dua, D. and Graff, C. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2019. Duan, H., Li, H., He, G., and Zeng, Q. Decremental learning algorithms for nonlinear langrangian and least squares support vector machines. In OSB, 2007. Dwork, C. Differential privacy. In ICALP, 2006. Dwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 2014. Machine Unlearning for Random Forests EU. Regulation (eu) 2016/679. https://eur-lex. europa.eu/eli/reg/2016/679/oj, 2016. [Online; accessed 16-April-2020]. Fletcher, S. and Islam, M. Z. A differentially private decision forest. In Aus DM, 2015. Fletcher, S. and Islam, M. Z. Differentially private random decision forests using smooth sensitivity. Expert Systems with Applications, 2017. Fletcher, S. and Islam, M. Z. Decision tree classification with differential privacy: A survey. ACM Computing Surveys, 2019. Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 2001. Fu, S., He, F., Xu, Y., and Tao, D. Bayesian inference forgetting. ar Xiv preprint ar Xiv:2101.06417, 2021. Garg, S., Goldwasser, S., and Vasudevan, P. N. Formalizing data deletion in the context of the right to be forgotten. ar Xiv preprint ar Xiv:2002.10635, 2020. Genuer, R., Poggi, J.-M., Tuleau-Malot, C., and Villa Vialaneix, N. Random forests for big data. Big Data Research, 9:28 46, 2017. Geurts, P., Ernst, D., and Wehenkel, L. Extremely randomized trees. Machine learning, 2006. Ginart, A., Guan, M., Valiant, G., and Zou, J. Y. Making AI forget you: Data deletion in machine learning. In Neur IPS, 2019. Golatkar, A., Achille, A., Ravichandran, A., Polito, M., and Soatto, S. Mixed-privacy forgetting in deep networks. ar Xiv preprint ar Xiv:2012.13431, 2020a. Golatkar, A., Achille, A., and Soatto, S. Forgetting outside the box: Scrubbing deep networks of information accessible from input-output observations. ar Xiv preprint ar Xiv:2003.02960, 2020b. Golatkar, A., Achille, A., and Soatto, S. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In CVPR, 2020c. Gu, T., Dolan-Gavitt, B., and Garg, S. Badnets: Identifying vulnerabilities in the machine learning model supply chain. In Machine Learning and Computer Security Workshop, 2017. Guo, C., Goldstein, T., Hannun, A., and van der Maaten, L. Certified data removal from machine learning models. In ICML, 2020. Gupta, A., Lanteigne, C., and Kingsley, S. SECure: A social and environmental certificate for AI systems. In ICML Deploying and Monitoring Machine Learning Systems Workshop, 2020. Hanley, J. and Mc Neil, B. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 1982. Haupt, J. and Nowak, R. Signal reconstruction from noisy random projections. IEEE Transactions on Information Theory, 2006. Izzo, Z., Smart, M. A., Chaudhuri, K., and Zou, J. Approximate data deletion from machine learning models: Algorithms and evaluations. ar Xiv preprint ar Xiv:2002.10077, 2020. Kaggle. Medical appointment no shows. https://www.kaggle.com/joniarroba/ noshowappointments, 2016. [Online; accessed 25-Januaray-2021]. Kaggle. Credit card fraud detection. https://www. kaggle.com/mlg-ulb/creditcardfraud/, 2018a. [Online; accessed 27-July-2020]. Kaggle. 120 years of olympic history: Athletes and events. https://www.kaggle.com/heesoo37/120years-of-olympic-history-athletesand-results, 2018b. [Online; accessed 28-July2020]. Kaggle. Dataset surgical binary classification. https://www.kaggle.com/ omnamahshivai/surgical-datasetbinary-classification/version/1#, 2018c. [Online; accessed 29-July-2020]. Karasuyama, M. and Takeuchi, I. Multiple incremental decremental learning of support vector machines. In Neur IPS, 2009. Ke, G., Meng, Q., et al. Light GBM: A highly efficient gradient boosting decision tree. In Neur IPS, 2017. Kearns, M. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 1998. Knoblauch, J., Husain, H., and Diethe, T. Optimal continual learning has perfect memory and is NP-hard. In ICML, 2020. Kocev, D., Vens, C., et al. Tree ensembles for predicting structured outputs. Pattern Recognition, 2013. Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In ICML, 2017. Machine Unlearning for Random Forests Kwak, C., Lee, J., et al. Let machines unlearn machine unlearning and the right to be forgotten. SIGSEC, 2017. Linero, A. R. and Yang, Y. Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society, 2018. Lloyd, S. Least squares quantization in PCM. IEEE Transactions on Information Theory, 1982. Lundberg, S. M., Erion, G. G., and Lee, S.-I. Consistent individualized feature attribution for tree ensembles. ar Xiv preprint ar Xiv:1802.03888, 2018. Mentch, L. and Hooker, G. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. Journal of Machine Learning Research, 2016. Moro, S., Cortez, P., et al. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 2014. Mozaffari-Kermani, M., Sur-Kolay, S., et al. Systematic poisoning attacks on and defenses for machine learning in healthcare. Journal of Biomedical and Health Informatics, 2014. Pedregosa, F., Varoquaux, G., Gramfort, A., et al. Scikitlearn: Machine learning in Python. JMLR, 2011. Prokhorenkova, L., Gusev, G., et al. Cat Boost: Unbiased boosting with categorical features. In Neur IPS, 2018. Pruthi, G., Liu, F., Kale, S., and Sundararajan, M. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33, 2020. Quinlan, J. R. C4.5: Programs for Machine Learning. Elsevier, 2014. Rahmani, M. and Li, P. Outlier detection and robust PCA using a convex measure of innovation. In Neur IPS, 2019. Rana, S., Gupta, S. K., and Venkatesh, S. Differentially private random forest with high utility. In ICDM, 2015. Research and Administration, I. T. Airline on-time performance and causes of flight delays. https: //catalog.data.gov/dataset/airlineon-time-performance-and-causes-offlight-delays-on-time-data, 2019. [Online; accessed 16-April-2020]. Romero, E., Barrio, I., and Belanche, L. Incremental and decremental learning for linear support vector machines. In International Conference on Artificial Neural Networks, 2007. Schelter, S. amnesia - machine learning models that can forget user data very fast. In CIDR, 2020. Schelter, S., Grafberger, S., and Dunning, T. Hedgecut: Maintaining randomised trees for low-latency machine unlearning. In Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data, 2021. Sedhai, S. and Sun, A. Hspam14: A collection of 14 million tweets for hashtag-oriented spam research. In SIGIR, 2015. Shao, J. Linear model selection by cross-validation. Journal of the American Statistical Association, 1993. Sharchilev, B., Ustinovskiy, Y., Serdyukov, P., and de Rijke, M. Finding influential training samples for gradient boosted decision trees. In ICML, 2018. Shintre, S. and Dhaliwal, J. Verifying that the influence of a user data point has been removed from a machine learning classifier. https://patents.google.com/ patent/US10225277B1/en, 2019. Shintre, S., Roundy, K. A., and Dhaliwal, J. Making machine learning forget. In Annual Privacy Forum, 2019. Shoeybi, M., Patwary, M., et al. Megatron-LM: Training multi-billion parameter language models using model parallelism. ar Xiv preprint ar Xiv:1909.08053, 2019. Sommer, D. M., Song, L., Wagh, S., and Mittal, P. Towards probabilistic verification of machine unlearning. ar Xiv preprint ar Xiv:2003.04247, 2020. Steinhardt, J., Koh, P. W. W., and Liang, P. S. Certified defenses for data poisoning attacks. In Neur IPS, 2017. Strack, B., De Shazo, J. P., et al. Impact of Hb A1c measurement on hospital readmission rates: Analysis of 70,000 clinical database patient records. Bio Med research international, 2014, 2014. Tople, S., Brockschmidt, M., et al. Analyzing privacy loss in updates of natural language models. ar Xiv preprint ar Xiv:1912.07942, 2019. Tveit, A., Hetland, M. L., and Engum, H. Incremental and decremental proximal support vector classification using decay coefficients. In Da Wa K, 2003. Villaronga, E. F., Kieseberg, P., and Li, T. Humans forget, machines remember: Artificial intelligence and the right to be forgotten. Computer Law & Security Review, 2018. Wager, S. and Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 2018. Machine Unlearning for Random Forests Wang, B., Yao, Y., et al. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In IEEE Symposium on Security and Privacy, 2019. Wu, Y., Dobriban, E., and Davidson, S. B. Delta Grad: Rapid retraining of machine learning models. In ICML, 2020. Yeh, C.-K., Kim, J., et al. Representer point selection for explaining deep neural networks. In Neur IPS, 2018. Yeom, S., Giacomelli, I., et al. Privacy risk in machine learning: Analyzing the connection to overfitting. In CSF, 2018. Zaman, F. and Hirose, H. Effect of subsampling rate on subbagging and related ensembles of stable classifiers. In PRe MI, 2009. Zhu, M. Recall, precision and precision, average. University of Waterloo, Waterloo, 2004.