# scalable_bayesian_rule_lists__1b1eb389.pdf Scalable Bayesian Rule Lists Hongyu Yang 1 Cynthia Rudin 2 Margo Seltzer 3 We present an algorithm for building probabilistic rule lists that is two orders of magnitude faster than previous work. Rule list algorithms are competitors for decision tree algorithms. They are associative classifiers, in that they are built from pre-mined association rules. They have a logical structure that is a sequence of IF-THEN rules, identical to a decision list or one-sided decision tree. Instead of using greedy splitting and pruning like decision tree algorithms, we aim to fully optimize over rule lists, striking a practical balance between accuracy, interpretability, and computational speed. The algorithm presented here uses a mixture of theoretical bounds (tight enough to have practical implications as a screening or bounding procedure), computational reuse, and highly tuned language libraries to achieve computational efficiency. Currently, for many practical problems, this method achieves better accuracy and sparsity than decision trees, with practical running times. The predictions in each leaf are probabilistic. 1. Introduction Our goal is to build a competitor for decision tree and rule learning algorithms in terms of accuracy, interpretability, and computational speed. Decision trees are widely used, particularly in industry, because of their interpretability. Their logical IF-THEN structure allows predictions to be explained to users. However, decision tree algorithms have the serious flaw that they are constructed using greedy splitting from the top down. They also use greedy pruning of nodes. They do not globally optimize any function, instead they are composed entirely of local optimization heuristics. If the algorithm makes a mistake in the splitting near the 1Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 2Duke University, Durham, North Carolina, USA 3Harvard University, Cambridge, Massachusetts, USA. Correspondence to: Hongyu Yang . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). top of the tree, it is difficult to undo it, and consequently the trees become long and uninterpretable, unless they are heavily pruned, in which case accuracy suffers. In general, decision tree algorithms are computationally tractable, not particularly accurate, and less sparse and interpretable than they could be. This leaves users with no good alternative if they desire an accurate yet sparse logical classifier. Several important ingredients provide the underpinning for our method including: (i) A principled objective, which is the posterior distribution for the Bayesian Rule List (BRL) model (see Letham et al., 2015). We optimize this objective over rule lists. Our algorithm is called Scalable Bayesian Rule Lists (SBRL). (ii) A useful statistical approximation that narrows the search space. We assume that each rule in the rule list contains ( captures ) a number of observations that is bounded below. Because of this approximation, the set of conditions defining each leaf is a frequent pattern. This means the antecedents within the rule list are all frequent patterns. All of the possible frequent patterns can be pre-mined from the dataset using one of the standard frequent pattern mining methods. This leaves us with a much smaller optimization problem: we optimize over the set of possible pre-mined antecedents and their order to create the rule list. (iii) High performance language libraries to achieve computational efficiency. Optimization over rule lists is done through repeated low level computations. At every iteration, we make a change to the rule list and need to evaluate the new rule list on the data. High performance calculations (novel to this problem) speed up this evaluation. (iv) Computational reuse. When we evaluate a rule list on the data that has been modified from a previous rule list, we need only to change the evaluation of points below the change in the rule list. Thus we can reuse the computation above the change. (v) Analytical bounds on BRL s posterior that are tight enough to be used in practice for screening association Scalable Bayesian Rule Lists Table 1. Rule list for the Mushroom dataset from the UCI repository (data available from Bache & Lichman, 2013). PP: the probability that the label is positive (the mushroom is edible). RULE-LIST PP IF BRUISES=NO,ODOR=NOT-IN-(NONE,FOUL) 0.001 ELSE IF ODOR=FOUL,GILL-ATTACHMENT=FREE, 0.001 ELSE IF GILL-SIZE=BROAD,RING-NUMBER=ONE, 0.999 ELSE IF STALK-ROOT=UNKNOWN,STALK-SURFACE-ABOVERING=SMOOTH, 0.996 ELSE IF STALK-ROOT=UNKNOWN,RING-NUMBER=ONE, 0.039 ELSE IF BRUISES=NO,VEIL-COLOR=WHITE, 0.995 ELSE IF STALK-SHAPE=TAPERING,RING-NUMBER=ONE, 0.986 ELSE IF HABITAT=PATHS, 0.958 ELSE (DEFAULT RULE) 0.001 rules and providing bounds on the optimal solution. These are provided in two theorems in this paper. Through a series of controlled experiments, we show that SBRL is over two orders of magnitude faster than the previous best code for this problem. For example, Table 1 presents a rule list that we learned for the UCI Mushroom dataset (see Bache & Lichman, 2013). This rule list is a predictive model for whether a mushroom is edible. It was created in about 9 seconds on a laptop and achieves perfect out-of-sample accuracy. 2. Review of Bayesian Rule Lists Scalable Bayesian Rule Lists maximizes the posterior distribution of the Bayesian Rule Lists algorithm. Our training set is {(xi, yi)}n i=1 where the xi X encode features, and yi are labels, which in our case are binary, either 0 or 1. A Bayesian rule list has the following form: if x obeys a1 then y Binom(θ1), θ1 Beta(α + N1) else if x obeys a2 then y Binom(θ2), θ2 Beta(α + N2) ... else if x obeys am then y Binom(θm), θm Beta(α + Nm) else y Binom(θ0), θ0 Beta(α + N0). Here, the antecedents {aj}m j=1 are conditions on the x s that are either true or false, for instance, if x is a patient, aj is true when x s age is above 60 years old and x has diabetes, otherwise false. The vector α = [α1, α0] has a prior parameter for each of the two labels. Values α1 and α0 are prior parameters, in the sense that each rule s prediction y Binomial(θj), and θj|α Beta(α). The notation Nj is the vector of counts, where Nj,l is the number of observations xi that satisfy condition aj but none of the previous conditions a1, ..., aj 1, and that have label yi = l, where l is either 1 or 0. Nj is added to the prior parameters α from the usual derivation of the posterior for the Beta-binomial. The default rule is at the bottom, which makes predictions for observations that are not satisfied by any of the conditions. When an observation satisfies condition aj but not a1, ..., aj 1 we say that the observation is captured by rule j. Formally: Definition 1 Rule j captures observation i, denoted Captr(i) = j, when j = argmin j such that aj (xi) = True. Bayesian Rule Lists is an associative classification method, in the sense that the antecedents are first mined from the database, and then the set of rules and their order are learned. The rule mining step is fast, and there are fast parallel implementations available. Any frequent pattern mining method will suffice, since the method needs only to produce those conditions with sufficiently high support in the database. The support of antecedent aj is denoted supp(aj), which is the number of observations that obey condition aj. A condition is a conjunction of expressions feature values, e.g., age [40,50] and color=white. The hard part is learning the rule list, which is what this paper focuses on. It is an optimization over subsets of rules and their permutations. The likelihood for the model discussed above is: Likelihood = p(y|x, d, α) Qm j=0 Γ(Nj,0+α0)Γ(Nj,1+α1) Γ(Nj,0+Nj,1+α0+α1) , where d denotes the rules in the list and their order, d = (m, {aj, θj}m j=0). Intuitively, one can see that having more of one class and less of the other class will make the likelihood larger. To see this, note that if Nj,0 is large and Nj,1 is small (or vice versa) the likelihood for rule j is large. We next discuss the prior. It has three terms, one governing the number of rules m in the list, one governing the size cj of each antecedent j, and one governing the choice of antecedent condition aj of rule j given its size. Specifically, cj is the cardinality of antecedent aj, also written |aj|, the number of conjunctive clauses in antecedent aj. E.g, if a is x1=green and x2<50 , this has cardinality 2. Notation a k, using logical operators acting upon the rule list captures vector for k, and the rule s captures vector for each rule j > k. This shortens the run time of the algorithm in practice by approximately 50%. A fast algebra for computational reuse: Our use of bit vectors transforms the large number of set operations (performed in a traditional implementation) into a set of boolean operations on bit vectors. We have custom implementations (discussed in the full version of this work, Yang et al., 2017) for the following: (i) Remove rule k uses boolean operations on bit vectors to redistribute the observations captured by rule k to the rules below it in the list. (ii) Insert rule k shifts the rules below k down one position, determines which observations are captured by the new rule, and removes those observations from the rules below it. (iii) Swap consecutive rules updates only which observations were captured for the two swapped rules. (iv) Generalized swap subroutine updates only observations captured for all rules between the two rules to be swapped. All operations use only bit vector computations. Ablation study: Having transformed expensive set operations into bit vector operations, we can now leverage both hardware vector instructions and optimized software libraries. We investigated four alternative implementations, each improving efficiency from the previous one. (i) First, we have the original python implementation here for comparison. (ii) Next, we retained our python implementation but converted from set operations to bit operations. (iii) Then, we used the python gmpy library to perform the bit operations. (iv) Finally, we moved the implementation from Python to C, represented the bit vectors as multiprecision integers, used the GMP library, which is faster on Scalable Bayesian Rule Lists py_origin py_bit Op py_gmpy C_bit Op_reuse 1.5 2.0 2.5 3.0 Log10 of runtime in second Figure 1. Boxplots of log10 runtime among different implementations. From the original python code, the final code is over two orders of magnitude faster. large data sets, and implemented the algebra for computational reuse outlined above. To evaluate how each of these steps improved the computation time of the algorithm, we conducted a controlled experiment where each version of the algorithm (corresponding to the four steps above) was given the same data (the UCI adult dataset, divided into three folds), same set of rules, and same number of MCMC iterations (20,000) to run. We created boxplots for the log10 of the run time over the different folds, which are shown in Figure 1. The final code is over two orders of magnitude faster than the original optimized python code (that of Letham et al., 2015). We prove two bounds. First we provide an upper bound on the number of rules in a maximum a posteriori rule list. This allows us to narrow our search space to rule lists below a certain size. Second we provide a constraint that eliminates certain prefixes of rule lists. This prevents our algorithm from searching in regions of the space that provably do not contain the maximum a posteriori rule list. 4.1. Upper bound on the number of rules in the list Given the number of features, the parameter λ for the size of the list, and parameters α0 and α1, we can derive an upper bound for the size of a maximum a posteriori rule list. This formalizes how the prior on the number of rules is strong enough to overwhelm the likelihood. We are considering binary antecedents and binary features (e.g., aj is true if female), so the total number of possible Algorithm 1 Calculating bj s Initialization: index=0, b0 = 1 for c = 0 to P 2 do for j = P c downto 1 do index = index + 1 bindex = j end for if c + c = p then for j = P P c downto 1 do index = index + 1 bindex = j end for end if end for antecedents of each size can be calculated directly. When creating the upper bound, within the proof, we hypothetically exhaust antecedents from each size category in turn, starting with the smallest sizes. We discuss this further below. Let |Qc| be the number of antecedents that remain in the pile that have c logical conditions. The sequence of b s that we define next is a lower bound for the possible sequence of |Qc| s. In particular, b0, b1, b2, etc. represents the sequence of sizes of rules that would provide the smallest possible |Qc| s. Intuitively, the sequence of b s arises when we deplete the antecedents of size 1, then deplete all of size 2, etc. The number of ways to do this is given by the bindex values, computed as Algorithm 1, where P is the number of features, and b = {b0, b1, ...b2P 1} is a vector of length 2P . We will use b within the theorem below. In our notation, rule list d is defined by the antecedents and the probabilities on the right side of the rules, d = (m, {al, θl}m l=1). Theorem 1 The size m of any MAP rule list d (with parameters λ, η, and α = (α0, α1)) obeys m mmax, where m ! Γ(N +α0)Γ(N++α1) With parameters α0 = 1 and α1 = 1, this reduces to: m ! Γ(N +1)Γ(N++1) The proof is in the longer version of this paper (Yang et al., 2017). Theorem 1 tends to significantly reduce the size of the space. Without this bound, it is possible that the search would consider extremely long lists, without knowing that they are provably non-optimal. Scalable Bayesian Rule Lists 4.2. Prefix Bound We next provide a bound that eliminates certain regions of the rule space from consideration. Consider a rule list beginning with antecedents a1, .., ap. If the best possible rule list starting with a1, .., ap cannot beat the posterior of the best rule list we have found so far, then we know any rule list starting with a1, .., ap is suboptimal. In that case, we should stop exploring rule lists that start with a1, .., ap. This is a type of branch and bound strategy, in that we have now eliminated (bounded) the entire set of lists starting with a1, .., ap. We formalize this intuition below. Denote the rule list antecedents at iteration t by dt = (at 1, at 2, ..., at mt, a0). The current best posterior probability has value v t , that is v t = max t t Posterior(dt , {(xi, yi)}n i=1). Let us consider a rule list with antecedents d = (a1, a2, ...am, a0). Let dp denote a prefix of length p of the rule list d, i.e., dp = (a1, a2, ...ap), where a1, a2, ..., ap are the same as the first p antecedents in d. We want to determine whether a rule list starting with dp could be better than the best we have seen so far. Define Υ(dp, {(xi, yi)}n i=1) as follows: Υ(dp, {(xi, yi)}n i=1) := λmax (p,λ)/(max (p,λ))! P|A| j=0(λj/j!) Qp j=1 p(cj|c