# efficient_learning_of_interpretable_classification_rules__c1499d5e.pdf

Journal of Artiﬁcial Intelligence Research 74 (2022) 1823-1863 Submitted 11/2021; published 08/2022

Eﬃcient Learning of Interpretable Classiﬁcation Rules

Bishwamittra Ghosh bghosh@u.nus.edu National University of Singapore

Dmitry Malioutov dmal@alum.mit.edu

Kuldeep S. Meel meel@comp.nus.edu.sg National University of Singapore

Machine learning has become omnipresent with applications in various safety-critical domains such as medical, law, and transportation. In these domains, high-stake decisions provided by machine learning necessitate researchers to design interpretable models, where the prediction is understandable to a human. In interpretable machine learning, rule-based classiﬁers are particularly eﬀective in representing the decision boundary through a set of rules comprising input features. Examples of such classiﬁers include decision trees, decision lists, and decision sets. The interpretability of rule-based classiﬁers is in general related to the size of the rules, where smaller rules are considered more interpretable. To learn such a classiﬁer, the brute-force direct approach is to consider an optimization problem that tries to learn the smallest classiﬁcation rule that has close to maximum accuracy. This optimization problem is computationally intractable due to its combinatorial nature and thus, the problem is not scalable in large datasets. To this end, in this paper we study the triangular relationship among the accuracy, interpretability, and scalability of learning rule-based classiﬁers. The contribution of this paper is an interpretable learning framework IMLI, that is based on maximum satisﬁability (Max SAT) for synthesizing classiﬁcation rules expressible in proposition logic. IMLI considers a joint objective function to optimize the accuracy and the interpretability of classiﬁcation rules and learns an optimal rule by solving an appropriately designed Max SAT query. Despite the progress of Max SAT solving in the last decade, the straightforward Max SAT-based solution cannot scale to practical classiﬁcation datasets containing thousands to millions of samples. Therefore, we incorporate an eﬃcient incremental learning technique inside the Max SAT formulation by integrating mini-batch learning and iterative rule-learning. The resulting framework learns a classiﬁer by iteratively covering the training data, wherein in each iteration, it solves a sequence of smaller Max SAT queries corresponding to each mini-batch. In our experiments, IMLI achieves the best balance among prediction accuracy, interpretability, and scalability. For instance, IMLI attains a competitive prediction accuracy and interpretability w.r.t. existing interpretable classiﬁers and demonstrates impressive scalability on large datasets where both interpretable and non-interpretable classiﬁers fail. As an application, we deploy IMLI in learning popular interpretable classiﬁers such as decision lists and decision sets. The source code is available at https://github.com/meelgroup/mlic.

1. Introduction

Machine learning models are presently deployed in safety-critical and high-stake decisionmaking arising in medical (Erickson et al., 2017; Kaissis et al., 2020; Kononenko, 2001), law (Kumar et al., 2018; Surden, 2014), and transportation (Peled et al., 2019; Zantalis

2022 AI Access Foundation. All rights reserved.

Ghosh, Malioutov, & Meel

et al., 2019). In safety-critical domains, a surge of interest has been observed in designing interpretable, robust, and fair models instead of merely focusing on an accuracy-centric learning objective (Bhagoji et al., 2018; Doshi-Velez & Kim, 2017; Du et al., 2019; Hancox Li, 2020; Holstein et al., 2019; Mehrabi et al., 2021; Molnar, 2020; Murdoch et al., 2019). In this study, our focus is on learning an interpretable machine learning classiﬁer (Rudin, 2019), where the rules governing the prediction are explicit (by design) in contrast to blackbox models. In order to achieve the interpretability of predictions, decision functions in the form of classiﬁcation rules such as decision trees, decision lists, decision sets, etc. are particularly eﬀective (Bessiere et al., 2009; Dash & Goncalves, 2021; Ignatiev et al., 2021b; Izza et al., 2020; Lakkaraju et al., 2017, 2016; Letham et al., 2015; Narodytska et al., 2018; Rivest, 1987; Wang & Rudin, 2015; Yu et al., 2020a). Such models are used to either learn an interpretable model from the start or as proxies to provide post-hoc explanations of pretrained black-box models (Gill et al., 2020; Lundberg & Lee, 2017; Moradi & Samwald, 2021; Ribeiro et al., 2016; Slack et al., 2020). At this point, it is important to acknowledge that interpretability is an informal notion, and formalizing it in full generality is challenging. In our context of rule-based classiﬁers, we will use the sparsity of rules (that is, fewer rules each having fewer Boolean literals), which has been considered a proxy of interpretability in various domains, speciﬁcally in the medical domain (Gage et al., 2001; Lakkaraju et al., 2019; Letham et al., 2015; Malioutov & Varshney, 2013; Myers, 1962). In this paper, we study decision functions expressible in propositional logic. In particular, we focus on learning logical formulas from data as classiﬁers and deﬁne interpretability in terms of the number of Boolean literals that the formula contains. In the following, we illustrate an interpretable classiﬁer that decides if a tumor cell is malignant or benign based on diﬀerent features of tumors.

A tumor is malignant if [(compactness SE < 0.1) OR (0.1 concave points < 0.2)] AND [ (0.2 area < 0.3) OR (0.1 largest symmetry < 0.2)]

Example 1.1. We learn an interpretable classiﬁcation rule in Conjunctive Normal Form for predicting the classiﬁcation of a tumor cell. We consider WDBC (Wisconsin Diagnostic Breast Cancer) dataset (Agarap, 2018) to learn a CNF formula with two clauses and four Boolean literals. In the formula, clauses are connected by Boolean AND , where each clause contains literals connected by Boolean OR . Informally, a tumor cell is malignant if at least one literal in each clause becomes true.

The problem of learning rule-based classiﬁers is known to be computationally intractable. The earliest tractable approaches for classiﬁers such as decision trees and decision lists relied on heuristically chosen objective functions and greedy algorithmic techniques (Clark & Niblett, 1989; Cohen & Singer, 1999; Quinlan, 1993). In these approaches, the size of rules is controlled by early stopping, ad-hoc rule pruning, etc. In recent approaches, the interpretable classiﬁcation problem is reduced to an optimization problem, where the accuracy and the sparsity of rules are optimized jointly (Lakkaraju et al., 2016; Narodytska et al., 2018). Diﬀerent optimization solvers such as linear programming (Malioutov &

Efficient Learning of Interpretable Classification Rules

Varshney, 2013), sub-modular optimizations (Lakkaraju et al., 2016), and Bayesian optimizations (Letham et al., 2015) are then deployed to ﬁnd the best classiﬁer with maximum accuracy and minimum rule-size. In our study, we propose an alternate optimization approach that ﬁts particularly well to rule-learning problems. Particularly, we propose a maximum satisﬁability (Max SAT) solution for learning interpretable rule-based classiﬁers.

The Max SAT problem is an optimization analog to the satisﬁability (SAT) problem, which is complete for the class FPNP. Although the Max SAT problem is NP-hard, dramatic progress has been made in designing solvers that can handle large-scale problems arising in practice. This has encouraged researchers to reduce several optimization problems into Max SAT such as optimal planning (Robinson et al., 2010), automotive conﬁguration (Walter et al., 2013), data analysis in machine learning (Berg et al., 2019), and automatic test pattern generation for cancer therapy (Lin & Khatri, 2012).

Contributions. The primary contribution of this paper is a Max SAT-based formulation, called IMLI, for learning interpretable classiﬁcation rules expressible in propositional logic. For the simplicity of exposition, our initial focus is on learning classiﬁers expressible in CNF (Conjunctive Normal Form) formulas; we later discuss how the CNF learning formulation can be extended to popular interpretable classiﬁers such as decision lists and decision sets. In this paper, IMLI provides precise control to jointly optimize the accuracy and the size of classiﬁcation rules. IMLI constructs and solves a Max SAT query of O(kn) size (number of clauses) to learn a k-clause CNF formula on a dataset containing n samples. Naturally, the na ıve formulation cannot scale to large values of n and k. To scale IMLI to large datasets, we propose an incremental learning technique along with the Max SAT formulation. Our incremental learning is an integration of mini-batch learning and iterative rule-learning, which are studied separately in classical learning problems. In the presented incremental approach, IMLI learns a k-clause CNF formula using an iterative separate-and-conquer algorithm, where in each iteration, a single clause is learned by covering a part of the training data. Furthermore, to eﬃciently learn a single clause in a CNF, IMLI relies on mini-batch learning, where it solves a sequence of smaller Max SAT queries corresponding to each mini-batch.

In our experimental evaluations, IMLI demonstrates the best balance among prediction accuracy, interpretability, and scalability in learning classiﬁcation rules. In particular, IMLI achieves competitive prediction accuracy and interpretability w.r.t. state-of-the-art interpretable classiﬁers. Besides, IMLI achieves impressive scalability by classifying datasets even with 1 million samples, wherein existing classiﬁers, including those of the non-interpretable ones, either fail to scale or achieve poor prediction accuracy. Finally, as an application, we deploy IMLI in learning interpretable classiﬁers such as decision lists and decision sets.

The presented framework IMLI extends our prior papers (Ghosh & Meel, 2019; Malioutov & Meel, 2018) in the following aspects. Malioutov and Meel (2018) presented a Max SATbased formulation for interpretable classiﬁcation rule-learning, which jointly optimizes between the accuracy and size of rules. Ghosh and Meel (2019) improved the Max SAT formulation by unifying mini-batch learning. This uniﬁcation allows learning on moderate-size datasets (about 50 thousand samples). This paper takes a signiﬁcant step in terms of scalability by integrating both iterative learning and mini-batch learning inside the incremental

Ghosh, Malioutov, & Meel

technique. The result is a practical scalable learning framework that can classify datasets with 1 million samples and can potentially scale beyond that. We organize the rest of the paper as follows. We discuss related literature in Section 2 and preliminaries in Section 3. In Section 4, we formally deﬁne our learning problem and provide a Max SAT-based solution in Section 5. We improve the Max SAT-based formulation through an incremental learning technique in Section 6. In Section 7, we apply IMLI in learning diﬀerent interpretable classiﬁers. We then discuss our experimental results in Section 8 and conclude in Section 9.

2. Related Work

The progress in designing interpretable rule-based classiﬁers ﬁnds its root in the development of decision trees (Bessiere et al., 2009; Quinlan, 1986, 1987), decision lists (Rivest, 1987), classiﬁcation rules (Cohen, 1995) etc. In early works, the focus was to improve the eﬃciency and scalability of the model rather than designing models that are interpretable. For example, decision rule approaches such as C4.5 rules (Quinlan, 1993), CN2 (Clark & Niblett, 1989), RIPPER (Cohen, 1995), and SLIPPER (Cohen & Singer, 1999) rely on heuristic-based branch pruning and ad-hoc local criteria e.g., maximizing information gain, coverage, etc. Recently, several optimization frameworks have been proposed for interpretable classiﬁcation, where both accuracy and rule-size are optimized during training. For example, Malioutov and Varshney (2013) proposed exact learning of rule-based classiﬁers based on Boolean compressed sensing using a linear programming formulation. Su, Wei, Varshney, and Malioutov (2016) presented two-level Boolean rules, where the trade-oﬀbetween classiﬁcation accuracy and interpretability is studied. In their work, the Hamming loss is used to characterize prediction accuracy, and sparsity is used to characterize the interpretability of rules. Wang and Rudin (2015) proposed a Bayesian optimization framework for learning falling rule lists, which is an ordered list of if-then rules. Other similar approaches based on Bayesian analysis for learning classiﬁcation rules are (Letham et al., 2015; Wang et al., 2017). Building on custom discrete optimization techniques, Angelino, Larus-Stone, Alabi, Seltzer, and Rudin (2017) proposed an optimal learning technique for decision lists using a branch-and-bound algorithm. In a separate study, Lakkaraju et al. (2016) highlighted the importance of decision sets over decision lists in terms of interpretability and considered a sub-modular optimization problem for learning a near-optimal solution for decision sets. Our proposed method for interpretable classiﬁcation, however, relies on the improvement in formal methods over the decades, particularly the eﬃcient CDCL-based solution for satisﬁability (SAT) problems (Silva & Sakallah, 2003). Formal methods, particularly SAT and its variants, have been deployed in interpretable classiﬁcation recently. In the context of learning decision trees, SAT and Max SAT-based solutions are proposed by (Alos et al., 2021; Janota & Morgado, 2020; Narodytska et al., 2018; Shati et al., 2021). In addition, researchers have applied SAT for learning explainable decision sets (Ignatiev et al., 2018, 2021a; Schidler & Szeider, 2021; Yu et al., 2020b). In most cases, SAT/Max SAT solutions are not suﬃcient in solving large-scale classiﬁcation tasks because of the NP-hardness of the underlying problem. This observation motivates us in combining Max SAT with more practical algorithms such as incremental learning.

Efficient Learning of Interpretable Classification Rules

Incremental learning has been studied in improving the scalability of learning problems, where data is processed in parts and results are combined to use lower computation overhead. In case of non-interpretable classiﬁers such as SVM, several solutions adopting incremental learning are available (Syed et al., 1999; Ruping, 2001). For example, Cauwenberghs and Poggio (2001) proposed an online recursive algorithm for SVM that learns one supportvector at a time. Based on radial basis kernel function, Ralaivola and d Alch e Buc (2001) proposed a local incremental learning algorithm for SVM. In the context of deep neural networks, stochastic gradient descent is a well-known convex optimization technique a variant of which includes computing the gradient on mini-batches (Hinton et al., 2012; Li et al., 2014; Masters & Luschi, 2018). Federated learning, on the other hand, decentralizes training on multiple local nodes based on local data samples with only exchanging learned parameters to construct a global model in a central node (Koneˇcn y et al., 2015, 2016). Another notable technique is Lagrangian relaxation that decomposes the original problem into several sub-problems, assigns Lagrangian multipliers to make sure that sub-problems agree, and iterates by solving sub-problems and adjusting weights based on disagreements (Fisher, 1981; Johnson et al., 2007; Lemar echal, 2001). To the best of our knowledge, our method is the ﬁrst method that uniﬁes incremental learning with Max SAT based formulation to improve the scalability of learning rule-based classiﬁers. Classiﬁers that are interpretable by design can be applied to improve the explainability of complex black-box machine learning classiﬁers. There is rich literature on extracting decompositional and pedagogical rules from non-linear classiﬁers such as support vector machines (Barakat & Diederich, 2004, 2005; Diederich, 2008; Martens et al., 2008; N u nez et al., 2002) and neural networks (Augasta & Kathirvalavakumar, 2012; Hailesilassie, 2016; Setiono & Liu, 1995; Sato & Tsukimoto, 2001; Zilke et al., 2016; Zhou, 2004). In recent years, local model-agnostic approaches for explaining black-box classiﬁers are proposed by learning surrogate simpler classiﬁers such as rule-based classiﬁers (Guidotti et al., 2018; Pastor & Baralis, 2019; Rajapaksha et al., 2020; Ribeiro et al., 2018). The core idea in local approaches is to use a rule-learner that can classify synthetically generated neighboring samples with class labels provided by the black-box classiﬁer. To this end, our framework IMLI can be directly deployed as an eﬃcient rule-learner and can explain the inner-working of black-box classiﬁers by generating interpretable rules.

3. Preliminaries

In this section, we ﬁrst discuss propositional satisﬁability and Max SAT, followed by rulebased classiﬁcation in machine learning. In the paper, we use capital boldface letters such as X to denote matrices while lower boldface letters x are reserved for vectors/sets.

3.1 Propositional Satisﬁability

Let F be a Boolean formula deﬁned over Boolean variables b = {b1, b2, . . . , bm}. A literal v is a variable b or its complement b, and a clause C is a disjunction ( ) or a conjunction ( ) of literals.1 F is in Conjunctive Normal Form (CNF) if F V i Ci is a conjunction

1. While term is reserved to denote disjunction of literals in the literature, we use clause for both disjunction and conjunction.

Ghosh, Malioutov, & Meel

of clauses where each clause Ci W j vj is a disjunction of literals. In contrast, F is in Disjunctive Normal Form (DNF) if F W i Ci is a disjunction of clauses where each clause Ci V j vj is a conjunction of literals. We use σ to denote an assignment of variables in b where σ(bi) {true, false}. A satisfying assignment σ of F is an assignment that evaluates F to true and is denoted by σ |= F. The propositional satisﬁability (SAT) problem ﬁnds a satisfying assignment σ to a CNF formula F such that i, σ |= Ci, wherein σ |= Ci if and only if vj Ci, σ (vj) = true. Informally, σ satisﬁes at least one literal in each clause of a CNF formula. In this paper, we use true as 1 and false as 0 interchangeably. Between two vectors u and v of the same length deﬁned over Boolean variables or constants (such as 0 and 1), we deﬁne u v to refer to their inner product. Formally, u v W i(ui vi) is a disjunction of element-wise conjunction where ui and vi denote the ith variable/constant of u and v, respectively. In this context, the conjunction between a variable and a constant follows the standard interpretation: b 0 = 0 and b 1 = b. For example, if u = [u1, u2, u3] and v = [0, 1, 1], then u v = u2 u3.

3.2 Max SAT

The Max SAT problem is an optimization analog to the SAT problem that ﬁnds an optimal assignment satisfying the maximum number of clauses in a CNF. In this work, we consider a weighted variant of the Max SAT problem more speciﬁcally, a weighted-partial Max SAT problem that optimizes over a set of hard and soft constraints in the form of a weighted CNF formula. In a weighted-CNF formula, a weight wt(Ci) R 0 { } is deﬁned over each clause Ci, wherein Ci is called a hard clause if wt(Ci) = , and Ci is a soft clause, otherwise. To avoid notational clutter, we overload wt( ) to denote the weight of an assignment σ. In particular, we deﬁne wt(σ) as the sum of the weights of unsatisﬁed clauses in a CNF, formally, wt(σ) = Σi|σ |=Ciwt(Ci). Given a weighted-CNF formula F W i Ci with weight wt(Ci), the weighted-partial Max SAT problems ﬁnds an optimal assignment σ that achieves the minimum weight. Formally, σ = Max SAT(F, wt( )) if σ = σ , wt(σ ) wt(σ). The optimal weight of the Max SAT problem wt(σ ) is inﬁnity ( ) when at least one hard clause becomes unsatisﬁed. Therefore, the weighted-partial Max SAT problem ﬁnds σ that satisﬁes all hard clauses and as many soft clauses as possible such that the total weight of unsatisﬁed soft clauses is minimum. In this paper, we reduce the classiﬁcation problem to the solution of a weightedpartial Max SAT problem. The knowledge of the inner workings of Max SAT solvers is not required for this paper.

3.3 Rule-based Classiﬁcation

We consider a binary classiﬁcation problem where X {0, 1}n m is a binary feature matrix2

deﬁned over Boolean features x = {x1, x2, . . . , xm} and y {0, 1}n is a vector of binary class labels. The ith row of X, denoted by Xi X, is called a sample, which is generated from a product distribution X. Thus, Xi is the valuation of features in x and yi is its class label. Xi is called a positive sample if yi = 1, otherwise a negative sample.

2. Our framework allows learning on real-valued and categorical features, as discussed in Section 5.3.

Efficient Learning of Interpretable Classification Rules

A classiﬁer R : x ˆy {0, 1} is a function that takes a Boolean vector as input and outputs the predicted class label ˆy. In a rule-based classiﬁer, we view R as a propositional formula deﬁned over Boolean features x such that, if R becomes true for an input (or an assignment), the prediction ˆy = 1 and ˆy = 0, otherwise. The goal of designing R is to not only approximate the training data, but also generalize to unseen samples arising from the same distribution. Additionally, we prefer to learn a sparse R in order to favor interpretability. We deﬁne the structural complexity of R, referred to as rule-size, in terms of the number of literals it contains. Let clause(R, i) denote the ith clause of R and |clause(R, i)| be the number of literals in clause(R, i). Thus, the size of the classiﬁer is |R| = Σi|clause(R, i)|, which is the sum of literals in all clauses in R. For example, let R (x1 x3) ( x2 x3) be a CNF classiﬁer deﬁned over three Boolean features {x1, x2, x3}. For an input [0, 1, 1], the prediction of R is 1 whereas for an input [1, 1, 0], the prediction is 0 because R is true and false in these two cases, respectively. Moreover, |R| = 2 + 2 = 4 is the rule-size of R.

Decision Lists. A decision list is a rule-based classiﬁer consisting of if-then-else rules. Formally, a decision list (Rivest, 1987) is an ordered list RL of pairs (C1, v1), . . . , (Ck, vk) where Ci is a conjunction of literals (alternately, a single clause in a DNF formula) and vi {0, 1} is a Boolean class label3. Additionally, the last clause Ck true, thereby vk is the default class. A decision list is deﬁned as a classiﬁer with the following interpretation: for a feature vector x, RL(x) is equal to vi where i is the least index such that x |= Ci. Since the last clause is true, RL(x) always exists. Intuitively, whichever clause (starting from the ﬁrst) in RL is satisﬁed for an input, the associated class label is its prediction.

Decision Sets. A decision set is a set of independent if-then rules. Formally, a decision set RS is a set of pairs {(C1, v1), . . . , (Ck 1, vk 1)} and a default pair (Ck, vk), where Ci is similar to a decision list a conjunction of literals and vi {0, 1} is a Boolean class label. In addition, the last clause Ck true and vk is the default class. For a decision set, if an input x satisﬁes one clause, say Ci, then the prediction is vi. If x satisﬁes no clause, then the prediction is the default class vk. Finally, if x satisﬁes 2 clauses, x is assigned a class using a tie-breaking (Lakkaraju et al., 2016).

4. Problem Formulation

Given (i) a binary feature matrix X {0, 1}n m with n samples and m features, (ii) a class-label vector y {0, 1}n, (iii) a positive integer k 1 denoting the number of clauses, and (iv) a regularization parameter λ R+, we learn a classiﬁer R represented as a k-clause CNF formula to separate samples of class 1 from class 0. Our goal is to learn classiﬁers that balance two goals: of being accurate but also interpretable. Various notions of interpretability have been proposed in the context of classiﬁcation problems. A common proxy for interpretability in the context of decision rules is the sparsity of rule. For instance, a rule involving fewer literals is highly interpretable. In this work, we minimize the total number of literals in all clauses, as discussed in Section 3, which motivates us to ﬁnd R with minimum |R|. Let R classify all samples correctly during

3. In a more practical setting, vi {0, . . . , N} can be multi-class for N 1.

Ghosh, Malioutov, & Meel

training. Among all the classiﬁcation rules that classify all samples correctly, we choose the sparsest (most interpretable) such R.

min R |R| such that i, yi = R(Xi)

In practical classiﬁcation tasks, perfect classiﬁcation is unlikely. Hence, we need to balance interpretability with classiﬁcation error. Let EX = {(Xi, yi)|yi = R(Xi)} be the set of samples in X that are misclassiﬁed by R. Therefore, we balance between classiﬁcationaccuracy and rule-sparsity and optimize the following function.4

min R |EX| + λ|R| (1)

Higher values of λ generate a rule with a smaller rule-size but of more training errors, and vice-versa. Thus, λ can be tuned to trade-oﬀbetween accuracy and interpretability for a rule-based classiﬁer, which we experiment extensively in Section 8.

5. Max SAT-based Interpretable Classiﬁcation Rule Learning

In this section, we discuss a Max SAT-based framework for optimal learning of an interpretable rule-based classiﬁer, particularly a CNF classiﬁer R. We ﬁrst describe the decision variables in Section 5.1 and present the Max SAT encoding in Section 5.2. Our Max SAT formulation assumes binary features as input. We conclude this section by learning R with non-binary features in Section 5.3 and discussing more ﬂexible interpretability constraints of R in Section 5.4.

5.1 Description of Variables

We initially preprocess X to account for the negation of Boolean features while learning a classiﬁer.5 In the preprocessing step, we negate each column in X to a new column and append it to X. For example, if age 25 is a Boolean feature, we add another feature age < 25 in X by negating the column age 25 . Hence, in the rest of the paper, we refer m as the modiﬁed number of columns in X. We next discuss the variables in the Max SAT problem. We consider two types of Boolean variables: (i) feature variables b corresponding to input features and (ii) error variables η corresponding to the classiﬁcation error of samples. We deﬁne a Boolean variable bi j that becomes true if feature xj appears in the ith clause of R, thereby contributing to an increase in the rule-size of R, and bi j is assigned false, otherwise. Moreover, we deﬁne an error variable ηl to attribute to whether the sample Xl is classiﬁed correctly or not. Speciﬁcally, ηl becomes true if Xl is misclassiﬁed, and becomes false otherwise. We next discuss the Max SAT encoding to solve the classiﬁcation problem.

4. In our formulation, it is straightforward to add class-conditional weights (e.g., to penalize false-alarms more than mis-detects), and to allow instance weights (per sample). 5. This preprocessing is similarly applied in (Malioutov & Varshney, 2013).

Efficient Learning of Interpretable Classification Rules

5.2 Max SAT Encoding

We consider a weighted-partial Max SAT formula, where we encode the objective function in Eq. (1) as soft clauses. For each sample in X, we construct hard clauses specifying that the sample is either classiﬁed correctly and ηl is false or it is misclassiﬁed and ηl is true. We next discuss the Max SAT encoding in detail.

Soft clauses for maximizing training accuracy: For each training sample, IMLI constructs a soft unit6 clause ηl to account for a penalty for misclassiﬁcation. Since the penalty for misclassiﬁcation of a sample is 1 in Eq. (1), the weight of this soft clause is also 1.

El := ηl; wt(El) = 1 (2)

Intuitively, if a sample is misclassiﬁed, the associated error variable becomes true, thereby dissatisfying the soft clause El.

Soft clauses for minimizing rule-sparsity: To favor rule-sparsity, IMLI tries to learn a classiﬁer with as few literals as possible. Hence, for each feature variable bi j, IMLI constructs a unit clause as bi j. Similar to training accuracy, the weight for this clause is derived as λ from Eq. (1).

Si j := bi j; wt(Si j) = λ (3)

Hard clauses for encoding constraints: In a Max SAT problem, constraints that must be satisﬁed are encoded as hard clauses. In IMLI, we have a learning constraint that, if the error variable is false, the associated sample must be correctly classiﬁed, and vice-versa. Let bi = {bi j | j {1, . . . , m}} be a vector of feature variables corresponding to the ith clause in R. Then, we deﬁne the following hard clause.

Hl := ηl yl

i=1 Xl bi ; wt(Hl) = (4)

In the hard clause, Vk i=1 Xl bi is a CNF formula including variables bi j for which the associated feature-value is 1 in Xl. Since yl {0, 1} is a constant, the constraint to the right of the implication is either Vk i=1 Xl bi or its complement. Therefore, the hard clause enforces that if the sample is correctly classiﬁed (using ηl), either Vk i=1 Xl bi or its complement is true depending on the class-label of the sample. We highlight that the single-implication in the hard clause Hl acts as a doubleimplication due to the soft clause El. Because, according to the deﬁnition of , the left constraint ηl can be false while the right constraint of is true. This, however, incurs unnecessarily dissatisfying the soft clause El, which is a sub-optimal solution and hence this solution is not returned by the Max SAT solver.

We next discuss the translation of soft and hard clauses into a CNF formula, which can be invoked by any Max SAT solver.

6. A unit clause has a single literal.

Ghosh, Malioutov, & Meel

Translating El, Si j, Hl to a CNF Formula. The soft clauses El and Si j are unit clauses and hence, no translation is required for them. In the hard clause, when yl = 1, the simpliﬁcation is Hl := ηl Vk i=1 Xl bi. In this case, we apply the equivalence rule in propositional logic (a b) ( a b) to encode Hl into CNF. In contrast, when yi = 0, we simplify the hard clause as Hl := ηl (Vk i=1 Xl bi) ηl Wk i=1 (Xl bi). Since Xl bi constitutes a disjunction of literals (deﬁned in Section 3), we apply Tseytin transformation to encode (Xl bi) into CNF. More speciﬁcally, we introduce an auxiliary variable zl,i corresponding to the clause (Xl bi). Formally, we replace Hl := ηl Wk i=1 (Xl bi) with Vk i=0 Hl,i, where Hl,0 := ( ηl Wk i=1 zl,i) and Hl,i := zl,i (Xl bi) for i = {1, . . . , k}. Finally, we apply the equivalence of (a b) ( a b) on Hl,i to translate them into CNF. For either value of yl, the weight of each translated hard clause in the CNF formula is . Once we translate all soft and hard clauses into CNF, the Max SAT query Q is the conjunction of all clauses.

i=1,j=1 Si j

Any oﬀ-the-shelf Max SAT solver can output an optimal assignment σ of the Max SAT query (Q, wt( )). We extract σ to construct the classiﬁer R and compute training errors as follows.

Construction 1. Let σ = Max SAT(Q, wt( )). Then xj clause(R, i) if and only if σ (bi j) = 1. Additionally, Xl is misclassiﬁed if and only if σ (ηl) = 1.

Complexity of the Max SAT Query. We analyze the complexity of the Max SAT query in terms of the number of Boolean variables and clauses in the CNF formula Q.

Proposition 1. To learn a k-clause CNF classiﬁer for a dataset of n samples over m boolean features, the Max SAT query Q deﬁnes km + n Boolean variables. Let nneg be the number of negative samples in the training dataset. Then the number of auxiliary variables in Q is knneg.

Proposition 2. The Max SAT query Q has km + n unit clauses corresponding to the constraints El and Si j. For each positive sample, the hard clause Hl is translated into k clauses. For each negative sample, the CNF translation requires at most km + 1 clauses. Let npos and nneg be the number of positive and negative samples in the training dataset. Therefore, the number of clauses in the Max SAT Query Q is km+n+knpos +(km+1)nneg O(kmn) when npos = nneg = n

In the following, we illustrate the encoding for a toy example.

Example of encoding

We illustrate the Max SAT encoding for a toy example consisting of four samples and two binary features. Our goal is to learn a two clause CNF classiﬁer that can approximate the

Efficient Learning of Interpretable Classification Rules

training data.

0 0 0 1 1 0 1 1

x1 x1 x2 x2

0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0

In the preprocessing step, we negate the columns in Xorig and add complemented features { x1, x2} in X. In the Max SAT encoding, we deﬁne 8 feature variables (4 features 2 clauses in the classiﬁer) and 4 error variables. For example, for feature x2, introduced feature variables are {b1 3, b2 3} and for feature x2, introduced variables are {b1 4, b2 4}. For four samples, error variables are {η1, η2, η3, η4}. We next show the soft and hard clauses in the Max SAT encoding

E1 := ( η1); E2 := ( η2); E3 := ( η3); E4 := ( η4)

S1 1 := ( b1 1); S1 2 := ( b1 2); S1 3 := ( b1 3); S1 4 := ( b1 4);

S2 1 := ( b2 1); S2 2 := ( b2 2); S2 3 := ( b2 3); S2 4 := ( b2 4);

H1 := ( η1 ((b1 2 b1 4) (b2 2 b2 4)));

H2 := ( η2 ( (b1 2 b1 3) (b2 2 b2 3)));

H3 := ( η3 ( (b1 1 b1 4) (b2 1 b2 4)));

H4 := ( η4 ((b1 1 b1 3) (b2 1 b2 3)));

In this example, we consider regularizer λ = 0.1, thereby setting the weight on accuracy as wt(Ei) = 1 and rule-sparsity weight as wt Si j = 0.1. Intuitively, the penalty for misclassifying a sample is 10 times than the penalty for adding a feature in the classiﬁer. For this Max SAT query, the optimal solution classiﬁes all samples correctly by assigning four feature variables {b1 1, b1 4, b2 2, b2 3} to true. Therefore, by applying Construction 1, the classiﬁer is (x1 x2) ( x1 x2).7

5.3 Learning with Non-binary Features

The presented Max SAT encoding requires input samples to have binary features. Therefore, we discretize datasets with categorical and continuous features into binary features. For each continuous feature, we apply equal-width discretization that splits the feature into a ﬁxed number of bins. For example, consider a continuous feature xc [a, b]. In discretization, we split the domain [a, b] into three bins with two split points {a , b } such that a < a < b < b. Therefore, the resulting three discretized features are a xc < a , a xc < b , and b xc b. An alternate to this close-interval discretization is openinterval discretization, where we consider six discretized features with each feature being

7. The presented Max SAT-based formulation does not learn a CNF classiﬁer with both xi and xi in the same clause. The reason is that a clause with both xi and xi connected by OR ( ) does not increase accuracy but increases rule-size and hence, this is not an optimal classiﬁer.

Ghosh, Malioutov, & Meel

compared with one threshold. In that case, the discretized features are: xc a, xc a , xc b , xc < a , xc < b , xc b. Both open-interval and close-interval discretization techniques have their use-cases where one or the other is appropriate. For simplicity, we experiment with close-interval discretization in this paper. After applying discretization on continuous features, the dataset contains categorical features only, which we convert to binary features using one-hot encoding (Lakkaraju et al., 2019; Ghosh & Meel, 2019). In one-hot encoding, a Boolean vector of features is introduced with cardinality equal to the number of distinct categories. For example, consider a categorical feature having three categories red , green , and yellow . In one hot encoding, samples with category-value red , green , and yellow would be converted into binary features by taking values 100, 010, and 001, respectively.

5.4 Flexible Interpretability Objectives

IMLI can be extended to consider more ﬂexible interpretability objectives than the simpliﬁed one in Eq. (1). In Eq. (1), we prioritize all features equally by providing the same weight to the clause Si j for all values of i and j. In practice, users may prefer rules containing certain features. In IMLI, such an extension can be achieved by modifying the weight function and/or the deﬁnition of Si j. For example, to constrain the classiﬁer to never include a feature, the weight of the clause Si j := bi j can be set to . In contrast, to always include a feature, we can deﬁne Si j := bi j with weight . In both cases, we treat Si j as a hard clause. Another use case may be to learn rules where clauses have disjoint set of features. To this end, we consider a pseudo-Boolean constraint Pk i=1 bi j 1, which speciﬁes that the feature xj appears in at-most one of the k clauses. This constraint may be soft or hard depending on the priority in the application domain. In either case, we convert this constraint to CNF using pseudo-Boolean to CNF translation (Philipp & Steinke, 2015). Thus, IMLI allows us to consider varied interpretability constraints by only modifying the Max SAT query without requiring changes in the Max SAT solving. Therefore, the separation between modeling and solving turns out to be the key strength of IMLI.

6. Incremental Learning with Max SAT-based Formulation

The complexity of the Max SAT query in Section 5.2 increases with the number of samples in the training dataset and the number of clauses to be learned in a CNF formula. In this section, we discuss an incremental learning technique that along with Max SAT formulation can generate classiﬁcation rules eﬃciently. Our incremental learning is built on two concepts: (i) mini-batch learning and (ii) iterative learning. In the following, we discuss both concepts in detail.

6.1 Mini-batch Learning

Our ﬁrst improvement is to implement a mini-batch learning technique tailored for rulebased classiﬁers. Mini-batch learning has two-fold advantages. Firstly, instead of solving a large Max SAT query for the whole training data, this approach solves a sequence of smaller Max SAT queries derived from mini-batches of smaller size. Secondly, this approach extends to life-long learning (Chen & Liu, 2018), where the classiﬁer can be updated incrementally

Efficient Learning of Interpretable Classification Rules

with new samples while also generalizing to previously observed samples. In our context of rule-based classiﬁers, we consider the following heuristic in mini-batch learning.

A Heuristic for Mini-Batch Learning. Let R be a classiﬁer learned on the previous batch. In mini-batch learning, we aim to learn a new classiﬁer R that can generalize to both the current batch and previously seen samples. For that, we consider a soft constraint such that R does not diﬀer much from R while training on the current batch. Thus, we hypothesize that by constraining R to be syntactically similar to R , it is possible to generalize well; because samples in all batches originate from the same distribution. Since our study focuses on rule-based classiﬁers, we consider the Hamming distance between two classiﬁers as a notion of their syntactic dissimilarity. In the following, we deﬁne the Hamming distance between two CNF classiﬁers R, R with the same number of clauses.

d H(R, R ) =

v Ci 1(v C i) + X

where Ci and C i are the ith clause in R and R , respectively and 1 is an indicator function that returns 1 if the input is true and returns 0 otherwise. Intuitively, d H(R, R ) calculates the total number of diﬀerent literals in each (ordered) pair of clauses in R and R . For example, consider R = (x1 x2) ( x1) and R = ( x1 x2) ( x1). Then d H(R, R ) = 2+0 = 2, because in the ﬁrst clause, the literals {x1, x1} are absent in either formulas, and the second clause is identical for both R and R . In the following, we discuss a modiﬁed objective function for mini-batch learning.

Objective Function. In mini-batch learning, we design an objective function to penalize both classiﬁcation errors on the current batch and the Hamming distance between new and previous classiﬁers. Let (Xb, yb) be the current mini-batch. The objective function in mini-batch learning is

min R |EXb| + λd H(R, R ). (5)

In the objective function, EXb is the misclassiﬁed subset of samples in the current batch (Xb, yb). Unlike controlling the rule-sparsity in the non-incremental approach in the earlier section, in Eq. (5), λ controls the trade-oﬀbetween classiﬁcation errors and the syntactic diﬀerences between consecutive classiﬁers. Next, we discuss how to encode the modiﬁed objective function as a Max SAT query.

Max SAT Encoding of Mini-batch Learning. In order to account for the modiﬁed objective function, we only modify the soft clause Si j in the Max SAT query Q. In particular, we deﬁne Si j to penalize for the complemented assignment of the feature variable bi j in R compared to R .

( bi j if xj clause(R , i) bi j otherwise ; wt(Si j) = λ

Si j is either a unit clause bi j or bi j depending on whether the associated feature xj appears in the previous classiﬁer R or not. In this encoding, the Hamming distance of

Ghosh, Malioutov, & Meel

R and R is minimized while attaining minimum classiﬁcation errors on the current batch (using soft clause El in Eq. (2)) To this end, mini-batch learning starts with an empty CNF formula as R with no feature, and thus Si j := bi j for the ﬁrst batch. Next, we analyze the complexity of the Max SAT encoding for mini-batch learning.

Proposition 3. Let n |Xb| |X| be the size of a mini-batch, n neg n be the number of negative samples in the batch, and m be the number of Boolean features. According to Proposition 1, to learn a k-clause CNF classiﬁer in mini-batch learning, the Max SAT encoding for each batch has km+n Boolean variables and kn neg auxiliary variables. Let n pos = n n neg be the number of positive samples in the batch. Then, according to Proposition 2, the number of clauses in the Max SAT query for each batch is km+n +kn pos+(km+1)n neg O(kmn ) when n pos = n neg = n

Assessing the Performance of R. Since we apply a heuristic objective in mini-batch learning, R may be optimized for the current batch while generalizing poorly on the full training data. To tackle this, after learning on each batch, we decide whether to keep R or not by assessing the performance of R on the training data and keep R if it achieves higher performance. We measure the performance of R on the full training data (X, y) using a weighted combination of classiﬁcation errors and rule-size. In particular, we compute a combined loss function loss(R) |EX| + λ|R| on the training data (X, y), which is indeed the value of the objective function that we minimize in the non-incremental learning in Section 4. Additionally, we discard the current classiﬁer R when the loss does not decrease (loss(R) > loss(R )). We present the algorithm for mini-batch learning in Algorithm 1.

Algorithm 1 Max SAT-based Mini-batch Learning

1: function Mini Batch Learning(X, y, λ, k)

2: R = Vk i=1 false Empty CNF formula

4: for i 1, . . . , N do N is the total batch-count

5: Xb, yb Get Batch(X, y)

6: Q, wt( ) Max SATEncoding(R, Xb, yb, λ, k) Returns a weighted-partial CNF

7: σ Max SAT(Q, wt( ))

8: Rnew Construct Classifier(σ ) Construction 1

9: loss |EX| + λ|Rnew| Compute loss

10: if loss < lossmax then

11: lossmax loss

13: return R

6.2 Iterative Learning

We discuss an iterative learning algorithm for rule-based classiﬁers. The major advantage of iterative learning is that we solve a smaller Max SAT query because of learning

Efficient Learning of Interpretable Classification Rules

a partial classiﬁer in each iteration. Our iterative approach is motivated by the setcovering algorithm also known as separate-and-conquer algorithm in symbolic rule learning (F urnkranz, 1999). In this approach, the core idea is to deﬁne the coverage of a partial classiﬁer (for example, a clause in a CNF classiﬁer). For a speciﬁc deﬁnition of coverage, this algorithm separates samples covered by the partial classiﬁer and recursively conquers remaining samples in the training data by learning another partial classiﬁer until no sample remains. The ﬁnal classiﬁer is an aggregation of all partial classiﬁers the conjunction of clauses in a CNF formula, for example. Iterative learning is diﬀerent from mini-batch learning in several aspects. In mini-batch learning, we learn all clauses in a CNF formula together, while in iterative learning, we learn a single clause of a CNF in each iteration. Additionally, in mini-batch learning, we improve scalability by reducing the number of samples in the training data using mini-batches, while in iterative learning, we improve scalability by reducing the number of clauses to learn at once. Therefore, an eﬃcient integration of iterative learning and mini-batch learning would beneﬁt scalability from both worlds. In the following, we discuss this integration by ﬁrst stating iterative learning for CNF classiﬁers. In iterative learning, we learn one clause of a CNF classiﬁer in each iteration, where the clause refers to a partial classiﬁer. The coverage of a clause in a CNF formula is the set of samples that do not satisfy the clause. The reason is that if a sample does not satisfy at least one clause in a CNF formula, the prediction of the sample by the full formula is class 0, because CNF is a conjunction of clauses. As a result, considering covered samples in the next iteration does not change their prediction regardless of whatever clause we learn in later iterations. To this end, a single clause learning can be performed eﬃciently by applying mini-batch learning discussed before. In Algorithm 2, we provide an algorithm for learning a CNF classiﬁer iteratively by leveraging mini-batch learning. This algorithm is a double-loop algorithm, where in the outer loop we apply iterative learning and in the inner loop, we apply mini-batch learning.

Algorithm 2 Iterative CNF Classiﬁer Learning

1: procedure Iterative CNFLearning(X, y, λ, k)

2: R true Initial formula

3: for i 1, . . . , k and (X, y) = do

4: Ci Mini Batch Learning(X, y, λ, 1) Single clause learning, k = 1

5: X , y Coverage(X, Ci)

6: if (X , y ) = then Terminating conditions

8: R R Ci 9: X, y (X, y) \ (X , y ) Removing covered samples

10: return R

Terminating Conditions. In Algorithm 2, we terminate iterative learning based on three conditions: (i) when R contains all k clauses , (ii) the training data (X, y) is empty (that is, no sample remains uncovered), and (iii) no new sample is covered by the current partial classiﬁer. Since the ﬁrst two conditions are trivial, we elaborate on the third condition.

Ghosh, Malioutov, & Meel

When clause Ci cannot cover any new sample from the training dataset (X, y), the next iteration will result in the same clause Ci because the training data remains the same. In this case, we do not include clause Ci to classiﬁer R because of zero coverage.

7. Applying IMLI on Learning Other Interpretable Classiﬁers

In earlier sections, we discuss the learning of CNF classiﬁers using IMLI. IMLI can also be applied to learn other interpretable rule-based representations. In this section, we discuss how IMLI can be applied in learning DNF classiﬁers, decision lists, and decision sets.

7.1 Learning DNF Classiﬁers

For learning DNF classiﬁers, we leverage De Morgan s law where complementing a CNF formula results in a DNF formula. To learn a DNF classiﬁer, say R (x), we can trivially show that y = R (x) (y = R (x)) for the feature vector x. Here R (x) is a CNF formula, by deﬁnition. Thus, we learn a DNF classiﬁer by ﬁrst complementing the class-label y to y in the training dataset, learning a CNF classiﬁer on (X, y), and ﬁnally complementing the learned classiﬁer to DNF. For example, the CNF classiﬁer (Male Age < 50) (Education = Graduate Income 1500) is complemented to a DNF classiﬁer as (not Male Age 50) (Education = Graduate Income 1500) . To learn a DNF classiﬁer incrementally, such as through mini-batch and iterative learning, we adopt the following procedure. For learning a DNF classiﬁer using mini-batch learning, we ﬁrst learn a CNF classiﬁer on dataset (X, y) and complement the classiﬁer to a DNF classiﬁer at the end of mini-batch learning. To learn a DNF classiﬁer in the iterative approach, we learn a single clause of the DNF classiﬁer in each iteration, remove covered samples, and continue till no training sample remains. In this context, the coverage of a clause in a DNF formula is the set of samples satisfying the clause.

7.2 Learning Decision Lists

In IMLI, we apply an iterative learning approach for eﬃciently learning a decision list. A decision list RL is a list of pairs (C1, v1), . . . , (Ck, vk), where we propose to learn one pair in each iteration. We note that the clause Ci is a conjunction of literals equivalently, a single clause DNF formula. Hence, our task is to deploy IMLI to eﬃciently learn a single clause DNF formula Ci. In particular, we opt to learn this clause for the majority class, say vi, in the training dataset by setting the majority samples as class 1 and all other samples as class 0. As a result, even if the Max SAT-based learning, presented in this paper, is targeted for binary classiﬁcation, we can learn a multi-class decision list in IMLI. In Algorithm 3, we present the iterative algorithm for learning decision lists. In each iteration, the algorithm learns a pair (Ci, vi), separates the training set based on the coverage of (Ci, vi) and conquers the remaining samples recursively. The coverage of (Ci, vi) is the set of samples that satisﬁes clause Ci. Finally, we add a default rule (Ck, vdefault) to RL where Ck true denoting that the clause is satisﬁed by all samples. We select the default class vdefault in the following order: (i) if any class(s) is not present in the predicted classes {vi}k 1 i=1 of the decision list, vdefault is the majority class among missing classes, and (ii) vdefault is the majority class of the original training set (X, y), otherwise.

Efficient Learning of Interpretable Classification Rules

Algorithm 3 Iterative learning of decision lists

1: procedure Decision List Learning(X, y, λ, k)

3: for i 1, . . . , k 1 and (X, y) = do

4: vi Majority Class(y) vi speciﬁes the target class

5: Ci Mini Batch DNFLearning(X, y, λ, 1, vi) Ref. Section 7.1

6: X , y Coverage(X, Ci)

7: if (X , y ) = then

9: RL RL {(Ci, vi)}

10: X, y (X, y) \ (X , y )

11: RL RL {(true, vdefault)} Default rule

12: return RL

7.3 Learning Decision Sets

We describe an iterative procedure for learning decision sets. A decision set comprises of an individual clause-class pair (Ci, vi) where Ci denotes a single clause DNF formula similar to decision lists. In a decision set, a sample can satisfy multiple clauses simultaneously, which is attributed as an overlapping between clauses (Lakkaraju et al., 2016). Concretely, the overlap between two clauses Ci and Cj with i = j is the set of samples {Xl|Xl |= Ci Xl |= Cj} satisfying both clauses. One additional objective in learning a decision set is to minimize the overlap between clauses, as studied in (Lakkaraju et al., 2016). Therefore, along with optimizing accuracy and rule-sparsity, we propose an iterative procedure for decision sets that additionally minimizes the overlap between clauses.

Algorithm 4 Iterative learning of decision sets

1: procedure Decision Sets Learning(X, y, λ, k)

3: Xcc = {} Contains correctly covered samples

4: for i 1, . . . , k 1 and (X, y) = do

5: vi Majority Class(X, y)

6: Xw X Xcc Correctly covered samples are included

7: yw y { vi}|Xcc| Xcc have complemented class vi to minimize overlap

8: Ci Mini Batch DNFLearning(Xw, yw, λ, 1, vi)

9: X , y Correct Coverage(X, y, Ci, vi)

10: if (X , y ) = then

12: RS RS {(Ci, vi)}

13: X, y (X, y) \ (X , y )

14: Xcc = Xcc X

15: RS RS {(true, vdefault)}

16: return RS

Ghosh, Malioutov, & Meel

In Algorithm 4, we present an iterative algorithm for learning a decision set. The iterative algorithm is a modiﬁcation of separate-and-conquer algorithm by additionally focusing on minimizing overlaps in a decision set. Given a training data (X, y), a regularization parameter λ, and the number of clauses k, the core idea of Algorithm 4 is to learn a pair (Ci, vi) in each iteration, separate covered samples from (X, y), and conquer remaining samples recursively. In contrast to learning decision lists, we have following modiﬁcations in Algorithm 4.

The ﬁrst modiﬁcation is with respect to the deﬁnition of coverage for decision sets. Unlike decision lists, we separate samples that are correctly covered by (Ci, vi) in each iteration. Given (X, y), the correctly covered samples of (Ci, vi) is a set {(Xl, yl)|Xl |= Ci yl = vi} (X, y) of samples that satisfy Ci and have matching class-label as vi.

The second modiﬁcation is related to the training dataset considered in each iteration. Let (X, y) denote the remaining training dataset in the current iteration and vi be the majority class in (X, y). Hence, vi is the target class in the current iteration. In Algorithm 4, we learn clause Ci on a working dataset (Xw, yw), where Xw X Xcc

comprises of both remaining samples and already covered samples in all previous iterations. However, the vector of class labels yw y { vi}|Xcc| comprises of remaining class-label vector y and complemented class label vector { vi}|Xcc| associated with feature-matrix Xcc. Thus, by explicitly labeling covered samples as class vi, the new clause Ci learns to falsify already covered samples. This heuristic allows us to minimize the overlap of Ci compared to previously learned clauses {Cj}i 1 j=1.

Finally, the default clause for decision sets is learned similarly as in decision lists.

8. Empirical Performance Analysis

In this section, we empirically evaluate the performance of IMLI. We ﬁrst present the experimental setup and the objective of the experiments, followed by experimental results.

8.1 Experimental Setup

We implement a prototype of IMLI in Python to evaluate the performance of the proposed Max SAT-based formulation for learning classiﬁcation rules. To implement IMLI, we deploy a state-of-the-art Max SAT solver Open-WBO (Martins et al., 2014), which returns the current best solution upon reaching a timeout. We compare IMLI with state-of-the-art interpretable and non-interpretable classiﬁers. Among interpretable classiﬁers, we compare with RIPPER (Cohen, 1995), BRL (Letham et al., 2015), CORELS (Angelino et al., 2017), and BRS (Wang et al., 2017). Among non-interpretable classiﬁers, we compare with Random Forest (RF), Support Vector Machine with linear kernels (SVM), Logistic Regression classiﬁer (LR), and k-Nearest Neighbors classiﬁer (k NN). We deploy the Scikit-learn library in Python for implementing noninterpretable classiﬁers. We experiment with real-world binary classiﬁcation datasets from UCI (Dua & Graﬀ, 2017), Open-ML (Vanschoren et al., 2013), and Kaggle repository (https://www.kaggle. com/datasets), as listed in Table 1. In these datasets, the number of samples vary from

Efficient Learning of Interpretable Classification Rules

about 200 to 1, 000, 000. The datasets contain both real-valued and categorical features. We process them to binary features by setting the maximum number of bins as 10 during discretization. For non-interpretable classiﬁers such as RF, SVM, LR, and k NN, which take real-valued features as inputs, we only convert categorical features to one-hot encoded binary features. We perform ten-fold cross-validation on each dataset and evaluate the performance of diﬀerent classiﬁers based on the median prediction accuracy on the test data. Additionally, we compare the median size of generated rules among rule-based interpretable classiﬁers. We consider a comparable combination (100) of hyper-parameters choices for all classiﬁers, which we ﬁne-tune during cross-validation. For IMLI, we vary the number of clauses k {1, 2, . . . , 5} and the regularization parameter λ in a logarithmic grid by choosing 5 values between 10 4 and 101. In the mini-batch learning in IMLI, we set the number of samples in each mini-batch, n {50, 100, 200, 400}. Thus, we consider n/n mini-batches, where n denotes the size of training data. To construct mini-batches from a training dataset, we sequentially split the data into n/n batches with each batch having n samples. Furthermore, to ignore the eﬀect of batch-ordering, we perform mini-batch learning in two rounds such that each batch participates twice in the training. For BRL algorithm, we vary four hyper-parameters: the maximum cardinality of rules in {2, 3, 4}, the minimum support of rules in {0.05, 0.175, 0.3}, and the prior on the expected length and width of rules in {2, 4, 6, 8} and {2, 5, 8}, respectively. For CORELS algorithm, we vary three hyper-parameters: the maximum cardinality of rules in {2, 3, 4, 5}, the minimum support of rules in {0.01, 0.17, 0.33, 0.5}, and the regularization parameter in {0.005, 0.01, 0.015, 0.02, 0.025, 0.03}. For BRS algorithm, we vary three hyper-parameters: the number of initial rules in {500, 1000, 1500, 2000, 2500, 3000}, the maximum length of rules in {1, 2, 3, 4}, and the minimum support of rules in {1, 4, 7, 10}. For RF and RIPPER classiﬁers, we vary the cut-oﬀon the number of samples in the leaf node using a linear grid between 3 to 500 and 1 to 300, respectively. For SVM and LR classiﬁers, we discretize the regularization parameter on a logarithmic grid between 10 3 and 103. For k NN, we vary the number of neighbors in a linear grid between 1 and 500. We conduct each experiment on an Intel Xeon E7 8857 v2 CPU using a single core with 16 GB of RAM running on a 64bit Linux distribution based on Debian. For all classiﬁers, we set the training timeout to 1000 seconds.

Objectives of Experiments. In the following, we present the objectives of our experimental study.

1. How are the accuracy and size of classiﬁcation rules generated by IMLI compared to existing interpretable classiﬁers?

2. How is the scalability of IMLI in solving large-scale classiﬁcation problems compared to existing interpretable classiﬁers?

3. How does IMLI perform in terms of accuracy and scalability compared to classiﬁers that are non-interpretable?

4. How does the incremental learning in IMLI perform compared to non-incremental Max SAT-based learning in terms of accuracy, rule-sparsity, and scalability?

Ghosh, Malioutov, & Meel

5. How do diﬀerent interpretable classiﬁcation rules learned using IMLI perform in terms of accuracy and rule-size?

6. What are the eﬀects of diﬀerent hyper-parameters in IMLI?

Summary of Experimental Results. To summarize our experimental results, IMLI achieves the best balance among prediction accuracy, interpretability, and scalability compared to existing interpretable rule-based classiﬁers. Particularly, compared to the most accurate classiﬁer RIPPER, IMLI demonstrates on average 1% lower prediction accuracy, wherein the accuracy of IMLI is higher than BRL, CORELS, and BRS in almost all datasets. In contrast, IMLI generates signiﬁcantly smaller rules than RIPPER, speciﬁcally in large datasets. Moreover, BRL, CORELS, and BRS report comparatively smaller rule-size than IMLI on average, but with a signiﬁcant decrease in accuracy. In terms of scalability, IMLI achieves the best performance compared to other interpretable and non-interpretable classiﬁers by classifying datasets with one million samples. While CORELS also scales to such large datasets, its accuracy is lower than that of IMLI by at least 2% on average. Therefore, IMLI is not only scalable but also accurate in practical classiﬁcation problems while also being interpretable by design. We additionally analyze the comparative performance of diﬀerent formulations presented in this paper, where the incremental approach empirically proves its eﬃciency than the na ıve Max SAT formulation. Furthermore, we learn and compare the performance of diﬀerent interpretable representations: decision lists, decision sets, CNF, and DNF formulas using IMLI and present the eﬃcacy of IMLI in learning varied interpretable classiﬁers. Finally, we study the eﬀect of diﬀerent hyper-parameters in IMLI, where each hyper-parameter provides a precise control among training time, prediction accuracy, and rule-sparsity.

8.2 Experimental Results

In the following, we discuss our experimental results in detail.

8.2.1 Comparing IMLI with Interpretable Classifiers

We compare IMLI with existing interpretable classiﬁers in three aspects: test accuracy, rule-size, and scalability.

Test accuracy and rule-size. We present the experimental results of test accuracy and rule-size among interpretable classiﬁers in Table 1, where the ﬁrst, second, and third columns represent the name of the dataset, the number of samples, and the number of features in the dataset, respectively. In each cell from the fourth to the eighth column in the table, the top value represents the median test accuracy and the bottom value represents the median size of rules measured through ten-fold cross-validation. In Table 1, IMLI and CORELS generate interpretable classiﬁcation rules in all 15 datasets in our experiments. In contrast, within a timeout of 1000 seconds, RIPPER, BRL, and BRS fail to generate any classiﬁcation rule in three datasets, speciﬁcally in large datasets ( 200, 000 samples). We compare IMLI with each interpretable classiﬁer in detail. Compared to RIPPER, IMLI has lower accuracy in 9 out of 12 datasets. More speciﬁcally, the accuracy of IMLI is

Efficient Learning of Interpretable Classification Rules

Table 1: Comparison of accuracy and rule-size among interpretable classiﬁers. Each cell from the fourth to the eighth column contains test accuracy (top) and rule-size (bottom). represents a timeout. Numbers in bold represent the best performing results among diﬀerent classiﬁers.

Dataset Size Features RIPPER BRL CORELS BRS IMLI

Parkinsons 195 202 94.44 94.74 89.74 84.61 94.74 7.0 11.5 2.0 5.0 7.5

WDBC 569 278 98.08 93.81 92.04 92.98 94.74 13.0 22.0 2.0 7.0 11.5

Pima 768 83 77.14 68.18 75.32 75.32 78.43 6.0 13.5 2.0 3.0 23.0

Titanic 1, 043 38 78.72 62.98 81.9 80.86 81.82 6.0 15.0 4.0 4.0 5.5

MAGIC 19, 020 100 82.68 76.95 78.05 77.5 78.26 102.0 81.0 4.0 3.0 8.5

Tom s HW 28, 179 946 85.91 83.27 83.13 85.24 30.0 4.0 18.5 44.5

Credit 30, 000 199 82.39 46.12 81.18 80.45 82.12 32.5 26.5 2.0 7.0 17.5

Adult 32, 561 94 84.37 72.08 79.78 70.75 81.2 115.5 46.5 4.0 4.0 30.0

Bank Marketing 45, 211 82 90.01 84.66 89.62 86.75 89.84 36.5 13.0 2.0 2.0 24.5

Connect-4 67, 557 126 76.72 65.83 68.68 70.49 75.36 118.0 18.5 4.0 11.0 50.5

Weather AUS 107, 696 169 84.22 43.26 83.67 83.78 26.0 22.0 2.0 22.0

Vote 131, 072 16 97.12 94.78 95.86 95.14 96.69 132.0 41.5 3.5 1.0 15.0

Skin Seg 245, 057 30 79.25 91.62 68.48 94.71 6.0 9.0 5.0 30.0

BNG(labor) 1, 000, 000 89 88.56 90.91 2.0 24.0

BNG(credit-g) 1, 000, 000 97 72.08 75.48 2.0 27.5

1% lower on average than RIPPER. The improved accuracy of RIPPER, however, results in the generation of higher size classiﬁcation rules than IMLI in most datasets. In particular, IMLI generates sparser rules than RIPPER in 9 out of 12 datasets, wherein RIPPER times out in 3 datasets. Interestingly, the diﬀerence in rule-size is more signiﬁcant in larger datasets, such as in Vote dataset, where RIPPER learns a classiﬁer with 132 Boolean

Ghosh, Malioutov, & Meel

literals compared to 15 Boolean literals by IMLI. Therefore, IMLI is better than RIPPER in terms of rule-sparsity, but lags slightly in accuracy. IMLI performs better than BRL both in terms of accuracy and rule-sparsity. In particular, IMLI has higher accuracy and lower rule-size than BRL in 12 and 8 datasets, respectively, in a total of 12 datasets, wherein BRL times out in 3 datasets. While comparing with CORELS, IMLI achieves higher accuracy in almost all datasets (14 out of 15 datasets). A similar trend is observed in comparison with BRS, where IMLI achieves higher accuracy in all of 12 datasets and BRS times out in 3 datasets. CORELS and BRS, however, generates sparser rules than IMLI in most datasets, but by costing a signiﬁcant decrease in accuracy. For example, in the largest dataset BNG(credit-g) with 1 Million samples, BRS times out, and CORELS generates a classiﬁer with 72.08% accuracy with rule-size 2. IMLI, in contrast, learns a classiﬁer with 27.5 Boolean literals achieving 75.48% accuracy, which is 3% higher than CORELS. Therefore, IMLI makes a good balance between accuracy and rule-size compared to existing interpretable classiﬁers while also being highly scalable. In the following, we discuss the results on the scalability of all interpretable classiﬁers in detail.

Scalability. We analyze the scalability among interpretable classiﬁers by comparing training time. In Figure 1, we use cactus plots8 to represent the training time (in seconds) of all classiﬁers in 1000 instances (10 folds 100 choices of hyper-parameters) derived for each dataset. In the cactus plot, the number of solved instances (within 1000 seconds) is on the X-axis, whereas the training time is on the Y -axis. A point (x, y) on the plot implies that a classiﬁer yields lower than or equal to y seconds of training in x many instances. In Figure 1, we present results in an increasing number of samples in a dataset (from left to right and top to bottom). In WDBC and Adult datasets presented on the ﬁrst row in Figure 1, CORELS solves lower than 600 instances within a timeout of 1000 seconds. The scalability performance of BRS is even worse, where it solves around 700 and 200 instances in WDBC and Adult datasets, respectively. The other three classiﬁers: IMLI, BRL, and RIPPER solve all 1000 instances, where BRL takes comparatively higher training time than the other two. The performance of IMLI and RIPPER is similar, with RIPPER being comparatively better in the two datasets. However, the eﬃciency of IMLI compared to other classiﬁers becomes signiﬁcant as the number of samples in a dataset increases. In particular, in Weather AUS dataset, BRS cannot solve a single instance, BRL and CORELS solves 400 instances, and RIPPER solves around 600 instances. IMLI, however, solves all 1000 instances in this dataset. Similarly, in BNG(labor) dataset, all other classiﬁers except IMLI and CORELS cannot solve any instance. While IMLI mostly takes the maximum allowable time (1000 seconds) in solving all instances in this dataset, CORELS can solve lower than 400 instances. Thus, IMLI establishes itself as the most scalable classiﬁer compared to other state-of-the-art interpretable classiﬁers.

8.2.2 Comparison with Non-interpretable Classifiers

We compare IMLI with state-of-the-art non-interpretable classiﬁers such as LR, SVM, k NN, and RF in terms of their median test accuracy in Table 2. In the majority of the datasets, IMLI achieves comparatively lower test accuracy than the best performing non-interpretable

8. Cactus plots are often used in (Max)SAT community to present the scalability of diﬀerent solvers/methods (Argelich et al., 2008; Balyo et al., 2017).

Efficient Learning of Interpretable Classification Rules

0 200 400 600 800 1000 Instances solved

Training time (s)

RIPPER CORELS BRS BRL IMLI

0 200 400 600 800 1000 Instances solved

Training time (s)

RIPPER CORELS BRS BRL IMLI

0 200 400 600 800 1000 Instances solved

Training time (s)

Weather AUS

RIPPER CORELS BRS BRL IMLI

0 200 400 600 800 1000 Instances solved

Training time (s)

RIPPER CORELS BRS BRL IMLI

Figure 1: Comparison of scalability among interpretable classiﬁers. The plots are arranged in increasing sizes of datasets (from left to right). In each cactus plot, IMLI solves all 1000 instances for each dataset, while competitive classiﬁers often fail to scale, specially in larger datasets.

classiﬁer. The decrease in the test accuracy of IMLI is attributed to two factors. Firstly, while we train IMLI on discretized data, non-interpretable classiﬁers are trained on nondiscretized data and thus IMLI incurs additional classiﬁcation errors due to discretization. Secondly, IMLI learns a rule-based classiﬁer, whereas non-interpretable classiﬁers can learn more ﬂexible decision boundaries and thus ﬁt data well. In Table 2, we also observe that IMLI achieves impressive scalability than competing classiﬁers by solving datasets with 1, 000, 000 samples where most of the non-interpretable classiﬁers fail to learn any decision boundary on such large datasets. Thus, IMLI, being an interpretable classiﬁer, demonstrates lower accuracy than competing non-interpretable classiﬁers, but higher scalability in practice.

8.2.3 Comparison Among Different Formulations in IMLI

We compare the performance of diﬀerent formulations for learning classiﬁcation rules as presented in this paper. In Figure 2, we show cactus plots for assessing training time (in seconds), test error (in percentage), and rule-size among diﬀerent formulations. In the

Ghosh, Malioutov, & Meel

Table 2: Comparison of IMLI with non-interpretable classiﬁers in terms of test accuracy. In the table, represents a timeout. Numbers in bold represent the best performing results among diﬀerent classiﬁers.

Dataset Size LR SVM k NN RF IMLI

Parkinsons 195 89.74 89.74 97.5 90.0 94.74 WDBC 569 98.25 98.25 98.23 96.49 94.74 Pima 768 78.43 79.08 74.5 79.22 78.43 Titanic 1, 043 80.86 80.38 81.34 82.69 81.82 MAGIC 19, 020 79.18 79.34 84.6 88.2 78.26 Tom s HW 28, 179 96.2 97.13 88.15 97.78 85.24 Credit 30, 000 82.2 81.9 81.83 82.15 82.12 Adult 32, 561 85.26 85.05 83.8 86.69 81.2 Bank Marketing 45, 211 90.09 89.28 89.43 90.27 89.84 Connect-4 67, 557 79.39 85.51 88.11 75.36 Weather AUS 107, 696 85.64 78.59 86.26 83.78 Vote 131, 072 96.43 96.37 97.05 97.38 96.69 Skin Seg 245, 057 91.86 99.96 99.96 94.71 BNG(labor) 1, 000, 000 90.91 BNG(credit-g) 1, 000, 000 80.58 75.48

cactus plot, a point (x, y) denotes that the formulation yields lower than or equal to y training time (similarly, test error and rule-size) in x many instances in each dataset. In Figure 2, we denote the baseline non-incremental Max SAT-based formulation as IMLI-B9, incremental Max SAT-based formulation with only mini-batch learning as IMLI-M, and incremental Max SAT-based formulation with both mini-batch and iterative learning as IMLI. We ﬁrst observe the training time of diﬀerent formulations in the left-most column in Figure 2, where IMLI-B soon times out and solves lower than 300 instances out of 1000 instances in each dataset. This result suggests that the non-incremental formulation cannot scale in practical classiﬁcation task. Comparing between IMLI and IMLI-M, both formulations solve all 1000 instances in each dataset with IMLI-M undertaking signiﬁcantly higher training time than IMLI. Therefore, IMLI achieves better scalability than IMLI-M indicating that an integration of mini-batch and iterative learning achieves a signiﬁcant progress in terms of scalability than mini-batch learning alone. We next focus on the test error of diﬀerent formulations in the middle column in Figure 2. Firstly, IMLI-B has a higher test error than the other two formulations since IMLI-B times out in most instances and learns a sub-optimal classiﬁcation rule with reduced prediction accuracy. In contrast, IMLI has the lowest test error compared to two formulations in all datasets. This result indicates the eﬀectiveness of integrating both iterative and mini-batch learning with Max SAT-based formulation in generating more accurate classiﬁcation rules. Moving focus on the rule-size in the rightmost column in Figure 2, IMLI-B achieves the highest rule-size in WDBC dataset. In contrast, the rule-size of IMLI-B is lowest (zero) in Credit and Adult datasets. In the last two datasets, IMLI-B times out during training

9. Since IMLI-B is a non-incremental formulation and does not involve any mini-batch learning, we consider two hyper-parameters for IMLI-B: the number of clauses in {1, 2, . . . , 10} and the regularization parameter λ as 10 values chosen from a logarithmic grid between 10 4 and 101.

Efficient Learning of Interpretable Classification Rules

0 200 400 600 800 1000 Instances solved

Training time (s)

IMLI-B IMLI-M IMLI

0 200 400 600 800 1000 Instances solved

IMLI-B IMLI-M IMLI

0 200 400 600 800 1000 Instances solved

IMLI-B IMLI-M IMLI

0 200 400 600 800 1000 Instances solved

Training time (s)

IMLI-B IMLI-M IMLI

0 200 400 600 800 1000 Instances solved

IMLI-B IMLI-M IMLI

0 200 400 600 800 1000 Instances solved

IMLI-B IMLI-M IMLI

0 200 400 600 800 1000 Instances solved

Training time (s)

IMLI-B IMLI-M IMLI

0 200 400 600 800 1000 Instances solved

IMLI-B IMLI-M IMLI

0 200 400 600 800 1000 Instances solved

IMLI-B IMLI-M IMLI

Figure 2: Comparison of training time, test error, and rule-size among diﬀerent formulations presented in the paper. In each cactus plot, the incremental formulation IMLI with both mini-batch and iterative learning demonstrates the best performance in training time and test error than compared two formulations: non-incremental Max SAT formulation IMLI-B and incremental formulation with only mini-batch learning IMLI-M. In terms of rule-size, IMLI often generates higher size rules than IMLI-M.

and returns the default rule true by predicting all samples as class 1. The other two formulations IMLI and IMLI-M demonstrate a similar trend in rule-size in all datasets with IMLI-M generating comparatively smaller size rules in Credit and Adult datasets. In this context, the improvement of rule-sparsity of IMLI-M is due to a comparatively higher test error (or lower accuracy) than IMLI as observed in all three datasets. Therefore, IMLI appears to be the best performing formulation w.r.t. training time, test error, and rule-size by balancing between accuracy and rule-size while being more scalable.

8.2.4 Performance Evaluation of Different Interpretable Representations in IMLI

We deploy IMLI to learn diﬀerent interpretable rule-based representations: CNF and DNF classiﬁers, decision lists, and decision sets, and present their comparative performance w.r.t.

Ghosh, Malioutov, & Meel

Table 3: Comparison of test accuracy (top value) and rule-size (bottom value) among diﬀerent rule-based representations learned using IMLI. Numbers in bold denote the best performing results among diﬀerent representations.

Dataset CNF DNF Decision Sets Decision Lists

Parkinsons 94.74 89.47 94.87 89.74 7.5 6.0 15.0 6.5

WDBC 94.74 96.49 95.61 95.61 11.5 15.0 15.5 10.0

Pima 78.43 77.13 76.97 76.97 23.0 9.0 15.0 13.5

Titanic 81.82 82.29 81.82 82.3 5.5 10.5 8.5 8.0

MAGIC 78.26 77.44 75.87 77.79 8.5 41.5 10.0 14.0

Tom s HW 85.24 85.15 85.72 85.95 44.5 26.5 45.0 59.5

Credit 82.12 82.15 82.03 82.22 17.5 14.0 9.5 21.5

Adult 81.2 84.28 80.07 80.96 30.0 34.5 7.0 24.5

Bank Marketing 89.84 89.77 89.67 89.79 24.5 7.5 6.0 10.5

Connect-4 75.36 70.63 68.09 69.83 50.5 42.0 4.5 24.0

Weather AUS 83.78 84.23 83.69 83.85 22.0 14.0 4.0 26.0

Skin Seg 94.71 93.68 87.92 91.17 30.0 15.0 3.0 7.0

test accuracy and rule-size in Table 3. In each cell in this table, the top value represents the test accuracy and the bottom value represents the size of generated rules.

We learn all four interpretable representations on twelve datasets, where the CNF classiﬁer appears to be the most accurate representation by achieving the highest accuracy in ﬁve datasets. In contrast, both DNF and decision lists achieve the highest accuracy in three datasets each; decision sets demonstrate the least performance in test accuracy by being more accurate in one dataset. To this end, the poor accuracy of decision sets is traded oﬀby its rule-size as decision sets generate the sparsest rules compared to other representations. More precisely, decision sets have the smallest rule-size in six datasets, while CNF, DNF, and decision lists have the smallest rule-size in two, three, and one dataset, respectively. These results suggest that CNF classiﬁers are more favored in applications where higher accuracy is preferred, while decision sets are preferred in applications where

Efficient Learning of Interpretable Classification Rules

1 2 3 4 5 Clauses k

Test accuracy

1 2 3 4 5 Clauses k

93.0 93.5 94.0 94.5 95.0 95.5 96.0

Train accuracy

1 2 3 4 5 Clauses k

1 2 3 4 5 Clauses k

Training time (s)

1 2 3 4 5 Clauses k

Test accuracy

1 2 3 4 5 Clauses k

Train accuracy

1 2 3 4 5 Clauses k

4 6 8 10 12 14 16

1 2 3 4 5 Clauses k

Training time (s)

1 2 3 4 5 Clauses k

Test accuracy

1 2 3 4 5 Clauses k

80.2 80.4 80.6 80.8 81.0 81.2 81.4

Train accuracy

1 2 3 4 5 Clauses k

1 2 3 4 5 Clauses k

Training time (s)

Figure 3: Eﬀect of the number of clauses k on accuracy (test and train), rule-size, and training time. As k increases, both train and test accuracy of IMLI increase while generating rules with higher size by incurring higher training time.

higher interpretability is desired. In both cases, one could deploy IMLI for learning varied representations of classiﬁcation rules.

8.2.5 Ablation Study

We experiment the eﬀect of diﬀerent hyper-parameters in IMLI on prediction accuracy, rulesize, and training time in diﬀerent datasets. In the following, we discuss the impact of the number of clauses, regularization parameter, and size of mini-batches in IMLI.

Eﬀect of the number of clauses k. In Figure 3, we vary k while learning CNF classiﬁers in IMLI. As k increases, both training and test accuracy generally increase in diﬀerent datasets (plots in the ﬁrst and second columns). Similarly, the size of rules increases with k by incurring higher training time (plots in the third and fourth columns). The reason is that a higher value of k allows more ﬂexibility in ﬁtting the data well by incurring more training time and generating higher size classiﬁcation rules. Therefore, the number of clauses in IMLI provides control on training-time vs accuracy and also on accuracy vs rule-sparsity.

Eﬀect of regularizer λ. In Figure 4, we vary λ in a logarithmic grid between 10 4 and 101. As stated in Eq. (5), a higher value of λ puts more priority on the minimal changes in rules between consecutive mini-batches in incremental learning while allowing higher mini-

Ghosh, Malioutov, & Meel

10 4 10 3 10 2 10 1 100 101 Regularizer

Test accuracy

10 4 10 3 10 2 10 1 100 101 Regularizer

Train accuracy

10 4 10 3 10 2 10 1 100 101 Regularizer

4 6 8 10 12 14 16

10 4 10 3 10 2 10 1 100 101 Regularizer

Training time (s)

10 4 10 3 10 2 10 1 100 101 Regularizer

Test accuracy

10 4 10 3 10 2 10 1 100 101 Regularizer

Train accuracy

10 4 10 3 10 2 10 1 100 101 Regularizer

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

10 4 10 3 10 2 10 1 100 101 Regularizer

Training time (s)

10 4 10 3 10 2 10 1 100 101 Regularizer

77.5 78.0 78.5 79.0 79.5 80.0 80.5 81.0

Test accuracy

10 4 10 3 10 2 10 1 100 101 Regularizer

Train accuracy

10 4 10 3 10 2 10 1 100 101 Regularizer

10 4 10 3 10 2 10 1 100 101 Regularizer

Training time (s)

Figure 4: Eﬀect of regularization λ on accuracy (test and train), rule-size, and training time. As λ increases, lower priority is given to accuracy. As a result, both training and test accuracy decrease with λ by generating smaller rules.

Efficient Learning of Interpretable Classification Rules

50 100 200 400 Batch size

Test accuracy

50 100 200 400 Batch size

92.5 93.0 93.5 94.0 94.5 95.0 95.5

Train accuracy

50 100 200 400 Batch size

10 11 12 13 14 15 16

50 100 200 400 Batch size

Training time (s)

50 100 200 400 Batch size

Test accuracy

50 100 200 400 Batch size

Train accuracy

50 100 200 400 Batch size

9 10 11 12 13 14 15

50 100 200 400 Batch size

Training time (s)

50 100 200 400 Batch size

Test accuracy

50 100 200 400 Batch size

Train accuracy

50 100 200 400 Batch size

17.5 20.0 22.5 25.0 27.5 30.0 32.5

50 100 200 400 Batch size

Training time (s)

Figure 5: Eﬀect of bath size on accuracy (test and train), rule-size, and training time. As we consider more samples in a mini-batch, IMLI generates more accurate and larger size classiﬁcation rules.

batch training errors. Thus, in the ﬁrst and second columns in Figure 4, as λ increases, both training and test accuracy gradually decrease. In addition, the size of rules (plots in the third column) also decreases. Finally, we observe that the training time generally decreases with λ. This observation indicates that higher λ puts lower computational load to the Max SAT solver as a fraction of training examples is allowed to be misclassiﬁed. Thus, similar to the number of clauses, regularization parameter λ in IMLI allows to trade-oﬀ between accuracy and rule-size in a precise manner.

Eﬀect of the size of mini-batch. In Figure 5, we present the eﬀect of mini-batch size in IMLI. As we consider more samples in a batch, both test and training accuracy increase in general as presented in the ﬁrst and second columns in Figure 5. Similarly, the size of generated rules also increases with the number of samples. Due to solving higher size Max SAT queries, the training time also increases in general with an increase in mini-batch size. Therefore, by varying the size of mini-batches, IMLI allows controlling on training time vs the prediction accuracy (and rule-size) of generated rules.

Ghosh, Malioutov, & Meel

9. Conclusion

Interpretable machine learning is gaining more focus with applications in many safetycritical domains. Considering the growing demand for interpretable models, it is challenging to design learning frameworks that satisfy all aspects: being accurate, interpretable, and scalable in practical classiﬁcation tasks. In this paper, we have proposed a Max SATbased framework IMLI for learning interpretable rule-based classiﬁers expressible in CNF formulas. IMLI is built on an eﬃcient integration of incremental learning, speciﬁcally minibatch and iterative learning, with Max SAT-based formulation. In our empirical evaluation, IMLI achieves the best balance among prediction accuracy, interpretability, and scalability. In particular, IMLI demonstrates competitive prediction accuracy and rule-size compared to existing interpretable rule-based classiﬁers. In addition, IMLI achieves impressive scalability than both interpretable and non-interpretable classiﬁers by learning interpretable rules on million-size datasets with higher accuracy. Finally, IMLI generates other popular interpretable classiﬁers such as decision lists and decision sets using the same framework.

Acknowledgments

Dmitry Malioutov was at the T.J. Watson IBM Research Center while performing this work. This work was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme [NRF-NRFFAI1-2019-0004], Ministry of Education Singapore Tier 2 grant MOE-T2EP20121-0011, and Ministry of Education Singapore Tier 1 Grant [R-252000-B59-114]. The computational work for this article was performed on resources of Max Planck Institute for Software Systems, Germany and the National Supercomputing Centre, Singapore https://www.nscc.sg.

Appendix A. Illustration of Interpretable CNF Classiﬁcation Rules

In the following, we present representative CNF classiﬁers learned in diﬀerent datasets. In each dataset, if an input satisﬁes the CNF formula, it is predicted class 1 and vice-versa.

0.3 Average vocal fundamental frequency < 0.4 OR 0.5 Average vocal fundamental frequency < 0.6 OR 0.1 Maximum vocal fundamental frequency < 0.2 OR 0.5 Minimum vocal fundamental frequency < 0.6 OR 0.1 Shimmer:APQ5 < 0.2 OR 0.5 DFA < 0.6 OR 0.3 spread1 < 0.4 OR 0.8 spread2 < 0.9 OR 0.3 PPE < 0.4 OR NOT - MDVP:APQ < 0.1 AND

NOT 0.8 Average vocal fundamental frequency < 0.9

0.4 perimeter < 0.5 OR 0.8 symmetry < 0.9 OR 0.5 largest concave points < 0.6 OR 0.6 largest concave points < 0.7 OR 0.7 largest concave points < 0.8 OR 0.6 largest symmetry < 0.7 OR NOT - area SE < 0.1 AND

Efficient Learning of Interpretable Classification Rules

0.4 texture < 0.5 OR 0.5 texture < 0.6 OR 0.5 perimeter < 0.6 OR 0.5 smoothness < 0.6 OR 0.4 concave points < 0.5 OR 0.5 concave points < 0.6 OR 0.2 largest concavity < 0.3 OR 0.6 largest concave points < 0.7 OR 0.7 largest concave points < 0.8 OR 0.4 largest symmetry < 0.5 OR 0.6 largest symmetry < 0.7

x1 = 11 OR x1 = 14 OR x1 = 15 OR 0.7 x2 < 0.8 OR 0.8 x2 < 0.9 OR 0.9 x2 < OR 0.9 x3 < OR 0.4 x5 < 0.5 OR 0.7 x5 < 0.8 OR 0.7 x6 < 0.8 OR 0.4 x7 < 0.5 OR 0.5 x7 < 0.6 OR 0.9 x7 < AND

x1 = 7 OR x1 = 8 OR 0.6 x2 < 0.7 OR 0.8 x2 < 0.9 OR 0.9 x2 < OR - x3 < 0.1 OR 0.7 x3 < 0.8 OR 0.9 x3 < OR 0.2 x5 < 0.3 OR 0.7 x6 < 0.8 OR 0.1 x7 < 0.2 OR 0.3 x7 < 0.4 OR 0.4 x8 < 0.5 OR 0.8 x8 < 0.9 AND

x1 = 2 OR x1 = 6 OR x1 = 7 OR x1 = 9 OR - x3 < 0.1 OR 0.8 x3 < 0.9 OR 0.1 x5 < 0.2 OR 0.2 x5 < 0.3 OR 0.7 x5 < 0.8 OR 0.5 x6 < 0.6 OR 0.6 x6 < 0.7 OR 0.4 x7 < 0.5 OR 0.9 x7 < OR 0.1 x8 < 0.2 OR 0.3 x8 < 0.4 OR 0.6 x8 < 0.7 OR 0.8 x8 < 0.9 AND

x1 = 8 OR 0.5 x2 < 0.6 OR 0.8 x2 < 0.9 OR 0.9 x2 < OR 0.7 x3 < 0.8 OR - x4 < 0.1 OR 0.1 x5 < 0.2 OR 0.3 x5 < 0.4 OR 0.5 x7 < 0.6 OR 0.1 x8 < 0.2 OR 0.4 x8 < 0.5 OR 0.8 x8 < 0.9

- age < 0.1 OR 0.9 age < OR 0.9 fare < OR NOT sex AND

passenger-class = 2 OR 0.4 age < 0.5 OR 0.6 age < 0.7 OR siblings-or-spoucesaboard = 0 OR 0.1 fare < 0.2 OR 0.5 fare < 0.6 OR embarked = C AND

- age < 0.1 OR 0.1 age < 0.2 OR 0.7 age < 0.8 OR siblings-or-spouces-aboard = 1 OR parents-or-childred-aboard = 1 OR embarked = C OR NOT passenger-class = 3 AND

- age < 0.1 OR 0.2 age < 0.3 OR 0.3 age < 0.4 OR parents-or-childred-aboard = 0 OR NOT passenger-class = 3 OR NOT siblings-or-spouces-aboard = 0 AND

NOT passenger-class = 2 OR NOT 0.7 age < 0.8

Ghosh, Malioutov, & Meel

- length < 0.1 OR - size < 0.1 OR 0.8 conc < 0.9 OR - alpha < 0.1 OR 0.1 alpha < 0.2 OR 0.2 dist < 0.3

0.4 x10 < 0.5 OR 0.1 x11 < 0.2 OR 0.4 x11 < 0.5 OR 0.4 x12 < 0.5 OR 0.3 x13 < 0.4 OR 0.6 x13 < 0.7 OR 0.3 x15 < 0.4 OR 0.6 x15 < 0.7 OR 0.4 x16 < 0.5 OR 0.1 x57 < 0.2 OR 0.5 x59 < 0.6 OR 0.3 x60 < 0.4 OR 0.2 x62 < 0.3 OR 0.2 x73 < 0.3 OR 0.1 x74 < 0.2 OR NOT - x74 < 0.1 OR NOT - x96 < 0.1 AND

0.4 x9 < 0.5 OR 0.8 x9 < 0.9 OR 0.2 x12 < 0.3 OR 0.3 x12 < 0.4 OR 0.3 x15 < 0.4 OR 0.6 x15 < 0.7 OR 0.1 x60 < 0.2 OR 0.1 x61 < 0.2 OR 0.6 x78 < 0.7 OR NOT - x11 < 0.1 OR NOT - x13 < 0.1 OR NOT - x46 < 0.1 OR NOT - x64 < 0.1 OR NOT - x72 < 0.1 OR NOT - x77 < 0.1 AND

- x50 < 0.1 OR 0.1 x57 < 0.2 AND

- x49 < 0.1 OR NOT - x17 < 0.1 AND

0.3 x14 < 0.4 OR 0.6 x74 < 0.7 OR 0.9 x79 < OR NOT - x9 < 0.1 OR NOT - x15 < 0.1 OR NOT - x73 < 0.1 OR NOT - x80 < 0.1

Repayment-status-in-September = 2 OR Repayment-status-in-September = 3 OR Repaymentstatus-in-August = 3 OR Repayment-status-in-May = 3 OR Repayment-status-in-May = 5 OR Repayment-status-in-April = 3 AND

0.4 Age < 0.5 OR Repayment-status-in-September = 1 OR Repayment-status-in-August = 0 OR Repayment-status-in-July = 2 OR Repayment-status-in-May = 2 OR Repaymentstatus-in-April = 7 OR 0.3 Amount-of-bill-statement-in-April < 0.4 OR NOT Repaymentstatus-in-May = -1 OR NOT - Amount-of-bill-statement-in-May < 0.1 AND

Gender OR - Age < 0.1 OR 0.3 Age < 0.4 OR Repayment-status-in-September = 1 OR 0.1 Amount-of-bill-statement-in-August < 0.2 OR NOT Education = 3 OR NOT 0.2 Amount-of-bill-statement-in-September < 0.3

education = Doctorate OR education = Prof-school OR 0.1 capital-gain < 0.2 OR 0.2 capital-gain < 0.3 OR 0.9 capital-gain < OR 0.4 capital-loss < 0.5 OR 0.5 capital-loss < 0.6 AND

Efficient Learning of Interpretable Classification Rules

relationship = Husband OR relationship = Wife OR 0.5 capital-loss < 0.6 OR 0.6 capital-loss < 0.7 OR 0.8 capital-loss < 0.9 OR NOT - capital-gain < 0.1 AND

0.1 age < 0.2 OR 0.7 age < 0.8 OR workclass = State-gov OR education = Bachelors OR marital-status = Separated OR occupation = Exec-managerial OR occupation = Farming-ﬁshing OR occupation = Prof-specialty OR occupation = Protective-serv OR occupation = Tech-support OR relationship = Unmarried OR 0.4 capital-loss < 0.5 OR 0.6 capital-loss < 0.7 OR 0.2 hours-per-week < 0.3 OR 0.4 hours-per-week < 0.5 OR 0.7 hours-per-week < 0.8 OR 0.9 hours-per-week < OR NOT - capital-gain < 0.1

Bank Marketing

0.7 age < 0.8 OR 0.2 duration < 0.3 OR 0.3 duration < 0.4 OR 0.4 duration < 0.5 OR 0.5 duration < 0.6 OR poutcome = success AND

housing OR 0.8 age < 0.9 OR job = admin. OR job = management OR job = self-employed OR job = services OR job = unemployed OR 0.2 duration < 0.3 OR 0.1 campaign < 0.2 OR poutcome = success AND

job = services OR 0.2 balance < 0.3 OR contact = cellular OR 0.1 duration < 0.2 OR 0.2 duration < 0.3 OR 0.2 pdays < 0.3 OR poutcome = unknown OR NOT - balance < 0.1 OR NOT - previous < 0.1 AND

0.2 age < 0.3 OR 0.3 age < 0.4 OR 0.8 age < 0.9 OR job = management OR job = student OR job = unemployed OR education = secondary OR 0.1 balance < 0.2 OR 0.1 duration < 0.2 OR - pdays < 0.1 OR 0.2 pdays < 0.3 OR poutcome = unknown

b2 = 1 OR c2 = 1 OR d2 = 1 OR d4 = 1 OR d5 = 0 OR e2 = 1 OR f3 = 0 OR NOT d1 = 0 AND

b2 = 1 OR b3 = 1 OR b4 = 1 OR d4 = 1 OR f2 = 1 OR NOT d3 = 0 OR NOT f3 = 2 AND

c1 = 2 OR c2 = 2 OR c3 = 1 OR c3 = 2 OR c4 = 1 OR c6 = 0 OR d2 = 1 OR d4 = 1 OR e2 = 1 OR e3 = 1 OR f4 = 1 OR NOT a3 = 2 OR NOT b6 = 2 AND

a1 = 0 OR a2 = 0 OR a6 = 0 OR b2 = 1 OR b4 = 1 OR b5 = 0 OR c2 = 1 OR c4 = 1 OR c5 = 0 OR d1 = 1 OR d2 = 1 OR e2 = 1 OR g1 = 0 OR g2 = 0 OR NOT d5 = 2 OR NOT d6 = 2 AND

Ghosh, Malioutov, & Meel

b2 = 1 OR b4 = 1 OR c3 = 1 OR d3 = 1 OR e2 = 1 OR f2 = 1 OR f3 = 1 OR g3 = 0 OR g5 = 0 OR NOT c2 = 0 OR NOT c4 = 2 OR NOT d2 = 2

Weather AUS

0.7 Rainfall < 0.8 OR 0.8 Humidity3pm < 0.9 OR 0.9 Humidity3pm < AND

Rain Today OR 0.5 Wind Gust Speed < 0.6 OR 0.7 Humidity9am < 0.8 OR 0.9 Humidity3pm < OR 0.4 Pressure3pm < 0.5 OR NOT 0.9 Humidity9am < AND

0.1 Min Temp < 0.2 OR 0.8 Min Temp < 0.9 OR 0.9 Wind Speed9am < OR 0.9 Humidity9am < OR 0.8 Humidity3pm < 0.9 OR 0.9 Humidity3pm < OR 0.5 Temp3pm < 0.6 OR NOT 0.7 Rainfall < 0.8 AND

Wind Gust Dir = NNW OR Wind Dir3pm = W OR NOT 0.8 Temp3pm < 0.9 AND

Wind Gust Dir = NNW OR Wind Gust Dir = NW OR 0.1 Pressure9am < 0.2 OR NOT 0.7 Temp3pm < 0.8

NOT physician-fee-freeze AND

NOT adoption-of-the-budget-resolution OR NOT anti-satellite-test-ban OR NOT synfuelscorporation-cutback AND

adoption-of-the-budget-resolution OR el-salvador-aid OR NOT duty-free-exports AND

mx-missile OR NOT adoption-of-the-budget-resolution OR NOT el-salvador-aid OR NOT anti-satellite-test-ban AND

adoption-of-the-budget-resolution OR mx-missile OR NOT synfuels-corporation-cutback OR NOT education-spending

0.2 Red < 0.3 OR 0.3 Red < 0.4 OR 0.4 Red < 0.5 OR 0.9 Red < OR 0.9 Green < OR 0.4 Blue < 0.5 OR NOT 0.9 Blue < AND

0.2 Red < 0.3 OR 0.7 Red < 0.8 OR 0.8 Red < 0.9 OR 0.9 Red < OR NOT 0.8 Blue < 0.9 AND

0.7 Red < 0.8 OR 0.8 Red < 0.9 OR 0.9 Red < OR - Green < 0.1

Efficient Learning of Interpretable Classification Rules

OR NOT 0.7 Blue < 0.8

0.5 wage-increase-ﬁrst-year < 0.6 OR pension = empl-contr OR contribution-to-dentalplan = full AND

0.6 wage-increase-ﬁrst-year < 0.7 OR 0.7 wage-increase-second-year < 0.8 OR 0.5 shift-diﬀerential < 0.6 OR NOT longterm-disability-assistance AND

0.2 wage-increase-ﬁrst-year < 0.3 OR 0.3 wage-increase-ﬁrst-year < 0.4 OR 0.4 wage-increase-ﬁrst-year < 0.5 OR 0.5 wage-increase-ﬁrst-year < 0.6 OR 0.6 wageincrease-ﬁrst-year < 0.7 OR 0.7 wage-increase-ﬁrst-year < 0.8 OR 0.6 wage-increasesecond-year < 0.7 OR cost-of-living-adjustment = tcf OR 0.3 working-hours < 0.4 OR 0.4 working-hours < 0.5 OR 0.5 working-hours < 0.6 OR pension = ret-allw OR 0.7 standby-pay < 0.8 OR contribution-to-dental-plan = full OR contribution-to-health-plan = half OR NOT education-allowance AND

0.5 shift-diﬀerential < 0.6 OR contribution-to-dental-plan = full OR contribution-todental-plan = half OR contribution-to-health-plan = half OR contribution-to-health-plan = none AND

0.3 wage-increase-ﬁrst-year < 0.4 OR 0.6 wage-increase-ﬁrst-year < 0.7 OR 0.7 wage-increase-second-year < 0.8 OR 0.4 working-hours < 0.5 OR pension = empl-contr OR pension = ret-allw OR 0.9 statutory-holidays < inf

BNG(credit-g)

foreign-worker OR checking-status = no checking OR checking-status = =200 OR - duration < 0.1 OR 0.1 duration < 0.2 OR credit-history = critical/other existing credit OR purpose = used car OR savings-status = no known savings OR savings-status = =1000 OR other-parties = guarantor AND

foreign-worker OR checking-status = no checking OR checking-status = =200 OR credit-history = delayed previously OR purpose = used car OR purpose = radio/tv OR other-parties = guarantor OR 0.4 age < 0.5 OR NOT other-payment-plans = bank AND

checking-status = no checking OR credit-history = critical/other existing credit OR purpose = used car OR purpose = radio/tv OR purpose = retraining OR employment = 4 X<7 OR property-magnitude = real estate OR NOT checking-status = <0

Ghosh, Malioutov, & Meel

Agarap, A. F. M. (2018). On breast cancer detection: an application of machine learning algorithms on the wisconsin diagnostic dataset. In Proceedings of the 2nd international conference on machine learning and soft computing, pp. 5 9.

Alos, J., Ansotegui, C., & Torres, E. (2021). Learning optimal decision trees using Max SAT..

Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M., & Rudin, C. (2017). Learning certiﬁably optimal rule lists for categorical data..

Argelich, J., Li, C.-M., Manya, F., & Planes, J. (2008). The ﬁrst and second Max-SAT evaluations.. Vol. 4, pp. 251 278. IOS Press.

Augasta, M. G., & Kathirvalavakumar, T. (2012). Rule extraction from neural networks a comparative study. In International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012), pp. 404 408. IEEE.

Balyo, T., Heule, M., & Jarvisalo, M. (2017). SAT competition 2016: Recent developments. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, Vol. 31.

Barakat, N., & Diederich, J. (2004). Learning-based rule-extraction from support vector machines.. not found.

Barakat, N., & Diederich, J. (2005). Eclectic rule-extraction from support vector machines.. Vol. 2, pp. 59 62. Citeseer.

Berg, O. J., Hyttinen, A. J., & J arvisalo, M. J. (2019). Applications of Max SAT in data analysis.. Easy Chair Publications.

Bessiere, C., Hebrard, E., & O Sullivan, B. (2009). Minimising decision tree size as combinatorial optimisation. In International Conference on Principles and Practice of Constraint Programming, pp. 173 187. Springer.

Bhagoji, A. N., Cullina, D., Sitawarin, C., & Mittal, P. (2018). Enhancing robustness of machine learning systems via data transformations. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pp. 1 5. IEEE.

Cauwenberghs, G., & Poggio, T. (2001). Incremental and decremental support vector machine learning. In Proc. of NIPS.

Chen, Z., & Liu, B. (2018). Lifelong machine learning.. Vol. 12, pp. 1 207. Morgan & Claypool Publishers.

Clark, P., & Niblett, T. (1989). The CN2 induction algorithm..

Cohen, W. W., & Singer, Y. (1999). A simple, fast, and eﬀective rule learner. In Proc. of AAAI, Orlando, FL.

Cohen, W. W. (1995). Fast eﬀective rule induction. In Machine learning proceedings 1995, pp. 115 123. Elsevier.

Dash, S., & Goncalves, J. (2021). LPRules: Rule induction in knowledge graphs using linear programming..

Diederich, J. (2008). Rule extraction from support vector machines: An introduction. In Rule extraction from support vector machines, pp. 3 31. Springer.

Efficient Learning of Interpretable Classification Rules

Doshi-Velez, F., & Kim, B. (2017). Towards a rigorous science of interpretable machine learning..

Du, M., Liu, N., & Hu, X. (2019). Techniques for interpretable machine learning.. Vol. 63, pp. 68 77. ACM New York, NY, USA.

Dua, D., & Graﬀ, C. (2017). UCI machine learning repository..

Erickson, B. J., Korﬁatis, P., Akkus, Z., & Kline, T. L. (2017). Machine learning for medical imaging.. Vol. 37, pp. 505 515. Radiological Society of North America.

Fisher, M. L. (1981). The lagrangian relaxation method for solving integer programming problems.. Vol. 27, pp. 1 18. INFORMS.

F urnkranz, J. (1999). Separate-and-conquer rule learning.. Vol. 13, pp. 3 54. Springer.

Gage, B. F., Waterman, A. D., Shannon, W., Boechler, M., Rich, M. W., & Radford, M. J. (2001). Validation of clinical classiﬁcation schemes for predicting stroke: results from the national registry of atrial ﬁbrillation.. Vol. 285, pp. 2864 2870. American Medical Association.

Ghosh, B., & Meel, K. S. (2019). Imli: An incremental framework for maxsat-based learning of interpretable classiﬁcation rules. In Proc. of AIES.

Gill, N., Hall, P., Montgomery, K., & Schmidt, N. (2020). A responsible machine learning workﬂow with focus on interpretable models, post-hoc explanation, and discrimination testing.. Vol. 11, p. 137. Multidisciplinary Digital Publishing Institute.

Guidotti, R., Monreale, A., Ruggieri, S., Pedreschi, D., Turini, F., & Giannotti, F. (2018). Local rule-based explanations of black box decision systems..

Hailesilassie, T. (2016). Rule extraction algorithm for deep neural networks: A review..

Hancox-Li, L. (2020). Robustness in machine learning explanations: does it matter?. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 640 647.

Hinton, G., Srivastava, N., & Swersky, K. (2012). Neural networks for machine learning lecture 6a overview of mini-batch gradient descent.. Vol. 14, p. 2.

Holstein, K., Wortman Vaughan, J., Daum e III, H., Dudik, M., & Wallach, H. (2019). Improving fairness in machine learning systems: What do industry practitioners need?. In Proceedings of the 2019 CHI conference on human factors in computing systems, pp. 1 16.

Ignatiev, A., Lam, E., Stuckey, P. J., & Marques-Silva, J. (2021a). A scalable two stage approach to computing optimal decision sets..

Ignatiev, A., Marques-Silva, J., Narodytska, N., & Stuckey, P. J. (2021b). Reasoning-based learning of interpretable ML models. In International Joint Conference on Artiﬁcial Intelligence (IJCAI).

Ignatiev, A., Pereira, F., Narodytska, N., & Marques-Silva, J. (2018). A SAT-based approach to learn explainable decision sets. In International Joint Conference on Automated Reasoning, pp. 627 645. Springer.

Izza, Y., Ignatiev, A., & Marques-Silva, J. (2020). On explaining decision trees..

Ghosh, Malioutov, & Meel

Janota, M., & Morgado, A. (2020). SAT-based encodings for optimal decision trees with explicit paths. In International Conference on Theory and Applications of Satisﬁability Testing, pp. 501 518. Springer.

Johnson, J. K., Malioutov, D. M., & Willsky, A. S. (2007). Lagrangian relaxation for MAP estimation in graphical models..

Kaissis, G. A., Makowski, M. R., R uckert, D., & Braren, R. F. (2020). Secure, privacypreserving and federated machine learning in medical imaging.. Vol. 2, pp. 305 311. Nature Publishing Group.

Koneˇcn y, J., Mc Mahan, B., & Ramage, D. (2015). Federated optimization: Distributed optimization beyond the datacenter..

Koneˇcn y, J., Mc Mahan, H. B., Ramage, D., & Richt arik, P. (2016). Federated optimization: Distributed machine learning for on-device intelligence..

Kononenko, I. (2001). Machine learning for medical diagnosis: history, state of the art and perspective.. Vol. 23, pp. 89 109. Elsevier.

Kumar, R. S. S., O Brien, D. R., Albert, K., & Vilojen, S. (2018). Law and adversarial machine learning..

Lakkaraju, H., Bach, S. H., & Leskovec, J. (2016). Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1675 1684.

Lakkaraju, H., Kamar, E., Caruana, R., & Leskovec, J. (2017). Interpretable & explorable approximations of black box models..

Lakkaraju, H., Kamar, E., Caruana, R., & Leskovec, J. (2019). Faithful and customizable explanations of black box models. In Proc. of AIES.

Lemar echal, C. (2001). Lagrangian relaxation. In Computational combinatorial optimization, pp. 112 156. Springer.

Letham, B., Rudin, C., Mc Cormick, T. H., & Madigan, D. (2015). Interpretable classiﬁers using rules and Bayesian analysis: Building a better stroke prediction model.. Vol. 9, pp. 1350 1371. Institute of Mathematical Statistics.

Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Eﬃcient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 661 670.

Lin, P.-C. K., & Khatri, S. P. (2012). Application of Max-SAT-based ATPG to optimal cancer therapy design.. Vol. 13, pp. 1 10. Springer.

Lundberg, S. M., & Lee, S.-I. (2017). A uniﬁed approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, pp. 4768 4777.

Malioutov, D., & Meel, K. S. (2018). MLIC: A maxsat-based framework for learning interpretable classiﬁcation rules. In Proceedings of International Conference on Constraint Programming (CP).

Efficient Learning of Interpretable Classification Rules

Malioutov, D., & Varshney, K. (2013). Exact rule learning via boolean compressed sensing. In International Conference on Machine Learning, pp. 765 773. PMLR.

Martens, D., Huysmans, J., Setiono, R., Vanthienen, J., & Baesens, B. (2008). Rule extraction from support vector machines: an overview of issues and application in credit scoring.. pp. 33 63. Springer.

Martins, R., Manquinho, V., & Lynce, I. (2014). Open-WBO: A modular Max SAT solver. In International Conference on Theory and Applications of Satisﬁability Testing, pp. 438 445. Springer.

Masters, D., & Luschi, C. (2018). Revisiting small batch training for deep neural networks..

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning.. Vol. 54, pp. 1 35. ACM New York, NY, USA.

Molnar, C. (2020). Interpretable machine learning. Lulu. com.

Moradi, M., & Samwald, M. (2021). Post-hoc explanation of black-box classiﬁers using conﬁdent itemsets.. Vol. 165, p. 113941. Elsevier.

Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Interpretable machine learning: deﬁnitions, methods, and applications..

Myers, I. B. (1962). The Myers-Briggs type indicator: Manual (1962)... Consulting Psychologists Press.

Narodytska, N., Ignatiev, A., Pereira, F., Marques-Silva, J., & RAS, I. (2018). Learning optimal decision trees with SAT. In IJCAI, pp. 1362 1368.

N u nez, H., Angulo, C., & Catal a, A. (2002). Rule extraction from support vector machines.. In Esann, pp. 107 112.

Pastor, E., & Baralis, E. (2019). Explaining black box models by means of local rules. In Proceedings of the 34th ACM/SIGAPP symposium on applied computing, pp. 510 517.

Peled, I., Rodrigues, F., & Pereira, F. C. (2019). Model-based machine learning for transportation. In Mobility patterns, big data and transport analytics, pp. 145 171. Elsevier.

Philipp, T., & Steinke, P. (2015). Pblib a library for encoding pseudo-boolean constraints into cnf. In International Conference on Theory and Applications of Satisﬁability Testing, pp. 9 16. Springer.

Quinlan, J. R. (1986). Induction of decision trees.. Vol. 1, pp. 81 106. Springer.

Quinlan, J. R. (1987). Simplifying decision trees.. Vol. 27, pp. 221 234. Elsevier.

Quinlan, J. R. (1993). C4. 5: Programming for machine learning..

Rajapaksha, D., Bergmeir, C., & Buntine, W. (2020). Lo RMIk A: Local rule-based model interpretability with K-optimal associations.. Vol. 540, pp. 221 241. Elsevier.

Ralaivola, L., & d Alch e Buc, F. (2001). Incremental support vector machine learning: A local approach. In Proc. of ICANN.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). why should i trust you? explaining the predictions of any classiﬁer. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135 1144.

Ghosh, Malioutov, & Meel

Ribeiro, M. T., Singh, S., & Guestrin, C. (2018). Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAI conference on artiﬁcial intelligence, Vol. 32.

Rivest, R. L. (1987). Learning decision lists.. Vol. 2, pp. 229 246. Springer.

Robinson, N., Gretton, C., Pham, D. N., & Sattar, A. (2010). Partial weighted Max SAT for optimal planning. In Paciﬁc rim international conference on artiﬁcial intelligence, pp. 231 243. Springer.

Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.. Vol. 1, pp. 206 215. Nature Publishing Group.

Ruping, S. (2001). Incremental learning with support vector machines. In Proc. of ICDM.

Sato, M., & Tsukimoto, H. (2001). Rule extraction from neural networks via decision tree induction. In IJCNN 01. International Joint Conference on Neural Networks. Proceedings (Cat. No. 01CH37222), Vol. 3, pp. 1870 1875. IEEE.

Schidler, A., & Szeider, S. (2021). SAT-based decision tree learning for large data sets. In Proceedings of AAAI, Vol. 21.

Setiono, R., & Liu, H. (1995). Understanding neural networks via rule extraction. In IJCAI, Vol. 1, pp. 480 485. Citeseer.

Shati, P., Cohen, E., & Mc Ilraith, S. (2021). SAT-based approach for learning optimal decision trees with non-binary features. In 27th International Conference on Principles and Practice of Constraint Programming (CP 2021). Schloss Dagstuhl-Leibniz Zentrum f ur Informatik.

Silva, J. P. M., & Sakallah, K. A. (2003). GRASP a new search algorithm for satisﬁability. In The Best of ICCAD, pp. 73 89. Springer.

Slack, D., Hilgard, S., Jia, E., Singh, S., & Lakkaraju, H. (2020). Fooling LIME and SHAP: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp. 180 186.

Su, G., Wei, D., Varshney, K. R., & Malioutov, D. M. (2016). Learning sparse two-level Boolean rules. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1 6. IEEE.

Surden, H. (2014). Machine learning and law.. Vol. 89, p. 87. Hein Online.

Syed, N. A., Huan, S., Kah, L., & Sung, K. (1999). Incremental learning with support vector machines.. Citeseer.

Vanschoren, J., van Rijn, J. N., Bischl, B., & Torgo, L. (2013). Open ML: Networked science in machine learning.. Vol. 15, pp. 49 60, New York, NY, USA. ACM.

Walter, R., Zengler, C., & K uchlin, W. (2013). Applications of Max SAT in automotive conﬁguration.. In Conﬁguration Workshop, Vol. 1, p. 21. Citeseer.

Wang, F., & Rudin, C. (2015). Falling rule lists. In Artiﬁcial Intelligence and Statistics, pp. 1013 1022. PMLR.

Efficient Learning of Interpretable Classification Rules

Wang, T., Rudin, C., Doshi-Velez, F., Liu, Y., Klampﬂ, E., & Mac Neille, P. (2017). A Bayesian framework for learning rule sets for interpretable classiﬁcation.. Vol. 18, pp. 2357 2393. JMLR. org.

Yu, J., Ignatiev, A., Bodic, P. L., & Stuckey, P. J. (2020a). Optimal decision lists using SAT..

Yu, J., Ignatiev, A., Stuckey, P. J., & Le Bodic, P. (2020b). Computing optimal decision sets with SAT. In International Conference on Principles and Practice of Constraint Programming, pp. 952 970. Springer.

Zantalis, F., Koulouras, G., Karabetsos, S., & Kandris, D. (2019). A review of machine learning and Io T in smart transportation.. Vol. 11, p. 94. Multidisciplinary Digital Publishing Institute.

Zhou, Z.-H. (2004). Rule extraction: Using neural networks or for neural networks?.. Vol. 19, pp. 249 253. Springer.

Zilke, J. R., Menc ıa, E. L., & Janssen, F. (2016). Deepred rule extraction from deep neural networks. In International Conference on Discovery Science, pp. 457 473. Springer.