# entropybased_logic_explanations_of_neural_networks__5a7d4e0d.pdf

Entropy-Based Logic Explanations of Neural Networks

Pietro Barbiero1, Gabriele Ciravegna 2,3,4, Francesco Giannini 3, Pietro Li o 1, Marco Gori 3,4, Stefano Melacci 3, 1 University of Cambridge (UK) 2 Universit a di Firenze (Italy) 3 Universit a di Siena (Italy) 4 Universit e Cˆote d Azur (France) {pb737, pl213}@cam.ac.uk, gabriele.ciravegna@uniﬁ.it, {francesco.giannini, marco.gori}@unisi.it, mela@diism.unisi.it

Explainable artiﬁcial intelligence has rapidly emerged since lawmakers have started requiring interpretable models for safety-critical domains. Concept-based neural networks have arisen as explainable-by-design methods as they leverage human-understandable symbols (i.e. concepts) to predict class memberships. However, most of these approaches focus on the identiﬁcation of the most relevant concepts but do not provide concise, formal explanations of how such concepts are leveraged by the classiﬁer to make predictions. In this paper, we propose a novel end-to-end differentiable approach enabling the extraction of logic explanations from neural networks using the formalism of First-Order Logic. The method relies on an entropy-based criterion which automatically identiﬁes the most relevant concepts. We consider four different case studies to demonstrate that: (i) this entropy-based criterion enables the distillation of concise logic explanations in safety-critical domains from clinical data to computer vision; (ii) the proposed approach outperforms state-of-the-art white-box models in terms of classiﬁcation accuracy and matches black box performances.

1 Introduction The lack of transparency in the decision process of some machine learning models, such as neural networks, limits their application in many safety-critical domains (EUGDPR 2017; Goddard 2017). For this reason, explainable artiﬁcial intelligence (XAI) research has focused either on explaining black box decisions (Zilke, Loza Menc ıa, and Janssen 2016; Ying et al. 2019; Ciravegna et al. 2020a; Arrieta et al. 2020) or on developing machine learning models interpretable by design (Schmidt and Lipson 2009; Letham et al. 2015; Cranmer et al. 2019; Molnar 2020). However, while interpretable models engender trust in their predictions (Doshi-Velez and Kim 2017, 2018; Ahmad, Eckert, and Teredesai 2018; Rudin et al. 2021), black box models, such as neural networks, are the ones that provide state-of-the-art task performances (Battaglia et al. 2018; Devlin et al. 2018; Dosovitskiy et al. 2020; Xie et al. 2020). Research to address this imbalance is needed for the deployment of cutting-edge technologies. Most techniques explaining black boxes focus on ﬁnding or ranking the most relevant features used by the model to

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

make predictions (Simonyan, Vedaldi, and Zisserman 2013; Zeiler and Fergus 2014; Ribeiro, Singh, and Guestrin 2016b; Lundberg and Lee 2017; Selvaraju et al. 2017). Such featurescoring methods are very efﬁcient and widely used, but they cannot explain how neural networks compose such features to make predictions (Kindermans et al. 2019; Kim et al. 2018; Alvarez-Melis and Jaakkola 2018). In addition, a key issue of most explaining methods is that explanations are given in terms of input features (e.g. pixel intensities) that do not correspond to high-level categories that humans can easily understand (Kim et al. 2018; Su, Vargas, and Sakurai 2019). To overcome this issue, concept-based approaches have become increasingly popular as they provide explanations in terms of human-understandable categories (i.e. the concepts) rather than raw features (Kim et al. 2018; Ghorbani et al. 2019; Koh et al. 2020; Chen, Bei, and Rudin 2020). However, fewer approaches are able to explain how such concepts are leveraged by the classiﬁer and even fewer provide concise explanations whose validity can be assessed quantitatively (Ribeiro, Singh, and Guestrin 2016b; Guidotti et al. 2018; Das and Rad 2020).

Contributions. In this paper, we ﬁrst propose an entropybased layer (Sec. 3.1) that enables the implementation of concept-based neural networks, providing First-Order Logic explanations (Fig. 1). The proposed approach is not just a post-hoc method, but an explainable by design approach as it embeds additional constraints both in the architecture and in the learning process, to allow the emergence of simple logic explanations. This point of view is in contrast with posthoc methods, which generally do not impose any constraint on classiﬁers: After the training is completed, the post-hoc method kicks in. Second, we describe how to interpret the predictions of the proposed neural model to distill logic explanations for individual observations and for a whole target class (Sec. 3.3). We demonstrate how the proposed approach provides high-quality explanations according to six quantitative metrics while matching black-box and outperforming state-of-the-art white-box models (Sec. 4) in terms of classiﬁcation accuracy on four case studies (Sec. 5). Finally, we share an implementation of the entropy layer, with extensive documentation and all the experiments in the public repository: https://github.com/pietrobarbiero/entropy-lens.

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

OUTPUT PREDICTION

CLASSIFIER ENTROPY-BASED NETWORK

INPUT CONCEPTS

LOGIC EXPLANATIONS

Figure 1: The proposed pipeline on one example from the CUB dataset. The neural network f : C 7 Y maps concepts onto target classes and provide concise logic explanations (yellow arguments of predicates are dropped for simplicity) of its own decision process. When the input data is non-interpretable (as pixels intensities), a classiﬁer g: X 7 C maps inputs to concepts.

2 Background Classiﬁcation is the problem of identifying a set of categories an observation belongs to. We indicate with Y {0, 1}r the space of binary encoded targets in a problem with r categories. Concept-based classiﬁers f are a family of machine learning models predicting class memberships from the activation scores of k human-understandable categories, f : C 7 Y , where C [0, 1]k (see Fig. 1). Concept-based classiﬁers improve human understanding as their input and output spaces consists of interpretable symbols. When observations are represented in terms of non-interpretable input features belonging to X Rd (such as pixels intensities), a concept decoder g is used to map the input into a concept-based space, g : X 7 C (see Fig. 1). Otherwise, they are simply rescaled from the unbounded space Rd into the unit interval [0, 1]k, such that input features can be treated as logic predicates. In the recent literature, the most similar method related to the proposed approach is the ψ network proposed by Ciravegna et al. (Ciravegna et al. 2020a,b), an end-to-end differentiable concept-based classiﬁer explaining its own decision process. The ψ network leverages the intermediate symbolic layer whose output belongs to C to distill First Order Logic formulas, representing the learned map from C to Y . The model consists of a sequence of fully connected layers with sigmoid activations only. An L1-regularization and a strong pruning strategy is applied to each layer of weights in order to allow the computation of logic formulas representing the activation of each node. Such constraints, however, limit the learning capacity of the network and impair the classiﬁcation accuracy, making standard white-box models, such as decision trees, more attractive.

3 Entropy-Based Logic Explanations of Neural Networks The key contribution of this paper is a novel linear layer enabling entropy-based logic explanations of neural networks (see Fig. 2 and Fig. 3). The layer input belongs to the concept space C and the outcomes of the layer computations are: (i) the embeddings hi (as any linear layer), (ii) a truth table T i explaining how the network leveraged concepts to make predictions for the i-th target class. Each class of the problem requires an independent entropy-based layer, as emphasized

example level

class level

Concatenate

Logic explanations for the -st class

TRUTH TABLE

Other layers

Other layers

Entropy-based layer

INPUT CONCEPTS

Entropy-based layer

example level

class level

Concatenate

Logic explanations for the -th class

TRUTH TABLE

Figure 2: For each class i, the network leverages one head of the entropy-based linear layer (green) as ﬁrst layer, and it provides: the class membership predictions f i and the truth table T i (Eq. 6) to distill FOL explanations (yellow, top).

by the superscript i. For ease of reading and without loss of generality, all the following descriptions concern inference for a single observation (corresponding to the concept tuple c C) and a neural network f i predicting the class memberships for the i-th class of the problem. For multi-class problems, multiple heads of this layer are instantiated, with one head per target class (see Sec. 5), and the hidden layers of the class-speciﬁc networks could be eventually shared.

3.1 Entropy-Based Linear Layer

When humans compare a set of hypotheses outlining the same outcomes, they tend to have an implicit bias towards the simplest ones as outlined in philosophy (Soklakov 2002; Rathmanner and Hutter 2011), psychology (Miller 1956; Cowan 2001), and decision making (Simon 1956, 1957, 1979). The proposed entropy-based approach encodes this inductive bias in an end-to-end differentiable model. The purpose of the entropy-based linear layer is to encourage the neural model to pick a limited subset of input concepts, al-

Hadamard product

ENTROPY-BASED LAYER

INPUT CONCEPT TUPLE

HIDDEN STATES BOOLEAN CONCEPT TUPLE

Figure 3: A detailed view on one head of the entropybased linear layer for the 1-st class, emphasizing the role of the k-th input concept as example: (i) the scalar γ1 k (Eq. 1) is computed from the set of weights connecting the k-th input concept to the output neurons of the entropy-based layer; (ii) the relative importance of each concept is summarized by the categorical distribution α1 (Eq. 2); (iii) rescaled relevance scores α1 drop irrelevant input concepts out (Eq. 3); (iv) hidden states h1 (Eq. 4) and Boolean-like concepts ˆc1 (Eq. 5) are provided as outputs of the entropy-based layer.

lowing it to provide concise explanations of its predictions. The learnable parameters of the layer are the usual weight matrix W and bias vector b. In the following, the forward pass is described by the operations going from Eq. 1 to Eq. 4 while the generation of the truth tables from which explanations are extracted is formalized by Eq. 5 and Eq. 6. The relevance of each input concept can be summarized in a ﬁrst approximation by a measure that depends on the values of the weights connecting such concept to the upper network. In the case of network f i (i.e. predicting the i-th class) and of the j-th input concept, we indicate with W i j the vector of weights departing from the j-th input (see Fig. 3), and we introduce γi j = ||W i j||1 . (1)

The higher γi j, the higher the relevance of the concept j for the network f i. In the limit case (γi j 0) the model f i drops the j-th concept out. To select only few relevant concepts for each target class, concepts are set up to compete against each other. To this aim, the relative importance of each concept to the i-th class is summarized in the categorical distribution αi, composed of coefﬁcients αi j [0, 1] (with P j αi j = 1), modeled by the softmax function:

αi j = eγi j/τ Pk l=1 eγi l /τ (2)

where τ R+ is a user-deﬁned temperature parameter to tune the softmax function. For a given set of γi j, when using high temperature values (τ ) all concepts have nearly the same relevance. For low temperatures values (τ 0), the probability of the most relevant concept tends to αi j 1,

while it becomes αi k 0, k = j, for all other concepts. For further details on the impact of τ on the model predictions and explanations (see Appendix). As the probability distribution αi highlights the most relevant concepts, this information is directly fed back to the input, weighting concepts by the estimated importance. To avoid numerical cancellation due to values in αi close to zero, especially when the input dimensionality is large, we replace αi with its normalized instance αi, still in [0, 1]k, and each input sample c C is modulated by this estimated importance,

ci = c αi with αi j = αi j maxu αiu , (3)

where denotes the Hadamard (element-wise) product. The highest value in αi is always 1 (i.e. maxj αi j = 1) and it corresponds to the most relevant concept. The embeddings hi are computed as in any linear layer by means of the afﬁne transformation: hi = W i ci + bi. (4) Whenever αi j 0, the input ci j 0. This means that the corresponding concept tends to be dropped out and the network f i will learn to predict the i-th class without relying on the j-th concept. In order to get logic explanations, the proposed linear layer generates the truth table T i formally representing the behaviour of the neural network in terms of Boolean-like representations of the input concepts. In detail, we indicate with c the Boolean interpretation of the input tuple c C, while µi {0, 1}k is the binary mask associated to αi. To encode the inductive human bias towards simple explanations (Miller 1956; Cowan 2001; Ma, Husain, and Bays 2014), the mask µi is used to generate the binary concept tuple ˆci, dropping the least relevant concepts out of c,

ˆci = ξ( c, µi) with µi = I αi ϵ and c = Ic ϵ, (5)

where Iz ϵ denotes the indicator function that is 1 for all the components of vector z being ϵ and 0 otherwise (considering the unbiased case, we set ϵ = 0.5). The function ξ returns the vector with the components of c that correspond to 1 s in µi (i.e. it sub-selects the data in c). As a results, ˆci

belongs to a space ˆCi of mi Boolean features, with mi < k due to the effects of the subselection procedure. The truth table T i is a particular way of representing the behaviour of network f i based on the outcomes of processing multiple input samples collected in a generic dataset C. As the truth table involves Boolean data, we denote with ˆCi the set with the Boolean-like representations of the samples in C computed by ξ, Eq. 5. We also introduce f i(c) as the Boolean-like representation of the network output, f i(c) = If i(c) ϵ. The truth table T i is obtained by stacking data of ˆCi into a 2D matrix ˆCi (row-wise), and concatenating the result with the column vector f i whose elements are f i(c), c C, that we summarize as

T i = ˆCi f i . (6)

To be precise, any T i is more like an empirical truth table than a classic one corresponding to an n-ary boolean function, indeed T i can have repeated rows and missing Boolean

tuple entries. However, T i can be used to generate logic explanations in the same way, as we will explain in Sec. 3.3.

3.2 Loss Function The entropy of the probability distribution αi (Eq. 2),

j=1 αi j log αi j (7)

is minimized when a single αi j is one, thus representing the extreme case in which only one concept matters, while it is maximum when all concepts are equally important. When H is jointly minimized with the usual loss function for supervised learning L(f, y) (being y the target labels we used the cross-entropy in our experiments), it allows the model to ﬁnd a trade off between ﬁtting quality and a parsimonious activation of the concepts, allowing each network f i to predict i-th class memberships using few relevant concepts only. Overall, the loss function to train the network f is deﬁned as,

L(f, y, α1, . . . , αr) = L(f, y) + λ

i=1 H(αi), (8)

where λ > 0 is the hyperparameter used to balance the relative importance of low-entropy solutions in the loss function. Higher values of λ lead to sparser conﬁguration of α, constraining the network to focus on a smaller set of concepts for each classiﬁcation task (and vice versa), thus encoding the inductive human bias towards simple explanations (Miller 1956; Cowan 2001; Ma, Husain, and Bays 2014). For further details on the impact of λ on the model predictions and explanations (see Appendix). It may be pointed out that a similar regularization effect could be achieved by simply minimizing the L1 norm over γi. However, as we observed in the Appendix, the L1 loss does not sufﬁciently penalize the concept scores for those features which are uncorrelated with the predicted category. The Entropy loss, instead, correctly shrink to zero concept scores associated to uncorrelated features while the other remains close to one.

3.3 First-Order Logic Explanations Any Boolean function can be converted into a logic formula in Disjunctive Normal Form (DNF) by means of its truth-table (Mendelson 2009). Converting a truth table into a DNF formula provides an effective mechanism to extract logic rules of increasing complexity from individual observations to a whole class of samples. The following rule extraction mechanism is applied to any empirical truth table T i for each task i.

FOL extraction. Each row of the truth table T i can be partitioned into two parts that are a tuple of binary concept activations, ˆq ˆCi, and the outcome of f i(ˆq) {0, 1}. An example-level logic formula, consisting in a single minterm, can be trivially extracted from each row for which f i(ˆq) = 1, by simply connecting with the logic AND ( ) the true concepts and negated instances of the false ones. The logic formula becomes human understandable whenever concepts appearing in such a formula are replaced with

human-interpretable strings that represent their name (similar consideration holds for f i, in what follows). For example, the following logic formula ϕi t,

ϕi t = c1 c2 . . . cmi, (9)

is the formula extracted from the t-th row of the table where, in the considered example, only the second concept is false, being cz the name of the z-th concept. Example-level formulas can be aggregated with the logic OR ( ) to provide a class-level formula, _

t Si ϕi t, (10)

being Si the set of rows indices of T i for which f i(ˆq) = 1, i.e. it is the support of f i. We deﬁne with φi(ˆc) the function that holds true whenever Eq. 10, evaluated on a given Boolean tuple ˆc, is true. Due to the aforementioned deﬁnition of support, we get the following class-level First-Order Logic (FOL) explanation for all the concept tuples,

ˆc ˆCi : φi(ˆc) f i(ˆc). (11)

We note that in case of non-concept-like input features, we may still derive the FOL formula through the concept decoder function g (see Sec. 2),

x X : φi ξ(g(x), µi) f i ξ(g(x), µi) (12)

An example of the above scheme for both example and class-level explanations is depicted on top-right of Fig. 2.

Remarks. The aggregation of many example-level explanations may increase the length and the complexity of the FOL formula being extracted for a whole class. However, existing techniques as the Quine Mc Cluskey algorithm can be used to get compact and simpliﬁed equivalent FOL expressions (Mc Coll 1878; Quine 1952; Mc Cluskey 1956). For instance, the explanation (person nose) ( person nose) can be formally simpliﬁed in nose. Moreover, the Boolean interpretation of concept tuples may generate colliding representations for different samples. For instance, the Boolean representation of the two samples {(0.1, 0.7), (0.2, 0.9)} is the tuple c = (0, 1) for both of them. This means that their example-level explanations match as well. However, a concept can be eventually split into multiple ﬁner grain concepts to avoid collisions. Finally, we mention that the number of samples for which any example-level formula holds (i.e. the support of the formula) is used as a measure of the explanation importance. In practice, example-level formulas are ranked by support and iteratively aggregated to extract class-level explanations, until the aggregation improves the accuracy of the explanation over a validation set.

4 Related Work In order to provide explanations for a given black-box model, most methods focus on identifying or scoring the most relevant input features (Simonyan, Vedaldi, and Zisserman 2013; Zeiler and Fergus 2014; Ribeiro, Singh, and Guestrin 2016b,a; Lundberg and Lee 2017; Selvaraju et al.

2017). Feature scores are usually computed sample by sample (i.e. providing local explanations) analyzing the activation patterns in the hidden layers of neural networks (Simonyan, Vedaldi, and Zisserman 2013; Zeiler and Fergus 2014; Selvaraju et al. 2017) or by following a modelagnostic approach (Ribeiro, Singh, and Guestrin 2016a; Lundberg and Lee 2017). To enhance human understanding of feature scoring methods, concept-based approaches have been effectively employed for identifying common activations patterns in the last nodes of neural networks corresponding to human categories (Kim et al. 2018; Kazhdan et al. 2020) or constraining the network to learn such concepts (Chen, Bei, and Rudin 2020; Koh et al. 2020). Either way, feature-scoring methods are not able to explain how neural networks compose features to make predictions (Kindermans et al. 2019; Kim et al. 2018; Alvarez-Melis and Jaakkola 2018) and only a few of these approaches have been efﬁciently extended to provide explanations for a whole class (i.e. providing global explanations) (Simonyan, Vedaldi, and Zisserman 2013; Ribeiro, Singh, and Guestrin 2016a). By contrast, a variety of rule-based approaches have been proposed to provide concept-based explanations. Logic rules are used to explain how black boxes predict class memberships for indivudal samples (Guidotti et al. 2018; Ribeiro, Singh, and Guestrin 2018), or for a whole class (Sato and Tsukimoto 2001; Zilke, Loza Menc ıa, and Janssen 2016; Ciravegna et al. 2020a,b). Distilling explanations from an existing model, however, is not the only way to achieve explainability. Historically, standard machine-learning such as Logistic Regression (Mc Kelvey and Zavoina 1975), Generalized Additive Models (Hastie and Tibshirani 1987; Lou, Caruana, and Gehrke 2012; Caruana et al. 2015) Decision Trees (Breiman et al. 1984; Quinlan 1986, 2014) and Decision Lists (Rivest 1987; Letham et al. 2015; Angelino et al. 2018) were devised to be intrinsically interpretable. However, most of them struggle in solving complex classiﬁcation problems. Logistic Regression, for instance, in its vanilla deﬁnition, can only recognize linear patterns, e.g. it cannot to solve the XOR problem (Minsky and Papert 2017). Further, only Decision Trees and Decision Lists provide explanations in the from of logic rules. Considering decision trees, each path may be seen as a human comprehensible decision rule when the height of the tree is reasonably contained. Another family of concept-based XAI methods is represented by rule-mining algorithms which became popular at the end of the last century (Holte 1993; Cohen 1995). Recent research has led to powerful rulemining approaches as Bayesian Rule Lists (BRL) (Letham et al. 2015), where a set of rules is pre-mined using the frequent-pattern tree mining algorithm (Han, Pei, and Yin 2000) and then the best rule set is identiﬁed with Bayesian statistics. In this paper, the proposed approach is compared with methods providing logic-based, global explanations. In particular, we selected one representative approach from different families of methods: Decision Trees (whitebox, https://scikit-learn.org/stable/modules/tree), BRL (rule mining, https://github.com/tmadl/sklearn-expertsys) and ψ Networks (explainable neural models, https://github.com/ pietrobarbiero/logic explainer networks).

LOGIC EXPLANATION

Entropy-based network

LOGIC EXPLANATION

LOGIC EXPLANATION

LOGIC EXPLANATION

Entropy-based network

Entropy-based network

Entropy-based network Entropy-based network

Figure 4: The four case studies show how the proposed Entropy-based networks (green) provide concise logic explanations (yellow) of their own decision process in different real-world contexts. When input features are noninterpretable, as pixel intensities, a concept decoder (Res Net10) maps images into concepts. Entropy-based networks then map concepts into target classes.

5 Experiments The quality of the explanations and the classiﬁcation performance of the proposed approach are quantitatively assessed and compared to state-of-the-art white-box models. A visual sketch of each classiﬁcation problem (described in detail in Sec. 5.1) and a selection of the logic formulas found by the proposed approach is reported in Fig. 4. Six quantitative metrics are deﬁned and used to compare the proposed approach with state-of-the-art methods. Sec. 5.2 summarizes the main ﬁndings. Further details concerning the experiments are reported in the Appendix.

5.1 Classiﬁcation Tasks and Datasets Four classiﬁcation problems ranging from computer vision to medicine are considered. Computer vision datasets (e.g. CUB) are annotated with low-level concepts (e.g. bird attributes) used to train concept bottleneck pipelines (Koh et al. 2020). In the other datasets, the input data is rescaled into a categorical space (Rk C) suitable for conceptbased networks. Please notice that this preprocessing step is performed for all white-box models considered in the ex-

Entropy net Tree BRL ψ net Neural Network Random Forest

MIMIC-II 79.05 1.35 77.53 1.45 76.40 1.22 77.19 1.64 77.81 2.45 78.88 2.25 V-Dem 94.51 0.48 85.61 0.57 91.23 0.75 89.77 2.07 94.53 1.17 93.08 0.44 MNIST 99.81 0.02 99.75 0.01 99.80 0.02 99.79 0.03 99.72 0.03 99.96 0.01 CUB 92.95 0.20 81.62 1.17 90.79 0.34 91.92 0.27 93.10 0.51 91.88 0.36

Table 1: Classiﬁcation accuracy (%). Left group, the compared white-box models. Right group, two black box models. We indicate in bold the best model in each group, with a star the best model overall.

periments for a fair comparison. Further descriptions of each dataset and links to all sources are reported in Appendix. Will we recover from ICU? (MIMIC-II). The Multiparameter Intelligent Monitoring in Intensive Care II (MIMICII, (Saeed et al. 2011; Goldberger et al. 2000)) is a publicaccess intensive care unit (ICU) database consisting of 32,536 subjects (with 40,426 ICU admissions) admitted to different ICUs. The task consists in identifying recovering or dying patients after ICU admission. An end-to-end classiﬁer f : C Y carries out the classiﬁcation task. What kind of democracy are we living in? (V-Dem). Varieties of Democracy (V-Dem, (Pemstein et al. 2018; Coppedge et al. 2021)) dataset contains a collection of indicators of latent regime characteristics over 202 countries from 1789 to 2020. The database include k1 = 483 lowlevel indicators k2 = 82 mid-level indices. The task consists in identifying electoral democracies from non-electoral ones. We indicate with C1, C2 the spaces associated to the activations of the two levels of concepts. Classiﬁers f1 and f2 are trained to learn the map C1 C2 Y . Explanations are given for classiﬁer f2 in terms of concepts c2 C2. What does parity mean? (MNIST Even/Odd). The Modiﬁed National Institute of Standards and Technology database (MNIST, (Le Cun 1998)) contains a large collection of images representing handwritten digits. The task we consider here is slightly different from the common digitclassiﬁcation. Assuming Y {0, 1}2, we are interested in determining if a digit is either odd or even, and explaining the assignment to one of these classes in terms of the digit labels (concepts in C). The mapping X C is provided by a Res Net10 classiﬁer g (He et al. 2016) trained from scratch. while the classiﬁer f learn both the ﬁnal mapping and the explanation as a function C Y . What kind of bird is that? (CUB). The Caltech-UCSD Birds-200-2011 dataset (CUB, (Wah et al. 2011)) is a ﬁnegrained classiﬁcation dataset. It includes 11,788 images representing r = 200 (Y = {0, 1}200) different bird species. 312 binary attributes (concepts in C) describe visual characteristics (color, pattern, shape) of particular parts (beak, wings, tail, etc.) for each bird image. The mapping X C is performed with a Res Net10 model g trained from scratch while the classiﬁer f learns the ﬁnal function C Y .

Quantitative metrics. Measuring the classiﬁcation quality is of crucial importance for models that are going to be applied in real-world environments. On the other hand, assessing the quality of the explanations is required for a safedeployment. In contrast with other kind of explanations, logic-based formulas can be evaluated quantitatively. Given

a classiﬁcation problem, ﬁrst a set of rules are extracted for each target category from each considered model. Each explanation is then tested on an unseen set of test samples. The results for each metric are reported in terms of mean and standard error, computed over a 5-fold cross validation (Krzywinski and Altman 2013). For each experiment and for each model model (f : C Y mapping concepts to target categories) six quantitative metrics are measured. (i) The MODEL ACCURACY measures how well the explainer identiﬁes the target classes on unseen data (see Table 1). (ii) The EXPLANATION ACCURACY measures how well the extracted logic formulas identiﬁes the target classes (Fig. 5). This metric is obtained as the average of the F1 scores computed for each class explanation. (iii) The COMPLEXITY OF AN EXPLANATION is computed by standardizing the explanations in DNF and then by counting the number of terms of the standardized formula (Fig. 5): the longer the formula, the harder the interpretation for a human being. (iv) The FIDELITY OF AN EXPLANATION measures how well the extracted explanation matches the predictions obtained using the explainer (Table 2). (v) The RULE EXTRACTION TIME measures the time required to obtain an explanation from scratch (see Fig. 6), computed as the sum of the time required to train the model and to extract the formula from a trained explainer. (vi) The CONSISTENCY OF AN EXPLANATION measures the average similarity of the extracted explanations over the 5-fold cross validation runs (see Table 3), computed by counting how many times the same concepts appear in a logic formula over different iterations.

5.2 Results and Discussion

Experiments show how entropy-based networks outperform state-of-the-art white box models such as BRL and decision trees and interpretable neural models such as ψ networks on challenging classiﬁcation tasks (Table 1). Moreover, the entropy-based regularization and the adoption of a conceptbased neural network have minor affects on the classiﬁcation accuracy of the explainer when compared to a standard black box neural network directly working on the input data, and a Random Forest model applied on the concepts.At the same time, the logic explanations provided by entropy-based networks are better than ψ networks and almost as accurate as the rules found by decision trees and BRL, while being far more concise, as demonstrated in Fig. 5. More precisely, logic explanations generated by the proposed approach represent non-dominated solutions (Marler and Arora 2004) quantitatively measured in terms of complexity and classiﬁcation error of the explanation. Furthermore, the time

Entropy net ψ net

MIMIC-II 79.11 2.02 51.63 6.67 V-Dem 90.90 1.23 69.67 10.43 MNIST 99.63 0.00 65.68 5.05 CUB 99.86 0.01 77.34 0.52

Table 2: Out-of-distribution ﬁdelity (%)

Figure 5: Non-dominated solutions (Marler and Arora 2004) (dotted black line) in terms of average explanation complexity and average explanation test error. The vertical dotted red line marks the maximum explanation complexity laypeople can handle (i.e. complexity 9, see (Miller 1956; Cowan 2001; Ma, Husain, and Bays 2014)). Notice how the explanations provided by the Entropy-based Network are always one of the non-dominated solution.

required to train entropy-based networks is only slightly higher with respect to Decision Trees but is lower than ψ Networks and BRL by one to three orders of magnitude (Fig. 6), making it feasible for explaining also complex tasks. The ﬁdelity (Table 2) of the formulas extracted by the entropybased network is always higher than 90% with the only exception of MIMIC. This means that almost any prediction made using the logic explanation matches the corresponding prediction made by the model, making the proposed approach very close to a white box model. These results empirically shows that our method represents a viable solution for a safe deployment of explainable cutting-edge models. The reason why the proposed approach consistently outperform ψ networks across all the key metrics (i.e. classiﬁcation accuracy, explanation accuracy, and ﬁdelity) can be explained observing how entropy-based networks are far less constrained than ψ networks, both in the architecture (our approach does not apply weight pruning) and in the loss function (our approach applies a regularization on the distributions αi and not on all weight matrices). Likewise, the main reason why the proposed approach provides a higher classiﬁcation accuracy with respect to BRL and decision trees may lie in the smoothness of the decision functions of neural networks which tend to generalize better than rule-based methods, as already observed by Tavares et al. (Tavares et al. 2020). For each dataset, we

Entropy net Tree BRL ψ net

MIMIC-II 28.75 40.49 30.48 27.62 V-Dem 46.25 72.00 73.33 38.00 MNIST 100.00 41.67 100.00 96.00 CUB 35.52 21.47 42.86 41.43

Table 3: Consistency (%)

MIMIC-II V-Dem MINST CUB Datasets

Elapsed times (s)

net BRL Tree

Figure 6: Time required to train models and to extract the explanations. Our model compares favorably with the competitors, with the exception of Decision Trees. BRL is by one to three order of magnitude slower than our approach.

report in the Appendix a few examples of logic explanations extracted by each method, as well as in Fig. 4. We mention that the proposed approach is the only matching the ground-truth explanation for the MNIST even/odd experiment, i.e. x, is Odd(x) is One(x) is Three(x) is Five(x) is Seven(x) is Nine(x) and x, is Even(x) is Zero(x) is Two(x) isfour(x) is Six(x) is Eight(x), being the exclusive OR. In terms of formula consistency, we observe how BRL is the most consistent rule extractor, closely followed by the proposed approach (Table 3).

6 Conclusions This work contributes to a safer adoption and greater impact of deep learning by making neural models explainableby-design, thanks to an entropy-based approach that yields FOL-based explanations. Moreover, as the proposed approach provides logic explanations for how a model arrives at a decision, it can be effectively used to reverse engineer algorithms, processes, to ﬁnd vulnerabilities, or to improve system design powered by deep learning models. From a scientiﬁc perspective, formal knowledge distillation from state-of-the-art networks may enable scientiﬁc discoveries or falsiﬁcation of existing theories. However, the extraction of a FOL explanation requires symbolic input and output spaces. In some contexts, such as computer vision, the use of concept-based approaches may require additional annotations and attribute labels to get a consistent symbolic layer of concepts. Recent works on automatic concept extraction may alleviate the related costs, leading to more costeffective concept annotations (Ghorbani et al. 2019; Kazhdan et al. 2020).

Acknowledgments This work was partially supported by TAILOR and GODS21 European Union s Horizon 2020 research and innovation programmes under GA No 952215 and 848077.

References Ahmad, M. A.; Eckert, C.; and Teredesai, A. 2018. Interpretable machine learning in healthcare. In Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics, 559 560. Alvarez-Melis, D.; and Jaakkola, T. S. 2018. Towards robust interpretability with self-explaining neural networks. ar Xiv preprint ar Xiv:1806.07538. Angelino, E.; Larus-Stone, N.; Alabi, D.; Seltzer, M.; and Rudin, C. 2018. Learning Certiﬁably Optimal Rule Lists for Categorical Data. ar Xiv:1704.01701. Arrieta, A. B.; D ıaz-Rodr ıguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garc ıa, S.; Gil-L opez, S.; Molina, D.; Benjamins, R.; et al. 2020. Explainable Artiﬁcial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58: 82 115. Barbiero, P.; Ciravegna, G.; Georgiev, D.; and Giannini, F. 2021. Py Torch, Explain! A Python library for Logic Explained Networks. ar Xiv preprint ar Xiv:2105.11697. Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relational inductive biases, deep learning, and graph networks. ar Xiv preprint ar Xiv:1806.01261. Breiman, L.; Friedman, J.; Stone, C. J.; and Olshen, R. A. 1984. Classiﬁcation and regression trees. CRC press. Caruana, R.; Lou, Y.; Gehrke, J.; Koch, P.; Sturm, M.; and Elhadad, N. 2015. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 1721 1730. Chen, Z.; Bei, Y.; and Rudin, C. 2020. Concept whitening for interpretable image recognition. Nature Machine Intelligence, 2(12): 772 782. Ciravegna, G.; Giannini, F.; Gori, M.; Maggini, M.; and Melacci, S. 2020a. Human-driven FOL explanations of deep learning. In Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence and Seventeenth Paciﬁc Rim International Conference on Artiﬁcial Intelligence {IJCAI-PRICAI-20}, 2234 2240. International Joint Conferences on Artiﬁcial Intelligence Organization. Ciravegna, G.; Giannini, F.; Melacci, S.; Maggini, M.; and Gori, M. 2020b. A Constraint-Based Approach to Learning and Explanation. In AAAI, 3658 3665. Cohen, W. W. 1995. Fast effective rule induction. In Machine learning proceedings 1995, 115 123. Elsevier. Coppedge, M.; Gerring, J.; Knutsen, C. H.; Lindberg, S. I.; Teorell, J.; Altman, D.; Bernhard, M.; Cornell, A.; Fish, M. S.; Gastaldi, L.; et al. 2021. V-Dem Codebook v11. Cowan, N. 2001. The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and brain sciences, 24(1): 87 114. Cranmer, M. D.; Xu, R.; Battaglia, P.; and Ho, S. 2019. Learning symbolic physics with graph networks. ar Xiv preprint ar Xiv:1909.05862. Das, A.; and Rad, P. 2020. Opportunities and Challenges in Explainable Artiﬁcial Intelligence (XAI): A Survey. Ar Xiv, abs/2006.11371. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805. Doshi-Velez, F.; and Kim, B. 2017. Towards a rigorous science of interpretable machine learning. ar Xiv preprint ar Xiv:1702.08608.

Doshi-Velez, F.; and Kim, B. 2018. Considerations for evaluation and generalization in interpretable machine learning. In Explainable and interpretable models in computer vision and machine learning, 3 17. Springer. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929. EUGDPR. 2017. GDPR. General data protection regulation. https: //gdpr.eu/. Accessed: 2021-08-20. Ghorbani, A.; Wexler, J.; Zou, J.; and Kim, B. 2019. Towards automatic concept-based explanations. ar Xiv preprint ar Xiv:1902.03129. Goddard, M. 2017. The EU General Data Protection Regulation (GDPR): European regulation that has a global impact. International Journal of Market Research, 59(6): 703 705. Goldberger, A. L.; Amaral, L. A.; Glass, L.; Hausdorff, J. M.; Ivanov, P. C.; Mark, R. G.; Mietus, J. E.; Moody, G. B.; Peng, C.-K.; and Stanley, H. E. 2000. Physio Bank, Physio Toolkit, and Physio Net: components of a new research resource for complex physiologic signals. circulation, 101(23): e215 e220. Guidotti, R.; Monreale, A.; Ruggieri, S.; Pedreschi, D.; Turini, F.; and Giannotti, F. 2018. Local rule-based explanations of black box decision systems. ar Xiv preprint ar Xiv:1805.10820. Han, J.; Pei, J.; and Yin, Y. 2000. Mining frequent patterns without candidate generation. ACM sigmod record, 29(2): 1 12. Hastie, T.; and Tibshirani, R. 1987. Generalized additive models: some applications. Journal of the American Statistical Association, 82(398): 371 386. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778. Holte, R. C. 1993. Very simple classiﬁcation rules perform well on most commonly used datasets. Machine learning, 11(1): 63 90. Kazhdan, D.; Dimanov, B.; Jamnik, M.; Li o, P.; and Weller, A. 2020. Now You See Me (CME): Concept-based Model Extraction. ar Xiv preprint ar Xiv:2010.13233. Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Viegas, F.; et al. 2018. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, 2668 2677. PMLR. Kindermans, P.-J.; Hooker, S.; Adebayo, J.; Alber, M.; Sch utt, K. T.; D ahne, S.; Erhan, D.; and Kim, B. 2019. The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, 267 280. Springer. Koh, P. W.; Nguyen, T.; Tang, Y. S.; Mussmann, S.; Pierson, E.; Kim, B.; and Liang, P. 2020. Concept bottleneck models. In International Conference on Machine Learning, 5338 5348. PMLR. Krzywinski, M.; and Altman, N. 2013. Error bars: the meaning of error bars is often misinterpreted, as is the statistical signiﬁcance of their overlap. Nature methods, 10(10): 921 923. Le Cun, Y. 1998. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Letham, B.; Rudin, C.; Mc Cormick, T. H.; Madigan, D.; et al. 2015. Interpretable classiﬁers using rules and bayesian analysis: Building a better stroke prediction model. Annals of Applied Statistics, 9(3): 1350 1371. Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101.

Lou, Y.; Caruana, R.; and Gehrke, J. 2012. Intelligible models for classiﬁcation and regression. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 150 158. Lundberg, S.; and Lee, S.-I. 2017. A uniﬁed approach to interpreting model predictions. ar Xiv preprint ar Xiv:1705.07874. Ma, W. J.; Husain, M.; and Bays, P. M. 2014. Changing concepts of working memory. Nature neuroscience, 17(3): 347. Marler, R. T.; and Arora, J. S. 2004. Survey of multi-objective optimization methods for engineering. Structural and multidisciplinary optimization, 26(6): 369 395. Mc Cluskey, E. J. 1956. Minimization of Boolean functions. The Bell System Technical Journal, 35(6): 1417 1444. Mc Coll, H. 1878. The calculus of equivalent statements (third paper). Proceedings of the London Mathematical Society, 1(1): 16 28. Mc Kelvey, R. D.; and Zavoina, W. 1975. A statistical model for the analysis of ordinal level dependent variables. Journal of Mathematical Sociology, 4(1): 103 120. Mendelson, E. 2009. Introduction to mathematical logic. CRC press. Miller, G. A. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological review, 63: 81 97. Minsky, M.; and Papert, S. A. 2017. Perceptrons: An introduction to computational geometry. MIT press. Molnar, C. 2020. Interpretable machine learning. Lulu. com. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. ar Xiv preprint ar Xiv:1912.01703. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12: 2825 2830. Pemstein, D.; Marquardt, K. L.; Tzelgov, E.; Wang, Y.-t.; Krusell, J.; and Miri, F. 2018. The V-Dem measurement model: latent variable analysis for cross-national and cross-temporal expert-coded data. V-Dem Working Paper, 21. Quine, W. V. 1952. The problem of simplifying truth functions. The American mathematical monthly, 59(8): 521 531. Quinlan, J. R. 1986. Induction of decision trees. Machine learning, 1(1): 81 106. Quinlan, J. R. 2014. C4. 5: programs for machine learning. Elsevier. Rathmanner, S.; and Hutter, M. 2011. A philosophical treatise of universal induction. Entropy, 13(6): 1076 1136. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016a. Why should i trust you? Explaining the predictions of any classiﬁer. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1135 1144. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016b. Modelagnostic interpretability of machine learning. ar Xiv preprint ar Xiv:1606.05386. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2018. Anchors: Highprecision model-agnostic explanations. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32. Rivest, R. L. 1987. Learning decision lists. Machine learning, 2(3): 229 246.

Rudin, C.; Chen, C.; Chen, Z.; Huang, H.; Semenova, L.; and Zhong, C. 2021. Interpretable machine learning: Fundamental principles and 10 grand challenges. ar Xiv preprint ar Xiv:2103.11251. Saeed, M.; Villarroel, M.; Reisner, A. T.; Clifford, G.; Lehman, L.- W.; Moody, G.; Heldt, T.; Kyaw, T. H.; Moody, B.; and Mark, R. G. 2011. Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): a public-access intensive care unit database. Critical care medicine, 39(5): 952. Sato, M.; and Tsukimoto, H. 2001. Rule extraction from neural networks via decision tree induction. In IJCNN 01. International Joint Conference on Neural Networks. Proceedings (Cat. No. 01CH37222), volume 3, 1870 1875. IEEE. Schmidt, M.; and Lipson, H. 2009. Distilling free-form natural laws from experimental data. science, 324(5923): 81 85. Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618 626. Simon, H. A. 1956. Rational choice and the structure of the environment. Psychological review, 63(2): 129. Simon, H. A. 1957. Models of man; social and rational. New York: John Wiley and Sons, Inc. Simon, H. A. 1979. Rational decision making in business organizations. The American economic review, 69(4): 493 513. Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2013. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. ar Xiv preprint ar Xiv:1312.6034. Soklakov, A. N. 2002. Occam s razor as a formal basis for a physical theory. Foundations of Physics Letters, 15(2): 107 135. Su, J.; Vargas, D. V.; and Sakurai, K. 2019. One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation, 23(5): 828 841. Tavares, A. R.; Avelar, P.; Flach, J. M.; Nicolau, M.; Lamb, L. C.; and Vardi, M. 2020. Understanding boolean function learnability on deep neural networks. ar Xiv preprint ar Xiv:2009.05908. Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology. Xie, Q.; Luong, M.-T.; Hovy, E.; and Le, Q. V. 2020. Self-training with noisy student improves imagenet classiﬁcation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10687 10698. Ying, R.; Bourgeois, D.; You, J.; Zitnik, M.; and Leskovec, J. 2019. Gnnexplainer: Generating explanations for graph neural networks. Advances in neural information processing systems, 32: 9240. Zeiler, M. D.; and Fergus, R. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision, 818 833. Springer. Zilke, J. R.; Loza Menc ıa, E.; and Janssen, F. 2016. Deep RED Rule Extraction from Deep Neural Networks. In Calders, T.; Ceci, M.; and Malerba, D., eds., Discovery Science, 457 473. Cham: Springer International Publishing. ISBN 978-3-319-46307-0 .