# learning_interpretable_concept_groups_in_cnns__543658fe.pdf

Learning Interpretable Concept Groups in CNNs

Saurabh Varshneya1 , Antoine Ledent1 , Robert A. Vandermeulen2 , Yunwen Lei3 , Matthias Enders4 , Damian Borth5 and Marius Kloft1

1Technical University of Kaiserslautern, Germany 2Technical University of Berlin, Germany 3University of Birmingham, United Kingdom 4NPZ Innovation Gmb H, Germany 5University of St.Gallen, Switzerland {varshneya, ledent, kloft}@cs.uni-kl.de, vandermeulen@tu-berlin.de, y.lei@bham.ac.uk, m.enders@npz-innovation.de, damian.borth@unisg.ch

We propose a novel training methodology Concept Group Learning (CGL) that encourages training of interpretable CNN ﬁlters by partitioning ﬁlters in each layer into concept groups, each of which is trained to learn a single visual concept. We achieve this through a novel regularization strategy that forces ﬁlters in the same group to be active in similar image regions for a given layer. We additionally use a regularizer to encourage a sparse weighting of the concept groups in each layer so that a few concept groups can have greater importance than others. We quantitatively evaluate CGL s model interpretability using standard interpretability evaluation techniques and ﬁnd that our method increases interpretability scores in most cases. Qualitatively we compare the image regions that are most active under ﬁlters learned using CGL versus ﬁlters learned without CGL and ﬁnd that CGL activation regions more strongly concentrate around semantically relevant features.

1 Introduction

There is a great interest in understanding the hidden representations produced by convolutional neural networks (CNNs). Understanding these representations has signiﬁcant implications theoretically (a better understanding of neural networks in general) and practically (trustworthiness in AI). Toward this end, many methods for interpreting hidden representations have been proposed. We delineate two avenues of research, which include (1) visualizing the regions of images that lead to high activation responses in a ﬁlter [Yosinski et al., 2014; Zhang et al., 2017; Mahendran and Vedaldi, 2016] and (2) interpreting the semantic meaning learned by a ﬁlter by ﬁnding its alignment to human-interpretable visual concepts, such as object, scene, and color [Bau et al., 2017; Zhou et al., 2018; Zhang et al., 2020].

Contact Author: varshneya@cs.uni-kl.de Code will be available at: https://github.com/srb-cv/cgl

There exists much work that focuses on understanding the hidden interpretations of CNNs, but there is a lack of work that actively inﬂuences the learning process to favor interpretable representations. To address this gap we propose a framework for CNNs that not only assesses interpretability but induces it. Our approach is to induce a group structure in CNNs at training, where groups of ﬁlters with a useful semantic meaning emerge during the training process. We achieve this through carefully chosen regularization and auxiliary loss functions. In our method, before any training is performed, we partition the collection of ﬁlters in each layer into groups, which we term concept groups. During training we promote ﬁlters within the same concept group to learn similar features, so that, after training, they form a group of correlated ﬁlters that jointly encode an abstract visual concept (e.g., an object, scene, or color). To learn these concept groups, we introduce three novelties to our training objective (illustrated in Figure 1): 1. We propose a new loss function (entitled group activation

General Block Norm Regularization

Group Activation Loss Spatial Loss +

conv5 layer activation maps visualization

Concept Group 1 (object: dog)

Concept Group 2 (scene: clouds)

Concept Group 3 (texture: grid)

conv5 layer ﬁlters

Figure 1: An illustration of our framework of learning concept groups in a convolutional layer of a CNN.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

loss) that encourages the ﬁlters within each concept group to have highly overlapping activation regions for a given sample. 2. We propose another loss function (entitled spatial loss), which promotes the activations of each ﬁlter to be concentrated around their mean position. 3. Using an ℓ2,1-blocknorm regularizer on the concept groups, we learn a sparse weighting of the concept groups. These weights can be interpreted as the relative importance of each concept group. A key property of our method is that it is completely unsupervised, automatically ﬁnding concepts using zero sideinformation. In practice one would likely include our losses in conjunction with some other task, e.g. classiﬁcation, to increase interpretability of that task, but none of the additional task information, e.g. class labels, is used in our losses. In addition to the training method described above, it can be observed that interpretability of CNNs decreases when batch-normalization layers are introduced [Zhou et al., 2018]. To remedy this we incorporate a variation of batchnorm which works together with its regularizer and losses to achieve the desired group structure in the hidden representations and enhance interpretability. We perform a quantitative analysis following the experimental setup by [Bau et al., 2017] and ﬁnd that our method signiﬁcantly improves interpretability. We also analyze the ﬁlters of our model qualitatively by visualizing the ﬁlters activations and ﬁnd that our training setup yields representations with noticeably more interpretable ﬁlters.

2 Related Work

We mention here a few existing works that are related to our training method and outline the major differences with the existing approaches.

Group convolutions and structured sparsity: Group convolutions were ﬁrst used in Alexnet [Krizhevsky et al., 2017] by distributing ﬁlters into groups to train the model over multiple GPUs. It is important to note that our concept group approach is different from group convolutions, where there is no connection among the activation maps of different partitioned groups. In contrast, concept groups activation maps are concatenated at each layer before feeding them to the next layer (see Figure 3). There exist methods that induce structured sparsity in CNNs [Wen et al., 2016; Li et al., 2017]. Such methods show that inducing sparsity among groups of ﬁlters can achieve state-of-the-art accuracy on various datasets with reduced computational costs for CNNs. However, less is known about the effect of such a regularization on the interpretability of networks.

Disentangled representations: For improving hidden representations while training, a group of methods endeavours to obtain disentangled representations. The aim of such methods coincides with the main focus of our research. These methods aim to learn better representations where each ﬁlter represents a clear and unique visual concept. [Zhang et al., 2018] introduce a training method which aims to train each ﬁlter in the ﬁnal convolutional layers of a CNN to represent an object part. The activation loss from their method encour-

ages a ﬁlter to activate only for the training images that belong to the assigned category in the training set. Our own techniques also rely on applying a loss to the activation maps of ﬁlters to inject priors into the CNN representation. However, there is a major difference in how we apply the losses on activations in our study. In [Zhang et al., 2018] each ﬁlter is pushed towards a known concept corresponding to an object part. Ultimately the concepts that this method extracts are parts of the objects in the training classes. This is unlike our method which can capture arbitrary concepts (not only ones corresponding to object parts) and without any supervision. Furthermore, the method from [Zhang et al., 2018] ﬁlters out activations of low magnitude entirely in the forward pass. We argue that such hard constraints on the activation map can hamper the representation power of the network. To retain representation power our method applies soft constraints on the activations. This does not require any changes in the forward pass. The rest of this paper is organized as follows: in Section 3, we mathematically deﬁne our algorithm and precise regularization procedure to make the CNNs more interpretable. In Section 4.1, we explain the evaluation techniques by [Bau et al., 2017], which we later use to evaluate the interpretability of the networks trained with our own training methods. Finally, in Section 4.3, we compare the results of the experiments we performed with the existing state-of-the-art methods.

3 Algorithm Here we present our method which induces interpretable ﬁlters in all layers of a CNN. We brieﬂy outline our method here and describe them precisely in the sequel. Central to our method is an initial grouping of each convolutional layer s ﬁlters into G equal-size groups, e.g. for a collection of ﬁlters in a layer indexed by 1, . . . , F we divide the indices into G groups (1, . . . , f), (f+1, . . . , 2f), . . . ((G 1)f+1, . . . , Gf) with F = Gf. During training we aim to have a single group of ﬁlters (e.g. (f + 1, . . . , 2f)) correspond to a single concept. We assign multiple ﬁlters to the same concept in the hopes of robustly capturing the concept in a way that would not be possible with a single ﬁlter. The number of groups is a hyperparameter that is determined before training. To make the method less restrictive, we can leave some free ﬁlters outside the grouping. We consider each concept as a block matrix and apply a general block norm over the collection of groups so as to encourage structured sparsity [Wen et al., 2016]. The ﬁlters belonging to each group are trained to correspond to similar visual concepts. To achieve this we feed each ﬁlter s activation map through a sigmoid to determine a (soft) receptive ﬁeld which represents the areas of the image where the ﬁlter is active and then apply a loss to induce ﬁlters in the same concept group to have receptive ﬁelds that largely overlap and are concentrated to a connected region within an image (illustrated in Figure 2). We now precisely describe the regularizers in the following.

3.1 Group Activation Loss Lg Here we describe our loss function that encourages ﬁlters in the same concept group to learn similar concepts. To do this

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Group Activation Loss + Spatial Loss

Soft Receptive ﬁelds

Standardization + Sigmoid

Input Image

conv1 conv2

Ground Truth

Cross Entropy Loss

Figure 2: The training setup for our auxiliary loss functions. At every layer we obtain a soft receptive ﬁeld by feeding a suitably standardized version of the activation maps through a sigmoid. We apply constraints on the soft receptive ﬁelds by our well-deﬁned losses

our loss function, which we denote by Lg, will encourage ﬁlters within the same concept group to be active in roughly the same area for each input image. To determine whether a ﬁlter is active in a region of the image, we ﬁrst feed the activation maps before applying any non-linearity (we call this the pre-activation map) through a linear scaling (which is also learned and shared between layers) followed by a sigmoid. We interpret the output of the sigmoid as the probability of a ﬁlter being active at a location. Let us denote the activation map of the ith ﬁlter in group g in layer l by al ig RAl. Here, Al denotes the size of the preactivation maps in layer l. A ﬁlter and its activation maps are typically tensors but we simply consider a ﬂattened version of them for notational simplicity. From our pre-activation map we utilize a sigmoid to yield a soft-receptive ﬁeld, ψl ig (0, 1)Al, which is deﬁned by the following:

P1 al ig Sl ig + P2

where Sl ig is the standard deviation of the ith activation-map of group g, computed over the minibatch. Note that P1 and P2 are a pair of scaling parameters which can be learned during training. These parameters are shared for all the activation maps in the network. We now deﬁne the distance between two soft-receptive ﬁelds ψ1 and ψ2 as

D(ψ1, ψ2) = 2 ψ1 ψ2 1 ψ1 1 + ψ2 1 + ψ2 ψ1 1 , (2)

where . 1 denotes the L1 norm: x 1 = P i |xi|. The motivation of this distance is to create a soft equivalent to one minus the Io U (intersection over union). This score is frequently used in computer vision to assess the performance of tracking or localization algorithms [Tychsen-Smith and Petersson, 2018; Huang et al., 2019]. To see the equivalence, suppose that ψ1 and ψ2 are the indicator functions of two sets

A and B. Then we have

D(ψ1, ψ2) = 2 ψ1 ψ2 1 ψ1 1 + ψ2 1 + ψ1 ψ2 1

= 2µ(A B) µ(A) + µ(B) + µ(A B) = 2µ(A B)

µ(A B) = 1 IOU, (3)

where we use the standard notation A B := (A \ B) (B \ A) = (A B) \ (A B) for the symmetric difference between A and B, and µ(.) denotes the Lebesgue measure or the cardinality in the case of ﬁnite sets. The deﬁnition of ψ must be altered slightly to work with batch normalizaiton; we describe this in the following.

Deﬁnition of ψ With Batch-normalization If the network is trained with batch-norm, the pre-activations are standardized to have mean 0 and variance 1. The input to the next layer is then γl ig(al ig) + βl ig, where βl ig and γl ig are the learned parameters of a batch-norm layer over the ith activation-map, of group g, as in standard batch normalization [Ioffe and Szegedy, 2015]. To modify the batch normalized values to be compatible with our method we ﬁrst deﬁne a variable τ l ig by

τ l ig = al ig + (βl ig)

which can be thought of as undoing batch normalization for the purposes of ﬁnding a soft receptive ﬁeld, since batch normalization is known to reduce interpretability [Bau et al., 2017]. We can now feed τ l ig through a sigmoid to obtain our soft receptive ﬁeld as follows:

ψl ig = σ P1τ l ig + P2 . (4)

Formula for Lg We can now deﬁne our group activation loss Lg to be the following (note that the second term is new and is described in the next paragraph)

P l,g P (i,j) R 2 ψl jg ψl ig 1 P l,g P (i,j) R ψl jg 1 + ψl ig 1 + ψl jg ψl ig 1 +

P l,g P (i,j) R 2 ψl jg ψl+1 ig 1 P l,g P (i,j) R ψl jg 1 + ψl+1 ig 1 + ψl jg ψl+1 ig 1 (5)

where R is a random subset of cardinality r of {1, 2, . . . , N l g} {1, 2, . . . , N l g} that changes at each iteration of the gradient descent procedure. N l g denotes the number of activation maps in group g in layer l. We use this randomization over R as an approximation of comparing all pairs of ﬁlters in each concept group which is computationally expensive. Here we write ψl jg for the vector composed of the components ψl jg(xi) for samples xi in some minibatch. All the L1 norms are computed over the minibatch (as well as the spatial dimension). The value for r can be set as a hyperparameter. We ﬁnd that setting r to 3N l g works well in practice.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Each term in the summation that appears in Lg is a soft version of one minus the intersection over union measure of similarity between the receptive ﬁelds given by D(ψl ig, ψl jg). As some concepts typically gradually change or die out in higher layers, we additionally consider employing a sliding comparison, the second term in (5), where the ﬁlters in group g of layer l are required to be similar to those in group g of layer l + 1 but are not directly compared to those in group g in layers beyond l+1. We found that enforcing this similarity of concepts throughout layers is too strong a requirement and simply causes the ﬁlters to go to zero. Because of this we use λ2 = 0 for all l in our experiments.

3.2 Spatial Loss Ls In this section we introduce a loss that aims to regularize each ﬁlter to activate on a single, connected area of each image. Suppose for instance that an image contains a car and a house both of which are relevant concepts to the task to which the CNN is being applied. Since both concepts are abstract they are likely to be represented near the ﬁnal convolutional layers of a CNN where the receptive ﬁelds are large. We would like to keep these concepts separate between concept groups and prevent them from incorporating large portions of the image corresponding to multiple abstract concepts. To achieve this we enforce each ﬁlter to only be active on a small connected region on images by penalizing activation responses that far from the center of activation. The center of activation is computed over the soft receptive ﬁeld ψ for each activation map. We deﬁne spatial loss Ls(ψ) as the following:

P j ψj j c(ψ) 2 P j ψj , where c(ψ) =

P j j ψj P j ψj

and j denotes the index of a activation of the soft receptive ﬁeld (ψ) of a ﬁlter. Here c represents the mean position of activation of a soft receptive ﬁeld and Ls is analogous to the corresponding variance. The loss is computed over all the activation maps of a ﬁlter obtained in a minibatch.

Input from L-1

Filters Feature maps Filters Feature maps

Concat Concat

Output for L+2

Figure 3: The general block norm induces a group structure in the ﬁlters of a layer, where few groups are dominantly learned over other groups. β1, β2 and β3 are the relevance factors of each group which are learned while training a convolutional neural network (CNN)

3.3 General Block Norm Regularizer Rbn In this section we deﬁne our general block norm regularizer Rbn that replaces the conventional weight decay regularizer. This changes the inductive bias of our training algorithm so that it induces a group sparse structure over the concept groups. This regularizer encourages some concept groups to be of greater importance than others, thereby giving a notion of importance to concept groups and encouraging our concept groups to be parsimonious. This allows us to avoid situations where, for example, multiple concept groups capture the concept of car. Such sparse models also yield reduced memory requirements and improved training speed as side beneﬁts [Mocanu et al., 2018]. An illustration the result of using this regularization is shown in Figure 3. We will denote by wgl the matrix containing all the weights of the ﬁlters of group g in layer l. The regularizer Rbn can then be written as :

g=1 wgl Fr, (6)

where L denotes the number of convolutional layers in the network, G the number of concept groups per convolutional layer, and Fr denotes the Frobenius norm i.e., M 2 Fr = P i,j M 2 i,j for a matrix M. An advantage of using the general block norm regularizer Rbn is that we can obtain a vector of relevance factors β for our concept groups, as shown in Figure 3. These relevance scores can be obtained by a summation of the norms as follows:

βgl = wgl Fr and βl =

g=1 βgl, (7)

where, βgl denotes a group s relevance in layer l, and βl denotes the relevance of layer l in a network.

Complete framework: In summary our overall framework is obtained by combining the above mentioned regularization strategy and the two loss functions. For a simpliﬁed view of the combined method, we consider a CNN with weight parameters w and we denote the soft receptive ﬁelds obtained from the activation maps by ψ. The ﬁnal optimization target for an interpretable CNN can be deﬁned as:

L = Ld(w) + λbn Rbn(w) + λg Lg(ψ) + λs Ls(ψ), (8)

where, Ld denotes the main loss for our task (e.g. cross entropy for classiﬁcation), Rbn is the general block norm regularization, Lg and Ls are the group activation loss and spatial loss respectively. λbn, λg and λs are the weightings of the deﬁned regularization and auxiliary losses with respect to the main loss.

4 Experiments

In this section we explain the evaluation strategy we adopt for the quantitative and qualitative analysis of the interpretability of CNNs, and describe the results of our experiments performed on synthetic and real world datasets.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

4.1 Interpretability: Quantitative Assessment

In order to quantify the interpretability of a trained CNN, we exactly replicate the evaluation on the Broden dataset as proposed by [Bau et al., 2017]. The Broden dataset contains pixel-wise annotations of a broad range of categories, which belong to one of six human-interpretable visual concepts. We compute the alignment of a ﬁlter with a category by comparing its activation maps for a set of images against the available ground truth using the same threshold and scoring function described by [Bau et al., 2017]. The interpretability of a layer can then be deﬁned as the number of unique categories assigned to the ﬁlters in that layer. Authors call this as the number of unique detectors.

Evaluation on a Synthetic Dataset In order to validate the efﬁcacy of our methods in a controlled environment, we constructed a synthetic dataset with obvious human-interpretable visual concepts corresponding to shape and color: each image contains two randomly positioned ﬁgures among the set {square, circle, triangle} with repetitions allowed, and each shape is assigned a colour from the set {red, blue, green}. Note that in principle, there are 15 highlevel concepts in this paradigm: one for each combination of colour and shape, and one for each colour and each shape separately. We consider two training settings: in the ﬁrst case, we have one label for each combination of shapes and colours (there are therefore 45 labels). In the second setting, we simply assign the label one if a square is present and zero otherwise (the problem is therefore binary). Therefore, in the ﬁrst setting, all the information contained in the high level concepts shape, color, color-shape is contained in the label, whereas in the second situation, only part of this information is encoded in the label. For training, we use a simple CNN with 2 convolutional layers, having 128 and 256 ﬁlters in the ﬁrst and second layer respectively. We set the size of ﬁlters to 3 3 for both the layers. The evaluation is performed using the same metrics as described above. We report the results of the second, binary setting in Table 1. In the multiclass setting, the network is able to recover all concepts very well regardless of regularization. However, we ﬁnd that in the binary setting, the addition of our regularizers has a very strong positive effect on the recovery of the concepts: in the case where we add all three regularizers, 14 out of 15 of all unique concepts are recovered at the second layer, compared to only 10 when using weight decay. This is a signiﬁcant illustration of the fact that our method is able to guide the network towards a more complete and disentangled representation of all salient features of the data in the absence of low-level feature annotations.

Evaluation on the Real Datasets We tested our methods on three well-known convolutional architectures named Alexnet, Alexnet-B and VGG, where Alexnet-B is the version containing the batch normalization layers added to the common Alexnet. All the networks are trained from scratch on the two well-known image classiﬁcation datasets Places365 [Zhou et al., 2017] and Ima-

Regularizer layer No. of Unique Detectors

color shape color-shape Total

Weight-Decay conv1 3 0 4 7 conv2 3 0 7 10

Rbn conv1 3 2 4 9 conv2 3 2 7 12

Rbn + Lg + Ls conv1 3 3 5 11 conv2 3 2 9 14

Table 1: Number of unique detectors in each layer while training on synthetic data in a two class labels setup

ge Net [Deng et al., 2009]. The Places365 dataset is comparable to the Image Net dataset, which contains 1.8 million training images with 365 scene categories. For evaluation, we exactly replicate the method from [Bau et al., 2017], as explained above to compute the interpretability of the trained models. For a baseline of interpretability, we obtained the pretrained models for Imagenet and Places365 datasets from the Pytorch and MIT CSAIL repositories respectively. Table 2 shows the comparison of the interpretabilty scores for the models trained with the conventional weight decay regularizer and our training method.

Analysis over Concept-groups For each predeﬁned group, we compute a weighted sum over the number of detectors found in the group and the Io U score. A group is said to be aligned to a visual concept when the weighted sum is greater than a threshold T, which was chosen by hand and is the same for all experiments. One can obtain the relevance of such a concept-group from Equation 7. In Figure 4, we show an analysis over the emergent conceptgroups in the convolutional layers of Alexnet.

4.2 Interpretability: Qualitative Assessment We use the method described in [Zhou et al., 2014] for visualizing ﬁlters of trained CNNs. To visualize the receptive

Dataset Model conv1 conv2 conv3 conv4 conv5 score

Places365 Alexnet 6 12 22 36 54 130 Alexnet, (Rbn) 6 16 36 46 69 173 Alexnet, (Rbn Lg Ls) 8 17 35 50 75 185

Places365 Alexnet-B 6 13 27 43 59 148 Alexnet-B, (Rbn) 6 15 30 44 62 157 Alexnet-B, (Rbn Lg Ls) 5 17 30 51 63 166

Imagenet Alexnet 5 20 34 28 49 136 Alexnet, (Rbn) 5 16 39 32 45 137 Alexnet, (Rbn Lg Ls) 7 16 34 36 48 141

conv3-3 conv4-3 conv5-1 conv5-2 conv5-3

Imagenet VGG16 10 48 59 53 89 259 VGG16, (Rbn Lg Ls) 14 48 59 62 83 266

Places365 VGG16 9 50 62 75 119 315 VGG16, (Rbn Lg Ls) 10 45 62 86 129 332

Table 2: The table above shows the sum of the number of unique detectors in each layer of the network. We achieve a better interpretability across all the tested networks and datasets.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Convolutional layers

Concept groups

conv1 conv2 conv3 conv4 conv5

texture color object part scene

Figure 4: Concept groups that emerge in different layers in a trained Alexnet. The size of the circle represents the relevance of a concept group in a layer.

ﬁeld of a ﬁlter f, its activation maps is computed over a set of images. The maximum activation (max-activation) value for each activation map over all the pixels is extracted. The top K images are then selected corresponding to the activation maps with K largest max-activation values. For each image, regions with high unit activations are identiﬁed using a threshold Tf taken from [Bau et al., 2017]. This activated region is then scaled up to the image resolution and the ﬁlter f s receptive ﬁeld can be visualized over the selected images. Overall, this qualitative analysis gives a better understanding of a ﬁlter by focusing on important areas in the image [Zhou et al., 2014]. We compare the visualizations of the ﬁlters of a CNN trained with weight decay against a CNN trained with our proposed losses in Figure 5.

4.3 Accuracy Versus Interpretability Most often, attempts to increase interpretability reduce the discrimination power of a CNN, thereby compromising the validation accuracy of the trained network. We show that by applying soft constraints as proposed in this paper, the validation accuracy is only slightly reduced (less than 1 percent in each case). Results in Table 2 show that most of the unique detectors emerge on the last two convolutional layers of every network. Therefore we compare the ratio of unique detectors (RUD), the number of the unique detectors divided by the total number of ﬁlters in a layer, for the ﬁnal two convolutional layers along with the accuracy of the trained network in Table 3.

5 Conclusion Our formulated losses and regularizer exploit the hidden structures of the data via inducing a more prominent group structure among the ﬁlters of CNN. Even simply replacing

Figure 5: The receptive ﬁelds of the ﬁlters in the last layer of Alexnet trained with weight decay regularizer (in the top two rows) on Places365 dataset compared with the model trained with spatial loss Ls (in the bottom two rows).

Dataset Network Top-1 acc. Top-5 acc. RUD

Alexnet 51.14% 81.59% 0.176 Alexnet, (Rbn Lg Ls) 50.46% 80.60% 0.244

Alexnet-B* 51.32% 81.60% 0.199 Alexnet-B, (Rbn Lg Ls) 50.46% 80.62% 0.223

VGG16 54.91% 85.02% 0.189 VGG16, (Rbn Lg Ls) 54.91% 85.20% 0.210

Alexnet 56.30% 79.04% 0.150 Alexnet, (Rbn Lg Ls) 55.09% 77.93% 0.164

VGG16 71.79% 90.45% 0.139 VGG16, (Rbn Lg Ls) 70.55% 89.87% 0.142

Table 3: A comparison of the Validation Accuracy and RUD Score on training with different regularizers. The RUD score is the ratio of the total number of unique detectors divided by the number of ﬁlters in the last two convolutional layers of a CNN.

the conventional weight decay regularizer by our block norm Rbn, we can increase the interpretability of a network without compromising with its discrimination power. We efﬁciently compute the activation region of a ﬁlter over an image in the forward pass as described in Section 3.1, and then constrain the activation regions so that ﬁlters in the same group activate similar areas. This further induces a group structure in all the layers of a CNN and enhances interpretability.

Acknowledgements

The authors gratefully acknowledge support by the Carl Zeiss Foundation, by the German Research Foundation (DFG) award KL 2698/2-1, and by the Federal Ministry of Science and Education (BMBF) awards 01IS18051A and 031B0770E.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

References [Bau et al., 2017] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition, 2017. [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. [Huang et al., 2019] Yifeng Huang, Zhirong Tang, Dan Chen, Kaixiong Su, and Chengbin Chen. Batching soft iou for training semantic segmentation networks. IEEE Signal Processing Letters, 27:66 70, 2019. [Ioffe and Szegedy, 2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pages 448 456. JMLR.org, 2015. [Krizhevsky et al., 2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. Communications of the ACM, 60(6):84 90, 2017. [Li et al., 2017] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning ﬁlters for efﬁcient convnets, 2017. [Mahendran and Vedaldi, 2016] Aravindh Mahendran and Andrea Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 120(3):233 255, 2016. [Mocanu et al., 2018] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artiﬁcial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):1 12, 2018. [Tychsen-Smith and Petersson, 2018] Lachlan Tychsen Smith and Lars Petersson. Improving object localization with ﬁtness nms and bounded iou loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [Wen et al., 2016] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29:2074 2082, 2016. [Yosinski et al., 2014] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3320 3328. Curran Associates, Inc., 2014. [Zhang et al., 2017] Quanshi Zhang, Wenguan Wang, and Song-Chun Zhu. Examining CNN representations with respect to dataset bias. Co RR, abs/1710.10577, 2017.

[Zhang et al., 2018] Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. Interpretable convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [Zhang et al., 2020] Quanshi Zhang, Xin Wang, Ying Nian Wu, Huilin Zhou, and Song-Chun Zhu. Interpretable cnns for object classiﬁcation, 2020.

[Zhou et al., 2014] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. Co RR, abs/1412.6856, 2014. [Zhou et al., 2017] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. [Zhou et al., 2018] Bolei Zhou, David Bau, Aude Oliva, and Antonio Torralba. Interpreting deep visual representations via network dissection. IEEE transactions on pattern analysis and machine intelligence, 2018.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)