# interpretable_compositional_convolutional_neural_networks__09b8ea1a.pdf

Interpretable Compositional Convolutional Neural Networks

Wen Shen1, , Zhihua Wei1, , Shikun Huang1 , Binbin Zhang1 , Jiaqi Fan1 , Ping Zhao1 , Quanshi Zhang2,

1Tongji University, Shanghai, China 2Shanghai Jiao Tong University, Shanghai, China {wen shen,zhihua wei,hsk,0206zbb,1930795,zhaoping}@tongji.edu.cn,zqs1022@sjtu.edu.cn

The reasonable deﬁnition of semantic interpretability presents the core challenge in explainable AI. This paper proposes a method to modify a traditional convolutional neural network (CNN) into an interpretable compositional CNN, in order to learn ﬁlters that encode meaningful visual patterns in intermediate convolutional layers. In a compositional CNN, each ﬁlter is supposed to consistently represent a speciﬁc compositional object part or image region with a clear meaning. The compositional CNN learns from image labels for classiﬁcation without any annotations of parts or regions for supervision. Our method can be broadly applied to different types of CNNs. Experiments have demonstrated the effectiveness of our method. The code will be released when the paper is accepted.

1 Introduction

Convolutional neural networks (CNNs) have exhibited superior performance in many visual tasks. Besides, the interpretability of CNNs has received increasing attention in recent years. Studies of network interpretability usually focus on the visualization of network features or the extraction of pixel-level correlations between network inputs and outputs. Training a CNN with interpretable features in intermediate layers is still a challenge to state-of-the-art algorithms, which helps people obtain more trustworthy and veriﬁable features. In this paper, we aim to propose a method to modify a CNN, which makes ﬁlters in an intermediate layer encode interpretable and compositional features. Speciﬁcally, as Fig. 1 shows, each ﬁlter in the intermediate layer is supposed to be consistently activated by the same object part with speciﬁc shapes (e.g. the head part of a bird) or the same image region without speciﬁc structures (e.g. the sky in the background) through different images. Besides, different ﬁlters in the layer are supposed to be activated by different parts or regions, which ensures the diversity of visual patterns.

Wen Shen and Zhihua Wei have equal contributions. Quanshi Zhang is the corresponding author. He is with the John Hopcroft Center and the Mo E Key Lab of Artiﬁcial Intelligence, AI Institute, at the Shanghai Jiao Tong University, China.

Traditional CNN

The compositional CNN represents both parts and large regions

The ICNN only represents parts in ball-like areas

A ﬁlter A ﬁlter A ﬁlter

Part: head Part:torso Part: torso torso Part Region background

Figure 1: Compared with ICNN [Zhang et al., 2018], the interpretable compositional CNN deﬁnes the ﬁlter interpretability in a more generic manner, thereby modeling more diverse visual patterns. In a compositional CNN, each ﬁlter consistently represents a speciﬁc object part or image region through different images. Different ﬁlters represent different object parts and image regions. In comparison, the ICNN can only represent object parts in ball-like areas.

Given different images, we learn the interpretable compositional CNN in an end-to-end manner without any annotations of object parts or image regions for supervision. To this end, we add a speciﬁc loss to the intermediate layer in a CNN. This loss encourages each ﬁlter to be consistently activated by the same object part or the same image region. The loss also pushes different ﬁlters to be activated by different object parts or image regions. We notice that a CNN usually uses a set of ﬁlters to jointly represent a speciﬁc object part or image region, instead of using a single ﬁlter, which has been discussed in [Fong and Vedaldi, 2018]. Therefore, we divide ﬁlters in a convolutional layer into different groups. The loss is designed to force ﬁlters in the same group to be activated by the same object part or the same image region, and force ﬁlters in different groups to be activated by different parts or regions. Note that each ﬁlter in the group is required to represent almost the entire part/region, instead of a random sub-part/sub-region fragment inside, which ensures the clarity of the meaning of each ﬁlter. The mutual veriﬁcation of visual patterns between ﬁlters in the same group ensures the stability of the visual patterns represented by each ﬁlter in this group. The slight difference of feature maps between ﬁlters in the same group encodes the ﬁne-grained variety of the same type of parts/regions. To this

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

end, we design a metric to measure the similarity between ﬁlters, which enables the loss to group ﬁlters. Besides, for multi-category classiﬁcation, we design a loss to force different groups of ﬁlters to be activated by object parts or image regions of different categories. In this study, we evaluate the interpretability of ﬁlters in the convolutional layer qualitatively and quantitatively. We visualize the feature map of a ﬁlter to qualitatively show the consistency of a ﬁlter s visual patterns through different images, in order to examine the ﬁtness between the visual patterns automatically learned by a compositional CNN and the visual concepts in human s cognition. For the quantitative evaluation, previous metrics in [Zhang et al., 2018] can only evaluate semantic consistency of object parts in ball-like areas and strong priors of object structures. Therefore, we design two metrics to evaluate both the consistency of a ﬁlter s visual patterns and the diversity of visual patterns represented by different ﬁlters, respectively. Previous studies also developed CNNs, where ﬁlters in an intermediate layer represented meaningful features. Capsule nets [Sabour et al., 2017] encoded different meaningful features, but these features usually did not represent parts or regions. Zhang et al. [2018] proposed interpretable CNNs (ICNNs), which learned ﬁlters in intermediate layers to represent object parts. They designed the information-theoretic interpretability loss to force ﬁlters to represent speciﬁc object parts. Filters in the ICNN could only represent object parts in ball-like areas. In comparison, we extend the ﬁlter interpretability to both object parts with speciﬁc shapes and image regions without clear structures, which proposes signiﬁcant challenges to state-of-the-art algorithms. Thus, the compositional CNN can encode more types of features than the ICNN. Please see Fig. 1 for details. Contributions of this study can be summarized as follows. We propose a method to modify traditional CNNs into compositional CNNs without any annotations of object parts or image regions for supervision. Each ﬁlter in a compositional CNN consistently represents the same object part or image region with a clear meaning. Experiments show that our method can be broadly applied to CNNs with different architectures.

2 Related Work

Learning interpretable features. Some studies directly trained networks to increase the interpretability of intermediate-layer features. Capsule nets [Sabour et al., 2017] learned capsules to encode meaningful features via a dynamic routing mechanism. Info GAN [Chen et al., 2016] and β-VAE [Higgins et al., 2017] learned interpretable representations for generative networks. These studies did not make each ﬁlter in the CNN encode a speciﬁc visual pattern. To this end, some studies [Li et al., 2020; Liang et al., 2020] learned class-speciﬁc ﬁlters, i.e. each ﬁlter only represented a speciﬁc category. However, such class-speciﬁc ﬁlters could not represent ﬁne-grained meaningful visual patterns, such as object parts and image regions. Chen et al. [2019] proposed the Proto PNet to extract similar object-part regions that were shared for ﬁne-grained classiﬁ-

cation. The Proto PNet did not ensure each ﬁlter to represent a clear meaning. Zhang et al. [2018] proposed interpretable CNNs to make each ﬁlter in a high convolutional layer represent a speciﬁc object part. In comparison, we extend the ﬁlter interpretability to both object parts and image regions, which presents signiﬁcant challenges to state-of-the-art algorithms.

Compositional models. Previous studies in compositional models focused on learning hierarchical feature representations [Fidler and Leonardis, 2007; Zhu et al., 2010a; Ommer and Buhmann, 2009], such as graph-based models [Si and Zhu, 2013; Wang and Yuille, 2015] and part-based models [Ott and Everingham, 2011; Zhu et al., 2010b]. These models did not use neural networks to learn features. Other studies learned discriminative compositional parts directly through network training. Stone et al. [2017] manually designed a graphical model to organize CNN modules and represent object structures. Kortylewski et al. [2020] designed a speciﬁc compositional layer to enable the network to localize partial occlusion. Huang and Li [2020] learned discriminative object parts for ﬁne-grained recognition based on manually labeled part priors. However, in all above studies, the compositional information of features was not automatically learned from data. In comparison, the proposed compositional CNN automatically learns compositional features without any annotations of parts or regions. I.e. the compositional CNN automatically regularizes its features into meaningful parts and regions without letting people supervise its semantic representations.

3 Algorithm

In this section, we aim to modify a convolutional layer of a CNN into an interpretable compositional layer. In this layer, each ﬁlter is supposed to consistently represent the same object part or the same image region through different images. To ensure the consistency of the visual patterns represented by each ﬁlter, we use a group of ﬁlters to represent the same object part or the same image region. The set of ﬁlters Ω= {1, 2, , d} in the target layer are divided into different groups A1, A2, , AK, where A1 A2 AK = Ω; Ai Aj = . A = {A1, A2, , AK} denotes the partition of ﬁlters. Let θ denote parameters of the CNN. Given a set of training images, we aim to simultaneously optimize parameters θ and the partition A to ensure that ﬁlters in the same group consistently represent the same visual patterns through different images, and ﬁlters in different groups represent different visual patterns. To measure the similarity of visual patterns represented by different ﬁlters, we propose a metric to measure the similarity between ﬁlters. Given an image I, let x I i Rm denote the feature map of the i-th ﬁlter in the target convolutional layer after the Re LU operation. Given a set of n training images I, let Xi = {x I i }I I denote the set of feature maps of the i-th ﬁlter. Then, we compute the similarity between the i-th and the j-th ﬁlters, which represents whether these two ﬁlters consistently represent the same visual patterns through different images. This similarity is formulated as sij = K(Xi, Xj) R, where K is a kernel function. Based on the similarity metric,

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

we design the following loss to learn ﬁlters.

Loss(θ, A) =

i,j Ak sij P

i Ak,j Ωsij , (1)

where Swithin k = P

i,j Ak sij = P

i,j Ak K(Xi, Xj) measures the similarity between ﬁlters within the same group Ak; Sall k = P

i Ak,j Ωsij = P

i Ak,j ΩK(Xi, Xj) measures the similarity between ﬁlters in Ak and all ﬁlters in Ω. This loss increases Swithin k to ensure that ﬁlters in the same group have high similartiy, and decreases Sall k to ensure that ﬁlters in different groups have low similartiy. Speciﬁcally, the similarity metric is implemented as a kernel function.

sij = K(Xi, Xj) = ρij + 1 = cov(Xi, Xj)

σiσj + 1 0, (2)

where ρij [ 1, 1] denotes the Pearson s correlation coefﬁcient between variables x I i and x I j through different images; Constant 1 is added to ensure the non-negativity of the similarity. cov(Xi, Xj) R denotes the covariance between variables x I i and x I j through different images, cov(Xi, Xj) = 1 n 1 P

I I(x I i µi) (x I j µj) R; µi = 1 n P

I I x I i Rm; σ2 i = 1 n 1 P

I I(x I i µi)2 R. The similarity metric can be understood as the sum of similarity between feature maps of the i-th and the j-th ﬁlters through all training images as follows, sij = K(Xi, Xj) = P

I I φ(x I i ) φ(x I j), where

φ(x I i ) = [(x I i µi) , p

1 1/nσ i ]/ n 1σi. The proposed loss makes ﬁlters in the same group have similar feature maps, which ensures the clarity and the stability of the visual patterns represented by each ﬁlter in this group. Meanwhile, the slight difference of these feature maps encodes ﬁne-grained variety of the same type of parts/regions. Besides, the loss also makes ﬁlters in different groups have different feature maps, which ensures the diversity of the visual patterns represented by different groups of ﬁlters. Binary classiﬁcation of a single category. We train a compositional CNN in an end-to-end manner by minimizing the following objective function.

L(θ, A) = λLoss(θ, A) + 1

I ILcls(ˆy I, y I ; θ), (3)

where Lcls(ˆy I, y I ; θ) denotes the classiﬁcation loss on image I; ˆy I, y I { 1, +1} denote the output of the CNN and the ground-truth label, respectively; λ is a positive weight. Multi-category classiﬁcation. For the multi-category classiﬁcation, besides Loss(θ, A), we design another loss to make different groups of ﬁlters to be activated by parts or regions of different categories. Given a set of n training images I, let Ic I represent the subset of images of the category c, (c = 1, 2, , C). Filters in a certain group are supposed to be mainly activated by a speciﬁc object part or image region of very few categories, and keep silent on other categories. To this end, for each ﬁlter, we quantify the distribution of its neural activations over different categories. We propose a metric to measure the similarity between such distributions of different ﬁlters. Given the p-th image I, let z(p) k R denote the average activation score of ﬁlters in group Ak, z(p) k = 1 |Ak| m P

i Ak Pm u=1 x I i,u, where |Ak| denotes the number of ﬁlters in group Ak; x I i,u denotes the

u-th element in x I i Rm. The similarity between distributions of neural activations of different groups on the p-th and the q-th images is computed using the following kernel function. spq = K(z(p), z(q)) = (z(p)) (z(q)) R, where z(p) = [z(p) 1 , , z(p) K ] RK; x I i,u 0, thereby spq 0. We propose the following loss to learn ﬁlters.

Lmulti(θ) =

p,q Ic spq P

p Ic,q I spq =

p,q Ic K(z(p), z(q)) P

p Ic,q I K(z(p), z(q)). (4)

The ﬁnal objective function for multi-category classiﬁcation is given as follows.

L(θ, A) = λLoss(θ, A)+βLmulti(θ)+ 1

I I Lcls(ˆy I, y I ; θ), (5)

where λ and β are positive weights. Learning. During the learning process, we need to simultaneously optimize network parameters θ and the ﬁlter partition A. Fortunately, we ﬁnd that, when we ﬁx θ, the minimization of Loss(θ, A) w.r.t. A is essentially equivalent to the problem of the spectral clustering in [Shi and Malik, 2000]. I.e. we can rewrite Loss(θ, A) as the following equation, which is exactly the same objective function in [Shi and Malik, 2000].

1 2(Loss(θ, A) + K) = 1

i Ak,j Ak sij P

i Ak,j Ωsij . (6)

Here, we regard the set of ﬁlters Ωas data points in the spectral clustering that need to be clustered into different groups A1, , AK. sij corresponds to the similarity between two data points. In this way, A can be optimized by applying the clustering technique in [Shi and Malik, 2000]. Therefore, we alternately optimize θ and A to minimize Loss(θ, A).

4 Experiments

We applied our method to CNNs with six types of architectures to demonstrate the broad applicability of our method. We used object images in four different benchmark datasets to learn compositional CNNs for both the binary classiﬁcation of a single category and the multi-category classiﬁcation. We designed two metrics to measure the inconsistency of a ﬁlter s visual patterns and the diversity of visual patterns represented by different ﬁlters. We also visualized feature maps of a ﬁlter to qualitatively show the consistency of a ﬁlter s visual patterns. We compared the performance of learning interpretable ﬁlters in different convolutional layers of a compositional CNN. We also discussed the effects of the group number on the performance of learning interpretable ﬁlters. For binary classiﬁcation of a single category, we set λ = 1.0 for most DNNs except for VGG-16 with λ = 0.1. It was because the learning of a residual network could be considered as the optimization of massive parallel shallow networks. From this perspective, the VGG-16 was the most difﬁcult to optimize. For multi-category classiﬁcation, we set λ = 0.1 and β = 0.1, because Lmulti has partially taken the work of Loss(θ, A). During the training procedure, for each time we optimized θ through all training samples, we optimized A once.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

0 1 Diversity

PASCAL-Part, single Inconsistency

0 1 Diversity

Inconsistency

0 1 Diversity

Inconsistency

Dense Net-121

0 1 Diversity

Helen Inconsistency

0 1 Diversity

Inconsistency

0 1 Diversity

Inconsistency

Dense Net-121

0 1 Diversity 1.75

PASCAL-Part, multi Inconsistency

0 1 Diversity

Inconsistency

Dense Net-121

0 1 Diversity

Inconsistency

0.0 0.5 Diversity

Inconsistency

0 1 Diversity

Inconsistency

Dense Net-161

0 1 Diversity

Inconsistency

0.0 0.5 Diversity

Inconsistency

0.5 1.0 Diversity

Inconsistency

Dense Net-161

0 1 Diversity

Inconsistency

0 1 Diversity

Inconsistency

Dense Net-161

Figure 2: Comparisons of inconsistency of visual patterns and diversity of visual patterns between CNNs. For the binary classiﬁcation of a single category, we showed curves of the average inconsistency of visual patterns and the average diversity of visual patterns over each CNN learned for each individual category. Results for each single category are shown in Fig. 7. Note that each value of inconsistency in this ﬁgure indicates the average inconsistency of all ﬁlters of a DNN.

4.1 Learning Compositional CNNs

Binary Classiﬁcation of A Single Category We learned six types of compositional CNNs based on the VGG-131, VGG-161 [Simonyan and Zisserman, 2015], Res Net-18, Res Net-50 [He et al., 2016], Dense Net-121, and Dense Net-161 [Huang et al., 2017] architectures. Just like in [Zhang et al., 2018], we added the loss Loss(θ, A) to the high convolutional layer in a CNN. It was because the previous study [Bau et al., 2017] had revealed that ﬁlters in high convolutional layers were more likely to represent object parts or image regions, instead of detailed patterns (e.g. colors or textures). For the VGG-13, VGG-16, Dense Net121, and Dense Net-161, we added the proposed loss to the top convolutional layer. For the Res Net-18, we added the loss to layer conv4 4. For the Res Net-50, we added the loss to layer conv4 18. All these compositional CNNs were learned based on the CUB200-2011 dataset [Wah et al., 2011], the Large-scale Celeb Faces Attributes (Celeb A) dataset [Liu et al., 2015], the Helen Facial Feature dataset [Smith et al., 2013], and animal categories in the PASCAL-Part dataset [Chen et al., 2014]. In the ﬁeld of learning interpretable deep features, animal categories were widely used to evaluate the automatically learned interpretable features [Zhang et al., 2018]. It was because animals usually contained deformable parts, which presented great challenges for part or region localization. Note that, the Helen Facial Feature dataset was usually used for the facial landmark localization. However, in this study, we used this dataset for the classiﬁcation of faces and non-faces. It was because this dataset provided segmentation masks for face parts to evaluate the inconsistency and the diversity of visual patterns. We randomly selected the same number of samples from the PASCAL-Part dataset as negative samples for training and testing. We followed experimental settings in [Zhang et al., 2018] to learn compositional CNNs for binary classiﬁcation of a single category on the CUB200-2011 dataset and the PASCALPart dataset. For compositional CNNs learned from the CUB200-2011 dataset, the PASCAL-Part dataset, and the Helen Facial Feature dataset, we set K = 5. For the Celeb A

1The VGG-13 and VGG-16 used in this paper were slightly revised by adding the batch-normalization [Ioffe and Szegedy, 2015] operation after each convolution layer.

dataset, we set K = 16, because these compositional CNNs usually learned detailed visual patterns from face images. To compare the performance of learning interpretable ﬁlters in different convolutional layers, we learned two compositional CNNs based on the VGG-16 architecture by adding the proposed loss to layer conv4 3 and layer conv5 3, respectively. These two compositional CNNs were learned on the PASCAL-Part dataset. To explore the effects of different values of K, we learned two compositional CNNs based on the Res Net-50 architecture using the Celeb A dataset by setting K = 8 and K = 16, respectively.

Multi-Category Classiﬁcation We learned four types of compositional CNNs based on the VGG-13, VGG-16, Dense Net-121, and Dense Net-161 architectures for the classiﬁcation on the PASCAL-Part dataset following experimental settings in [Zhang et al., 2018]. We set K = 16. For all compositional CNNs, we learned traditional CNNs based on the same architectures and datasets as baselines. We replaced the zero padding with the replication padding for all compositional CNNs. For traditional CNNs based on the Dense Net architectures, we initialized parameters of the fully-connected layers, and loaded parameters of other layers from the same architectures that were pre-trained using the Image Net dataset [Deng et al., 2009]. For traditional CNNs based on other architectures, we initialized parameters of the target layer (i.e. the convolutional layer would be modiﬁed to an interpretable compositional layer) and its following layers, and loaded parameters of other layers from the same architectures that were pre-trained using the Image Net dataset. For all compositional CNNs, we loaded parameters of all layers from the above well-trained traditional CNNs.

4.2 Quantitative Evaluation of Filter Interpretability Some previous studies also focused on learning interpretable ﬁlters, but their metrics usually have strong limitations and can not be used in our experiments. Metrics in [Zhang et al., 2018] can only be used to evaluate semantic consistency of object parts in ball-like areas and strong priors of object structures. Bau et al. [2017] annotated six types of visual semantics for evaluation (including colors and materials), but ﬁlters in the compositional CNN were not designed towards

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Part:forehead

Part:eyebrows

Region:background

compositional CNNs ICNNs

Region:background

Part:cheek Region:background Part:mouth

Part:top of head

Part:edge of face

Part:torso/leg

Part:mouth/nose/eyes Part:head

Part:torso/leg

Region:background

Region:leg/background

Region:leg/background

Part:head/torso

Part:torso/leg

Figure 3: Visualization of feature maps of compositional CNNs and ICNNs [Zhang et al., 2018]. Each column in the ﬁgure corresponds to a certain ﬁlter. Visualization results indicate that each ﬁlter in a compositional CNN consistently represented the same object part or the same image region, while different ﬁlters represented different parts and regions. In comparison, ﬁlters in an ICNN could only represent object parts. Note that we manually classiﬁed ﬁlters into part ﬁlters and region ﬁlters to help understand the visual patterns represented by the ﬁlter. In addition, part ﬁlters in the compositional CNN usually encoded more complex shapes than those in the ICNN.

such semantics. Therefore, we extended the metric in [Bau et al., 2017] and proposed the inconsistency of visual patterns to evaluate the interpretability of ﬁlters. Besides, we evaluated the diversity of visual patterns represented by ﬁlters, which was an signiﬁcant factor neglected in previous studies.

Evaluation Metric 1: Inconsistency of Visual Patterns

This metric was proposed to measure the consistency of visual patterns represented by a ﬁlter through different images. Ideally, an interpretable ﬁlter was supposed to have high consistency. We computed the probability of a ﬁlter being associated with a ground-truth semantic concept in a speciﬁc image (e.g. bird head, bird torso). Then, we deﬁned the inconsistency of visual patterns as the entropy of such probabilities over different semantic concepts. For simplicity, here, we only discussed the metric for a single ﬁlter below. We ﬁrst computed the pixel-wise receptive ﬁeld (RF) of neural activations of the ﬁlter on testing images [Zhang et al., 2018]. Let Q(I) RM denote activation scores of the target ﬁlter projected onto the test image I, where M denoted the number of pixels in the image I. We only considered activation scores in the feature map greater than a threshold τ as valid ones to represent the ﬁlter (the setting of τ would be explained later). Then, Q(I) {0, 1}M s.t. Qu(I) = 1(Qu(I) τ) denoted the RF. Let Gj(I) {0, 1}M denote the ground-truth segmentation mask of the j-th concept (j = 1, , T) on the test image I. The probability of the target ﬁlter being associated with the j-th concept was computed as Pj =

P I Itest PM u=1 min{ Qu(I),Gj u(I)} P I Itest PM u=1 Qu(I) , where Itest

denoted the set of testing images. Then, the inconsistency of the target ﬁlter s visual patterns was deﬁned as the entropy H = PT j=1 Pj log Pj.

Binary classiﬁcation of a single category. We followed [Zhang et al., 2018] to merge certain areas of each animal category in the PASCAL-Part dataset to obtain stable landmark locations as stable concepts for evaluation. We used ﬁve concepts for the bird category, including (head, l/reyes, beak, neck), (torso, l/r-wings), (l/r-legs/feet), (tail), and (background). Here, all parenthesized areas were merged as a new concept. We used four concepts for the cat category, including (head, l/r-eyes, l/r-ears, nose, neck), (torso, tail), (lf/rf/lb/rb-legs, lf/rf/lb/rb-paws), and (background). We used four concepts for the dog category, areas of which were merged in the same way as the cat category, except for merging the additional muzzle area to the head concept. We used four concepts for the cow category, which were deﬁned in a similar way as the dog category. We added l/r-horn to the head concept. We used four concepts for sheep and horse categories, which were deﬁned in the same way as the cow category. Note that, in actual calculations, we only used images with relatively complete areas of each animal category2. In the Helen Facial Feature dataset2, we used three concepts for the face category. We merged areas of face skin, l/r-eyebrow, l/r-eye, nose, u/l-lip, and inner mouth as the face concept. We used the areas of hair and background as the 2-nd and the 3-rd concepts, respectively.

Multi-category classiﬁcation. In the PASCAL-Part dataset, for each category, we used the foreground object as a single concept, and used the background as another concept. We considered visual concepts of all categories equally, i.e. we would get T = 2C concepts for the classiﬁcation of C categories. Then, we used the aforementioned entropy

2The dataset for the computation of metrics in this paper will be released in https://github.com/ada-shen/ic CNN.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

Group number K = 8 Conv4_3 Conv5_3 (a) Filters in different convolutional layers Group number K = 16

Part head/torso Region background Texture Texture Part top of head Part eyebrows Part eyes Part forehead Part chin Part eyes Part forehead Part nose/cheek

(b) Filters learned with different values of K (c) t-SNE visualization of feature maps

(c2) traditional CNN (c1) compositional CNN

Figure 4: (a) Comparisons of interpretable ﬁlters in different convolutional layers. Results indicate that ﬁlters in a high convolutional layer tended to represent parts or regions, while ﬁlters in a middle convolutional layer tended to represent textures. (b) Filters learned with different values of K. Filters in the compositional CNN with K = 16 represented more detailed visual patterns than the CNN learned with K = 8. (c) t-SNE visualization of feature maps of a compositional CNN (c1) and a traditional CNN (c2). Each point represents a feature map. Different colors of points represent feature maps of ﬁlters in different groups.

Ours ICNN Input

Ours ICNN Input Ours ICNN Input

Figure 5: Visualizing distributions of visual patterns that are encoded in interpretable ﬁlters via the method in [Zhang et al., 2018]. Results show that interpretable ﬁlters of a compositional CNN explained much more regions in an image than those of an ICNN.

Center of a group Center of a group

Figure 6: Comparisons of RFs between the center of the group and each ﬁlter in the group.

H over the 2C concepts for evaluation. Note that, for the classiﬁcation of a large number of categories, theoretically, each category only obtained very few ﬁlters, which decreased the ﬁlter interpretability. Therefore, we only learned compositional CNNs for multi-category classiﬁcation based on all animal categories in the PASCAL-Part dataset.

Randomly shufﬂed feature maps as baselines. We constructed feature maps that totally had no consistency of visual patterns as a baseline. In implementation, we randomly shufﬂed different images feature maps of a traditional CNN to approximately construct random feature maps.

Evaluation Metric 2: Diversity of Visual Patterns This metric was proposed to evaluate whether a CNN learned various visual patterns. In this study, the diversity of visual patterns was approximately quantiﬁed as the number of pixels which had been explained by a CNN. We determined that a pixel was explained by a CNN, if this pixel was explained by some ﬁlters. Recall that, we had computed the pixel-wise RF of neural activations of a ﬁlter on the test image I based on [Zhang et al., 2018]. Here, we used Qi(I) {0, 1}M to denote the RF of neural activations of the i-th ﬁlter. Then, we determined that the u-th pixel was explained by a CNN, if ( 1

d Pd i=1 Qi u(I)) γ. We set γ = 0.2. The higher diversity meant that RFs of ﬁlters covered more

single-category multi-category PASCAL-Part CUB200 Celeb A PASCAL-Part

VGG-13 97.07 99.76 87.51 compositional CNN 96.29 99.41 86.37 VGG-16 98.66 99.86 90.47 89.71 ICNN 95.39 96.51 89.11 91.60 compositional CNN 97.12 99.27 90.70 87.51 Res Net-18 97.77 99.81 89.60 ICNN 93.30 97.12 compositional CNN 96.90 98.49 89.76 Res Net-50 97.88 99.88 90.21 compositional CNN 97.30 99.27 89.63 Dense Net-121 98.29 99.92 91.28 ICNN 96.55 99.22 compositional CNN 97.52 98.83 91.75 Dense Net-161 98.70 99.96 93.48 compositional CNN 98.14 99.61 92.66

Table 1: Comparisons of classiﬁcation accuracy between ICNNs and compositional CNNs revised from different classic CNNs.

pixels, i.e. more diverse concepts were encoded by the CNN. Therefore, the diversity of visual patterns was computed as Diversity = 1 M EI[PM u=1 1(( 1

d Pd i=1 Qi u(I)) γ)].

Curves of Inconsistency and Diversity Note that, the two metrics of inconsistency and diversity were closely related. Generally speaking, the greater the diversity was, the lower the consistency was. Therefore, in order to fairly compare different CNNs inconsistency of visual patterns under different diversity, we reported inconsistency-diversity curves in this paper, as shown in Fig. 2. To this end, we sampled different values of τ to obtain different pairs of inconsistency-diversity, thereby obtaining inconsistency-diversity curves. Given n sampled values of τ, [τ1, τ2, , τn], we could calculate n pairs of inconsistencydiversity values, (p1, q1), (p2, q2), , (pn, qn). The sampling of τ was under the constraint that q1, q2, , qn were evenly distributed between (0, 1].

4.3 Experimental Results and Analysis Inconsistency-diversity curves and classiﬁcation accuracy. Fig. 2 shows the inconsistency-diversity curves of different CNNs. Each inconsistency value was the average inconsistency over all ﬁlters. Under the same diversity of visual patterns, compositional CNNs exhibited higher consistency of

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

0 1 Diversity

Inconsistency

VGG-13, bird

0 1 Diversity

Inconsistency

VGG-13, cat

0 1 Diversity 0.50

Inconsistency

VGG-13, dog

0 1 Diversity

Inconsistency

VGG-13, cow

0 1 Diversity

Inconsistency

VGG-13, horse

0 1 Diversity

Inconsistency

VGG-13, sheep

0 1 Diversity

Inconsistency

VGG-16, bird

0 1 Diversity

Inconsistency

VGG-16, cat

0 1 Diversity

0.75 1.00 1.25

Inconsistency

VGG-16, dog

0 1 Diversity

Inconsistency

VGG-16, cow

0 1 Diversity

Inconsistency

VGG-16, horse

0 1 Diversity

Inconsistency

VGG-16, sheep

0 1 Diversity

Inconsistency

Res Net-18, bird

0 1 Diversity

Inconsistency

Res Net-18, cat

0 1 Diversity

Inconsistency

Res Net-18, dog

0 1 Diversity

Inconsistency

Res Net-18, cow

0 1 Diversity

Inconsistency

Res Net-18, horse

0 1 Diversity

Inconsistency

Res Net-18, sheep

0.0 0.5 Diversity

Inconsistency

Res Net-50, bird

0.0 0.5 Diversity

Inconsistency

Res Net-50, cat

0.0 0.5 Diversity

1.15 1.20 1.25

Inconsistency

Res Net-50, dog

0.0 0.5 Diversity

Inconsistency

Res Net-50, cow

0.0 0.5 Diversity 0.75

Inconsistency

Res Net-50, horse

0.0 0.5 Diversity

Inconsistency

Res Net-50, sheep

0 1 Diversity

Inconsistency

Dense Net-121, bird

0 1 Diversity

Inconsistency

Dense Net-121, cat

0 1 Diversity

Inconsistency

Dense Net-121, dog

0 1 Diversity

Inconsistency

Dense Net-121, cow

0 1 Diversity

Inconsistency

Dense Net-121, horse

0 1 Diversity

Inconsistency

Dense Net-121, sheep

0 1 Diversity

Inconsistency

Dense Net-161, bird

0 1 Diversity

Inconsistency

Dense Net-161, cat

0 1 Diversity

Inconsistency

Dense Net-161, dog

0 1 Diversity

Inconsistency

Dense Net-161, cow

0 1 Diversity

Inconsistency

Dense Net-161, horse

0 1 Diversity

0.25 0.50 0.75

Inconsistency

Dense Net-161, sheep

Figure 7: The inconsistency-diversity curves of CNNs based on different categories of the PASCAL-Part dataset.

Figure 8: Very few cases when ﬁlters in compositional CNNs did not represent meaningful patterns.

VGG-13 VGG-16 Res-18 Res-50 Dense-121 Dense-161

classic CNN 1.0 1.0 1.0 1.0 1.0 1.0 ICNN 99.70 1.0 1.0 compositional CNN 1.0 1.0 99.85 1.0 99.85 1.0

Table 2: Classiﬁcation accuracy of CNNs based on the Helen Facial Feature dataset. Res indicates Res Net; Dense indicates Dense Net.

visual patterns than traditional CNNs and ICNNs. Besides, compositional CNNs always showed higher diversity than ICNNs. As shown in Fig. 3 and Fig. 5, interpretable ﬁlters of compositional CNNs could explain almost the entire region of the image, while ﬁlters of ICNNs could only represent small parts in ball-like areas. Note that sometimes we could not obtain large values of diversity for an ICNN, because RFs of all ﬁlters in the ICNN were small, as shown in Fig. 2. Traditional CNNs showed low consistency of visual patterns, which were close to that of randomly synthesized feature maps. This indicated that in terms of ﬁlter interpretability, features of ﬁlters in traditional CNNs did not show signiﬁcantly better consistency than synthesized random features. As Table 1 and Table 2 shows, compositional CNNs exhibited comparable classiﬁcation performance with traditional CNNs. Besides, compositional CNNs achieved higher accuracy than ICNNs in most comparisons.

Visualization of ﬁlters. We followed [Zhang et al., 2018] to visualize RFs corresponding to a ﬁlter s feature maps. Fig. 3 shows RFs of features of compositional CNNs and ICNNs learned for the binary classiﬁcation of a single category. In compositional CNNs, given different images, each ﬁlter consistently represented the same object part or the same image region. Different ﬁlters represented different object parts or image regions. In ICNNs, ﬁlters only represented small

parts in ball-like areas. In addition, ﬁlters in the compositional CNN usually represented more complex shapes than ﬁlters in the ICNN. We speciﬁcally found out failure cases of interpretable ﬁlters in compositional CNNs, as shown in Fig. 8. We also compared RFs between the center of the group and each ﬁlter in the group in Fig. 6. Comparison of interpretable ﬁlters in different convolutional layers. As shown in Fig. 4 (a), ﬁlters of a high convolutional layer usually represented object parts or image regions, while ﬁlters of a middle convolutional layer usually represented local textures or local shapes. Comparison of interpretable ﬁlters learned with different values of K. As shown in Fig. 4 (b), ﬁlters in the compositional CNN with K = 16 represented more visual patterns than ﬁlters in the compositional CNN with K = 8. t-SNE visualization. We visualized ﬁlters in a compositional CNN and a traditional CNN using t-SNE [van der Maaten and Hinton, 2008]. These two CNNs were learned based on the VGG-16 using the bird category in the PASCALPart dataset. As Fig. 4 (c) shows, feature maps of a compositional CNN seem more clustered than those of a traditional CNN.

5 Conclusion In this paper, we have proposed a method to modify a traditional CNN into a compositional CNN, in order to make ﬁlters in a high convolutional layer encode meaningful visual patterns without any part or region annotations for supervision. Speciﬁcally, we design a loss to encourage each ﬁlter in the layer consistently represents the same object part or the same image region through different images, and encourage different ﬁlters in the layer to represent different object parts and image regions. Experiments have demonstrated the effectiveness of our method.

Acknowledgments This work is partially supported by the National Nature Science Foundation of China (No. 61976160, 61906120, U19B2043), and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)

[Bau et al., 2017] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017. [Chen et al., 2014] Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014. [Chen et al., 2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Neur IPS, 2016. [Chen et al., 2019] Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K. Su. This looks like that: deep learning for interpretable image recognition. In Neur IPS, 2019. [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. [Fidler and Leonardis, 2007] Sanja Fidler and Ales Leonardis. Towards scalable representations of object categories: Learning a hierarchy of parts. In CVPR, 2007. [Fong and Vedaldi, 2018] Ruth Fong and Andrea Vedaldi. Net2vec: Quantifying and explaining how concepts are encoded by ﬁlters in deep neural networks. In CVPR, 2018. [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [Higgins et al., 2017] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017. [Huang and Li, 2020] Zixuan Huang and Yin Li. Interpretable and accurate ﬁne-grained recognition via region grouping. In CVPR, 2020. [Huang et al., 2017] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017. [Ioffe and Szegedy, 2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. [Kortylewski et al., 2020] Adam Kortylewski, Ju He, Qing Liu, and Alan Yuille. Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion. In CVPR, 2020. [Li et al., 2020] Yuchao Li, Rongrong Ji, Shaohui Lin, Baochang Zhang, Chenqian Yan, Yongjian Wu, Feiyue Huang, and Ling Shao. Interpretable neural network decoupling. ar Xiv:1906.01166, 2020.

[Liang et al., 2020] Haoyu Liang, Zhihao Ouyang, Yuyuan Zeng, Hang Su, Zihao He, Shu-Tao Xia, Jun Zhu, and Bo Zhang. Training interpretable convolutional neural networks by differentiating class-speciﬁc ﬁlters. In ECCV, 2020. [Liu et al., 2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, 2015. [Ommer and Buhmann, 2009] Bjorn Ommer and Joachim M. Buhmann. Learning the compositional nature of visual object categories for recognition. IEEE T-PAMI, 32(3):501 516, 2009. [Ott and Everingham, 2011] Patrick Ott and Mark Everingham. Shared parts for deformable part-based models. In CVPR, 2011. [Sabour et al., 2017] Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dynamic routing between capsules. In Neur IPS, 2017. [Shi and Malik, 2000] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE T-PAMI, 22(8):888 905, 2000. [Si and Zhu, 2013] Zhangzhang Si and Song-Chun Zhu. Learning and-or templates for object recognition and detection. IEEE T-PAMI, 35(9):2189 2205, 2013. [Simonyan and Zisserman, 2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [Smith et al., 2013] Brandon M. Smith, Li Zhang, Jonathan Brandt, Zhe Lin, and Jianchao Yang. Exemplar-based face parsing. In CVPR, 2013. [Stone et al., 2017] Austin Stone, Huayan Wang, Michael Stark, Yi Liu, D. Scott Phoenix, and Dileep George. Teaching compositionality to cnns. In CVPR, 2017. [van der Maaten and Hinton, 2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579 2605, 2008. [Wah et al., 2011] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech UCSD Birds-200-2011 Dataset. Technical Report CNSTR-2011-001, California Institute of Technology, 2011. [Wang and Yuille, 2015] Jianyu Wang and Alan Yuille. Semantic part segmentation using compositional model combining shape and appearance. In CVPR, 2015. [Zhang et al., 2018] Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. Interpretable convolutional neural networks. In CVPR, 2018. [Zhu et al., 2010a] Long (Leo) Zhu, Yuanhao Chen, Antonio Torralba, William Freeman, and Alan Yuille. Part and appearance sharing: Recursive compositional models for multi-view. In CVPR, 2010. [Zhu et al., 2010b] Long (Leo) Zhu, Yuanhao Chen, Alan Yuille, and William Freeman. Latent hierarchical structural learning for object detection. In CVPR, 2010.

Proceedings of the Thirtieth International Joint Conference on Artiﬁcial Intelligence (IJCAI-21)