# ladder_capsule_network__e3f259e4.pdf

Ladder Capsule Network

Taewon Jeong 1 Youngmin Lee 1 Heeyoung Kim 1

We propose a new architecture of the capsule network called the ladder capsule network, which has an alternative building block to the dynamic routing algorithm in the capsule network (Sabour et al., 2017). Motivated by the need for using only important capsules during training for robust performance, we ﬁrst introduce a new layer called the pruning layer, which removes irrelevant capsules. Based on the selected capsules, we construct higher-level capsule outputs. Subsequently, to capture the part-whole spatial relationships, we introduce another new layer called the ladder layer, the outputs of which are regressed lower-level capsule outputs from higher-level capsules. Unlike the capsule network adopting the routing-by-agreement, the ladder capsule network uses backpropagation from a loss function to reconstruct the lower-level capsule outputs from higher-level capsules; thus, the ladder layer implements the reverse directional inference of the agreement/disagreement mechanism of the capsule network. The experiments on MNIST demonstrate that the ladder capsule network learns an equivariant representation and improves the capability to extrapolate or generalize to pose variations.

1. Introduction

The convolutional neural network (CNN) has shown superhuman performances over the recent years for a wide range of computer vision tasks such as image classiﬁcation, segmentation, detection, and tracking. In essence, a CNN predicts whether an object (e.g., face) exists by detecting the existence of features or object parts (e.g., eye, nose, mouth). However, CNN only captures the existence of features, and

1Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea. Correspondence to: Heeyoung Kim <heeyoungkim@kaist.ac.kr>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

fails to capture the intrinsic spatial relationship between a part and a whole (e.g., the correct positions of the eye, nose, and mouth to form a face). This problem in CNN is due to the max-pooling that discards the information about the pose (position, size, orientation) of features, although it contributes to the extraction of translation-invariant features. As an alternative to the CNN, the capsule network (Caps Net), a new network architecture recently introduced by Sabour et al. (2017), learns an equivariant representation that is more robust to pose variations. Caps Net captures various pose information of the same feature by replacing scalaroutput neurons in the CNN with vector-output capsules, and captures part-whole spatial relationships by replacing the max-pooling in the CNN with a dynamic routing algorithm. Caps Nets have shown to outperform CNNs on digit recognition, even when using a dataset of highly overlapping digits (Sabour et al., 2017).

Although the dynamic routing algorithm has shown that it is effective in capturing part-whole relationships, it is inevitable that information on unnecessary lower-layer capsules would be included when constructing higher-layer capsules owing to the nature of the algorithm that expresses the higher-layer capsules by the weighted sum of many lower-layer capsules. Too many unnecessary capsules, similar to other over-parameterized deep learning networks, can cause confusion in delivering the necessary information to the upper layers and can ultimately lead to difﬁculties in performing the desired tasks (Costa et al., 2002).

In this paper, we propose a new architecture of capsule networks referred to as the ladder capsule network (L-Caps Net) based on an alternative building structure to the dynamic routing algorithm. We ﬁrst demonstrate via experiments that not all lower-layer capsules are necessarily required to construct higher-layer capsules; only part of the capsules is sufﬁcient to construct higher-layer capsules. To address this ﬁnding in the L-Caps Net, we introduce a new layer called the pruning layer, which is inspired by pruning methods that primarily aim at reducing the size of deep neural networks. The pruning layer removes the capsules with small activities and only uses the capsules with large activities. Subsequently, we can construct the higher-level capsules as a linear combination of the selected capsules from the pruning layer, similar to the Caps Net. Unlike the Caps Net that uses an iterative routing based on the agree-

Ladder Capsule Network

ment between the predictions from lower-level capsules for constructing higher-level capsules and capturing the partwhole relationships, our model uses backpropagation from a loss function to reconstruct lower-level capsules from higher-level capsules. This reconstruction is performed in a new layer called the ladder layer. Similar to the ladder networks that proved their effective performance in semisupervised learning (Rasmus et al., 2015), the ladder layer can effectively learn representative features for reconstruction using the feedback from lower-level capsules. Through these two building blocks pruning and ladder layers, the L-Caps Net is shown to improve the capability to extrapolate or generalize to pose variations on the MNIST digits.

2. Background

2.1. Dynamic Routing of Caps Nets

A capsule is a basic component of Caps Net, and it is deﬁned as a group of neurons. The activity vector of a capsule represents the probability of existence of an entity (an object or an object part) by its length and the instantiation parameters of the entity by its orientation. Let {ui Rd|i = 1, 2, ...Nl} be the collection of the vector output of capsule i in layer l. To construct the vector output vj RD of capsule j in layer (l + 1), each ui is ﬁrst multiplied by the prediction matrix Wij Rd D, producing the prediction vector ˆuj|i = Wijui; subsequently, a weighted sum of all prediction vectors, denoted by sj, is computed as the total input of vj as follows: sj = P i cij ˆuj|i, where {0 cij 1} are the coupling coefﬁcients, which are determined through the dynamic routing algorithm summarized in Algorithm 1. Finally, vj is calculated as vj = sj 2

1+ sj 2 sj sj by applying the squashing function to sj, which ensures that the length of the output vector represents the probability of existence of an entity.

Algorithm 1 Dynamic routing algorithm (Sabour et al., 2017) Initialize logit parameters bij = 0 for all capsule i in layer l and capsule j in layer (l + 1).

1: for 1:Max Iter do 2: cij = exp(bij) P

k exp(bik) for all capsule i in layer l.

i cij ˆuj|i and vj = ||sj||2

1+||sj||2 sj ||sj|| for all capsule j in layer (l + 1). 4: bij = bij + ˆuj|i, vj for all capsule i in layer l and capsule j in layer (l + 1). 5: end for

In the dynamic routing algorithm, two points are to be emphasized. One is the linear relationship between the lowerlevel and higher-level capsules. The prediction vector ˆuj|i

means the prediction of pose of entity j based on the pose of entity i. For example, if we know where someone s nose is, we can predict where his/her face is. Furthermore, if his/her nose is moved to some direction, the predicted position of his/her face is also moved to the same direction. Hence, it is reasonable to assume a linear relationship between the lower-level and higher-level capsules. This assumption is also used in the L-Caps Net to construct higher-level capsules based on lower-level capsules.

The other is the core idea of the dynamic routing algorithm. When we consider the pose of entity j, the core parts of entity j in the layer below should predict the pose of entity j consistently, but irrelevant parts are likely to predict the pose differently. Using the example above again, the nose and eyes are core parts of the face; therefore, the predictions of the face pose based on the poses of the nose and eyes should be similar: their predictions should agree. In contrast, the predictions of the face pose from the poses of a pencil or laptop (i.e., irrelevant parts) should disagree. Using this idea, the dynamic routing algorithm is expected to capture the part-whole relationship (lower-level and higher-level capsule relationship). In the algorithm, cij measures the importance of entity i in constructing entity j in the layer above. Moreover, a high value of cij indicates that the pose of entity j is similar with the predicted pose from entity i.

2.2. Pruning Techniques

Pruning is a method used to reduce network complexity by removing certain unimportant weights, neurons, or channels on the network (Luo et al., 2017). The primary purpose of pruning is to reduce the size of the network such that it can be used for devices with limited computing or storage capacity. For example, Molchanov et al. (2016) developed a greedy criteria-based pruning method that uses the Taylor expansion that approximates the parameter-importance evaluation. Aghasi et al. (2017) and Dong et al. (2017) proposed a layer-wise pruning method to reduce the complexity of deep neural networks. Structured pruning having various scale levels such as feature maps or kernels was used by Anwar et al. (2017) for the real-time application of a deep learning model. Pruning enabled the studies above to show similar or even better performance using a much smaller scale network than a larger network, as it could improve the generalization ability of deep learning models (Thodberg, 1991; Reed, 1993; Augasta & Kathirvalavakumar, 2013). The overﬁtting caused by a large number of parameters could be prevented by removing unnecessary connections in the networks.

For a similar objective, to improve the generalization of the capsule network, we use pruning as a layer that removes the information of unimportant capsules and retains only capsules that contain important information. This approach

Ladder Capsule Network

is similar to the stochastic activation pruning proposed for advanced defense (Dhillon et al., 2018), in that the pruning is based on the activity levels of neurons in the forward direction of the network. The operation of the pruning layer is similar to the k-max pooling (Kalchbrenner et al., 2014), which is a generalization of max pooling. However, one major difference exists: our pruning layer does not lose spatial information because the location information of the capsules is transferred to the upper layer by the so-called code vector. More detailed descriptions of the pruning layer and the code vector are presented in Section 3.1.

2.3. Ladder Networks

Over the recent years, many studies have demonstrated that supervised learning with auxiliary unsupervised representation learning can improve the network performance for supervised tasks (Suddarth & Kergosien, 1990). A ladder network, which adds an auxiliary task in the intermediate representation, is one example that is widely used in supervised (or semi-supervised) learning and showed remarkable performance on several tasks (Pezeshki et al., 2016; Dosovitskiy & Brox, 2016). Most ladder networks add auxiliary decoding layers, each of which corresponds to an encoding layer in the network for supervised learning, and each decoding layer targets at reconstructing the output of the corresponding encoding layer. Previous studies showed that this type of networks not only improves representation learning (Sønderby et al., 2016), but also achieves enhanced performance for supervised tasks (Zhang et al., 2016). The Caps Net (Sabour et al., 2017) also uses an additional reconstruction loss to encourage the digit capsules to encode the instantiation parameters of the input digit, which results in the improved performance of digit recognition, compared with the results from using only the margin loss.

Inspired by these, we adopt a ladder structure into the capsule network by introducing a new layer called the ladder layer, which forms an alternative building block to the dynamic routing algorithm. Unlike previous studies, our ladder directly links to supervised tasks and facilitates in capturing the part-whole relationship between capsules. This architecture is similar to the what-where autoencoder (Zhao et al., 2015), for which the where component is similar to the code vector that is introduced in the L-Caps Net. We discuss the ladder layer in more detail in Section 3.3.

3. Components of L-Caps Net

In this section, we describe the components of the LCaps Net. The L-Caps Net consists of three components: pruning layer, weight construction and propagation layer, and ladder layer.

3.1. Pruning Layer

The use of pruning in the capsule network was motivated by a simple thought experiment. Let us use an example of a classiﬁcation task between a human face and a car. The core entities of the face in the lower level could be the nose, eyes, or mouth, whereas the car s core entities could be wheels, roof, or mirror. If an input image corresponds to a human face, the routing algorithm is expected to predict the face pose based on the low-level capsules of the nose, eyes, and mouth. However, the predictions based on the wheel, roof, and mirror are also executed in the dynamic routing, even though their capsules are not activated. In fact, we only need to consider the results of the agreement/disagreement between the core entities of the face; predictions based on the car s capsules are not required.

This argument can be shown from our simple experiment to implement the Caps Net with the same architecture as in Sabour et al. (2017) on the MNIST dataset. After training the network, we found that the activities of some capsules were signiﬁcantly higher than those of others. In addition, these highly activated capsules (ui) tended to have relatively large coupling coefﬁcients (cij) for the desired parent vj. This indicates that the construction of higher-level capsules is primarily contributed by highly activated lower-level capsules; hence, the capsules with low activities need not be emphasized. Figure 1 illustrates the results with the digit 0 . The graph shows the average length of ui and the average value of cij of the 100 most active capsules over 5444 samples (total number of the digit 0 in the training set) on the left y-axis by the blue bar and the right-y axis by the red bar, respectively.

0 20 40 60 80 100 0

Capsule index

Length average

Cij average

Figure 1: Length of ui (left y-axis, blue bar) and value of cij (right y-axis, red bar) of the 100 most active lower-level capsules to predict the 0 digit capsule.

Inspired by these results, we introduce the pruning layer, which implements the selection of important lower-level capsules to ensure that only the outputs of the important capsules are sent to the layer above. We expect that the pruning layer not only reduces computational burden, but also

Ladder Capsule Network

improves the network generalization. The pruning layer collects the outputs of the K most active capsules. More specifically, consider the outputs of the level l capsules, U l = {ui Bd|i = 1, 2, ...nl}, where Bd is the unit ball in Rd, and the corresponding activity level (or the probability of existence of an entity), Al = {0 ai 1|i = 1, 2, ...nl}. The orientation of ui represents the pose of an entity, and the length of ui represents the activity level, or the probability of existence of the entity, i.e., ai = ui . Before propagating the outputs in level l to level (l + 1), we select the K most active capsules. Let a(i) denote the ith highest activity level in Al such that a(1) > a(2) > ... > a(n). Further, let I(m) denote the ordering index for the mth lowest activity level within {i|ai a(K)}. Subsequently, we can collect the outputs of the K most active capsules as {u I(m) Rd|m = 1, 2, ...K}. Based on this collection, we can construct U l K RK d matrix, where the pth row is equal to u I(p). We also consider one-hot encoded vectors cl (0, 1)nl from the index set {i|ai a(K)}, which we call the code vector. The code vector restores the information about which capsules are selected, and it contains the information about which capsules in U l are used to form U l K. For level (l+1), we only propagate U l K and cl, instead of the whole capsule output U l; subsequently, we construct higher-level capsules in level (l + 1). Figure 2 illustrates an example of the pruning layer with nl = 6 and K = 3: when the three capsules u1, u4, and u6 are selected among ui, i = 1, . . . , 6, the code vector becomes cl = (1, 0, 0, 1, 0, 1).

3.2. Weight Construction and Propagation Layer

We propagate each row of U l K to higher-level capsules by applying a linear operator, as discussed in Section 2.1. Recall that the Caps Net sets the prediction matrix Wij as a linear operator between all lower-level and higher-level capsules. That is, predictions ˆuj|i from all lower-level capsules by multiplying Wij are propagated to the layer above to construct higher-level capsule outputs. In contrast, the LCaps Net propagates only part of the capsules selected in the pruning layer, U l K, to the next level above; thus, the same operation with Wij cannot be directly used. Instead of Wij, we use the code vector cl, which contains the information about which capsules are highly active. We deﬁne a function of the code vector {fj|i(cl) Rd D}, which is a linear operator for propagating the ith row in U l K to the capsule j in layer (l + 1).

Moreover, we deﬁne another non-negative function of the code vector pj(cl) = (pj|1(cl), pj|2(cl), . . . , pj|K(cl)), which determines how much contribution comes from different output vectors of the layer below. Let ul i denote the ith row of U l K and uj|i denote the prediction made by a capsule i. We have uj|i = ul ifj|i RD. Subsequently, we compute a partial input of capsule j in layer (l +1), denoted by sl+1 j ,

as follows:

i=1 pj|i(cl) uj|i RD. (1)

We used the word partial, because other components (that will be discussed in Section 3.3) are also used to construct the output of capsule j in layer l + 1, vl+1 j . More precisely, sl+1 j only represents the pose of entity j in layer l+1. In the layer above, we obtain the activity level and by multiplying

it by sl+1 j sj , we compute the total output of vl+1 j . In the L-Caps Net, the functions of the code vector, fj|i and pj, are architecture from a deep convolutional neural network; we train all parameters of the network via backpropagation. This is different from the dynamic routing algorithm: we train the contribution rate of propagation, pj(cl), while the dynamic routing determines cij using an iterative routing algorithm.

3.3. Ladder Layer

The ladder layer is designed to capture the part-whole relationship; it plays a similar role as the dynamic routing algorithm, but is based on a different idea. Recall that in the dynamic routing algorithm, predictions made from lowerlevel capsules are investigated if they agree. The L-Caps Net focuses on the reverse directional inference; lower-level capsules are regressed from higher-level capsules. The primary idea of the ladder layer is that if higher-level capsules are well constructed, we can subsequently infer the pose of the core entities in the layer below. Using the simple example of Section 2.1, if we know the face pose such as the location and orientation, subsequently we can well infer the pose of the core entities of the face, nose, and eyes; however, it is difﬁcult to infer the pose of irrelevant entities such as a pencil or laptop. Recall that U l K is constructed by selecting the K most active capsules, which correspond to the entities with high probability of existence. If the elements of U l K are indeed the core entities of capsule j in layer l + 1, then we can expect that regression from sl+1 j to U l K performs well.

As the lower-level and higher-level capsules are linked by a linear relationship, the regression from sl+1 j to U l K is also assumed to be linear. We introduce a linear regression operator f reg j (cl) RK d D, which is another function of the code vector used for the regression of U l K from vl+1 j . The regressed U l K from vl+1 j is denoted by ˆureg j :

ˆureg j = f reg j (cl)(sl+1 j )T RK d. (2)

As ul j lies on the unit ball Bd, we apply the squashing function to ˆureg j for down scaling, resulting in ureg j in Eq.(3). Although the squashing function is nonlinear, it preserves

Ladder Capsule Network

Figure 2: Illustration of the pruning and ladder layers

the input orientation; therefore, the regression of the pose is still valid after applying the squashing function.

ureg j = ˆureg j 2

1 + ˆureg j 2 ˆureg j ˆureg j BK d. (3)

Subsequently, we deﬁne an L2-norm-based similarity measure, denoted by dj, between ureg j and U l K. The measure dj allows us to numerically evaluate the part-whole relationship between lower-level and higher-level capsules:

dj = exp( γ ureg j U l K 2), (4)

where ureg j U l K 2 = 1

K PK i=1 ureg i|j ul i 2, where ureg i|j denotes the ith row of ureg j , and γ is a non-negative hyperparameter. A value of dj close to one indicates a good ﬁt of the regression model, resulting in lower-level entities appropriately posed for constructing higher-level entities. Hinton et al. (2018) indicated the importance of sensitiveness to the difference between good and very good agreement, which is one of deﬁciencies of the dynamic routing algorithm. Hence, we choose the hyperparameter γ = 5.12, which solves for exp( 0.01γ) = 0.95 to ensure a good ﬁt of the regression model for a well-suited relationship between lower-level and higher-level entities. We let dj represent the activity of the capsule j in layer (l + 1). Using dj, the total output of

capsule j in layer l + 1 is given by vl+1 j = dj sl+1 j sl+1 j .

Recall that cl provides the information about which entity s pose is represented in each row of U l K. Therefore, cl informs f reg j about the entities that should be regressed from sl+1 j , which makes the ith row of ˆureg j reconstruct ul i. It is noteworthy that the ladder layer trains all weights via backpropagation, unlike the Caps Net that uses an iterative routing. This reduces the computational cost of the L-Caps Net. A comparison of computation time between the Caps Net and L-Caps Net is presented in Section 5.2.

4. Loss on L-Caps Net

Sabour et al. (2017) proposed the margin loss for classiﬁcation using the capsule network. We also applied this loss,

which is expressed on the L-Caps Net as follows:

Lmargin k = Tk max(0, m+ dk)2

+λ(1 Tk) max(0, dk m )2,

where Tk s are outputs of the one-hot encoded labels; m+, m , and λ are non-negative hyperparameters; di is the activity level of capsule i in the ﬁnal layer; further, the number of capsules in the ﬁnal layer should be the same as the number of labels. In Section 5, we trained the LCaps Net with the margin loss with m+ = 0.9, m = 0.1, and λ = 0.5. In addition, we found that adding the loss of difference between the code vector and lower-level activity level, cl Al 2, would be helpful for training; thus we trained the L-Caps Net with the loss

L = Lmargin + ϵ cl Al 2

with ϵ = 0.0001. We used the Adam optimizer with exponentially decaying learning rate starting from 0.001.

5. L-Caps Net Architecture and Experiments

5.1. L-Caps Net Architecture

The general architecture of the L-Caps Net is depicted in Figure.3. We start with a convolutional layer having a 9 9 kernel and 256 channels with a stride of 1 and the Re Lu activation function. This layer propagates the activities of the local feature detectors to the primary capsule layer, similar to the work of Sabour et al. (2017), by applying 8 convolution units with a 9 9 kernel, 32 channels, and a stride of 2. Subsequently, for an I I pixel image, the primary capsules have a total of n I n I 32 8D capsule outputs, where n I = ceil( I 16

2 ). Each primary capsule output sees the output of all convolution units whose receptive ﬁelds overlap with the location of the center of the capsule. We ﬁrst construct nl = n I n I 32 primary capsule inputs ti R8; subsequently, we obtain the primary capsule output ui = squash(ti), by applying the squash function, and the corresponding activity ai = ti 2

1+ ti 2 . Subsequently, the primary capsule outputs are propagated to the pruning layer, which selects the K most active capsules, forming U pri K ,

Ladder Capsule Network

Figure 3: L-Caps Net architecture

Table 1: Architecture of the function of the code vector

fj|i(cpri) f reg j (cpri) pj(cpri) 1st layer fully connected, activation: Relu output dim 8 16 3 8 16 3 K 3 2nd layer fully connected, activation: Relu output dim 8 16 2 8 16 2 K 2 3rd layer fully connected, activation: Relu output dim 8 16 8 16 K 4th layer fully connected, activation: linear (fj|i(cpri), f reg j (cpri)), sigmoid(pj(cpri)) output dim 8 16 K 8 16 K K

which we call the core primary capsules, and then the corresponding code vector cpri is obtained. Based on cpri, we construct fj|i(cpri), pj(cpri), and f reg j (cpri) using a combination of fully connected layers and convolution layers. More detailed structures are presented in Table 1. By applying those linear operators for propagation (i.e., fj|i(cpri) and pj(cpri)) and for regression (i.e., f reg j (cpri)), we obtain the digit capsules and regressed primary capsules. As mentioned in Section 4, the number of digit capsules equals the number of labels, and we set the dimension of the digit capsules to 16. By computing the similarity between the core primary capsules and the regressed primary capsules using Eq.(4), the activity of each digit capsule is obtained, which is subsequently used to compute the loss in Section 4.

5.2. Experiments on MNIST

To evaluate the performance of the L-Caps Net, we performed two experiments. For the ﬁrst experiment, we trained 60,000 images and tested 10,000 images of the 28 28 MNIST. For the second experiment, we trained

60,000 images of the 40 40 expanded MNIST, and tested 10,000 images of the aff NIST. We assumed the same experimental environment (i.e., 9 9 kernel on Re Lu Conv1 and Primary Caps) as in Sabour et al. (2017) on both experiments. This setting produces 1152 capsules for the MNIST experiment and 4608 capsules for the expanded MNIST and aff NIST experiment. Furthermore, to consider various ratios of the number of selected capsules to the total capsules, we assumed more experimental settings with the kernel size 15 15. In the MNIST experiment, we change the kernel size on Primary Caps from 9 9 to 15 15, and in the expanded MNIST and aff NIST experiment, we set the kernel size 15 15 on both layers. These settings produce 288 and 1152 primary capsules on MNIST and aff NIST, respectively. We performed the experiments with several values of K for pruning. Table 2 shows the test error for each case, together with the results of the CNN and Caps Net reported in Sabour et al. (2017).

Although the L-Caps Net resulted in higher test errors on MNIST classiﬁcation, it dramatically outperformed on the aff NIST test set. Recall that the expanded MNIST is con-

Ladder Capsule Network

Table 2: Test error results of L-Caps Nets

Method K MNIST(%) aff NIST(%) CNN (Sabour et al., 2017) - 0.39 34.0 Caps Net (Sabour et al., 2017) - 0.25 21.0 L-Caps Net (9 9 kernel) 50 0.74 13.0 L-Caps Net (9 9 kernel) 70 0.50 12.5 L-Caps Net (9 9 kernel) 100 0.80 13.2 L-Caps Net (15 15 kernel) 50 0.69 12.5 L-Caps Net (15 15 kernel) 70 0.73 12.2 L-Caps Net (15 15 kernel) 100 0.79 13.1

Figure 4: Activity levels of the primary capsules on L-Caps Net

structed by translating MNIST, whereas aff NIST is constructed by MNIST s afﬁne transformation, i.e., aff NIST incorporates more variations (e.g., rotation). The outperforming results on aff NIST show that the L-Caps Net learns an equivariant representation that is more robust to pose variations on MNIST digits than the Caps Net.

Moreover, we compare the computation time of the LCaps Net and Caps Net. We varied the value of K in the L-Caps Net as 50, 70, and 100, while varying the number of routing iterations (denoted by r) in the dynamic routing algorithm in the Caps Net as 3, 4, and 5. Table 3 presents the average computation time of 1 training iteration over 100 batch samples on MNIST, with standard errors in parentheses. The results show that the L-Caps Net signiﬁcantly reduces the computation time. This may be because only the selected lower-level capsules from the pruning layer are sent to the layer above, and all weights are trained via backpropagation through the ladder layer in the L-Caps Net, while all lower-level capsules are used in the dynamic routing algorithm with several routing iterations in the Caps Net.

5.3. Analysis of the Effects of K

In the L-Caps Net, only K most active capsules are used for classiﬁcation tasks. Recall the example in Section 3.1 again,

Table 3: The average computation time of 1 training iteration over 100 batch samples on MNIST. r: the number of routing iterations. Standard errors in parentheses.

Method Computation time, in seconds L-Caps Net (K = 50) 0.2034 (0.010) L-Caps Net (K = 70) 0.2159 (0.008) L-Caps Net (K = 100) 0.2953 (0.001) Caps Net (r = 3) 1.732 (0.026) Caps Net (r = 4) 2.123 (0.041) Caps Net (r = 5) 2.656 (0.085)

when the input is a human face, we hope that the core entities (e.g., nose, eyes, and mouth) are captured in K most activate capsules, while capsules for representing irrelevant entities (e.g., wheels and mirrors) are deactivated. In other words, we hope that the L-Cap Net clearly discriminates capsules for core entities versus irrelevant entities. We found from our experiments that this could be achieved by appropriately selecting the number of K.

We performed an analysis to see the effects of the hyperparameter K in the L-Caps Net. Figure 4 illustrates the average activity levels of 300 most active primary capsules on the L-Caps Net over 5444 samples of digit 0 in the

Ladder Capsule Network

MNIST training set for K values of 50, 70, and 100. We can see that for low K values (i.e., K=50, 70), the activity levels are extremely high or low. That is, the active and inactive capsules are distinguished clearly. This supports our adoption of K motivated by the need for using only important capsules for robust performance. When K = 100, the distinction between active and inactive capsules is not as clear as the cases of K=50 and 70, maybe because it is not very effective in capturing disentangled entities and their poses using more capsules than actually needed. In contrast, when K is low, each capsule may capture a disentangled entity and its pose, which results in clear discrimination between active and inactive capsules, which represent core and irrelevant entities, respectively. As Lenssen et al. (2018) pointed out the importance of disentangled representation for equivariance, for our experiments, low K values would be preferred.

6. Conclusion

In this paper, we proposed a new building block of the capsule network that removes irrelevant capsules without losing information about the spatial relationship between lower-level and higher-level entities, based on our ﬁning that only part of the entities (i.e., core entities) signiﬁcantly contributes to capturing the part-whole spatial relationships. While the Caps Net captures the part-whole relationships by using an iterative routing-by-agreement, the L-Caps Net achieves the same goal by using both the pruning and ladder layers. More speciﬁcally, the pruning layer selects relevant lower-level capsules based on the activity level, using the fact that the activity level represents the probability of existence of an entity. In fact, we showed that highly active lower-level capsules tend to have large coupling coefﬁcients for the desired parent (Figure 1) and that the L-Caps Net clearly discriminates relevant and irrelevant capsules using the activity level (Figure 4). Subsequently, the higher-level capsules can be constructed as a linear combination of the selected lower-level capsules. Unlike the Caps Net that takes the inner product between higher-level capsule and the prediction from the layer below as the agreement rule, the L-Caps Net takes how well lower-level capsule outputs are regressed from higher-level capsules as the agreement rule.

We select the K most active capsules through the pruning layer, and subsequently propagate the activities of only the selected capsules into the layer above, from which we can expect the robust capability on the network. However, the method to determine the appropriate K value according to the characteristics of data or tasks should be further studied. In our MNIST & afﬁne MNIST experiments, the value of K selected at a rate of approximately 6% of the total capsules produced the best results. Another important component of the L-Caps Net is the ladder layer. It is motivated by the

reverse directional inference of the agreement/disagreement rule for the Caps Net. Based on our outperforming results on the aff NIST data, we may conclude that the ladder layer provides the extrapolation capability or robustness on the L-Caps Net.

The reverse directional inference was similarly considered in Hinton et al. (2018). In their work, Hinton et al. (2018) approximate the mispredictions by higher-level capsules of lower-level capsule outputs by using the negative log probability density, avoiding the expensive matrix inversion of the prediction matrix Wij. The L-Caps Net learns the inverse linear transformation f reg with a neural network rather than exactly inverting Wij or fj|i.

For more extensive research, we plan to develop an algorithm for a bi-directional agreement/disagreement mechanism, based on the idea that we can predict not only higher-level capsule outputs from lower-level capsules, but can also regress lower-level capsule outputs from higherlevel capsules. We expect this exchangeability to enable more robust capsule outputs to be extracted and improve the task performance. The code for L-Caps Net is available at https://github.com/taewonjeong/L-Caps Net.

Acknowledgements

The authors thank the reviewers for reviewing the paper and providing valuable comments. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.2018R1C1B6004511).

Aghasi, A., Abdi, A., Nguyen, N., and Romberg, J. Nettrim: Convex pruning of deep neural networks with performance guarantee. In Advances in Neural Information Processing Systems, pp. 3177 3186, 2017.

Anwar, S., Hwang, K., and Sung, W. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3):32, 2017.

Augasta, M. G. and Kathirvalavakumar, T. Pruning algorithms of neural networksa comparative study. Central European Journal of Computer Science, 3(3):105 115, 2013.

Costa, M. A., Braga, A. P., and de Menezes, B. R. Improving neural networks generalization with new constructive and pruning methods. Journal of Intelligent & Fuzzy Systems, 13(2-4):75 83, 2002.

Dhillon, G. S., Azizzadenesheli, K., Lipton, Z. C., Bernstein, J., Kossaiﬁ, J., Khanna, A., and Anandkumar, A.

Ladder Capsule Network

Stochastic activation pruning for robust adversarial defense. ar Xiv preprint ar Xiv:1803.01442, 2018.

Dong, X., Chen, S., and Pan, S. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4860 4874, 2017.

Dosovitskiy, A. and Brox, T. Inverting visual representations with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4829 4837, 2016.

Hinton, G. E., Frosst, N., and Sabour, S. Matrix capsules with EM routing. In International Conference on Learning Representations, 2018.

Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pp. 655 665, 2014.

Lenssen, J. E., Fey, M., and Libuschewski, P. Group equivariant capsule networks. ar Xiv preprint ar Xiv:1806.05086, 2018.

Luo, J.-H., Wu, J., and Lin, W. Thinet: A ﬁlter level pruning method for deep neural network compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5058 5066, 2017.

Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. Pruning convolutional neural networks for resource efﬁcient inference. Proceedings of the International Conference on Learning Representations, 2016.

Pezeshki, M., Fan, L., Brakel, P., Courville, A., and Bengio, Y. Deconstructing the ladder network architecture. In International Conference on Machine Learning, pp. 2368 2376, 2016.

Rasmus, A., Berglund, M., Honkala, M., Valpola, H., and Raiko, T. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546 3554, 2015.

Reed, R. Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4(5):740 747, 1993.

Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3859 3869, 2017.

Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. Ladder variational autoencoders. In Advances in neural information processing systems, pp. 3738 3746, 2016.

Suddarth, S. C. and Kergosien, Y. Rule-injection hints as a means of improving network performance and learning time. In Neural Networks, pp. 120 129. Springer, 1990.

Thodberg, H. H. Improving generalization of neural networks through pruning. International Journal of Neural Systems, 1(04):317 326, 1991.

Zhang, Y., Lee, K., and Lee, H. Augmenting supervised neural networks with unsupervised objectives for largescale image classiﬁcation. In International Conference on Machine Learning, pp. 612 621, 2016.

Zhao, H., Wang, Z., Wu, H., Xiao, Q., Yao, W., Wang, E., Liu, Y., and Wei, M. Stat3 genetic variant, alone and in combination with stat5b polymorphism, contributes to breast cancer risk and clinical outcomes. Medical Oncology, 32(1):375, 2015.