# selfrouting_capsule_networks__4fcd8e92.pdf

Self-Routing Capsule Networks

Taeyoung Hahn Myeongjang Pyeon Gunhee Kim Seoul National University, Seoul, Korea {taeyounghahn,mjpyeon,gunhee}@snu.ac.kr http://vision.snu.ac.kr/projects/self-routing

Capsule networks have recently gained a great deal of interest as a new architecture of neural networks that can be more robust to input perturbations than similar-sized CNNs. Capsule networks have two major distinctions from the conventional CNNs: (i) each layer consists of a set of capsules that specialize in disjoint regions of the feature space and (ii) the routing-by-agreement coordinates connections between adjacent capsule layers. Although the routing-by-agreement is capable of ﬁltering out noisy predictions of capsules by dynamically adjusting their inﬂuences, its unsupervised clustering nature causes two weaknesses: (i) high computational complexity and (ii) cluster assumption that may not hold in the presence of heavy input noise. In this work, we propose a novel and surprisingly simple routing strategy called self-routing, where each capsule is routed independently by its subordinate routing network. Therefore, the agreement between capsules is not required anymore, but both poses and activations of upper-level capsules are obtained in a way similar to Mixture-of-Experts. Our experiments on CIFAR10, SVHN, and Small NORB show that the self-routing performs more robustly against white-box adversarial attacks and afﬁne transformations, requiring less computation.

1 Introduction

In the past years, deep convolutional neural networks (CNNs) have become the de-facto standard architecture in image classiﬁcation tasks, thanks to their high representational power. However, an important yet unanswered question is whether deep networks can truly generalize. Well-trained networks can be catastrophically fooled by the images with carefully designed perturbations that are even unrecognizable by human eyes [23, 29, 37, 38]. Furthermore, natural, non-adversarial pose changes of familiar objects are enough to trick deep networks [1, 8]. The later is more depressing since natural pose changes are universal in the real world.

Some research [5, 15] has argued that neural networks should aim for equivariance, not invariance. The reasoning is that, by preserving variations of an entity in a group of neurons (equivariance) rather than only detecting its existence (invariance), it would be easier to learn the underlying spatial relations and thus yield better generalization. Following this argument, a new network architecture called capsule networks (Caps Nets) and a mechanism called routing-by-agreement have been introduced [16, 34]. In this design of networks, each capsule contains a pose (or instantiation parameters) for encoding patterns of its responsible entity. Active capsules in one layer make pose predictions for capsules in the next layer via transformation matrices. Then, the routing algorithm ﬁnds a centerof-mass among the predictions via iterative clustering and ensures that only the majority opinion is passed down to the next layer.

While the routing-by-agreement [16, 34] has shown to be effective, its unsupervised clustering nature causes two inherent weaknesses. First, it requires repeatedly computing means and membership scores of prediction vectors. This makes Caps Nets much computationally heavier than one-pass

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

feed-forward CNNs. Second, it makes assumptions on cluster distributions of predictions. This might not be a problem if the training of Caps Nets successfully learns to ﬁt the weights to the assumptions. However, it is likely that the routing-by-agreement tends to fail when a number of prediction vectors become noisy so that they are clustered in an unexpected form.

In this work, we aim to overcome the above limitations by proposing a new and surprisingly simple routing strategy that does not involve agreement anymore. In our algorithm, the contribution of a lower-level capsule to a higher-level capsule is determined by its activation and the routing decision by its subordinate routing network. We refer to this design of routing as self-routing. To the best of our knowledge, there is no previous literature on removing the routing-by-agreement from Caps Nets.

Our method is motivated by the structural resemblance between Caps Nets and Mixture-of-Experts (Mo E) [18, 7, 36, 20]. They are similar in that their composing units (i.e. capsules and experts) specialize in different regions of input space and that their contributions are adjusted differently per example. One key difference is that, in Caps Nets, gating values are dynamically adjusted to suppress potentially unreliable submodules via the routing-by-agreement. However, if the robustness of Caps Nets can be retained without the routing-by-agreement, then we might be able to safely remove the unsupervised clustering part that causes the two aforementioned weaknesses.

For evaluation, we compare our self-routing to the two most prominent routing-by-agreement, dynamic [34] and EM [16] routing on CIFAR-10 [22], SVHN [42] and Small NORB [24] datasets. We compare not only classiﬁcation accuracies but also robustness under adversarial attacks and viewpoint changes. For fairness, we use the same CNN base architectures (e.g. 7-layer CNN and Res Net-20) and replace only the last layers of the original networks to respective capsule layers.

Compared to the previous routing-by-agreement methods [16, 34], the self-routing achieves better classiﬁcation performance in most cases while using signiﬁcantly fewer computations in FLOPs. Moreover, it shows stronger robustness under both perturbations of adversarial attacks and viewpoint changes. We also show that our self-routing beneﬁts more the Caps Nets from the increase in model sizes (i.e. wider capsule layers), while the previous methods often degrade.

2 Related Work

Capsule networks. Recently, capsule networks have been actively applied to many domains, such as generative models [19], object localization [27], and graph networks [40], to name a few. Hinton et al. [15] ﬁrst introduced the idea of capsules and equivariance in neural networks. In their work, autoencoders are trained to generate images with the desired transformation; yet the model requires transformation parameters to be supplied externally. Later, Sabour et al. [34] proposed a more complete model in which transformations are directly learnable from the data. To control the information ﬂow between adjacent capsule layers, they employed a mechanism named dynamic routing. Since then, alternative routing methods have been suggested. Hinton et al. [16] proposed to use Gaussian-mixture clustering. Bahadori et al. [3] made convergence faster via eigendecomposition of prediction vectors. Wang et al. [41] formalized the routing process to suggest a theoretically reﬁned version. Li et al. [25] approximated the routing process with the interaction between master and aide branches. Compared to all of the previous work, our routing approach is free from agreement but focuses on its ability of mixture-of-experts.

Mixture-of-experts. There have been many attempts to incorporate mixture-of-experts (Mo E) into deep network models. Eigen et al. [7] stacked multiple layers of Mo E to create an exponentially increasing number of inference paths. Shazeer et al. [36] used sparsely-gated Mo E between stacked LSTM layers to expand model capacity with only a minor loss in computational efﬁciency. In [30], architectural diversity is added by allowing experts to be an identity function or a pooling operation. Kirsch et al. [21] modularized a network so that neural modules can be selected on a per-example basis. In [33], a routing network is trained to choose appropriate function blocks for the input and task. This work interprets the origin of Caps Net s strengths as the behavior of Mo E, and such perspective leads to our self-routing design.

CNN fragility. Despite the great success of CNNs, many recent studies have raised concerns about their robustness [8, 9, 14]. Unlike humans, CNNs easily yield incorrect answers when presented with rotated images [8, 9]. Surprisingly, little improvement is observed in terms of noise robustness even for recent deep CNNs that are highly successful on image classiﬁcation tasks [14]. However,

their fragility may be hard to overcome by data augmentation techniques [9]. There are also a number of methods called adversarial attacks, that fool CNNs by creating images whose fabrication is hardly perceivable even to human [23, 29, 38, 37]. In this work, we evaluate the robustness of Caps Nets, which are proposed as an alternative or a supplement for deep CNNs. In section 5, we demonstrate that augmenting only one routing layer structured by capsules can signiﬁcantly improve the robustness against adversarial attacks and afﬁne transformations.

3 Preliminaries

We ﬁrst review the basics of capsule networks and the two most popular routing algorithms: dynamic [34] and EM routing [16].

3.1 Capsule Formulation

A capsule network [16, 34] is composed of layers of capsules. Let Ωl denote the sets of capsules in layer l. Each capsule i Ωl has a pose vector ui and an activation scalar ai. In addition, a weight matrix Wpose ij for every capsule j Ωl+1 predicts pose changes: ˆuj|i = Wpose ij ui. The pose vector of capsule j is a linear combination (or together with an activation function) of the prediction vectors: uj = P

i cijˆuj|i, where cij is a routing coefﬁcient determined by an routing algorithm. In the convolutional case, capsules within K K neighborhood in Ωl deﬁne a capsule in Ωl+1 where K is the kernel size. The formulation of capsules varies according to the design of the routing algorithm. In dynamic routing [34], the pose is deﬁned as a vector, and its length is used as its activation. In EM routing [16], the pose is deﬁned as a matrix, and the activation scalar is separately deﬁned. In our method, we use a vector for the pose with separated activation scalar.

3.2 Dynamic and EM Routing

A capsule is activated when multiple predictions by lower-level capsules agree. In other words, the activation depends on how tight the prediction vectors are clustered. The routing coefﬁcients from a lower-level capsule to all upper-level capsules sum to 1 (e.g. P

j cij = 1), and are iteratively adjusted so that the lower-level capsule i has more inﬂuence on the upper-level capsule j when i-th prediction is close to the mean of the predictions that j receives. We below review how the two most popular routing methods compute the routing coefﬁcient cij from capsule i to j.

In dynamic routing [34], the cosine similarity is used to measure the agreement. The routing logits bij are initialized to 0 and adjusted iteratively by the following equation:

b(t+1) ij b(t) ij + ˆuj|i u(t) j for t = 1, k, (1) where t is the iteration number. The routing coefﬁcients cij are obtained by applying softmax to bij along j-th dimension. u(t) j is obtained by applying a non-linear squash function: fsquash(s) =

1+||s||2 s ||s|| to P

i c(t) ij ˆuj|i. Then, c and uj are updated alternatively for k iterations.

In EM routing [16], it is assumed that the probability density of ˆuj|i follows the j s Gaussian model with mean µj and variance diag[σ2 j,1, σ2 j,2, , σ2 j,h] where h is the dimension of uj and ˆuj|i: ˆuj|i N(µj, diag[σ2 j,1, σ2 j,2, , σ2 j,h]) for all i Ωl. The following iterative routing process is

conducted with a(1) j initialized to 0 and for each iteration t = 1, , k,

a(t) j , µ(t) j , σ(t) j M step(ai, c(t) ij , ˆuj|i), c(t+1) ij E step(a(t) j , µ(t) j , σ(t) j ), (2)

where costj = P

i Ωl cij(βu log p(ˆuj|i)), a(t) j = sigmoid λ(βa cost(t) j ) , aj = a(k) j and

uj = µ(k) j . βa and βu are learned discriminatively and λ is a hyperparameter that increases during training with a ﬁxed schedule.

In section 4.1, we discuss the two distinctive strengths of Caps Nets. In section 4.2, we propose our approach of self-routing, whose motivation is to maximize the merits of Caps Nets while minimizing undesired side effects.

(a) Capsule networks

(b) Mixture-of-Experts

Figure 1: Comparison between (a) capsule networks and (b) mixture-of-experts. Given routing coefﬁcients, the only computational difference is that capsule networks use disjoint input representations for each capsule unit.

(a) Routing-by-Agreement

(b) Self-Routing

Figure 2: Conceptual comparison between (a) routing-by-agreement and (b) our proposed self-routing. In self-routing, subordinate routing networks W route i fed pose vectors ui are used to obtain routing coefﬁcients cij rather than unsupervised clustering on prediction vectors ˆuj|i.

4.1 Motivation

We view that Caps Nets have two distinctive characteristics compared to the conventional CNNs: (i) the behavior of mixture-of-experts (Mo E) and (ii) the noise ﬁltering mechanism.

Behavior of Mo E. Capsules specialize in disjoint regions of feature space and make multiple predictions based on information available to them for the regions. In each capsule layer, this structure naturally forms an ensemble of submodules that are activated differently per example, in a way similar to Mixture-of-Experts (Mo E) [18, 7, 36, 20]. Compared to Mo E, the division of labor is more explicit as different capsules do not share the same feature space (see Figure 1). Nonetheless, the discriminative power of each capsule may not be as strong as in CNNs, where the entire feature map in each layer is utilized to produce a single output of the layer. However, by aggregating predictions from weaker modules that have different parameters, it effectively prevents overﬁtting and thus can reduce the output variance with respect to small input variations.

Noise ﬁltering mechanism. In Caps Nets, initial gating values (i.e. activation scalars) of capsules are dynamically adjusted. The routing-by-agreement ensures that the predictions far from general consensus have a lesser inﬂuence on the output of each layer. In other words, the process can effectively ﬁlter out the contribution of submodules having possibly noisy information. Additionally, output capsules of which predictions have high variance are further suppressed. On the other hand, CNNs have no mechanism of segmenting potentially noisy channels from unnoisy ones.

In this work, we aim to design a routing method that mainly focuses on the ﬁrst characteristic of Caps Nets. The second property is beneﬁcial but brought by the agreement-based routing, which unfortunately causes two critical side effects: (i) high computational complexity and (ii) assumptions on cluster distribution of prediction vectors. Speciﬁcally, the previous routing methods assume a spherical or normal distribution of prediction vectors, which is unlikely to hold due to high variability and noisiness of real-world data. In fact, clustering noisy data is still a challenging task. Hence, the key to our intuition is to remove the notion of agreement from the routing process but introduce a learnable routing network for each capsule instead (section 4.2). Although the clustering in the agreement-based routing can help to reduce the variance of output, we empirically observe that, even without the routing-by-agreement, the simple weight average of the new routing method can have a similar effect. That is, the unreliability of a single prediction can be mitigated by ensemble averaging since the errors of the submodules (capsules) average out to provide a stable combined output.

4.2 Self-Routing

We name the Caps Net model with our proposed self-routing as SR-Caps Net. Figure 2(a) (b) illustrate the high-level difference between the self-routing and the routing-by-agreement. In the self-routing, each capsule determines its routing coefﬁcients by itself without coordinating the agreement with peer capsules. Instead, each capsule is endowed higher modeling power by a subordinate routing network. Following the Mo E literature [36], we design the routing network as single-layer perceptrons, although it is straightforward to extend it to an MLP. The routing coefﬁcients also work as the predicted activations of output capsules. That is, an upper-level capsule is more likely to be activated if more capsules have high routing coefﬁcients to it.

The self-routing involves two learnable weight matrices, Wroute and Wpose, which are used to compute routing coefﬁcients cij and predictions ˆuj|i, respectively. Each layer of the routing network multiplies the pose vector ui by a trainable weight matrix Wroute i and outputs the routing coefﬁcients ci via a softmax layer. The routing coefﬁcients cij are then multiplied by the capsule s activation scalar ai to generate weighted votes. The activation aj of an upper-layer capsule is simply the summation of the weighted votes of lower-level capsules over spatial dimensions H W (or K K when using convolution). In summary,

cij = softmax(Wroute i ui)j, aj =

i Ωl cijai P

i Ωl ai . (3)

The pose of the upper-layer capsule is determined by the weighted average of prediction vectors:

ˆuj|i = Wpose ij ui, uj =

i Ωl cijaiˆuj|i P

i Ωl cijai . (4)

5 Experiments

In experiments, we focus on comparing our self-routing scheme with the two most important agreement-based routing algorithms from multiple perspectives. We ﬁrst evaluate the image classiﬁcation performance (section 5.2). We then compare the robustness against unseen input perturbations, since such generalization abilities have been the key motivation of Caps Nets. Especially, we experiment the robustness against adversarial examples (section 5.3) and viewpoint changes by afﬁne transformation (section 5.4). Our full source code is available at http://vision.snu.ac.kr/projects/self-routing.

5.1 Experimental Settings

Datasets. Following Caps Net literature, we mostly use two classiﬁcation benchmarks of CIFAR-10 [22] and SVHN [42] and additionally Small NORB [24] for the afﬁne transformation tests. During training on CIFAR-10 and SVHN, we augment using random crop, random horizontal ﬂip, and normalization. For Small NORB, we follow the setting of [16]; we downsample training images to 48 48, randomly crop 32 32 patches, and add random brightness and contrast. Test images are center cropped after downsampled.

Architectures. For a fair comparison, we let all routing algorithms share the same base CNN architecture. We choose Res Net-20 [13] designed for CIFAR-10 classiﬁcation for the following two reasons. First, the Res Net is one of the best performing CNN architectures in various computer vision applications [4, 11, 26, 31, 32]. It would be interesting to verify whether employing capsule structures can beneﬁt mainstream CNNs. Second, the Res Net is mostly composed of Conv layers, which makes it easy to implement Caps Nets due to the similar structure between them. Note that Caps Nets consist of only Conv layers and capsule layers. Given that Res Net-20 [13] consists of 19 Conv layers followed by the last average pooling and FC layers, we can build a Caps Net on top of it by replacing the last two layers by a primary capsule (Primary Caps) and fully-connected capsule (FCCaps) layers, respectively. However, Small NORB dataset is much less complex than CIFAR-10/SVHN, and its training set is relatively small (16200 samples), we use a 7-layer CNN, which is a shallower network with no shortcut connection. It consists of 6 CONV layers, followed by Avg Pool and FC layers.

We also measure the performance variations according to the depth and width of capsule layers. The depth means how many routing layers are inserted after the Primary Caps layer. Thus, the depth of

Table 1: Comparison of parameter counts (M), FLOPs (M), and error rates (%) between routing algorithms of Caps Nets and CNN models. We use Res Net-20 as the base network. DR, EM and SR stands for dynamic [34], EM [16] and self-routing, respectively. The number following (method-) is the number of stacked capsule layers on top of Conv layers. All Caps Nets have 32 capsules in each layer. We test each model 5 times with different random seeds. Error rates reported below are their averages.

Methods # Param. (M) # FLOPs (M) CIFAR-10 SVHN

Avg Pool 0.3 41.3 7.94 0.21 3.55 0.11 Conv 0.9 61.0 10.01 0.99 3.98 0.15

DR-1 5.8 73.5 8.46 0.27 3.49 0.69 DR-2 4.2 232.1 7.86 0.21 3.17 0.09 EM-1 0.9 76.6 10.25 0.45 3.85 0.13 EM-2 0.8 173.8 12.52 0.32 3.70 0.35 SR-1 0.9 62.2 8.17 0.18 3.34 0.08 SR-2 3.2 140.3 7.86 0.12 3.12 0.13

1 means the ﬁnal layers of the network are Primary Caps+FCCaps, between which the routing is performed once. The depth of 2 indicates the ﬁnal layers are Primary Caps+Conv Caps+FCCaps so that the routing is performed twice between consecutive capsule layers. Therefore, the depth of d involves d 1 Conv Caps inserted between Primary Caps and FCCaps. All Conv Caps layers have a kernel size of 3 and a stride of 1 except for the ﬁrst Conv Caps layer that has a stride of 2. The width indicates the number of capsules in each capsule layer; for example, the width of 8 means there are 8 capsules in all Primary/Conv Caps layers of the architecture.

CNN baselines. We also test two variants of CNNs for comparison between Caps Nets and conventional CNNs. Since the former CONV layers of the base networks are shared by all the models, we vary the last two layers as (1) Avg Pool+FC as the original architecture and (2) Conv+FC for verifying whether the performance obtained by Caps Nets is simply caused by using more parameters, not by their structures or routing algorithms.

We describe more details of experiments in the Appendix, including architectures and optimization.

5.2 Results of Image Classiﬁcation

We compare the image classiﬁcation performance of self-routing with two agreement-based routing algorithms on SVHN [42] and CIFAR-10 [22]. Table 1 summarizes the error rates as well as memory and computation overheads of each method. Self-routing and dynamic routing [34] show similar classiﬁcation accuracies to CNN baselines, while EM routing [16] degrades the performances. Importantly, the computation overheads of self-routing in FLOPs are less than those of other routing baselines, since it requires no iterative routing computations. In terms of the parameter size, EM routing is the most efﬁcient due to its matrix representations. Yet, we ﬁnd that it is hard to train the networks with stacked EM routing layers on datasets more complex than initially tested ones [16] (e.g. Small NORB). It seems that the constraint of 4 4 weight matrix multiplications is too strong to learn good representations from complex data with multi-level routing layers. Dynamic routing is comparable to our self-routing in performance, but it requires much more FLOPs for computation. We ﬁnd that Caps Net variants equipped with self-routing are better than CNN baselines, but the margins are not large. It seems that the beneﬁt of using the ensemble of weak submodules is relatively weak since the degree of required specialization is small for the task and the datasets. In fact, Mo E-style deep models are commonly strong in the tasks that require obvious specializations such as multi-task learning [2, 33, 30].

5.3 Robustness to Adversarial Examples

Adversarial examples are the inputs that are intentionally crafted to trick recognition models into misclassiﬁcation. Numerous defensive methods for such attacks have been suggested [6, 28]. In [16], Caps Nets have shown considerable robustness to such attacks without any reactive modiﬁcation. Therefore, we evaluate our model s robustness to adversarial examples. We use the targeted and untargeted white-box Fast Gradient Sign Method (FGSM) [10] to generate adversarial examples. FGSM ﬁrst computes the gradient of the loss with respect to the input pixels and adds the signs of the

CIFAR-10 SVHN

Success Rate (%)

Untargeted FGSM Targeted FGSM

CIFAR-10 SVHN

(a) Primary Caps+FCCaps (depth of 1)

CIFAR-10 SVHN

Success Rate (%)

Untargeted FGSM Targeted FGSM

CIFAR-10 SVHN

(b) Primary Caps+Conv Caps+FCCaps (depth of 2)

Figure 3: Success rates (%) of untargeted and targeted FGSM attacks against different routing methods of Caps Nets and CNN models. All Caps Nets have 32 capsules in each layer. We set ϵ = 0.1 for all FGSM attacks. All results are obtained with 5 random seeds.

Figure 4: The variation of success rates of untargeted and targeted FGSM attacks according to the number of capsules per layer. The self-routing improves robustness as the number of capsules increases. We set ϵ = 0.1 for all FGSM attacks. All results are obtained with 5 random seeds.

obtained tensor to the input pixels by a ﬁxed amount ϵ. For a fair comparison, we attack images for which predictions of each model are correct.

Figure 3 summarizes the results. CNN baselines are signiﬁcantly more vulnerable against both general and targeted FGSM attacks than the Caps Nets, among which our SR-Caps Nets are the most robust. Figure 4 shows that adding more capsules per layer further improves the robustness of our SR-Caps Nets, while stacking more capsule layers helps against the untargeted attack only. We cannot ﬁnd a similar pattern in other routing algorithms. Often, their performance degrades when more capsules are used. This suggests that the routing-by-agreement struggles to cluster predictions whose large portions involve noisy information.

We also attach the results obtained by another adversarial attack of BIM [23] in the Appendix.

5.4 Robustness to Afﬁne Transformation

One of Caps Nets known strengths is their generalization ability to novel viewpoints [16]. To demonstrate this, we measure the classiﬁcation performance on the Small NORB [24] test sets with novel viewpoints. Following the experimental protocol of [16], we train all models on 1/3 of training data with azimuths of 0, 20, 40, 300, 320, 340 degrees, and test them on 2/3 of test data with other

Table 2: Comparison of error rates (%) on the Small NORB test set with the 7-layer CNN as the base architecture. Familiar and Novel denote the results on the test samples with seen and unseen viewpoints during training, respectively. All Caps Nets have 32 capsules in each layer. All results are obtained with 10 random seeds.

Methods Azimuth Elevation

Familiar Novel Familiar Novel

Avg Pool 8.49 0.45 21.76 1.18 5.68 0.72 17.72 0.30 Conv 8.39 0.56 22.07 1.02 7.51 1.09 18.78 0.67

DR-1 6.86 0.50 20.33 1.32 5.78 0.48 16.37 0.90 EM-1 7.36 0.89 20.16 0.96 5.97 0.98 17.51 1.52 SR-1 7.62 0.95 19.86 1.03 5.96 0.46 15.91 1.09

Figure 5: Comparison of error rates (%) on the Small NORB test set without CNN base. All results are obtained with 5 random seeds.

azimuths. In another experiment, models are trained on 1/3 of training data with elevations of 30, 35, 40 degrees from the horizontal, and tested on 2/3 of test data with other elevations.

Table 2 shows the results where capsule-based models generalize better than the CNN baselines. Speciﬁcally, self-routing has the best performance for the images with novel azimuths and elevations. The results suggest that viewpoint generalization is not the unique strength of the routing-byagreement. We also ﬁnd a weak correlation between increase in model size (i.e. depth and width of capsule layers) and generalization performance, but the improvement is small.

Although the experiments with the 7-layer CNN base show that Caps Nets generalize better than CNNs, the margins between different routing methods are not signiﬁcant. Thus, we conduct additional experiments on Small NORB with a smaller network that consists of only one convolution layer followed by three consecutive capsule layers of Primary Caps+Conv Caps+FCCaps. The capsule layers are composed of 16 capsules, each with 16 neurons. Figure 5 shows the results that our self-routing (SR) outperforms Dynamic and EM routing with signiﬁcant margins in both tasks. That is, using shallow feature extractors, the previous routing techniques could struggle to learn good representations.

6 Conclusion

We proposed a supervised, non-iterative routing method for capsule-based models with better computational efﬁciency. We conducted systemic experiments for the comparison between the existing routing methods and our self-routing. The experiments veriﬁed that our method achieves competitive performance on adversarial defense and viewpoint generalization that are the two proposed strengths of Caps Nets. Moreover, our method generally performs better when more capsules are used per layer, while the previous methods often behave unstably. The results suggested that the routing-byagreement may not be a requirement for Caps Net s robustness. As future work, it is interesting to look for a method that can bring residual connections to our models, since it has been shown that residual networks behave like ensembles of networks with different depths [39]. It can be synergetic with our models where capsules take different paths of the same depth.

Acknowledgements. This work was supported by Samsung Research Funding Center of Samsung Electronics (No. SRFC-IT1502-51) and the ICT R&D program of MSIT/IITP (No. 2019-0-01309, Development of AI technology for guidance of a mobile robot to its goal with uncertain maps in indoor/outdoor environments). Gunhee Kim is the corresponding author.

[1] M. A. Alcorn, Q. Li, Z. Gong, C. Wang, L. Mai, W.-S. Ku, and A. Nguyen. Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. ar Xiv preprint ar Xiv:1811.11553, 2018. [2] R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. In CVPR, 2017. [3] M. T. Bahadori. Spectral capsule networks. In ICLR workshop, 2018. [4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deep Lab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834 848, 2017. [5] T. Cohen and M. Welling. Group equivariant convolutional networks. In ICML, 2016. [6] F. Croce, M. Andriushchenko, and M. Hein. Provable robustness of relu networks via maximization of linear regions. ar Xiv preprint ar Xiv:1810.07481, 2018. [7] D. Eigen, M. Ranzato, and I. Sutskever. Learning factored representations in a deep mixture of experts. ICLR Workshops, 2014. [8] L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry. A rotation and a translation sufﬁce: Fooling cnns with simple transformations. ar Xiv preprint ar Xiv:1712.02779, 2017. [9] R. Geirhos, C. R. Temme, J. Rauber, H. H. Schütt, M. Bethge, and F. A. Wichmann. Generalisation in humans and deep neural networks. In Neur IPS, 2018. [10] I. J. Goodfellow, J. Shlens, and S. Christian. Explaining and harnessing adversarial examples. In ICLR, 2015. [11] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017. [12] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In ICCV, 2015. [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. [14] D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR, 2019. [15] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In ICANN, 2011. [16] G. E. Hinton, S. Sabour, and N. Frosst. Matrix capsules with EM routing. In ICLR, 2018. [17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015. [18] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, et al. Adaptive mixtures of local experts. Neural computation, 3(1):79 87, 1991. [19] A. Jaiswal, W. Abd Almageed, Y. Wu, and P. Natarajan. Capsule GAN: Generative adversarial capsule network. In ECCV, 2018. [20] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181 214, 1994. [21] L. Kirsch, J. Kunze, and D. Barber. Modular Networks: Learning to decompose neural computation. In Neur IPS, 2018. [22] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. [23] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. ar Xiv preprint ar Xiv:1607.02533, 2016. [24] Y. Le Cun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR, 2004. [25] H. Li, X. Guo, B. Dai Wanli Ouyang, and X. Wang. Neural network encapsulation. In ECCV, pages 252 267, 2018. [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. In ECCV, 2016. [27] W. Liu, E. Barsoum, and J. D. Owens. Object localization and motion transfer learning with capsules. ar Xiv preprint ar Xiv:1805.07706, 2018. [28] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. [29] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High conﬁdence predictions for unrecognizable images. In CVPR, 2015. [30] P. Ramachandran and Q. V. Le. Diversity and depth in per-example routing models. In ICLR, 2019. [31] J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In CVPR, 2017. [32] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Neurips, 2015. [33] C. Rosenbaum, T. Klinger, and M. Riemer. Routing Networks: Adaptive selection of non-linear functions for multi-task learning. ar Xiv preprint ar Xiv:1711.01239, 2017.

[34] S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In Neur IPS, pages 3856 3866, 2017. [35] S. Saito and S. Roy. Effects of loss functions and target representations on adversarial robustness. ar Xiv preprint ar Xiv:1812.00181, 2018. [36] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ar Xiv preprint ar Xiv:1701.06538, 2017. [37] J. Su, D. V. Vargas, and K. Sakurai. One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation, 2019. [38] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. [39] A. Veit, M. J. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow networks. In Neur IPS, 2016. [40] S. Verma and Z.-L. Zhang. Graph capsule convolutional neural networks. ar Xiv preprint ar Xiv:1805.08090, 2018. [41] D. Wang and Q. Liu. An optimization view on dynamic routing between capsules. In ICLR workshop, 2018. [42] Y. N. T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In Neur IPS, 2011.