# introducing_routing_uncertainty_in_capsule_networks__f15b66e6.pdf

Introducing Routing Uncertainty in Capsule Networks

Fabio De Sousa Ribeiro Georgios Leontidis Stefanos Kollias

Machine Learning Group University of Lincoln, UK {fdesousaribeiro,skollias}@lincoln.ac.uk

Department of Computing Science University of Aberdeen, UK georgios.leontidis@abdn.ac.uk

Rather than performing inefﬁcient local iterative routing between adjacent capsule layers, we propose an alternative global view based on representing the inherent uncertainty in part-object assignment. In our formulation, the local routing iterations are replaced with variational inference of part-object connections in a probabilistic capsule network, leading to a signiﬁcant speedup without sacriﬁcing performance. In this way, global context is also considered when routing capsules by introducing global latent variables that have direct inﬂuence on the objective function, and are updated discriminatively in accordance with the minimum description length (MDL) principle. We focus on enhancing capsule network properties, and perform a thorough evaluation on pose-aware tasks, observing improvements in performance over previous approaches whilst being more computationally efﬁcient.

1 Introduction

Although capsule networks (Caps Nets) have taken on a few different forms since their inception [1, 2, 3, 4], they are generally built upon the following core assumptions and premises:

(i) Capturing equivariance w.r.t. viewpoints in neural activities, and invariance in the weights; (ii) High-dimensional coincidences are effective feature detectors; (iii) Viewpoint changes have nonlinear effects on pixels, but linear effects on object relationships;

(iv) Object parts belong to a single object, and each location contains at most a single object.

In theory, a perfect instantiation of the above premises could yield more sample efﬁcient models, that leverage robust representations to better generalise to unseen cases. Unlike current methods, humans can extrapolate object appearance to novel viewpoints after a single observation. Evidence suggests that this is because we impose coordinate frames on objects [5, 6]. Capsules imitate this concept by representing neural activities as poses of objects w.r.t. a coordinate frame imposed by an observer, and attempt to disentangle salient features of objects into their composing parts. This is reminiscent of inverse graphics [7], but is not explicitly enforced in capsule formulations since the learned pose matrices are not constrained to interpretable geometric forms. Another argument for Caps Nets, is one that views capsules as an extension to the very successful inductive biases already present in CNNs, by wiring in some additional complexity to deal with viewpoint changes. One of the desired effects is to align the learned representations with those perceptually consistent with humans, which would also make adversarial examples less effective [8]. The additional complexity comes from replacing scalar neurons with vector valued neural activities, along with a high-dimensional coincidence ﬁltering algorithm to detect capsule level features, known as capsule routing [2, 3]. This procedure is typically iterative, local and inefﬁcient which has prompted further research on the topic [9, 10, 11, 12, 13].

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

1.1 Motivation & Contribution

Weaknesses of Capsule Networks. The memory bottleneck incurred by vector valued activations in addition to the iterative nature of capsule routing algorithms results in inefﬁcient models. They are also prone to underﬁtting or overﬁtting if the number of routing iterations isn t properly set [2, 3]. To address the above weaknesses one may decide to naively replace the iterative nature of capsule routing with some faster alternative. However, to stay true to the premises of Caps Nets, we argue that the four following points are of paramount importance for the research community to consider, when proposing algorithmic variants of Caps Nets or capsule routing going forward:

(i) Whether viewpoint-invariance and afﬁne transformation robustness properties are retained; (ii) Changes in assumptions about part-object relationships are made explicit; (iii) Whether capsules are still activated based on high-dimensional coincidences;

(iv) How do we handle the intrinsic uncertainty in assembling parts into objects.

Changes in the core assumptions of Caps Nets aren t always made clear in recent literature, but emerge incidentally via the proposed modiﬁcations. This leads to ambiguities regarding what qualiﬁes as a capsule network, which can make comparisons between methods more difﬁcult and hinder progress. In this paper, we focus on the core premises of capsule networks, and on enhancing their advantages over CNNs: viewpoint-invariance, and afﬁne transformation robustness whilst being more efﬁcient.

Contribution. Rather than performing local iterative routing between adjacent capsule layers which is inefﬁcient, we propose an alternative global view based on representing the inherent uncertainty in part-object relationships, by approximating a posterior distribution over part-object connections. Sources of uncertainty in assembling objects via a composition of parts can arise from numerous sources, such as: (i) feature occlusions due to observed viewpoints; (ii) sensory noise in captured data; (iii) object symmetries for which poses may be ambiguous such as spherical objects/parts.

In our formulation, the local routing iterations are replaced with variational inference of part-object connections in a probabilistic capsule network, leading to a signiﬁcant speedup (Figure 4). In this way, we encourage global context to be taken into account when routing information, by introducing global latent variables which have direct inﬂuence on the objective function, and are updated discriminatively in accordance with the minimum description length (MDL) principle [14, 15]. Our experiments demonstrate that local iterative routing can be replaced by variational posterior inference of partobject connections in a global context setting, allowing the model to leverage the inherent uncertainty in assembling objects as a composition of parts to improve performance on pose-aware tasks.

2 Background: Capsule Networks

Capsules. A capsule c is a set of neurons c = {a, M}. Each capsule is composed of either a vector m Rd or matrix M R

d of neurons, and an activation probability a. A single capsule is wired to represent a single entity, and its vector/matrix may learn to encode its pose w.r.t the coordinate frame imposed by an observer. The activation a simply represents an entity s presence. A capsule network is composed of two or more capsule layers, with multiple capsules N in each layer. Capsule routing takes place between adjacent capsule layers, i.e. Ni capsules in a lower layer ℓi are routed to Nj capsules in a higher layer ℓj, which can be seen as a form of cluster ﬁnding. Contextually, capsules in ℓi are referred to as parts of objects (datapoints), and capsules in ℓj are objects (clusters). Each part capsule uses its relationship to the viewer (pose), to posit a vote for what the pose of the object it is part of should be. To achieve this, part capsule poses Mi are multiplied with trainable viewpoint-invariant, afﬁne transformation weight matrices:

Vj|i = n Mi Wij | ci ℓi , cj ℓj o , Wij R

where Vj|i denotes the ith part capsule vote for the jth object capsule pose, and Wij are the weights,

Inducing Nonlinearity. Capsule poses M are not directly activated via nonlinear mappings but are compositions of afﬁne/projective linear transformations, that increase in complexity as we traverse through the network. Nonlinearity is induced by the choice of routing algorithm [2, 3], and the vote agreement measure used in calculating the activation probability aj for each capsule cj ℓj.

Figure 1: Our inference procedure in a given capsule layer (Left). Small example of part-object connections in convolutional voting for k = 2, drawn randomly from Dirichlet distributions (Right).

3 Uncertainty in Capsule Routing

Let D denote a set of data given as m pairs {xi, yi}m i=1, where xi Rd denotes a datapoint, and yi {1, . . . , K} its corresponding label. Let z denote some latent variables associated with our observations (x, y), that capture underlying structure in our data D and help govern its distribution.

3.1 Deﬁning Part-Object Connections

Dense & Convolutional Voting. In dense capsule voting, all part capsules are connected to all object capsules in the layer above. That is, each part capsule ci ℓi votes Nj times, therefore each object capsule cj ℓj receives Ni votes. The part-object connections are then zℓi,ℓj RNi Nj. Alternatively, in a convolutional capsule layer with kernel size k and stride s, the number of object capsules that each part capsule ci can vote for Ni j, is bounded above and below by

0 Ni j tj k

2 , and z(i) ℓi,ℓj RNi j ci ℓi, (2)

where denotes the ceiling function, and tj denotes the number of output object capsule types, which are analogous to output channels in CNNs. Importantly, part capsules on the edge of feature maps vote for fewer objects than those in the middle (Figure 1), a fact which is very often overlooked in capsule research, leading to improper normalisation over objects and competition between capsules.

Stochastic Variational Inference. To represent our uncertainty about part-object relationships in a Caps Net, we look to approximate the (intractable) posterior distribution p(z|D) over part-object connections z, with a chosen parameterised distribution qφ(z|D) p(z|D) via variational inference (VI). In general, qφ(z|D) is optimised by updating the parameters φ such that the Kullback-Leibler (KL) divergence DKL(qφ(z|D) || p(z|D)) is minimised [15, 16, 17]. Next, we discuss and consider the inference of qφ(z|D) under two main modelling paradigms: generative and discriminative.

Generative. Under generative frameworks, a set of local latent variables z in models of the form pθ(x, z) = pθ(x|z)p(z) are often employed, such as in the variational autoencoder (VAE) [18]. Speciﬁcally, latent variables z = {zi}m i=1 are inferred for each x = {xi}m i=1, and maximum likelihood (ML) or maximum a posteriori (MAP) inference is performed on global parameters. The model is ﬁt by maximising the Evidence Lower BOund (ELBO) on the marginal log-likelihood

i=1 DKL(qφ(zi|xi) || p(zi)) + Eqφ(zi|xi)[log pθ(xi|zi)] Llocal(φ, θ). (3)

Discriminative. Under the discriminative framework, global latent variables z are often utilised and are shared among datapoints {xi}m i=1, for instance when inferring the posterior on the weights of a neural network (NN) [15, 17, 19]. The bound is on the conditional marginal log-likelihood

m DKL(qφ(z) || p(z)) + Eqφ(z)[log p(yi|xi, z)] Lglobal(φ) (4)

To facilitate comparisons with the majority of research on Caps Nets, we focus on the development and evaluation of our method in a discriminative setting. Formally, we are interested in estimating the conditional likelihood p(y|x, z) = Qm i=1 p(yi|xi, z) using probabilistic capsule network models.

3.2 Posterior Inference of Part-Object Connections

Inference & Model Assumptions. Using stochastic VI tools, we intend to ﬁnd the best approximation q φ(z) that minimises DKL(qφ(z) || p(z|D, W)), where z are global latent part-object connection variables, and W are viewpoint-invariant transformation parameters, in a Caps Net with L layers. We place a prior p(z(i)) over each part capsule s ci ℓconnections to the objects they vote for cj ℓ+ 1, and make the following factorised independence assumptions across capsule layers:

z(i) = (z1, z2, . . . , z Ni j) p(z(i)) ci ℓi, p(z) =

i=1 p(z(i) ℓ). (5)

We then make a variational approximation qφ(zℓ,ℓ+1) to the posterior on part-object connection variables between adjacent capsule layers ℓand ℓ+ 1, for all capsule layers in the network. Our model s likelihood p(D|z, W), and mean-ﬁeld variational family qφ(z) are given by

p(D|z, W) =

i=1 p(yi|xi, z, W), qφ(z) =

i=1 qφ(z(i) ℓ,ℓ+1). (6)

The model is deﬁned hierarchically where the object capsules in ℓare the parts of ℓ+ 1, and so forth.

Free Energy Objective. The model is ﬁt end-to-end by maximising the following lower bound on the conditional marginal log-likelihood log p(y|x), which approximates its description length:

ℓ=1 DKL(qφ(zℓ,ℓ+1) || p(zℓ)) +

i=1 Eqφ(z)[log p(yi|xi, z, W)]. (7)

In the general case, we perform variational inference on the part-object connection latent variables z, and ML/MAP inference on W. We ﬁnd this to work well enough in practice, whilst signiﬁcantly reducing the number of parameters needed and assumptions made, which is especially important in Caps Nets given that efﬁciency is a major concern. Nonetheless, for full posterior learning, we can make one further mean-ﬁeld assumption by: qφ,θ(z, W) = qφ(z)qθ(W), where qθ(W) is Gaussian and factorises similarly across layers, including any convolutional layers preceding the capsule layers.

3.3 Choosing Priors: Reﬂecting Part-Object Assumptions

Logistic-Normal. Recall from Eq. (2) that each part capsule ci votes for Ni j objects, we can introduce randomness in their part-object connections via a Gaussian-Softmax parameterisation:

softmax(z(i))j = exp(zj) PNi j k exp(zk) , zj N(0, 1) for j = 1, 2, . . . , Ni j, (8)

with all components zj sampled independently from standard Gaussian priors. The approximate posterior then takes the form: qφ(z(i)) = N(z(i) | µ(i), σ(i)) ci ℓi. To obtain stochastic gradients of the lower bound w.r.t. the parameters φ, we can parameterise samples from qφ(z(i)) by: z(i) = f(ϵ, φ) where f( ) is differentiable and ϵ N(0, I), using the (local) reparameterisation trick [18, 20]. These priors are generally attractive since reparameterising Gaussian samples is straight forward, and they have been shown to work well in other settings such as topic models [21, 22].

Dirichlet. Alternatively, multi-modality over categorical events is better captured by the Dirichlet distribution [23]. We can also reduce the number of parameters as we only need to infer π(i) rather than {µ(i), σ(i)} for each part capsule ci, which is especially important in Caps Nets, as explained in Section 1.1, since efﬁciency is a major concern. Our Dirichlet priors over z are deﬁned as

z(i) = (z1, z2 . . . , z Ni j) Dir(π(i) 0 ), π(i) 0 = (π1, π2, . . . , πNi j), (9)

where π(i) 0 are the prior concentration parameters for ci, and the approximate posterior is then also Dirichlet distributed: qφ(z(i)) = Dir(π(i)) ci ℓi. In practice, we draw Dirichlet samples via independent standard Gamma distributions over each part-object connection:

γ(i) = γj Ni j j=1 , γj Gamma(πj, 1), (10)

zj = γj PNi j k γ(i) k , then z(i) = (z1, z2, . . . , z Ni j) Dir(π(i) 0 ). (11)

This parameterisation enables signiﬁcantly more efﬁcient normalisation over objects, using a 2D transposed convolution with an identity ﬁlter to collect variable length vectors z(i), when using convolutional voting. Unlike the Gaussian, the Gamma and Dirichlet distributions are not directly amenable to the reparameterisation trick [18, 24], so we obtain approximate pathwise gradients via the optimal mass transport (OMT) method [25]. Alternatively, we could obtain implicitly reparameterised gradients as in [26]. Both are readily available in Py Torch and Tensorﬂow respectively [27, 28].

3.4 Routing & Activating Capsules

Algorithm 1 Capsule Layer with Routing Uncertainty. Returns updated object capsules cj = {aj, Mj} ℓ+ 1, given part capsules ci = {ai, Mi} ℓ. Performs ML/MAP inference of transformation weights W, and variational inference of latent part-object connection variables z.

1: function CONVCAPS2D (ai, Mi) input capsules from previous layer

2: Initialise Afﬁne Weights: Wij R

3: Set Dirichlet priors: π(i) 0 RNi j ci ℓ

4: Vj|i VOTE (Mi, Wij) # Eq.(1) capsules ci vote for poses of capsules cj 5: zℓ,ℓ+1 SAMPLE qφ( ) (ai, π(i) 0 ) # Eqs.(10 12) sample z(i) ci from approximate posterior

6: aj, Mj ROUTE (zℓ,ℓ+1, Vj|i) # Eqs.(12,13) aggregate votes and activate capsules cj 7: return cj = {aj, Mj} output capsules to next layer

Global Routing. Following from Eq. (1), part capsules ci ℓcast votes Vj|i for object capsules cj ℓ+ 1, in all layers. During training we ﬁt multivariate gaussians Mj N(µj, σj), on each object s d dimensional poses, and sample part-object connections from the approximate posterior:

z(i) qφ(zℓ,ℓ+1) ci ℓ, µj =

i z(i) ℓ,ℓ+1Vj|i P

i z(i) ℓ,ℓ+1 , σj =

i z(i) ℓ,ℓ+1(Vj|i µj)2

i z(i) ℓ,ℓ+1 . (12)

The latent variables z(i) can act as soft assignments depending on our choice of prior, and one could interpret the training procedure as approximating the true posterior q φ(z|D) p(z|D, W) over all layers under the global minimum description length objective in Eq. (7), rather than local (iterative) inference of z in the E-step of EM routing [3] between all adjacent capsule layers. Alternatively, if for instance we let our priors on z(i) be Beta distributed over each part-object connection, and omit the normalisation over objects, we can allow each part to route information to multiple objects at once. If one normalises over parts rather than objects, then routing closely resembles attention [29].

Agreement & Activation. To measure vote agreement for each object capsule, we compute the average negative entropy of its pose: H(Mj) d 1H N(Mj | µj, σj) . Averaging yields a scale invariant measure w.r.t. the number of pose parameters d. Agreement is weighted by the support for each object capsule, which is the amount of data received from its parts: H(Mj) P

i z(i) ℓ,ℓ+1. Next, consider a Binomially distributed random variable Sj B(Ni, N 1 j ), describing the assignment of Ni parts to Nj objects with probability N 1 j . The expected amount of data each object receives in a given layer is then E(Sj). We can use this value to normalise and offset the entropy term, which automatically scales logits according to the number of capsules in each layer:

aj ηj H(Mj) E(Sj)

E(Sj) = ηj E(Sj)H(Mj) 1, ηj X

i z(i) ℓ,ℓ+1, (13)

aj is then activated using the logistic function. In simple terms, if the uncertainty among votes is high i.e. low negative entropy and poor agreement assigning more data to capsule j decreases its activation. Alternatively, if the uncertainty among votes is low i.e. high negative entropy and good agreement assigning more data to capsule j increases its activation signiﬁcantly. Activating capsules in this way simply encourages the model to meet the agreement and support activation criteria implicitly, but does not enforce them explicitly via learned β thresholds as in EM routing [3].

Table 1: Comparing viewpoint-invariance on Small NORB. Performances are matched on familiar viewpoints, before testing on novel. Results from 3 random seeds on architectures {f0, t1, t2, t3, t4}.

Method Azimuth (Acc. %) Elevation (Acc. %) # Param (Viewpoints) Atrain Atest Etrain Etest

Baseline CNN [3] 96.3 80.0 95.7 82.2 4.2M CNN (Avg Pool) [12] 91.5 78.2 94.3 82.28 0.15M Our EM-Routing 96.29 0.02 87.1 0.42 95.71 0.02 87.9 0.39 0.17M

SR-Caps [12] 92.38 80.14 94.04 84.09 0.75M STAR-Caps [11] 96.3 86.3 - - 0.32M EM-Routing [3] 96.3 86.5 95.7 87.7 0.31M VB-Routing [13] 96.29 88.6 95.68 88.4 0.17M

{32, 8, 8, 8, 5} 96.3 0.03 89.12 0.7 95.68 0.04 89.64 0.49 0.06M {64, 8, 16, 16, 5} 96.3 0.02 91.06 0.31 95.7 0.02 91.01 0.26 0.14M {64, 16, 16, 16, 5} 96.29 0.03 91.41 0.46 95.7 0.03 91.36 0.4 0.22M {128, 16, 32, 32, 5} 96.3 0.02 91.85 0.42 95.71 0.03 92.03 0.21 0.58M

Capsule L2 Norm. Alternatively, we can activate capsules by computing the Frobenius norm of the mean votes for object poses ||µj||F , then squashing it to a sensible (0, 1) range [2]. This encodes agreement in the norm of the poses and offers a considerable speedup at a performance cost.

4 Experiments

In this study, we focus on demonstrating that our method enhances capsule properties and outperforms previous approaches on challenging pose-aware tasks used in Caps Net literature (Sections 4.1, 4.2 and 4.4), whilst being more computationally efﬁcient (see Figure 4 for runtime comparisons).1

Network Architecture. To ensure fair and direct comparisons with previous work, we use identical Caps Nets to EM routing [3]. A single 5 5 Conv layer with f0 ﬁlters and stride 2 precedes four capsule layers. The Primary Caps layer transforms f0 feature maps into t1 capsule types, each having H W number of capsules with 4 4 poses. Next, two 3 3 Conv Caps layers with t2 and t3 output capsule types, using strides 2 and 1. The last Conv Caps layer outputs t4 class capsules, and shares weights across spatial dimensions [3]. Let {fo, t1, t2, t3, t4} denote the complete architecture. In all experiments, we use Adam [30] with default parameters and a batch size of 128 for training.

Priors. To show our method works well in the general case, we set the priors to be as uninformative as possible for all the benchmark results presented, i.e. ﬂat Dirichlet: p(z(i)) Dir(1Ni j) ci ℓ, ℓ. These priors explicitly assume that each part capsule ci is equally likely to belong to any object it votes for, with any level of certainty. Nonetheless, we conducted experiments to test sensitivity to the choice of prior, as presented in Figure 4. We observe tighter bounds for priors with central peaks, meaning that sampled part-object connections are closer to uniform over objects. Although tighter bounds are not always better [31], this suggests that parts prefer to spread their vote amongst multiple objects in Caps Nets, which is reminiscent of Dropout s effect on NN weights [32].

Inference. In all benchmark results, we perform a deterministic inference at test time without sampling z, by using the posterior means z = E[q φ(zℓ,ℓ+1)] ℓ, to compute predictions y = arg maxy p(y|x, z , W). Alternatively, we can draw T Monte Carlo samples of part-object connections from the approximate posterior, and calculate the predictive entropy:

H(by|x, z, W) =

k=1 byk log byk, by 1

t=1 p(y|x, zt, W), zt q φ(z|D). (14)

Under full posterior learning: qφ,θ(z, W), the pose transformation matrices W are also sampled. Although the model is partially Bayesian, we observe predictive entropies on out-of-distribution dataset samples (Aff NIST, Fashion MNIST) to be consistent with model uncertainty representation as shown in Figure 3. We also observe entropic predictions on more challenging Small NORB viewpoints as we vary azimuth, whilst holding the lowest/highest elevation viewpoints ﬁxed (see Figure 2).

0 2 4 30 32 34 6 8 10 12 14 16 18 20 22 24 26 28 azimuth ( 10)

H(by|x, z, W)

Atrain Atest

Elevation Fixed: (0)

0 2 4 30 32 34 6 8 10 12 14 16 18 20 22 24 26 28 azimuth ( 10)

H(by|x, z, W)

Atrain Atest

Elevation Fixed: (8)

Figure 2: (Top row) Predictive entropies when varying azimuth viewpoints whilst holding lowest/highest elevations ﬁxed on Small NORB. Obtained with 10 MC samples using {32, 8, 8, 8, 5}. (Bottom row) Example posterior parameters π from two random penultimate layer capsules of networks trained under different Dirichlet priors π0, simplex corners represent SVHN digit classes.

4.1 Generalisation to Novel Viewpoints

Table 2: Small NORB test error (%), results from 3 random seeds.

Method Small NORB

Error (%) # Param

Baseline CNN [3] 5.2 4.2M Our CNN 5.6 0.12 2.4M Our Res Net-20 2.7 0.11 0.27M Our EM-Routing 1.9 0.15 0.17M

Dynamic [2] 2.7 8.2M FRMS [10] 2.6 1.2M FREM [10] 2.2 1.2M STAR-Caps [11] 1.8 0.25M EM-Routing [3] 1.8 0.31M VB-Routing [13] 1.6 0.17M

{32, 8, 8, 8, 5} 2.2 0.08 0.06M {64, 16, 16, 16, 5} 1.5 0.10 0.22M {64, 8, 16, 16, 5} 1.4 0.09 0.14M

Viewpoint-Invariance. Small NORB [33] consists of grey-level stereo 96 96 images of 5 objects: each given at 18 different azimuths (0-340), 9 elevations and 6 lighting conditions, with 24,300 training and test set examples. As in [3], we standardise the images and resize them to 48 48. During training we take 32 32 random crops, and centre crops at test time. We train on training set images with azimuths: Atrain = {300, 320, 340, 0, 20, 40}, denoted as familiar viewpoints, and test on test set images containing novel azimuths: Atest = {60, 80, . . . , 280}. Similarly, for the elevation viewpoints we train on Etrain = {30, 35, 40}, and test on Etest = {45, 50, . . . , 70}. As reported in Table 1, we observed notable performance improvements in viewpoint-invariance over previous Caps Nets, and signiﬁcant improvements over CNNs. Additional results on the standard Small NORB train/test splits are found in Table 2.

4.2 Afﬁne Transformation Robustness

Table 3: MNIST to Aff NIST generalisation error (%). ( ) unsupervised learning.

Method MNIST Aff NIST

Test Error (%)

Baseline CNN [3] 0.8 14.1 BCN [34] 2.5 8.4

Dynamic [2] 0.77 21 G-Caps [35] 1.58 10.1 Sparse-Caps[36] 1.0 9.9 SCAE [4] 1.5 7.79 EM-Routing [3] 0.8 6.9 Aff-Caps [37] 0.77 6.79

{32, 8, 8, 8, 10} 0.8 0.01 5.02 0.28 {64, 8, 16, 16, 10} 0.79 0.01 4.17 0.3 {64, 16, 16, 16, 10} 0.78 0.02 3.88 0.34 {128, 16, 32, 32, 10} 0.8 0.02 3.46 0.19 {128, 16, 32, 32, 10} 0.28 0.01 2.31 0.03

Out-of-Distribution Generalisation. In this study we demonstrate our model s robustness to afﬁne transformations using the Aff NIST dataset. Aff NIST consists of MNIST images which have been uniquely transformed by 32 random afﬁne transformations per image. Training is performed on the MNIST training set, and we test generalisation performance on the Aff NIST test set containing 320,000 examples. Aff NIST images are 40 40 so for training we pad MNIST images, randomly placing the digits on 40 40 black backgrounds as in works we compare to [2, 13]. Our models were never trained on Aff NIST, and no further data augmentation was used. As shown in Table 3, we observed performance improvements over previous Caps Nets, and signiﬁcantly so over CNNs. Increasing the number of capsules used in our method also leads to better generalisation performance.

1Code available at: https://github.com/fabio-deep/Routing-Uncertainty-Caps Net

0.75 1.00 1.25 1.50 1.75 2.00 2.25

predictive H(by|x, z, W)

MNIST (Train)

MNIST (Test)

0.75 1.00 1.25 1.50 1.75 2.00 2.25

predictive H(by|x, z, W)

MNIST (Test)

AﬀNIST (Test)

0.75 1.00 1.25 1.50 1.75 2.00 2.25

predictive H(by|x, z, W)

MNIST (Test)

Fashion MNIST (Test)

0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6

predictive H(by|x, z, W)

NORB (Atrain)

NORB (Atest)

Figure 3: Histograms of predictive entropies on inand out-of-distribution test examples. Results obtained with 10 MC samples from q φ(z|D) using our {64, 8, 16, 16, 10} model.

0 2 4 6 8 iteration ( 104)

102 Dirichlet Priors (π0)

103 104 105 iteration (log10)

DKL(qφ(z) || p(z))

102 Complexity Cost

Atrain Atest Etrain Etest viewpoints

Eﬀect of Agreement Measure

L2 Norm : ||µj||F Entropy : H[N(Mj|µj, σj)]

Ours1 Ours2 IDP[9] SR[12] VB[13] DR[2] EM[3]

iterations/s

17.9 13.2 16.1 11.0 9.8

11.3 6.8 5.6 5.4 4.1

Runtime Comparisons

training inference

Figure 4: Effect of symmetric Dirichlet priors on the tightness of the ELBO over 3 runs on SVHN 10K, and complexity cost (KL) with β weight penalty throughout training (Left). Comparing capsule activation methods on Small NORB viewpoint performance, and runtimes (CIFAR-10) of 5 open source routing methods ran on 2 Titan Xp GPUs, using the same {128, 16, 16, 16, 10} model (Right).

4.3 Limited Training Data Regime

Table 4: Comparing SVHN test error (%) with limited training data, from 3 random seed runs.

Method SVHN # Param

(#Train) 10K 20K

Res Net-18 [38] 9.83 7.90 2.7M Res Net-34 [38] 8.73 7.05 5.2M Our CNN 9.4 0.25 7.7 0.21 2.4M

Res Net-18 (STN) 9.10 7.17 2.8M (ETN) 7.81 6.37 Res Net-34 (STN) 8.60 6.91 5.3M (ETN) 7.72 5.98

{32, 8, 8, 8, 10} 7.7 0.05 6.5 0.04 0.06M {64, 16, 16, 16, 10} 7.5 0.21 5.9 0.26 0.22M {64, 8, 16, 16, 10} 7.0 0.15 5.9 0.11 0.15M

Sample Efﬁciency. Rather than artiﬁcially applying perturbations, we leverage the natural range of geometric variation in the SVHN dataset [39] to verify robustness and generalisation performance on real data. We follow the experimental setup of Equivariant Transformers [38], and train models with random 10K and 20K subsets of the original training set of 73,257 examples, and evaluate on the test set (26,032). As shown in Table 4, our Caps Nets are quite sample efﬁcient in the limited training data regime, offering modest improvements over STN/ETN baselines in [38], and signiﬁcantly so over CNNs. Sample efﬁciency is critical in real world tasks where data is limited. Interestingly, we observe smaller improvements over baselines as more training data is used, suggesting that model choice is less important given enough data.

4.4 Performance Under Feature Occlusion

Table 5: Comparing Multi MNIST test error and exact match ratio (MR) error. ( ) dagger denotes results from using Diverse Multi MNIST.

Method Multi MNIST # Param

(Test) Error (%) MR (%)

Baseline CNN [2] 8.01 - 24.6M Baseline CNN [9] - 15.2 19.6M

Dynamic [2] 5.2 - 8.2M IDP-Attention [9] - 8.83 42M Aff-Caps [37] 4.51 - 8.2M

{64, 8, 16, 16, 10} 3.3 0.07 7.2 0.21 0.15M {128, 16, 16, 16, 10} 2.4 0.11 4.7 0.18 0.23M {128, 16, 32, 32, 10} 1.8 0.09 3.4 0.17 0.58M

Overlapping Digits. In this study we empirically demonstrate that our method is resilient under feature occlusions (which is a source of uncertainty). To that end, we replicated the experiment setup in [2], and trained our shallow models on the Multi MNIST dataset by generating occluded digit pairs on the ﬂy. Digit pairs are formed by shifting each MNIST digit by up to 4 pixels in each direction, then adding them together. No further data augmentation was used. Our models were trained/validated on 60M overlapping digit pairs, and tested on 10M. Table 5 reports both lower test error and exact match ratio (MR) error compared to previous work. See Figure 5 for illustrations.

Figure 5: Explanatory heat maps of predictions by our models trained on Small NORB (Left) and Multi MNIST (Right). Obtained by upsampling the posterior means of the part-object connections z in the class capsule layer, up to the input size: yielding attention-like explanations of predictions.

5 Related Work & Conclusion

Variational Inference. Our work lies at the intersection of Caps Nets and variational Bayesian learning. Variational Inference (VI) has its roots in statistical physics [40, 41], leading to seminal work in the early-1990s [15] which offered a MDL [14] perspective on VI in NNs. VI was later formalised more generally in a series of important works [42, 43, 44, 16]. More recently, practical strategies for calculating biased/unbiased Monte Carlo gradients of variational objectives in deep NNs have been proposed [17, 19], which are complemented by ideas from deep generative modelling such as the reparameterisation trick [18, 24]. NNs with Dropout [32] have also been interpreted as being approximately Bayesian [45, 46], and are widely used to estimate uncertainty [47, 48, 49, 50].

Capsule Networks. Initial work on capsules began with the transforming autoencoder [1]. Other successful variants have since then been proposed, notably: Dynamic routing [2], EM routing [3], and stacked capsule autoencoders (SCAE) [4], all of which achieved state-of-the-art performance in pose-aware tasks. Much follow-up work focuses on algorithmic variants of local routing or in scaling up Caps Nets: VB routing [13], KDE [10], Spectral [51], Subspace-Caps [52, 53]. Other interesting works improve on the equivariance properties of Caps Nets directly using Group theory [35, 54], which is on the contrary to our approach, as we do not impose any speciﬁc equivariance restrictions into the model. Geometric approaches have also been explored by [55, 56, 57], extending Caps Nets to work with point clouds and in 3D. Related work on probabilistic interpretations of Caps Nets is limited, with the notable exception of [58] which considers a fully generative perspective of SCAE [4] that is unsupervised, in contrast to the discriminative probabilistic model with capsule structure presented in this paper. Our work builds primarily on both local EM/VB routing [3, 13] to which we provide a global alternative view using VI tools and other recent non-iterative routing methods: Attention routing [59], STAR-Caps [11], Self-Routing [12], and Inverted Dot-Product routing [9]. Modiﬁcations in some of the latter methods have led to ambiguities regarding what qualiﬁes as a Caps Net, as opposed to CNNs with attention. As explained in Section 1.1, this occurs whenever the fundamental premises of Caps Nets are implicitly or explicitly altered, and their properties are not carefully veriﬁed or retained. With that in mind, we demonstrate empirically that our proposed end-to-end probabilistic approach leads to performance enhancements in benchmark pose-aware tasks commonly used in Caps Net literature, whilst being more computationally efﬁcient.

5.1 Conclusion

In this paper we propose to replace inefﬁcient local iterative routing with variational inference of a posterior on part-object connections in a probabilistic capsule network, leading to a signiﬁcant speedup (Figure 4). In this way, we encourage global context to be taken into account when routing information, by introducing global latent variables which have direct inﬂuence on the objective function, and are updated discriminatively in accordance with the minimum description length principle. To facilitate comparisons, we developed our method in a discriminative setting, and performed a thorough evaluation on pose-aware tasks, demonstrating enhanced capsule properties over previous iterative and non-iterative routing methods. We believe further exploration of Caps Nets as deep latent variable models (DLVMs) [24, 60, 61], to be a promising future research direction.

Broader Impact

With the advent of Deep Learning, the computational requirements in the ﬁeld have increased signiﬁcantly due to the ever increasing scale of our models. The environmental impact of training or deploying such models is therefore at an all time high. This raises concerns regarding the sustainability of our current practices, as the technologies we help develop are slowly integrated into all areas of society. Although it is important to continue on this path of discovery, we feel that an important shift towards efﬁciency is sorely needed. Concretely, the development of smaller scale models which are more robust and sample efﬁcient, could signiﬁcantly reduce the environmental impact of our technology with small sacriﬁces in performance. In general, we believe this can be achieved by introducing richer inductive priors into our models, which in turn require fewer examples to learn from, i.e. leading to increased sample efﬁciency. With that in mind, Capsule Networks have previously shown to possess superior generalisation properties than conventional CNNs in certain tasks, and in our work we enhance these properties further whilst being more computationally efﬁcient than previous iterative routing methods. We also demonstrated competitive performances on sample efﬁciency tasks, which have broad applicability to limited data domains such as medical. When these properties are enhanced even further, they have the potential to make a signiﬁcant positive impact on our societies by increasing the sustainability and efﬁciency of our machine learning models.

Acknowledgments and Disclosure of Funding

We would like to gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used for this research. We also thank Francesco Caliva and Lewis Smith for fruitful discussions.

[1] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International conference on artiﬁcial neural networks, pages 44 51. Springer, 2011.

[2] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in neural information processing systems, pages 3856 3866, 2017.

[3] Geoffrey Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. In 6th international conference on learning representations, ICLR, pages 1 15, 2018.

[4] Adam Kosiorek, Sara Sabour, Yee Whye Teh, and Geoffrey E Hinton. Stacked capsule autoencoders. In Advances in Neural Information Processing Systems, pages 15486 15496, 2019.

[5] I. Rock. Orientation and form. Academic Press, 1973.

[6] Geoffrey Hinton. Some demonstrations of the effects of structural descriptions in mental imagery. Cognitive Science, 3(3):231 250, 1979.

[7] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pages 2539 2547, 2015.

[8] Yao Qin, Nicholas Frosst, Sara Sabour, Colin Raffel, Garrison Cottrell, and Geoffrey Hinton. Detecting and diagnosing adversarial images with class-conditional capsule reconstructions. In International Conference on Learning Representations, 2020.

[9] Yao-Hung Hubert Tsai, Nitish Srivastava, Hanlin Goh, and Ruslan Salakhutdinov. Capsules with inverted dot-product attention routing. In International Conference on Learning Representations, 2020.

[10] Suofei Zhang, Quan Zhou, and Xiaofu Wu. Fast dynamic routing based on weighted kernel density estimation. In International Symposium on Artiﬁcial Intelligence and Robotics, pages 301 309. Springer, 2018.

[11] Karim Ahmed and Lorenzo Torresani. Star-caps: Capsule networks with straight-through attentive routing. In Advances in Neural Information Processing Systems, pages 9098 9107, 2019.

[12] Taeyoung Hahn, Myeongjang Pyeon, and Gunhee Kim. Self-routing capsule networks. In Advances in Neural Information Processing Systems 32, pages 7658 7667. Curran Associates, Inc., 2019.

[13] Fabio De Sousa Ribeiro, Georgios Leontidis, and Stefanos Kollias. Capsule routing via variational bayes. In AAAI Conference on Artiﬁcial Intelligence, 2020.

[14] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465 471, 1978.

[15] Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pages 5 13, 1993.

[16] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, 1999.

[17] Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348 2356, 2011.

[18] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

[19] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. ar Xiv preprint ar Xiv:1505.05424, 2015.

[20] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pages 2575 2583, 2015.

[21] David Blei and John Lafferty. Correlated topic models. Advances in neural information processing systems, 18:147, 2006.

[22] Akash Srivastava and Charles Sutton. Autoencoding variational inference for topic models. ar Xiv preprint ar Xiv:1703.01488, 2017.

[23] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993 1022, 2003.

[24] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014.

[25] Martin Jankowiak and Fritz Obermeyer. Pathwise derivatives beyond the reparameterization trick. ar Xiv preprint ar Xiv:1806.01851, 2018.

[26] Mikhail Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pages 441 452, 2018.

[27] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024 8035, 2019.

[28] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265 283, 2016.

[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998 6008, 2017.

[30] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[31] Tom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood, and Yee Whye Teh. Tighter variational bounds are not necessarily better. ar Xiv preprint ar Xiv:1802.04537, 2018.

[32] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929 1958, 2014.

[33] Yann Le Cun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 2, pages II 104. IEEE, 2004.

[34] Simyung Chang, John Yang, Seong Uk Park, and Nojun Kwak. Broadcasting convolutional network for visual relational reasoning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 754 769, 2018.

[35] Jan Eric Lenssen, Matthias Fey, and Pascal Libuschewski. Group equivariant capsule networks. In Advances in Neural Information Processing Systems, pages 8844 8853, 2018.

[36] David Rawlinson, Abdelrahman Ahmed, and Gideon Kowadlo. Sparse unsupervised capsules generalize better. ar Xiv preprint ar Xiv:1804.06094, 2018.

[37] Jindong Gu and Volker Tresp. Improving the robustness of capsule networks to image afﬁne transformations. ar Xiv preprint ar Xiv:1911.07968, 2019.

[38] Kai Sheng Tai, Peter Bailis, and Gregory Valiant. Equivariant transformer networks. In International Conference on Machine Learning, pages 6086 6095, 2019.

[39] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.

[40] Carsten Peterson. A mean ﬁeld theory learning algorithm for neural networks. Complex systems, 1:995 1019, 1987.

[41] G. Parisi. Statistical Field Theory. Basic Books, 1988.

[42] Lawrence K Saul and Michael I Jordan. Exploiting tractable substructures in intractable networks. In Advances in neural information processing systems, pages 486 492, 1996.

[43] Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan. Mean ﬁeld theory for sigmoid belief networks. Journal of artiﬁcial intelligence research, 4:61 76, 1996.

[44] Tommi Sakari Jaakkola. Variational methods for inference and estimation in graphical models. Ph D thesis, Massachusetts Institute of Technology, 1997.

[45] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050 1059, 2016.

[46] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in neural information processing systems, pages 3581 3590, 2017.

[47] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. ar Xiv preprint ar Xiv:1511.02680, 2015.

[48] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. ar Xiv preprint ar Xiv:1703.02910, 2017.

[49] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574 5584, 2017.

[50] Fabio De Sousa Ribeiro, Francesco Calivá, Mark Swainson, Kjartan Gudmundsson, Georgios Leontidis, and Stefanos Kollias. Deep bayesian self-training. Neural Computing and Applications, pages 1 17, 2019.

[51] Mohammad Taha Bahadori. Spectral capsule networks. ICLR Workshop, 2018.

[52] Marzieh Edraki, Nazanin Rahnavard, and Mubarak Shah. Subspace capsule network. ar Xiv preprint ar Xiv:2002.02924, 2020.

[53] Liheng Zhang, Marzieh Edraki, and Guo-Jun Qi. Cappronet: Deep feature learning via orthogonal projections onto capsule subspaces. In Advances in Neural Information Processing Systems, pages 5814 5823, 2018.

[54] Sai Raam Venkataraman, S. Balasubramanian, and R. Raghunatha Sarma. Building deep equivariant capsule networks. In International Conference on Learning Representations, 2020.

[55] Nitish Srivastava, Hanlin Goh, and Ruslan Salakhutdinov. Geometric capsule autoencoders for 3d point clouds. ar Xiv preprint ar Xiv:1912.03310, 2019.

[56] Yongheng Zhao, Tolga Birdal, Haowen Deng, and Federico Tombari. 3d point capsule networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1009 1018, 2019.

[57] Yongheng Zhao, Tolga Birdal, Jan Eric Lenssen, Emanuele Menegatti, Leonidas Guibas, and Federico Tombari. Quaternion equivariant capsule networks for 3d point clouds, 2020.

[58] Lewis Smith, Lisa Schut, Yarin Gal, and Mark van der Wilk. Capsule networks a probabilistic perspective. ar Xiv preprint ar Xiv:2004.03553, 2020.

[59] Jaewoong Choi, Hyun Seo, Suii Im, and Myungjoo Kang. Attention routing between capsules. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0 0, 2019.

[60] Rajesh Ranganath, Linpeng Tang, Laurent Charlin, and David Blei. Deep exponential families. In Artiﬁcial Intelligence and Statistics, pages 762 771, 2015.

[61] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738 3746, 2016.