# topographic_vaes_learn_equivariant_capsules__db11f909.pdf

Topographic VAEs learn Equivariant Capsules

T. Anderson Keller Uv A-Bosch Delta Lab University of Amsterdam t.anderson.keller@gmail.com

Max Welling Uv A-Bosch Delta Lab University of Amsterdam m.welling@uva.nl

In this work we seek to bridge the concepts of topographic organization and equivariance in neural networks. To accomplish this, we introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables. We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. Furthermore, through topographic organization over time (i.e. temporal coherence), we demonstrate how predefined latent space transformation operators can be encouraged for observed transformed input sequences a primitive form of unsupervised learned equivariance. We demonstrate that this model successfully learns sets of approximately equivariant features (i.e. "capsules") directly from sequences and achieves higher likelihood on correspondingly transforming test sequences. Equivariance is verified quantitatively by measuring the approximate commutativity of the inference network and the sequence transformations. Finally, we demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.

1 Introduction

Many parts of the brain are organized topographically. Famous examples are the ocular dominance maps and the orientation maps in V1. What is the advantage of such organization and what can we learn from it to develop better inductive biases for deep neural network architectures?

Figure 1: Overview of the Topographic VAE with shifting temporal coherence. The combined color/rotation transformation in input space τg becomes encoded as a Roll within the capsule dimension. The model is thus able decode unseen sequence elements by encoding a partial sequence and Rolling activations within the capsules. We see this resembles a commutative diagram.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

One potential explanation for the emergence of topographic organization is provided by the principle of redundancy reduction [1]. In the language of Information Theory, redundancy wastes channel capacity, and thus to represent information as efficiently as possible, the brain may strive to transform the input to a neural code where the activations are statistically maximally independent. In the machine learning literature, this idea resulted in Independent Component Analysis (ICA) which linearly transforms the input to a new basis where the activities are independent and sparse [2, 13, 29, 44]. It was soon realized that there are remaining higher order dependencies (such as correlation between absolute values) that can not be transformed away by a linear transformation. For example, along edges of an image, linear-ICA components (e.g. gabor filters) still activate in clusters even though the sign of their activity is unpredictable [48, 56]. This led to new algorithms that explicitly model these remaining dependencies through a topographic organization of feature activations [27, 45, 46, 59]. Such topographic features were reminiscent of pinwheel structures observed in V1, encouraging multiple comparisons with topographic organization in the biological visual system [28, 30, 42].

A second, almost independent body of literature developed the idea of equivariance of neural network feature maps under symmetry transformations. The idea of equivariance is that symmetry transformations define equivalence classes as the orbits of their transformations, and we wish to maintain this structure in the deeper layers of a neural network. For instance, for images, asserting a rotated image contains the same object for all rotations, the transformation of rotation then defines an orbit where the elements of that orbit can be interpreted as pose or angular orientation. When an image is processed by a neural network, we want features at different orientations to be able to be combined to form new features, but we want to ensure the relative pose information between the features is preserved for all orientations. This has the advantage that the equivalence class of rotations for the complex composite features is guaranteed to be maintained, allowing for the extraction of invariant features, a unified pose, and increased data efficiency. Such ideas are reminiscent of the capsule networks of Hinton et al. [21, 22, 51], and indeed formal connections to equivariance have been made [39]. Interestingly, by explicitly building neural networks to be equivariant, we additionally see geometric organization of activations into these equivalence classes, and further, the elements within an equivalence class are seen to exhibit higher-order non-Gaussian dependencies [40, 41, 56, 57]. The insight of this connection between topographic organization and equivariance hints at a possibility to encourage approximate equivariance from an induced topology in feature space.

To build a model, we need to ask what mechanisms could induce topographic organization of observed transformations specifically? We have argued that removing dependencies between latent variables is a possible mechanism; however, to obtain the more structured organisation of equivariant capsule representations, the usual approach is to hard-code this structure into the network, or to encourage it through regularization terms [4, 15]. To achieve this same structure unsupervised, we propose to incorporate another key inductive bias: temporal coherence [18, 24, 52, 60]. The principle of temporal coherence, or slowness , asserts than when processing correlated sequences, we wish for our representations to change smoothly and slowly over space and time. Thinking of time sequences as symmetry transformations on the input, we desire features undergoing such transformations to be grouped into equivariant capsules. We therefore suggest that encouraging slow feature transformations to take place within a capsule could induce such grouping from sequences alone.

In the following sections we will explain the details of our Topographic Variational Autoencoder which lies at the intersection of topographic organization, equivariance, and temporal coherence, thereby learning approximately equivariant capsules from sequence data completely unsupervised.

2 Related Work

The history of statistical models upon which this work builds is vast, including sparse coding [44], Independant Component Analysis (ICA) [2, 13, 29], Slow Feature Analysis (SFA) [54, 60], and Gaussian scale mixtures [41, 48, 56, 57]. Most related to this work are topographic generative models including Generative Topographic Maps [6], Bubbles [25], Topographic ICA [27], and the Topographic Product of Student s-t [46, 59]. Prior work on learning equivariant and invariant representations is similarly vast and also has a deep relationship with these generative models. Specifically, Independant Subspace Analysis [26, 53], models involving temporal coherence [18, 24, 52, 60], and Adaptive Subspace Self Organizing Maps [35] have all demonstrated the ability to learn invariant feature subspaces and even disentangle space and time [19, 53]. Our work assumes a similar generative model to these works while additionally allowing for efficient estimation of

the model through variational inference [33, 50]. Although our work is not the first to combine Student s-t distributions and variational inference [7], it is the first to provide an efficient method to do so for Topographic Student s-t distributions.

Another line of work has focused on constructing neural networks with equivariant representations separate from the framework of generative modeling. Analytically equivariant networks such as Group Equivariant Neural Networks [11], and other extensions [9, 16, 17, 55, 49, 58, 61, 62] propose to explicitly enforce symmetry to group transformations in neural networks through structured weight sharing. Alternatively, others propose supervised and self-supervised methods for learning equivariance or invariance directly from the data itself [4, 14, 15]. One related example in this category uses a group sparsity regularization term to similarly learn topographic features for the purpose of modeling invariance [31]. We believe the Topographic Variational Autoencoder presented in this paper is another promising step in the direction of learning approximate equivariance, and may even hint at how such structure could be learned in biological neural networks.

Furthermore, the idea of disentangled representations [3] has also been been connected to equivariance and representation theory in multiple recent papers [8, 12, 10, 20]. Our work shares a fundamental connection to this distributed operator definition of disentanglement, where the slow roll of capsule activations can be seen as the latent operator. Recently, the authors of [34] demonstrated that incorporating the principle of slowness in a variational autoencoder (VAE) yields the ability to learn disentangled representations from natural sequences. While similar in motivation, the generative model proposed in [34] is unrelated to topographic organization and equivariance, and is more aligned with traditional notions of disentanglement.

Finally, and importantly, in the neuroscience literature, another popular explanation for topographic organization arises as the solution to the wiring length minimization problem [36]. Recently, models which attempt to incorporate wiring length constraints have been shown to yield topographic organization of higher level features, ultimately resembling the face patches found in primates [32, 38]. Interestingly, the model presented in this paper organizes activity based on the same statistical property (local correlation) as the wiring length proxies developed in [38], but from a generative modeling perspective, demonstrating a computationally principled explanation for the same phenomenon.

3 Background

The model in this paper is a first attempt at bridging two yet disjoint classes of models: Topographic Generative Models, and Equivariant Neural Networks. In this section, we will provide a brief background on these two frameworks.

3.1 Topographic Generative models

Inspired by Topographic ICA, the class of Topographic Generative models can be understood as generative models where the joint distribution over latent variables does not factorize into entirely independent factors, as is commonly done in ICA or VAEs, but instead has a more complex local correlation structure. The locality is defined by arranging the latent variables into an n-dimensional lattice or grid, and organizing variables such that those which are closer together on this grid have greater correlation of activities than those which are further apart. In the related literature, activations which are nearby in this grid are defined to have higher-order correlation, e.g. correlations of squared activations (aka energy ), asserting that all first order correlations are removed by the initial ICA de-mixing matrix.

Such generative models can be seen as hierarchical generative models where there exist higher level independent variance generating variables V which are combined locally to generate the variances σ = ϕ(WV) of the lower level topographic variables T N(0, σ2I), for an appropriate non-linearity ϕ. The variables T are thus independent conditioned on σ. Other related models which can be described under this umbrella include Independent Subspace Analysis (ISA) [26] where all variables within a predefined subspace (or capsule ) share a common variance, and temporally coherent models [24] where the energy of a given variable between time steps is correlated by extending the topographic neighborhoods over the time dimension [25]. The topographic latent variable T can additionally be described as an instance of a Gaussian scale mixture (GSM). GSMs have previously been used to model the observed non-Gaussian dependencies between coefficients of steerable wavelet pyramids (interestingly also equivariant to translation & rotation) [48, 56, 57].

3.2 Group Equivariant Neural Networks

Equivariance is the mathematical notion of symmetry for functions. A function is said to be an equivariant map if the the result of transforming the input and then computing the function is the same as first computing the function and then transforming the output. In other words, the function and the transformation commute. Formally, f(τρ[x]) = Γρ[f(x)], where τ and Γ denote the (potentially different) operators on the domain and co-domain respectively, but are indexed by the same element ρ.

It is well known that convolutional maps in neural networks are translation equivariant, i.e., given a translation Γρ (applied to each feature map separately) and a convolutional map f( ), we have f(Γρ[x]) = Γρ[f(x)]. This can be extended to other transformations (e.g. rotation or mirroring) using Group convolutions (G-convolutions) [11]. As a result of the design of G-convolutions, feature maps that are related to each other by a rotation of the filter/input are grouped together. Moreover, a rotation of the input results in a transformation (i.e. a permutation and rotation) on the activations of each of these groups in the output. Hence, we can think of these equivalence class groups as capsules where transformations of the input only cause structured transformations within a capsule. As we will demonstrate later, this is indeed analogous to the structure of the representation learned by the Topographic VAE with temporal coherence a transformation of the input yields a cyclic permutation of activations within each capsule. However, due to the approximate learned nature of the equivariant representation, the Topographic VAE does not require the transformations τρ to constitute a group.

4 The Generative Model

The generative model proposed in this paper is based on the Topographic Product of Student s-t (TPo T) model as developed in [46, 59]. In the following, we will show how a TPo T random variable can be constructed from a set of independent univariate standard normal random variables, enabling efficient training through variational inference. Subsequently, we will construct a new model where topographic neighborhoods are extended over time, introducing temporal coherence and encouraging the unsupervised learning of approximately equivariant subspaces we call capsules .

4.1 The Product of Student s-t Model

We assume that that our observed data is generated by a latent variable model where the joint distribution over observed and latent variables x and t factorizes into the product of the conditional and the prior. The prior distribution p T(t) is assumed to be a Topographic Product of Student s-t (TPo T) distribution, and we parameterize the conditional distribution with a flexible function approximator:

p X,T(x, t) = p X|T(x|t)p T(t) p X|T(x|t) = pθ(x|gθ(t)) p T(t) = TPo T(t; ν) (1)

The goal of training is thus to learn the parameters θ such that the marginal distribution of the model pθ(x) matches that of the observed data. Unfortunately, the marginal likelihood is generally intractable except for all but the simplest choices of gθ and p T [45]. Prior work has therefore resorted to techniques such as contrastive divergence with Gibbs sampling [59] to train TPo T models as energy based models. In the following section, we instead demonstrate how TPo T variables can be constructed as a deterministic function of Gaussian random variables, enabling the use of variational inference and efficient maximization of the likelihood through the evidence lower bound (ELBO).

4.2 Constructing the Product of Student s-t Distribution

First, note a univariate Student s-t random variable T with ν degrees of freedom can be defined as:

1 ν Pν i U 2 i with Z, Ui N(0, 1) i (2)

Where Z and {Ui}ν i=1 are independent standard normal random variables. If T is a multidimensional Student s-t random variable, composed of independent Zi and Ui, then T Po T(ν), i.e.:

1 ν Pν i=1 U 2 i , Z2 q

1 ν P2 ν i=ν+1 U 2 i , . . . Zn q

1 ν Pn ν i=(n 1) ν+1 U 2 i

Po T(ν) (3)

Note that the Student s-t variable T is large when most of the {Ui}i in its set are small. We can therefore think of the {Ui}i as constraint violations rather then pattern matches: if the input matches all constraints Ui 0, the corresponding T variables will activate (see [23] for further discussion).

4.3 Introducing Topography

To make the Po T distribution topographic, we strive to correlate the scales of Tj which are nearby in our topographic layout. One way to accomplish this is by sharing some Ui-variables between neighboring Tj s. Formally, we define overlapping neighborhoods N(j) for each variable Tj and write:

i N(1) U 2 i , Z2 q

i N(2) U 2 i , . . . Zn q

i N(n) U 2 i

TPo T(ν) (4)

With some abuse of notation, if we define W to be the adjacency matrix which defines our neighborhood structure, U and Z to be the vectors of random variables Ui and Zj, we can write the above succinctly as:

1 ν W1U2 , Z2 q

1 ν W2U2 , . . . Zn q

1 ν WU2 TPo T(ν) (5)

Due to non-linearities such as Re LUs which may alter input distributions, it is beneficial to allow the Z variables to model the mean and scale. We found this can be achieved with the following parameterization: T = Z µ σ

1/νWU2. In practice, we found that σ = ν often works well, finally yielding:

Given this construction, we observe that the TPo T generative model can instead be viewed as a latent variable model where all random variables are Gaussian and the construction of T in Equation 6 is the first layer of the generative decoder : gθ(t) = gθ(u, z). In Section 5 we then leverage this interpretation to show how an approximate posterior for the latent variables Z and U can be trained through variational inference.

4.4 Capsules as Disjoint Topologies

Figure 2: An example of a neighborhood structure which induces disjoint topologies (aka capsules). Lines between variables Ti indicate that sharing of Ui, and thus correlation.

One setting of neighborhood structure W which is of particular interest is when there exist multiple sets of disjoint neighborhoods. Statistically, the variables of two disjoint topologies are completely independent. An example of a capsule neighborhood structure is shown in Figure 2. The idea of independant subspaces has previously been shown to learn invariant feature subspaces in the linear setting and is present in early work on Independent Subspace Analysis [26] and Adaptive Subspace Self Organizing Maps (ASSOM) [35]. It is also very reminiscent of the transformed sets of features present in a group equivariant convolutional neural network. In the next section, we will show how temporal coherence can be leveraged to induce the encoding of observed transformations into the internal dimensions of such capsules thereby yielding unsupervised approximately equivariant capsules.

4.5 Temporal Coherence and Learned Equivariance

We now describe how the induced topographic organization can be leveraged to learn a basis of approximately equivariant capsules for observed transformation sequences. The resulting representation is composed of a large set of capsules where the dimensions inside the capsule are topographically structured, but between the capsules there is independence. To benefit from sequences of input, we encourage topographic structure over time between sequentially permuted activations within a capsule, a property we refer to as shifting temporal coherence.

4.5.1 Temporal Coherence

Temporal Coherence can be measured as the correlation of squared activation between time steps. One way we can achieve this in our model is by having Tj share Ui between time steps. Formally, the generative model is identical to Equation 1, factorizing over timesteps denoted by subscript l, i.e. p Xl,Tl(xl, tl) = p Xl|Tl(xl|tl)p Tl(tl). However, Tl is now a function of a sequence {Ul+δ}L δ= L:

Tl = Zl µ q

W U2 l+L; ; U2 l L (7)

Where U2 l+L; ; U2 l L denotes vertical concatenation of the column vectors Ul, and 2L can be seen as the window size. We see that the choice of W now defines correlation structure over time. In prior work on temporal coherence (denoted Bubbles [25]), the grouping over time is such that a given variable Tl,i has correlated energy with the same spatial location (i) at a previous time step (l 1) i.e. cov(T 2 l,i, T 2 l 1,i) > 0 . This can be implemented as:

W U2 l+L; ; U2 l L =

δ= L WδU2 l+δ (8)

Where Wδ defines the topography for a single timestep, and is typically the same for all timesteps.

4.5.2 Learned Equivariance with Shifting Temporal Coherence

In our model, instead of requiring a single location to have correlated energies over a sequence, we would like variables at sequentially permuted locations within a capsule to have correlated energy between timesteps cov(T 2 l,i, T 2 l 1,i 1) > 0 . Similarly, this can be implemented as:

W U2 l+L; ; U2 l L =

δ= L WδRollδ(U2 l+δ) (9)

Where Rollδ(U2 l+δ) denotes a cyclic permutation of δ steps along the capsule dimension. The exact implementation of Roll can be found in Section A.11. As we will show in Section 6.3, TVAE models with such a topographic structure learn to encode observed sequence transformations as Rolls within the capsule dimension, analogous to a group equivariant neural network where τρ and Roll1 can be seen as the action of the transformation ρ on the input and output spaces respectively.

5 Topographic VAE

To train the parameters of the generative model θ, we use the above formulation to parameterize an approximate posterior for t in terms of a deterministic transformation of approximate posteriors over simpler Gaussian latent variables u and z. Explicitly:

qϕ(zl|xl) = N zl; µϕ(xl), σϕ(xl)I pθ(xl|gθ(tl)) = pθ(xl|gθ(zl, {ul})) (10)

qγ(ul|xl) = N ul; µγ(xl), σγ(xl)I tl = zl µ q

W u2 l+L; ; u2 l L (11)

We denote this model the Topographic VAE (TVAE) and optimize the parameters θ, ϕ, γ (and µ) through the ELBO, summed over the sequence length S:

l=1 EQϕ,γ(zl,ul|{xl}) [log pθ(xl|gθ(tl))] DKL[qϕ(zl|xl)||p Z(zl)] DKL[qγ(ul|xl)||p U(ul)] (12)

where Qϕ,γ(zl, ul|{xl}) = qϕ(zl|xl) QL δ= L qγ(ul+δ|xl+δ), and { } denotes a set over time.

6 Experiments

In the following experiments, we demonstrate the viability of the Topographic VAE as a novel method for training deep topographic generative models. Additionally, we quantitatively verify that shifting

temporal coherence yields approximately equivariant capsules by computing an equivariance loss and a correlation metric inspired by the disentanglement literature. We show that equivariant capsule models yield higher likelihood than baselines on test sequences, and qualitatively support these results with visualizations of sequences reconstructed purely from Rolled capsule activations.

6.1 Evaluation Methods

As depicted in Figure 1, we make use of capsule traversals to qualitatively visualize the transformations learned by our network. Simply, these are constructed by encoding a partial sequence into a t0 variable, and decoding sequentially Rolled copies of this variable. Explicitly, in the top row we show the data sequence {xl}l, and in the bottom row we show the decoded sequence: {gθ(Rolll(t0))}l.

To measure equivariance quantitatively, we measure an equivariance error similar to [15]. The equivariance error can be seen as the difference between traversing the two distinct paths of the commutative diagram, and provides some measure of how precisely the function and the transform commute. Formally, for a sequence of length S, and ˆt = t/||t||2, the error is defined as:

Eeq({tl}S l=1) =

Rollδ(ˆtl) ˆtl+δ 1 (13)

Additionally, inspired by existing disentanglement metrics, we measure the degree to which observed transformations in capsule space are correlated with input transformations by introducing a new metric we call Cap Corry. Simply, this metric computes the correlation between the amount of observed Roll of a capsule s activation at two timesteps l and l + δ, and the shift of the ground truth generative factors yl in that same time. Formally, for a correlation coefficient Corr:

Cap Corr(tl, tl+δ, yl, yl+δ) = Corr (argmax [tl tl+δ] , |yl yl+δ|) (14)

Where is discrete periodic cross-correlation across the capsule dimension, and the correlation coefficient is computed across the entire dataset. We see the argmax of the cross-correlation is an estimate of the degree to which a capsule activation has shifted from time l to l + δ. To extend this to multiple capsules, we can replace the argmax function with the mode of the argmax computed for all capsules. We provide additional details and extensions of this metric in Section A.10. For measuring capsule-metrics on baseline models which do not naturally have capsules, we simply arbitrarily divide the latent space into a fixed set of corresponding capsules and capsule dimensions, and provide such results as equivalent to random baselines for these metrics.

6.2 Topographic VAE without Temporal Coherence

Figure 3: Maximum activating images for a Topographic VAE trained with a 2D torus topography on MNIST.

To validate the TVAE is capable of learning topographically organized representations with deep neural networks, we first perform experiments on a Topographic VAE without Temporal Coherence. The model is constructed as in Equations 10 and 11 with L = 0, and is trained to maximize Equation 12. We fix W such that globally the latent variables are arranged in a grid on a 2-dimensional torus (a single capsule), and locally W sums over 5x5 2D groups of variables. In this setting, W can be easily implemented as 2D convolution with a 5x5 kernel of 1 s, stride 1, and cyclic padding. We see that training the model with 3layer MLP s for the encoders and decoder indeed yields a 2D topographic organization of higher level features. In Figure 3, we show the maximum activating image for each final layer neuron of the capsule, plotted as a flattened torus. We see that the neurons become arranged according to class, orientation, width, and other learned features.

6.3 Learning Equivariant Capsules

In the remaining experiments, we provide evidence that the Topographic VAE can be leveraged to learn equivariant capsules by incorporating shifting temporal coherence into a 1D baseline topographic

model. We compare against two baselines: standard normal VAEs and models that have non-shifting stationary temporal coherence as defined in Equation 8 (denoted Bubble VAE [25]).

In all experiments we use a 3-layer MLP with Re LU activations for both encoders and the decoder. We arrange the latent space into 15 circular capsules each of 15-dimensions for d Sprites [43], and 18 circular capsules each of 18-dimensions for MNIST [37]. Example sequences {xl}S l=1 are formed by taking a random initial example, and sequentially transforming it according to one of the available transformations: (X-Pos, Y-Pos, Orientation, Scale) for d Sprites, and (Color, Scale, Orientation) for MNIST. All transformation sequences are cyclic such that when the maximum transformation parameter is reached, the subsequent value returns to the minimum. We denote the length of a full transformation sequence by S, and the time-extent of the induced temporal coherence (i.e. the length of the input sequence) by 2L. For simplicity, both datasets are constructed such that the sequence length S equals the capsule dimension (for d Sprites this involves taking a subset of the full dataset and looping the scale 3-times for a scale-sequence). Exact details are in Sections A.8 & A.9.

In Figure 4, we show the capsule traversals for TVAE models with L 1

3S. We see that despite the t0 variable encoding only 2

3 of the sequence, the remainder of the transformation sequence can be decoded nearly perfectly by permuting the activation through the full capsule implying the model has learned to be approximately equivariant to full sequences while only observing partial sequences per training point. Furthermore, we see that the model is able to successfully learn all transformations simultaneously for the respective datasets.

Figure 4: Capsule Traversals for TVAE models on d Sprites and MNIST. The top rows show the encoded sequences (with greyed-out images held-out), and the bottom rows show the images generated by decoding sequentially Rolled copies of the initial activation t0 (indicated by a grey border).

Capsule traversals for the non-equivariant baselines, as well as TVAEs with smaller values of L (which only learn approximate equivariance to partial sequences) are shown in Section D. We note that the capsule traversal plotted in Figure 1 demonstrates a transformation where color and rotation change simultaneously, differing from how the models in this section are trained. However, as we describe in more detail in Section B.4, we observe that TVAEs trained with individual transformations in isolation (as in this section) are able to generalize, generating sequences of combined transformations when presented with such partial input sequences at test time. We believe this generalization capability to be promising for data efficiency, but leave further exploration to future work. Additional capsule traversals with such unseen combined transformations are shown in Section B.4 and further complex learned transformations (such as perspective transforms) are shown at the end of Section D.

For a more quantitative evaluation, in Table 1 we measure the equivariance error and log-likelihood (reported in nats) of the test data under our trained MNIST models as estimated by importance sampling with 10 samples. We observe that models which incorporate temporal coherence (Bubble VAE and TVAE with L > 0) achieve low equivariance error, while the TVAE models with shifting temporal coherence achieve the highest likelihood and the lowest equivariance error simultaneously.

Table 1: Log Likelihood and Equivariance Error on MNIST for different settings of temporal coherence length L relative to sequence length S. Mean std. over 3 random initalizations.

Model TVAE TVAE TVAE Bubble VAE VAE L L = 1

2S L = 5 36S L = 0 L = 5 36S L = 0

log p(x) 186.8 0.1 186.0 0.7 -218.5 0.9 -191.4 0.5 -189.0 0.8 Eeq 574 2 3247 3 3217 105 3370 12 13274 1

Table 2: Equivariance error (Eeq ) and correlation of observed capsule roll with ground truth factor shift (Cap Corr ) for the d Sprites dataset. Mean standard deviation over 3 random initalizations.

Model TVAE TVAE TVAE TVAE Bubble VAE VAE L L = 1

6S L = 0 L = 1

Cap Corr X 1.0 0 1.0 0 0.67 0.02 0.17 0.03 0.13 0.01 0.18 0.01 Cap Corr Y 1.0 0 1.0 0 0.66 0.02 0.21 0.02 0.12 0.01 0.16 0.01 Cap Corr O 1.0 0 1.0 0 0.52 0.01 0.09 0.01 0.10 0.01 0.11 0.00 Cap Corr S 1.0 0 1.0 0 0.42 0.01 0.51 0.01 0.50 0.00 0.52 0.00

Eeq 344 5 1034 6 2549 38 2971 9 1951 34 6934 0

To further understand how capsules transform for observed input transformations, in Table 2 we measure Eeq and the Cap Corr metric on the d Sprites dataset for the four proposed transformations. We see that the TVAE with L 1

3S achieves perfect correlation implying the learned representation indeed permutes cyclically within capsules for observed transformation sequences. Further, this correlation gradually decreases as L decreases, eventually reaching the same level as the baselines. We also see that, on both datasets, the equivariance losses for the TVAE with L = 0 and the Bubble VAE are significantly lower than the baseline VAE, while conversely, the Cap Corr metric is not significantly better. We believe this to be due to the fundamental difference between the metrics: Eeq measures continuous L1 similarity which is still low when a representation is locally smooth (even if the change of the representation does not follow the observed transformation), whereas Cap Corr more strictly measures the correspondence between the transformation of the input and the transformation of the representation. In other words, Eeq may be misleadingly low for invariant capsule representations (as with the Bubble VAE), whereas Cap Corr strictly measures equivariance.

7 Future Work & Limitations

The model presented in this work has a number of limitations in its existing form which we believe to be interesting directions for future research. Foremost, the model is challenging to compare directly with existing disentanglement and equivariance literature since it requires an input sequence which determines the transformations reachable through the capsule roll. Related to this, we note the temporal coherence proposed in our model is not causal (i.e. t0 depends on future xl). We believe these limitations could be at least partially alleviated with minor extensions detailed in Section C.

We additionally note that some model developers may find a priori definition of topographic structure burdensome. While true, we know that the construction of appropriate priors is always a challenging task in latent variable models, and we observe that our proposed TVAE achieves strong performance even with improper specification. Furthermore, in future work, we believe adding learned flexibility to the parameters W may alleviate some of this burden.

Finally, we note that while this work does demonstrate improved log-likelihood and equivariance error, the study is inherently preliminary and does not examine all important benefits of topographic or approximately equivariant representations. Specifically, further study of the TVAE both with and without temporal coherence in terms of the sample complexity, semi-supervised classification accuracy, and invariance through structured topographic pooling would be enlightening.

8 Conclusion

In the above work we introduce the Topographic Variational Autoencoder as a method to train deep topographic generative models, and show how topography can be leveraged to learn approximately equivariant sets of features, a.k.a. capsules, directly from sequences of data with no other supervision. Ultimately, we believe these results may shine some light on how biological systems could hard-wire themselves to more effectively learn representations with equivariant capsule structure. In terms of broader impact, it is foreseeable our model could be used to generate more realistic transformations of deepfakes , enhancing disinformation. Given that the model learns approximate equivariance, we caution against the over-reliance on equivariant properties as these have no known formal guarantees.

[1] Horace B Barlow et al. Possible principles underlying the transformation of sensory messages. Sensory communication, 1(01), 1961.

[2] Anthony J. Bell and Terrence J. Sejnowski. An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation, 7(6):1129 1159, 11 1995.

[3] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013.

[4] Gregory Benton, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. Learning invariances in neural networks. Advances in Neural Information Processing Systems, December, 2020.

[5] Lukas Biewald. Experiment tracking with weights and biases, 2020. Software available from wandb.com.

[6] Christopher Bishop, Markus Svensen, and Christopher Williams. Gtm: The generative topographic mapping. Neural Computation, 10:215 234, 05 1997.

[7] Benedikt Boenninghoff, Steffen Zeiler, Robert M. Nickel, and Dorothea Kolossa. Variational autoencoder with embedded student-t mixture model for authorship attribution. Ar Xiv, abs/2005.13930, 2020.

[8] Diane Bouchacourt, Mark Ibrahim, and Stéphane Deny. Addressing the topological defects of disentanglement via distributed operators. Ar Xiv, abs/2102.05623, 2021.

[9] Taco Cohen and M. Welling. Steerable cnns. Ar Xiv, abs/1612.08498, 2017.

[10] Taco Cohen and Max Welling. Learning the irreducible representations of commutative lie groups. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1755 1763, Bejing, China, 22 24 Jun 2014. PMLR.

[11] Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, pages 2990 2999, 2016.

[12] Taco S. Cohen and Max Welling. Transformation properties of learned visual representations. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

[13] Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994.

[14] Marissa Connor, Gregory Canal, and Christopher Rozell. Variational autoencoder with learned latent structure. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2359 2367. PMLR, 13 15 Apr 2021.

[15] Nichita Diaconu and Daniel Worrall. Learning to convolve: A generalized weight-tying approach. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1586 1595. PMLR, 09 15 Jun 2019.

[16] Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3165 3176. PMLR, 13 18 Jul 2020.

[17] Marc Finzi, Max Welling, and Andrew Gordon Gordon Wilson. A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3318 3328. PMLR, 18 24 Jul 2021.

[18] Peter Földiák. Learning invariance from transformation sequences. Neural Computation, 3:194 200, 06 1991.

[19] Will Grathwohl and Aaron Wilson. Disentangling space and time in video with hierarchical variational auto-encoders. Co RR, abs/1612.04440, 2016.

[20] Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations. Ar Xiv, abs/1812.02230, 2018.

[21] Geoffrey E. Hinton, Alex Krizhevsky, and Sida D. Wang. Transforming auto-encoders. In Timo Honkela, Włodzisław Duch, Mark Girolami, and Samuel Kaski, editors, Artificial Neural Networks and Machine Learning ICANN 2011, pages 44 51, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.

[22] Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with EM routing. In International Conference on Learning Representations, 2018.

[23] Geoffrey E. Hinton and Yee-Whye Teh. Discovering multiple constraints that are frequently approximately satisfied. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI 01, page 227 234, 2001.

[24] Jarmo Hurri and Aapo Hyvärinen. Simple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural Video. Neural Computation, 15(3):663 691, 03 2003.

[25] A. Hyvärinen, J. Hurri, and Jaakko J. Väyrynen. A unifying framework for natural image statistics: spatiotemporal activity bubbles. Neurocomputing, 58-60:801 806, 2004.

[26] Aapo Hyvärinen and Patrik Hoyer. Emergence of phase-and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural computation, 12(7):1705 1720, 2000.

[27] Aapo Hyvärinen, Patrik O Hoyer, and Mika Inki. Topographic independent component analysis. Neural computation, 13(7):1527 1558, 2001.

[28] Aapo Hyvärinen, Jarmo Hurri, and Patrick O Hoyer. Natural image statistics: A probabilistic approach to early computational vision., volume 39. Springer Science & Business Media, 2009.

[29] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411 430, 2000.

[30] Aapo Hyvärinen and Patrik O. Hoyer. A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41(18):2413 2423, 2001.

[31] Koray Kavukcuoglu, Marc Aurelio Ranzato, Rob Fergus, and Yann Le Cun. Learning invariant features through topographic filter maps. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1605 1612. IEEE, 2009.

[32] T. Anderson Keller, Qinghe Gao, and Max Welling. Modeling category-selective cortical regions with topographic variational autoencoders. In SVRHM 2021 Workshop @ Neur IPS, 2021.

[33] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

[34] David A. Klindt, Lukas Schott, Yash Sharma, Ivan Ustyuzhaninov, Wieland Brendel, Matthias Bethge, and Dylan Paiton. Towards nonlinear disentanglement in natural data with temporal sparse coding. In International Conference on Learning Representations, 2021.

[35] Teuvo Kohonen. Emergence of invariant-feature detectors in the adaptive-subspace selforganizing map. Biological cybernetics, 75(4):281 291, 1996.

[36] Alexei A Koulakov and Dmitri B Chklovskii. Orientation preference patterns in mammalian visual cortex: a wire length minimization approach. Neuron, 29(2):519 527, 2001.

[37] Yann Le Cun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.

[38] Hyodong Lee, Eshed Margalit, Kamila M. Jozwik, Michael A. Cohen, Nancy Kanwisher, Daniel L. K. Yamins, and James J. Di Carlo. Topographic deep artificial neural networks reproduce the hallmarks of the primate inferior temporal cortex face processing network. bio Rxiv, 07/2020 2020.

[39] Jan Eric Lenssen, Matthias Fey, and Pascal Libuschewski. Group equivariant capsule networks. In Neur IPS, pages 8858 8867, 2018.

[40] S Lyu and E P Simoncelli. Nonlinear image representation using divisive normalization. In Proc. Computer Vision and Pattern Recognition, pages 1 8. IEEE Computer Society, Jun 23-28 2008.

[41] S Lyu and E P Simoncelli. Modeling multiscale subbands of photographic images with fields of Gaussian scale mixtures. IEEE Trans. Patt. Analysis and Machine Intelligence, 31(4):693 706, Apr 2009.

[42] Libo Ma and Liqing Zhang. Overcomplete topographic independent component analysis. Neurocomputing, 71(10-12):2217 2223, 2008.

[43] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.

[44] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision research, 37(23):3311 3325, 1997.

[45] Simon Osindero, Max Welling, and Geoffrey E. Hinton. Topographic Product Models Applied to Natural Scene Statistics. Neural Computation, 18(2):381 414, 02 2006.

[46] Simon Kayode Osindero. Contrastive Topographic Models. Ph D thesis, University of London, 2004.

[47] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024 8035. 2019.

[48] J Portilla, V Strela, M J Wainwright, and E P Simoncelli. Image denoising using scale mixtures of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338 1351, Nov 2003. Recipient, IEEE Signal Processing Society Best Paper Award, 2008.

[49] Siamak Ravanbakhsh, Jeff Schneider, and Barnabás Póczos. Equivariance through parametersharing. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2892 2901. PMLR, 06 11 Aug 2017.

[50] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, page II 1278 II 1286. JMLR.org, 2014.

[51] Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dynamic routing between capsules. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, page 3859 3869, Red Hook, NY, USA, 2017. Curran Associates Inc.

[52] James V. Stone. Learning Perceptually Salient Visual Parameters Using Spatiotemporal Smoothness Constraints. Neural Computation, 8(7):1463 1492, 10 1996.

[53] Jan Stühmer, Richard E. Turner, and Sebastian Nowozin. Independent subspace analysis for unsupervised learning of disentangled representations, 2019.

[54] Richard Turner and Maneesh Sahani. A maximum-likelihood interpretation for slow feature analysis. Neural computation, 19:1022 38, 05 2007.

[55] Elise van der Pol, Daniel E. Worrall, Herke van Hoof, Frans A. Oliehoek, and Max Welling. MDP homomorphic networks: Group symmetries in reinforcement learning. Co RR, abs/2006.16908, 2020.

[56] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. Müller, editors, Adv. Neural Information Processing Systems (NIPS*99), volume 12, pages 855 861, Cambridge, MA, May 2000. MIT Press.

[57] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in analyzing and modeling natural images. Applied and Computational Harmonic Analysis, 11(1):89 123, Jul 2001.

[58] Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco Cohen. 3d steerable cnns: Learning rotationally equivariant features in volumetric data. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, page 10402 10413, Red Hook, NY, USA, 2018. Curran Associates Inc. [59] Max Welling, Simon Osindero, and Geoffrey E Hinton. Learning sparse topographic representations with products of student-t distributions. In Advances in neural information processing systems, pages 1383 1390, 2003. [60] Laurenz Wiskott and Terrence J Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural computation, 14(4):715 770, 2002. [61] Daniel Worrall and Max Welling. Deep scale-spaces: Equivariance over scale. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. [62] Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7168 7177, 2017.

9 Acknowledgements

We would like to thank Jorn Peters for his invaluable contributions to this work at its earliest stages. We would additionally like to thank Patrick Forré, Emiel Hoogeboom, and Anna Khoreva for their helpful guidance throughout the project. We would like to thank the creators of Weight & Biases [5] and Py Torch [47]. Without these tools our work would not have been possible. Finally, we thank the Bosch Center for Artificial Intelligence for funding, and the reviewers for their helpful comments.