# neural_expectation_maximization__77776040.pdf

Neural Expectation Maximization

Klaus Greff IDSIA klaus@idsia.ch

Sjoerd van Steenkiste IDSIA sjoerd@idsia.ch

Jürgen Schmidhuber IDSIA juergen@idsia.ch

Many real world tasks such as reasoning and physical interaction require identiﬁcation and manipulation of conceptual entities. A ﬁrst step towards solving these tasks is the automated discovery of distributed symbol-like representations. In this paper, we explicitly formalize this problem as inference in a spatial mixture model where each component is parametrized by a neural network. Based on the Expectation Maximization framework we then derive a differentiable clustering method that simultaneously learns how to group and represent individual entities. We evaluate our method on the (sequential) perceptual grouping task and ﬁnd that it is able to accurately recover the constituent objects. We demonstrate that the learned representations are useful for next-step prediction.

1 Introduction

Learning useful representations is an important aspect of unsupervised learning, and one of the main open problems in machine learning. It has been argued that such representations should be distributed [13, 37] and disentangled [1, 31, 3]. The latter has recently received an increasing amount of attention, producing representations that can disentangle features like rotation and lighting [4, 12].

So far, these methods have mostly focused on the single object case whereas, for real world tasks such as reasoning and physical interaction, it is often necessary to identify and manipulate multiple entities and their relationships. In current systems this is difﬁcult, since superimposing multiple distributed and disentangled representations can lead to ambiguities. This is known as the Binding Problem [21, 37, 13] and has been extensively discussed in neuroscience [33]. One solution to this problem involves learning a separate representation for each object. In order to allow these representations to be processed identically they must be described in terms of the same (disentangled) features. This would then avoid the binding problem, and facilitate a wide range of tasks that require knowledge about individual objects. This solution requires a process known as perceptual grouping: dynamically splitting (segmenting) each input into its constituent conceptual entities.

In this work, we tackle this problem of learning how to group and efﬁciently represent individual entities, in an unsupervised manner, based solely on the statistical structure of the data. Our work follows a similar approach as the recently proposed Tagger [7] and aims to further develop the understanding, as well as build a theoretical framework, for the problem of symbol-like representation learning. We formalize this problem as inference in a spatial mixture model where each component is parametrized by a neural network. Based on the Expectation Maximization framework we then derive a differentiable clustering method, which we call Neural Expectation Maximization (N-EM). It can be trained in an unsupervised manner to perform perceptual grouping in order to learn an efﬁcient representation for each group, and naturally extends to sequential data.

Both authors contributed equally to this work.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

2 Neural Expectation Maximization

The goal of training a system that produces separate representations for the individual conceptual entities contained in a given input (here: image) depends on what notion of entity we use. Since we are interested in the case of unsupervised learning, this notion can only rely on statistical properties of the data. We therefore adopt the intuitive notion of a conceptual entity as being a common cause (the object) for multiple observations (the pixels that depict the object). This common cause induces a dependency-structure among the affected pixels, while the pixels that correspond to different entities remain (largely) independent. Intuitively this means that knowledge about some pixels of an object helps in predicting its remainder, whereas it does not improve the predictions for pixels of other objects. This is especially obvious for sequential data, where pixels belonging to a certain object share a common fate (e.g. move in the same direction), which makes this setting particularly appealing.

We are interested in representing each entity (object) k with some vector θk that captures all the structure of the affected pixels, but carries no information about the remainder of the image. This modularity is a powerful invariant, since it allows the same representation to be reused in different contexts, which enables generalization to novel combinations of known objects. Further, having all possible objects represented in the same format makes it easier to work with these representations. Finally, having a separate θk for each object (as opposed to for the entire image) allows θk to be distributed and disentangled without suffering from the binding problem.

We treat each image as a composition of K objects, where each pixel is determined by exactly one object. Which objects are present, as well as the corresponding assignment of pixels, varies from input to input. Assuming that we have access to the family of distributions P(x|θk) that corresponds to an object level representation as described above, we can model each image as a mixture model. Then Expectation Maximization (EM) can be used to simultaneously compute a Maximum Likelihood Estimate (MLE) for the individual θk-s and the grouping that we are interested in.

The central problem we consider in this work is therefore how to learn such a P(x|θk) in a completely unsupervised fashion. We accomplish this by parametrizing this family of distributions by a differentiable function fφ(θ) (a neural network with weights φ). We show that in that case, the corresponding EM procedure becomes fully differentiable, which allows us to backpropagate an appropriate outer loss into the weights of the neural network. In the remainder of this section we formalize and derive this method which we call Neural Expectation Maximization (N-EM).

2.1 Parametrized Spatial Mixture Model

We model each image x RD as a spatial mixture of K components parametrized by vectors θ1, . . . , θK RM. A differentiable non-linear function fφ (a neural network) is used to transform these representations θk into parameters ψi,k = fφ(θk)i for separate pixel-wise distributions. These distributions are typically Bernoulli or Gaussian, in which case ψi,k would be a single probability or a mean and variance respectively. This parametrization assumes that given the representation, the pixels are independent but not identically distributed (unlike in standard mixture models). A set of binary latent variables Z [0, 1]D K encodes the unknown true pixel assignments, such that zi,k = 1 iff pixel i was generated by component k, and

k zi,k = 1. A graphical representation of this model can be seen in Figure 1, where π = (π1, . . . πK) are the mixing coefﬁcients (or prior for z). The full likelihood for x given θ = (θ1, . . . , θK) is given by:

zi P(xi, zi|ψi) =

k=1 P(zi,k = 1) πk

P(xi|ψi,k, zi,k = 1). (1)

2.2 Expectation Maximization

Directly optimizing log P(x|ψ) with respect to θ is difﬁcult due to marginalization over z, while for many distributions optimizing log P(x, z|ψ) is much easier. Expectation Maximization (EM; [6]) takes advantage of this and instead optimizes a lower bound given by the expected log likelihood:

Q(θ, θold) =

z P(z|x, ψold) log P(x, z|ψ). (2)

Figure 1: left: The probabilistic graphical model that underlies N-EM. right: Illustration of the computations for two steps of N-EM.

Iterative optimization of this bound alternates between two steps: in the E-step we compute a new estimate of the posterior probability distribution over the latent variables given θold from the previous iteration, yielding a new soft-assignment of the pixels to the components (clusters):

γi,k := P(zi,k = 1|xi, ψold i ). (3)

In the M-step we then aim to ﬁnd the conﬁguration of θ that would maximize the expected loglikelihood using the posteriors computed in the E-step. Due to the non-linearity of fφ there exists no analytical solution to arg maxθ Q(θ, θold). However, since fφ is differentiable, we can improve Q(θ, θold) by taking a gradient ascent step:2

θnew = θold + η Q

θ where Q θk

i=1 γi,k(ψi,k xi) ψi,k

The resulting algorithm belongs to the class of generalized EM algorithms and is guaranteed (for a sufﬁciently small learning rate η) to converge to a (local) optimum of the data log likelihood [42].

2.3 Unrolling

In our model the information about statistical regularities required for clustering the pixels into objects is encoded in the neural network fφ with weights φ. So far we have considered fφ to be ﬁxed and have shown how we can compute an MLE for θ alongside the appropriate clustering. We now observe that by unrolling the iterations of the presented generalized EM, we obtain an end-to-end differentiable clustering procedure based on the statistical model implemented by fφ. We can therefore use (stochastic) gradient descent and ﬁt the statistical model to capture the regularities corresponding to objects for a given dataset. This is implemented by back-propagating an appropriate loss (see Section 2.4) through time (BPTT; [39, 41]) into the weights φ. We refer to this trainable procedure as Neural Expectation Maximization (N-EM), an overview of which can be seen in Figure 1.

Figure 2: RNN-EM Illustration. Note the changed encoder and recurrence compared to Figure 1.

Upon inspection of the structure of N-EM we ﬁnd that it resembles K copies of a recurrent neural network with hidden states θk that, at each timestep, receive γk (ψk x) as their input. Each copy generates a new ψk, which is then used by the E-step to re-estimate the soft-assignments γ. In order to accurately mimic the M-Step (4) with an RNN, we must impose several restrictions on its weights and structure: the encoder must correspond to the Jacobian ψk/ θk, and the recurrent update must linearly combine the output of the encoder with θk from the previous timestep. Instead, we introduce a new algorithm named RNN-EM, when substituting that part of the computational graph of N-EM with an actual RNN (without imposing any restrictions). Although RNN-EM can no longer guarantee

2Here we assume that P(xi|zi,k = 1, ψi,k) is given by N(xi; µ = ψi,k, σ2) for some ﬁxed σ2, yet a similar update arises for many typical parametrizations of pixel distributions.

convergence of the data log likelihood, its recurrent weights increase the ﬂexibility of the clustering procedure. Moreover, by using a fully parametrized recurrent weight matrix RNN-EM naturally extends to sequential data. Figure 2 presents the computational graph of a single RNN-EM timestep.

2.4 Training Objective

N-EM is a differentiable clustering procedure, whose outcome relies on the statistical model fφ. We are interested in a particular unsupervised clustering that corresponds to grouping entities based on the statistical regularities in the data. To train our system, we therefore require a loss function that teaches fφ to map from representations θ to parameters ψ that correspond to pixelwise distributions for such objects. We accomplish this with a two-term loss function that guides each of the K networks to model the structure of a single object independently of any other information in the image:

k=1 γi,k log P(xi, zi,k|ψi,k) intra-cluster loss

(1 γi,k)DKL[P(xi)||P(xi|ψi,k, zi,k)] inter-cluster loss

The intra-cluster loss corresponds to the same expected data log-likelihood Q as is optimized by N-EM. It is analogous to a standard reconstruction loss used for training autoencoders, weighted by the cluster assignment. Similar to autoencoders, this objective is prone to trivial solutions in case of overcapacity, which prevent the network from modelling the statistical regularities that we are interested in. Standard techniques can be used to overcome this problem, such as making θ a bottleneck or using a noisy version of x to compute the inputs to the network. Furthermore, when RNN-EM is used on sequential data we can use a next-step prediction loss.

Weighing the loss pixelwise is crucial, since it allows each network to specialize its predictions to an individual object. However, it also introduces a problem: the loss for out-of-cluster pixels (γi,k = 0) vanishes. This leaves the network free to predict anything and does not yield specialized representations. Therefore, we add a second term (inter-cluster loss) which penalizes the KL divergence between out-of-cluster predictions and the pixelwise prior of the data. Intuitively this tells each representation θk to contain no information regarding non-assigned pixels xi: P(xi|ψi,k, zi,k) = P(xi).

A disadvantage of the interaction between γ and ψ in (5) is that it may yield conﬂicting gradients. For any θk the loss for a given pixel i can be reduced by better predicting xi, or by decreasing γi,k (i.e. taking less responsibility) which is (due to the E-step) realized by being worse at predicting xi. A practical solution to this problem is obtained by stopping the γ gradients, i.e. by setting L/ γ = 0 during backpropagation.

3 Related work

The method most closely related to our approach is Tagger [7], which similarly learns perceptual grouping in an unsupervised fashion using K copies of a neural network that work together by reconstructing different parts of the input. Unlike in case of N-EM, these copies additionally learn to output the grouping, which gives Tagger more direct control over the segmentation and supports its use on complex texture segmentation tasks. Our work maintains a close connection to EM and relies on the posterior inference of the E-Step as a grouping mechanism. This facilitates theoretical analysis and simpliﬁes the task for the resulting networks, which we ﬁnd can be markedly smaller than in Tagger. Furthermore, Tagger does not include any recurrent connections on the level of the hidden states, precluding it from next step prediction on sequential tasks.3

The Binding problem was ﬁrst considered in the context of Neuroscience [21, 37] and has sparked some early work in oscillatory neural networks that use synchronization as a grouping mechanism [36, 38, 24]. Later, complex valued activations have been used to replace the explicit simulation of oscillation [25, 26]. By virtue of being general computers, any RNN can in principle learn a suitable mechanism. In practice however it seems hard to learn, and adding a suitable mechanism like competition [40], fast weights [29], or perceptual grouping as in N-EM seems necessary.

3RTagger [15]: a recurrent extension of Tagger that does support sequential data was developed concurrent to this work.

Figure 3: Groupings by RNN-EM (bottom row), NEM (middle row) for six input images (top row). Both methods recover the individual shapes accurately when they are separated (a, b, f), even when confronted with the same shape (b). RNN-EM is able to handle most occlusion (c, d) but sometimes fails (e). The exact assignments are permutation invariant and depend on γ initialization; compare (a) and (f).

Unsupervised Segmentation has been studied in several different contexts [30], from random vectors [14] over texture segmentation [10] to images [18, 16]. Early work in unsupervised video segmentation [17] used generalized Expectation Maximization (EM) to infer how to split frames of moving sprites. More recently optical ﬂow has been used to train convolutional networks to do ﬁgure/ground segmentation [23, 34]. A related line of work under the term of multi-causal modelling [28] has formalized perceptual grouping as inference in a generative compositional model of images. Masked RBMs [20] for example extend Restricted Boltzmann Machines with a latent mask inferred through Block-Gibbs sampling.

Gradient backpropagation through inference updates has previously been addressed in the context of sparse coding with (Fast) Iterative Shrinkage/Tresholding Algorithms ((F)ISTA; [5, 27, 2]). Here the unrolled graph of a ﬁxed number of ISTA iterations is replaced by a recurrent neural network that parametrizes the gradient computations and is trained to predict the sparse codes directly [9]. We derive RNN-EM from N-EM in a similar fashion and likewise obtain a trainable procedure that has the structure of iterative pursuit built into the architecture, while leaving tunable degrees of freedom that can improve their modeling capabilities [32]. An alternative to further empower the network by untying its weights across iterations [11] was not considered for ﬂexibility reasons.

4 Experiments

We evaluate our approach on a perceptual grouping task for generated static images and video. By composing images out of simple shapes we have control over the statistical structure of the data, as well as access to the ground-truth clustering. This allows us to verify that the proposed method indeed recovers the intended grouping and learns representations corresponding to these objects. In particular we are interested in studying the role of next-step prediction as a unsupervised objective for perceptual grouping, the effect of the hyperparameter K, and the usefulness of the learned representations.

In all experiments we train the networks using ADAM [19] with default parameters, a batch size of 64 and 50 000 train + 10 000 validation + 10 000 test inputs. Consistent with earlier work [8, 7], we evaluate the quality of the learned groupings with respect to the ground truth while ignoring the background and overlap regions. This comparison is done using the Adjusted Mutual Information (AMI; [35]) score, which provides a measure of clustering similarity between 0 (random) and 1 (perfect match). We use early stopping when the validation loss has not improved for 10 epochs.4 A detailed overview of the experimental setup can be found in Appendix A. All reported results are averages computed over ﬁve runs.5

4.1 Static Shapes

To validate that our approach yields the intended behavior we consider a simple perceptual grouping task that involves grouping three randomly chosen regular shapes ( ) located in random positions of 28 28 binary images [26]. This simple setup serves as a test-bed for comparing N-EM and RNN-EM, before moving on to more complex scenarios.

We implement fφ by means of a single layer fully connected neural network with a sigmoid output ψi,k for each pixel that corresponds to the mean of a Bernoulli distribution. The representation θk is

4Note that we do not stop on the AMI score as this is not part of our objective function and only measured to evaluate the performance after training. 5Code to reproduce all experiments is available at https://github.com/sjoerdvansteenkiste/ Neural-EM

Figure 4: A sequence of 5 shapes ﬂying along random trajectories (bottom row). The next-step prediction of each copy of the network (rows 2 to 5) and the soft-assignment of the pixels to each of the copies (top row). Observe that the network learns to separate the individual shapes as a means to efﬁciently solve next-step prediction. Even when many of the shapes are overlapping, as can be seen in time-steps 18-20, the network is still able to disentangle the individual shapes from the clutter.

a real-valued 250-dimensional vector squashed to the (0, 1) range by a sigmoid function before being fed into the network. Similarly for RNN-EM we use a recurrent neural network with 250 sigmoidal hidden units and an equivalent output layer. Both networks are trained with K = 3 and unrolled for 15 EM steps.

As shown in Figure 3, we observe that both approaches are able to recover the individual shapes as long as they are separated, even when confronted with identical shapes. N-EM performs worse if the image contains occlusion, and we ﬁnd that RNN-EM is in general more stable and produces considerably better groupings. This observation is in line with ﬁndings for Sparse Coding [9]. Similarly we conclude that the tunable degrees of freedom in RNN-EM help speed-up the optimization process resulting in a more powerful approach that requires fewer iterations. The beneﬁt is reﬂected in the large score difference between the two: 0.826 0.005 AMI compared to 0.475 0.043 AMI for N-EM. In comparison, Tagger achieves an AMI score of 0.79 0.034 (and 0.97 0.009 with layernorm), while using about twenty times more parameters [7].

4.2 Flying Shapes

We consider a sequential extension of the static shapes dataset in which the shapes ( ) are ﬂoating along random trajectories and bounce off walls. An example sequence with 5 shapes can be seen in the bottom row of Figure 4. We use a convolutional encoder and decoder inspired by the discriminator and generator networks of info GAN [4], with a recurrent neural network of 100 sigmoidal units (for details see Section A.2). At each timestep t the network receives γk(ψ(t 1) k x(t)) as input, where x(t) is the current frame corrupted with additional bitﬂip noise (p = 0.2). The next-step prediction objective is implemented by replacing x with x(t+1) in (5), and is evaluated at each time-step.

Table 1 summarizes the results on ﬂying shapes, and an example of a sequence with 5 shapes when using K = 5 can be seen in Figure 4. For 3 shapes we observe that the produced groupings are close to perfect (AMI: 0.970 0.005). Even in the very cluttered case of 5 shapes the network is able to separate the individual objects in almost all cases (AMI: 0.878 0.003).

These results demonstrate the adequacy of the next step prediction task for perceptual grouping. However, we ﬁnd that the converse also holds: the corresponding representations are useful for the prediction task. In Figure 5 we compare the next-step prediction error of RNN-EM with K = 1 (which reduces to a recurrent autoencoder that receives the difference between its previous prediction and the current frame as input) to RNN-EM with K = 5 on this task. To evaluate RNN-EM on next-step prediction we computed its loss using P(xi|ψi) = P(xi| maxk ψi,k) as opposed to P(xi|ψi) =

k γi,k P(xi|ψi,k) to avoid including information from the next timestep. The reported BCE loss for RNN-EM is therefore an upperbound to the true BCE loss. From the ﬁgure we observe that RNN-EM produces signiﬁcantly lower errors, especially when the number of objects increases.

Figure 5: Binomial Cross Entropy Error obtained by RNN-EM and a recurrent autoencoder (RNN-EM with K = 1) on the denoising and next-step prediction task. RNN-EM produces signiﬁcantly lower BCE across different numbers of objects.

Figure 6: Average AMI score (blue line) measured for RNN-EM (trained for 20 steps) across the ﬂying MNIST test-set and corresponding quartiles (shaded areas), computed for each of 50 time-steps. The learned grouping dynamics generalize to longer sequences and even further improve the AMI score.

Train Test Test Generalization

# obj. K AMI # obj. K AMI # obj. K AMI

3 3 0.969 0.006 3 3 0.970 0.005 3 5 0.972 0.007 3 5 0.997 0.001 3 5 0.997 0.002 3 3 0.914 0.015 5 3 0.614 0.003 5 3 0.614 0.003 3 3 0.886 0.010 5 5 0.878 0.003 5 5 0.878 0.003 3 5 0.981 0.003

Table 1: AMI scores obtained by RNN-EM on ﬂying shapes when varying the number of objects and number of components K, during training and at test time.

Finally, in Table 1 we also provide insight about the impact of choosing the hyper-parameter K, which is unknown for many real-world scenarios. Surprisingly we observe that training with too large K is in fact favourable, and that the network learns to leave the excess groups empty. When training with too few components we ﬁnd that the network still learns about the individual shapes and we observe only a slight drop in score when correctly setting the number of components at test time. We conclude that RNN-EM is robust towards different choices of K, and speciﬁcally that choosing K to be too high is not detrimental.

4.3 Flying MNIST

In order to incorporate greater variability among the objects we consider a sequential extension of MNIST. Here each sequence consists of gray-scale 24 24 images containing two down-sampled MNIST digits that start in random positions and ﬂoat along randomly sampled trajectories within the image for T timesteps. An example sequence can be seen in the bottom row of Figure 7.

We deploy a slightly deeper version of the architecture used in ﬂying shapes. Its details can be found in Appendix A.3. Since the images are gray-scale we now use a Gaussian distribution for each pixel with ﬁxed σ2 = 0.25 and µ = ψi,k as computed by each copy of the network. The training procedure is identical to ﬂying shapes except that we replace bitﬂip noise with masked uniform noise: we ﬁrst sample a binary mask from a multi-variate Bernoulli distribution with p = 0.2 and then use this mask to interpolate between the original image and samples from a Uniform distribution between the minimum and maximum values of the data (0,1).

We train with K = 2 and T = 20 on ﬂying MNIST having two digits and obtain an AMI score of 0.819 0.022 on the test set, measured across 5 runs.

In early experiments we observed that, given the large variability among the 50 000 unique digits, we can boost the model performance by training in stages using 20, 500, 50 000 digits. Here we exploit the generalization capabilities of RNN-EM to quickly transfer knowledge from a less varying set of

Figure 7: A sequence of 3 MNIST digits ﬂying across random trajectories in the image (bottom row). The next-step prediction of each copy of the network (rows 2 to 4) and the soft-assignment of the pixels to each of the copies (top row). Although the network was trained (stage-wise) on sequences with two digits, it is accurately able to separate three digits.

MNIST digits to unseen variations. We used the same hyper-parameter conﬁguration as before and obtain an AMI score of 0.917 0.005 on the test set, measured across 5 runs.

We study the generalization capabilities and robustness of these trained RNN-EM networks by means of three experiments. In the ﬁrst experiment we evaluate them on ﬂying MNIST having three digits (one extra) and likewise set K = 3. Even without further training we are able to maintain a high AMI score of 0.729 0.019 (stage-wise: 0.838 0.008) on the test-set. A test example can be seen in Figure 7. In the second experiment we are interested in whether the grouping mechanism that has been learned can be transferred to static images. We ﬁnd that using 50 RNN-EM steps we are able to transfer a large part of the learned grouping dynamics and obtain an AMI score of 0.619 0.023 (stage-wise: 0.772 0.008) for two static digits. As a ﬁnal experiment we evaluate the directly trained network on the same dataset for a larger number of timesteps. Figure 6 displays the average AMI score across the test set as well as the range of the upper and lower quartile for each timestep.

The results of these experiments conﬁrm our earlier observations for ﬂying shapes, in that the learned grouping dynamics are robust and generalize across a wide range of variations. Moreover we ﬁnd that the AMI score further improves at test time when increasing the sequence length.

5 Discussion

The experimental results indicate that the proposed Neural Expectation Maximization framework can indeed learn how to group pixels according to constituent objects. In doing so the network learns a useful and localized representation for individual entities, which encodes only the information relevant to it. Each entity is represented separately in the same space, which avoids the binding problem and makes the representations usable as efﬁcient symbols for arbitrary entities in the dataset. We believe that this is useful for reasoning in particular, and a potentially wide range of other tasks that depend on interaction between multiple entities. Empirically we ﬁnd that the learned representations are already beneﬁcial in next-step prediction with multiple objects, a task in which overlapping objects are problematic for standard approaches, but can be handled efﬁciently when learning a separate representation for each object.

As is typical in clustering methods, in N-EM there is no preferred assignment of objects to groups and so the grouping numbering is arbitrary and only depends on initialization. This property renders our results permutation invariant and naturally allows for instance segmentation, as opposed to semantic segmentation where groups correspond to pre-deﬁned categories. RNN-EM learns to segment in an unsupervised fashion, which makes it applicable to settings with little or no labeled data. On the downside this lack of supervision means that the resulting segmentation may not always match the intended outcome. This problem is inherent to this task since in real world images the notion of an object is ill-deﬁned and task dependent. We envision future work to alleviate this by extending unsupervised segmentation to hierarchical groupings, and by dynamically conditioning them on the task at hand using top-down feedback and attention.

6 Conclusion

We have argued for the importance of separately representing conceptual entities contained in the input, and suggested clustering based on statistical regularities as an appropriate unsupervised approach for separating them. We formalized this notion and derived a novel framework that combines neural networks and generalized EM into a trainable clustering algorithm. We have shown how this method can be trained in a fully unsupervised fashion to segment its inputs into entities, and to represent them individually. Using synthetic images and video, we have empirically veriﬁed that our method can recover the objects underlying the data, and represent them in a useful way. We believe that this work will help to develop a theoretical foundation for understanding this important problem of unsupervised learning, as well as providing a ﬁrst step towards building practical solutions that make use of these symbol-like representations.

Acknowledgements

The authors wish to thank Paulo Rauber and the anonymous reviewers for their constructive feedback. This research was supported by the Swiss National Science Foundation grant 200021_165675/1 and the EU project INPUT (H2020-ICT-2015 grant no. 687795). We are grateful to NVIDIA Corporation for donating us a DGX-1 as part of the Pioneers of AI Research award, and to IBM for donating a Minsky machine.

[1] H.B. Barlow, T.P. Kaushal, and G.J. Mitchison. Finding Minimum Entropy Codes. Neural Computation, 1(3):412 423, September 1989. [2] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm with application to wavelet-based image deblurring. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference On, pages 693 696. IEEE, 2009. [3] Yoshua Bengio. Deep learning of representations: Looking forward. In International Conference on Statistical Language and Speech Processing, pages 1 37. Springer, 2013. [4] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Info GAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. ar Xiv:1606.03657 [cs, stat], June 2016. [5] Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on pure and applied mathematics, 57(11):1413 1457, 2004. [6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society., pages 1 38, 1977. [7] Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hotloo Hao, Jürgen Schmidhuber, and Harri Valpola. Tagger: Deep Unsupervised Perceptual Grouping. ar Xiv:1606.06724 [cs], June 2016. [8] Klaus Greff, Rupesh Kumar Srivastava, and Jürgen Schmidhuber. Binding via Reconstruction Clustering. ar Xiv:1511.06418 [cs], November 2015. [9] Karol Gregor and Yann Le Cun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 399 406, 2010. [10] Jose A. Guerrero-Colón, Eero P. Simoncelli, and Javier Portilla. Image denoising using mixtures of Gaussian scale mixtures. In Image Processing, 2008. ICIP 2008. 15th IEEE International Conference On, pages 565 568. IEEE, 2008. [11] John R. Hershey, Jonathan Le Roux, and Felix Weninger. Deep unfolding: Model-based inspiration of novel deep architectures. ar Xiv preprint ar Xiv:1409.2574, 2014. [12] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Beta-VAE: Learning basic visual concepts with a constrained variational framework. In In Proceedings of the International Conference on Learning Representations (ICLR), 2017. [13] Geoffrey E. Hinton. Distributed representations. 1984.

[14] Aapo Hyvärinen and Jukka Perkiö. Learning to Segment Any Random Vector. In The 2006 IEEE International Joint Conference on Neural Network Proceedings, pages 4167 4172. IEEE, 2006.

[15] Alexander Ilin, Isabeau Prémont-Schwarz, Tele Hotloo Hao, Antti Rasmus, Rinu Boney, and Harri Valpola. Recurrent Ladder Networks. ar Xiv:1707.09219 [cs, stat], July 2017.

[16] Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H. Adelson. Learning visual groups from co-occurrences in space and time. ar Xiv:1511.06811 [cs], November 2015.

[17] Nebojsa Jojic and Brendan J. Frey. Learning ﬂexible sprites in video layers. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference On, volume 1, pages I I. IEEE, 2001.

[18] Anitha Kannan, John Winn, and Carsten Rother. Clustering appearance and shape by learning jigsaws. In Advances in Neural Information Processing Systems, pages 657 664, 2007.

[19] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

[20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a generative model of images by factoring appearance and shape. Neural Computation, 23(3):593 650, 2011.

[21] P. M. Milner. A model for visual shape recognition. Psychological review, 81(6):521, 1974.

[22] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and Checkerboard Artifacts. Distill, 2016.

[23] Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning Features by Watching Objects Move. ar Xiv:1612.06370 [cs, stat], December 2016.

[24] R. A. Rao, G. Cecchi, C. C. Peck, and J. R. Kozloski. Unsupervised segmentation with dynamical units. Neural Networks, IEEE Transactions on, 19(1):168 182, 2008.

[25] R. A. Rao and G. A. Cecchi. An objective function utilizing complex sparsity for efﬁcient segmentation in multi-layer oscillatory networks. International Journal of Intelligent Computing and Cybernetics, 3(2):173 206, 2010.

[26] David P. Reichert and Thomas Serre. Neuronal Synchrony in Complex-Valued Deep Networks. ar Xiv:1312.6115 [cs, q-bio, stat], December 2013.

[27] Christopher J. Rozell, Don H. Johnson, Richard G. Baraniuk, and Bruno A. Olshausen. Sparse coding via thresholding and local competition in neural circuits. Neural computation, 20(10):2526 2563, 2008.

[28] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation, 7(1):51 71, 1995.

[29] J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131 139, 1992.

[30] Jürgen Schmidhuber. Learning Complex, Extended Sequences Using the Principle of History Compression. Neural Computation, 4(2):234 242, March 1992.

[31] Jürgen Schmidhuber. Learning Factorial Codes by Predictability Minimization. Neural Computation, 4(6):863 879, November 1992.

[32] Pablo Sprechmann, Alexander M. Bronstein, and Guillermo Sapiro. Learning efﬁcient sparse and low rank models. IEEE transactions on pattern analysis and machine intelligence, 37(9):1821 1833, 2015.

[33] Anne Treisman. The binding problem. Current Opinion in Neurobiology, 6(2):171 178, April 1996.

[34] Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. Sf M-Net: Learning of Structure and Motion from Video. ar Xiv:1704.07804 [cs], April 2017.

[35] N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. JMLR, 11:2837 2854, 2010.

[36] C. von der Malsburg. Binding in models of perception and brain function. Current opinion in neurobiology, 5(4):520 526, 1995.

[37] Christoph von der Malsburg. The Correlation Theory of Brain Function. Departmental technical report, MPI, 1981. [38] D. Wang and D. Terman. Locally excitatory globally inhibitory oscillator networks. Neural Networks, IEEE Transactions on, 6(1):283 286, 1995. [39] Paul J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural networks, 1(4):339 356, 1988. [40] H. Wersing, J. J. Steil, and H. Ritter. A competitive-layer model for feature binding and sensory segmentation. Neural Computation, 13(2):357 387, 2001. [41] Ronald J. Williams. Complexity of exact gradient computation algorithms for recurrent neural networks. Technical report, Technical Report Technical Report NU-CCS-89-27, Boston: Northeastern University, College of Computer Science, 1989. [42] CF Jeff Wu. On the convergence properties of the EM algorithm. The Annals of statistics, pages 95 103, 1983.