# selectandsample_for_spikeandslab_sparse_coding__f5d9166c.pdf

Select-and-Sample for Spike-and-Slab Sparse Coding

Abdul-Saboor Sheikh Technical University of Berlin, Germany, and Cluster of Excellence Hearing4all University of Oldenburg, Germany, and SAP Innovation Center Network, Berlin sheikh.abdulsaboor@gmail.com

Jörg Lücke Research Center Neurosensory Science and Cluster of Excellence Hearing4all and Dept. of Medical Physics and Acoustics University of Oldenburg, Germany joerg.luecke@uol.de

Probabilistic inference serves as a popular model for neural processing. It is still unclear, however, how approximate probabilistic inference can be accurate and scalable to very high-dimensional continuous latent spaces. Especially as typical posteriors for sensory data can be expected to exhibit complex latent dependencies including multiple modes. Here, we study an approach that can efﬁciently be scaled while maintaining a richly structured posterior approximation under these conditions. As example model we use spike-and-slab sparse coding for V1 processing, and combine latent subspace selection with Gibbs sampling (selectand-sample). Unlike factored variational approaches, the method can maintain large numbers of posterior modes and complex latent dependencies. Unlike pure sampling, the method is scalable to very high-dimensional latent spaces. Among all sparse coding approaches with non-trivial posterior approximations (MAP or ICAlike models), we report the largest-scale results. In applications we ﬁrstly verify the approach by showing competitiveness in standard denoising benchmarks. Secondly, we use its scalability to, for the ﬁrst time, study highly-overcomplete settings for V1 encoding using sophisticated posterior representations. More generally, our study shows that very accurate probabilistic inference for multi-modal posteriors with complex dependencies is tractable, functionally desirable and consistent with models for neural inference.

1 Introduction

The sensory data that enters our brain through our sensors has a high intrinsic dimensionality and it is complex and ambiguous. Image patches or small snippets of sound, for instance, often do not contain sufﬁcient information to identify edges or phonemes with high degrees of certainty. Probabilistic models are therefore very well suited to maintain uncertainty encodings. Given an image patch, for instance, high probabilities for an edge in one location impacts the probabilities for other components resulting in complex dependencies commonly known as "explaining-away" effects. Such dependencies in general include (anti-)correlations, higher-order dependencies and multiple posterior modes (i.e., alternative interpretations of a patch). Furthermore, sensory data is typically composed of many different elementary constituents (e.g., an image patch contains some of a potentially very large number of components) resulting in sparse coding models aiming at increasing overcompleteness [1]. If sensory data gives rise to complex posterior dependencies and has high intrinsic dimensionality, how can we study inference and learning in such settings? To date most studies, e.g. of V1 encoding models, have avoided the treatment of complex latent dependencies by assuming standard sparse models with Laplace priors [2, 3, 1]; high-dimensional problems can then be addressed by applying maximum a-posteriori (MAP) approximations for the resulting mono-modal posteriors. Other scalable approaches such as independent component analysis (ICA) or singular value decomposition (K-SVD) [4, 5] do not encode for data uncertainty, which

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

avoids posterior estimations altogether. For advanced data models, which we expect to be required, e.g., for visual data, neither MAP nor a non-probabilistic treatment can be expected to be sufﬁcient. It was, for example, in a number of studies shown that sparse coding models with more ﬂexible spike-and-slab priors are (A) more closely aligned with the true generative process e.g. for images, and are (B) resulting in improved functional performance [6, 7]. Spike-and-slab priors do, however, result in posteriors with complex dependencies including many modes [8, 7]. Inference w.r.t. to spike-and-slab sparse coding is therefore well suited to, in general, study efﬁcient inference and learning with complex posteriors in high-dimensions. Results for spike-and-slab sparse coding are, furthermore, of direct interest for other important models such as hierarchical communities of experts [9], deep Boltzmann Machines (see [6]), or convolutional neural networks [10]. Also for these typically deep systems very high-dimensional inference and learning is of crucial importance.

So far, intractable inference for spike-and-slab sparse coding was approximated using sampling or factored variational approaches. While sampling approaches can in principle model any dependencies including multiple modes, they have been found challenging to train at scale, with the largest scale applications going to few hundred of latents [11, 12]. Compared to sampling, approaches using factored variational approximations can not model as complex posterior dependencies because they assume posterior independence (no correlations etc), however, they can capture multiple modes and are scalable to several hundreds up to thousands of latents [8, 6]. In this work we combine the accuracy of sampling approaches and the scalability of variational approaches by applying select-andsample [13] to scale spike-and-slab sparse coding to very high latent dimensions. In contrast to using a factored approximation, we here select low dimensional subspaces of the continuous hidden space, and then apply sampling to approximate posteriors within these lower dimensional spaces.

2 The Spike-and-Slab Sparse Coding Model and Parameter Optimization

The spike-and-slab sparse coding model (see [8, 6] and citations therein) used for our study assumes a Bernoulli prior over all H components of the the binary latent vector b {0, 1}H, with a Gaussian prior (the slab ) for the continuous latent vector z RH:

p( b | Θ) = h πbh (1 π)1 bh, p( z | Θ) = h N(zh; µh, ψ2 h), (1)

where π deﬁnes the probability of bh being equal to one and where µ RH and ψ RH parameterize the Gaussian slab. A spike-and-slab hidden variable s RH is then generated by a pointwise multiplication: s = ( b z), i.e., sh = bh zh. Given the hidden variable s, we follow standard sparse coding by linearly superimposing a set of latent components (i.e., W s =

h Whsh) to initialize the mean of a Gaussian noise model: p( y | s, Θ) =

h Wdhsh, σ2 d), (2)

which then generates us the observed data y RD. Here the columns of the matrix W RD H

are each a latent component Wh that is associated with a spike-and-slab latent variable sh. We use σ RD to parameterize the observation noise. The parameters of the generative model (1) to (2) are together denoted by Θ = (π, µ, ψ, W, σ). To ﬁnd the values of Θ, we seek to maximize the data likelihood L = N n=1 p( y (n) | Θ) under the spike-and-slab data model and given a set of N data points { y (n)}n=1,...,N. To derive a learning algorithm, we apply expectation maximization (EM) in its free-energy formulation. In our case the free-energy is given by:

F(q, Θ) = N

log p( y (n), s | Θ)

n + H(q(n)), where f( s) n = q(n)( s) f( s)d s

is the expectation under q(n), a distribution over the latent space and H( ) denotes the Shannon entropy. Given the free-energy, the parameter updates are canonically derived by setting the partial derivatives of F(q, Θ) w.r.t. the parameters to zero. For the spike-and-slab sparse coding model (1) and (2), we obtain (similar to [8, 6, 7]) the following closed-form M-step equations:

n (sh µhbh)2

n y (n) s T n

n , and σ2 d = 1

h Wdhsh y(n) d )2

with sh = bh zh and | x| = h xh as deﬁned above.

3 Approximate Inference With Select-and-Sample

The optimal choices for the distributions q(n)( s) for the expectations in (3) and (4) are the posteriors p( s | y (n), Θ), but neither the posteriors nor their corresponding expectation values are computationally tractable for high dimensions. However, a crucial observation that we exploit in our work is that for observed data such as natural sensory input or data generated by a sparse coding model, the activity of latent components (or causes) can be expected to be concentrated in low-dimensional subspaces. In other words, for a given observed data point, all except for a very small fraction of the latent components can be assumed to be non-causal or irrelevant, hence the corresponding latent space can be neglected for the integration over s. For a sparse initiation (i.e., π 1) of the spike-and-slab model (1) to (2), we consider such low dimensional subspaces to be spanned by a few (approximately πH) of the H latent space coordinates. If we denote by J (n) the subspace containing the large majority of posterior mass for a given data point y (n), an approximation to p( s | y (n), Θ) is then given by the following truncated distribution:

q(n)( s; Θ) = p( s | y (n), Θ)

s J (n)p( s | y (n), Θ) d s δ( s J (n)), (5)

where δ( s J (n)) is an indicator function, taking the value δ( s J (n)) = 1 only if s J (n) and zero otherwise. Truncated approximations have previously been shown to work efﬁciently and accurately for challenging data models [14, 15, 16]. Latents were restricted to be binary, however, and scalability was previously limited by the combinatorics within the selected latent subsets. For our aim of very large scale applications, we therefore apply the select-and-sample approach [13] and use a sampling approximation that operates within the subspaces J (n). Unlike [13] who used binary latents, we here apply the approximation to the continuous latent space of spike-and-slab sparse coding. Formally, this means that we ﬁrst use the posterior approximation q(n)( s) in Eqn. 5 and then approximate the expectation values w.r.t. q(n)( s) using sampling (see illustration of Alg. 1):

f( s) n = q(n)( s) f( s) d s 1 M

m=1 f( s (m)), where s (m) q(n)( s) , (6)

M is the number of samples and f( s) can be any argument of the expectation values in (3) and (4).

It remains to be shown how difﬁcult sampling from q(n)( s) is compared to directly sampling form the full posterior p( s | y (n), Θ). The index function δ( s J (n)) means that we can clamp all values of s to zero but we have to answer the question how the remaining sh are sampled. A closer analysis of the problem shows that the distribution to sample in the reduced space is given by the posterior w.r.t. a truncated generative model. To show this, let us ﬁrst introduce some notation: Let us denote by I a subset of the indices of the latent variables s, i.e., I {1, . . . , H}, and let us use H\I as an abbreviation for {1, . . . , H}\I. The vector s I w.r.t. I is then, as customary, a vector in R|I| deﬁned by those entries sh with h I. In analogy, we take a matrix WI RD |I| to be deﬁned by row vectors ( w T d )I where w T d are the row vectors of W RD H.

Proposition 1. Consider the spike-and-slab generative model (1) to (2) with parameters Θ, and let ΘI(n) = (π, µI(n), ψI(n), WI(n), σ) be the parameters of a truncated spike-and-slab model with H = dim(I(n)) dimensional latent space. Then it applies that sampling from the truncated distribution in (5) is equivalent to sampling from the posterior p( s I(n) | y (n), ΘI(n)) of the truncated spike-and-slab model, while all values sh with h I(n) are clamped to zero.

Proof. If I(n) denotes the indices of those latents sh that span the subspace in which the posterior mass of p( s | y (n), Θ) is concentrated, then these subsets are given by J (n) = { s RH | s H\I(n) = 0}, i.e., δ( s J (n)) can be rewritten as

h I(n) δ(sh = 0). Considering (5), we can therefore set the corresponding values s H\I(n) = 0. We now drop the

superscript n for readability and ﬁrst derive:

p( s I, s H\I = 0, y | Θ)

= N( y; WI s I + WH\I 0, σ)

h I Bern(bh; π)N(zh; µh, ψh)

h I Bern(bh = 0; π)N(zh; µh, ψh)

= p( s I, y | ΘI) U( s H\I = 0, Θ) with U( s H\I, Θ) = p( s H\I | ΘH\I),

i.e., the joint with s H\I = 0 is given by the joint of the truncated model multiplied by a term not depending on s I such that:

q( s; Θ) = p( s I, s H\I = 0, y | Θ) δ( s J )

s J p( s I, s H\I = 0, y | Θ) d s = p( s I, y | ΘI) U( s H\I = 0, Θ) δ( s J )

s J p( s I, y | ΘI) d s U( s H\I = 0, Θ)

= p( s I, y | ΘI)

s p( s I, y | ΘI) d s I

h I δ(sh = 0) = p( s I | y, ΘI)

h I δ(sh = 0) . (7)

Following the proof, Proposition 1 applies for any generative model p( s, y | Θ) for which applies p( s I, s H\I = 0, y | Θ) = p( s I, y | ΘI) U( s H\I = 0, y, Θ). This includes a large class of models such as linear and non-linear spike-and-slab models, and potentially hierarchical models such as SBNs. Proposition 1 does not apply in general, however (we exploit speciﬁc model properties).

Sampling. In previous work [7], posteriors for spike-and-slab sparse coding have been evaluated exhaustively within selected I(n) which resulted in scalability to be strongly limited by the dimensionality of I(n). Based on Proposition 1, we can now overcome this bottleneck by using sampling approximations within the subspaces J (n), and we have shown that such sampling is equivalent to sampling w.r.t. to a much lower dimensional spike-and-slab model. The dimensionality of J (n) is still non-trivial, however, and we use a Markov chain Monte Carlo (MCMC) approach, namely Gibbs sampling for efﬁcient scalability. Following Proposition 1 we derive a sampler for the spike-and-slab model (1) to (2) and later apply it for the needed (low) dimensionality.

While the result of sampling from posteriors of truncated models applies for a broad class of spikeand-slab models (Proposition 1), we can here exploit a further speciﬁc property of the model (1) to (2). As has previously been observed and exploited in different contexts [8, 12, 17], the Gaussian slab and the Gaussian noise model can be combined using Gaussian identities such that integrals over the continuous latents z are solvable analytically. Here we can use this observation for the derivation of a Gibbs sampler. For this we ﬁrst devise a latent variable Markov chain such that its target density is given by the following conditional posterior distribution:

p(sh| s H\h, y, θ) p(sh|θ)

d p(yd|sh, s H\h, θ)

= (1 π) δ(sh) + π N(sh; µh, ψ2 h)

d N(sh; νd, ϕ2 d) , (8)

where δ(.) is the Dirac delta to represent the spike at zero and where νd = (yd

h \h Wdh sh )/Wdh and ϕ2 d = σ2 d/W 2 dh. Using Gaussian identities we obtain:

p(sh| s H\h, y, θ) (1 π) N(sh; υ, φ2) δ(sh) + π N(sh; τ, ω2) , (9)

where υ = φ2

d νd/ϕ2 d and φ2 = (

d 1/ϕ2 d) 1, whereas τ = ω2 (υ/φ2 + µh/ψ2 h) and ω2 = (1/φ2 + 1/ψ2 h) 1. We can observe that the conditional posterior (9) of sh retains the form of a spike-and-slab distribution. We can therefore simply compute the cumulative distribution function (CDF) of (9) to simulate sh from the exact conditional distribution (sh p(sh| s H\h , y, θ)) by means of inverse transform sampling.

Selecting. The Gibbs sampler can now be applied to generate posterior samples for a truncated spike-and-slab model (deﬁned using parameters ΘI(n)). We also obtain a valid approximation, of course, without selection (I = {1,...,H}) but MCMC samplers in very high dimensional spaces

Algorithm 1: Select-and-sample for spike-and-slab sparse coding (S5C)

init Θ; repeat

for (n = 1, ..., N) do

for (h = 1, ..., H) do

compute Sh( y (n)) as in (10);

deﬁne I(n) as in (11); for (m = 1, . . . , M) do

draw s(m) I(n) p( s I(n) | y (n), ΘI(n)) using (9);

compute f( s) n = 2 M M m= M

2 +1 f( s (m));

compute M-step with arguments f( s) as in (3) and (4); until (until Θ have converged);

Illustration of general application.

with complex posterior structure are known to be challenging (convergence to target distributions can be very slow). The problems typically increase superlinearly with hidden dimensionality but for intermediate dimensions, a Gibbs sampler can be very fast and accurate. By using subspaces J (n) with intermediate dimensionality, therefore, results in very efﬁcient and accurate sampling approximations within these spaces. An overall very accurate approximation is then obtained if the subspaces are well selected and if they do contain the large majority of posterior mass. By using exact EM it was indeed previously shown for spike-and-slab sparse coding [7] that almost all posterior mass, e.g., for naturally mixed sound sources, is concentrated in collections of low-dimensional subspaces (also compare [18]). To deﬁne a subspace J (n) given a data point y (n), we follow earlier approaches [15, 14] and ﬁrst deﬁne an efﬁciently computable selection function to choose those latents that are the most likely to have generated the data point. We use the selection function in [7] which is given by:

Sh( y (n), Θ) =

d N( y (n) d ; Wdhµh, σd + W 2 dh/ψh) p( y (n) | b = bh, Θ), (10)

where bh represents a singleton state with only component h being equal to one. The subsets are then deﬁned as follows:

I(n) is the set of H indices such that h I(n) h I(n) : Sh( y (n), Θ) > Sh ( y (n), Θ). (11)

We then use J (n) = { s | s H\I(n) = 0} as above. In contrast to previous approaches with H typically < 10, H can be chosen relatively large here because the Gibbs sampler is still very efﬁcient and precise for H > 10 (we will go up H = 40).

By combining selection procedure and the Gibbs sampler using Proposition 1, we obtain the efﬁcient approximate EM algorithm summarized in Alg. 1. It will be referred to as S5C (see Alg. 1 caption). Note that we will, for all experiments, always discard the ﬁrst half of the drawn samples as burn-in.

4 Numerical Experiments

In all the experiments, the initial values of π were drawn from a uniform distribution on the interval [0.1, 0.5] (i.e., intermediately sparse), µ was initialized with normally distributed random values, ψh was set to 1 and σd was initialized with the standard deviation of yd. The elements of W were iid drawn from a normal distribution with zero mean and a standard deviation of 5.0. We used a multi-core parallelized implementation and executed the algorithm on up to 1000 CPU cores.

Veriﬁcation of functional accuracy. We ﬁrst investigate the accuracy and convergence properties of our method on ground-truth data which was generated by the spike-and-slab data model (1) and (2) itself. We used H = 10 hidden variables and D = 5 5 and generative ﬁelds Wh in the form of ﬁve horizontal and ﬁve vertical bars. As is customary for such bars like data (e.g., [15] and cites therein) we take each ﬁeld to contribute to a data point with probability π = 2

H . We then randomly make each of the 5 vertical and 5 horizontal bars positive or negative by assigning them a value of 5

Figure 1: Functional accuracy of S5C. A Artiﬁcial ground-truth data. B Likelihoods during learning (Alg. 1) for different H . C Denoising performance of S5C on the house benchmark as used for other methods (MTMKL [8], K-SVD [4], Beta process [11] and GSC-ET [7]. Bold values highlight the best performing algorithm. Value not bold-faced as noise variance is assumed known a-priori[4]. D Top: Noisy image with σ = 25. Bottom: State-of-the art denoising result after S5C was applied.

or 5, while the non-bar pixels are assigned zero value. The parameters of the latent slabs µh and ψh are set to 0.0 and 1.0, respectively, and we set the observation noise to σd = 1.0. We generate N = 5000 data points with this setting (see Fig. 1A for examples).

We apply the S5C algorithm (Alg. 1) with H = 10 latents and M = 40 samples per data point and use two settings for preselection: (A) no preselection (H = H = 10) and (B) subspace preselection using H = 5. We did ten runs per setting using different initializations per run as above. For setting (A), i.e. pure Gibbs sampling, the algorithm recovered, after 150 EM iterations, the generating bars in 2 of the 10 runs. For setting (B) convergence was faster and in 9 of the 10 runs all bars were recovered after 50 EM iterations. Fig. 1B shows for all 20 runs likelihoods during learning (which are still tractable for H = 10). These empirical results show the same effect for a continuous latent variable model as was previously reported for non-continuous latents [19, 20]: preselection helps avoiding local optima (presumably because poor non-sparse solutions are destabilized using subspace selection).

After having veriﬁed the functioning of S5C on artiﬁcial data, we turned to verifying the approach on a denoising benchmark, which is standard for sparse coding. We applied S5C using a noisy house image [following 11, 4, 8, 7]. We used three different levels of added Gaussian noise (σ = 15, 25, 50). For each setting we extract 8 8 patches from 256 256 noisy image, visiting a whole grid of 250 250 pixels by shifting (vertically and horizontally) 1 pixel at a time. In total we obtained N = 62, 001 overlapping image patches as data points. We applied the S5C algorithm with H = 256, select subspaces with H = 40 and used M = 100 samples per subspace. Fig. 1C,D show the obtained results and a comparison to alternative approaches. As can be observed, S5C is competitive to other approaches and results in higher peak signal-to-noise ratios (PSNRs) (see [7] for details) than, e.g., K-SVD or factored variational EM approaches (MTMKL) for σ = 25 and 50. Even though S5C uses the additional sampling approximation in the selected subspaces, it is also competitive to ET-GSC [7], which is less efﬁcient as it sums exhaustively within subspaces. For σ = 25 S5C even outperforms ET-GSC presumably because S5C allows for selecting larger subspaces. In general we observed increased improvement with the number of samples, but improvements with H saturated after about H = 256.

Large-scale application and V1 encoding. Since sparse coding was ﬁrst suggested as coding model for primary visual cortex [21], a main goal has been its application to very high latent dimensions because V1 is believed to be highly overcomplete [1]. Furthermore, for very large hidden dimensions, non-standard generative ﬁelds were observed [1], a ﬁnding which is of signiﬁcant relevance for the ongoing debate of how and where increasingly complex structures in the visual system may be processed. Here we applied S5C with H = 10 000 hidden dimensions to demonstrate scalability of the method, and to study highly-overcomplete V1 encoding based on a posterior approximation capturing rich structure. For our application we used the standard van Hateren database [22], extracted N = 106 image patches of size 16 16, and applied pseudo-whitening following [21]. We applied

Figure 2: Selection of different types of generative ﬁelds as learned by S5C using H = 10, 000 latent dimensions (see Suppl. for all ﬁelds). Gabor-like ﬁelds are the most frequent type (Gabors, ridgelets, gratings), followed by globular ﬁelds, curved ﬁelds and corner-like ﬁelds. We also observed textures other than gratings. Gabors, curved and corner ﬁelds were almost all among the 30% most frequently activated ﬁelds. Ridgelets, globular ﬁelds and gratings were typically among the 30-80% most used ﬁelds.

S5C for 50 EM iterations to the data using H = 20 dimensional subspaces and M = 50 samples per data point. After learning we observed a large number of generative ﬁelds specialized to image components. Like in recent large-scale applications of standard sparse coding [1] we found ﬁelds that did not specialize (about 1% in [1] and about 12% for S5C). The higher percentage for S5C may be due to the ﬁve-fold higher dimensionality used here. For the ﬁelds specialized to components, we observed a large number of Gabor-like ﬁelds including ridgelets and gratings (names follow [1]). Furthermore, we observed globular ﬁelds that have been observed experimentally [23] and are subject of a number of recent theoretical studies (e.g., [14, 3]). Notably, we also observed a number of curved ﬁelds and ﬁelds sensitive to corner-like structures (Fig. 2 shows some examples). Curved ﬁelds have so far only been described to emerge from sparse coding once before [1] and for convolutional sparse coding in two cases [24, 25] (to the knowledge of the authors) but have been suggested for technical applications much earlier [26] (a link that was not made, so far). Corner-like structures have previously not been observed for sparse coding presumably because of lower dimensional latent spaces (also not in [1] but compare convolutional extensions [24, 16, 25]). The numbers of curved (a few hundred) and corner-like ﬁelds (a few tens) are small but we almost exclusively ﬁnd those ﬁelds among the 20% most frequently used ﬁelds (we order according to average approx. posterior, see supplement). Neural responses to corner-like sensitivities are typically associated with higher-level processing in the visual pathway. Our results may be evidence for such structures to emerge together, e.g., with Gabors for very large latent dimensionality (as expected for V1). In general, the statistics of generative ﬁeld shapes can be inﬂuenced by many factors including preprocessing details, sparsity, local optima or details of the learning algorithms. However, because of the applied approximation, S5C can avoid the for MAP based approaches required choice of the sparsity penalty [1]. Instead we statistically infer the sparsity level which is well interpretable for hard sparsity, and which corresponds for our application to Hπ = 6.2 components per patch (also compare [14, 20]). In the supplement we provide the full set of the H = 10 000 learned generative ﬁelds.

Figure 3: The y-axis shows the highest reported latent dimensionality for different sparse coding algorithms (cont. latents), and the xaxis the accuracy of posterior approximations. Within each column, entries are ordered (left-to-right) w.r.t. the publication year. 1st column: Sparse coding systems using one latent state for inference (eg., MAP-like [27, 28, 1] or SAILnet [3] or KSVD [4, 5]). 2nd: Approximate posteriors in the form of factored variational distributions that can capture multiple modes but assume posterior independence among the latents sh (MTMKL [8], S3C [6]). 3rd: Sampling based approximations [11, 12] and truncated approximations (ss MCA [20], GSC-ET [7]) that capture multiple posterior modes and complex latent dependencies. Following [6] we also included ss RBM for comparison. 4th: Full posterior with exact EM [17].

5 Discussion

In this study we have applied a select-and-sample approach [13] to derive and study an approximate EM algorithm applicable to models with very large-scale latent spaces. Select-and-sample combines sampling with subspace preselection [15, 14] and has previously been applied as model for neural inference using binary latents [13]. Furthermore, it has been used to overcome analytical intractabilities of a non-linear sparse coding model [20]. Here, we for the ﬁrst time apply select-and-sample to scale a standard linear sparse coding model with spike-and-slab prior up to very large hidden dimensions. Spike-and-slab sparse coding is hereby not only more expressive than standard Laplace or binary priors [8, 12, 7, 20] but results in properties that we can exploit for our approximation. We have thus analytically shown (Proposition 1) that select-and-sample is applicable to a large class of models with hard sparsity (giving justiﬁcation also to earlier applications [20]).

Empirically, we have, ﬁrstly, shown that select-and-sample for spike-and-slab sparse coding (S5C) maintains the functional competitiveness of alternative approaches (Fig. 1). Secondly, we demonstrated efﬁciency by scaling S5C up to very high-dimensional latent spaces (we go up to 10 000). For comparison, Fig. 3 shows the largest reported latent spaces of different sparse coding approaches depending on the posterior structure that can be captured. Non-probabilistic approaches (e.g., K-SVD [4, 5]) are known to scale relatively well, and, likewise, approaches using MAP approximations [2, 3, 1] have been shown to be applicable to large scales. None of these approaches captures posterior dependencies or multiple posterior modes given a data point, however. Factored variational approaches can be scaled to very high-dimensional latent spaces and can capture multiple posterior modes. No latent dependencies in the posterior are modeled, however, which has previously been reported to result in disadvantageous behavior (e.g. [29, 7]). In contrast to MAP-based or factored approaches, sampling approaches can model both multiple posterior modes and complex latent dependencies. Some models hereby additionally include a more Bayesian treatment of parameters [11, 12] (also compare [8]) which can be considered more general than approaches followed in other work (see Fig. 3). The scalability of sampling based approaches has been limited, however. Among those models capturing the crucial posterior structure, S5C shows, to the knowledge of the authors, the largest scale applicability. This is even the case if approaches using factored posteriors are included. Notably there is also little reported for very large hidden dimensions for MAP based or deterministic approaches (compare, e.g., [5]), although scalability should be less of an issue. In general it may well be that a method is scalable to larger than the reported latent spaces but that such increases do not result in functional beneﬁts. For probabilistic approaches, the requirement for approximations with high accuracy have been identiﬁed also in other very promising work [30, 31] which uses different approaches that were, so far, applied to much smaller scales. For the select-and-sample method and the spike-and-slab sparse coding model, the high-dimensional applicability means that this or similar approaches are a promising candidate for models such as DBNs, SBNs or CNNs because of their close relation to spike-and-slab models and their typically similarly large scale settings. Here we have studied an application of S5C to standard image patches, primarily to demonstrate scalability. The obtained non-standard generative ﬁelds may by themselves, however, be of relevance for V1 encoding (Fig. 2) and they show that spike-and-slab models may be very suitable generalized V1 models. From a probabilistic view on neural processing, the accuracy that can be provided by select-and-sample inference is hereby very desirable and is consistent, e.g., with sampling-based interpretations of neural variability [32]. Here we have shown that such probabilistic approximations are also functionally competitive and scalable to very large hidden dimensions.

Acknowledgements. We thank E. Guiraud for help with Alg. 1 (illustration) and acknowledge funding by the DFG: Cluster of Excellence EXC 1077/1 (Hearing4all) and grant LU 1196/5-1.

References [1] B. Olshausen. Highly overcomplete sparse coding. In Proc. SPIE, 8651, 2013.

[2] B. Olshausen and D. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311 3325, 1997.

[3] J. Zylberberg, J. Murphy, and M. De Weese. A Sparse Coding Model with Synaptically Local Plasticity and Spiking Neurons Can Account for the Diverse Shapes of V1 Simple Cell Receptive Fields. PLo S Comp. Bio., 7(10):e1002250, 2011.

[4] H. Li and F. Liu. Image denoising via sparse and redundant representations over learned dictionaries in wavelet domain. In ICIG, pages 754 758, 2009.

[5] A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Dictionary learning for massive matrix factorization. ICML, 2016.

[6] I. J. Goodfellow, A. Courville, and Y. Bengio. Scaling up spike-and-slab models for unsupervised feature learning. TPAMI, 35(8):1902 1914, 2013.

[7] A. S. Sheikh, J. A. Shelton, and J. Lücke. A truncated EM approach for spike-and-slab sparse coding. JMLR, 15:2653 2687, 2014.

[8] M. Titsias and M. Lazaro-Gredilla. Spike and slab variational inference for multi-task and multiple kernel learning. In NIPS, pages 2339 2347, 2011.

[9] G. E. Hinton, B. Sallans, and Z. Ghahramani. A hierarchical community of experts. In Learning in graphical models, pages 479 494. Springer, 1998.

[10] Ankit B Patel, Tan Nguyen, and Richard G Baraniuk. A probabilistic theory of deep learning. In Advances in Neural Information Processing Systems (NIPS), 2016. in press, preprint ar Xiv:1504.00641.

[11] M. Zhou, H. Chen, J. Paisley, L. Ren, G. Sapiro, and L. Carin. Non-parametric Bayesian dictionary learning for sparse image representations. In NIPS, pages 2295 2303, 2009.

[12] S. Mohamed, K. Heller, and Z. Ghahramani. Evaluating Bayesian and L1 approaches for sparse unsupervised learning. In ICML, 2012.

[13] J. Shelton, J. Bornschein, A.-S. Sheikh, P. Berkes, and J. Lücke. Select and sample a model of efﬁcient neural inference and learning. In NIPS, pages 2618 2626, 2011.

[14] G. Puertas, J. Bornschein, and J. Lücke. The maximal causes of natural scenes are edge ﬁlters. In NIPS, volume 23, pages 1939 47, 2010.

[15] J. Lücke and J. Eggert. Expectation truncation and the beneﬁts of preselection in training generative models. JMLR, 11:2855 2900, 2010.

[16] Z. Dai, G. Exarchakis, and J. Lücke. What are the invariant occlusive components of image patches? A probabilistic generative approach. In NIPS 26, pages 243 251. 2013.

[17] J. Lücke and A.-S. Sheikh. Closed-form EM for sparse coding and its application to source separation. In LVA, pages 213 221, 2012.

[18] K. Schnass. Local identiﬁcation of overcomplete dictionaries. JMLR, 16:1211 1242, 2015.

[19] G. Exarchakis, M. Henniges, J. Eggert, and J. Lücke. Ternary sparse coding. In LVA/ICA, pages 204 212, 2012.

[20] J. A. Shelton, A. S. Sheikh, J. Bornschein, P. Sterne, and J. Lücke. Nonlinear spike-and-slab sparse coding for interpretable image encoding. PLo S ONE, 10:e0124088, 05 2015.

[21] B. Olshausen and D. Field. Emergence of simple-cell receptive ﬁeld properties by learning a sparse code for natural images. Nature, 381:607 609, 1996.

[22] J. H. van Hateren and A. van der Schaaf. Independent component ﬁlters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London B, 265:359 366, 1998.

[23] D. L. Ringach. Spatial structure and symmetry of simple-cell receptive ﬁelds in macaque primary visual cortex. Journal of Neurophysiology, 88:455 463, 2002.

[24] P. Jost, P. Vandergheynst, S. Lesage, and R. Gribonval. Motif: an efﬁcient algorithm for learning translation invariant dictionaries. In IEEE Int. Conf. Acoustics Speech and Sig. Processing, volume 5, 2006.

[25] J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and vision processing. Foundations and Trends in Computer Graphics and Vision, 8(2-3):85 283, 2014.

[26] N. Krüger and G. Peters. Object recognition with banana wavelets. In Eur. Symp. ANNs, 1997.

[27] P. Garrigues and B. A. Olshausen. Learning horizontal connections in a sparse coding model of natural images. In NIPS, 2007.

[28] A. Coates and A. Y. Ng. The importance of encoding versus training with sparse coding and vector quantization. In Proc. ICML, pages 921 928, 2011.

[29] A. Ilin and H. Valpola. On the effect of the form of the posterior approximation in variational learning of ICA models. Neural Processing Letters, 22(2):183 204, 2005.

[30] T. Salimans, D. Kingma, and M. Welling. Markov chain monte carlo and variational inference: Bridging the gap. ICML, 2015.

[31] D. Rezende and S. Mohamed. Variational inference with normalizing ﬂows. ICML, 2015.

[32] P. Berkes, G. Orban, M. Lengyel, and J. Fiser. Spontaneous Cortical Activity Reveals Hallmarks of an Optimal Internal Model of the Environment. Science, 331(6013):83 87, January 2011.