# unsupervised_deep_haar_scattering_on_graphs__7ad1cb5f.pdf

Unsupervised Deep Haar Scattering on Graphs

Xu Chen1,2, Xiuyuan Cheng2, and St ephane Mallat2

1Department of Electrical Engineering, Princeton University, NJ, USA 2D epartement d Informatique, Ecole Normale Sup erieure, Paris, France

The classiﬁcation of high-dimensional data deﬁned on graphs is particularly difﬁcult when the graph geometry is unknown. We introduce a Haar scattering transform on graphs, which computes invariant signal descriptors. It is implemented with a deep cascade of additions, subtractions and absolute values, which iteratively compute orthogonal Haar wavelet transforms. Multiscale neighborhoods of unknown graphs are estimated by minimizing an average total variation, with a pair matching algorithm of polynomial complexity. Supervised classiﬁcation with dimension reduction is tested on data bases of scrambled images, and for signals sampled on unknown irregular grids on a sphere.

1 Introduction

The geometric structure of a data domain can be described with a graph [11], where neighbor data points are represented by vertices related by an edge. For sensor networks, this connectivity depends upon the sensor physical locations, but in social networks it may correspond to strong interactions or similarities between two nodes. In many applications, the connectivity graph is unknown and must therefore be estimated from data. We introduce an unsupervised learning algorithm to classify signals deﬁned on an unknown graph.

An important source of variability on graphs results from displacement of signal values. It may be due to movements of physical sources in a sensor network, or to propagation phenomena in social networks. Classiﬁcation problems are often invariant to such displacements. Image pattern recognition or characterization of communities in social networks are examples of invariant problems. They require to compute locally or globally invariant descriptors, which are sufﬁciently rich to discriminate complex signal classes.

Section 2 introduces a Haar scattering transform which builds an invariant representation of graph data, by cascading additions, subtractions and absolute values in a deep network. It can be factorized as a product of Haar wavelet transforms on the graph. Haar wavelet transforms are ﬂexible representations which characterize multiscale signal patterns on graphs [6, 10, 11]. Haar scattering transforms are extensions on graphs of wavelet scattering transforms, previously introduced for uniformly sampled signals [1].

For unstructured signals deﬁned on an unknown graph, recovering the full graph geometry is an NP complete problem. We avoid this complexity by only learning connected multiresolution graph approximations. This is sufﬁcient to compute Haar scattering representations. Multiscale neighborhoods are calculated by minimizing an average total signal variation over training examples. It involves a pair matching algorithm of polynomial complexity. We show that this unsupervised learning algorithms computes sparse scattering representations.

This work was supported by the ERC grant Invariant Class 320959.

x S1x S2x S3x

Figure 1: A Haar scattering network computes each coefﬁcient of a layer Sj+1x by adding or subtracting a pair of coefﬁcients in the previous layer Sjx.

For classiﬁcation, the dimension of unsupervised Haar scattering representations are reduced with supervised partial least square regressions [12]. It amounts to computing a last layer of reduced dimensionality, before applying a Gaussian kernel SVM classiﬁer. The performance of a Haar scattering classiﬁcation is tested on scrambled images, whose graph geometry is unknown. Results are provided for MNIST and CIFAR-10 image data bases. Classiﬁcation experiments are also performed on scrambled signals whose samples are on an irregular grid of a sphere. All computations can be reproduced with a software available at www.di.ens.fr/data/scattering/haar.

2 Orthogonal Haar Scattering on a Graph

2.1 Deep Networks of Permutation Invariant Operators

We consider signals x deﬁned on an unweighted graph G = (V, E), with V = {1, ..., d}. Edges relate neighbor vertices. We suppose that d is a power of 2 to simplify explanations. A Haar scattering is calculated by iteratively applying the following permutation invariant operator

(α, β) (α + β, |α β|) . (1)

Its values are not modiﬁed by a permutation of α and β, and both values are recovered by

max(α, β) = 1

2 α + β + |α β| and min(α, β) = 1

2 α + β |α β| . (2)

An orthogonal Haar scattering transform computes progressively more invariant signal descriptors by applying this invariant operator at multiple scales. This is implemented along a deep network illustrated in Figure 1. The network layer j is a two-dimensional array Sjx(n, q) of d = 2 jd 2j coefﬁcients, where n is a node index and q is a feature type.

The input network layer is S0x(n, 0) = x(n). We compute Sj+1x by regrouping the 2 jd nodes of Sjx in 2 j 1d pairs (an, bn), and applying the permutation invariant operator (1) to each pair (Sjx(an, q), Sjx(bn, q)):

Sj+1x(n, 2q) = Sjx(an, q) + Sjx(bn, q) (3)

and Sj+1x(n, 2q + 1) = |Sjx(an, q) Sjx(bn, q)| . (4) This transform is iterated up to a maximum depth J log2(d). It computes SJx with Jd/2 additions, subtractions and absolute values. Since Sjx 0 for j > 0, one can put an absolute value on the sum in (3) without changing Sj+1x. It results that Sj+1x is calculated from the previous layer Sjx by applying a linear operator followed by a non-linearity as in most deep neural network architectures. In our case this non-linearity is an absolute value as opposed to rectiﬁers used in most deep networks [4].

For each n, the 2j scattering coefﬁcients {Sjx(n, q)}0 q<2j are calculated from the values of x in a vertex set Vj,n of size 2j. One can verify by induction on (3) and (4) that V0,n = {n} for 0 n < d, and for any j 0 Vj+1,n = Vj,an Vj,bn . (5)

Figure 2: A connected multiresolution is a partition of vertices with embedded connected sets Vj,n of size 2j. (a): Example of partition for the graph of a square image grid, for 1 j 3. (b): Example on an irregular graph.

The embedded subsets {Vj,n}j,n form a multiresolution approximation of the vertex set V . At each scale 2j, different pairings (an, bn) deﬁne different multiresolution approximations. A small graph displacement propagates signal values from a node to its neighbors. To build nearly invariant representations over such displacements, a Haar scattering transform must regroup connected vertices. It is thus computed over multiresolution vertex sets Vj,n which are connected in the graph G. It results from (5) that a necessary and sufﬁcient condition is that each pair (an, bn) regroups two connected sets Vj,an and Vj,bn.

Figure 2 shows two examples of connected multiresolution approximations. Figure 2(a) illustrates the graph of an image grid, where pixels are connected to 8 neighbors. In this example, each Vj+1,n regroups two subsets Vj,an and Vj,bn which are connected horizontally if j is even and connected vertically if j is odd. Figure 2(b) illustrates a second example of connected multiresolution approximation on an irregular graph. There are many different connected multiresolution approximations resulting from different pairings at each scale 2j. Different multiresolution approximations correspond to different Haar scattering transforms. In the following, we compute several Haar scattering transforms of a signal x, by deﬁning different multiresolution approximations.

The following theorem proves that a Haar scattering preserves the norm and that it is contractive up to a normalization factor 2j/2. The contraction is due to the absolute value which suppresses the sign and hence reduces the amplitude of differences. The proof is in Appendix A.

Theorem 2.1. For any j 0, and any x, x deﬁned on V

Sjx Sjx 2j/2 x x ,

Sjx = 2j/2 x .

2.2 Iterated Haar Wavelet Transforms

We show that a Haar scattering transform can be written as a cascade of orthogonal Haar wavelet transforms and absolute value non-linearities. It is a particular example of scattering transforms introduced in [1]. It computes coefﬁcients measuring signal variations at multiple scales and multiple orders. We prove that the signal can be recovered from Haar scattering coefﬁcients computed over enough multiresolution approximations.

A scattering operator is contractive because of the absolute value. When coefﬁcients have an arbitrary sign, suppressing the sign reduces by a factor 2 the volume of the signal space. We say that SJx(n, q) is a coefﬁcient of order m if its computation includes m absolute values of differences. The amplitude of scattering coefﬁcients typically decreases exponentially when the scattering order m increases, because of the contraction produced by the absolute value. We verify from (3) and (4)

that SJx(n, q) is a coefﬁcient of order m = 0 if q = 0 and of order m > 0 if

k=1 2J jk for 0 jk < jk+1 J .

It results that there are J m 2 Jd coefﬁcients SJx(n, q) of order m.

We now show that Haar scattering coefﬁcients of order m are obtained by cascading m orthogonal Haar wavelet tranforms deﬁned on the graph G. A Haar wavelet at a scale 2J is deﬁned over each Vj,n = Vj 1,an Vj 1,bn by ψj,n = 1Vj 1,an 1Vj 1,bn .

For any J 0, one can verify [10, 6] that

{1VJ,n}0 n<2 Jd {ψj,n}0 n<2 jd,0 j<J

is a non-normalized orthogonal Haar basis of the space of signals deﬁned on V . Let us denote x, x = P

v V x(v) x (v). Order m = 0 scattering coefﬁcients sum the values of x in each VJ,n

SJx(n, 0) = x , 1VJ,n .

Order m = 1 scattering coefﬁcients are sums of absolute values of orthogonal Haar wavelet coefﬁcients. They measure the variation amplitude x at each scale 2j, in each VJ,n:

SJx(n, 2J j1) = X

p Vj1,p VJ,n

| x , ψj1,p |.

Appendix B proves that second order scattering coefﬁcients SJx(n, 2J j1 + 2J j2) are computed by applying a second orthogonal Haar wavelet transform to ﬁrst order scattering coefﬁcients. A coefﬁcient SJx(n, 2J j1 +2J j2) is an averaged second order increment over VJ,n, calculated from the variations at the scale 2j2, of the increments of x at the scale 2j1. More generally, Appendix B also proves that order m coefﬁcients measure multiscale variations of x at the order m, and are obtained by applying a Haar wavelet transform on scattering coefﬁcients of order m 1.

A single Haar scattering transform loses information since it applies a cascade of permutation invariant operators. However, the following theorem proves that x can be recovered from scattering transforms computed over 2J different multiresolution approximations.

Theorem 2.2. There exist 2J multiresolution approximations such that almost all x Rd can be reconstructed from their scattering coefﬁcients on these multiresolution approximations.

This theorem is proved in Appendix C. The key idea is that Haar scattering transforms are computed with permutation invariants operators. Inverting these operators allows to recover values of signal pairs but not their locations. However, recombining these values on enough overlapping sets allows one to recover their locations and hence the original signal x. This is done with multiresolutions which are interlaced at each scale 2j, in the sense that if a multiresolution is pairing (an, bn) and (a n, b n) then another multiresolution approximation is pairing (a n, bn). Connectivity conditions are needed on the graph G to guarantee the existence of interlaced multiresolution approximations which are all connected.

3.1 Sparse Unsupervised Learning of Multiscale Connectivity

Haar scattering transforms compute multiscale signal variations of multiple orders, over nonoverlapping sets of size 2J. To build signal descriptors which are nearly invariant to signal displacements on the graph, we want to compute scattering transforms over connected sets in the graph, which a priori requires to know the graph connectivity. However, in many applications, the graph connectivity is unknown. For piecewise regular signals, the graph connectivity implies some form of correlation between neighbor signal values, and may thus be estimated from a training set of unlabeled examples {xi}i [7].

Instead of estimating the full graph geometry, which is an NP complete problem, we estimate multiresolution approximations which are connected. This is a hierarchical clustering problem [19]. A multiresolution approximation is connected if at each scale 2j, each pair (an, bn) regroups two vertex sets (Vj,an, Vj,bn) which are connected. This connection is estimated by minimizing the total variation within each set Vj,n, which are clusters of size 2j [19]. It is done with a ﬁne to coarse aggregation strategy. Given {Vj,n}0 n<2 jd, we compute Vj+1,n at the next scale, by ﬁnding an optimal pairing {an, bn}n which minimizes the total variation of scattering vectors, averaged over the training set {xi}i:

i |Sjxi(an, q) Sjxi(bn, q)| . (6)

This is a weighted matching problem which can be solved by the Blossom Algorithm of Edmonds [8] with O(d3) operations. We use the implementation in [9]. Iterating on this algorithm for 0 j < J thus computes a multiresolution approximation at the scale 2J, with a hierarchical aggregation of graph vertices.

Observe that Sj+1x 1 = Sjx 1 + X

n |Sjx(an, q) Sjx(bn, q)| .

Given Sjx, it results that the minimization of (6) is equivalent to the minimization of P

i Sj+1xi 1. This can be interpreted as ﬁnding a multiresolution approximation which yields an optimally sparse scattering transform. It operates with a greedy layerwise strategy across the network layers, similarly to sparse autoencoders for unsupervised deep learning [4].

As explained in the previous section, several Haar scattering transforms are needed to obtain a complete signal representation. The unsupervised learning computes N multiresolution approximations by dividing the training set {xi}i in N non-overlapping subsets, and learning a different multiresolution approximation from each training subset.

3.2 Supervised Feature Selection and Classiﬁcation

The unsupervised learning computes a vector of scattering coefﬁcients which is typically much larger than the dimension d of x. However, only a subset of these invariants are needed for any particular classiﬁcation task. The classiﬁcation is improved by a supervised dimension reduction which selects a subset of scattering coefﬁcients. In this paper, the feature selection is implemented with a partial least square regression [12, 13, 14]. The ﬁnal supervised classiﬁer is a Gaussian kernel SVM.

Let us denote by Φx = {φpx}p the set of all scattering coefﬁcients at a scale 2J, computed from N multiresolution approximations. We perform a feature selection adapted to each class c, with a partial least square regression of the one-versus-all indicator function

fc(x) = 1 if x belongs to class c 0 otherwise .

A partial least square greedily selects and orthogonalizes each feature, one at a time. At the kth

iteration, it selects a φpkx, and a Gram-Schmidt orthogonalization yields a normalized φpkx, which is uncorrelated relatively to all previously selected features:

i φpk(xi) φpr(xi) = 0 and X

i | φpk(xi)|2 = 1 .

The kth feature φpkx is selected so that the linear regression of fc(x) on { φprx}1 r k has a minimum mean-square error, computed on the training set. This is equivalent to ﬁnding φpk so that P

i fc(xi) φpk(xi) is maximum.

The partial least square regression thus selects and computes K decorrelated scattering features { φpkx}k<K for each class c. For a total of C classes, the union of all these feature sets deﬁnes a dictionary of size M = K C. They are linear combinations of the original Haar scattering coefﬁcients {φpx}p. This dimension reduction can thus be interpreted as a last fully connected network

Figure 3: MNIST images (left) and images after random pixel permutations (right).

layer, which outputs a vector of size M. The parameter M allows one to optimize the bias versus variance trade-off. It can be adjusted from the decay of the regression error of each fc [12]. In our numerical experiments, it is set to a ﬁxed size for all data bases.

4 Numerical Experiments

Unsupervised Haar scattering representations are tested on classiﬁcation problems, over scrambled images and scrambled data on a sphere, for which the geometry is therefore unknown. Classiﬁcation results are compared with a Haar scattering algorithm computed over the known signal geometry, and with state of the art algorithms.

A Haar scattering representation involves few parameters which are reviewed. The scattering scale 2J d is the invariance scale. Scattering coefﬁcients are computed up to the a maximum order m, which is set to 4 in all experiments. Indeed, higher order scattering coefﬁcient have a negligible relative energy, which is below 1%. The unsupervised learning algorithm computes N multiresolution approximations, corresponding to N different scattering transforms. Increasing N decreases the classiﬁcation error but it increases computations. The error decay becomes negligible for N 40. The supervised dimension reduction selects a ﬁnal set of M orthogonalized scattering coefﬁcients. We set M = 1000 in all numerical experiments.

For signals deﬁned on an unknown graph, the unsupervised learning computes an estimation of connected multiresolution sets by minimizing an average total variation. For each data basis of scrambled signals, the precision of this estimation is evaluated by computing the percentage of multiscale sets which are indeed connected in the original topology (an image grid or a grid on the sphere).

4.1 MNIST Digit Recognition

MNIST is a data basis with 6 104 hand-written digit images of size d 210, with 5 104 images for training and 104 for testing. Examples of MNIST images before and after pixel scrambling are shown in Figure 3. The best classiﬁcation results are obtained with a maximum invariance scale 2J = 210. The classiﬁcation error is 0.9%, with an unsupervised learning of N = 40 multiresolution approximations. Table 1 shows that it is below but close to state of the art results obtained with fully supervised deep convolution, which are optimized with supervised backpropagation algorithms.

The unsupervised learning computes multiresolution sets Vj,n from scrambled images. At scales 1 2j 23, 100% of these multiresolution sets are connected in the original image grid, which proves that the geometry is well estimated at these scales. This is only evaluated on meaningful pixels which do not remain zero on all training images. For j = 4 and j = 5 the percentages of connected sets are respectively 85% and 67%. The percentage of connected sets decreases because long range correlations are weaker.

One can reduce the Haar scattering classiﬁcation error from 0.9% to 0.59% with a known image geometry. The Haar scattering transform is then computed over multiresolution approximations which are directly constructed from the image grid as in Figure 2(a). Rotations and translations deﬁne N = 64 different connected multiresolution approximations, which yield a reduced error of 0.59%. State of the art classiﬁcation errors on MNIST, for non-augmented data basis (without elastic deformations), are respectively 0.46% with a Gabor scattering [2] and 0.53% with a supervised training of deep convolution networks [5]. This shows that without any learning, a Haar scattering using geometry is close to the state of the art.

Maxout MLP + dropout [15] Deep convex net. [16] DBM + dropout [17] Haar Scattering 0.94 0.83 0.79 0.90

Table 1: Percentage of errors for the classiﬁcation of scrambled MNIST images, obtained by different algorithms.

Figure 4: Images of digits mapped on a sphere.

4.2 CIFAR-10 Images

CIFAR-10 images are color images of 32 32 pixels, which are much more complex than MNIST digit images. It includes 10 classes, such as dogs , cars , ships with a total of 5 104 training examples and 104 testing examples. The 3 color bands are represented with Y, U, V channels and scattering coefﬁcients are computed independently in each channel.

The Haar scattering is ﬁrst applied to scrambled CIFAR images whose geometry is unknown. The minimum classiﬁcation error is obtained at the scale 2J = 27 which is below the maximum scale d = 210. It maintains some localization information on the image features. With N = 10 multiresolution approximations, a Haar scattering transform has an error of 27.3%. It is 10% below previous results obtained on this data basis, given in Table 2.

Nearly 100% of the multiresolution sets Vj,n computed from scrambled images are connected in the original image grid, for 1 j 4, which shows that the multiscale geometry is well estimated at these ﬁne scales. For j = 5, 6 and 7, the proportions of connected sets are 98%, 93% and 83% respectively. As for MNIST images, the connectivity is not as precisely estimated at large scales.

Fastfood [18] Random Kitchen Sinks [18] Haar Scattering 36.9 37.6 27.3

Table 2: Percentage of errors for the classiﬁcation of scrambled CIFAR-10 images, with different algorithms.

The Haar scattering classiﬁcation error is reduced from 27.7% to 21.3% if the image geometry is known. Same as for MNIST, we compute N = 64 multiresolution approximations obtained by translating and rotating. After dimension reduction, the classiﬁcation error is 21.3%. This error is above the state of the art obtained by a supervised convolutional network [15] (11.68%), but the Haar scattering representation involves no learning.

4.3 Signals on a Sphere

A data basis of irregularly sampled signals on a sphere is constructed in [3], by projecting the MNIST image digits on d = 4096 points randomly sampled on the 3D sphere, and by randomly rotating these images on the sphere. The random rotation is either uniformly distributed on the sphere or restricted with a smaller variance (small rotations) [3]. The digit 9 is removed from the data set because it can not be distinguished from a 6 after rotation. Examples of the dataset are shown in Figure 4.

The classiﬁcation algorithms introduced in [3] use the known distribution of points on the sphere, by computing a representation based on the graph Laplacian. Table 3 gives the results reported in [3], with a fully connected neural network, and a spectral graph Laplacian network.

As opposed to these algorithms, the Haar scattering algorithm uses no information on the positions of points on the sphere. Computations are performed from a scrambled set of signal values, without any

geometric information. Scattering transforms are calculated up to the maximum scale 2J = d = 212. A total of N = 10 multiresolution approximations are estimated by unsupervised learning, and the classiﬁcation is performed from M = 103 selected coefﬁcients. Despite the fact that the geometry is unknown, the Haar scattering reduces the error rate both for small and large 3D random rotations.

In order to evaluate the precision of our geometry estimation, we use the neighborhood information based on the 3D coordinates of the 4096 points on the sphere of radius 1. We say that two points are connected if their geodesic distance is smaller than 0.1. Each point on the sphere has on average 8 connected points. For small rotations, the percentage of learned multiresolution sets which are connected is 92%, 92%, 88% and 83% for j going from 1 to 4. It is computed on meaningful points with nonneglegible energy. For large rotations, it is 97%, 96%, 95% and 95%. This shows that the multiscale geometry on the sphere is well estimated.

Nearest Fully Spectral Haar Neighbors Connect. Net.[3] Scattering Small rotations 19 5.6 6 2.2 Large rotations 80 52 50 47.7

Table 3: Percentage of errors for the classiﬁcation of MNIST images rotated and sampled on a sphere [3], with a nearest neighbor classiﬁer, a fully connected two layer neural network, a spectral network [3], and a Haar scattering.

5 Conclusion

A Haar scattering transform computes invariant data representations by iterating over a hierarchy of permutation invariant operators, calculated with additions, subtractions and absolute values. The geometry of unstructured signals is estimated with an unsupervised learning algorithm, which minimizes the average total signal variation over multiscale neighborhoods. This shows that unsupervised deep learning can be implemented with a polynomial complexity algorithm. The supervised classiﬁcation includes a feature selection implemented with a partial least square regression. State of the art results have been shown on scrambled images as well as random signals sampled on a sphere. The two important parameters of this architecture are the network depth, which corresponds to the invariance scale, and the dimension reduction of the ﬁnal layer, set to 103 in all experiments. It can thus easily be applied to any data set.

This paper concentrates on scattering transforms of real valued signals. For a boolean vector x, a boolean scattering transform is computed by replacing the operator (1) by a boolean permutation invariant operator which transforms (α, β) into (α and β , α xor β). Iteratively applying this operator deﬁnes a boolean scattering transform Sjx having similar properties.

[1] S. Mallat, Recursive interferometric representations . Proc. of EUSICO Conf. 2010, Denmark.

[2] J. Bruna, S. Mallat, Invariant Scattering Convolution Networks, IEEE Trans. PAMI, 35(8): 1872-1886, 2013.

[3] J. Bruna, W. Zaremba, A. Szlam, and Y. Le Cun, Spectral Networks and Deep Locally Connected Networks on Graphs, ICLR 2014.

[4] Y. Bengio, A. Courville, P. Vincent, Representation Learning: A Review and New Perspectives , IEEE Trans. on PAMI, no.8, vol. 35, pp 1798-1828, 2013.

[5] Y. Le Cun, K. Kavukvuoglu, and C. Farabet, Convolutional Networks and Applications in Vision, Proc. IEEE Int. Sump. Circuits and Systems 2010.

[6] M. Gavish, B. Nadler, and R. R. Coifman. Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications to semi supervised learning , in ICML, pages 367374, 2010.

[7] N. L. Roux, Y. Bengio, P. Lamblin, M. Joliveau and B. K egl, Learning the 2-D topology of images , in NIPS, pages 841-848, 2008.

[8] J. Edmonds. Paths, trees, and ﬂowers. Canadian Journal of Mathematics, 1965.

[9] E. Rothberg of H. Gabow s An Efﬁcient Implementation of Edmond s Algorithm for Maximum Matching on Graphs. JACM, 23, 1v976.

[10] R. Rustamov, L. Guibas, Wavelets on Graphs via Deep Learning, NIPS 2013.

[11] D. Shuman, S. Narang, P. Frossard, A. Ortega, P. Vanderghenyst, The Emmerging Field of Signal Processing on Graphs, IEEE Signal Proc. Magazine, May 2013.

[12] T. Mehmood, K. H. Liland, L. Snipen and S. Sæbø, A Review of Variable Selection Methods in Partial Least Squares Regression , Chemometrics and Intelligent Laboratory Systems, vol. 118, pages 62-69, 2012.

[13] H. Zhang, S. Kiranyaz and M. Gabbouj, Cardinal Sparse Partial Least Square Feature Selection and its Application in Face Recognition , Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22st European, Sep. 2014.

[14] W. R. Schwartz, A. Kembhavi, D. Harwood and L. S. Davis, Human Detection Using Partial Least Squares Analysis , Computer vision, ICCV 2009.

[15] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville and Y. Benjio, Maxout Networks , Arxiv preprint, arxiv:1302.4389, 2013. [16] D. Yu and L. Deng, Deep Convex Net: A Scalable Architecture for Speech Pattern Classiﬁcation ,in Proc. INTERSPEECH, 2011, pp.2285-2288.

[17] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors , Technical report, ar Xiv:1207.0580, 2012.

[18] Q. Le, T. Sarlos and A. Smola, Fastfood - Approximating Kernel Expansions in Loglinear Time , ICML, 2013.

[19] M Hein and S. Setzer, Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts, NIPS 2011.