# separation_and_concentration_in_deep_networks__05209127.pdf Published as a conference paper at ICLR 2021 SEPARATION AND CONCENTRATION IN DEEP NETWORKS John Zarka, Florentin Guth Département d informatique de l ENS, ENS, CNRS, PSL University, Paris, France {john.zarka,florentin.guth}@ens.fr Stéphane Mallat Collège de France, Paris, France Flatiron Institute, New York, USA Numerical experiments demonstrate that deep neural network classifiers progressively separate class distributions around their mean, achieving linear separability on the training set, and increasing the Fisher discriminant ratio. We explain this mechanism with two types of operators. We prove that a rectifier without biases applied to sign-invariant tight frames can separate class means and increase Fisher ratios. On the opposite, a soft-thresholding on tight frames can reduce withinclass variabilities while preserving class means. Variance reduction bounds are proved for Gaussian mixture models. For image classification, we show that separation of class means can be achieved with rectified wavelet tight frames that are not learned. It defines a scattering transform. Learning 1 1 convolutional tight frames along scattering channels and applying a soft-thresholding reduces within-class variabilities. The resulting scattering network reaches the classification accuracy of Res Net-18 on CIFAR-10 and Image Net, with fewer layers and no learned biases. 1 INTRODUCTION Several numerical works (Oyallon, 2017; Papyan, 2020; Papyan et al., 2020) have shown that deep neural networks classifiers (Le Cun et al., 2015) progressively concentrate each class around separated means, until the last layer, where within-classes variability may nearly collapse (Papyan et al., 2020). The linear separability of a class mixture is characterized by the Fisher discriminant ratio (Fisher, 1936; Rao, 1948). The Fisher discriminant ratio measures the separation of class means relatively to the variability within each class, as measured by their covariances. The neural collapse appears through a considerable increase of the Fisher discriminant ratio during training (Papyan et al., 2020). No mathematical mechanism has yet been provided to explain this separation and concentration of probability measures. Linear separability and Fisher ratios can be increased by separating class means without increasing the variability of each class, or by concentrating each class around its mean while preserving the mean separation. This paper shows that these separation or concentration properties can be achieved with one-layer network operators using different pointwise non-linearities. We cascade these operators to define structured deep neural networks with high classification accuracies, and which can be analyzed mathematically. Section 2 studies two-layer networks computed with a linear classifier applied to ρF, where F is linear and ρ is a pointwise non-linearity. First, we show that ρF can separate class means with a Re LU ρr(u) = max(u, 0) and a sign-invariant F. We prove that ρr F then increases the Fisher ratio. As in Parseval networks (Cisse et al., 2017), F is normalized by imposing that it is a tight frame which satisfies F T F = Id. Second, to concentrate the variability of each class around its mean, we use a shrinking non-linearity implemented by a soft-thresholding ρt. For Gaussian mixture models, we prove that ρt F concentrates within-class variabilities while nearly preserving class means, under appropriate sparsity hypotheses. A linear classifier applied to these ρF defines two-layer Published as a conference paper at ICLR 2021 neural networks with no learned bias parameters in the hidden layer, whose properties are studied mathematically and numerically. Cascading several convolutional tight frames with Re LUs or soft-thresholdings defines a deep neural network which progressively separates class means and concentrates their variability. One may wonder if we can avoid learning these frames by using prior information on the geometry of images. Section 3 shows that the class mean separation can be computed with wavelet tight frames, which are not learned. They separate scales, directions and phases, which are known groups of transformations. A cascade of wavelet filters and rectifiers defines a scattering transform (Mallat, 2012), which has previously been applied to image classification (Bruna & Mallat, 2013; Oyallon & Mallat, 2015). However, such networks do not reach state-of-the-art classification results. We show that important improvements are obtained by learning 1 1 convolutional projectors and tight frames, which concentrate within-class variabilities with soft-thresholdings. It defines a bias-free deep scattering network whose classification accuracy reaches Res Net-18 (He et al., 2016) on CIFAR-10 and Image Net. Code to reproduce all experiments of the paper is available at https://github.com/j-zarka/separation_concentration_deepnets. The main contributions of this paper are: A double mathematical mechanism to separate and concentrate distinct probability measures, with a rectifier and a soft-thresholding applied to tight frames. The increase of Fisher ratio is proved for tight-frame separation with a rectifier. Bounds on within-class covariance reduction are proved for a soft-thresholding on Gaussian mixture models. The introduction of a bias-free scattering network which reaches Res Net-18 accuracy on CIFAR-10 and Image Net. Learning is reduced to 1 1 convolutional tight frames which concentrate variabilities along scattering channels. 2 CLASSIFICATION BY SEPARATION AND CONCENTRATION The last hidden layer of a neural network defines a representation Φ(x), to which is applied a linear classifier. This section studies the separation of class means and class variability concentration for Φ = ρF in a two-layer network. 2.1 TIGHT FRAME RECTIFICATION AND THRESHOLDING We begin by briefly reviewing the properties of linear classifiers and Fisher discriminant ratios. We then analyze the separation and concentration of Φ = ρF, when ρ is a rectifier or a soft-thresholding and F is a tight frame. Linear classification and Fisher ratio We consider a random data vector x Rd whose class labels are y(x) {1, ..., C}. Let xc be a random vector representing the class c, whose probability distribution is the distribution of x conditioned by y(x) = c. We suppose that all classes are equiprobable for simplicity. Avec denotes C 1 PC c=1. We compute a representation of x with an operator Φ which is standardized, so that E(Φ(x)) = 0 and each coefficient of Φ(x) has a unit variance. The class means µc = E(Φ(xc)) thus satisfy P c µc = 0. A linear classifier (W, b) on Φ(x) returns the index of the maximum coordinate of WΦ(x) + b RC. An optimal linear classifier (W, b) minimizes the probability of a classification error. Optimal linear classifiers are estimated by minimizing a regularized loss function on the training data. Neural networks often use logistic linear classifiers, which minimize a cross-entropy loss. The standardization of the last layer Φ(x) is implemented with a batch normalization (Ioffe & Szegedy, 2015). A linear classifier can have a small error if the typical sets of each Φ(xc) have little overlap, and in particular if the class means µc = E(Φ(xc)) are sufficiently separated relatively to the variability of each class. Under the Gaussian hypothesis, the variability of each class is measured by the covariance Σc of Φ(xc). Let ΣW = Avec Σc be the average within-class covariance and ΣB = Avec µc µT c be the between-class covariance of the means. The within-class covariance can be whitened and normalized to Id by transforming Φ(x) with the square root Σ 1 2 W of Σ 1 W . All classes Published as a conference paper at ICLR 2021 c, c are highly separated if Σ 1 2 W µc 1. This separation is captured by the Fisher discriminant ratio Σ 1 W ΣB. We shall measure its trace: C 1 Tr(Σ 1 W ΣB) = Ave c Σ 1 2 W µc 2. (1) Fisher ratios have been used to train deep neural networks as a replacement for the cross-entropy loss (Dorfer et al., 2015; Stuhlsatz et al., 2012; Sun et al., 2019; Wu et al., 2017; Sultana et al., 2018; Li et al., 2016). In this paper, we use their analytic expression to analyze the improvement of linear classifiers. Linear classification obviously cannot be improved with a linear representation Φ. The following proposition gives a simple condition to improve (or maintain) the error of linear classifiers with a non-linear representation. Proposition 2.1. If Φ has a linear inverse, then it decreases (or maintains) the error of the optimal linear classifier, and it increases (or maintains) the Fisher ratio (1). To prove this result, observe that if Φ has a linear inverse Φ 1 then Wx = W Φ(x) with W = WΦ 1. The minimum classification error by optimizing W is thus above the error obtained by optimizing W . Appendix A proves that the Fisher ratio (1) is also increased or preserved. There are qualitatively two types of non-linear operators that increase the Fisher ratio Σ 1 W ΣB. Separation operators typically increase the distance between the class means without increasing the variance ΣW within each class. We first study such operators having a linear inverse, which guarantees through Proposition 2.1 that they increase the Fisher ratio. We then study concentration operators which reduce the variability ΣW with non-linear shrinking operators, which are not invertible. It will thus require a finer analysis of their properties. Separation by tight frame rectification Let Φ = ρF be an operator which computes the first layer of a neural network, where ρ is a pointwise non-linearity and F is linear. We first study separation operators computed with a Re LU ρr(u) = max(u, 0) applied to an invertible sign-invariant matrix. Such a matrix has rows that can be regrouped in pairs of opposite signs. It can thus be written F = [ F T , F T ]T where F is invertible. The operator ρF separates coefficients according to their sign. Since ρr(u) ρr( u) = u, it results that Φ = ρr F is linearly invertible. According to Proposition 2.1, it increases (or maintains) the Fisher ratio, and we want to choose F to maximize this increase. Observe that ρr(αu) = αρr(u) if α 0. We can thus normalize the rows fm of F without affecting linear classification performance. To ensure that F Rp d is invertible with a stable inverse, we impose that it is a normalized tight frame of Rd satisfying F T F = Id and fm 2 = d/p for 1 m p. The tight frame can be interpreted as a rotation operator in a higher dimensional space, which aligns the axes and the directions along which ρr performs the sign separation. This rotation must be adapted in order to optimize the separation of class means. The fact that F is a tight frame can be interpreted as a normalization which simplifies the mathematical analysis. Suppose that all classes xc of x have a Gaussian distribution with a zero mean µc = 0, but different covariances Σc. These classes are not linearly separable because they have the same mean, and the Fisher ratio is 0. Applying ρr F can separate these classes and improve the Fisher ratio. Indeed, if z is a zero-mean Gaussian random variable, then E(max(z, 0)) = (2π) 1/2E(z2)1/2 so we verify that for F = [ F T , F T ]T , E(ρr Fxc) = (2π) 1/2 diag( FΣc F T )1/2, diag( FΣc F T )1/2 . The Fisher ratio can then be optimized by maximizing the covariance ΣB between the mean vector components diag( FΣc F T )1/2 for all classes c. If we know a priori that that xc and xc have the same probability distribution, as in the Gaussian example, then we can replace ρr by the absolute value ρa(u) = |u| = ρr(u) + ρr( u), and ρr F by ρa F, which reduces by 2 the frame size. Published as a conference paper at ICLR 2021 Concentration by tight frame soft-thresholding If the class means of x are already separated, then we can increase the Fisher ratio with a non-linear Φ that concentrates each class around its mean. The operator Φ must reduce the within-class variance while preserving the class separation. This can be interpreted as a non-linear noise removal if we consider the within-class variability as an additive noise relatively to the class mean. It can be done with soft-thresholding estimators introduced in Donoho & Johnstone (1994). A soft-thresholding ρt(u) = sign(u) max(|u| λ, 0) shrinks the amplitude of u by λ in order to reduce its variance, while introducing a bias that depends on λ. Donoho & Johnstone (1994) proved that soft-thresholding estimators are highly effective to estimate signals that have a sparse representation in a tight frame F. To evaluate more easily the effect of a tight frame soft-thresholding on the class means, we apply the linear reconstruction F T on ρt Fx, which thus defines a representation Φ(x) = F T ρt Fx. For a strictly positive threshold, this operator is not invertible, so we cannot apply Proposition 2.1 to prove that the Fisher ratio increases. We study directly the impact of Φ on the mean and covariance of each class. Let xc be the vector representing the class c. The mean µc = E(xc) is transformed into µc = E(Φ(xc)) and the covariance Σc of xc into the covariance Σc of Φ(xc). The average covariances are ΣW = Avec Σc and ΣW = Avec Σc. Suppose that each xc is a Gaussian mixture, with a potentially large number of Gaussian components centered at µc,k with a fixed covariance σ2Id: k πc,k N(µc,k, σ2Id). (2) This model is quite general, since it amounts to covering the typical set of realizations of xc with a union of balls of radius σ, centered in the (µc,k)k. The following theorem relates the reduction of within-class covariance to the sparsity of Fµc,k. It relies on the soft-thresholding estimation results of Donoho & Johnstone (1994). For simplicity, we suppose that the tight frame is an orthogonal basis, but the result can be extended to general normalized tight frames. The sparsity is expressed through the decay of sorted basis coefficients. For a vector z Rd, we denote z(r) a coefficient of rank r: |z(r)| |z(r+1)| for 1 r d. The theorem imposes a condition on the amplitude decay of the (Fµc,k)(r) when r increases, which is a sparsity measure. We write a(r) b(r) if C1 a(r) b(r) C2 a(r) where C1 and C2 do not depend upon d nor σ. The theorem derives upper bounds on the reduction of within-class covariances and on the displacements of class means. The constants do not depend upon d when it increases to nor on σ when it decreases to 0. Theorem 2.2. Under the mixture model hypothesis (2), we have: Tr(ΣW ) = Tr(ΣM) + σ2 d, with Tr(ΣM) = C 1 X c,k πc,k µc µc,k 2. (3) If there exists s > 1/2 such that |(Fµc,k)(r)| r s then a tight frame soft-thresholding with threshold λ = σ 2 log d satisfies: Tr(ΣW ) = 2 Tr(ΣM) + O(σ2 1/s log d), (4) and all class means satisfy: µc µc 2 = O(σ2 1/s log d). (5) Under appropriate sparsity hypotheses, the theorem proves that applying Φ = F T ρt F reduces considerably the trace of the within-class covariance. The Gaussian variance σ2d is dominant in (3) and is reduced to O(σ2 1/s log d) in (4). The upper bound (5) also proves that F T ρt F creates a relatively small displacement of class means, which is proportional to log d. This is important to ensure that all class means remain well separated. These bounds qualitatively explains the increase of Fisher ratios, but they are not sufficient to prove a precise bound on these ratios. In numerical experiments, the threshold value of the theorem is automatically adjusted as follows. Non-asymptotic optimal threshold values have been tabulated as a function of d by Donoho & Johnstone (1994). For the range of d used in our applications, a nearly optimal threshold is λ = 1.5 σ. We rescale the frame variance σ2 by standardizing the input x so that it has a zero mean and each coefficient has a unit variance. In high dimension d, the within-class variance typically dominates Published as a conference paper at ICLR 2021 the variance between class means. Under the unit variance assumption we have Tr(ΣW ) d. If F Rp d is a normalized tight frame then we also verify as in (3) that Tr(ΣW ) σ2p so σ2 d/p. It results that we choose λ = 1.5 p A soft-thresholding can also be computed from a Re LU with threshold ρrt(u) = max(u λ, 0) because ρt(u) = ρrt(u) ρrt( u). It results that [F T , F T ] ρrt [F T , F T ]T = F T ρt F. However, a thresholded rectifier has more flexibility than a soft-thresholding, because it may recombine differently ρrt F and ρrt( F) to also separate class means, as explained previously. The choice of threshold then becomes a trade-off between separation of class means and concentration of class variability. In numerical experiments, we choose a lower λ = p d/p for a Re LU with a threshold. 2.2 TWO-LAYER NETWORKS WITHOUT BIAS We study two-layer bias-free networks that implement a linear classification on ρF, where F is a normalized tight frame and ρ may be a rectifier, an absolute value or a soft-thresholding, with no learned bias parameter. Bias-free networks have been introduced for denoising in Mohan et al. (2019), as opposed to classification or regression. We show that such bias-free networks have a limited expressivity and do not satisfy universal approximation theorems (Pinkus, 1999; Bach, 2017). However, numerical results indicate that their separation and contractions capabilities are sufficient to reach similar classification results as two-layer networks with biases on standard image datasets. Applying a linear classifier on Φ(x) computes: WΦ(x) + b = WρFx + b. This two-layer neural network has no learned bias parameters in the hidden layer, and we impose that F T F = Id with frame rows (fm)m having constant norms. As a result, the following theorem proves that it does not satisfy the universal approximation theorem. We define a binary classification problem for which the probability of error remains above 1/4 for any number p of neurons in the hidden layer. The proof is provided in Appendix C for a Re LU ρrt with any threshold. The theorem remains valid with an absolute value ρa or a soft-thresholding ρt, because they are linear combinations of ρrt. Theorem 2.3. Let λ 0 be a fixed threshold and ρrt(u) = max(u λ, 0). Let F be the set of matrices F Rp d with bounded rows fm 1. There exists a random vector x Rd which admits a probability density supported on the unit ball, and a C function h: Rd R such that, for all p d: inf W R1 p,F F,b R P[sgn(Wρrt Fx + b) = sgn(h(x))] 1 Optimization The parameters W, F and b are optimized with a stochastic gradient descent that minimizes a logistic cross-entropy loss on the output. To impose F T F = Id, following the optimization of Parseval networks (Cisse et al., 2017), after each gradient update of all network parameters, we insert a second gradient step to minimize α/2 F T F Id 2. This gradient update is: F (1 + α)F αFF T F. (6) We also make sure after every Parseval step that each tight frame row fm keeps a constant norm fm = p d/p by applying a spherical projection: fm p d/p fm/ fm . These steps are performed across all experiments described in the paper, which ensures that all singular values of every learned tight frame are comprised between 0.99 and 1.01. To reduce the number of parameters of the classification matrix W RC p, we factorize W = W F T with W RC d. It amounts to reprojecting ρF in Rd with the semi-orthogonal frame synthesis F T , and thus defines: Φ(x) = F T ρ Fx. A batch normalization is introduced after Φ to stabilize the learning of W . Image classification by separation and concentration Image classification is first evaluated on the MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky, 2009) image datasets. Table 1 gives Published as a conference paper at ICLR 2021 Table 1: For MNIST and CIFAR-10, the first row gives the logistic classification error and the second row the Fisher ratio (1), for different signal representations Φ(x). Results are evaluated with an absolute value ρa, a soft-thresholding ρt, and a Re LU with threshold ρrt. Φ(x) x F T ρFx ST (x) ρ = ρa ρ = ρt ρ = ρrt MNIST Error (%) 7.4 1.3 1.4 1.3 0.8 Fisher 19 68 69 67 130 CIFAR Error (%) 60.5 28.1 34.8 26.5 27.7 Fisher 6.7 15 13 16 12 the results of logistic classifiers applied to the input signal x and to Φ(x) = F T ρFx for 3 different non-linearities ρ: absolute value ρa, soft-thresholding ρt, and Re LU with threshold ρrt. The tight frame F is a convolution on patches of size k k with a stride of k/2, with k = 14 for MNIST and k = 8 for CIFAR. The tight frame F maps each patch to a vector of larger dimension, specified in Appendix D. Figure 1 in Appendix D shows examples of learned tight frame filters. On each dataset, applying F T ρF on x greatly reduces linear classification error, which also appears with an increase of the Fisher ratio. For MNIST, all non-linearities produce nearly the same classification accuracy, but on CIFAR, the soft-thresholding has a higher error. Indeed, the class means of MNIST are distinct averaged digits, which are well separated, because all digits are centered in the image. Concentrating variability with a soft-thresholding is then sufficient. On the opposite, the classes of CIFAR images define nearly stationary random vectors because of arbitrary translations. As a consequence, the class means µc are nearly constant images, which are only discriminated by their average color. Separating these class means is then important for improving classification. As explained in Section 2.1, this is done by a Re LU ρr, or in this case an absolute value ρa, which reduces the error. The Re LU with threshold ρrt can interpolate between mean separation and variability concentration, and thus performs usually at least as well as the other non-linearities. The error of the bias-free networks with a Re LU and an absolute value are similar to the errors obtained by training two-layer networks of similar sizes but with bias parameters: 1.6% error on MNIST (Simard et al., 2003), and 25% on CIFAR-10 (Krizhevsky, 2010). It indicates that the elimination of bias parameters does not affect performances, despite the existence of the counterexamples from Theorem 2.3 that cannot be well approximated by such architectures. This means that image classification problems have more structure that are not captured by these counter-examples, and that completeness in linear high-dimensional functional spaces may not be key mathematical properties to explain the preformances of neural networks. Figure 1 in Appendix D shows that the learned convolutional tight frames include oriented oscillatory filters, which is also often the case of the first layer of deeper networks (Krizhevsky et al., 2012). They resemble wavelet frames, which are studied in the next section. 3 DEEP LEARNING BY SCATTERING AND CONCENTRATING To improve classification accuracy, we cascade mean separation and variability concentration operators, implemented by Re LUs and soft-thresholdings on tight frames. This defines deep convolutional networks. However, we show that some tight frames do not need to be learned. Section 3.1 reviews scattering trees, which perform mean separation by cascading Re LUs on wavelet tight frames. Section 3.2 shows that we reach high classification accuracies by learning projectors and tight frame soft-thresholdings, which concentrate within-class variabilities along scattering channels. 3.1 SCATTERING CASCADE OF WAVELET FRAME SEPARATIONS Scattering transforms have been introduced to classify images by cascading predefined wavelet filters with a modulus or a rectifier non-linearity (Bruna & Mallat, 2013). We write it as a product of wavelet tight frame rectifications, which progressively separate class means. Published as a conference paper at ICLR 2021 Wavelet frame A wavelet frame separates image variations at different scales, directions and phases, with a cascade of filterings and subsamplings. We use steerable wavelets (Simoncelli & Freeman, 1995) computed with Morlet filters (Bruna & Mallat, 2013). There is one low-pass filter g0, and L complex band-pass filters gℓhaving an angular direction θ = ℓπ/L for 0 < ℓ L. These filters can be adjusted (Selesnick et al., 2005) so that the filtering and subsampling: Fwx(n, ℓ) = x gℓ(2n) defines a complex tight frame Fw. Fast multiscale wavelet transforms are computed by cascading the filter bank Fw on the output of the low-pass filter g0 (Mallat, 2008). Each complex filter gℓis analytic, and thus has a real part and imaginary part whose phases are shifted by α = π/2. This property is important to preserve equivariance to translation despite the subsampling with a stride of 2 (Selesnick et al., 2005). To define a sign-invariant frame as in Section 2.1, we must incorporate filters of opposite signs, which amounts to shifting their phase by π. We thus associate to Fw a real sign-invariant tight frame Fw by considering separately the four phases α = 0, π/2, π, 3π/2. It is defined by Fwx(n, ℓ, α) = x gℓ,α(2n), with gℓ,0 = 2 1/2Real(gℓ), gℓ,π/2 = 2 1/2Imag(gℓ) and gℓ,α+π = gℓ. We apply a rectifier ρr to the output of all real band-pass filters gℓ,α but not to the low-pass filter: ρr Fw = x g0(2n) , ρr(x gℓ,α(2n)) The use of wavelet phase parameters with rectifiers is studied in Mallat et al. (2019). The operator ρr Fw is linearly invertible because Fw is a tight frame and the Re LU is applied to band-pass filters, which come in pairs of opposite sign. Since there are 4 phases and a subsampling with a stride of 2, Fwx is (L + 1/4) times larger than x. Scattering tree A full scattering tree ST of depth J is computed by iterating J times over ρr Fw. Since each ρr Fw has a linear inverse, Proposition 2.1 proves that this separation can only increase the Fisher ratio. However it also increases the signal size by (L + 1/4)J, which is typically much too large. This is avoided with orthogonal projectors, which perform a dimension reduction after applying each ρr Fw. A pruned scattering tree ST of depth J and order o is defined in Bruna & Mallat (2013) as a convolutional tree which cascades J rectified wavelet filter banks, and at each depth prunes the branches with Pj to prevent an exponential growth: j=1 Pj ρr Fw. (7) After the Re LU, the pruning operator Pj eliminates the branches of the scattering which cascade more than o band-pass filters and rectifiers, where o is the scattering order (Bruna & Mallat, 2013). After J cascades, the remaining channels have thus been filtered by at least J o successive low-pass filters g0. We shall use a scattering transform of order o = 2. The operator Pj also averages the rectified output of the filters gℓ,α along the phase α, for ℓfixed. This averaging eliminates the phase. It approximatively computes a complex modulus and produces a localized translation invariance. The resulting pruning and phase average operator Pj is a 1 1 convolutional operator, which reduces the dimension of scattering channels with an orthogonal projection. If x has d pixels, then ST (x)[n, k] is an array of images having 2 2Jd pixels at each channel k, because of the J subsamplings with a stride of 2. The total number of channels K is 1 + JL + J(J 1)L2/2. Numerical experiments are performed with wavelet filters which approximate Gabor wavelets (Bruna & Mallat, 2013), with L = 8 directions. The number of scales J depends upon the image size. It is J = 3 for MNIST and CIFAR, and J = 4 for Image Net, resulting in respectively K = 217, 651 and 1251 channels. Each ρr Fw can only improve the Fisher ratio and the linear classification accuracy, but it is not guaranteed that this remains valid after applying Pj. Table 1 gives the classification error of a logistic classifier applied on ST (x), after a 1 1 orthogonal projection to reduce the number of channels, and a spatial normalization. This error is almost twice smaller than a two-layer neural network on MNIST, given in Table 1, but it does not improve the error on CIFAR. On CIFAR, the error obtained by a Res Net-20 is 3 times lower than the one of a classifier on ST (x). The main issue is now to understand where this inefficiency comes from. Published as a conference paper at ICLR 2021 Table 2: Linear classification error and Fisher ratios (1) of several scattering representations, on CIFAR-10 and Image Net. For SC, results are evaluated with a soft-thresholding ρt and a thresholded rectifier ρrt. The last column gives the error of Res Net-20 for CIFAR-10 (He et al., 2016) and Res Net-18 for Image Net, taken from https://pytorch.org/docs/stable/ torchvision/models.html. Φ ST SP SC (ρt) SC (ρrt) Res Net CIFAR Error (%) 27.7 12.8 8.0 7.6 8.8 Fisher 12 20 43 41 - Image Net Error (%) Top-5 54.1 20.5 11.6 10.7 10.9 Top-1 73.0 42.3 31.4 29.7 30.2 Fisher 2.0 18 51 44 - 3.2 SEPARATION AND CONCENTRATION IN LEARNED SCATTERING NETWORKS A scattering tree iteratively separates class means with wavelet filters. Its dimension is reduced by predefined projection operators, which may decrease the Fisher ratio and linear separability. To avoid this source of inefficiency, we define a scattering network which learns these projections. The second step introduces tight frame thresholdings along scattering channels, to concentrate withinclass variabilities. Image classification results are evaluated on the CIFAR-10 (Krizhevsky, 2009) and Image Net (Russakovsky et al., 2015) datasets. Learned scattering projections Beyond scattering trees, the projections Pj of a scattering transform (7) can be redefined as arbitrary orthogonal 1 1 convolutional operators, which reduce the number of scattering channels: Pj P T j = Id. Orthogonal projectors acting along the direction index ℓof wavelet filters can improve classification (Oyallon & Mallat, 2015). We are now going to learn these linear operators together with the final linear classifier. Before computing this projection, the mean and variances of each scattering channel is standardized with a batch normalization BN, by setting affine coefficients γ = 1 and β = 0. This projected scattering operator can be written: j=1 Pj BN ρr Fw. Applying a linear classifier to SP (x) defines a deep convolutional network whose parameters are the 1 1 convolutional Pj and the classifier weights W, b. The wavelet convolution filters in Fw are not learned. The orthogonality of Pj is imposed through the gradient steps (6) applied to F = P T j . Table 2 shows that learning the projectors Pj more than halves the scattering classification error of SP relatively to ST on CIFAR-10 and Image Net, reaching Alex Net accuracy on Image Net, while achieving a higher Fisher ratio. The learned orthogonal projections Pj create invariants to families of linear transformations along scattering channels that depend upon scales, directions and phases. They correspond to image transformations which have been linearized by the scattering transform. Small diffeomorphisms which deform the image are examples of operators which are linearized by a scattering transform (Mallat, 2012). The learned projector eliminates within-class variabilities which are not discriminative across classes. Since it is linear, it does not improve linear separability or the Fisher ratio. It takes advantage of the non-linear separation produced by the previous scattering layers. The operator Pj is a projection on a family of orthogonal directions which define new scattering channels, and is followed by a wavelet convolution Fw along spatial variables. It defines separable convolutional filters Fw Pj along space and channels. Learning Pj amounts to choosing orthogonal directions so that ρr Fw Pj optimizes the class means separation. If the class distributions are invariant by rotations, the separation can be achieved with wavelet convolutions along the direction index ℓ(Oyallon & Mallat, 2015), but better results are obtained by learning these filters. This separable scattering architecture is different from separable approximations of deep network filters in discrete cosine bases (Ulicny et al., 2019) or in Fourier-Bessel bases (Qiu et al., 2018). A wavelet scattering computes ρr Fw Pj as opposed to a separable decomposition ρr Pj Fw, so the Re LU is applied in a Published as a conference paper at ICLR 2021 Table 3: Evolution of Fisher ratio across layers for the scattering concentration network SC with a Re LU with threshold ρrt, on the CIFAR dataset. CIFAR Layer 0 1 2 3 4 5 6 7 8 Fisher 1.8 11 13 11 15 15 22 25 40 higher dimensional space indexed by wavelet variables produced by Fw. It provides explicit coordinates to analyze the mathematical properties, but it also increase the number of learned parameters as shown in Table 4, Appendix D. Concentration along scattering channels A projected scattering transform can separate class means, but does not concentrate class variabilities. To further reduce classification errors, following Section 2.1, a concentration is computed with a tight frame soft-thresholding F T j ρt Fj, applied on scattering channels. It increases the dimension of scattering channels with a 1 1 convolutional tight frame Fj, applies a soft-thresholding ρt, and reduces the number of channels with the 1 1 convolutional operator F T j . The resulting concentrated scattering operator is j=1 (F T j ρt Fj) (Pj BN ρr Fw). (8) It has 2J layers, with odd layers computed by separating means with a Re Lu ρr and even layers computed by concentrating class variabilities with a soft-thresholding ρt. According to Section 2.1 the soft-threshold is λ = 1.5 p d/p. This soft-thresholding may be replaced by a thresholded rectifier ρrt(u) = max(u λ, 0) with a lower threshold λ = p d/p. A logistic classifier is applied to SC(x). The resulting deep network does not include any learned bias parameter, except in the final linear classification layer. Learning is reduced to the 1 1 convolutional operators Pj and Fj along scattering channels, and the linear classification parameters. Table 2 gives the classification errors of this concentrated scattering on CIFAR for J = 4 (8 layers) and Image Net for J = 6 (12 layers). The layer dimensions are specified in Appendix D. The number of parameters of the scattering networks are given in Table 4, Appendix D. This concentration step reduces the error of SC by about 40% relatively to a projected scattering SP . A Re LU thresholding ρrt produces an error slightly below a soft-thresholding ρt both on CIFAR-10 and Image Net, and this error is also below the errors of Res Net-20 for CIFAR and Res Net-18 for Image Net. These errors are also nearly half the classification errors previously obtained by cascading a scattering tree ST with several 1 1 convolutional layers and large MLP classifiers (Zarka et al., 2020; Oyallon et al., 2017). It shows that the separation and concentration learning must be done at each scale rather than at the largest scale output. Table 3 shows the progressive improvement of the Fisher ratio measured at each layer of SC on CIFAR-10. The transition from an odd layer 2j 1 to an even layer 2j results from Fj T ρt Fj, which always improve the Fisher ratio by concentrating class variabilities. The transition from 2j to 2j + 1 is done by Pj+1ρr Fw, which may decrease the Fisher ratio because of the projection Pj+1, but globally brings an important improvement. 4 CONCLUSION We proved that separation and concentration of probability measures can be achieved with rectifiers and thresholdings applied to appropriate tight frames F. We also showed that the separation of class means can be achieved by cascading wavelet frames that are not learned. It defines a scattering transform. By concentrating variabilities with a thresholding along scattering channels, we reach Res Net-18 classification accuracy on CIFAR-10 and Image Net. A major mathematical issue is to understand the mathematical properties of the learned projectors and tight frames along scattering channels. This is necessary to understand the types of classification problems that are well approximated with such architectures, and to prove lower bounds on the evolution of Fisher ratios across layers. Published as a conference paper at ICLR 2021 ACKNOWLEDGMENTS This work was supported by grants from Région Ile-de-France and the PRAIRIE 3IA Institute of the French ANR-19-P3IA-0001 program. We would like to thank the Scientific Computing Core at the Flatiron Institute for the use of their computing resources. M. Andreux, T. Angles, G. Exarchakis, R. Leonarduzzi, G. Rochette, L. Thiry, J. Zarka, S. Mallat, J. Andén, E. Belilovsky, J. Bruna, V. Lostanlen, M. J. Hirn, E. Oyallon, S. Zhang, C. E. Cella, and M. Eickenberg. Kymatio: Scattering transforms in python. Journal of Machine Learning Research, 21(60):1 6, 2020. F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1 53, 2017. J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1872 1886, 2013. M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robustness to adversarial examples. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, pp. 854 863, 2017. D. L. Donoho and I. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81 (3):425 455, 09 1994. M. Dorfer, R. Kelz, and G. Widmer. Deep linear discriminant analysis. ar Xiv preprint ar Xiv:1511.04707, 2015. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(7): 179 188, 1936. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, pp. 448 456, 2015. A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. A. Krizhevsky. Convolutional deep belief networks on cifar-10, 2010. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 1097 1105. Curran Associates, Inc., 2012. Y. Le Cun, C. Cortes, and C.J. Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. Y. Le Cun, Y. Bengio, and G.E. Hinton. Deep learning. Nature, 521(7553):436 444, 2015. Y. Li, W. Zhao, and J. Pan. Deformable patterned fabric defect detection with fisher criterion-based deep learning. IEEE Transactions on Automation Science and Engineering, 14(2):1256 1264, 2016. S. Mallat. A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. Academic Press, 3rd edition, 2008. S. Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10): 1331 1398, 2012. S. Mallat, S. Zhang, and G. Rochette. Phase harmonic correlations and convolutional neural networks. Information and Inference: A Journal of the IMA, 11 2019. doi: 10.1093/imaiai/iaz019. Published as a conference paper at ICLR 2021 S. Mohan, Z. Kadkhodaie, E. P. Simoncelli, and C. Fernandez-Granda. Robust and interpretable blind image denoising via bias-free convolutional neural networks. ar Xiv preprint ar Xiv:1906.05478, 2019. E. Oyallon. Building a regular decision boundary with deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1886 1894, 2017. E. Oyallon and S. Mallat. Deep roto-translation scattering for object classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 2865 2873. IEEE Computer Society, 2015. E. Oyallon, E. Belilovsky, and S. Zagoruyko. Scaling the scattering transform: Deep hybrid networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5618 5627, 2017. V. Papyan. Traces of class/cross-class structure pervade deep learning spectra, 2020. V. Papyan, X. Y. Han, and D. L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 2020. A. Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8:143 195, 1999. Q. Qiu, X. Cheng, R. Calderbank, and G. Sapiro. DCFNet: Deep neural network with decomposed convolutional filters. International Conference on Machine Learning, 2018. C. R. Rao. The utilization of multiple measurements in problems of biological classification. Journal of the Royal Statistical Society: Series B (Methodological), 10(2):159 193, 1948. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015. I. W. Selesnick, R. G. Baraniuk, and N. C. Kingsbury. The dual-tree complex wavelet transform. IEEE signal processing magazine, 22(6):123 151, 2005. P. Simard, D. Steinkraus, and J.C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2, 2003. E. P. Simoncelli and W. T Freeman. The steerable pyramid: A flexible architecture for multiscale derivative computation. In Proceedings., International Conference on Image Processing, volume 3, pp. 444 447. IEEE, 1995. A. Stuhlsatz, J. Lippel, and T. Zielke. Feature extraction with deep neural networks by a generalized discriminant analysis. IEEE Transactions on Neural Networks and Learning Systems, 23(4):596 608, 2012. N. Sultana, B. Mandal, and N. Puhan. Deep residual network with regularised fisher framework for detection of melanoma. IET Computer Vision, 12(8):1096 1104, 2018. K. Sun, J. Zhang, H. Yong, and J. Liu. Fpcanet: Fisher discrimination for principal component analysis network. Knowledge-Based Systems, 166:108 117, 2019. M. Ulicny, V. Krylov, and R. Dahyot. Harmonic networks for image classification. In Proceedings of the British Machine Vision Conference, Sep. 2019. L. Wu, C. Shen, and A. Van Den Hengel. Deep linear discriminant analysis on fisher networks: A hybrid architecture for person re-identification. Pattern Recognition, 65:238 250, 2017. J. Zarka, L. Thiry, T. Angles, and S. Mallat. Deep network classification by scattering and homotopy dictionary learning. In International Conference on Learning Representations, 2020.