# phase_collapse_in_neural_networks__5bd9d2ec.pdf

Published as a conference paper at ICLR 2022

PHASE COLLAPSE IN NEURAL NETWORKS

Florentin Guth, John Zarka DI, ENS, CNRS, PSL University, Paris, France {florentin.guth,john.zarka}@ens.fr

Stéphane Mallat Collège de France, Paris, France Flatiron Institute, New York, USA

Deep convolutional classiﬁers linearly separate image classes and improve accuracy as depth increases. They progressively reduce the spatial dimension whereas the number of channels grows with depth. Spatial variability is therefore transformed into variability along channels. A fundamental challenge is to understand the role of non-linearities together with convolutional ﬁlters in this transformation. Re LUs with biases are often interpreted as thresholding operators that improve discrimination through sparsity. This paper demonstrates that it is a different mechanism called phase collapse which eliminates spatial variability while linearly separating classes. We show that collapsing the phases of complex wavelet coefﬁcients is sufﬁcient to reach the classiﬁcation accuracy of Res Nets of similar depths. However, replacing the phase collapses with thresholding operators that enforce sparsity considerably degrades the performance. We explain these numerical results by showing that the iteration of phase collapses progressively improves separation of classes, as opposed to thresholding non-linearities.

1 INTRODUCTION

CNN image classiﬁers progressively eliminate spatial variables through iterated ﬁlterings and subsamplings, while linear classiﬁcation accuracy improves as depth increases (Oyallon, 2017). It has also been numerically observed that CNNs concentrate training samples of each class in small separated regions of a progressively lower-dimensional space. It can ultimately produce a neural collapse (Papyan et al., 2020), where all training samples of each class are mapped to a single point. In this case, the elimination of spatial variables comes with a collapse of within-class variability and perfect linear separability. This increase in linear classiﬁcation accuracy is obtained in standard CNN architectures like Res Nets from the iteration of linear convolutional operators and Re LUs with biases.

A difﬁculty in understanding the underlying mathematics comes from the ﬂexibility of Re LUs. Indeed, a linear combination of biased Re LUs can approximate any non-linearity. Many papers interpret iterations on Re LUs and linear operators as sparse code computations (Sun et al., 2018; Sulam et al., 2018; 2019; Mahdizadehaghdam et al., 2019; Zarka et al., 2020; 2021). We show that it is a different mechanism, called phase collapse, which underlies the increase in classiﬁcation accuracy of these architectures. A phase collapse is the elimination of phases of complex-valued wavelet coefﬁcients with a modulus, which we show to concentrate spatial variability. This is demonstrated by introducing a structured convolutional neural network with wavelet ﬁlters and no biases.

Section 2 introduces and explains phase collapses. Complex-valued representations are used because they reveal the mathematics of spatial variability. Indeed, translations are diagonalized in the Fourier basis, where they become a complex phase shift. Invariants to translations are computed with a modulus, which collapses the phases of this complex representation. Section 2 explains how this can improve linear classiﬁcation. Phase collapses can also be calculated with Re LUs and real ﬁlters. A CNN with complex-valued ﬁlters is indeed just a particular instance of a real-valued CNN, whose channels are paired together to deﬁne complex numbers.

Section 3 demonstrates the role of phase collapse in deep classiﬁcation architectures. It introduces a Learned Scattering network with phase collapses. This network applies a learned 1 1 convolutional complex operator Pj on each layer xj, followed by a phase collapse, which is obtained with a complex wavelet ﬁltering operator W and a modulus:

xj+1 = |WPjxj|. (1)

Published as a conference paper at ICLR 2022

It does not use any bias. This network architecture is illustrated in Figure 1. With the addition of skip-connections, we show that this phase collapse network reaches Res Net accuracy on Image Net and CIFAR-10.

Section 4 compares phase collapses with other non-linearities such as thresholdings or more general amplitude reduction operators. Such non-linearities can enforce sparsity but do not modify the phase. We show that the accuracy of a Learned Scattering network is considerably reduced when the phase collapse modulus is replaced by soft-thresholdings with learned biases. This is also true of more general phase-preserving non-linearities and architectures.

Section 5 explains the performance of iterated phase collapses by showing that each phase collapse progressively improves linear discriminability. On the opposite, the improvements in classiﬁcation accuracy of successive sparse code computations are shown to quickly saturate.

The main contribution of this paper is a demonstration that the classiﬁcation accuracy of deep neural networks mostly relies on phase collapses, which are sufﬁcient to linearly separate the different classes on natural image databases. This is captured by the Learned Scattering architecture which reaches Res Net-18 accuracy on Image Net and CIFAR-10. We also show that phase collapses are necessary to reach this accuracy, by demonstrating numerically and theoretically that iterating phase-preserving non-linearities leads to a signiﬁcantly worse performance.

Figure 1: Architecture of a Learned Scattering network with phase collapses. It has J + 1 layers with J = 11 for Image Net and J = 8 for CIFAR-10. Each layer is computed with a 1 1 convolutional operator Pj which linearly combines channels. It is followed by a phase collapse, computed with a spatial convolutional ﬁltering with a complex wavelet W and a complex modulus | |. A layer of depth j corresponds to a scale 2j/2 and a subsampling by 2 is applied every two layers, after W. A skip-connection concatenates the outputs of WPj and WPj . A ﬁnal 1 1 PJ reduces the dimension before a linear classiﬁer.

2 ELIMINATING SPATIAL VARIABILITY WITH PHASE COLLAPSES

Deep convolutional classiﬁers achieve linear separation of image classes. We show that linear classiﬁcation on raw images has a poor accuracy because image classes are invariant to local translations. This geometric within-class variability takes the form of random phase ﬂuctuations, and as a result all classes have a zero mean. To improve classiﬁcation accuracy, non-linear operators must separate class means, which therefore requires to collapse these phase ﬂuctuations.

Translations and phase shifts Translations capture the spatial topology of the grid on which the image is deﬁned. These translations are transformed into phase shifts by a Fourier transform. We prove that this remains approximately valid for images convolved with appropriate complex ﬁlters.

Let x be an image indexed by u Z2. We write xτ(u) = x(u τ) the translation of x by τ. It is diagonalized by the Fourier transform bx(ω) = P

u x(u) e iω u, which creates a phase shift:

c xτ(ω) = e iω τ bx(ω). (2)

This diagonalization explains the need to introduce complex numbers to analyze the mathematical properties of geometric within-class variabilities. Computations can however be carried with real numbers, as we will show.

A Fourier transform is computed by ﬁltering x with complex exponentials eiω u. One may replace these by complex wavelet ﬁlters ψ that are localized in space and in the Fourier domain. The following theorem proves that small translations can still be approximated by a phase shift in this case. We denote by the convolution of images.

Published as a conference paper at ICLR 2022

Theorem 1. Let ψ: Z2 C be a ﬁlter with ψ 2 = 1, whose center frequency ξ and bandwidth σ are deﬁned by:

[ π,π]2 ω | bψ(ω)|2 dω and σ2 = 1

[ π,π]2 |ω ξ|2| bψ(ω)|2 dω.

Then, for any τ Z2, xτ ψ e iξ τ(x ψ) σ |τ| x 2. (3)

The proof is in Appendix C. This theorem proves that if |τ| 1/σ then xτ ψ e iξ τx ψ. In this case, a translation by τ produces a phase shift by ξ τ.

Phase collapse and stationarity We deﬁne a phase collapse as the elimination of the phase created by a spatial ﬁltering with a complex wavelet ψ. We now show that phase collapses improve linear classiﬁcation of classes that are invariant to global or local translations.

The training images corresponding to the class label y may be represented as the realizations of a random vector Xy. To achieve linear separation, it is sufﬁcient that class means E Xy are separated and within-class variances around these means are small enough (Hastie et al., 2009). The goal of classiﬁcation is to ﬁnd a representation of the input images in which these properties hold.

To simplify the analysis, we consider the particular case where each class y is invariant to translations. More precisely, each random vector Xy is stationary, which means that its probability distribution is invariant to translations. Equation (2) then implies that the phases of Fourier coefﬁcients of Xy are uniformly distributed in [0, 2π], leading to E[ b Xy(ω)] = 0 for ω = 0. The class means E[Xy] are thus constant images whose pixel values are all equal to E[ b Xy(0)]. A linear classiﬁer can then only rely on the average colors of the classes, which are often equal in practice. It thus cannot discriminate such translation-invariant classes.

Eliminating uniform phase ﬂuctuations of non-zero frequencies is thus necessary to create separated class means, which can be achieved with the modulus of the Fourier transform. It is a translationinvariant representation: |bxτ| = |bx|. This improves linear discriminability of stationary classes, because E[| b Xy|] may be different for different y. However, | b Xy| has a high variance, because the Fourier transform is unstable to small deformations (Bruna and Mallat, 2013).

Fourier modulus descriptors can be improved by using ﬁlters ψ that have a localized support in space. Theorem 1 shows that the phase of Xy ψ is also uniformly distributed in [0, 2π]. It results that E[Xy ψ] = 0, and x ψ still provides no information for linear classiﬁcation. Applying a modulus similarly computes approximate invariants to small translations: |xτ ψ| |x ψ|, with an error bounded by σ |τ| x 2. More generally, these phase collapses compute approximate invariants to deformations which are well approximated by translations over the support of ψ. This representation improves linear classiﬁcation by creating different non-zero class means E[|Xy ψ|] while achieving a lower variance than Fourier coefﬁcients, as it is stable to deformations (Bruna and Mallat, 2013).

Image classes are usually not invariant to global translations, because of e.g. centered subjects or the sky located in the topmost part of the image. However, classes are often invariant to local translations, up to an unknown maximum scale. This is captured by the notion of local stationarity, which means that the probability distribution of Xy is nearly invariant to translations smaller than some maximum scale (Priestley, 1965). The above discussion remains valid if Xy is only locally stationary over a domain larger than the support of ψ. The use of so-called windowed absolute spectra E Xy ψ

for locally stationary processes has previously been studied in Tygert et al. (2016).

Real or complex networks The use of complex numbers is a mathematical abstraction which allows diagonalizing translations, which are then represented by complex phases. It provides a mathematical interpretation of ﬁltering operations performed on real numbers. We show that a real network can still implement complex phase collapses.

In the ﬁrst layer of a CNN, one can observe that ﬁlters are often oscillatory patterns with small supports, where some ﬁlters have nearly the same orientation and frequency but with a phase shifted by some α (Krizhevsky et al., 2012). We reproduce in Appendix A a ﬁgure from Shang et al. (2016) which evidences this phenomenon. It shows that real ﬁlters may be arranged in groups (ψα)α that

Published as a conference paper at ICLR 2022

can be written ψα = Re(e iαψ) for a single complex ﬁlter ψ and several phases α. A CNN with complex ﬁlters is thus a structured real-valued CNN, where several real ﬁlters (ψα)α have been regrouped into a single complex ﬁlter ψ. This structure simpliﬁes the mathematical interpretation of non-linearities by explicitly deﬁning the phase, which is otherwise a hidden variable relating multiple ﬁlter outputs within each layer.

A phase collapse is explicitly computed with a complex wavelet ﬁlter and a modulus. It can also be implicitly calculated by real-valued CNNs. Indeed, for any real-valued signal x, we have:

π Re LU(x ψα) dα. (4)

Furthermore, this integral is well approximated by a sum over 4 phases, allowing to compute complex moduli with real-valued ﬁlters and Re LUs without biases. See Appendix D for a proof of eq. (4) and its approximation.

3 LEARNED SCATTERING NETWORK WITH PHASE COLLAPSES

This section introduces a learned scattering transform, which is a highly structured CNN architecture relying on phase collapses and reaching Res Net accuracy on the Image Net (Russakovsky et al., 2015) and CIFAR-10 (Krizhevsky, 2009) datasets.

Scattering transform Theorem 1 proves that a modulus applied to the output of a complex wavelet ﬁlter produces a locally invariant descriptor. This descriptor can then be subsampled, depending upon the ﬁlter s bandwidth. We brieﬂy review the scattering transform (Mallat, 2012; Bruna and Mallat, 2013), which iterates phase collapses.

A scattering transform over J scales is implemented with a network of depth J, whose ﬁlters are speciﬁed by the choice of wavelet. Let x0 = x. For 0 j < J, the (j +1)-th layer xj+1 is computed by applying a phase collapse on the j-th layer xj. It is implemented by a modulus which collapses the phases created by a wavelet ﬁltering operator W:

xj+1 = Wxj . (5)

The operator W is deﬁned with Morlet ﬁlters (Bruna and Mallat, 2013). It has one low-pass ﬁlter g0, and L zero-mean complex band-pass ﬁlters (gℓ)ℓ, having an angular direction ℓπ/L for 0 < ℓ L. It thus transforms an input image x(u) into L + 1 sub-band images which are subsampled by 2:

Wx(u, ℓ) = x gℓ(2u). (6)

The cascade of j low-pass ﬁlters g0 with a ﬁnal band-pass ﬁlter gℓ, each followed by a subsampling, computes wavelet coefﬁcients at a scale 2j. One can also modify the wavelet ﬁltering W to compute intermediate scales 2j/2, as explained in Appendix G. The spatial subsampling is then only computed every other layer, and the depth of the network becomes twice larger. Applying a linear classiﬁer on such a scattering transform gives good results on simple classiﬁcation problems such as MNIST (Le Cun et al., 2010). However, results are well below Res Net accuracy on CIFAR-10 and Image Net, as shown in Table 1.

Learned Scattering The prior work of Zarka et al. (2021) showed that a scattering transform can reach Res Net accuracy by incorporating learned 1 1 convolutional operators and soft-thresholding non-linearities in-between wavelet ﬁlters. In contrast, we introduce a Learned Scattering architecture whose sole non-linearity is a phase collapse. It shows that neither biases nor thresholdings are necessary to reach a high accuracy in image classiﬁcation. A similar result had previously been obtained on image denoising (Mohan et al., 2019).

The Learned Scattering (LScat) network inserts in eq. (5) a learned complex 1 1 convolutional operator Pj which reduces the channel dimensionality of each layer xj before each phase collapse:

xj+1 = WPjxj . (7)

Similar architectures which separate space-mixing and channel-mixing operators had previously been studied in the context of basis expansion (Qiu et al., 2018; Ulicny et al., 2019) or to ﬁlter scattering

Published as a conference paper at ICLR 2022

Table 1: Error of linear classiﬁers applied to a scattering (Scat), learned scattering (LScat) and learned scattering with skip connections (+ skip), on CIFAR-10 and Image Net. The last column gives the single-crop error of Res Net-20 for CIFAR-10 and Res Net-18 for Image Net, taken from https://pytorch.org/vision/stable/models.html.

Scat LScat LScat + skip Res Net

CIFAR-10 Top-1 error (%) 27.7 11.7 7.7 8.8

Image Net Top-5 error (%) 54.1 15.2 11.0 10.9 Top-1 error (%) 73.0 35.9 30.1 30.2

channels (Cotter and Kingsbury, 2019). This separation is also a major feature of recent architectures such as Vision Transformers (Dosovitskiy et al., 2021) or MLP-Mixer (Tolstikhin et al., 2021).

Each Pj computes discriminative channels whose spatial variability is eliminated by the phase collapse operator. Their role is further discussed in Section 5. Table 1 gives the accuracy of a linear classiﬁer applied to the last layer of this Learned Scattering. It provides an important improvement over a scattering transform, but it does not yet reach the accuracy of Res Net-18.

Including the linear classiﬁer, the architecture uses a total number of layers J + 1 = 12 for Image Net and J + 1 = 9 for CIFAR, by introducing intermediate scales. The number of channels of Pjxj is the same as in a standard Res Net architecture (He et al., 2016) and remains no larger than 512. More details are provided in Appendix G.

Skip-connections across moduli Equation (7) imposes that all phases are collapsed at each layer, after computing a wavelet transform. More ﬂexibility is provided by adding a skip-connection which concatenates WPjxj with its modulus:

xj+1 = h WPjxj , WPjxj i . (8)

The skip-connection produces a cascade of convolutional ﬁlters W without non-linearities in-between. The resulting convolutional operator WW W is a wavelet packet transform which generalizes the wavelet transform (Coifman and Wickerhauser, 1992). Wavelet packets are obtained as the cascade of low-pass and band-pass ﬁlters (gℓ)ℓ, each followed by a subsampling. Besides wavelets, wavelet packets include ﬁlters having a larger spatial support and a narrower Fourier bandwidth. A wavelet packet transform is then similar to a local Fourier transform. Applying a modulus on such wavelet packet coefﬁcients deﬁnes local spatial invariants over larger domains.

As discussed in Section 2, image classes are usually invariant to local rather than global translations. Section 2 explains that a phase collapse improves discriminability for image classes that are locally translation-invariant over the ﬁlter s support. Indeed, phases of wavelet coefﬁcients are then uniformly distributed over [0, 2π], yielding zero-mean coefﬁcients for all classes. At scales where there is no local translation-invariance, these phases are no longer uniformly distributed, and they encode information about the spatial localization of features. Introducing a skip-connection provides the ﬂexibility to choose whether to eliminate phases at different scales or to propagate them up to the last layer. Indeed, the next 1 1 operator Pj+1 linearly combines WPjxj and WPjxj and may learn to use only one of these. This adds some localization information, which appears to be important.

Table 1 shows that the skip-connection indeed improves classiﬁcation accuracy. A linear classiﬁer on this Learned Scattering reaches Res Net-18 accuracy on CIFAR-10 and Image Net. It demonstrates that collapsing appropriate phases is sufﬁcient to obtain a high accuracy on large-scale classiﬁcation problems. Learning is reduced to 1 1 convolutions (Pj)j across channels.

4 PHASE COLLAPSES VERSUS AMPLITUDE REDUCTIONS

We now compare phase collapses with amplitude reductions, which are non-linearities which preserve the phase and act on the amplitude. We show that the accuracy of a Learned Scattering network is considerably reduced when the phase collapse modulus is replaced by soft-thresholdings with learned biases. This result remains true for other amplitude reductions and architectures.

Published as a conference paper at ICLR 2022

Thresholding and sparsity A complex soft-thresholding reduces the amplitude of its input z = |z|eiϕ by b while preserving the phase: ρb(z) = Re LU(|z| b) eiϕ. Similarly to its real counterpart, it is obtained as the proximal operator of the complex modulus (Yang et al., 2012):

ρb(z) = arg min w C b|w| + 1

2|w z|2. (9)

Soft-thresholdings and moduli have opposite properties, since soft-thresholdings preserve the phase while attenuating the amplitude, whereas moduli preserve the amplitude while eliminating the phase. In contrast, Re LUs with biases are more general non-linearities which can act both on phase and amplitude. This is best illustrated over R where the phase is replaced by the sign, through the even-odd decomposition. If z R and b 0, then the even part of Re LU(z b) is Re LU(|z| b), which is an absolute value with a dead-zone [ b, b]. When b = 0, it becomes an absolute value |z|. The odd part is a soft-thresholding ρb(z) = sign(z) Re LU(|z| b). Over C, a similar result can be obtained through the decomposition into phase harmonics (Mallat et al., 2019).

We have explained how phase collapses can improve the classiﬁcation accuracy of locally stationary processes by separating class means E Xy ψ . In contrast, since the phase of Xy ψ is uniformly distributed for such processes, then it is also true of ρb(Xy ψ). This implies that E ρb(Xy ψ) = 0 for all b. Class means of locally stationary processes are thus not separated by a thresholding.

When class means E[Xy ψ] are separated, a soft-thresholding of Xy ψ may however improve classiﬁcation accuracy. If Xy ψ is sparse, then a soft-thresholding ρb(Xy ψ) reduces the withinclass variance (Donoho and Johnstone, 1994; Zarka et al., 2021). Coefﬁcients below the threshold may be assimilated to unnecessary clutter which is set to 0. To improve classiﬁcation, convolutional ﬁlters must then produce high-amplitude coefﬁcients corresponding to discriminative features .

Phase collapses versus amplitude reductions A Learned Scattering with phase collapses preserves the amplitudes of wavelet coefﬁcients and eliminates their phases. On the opposite, one may use a non-linearity which preserves the phases of wavelet coefﬁcients but attenuates their amplitudes, such as a soft-thresholding. We show that such non-linearities considerably degrade the classiﬁcation accuracy compared to phase collapses.

Several previous works made the hypothesis that sparsifying neural responses with thresholdings is a major mechanism for improving classiﬁcation accuracy (Sun et al., 2018; Sulam et al., 2018; 2019; Mahdizadehaghdam et al., 2019; Zarka et al., 2020; 2021). The dimensionality of sparse representations can then be reduced with random ﬁlters which implement a form of compressed sensing (Donoho, 2006; Candes et al., 2006). The interpretation of CNNs as compressed sensing machines with random ﬁlters has been studied (Giryes et al., 2015), but it never led to classiﬁcation results close to e.g. Res Net accuracy.

To test this hypothesis, we replace the modulus non-linearity in the Learned Scattering architecture with thresholdings, or more general phase-preserving non-linearities. A Learned Amplitude Reduction Scattering applies a non-linearity ρ(z) which preserves the phases of wavelet coefﬁcients z = |z|eiϕ: ρ(z) = eiϕ ρ(|z|). Without skip-connections, each layer xj+1 is computed from xj by:

xj+1 = ρ(WPjxj), (10)

and with skip-connections: xj+1 = h ρ(WPjxj) , WPjxj i . (11)

A soft-thresholding is deﬁned by ρ(|z|) = Re LU(|z| b) for some threshold b. We also deﬁne an amplitude hyperbolic tangent ρ(|z|) = (e|z| e |z|)/(e|z| + e |z|), an amplitude sigmoid as ρ(|z|) = (1 + e a log |z| b) 1 and an amplitude soft-sign as ρ(|z|) = |z|/(1 + |z|). The softthresholding and sigmoid parameters a and b are learned for each layer and each channel.

We evaluate the classiﬁcation performance of a Learned Amplitude Reduction Scattering on CIFAR10, by applying a linear classiﬁer on the last layer. Classiﬁcation results are given in Table 2 for different amplitude reductions, with or without skip-connections. Learned Amplitude Reduction Scatterings yield much larger errors than a Learned Scattering with phase collapses. Without skipconnections, they are even above a scattering transform, which also uses phase collapses but does not

Published as a conference paper at ICLR 2022

Table 2: Top-1 error (in %) on CIFAR-10 with a linear classiﬁer applied to a Scattering network (Scat) and several Learned Scattering networks (LScat) with several non-linearities. They include a modulus (Mod), an amplitude soft-thresholding (Thresh), an amplitude hyperbolic tangent (ATanh), an amplitude sigmoid (ASigmoid), and an amplitude Soft-sign (ASign).

Mod AThresh ATanh ASigmoid ASign

Without skip 27.7 11.7 36.7 40.7 38.5 39.9 With skip - 7.7 22.5 19.2 17.0 19.5

have learned 1 1 convolutional projections (Pj)j. It demonstrates that high accuracies result from phase collapses without biases, as opposed to amplitude reduction operators including thresholdings, which learn bias parameters. Similar experiments in the real domain with a standard Res Net-18 architecture on the Image Net dataset can be found in Appendix B.

Re LUs with biases Most CNNs, including Res Nets, use Re LUs with biases. A Re LU with bias simultaneously affects the sign and the amplitude of its real input. Over complex numbers, it amounts to transforming the phase and the amplitude. These numerical experiments show that accuracy improvements result from acting on the sign or phase rather than the amplitude. Furthermore, this can be constrained to collapsing the phase of wavelet coefﬁcients while preserving their amplitude.

Several CNN architectures have demonstrated a good classiﬁcation accuracy with iterated thresholding algorithms, which increase sparsity. However, all these architecture also modiﬁed the sign of coefﬁcients by computing non-negative sparse codes (Sun et al., 2018; Sulam et al., 2018; Mahdizadehaghdam et al., 2019) or with additional Re LU or modulus layers (Zarka et al., 2020; 2021). It seems that it is the sign or phase collapse of these non-linearities which is responsible for good classiﬁcation accuracies, as opposed to the calculation of sparse codes through iterated amplitude reductions.

5 ITERATING PHASE COLLAPSES AND AMPLITUDE REDUCTIONS

We now provide a theoretical justiﬁcation to the above numerical results in simpliﬁed mathematical frameworks. This section studies the behavior of phase collapses and amplitude reductions when they are iterated over several layers. It shows that phase collapses beneﬁt from iterations over multiple layers, whereas there is no signiﬁcant gain in performance when iterating amplitude reductions.

5.1 ITERATED PHASE COLLAPSES

We explain the role of iterated phase collapses with multiple ﬁlters at each layer. Classiﬁcation accuracy is improved through the creation of additional dimensions to separate class means. The learned projectors (Pj)j are optimized for this separation.

We consider the classiﬁcation of stationary processes Xy Rd, corresponding to different image classes indexed by y. Given a realization x of Xy, and because of stationarity, the optimal linear classiﬁer is calculated from the empirical mean 1/d P u x(u). It computes an optimal linear estimation of E Xy(u) = µy. If all classes have the same mean µy = µ, then all linear classiﬁers fail.

As explained in Section 2, linear classiﬁcation can be improved by computing (|x ψk|)k for some wavelet ﬁlters (ψk)k. These phase collapses create additional directions with non-zero means which may separate the classes. If Xy is stationary, then |Xy ψk| remains stationary for any ψk. An optimal linear classiﬁer applied to (|x ψk(u)|)k is thus obtained by a linear combination of all empirical means (1/d P u |x ψk(u)|)k. They are proportional to the ℓ1 norm x ψk 1, which is a measure of sparsity of x ψk.

If linear classiﬁcation on (|x ψk(u)|)k fails, it reveals that the means E |Xy ψk(u)| = µy,k are not sufﬁciently different. Separation can be improved by considering the spatial variations of |Xy ψk(u)| for different y. These variations can be revealed by a phase collapse on a new set of

Published as a conference paper at ICLR 2022

wavelet ﬁlters ψk , which computes (||x ψk| ψk |)k,k . This phase collapse iteration is the principle used by scattering transforms to discriminate textures (Bruna and Mallat, 2013; Sifre and Mallat, 2013): each successive phase collapse creates additional directions to separate class means.

However, this may still not be sufﬁcient to separate class means. More discriminant statistical properties may be obtained by linearly combining (|x ψk|)k across k before applying a new ﬁlter ψk . In a Learned Scattering with phase collapse, this is done with a linear projector P1 across the channel indices k, before computing a convolution with the next ﬁlter ψk . The 1 1 operator P1 is optimized to improve the linear classiﬁcation accuracy. It amounts to learning weights wk such that E[ P

k wk Xy ψk ψk ] is as different as possible for different y. Because these are proportional to the ℓ1 norms P

k wk|x ψk| ψk 1, it means that the images P

k wk|x ψk| ψk have different sparsity levels depending upon the class y of x. The weights (wk)k of P1 can thus be interpreted as features along channels providing different sparsiﬁcations for different classes. A Learned Scattering network learns such Pj at each scale j.

5.2 ITERATED AMPLITUDE REDUCTIONS

Sparse representations and amplitude reduction algorithms may improve linear classiﬁcation by reducing the variance of class mean estimations, which can be interpreted as clutter removal. Such approaches are studied in Zarka et al. (2021) by modeling the clutter as an additive white noise. Although a single thresholding step may improve linear classiﬁcation, we show that iterating more than one thresholding does not improve the classiﬁcation accuracy, if no phase collapses are inserted.

To understand these properties, we consider the discrimination of classes Xy for which class means E(Xy) = µy are all different. If there exists y such that µy µy is small, then the class y can still be discriminated from y if we can estimate E(Xy) sufﬁciently accurately from a single realization x of Xy. This is a mean estimation problem. Suppose that Xy = µy + N(0, σ2) is contaminated with Gaussian white noise, where the noise models some clutter. Suppose also that there exists a linear orthogonal operator D such that Dµy is sparse for every y, and hence has its energy concentrated in few non-zero coefﬁcients. Such a D may be computed by minimizing the expected ℓ1 norm P

y E DXy 1 . The estimation of µy can be improved with a soft-thresholding estimator (Donoho and Johnstone, 1994), which sets to zero all coefﬁcients below a threshold b proportional to σ. It amounts to computing ρb(Dx), where ρb is a soft-thresholding.

However, we explain below why this approach cannot be further iterated without inserting phase collapses. The reason is that a sparse representation ρb(Dx) concentrates its entropy in the phases of the coefﬁcients, rather than their amplitude. We then show that such processes cannot be further sparsiﬁed, which means that a second thresholding ρb (D ρb(Dx)) will not reduce further the variance of class mean estimators. This entails that a model of within-class variability relying on amplitude reductions cannot be the sole mechanism behind the performance of deep networks.

Iterating amplitude reductions may however be useful if it is alternated with another non-linearity which partly or fully collapses phases. Reducing the entropy of the phases of ρb(Dx) allows ρb D

to further sparsify the process and hence further reduce the within-class variability. As mentioned in Section 4, this is the case for previous work which used iterated sparsiﬁcation operators (Sun et al., 2018; Sulam et al., 2018; Mahdizadehaghdam et al., 2019). Indeed, these networks compute non-negative sparse codes where sparsity is enforced with a Re LU, which acts both on phases and amplitudes. Our results shows that the beneﬁt of iterating non-negative sparse coding comes from the sign collapse due to the non-negativity constraint.

We now qualitatively demonstrate these claims with two theorems. We ﬁrst show that ﬁnding the sparsest representation of a random process (i.e., minimizing its ℓ1 norm) is the same as maximizing a lower bound on the entropy of its phases.

Theorem 2. Let X denote a random vector in Cd with a probability density p. Let H(X) be the entropy of X with respect to the Lebesgue measure:

H(X) = Z p(x) log p(x) dx.

Published as a conference paper at ICLR 2022

If D U(d) is a unitary operator then:

H ϕ(DX) |DX| H(X) d 2d log 1

d E[ DX 1] ,

where ϕ(DX) [0, 2π]d (resp. |DX| Rd +) is the random process of the entry-wise phases (resp. moduli) of DX.

The proof is in Appendix E. This theorem gives a lower-bound on the conditional entropy of the phases of DX with a decreasing function of the expected ℓ1 norm of DX. Minimizing over D this expected ℓ1 norm amounts to maximizing the lower bound on H ϕ(DX) |DX| . An extreme situation

arises when this entropy reaches its maximal value of d log(2π). In this case, the phase ϕ(DX) has a maximum-entropy distribution and is therefore uniformly distributed in [0, 2π]d. Moreover, in this extreme case ϕ(DX) is independent from |DX|, since its conditional distribution does not depend on |DX|. Such statistical properties have previously been observed on wavelet coefﬁcients of natural images (Rao et al., 2001), where the wavelet transform seems to be a nearly optimal sparsifying unitary dictionary.

The second theorem considers the extreme case of a random process whose phases are conditionally independent and uniform. It proves that such a process cannot be signiﬁcantly sparsiﬁed with a change of basis.

Theorem 3. Assume that ϕ(ρb(DX)) is uniformly distributed in [0, 2π]d and independent from |ρb(DX)|. Then there exists a constant Cd > 0 which depends on the dimension d, such that for any D U(d), E D ρb(DX) 1 Cd E[ ρb(DX) 1].

The proof is in Appendix F. This theorem shows that random processes with conditionally independent and uniform phases have an ℓ1 norm which cannot be signiﬁcantly decreased by any unitary transformation. Numerical evaluations suggest that the constant Cd may be chosen to be π/2 0.886, independently of the dimension d. This constant arises as the value of E[|Z|] when Z is a complex normal random variable with E[|Z|2] = 1.

These two theorems explain qualitatively that linear classiﬁcation on ρb(Dx) cannot be improved by another thresholding that would take advantage of another sparsiﬁcation operator. Indeed, Theorem 2 shows that if ρb(Dx) is sparse, then its phases have random ﬂuctuations of high entropy. Theorem 3 indicates that such random phases prevent a further sparsiﬁcation of ρb(Dx) with some linear operator D . Applying a second thresholding ρb (D ρb(Dx)) thus cannot signiﬁcantly reduce the variance of class mean estimators.

6 CONCLUSION

This paper studies the improvement of linear separability for image classiﬁcation in deep convolutional networks. We show that it mostly relies on a phase collapse phenomenon. Eliminating the phase of wavelet coefﬁcients improves the separation of class means. We introduced a Learned Scattering network with wavelet phase collapses and learned 1 1 convolutional ﬁlters (Pj)j, which reaches Res Net accuracy. The learned 1 1 operators (Pj) enhance discriminability by computing channels that have different levels of sparsity for different classes.

When class means are separated, thresholding non-linearities can improve classiﬁcation by reducing the variance of class mean estimators. When used alone, the classiﬁcation performance is poor over complex datasets such as Image Net or CIFAR-10, because class means are not sufﬁciently separated. Furthermore, the iteration of thresholdings on sparsiﬁcation operators requires intermediary phase collapses.

These results show that linear separation of classes result from acting on the sign or phase of network coefﬁcients rather than their amplitude. Furthermore, this can be constrained to collapsing the phase of wavelet coefﬁcients while preserving their amplitude. The elimination of spatial variability with phase collapses is thus both necessary and sufﬁcient to linearly separate classes on complex image datasets.

Published as a conference paper at ICLR 2022

REPRODUCIBILITY STATEMENT

The code to reproduce the experiments of the paper is available at https://github.com/ Florentin Guth/Phase Collapse. All experimental details and hyperparameters are also provided in Appendix G.

ACKNOWLEDGMENTS

This work was supported by a grant from the PRAIRIE 3IA Institute of the French ANR-19-P3IA0001 program. We would like to thank the Scientiﬁc Computing Core at the Flatiron Institute for the use of their computing resources. We also thank Antoine Brochard, Brice Ménard and Rudy Morel for heplful comments.

E. Oyallon. Analyzing and Introducing Structures in Deep Convolutional Neural Networks. Theses, Paris Sciences et Lettres, October 2017.

V. Papyan, X. Y. Han, and D. L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 2020.

X. Sun, N. M. Nasrabadi, and T. D. Tran. Supervised deep sparse coding networks. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 346 350, 2018.

J. Sulam, V. Papyan, Y. Romano, and M. Elad. Multilayer convolutional sparse modeling: Pursuit and dictionary learning. IEEE Transactions on Signal Processing, 66(15):4090 4104, 2018.

J. Sulam, A. Aberdam, A. Beck, and M. Elad. On multi-layer basis pursuit, efﬁcient algorithms and convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.

S. Mahdizadehaghdam, A. Panahi, H. Krim, and L. Dai. Deep dictionary learning: A parametric network approach. IEEE Transactions on Image Processing, 28(10):4790 4802, Oct 2019.

J. Zarka, L. Thiry, T. Angles, and S. Mallat. Deep network classiﬁcation by scattering and homotopy dictionary learning. In International Conference on Learning Representations, ICLR, 2020.

J. Zarka, F. Guth, and S. Mallat. Separation and concentration in deep networks. In International Conference on Learning Representations, ICLR, 2021.

T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition, 2009.

J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1872 1886, 2013.

M. B. Priestley. Evolutionary spectra and non-stationary processes. Journal of the Royal Statistical Society: Series B (Methodological), 27(2):204 229, 1965. doi: https://doi.org/10.1111/ j.2517-6161.1965.tb01488.x. URL https://rss.onlinelibrary.wiley.com/doi/ abs/10.1111/j.2517-6161.1965.tb01488.x.

M. Tygert, J. Bruna, S. Chintala, Y. Le Cun, S. Piantino, and A. Szlam. A mathematical motivation for complex-valued convolutional networks. Neural computation, 28(5):815 825, 2016.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, Neur IPS, pages 1097 1105, 2012.

W. Shang, K. Sohn, D. Almeida, and H. Lee. Understanding and improving convolutional neural networks via concatenated rectiﬁed linear units, 2016.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015.

Published as a conference paper at ICLR 2022

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

S. Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10): 1331 1398, 2012.

Y. Le Cun, C. Cortes, and C.J. Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.

S. Mohan, Z. Kadkhodaie, E. P. Simoncelli, and C. Fernandez-Granda. Robust and interpretable blind image denoising via bias-free convolutional neural networks. In International Conference on Learning Representations, 2019.

Q. Qiu, X. Cheng, R. Calderbank, and G. Sapiro. DCFNet: Deep neural network with decomposed convolutional ﬁlters. International Conference on Machine Learning, 2018.

M. Ulicny, V. Krylov, and R. Dahyot. Harmonic networks for image classiﬁcation. In Proceedings of the British Machine Vision Conference, Sep. 2019.

F. Cotter and N. G. Kingsbury. A learnable scatternet: Locally invariant convolutional layers. In 2019 IEEE International Conference on Image Processing, ICIP, pages 350 354. IEEE, 2019.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.

I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision. ar Xiv preprint ar Xiv:2105.01601, 2021.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016.

R.R. Coifman and M.V. Wickerhauser. Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory, 38(2):713 718, 1992. doi: 10.1109/18.119732.

Z. Yang, C. Zhang, and L. Xie. On phase transition of compressed sensing in the complex domain. IEEE Signal Processing Letters, 19(1):47 50, Jan 2012. ISSN 1558-2361. doi: 10.1109/lsp.2011. 2177496. URL http://dx.doi.org/10.1109/LSP.2011.2177496.

S. Mallat, S. Zhang, and G. Rochette. Phase harmonic correlations and convolutional neural networks. Information and Inference: A Journal of the IMA, 9(3):721 747, 11 2019.

D. L. Donoho and I. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3): 425 455, 09 1994.

D. L. Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289 1306, 2006.

E. J. Candes, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207 1223, 2006.

R. Giryes, G. Sapiro, and A. M. Bronstein. Deep neural networks with random gaussian weights: A universal classiﬁcation strategy? Co RR, abs/1504.08291, 2015. URL http://arxiv.org/ abs/1504.08291.

L. Sifre and S. Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1233 1240, 2013.

R. Rao, B. Olshausen, M. Lewicki, M. Wainwright, O. Schwartz, and E. P. Simoncelli. Natural image statistics and divisive normalization: Modeling nonlinearities and adaptation in cortical neurons. Statistical Theories of the Brain, 01 2001.

Published as a conference paper at ICLR 2022

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, page 448 456, 2015.

M. Andreux, T. Angles, G. Exarchakis, R. Leonarduzzi, G. Rochette, L. Thiry, J. Zarka, S. Mallat, J. Andén, E. Belilovsky, J. Bruna, V. Lostanlen, M. J. Hirn, E. Oyallon, S. Zhang, C. E. Cella, and M. Eickenberg. Kymatio: Scattering transforms in python. Journal of Machine Learning Research, 21(60):1 6, 2020.

Published as a conference paper at ICLR 2022

A PAIRED ALEXNET FILTERS

Section 2 explains that real networks can still implement phase collapses. This is done with several real ﬁlters ψα = Re(e iαψ) which correspond to several phases α of the same complex ﬁlter ψ. Shang et al. (2016) showed that the ﬁlters in e.g. the ﬁrst layer of Alex Net (Krizhevsky et al., 2012) can indeed be grouped in such a way. For the sake of completeness, we reproduce in Figure 2 a ﬁgure from Shang et al. (2016). This suggests that real-valued networks may indeed implement phase collapses using eq. (4).

Figure 2: First-layer ﬁlters from Alex Net (Krizhevsky et al., 2012). They have been paired so that they approximately correspond to two different phases of the same complex ﬁlter ψ. Figure reproduced from Shang et al. (2016).

B PHASE COLLAPSE VERSUS AMPLITUDE REDUCTION WITH RESNET

We now evaluate the classiﬁcation error of phase collapses and amplitude reduction non-linearities in the real domain. We use a standard Res Net-18 architecture without biases. We replace the Re LU non-linearity by an absolute value or sign collapse |x| and several sign-preserving (i.e., odd) nonlinearities. They include a soft-thresholding ρb(x) = sign(x) Re LU(|x| b), an hyperbolic tangent ρ(x) = (ex e x)/(ex + e x), and a soft-sign ρ(x) = x/(1 + |x|). We do not report results for an amplitude sigmoid ρ(x) = sign(x)(1 + e a log |x| b) 1 because of optimization instabilities when learning the parameters a and b.

Classiﬁcation results on the Image Net dataset are given in Table 3. The error of bias-free Re LUs and sign collapses are comparable to a standard Res Net-18, and conﬁrm that sign collapses are sufﬁcient to reach such accuracies. In contrast, the performance of amplitude reduction non-linearities, which preserve the sign of network coefﬁcients, is signiﬁcantly worse. The conclusions of Section 4 thus still hold in the real domain and when the spatial ﬁlters are not constrained to be wavelets.

Table 3: Classiﬁcation errors on Image Net of bias-free Res Net-18 (BFRes Net) architectures with several non-linearities. They include a Re LU, an absolute value which performs sign collapses (Abs), a soft-thresholding (Thresh), a hyperbolic tangent (Tanh), and a soft-sign (Sign). They are compared to the original Res Net-18 architecture, which uses a Re LU and learns biases.

Res Net BFRes Net

Re LU Abs Thresh Tanh Sign

Top-5 error (%) 10.9 12.3 13.9 25.7 22.4 24.2 Top-1 error (%) 30.2 32.6 35.3 50.0 44.6 49.3

Published as a conference paper at ICLR 2022

C PROOF OF THEOREM 1

xτ ψ e iξ τ(x ψ) = x (ψτ e iξ τψ) by covariance of convolution,

ψτ e iξ τψ 2 x 2 by Young s inequality,

ψτ e iξ τψ 2 2 = 1

[ π,π]2 |c ψτ(ω) e iξ τ bψ(ω)|2dω by Plancherel,

[ π,π]2 |e iω τ bψ(ω) e iξ τ bψ(ω)|2dω since ψτ(u) = ψ(u τ),

[ π,π]2 |e iω τ e iξ τ|2| bψ(ω)|2dω

[ π,π]2 |(ω ξ) τ|2| bψ(ω)|2dω since x R 7 eix is 1-Lipschitz,

[ π,π]2 |ω ξ|2|τ|2| bψ(ω)|2dω by Cauchy-Schwarz,

which leads to the desired result of eq. (3):

xτ ψ e iξ τ(x ψ) σ |τ| x 2.

D PROOF OF EQUATION (4)

We have: Re LU(x ψα) = Re LU(x Re(e iαψ)) = Re LU(Re(e iαx ψ)),

since x is real. By writing: x ψ = |x ψ|eiϕ(x ψ) where ϕ(x ψ) is the phase of x ψ, this leads to:

Re LU(Re(e iαx ψ)) = Re LU(Re(|x ψ|ei(ϕ(x ψ) α))) = Re LU(|x ψ| cos(ϕ(x ψ) α)) = |x ψ| Re LU(cos(ϕ(x ψ) α)),

since Re LU activation is positive-homogeneous of degree 1. Thus:

π Re LU(x ψα)dα = 1

π |x ψ| Re LU(cos(ϕ(x ψ) α))dα

2|x ψ| Z π ϕ(x ψ)

π ϕ(x ψ) Re LU(cos( α))dα with a change of variable,

π Re LU(cos(α))dα since cos is 2π periodic and even,

2|x ψ| Z π/2

π/2 cos(α)dα

For z C, we have |z| = q

|Re(z)|2 + |Im(z)|2 |Re(z)| + |Im(z)| in the following sense:

2(|Re(z)| + |Im(z)|) |z| |Re(z)| + |Im(z)|.

Published as a conference paper at ICLR 2022

We can write:

|Re(z)| = Re LU(Re(z)) + Re LU( Re(z)), |Im(z)| = Re LU(Im(z)) + Re LU( Im(z)).

and then, using Im(z) = Re(eiπ/2z) and eiπ = 1:

|z| Re LU(Re(z)) + Re LU(Re(e iπz)) + Re LU(Re(e iπ/2z)) + Re LU(Re(eiπ/2z)).

π Re LU(x ψα)dα X

α { π/2,0,π/2,π} Re LU(Re(x ψα)),

which shows that the integral can be well approximated with a sum of 4 phases α of the complex ﬁlter ψ.

E PROOF OF THEOREM 2

We ﬁrst use the chain rule for the entropy:

H ϕ(DX) |DX| = H(|DX|, ϕ(DX)) H(|DX|).

The ﬁrst term is rewritten with a change of variable:

H(|DX|, ϕ(DX)) = H(DX)

k=1 E[log |(DX)k|]

k=1 E[log |(DX)k|] as D is unitary and hence |det(D)| = 1,

H(X) d E log 1

by concavity,

H(X) d log 1

d E[ DX 1] by concavity.

The second term is bounded using the fact that the exponential distribution E(λ) is the maximumentropy distribution on R+ with mean 1

k=1 H(|(DX)k|)

k=1 log(e E[|(DX)k|])

d E[ DX 1] by concavity.

Combining both inequalities and rearranging terms yields the stated bound:

H ϕ(DX) |DX| H(X) d 2d log 1

d E[ DX 1] .

F PROOF OF THEOREM 3

We begin with the following lemma: Lemma 1. Let (θ1, . . . , θd) be i.i.d. uniform random variables in [0, 2π]. Then there exists a constant Cd > 0 such that for all (ρ1, . . . , ρd) Rd, then:

k=1 ρkeiθk|

Published as a conference paper at ICLR 2022

This is proved by observing that the left-hand side is a norm on Rd. One can indeed verify that it is positive deﬁnite, homogeneous and satisﬁes the triangle inequality. Since all norms on Rd are equivalent, there exists a constant Cd > 0 such that:

k=1 ρkeiθk|

for all (ρ1, . . . , ρd) Rd.

Going back to the proof of Theorem 3, and letting X = ρb(DX), we then have:

E h D X 1 |X | i =

k=1 D m,k X k|

k=1 |D m,k| 2|X k| 2 by the above lemma,

k=1 |D m,k| 2|X k| by concavity, because

k=1 |D m,k| 2 = 1,

= Cd X 1 because

m=1 |D m,k| 2 = 1.

Taking the expectation ﬁnishes the proof:

E D X 1 Cd E X 1 . (12)

G EXPERIMENTAL DETAILS

Channel operators In all experiments we set P0 = Id, and factorize the classiﬁer with an additional complex 1 1 convolutional operator PJ, which reduces the dimension before all channels and positions are linearly combined. The architectures implemented are thus also written as QJ j=1 PjρW, where ρ is the non-linearity. Each operator (Pj)1 j J is preceded by a standardization. It sets the complex mean µ = E[z] of every channel to zero, and the real variance σ2 = E[|z|2] of every channel to one. This is similar to a complex 2D batch-normalization layer (Ioffe and Szegedy, 2015), but without learned afﬁne parameters. Each operator (Pj)1 j J is additionally followed by a spatial divisive normalization (Rao et al., 2001), similarly to the local response normalization of Krizhevsky et al. (2012). It sets the norm across channels of each spatial position to one. The sizes of the (Pj)j are speciﬁed in Table 4.

The total numbers of parameters for each architecture are speciﬁed in Table 5. Learned Scattering with phase collapse have a large number of parameters compared to Res Net, despite the comparable width. This is because the predeﬁned wavelet operator W expands the dimension by a factor of L + 1, which means that the input dimension of the learned (Pj)j is higher than in Res Net. The skip-connection further increases this input dimension by a factor of 2.

Table 4: Number cj of complex output channels of Pj, 1 j J. The total number of projectors is J = 8 for CIFAR and J = 11 for Image Net.

j 1 2 3 4 5 6 7 8 9 10 11

CIFAR-10 cj 64 128 256 512 512 512 512 512 - - -

Image Net cj 32 64 64 128 256 512 512 512 512 512 256

Published as a conference paper at ICLR 2022

Table 5: Number of real parameters (in millions) of Learned Scattering network architectures. A complex parameter is counted as two real parameters.

PCScat PCScat + skip Res Net

CIFAR-10 41.6 83.1 0.27

Image Net 36.0 62.8 11.7

Spatial ﬁlters We use elongated Morlet ﬁlters for the L complex band-pass ﬁlters (gℓ)ℓwhich are rotated versions of a mother wavelet g: gℓ(u) = g(r πℓ/Lu), with rθ the rotation by angle θ. The mother wavelet g is deﬁned as:

2π/s2 (eiξ u K)e u Σu/2 with Σ = σ2 0 0 σ2s2

Its parameters are its center frequency ξ = ((3π/4)/2γ, 0), its bandwidth σ = 1.25 2 γ, and its slant s = 0.5, where 2γ designates the scale of the band-pass ﬁlter and is to be adjusted.

g is rotated along L = 8 angles for Imagenet and L = 4 angles for CIFAR: θℓ= (πℓ/L)1 ℓ L. The (gℓ)ℓare then discretized for numerical computations, and K is adjusted so that they have a zero mean.

Finally, we use for the low frequency g0 a Gaussian window:

2π e σ2 u 2 2/2.

The ﬁlters are implemented with the Kymatio package (Andreux et al., 2020).

Intermediate scales 2j/2 are obtained by applying a subsampling by 2 after each block of 2 layers. This introduces intermediate scales and generates a wavelet ﬁlterbank with 2 scales per octave: the ﬁlters are designed so that when j low-pass ﬁlters and one band-pass ﬁlter are cascaded, with a subsampling every 2 layers, the scale of the resulting wavelet is 2j/2.

Each block comprises in its ﬁrst layer a low-frequency ﬁlter g1 0 with γ = 1/2 and band-pass ﬁlters with γ = 0. In the second layer, we use the same low-frequency ﬁlter g2 0 = g1 0 with γ = 1/2. The band-pass ﬁlters g2 ℓare obtained with parameters ξ = (π/

2, 0), σ = 1.25 p

2/3, and s =

For CIFAR experiments, the J = 8 layers are grouped in 4 successive blocks of 2 layers. For Image Net experiments, the ﬁrst layer consists of band-pass elongated Morlet ﬁlters gℓand a low-pass Gaussian window g0 with γ = 0, followed by a subsampling of 2. The 10 following layers are grouped in 5 blocks of 2 layers.

Optimization We use the optimizer SGD with an initial learning rate of 0.01, a momentum of 0.9, a weight decay of 0.0001, and a batch size of 128. The classiﬁer is preceded by a 2D batchnormalization layer. We use traditional data augmentation: horizontal ﬂips and random crops for CIFAR, random resized crops of size 224 and horizontal ﬂips for Image Net. Classiﬁcation error on Image Net validation set is computed on a single center crop of size 224. On CIFAR, training lasts for 300 epochs and the learning rate is divided by 10 every 70 epochs. On Image Net, training lasts for 150 epochs and the learning rate is divided by 10 every 45 epochs. All experiments ran during the preparation of this paper, including preliminary ones, required around 10k 32GB NVIDIA V100 GPU-hours.