# highdimensional_asymptotics_of_denoising_autoencoders__3ad57792.pdf

High-dimensional Asymptotics of Denoising Autoencoders

Hugo Cui Statistical Physics of Computation Lab Department of Physics EPFL, Lausanne, Switzerland hugo.cui@epfl.ch

Lenka Zdeborová Statistical Physics of Computation Lab Department of Physics EPFL, Lausanne, Switzerland

We address the problem of denoising data from a Gaussian mixture using a twolayer non-linear autoencoder with tied weights and a skip connection. We consider the high-dimensional limit where the number of training samples and the input dimension jointly tend to infinity while the number of hidden units remains bounded. We provide closed-form expressions for the denoising mean-squared test error. Building on this result, we quantitatively characterize the advantage of the considered architecture over the autoencoder without the skip connection that relates closely to principal component analysis. We further show that our results accurately capture the learning curves on a range of real data sets.

1 Introduction

Machine learning techniques have a long history of success in denoising tasks. The recent breakthrough of diffusion-based generation [1, 2] has further revived the interest in denoising networks, demonstrating how they can also be leveraged, beyond denoising, for generative tasks. However, this rapidly expanding range of applications stands in sharp contrast to the relatively scarce theoretical understanding of denoising neural networks, even for the simplest instance thereof namely Denoising Auto Encoders (DAEs) [3].

Theoretical studies of autoencoders have hitherto almost exclusively focused on data compression tasks using Reconstruction Auto Encoders (RAEs), where the goal is to learn a concise latent representation of the data. A majority of this body of work addresses linear autoencoders [4, 5, 6, 7]. The authors of [8, 9] analyze the gradient-based training of non-linear autoencoders with online stochastic gradient descent or in population, thus implicitly assuming the availability of an infinite number of training samples. Furthermore, two-layer RAEs were shown to learn to essentially perform Principal Component Analysis (PCA) [10, 11, 12], i.e. to learn a linear model. Ref. [13] shows that this is also true for infinite-width architectures. Learning in DAEs has been the object of theoretical investigations only in the linear case [14], while the case of non-linear DAEs remains theoretically largely unexplored.

Main contributions The present work considers the problem of denoising data sampled from a Gaussian mixture by learning a two-layer DAE with a skip connection and tied weights via empirical risk minimization. Throughout the manuscript, we consider the high-dimensional limit where the number of training samples n and the dimension d are large (n, d ) while remaining comparable, i.e. α n/d = Θ(1). Our main contributions are:

Leveraging the replica method, we provide sharp, closed-form formulae for the mean squared denoising test error (MSE) for DAEs, as a function of the sample complexity α and

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

the problem parameters. We also provide a sharp characterization for other learning metrics including the weights norms, skip connection strength, and cosine similarity between the weights and the cluster means. These formulae encompass as a corollary the case of RAEs. We show that these formulae also describe quantitatively rather well the denoising MSE for real data sets, including MNIST [15] and Fashion MNIST [16].

We find that PCA denoising (namely denoising by projecting the noisy data along the principal component of the training samples) is widely sub-optimal compared to the DAE, leading to a MSE superior by a difference of Θ(d), thereby establishing that DAEs do not simply learn to perform PCA.

Building on the formulae, we quantify the role of each component of the DAE architecture (skip connection and the bottleneck network) in its overall performance. We find that the two components have complementary effects in the denoising process namely preserving the data nuances and removing the noise and discuss how the training of the DAE results from a tradeoff between these effects.

The code used in the present manuscript can be found in the following repository.

Related works Theory of autoencoders Various aspects of RAEs have been studied, for example, memorization [17], or latent space alignment [18]. However, the largest body of work has been dedicated to the analysis of gradient-based algorithms when training RAEs. Ref. [5] established that minimizing the training loss leads to learning the principal components of the data. Authors of [11, 12] have analyzed how a linear RAE learns these components during training. These studies were later extended to non-linear networks by [19, 8, 9], at the sacrifice of further assuming an infinite number of training samples to be available either by considering online stochastic gradient descent, or the population loss. Refs. [20, 13] are able to address a finite sample complexity, but in exchange, have to consider infinite-width architectures, which [13] further shows, also tend to a large extent to learn to perform PCA.

Exact asymptotics from the replica method The replica method [21, 22, 23, 24] has proven a very valuable gateway to access sharp asymptotic characterizations of learning metrics for highdimensional machine learning problems. Past works have addressed among others single-[25, 26, 27, 28] and multi-index models [29], or kernel methods [30, 31, 32, 33]. While the approach has traditionally addressed convex problems, for which its prediction can be proven e.g. using the convex Gordon minimax theorem [34], the replica method allows to average over all the global minimizers of the loss, and therefore also accommodates non-convex settings. Refs. [35, 36] are two recent examples of its application to non-convex losses. In the present manuscript, we leverage this versatility to study the minimization of the empirical risk of DAEs, whose non-convexity represents a considerable hurdle to many other types of analyses.

Data model We consider the problem of denoising data x Rd corrupted by Gaussian white noise of variance , x =

where we denoted x the noisy data point, and ξ N(0, Id) the additive noise. The rescaling of the clean data point by a factor

1 is a practical choice that entails no loss of generality, and allows to easily interpolate between the noiseless case ( = 0) and the case where the signal-to-noise ratio vanishes ( = 1). Furthermore, it allows us to seamlessly connect with works on diffusion-based generative models, where the rescaling naturally follows from the way the data is corrupted by an Ornstein-Uhlenbeck process [1, 2]. In the present work, we assume the clean data x to be drawn from a Gaussian mixture distribution P with K clusters

k=1 ρk N(µk, Σk). (1)

The k th cluster is thus centered around µk Rd, has covariance Σk 0, and relative weight ρk.

DAE model An algorithmic way to retrieve the clean data x from the noisy data x is to build a neural network taking the latter as an input and yielding the former as an output. A particularly natural choice for such a network is an autoencoder architecture [3]. The intuition is that the narrow hidden layer of an autoencoder forces the network to learn a succinct latent representation of the data, which is robust against noise corruption of the input. In this work, we analyze a two-layer DAE. We further assume that the weights are tied. Additionally, mirroring modern denoising architectures like U-nets [37] or [38, 39, 40, 41], we also allow for a (trainable) skip-connection:

fb,w( x) = b x + w

The DAE (2) is therefore parametrized by the scalar skip connection strength b R and the weights w Rp d, with p the width of the DAE hidden layer. The normalization of the weight w by

d in (2) is the natural choice which ensures for high dimensional settings d 1 that the argument w x/

d of the non-linearity σ( ) stays Θ(1). Like [8], we focus on the case with p d. The assumption of weight-tying affords a more concise theoretical characterization and thus clearer discussions. Note that it is also a strategy with substantial practical history, dating back to [3], as it prevents the DAE from functioning in the linear region of its non-linearity σ( ). This choice of architecture is also motivated by a particular case of Tweedie s formula [42] (see eq. (79) in Appendix B), which will be the object of further discussion in Section 4.

We also consider two other simple architectures

, rc( x) = c x, (3)

which correspond to the building blocks of the complete DAE architecture fb,w (2) (hereafter referred to as the full DAE). Note that indeed fb,w = rb + uw. The part uv( ) is a DAE without skip connection (hereafter called the bottleneck network component), while rc( ) correspond to a simple single-parameter trainable rescaling of the input (hereafter called the rescaling component).

To train the DAE (2), we assume the availability of a training set D = { xµ, xµ}n µ=1, with n clean samples xµ drawn i.i.d from P (1) and the corresponding noisy samples xµ = xµ + ξµ (with the noises ξµ assumed mutually independent). The DAE is trained to recover the clean samples xµ from the noisy samples xµ by minimizing the empirical risk

µ=1 xµ fb,w( xµ) 2 + g(w), (4)

where g : Rp d R+ is an arbitrary convex regularizing function. We denote by ˆb, ˆw the minimizers of the empirical risk (4) and by ˆf fˆb, ˆ w the corresponding trained DAE (2). For future discussion, we also consider training independently the components (3) via empirical risk minimization, by which we mean replacing fb,w by uv or rc in (4). We similarly denote ˆv (resp. ˆc) the learnt weight of the bottleneck network (resp. rescaling) component and ˆu uˆv (resp. ˆr rˆc). Note that generically, ˆv = ˆw and ˆc = ˆb, and therefore ˆf = ˆu + ˆr, since ˆb, ˆw result from their joint optimization as parts of the full DAE fb,w, while ˆc (or ˆv) are optimized independently. As we discuss in Section 4, training the sole rescaling rc does not afford an expressive enough denoiser, while an independently learnt bottleneck network component uv essentially only learns to implement PCA. However, when jointly trained as components of the full DAE fb,w (2), the resulting denoiser ˆf is a genuinely non-linear model which yields a much lower test error than PCA, and learns to leverage flexibly its two components to balance the preservation of the data nuances and the removal of the noise.

Learning metrics The performance of the DAE (2) trained with the loss (4) is quantified by its reconstruction (denoising) test MSE, defined as

mse ˆ f EDEx PEξ N(0,Id) x fˆb, ˆ w

The expectations run over a fresh test sample x sampled from the Gaussian mixture P (1), and a new additive noise ξ corrupting it. Note that an expectation over the train set D is also included to make mse ˆ f a metric that does not depend on the particular realization of the train set. The denoising

test MSEs mseˆu, mseˆr are defined similarly as the denoising test errors of the independently learnt components (3). Aside from the denoising MSE (5), another question of interest is how much the DAE manages to learn the structure of the data distribution, as described by the cluster means µk. This is measured by the cosine similarity matrix θ Rp K, where for i J1, p K and k J1, KK,

ˆw i µk ˆwi µk

In other words, θik measures the alignment of the i th row ˆwi of the trained weight matrix ˆw with the mean of the k th cluster µk.

High-dimensional limit We analyze the optimization problem (4) in the high-dimensional limit where the input dimension d and number of training samples n jointly tend to infinity, while their ratio α = n/d hitherto referred to as the sample complexity stays Θ(1). The hidden layer width p, the noise level , the number of clusters K and the norm of the cluster means µk are also assumed to remain Θ(1). This corresponds to a rich limit, where the number of parameters of the DAE is not large compared to the number of samples like in [20, 13], and therefore cannot trivially fit the train set, or simply memorize it [17]. Conversely, the number of samples n is not infinite like in [8, 9, 19], and therefore importantly allows to study the effect of a finite train set on the representation learnt by the DAE.

3 Asymptotic formulae for DAEs

We now state the main result of the present work, namely the closed-form asymptotic formulae for the learning metrics mse ˆ f (5) and θ (6) for a DAE (2) learnt with the empirical loss (4). These characterizations are obtained by first recasting the optimization problem into an analysis of an associated probability measure, and then carrying out this analysis using the heuristic replica method, which we here employ in its replica-symmetric formulation (see Appendix A).

Assumption 3.1. The covariances {Σ}K k=1 admit a common set of eigenvectors {ei}d i=1. We further note {λk i }d i=1 the eigenvalues of Σk. The eigenvalues {λk i }d i=1 and the projection of the cluster means on the eigenvectors {e i µk}i,k are assumed to admit a well-defined joint distribution ν as d namely, for γ = (γ1, ..., γK) RK and τ = (τ1, ..., τK) RK:

k=1 δ λk i γk δ

de i µk τk d ν (γ, τ) . (7)

Moreover, the marginals νγ (resp. ντ) are assumed to have a well-defined first (resp. second) moment.

Assumption 3.2. g( ) is a ℓ2 regularizer with strength λ, i.e. g( ) = λ/2 2 F .

We are now in a position to discuss the main result of this manuscript, which we state under Assumptions 3.1 and 3.2 for definiteness and clarity. These assumptions can actually be relaxed, as we further discuss after the result statement. Conjecture 3.3. (Closed-form asymptotics for DAEs trained with empirical risk minimization) Under Assumptions 3.1 and 3.2, in the high-dimensional limit n, d with fixed ratio α, the denoising test MSE mse ˆ f (5) admits the expression

mse ˆ f mse =

k=1 ρk Ez Tr h qσ ( 1 mk+

q+(1 )qkz) 2i (8)

k=1 ρk Eu,v h

qk(1 )u+ qv ((1 ˆb 1 )(mk+ qku) ˆb qv) i + o(1),

where the averages bear over independent Gaussian variables z, u, v N(0, Ip). We denoted

mse = d ˆb2 + 1

1 ˆb 2 " K X

Z dντ(τ)τ 2 k + d Z dνγ(γ)γk

The learnt skip connection strength ˆb is

k=1 ρk R dνγ(γ)γk

k=1 ρk R dνγ(γ)γk

(1 ) + + o(1). (10)

The cosine similarity θ (6) admits the compact formula for i J1, p K and k J1, KK

θik = (mk)i q

qii R dντ(τ)τ 2 k , (11)

where we have introduced the summary statistics

q = lim d ED

, qk = lim d ED

, mk = lim d ED

Thus q, qk Rp p, mk Rp. The existence of these limits is an assumption of the replica method. The summary statistics q, qk, mk can be determined as solutions of the system of equations

ˆqk = αρk Eξ,ηV 1 k proxk y q

1 2 k η mk 2 V 1 k ˆVk = αρkq 1

2 k Eξ,ηV 1 k proxk y q

1 2 k η mk η

ˆmk = αρk Eξ,ηV 1 k proxk y q

k=1 ρk Eξ,ηV 1 proxk x

q 1 2 ξ 2 V 1

k=1 ρk Eξ,η

2 V 1 proxk x

q 1 2 ξ ξ σ

1 proxk y + proxk x 2 #

qk = R dν(γ, τ)γk

λIp + ˆV + K P

j=1 γj ˆqj + P

1 j,l K τjτl ˆmj ˆm l

Vk = R dν(γ, τ)γk

λIp + ˆV + K P

mk = R dν(γ, τ)τk

λIp + ˆV + K P

q = R dν(γ, τ)

λIp + ˆV + K P

j=1 γj ˆqj + P

1 j,l K τjτl ˆmj ˆm l

V = R dν(γ, τ)

λIp + ˆV + K P

In (13), ˆqk, ˆVk, ˆV , V Rp p and ˆmk Rp, and the averages bear over finite-dimensional i.i.d Gaussians ξ, η N(0, Ip). Finally, proxk x, proxk y are given as the solutions of the finite-dimensional optimization

proxk x, proxk y = arginf x,y Rp

Tr V 1 k y q

1 2 k η mk 2 + 1

1 y + x) 2i 2σ(

1 y + x) ((1

1 ˆb)y ˆbx)

Conjecture 3.3 provides a gateway to probe and characterize the asymptotic properties of the model (2) at the global optimum of the empirical risk (4), whereas a purely experimental study would not have been guaranteed to reach the global solution, and would suffer from finite-size effects. Equation

0.0 0.2 0.4 0.6 0.8 1.0

msef mser mse mser msef mser (linear)

0.0 0.2 0.4 0.6 0.8 1.0

Figure 1: α = 1, K = 2, ρ1,2 = 1/2, Σ1,2 = 0.09 Id, p = 1, λ = 0.1, σ( ) = tanh( ); the cluster mean µ1 = µ2 was taken as a random Gaussian vector of norm 1. (left) In blue, the difference in MSE between the full DAE ˆf (2) and the rescaling component ˆr (3). Solid lines correspond to the sharp asymptotic characterization of Conjecture 3.3. Dots represent numerical simulations for d = 700, training the DAE using the Pytorch implementation of full-batch Adam, with learning rate η = 0.05 over 2000 epochs, averaged over N = 10 instances. Error bars represent one standard deviation. For completeness, the MSE of the oracle denoiser is given as a baseline in green, see Section 4. The performance of a linear DAE (σ(x) = x) is represented in dashed red. (right) Cosine similarity θ (6) (green), squared weight norm ˆ w 2 F/d (red) and skip connection strength ˆb (blue). Solid lines correspond to the formulae (11)(12) and (10) of Conjecture 3.3; dots are numerical simulations. For completeness, the cosine similarity of the first principal component of the clean train data {xµ}n µ=1 is plotted in dashed black.

(13) provides a closed-form expression for the summary statistics (12), in terms of the solutions of the low-dimensional optimization (14). The latter can be loosely viewed as an effective loss, subsuming the averaged effect of the finite training set. Further remark that while the regularization λ does not explicitly appear in the expression for the MSE (8), the statistics qk, mk in (8) depend thereon through (13). In fact, Assumptions 3.1 and 3.2 are not strictly necessary, and can be simultaneously relaxed to address arbitrary convex regularizer g( ) and generically non-commuting {Σk}K k=1 but at the price of more intricate formulae. For this reason, we choose to discuss here Conjecture 3.3, and defer a discussion and detailed derivation of the generic asymptotically exact formulae to Appendix A, see eq. (58). Let us mention that a sharp asymptotic characterization of the train MSE can also be derived; for conciseness, we do not present it here and refer the interested reader to equation (68) in Appendix A. Conjecture 3.3 encompasses as special cases the asymptotic characterization of the components ˆr, ˆu (3):

Corollary 3.4. (MSE of components) The test MSE of ˆr (3) is given by mseˆr = mse (9). Furthermore, the learnt value of its single parameter ˆc is given by (10). The test MSE, cosine similarity and summary statistics of the bottleneck network ˆu (3) follow from Conjecture 3.3 by setting ˆb = 0.

The implications of Corollary 3.4 shall be further discussed in Section 4, and a full derivation is provided in Appendix E. Finally, remark that in the noiseless limit = 0, the denoising task reduces to a reconstruction task, with the autoencoder being tasked with reproducing the clean data as an output when taking the same clean sample as an input. Therefore Conjecture 3.3 also includes RAEs (by definition, without skip connection) as a special case.

Corollary 3.5. (RAEs) In the n, d limit, the MSE, cosine similarity and summary statistics for an RAE follow from Conjecture 3.3 by setting x = 0 in (14), removing the first term in the brackets in the equation of ˆV (13) and taking the limit , ˆq,ˆb 0.

Corollary 3.5 will be the object of further discussion in Section 4. A detailed derivation is presented in Appendix F. Note that Corollary 3.5 provides a characterization of RAEs as a function of the sample complexity α, where previous studies on non-linear RAEs rely on the assumption of an infinite number of available training samples [13, 8, 9].

Equations (10) and (12) of Conjecture 3.3 thus characterize the statistics of the learnt parameters ˆb, ˆw of the trained DAE (2). These summary statistics are, in turn, sufficient to fully characterize the

learning metrics (5) and (6) via equations (8) and (11). We thus have reduced the high-dimensional optimization (4) and the high-dimensional average over the train set D involved in the definition of the metrics (5) and (6) to a simpler system of equations over 4 + 6K variables (13) which can be solved numerically. It is important to note that all the summary statistics involved in (13) are finite-dimensional as d , and therefore Conjecture 3.3 is a fully asymptotic characterization, in the sense that it does not involve any high-dimensional object. Finally, let us stress once more that the replica method employed in the derivation of these results should be viewed as a strong heuristic, but does not constitute a rigorous proof. While Conjecture 3.3 is stated in full generality, we focus for definiteness in the rest of the manuscript on the simple case r = 1, K = 2, which is found to already display all the interesting phenomenology discussed in this work, and leave the thorough exploration of r > 1, K > 2 settings to future work. In the next paragraphs, we give two examples of applications of Conjecture 3.3, to a simple binary isotropic mixture, and to real data sets.

Example 1: Isotropic homoscedastic mixture We give as a first example the case of a synthetic binary Gaussian mixture with K = 2, µ1 = µ2, Σ1,2 = 0.09 Id, ρ1,2 = 1/2, using a DAE with σ = tanh and p = 1. Since this simple case exhibits the key phenomenology discussed in the present work, we refer to it in future discussions. The MSE mse ˆ f (8) evaluated from the solutions of the self-consistent equations (13) is plotted as the solid blue line in Fig. 1 (left) and compared to numerical simulations corresponding to training the DAE (2) with the Pytorch implementation of the Adam optimizer [43] (blue dots), for sample complexity α = 1 and ℓ2 regularization (weight decay) λ = 0.1. The agreement between the theory and simulation is compelling. The green solid line and corresponding green dots in Fig. 1 (right) correspond to the replica prediction (11) and simulations for the cosine similarity θ (6), and again display very good agreement. Note that for large noise levels, the DAE achieves a worse MSE than the rescaling as shown by the positive value of mse ˆ f mseˆr , despite the former being a priori expressive enough to realize the latter. This in fact signals that the DAE overfits the training data. That such an overfitting is captured is a strength of our analysis, which allows to cover the effect of a limited sample complexity. Finally, this overfitting can be mitigated by increasing the weight decay λ, see Fig. 9 in Appendix A.

A particularly striking observation is that due to the non-convexity of the loss (4), there is a priori no guarantee that an Adam-optimized DAE should find a global minimum, as described by the Conjecture 3.3, rather than a local minimum. The compelling agreement between theory and simulations in Fig. 1 temptingly suggests that at least in this case the loss landscape of DAEs (2) trained with the loss (4) for the data model (1) should in some way be benign. Authors of [12] have shown, for linear RAEs, that there exists a unique global and local minimum for the square loss and no regularizer. Ref. [14] offers further insight for a linear DAE in dimension d = 1, and shows that, aside from the global minima, the loss landscape only includes an unstable saddle point from which the dynamics easily escapes. Extending these works and intuitions to non-linear DAEs is an exciting research topic for future work.

Example 2: MNIST, Fashion MNIST It is reasonable to ask whether Conjecture 3.3 is restricted to Gaussian mixtures (1). The answer is negative in fact, Conjecture 3.3 also describes well a number of real data distributions. We provide such an example for Fashion MNIST [16] (from which, for simplicity, we only kept boots and shoes) and MNIST [15] (1s and 7s), in Fig. 2. For each data set, samples sharing the same label were considered to belong to the same cluster. Note that we purposefully chose closely related classes, for which the clusters are expected to be closer, leading to an a priori harder and thus more interesting learning problem. The mean and covariance thereof were estimated numerically, and combined with Conjecture 3.3. The resulting denoising MSE predictions mse ˆ f are plotted as solid lines in Fig. 2, and agree very well with numerical simulations of DAEs optimized over the real data sets using the Pytorch implementation of Adam [43]. A full description of this experiment is given in Appendix D.

The observation that the MSEs of real data sets are to such degree of accuracy captured by the equivalent Gaussian mixture strongly hints at the presence of Gaussian universality [44]. This opens a gateway to future research, as Gaussian universality has hitherto been exclusively addressed in classification and regression (rather than denoising) settings, see e.g. [44, 45, 46]. Denoising tasks further constitute a particularly intriguing setting for universality results, as Gaussian universality would signify that only second-order statistics of the data can be reconstructed using a shallow autoencoder.

0.0 0.2 0.4 0.6 0.8 1.0 3.0

theory MNIST

0.0 0.2 0.4 0.6 0.8 1.0

theory Fashion MNIST

Figure 2: Difference in MSE between the full DAE (2) and the rescaling component (3) for the MNIST data set (middle), of which for simplicity only 1s and 7s were kept, and Fashion MNIST (right), of which only boots and shoes were kept. In blue, the theoretical predictions resulting from using Conjecture 3.3 with the empirically estimated covariances and means, see Appendix D for further details. In red, numerical simulations of a DAE (p = 1, σ = tanh) trained with n = 784 training points, using the Pytorch implementation of full-batch Adam, with learning rate η = 0.05 and weight decay λ = 0.1 over 2000 epochs, averaged over N = 10 instances. Error bars represent one standard deviation. (left) illustration of the denoised images: (top left) original image, (top right) noisy image, (bottom left) DAE ˆf (2), (bottom right) rescaling ˆr (3).

4 The role and importance of the skip connection.

Conjecture 3.3 for the full DAE ˆf (2) and Corollary 3.4 for its components ˆr, ˆu (3) allow to disentangle the contribution of each part, and thus to pinpoint their respective roles in the DAE architecture. We sequentially present a comparison of ˆf with ˆr, and ˆf with ˆu. We remind that ˆf, ˆr and ˆu result from independent optimizations over the same train set D, and that while fb,w = uw +hb, ˆf = ˆu+ ˆr.

Full DAE and the rescaling component We start this section by observing that for noise levels below a certain threshold, the full DAE ˆf yields better MSE than the learnt rescaling ˆr, as can be seen by the negative value of mse ˆ f mseˆr in Fig. 1 and Fig. 2. The improvement is more sizeable at intermediate noise levels , and is observed for a growing region of as the sample complexity α increases, see Fig. 3 (a). This lower MSE further translates into visible qualitative changes in the result of denoising. As can be seen from Fig. 2 (left), the full DAE ˆf (2) (bottom left) yields denoised images with sensibly higher definition and overall contrast, while a simple rescaling ˆr (bottom right) leads to a still largely blurred image.

We provide one more comparison: for the isotropic binary mixture (see Fig. 1), the DAE test error mse ˆ f in fact approaches the information-theoretic lowest achievable MSE mse as the sample complexity α increases. To see this, note that mse is given by the application of Tweedie s formula [42], that requires perfect knowledge of the cluster means µk and covariances Σk it is, therefore, an oracle denoiser. A sharp asymptotic characterization of the oracle denoiser is provided in Appendix B. As can be observed from Fig. 3 (a), the DAE MSE (2) approaches but does not exactly converges to (see Appendix C) the oracle test error mse as the number of available training samples n grows, and is already sensibly close to the optimal value for α = 8.

DAEs with(out) skip connection We now turn our attention to comparing the full DAE ˆf (2) to the bottleneck network component ˆu (3). It follows from Conjecture 3.3 and Corollary 3.4 that ˆu (3) leads to a higher MSE than the full DAE ˆf (2), with the gap being Θ(d). More precisely,

mseˆu mse ˆ f =

R dνγ(γ) K P

2 (1 ) R dνγ(γ) K P

(1 ) + . (15)

0.0 0.2 0.4 0.6 0.8

= 0.5 = 1 = 2 = 8 mse mser

0.0 0.2 0.4 0.6 0.8 1.0

d(mseu msef)

theory simulations PCA

Figure 3: (left) Solid lines: difference in MSE between the full DAE ˆf (2), with σ = tanh, p = 1, and the rescaling ˆr (3). Dashed: the same curve for the oracle denoiser. Different colours represent different sample complexities α (solid lines). (right) Difference in MSE between the bottleneck network ˆu (3) and the complete DAE ˆf (2). In blue, the theoretical prediction (15); in red, numerical simulations for the bottleneck network (3) (σ = tanh, p = 1) trained with the Pytorch implementation of full-batch Adam, with learning rate η = 0.05 and weight decay λ = 0.1 over 2000 epochs, averaged over N = 5 instances, for d = 700. In green, the MSE (minus the MSE of the complete DAE (2)) achieved by PCA. Error bars represent one standard deviation. The model and parameters are the same as in Fig. 1.

(a) (b) (c) (d) (e) (f)

Figure 4: Illustration of the denoised image for the various networks and algorithms. (a) original image (b) noisy image, for

= 0.2 (c) trained rescaling ˆr (3) (d) full DAE ˆf (2) (e) bottleneck network ˆu (3) (f) PCA. The DAE and training parameters are the same as Fig. 2, see also Appendix D.

The theoretical prediction (15) compares excellently with numerical simulations; see Fig. 3 (right). Strikingly, we find that PCA denoising yields an MSE almost indistinguishable from ˆu, see Fig. 3, strongly suggesting that ˆu essentially learns, also in the denoising setting, to project the noisy data x along the principal components of the training set. The last two images of Fig. 4 respectively correspond to ˆu and PCA, which can indeed be observed to lead to visually near-identical results. This echoes the findings of [10, 11, 8, 9, 13] in the case of RAEs that bottleneck networks are limited by the PCA reconstruction performance a conclusion that we also recover from Corollary 3.5, see Appendix F. Crucially however, it also means that compared to the full DAE ˆf (2), PCA is sizeably suboptimal, since mse PCA mseˆu = mse ˆ f + Θ(d).

This last observation has an important consequence: in contrast to previously studied RAEs [10, 12, 11, 9, 13], the full DAE ˆf does not simply learn to perform PCA. In contrast to bottleneck RAE networks [8, 9, 13], the non-linear DAE hence does not reduce to a linear model after training. The non-linearity is important to improve the denoising MSE, see Fig. 1. We stress this finding: trained alone, the bottleneck network ˆu only learns to perform PCA; trained jointly with the rescaling component as part of the full DAE fb,w (2), it learns a richer, non-linear representation. The full DAE (2) thus offers a genuinely non-linear learning model and opens exciting research avenues for the theory of autoencoders, beyond linear (or effectively linear) cases. In the next paragraph, we explore further the interaction between the rescaling component and the bottleneck network.

A tradeoff between the rescaling and the bottleneck network Conjecture (3.3), alongside Corollary (3.4) and the discussion in Section 4 provide a firm theoretical basis for the well-known empirical intuition (discussed e.g. in [38]) that skip-connections allow to better propagate information

from the input to the output of the DAE, thereby contributing to preserving intrinsic characteristics of the input. This effect is clearly illustrated in Fig. 4, where the resulting denoised image of an MNIST 7 by ˆr, ˆf, ˆu and PCA are presented. While the bottleneck network ˆu perfectly eliminates the background noise and produces an image with a very good resolution, it essentially collapses the image to the cluster mean, and yields, like PCA, the average MNIST 7. As a consequence, the denoised image bears little resemblance with the original image in particular, the horizontal bar of the 7 is lost in the process. Conversely, the rescaling ˆr preserves the nuances of the original image, but the result is still largely blurred and displays overall poor contrast. Finally, the complete DAE (2) manages to preserve the characteristic features of the original data, while enhancing the image resolution by slightly overlaying the average 7 thereupon.

The optimization of the DAE (4) is therefore described by a tradeoff between two competing effects namely the preservation of the input nuances by the skip connection, and the enhancement of the resolution/noise removal by the bottleneck network. This allows us to discuss the curious nonmonotonicity of the cosine similarity θ as a function of the noise level , see Fig. 1 (left). While it may at first seem curious that the DAE seemingly does not manage to learn the data structure better for low than for intermediate (where the cosine similarity θ is observed to be higher), this is actually due to the afore-dicussed tradeoff. Indeed, for small , the data is still substantially clean, and there is therefore no incentive to enhance the contrast by using the cluster means which are consequently not learnt. This phase is thus characterized by a large skip connection strength ˆb, and small cosine similarity θ and weight norm ˆw F . Conversely, at high noise levels , the nuances of the data are already lost because of the noise. Hence the DAE does not rely on the skip connection component (whence the small values of ˆb), and the only way to produce reasonably denoised data is to collapse to the cluster mean using the network component (whence a large ˆw F ).

5 Conclusion

We consider the problem of denoising a high-dimensional Gaussian mixture, by training a DAE via empirical risk minimization, in the limit where the number of training samples and the dimension are proportionally large. We provide a sharp asymptotic characterization of a number of summary statistics of the trained DAE weight, average MSE, and cosine similarity with the cluster means. These results contain as a corollary the case of RAEs. Building on these findings, we isolate the role of the skip connection and the bottleneck network in the DAE architecture and characterize the tradeoff between those two components in terms of preservation of the data nuances and noise removal thereby providing some theoretical insight into a longstanding practical intuition in machine learning.

We believe the present work also opens exciting research avenues. First, our real data experiments hint at the presence of Gaussian universality. While this topic has gathered considerable attention in recent years, only classification/regression supervised learning tasks have been hitherto addressed. Which aspects of universality carry over to denoising tasks, and how they differ from the current understanding of supervised regression/classification, is an important question. Second, the DAE with skip connection (2) provides an autoencoder model which does not just simply learn the principal components of the training set. It, therefore, affords a genuinely non-linear network model where richer learning settings can be investigated.

Acknowledgements

We thank Eric Vanden-Eijnden for his wonderful lecture at the les Houches Summer School of July 2022, which inspired parts of this work. We thank Maria Refinetti and Sebastian Goldt for insightful discussions at the very early stages of this project.

[1] Yang Song, Jascha Narain Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations, 2021.

[2] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

[3] Pascal Vincent, H. Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res., 11:3371 3408, 2010.

[4] Reza Oftadeh, Jiayi Shen, Zhangyang Wang, and Dylan A. Shell. Eliminating the invariance on the loss landscape of linear autoencoders. In International Conference on Machine Learning, 2020.

[5] Daniel Kunin, Jonathan M. Bloom, Aleksandrina Goeva, and Cotton Seed. Loss landscapes of regularized linear autoencoders. In International Conference on Machine Learning, 2019.

[6] Xuchan Bao, James Lucas, Sushant Sachdeva, and Roger B Grosse. Regularized linear autoencoders recover the principal components, eventually. Advances in Neural Information Processing Systems, 33:6971 6981, 2020.

[7] Gauthier Gidel, Francis R. Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in deep linear neural networks. In Neural Information Processing Systems, 2019.

[8] Maria Refinetti and Sebastian Goldt. The dynamics of representation learning in shallow, nonlinear autoencoders. In International Conference on Machine Learning, pages 18499 18519. PMLR, 2022.

[9] A. E. Shevchenko, Kevin Kögler, Hamed Hassani, and Marco Mondelli. Fundamental limits of two-layer autoencoders, and achieving them with gradient methods. Ar Xiv, abs/2212.13468, 2022.

[10] Carl Eckart and G. Marion Young. The approximation of one matrix by another of lower rank. Psychometrika, 1:211 218, 1936.

[11] Hervé Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59:291 294, 1988.

[12] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2:53 58, 1989.

[13] Phan-Minh Nguyen. Analysis of feature learning in weight-tied autoencoders via the mean field lens. Ar Xiv, abs/2102.08373, 2021.

[14] Arnu Pretorius, Steve Kroon, and Herman Kamper. Learning dynamics of linear denoising autoencoders. In International Conference on Machine Learning, 2018.

[15] Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86:2278 2324, 1998.

[16] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Ar Xiv, abs/1708.07747, 2017.

[17] Adityanarayanan Radhakrishnan, Mikhail Belkin, and Caroline Uhler. Overparameterized neural networks implement associative memory. Proceedings of the National Academy of Sciences of the United States of America, 117:27162 27170, 2019.

[18] Saachi Jain, Adityanarayanan Radhakrishnan, and Caroline Uhler. A mechanism for producing aligned latent spaces with autoencoders. Ar Xiv, abs/2106.15456, 2021.

[19] Thanh Van Nguyen, Raymond K. W. Wong, and Chinmay Hegde. On the dynamics of gradient descent for autoencoders. In International Conference on Artificial Intelligence and Statistics, 2019.

[20] Thanh Van Nguyen, Raymond K. W. Wong, and Chinmay Hegde. Benefits of jointly training autoencoders: An improved neural tangent kernel analysis. IEEE Transactions on Information Theory, 67:4669 4692, 2019.

[21] Giorgio Parisi. Towards a mean field theory for spin glasses. Phys. Lett, 73(A):203 205, 1979.

[22] Giorgio Parisi. Order parameter for spin glasses. Phys. Rev. Lett, 50:1946 1948, 1983.

[23] Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: thresholds and algorithms. Advances in Physics, 65:453 552, 2015.

[24] Marylou Gabrié. Mean-field inference methods for neural networks. Journal of Physics A: Mathematical and Theoretical, 53, 2019.

[25] Elizabeth Gardner and Bernard Derrida. Optimal storage properties of neural network models. Journal of Physics A: Mathematical and general, 21(1):271, 1988.

[26] Manfred Opper and David Haussler. Generalization performance of bayes optimal classification algorithm for learning a perceptron. Physical Review Letters, 66(20):2677, 1991.

[27] Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences of the United States of America, 116:5451 5460, 2017.

[28] Benjamin Aubin, Florent Krzakala, Yue M. Lu, and Lenka Zdeborová. Generalization error in high-dimensional perceptrons: Approaching bayes error with convex optimization. Advances in Neural Information Processing Systems, 33:12199 12210, 2020.

[29] Benjamin Aubin, Antoine Maillard, Jean Barbier, Florent Krzakala, Nicolas Macris, and Lenka Zdeborová. The committee machine: computational to statistical gaps in learning a two-layers neural network. Journal of Statistical Mechanics: Theory and Experiment, 2019, 2018.

[30] Rainer Dietrich, Manfred Opper, and Haim Sompolinsky. Statistical mechanics of support vector networks. Physical review letters, 82(14):2975, 1999.

[31] Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, 2020.

[32] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Generalisation error in learning with random features and the hidden manifold model. Journal of Statistical Mechanics: Theory and Experiment, 2021, 2020.

[33] Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Error rates for kernel classification under source and capacity conditions. Ar Xiv, abs/2201.12655, 2022.

[34] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Precise error analysis of regularized m-estimators in high dimensions. IEEE Transactions on Information Theory, 64(8):5592 5628, 2018.

[35] Jacob A. Zavatone-Veth, William Tong, and Cengiz Pehlevan. Contrasting random and learned features in deep bayesian linear regression. Physical review. E, 105 6-1:064118, 2022.

[36] Hugo Cui, Florent Krzakala, and Lenka Zdeborová. Optimal learning of deep random networks of extensive-width. Ar Xiv, abs/2302.00375, 2023.

[37] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234 241. Springer, 2015.

[38] Xiao-Jiao Mao, Chunhua Shen, and Yubin Yang. Image restoration using convolutional autoencoders with symmetric skip connections. Ar Xiv, abs/1606.08921, 2016.

[39] Tong Tong, Gen Li, Xiejie Liu, and Qinquan Gao. Image super-resolution using dense skip connections. In Proceedings of the IEEE international conference on computer vision, pages 4799 4807, 2017.

[40] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646 1654, 2016.

[41] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1637 1645, 2016.

[42] Bradley Efron. Tweedie s formula and selection bias. Journal of the American Statistical Association, 106:1602 1614, 2011.

[43] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2014.

[44] Sebastian Goldt, Bruno Loureiro, Galen Reeves, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. The gaussian equivalence of generative models for learning with shallow neural networks. In Mathematical and Scientific Machine Learning, 2020.

[45] Hong Hu and Yue M. Lu. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 69:1932 1964, 2020.

[46] Andrea Montanari and Basil N Saeed. Universality of empirical risk minimization. In Conference on Learning Theory, pages 4310 4312. PMLR, 2022.

[47] Bruno Loureiro, Cédric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model. Journal of Statistical Mechanics: Theory and Experiment, 2022, 2021.

[48] Bruno Loureiro, Gabriele Sicuro, Cédric Gerbelot, Alessandro Pacco, Florent Krzakala, and Lenka Zdeborová. Learning gaussian mixtures with generalised linear models: Precise asymptotics in high-dimensions. In Neural Information Processing Systems, 2021.

A Derivation of Conjecture 3.3

In this section, we detail the derivation of Conjecture (3.3).

A.1 Derivation technique

In order to access sharp asymptotic characterizations for the metrics and summary statistics mse (5),θ (6), ˆb and ˆ w /d, the first step is to observe that any test function φ( ˆw,ˆb) of the learnt network parameters ˆw,ˆb can be written as an average over a limit probability distribution

ED h φ( ˆw,ˆb) i = lim β ED 1 Z

Z dwdbφ(w, b)e β ˆ R(w,b) (16)

where ˆR( , ) is the empirical risk (4) and the corresponding probability density

Pβ(w, b) = e β ˆ R(w,b)

is known as the Boltzmann measure in statistical physics, with the parameter β typically referred to as the inverse temperature. The normalization Z is called the partition function of Pβ:

Z = Z dwdbe β ˆ R(w,b). (18)

The idea of the replica method [21, 22, 23, 24], building on the expression (16), is to compute the moment generating function (also known as the free entropy) as

ED ln Z = lim s 0 ln [EDZs]

With the moment generating function, it then becomes possible to compute averages like (16). The backbone of the derivation, which we detail in the following subsections, therefore lies in the computation of the replicated partition function EDZs.

A.2 Replicated partition function

For subsequent clarity, we shall introduce the variable

ηµ xµ µk (20)

if xµ belongs to the k th cluster. By definition of the Gaussian mixture model, this is just a Gaussian variable: ηµ N(0, Σk), and the average over the data set {xµ} reduces to averaging over the cluster index k and {ηµ}. For notational brevity, we shall name in this appendix ξµ what is called

ξµ in the main text (i.e., absorbing the variance in the definition of ξµ). The replicated partition function then reads

EDZs = Z s Y

a=1 dwadbae β s P

k=1 ρk Eη,ξ e β

µk+η ba ( 1 (µk+η)+ξ)+ w a

d σ wa( 1 (µk+η)+ξ)

| {z } ( ) . (21)

One can expand the exponent ( )as

" waw a d σ wa( 1 (µk+η)+ξ)

2σ wa( 1 (µk+η)+ξ)

1 )(µk+η) baξ)

a=1[(1 1 ba)2( µk 2+ η 2+2µ k η)+b2 a ξ 2+2ba( 1 ba 1)(µ k ξ+η ξ)]. (22)

Es,η,ξ( ) =

a=1 ( 1 ba 1)2 Z dηdξ (2π

Σ 1 k +β s P

a=1 ( 1 ba 1)2Id

2 ξ 2 1 +β s P

a=1((1 1 ba)2µ k η+ba( 1 ba 1)µ k ξ+ba( 1 ba 1)η ξ) | {z } Peff.(η,ξ)

" waw a d σ wa( 1 (µk+η)+ξ)

2σ wa( 1 (µk+η)+ξ)

1 )(µk+η) baξ)

The effective prior over the noises η, ξ is Gaussian with means µξ, µη and covariance C:

Peff.(η, ξ) = N

η, ξ; µη, µξ, C 1 ηη C 1 ξη Id C 1 ξη Id C 1 ξξ Id

Identifying the covariance components leads to

a=1 b2 a, C 1 ηη = Σ 1 k + β

1 ba 1)2Id, C 1 ξη = β

1 ba 1). (25)

Note that all the sums indexed by a introduce terms of O(s), which can safely be neglected. Identifying the means leads to the following system of equations:

C 1 ξξ µξ + C 1 ξη µη = β s P

C 1 ηη µη + C 1 ξη µξ = β s P

1 ba)2µk , (26)

which is solved as

µξ = β C 1 ηη C 1 ξξ (C 1 ξη )2Id s X

C 1 ηη ba(1

1 ba) + C 1 ξη (1

| {z } gµ ξ

µη = β C 1 ηη C 1 ξξ (C 1 ξη )2Id s X

1 ba)2 + C 1 ξη ba(1

| {z } gµ η

Again, observe that gµ η , gµ ξ = O(s) and will be safely neglected later in the derivation. Therefore

Es,η,ξ( ) =

a=1 ( 1 ba 1)2+ 1

2 C 1 ξξ µξ 2+ 1

2 µ η C 1 ηη µη+C 1 ξη µ ξ µη

det Σk d 2 det(C 1 ηη C 1 ξξ (C 1 ξη )2Id) 1 2

" waw a d σ wa( 1 (µk+η)+ξ)

2σ wa( 1 (µk+η)+ξ)

1 )(µk+η) baξ)

Peff.(η,ξ) | {z } (a)

We are now in a position to introduce the local fields:

λξ a = wa(ξ µξ)

d , λη a = wa(η µη)

d , hk a = waµk

with statistics

λη aλη b waΣkw b d , λξ aλη b = Cξη waw b d 0, λξ aλξ b waw b d . (30)

We used the leading order of the covariances Cξξ,ηη,ηξ. One therefore has to introduce the summary statistics:

Qab = waw b d Rp p, Sk ab = waΣkw b d Rp p, mk a = waµk

Note that the local fields λη,ξ a thus follow a Gaussian distribution:

(λξ a, λη a)s a=1 N

0, Sk 0 0 Q

Going back to the computation, (a) can be rewritten as

a=1 Tr Qaaσ(mk a( 1 (1+gµ η )+gµ ξ )+ 1 λη a+λξ a) 2

2σ(mk a( 1 (1+gµ η )+gµ ξ )+ 1 λη a+λξ a) (mk a((1 1 ba)(1+gµ η ) bagµ ξ )+(1 1 ba)λη a baλξ a)

{λξ a,λη a}s a=1 . (33)

A.3 Reformulating as a saddle-point problem

Introducing Dirac functions enforcing the definitions of Qab, ma brings the replicated function in the following form:

EZs = Z s Y

a,b d Qabd ˆQab

a=1 dmad ˆma

a,b d Sk abd ˆSk ab

a b Qab ˆ Qab d K P

a b Sk ab ˆSk ab K P

a dmk a ˆmk a | {z } eβsdΨt Z s Y

a=1 dwae β P

a b ˆ Qabwaw b + K P

a b ˆSk abwaΣkw b + K P

| {z } eβsdΨw

k=1 ρk e O(s2)e β

a=1 ( 1 ba 1)2

det Σk d 2 det(C 1 ηη C 1 ξξ O(s2)) 1 2 (a)

eαsd2Ψquad.+βsdΨy

We introduced the trace, entropic and energetic potentials Ψt, Ψw, Ψy. Since all the integrands scale exponentially (or faster) with d, this integral can be computed using a saddle-point method. To proceed further, note that the energetic term encompasses two types of terms, scaling like sd2 and sd. More precisely,

a=1 ( 1 ba 1)2

det Σk d 2 det(C 1 ηη C 1 ξξ ) 1 2 (a)

k=1 ρk 1 sd ln det Σk d 2 det(C 1 ηη C 1 ξξ ) 1 2 +s

k=1 ρk 1 s ln(a) β

k=1 ρk µk 2 s X

1 ba 1)2 #αd

sd ln det Σk d 2 det(C 1 ηη C 1 ξξ ) 1 2 +sαd K P

s ln(a) . (35)

Note that the saddle-point method involves an intricate extremization over s s matrices Q, Sk. To make further progress, we assume the extremum is realized at the Replica Symmetric (RS) fixed point

a, mk a = mk, ˆmk a = ˆmk

a, b, Qab = (r q)δab + q, ˆQab = ˆr

2 + ˆq δab + ˆq

a, b, Sk ab = (rk qk)δab + qk, ˆSk ab = ˆrk

δab + ˆqk (36)

This is a standard assumption known as the RS ansatz [21, 22, 23, 24].

Quadratic potential The quadratic potential Ψquad., which correspond to the leading order term in d in the exponent, needs to be extremized first in the framework of our saddle-point analysis. Its expression can be simplified as

k=1 ρk ln det Σk d 2 det(C 1 ηη C 1 ξξ ) 1 2 = 1

k=1 ρk ln det h (1 + βsb2) Id + Σkβs(

k=1 ρk Tr h b2Id + Σk(

k=1 ρk Tr Σk

which is extremized for

k=1 ρk Tr Σk

k=1 ρk Tr Σk

(1 ) + . (38)

This fixes the skip connection strength b.

Entropic potential We now turn to the entropic potential Ψw. It is convenient to introduce the variance order parameters

ˆV ˆr + ˆq, ˆVk ˆrk + ˆqk. (39)

The entropic potential can then be expressed as

a=1 dwae β P

a=1 Tr[ ˆV waw a ]+ 1

a,b Tr[ˆqwaw b ]+ ˆm K P

d ˆm k w a µk

a=1 Tr[ ˆVkwaΣkw a ]+ 1

a,b Tr[ˆqkwaΣkw b ]

Z dwe βg(w) 1

2 Tr ˆV ww + K P

k=1 ˆVkwΣkw + K P

d ˆmkµ + K P

k=1 Ξk (ˆqk Σk) 1 2 +Ξ0 (ˆq Id) 1 2 w

Z dwe βg(w) 1

2 w ˆV Id+ K P

d ˆmkµ + K P

k=1 Ξk (ˆqk Σk) 1 2 +Ξ0 (ˆq Id) 1 2 w

Z dwe βg(w) 1

2 w ˆV Id+ K P

d ˆmkµ + K P

k=1 Ξk (ˆqk Σk) 1 2 +Ξ0 (ˆq Id) 1 2 w

For a matrix Ξ Rp d and tensors A, B Rp d Rp d, we denoted (Ξ A)kl = P

ij Ξij Aij,kl and (A B)ij,kl = P

rs Aij,rs Brs,kl.

Energetic potential In order to compute the energetic potential Ψy, one must first compute the inverse of the covariance Ωk, given by

Ω 1 k = S 1 k 0 0 1 Q 1

It is straightforward to see that S 1 k , Q 1 share the same block structure as Sk, Q, and we shall name for clarity S 1 k

ab = ( rk qk) δab + qk, Q 1

ab = ( r q) δab + q. (43)

From the Sherman-Morisson lemma, and noting

Vk rk qk Rp p, V r q Rp p, (44)

one reaches r = V 1 (V + sq) 1q V 1

q = (V + sq) 1q V 1 , rk = V 1 k (Vk + sqk) 1qk V 1 k qk = (Vk + sqk) 1qk V 1 k . (45)

We will also need the expression of the determinants

ln det Sk = s det Vk + s Tr V 1 k qk , Q = s det V + s Tr V 1q . (46)

One can finally proceed in analyzing the term (a):

a=1 dλη adλξ a

(2π)ps det Q det Sk

a=1 (λη a) ( rk qk)λη a 1

a,b (λη a) qkλb 1

a=1 (λξ a) ( r q)λξ a 1

a,b (λξ a) qλb e β

(2π)ps det Q det Sk

Rp dληdλξe 1

2 λ η V 1 k λη+λ η V 1 k q 1 2 k η 1

2 λ ξ V 1λξ+ 1

λ ξ V 1q 1 2 ξ β

Rp DξDη Z N λξ;

q 1 2 ξ, V N λη; q

1 2 k ξ, Vk e β

Rp DξDη ln Z N λξ;

q 1 2 ξ, V N λη; q

1 2 k ξ, Vk e β

where we noted with capital D an integral over a p dimensional Gaussian distribution N(0, Ip). Therefore

Rp DξDη ln Z N λξ;

q 1 2 ξ, V N λη; q

1 2 k ξ, Vk e β

2 ( ) . (48)

A.4 zero-temperature limit

In this subsection, we take the zero temperature β limit. Rescaling

1 β V V, β ˆV ˆV , β2ˆq ˆq,

1 β Vk Vk, β ˆVk ˆVk, β2ˆqk ˆqk, β ˆmk ˆmk, (49)

one has that

k=1 ˆqk Σk + d

where we introduced the Moreau enveloppe

2 w ˆV Id+ K P

2 (ˆq Id) 1 2 Ξ0+ K P

k=1 (ˆqk Σk) 1 2 Ξk+

Note that while this optimization problem is still cast in a space of dimension p d, for some regularizers g( ) including the usual ℓ1,2 penalties, the Moreau enveloppe admits a compact analytical expression.

The energetic potential Ψy also simplifies in this limit to

k=1 ρk Eη,ξMk(η, ξ), (52)

Mk(ξ, η) = 1

Tr V 1 k y q

1 2 k η mk 2 + 1

1 y + x) 2i 2σ(

1 y + x) ((1

It is immediate to see that the trace potential Ψt can be expressed as

Ψt = Tr h ˆV q i Tr[ˆq V ]

k=1 (Tr h ˆVkqk i Tr[ˆqk Vk])

k=1 mk ˆmk. (54)

Putting these results together, the total free entropy reads

Φ = extr q,m,V,ˆq, ˆm, ˆV ,{qk,mk,Vk,ˆqk, ˆmk, ˆVk}K k=1

Tr h ˆV q i Tr[ˆq V ]

k=1 (Tr h ˆVkqk i Tr[ˆqk Vk])

k=1 ρk Eξ,ηMk(ξ, η) 1

k=1 ˆqk Σk + d

where we remind that

Mk(ξ, η) = 1

Tr V 1 k y q

1 2 k η mk 2 + Tr V 1 x q 1 2 ξ 2

1 y + x) 2i 2σ(

1 y + x) ((1

2 w ˆV Id+ K P

2 (ˆq Id) 1 2 Ξ0+ K P

k=1 (ˆqk Σk) 1 2 Ξk+

A.5 Self-consistent equations

The extremization problem (58) can be recast as a system of self-consistent equations by requiring the gradient with respect to each summary statistic involved in (58) to be zero. This translates into:

ˆqk = αρk Eξ,ηV 1 k proxk y q

1 2 k η mk 2 V 1 k ˆVk = αρkq 1

2 k Eξ,ηV 1 k proxk y q

1 2 k η mk η

ˆmk = αρk Eξ,ηV 1 k proxk y q

k=1 ρk Eξ,ηV 1 proxk x q 1 2 ξ 2 V 1

k=1 ρk Eξ,η

2 V 1 proxk x q 1 2 ξ ξ σ

1 proxy + proxx 2 #

d EΞ h proxr (ˆqk Σk) 1

2 (Ip Σk) Ξ k i

d EΞ proxrΣkprox r

d Eξ,η [proxrµk]

d EΞ h proxr (ˆq Id) 1

d EΞ proxrprox r

We remind that the skip connection b is fixed to

k=1 ρk Tr Σk

k=1 ρk Tr Σk

(1 ) + . (60)

A.6 Sharp asymptotic formulae for the learning metrics

The previous subsections have allowed to obtain sharp asymptotic characterization for a number of summary statistics of the probability (17). These statistics are in turn sufficient to asymptotically characterize the learning metrics discussed in the main text. We successively address the test MSE (5), cosine similarity (6) and training MSE.

MSE The denoising MSE reads

mse = Ek,η,ξ

1 (µk + η) + ˆb

1 (µk + η) +

k=1 ρk EN(0,1) z h Tr h qσ ( 1 mk+

q+(1 )qkz) 2ii

k=1 ρk EN(0,1) u,v h

v i ((1 b 1 )mk+(1 b 1 ) qku b q

mse = d b2 + (1

µk 2 + Tr Σk !

Cosine similarity By the very definition of the summary statistics mk, q:

θik ˆwiµk ˆw µk = (mk)i qii µk (63)

0.0 0.2 0.4 0.6 0.8 1.0

theory simulations

0.0 0.2 0.4 0.6 0.8 1.0

theory simulations

Figure 5: (left) Training MSE for the full DAE (2) (p = 1, σ = tanh). Solid lines represent the sharp asymptotic formula (68); dots correspond to simulation, training the DAE with the Pytorch implementation of full-batch Adam, over T = 2000 epochs using learning rate η = 0.05 and weight decay λ = 0.1. The data was averaged over N = 5 instances; error bars are smaller than the point size. (right) Generalization gap mse ˆ f ϵt. Solid lines correspond to the asymptotic prediction of Conjecture 3.3 (for the test MSE) and of (68) (for the train MSE), while dots correspond to simulations. Error bars represent one standard deviation. The Gaussian mixture is the isotropic binary mixture, whose parameters are specified in the caption of Fig. 1 in the main text.

Training error The replica formalism also allows to compute a sharp asymptotic characterization for the training MSE

xµ ˆf( xµ) 2 #

We have skipped the discussion of this learning metric in the main text for the sake of conciseness, but will now detail the derivation. Note that the training error ϵt can be deduced from the risk (4) and the average regularization as

1 2ϵt = ED h ˆR( ˆw,ˆb) i ED [r( ˆw)] . (65)

Note that in turn, from the definition of the free entropy, the average risk (training loss) can be computed as

ED h ˆR( ˆw,ˆb) i = lim β 1 β ln Z

β = lim β 1 β Φ

Doing so reveals

ϵt = 2 lim β (Ψy dΨquad β

1 b 1)2)) β

= mse + lim β

R N λξ; q 1 2 ξ, Vk N λη; q

1 2 k ξ, V ( )e β

R N λξ; q 1 2 ξ, Vk N λη; q

1 2 k ξ, V e β

2 ( ) . (67)

Taking the β limit finally leads to

k=1 ρk Eξ,η

1 proxy + proxx) 2i

1 proxy + proxx) ((1

1 ˆb)proxy ˆbproxx)

The sharp asymptotic formula (68) agrees well with numerical simulations, Fig. 5. As is intuitive, the generalization gap mse ˆ f ϵt grows with the noise level , as the learning problem grows harder.

A.7 Generic asymptotically exact formulae

This last result ends the derivation of the generic version of Conjecture 3.3, i.e. not assuming Assumptions 3.1 and 3.2.

Importantly, note that the characterization (58) is, like the formulae in e.g. [47, 48], asymptotically exact, but not fully asymptotic, as the equations (58) still involve high-dimensional quantities. In practice however, for standard regularizers r( ) like ℓ1 or ℓ2, the proximal proxr admits a simple closed-form expression, and the average over Ξ can be carried out analytically. The only high-dimensional operation left is then taking the trace of linear combinations of Σk matrices (or the inverse of such combinations), which is numerically less demanding, and analytically much simpler, than the high dimensional optimization (4) and averages (5) involved in the original problem. We give in the next paragraph an example of how Assumption 3.1 can be relaxed, for a binary mixture with arbitrary (i.e. not necessarily jointly diagonalizable) covariances Σ1,2.

Example: Anisotropic, heteroscedastic binary mixture As an illustration, we provide the equivalent of (58) for a binary mixture, with generically anisotropic and distinct covariances, for p = 1, thereby breaking free from Assumption 3.1. We index the clusters by +, rather than 1, 2 for notational convenience. We remind that

M (ξ, η) = 1

ξ)2+ 1 V (y q η m)2+qσ( 1 y+x)2 2σ( 1 y+x)((1 1 b)y bx) o . (69)

Then the self-consistent equations (58) simplify to

d Tr (λ + ˆV )Id + ˆV+Σ+ + ˆV Σ 2 ˆq Id + ˆq+Σ+ + ˆq Σ + ˆm2dµµ

m = ˆm Tr (λ + ˆV )Id + ˆV+Σ+ + ˆV Σ 1 µµ

d Tr λ + ˆV )Id + ˆV+Σ+ + ˆV Σ 1

d Tr ((λ+ ˆV )Id+ ˆV+Σ++ ˆV Σ ) 1Σ ((λ+ ˆV )Id+ ˆV+Σ++ ˆV Σ ) 1(ˆq Id+ˆq+Σ++ˆq Σ + ˆm2dµµ )

d Tr Σ (λ + ˆV )Id + ˆV+Σ+ + ˆV Σ 1

ˆm = α h ρ 1

V+ Eξ,η prox+ y q+η m (1 ρ) 1

V Eξ,η prox y q η + m i

ˆq = α V 2 h ρEξ,η(prox+ x q

ξ)2 + (1 ρ)Eξ,η(prox x q

ξ(prox+ x q

1 prox+ y + prox+ x )2 #

+(1 ρ)Eξ,η h ξ(prox x q

1 prox y + prox x )2i #

V 2 + Eξ,η prox+ y q+η m 2

ˆq = α(1 ρ) 1

V 2 Eξ,η prox y q η + m 2

ˆV+ = αρ 1 q+V+ Eξ,η prox+ y q+η m η ˆV = α(1 ρ) 1 q V Eξ,η prox y q η + m η

These equations are, as previously discussed, asymptotically exact as d . While they still involve traces over high-dimensional matrices Σk, this is a very simple operation. We have therefore reduced the original high-dimensional optimization problem (4) to the much simpler one of computing traces like (70). Crucially, while these traces cannot be generally simplified without Assumption 3.1, they provide the benefit of a simple and compact expression which bypasses the need of jointly diagonalizable covariances, and which can thus be readily evaluated for any covariance, like real-data covariances. This versatility is leveraged in Appendix D.

A.8 End of the derivation: Conjecture 3.3

We finally provide the last step in the derivation of the fully asymptotic characterization of Conjecture 3.3. Assuming r( ) = λ/2 2 F (Assumption 3.2), the Moreau envelope Mr assumes a very simple expression, and no longer needs to

be written as a high-dimensional optimization problem. The resulting free energy can be compactly written as:

Φ = extr q,m,V,ˆq, ˆm, ˆV ,{qk,mk,Vk,ˆqk, ˆmk, ˆVk}K k=1

Tr h ˆV q i Tr[ˆq V ]

k=1 (Tr h ˆVkqk i Tr[ˆqk Vk])

k=1 ρk Eξ,ηMk(ξ, η)

λIp Id + ˆV Id +

k=1 ˆqk Σk + d

( ) is the only segment which still involves high-dimensional matrices, and is therefore not yet full asymptotic. Assuming Assumption 3.1, ( ) can be massaged into

λ + ˆV + λk i ˆVk 1

k=1 λk i ˆqk + X

1 k,j K e i µj ˆmj ˆm k µ k ei

Z dν(γ, τ) Tr

λ + ˆV + γk ˆVk 1

k=1 γkˆqk + X

1 k,j K τjτk ˆmj ˆm k

The fully asymptotic free energy thus becomes

Φ = extr q,m,V,ˆq, ˆm, ˆV ,{qk,mk,Vk,ˆqk, ˆmk, ˆVk}K k=1

Tr h ˆV q i Tr[ˆq V ]

k=1 (Tr h ˆVkqk i Tr[ˆqk Vk])

k=1 ρk Eξ,ηMk(ξ, η)

Z dν(γ, τ) Tr

λ + ˆV + γk ˆVk 1

k=1 γkˆqk + X

1 k,j K τjτk ˆmj ˆm k

Requiring the free energy to be extremized implies

ˆqk = αρk Eξ,ηV 1 k proxk y q

1 2 k η mk 2 V 1 k ˆVk = αρkq 1

2 k Eξ,ηV 1 k proxk y q

1 2 k η mk η

ˆmk = αρk Eξ,ηV 1 k proxk y q

k=1 ρk Eξ,ηV 1 proxk x q 1 2 ξ 2 V 1

k=1 ρk Eξ,η

2 V 1 proxk x q 1 2 ξ ξ σ

1 proxk y + proxk x 2 #

qk = R dν(γ, τ)γk

λIp + ˆV + K P

j=1 γj ˆqj + P

1 j,l K τjτl ˆmj ˆm l

Vk = R dν(γ, τ)γk

λIp + ˆV + K P

mk = R dν(γ, τ)τk

λIp + ˆV + K P

q = R dν(γ, τ)

λIp + ˆV + K P

j=1 γj ˆqj + P

1 j,l K τjτl ˆmj ˆm l

V = R dν(γ, τ)

λIp + ˆV + K P

which is exactly (13). This closes the derivation of Conjecture 3.3.

As a final heuristic remark, observe that from (74) it is reasonable to expect that all the hatted statistics ˆqk, ˆVk, ˆmk, ˆq, ˆV should be of order Θ(α) as α 1. As a consequence, Vk = Θ(1/α) and q, m, qk, mk should approach their α limit as Θ(1/α). Therefore, the MSE (61) should also approach its α limit as Θ(1/α).

A.9 Additional comparisons

We close this appendix by providing further discussion on some points, including the role of the non-linearity σ( ), the influence of the weight-tying assumption, and of the (an-)isotropy and (homo/hetero)-scedasticity of the clusters.

The activation function The figures in the main text were generated using σ = tanh. Several studies for RAEs, however, have highlighted that an auto-encoder would optimally seek to place itself in the linear region of its nonlinearity, so as to learn the principal components of the training data, at least in the non-linear untied weights case, see e.g. [3, 8]. In light of these findings, it is legitimate to wonder whether setting a linear activation σ(x) = x would not yield a better MSE. Fig. 6 shows that it is not the case. For the binary isotropic mixture (13), the linear activations yield a worse performance than the tanh activation.

Weight-tying We have assumed that the weights of the DAE (2) were tied. While this assumption originates mainly for technical reasons in the derivation, note that it has been also used and discussed in practical setting [3] as a way to prevent the DAE from going to the linear region of its non-linearity, by making the norm of the first layer weights very small, and that of the second layer very large to compensate [8]. A full extension of Conjecture 3.3 to the case of the untied weight is a lengthy theoretical endeavour, which we leave for future work. However, note that Fig. 6 shows, in the binary isotropic case, that untying the weights actually leads to a worse MSE than the DAE with tied weights. This is a very interesting observation, as a DAE with untied weights obviously has the expressive power to implement a DAE with tied weights. Therefore, this effect should be mainly due to the untied landscape presenting local minima trapping the optimization. A full landscape analysis would present a very important future research topic. Finally, remark that this stands in sharp contrast to the observation of [8] for non-linear RAEs, where weight-tying worsened the performance, as it prevents the RAE from implementing a linear principal component analysis.

0.0 0.2 0.4 0.6 0.8

tanh, tied linear, tied tanh, untied

Figure 6: Same parameters as Fig. 1 in the main text. α = 1, K = 2, ρ1,2 = 1/2, Σ1,2 = 0.3 Id, p = 1; the cluster mean µ1 = µ2 was taken as a random Gaussian vector of norm 1. Difference in MSE between the full DAE ˆf (2) and the rescaling network ˆr (3) for σ( ) = tanh (blue) or σ(x) = x (red). Solid lines correspond to the sharp asymptotic characterization (13), which is a particular case of Conjecture 3.3. Dots represent numerical simulations for d = 700, training the DAE using the Pytorch implementation of full-batch Adam [43], with learning rate η = 0.05 over 2000 epochs, averaged over N = 2 instances. Error bars represent one standard deviation. In dashed green, numerical simulations for the same architecture, but untied weights. The learning parameters were left unchanged.

An important conclusion from these two observations is that, in contrast to RAEs, the DAE (2) does not just implement a linear principal component analysis weight tying, and the presence of a non-linearity, which are obstacles for RAEs in reaching the PCA MSE, lead for DAEs (2) to a better MSE. The model (2) therefore constitutes a model of a genuinely non-linear algorithm, where the non-linearity is helpful and is not undone during training. Further discussion can be found in Section 4 of the main text.

Heteroscedasticity, anisotropy Fig. 1 and 3 in the main text all represent the isotropic, homoscedastic case. This simple model actually encapsulates all the interesting phenomenology. In this paragraph, we discuss for completeness the generically heteroscedastic, anisotropic case. We consider a binary mixture, with covariances Σ1, Σ2 independently drawn from the Wishart-Laguerre ensemble, with aspect ratios 5/7 and 5/6. Therefore, the clusters are anisotropic, and eigendirections associated with the largest eigenvalues are more "stretched" (i.e. induce higher cluster variance). Furthermore, since the set of eigenvectors of Σ1, Σ2 are independent, the two clusters are stretched in different directions. To ease the comparison with Fig. 1 (isotropic, homoscedastic), these Wishart matrices were further divided by 10, so that the trace is approximately 0.09 like in Fig. 1 i.e., the clusters have the same average extension. Fig. 7 presents the resulting metrics and summary statistics. Again, the agreement between the theory 3.3 and numerical simulation using Pytorch is compelling. Qualitatively, the observations made in Section 4 still hold true in the anisotropic heteroscedastic case:

The curve of the cosine similarity still displays a non-monotonic behaviour, signalling the preferential activation by the DAE of its skip connection at low and of its network component at high .

The skip connection strength ˆb is higher at small and decreases, in connection to the previous remark. The norm of the weight matrix ˆw overall increases with the noise level , signalling an increasing usage by the DAE of its bottleneck network component, again in accordance to the previous remark.

Therefore, the generic case does not introduce qualitative changes compared to the isotropic case.

Strength of the weight decay λ A ℓ2 regularization (weight decay) λ = 0.1 was adopted in the main text. In fact, the value of the strength of the weight decay was not found to sensibly influence the curves, and again, the qualitative phenomena discussed in Section 4 are observed for any value. Fig. 8 shows, for an isotropic binary mixture, the MSE difference mse ˆ f mseˆr (5) and the cosine similarity θ (6) for regularization strength λ = 0.1 and λ = 0.001. As can be observed, even reducing the regularization a hundredfold does not change at all the qualitative picture in particular, the non-monotonicity of the cosine similarity, discussed in Section 4 , and very little the quantitative values. On the

0.0 0.2 0.4 0.6 0.8

0.3 msef mser

0.0 0.2 0.4 0.6 0.8

Figure 7: α = 1, K = 2, ρ1,2 = 1/2, p = 1, λ = 0.1, σ = tanh; the cluster mean µ1 = µ2 was taken as a random Gaussian vector of norm 1. The cluster covariances were independently drawn from the Wishart-Laguerre ensemble, with aspect ratios 5/6 and 5/7, before being normalized by 10 to match the trace of the isotropic case in Fig. 1. In particular, the eigenvectors are independent, so the mixture is totally anisotropic with different main variance directions and heteroscedastic. (left) In blue, the difference in MSE between the full DAE ˆf (2) and the rescaling network ˆr (3). Solid lines correspond to the sharp asymptotic characterization of Conjecture 3.3 in its generic formulation (58). Dots represent numerical simulations for d = 500, training the DAE using the Pytorch implementation of full-batch Adam [43], with learning rate η = 0.05 over 2000 epochs, averaged over N = 5 instances Error bars represent one standard deviation. (right) Cosine similarity θ (6) (green), squared weight norm ˆ w 2/d (red) and skip connection strength ˆb (blue). Solid lines correspond to the formulas (11)(12) and (10) of Conjecture 3.3; dots are numerical simulations. For completeness, the cosine similarity of the first principal component of the train set is plotted in dashed black.

0.0 0.2 0.4 0.6 0.8 1.0

= 0.1 = 0.001

0.0 0.2 0.4 0.6 0.8 1.0 0.0

= 0.1 = 0.001

Figure 8: Same parameters as Fig. 1 in the main text. α = 1, K = 2, ρ1,2 = 1/2, Σ1,2 = 0.09 Id, p = 1, σ = tanh; the cluster mean µ1 = µ2 was taken as a random Gaussian vector of norm 1. Difference in MSE between the full DAE ˆf (2) and the rescaling network ˆr (3) for λ = 0.1 (blue) or λ = 0.001 (red). Lines correspond to the sharp asymptotic characterization (13) of Conjecture 3.3. Dots represent numerical simulations for d = 700, training the DAE using the Pytorch implementation of full-batch Adam [43], with learning rate η = 0.05 over 2000 epochs, averaged over N = 2 instances. (right) Cosine similarities, for λ = 0.1 (green) and λ = 0.001 (orange); lines correspond to the asymptotic formulae of Conjecture 3.3, dots represent the numerical simulations.

0.0 0.2 0.4 0.6 0.8 1.0

= 0.1 = 0.3 = 0.5 = 0.7

Figure 9: Like Fig. 1 of the main text, α = 1, K = 2, ρ1,2 = 1

2, Σ1,2 = 0.09 Id, p = 1, σ( ) = tanh( ); the cluster mean µ1 = µ2 was taken as a random Gaussian vector of norm 1. (left) Solid lines correspond to the difference in MSE between the full DAE ˆf (2) and the rescaling component ˆr (3) as predicted by Conjecture 3.3 Different colors indicate different regularization strength λ. For large noises and insufficient regularization λ, the DAE overfits the train set, leading to performances worse than the simple rescaling ˆr, as signalled by positive values of mse ˆ f mseˆr. This effect is suppressed for larger regularizations λ.

0.0 0.2 0.4 0.6 0.8 1.0

0.012 msef mser mse mser

0.0 0.2 0.4 0.6 0.8 1.0

d(mseu msef)

theory PCA simulations

Figure 10: α = 1, K = 1, ρ1,2 = 1

2, Σ1 = 0.09 Id, p = 1, σ( ) = tanh( ). (left) Difference between the MSE of the DAE ˆf and the MSE of the rescaling ˆr mse ˆ f mseˆr in the case of a single Gaussian cluster. Solid lines correspond to the theoretical prediction of Conjecture 3.3 and dots to numerical simulations. In the case of a single cluster, the bottleneck component of the DAE is not needed, and its presence leads to detrimental overfitting, as signalled by the positive value of mse ˆ f mseˆr. (right) Difference between the MSE of the DAE ˆf and the MSE of the bottleneck network ˆu, divided by d. Like the K > 1 case (see Fig. 3 (b)), this difference is of order Θd(d), and the MSE of PCA (green dots) is sensibly equal.

other hand, observe that at the level of the MSE increasing the weight decay helps palliate the overfitting at large noise levels , as we further illustrate in Fig. 9

K = 1 cluster We finally discuss the case of unstructured data, where the Gaussian mixture only has a single cluster. Analytically, this corresponds to setting the norm of the cluster mean µ to zero in Conjecture 3.3. The bottleneck component, whose role is to learn the data structure as given by the cluster mean and leverage it to improve the denoising performance, is no longer needed when the data is unstructured. It presence actually leads the DAE to overfit, as can be observed in Fig.10 (left). Similarly, the PCA denoiser performs worse in the unstructured case, with an associated MSE remaining of order Θ(d), see Fig.10 (right).

B Tweedie baseline

B.1 Oracle denoiser

Tweedie s formula [42] provides the best estimator of a clean sample x given a noisy sample x corrupted by Gaussian noise, as

1 P| x. (76)

Note that this gives the Bayes-optimal estimator for the MSE, assuming perfect knowledge of the clean data distribution P from which the so-called score P has to be computed. Of course, this knowledge is inaccessible in general, where the only information on P is provided in the form of the train set D. Therefore, Tweedie s formula does not give a learning algorithm, but allows to give an oracle lower-bound on the achievable MSEs, as any learning algorithm will yield a higher MSE than this information-theoretic baseline.

For a generic Gaussian mixture (1), Tweedie s formula reduces to the expression

k=1 ρk h Σk( Σk + Id) 1x + ( Σk + Id) 1 µk) i N x; µk, Σk + Id

k=1 ρk N x; µk, Σk + Id , (77)

where we noted

1 µk, Σk = (1 )Σk. (78)

B.2 Oracle test MSE

In general, except in the complete homoscedastic case where all clusters have the same covariance, there is no closed-form asymptotic expression for the MSE achieved by the Tweedie denoiser. In the binary homoscedastic case µ1 = µ2 µ, Σ1 = Σ2 = σ2Id (see Fig. 1), Tweedie s formula (77) reduces to the compact form

σ2(1 ) + x + σ2(1 ) + tanh x µ

Note that this is of the same form as the DAE architecture (2). The associated MSE reads

1 x + ξ) + µ σ2(1 ) + σ

1 x + ξ) σ2(1 ) +

A sharp asymptotic formula can be found to be

mse = mse + σ4 e (σ2(1 ) + )2 1 2

s= 1 EN(0,1) z

2 σ2(1 ) + 1 2

s= 1 EN(0,1) z

1 )s . (81)

This is the theoretical characterization plotted in Figs. 1 and 3 in the main text.

As discussed in the main text and illustrated in Fig. 3(left), the MSE of the DAE approaches the oracle MSE (81) as the sample complexity α grows. On the other hand, the DAE MSE does not exactly converges to the oracle MSE as α . Intuitively, this is because of the fact that while the form of the oracle (79) is that of a DAE, it does not have tied weights, like the considered architecture (2). The latter cannot therefore perfectly realize the oracle denoiser, whence a small discrepancy, as more precisely numerically characterized in Fig. 11.

100 101 102 103 104

Figure 11: Like Fig. 1 of the main text, K = 2, ρ1,2 = 1

2, Σ1,2 = 0.09 Id, p = 1, σ( ) = tanh( ); the cluster mean µ1 = µ2 was taken as a random Gaussian vector of norm 1. Solid lines correspond to the difference in MSE between the full DAE ˆf and the rescaling component ˆr as predicted by Conjecture 3.3. In dashed lines, the oracle MSE (81), minus the rescaling MSE mseˆr. Different colors correspond to different noise levels . As the sample complexity α becomes large, the denoiser MSE approaches the oracle baseline, but does not become exactly equal to it as α .

C Bayes-optimal MSE

C.1 Assuming imperfect prior knowledge

The Tweedie denoiser discussed in Appendix B is an oracle denoiser, in the sense that perfect knowledge of all the parameters {µk, Σk}K k=1 of the Gaussian mixture distribution P (1) is assumed. Therefore, though the comparison of the DAE (2) MSE with the oracle Tweedie MSE mse does provide useful intuition (see Fig. 3), it importantly does not allow to disentangle which part of the MSE of the DAE is due to the limited expressivity of the architecture, and which is due to the limited availablility of training samples which entails imperfect knowledge of the parameters {µk, Σk}K k=1 of the Gaussian mixture distribution P (1). In this appendix, we derive the MSE of the Bayes optimal estimator (with respect to the MSE), assuming only knowledge of the distribution from which the parameters {µk, Σk}K k=1 are drawn. For simplicity and conciseness, we limit ourselves to the binary isotropic and homoscedastic mixture (see e.g. Fig. 1), for which µ1 = µ2 µ and Σ1 = Σ2 = σId. For definiteness, we assume µ Pµ, with Pµ = N(0, Id/d), so that with high probability µ = 1. We shall moreover assume for ease of discussion that σ is perfectly known. Thus, the centroid µ is the only unknown parameter.

C.2 Bayes-optimal denoiser without knowledge of cluster means

The corresponding Bayes-optimal denoiser then reads

b( x) E [x| x, σ, D] = Z dµ E [x| x, µ, σ, D] | {z } t(µ, x)

P(µ|σ, D) (82)

Where we have identified the oracle denoiser (79), and emphasized the dependence on µ. Note that this is a slight abuse of notation, as the true oracle t(µ , x) involves the ground truth centroid. Further note that P(µ|σ, D) = P(µ|D), since knowledge of σ does not bring information about µ. Furthermore, note that as the noisy part of the dataset D { xµ}n µ=1 brings only redundant information, one further has that

P(µ|D) = P(µ| ˇD), (83)

where we noted ˇD = {xµ}n µ=1 the clean part of the dataset. Thus

b( x) = 1 P( ˇD)

Z dµ t(µ, x)P( ˇD|µ)Pµ(µ)

Z dµ t(µ, x)Pµ(µ)

1 2 1 (2πσ2) d/2

h e 1 2σ2 xµ µ 2 + e 1 2σ2 xµ+µ 2i

Z dµ t(µ, x) 1 (2π/d) d/2 e d

2 µ 2e n 2σ2 µ 2 n Y

e 1 2σ2 xµ 2

(2πσ2) d/2 cosh µ xµ

d, x) 1 (2π/d) d/2 e 1

2 µ 2e α 2σ2 µ 2 n Y

e 1 2σ2 xµ 2

(2πσ2) d/2 cosh µ xµ

d, x)Pb(µ), (84)

where we noted

Pb(µ) e 1 2 σ2 µ 2 n Y

µ=1 cosh µ xµ

with the effective variance

σ2 + α. (86)

Finally, the partition function Z is defined as the normalisation of Pb, i.e.

Z Z dµ Pb(µ). (87)

One is now in a position to compute the MSE associated with the Bayes estimator b( x):

= EDEx PEξ N(0,Id)

s= 1 Eξ,η N(0,Id) σ2(1 ) +

* (1 b 1 )sµ µ

d + µ ((1 b 1 )ση b

tanh 1 sµ µ

2 + 1 σ µ η

s= 1 Eξ,η N(0,Id)

d + 1 σ µ 1 η

d + 1 σ µ 2 η

s= 1 Ez N(0,1) σ2(1 ) +

* (1 b 1 )sµ µ

d tanh 1 sµ µ

s= 1 Eu,v N(0,Ω)

d tanh 1 sµ 1 µ

tanh 1 sµ 2 µ

µ1,µ2 Pb . (88)

We have adopted the shortcuts

σ2(1 ) + , Ω=

(88) shows that the inner averages involved in the Bayes MSE mseb only depend on the overlaps µ 1,2µ1,2/d for µ1, µ2 two independently sampled vectors from Pb. Motivated by similar high-dimensional studies, it is reasonable to expect these quantities to concentrate as n, d in the measure ED Pb. A more principle but much more painstaking way to derive this assumption is to introduce a Fourier representation, and carry out the ED average using the replica method. We refer the interested reader to, for instance, Appendix H of [36], where such a route is taken.

C.3 Derivation

To compute the summary statistics of Pb, we again resort to the replica method to compute the moment-generating function (free entropy), see also Appendix A. The replicated partition function reads

EDZs = Z s Y

a=1 dµae 1 2ˆσ2 µa 2 n Y

1 2Eη N(0,Id)

d + σ µ a η

Introducing the order parameters

d , ma µ a µ

one reaches

a b dqabdˆqab

a=1 dmad ˆma e d P

a b qab ˆqab d P

a ma ˆma | {z } esdΨt

a=1 dµae 1 2ˆσ2 µa 2e

a b ˆqabµ a µb+P

| {z } esdΨw "

Z N ({λa}s a=1; 0, (qab)1 a,b s)

a=1 cosh sma + σλa

| {z } esαdΨy

The computation of Ψt, Ψw, Ψy is rather standard and follows the main steps presented in Appendix A. One again assumes the replica symmetric ansatz [21, 22] qab = (r q)δab + q, ma = m, (93)

2 + ˆq δab + ˆq, ˆma = ˆm. (94)

Introducing as in Appendix A the variances

V r q, ˆV ˆr + ˆq, (95) one reaches

Ψt = ˆV q V ˆq

2 m ˆm, (96)

2 ln 1 + ˆV ˆσ2 + ˆσ2

1 + ˆV ˆσ2 , (97)

Ψy = Eη N(0,1) ln Z dλN(λ; qη, V ) cosh σλ + m

Therefore the free entropy reads

Φ = extr q,m,V,ˆq, ˆm, ˆV

2 ln 1 + ˆV ˆσ2 + ˆσ2

+ αEη N(0,1) ln Z dλN(λ; qη, V ) cosh σλ + m

0.0 0.2 0.4 0.6 0.8 1.0

DAE Bayes oracle PCA oracle

0.0 0.2 0.4 0.6 0.8 1.0

Bayes DAE PCA

Figure 12: (Binary isotropic mixture, Σ1,2 = 0.09Id, ρ1,2 = 1/2, and µ1 = µ2, see also Fig. 1). (right) MSE (minus the rescaling baseline mse ) for the DAE (p = 1, σ = tanh) (2) (blue), the oracle Tweedie denoiser (76) (green), the Bayes denoiser (101) (red), and the plug-in oracle-PCA denoiser (103) (black). Solid lines correspond to the theoretical formulae 3.3, (81) and (101); simulations correspond to a DAE optimized using the Pytorch implementation of Adam (weight decay λ = 0.1, learning rate η = 0.05, full batch, 2000 epochs, sample complexity α = 1), averaged over N = 10 instances, with error bars representing one standard deviation. (right) Cosine similarity (6) for the same algorithms, with identical color code.

The solution of this extremization problem is given by the set of self-consistent equations, imposing that gradients with respect to all parameters be zero:

1+ ˆV ˆσ2 q = ˆσ4(ˆq+ ˆm2)

(1+ ˆV ˆσ2)2 + ˆσ2

1+ ˆV ˆσ2 m = ˆσ2 ˆm 1+ ˆV ˆσ2

ˆV = α 1 σ q Eη N(0,1) h tanh σ qη+m

σ2 ˆm = α σ2 Eη N(0,1) h tanh σ qη+m

This is a simple optimization problem over 6 variables, which can be solved numerically. One can finally use the obtained summary statistics to evaluate the Bayes optimal MSE:

mseb = mse 2EDEz N(0,1) σ2(1 ) + ((1 b 1 )m) tanh 1 m+

(1 )σ2+ q+V z σ2(1 )+

+ EDEu,v N(0,Ω)

σ2(1 )+ 2 q tanh 1 m+

(1 )σ2+ u σ2(1 )+ tanh 1 m+

(1 )σ2+ v σ2(1 )+ (101)

Ω= q + V q q q + V

This completes the derivation of the MSE of the Bayes estimator agnostic of the cluster means.

C.4 A simple plug-in denoiser

Another simple denoiser assuming only perfect knowledge of the variance σ, but imperfect knowledge of µ, is simply to plug the PCA estimate ˆµPCA of µ into the Tweedie oracle (76), i.e.

t(ˆµPCA, x) =

σ2(1 ) + x + σ2(1 ) + tanh x ˆµPCA

ˆµPCA. (103)

The performance of this plug-in denoiser is plotted in Fig. 12, and contrasted to the one of the Bayes denoiser b( ) (101), the oracle t( ) (81) and the DAE 3.3. Strikingly, the performance of the PCA plug-in denoiser is sizeably identical to the one of the Bayes-optimal denoiser, both in terms of MSE and cosine similarity. It is important to remind at this point that both the plug-in (103) and the Bayes denoiser (101) still rely on perfect knowledge of the cluster variance σ and the noise level , while the DAE (2) is agnostic to these parameters. Yet, these two denoisers make for a fairer comparison than the oracle (81), as they importantly take into account the finite training set. The DAE is relatively close to the Bayes baseline for small noise levels , while a gap can be seen to open up for larger .

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

3.0 ||w||2/d

Figure 13: Cosine similarity θ (6) (green), weight norm ˆ w 2/d and skip connection strength ˆb for the MNIST dataset (left), of which for simplicity only 1s and 7s were kept, and Fashion MNIST (right), of which only boots and shoes were kept. In solid lines, the theoretical predictions resulting from using Conjecture 3.3 in its generic formulation (58) with the empirically estimated covariances and means. Dots represent numerical simulations of a DAE (p = 1, σ = tanh) trained with n = 784 training points, using the Pytorch implementation of full-batch Adam, with learning rate η = 0.05 over 2000 epochs, weight decay λ = 0.1, averaged over N = 5 instances. Error bars represent one standard deviation. See Appendix D for full details on the preprocessing.

D Details on real data simulations

In this Appendix, we provide further details on the real data experiments presented in Fig. 2 and Fig. 4.

Preprocessing The original MNIST [15] and Fashion MNIST [16] data-sets were flattened (vectorized), centered, and rescaled by 400 (MNIST) and 600 (Fashion MNIST). For simplicity and ease of discussion, for each data set, only two labels were kept, namely 1s and 7s for MNIST and boots and shoes for Fashion MNIST. Note that the visual similarity between the two selected classes should make the denoising problem more challenging.

Means and covariance estimation For each data-set, data from the same class were considered to belong to the same Gaussian mixture cluster. For simplicity, we kept the same number of training points in each cluster, so as to obtain balanced clusters (ρ1 = ρ2 = 1/2) for definiteness. For each cluster, the corresponding mean µ and covariance Σ were numerically evaluated from the empirical mean and covariance over the 6000 boots (shoes) in the Fashion MNIST training set, and the 6265 1s (7s) in the MNIST training set. The solid lines in Fig. 2 correspond to using those estimates in the asymptotic formulae of Conjecture 3.3, in their generic form (58) (see Appendix A) for sample complexity α = 1 and ℓ2 regularization λ = 0.1.

Pytorch simulations The red points in Fig. 2 were obtained from numerical simulations, by optimizing a p = 1 DAE (2) with σ = tanh activation using the Pytorch implementation of the full-batch Adam [43] optimizer, with weight decay λ = 0.1, learning rate η = 0.05, over T = 2000 epochs. For each value of , n = 784 images were drawn without replacement from the data-set training set, corrupted by noise, and fed to the DAE along with the clean image for training. The denoising test MSE, cosine similarity and summary statistics of the trained DAE were estimated after optimization using ntest = 1000 images randomly drawn from the test set (from which also only the relevant classes were kept, and equally represented i.e. balanced). Each value was averaged over N = 10 instances of the train set, noise realization and test set. The obtained values were furthermore found to be robust with respect to the choice of the optimizer, learning rate and number of epochs.

D.1 Complementary simulations

Fig.2 in the main text addresses the comparison of the theoretical prediction of Conjecture 3.3 with simulations on the real data-set, at the level of the denoising test MSE (5). For completeness, we show here the same comparison for the other learning metrics and summary statistics, namely the cosine similarity θ (6), weights norm ˆ w 2/d and trained skip connection strength ˆb. Theoretical asymptotic characterization are again provided by Conjecture 3.3 (10), (11) and (13).

They are plotted as solid lines in Fig. 13, and contrasted to numerical simulations (dots) on the real data-sets, by training a p = 1 DAE (2) using the Pytorch implementation of the Adam optimizer. All experiment details are the same as Fig. 2 of the main text, and can be found in the previous subsection. As can be observed, there is a gap, for large noise levels , between the theory and simulations of the weights norm ˆ w 2/d (red). On the other hand, the matchings for the cosine similarity θ (green) and skip connection strength ˆb (blue) are perfect. Overall, Conjecture (3.3) therefore captures well (with the exception of ˆ w 2/d at large ) the main metrics of the denoising problem on these two real data-sets. A more thorough investigation of the possible Gaussian universality of the problem is left to future work.

E Derivation of Corollary 3.4

In this Appendix, we provide the derivation of Corollary 3.4.

Rescaling component ˆh We first provide a succinct derivation of the sharp asymptotic characterizations for the rescaling component ˆh (3). Because this is only a scalar optimization problem over the single scalar parameter c of hc, the derivation is simple and straightforward. The empirical loss reads

1 nd ˆR(c) = 1

d + c2 + O (1/

k=1 ρk µk 2+ Tr Σk + c2 (104)

Extremizing this last expression with respect to c leads to

k=1 ρk Tr Σk

k=1 ρk Tr Σk

k=1 ρk R dνγ(γ)γk

k=1 ρk R dνγ(γ)γk

(1 ) + . (105)

which is exactly (10), as stated in Corollary 3.4. The associated MSE can be then readily computed as

mseˆr = EDEx,ξ x ˆc

k=1 ρk µk 2+ Tr Σk + ˆc2

Z dντ(τ)τ 2 k + d Z dµγ(γ)γk

= mse . (106)

This completes the proof of the asymptotic characterization of the rescaling component ˆr.

Bottleneck network component ˆu For ˆu, one needs to go through the derivation presented in Appendix A by setting b = 0 from the beginning. It is straightforward to realize that the derivation goes through sensibly unaltered, and the only differing step is that the quadratic potential Ψquad. is now a constant. One therefore only needs to set ˆb in all the formulae of Conjecture 3.3 to access sharp asymptotics for ˆu. This concludes the derivation of Corollary 3.4

F Derivation of Corollary 3.5

We now turn to the derivation of Corollary 3.5. This corresponds to the limit = 0, where the input of the auto-encoder is also the clean data point. In terms of equations, essentially, this means that all the integrals over ξ, λξ in the derivation of Conjecture 3.3 as presented in Appendix A should be removed. For the sake of clarity, and because this is an important case of particular interest, we provide here a complete, succinct but self-contained, derivation in the case of RAEs. For technical reasons, we limit ourselves to ℓ2 regularization. Furthermore, note that the model of an RAE with a skip connection is trivial, since such an architecture can achieve perfect reconstruction accuracy simply by setting the skip connection strength to 1 and the weights to 0. Therefore, we consider an RAE without skip connection

F.1 Derivation

The derivation of asymptotic characterizations for the metrics mse (5), θ (6) and summary statistics for the RAE (107) follow the same lines as for the full DAE (2), as presented in Appendix A. As in Appendix A, the first step is to evaluate the replicated partition function

EDZs = Z s Y

a=1 dwae β s P

a=1 g(wa) n Y

k=1 ρk Eη e β

d σ wa(µk+η)

| {z } ( ) . (108)

The exponent ( ) can be expanded as

" waw a d σ wa(µk+η)

2σ wa(µk+η)

d +( µk 2+ η 2+2µ k η)

2 µk 2 Z dη

(2π) d 2 det Σk e 1

2 η (Σ 1 k +βs Id)η βsµ k η | {z } Peff.(η)

" waw a d σ wa(µk+η)

2σ wa(µk+η))

The effective prior over the η is therefore Gaussian with dressed mean and covariance

Peff.(η) = N (η; µ, C) , (111)

C 1 = Σ 1 k + βs Id, µ = βs Cµk. (112)

Observe that this mean is O(s) and can be safely neglected. Then

k=1 ρk e βs

det Σk det(C) 1

" waw a d σ wa(µk+η)

2σ wa(µk+η)

Peff.(η) | {z } (a)

Again, like Appendix A, we introduce the local fields:

λη a = wa(η µ)

d , hk a = waµk

which have correlation

λη aλη b waΣkw b d . (115)

We used the leading order of the covariance C. One therefore has to introduce the order parameters:

Qab = waw b d Rp p, Sk ab = waΣkw b d Rp p, mk a = waµk

d Rp. (116)

The distribution of the local fields λη a is then simply:

(λη a)s a=1 N 0, Sk . (117)

Going back to the computation, (a) can be rewritten as D e β

a=1 Tr Qaaσ(mk a+λη a) 2 β

2σ(mk a+λη a) (mk aλη a)

{λη a}s a=1 . (118)

Introducing Dirac functions enforcing the definitions of Qab, ma brings the replicated function in the following form:

EZs = Z s Y

a,b d Qabd ˆQab

a=1 dmad ˆma

a,b d Sk abd ˆSkabes

a b Qab ˆ Qab d K P

a b Sk ab ˆSk ab K P

a dmk a ˆmk a | {z } eβsdΨt Z s Y

a=1 dwae β P

a b ˆ Qabwaw b + K P

a b ˆSk abwaΣkw b + K P

| {z } eβsdΨw

k=1 ρk 1 det Σk det(C 1) 1 2 (a)

eαsd2Ψquad.+βsdΨy

As in Appendix A, we have introduced the trace, entropic and energetic potentials Ψt, Ψw, Ψy. Since all the integrands scale exponentially (or faster) with d, this integral can be computed using a saddle-point method. To proceed further, note that the energetic term encompasses two types of terms, scaling like sd2 and sd. More precisely, " K X

k=1 ρk 1 det Σk det(C 1) 1 2 (a)

k=1 ρk 1 sd ln det Σk det(C 1) 1 2 + s

k=1 ρk 1 s ln(a)

= e sαd2 K P

sd ln det Σk det(C 1) 1 2 +sαd K P

s ln(a) (120)

Remark that in contrast to Appendix A, the quadratic term is constant with respect to the network parameters, and therefore one no longer needs to discuss its extremization. Following Appendix A, we assume the RS ansatz

a, mk a = mk, ˆmk a = ˆmk

a, b, Sk ab = (rk qk)δab + qk, ˆSk ab = ˆrk

a, b, Qab = (r q)δab + q, ˆQab = ˆr

2 + ˆq δab + ˆq. (121)

In the following, we sequentially simplify the potentials Ψw, Ψy, Ψt under this ansatz.

Entropic potential It is convenient to introduce before proceeding the variance order parameters

ˆV ˆr + ˆq, ˆVk ˆrk + ˆqk. (122)

The entropic potential can then be expressed as

a=1 dwae β P

a=1 Tr[ ˆV waw a ]+ 1

a,b Tr[ˆqwaw b ]+ ˆm K P

d ˆm k w a µk

a=1 Tr[ ˆVkwaΣkw a ]+ 1

a,b Tr[ˆqkwaΣkw b ]

Z dwe βg(w) 1

2 Tr ˆV ww + K P

k=1 ˆVkwΣkw + K P

d ˆmkµ + K P

k=1 Ξk (ˆqk Σk) 1 2 +Ξ0 (ˆq Id) 1 2 w

Z dwe βg(w) 1

2 w ˆV Id+ K P

d ˆmkµ + K P

k=1 Ξk (ˆqk Σk) 1 2 +Ξ0 (ˆq Id) 1 2 w

Z dwe βg(w) 1

2 w ˆV Id+ K P

d ˆmkµ + K P

k=1 Ξk (ˆqk Σk) 1 2 +Ξ0 (ˆq Id) 1 2 w

Energetic potential The inverse of Sk can be computed like in Appendix A, leading to the following expression for (a):

(2π) ms/2 det Sk e 1

a=1 (λη a) ( rk qk)λη a 1

a,b (λη a) qkλb β

Rp Dη (2π) ms/2 det Sk

2 λ η V 1 k λη+λ η V 1 k q 1 2 k η β

Rp Dη Z N λη; q

1 2 k η, V e β

Rp Dη ln Z N λη; q

1 2 k η, V e β

2 ( ) , (125)

where we noted with capital D an integral over N(0, Ip). Therefore

Rp Dη ln Z N λη; q

1 2 k η, V e β

2 ( ) . (126)

Zero-temperature limit In this subsection, we take the zero temperature β limit. Rescaling

1 β Vk Vk, β ˆVk ˆVk, β2ˆqk ˆqk, β ˆmk ˆmk (127)

one has that

k=1 ˆqk Σk + d

d EΞMr(Ξ), (128)

where we introduced the Moreau envelope

2 w ˆV Id+ K P

2 (ˆq Id) 1 2 Ξ0+ K P

k=1 (ˆqk Σk) 1 2 Ξk+

For the case of ℓ2 regularization, the Moreau envelope presents a simple expression, and the entropy potential assumes the simple form

λIm Id + ˆV Id +

k=1 ˆqk Σk + d

The energetic potential also simplifies in this limit to

k=1 ρk EηMk(η) (131)

Tr V 1 k y q

1 2 k η mk 2 + Tr qσ(y) 2 2σ(y) y

Trace potential It is immediate to see that the trace potential can be expressed as

Ψt = ˆV q ˆq V

k=1 (Tr h ˆVkqk i Tr[ˆqk Vk])

k=1 mk ˆmk. (133)

Then the total free entropy for RAEs (107) reads

Φ = extr q,m,V,ˆq, ˆm, ˆV ,{qk,mk,Vk,ˆqk, ˆmk, ˆVk}K k=1

k=1 (Tr h ˆVkqk i Tr[ˆqk Vk])

k=1 ρk EηMk(η) + 1

Z dν(γ, τ) Tr

λ + γk ˆVk 1

k=1 γkˆqk + X

1 k,j K τjτk ˆmj ˆm k

Comparing to the free energy derived in Appendix A, it is immediate to see that the Moreau envelope Mk for the RAE just corresponds to removing the second term from the DAE Mk, and setting x = 0. The disappearance of the second term in the Moreau envelope has the effect to kill the V -dependence, resulting in ˆq = 0 when extremizing with respect to V . These observations, in addition to the already taken limits , b = 0, finish the derivation of Corollary 3.5.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.2

mse RAE mse

theory simulations PCA

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Figure 14: α = 1, K = 2, ρ1,2 = 1/2, Σ1,2 = 0.3 Id, p = 1, σ = tanh; the cluster mean µ1 = µ2 was taken as a random Gaussian vector of norm 1. (left) In blue, difference in test reconstruction MSE between the RAE (Corollary 3.5) and the leading order term mse (135). Solid lines correspond to the sharp asymptotic characterization of Corollary 3.5. Dots represent numerical simulations for d = 700, training the RAE using the Pytorch implementation of full-batch Adam, with learning rate η = 0.05 over 2000 epochs, weight decay λ = 0.1, averaged over N = 10 instances. Error bars represent one standard deviation. (right) Cosine similarity θ (6) (green), squared weight norm ˆ w 2/d (red). Solid lines correspond to the theoretical characterization of Corollary 3.5; dots are numerical simulations. The cosine similarity with the cluster means of the principal component of the training samples is represented in blue crosses, and superimposes perfectly with the analogous curve learnt by the RAE.

F.2 Reconstruction MSE

Having obtained a sharp asymptotic characterization of the summary statistics for the RAE, we now turn to the learning metrics. It is easy to see that the formula for the cosine similarity θ (6) carries over unchanged from the DAE case, see Appendix A. We therefore focus on the test MSE:

mse RAE = Ek,η,ξ

d σ ˆw(µk + η)

µk 2 + Tr Σk

k=1 ρk EN(0,1) z h Tr h qσ (mk + qkz) 2ii

k=1 ρk EN(0,1) u,v [σ (mk + qku)] (mk + qku) (135)

F.3 RAEs are limited by the PCA MSE

We close this Appendix by showing how the RAE Corollary 3.5 recovers the well-known Eckart-Young theorem [10]. The difference mse mse for an RAE learning to reconstruct a binary, isotropic homoscedastic mixture (see Fig. 1) is shown in Fig. 14, from a training set with sample complexity α = 1. Because mse mse is essentially a correction to the leading term mse , estimating this difference is slightly more challenging numerically, whence rather large error bars. Nevertheless, it can be observed that the agreement between the theory (solid blue line) and the simulations (dots) gathered from training an p = 1 RAE using the Pytorch implementation of Adam is still very good. In terms of cosine similarity and learnt weight norm, the agreement is perfect, see Fig. 3.5 (right). For comparison, the performance of PCA reconstruction (crosses) has been overlayed on the RAE curves. Importantly, observe that in terms of MSE, PCA reconstruction always leads to smaller MSE, in agreement with the well-known Eckart-Young theorem [10]. From the cosine similarity, it can be seen that the RAE essentially learns the principal component of the train set, see the discussions in e.g. [12, 11, 8, 9].

We finally stress that, in contrast to all previous figures, the x axis in Fig. 3.5 is the sample complexity α, rather than the noise level . As a matter of fact, while our result recover the well-known fact that the MSE of RAEs is limited (lower-bounded) by the PCA test error, Corollary 3.5 allows us to investigate how the RAE approaches the PCA

performance as a function of the number of training samples, a characterization which has importantly been missing from previous studies for non-linear RAEs [8, 9].