# matrix_inference_and_estimation_in_multilayer_models__79a41e45.pdf

Matrix Inference and Estimation in Multi-Layer Models

Parthe Pandit Dept. ECE UC, Los Angeles parthepandit@ucla.edu

Mojtaba Sahraee-Ardakan Dept. ECE UC, Los Angeles msahraee@ucla.edu

Sundeep Rangan Dept. ECE NYU srangan@nyu.edu

Philip Schniter Dept. ECE The Ohio State Univ. schniter.1@osu.edu

Alyson K. Fletcher Dept. Statistics UC, Los Angeles akfletcher@ucla.edu

We consider the problem of estimating the input and hidden variables of a stochastic multi-layer neural network from an observation of the output. The hidden variables in each layer are represented as matrices with statistical interactions along both rows as well as columns. This problem applies to matrix imputation, signal recovery via deep generative prior models, multi-task and mixed regression, and learning certain classes of two-layer neural networks. We extend a recently-developed algorithm Multi-Layer Vector Approximate Message Passing (ML-VAMP), for this matrix-valued inference problem. It is shown that the performance of the proposed Multi-Layer Matrix VAMP (ML-Mat-VAMP) algorithm can be exactly predicted in a certain random large-system limit, where the dimensions N d of the unknown quantities grow as N with d ﬁxed. In the two-layer neural-network learning problem, this scaling corresponds to the case where the number of input features as well as training samples grow to inﬁnity but the number of hidden nodes stays ﬁxed. The analysis enables a precise prediction of the parameter and test error of the learning.

1 Introduction

Consider an L-layer stochastic neural network given by

Z0 ℓ= WℓZ0 ℓ 1 + Bℓ+ Ξ0 ℓ, ℓ= 1, 3, . . . , L 1, (1a)

Z0 ℓ= φℓ(Z0 ℓ 1, Ξ0 ℓ), ℓ= 2, 4, . . . , L, (1b)

where, for ℓ= 0, 1, . . . , L, we have true activations Z0 ℓ Rnℓ d, weights Wℓ Rnℓ nℓ 1, bias matrices Bℓ Rnℓ d, and true noise realizations Ξ0 ℓ. The activation functions φℓ: Rnℓ 1 d Rnℓ d are known non-linear functions acting row-wise on their inputs. See Fig. 1 (TOP). We use the superscript 0 in Z0 ℓto indicate the true values of the variables, in contrast to estimated values denoted by b Zℓdiscussed later. We model the true values Z0 0 as a realization of random Z0, where the rows z T 0,i: of Z0 are i.i.d. with distribution p0: p(Z0) = Qn0 i=1 p0(z0,i:). Similarly, we also assume that Ξ0 ℓare realizations of random Ξℓwith i.i.d. rows ξT ℓ,i:. For odd ℓ, the rows ξℓ,i: are zero-mean multivariate Gaussian with covariance matrix N 1 ℓ Rd d, whereas for even ℓ, the rows ξℓ,i: can be arbitrarily distributed but i.i.d.

Code available at https://github.com/parthe/ML-Mat-VAMP.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

W1, B1 φ2( ) W3, B3 φ4( ) Z0 0 Z0 1 Z0 2 Z0 3 Y

Ξ1 Ξ2 Ξ3 Ξ4

G+ 0 ( ) G 1 ( ) G 2 ( ) G 3 ( ) G 4 ( ) Y

b Z+ k0 R+ k0

b Z k0 R k0

b Z+ k1 R+ k1

b Z k1 R k1

b Z+ k2 R+ k2

b Z k2 R k2

b Z+ k3 R+ k3

b Z k3 R k3

Figure 1: (TOP) The signal ﬂow graph for true values of matrix variables {Z0 ℓ}3 ℓ=0, given in eqn. (1) where Z0 ℓ Rnℓ d. (BOTTOM) Signal ﬂow graph of the ML-MVAMP procedure in Algo. 1. The variables with superscript + and - are updated in the forward and backward pass respectively. ML-MVAMP (Algorithm 1) solves (2) by solving a sequence of simpler estimation problems over consecutive pairs (Zℓ, Zℓ 1).

Denoting by Y := Z0 L Rn L d the output of the network, we consider the following matrix inference problem:

Estimate Z := {Zℓ}L 1 ℓ=0 given Y := Z0 L and {W2k 1, B2k 1, φ2k}L/2 k=1. (2)

A key feature of the problem we consider here is that the unknowns, Zℓ, are matrix-valued with d columns with statistical dependencies between the columns. As we will see in Section 2, the matrix-valued case applies to several problems of broad interest such as matrix imputation, multi-task and mixed regression problems, sketched clustering. We also show that via this formulation we can analyze the learning in two layer neural networks under some architectural assumptions.

In many applications, the inference problem can be performed via minimization of an appropriate cost function. For example, suppose the network (1) has no noise Ξℓfor all layers except the ﬁnal measurement layer, ℓ= L. In this case, the Z0 L 1 = g(Z0 0) for some deterministic function g( ) representing the action of the ﬁrst L 1 layers. Inference can then be conducted via a minimization of the form,

b ZL 1 := g arg min Z0 HL(Y, ZL 1) + H0(Z0), subject to ZL 1 = g(Z0) (3)

where the term HL(Y, ZL 1) penalizes the prediction error and H0(Z0) is an (optional) regularizer on the network input. For maximum a posteriori (MAP) estimation one takes, HL(Y, ZL 1) = log p(Y|ZL 1), and H0(Z0) = log p(Z0), where the output probability p(Y|ZL 1) is deﬁned from the last layer of model (1b): Y = ZL = φL(ZL 1, ΞL). The minimization (3) can then be solved using a gradient-based method. Encouraging results in image reconstruction have been demonstrated in [46, 4, 15, 18, 38, 42, 29]. Markov-chain Monte Carlo (MCMC) algorithms and Langevin diffusion [7, 45] could also be employed for more complex inference tasks.

However, rigorous analysis of these methods is difﬁcult due to the non-convex nature of the optimization problem. To address this issue, recent works [25, 12, 35] have extended Approximate Message Passing (AMP) methods to provide inference algorithms for the multi-layer networks. AMP was originally developed in [9, 10, 3, 17] for compressed sensing. Similar to other AMP-type results, the performance of multi-layer AMP-based inference can be precisely characterized in certain highdimensional random instances. In addition, the mean-squared error for inference of the algorithms match predictions for the Bayes-optimal inference predicted by various techniques from statistical physics [37, 14, 2]. Thus, AMP-based multi-layer inference provides a computationally tractable estimation framework with precise performance guarantees and testable conditions for optimality in certain high-dimensional random settings.

Prior multi-layer AMP works [16, 26, 12, 35] have considered the case of vector-valued quantities with d = 1. The main contribution of this paper is to consider the matrix-valued case when d > 1. To handle the case when d > 1, we extend the Multi-Layer Vector Approximate Message Passing (ML-VAMP) algorithm of [12, 35] to the matrix case. The ML-VAMP method is based on VAMP method of [36], which is closely related to expectation propagation (EP) [28, 39], expectation-

consistent approximate inference (EC) [32, 13], S-AMP [6], and orthogonal AMP [24]. We will use ML-Mat-VAMP when referring to the matrix extension of ML-VAMP.

Contributions: First, similar to the case of ML-VAMP, we analyze ML-Mat-VAMP in a large system limit, where nℓ and d is ﬁxed, under rotationally invariant random weight matrices Wℓ. In this large system limit, we prove that the mean-squared error (MSE) of the estimates of ML-Mat-VAMP can be exactly predicted by a deterministic set of equations called the state evolution (SE). The SE describes how the distribution of the true activations and pre-activations of the network as well as the estimated values generated by ML-Mat-VAMP evolve jointly from one iteration of the algorithm to the other. This extension of the SE equations to the matrix case is not trivial and requires considering correlation across multiple vectors. Indeed, in the case of ML-VAMP, the SE equations involve scalar quantities and 2 2 matrices. For ML-Mat-VAMP, the SE equations involve d d and 2d 2d matrices.

Second, we show that the method can offer precise predictions in important estimation problems that are difﬁcult to analyze via other means. The ML-VAMP was focused on deep reconstruction problems [46, 4]. The matrix version here can be applied to other classes of problems such as multi-task regression, matrix completion and learning the input layer of a neural network. Even though these networks are typically shallow (just L = 2 layers), there are no existing methods that can provide the same types of precise results. For example, in the case of learning the input layer of a neural network, our results can exactly predict the test error as a function of the noise statistics, activations, number of training sample and other key modeling parameters.

Notation: Boldface uppercase letters X denote matrices. Xn: refers to the nth row of X. Random vectors are row-vectors. For a function f : R1 m R1 k, its row-wise extension is represented by f : RN m RN k, i.e., [f(X)]n: = f(Xn:). We denote the Jacobian matrix of f by f

x(x) Rm k, so that [ f

x(x)]ij = fi xj (x). For its row-wise extension f, we denote by f

average Jacobian, i.e., 1

N PN n=1 f Xn: (Xn:) Rm k

2 Example Applications

As we describe next, the matrix estimation problem 2 is of broad interest and several interesting applications can be formulated under this framework. We share a few examples below.

2.1 Multi-task and Mixed Regression Problems

A simple application of the matrix-valued multi-layer inference problem (2) is for multi-task regression [31]. Consider a generalized linear model of the form,

Y = φ(XF; Ξ), (4)

where Y RN d is a matrix of measured responses, X RN p is a known design matrix, F Rp d are a set regression coefﬁcients to be estimated, and Ξ is noise. The problem can be considered as d separate regression problems one for each column. However, in some applications, these design tasks are related in such a way that it beneﬁts to jointly estimate the predictors. To do this, it is common to solve an optimization problem of the form

i=1 L(yij, [XF]ij) + λ

where L( ) is a loss function, and ρ( ) is a regularizer that acts on the rows Fk: of F to couple the prediction coefﬁcients across tasks. For example, the multi-task LASSO [31] uses loss L(y, z) = (y z)2 and regularization ρ(Fk:) = Fk: 2 to enforce row-sparsity in F. In the compressivesensing context, multi-task regression is known as the multiple measurement vector (MMV) problem, with applications in MEG reconstruction [8], Do A estimation [43], and parallel MRI [22]. An AMP approach to the MMV problem was developed in [48]. The multi-task model (4) can be immediately written as a multi-layer network (1) by setting: Z0 := F, W0 := X, Z1 := W0Z0 = XF, Y = Z2 := φ(Z1, Ξ). Also, by appropriately setting the prior p(Z0), the multi-layer matrix MAP inference (3) will match the multi-task optimization (5).

In (5), the regularization couples the columns of F but the loss term couples its rows. In mixed regression problems, the loss couples the columns of F. For example, consider designing predictors F = [f1, f2] for mixed linear regression [47], i.e.,

yi = qix T i f1 + (1 qi)x T i f2 + vi, qi {0, 1}, (6)

where i = 1, . . . , N and the ith response comes from one of two linear models, but which model is not known. This setting can be modeled by a different output mapping: As before, set Z0 := F, Z1 = XF and let the noise in the output layer be Ξ1 = [q, v] which includes the additive noise vi in (6) and the random selection variable qi. Then, we can write (6) via an appropriate function, y = φ1(Z1, Ξ1).

2.2 Sketched Clustering

A related problem arises in sketched clustering [19], where a massive dataset is nonlinearly compressed down to a short vector y Rn, from which cluster centroids fk Rp, for k = 1, . . . , d, are then extracted. This problem can be approached via the optimization [20]

minα 0 min F Pn i=1

yi Pd j=1 αje 1x T ifj

2 where xi Rp are known i.i.d. Gaussian vectors.

An AMP approach to sketched clustering was developed in [5]. For known α, the minimization corresponds to MAP estimation with the multi-layer matrix model with Z0 = F, W1 = X Z1 = XF and using the output mapping, φ1(Z1, Ξ) := Pd j=1 αje 1Z1,:j + Ξ, where the exponential is applied elementwise and Ξ is i.i.d. Gaussian. The mapping φ1 operates row-wise on Z1 and Ξ.

2.3 Learning the Input Layer of a Two-Layer Neural Network

The matrix inference problem (2) can also be applied to learning the input layer weights in a two-layer neural network (NN). Let X RN Nin and Y RN Nout be training data corresponding to N data samples. Consider the two-layer NN model,

Y = σ(XF1)F2 + Ξ, (7)

with weight matrices (F1, F2), componentwise activation function σ( ), and noise Ξ. In (7), the bias terms are omitted for simplicity. We used the notation Fℓ for the weights, instead of the standard notation Wℓ, to avoid confusion when (7) is mapped to the multi-layer inference network (2). Now, our critical assumption is that the weights in the second layer, F2, are known. The goal is to learn only the weights of the ﬁrst layer, F1 RNin Nhid, from a dataset of N samples (X, Y).

If the activation is Re LU, i.e., σ(H) = max{H, 0} and Y has a single column (i.e. scalar output per sample), and F2 has all positive entries, we can, without loss of generality, treat the weights F2 as ﬁxed, since they can always be absorbed into the weights F1. In this case, y and F2 are vectors and we can write the ith entry of y as

j=1 F2jσ([XF1]ij) + ξi =

j=1 σ([XF1]ij F2j) + ξi (8)

Thus, we can assume, without loss of generality, that F2 is all ones. The parameterization (8) is sometimes referred to as the committee machine [41]. The committee machine has been recently studied by AMP methods [1] and mean-ﬁeld methods [27] as a way to understand the dynamics of learning.

To pose the two-layer learning problem as multi-layer inference, deﬁne Z0 := F1, W1 := X, Z1 := XF1 Ξ2 := Ξ, then Y = Z2, where Z2 is the output of a 2-layer inference network of the form in (1): Y = Z2 = φ2(Z1, Ξ2) := σ(Z1)F2 + Ξ2. (9)

Note that W1 is known. Also, since we have assumed that F2 is known, the function φ2 is known. Finally, the function φ2 is row-wise separable on both inputs. Thus, the problem of learning the input weights F1 is equivalent to learning the input Z0 of the network (9).

2.4 Model-Based Matrix completion

Consider an observed matrix Y = ZL RNL d with missing entries Ωc [NL] [d]. The problem is to impute the missing entries of Y. This is an important problem in several applications ranging from recommendation systems, genomics, bioinformatics and more broadly analysis of tabular data. There have been several approaches to solving this data imputation problem, right from 0 imputation and mean imputation to more sophisticated techniques based on generative models.

Consider a generative model based on a multi-layer perceptron as in (1) such that the output ZL 1 models the uncorrupted data matrix. Then the imputation problem can be posed as the solution of the MAP optimization problem:

minimize {Zℓ}L ℓ=0 Y ZL 1 2 Ω log P(ZL 1, ZL 2, . . . , Z0) (10)

where Y ZL 1 2 Ω= P (i,j) Ω((Y)ij (ZL 1)ij)2. One can also similarly construct Bayes estimators such as E[ZL 1|ZL].

Traditional approaches to matrix completion have looked at regularized convex minimization schemes just like (10) where log P(ZL 1) = ZL 1 , which is the nuclear norm, or some other structure inducing convex norms. While the term log P(. . .) in (10) can be thought of as a more general regularization term, this formulation allows for more general application problems with heterogeneous variables.

For example, in imputation of tabular data, it is often the case that some columns correspond to continuous valued variables, whereas other variables are discrete valued modeling Yes/No answers or count data. In such scenarios the log P(ZL 1, . . .) allows more ﬂexibility towards modeling using GLMs and other exponential family distributions for every column separately. One simple instance of (10) would be a generative model log P(ZL 1, . . . , Z0) which is trained on some fully observed data ZL 1 using unsupervised learning methods such as Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN).

3 Multi-layer Matrix VAMP

3.1 MAP and MMSE inference

Observe that the equations (1) deﬁne a Markov chain over these signals and thus the posterior

p(Z|ZL) factorizes as p(Z|ZL) p(Z0)

ℓ=1 p(Zℓ|Zℓ 1) p(Y|ZL 1). where recall the notation Z

from (2). The transition probabilities p(Zℓ|Zℓ 1) above are implicitly deﬁned in equation (1) and depend on the statistics of noise terms Ξℓ. We consider both maximum a posteriori (MAP) and minimum mean squared error (MMSE) estimation for this posterior:

b Zmap = arg max Z p(Z|ZL) b Zmmse = E[Z|ZL] = Z Z p(Z|ZL) d Z (11)

3.2 Algorithm Details

The ML-Mat-VAMP for approximately computing the MAP and MMSE estimates is similar to the ML-VAMP method in [12, 33]. The speciﬁc iterations of ML-Mat-VAMP algorithm are shown in Algorithm 1. The algorithm produces estimates by a sequence of forward and backward pass updates denoted by superscripts + and respectively. The estimates b Z ℓare constructed by solving sequential problems Z = {Zℓ}L 1 ℓ=0 into a sequence of smaller problems each involving estimation of a single activation or preactivation Zℓvia estimation functions {G ℓ( )}L 1 ℓ=1 which are selected depending on whether one is interested in MAP or MMSE estimation.

To describe the estimation functions, we use the notation that, for a positive deﬁnite matrix Γ, deﬁne the inner product A, B Γ := Tr(ATBΓ) and let A Γ denote the norm induced by this inner

Algorithm 1 Multilayer Matrix VAMP (ML-Mat-VAMP)

Require: Estimators G+ 0 , G L, {G ℓ}L 1 ℓ=1 . 1: Set R 0ℓ= 0 Rnℓ d and initialize {Γ 0ℓ}L 1 ℓ=0 Rd d 0 . 2: for k = 0, 1, . . . , Nit 1

3: // Forward Pass 4: b Z+ k0 = G+ 0 (R k0, Γ k0)

5: Λ+ k0 = D G+ 0 R 0 (R k0, Γ k0) E 1 Γ k,0,

6: Γ+ k,0 = Λ+ k,0 Γ k,0 7: R+ k,0 = (b Z+ k,0Λ+ k,0 R k,0Γ k,0)(Γ+ k,0) 1

8: for ℓ= 1, . . . , L 1 do 9: b Z+ kℓ= G+ ℓ(R kℓ, R+ k,ℓ 1, Γ kℓ, Γ+ k,ℓ 1)

10: Λ+ kℓ= D G+ ℓ R ℓ(. . .) E 1 Γ kℓ,

11: Γ+ kℓ= Λ+ kℓ Γ kℓ 12: R+ kℓ= (b Z+ kℓΛ+ kℓ R kℓΓ kℓ)(Γ+ kℓ) 1

13: end for

14: // Backward Pass 15: b Z k,L 1 = G L(R+ k,L 1, Γ+ k,L 1)

16: Λ k,L 1 = G L R+ L 1 (R+ k,L 1, Γ+ k,L 1) 1 Γ+ k,L 1,

17: Γ k,L 1 = Λ k,L 1 Γ+ k,L 1 18: R k+1,L 1 = (b Z k,L 1Λ k,L 1 R+ k,0Γ+ k,0)(Γ k,0) 1

19: for ℓ= L 1, . . . , 1 do 20: b Z k+1,ℓ 1 =G ℓ(R k+1,ℓ, R+ k,ℓ 1, Γ k+1,ℓ, Γ+ k,ℓ 1)

21: Λ k+1,ℓ 1 = G ℓ R+ ℓ 1 ( ) 1 Γ+ k,ℓ 1,

22: Γ k+1,ℓ= Λ kℓ Γ+ kℓ 23: R k+1,ℓ 1 = (b Z kℓΛ kℓ R+ kℓΓ+ kℓ)(Γ k+1,ℓ) 1

24: end for 25: end for

product. For ℓ= 1, . . . , L 1 deﬁne the approximate belief functions

bℓ(Zℓ, Zℓ 1|R ℓ, R+ ℓ 1, Γ ℓ, Γ+ ℓ 1) p(Zℓ|Zℓ 1)e 1

2 Zℓ 1 R+ ℓ 1 2

Γ+ ℓ 1, (12)

where Zℓ, R ℓ Rnℓ d and Γ ℓ Rd d for all ℓ= 0, 1, . . . L. Deﬁne b0(Z0|R 0 , Γ 0 ) and b L(ZL 1|R+ L 1, Γ+ L 1) similarly. The MAP and MMSE estimation functions are then given by the MAP and MMSE estimates for these belief densities,

G ℓ,map =(b Z+ ℓ, b Z ℓ 1) = argmax bℓ(Zℓ, Zℓ 1) G ℓ,mmse =(b Z+ ℓ, b Z ℓ 1) = E[(Zℓ, Zℓ 1)|bℓ] (13)

where the expectation is with respect to the normalized density proportional to bℓ. Thus, the ML-Mat VAMP algorithm reduces the joint estimation of the vectors (Z0, . . . , ZL 1) to a sequence of simpler estimations on sub-problems with terms (Zℓ 1, Zℓ). We refer to these subproblems as denoisers and denote their solutions by G ℓ, so that b Z+ ℓ= G+ ℓand b Z ℓ 1 = G ℓcorresponding to lines 9 and 20 of Algorithm 1. The denoisers G+ 0 and G L, which provide updates to b Z+ 0 and b Z L 1, are deﬁned in a similar manner via b0 and b L respectively.

The estimation functions (13) can be easily computed for the multi-layer matrix network. An important characteristic of these estimators is that they can be computed using maps which are row-wise separable over their inputs and hence are easily parallelizable. To simplify notation, we denote the precision parameters for denoisers G ℓin the kth iteration by

Θ+ kℓ:= (Γ kℓ, Γ+ k,ℓ 1), Θ kℓ:= (Γ k+1,ℓ, Γ+ k,ℓ 1), Θ+ k0 := Γ k0, Θ k L := Γ+ k,L 1. (14)

Non-linear layers: For ℓeven, since the rows of Ξℓare i.i.d., the belief density bℓ(Zℓ, Zℓ 1| ) from (12) factors as a product across rows, bℓ(Zℓ, Zℓ 1) = Q n bℓ([Zℓ]n:, [Zℓ 1]n:). Thus, the MAP and MMSE estimates (13) can be performed over d-dimensional variables where d is the number of entries in each row. There is no joint estimation across the different nℓrows.

Linear layers: When ℓis odd, the density bℓ(Zℓ, Zℓ 1| ) in (12) is a Gaussian. Hence, the MAP and MMSE estimates agree and can be computed via least squares. Although for linear layers [G+ ℓ, G ℓ](R ℓ, R+ ℓ 1, Θℓ) is not row-wise separable over (R ℓ, Rℓ 1), it can be computed using another row-wise denoiser [ e G+ ℓ, e G ℓ] via the SVD of the weight matrix Wℓ= Vℓdiag(Sℓ)Vℓ 1 as

follows. Note that the SVD is only needed to be performed once.:

[G+ ℓ, G ℓ](Rℓ, Rℓ 1, Θℓ) = argmax Zℓ,Zℓ 1 Zℓ WℓZℓ 1 Bℓ 2 Nℓ+ Zℓ R ℓ 2 Γ ℓ+ Zℓ 1 R+ ℓ 1 2 Γ+ ℓ 1

(a) = argmax Zℓ,Zℓ 1

VT ℓZℓ diag(Sℓ)Vℓ 1Zℓ 1 VT ℓBℓ 2

Nℓ+ VT ℓZℓ VT ℓR ℓ 2

Γ ℓ+ Vℓ 1Zℓ 1 Vℓ 1R+ ℓ 1 2 Γ+ ℓ 1

(b) = [VT ℓe G+ ℓ, Vℓ 1 e G ℓ](VT ℓRℓ, Vℓ 1Rℓ 1, Θℓ)

where (a) follows from the rotational invariance of the norm, and (b) follows from the deﬁnition of denoiser [ e G+ ℓ, e G ℓ]( e R ℓ, e R+ ℓ 1, Θℓ) given below

[ e G+ ℓ, e G ℓ] := argmax e Zℓ,e Zℓ 1

e Zℓ diag(Sℓ)e Zℓ 1 e Bℓ 2

e Zℓ e R ℓ 2

Γ ℓ + e Zℓ 1 e R+ ℓ 1 2

Γ+ ℓ 1 (15)

Note that the optimization problem in (15), is decomposable accross the rows of variables e Zℓand e Zℓ 1, and hence [ e G+ ℓ, e G ℓ] operates row-wise on its inputs.

Fixed Points: We note that the ﬁxed points of the ML-Mat-VAMP algorithm can be shown to be KKT points of the variational formulations of (11), omitted here due to lack of space. This is a direct extention of results from Section 3 of [35]. In particular, we can show that the ML-Mat-VAMP in the MAP inference case is a preconditioned Peaceman-Rachford splitting ADMM type algorithm [40].

4 Analysis in the Large System Limit

We follow the analysis framework of the ML-VAMP work [12, 33], which is itself based on the original AMP analysis in [3]. This analysis is based on considering the asymptotics of certain large random problem instances. We essentially show that under certain assumptions, as the dimension goes to inﬁnity the behavior of the ML-Mat-VAMP algorithm can be characterized by a set of equations that describe how the distribution of rows of hidden matrices evolve at each iteration of the algorithm for all the layers. Speciﬁcally, we consider a sequence of problems (1) indexed by N such that for each problem the dimensions nℓ(N) are growing so that lim N nℓ

N = βℓ (0, ) are scalar constants. Note that d is ﬁnite and does not grow with N.

Distributions of weight matrices: For ℓ= 1, 3, . . . , L 1, we assume that the weight matrices Wℓ are generated via the singular value decomposition, Wℓ= Vℓdiag(Sℓ)Vℓ 1 where Vℓ Rnℓ nℓ are Haar distributed over orthonormal matrices and Sℓ= (sℓ,1, . . . , sℓ,min{nℓ,nℓ 1}). We will describe the distribution of the components Sℓmomentarily.

Assumption on Denoisers: We assume that the non-linear denoisers G 2k act row-wise on their in-

puts (R 2k, R+ 2k 1). Further these operators and their Jacobian matrices G+ 2k R 2k , G 2k R+ 2k 1 , G+ 0 R 0 , G L R+ L 1 are uniformly Lipschitz continuous, the deﬁnition of which is provided in Appendix B.

Assumption on initialization, true variables: The distribution of the remaining variables is described by a weak limit: For a matrix sequence X := X(N) RN d, by the notation X 2= X we mean that there exists a random variable X in Rd with E X 2 < such that lim N 1 N PN i=1 ψ(Xi:) = E ψ(X) almost surely, for any bounded continuous function ψ : Rd R,

as well as for quadratic functions x P x for any P Rd d 0 . This is also referred to as Wasserstein-2 convergence [30]. For e.g., this property is satisﬁed for a random X with i.i.d. rows with bounded second moments, but is more general, since it applies to deterministic matrix sequences as well. More details on this weak limit are given in Appendix B.

Let Bℓ:= VT ℓBℓ, and Sℓ Rnℓbe the zero-padded vector of singular values of Wℓ, and let τ 0ℓ

Rd d 0 . Then we assume that the following empirical convergences hold.(Ξℓ, R 0ℓ Z0 ℓ) 2= (Ξℓ, Q 0ℓ)

for even ℓand (Sℓ, Bℓ, Ξℓ, V ℓ(R 0ℓ Z0 ℓ)) 2= (Sℓ, Bℓ, Ξℓ, Q 0ℓ), for odd ℓ. Here Sℓ R 0 is bounded, Bℓ Rd is bounded, Ξ2ℓ 1 N(0, N 1 2ℓ 1), and Q 0ℓ N(0, Γ 0ℓ), for ℓ= 0, 1, . . . , L 1

are all pairwise independent random variables. Additionally, we assume that Z0 0 2= Z0 and that the sequence of initial matrices {Γ 0ℓ} satisﬁes the following pointwise convergence

Γ 0ℓ(N) Γ 0ℓ, ℓ= 0, 1, . . . , L 1 (16)

4.1 Main Result

The main result of this paper concerns the empirical distribution of the rows [b Z ℓ]n:, [R ℓ]n: of the iterates of Algorithm 1. It characterizes the asymptotic behaviour of these empirical distributions in terms of d-dimensional random vectors which are either Gaussians or functions of Gaussians. Let G ℓ denote maps R1 d R1 d, such that (13), i.e., [G ℓ(R ℓ, R+ ℓ 1, Θ)]n: = G ℓ([R ℓ]n:, [R+ ℓ 1]n:, Θ). Having stated the requisite deﬁnitions and assumptions, we can now state our main result.

Theorem 1. For a ﬁxed iteration index k 0, there exist deterministic matrices K+ kℓ R2d 2d 0 , and τ kℓ, Γ + kℓand Γ kℓ, Rd d 0 such that for even ℓ: Z0 ℓ 1, Z0 ℓ, b Z k,ℓ 1, b Z+ kℓ 2= A, e A, G ℓ(C + e A, B + A, Γ kℓ, Γ + k,ℓ 1), G+ ℓ(C + e A, B + A, Γ kℓ, Γ + k,ℓ 1)

where (A, B) N(0, K+ k,ℓ 1), C N(0, τ kℓ), e A = φℓ(A, Ξℓ) and (A, B), C are independent. For ℓ= 0, the same result holds where the 1st and 3rd terms are dropped, whereas for ℓ= L, the 2nd and 4th terms are dropped. Similarly, for odd ℓ: VT ℓ 1Z0 ℓ 1, VT ℓ 1Z0 ℓ, Vℓb Z k,ℓ 1, Vℓb Z+ kℓ 2= A, e A, G ℓ(C + e A, B + A, Γ kℓ, Γ + k,ℓ 1), G+ ℓ(C + e A, B + A, Γ kℓ, Γ + k,ℓ 1)

where (A, B) N(0, K+ k,ℓ 1), C N(0, τ kℓ), e A = SℓA+Bℓ+Ξℓand (A, B), C are independent.

Furthermore for ℓ= 0, 1, . . . L 1, we have

(Γ kℓ, Λ kℓ) a.s. (Γ kℓ, Λ kℓ).

The parameters in the distribution, {K+ kℓ, τ kℓ, Γ kℓ, Λ kℓ} are deterministic and can be computed via a set of recursive equations called the state evolution or SE. The SE equations are provided in Appendix A The result is similar to those for ML-VAMP in [12, 35] except that the SE equations for ML-Mat-VAMP involve d d and 2d 2d matrix terms; the ML-VAMP SE only requires scalar and 2 2 matrix terms. The result holds for both MAP inference and MMSE inference, the only difference is implicit, i.e., the choice of denoiser Gℓ( ) from eqn. (13).

The importance of Theorem 1 is that the rows of the iterates of the ML-Mat-VAMP Algorithm (b Z k,ℓ 1, b Z+ kℓin Algorithm 1) and the rows of the corresponding true values, Z0 ℓ 1, Z0 ℓ, have a simple, asymptotic random vector description of a typical row. We will call this the row-wise" model. According to this model, for even ℓ, the rows of Z0 ℓ 1 converge to a Gaussian A Rd and the rows of Z0 ℓconverge to the output of the Gaussian through the row-wise function φℓ, e A = φℓ(A, Ξℓ). Then the rows of the estimates b Z k,ℓ 1, b Z+ kℓasymptotically approach the outputs of row-wise estimation

function G ( ) and G+( ) supplied by A and e A corrupted with Gaussian noise. A similar convergence holds for odd ℓ.

This row-wise" model enables exact an analysis of the performance of the estimates at each iteration. For example, to compute a weighted mean squared error (MSE) metric at iteration k, the convergence shows that,

b Z+ kℓ Z0 ℓ 2

a.s. E G+ ℓ(C + e A, B + A, Θkℓ) e A 2 H,

for even ℓand any positive semi-deﬁnite matrix H Rd d. The norm on the left-hand above acts row-wise, Z 2 H := P i Zi: 2 H. Hence, this asymptotic MSE can be evaluated via expectations of d-dimensional variables from the SE. Similarly, one can obtain exact answers for any other row-wise performance metric of {(b Z kℓ, Z0 ℓ)}ℓfor any k.

Figure 2: Test error in learning the ﬁrst layer of a 2 layer neural network using ADAM-based gradient descent, ML-Mat-VAMP and its state evolution prediction.

5 Numerical Experiments

We consider the problem of learning the input layer of a two layer neural network as described in Section 2.3. We learn the weights F1 of the ﬁrst layer of a two-layer network by solving problem (9). The large system limit analysis in this case corresponds to the input size nin and number of samples N going to inﬁnity with the number of hidden units being ﬁxed. Our experiment take d = 4 hidden units, Nin = 100 input units, Nout = 1 output unit, sigmoid activations and variable number of samples N. The weight vectors F1 and F2 are generated as i.i.d. Gaussians with zero mean and unit variance. The input X is also i.i.d. Gaussians with variance 1/Nin so that the average pre-activation has unit variance. Output noise is added at two levels of 10 and 15 d B relative to the mean. We generate 1000 test samples and a variable number of training samples that ranges from 200 to 4000. For each trial and number of training samples, we compare three methods: (i) MAP estimation where the MAP loss function is minimized by the ADAM optimizer [21] in the Keras package of Tensorﬂow; (ii) Algorithm 1 run for 20 iterations and (iii) the state evolution prediction. The ADAM algorithm is run for 100 epochs with a learning rate = 0.01. The expectations in the SE are estimated via Monte-Carlo sampling (hence there is some variation). Given an estimate b F1 and true value F0 1, we can compute the test error as follows: Given a new sample x, the true and predicted pre-activations will be z1 = (F0 1)Tx and bz1 = b FT 1x. Thus, if the new sample x N(0, 1 Nin I), the true and predicted pre-activations, (z1,bz1), will be jointly Gaussian

with covariance equal to the empirical 2d 2d covariance matrix of the rows of F0 1 and b F1:

K := 1 Nin PNin k=1 u T kuk, uk = h F1,k: b F1,k: i (17)

From this covariance matrix, we can estimate the test error, E|y by|2 = E|FT 2(σ(z1) σ(bz1)|2, where the expectation is taken over the Gaussian (z1,bz1) with covariance K. Also, since (17) is a row-wise operation, it can be predicted from the ML-Mat-VAMP SE. Thus, the SE can also predict the asymptotic test error. The normalized test error for ADAM-MAP, ML-Mat-VAMP and the ML-Mat-VAMP SE are plotted in Fig. 2. The normalized test error is deﬁned as the ratio of the MSE on the test samples to the optimal MSE. Hence, a normalized MSE of one is the minimum value. Note that since ADAM and ML-Mat-VAMP are solving the same optimization problem, they perform similarly as expected. The main message of this paper is not to develop an algorithm that outperforms ADAM, but rather an algorithm that has theoretical guarantees. The key property of ML-Mat-VAMP is that its asymptotic behavior at all the iterations can be exactly predicted by the state evolution equations. In this example, Fig. 2 shows that the normalized test MSE predicted via state evolution (plotted in green) matches the normalized MSE of ML-Mat-VAMP estimates (plotted in orange).

6 Conclusions

We have developed a general framework for analyzing inference in multi-layer networks with matrix valued quantities in certain high-dimensional random settings. For learning the input layer of a two layer network, the methods enables precise predictions of the expected test error as a function of the parameter statistics, numbers of samples and noise level. This analysis can be valuable in understanding key properties such as generalization error, for example using ML-VAMP, Emami et al. [11] characterizes the generalization error of GLMs under a variety of feature distributions and train-test mismatch. Future work will look to extend these to more complex networks.

Broader Impact

In statistical physics, systems with a large number of degrees of freedom often admit a simpliﬁed macroscopic description. Modern neural networks have thousands of hidden units and billions of free parameters; is there an analogous macroscopic description of the dynamics of multi-layer neural networks? This paper identiﬁes some of these macroscopic descriptions that can be used to analyze a large class of optimization problems (See Section 2 for examples) arising in Signal Processing, Data Science, and Machine Learning. The techniques developped in this paper allow analyzing and understanding the fundamental limits of learning in 1 and 2 layer neural networks which are basic building blocks in modern machine learning pipelines.

Acknowledgements and Funding Disclosure

The work of P. Schniter was supported by NSF grant 1716388. The work of P. Pandit, M. Saharee Ardakan and A. K. Fletcher was supported in part by the NSF Grants 1738285 and 1738286, ONR Grant N00014-15-1-2677. The work of S. Rangan was supported in part by NSF grants 1116589, 1302336, and 1547332, NIST, SRC and the the industrial afﬁliates of NYU Wireless.

[1] Benjamin Aubin, Antoine Maillard, Florent Krzakala, Nicolas Macris, Lenka Zdeborová, et al. The committee machine: Computational to statistical gaps in learning a two-layers neural network. In Advances in Neural Information Processing Systems, pages 3223 3234, 2018. 4

[2] Jean Barbier, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. Optimal errors and phase transitions in high-dimensional generalized linear models. Proc. Nat. Acad. Sci., 116(12):5451 5460, 2019. 2

[3] M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inform. Theory, 57(2):764 785, February 2011. 2, 7, 13

[4] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Dimakis. Compressed sensing using generative models. Proc. ICML, 2017. 2, 3

[5] Evan Byrne, Antoine Chatalic, Rémi Gribonval, and Philip Schniter. Sketched clustering via hybrid approximate message passing. IEEE Transactions on Signal Processing, 67(17):4556 4569, 2019. 4

[6] Burak Cakmak, Ole Winther, and Bernard H Fleury. S-AMP: Approximate message passing for general matrix ensembles. In Proc. IEEE ITW, 2014. 3

[7] Xiang Cheng, Niladri S Chatterji, Yasin Abbasi-Yadkori, Peter L Bartlett, and Michael I Jordan. Sharp convergence rates for langevin dynamics in the nonconvex setting. ar Xiv preprint ar Xiv:1805.01648, 2018. 2

[8] Shane F Cotter, Bhaskar D Rao, Kjersti Engan, and Kenneth Kreutz-Delgado. Sparse solutions to linear inverse problems with multiple measurement vectors. IEEE Transactions on Signal Processing, 53(7):2477 2488, 2005. 3

[9] D. L. Donoho, A. Maleki, and A. Montanari. Message-passing algorithms for compressed sensing. Proc. Nat. Acad. Sci., 106(45):18914 18919, Nov. 2009. 2

[10] D. L. Donoho, A. Maleki, and A. Montanari. Message passing algorithms for compressed sensing. In Proc. Inform. Theory Workshop, pages 1 5, 2010. 2

[11] Melikasadat Emami, Mojtaba Sahraee-Ardakan, Parthe Pandit, Sundeep Rangan, and Alyson K Fletcher. Generalization error of generalized linear models in high dimensions. ar Xiv preprint ar Xiv:2005.00180, 2020. 9

[12] Alyson K Fletcher, Sundeep Rangan, and P. Schniter. Inference in deep networks in high dimensions. Proc. IEEE Int. Symp. Information Theory, 2018. 2, 5, 7, 8

[13] Alyson K. Fletcher, Mojtaba Sahraee-Ardakan, Sundeep Rangan, and Philip Schniter. Expectation consistent approximate inference: Generalizations and convergence. In Proc. IEEE Int. Symp. Information Theory, pages 190 194, 2016. 3

[14] Marylou Gabrié, Andre Manoel, Clément Luneau, Jean Barbier, Nicolas Macris, Florent Krzakala, and Lenka Zdeborová. Entropy and mutual information in models of deep neural networks. In Proc. NIPS, 2018. 2

[15] Paul Hand and Vladislav Voroninski. Global guarantees for enforcing deep generative priors by empirical risk. ar Xiv:1705.07576, 2017. 2

[16] Hengtao He, Chao-Kai Wen, and Shi Jin. Generalized expectation consistent signal recovery for nonlinear measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 2333 2337. IEEE, 2017. 2

[17] Yoshiyuki Kabashima. A cdma multiuser detection algorithm on the basis of belief propagation. Journal of Physics A: Mathematical and General, 36(43):11111, 2003. 2

[18] Maya Kabkab, Pouya Samangouei, and Rama Chellappa. Task-aware compressed sensing with generative adversarial networks. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018. 2

[19] Nicolas Keriven, Anthony Bourrier, Rémi Gribonval, and Patrick Pérez. Sketching for largescale learning of mixture models. Information and Inference: A Journal of the IMA, 7(3):447 508, 2017. 4

[20] Nicolas Keriven, Nicolas Tremblay, Yann Traonmilin, and Rémi Gribonval. Compressive k-means. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6369 6373. IEEE, 2017. 4

[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. 9

[22] Dong Liang, Leslie Ying, and Feng Liang. Parallel MRI Acceleration Using M-FOCUSS. In Proc. International Conference on Bioinformatics and Biomedical Engineering, pages 1 4. IEEE, 2009. 3

[23] Jun S Liu. Siegel s formula via stein s identities. Statistics & Probability Letters, 21(3):247 251, 1994. 23

[24] Junjie Ma and Li Ping. Orthogonal AMP. IEEE Access, 5:2020 2033, 2017. 3

[25] Andre Manoel, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Multi-layer generalized linear estimation. In Proc. IEEE Int. Symp. Information Theory, pages 2098 2102, 2017. 2

[26] Andre Manoel, Florent Krzakala, Gaël Varoquaux, Bertrand Thirion, and Lenka Zdeborová. Approximate message-passing for convex optimization with non-separable penalties. ar Xiv preprint ar Xiv:1809.06304, 2018. 2

[27] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean ﬁeld view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665 E7671, 2018. 4

[28] Thomas P Minka. Expectation propagation for approximate bayesian inference. In Proc. UAI, pages 362 369, 2001. 2

[29] Dustin G Mixon and Soledad Villar. Sunlayer: Stable denoising with generative networks. ar Xiv preprint ar Xiv:1803.09319, 2018. 2

[30] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan. The generalization error of max-margin linear classiﬁers: High-dimensional asymptotics in the overparametrized regime. ar Xiv preprint ar Xiv:1911.01544, 2019. 7

[31] Guillaume Obozinski, Ben Taskar, and Michael Jordan. Multi-task feature selection. Statistics Department, UC Berkeley, Tech. Rep, 2(2.2), 2006. 3

[32] Manfred Opper and Ole Winther. Expectation consistent approximate inference. J. Machine Learning Res., 6:2177 2204, December 2005. 3

[33] P. Pandit, M. Sahraee, S. Rangan, and A. K. Fletcher. Asymptotics of MAP inference in deep networks. In Proc. IEEE Int. Symp. Information Theory, pages 842 846, 2019. 5, 7

[34] Parthe Pandit, Mojtaba Sahraee-Ardakan, Sundeep Rangan, Philip Schniter, and Alyson K. Fletcher. Inference with deep generative priors in high dimensions, 2019. 16, 20

[35] Parthe Pandit, Mojtaba Sahraee-Ardakan, Sundeep Rangan, Philip Schniter, and Alyson K Fletcher. Inference with deep generative priors in high dimensions. IEEE Journal on Selected Areas in Information Theory, 2020. 2, 7, 8

[36] Sundeep Rangan, Philip Schniter, and Alyson K Fletcher. Vector approximate message passing. IEEE Trans. Information Theory, 65(10):6664 6684, 2019. 2, 21, 22, 23

[37] G. Reeves. Additivity of information in multilayer networks via additive Gaussian noise transforms. In Proc. Allerton Conf. Comm. Control & Comput., pages 1064 1070, 2017. 2

[38] Viraj Shah and Chinmay Hegde. Solving linear inverse problems using GAN priors: An algorithm with provable guarantees. In IEEE Int. Conf. Acoustics, Speech and Signal Processing, pages 4609 4613, 2018. 2

[39] Keigo Takeuchi. Rigorous dynamics of expectation-propagation-based signal recovery from unitarily invariant measurements. In Proc. IEEE Int. Symp. Information Theory, pages 501 505, 2017. 2

[40] Andreas Themelis and Panagiotis Patrinos. Douglas rachford splitting and admm for nonconvex optimization: Tight convergence results. SIAM Journal on Optimization, 30(1):149 181, 2020. 7

[41] Volker Tresp. A bayesian committee machine. Neural computation, 12(11):2719 2741, 2000. 4

[42] Subarna Tripathi, Zachary C Lipton, and Truong Q Nguyen. Correction by projection: Denoising images with generative adversarial networks. ar Xiv preprint ar Xiv:1803.04477, 2018. 2

[43] George Tzagkarakis, Dimitris Milioris, and Panagiotis Tsakalides. Multiple-measurement Bayesian compressed sensing using GSM priors for DOA estimation. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2610 2613. IEEE, 2010. 3

[44] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. 15

[45] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proc. 28th Int. Conf. Machine Learning, pages 681 688, 2011. 2

[46] Raymond Yeh, Chen Chen, Teck Yian Lim, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with perceptual and contextual losses. ar Xiv:1607.07539, 2016. 2, 3

[47] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed linear regression. In International Conference on Machine Learning, pages 613 621, 2014. 4

[48] Justin Ziniel and Philip Schniter. Efﬁcient high-dimensional inference in the multiple measurement vector problem. IEEE Transactions on Signal Processing, 61(2):340 354, 2012. 3