# asymptotics_of_ridge_regression_in_convolutional_models__3331c019.pdf

Asymptotics of Ridge Regression in Convolutional Models

Mojtaba Sahraee-Ardakan 1 2 Tung Mai 3 Anup Rao 3 Ryan Rossi 3 Sundeep Rangan 4 Alyson K. Fletcher 1 2

Understanding generalization and estimation error of estimators for simple models such as linear and generalized linear models has attracted a lot of attention recently. This is in part due to an interesting observation made in machine learning community that highly over-parameterized neural networks achieve zero training error, and yet they are able to generalize well over the test samples. This phenomenon is captured by the so called double descent curve, where the generalization error starts decreasing again after the interpolation threshold. A series of recent works tried to explain such phenomenon for simple models. In this work, we analyze the asymptotics of estimation error in ridge estimators for convolutional linear models. These convolutional inverse problems, also known as deconvolution, naturally arise in different ﬁelds such as seismology, imaging, and acoustics among others. Our results hold for a large class of input distributions that include i.i.d. features as a special case. We derive exact formulae for estimation error of ridge estimators that hold in a certain high-dimensional regime. We show the double descent phenomenon in our experiments for convolutional models and show that our theoretical results match the experiments.

1. Introduction

Increasingly powerful hardware along with deep learning libraries that efﬁciently use these computational resources have allowed us to train ever larger neural networks. Modern neural networks are so over-parameterized that they can perfectly ﬁt random noise (Zhang et al., 2016; Li et al., 2020). With enough over-parameterization, they can also achieve zero loss over training data with their parameters moving

1Department of Electrical and Computer Engineering, University of California, Los Angeles 2Department of Statistics, University of California, Los Angeles 3Adobe Research 4Department of Electrical and Computer Engineering, New York University. Correspondence to: Mojtaba Sahraee-Ardakan <msahraee@ucla.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

only slightly away from the initialization (Allen-Zhu et al., 2018; Soltanolkotabi et al., 2018; Du et al., 2018a;b; Li and Liang, 2018; Zou et al., 2020). Yet, these models generalize well on test data and are widely used in practice (Zhang et al., 2016). In fact, some recent work suggest that it is best practice to use as large a model as possible for the tasks in hand (Huang et al., 2018). This seems contrary to our classical understanding of generalization where increasing the complexity of the model space to the extent that the training data can be easily interpolated indicates poor generalization. Most statistical approaches explain generalization by controlling some notion of capacity of the hypothesis space, such as VC dimension, Rademacher complexity, or metric entropy (Anthony and Bartlett, 2009). Such approaches that do not incorporate the implicit regularization effect of the optimization algorithm fail to explain generalization of over-parameterized deep networks (Kalimeris et al., 2019; Oymak and Soltanolkotabi, 2019).

The so-called double descent curve where the test risk starts decreasing again by over-parameterizing neural networks beyond the interpolation threshold is widely known by now (Belkin et al., 2019a; Nakkiran et al., 2019). Interestingly, such phenomenon is not unique to neural networks and have been observed even in linear models (Le Cun et al., 1991; Dobriban et al., 2018; Advani et al., 2020). Recently, another line of work has also connected neural networks to linear models. In (Jacot et al., 2018), the authors show that inﬁnitely wide neural networks trained by gradient descent behave like their linearization with respect to the parameters around their initialization. The problem of training such wide neural networks with square loss then turns into a kernel regression problem in a RKHS associated to a ﬁxed kernel called the neural tangent kernel (NTK). The NTK results were later extended to many other architectures such as convolutional networks and recurrent neural networks (Li et al., 2019; Alemohammad et al., 2020; Yang, 2020). Trying to understand the generalization in deep networks and explaining such phenomenon as the double descent curve has attracted a lot of attention to theoretical properties of kernel methods as well as simple machine learning models. Such models, despite their simplicity, can help us gain a better understanding of machine learning models and algorithms that might be hard to achieve just by looking at deep neural networks due their complex nature.

Asymptotics of Ridge Regression in Convolutional Models

Double descent has been shown in linear models (Dobriban et al., 2018; Belkin et al., 2019b; Hastie et al., 2019), logistic regression (Deng et al., 2019), support vector machines (Montanari et al., 2019), generalized linear models (Emami et al., 2020), kernel regression (Liang et al., 2020), random features regression (Mei and Montanari, 2019; Hastie et al., 2019), and random Fourier feature regression (Liao et al., 2020) among others. Most of these works consider the problem in a doubly asymptotic regime where both the number of parameters and the number of observations go to inﬁnity at a ﬁxed ratio. This is in contrast to classical statistics where either the number of parameters is assumed to be ﬁxed and the samples go to inﬁnity or vice versa. In practice, the number of parameters and number of samples are usually comparable and therefore the doubly asymptotic regime provides more value about the performance of different models and algorithms. In this work we study the performance of ridge estimators in convolutional linear inverse problems in this asymptotic regime. Unlike ordinary linear inverse problems, theoretical properties of the estimation problem in convolutional models has not been studied, despite their wide use in practice, e.g. in solving inverse problems with deep generative priors (Ulyanov et al., 2018).

Beyond machine learning, inverse problems involving convolutional measurement models are often called deconvolution and are encountered in many different ﬁelds. In astronomy, deconvolution is used for example to deblur, sharpen, and correct for optical aberrations in imaging (Starck et al., 2002). In seismology, deconvolution is used to separate seismic traces into a source wavelet and an impulse response that corresponds to the layered structure of the earth (Treitel and Lines, 1982; Mueller, 1985). In imaging, it is used to correct for blurs caused by the point spread function, sharpen out of focus areas in 3D microscopy (Mc Nally et al., 1999), and to separate neuronal spikes for calcium traces in calcium imaging (Friedrich et al., 2017) among others. In practical applications, the convolution kernel might not be known and should either be estimated from the physics of the problem or jointly with the unknown signal using the data.

Summary of Contributions. We analyze the performance of ridge estimators for convolutional models in the proportional asymptotics regime. Our main result (Theorem 1) characterizes the limiting joint distribution of the true signal and its ridge estimate in terms of the spectral properties of the data. As a result of this theorem, we can provide an exact formula to compute the mean squared error of ridge estimator in the form of a scalar integral (Corollary 1). Our assumptions on the data are fairly general and include many random processes as an example as opposed to i.i.d. features only. Even though our theoretical results hold only in a certain high dimensional limit, our experiments show that

its prediction matches the observed error even for problems of moderate size. We show that our result can predict the double descent curve of the estimation error as we change the ratio of number of measurements to unknowns.

Prior Work. Asymptotic error of ridge regression for ordinary linear inverse problems (as opposed to convolutional linear inverse problem considered in this work) is studied in (Dicker et al., 2016) for isotropic features. Asymptotics of ridge regression for correlated features is studied in (Dobriban et al., 2018). Error of ridgeless (minimum ℓ2-norm interpolant) regression for data generated from a linear or nonlinear model is obtained in (Hastie et al., 2019). These works use results from random matrix theory to derive closed form formulae for the estimation or generalization error. For features with general i.i.d. prior other than Gaussian distribution, approximate message passing (AMP) (Donoho et al., 2010; Bayati and Montanari, 2011) or vector approximate message passing (VAMP) (Rangan et al., 2019) can be used to obtain asymptotics of different types of error. Instead of a closed form formula, these works show that the asymptotic error can be obtained via a recursive equation that is called the state evolution. In (Deng et al., 2019), the authors use convex Gaussian min-max theorem to characterize the performance of maximum likelihood as well as SVM classiﬁers with i.i.d. Gaussian covariates. In (Emami et al., 2020), the problem of learning generalized linear models is reduced to an inference problem in deep networks and the results of (Pandit et al., 2020) are used to obtain the generalization error.

Notation. We use uppercase boldface letters for matrices and tensors, and lowercase boldface letters for vectors. For a matrix A, its ith row and column is denoted by Ai and A i respectively. A similar notation is used to show slices of tensors. The submatrix formed by columns i through j 1 of A is shown by A ,i:j. Standard inner product for vectors, matrices, and tensors is represented by , . N(0, 1) and CN(0, 1) denote standard normal and complex normal distributions respectively. Finally, [n] = {1, . . . , n}.

2. Problem Formulation

We consider the inverse problem of estimating X from Y in the convolutional model

Y = K X + Ξ, (1)

where X Cnx T , Y Cny T , K Cny nx k with Ki Cnx k being the ith convolutional kernel of width k, and Ξ is a noise matrix with the same shape as Y and i.i.d. zero-mean complex normal entries CN(0, σ2). See Appendix A for a brief overview of complex normal distribution. The circular convolution in equation (1) is deﬁned

Asymptotics of Ridge Regression in Convolutional Models

Yi = Ki X + Ξi (2)

Yit = Ki , X ,t:t+k + Ξit, i [ny], t [T] (3)

Note that in (3) we are not using the correct deﬁnition of the convolution operation where the kernel (or the signal X) is reﬂected along the time axis, but rather we are using the common deﬁnition used in machine learning.

We consider the inference problem in the Bayesian setting where the signal X is assumed to have a prior that admits a density (with respect to Lebesgue measure) p(X). Further, we assume that rows of Xi are i.i.d. such that this density factorizes as

i=1 p(Xi ). (4)

The convolution kernel K is assumed to be known with i.i.d. entries drawn from CN(0, 1/(nyk)). Given this statistical model, the posterior is

p(X|Y) p(X, Y) = p(X)p(Y|X), (5)

where with some abuse of notation, we are using p( ) to represent the densities of all random variables to simplify the notation. From the Gaussianity assumption on noise we have Yi |X CN(Ki X, σ2I), (6)

where I is the identity matrix of size T T.

Given the model in (5), one can consider different types of estimators for X. Of particular interest are regularized M-estimators

b Xm-est = arg min X L(X, Y) + R(X), (7)

where L is a loss function and R corresponds to the regularization term. Taking negative log-likelihood as the loss and negative log-prior as the regularization we get the maximum a posteriori (MAP) estimator

b XMAP := arg min X log p(Y|X) log p(X), (8)

which selects mode of the posterior as the estimate.

In this work we are interested in analyzing the performance of ridge-regularized least squares estimator which is another special case of the regularized M-estimator in (7) with square loss and ℓ2-norm regularization

b Xridge = arg min X Y K X 2 F + λ X 2 F , (9)

where λ is the regularization parameter. By Gaussianity of the noise, this can also be thought of as ℓ2-regularized maximum likelihood estimtor. Given an estimate b X, one is

usually interested in performance of the estimator based on some metric. In Bayesian setting, the metric is usually an average risk (averaged over the prior and the randomness of data) R = Eℓ(X, b X), (10)

where ℓis some loss function between the true parameters and the estimates. The most widely used loss is the squared error where ℓ(X, b X) = X b X 2 F. Our theoretical results exactly characterize the mean squared error (MSE) of ridge estimator (9) in a certain high-dimensional regime described below.

3. Main Result

Similar to other works in this area (Bayati and Montanari, 2011; Rangan et al., 2019; Pandit et al., 2020), our goal is to analyze the average case performance of the ridge estimator in (9) for the convolutional model in (1) in a certain high dimensional regime that is called large system limit (LSL).

3.1. Large System Limit

We consider a sequence of problems indexed by T and nx. We assume that k := k(T) and ny := ny(nx) are functions of T and nx respectively and

lim nx ny(nx)

nx = δ (0, ), lim T k(T)

T = β (0, 1].

This doubly asymptotic regime where both the number of parameters and unknowns are going to inﬁnity at a ﬁxed ratio is sometimes called proportional asymptotic regime in the literature. Note that even though we make the assumption that k and T are going to inﬁnity at a ﬁxed ratio throughout this paper, the linear scaling of k with respect to T is not necessary. We only need T and k could scale at an arbitrary rate with respect to T (or even remain a constant) so long as certain the variance of kernel entries is scaled appropriately (see Remark 5 for more details).

3.1.1. ASSUMPTIONS ON THE NOISE AND KERNEL

We assume that the entries of the convolution kernel K and the noise Ξ converge empirically with second order to random variables K and Ξ, with distributions CN(0, σ2 K/(kny)) and CN(0, σ2) respectively. See Appendix B for deﬁnition of empirical convergence of random variables and some useful results. Informally, empirical convergence means that as we take the limit of T and xx the empirical distribution of entries of convolution kernel tensor and entries of noise matrix in (1) converge to the distribution of complex normal random variables. As stated formally in the appendix, the convergence of the distributions is with respect to Wasserstein-2 metric on the space of distributions (see Proposition 1 for the precise statement).

Asymptotics of Ridge Regression in Convolutional Models

3.1.2. ASSUMPTIONS ON Xi

Next, based on (Peligrad et al., 2010), we state the distributional assumptions on the rows of X that we require for our theory to hold. Let {ξt}t Z be a stationary ergodic Markov chain deﬁned on a probability space (S, F, P) and let π( ) be the distribution of ξ0. Note that stationarity implies all ξts have the same marginal distribution. Deﬁne the space of functions

L2 0(π) := h|Eπ[h(ξ0)] = 0, Eπ[h2(ξ0)] < . (11)

Deﬁne Fk := σ({ξt}t k), the σ-algebra generated by ξt up to time k and let xt = h(ξt) for some h L2 0(π). We assume that the process {xt}t Z satisﬁes the regularity condition

E[x0|F ] = 0, P almost surely. (12)

The class of processes that satisfy these conditions is quite large and includes i.i.d. random processes as an example. It also includes causal functions of i.i.d. random variables of the form Xn = f(ξk, k n) where ξk is i.i.d. such as autoregressive (AR) processes and many Markov chains. See (Peligrad et al., 2010) for more examples satisfying these conditions.

We assume that each row Xi of X, is an i.i.d. sample of a process that satisﬁes the conditions mentioned. Let e Xi(ω) be its (normalized) Fourier transform (deﬁned in (18)). and deﬁne g(ω) := lim T E| e Xi(ω)|2. As shown in (Peligrad et al., 2010), since the rows are i.i.d., this limit is the same for all the rows. Also, g(ω) is proportional to the spectral density of the process that generates the rows of X

t= cn exp( iωt), ct = E[Xi0Xit].

(13) As we will see in the next section, g(ω) plays a key role in characterization of estimation error of ridge estimator in convolutional linear inverse problems that have such processes as inputs.

3.2. Asymptotics of Ridge Estimator

The main result of our paper characterizes the limiting distribution to which the the Fourier transform of the true signal e X0(ω) and Fourier transform of the estimated signal be Xridge(ω) converge. The proof can be found in Section 4. In the following B and µ represent Borel σ-algebra and Lebesgue measure respectively. Theorem 1. Under the assumptions in Section 3.1, as nx, ny, T, k , over the product space ([0, 2π] Snx, B Fnx, µ P nx) the Ridge estimator satisﬁes " e X0(ω) be Xridge(ω)

d,P L(2) = p

g(U)Z0 α( p

g(U)Z0 + τ(g(U))Z1)

where U unif([0, 2π]), Z0, Z1 CN(0, 1) where CN(0, 1) is the standard complex normal distribution, U, Z0 and Z1 are independent of each other, α is the smaller root of the quadratic equation

λ σ2 K = (1 α)(1 α/δ)

τ 2(g(U)) = σ2/σ2 K + (1 α2)g(U)/δ

1 α2/δ . (15)

The convergence in this theorem is weak convergence for

ω and PL(2) for e X0(ω) and be Xridge(ω). See appendix B for the deﬁnition of this convergence. This convergence result allows us to ﬁnd the asymptotic mean squared error of ridge estimator for the convolutional model as an integral.

Corollary 1. Under the same assumptions as in Theorem 1, ridge estimator satisﬁes

lim nx lim T 1 nx T be Xridge e X 2 F

(α 1)2g(ω) + α2τ(g(ω)) dω. (16)

Remark 1. As shown in the proof Lemma 6, for λ 0, the quadratic equation (14) always has two real positive solutions the smaller of which determines the error.

Remark 2. The 1/

T scaling in our deﬁnition of Fourier operator in Section 4.2 makes it a unitary operator, i.e. ℓ2 norm is preserved under the Fourier transform and its inverse. This implies

b Xridge X F = be Xridge e X F. (17)

Therefore, the result of Corollary 1 also holds in time domain.

Remark 3. If rows Xi have zero mean i.i.d. entries, then the correlation coefﬁcients ct in (13) are all zero except for c0. Therefore, g(ω) = g where g is constant. Hence, in this case, the integrand in (16) would be a constant and the estimation error across all the frequencies would be the same as the estimation error in the ordinary ridge regression as in Lemma 6, i.e. the error vs. δ would be exactly the same as the double descent curve in ordinary ridge regression with i.i.d. priors. In other words, the double descent curve for ordinary ridge regression carries over to the convolutional ridge regression for i.i.d. priors (see Figure 1).

In this section we present the proof of Theorem 1. Before presenting the details of the proof, it is helpful to see an overview of the proof.

Asymptotics of Ridge Regression in Convolutional Models

4.1. Proof overview

We ﬁrst show that convolutional models turn into ordinary linear models for each frequency in Fourier domain. We then show that ridge estimators in time domain can also be written as ridge estimators in frequency domain. This uses the fact that Fourier transform matrix, with appropriate normalization is a unitary matrix, and ℓ2 norm is preserved under unitary transformation, i.e. it is an isometry. Next we use properties of Fourier transform of random processes to show that under certain conditions, they asymptotically converge to a Gaussian process in frequency domain that is independent across different frequencies for almost every frequency. These together allow us to turn the ridge estimation in time domain into multiple ridge estimators in frequency domain, one for each frequency. We then use theoretical properties of ridge estimators to derive estimation error for each of these ridge estimators and integrate them over frequencies to derive our main result.

Our proof is based on previous results for asymptotic error of ridge estimators for ordinary linear inverse problems. This has been studied in many works (Dicker et al., 2016; Dobriban et al., 2018; Hastie et al., 2019) where the authors take advantage of the fact that ridge estimators have a closed form solution that can be analyzed, e.g. using results from random matrix theory. In this work we use approximate message passing (Donoho et al., 2010; Bayati and Montanari, 2011) to derive the asymptotic error of ridge estimators.

4.2. Main Technical Lemmas

In order to prove Theorem 1, we need several lemmas. We ﬁrst characterize the convolutional model in (1) in Fourier domain. For ω {0, 1 2π

T , . . . , (T 1) 2π

T } =: Ω, let e Xj(ω) be the discrete (circular) Fourier transform (DFT) of Xj

e Xj(w) = 1

t=0 Xjt exp( iωt). (18)

If we let F to denote the T-point unitary DFT matrix, i.e. Fmn = 1/

T exp( i2πmn/T), then this equation can be written in matrix form as e Xi( ) = Xi F. Deﬁne e Yi(ω), e Kij(ω), and eΞ(ω) similarly. Note that in these deﬁnitions, we have included a 1/

T factor which is usually not included in the deﬁnition of Fourier transform, but since this makes the Fourier matrix unitary, it eases our notation slightly. With this deﬁnition we have FT = F and F F = I.

Lemma 1 (Convolutional models in Fourier domain). The convolutional model in (1) can be written in Fourier domain as

T e K (ω) e X(ω) + eΞ(ω), ω Ω. (19)

Proof. Taking Fourier transform of Equation (2) and using

the convolution theorem we get

j=1 e Kij(ω) e Xj(ω) + eΞi(ω). (20)

Note that the

T factor on the right hand is due to our deﬁnition of Fourier transform where we have used a 1/

T factor to make the Fourier operator unitary. Rewriting this equation in matrix form gives us the desired result.

The next lemma characterizes the ridge estimator in (9) in frequency domain.

Lemma 2. The ridge estimator in (9) in frequency domain is equivalent to solving separate ordinary ridge regressions for each ω Ω:

be Xridge(ω) = arg min e X(ω)

T e K(ω) e X(ω) 2

+ λ e X(ω) 2

Proof. Since the Fourier matrix is unitary, we have

X F = XF F = e X F

Y K X F = (Y K X)F F = e Y

T e K e X F,

where e K e X is a tensor-matrix product deﬁned as ( e K e X)(ω) = e K(ω) e X(ω). Then, a change of variable e X = XF in (9) proves the lemma.

Lemma 3. If the kernel K has i.i.d. CN(0, σ2 K/(kny)) entries, then for each ω Ω,

T e K(ω) has i.i.d. complex normal entries CN(0, σ2 K/ny).

Proof. The DFT of the kernel is

e Kij(ω) = 1

t=0 Kijt exp( iωt). (22)

This is a linear combination of complex Gaussian random variables and therefore, e K is a tensor with jointly complex Gaussian entries. Clearly, E( e K) = 0 and for (i, j) = (i , j ), using independence of Kij and Ki j we have

E e Kij(ω) e Ki j (ω ) = 0, E e Kij(ω) e K i j (ω ) = 0 ω, ω ,

which proves that for any ω, e K(ω) has independent entries and all the dependence in e K is across different frequencies.

It remains to ﬁnd the variance and relation of each entry of e K(ω). Let k := Kij for some i and j be a row vector, and

Asymptotics of Ridge Regression in Convolutional Models

let ek := e Kij( ). Then we have ek = k F and the variance of

γ = TE[ekmek m] = TE[k F m F mk T] (23)

= TFm E[k Tk]F m = Tσ2 K kny Fm

t=0 exp(2πitm

T ) exp( 2πitm

= σ2 K ny . (26)

Similarly, the relation is

c(m) = TE[ekmek T m] = TE[k F m Fm k T]

= TFm E[k Tk]F m = σ2 K k Fm

Therefore, for each ω, e Kij(ω) CN(0, σ2 K/ny) and they are i.i.d. for all i, j.

Remark 4. Observe that the scaling of variance of entries of K with 1/k is crucial to get a non-trivial distribution for entries of

T e K(ω) as we take the limit T .

Remark 5. Here, we should emphasize that the linear scaling of k with respect to T is not necessary for this Lemma to hold. All we need is for the k and variance of the convolution kernel to scale in such a way that in Fourier domain the entries of e K have i.i.d. CN(0, σ2 K/ny) entries. This can be achieved by appropriate scaling of variance in time domain and k even for the case that k is constant.

Lemma 4. If noise Ξ has i.i.d. CN(0, σ2) entries, then eΞ has i.i.d. complex normal entries eΞij(ω) CN(0, σ2), i.e.

Ξij(ω) d= σ2

2 Z2, Z1, Z2 N(0, 1), Z1 Z2.

Proof. The proof is similar to the proof of Lemma 3 with k = T.

Lemma 4 is the complex analogue of the fact that distribution of vectors with i.i.d. Gaussian entries is invariant under orthogonal transformations.

As stated in Appendix B, for a Gaussian random sequence, convergence in the ﬁrst and second moments implies convergence in Wasserstein distance which is equivalent to PL(2) convergence. Therefore, Lemma 3 and 4 also imply that the entries of kernel and noise for each frequency are converging empirically with second order to i.i.d. complex Gaussian random variables. As shown in the appendix, this convergence is stronger than convergence in distribution.

Next, we mention a result about Fourier transform of random processes from (Peligrad et al., 2010).

Lemma 5 (Fourier transform of random processes (Peligrad et al., 2010)). Let {Xt}t Z be a stationary ergodic process that satisﬁes the assumptions in Section 3.1.2. Let e X(ω) be its (normalized) Fourier transform and g(ω) := lim T E| e X(ω)|2. Then on the product space ([0, 2π] S, B F, µ P) we have

e X(ω) d= p

g(U)CN(0, 1), (27)

where U unif([0, 2π]) is independent of CN(0, 1).

Lemma 5 allows us to characterize the limiting distribution of each row Xi of the input in the frequency domain, i.e. distribution of e Xi(ω) as T .

All the lemmas so far allow us to look at the convolutional model in the frequency domain. We need one last lemma to characterize the asymptotics of ridge regression in high dimensions.

Lemma 6 (Asymptotics of ridge regression). Consider the linear model y = Ax0 + ξ, where x Rnx, ξ Rny, and A Rny nx all have i.i.d. components with xi N(0, σ2 x), ξi N(0, σ2), and Aij N(0, σ2 A/ny). Then the ridge estimator

bxridge = arg min x y Ax 2 2 + λ x 2 2 (28)

as nx, ny with ny/nx δ satisfy x0 bxridge

P L(2) = X0 α(X0 + τ(σx)Z)

where X0 p X, Z N(0, 1) independent of X0, α is the smaller root of the quadratic equation

λ σ2 A = (1 α)(1 α/δ)

τ 2(σ2 x) = σ2/σ2 A + (1 α2)σ2 x/δ 1 α2/δ . (31)

Therefore, we almost surely have

lim nx 1/nx xridge x0 2 2 = (α 1)2σ2 X + α2τ 2(σx).

Proof. This is a consequence of using approximate message passing (AMP) algorithm to solve (28). We have mentioned in the overview of the proof our main theorem (4.1) that AMP is only one of the ways to obtain the asymptotic error of ridge estimators in ordinary linear problems. There are several other methods mostly based on random matrix theory that also derive such asymptotic errors, but in this work we adopt the AMP approach. See Appendix D for a brief

Asymptotics of Ridge Regression in Convolutional Models

introduction to AMP and Appendix D.1 for the proof of this lemma. Here we brieﬂy mention the sketch of the proof. AMP is a recursive algorithm that originally was proposed to solve linear inverse problems but since has been applied to many different problems. The key property of AMP algorithms is that a set of equations called the state evolution (SE) exactly characterize the performance of AMP at each iteration. As shown in the appendix, the AMP recursions to solve ridge regression in (28) are

xt+1 = α(ATzt + xt), (32)

zt = y Axt + α

δ zt 1, (33)

where α is the smaller root of the quadratic equation in (30). In the appendix, we prove that for λ 0, the roots of this equation are real and positive, and only for the smaller root the AMP algorithm converges. Finally, using the state evolution, we obtain that the ridge estimator and true values of x jointly converge as in (29), and τ 2(σx) in (31) is the ﬁxed point value of state evolution recursion in (69).

This lemma allows us to ﬁnd asymptotics of ridge regression for real linear inverse problems. Even though the τ in (31) depends also on α, δ, and noise variance, we have only made the dependence on σx explicit, as all the other parameters will be ﬁxed for the ridge regression problem for each frequency, but σx could change as a function of frequency.

Remark 6. The exact same result holds for complex valued ridge regression mutatis mutandis, i.e. by changing normal distributions N(0, ) to complex normal distributions CN(0, ).

Remark 7. The requirements of Lemma 6 can be relaxed. Rather than requiring x or ξ to have i.i.d. Gaussian entries, we only need them to converge PL(2) to random variables with these distributions.

4.3. Proof of Theorem 1

We now have all the ingredients to prove Theorem 1. Lemma 2 allows us to ﬁnd Fourier transform of Ridge estimator in 9 using a series of ridge regressions in Fourier domain. Lemmas 3 and 4 show that the Fourier transform of the convolution kernel and the noise and the signal asymptotically have complex Gaussian distributions with i.i.d. components for each frequency. Then, Lemma 5 shows that

e Xi(ω) d= p

g(U)CN(0, 1), (34)

where g( ) is deﬁned in the Lemma.

Next, the complex version of Lemma 6 would give us the asymptotic error of ridge estimator on the product space

([0, 2π] S, B F, λ P) in the limit " e X0(ω) e X(ω)

g(U)Z0 α( p

g(U)Z0 + τ(g(U))Z1)

where U unif([0, 2π]), and Z0, Z1 CN(0, 1), and τ( ) is the function in (31). Note that the variance of e X0(ω) is g(ω), hence the term τ(g(U))Z1. As mentioned earlier, this variance is the only variable that changes with frequency while all the other parameters are the same for all frequencies. Using this convergence, in the limit, the error is

lim nx lim T 1 nx T be Xridge e X 2 F

(α 1)2g(ω) + α2τ(g(ω)) dω. (36)

As mentioned in Section 4.2, our scaling of the Fourier operator makes it a unitary operator. Therefore, ℓ2 norm is preserved under our deﬁnition of Fourier transform and its inverse. This implies that the same result as in (36) also holds in time domain.

5. Experiments

In this section we validate our theoretical results on simulated data. We generate data using a ground truth convolutional model of the form (3). We use i.i.d. complex normal convolution kernel and noise with different variances. For the data matrix X, we consider two different models: i) i.i.d. complex normal data; and ii) a non-Gaussian autoregressive process of order 1 (an AR(1) process). In both cases we take T = 256, ny = 500 and use different values of nx to create plots of estimation error with respect to δ = ny/nx.

AR(1) process is a process that evolves in time as

xt = axt 1 + ξt, (37)

where ξt is some zero mean random noise and a is a ﬁxed constant. Note that our assumptions on the process require it to be a stationary and ergodic process. These assumptions are only satisﬁed when the noise is i.i.d. and |a| < 1. The parameter a controls how fast the process is mixing. The case were a = 0 results in an i.i.d. process, whereas values with magnitude close to 1 would result in a process that has strong correlations over a long period of time. This process has the form of one of the examples given in Section 3.1.2 and hence satisﬁes all the assumptions required for our theory to hold.

If the noise ξt has zero mean Gaussian distribution, the process will be a centered Gaussian process which is completely characterized by an auto-correlation function

R[t] = E[xt +txt ], (38)

Asymptotics of Ridge Regression in Convolutional Models

where we have used the fact that stationarity of the process implies this auto-correlation only depends on the time difference and not on the actual time. Since for Gaussian processes, the Fourier transform is another Gaussian process, we do not need to use the results of (Peligrad et al., 2010) to analyze them. As such, a more interesting example would be to use a non-Gaussian noise ξ. Therefore, besides the Gaussian AR process, we also take ξt unif({ s, s}) where s is an step size that controls the variance of the process. The variance and auto-correlation of a univariate AR(1) process can be found as follows. Squaring both sides of (37) and taking the expectation we obtain

E[x2] = E[ξ2]

1 a2 . (39)

Similarly, it is also easy to show that the auto-correlation of this process is

R[t] = E[ξ2]

1 a2 a|t|. (40)

The auto-correlation function of the AR process only depends on the variance of the noise and not its distribution. Our main result, Theorem 1, shows that the asymptotic error of ridge estimator depends on the the underlying process only through the function g(ω) which as stated earlier, is proportional to the spectral density of the process, i.e. norm of Fourier transform of the auto-correlation function. Therefore, so long as the zero mean noise has the same variance, irrespective of its distribution, we expect to see the same asymptotic error in the convolutional ridge regression when the rows of X are i.i.d. samples of such processes. To show this, we use both a Gaussian AR(1) process with var(ξ2 t ) = 0.1 as well as ξt unif({ s, s}) with s =

0.1 to match the variances. In both cases we take a = 0.9 and measurement noise variance σ2 = 0.1.

We ﬁrst present the results for the i.i.d. Gaussian covariates. In this case the variance of signal and noise are 0.004 and 1 respectively. Figure 1 shows the log of normalized estimation error with respect to δ = ny/nx for three different values of the regularization parameter λ. Normalized estimation error is deﬁned as

NMSE = E b Xridge X0 2 2 E X0 2 2 . (41)

The solid curves correspond to what our theory predicts and the dots correspond to what we observe on synthetic data. Even though our results hold in the limit of nx, ny, k, T at proportional ratio, we can see that already at this problem size, there is an almost perfect match between our predictions and the error that is observed in practice. This suggests that the errors concentrate around these asymptotic values. The ﬁgure also shows the double descent phenomenon where as the number of parameters increases beyond the interpolation threshold, the error starts decreasing

Figure 1: Log of normalized error for i.i.d. Gaussian features with respect to δ = ny/nx for three different values of λ. Solid lines show the predictions of our theory whereas the dots show the observed error on synthetic data.

again. It can be seen that regularization helps with pulling the estimation error down in vicinity of the interpolation threshold. The interpolation threshold is where we have just enough parameters to ﬁt the observations perfectly. This happens at δ = 1, i.e. nx = ny.

As mentioned in Remark 3 an i.i.d. process has a white spectrum in frequency domain, meaning that g(ω) is a constant. Therefore, for these processes, the integral in Corollary 1 would be proportional to the integrand. The integrand in turn is the asymptotic error of an ordinary ridge regression problem with i.i.d. Gaussian features. As such, this ﬁgure is essentially the same as the ﬁgures in papers that have looked at asymptotics of ridge regression, some of which we have mentioned in the Introduction and prior work.

Figure 2 shows the same plot for the case where the rows of X are i.i.d. samples of an AR(1) process with process noise unif({ s, s}). The asymptotic error for processes that have dependencies over time can be signiﬁcantly different from i.i.d. random features. The red curve is similar to the red curve in Figure 1, but the other curves show very different behavior. The double descent phenomenon is still present here. The plot for AR(1) process with Gaussian noise is essentially identical to Figure 2 and we have moved it to the appendix (Figure 3). This supports our theoretical result that error in this asymptotic regime only depends on the spectral density of the process.

6. Conclusion

Summary. We characterized the performance of ridge estimator for convolutional models in proportional asymptotics regime. By looking at the problem in Fourier domain, we showed that the asymptotic mean squared error of ridge

Asymptotics of Ridge Regression in Convolutional Models

Figure 2: Log of normalized error for the AR(1) features with the process noise unif({ s, s}), with respect to δ = ny/nx for three different values of λ. Solid lines show the predictions of our theory and the dots show the observed error on synthetic data. The plots for Gaussian AR process is essentially identical. to this plot.

estimator can be found from a scalar integral that depends on the spectral properties of the true signal. Our experiments show that our theoretical predictions match what we observe in practice even for problems of moderate size.

Future work. The results of this work only apply to ridge regression estimator for convolutional linear inverse problem. The key property of ridge regularization is that it is invariant under unitary transforms, and hence we could instead solve the problem in frequency domain. Proving such result for general estimators and regularizers allows us to extend this work to inference in deep convolutional neural networks similar to (Pandit et al., 2020). Such work would allow us to obtain the estimation error inverse problems use of deep convolutional generative priors.

Acknowledgements

This work has been done as part of an internship at Adobe Research under the mentorship of T. Mai, A. Rao, and R. Rossi. The work of M. Sahraee-Ardakan and A. K. Fletcher was supported in part by the National Science Foundation under Grants 1254204 and 1738286, and the Ofﬁce of Naval Research under Grant N00014-15-1-2677. S. Rangan was supported in part by the National Science Foundation under Grants 1116589, 1302336, and 1547332, NIST, the industrial afﬁliates of NYU WIRELESS, and the SRC.

Advani, M. S., Saxe, A. M., and Sompolinsky, H. (2020). High-dimensional dynamics of generalization error in

neural networks. Neural Networks, 132:428 446.

Alemohammad, S., Wang, Z., Balestriero, R., and Baraniuk, R. (2020). The recurrent neural tangent kernel. ar Xiv preprint ar Xiv:2006.10246.

Allen-Zhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via over-parameterization. ar Xiv preprint ar Xiv:1811.03962.

Anthony, M. and Bartlett, P. L. (2009). Neural network learning: Theoretical foundations. cambridge university press.

Bayati, M. and Montanari, A. (2011). The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764 785.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019a). Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854.

Belkin, M., Hsu, D., and Xu, J. (2019b). Two models of double descent for weak features. ar Xiv preprint ar Xiv:1903.07571.

Chafaı, D., Chafä, D., Guédon, O., Lecue, G., and Pajor, A. (2009). Singular values of random matrices. Lecture Notes.

Deng, Z., Kammoun, A., and Thrampoulidis, C. (2019). A model of double descent for high-dimensional binary linear classiﬁcation. ar Xiv preprint ar Xiv:1911.05822.

Dicker, L. H. et al. (2016). Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli, 22(1):1 37.

Dobriban, E., Wager, S., et al. (2018). High-dimensional asymptotics of prediction: Ridge regression and classiﬁcation. The Annals of Statistics, 46(1):247 279.

Donoho, D. L., Maleki, A., and Montanari, A. (2010). Message passing algorithms for compressed sensing: I. motivation and construction. In 2010 IEEE information theory workshop on information theory (ITW 2010, Cairo), pages 1 5. IEEE.

Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. (2018a). Gradient descent ﬁnds global minima of deep neural networks. ar Xiv preprint ar Xiv:1811.03804.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. (2018b). Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054.

Asymptotics of Ridge Regression in Convolutional Models

Emami, M., Sahraee-Ardakan, M., Pandit, P., Rangan, S., and Fletcher, A. K. (2020). Generalization error of generalized linear models in high dimensions. ar Xiv preprint ar Xiv:2005.00180.

Friedrich, J., Zhou, P., and Paninski, L. (2017). Fast online deconvolution of calcium imaging data. PLo S computational biology, 13(3):e1005423.

Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. ar Xiv preprint ar Xiv:1903.08560.

Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, Q. V., and Chen, Z. (2018). Gpipe: Efﬁcient training of giant neural networks using pipeline parallelism. Co RR, abs/1811.06965.

Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571 8580.

Jury, E. I. (1963). On the roots of a real polynomial inside the unit circle and a stability criterion for linear discrete systems. IFAC Proceedings Volumes, 1(2):142 153.

Kalimeris, D., Kaplun, G., Nakkiran, P., Edelman, B., Yang, T., Barak, B., and Zhang, H. (2019). Sgd on neural networks learns functions of increasing complexity. In Advances in Neural Information Processing Systems, pages 3496 3506.

Le Cun, Y., Kanter, I., and Solla, S. A. (1991). Second order properties of error surfaces learning time and generalization. Advances in neural information processing systems, 3:918 924.

Li, M., Soltanolkotabi, M., and Oymak, S. (2020). Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International Conference on Artiﬁcial Intelligence and Statistics, pages 4313 4324. PMLR.

Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157 8166.

Li, Z., Wang, R., Yu, D., Du, S. S., Hu, W., Salakhutdinov, R., and Arora, S. (2019). Enhanced convolutional neural tangent kernels. ar Xiv preprint ar Xiv:1911.00809.

Liang, T., Rakhlin, A., et al. (2020). Just interpolate: Kernel ridgeless regression can generalize. Annals of Statistics, 48(3):1329 1347.

Liao, Z., Couillet, R., and Mahoney, M. W. (2020). A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. ar Xiv preprint ar Xiv:2006.05013.

Maleki, A., Anitori, L., Yang, Z., and Baraniuk, R. G. (2013). Asymptotic analysis of complex lasso via complex approximate message passing (camp). IEEE Transactions on Information Theory, 59(7):4290 4308.

Mc Nally, J. G., Karpova, T., Cooper, J., and Conchello, J. A. (1999). Three-dimensional imaging by deconvolution microscopy. Methods, 19(3):373 385.

Mei, S. and Montanari, A. (2019). The generalization error of random features regression: Precise asymptotics and double descent curve. ar Xiv preprint ar Xiv:1908.05355.

Montanari, A., Ruan, F., Sohn, Y., and Yan, J. (2019). The generalization error of max-margin linear classiﬁers: High-dimensional asymptotics in the overparametrized regime. ar Xiv preprint ar Xiv:1911.01544.

Mueller, C. S. (1985). Source pulse enhancement by deconvolution of an empirical green s function. Geophysical Research Letters, 12(1):33 36.

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2019). Deep double descent: Where bigger models and more data hurt. ar Xiv preprint ar Xiv:1912.02292.

Oymak, S. and Soltanolkotabi, M. (2019). Overparameterized nonlinear learning: Gradient descent takes the shortest path? In International Conference on Machine Learning, pages 4951 4960.

Pandit, P., Sahraee-Ardakan, M., Rangan, S., Schniter, P., and Fletcher, A. K. (2020). Inference with deep generative priors in high dimensions. IEEE Journal on Selected Areas in Information Theory.

Peligrad, M., Wu, W. B., et al. (2010). Central limit theorem for fourier transforms of stationary processes. The Annals of Probability, 38(5):2009 2022.

Rangan, S., Schniter, P., and Fletcher, A. K. (2019). Vector approximate message passing. IEEE Transactions on Information Theory, 65(10):6664 6684.

Soltanolkotabi, M., Javanmard, A., and Lee, J. D. (2018). Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742 769.

Starck, J.-L., Pantin, E., and Murtagh, F. (2002). Deconvolution in astronomy: A review. Publications of the Astronomical Society of the Paciﬁc, 114(800):1051.

Asymptotics of Ridge Regression in Convolutional Models

Treitel, S. and Lines, L. (1982). Linear inverse theory and deconvolution. Geophysics, 47(8):1153 1159.

Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2018). Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9446 9454.

Villani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business Media.

Yang, G. (2020). Tensor programs ii: Neural tangent kernel for any architecture. ar Xiv preprint ar Xiv:2006.14548.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530.

Zou, D., Cao, Y., Zhou, D., and Gu, Q. (2020). Gradient descent optimizes over-parameterized deep relu networks. Machine Learning, 109(3):467 492.