# universality_in_learning_from_linear_measurements__64f520fe.pdf Universality in Learning from Linear Measurements Ehsan Abbasi Department of Electrical Engineering California Institute of Technology Pasadena, CA, 91125 eabbasi@caltech.edu Fariborz Salehi Department of Electrical Engineering California Institute of Technology Pasadena, CA, 91125 fsalehi@caltech.edu Babak Hassibi Department of Electrical Engineering California Institute of Technology Pasadena, CA, 91125 hassibi@caltech.edu We study the problem of recovering a structured signal from independently and identically drawn linear measurements. A convex penalty function f( ) is considered which penalizes deviations from the desired structure, and signal recovery is performed by minimizing f( ) subject to the linear measurement constraints. The main question of interest is to determine the minimum number of measurements that is necessary and sufficient for the perfect recovery of the unknown signal with high probability. Our main result states that, under some mild conditions on f( ) and on the distribution from which the linear measurements are drawn, the minimum number of measurements required for perfect recovery depends only on the first and second order statistics of the measurement vectors. As a result, the required of number of measurements can be determining by studying measurement vectors that are Gaussian (and have the same mean vector and covariance matrix) for which a rich literature and comprehensive theory exists. As an application, we show that the minimum number of random quadratic measurements (also known as rank-one projections) required to recover a low rank positive semi-definite matrix is 3nr, where n is the dimension of the matrix and r is its rank. As a consequence, we settle the long standing open question of determining the minimum number of measurements required for perfect signal recovery in phase retrieval using the celebrated Phase Lift algorithm, and show it to be 3n. 1 Introduction Recovering a structured signal from a set of linear observations appears in many applications in areas ranging from finance to biology, and from imaging to signal processing. More formally, the goal is to recover an unknown vector x0 Rn, from observations of the form yi = a T i x0, for i = 1, . . . , m. In many modern applications, the ambient dimension of the signal, n, is often (overwhelmingly) larger than the number of observations, m. In such cases, there are infinitely many solutions that satisfy the linear equations arising from the observations, and therefore to obtain a unique solution one must assume some prior structure on the unknown vector. Common examples of structured signals are sparse and group-sparse vectors [13, 6], low-rank matrices [24, 5], and simultaneously-structured matrices [8, 21]. To this end, we use a convex penalty function f : Rn R, that captures the structure of the structured signal, in the sense that signals that do not adhere to the desired structure This work was supported in part by the National Science Foundation under grants CNS-0932428, CCF1018927, CCF-1423663 and CCF-1409204, by a grant from Qualcomm Inc., by a grant from Futurewei Inc., by NASA s Jet Propulsion Laboratory through the President and Director s Fund, and by King Abdullah University of Science and Technology. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. will have a higher cost. Therefore, the following estimator is used to recover x0, ˆx = arg min x f(x) subject to, yi = a T i x, i = 1, . . . , m . (1) Popular choices of f( ) include the ℓ1-norm for sparse vectors [31], and the nuclear norm for low-rank matrices [24]. A canonical question in this area is how many measurements are needed to recover x0 via this estimator?" This question has been extensively studied in the literature (see [28, 1, 9] and the references therein.) The answer depends on the ai and is very difficult to determine for any given set of measurement vectors. As a result, it is common to assume that the measurement vectors are drawn randomly from a given distribution and to ask whether the unknown vector can be recovered with high probability. In the special case where the entries of the measurement matrix are drawn iid from a Gaussian distribution, the minimum number of measurements for the recovery of x0 with high probability is known (and is related to the concept of the Gaussian width [28, 1, 9]). For instance, it has been shown that 2k log(n/k) linear measurements is required to recover a k sparse signal [12], and 3rn measurements suffice for the recovery of a symmetric n n rank-r matrix [20, 9]. Recently, Oymak et al [22] showed that these thresholds remain unchanged, as long as the entries of each ai are i.i.d and drawn from a "well-behaved" distribution. It has also been shown that similar universality holds in the case of noisy measurements [23]. Although these works are of great interest, the independence assumption on the entries of the measurement vectors can be restrictive. In certain applications in communications, phase retrieval, covariance estimation, the entries of the measurement vectors ai have correlations. In this paper, we show a much stronger universality result which holds for a broader class of measurement distributions. Here is an informal description of our result: Assume the measurement vectors ai are drawn iid from some given distribution. In other words, the measurement vectors are iid random, but their entries are not necessarily so. Then the minimum number of observations needed to recover x0 from (1) with high probability, depends only on the first two statistics of the ai, i.e., their mean vector µ, and covariance matrix Σ. We anticipate that this universality result will have many practical ramifications. In this paper we focus on the ramifications to the problem of recovering a structured matrix, X0 Rn n, from quadratic measurements (a.k.a. rank-one projections). In this problem, we are given observations of the form yi = a T i X0ai = Tr(X0(aia T i )) = vec(X)tvec(aiat i) for i = 1, . . . , m.2 Such measurement schemes appear in a variety of problems [11, 3, 35, 19, 18]. An interesting application of learning from quadratic measurements is the Phase Lift algorithm [7] for phase retrieval. In phase retrieval, the goal is to recover the signal x0 from quadratic measurements of the form, yi = |a T i x0|2 = a T i (x0x T 0 )ai. Note that x0xt 0 is a low-rank (in this case rank-1) matrix and Phase Lift relaxes this constraint to a non-negativity constraint and minimizes nuclear norm to encourage a low rank solution. Quadratic measurements also appears in non-coherent energy measurements in communications and signal processing [33, 2], sparse covariance estimation [11, 35], and sparse phase retrieval [18, 26]. Recently, Chen et al [11] proved sufficient bounds on the number of measurements for various structures on the matrix X0. However, to the best of our knowledge, prior to this work, the precise number of required measurements for perfect recovery was unknown. For example, when the ai have iid Gaussian entries (note that the measurement vectors, which are now vec(aiat i), are no longer iid Gaussian) we show that 3nr measurement is necessary and sufficient for the perfect recovery of a rank-r matrix from quadratic measurements. In the special case of phase retrieval, we therefore demonstrate that 3n measurements is necessary and sufficient for perfect recovery of x0, which settles the long standing open question of the recovery threshold for Phase Lift. In particular, this indicates that 2n extra phaseless measurements is all that is needed to compensate the missing phase information. The remainder of the paper is structured as follows. The problem setup and definitions are given in Section 2. In Section 3, we introduce our universality framework, which states that the number of required observations for the recovery of an unknown model depends only on the first two statistics of the measurement vectors. As an applications, in Section 4, we apply this universality theorem to derive tight bounds (i.e., necessary and sufficient conditions) on the required number of observations for matrix recovery via quadratic measurements. 2 Preliminaries 2.1 Notations We start by introducing some notations that are used throughout the paper. Bold lower letters x, y, . . . are used to denote vectors, and bold upper letters X, Y, . . . are for matrices. For a matrix X Rm n, 2The reader should pardon the abuse of notation as the measurement vectors are now vec(aiat i). Vec(X) Rmn returns the vectorized form of the matrix. X 2, X F , X and Tr(X) represent the operator norm, the Frobenius norm, the nuclear norm and the trace of the matrix X, respectively. x ℓp denotes the ℓp-norm of the vector x and for matrices, X ℓp = Vec(X) ℓp. For both vectors and matrices, 0 indicates the number of non-zero entries. The set of n n positive definite matrices and positive semi-definite matrices are denoted by Sn ++ and Sn +, respectively. The letters g and G are reserved for a Gaussian random vector and matrix with i.i.d. standard normal entries. The letter H is reserved for a random Gaussian Wigner matrix, that is a symmetric matrix whose upper-diagonal entries drawn independently from N(0, 1) whose its diagonals entries are drawn independently from N(0, 2). Finally, the letter I is reserved for the identity matrix. For a random vector a, E[a] and Cov[a] represent the expected value and the covariance matrix of a. 2.2 Problem Setup We consider the problem of recovering the unknown vector x0 S Rn from m observations of the form yi = a T i x0, i = 1, . . . , m. Here, the known measurement vectors ai Rn s are drawn independently and identically from a random distribution. These observations can be reformulated as y = Ax0 , (2) where y = [y1, . . . , ym]T Rm and A = [a1, . . . , am]T Rm n. We focus on the highdimensional setting where both n and m grow large. We use the notation m = θ(n), to fix the rate at which m grows compared to n. Of special interest is the underdetermined case where the number of measurement is smaller than the ambient dimension. In this case, the problem of signal reconstruction is generally ill-posed unless some prior information is available regarding the structure of x0. Some popular cases of structures include, sparse vectors, low-rank matrices, and simultaneously-structured matrices. Convex estimator: To recover the structured vector x0, we minimize a convex function f : Rn R that enforces this structure. We do this minimization for all feasible points x S, that satisfy y = Ax. We formally define such estimators as follows, Definition 1. Let x0 S where S Rn is a convex set. For a convex function f : Rn R and a measurement matrix A Rm n, we define the convex estimator E{x0, A, S, f( )} as following, ˆx = arg min x S Ax=Ax0 f(x) . (3) We say E{x0, A, S, f( )} has perfect recovery iff ˆx = x0. Note that we are given the observation vector y = Ax0 in the constraint of (3). We aim to characterize the perfect recovery criteria for this estimator. Given a structured vector x0, the perfect recovery of an estimator E{x0, A, S, f( )} depends on three factors; the number of observations m compared to the dimension of the ambient space n, properties of the measurement vectors {ai}m i=1, and the penalty function, f( ). We briefly explain each factor, below. The rate function θ( ): We work in the high dimensional regime where both n and m grow to infinity with a fixed rate m = θ(n). Finding the minimum number of measurements to recover x0 via (3), translates to finding the smallest rate function θ ( ), for which our estimator has perfect recovery. This optimal rate function depends on the problem settings and varies in different problems. For instance, in order to recover a rank-r matrix in Sn +, we will need the measurements to be of order m = O(n), while in the case of k-sparse matrices, the measurements will be of order m = O(k log(n2/k)), where in many applications k is a fraction of n2. The penalty function: We use a convex function f( ) that promotes the particular structure of x0. Exploiting a convex penalty for the recovery of structured signals has been studied extensively [9, 1, 28, 14, 4, 29]. Chandrasekaran et. al. [9] introduced the concept of the atomic norm, which is a convex surrogate defined based on a set of (so-called) "atoms". For instance, the corresponding atomic norm for sparse recovery is the ℓ1-norm and for low-rank matrix recovery the nuclear norm. Another interesting scenario is when the underlying parameter x0 simultaneously exhibits multiple structures such as being low-rank and sparse. For simultaneously structured signals building the set of atoms is often intractable. Therefore, it has been proposed [21, 10] to use a weighted sum of corresponding atomic norms for each structure as the penalty. The measurement vectors: We consider a random ensemble, where the vectors {ai}m i=1 are drawn independently and identically from a random distribution. Later in Section 2.3, we formally present the required assumptions on this distribution. It has been observed that the estimator (3) exhibits a phase transition phenomenon, i.e., there exist a phase transition rate θ (n), such that when m > θ (n) the optimization program (3) successfully recover x0 with high probability, otherwise, when m < θ (n) it fails with high probability [1, 9]. The question is that how is this phase transition is related to the properties of the measurement vectors ai s? Universality in learning: Directly calculating the precise phase transition behavior of the estimator E(x0, A, S, f( )), for a general random distribution on the measurement vectors is very challenging. Recently, as an extension of Gaussian comparison lemmas due to Gordon [16, 17] and earlier work in [27, 28, 9, 1], a new framework, known as CGMT [29, 30], has been developed which made this analysis possible when the measurement vectors {ai}m i=1, are independently drawn from the Gaussian distribution, N(0, In). Another parallel work that makes this analysis possible under the same conditions is known as AMP [14]. However, the Gaussian assumption is critical in the analysis through these frameworks, which restricts us from investigating a vast variety of practical problems. As our main result, we show that, for a broad class of distributions, the phase transition of E(x0, A, S, f( )) depends only on the first two statistics of the distribution on the measurement vectors {ai}m i=1. As a result, the phase transition of the estimator remains unchanged when we replace the measurement vectors with the ones drawn from a Gaussian distribution with the same mean vector and covariance matrix. As the phase transition is the same as the one with Gaussian measurements, we can use the CGMT framework to analyze the latter and get the desired result. Equivalent Gaussian Problem: Let µ := E[ai] and Σ := Cov[ai] for i = 1, 2, . . . , m, and consider the following problem: 1. We are given m observations of the form yi = g T i x0 and the measurement vectors {gi}m i=1. 2. The rows of the measurement matrix G = [g1, . . . , gm]T Rm n are independently drawn from the multivariate Gaussian distribution N(µ, Σ). 3. We use the estimator E(x0, G, S, f( )), as in Definition 1, to recover x0. In Theorem 1, we show that under certain conditions, the two estimators E(x0, A, S, f( )) and E(x0, G, S, f( )) asymptotically exhibit the same phase transition behavior. Before stating our main result in Section 3, we discuss the assumptions needed for our universality to hold. 2.3 Assumptions We show universality for a wide range of distributions on the measurement vector as well as a broad class of convex penalties. Here, we give the conditions needed for the measurement matrix, Assumption 1. [The Measurement Vectors] We say the measurement matrix A = [a1, . . . , am]T Rm n satisfies Assumption 1 with parameters µ Rn and Σ Rn n, if the followings hold true. 1. [Sub-Exponential Tails] The vectors ai s are independently drawn from a random subexponential distribution, with mean µ and covariance Σ 0. 2. [Bounded Mean] For some constants c1, τ1 > 0, we have µ 2 2 E[ ai µ 2] c1 n τ1, for all i. 3. [Bounded Power] For some constants c2, τ2 > 0, we have Var( ai 2) E2[ ai µ 2] c2 n τ2 for all i . Assumption 1 summarizes the technical conditions that are essential in the proof of our main theorem. The first assumption on the tail of the distribution enables us to exploit concentration inequalities for sub-exponential distributions. We allow the vector ai to have a non-zero mean in Assumption 1.2. Yet we require the power of its mean to be small compared to the power of the random part of the vector. Intuitively, one would like the measurement vectors to sample diversely from all the directions in Rn, and not be biased towards a specific direction. Finally, Assumption 1.3 is meant to control the dependencies among the entries of ai and is used to prove concentration of 1 na T i Mai around its mean, for a matrix M with bounded operator norm. For instance, for a Gaussian vector g N(0, I), we have Var[ g 2] = 2n and E2[ g 2] = n2. So Assumption 1.3 is satisfied with c2 = 2 and τ2 = 1. We will examine these assumptions for the applications discussed in Section 4. In addition, we need to enforce a few conditions on the penalty function f( ) as follows, Assumption 2. [The Penalty Function] We say the funtion f( ) satisfies Assumption 2, if the following holds true. 1. [Separablity] f( ) is continuous, convex and separable, where f(x) = Pn i=1 fi(xi) . 2. [Smoothness] The functions {fi( )} are three times differentiable everywhere, except for a finite number of points. 3. [Bounded Third Derivative] For any C > 0, there exists a constant cf > 0, such that for all i, we have | 3fi(x) x3 | cf, for all smooth points in the domain of fi( ) such that |x| < C. As observed in the Assumption 2.1, we only consider the special (yet popular) case of separable penalty functions. Common choices include x ℓ1 and x 2 ℓ2 for vectors, and X ℓ1, X F and Tr(X) (which is equivalent to the nuclear norm of X when X S+) for matrices. We can also apply our theorem for ℓp-norm. This is due to the fact that replacing ℓp with p ℓp does not change our estimate, and the latter is a separable function. 3 Main Result In this section, we state our main theorem which shows that the performance of the convex estimator E(x0, A, S, f( )), is independent of the distribution of the measurement vectors. So we can replace them with the Gaussian random vectors with the same mean and covariance. Next, using CGMT framework [29, 30], we analyze the phase transition in the case with Gaussian measurements, in Corollary 1. Later, we will apply this result to some well-known problems in Section 4. 3.1 Universality Theorem Theorem 1. [non-Gaussian=Gaussian] Consider the problem of recovering x0 S Rn from the measurements y = Ax0 Rm, using a convex penalty function f( ) in the estimator E{x0, A, S, f( )} in (3). Assume S is a convex set and m and n are growing to infinity at a fixed rate m = θ(n). Also assume that 1. f : Rn R is a convex function that satisfies Assumption 2. 2. The measurement matrix A = [a1, . . . , am]T satisfies Assumption 1, with µ := E[ai] and Σ := Cov[ai] for all i = 1, . . . , m . 3. G = [g1, . . . , gm]T Rm n is a random Gaussian matrix with independent rows drawn from Gaussian distribution N(µ, Σ) . Then the estimator E{x0, A, S, f( )} (introduced in Definition 1) succeeds in recovering x0 with probability approaching one (as m and n grow large), if and only if the estimator E{x0, G, S, f( )} succeeds with probability approaching one. Theorem 1 shows that only the mean and covariance of the measurement vectors ai affect the required number of measurements for perfect recovery in (3). Although Theorem 1 holds for n and m growing to infinity, the result of our numerical simulations in Section 3.2, indicates the validity of universality for values of m and n ranging in the order of hundreds. 3.1.1 Analysis of the Gaussian Estimator Theorem 1 shows the equivalence of the convex estimator E{x0, A, S, f( )} and the Gaussian estimator E{x0, G, S, f( )}. We can utilize the CGMT framework to analyze the perfect recovery conditions for E{x0, G, S, f( )}. Before doing so, we need the definition of the descent cone, Definition 2. [Descent Cone] The descent cone of a convex function f( ) at point x0 is defined as Df(x0) = Cone ({y : f(y) f(x0)}) , (4) which is a convex cone. Here, Cone(S) denotes the conic-hull of the set S. Corollary 1. Consider the problem of recovering the vector x0 S, given the observations y = Gx0 Rm, via the estimator E{x0, G, S, f( )} introduced earlier. Assume that the rows of G are independent Gaussian random vectors with mean µ and covariance Σ = MMT. Let δ := m/n and the set S and the penalty function f( ) be convex. E{x0, G, S, f( )} succeed in recovering x0 with probability approaching one (as m and n grow to infinity), if and only if max w (S x0) Df (x0) 1 n MTw Sn 1 where Sn 1 is the n-dimensional unit sphere, and the expected value is over the Gaussian vector g N(0, Σ). ["Pseudo Gaussian Width"] When µ = 0 and Σ = I, the expected value in (5) resembles the definition of the Gaussian width [25]. It has been shown that when the measurements are i.i.d. Gaussian, the square of the Gaussian width indicates the phase transition for linear inverse problems [9, 1, 28]. The Gaussian width has been computed for several interesting examples, such as sparse recovery, and low-rank matrix recovery. Using our universality result in Theorem 1, we can state that the square of the Gaussian width indicates the phase transition in the non-Gaussian setting as well. 3.2 Numerical Results To validate the result of Theorem 1, we performed numerical simulations under various distributions for the measurement vectors. For our simulations in Figure 1, we use the estimator E{x0, A, Rn, ℓ1} to recover a k-sparse signal x0 under three random ensembles for the measurement vectors {ai}m i=1. In each of the three plots, we computed the norm of the estimation error E{x0, A, Rn, ℓ1}, for different over sampling ratios δ = m/n and multiple sparsity factors s = k/n. We generated the measurement vectors {ai}m i=1 for each figure, as follows, For each trial, we generate a random matrix M Rn n, with i.i.d. standard Gaussian random variables. Σ = MMT will play the role of the covariance matrix of the measurement vectors. For Figure 1a, {ai}m i=1 are drawn independently from the Gaussian distribution N(0, Σ). For the measurement vectors of the Figure 1b, we first generate i.i.d centered bernouli vectors Ber(.8), and multiply each vector by M. For the measurement vectors of the Figure 1c, we first generate i.i.d centered χ1 vectors, and multiply each vector by M. The blue line in the figures shows the theoretical phase transition derived as a result of Corollary 1. It can be observed that the phase transition for all the three random schemes is the same, as predicted by Theorem 1. It also matches the theoretical phase transition derived from Corollary 1. Figure 1: Phase transition regimes for the estimator E{x0, A, Rn, ℓ1}, in terms of the oversampling ratio δ = m n and s = x0 0 n , for the cases of (a) Gaussian measurements and (b) Bernoulli measurements and (c) χ2 measurements. The blue lines indicate the theoretical estimate for the phase transition derived from Corollary 1. In the simulations we used vectors of size n = 256. The data is averaged over 10 independent realization of the measurements. Next, to illustrate the applicability and the implications of the results, we present some examples where our universality theorem can be applied. 4 Applications: Quadratic Measurements In this section we consider the problem of recovering a matrix from (so-called) quadratic measurements. The goal is to reconstruct a symmetric matrix X0 Rn n in a convex set S, given m measurements of the form, yi = a T i X0ai = Tr X0 (aia T i ) , i = 1, . . . , m . (6) Depending on the application, the matrix X0 may exhibit various structures. Similar to (3), we use the convex penalty function f : Rn n n, to enforce this structure via the following convex estimator, ˆX = arg min X S f(X) subject to: a T i Xai = a T i X0ai, i = 1, . . . , m . (7) Note that the measurements in (6) are linear with respect to the matrix X0, yet quadratic with respect to the measurement vectors ai. We can define x0 := Vec(X0) Rn2 and ai := Vec(aia T i ) Rn2, such that the measurements take the familiar form, yi = a T i x0. In order to apply the result of Theorem 1, one should check if the vectors { ai}m i=1 satisfy Assumption 1. It can be shown that if the vectors {ai}m i=1 satisfy the following conditions, then Assumption 1 holds true for { ai = Vec(aia T i )}m i=1 . Assumption 3. We say vectors {ai}m i=1 satisfy Assumption 3, if 1. ai s are drawn independently from a sub-Gaussian distribution. 2. For each i, the entries of ai are independent, zero-mean and unit-variance. In particular, this assumption is valid when {ai} s have i.i.d. standard normal entries. Therefore, when Assumption 3 holds, we can apply Theorem 1 to show that the required number of measurements for perfect recovery in (7) is equal to the required number of measurements for the success of the following estimator, ˆX = arg min X S f(X) subject to: Tr ((Hi + I)X) = Tr ((Hi + I)X0) , i = 1, . . . , m , (8) where I is the n n identity matrix and Hi s are independent Gaussian Wigner matrices (defined in Section 2). Corollary 2 presents a formal statement. Corollary 2. Consider the problem of recovering the matrix X0 S Rn n, from m quadratic measurements of the form (6), using the estimator (7). Let S and f( ) be convex set and function satisying Assumption 2. Assume, The measurement vectors {ai}m i=1 satisfy Assumption 3, and, {Hi Rn n}m i=1 is a set of independent Gaussian Wigner matrices. Then, as m and n grow to infinity at a fixed rate m = θ(n), the estimator (7) perfectly recovers X0 with probability approaching one if and only if the estimator (8) perfectly recovers X0 with probability approaching one. Therefore, in order to find the phase transition, it is sufficient to analyze the equivalent optimization (8) which is possible via the CGMT framework. Proceeding onward, we exploit the CGMT framework along with Corollary 1 to find the required number of measurements for the recovery of X0 in two specific applications. 4.1 Low-rank Matrix Recovery Assume the unknown matrix X0 0 has rank r, where r is a constant ( i.e., r does not grow with problem dimensions n, m.) Such matrices appear in many applications such as traffic data monitoring, array signal processing and phase retrieval. The nuclear norm, || || , is often used as the convex surrogate for low-rank matrix recovery [24]. Hence, we are interested in analyzing the optimization (7), with the choice of f(X) = X , where the optimization is over the set of PSD matrices. Note that Tr( ) = || || within this set, which satisfies Assumption 2. According to Corollary 2, the perfect recovery in (7) is equivalent to perfect recovery in (8), where the same choice of f(X) = Tr(X). The analysis of the later through CGMT yields the following corollary. Corollary 3. Consider the optimization program (7), where the matrix X0 0 has rank r, f(X) = Tr(X), the set S is the PSD cone and the measurement vectors {ai}m i=1 satisfy Assumption 3. Assume m, n at the proportional rate δ := m n (0, + ). The estimator perfectly recovers X0 if δ > 3r. Corollary 3 indicates that 3rn measurements is needed to perfectly recover a rank-r PSD matrix X0, from quadratic measurements. Although, the error of estimation gets extremely small, much before the threshold m = 3nr. To the extent of our knowledge, this is the first work that precisely computes the phase transition of low-rank matrix recovery from quadratic measurements. Figure2 depicts the result of numerical simulations. For different values of r and δ, the Frobenius norm of the error of the estimators (7) and (8) has been computed, which shows the same phase transition in both cases. Figure 2: Phase transition regimes for both estimators 7 and (8), with f(X) = Tr(X), in terms of the oversampling ratio δ = m n and r = Rank(X0), for the cases of (a) estimator (7) with quadratic measurements and (b) estimator (8) with Gaussian measurements. In the simulations we used matrices of size n = 40. The data is averaged over 20 independent realization of the measurements. 4.1.1 Phase Transition of Phase Lift in Phase Retrieval An important application for the result of Corollary 3, is when the underlying matrix X0 is of rank 1. This appears in the problem of phase retrieval, where X0 = x0x T 0 is the lifted version of the signal. The optimization program (7) with f(X) = Tr(X) in this case, is known as Phase Lift [7]. Corollary 3 states that the phase transition of the Phase Lift algorithm happens at δ = 3, i.e., m > 3n measurements is needed for the perfect signal reconstruction in Phase Lift. We should emphasize the significance of this result as establishing the exact phase transition of the Phase Lift algorithm was long an open problem. 4.2 Sparse Matrix Recovery Let X0 0 represent the covariance matrix of a set of random variables. In certain applications, the covariance matrix has many near-zero entries as the correlations are small for many pairs of random variables. Such matrices arise in applications in spectrum estimation, biology and finance [15, 11]. We are interested in analyzing estimator (7), where f(X) = X ℓ1 promotes the sparsity in the optimization. As ℓ1 satisfies Assumption 2, applying the result of Corollary 2, the perfect recovery in (7) is equivalent to the perfect recovery in the estimator (8), with the same penalty function. Analyzing the optimization (8) via CGMT leads to the following result: Corollary 4. Let δ := m n2 , s := X0 0 n2 . As n , the optimization program (7), with f(X) = X ℓ1 can successfully recover the signal iff δ > δ , where δ is the unique solution to the following nonlinear equation, = (1 s)φ Q 1 2x s Model Penalty function f( ) No. of required measurements k sparse matrix ℓ1 n2δ defined in (9) Rank-r PSD matrix Tr( ) 3nr S&L (k, r) matrix Tr( ) + λ 1 O(min(k2, rn)) Table 1: Summary of the parameters that are discussed in this section. The last row is for a n n rank-r matrix whose smallest sub-matrix with non-zero entries is k by k. The third column shows the number of required quadratic measurements for perfect recovery. where φ(x) = exp( x2/2)/ 2π and Q 1( ) is inverse of the Q-function. Figure 3b compares the empirical result with the theoretical phase transition derived from Corollary 4 Each plot shows the norm of the error with respect to the sparsity of the matrix X0 and the ratio δ = m n2 . A comparison between the two plots indicates that the phase transitions of the two estimators (7) and (8) with f(X) = X ℓ1 match. Figure 3: Phase transition regimes for both estimators (7) and (8), with f(X) = X ℓ1, in terms of the oversampling ratio δ = m n and s = X0 0 n2 , for the cases of (a) estimator (7) with quadratic measurements and (b) estimator (8) with Gaussian measurements. The blue lines indicate the theoretical estimate for the phase transition derived from equation (9). In the simulations we used matrices of size n = 40. The data is averaged over 20 independent realization of the measurements. 4.3 Conclusion We have investigated an estimation problem under linear observations. We aimed to characterize the minimum number of observations that are needed for perfect recovery of the unknown model. Our main result indicated that this phase transition, only depends on the first two statistics of the measurement vector. Therefore, it remains unchanged as we replace these vectors with the Gaussian one, with the same mean vector and covariance matrix. The later can be analyzed through existing frameworks such as CGMT. As one of the applications of this universality, we investigated the case of matrix recovery via the so called quadratic measurements, and derived the minimum number of observations required for the recovery of a structured matrix. Due to the space constraint, we moved the discussions regarding the case of simultaneously structured matrices to the appendix. Table 1, summarizes these results for the cases of three structures. [1] Dennis Amelunxen, Martin Lotz, Michael B Mc Coy, and Joel A Tropp. Living on the edge: Phase transitions in convex programs with random data. Information and Inference: A Journal of the IMA, 3(3):224 294, 2014. [2] Dyonisius Dony Ariananda and Geert Leus. Compressive wideband power spectrum estimation. IEEE Transactions on signal processing, 60(9):4775 4789, 2012. [3] T Tony Cai, Anru Zhang, et al. Rop: Matrix recovery via rank-one projections. The Annals of Statistics, 43(1):102 138, 2015. [4] Emmanuel J Candes. The restricted isometry property and its implications for compressed sensing. Comptes rendus mathematique, 346(9-10):589 592, 2008. [5] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009. [6] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207 1223, 2006. [7] Emmanuel J Candes, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241 1274, 2013. [8] Venkat Chandrasekaran, Pablo A Parrilo, and Alan S Willsky. Latent variable graphical model selection via convex optimization. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1610 1613. IEEE, 2010. [9] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex geometry of linear inverse problems. Foundations of Computational mathematics, 12(6):805 849, 2012. [10] Yuxin Chen, Yuejie Chi, and Andrea J Goldsmith. Estimation of simultaneously structured covariance matrices from quadratic measurements. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669 7673. IEEE, 2014. [11] Yuxin Chen, Yuejie Chi, and Andrea J Goldsmith. Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Transactions on Information Theory, 61(7):4034 4059, 2015. [12] David Donoho and Jared Tanner. Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906):4273 4293, 2009. [13] David L Donoho et al. Compressed sensing. IEEE Transactions on information theory, 52(4):1289 1306, 2006. [14] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914 18919, 2009. [15] Noureddine El Karoui et al. Operator norm consistent estimation of large-dimensional sparse covariance matrices. The Annals of Statistics, 36(6):2717 2756, 2008. [16] Yehoram Gordon. Some inequalities for gaussian processes and applications. Israel Journal of Mathematics, 50(4):265 289, 1985. [17] Yehoram Gordon. On milman s inequality and random subspaces which escape through a mesh in R n. In Geometric Aspects of Functional Analysis, pages 84 106. Springer, 1988. [18] Xiaodong Li and Vladislav Voroninski. Sparse signal recovery from quadratic measurements via convex programming. SIAM Journal on Mathematical Analysis, 45(5):3019 3033, 2013. [19] Yuanxin Li, Yue Sun, and Yuejie Chi. Low-rank positive semidefinite matrix recovery from corrupted rank-one measurements. IEEE Transactions on Signal Processing, 65(2):397 408, 2016. [20] Samet Oymak and Babak Hassibi. New null space results and recovery thresholds for matrix rank minimization. ar Xiv preprint ar Xiv:1011.6326, 2010. [21] Samet Oymak, Amin Jalali, Maryam Fazel, Yonina C Eldar, and Babak Hassibi. Simultaneously structured models with application to sparse and low-rank matrices. IEEE Transactions on Information Theory, 61(5):2886 2908, 2015. [22] Samet Oymak and Joel A Tropp. Universality laws for randomized dimension reduction, with applications. Information and Inference: A Journal of the IMA, 7(3):337 446, 2017. [23] Ashkan Panahi and Babak Hassibi. A universal analysis of large-scale regularized least squares solutions. In Advances in Neural Information Processing Systems, pages 3381 3390, 2017. [24] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471 501, 2010. [25] Mark Rudelson and Roman Vershynin. Sparse reconstruction by convex relaxation: Fourier and gaussian measurements. In 2006 40th Annual Conference on Information Sciences and Systems, pages 207 212. IEEE, 2006. [26] Yoav Shechtman, Amir Beck, and Yonina C Eldar. Gespar: Efficient phase retrieval of sparse signals. IEEE transactions on signal processing, 62(4):928 938, 2014. [27] Mihailo Stojnic. Various thresholds for l1-optimization in compressed sensing. 2009. [28] Mihailo Stojnic. Upper-bounding l1-optimization weak thresholds. ar Xiv preprint ar Xiv:1303.7289, 2013. [29] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Precise error analysis of regularized m-estimators in high dimensions. IEEE Transactions on Information Theory, 64(8):5592 5628, 2018. [30] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. Regularized linear regression: A precise analysis of the estimation error. In Conference on Learning Theory, pages 1683 1709, 2015. [31] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. [32] Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends R in Machine Learning, 8(1-2):1 230, 2015. [33] Joel A Tropp, Jason N Laska, Marco F Duarte, Justin K Romberg, and Richard G Baraniuk. Beyond nyquist: Efficient sampling of sparse bandlimited signals. ar Xiv preprint ar Xiv:0902.0026, 2009. [34] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. [35] Chris D White, Sujay Sanghavi, and Rachel Ward. The local convexity of solving systems of quadratic equations. ar Xiv preprint ar Xiv:1506.07868, 2015.