# uniform_convergence_with_squareroot_lipschitz_loss__08524b03.pdf

Uniform Convergence with Square-Root Lipschitz Loss

Lijia Zhou University of Chicago zlj@uchicago.edu

Zhen Dai University of Chicago zhen9@uchicago.edu

Frederic Koehler Stanford University fkoehler@stanford.edu

Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu

Collaboration on the Theoretical Foundations of Deep Learning (deepfoundations.ai)

We establish generic uniform convergence guarantees for Gaussian data in terms of the Rademacher complexity of the hypothesis class and the Lipschitz constant of the square root of the scalar loss function. We show how these guarantees substantially generalize previous results based on smoothness (Lipschitz constant of the derivative), and allow us to handle the broader class of square-root-Lipschitz losses, which includes also non-smooth loss functions appropriate for studying phase retrieval and Re LU regression, as well as rederive and better understand optimistic rate and interpolation learning guarantees.

1 Introduction

The phenomenon of "interpolation learning", where a learning algorithm perfectly interpolates noisy training labels while still generalizing well to unseen data, has attracted significant interest in recent years. As Shamir (2022) pointed out, one of the difficulties in understanding interpolation learning is that we need to determine which loss function we should use to analyze the test error. In linear regression, interpolating the square loss is equivalent to interpolating many other losses (such as the absolute loss) on the training set. Similarly, in the context of linear classification, many works (Soudry et al. 2018; Ji and Telgarsky 2019; Muthukumar et al. 2021) have shown that optimizing the logistic, exponential, or even square loss would all lead to the maximum margin solution, which interpolates the hinge loss. Even though there are many possible loss functions to choose from to analyze the population error, consistency is often possible with respect to only one loss function because different loss functions generally have different population error minimizers.

In linear regression, it has been shown that minimal norm interpolant can be consistent with respect to the square loss (Bartlett et al. 2020; Muthukumar et al. 2020; Negrea et al. 2020; Koehler et al. 2021), while the appropriate loss for benign overfitting in linear classification is the squared hinge loss (Shamir 2022; Zhou et al. 2022). In this line of work, uniform convergence has emerged as a general and useful technique to understand interpolation learning (Koehler et al. 2021; Zhou et al. 2022). Though the Lipschitz contraction technique has been very useful to establish uniform convergence in classical learning theory, the appropriate loss functions to study interpolation learning (such as the square loss and the squared hinge loss) are usually not Lipschitz. In fact, many papers (Nagarajan and Kolter 2019; Zhou et al. 2020; Negrea et al. 2020) have shown that the traditional notion of uniform convergence implied by Lipschitz contraction fails in the setting of interpolation learning. Instead, we need to find other properties of the loss function to establish a different type of uniform convergence.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Zhou et al. (2022) recently showed that the Moreau envelope of the square loss or the squared hinge loss is proportional to itself and this property is sufficient to establish a type of uniform convergence guarantee known as "optimistic rate" (Panchenko 2002; Srebro et al. 2010). Roughly speaking, if this condition holds, then the difference between the square roots of the population error and the training error can be bounded by the Rademacher complexity of the hypothesis class. Specializing to predictors with zero training error, the test error can then be controlled by the square of the Rademacher complexity, which exactly matches the Bayes error for the minimal norm interpolant when it is consistent (Koehler et al. 2021). However, the Moreau envelope of a loss function is not always easy to solve in closed-form, and it is unclear whether this argument can be applied to problems beyond linear regression and max-margin classification. Moreover, considering the difference of square roots seem like a mysterious choice and does not have a good intuitive explanation.

In this paper, by showing that optimistic rate holds for any square-root Lipschitz loss, we argue that the class of square-root Lipschitz losses is the more natural class of loss functions to consider. In fact, any loss whose Moreau envelope is proportional to itself is square-root Lipschitz. Our result also provides an intuitive explanation for optimistic rate: when the loss is Lipschitz, the difference between the test and training error can be bounded by Rademacher complexity; therefore, when the loss is square-root Lipschitz, the difference between the square roots of the test and training error can be bounded by Rademacher complexity. In addition to avoiding any hidden constant and logarithmic factor, our uniform convergence guarantee substantially generalize previous results based on smoothness (Lipschitz constant of the derivative, such as Srebro et al. 2010). The generality of our results allows us to handle non-differentiable losses. In the problems of phase retrieval and Re LU regression, we identify the consistent loss through a norm calculation, and we apply our theory to prove novel consistency results. In the context of noisy matrix sensing, we also establish benign overfitting for the minimal nuclear norm solution. Finally, we show that our results extend to fully optimized neural networks with weight sharing in the first layer.

2 Related Work

Optimistic rates. In the context of generalization theory, our results are related to a phenomena known as optimistic rates generalization bounds which give strengthened guarantees for predictors with smaller training error. Such bounds can be contrasted with pessimistic bounds which do not attempt to adapt to the training error of the predictor. See for example Vapnik 1982; Panchenko 2003; Panchenko 2002. The work of Srebro et al. 2010 showed that optimistic rates control the generalization error of function classes learned with smooth losses, and recently the works Zhou et al. 2021; Koehler et al. 2021; Zhou et al. 2020 showed that a much sharper version of optimism can naturally explain the phenomena of benign overfitting in Gaussian linear models. (These works are in turn connected with a celebrated line of work in proportional asymptotics for M-estimation such as Stojnic 2013, see references within.) In the present work, we show that the natural setting for optimistic rates is actually square-root-Lipschitz losses and this allows us to capture new applications such as phase retrieval where the loss is not smooth.

Phase retrieval, Re LU Regression, Matrix sensing. Recent works (Maillard et al. 2020; Barbier et al. 2019; Mondelli and Montanari 2018; Luo et al. 2019) analyzed the statistical limits of phase retrieval in a high-dimensional limit for certain types of sensing designs (e.g. i.i.d. Real or Complex Gaussian), as well as the performance of Approximate Message Passing in this problem (which is used to predict a computational-statistical gap). In the noiseless case, the works Andoni et al. 2017; Song et al. 2021 gave better computationally efficient estimators for phase retrieval based upon the LLL algorithm. Our generalization bound can naturally be applied to any estimator for phase retrieval, including these methods or other ones, such as those based on non-convex optimization (Sun et al. 2018; Netrapalli et al. 2013). For the same reason, they can also naturally be applied in the context of sparse phase retrieval (see e.g. Li and Voroninski 2013; Candes et al. 2015).

Similarly, there have been many works studying the computational tractability of Re LU regression. See for example the work of Auer et al. 1995 where the matching loss technique was developed for learning well-specified GLMs including Re LUs. Under misspecification, learning Re LUs can be hard and approximate results have been established, see e.g. Goel et al. 2017; Diakonikolas et al. 2020. In particular, learning a Re LU agnostically in squared loss, even over Gaussian marginals, is known to be hard under standard average-case complexity assumptions (Goel et al. 2019). Again, our

generalization bound has the merit that it can be applied to any estimator. For matrix sensing, we consider nuclear norm minimization (as in e.g. Recht et al. 2010) which is a convex program, so it can always be computed in polynomial time.

Weight-tied neural networks. The work of Bietti et al. 2022 studies a similar two-layer neural network model with weight tying. Their focus is primarily on understanding the non-convex optimization of this model via gradient-based methods. Again, our generalization bound can be straightforwardly combined with the output of their algorithm. The use of norm-based bounds as in our result is common in the generalization theory for neural networks, see e.g. Bartlett (1996) and Anthony, Bartlett, et al. (1999). Compared to existing generalization bounds, the key advantage of our bound is its quantitative sharpness (e.g. small constants and no extra logarithmic factors).

3 Problem Formulation

In most of this paper, we consider the setting of generalized linear models. Given any loss function f : R Y R 0 and independent sample pairs (xi, yi) from some data distribution D over Rd Y, we can learn a linear model ( ˆw,ˆb) by minimizing the empirical loss ˆLf with the goal of achieving small population loss Lf:

ˆLf(w, b) = 1

i=1 f( w, xi + b, yi), Lf(w, b) = E(x,y) D[f( w, x + b, y)]. (1)

In the above, Y is an abstract label space. For example, Y = R for linear regression and Y = R 0 for phase retrieval (section 5.1) and Re LU regression (section 5.2). If we view xi as vectorization of the sensing matrices, then our setting (1) also includes the problem of matrix sensing (section 5.3).

For technical reasons, we assume that the distribution D follows a Gaussian multi-index model (Zhou et al. 2022).

(A) d-dimensional Gaussian features with arbitrary mean and covariance: x N(µ, Σ)

(B) a generic multi-index model: there exist a low-dimensional projection W = [w 1, ..., w k] Rd k, a random variable ξ Dξ independent of x (not necessarily Gaussian), and an unknown link function g : Rk+1 Y such that

ηi = w i , x , y = g(η1, ..., ηk, ξ). (2)

We require x to be Gaussian because the proof of our generalization bound depends on the Gaussian Minimax Theorem (Gordon 1985; Thrampoulidis et al. 2014), as in many other works (Koehler et al. 2021; Zhou et al. 2022; Wang et al. 2021; Donhauser et al. 2022). In settings where the features are not Gaussian, many works (Goldt et al. 2020; Mei and Montanari 2022; Misiakiewicz 2022; Hu and Lu 2023; Han and Shen 2022) have established universality results showing that the test and training error are asymptotically equivalent to the error of a Gaussian model with matching covariance matrix. However, we show in section 7 that Gaussian universality is not always valid and we discuss potential extensions of the Gaussian feature assumption there.

The multi-index assumption is a generalization of a well-specified linear model y = w , x + ξ, which corresponds to k = 1 and g(η, ξ) = η + ξ in (2). Since g is not even required to be continuous, the assumption on y is quite general. It allows nonlinear trend and heteroskedasticity, which pose significant challenges for the analyses using random matrix theory (but not for uniform convergence). In Zhou et al. (2022), the range of g is taken to be Y = { 1, 1} in order to study binary classification. In section 5.2, we allow the conditional distribution of y to have a point mass at zero, which is crucial for distinguishing Re LU regression from standard linear regression.

4 Optimistic Rate

In order to establish uniform convergence, we need two additional technical assumptions because we have not made any assumption on f and g.

(C) hypercontractivity: there exists a universal constant τ > 0 such that uniformly over all (w, b) Rd R, it holds that

E[f( w, x + b, y)4]1/4

E[f( w, x + b, y)] τ. (3)

(D) the class of functions on Rk+1 Y defined below has VC dimension at most h:

{(x, y) 1{f( w, x , y) > t} : (w, t) Rk+1 R}. (4)

Crucially, the quantities τ and h only depend on the number of indices k in assumption (B) instead of the feature dimension d. This is a key distinction because k can be a small constant even when the feature dimension is very large. For example, recall that k = 1 in a well-specified model and it is completely free of d. The feature dimension plays no role in equation (3) because conditioned on W T x Rk and ξ R, the response y is non-random and the distribution of w, x for any w Rd only depends on w T µ and w T Σw by properties of the Gaussian distribution.

The hypercontractivity (or norm equivalence ) assumption (3) is one of a few different common assumptions made in the statistical learning theory literature to rule out the possibility of a very heavy-tailed loss function. The reason is that if the loss function has a very tiny chance of being extremely large, it is not possible to state any meaningful learning guarantee from a finite number of samples. Hypercontractivity was used as an assumption already in Vapnik 1982. Another common and incomparable assumption made in the literature is boundedness, but this is not suitable for us because, e.g., the squared loss under the Gaussian distribution will not be bounded almost surely. On the other hand, if f is the squared loss then (3) holds with τ = 4

105 by standard properties of the Gaussian distribution. We can also state other versions of our main theorem by using other results in statistical learning theory to handle the low-dimensional concentration; in our proof we simply use a result from Vapnik 1982 in a contained and black-box fashion. (See e.g. Mendelson 2017; Vapnik 1982 for more discussion of the different possible assumptions.)

Similar assumptions are made in Zhou et al. (2022) and more background on these assumptions can be found there, but we note that our assumption (C) and (D) are weaker than theirs because we do not require (3) and (4) to hold uniformly over all Moreau envelopes of f.

We are now ready to state the uniform convergence results for square-root Lipschitz losses. Theorem 1. Assume that (A), (B), (C), and (D) holds, and let Q = I W(W T ΣW) 1W T Σ. For any δ (0, 1), let Cδ : Rd [0, ] be a continuous function such that with probability at least 1 δ/4 over x N (0, Σ), uniformly over all w Rd, w, QT x Cδ(w). (5)

If for each y Y, f is non-negative and f is

H-Lipschitz with respect to the first argument, then with probability at least 1 δ, it holds that uniformly over all (w, b) Rd R, we have

(1 ϵ) Lf(w, b)

ˆLf(w, b) +

where ϵ = O τ q

h log(n/h)+log(1/δ)

Since the label y only depends on x through W T x, the matrix Q is simply a (potentially oblique) projection such that QT x is independent of y. In the proof, we separate x into a low-dimensional component related to y and the independent component QT x. We establish a low-dimensional concentration result using VC theory, which is reflected in the ϵ term that does not depend on d, and we control the the remaining high-dimensional component using a scale-sensitive measure Cδ.

The complexity term Cδ(w)/ n should be thought of as a (localized form of) Rademacher complexity Rn. For example, for any norm , we have w, QT x w QT x and the Rademacher complexity for linear predictors with norm bounded by B is B E x / n (Shalev-Shwartz and Ben-David 2014). More generally, if we only care about linear predictors in a set K, then Cδ E supw K w, QT x is simply the Gaussian width (Bandeira 2016; Gordon 1988; Koehler et al. 2021) of K with respect to Σ = QT ΣQ and is exactly equivalent to the expected Rademacher complexity of the class of functions (e.g., Proposition 1 of Zhou et al. 2021).

5 Applications

We now go on to analyze several different problems using optimistic rates and sqrt-lipschitz losses. One of the key insights we want to highlight is the identification of the consistent loss for many of these problems in other word, the test loss which is implicitly being minimized. When we consider interpolators/overfit models, this is not obvious because an interpolator simultaneously minimizes many different training losses (and when we look at the test loss, each different loss functional may correspond to a different minimizer). Nevertheless, we are able to analyze phenomena like benign overfitting by identifying the particular sqrt-Lipschitz losses which the interpolator is consistent in, i.e. for which it approaches the minimum of the test loss in the appropriate limit.

5.1 Benign Overfitting in Phase Retrieval

In this section, we study the problem of phase retrieval where only the magnitude of the label y can be measured. If the number of observations n < d, the minimal norm interpolant is

ˆw = arg min w Rd: i [n], w,xi 2=y2 i w 2. (7)

We are particularly interested in the generalization error of (7) when yi are noisy labels and so ˆw overfits to the training set. It is well-known that gradient descent initialized at zero converges to a KKT point of the problem (7) when it reaches a zero training error solution (see e.g. Gunasekar et al. 2018). While is not always computationally feasible to compute ˆw, we can study it as a theoretical model for benign overfitting in phase retrieval.

Due to our uniform convergence guarantee in Theorem 1, it suffices to analyze the norm of ˆw and we can use the norm calculation to find the appropriate loss for phase retrieval. The key observation is that for any w Rd, we can let I = {i [n] : w , xi 0} and the predictor w = w + w satisfies | w, xi | = yi where

w = arg min w Rd: i I, w,xi =yi | w ,xi | i/ I, w,xi =| w ,xi | yi

Then it holds that ˆw 2 w 2 w 2 + w 2. By treating yi | w , xi | and | w , xi | yi as the residuals, the norm calculation for w 2 is the same as the case of linear regression (Koehler et al. 2021; Zhou et al. 2022). Going through this analysis yields the following norm bound (here R(Σ) = Tr(Σ)2/ Tr(Σ2) is a measure of effective rank, as in (Bartlett et al. 2020)):

Theorem 2. Under assumptions (A) and (B), let f : R Y R be given by f(ˆy, y) := (|ˆy| y)2

with Y = R 0. Let Q be the same as in Theorem 1 and Σ = QT ΣQ. Fix any w Rd such that Qw = 0 and for some ρ (0, 1), it holds that

ˆLf(w ) (1 + ρ)Lf(w ). (9)

Then with probability at least 1 δ, for some ϵ ρ + log 1

n + n R(Σ )

min w Rd: i [n], w,xi 2=y2 i

w 2 w 2 + (1 + ϵ)

Tr(Σ ) . (10)

Note that in this result we used the loss f(ˆy, y) = (|ˆy| y)2, which is 1-square-root Lipschitz and naturally arises in the analysis. We will now show that this is the correct loss in the sense that benign overfitting is consistent with this loss. To establish consistency result, we can pick w to be the minimizer of the population error. The population minimizer always satisfies Qw = 0 because Lf((I Q)w) Lf(w) for all w Rd by Jensen s inequality. Condition (9) can also be easily checked because we just need concentration of the empirical loss at a single non-random parameter. Applying Theorem 1, we obtain the following.

Corollary 1. In the same setting as in Theorem 2, with probability at least 1 δ, it holds that for some

k log(n/k) + log(1/δ)

n + log (1/δ)

n + n R(Σ )

the minimal norm interpolant ˆw given by (7) enjoys the learning guarantee:

L( ˆw,ˆb) (1 + ρ)

L(w , b ) + w 2

Therefore, ˆw is consistent if the same benign overfitting conditions from Bartlett et al. (2020) and Zhou et al. (2022) hold: k = o(n), R(Σ ) = ω(n) and w 2 2 Tr(Σ ) = o(n).

5.2 Benign Overfitting in Re LU Regression

Since we only need f to be Lipschitz, for any 1-Lipschitz function σ : R R, we can consider

(i) f(ˆy, y) = (σ(ˆy) y)2 when Y = R (ii) f(ˆy, y) = (1 σ(ˆy)y)2 + when Y = { 1, 1}.

We can interpret the above loss function f as learning a neural network with a single hidden unit. Indeed, Theorem 1 can be straightforwardly applied to these situations. However, we do not always expect benign overfitting under these loss functions for the following simple reason, as pointed out in Shamir (2022): when σ is invertible, interpolating the loss f(ˆy, y) = (σ(ˆy) y)2 is the same as interpolating the loss f(ˆy, y) = (ˆy σ 1(y))2 and so the learned function would be the minimizer of E[( w, x +b σ 1(y))2] which is typically different from the minimizer of E[(σ( w, x +b) y)2].

The situation of Re LU regression, where σ(ˆy) = max{ˆy, 0}, is more interesting because σ is not invertible. In order to be able to interpolate, we must have y 0. If y > 0 with probability 1, then σ(ˆy) = y is the same as ˆy = y and we are back to interpolating the square loss f(ˆy, y) = (ˆy y)2. From this observation, we see that f(ˆy, y) = (σ(ˆy) y)2 cannot be the appropriate loss for consistency even though it s 1 square-root Lipschitz. In contrast, when there is some probability mass at y = 0, it suffices to output any non-positive value and the minimal norm to interpolate is potentially smaller than requiring ˆy = 0. Similar to the previous section, we can let I = {i [n] : yi > 0} and for any (w , b ) Rd+1, the predictor w = w + w satisfies σ( w, xi + b ) = yi where

w = arg min w Rd: i I, w,xi =yi w ,xi b

i/ I, w,xi = σ( w ,xi +b )

Our analysis will show that the consistent loss for benign overfitting with Re LU regression is

f(ˆy, y) = (ˆy y)2 if y > 0 σ(ˆy)2 if y = 0. . (13)

We first state the norm bound below. Theorem 3. Under assumptions (A) and (B), let f : R Y R be the loss defined in (13) with Y = R 0. Let Q be the same as in Theorem 1 and Σ = QT ΣQ. Fix any (w , b ) Rd+1 such that Qw = 0 and for some ρ (0, 1), it holds that ˆLf(w , b ) (1 + ρ)Lf(w , b ). (14)

Then with probability at least 1 δ, for some ϵ ρ + log 1

n + n R(Σ )

min (w,b) Rd+1: i [n],σ( w,xi +b)=yi

w 2 w 2 + (1 + ϵ)

n Lf(w , b )

Tr(Σ ) . (15)

Since the loss defined in (13) is also 1-square-root Lipschitz, it can be straightforwardly combined with Theorem 1 to establish benign overfitting in Re LU regression under this loss. The details are exactly identical to the previous section and so we omit it here.

5.3 Benign Overfitting in Matrix Sensing

We now consider the problem of matrix sensing: given random matrices A1, ..., An (with i.i.d. standard Gaussian entries) and independent linear measurements y1, ..., yn given by yi = Ai, X + ξi where ξi is independent of Ai, and Eξ = 0 and Eξ2 = σ2, we hope to reconstruct the matrix X Rd1 d2 with sample size n d1d2. To this end, we assume that X has rank r. In this setting, since the measurement matrices have i.i.d. standard Gaussian entries, the test error is closely related to the estimation error:

L(X) = E( A, X y)2 = X X 2 F + σ2.

The classical approach to this problem is to find the minimum nuclear norm solution: ˆX = arg min X Rd1 d2: Ai,X =yi X . (16)

Gunasekar et al. (2017) also shows that gradient descent converges to the minimal nuclear norm solution in matrix factorization problems. It is well known that having low nuclear norm can ensure generalization (Foygel and Srebro 2011; Srebro and Shraibman 2005) and minimizing the nuclear norm ensures reconstruction (Candès and Recht 2009; Recht et al. 2010). However, if the noise level σ is high, then even the minimal nuclear norm solution ˆX can have large nuclear norm. Since our result can be adapted to different norms as regularizer, our uniform convergence guarantee can be directly applied. The dual norm of the nuclear norm is the spectral norm, and it is well-known that the spectrum norm of a Gaussian random matrix is approximately d1 + d2 (Vershynin 2018) . It remains to analyze the minimal nuclear norm required to interpolate. For simplicity, we assume that ξ are Gaussian below, but we can extend it to be sub-Gaussian.

Theorem 4. Suppose that d1d2 > n, then there exists some ϵ q

n + n d1d2 such that with probability at least 1 δ, it holds that

min i [n], Ai,X =yi X r X F + (1 + ϵ)

d1 d2 . (17)

Without loss of generality, we will assume d1 d2 from now on because otherwise we can take the transpose of A and X. Similar to assuming n/R(Σ ) 0 in linear regression, we implicitly assume that n/d1d2 0 in matrix sensing. Such scaling is necessary for benign overfitting because of the lower bound on the test error for any interpolant (e.g., Proposition 4.3 of Zhou et al. 2020). Finally, we apply the uniform convergence guarantee. Theorem 5. Fix any δ (0, 1). There exist constants c1, c2, c3 > 0 such that if d1d2 > c1n, d2 > c2d1, n > c3r(d1 + d2), then with probability at least 1 δ that

ˆX X 2 F X 2 F r(d1 + d2)

d1 d2 + n d1d2

X 2 F . (18)

From Theorem 5, we see that when the signal to noise ratio X 2 F σ2 is bounded away from zero,

then we obtain consistency ˆ X X 2 F X 2 F 0 if (i) r(d1 + d2) = o(n), (ii) d1d2 = ω(n), and (iii)

d1/d2 {0, }. This can happen for example when r = Θ(1), d1 = Θ(n1/2), d2 = Θ(n2/3). As discussed earlier, the second condition is necessary for benign overfitting, and the first consistency condition should be necessary even for regularized estimators. It is possible that the final condition is not necessary and we leave a tighter understanding of matrix sensing as future work.

6 Single-Index Neural Networks

We show that our results extend a even more general setting than (1). Suppose that we have a parameter space Θ Rp and a continuous mapping w from θ Θ to a linear predictor w(θ) Rd. Given a function f : R Y Θ R and i.i.d. sample pairs (xi, yi) drawn from distribution D, we consider training and test error of the form:

i=1 f( w(θ), xi , yi, θ) and L(θ) = E[f( w(θ), x , y, θ)]. (19)

We make the same assumptions (A) and (B) on D. Assumptions (C) and (D) can be naturally extended to the following:

(E) there exists a universal constant τ > 0 such that uniformly over all θ Θ, it holds that

E[f( w(θ), x , y, θ)4]1/4

E[f( w(θ), x , y, θ)] τ. (20)

(F) bounded VC dimensions: the class of functions on Rk+1 Y defined by

{(x, y) 1{f( w, x , y, θ) > t} : (w, t, θ) Rk+1 R Θ} (21)

has VC dimension at most h.

Now we can state the extension of Theorem 1. Theorem 6. Suppose that assumptions (A), (B), (E) and (F) hold. For any δ (0, 1), let Cδ : Rd [0, ] be a continuous function such that with probability at least 1 δ/4 over x N (0, Σ), uniformly over all θ Θ, w(θ), QT x Cδ(w(θ)). (22)

If for each θ Θ and y Y, f is non-negative and f is Hθ-Lipschitz with respect to the first argument, and Hθ is continuous in θ, then with probability at least 1 δ, it holds that uniformly over all θ Θ, we have

Hθ Cδ(w(θ))2

Next, we show that the generality of our result allows us to establish uniform convergence bound for two-layer neural networks with weight sharing. In particular, we let σ(x) = max(x, 0) be the Re LU activation function and θ = (w, a, b) Rd+2N, where N is the number of hidden units. Consider the loss function

f(ˆy, y, θ) =

i=1 aiσ(ˆy bi) y

or f(ˆy, y, θ) =

i=1 aiσ(ˆy bi)y

then L(θ) is the test error of a neural network of the form hθ(x) := PN i=1 aiσ( w, x bi). Since our uniform convergence guarantee holds uniformly over all θ, it applies to networks whose first and second layer weights are optimized at the same time. Without loss of generality, we can assume that b1 ... b N are sorted, then it is easy to see that f is maxj [N] Pj i=1 ai Lipschitz. Applying Theorem 6, we obtain the following corollary. Corollary 2. Fix an arbitrary norm and consider f as defined in (24). Assume that the data distribution D satisfy (A), (B), and (E). Then with probability at least 1 δ, it holds that uniformly over all θ = (w, a, b) Rd+2N, we have

ˆL(θ) + maxj [N] Pj i=1 ai w (E QT x + ϵ ) n

where ϵ is the same as in Theorem 6 with h = O(k +N) and ϵ = O sup u 1 u Σ p

The above theorem says given a network parameter θ, after sorting the bi s, a good complexity measure to look at is maxj [N] Pj i=1 ai w and equation (25) precisely quantify how the complexity of a network controls generalization.

7 On Gaussian Universality

In this section, we discuss a counterexample to Gaussian universality motivated by Shamir (2022). In particular, we consider a data distribution D over (x, y) given by:

(G) x = (x|k, x|d k) where x|k N(0, Σ|k) and there exists a function h : Rk R such that

x|d k = h(x|k) z with z N 0, Σ|d k independent of x|k (26)

(H) there exists a function g : Rk+1 R such that

y = g(x|k, ξ) (27)

where ξ Dξ is independent of x (but not necessarily Gaussian).

By the independence between z and x|k, we can easily see that x|k and x|d k are uncorrelated. Moreover, the covariance matrix of x|d k is

E[x|d kx T |d k] = E[h(x|k)2] Σ|d k

which is just a re-scaling of Σ|d k and therefore has the same effective rank as R(Σ|d k). Even though the feature x in D is non-Gaussian, its tail x|d k is Gaussian conditioned on x|k. Therefore, our proof technique is still applicable after a conditioning step. However, it turns out that the uniform convergence bound in Theorem 1 is no longer valid. Instead, we have to re-weight the loss function. The precise theorem statements can be found in the appendix. We briefly describe our theoretical results below, which follow the same ideas as in section 4 and 5.

Uniform Convergence. Under a similar low-dimensional concentration assumption such as (C), if we let Cδ be such that w, z Cδ(w) for all w Rd k with high probability, it holds that

E f( w, x , y)

f( w, xi , yi)

h(xi|k)2 + Cδ(w|d k) n

Note that the distribution of z is specified in (G) and different to the distribution of x|d k.

Norm Bound. Next, we focus on linear regression and compute the minimal norm required to interpolate. If we pick Σ|d k to satisfy the benign overfitting conditions, for any w |k Rk, we have

min w Rd: i [n], w,xi =yi

w 2 2 w |k 2 2 + (1 + o(1)) n E y w |k,x|k h(x|k) 2

Tr(Σ|d k) (29)

It is easy to see that the w that minimizes the population weighted square loss satisfy w |d k = 0, and so we can let w |k to be the minimizer.

Benign Overfitting. Let ˆw = arg minw Rd:Xw=Y w 2 be the minimal norm interpolant. Plugging in the norm bound (29) into the uniform convergence guarantee (28), we show that

E ( ˆw, x y)2

(1 + o(1)) ˆw 2 2 Tr(Σ|d k)

n E ( w , x y)2

which recovers the consistency result of Shamir (2022).

Contradiction. Suppose that Gaussian universality holds, then our optimistic rate theory would predict that

E[( ˆw, x y)2] (1 + o(1)) ˆw 2 2 E[h(x|k)2] Tr(Σ|d k)

Combining (30) with (31), we obtain that

min w E[( w, x y)2] (1 + o(1)) min w E[h(x|k)2] E

which cannot always hold. For example, let s consider the case where k = 1, x1 N(0, 1), h(x1) = 1 + |x1| and y = h(x1)2. Then it is straightforward to verify that the left hand side of (32) equals E[h(x1)4] and the right hand side equals E[h(x1)2]2, but this is impossible because

E[h(x1)4] E[h(x1)2]2 = Var(h(x1)2) > 0.

In the counterexample above, we see that it is possible to introduce strong dependence between the signal and the tail component of x while ensuring that they are uncorrelated. The dependence will prevent the norm of the tail from concentrating around its mean, no matter how large the sample size is. In contrast, the norm of the tail will concentrate for Gaussian features with a matching covariance such discrepancy results in an over-optimistic bound for non-Gaussian data.

8 Conclusion

In this paper, we extend a type of sharp uniform convergence guarantee proven for the square loss in Zhou et al. (2021) to any square-root Lipschitz loss. Uniform convergence with square-root Lipschitz loss is an important tool because the appropriate loss to study interpolation learning is usually square-root Lipschitz instead of Lipschitz. Compared to the prior work of Zhou et al. 2022, our view significantly simplify the assumptions to establish optimistic rate. Since we don t need to explicitly compute the Moreau envelope for each application, our framework easily leads to many novel benign overfitting results, including low-rank matrix sensing.

In the applications to phase retrieval and Re LU, we identify the appropriate loss function f and our norm calculation overcomes the challenge that f is non-convex and so CGMT cannot be directly applied. Furthermore, we explore new extensions of the uniform convergence technique to study single-index neural networks and suggest a promising research direction to understand the generalization of neural networks. Finally, we argue that Gaussian universality cannot always be taken for granted by analyzing a model where only the weighted square loss enjoys an optimistic rate. Our results highlight the importance of tail concentration and shed new lights on the necessary conditions for universality. An important future direction is to extend optimistic rate beyond Gaussian data, possibly through worst-case Rademacher complexity. Understanding the performance of more practical algorithms in phase retrieval and deriving the necessary and sufficient conditions for benign overfitting in matrix sensing are also interesting problems.

Acknowledgements. F.K. was supported in part by NSF award CCF-1704417, NSF award IIS1908774, and N. Anari s Sloan Research Fellowship.

Andoni, Alexandr, Daniel Hsu, Kevin Shi, and Xiaorui Sun (2017). Correspondence retrieval. Conference on Learning Theory. PMLR, pp. 105 126. Anthony, Martin, Peter L. Bartlett, et al. (1999). Neural network learning: Theoretical foundations. Vol. 9. cambridge university press Cambridge. Auer, Peter, Mark Herbster, and Manfred KK Warmuth (1995). Exponentially many local minima for single neurons. Advances in neural information processing systems 8. Bandeira, Afonso S. (2016). Ten Lectures and Forty-Two Open Problems in the Mathematics of Data Science. Lecture notes, MIT. URL: https://people.math.ethz.ch/~abandeira/ Ten Lectures Forty Two Problems.pdf. Barbier, Jean, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová (2019). Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences 116.12, pp. 5451 5460. Bartlett, Peter L. (1996). For valid generalization the size of the weights is more important than the size of the network. Advances in neural information processing systems 9. Bartlett, Peter L., Nick Harvey, Christopher Liaw, and Abbas Mehrabian (2019). Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research 20.1, pp. 2285 2301. Bartlett, Peter L., Philip M. Long, Gábor Lugosi, and Alexander Tsigler (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences 117.48, pp. 30063 30070. Bauschke, Heinz H, Patrick L Combettes, et al. (2011). Convex analysis and monotone operator theory in Hilbert spaces. Vol. 408. Springer.

Bietti, Alberto, Joan Bruna, Clayton Sanford, and Min Jae Song (2022). Learning single-index models with shallow neural networks. ar Xiv preprint ar Xiv:2210.15651. Blumer, Anselm, Andrzej Ehrenfeucht, David Henry Haussler, and Manfred Klaus Warmuth (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM 36 (4), pp. 929 965. Boyd, Stephen, Stephen P Boyd, and Lieven Vandenberghe (2004). Convex optimization. Cambridge University Press. Candes, Emmanuel J, Xiaodong Li, and Mahdi Soltanolkotabi (2015). Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory 61.4, pp. 1985 2007. Candès, Emmanuel J and Benjamin Recht (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics 9.6, pp. 717 772. Diakonikolas, Ilias, Surbhi Goel, Sushrut Karmalkar, Adam R Klivans, and Mahdi Soltanolkotabi (2020). Approximation schemes for relu regression. Conference on learning theory. PMLR, pp. 1452 1485. Donhauser, Konstantin, Nicolo Ruggeri, Stefan Stojanovic, and Fanny Yang (2022). Fast rates for noisy interpolation require rethinking the effects of inductive bias. ar Xiv: 2203.03597. Foygel, Rina and Nathan Srebro (2011). Concentration-based guarantees for low-rank matrix reconstruction. Proceedings of the 24th Annual Conference on Learning Theory. JMLR Workshop and Conference Proceedings, pp. 315 340. Goel, Surbhi, Varun Kanade, Adam Klivans, and Justin Thaler (2017). Reliably learning the relu in polynomial time. Conference on Learning Theory. PMLR, pp. 1004 1042. Goel, Surbhi, Sushrut Karmalkar, and Adam Klivans (2019). Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. Advances in neural information processing systems 32. Goldt, Sebastian, Bruno Loureiro, Galen Reeves, Florent Krzakala, Marc Mezard, and Lenka Zdeborova (2020). The gaussian equivalence of generative models for learning with shallow neural networks. ar Xiv:2006.14709. Gordon, Yehoram (1985). Some inequalities for Gaussian processes and applications. Israel Journal of Mathematics 50.4, pp. 265 289. Gordon, Yehoram (1988). On Milman s inequality and random subspaces which escape through a mesh in Rn. Geometric Aspects of Functional Analysis. Vol. 1317. Lecture Notes in Mathematics. Springer, pp. 84 106. Gunasekar, Suriya, Jason Lee, Daniel Soudry, and Nathan Srebro (2018). Characterizing implicit bias in terms of optimization geometry. International Conference on Machine Learning. PMLR, pp. 1832 1841. Gunasekar, Suriya, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro (2017). Implicit Regularization in Matrix Factorization. Advances in Neural Information Processing Systems. Han, Qiyang and Yandi Shen (2022). Universality of regularized regression estimators in high dimensions. ar Xiv: 2206.07936. Hu, Hong and Yue M. Lu (2023). Universality Laws for High-Dimensional Learning With Random Features. IEEE Transactions on Information Theory 69.3, pp. 1932 1964. Ji, Ziwei and Matus Telgarsky (2019). The implicit bias of gradient descent on nonseparable data. Conference on Learning Theory, pp. 1772 1798. Koehler, Frederic, Lijia Zhou, Danica J. Sutherland, and Nathan Srebro (2021). Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds and Benign Overfitting. Advances in Neural Information Processing Systems. ar Xiv: 2106.09276. Li, Xiaodong and Vladislav Voroninski (2013). Sparse signal recovery from quadratic measurements via convex programming. SIAM Journal on Mathematical Analysis 45.5, pp. 3019 3033. Luo, Wangyu, Wael Alghamdi, and Yue M Lu (2019). Optimal spectral initialization for signal recovery with applications to phase retrieval. IEEE Transactions on Signal Processing 67.9, pp. 2347 2356. Maillard, Antoine, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová (2020). Phase retrieval in high dimensions: Statistical and computational phase transitions. Advances in Neural Information Processing Systems 33, pp. 11071 11082. Mei, Song and Andrea Montanari (2022). The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics 75.4, pp. 667 766. Mendelson, Shahar (2017). Extending the scope of the small-ball method. ar Xiv: 1709.00843. Misiakiewicz, Theodor (2022). Spectrum of inner-product kernel matrices in the polynomial regime and multiple descent phenomenon in kernel ridge regression. ar Xiv:2204.10425.

Mondelli, Marco and Andrea Montanari (2018). Fundamental limits of weak recovery with applications to phase retrieval. Conference On Learning Theory. PMLR, pp. 1445 1450. Muthukumar, Vidya, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai (2021). Classification vs regression in overparameterized regimes: Does the loss function matter? Journal of Machine Learning Research 22, pp. 1 69. ar Xiv: 2005.08054. Muthukumar, Vidya, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai (2020). Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory. ar Xiv: 1903.09139. Nagarajan, Vaishnavh and J. Zico Kolter (2019). Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems. ar Xiv: 1902.04742. Negrea, Jeffrey, Gintare Karolina Dziugaite, and Daniel M. Roy (2020). In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors. International Conference on Machine Learning. ar Xiv: 1912.04265. Netrapalli, Praneeth, Prateek Jain, and Sujay Sanghavi (2013). Phase retrieval using alternating minimization. Advances in Neural Information Processing Systems 26. Panchenko, Dmitry (2002). Some Extensions of an Inequality of Vapnik and Chervonenkis. Electronic Communications in Probability 7, pp. 55 65. ar Xiv: 0405342. Panchenko, Dmitry (2003). Symmetrization approach to concentration inequalities for empirical processes. The Annals of Probability 31.4, pp. 2068 2081. Recht, Benjamin, Maryam Fazel, and Pablo A Parrilo (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52.3, pp. 471 501. Rockafellar, R Tyrrell (1970). Convex analysis. Vol. 18. Princeton university press. Shalev-Shwartz, Shai and Shai Ben-David (2014). Understanding machine learning: From theory to algorithms. Cambridge University Press. Shamir, Ohad (2022). The Implicit Bias of Benign Overfitting. ar Xiv: 2201.11489. Song, Min Jae, Ilias Zadik, and Joan Bruna (2021). On the cryptographic hardness of learning single periodic neurons. Advances in neural information processing systems 34, pp. 29602 29615. Soudry, Daniel, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro (2018). The implicit bias of gradient descent on separable data. Journal of Machine Learning Research 19, pp. 2822 2878. Srebro, Nathan and Adi Shraibman (2005). Rank, trace-norm and max-norm. International Conference on Computational Learning Theory, pp. 545 560. Srebro, Nathan, Karthik Sridharan, and Ambuj Tewari (2010). Optimistic Rates for Learning with a Smooth Loss. ar Xiv: 1009.3896. Stojnic, Mihailo (2013). A framework to characterize performance of LASSO algorithms. ar Xiv:

1303.7291. Sun, Ju, Qing Qu, and John Wright (2018). A geometric analysis of phase retrieval. Foundations of Computational Mathematics 18, pp. 1131 1198. Thrampoulidis, Christos, Samet Oymak, and Babak Hassibi (2014). The Gaussian min-max theorem in the presence of convexity. ar Xiv: 1408.4837. Thrampoulidis, Christos, Samet Oymak, and Babak Hassibi (2015). Regularized linear regression: A precise analysis of the estimation error. Conference on Learning Theory. Vapnik, Vladimir (1982). Estimation of dependences based on empirical data. Springer Science & Business Media. Vershynin, Roman (2010). Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027. Vershynin, Roman (2018). High-dimensional probability: An introduction with applications in data science. Vol. 47. Cambridge University Press. Wang, Guillaume, Konstantin Donhauser, and Fanny Yang (2021). Tight bounds for minimum l1-norm interpolation of noisy data. ar Xiv: 2111.05987. Zhou, Lijia, Frederic Koehler, Pragya Sur, Danica J. Sutherland, and Nathan Srebro (2022). A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models. Advances in Neural Information Processing Systems. ar Xiv: 2210.12082. Zhou, Lijia, Frederic Koehler, Danica J. Sutherland, and Nathan Srebro (2021). Optimistic Rates: A Unifying Theory for Interpolation Learning and Regularization in Linear Regression. ar Xiv: 2112.04470.

Zhou, Lijia, Danica J. Sutherland, and Nathan Srebro (2020). On Uniform Convergence and Low Norm Interpolation Learning. Advances in Neural Information Processing Systems. ar Xiv: 2006. 05942.

A Organization of the Appendices

In the Appendix, we give proofs of all results from the main text. In Appendix B, we study properties of square-root-Lipschitz functions and introduce some technical tools that we use throughout the appendix. In Appendix C, we prove our main uniform convergence guarantee (Theorem 1 and the more general version Theorem 6). In Appendix D, we obtain bounds on the minimal norm required to interpolate in the settings studied in section 5. In Appendix E, we provide details on the counterexample to Gaussian universality described in section 7.

B Preliminaries

B.1 Properties of Square-root Lipschitz Loss

In this section, we prove that square-root Lipschitzness can be equivalently characterized by a relationship between a function and its Moreau envelope, which can be used to establish uniform convergence results based on the recent work of Zhou et al. 2022. We formally define Lipschitz functions and Moreau envelope below. Definition 1. A function f : R R is M-Lipschitz if for all x, y in R,

|f(x) f(y)| M|x y|. (33)

Definition 2. The Moreau envelope of a function f : R R associated with smoothing parameter λ R+ is defined as fλ(x) := inf y R f(y) + λ(y x)2. (34)

Though we define Lipschitz functions and Moreau envelope for univariate functions from R to R above, we can easily extend definitions 1 and 2 to loss functions f : R Y R or f : R Y Θ R. We say a function f : R Y R is M-Lipschitz if for any y Y and ˆy1, ˆy2 R, we have

|f(ˆy1, y) f(ˆy2, y)| M|ˆy1 ˆy2|.

Similarly, we say a function f : R Y Θ R is M-Lipschitz if for any y Y, θ Θ and ˆy1, ˆy2 R, we have |f(ˆy1, y, θ) f(ˆy2, y, θ)| M|ˆy1 ˆy2|. We can also define the Moreau envelope of a function f : R Y R by

fλ(ˆy, y) := inf u R f(u, y) + λ(u ˆy)2,

and the Moreau envelope of a function f : R Y Θ R is defined as

fλ(ˆy, y, θ) := inf u R f(u, y, θ) + λ(u ˆy)2.

The proof of all results in this section can be straightforwardly extended to these settings. For simplicity, we ignore the additional arguments in Y and Θ in this section.

The Moreau envelope is usually viewed as a smooth approximation to the original function f; its minimizer is known as the proximal operator. It plays an important role in convex analysis (see e.g. Boyd et al. 2004; Bauschke, Combettes, et al. 2011; Rockafellar 1970), but is also useful and well-defined when f is nonconvex. The canonical example of a

H-square-root-Lipschitz function is f(x) = Hx2, for which we can easily check

fλ(x) = λ λ + H f(x).

In proposition 1 below, we show that the condition fλ λ λ+H f is exactly equivalent to

H-squareroot-Lipschitzness.

Proposition 1. A function f : R R is non-negative and

H-square-root-Lipschitz if and only if for any x R and λ 0, it holds that

fλ(x) λ λ + H f(x). (35)

Proof. Suppose that equation (35) holds, then by taking λ = 0 and the definition in equation (2), we see that f must be non-negative. For an non-negative function f, we observe for any x R, it holds that

λ 0, fλ(x) λ λ + H f(x)

λ > 0, fλ(x) λ λ + H f(x) since fλ 0

inf λ>0 λ + H

λ fλ(x) f(x)

inf λ>0 λ + H

λ inf y R f(y) + λ(y x)2 f(x) by equation (2)

inf y R inf λ>0

f(y) + (λ + H)(y x)2 f(x)

inf y R f(y) + H(y x)2 + 2 p

f(y)H(y x)2 f(x) by λ =

Hf(y) (y x)2

H|y x|)2 f(x)

f(y) since f 0.

Therefore, f must be

H-square-root-Lipschitz as well. Conversely, if f is non-negative and

H-square-root-Lipschitz, then the above implies that (2) must hold and we are done.

Interestingly, there is a similar equivalent characterization for Lipschitz functions as well. Proposition 2. A function f : R R is M-Lipschitz if and only if for any x R and λ > 0, it holds that

fλ(x) f(x) M 2

Proof. Observe that for any x R, it holds that

λ > 0, fλ(x) f(x) M 2

inf λ>0 fλ(x) + M 2

inf λ>0,y R f(y) + λ(y x)2 + M 2

4λ f(x) by equation (2)

inf y R f(y) + M|y x| f(x) by λ = M 2|y x| y R, M|y x| f(x) f(y)

and we are done.

Finally, we show that any smooth loss is square-root-Lipschitz. Therefore, the class of square-root Lipschitz losses is more general than the class of smooth losses studied in Srebro et al. 2010. Definition 3. A twice differentiable1 function f : R R is H-smooth if for all x in R

The following result is similar to to Lemma 2.1 in Srebro et al. 2010: Proposition 3. Let f : R R be a H-smooth and non-negative function. Then for any x R, it holds that |f (x)| p

Therefore, f is p

H/2-Lipschitz.

1The definition of smoothness can be stated without twice differentiability, by instead requiring the gradient to be Lipschitz. We make this assumption here simply for convenience.

Proof. Since f is H-smooth and non-negative, by Taylor s theorem, for any x, y R, we have

= f(x) + f (x)(y x) + f (a)

f(x) + f (x)(y x) + H

where a [min(x, y), max(x, y)]. Setting y = x f (x)

H yields the desired bound. To show that f is Lipschitz, we observe that for any x R d dx

and so we apply Taylor s theorem again to show that

which is the desired definition.

B.2 Properties of Gaussian Distribution

We will make use of the following results without proof.

Gaussian Minimax Theorem. Our proof of Theorem 1 and 6 will closely follow prior works that apply Gaussian Minimax Theorem (GMT) to uniform convergence (Koehler et al. 2021; Zhou et al. 2021; Zhou et al. 2022; Wang et al. 2021; Donhauser et al. 2022). The following result is Theorem 3 of Thrampoulidis et al. 2015 (see also Theorem 1 in the same reference). As explained there, it is a consequence of the main result of Gordon (1985), known as Gordon s Theorem. Theorem 7 (Thrampoulidis et al. 2015; Gordon 1985). Let Z : n d be a matrix with i.i.d. N(0, 1) entries and suppose G N(0, In) and H N(0, Id) are independent of Z and each other. Let Sw, Su be compact sets and ψ : Sw Su R be an arbitrary continuous function. Define the Primary Optimization (PO) problem

Φ(Z) := min w Sw max u Su u, Zw + ψ(w, u) (37)

and the Auxiliary Optimization (AO) problem

ϕ(G, H) := min w Sw max u Su w 2 G, u + u 2 H, w + ψ(w, u). (38)

Under these assumptions, Pr(Φ(Z) < c) 2 Pr(ϕ(G, H) c) for any c R.

Furthermore, if we suppose that Sw, Su are convex sets and ψ(w, u) is convex in w and concave in u, then Pr(Φ(Z) > c) 2 Pr(ϕ(G, H) c).

GMT is an extremely useful tool because it allows us to convert a problem involving a random matrix into a problem involving only two random vectors. In our analysis, we will make use of a slightly more general version of Theorem 7, introduced by Koehler et al. (2021), to include additional variables which only affect the deterministic term in the minmax problem. Theorem 8 (Variant of GMT). Let Z : n d be a matrix with i.i.d. N(0, 1) entries and suppose G N(0, In) and H N(0, Id) are independent of Z and each other. Let SW , SU be compact sets in Rd Rd and Rn Rn respectively, and let ψ : SW SU R be an arbitrary continuous function. Define the Primary Optimization (PO) problem

Φ(Z) := min (w,w ) SW max (u,u ) SU u, Zw + ψ((w, w ), (u, u )) (39)

and the Auxiliary Optimization (AO) problem

ϕ(G, H) := min (w,w ) SW max (u,u ) SU w 2 G, u + u 2 H, w + ψ((w, w ), (u, u )). (40)

Under these assumptions, Pr(Φ(Z) < c) 2 Pr(ϕ(G, H) c) for any c R.

Theorem 8 requires SW and SU to be compact. However, we can usually get around the compactness requirement by a truncation argument. Lemma 1 (Zhou et al. 2022, Lemma 6). Let f : Rd R be an arbitrary function and Sd r = {x Rd : x 2 r}, then for any set K, it holds that lim r sup w K Sd r f(w) = sup w K f(w). (41)

If f is a random function, then for any t R

Pr sup w K f(w) > t = lim r Pr

sup w K Sd r f(w) > t

Lemma 2 (Zhou et al. 2022, Lemma 7). Let K be a compact set and f, g be continuous real-valued functions on Rd. Then it holds that lim r sup w K inf 0 λ r λf(w) + g(w) = sup w K:f(w) 0 g(w). (43)

If f and g are random functions, then for any t R

sup w K:f(w) 0 g(w) t

= lim r Pr sup w K inf 0 λ r λf(w) + g(w) t . (44)

Concentration inequalities. Let σmin(A) denote the minimum singular value of an arbitrary matrix A, and σmax the maximum singular value. We use A op = σmax(A) to denote the operator norm of matrix A. The following concentration results for Gaussian vector and matrix are standard. Lemma 3 (Special case of Theorem 3.1.1 of Vershynin 2018). Suppose that Z N(0, In). Then

Pr( Z 2 n t) 4e t2/4. (45)

Lemma 4 (Koehler et al. 2021, Lemma 10). For any covariance matrix Σ and H N(0, Id), with probability at least 1 δ, it holds that

1 Σ1/2H 2 2 Tr(Σ) log(4/δ) p

and ΣH 2 2 log(4/δ) Tr(Σ2). (47)

Therefore, provided that R(Σ) log(4/δ)2, it holds that ΣH 2

2 log(4/δ)Tr(Σ2)

Tr(Σ) . (48)

Theorem 9 (Vershynin 2010, Corollary 5.35). Let n, N N. Let A RN n be a random matrix with entries i.i.d. N(0, 1). Then for any t > 0, it holds with probability at least 1 2 exp( t2/2) that

N n t σmin(A) σmax(A)

N + n + t. (49)

Conditional Distribution of Gaussian. To handle arbitrary multi-index conditional distributions of y given by assumption (B), we will apply a conditioning argument. After conditioning on W T x and ξ, the response y is no longer random. Importantly, the conditional distribution of x remains Gaussian (though with a different mean and covariance) and so we can still apply GMT. In the lemma below, Z Rn d is a random matrix with i.i.d. N(0, 1) entries and X = ZΣ1/2. Lemma 5 (Zhou et al. 2022, Lemma 4). Fix any integer k < d and any k vectors w 1, ..., w k in Rd

such that Σ1/2w 1, ..., Σ1/2w k are orthonormal. Denoting

i=1 (Σ1/2w i )(Σ1/2w i )T , (50)

the distribution of X conditional on Xw 1 = η1, ..., Xw k = ηk is the same as that of

i=1 ηi(Σw i )T + ZPΣ1/2. (51)

B.3 Vapnik-Chervonenkis (VC) theory

By the conditioning step mentioned above, we will separate x into a low-dimensional component W T x and the independent component QT x. Concentration results for the low-dimensional component can be easily established using VC theory. As mentioned in Zhou et al. 2022, low-dimensional concentration can be established using alternative results (e.g., Vapnik 1982; Panchenko 2002; Panchenko 2003; Mendelson 2017).

Recall the following definition of VC-dimension from Shalev-Shwartz and Ben-David (2014). Definition 4. Let H be a class of functions from X to {0, 1} and let C = {c1, ..., cm} X. The restriction of H to C is HC = {(h(c1), ..., h(cm)) : h H}.

A hypothesis class H shatters a finite set C X if |HC| = 2|C|. The VC-dimension of H is the maximal size of a set that can be shattered by H. If H can shatter sets of arbitrary large size, we say H has infinite VC-dimension.

Also, we have the following well-known result for the class of nonhomogenous halfspaces in Rd (Theorem 9.3 of Shalev-Shwartz and Ben-David (2014)), and the result on VC-dimension of the union of two hypothesis classes (Lemma 3.2.3 of Blumer et al. (1989)): Theorem 10. The class {x 7 sign( w, x + b) : w Rd, b R} has VC-dimension d + 1. Theorem 11. Let H a hypothesis classes of finite VC-dimension d 1. Let H2 := {max(h1, h2) : h1, h2 H} and H3 := {min(h1, h2) : h1, h2 H}. Then, both the VC-dimension of H2 and the VC-dimension of H3 are O(d).

By combining Theorem 10 and 11, we can easily verify the VC assumption in Corollary 1 for the phase retrieval loss f(ˆy, y) = (|ˆy| y)2. Similar results can be proven for Re LU regression. To verify the VC assumption for single-index neural nets in Corollary 2, we can use the following result (equation 2 of Bartlett et al. (2019)): Theorem 12. The VC-dimension of a neural network with piecewise linear activation function, W parameters, and L layers has VC-dimension O(WL log W).

We can easily establish low-dimensional concentration due to the following result: Theorem 13 (Vapnik 1982, Special case of Assertion 4 in Chapter 7.8; see also Theorem 7.6). Suppose that the loss function l : Z Θ R 0 satisfies

(i) for every θ Θ, the function l( , θ) is measurable with respect to the first argument

(ii) the class of functions {z 7 1{l(z, θ) > t} : (θ, t) Θ R} has VC-dimension at most h

and the distribution D over Z satisfies for every θ Θ

Ez D[l(z, θ)4]1/4

Ez D[l(z, θ)] τ, (52)

then for any n > h, with probability at least 1 δ over the choice of (z1, . . . , zn) Dn, it holds uniformly over all θ Θ that

i=1 l(zi, θ)

h(log(2n/h) + 1) + log(12/δ)

Ez D[l(z, θ)]. (53)

C Proof of Theorem 6

It is clear that Theorem 1 is a special case of Theorem 6. Therefore, we will prove the more general result here.

Notation. Following the tradition in statistics, we denote X = (x1, ..., xn)T Rn d as the design matrix. In the proof section, we slightly abuse the notation of ηi to mean Xw i and ξ to mean the n-dimensional random vector whose i-th component satisfies yi = g(η1,i, ..., ηk,i, ξi). We will write X = ZΣ1/2 where Z is a random matrix with i.i.d. standard normal entries if µ = 0.

Throughout this section, we can first assume µ = 0 in Assumption (A) without loss of generality because if we define f : R Y Θ R by f(ˆy, y, θ) := f(ˆy + w(θ), µ , y, θ), (54) then by definition, it holds that

f( w(θ), x , y, θ) = f( w(θ), x µ , y, θ)

and so we can apply the theory on f first and then translate to the problem on f. Similarly, we can also assume Σ1/2w 1, ..., Σ1/2w k are orthonormal without loss of generality. This is because we can denote W Rd k by W = [w 1, ..., w k] and let W = W(W T ΣW) 1/2. By definition, it holds that W T Σ W = I and so the columns of W = [ w 1, ..., w k] satisfy Σ1/2 w 1, ..., Σ1/2 w k are orthonormal. If we define g : Rk+1 R by

g(η1, ..., ηk, ξ) = g([η1, ..., ηk](W T ΣW)1/2 + µT W, ξ), (55)

then y = g(x T W, ξ) and so we can apply the theory on g.

We will write the generalization problem as a Primary Optimization problem in Theorem 8. For generality, we will let F be any deterministic function and then choose it in the end. Lemma 6. Fix an arbitrary set Θ Rp and let F : Θ R be any deterministic and continuous function. Consider dataset (X, Y ) drawn i.i.d. from the data distribution D according to (A) and (B) with µ = 0 and orthonormal Σ1/2w 1, ..., Σ1/2w k. Then conditioned on Xw 1 = η1, ..., Xw k = ηk and ξ, if we define Φ := sup (w,u,θ) Rd Rn Θ w=P Σ1/2w(θ)

inf λ Rn λ, Zw + ψ(u, θ, λ | η1, ..., ηk, ξ) (56)

where P is defined in (50) and ψ is a deterministic and continuous function given by

ψ(u, θ, λ | η1, ..., ηk, ξ) = F(θ) 1

i=1 f(ui, g(η1,i, ..., ηk,i, ξi), θ)

i=1 ηi(Σw i )T !

then it holds that for any t R, we have

Pr sup θ Θ F(θ) ˆL(θ) > t η1, ..., ηk, ξ = Pr(Φ > t). (58)

Proof. By introducing a variable u = Xw(θ), we have

sup θ Θ F(θ) ˆL(θ) = sup θ Θ F(θ) 1

i=1 f( w(θ), xi , yi, θ)

= sup θ Θ,u Rn inf λ Rn λ, Xw(θ) u + F(θ) 1

i=1 f(ui, yi, θ).

Conditioned on Xw 1 = η1, ..., Xw k = ηk and ξ, the above is only random in X by our multi-index model assumption on y. By Lemma 5, the above is equal in law to

sup θ Θ,u Rn inf λ Rn λ,

i=1 ηi(Σw i )T + ZPΣ1/2 !

w(θ) u + F(θ) 1

i=1 f(ui, yi, θ)

= sup θ Θ,u Rn inf λ Rn λ, ZPΣ1/2 w(θ) + ψ(u, θ, λ | η1, ..., ηk, ξ)

= sup (w,u,θ) Rd Rn Θ w=P Σ1/2w(θ)

inf λ Rn λ, Zw + ψ(u, θ, λ | η1, ..., ηk, ξ)

The function ψ is continuous because we require F, f and w to be continuous in the definitions.

Next, we are ready to apply Gaussian Minimax Theorem. Although the domains in (56) are not compact, we can use the truncation lemmas 1 and 2 in Appendix B. Lemma 7. In the same setting as Lemma 6, define the auxiliary problem as

Ψ := sup (u,θ) Rn Θ H,P Σ1/2w(θ) P Σ1/2w(θ) 2G+Pk i=1 w(θ),Σw i ηi u 2

i=1 f(ui, yi, θ) (59)

then for any t R, it holds that

Pr sup θ K F(θ) ˆL(θ) > t 2 Pr(Ψ t). (60)

where the randomness in the second probability is taken over G, H, η1, ..., ηk and ξ.

Proof. Denote Sr = {(w, u, θ) Rd Rn Θ : w = PΣ1/2w(θ) and w 2 + u 2 + θ 2 r}. The set Sr is bounded by definition and closed by the continuity of w. Hence, it is compact. Next, we denote the truncated problems:

Φr := sup (w,u,θ) Sr inf λ Rn λ, Zw + ψ(u, θ, λ | η1, ..., ηk, ξ) (61)

Φr,s := sup (w,u,θ) Sr inf λ 2 s λ, Zw + ψ(u, θ, λ | η1, ..., ηk, ξ). (62)

By definition, we have Φr Φr,s and so Pr(Φr > t) Pr(Φr,s > t). The corresponding auxiliary problems are Ψr,s := sup (w,u,θ) Sr inf λ 2 s λ 2 H, w + w 2 G, λ + ψ(u, θ, λ | η1, ..., ηk, ξ)

= sup (w,u,θ) Sr inf λ 2 s λ 2 H, w + λ, w 2G +

i=1 ηi w(θ), Σw i u

i=1 f(ui, g(η1,i, ..., ηk,i, ξi), θ)

= sup (w,u,θ) Sr inf 0 λ s λ

i=1 ηi w(θ), Σw i u

i=1 f(ui, g(η1,i, ..., ηk,i, ξi), θ)

and the limit of s :

Ψr := sup (w,u,θ) Sr H,w w 2G+Pk i=1 ηi w(θ),Σw i u 2

i=1 f(ui, g(η1,i, ..., ηk,i, ξi), θ)

By definition, it holds that Ψr Ψ and so Pr(Ψr t) Pr(Ψ t). Thus, it holds that Pr(Φ > t) = lim r Pr(Φr > t) by Lemma 1

lim r lim s Pr(Φr,s > t)

2 lim r lim s Pr(Ψr,s t) by Theorem 8

= 2 lim r Pr(Ψr t) by Lemma 2

2 Pr(Ψ t). The proof concludes by applying Lemma 6 and the tower law.

The following two simple lemmas will be useful to analyze the auxiliary problem. Lemma 8. For a, b, H > 0, we have

sup λ 0 λa + λ H + λb = (

Proof. Observe that

sup λ 0 λa + λ H + λb = b inf λ 0 λa + H H + λb.

Define f(λ) = λa + H H+λb, then

f (λ) = a Hb (H + λ)2 0 (H + λ)2 Hb

Since we require λ 0, we only need to consider whether q

a H 0 b Ha. If

b < Ha, the infimum is attained at λ = 0. Otherwise, the infimum is attained at λ = q

a H, at which point f(λ ) = 2

Plugging in, we see that the expression is equivalent to (

Ha)2 + in both cases.

Lemma 9. For a, b 0, we have

sup λ 0 λa b

Proof. Define f(λ) = λa b

f (λ) = a + b

and so in the domain λ 0, the optimum is attained at λ = p

b/a at which point f(λ ) = 2

We are now ready to analyze the auxiliary problem. Lemma 10. In the same setting as in Lemma 6, assume that for every δ > 0

(A) Cδ : Rd [0, ] is a continuous function such that with probability at least 1 δ/4 over H N(0, Id), uniformly over all w Rd, we have that

Σ1/2PH, w Cδ(w) (63)

(B) ϵδ is a positive real number such that with probability at least 1 δ/4 over {( xi, yi)}n i=1 drawn i.i.d. from D, it holds uniformly over all θ Θ that

i=1 f( ϕ(w(θ)), xi , yi, θ) 1 1 + ϵδ E( x, y) D[f( ϕ(w(θ)), x , y, θ)]. (64)

where the distribution D over ( x, y) is given by

x N(0, Ik+1), ξ Dξ, y = g( x1, ..., xk, ξ)

and the mapping ϕ : Rd Rk+1 is defined as

ϕ(w) = ( w, Σw 1 , ..., w, Σw k , PΣ1/2w 2)T .

Then the following is true:

(i) suppose for some choice of Mθ that is continuous in θ, it holds for every y Y and θ Θ, f is Mθ-Lipschitz with respect to the first argument, then with probability at least 1 δ, uniformly over all θ Θ, we have

L(θ) (1 + ϵδ)

(ii) suppose for some choice of Hθ that is continuous in θ, it holds for every y Y and θ Θ, f is non-negative and f is Hθ-Lipschitz with respect to the first argument, then with probability at least 1 δ, uniformly over all θ Θ, we have

L(θ) (1 + ϵδ)

HθCδ(w(θ))2

Proof. First, let s simplify the auxiliary problem (59). Changing variables to subtract the quantity Gi PΣ1/2w(θ) 2 + Pk l=1 w(θ), Σw l ηl,i from each of the former ui, we have that

Ψ = sup (u,θ) Rn Θ u 2 H,P Σ1/2w(θ)

ui + Gi PΣ1/2w(θ) 2 +

l=1 w(θ), Σw l ηl,i, yi, θ

and separating the optimization problem in u and θ, we obtain Ψ = sup θ Θ F(θ)

n inf u Rn: u 2 H,P Σ1/2w(θ)

ui + Gi PΣ1/2w(θ) 2 +

l=1 w(θ), Σw l ηl,i, yi, θ

Next, we will lower bound the infimum term by weak duality to obtain upper bound on Ψ:

inf u Rn: u 2 H,P Σ1/2w(θ)

ui + Gi PΣ1/2w(θ) 2 +

l=1 w(θ), Σw l ηl,i, yi, θ

= inf u Rn sup λ 0 λ( u 2 2 Σ1/2PH, w(θ) 2)

ui + Gi PΣ1/2w(θ) 2 +

l=1 w(θ), Σw l ηl,i, yi, θ

sup λ 0 λ Σ1/2PH, w(θ) 2

ui + Gi PΣ1/2w(θ) 2 +

l=1 w(θ), Σw l ηl,i, yi, θ

= sup λ 0 λ Σ1/2PH, w(θ) 2

i=1 inf ui R f

ui + Gi PΣ1/2w(θ) 2 +

l=1 w(θ), Σw l ηl,i, yi, θ

= sup λ 0 λ Σ1/2PH, w(θ) 2 +

Gi PΣ1/2w(θ) 2 +

l=1 w(θ), Σw l ηl,i, yi, θ

Suppose that for every y Y and θ Θ, f is Mθ-Lipschitz with respect to the first argument, then by Proposition 2, the above can be further lower bounded by the following quantity:

sup λ 0 λ Σ1/2PH, w(θ) 2 n M 2 θ 4λ +

l=1 w(θ), Σw l ηl,i + PΣ1/2w(θ) 2 Gi, yi, θ

On the other hand, suppose that for every y Y and θ Θ, f is non-negative and f is HθLipschitz with respect to the first argument, then by Proposition 1, the above can be further lower bounded by:

sup λ 0 λ Σ1/2PH, w(θ) 2 + λ Hθ + λ

l=1 w(θ), Σw l ηl,i + PΣ1/2w(θ) 2 Gi, yi, θ

Notice that if we write xi = (η1,i, ..., ηk,i, Gi), then ( xi, yi) are independent with distribution exactly equal to D. Moreover, we have

l=1 w(θ), Σw l ηl,i + PΣ1/2w(θ) 2 Gi, yi, θ

= f( ϕ(w(θ)), xi , yi, θ)

and it is easy to see that the joint distribution of ( ϕ(w(θ)), x , y) with ( x, y) D is exactly the same as ( w(θ), x , y) with (x, y) D. As a result, we have that

E( x,y) D[f( ϕ(w(θ)), x , y, θ)] = L(θ).

By our assumption (63), (64) and a union bound, we have with probability at least 1 δ/2

| Σ1/2PH, w(θ) | Cδ(w(θ))

l=1 w(θ), Σw l ηl,i + PΣ1/2w(θ) 2 Gi, yi, θ

1 1 + ϵδ L(θ).

Therefore, if f is Mθ-Lipschitz, then by by Lemma 9, we have

Ψ sup θ Θ F(θ) sup λ 0 λCδ(w(θ))2

n M 2 θ 4λ + 1 1 + ϵδ L(θ)

= sup θ Θ F(θ) +

M 2 θ Cδ(w(θ))2

n 1 1 + ϵδ L(θ)

Consequently, by taking F(θ) = 1 1+ϵδ L(θ) Mθ q

n and Lemma 7, we have shown that with probability at least 1 δ, we have

sup θ K F(θ) ˆL(θ) 0 = 1 1 + ϵδ L(θ) ˆL(θ) + Mθ

If f is Hθ-Lipschitz, then by Lemma 8

Ψ sup θ K F(θ) sup λ 0 λCδ(w(θ))2

n + λ Hθ + λ 1 1 + ϵδ L(θ)

= sup θ K F(θ)

L(θ) 1 + ϵδ

HθCδ(w(θ))2

Consequently, by taking F(θ) = q

L(θ) 1+ϵδ q

HθCδ(w(θ))2

+ and Lemma 7, we have shown that

with probability at least 1 δ, we have

sup θ K F(θ) ˆL(θ) 0.

Rearranging, either we have s

L(θ) 1 + ϵδ

HθCδ(w(θ))2

n < 0 = L(θ) < (1 + ϵδ)HθCδ(w(θ))2

or we have s

L(θ) 1 + ϵδ

HθCδ(w(θ))2

L(θ) 1 + ϵδ

HθCδ(w(θ))2

= L(θ) (1 + ϵδ)

HθCδ(w(θ))2

In either case, the desired bound holds.

Finally, we are ready to prove Theorem 6. In the version below, we also provide uniform convergence guarantee (with sharp constant) for Lipschitz loss. Theorem 14. Suppose that assumptions (A), (B), (E) and (F) hold. For any δ (0, 1), let Cδ : Rd [0, ] be a continuous function such that with probability at least 1 δ/4 over x N (0, Σ), uniformly over all θ Θ, w(θ), QT x Cδ(w(θ)). (67) Then it holds that

(i) if for each θ Θ and y Y, f is Mθ-Lipschitz with respect to the first argument and Mθ is continuous in θ, then with probability at least 1 δ, it holds that uniformly over all θ Θ, we have

(1 ϵ) L(θ) ˆL(θ) + Mθ

(ii) if for each θ Θ and y Y, f is non-negative and f is Hθ-Lipschitz with respect to the first argument, and Hθ is continuous in θ, then with probability at least 1 δ, it holds that uniformly over all θ Θ, we have

Hθ Cδ(w(θ))2

where ϵ = O τ q

h log(n/h)+log(1/δ)

Proof. We apply the reduction argument at the beginning of the appendix. Given D that satisfies assumptions (A) and (B), we define [ w 1, ..., w k] = W = W(W T ΣW) 1/2 and f, g as in (54) and (55). For {(xi, yi)}n i=1 sampled independently from D, we observe that the joint distribution of (xi µ, yi) can also be described by D as follows:

(A ) x N(0, Σ)

(B ) y = g(η1, ..., ηk, ξ) where ηi = x, wi .

Indeed, we can check that

y = g(x T W, ξ)

= g((x µ)T W(W T ΣW)1/2 + µT W, ξ)

= g((x µ)T W, ξ).

Moreover, by construction, we have

i=1 f( w(θ), xi µ , yi, θ)

L(θ) = ED f( w(θ), xi , yi, θ)

and D satisfies assumptions (A) and (B) with µ = 0 and orthonormal Σ1/2 w 1, ..., Σ1/2 w 1 and falls into the setting in Lemma 6. We see that f being Lipschitz or square-root Lipschitz is equivalent to

f being Lipschitz or square-root Lipschitz. It remains to check assumptions (63) and (64) and then apply Lemma 10. Observe that

Σ 1/2PΣ1/2 = Σ 1/2 Id Σ1/2 W W T Σ1/2 Σ1/2

= Id W W T Σ = I W(W T ΣW) 1W T Σ = Q

and so Σ1/2P = QT Σ1/2.

To check that (63) holds, observe that Σ1/2PH, w has the same distribution as Qw, x . To check that (64) holds, we will apply Theorem 13. Note that the joint distribution of ( ϕ(w(θ)), x , y) with ( x, y) D is exactly the same as ( w(θ), x , y) with (x, y) D and so

E D[ f( ϕ(w(θ)), x , y, θ)4]1/4

E D[ f( ϕ(w(θ)), x , y, θ)] = ED [ f( w(θ), x , y, θ)4]1/4

ED [ f( w(θ), x , y, θ)] = ED[f( w(θ), x , y, θ)4]1/4

ED[f( w(θ), x , y, θ)] .

Therefore, the assumption (E) is equivalent to the hypercontractivity condition in Theorem 13. Note that {(x, y) 7 1{ f( ϕ(w(θ)), x , y, θ) > t} : (θ, t) Θ R} is a subclass of {(x, y) 7 1{f( w, x + b, y, θ) > t} : (w, b, t, θ) Rk+1 R R Θ}. Therefore, by assumption (F), we can apply Theorem 13 and (64) holds.

D Norm Bounds

The following lemma is a version of Lemma 7 of Koehler et al. (2021) and follows straightforwardly from CGMT (Theorem 7), though it requires a slightly different truncation argument compared to the proof Theorem 6. For simplicity, we won t repeat the proof here and simply use it for our applications.

Lemma 11 (Koehler et al. 2021, Lemma 7). Let Z : n d be a matrix with i.i.d. N(0, 1) entries and suppose G N(0, In) and H N(0, Id) are independent of Z and each other. Fix an arbitrary norm , any covariance matrix Σ, and any non-random vector ξ Rn, consider the Primary Optimization (PO) problem: Φ := min w Rd: ZΣ1/2w=ξ

and the Auxiliary Optimization (AO) problem:

Ψ := min w Rd: G Σ1/2w 2 ξ 2 Σ1/2H,w

Then for any t R, it holds that

Pr(Φ > t) 2 Pr(ϕ t). (73)

The next lemma analyzes the AO in Lemma 11. Our proof closely follows Lemma 8 of Koehler et al. 2021, but we don t make assumptions on ξ yet to allow more applications.

Lemma 12. Let Z : n d be a matrix with i.i.d. N(0, 1) entries. Fix any δ > 0, covariance matrix

Σ and non-random vector ξ Rn, then there exists ϵ log(1/δ) 1 n + 1

R(Σ) + n R(Σ)

with probability at least 1 δ, it holds that

min w Rd: ZΣ1/2w=ξ

w 2 2 (1 + ϵ) ξ 2 2 Tr(Σ). (74)

Proof. By a union bound, there exists a constant C > 0 such that the following events occur together with probability at least 1 δ/2:

1. Since G, ξ N(0, ξ 2 2), by the standard Gaussian tail bound Pr(|Z| t) 2e t2/2, we have | G, ξ | ξ 2 p

2 log(32/δ)

2. Using subexponential Bernstein s inequality (Theorem 2.8.1 of Vershynin (2018)), requiring n = Ω(log(1/δ)), we have G 2 2 2n

3. Using the first part of Lemma 4, we have

Σ1/2H 2 2 Tr(Σ)

1 C log(32/δ) p

4. Using the last part of Lemma 4, requiring R(Σ) log(32/δ)2

ΣH 2 2 Σ1/2H 2 2 C log(32/δ)Tr(Σ2)

Therefore, by the AM-GM inequality, it holds that

G Σ1/2w 2 ξ 2 2 = G 2 2 Σ1/2w 2 2 + ξ 2 2 2 G, ξ Σ1/2w 2

2n Σ1/2w 2 2 + ξ 2 2 + 2 ξ 2 p

2 log(32/δ) Σ1/2w 2

3n Σ1/2w 2 2 + 1 + 2 log(32/δ)

To apply lemma 11, we will consider w of the form w = α Σ1/2H Σ1/2H 2 for some α > 0. Then we have

G Σ1/2w 2 ξ 2 2 3n C log(32/δ)Tr(Σ2)

Tr(Σ) α2 + 1 + 2 log(32/δ)

Σ1/2H, w 2 = α2 Σ1/2H 2 2 α2 Tr(Σ)

1 C log(32/δ) p

So it suffices to choose α such that

1 + 2 log(32/δ)

Tr(Σ) 1 C log(32/δ)

3n C log(32/δ) Tr(Σ2)

= 1 + 2 log(32/δ)

1 C log(32/δ) 1

R(Σ) + 3 n R(Σ)

ξ 2 2 Tr(Σ)

and we are done.

A challenge for analyzing the minimal norm to interpolate is that the projection matrix Q is not necessarily an orthogonal projection. However, the following lemma suggests that if Σ = QT ΣQ has high effective rank, then we can let R be the orthogonal projection matrix onto the image of Q and RΣR is approximately the same as Σ in terms of the quantities that are relevant to the norm analysis.

Lemma 13. Consider Q = I Pk i=1 w i (w i )T Σ where Σ1/2w 1, ..., Σ1/2w k are orthonormal and we let R be the orthogonal projection matrix onto the image of Q. Then it holds that rank(R) = d k and RΣw i = 0 for any i = 1, ..., k. Moreover, we have QR = R and RQ = Q, and so

1 Tr(RΣR) 1 k

n n R(QT ΣQ)

1 1 Tr(QT ΣQ)

n R(RΣR) 1 k

n n R(QT ΣQ)

2 n R(QT ΣQ).

Proof. It is obvious that rank(R) = rank(Q) and by the rank-nullity theorem, it suffices to show the nullity of Q is k. To this end, we observe that

Qw = 0 Σ 1/2

i=1 (Σ1/2w i )(Σ1/2w i )T !

i=1 (Σ1/2w i )(Σ1/2w i )T !

Σ1/2w span{Σ1/2w 1, ..., Σ1/2w k} w span{w 1, ..., w k}.

It is also straightforward to verify that Q2 = Q and QT Σw i = 0 for i = 1, ..., k. For any v Rd, Rv lies in the image of Q and so there exists w such that Rv = Qw. Then we can check that

v T RΣw i = Rv, Σw i

= Qw, Σw i = w, QT Σw i = 0

(QR)v = Q(Rv)

= Q(Qw) = Q2w = Qw = Rv.

Since the choice of v is arbitrary, it must be the case that RΣw i = 0 and QR = R. For any v Rd, we can check (RQ)v = R(Qv) = Qv by the definition of orthogonal projection. Therefore, it must be the case that RQ = Q. Finally, we use R = QR = RQT to show that

Tr(RΣR) = Tr(RQT ΣQR) = Tr(QT ΣQR)

= Tr(QT ΣQ) Tr(QT ΣQ(I R))

Tr(QT ΣQ) q

Tr((QT ΣQ)2) Tr((I R)2)

= Tr(QT ΣQ)

= Tr(QT ΣQ) 1 k

n n R(QT ΣQ)

Tr((RΣR)2) = Tr(ΣRΣR)

= Tr(ΣQRQT ΣQRQT )

= Tr((RQT ΣQ)R(QT ΣQR))

Tr((RQT ΣQ)(QT ΣQR)) = Tr((QT ΣQ)2R)

Tr((QT ΣQ)2).

Rearranging concludes the proof.

D.1 Phase Retrieval

Theorem 2. Under assumptions (A) and (B), let f : R Y R be given by f(ˆy, y) := (|ˆy| y)2

with Y = R 0. Let Q be the same as in Theorem 1 and Σ = QT ΣQ. Fix any w Rd such that Qw = 0 and for some ρ (0, 1), it holds that

ˆLf(w ) (1 + ρ)Lf(w ). (9)

Then with probability at least 1 δ, for some ϵ ρ + log 1

n + n R(Σ )

min w Rd: i [n], w,xi 2=y2 i

w 2 w 2 + (1 + ϵ)

Tr(Σ ) . (10)

Proof. Without loss of generality, we assume that µ lies in the span of {Σw 1, ..., Σw k} because otherwise we can simply increase k by one. Moreover, we can assume that {Σ1/2w 1, ..., Σ1/2w k} are orthonormal because otherwise we let W = W(W T ΣW) 1 and conditioning on W T (x µ) is the same as conditioning on W T (x µ). By Lemma 5, conditioned on

ηT 1 ... ηT k

= [W T (x1 µ), ..., W T (xn µ)]

the distribution of X is the same as

i=1 ηi(Σw i )T + ZΣ1/2Q

where Z has i.i.d. standard normal entries. Furthermore, conditioned on W T (x µ) and the noise of variable in y (which is independent of x), by the multi-index assumption (B), the label y is non-random. Since Qw = 0, we have w = Pk i=1 w i , Σw w i and so

w , x = w , µ +

i=1 w i , Σw w i , x µ .

Therefore, w , x also becomes non-random after conditioning. We can let I = {i [n] : w , xi 0} and define ξ Rn by

ξi = yi | w , xi | if i I | w , xi | yi if i / I and ξ is non-random after conditioning. Following the construction discussed in the main text, for any w Rd, the predictor w = w + w satisfies | w, xi | = yi where

w = arg min w Rd: Xw=ξ

by the definition of ξ. Hence, we have

min w Rd: i [n], w,xi 2=y2 i w 2 w 2 + w 2

and it suffices to control w 2.

Let R be the orthogonal projection matrix onto the image of Q and we consider w of the form Rw to upper bound w 2. By Lemma 13, we know QR = R and RΣw i = 0. By the assumption that µ lies in the span of {Σw 1, ..., Σw k}, we have

i=1 ηi(Σw i )T + ZΣ1/2Q

Rw = ZΣ1/2Rw.

Since R is an orthogonal projection, it holds that Rw 2 w 2. Finally, we observe that the distribution of ZΣ1/2R is the same as Z(RΣR)1/2 and so

w 2 min w Rd: Z(RΣR)1/2w=ξ

We are now ready to apply Lemma 12 to the covariance RΣR. We are allowed to replace the dependence on RΣR by the dependence on Σ by the last two inequalities of Lemma 13. The desired conclusion follows by the observation that ξ 2 2 = nˆLf(w ) and the assumption that ˆLf(w ) (1 + ρ)Lf(w ).

D.2 Re LU Regression

The proof of Theorem 3 will closely follow the proof of Theorem 2. Theorem 3. Under assumptions (A) and (B), let f : R Y R be the loss defined in (13) with Y = R 0. Let Q be the same as in Theorem 1 and Σ = QT ΣQ. Fix any (w , b ) Rd+1 such that Qw = 0 and for some ρ (0, 1), it holds that

ˆLf(w , b ) (1 + ρ)Lf(w , b ). (14)

Then with probability at least 1 δ, for some ϵ ρ + log 1

n + n R(Σ )

min (w,b) Rd+1: i [n],σ( w,xi +b)=yi

w 2 w 2 + (1 + ϵ)

n Lf(w , b )

Tr(Σ ) . (15)

Proof. We let I = {i [n] : yi > 0} and for any (w , b ) Rd+1, we define ξ Rn by

ξi = yi w , xi b if i I σ( w , xi + b ) if i / I.

By the definition of ξ, the predictor (w, b) = (w + w , b ) satisfies σ( w, xi + b) = yi where

w = arg min w Rd: Xw=ξ

Hence, we have min (w,b) Rd+1: i [n],σ( w,xi +b)=yi

w 2 w 2 + w 2

and it suffices to control w 2.

Similar to the proof of Theorem 2, we make the simplifying assumption that µ lies in the span of {Σw 1, ..., Σw k} and {Σ1/2w 1, ..., Σ1/2w k} are orthonormal. Conditioned on W T (xi µ) and the noise variable in yi, both yi and w , xi are non-random, and so ξ is also non-random. The distribution of X is the same as

i=1 ηi(Σw i )T + ZΣ1/2Q.

If we consider w of the form Rw, then we have

w 2 min w Rd: Z(RΣR)1/2w=ξ

We are now ready to apply Lemma 12 to the covariance RΣR. We are allowed to replace the dependence on RΣR by the dependence on Σ by the last two inequalities of Lemma 13. The desired conclusion follows by the observation that ξ 2 2 = nˆLf(w , b ) due to the definition (13) and the assumption that ˆLf(w ) (1 + ρ)Lf(w , b ).

D.3 Low-rank Matrix Sensing

Theorem 4. Suppose that d1d2 > n, then there exists some ϵ q

n + n d1d2 such that with probability at least 1 δ, it holds that

min i [n], Ai,X =yi X r X F + (1 + ϵ)

d1 d2 . (17)

Proof. Without loss of generality, we will assume that d1 d2. We will vectorize the measurement matrices and estimator A1, ..., An, X Rd1 d2 as a1, ..., an, x Rd1d2 and define x = X . Denote A = [a1, ..., an]T Rn d1d2. We define the primary problem Φ by

Φ := min i [n], Ai,X =ξ X = min Ax=ξ x .

By Lemma 11, it suffices to consider the auxiliary problem

Ψ := min G x 2 ξ 2 H,x x .

We will pick x of the form x = αH for some α 0, which needs to satisfy α H 2 2 αG H 2 ξ 2. By a union bound, the following events occur simultaneously with probability at least 1 δ/2:

1. by Lemma 3, it holds that

G 2 n + 2 p

log(32/δ) ξ 2

2. Condition on ξ, we have 1 ξ G, ξ N(0, 1) and so by standard Gaussian tail bound

Pr(|Z| > t) 2e t2/2 | G, ξ |

2 log(16/δ)

Then we can use AM-GM inequality to show for sufficiently large n

αG H 2 ξ 2 2 = α2 G 2 2 H 2 2 + ξ 2 2α H 2 G, ξ

+ ξ 2 + 2 nα H 2 ξ 2

2 log(16/δ)

2 log(16/δ)

and it suffices to let

α2 H 4 2 nα2 H 2 2

2 log(16/δ)

Rearranging the above inequality, we can choose

and since H as a matrix can have at most rank d1, by Cauchy-Schwarz inequality on the singular values of H, we have H d1 H 2 and

x = α H α p

d1 H 2 (1 + ϵ)

d1d2 = (1 + ϵ)

for some ϵ q

n + n d1d2 . The desired conclusion follows by the observation that X r X F because X has rank r.

Theorem 5. Fix any δ (0, 1). There exist constants c1, c2, c3 > 0 such that if d1d2 > c1n, d2 > c2d1, n > c3r(d1 + d2), then with probability at least 1 δ that

ˆX X 2 F X 2 F r(d1 + d2)

d1 d2 + n d1d2

X 2 F . (18)

Proof. Note that A, X N(0, X 2 F ) and so by the standard Gaussian tail bound Pr(|Z| t) 2e t2/2, Theorem 9 and a union bound, it holds with probability at least 1 δ/8 that

2 log(32/δ) X F

2 log(32/δ).

Then it holds that A A, X

X 2 F X op A op + | A, X |

2 log(32/δ) + X op

2 log(32/δ)

8 log(32/δ).

Therefore, we can choose Cδ in Theorem 1 by

8 log(32/δ) X

and applying Theorem 1 and Theorem 4, we have

(1 ϵ)L( ˆX) Cδ(X)2

d1 + d2 + p

8 log(32/δ) 2

r X F + (1 + ϵ)

8 log(32/δ)

n + (1 + ϵ) σ X F

where ϵ is the maximum of the two ϵ in Theorem 1 and Theorem 4. Finally, recall that

L( ˆX) = σ2 + ˆX X 2 F .

Assuming that d1 d2, then the above implies that

ˆX X 2 F X 2 F

(1 ϵ) 1(1 + ϵ)2

8 log(32/δ)

d1 d2 + n d1d2

and we are done.

E Counterexample to Gaussian Universality

By assumption (G), we can write xi|d k = h(xi|k) Σ1/2 |d kzi where zi N(0, Id k). We will denote the matrix Z = [z1, ..., zn]T Rn (d k). Following the notation in section 7, we will also write X = [X|k, X|d k] where X|k Rn k and X|d k Rn (d k). The proofs in this section closely follows the proof of Theorem 6.

Theorem 15. Consider dataset (X, Y ) drawn i.i.d. from the data distribution D according to (G) and (H), and fix any f : R Y R 0 such that f is 1-Lipschitz for any y Y. Fix any δ > 0 and suppose there exists ϵδ < 1 and Cδ : Rd k [0, ] such that

(i) with probability at least 1 δ/2 over (X, Y ) and G N(0, In), it holds uniformly over all w|k Rk and w|d k Σ|d k R 0 that

f( w|k, xi|k + h(xi|k) w|d k Σ|d k Gi, yi)

h(xi|k)2 (1 ϵδ)ED

f( w, x , y)

(ii) with probability at least 1 δ/2 over z|d k N(0, Σ|d k), it holds uniformly over all w|d k Rd k that w|d k, z|d k Cδ(w|d k) (75)

then with probability at least 1 δ, it holds uniformly over all w Rd that

(1 ϵδ)E f( w, x , y)

f( w, xi , yi)

h(xi|k)2 + Cδ(w|d k) n

Proof. Note that w|d k, xi|d k = h(xi|k) w|d k, Σ1/2 |d kzi

and so for any f : R Y Rk R, we can write

Φ := sup w Rd F(w) 1

i=1 f( w, xi , yi, xi|k)

= sup w Rd,u Rn

u=ZΣ1/2 |d kw|d k

i=1 f( w|k, xi|k + h(xi|k)ui, yi, xi|k)

= sup w Rd,u Rn inf λ Rn λ, ZΣ1/2 |d kw|d k u + F(w) 1

i=1 f( w|k, xi|k + h(xi|k)ui, yi, xi|k).

By the same truncation argument used in Lemma 7, it suffices to consider the auxiliary problem:

Ψ := sup w Rd,u Rn inf λ Rn λ 2 H, Σ1/2 |d kw|d k + G Σ1/2 |d kw|d k 2 u, λ

i=1 f( w|k, xi|k + h(xi|k)ui, yi, xi|k)

= sup w Rd,u Rn inf λ 0 λ H, Σ1/2 |d kw|d k G Σ1/2 |d kw|d k 2 u 2

i=1 f( w|k, xi|k + h(xi|k)ui, yi, xi|k)

Therefore, it holds that

Ψ = sup w Rd,u Rn

H,Σ1/2 |d kw|d k G Σ1/2 |d kw|d k 2 u 2

i=1 f( w|k, xi|k + h(xi|k)ui, yi, xi|k)

= sup w Rd F(w) 1

H,Σ1/2 |d kw|d k G Σ1/2 |d kw|d k 2 u 2

i=1 f( w|k, xi|k + h(xi|k)ui, yi, xi|k).

Next, we analyze the infimum term:

H,Σ1/2 |d kw|d k G Σ1/2 |d kw|d k 2 u 2

i=1 f( w|k, xi|k + h(xi|k)ui, yi, xi|k)

u 2 H,Σ1/2 |d kw|d k

i=1 f( w|k, xi|k + h(xi|k) ui + Σ1/2 |d kw|d k 2Gi , yi, xi|k)

= inf u Rn sup λ 0 λ( u 2 H, Σ1/2 |d kw|d k 2)

i=1 f( w|k, xi|k + h(xi|k) ui + Σ1/2 |d kw|d k 2Gi , yi, xi|k)

sup λ 0 inf u Rn λ( u 2 H, Σ1/2 |d kw|d k 2)

i=1 f( w|k, xi|k + h(xi|k) ui + Σ1/2 |d kw|d k 2Gi , yi, xi|k)

= sup λ 0 λ H, Σ1/2 |d kw|d k 2

i=1 inf ui R f( w|k, xi|k + ui + Σ1/2 |d kw|d k 2h(xi|k)Gi, yi, xi|k) + λ h(xi|k)2 u2 i .

Now suppose that f takes the form f(ˆy, y, x|k) = 1 h(x|k)2 f(ˆy, y) for some 1 square-root Lipschitz f and by a union bound, it holds with probability at least 1 δ that

Σ1/2 |d k H, w|d k 2 Cδ(w|d k)2

1 h(xi|k)2 f( w|k, xi|k + Σ1/2 |d kw|d k 2h(xi|k)Gi, yi) (1 ϵδ)E 1 h(x|k)2 f( w, x , y) ,

then the above becomes

sup λ 0 λ Σ1/2 |d k H, w|d k 2 +

1 h(xi|k)2 fλ( w|k, xi|k + Σ1/2 |d kw|d k 2h(xi|k)Gi, yi)

sup λ 0 λ Σ1/2 |d k H, w|d k 2 + λ λ + 1

1 h(xi|k)2 f( w|k, xi|k + Σ1/2 |d kw|d k 2h(xi|k)Gi, yi)

sup λ 0 λCδ(w|d k)2 + λ λ + 1(1 ϵ)n E 1 h(x|k)2 f( w, x , y)

(1 ϵδ)E 1 h(x|k)2 f( w, x , y) Cδ(w|d k) n

+ where we apply Lemma 8 in the last step. Then if we take

(1 ϵδ)E 1 h(x|k)2 f( w, x , y) Cδ(w|d k) n

+ then we have Ψ 0. To summarize, we have shown s

(1 ϵδ)E 1 h(x|k)2 f( w, x , y) Cδ(w|d k) n

1 h(xi|k)2 f( w, xi , yi) 0

which implies

E 1 h(x|k)2 f( w, x , y) (1 ϵδ) 1 1 n

1 h(xi|k)2 f( w, xi , yi) + Cδ(w|d k) n

Theorem 16. Under assumptions (G) and (H), fix any w |k Rk and suppose for some ρ (0, 1), it holds with probability at least 1 δ/8

yi w |k, xi|k

y w |k, x|k

Then with probability at least 1 δ, for some ϵ ρ + log 1

R(Σ|d k) + n R(Σ|d k)

min w Rd: i, w,xi =yi w 2 2 w |k 2 2 + (1 + ϵ) n E y w |k,x|k h(x|k) 2

Tr(Σ|d k) (78)

Proof. Fix any w |k Rk, we observe that

min w Rd: i, w,xi =yi w 2 2 = min w Rd: i, w|k,xi|k + w|d k,xi|d k =yi w|k 2 2 + w|d k 2 2

w |k 2 2 + min w|d k Rd k: i, w|d k,xi|d k =yi w |k,xi|k

Therefore, it is enough analyze

Φ := min w|d k Rd k: i, w|d k,xi|d k =yi w |k,xi|k

w|d k 2 = min w|d k Rd k:

i, w|d k,Σ1/2 |d kzi = yi w |k,xi|k

By introducing the Lagrangian, we have

Φ = min w|d k Rd k max λ Rn

Σ1/2 |d kw|d k, zi yi w |k, xi|k

= min w|d k Rd k max λ Rn λ, ZΣ1/2 |d kw|d k

yi w |k, xi|k

Similarly, the above is only random in Z after conditioning on X|kw |k and ξ and the distribution of Z remains unchanged after conditioning because of the independence. By the same truncation argument as before and CGMT, it suffices to consider the auxiliary problem:

min w|d k Rd k max λ Rn λ 2 H, Σ1/2 |d kw|d k +

Σ1/2 |d kw|d k 2Gi yi w |k, xi|k

= min w|d k Rd k max λ Rn λ 2

H, Σ1/2 |d kw|d k +

Σ1/2 |d kw|d k 2Gi yi w |k, xi|k

+ w|d k 2 and so we can define

Ψ := min w|d k Rd k: s

Σ1/2 |d kw|d k 2Gi yi w |k,xi|k

2 Σ1/2 |d k H,w|d k

To upper bound Ψ, we consider w|d k of the form α Σ1/2 |d k H

Σ1/2 |d k H 2 , then we just need

α Σ|d k H 2

Σ1/2 |d k H 2 Gi yi w |k, xi|k

α2 Σ1/2 |d k H 2 2.

By a union bound, the following occur together with probability at least 1 δ/2 for some absolute constant C > 0:

1. Using the first part of Lemma 4, we have

Σ1/2 |d k H 2 2 Tr(Σ|d k)

1 C log(32/δ) p

2. Using the last part of Lemma 4, requiring R(Σ|d k) log(32/δ)2

Σ|d k H 2 2 Σ1/2 |d k H 2 2 C log(32/δ) Tr(Σ2 |d k)

3. Using subexponential Bernstein s inequality (Theorem 2.8.1 of Vershynin (2018)), requiring n = Ω(log(1/δ)), 1 n

4. Using standard Gaussian tail bound Pr(|Z| t) 2e t2/2, we have 1 n

Gi(yi w |k, xi|k )

yi w |k, xi|k

2 log(32/δ)

5. By assumption, it holds that

yi w |k, xi|k

y w |k, x|k

Then we use the above and the AM-GM inequality to show that

α Σ|d k H 2

Σ1/2 |d k H 2 Gi yi w |k, xi|k

2α2 Σ|d k H 2 2 Σ1/2 |d k H 2 2 + (1 + ρ) E

y w |k, x|k

+ 2α Σ|d k H 2

Σ1/2 |d k H 2

v u u u t(1 + ρ) E

y w |k, x|k

2 log(32/δ)

C log(32/δ)

2 log(32/δ)

α2 Tr(Σ2 |d k)

2 log(32/δ)

y w |k, x|k

After some rearrangements, it is easy to see that we can choose

2 log(32/δ)

1 C log(32/δ)

R(Σ|d k) C log(32/δ) 2 + q

2 log(32/δ)

n E y w |k,x|k h(x|k) 2

Tr(Σ|d k) .

and the proof is complete.