# generalization_error_of_spectral_algorithms__6e5d8902.pdf

Published as a conference paper at ICLR 2024

GENERALIZATION ERROR OF SPECTRAL ALGORITHMS

Maksim Velikanov1,2, Maxim Panov3, Dmitry Yarotsky4

1Technology Innovation Institute, 2Ecole Polytechnique, 3MBZUAI, 4Skoltech maksim.velikanov@tii.ae, maxim.panov@mbzuai.ac.ae, d.yarotsky@skoltech.ru

The asymptotically precise estimation of the generalization of kernel methods has recently received attention due to the parallels between neural networks and their associated kernels. However, prior works derive such estimates for training by kernel ridge regression (KRR), whereas neural networks are typically trained with gradient descent (GD). In the present work, we consider the training of kernels with a family of spectral algorithms specified by profile h(λ), and including KRR and GD as special cases. Then, we derive the generalization error as a functional of learning profile h(λ) for two data models: high-dimensional Gaussian and low-dimensional translation-invariant model. Under power-law assumptions on the spectrum of the kernel and target, we use our framework to (i) give full loss asymptotics for both noisy and noiseless observations (ii) show that the loss localizes on certain spectral scales, giving a new perspective on the KRR saturation phenomenon (iii) conjecture, and demonstrate for the considered data models, the universality of the loss w.r.t. non-spectral details of the problem, but only in case of noisy observation.

1 INTRODUCTION

Quantitative description of various aspects of neural networks, most notably, generalization performance after training, is an important but challenging question of deep learning theory. One of the central approaches to this question is built on the connection between neural networks and its neural tangent kernel, first established for infinitely wide networks (Jacot et al., 2018; Lee et al., 2020; Chizat et al., 2019), and then further taken to the rich realm of finite practical networks (Fort et al., 2020; Maddox et al., 2021; Long, 2021; Kopitkov & Indelman, 2020; Vyas et al., 2023).

Consider a task of learning target function f (x) from training dataset DN = {xi}N i=1 and (possibly) noisy observation yi = f (xi) + σεi, εi N(0, 1), using given kernel K(x, x ). Then, many authors (Bordelon et al., 2020; Jacot et al., 2020a; Wei et al., 2022) derive asymptotic N generalization error for kernel ridge regression (KRR) algorithm with regularization η:

LKRR(η) = 1

(ηeffcl)2 + λ2 l σ2

N (ηeff + λl)2 , 1 = η ηeff + 1

λl λl + ηeff , (1)

where λl are population eigenvalues of K(x, x ) and cl are respective eigencoefficients of f (x) (see definition in (4)). The main motivation of our work is what happens with (1) when, as required by association with neural networks, KRR is replaced with GD or an even more general learning algorithm.

Importantly, a result of the type (1) may give precise insights for the family of power-law spectral distributions, closely related to capacity and source assumptions of non-parametric statistics:

λl l ν (with ν > 1), c2 l l κ 1 (with κ > 0). (2)

Power-law conditions (2) exhibit a rich picture of convergence rates. For the case of noisy observations σ2 > 0, (Caponnetto & De Vito, 2007) gives minimax rate O(N κ κ+1 ). For noiseless observations σ2 = 0, the optimal estimation rate significantly improves (Bordelon et al., 2020; Cui et al., 2021) becoming O(N κ). However, the optimal rates are not always achievable for some classical algorithms. For example, in the case when κ

ν > 2 in the noisy case the rate of the KRR becomes

Published as a conference paper at ICLR 2024

Noisy observations σ2 > 0 Noiseless observations

KRR GF & Optimal KRR & GF & Optimal

Exponent in L = O(N #) κ κ+1 2ν 2ν+1 κ κ+1 κ 2ν

Spectral localization scale s ν κ+1 0, ν 2ν+1 ν κ+1 ν 0

Universality yes yes no

Table 1: Our results for power-law spectral distributions (2) and three algorithms: optimally regularized KRR, optimally stopped Gradient Flow (GF), and the optimal algorithm (see Section 3). For quantities exhibiting saturation at κ

ν = 2, the vertical line separates saturated and non-saturated values. The spectral localization at scale s means that the error is accumulated at eigenvalues λ of the order N s (see Section 5.1). In the last line, by universality, we mean the asymptotic equality of the errors for different problems with the same population spectrum λl, cl. While we show the universality only for our two data models, we would expect it to hold for a broader class of data models.

O(N 2ν 2ν+1 ), i.e. KRR doesn t attain the minimax lower bound (Li et al., 2023). Such an effect is usually called saturation and is well-known in various non-parametric problems (Math e, 2004; Bauer et al., 2007). However, the saturation effect can be removed with algorithms other than KRR, for example spectral cut-off (Bauer et al., 2007) or gradient descent (Pillaud-Vivien et al., 2018). In noiseless case, (Bordelon et al., 2020; Cui et al., 2021) show saturation at the same point κ

ν > 2, with the rate changing to O(N 2ν). Whether noiseless saturation can be removed by algorithms other than KRR, to the best of our knowledge, was not studied in the literature.

Our contribution. In this work, we augment the above picture in several directions, as summarized in Table 1 and the following three blocks

Loss functional. In Section 3, we introduce a new kernel learning framework by specifying the learning algorithm with a spectral profile h(λ) and expressing the generalization error as a quadratic functional of this profile. While specific choices of h(λ) can give KRR, Gradient Flow (GF), or any iterative first-order algorithm, we go beyond these classical examples and consider the optimal learning algorithm as the minimizer of the loss functional for a given problem.

The models. As the loss functional is problem-specific, we consider two different models: a Wishart-type model with Gaussian features and a translational-invariant model on a circle. In Section 4, we derive loss functionals for these models. In addition, we introduce a simple Naive Model for Noisy Observations (NMNO). In the presence of observations noise, for reasonable learning algorithms, and at least for power-law spectrum, NMNO model gives asymptotically exact loss values for both Wishart and Circle models despite their differences. This suggests a possible universality property for a larger family of problems, including our Wishart and Circle models as special cases.

Results for power-law spectrum. In Section 5, we reach specific conclusions by restricting both kernel eigenvalues and the target function eigencoefficients to be described by power-laws (2).

For both noisy and noiseless observations, we derive full loss asymptotics. While the resulting rates are indeed consistent with the prior works, precise characterization of leading order asymptotic term gives access to finer details, such as the shape h (λ) of the optimal learning algorithm.

We introduce the notion of spectral localization - the scale of kernel eigenvalues over which the loss is accumulated - and quantify it for algorithms under consideration. In particular, this perspective provides a simple explanation of KRR saturation phenomenon as the inability to sufficiently fit the main features of the target function.

By characterizing the shape of the optimal algorithm, we point to the cases where it is optimal to overlearn the training data, akin to KRR with negative regularization. Moreover, we specify the values of power-law exponents at which overlearning is beneficial.

Published as a conference paper at ICLR 2024

Generalization error. We evaluate the estimator bf(x) with its squared prediction error | bf(x) f (x)|2, averaged over inputs x drawn from a population density p(x), and then all the randomness in training dataset DN:

2EDN,ε bf f 2 = 1

Z EDN,ε h bf(x) f (x) 2i p(x)dx, (3)

where ε = (ε1, . . . , εN) N(0, I), f 2 f, f , and angle brackets denote the scalar product f, g R f(x)g(x)p(x)dx.

Population spectrum. Central to our approach is the connection between generalization error L b f and spectral distributions λl, cl of the problem, defined by Mercer theorem as

K(x, x ) = X λlϕl(x)ϕl(x ), f (x) = X clϕl(x), ϕl, ϕl = δll . (4)

Here, λl are the kernel eigenvalues, cl are the target function coefficients, and ϕl(x) are the eigenfeatures of the kernel. In the most interesting scenarios, the number P of features ϕl is infinite, and the respective eigenvalues λl 0 as l .

In this work, we aim at two levels of results w.r.t. the population spectrum λl, cl. First, we want to obtain a characterization of L b f for a general λl, cl, similar to what is done in the classic result (1). Second, we will assume the power-laws (2) to obtain a more detailed description of the generalization error L b f.

An important object is the empirical kernel matrix K RN N composed of evaluation of the kernel on training points (K)ij = K(xi, xj). Let us additionally denote by y = f + ε RN

the observation vector with components (y)i = f (xi) + εi; by Λ RP P the diagonal matrix with (Λ)ll = λl; and by Φ RP N the matrix of kernel features evaluated at training points, Φ

li = ϕl(xi). Then, spectral decomposition (4) allows to write empirical kernel matrix as

K = ΦT ΛΦ. (5)

Data models. A standard approach to analyzing the generalization error consists in considering general families of kernels K(x, x ) and targets f (x), typically defined by regularity assumptions, and deriving upper and lower generalization bounds (e.g., see (Caponnetto & De Vito, 2007)). We adopt a different approach that allows us to go beyond just the bounds and describe generalization error L b f with more quantitative detail. To this end, we consider two particular models, Circle and Wishart, that represent extreme lowand high-dimensional cases of the kernel learning setting.

Wishart model. This model is a common choice for our setting: it was explicitly assumed in (Cui et al., 2021; Jacot et al., 2020b; Simon et al., 2023) and is closely related to settings of (Jacot et al., 2020a; Bordelon et al., 2020; Canatar et al., 2021; Wei et al., 2022). Specifically, assume that the kernel features ϕl(x) and data distribution p(x) are such that feature matrix Φ has i.i.d. Gaussian entries: ϕl(xi) N(0, 1). Then, the resulting empirical kernel matrix in (5) is called Wishart matrix in Random Matrix Theory (RMT) context. Intuitively, we can think about Wishart model as a model of some high-dimensional data with all the structural information about data distribution and kernel being wiped out by high-dimensional fluctuations.

Circle model. To describe the generalization of a kernel estimator trained by completely fitting (i.e. interpolating) training data, Spigler et al. (2020) and Beaglehole et al. (2023) used a model with translations-invariant kernel and training inputs forming a regular lattice in a hypercube. In this work, we consider, for transparency, a one-dimensional version of this model. Yet, the onedimensional version will display all the important phenomena we plan to discuss. Specifically, let inputs come from the circle x S = R mod 2π, and the training set DN = {u + 2πi

N }N 1 i=0 . Here, u is an overall shift of the training data lattice, which we sample uniformly u U([0, 2π]) to introduce some randomness in otherwise deterministic training set DN. Then, the kernel and the target are defined in the basis of Fourier harmonics as

l= λleil(x x ), f (x) =

l= cleilx. (6)

We assume λl = λ l R and c l = cl to ensure that both the kernel and the target are real-valued.

Published as a conference paper at ICLR 2024

3 SPECTRAL ALGORITHMS AND THEIR GENERALIZATION ERROR

Several authors, e.g. Bauer et al. (2007); Rastogi & Sampath (2017); Lin et al. (2020), considered Spectral Algorithms to generalize and extend the type of regularization performed by classical methods such as KRR and GD. Indeed, both KRR and GD fit observation vector y to a different extent in different spectral subspaces of the empirical kernel matrix K. For spectral algorithms, this fitting degree is specified by a profile h(λ) so that the estimator is given by bf(x) = k(x)T K 1h 1

Here k(x) RN has components k(x)

i = K(x, xi). Function h(λ) is applied to the kernel matrix 1

N K in the operator sense: h( ) acts element-wise on diagonal matrices, and for an arbitrary positive semi-definite matrix with eigendecomposition A = UT DU we have h(A) = UT h(D)U.

Let us show how classical algorithms can be written in the form (7) with a specific choice of h(λ):

1. Kernel Ridge Regression with regularization η is obtained with hη(λ) = λ λ+η. Then, (7)

transforms into the classical formula for KRR predictor: bf(x) = k(x)T K + NηI 1y.

2. Gradient Flow by time t is obtained with ht(λ) = 1 e tλ. (For this and the next example, we provide the respective derivations in Section B.1.) 3. For an arbitrary general first-order iterative algorithm at iteration t, ht(λ) is given by the associated degree-t polynomial with ht(λ = 0) = 0 (see Section B.1). For example, GD with a learning rate α is given by ht(λ) = 1 (1 αλ)t.

Now, we make a simple observation that is crucial to the current work. Note that generalization error (3) is quadratic in the estimator bf, while bf is linear in h according to (7). Thus, for any problem, the generalization error is quadratic in the profile h. This observation is formalized (see the proof in Section A) in Proposition 1. There exist signed measures ρ(2)(dλ1, dλ2), ρ(1)(dλ) and ρ(ε)(dλ) (given in equations (37)-(39)) such that the map h 7 bf 7 L b f given by (7), (3) is expressed as the quadratic functional

Z Z h(λ1)h(λ2)ρ(2)(dλ1, dλ2) 2 Z h(λ)ρ(1)(dλ) + f (x) 2

Z h2(λ)ρ(ε)(dλ). (8)

We will refer to the measures ρ(1), ρ(2) and ρ(ε) as the learning measures. Proposition 1 shows that the loss functional is completely specified by these measures. The first line in (8) describes the estimation of the target function from the signal part f of the observation vector y, which is hindered by insufficiency of N observations to capture fine details of f (x). Similarly, the second line in (8) describes the effect of (unwanted) learning of the noise part ε of observations y.

The functional (8) makes the relation between the learning algorithm h(λ) and the generalization error maximally explicit. However, properties of the underlying kernel and data are reflected in the learning measures ρ(2)(dλ1, dλ2), ρ(1)(dλ) and ρ(ε)(dλ) in a fairly complicated way. In Section A, we show some general connections between the kernel eigenvalues λl, the features ϕl(x), and the learning measures. Yet, the explicit characterization of learning measures is challenging even for our two data models, and constitutes the main technical step of our work.

Optimal algorithm. Consider a regression problem and its associated loss functional (8). Since the loss functional is positive semi-definite, under suitable regularity assumptions it has a (possibly non-unique) minimizer h (λ) = arg min L[h] (9) that achieves the minimal possible generalization error in a given problem. We refer to the spectral algorithm with profile h (λ) as optimal. In the context of models with power-law spectra, we will also speak of optimal algorithms in a broader sense, as those providing the optimal error scaling with N. We will analyze the conditions of optimality in the Circle and Wishart models and show that in the noisy setting they have the same structure, easily understood using a simplified loss model.

Published as a conference paper at ICLR 2024

4 EXPLICIT FORMS OF THE LOSS FUNCTIONAL

Circle model. The main advantage of this model is that it admits an exact and explicit solution. Below, we describe its main properties, with derivations and proofs given in Section D.

Due to the fact that training inputs xi form a regular lattice, the eigenvalues bλk of empirical kernel matrix K become deterministic: bλk = P n= λl+Nn. Behind this relation is the learning picture based on aliasing. For a given k 0, N 1, the information about the target function contained in all Fourier harmonics with frequencies l = k + Nn, n Z is compressed into the single k-th harmonic, and then projected back to the original l = k + Nn harmonics of the estimator bf. This leads to a transformation of population quantities λa l |cl|2b that we call N-deformation: λa l |cl|2b

n= λa l+Nn|cl+Nn|2b. (10)

It is periodic: λa l |cl|2b

N = λa l+N|cl+N|2b

N. Also, bλk = λk

N. Then, we have Theorem 1. Loss functional of the Circle model is given by

N λk 2 N h2(bλk) 2

N h(bλk) + |ck|2

The special feature of the loss functional (11) compared to the general form (8) in that there are no off-diagonal λ1 = λ2 contributions to the loss. Then, the optimal algorithm is found by a simple point-wise minimization:

h (bλk) = [λk|ck|2]N[λk]N σ2

N [λ2 k]N . (12)

Wishart model. This model, although more common in the literature, does not enjoy an exact solution like the Circle model. However, using two approximations, in Section E we derive an explicit form of the measures ρ(2), ρ(1), ρ(ε) describing the loss by equation (8). We give now an outline of our derivation.

First, we point out that the learning measures from (8) can be reduced to the first and second moment of the imaginary part of the resolvent b R(z) = ( K

N z I) 1 computed at the points z = λ + i0+ near the real line. Then, we make standard RMT assumptions to describe the resolvent in terms of the Stieltjes transform r(z) of spectral measure of K, which satisfies the fixed-point equation

1 = zr(z) + 1

r(z)λl r(z)λl + 1. (13)

The first resolvent moment is computed straightforwardly and the second moment at the same point z1 = z2 can be computed with differentiation trick (Simon et al., 2023), but for the second moment at z1 = z2 a new tool is required. For that, we employ Wick s theorem of computing averages over Gaussian fields, where we take into account leading order pairings and neglect subleading O(N 1) terms. The above procedure expresses the moments of b R(z) in terms of three auxiliary functions

c2 l λl + r 1(z), u(z) = X

λlc2 l λl + r 1(z), w(z) = X

λ2 l λl + r 1(z). (14)

Finally, we obtain loss functional (8) by specifying each of the learning measures. Using the notation ℑ{z} for imaginary part of z, we find first moment of signal measure and the noise measure to be ρ(1)(dλ)

πλ , ρ(ε)(dλ)

The second moment of signal measure has diagonal λ1 = λ2 and off-diagonal λ1 = λ2 parts ρ(2)(dλ1, dλ2)

dλ1dλ2 = |r 1(λ1)|2

πλ2 1 ℑ{v(λ1)}δ(λ1 λ2)

ℑ u(λ2) ℑ r 1(λ1) ℑ u(λ1) ℑ r 1(λ2)

Published as a conference paper at ICLR 2024

101 103 105

= 0.5 = 5 KRR with = N /( + 1)

101 103 105

GF with t = N /( + 1)

101 103 105

KRR with = N /(2 + 1) (saturation)

Cwishart N 2 /(2 + 1)

Ccircle N 2 /(2 + 1)

101 103 105

GF with t = N /( + 1)

Wishart Cosine Wishart NMNO (Wishart & Cosine Wishart) Circle NMNO (Circle)

Figure 1: Generalization error of different data models in presence of observation noise converges to our NMNO model (solid) as N , which in turn converges to its O(N #) asymptotic (dashed). All plots have ν = 1.5. Cosine Wishart is an additional data model not covered by our theory yet converging to NMNO. The difference between Circle and Wishart asymptotic on the plot 3 is due to localization of the error on scale s = 0 at saturation. For details and extended discussion see Sec. F.

0 + 1 eigenvalue scale s

Error scales

Non-saturated KRR: *

0 + 1 eigenvalue scale s

Non-saturated KRR: = *

0 + 1 2 + 1

eigenvalue scale s

Saturated KRR: = *

Scales Noise before learning Signal before learning Noise after learning Signal after learning Total Error localization

Figure 2: Scale diagrams of different KRR regimes for noisy observations. All plots have ν = 1.2, while κ = 1.0 in the non-saturated case (left and center) and κ = 5.0 in the saturated case (right). The dotted lines represent the noise 1 s

ν and signal κ

ν s terms in equation (18). The solid lines show the same terms with added components 2S(h) and 2S(1 h). Left: the sub-optimal (sh > s ) nonsaturated case. Center: the optimal (sh = s ) non-saturated case. Right: the saturated (κ > 2ν) case with the choice sη = ν 2ν+1 optimal for KRR, but sub-optimal among general algorithms h.

Naive Model of Noisy Observations. As we can see from our results for Circle and Wishart model above, the loss functional of a given model can be quite complicated. We will see, however, that in the presence of observation noise σ2 > 0 the asymptotic (N ) generalization properties of both these models are well described by a simple artificial model (NMNO), introduced below. We conjecture this match to be a universal phenomenon, valid for a wide range of models.

For a large dataset sizes N, we expect all the complex details of the problem to be concentrated at eigendirections l with small λl, whose features ϕl(x) can not be well captured by the empirical kernel matrix K of size N. On the contrary, K should succeed in capturing ϕl(x) with moderate λl, and empirical and population eigendecompositions should be close to each other.

Let us therefore assume that we can ignore the small eigenvalues and determine the generalization error only using components l with moderate λl (later, we will explain this by loss localization at moderate spectral scales). Then, the contribution of the l-th component to the generalization error can be approximated by (1 h(λl))2c2 l for signal fitting part, and σ2

N h2(λl) for the learned observation noise. This completely determines the associated loss functional. Let us describe the population spectral data λl, cl by the spectral eigenvalue measure µλ(dλ) = P

l δλl(dλ) and the coefficient measure µc(dλ) = P

l c2 l δλl(dλ). Then, we define the NMNO model by the functional

L(nmno)[h] = σ2

h2(λ)µλ(dλ) + 1

1 h(λ) 2µc(dλ). (17)

Here, λmin is a reference minimal population eigenvalue defined by the condition µλ([λmin, 1]) = N (i.e., such that the segment [λmin, 1] contains exactly N population eigenvalues).

Published as a conference paper at ICLR 2024

5 RESULTS UNDER POWER-LAW SPECTRAL DISTRIBUTIONS

In this section we perform a deeper study of the Circle, Wishart and NMNO models of Section 4 in the setting of power-law distributions (2). Prioritizing transparency over generality, we assume λl, cl to be exact power-laws in the form convenient for a given model. Specifically, for Circle model we take λl = (|l| + 1) ν and |cl|2 = (|l| + 1) κ 1, while for Wishart model we assume continuous analogs of these spectral distributions, namely assume them to have smooth densities supported on [0, 1]: µλ(dλ) = 1

ν dλ and µc(dλ) = 1

ν λ κ ν 1dλ.

5.1 SCALING ANALYSIS AND ITS APPLICATION TO THE NMNO MODEL

Under power-law spectral assumptions, it is key to observe that various important quantities scale as suitable powers of the training set size N. In general, given a sequence a N, we will say that it has scale s if for any ϵ > 0 we have |a N| = o(N s+ϵ) and |a N| = ω(N s ϵ). If only the first or the second condition holds, we say that the scale of a N is not less or not greater than s (respectively).

Suppose that g N(λ) is a sequence of functions on the spectral interval (0, 1]. We say that this sequence has a scaling profile S(g)(s), s 0, if for any sequence λN (0, 1] that has scale s the sequence |g N(λN)| has scale S(g)(s). It is easy to check that the scaling profile, if exists, is a continuous function of s (see Lemma 1). The basic example of a sequence of functions having a scaling profile is the sequence g N(λ) = N aλb with constant a, b; in this case S(g)(s) = bs a.

Integrals of functions with a scaling profile also have a specific scaling: Proposition 2 (see proof in Section C). Let g N(λ) be a sequence of functions with a scaling profile S(g)(s), and let a N > 0 be a sequence of scale a > 0. Then, the sequence of integrals R 1 a N |g N(λ)|dλ has scale s = min0 s a(S(g)(s) + s).

When prop. 2 can be applied to the functional (8), we call the set Sloc = arg min0 s a(S(g)(s)+s) of scales which dominate the loss integral as spectral localization scales of the generalization error. In the rest of the paper, we reserve the letter s to denote the scale of eigenvalues λ.

Application to NMNO. Now we apply the scaling arguments to the NMNO model (17), for which it will be easy to find explicit optimality conditions. Suppose that the problem has either discrete or continuous power-law spectrum with exponents ν, κ as described above for the Circle and Wishart model. Then, λmin in equation (17) has finite scale ν. Suppose that the functions h and 1 h have scaling profiles S(h) and S(1 h). Then, applying Proposition 2 in the continuous case or analogous Proposition 5 in the discrete case, we obtain Proposition 3. The NMNO loss L(nmno)[h] has scaling

Snmno[h] = min 0 s ν

ν + 2S(h)(s) κ

ν s + 2S(1 h)(s) i . (18)

Here, = min; the first and second arguments in come from the noise and signal terms in (17), respectively. In the continuous case we use the fact that N 1λ 1 1/νh(λ)2 has scaling profile 1 s(1 + 1

ν ) + 2S(h) while λκ/ν 1(1 h(λ))2 has scaling profile s( κ

ν 1) + 2S(1 h).

Clearly, for any particular s only one of the values S(h)(s) and S(1 h)(s) can be strictly positive. This implies a bound on feasible loss scalings:

Snmno[h] min 0 s ν

ν s = κ κ+1. (19)

The minimum here is attained at the scale s = ν κ+1. Moreover, an algorithm h attains the optimal scale κ κ+1 exactly when for each s [0, ν] we have 1 s

ν +2S(h)(s) κ

ν s+2S(1 h)(s) κ κ+1:

Proposition 4. Snmno[h] κ κ+1, and the equality occurs when 1) S(h)(s) 1 2 s

ν 1 κ+1 for s s = ν κ+1 and 2) S(1 h)(s) 1

ν s for s s .

These results provide a simple picture of spectral algorithms close to optimality for the NMNO

model: one should choose the spectral function h so that |h(λ)| N 1 2(κ+1) λ 1 2ν for λ N ν κ+1

Published as a conference paper at ICLR 2024

and so that |1 h(λ)| N κ 2(κ+1) λ κ

2ν for λ N ν κ+1 . The value s = ν κ+1 can be referred to as the loss localization scale for the optimal algorithm.

Let us apply the obtained conditions to KRR and GF. KRR with regularization η has hη(λ) = λ λ+η. Suppose that η has scale sη, then hη has the scaling profile S(h)(s) = (s sη) 0, while 1 hη equals η λ+η and has the scaling profile S(1 h)(s) = (sη s) 0. Recalling that ν > 1, we see that condition 1) in Proposition 4 is satisfied iff sη s = ν κ+1. Condition 2) is more subtle: if κ

2ν 1, then it is satisfied iff sη s , but in the case κ 2ν > 1 it is rather satisfied iff sη κ 2(κ+1). We see, in particular, that in the case κ 2ν 1 conditions 1) and 2) are simultaneously satisfied iff sη = s , while in the case κ 2ν > 1 they cannot be simultaneously satisfied (and so KRR cannot achieve the optimal scaling κ κ+1) an effect called saturation (Math e, 2004; Bauer et al., 2007). See Figure 2 for an illustration. In contrast, GF ht(λ) = 1 e tλ has no saturation: choosing t to be of scale s satisfies both conditions of Proposition 4 for all ν > 1 and κ > 0.

5.2 NOISY OBSERVATIONS AND MODEL EQUIVALENCE

For our two data models, Circle and Wishart, the intuition behind the NMNO ansatz can be rigorously justified by showing that this ansatz represents the leading contribution to the true loss. For instance, consider the circle model with loss functional[(11) specified by (10). The empirical eigenvalues bλk = [λk]N = λk + O(N ν) for |k| < N/2, so the scale ν of the correction |bλk λk| is higher than the scale s of the eigenvalues λk except for the eigenvalues of the highest scale s = ν. This shows that the empirical and population quantities are significantly different only on the highest spectral scale s = ν. Continuing this line of arguments, we get Theorem 2. Assume that the learning algorithm h(λ) is such that h(λ) and 1 h(λ) have scaling profiles S(h)(s) and S(1 h)(s). Assume that the maps log λ 7 log |h(λ)| and log λ 7 log |1 h(λ)| are globally Lipschitz, uniformly in N. Then, if ν = s is not a localization point of NMNO functional L(nmno)[h], Circle model specified by (11) and Wishart model specified by (16),(15) are equivalent to the NMNO model in the limit N :

L(nmno)[h] = L(circle)[h] 1 + o(1) = L(wishart)[h] 1 + o(1) . (20)

We prove the theorem separately for Circle and Wishart models in Sections D.2 and E.3.3. Note that the condition of equivalence is specified using only NMNO model. Thus, if satisfied, it allows analyzing the simple functional (17) instead of the more complicated ones (16), (15) and (11). The requirement that s = ν is not a localization point is reasonable as the heuristic derivation of the NMNO model in Section 4 included an assumption that the loss is localized at moderate scales.

5.3 NOISELESS OBSERVATIONS

We focus on Circle model to describe main observations, with derivation deferred to Section D and Wishart model results deferred to Section E. Let us write the loss functional perturbatively

2 1 + o(τ) h(bλk) h (bλk) 2 + N κ 1 O(1) + O(τ 2ν κ 1) i (21)

with τ |k|+1

N as a small parameter. From this, we make several observations. First, take h = h . Then, if κ < 2ν, the loss localizes on s = ν (i.e. the sum is accumulated at |k| N) and has the rate O(N κ). This rate is natural and reflects that we are able to learn target function everywhere except at inaccessible scales λ N ν. Moreover, by examining the first term in (21), we see that this rate is not destroyed by learning algorithms sufficiently close to the optimal: |h(bλk) h (bλk)|2 = o(τ κ).

The situation changes dramatically for κ > 2ν: the optimal loss L[h ] becomes dominated by O(τ 2ν κ 1) term in (21) and localizes at s = 0 (i.e. the sum is accumulated at |k| 1), changing the rate to O(N 2ν). We call this behavior saturation since it has features similar to KRR saturation effect for noisy observations: transition at κ = 2ν; change of error rate; localization at s = 0. However, noisy KRR saturation is algorithm-driven and can be removed by replacing KRR with GF, while saturation in (21) persists even for optimal algorithm h (λ). Interestingly, the optimal

Published as a conference paper at ICLR 2024

100 101 102 103

Generalization error at finite N. = 0.1

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0

Constant in the error asymptotic L CN

Overlearning transition = 1

Saturation transition = 2

0 10 20 30 40 50 N

Algorithms profile. = 0.25

0 10 20 30 40 50 N

Algorithms profile. = 1.5

Optimally stopped GF Optimally regularized KRR The optimal algorithm Interpolation

Figure 3: Generalization error (left) and profiles h(λ) (right) of various algorithms applied to the Circle model with ν = 1.5 and noiseless observations with different κ. Before overlearning transition κ = ν 1 optimal algorithms underlearn observations (h(λ) < 1) while starting to overlearn them (h(λ) > 1) after the transition. For details and extended discussion see Section F.

loss in the saturated phase can be achieved (asymptotically) by KRR with a negative regularization: L[h ] = L[hη ](1 + o(1)), η < 0. Denoting Riemann zeta function as ζ(α), we have

(ηN ν + 2ζ(ν))2 + 2ζ(2ν) " X

N 2ν 1 + o(1) , (22)

with minimum at η = 2ζ(ν)N ν. The benefit of negative regularization was empirically observed in (Kobak et al., 2020) and theoretically approached in (Tsigler & Bartlett, 2023). Within our framework, KRR with negative regularization can be thought of as a special case of overlearning the observations.

In fact, overlearning also occurs in the κ < 2ν phase. In Section D.3 we show that the loss functional can be asymptotically written (see (138)) using τ = |k|+1

N for (continuous) eigenspace indexing, for example bλτ bλk. Then, denoting Hurwitz zeta function as ζ(α, x), the optimal algorithm becomes

h (bλτ) = ζ(ν+κ+1) τ ζ(ν) τ ζ(κ+1) τ ζ(2ν) τ , ζ(α) x ζ(α, x) + ζ(α, 1 x). (23)

The optimal algorithm (23) has an intriguing property (see also Figure 3): it interpolates the data h (λ) = 1 when κ = ν 1, and, otherwise, the sign of 1 h (λ) coincides with the sign of ν 1 κ. In other words, for hard targets with κ < ν 1, classical regularization by underfitting the observation is required for optimal performance. But, for easier targets with κ > ν 1 it becomes optimal to overlearn the training data in contrast to conventional wisdom. The same holds for Wishart model (see Sec. E.3.4). Thus, we identify the point κ = ν 1 as overlearning transition.

6 DISCUSSION

We have extended results of type (1) to general spectral algorithms, as given by our loss functional (11) for Circle model and (15),(16) for Wishart model. It allows to address questions that require going beyond specific (KRR,GF) algorithms. For example, we show that the nature of saturation at κ = 2ν is different for noisy and noiseless observations, with the latter being an intrinsic property of the given kernel and data model, and can not be removed by any choice of the learning algorithm.

Our formalism of spectral localization and scaling, while being compact, provides a simple and transparent picture of the variety of convergence rates under power-law spectra distributions. Also, the equivalence result between our two data models and naive model of noisy observations (17) relies on the straightforward estimation of the scale of perturbation of population quantities by finite size N of the training dataset. Thus, an interesting direction for future research would be to check whether the equivalence holds for other data and kernel models.

Finally, let us mention the advantage of full loss asymptotic L = CN #(1 + o(1)) compared to the rates L = O(N #). In this work, we used the full asymptotic to obtain the shape of optimal algorithm h (λ). In the noiseless case, the knowledge of h (λ) allowed us to characterize the overlearning transition at κ = ν 1, which otherwise would be invisible on the level of the rate O(N κ). Investigating whether κ = ν 1 remains the point of overlearning phase transition for more general data models is an interesting direction for future research.

Published as a conference paper at ICLR 2024

ACKNOWLEDGMENTS

We thank Eric Moulines for insightful discussions during the initial stage of the work.

Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52 72, 2007.

Daniel Beaglehole, Mikhail Belkin, and Parthe Pandit. On the inconsistency of kernel ridgeless regression in fixed dimensions. SIAM Journal on Mathematics of Data Science, 5(4):854 872, 2023. doi: 10.1137/22M1499819. URL https://doi.org/10.1137/22M1499819.

Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pp. 1024 1034. PMLR, 2020.

Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature Communications, 12(1), may 2021. doi: 10.1038/s41467-021-23103-1. URL https: //doi.org/10.1038%2Fs41467-021-23103-1.

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331 368, 2007.

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.

Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborov a. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131 10143, 2021.

Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M Roy, and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In Advances in Neural Information Processing Systems, volume 33, pp. 5850 5861, 2020.

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_ files/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.

Arthur Jacot, Berfin Simsek, Francesco Spadaro, Clement Hongler, and Franck Gabriel. Kernel alignment risk estimator: Risk prediction from training data. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 15568 15578. Curran Associates, Inc., 2020a. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/b367e525a7e574817c19ad24b7b35607-Paper.pdf.

Arthur Jacot, Berfin Simsek, Francesco Spadaro, Clement Hongler, and Franck Gabriel. Implicit regularization of random feature models. In Hal Daum e III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 4631 4640. PMLR, 13 18 Jul 2020b. URL https://proceedings. mlr.press/v119/jacot20a.html.

Hui Jin, Pradeep Kr. Banerjee, and Guido Montufar. Learning curves for gaussian process regression with power-law priors and targets. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Ke I9E-gso B.

Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization. Journal of Machine Learning Research, 21(1):6863 6878, 2020.

Published as a conference paper at ICLR 2024

Dmitry Kopitkov and Vadim Indelman. Neural spectrum alignment: Empirical study. In Artificial Neural Networks and Machine Learning ICANN 2020: 29th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 15 18, 2020, Proceedings, Part II 29, pp. 168 179. Springer, 2020.

Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Journal of Statistical Mechanics: Theory and Experiment, 2020(12): 124002, dec 2020. doi: 10.1088/1742-5468/abc62b. URL https://doi.org/10.1088% 2F1742-5468%2Fabc62b.

Yicheng Li, Haobo Zhang, and Qian Lin. On the saturation effect of kernel ridge regression. In The Eleventh International Conference on Learning Representations, 2023. URL https:// openreview.net/forum?id=t Fvr-k YWs_Y.

Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral algorithms with least-squares regression over hilbert spaces. Applied and Computational Harmonic Analysis, 48(3):868 890, 2020. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha. 2018.09.009. URL https://www.sciencedirect.com/science/article/pii/ S1063520318300174.

Philip M. Long. Properties of the after kernel. Ar Xiv, abs/2105.10585, 2021.

Wesley Maddox, Shuai Tang, Pablo Moreno, Andrew Gordon Wilson, and Andreas Damianou. Fast adaptation with linearized neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 2737 2745. PMLR, 2021.

Peter Math e. Saturation of regularization methods for linear ill-posed problems in hilbert spaces. SIAM journal on numerical analysis, 42(3):968 973, 2004.

Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.

Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Advances in Neural Information Processing Systems, 31, 2018.

Abhishake Rastogi and Sivananthan Sampath. Optimal rates for the regularized learning algorithms under general source condition. Frontiers in Applied Mathematics and Statistics, 3:3, 2017.

James B Simon, Madeline Dickens, Dhruva Karkada, and Michael Deweese. The eigenlearning framework: A conservation law perspective on kernel ridge regression and wide neural networks. Transactions on Machine Learning Research, 2023.

Stefano Spigler, Mario Geiger, and Matthieu Wyart. Asymptotic learning curves of kernel methods: empirical data versus teacher student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124001, dec 2020. doi: 10.1088/1742-5468/abc61d. URL https:// dx.doi.org/10.1088/1742-5468/abc61d.

Alexander Tsigler and Peter L Bartlett. Benign overfitting in ridge regression. Journal of Machine Learning Research, 24:123 1, 2023.

Nikhil Vyas, Yamini Bansal, and Preetum Nakkiran. Empirical limitations of the ntk for understanding scaling laws in deep learning. Transactions on Machine Learning Research, 2023.

Alexander Wei, Wei Hu, and Jacob Steinhardt. More than a toy: Random matrix models predict how real-world neural representations generalize. In International Conference on Machine Learning, pp. 23549 23588. PMLR, 2022.

Published as a conference paper at ICLR 2024

1 Introduction 1

2 Setting 3

3 Spectral algorithms and their generalization error 4

4 Explicit forms of the loss functional 5

5 Results under power-law spectral distributions 7

5.1 Scaling analysis and its application to the NMNO model . . . . . . . . . . . . . . 7

5.2 Noisy observations and model equivalence . . . . . . . . . . . . . . . . . . . . . . 8

5.3 Noiseless observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

6 Discussion 9

References 10

A Spectral perspectives 13

A.1 Population perspective: transfer matrix . . . . . . . . . . . . . . . . . . . . . . . . 13

A.2 Empirical perspective: learning measure . . . . . . . . . . . . . . . . . . . . . . . 14

A.3 Joint population-empirical perspective: transfer measure . . . . . . . . . . . . . . 15

B Gradient-based algorithms 16

B.1 Kernel form of predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

B.2 Implementing spectral algorithms with a pair of Gradient Flows . . . . . . . . . . 18

C Scaling statements 18

D Circle model 20

D.1 Loss functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

D.2 Power-law ansatz: noisy observations . . . . . . . . . . . . . . . . . . . . . . . . 23

D.3 Power-law ansatz: noiseless observations . . . . . . . . . . . . . . . . . . . . . . 25

D.3.1 Perturbative expansion of the loss functional . . . . . . . . . . . . . . . . 25

D.3.2 Optimally scaled learning algorithms . . . . . . . . . . . . . . . . . . . . 26

D.3.3 Saturated phase κ > 2ν . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

D.3.4 Non-saturated phase κ < 2ν . . . . . . . . . . . . . . . . . . . . . . . . . 30

D.3.5 The overlearning transition point . . . . . . . . . . . . . . . . . . . . . . . 30

E Wishart model 31

E.1 Resolvent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

E.1.1 Computing the resolvent moments . . . . . . . . . . . . . . . . . . . . . . 33

Published as a conference paper at ICLR 2024

E.2 Loss functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

E.3 Power-law ansatz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

E.3.1 Solving fixed point equation . . . . . . . . . . . . . . . . . . . . . . . . . 40

E.3.2 Learning measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

E.3.3 Noisy observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

E.3.4 Noiseless observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

F Experiments 47

F.1 Figure 1: details and discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

F.2 Figure 3: details and discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

F.3 Cosine Wishart model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A SPECTRAL PERSPECTIVES

In this section, we show general relations between generalization error (3), population spectra λl, cl, and learning algorithms h(λ). Recall that profile h(λ) is applied to the eigenvalues of the kernel matrix K (i.e. values of the kernel function evaluated on the training dataset DN). Therefore, relating generalization error and h(λ) involves the properties of the empirical spectrum: eigenvalues of K and the related eigendecomposition of the observation vector y.

With these remarks in mind, we may say that we are dealing with population and empirical perspectives on generalization error (3). We start with population perspective in Section A.1, which is behind the classical result (1). Then, we proceed with the empirical perspective in Section A.2, which basically amounts to proving Proposition 1 and introducing learning measures ρ(2), ρ(1), ρ(ε). Finally, In Section A.3, we combine population and empirical perspectives. While probably less conceptual than the first two perspectives, the joint population-empirical perspective is an essential step in our derivation of the loss functional for the Wishart model.

A.1 POPULATION PERSPECTIVE: TRANSFER MATRIX

A central object for the population perspective is the transfer matrix b Tll introduced explicitly, for example, in (Simon et al., 2023) in the context of KRR. Specifically, let us decompose the prediction (7) over kernel eigenfunctions ϕl(x) as bf(x) = P l bclϕl(x). Then, the prediction coefficients bcl can be written as

l b Tll cl + σbεl, b Tll = λlϕT l K 1h 1

N K ϕl , bεl = λlϕT l K 1h 1

N K ε, (24)

where ϕl is the vector of eigenfunctions computed at the dataset inputs (ϕl)i = ϕl(xi), and (ε)i = εi is the vector of observation noise. Note that the transfer matrix b Tll has a clear interpretation of the rate at which the information cl contained in spectral component l is transferred to spectral component l. The population noise component bεl describes how much of the of the observation noise ε was learned in the l-th population spectral component.

The population loss (3) (sometimes we use this term as a synonym to generalization error) is straightforwardly expressed through the first and second moments of the transfer matrix, and the variance of population noise components

bcl cl 2 = 1

l1,l2 cl1 X

l T (2) l1l l l2 2T (1) l1l2 + δl1l2 cl2 + σ2

l ε(2) l , (25)

Published as a conference paper at ICLR 2024

T (1) ll = EDN h b Tll i , (26)

T (2) l1l 1l 2l2 = EDN h b Tl 1l1 b Tl 2l2 i , (27)

ε(2) l = NEDN,ε[(bεl)2] = EDN

λ2 l 1 N ϕT l 1

The representation (25) makes the most explicit dependence of the loss on population coefficients cl, while the dependence on the learning algorithm h(λ) and population eigenvalues λl is hidden inside moments T (1) ll , T (2) l1l 1l 2l2 of the transfer matrix and noise variance ε(2) l . Yet, for the case of KRR the results (1) shows that the dependence on λl can be made fairly explicit.

A.2 EMPIRICAL PERSPECTIVE: LEARNING MEASURE

As this perspective focuses on empirical spectrum of kernel matrix and observation vector, we start with writing eigendecomposition of K, f = y σε and ε as

k=1 bλkuku T k , u T k uk = δkk , (29)

k=1 bckuk, (30)

k=1 bεkuk, (31)

where in the last line bεk are i.i.d. normal Gaussian because orthogonal transformation to empirical eigenbasis {uk}N k=1 leaves the distribution of isotropic Gaussian vectors ε N(0, I) unchanged.

Then, inserting spectral decomposition (29) of the empirical kernel matrix into the prediction (7) gives

bfh(x) = k(x) T " 1 N

k=1 uku T k h(bλk)

k=1 bεk h(bλk)

= Z h(λ)bρ(f)(x, dλ) + σ

Z h(λ)bρ(ε)(x, dλ),

where bρ(f) N (x, dλ) and bρ(ε) N (x, dλ) are target and noise learning measures:

bρ(f)(x, dλ) = k(x) T " 1 N

uku T k λ δbλk

bρ(ε)(x, dλ) = 1

k(x) T uk λ δbλk. (34)

The target learning measure defines what pattern is learned from the target function at the neighborhood dλ of the empirical spectral position λ. Similarly, the noise learning measure defines the patterns of the noise learned in the neighborhood of λ.

Published as a conference paper at ICLR 2024

As for the population perspective, we substitute the expression of prediction in terms of learning measures (32) into the population loss (3)

Z h(λ)bρ(f)(x, dλ) + σ

Z h(λ)bρ(ε)(x, dλ) f (x)

" Z Z h(λ1)h(λ2)EDN bρ(f)(x, dλ1), bρ(f)(x, dλ2)

2 Z h(λ)EDN f (x), bρ(f)(x, dλ) + f (x), f (x)

f (x) Z h(λ)ρ(f)(x, dλ), Z h(λ)Eερ(ε)(x, dλ)

Z Z h(λ1)h(λ2)EDN,ε bρ(ε)(x, dλ1), bρ(ε)(x, dλ2) #

Now, observe that the term in the second-to-last line in (35) is linear in noise learning measure

averaged over ε which is zero: Eερ(ε)(x, dλ) = 1

k(x) T uk λ δbλk Eεbεk = 0 since Eεbεk = 0. Similarly, taking the expectation over observation noise ε helps to simplify the last term

Eε bρ(ε)(x, dλ1), bρ(ε)(x, dλ2) = X

k1,k2 δbλk1(dλ1)δbλk2(dλ2) u T k1k(x), u T k2k(x) Nλ1λ2 Eεbεk1bεk2

= δλ1(dλ2) 1

k δbλk(dλ2) u T k k(x) 2

where we have used Ebεk1bεk2 = δk1k2. Now, one can recognize the loss functional stated in Proposition 1: the first 3 terms and the last term of (35) correspond to the respective terms of (8). In other words, the learning measures announced in Proposition 1 are given by

ρ(1)(dλ) = EDN h f (x), bρ(f)(x, dλ) i , (37)

ρ(2)(dλ1, dλ2) = EDN h bρ(f)(x, dλ1), bρ(f)(x, dλ2) i , (38)

ρ(ε)(dλ) = EDN

u T k k(x) 2

Again, the loss functional (8) represents the empirical perspective on the generalization error, making the dependence on the learning algorithm h(λ) very explicit. But, the dependence on the problem s kernel structure and target function is hidden inside measures ρ(1)(dλ) and ρ(2)(dλ1, dλ2).

A.3 JOINT POPULATION-EMPIRICAL PERSPECTIVE: TRANSFER MEASURE

To combine to perspective described above, consider a l-th spectral component of learning measure bρ(f) l (dλ) ϕl(x), bρ(f)(x, dλ) . Then, inserting decomposition (4) of the target function into target learning measure (33) allows to write

bρ(f) l (dλ) ϕl(x), bρ(f)(x, dλ) = X

l cl bρ(f) ll (dλ), (40)

bρ(f) ll (dλ) = λl

ϕT l uk u T k ϕl δbλk (41)

can be naturally called a transfer measure. Now, we insert decomposition (40) into (37) and (38), as well as population eigendecomposition (4) into (39). The scalar products in (37) and (38) become f (x), bρ(f)(x, dλ) = X

l clbρ(f) l (dλ) = X

l,l clcl bρ(f) ll (dλ),

bρ(f)(x, dλ1), bρ(f)(x, dλ2) = X

l bρ(f) l (dλ1)bρ(f) l (dλ2) = X

l,l1,l2 cl1cl2 bρ(f) ll1 (dλ1)bρ(f) ll2 (dλ2). (42)

Published as a conference paper at ICLR 2024

As for the norm u T k k(x) 2 in (39), we can use D k(x), k(x) T E = X

l λ2 l ϕlϕT l . (43)

Combining the expressions above and noting that f (x) 2 = P

l c2 l gives yet another representation of the population loss in terms of the first and second moment of the transfer measure and population decomposition of noise variance measure.

l1,l2 cl1 Z h(λ1)h(λ2) X

l ρ(2) l1l l l2(dλ1, dλ2) 2 Z h(λ)ρ(1) l1l2(dλ) + δl1l2 cl2 (44)

Z h(λ)2ρ(ε) l (dλ), (45)

ρ(1) ll (dλ) = EDN h bρ(f) ll (dλ) i , (46)

ρ(2) l1l 1l 2l2(dλ1, dλ2) = EDN h bρ(f) l 1l1(dλ1)bρ(f) l 2l2(dλ2) i , (47)

ρ(ε) l (dλ) = EDN

" λ2 l λ2 1 N

ϕT l uk 2δbλk

If the moments of transfer measure ρ(1) ll (dλ), ρ(2) l1l 1l 2l2(dλ1, dλ2) and noise variance measure

ρ(ε) l (dλ) are known, the representation (44) connects spectral distribution λl, cl of the problem and learning algorithm h(λ) with the population loss, thus justifying the name joint population-empirical spectral perspective.

B GRADIENT-BASED ALGORITHMS

The purpose of this section is two-fold. First, in Section B.1, we support our examples of h(λ) provided in Section 3 with the respective derivations. This amounts to show that for linear models trained with a gradient-based algorithm, the predictor during optimization can be written in the form (7) with a specific choice of h(λ). Second, in Section B.2 try to connect general spectral algorithms specified by some profile h(λ) with gradient-based optimization, which was not included in the main paper due to the space constraints. For that, we provide a simple construction based on a pair of GF processes.

B.1 KERNEL FORM OF PREDICTORS

To consider gradient-based optimization for the kernel method setting discussed in the main paper, we need to introduce a linear parametric model bf(w, x) whose parameters w will be updated during the optimization process. Starting with a kernel K(x, x ) with population decomposition (4), let us define the model features ψl(x) = λlϕl(x). Then, combining the features in a vector ψ(x) RP , ψ(x)

l = ψl(x), the linear model is defined as

bf(w, x) = w, ψ(x) , w RP . (49)

For positive definite kernels P = , and both model s features and parameters belong to RKHS HK of the kernel K: w, ψ(x) HK.

The (neural) tangent kernel (Jacot et al., 2018) of the model (49) is given by NTK b f(x, x ) = ψ(x), ψ(x ) = P

l λlϕl(x)ϕl(x ) = K(x, x ), thus reproducing our original kernel we have started with. Note that one can go in the opposite direction: start from the linear (49) and then define a kernel method specified by the tangent kernel (NTK) of the linear model. An especially interesting example of the latter direction is a (non-linear) neural network f(θ, x) linearized at θ0 resulting in flin(θ, x) = f(θ0, x) + θ θ0, θf(θ0, x) . If constant prediction f(θ0, x) is ignored, the linearized neural network is also described by (49) with gradients as the model features ψ(x) = θf(θ0, x) and the displacement from θ0 as model parameters w = θ θ0.

Published as a conference paper at ICLR 2024

To finalize the connection between the parameter-based setting and kernel-based setting from the main paper, linear model (49) needs to be trained by minimizing quadratic loss on train dataset DN

LDN (w) 1 2N

bf(w, xi) yi 2

w, ψ(xi) w , ψ(xi) 2 = 1

2(w w )T H(w w ),

where H = 1 N PN i=1 ψ(xi) ψ(xi) is the Hessian of the train loss. In the following, it will be convenient to denote Ψ the matrix of features calculated on the training dataset Ψ

li = ψl(xi), and use finite-dimensional notation for inner and outer product in the parameter space: i.e. the Hessian H = 1

N ΨΨT and empirical kernel matrix K = ΨT Ψ. In (50), we have assumed that there exists a parameter value w so that the model (49) completely fits the observations ΨT w = y. Considering the typical case P > N, this amounts to a feature matrix having full rank rank(Ψ) = N 1.

Now, let us proceed with showing how gradient-based optimization fits into the family of spectral algorithms given by (7). We start with the basic example of vanilla Gradient Descent with learning rate α, having parameter update rule wt+1 = wt α w LDN (wt). For the quadratic loss (50), this reduces to wt+1 = wt αH(wt w ) = w + (I αH)(wt w ), or, equivalently, wt+1 w = (I αH)t(w0 w ) = pt(H)(w0 w ). (51) Here we introduced the polynomial pt(λ) = (1 αλ)t that will prove a useful notation in the following and is related to the profile ht(λ) as pt(λ) = 1 ht(λ). To obtain the representation (7) for the learned prediction bft(x) = wt, ψ(x) we additionally need to set w0 = 0. Then, bft(x) = w + pt(H)(w0 w ), ψ(x) = I pt(H) w , ψ(x) = ht(H)w , ψ(x) (52)

Next, note that polynomial pt(λ) is often called residual polynomial due to its normalization at λ = 0 as pt(0) = 1, or equivalently ht(0) = 0. The latter implies that we can write ht(λ) = λqt(λ) with some polynomial qt(λ) of degree t 1. Using an algebraic identity JT Jq(JT J) = JT q(JJT )J for arbitrary matrix J and polynomial q allows us to finally obtain (7)

bft(x) = Hqt(H)w , ψ(x) = 1

N ΨΨT qt( 1

N ΨΨT )w , ψ(x)

N ΨT Ψ)ΨT w , ψ(x)

(1) = k(x)T 1

N )y = k(x)T K 1 1

N K)y = k(x)T K 1ht( 1

where in (1) we have used that ΨT w = y and Ψ, ψ(x) = k(x)T . Thus, we have shown that for GD with learning rate α representation (7) holds with h(λ) = 1 (1 αλ)t.

The argument above can be easily extended to the case of Gradient Flow (GF). First note that under GF dynamics d dtwt = LDN (wt) the parameters are wt w = e Ht(w0 w ), thus implying pt(λ) = e λt and ht(λ) = 1 e λt. Then, for qt(λ) = ht(λ)

λ we also have JT Jq(JT J) = JT q(JJT )J, which can be seen, for example, by from the Taylor expansion qt(λ) = t P n=0 ( tλ)n

(n+1)! . The rest of the argument is unchanged.

Gradient descent is also easily extended to arbitrary first-order iterative optimization algorithms. For all such algorithms, the parameter change wt w0 on iteration t belongs to an order t Krylov subspace: wt w0 span{H(w0 w ), H2(w0 w ), . . . , Ht(w0 w )} (see, e.g. (Nesterov, 2003), page 42). This is equivalent to saying that wt+1 w = pt(H)(w0 w ) with pt(λ) being an arbitrary residual polynomial (i.e. normalized as pt(0) = 1). Since in our GD argument we did not use any property of its pt(λ) except for residual normalization, the argument continues to hold, making the representation (7) for all first-order iterative optimization algorithms.

1A similar assumption was implicitly made in the main paper in the definition of the spectral algorithm (7). Indeed, the existence of inverse K 1 also requires rank(Ψ) = N. This is a natural assumption for positive definite kernels K: empirical kernel matrix has full rank if evaluated on a set of distinct inputs xi, which in turn happens almost surely for typical generation processes of DN such as i.i.d. drawn xi. In principle, one can also consider the case of non-full rank of K, or alternatively non-existence of w completely fitting the observations, and replace K 1 in (7) with pseudoinverse. For simplicity, we leave such cases to future work.

Published as a conference paper at ICLR 2024

B.2 IMPLEMENTING SPECTRAL ALGORITHMS WITH A PAIR OF GRADIENT FLOWS

Recall that in the section above we get GD residual pt(λ) = 1 ht(λ) = e λt. An alternative way to get this would be to declare wt w = pt(H)(w0 w ) and then rewrite original Gradient Flow ODE d

dtwt = w LDN (wt) in terms of the residual pt(λ) as

tpt(λ) = λpt(λ), p0(λ) = 1, (54)

with an immediate solution pt(λ) = e λt.

Now, suppose we are given some target profile eh(λ) with respective residual ep(λ) = 1 eh(λ) = 0 that needs to be implemented. Our strategy is to design such GF process with residual qt(λ) that converge to the desired profile at long training times q (λ) limt qt(λ) = ep(λ). The basic GF process given by (54) always converges to full interpolation of the training data: limt pt(λ) = limt e λt = 0. We propose to overcome this interpolation property by using two optimization processes the first process is standard GF (54) converging to 0, and the second GF process will use gradients of the first GF to converge to ep(λ) = 0. These two process are defined by a pair of ODEs d

dtwt = w LDN (wt), d dtut = 1 + g(t) w LDN (wt), (55)

where wt and ut are the parameters of the first and the second process respectively, and the initial conditions are assumed to be identical w0 = u0. Associating residuals pt(λ), qt(λ) to parameters w, u, ODE system (55) is rewritten as tpt(λ) = λpt(λ), tqt(λ) = 1 + g(t) λpt(λ), p0(λ) = q0(λ) = 1. (56)

Here g(t) is a function controlling the final solution q (λ), and therefore needs to be chosen based on the desired solution ep(λ). At a given control function g(t) the final solution q (λ) can be easily found by integrating the second equation as R 0 tqt(λ) = q (λ) q0(λ) and substituting the solution of the basic GF pt(λ) = e λt into tqt(λ). Then, setting the final solution to the desired value q (λ) = ep(λ) leads to an integral equation on the control g(t) Z

1 + g(t) λe λtdt = 1 ep(λ). (57)

Since R 0 λe λtdt = 1, we cancel 1 from both sides, arriving at Laplace transform of g(t) on the left-hand side Z

0 g(t)e λtdt = ep(λ)

Thus, choosing g(t) as an inverse Laplace transform of ep(λ)

λ implements the desired spectral algorithm eh(λ).

C SCALING STATEMENTS

In the section, we give rigorous versions of the scaling statements outlined in Section 5.1. In the end of the section we also provide Proposition 5 as a discrete analog of Proposition 2, which will be required for the Circle model.

Intuitive derivation. Before proceeding with rigorous proofs, let us give a simple intuition behind the scale of sums and integrals stated in Propositions 2 and 5.

We start with the integral case, following notations and assumptions from Proposition 2. To find the scale of the integral R 1 a N |g N(λ)|dλ, let us divide the range of scales s [0, a] into many small segments and look at the contribution from a single segment [s0, s0+ε], corresponding to the interval of eigenvalues Λs0 = [N ελ0, λ0], λ0 = N s0. Due to the continuity of scaling profile Sg(s) (see Lemma 1 below), we can neglect the change of the g N(λ) on Λs0. Approximating the length of

Published as a conference paper at ICLR 2024

eigenvalues interval as |Λs0| λ0, we can estimate the contribution to the integral from [s0, s0 + ε] as |g N(λ)| |Λs0| |g N(λ)| λ0 N Sg(s0) s0, (59) which is exactly the expression under the minimum in Proposition 2. To see that only the scales which minimize G(s) = Sg(s) + s give a non-vanishing contribution to the final result, take two scales s1, s2 such that G(s2) G(s1) = δ > 0. Then, according to (59), the contribution from scale segment [s2, s2 +ε] will be N δ times smaller than from the scale segment [s1, s1 +ε], and therefore will vanish in the limit N .

We can summarize the above with the following simple heuristic: replace dλ with λ under the integral and maximize the resulting expression to get an estimation of the integral scale. The case of a discrete sum goes along the same lines, leading to the heuristic of replacing the sum P

kwith current index k: P k |g N(λ(N) k )| k |g N(λ(N) k )|, and then maximizing the resulting expression.

Continuity of scaling profiles. Lemma 1. The scaling profile Sg(s), if exists, is a continuous function of s.

Proof. Suppose that Sg exists but is not continuous. Then there exists s 0 and a sequence sm s such that |Sg(sm) Sg(s )| > c for all m and some positive constant c. Suppose w.l.o.g. that Sg(sm) < Sg(s ) c for all m. Consider some fixed m and choose some sequence λ(m) N of scale sm. In particular, we then have

λ(m) N < N sm+1/m and λ(m) N > N sm 1/m (60)

for N > Nm with some Nm. By definition of the scaling profile, we also have

|g N(λ(m) N )| > N S(g)(sm) c/2 > N S(g)(s )+c/2 (61)

for N large enough; say for N > Nm with the same Nm as before. We can assume w.l.o.g. that Nm is monotone increasing. Now define the sequence λN by

λN = λ(m) N , Nm < N Nm+1. (62)

This sequence has scale s , but |g N(λN)| > N S(g)(s )+c/2 for all sufficiently large N, contradicting the fact that that g N(λN) must have scale S(g)(s ).

Proof of Proposition 2. Part 1. Let us first show the part of the statement that says that for any ϵ > 0 Z 1

a N |g N(λ)|dλ = o(N s +ϵ). (63)

Suppose that this is not the case, and there is ϵ > 0 and a subsequence Nm such that Z 1

a N |g N(λ)|dλ > N s +ϵ m . (64)

Divide the interval [0, a] into finitely many subintervals Ir = [br, br+1] of length less than ϵ/2. For each subinterval Ir, define λm,r = arg max λ: log Nm λ Ir |g Nm(λ)|. (65)

Note that for each r the sequence m 7 log Nm λm,r takes values in the compact interval Ir, so it has a limit point s r Ir. By going to a subsequence, we can assume w.l.o.g. that the limit point is unique, i.e. is the limit. Then we can define for each r the sequence λ(r) N by setting λ(r) N = λm,r if N = Nm and somehow complementing it for N = Nm so that the sequence λ(r) N has scale s r. By scaling assumption, we then have

g N(λ(r) N ) = o(N S(g)(s r)+ϵ/2) (66)

and in particular

g Nm(λm,r) = o(N S(g)(s r)+ϵ/2 m ). (67)

Published as a conference paper at ICLR 2024

By definition of λm,r, Z 1

a Nm |g Nm(λ)|dλ = Z 0

log Nm a Nm |g Nm(N q m )|d N q m dq dq X

r g Nm(λm,r)N br m . (68)

It follows that Z 1

a N |g Nm(λ)|dλ = o X

r N S(g)(s r)+ϵ/2 m N br m (69)

r N (S(g)(s r)+s r)+ϵ m (70)

= o(N s +ϵ m ) (71)

contradicting assumption (64).

Part 2. Now we prove the opposite inequality: Z 1

a N |g N(λ)|dλ = ω(N s ϵ). (72)

Let q = arg min0 s a(S(g)(s) + s). By continuity of S(g), there exists an interval I = [q δ, q + δ] where S(g)(s) < S(g)(q ) + ϵ/2. Arguing as in Part 1, we then deduce from the scaling assumption on S(g) that min λ: log N λ I |g N(λ)| = ω(N S(g)(q ) ϵ). (73)

It follows that Z 1

a N |g N(λ)|dλ (N q +δ N q δ) min λ: log Nm λ I |g N(λ)| (74)

= ω(N S(g)(q ) ϵN (q δ)) (75)

= ω(N S(g)(q ) q ϵ) (76)

= ω(N s ϵ) (77)

as desired. This completes the proof of Proposition 2.

Discrete spectrum. Suppose that {λ(N) k (0, 1]}N k=1 is a N-dependent, size-N multiset (with possibly repeated elements) such that the sequence of the respective distribution functions FN(λ) = |{k 1, N : λ < λk 1}| has a scaling profile S(F )(s). Observe, in particular, that in the Circle model with the population eigenvalues λl = (|l|+1) ν the distribution of the empirical eigenvalues bλk (as well as of the population eigenvalues λk for k restricted to the interval ( N/2, N/2]) has the scaling profile S(F )(s) = s

Proposition 5. Assuming that min{(λ(N) k )N k=1} = ω(N a) with some a > 0 and S(F )(s) is strictly monotone decreasing, the sequence of sums PN k=1 |g N(λ(N) k )| has scale s = min0 s a(S(g)(s)+ S(F )(s)).

The proof of this proposition is analogous to the proof of Proposition 2.

D CIRCLE MODEL

Here, we give all our derivations related to Circle model.

D.1 LOSS FUNCTIONAL

In this section we provide the proof of Theorem 1.

Published as a conference paper at ICLR 2024

The main technical motivation behind our Circle model is to simplify the empirical kernel matrix K. Indeed, K becomes a symmetric circulant matrix

l= λlei 2πl(i j)

To establish the relationship between the (complex) eigendecomposition of empirical kernel matrix (29) and observation vector (y)i = yion one side, and the population spectra λl, c+,l, c ,l on the other side, we write

n= λk+Nnei 2π(k+Nn)(i j)

k=0 bλkei 2πk(i j)

This leads to empirical eigenvalues

n= λk+Nn, (uk)i = 1

Note that empirical eigenvalues bλk turned out to be non-random, which is a consequence of the regularity of training dataset DN. Observe that each empirical eigenvalue except bλ0 is twice degenerate: bλk = bλN k for 0 < k < N/2. This is the consequence of the fact that we took the kernel function K(x x ) to be even.

Turning to the target function, we have

yi = σεi + X

l cleil(u+ 2πi

The respective empirical coefficients in the decomposition 1

k bykuk are

i=0 yie i 2πki

n= ck+Nsei(k+Nn)u. (82)

Here εk = 1

i εie i 2πki

N C are complex Gaussian random variables. They are i.i.d. up to a few dependencies, for example, εk = εN k (overline denotes complex conjugation here). Therefore, later we will use the definition of εk in terms of εi to avoid accurate formulation of its statistics.

Finally, let us use the obtained eigendecomposition of the empirical kernel and target to write an expression for the prediction components bcl

0 bf(x)e ilxdx = λl

i=0 eik 2πi

N e il(u+ 2πi

bλkl h(bλkl)e ilubykl,

where kl = l mod N. Note that representations (83) and (82) define transfer measure bρ(f) ll (dλ) introduced in Section A.3. Due to the regular structure of the training dataset in our setting, the transfer measure also has a specific regular structure.

1. In the basis of Fourier harmonics {eilx} l= , the information is transferred from l to l only if l l is divisible by N (we can say that such l, l are compatible). 2. If l, l are compatible, the information is transferred only through a single empirical eigenvalue λN,kl.

Now, we start to derive the specific form of (44) for our translation-invariant setting. It will be convenient to divide the contribution to the final loss LN[h] into a bias term and two variance terms responsible for randomness w.r.t. to u and εi:

L[h] = LBias[h] + LVar,u[h] + LVar,ε[h]. (84)

Published as a conference paper at ICLR 2024

Bias term. This term is the loss of the mean prediction with population coefficients

Eu,ε[bcl] = λl

bλkl h(bλkl) Z 2π

n= cl+Nnei(l+Nn)u du

bλkl h(bλkl)cl, (85)

whose substitution into f (x) Eu,ε[ bf(x)] 2 gives

LBias N [h] = 1

bλkl h(bλkl)

|cl|2. (86)

Here we see that how well the component l can be learned depends on the closeness between λl and bλkl: for l N we have kl = l and typically λl bλl (assuming that λl decay fast with l). Thus, the target function can be learned well by setting h(λ) = 1.

Noise variance term. Since the prediction is linear in the noise εi, its contribution to the loss comes only from the terms which are quadratic in εi. Then, we define the noise variance by the contribution of such terms to the loss. Denoting the contribution of the noise to the prediction as bf (ε), we calculate its second moment

Eu,ε h bf (ε) bf (ε) i =

i1,i2 e ikl 2π(i1 i2)

N Eε[εi1εi2] =

where we used that | bf (ε)|2 is independent of u making expectation Eu trivial. The respective contribution to the loss is

LVar,ε[h] = 1

Dataset variance term. This part is simply the contribution of the rest of the variance prediction. The respective second moment is

Eu,ε h (bcl bc(ε) l )(bcl bc(ε) l ) i =

n1,n2= cl+Nn1cl+Nn2ei N(n1 n2)u du

n= |cl+Nn|2.

Subtracting the mean (85) from the second moment, we get the dataset variance loss term

LVar,u[h] = 1

n =0 |cl+Nn|2. (90)

Final expression. Let us First combine bias and dataset variance terms.

LVar,u[h] + LBias N [h] = 1

|cl|2 2|cl|2 h(bλkl)

bλkl λl + λl h(bλkl)

n= |cl+Nn|2

(|cl|2 2|cl|2 h(bλkl)

bλkl λl + |cl|2 h(bλkl)

where we have rearranged the sum over l with fixed kl in the quadratic term.

Adding the noise variance term, we are now able to write the final expression for the generalization error

Published as a conference paper at ICLR 2024

where we have used N-deformations (10) notation for the sum P n= λ2 l+Nn λ2 l

N. As a last step, we observe that the sum over indices l the same fixed kl = k mod N again leads to N-deformations (10), allowing to rewrite the full sum over l Z into the sum over kl 0, 1, . . . , N 1. Then, denoting kl simply as k, we have

N λk 2 N h2(bλk) 2

N h(bλk) + |ck|2

which proves Theorem 1.

Let us now describe the optimal learning algorithm (9). Note that the functional (93) is fully local, and therefore the optimal algorithm is well defined and is given by pointwise minimization at each bλk. The resulting optimal algorithm h (bλk) is

h (bλk) = [λk|ck|2]N[λk]N σ2

N [λ2 k]N . (94)

It may also be convenient to give a completed square form of the loss functional L[h] = δL[h h ] + L[h ], δL[h ] 0 that separates the minimal possible error L[h ] with an excess positive error if the algorithm is non-optimal h h = 0

h(bλk) h (bλk) 2

+ [|ck|2]N [λk|ck|2]2 N σ2

Finally, we note that one can use translation symmetry k k + N of N-deformations (10) in order to shift the summation in (93) to values N

2 , for example, k { N

2 , . . . , N

Like in (21), we denote such summation range simply as P N

2 . The purpose of this shift is to

put all high empirical spectral quantities λa k|ck|2b

N in the region |k| N, allowing to write, for example, bλk = O (|k| + 1) ν .

D.2 POWER-LAW ANSATZ: NOISY OBSERVATIONS

Now, we turn to a more detailed analysis of the Circle model. As in the main paper, we separately consider the noisy σ2 > 0 case in this section and the noiseless σ2 = 0 case in the next Section D.3. Recall that we adapt basic power-law spectrum λl = l ν, c2 l = l κ 1, l 1 since for the circle model the population is naturally indexed by the whole integer set Z, leading to

λl = (|l| + 1) ν, c2 l = (|l| + 1) κ 1, l Z. (96)

The purpose of the current section is to show the equivalence of circle model and NMNO models, thus proving the respective part of Theorem 2. The intuition behind NMNO relies on the closeness of population and empirical spectral distributions in the eigenspaces distant from the spectrum edge |l| N. Thus, we need to compare N-deformations [λa k|ck|2b]N with their population counterparts λa k|ck|2b, considering the values |k| N

2 relevant for the loss functional (11). From the definition (10) we get λa k|ck|2b

N = λa k|ck|2b + X

n =0 λa k+Nn|ck+Nn|2b

= λa k|ck|2b +

N aν b(κ+1) n1 + 1+k

N aν+b(κ+1) +

N aν b(κ+1) n2 + 1 k

N aν+b(κ+1)

= λa k|ck|2b + O(N aν b(κ+1)) = λa k|ck|2b 1 + O(τ aν+b(κ+1)) ,

where, in the last line, we have assumed that aν + b(κ + 1) > 1 so that both n1 and n2 series are converging. Also, we have recalled the notation τ = |k|+1

N introduced in Section 5.3.

Published as a conference paper at ICLR 2024

It turns out that the relation λa k|ck|2b

N = λa k|ck|2b + O(N aν b(κ+1)) is sufficient to establish equivalence to NMNO model. For that, write the Circle model loss functional (11) as NMNO functional L(nmno)[h], defined in (17), plus two corrections terms

L[h] = L(nmno)[h] + δLalg[h] + δLcoeff[h], (98) where the correction due to the displacements between the population and empirical eigenvalues in the argument of the learning algorithm is

δLalg[h] = 1

N h2(bλk) h2(λk) + c2 l 1 h(bλk) 2 1 h(λk) 2 , (99)

and the correction due to difference between population λa k|ck|2b and empirical λa k|ck|2b

N coefficients in the loss functional

δLcoeff[h] = 1

N λk 2 N λk 2 N h2(bλk) + |ck|2

N λk 2 N |ck|2 h2(bλk)

N |ck|2 h(bλk) + |ck|2

From this point, our strategy is to specify the scales of all the terms of NMNO functional L(nmno)[h], and of the two corrections δLalg[h], δLcoeff[h]. Then, we can invoke the scaling argument of Proposition 2 (actually its discrete version in Proposition 5) to show that all the correction terms give negligible contribution to the loss.

First, recall from Section 5.1 that the scaling of NMNO terms is given by

N h2(λk) i = 1 + 2S(h)(s), (101)

S h c2 k 1 h(λk) 2i = κ+1

ν s + 2S(1 h)(s). (102)

Next, we proceed to the δLalg[h] correction. To bound its terms one needs a certain smoothness assumption on h(λ). Currently, in Theorem 2, we require that the maps log λ 7 log |h(λ)| and log λ 7 log |1 h(λ)| are globally Lipschitz, but maybe a weaker smoothness condition is possible. To understand the application of this condition, take some function g(x) such that a mapping log x log g(x) is Lipschitz with constant C, i.e. log g(x+ x)

g(x) C log x+ x

x . Then, taking any constant C > C, there is δ > 0 such that for all | x

x | < δ we have g(x+ x) g(x) < C g(x) x

x . Coming back to the difference between empirical and population

eigenvalues and using (97) gives bλk λk

λk = O(τ ν), and therefore |h2(bλk) h2(λk)| = O h2(λk)τ ν

(and similar estimate for (1 h(λ))2). Recalling the scale S[τ] = S |k|+1

ν , we get a bound on the scale of the respective correction terms

h2(bλk) h2(λk) 1 + ν s + 2S(h)(s), (103)

S h c2 k 1 h(bλk) 2 1 h(λk) 2 i ν + κ+1 ν

ν s + 2S(1 h)(s). (104)

Finally, repetitive application of (97) to all the terms of δLcoeff[h] gives the remaining scales

N λk 2 N λk 2 N h2(bλk)

1 + (ν s) + 2S(h)(s), (105)

N λk 2 N |ck|2 h2(bλk)

(κ + 1) ν + κ+1 ν

ν s + 2S(h)(s), (106)

N |ck|2 h(bλk)

ν s + S(h)(s), (107)

N |ck|2i κ + 1. (108)

Published as a conference paper at ICLR 2024

Here, all the estimations are relatively straightforward, except for, possibly, the bound (106) that involves two values depending on whether κ + 1 > ν or the opposite (a reminiscent of overlearning transition of Section 5.3 here!). We demonstrate the bounding of this term in detail

N λk 2 N = |ck|2 1 + O(τ κ+1) λ2 k 1 + O(τ 2ν)

λk 1 + O(τ ν) 2

= |ck|2 1 + O(τ κ+1) + O(τ ν) = |ck|2 1 + O τ (κ+1) ν

= |ck|2 + O (|k| + 1)0 (ν κ 1)

This immediately implies the bound on the scale of [|ck|2]N [λ2 k]N [λk]2 N |ck|2 used in (106).

Now, having specified the scale of all the corrections, our task is to show that they contribute negligibly to the loss. In other words, the scale of the total contribution of all the corrections has to be strictly lower than that of NMNO loss. Using Proposition 5 this is written as

ν + 2S(h)(s) κ

ν s + 2S(1 h)(s) i (110)

< min 0 s ν

h 1 + (ν s) + 2S(h)(s) (111)

(κ + 1) ν + κ+1 ν

ν s + 2S(h)(s) (112)

ν s + S(h)(s) (113)

ν s + 2S(h)(s) (114)

ν i . (115)

It is easy to see that this strict inequality can only be violated (by becoming equality) when the mins are attained at s = ν so that the r.h.s. of (105) is equal to the r.h.s. of (101)). However, this possibility is excluded by hypothesis of the theorem. This completes the proof of Theorem 2 for the Circle model.

We remark that the Lipschitz condition in Theorem 2, used to compare algorithm h(λ) evaluated at the population and empirical eigenvalues, is required only for the discrete (Circle) problem. It is easy to check that this condition holds for the KRR algorithm with hη(λ) = λ λ+η. However, it is violated for GF with ht(λ) = 1 e λt: for time t N st the mapping log λ log(1 ht(λ)) = λt is not Lipschitz on the scales s < st. Fortunately, (1 ht(λ))2 on such scales is exponentially small, and the contribution to the loss from corresponding terms, both NMNO and its corrections, can be ignored. Thus, the equivalence between Circle and NMNO models still holds for GF algorithm. We expect that the Lipschitz condition of Theorem 2 can be weakened to take into account GF algorithm but leave it for future work.

D.3 POWER-LAW ANSATZ: NOISELESS OBSERVATIONS

In this section, we derive the results presented in Section 5.3, while also giving full versions of the results that were discussed only partially in the main paper. For the convenience of exposition, the order in this section repeats that of Section 5.3.

D.3.1 PERTURBATIVE EXPANSION OF THE LOSS FUNCTIONAL

In the main paper, we relied on equation (21) as a starting point to quite easily conclude that the loss localizes on the smallest spectral scale s = 0 in the saturated phase κ > 2ν while localizing on the highest scale s = ν in the non-saturated phase κ < 2ν. Essentially, equation (21) ignores the details of the problem at subspaces corresponding to small eigenvalues bλk bλN/2 (and |k| N), while providing a simple estimation of different loss functional terms for large eigenvalue subspaces with bλk bλN/2 (and |k| N). Alternatively, if one starts from exact loss functional (11), there seems to be no clear path to deducing the existence of saturated and non-saturated phases, and obtaining their convergence rates.

Published as a conference paper at ICLR 2024

The above discussion illuminates the importance of perturbative expansion of the loss functional in the small parameter τ = |k|+1

N , or, in other words, perturbative corrections bλk λk and bck ck of population spectral distributions in the presence of finite dataset size N. For the circle model, the effect of finite dataset size N is captured by deviation of N-deformations [λa k|ck|2b]N from their population counterparts λa k|ck|2b. The simplest form of this deviation was already obtained in equation (97), which we will use below to derive equation (21).

Let us start by writing down the loss functional in completed square from (95) and in the absence of observation noise

N 2 h(bλk) h (bλk) 2 + [|ck|2]N

[λk|ck|2]N 2 |ck|2

Note that the first term here, i.e. the factor in front of h(bλk) h (bλk) 2, was already estimated in

the second line of (109): it agrees with the factor |ck|2

2 1+o(τ) appearing in (21) up to replacement O(τ (κ+1) ν) = o(τ) valid due to (κ + 1) ν > 1. We use this simplification in (21) because we do not need a more accurate estimation of the correction term at this moment.

The second term of (21) corresponds to the generalization error of the optimal algorithm, which allows to denote it as Lk[h ] since L[h ] = PN/2 k= N/2 Lk[h ]. This term is estimated as follows

Lk[h ] = [|ck|2]N

[λk|ck|2]N 2 |ck|2

= |ck|2 1 + O(τ κ+1) |ck|4λ2 k 1 + O(τ κ+1+ν)

|ck|2λ2 k 1 + O(τ κ+1) + O(τ 2ν)

= |ck|2 O(τ κ+1) + O(τ κ+1+ν) + O(τ 2ν)

= (τN) κ 1 O(τ κ+1) + O(τ 2ν)

= N κ 1 O(1) + O(τ 2ν κ 1) .

D.3.2 OPTIMALLY SCALED LEARNING ALGORITHMS

In Section 5.3 we mentioned that the rate O(N κ) of the optimal algorithm h (λ) in the nonsaturated phase also holds for learning algorithms h(λ) within a suitable neighbourhood of h (λ), characterized by a condition |h(bλk) h (bλk)|2 = o(τ κ). In this section, we derive this result while also providing a more systematic discussion of learning algorithms that do not destroy the rate of the optimal algorithm.

Let us call an algorithm h(λ) optimally scaled if the scale S L[h] of associated generalization error L[h] is the same as that of the optimal algorithm h (λ)

S L[h] = S L[h ] . (118)

While one might try to find all algorithms satisfying (118), here we take a less ambitious approach by considering a simple family of conditions on |h(λ) h (λ)| and then choosing the weakest condition within the family. Specifically, for any two constants a, b, consider the following bound on the scale of deviation of an algorithm h(λ) from the optimal one:

S |h(λ) h (λ)|2 as + b, s [0, ν], (119)

which is a slightly weaker (e.g. up to a log N factors) version of |h(λ) h (λ)|2 = O(λ a N b). Here we limited the scale of λ to s [0, ν] because this is precisely the interval of scales occupied by the set of empirical eigenvalues {bλk}N 1 k=0 passed as an input to algorithms h, h .

It is easy to check whether all algorithms satisfying condition (119) are optimally scaled. First, we can equivalently rewrite (118) as S L[h] L[h ] S L[h ] . Then, applying Proposition 5 to (21) yields

Published as a conference paper at ICLR 2024

S L[h] L[h ] = min 0 s ν

S |ck|2 + S |h(bλk) h (bλk)|2 s

ν + as + b s

ν b + κ + aν, a < κ

According to the above calculation, the set of all pairs a, b that guarantee algorithms under condition (119) to be optimally scaled is given by A = {(a, b) R2 | b + 0 (κ + aν) S L[h ] }. Now, if we wish to pick a pair (a, b) A such that condition (119) is the weakest at a spectral scale s, we have to minimize as + b. Fortunately, there is a unique pair (a , b ) that provides the weakest condition across all relevant spectral scales

(a , b ) = κ

ν , S L[h ] = arg min (a,b) A as + b s [0, ν]. (121)

Applying this result to saturated and non-saturated phases, with respective convergence rates O(N 2ν) and O(N κ), gives the desired conditions for optimally scaled algorithms

S |h(λ) h (λ)|2 κ

ν s + κ in the non-saturated phase κ < 2ν, (122)

S |h(λ) h (λ)|2 κ

ν s + 2ν in the saturated phase κ > 2ν. (123)

A stronger version of optimally scaled algorithm condition. Recall that the loss of the optimal algorithm h in saturated and non-saturated phases localize at s = 0 and s = ν respectively. However, conditions (122), (123) do not provide the same localization property for the excess error L[h] L[h ]. To see this, take |h(λ) h (λ)|2 = O(λ κ

ν N κ 2ν), which corresponds to values (a , b ) given in (121). Substituting these values in excess error scale calculation (120) makes the function under the minimum s-independent, implying that the excess loss localizes on all spectral scales s [0, ν]. Such spread of localization scales introduces a logarithmic factor in the error rate. Taking, for simplicity, |h(bλk) h (bλk)|2 = (|k| + 1) κN κ 2ν, and using (21) gives

L[h] L[h ] =

c2 k(1 + o(τ))|h(bλk) h (bλk)|2

|k| + 1 = O(N κ 2ν log N).

To avoid these issues, let us introduce a slightly stronger version of conditions (122),(123), specified by picking a small parameter ε > 0:

|h(λ) h (λ)|2 = O (λ 1 ν N) κ ε in the non-saturated phase κ < 2ν, (125)

|h(λ) h (λ)|2 = O λ κ

ν +εN 2ν in the saturated phase κ > 2ν. (126)

For any ε > 0, the above conditions guarantee localization of the excess error either at s = 0 (saturated phase) or s = ν (non-saturated phase), as can be seen from (120). Also, using these conditions in computation like (124) produces the rate of the optimal algorithm without log N factor L[h] L[h ] = O(N κ 2ν).

Finally, let us comment on |h(bλk) h (bλk)|2 = o(τ κ) condition for non-saturated phase, mentioned

in the main paper. Since τ = (λ

1 ν k N) 1, in many cases one can replace o(τ κ) with O(τ κ+ε) with some ε and thus satisfy condition (125). When no ε > 0 can provide such replacement, it is possible to show that the rate of the excess error remains L[h] L[h ] = O(N κ), although with a more technically involved version of computation (124).

Published as a conference paper at ICLR 2024

D.3.3 SATURATED PHASE κ > 2ν

When discussing the saturated phase in Section 5.3, we first deduced from O(τ #) terms in equation (21) that the loss will localize on the scale s = 0 (corresponding to |k| 1), and then stated that the loss of the optimal algorithm can be achieved by optimally regularized KRR. In this section, we derive an asymptotic expression of the loss functional in the saturated phase, which both explains the statements made in the main paper and gives a more systematic picture of the loss-algorithm relations in the saturated phase.

Generalization error of optimal algorithm L[h ]. The asymptotic of L[h ] originates from O(τ 2ν κ 1) term in equation (21), which dominates the whole loss functional for κ > 2ν. To verify that this is indeed the case, we need a more accurate characterization of N-deformation correction terms than in (97). For that, we recognize that the sums over n1, n2 in the second line of (97) are Hurwitz zeta functions ζ(α, x) P n=0(n + x) α evaluated at α = aν + b(κ + 1) and two specific values of x: λa k|ck|2b

α=aν+b(κ+1) = λa k|ck|2b + N α ζ α, 1 + 1+k

N + ζ α, 1 + 1 k

Substituting Taylor expansion ζ(α, 1 + ϵ) = ζ(α) αζ(α + 1)ϵ + O(ϵ2) of Hurwitz zeta function at x = 1 into (127) we get a more detailed version of (97)

N = λa k|ck|2b + 2ζ(α)N α + O(N α 1) + O(τ 2N α). (128)

Equipped with (128), we turn to derivation of the leading asymptotic of O(τ 2ν κ 1) term dominating the loss. By looking at (117), we can see that the term of interest comes from finite N corrections of λ2 k

N. Then, mimicking the derivation (117) and using (128) for λ2 k

Lk[h ] = |ck|2 1 + O(τ κ+1) |ck|2λ2 k 1 + O(τ κ+1)

λ2 k + 2ζ(2ν)N 2ν 1 + O(N 1) + O(τ 2)

= 2ζ(2ν)|ck|2

λ2 k N 2ν 1 + O(N 1) + O(τ 2) + O(N κ 1).

Substituting the above in the loss functional sum results gives

h 2ζ(2ν)|ck|2

λ2 k N 2ν 1 + O(N 1) + O(τ 2) + O(N κ 1) i

22ζ(2ν)N 2ν 1 + o(1)

22ζ(2ν)N 2ν 1 + o(1) X

Here in the last equality we extended the summation to all population eigenspaces due to convergence of the series P l= |cl|2

λ2 l at κ > 2ν, which reflects localization of the error on scale s = 0, corresponding to |l| 1.

Note that in (130) we could have evaluated the sum over the population spectra as

l= (|l| + 1)2ν κ 1 = 2ζ(κ + 1 2ν) 1. (131)

However, we view the version with the sum as being more general. Intuitively, it stays valid in the case of population spectra having power-law form only asymptotically: λl = (|l| + 1) ν(1 + o(1)) and |cl|2 = (|l| + 1) κ 1(1 + o(1)). In that case, the value of the sum is not fixed by the asymptotics of λl, |cl|2 and may vary substantially. In contrast, we can see from computation (129) and corrections to N-deformations (97) and (128) that the factor 2ζ(2ν) is determined by population spectrum λl with indices l near the values l = N, N, 2N, 2N, . . .. Therefore, the factor 2ζ(2ν)

Published as a conference paper at ICLR 2024

in the loss is determined purely by asymptotic behavior of λl. Overall, this creates an interesting situation where the loss (130) is localized on the smallest scale s = 0, but the asymptotic shape of the population spectrum almost fully specifies the generalization error asymptotic L[h ] = CN 2ν(1+ o(1)), with only the constant factor P

λ2 l determined by the details living on the localization scale s = 0.

Full generalization error L[h]. The full generalization error is obtained by combining L[h ] with the part of (116) associated with deviation from the optimal algorithm h(λ) h (λ). We will consider only the algorithms h(λ) with mild deviation from the optimal, as given by condition (125). As demonstrated in Section D.3.2, this condition guarantees the rate not worse than the rate O(N 2ν) of the optimal algorithm, and also localization of the excess error on the same scale s = 0.

Having ensured localization of the excess error on the scale s = 0, let us first look at the perturbative expression of the optimal algorithm h (λ) at this scale. Substituting (128) into (12), we get

h (bλk) = 1 + 2ζ(ν)N ν

1 + O(N 1) + O(τ) . (132)

Using the above expression of the optimal algorithm, we compute the full loss functional as follows

|ck|2 1 + o(τ) h(bλk) 1 2ζ(ν)N ν

N ) + O(τ) 2

λ2 k 2ζ(2ν)N 2ν(1 + o(1))

2N 2ν(1 + o(1))

λk N ν h(λk) 1 2ζ(ν) 2 + 2ζ(2ν)

While the above expression for L[h] is already a valid asymptotic for the loss in the saturated phase, we make a few further refinements.

First, note that (133) relies on condition (125) for the algorithm h(λ). This condition might be quite difficult to verify in practice, as it requires the knowledge of the optimal algorithm h(λ). However, perturbative expansion (132) shows that optimal algorithm satisfies |h (λ) 1|2 = O(λ 2N 2ν). Since in saturated phase κ

ν > 2, we can always make λ 2N 2ν smaller than λ κ

ν +εN 2ν by choosing ε < κ

ν 2. Thus, employing triangular inequality shows that the original condition (125) is equivalent to

|h(λ) 1|2 = O(λ κ

ν +εN 2ν), 0 < ε < κ

The main advantage of condition (134) compared to (126) is that it is easily verifiable. For instance, KRR has 1 hη(λ) = η λ+η which satisfies (134) for |η| = O(N ν). Similarly, for gradient flow we

have 1 ht(λ) = e tλ which satisfies (134) with t = Ω(N 2ν2

κ + ε ) and some ε > 0 depending on ε.

Note that KRR and GF examples of h(λ) considered above satisfy (134) not only on the scales s [0, ν] but for all s 0, or equivalently λ 1. Based on this observation, let us impose condition (134) on all λ 1. Then, the summation indices in (133) to all population values l Z, similarly to how it was done for optimal algorithm loss in (130).

To summarize, through the derivations and discussions above we have proved

Theorem 3. Consider the Circle model in saturated phase κ > 2ν. Then, if an algorithm h(λ) satisfies (134) for all λ 1, the loss functional is given by

2N 2ν(1 + o(1))

λl N ν h(λl) 1 2ζ(ν) 2 + 2ζ(2ν) . (135)

Published as a conference paper at ICLR 2024

Finally, let us comment on the KRR result (22) presented in the main paper. Observe that the combination λ(h(λ) 1) entering the loss functional (135) is given, in case of KRR, by

λ(hη(λ) 1) = η 1 + η

λ = η 1 + O(η

Substitution of the above into loss functional (135) leads to the KRR expression (22). According to condition (134), this expression is applicable only for small enough regularization strengths |η| = O(N ν).

D.3.4 NON-SATURATED PHASE κ < 2ν

In this section, we obtain a limiting form of loss functional in the non-saturated phase. While this form was not discussed explicitly in the main text, it was used to produce the first two plots in Figure 3.

An important feature of the non-saturated phase is the localization of the loss on the highest spectral scale s = ν, which is ensured for algorithms h(λ) satisfying (125). Such localization corresponds to values τ 1, so we can not rely on perturbative expansion in small parameter τ 0, as we did in the sections above. Instead, for all terms of the loss functional (116) we need to take into account their limiting N form at fixed τ. For N-deformations, such limit leads to symmetrized Hurwitz zeta function ζ(α) τ = ζ(α, τ) + ζ(α, 1 τ) introduced in equation (23) of the main text. Indeed, slightly rearranging (127), we get

N = N α ζ α, τ + ζ α, 1 τ + 2

N N , τ=const N αζ(α) τ . (137)

Next, observe that on the scale s = ν the density of eigenvalues λk is very high, in a sense that

λk = O(τ 1N 1)

N , τ=const 0. Thus, in the limit N the summation over spectral

index k in the loss functional (116) translates into an integral: P N

2 dk 2N 1 R 1

where for the last transition recall that τ = |k|+1

N and N-deformations [λa k|ck|2b]N are even functions of k. This leads to the following continuous form of the loss functional (116)

L(cont)[h] = N κ

" ζ(κ+1) τ ζ(2ν) τ ζ(ν) τ 2 h(bλτ) h (bλτ) 2 + ζ(κ+1) τ

ζ(ν+κ+1) τ 2

ζ(κ+1) τ ζ(2ν) τ

" ζ(κ+1) τ ζ(2ν) τ ζ(ν) τ 2 h2(bλτ) 2ζ(ν+κ+1) τ

ζ(ν) τ h(bλτ) + ζ(κ+1) τ

where bλτ = N νζ(ν) τ and the optimal algorithm h (bλτ) is given by (23).

D.3.5 THE OVERLEARNING TRANSITION POINT

Observe that if κ = ν 1, then in the case of zero noise σ2 = 0 the optimal algorithm (12) becomes h (bλk) 1, and the same holds for the limiting (N ) version (23). We prove now that for larger (smaller) κ the optimal algorithm becomes overlearning (underlearning). We give the proof for the limiting (N ) version, but it is easy to see that the proof extends without change to the original discrete version as well. Lemma 2. Consider the N limit of the optimal algorithm given by (23):

h (bλτ) = ζ(ν+κ+1) τ ζ(ν) τ ζ(κ+1) τ ζ(2ν) τ , ζ(α) x ζ(α, x) + ζ(α, 1 x). (139)

Then for any τ (0, 1)

< 1, κ + 1 < ν, = 1, κ + 1 = ν, > 1, κ + 1 > ν. (140)

Published as a conference paper at ICLR 2024

Proof. We have

ln h (bλτ) = (ln ζ(ν+κ+1) τ ln ζ(κ+1) τ ) (ln ζ(2ν) τ ln ζ(ν) τ ) = Z κ+1

α dβ d2 dβ2 ln ζ(β) τ .

(141) To prove the lemma, it suffices to show that the function f(α) = ζ(α) τ is strictly log-convex for α > 1, i.e. ff > (f )2. Note that

dm dαm f(α) =

n=0 ( ln(n + τ))m(n + τ) α +

n=0 (ln(n + 1 τ))m(n + 1 τ) α. (142)

Then, the inequality ff > (f )2 is just the Cauchy inequality for the vectors

(. . . , (n + τ) α/2, . . . , (n + 1 τ) α/2, . . . ) (143)

and (. . . , ln(n + τ)(n + τ) α/2, . . . , ln(n + 1 τ)(n + 1 τ) α/2, . . . ). (144)

The inequality is strict because the vectors are not collinear.

The over/under-learning property h (λτ) 1 is clearly visible in the right two subfigures of Figure 3. At κ > ν 1, the optimal regularized KRR is also overlearning (with a negative regularization). The optimally stopped GF in this regime is GF continued to infinity, i.e. h(λ) = limt (1 e tλ) = 1.

E WISHART MODEL

As for the circle model, in this section we collect all our derivations for the Wishart model.

Our strategy in computing loss functional (8) relies on the resolvent b R(z) of empirical kernel matrix K = ΦT ΛΦ:

N ΦT ΛΦ z I 1 =

l λlϕlϕT l z I

, z C, (145)

where ϕl are the columns of feature matrix Φ.

We note that the projections on empirical eigenvalues and eigenvectors, appearing in empirical transfer measure (41) are related to the resolvent via the limit in the complex plane

k=1 δbλk(dλ)uku T k = 1

π lim y 0+ ℑ X

k bλkuku T k (λ + iy)I 1 dλ

π lim y 0+ ℑb R(λ + iy)dλ,

where ℑz denotes the imaginary part of z C.

Now, denote the resolvent projected on features ϕl as

b Rll (z) = 1

N ϕT l b R(z)ϕl , (147)

and the first two moments of the latter

Rll (z) = EΦ h b Rll (z) i , (148)

Rl1l 1l 2l2(z1, z2) = EΦ h b Rl1l 1(z1) b Rl 2l2(z2) i . (149)

Then, substituting (146) into the transfer measure (41) immediately connects it to the projected resolvent

Published as a conference paper at ICLR 2024

bρ(f) ll (dλ) = λl

πλ lim y 0+ℑ b Rll (λ + iy) dλ. (150)

Carrying the relation (150) over to the first and second moments of the transfer measure gives

ρ(1) ll (dλ) = λl

πλ lim y 0+R(im) ll (λ + iy)dλ, (151)

ρ(2) l1l 1l 2l2(dλ1, dλ2) = λl 1λl 2 π2λ1λ2 lim y1 0+ y2 0+ R(im) l1l 1l 2l2(λ1 + iy1, λ2 + iy2)dλ1dλ2, (152)

ρ(ε) l (dλ) = λ2 l πλ2 lim y 0+R(im) ll (λ + iy)dλ, (153)

where R(im)) ll and R(im) l1l 1l 2l2 are the moments of the imaginary part of the resolvent projections

R(im) ll (z) = EDN h ℑ b Rll (z) i , (154)

R(im) l1l 1l 2l2(z1, z2) = EDN h ℑ b Rl1l 1(z1) ℑ b Rl 2l2(z2) i . (155)

Then, the learning measures defining the loss functional (8) are obtained by summation over population indices as in (44).

E.1 RESOLVENT

Given the connection between resolvent and learning measures ρ(2), ρ(1), ρ(ε) described above, we see that the resolvent moments (148) and (149) are fundamental building blocks for the loss functional (8). In this section, we try to calculate these moments and simplify the result as much as possible.

To start, recall that the 1 N Tr[ b R(z)] gives the Stieltjes transform of the empirical spectral measure bµ = 1 N PN k=1 δbλk. A central quantity in our calculations will be Stieltjes transform of the average spectral measure µ = E [bµ]

N E h Tr[ b R(z)] i . (156)

Next, denote resolvents of kernel matrices with one or two spectral components removed as

b R l(z) = b R 1(z) 1

m =l λmϕmϕT m z I

b R l l (z) = b R 1 l (z) 1

N λl ϕl ϕT l 1 =

m =l,l λmϕmϕT m z I

and the respective Stieltjes transforms as r l(z), r l l (z).

In our calculations, we will use two assumptions Assumption 1. (Self-averaging property). For a random Gaussian vector ϕ N(0, I) independent from Φ 1 N ϕT b R(z)ϕ 1

N E h ϕT b R(z)ϕ i = r(z).

Assumption 2. (Stability under eigenvalue removal)

r l l (z) r l(z) r(z).

In classical RMT settings, e.g. that of Marchenko-Pastur law, both of these assumptions can be rigorously shown. Specifically, both the statistical fluctuations of 1 N ϕT b R(z)ϕ and the change of

Published as a conference paper at ICLR 2024

r(z) after eigenvalue removal give no contribution to the limiting spectral measure as N . We leave for the future work the validity of these assumptions in our settings.

As a final preparation step, let us write down a quick way of deriving the self-consistent (fixed-point) equation for r(z) using the above assumptions. Starting with the trivial relation 1 = 1

N Tr[I], we get

Tr h b R(z) 1 N X

l λlϕlϕT l z I i#

l λl E h 1 N ϕT l b R(z)ϕl i zr(z). (159)

Next, we relate ϕT l b R(z)ϕl to ϕT l b R l(z)ϕl via Sherman-Morrison formula in order to break the dependence between ϕl and the matrix inside.

1 N ϕT l b R(z)ϕl = 1

N ϕT l b R l(z)ϕl λl 1 N ϕT l b R l(z)ϕl 2

N ϕT l b R l(z)ϕl =

1 N ϕT l b R l(z)ϕl 1 + λl 1

N ϕT l b R l(z)ϕl

(1) = r l(z) 1 + λlr l(z)

(2) = r(z) 1 + λlr(z),

where in (1) and (2) we used the assumptions 1 and 2 respectively. Substituting this into (159) gives the self-consistent equation

1 = zr(z) + 1

r(z)λl 1 + r(z)λl , (161)

which is often written in a fixed-point form as

λl 1 + r(z)λl

For z = η < 0, we have r(z) = η 1 eff,N for effective regularization ηeff,N defined in (1).

E.1.1 COMPUTING THE RESOLVENT MOMENTS

Reflection symmetry. Here, we mainly repeat Simon et al. (2023) to establish useful exact relations for resolvent expectations. Recall that distribution of a Gaussian random vector z N(0, I) remains invariant under orthogonal transformations: Uz N(0, I), where U is an arbitrary orthogonal matrix. Now, we notice that our averages of interest (154) and (155) have a form EΦ f(K, Φ)

for some function f( , ) of empirical kernel matrix K = ΦΛΦT and features Φ. Applying transformation Φ UΦ to perturbed kernel matrix ΦT UΛUT Φ gives

E f(ΦT UΛUT Φ, Φ) = E f(ΦT ΛΦ, UΦ) . (163)

If we take U to be a reflection along one of the basis axes, e.g. (U(m))ll = δll (1 2δlm) for reflection along axis m, then U(m)Λ(U(m))T = Λ. This implies

E f(ΦT ΛΦ, Φ) = E h f(ΦT ΛΦ, U(m)Φ) i . (164)

Applying (164) to (154) and (155) gives their non-zero elements are only those with paired indices: Rll(z) for the first moment and Rlll l (z1, z2), Rll ll (z1, z2), Rll l l(z1, z2) for the second moment. Moreover, note that we are interested only in those components of the resolvent second moment that contribute to the loss (44). This leaves Rll l l(z1, z2) to be the only relevant components.

First moment. Here, all the necessary computations were already done in (160), leading to

Rll(z) = r(z) 1 + λlr(z). (165)

Published as a conference paper at ICLR 2024

Second moment at z1 = z2. In this case, the second moment is connected to the derivative of the first moment, which was utilized by Simon et al. (2023) and Bordelon et al. (2020) to obtain the generalization error of KRR. Indeed,

Rll l l(z, z) = E 1

N ϕT l b R(z) ϕl ϕT l N b R(z)ϕl

m λm ϕmϕT m N z I 1

= λl Rll(z). (166)

As prerequisite for computing the derivative λl Rll(z), we compute similar derivative of inverse Stieltjes transform r 1(z). Differentiating the fixed point equation (161) in the form r 1 + z = 1 N P

λl+r 1 w.r.t. to either z or λl gives

λl λl + r 1 + 1

(λl + r 1)2

λl λl + r 1 + 1

(λl + r 1)2

(λl + r 1)2 , (168)

(λl + r 1)2 , (169)

N P l λlr 2 (λl+r 1)2 . (170)

Using this, we second moment becomes

Rll l l(z, z) = λl Rll(z) = δll (λl + r 1)2 1

(λl + r 1)2(λl + r 1)2 . (171)

Second moment at z1 = z2 with l = l . Here and in the next case l = l we will again apply Sherman-Morrison formula, including two subsequent applications to break dependence with ϕl, ϕl . Denoting b R#(z) the resolvent with, possibly, removed eigenvalue, removing an extra eigenvalue can be written in a simplified form under assumptions 1 and 2

b R#(z) = b R# l(z) λl 1 + λlr(z) b R# l(z) ϕlϕT l N b R# l(z) = q b R# l(z), ϕlϕT l N , al(z) , (172)

where q(X, Y, a) = X a XYX is a polynomial in two matrices X, Y and a scalar variable a, and al(z) is a shorthand notation for λl/(1 + λlr(z)).

Using representation (172) we can write our second moment element as Rllll(z1, z2) = E [Tr [Yq(X1, Y, a1)Yq(X2, Y, a2)]] , (173) where we have denoted

X1 = b R l(z1), X2 = b R l(z2), Y = ϕlϕT l N , a1 = al(z1), a2 = al(z2). (174) Now we exploit the independence between X1, X2 and Y by first taking the expectations w.r.t. Y. Since Y is product of two Gaussian random variables and (173) is polynomial in Y containing monomials of degree from 2 to 4, we need to compute Gaussian moments of the order from 4 to 8. This can be conveniently done using Wick s theorem for computing moments of Gaussian random variables, which equates 2m-th moment to the sum over all pairings of 2m variables of the products of m intra-pair covariances. Specifically, for normal vector x N(0, I), it reduces to the products of Dirac deltas

E xi1xi2 . . . xi2m 1xi2m = 1

j=1 E xiσ(2j 1)xiσ(2j) = 1

j=1 δiσ(2j 1)iσ(2j),

Published as a conference paper at ICLR 2024

where S2m denotes the group of permutations of the index set {1, 2, . . . , 2m 1, 2m}. Now we apply Wick s theorem separately to each monomial from (173). For the simplest order-two monomial, we have

EY [Tr[YX1YX2]] = N 2 X

i+i j+j Eϕ N(0,I) ϕi+ϕi (X1)i j+ϕj+ϕj (X2)j i+

= N 2 (Tr[X1] Tr[X2] + 2 Tr[X1X2])

= N 2 Tr[X1] Tr[X2] + 2Tr[X1] Tr[X2]

= r(z1)r(z2) + 2

N r(z1) r(z2)

z1 z2 = r1r2 + O(N 1),

where r1, r2 is a shorthand for r(z1), r(z2). Here we first applied Wick s theorem, then transformed the product X1X2 = (X1 X2)/(z1 z2). Then, we wrote the resolvent traces in terms of Stieltjes transform Tr[X1,2] = r(z1,2) using assumption (1), thus removing the need to take expectation with respect X1, X2 in the remaining calculation of the second moment Rllll(z1, z2). An important observation is that when computing the moment of order n, the leading term in 1

N has to be the product of traces Tr[X1], Tr[X2], while all the other terms (containing at least one trace of a product) will give O(N 1) contribution. Using the above observation, we can easily compute leading terms for all the other averages:

EY [Tr[YX1YX1YX2]] = N 3 Tr[X1]2 Tr[X2] + O(N 1) = r2 1r2 + O(N 1),

EY [Tr[YX1YX2YX2]] = N 3 Tr[X1] Tr[X2]2 + O(N 1) = r1r2 2 + O(N 1),

EY [Tr[YX1YX1YX2YX2]] = N 4 Tr[X1]2 Tr[X2]2 + O(N 1) = r2 1r2 2 + O(N 1).

Combining all the monomials, we can summarize the whole computation as follows

Rllll(z1, z2) = q r1, 1, a1 q r2, 1, a2 + O(N 1) = 1 (λl + r 1 1 )(λl + r 1 2 ) + O(N 1), (178)

where in the last equality we used q(r(z), 1, al(z)) = 1 λl+r 1(z). Note that in the limit z1 z2, the expressions above coincide with previously derived (171) up to subleading O(N 1) terms.

Second moment at z1 = z2 with l = l . Here we need to remove two eigenvalues λl and λl . The respective resolvent expression is

b R(z) =q( b R l(z), ϕlϕT l N , al(z)) = q(q( b R l l (z), ϕl ϕT l N , al (z)), ϕlϕT l N , al(z)))

=eq( b R l l (z), ϕlϕT l N , ϕl ϕT l N , al(z), al (z)), (179)

eq(X, Y, Y , a, a ) = q(q(X, Y , a ), Y, a)

= X a XYX a XY X + aa XYXY X + XY XYX . (180)

Substituting (179) into (149) will produce an expression of the form

Rll l l(z1, z2) = E [Tr [Y eq(X1, Y, Y , a1, a 1)Yeq(X2, Y, Y , a2, a 2)]] , (181)

where in addition to (174) we have denoted Y = ϕl ϕT l N , a 1 = al (z1), and a 2 = al (z2).

Now, let us again calculate the expectations over Y, Y using Wick s theorem. The difference with the l = l case is that the pairings between ϕl and ϕl produce zeros due to their independence. For example, the expectation for the simplest monomial is

EY,Y [Tr[Y X1YX2]] = 1 N 2 X

i+i j+j Eϕ,ϕ N(0,I) h ϕ i+ϕ i (X1)i j+ϕj+ϕj (X2)j i+ i

= N 2 Tr[X1X2] = 1

N r(z1) r(z2)

Published as a conference paper at ICLR 2024

Observe that the monomial above has 1 N magnitude. This is the consequence that N 0 terms from the l = l case arose from pairings between Y and Y, which are zero in the l = l case. Taking into account the symmetry Rll ll (z1, z2) = Rll ll (z2, z1), we proceed similarly and calculate the leading terms for all the other independent monomials.

Terms with X2:

EY,Y [Tr[Y X1Y X1YX2]] =N 3 Tr[X1X2] Tr[X1] + O(N 2),

EY,Y [Tr[Y X1YX1YX2]] =N 3 Tr[X1X2] Tr[X1] + O(N 2),

EY,Y [Tr[Y X1Y X1YX1YX2]] =N 4 Tr[X1X2] Tr[X1] Tr[X2] + O(N 2),

EY,Y [Tr[Y X1YX1Y X1YX2]] =3N 43 Tr[X1X2] Tr[X2 1] + O(N 3) = O(N 2).

New terms with X2YX2:

EY,Y [Tr[Y X1Y X1YX2YX2]] = N 4 Tr[X1X2] Tr[X1] Tr[X2] + O(N 2),

EY,Y [Tr[Y X1YX1YX2YX2]] = N 4 Tr[X1X2] Tr[X1] Tr[X2] + O(N 2),

EY,Y [Tr[Y X1Y X1YX1YX2YX2]] = N 5 Tr[X1X2] Tr[X1]2 Tr[X2] + O(N 2),

EY,Y [Tr[Y X1YX1Y X1YX2YX2]] =

= 3N 5 Tr[X1X2] Tr[X2 1] Tr[X2] + O(N 3) = O(N 2).

New terms with X2Y X2:

EY,Y [Tr[Y X1YX1YX2Y X2]] = N 4 Tr[X1X2] Tr[X1] Tr[X2] + O(N 2),

EY,Y [Tr[Y X1Y X1YX1YX2Y X2]] = N 5 Tr[X1X2] Tr[X1]2 Tr[X2] + O(N 2),

EY,Y [Tr[Y X1YX1Y X1YX2Y X2]] =

= 3N 5 Tr[X1X2] Tr[X2 1] Tr[X2] + O(N 3) = O(N 2).

New terms with X2YX2Y X2: EY,Y [Tr[Y X1Y X1YX1YX2YX2Y X2]] =

= N 6 Tr[X1X2] Tr[X1]2 Tr[X2]2 + O(N 2),

EY,Y [Tr[Y X1YX1Y X1YX2YX2Y X2]] =

= 3N 6 Tr[X1X2] Tr[X1]2 Tr[X2]2 + O(N 3) = O(N 2).

A new term with X2Y X2YX2: EY,Y [Tr[Y X1YX1Y X1YX2Y X2YX2]] =

= N 6 6 Tr[X1X2]3 + 9 Tr[X1X2] Tr[X1]2 Tr[X2]2 + O(N 4) = O(N 3). (187)

To summarize, we observe two types of monomials. The first type has the order O(N 2) and is composed of those monomials which take X1YX1Y X1 from eq(X1, Y, Y , a1, a 1) and/or X2Y X2YX2 from eq(X2, Y, Y , a2, a 2). The rest of the monomials g( , , , ), contain exactly one factor Tr[X1X2] = N r1 r2

z1 z2 , have the order O(N 1), and satisfy

EY,Y g X1, X2, Y, Y = 1

N r1 r2 z1 z2

g r1, r2, 1, 1

r1r2 + O(N 2)

N r 1 1 r 1 2 z1 z2 g r1, r2, 1, 1 + O(N 2).

Thus, the leading O(N 1) term for the second moment with l = l is given by

Rll l l(z1, z2) = 1

N r 1 1 r 1 2 z1 z2

rs (as + a s)r2 s + asa sr3 s + O(N 2)

N r 1 1 r 1 2 z1 z2

r 1 1 r 1 2 (λl + r 1 1 )(λl + r 1 1 )(λl + r 1 2 )(λl + r 1 2 ) + O(N 2).

Again, in the limit z1 z2, the result above coincides with previously derived (171) up to subleading O(N 2) terms.

Published as a conference paper at ICLR 2024

Summations over l and l in (44). For the loss functional we will need not the bare second moment Rll l l(z1, z2) but the sum P

l λ2 l Rll l l(z1, z2). First, let us start with the case z1 = z2 where

the second moment is given by (171). The essential sum to compute is P

l λ2 l (λl +r 1)2 . Using the fixed-point equation (161) gives

zr 1λ2 l (λl + r 1)2 = 1

z r 1λl λl + r 1 =

z (r 1 + z) = 1 + zr 1. (190)

We can substitute this result into the second moment to get the desired second moment sum.

l λ2 l Rll l l(z, z) = λ2 l (1 + zr 1)r 2

(λl + r 1)2 . (191)

Now, we proceed to the case z1 = z2. Similarly, from (189) we see that the essential sum to compute is

1 N r 1 1 r 1 2 z1 z2

λ2 l (λl + r 1 1 )(λl + r 1 1 ) = 1

λl r 1 1 λl + r 1 1 λl r 1 2 λl + r 1 2

= 1 + r 1 1 r 1 2 z1 z2 ,

which is basically a finite difference version of the sum for z1 = z2. Again, we substitute this result into second moment sum to get

l λ2 l Rll l l(z1, z2) = λ2 l r 1 1 r 1 2 1 + r 1 1 r 1 2 z1 z2

(λl + r 1 1 )(λl + r 1 2 ) + O(N 1). (193)

Now let us consider the sum over l in (44). First, observe that the first-moment term from (44) enters in the form X

l c2 l λl Rll(z) = X

λlc2 l λl + r 1(z) = u(z). (194)

Here we encountered the first auxiliary function u(z) introduced (14). However, directly performing the respective sum for the second-moment term does not automatically reduce it to an expression with decoupled factors (depending only on z1 or z2 but not jointly on z1, z2). Yet, we can perform an additional transformation to express the result in terms of decoupled factors. Representation of fractions product as a difference, similar to (192), gives for two parts of P

l,l c2 l λ2 l Rll l l(z1, z2)

c2 l (λ2 l r 1 1 r 1 2 ) (λl + r 1 1 )(λl + r 1 2 ) = X

λlc2 l λl + r 1 1 + λlc2 l λl + r 1 2 c2 l = u1 + u2 X

l c2 l , (195)

c2 l r 1 1 r 1 2 r 1 1 r 1 2 z1 z2 (λl + r 1 1 )(λl + r 1 2 ) = r 1 1 r 1 2

l c2 l λl+r 1 1 P

l c2 l λl+r 1 2 z1 z2 = r 1 1 r 1 2 v1 v2 z1 z2 , (196)

where we have encountered second auxiliary function v(z) defined in (14). Summarizing our computation of the second moment, we have obtained X

l,l c2 l λ2 l Rll l l(z1, z2) = u(z1) + u(z2) + r 1(z1)r 1(z2)v(z1) v(z2)

l c2 l . (197)

Thus, we fully expressed the population sums of resolvent moments in terms of two auxiliary functions v(z) and u(z), which would later define our final result for the loss functional.

E.2 LOSS FUNCTIONAL

In this section, we using the resolvent moments derived in Section E.1.1 to compute the loss functional (44).

Published as a conference paper at ICLR 2024

KRR case. As a sanity check, let s first compute the loss for the case of KRR learning algorithm hη(z) = z z+η, with the expectation to recover the original expression (1). In the absence of target noise σ2 = 0, we have

Ll[hη] σ2=0 = X

l λ2 l Rll l l( η, η) 2λl Rll( η) + 1

= λ2 l η2 eff(1 ηηeff) (λl + ηeff)2 2 λl λl + ηeff + 1 = η2 eff ηηeff (λl + ηeff)2 .

For the contribution of noise term we have

ε(2) l = λ2 l 1 N E h ϕT l b R2( η)ϕl i = λ2 l η 1 N E h ϕT l b R( η)ϕl i

= λ2 l ηRll( η) = λ2 l ηηeff (λl + ηeff)2 . (199)

Combining noiseless and noise contributions, we obtain

LKRR(η) = 1

2 η2 eff ηηeff (λl + ηeff)2 + σ2

2N λ2 l ηηeff (λl + ηeff)2 , (200)

which is the same as (1).

General case. Now we turn to our main goal of describing learning measures ρ(2), ρ(1), ρ(ε). Denote, u(λ) = limy 0+ u(λ + iy) and v(λ) = limy 0+ v(λ + iy). Then, for the first moment of the learning measure, we have

ρ(1)(dλ) = X

l,l clcl ρ(1) ll (dλ) = 1

πλ lim y 0+

l c2 l λlℑRll(λ + iy)dλ = ℑu(λ)

πλ dλ. (201)

For noise measure ρ(ε) l (dλ) we similarly obtain

ρ(ε)(dλ) = 1 πλ2 lim y 0+

l λ2 l ℑRll(λ + iy)dλ = ℑw(λ)

πλ2 dλ, (202)

where w(λ) = limy 0+ w(λ + iy) for the last auxiliary function w(z) defined in (14).

Now we proceed to the computation for the second moment of the learning measure ρ(2)(dλ1, dλ2) using the relation (152). We have

ρ(2)(dλ1, dλ2) = 1 π2λ1λ2 lim y1 0+ y2 0+

l,l c2 l λ2 l R(im) l1l 1l 2l2(λ1 + iy1, λ2 + iy2)dλ1dλ2. (203)

Note that while for the first moment we simply have R(im) ll (z) = ℑRll (z), the relation between R(im) l1l 1l 2l2(z1, z2), and previously computed Rl1l 1l 2l2(z1, z2) is less straightforward since the product of two imaginary parts has to be taken out of expectation. This can be done with the following trick: for two random variables w1, w2 we have

E [ℑw1ℑw2] = E ℜw1w2 w1w2

= ℜE [w1w2] E [w1w2]

Taking w1 = b Rll (z1), w2 = b Rl l(z2), and noting that b R(z) = b R(z), we get the desired second moment of imaginary parts

R(im) ll l l(z1, z2) = E h ℑb Rll (z1)ℑb Rl l(z2) i = ℜ Rll l l(z1, z2) Rll l l(z1, z2)

Observe that the part u(z1) + u(z2) P

l c2 l of P

l,l c2 l λ2 l Rl ll l(z1, z2) gives no contribution when substituted into (205):

" u(z1) + u(z2) X

l c2 l u(z1) + u(z2) X

u(z2) u(z2) = 0. (206)

Published as a conference paper at ICLR 2024

Indeed, the remaining part has a non-trivial contribution that we will compute below in the limit z1 λ1 + i0, z2 λ2 + i0.

lim y1 0+ y2 0+

l,l c2 l λ2 l R(im) l1l 1l 2l2(λ1 + iy1, λ2 + iy2)

2ℜlim y1 0+ y2 0+ r 1(λ1 + iy1)r 1(λ2 + iy2)v(λ1 + iy1) v(λ2 + iy2)

λ1 λ2 + i(y1 y2)

2ℜlim y1 0+ y2 0+ r 1(λ1 + iy1)r 1(λ2 iy2)v(λ1 + iy1) v(λ2 iy2)

λ1 λ2 + i(y1 + y2) .

The first limit here can be easily taken in joint way y1 = y2 = y 0+, leading to

lim y1 0+ y2 0+ r 1(λ1 + iy1)r 1(λ2 + iy2)v(λ1 + iy1) v(λ2 + iy2)

λ1 λ2 + i(y1 y2) = v(λ1) v(λ2) r(λ1)r(λ2)(λ1 λ2). (208)

Note that this expression is regular at λ1 = λ2 since we assume v(λ) to be differentiable.

The second limit might have a non-vanishing singularity at λ1 = λ2, for which we will need to use Sokhotski Plemelj formula limε 0+ 1 x iε = iπδ(x)+P( 1

x), where P denotes the Cauchy principal value.

lim y1 0+ y2 0+ r 1(λ1 + iy1)r 1(λ2 iy2)v(λ1 + iy1) v(λ2 iy2)

λ1 λ2 + i(y1 + y2)

= lim y1 0+ r 1(λ1 + iy1)r 1(λ2)v(λ1 + iy1) v(λ2)

λ1 λ2 + iy1

=|r 1(λ1)|22πℑ{v(λ1)}δ(λ1 λ2) + r 1(λ1)r 1(λ2)P

v(λ1) v(λ2)

Note that the singularity at λ1 = λ2 under the Cauchy principal value is purely imaginary and, therefore, will disappear after taking the real part in (207). Next, let us combine (208) with the second term from (209)

1 2ℜ nr 1(λ1)r 1(λ2) v(λ1) v(λ2) r 1(λ1)r 1(λ2) v(λ1) v(λ2)

= ℑ r 1(λ2) ℑ r 1(λ1)v(λ1) ℑ r 1(λ1) ℑ r 1(λ2)v(λ2)

= ℑ u(λ2) ℑ r 1(λ1) ℑ u(λ1) ℑ r 1(λ2)

where in the last line we have used the relation r 1v = P

l c2 l u. Finally, we combine all the terms into the learning measure second moment

ρ(2)(dλ1, dλ2)

dλ1dλ2 = |r 1(λ1)|2

πλ2 1 ℑ{v(λ1)}δ(λ1 λ2)

ℑ u(λ2) ℑ r 1(λ1) ℑ u(λ1) ℑ r 1(λ2)

Thus, we have derived the expressions (15) and (16) of the main paper.

E.3 POWER-LAW ANSATZ

The analysis of the loss functional given by (15) and (16) is much more tractable for continuous approximation of our basic power distributions (2)

l δλl(dλ) 1

ν 1dλ = µλ(dλ), X

l c2 l δλl(dλ) 1

ν λ κ ν 1dλ = µc(dλ). (212)

Published as a conference paper at ICLR 2024

In particular, the fixed-point equation (161) for Stieltjes transform r(z) becomes

z = r 1(z) + 1

ν dλ λ + r 1(z) . (213)

We will encounter many integrals similar to the one above, with the general form

λadλ λ + x, a > 1, x / [ 1, 0]. (214)

Such integral can be expressed in terms of Hypergeometric function 2F1. This can be immediately used to get asymptotic x 0 expansion, which will be very useful later

Fa(x) = 2F1 1, 1 + a, 2 + a, 1

(1 + a)x = π sin(πa)xa + 1

a + x 1 a + O(x2). (215)

For this asymptotic expansion, we assume that x has a cut along R and a / Z.

Below we describe the essential steps in computing the loss functional for the Wishart model.

E.3.1 SOLVING FIXED POINT EQUATION

In this section, we will analyze asymptotic N solutions of fixed-point equation (213). Note that we are interested in the solution of this equation when z approaches the real line from above:

r(λ) = lim ε 0+ r(λ + iε). (216)

First, let us find values of λ when ℑr(λ) = πµ(λ) > 0, which corresponds to support of the empirical spectral density µ(λ) of K. For this, let us write r 1(λ) = τ iυ, and rewrite (213) in the limit ε 0 as a pair of real equations

Z λ (τ 2 + υ2) (λ )2τ

(λ τ)2 + υ2 µλ(dλ )

Z (λ )2υ (λ τ)2 + υ2 µλ(dλ )

(λ τ)2 + υ2 µλ(dλ ) .

Now, let us fix value of r corresponding to the point outside of the support of µ, where υ = 0, and therefore τ / supp(µλ) = [0, 1] to ensure convergence of the integral. Since the solution of the fixed point equation for z C+ is unique, there should be no value of λ and υ satisfying equations (217). But since λ can be defined for any value of τ, υ, the second equation should have no solutions with υ = 0. Due to the monotonicity of the expression in the brackets, this gives a necessary condition for τ corresponding to the point outside the support:

(λ τ)2 µλ(dλ ) < 1. (218)

Additionally, it is easy to see that for the values of τ not satisfying the inequality above, there is a solution of the second equation in (217) with υ < 0, which induces the solution of the first equation with some λ, meaning that the triple (λ, τ, υ) corresponds the point inside of the support of µ.

The argument above fully characterizes the support of µ: there two support edges λ , λ+ with the respective values τ < 0, τ+ > 1 given by the equality version of (218). The right edge τ+ should be at a distance N 1 from λ = 1, where we have

(λ τ+)2 µλ(dλ ) = 1 + o(1)

µλ(1) (λ τ+)2 dλ

(1 + o(1)) = 1. (219)

Published as a conference paper at ICLR 2024

From the calculation above and the first equation in (217), the right edge is given by

τ+ = 1 + µλ(1)

N (1 + o(1))

λ+ = τ+ µλ(1)

N log τ+ 1 (1 + o(1)) = 1 + µλ(1)log N

N (1 + o(1)). (220)

Turning to the left edge of the support, we note that it has the order τ N ν. Thus, we can use asymptotic expansion of Hypergeometric function (215) with a = 1

ν . leading to asymptotic form of fixed point equation

z = r 1(z) + Cν

ν (z), Cν = π/ν sin(π/ν), (221)

where we, for simplicity, do not write (1 + o(1)) correction factors in the asymptotic.

Then, for the left edge of the support we have

N ( τ iυ)1 1

which gives the respective left edge values

ν Cν ν N ν,

λ = 1 ν 1( τ ) = (ν 1)ν 1 Cν ν

νν N ν. (223)

Now, let us give more explicit solutions of fixed-point equation in different scenarios.

To the left of the support (λ < λ ). In this region the solution of fixed point equation is real, and will in fact has a lot of parallels with KRR applied to the Wishart model. Thus, we translate λ and r(λ) to their KRR notations: η = λ is the KRR regularization and ηeff = r 1 is the effective implicit regularization, appearing in (1). In these notations, the asymptotic form of fixed point equation becomes

1 = η ηeff + Cν

ν eff . (224)

When ηeff has the scaling s ν, we can write ηeff = eηeff N ν and η = eηN ν, which satisfies N-independent equation

1 = eη eηeff + Cνeη 1

ν eff . (225)

The equation above gives a nontrivial relations between eη and eηeff. However, when N ν ηeff 1, or in other words it has the scaling 0 < s < ν, the relation simplify to an explicit one:

ηeff = η + Cν

Inside the support (λ < λ < λ+). In this region it is convenient to write r(λ) = r0(λ)eiϕ(λ), with the phase taking values in the upper half-circle: 0 < ϕ < π. Substituting r0(λ)eiϕ(λ) into the limit form (221) of fixed point equation we get a pair of real equations

N (r0) 1 ν cos (1 1

0 = r 1 0 sin ϕ Cν

ν 0 sin (1 1

ν )ϕ . (228)

Let us rewrite the second equation here using the left edge value r = ( τ ) 1

r 1 0 = r 1

ν ) sin ϕ sin (1 1

= N ν sin ϕ Cν sin (1 1

Published as a conference paper at ICLR 2024

Thus, we see that the solution of fixed point inside the support is mostly conveniently described by using the phase ϕ as a free variable, and then specify the rest of the variables by the mappings ϕ 7 r0 and (r0, ϕ) 7 λ given by equations (229) and (227) respectively. In this representation, ϕ = 0 corresponds to the left edge λ of the support, while ϕ π corresponds to the right edge of the support.

As for outside of the support case, we see that on the scale N ν the fixed-point equation admits Nindependent form. Specifically, set λ = eλN ν and r0 = er0N ν. Then, the triplet er0, ϕ, eλ satisfies

eλ = (er0) 1 cos ϕ + Cν(er0) 1+ 1

sin ϕ Cν sin (1 1

Finally, we turn to the values N ν λ 1, corresponding to the scaling 0 < s < ν. In this case, the pair of equations (227), (229) becomes

λ = r 1 0 Cν

ν ) = π/ν N(π ϕ)

ν 1 + (π ϕ) cot( π

r 1 0 = π/ν N(π ϕ)

Noting that in the leading asymptotic order r0 = λ 1, the second equation can be rewritten as

ν N π ν . (232)

We can summarize the above equations in a single complex equation

1 Cν cos(π/ν)λ 1

ν N . (233)

In particular, taking the imaginary part of Stieltjes transform above gives ℑr(λ) = 1 N πµλ(λ). This implies that for N ν λ 1 the (normalized) empirical eigenvalue density Nµ(λ) coincides with population density µλ(λ) as expected.

E.3.2 LEARNING MEASURES

In this section, we specify the form of empirical learning measures ρ(2)(dλ1, dλ2), ρ(1)(dλ) and ρ(ε)(dλ), derived in (201),(202) and (16) respectively, for the case of power-law distributions (212). Note that all three learning measures are expressed in terms of r(λ) and imaginary parts of 3 auxiliary functions v(λ), u(λ), w(λ), whose asymptotic form we will now analyze. In the continuous approximation (212) these functions are given by

(λ ) κ ν 1dλ

λ + r 1(λ), u(λ) = 1

(λ ) κ ν dλ

λ + r 1(λ), w(λ) = 1

λ + r 1(λ). (234)

Since all of these functions have the same functional form R (λ )adλ

λ +r 1(λ), let us write an asymptotic expansion of the imaginary part of this integral, using (215) as a basis. First, we consider λ 1 and use the leading term of asymptotic expansion (215) together with asymptotic formulas (227) and (229).

λ + r 1(λ) = Γ(1 + a)Γ( a)ℑ r a 0 e iaϕ

= π sin(aϕ)

N sin ϕ Cν sin (1 1

Published as a conference paper at ICLR 2024

Importantly, away from left edge λ λ , or equivalently π ϕ 1, the above expression significantly simplifies

λ + r 1(λ) = π N(π ϕ)

aν 1 + O(π ϕ)

= πλa 1 + O λ 1

The the leading term above is expected: since ℑr 1(λ) λ for λ λ , the fraction can be approximated with Sokhotski formula ℑ 1 λ +r 1(λ) πδλ(λ ). However, our approach with asymptotic expansion of Hypergeometric function in (215) also provides an estimation of correction term to the Sokhotski formula, which would be difficult to obtain directly.

Let us now take into account the omitted subleading terms in the asymptotic expansion (215). These terms make a regular power series in x, and therefore their imaginary part can be estimated O(ℑx) =

O(r 1(λ)) = O λ1 1

ν N . Thus, we can summarize our computation for λ λ 1 as

λ + r 1(λ) = πλa + O λa 1

ν N + O λ1 1

ν N . (237)

Note that while the asymptotic form above is mostly meaningful for λ λ when the corrections as small, we can formally use at for λ λ but the corrections become of the same order as the leading terms.

The values of functions ℑv(λ), ℑu(λ), ℑw(λ), can be obtained by using either (235) or (237) depending on the scale of λ. In particular, when λ λ 1 we have

ℑv(λ) = πµc(λ) + O λ1 ( κ

ℑu(λ) = πλµc(λ) + O λ1 κ

ℑw(λ) = πλ2µλ(λ) + O λ1 2

As an important application of the expressions above, let us estimate the scale of the off-diagonal part ρ(2) off (λ1, λ2) of the learning measure ρ(2)(λ1, λ2) given in (16). Using (238) and (233) we get

ρ(2) off (λ1, λ2) = λ 1 1

κ ν 1 1 + O( λ 1

ν 2 N ) + O( λ 1

ν 1 N ) λ 1 1

κ ν 2 1 + O( λ 1

ν 2 N ) + O( λ 1

π2λ1λ2N(λ1 λ2) .

(239) Now we will estimate the scale S[ρ(2) off ] of ρ(2) off (λ1, λ2) assuming that λ1 and λ2 have the scales s1 and s2 respectively. First, assume that λ1 λ2 has the same scale as their maximum λ1 λ2. Then, the corrections to the leading terms in the numerator can be neglected, and the scales of both denominator and numerator are given by minimal scale of two subtracted terms. Specifically,

S[ρ(2) off ](s1, s2) = ( ν 1

ν s2) (s1 + s2 1 + s1 s2)

1 s1 s2. (240)

Yet, the scale derived above may not be valid when λ1 and λ2 are to close to each other. This does not happen, which can be seen, for example, by writing the difference in the enumerator of ρ(2) off (λ1, λ2) and observing that the whole expression behaves regularly as λ1 λ2. Then, similar argumentation together with (235) shows that the scale derived in (201) holds when both s1 = s2 = ν. Thus, for all λ1, λ2 inside the support of µ we have S[ρ(2) off ](s1, s2) 1 s1 s2.

Finally, we estimate the contribution of off-diagonal part of the second moment to the loss Loff[h] = 1 2 R h(λ1)h(λ2)ρ(2) off (dλ1, dλ2). For that, we can use a two-dimensional analog of proposition 2, which can be proven similarly.

Published as a conference paper at ICLR 2024

Proposition 6. Let g N(λ1, λ2) be a sequence of functions with a scaling profile S(g)(s1, s2), and let a N > 0 be a sequence of scale a > 0. Then, the sequence of integrals R 1 a N R 1 a N |g N(λ1, λ2)|dλ1dλ2 has the scale

a N |g N(λ1, λ2)|dλ1dλ2 i = min 0 s1,s2 a S(g)(s1, s2) + s1 + s2 . (241)

Applying this statement to Loff[h] = 1

2 R h(λ1)h(λ2)ρ(2) off (dλ1, dλ2), we get

S [Loff] min 0 s ν

h 1 s1 s2 + S(h)(s1) + S(h)(s2) + s1 + s2 i 1. (242)

In other words, we have shown that Z h(λ1)h(λ2)ρ(2) off (dλ1, dλ2) = O(N 1). (243)

Actually, the estimation above is tight, which can be shown, for example, by taking h(λ) = 1 computing the leading order term above, which will be exactly N 1. Note the presence of N 1 term in the loss, regardless of the value of κ, ν is unnatural for our problem because we expect the rate of the optimal algorithm to be N κ 2ν, which can be smaller than N 1. The reason for this unnatural term can be traced back to our calculation of the resolvent second moment in (189) and (178), where in the application of the Wick theorem we took into account only the pairings producing leading order terms in N. Thus, we might expect that taking into account subleading order pairings in Wick theorem would lift the (243) from O(N 1) to O(N 2). What would be the role of the off-diagonal ρ(2) off if we took into account all pairings (i.e. performed non-perturbative computation) is not clear at the moment and is an interesting direction for future work.

E.3.3 NOISY OBSERVATIONS

For noisy observations, our goal is to show the equivalence between full loss functional and the NMNO functional (Theorem 2). In terms of the population densities µλ(λ) and µc(λ), the NMNO is written as

L(nmno)[h] = 1

N h2(λ)µλ(λ) + 1 h(λ) 2µc(λ) dλ. (244)

Let us decompose the difference between two functionals as

L[h] L(nmno)[h] = 1

πλ2 µλ(λ) h2(λ)

+ |r 1(λ)|2ℑv(λ)

πλ2 µc(λ) h2(λ)

πλ µc(λ) h(λ) dλ

N h2(λ)µλ(λ) + 1 h(λ) 2µc(λ) dλ + O(N 1).

Now, similarly to the translation-invariant model on a circle, we write down the scales of all terms in the difference between two functionals, assuming that λ has the scale s. For that, we use asymptotic expressions (238) for functions v, u, w, and also expression (233) for r 1(λ).

Published as a conference paper at ICLR 2024

πλ2 µλ(λ) h2(λ) 1 s s

ν ) + 2S(h)(s), (246)

S |r 1(λ)|2ℑv(λ)

πλ2 µc(λ) h2(λ) κ

ν s s + (1 s

ν ) + 0 ( 2ν κ

ν s) + 2S(h)(s), (247)

πλ µc(λ) h(λ) κ

ν s s + (1 s

ν ) + 0 ( ν κ

ν s) + S(h)(s), (248)

N h2(λ)µλ(λ) = 1 s s

ν + 2S(h)(s), (249)

S h 1 h(λ) 2µc(λ) i = κ

ν s s + 2S(1 h)(s). (250)

For the last two terms, we do it on the level of the whole integral, which is taken over a single scale s = ν.

N h2(λ)µλ(λ)dλ

= 2S(h)(ν), (251)

1 h(λ) 2µc(λ)dλ

= κ + 2S(1 h)(ν). (252)

Using the scales derived above, we need to obtain the conditions on the learning algorithm scales S(1 h), S(h) for which the scale of all corrections to the loss is greater than the scale of NMNO loss given, as usual, by

S h L(nmno)[h] i = min 0 s ν

ν s + 2S(1 h) 1 s

ν + 2S(1 h)(s) i κ κ + 1. (253)

Again, this means that we can neglect all corrections whose scale (after integration with dλ) is greater than κ κ+1. In particular, we can neglect O(N 1) correction coming from off-diagonal part of the second moment (243).

First, note that the terms (251) and (252) exactly equal the contribution of noise (249) and signal (250) parts at scale s = ν. Thus, if these terms are to be neglected, s = ν should not be a localization scale of L(nmno)[h].

Second, observe that the right parts of 0 ( 2ν κ

ν s) and 0 ( ν κ

ν s) in (247) and (248), when activated, give at most O(N 1) contribution to the total loss, and therefore can also be neglected. Then, after choosing the left 0 option, we see that (248) is never less than (247). Thus, the total contribution of these two terms can be effectively described by the scale κ

ν s s + (1 s

ν ) + S(h)(s).

Lastly, we can see that (246) is always larger than corresponding NMNO noise scale (249), except for s = ν where they are equal. But since we already require s = ν not being a localization scale of L(nmno)[h], the term (246) can be neglected.

Thus, all requirements in addition to s = ν not being a localization scale of L(nmno)[h] can be summarized as

h κ ν s + (1 s

ν ) + S(h)(s) i > min 0 s ν

ν s + 2S(1 h) 1 s

ν + 2S(1 h)(s) i . (254)

However, we have

h κ ν s + (1 s

ν ) + S(h)(s) i 1 κ > κ κ + 1, (255)

therefore, the requirement (254) is satisfied automatically. Summarizing, the corrections to the NMNO functional are asymptotically vanishing when s = ν is not a localization scale of L(nmno)[h].

Published as a conference paper at ICLR 2024

E.3.4 NOISELESS OBSERVATIONS

Now, we take σ2 = 0 and analyze the respective loss functional. Recall that our current calculations are accurate up to O(N 1), as we discussed before. Therefore, we will neglect all the terms that give contributions of the order O(N 1), since they are beyond our current level of accuracy. This means several things:

1. We ignore the contribution to the loss of the off-diagonal part R h(λ1)h(λ2)ρ(2) off (dλ1, dλ2) of the functional.

2. We ignore the O( λ1 1

ν N ) correction to ℑv(λ) and ℑu(λ) in (238): in the previous section we have shown that they have at most O(N 1) contribution to the loss.

3. We restrict ourselves with the values of signal exponent κ < 1. Since the optimal scaling of the loss in the noiseless case is expected to be N κ, we will not be able to capture it for κ 1.

4. The previous point also implies that we cannot access values κ > 2ν > 2 corresponding to the noiseless saturation phase since eigenvalue exponent values are confined to the physical region ν > 1.

With the above remarks taken into account, we start deriving the loss functional and, for compactness, ignore all the O(N 1) terms. First, due to the diagonality of the remaining part of the learning measure ρ(2)(dλ1, dλ2), we can immediately specify the optimal algorithm and the respective decomposition of the functional

" |r 1(λ)|2ℑv(λ)

πλ2 h(λ) h (λ) 2 + µc(λ)

π|r 1(λ)|2ℑv(λ)

h (λ) = λℑu(λ) |r 1(λ)|2ℑv(λ) = 1 + O( λ 1

ν N ), (257)

where we have used asymptotics (238) to estimate the deviation of the optimal algorithm from

h(λ) = 1. Again using (238), we see that the free term in (256) has the order O( λ κ 1

N ), which, given κ < 1, always localizes on the maximal scale s = ν. If we additionally assume that the

learning algorithms fits higher eigenvalues well enough: |1 h(λ)| = O( q

ν N ), the first term in (256) will also always localize on the maximal scale s = ν.

Now, relying on the fact that the loss localizes on s = ν, we can use expressions of all the terms in the functional through the phase ϕ. Eigenvalue λ is defined by an explicit form of (227)

λ = N ν sin ϕ Cν sin(ϕ ϕ

sin ϕ cot( ν 1

ν ϕ) cot ϕ . (258)

Using (235) and (229), the terms of the functional become

|r 1(λ)|2ℑv(λ)

sin ϕ Cν sin(ϕ ϕ

! κ+ν sin(ϕ κ

ν π) sin2 ϕ cot( ν 1

ν ϕ) cot ϕ 2 (259)

sin ϕ Cν sin(ϕ ϕ

! κ+ν sin ϕ cot( ν 1

ν ϕ) cot ϕ κ

π|r 1(λ)|2ℑv(λ) = 1

sin ϕ Cν sin(ϕ ϕ

! κ+ν sin2( κ

ν ϕ) sin(ϕ κ

ν ϕ) sin( κ

ν π), (261)

The optimal algorithm is

h (λ) = cot( ν 1

ν ϕ) cot ϕ cot( κ

ν ϕ) cot ϕ . (262)

Published as a conference paper at ICLR 2024

Substituting everything in the loss functional gives

sin ϕ Cν sin(ϕ ϕ

! κ+ν " sin(ϕ κ

ν ϕ) h(λ) h (λ) 2

ν π) sin2 ϕ cot( ν 1

ν ϕ) cot ϕ 2

sin ϕ cot( ν 1

ν ϕ) cot ϕ κ

ν 1 sin2( κ

ν ϕ) ν sin(ϕ κ

ν ϕ) sin( κ

Finally, we note that the ϕ π asymptotic of the expressions under the integral above can be

inferred from the respective λ 1

ν N 0 asymptotic of the original functional, since

ν N (1 + o(π ϕ)). Thus, we can write

h O (π ϕ) κ+ν h(λϕ) h (λϕ) 2 + O (π ϕ) κ+ν+1 i dϕ. (264)

From the above expression we see that, if |h(λϕ) h (λϕ)| = O(|h(λϕ) 1|) = O( π ϕ), the expression under the integral is integrable at ϕ = π. Therefore, the upper integration limit can be shifted to π without changing leading order asymptotic of the loss.

Overlearning transition. As we discussed above, our analysis of Wishart model in the noiseless case is restricted to the values of target exponent κ < 1 due to O(N 1) accuracy of calculations. Thus, if ν < 2, we can access overlearning transition point κ = ν 1 < 1 and verify whether the transition holds for the Wishart model. Proceeding similarly to respective Lemma 2 for the circle model, we have Lemma 3. Consider the optimal algorithm (262):

h (λ) = cot( ν 1

ν ϕ) cot ϕ cot( κ

ν ϕ) cot ϕ (265)

with eigenvalue λ = λ(ϕ), ϕ (0, π) parameterized according to (258). Then, assuming κ < 1, for any ϕ (0, π)

< 1, κ + 1 < ν, = 1, κ + 1 = ν, > 1, κ + 1 > ν. (266)

Proof. Observe that both the nominator and denominator of (265) have the form g(a, ϕ) cot(aϕ) cot(ϕ) with a (0, 1), ϕ (0, π). Since cot(ϕ) is strictly decreasing on (0, π), the function g(a, ϕ) is 1) positive 2) at fixed ϕ is strictly decreasing in a. Thus, for the ratio of such functions, we have

g(a, ϕ) g(b, ϕ)

< 1, a > b, = 1, a = b, > 1, a < b. (267)

The above is equivalent to the statement of the lemma once a = ν 1

The above result shows that the behavior of the sign of h(λ) 1, which indicates overlearning/underlearning, is exactly the same in the Wishart and Circle models. This makes us believe that overlearning transition can be a quite general phenomenon going beyond two data models considered in the current work.

F EXPERIMENTS

F.1 FIGURE 1: DETAILS AND DISCUSSION.

Let us start by describing the experiment setting and details. Both KRR and GF plots use optimally scaled regularization η and time t, as derived in Section C. For all three data models, we consider

Published as a conference paper at ICLR 2024

ideal power-law population spectrum: λl = l ν, c2 l = l κ 1 (truncated at P = 4 104 due to computational limitations), and an adapted version λl = (2(|l|+1)) ν, |cl|2 = (2(|l|+1)) κ 1, l Z for Circle model. The extra factor of 2 here is needed to asymptotically align population spectrum at small λ: for all three models we have µc([0, λ]) 1

κλ κ ν and µλ([λ, )) λ 1

The figure contains 3 types of data that are computed in different ways. The first type is scatter plot markers and corresponds to the estimation of generalization loss via direct simulation. For Wishart and Cosine Wishart (see Section F.3) models, this amounts to sampling empirical kernel matrix K and observation vector y, calculating the generalization error for the resulting sampled realization, and finally averaging the result over n = 100 repetitions of the above procedure to estimate the expectation over training dataset DN in (3). Due to computational limitations, we were able to execute this procedure for sizes of empirical kernel matrix only up to N = 4 103. For Circle model, Theorem 1 gives an exact value of the expected loss (3) (i.e. no approximations are made during the derivation). Since for the considered type of population spectra the N-perturbations (10) can be expressed in terms of Hurwitz zeta function (see (127)), we compute analytically all the terms in (11). This way, we are able to reach much larger values of N for the circle model.

Let us also comment on the estimation of the generalization error (3) of a given realization of Wishart or Cosine Wishart models. The expectation over x requires scalar products

K(xi, x), K(x, xj) = X

l λ2 l ϕl(xi)ϕl(xj), f (x), K(x, xj) = X

l λlclϕl(xj) (268)

for points xi, xj from sampled dataset DN. The expressions above can be used to calculate these scalar products experimentally since we have access to λl, cl and already sampled feature values ϕl(xi).

The second type of data is depicted with solid lines and corresponds to the direct calculation of NMNO loss (17), which becomes a sum for discrete population spectrum. Assuming λl are sorted in descending order, NMNO loss becomes

L(nmno)[h] = 1

c2 l (1 h(λl))2 + σ2

N h(λl)2 , (269)

which we can easily compute for quite large values of N.

Finally, the third type of data is N loss asymptotic L = CN #(1 + o(1)) depicted with dashed lines. The constant C we compute analytically, similarly to limiting expressions (138), (135) for noiseless Circle model and (263) for Wishart model. For that, we 1) recall (see Section C) the loss localization scale sloc for the considered algorithms: sloc = ν κ+1 for GD and non-saturated KRR, and sloc {0, ν 2ν+1} for saturated KRR 2) and then get the limiting expressions of NMNO loss functional (17).

For GF, we have algorithm profile ht(λ) = 1 e tλ and loss localization scale is sloc = ν κ+1. Thus, we make a change of variables λ = λ N ν κ+1 and t = t N ν κ+1 , leading to the following limiting loss

L(nmno)[ht] = 1

h e 2t λ (λ ) κ ν + σ2(1 e t λ )2(λ ) 1

Note that the integral above converges both at λ 0 and λ , which reflects that our change of variables has used the correct loss localization scale. The integral in (270) can be computed either numerically or analytically by reducing it to Gamma functions.

For KRR algorithm profile is hη(λ) = λ λ+η and loss localization changes between saturated and non-saturated phases. In the non-saturated phase, we make a change of variables λ = λ N ν κ+1 and η = η N ν κ+1 , leading to

L(nmno)[hη] = 1

(η )2(λ ) κ ν + σ2(λ )2 1

ν (λ + η )2 dλ

λ , κ < 2ν. (271)

In the saturated phase, the noise part localize at sloc = ν 2ν+1, which coincides with the scale of regularization η = η N ν 2ν+1 . The signal part localizes at sloc = 0 where the loss functional

Published as a conference paper at ICLR 2024

remains discrete, and we can use a perturbative expression for the learning algorithm hη(λ) λ

η . This leads to

L(nmno)[hη] = 1

0 σ2 (λ )2 1

ν (λ + η )2 dλ

λ + (η )2 X

, κ > 2ν. (272)

As for the GF, the integrals in (271) and (272) can be computed either numerically or analytically by reducing them to Beta function integrals with a change of integration variable z = η

Discussion. The main conclusion of the figure is, indeed, the match between the actual values of the loss for a given model (scatter markers) and NMNO values (solid lines) for large enough N. This validates experimentally the statement of Theorem 2. Importantly, the Cosine Wishart model is not covered by our theory but still demonstrates the equivalence. We interpret it as an indication that the equivalence holds for a broader class of models. At the moment, we are not ready to give any potential candidates for such class of models since it requires a different set of tools than the one used in this work.

Observe that the match between the actual loss of a data model and its NMNO analog happens at quite small N for big value of target exponent κ = 5 (two right subfigures) whereas much larger N are required for small κ = 0.5. This is a general manifestation of the fact that slower powerlaws require much larger sizes N for asymptotic N behavior to start working. For instance, NMNO loss (17) ignores the error associated with unlearned signal at eigenvalues λ < λmin. The contribution of this part can be roughly estimated as µc([0, λN])/µc([0, 1]) N κ = e log Nκ

which becomes negligible at exponentially large values of dataset size N e 1 κ .

Finally, we note that only the third subplot of the figure (corresponding to saturated KRR) has different loss asymptotics for the Circle model on one side and Wishart and Cosine Wishart on the other side. This difference is due to the respective population spectra which, although matching asymptotically at λ 0, are different at the head of the spectrum λ 1. Then, since the loss for saturated KRR localizes at the scale s = 0 (i.e. λ 1) as demonstrated in Figure 2, the difference between population spectra at λ 1 start to affect the loss values. On the contrary, when there is no saturation (the rest of the plots on the figure), the loss localizes on some scale s > 0 where the two population spectra become (asymptotically) the same. This is the reason why 1,2,4 subplots have a single common loss asymptotic colored in grey.

F.2 FIGURE 3: DETAILS AND DISCUSSION.

This figure s primary focus is the comparison of different algorithms. Profiles of each of 4 considered algorithms were obtained as follows: interpolation is simply h(λ) = 1, the optimal algorithm was calculated from (23), optimally stopped GF was obtained by first evaluating the asymptotic loss (138) on a wide and dense grid of t values subsequently picking t as the optimal among those values, and for optimally regularized KRR the same procedure (including both negative and positive regularization values) was used.

On the first plot, we validate the asymptotic loss value L(asym) = CN κ (dashed lines) computed from (138) with exact loss value at finite N given by (11) (scatter markers) computed similar to the respective values from Figure 1. On the remaining 3 plots we already exclusively use asymptotic expression (138).

Discussion. The first plot indeed confirms that asymptotic expression L(asym) = CN κ given by (138) can accurately describe the loss values with the correct constant C.

The second plot examines the behavior of the constant C across the range of target exponent values κ (0, 2ν), while saturated values κ > 2ν have different rate N 2ν and thus not considered on the figure. In particular, in the vicinity of the saturation transition the constant starts to blow up C as κ 2ν 0. The overlearning transition at κ = ν 1 is also quite visible: for κ < ν 1 all the considered algorithms have very close loss values, while for κ > ν 1 a significant gap appears between optimal KRR and the optimal algorithm on one side and interpolation and optimal GF on the other side. This demonstrates that significant overlearning h(λ) > 1 is required for strong

Published as a conference paper at ICLR 2024

performance. Note that GF profile ht(λ) = 1 e tλ 1 which forces t = for κ > ν 1, thus explaining exact match between green and grey lines for κ < ν 1.

Actually, it is quite difficult to distinguish loss curves of different algorithms (except for the overlearning gap discussed above) on the two left plots. However, there is a well-defined (i.e. not due to numerical errors) difference between the algorithms (high zoom is required to see it!). Such a small difference seems to be an intrinsic property of the Circle model: we have tried wide ranges of κ and ν and observed that in all of them, the relative difference between different algorithms was of the order of 1% or smaller.

F.3 COSINE WISHART MODEL

The model is constructed as follows. Given a population spectrum λl, cl R, l = 0, 1, 2, 3, . . ., the features ϕl(x) with x S are defined as

ϕl(x) = 1, l = 0,

2 cos(lx), l 1, (273)

and the training dataset inputs xi DN are sampled i.i.d. from a uniform distribution on the circle S. In particular, (273) ensures that Ex[ϕl(x)ϕl (x)] = δll .

The Cosine Wishart model can be thought as something intermediate between our main Wishart and Circle models. On the one side, the population quantities of Cosine Wishart models are the same as those for the Circle given in (6), except for the presence of eil(x+x ) terms in the kernel K(x, x ) making it non-translation invariant. This last aspect does not move Cosine Wishart model too much from Circle, as the respective empirical kernel matrix K would still be almost diagonalized by discrete Fourier harmonics ei 2πki

N if the inputs were forming a regular lattice. The more important difference is that inputs xi of the Cosine Wishart model are sampled i.i.d., which completely eliminates the possibility to analytically diagonalize K, which is a core step of the Circle model solution.

On the other side, for a given population spectrum λl, cl, the structure of Cosine Wishart model and Wishart model are similar in the sense that in both cases, random realizations of empirical kernel matrix K are obtained by sampling components of the feature matrix Φ with first to moments being EΦli = 0 and EΦ2 li = 1, except for EΦ0i = 1 for Cosine Wishart which only amounts to a single spike in K. However, the most important difference is that for Wishart model, all entries of Φ are independent, while for Cosine Wishart there is a very strong correlation of entries Φli =

2 cos(lxi) with different l but the same i. This correlation turns out to be crucial and makes the Stieltjes transform r(z) = Tr h K z I 1i no longer deterministic, which was the core assumption for our analysis of Wishart model based on the fixed-point equation (13).