# 43_phases_of_computeoptimal_neural_scaling_laws__b62ddf67.pdf 4+3 Phases of Compute-Optimal Neural Scaling Laws Elliot Paquette Mc Gill University elliot.paquette@mcgill.ca Courtney Paquette Google Deep Mind & Mc Gill University courtney.paquette@mcgill.ca Lechao Xiao Google Deep Mind Jeffrey Pennington Google Deep Mind We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model-parameter-count. We use this neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scalinglaw exponents in all of these phases, in particular computing the optimal modelparameter-count as a function of floating point operation budget. We include a colab notebook nano Chinchilla3 that reproduces some key results of the paper. 1 Introduction The advent of large language models (LLMs) has changed our perceptions of the landscape of optimization and is resulting in the emergence of new interesting questions related to scaling. Prior to LLMs and other large models, we often viewed the large-scale optimization problems as being limited by the amount of data. In training language models, in contrast, data can be effectively infinite. Thus, compute budgets can be the limitation. This leads to the following natural question: given an architecture, given a fixed compute budget, and having unlimited data, how should one select the model size to minimize loss? To formally address this question, let us consider the general learning problem, min θ Rd P(θ) = Ex[R(θ; x)] , where R : Rd R, (1) the number of parameters d is large, and the data vector x is drawn from an unknown distribution. We solve (1) using stochastic algorithms, such as stochastic gradient descent (SGD) with batch size B, under various parameter sizes d, that produce a sequence of iterates {θr}. A standard formula used in practice to measure compute is the "6ND" formula [26], that is, Compute (flops4f) = (iterations of alg. (r) batch size (B)) parameters (d). (2) Corresponding author; website: https://elliotpaquette.github.io/. The authors contributed equally to the paper. 3Available at: https://tinyurl.com/2saj6bkj 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Therefore, we can plot the loss curve P(θr; d) = P(r; d) = P(f/(d B); d) as a function of flops (see Fig. 1). The question now is: given a fixed number of flops f and given batch size B, how should we choose the parameters d so that we get the best loss, i.e. find d solving the constrained problem d (f) arg mind P f d B ; d = arg mind P(θr; d) subj. to f = (r B) d . (3) 104 106 108 1010 1012 flops loss, P(flops) alpha =0.4, beta =0.4 D = 1,600 D = 3,200 D = 6,400 D = 12,800 D = 25,600 D = 51,200 compute optimal fixed compute Figure 1: Toy scaling problem. We plot the loss function, P(θr; d) as a function of flops f using (2). Consider a fixed number of flops f = 107 (dashed line). If we had chosen, e.g., d = 1600, we can run for a long time, but our model does not have a lot of capacity and thus the value of the loss function remains high. On the hand, we can increase capacity by choosing a large number of parameters (e.g., d = 51, 200), but because our compute is fixed we can not run our algorithm for very long. Thus the loss value is still large. The optimal choice is d 6, 400. When done for every choice of f gives the compute-optimal curve (red line). This choice of (α, β) (Phase I) is an example of where model capacity controls the compute-optimal curve, but it is not the only behavior we show. In other phases the compute-optimal is controlled by poor model embedding (Phase II, III) and SGD noise (Phase III, IV). Main contributions. In this work, we analyze a three parameter simple model, which we call powerlaw random features (PLRF) [30, 5]. The three parameters in the PLRF are the data complexity (α), target complexity (β) and model-parameter count d. Using this model, we derive a deterministic equivalent for the expected loss, as a function of α, β, and d, that captures the training dynamics of one-pass SGD. This can be used to derive numerical predictions for the scaling laws. We also extract exact expressions for the compute-optimal scaling laws and the optimal parameter d (f) argmind P( f d B ; d) for large5 d, and give some estimates on the order of d necessary for these scaling laws to take hold. We also observe for a large portion of the (α, β)- phase plane, the optimal parameter is d (f) = f1/2, suggesting a regime of universal scaling behavior (see Fig. 4a and Table 2). This verifies theoretically the Chinchilla scaling [24]. The PLRF is not only analyzable, but also exhibits a rich behavior of compute-optimal curves/loss curves, which are qualitatively and quantitatively different depending on the strengths of the data (α) vs. target (β) complexity. Particularly, we show that there are 4 distinct (+3 sub phases) compute-optimal curve/loss curve behaviors. Model constrained compute-optimal curves. In two of the phases (Phase Ia,b,c and Phase II), it is the underlying model that dictates the curves. The algorithm has little/no impact. This appears in two forms. The first behavior are compute-optimal curves controlled by the capacity of the model (Phase Ia,b,c). Here once the algorithm reaches the limiting risk value possible (capacity), it is better to increase the model-parameter d. Another type of loss dynamics is due to poor model feature embedding (Phase II). Here the features are embedded in a way which is difficult to train. After an initial large decrease in the loss value, this feature embedding distortion frustrates the algorithm and training slows, but it continues to solve. However, solving to capacity wastes compute, in that it is compute-favored to increase the model parameter count d. Algorithm constrained compute-optimal curves. For some choices of (α, β) (Phase III and IV), it is the noise produced by the SGD algorithm that ultimately controls the tradeoff. Here the algorithm matters. Indeed, another algorithm could change the compute-optimal curves for these phases. Related work. The key source of inspiration for this work are [24, 26], which identified compute optimality as a fundamental concept in scaling large language models and made a substantial empirical exploration of it. The problem setup was formulated by [30], where additionally data-limited scalings were considered, but compute optimality was not (nor indeed any algorithmic considerations); see also [8] where gradient flow is considered in the same setting. 4Here and throughout we use flops to mean floating point operations and not as the rate floating point operations per second. We also drop the pre-factor 6 in "6ND" formula for simplicity. 5We discuss how large is large, but the truth is somewhat complicated and also quite dependent on the desired precision. If 0.05 on the achieved scaling laws is tolerable, a flat d > 1000 seems to suffice across all phases. Phase Ia, Ib, Ic no power law high-dim. line pentuple point (a) Phase Diagram Figure 2: Phase Diagram and Cartoon Plots of Loss Curves in Different Phases. (a) Phase Diagram. Colored regions represent where the training of the risk/compute-optimal curves look qualitatively and quantitatively different depending on α and β. This, in term, yields different scaling law (η) and parameter count (ξ) exponents for each of the phases. Critical point at α = β = 1/2 where all behaviors are observed. The other plots illustrate the components of F (via F0, Fpp, Fac) and Kpp which dominate the loss curve for each phase (see Sec. C.4.1 & Sec. C.4.1 for proofs); tradeoff between the functions where the compute-optimal point occurs is also indicated (see Sec. 2.1 for definitions and Sec. 3.1 & Sec. D for proofs). There is a substantial body of work considering scaling laws of losses (trained to minimum-loss) of dataset size vs parameter count, in a variety of settings (linear, random features, deep networks). See especially: [5, 39, 41], wherein a hidden-manifold model is considered for the data. We note that as we consider one-pass SGD, some dataset/parameter-count scaling laws are implicit from the results here; however, the training method (one-pass SGD) is, in some regimes, suboptimal given unlimited compute. For additional related work on random features models (and sample complexity), random matrix theory in machine learning, and other deterministic equivalents for SGD, see Section A. We note that while this paper is fundamentally about computation, but the novel mathematical contributions could also be recast in terms of generalization bounds of one-pass SGD, some of which are new. For a detailed comparison of the convergence rates and sample complexity, see Table 4. 1.1 Problem Setup: SGD on Power-law Random Features In this work, we analyze the three parameter power-law random features (PLRF) model, that is, min θ Rd P(θ) def = Ex[( W T x, θ x, b )2] . (4) We embed the data vector x Rv in Rd through the matrix W Rv d and construct noiseless targets6 by dotting a fixed b Rv with the sample x. The use of the matrix W allows the model to have variable capacity (d) independent of the data set size. The samples x Rv and labels b Rv have power law dependence, whereas the matrix W has entries distributed as N(0, 1/d). 6With label noise, the scaling laws are the same as we report here, up to a scale at which the label noise is the limiting factor in the optimization and further increase of compute-budget or d does not yield any benefits. (a) Compute-optimal Front (b) Iso FLOP (c) Optimal Model Size Figure 3: Compute-Optimal Front in Phase II-III boundary. (a) The Volterra equations perfectly captures the training dynamics of SGD when model-parameter count ranges from d = 200 12800. (b) We apply Iso FLOP approach [24] to our toy model to extract the optimal-compute front: (computeoptimal loss) (highlighted in red in (a)) and the optimal model size: (compute-optimal model size) (scattered in purple in (c)). Power-law fitting compute-optimal front gives a measurement of the scaling law exponent 0.648 (vs. theoretical prediction 0.643 in Table 2). In (c), we power-law fit the relation between compute and (empirical) optimal model size via Approach 1 and 2 used in [24]: d f0.508 and d f0.525, resp. (vs. theory, d f0.5). See Sec. J for details. Assumption 1 (Data and labels, α and β). The samples x Rv are distributed according to (xj) j αzj for all 1 j v and {zj}v j=1 N(0, 1). The labels are scalars constructed by dotting the sample x with a signal b Rv whose entries (bj) = j β. Without the random matrix W, the α, β are related to what is known in the literature as source and capacity conditions [10, 11, 20, 36]. For a detailed comparison of the parameters and related work, see Section A and Table 3. The dimensions we consider throughout are always such that v Cd for C > 1. Throughout both v and d need to be large, but for some choices of α and β, the v will need to be comparable to d. Definition 1.1 (Admissible v and d). We assume that v Cd with C > 1 and v, d . Above the high-dimensional line, which is when 2α > 1, we suppose v/d r (1, ) { }.7 On the other hand, below the high-dimensional line (2α < 1) we limit v to be v/d r (1, ).8 One can rewrite the expression in (4) using the convenient form: min θ Rd P(θ) = D(Wθ b), (Wθ b) , where D = diag(j 2α) Rv v. (5) Algorithmic set-up. To solve the minimization problem in (5), we use one-pass SGD with minibatches of size B (independent of d)9 and constant learning rate γ > 0: letting θ0 = 0, we iterate drawing {xi r}B i=1 fresh iid samples and θr+1 = θr γ i=1 W T xi r W T xi r, θr xi r, b . (6) The learning rate and batch size will need to satisfy a condition to ensure convergence (Prop. 2.1). Main goal. Under this setup, our main goal is to characterize the compute-optimal frontier. Precisely, we want to find the parameter count exponent ξ and scaling law exponent η, such that, d (f) fξ and P f d B ; d f η. Notation. We use P(θr) = P(r) when we want to emphasize the iteration counter r. We say A (r, v, d) A(r, v, d) for functions A (r, v, d), A(r, v, d) > 0 if for every ε > 0 and for all admissible v and d, there exists an r0, d0 such that for all d > d0 and r r0 (1 ε)A(r, v, d) A (r, v, d) (1 + ε)A(r, v, d). 7In fact, we may take v = for 2α > 1. 8Indeed one can, in the former case, take d v d1/(1 2α), but for simplicity of presentation we focus on the proportional regime when 2α < 1. 9One can study batch size B growing with d, but for simplicity we let B = 1. Thus we only consider B independent of d setting. Table 1: Large d behavior of the forcing function and kernel function. See Sec. H for proofs. Function Γ(x) is the Gamma function F0(r) d 2α+max{0,1 2β} Fpp(r) (2α) 1 Γ β α 1 2α + 1 (2γB r) (1+β/α)+1/(2α) Fac(r) C F0(r), if 2β > 1, 2α < 1 0, if 2β < 1 for C > 0, independent of d If 2β > 1, 2α > 1, Fac(r) Pv j=1 j 2β (2α) 1Γ 1 1 2α (2γB r) 1+1/(2α) d 1 Kpp(r) (2α) 1 Γ 2 1 2α (2γB r) 2+1/(2α) We write if the upper and lower bounds hold with some constants c, C in place of 1 ε respectively and , if only one inequality holds. 2 Learning Dynamics of SGD Compute-optimal curves (3) for the random features model (4) rely on accurate predictions for the learning trajectory of SGD. Similar to the works of [33, 35], we show that the expected loss under SGD satisfies a convolution-type Volterra equation (for background on Volterra equations, see Section C.3) E[P(θr) | W] = forcing func. F(r) | {z } grad. descent + K E [P(θr) | W] | {z } SGD noise , where (K f)(r) = s=0 K (r 1 s)f(s). (7) The forcing function F(r) and kernel function K (r) are explicit functions of the matrix ˆK = D1/2WW T D1/2, where D = Diag(j 2α, 1 j v), and Γ C a contour enclosing the spectrum of ˆK [0, 1], F(r) def = 1 Γ ( ˆK z) 1(D1/2b), (D1/2b) (1 2γBz + γ2B(B + 1)z2)r dz and K (r) def = 1 Γ ( ˆK z) 1z2(1 2γBz + γ2B(B + 1)z2)r dz . (8) Intuitively, the forcing function is gradient descent on the random features model and the kernel function is the excess risk due to 1 unit of SGD noise. Deterministic equivalent. The forcing function F(r) and kernel function K (r) are random functions depending on the random matrix ˆK. Indeed, it is the resolvent of ˆK, ( ˆK z) 1, which plays a significant role in F and K . We remove this randomness from the expression by using a deterministic equivalent a technique from random matrix theory. Formally, we define the deterministic equivalent for the resolvent of ˆK, denoted by R(z), implicitly via a fixed point equation m(z) def = 1 d Pv j=1 j 2α j 2αm(z) z where R(z) = Diag 1 j 2αm(z) z : 1 j v . (9) This deterministic equivalent R(z) is viewed, roughly, as EW [( ˆK z) 1] R(z); though it is not formally the expectation over W. By replacing the resolvent of ˆK with R(z), there exists a (a) Scaling Law Exponent (b) Empirical vs Theory (α = 0.7) Figure 4: (a) Scaling Law Exponents. The heatmap displays scaling law exponents (η) in the (α, β)-plane. Hatched lines represent region with universal scaling behavior, d f0.5, independent of (α, β). (b) Exponent Measurements. Compare empirical exponents (following [24]; see Sec.J for details) to theoretical predictions, traversing the phase diagram horizontally at α = 0.7 from Phases Ia II III as β . deterministic function P(r) which solves a convolution Volterra equation, matching (7): P(r) = forcing func. F(r) |{z} grad. descent + (K P)(r) | {z } SGD noise where F(r) def = 1 Γ (R(z)(D1/2b), (D1/2b) (1 2γBz + γ2B(B + 1)z2)r dz (11) and K(r) def = γ2B Tr 1 Γ R(z)z2(1 2γBz + γ2B(B + 1)z2)r dz . (12) The solution to the Volterra equation with deterministic equivalent (10) numerically exactly matches the training dynamics of SGD, see Fig. 3. A discussion of the deterministic equivalent for ( ˆK z) 1 can be found in Sec. E. All our mathematical analysis will be for the deterministic equivalents, going forward.10 The derivation of the Volterra equation for the expected loss can be found in Sec. B. An immediate consequence of (10) is that for convolution Volterra equations bounded solutions occur if and only if the forcing function is bounded and the kernel norm K def = P s=0 K(s) < 1. This directly translates into a sufficient condition on the batch size and learning rate of SGD. Proposition 2.1 (Sufficient conditions on learning rate and batch). Suppose learning rate γ and batch B satisfy K < 1 and γ(B + 1) < 2. Then P(r) is bounded. Remark 2.1. Below the line 2α = 1, the kernel norm diverges with v for fixed constant γ, and so we must take γ 0 to ensure bounded solutions. Thus, provided γ v2α 1, then j=1 j 2α γ 2(1 2α)v1 2α is order 1. Thus, the kernel norm, K , is always constant order for all α. The batch B and γ can depend on d. For simplicity, we only consider B order 1 in this work. For a proof of the necessary and sufficient conditions on γ and B, see Prop. C.2, and see Cor. G.1 for the asymptotic on K . The Volterra equation in (10) can be analyzed to give a more explicit formula for P (see Section C.3.2 for proof). 10There is good numerical evidence that the deterministic equivalent captures all interesting features of the PLRF. There is a vast random matrix theory literature on making precise comparisons between resolvents and their deterministic equivalents. It seems a custom analysis will be needed for this problem, given the relatively high precision required, and we do not attempt to resolve this mathematically here. 104 108 1012 1016 1020 1024 Asymptotic Prediction P/(F0+Fpp+Fac+Kpp/γ) 1 10 100 1000 104 105 106 Compute-optimal slope SL exp Data exp 1000 105 107 109 0.45 Num. params (d) (a) (b) (c) Num. params (d) Compute-optimal slope Figure 5: Finite-size effects. (a) The ratio of the exact solution of eq. (10) to the estimate in eq. (17) is bounded by constants for all r, confirming the validity of eq. (17); shown here is (α, β) = (0.7, 1.2). (b) For non-asymptotic d, the estimate in eq. (17) (solid curves) predicts both the magnitudes and trends of the measured exponents of the empirical compute-optimal frontier (points), shown here for (α, β) = (0.7, 1.2) computed using Approach 0 (see Appendix J) to capture the instantaneous slope; the dashed lines show the asymptotic exponents from Table 2. (c) The finite-size behavior relaxes to the asymptotic predictions over horizons whose length can grow exceedingly large, especially in the vicinity of the phase transition, shown here for β = 0.7 approaching the Phase 4a 4b boundary. Theorem 2.1 (Approximation solution for P). Suppose γ and B are at most half the convergence threshold and 2α + 2β > 1, α > 1 4.11 There exists an M > 0 large and a constant C = C(α, β, M), independent of d, so that for all admissible v and d, for all γBr > M, F(r) + (K F)(r) P(r) F(r) + C (K F)(r). (13) The convolution K F further simplifies γB K(r) (K F)(r) C forcing func. F(r) |{z} grad. descent γB kernel func. K(r) |{z} SGD noise for some constants c = c(α, β, M) and C = C(α, β, M) > 0 independent of d. Remark 2.2. If we were to run gradient descent instead of SGD (i.e., γ small), then we would only have the forcing term, that is, P(r) = F(r). The measurable effect of SGD comes from the second term that contains the kernel function. For this reason, we refer to SGD noise as 1 γB K(r). In light of (13) and (14), we have trapped the training loss between the sum of F and K, so it suffices now to understand the forcing and kernel functions. 2.1 Forcing function and kernel function We decompose the forcing function (11), F, and the kernel function, (12), K, into F(r) = F0(r) + Fac(r) + Fpp(r) + errors F(r) and K(r) = Kpp(r) + errors K(r). (15) Each term is explicit and has an asymptotic equivalence (when 1 γBr d2α) given by Fi(r, d), Kpp(r, d) c r τ d σ for some constants c, τ, σ > 0 (see Table 1). (16) The two error terms are such that for large d with 1 γBr d2α, |errors F(r)| C (F0(r) + Fac(r) + Fpp(r)) and |errors K(r)| C Kpp(r), for some constant C > 0. For γBr d2α, the forcing function F(r) F0(r), the limiting risk value. The terms arise from different parts of the spectrum of the deterministic equivalent for ˆK (see Fig. 6). 1. Point mass at 0: F0(0) = F0(r) is the limiting value of P(r) d 2α+max{0,1 2β} as r . It occurs because the loss is irreducible (d < v), that is a component of the target is not in the image of the RF model (or equivalently that ˆK has a kernel). 11In spite of Theorem 2.1 holding only for α > 1 4, we expect this to hold for all 2α + 2β > 1 as supported numerically. When α < 1 4, the kernel function stops being power law and, as a result, requires a different set of tools to prove the result. 2. Aligned features: The function Fpp(r) represents gradient descent on the components of features which are aligned to the underlying population features. Indeed, if we ran gradient descent on the population loss without a random features map (or a diagonal W), this would be the loss curve. 3. Distorted features: The function Fac(r) is the result of feature distortion, where the matrix W leads to an embedding where a small component of the leading features is distributed across many different eigenmodes. These are still solvable, and given enough compute these will eventually be used, but they are much slower to solve. 4. Aligned kernel: Kpp(r) is the excess risk due to 1 unit of SGD noise, which is then solved according to population gradient descent. Out of brevity, we relegate the exact definitions of F0, Fpp, Fac, and Kpp and all proofs of the asymptotics in Table 1 and analyses of the functions to Section F, G, and H. 3 The 4 Phases We now put together a coherent picture of the effect of different choices of α (data complexity) and β (target complexity) and their impact on the compute-optimal frontier. By Theorem 2.1, we estimate P(r, d) Fpp(r) + Fac(r) + F0(r) + 1 γB Kpp(r). (17) Explicitly, we show that the functional form, or scaling law for the PLRF model is P(r, d) r σ1 |{z} Fpp(r) + d τ1 |{z} F0(r) + d τ2r σ2 | {z } Fac(r) + r σ3 |{z} 1 γB Kpp(r) , where σi, τi > 0 and explicit, see Table 1. (18) Fig. 5a. shows empirically that this equivalence of P(r) is quite good. The first two terms Fpp(r) (i.e., r σ1) and F0(r) (i.e., d τ1) are often called in the literature as time and model bottlenecks respectively. The functional form using only these two components, i.e., P(r, d) r σ1 + d τ1 were used to find compute-optimal exponents in [24, 9] and in concurrent work [28]. Because the functional form considered in [9, 28] are missing the two other terms (cross-term Fac and SGD noise Kpp), the compute-optimal curves in [9, 28] agree only in Phase Ia with our results. Importantly, we show that the cross-term, i.e., Fac(r), and SGD noise, Kpp(r), can indeed affect the compute-optimal exponents. (The cross-term also appeared in concurrent work on ridge regression [18].) The 4 distinct phases (see Fig. 2a) decompose the (α, β)-plane based on the shape of the loss curve P(r), that is, which of the distinct components of the forcing function (i.e., F0, Fpp, Fac,) and/or kernel function (i.e., Kpp) dominate the loss curve at a given iteration r. See Table 2 for loss description in each phase. Cartoon pictures of the different features of the loss curves are shown in Fig. 2. For each phase, we derive a compute-optimal curve in Section 3.1. The high-dimensional line, which occurs where 2α = 1, distinguishes the phases where the vdimension can be big and independent of d (Phase Ia, II, III, 2α > 1) and the phases where d and v must be related to each other (Phase Ib, Ic, IVa, IVb, 2α < 1). When 2α + 2β < 1, the loss does not exhibit any power-law decay as the limit level stops going to 0 as d (purely as a consequence of having selected the regime v > d). Moreover, there exists an interesting critical point α = β = 1 2 where all the parts of the forcing function and kernel mix and interact with each other. The behavior of the loss at the pentuple point (see Fig 2a) we leave for future research. Across each of the phase boundaries the compute-optimal curves are continuous, but not necessarily differentiable; in contrast, d is discontinuous across some phase boundaries. 3.1 Compute-optimal Curves To simplify the computations for compute-optimal curves, we introduce the following curve P(r) def = max Fpp(r), Fac(r), F0(r), 1 γB Kpp(r) . (19) The function P(r, d) achieves the same power law behavior as the original compute-optimal curve P(r, d) (i.e., the slope of the compute-optimal curve is correct) and deviates from the true curve by an absolute constant (independent of d and f). Note that some of the terms in the max function Table 2: Loss description for P(r) and compute-optimal curves for P( f d B , d) across the 4 phases. Loss P(r) Trade off Compute-optimal Curves Phase I Fpp(r) + F0(r) Fpp = F0 Ia P Phase Ia(f) f 1 2α+1 1 (1+β/α 1/(2α)) d Phase Ia f1/(2α+1) Ib P Phase Ib(f) f 1 2 α β d Phase Ib f 1 2 Ic P Phase Ic(f) f α(2α+2β 1) α(2β 3) 2β+1 d Phase Ic f 1 2(α+β) 2(α(2β 3) 2β+1) Phase II Fpp(r) + Fac(r) +F0(r) Fpp = Fac P Phase II(f) f 2α+2β 1 d Phase II f(β/α)/(1+β/α) Phase III Fac(r) + F0(r) 1 γB Kpp = Fac P Phase III(f) f(1 4α)/(4α) d Phase III f1/2 Fpp(r) + F0(r) IVa 1 γB Kpp = F0 P Phase IVa(f) f α d Phase IVa f1/2 IVb 1 γB Kpp = Fpp P Phase IVb(f) f (1 2α)(2α+2β 1) (2(2αβ+α 2β)) d Phase IVb f(α β)/(2αβ+α 2β) (19) should be taken to be 0 when not defined for the different phases. Therefore, we derive the compute-optimal curves by solving the problem min d P f d B , d , and if d (f) def = arg mind P f d B , d , then the compute-optimal curve is P (f) def = P f d (f) B , d (f) . (20) See Table 2 for the exact expressions for d (f) and the compute-optimal curve P (f) for each phase. A more detailed description with proofs can be found in Section C.4 and Section D. Now to derive d and P , we recall that the functions F0, Fpp, Fac, Kpp take the form c d σi ( f d B ) τi (16). Therefore, P (f/(d B), d ) must occur at corner point where two functions meet. These tradeoffs between the two functions for which the compute-optimal point occurs are shown in Fig. 2 and Table 2. Details for each phase. We describe the qualitative and quantitative properties of compute-optimal curves for each phase. These are broken down into model constrained (Phase I, II) vs. algorithm constrained (Phase III, IV), i.e., whether the PLRF model or SGD is the constraining feature. Phase Ia, Ib, Ic. Capacity constrained. Phase Ia (2α > 1, 2β < 1), Ib (2α < 1, 2β < 1, 2(α + β) > 1), Ic are characterized by having the simplest loss description, P(r) Fpp(r) + F0(r). Here the SGD noise is irrelevant and one would have the same loss (and thus compute-optimal curve) as gradient descent on the population loss. Compute optimality is characterized by training the model completely (to its limit loss) and choosing the model parameter count large enough so that at the end of training, the smallest loss is attained. The main distinctions between Phase Ia, Ib, Ic are the model capacities (i.e., F0(r, d) = d 2α+1 2β in Ia, Ib, and F0(r, d) = d 2α in Ic) and the dependence of dimension in the learning rate due to Ib,Ic being below the high-dimensional line. Consequently, while the qualitative features of the loss curve are the same for Ia, Ib, and Ic, the actual values of the compute-optimal curve vary across the different regions. Notably, in Phase Ib, the compute-optimal parameter is d = f1/2 and it is independent of α and β. Phase II. Distortion constrained. Phase II (2α > 1, 2β > 1, β < α) has a loss curve where the Fac is important, that is, P(r) Fpp(r) + Fac(r) + F0(r). The Fac term becomes the dominant term after running for some intermediate amount of time dc; in fact it is compute-optimal to stop at this point, and then select the number of model parameters so to minimize the loss with this early stopping criterion. It transpires that across all phases, it never pays to solve through the Fac part of the loss curve it is always better to just increase the number of model parameters. Phase III. SGD frustrated, distortion constrained. In this phase (2α > 1, 2β > 1, β > α), SGD noise is important. The loss curve is P(r) Fac(r) + F0(r) + 1 γB Kpp(r). Notably, in this phase, the compute-optimal parameter is d (f) = f1/2, which is independent of α and β. PLRF that fall within this phase have the same scaling law regardless of data complexity and target complexity. Moreover, the tradeoff occurs, like in Phase II, once the optimizer reaches the Fac-dominated part of the loss curve. Unlike in Phase II, the optimization is slowed by SGD noise (Kpp) leading up to that point. We note that there is a dimension-independent burn-in period required for SGD noise to dominate, and for small numerical simulations, one may actually observe an (Fpp, Fac) tradeoff. Phase IV. SGD frustrated, capacity constrained. Like Phase III, SGD noise is important. The SGD algorithm in Phase IV will be distinguished from gradient descent. As one approaches the high-dimensional line (2α = 1) in Phase III, the Fac(r) disappears. It becomes too small relative to Fpp and Kpp. Moreover at the high-dimensional line, Fpp becomes important again. Thus, the loss curve in Phase IV (a and b) look like P r, d Fpp(r, d) + F0(r, d) + 1 γB Kpp(r, d). The distinction between Phase IVa (1 1 2 < α < 0.5, 2β > 1) and Phase IVb ( 1 4 < α < 1 1 2, 2β > 1) is where the compute-optimal tradeoff occurs. It changes from Kpp = F0 (Phase IVa) to Fpp = Kpp (Phase IVb). In particular it can be (Phase IVb) the SGD noise is so large that increasing the model parameter count is compute-optimal. We note that in this phase d must be taken very large (in particular larger than we could numerically attain) to get quantitative agreement between the exponents and theory. Other observations. In Phase III, Ib, and IVa, the optimal parameter d = f1/2 (see dashed lines in Fig. 4a). These phases, taken together, encompass a large section of the (α, β)-phase plane. This suggests that there is a potential universal scaling law. Moreover using 1 A100-GPU-day of compute, one reaches scales of d where the observed exponents in the scaling laws SGD, the theoretically-derived Volterra equation eq. (10), and the equivalence of P(r) eq. (17) are still changing (see Fig. 5b and c). This serves as a potential warning for empirically derived scaling laws. Additionally, although we have identified the lower-left of the phase diagram (α + β < 1/2) as "no power-law", this designation relies on the assumption v > d, which could be relaxed to interesting effect in more realistic (e.g. non-linear) models. Compute-optimal learning rate and batch. Previously, we have used B = 1 and the maximal learning rate allowed. One can also consider finding the compute-optimal curves with respect to batch size and learning rate, i.e., find d , γ , B such that (d , γ , B ) arg mind,γ,B arg min P( f d B , γ, d) s.t. γB < 1 and Kpp < 1. (21) In Section I, we show that P( f d B , γ, d) is monotonic in B and therefore B = 1 is optimal. Similarly for γ, in Phases I, II, III, the loss P( f d B , γ, d) is monotonic and thus the maximally stable learning rate is optimal. For Phase IV, this is not true. There does exist an optimal γ (with B = 1), 4α(α β) 4αβ+2α+2β 1 , d (f) f 2α+2β 1 4αβ+2α+2β 1 , and P (f) f 2α(2α+2β 1) 4αβ+2α+2β 1 , where the tradeoff occurs between 1 γ Kpp(r) = F0(r) and Phase IVa and IVb collapse to a single phase. This is proven in Proposition I.1. Conclusion. We analyze a simple three parameter model, PLRF, and derive deterministic expressions for the training dynamics (see Volterra equation (10)). We then extract compute-optimal scaling laws for large d. We identify 4 phases (+3 subphases) in the (α, β)-phase plane, corresponding to different compute-optimal curve/loss behaviors. These phase boundaries are determined by the relative importance of model capacity (Phase I, IV), poor embedding of the features (Phase II, III), and the noise produced by the SGD algorithm (Phase III, IV). The latter suggesting that another stochastic algorithm might change the compute-optimal curve; we leave this interesting direction to future research. We also show evidence of a universal scaling law which we also leave for future research to explore. Acknowledgments and Disclosure of Funding C. Paquette is a Canadian Institute for Advanced Research (CIFAR) AI chair, Quebec AI Institute (MILA) and a Sloan Research Fellow in Computer Science (2024). C. Paquette was supported by a Discovery Grant from the Natural Science and Engineering Research Council (NSERC) of Canada, NSERC CREATE grant Interdisciplinary Math and Artificial Intelligence Program (INTERMATH-AI), Google research grant, and Fonds de recherche du Québec Nature et technologies (FRQNT) New University Researcher s Start-Up Program. Research by E. Paquette was supported by a Discovery Grant from the Natural Science and Engineering Council (NSERC). Additional revenues related to this work: C. Paquette has 20% part-time employment at Google Deep Mind. [1] Luca Arnaboldi, Ludovic Stephan, Florent Krzakala, and Bruno Loureiro. From highdimensional & mean-field dynamics to dimensionless odes: A unifying approach to sgd in two-layers networks. In The Thirty Sixth Annual Conference on Learning Theory (COLT), pages 1199 1227. PMLR, 2023. [2] Søren Asmussen, Soren Asmussen, and Sren Asmussen. Applied probability and queues, volume 2. Springer, 2003. [3] Krishna B Athreya, Peter E Ney, and PE Ney. Branching processes. Courier Corporation, 2004. [4] Francis Bach. High-dimensional analysis of double descent for linear regression with random projections. SIAM Journal on Mathematics of Data Science, 6(1):26 50, 2024. [5] Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws. Proc. Natl. Acad. Sci. USA, 121(27):Paper No. e2311878121, 8, 2024. [6] Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010. [7] Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You. Chinchilla Scaling: A replication attempt. ar Xiv preprint ar Xiv:2404.10102, 2024. [8] Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A Dynamical Model of Neural Scaling Laws. In Proceedings of the 41st International Conference on Machine Learning (ICML), volume 235 of Proceedings of Machine Learning Research, pages 4345 4382. PMLR, 2024. [9] Blake Bordelon and Cengiz Pehlevan. Learning Curves for SGD on Structured Features. In International Conference on Learning Representations (ICLR), 2022. [10] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331 368, 2007. [11] Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features. Advances in Neural Information Processing Systems, 31, 2018. [12] Chen Cheng and Andrea Montanari. Dimension free ridge regression. ar Xiv preprint ar Xiv:2210.08571, 2022. [13] Elizabeth Collins-Woodfin, Courtney Paquette, Elliot Paquette, and Inbar Seroussi. Hitting the high-dimensional notes: an ODE for SGD learning dynamics on GLMs and multi-index models. Inf. Inference, 13(4):Paper No. iaae028, 107, 2024. [14] Romain Couillet and Zhenyu Liao. Random Matrix Methods for Machine Learning. Cambridge University Press, 2022. [15] D. Cruz-Uribe and C. J. Neugebauer. An Elementary Proof of Error Estimates for the Trapezoidal Rule. Math. Mag., 76(4):303 306, 2003. [16] Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization Error Rates in Kernel Regression: The Crossover from the Noiseless to Noisy Regime. Advances in Neural Information Processing Systems, 34, 2021. [17] Stéphane d Ascoli, Marylou Gabrié, Levent Sagun, and Giulio Biroli. On the interplay between data structure and loss function in classification problems. Advances in Neural Information Processing Systems, 34:8506 8517, 2021. [18] Leonardo Defilippis, Bruno Loureiro, and Theodor Misiakiewicz. Dimension-free deterministic equivalents for random feature regression. ar Xiv preprint ar Xiv:2405.15699, 2024. [19] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal Distributed Online Prediction Using Mini-Batches. Journal of Machine Learning Research, 13(1), 2012. [20] Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363 1399, 2016. [21] Cedric Gerbelot, Emanuele Troiani, Francesca Mignacco, Florent Krzakala, and Lenka Zdeborova. Rigorous dynamical mean-field theory for stochastic gradient descent methods. SIAM Journal on Mathematics of Data Science, 6(2):400 427, 2024. [22] Gustaf Gripenberg. On the resolvents of nonconvolution Volterra kernels. Funkcial. Ekvac., 23(1):83 95, 1980. [23] Walid Hachem, Philippe Loubaton, and Jamal Najim. Deterministic equivalents for certain functionals of large random matrices. Ann. Appl. Probab., 17(3):875 930, 2007. [24] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems, volume 35, 2022. [25] Hong Hu, Yue M. Lu, and Theodor Misiakiewicz. Asymptotics of Random Feature Regression Beyond the Linear Scaling Regime. ar Xiv preprint ar Xiv:2403.08160, 2024. [26] Jared Kaplan, Sam Mc Candlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alex Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ar Xiv preprint ar Xiv:2001.08361, 2020. [27] Kiwon Lee, Andrew Cheng, Elliot Paquette, and Courtney Paquette. Trajectory of mini-batch momentum: batch size saturation and convergence in high dimensions. Advances in Neural Information Processing Systems, 35:36944 36957, 2022. [28] Licong Lin, Jingfeng Wu, Sham M Kakade, Peter L Bartlett, and Jason D Lee. Scaling Laws in Linear Regression: Compute, Parameters, and Data. ar Xiv preprint ar Xiv:2406.08466, 2024. [29] Bruno Loureiro, Cedric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mezard, and Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model. Advances in Neural Information Processing Systems, 34:18137 18151, 2021. [30] Alexander Maloney, Daniel A. Roberts, and James Sully. A Solvable Model of Neural Scaling Laws. ar Xiv preprint ar Xiv:2210.16859, 2024. [31] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667 766, 2022. [32] Gabriel Mel and Jeffrey Pennington. Anisotropic random feature regression in high dimensions. In International Conference on Learning Representations, 2021. [33] Courtney Paquette, Kiwon Lee, Fabian Pedregosa, and Elliot Paquette. SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality. In Proceedings of Thirty Fourth Conference on Learning Theory (COLT), volume 134, pages 3548 3626, 2021. [34] Courtney Paquette and Elliot Paquette. Dynamics of stochastic momentum methods on largescale, quadratic models. Advances in Neural Information Processing Systems, 34:9229 9240, 2021. [35] Courtney Paquette, Elliot Paquette, Ben Adlam, and Jeffrey Pennington. Homogenization of SGD in high-dimensions: exact dynamics and generalization properties. ar Xiv preprint ar Xiv:2205.07069, 2022. [36] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Advances in Neural Information Processing Systems, 31, 2018. [37] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. Advances in Neural Information Processing Systems, 30, 2017. [38] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71 79. PMLR, 2013. [39] Utkarsha Sharma and Jared Kaplan. A neural scaling law from the dimension of the data manifold. ar Xiv preprint ar Xiv:2004.10802, 2020. [40] Jack W Silverstein and Zhi Dong Bai. On the empirical distribution of eigenvalues of a class of large dimensional random matrices. Journal of Multivariate analysis, 54(2):175 192, 1995. [41] James B. Simon, Dhruva Karkada, Nikhil Ghosh, and Mikhail Belkin. More is better in modern machine learning: when infinite overparameterization is optimal and overfitting is obligatory. In International Conference on Learning Representations (ICLR), 2024. [42] Aditya Vardhan Varre, Loucas Pillaud-Vivien, and Nicolas Flammarion. Last iterate convergence of SGD for Least-Squares in the Interpolation regime. In Advances in Neural Information Processing Systems (Neur IPS), volume 34, pages 21581 21591, 2021. [43] Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, and Sham M Kakade. Benign overfitting of constant-stepsize SGD for linear regression. Journal of Machine Learning Research, 24(326):1 58, 2023. 4+3 Phases of Compute-Optimal Neural Scaling Laws Supplementary material Broader Impact Statement. The work presented in this paper is foundational research and it is not tied to any particular application. The set-up is on a simple well-studied random features model with synthetic data and solved using a commonly deployed algorithm stochastic gradient descent. We present (theoretical) compute-optimal curves for this model. The results are theoretical and we do not anticipate any direct ethical and societal issues. We believe the results will be used by machine learning practitioners and we encourage them to use it to build a more just, prosperous world. Outline of the paper. The remainder of the article is structured as follows: in Section A, we provide additional related work. In Section B, we derive the convolution-type Volterra equation for the expected risk under SGD, (7). In Section C.1, we analyze the Volterra equation under the deterministic equivalent. A discussion on the convergence threshold for P(r) including a necessary and sufficient condition for bounded solutions of (10) (Proposition C.2) and a proof of Proposition 2.1 are provided in Section C.2. Some background on Volterra equations and their solutions are provided in Section C.3.1 followed by the proof of Theorem 2.1 in Section C.3.2. We finish this section with a detailed description and proofs for the risk curves in all phases, Section C.4. Section D is devoted to deriving and proving the compute-optimal curves in Table 2. We follow this by Section E which analyzes the deterministic equivalent for the resolvent of ˆK. Here we examine the spectrum of ˆK from a random matrix point of view. In particular, in this section, we prove estimates on the fixed point equation, m, see eq. (9). We then give explicit descriptions of the components of the forcing function, F0, Fpp, Fac, as contour integrals and show that the error terms error F are small, see Section F. We do the same with the kernel function K and kernel norm in Section G. In Section H, we derive the asymptotic formulas for the components of the forcing and kernel functions (see Table 1) used in the compute-optimal curve derivations. Finally, we end with some additional numerical experiments (and their experimental setups) as well as detailed descriptions of the different approaches to estimating the exponents in the scaling law and optimal model-parameter, Section J. 1 Introduction 1 1.1 Problem Setup: SGD on Power-law Random Features . . . . . . . . . . . . . . . . 3 2 Learning Dynamics of SGD 5 2.1 Forcing function and kernel function . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 The 4 Phases 8 3.1 Compute-optimal Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 A Additional Related Work. 16 B Derivation of Volterra equation 18 C Analysis of Volterra Equation 21 C.1 Deterministic equivalent of the loss under SGD . . . . . . . . . . . . . . . . . . . 22 C.2 Convergence threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.3 Simplification of the Volterra Equation . . . . . . . . . . . . . . . . . . . . . . . . 24 C.3.1 Background on Volterra equations . . . . . . . . . . . . . . . . . . . . . . 24 C.3.2 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C.4 Details of risk curves for the phases . . . . . . . . . . . . . . . . . . . . . . . . . 27 C.4.1 Above the high-dimensional line (Phases Ia, II, III) . . . . . . . . . . . . . 28 C.4.2 Below the high-dimensional line (Phases IVa, IVb, Ib, Ic) . . . . . . . . . . 31 D Compute-optimal curves 33 D.1 Compute-optimal curves: Above the high-dimensional line (Phases Ia, II, III). . . . 34 D.1.1 Phase Ia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 D.1.2 Phase II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 D.1.3 Phase III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 D.2 Compute-optimal curves: Below the high-dimensional line (Phase IV, Ib, Ic), 2α < 1 37 D.2.1 Phase IV (a) and (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 D.2.2 Phase Ib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 D.2.3 Phase Ic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 E Spectrum of ˆK: random matrix theory 40 E.1 Self-consistent approximation for ( ˆK z) 1 = (D1/2WW T D1/2 z) 1. . . . . 40 E.2 The region near 0 and the spectral bulk . . . . . . . . . . . . . . . . . . . . . . . . 44 E.3 The mesoscopic region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 E.4 The large z region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 F Approximation of the forcing function 54 G Estimation of Kernel function 58 H Asymptotics of forcing function and kernel function 61 H.1 Pure point forcing term, Fpp(r) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 H.2 Model misspecification, point mass at 0, F0(r) . . . . . . . . . . . . . . . . . . . 63 H.3 Absolutely continuous forcing function, Fac(r) . . . . . . . . . . . . . . . . . . . 64 H.4 Kernel function asymptotic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 I Optimizing over batch and learning rate 69 I.1 Optimal batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 I.2 Optimal learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 J Experimental Results 70 J.1 Measuring the Scaling Law Exponent . . . . . . . . . . . . . . . . . . . . . . . . 71 J.2 Measuring Parameter Count Exponent: Approach 0 . . . . . . . . . . . . . . . . . 71 J.3 Measuring Parameter Count Exponent: Approach 1 . . . . . . . . . . . . . . . . . 72 J.4 Measuring Parameter Count Exponent: Approach 2 . . . . . . . . . . . . . . . . . 72 J.5 Exponents comparison: Theory vs Measurement . . . . . . . . . . . . . . . . . . . 73 J.6 Instantaneous slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 J.7 Negative β. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 J.8 Additional plots for different phases . . . . . . . . . . . . . . . . . . . . . . . . . 74 A Additional Related Work. Table 3: Comparison of the source/capacity parameters across various related work. We note this table is taken from Table 1 in [DLM24]1 with the addition of [Lin+24]5. We note that both [DLM24] and [Lin+24] appeared concurrently with this article. This work [DLM24]1 [Bahri+21]2 [MRS22]3 [BAP24]4 [Lin+24]5 Input dimension d d d M M M # of features v p P N N - Iterations/samples r n D T P N Capacity 2α α 1 + α 1 + α b a Source 2α+2β 1 4α r 1 2(1 1 Target decay (in L2) α + β αr + 1 2 0 0 a 2 b 2 1 [DLM24] L. Defilippis, B. Loureiro, T. Misiakiewicz. Dimension-free deterministic equivalents for random feature regression. 2024 2 [Bahri21] Y. Bahri, D. Dyer, J. Kaplan, J. Lee, and U. Sharma. Explaining neural scaling laws. 2024. 3 [MRS22] A. Maloney, D.A. Roberts, J. Sully. A solvable model of neural scaling laws 4 [BAP24] B. Bordelon, A. Atanasov, C. Pehlevan. A dynamical model of neural scaling laws. 2024. 5 [Lin24] L. Lin, J. Wu, S. Kakade, P. Barlett, J.D. Lee. Scaling Laws in Linear Regression: Compute, Parameters, and Data. 2024 Random features and random matrices. This paper uses random matrix theory to analyze a random features problem, which in statistical language would be the generalization error of the one-pass SGD estimator. Random matrix theory has played an increasingly large role in machine learning (see for [14] for a modern introduction). The input we need for our random matrix analysis is for sample covariance matrices with power-law population covariance (i.e. linear random features). The analysis of sample covariance matrices precedes their usage in machine learning (see e.g. [6]), but to our knowledge, a detailed study of all parts of the spectrum of sample covariance matrices with power-law population covariances has not appeared before. The narrower study of ridge regression has been extensively investigated (see for e.g. [4, 12]), and the concurrent work [18] provides a complete analysis of the ridge regression problem when 2α > 1. However, while (roughly speaking) ridge regression requires analysis of resolvent (A z Id) 1 statistics for negative spectral parameter z (which might be very close to 0), the analysis in this work requires resolvent statistics for essentially all z. There is a larger theory of nonlinear random features regression, mostly in the case of isotropic random features. Including nonlinearities in this model is a natural future direction; for isotropic random features with proportional dimension asymptotics this has been explored in works such as [31] and for some classes of anisotropic random features in [32, 17, 29] (we mention that lots of the complexity of the analysis of power-law random features arises from the analysis of the self-consistent equations indeed the self-consistent equations we use date to [40], but the analysis of these equations may still be novel). This strongly motivates non-proportional scalings (which would be inevitable in power-law random features with nonlinearities); in the isotropic case, the state of the art is [25]. Random features regression, source/capacity conditions, and SGD. A large body of kernel regression and random features literature is formulated for source/capacity conditions, which are power-law type assumptions that contain the problem setup here, when 2α > 1 (the low-dimensional regime). For convenience, we record the parameters αsource = 2α and r = 2α + 2β 1 Table 4: (Nonexhaustive) Comparison of sample-complexity results. Let ρ def = 2α + 2β 1. We use our Phases with n = sample size, d = parameters. We will include derivation of these results in the appendix of our paper. [DLM24]1 can also be done with RR+optimal-ridge, which yields same in Phase Ia, but different in Phase II/III. [VPF21]9 obtain P n min{1/(2α),(2α+2β 1)/(2α)}, that is, they capture the Fpp, but not Fac. The minimax optimal rates never achieve any of the rates (always worse), which can be connected to overly conservative, small stepsizes. For derivation of the minimax rates, we used Cor. 2 from [DB18]7. [Lin+24]5 requires label noise order 1 and also a very small learning rate. This work [DLM24]1 Algorithm one-pass SGD RR + O(1)-ridge Phase Ia Risk, Phase II P(n) Phase III Θ(n ρ/(2α) d ρ) Θ(n ρ/(2α) d 1n 1+1/(2α) d 2α) Θ(n 2+1/(2α) d 1n 2α 1 same as ours same as ours Θ(n 2 d 1n 2α 1 Minimax optimal678 [Lin+24]5 Algorithm one-pass SGD, very small stepsize one-pass SGD, very small stepsize Phase Ia Risk, Phase II P(n) Phase III O n ρ/(2α+2β) O n ρ/(2α+2β) O n 4α/(4α+1) Θ(d ρ+n ρ/(2α)+min{ d n, n 1+1/(2α)}) does not cover does not cover 6 Carratino, Rudi, Rosasco. Learning with sgd and random features. 2018 7 Dieuleveut and Bach. Nonparametric stochastic approximation with large stepsizes. 2016. 8 Pillaud-Vivien, Rudi, Bach. Statistical optimality of SGD on hard learning problems through multiple passes. 2018. 9 Varre, Pillaud-Vivien, Flammarion. Last iterate convergence of SGD for least squares in the interpolation regime 2021. Here we have taken r as the limit of those r s for which the source/capacity conditions hold (see Table 3). We note that in this language r is often interpreted as hardness (lower is harder), and that r (0, 0.5), r (0.5, 1.0) and r (1.0, ) correspond to 3 regimes of difficulty which have appeared previously (see the citations below); they are also precisely the 3 phases Ia, II, and III. The authors of [37] establish generalization bounds for random feature regression with power-law structures in 2α > 1 case. These bounds were sharpened in [16] and extended in [18] (see also the earlier [10] which shows kernel ridge regression is minimax optimal under various source-capacity conditions ); we give a comparison to these bounds in Table 4, but we note that the problem setup we have is not captured by minimax optimality (in particular minimax optimality is worst-case behavior over a problem class, and our problem setup is not worst-case for the traditional source/capacity conditions) We note that this paper is fundamentally about computation, but the novel mathematical contributions could also be recast in terms of generalization bounds of one-pass SGD, some of which are new. The work of [11] compares SGD to kernel ridge regression, showing that one-pass SGD can attain the same bounds as kernel ridge regression and hence is another minimax optimal method (again under source-capacity conditions). See also [20] which considers similar statements for SGD with iterate averaging and [36] for similar statements for multipass SGD; see also [38, 19] which also prove the single-batch versions of these. These bounds attain the minimax-optimal rate, which are worse than the rates attained in this paper (see Table 4 for a comparison). Dynamical deterministic equivalents, Volterra equations and ODEs. Using the deterministic equivalents for random matrix resolvents [23], we in turn derive deterministic equivalents for the risk curves of SGD. The method of analysis of the risk curves in this paper is by formulation of a convolutional Volterra equation [33]. This can be equivalently formulated as a system of coupled difference equations for weights of the SGD residual in the observed data covariance, which generalizes beyond the least-squares context [13]; in isotropic instances, this simplifies to a finite-dimensional family of ODES [1]. This can also be generalized to momentum SGD methods [34] and large batch SGD methods [27]. Convolution-Volterra equations are convenient tools, as they are well-studied parts of renewal theory [2] and branching process theory [3]. Another method of analysis is dynamical mean field theory. The closest existing work to this one in scientific motivations is [8], which uses this technique. This formally can be considered as a type of Gaussian process approximation, but for a finite family of observables ( order parameters ). In instances of one-pass SGD (including in anisotropic cases), this is rigorously shown to hold in [21]. The analysis of the resulting self-consistent equations is nontrivial, and [8] does some of this analysis under simplifying assumptions on the structure of the solutions of these equations. Besides these works, there is a large theory around generalization error of SGD. The work of [42] gives a direct analysis of risks of SGD under source/capacity type assumptions which formally capture the Fpp parts of the Phase Ia/II loss curves. The risk bounds of [43] give non-asymptotic estimates which again reproduce tight estimates for the Fpp parts of the loss (note that to apply these bounds to this case, substantial random matrix theory needs to be worked out first); see also concurrent work [28] where some of this is done. B Derivation of Volterra equation We begin by deriving a Volterra equation for the population loss P(θ), (5). Fix a quadratic q : Rd R, i.e., a function q(x) = x T Ax + e T x + c for fixed matrix A Rd d, vector e Rd and constant c R. Let us consider the filtration Fr = σ(W, θ0, . . . , θr) which conditions on W and the past iterates. Then we have from Taylor theorem, E[q(θr+1) q(θr) | Fr] = E[ q(θr), θr+1 θr | Fr] + 1 2E[ 2q, (θr+1 θr) 2 |Fr]. (22) We need to plug the expression for SGD in (6) into the above. The first thing we observe is that we need moments of Gaussians via Wick s formula: for fixed vectors vi Rv, i = 1, 2, 3, 4, and x N(0, D) Ex[x x, v1 ] = Ex[Tr(x T x)]v1 = Dv1 Ex[ x, v1 x, v2 x, v3 x, v4 ] = D, v1 v2 D, v3 v4 + D, v1 v3 D, v2 v4 + D, v1 v4 D, v2 v3 . (23) Here we recall that the (v v)-matrix D def = Diag(j 2α : 1 j v). Using these moment calculations, we can compute explicitly each of the terms in (22). Gradient term. First, we consider the gradient term in (22). A simple computation yields E[ q(θr), θr+1 θr | Fr] = γ q(θr), E X j Br W T xj W T xj, θr xj, b | Fr = γB q(θr), W T DWθr W T Db . (24) Quadratic term. We now turn to the quadratic term in (22). Supposing x and ˆx are independent samples, 1 2E[ 2q, (θr+1 θr) 2 | Fr] j Br W T xj[ W T xj, θr xj, b ] X k Br W T xk[ W T xk, θr xk, b ] |Fr j Br W T xj[ W T xj, θr xj, b ] 2 |Fr 2 E 2q, 2 X W T xj[ W T xj, θr xj, b ] W T xk[ W T xk, θr xk, b ] |Fr . Continuing, we have 1 2E[ 2q, (θr+1 θr) 2 | Fr] 2 E 2q, (W T x[ W T x, θr x, b ]) 2 |Fr 2 E 2q, (W T x[ W T x, θr x, b ]) (W T ˆx[ W T ˆx, θr ˆx, b ]) |Fr 2 E ( W T x, θr x, b )2 2q, (W T x) 2 | Fr 2 E 2q, (W T x[ W T x, θr x, b ]) (W T ˆx[ W T ˆx, θr ˆx, b ]) |Fr . Let 2q = Pv j=1 vj vj. Now we note the following ( W T x, θr x, b )2 2q, (W T x) 2 = x T W 2q W T x [ W T x, θr 2 2 W T x, θr x, b + x, b 2] j x, Wvj x, W vj [ W T x, θk 2 2 x, Wθr x, b + x, b 2]. (27) This, after taking expectations, is in the form for us to apply the moment computations in (23). Using these moments, we get the following expression: j x, Wvj x, W vj [ W T x, θr 2 2 x, Wθr x, b + x, b 2] | Fr = 2q, W T DW D1/2(Wθr b) 2 D, Wvj Wθr D, W vj Wθr D, Wvj Wθr D, W vj b + D, Wvj b D, W vj b D, W vj Wθr D, Wvj b = 2q, W T DW D1/2(Wθk b) 2 j D, Wvj (Wθr b) D, W vj (Wθr b) . Now we simplify the 2nd term in the summand X j D, Wvj (Wθr b) D, W vj (Wθr b) j W T D, vj (Wθr b) W T D, vj (Wθr b) i,n,ℓ,m (W T D)nivjn(Wθr b)i(W T D)mℓ vjm(Wθr b)ℓ = 2 DW( 2q)W T D, (Wθr b) 2 . Moreover, as x and ˆx are independent, we see that E 2q, (W T x[ W T x, θr x, b ]) (W T ˆx[ W T ˆx, θr ˆx, b ]) |Fr = E [(Wθr b)T xx T W 2q W T ˆxˆx T (Wθr b)|Fr] = (Wθr b)T DW 2q W T D(Wθr b). As a result, we deduce by combining (27), (28), (29), and (30) with (26) gives the following representation for the expected quadratic term 1 2E[ 2q, (θr+1 θr) 2 | Fk] = γ2B 2 2q, W T DW D1/2(Wθr b) 2 2 DW( 2q)W T D, (Wθr b) 2 . (31) Volterra equation. Using the simplified gradient and quadratic terms, we can now state the expected change in any quadratic q : Rd R evaluated at an iterate of SGD (6): E[q(θr+1) q(θr) | Fr] = γB q(θr), W T DWθr W T Db 2 2q, W T DW D1/2(Wθr b) 2 2 DW( 2q)W T D, (Wθr b) 2 . We can write Rv = Im(W) W . Thus, there exists ˇb Rd and b Rv such that one can write b = Wˇb + b, that is, something in the image of W and something in the co-ker of W, i.e., b = Wˇb + b, where W T D b = 0. (33) One-step update formula. Using this observation, we have a formula for the expectation of the quadratic q : Rd R, E[q(θr+1) q(θr) | Fr] = γB q(θr), W T DW(θr ˇb) 2 2q, W T DW D1/2(Wθr b) 2 2 W T DW( 2q)W T DW, (θr ˇb) 2 . We observe that all the terms on the right hand side of the above (34) involve the matrix W T DW Rd d. Consequently, let (λj, wj) for j = 1, . . . , d be the eigenvalue-eigenvector of W T DW with wj = 1. Now define ρ2 j(r) def = w 2 j , (θr ˇb) 2 , for all j = 1, . . . , d. (35) We will write our Volterra equation in terms of ρj s. Note we can express the loss P(θr) = D1/2(Wθr b) 2 by P(θr) = D1/2(Wθr b) 2 = j=1 λ2 jρ2 j(r) + D1/2 b 2. (36) We can now plug ρ2 j into (34). For this, we need to compute ρ2 j and 2ρ2 j: ρ2 j(r) = w 2 j , θr ˇb 2 , θρ2 j(r) = 2wj wj, θr ˇb , and 2ρ2 j(r) = 2wj wj. Then we have that dρ2 j(r) = 2γB wj, θr ˇb wj, W T DW(θr ˇb) + Bγ2 wj wj, W T DW D1/2(Wθr b) 2 + B(B + 1)γ2 W T DW(wj wj)W T DW, (θr ˇb) 2 = 2γBλjρ2 j(r) + γ2Bλj D1/2(Wθr b) 2 + γ2B(B + 1)λ2 jρ2 j(r) Using an integrating factor, we can implicitly solve this expression dρ2 j(k) = 2γBλj + γ2B(B + 1)λ2 j ρ2 j + γ2Bλj D1/2(Wθk b) 2, (38) and thus, we have a discrete Volterra equation ρ2 j(r) = ρ2 j(0)(1 2γBλj + γ2B(B + 1)λ2 j)r s=0 (1 2γBλj + γ2B(B + 1)λ2 j)r 1 sλj D1/2(Wθs b) 2. (39) Let us define ˇK def = W T DW. Using the expression in (36), E[P(θr) |W] = j=1 λjρ2 j(0)(1 2γBλj + γ2B(B + 1)λ2 j)r + D1/2 b 2 j=1 γ2Bλ2 j s=0 (1 2γBλj + γ2B(B + 1)λ2 j)r 1 s E[P(θs) | W]. Let us define the kernel K (r) def = γ2B j=1 λ2 j(1 2γBλj+γ2B(B+1)λ2 j)r = γ2B Tr ˇK2(I 2γB ˇK+γ2B(B+1) ˇK2)r . Discrete volterra equation for the loss for ˇK = W T DW. Let r be the number of iterates of SGD. Then E[P(θr) | W] = ˇK(I 2γB ˇK + γ2B(B + 1) ˇK2)r, (θ0 ˇb) 2 + D1/2 b 2 s=0 K (r 1 s) E[P(θs) | W], where K (s) = γ2B j=1 λ2 j(1 2γBλj + γ2B(B + 1)λ2 j)s = γ2B Tr ˇK2(I 2γB ˇK + γ2B(B + 1) ˇK2)s and D = Diag(j 2α : 1 j v). We can also write (40) in terms of ˆK def = D1/2WW T D1/2. To see this, set D1/2W = V where ˇK = UΩU T and ˆK = V ΩV T . Then we see that poly( ˇK) ˇK, (θ0 ˇb) 2 = poly(Ω)Ω, (U(θ0 ˇb) 2 = V poly(Ω)V T , (V ΩU(θ0 ˇb)) 2 = poly( ˆK), (D1/2W(θ0 ˇb)) 2 . Discrete volterra equation for the loss with ˆK = D1/2WW T D1/2. Let r be the number of iterates of SGD. Then E[P(θr) | W] = (I 2γB ˆK + γ2B(B + 1) ˆK2)r, (D1/2(Wθ0 b)) 2 + D1/2 b 2 s=0 K (r 1 s) E[P(θs) | W], where K (s) = γ2B j=1 λ2 j(1 2γBλj + γ2B(B + 1)λ2 j)s = γ2B Tr ˆK2(I 2γB ˆK + γ2B(B + 1) ˆK2)s and D = Diag(j 2α : 1 j v). (41) C Analysis of Volterra Equation From now on, we consider the setting where the initialization of SGD is θ0 = 0. Let us introduce the forcing function: F(r) def = (I 2γB ˆK + γ2B(B + 1) ˆK2)r, (D1/2(Wθ0 b)) 2 + D1/2 b 2 (42) and recall the kernel function K (s): K (s) = γ2B Tr ˆK2(I 2γB ˆK + γ2B(B + 1) ˆK2)s . While these representations are easy to see from the derivation of the Volterra equation, a more useful representation of the forcing function and the kernel function is through contour integrals over the spectrum of ˆK. With this in mind, let Γ be a contour containing [0, 1]. Note that by the assumptions on ˆK, the largest eigenvalue is normalized to be 1; hence Γ contains the spectrum of ˆK. Then the forcing function takes the form Γ ( ˆK z) 1, (D1/2b) 2 (1 2γBz + γ2B(B + 1)z2)r dz. (43) and the kernel function K (r) = γ2B Tr 1 Γ z2 (1 2γBz + γ2B(B + 1)z2)r ( ˆK z) 1 dz . (44) Then one can write the Volterra equation (41) as the forcing function plus a convolution with the kernel and the expected loss, i.e., E[P(θr) | W] = F(r) + K E[P(θs) | W] , where (K E[P(θs) | W])(r) = s=0 K (r 1 s)E[P(θs) | W]. (45) C.1 Deterministic equivalent of the loss under SGD The forcing functions F(r) and kernel function K (r) are random functions as they depend on the random matrix W. Moreover the expressions via contour integration show that both of these functions can be described in terms of the random matrix ˆK = D1/2WW T D1/2. Indeed it is the resolvent of ˆK, R( ˆK, z) def = ( ˆK z) 1, which plays a significant role in F and K and thus in the expected loss E[P(θr) | W]. To analyze the power law behavior of the expected loss, it would be helpful to remove the randomness in ˆK, i.e., W. We do this by finding a deterministic equivalent for the resolvent of ˆK, R( ˆK, z) = ( ˆK z) 1, using techniques from random matrix theory. Intuitively, we want to take the expectation over the random matrix W; though not formally true. Formally, we define the deterministic equivalent for the resolvent R( ˆK, z), denoted by R(z) implicitly via a fixed point equation m(z) def = 1 d Pv j=1 j 2α j 2αm(z) z where R(z) = Diag 1 j 2αm(z) z : 1 j v . (46) As mentioned earlier, this deterministic equivalent, R(z) can be viewed, roughly as, EW [( ˆK z) 1] = EW [R( ˆK, z)] R(z); though it is not formally the expectation over W. Using this deterministic expression for the resolvent of ˆK, we defined deterministic expressions for the forcing function via the contour representation of F(r) in (43) forcing function deterministic equivalent F(r) def = 1 Γ R(z), (D1/2b) 2 (1 2γBz+γ2B(B+1)z2)r dz, (47) and the kernel function in (44) kernel function deterministic equivalent K(r) def = γ2B Tr 1 Γ z2(1 2γBz+γ2B(B+1)z2)r R(z) dz . (48) Using the deterministic expressions for the forcing function F and kernel function K, we define the deterministic function P(r) : N R as the solution to the (discrete) convolution-type Volterra equation: P(r) = F(r) + (K P)(r), where (K P)(r) = s=0 K(r 1 s)P(s). (49) We note the similarity with the Volterra equation for SGD. We conjecture that the two processes are close: for {θr} the sequence of iterates generated by SGD with θ0 = 0 and any ε > 0, (1 ε) sup r N for all admissible v, d with probability going to 1 as d . We leave this for future research; we suspect it is true based upon existing of deterministic equivalence theory for random matrices and numerical evidence. C.2 Convergence threshold A natural question is: for what choices of batch B and learning rate γ does P converge? To answer this, we introduce an additional quantity, the kernel norm defined as (kernel norm) K def = s=0 K(s). (50) Proposition C.1 (Kernel norm). The kernel norm is satisfies If 2α > 1, then v be taken to equal , that is, In the case that 2α < 1, we have j=1 j 2α γ 2(1 2α)v1 2α. In all cases, we choose γ so that the kernel norm is asymptotic to a strictly positive constant. A well-known result about convolution-type Volterra such as (49) is that the solution of convolutiontype Volterra equation is bounded if and only the forcing function F(r) is bounded and the kernel norm K < 1. This naturally leads to conditions for our specific forcing function and kernel function. Remark C.1 (Convergence threshold conditions.). The forcing function F is bounded and the kernel norm K < 1 for (47) and (48), respectively, if and only if (i). |1 2γBλj + γ2B(B + 1)λ2 j| < 1, for all λj [0, 1] and (ii). kernel norm K < 1. (51) The first term ensures that the forcing function of the Volterra equation in (49) goes to 0 (i.e., bounded) and the second condition is the same kernel norm bound. Moreover, we can think of condition (i). as the same condition needed for gradient descent to converge while the kernel norm is the effect of noise from SGD. We also note that in light of Proposition C.1 the kernel norm does not involve the batch size B. Therefore the condition K < 1 only places a condition on the learning rate (see below). We now state necessary/sufficient conditions on the batch size and learning rate (51) (Proof of Prop. 2.1). Proposition C.2 (Necessary/Sufficient conditions on learning rate and batch size). The learning rate, γ > 0 and batch size, B > 0, satisfy K < 1, γ(B + 1) < 2. (52) if and only if the solution P(r) to the convolution-type Volterra equation (10) is bounded. Proof. From (51), we need that |1 2γBλj + γ2B(B + 1)λ2 j| < 1, for all λj [0, 1]. For this, we consider two cases. First, suppose that 1 2γBx + γ2B(B + 1)x2 < 1 for all x [0, 1]. We have that 2γBx + γ2B(B + 1)x2 < 0 x( 2γB + γ2B(B + 1)x) < 0. (53) The roots are precisely x = 0 and x = 2 γ(B+1). If 2/(γ(B + 1)) > 1, then the inequality in (53) always holds. Therefore, we need that γ(B + 1) < 2. Now suppose 1 + 2γBx γ2B(B + 1)x2 < 1. Then we have 2 + 2γBx γ2B(B + 1)x2 < 0, for all x [0, 1]. (54) The roots of the left-hand side are complex and thus the inequality always holds. Remark C.2. Below the high-dimensional line, 2α < 1, the kernel norm diverges with v for fixed constant γ, and so we must take γ 0 to ensure bounded solutions. Furthermore, with γ 0 (at any rate depending on v) we have the asymptotic equivalence j=1 j 2α γ 2(1 2α)v1 2α. For a proof of the asymptotic for K , see Corollary G.1. Remark C.3. Similar results hold for the expected SGD loss (via the Volterra equation (45)) by replacing K with K . C.3 Simplification of the Volterra Equation While convolution-type Volterra equation such as (49) are quite nice and well studied in the literature (e.g., [22]), we need an approximation of the solution to it to have better understanding of computeoptimal curves. In this section, we show that we can bound (above and below) P(r) by a constant multiple of the forcing function F and kernel function K. C.3.1 Background on Volterra equations To do this, we need some background on general convolution-type Volterra equations of the form: P(t) = f(t) + (K P)(t), where (K P)(t) = s=0 K(s)P(t s). (55) where f(t) is a non-negative forcing function and K(t) is a monotonically decreasing non-negative kernel function. Let us define K n def = (K K . . . K K | {z } n times )(t), the n-fold convolution of K where K 1 = K(t). Under mild assumptions such as K = P t=0 K(t) < 1 and the forcing function f is bounded, then there exists a unique (bounded) solution P(t) to (55) and the solution is given by repeatedly convolving the forcing function with K (see, e.g., [22, Theorem 3.5]), P(t) = f(t) + j=1 K j f(t) = f(t) + (K f)(t) + (K K f)(t) + (K K K f)(t) + . . . . This representation of the solution to (55) enables us to get good bounds on P(t). First, we state and prove a lemma attributed to Kesten s Lemma [3, Lemma IV.4.7]. Lemma C.1 (Kesten s Lemma). Suppose the kernel function K is positive and monotonically decreasing and K < . Moreover suppose for some ε > 0, there exists a T(ε) > 0 such that s=0 K(s)K(t s) 2(1 + ε) K K(t) for all t T. (56) Then for all n 0, K(T) + 1 2 K (1 + ε) n. Proof. Define an def = sup t>0 K n(t) K(t)(2 K )n 1 . Then a1 = 1, and we are trying to prove K(T) + 1 (1 + ε)n 1. By definition of the convolution, we have that 2 K K n(t s) K(t s)(2 K )n 1 an By the hypothesis (56), for t T, K (n+1)(t) (2 K )n an(1 + ε)K(t). (57) For t < T, we have K(s)K (n)(t s) (K monotonically decreasing) K(0) (2 K )n = K(0) 1 where the last equality follows by the equality K n = K n, [22, Theorem 2.2(i)]. In conclusion, by monotonicity, we have that K (n+1)(t) (2 K )n K(t) ( K(0) 2n K(T ), t T an(1 + ε), t T. Hence we have that an+1 K(0) K(T)2n + (1 + ε)an. Developing the recursion, (1 + ε)j K(0) n j + (1 + ε)n (1 + ε)n 1 1 1/2 1 K(0) K(T) + (1 + ε)n. The result is proven. Remark C.4. If the assumption (56) holds only for ˆT > t > T, then the statement of Lemma C.1 still holds with K(T) + 1 2 K (1 + ε) n. We now give a non-asymptotic bound for the general convolution-type Volterra equation. Lemma C.2 (Non-asymptotic Volterra bound). Let K and f be non-negative functions. Suppose K is monotonically decreasing and for some ε > 0, there exists a T(ε) > 0 such that s=0 K(s)K(t s) 2(1 + ε) K K(t), for all t T. Moreover, suppose the convergence threshold condition 2(1 + ε) K < 1 holds. Then f(t) + (K f)(t) P(t) f(t) + C (K f)(t), where C = K(0) K(T ) + 1 1 1 2 K (1+ε) . Proof. We consider the upper and lower bound separately. Lower bound: Since K and f is are non-negative, then P j=1(K j f)(t) (K 1 f)(t) (K f)(t). Recall the solution to the convolution-type Volterra equation takes the form, P(t) = f(t) + j=1 (K j f)(t). It immediately follows from P j=1(K j f)(t) (K f)(t) the lower bound. Upper bound: The solution to a Volterra equation (in L1) is P(t) = f(t) + j=1 (K j f)(t). By Lemma C.1 and the hypothesis, there exists a T > 0 and ε > 0 such that K j(s) K(s) K(0) K(T) + 1 (2 K (1 + ε))j 1, and (2 K (1 + ε))j 1 < 1. Hence, we have that j=1 (K j f)(t) = s=0 K j(s)f(t s) j=1 (2 K (1 + ε))j 1(K f)(t) K(T) + 1 1 1 2 K (1 + ε) The result is shown. C.3.2 Proof of Theorem 2.1 We are now ready to show one of the main tools used to analyze the loss function, Theorem 2.1. The result relies on approximations for the kernel and forcing functions found in Section F and Section G. We restate the theorem statement to remind the reader of the result. Theorem C.1 (Approximation solution for P). Suppose γ and B is at most half the convergence threshold and α > 1 4. There exists an M > 0 large and a constant C = C(α, β, M), independent of d, so that for all admissible v and d, for all M < γBr, F(r) + (K F)(r) P(r) F(r) + C (K F)(r). (58) The convolution further simplifies. For any ϵ > 0, there exists an M > 0 and a constant C = C(α, β, M) independent of d so that for all M < γBr, (1 ϵ) K F(r) + 1 C γB K(r) (K F)(r) C K F(r) + 1 γB K(r) . (59) Proof of Theorem C.1 / Theorem 2.1. Note for all γBr > 1/Md2α, we have that c F0 F(r), K(r) CF0(r) for some C, c > 0. This is where the limiting level starts to dominate. We begin by showing (58). Fix ε > 0. From Proposition G.2, we have that there exists an M > 0 sufficiently large so that the hypothesis for Kesten s Lemma, i.e., s=0 K(s)K(r s) 2(1 + ε) K K(r), for all d2α/M > γBr > M. Therefore, we get (58) by Lemma C.2. To prove (59), we begin by s=0 K(r s)F(s) = s=0 K(r s)F(s) + s=r/2 K(r s)F(s) K( r s=0 F(s) + F( r where we used monotonicity of F and K. Using Proposition H.2 and Proposition H.4, for large d2α/M γBr M, we have that F( r 2) F(r) since F is power law for large r (see also Corollary F.1). The same holds for K, using Proposition H.5 and Proposition G.2, K( r 2) K(r) for d2α/M γBr M for some M > 0. For small γBr M, we have that F(r/2) C and K(r/2) C for some C > 0. Since F and K are monotonic, we can choose a constant so that F(r/2) F(r) and K(r/2) K(r) for γBr M. Now using Proposition C.1 and Proposition H.6, we have that s=0 K(r s)F(s) K( r s=0 F(s) + F( r s=0 K(s) 1 γB K(r) + F(r) K . For the lower bound, we have that s=0 K(r s)F(s) = s=0 K(r s)F(s) + s=r/2 K(r s)F(s) K(r) s=0 F(s) + F(r) where we used monotonicity of K and F. We note that F(s) C for γBs M for all M > 0. Therefore, s=0 F(s) 1 γB . On the other hand, by Proposition C.1, for any ϵ > 0, there is an M so that for any γBr M, s=0 K(s) (1 ϵ) K . This proves the lower bound. C.4 Details of risk curves for the phases We can now put together a coherent picture of the effect of different choices of α and β and their impact on the Pareto frontier. We will have 4 distinct phases where the expected loss will exhibit a power law decay and 1 region (α + β 0.5) for which the expected loss has no power law decay (see Figure 2a). We will describe each of the 4 power law phases below marked by their boundaries. First, we recall the forcing function F and kernel function K introduced in Section 2.1. Table 5: Decomposition of the forcing and kernel functions. We express the forcing function F(r) as the sum of three functions, Fpp, F0, Fac, up to errors and kernel function K(r) as Kpp(r), up to errors. These functions arise from the different parts of the spectrum of the deterministic equivalent for the resolvent of ˆK. Function Part of spectrum F0(r) def = 1 Γ0 R(z), (D1/2 ˆβ) 2 (1 2γBz + γ2B(B + 1)z2)r dz; see (84) 1 + j 2αd2ακ(v/d) 1 + O(d 1) , where κ solves Z v/d κ dx κ + x2α = 1 κ Pv j=1 j 2β , if 2β > 1 d1 2(α+β) R v/d 0 u 2β κ+u2α du, if 2β < 1; (Prop. H.3) Point mass at z = 0 (Prop. F.1) Fpp(r) def = 1 0 u(2β 1)/(2α) exp( 2γBru) du; see (85) Fpp(r) (2α) 1(2γB)1/(2α) β/α 1 Γ β 2α +1 r (1+β/α)+1/(2α); (Prop. H.2) Fac(r) def = cβ d 2α u 1/(2α)d 1 exp( 2γBru) du, where cβ = Pv j=1 j 2β if 2β > 1 and otherwise 0; see (85) If 2α > 1 and 2β > 1, Fac(r) cβ(2γB) 1+1/(2α)(2α) 1Γ 1 1 2α r 1+1/(2α) d 1; (Prop. H.4) Kpp(r) def = γ2B 2α R 1 0 u1 1/(2α) exp( 2γBur) du, if α > 1/4; see (86) Kpp(r) γ2B (2α) 1(2γB) 2+1/(2α) Γ 2 1 2α r 2+1/(2α); (Prop. H.5) Forcing function. For the forcing function, F(r) = F0(r) + Fpp(r) + Fac(r) + errors F. (60) The function F0(r) is the component of the forcing function corresponding to the point mass at 0, Fpp(r) is the component of the forcing function corresponding to the pure point part of the spectrum, and lastly, the most complicated part of the spectrum, the forcing function corresponding to the distorted features. In particular, we will show in Section F the exact definitions of F0, Fpp, Fac and |error F| are small, and, in Section H, we derive asymptotic-like behaviors for these functions. See Table 5 for definitions and asymptotics. Kernel function. Similarly, the kernel function K is K(r) = Kpp(r) + errors K. Note here that the kernel function has a multiplication by the eigenvalue of ˆK and so the point mass at 0 will not contribute. In Section G, we will give an explicit definition of Kpp and show error terms are small and, in Proposition H.5, we give the asymptotic-like behavior of Kpp. Now we describe in detail the different risk curves that arise for the different phases. C.4.1 Above the high-dimensional line (Phases Ia, II, III) This setting is commonly known as the trace class. It is characterized by four components: learning rate γ can be picked independent of dimension; loss curve does not self average, that is, the loss curve does not concentrate around a deterministic function; v d, but v has no upper bound and so we can take v ; batch, B, is constrained to be small (see Proposition C.2). When 2α > 1, or the trace class phase, the loss will exhibit 3 different phases. We described these phases in detail below. Phase Ia: (2β < 1, 2α > 1). In this phase, it notable for three characteristics: absolutely continuous part of the forcing function does not participate; level at which SGD saturates is affected by β; SGD noise does not participate. In this case, the loss curve is just a constant multiple of gradient flow. Hence, we have that P(r) Fpp(r) + F0(r). Proposition C.3 (Phase Ia: 2β < 1, 2α > 1). Suppose 2β < 1 and 2α > 1. Suppose the learning rate γ and batch B > 0 satisfy at most half the convergence threshold in Proposition C.2. Then there exists an M > 0 large and constants C = C(α, β, M) and c = c(α, β, M), independent of d, so that for all admissible v and d, for all γBr > M c Fpp(r) + F0(r) P(r) C Fpp(r) + F0(r) . (61) Proof. By Theorem C.1, we know that it suffices to look at the forcing function F and kernel function K. Moreover, in this regime, we have that γ and B are constant (see Proposition C.2). The rest of the argument relies on the bounds found in Proposition H.2, (Fpp), Proposition H.4 (Fac), Proposition H.3, (F0), and Proposition H.5 (Kpp). For the forcing function, Fac(r) = 0 as 2β < 1 (Proposition H.4). Therefore the forcing function is composed of Fpp(r) and F0(r). First, we have that (γBr) 2+1/(2α) < (γBr) (1+β/α)+1/(2α) as β < α in this phase. Thus, using Proposition H.2 and Proposition H.5, for γBr > M, where M is some constant, we have that 1 γB Kpp(r) C Fpp(r) for some C > 0. Hence the result is shown. As a consequence of the argument above, we know that P(r) Fpp(r), if γBr D0 F0(r), if γBr D0 for some D0 that depends on d. Phase II: (2β > 1, 2α > 1, β < α) For this phase, we see that limit level is unaffected by β; absolutely continuous spectrum takes over for r (Mdα, d2α/M) for some M; SGD noise does not participate. Therefore, in this case, we have that P(r) Fpp(r) + Fac(r) + F0(r). Proposition C.4 (Phase II: 2β > 1, 2α > 1, β < α). Suppose 2β > 1, 2α > 1, and β < α. Suppose the learning rate γ and batch B > 0 satisfy at most half the convergence threshold in Proposition C.2. Then there exists an M > 0 large and constants C = C(α, β, M) and c = c(α, β, M), independent of d, so that for all admissible v and d, for all γBr > M c Fpp(r) + Fac(r) + F0(r) P(r) C Fpp(r) + Fac(r) + F0(r) . (62) Proof. By Theorem C.1, we know that it suffices to look at the forcing function F and kernel function K. Moreover, in this regime, we have that γ and B are constant (see Proposition C.2). The rest of the argument relies on the bounds found in Proposition H.2, (Fpp), Proposition H.4 (Fac), Proposition H.3, (F0), and Proposition H.5 (Kpp). γBr M0, for some M0: First, we have that 1 γB Kpp(r) C0 Fpp(r) (Proposition H.5) and Fac(r) C0 Fpp(r) (Proposition H.4) for some constant C0 > 0. The constant M0 is where the asymptotic of Fpp starts to apply. M0 γBr M1, for some M0 and for all M1 > M0: We see that (γBr) 2+1/(2α) < (γBr) (1+β/α)+1/(2α) as β < α in this phase. Thus, using Proposition H.2 and Proposition H.5, we have that 1 γB Kpp(r) C1 Fpp(r) for some C1 > 0. A quick computation shows that Fac(r) Fpp(r). M1 γBr M2d2α, for any M1 and some M2: The M2 is the smallest of the two endpoints for the asymptotics of Fpp and Fac. As in the previous regime, we have that 1 γB Kpp(r) Fpp(r). In this region, Fac(r) d 1(γBr) 1+1/(2α) and Fpp(r) (γBr) (1+β/α)+1/(2α). We see at (γBr) = d2α that (γBr) β/α (d2α) β/α = d 2β d 1 as 2β > 1. Therefore, at γBr = d2α, Fpp(r) Fac(r) and we started, i.e., when r = M1 with Fac(r) Fpp(r). Therefore, we must change in this regime to being Fac dominate. M2d2α γBr for all M2: In this case, all terms are bounded above by F0(r). As a consequence of the argument above, we know that Fpp(r), if γBr D0 Fac(r), if D0 γBr D1 F0(r), if γBr D1 for some D0, D1 that depend on d. (63) Phase III: SGD noise appears, (2β > 1, 2α > 1, β > α) In this case, we see that SGD changes the dynamics over gradient flow. In particular, limit level is unaffected by β; absolutely continuous forcing function takes over for iterations r (Md, d2α/M) for some M; SGD noise regulates convergence. Thus, we have that P(r) Fac(r) + F0(r) + 1 γB Kpp(r). (64) Proposition C.5 (Phase III: 2β > 1, 2α > 1, β > α). Suppose 2β > 1, 2α > 1, and β > α. Suppose the learning rate γ and batch B > 0 satisfy at most half the convergence threshold in Proposition C.2. Then there exists an M > 0 large and constants C = C(α, β, M) and c = c(α, β, M), independent of d, so that for all admissible v and d, for all γBr > M c Fac(r) + F0(r) + 1 γB Kpp(r) P(r) C Fac(r) + F0(r) + 1 γB Kpp(r) . (65) Proof. By Theorem C.1, we know that it suffices to look at the forcing function F and kernel function K. Moreover, in this regime, we have that γ and B are constant (see Proposition C.2). The rest of the argument relies on the bounds found in Proposition H.2, (Fpp), Proposition H.4 (Fac), Proposition H.3, (F0), and Proposition H.5 (Kpp). γBr M0, for some M0: First, we have that Fpp(r) C0 1 γB Kpp(r) (Proposition H.5) and Fac(r) C0 1 γB Kpp(r) (Proposition H.4) for some constant C0 > 0. The constant M0 is where the asymptotic of Kpp starts to apply. M0 γBr M1, for some M0 and for all M1 > M0: We see that (γBr) 2+1/(2α) > (γBr) (1+β/α)+1/(2α) as β > α in this phase. Thus, using Proposition H.2 and Proposition H.5, we have that Fpp C1 1 γB Kpp(r) for some C1 > 0. A quick computation shows that Fac(r) Kpp(r). M1 γBr M2d2α, for any M1 and some M2: The M2 is the smallest of the two endpoints for the asymptotics of Kpp and Fac. As in the previous regime, we have that 1 γB Kpp(r) Fpp(r). In this region, Fac(r) d 1r 1+1/(2α) and Kpp(r) r 2+1/(2α). We see at γBr = d2α that (γBr) 1 (d2α) 1 = d 2α d 1 as 2α > 1. Therefore, at (γBr) d2α, Kpp(r) Fac(r) and we started, i.e., when r = M1 with Fac(r) Kpp(r). Therefore, we must change in this regime to being Fac dominate. M2d2α rγB for all M2: In this case, all terms are bounded above by F0(r). As a consequence of the argument above, we know that Kpp(r), if γBr D0 Fac(r), if D0 γBr D1 F0(r), if γBr D1 for some D0, D1 that depend on d. (66) C.4.2 Below the high-dimensional line (Phases IVa, IVb, Ib, Ic) One of the main differences between the previous regime and this regime is that V can not be taken to independent of d. As a result, we call this below the high-dimensional line and it is precisely bounded by whether 2α is summable or not. The four main characteristics of this regime are: learning rate γ scales like v 1+2α; SGD loss, i.e., E [P(θr)] self-concentrates; v can not be too large, i.e., d and v are proportional; batch can be large (i.e., γB 1) since the learning rate is small (γ v 1+2α). In Phases IV, Ib, and Ic, because j 2α is not summable, the summation of the j depends on the dimension v. Thus, the kernel norm is j=1 j 2α γ 2(1 2α)v1 2α, where the learning rate γ is chosen so that K is constant, i.e., γ = γ K where γ > 0 is a constant. Phase IV, (2β > 1, 1 2). In this phase, we have the following limiting value of the loss that SGD converges to is unaffected by β; pure point forcing function plays a role; absolutely continuous part of the spectrum does not contribute to the forcing function; SGD noise affect the loss curves. In this phase, the loss curve is P(r) Fpp(r) + F0(r) + 1 γB Kpp(r). The following gives the precise statement. Proposition C.6 (Phase IV: 2β > 1, 1 2). Suppose 2β > 1 and 1 2. Suppose the learning rate γ and batch B > 0 satisfy at most half the convergence threshold in Proposition C.2. Then there exists an M > 0 large and constants C = C(α, β, M) and c = c(α, β, M), independent of d, so that for all admissible v and d, for all γBr > M c Fpp(r) + F0(r) + 1 γB Kpp(r) P(r) C Fpp(r) + F0(r) + 1 γB Kpp(r) . (67) Proof. By Theorem C.1, we know that it suffices to look at the forcing function F and kernel function K. Moreover, in this regime, we have that γ decreases like d2α 1 (see Proposition C.2). The rest of the argument relies on the bounds found in Proposition H.2, (Fpp), Proposition H.4 (Fac), Proposition H.3, (F0), and Proposition H.5 (Kpp). We first note there is no Fac(r) F0 and therefore it is too small to contribute. γBr M0, for some M0: First, we have that 1 γB Kpp(r) C0 Fpp(r) for some constant C0 > 0. The constant M0 is where the asymptotic of Fpp starts to apply. M0 γBr M1, for some M0 and for all M1 > M0: We see that γ(γBr) 2+1/(2α) < (γBr) (1+β/α)+1/(2α) since γ d2α 1 and 2α < 1 in this phase. Thus, using Proposition H.2 and Proposition H.5, we have that 1 γB Kpp(r) C1 Fpp for some C1 > 0. M1 γBr M2d2α, for any M1 and some M2: The M2 is the smallest of the two endpoints for the asymptotics of Kpp and Fpp. In this region, Fpp(r) (γBr) (1+β/α)+1/(2α) and γ 1 γB Kpp(r) γ (γBr) 2+1/(2α) d2α 1 (γBr) 2+1/(2α). We see at r = d2α that d2α 1(γBr) 1 = d 1 d 2β = (d2α) β/α. Thus Fpp(r) 1 γB Kpp(r) and we started, i.e., when r = M1 with Kpp(r) Fpp(r). Therefore, we must change in this regime to being Kpp dominate. M2d2α γBr for all M2: In this case, all terms are bounded above by F0(r). As a consequence of the argument above, we know that Fpp(r), if γBr D0 1 γB Kpp(r), if D0 γBr D1 F0(r), if γBr D1 for some D0, D1 that depend on d. (68) Phase Ib, (2β < 1, 0.25 < α < 0.5, 2(α + β) > 1). Phase Ia, Ib, and Ic are quite similar as the dynamics of SGD only depend on the forcing function pure point and limiting value. In this phase, the learning rate γ is dimension dependent, unlike Phase Ia, and the following hold limiting value of the loss that SGD converges to is d 2α+1 2β; absolutely continuous part of the spectrum does not contribute to the forcing function; SGD noise not does affect the loss curves. In this phase, the loss curve is P(r) Fpp(r) + F0(r). Although we did not prove the statement for α < 0.25 as we do not have estimates for the kernel function, we believe that statement still holds. We believe that the kernel function stops becoming power law when α < 0.25, but the forcing function is still power law. The following gives the precise statement. Proposition C.7 (Phase Ib: 2β < 1, 1 2, 2(α + β) > 1). Suppose 2β < 1, 2(α + β) > 1, and 1 2. Suppose the learning rate γ and batch B > 0 satisfy at most half the convergence threshold in Proposition C.2. Then there exists an M > 0 large and constants C = C(α, β, M) and c = c(α, β, M), independent of d, so that for all admissible v and d, for all γBr > M c Fpp(r) + F0(r) P(r) C Fpp(r) + F0(r) . (69) Proof. By Theorem C.1, we know that it suffices to look at the forcing function F and kernel function K. Moreover, in this regime, we have that γ decreases like d2α 1 (see Proposition C.2). The rest of the argument relies on the bounds found in Proposition H.2, (Fpp), Proposition H.4 (Fac), Proposition H.3, (F0), and Proposition H.5 (Kpp). We first note there is no Fac(r). γBr M0, for some M0: First, we have that 1 γB Kpp(r) C0 Fpp(r) for some constant C0 > 0. The constant M0 is where the asymptotic of Fpp starts to apply. M0 γBr M1d2α, for any M0 and some M1: The M1 is the smallest of the two endpoints for the asymptotics of Kpp and Fpp. In this region, Fpp(r) (γBr) (1+β/α)+1/(2α) and γ 1 γB Kpp(r) γ (γBr) 2+1/(2α) d2α 1 (γBr) 2+1/(2α). We see at r = d2α that d2α 1(γBr) 1 = d 1 d 2β = (d2α) β/α. Thus 1 γB Kpp(r) Fpp(r) and we started, i.e., when r = M1 with Kpp(r) Fpp(r). Therefore, Fpp must dominate. M1d2α γBr for all M1: In this case, all terms are bounded above by F0(r). We expect Prop. C.7 to hold with the same conclusions for 2β < 1, 2α < 1, and 2(α + β) > 1. As a consequence of the argument above, we know that P(r) Fpp(r), if γBr D0 F0(r), if γBr D0 for some D0 that depends on d. (70) Phase Ic, (2β > 1, 0 < α < 1 4. Lastly, we consider Phase Ic, which is similar to Phases Ia and Ib. The following holds in this phase. limiting value of the loss that SGD converges to is d 2α+1 2β; absolutely continuous part of the spectrum does not contribute to the forcing function; SGD noise not does affect the loss curves. In this phase, the loss curve is P(r) Fpp(r) + F0(r). Under the assumption that Theorem C.1 holds for α > 1/4, we get the following. Proposition C.8 (Phase Ic: 2β > 1, 0 < α < 1 4). Suppose 2β > 1 and 0 < α < 1 4 and Theorem C.1 holds. Suppose the learning rate γ and batch B > 0 satisfy at most half the convergence threshold in Proposition C.2. Then there exists an M > 0 large and constants C = C(α, β, M) and c = c(α, β, M), independent of d, so that for all admissible v and d, for all γBr > M c Fpp(r) + F0(r) P(r) C Fpp(r) + F0(r) . (71) We can not prove this statement as we do not have sharp bounds on the kernel function in this region. We believe that the kernel function stops becoming power law, but the forcing function is still power law. Thus, it should become even more forcing function dominate. We believe the loss curve follows similar behavior to Phase Ia and Phase Ib, that is, P(r) Fpp(r), if γBr D0 F0(r), if γBr D0 for some D0 that depends on d. (72) D Compute-optimal curves Throughout this section, consider the deterministic equivalent loss function P(r) = P(r, d). Moreover as batch size B is order 1, it only effects the compute-optimal curves by a constant. Therefore, we can set B = 1. For each iteration r, the SGD costs d flops, or equivalently r/d = flops, f. The goal is to find the optimal compute line as a function of the number of flops f: If d (f) def = arg mind P( f d, d), the optimal compute line is precisely P f d (f), d (f) . To do this, we simplify the loss curve P( f d, d). While it is possible to minimize this as a function of d, an alternative function considered is the following P(r, d) def = Fpp(r, d) Fac(r, d) F0(r, d) 1 γB Kpp(r, d), (73) which achieves the right power law behavior as the true compute-optimal curve and deviates from this true curve by an absolutely constant (independent of d, f) (see Theorem C.1). Note here some of the terms should be taken as 0 when not defined for the different phases. Using this alternative loss function, P(r, d), the compute-optimal line must occur at one of the corner points, i.e., where any pair of functions equal each other. The following lemma gives a useful characterization of these points. Table 6: Summary of the compute-optimal curves for P( f d, d) for above the high-dimensional line, 2α > 1. This includes Phases Ia, II, and III. Trade off Compute-optimal Curves Phase Ia (Prop. D.1) Fpp = F0 P Phase Ia(f) f 1 2α+1 1 (1+β/α 1/(2α)) d Phase Ia f1/(2α+1) Phase II (Prop. D.2) Fpp = Fac P Phase II(f) f 2α+2β 1 2(α+β) d Phase II f(β/α)/(1+β/α) Phase III (Prop. D.3) 1 γB Kpp = Fac P Phase III(f) f(1 4α)/(4α) d Phase III f1/2 Lemma D.1. Suppose C0, C1 > 0 are constants and γ0, γ1, p0, p1 > 0 exponents such that a function ˆP(r, d) equals ˆP(r, d) = max C0r γ0d p0, C1r γ1d p1 . Then replacing r 7 f d the minimizer in d satisfies d def = arg mind {ˆP(f, d)} = C0 C1 1/(γ1 p1 γ0+p0) f( γ0+γ1)/(γ1 p1 γ0+p0) and the optimal value is min d ˆP(f, d) = C0 f γ0 (d )γ0 p0. Proof. The proof is a straightforward computation. The minimizer of ˆP(f, d) in d must occur where the two terms in the maximum are equal, i.e., d γ0d p0 = C1 f Solving for this d gives d . Plugging in the value of d into ˆP(f, d) gives the optimal value. Remark D.1. The possible minimal values of (73), i.e., where pairs of functions in the max are equal, can be reduced further. For instance, if Fac(r, d) exist for the phase, then for some 0 < r0 < r1 < r2 Fpp(r, d), 0 < r r0 1 γB Kpp(r, d) r0 < r r1 Fac(r, d), r1 < r < r2 F0(r, d), r2 < r Thus, there are only a maximum of three points to check in order to find the optimal compute curve. Remark D.2. In view of Lemma D.1, to find the optimal compute curves, we first find the potential curves (i.e., all the possible combinations of two functions in the loss curve are equal while still lying on the loss curve). Then the curve which has the smallest exponent on the flops, f, is the optimal compute curve. D.1 Compute-optimal curves: Above the high-dimensional line (Phases Ia, II, III). To ease notation, we introduce several constants that will be used only in this Section D.1: Fpp(r, d) (γBr) (1+β/α)+1/(2α), Fac(r, d) d 1(γBr) 1+1/(2α), 1 γB Kpp(r, d) γ (γBr) 2+1/(2α), and F0(r, d) d 2α+max{0,1 2β}, where the asymptotics only hold in specific regions of the space of γBr. For additional details on the derivation of these asymptotics and the constraints on γBr where asymptotics hold, see Section H. Remark D.3. The constants in the asymptotics are dimension independent and only depend on α, β. The compute-optimal curves are summarized in Table 6. D.1.1 Phase Ia In this case, the approximate loss curve is given by d, d) = max{Fpp( f d, d), F0( f d, d)} max{ f d (1+β/α)+1/(2α), d 2α+1 2β}. (74) With this, we give a description of the optimal compute curve. Proposition D.1 (Phase Ia: Compute-optimal Curve). Suppose we are in Phase Ia, that is, 2β < 1 and 2α > 1. The compute-optimal curve using P( f d, d) in (74) occurs when Fpp( f d, d) = F0( f d, d). Precisely, the optimal d which minimizes P( f d Phase Ia f1/(2α+1), and the compute-optimal curve is P Phase Ia(f) f 1 2α+1 1 (1+β/α 1/(2α)). Proof. We apply Lemma D.1 with C0 = 1, γ0 = 1 + β/α 1/(2α), p0 = 0 and C1 = 1, γ1 = 0, p1 = 2α 1 + 2β. D.1.2 Phase II In this case, the approximate loss curve has three terms (Proposition C.4 with (63)) d, d) = max Fpp( f d, d), Fac( f d, d), F0( f d (1+β/α)+1/(2α), f d 1+1/(2α) d 1, d 2α (75) Proposition D.2 (Phase II: Compute-optimal Curve). Suppose we are in Phase II, that is, 2β > 1, 2α > 1, and β < α. The compute-optimal curve using P( f d, d) in (75) occurs when Fpp( f d, d) = Fac( f d, d). Precisely, the optimal d which minimizes P( f d Phase II f(β/α)/(1+β/α), and the compute-optimal curve is P Phase II(f) f 2α+2β 1 Proof. Using the Remark D.2 after Lemma D.1 and Proposition C.4 with (63), we only need to check two intersections: Fpp = Fac and Fac = F0. The curve which has the smallest (i.e., largest negative) exponent (i.e, steepest curve on a log-log plot) is the compute-optimal curve. Case 1: Consider Fpp( f d, d) = Fac( f d, d). We apply Lemma D.1 with C0 = 1, γ0 = 1 + β/α 1/(2α), p0 = 0 and C1 = 1, γ1 = 1 1/(2α), p1 = 1 to get that the minimum is d 1 f(β/α)/(1+β/α) and the optimal value is P 1(f) f 2α+2β 1 Case 2: Consider Fac( f d, d) = F0( f d, d). As before, we apply Lemma D.1 with C0 = 1, γ0 = 0, p0 = 2α and C1 = 1, γ1 = 1 1/(2α), p1 = 1 to get that the minimum is d 2 f1/(2α+1) and the optimal value is P 2(f) f 2α/(2α+1). One can check that 2(α + β) < 2α 2α + 1, for all 2β > 1, 2α > 1, β < α. Therefore, Case 1 is the optimal overall. D.1.3 Phase III In this case, the approximate loss curve has three terms (Proposition C.5 with (66)) d, d) = max 1 d, d), Fac( f d, d), F0( f d 2+1/(2α), f d 1+1/(2α) d 1, d 2α . (76) Proposition D.3 (Phase III: Compute-optimal Curve). Suppose we are in Phase III, that is, 2β > 1, 2α > 1, and β > α. The compute-optimal curve using P( f d, d) in (76) occurs when 1 γB Kpp( f d, d). Precisely, the optimal d which minimizes P( f d Phase III f1/2, and the compute-optimal curve is P Phase III(f) f(1 4α)/(4α). Proof. Using the Remark D.2 after Lemma D.1 and Proposition C.5 with (66), we only need to check two curves: 1 γB Kpp = Fac and Fac = F0. The curve which has the smallest (i.e., largest negative) exponent (i.e, steepest curve on a log-log plot) is the compute-optimal curve. Case 1: Consider Fac( f d, d) = F0( f d, d). We did this for Phase II in the proof of Proposition D.2. Thus, we have C0 = 1, γ0 = 0, p0 = 2α and C1 = 1, γ1 = 1 1/(2α), p1 = 1 to get that the minimum is d 1 f1/(2α+1) and the optimal value is P 1(f) f 2α/(2α+1). Case 2: Consider 1 γB Kpp( f d, d) = Fac( f d, d). We apply Lemma D.1 with C0 = 1, γ0 = 2 1/(2α), p0 = 0 and C1 = 1, γ1 = 1 1/(2α), p1 = 1 to get that the minimum is d 2 f1/2 and the optimal value is P 2(f) f(1 4α)/(4α). One can check that 1 4α 4α < 2α 2α + 1, for all 2β > 1, 2α > 1, β > α. Therefore, Case 2 is the optimal overall. Table 7: Summary of the compute-optimal curves for P( f d, d) for below the highdimensional line, 2α < 1. This includes Phases IV, Ib, and Ic. Trade off Compute-optimal Curves Phase IVa (Prop. D.4) 1 γB Kpp = F0 P Phase IVa(f) f α d Phase IVa f1/2 Phase IVb (Prop. D.5) 1 γB Kpp = Fpp P Phase IVb(f) f(1 2α)(2α+2β 1)/(2(2αβ+α 2β)) d Phase IVb f(α β)/(2αβ+α 2β) Phase Ib (Prop. D.6) Fpp = F0 P Phase Ib(f) f1/2 α β d Phase Ib f1/2, Phase Ic (Prop. D.7) Fpp = F0 P Phase Ic(f) f α(2α+2β 1) α(2β 3) 2β+1 d Phase Ic f 1 2(α+β) 2(α(2β 3) 2β+1) D.2 Compute-optimal curves: Below the high-dimensional line (Phase IV, Ib, Ic), 2α < 1 This section main distinction with above the high-dimensional line section is the dependency of the learning rate on v. In deed, we have that v/d c (0, ) and the learning rate is chosen so that the kernel norm is constant, i.e., γ 2(1 2α) K v2α 1 γ def = γ v2α 1, where γ is the positive constant so that γ = γ v2α 1. Consequently, we also need to keep track of the learning rate in the various terms. In this section, we assume that v/d r (1, ) as v, d . We state for completeness the d and r dependency on the forcing and kernel function, including the learning rate γ. We note that these asymptotics only hold for a set of γBr values which depend on the spectral properties of K (see the propositions listed next to the terms for details). Fpp(r) (γ r) (1+β/α)+1/(2α) v2α 1 d2α 1 (1+β/α)+1/(2α) (d2α 1 r) (1+β/α)+1/(2α) (d2α 1 r) (1+β/α)+1/(2α), (see Proposition H.2) 1 γB Kpp(r) γ 1+1/(2α) r 2+1/(2α) γ V 2α 1 d2α 1 1+1/(2α) d2 1/(2α) 2α r 2+1/(2α) d2 1/(2α) 2α r 2+1/(2α), (see Proposition H.5) F0(r) (v1 1/(2α) d1/(2α)) 2α+max{0,1 2β} d1 1/(2α) 2α+max{0,1 2β} d 2α+max{0,1 2β} d 2α+max{0,1 2β}. (see Proposition H.3) D.2.1 Phase IV (a) and (b) In these cases, the approximation loss curve is given by (Proposition C.6 with (68)) d, d) max{Fpp( f d, d), 1 γB Kpp( f d, d), F0( f d (1+β/α)+1/(2α) d(2α 1)( (1+β/α)+1/(2α)), d2 1/(2α) 2α f d 2+1/(2α), d 2α . As one crosses the 2α = 1, line the Fac disappears and Fpp emerges. Consequently, there leaves two possible corners where the compute-optimal value could occur at. When α goes below α = 1/4, the Kpp decreases. The difference between IVa and IVb is simply where the compute-optimal occurs. In IVa, the tradeoff occurs between Kpp and F0, whereas in IVb, the tradeoff occurs at Fpp and Kpp. We give the compute-optimal curve for Phase IVa. Proposition D.4 (Phase IVa: Compute-optimal Curve). Suppose we are in Phase IVa, that is, 2β > 1 and 2 < α < 1 2. The compute-optimal curve using P( f d, d) in (77) occurs when d1 2αKpp( f d, d) = F0( f d, d). Precisely, the optimal d which minimizes P( f d Phase IVa f1/2, and the compute-optimal curve is P Phase IVa(f) f α. Proof. Using the Remark D.2 after Lemma D.1 and Proposition C.6 with (68), we only need to check two curves: Fpp = d1 2α Kpp and d1 2α Kpp = F0. The curve which has the smallest (i.e., largest negative) exponent (i.e, steepest curve on a log-log plot) is the compute-optimal curve. Case 1: Consider Fpp( f d, d) = d1 2αKpp( f d, d). We apply Lemma D.1 with C0 = 1, γ0 = 1 + β/α 1/(2α), p0 = (2α 1)(1 + β/α 1/(2α)) and C1 = 1, γ1 = 2 1/(2α), p1 = 2 + 1/(2α) + 2α, to get that the minimum is d 1 f(α β)/(2αβ+α 2β) and the optimal value is P 1(f) = cfcpp cf cpp ck cpp (1 α)(2α+2β 1)/(2αβ+α 2β) f(1 2α)(2α+2β 1)/(2(2αβ+α 2β)). Case 2: Consider d1 2αKpp( f d, d) = F0( f d, d). We apply Lemma D.1 with C0 = 1, γ0 = 0, p0 = 2α and C1 = 1, γ1 = 2 1/(2α), p1 = 2 + 1/(2α) + 2α, to get that the minimum is and the optimal value is P 1(f) f α. In this region of α s and β s, Case 2 has a smaller exponent on f than in Case 1. As for Phase IVb, we have the following. Proposition D.5 (Phase IVb: Compute-optimal Curve). Suppose we are in Phase IVb, that is, 2β > 1 and 1 2 . The compute-optimal curve using P( f d, d) in (77) occurs when ckd1 2αKpp( f d, d) = cf Fpp( f d, d). Precisely, the optimal d which minimizes P( f d Phase IVb f(α β)/(2αβ+α 2β), and the compute-optimal curve is P Phase IVb(f) f(1 2α)(2α+2β 1)/(2(2αβ+α 2β)). Proof. The computations are exactly the same as in Proposition D.4. For the α s and β s in this region, we see that Case 1 has the smaller exponent on f than Case 2. D.2.2 Phase Ib In this case, the approximate loss curve is given by (Proposition C.7 with (70)) d, d) max{Fpp( f d, d), F0( f d (1+β/α)+1/(2α) d(2α 1)( (1+β/α)+1/(2α)), d 2α+1 2β}. (78) Note this is true for α > 1/4, but we expect this to hold without this extra assumption. With this, we give a description of the optimal compute curve. Proposition D.6 (Phase Ib: Compute-optimal Curve). Suppose we are in Phase Ib, that is, 2β < 1, 1 4 < α < 1 2, and α + β > 1 2. The compute-optimal curve using P( f d, d) in (78) occurs when Fpp( f d, d) = F0( f d, d). Precisely, the optimal d which minimizes P( f d Phase Ib f 1 2 , and the compute-optimal curve is P Phase Ib(f) f 1 2 α β. Proof. We apply Lemma D.1 with C0 = 1, γ0 = 1 + β/α 1/(2α), p0 = (2α 1)(1 + β/α 1/(2α)) and C1 = 1, γ1 = 0, p1 = 2α 1 + 2β. We expect the conclusions of Prop. D.6 to hold for the (α, β) pairs where 2β < 1, 2α < 1, and 2(α + β) > 1. D.2.3 Phase Ic In this case, the approximate loss curve is given by (Proposition C.8 with (72)) d, d) max{Fpp( f d, d), F0( f d (1+β/α)+1/(2α) d(2α 1)( (1+β/α)+1/(2α)), d 2α}. (79) Note again that this is speculative as we do not have the bounds for the kernel. However we do believe that this is correct. With this, we give a description of the compute-optimal curve. Proposition D.7 (Phase Ic: Compute-optimal Curve). Suppose we are in Phase Ic, that is, 2β > 1 and 0 < α < 1 4 and suppose (79) is true. The compute-optimal curve using P( f d, d) in (77) occurs when Fpp( f d, d) = F0( f d, d). Precisely, the optimal d which minimizes P( f d Phase Ic f 1 2(α+β) 2(α(2β 3) 2β+1) , and the compute-optimal curve is P Phase Ic(f) f α(2α+2β 1) α(2β 3) 2β+1 . Proof. We apply Lemma D.1 with C0 = 1, γ0 = 1 + β/α 1/(2α), p0 = (2α 1)(1 + β/α 1/(2α)) and C1 = 1, γ1 = 0, p1 = 2α. 10 6 10 5 10 4 10 3 10 2 10 1 100 spectra weighted in D (1/2) hat(beta) alpha =1.0, beta =2.0, D = 1,000, V = 10,000 empirical density fixed point Figure 6: Spectra of empirical and theory weighted by D1/2 ˆβ. Empirical spectra (blue) averaged over 100 randomly generated matrices W Rv d. Point mass at z = 0 was manually removed. Theory (orange) computed using the resolvent formula (9) and solved with Newton method (10 iterations for each z; z-values were spaced at 0.1d 2α with an imaginary part at d 2α). There is a continuous part that evolves into a pure point outliers. E Spectrum of ˆK: random matrix theory In this section, we analyze the spectrum of ˆK = D1/2WW T D1/2. For this, we use standard tools from random matrix theory to derive a fixed point equation for the Stieljes transform of ˆK. Indeed, by knowing the Stieljes transform of ˆK, one can recover the spectral properties. In particular, we will need the spectra of ˆK decomposes into 3 parts: 1. Point mass at z = 0: There will be a point mass at z = 0 of mass v d for trivial reasons since v d. 2. Pure point outliers: There will be a set of outliers, the pure point spectra, which are at constant order and nearly equal to j 2α for j = 1, 2, . . . , 3. Absolutely continuous part: The spectral bulk, the absolutely continuous part, which form a density on a shrinking window. In fact, we will not need to give a complete picture about all the spectra. E.1 Self-consistent approximation for ( ˆK z) 1 = (D1/2WW T D1/2 z) 1. In this section, we state the deterministic equivalent for the random matrix ( ˆK z) 1 and give some properties of its self-consistent spectra. The starting point for this is the self-consistent equation d PV j=1 j 2α j 2αm(z) z where (D1/2WW T D1/2 z) 1 jj 1 j 2αm(z) z . (80) The identification can be made rigorous by showing that |Tr A(D1/2WW T D1/2 z) 1 Tr A Diag(j 2αm(z) z : j) 1 | Pr d 0, for deterministic sequences of test matices {A} with bounded nuclear norm and generally with very high probability in d. We note that we would need a more precise quantification of errors to be useful for establishing the scaling law for the actual random matrices. In Figure 6, we solve the theoretical spectrum by solving the fixed point equation for m(z) using a Newton method on a grid of complex z. The function m can also be related to the trace of ( b K z) 1. From the definition of m, we can derive the explicit representation theorem. Lemma E.1. For any v, d and z C with Im(m(z)) > 0 1 j 2αm(z) z = v + (1 m(z))d Proof. Multiplying through by z we should evaluate v X z j 2αm(z) z . Adding and subtracting j 2αm(z) z j 2αm(z) z = v + m(z) j 2αm(z) z . Using the definition of m, v X j 2αm(z) z = d 1 m(z) 1 . Substituting this in completes the proof. While an explicit solution of m is unavailable, we can derive many properties of m, starting with: Proposition E.1. Suppose X is a real random variable compactly supported in [0, ). For every z H = {z C : Im z > 0}, there is a unique solution of (80) satisfying Im m(z) < 0. Moreover, this solution is an analytic function, and it can be solved by iterating the fixed point map (m(z) : z H) 7 1 d E X Xm(z) z : z H initialized with m 1. Furthermore, if we consider the equation for z H F(m; z) def = m + v d E Xm Xm z then this is solved uniquely by m(z) and moreover it is stable in that m F = 0 in a neighborhood of the solution. Proof. Let G be the mapping G(m; z) def = 1 d E X Xm(z) z . For each fixed z H, this is a strict self map into the lower half-plane of m. Self-maps are strict contractions in the hyperbolic metric (from the Schwarz-Pick lemma), and thus there is a unique solution of m(z) = G(m(z); z). We now introduce F according to the formula F(m; z) = m + v d E Xm Xm z so that F(m; z) = m/G(m; z). Introduce the stability operator d E Xz (Xm z)2 Then we have m F = G m m G By the Schwarz-Pick lemma (in the half-plane version), we have | m G| < | Im G(m; z)| and hence in some sufficiently small neighborhood of G(m(z)) = m(z), we therefore have | m G| < 1. Hence also, in a sufficiently small neighborhood of m(z) m F = m1 m G While this proposition does not state what happens on the real line, in regions of the line where m has a finite, real-valued limit, it agrees with its reflection in the lower half-plane. Hence in any open subset of R where limη 0 m(x + iη) exists and is real, m will be analytic. From Proposition E.1, we can derive some explicit estimates on m, which will be sufficient for deriving the estimates on the forcing and kernel functions. We summarize these properties in the following: Proposition E.2. Let X be any random variable with support in (0, 1]. Then the following hold: 1. Near 0: Suppose that a < 0 is a real-valued solution of v d E Xa Xa 1 Then m is analytic in a neighborhood of z = 0 and m(z) = az + O(z2). If furthermore if for some interval [0, z0] the equation d E Xa Xa 1 is solvable for a < 0, then m is analytic in a complex neighborhood of the whole interval [0, z0]. 2. Far away: Let ϱ0(z) be the distance of z to [0, 1]. Suppose that z is such that 16 v d E X ϱ0(z)2. Then for some absolute constant C > 0, |m(z) 1| 8v d E X ϱ0(z). Moreover suppose that v d E X X z ϵ < 1 4, and suppose that δ def = sup m:|m 1| 2ϵ 4 and sup m:|m 1| 2ϵ v d E X|z| |Xm z|2 Then m(z) 1 + v 2δ v d E X X z We need a lemma on stability of solutions. Lemma E.2. Suppose F is an analytic map of H C, Suppose there is a c, δ > 0 and m0 H so that for all m H with |m m0| δ and δ < | Im m0|, | m F(m, z)| c > 0 If m0 is an approximate solution of F(m) = 1 in that it satisfies |1 F(m0)| < cδ, then there is an solution m H with F(m) = 1 so |m m0| δ. Proof. We introduce an ODE (which is the continuous limit of the damped Newton s method) dt = 1 F(mt) m F(mt) , where mt|t=0 = m0. Then we note that along this ODE dt = m F(mt)dmt dt = (1 F(mt, z)). Hence we have (1 F(mt)) = (1 F(m0))e t for however long the ODE exists. In particular, if we have that on some open set U of admissible m containing m0 that | m F(m)| c, then provided the ODE does not exit U, |m m0| = |m m0| Z 0 |(1 F(m0))|e t/c = |(1 F(m0))|/c. Hence as there is a neighborhood of m0 of size δ on which m F(m)| c then as |(1 F(m0, z))| < cδ we have |m m0| |(1 F(m0, z))|/c = δ. Proof of Proposition. Part 1, Near 0: For the component near 0, we consider a change of variables and look at q(z) = m(z)/z which is therefore the unique solution of the fixed point equation: F(q(z), z) := zq(z) + v d E Xq(z) Xq(z) 1 Then by hypothesis, we have a solution at z = 0 given by F(a, 0) = 1. We wish to continue this solution to a neighborhood of (a, 0), and so it would suffice to know that the differential equation q F(q, z)dq dz + z F = 0 has a solution in a neighborhood of the point. Solving for the derivative of q, dq dz = q q F(q, z), where q F(q, z) = z v d E X (Xq(z) 1)2 We note that if we solve the equation along the real line, then q stays real valued. Furthermore, for all q with q < 0 we have q F(q, z) = z v d E X (Xq(z) 1)2 By analyticity, we can extend this solution into a neighborhood in the upper half plane and on an interval of the real line where this solvable. Hence m is analytic in this neighborhood and has boundary values given by m(z)/z a. Part 2, Far away: For the other parts, we use that m is the solution of F(m(z), z) := m(z) + v d E Xm(z) Xm(z) z Hence we have dz = z F(m, z) m F(m, z) = v d E Xm(z) (Xm(z) z)2 d E Xz (Xm(z) z)2 Define a distance fϱ(z) := ϱδ(z) := inf{|Xm z| : |m 1| δ, X [0, 1]}. Provided that |m 1| δ and v d(E X)|z| ϱ(z)2/2 (which is trivially satisfied if v d(E X)/ϱ(z) 1 d E(X) ϱ(z)2 . For any z with ϱ(z) > 0, we can find a unit speed geodesic σ(t) connecting z to so that ϱ(σ(t)) = ϱ(z) + t (this will just be a straight line). Then along this geodesic, v d(E X)|σ(t)| ϱ(σ(t))2/2, since v d(E X)|σ(t)| v d(E X)(|z|+t) v d(E X)(|z|+t|z|/ϱ(z)) (ϱ(z))2(1+t/ϱ(z))/2 (ϱ(z)+t)2/2. Hence along the geodesic, and provided that |m 1| δ, we conclude dm dz (σ(t)) 2(1 + δ)v d E(X) (ϱ(z) + t)2 . Integrating along this line segment from infinity, we conclude that provided the right hand side is less than δ, m(z) 1 Z dz (σ(t)) dt 2(1 + δ)v d E X ϱ(z). For the second conclusion, we use Lemma E.2 with this F (holding z fixed). We let m0 = 1 v d E X (X z) , and we consider an 2ϵ neighborhood of 1, U By assumption, on U m F(m, z) = 1 v d E Xz (Xm(z) z)2 satisfies | m F| 3 4. We also have that |1 F(m0, z)| = v d E X2(1 m0) (Xm0(z) z)(X z) Hence applying Cauchy-Schwarz |1 F(m0, z)| |1 m0|δ = ϵδ. The conclusion now follows from Lemma E.2. E.2 The region near 0 and the spectral bulk We now bound the contribution of the region near 0. Proposition E.3. The function m(z) is analytic in a neighborhood of z = 0 of radius c(α)d 2α for some c(α) > 0. Furthermore, m is negative on (0, cd 2α), vanishes at 0, and has |m (0) + κ(v/d)d2α| Cd2α 1 for all d sufficiently large where κ(v/d) solves Z v/d κ dx κ + x2α = 1. Moreover, we introduce f(z; V/d) where f : H H which solves f(z; a) + Z a f(z; a) dx f(z; a) x2αz = 1. This extends analytically to the interval [0, c). Then f and m are close in that for any compact subset K C \ ([c, ] [ , 0]) , we have |m(zd 2α) f(z)| C(K)/d. We furthermore have that in the case 2β < 1 j 2αm(zd 2α) zd 2α d1 2β Z a x 2β dx f(z) zx2α and in the case 2β > 1 j 2αm(zd 2α) zd 2α cβ C(K)d min{1,2α,2β 1}. Proof. For the first part, we look to apply Proposition E.2 part 1. The equation we need to solve is j 2αa j 2αa 1 = 1, for a < 0. We change variables by setting a = d2ακ and z = d 2αz, in terms of which (j/d) 2ακ (j/d) 2ακ + 1 = 1. (81) By monotonicity, for κ positive, Z v/d κ dx κ + x2α 1 (j/d) 2ακ (j/d) 2ακ + 1 Z v/d κ dx κ + x2α , and moreover the lower bound is only less than the upper bound by at most 1 d uniformly in κ > 0. In the case that 2α < 1, we can bound for κ [0, 1] κ (v/d) 1 + (v/d)2α Z v/d κ dx κ + x2α κ (v/d)1 2α Hence there there is an interval [0, c0] (bounded solely in terms of (v/d)) on which (81) is solvable and is uniformly bounded away from 0 over all d. Hence, the solution of κ of R v/d 0 κ dx κ+x2α = 1 satisfies 1 d (j/d) 2ακ (j/d) 2ακ + 1 [1 1 Following the same bounds on R V/d 0 dx κ+x2α shown above, we conclude that the true solution κ of (81) with z = 0 satisfies |κ κ| = O(1/d). This concludes the proof when 2α < 1. In the case that 2α > 1, Z v/d κ dx κ + x2α Z κ dx κ + x2α = κ1/(2α) Z dx 1 + x2α . On the other hand for κ [0, 1] Z v/d κ dx κ + x2α κ1/(2α) Z (v/d)κ 1/(2α) dx 1 + x2α κ1/(2α) Z (v/d) and hence once more there is an interval [0, c0] independent of V/d on which this is solvable and moreover the conclusions now follow in the same way as in the case that 2α < 1. Convergence to f. The existence and uniqueness of f follows from Proposition E.1, where we define F(f; z) def = f + Z a f dx f x2αz = 1, (making appropriate choices of v/d and X). We further have, from the previous part, that f takes negative values on an the interval (0, c). In what follows we fix a compact set K C \ ([c, ] [ , 0]) . Further, we claim the stability operator f F = 1 Z a x2αz dx (f(z; a) x2αz)2 is nonvanishing on K in a neighborhood of f. Off of the real line, this follows from Proposition E.1. On the real line, it follows from monotonicity of F for f < 0. Hence it follows that on K, there is a constant C(K) and a δ0 > 0 so that if z K and m satisfies |F(m; z) 1| < δ0, then |m f| C(K)|F(m; z) 1|. S(m; z, β) def = j 2αm zd 2α . Then with β = 0 1 d S(m; z, 0) = 1 j 2αm zd 2α . We will define a = v/d, and we will define as the separation def = min min{t [0, (v/d)2α] : |m zt|}, 1 Then by bounding the errors in a trapezoid rule approximation 1 d S(m; z, 0) Z a where we have used a bound on the x derivative of the integrand Z a |z|2αx2α 1 dx |m zx2α|2 1 (which relies on z being bounded away from 0 and on z being bounded in modulus). Hence if m(zd 2α) is the solution of m + m d S(m; z, 0) = 1, then we have |F(m; z) 1| = m + Z a m dx m zx2α 1 |m|d 1 provided m is bounded. To see that m remains bounded, we let m F be the stability operator of the equation 1 = F(m; z) = m + m d S(m; z, 0). Then once more m F = 1 + 1 d S(m; z, 0) + m d m S(m; z, 0). Approximating the sum for m S | m F(m; z) m F(m; z)| d 1 1 Differentiating the fixed point equation, we have the differential equation for m z 1 m F(m; z) m F(m; z) . As the same equation holds for f and is non-degenerate in a neighborhood of f (as the stability operator does not vanish in a neighborhood of the solution), we conclude that dm d(zd 2α) = O(1) and |F(m; z) 1| d 1 uniformly on compact sets for all d sufficiently large Sum formula. Hence having approximated f, we can turn to estimating S(m; z, β). When 2β < 1, we may repeat the Riemann sum approximation argument. Specifically, we have d2β 1S(m; z, β) Z a x 2β dx m zx2α where to bound the errors, we now must estimate d dx x 2α 2β x 2αm zd 2α dx d2β 1 Z v x 2αm zd 2α + x 4α 2β 1m (x 2αm zd 2α)2 Setting x = wd, we arrive at d2β 1S(m; z, β) Z a x 2β dx m zx2α + w 2β 1m (m zw2α)2 We may subsequently replace in this expression m by f. In the case that 2β > 1, we subtract from S the divergence cβ/m and then express S(m; z, β) 1/m j=1 j 2β v X j 2αm zd 2α j 2α 2β Bounding the difference leads to |S(m; z, β) cβ/m| cβd 2α/ + d1 2β/|m|. Remark E.1. There is an exactly solvable case where even more can be said. Note that when 2α > 1 and v/d = , the equation for f becomes f dx f x2αz = 1. Changing variables (which requires a contour deformation which restricts the branches considered) by letting x2αz = fy2α, so that x = ( f/z)1/(2α)y. Then f + ( f/z)1/(2α) Z dx 1 + x2α = 1 Hence with cα = R 0 dx 1+x2α , we have that f is the solution of f + ( f/z)1/(2α)cα = 1. If for example α = 1, then with g = ( f)1/2 we have g satisfies the quadratic equation g2 + gz 1/2c1 = 1, g = z 1/2 c1 c2 1z 1/4 1, with chosen so that Im g 0 and Re g > 0. We note that c1 = π 2 and conclude that (π/4)2 z 2 , with the branch chosen to ensure Im f < 0 when Im z > 0. E.3 The mesoscopic region We will need the following technical estimate on sums over lattice points. Lemma E.3. Suppose that z and w are complex numbers and z/w Z 1 wn + z = π w cot(πz/w). (82) Moreover, if we suppose | Im(z/w)| | Re(z/w)| then there is an aboslute constant C > 0 so that for any N N pv X Proof. Note that we may remove a factor 1 w from all statements and instead look at the case (with y = z/w) 1 wn + z = 1 Then by a residue computation (applied to the function π cot(πz) 1 z y) wπ cot(πy) = π w cot(πz/w), where we have used that cot is odd. Now by pairing terms, we have pv X 1 wn + z + 1 wn + z Making a common fraction, we have pv X 2z (wn)2 + z2 1 n2 + (z/w)2 Now Re(z/w)2 < 0, and hence the claim follows. Proposition E.4. Let α, β 0 with neither equal to 1 2. We further assume 2α + β = 1 2. For u, η, a, b 0 consider with m = a ib u iη + j 2αm for real A, B. We suppose that the a, b, u, η and ϵ satisfy 2. η + ub < ϵ4u, 4. log(1/ϵ)u1+1/(2α) cη, 6. 0 < u < c. Then there is an ϵ0 > 0, c > 0, and C > 0 so that if ϵ (0, ϵ0) such that B π(u/a)β/α 1/(2α) 2αa cβ b a2 oα+β (η + ub)v1 2α 2β u2(1 2α 2β) C ϵuβ/α 1/(2α) + cβ(η + b)u log(1/u) + ϵoα+β (η + ub)v1 2α 2β where cβ = P j=1 j 2β (if β > 1 2) or cβ = 0 otherwise and where oα+β is the indicator function of α + β < 1 2. Furthermore, let A = A(u + iη) be the same sum with m 1. Then |A A| C|1 a| 1 ϵ uβ/α 1/(2α) + cβ + oα+βϵv1 2α 2β and moreover |A| C uβ/α 1/(2α) + cβ + oα+β v1 2α 2β Proof. We look to estimate the expression u iη + j 2αm, on the regime considered, where m = a ib. The dominant contribution of the sum will either come from j 2αa u or possibly, when 2β > 1, from small j. So the analysis will be done by separately considering windows around the transition window j 2αa u, and another analysis for large/small j. We use the notation for I {1, 2, . . . , v} that AI and BI are the restrictions of this sum to the range of j I. The transition window. We begin by setting j0 to be the integer which minimizes |j 2α 0 a u|. We can estimate this difference, noting that |j 2α 0 a u| max{|j 2α 0 (j0 1) 2α|a} C(α)j 2α 1 0 a C(α) u1+1/(2α). (83) We can estimate j 2α by Taylor approximation, giving j 2α = j 2α 0 2α(j j0)j 2α 1 0 + O((j j0)2j 2α 2 0 ). Now we divide j according to whether (j 2α 0 a j 2αa)2 M(η + j 2α 0 b)2 or if not, for a large M = M(ϵ) 1/ϵ2. Let I the largest possible symmetric interval of j around j0 that satisfies the above display. On this interval, we would like to justify that the Taylor approximation holds. For this, we shall require that M(η + j 2α 0 b)j2α 0 ϵ. Note under this condition |j j0|/j0 + O((j j0)2j 2 0 ) |j 2α 0 j 2α|j2α 0 M(η + j 2α 0 b)j2α 0 ϵ. Thus the largest difference of |j j0| on I is bounded above, up to constants, by M(η + ub)u 1 1/(2α). Hence the Taylor approximation is justified in that on I |j 2α j 2α 0 | = (2α)|j j0|u1+1/(2α)(1 + O(ϵ)), with the implied constants bounded in terms of |1 a| and c. It follows that for terms outside of I, we have (j 2α 0 a j 2αa)2 > c M(η + j 2α 0 b)2 for some absolute c (provided c(α) was picked sufficiently small). The contribution of I terms now follows the same path as was done in the first case: X (a ib)j 2α (u + iη) = X (a ib) (u + iη)(j2α 0 + (j j0)j2α 1 0 2α) + ξ1. The error terms ξ1 are bounded by |ξ1| j 2β 0 X |u + iη|j2α 2 0 |j j0|2 (b + j2α 0 η)2 M 3/2ub/α+1/αu 3 3/(2α)(η + ub)3 (b + j2α 0 η)2 M 3/2(η + ub)uβ/α 1 1/(2α). We then do a second replacement, freezing the j 2β in the numerator, and so we need to estimate j 2β j 2β 0 (a ib) (u + iη)(j2α 0 + (j j0)j2α 1 0 2α), which we do simply by |ξ2| j 2β 1 0 max j I |j j0|2 b + η/u M(η + ub)uβ/α 1 1/(2α). Thus with M 1/ϵ2 and using the second assumption of the lemma, we get |ξ1| + |ξ2| ϵuβ/α 1/(2α), where we have expressed (a ib)j 2α (u + iη) = X j 2β 0 z + w(j j0)+ξ1+ξ2 where ( z = a ib (u + iη)j2α 0 , w = (u + iη)j2α 1 0 2α The sum we can now evaluate using Lemma E.3. Note this makes z nearly i(b + η/u) and w nearly u1/(2α)(2α), and hence z/w is almost purely imaginary. Thus the error estimate in the Lemma applies and we have (using |(η + bu)u 1 1/(2α)| log(1/ϵ) and η < ϵu) X (a ib)j 2α (u + iη) iπ(u/a)β/α 1/(2α) Cϵuβ/α 1/(2α). The small j regime, imaginary part. Recall the terms of small j, which is to say those with j less than those in I, are denoted S. For these terms, we have j 2αa u c M(η + ub). For the real and imaginary parts of the sum we have AS + i BS = X u + j 2αa i(η + j 2αb) = X j 2α 2β( u + j 2αa + i(η + j 2αb)) ( u + j 2αa)2 + (η + j 2αb)2 . We shall focus on the imaginary part first. We introduce an approximation for this sum, coming from approximating the denominator by j 4αa2. Thus we introduce i B S def = 1 j S j2α 2β(i(η + j 2αb)). Let cβ be as in the statement of the Proposition. Then |i B S cβ(ib/a2)| j1+2α 2β 0 |η| + j1 2β 0 (b) ϵuβ/α 1/2α. We turn to estimating the difference of BS i B S. Using that (j 2αa)2 ( u + j 2αa)2 2u(j 2αa), we can estimate uj 2β(η + j 2αb) ( u + j 2αa)2 . To estimate these differences, we break these sums into scales. We let Sk to be those j for which Sk = (η + ub)2k 1 j 2αa u (η + ub)2k . Then we can estimate the number of terms in each of these k by |Sk| C(α)(η + ub)2k(u + (η + ub)2k) 1 1/(2α). For small k, i.e. those for which (η+ub)2k u, we can estimate |Sk| C(α)(η+ub)2ku 1 1/(2α). Call the small k terms S and the remainder S . Then for larger S terms, |Sk| C(α)((η + ub)2k) 1/(2α). For the difference of the imaginary parts on small k, we may bound j 2β as a multiple of j 2β 0 and so we arrive at |BS B S | (uβ/α+1) X |Sk|(η + ub) 22k(η + ub)2 (uβ/α 1/(2α)) X 1 2k ϵuβ/α 1/(2α). Then for the difference of the imaginary parts on large k |BS B S | X k u|Sk|(η((η + ub)2k)β/α + b((η + ub)2k)β/α+1) 22k(η + ub)2 . This we further estimate |BS B S | u X k η((η + ub)2k)β/α 2 1/(2α) + b((η + ub)2k)β/α 1 1/(2α). In the event that the exponents are non-negative, which can only occur when 2β > 1, we may lose a factor which is boundable by the largest k term (which is constant order) or by a logarithm in the case the exponent is 0. If either exponent is negative, the expression is dominated by its smallest k term, for which (η + ub)2k u. In all we have |BS B S| ϵuβ/α 1/(2α) + (η + ub)uβ/α 1 1/(2α) + cβ(η + b)u log(1/u). The large j regime, imaginary part. We break the sum into two parts L and L , those with j < 1.1j0 and those with j 1.1j0. For the terms in L we again break into scales, much like in the small j regime. We let Lk to be those j for which Lk = (η + ub)2k 1 u j 2αa (η + ub)2k . Then we can estimate the number of terms in each of these k by |Lk| C(α)(η + ub)2k(u) 1 1/(2α). Then for the imaginary part uβ/α+1|Lk|(η + ub) 22k(η + ub)2 uβ/α 1/(2α) This sum is always dominated by the smallest k, and so we have |BL | ϵuβ/α 1/(2α). As for larger j, we first remove a potentially divergent term, and so define j 2α 2β(η + ub) In the case that α + β < 1/2, we have that (comparing to an integral and using monotonicity) |B L (η + ub)V 1 2α 2β u2(1 2α 2β) | (η + ub) j 2α 2β 0 u2 + j1 2α 2β 0 (η + ub)u 1+β/α 1/(2α) ϵuβ/α 1/(2α). Otherwise, |B L | (η + ub)u 1+β/α 1/(2α) ϵuβ/α 1/(2α). As for comparing the this divergence with the sum, we have |BL B L | X j 2α 2β(j 2α(η/u + b)) ( u + j 2αa)2 X j 4α 2β((η + ub)) Then if 2α + β > 1/2, this leaves |BL | (η + ub)uβ/α 1 1/(2α) ϵuβ/α 1/(2α) or in the case 2α + β < 1/2 |BL | (η + ub)u 3v1 4α 2β (η + ub) v 2α/u u 2v1 2α 2β. The real part. For the real part, we shall prove a comparison with which we note is a special case of A with a ib = 1. The arguments are now very similar in all regimes to the imaginary parts, and so we just give a summary of the arguments. The main difference is for j j0. Note that using the previous bounds on the transition window, we may discard an interval of |j j0| Mηj1+2α 0 from A and incur an error of only ϵuβ/α 1/(2α). On a larger interval, J, given by those j with ηj1+2α 0 |j j0| ϵj0, by pairing j0 + r with j0 r, we can bound |AJ| + |AJ| ϵj1 2β 0 ϵuβ/α 1/(2α). Moreover, the difference we can bound by |AJ AJ| |1 a| X |(j 2αa u)(j 2α u)| |1 a| ϵ uβ/α 1/(2α). For small j, where we redefine S as those j smaller than those in J, we further divide to hose j with |j j0| j0/2 and those S which are further from j0. |AS AS| |1 a| ϵ j1 2β 0 + |1 a| X Hence we arrive at |AS AS| ϵuβ/α 1/(2α) + cβ|1 a|. For the large j terms, we redefine L as those j larger than those in J. Again dividing to those with |j j0| j0/2 and otherwise, we arrive at |AL AL| |1 a| ϵ j1 2β 0 + |1 a| X This, as in the large terms for the imaginary part, leads to |AL AL| |1 a| ϵ j1 2β 0 + |1 a|ϵuβ/α 1/(2α) + |1 a|oα+βϵv1 2α 2β Finally, we observe that A satisfies an estimate of the form |A| uβ/α 1/(2α) + cβ + oα+β v1 2α 2β which arise from the transitionary region, the small j region and the large j region. Proposition E.5. Assume α = 1 4 and α = 1 2. With z = u + iη(u), with η = (log(1/ϵ)/c) max u1+1/(2α), π 2α u1 1/(2α) there is a c > 0 and an ϵ0 so that for all ϵ (0, ϵ0) there is a cϵ > 0 so for all u [d 2α/cϵ, cϵ] (with A as in Proposition E.4) |m(z(u)) 1 + d 1A(u + iη) + i π 2αu 1/(2α)d 1| C(α)ϵu 1/(2α)d 1. Proof. We claim that m is approximately equal to m0 def = 1 A(u + iη) d iπu 1/(2α) where m is the solution of F(m(z), z) := m(z) + v d E Xm(z) Xm(z) z with Im m < 0. Hence the result boils down to checking: |F(m0, z) 1| Cϵu 1/(2α) and secondly that |1 m(F)| 1 2 in a neighborhood of m0, using Lemma E.2. For showing that |F(m0, z) 1| we first observe that on the contour selected, if α < 1 2 and ϵ0, cϵ is chosen sufficiently small v1 2α u2d (η + ub) ϵπu 1/(2α) Moreover the claimed estimates on 1 F now follow directly from Proposition E.4. For the stability, we have that j 2αz (j 2αm z)2 . Taking modulus, we have j 2α(u + η) (j 2αa u)2 + (j 2αb + η)2 =: X Now we break the estimation of the sum into regions of j, as in the proof of Proposition E.4. We let j0 be the integer which minimizes (j 2α 0 a u)2. We define S, L, I, J to be the sets where j < δj0, j > j0/δ, (j 2αa u)2 (ub + η)2, and the rest in J, we let XA be the restriction of the sum X to the set of indices A. For the terms in S, j2α(u + η) a2(1 O(δ)) u 1/2α with the final sum holding for all δ > 0 sufficiently small. For the terms in L, j 2α(u + η) u2(1 O(δ)) u 1/2α d δ1+2α + oα v1 2α d u u2 + η2 , where oα is the indicator of 2α < 1. For the terms in I we have j 2α(u + η) (ub + η)2 1 d u 1/(2α)(u + η) where we have used that the number of terms in this regions is on order of j2α+1 0 (ub + η). Now taking η a sufficiently large multiple of uβ, we conclude that the terms in XI 1 8. For the terms in J j 2α(u + η) (j 2αa u)2 C(δ)1 j1 0(u + η) j 2α 1 0 (r)2 ; here the range of r is such that at its smallest value j 2α 1 0 (r) (ub + η), and so we arrive at d u 1/(2α)(u + η) Hence picking δ sufficiently small that XS, XL are both less than 1 8, and subsequently increasing the lower bound on η/(ub) sufficiently far, we conclude that all four components can be made less than 1 8 and hence that |1 m F| 1 E.4 The large z region Proposition E.6. For any compact set U C of distance at least δ > 0 from [0, 1] and any α = 1/2 there is a C(α) such that |m(z) 1| C(α) δ min{d, d2α} and such that m(z) 1 + 1 C(α) δ min{d2, d4α} Furthermore, on the same set j 2αm (u + iη) j 2α (u + iη) C(α) δ min{d, d2α}. Proof. We apply Proposition E.2 part 2. We have v d E X = 1 j=1 j 2α 1 min{d, d2α}, and the result follows directly from Proposition E.2. We turn to evaluating the sum S(m, u + iη) def = j 2αm (u + iη). Then taking the partial derivative in m, m S(m, u + iη) = (j 2αm (u + iη))2 , which is uniformly bounded on U and on the set m so |m 1| < δ/2. It follows that on U |S(m, u + iη) S(1, u + iη)| 1 min{d, d2α}. For the second part, we start by observing that we can estimate v d E X X z 1 δ min{d, d2α}. We further have v d E X2 j=1 j 4α/δ2 1 δ2 min{d, d4α}. Hence, combining all these errors we conclude the claim. F Approximation of the forcing function We now apply the technical estimate to find good approximations for the function F. Recall F(r) def = 1 Γ+Γ0 R(z), (D1/2b) 2 (1 2γBz + γ2B(B + 1)z2)r dz, where R(z) = Diag j 2α z + j 2αm(z) : 1 j v and m(z) = 1 d Pv j=1 j 2α j 2αm(z) z . We decompose the forcing function into a sum of three functions F(r) = Fpp(r) + Fac(r) + F0(r) + errors, which will be introduced in the course of the approximation. d 1 log(1/ϵ) d 1 log(1/ϵ) Γcaps = ΓL + ΓR ΓC = Γ1 + Γ2 Γ = ΓC + Γcaps ΓL spectrum of ˆK Re(z) d 2α M d 2α 1 1 M d α 0 Figure 7: Contour of Γ + Γ0. This is used to estimate the m and derive expressions for the forcing function and kernel function. The important part of the contour is Γ0, which contains the point mass at 0 (blue) and ΓC (purple) which contains the bulk of the spectrum of deterministic equivalent of ˆK. There is a left spectral gap which occurs at d 2α. Moreover we have a change of behavior at d α in the contour to account for the change of behavior from pure point to absolutely continuous bulk part of the spectrum. The contour we select will come in three parts. The contour Γ0 is an arbitrarily small contour enclosing 0. The contour Γ will be in three parts which is symmetric under the reflection z 7 z. The main part will be ΓC parameterized by z = u + iη(u) with η(u) as in Proposition E.5 for u [u0, u1] where u0 = u0(d) = Cd 2α for some large C > 0 and u1 is a small positive constant. This is connected by two curves, one which is a smooth curve ΓL which is on scale d 2α and which is reflection symmetric, connects u0 + iη(u0) to its conjugates and crosses the imaginary axis on [0, cd 2α] (with c as in Proposition E.3). The other ΓR connects u1 + iη(u1) to its conjugate by a smooth curve which avoids an ϵ neighborhood of [0, 1]. For Γ0, using Proposition E.3, we have F0(r) def = 1 Γ0 R(z), (D1/2b) 2 (1 2γBz + γ2B(B + 1)z2)r dz. (84) This can be evaluated explicitly in terms of a residue at 0. Proposition F.1. The function F0(r) is constant and F0(0) 1 + j 2αd2ακ(v/d) Cd 2α+(2β 1)+ 1. Proof. From Proposition E.3, we can apply the residue formula. Evaluating the residue and bounding the sum produces the statement. The contours ΓR and ΓL both contribute error terms. Define the sum of the two as Fcaps def = 1 ΓR+ΓL R(z), (D1/2b) 2 (1 2γBz + γ2B(B + 1)z2)r dz. Proposition F.2. There are positive functions f(r) and g(r) satisfying f(r) C exp( cγBrd 2α) and g(r) C exp( cγBr) so that Fcaps(r) Cf(r)d 2α+(1 2β)+ + Cg(r). Furthermore, for any M > 1 we can choose u0 = Td 2α and u1 = 1/T with T sufficiently large that Fcaps(r) satisfies for γBr M and γBr Md2α and some other C > 0 Fcaps(r) f(Cr)d 2α+(1 2β)+/C + g(r)/C. Hence this will appear as essentially constant on the loss curves. When combined with F0 we have that F0(r) + Fcaps(r) is bounded above and below by constants times d 2α+(2β 1)+ 1. Proof. Both the contributions from ΓL and ΓR give exponentially decaying errors, albeit at much different scales and lead to the f and g terms respectively. For the ΓR terms, we simply bound, using Proposition E.6, |m(z) 1| d min{2α,1}. On the ΓR contour, having picked the contour sufficiently close to [0, 1] (independent of v, d), we have for some δ > 0 (1 2γBz + γ2B(B + 1)z2)r e 2γB Re zr(1 δ). Then we have R(z), (D1/2b) 2 (1 2γBz + γ2B(B + 1)z2)r dz d min{2α,1} ΓR e 2γB Re zr(1 δ) | dz| . Hence this decays exponentially. By construction of the ΓR contour, we can close the contour with an additional (nearly vertical) segment ΓV with real part u and height ϵu. Moreover this can be chosen to evenly divide two poles {j 2α}, by adding small horizontal segments. Then we can estimate on ΓV (essentially by Proposition E.4, with an extension for very small imaginary part when we split two poles) uβ/α 1/(2α) 1 . Then integrating over ΓV , we get (1 2γBz + γ2B(B + 1)z2)r dz ϵu1+β/α 1/(2α) 1 e 2γBu1r(1 δ). Having enclosed the poles, we can apply the residue formula, and we have (1 2γBz + γ2B(B + 1)z2)r dz j=1 j 2α 2β(1 2γBj 2α + γ2B(B + 1)j 4α)r, for some j0 with j0 u 1/(2α) 1 . Hence both contributions of ΓV and ΓR decay like g(r) for an appropriate choice of δ, C. For ΓL, we use similar arguments. We use Proposition E.3 to replace the summation by a dindependent quantity, which also requires rescaling the contour by d 2α. Then we have 1 2πi R(z), (D1/2b) 2 d1 2β Z a x 2β dx f(zd2α) zd2αx2α (1 2γBz + γ2B(B + 1)z2)r dz ΓLd2α e 2γBcd 2αr |dz| . Hence we are left with a dominant contribution of x 2β dx f(z) zx2α exp( 2γBrzd 2α) dz In the case that 2β > 1 we instead are left with exp( 2γBrzd 2α) dz. As the spectral support of f has a left edge, these decay exponentially. In either case, we can then deform the contour to run twice along the real axis and then vertically to the ends of the ΓL contour. The component along the vertical portion can be estimated by O(d(1 2β)+(u1 1/(2α) 0 /d)) exp( (2 δ)γBru0) (and using the boundedness of f, 1/f). This can be made to decay faster than the contribution from f. Finally, the dominant contributions arise from the contour ΓC. We define: Fpp(r) def = 1 0 u(2β 1)/(2α) exp( 2γBru) du, Fac(r) def = cβ d 2α u 1/(2α)d 1 exp( 2γBru) du, where cβ is as in Proposition E.5. Then this gives us the principal contribution to the limit: Proposition F.3. Set for r 0 FC(r) def = 1 ΓC R(z), (D1/2b) 2 (1 2γBz + γ2B(B + 1)z2)r dz. Then FC(r) is real-valued and satisfies for some constant C independent of u1, u0, α, β |FC(r)| C(Fpp(r) + Fac(r)). Moreover, there is an M = M(u0, u1) > 0 and a positive bounded function C(r) so that if γBr [M, d2α/M] then 1 C(r)(Fpp(r) + Fac(r)) FC(r) C(r)(Fpp(r) + Fac(r)) Furthermore, for any ϵ > 0 there is a M(ϵ, u0, u1) large enough that C(r) 1 + ϵ for γBr [M, d2α/M]. Proof. These follow in a similar way to the earlier Propositions, and so we do not enter the details. Instead, we give a brief overview, using the estimates given in Proposition E.5 and Proposition E.4. Along FC, we can approximate m uniformly by m(z(u)) 1 (c(u) + i) π 2αu 1/(2α)d 1 ϵu 1/(2α)d 1, where c is real-valued and bounded and M = M(ϵ). Hence using Proposition E.4, u iη + j 2αm(z(u)) = (1 + O(ϵ))A(u) + i(1 + O(ϵ))πuβ/α 1/(2α) + i(1 + O(ϵ))cβ π 2αu 1/(2α)d 1 for real valued A and cβ = P j=1 j 2β if β > 1/2 or 0 otherwise as in Proposition E.4. Integrating each of these imaginary terms over ΓC produces Fpp and Fac respectively. The real part is negligible, as the contour is close to the real axis (in particular the imaginary part of the contour is smaller than the real part by a factor of ϵ). Combining all of these propositions, we have the following conclusion Corollary F.1. For any α, β with α, β = 1/2 and α + β > 1 2 there is a function C(r) bounded above for all r so that 1 C(r)(Fpp(r) + Fac(r) + F0(r)) F(r) C(r)(Fpp(r) + Fac(r) + F0(r)), Moreover, for any ϵ > 0 there is a M(ϵ) large enough that C(r) 1 + ϵ for γBr [M, d2α/M] and for γBr > Md2α. Proof. This follows directly from Proposition F.3, F.2 and F.1, and needs that the F curve is monotone to fill the gaps on which the approximations are made. ( There are potentially two windows on which the various approximations do not overlap: when γBr is a large constant and when it is on order d2α) G Estimation of Kernel function We can now give the approximation of the kernel function, which is represented by K(r) def = γ2B Γ+Γ0 z2Tr(R(z))(1 2γBz + γ2B(B + 1)z2)r dz, with the same contours that were used for the forcing function. Using Lemma E.1, we therefore can represent the kernel function as K(r) def = γ2B Γ+Γ0 z( v + (1 m(z))d)(1 2γBz + γ2B(B + 1)z2)r dz. By the residue theorem, the contribution from Γ0 disappears, as does the V z term. Hence we are left with the representations K(r) def = γ2Bd Γ z(1 m(z))(1 2γBz + γ2B(B + 1)z2)r dz. 4, the dominant contribution comes once more from the contour ΓC for which we get Kpp(r) def = γ2B 0 u1 1/(2α) exp( 2γBur) du. (86) It now follows swiftly from the estimates on m: Proposition G.1. Suppose α > 1 4. There is a positive function C(r) so that 1 C(r)Kpp(r) K(r) C(r)Kpp(r), and C(r) is bounded independent of d by a function of M for all rγB < d2αM. Moreover for any ϵ > 0 there is an M sufficiently large so that for rγB [M, d2α/M], C(r) < 1 + ϵ. Proof. By reflection symmetry, the real part of contour integral for K(r) vanishes. Also, for a given contour ΓA, we will define KA(r) = γ2B ΓA z(1 m(z))d(1 2γBz + γ2B(B + 1)z2)r dz, and we will estimate each piece of Γ = ΓR + ΓC + ΓL separately. We begin with the contributions from ΓR. Using Proposition E.6, we have that on ΓR (1 2γBz + γ2B(B + 1)z2)r dz d min{1,4α 1}γ2B exp( δγBr). Following the same steps as in the proof of Proposition F.2, we can extend the ΓR contour by a straight line ΓV to enclose some residues, which leads to (1 2γBz + γ2B(B + 1)z2)r dz j=1 j 4α(1 2γBj 2α + γ2B(B + 1)j 4α)r, with j0 u 1/(2α) 1 . Moreover, the contribution of the ΓV contour can be estimated (for some δ > 0 which can be made small by increasing u1) by O(u2 1/(2α) 1 exp( 2γBu1r(1 δ))). Meanwhile making a Riemann sum approximation (and changing variables by j 2α = u) j=1 j 4α(1 2γBj 2α + γ2B(B + 1)j 4α)r 1 u1 u1 1/(2α) exp( 2γBru) du for 2γBr < 1 u1 . Hence by taking u1 sufficiently small, we conclude that for 2γBr < 1 u1 , u1 u1 1/(2α) exp( 2γBru) du, and also that |KR(r)| e 2γBru1(1 δ) for larger (2γBr). The contributions from ΓC give, in a similar way for 2γBr > 1 u1 and 2γBr < 1 u1 KC(r) γ2B Z u1 u0 u1 1/(2α) exp( 2γBru) du Moreover for any ϵ > 0 there is an M > 0 sufficiently large that when (2γBr) is in [M, d2α/M], 1 ϵ < γ2B 2αKC(r) u0 u1 1/(2α) exp( 2γBru) du < 1 + ϵ, by first choosing ϵ > 0, then choosing the contour as in Proposition E.5 sufficiently far, and then possibly shrinking [u0, u1]. For larger r, it further satisfies an estimate that for some δ > 0, which can be made smaller by increasing u0, KL(r) e 2γBru0(1 δ). Finally, the contributions from ΓL, we have after changing variables KL(r) = γ2B π d1 4α Im I ΓLd2α z(1 m(zd2α))(1 2γBzd 2α + γ2B(B + 1)z2d 4α)r dz. This can be compared to the same expression with m(zd2α) f(z) and replacing (1 2γBzd 2α + γ2B(B + 1)z2d 4α)r exp( 2γBr(Re z)d 2α). This gives for 2γBr d2α KL(r) γ2Bd1 4α Z u0d2α 0 uf(u) exp( 2γBrud 2α) du, where f(u) = 1 π limϵ 0 f(u + iϵ) is the spectral density corresponding to f. Hence it follows that for 2γBr d2α, KL(r) γ2B Z u0 d 2α u1 1/(2α) exp( 2γBru) du. Remark G.1. In contrast, when α < 1 4, the dominant contribution to K is from KL, so that K(r) γ2Bd1 4α Z 0 uf(u) exp( 2γBrud 2α) du, and moreover the density f(u) u 1/(2α) so that the integral is convergent (which is implicit in Proposition E.5) We conclude with noting that for ther norm of K we can directly evaluate it using a contour integral. Summing the contour integral expression r=0 K(r) = γ2Bd z(1 m(z)) 2γBz γ2B(B + 1)z2 dz = γd (1 m(z)) 1 1 2γ(B + 1)z dz. We additionally can more generally evaluate a partial norm s=r K(s) = γd (1 m(z)) 1 1 2γ(B + 1)z (1 2γBz + γ2B(B + 1)z2)r dz. Combining this with Proposition E.6, this leads directly to tight estimates for the kernel norm. Corollary G.1. When 2α > 1 and γ(B + 1) < 2, 2j 2αγ(B + 1)(1 + o(1)) When 2α < 1 (and recalling that we take γ on the order d2α 1 in this case), 1 2α(1 + o(1)). Furthermore, for any ϵ > 0 there is an M > 0 so that if γBr > M then s=r K(s) < ϵ. Proof. For the first case with 2α > 1, Proposition E.6 gives on ΓR j 2α z + o(1/d) Using this, and completing the ΓR contour via a vertical line, we get a residue contribution which matches the claim, up to some number of terms j0 (which can be made as large as desired). Proposition E.5 and E.3 can be used to control the parts of the contour near 0 and in the middle. For the second case 2α < 1, since γ is small, we may deform the contour to be at a fixed distance from [0, 1], and then once more we can use Proposition E.6 which gives in 1 step that 2j 2αγ(B + 1) + O(γd1 4α). For the final statement, under the conditions given on r |(1 2γBz + γ2B(B + 1)z2)r| < ϵ uniformly over the contours, and the estimate follows directly. From here, we can derive the sub-exponential property of K. Proposition G.2. Suppose α > 1 4. For any ϵ > 0, there is an M sufficiently large so that for γBr [M, d2α/M] r X s=0 K(s)K(r s) (2 + ϵ) K K(r) Proof. We note that in the range of r given, we can conclude that for any δ, ϵ > 0, by increasing M rδ K(s) < ϵ K , furthermore that for s > r/2 (1 ϵ)Kpp(s) < K(s) < (1 + ϵ)Kpp(s), and that finally for s > r/2 (1 ϵ)Kpp(s) < γ2B 2α Γ(2 (1/(2α)))(2γBs) 2+(1/(2α)) < (1 + ϵ)Kpp(s), where the final estimate follows by estimating Z 1 0 u1 1/(2α) exp( 2γBur) du Z 0 u1 1/(2α) exp( 2γBur) du, once γBr > M, and moreover their ratio tends to 1 as M . With these estimates in place, we can break the estimate up as s=0 K(s)K(r s) < 2 s=0 K(s)K(r s) + s=rδ K(s)K(r s). The final sum is bounded by rδ K(s)K(r s) ϵ(1 + ϵ) K Kpp(r/2). Meanwhile, for the first sum, s=0 K(s)K(r s) 2(1 + O(ϵ))(1 + O(δ)) K Kpp(r), where we have used that Kpp(r(1 δ)) 1+ϵ 1 ϵ(1 + O(δ))Kpp(r). H Asymptotics of forcing function and kernel function With this, we now analyze the asymptotics of each of these terms individually. These asymptotics often rely on a result about how close a Riemann sum is to its integral. We state below the main result of this nature that we used: Proposition H.1 (Trapezoidal Rule, [15]). If f is continuous, then for each integer n > 0, the integral of f on [a, b] is approximated by Tn(f) def = b a 2n f(x0) + 2f(x1) + . . . + 2f(xn 1) + f(xn) where xi = a + i(b a)/n, 0 i n. Define the error in the trapezoid rule ET n (f) def = If f has an integrable first derivative as an improper integral, thens ET n (f) b a a |f (t)| dt. H.1 Pure point forcing term, Fpp(r) In this section, we prove an asymptotic for the pure point forcing term, see (85), 0 u(2β 1)/(2α) exp( 2γBru) du. Proposition H.2 (Pure point forcing term). Suppose 2α+2β > 1. For any ϵ > 0, there is an M > 0 so that for γBr M, |Fpp(r) g(r)| ϵ g(r) where g(r) def = (2α) 1(2γB)1/(2α) β/α 1 Γ β α 1 2α + 1 r (1+β/α)+1/(2α). Furthermore, for any M > 0, there exists some constants C, C, c > 0 independent of d so that c Fpp(r) C if γBr < M, and if r > Md2α, Fpp(r) C F0(r). Proof. First, a simple computation shows that g(r) = (2α) 1(2γBr) (1+β/α)+1/(2α) Z 0 w(2β 1)/(2α) exp( w) dw. Let ρ = (1 + β/α) + 1/(2α). A simple computation, using the change of variables w = 2γBru, yields Fpp(r) = (2α) 1(2γBr)ρ Z 2γBr 0 w(2β 1)/(2α) exp( w) dv. Then we have that |Fpp(r) g(r)| (2α) 1(2γBr)ρ Z 2γBr w(2β 1)/(2α) exp( w) dw (2α) 1(2γBr)ρ Z 2M w(2β 1)/(2α) exp( w) dw. Since R 0 w(2β 1)/(2α) exp( w) dw, there exists a M large so that Z 2M w(2β 1)/(2α) exp( w) dw < ϵ. Thus the first result is shown. If γBr < M, then Fpp(r) (2α) 1 Z 1 0 u(2β 1)/(2α) du C. Moreover, we have that exp( 2γBru) exp( 2 M). Therefore, we get that Fpp(r) exp( 2 M) 2α 0 u(2β 1)/(2α) du = c. Now suppose γBr > Md2α. By the previous part, we know that Fpp(r) (1 + ϵ)g(r). Moreover, we see that g is decreasing. As a result, we see that up to constants g(r) g( Md2α) = C d 2α 2β+1 C F0(r). for some constants C, C > 0. Hence the result is shown. H.2 Model misspecification, point mass at 0, F0(r) Recall, from Proposition F.1, the forcing function point mass at 0, satisfies 1 + j 2αd2ακ(v/d) 1 + O(d 1) where κ(v/d) solves 1 = Z v/d κ κ + u2α du. (87) In this section, we provide an asymptotic for F0(r) (see Proposition H.3) which represents the limiting value the loss obtains as r . Unlike the pure point process above, this asymptotic depends on whether 2β > 1. We begin by showing that the κ defined implicitly in (87) is uniquely determined and dimensionless. Lemma H.1. Suppose v and d are admissible such that the ratio v d > 1. Then the equation κ κ + u2α du has a unique solution κ such that 0 < κ < . Proof. Let w def = κ and F(w) def = R v/d 0 1 w+u2α du and set G(w) = w F(w). To solve the fixed point equation, we want to find G(w) = 1. First, it is clear that limw 0 G(w) = 0. Second, we see that as limw G(w) = v d > 1. As G(w) is continuous, it follows that there exists a solution κ to G(κ) = 1. To show that κ is unique, amounts to showing that G(w) is strictly increasing for w 0. First, we see that G(w) = Z v/d w + u2α u2α w + u2α du = Z v/d w + u2α du. We note that w 7 u2α w+u2α is strictly decreasing in w. So w 7 1 u2α w+u2α is strictly increasing in w. Hence G(v) is strictly increasing and there is a unique solution to G(κ) = 1. Now we give an asymptotic for F0. Proposition H.3 (Asymptotic for F0). Suppose v and d are admissible such that the ratio v/d > 1 and suppose 2α + 2β > 1. Let 0 < κ(v/d) < be the unique solution to κ κ + u2α du. κ Pv j=1 j 2β , if 2β > 1 d1 2(α+β) R v/d 0 u 2β κ+u2α du, if 2β < 1. Proof. We consider 2 cases. Let κ = κ(v/d). Case 1: Suppose 2β > 1: Let C def = Pv j=1 j 2β, which is finite as 2β > 1. Consider the following E1(r) def = Pv j=1 j 2(α+β) j 2ακd2α+1 d 2α κ Pv j=1 j 2β Pv j=1 j 2β To handle the large j values, we see that there exists a j0 large so that Pv j=j0 j 2β j j0 j 2β < ϵ, where we used that j 2ακd2α + 1 > 1. For the small j, we use that d can be large. Hence, j 2ακd2α + 1 j 2α 0 κd2α + 1 j0 j 2α 0 κd2α + 1. For sufficiently large d, we can make the right-hand-side small. Therefore, E1(r) is small for sufficiently large d and hence, the result holds. Case 2: Suppose 2β < 1: To show this case, we define the following errors E21(r) def = Pv j=1 j 2(α+β) j 2ακd2α+1 d1 2(α+β) R v/d 1/d u 2β du d1 2(α+β) R v/d 0 u 2β du E22(r) def = d1 2(α+β) R v/d 1/d u 2β du κ+u2α d1 2(α+β) R v/d 0 u 2β du d1 2(α+β) R v/d 0 u 2β du It is clear, for sufficiently large d, E22(r) is small. For the first error term, we use a Riemann sum approximation, that is, j 2ακd2α + 1 = d1 2(α+β) 1 (j/d) 2(α+β) (j/d) 2ακ + 1. Letting a = 1/d, b = v/d, n = v 1, xj = 1/d + j/d, and f(x) = x 2(α+β) x 2ακ+1, we can approximate the summation with an integral. Using Prop. H.1, 1 d R v/d 1/d |f (x)| dx R v/d 0 u 2β du One can check that R v/d 1/d |f (x)| dx R v/d 0 u 2β du κ+u2α < C where C is independent of d. For sufficiently large d, 1 d R v/d 1/d |f (x)| dx R v/d 0 u 2β du κ+u2α < C 1 Hence, Case 2 is shown. H.3 Absolutely continuous forcing function, Fac(r) We now turn to the absolutely continuous forcing function, defined as Fac(r) = cβ d 2α u 1/(2α)d 1 exp( 2γBru) du, where cβ = Pv j=1 j 2β if 2β > 1 and 0 otherwise. From this, we derive a simple asymptotic formula. Proposition H.4. There exists a constant C(α, β) > 0 such that Fac(r) C F0(r), if 2β > 1, 2α < 1 0, if 2β < 1. (88) Suppose now 2α > 1 and 2β > 1. For any ϵ > 0, there is an M > 0 so that for γBr [M, d2α/M], |Fac(r) g(r)| ϵ g(r) where g(r) def = v X j=1 j 2β (2γB) 1+1/(2α)(2α) 1Γ 1 1 2α r 1+1/(2α) d 1. (89) Furthermore, for any M > 0, these exists some constants C, c > 0 independent of d so that Fac(r) C d 1, if γBr M c F0(r), if γBr Md2α. Proof. We proceed by cases. The case 2β < 1 is immediate as cβ is only non-zero for 2β > 1. Case: 2β > 1 and 2α < 1. In this case, we just bound directly bound Fac(r). Dropping the exponential, we get d 2α u 1/(2α)d 1 du = cβ 2αd 1 d1 2α d1/2 α 2α 1) d 2α. We know that F0(r) d 2α+max{0,1 2β} and thus the result is shown. Next we show (89). Case: 2β > 1 and 2α > 1. First, we make the following observation. The integral is 0 u 1/(2α)d 1 exp( 2γBru) du = g(r). Define C = cβ 2α. Let us consider the following E def = C Z d α d 2α e 2γBuru 1/(2α)d 1 du C Z 0 e 2γBuru 1/(2α)d 1 du . First, we see 0 e 2γBuru 1/(2α)d 1 du 1 e 2γBuru 1/(2α)d 1 du | {z } E2 Case: E1. Suppose γBr 1/Md2α. Here we can just use directly the u and disregard the exponential: Z d 2α 0 e 2γBuru 1/(2α)d 1 du Z d 2α 0 u 1/(2α)d 1 du = c d 2α. for some constant c. Now we have that d 1(γBr) 1+1/(2α) = d 2α+1(γBr)1 1/(2α) d 2α+1(1/M)1 1/(2α)d2α 1 = M 1+1/(2α) By choosing M large, this can be made small. Case: E2. Suppose γBr M. Let us consider 1 d 1e 2γBur du d 1(γBr) 1 exp( 2γBr). It follows that d 1(γBr) 1 exp( 2γBr) d 1(γBr) 1+1/(2α) = (γBr) 1/(2α) exp( 2γBrd α) exp( 2M)(M) 1/(2α) = exp( 2M)M 1/(2α). Therefore, by choosing M large, we have that this can be small. This proves (89). To finish the proof, let us first suppose that γBr M. Then we have that d 2α u 1/(2α)d 1 du d 1. When γBr Md2α, we have that d 2α exp( 2γBru) du d 2α exp( 2γB Md2αd 2α) F0(r). H.4 Kernel function asymptotic. We recall the term Kpp defined as Kpp(r) = γ2B 0 u1 1/(2α) exp( 2γBur) du We now give an asymptotic for such a function. Proposition H.5 (Kpp asymptotic). Suppose α > 1/4. For any ϵ > 0, there is an M > 0 so that for γBr M, |Kpp(r) g(r)| ϵ g(r) (90) where g(r) def = (2α) 1γ2B(2γB) 2+1/(2α) Γ 2 1 2α r 2+1/(2α). Moreover, for any M > 0, there exists constants c, C, ˆC > 0, such that when 2α > 1, c Kpp(r) C, if γBr M and when 2α < 1, Kpp(r) ˆC d2α 1, if γBr M. Furthermore, for any M > 0, there exist a constant C > 0, such that Kpp(r) C F0(r), if γBr Md2α. Proof. The first part of the argument, (90), follows immediately from the proof of Fpp, Prop. H.2. If 2α > 1, then we always have γ2B is constant order. Therefore, using the same argument as Fpp (see Prop. H.2), we get that there exists constant c, C > 0 such that c Kpp(r) C for γBr M. If 2α < 1, then γ d2α 1. Therefore, for γBr M, we have that 0 u1 1/(2α) exp( 2γBur) du d2α 1 γB 0 u1 1/(2α) exp( 2γBur) du d2α 1C. The later inequality follows using the same bounding argument as Fpp(r). As γ2B C, then the same argument in Fpp(r) shows for γBr Md2α that Kpp(r) CF0(r). We now turn to the last quantity that appears in the Volterra equation. Proposition H.6 (Forcing function norm). Provided 2β > 1, we have that Md2α/(γB) X s=0 F(s) 1 γB , (91) for some constant M > 0. Next, suppose 2β < 1. Then there exists an M, M > 0 such that Md2α/(γB) X s= M/(γB) F(s) F(r) for all M γBr Md2α, (92) and it follows for all γBr Md2α, Md2α/(γB) X s=0 F(s) F(r) + 1 γB K(r). (93) Proof. Suppose 2β > 1. First note that in this region PMd2α/(γB) s=0 F0(r) 1 γB for any fixed M > 0, Proposition H.3. Next, let us consider the pure point part of F, i.e., Fpp. Choose M and M so that Fpp is behaving like the asymptotic in Proposition H.2, i.e., for γBr M, Fpp(r) (γBr) (1+β/α)+1/(2α). and for γBr M, Fpp(r) C for some constant C > 0 independent of d. It follows then that r=0 Fpp(r) 1 γB . To handle the rest of the sum, we see that Md2α/(γB) X r= M/(γB) Fpp(r) Md2α/(γB) X r= M/(γB) (γBr) (1+β/α)+1/(2α) 1 γB r= M r (1+β/α)+1/(2α) = (d2α) β/α+1/(2α) d2α (1+β/α)+1/(2α) (d2α) β/α+1/(2α)M M/d2α x (1+β/α)+1/(2α) dx = (d2α) β/α+1/(2α)M2α γB x β/α+1/(2α) Here we use that the Riemann sum approximation with a = M d2α , b = M + M d2α , n = Md2α and f(x) = x (1+β/α)+1/(2α). Using a similar argument for Fac(r) (Proposition H.4)when 2α > 1 (otherwise we do not need to worry about Fac), we have that Md2α/(γB) X r= M/(γB) Fac(r) d 1 r= M s 1+1/(2α) = M γB d 2α Md2α X 0 x 1+1/(2α) dx 1 γB . When r M/(γB), we have that Fac(r) d 1. Hence, P M/(γB) r=0 Fac(r) 1 γB . The first result, (91), then follows from Corollary F.1. Consider 2β < 1 and 2α < 1. We do not need to worry about Fac in this region because it does not exist in this region. Choose M and M so that both Kpp and Fpp are in their asymptotic regions and, using Proposition G.1, K(r) Kpp(r). Using a similar argument as above, we estimate the summation of Fpp as an integral. For any r [ M/(γB), Md2α/(γB)] s= M/(γB) Fpp(s) (d2α) β/α+1/(2α)M2α γB x β/α+1/(2α) rγB/d2α+ M/d2α 1 γB (γBr) β/α+1/(2α). Using the asymptotic for Kpp (Proposition H.5), s= M/(γB) Fpp(s) γ (γBr) β/α+1/(2α) (γBr) 2+1/(2α). We will show that this is less than Fpp(r). Using the asymptotic for Fpp(r) (Proposition H.2), let us suppose γ (γBr) β/α+1/(2α) (γBr) 2+1/(2α) (γBr) 1 β/α+1/(2α) γ (γBr)1 1/(2α). In this region, the learning rate is γ d2α 1. Thus, we see that d2α 1 (γBr)(2α 1)/(2α) d(2α 1) 2α 2α 1 (γBr) 2α 1 This is true and so we have that s= M/γB Fpp(r) Fpp(r), for all r [ M/(γB), Md2α/(γB)]. For F0, with r [ M/(γB), Md2α/(γB)] s= M/(γB) F0 (γBr) d1 2β 2α 1 Therefore, we get that s= M/(γB) F0(r) (γBr) d1 2β 2α γ (γBr) 2+1/(2α) d 2β(γBr) 1+1/(2α). We will show that this is less than Fpp. For this, we see d 2β(γBr) 1+1/(2α) (γr B) β/α 1+1/(2α) d 2β (γBr) β/α (γBr)β/α d2β Hence, we have that s= M/(γB) F0 Fpp(r), for all r [ M/(γB), Md2α/(γB)]. Since there is no Fac in this region, we immediately get from Corollary F.1 s= M/(γB) F(r) Fpp(r), for all r [ M/(γB), Md2α/(γB)]. For s [0, M/(γB)], we have that F(s) C. Thus we immediately get that s=0 F(s) K(r) 1 for all r. This proves the result for 2β < 1 and 2α < 1. Consider 2β < 1 and 2α > 1. As in the previous case, we do not need to consider Fac as it does not exist here. The proof will be similar to the previous case. Choose M and M so that both Kpp and Fpp are in their asymptotic regions and, using Proposition G.1, K(r) Kpp(r). First, by the same argument as in 2β < 1 and 2α > 1, we immediately have for s [0, M/(γB)], we have that F(s) C, s=0 F(s) K(r) 1 As before, we have for any r [ M/(γB), Md2α/(γB)] s= M/(γB) Fpp(s) γ (γBr) β/α+1/(2α) (γBr) 2+1/(2α). Note here that γ is constant. We will show that this is less than Fpp(r). Using the asymptotic for Fpp(r) (Proposition H.2), let us suppose (γBr) β/α+1/(2α) (γBr) 2+1/(2α) (γBr) 1 β/α+1/(2α) 0 (γBr)1 1/(2α). Hence, we have that s= M/γB Fpp(r) Fpp(r), for all r [ M/(γB), Md2α/(γB)]. For F0, with r [ M/(γB), Md2α/(γB)] s= M/(γB) F0 (γBr) d1 2β 2α 1 Therefore, we get that s= M/(γB) F0(r) (γBr) d1 2β 2α (γBr) 2+1/(2α) d 2α+1 2β(γBr) 1+1/(2α). We will show that this is less than Fpp. For this, we see d 2α+1 2β(γBr) 1+1/(2α) (γr B) β/α 1+1/(2α) d 2α+1 2β (γBr) β/α (γBr)β/α d2β+2α 1. Now we see that (γBr)β/α d2β d2β+2α 1. Hence, we have that s= M/(γB) F0 Fpp(r), for all r [ M/(γB), Md2α/(γB)]. The result is thus shown in this case. I Optimizing over batch and learning rate The previous sections use batch size B = 1 and the maximal learning rate allowed. In this section, we consider optimizing compute-optimal curves with respect to batch size and learning rate, i.e., find d , γ , B such that (d , γ , B ) arg mind,γ,B arg min P( f d B , d, γ) s.t. γB < 1 and Kpp < 1. (94) I.1 Optimal batch size We see that batch essentially scales out of the problem and therefore the batch has no effect on the compute-optimal curves. To see this, we observe from Table 5 that Fpp(r) (γBr) (1+β/α)+1/(2α) Fpp( f d ) (1+β/α)+1/(2α) Fac(r) (γBr) 1+1/(2α) d 1 Fac( f d ) 1+1/(2α) d 1 1 γB Kpp(r) γ(γBr) 2+1/(2α) 1 γB Kpp( f d B ) γ( γf d ) 2+1/(2α). It immediately follows that the batch size has no effect on the compute-optimal curves. Therefore any batch size (e.g., B = 1) that satisfies the necessary and sufficient condition for convergence (Prop. C.2), that is, γ(B + 1) < 2, will yield the same compute-optimal curves. This is not necessarily true for the learning rate as we will see in the next section. I.2 Optimal learning rate Without loss of generality, we let the batch size B = 1. From the expressions in (95), we see that Fpp and Fac are monotonically decreasing in learning rate γ. Moreover in Phase III, 1 γB Kpp, is also monotonically decreasing in the learning rate. Therefore, in Phases I, II, and III, the optimal learning rate choice is to choose γ maximally. In the cases of Phase Ia, II, and III, this would mean γ constant (see Prop C.2) and in Phase Ib, Ic, γ d2α 1. It remains to understand the effect of the learning rate in Phase IV. From Proposition C.6, we know that the loss is given by d, d, γ) F0( f d) + Fpp( f d) + 1 γB Kpp( f d) d 2α + ( γf d )ρ + γ( γf d ) 2+1/(2α), where ρ def = 1 2α β α 1. By taking derivatives, we see that d )α/β 1 and d (f) fρ/(ρ 2β). We need to check that γ is feasible, i.e., γ < 1 (which it is) and γ < d2α 1. For the later, a simple check shows that γ d 4α2 4αβ 2α+2β 1 < d2α 1 when α > 1/4 and 2β > 1, i.e., precisely Phase IV. The compute-optimal curve in Phase IV with optimal stepsize is the following. Proposition I.1 (Phase IV, optimal γ, compute-optimal curve). Suppose 1/4 < α < 1/2 and 2β > 1. Then 4α(α β) 4αβ+2α+2β 1 , d (f) f 2α+2β 1 4αβ+2α+2β 1 , and P (f) f 2α(2α+2β 1) 4αβ+2α+2β 1 . The trade off occurring where 1 γB Kpp = F0. We note that there is only one Phase IV (and no sub-phases). J Experimental Results To measure the exponents of the scaling law and parameter count, we follow approach12 1 and 2 from [24]. We explain the method below using (α, β) = (0.5, 0.7) as an example. The theoretical prediction of the scaling law and parameter count exponents for this example are η = 0.5 and ξ = 0.5, resp. (see Table 2). We then repeat this procedure for total of 32 pairs of (α, β) in the phase diagram; see Fig. 15 and Fig. 16. The theoretical predictions of these two exponents are shown in the heatmaps Fig. 8. First, we run SGD for parameter counts d [200, 300, 400, 600, 800, 1200, 1600, 2400, 3200, 4800, 6400, 9600, 12800]. The SGD learning curves for (α, β) = (0.5, 0.7) with parameters d [800, 1600, 3200, 6400, 12800] are shown in Fig. 9a. 12We did not use approach 3 in [24], which is more subtle than the other two; see [7]. (a) Parameter Count Exponents (ξ) (b) Scaling Law Exponents (η) Figure 8: Theoretical predictions of parameter count and scaling law exponents. J.1 Measuring the Scaling Law Exponent We follow Sec 3.1 in [24]. First, we choose an Iso FLOP window [fmin, fmax] and construct fj s using a geometric spacing between f1 = fmin and fn = fmax. For each Iso FLOP slice, fj, (e.g., fj = 2e7 is the vertical line in Fig. 9a), we find the minimum loss across all d. We denote this minimum value by P (fj) and the associated optimal parameter by d (fj). As an example, in Fig. 9a, P (fj) = 1.6e 3 and the associated optimal parameter d (fj) = 6400. We obtain the compute-optional frontier (highlighted in red in Fig. 9b) by plotting [(fj, P (fj)]1 j n, (96) and the optimal parameter count [(fj, d (fj)]1 j n. (97) We then fit a power-law curve P (f) = a f ˆη to predict the relationship between the compute f and optimal loss P . This is shown as the dashed line in Fig. 9b. For (α, β) = (0.5, 0.7), this gives P (f) = 22.61 f 0.515, whereas our theoretical result predicts P (f) f 0.5 . J.2 Measuring Parameter Count Exponent: Approach 0 One benefit of our theoretical framework is that the solution of the Volterra equation (eq. 10) is deterministic. As such, precise numerical evaluation can determine the instantaneous slope of the compute-optimal curves using a new approach that is not necessarily feasible when dealing with noisy (a) Iso FLOP (b) Compute-Optimal Frontier Figure 9: Measuring the scaling law exponent for (α, β) = (0.5, 0.7). (a) Iso FLOP Window [1e6, 1e8] (b) Iso FLOP Window [2e6, 0.5e8] Figure 10: 2 different Iso FLOP windows for measuring the parameter count exponent with Approach 1 for (α, β) = (0.5, 0.7). SGD curves. Specifically, we search for the unique tangent line that intersects the loss-versus-flops curves for two adjacent values of d, i.e. we numerically solve the following system for f1 and f2: P 1(f1) = P 2(f2) = P2(f2) P1(f1) f2 f1 , (98) where P1 is the loss curve for d = d1 and P2 is the loss curve for d = d2. When d1 and d2 are close, we obtain an accurate estimate of the parameter count exponent by measuring the discrete logarithmic derivative, (log(d2) log(d1))/(log(f 2 ) log(f 1 )). J.3 Measuring Parameter Count Exponent: Approach 1 To predict the optimal parameter count exponent, we fit the function d = a fb, a, b constants, to the measurements in (97) (see e.g., Fig. 10a). For the example (α, β) = (0.5, 0.7) (Fig. 10a), this approach gives d = .25 f0.551. (99) Note that the fit of the exponent is very sensitive to the choice of Iso FLOP window. When we change the window from [1e6, 1e8] to [2e6, 0.5e8], the parameter count exponent changes from 0.51 to 0.58, as shown in Fig.10b. The theoretical prediction of this exponent is 0.5. J.4 Measuring Parameter Count Exponent: Approach 2 For each Iso FLOP slice, fj, we obtain a set of training loss values depending on d, {P(fj, di)}1 i m. In our running example, d1 = 200 and dm = 12800 (see Fig. 9a). We then fit a parabola (quadratic (a) Iso FLOP Quadratic fit (b) Approach 2 Figure 11: Measuring parameter count exponent with Approach 2 for (α, β) = (0.5, 0.7). (a) Iso FLOP Window [1e6, 5e8] (b) Iso FLOP Window [1e6, 5e8] (c) Iso FLOP Window [1e6, 5e8] (d) Iso FLOP Window [1e6, 1e8] (e) Iso FLOP Window [1e6, 1e8] (f) Iso FLOP Window [1e6, 1e8] Figure 12: The empirical measurements of the scaling law exponent and parameter count exponent are sensitive to the choice of the Iso FLOP windows. Top: larger Iso FLOP window [1e6, 5e8] Bottom: smaller Iso FLOP window [1e6, 1e8]. function) to {(log P(fj, di), log di)}1 i m, i.e., we find (a, b, c) such that log P(fj, di) = a log2 di + b log di + c. This is shown in Fig. 11a. After solving for (a, b, c), we find the d (fj) that minimizes a2 log2 d + b log d + c. Repeating this procedure for all fj s gives a set of pairs {(fj, d j)}1 j n. In the final step, we power-law fit this set. For the example (α, β) = (0.5, 0.7) (see Fig. 11b), this gives d = 0.17 f0.565. J.5 Exponents comparison: Theory vs Measurement We compare the empirical measurements of the exponents against their theoretical predictions in Fig. 12. We chose three slices across the phase diagram 1. α = 0.7 Slice (Fig. 12a), in which (α, β) goes from Phase Ia, II and III. 2. α = 0.27 Slice (Fig. 12b), in which (α, β) goes from Phase Ib to Phase IVb. 3. β = 0.7 Slice (Fig. 12c) in which (α, β) goes from Phase Ic, IVb, IVa, III and to II. For the scaling law exponents, the empirical measurement agrees with the theoretical prediction quite well. For the parameter count, the agreement is good but not as good as that of the scaling law exponents. Noticeably, there is disagreement between Approach 1 and Approach 2. Such disagreement is not surprising, as empirical measurements are sensitive to the choice of the Iso FLOP windows and we use the same Iso FLOP window [1e6, 5e8] for all (α, β). This is clearly suboptimal. We briefly discuss this in the next subsection. J.6 Instantaneous slope In this section, we demonstrate that there can be strong finite-size d effects in the measurements of the scaling law and parameter count exponents. We measure the instantaneous slope as a function of parameter count d for the Volterra equation (10). See Fig. 13. To do so, we generate the Volterra solutions for a geometrically spaced sequence of d s with ratio 1.05. We then apply Approach 1 with a very dense Iso FLOP window (100 Iso FLOP Slices between [2e4, 2e7]). We then slide a smaller Iso FLOP window (20 Iso FLOP slices) from left to right to generate a sequence of scaling law (parameter count) exponent, as shown in the middle (right) plot in Fig. 13. These exponents varying slowly when the window slides from a small flops regime to a large flops regime. For example, the scaling law exponent η changes from η = 0.440 to η = 0.413 from left to right, while the parameter count exponent changes from ξ = 0.450 0.575. Using the global window (100 Iso FLOP) to measure these exponents, we have η = 0.42 and ξ = 0.52 which are very close to their average over all small windows: η = 0.418 and ξ = 0.526. Figure 13: Instantaneous Slope. (Left) Volterra equation (10) dynamics for a highly dense grid of d. We also plot the compute-optimal front obtained from using the left small window (small flops regime) and the right window (larger flops regime). (Middle) Shows the evolving measurements of scaling exponents when the flops increases. (Right) Shows the evolving measurements of parameter count exponents when the flops increases. J.7 Negative β. In Fig. 14, we show that our theoretical results work well for β < 0. Figure 14: Negative β. Sweeping β = 0.2 to β = 0.2. We see good agreement between Volterra (theory) dynamics and SGD dynamics. J.8 Additional plots for different phases We summarize the measurement of the scaling law exponents and optimal parameter count exponents in Fig. 15 and Fig. 16, resp. Figure 15: Theory vs. empirical scaling law across different phases. Figure 16: Theory vs. empirical optimal parameter count across different phases. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: All the statements stated in the abstract and in the introduction are proven explicitly in the remainder of the paper with most proven in the Appendix. We clearly state that we are studying a simple model with the exact description described in Section 1.1. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We clearly state in Section 1.1 that we are working on a simple quadratic power law random features model with assumptions on the data and targets. The algorithmic set-up follows that section. We indicate that we are only considering fixed stepsize SGD. All the results are proven in the Appendix. We also make note that we only prove our main result for α > 1/4. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All theorems and results are proven in the Appendix. We make sure to clearly state any assumptions and limitations. For example, we do not prove the setting for α < 1 4, but we clearly state this (see e.g., remark after Theorem 2.1). 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: In each of the figures, we provide an explicit description of how the image was generated including the numerical set-up. In this case, the experiments are simple and they do not require significant coding or datasets. This is mainly a theory paper and the data was synthetically generated. The model has already been used before in other papers and the data comes from Gaussians. This is all clearly stated in our set-up, Section 1.1. We intend to release the code for the numerical computation of the Volterra equation and how it matches the loss curves under SGD. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The paper does not include experiments that require significant code. The model we analyze is a simple random features model with power law applied to synthetic data. As such, the code can be readily reproduced by following the set-up seen in the caption of the figures. We intend to release the code for the numerical computation of the Volterra equation and how it matches the loss curves under SGD. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The experimental design, including the power law exponents, fixed stepsizes, choices of d and v, and the numerical simulations for solving Volterra equations are all written in the captions of the figures. We also intend to release the code for numerically computing the Volterra equation. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We provide (and explain) in the figures/captions the error bars across multiple runs of SGD. We record how we generate the empirical compute-optimal exponents using statistical tools that were first deployed in other papers such as [24]. We are careful to explain when and why the theory deviates from the numerical simulations. Often this is due to finite d and v effects and the slow behavior of the theory to the asymptotics. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: As this paper is about compute-optimal curves, we provide details on the exact number of flops required to perform the experiments. Since the model we study is a simple least squares model, the compute resources are also well known in by the community. 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have reviewed the code of ethics. We have made our utmost attempt to adhere to the guidelines provided by Neur IPS. We do not use any human subjects nor any datasets. We did our best to cite all the relevent related work. Given that our work is in the foundational research, it is difficult to mitigate all the risks as the downstream effects of theory are long, but we have done our best. The model is completely synthetic using the standard SGD algorithm; thus we don t, to the best of our knowledge, anticipate any risks. 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The work presented is purely foundational research and is not directly tied to any particular application. We study a simple random features model with power law data and target and we solve the model using a common algorithm SGD. Given the theoretical nature of this work, we do not anticipate any direct ethical and societal issues. See our Broader Impact Statement in the appendix. 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no such risks. The work is purely theoretical and uses only synthetic data generated from a normal distribution. The models employed are standard statistical model (e.g., least squares) which are textbook learning problems. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We acknowledge via citations that the model we study was introduced before by others (e.g., Maloney, Roberts, and Sully paper). We are not using any datasets or other assets beyond numpy and thus do not need to name any license or cite any dataset. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not intend to release any new assets. The work is a theoretical analysis of SGD on a random features model. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. The work is purely theoretical on a simple model. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The work does not involve crowdsourcing nor research with human subjects. The data used in this work is generated synthetically.