# implicit_regularization_of_random_feature_models__92e486cd.pdf Implicit Regularization of Random Feature Models Arthur Jacot * 1 Berfin S ims ek * 1 2 Francesco Spadaro 1 Cl ement Hongler 1 Franck Gabriel 1 Random Feature (RF) models are used as efficient parametric approximations of kernel methods. We investigate, by means of random matrix theory, the connection between Gaussian RF models and Kernel Ridge Regression (KRR). For a Gaussian RF model with P features, N data points, and a ridge λ, we show that the average (i.e. expected) RF predictor is close to a KRR predictor with an effective ridge λ. We show that λ > λ and λ λ monotonically as P grows, thus revealing the implicit regularization effect of finite RF sampling. We then compare the risk (i.e. test error) of the λKRR predictor with the average risk of the λ-RF predictor and obtain a precise and explicit bound on their difference. Finally, we empirically find an extremely good agreement between the test errors of the average λ-RF predictor and λ-KRR predictor. 1. Introduction In this paper, we consider the Random Feature (RF) model which is an approximation of Kernel Methods (Rahimi & Recht, 2008) which has seen many recent theoretical developements. The conventional wisdom suggests that to ensure good generalization performance, one should choose a model class that is complex enough to learn the signal from the training data, yet simple enough to avoid fitting spurious patterns therein (Bishop, 2006). This view has been questioned by recent developments in machine learning. First, Zhang et al. (2016) observed that modern neural network models can perfectly fit randomly labeled training data, while still generalizing well. Second, the test error as a function of *Equal contribution 1Chair of Statistical Field Theory, Ecole Polytechnique F ed erale de Lausanne, Lausanne, Switzerland 2Laboratory of Computational Neuroscience, Ecole Polytechnique F ed erale de Lausanne, Lausanne, Switzerland. Correspondence to: Arthur Jacot . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). parameters exhibits a so-called double-descent curve for many models including neural networks, random forests, and random feature models (Advani & Saxe, 2017; Spigler et al., 2018; Belkin et al., 2018; Mei & Montanari, 2019; Belkin et al., 2019; Nakkiran et al., 2019). The above models share the feature that for fixed input, the learned predictor ˆf is random: for neural networks, this is due to the random initialization of the parameters and/or to the stochasticity of the training algorithm; for random forests, to the random branching; for random feature models, to the sampling of random features. The somehow surprising generalization behavior of these models has recently been the subject of increasing attention. In general, the risk (i.e. test error) is a random variable with two sources of randomness: the usual one due to the sampling of the training set, and the second one due to the randomness of the model itself. We consider the Random Feature (RF) model (Rahimi & Recht, 2008) with features sampled from a Gaussian Process (GP) and study the RF predictor ˆf minimizing the regularized least squares error, isolating the randomness of the model by considering fixed training data points. RF models have been the subject of intense research activity: they are (randomized) approximations of Kernel Methods aimed at easing the computational challenges of Kernel Methods while being asymptotically equivalent to them (Rahimi & Recht, 2008; Yang et al., 2012; Sriperumbudur & Szab o, 2015; Yu et al., 2016). Unlike the asymptotic behavior, which is well studied, RF models with a finite number of features are much less understood. 1.1. Contributions We consider a model of Random Features (RF) approximating a kernel method with kernel K. This model consists of P Gaussian features, sampled i.i.d. from a (centered) Gaussian process with covariance kernel K. For a given training set of size N, we study the distribution of the RF predictor ˆf (RF ) λ with ridge parameter λ > 0 (L2 penalty on the parameters) and denote it by λ-RF. We show the following: The distribution of ˆf (RF ) λ is that of a mixture of Gaussian processes. Implicit Regularization of Random Feature Models The expected RF predictor is close to the λ-KRR (Kernel Ridge Regression) predictor for an effective ridge parameter λ > 0. The effective ridge λ > λ is determined by the number of features P, the ridge λ and the Gram matrix of K on the dataset; λ decreases monotonically to λ as P grows, revealing the implicit regularization effect of finite RF sampling. Conversely, when using random features to approximate a kernel method with a specific ridge λ , one should choose a smaller ridge λ < λ to ensure λ(λ) = λ . The test errors of the expected λ-RF predictor and of the λ-KRR predictor ˆf (K) λ are numerically found to be extremely close, even for small P and N. The RF predictor s concentration around its expectation can be explicitly controlled in terms of P and of the data; this yields in particular E[L( ˆf (RF ) λ )] = L( ˆf (K) λ ) + O(P 1) as N, P with a fixed ratio γ = P/N where L is the MSE risk. Since we compare the behavior of λ-RF and λ-KRR predictors on the same fixed training set, our result does not rely on any probabilistic assumption on the training data (in particular, we do not assume that our training data is sampled i.i.d.). While our proofs currently require the features to be Gaussian processes, we are confident that they could be generalized to a more general setting (Louart et al., 2017; Benigni & P ech e, 2019). 1.2. Related works Generalization of Random Features. The generalization behavior of Random Feature models has seen intense study in the Statistical Learning Theory framework. Rahimi & Recht (2009) find that O(N) features are sufficient to ensure the O( 1 N ) decay of the generalization error of Kernel Ridge Regression (KRR). Rudi & Rosasco (2017) improve on their result and show that O( N log N) features is actually enough to obtain the O( 1 N ) decay of the KRR error. Hastie et al. (2019) use random matrix theory tools to compute the asymptotic risk when both P, N with P N γ > 0. When the training data is sampled i.i.d. from a Gaussian distribution, the variance is shown to explode at γ = 1. In the same linear regression setup, Bartlett et al. (2019) establish general upper and lower bounds on the excess risk. Mei & Montanari (2019) prove that the doubledescent (DD) curve also arises for random Re LU features, and adding a ridge suppresses the explosion around γ = 1. Double-descent and the effect of regularization. For the cross-entropy loss, Neyshabur et al. (2014) observed that for two-layer neural networks the test error exhibits the double-descent (DD) curve as the network width increases (without regularizers, without early stopping). For MSE and hinge losses, the DD curve was observed also in multilayer networks on the MNIST dataset (Advani & Saxe, 2017; Spigler et al., 2018). Neal et al. (2018) study the variance due to stochastic training in neural networks and find that it increases until a certain width, but then decreases down to 0. Nakkiran et al. (2019) establish the DD phenomenon across various models including convolutional and recurrent networks on more complex datasets (e.g. CIFAR-10, CIFAR-100). Belkin et al. (2018; 2019) find that the DD curve is not peculiar to neural networks and observe the same for random Fourier features and decision trees. In Geiger et al. (2019), the DD curve for neural networks is related to the variance associated with the random initialization of the Neural Tangent Kernel (Jacot et al., 2018); as a result, ensembling is shown to suppress the DD phenomenon in this case, and the test error stays constant in the overparameterized regime. Recent theoretical work (d Ascoli et al., 2020) study the same setting and derive formulas for the asymptotic error, relying on the so-called replica method. General Wishart Matrices. Our theoretical analysis relies on the study of the spectrum of the so-called general Wishart matrices of the form WΣW T (for N N matrix Σ and P N matrix W with i.i.d. standard Gaussian entries) and in particular their Stieltjes transform m P (z) = 1 P Tr WΣW T z IP 1. A number of asymptotic results (Silverstein, 1995; Bai & Wang, 2008) about the spectrum and Stieltjes transform of such matrices can be understood using the asymptotic freeness of W T W and Σ (Gabriel, 2015; Speicher, 2017). In this paper, we provide non-asymptotic variants of these results for an arbitrary matrix Σ (which in our setting is the kernel Gram matrix); the proofs in our setting are detailed in the Supp. Mat. 1.3. Outline The rest of this paper is organized as follows: In Section 2, the setup (linear regression, Gaussian RF model, λ-RF predictor, and λ-KRR predictor) is introduced. In Section 3, preliminary results on the distribution of the λ-RF model are provided: the RF predictors are Gaussian mixtures (Proposition 3.1) and the λ 0RF model is unbiased in the overparameterized regime (Corollary 3.2). Graphical illustrations of the RF predictors in various regimes are presented (Figure 1). In Section 4, the first main theorem is stated (Theorem 4.1): the average (expected) λ-RF predictor is close to the λ-KRR predictor for an explicit λ > λ. As a con- Implicit Regularization of Random Feature Models sequence (Corollary 4.3), the test errors of these two predictors are close. Finally, numerical experiments show that the test errors are in fact virtually identical (Figure 2). In Section 5, the second main theorem is stated (Theorem 5.1): a bound on the variance of the λ-RF predictor is given, which show that it concentrates around the average λ-RF predictor. As a consequence, the test error of the λ-RF predictor is shown to be close to that of the λ-KRR predictor (Corollary 5.2). The ridgeless λ 0 case is then investigated (Section 5.2): a lower bound on the variance of the λ-RF predictor is given, suggesting an explanation for the double-descent curve in the ridgeless case. In Section 6, we summarize our results and discuss potential implications and extensions. Linear regression is a parametric model consisting of linear combinations θ1φ(1) + + θP φ(P ) of (deterministic) features φ(1), . . . , φ(P ) : Rd R. We consider an arbitrary training dataset (X, y) with X = [x1, ..., x N] Rd N and y = [y1, . . . , y N] RN, where the labels could be noisy observations. For a ridge parameter λ > 0, the linear estimator corresponds to the parameters ˆθ = [ˆθ1, . . . , ˆθP ] RP that minimize the (regularized) Mean Square Error (MSE) functional ˆLλ defined by ˆLλ(fθ) = 1 i=1 (fθ(xi) yi)2 + λ The data matrix F is defined as the N P matrix with entries Fij = 1 P φ(j)(xi). The minimization of (1) can be rewritten in terms of F as ˆθ = argminθ Fθ y 2 + λ θ 2. (2) The optimal solution ˆθ is then given by ˆθ = F T FF T + λIN 1 y (3) and the optimal predictor ˆf = fˆθ by j=1 φ(j)(x)F T :,j FF T + λIN 1 y. (4) In this paper, we consider linear models of Gaussian random features associated with a kernel K : Rd Rd R. We take φ(j) = f (j), where f (1), . . . , f (P ) are sampled i.i.d. from a Gaussian Process of zero mean (i.e. E[f (j)(x)] = 0 for all x Rd) and with covariance K (i.e. E[f (j)(x)f (j)(x )] = K(x, x ) for all x, x Rd). In our setup, the optimal parameter ˆθ still satisfies (3) where F is now a random matrix. The associated predictor, called λ-RF predictor, is then given by Definition 2.1 (Random Feature Predictor). Consider a kernel K : Rd Rd R, a ridge λ > 0, and random features f (1), . . . , f (P ) sampled i.i.d. from a centered Gaussian Process of covariance K. Let ˆθ be the optimal solution to (1) taking φ(j) = f (j). The Random Feature predictor with ridge λ is the random function ˆf (RF ) λ : Rd R defined by ˆf (RF ) λ (x) = 1 j=1 ˆθjf (j)(x). (5) The λ-RF can be viewed as an approximation of kernel ridge predictors: observing from (4) that ˆf (RF ) λ only depends on the scalar product KP (x, x ) = 1 P PP j=1 f (j)(x)f (j)(x ) between datapoints, we see that as P , KP K and hence ˆf (RF ) λ converges (Rahimi & Recht, 2008) to a kernel predictor with ridge λ (Sch olkopf et al., 1998), which we call λ-KRR predictor. Definition 2.2 (Kernel Predictor). Consider a kernel function K : Rd Rd R and a ridge λ > 0. The Kernel Predictor is the function ˆf (K) λ : Rd R ˆf (K) λ (x) = K(x, X)(K(X, X) + λIN) 1y where K(X, X) is the N N matrix of entries (K(X, X))ij = K(xi, xj) and K( , X) : Rd RN is the map (K(x, X))i = K(x, xi). 2.1. Bias-Variance Decomposition. Let us assume that there exists a true regression function f : Rd R and a data generating distribution D on Rd. The risk of a predictor f : Rd R is measured by the MSE defined as L(f) = ED (f(x) f (x))2 . Let π denote the joint distribution of the i.i.d. sample f (1), ..., f (P ) from the centered Gaussian process with covariance kernel K. The risk of ˆf (RF ) λ can be decomposed into a bias-variance form as Eπ h L( ˆf (RF ) λ ) i =L Eπ[ ˆf (RF ) λ ] +ED h Varπ( ˆf (RF ) λ (x)) i . This decomposition into the risk of the average RF predictor and of the D-expectation of its variance will play a crucial Implicit Regularization of Random Feature Models (a) P = 2, λ = 10 4 (b) P = 4, λ = 10 4 (c) P = 4, λ = 0.1 (d) P = 100, λ = 10 4 Figure 1. Distribution of the RF Predictor. Red dots represent a sinusoidal dataset yi = sin(xi) for N = 4 points xi in [0, 2π). For selected P and λ, we sample ten RF predictors (blue dashed lines) and compute empirically the average RF predictor (black lines) with 2 standard deviations intervals (shaded regions). role in the next sections. This is in contrast with the classical bias-variance decomposition in Geman et al. (1992) ED N [L(f)] = L(ED N [f]) + ED[Var D N [f(x)]] where D N denotes the joint distribution on x1, ..., x N, sampled i.i.d. from D. Note that in our decomposition no probabilistic assumption is made on the data, which is fixed. 2.2. Additional Notation In this paper, we consider a fixed dataset (X, y) with distinct data points and a kernel K (i.e. a positive definite symmetric function Rd Rd R). We denote by y K 1 the inverse kernel norm of the labels defined as y T (K(X, X)) 1y. Let UDU T be the spectral decomposition of the kernel matrix K(X, X), with D = diag(d1, . . . , d N). Let D 1 2 = diag( d1, . . . , d N) and set K 1 2 = UD 1 2 U T . The law of the (random) data matrix F is now that of 1 P K 1 2 W T where W is a P N matrix of i.i.d. standard Gaussian entries, so that E[FF T ] = K(X, X). We will denote by γ = P N the parameter-to-datapoint ratio: the underparameterized regime corresponds to γ < 1, while the overparameterized regime corresponds to γ 1. In order to stress the dependence on the ratio parameter γ, we write ˆf (RF ) λ,γ instead of ˆf (RF ) λ . 3. First Observations The distribution of the RF predictor features a variety of behaviors depending on γ and λ, as displayed in Figure 1. In the underparameterized regime P < N, sample RF predictors induce some implicit regularization and do not interpolate the dataset (1a); at the interpolation threshold P = N, RF predictors interpolate the dataset but the variance explodes when there is no ridge (1b), however adding some ridge suppresses variance explosion (1c); in the overparameterized regime P N with large P, the variance vanishes thus the RF predictor converges to its average (1d). We will investigate the average RF predictor (solid lines) in detail in Section 4 and study its variance in Section 5. We start by characterizing the distribution of the RF predictor as a Gaussian mixture: Proposition 3.1. Let ˆf (RF ) λ,γ (x) be the random features predictor as in (5) and let ˆy = F ˆθ be the prediction vector on training data, i.e. ˆyi = ˆf (RF ) λ,γ (xi). The process ˆf (RF ) λ,γ is a mixture of Gaussians: conditioned on F, we have that ˆf (RF ) λ,γ is a Gaussian process. The mean and covariance of ˆf (RF ) λ,γ conditioned on F are given by E[ ˆf (RF ) λ,γ (x)|F] = K(x, X)K(X, X) 1ˆy, (6) Cov[ ˆf (RF ) λ,γ (x), ˆf (RF ) λ,γ (x )|F] = ˆθ 2 P K(x, x ), (7) with K(x, x ) = K(x, x ) K(x, X)K(X, X) 1K(X, x ) denoting the posterior covariance kernel. The proof of Proposition 3.1 relies on the fact that f (j) conditioned on f (j)(xi) i=1,...,N is a Gaussian Process. Note that (6) and (7) depend on λ and P through ˆy and ˆθ 2; in fact, as the proof shows, these identities extend to the ridgeless case λ 0. For the ridgeless case, when one is in the overparameterized regime (P N), one can (with probability one) fit the labels y and hence ˆy = y: Corollary 3.2. When P N, the average ridgeless RF predictor is equivalent to the ridgeless KRR predictor E h ˆf (RF ) λ 0,γ(x) i = K(x, X)K(X, X) 1y = ˆf (K) λ 0(x). This corollary shows that in the overparameterized case, the ridgeless RF predictor is an unbiased estimator of the ridgeless kernel predictor. The difference between the expected loss of ridgeless RF predictor and that of the ridgeless KRR predictor is hence equal to the variance of the RF predictor. As will be demonstrated in this article, outside of this specific regime, a systematic bias appears, which reveals an implicit regularizing effect of random features. Implicit Regularization of Random Feature Models 10-2 10-1 100 101 102 (a) Evolution of λ 10-2 10-1 100 101 102 (b) Average λ-RF predictor vs. λ-KRR Figure 2. Comparison of the test errors of the average λ-RF predictor and the λ-KRR predictor. We train the RF predictors on N = 100 MNIST data points where K is the RBF kernel, i.e. K(x, x ) = exp x x 2/ℓ . We approximate the average λ-RF on 100 random test points for various ridges λ. In (a), given γ and λ, the effective ridge λ is computed numerically using (9). In (b), the test errors of the λ-KRR predictor (blue lines) and the empirical average of the λ-RF predictor (red dots) agree perfectly. 4. Average Predictor In this section, we study the average RF predictor E[ ˆf (RF ) λ,γ ]. As shown by Corollary 3.2 above, in the ridgeless overparmeterized regime, the RF predictor is an unbiased estimator of the ridgeless kernel predictor. However, in the presence of a non-zero ridge, we see the following implicit regularization effect: the average λ-RF predictor is close to the λ-KRR predictor for an effective ridge λ > λ (in other words, sampling a finite number P of features amounts to taking a greater kernel ridge λ). Theorem 4.1. For N, P > 0 and λ > 0, we have E[ ˆf (RF ) λ,γ (x)] ˆf (K) λ (x) c p K(x, x) y K 1 where the effective ridge λ(λ, γ) > λ is the unique positive number satisfying λ = λ + λ γ 1 N di λ + di , (9) and where c > 0 depends on λ, γ, and 1 N Tr K(X, X) only. Proof. (Sketch; see Supp. Mat. for details) Set Aλ = F(F T F + λIP ) 1F T . The vector of the predictions on the training set is given by ˆy = Aλy and the expected predictor is given by E h ˆf (RF ) λ,γ (x) i = K(x, X)K(X, X) 1E [Aλ] y. By a change of basis, we may assume the kernel Gram matrix to be diagonal, i.e. K(X, X) = diag(d1, . . . , d N). In this basis E [Aλ] turns out to be diagonal too. For each i = 1, . . . , N we can isolate the contribution of the i-th row of F: by the Sherman-Morrison formula, we have (Aλ)ii = digi 1+digi , where P W T i (F T (i)F(i) + λIP ) 1Wi, with Wi denoting the i-th column of W = 2 and F(i) being obtained by removing the i-th row of F. The gi s are all within O(1/ P) distance to the Stieltjes transform m P ( λ) = 1 P Tr F T F + λIP 1 . By a fixed point argument, the Stieltjes transform m P ( λ) is itself within O(1/ P) distance to the deterministic value m( λ), where m is the unique positive solution to di m(z) 1 + di m(z) γz m(z). (The detailed proof in the Supp. Mat. uses non-asymptotic variants of arguments found in (Bai & Wang, 2008); the constants in the O bounds are in particular made explicit). As a consequence, from the above results, we obtain E [(Aλ)ii] = E digi 1 + digi di m 1 + di m = di λ + di , revealing the effective ridge λ = 1/ m( λ). Implicit Regularization of Random Feature Models This implies that E [Aλ] K(X, X)(K(X, X) + λIN) 1 E h ˆf (RF ) λ,γ (x) i K(x, X)(K(X, X)+ λIN) 1y= ˆf (K) λ (x), yielding the desired result. Note that asymptotic forms of equations similar to the ones in the above proof appear in different settings (Dobriban & Wager, 2018; Mei & Montanari, 2019; Liu & Dobriban, 2020), related to the study of the Stieltjes transform of the product of asymptotically free random matrices. While the above theorem does not make assumptions on P, N, and K, the case of interest is when the right hand side c K(x,x) y K 1 P is small. The constant c > 0 is uniformly bounded whenever γ and λ are bounded away from 0 and 1 N Tr K(X, X) is bounded from above. As a result, to bound the right hand side of (8), the two quantities we need to bound are T = 1 N Tr K(X, X) and y K 1. The boundedness of T is guaranteed for kernels that are translation-invariant, i.e. of the form K(x, y) = k( x y ): in this case, one has T = k(0). If we assume ED [K(x, x)] < (as is commonly done in the literature (Rudi & Rosasco, 2017)), T converges to ED [K(x, x)] as N (assuming i.i.d. data points). For y K 1, under the assumption that the labels are of the form yi = f (xi) for a true regression function f lying in Reproducing Kernel Hilbert Space (RKHS) H of the kernel K (Sch olkopf et al., 1998), we have y K 1 f H. Our numerical experiments in Figure (2b) show excellent agreement between the test error of the expected λ-RF predictor and the one of the λ-KRR predictor suggesting that the two functions are indeed very close, even for small N, P. Thanks to the implicit definition of the effective ridge λ (which depends on λ, γ, N and on the eigenvalues di of K(X, X)) we obtain the following: Proposition 4.2. The effective ridge λ satisfies the following properties: 1. for any γ > 0, we have λ < λ(λ, γ) λ + 1 2. the function γ 7 λ(λ, γ) is decreasing; 3. for γ > 1, we have λ γ γ 1λ; 4. for γ < 1, we have λ 1 γ γ mini di. The above proposition shows the implicit regularization effect of the RF model: sampling fewer features (i.e. decreasing γ) increases the effective ridge λ. Furthermore, as λ 0 (ridgeless case), the effective ridge λ behave as follows: in the overparameterized regime (γ > 1), λ goes to 0; in the underparameterized regime (γ < 1), λ goes to a limit λ0 > 0. These observations match the profile of λ in Figure (2a). Remark. When λ 0, the constant c in our bound (8) explodes (see Supp. Mat.). As a result, this bound is not directly useful when λ = 0. However, we know from Corollary 3.2 that in the ridgeless overparametrized case (γ > 1), the average RF predictor is equal to the ridgeless KRR predictor. In the underparametrized case (γ < 1), our numerical experiments suggest that the ridgeless RF predictor is an excellent approximation of the λ0-KRR predictor. 4.1. Effective Dimension The effective ridge λ is closely related to the so-called effective dimension appearing in statistical learning theory. For a linear (or kernel) model with ridge λ, the effective dimension N(λ) N is defined as PN i=1 di λ+di (Zhang, 2003; Caponnetto & De Vito, 2007). It allows one to measure the effective complexity of the Hilbert space in the presence of a ridge. For a given λ > 0, the effective ridge λ introduced in Theorem 4.1 is related to the effective dimension N( λ) by N( λ) = P 1 λ In particular, we have that N( λ) min(N, P): this shows that the choice of a finite number of features corresponds to an automatic lowering of the effective dimension of the related kernel method. Note that in the ridgeless underparameterized case (λ 0 and γ < 1), the effective dimension N( λ) equals precisely the number of features P. 4.2. Risk of the Average Predictor A corollary of Theorem 4.1 is that the loss of the expected RF predictor is close to the loss of the KRR predictor with ridge λ: Corollary 4.3. If ED[K(x, x)] < , we have that the difference of errors δE = L(E[ ˆf (RF ) λ,γ ]) L( ˆf (K) λ ) is Implicit Regularization of Random Feature Models 10-2 10-1 100 101 102 (a) Ridgeless vs. Ridge 10-2 10-1 100 101 102 (b) Variance of λ-RF 10-2 10-1 100 101 102 (c) Evolution of λ λ Figure 3. Average test error of the ridgeless vs. ridge λ-RF predictors. In (a), the average test errors of the ridgeless and the ridge RF predictors (solid lines) and the effect of ensembling (dashed lines) for N = 100 MNIST data points. In (b), the variance of the RF predictors and in (c), the evolution of λ λ in the ridgeless and ridge cases. The experimental setup is the same as in Figure 2. bounded from above by where C is given by c p ED[K(x, x)], with c the constant appearing in (8) above. As a result, δE can be bounded in terms of λ, γ, T, y K 1, which are discussed above, and of the kernel generalization error L(f (K) λ ). Such a generalization error can be controlled in a number of settings as N grows: in (Caponnetto & De Vito, 2007; Marteau-Ferey et al., 2019), for instance, the loss is shown to vanish as N . Figure (2b) shows that the two test losses are indeed very close. 5. Variance In the previous sections, we analyzed the loss of the expected predictor E[ ˆf (RF ) λ,γ ]. In order to analyze the expected loss of the RF predictor ˆf (RF ) λ,γ , it remains to control the variance of the RF predictor: this follows from the bias-variance decomposition E h L( ˆf (RF ) λ,γ ) i =L E[ ˆf (RF ) λ,γ ] + ED h Var( ˆf (RF ) λ,γ (x)) i , introduced in Section 2.1. The variance Var ˆf (RF ) λ,γ (x) of the RF predictor can itself be written as the sum Var E h ˆf (RF ) λ,γ (x) | F i + E h Var ˆf (RF ) λ,γ (x) | F i . By Proposition 3.1, we have E h ˆf (RF ) λ,γ (x) | F i = K(x, X)K(X, X) 1ˆy Var ˆf (RF ) λ,γ (x) | F = ˆθ 2 5.1. RF Predictor Concentration The following theorem allows us to bound both terms: Theorem 5.1. There are constants c1, c2 > 0 depending on λ, γ, T only such that Var K(x, X)K(X, X) 1ˆy c1K(x, x) y 2 K 1 P E [ˆθ 2] λ λy T M λy c2 y 2 K 1 P , where λ λ is the derivative of λ with respect to λ and for M λ = K(X, X)(K(X, X) + λIN) 2. As a result Var ˆf (RF ) λ,γ (x) c3K(x, x) y 2 K 1 P , where c3 > 0 depends on λ, γ, T. Putting the pieces together, we obtain the following bound on the difference E = |E[L( ˆf (RF ) λ,γ )] L( ˆf (K) λ )| between the expected RF loss and the KRR loss: Corollary 5.2. If ED[K(x, x)] < , we have L( ˆf (K) λ ) + C2 y K 1 . where C1 and C2 depend on λ, γ, T and ED[K(x, x)] only. 5.2. Double Descent Curve We now investigate the neighborhood of the frontier γ = 1 between the underand overparameterized regimes, known empirically to exhibit a double descent curve, where the test error explodes at γ = 1 (i.e. when P N) as exhibited in Figure 3. Thanks to Theorem 5.1, we get a lower bound on the variance of ˆf (RF ) λ,γ : Implicit Regularization of Random Feature Models Corollary 5.3. There exists c4 > 0 depending on λ, γ, T only such that Var( ˆf (RF ) λ,γ (x)) is bounded from below by λ λy T M λy P K(x, x) c4K(x, x) y 2 K 1 P 2 . If we assume the second term of Corollary 5.3 to be negligible, then the only term which depends on P is λ λ y T M λy P . The derivative λ λ has an interesting behavior as a function of λ and γ: Proposition 5.4. For γ > 1, as λ 0, the derivative λ λ converges to γ γ 1. As λγ , we have λ λ(λ, γ) 1. The explosion of λ λ in (γ = 1, λ = 0) is displayed in Figure (3c). Corollary 5.3 can be used to explain the double-descent curve numerically observed for small λ > 0. It is natural to assume that in this case λ λ 1 around γ = 1, dominating the lower bound in Corollary 5.3. In turn, by Proposition 5.4 this implies that the variance of ˆf (RF ) gets large. Finally, by the bias-variance decomposition, we obtain a sharp increase of the test error around γ = 1, which is in line with the results of (Hastie et al., 2019; Mei & Montanari, 2019). 6. Conclusion In this paper, we have identified the implicit regularization arising from the finite sampling of Random Features (RF): using a Gaussian RF model with ridge parameter λ > 0 (λ-RF) and feature-to-datapoints ratio γ = P N is essentially equivalent to using a Kernel Ridge Regression with effective ridge λ > λ ( λ-KRR) which we characterize explicitly. More precisely, we have shown the following: The expectation of the λ-RF predictor is very close to the λ-KRR predictor (Theorem 4.1). The λ-RF predictor concentrates around its expectation when λ is bounded away from zero (Theorem 5.1); this implies in particular that the test errors of the λ-RF and λ-KRR predictors are close to each other (Corollary 5.2). Both theorems are proven using tools from random matrix theory, in particular finite-size results on the concentration of the Stieltjes transform of general Wishart matrix models. While our current proofs require the assumption that the RF model is Gaussian, it seems natural to postulate that the results and the proofs extend to more general setups, along the lines of (Louart et al., 2017; Benigni & P ech e, 2019). Our numerical verifications on the expected λ-RF predictor and the λ-KRR predictor have shown that both are in excellent agreement. This shows in particular that in order to use 10-2 10-3 10-1 100 101 102 (a) N = 100 vs. N = 1000 Figure 4. Average test error of the λ-RF predictor for two values of N and λ = 10 4. For N = 1000, the test error is naturally lower and the cusp at γ = 1 is narrower than for N = 100. The experimental setup is the same as in Figure 2. RF predictors to approximate KRR predictors with a given ridge, one should choose both the number of features and the explicit ridge appropriately. Finally, we investigate the ridgeless limit case λ 0. In this case, we see a sharp transition at γ = 1: in the overparameterized regime γ > 1, the effective ridge goes to zero, while in the underparameterized regime γ < 1, it converges to a positive value. At the interpolation threshold γ = 1, the variance of the λ-RF explodes, leading to the double descent curve emphasized in (Advani & Saxe, 2017; Spigler et al., 2018; Belkin et al., 2018; Nakkiran et al., 2019). We investigate this numerically and prove a lower bound yielding a plausible explanation for this phenomenon. Thanks and Acknowledgements The authors would like to thank Andrea Montanari, Song Mei, L ena ıc Chizat and Alessandro Rudi for the helpful discussions. Cl ement Hongler acknowledges support from the ERC SG CONSTAMIS grant, the NCCR Swiss MAP grant, the Minerva Foundation, the Blavatnik Family Foundation, and the Latsis foundation. Implicit Regularization of Random Feature Models Advani, M. S. and Saxe, A. M. High-dimensional dynamics of generalization error in neural networks. ar Xiv preprint ar Xiv:1710.03667, 2017. URL http://arxiv.org/ abs/1710.03667. Bai, Z. and Wang, Z. Large sample covariance matrices without independence structures in columns. Statistica Sinicia, 18:425 442, 2008. Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. ar Xiv preprint ar Xiv:1906.11300, 2019. URL http://arxiv.org/ abs/1906.11300. Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine learning and the bias-variance trade-off. ar Xiv preprint ar Xiv:1812.11118, 2018. URL http: //arxiv.org/abs/1812.11118. Belkin, M., Hsu, D., and Xu, J. Two models of double descent for weak features. ar Xiv preprint ar Xiv:1903.07571, 2019. URL http://arxiv.org/ abs/1903.07571. Benigni, L. and P ech e, S. Eigenvalue distribution of nonlinear models of random matrices. ar Xiv preprint ar Xiv:1904.03090, 2019. URL http://arxiv.org/ abs/1904.03090. Bishop, C. M. Pattern recognition and machine learning. springer, 2006. Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007. d Ascoli, S., Refinetti, M., Biroli, G., and Krzakala, F. Double trouble in double descent: Bias and variance (s) in the lazy regime. ar Xiv preprint ar Xiv:2003.01054, 2020. Dobriban, E. and Wager, S. High-dimensional asymptotics of prediction: Ridge regression and classification. Ann. Statist., 46(1):247 279, 02 2018. doi: 10.1214/17-AOS1549. URL https://doi.org/10. 1214/17-AOS1549. Gabriel, F. Combinatorial theory of permutation-invariant random matrices ii: Cumulants, freeness and Levy processes. ar Xiv preprint ar Xiv:1507.02465, 2015. URL http://arxiv.org/abs/1507.02465. Geiger, M., Jacot, A., Spigler, S., Gabriel, F., Sagun, L., d Ascoli, S., Biroli, G., Hongler, C., and Wyart, M. Scaling description of generalization with number of parameters in deep learning. ar Xiv preprint ar Xiv:1901.01608, 2019. URL http://arxiv.org/ abs/1901.01608. Geman, S., Bienenstock, E., and Doursat, R. Neural networks and the bias/variance dilemma. Neural computation, 4(1):1 58, 1992. Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. ar Xiv preprint ar Xiv:1903.08560, 2019. URL http://arxiv.org/abs/1903.08560. Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Neur IPS, 2018. Liu, S. and Dobriban, E. Ridge regression: Structure, crossvalidation, and sketching. In International Conference on Learning Representations, 2020. URL https:// openreview.net/forum?id=Hkl Rwa EKw B. Louart, C., Liao, Z., and Couillet, R. A random matrix approach to neural networks. The Annals of Applied Probability, 28, 02 2017. doi: 10.1214/17-AAP1328. Marteau-Ferey, U., Ostrovskii, D., Bach, F., and Rudi, A. Beyond least-squares: Fast rates for regularized empirical risk minimization through self-concordance. Co RR, abs/1902.03046, 2019. URL http://arxiv.org/ abs/1902.03046. Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and double descent curve. ar Xiv preprint ar Xiv:1908.05355, 2019. URL http://arxiv.org/abs/1908.05355. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. Deep double descent: Where bigger models and more data hurt. ar Xiv preprint ar Xiv:1912.02292, 2019. URL http://arxiv.org/ abs/1912.02292. Neal, B., Mittal, S., Baratin, A., Tantia, V., Scicluna, M., Lacoste-Julien, S., and Mitliagkas, I. A modern take on the bias-variance tradeoff in neural networks. ar Xiv preprint ar Xiv:1810.08591, 2018. URL http: //arxiv.org/abs/1810.08591. Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. ar Xiv preprint ar Xiv:1412.6614, 2014. URL http://arxiv.org/abs/1412.6614. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177 1184, 2008. Rahimi, A. and Recht, B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems, pp. 1313 1320, 2009. Implicit Regularization of Random Feature Models Rudi, A. and Rosasco, L. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, pp. 3215 3225, 2017. Sch olkopf, B., Smola, A., and M uller, K.-R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299 1319, 1998. Silverstein, J. Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices. Journal of Multivariate Analysis, 55(2):331 339, 1995. ISSN 0047-259X. doi: https://doi.org/10.1006/jmva.1995. 1083. URL http://www.sciencedirect.com/ science/article/pii/S0047259X85710834. Speicher, R. Free probability and random matrices. In Free Probability and Random Matrices, 2017. Spigler, S., Geiger, M., d Ascoli, S., Sagun, L., Biroli, G., and Wyart, M. A jamming transition from under-to overparametrization affects loss landscape and generalization. ar Xiv preprint ar Xiv:1810.09665, 2018. URL http: //arxiv.org/abs/1810.09665. Sriperumbudur, B. and Szab o, Z. Optimal rates for random fourier features. In Advances in Neural Information Processing Systems, pp. 1144 1152, 2015. Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., and Zhou, Z.-H. Nystr om method vs random Fourier features: A theoretical and empirical comparison. In Advances in neural information processing systems, pp. 476 484, 2012. Yu, F. X. X., Suresh, A. T., Choromanski, K. M., Holtmann Rice, D. N., and Kumar, S. Orthogonal random features. In Advances in Neural Information Processing Systems, pp. 1975 1983, 2016. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016. URL http://arxiv.org/abs/1611.03530. Zhang, T. Effective dimension and generalization of kernel learning. In Advances in Neural Information Processing Systems, pp. 471 478, 2003.