# strong_inductive_biases_provably_prevent_harmless_interpolation__635969c1.pdf

Published as a conference paper at ICLR 2023

STRONG INDUCTIVE BIASES PROVABLY PREVENT HARMLESS INTERPOLATION

Michael Aerni 1, Marco Milanta 1, Konstantin Donhauser1,2, Fanny Yang1

1Department of Computer Science, ETH Zurich 2ETH AI Center

Classical wisdom suggests that estimators should avoid fitting noise to achieve good generalization. In contrast, modern overparameterized models can yield small test error despite interpolating noise a phenomenon often called benign overfitting or harmless interpolation . This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator s inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well. Our main theoretical result establishes tight non-asymptotic bounds for high-dimensional kernel regression that reflect this phenomenon for convolutional kernels, where the filter size regulates the strength of the inductive bias. We further provide empirical evidence of the same behavior for deep neural networks with varying filter sizes and rotational invariance.

1 INTRODUCTION

According to classical wisdom (see, e.g., Hastie et al. (2001)), an estimator that fits noise suffers from overfitting and cannot generalize well. A typical solution is to prevent interpolation, that is, stopping the estimator from achieving zero training error and thereby fitting less noise. For example, one can use ridge regularization or early stopping for iterative algorithms to obtain a model that has training error close to the noise level. However, large overparameterized models such as neural networks seem to behave differently: even on noisy data, they may achieve optimal test performance at convergence after interpolating the training data (Nakkiran et al., 2021; Belkin et al., 2019a) a phenomenon referred to as harmless interpolation (Muthukumar et al., 2020) or benign overfitting (Bartlett et al., 2020) and often discussed in the context of double descent (Belkin et al., 2019a).

To date, we lack a general understanding of when interpolation is harmless for overparameterized models. In this paper, we argue that the strength of an inductive bias critically influences whether an estimator exhibits harmless interpolation. An estimator with a strong inductive bias heavily favors simple solutions that structurally align with the ground truth (such as sparsity or rotational invariance). Based on well-established high-probability recovery results of sparse linear regression (Tibshirani, 1996; Candes, 2008; Donoho & Elad, 2006), we expect that models with a stronger inductive bias generalize better than ones with a weaker inductive bias, particularly from noiseless data. In contrast, the effects of inductive bias are much less studied for interpolators of noisy data.

Recently, Donhauser et al. (2022) provided a first rigorous analysis of the effects of inductive bias strength on the generalization performance of linear max-ℓp-margin/min-ℓp-norm interpolators. In particular, the authors prove that a stronger inductive bias (small p 1) not solely enhances a model s ability to generalize on noiseless data, but also increases a model s sensitivity to noise eventually harming generalization when interpolating noisy data. As a consequence, their result suggests that interpolation might not be harmless when the inductive bias is too strong.

In this paper, we confirm the hypothesis and show that strong inductive biases indeed prevent harmless interpolation, while also moving away from sparse linear models. As one example, we consider data where the true labels nonlinearly only depend on input features in a local neighborhood, and vary the strength of the inductive bias via the filter size of convolutional kernels or shallow convolutional neural networks small filter sizes encourage functions that depend nonlinearly only on local

Equal contribution; correspondence to research@michaelaerni.com

Published as a conference paper at ICLR 2023

neighborhoods of the input features. As a second example, we also investigate classification for rotationally invariant data, where we encourage different degrees of rotational invariance for neural networks. In particular,

we prove a phase transition between harmless and harmful interpolation that occurs by varying the strength of the inductive bias via the filter size of convolutional kernels for kernel regression in the high-dimensional setting (Theorem 1).

we further show that, for a weak inductive bias, not only is interpolation harmless but partially fitting the observation noise is in fact necessary (Theorem 2).

we show the same phase transition experimentally for neural networks with two common inductive biases: varying convolution filter size, and rotational invariance enforced via data augmentation (Section 4).

From a practical perspective, empirical evidence suggests that large neural networks not necessarily benefit from early stopping. Our results match those observations for typical networks with a weak inductive bias; however, we caution that strongly structured models must avoid interpolation, even if they are highly overparameterized.

2 RELATED WORK

We now discuss three groups of related work and explain how their theoretical results cannot reflect the phase transition between harmless and harmful interpolation for high-dimensional kernel learning.

Low-dimensional kernel learning: Many recent works (Bietti et al., 2021; Favero et al., 2021; Bietti, 2022; Cagnetta et al., 2022) prove statistical rates for kernel regression with convolutional kernels in low-dimensional settings, but crucially rely on ridge regularization. In general, one cannot expect harmless interpolation for such kernels in the low-dimensional regime (Rakhlin & Zhai, 2019; Mallinar et al., 2022; Buchholz, 2022); positive results exist only for very specific adaptive spiked kernels (Belkin et al., 2019b). Furthermore, techniques developed for low-dimensional settings (see, e.g., Schölkopf et al. (2018)) usually suffer from a curse of dimensionality, that is, the bounds become vacuous in high-dimensional settings where the input dimension grows with the number of samples.

High-dimensional kernel learning: One line of research (Liang et al., 2020; Mc Rae et al., 2022; Liang & Rakhlin, 2020; Liu et al., 2021) tackles high-dimensional kernel learning and proves non-asymptotic bounds using advanced high-dimensional random matrix concentration tools from El Karoui (2010). However, those results heavily rely on a bounded Hilbert norm assumption. This assumption is natural in the low-dimensional regime, but misleading in the high-dimensional regime, as pointed out in Donhauser et al. (2021b). Another line of research (Ghorbani et al., 2021; 2020; Mei et al., 2021; Ghosh et al., 2022; Misiakiewicz & Mei, 2021; Mei et al., 2022) asymptotically characterizes the precise risk of kernel regression estimators in specific settings with access to a kernel s eigenfunctions and eigenvalues. However, these asymptotic results are insufficient to investigate how varying the filter size of a convolutional kernel affects the risk of a kernel regression estimator. In contrast to both lines of research, we prove tight non-asymptotic matching upper and lower bounds for high-dimensional kernel learning which precisely capture the phase transition described in Section 3.2.

Overfitting of structured interpolators: Several works question the generality of harmless interpolation for models that incorporate strong structural assumptions. Examples include structures enforced via data augmentation (Nishi et al., 2021), adversarial training (Rice et al., 2020; Kamath et al., 2021; Sanyal et al., 2021; Donhauser et al., 2021a), neural network architectures (Li et al., 2021), pruning-based sparsity (Chang et al., 2021), and sparse linear models (Wang et al., 2022; Muthukumar et al., 2020; Chatterji & Long, 2022). In this paper, we continue that line of research and offer a new theoretical perspective to characterize when interpolation is expected to be harmless.

3 THEORETICAL RESULTS

For convolutional kernels, a small filter size induces a strong bias towards estimators that depend nonlinearly on the input features only via small patches. This section analyzes the effect of filter size (as an example inductive bias) on the degree of harmless interpolation for kernel ridge regression.

Published as a conference paper at ICLR 2023

For this purpose, we derive and compare tight non-asymptotic bias and variance bounds as a function of filter size for min-norm interpolators and optimally ridge-regularized estimators (Theorem 1). Furthermore, we prove for large filter sizes that not only does harmless interpolation occur (Theorem 1), but fitting some degree of noise is even necessary to achieve optimal test performance (Theorem 2).

3.1 SETTING

We study kernel regression with a (cyclic) convolutional kernel in a high-dimensional setting where the number of training samples n scales with the dimension of the input data d as n Θ(dℓ). We use the same setting as in previous works on high-dimensional kernel learning such as Misiakiewicz & Mei (2021): we assume that the training samples {xi, yi}n i=1 are i.i.d. draws from the distributions xi U({ 1, 1}d), and yi = f (xi)+ϵi with ground truth f and noise ϵ N(0, σ2). For simplicity of exposition, we further assume that f (x) = x1x2 x L , with L specified in Theorem 1.

While the assumptions on the noise and ground truth can be easily extended by following the proof steps in Section 5, generalizing the feature distribution is challenging. Indeed, existing results that establish precise risk characterizations (see Section 2) crucially rely on hypercontractivity of the feature distribution an assumption so far only proven for few high-dimensional distributions, including the hypersphere (Beckner, 1992), and the discrete hypercube (Beckner, 1975) which we use in this paper. Hypercontractivity is essential to tightly control the empirical kernel matrix within Lemma 3 in Section 5. Generalizations beyond this assumption require the development of new tools in random matrix theory, which we consider important future work.

We consider (cyclic) convolutional kernels with filter size q {1, . . . , d} of the form

K(x, x ) = 1

x(k,q), x (k,q)

where x(k,q) := [xmod(k,d) xmod(k+q 1,d)], and κ: [ 1, 1] R is a nonlinear function that implies standard regularity assumptions (see Assumption 1 in Appendix B) that hold for instance for the exponential function. Decreasing the filter size q restricts kernel regression solutions to depend nonlinearly only on local neighborhoods instead of the entire input x.

We analyze the kernel ridge regression (KRR) estimator, which is the minimizer of the following convex optimization problem:

ˆfλ = arg min f H

i=1 (f(xi) yi)2 + λ

n f 2 H, (2)

where H is the Reproducing Kernel Hilbert space (RKHS) over { 1, 1}d generated by the convolutional kernel K in Equation (1), H the corresponding norm, and λ > 0 the ridge regularization penalty.1 In the interpolation limit (λ 0), we obtain the min-RKHS-norm interpolator ˆf0 = arg min f H f H s.t. i : f(xi) = yi. (3)

For simplicity, we refer to ˆf0 as the kernel ridge regression estimator with λ = 0. We evaluate all estimators with the expected population risk over the noise, defined as

Risk( ˆfλ) := Ex

Eϵ[ ˆfλ(x)] f (x) 2

| {z } :=Bias2( ˆ fλ)

ˆfλ(x) Eϵ[ ˆfλ(x)] 2

| {z } :=Variance( ˆ fλ)

3.2 MAIN RESULT

We now present tight upper and lower bounds for the prediction error of kernel regression estimators in the setting from Section 3.1. The resulting rates hold for the high-dimensional regime, that is, when both the ambient dimension d and filter size q scale with n.2 We defer the proof to Section 5.

1Note that previous works show how early-stopped gradient methods on the square loss behave statistically similarly to kernel ridge regression (Raskutti et al., 2014; Wei et al., 2017). 2We hide positive constants that depend at most on ℓand β (defined in Theorem 1) using the standard Bachmann Landau notation O( ), Ω( ), Θ( ), as well as , , and use c, c1, . . . as generic positive constants.

Published as a conference paper at ICLR 2023

Theorem 1 (Non-asymptotic prediction error rates). Let ℓ> 0, β (0, 1), ℓσ R. Assume a dataset and a kernel as described in Section 3.1, with the kernel satisfying Assumption 1. Assume further n Θ(dℓ), the filter size q Θ dβ , and σ2 Θ(d ℓσ). Lastly, define δ := ℓ 1

δ := ℓ ℓλ 1

β for any ℓλ. Then, with probability at least 1 cd β min{ δ,1 δ} uniformly

over all ℓλ [0, ℓ 1), the KRR estimate ˆfλ in Equation (2) with max{λ, 1} Θ(dℓλ) satisfies

Variance( ˆfλ) Θ n

ℓmin{δ,1 δ} .

Further, for a ground truth f (x) = x1x2 x L with L l ℓ ℓλ 1

β m , with probability at least

1 cd β min{ δ,1 δ}, we have

Bias2( ˆfλ) Θ n 2 2

ℓ( ℓλ 1 β(L 1)) .

Finally, by setting ℓλ = 0, both rates also hold for the min-RKHS-norm interpolator ˆf0 in Eq. (3).

Note how the theorem reflects the usual intuition for the effects of noise and ridge regularization strength on bias and variance via the parameter ℓλ: With increasing ridge regularization ℓλ (and thus increasing λ), the bias increases and the variance decreases. Similarly, as noise increases (and thus ℓσ decreases), the variance increases.

Phase transition as a function of β: In the following, we focus on the impact of the filter size q Θ dβ on the risk (sum of bias and variance) via the growth rate β. Recalling that a small filter size (small β) corresponds to a strong inductive bias, and vice versa, Figure 1 demonstrates how the strength of the inductive bias affects generalization. For illustration, we choose the ground truth f (x) = x1x2 so that the assumption on L is satisfied for all β. Specifically, Figure 1a shows the rates for the min-RKHS-norm interpolator ˆf0 and the optimally ridge-regularized estimator ˆfλopt, where we choose λopt to minimize the expected population risk Risk( ˆfλopt). Furthermore, Figure 1b depicts the (statistical) bias and variance of the interpolator ˆf0. At the threshold β (0, 1), implicitly defined as the β at which the rates of statistical bias and variance in Theorem 1 match, we can observe the following phase transition:

For β < β , that is, for a strong inductive bias, the rates in Figure 1a for the optimally ridgeregularized estimator ˆfλopt are strictly better than the ones for the corresponding interpolator ˆf0. In other words, we are observing harmful interpolation. For β > β , that is, for a weak inductive bias, the rates in Figure 1a of the optimally ridge-regularized estimator ˆfλopt and the min-RKHS-norm interpolator ˆf0 match. Hence, we observe harmless interpolation.

In the following theorem, we additionally show that interpolation is not only harmless for β > β , but the optimally ridge-regularized estimator ˆfλopt necessarily fits part of the noise and has a training error strictly below the noise level. In contrast, we show that when interpolation is harmful in Figure 1a, that is, when β < β , the training error of the optimally ridge-regularized model approaches the noise level.

Theorem 2 (Training error (informal)). Let λopt be such that the expected population risk Risk( ˆfλopt) is minimal, and let β be the unique threshold3 where the bias and variance bounds in Theorem 1 are of the same order for the interpolator ˆf0 (setting ℓλ = 0). Then, the expected training error converges in probability:

lim n,d 1 σ2 Eϵ

i ( ˆfλopt(xi) yi)2 # = 1 β < β , cβ β β ,

where cβ < 1 for any β > β .

We refer to Appendix D.2 for the proof and a more general statement.

3See Theorem 4 for a more general statement that does not rely on a unique β .

Published as a conference paper at ICLR 2023

interpolating optimally regularized

β s.t. ﬁlter size q Θ dβ

error rate exponent α

(a) Interpolating vs. regularized model

variance bias2 var. lower bound

β s.t. ﬁlter size q Θ dβ

rate exponent α

(b) Variance vs. bias for interpolating model

Figure 1: Illustration of the rates in Theorem 1 for high-dimensional kernel ridge regression as a function of β the rate of the filter size q Θ(dβ). (a) Rate exponent α of the Risk Θ(nα) for the interpolator ˆf0 vs. the optimally ridge-regularized estimator ˆfλopt. (b) Rate exponent of the variance and bias for the interpolator ˆf0. For both illustrations, we choose ˆf0 with ℓ= 2, ℓσ = 0.6, and the ground truth f (x) = x1x2. Lastly, β

denotes the threshold where the bias and variance terms in Theorem 1 match, and where we observe a phase transition between harmless and harmful interpolation. See Appendix D.1 for technical details.

Bias-variance trade-off: We conclude by discussing how the phase transition arises from a (statistical) bias and variance trade-off for the min-RKHS-norm interpolator as a function of β, reflected in Theorem 1 when setting ℓλ = 0 and illustrated in Figure 1b. While the statistical bias monotonically decreases with decreasing β (i.e., increasing strength of the inductive bias), the variance follows a multiple descent curve with increasing minima as β decreases. Hence, analogous to the observations in Donhauser et al. (2022) for linear max-ℓp-margin/min-ℓp-norm interpolators, the interpolator achieves its optimal performance at a β (0, 1), and therefore at a moderate inductive bias. Finally, we note that Liang et al. (2020) previously observed a multiple descent curve for the variance, but as a function of input dimension and without any connection to structural biases.

4 EXPERIMENTS

We now empirically study whether the phase transition phenomenon that we prove for kernel regression persists for deep neural networks with feature learning. More precisely, we present controlled experiments to investigate how the strength of a CNN s inductive bias influences if interpolating noisy data is harmless. In practice, the inductive bias of a neural network varies by way of design choices such as the architecture (e.g., convolutional vs. fully-connected vs. graph networks) or the training procedure (e.g., data augmentation, adversarial training). We focus on two examples: convolutional filter size that we vary via the architecture, and rotational invariance via data augmentation. To isolate the effects of inductive bias and provide conclusive results, we use datasets where we know a priori that the ground truth exhibits a simple structure that matches the networks inductive bias. See Appendix E for experimental details.

Analogous to ridge regularization for kernels, we use early stopping as a mechanism to prevent noise fitting. Our experiments compare optimally early-stopped CNNs to their interpolating versions. This highlights a trend that mirrors our theoretical results: the stronger the inductive bias of a neural network grows, the more harmful interpolation becomes. These results suggest exciting future work: proving this trend for models with feature learning.

4.1 FILTER SIZE OF CNNS ON SYNTHETIC IMAGES

In a first experiment, we study the impact of filter size on the generalization of interpolating CNNs. As a reminder, small filter sizes yield functions that depend nonlinearly only on local neighborhoods of the input features. To clearly isolate the effects of filter size, we choose a special architecture on a synthetic classification problem such that the true label function is indeed a CNN with small filter size. Concretely, we generate images of size 32 32 containing scattered circles (negative class) and crosses (positive class) with size at most 5 5. Thus, decreasing filter size down to 5 5 corresponds to a stronger inductive bias. Motivated by our theory, we hypothesize that interpolating noisy data is harmful with a small filter size, but harmless when using a large filter size.

Published as a conference paper at ICLR 2023

interpolating: opt. early-stopped:

0% noise 0% noise

20% noise 20% noise

5 13 22 31 ﬁlter size

(a) Test error

noisy subset clean subset

5 13 22 31 ﬁlter size

subset train error (opt. early stopping)

(b) Training error of optimally early-stopped models

Figure 2: Convolutional neural network experiments with varying filter size on synthetic image data. (a) For noisy data, small filter sizes (strong inductive bias) induce a gap between the generalization performance of interpolating (blue) vs. optimally early-stopped models (yellow). The gap vanishes as the inductive bias decreases (i.e., filter size increases). For noiseless data (dashed), interpolation is always harmless. (b) Training error of the optimally early-stopped model (optimized for test error) on the noisy and clean subsets of a training set with 20% label noise. Under optimal early stopping, models with a strong inductive bias ignore all noisy samples (100% error on the noisy subset), while models with a weak inductive bias fit all noisy training samples (0% error on the noisy subset). All lines show the mean over five random datasets, and shaded areas the standard error; see Section 4.1 for the experiment setup.

Training setup: In the experiments, we use CNNs with a single convolutional layer, followed by global spatial max pooling and two dense layers. We train those CNNs with different filter sizes on 200 training samples (either noiseless or with 20% label flips) to minimize the logistic loss and achieve zero training error. We repeat all experiments over 5 random datasets with 15 optimizations per dataset and filter size, and report the average 0-1-error for 100k test samples per dataset. For a detailed discussion on the choice of hyperparameters and more experimental details, see Appendix E.1.

Results: First, the noiseless error curves (dashed) in Figure 2a confirm the common intuition that the strongest inductive bias (matching the ground truth) at size 5 yields the lowest test error. More interestingly, for 20% training noise (solid), Figure 2a reveals a similar phase transition as Theorem 1 and confirms our hypothesis: Models with weak inductive biases (large filter sizes) exhibit harmless interpolation, as indicated by the matching test error of interpolating (blue) and optimally earlystopped (yellow) models. In contrast, as filter size decreases, models with a strong inductive bias (small filter sizes) suffer from an increasing gap in test errors when interpolating versus using optimal early stopping. Furthermore, Figure 2b reflects the dual perspective of the phase transition as presented in Theorem 2 under optimal early stopping: models with a small filter size entirely avoid fitting training noise, such that the training error on the noisy training subset equals 100%, while models with a large filter size interpolate the noise.

Difference to double descent: One might suspect that our empirical observations simply reflect another form of double descent (Belkin et al., 2019a). As a CNN s filter size increases (inductive bias becomes weaker), so does the number of parameters and degree of overparameterization. Thus, double descent predicts vanishing benefits of regularization due to model size for weak inductive biases. Nevertheless, we argue that the phenomenon we observe here is distinct, and provide an extended discussion in Appendix E.3. In short, we choose sufficiently large networks and tune their hyperparameters to ensure that all models interpolate and yield small training loss, even for filter size 5 and 20% training noise. To justify this approach, we repeat a subset of the experiments while significantly increasing the convolutional layer width. As the number of parameters increases for a fixed filter size, double descent would predict that the benefits of optimal early stopping vanish. However, we observe that our phenomenon persists. In particular, for filter size 5 (strongest inductive bias), the test error gap between interpolating and optimally early-stopped models remains large.

4.2 ROTATIONAL INVARIANCE OF WIDE RESIDUAL NETWORKS ON SATELLITE IMAGES

In a second experiment, we investigate rotational invariance as an inductive bias for CNNs whenever true labels are independent of an image s rotation. Our experiments control inductive bias strength by fitting models on multiple rotated versions of an original training dataset, effectively performing

Published as a conference paper at ICLR 2023

interpolating: opt. early-stopped:

0% noise 0% noise

20% noise 20% noise

# rotations

(a) Test error

noisy subset clean subset

# rotations

subset train error (opt. early stopping)

(b) Training error of optimally early-stopped models

Figure 3: Varying degrees of rotational invariance when fitting Wide Residual Networks on satellite images. (a) For noisy data (solid), a strong bias towards rotational invariance (via the number of augmented rotations) induces a gap between the generalization performance of interpolating (blue) vs. optimally early-stopped models (yellow). The gap decreases as the inductive bias decreases (# rotations decreases). (b) Training error of optimally early-stopped models (w.r.t. the test error) on the noisy and clean subsets of a training set with 20% label noise: For maximum rotational invariance (12 rotations), optimally early-stopped models avoid fitting noisy data (close to 100% training error on noisy subset), yet no rotational invariance (1 rotation) requires fitting noise (less than 50% training error on noisy subset) for optimal generalization. All lines show the mean over five optimization runs, and shaded areas the standard error; see Section 4.2 for the experiment setup.

varying degrees of data augmentation.4 As an example dataset with a rotationally invariant ground truth, we classify satellite images from the Euro SAT dataset (Helber et al., 2018) into 10 types of land usage. Because the true labels are independent of image orientation, we expect rotational invariance to be a particularly fitting inductive bias for this task.

Training and test setup: For computational reasons, we subsample the original Euro SAT training set to 7680 raw training and 10k raw test samples. In the noisy case, we replace 20% of the raw training labels with a wrong label chosen uniformly at random. We then vary the strength of the inductive bias towards rotational invariance by augmenting the dataset with an increasing number of k rotated versions of itself. For each sample, we use k equal-spaced angles spanning 360 , plus a random offset. Note that training noise applies before rotations, so that all rotated versions of the same image share the same label. We then center-crop all rotated images such that they only contain valid pixels. In all experiments, we fit Wide Residual Networks (Zagoruyko & Komodakis, 2016) on the augmented training set for 5 different network initializations. We evaluate the 0-1-error on the randomly rotated test samples to avoid distribution shift effects from image interpolation. All random rotations are the same for all experiments and stay fixed throughout training. See Appendix E.2 for more experimental details. Lastly, we perform additional experiments with larger models to differentiate from double descent; see Appendix E.3 for the results and further discussions.

Results: Similar to the previous subsection, Figure 3a corroborates our hypothesis under rotational invariance: stronger inductive biases result in lower test errors on noiseless data, but an increased gap between the test errors of interpolating and optimally early-stopped models. In contrast to filter size, the phase transition is more abrupt; invariance to 180 rotations already prevents harmless interpolation. Figure 3b confirms this from a dual perspective, since all models with some rotational invariance cannot fit noisy samples for optimal generalization.

5 PROOF OF THE MAIN RESULT

The proof of the main result, Theorem 1, proceeds in two steps: First, Section 5.1 presents a fixeddesign result that yields matching upper and lower bounds for the prediction error of general kernels under additional conditions. Second, in Section 5.2, we show that the setting of Theorem 1 satisfies those conditions with high probability over dataset draws.

4Data augmentation techniques can efficiently enforce rotational invariance; see, e.g., Yang et al. (2019).

Published as a conference paper at ICLR 2023

Notation Assuming that inputs are draws from a data distribution ν (i.e., x, x ν), we can decompose and divide any continuous, positive semi-definite kernel function as

k=1 λkψk(x)ψk(x ) =

k=1 λkψk(x)ψk(x )

| {z } :=K m(x,x )

k=m+1 λkψk(x)ψk(x )

| {z } :=K>m(x,x )

where {ψk}k 1 is an orthonormal basis of the RKHS induced by f, g ν := Ex ν[f(x)g(x)] and the eigenvalues λk are sorted in descending order. In the following, we write [ ]i,j to refer to the entry in row i and column j of a matrix. Then, we define the empirical kernel matrix for K as K Rn n with [K]i,j := K(xi, xj), and analogously the truncated versions K m and K>m for K m and K>m, respectively. Next, we utilize the matrices Ψ m Rn m with [Ψ m]i,l := ψl(xi), and D m := diag(λ1, . . . , λm). We further use the squared kernel S(x, x ) := Ez ν[K(x, z)K(z, x )], its truncated versions S m and S>m, as well as the corresponding empirical kernel matrices S, S m, S>m Rn n. Next, for a symmetric positive-definite matrix, we write µmin ( ) and µmax( ) (or ) to indicate the min and max eigenvalue, respectively, and µi( ) for the i-th eigenvalue in decreasing order. Finally, we use , for the Euclidean inner product in Rd.

5.1 GENERALIZATION BOUND FOR FIXED-DESIGN

First, Theorem 3 provides tight fixed-design bounds for the prediction error.

Theorem 3 (Generalization bound for fixed-design). Let K be a kernel that under a distribution ν decomposes as K(x, x ) = P

k λkψk(x)ψk(x ) with Ex ν[ψk(x)ψk (x)] = δk,k , and {(xi, yi)}n i=1 be a dataset with yi = f (xi) + ϵi for zero-mean σ2-variance i.i.d. noise ϵi and ground truth f . Define τ1 := min n nλm max{λ,1}, 1 o , τ2 := max n nλm+1 max{λ,1}, 1 o , r1 := µmin(K>m)+λ

max{λ,1} , r2 := K>m +λ

max{λ,1} . Then, for any m N such that r1 > 0 and Ψ mΨ m/n Im 1

the KRR estimate ˆfλ in Equation (2) for any λ 0 has a variance upper and lower bounded by

r2 1τ 2 1 2r2 2(1.5 + r1)2 m

Pn i=m+1 µi(S>m) r2 2 max{λ, 1}2 Variance( ˆfλ)/σ2 6r2 2 r2 1

n + Tr (S>m) r2 1 max{λ, 1}2 .

Furthermore, for any ground truth that can be expressed as f (x) = Pm k=1 akψk(x) with a Rm + and ψk as defined in Equation (4), the bias is upper and lower bounded by

r2 1τ 2 1 (1.5 + r1)2 max{λ, 1}2 D 1 ma 2

n2 Bias2( ˆfλ) 4 r2 2 + 1.5r3 2 r2 1

τ2 max{λ, 1}2 D 1 ma 2

See Appendix A.1 for the proof. Note that this result holds for any fixed-design dataset. We derive the main ideas from the proofs in Bartlett et al. (2020); Tsigler & Bartlett (2020), where the authors establish tight bounds for the min-ℓ2-norm interpolator on independent sub-Gaussian features.

Remark 1 (Comparison with Mc Rae et al. (2022)). The upper bound for the bias in Theorem 1 from Mc Rae et al. (2022) depends on the suboptimal term D 1/2 m a /n, but also applies to more general ground truths. We improve that upper bound in Theorem 3 and present a matching lower bound.

5.2 PROOF OF THEOREM 1

Throughout the remainder of the proof, all statements hold for the setting in Section 3.1 and under the assumptions of Theorem 1, especially Assumption 1 on K, n Θ(dℓ), and max{λ, 1} Θ(dℓλ).

Furthermore, as in Theorem 1, we use δ, δ > 0 with δ := ℓ ℓλ 1

β k and δ := ℓ 1

β k . We show that there exists a particular m for which the conditions in Theorem 3 hold and we can control the terms in the bias and variance bounds.

Published as a conference paper at ICLR 2023

Step 1: Conditions for the bounds in Theorem 3 We first derive sufficient conditions on m such that the conditions on Ψ m and f in Theorem 3 hold with high probability. The following standard concentration result shows that all m n satisfy Equation (5) with high probability. Lemma 1 (Corollary of Theorem 5.44 in Vershynin (2012)). For d large enough, with probability at least 1 cd β δ, all m O(n q δ) satisfy Equation (5).

See Appendix C.1 for the proof. Simultaneously, to ensure that f is contained in the span of the first m eigenfunctions, m must be sufficiently large. We formalize this in the following lemma. Lemma 2 (Bias bound condition). Consider a kernel as in Theorem 1 satisfying Assumption 1, and a ground truth f (x) = QL

j=1 xj with 1 L l ℓ ℓλ 1

β m . Then, for any m with λm o 1 dq L 1

and d sufficiently large, f is in the span of the first m eigenfunctions and D 1 ma Θ dq L 1 .

See Appendix B.3 for the proof. Note that L 1 follows from ℓλ < ℓ 1 in Theorem 1, and allows us to focus on non-trivial ground truth functions.

Step 2: Concentration of the (squared) kernel matrix In a second step, we show that there exists a set of m that satisfy the sufficient conditions, and for which the spectra of the kernel matrix K>m and squared kernel matrix S>m concentrate.

Lemma 3 (Tight bound conditions). With probability at least 1 cd β min{ δ,1 δ} uniformly over all ℓλ [0, ℓ 1), for d sufficiently large and any λ R, m N with max{λ, 1} Θ(dℓλ),

nλm Θ(max{λ, 1}), and m O nq δ

max{λ,1} , we have

r1, r2 Θ(1), Tr(S>m) O dℓλq δ + dℓλq (1 δ) ,

i=m+1 µi(S>m) Ω dℓλq (1 δ) .

We refer to Appendix C.3 for the proof, which heavily relies on the feature distribution. Note that the results for m in Lemma 3 also imply τ1, τ2 Θ(1).

Step 3: Completing the proof Finally, we complete the proof by showing the existence of a particular m that simultaneously satisfies all conditions of Lemmas 1 to 3. Lemma 4 (Eigendecay). There exists an m such that

nλm Θ(max{λ, 1}) and m Θ nq δ

Furthermore, assuming L l ℓ ℓλ 1

β m , we have λm o 1 d q L 1 .

We refer to Appendix B.4 for the proof. As a result, we can use Lemmas 1 to 4 to instantiate Theorem 3 for the setting in Theorem 1, resulting in the following tight high-probability bounds for variance and bias: d ℓλq δ + d ℓλq (1 δ) Variance( ˆfλ)/σ2 d ℓλq δ + d ℓλq (1 δ),

d 2(ℓ ℓλ 1 β(L 1)) Bias2( ˆfλ) d 2(ℓ ℓλ 1 β(L 1)). Reformulating the bounds in terms of n then concludes the proof of Theorem 1.

6 SUMMARY AND OUTLOOK

In this paper, we highlight how the strength of an inductive bias impacts generalization. Concretely, we study when the gap in test error between interpolating models and their optimally ridge-regularized or early-stopped counterparts is zero, that is, when interpolation is harmless. In particular, we prove a phase transition for kernel regression using convolutional kernels with different filter sizes: a weak inductive bias (large filter size) yields harmless interpolation, and even requires fitting noise for optimal test performance, whereas a strong inductive bias (small filter size) suffers from suboptimal generalization when interpolating noise. Intuitively, this phenomenon arises from a bias-variance trade-off, captured by our main result in Theorem 1: with increasing inductive bias, the risk on noiseless training samples (bias) decreases, while the sensitivity to noise in the training data (variance) increases. Our empirical results on neural networks suggest that this phenomenon extends to models with feature learning, which opens up an avenue for exciting future work.

Published as a conference paper at ICLR 2023

ACKNOWLEDGMENTS

K.D. is supported by the ETH AI Center and the ETH Foundations of Data Science. We would further like to thank Afonso Bandeira for insightful discussions.

Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117:30063 30070, 2020.

William Beckner. Inequalities in Fourier Analysis. Annals of Mathematics, 102:159 182, 1975.

William Beckner. Sobolev inequalities, the Poisson semigroup, and analysis on the sphere Sn. Proceedings of the National Academy of Sciences, 89:4816 4819, 1992.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116:15849 15854, 2019a.

Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? In Proceedings of the Conference on Learning Theory (COLT), 2019b.

Alberto Bietti. Approximation and Learning with Deep Convolutional Models: a Kernel Perspective. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.

Alberto Bietti, Luca Venturi, and Joan Bruna. On the Sample Complexity of Learning under Geometric Stability. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Simon Buchholz. Kernel interpolation in Sobolev spaces is not consistent in low dimensions. In Proceedings of the Conference on Learning Theory (COLT), 2022.

Francesco Cagnetta, Alessandro Favero, and Matthieu Wyart. What can be learnt with wide convolutional networkds? ar Xiv preprint ar Xiv:2208.01003, 2022.

Emmanuel J. Candes. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathématique, 346:589 592, 2008.

Xiangyu Chang, Yingcong Li, Samet Oymak, and Christos Thrampoulidis. Provable benefits of overparameterization in model compression: From double descent to pruning neural networks. In Proceedings of the Conference on Artificial Intelligence (AAAI), 2021.

Niladri S. Chatterji and Philip M. Long. Foolish Crowds Support Benign Overfitting. Journal of Machine Learning Research (JMLR), 23:1 12, 2022.

Konstantin Donhauser, Alexandru Tifrea, Michael Aerni, Reinhard Heckel, and Fanny Yang. Interpolation can hurt robust generalization even when there is no noise. In Advances in Neural Information Processing Systems (Neur IPS), 2021a.

Konstantin Donhauser, Mingqi Wu, and Fanny Yang. How rotational invariance of common kernels prevents generalization in high dimensions. In Proceedings of the International Conference on Machine Learning (ICML), 2021b.

Konstantin Donhauser, Nicolò Ruggeri, Stefan Stojanovic, and Fanny Yang. Fast rates for noisy interpolation require rethinking the effect of inductive bias. In Proceedings of the International Conference on Machine Learning (ICML), 2022.

David L. Donoho and Michael Elad. On the stability of the basis pursuit in the presence of noise. Signal Processing, 86:511 532, 2006.

Noureddine El Karoui. The spectrum of kernel random matrices. Annals of Statistics, 38:1 50, 2010.

Alessandro Favero, Francesco Cagnetta, and Matthieu Wyart. Locality defeats the curse of dimensionality in convolutional teacher-student scenarios. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Published as a conference paper at ICLR 2023

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? In Advances in Neural Information Processing Systems (Neur IPS), 2020.

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. Annals of Statistics, 49:1029 1054, 2021.

Nikhil Ghosh, Song Mei, and Bin Yu. The Three Stages of Learning Dynamics in High-dimensional Kernel Methods. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer New York, 2001.

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Introducing Euro SAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. In Proceedings of the IEEE International Symposium on Geoscience and Remote Sensing (IGARSS), 2018.

Sandesh Kamath, Amit Deshpande, Subrahmanyam Kambhampati Venkata, and Vineeth N Balasubramanian. Can We Have It All? On the Trade-off between Spatial and Adversarial Robustness of Neural Networks. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Jingling Li, Mozhi Zhang, Keyulu Xu, John Dickerson, and Jimmy Ba. How Does a Neural Network s Architecture Impact Its Robustness to Noisy Labels? In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel ridgeless regression can generalize. Annals of Statistics, 48:1329 1347, 2020.

Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In Proceedings of the Conference on Learning Theory (COLT), pp. 2683 2711, 2020.

Fanghui Liu, Zhenyu Liao, and Johan Suykens. Kernel regression in high dimensions: Refined analysis beyond double descent. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2021.

Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, and Preetum Nakkiran. Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting. ar Xiv preprint ar Xiv:2207.06569, 2022.

Andrew D. Mc Rae, Santhosh Karnik, Mark Davenport, and Vidya K. Muthukumar. Harmless interpolation in regression and classification with structured features. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Learning with invariances in random features and kernel models. In Proceedings of the Conference on Learning Theory (COLT), 2021.

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3 84, 2022.

Theodor Misiakiewicz and Song Mei. Learning with convolution and pooling operations in kernel methods. ar Xiv preprint ar Xiv:2111.08308, 2021.

Vidya K. Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1:67 83, 2020.

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021:124003, 2021.

Published as a conference paper at ICLR 2023

Kento Nishi, Yi Ding, Alex Rich, and Tobias Hollerer. Augmentation Strategies for Learning With Noisy Labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Alexander Rakhlin and Xiyu Zhai. Consistency of interpolation with Laplace kernels is a highdimensional phenomenon. In Proceedings of the Conference on Learning Theory (COLT), 2019.

Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research (JMLR), 15: 335 366, 2014.

Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in adversarially robust deep learning. In Proceedings of the International Conference on Machine Learning (ICML), 2020.

Amartya Sanyal, Puneet K. Dokania, Varun Kanade, and Philip Torr. How Benign is Benign Overfitting? In Proceedings of the International Conference on Learning Representations (ICLR), 2021.

Bernhard Schölkopf, Alexander J. Smola, and Francis Bach. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, 2018.

Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, 58:267 288, 1996.

Alexander Tsigler and Peter L. Bartlett. Benign overfitting in ridge regression. ar Xiv preprint ar Xiv:2009.14286, 2020.

Roman Vershynin. Compressed Sensing: Theory and Applications, chapter Introduction to the non-asymptotic analysis of random matrices, pp. 210 268. Cambridge University Press, 2012.

Guillaume Wang, Konstantin Donhauser, and Fanny Yang. Tight bounds for minimum ℓ1-norm interpolation of noisy data. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2022.

Yuting Wei, Fanny Yang, and Martin J. Wainwright. Early stopping for kernel boosting algorithms: A general analysis with localized complexities. In Advances in Neural Information Processing Systems (Neur IPS), 2017.

Fanny Yang, Zuowen Wang, and Christina Heinze-Deml. Invariance-inducing regularization using worst-case transformations suffices to boost accuracy and spatial robustness. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. In Proceedings of the British Machine Vision Conference (BMVC), 2016.

Published as a conference paper at ICLR 2023

A GENERALIZATION BOUND FOR FIXED-DESIGN

We prove Theorem 3 by deriving a closed-form expression for the bias and variance, and bounding them individually. Hence, the proof does not rely on any matrix concentration results.

A.1 PROOF OF THEOREM 3

It is well-known that the KRR problem defined in Equation (2) yields the estimator ˆfλ(x) = y H 1k(x), where x, xi ν, y := f +ϵ = [f (x1)+ϵ1, . . . , f (xn)+ϵn] , k(x) := [K(x1, x), . . . , K(xn, x)] , and H := K + λIn. For this estimator, both bias and variance exhibit a closed-form expression:

Bias2( ˆfλ) = Ex ν[ f (x) Eϵ[(f + ϵ) H 1k(x)] 2]

= Ex ν[ f (x) f H 1k(x) 2]

= Ex ν f (x)2 2f H 1Ex ν[f (x)k(x)] + f H 1Ex ν[k(x)k(x) ]H 1f

(i) = a a 2a Ψ m H 1Ψ m D ma + a Ψ m H 1SH 1Ψ ma, (6)

Variance( ˆfλ)/σ2 = 1

σ2 Ex νEϵ[ (f + ϵ) H 1k(x) Eϵ[(f + ϵ) H 1k(x)] 2]

σ2 Ex νEϵ[ϵ H 1k(x)k(x) H 1ϵ]

i,j E[ϵiϵj] H 1Ex ν[k(x)k(x) ]H 1

(i) = Tr H 1SH 1 . (7)

Step (i) uses the definition of the squared kernel S, that f (x) = Pm k=1 akψk(x) as f is in the span of the first m eigenfunctions, and the following consequences of the eigenfunctions orthonormality:

Ex ν f (x)2 =

k,k =1 akak Ex ν [ψk(x)ψk (x)] =

k,k =1 akak δk,k = a 2 2,

Ex ν [[f (x)k(x)]i] =

k =1 akλk Ex ν [ψk(x)ψk (x)] ψk (xi)

k=1 akλkψk(xi) = [Ψ m D ma]i .

We now bound the closed-form expressions of bias and variance individually.

Bias Lemma 5 below yields S = Ψ m D2 mΨ m + S>m. Hence, the bias decomposes into

Bias2( ˆfλ) = a a 2a Ψ m H 1Ψ m D ma + a Ψ m H 1Ψ m D2 mΨ m H 1Ψ ma

+ a Ψ m H 1S>m H 1Ψ ma

= Im D mΨ m H 1Ψ m a 2

| {z } :=B1

+ a Ψ m H 1S>m H 1Ψ ma | {z } :=B2

First, we rewrite B1 as

B1 = Im D mΨ m H 1Ψ m a 2

D m D mΨ m Ψ m D mΨ m + H>m 1 Ψ m D m

(ii) = D 1 m + Ψ m H 1 >mΨ m 1 D 1 ma

Published as a conference paper at ICLR 2023

where (i) uses the decomposition H = K m + H>m with H>m := K>m + λIn, and (ii) applies the Woodbury matrix identity.

Second, we can upper-bound B2 as follows:

B2 S>m a Ψ m H 2Ψ ma = S>m D 1 ma D mΨ m H 2Ψ m D m D 1 ma

(i) = S>m a D 1 m D 1 m + Ψ m H 1 >mΨ m 1 Ψ m H 2 >mΨ m D 1 m + Ψ m H 1 >mΨ m 1 D 1 ma

S>m Ψ m H 2 >mΨ m | {z } :=C1

D 1 m + Ψ m H 1 >mΨ m 1 D 1 ma

where (i) uses Lemma 6. Thus, the bias can be bounded by B1(1 + C1) with C1 0.

We proceed by upper bounding C1:

1 + C1 1 + n S>m

µmin (H>m)2

(i) 1 + 1.5 nλm+1 K>m

(µmin (K>m) + λ)2 1 + 1.5nλm+1 ( K>m + λ)

(µmin (K>m) + λ)2

1 + 1.5 nλm+1 max{λ, 1} max{λ, 1}2

(µmin (K>m) + λ)2 K>m + λ

nλm+1 max{λ, 1}

(ii) 1 + 1.5r2

max nλm+1 max{λ, 1}, 1

= 1 + 1.5r2

where (i) uses Equation (5) to bound Ψ mΨ m/n and Lemma 7 to bound S>m , and (ii) follows from cx + 1 (c + 1) max{x, 1} for c 0.

Hence, to conclude the bias bound, we need to bound B1 in Equation (8) from above and below.

Upper bound:

D 1 m n + Ψ m H 1 >mΨ m n

µmin Ψ mΨ m

n 2 D 1 ma 2

(i) 4 ( K>m + λ)2 D 1 ma 2

= 4r2 2 max{λ, 1}2 D 1 ma 2

where (i) follows from Equation (5).

Combining Equation (9) in (i) and Equation (10) in (ii) yields the desired upper bound on the bias:

Bias2 B1+B2 (1+C1)B1

(i) 1 + 1.5r2

(ii) 4 1 + 1.5r2

r2 2τ2 max{λ, 1}2 D 1 ma 2

Published as a conference paper at ICLR 2023

Lower bound:

D 1 m n + Ψ m H 1 >mΨ m n

1 1 nλm + 1 µmin(H>m) Ψ mΨ m

n 2 D 1 ma 2

nλm max{λ,1} 1.5 max{λ,1} µmin(K>m)+λ nλm max{λ,1} + 1

max{λ, 1}2 D 1 ma 2

(ii) r1 1.5 + r1

2 min nλm max{λ, 1}, 1 2 max{λ, 1}2 D 1 ma 2

= r1 1.5 + r1

2 τ 2 1 max{λ, 1}2 D 1 ma 2

where (i) follows from Equation (5), and (ii) from the fact that x cx+1 1 1+c min{x, 1} for x, c 0. Since Bias2 B1, this concludes the lower bound for the bias.

Variance As for the bias bound, we first apply Lemma 5 to write S = Ψ m D2 mΨ m + S>m and decompose the variance in Equation (7) into

Variance( ˆfλ)/σ2 = Tr H 1Ψ m D2 mΨ m H 1

| {z } :=V1

+ Tr H 1S>m H 1

| {z } :=V2

Next, we rewrite V1 as follows:

V1 = Tr H 1Ψ m D2 mΨ m H 1 = Tr D mΨ m H 2Ψ m D m

(i) = Tr D 1 m + Ψ m H 1 >mΨ m 1 Ψ m H 2 >mΨ m D 1 m + Ψ m H 1 >mΨ m 1

D 1 m n + Ψ m H 1 >mΨ m n

! 1 Ψ m H 2 >mΨ m n

D 1 m n + Ψ m H 1 >mΨ m n

where (i) follows from Lemma 6.

To bound V1 and V2, we use the fact that the trace is the sum of all eigenvalues. Therefore, the trace is bounded from above and below by the size of the matrix times the largest and smallest eigenvalue, respectively. This yields the following bounds for V1:

D 1 m n + Ψ m H 1 >mΨ m n

! 1 Ψ m H 2 >mΨ m n

D 1 m n + Ψ m H 1 >mΨ m n

n 4r2 2 max{λ, 1}2 Ψ mΨ m/n

µmin (H>m)2

n r2 2 max{λ, 1}2

(µmin (K>m) + λ)2

Published as a conference paper at ICLR 2023

D 1 m n + Ψ m H 1 >mΨ m n

! 1 Ψ m H 2 >mΨ m n

D 1 m n + Ψ m H 1 >mΨ m n

r1 1.5 + r1

2 τ 2 1 max{λ, 1}2 µmin Ψ mΨ m/n

( K>m + λ)2

r1 1.5 + r1

2 τ 2 1 max{λ, 1}2

( K>m + λ)2 = 1

r1 r2(1.5 + r1)

For (i) and (iii), we bound the terms analogously to the upper and lower bound of B1, and use Equation (5) in (ii) and (iv).

Next, the upper bound on V2 follows from a special case of Hölder s inequality as follows:

V2 = Tr H 1S>m H 1 Tr (S>m) H 2

µmin (K m + H>m)2 Tr (S>m)

µmin (H>m)2 = Tr (S>m) r2 1 max{λ, 1}2 .

For the lower bound, we need a more accurate analysis. First, we apply the identity

H 1 = (H>m + K m) 1 = H 1 >m H 1 >m K m(H>m + K m) 1 | {z } :=A

which is valid since H and H>m are full rank.

Next, note that the rank of A is bounded by the rank of K m, which can be written as Ψ m D mΨ m and therefore has itself rank at most m. Furthermore, Equation (5) implies that Ψ mΨ m has full rank, and hence m n.

Let now {v1, . . . , vrank(A)} be an orthonormal basis of col(A), and let {vrank(A)+1, . . . , vn} be an orthonormal basis of col(A) . Hence, {v1, . . . , vn} is an orthonormal basis of Rn, and similarity invariance of the trace yields

V2 = Tr H 1S>m H 1 =

i=1 v i H 1S>m H 1vi

i=rank(A)+1 v i H 1S>m H 1vi

i=rank(A)+1 v i (H 1 >m A)S>m(H 1 >m A)vi

i=rank(A)+1 v i H 1 >m S>m H 1 >mvi

i=rank(A)+1 v i S>mvi,

= 1 H>m 2 Tr (P S>m P) ,

where P is the projection matrix of Rn onto col(A) , and (i) follows from Equation (12). Step (ii) uses that, for all i > rank(A), vi is orthogonal to the column space of A, and hence

v i (H 1 >m A)S>m(H 1 >m A)vi =v i H 1 >m S>m H 1 >mvi v i A |{z} =0

S>m H 1 >mvi v i H 1 >m S>m Avi |{z} =0 + v i AS>m Avi | {z } =0

Published as a conference paper at ICLR 2023

Finally, let µi( ) be the i-th eigenvalue of its argument with respect to a decreasing order. Then, the Cauchy interlacing theorem yields µrank(A)+i(S m) µi(P S>m P) for all i = 1, . . . , n rank(A). This implies

Tr (P S>m P) =

n rank(A) X

i=1 µi(P S>m P)

n rank(A) X

i=1 µrank(A)+i(S>m)

i=1+rank(A) µi(S>m) (i)

i=1+m µi(S>m),

where (i) uses that the rank of A is bounded by m. This concludes the lower bound on V2 as follows:

V2 = Tr H 1S>m H 1

Pn i=m+1 µi(S m)

Pn i=m+1 µi(S m) r2 2 max{λ, 1}2 .

A.2 TECHNICAL LEMMAS

Lemma 5 (Squared kernel decomposition). Let K be a kernel function that under a distribution ν can be decomposed as K(x, x ) = P

k λkψk(x)ψk(x ), where Ex ν[ψk(x)ψk (x)] = δk,k . Then, the squared kernel S(x, x ) = Ez ν[K(x, z)K(x , z)] can be written as

S(x, x ) = X

k λ2 kψk(x)ψk(x ),

and for any m > 0, the corresponding kernel matrix can be written as S = Ψ m D2 mΨ m + S>m.

Proof. The statement simply follows from

S(x, x ) = X

k,k λkλk ψk(x) Ez ν [ψk(z)ψk (z)] | {z } δk,k

ψk (x ) = X

k λ2 kψk(x)ψk(x ).

Lemma 6 (Corollary of Lemma 20 in Bartlett et al. (2020)).

D mΨ m H 2Ψ m D m = D 1 m + Ψ m H 1 >mΨ m 1 Ψ m H 2 >mΨ m D 1 m + Ψ m H 1 >mΨ m 1

D mΨ m H 2Ψ m D m = D1/2 m D1/2 mΨ m H 2Ψ m D1/2 m D1/2 m

(i) = D1/2 m Im + D1/2 mΨ m H 1 >mΨ m D1/2 m 1 D1/2 mΨ m H 2 >mΨ m D1/2 m Im + D1/2 mΨ m H 1 >mΨ m D1/2 m 1 D1/2 m

= D 1 m + Ψ m H 1 >mΨ m 1 Ψ m H 2 >mΨ m D 1 m + Ψ m H 1 >mΨ m 1 ,

where (i) applies Lemma 20 from Bartlett et al. (2020).

Lemma 7 (Squared kernel tail). For m > 0, let S>m be the kernel matrix of the truncated squared kernel S>m = P k>m λ2 kψk(x)ψk(x ), and let K>m be the kernel matrix of the truncated original kernel K>m = P

k>m λkψk(x)ψk(x ). Then,

S>m λm+1 K>m .

Published as a conference paper at ICLR 2023

Proof. We show that for any vector v, v S>mv λm+1v K>mv, which implies the claim. To do so, we define Ψk Rn n with [Ψk]i,j = ψk(xi)ψk(xj) for all k > m. Then we can write K>m = P k>m λkΨk and S>m = P k>m λ2 kΨk. Since the eigenvalues are in decreasing order, we have λk λm+1 for any k > m, and thus

k>m λ2 kv Ψkv λm+1 X

k>m λkv Ψkv = λm+1v K>mv.

B CONVOLUTIONAL KERNELS ON THE HYPERCUBE

First, Appendix B.1 provides a way to decompose general functions for features distributed uniformly on the hypercube. Next, Appendix B.2 uses those results to characterize the eigenfunctions and eigenvalues of cyclic convolutional kernels. Finally, Appendices B.3 and B.4 apply this characterization to prove Lemmas 2 and 4, respectively.

B.1 GENERAL FUNCTIONS ON THE HYPERCUBE

This subsection focuses on the main setting in our paper: the hypercube domain { 1, 1}d together with the uniform probability distribution, previously studied in Misiakiewicz & Mei (2021). For any S {1, . . . , d}, we define the polynomial

j S [x]j (13)

of degree |S|, where [x]j is the j-th entry of x { 1, 1}d. It is easy to see that {YS}S {1,...,d} is set of orthonormal functions with respect to the inner product f, g { 1,1}d := Ex U({ 1,1}d)[f(x)g(x)]. Those functions play a key role in the remainder of our proof; as it turns out, they are the eigenfunctions of the kernel in Section 3.1. Towards a formal statement, define the polynomials

:= 1 B(l, d)

|S|=l YS(x)YS(x ), (14)

where B(l, d) := |{S {1, . . . , d} | |S| = l}| = d l

Note that G(d) l only depends on the (Euclidean) inner product of x and x . Furthermore, {G(d) l }d l=0 is a set of orthonormal polynomials with respect to the distribution of x, x /

d. The following lemma shows how such polynomials form an eigenbasis for functions that only depend on the inner product between points in the unit hypercube. Lemma 8 (Local kernel decomposition). Let κ: R R be any function and d N>0. Then, for any x, x { 1, 1}d, we can decompose κ( x, x /d) as

l=0 ξ(d) l 1 B(l, d)

|S|=l YS(x)YS(x ). (16)

Proof. Note that the decomposition only needs to hold at the evaluation of κ in the values that x, x /d can take, that is, κ computed in { 1, 1 + 2/d, . . . , 2/d + 1, 1}. Since that set has a cardinality d+1, we can write κ as a linear combination of d+1 uncorrelated functions. In particular, {G(d) l }d l=0 is a set of such functions with respect to the distribution of x, x /

d, and hence

l=0 cl G(d) l

for some (unknown) coefficients cl. Finally, the proof follows by expanding the definition of G(d) l in Equation (14) and choosing ξ(d) l = cl.

Published as a conference paper at ICLR 2023

B.2 CONVOLUTIONAL KERNELS ON THE HYPERCUBE

While the previous subsection considers general functions on the hypercube, we now focus on convolutional kernels and their eigenvalues. This yields the tools to prove Lemmas 2 and 4. In order to characterize eigenvalues, we first follow existing literature such as Misiakiewicz & Mei (2021) and introduce useful quantities.

Let S {1, . . . , d}. The diameter of S is

γ(S) := max i,j S min {mod (j i, d) + 1, mod (i j, d) + 1}

for S = , and γ( ) = 0. Furthermore, we define

C(l, q, d) := |{S {1, . . . , d} | |S| = l, γ(S) q}|. (17)

Intuitively, the diameter of S is the smallest number of contiguous feature indices that fully contain S. The following lemma yields an explicit formula for C(l, q, d), that is, the number of sets of size l with diameter at most q.

Lemma 9 (Number of overlapping sets). Let l, q, d N with l q < d/2. Then,

C(l, q, d) =

( d q 1 l 1 l > 0, 1 l = 0.

Proof. Since the result holds trivially for l = 0 and l = 1, we henceforth focus on l 2. Let C(l, γ, d) be the number of subsets S {1, . . . , d} of cardinality |S| = l with diameter exactly γ(S) = γ. First, consider C(2, γ, d). For each set, we can choose the first element i from d different values, and the second as mod ((i 1) (γ 1), d) + 1. In this way, since q < d/2, we count each set exactly twice. Thus, C(2, γ, d) = d 2

Next, consider l > 2. We can build all the possible sets by starting with one of the C(2, γ, d) = d sets, and adding the remaining l 2 elements from γ 2 possible indices. Hence, every fixed set of size 2 and diameter γ yields γ 2 l 2 different sets of size l. Furthermore, by construction, every set of size l and diameter γ results from exactly one set of size 2 and diameter γ. Therefore,

C(l, γ, d) = d γ 2 l 2

The result for l 2 then follows from summing C(l, γ, d) over all diameters γ q:

C(l, q, d) = d

(i) = d q 1 l 1

where (i) follows from the hockey-stick identity.

Now, we focus on cyclic convolutional kernels K as in Equation (1). First, we restate Proposition 1 from Misiakiewicz & Mei (2021). This proposition establishes that YS for S {1, . . . , d} are indeed the eigenfunctions of K, and yields closed-form eigenvalues λS up to factors ξ(q) |S| that depend on the inner nonlinearity κ. Next, Lemma 10 uses additional regularity assumptions on κ to eliminate the dependency on ξ(q) |S|. This characterization of the eigenvalues then enables the proof Lemmas 2 and 4.

Proposition 1 (Proposition 1 from Misiakiewicz & Mei (2021)). Let K be a cyclic convolutional kernel over the unit hypercube as defined in Equation (1). Then,

K(x, x ) := 1

k=1 κ x(k,q), x (k,q)

γ(S) q |S|=l

λSYS(x)YS(x ),

Published as a conference paper at ICLR 2023

λS = ξ(q) |S| q + 1 γ(S)

d B(|S|, q)

where ξ(q) |S| are the coefficients of the κ-decomposition (Equation (16)) over { 1, 1}q. Alternatively,

K(x, x ) = X

k λSk YSk(x)YSk(x )

where we order all Sk {1, . . . , d} with γ(Sk) q such that λSk λSk+1. In particular, λSk = λk.

We refer to Proposition 1 from Misiakiewicz & Mei (2021) for a formal proof. Intuitively, the result follows from applying Lemma 8 for each subset S {1, . . . , d} of contiguous elements. This is possible because crucially, any subset of q features is again distributed uniformly on the q-dimensional unit hypercube. Lastly, the factor q + 1 γ(S) stems from the fact that each eigenfunction YS appears as many times as there are contiguous index sets of size γ(S) supported in a fixed contiguous index set of size q. In other words, the term is the number of shifted instances of S supported in a contiguous subset of q features.

As mentioned before, Proposition 1 characterizes the eigenvalues of cyclic convolutional kernels K up to factors ξ(q) |S| that depend on the inner nonlinearity κ. To avoid the additional factors, we require the following regularity assumptions:

Assumption 1 (Regularity). Let T := l 4 + 4ℓ

β m . A cyclic convolutional kernel K(x, x ) =

1 d Pd k=1 κ x(k,q),x (k,q) q from the setting of Section 3.1 with inner function κ satisfies the regular-

ity assumption if there exist constants c T, c , c > 0 and a series of constants {cl > 0}T l=0 such that, for any q c, the decomposition

γ(S) q |S|=l

ξ(q) |S| q + 1 γ(S)

d B(|S|, q) YS(x)YS(x )

from Proposition 1 over inputs x, x { 1, 1}d satisfies

ξ(q) l cl l {0, . . . , T}, (18)

ξ(q) l 0 l > T, (19)

q T l+1 l {0, . . . , T}, (20) X

l 0 ξ(q) l c . (21)

For sufficiently high-dimensional inputs x, x , Equations (18) and (19) ensure that the convolutional kernel K(x, x ) in Equation (1) is a valid kernel, and that it can learn polynomials of degree up to T. Indeed, if ξ(q) l = 0 for some l, then there are no polynomials of degree l among the eigenfunctions of K. Furthermore, Equations (18), (20) and (21) guarantee that the eigenvalue tail is sufficiently bounded. This allows us to bound K>m and µmin (K>m) in Appendix C.

Our assumption resembles Assumption 1 by Misiakiewicz & Mei (2021): For one, Equations (20) and (21) are equivalent to Equations 43 and 44 in Misiakiewicz & Mei (2021). Furthermore, Equation (18) above is a slightly stronger version of Equation 42, where strengthening is necessary due to the non-asymptotic nature of our results.

We still argue that many standard κ, for example, the Gaussian kernel, satisfy Assumption 1 with our convolution kernel K. Because such κ satisfy Assumption 1 in Misiakiewicz & Mei (2021), we only need to check that they additionally satisfy our Equation (18). If κ is a smooth function, we have ξ(q) l = κ(l)(0) + o(1) for all l T, where κ(l) is the l-th derivative of κ. In particular, all derivatives of the exponential function at 0 are strictly positive, implying Equation (18) for the Gaussian kernel if d is large enough.

Published as a conference paper at ICLR 2023

The final lemma of this section is a corollary that characterizes the eigenvalues solely in terms of |S|, and further shows that, for d large enough, the eigenvalues decay as |S| grows. Lemma 10 (Corollary of Proposition 1). Consider a cyclic convolutional kernel as in Proposition 1 that satisfies Assumption 1 with q Θ(dβ) for some β (0, 1). Then, for any S {1, . . . , d} such that γ(S) q and |S| < T, the eigenvalue λS corresponding to the eigenfunction YS(x) satisfies

λS Ω 1 d q|S|

and λS O 1 d q|S| 1

Furthermore,

max |S| T γ(S) q λS O 1 d q T 1

Proof. Without loss of generality, assume d is large enough such that d > q/2 c/2, where c is a constant from Assumption 1.

Let S {1, . . . , d} with γ(S) q be arbitrary and define l := |S| and r := q + 1 γ(S). Since l γ(S) q, we have 1 r q + 1 l. Furthermore, since B(l, q) = q l , we use the following classical bound on the binomial coefficient throughout the proof: 1

l ql B(l, q) e

For the first part of the lemma, assume |S| < T. Then, using Assumption 1, we have

λS = ξ(q) l r d B(l, q)

d cl,1 1 dql 1 c T,1 1 dql 1 ,

d cl,2 1 dql c T,2 1 dql

for some positive constants cl,1, cl,2 that depend on l, c T,1 := maxl {0,...,T 1} cl,1, and c T,2 := minl {0,...,T 1} cl,2. Step (i) follows from the upper bound in Equation (21) with non-negativity in Equations (18) and (19), and (ii) follows from the lower bound in Equation (18). Since c T,1 and c T,2 do not depend on l, this concludes the first part of the proof.

For the second part of the proof, we consider two cases depending on whether |S| [T, q T] or |S| > q T.

Hence, first assume T |S| q T. Then,

λS (i) = ξ|S| q + 1 γ(S)

d q |S| (ii) ξ|S| q + 1 γ(S)

d q T (iii) c q

T T = c T T

dq T 1 , (24)

where (i) follows from Proposition 1. In step (ii), we use that q |S| is minimized when |S| is has the largest difference to q/2. Lastly, step (iii) applies the upper bound from Equation (21) together with non-negativity of ξ|S| in Assumption 1, and the classical bound on the binomial coefficient.

Now assume|S| > q T. Then,

λS (i) = ξ|S| q + 1 γ(S)

d q |S| ξq (q |S|) q d q q (q |S|)

(ii) = ξq (q |S|) q d q q |S|

(iii) ξq (q |S|) (q |S|)q |S|q

(iv) c (q |S|)q |S|q dq T (q |S|)+1qq |S|

dq T , (25)

Published as a conference paper at ICLR 2023

where (i) follows from Proposition 1, (ii) from the fact that n n k = n k , (iii) from the classical bound for the binomial coefficient, (iv) from Equation (20) in Assumption 1, and (v) from the fact that q |S| < T in the current case.

Combining Equations (24) and (25) from the two cases finally yields

max |S| T γ(S) q λS max

max T |S| q T γ(S) q λS, max |S|>q T γ(S) q λS

dq T 1 , c T T

which concludes the second part of the proof.

B.3 PROOF OF LEMMA 2

First, note that f = YS for S = {1, . . . , L }. Since |S | = γ(S ) = L , Proposition 1 yields

λS = ξ(q) L q + 1 L

(i) Θ q dq L

= Θ 1 dq L 1

(ii) ω(λm),

where (i) uses Equations (18) and (21) in Assumption 1 for L T and d large enough to get

ξ(q) L Θ(1), and (ii) uses λm o 1 dq L 1 . Hence, for d sufficiently large, λS > λm. Since the eigenvalues are in decreasing order, this implies that f is in the span of the first m eigenfunctions. This further yields D 1 ma = λ 1 S Θ dq L 1 ,

since the entry of a corresponding to YS is 1 while all others are 0.

B.4 PROOF OF LEMMA 4

Before proving Lemma 4, we introduce the following quantity:

L := ℓ ℓλ 1

Intuitively, L corresponds to the degree of the largest polynomial that a cyclic convolutional kernel as defined in Equation (1) can learn. This quantity plays a key role throughout the proof of Lemma 4, and Lemma 3 later. Finally, note that δ as defined in Theorem 1 can be written as δ = ℓ ℓλ 1

Proof of Lemma 4. First, we use Proposition 1 to write the cyclic convolutional kernel as

γ(S) q |S|=l

λSYS(x)YS(x ) = X

k λSk YSk(x)YSk(x )

where the λSk are ordered such that λSk+1 λSk.

For the first part of the proof, we need to pick an m N such that nλm Θ (max{λ, 1}) and

max{λ,1} . We will equivalently choose m Θ(dq L) with λm Θ 1 d q L+δ ; since

n Θ(dℓ) and max{λ, 1} Θ(dℓλ), we have

Θ max{λ, 1}

d1+ℓλ+β(L+δ)

= Θ 1 d q L+δ

= Θ d1+ℓλ+β(L+δ)q δ

The remainder of the proof proceeds in five steps: we first construct a candidate Sm {1, . . . , d} with γ(Sm) q, show that the rate of the eigenvalue corresponding to YSm satisfies λSm = λm

Θ 1 d q L+δ , show that the rate of m Θ(dq L), establish Θ nq δ

max{λ,1} O(n q δ), and finally

show that λm o 1 d q L 1 for appropriate L .

Published as a conference paper at ICLR 2023

Construction of m We consider two different Sm depending on δ:

Sm = {1, . . . , L, q + 1 q1 δ } δ (0, 1) {1, . . . , L, q/2 } δ = 0. (27)

For d and hence q Θ(dβ) large enough, Sm is well-defined, |Sm| = L + 1, and the diameter is

γ(Sm) = q + 1 q1 δ δ (0, 1) q/2 δ = 0.

For the rest of the proof, assume that d is sufficiently large.

Rate of λSm Using Proposition 1 and |Sm| = L + 1, we can write

λSm = ξ(q) L+1 q + 1 γ(Sm)

d B(L + 1, q) .

First, we show that the numerator is in Θ q1 δ for both definitions of Sm. In the case where δ (0, 1), we have

q + 1 q + 1 q1 δ = q1 δ = q1 δ (i) Θ q1 δ ,

where (i) follows from δ < 1 and q sufficiently large. In the case where δ = 0, we have

q + 1 q/2 q + 1 O(q), q + 1 q/2 q/2 Ω(q).

Thus, since δ = 0 in this case, the numerator is in Θ(q) = Θ(q1 δ).

As the denominator does not depend on δ, we use the same technique for both δ = 0 and δ (0, 1). The classical bound on B(L + 1, q) = q L+1 yields

q L+1 q L + 1

L+1 q L + 1

e L+1 q L + 1

Therefore, d B(L + 1, q) Θ dq L+1 .

Finally, since L + 1 T = 4 + 4ℓ/β , we have ξ(q) L+1 Θ(1) by Equations (18) and (21) in Assumption 1 for d sufficiently large. Combining all results then yields the desired rate of λSm as follows:

= Θ 1 d q L+δ

Rate of m To establish m Θ(dq L), we bound m individually from above and below.

Upper bound: Since the eigenvalues are in decreasing order, we can bound m from above by counting how many eigenvalues are larger than λm. To do so, we use |Sm| = L + 1, and show that for d sufficiently large, all Sk with |Sk| > L + 1 correspond to eigenvalues λSk < λSm. We first decompose

max k:|Sk|>L+1 γ(Sk) q λSk = max

max k:L+1<|Sk|<T γ(Sk) q λSk

| {z } =:M1

, max k:|Sk| T γ(Sk) q λSk

| {z } =:M2

For M1, let k with L + 1 < |Sk| < T be arbitrary. Then,

(i) O 1 dq|Sk| 1

O 1 dq(L+2) 1

= O 1 dq L+1

(ii) o(λSm),

Published as a conference paper at ICLR 2023

where we apply Equation (22) from Lemma 10 in (i), and use Equation (28) with δ < 1 in (ii). This implies M1 o(λSm).

For M2, we directly get

M2 = max k:|Sk| T γ(Sk) q λSk

(i) O 1 dq T 1

(ii) O 1 dq(L+2) 1

= O 1 dq L+1

(iii) o(λSm),

where we apply Equation (23) from Lemma 10 in (i), (ii) follows from L + 2 T, and step (iii) uses Equation (28) with δ < 1.

Combined, we have maxk:|Sk|>L+1,γ(Sk) q λSk = max {M1, M2} o(λSm). Thus, for d sufficiently large and |Sk| > L + 1, we have λSk < λSm. For this reason, m is at most the number of eigenfunctions with degree no larger than L + 1:

l=0 C(l, q, d) (i) O(dq L),

where (i) uses Lemma 9 for d large enough with q l Θ(ql).

Lower bound: By construction of Sm in Equation (27), we have γ(Sm) q/2 . This, combined with Proposition 1, implies that the indices of all polynomials with degree L + 1 but diameter at most q/2 1 are smaller than m. Hence, for large enough d, Lemma 9 yields the following lower bound:

m C(L + 1, q/2 1, d) = d q/2 2 L

The upper and lower bound together then imply m Θ(dq L). This concludes the existence of an m N such that λm and m exhibit the desired rates.

Rate of m with respect to n We can write n as

n Θ(dℓ) = Θ d dβ ℓ 1

β = Θ dq ℓ 1

β ) = Θ dq ℓ 1

Combining this with L j ℓ 1

β k , we directly get Θ(dq L) O(dq ℓ 1

β ) = O(nq δ).

Rate of λm for appropriate L Since nλm Θ(max{λ, 1}), we have

λm Θ max{λ, 1}

(i) = Θ dℓλ

= Θ 1 dq L+δ

where (i) uses the identity ℓ= 1 + ℓλ + β(L + δ). Assume now L l ℓ ℓλ 1

β m . For the remainder, we need to consider two cases depending on δ.

If δ > 0, then L L + 1, and we have

If δ = 0, then L L, and we have

In both cases, λm o 1 dq L 1 , concluding the proof.

C MATRIX CONCENTRATION

This section considers the random matrix theory part of our main result. First, Appendix C.1 focuses on the large eigenvalues of our kernel, and proves Lemma 1. Next, Appendices C.2 and C.3 focus on the tail of the eigenvalues, culminating in the proof of Lemma 3. Lastly, Appendices C.4 and C.5 establishes some technical tools that we use throughout the proofs.

Published as a conference paper at ICLR 2023

C.1 PROOF OF LEMMA 1

In this proof, we show that the matrix Ψ mΨ m/n concentrates around the identity matrix for all m O(nd β δ), thereby establishing Equation (5). Let ˆm be the largest m O(nq δ). The proof consists of applying Theorem 5.44 from Vershynin (2012) to the matrix Ψ ˆm, and extending the result to all suitable choices of m simultaneously.

More precisely, let c be the implicit constant of the O(nq δ)-notation, and define ˆm to be the largest m N with m c nq δ. Note that ˆm exists, because d is large enough and fixed.

Bound for ˆm To apply Theorem 5.44 from Vershynin (2012), we need to verify the theorem s conditions on the rows of Ψ ˆm. In particular, we show that the rows are independent, have a common second moment matrix, and that their norm is bounded. Let [Ψ ˆm]i,: indicate the i-th row of Ψ ˆm Rn ˆm. We may write each row entry-wise as

[Ψ ˆm]i,: = [Y1(xi) Y2(xi) Y ˆm(xi)] .

First, the rows of Ψ ˆm are independent, since each row depends on a different xi, and we assume the data to be i.i.d..

Second, since the eigenfunctions are orthonormal w.r.t. the data distribution, the second moment of the rows is E [Ψ ˆm]i,:[Ψ ˆm] i,: = I ˆm for all rows i {1, . . . , n}.

Third, to show that each row has a bounded norm, we use the fact that the eigenfunctions Yk in Equation (13) over { 1, 1}d satisfy Yk(xi)2 = 1 for all k. Thus, the norm of each row is

[Ψ ˆm]i,: 2 =

k=1 Yk(xi)2 =

We can now apply Theorem 5.44 from Vershynin (2012). For any t 0, this yields the following inequality with probability 1 ˆm exp{ ct2}, where c is an absolute constant: Ψ ˆmΨ ˆm

max n I ˆm 1 2 , 2o , where = t

The choice t = 1

ˆm yields max n I ˆm 1 2 , 2o = 1/2, and the following error probability for large enough d:

ˆm exp n c n

o (i) nq δ exp c n nq δ

q δ dℓexp{ c q δ} q δ,

where (i) follows from ˆm O(nq δ).

Bound for any m < ˆm Note that Ψ mΨ m is a submatrix of Ψ ˆmΨ ˆm. Thus,

n Im is also a submatrix of Ψ ˆmΨ ˆm

Therefore, Ψ mΨ m

with probability at least 1 cd β δ uniformly over all m ˆm.

C.2 FURTHER DECOMPOSITION OF THE TERMS AFTER m

In this section, we focus on the concentration of the smallest and largest eigenvalue of the kernel matrix K>m to prove Lemma 3. However, this proof is involved, and requires additional tools. In particular, we further decompose K>m into two kernels K1 and K2.

Published as a conference paper at ICLR 2023

In the following, we consider the setting of Theorem 1 with a convolutional kernel K that satisfies Assumption 1. We define the additional notation

L := ℓ ℓλ 1

and L := ℓ 1

Intuitively, L is the maximum polynomial degree that K can learn with regularization, and L is the analogue without regularization. Finally, note that δ and δ as defined in Theorem 1 can be written as δ = ℓ ℓλ 1

β L and δ = ℓ 1

β L, respectively.

We now introduce the two additional kernels, and then show in Lemma 11 that K>m = K1 + K2. First, applying Proposition 1 to K yields

K(x, x ) = X

k λSk YSk(x)YSk(x ), (29)

where {Sk}k>0 is a sequence of all subsets Sk {1, . . . , d} with γ(Sk) q, ordered such that λSk λSk+1. Next, let m N be such that nλSm Θ(max{λ, 1}), and define the index sets

I1 := {k N | k > m and |Sk| L + 1},

I2 := {k N | |Sk| L + 2}.

Those sets induce the following kernels:

K1(x, x ) := X

k I1 λSk YSk(x)YSk(x ), S1(x, x ) := X

k I1 λ2 Sk YSk(x)YSk(x ),

K2(x, x ) := X

k I2 λSk YSk(x)YSk(x ), S2(x, x ) := X

k I2 λ2 Sk YSk(x)YSk(x ),

where S1 and S2 are the squared kernels corresponding to K1 and K2, respectively. The empirical kernel matrices K1, K2, S1, S2 Rn n are

[K1]i,j = K1(xi, xj), [K2]i,j = K2(xi, xj), [S1]i,j = S1(xi, xj), and [S2]i,j = S2(xi, xj).

Furthermore, as in the original kernel decomposition, we define the matrices

Ψ1 Rn |I1|, [Ψ1]i,j = YSkj (xi),

D1 R|I1| |I1|, D1 = diag(λSk1, . . . , λSk|I1| ),

where {kj}|I1| j=1 is a sequence of all indices in I1 ordered such that λSkj λSkj+1. Intuitively, Ψ1, D1 are the analogue to Ψ m, D m in the original decomposition K = K m + K>m.

Lastly, we define m as the largest eigenvalue corresponding to an eigenfunction YS of degree |S| L + 2, that is, m := min I2.

Using the previous definitions, the following lemma establishes that K1 and K2 indeed constitute a decomposition of K>m. Lemma 11 (1-2 decomposition). For d sufficiently large, we have

K>m(x, x ) = K1(x, x ) + K2(x, x ) and S>m(x, x ) = S1(x, x ) + S2(x, x ).

Proof. For the decomposition of K>m, we have to show that exactly the eigenfunctions with index larger than m appear in either K1 or K2, that is, I1 I2 = {k > m}, and that no eigenfunction appears in both K1 or K2, that is, I1 I2 = . Furthermore, since we can write S>m(x, x ) = P

k>m λ2 Sk YSk(x)YSk(x ) by Lemma 5, the same argument implies the 1-2 decomposition of S>m.

First, from the definition of I1 and I2, it follows directly that I1 I2 = , that I1 I2 {k > m}, and that I1 {k > m}. Hence, to conclude the proof, we only need to show that I2 {k > m}. Since the eigenvalues are sorted in decreasing order, we equivalently show that, for d sufficiently large, all eigenvalues λSk with k I2 are smaller than λSm Θ(max{λ, 1}/n).

Published as a conference paper at ICLR 2023

More precisely, we show that maxk I2 nλSk o(nλSm) = o(max{λ, 1}). Using T from Assumption 1, we have

max k I2 nλSk = max

max k I2:|Sk|<T nλSk | {z } =:M1

, max k I2:|Sk| T nλSk | {z } =:M2

For M1, we bound a generic k I2 with |Sk| < T as follows:

(i) O n dq|Sk| 1

O n dq( L+2) 1

= O d1+ℓλ+β(L+δ)

d1+β( L+δ) d β(1 δ)

= O(max{λ, 1}dβ(L L)d β(1 δ)) (ii) o(nλm),

where (i) applies Equation (22) from Lemma 10, and (ii) uses L L and δ < 1. In particular, this implies M1 o(nλm).

For M2 we have

max k I2:|Sk| T nλSk = n max |Sk| T γ(Sk) q λSk

(i) O n dq T 1

(ii) O n dq( L+2) 1

= O d1+ℓλ+β(L+δ)

d1+β( L+δ) d β(1 δ)

= O(max{λ, 1}dβ(L L)d β(1 δ)) (iii) o(nλm),

where (i) applies Equation (23) from Lemma 10, (ii) uses that L + 2 T, and (iii) follows from L L and δ < 1.

Combining the bounds on M1 and M2, we have maxk I2 nλSk o(nλSm) = o(max{λ, 1}). Hence, for d sufficiently large, all k I2 yield λSk < λSm and consequently k > m.

Using the 1-2 decomposition, we now prove Lemma 3. We defer the auxiliary Lemmas 12 to 15 to Appendix C.4, and concentration-results to Appendix C.5.

C.3 PROOF OF LEMMA 3

Throughout the proof, we assume d to be large enough such that all quantities are well-defined and all necessary lemmas apply. In particular, we assume the conditions of Lemma 11 to be satisfied, and that c < q/2 < q < d/2 for c in Assumption 1. Hence, L + 2 L + 2 < T < q/2 , and we can apply Lemmas 9 to 15, the setting of Appendix C.2, as well as Assumption 1 throughout the proof. We will mention additional implicit lower bounds on d as they arise.

The proof proceeds in three steps: we first bound r1 and r2, then bound Tr(S>m), and finally Pn i=1+m µi(S>m). We do not establish the required matrix concentration results directly, but apply various auxiliary lemmas. All corresponding statements hold with either probability at least 1 cq δ

or at least 1 cq (1 δ) for context-dependent constants c. We hence implicitly choose a c > 0 such that collecting all error probabilities yields the statement of Lemma 3 with probability at least 1 cd β min{ δ,1 δ}.

To start, let m N as in the statement of Lemma 3, and instantiate Appendix C.2 with that m. In particular, Lemma 11 yields the 1-2 decomposition K>m = K1 + K2 and S>m = S1 + S2, which we will henceforth use. Finally, we define

Q(d,q) l (x, x ) := X

γ(S) q |S|=l

d B(l, q) YS(x)YS(x ) (30)

with the corresponding kernel matrix Q(d,q) l Rn n.

Published as a conference paper at ICLR 2023

Bound on r1 and r2 Remember the definition of r1 and r2:

r1 = µmin (K>m) + λ

max{λ, 1} , r2 = K>m + λ

max{λ, 1} .

To bound those quantities, we have to bound µmin (K>m) and K>m . For the upper bound on K>m , we use the triangle inequality on K>m = K1 + K2 , and then bound K1 and K2 individually.

Note that we can write K1 = Ψ1D1Ψ 1 by definition. Hence,

K1 = Ψ1D1Ψ 1 = n D1 Ψ 1Ψ1

(i) 1.5n D1 (ii) 1.5nλm (iii) O (max{λ, 1}) ,

where (i) follows from Lemma 12 with probability at least 1 c1q δ, (ii) uses that all eigenvalues of K1 are at most λm by definition, and (iii) follows from nλm Θ(max{λ, 1}).

Next, Lemma 13 directly yields with probability at least 1 c2q (1 δ) that K2 Θ(1) and µmin (K2) Θ(1). Hence, with probability at least 1 ( c1q δ + c2q (1 δ)) 1 cq min{ δ,1 δ}, we have

K1 , K2 O (max{λ, 1}) , µmin (K2) Ω(1).

This implies

K>m + λ K1 + K2 + λ O(max{λ, 1}), µmin (K>m) + λ µmin (K2) + λ Ω(max{λ, 1}),

and subsequently

r2 = K>m + λ

max{λ, 1} O(1),

r1 = µmin (K>m) + λ

max{λ, 1} Ω(1).

Finally, since r1 r2, this yields r1, r2 Θ(1).

Bound on Tr(S>m) We need to show that

Tr(S>m) dℓλq δ + dℓλqδ 1 (31)

with high probability, where the two terms correspond to the 1-2 decomposition Tr(S>m) = Tr(S1) + Tr(S2). We differentiate between L = L and L > L. Intuitively, the case L = L corresponds to interpolation or weak regularization, because the maximum degree of learnable polynomials with regularization equals the one without regularization. Conversely, L > L corresponds to strong regularization.

Case L = L (interpolation or weak regularization): In this setting,

First, Lemma 14 yields Tr(S2) Θ(q (1 δ)) with probability at least 1 c3q (1 δ). Therefore,

Tr(S2) Θ(q (1 δ)) (i) = Θ(q (1 δ)+ ℓλ

β ) = Θ dℓλq (1 δ) ,

where (i) follows from Equation (32).

We now consider Tr(S1):

Tr(S1) = n Tr Ψ 1Ψ1

Tr(D2 1) (i) 1.5n X

k I1 λ2 Sk,

Published as a conference paper at ICLR 2023

where (i) follows from Lemma 12 with probability at least 1 c1q δ. The bound continues as

(i) nλ2 m|I1|

(ii) O max{λ, 1}2 |I1|

= O d2ℓλq δ (iv) = O dℓλq δ ,

where (i) uses λk λm for all k I1 by definition, (ii) uses nλm Θ(max{λ, 1}), and (iv) follows from Equation (32). Furthermore, (iii) uses the following bound of |I1|:

l=0 C(l, q, d) (i) = 1 + d

l 1 (iii) O(d q L),

where C(l, q, d) is defined in Equation (17), (i) follows from Lemma 9, (ii) is a classical bound on the binomial coefficient, and (iii) follows from the fact that the term corresponding to L + 1 dominates the polynomial.

Finally, collecting the upper bounds on Tr(S1) and Tr(S2) yields

Tr(S m) = Tr(S1) + Tr(S2) O dℓλq δ + dℓλq (1 δ)

with probability at least 1 ( c3q (1 δ) + c1q δ) 1 cq min{ δ,1 δ}.

Case L > L (strong regularization): In this setting, the dominating rate will arise from Tr(S1). We start by linking δ and δ in analogy to Equation (32):

β L + L L = δ ℓλ

β + L L. (33)

Next, as in the previous case, Lemma 14 yields Tr(S2) Θ(q (1 δ)) with probability at least 1 c3q (1 δ), and therefore

Tr(S2) Θ(q (1 δ)) = Θ(q δ 1) (i) = Θ dℓλqδ ( L L) 1 (ii) o dℓλq (1 δ) ,

where (i) follows from Equation (33), and (ii) from L > L.

To bound Tr(S1), we start as in the previous case:

Tr(S1) = n Tr Ψ 1Ψ1

Tr(D2 1) (i) 1.5n X

k I1 λ2 Sk,

where (i) follows from Lemma 12 with probability at least 1 c1q δ. We then decompose the sum over all squared eigenvalues with index in I1 as

k I1 λ2 Sk = n X

k I1 |Sk| L+1

| {z } =:E1

k I1 |Sk|=L+2

| {z } =:E2

k I1 |Sk| L+3

| {z } =:E3

and bound the three terms individually.

First, we upper-bound E1 as follows:

k I1 |Sk| L+1

(i) n2λ2 m n

k I1 |Sk| L+1

l=0 C(l, q, d) (ii) O max{λ, 1}2

d dℓλ q L+δ dq L = O dℓλq δ ,

Published as a conference paper at ICLR 2023

where (i) follows from λk λm for all k > m due to the decreasing order of eigenvalues. Step (ii) applies nλm Θ(max{λ, 1}), as well as PL+1 l=0 C(l, q, d) O(d q L), which follows as in the other case from Lemma 9 and the classical bound on the binomial coefficient.

Second, the upper bound of E2 arises as follows:

k I1 |Sk|=L+2

λ2 Sk (i) = n(ξ(q) L+2)2 X

k I1 |Sk|=L+2

q + 1 γ(Sk)

d B(L + 2, q)

(ii) (ξ(q) L+2)2 n q d B(L + 2, q)

γ(S) q |S|=L+2

d B(L + 2, q)

(iii) n q d q L+2 X

γ(S) q |S|=L+2

d B(L + 2, q)

(iv) = n q d q L+2 Q(d,q) L+2 (x, x)

(v) O d dℓλq L+δ q

= O dℓλq (1 δ) ,

where (i) follows from Proposition 1, and (ii) uses q + 1 γ(Sk) q. Next, (iii) uses that Equations (18) and (21) in Assumption 1 imply ξ(q) L+2 Θ(1), and applies the bound B(L + 2, q) = q L+2 (eq/(L + 2))L+2. Step (iv) uses YS(x)YS(x) = 1 for all S and x { 1, 1}d, together

with the definition of Q(d,q) L+2 . Lastly, (v) applies Lemma 15 and n Θ(dℓ) = Θ(d dℓλq L+δ).

Third, we upper-bound E3:

k I1 |Sk| L+3

λ2 Sk n max k I1,|Sk| L+3 λS

k I1 |Sk| L+3

λSk n max k I1,|Sk| L+3(λSk).

The last step follows from

k I1 |Sk| L+3

γ(S) q |S|=l

γ(S) q |S|=l

ξ(q) l q + 1 γ(S)

l=L+3 ξ(q) l X

γ(S) q |S|=l

d B(l, q) YS(x)YS(x)

l=L+3 ξ(q) l

where (i) follows from Proposition 1, (ii) uses YS(x)YS(x) = 1 for all S and x { 1, 1}d, (iii) applies the definition of Q(d,q) l and Lemma 15, and (iv) follows from Equations (18) and (21) in Assumption 1 since L + 1 T.

For maxk I1,|Sk| L+3(λSk), we bound each element individually:

(i) O 1 dq|Sk| 1

O 1 dq(L+3) 1

Published as a conference paper at ICLR 2023

where (i) uses Equation (22) in Lemma 10 since k L + 1 < T by definition of I1. Hence, we obtain the following bound on E3:

E3 n max k I1,|Sk| L+3(λSk) O n dq L+2

= O d dℓλq L+δ

O dℓλq (1 δ) .

Finally, we can bound Tr(S1) as

Tr(S1) 1.5n X

k I1 λ2 Sk = E1 + E2 + E3 O dℓλq δ + dℓλq (1 δ) ,

which yields Tr(S m) = Tr(S1) + Tr(S2) O dℓλq δ + dℓλq (1 δ)

as desired with probability at least 1 ( c3q (1 δ) + c1q δ) 1 cq min{ δ,1 δ}.

Bound on Pn i=1+m µi(S>m) As before, we differentiate between no/weak and strong regularization, that is, between L = L and L > L:

Case L = L (interpolation or weak regularization): In this case, we start by directly bounding

i=1+m µi(S>m) =

i=1+m µi(S1 + S2)

i=1+m µi(S2)

i=1 µi(S2) (i) Tr(S2) m S2 (34)

(ii) Ω q (1 δ) m dq L+1

(iii) = Ω dℓλq (1 δ) m dq L+1

where (i) bounds each of the first m eigenvalues of S2 with the largest one, (ii) follows from Lemma 14 with probability at least 1 c3q (1 δ), and (iii) from Equation (32) since L = L.

To conclude the lower bound, it suffices to show that m dq L+1 o(dℓλq (1 δ)):

max{λ, 1}dq L+1

(ii) = O q δdℓλq (1 δ) (iii) o(dℓλq (1 δ)),

where (i) follows from m O nq δ

max{λ,1} , and (ii) from Equation (32). For (iii), note that δ = 0 for a sufficiently large c yields a vacuous result. We hence assume without loss of generality that δ > 0, which justifies the step. This concludes the proof for the current case with probability at least 1 c3q (1 δ) 1 cq min{ δ,1 δ}.

Case L > L (strong regularization): In this case, we define the additional index set

I3 := {k I1 | |Sk| = L + 2}

with S3, Ψ3, D3 analogously to S1, Ψ1, D1 in Appendix C.2, but using I3 instead of I1. Since I3 I1, it follows that Ψ 3Ψ3 is a submatrix of Ψ 1Ψ1, and thus

n I|I3| is a submatrix of Ψ 1Ψ1

This particularly implies Ψ 3Ψ3

(i) 1/2, (35)

where (i) follows from Lemma 12 with probability at least 1 c1q δ.

Published as a conference paper at ICLR 2023

We now move our focus back to the lower bound of Pn i=1+m µi(S>m):

i=1+m µi(S>m) (i)

i=1+m µi(S1) (ii)

i=1+m µi(S3) (iii) Tr(S3) m S3 , (36)

where (i) follows from the 1-2 decomposition S>m = S1 + S2, (ii) from the fact that I3 I1, and (iii) analogously to Equation (34). Similar to the previous case, we conclude the proof by first showing that Tr(S3) Ω dℓλq (1 δ) , and then m S3 o(Tr(S3)).

For the lower bound of Tr(S3), we start with

Tr(S3) = n Tr 1

n Tr D2 3 µmin

(ii) = n(ξ(q) L+2)2 X

|S|=L+2 γ(S) q

d B(L + 2, q)

where (i) follows with high probability from Equation (35). Step (ii) applies Proposition 1, and the fact that I3 = {k N | |Sk| = L + 2 and γ(S) q} for d sufficiently large. To show this, we use

= Θ 1 dq L+δ

Since the eigenvalues are in decreasing order and L + 2 L + 1 in the current case, we only need to show that λS < λm for all S {1, . . . , d} with |S| = L + 2 and γ(S) q:

λS (i) O 1 dq L+1

= o 1 dq L+δ

where (i) applies Equation (22) in Lemma 10 since L+2 < T. Thus, λS < λm for all S {1, . . . , d} with |S| = L + 2 and γ(S) q if d is sufficiently large, which we additionally assume from now on.

The lower bound of Tr(S3) continues as follows:

Tr(S3) n(ξ(q) L+2)2 X

|S|=L+2 γ(S) q

d B(L + 2, q)

n(ξ(q) L+2)2 X

|S|=L+2 γ(S) q/2

d B(L + 2, q)

(iii) (ξ(q) L+2)2 n q/2 d B(L + 2, q)

|S|=L+2 γ(S) q/2

q/2 + 1 γ(S)

d B(L + 2, q)

(iv) (ξ(q) L+2)2 n q/2 d B(L + 2, q)

|S|=L+2 γ(S) q/2

q/2 + 1 γ(S)

d B(L + 2, q/2 ) ,

where (iii) follows from q q/2 and q + 1 γ(S) q/2. Step (iv) follows from the fact that B(L + 2, q) and B(L + 2, q/2 ) are of the same order; this follows from a classical bound on the binomial coefficient:

B(L + 2, q) eq L + 2

L+2 B(L + 2, q/2 ).

Published as a conference paper at ICLR 2023

We conclude the lower bound on Tr(S3) as follows:

Tr(S3) (ξ(q) L+2)2 n q/2 d B(L + 2, q)

|S|=L+2 γ(S) q/2

q/2 + 1 γ(S)

d B(L + 2, q/2 )

(v) = (ξ(q) L+2)2 n q/2 d B(L + 2, q)Q(d, q/2 ) L+2 (x, x)

(vi) = (ξ(q) L+2)2 n q/2 d B(L + 2, q)

(vii) Ω n q d B(L + 2, q)

(viii) = Ω d dℓλq L+δ q

= Ω dℓλq (1 δ) , (37)

where (v) uses YS(x)YS(x) = 1 for all S and x { 1, 1}d with the definition of Q(d, q/2 ) L+2 in Equation (30), (vi) applies Lemma 15 with q/2 as filter size, (vii) follows from Equation (18) in Assumption 1, and (viii) uses the classical bound on the binomial coefficient.

Finally, for the upper bound on m S3 , we have

m S3 = m Ψ3D3Ψ 3 mn D3 Ψ 3Ψ3

(i) 1.5mn D3 = mn max k I3 λ2 k

(ii) = mn max k I3

ξ(q) L+2 q + 1 γ(Sk)

d B(L + 2, q)

n (ξ(q) L+2)2 n q d B(L + 2, q)

ddℓλq L+δ q

2! (iv) O q δd ℓλ dℓλq (1 δ) 2

= O q δq (1 δ) dℓλq (1 δ) (v) o dℓλq (1 δ) (vi) o(Tr(S3)),

where (i) follows with high probability from Equation (35), and (ii) from Proposition 1. Step (iii) uses that Equations (18) and (21) in Assumption 1 yield ξ(q) L+2 Θ(1), and B(L + 2, q) = q L+2

O(q L+2). Furthermore, (iv) follows from m O nq δ

max{λ,1} , (v) from q δq (1 δ) o(1), and

(vi) from the lower bound on Tr(S3) in Equation (37).

Finally, combining this result with Equations (36) and (37), we have

i=1+m µi(S>m) Tr(S3) m S3 Ω(Tr(S3)) Ω dℓλq (1 δ)

with probability at least 1 c1q δ 1 cq min{ δ,1 δ}.

C.4 TECHNICAL LEMMAS

We use the following technical lemmas in the proof of Lemma 3. All results assume the setting of Appendix C.2, particularly a kernel as in Theorem 1 that satisfies Assumption 1. Lemma 12 (Bound on K1). In the setting of Appendix C.2, for d/2 > q L + 1, we have Ψ 1Ψ1

with probability at least 1 cq δ uniformly over all choices of m and λ.

Proof. The proof follows a very similar argument to Lemma 1 with minor modifications. First, define

ˆI1 := {k N | |Sk| L + 1}.

Published as a conference paper at ICLR 2023

Note that ˆI1 does not depend on m or λ, and I1 ˆI1 for any m. Furthermore,

l=0 C(l, q, d) (i) = 1 +

l=1 d q 1 l 1

l 1 (iii) O(d q L),

where (i) follows from Lemma 9, (ii) is a classical bound on the binomial coefficient, and (iii) follows from the fact that the largest degree monomial dominates the others.

Next, we define ˆΨ1 Rn |ˆI1| with [ ˆΨ1]i,j = YSkj (xi) where kj is the j-th element in ˆI1. Using

the same arguments as in the proof of Lemma 1, it follows that the rows of ˆΨ1 are independent, that their norm is bounded by |ˆI1|, and that they have an expected outer product equal to I|ˆI1|.

Hence, as in the proof of Lemma 1, we can apply Theorem 5.44 from Vershynin (2012). Choosing t = 1

|ˆI1| yields

with probability at least 1 cq δ for some absolute constant c.

Finally, since I1 ˆI1 for all choices of λ and m, Ψ 1Ψ1

with probability at least 1 cq δ uniformly over all λ and m.

Lemma 13 (Bound on K2). In the setting of Appendix C.2, if c < q/2 < q < d/2 for c as in Assumption 1, we have c1 µmin (K2) K2 c2

for some positive constants c1, c2 with probability at least 1 cq (1 δ).

Proof. First, the condition on d implies q > T and ensures that we can apply Lemmas 16 and 17, and Assumption 1. Furthermore, note that T 1 > L + 2.

Proposition 1 and the definition of K2 yield the following decomposition:

K2(x, x ) =

l= L+2 ξ(q) l X

γ(S) q |S|=l

d B(l, q) YS(x)YS(x ) =

l= L+2 ξ(q) l Q(d,q) l (x, x ),

where Q(d,q) l (x, x ) is defined in Equation (30). Then, using the triangle inequality with non-negativity of the ξ(q) l from Equations (18) and (19) in Assumption 1, we have

l= L+2 ξ(q) l Q(d,q) l

l= L+2 ξ(q) l Q(d,q) l +

l=T ξ(q) l Q(d,q) l

l= L+2 ξ(q) l Q(d,q) l +

l=T ξ(q) l Q(d,q) l

l= L+2 ξ(q) l + c3

(ii) c2 with probability 1 cq (1 δ).

Q(d,q) l is the kernel matrix corresponding to Q(d,q) l (x, x ), (i) uses Lemmas 16 and 17, and (ii) uses non-negativity and additionally Equation (21) in Assumption 1.

Published as a conference paper at ICLR 2023

The lower bound follows similarly from

l= L+2 ξ(q) l µmin Q(d,q) l ξ(q) L+2µmin Q(d,q) L+2

where (i) follows from Lemma 16 with probability at least 1 cq (1 δ), and (ii) from Equation (18) in Assumption 1.

Since Lemma 16 yields both the upper and lower bound for all l uniformly with probability 1 cq (1 δ), this concludes the proof.

Lemma 14 (Bound on Tr(S2)). In the setting of Appendix C.2, if c < q/2 < q < d/2 for c as in Assumption 1, we have with probability at least 1 cq (1 δ) that

Tr(S2) Θ(q (1 δ)) and S2 O 1 dq L+1

where S2 is defined in Appendix C.2.

Proof. Throughout the proof, the conditions on d and hence q Θ(dβ) ensure that we can apply Assumption 1 and Lemma 16, as well as L + 2 T < q/2 .

First, Lemma 13 yields K2 Θ(1) with probability at least 1 c q (1 δ), which we will use throughout the proof. Next, we bound S2 in two steps. For this, remember that λ m is the largest eigenvalue corresponding to an eigenfunction YS of degree |S| L + 2.

Proof that S2 λ m K2 Define Ψk Rn n with [ Ψk]i,j := YSk(xi)YSk(xj) for all i, j {1, . . . , n}, k I2, and let v be any vector in Rn. Then,

k I2 λ2 k Ψk

max k I2{λk} X

k I2 λk Ψkv

= λ m K2v ,

where (i) follows from the definition of S2 and (ii) from the definition of m.

Proof that λ m Θ 1 dq L+1 We show that λk O 1 dq L+1 for all k I2, and that there exists

m I2 with λ m Ω 1 dq L+1 . Since λ m = maxk I2 λSk, those two facts imply λ m Θ 1 dq L+1 .

Let k I2 be arbitrary. Lemma 10 yields

O 1 dq|Sk| 1 |Sk| < T

O 1 dq T 1 |Sk| T

(i) O 1 dq L+1

where (i) follows from |Sk| L + 2 and T L + 2.

Now we show that there exists m with λ m Ω 1 dq L+1 . We choose m with S m = {1, 2, . . . , L+2}.

Note that L + 2 = γ(S m) q and thus m I2. Next, Proposition 1 yields

λS m = ξ(q) L+2 (q + 1 γ(S m))

d B( L + 2, q)

(i) Θ q dq L+2

Published as a conference paper at ICLR 2023

where (i) follows from Equations (18) and (21) in Assumption 1.

Finally, combining the previous two results and K2 Θ(1), we have

S2 λ m K2 O (λ m) = O 1 dq L+1

Upper bound of Tr(S2) The upper bound also follows directly from the last two results:

Tr(S2) n S2 O n dq L+1

= O(q (1 δ)).

Lower bound of Tr(S2) The lower bound requires a more refined argument.

S2(x, x ) = X

k I2 λ2 Sk YSk(x)YSk(x ) X

γ(S) q |S|= L+2

λ2 SYS(x)YS(x )

γ(S) q/2 |S|= L+2

λ2 SYS(x)YS(x )

(i) = (ξ(q) L+2)2 X

γ(S) q/2 |S|= L+2

(q + 1 γ(S))2

d2B( L + 2, q)2 YS(x)YS(x )

= (ξ(q) L+2)2

d B( L + 2, q)

γ(S) q/2 |S|= L+2

(q + 1 γ(S))q + 1 γ(S)

d B( L + 2, q) YS(x)YS(x )

(ii) (ξ(q) L+2)2q/2

d B( L + 2, q)

γ(S) q/2 |S|= L+2

q/2 + 1 γ(S)

d B( L + 2, q) YS(x)YS(x ),

where (i) follows from Proposition 1. In (ii), we use that, as long as γ(S) q/2 , we have q + 1 γ(S) q

2 and q q/2 . Continuing the bound, we have

S2(x, x ) (ξ(q) L+2)2q/2

d B( L + 2, q)

γ(S) q/2 |S|= L+2

q/2 + 1 γ(S)

d B( L + 2, q) YS(x)YS(x )

(iii) c L (ξ(q) L+2)2q/2

d B( L + 2, q)

γ(S) q/2 |S|= L+2

q/2 + 1 γ(S)

d B( L + 2, q/2 ) YS(x)YS(x )

(iv) = c L (ξ(q) L+2)2q/2

d B( L + 2, q)Q(d, q/2 ) L+2 (x, x ).

In (iii), we use the classical bound on the binomial coefficient q L+2 = B( L + 2, q) as follows:

B( L + 2, q) (2e) L+2 q/2

L+2 (2e) L+2c L

L+2 = c LB( L + 2, q/2 ),

where c L is a constant that depends only on L. Finally, (iv) follows from the definition of Q(d, q/2 ) L+2 in Equation (30).

The bound on S2(x, x ) implies

µmin (S2) c L (ξ(q) L+2)2q/2

d B( L + 2, q)µmin Q(d, q/2 ) L+2

Published as a conference paper at ICLR 2023

and allows us to ultimately lower-bound Tr(S2) as follows:

Tr(S2) nµmin (S2) c Ln (ξ(q) L+2)2q/2

d B( L + 2, q)µmin Q(d, q/2 ) L+2

(i) c Ln (ξ(q) L+2)2q/2

d B( L + 2, q)

(ii) dℓ q dq L+2 = dq L+ δ

dq L+1 = q (1 δ),

where (i) follows from the lower bound on µmin Q(d, q/2 ) L+2

in Lemma 16 with probability at

least 1 c q/2 (1 δ) 1 c q (1 δ), and (ii) follows from Equation (18) in Assumption 1 and the classical lower bound on the binomial coefficient. This yields the desired lower bound Tr(S2) Ω(q (1 δ)).

Finally, collecting all error probabilities concludes the proof.

Lemma 15 (Diagonal elements of Q(d,q)). Let l, q, d N with 0 < l q < d/2 and x { 1, 1}d. Then, Q(d,q) l (x, x) = 1.

Proof. First,

Q(d,q) l (x, x) = X

γ(S) q |S|=l

d B(l, q) YS(x)2

(i) = 1 d B(l, q)

γ(S) q |S|=l

= 1 d B(l, q)

γ=l (q + 1 γ) X

γ(S)=γ |S|=l

where (i) follows from the fact that YS(x)2 = 1.

Note that P γ(S)=γ |S|=l 1 matches the definition of C(l, γ, d) in the proof of Lemma 9. Next, we use the

following recurrence:

γ=l (q + 1 γ) C(l, γ, d) =

C(l, γ, d) + ((q 1) + 1 γ) C(l, γ, d)

= C(l, q, d) + 0 +

γ=l ((q 1) + 1 γ) C(l, γ, d),

where the last step uses the fact that C(l, q, d) = Pq γ=l C(l, γ, d) by definition, and that the term corresponding to γ = q in the second sum is zero. Recursively applying this formula q l times yields q X

γ=l (q + 1 γ) C(l, γ, d) =

γ=l C(l, γ, d).

Using this identity, we finally get

Q(d,q) l (x, x) = 1 d B(l, q)

γ=l C(l, γ, d) (i) = 1 d B(l, q)

γ=l d γ 1 l 1

(ii) = 1 d B(l, q)d q l

where (i) follows from Lemma 9, (ii) from the hockey-stick identity, and (iii) from the definition of B(l, q) in Equation (15).

Published as a conference paper at ICLR 2023

C.5 RANDOM MATRIX THEORY LEMMAS

We use the following lemmas related to random matrix theory in the proof of Lemma 3. The first two results bound the kernel s intermediate and late eigenvalues.

Lemma 16 (Bound on the kernel s intermediate tail). In the setting of Appendix C.2, for T 1 q < d/2, with probability at least 1 cq (1 δ), all l N with L + 2 l < T satisfy

Q(d,q) l In 1

where T is defined in Assumption 1 and Q in Equation (30).

This lemma particularly implies Q(d,q) l 1.5 and µmin Q(d,q) l 1/2 for all L + 2 l < T with high probability.

Lemma 17 (Bound on the kernel s late tail). In the setting of Appendix C.2, if c q/2 1 < q < d/2, for c and T as in Assumption 1, we have

l=T ξ(q) l Q(d,q) l

with probability at least 1 cd 1.

Proof of Lemma 16. First, Lemma 15 yields that the diagonal elements of Q(d,q) l are just 1. Hence, we define W(d,q) l := Q(d,q) l In, and want to show that, with high probability, W(d,q) l 1/2 for all L + 2 l < T at the same time.

The proof makes use of Lemma 18. Therefore, we need to find an appropriate M(l,q) for each considered l, and show that the conditions of the lemma hold.

The first condition follows directly from the construction of W(d,q) l and diag(Q(d,q) l ) = In.

To establish the second condition, we have for all i = j, k = j Exj h [W(d,q) l ]i,j[W(d,q) l ]j,k i

1 (d B(l, q))2 X

γ(S) q |S|=l

γ(S ) q |S |=l

(q + 1 γ(S))(q + 1 γ(S ))YS(xi)E [YS(xj)YS (xj)] YS (xk)

(i) = 1 (d B(l, q))2

γ(S) q |S|=l

(q + 1 γ(S))2YS(xi)YS(xk)

q d B(l, q) 1 d B(l, q)

γ(S) q |S|=l

(q + 1 γ(S))YS(xi)YS(xk)

= d B(l, q)

1 h W(d,q) l i

1 h W(d,q) l i

where (i) follows from orthogonality of the eigenfunctions. Hence,

M(l,q) = d B(l, q)/lq

satisfies the second condition in Lemma 18 for all L + 2 l < T.

Published as a conference paper at ICLR 2023

The extra l factor is necessary for the third condition to hold. As in Lemma 18, let p N, p 2. Then, for all i = j,

E h |[W(d,q) l ]i,j|pi 1

p (i) (p 1)l/2 r

E h [W(d,q) l ]2 i,j i (ii)

v u u u tpl X

γ(S) q |S|=l

(q + 1 γ(S))2

(d B(l, q))2

(d B(l, q))2 C(l, q, d) =

pl lq d B(l, q)

q C(l, q, d)

where (i) follows from hypercontractivity in Lemma 19, (ii) from orthogonality of the eigenfunctions, and (iii) from the definition of C(l, q, d) as well as 1 γ(S) q. Step (iv) follows from Lemma 9, the definition of B(l, q) in Equation (15), q 1 l 1 q

l = q l , and the definition of M(l,q).

Since all conditions are satisfied, Lemma 18 yields for all p N, p > 2

Pr W(d,q) l > 1/2 cp l,1p3pn n M(l,q)

p + cp l,2 n M(l,q)

where cl,1, cl,2 are positive constants that depend on l. In particular, if p 2 +

1 δ , then we get

Pr W(d,q) l > 1/2 clq 2(1 δ),

where cl is a positive constant that depends on l. To avoid this dependence, we can take the union bound over all l = L + 2, . . . , T 1:

Pr l { L + 2, . . . , T 1}: W(d,q) l > 1/2 q 2(1 δ) T 1 X

l= L+2 cl c Lq 2(1 δ),

where c L only depends on L and T, which are fixed in our setting. Finally, additionally note that neither L nor T depend on ℓλ.

Proof of Lemma 17. First, Lemma 15 yields that the diagonal elements of Q(d,q) l are just 1 for all l {T, . . . , q}. Hence, we define W(d,q) l := Q(d,q) l In, and decompose the kernel matrix as

l=T ξ(q) l Q(d,q) l =

l=T ξ(q) l W(d,q) l +

l=T ξ(q) l In.

We can hence apply the triangle inequality to bound the norm as follows:

l=T ξ(q) l Q(d,q) l

l=T ξ(q) l W(d,q) l

l=T ξ(q) l In

l=T ξ(q) l W(d,q) l +

l=T ξ(q) l In

i =j [W(d,q) l ]2 i,j +

n2 max i =j [W(d,q) l ]2 i,j | {z } =:ϖl

q + c 1, with probability 1 cd 1,

where (i) uses non-negativity of the ξ(q) l from Equations (18) and (19) in Assumption 1, (ii) bounds the operator norm with the Frobenius norm, and (iii) additionally bounds the sum of the ξ(q) l using

Published as a conference paper at ICLR 2023

Equation (21) in Assumption 1. Step (iv) use a bound that we show in the remainder of the proof: with probability at least 1 cd 1, we have ϖl c /q uniformly over all l {T, . . . , q}.

We first bound ϖl for a fixed T l q as follows:

Pr(ϖl > 1/q) = Pr ξ(q) l

n2 max i =j [W(d,q) l ]2 i,j > 1/q

max i =j [W(d,q) l ]2 i,j > 1

n2(ξ(q) l )2q2

i = j : [W(d,q) l ]2 i,j > 1

n2(ξ(q) l )2q2

(i) n(n 1) Pr

[W(d,q) l ]2 1,2 > 1

n2(ξ(q) l )2q2

(ii) n4(ξ(q) l )2q2E h [W(d,q) l ]2 1,2 i

(iii) = n4(ξ(q) l )2q2

(d B(l, q))2 X

γ(S),γ(S ) q |S|,|S |=l

(q + 1 γ(S))(q + 1 γ(S )) E [YS(x1)YS(x2)YS (x1)YS (x2)] | {z } δS,S

= n4(ξ(q) l )2q2

(d B(l, q))2 X

γ(S) q |S|=l

(q + 1 γ(S))2

(iv) (q + 1 l)n4(ξ(q) l )2q2

γ(S) q |S|=l

(q + 1 γ(S))

d B(l, q) YS(x1)2

n4(ξ(q) l )2q3

d B(l, q) Q(d,q) l (x1, x1) (v) = n4q3

d (ξ(q) l )2

where (i) follows from the union bound and the distribution of the off-diagonal entries in W(d,q), and (ii) from the Markov inequality. In step (iii), we use orthogonality of the eigenfunctions, as well as the fact that W(d,q) l and Q(d,q) l coincide on off-diagonal entries by construction. Step (iv) follows from YS(x)2 = 1 for all S and x { 1, 1}d. Finally, step (v) applies Lemma 15.

Next, we use the union bound over all ϖl of interest:

Pr ( l {T, . . . , q}: ϖl > 1/q)

l=T Pr (ϖl > 1/q)

d (ξ(q) l )2

(ξ(q) l )2 q l +

(ξ(q) l )2 q l +

(ξ(q) l )2 q l

(ξ(q) l )2 q l +

(ξ(q) q l )2 q q l +

(ξ(q) q l )2 q q l

(ξ(q) l )2 + (ξ(q) q l)2 q l

| {z } =:E1

(ξ(q) q l )2

| {z } =:E2

where (i) substitutes l = q l, and (ii) uses q q/2 1 q/2 as well as the fact that q q l = q l . We bound both E1 and E2 using Assumption 1.

Published as a conference paper at ICLR 2023

For E1 in particular, Equations (18), (19) and (21) imply that all ξ(q) l 1. Hence,

(ξ(q) l )2 + (ξ(q) q l)2 q l

1 q T = q/2 T + 1 q T (ii) T T q

q T 1 q T 1 ,

where (i) exploits that T is the value in {T, . . . , q/2 } the furthest away from q/2, and thus

min l {T,..., q/2 }

and (ii) follows from the classical lower bound on the binomial coefficient.

For E2, we have

(ξ(q) q l )2

l l (ξ(q) q l )2

q2T 2l +2+l T T T

q T +2 1 q T 1 ,

where (i) uses the classical bound on the binomial coefficient, and (ii) Equation (20) in Assumption 1.

Combining the bounds on E1 and E2 finally yields

Pr ( l {T, . . . , q}: ϖl > 1/q)

l=T Pr (ϖl > 1/q) n4q3

d (E1 + E2)

dq T 4 d4ℓ 1 β(T 4) (i) 1

where (i) follows from the definition of T = 4 + 4ℓ

β in Assumption 1.

The next statement is a non-asymptotic version of Proposition 3 from Ghorbani et al. (2021). Lemma 18 (Graph argument). Let W Rn n be a random matrix that satisfies the following conditions:

1. [W]i,i = 0, i {1, . . . , n}.

2. There exists M > 0 such that, for all i, j, k {1, . . . , n} with i = j and j = k, we have Exj [[W]i,j[W]j,k] 1

3. There exists l N such that, for all p N, p 2 and all i, j {1, . . . , n}, i = j, we have

E [|[W]i,j|p]1/p

Then, for all p N, p > 2,

Pr ( W > 1/2) cp 1p3pn n

where c1 and c2 are positive constants that depend on l.

Proof. Repeating the steps in the proof of Proposition 3 from Ghorbani et al. (2021), we get

E[ W 2p] E Tr(W2p) (cp)3p np+1

Note that the proof in Ghorbani et al. (2021) assumes M to be in the order of dl. We get rid of this assumption and keep M explicit. Furthermore, Ghorbani et al. (2021) use their Lemma 4 during their proof, but we use our Lemma 19 instead.

Ultimately, we apply the Markov inequality to get a high-probability bound: Pr( W 1/2) = Pr W 2p (1/2)2p

(1/2)2p = c3

Renaming the constants concludes the proof.

Published as a conference paper at ICLR 2023

Lemma 19 (Hypercontractivity). For all l, q, d N and p 2, we have

Ex,x U({ 1,1}d) h |Q(d,q) l (x, x )|pi1/p (p 1)l/2 r

Ex,x U({ 1,1}d) h (Q(d,q) l (x, x ))2 i ,

where Q(d,q) l (x, x ) is defined in Equation (30).

Proof. Let x, x U({ 1, 1}d) and let z be the entry-wise product of x and x . Then, for all S {1, . . . , d}, YS(x)YS(x ) depends only on z:

YS(x)YS(x ) =

i S [x]i[x ]i = Y

Hence, Q(d,q) l (x, x ) also only depends on x and x via z. Furthermore, note that z U({ 1, 1}d). Therefore, we can use hypercontractivity (Beckner, 1975) as for instance in Lemma 4 from Misiakiewicz & Mei (2021) to conclude the proof.

D OPTIMAL REGULARIZATION AND TRAINING ERROR

D.1 OPTIMAL REGULARIZATION

In the main text we often refer to the optimal regularization λopt, defined as the minimizer of the risk Risk( ˆfλ). While we cannot calculate λopt directly, we only need the rate ℓλopt such that

max{λopt, 1} Θ dℓλopt . Furthermore, it is not a priori clear that such a ℓλopt minimizes the rate exponent of the risk in Theorem 1. The current subsection establishes that this is indeed the case, and provides a way to determine ℓλopt.

We introduce some shorthand notation for the rate exponents in Theorem 1:

ηv(ℓλ; ℓ, ℓσ, β) := ℓσ ℓλ

ℓmin{δ, 1 δ},

ηb(ℓλ; ℓ, L , β) := 2 2

ℓ( ℓλ 1 β(L 1)),

η(ℓλ; ℓ, ℓσ, L , β) := max {ηv(ℓλ; ℓ, ℓσ, β), ηb(ℓλ; ℓ, L , β)} .

We highlight that ηb and η depend on ℓλ also through δ = ℓ ℓλ 1

β k . Hence, in the setting of Theorem 1, we have with high probability that

Variance( ˆfλ) Θ(nηv(ℓλ;ℓ,ℓσ,β)),

Bias2( ˆfλ) Θ(nηb(ℓλ;ℓ,L ,β)),

Risk2( ˆfλ) Θ(nη(ℓλ;ℓ,ℓσ,L ,β)).

In the following, we view those quantities as functions of ℓλ, with all other parameters fixed. Next, we additionally define

λopt := arg min λ 0|max {λ,1} O(d ℓ) Risk( ˆfλ),

ℓλmin := arg min ℓλ [0, ℓ] η(ℓλ; ℓ, ℓσ, L , β),

ηmin := min ℓλ [0, ℓ] η(ℓλ; ℓ, ℓσ, L , β),

ℓ:= ℓ 1 β(L 1).

First, we remark that ℓλmin the set of regularization rates that minimize the risk rate might have cardinality larger than one. However, it cannot be empty: [0, ℓ] is a closed set, and Lemma 21 below shows that η is a continuous function.

Published as a conference paper at ICLR 2023

Second, ℓdefines the scope of the minimization domain, guaranteeing that the constraint on L in Theorem 1 holds for all candidate ℓλ.

Rate of optimal regularization ℓλopt vs. optimal rate ℓλmin: Let ℓλopt be the rate of the optimal regularization strength such that max {λopt, 1} Θ(dℓλopt ). It is a priori not clear that ℓλopt minimizes η. However, Lemma 20 bridges the two quantities, and guarantees with high probability that the rate of λopt minimizes the rate of the risk.

Lemma 20 (Optimal regularization and optimal rate). In the setting of Theorem 1, assume ℓ> 0, β (0, 1), ℓσ ℓ, L h 1, ℓ 1

β i N. Then, for d sufficiently large, with probability at least

1 cd β min{ δ,1 δ} there exists l ℓλmin such that

max{λopt, 1} Θ dl .

Hence, we only need to obtain a minimum rate l ℓλmin instead of ℓλopt. In order to propose a method for this, we first establish properties of η, ηb, ηv in the following lemma.

Lemma 21 (Properties of η). Assume ℓ> 0, β (0, 1), ℓσ ℓ, L h 1, ℓ 1

1. Over [0, ℓ], ηv( ; ℓ, ℓσ, β) is continuous and non-increasing, and ηb( ; ℓ, L , β) is continuous and strictly increasing.

2. ℓλmin := arg minℓλ [0, ℓ] η(ℓλ; ℓ, ℓσ, L , β) is a closed interval.

3. If there exists l 0, ℓ with ηv( l; ℓ, ℓσ, β) = ηb( l; ℓ, L , β), then ηmin = η( l; ℓ, ℓσ, L , β). Otherwise, ηmin = η(0; ℓ, ℓσ, L , β).

4. Every l [0, ℓ] with l l for all l ℓλmin satisfies

η(l; ℓ, ℓσ, L , β) = ηb(l; ℓ, L , β).

5. Let l [0, ℓ] with l l for all l ℓλmin. If η(l; ℓ, ℓσ, L , β) ηmin c where c > 0 is constant 5 and depends only on ℓ, ℓσ, L , β, then

η(l; ℓ, ℓσ, L , β) = ηmin + 2

ℓ(min (ℓλmin) l) .

Finding an optimal rate: Lemma 21 suggests a simple strategy to find a l ℓλmin numerically: search the intersection of ηv and ηb in [0, ℓ]; if found, then the intersection point is optimal, otherwise l = 0 is optimal. Note that, if the intersection point exists, it is unique and easy to numerically approximate, since ηv is non-increasing, and ηb is strictly increasing.

Calculating numerical solutions: However, Lemma 21 also shows that ℓλmin is an interval and thus might contain multiple values. In that case, the proposed strategy might not necessarily retrieve the rate of λopt, but a different l ℓλmin. Yet, Theorem 1 guarantees that both the optimally regularized estimator and any estimator regularized with max {λ, 1} dl for any l ℓλmin have a risk vanishing with the same rate nηmin with high probability. In particular, this allows us to exhibit the rate of the optimally regularized estimator in Figure 1a. Finally, because of the multiple descent phenomenon (see, for example, Figure 1), we do not expect either ℓλopt or β to attain an easily readable closed-form expression. Nevertheless, simple optimization procedures allow us to calculate accurate numerical approximations.

Finally, we prove Lemmas 20 and 21.

Proof of Lemma 20. Let ℓλopt be such that

max {λopt, 1} Θ dℓλopt .

5We omit an explicit definition of c here for brevity and refer to the proof instead.

Published as a conference paper at ICLR 2023

Combining the bounds on bias and variance from Theorem 1, we have

Risk2( ˆfλ) = Bias2( ˆfλ) + Variance2( ˆfλ)

ℓ( ℓλ 1 β(L 1)) + n

ℓmin{δ,1 δ}

Θ nη(ℓλ;ℓ,ℓσ,L ,β) with probability 1 cd min{ δ,1 δ} (38)

uniformly for all ℓλ 0, ℓ with max{λ, 1} Θ(dℓλ).

Throughout the remainder of this proof, we drop the dependencies on ℓ, ℓσ, L , β in the notation of η and simply write η(ℓλ). We further omit repeating that each step is true with probability at least 1 cd min{ δ,1 δ}, but imply it throughout.

The goal of the proof is to show that there exists l ℓλmin sufficiently close to ℓλopt; formally, we need to show that dℓλopt l Θ(1). (39)

If this is the case, then the definition of ℓλopt yields the conclusion as follows:

max {λopt, 1} Θ dℓλopt = Θ dldℓλopt l (i) = Θ dl ,

where (i) uses Equation (39).

Towards an auxiliary result, we first apply Equation (38) to ℓλopt and l ℓλmin with max{λ, 1} Θ(dl):

Risk2( ˆfλ) c2nη(l) = c2nηmin,

Risk2( ˆfλopt) c1nη(λopt),

where c1, c2 > 0 are the constants hidden by the Θ-notation in Equation (38). Next, the optimality of λopt yields c1 < c2, and

c1nη(ℓλopt) Risk2( ˆfλopt) Risk2( ˆfl) c2nηmin.

This implies

nη(ℓλopt) ηmin c2

c1 η(ℓλopt) ηmin log(c2/c1)

log n , (40)

where the second implication uses that η(ℓλopt) ηmin 0, since ηmin is the minimum rate.

With this result, we finally focus on establishing Equation (39) which yields the claim of this lemma. Lemma 21 shows that ℓλmin is an interval; hence, we distinguish three cases:

Case ℓλopt ℓλmin: Picking ℓλopt = l ℓλmin directly yields dℓλopt ℓλopt = d0 Θ(1).

Case ℓλopt < min (ℓλmin): Let l := min(ℓλmin). Equation (40) yields η(ℓλopt) ηmin c for any c > 0 if d is large enough. Hence, for d fixed but sufficiently large, Lemma 21 yields

η(ℓλopt) = ηmin + 2

ℓ l ℓλopt .

Applying Equation (40), we get

l ℓλopt = ℓ

2(η(ℓλopt) ηmin) ℓlog(c2/c1)

Now, since ℓλopt l, dℓλopt l O(1). Furthermore,

ℓlog(c2/c1)

2 log n (i) = d

2 log d = p

c2/c1 O(1),

where (i) follows from n Θ(dℓ). Since both dl ℓλopt and 1 d l ℓλopt = dℓλopt l are in O(1),

dℓλopt l Θ(1),

Published as a conference paper at ICLR 2023

which establishes Equation (39) for l ℓλmin and thereby concludes this case.

Case ℓλopt > max (ℓλmin): Let l := max(ℓλmin). Then, Lemma 21 yields

η(ℓλopt) = 2 2

ℓ( ℓλopt 1 β(L 1))

ℓ( ℓλopt l + l 1 β(L 1))

ℓ( l 1 β(L 1)) 2

(i) = ηmin + 2

ℓ(ℓλopt l),

where (i) uses that l ℓλmin. Applying Equation (40), we get

ℓλopt l = ℓ

2(η(ℓλopt) ηmin) ℓlog(c2/c1)

Analogously to the previous case, this implies dℓλopt l O(1), and the fact that ℓλopt > l implies

dℓλopt l Ω(1). Together, this yields

dℓλopt l Θ(1),

which establishes Equation (39) for l ℓλmin and thereby concludes this case.

Proof of Lemma 21. Throughout this proof, we drop the dependencies on ℓ, ℓσ, L , β in the notation of η, ηv, ηb and simply write η(l), ηv(l), ηb(l).

Continuity and monotonicity (Item 1) We first show that ηv(l) and ηb(l) are continuous functions. ηb(l) is an affine function of l, hence continuous. ηv(l), however, additionally depends on l via δ, which is not linear. Hence, to show that ηv(l) is continuous, we need to show that min{δ, 1 δ} is continuous. Consider the triangle wave function

ϖ(t) := min{t t , 1 (t t )},

which is well-known to be continuous. Because min{δ, 1 δ} = ϖ ℓ l 1

β , and ℓ l 1

β is a linear

function of l, we get that ηv(l) is also continuous.

For monotonicity, we consider the derivatives of ηv(l) and ηb(l). For ηb(l), we have

Since ηb is an affine function, and 2

ℓ> 0, this also implies that ηb is strictly increasing. For ηv, we need to distinguish two cases:

( l ℓσ l βδ

ℓ δ < 1 δ l ℓσ l β(1 δ)

= 0 δ < 1 δ, 2

Since ηv is a continuous function with non-positive derivatives, this implies that ηv is non-increasing.

η decomposition (Item 3) First assume that there exists l 0, ℓ with ηv( l) = ηb( l). Since ηv is non-increasing and ηb is strictly increasing,

η(l) := max{ηb(l), ηv(l)} = ηv(l) l < l, ηb(l) otherwise.

In particular, η(l) > η( l) for all l > l as ηb is strictly increasing, and η(l) η( l) for all l < l since ηv is non-increasing. Combined, this yields ηmin = η( l; ℓ, ℓσ, L , β).

Published as a conference paper at ICLR 2023

Next, assume l does not exist, that is, ηv(l) = ηb(l) for all l 0, ℓ . Then, due to continuity, either ηb(l) > ηv(l) for all l 0, ℓ , or ηv(l) > ηb(l) for all l 0, ℓ . However, a closer analysis shows that the latter is not possible: the rates at the boundary ℓare

ηb( ℓ) = 2 2

ℓ( (ℓ 1 β(L 1)) 1 β(L 1)) = 0,

ηv( ℓ) = ℓσ + (ℓ 1 β(L 1))

ℓmin{δ, 1 δ} (i) 0,

where (i) follows from the assumption ℓσ ℓand the fact that min{δ, 1 δ} 0. Hence, if l does not exist, ηv(l) < ηb(l), l [0, ℓ]. In particular, ηmin is the minimum of the strictly increasing function ηb(l) over [ 0, ℓ , and therefore attained only at 0.

Lastly, combining both cases of l yields the following convenient expression:

η(l) = ηv(l) l exists and l < l, ηb(l) otherwise. (41)

Closed interval of solutions (Item 2) We again differentiate whether l as in the previous step exists or not. If l does not exist, then the previous step already yields ℓλmin = {0}, which is a closed interval. Next, assume l exists. Then, all l ℓλmin satisfy l l, since l ℓλmin and ηb is strictly increasing. Since further ηv is continuous and non-increasing over h 0, l i ℓλmin, ℓλmin is an interval. Finally,

η(l) = max{ηb(l), ηv(l)} is the maximum of two continuous functions, hence itself also continuous. Therefore, the minimizers ℓλmin of η are a closed set. This concludes that ℓλmin is a closed interval.

Proof of Items 4 and 5 Item 4 follows straightforwardly from Equation (41): If l does not exist, then η(l) = ηb(l) for all l 0, ℓ . Similarly, if l exists, η(l) = ηb(l) for all l l, in particular for l l ℓλmin.

Item 5 requires additional considerations. In the case where l does not exist, we have ℓλmin = {0}, and the result follows directly. Otherwise, using Equation (41), we have η(l) = ηv(l) for any l l, as l min(ℓλmin) l. As shown in the proof of Item 1, ηv(l) alternates between derivatives 0 and 2/ℓ. We claim that there exists a left neighborhood of min (ℓλmin) where the derivative is 2/ℓ. Assume towards a contradiction that no such left neighborhood exists. Then, there must be a left neighborhood of min (ℓλmin) where the derivative is 0, since ηv(l) alternates between only two derivatives. However, in that left neighborhood, ηv(l) is constant with all values equal to ηmin. Hence, there exists l < min (ℓλmin) with η(l) = ηv(l) = ηmin and thus l ℓλmin, which is a contradiction.

Thus, there exists a left neighborhood of min (ℓλmin) with diameter ε > 0 throughout which the derivative is 2

ℓ. Then, for all l [min (ℓλmin) ε, min (ℓλmin)],

ηv(l) = ηmin 2

ℓ(l min (ℓλmin)).

Finally, since ηv(l) is non-increasing, as long as ηv(l) ηmin + 2

ℓε we have l min (ℓλmin) ε. Hence, choosing c = 2

ℓε yields the statement of Item 5.

D.2 PROOF OF THEOREM 2

The informal Theorem 2 in the main text relies on a β , defined as the intersection of the variance and bias rates from Theorem 1 for the interpolator ˆf0 (setting ℓλ = 0). Whenever β is unique, the fact that Bias2( ˆf0) in Theorem 1 strictly increases as a function of β induces a phase transition: for β > β , the bias dominates the rate of the risk in Theorem 1, while for β β , the variance dominates. In particular, Lemmas 20 and 21 imply that interpolation is harmless whenever the bias dominates, and harmful if the variance dominates.

Intuitively, Theorem 2 considers optimally regularized estimators, and varies the inductive bias strength via β. The formal Theorem 4 below presents a different perspective: it considers β fixed,

Published as a conference paper at ICLR 2023

and instead differentiates whether the optimal risk rate in Theorem 1 results from ℓλ > 0 (harmful interpolation) or ℓλ = 0 (harmless interpolation).

Whenever β is well-defined and unique, as for example in the setting of Figure 1, the two perspectives coincide. However, one can construct pathological edge cases where variance and bias rates intersect on an interval of β values, or the dominating quantity of the risk rate in Theorem 1 as a function of β alternates between the variance and bias. While Theorem 2 fails to capture such settings, Theorem 4 still applies. We hence present Theorem 4 as a more general result.

Theorem 4 (Formal version of Theorem 2). In the setting of Theorem 1 using the notation from Appendix D.1, let ℓ> 0, β (0, 1), ℓσ ℓ, and L h 1, ℓ 1

β i N. Let further λopt =

arg minλ 0|max {λ,1} O(d ℓ) Risk( ˆfλ). Then, the expected training error behaves as follows:

1. If all l arg minl [0, ℓ] η(l; ℓ, ℓσ, L , β) satisfy η(l; ℓ, ℓσ, L , β) < η(0; ℓ, ℓσ, L , β), then

i ( ˆfλopt(xi) yi)2 #

σ2 O(d l+Risk2( ˆfλopt)), w.p. 1 cd β min{ δ,1 δ}.

2. If all l (0, ℓ] satisfy η(l; ℓ, ℓσ, L , β) > η(0; ℓ, ℓσ, L , β), then

i ( ˆfλopt(xi) yi)2 #

cσ2 + O Risk2( ˆfλopt) , w.p. 1 cd β min{ δ,1 δ}

for some constant c < 1.

Intuitively, the two cases in Theorem 4 correspond to harmful and harmless interpolation, where the optimal rate in Theorem 1 is for ℓλ > 0 and ℓλ = 0, respectively. Then, Lemma 20 yields with high probability that also ℓλopt > 0 and ℓλopt = 0 in the first and second case, respectively. Finally, we remark that Theorem 4 lacks an edge case: if both some ℓλ > 0 and ℓλ = 0 minimize the risk rate in Theorem 1 simultaneously, Lemma 20 fails to differentiate whether ℓλopt is zero or positive. However, that edge case corresponds to either interpolation or very weak regularization. Hence, we conjecture the corresponding model s training error to behave similar to the second case in Theorem 4.

Proof of Theorem 4. First, Lemma 20 yields with probability at least 1 cd β min{ δ,1 δ} that max{λopt, 1} Θ(dl), where l is a minimizer of η(l; ℓ, ℓσ, L , β). The condition in Item 1 guarantees that all optimal l are positive, while the condition in Item 2 ensures that the optimal l = 0.

Harmful interpolation setting (Item 1) In this case, l > 0, and thus λopt Θ(dl) for d sufficiently large. We start by applying Theorem 1 with ℓλ = l in this setting.

Within the proof of Theorem 1 in Section 5.2, we pick a m N such that for d sufficiently large, with probability at least 1 cd β min{ δ,1 δ}, m satisfies the conditions of Lemmas 1 to 4 and Theorem 3. For the remainder of this proof, let m be the same as in the proof of Theorem 1 for ℓλ = l. This m satisfies the conditions of Lemma 23, which hence yields

n Tr H 2 Eϵ

i ( ˆfλopt(xi) yi)2 #

n Tr H 2 + 6λ2 opt r2 2 r2 1

where H := K + λIn. Then, using that a b a + d implies |b c| |a c| + d, we further get Eϵ

i ( ˆfλopt(xi) yi)2 #

σ2 σ2 λ2 opt n Tr H 2 1

| {z } :=T1

+ 6λ2 opt r2 2 r2 1

n2 | {z } :=T2

Published as a conference paper at ICLR 2023

We first bound T2 as follows:

T2 = 6λ2 opt r2 2 r2 1

d2ℓ r2 2 r2 1 D 1 ma 2

d2ℓ d2q2(L 1)

d2d2ld2(ℓ l 1) d2q2(L 1)

d 2(ℓ l 1 β(L 1))

(iii) O Bias2( ˆfλopt) O Risk2( ˆfλopt) , (43)

where (i) uses n Θ(dℓ) and λopt Θ(dl), (ii) applies Lemma 2 for the rate of D 1 ma and Lemma 3 for r2 2, r2 1 Θ(1), and (iii) matches the expression to the rate of the bias in Theorem 1.

For T1, we will bound Tr H 2 from above and below using Lemma 22. We hence introduce the following notation:

K 2 := K m + K1, H2 := K2 + λopt In,

so that H = H2 + K 2, and where K1 and K2 are defined in Appendix C.2. Furthermore, the rank of K 2 is at most |N \ I2| = |{k N | |Sk| < L + 2}|, that is, the number of eigenfunctions that contribute to K 2. The rate of |N \ I2| is

|N \ I2| (i) =

l=0 C(l, q, d) (ii) = 1 +

l=1 d q 1 l 1

(iii) O(d q L),

where (i) uses the definition of C(l, q, d) in Equation (17), (ii) applies Lemma 9 with d sufficiently

large, and (iii) uses the classical bound q 1 l 1 e q 1

l 1 l 1 as well as the fact that the largest

monomial dominates the sum. Finally, since n Θ(d q L+ δ), we have

rank(K 2) |N \ I2| O(nq δ) o(n). (44)

Therefore, for d and hence n Θ(dℓ) sufficiently large, rank(K 2) < n, and Lemma 13 yields c1 µmin (H2) H2 c2 for some constants c1, c2 > 0 with probability at least 1 c q (1 δ). We can thus instantiate Lemma 22 with M = H, M1 = K 2, M2 = H2. This implies that:

Tr H 2 n rank(K 2)

H2 2 n rank(K 2)

(c2 + λopt)2 .

The upper bound on Tr H 2 simply follows from

Tr H 2 n H 2

µmin (K m + K1 + K2 + λopt In)2

µmin (K2 + λopt In)2

(i) n (c1 + λopt)2 , (45)

where (i) uses the previous lower bound on µmin (K2) from Lemma 13.

Published as a conference paper at ICLR 2023

Combining the upper and lower bounds on Tr H 2 yields

n rank(K 2)

(c2 + λopt)2 Tr H 2 n (c1 + λopt)2

n rank(K 2)

n λ2 opt (c2 + λopt)2 λ2 opt n Tr H 2 λ2 opt (c1 + λopt)2

n λ2 opt (c2 + λopt)2 + λ2 opt (c2 + λopt)2 1 λ2 opt n Tr H 2 1 λ2 opt (c1 + λopt)2 1

n λ2 opt (c2 + λopt)2 2c2λopt + c2 2 (c2 + λopt)2 λ2 opt n Tr H 2 1 2c1λopt + c2 1 (c1 + λopt)2 .

Taking absolute values, a simple case distinction and c1 < c2 yields the following bound: λ2 opt n Tr H 2 1

(c2 + λopt)2 + rank(K 2)

n λ2 opt (c2 + λopt)2

d2l = d l + q δ O(d l),

where (i) uses the rate of rank(K 2) from Equation (44) and λopt Θ dl . Combining this bound on T1 with the bound on T2 in Equation (43) and collecting all error probabilities concludes the current case.

Harmless interpolation setting (Item 2) In this setting, the only minimizer of the risk rate is l = 0, and hence λopt max {λopt, 1} O(d0) = O(1). As for the previous case, let m be the same as in the proof of Theorem 1 for ℓλ = 0. With probability at least 1 cd β min{ δ,1 δ}, this m again satisfies the conditions of Lemma 23, which yields

i ( ˆfλopt(xi) yi)2 #

| {z } :=T3

+ 6λ2 opt r2 2 r2 1

n2 | {z } T2

Furthermore, we apply the same steps with l = 0 as in the previous case (Equation (43)) to bound T2:

T2 O Risk2( ˆfλopt) .

For T3, we use the same bound on Tr H 2 as in Equation (45) with the same probability as follows:

T3 = λ2 optσ2

n Tr H 2 λ2 optσ2

n n (c1 + λopt)2

= σ2 λ2 opt (c1 + λopt)2

(i) σ2 (c )2

(c1 + c )2 ,

where (i) follows for d sufficiently large from λopt O(1) with c > 0. Since c1 > 0, we have (c )2

(c1+c )2 < 1. Hence, combining the bounds on T2 and T3, as well as collecting all probabilities, we get the desired result for this case.

D.3 TECHNICAL LEMMAS

Lemma 22 (Trace of the inverse). Let M, M1, M2 Rn n be symmetric positive semi-definite matrices with M = M1 + M2. Furthermore, assume that µmin (M2) > 0 and rank(M1) < n. Then,

Tr(M 2) n rank(M1)

Published as a conference paper at ICLR 2023

Proof. First, we apply the identity

M 1 = (M2 + M1) 1 = M 1 2 M 1 2 M1(M2 + M1) 1 | {z } :=A

which holds since M2, and thus M, are full rank. Next, A is a product of matrices including M1; hence, the rank of A is bounded by rank(M1) < n. Let now {v1, . . . , vrank(A)} be an orthonormal basis of col(A), and let {vrank(A)+1, . . . , vn} be an orthonormal basis of col(A) . Thus, {v1, . . . , vn} is an orthonormal basis of Rn, and similarity invariance of the trace yields

i=1 v i M 2vi =

i=1 v i M 2vi +

i=rank(A)+1 v i M 2vi

i=rank(A)+1 v i M2 1 A 2 vi

i=rank(A)+1

v i M2 2vi v i A |{z} =0

M2 1vi v i M2 1 Avi |{z} =0 + v i AAvi | {z } =0

i=rank(A)+1 v i M2 2vi

(ii) (n rank(A))µmin M2 2 = (n rank(A)) 1 M2 2

(iii) n rank(M1)

where (i) uses that, for all i > rank(A), vi is orthogonal to the column space of A, and A is symmetric. Furthermore, (ii) uses that all vi have norm 1, and (iii) that rank(A) rank(M1).

Lemma 23 (Fixed-design training error). In the setting of Theorem 3, let m N such that r1 > 0, Equation (5) holds, and the ground truth satisfies f (x) = Pm k=1 akψk(x). Then,

n Tr H 2 Eϵ

i ( ˆfλ(xi) yi)2 #

n Tr H 2 + 6λ2 r2 2 r2 1

where H := K + λIn.

Proof. The kernel ridge regression estimator is ˆfλ(x ) = y H 1k(x ),

where y := [y1, . . . , yn] with yi = f (xi) + ϵi, and k(x ) := [K(x1, x ), . . . , K(xn, x )] . Thus, the estimator evaluated at x1, . . . , xn is KH 1y, which yields the following training error:

i ( ˆfλ(xi) yi)2 = 1

n In KH 1 y 2 (i) = λ2

where (i) follows from

In KH 1 = In K(K + λIn) 1 = In (K + λIn λIn)(K + λIn) 1 = λ(K + λIn) 1.

Next, by the assumptions on the ground truth, we can write y = ϵ + Ψ ma. Thus, the expected training error with respect to the noise is

i ( ˆfλ(xi) yi)2 #

n Eϵ (ϵ + Ψ ma) H 2(ϵ + Ψ ma)

| {z } :=T1

n a Ψ m H 2Ψ ma | {z } :=T2

Published as a conference paper at ICLR 2023

First, T2 > 0 since H is positive semi-definite. Therefore, T1 already yields the desired lower bound on the expected training error. For the upper bound, we bound T2 as follows:

n (D 1 ma) D mΨ m H 2Ψ m D m(D 1 ma)

n (D 1 ma) D 1 m + Ψ m H 1 >mΨ m 1 Ψ m H 2 >mΨ m D 1 m + Ψ m H 1 >mΨ m 1 (D 1 ma)

H 1 >mΨ m D 1 m + Ψ m H 1 >mΨ m 1 D 1 ma

Ψ m H 2 >mΨ m

D 1 m + Ψ m H 1 >mΨ m 1 D 1 ma

(µmin (K>m) + λ)2 B1.

(iii) 6λ2 r2 2 max{λ, 1}2

(µmin (K>m) + λ)2 D 1 ma 2

n2 = 6λ2 r2 2 r2 1

where H>m := K>m + λIn. Step (i) follows from Lemma 6, step (ii) uses Equation (5) and matches the term B1 from the proof of Theorem 3, and step (iii) applies the bound on B1 achieved in Theorem 3. This upper-bounds the expected training error and thereby concludes the proof.

E EXPERIMENTAL DETAILS

This section describes our experimental setup and includes additional details. We provide the code to replicate all experiments and plots in https://github.com/michaelaerni/ iclr23-Inductive Biases Harmless Interpolation.

E.1 SETUP FOR FILTER SIZE EXPERIMENTS

The following describes the main filter size experiments presented in Section 4.1.

Network architecture We use a special CNN architecture that amplifies the role of filter size as an inductive bias. Each model of the main filter size experiments in Figure 2 has the following architecture:

1. Convolutional layer with 128 filters of varying size and no padding

2. Global max pooling over the spatial feature dimensions

3. Re LU activation

4. Linear layer with 256 output features

5. Re LU activation

6. Linear layer with 1 output feature

All convolutional and linear layers use a bias term. Since we employ a single convolutional layer before global max pooling, the convolutional filter size directly determines the maximum size of an input patch that can influence the CNN s output. Note that this architecture reduces to an MLP if the filter size equals the input image size.

Optimization procedure We use the same training procedure for all settings in Figure 2. Optimization minimizes the logistic loss for 300 epochs of mini-batch SGD with momentum 0.9 and batch size 100. We linearly increase the learning rate from 10 6 to a peak value of 0.2 during the first 50

Published as a conference paper at ICLR 2023

(a) Negative samples

(b) Positive samples

Figure 4: Example synthetic images used in the filter size experiments.

epochs, and then reduce the learning rate according to an inverse square-root decay every 20 epochs. For peak learning rate γ0, a decay rate L, the inverse square-root decay schedule at epoch t 0 is γ0 p

1 + t/L . (46)

Learning rate warm-up helps to capture the early-stopped test error more precisely. Whenever possible, we use deterministic training algorithms, so that our results are as reproducible as possible. We selected all hyperparameters to minimize the training loss of the strongest inductive bias (filter size 5) on noisy training data, with the constraint that all other settings still converge and interpolate. Note that we do not use data augmentation, dropout, or weight decay.

Evaluation We observed that all models achieved their minimum test error either at the beginning or very end of training. Hence, our experiments evaluate the test error every 2 epochs during the first 150 epochs, and every 10 epochs afterwards to save computation time. We use an oracle, that is, the true test error, to determine the optimal early stopping epoch in retrospective. The optimal early stopping training error is always over the entire training set (including potential noise) for a fixed model, not an average over mini-batches. To mitigate randomness in both the training data and optimization procedure, we average over multiple dataset and training seeds. More precisely, we sample 5 different pairs of training and test datasets. For each dataset, we fit 15 randomly initialized models per filter size on the same dataset, and calculate average metrics. The plots then display the mean and standard error over the 5 datasets.

Dataset All filter size experiments use synthetic images. For a fixed seed, the experiments generate 200 training and 100k test images, both having an equal amount of positive and negative classes. Given a class, the sampling procedure iteratively scatters 10 shapes on a black 32 32 image. A single shape is either a circle (negative class) or a cross (positive class), has a uniformly random size in [3, 5], and a uniformly random center such that all shapes end up completely inside the target image. We use a conceptual line width of 0.5 pixels, but discretize the shapes into a grid. See Figure 4 for examples. A single dataset seed fully determines the training data, test data, and all scattered shapes.

Noise model In the noisy case, we select 20% of all training samples uniformly at random without replacement, and flip their label. The noise is deterministic per dataset seed and does not change between different optimization runs. Note that we never apply noise to the test data.

E.2 SETUP FOR ROTATIONAL INVARIANCE EXPERIMENTS

The following describes the rotational invariance experiments presented in Section 4.2.

Dataset We use the Euro SAT (Helber et al., 2018) training split and subsample it into 7680 raw training and 10k raw test samples in a stratified way. For a fixed number of rotations k, we generate a training dataset as follows:

Published as a conference paper at ICLR 2023

1. In the noisy case, select a random 20% subset of all training samples without replacement; for each, change the label to one of the other 9 classes uniformly at random. 2. For the i-th training sample (i {1, . . . , 7680}):

(a) Determine a random offset angle αi. (b) Rotate the original image by each angle in {αi + j (360 /k) | j = 0, . . . , k 1}.

(c) Crop each of the k rotated 64 64 images to 44 44 so that no network sees black borders from image interpolation. 3. Concatenate all k 7680 samples into a single training dataset. 4. Shuffle this final dataset (at the beginning of training and every epoch).

To generate the actual test dataset, we apply a random rotation to each raw test sample independently, and crop the rotated images to the same size as the training samples. This procedure rotates every image exactly once, and uses random angle offsets to avoid distribution shift effects from image interpolation. Note that all random rotations are independent of the label noise and the number of training rotations. Hence, all experiments share the same test dataset. Furthermore, since we apply label noise before rotating images, all rotations of an image consistently share the same label.

Network architecture All experiments use a Wide Residual Network (Zagoruyko & Komodakis, 2016) with 16 layers, widen factor 6, and default Py Torch weight initialization. We chose the width and depth such that all networks are sufficiently overparameterized while still being manageable in terms of computational cost.

Optimization procedure We use the same training procedure for all settings in Figure 3. Optimization minimizes the softmax cross-entropy loss using mini-batch SGD with momentum 0.9 and batch size 128. Since the training set size grows in the number of rotations, all experiments fix the number of gradient updates to 144k. This corresponds to 200 epochs over a dataset with 12 rotations. Similar to the filter size experiments, we linearly increase the learning rate from zero to a peak value of 0.15 during the first 4800 steps, and then reduce the learning rate according to an inverse square-root decay (Equation (46)) every 960 steps. Whenever possible, we use deterministic training algorithms, so that our results are as reproducible as possible. We selected all hyperparameters to minimize the training loss of the strongest inductive bias (12 rotations) on noisy training data, with the constraint that all other settings still converge and interpolate. As for all experiments in this paper, we do not use additional data augmentation, dropout, or weight decay.

Evaluation Similar to the filter size experiments, we evaluate the test error more frequently during early training iterations: every 480 steps for the first 9600 steps, every 1920 steps afterwards. The experiments again use the actual test error to determine the best step for early-stopping, and calculate the corresponding training error over the entire training dataset, including all rotations and potential noise. Due to the larger training set size and increased computational costs, we only sample a single training and test dataset, and report the mean and standard error of all metrics over five training seeds.

E.3 DIFFERENCE TO DOUBLE DESCENT

As mentioned in Section 4.1, our empirical observations resemble the double descent phenomenon. This subsection expands on the discussion and provides additional details about how this paper s phenomenon differs from double descent.

While all models in all experiments interpolate the training data, we observe that both noisy labels and stronger inductive biases increase the final training loss of an interpolating model: Smaller filter size results in a decreasing number of model parameters. Enforcing invariance to more rotations requires a model to interpolate more (correlated) training samples. Thus, in both cases, increasing inductive bias strength decreases a model s overparameterization in relation to the number of training samples shifting the setting closer to the interpolation threshold.

We argue that our choice of architecture and hyperparameter tuning ensures that no model in any experiment is close to the corresponding interpolation threshold. If that is the case, then double descent predicts that increasing the number of model parameters has a negligible effect on whether regularization benefits generalization, and does therefore not explain our observations.

Published as a conference paper at ICLR 2023

0% noise, width:

20% noise, width:

128 1024 128 1024

256 2048 256 2048

5 13 23 31 ﬁlter size

training loss

(a) Filter size

# rotations

training loss

(b) Rotational invariance

Figure 5: Training losses of the models in this paper s experiments as a function of (a) filter size and (b) the number of training set rotations. Models with a stronger inductive bias generally exhibit larger losses. However, in all instances, the numerical difference is small. Lines show the mean loss over 5 training set samples in (a) and 5 different optimization runs in (b), shaded areas the corresponding standard error.

In the following, we first describe how our hyperparameter and model selection procedure ensures that all models in all experiments are sufficiently overparameterized, so that double descent predicts negligible effects from increasing the number of parameters. Then, we provide additional experimental evidence that supports our argument: We repeat a subset of the experiments in Section 4 while upscaling the number of parameters in all models. For a fixed model scale and varying inductive bias, we observe that all phenomena in Section 4 persist. For a fixed inductive bias strength, we further see that the test error of interpolating models saturates at a value that matches our hypothesis. In particular, for strong inductive biases, the gap in test error between interpolating models and their optimally early-stopped version harmful interpolation persists.

Hyperparameter tuning We mitigate differences in model complexity for different inductive bias strengths by tuning all hyperparameters on worst-case settings, that is, maximum inductive bias with noisy training samples. To avoid optimizing on test data, we tune on dataset seeds and network initializations that differ from the ones used in actual experiments. Figure 5 displays the final training loss for all empirical settings in this paper. While models with a stronger inductive bias exhibit larger training losses, all values are close to zero, and the numerical difference is small. Finally, we want to stress again that this discussion is only about the training loss; all models in all experiments have zero training error and perfectly fit the corresponding training data.

Increasing model complexity for varying filter size As additional evidence, we repeat the main filter size experiments from Figure 2 in Section 4.1 using the same setup as before (see Appendix E.1), but increase the convolutional layer width to 256, 512, 1024, and 2048. For computational reasons, we evaluate a reduced number of filter sizes for widths 256 and 512, and only the smallest filter size 5 for widths 1024 and 2048. Since we found the original learning rate 0.2 to be too unstable for the larger model sizes, we use a decreased peak learning rate 0.13 for widths 256 and 512, and 0.1 for widths 1024 and 2048.

Figures 6a and 6b show the test errors for 20% and 0% training noise, respectively. With noisy training data (Figure 6a), larger interpolating models yield a slightly smaller test error, but the overall trends remain: the gap in test error between converged and optimally early-stopped models increases with inductive bias strength, and the phase transition between harmless and harmful interpolation persists. In particular, Figure 6a shows strong evidence that the number of model parameters does not influence our phenomenon: for example, models with filter size 5 (strong inductive bias) and width 512 (red) have more parameters than models with filter size 27 (weak inductive bias) and width 128 (blue). Nevertheless, models with filter size 5 benefit significantly from early stopping, while interpolation for models with filter size 27 is harmless. In the noiseless case (Figure 6b),

Published as a conference paper at ICLR 2023

interpolating, width:

optimally early-stopped, width:

128 1024 128 1024

256 2048 256 2048

128 1024 128 1024

256 2048 256 2048

5 13 23 27 ﬁlter size

(a) Test error (20% noise)

5 13 23 27 ﬁlter size

(b) Test error (0% noise)

5 13 23 27 ﬁlter size

subset train error

(optimal early stopping)

(c) Opt. early stopping (20% noise)

Figure 6: An increase in convolutional layer width by factors 2 to 16 does not significantly alter the behavior of (a) the test error when training on 20% label noise, (b) the test error when training on 0% label noise, and (c) the training error when using optimal early stopping on 20% label noise. Despite significantly larger model size, the phase transition between harmless and harmful interpolation persists. Lines show the mean over five random datasets, shaded areas the standard error.

increasing model complexity does neither harm nor improve generalization, and all models achieve their optimal performance after interpolating the entire training dataset. Similarly, Figure 6c reveals that the fraction of training noise that optimally early-stopped models fit stays the same for larger models. Finally, for a fixed inductive bias strength, the test errors saturate as model size increases, making a different trend for models with more than 2048 filters unlikely. To increase legibility, we present the numerical results for the largest two filter sizes in Table 1.

Table 1: Test errors for filter size 5 (strongest inductive bias) and very large width under 20% training noise.

width 1024 width 2048

early-stopped test error 0.0062% 0.0049% interpolating test error 3.6251% 3.3664% # parameters 289281 578049

Increasing model capacity for varying rotational invariance For completeness, we also repeat the rotation invariance experiments from Figure 3 in Section 4.2 with twice as wide Wide Residual Networks on a reduced number of rotations. More precisely, we increase the network widen-factor from 6 to 12, and otherwise use the same setting as the main experiments (see Appendix E.2). Note that this corresponds to a parameter increase from around 6 million to around 24 million parameters.

The results in Figure 7 provide additional evidence that our phenomenon is distinct from double descent: both the test error (Figures 7a and 7b) and fraction of fitted noise under optimal early stopping (Figure 7c) exhibit the same trend, despite the significant difference in number of parameters.

Published as a conference paper at ICLR 2023

interpolating: optimally early-stopped:

WRN16-6 WRN16-6

WRN16-12 WRN16-12

noisy subset: WRN16-6 clean subset: WRN16-6

# rotations

(a) Test error (20% noise)

# rotations

(b) Test error (0% noise)

# rotations

subset train error

(optimal early stopping)

(c) Opt. early stopping (20% noise)

Figure 7: Doubling Wide Residual Network width does not significantly alter the behavior of (a) the test error when training on 20% label noise, (b) the test error when training on 0% label noise, and (c) the training error when using optimal early stopping on 20% label noise. Despite significantly larger model size, the phase transition between harmless and harmful interpolation persists. Lines show the mean over five random network initializations, shaded areas the standard error.