# simplicity_bias_in_overparameterized_machine_learning__0ed376e1.pdf

Simplicity Bias in Overparameterized Machine Learning

Yakir Berchenko

Department of Industrial Engineering and Management, Ben-Gurion University of the Negev P. O. 653, Beer-Sheva 84105, Israel. berchenk@bgu.ac.il

A thorough theoretical understanding of the surprising generalization ability of deep networks (and other overparameterized models) is still lacking. Here we demonstrate that simplicity bias is a major phenomenon to be reckoned with in overparameterized machine learning. In addition to explaining the outcome of simplicity bias, we also study its source: following concrete rigorous examples, we argue that (i) simplicity bias can explain generalization in overparameterized learning models such as neural networks; (ii) simplicity bias and excellent generalization are optimizer-independent, as our examples show, and although the optimizer affects training, it is not the driving force behind simplicity bias; (iii) simplicity bias in pre-training models, and subsequent posteriors, is universal and stems from the subtle fact that uniformly-atrandom constructed priors are not uniformly-at-random sampled; and (iv) in neural network models, the biasing mechanism in wide (and shallow) networks is different from the biasing mechanism in deep (and narrow) networks.

Introduction Contemporary practice in deep learning has challenged conventional approaches to machine learning. Specifically, deep neural networks are highly overparameterized models with respect to the number of data points and are often trained without explicit regularization. Yet they achieve state-ofthe-art generalization performance. It is interesting to note that these observations are not limited just to neural network models. Qualitatively similar behaviour has also been observed when using boosting with decision trees and random forests (Wyner et al. 2017) and other non-network learning problems, with some examples dating back to the early 1990s (Loog et al. 2020). A thorough theoretical understanding of the unreasonable effectiveness of deep networks (and other overparameterized models) is still lacking. Previous work (e.g., (Neyshabur, Tomioka, and Srebro 2014; Neyshabur et al. 2018)) has suggested that an implicit regularization is occurring in neural networks via an implicit norm minimization; in particular, the minimization of the (generalized) norm was conjectured to be a by-product of the optimizer , the method by which the network is trained (i.e., stochastic gradient

Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

descent, SGD). However, this has been questioned by theoretical and empirical work showing strong evidence to the contrary (Razin and Cohen 2020; Huang et al. 2019; Kawaguchi, Kaelbling, and Bengio 2017). Furthermore, although the reasoning behind SGD as an implicit regularizer (Hochreiter and Schmidhuber 1997; Keskar et al. 2016) is insightful ( solutions that do not generalize well correspond to sharp minima, and added noise prevents convergence to such solutions ), there are examples where a good generalization is obtained irrespective of the optimizer used (Huang et al. 2019; Kawaguchi, Kaelbling, and Bengio 2017; Wu, Zhu et al. 2017). Here we propose an entirely different and new approach: instead of implicitly assuming that learning-models are uniformly-probable random objects (prior to training), we suggest that the probability space over models is in fact biased towards simple functions. Thus, the trained model will likely extrapolate well without fluctuating wildly inbetween the training data-points. Despite some preliminary results (Arpit et al. 2017; Valle Perez, Camargo, and Louis 2018; Mingard et al. 2019; De Palma, Kiani, and Lloyd 2019), a theoretical treatment with a rigorous proof of such a bias is still lacking. Here, in addition to explaining the outcome of simplicity bias, we also study its source. This work suggests that, for typical overparameterized models (in addition to explaining generalization), simplicity bias is in fact universal and nearly unavoidable. The crux of the matter is that the functions obtained (at initialization) are uniformly-at-random constructed but not uniformly-at-random sampled. In other words, the naive default probability space over functions, where each unique function is sampled with equal probability, is actually conceptually wrong and irrelevant for this domain. Instead, it is well-known that if a construction process is used to sample complex objects then the resulting probability space is not a uniform space (even if the construction itself is uniformly random, i.e., in each sub-step we choose uniformly from several construction options). Moreover, different (spaces of) construction processes yield different probability spaces over functions. However, several properties can be expected to be universal in some sense, with simplicity bias being a case in point (Chauvin et al. 2004). Below we present a concrete rigorous example; we then

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

identify two distinct mechanisms that bias towards simplicity and provide further theoretical evidence to dispute the relevance of focusing on shallow (and wide) networks while researching deep (and narrow) networks.

We now describe three concrete examples: overparameterized learning of a Boolean function, infinitely-wide neural networks (where we revisit known results), and infinitely-deep non-wide neural networks.

Learning a Boolean function. Let Fn be the set of all Boolean functions on n variables. There are 2n possible binary inputs, and thus there are 22n such Boolean functions, |Fn| 22n. For f P Fn, let p X, Yfq denote a uniformly random sample of observations regarding f, and |p X, Yfq| denote the sample size (i.e., the number of input output pairs observed). How likely are we to overfit when fitting ˆf based on a small sample? Consider the following negative claim:

Claim 1. The vast majority of functions that agree with f over the sample obtain only chance agreement with f out-of-sample (i.e., agreement 1{2 out-of-sample).

More precisely, for a fixed sample size and a fixed ϵ ą 0, the law of large numbers immediately entails that if we choose a function ˆf completely at random (subject to fitting the sample) then the probability of agreement on more than 1{2 ϵ out-of-sample goes to 0 as n Ñ 8 (note that f need not be fixed). However, our main claim here is that this negative result is actually misleading and demonstrates a misguided way of thinking. It is true that when fitting by choosing a function at random we are likely to overfit; however, in practice we seldom choose a function at random, instead we construct a function at random or choose a representation at random. In particular, Boolean functions are typically implemented using circuits or binary AND/OR trees. Thus, a more nuanced question would be: if we choose a Boolean circuit t completely at random (subject to fitting the sample) what is the probability of agreement on more than 1{2 ϵ out-ofsample? Let Tm,n denote the set of all binary Boolean trees with m internal nodes over n input variables (and consider large and overparameterized trees with m " n). Formally, a uniform probability space over Tm,n was introduced in (Chauvin et al. 2004) in the following manner: choose uniformly at random a rooted binary tree and label its m internal nodes randomly with AND and OR, and the m 1 external nodes with a literal, i.e., a variable or its negation. Each of the m inner nodes is labeled with AND or OR with equal probability 1{2 and independently of the other nodes; each leaf is labeled with a literal, chosen according to the uniform distribution on the 2n literals and independently of the labeling of all other nodes. Again, if we construct a tree ti P Tm,n, we can find

many candidates that agree with f perfectly over the sample p X, Yfq while agreeing with f over only 1{2 of the out-of-sample data (i.e., chance agreement). Therefore, you might surmise that the following learning algorithm is (statistically) ill-advised:

Naive Algorithm. Given p X, Yfq: 1. Construct uniformly at random a tree ti P Tm,n. 2. Check if fi, the function corresponding to ti, agrees with f over p X, Yfq. If so, return fi. If not, go to step 1. Nevertheless, we have the following surprising result linking the complexity1 of f, the sample size, and ϵ (generalization): Theorem 1 (Random trees interpolate without overfitting). Let Lf denote the complexity of f, and s denote the sample size, s |p X, Yfq|. For a given 0 ă ϵ ă 1{2, fix b such that logp1{2 ϵq ă b. As n Ñ 8, if Lf ď b s log n then with high probability the output of the naive algorithm: (i) agrees with f completely over the sample, and (ii) agrees with f over more than 1{2 ϵ of the out-of-sample data. Remark: note that ϵ and b are fixed, but Lf and s (and m) are not.

Proof. According to theorem 3.1 of (Lefmann and Savick y 1997)2, the probability of randomly constructing a tree that corresponds to f is ě 1

8n Lf . Therefore, the number of attempts by the naive algorithm is dominated by a geometric random variable with mean 4p8nq Lf , and a perfect fit to the data is found in Op p8nq Lf attempts. The functions in Fn can be partitioned according to their agreement with f: Poor or chance agreement. Each function in this class has a probability of less than 1{2 ϵ of agreeing with f for a random input. Adequate agreement (or better) 3 . Each function in this class has a probability of more than 1{2 ϵ of agreeing with f for a random input. The naive algorithm results in overfitting if it samples a function ˆf from the first class (the poor/chance agree-

1Here, following (Chauvin et al. 2004) we define the complexity of a Boolean function, f, as Lf minimal size of a tree computing f, where the size of a tree is the number of internal nodes it has. 2Note we use the notations of (Chauvin et al. 2004), not (Lefmann and Savick y 1997). See also theorem 1 in (Chauvin et al. 2004). 3It is possible to further add here an almost sure agreement class, agreeing with f with probability 1 op1q. A similar proof, with additional bookkeeping beyond the scope of this note, can show that for a large enough sample size (or simple enough f) the naive algorithm will provide ˆf almost surely agreeing with f.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

ment class) which agrees with f over the sample data; however, for a given ˆf, since the sample data is sampled uniformly at random, the probability of such an agreement is ď p1{2 ϵqs. Finally, since the naive algorithm performs Op p8nq Lf attempts, an application of the union bound provides the following upper bound on the probability of overfitting:

Ppoverfitq Op p1{2 ϵqsp8nq Lf (1)

p1{2 ϵqsp8nq Lf ď p1{2 ϵqsp8nqb s log n

p1{2 ϵqp8nq b log n s

and bearing in mind that lim n b log n eb, we get that the right-hand side is op1q for large s. b

The following example illustrates the point: Example 1. Consider a binary classification task of black/white images with n 28 ˆ 28 784 pixels (with 22784 1010235 classifiers to choose from). If we provide the naive algorithm a sample of s 106 images, when will it avoid overfitting? Theorem 1 says that overfitting is avoided if Lf Ops{ log nq, in other words: if f can be implemented using an order of s{ log n 106{ log 784 0.35 ˆ 106 gates, which seems quite a lot and is very permissive.

Wide neural networks. In a very wide network (see Fig. 1A) with random weights, the situation is in fact straightforward: under the appropriate conditions, the output layer simply sums up a large number of random variables and thus the central limit theorem kicks in. The result is that the network is a Gaussian process (at initialization), which is a simple and well-behaved function of the input. Moreover, previous work regarding very wide networks (Jacot, Gabriel, and Hongler 2018) demonstrated this Gaussian process and further found that after training via gradient descent the result is akin to Gaussian process regression (Jacot, Gabriel, and Hongler 2018) (i.e., a Gaussian process conditioned on interpolating the sample data; see (Williams and Rasmussen 2006)). Denote by Btrainp q a Gaussian process (whose mean function and covariance are training-data dependent) conditioned on passing through the sample points4. From ref. (Jacot, Gabriel, and Hongler 2018) we can summarize the following proposition:

Proposition 1. In the setting of ref. (Jacot, Gabriel, and Hongler 2018), a wide network trained on a noise-free data p X, Yfq via gradient descent after a random initialization is distributed as Btrainp q.

Remark: Needless to say, the mean function and covariance are training-data dependent, but given p X, Yfq they are deterministic. The only randomness is due to the random Gaussian initialization of the network weights.

4The notation B is due to the analogy to the Brownian bridge process.

Consider again the following naive training algorithm: Naive Algorithm (for neural networks). Given p X, Yfq: 1. Initialize the weights of a network ( ˆf) at random from a normal distribution. 2. Check if the network ˆf agrees with f over p X, Yfq. If so, return ˆf. If not, go to step 1. Remark: This algorithm is obviously non-constructive. In particular, the probability of agreement is zero. However, (i) it could be modified to check if the disagreement is smaller than a predefined threshold (ii) efficiency is not the issue here, rather the following question: what could be said about the result if the algorithm does stop? Proposition 2. In the setting of ref. (Jacot, Gabriel, and Hongler 2018), a wide network trained on a noise-free data p X, Yfq via the naive algorithm (given the algorithm has stopped) is distributed as Btrainp q. Remark: Here too the randomness is due to the random Gaussian initialization of the network weights.

This highlights that it is not gradient-based training which contribute to the statistical efficiency and generalization of the outcome (although GD is indeed important for computational efficiency - and for speeding up training). Rather, there is an inherent simplicity bias due to the random construction of the network via random weights initialization. However, the main driving force for the emergence of these simple functions in the setting above is the large width of the network, and not its depth (Lee et al. 2019). For a narrow but deep neural network we show below an entirely different mechanism which produces simplicity bias nevertheless.

Deep neural networks. In a multi-layered network (see Fig. 1B) each layer can be viewed as an operator in a dynamical system that acts on the output of its preceding layer. Under the appropriate conditions, this should lead to convergence to a fixed point5 regardless of the initial input - i.e. a simple constant function .

Example 2. Consider a standard fully connected Re Lu network with no bias, and Gaussian weights initialization; i.e., prior to training, the output of xj,l, the jth unit in the

lth layer, is R řw k 1 aplq j,kxk,l 1 , where w is the width of

the network, each weight aplq j,k is a standard normal random variable independent of the rest, and Rpxq maxp0, xq.

Let Xl denote the output of the lth layer, and 0 denote the zero vector. Notice the following properties:

5The notion of a fixed point is more subtle in the case of random dynamical systems, but similar behavior is expected nonetheless (Boxler 1995; Bhattacharya and Majumdar 2007).

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

(i) if Xl 0 then the same would be true for all Xk with k ą l. (ii) since Ppxj,l 0q ě Pp@k : aplq j,k ă 0q 0.5w ą 0 we get Pp Xl 0q ą 0. Taken together we conclude the following: as l Ñ 8 we have Pp Xl 0q Ñ 1, and the output is a simple constant regardless of the input. The example above presents the core ideas and notation, while keeping the analysis pretty straightforward (thanks to the positive mass on Ppxj,l 0q due to the activation function). We now show that networks with other activation functions (with zero mass on Ppxj,l 0q, but a finite derivative at zero) exhibit similar phenomena.

Definition 1. Asymptotic stability. Let X0, X1, X2... denote the states of a dynamical system, with }X} as the Euclidean norm. We say that a point X is asymptotically stable if there exists a δ ą 0 such that for all X0 in the δ-neighborhood of X we have Xl Ñ X as l Ñ 8.

Theorem 2 (Asymptotic stability in deep networks). In the setting above, consider a deep network with activation function σ řw k 1 aplq j,kxk,l 1 with σp0q 0 and a finite deriva-

tive at zero, σ1 0 ă 8. If the weights taplq j,kuj,k,lě1 are i.i.d. zero-mean with standard deviation smaller than 1 ?wσ1 0 then 0 is asymptotically stable.

Proof (sketch). Consider the dynamical system linearized at 0. Its Jacobian matrix, J, is a w ˆ w random matrix with i.i.d. zero-mean entries with variance ă 1

Large width case: for large w, according to the circular law (see theorem 1.10 in (Tao, Vu, and Krishnapur 2010)), the eigenvalues of J lie in the unit disk with high probability. Standard theory (see, e.g., (Boxler 1995) for additional details) entails that 0 is a (local) attractor and that for any initial point (close to 0) the system will converge to 0. Fixed width case: although less elegant, it is still possible to bound the Lyapunov exponent of the system, λ1 ď 1

w ln2 ψpw{2qq where ψ is the digamma function (see details and definitions in chap. 2 sec. 4 in (Crisanti, Paladin, and Vulpiani 1993), and (Cohen and Newman 1984)). Since ψpzq lnz 1 2z Op 1

z2 q we have λ1 ď 1

w2 q ă 0. Thus, again we conclude from standard theory that 0 is a (local) attractor and that for any initial point (close to 0) the system will converge to 0. b

Remark 1: A similar result was obtained by (Xiao, Pennington, and Schoenholz 2020) for infinitely-wide networks (the crux of the methods in ref. (Xiao, Pennington, and Schoenholz 2020), which requires the large width, is the use of the central limit theorem to approximate the preactivations as a gaussian). Our result, in contrast, is not limited to wide networks, and applies to narrow networks as well; and although the sketch of the proof mentioned the circular law, and thus a wide network with large w is implied, this is actually not essential. This is merely the most imme-

diate way to bound the eigenvalues of J below 1 in magnitude, and any other way would suffice (see, for example, the second part addressing fixed width). Furthermore, this sheds light on the Op 1 ?wq scaling of taplq j,kuj,k,lě1 required. Remark 2: In the next section we also discuss more relaxed and realistic versions of Theorem 2; in particular, including biases and finite-depth networks. Theorem 2, however, primarily acts as an initial starting point, mainly for didactic purposes. A more interesting question to consider is: Can we establish for deep neural networks a counterpart to Theorem 1 and proposition 2 that sheds light on the behavior after training with examples? Our naive algorithm for training deep neural networks is similar to before: initialize the weights from a normal distribution, and check if the outcome fits the sample data. However, we now also consider a tolerance for disagreement smaller than a predefined threshold: Naive Algorithm (for neural networks). Given p X, Yfq and a tolerance, τ: 1. Initialize the weights of a network ( ˆf) at random from a normal distribution with variance 1 w.

2. Check if the network ˆf agrees with f over p X, Yfq up to τ; i.e., each output given by the network on the sample data is within a distance τ from the correct output. If so, return ˆf. If not, go to step 1. Note that for typical networks, with a continuous activation-function, the resulting network is continuous - both with respect to its input and moreover with respect to its weights6. From this follows: Proposition 3. If f can be described over the sample-data by the neural network (i.e., there is a choice of the weights for the neural network such that it fits the sample-data perfectly), then the naive algorithm terminates successfully in finite time with probability 1. Proof. Due to continuity with respect to weights, the set of perfect fit weights is contained in a ball with radios δτ ą 0, i.e. having a strictly positive mass. Thus, the number of attempts by the naive algorithm is dominated by a geometric random variable. Remark: throughout the rest of the paper we will assume that f can indeed be described over the sample-data by the neural network. Consider the following conditions regarding the networkstructure and the data: N1: the network is a fully connected convolutional neural network. Let d denote the dimension of the input xj; each layer in the network has also width d. N2: the activation function is linear near the origin; for example, a shifted hard sigmoid: maxp 1, minp1, xqq. S1: small sample size. We assume the sample size is smaller than d, i.e. d ą |p X, Yfq|.

6For brevity, we focus on regression, though a continuity argument could be made for classification as well.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

S2: orthonormal samples. For each xi, xj in the sample, we have that xi and xj are orthogonal (and have unit length). Assume the existence of an interpolating solution (i.e., a set of weights with which the network fits the data perfectly) possessing the following properties, and that the naive algorithm succeeds in finding it: A1: Similar length mapping. Assume that for each xi, xj in the sample, each one of the layers in the network produces for xj an output which is roughly the same length, cl, as for xi (this will be made clear below, and further discussed in the discussion section). A2: Quasi-linear prologue. Assume that for each input near the origin, the first layers in the network act on it linearly; only at layer lb, with 1 ! lb, the non-linearity first takes its effect (this too will be made clear below, and further discussed in the discussion section). Remark. Notice that if a network with depth m is an interpolating solution, then there is an interpolating solution with depth m lb satisfying assumptions A1-A2; for example, a solution where the first lb layers act as the identity function on the training data. Theorem 3 (Probabilistic nearest-neighbor classification). Denote the s samples p X, Yfq by X tx1, x2, ...xsu and Yf ty1, y2, ...ysu. Assume a neural network satisfying conditions N1-S2 is trained with the naive algorithm, and a solution satisfying assumptions A1-A2 is found. Let xnew denote an out-of-sample data-point and ynew the output by the network corresponding to it. With high probability (tending to 1 as lb Ñ 8) we have: i) ynew P ty1, y2, ...ysu. ii) For yj P ty1, y2, ...ysu, Ppynew yjq is an increasing function of the cosine similarity between xnew and xj.

Proof. Let xp1q j denote the output of the first layer of the network (after training) given xj as input; more generally, let xpkq j denote the output of the kth layer of the network given xj as input. Lemma 1. Let xort be a new input, orthogonal to the training samples tx1, x2, ...xsu. Then the output of the first layer, xp1q ort is a zero-mean normally distributed random variable, independent of tx1, x2, ...xsu and independent of txp1q 1 , xp1q 2 , ...xp1q s u (see proof in the appendix). Consider the orthogonal decomposition of xnew:

j 1 hjxj hortxort (2)

Based on A1-2 and lemma (1) we have

j 1 hjc1xp1q j z (3)

where z is a zero-mean normally distributed random variable independent of txp1q 1 , xp1q 2 , ...xp1q s u.

Clearly, if for some j we have xp1q new equal to xp1q j , or in

the δτ neighborhood of xp1q j , then ynew yj. For which j is this most likely? Proposition 4. The larger the magnitude of hj, the more likely it is that xp1q new is in the δτ neighborhood of xp1q j . Without loss of generality, assume that c1 1 (just to simplify the notations). Pick some index, k, and rewrite (3) in the following manner:

j 1 hjxp1q j z

hjxp1q j hkxp1q k

hjxp1q j phk 1qxp1q k

In order for xp1q new to be in the δτ neighborhood of xp1q k the terms inside the parentheses need to be canceled out by z. Since the Euclidean norm of the (vector) term in the parentheses is řs j 1 h2 j 1 2hk, while z is concentrated at the origin (i.e., a zero-mean unimodal spherically symmetric random variable) we conclude that the larger the magnitude of hk, the more likely it is that xp1q new is in the δτ neighborhood of xp1q k . Next, if xp1q new is not within the δτ-neighborhood of any xp1q k , we should examine the probabilities of xp2q new being within the δτ-neighborhood of certain xp2q k . Unfortunately, unlike the case with lemma (1), xp2q ort is not a normally distributed7. However, xp2q ort is nevertheless a zero-mean unimodal spherically symmetric random variable; thus, similarly to proposition (4), we conclude that the larger the magnitude of hk, the more likely it is that xp2q new is in the δτ neighborhood of xp2q k . Furthermore, based on A1-2, the same applies to layers 3, 4, ...lb, and thus as lb Ñ 8 we conclude that with high probability: i) ynew P ty1, y2, ...ysu. ii) For yj P ty1, y2, ...ysu, Ppynew yjq is an increasing function of the cosine similarity between xnew and xj. Corollary 1. If a large ensemble of networks is trained under the conditions of theorem 3 then the ensemble-average of their output for ynew is a kernel-machine.

Discussion Although Boolean networks and the deep neural networks in Theorems 2-3 are perhaps the simplest overparameterized models, and are much simpler than other practical models,

7Interestingly, when the width of the layers is rapidly decreasing, similarity to autoencoders, subsequent outputs are again nearnormal (Li and Woodruff 2021).

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

we feel there are a few lessons to be learnt from Theorem 1 and Theorems 2-3. Shannon effect vs. the no Shannon effect . Unlike the intuitive Shannon effect (i.e., that most Boolean functions are complex (Riordan and Shannon 1942)), the so-called no Shannon effect in randomly constructed functions (Genitrini, Gittenberger, and Mailler 2014) is virtually unknown in the disciplines of machine learning and statistics. Nevertheless, we hypothesize it plays a key role not only in explaining the generalization of binary neural networks, but also more broadly for other overparameterized models (albeit through analogous phenomena). Indeed, in (Valle-Perez, Camargo, and Louis 2018) the authors rightly argued that general Kolmogorov complexity reasoning entails simplicity bias; however, their argument was non-constructive and abstract, whereas below we argue that simplicity bias is not much more than a nearly-inevitable outcome of the central limit theorem (for wide networks) or a dynamical system fixedpoint theorem (for deep networks). Wide vs. deep networks. There are two very different mechanisms that bias towards simplicity in neural networks. In a very wide network (see Fig. 1A) with random weights, the situation is governed by the central limit theorem, and thus the network is a Gaussian process (at initialization), which is a simple and well-behaved function of the input. The main driving force for the emergence of these simple functions in the setting above is the large width of the network, and not its depth (Lee et al. 2019). For a narrow but deep neural network we present an entirely different mechanism that produces simplicity bias nevertheless. In a multilayered network (see Fig. 1B) each layer can be viewed as an operator in a dynamical system, or Markov chain, that acts on the output of its preceding layer; this should lead to convergence to a fixed point regardless of the initial input - i.e. a simple constant function . The aforementioned viewpoint of a Markov-chain forgetting its initial condition (i.e., input) elucidate also why adding biases in Theorem 2 would not change the result substantially. Adding external biases, unrelated to the initial input, should not prevent forgetting the initial condition and converging to an output independent of the input (although now instead of having 0 as the output, the output is drawn from the stationary distribution). Similarly, the asymptotic phrasing of Theorem 2 does not imply it is irrelevant for finite-depth networks; on the contrary, like many other asymptotic converence results, here too convergence to the fixed-point up to an ϵ-tolerance distance from it, occurs within a finite depth (possibly exponentially fast). Note also that our new conjecture is different and complementary to the edge of chaos hypothesis (Langton 1990), which states that in order for a (post-training) dynamical system to carry out computations , it needs to be between ordered and chaotic, i.e. at the edge of chaos (e.g., with a Lyapunov exponent 0 in absolute value). Our new hypothesis states that a pre-training system needs to be in an ordered state (e.g., with a Lyapunov exponent ă 0 in absolute value) for it to generalize well after seeing the data (see also (Xiao, Pennington, and Schoenholz 2020; Schoenholz et al. 2016) for additional discussion in a specific setting).

Regardless to the universality and veracity of our conjecture above, the following (known) cautionary tale regarding research on generalization in wide networks should be reiterated: generalization in deep networks is likely to be driven by a different mechanism, and thus insights from shallowand-wide networks might not be relevant. Training and optimizers. In addition to providing a concrete and rigorous example of simplicity bias and its contribution to learning, Theorem 1 also suggests a lack of optimizer-dependence. The continuous neural network analogous to the (computationally inefficient) naive algorithm would be many initializations plus early stopping8 , suggesting that the role of the specific optimizer is not crucial (and Theorem 3 demonstrates it for regression for an almost-perfect fit). As proposition 2 also implies, the optimizer clearly affects which representation will be sampled (i.e., which function will be obtained after training), but it is not the driving force behind simplicity bias. Our results complement the existing literature that examines the effects of training with SGD, as demonstrated in studies such as (Rahaman et al. 2019). Using Fourier analysis terminology, these works have shown that SGD has a stronger impact on lower frequencies. If we adopt the definition of simple as having a fast-decreasing Fourier transform, our work introduces a crucial additional factor: the initial conditions, following random weight initialization, are simple, and training with SGD maintains this simplicity throughout the process, resulting in a simple outcome. It is important to note that prior findings related to training dynamics alone are insufficient; we must also establish the simplicity of the initial conditions. Nevertheless, there is certainly merit in research studying specific optimizers. Here we did not address training or the standard optimizers; in Theorem 1 we only addressed learning via toy training mostly as a proof of concept - that good learning can be performed in overparameterized models merely via the build-in simplicity bias (and without the use of an optimizer). Additionally, Theorem 2 does not address learning at all, it only serve to show that (randomly initialized) deep networks are biased towards simple functions; nevertheless, there is reason to think that this prior bias will affect the posterior obtained after training. This seem even more likely when considering training which start at a random initialization and is updated via iterations of a local search (as most common optimizers do). Indeed, ref. (Xiao, Pennington, and Schoenholz 2020) demonstrated similar results for infinitely-wide deep networks (while here we also address finite-width, or narrow , networks). Theorem 3 show that random constructions that fit the sample act as simple nearest-neighbors classification or kernel-machines, suggesting that perhaps the vast majority of interpolating solutions behave so as well. After a random initialisation, SGD merely changes the weights greedily

8Note that for classification, unlike regression, even if the weights are continuous random variables, there is a non-zero probability of random weights yielding a network that fits the training data perfectly (albeit possibly a very small probability, and subject to the existence of such a fit).

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

to find the closest solution, without any ingenious hidden addition; thus, we conjecture that the solutions SGD finds are of similar nature (see (Domingos 2020)), while sophisticated solutions are very rare.

Appendix. Here we provide the proof of lemma 1.

Recall the following affine transformation theorem:

Theorem 4 (Affine Transformation Theorem). Let X be a p-dimensional multivariate normal random variable, i.e., X Npµ, Σq, where µ is the mean vector and Σ is the covariance matrix of X. Let A be a constant q ˆ p matrix. Then, the random variable Y AX is also multivariate normally distributed, and its distribution is given by: Y Np Aµ, AΣAT q

Rearrange the (transpose) of the inputs, x T j , and x T ort in the following block matrix:

x T 1 0 0 0 x T 1 0 ... ... ... ... 0 0 x T 1 x T 2 0 0 0 x T 2 0 ... ... ... ... 0 0 x T 2 ... ... ... ... x T s 0 0 0 x T s 0 ... ... ... ... 0 0 x T s x T ort 0 0 0 x T ort 0 ... ... ... ... 0 0 x T ort

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffifl

where each 0 is a vector of an appropriate dimension.

Recall that for the first layers, under A2 and our previous notation, the output of xj,l, the jth unit in the lth layer, is řw k 1 aplq j,kxk,l 1. Focusing on the first layer, we drop the superscript notation and write aj paj,1, aj,2, aj,3, ...aj,wq T .

Concatenate the weights in the following column vector:

a1 a2 a3 ... aw

ffiffiffiffifl

Hidden layers Input layer

Input layer

Output layer

Output layer

Hidden layer

Figure 1: Limiting simplicity for large neural networks. (A) A wide (and shallow) neural network. When the weights are random, summing over the output of the large hidden layer will entail a Gaussian process for the output and thus simplicity. (B) A deep (and narrow) neural network. Here the hidden layers might act as an operator in a dynamical system, driving the initial input towards a fixed point, given as the output of the penultimate layer. Thus, in a deep network the driving force for the emergence of a simple function (the constant fixed point) and generalization is different from the one in a wide network.

x T 1 0 0 0 x T 1 0 ... ... ... ... 0 0 x T 1 x T 2 0 0 0 x T 2 0 ... ... ... ... 0 0 x T 2 ... ... ... ... x T s 0 0 0 x T s 0 ... ... ... ... 0 0 x T s x T ort 0 0 0 x T ort 0 ... ... ... ... 0 0 x T ort

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffifl

a1 a2 a3 ... aw

ffiffiffiffifl

xp1q 1 xp1q 2 xp1q 3... xp1q ort

ffiffiffiffiffiffifl

Recall that pa1, a2, a3, ...awq T is normally distributed with a diagonal covariance matrix, and that it is here multiplied by a matrix with orthogonal rows; thus, by Theorem (4) their product is normally distributed with a diagonal covariance matrix. In particular, xp1q ort is a zero-mean normally distributed random variable, and conditioning on txp1q 1 , xp1q 2 , ...xp1q s u does not change its distribution.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

References Arpit, D.; Jastrzebski, S.; Ballas, N.; Krueger, D.; Bengio, E.; Kanwal, M. S.; Maharaj, T.; Fischer, A.; Courville, A.; Bengio, Y.; et al. 2017. A Closer Look at Memorization in Deep Networks. In International Conference on Machine Learning, 233 242. Bhattacharya, R.; and Majumdar, M. 2007. Random dynamical systems: theory and applications. Cambridge University Press. Boxler, P. 1995. Lyapunov exponents indicate stability and detect stochastic bifurcations. In Probabilistic Methods in Applied Physics, 97 119. Springer. Chauvin, B.; Flajolet, P.; Gardy, D.; and Gittenberger, B. 2004. And/Or trees revisited. Combinatorics Probability and Computing, 13(4-5): 475 497. Cohen, J. E.; and Newman, C. M. 1984. The stability of large random matrices and their products. The Annals of Probability, 283 310. Crisanti, A.; Paladin, G.; and Vulpiani, A. 1993. Products of Random Matrices: In Statistical Physics. NATO Asi Series. Series F, Computer and System Sciences. Springer Berlin Heidelberg. ISBN 9783540565758. De Palma, G.; Kiani, B.; and Lloyd, S. 2019. Random deep neural networks are biased towards simple functions. In Advances in Neural Information Processing Systems, 1964 1976. Domingos, P. 2020. Every model learned by gradient descent is approximately a kernel machine. ar Xiv preprint ar Xiv:2012.00152. Genitrini, A.; Gittenberger, B.; and Mailler, C. 2014. No Shannon effect induced by And/Or trees. In 25th International Meeting on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, 109 120. Hochreiter, S.; and Schmidhuber, J. 1997. Flat minima. Neural Computation, 9(1): 1 42. Huang, W. R.; Emam, Z.; Goldblum, M.; Fowl, L.; Terry, J. K.; Huang, F.; and Goldstein, T. 2019. Understanding generalization through visualizations. ar Xiv preprint ar Xiv:1906.03291, . Jacot, A.; Gabriel, F.; and Hongler, C. 2018. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, 8571 8580. Kawaguchi, K.; Kaelbling, L. P.; and Bengio, Y. 2017. Generalization in deep learning. ar Xiv preprint ar Xiv:1710.05468, . Keskar, N. S.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; and Tang, P. T. P. 2016. On large-batch training for deep learning: Generalization gap and sharp minima. ar Xiv preprint ar Xiv:1609.04836, . Langton, C. G. 1990. Computation at the edge of chaos: Phase transitions and emergent computation. Physica D: Nonlinear Phenomena, 42(1-3): 12 37.

Lee, J.; Xiao, L.; Schoenholz, S.; Bahri, Y.; Novak, R.; Sohl Dickstein, J.; and Pennington, J. 2019. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, 8572 8583. Lefmann, H.; and Savick y, P. 1997. Some typical properties of large And/Or Boolean formulas. Random Structures & Algorithms, 10(3): 337 351. Li, Y.; and Woodruff, D. P. 2021. The Product of Gaussian Matrices Is Close to Gaussian. ar Xiv preprint ar Xiv:2108.09887. Loog, M.; Viering, T.; Mey, A.; Krijthe, J. H.; and Tax, D. M. 2020. A brief prehistory of double descent. Proceedings of the National Academy of Sciences, 117(20): 10625 10626. Mingard, C.; Skalse, J.; Valle-P erez, G.; Mart ınez-Rubio, D.; Mikulik, V.; and Louis, A. A. 2019. Neural networks are a priori biased towards Boolean functions with low entropy. ar Xiv, .: ar Xiv 1909. Neyshabur, B.; Li, Z.; Bhojanapalli, S.; Le Cun, Y.; and Srebro, N. 2018. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations. Neyshabur, B.; Tomioka, R.; and Srebro, N. 2014. In search of the real inductive bias: On the role of implicit regularization in deep learning. ar Xiv preprint ar Xiv:1412.6614, . Rahaman, N.; Baratin, A.; Arpit, D.; Draxler, F.; Lin, M.; Hamprecht, F.; Bengio, Y.; and Courville, A. 2019. On the spectral bias of neural networks. In International Conference on Machine Learning, 5301 5310. PMLR. Razin, N.; and Cohen, N. 2020. Implicit Regularization in Deep Learning May Not Be Explainable by Norms. ar Xiv preprint ar Xiv:2005.06398, . Riordan, J.; and Shannon, C. E. 1942. The number of twoterminal series-parallel networks. Journal of Mathematics and Physics, 21(1-4): 83 93. Schoenholz, S. S.; Gilmer, J.; Ganguli, S.; and Sohl Dickstein, J. 2016. DEEP INFORMATION PROPAGATION. stat, 1050: 4. Tao, T.; Vu, V.; and Krishnapur, M. 2010. Random matrices: Universality of ESDs and the circular law. The Annals of Probability, 38(5): 2023 2065. Valle-Perez, G.; Camargo, C. Q.; and Louis, A. A. 2018. Deep learning generalizes because the parameter-function map is biased towards simple functions. In International Conference on Learning Representations. Williams, C. K.; and Rasmussen, C. E. 2006. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA. Wu, L.; Zhu, Z.; et al. 2017. Towards understanding generalization of deep learning: Perspective of loss landscapes. ar Xiv preprint ar Xiv:1706.10239, . Wyner, A. J.; Olson, M.; Bleich, J.; and Mease, D. 2017. Explaining the success of adaboost and random forests as interpolating classifiers. Journal of Machine Learning Research, 18: 1 33.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Xiao, L.; Pennington, J.; and Schoenholz, S. 2020. Disentangling trainability and generalization in deep neural networks. In International Conference on Machine Learning, 10462 10472. PMLR.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)