# adaptive_optimization_in_the_inftywidth_limit__8dd50b1e.pdf

Published as a conference paper at ICLR 2023

ADAPTIVE OPTIMIZATION IN THE -WIDTH LIMIT

Etai Littwin Apple elittwin@apple.com

Greg Yang Microsoft Research gregyang@microsoft.com

Recent works have developed detailed understanding of large neural networks behaviors via their inﬁnite-width limits, e.g., the neural tangent kernel (NTK) and the feature learning (µ) limits. These theories were developed for stochastic gradient descent. Yet, in practice, all large NN are trained using Adam or other adaptive gradient optimizers (AGO), which are not covered by such previous works. Here, we close this gap via the Tensor Programs framework. Speciﬁcally, for deep MLPs, we derive the NTK and µ parametrizations as well as their inﬁnite-width limits. We ﬁnd 1) The NTK limit of AGO, in contrast to that of SGD, now depends nonlinearly on the loss derivative but nevertheless still fails to learn features; 2) this is ﬁxed by the µ limit of AGO (as in the case of SGD). To obtain these results, we extend the Tensor Programs language with a new instruction that allows one to express the gradient processing done by AGOs.

1 INTRODUCTION

Inﬁnite width limits of neural networks have been a major focus of study in the last several years, underlying some of the most profound recent breakthroughs in our theoretical understanding of deep learning. Speciﬁcally, two types of limits have garnered the lions share of attention from the research community. The kernel limit, popularized by the seminal work of Jacot et al. (2018) refers to a regime of training where weights remain roughly in their initialized values, and training may be entirely characterized in function space by a constant kernel of a particular form, which depends on the network architecture. While easier to analyze, this limit does not permit updates to the internal representation of the network, hence it cannot account for data dependent feature learning, a staple of deep learning in practice. In contrast, the µ limit (of which the well-known mean ﬁeld limit is a speciﬁc case in 1-hidden-layer perceptrons) refers to a regime of training where the weights adapt to the data during training in a nonlinear fashion, facilitating representation learning. It was recently shown in Yang & Hu (2020) that, under vanilla gradient based training, the precise setting of various hyperparameters relating to initialization scale and learning rate determine the type of inﬁnite-width limit one can associate with a trained neural network. Notably, the µ parameterization was identiﬁed as the unique parameterization which gives rise to "maximal" feature learning dynamics in the inﬁnite-width limit, where maximal refers to the fact that every layer learns features. However, quite remarkably, no such limits have yet been formally established for adaptive gradient based optimization of neural networks, which we make the focus of the present paper. Our main results in the paper are the identiﬁcation and prescription of two types of inﬁnite-width limits relating to popular AGO, the counterparts of the kernel and feature learning limits for vanilla GD. For the kernel limit counterpart, we uncover a fundamentally different dynamics for adaptive optimization, referred to as the adaptive neural tangent kernel (ANTK) regime. In this limit, the training dynamics can no longer be described by kernel gradient descent, since the kernel function itself depends non-linearly on the loss derivative. Our results lay a clear path to theoretically analyze the implicit biases of AGO in the inﬁnite-width limit.

Key Technical Contribution: Analyzing the dynamics of adaptive optimization of arbitrary neural network architectures in the inﬁnite-width limit presents a major technical challenge. As a main technical tool, we build upon the TP framework introduced and developed in a series of recent papers Yang (2019; 2020a;b). At a high level, the mechanics of the TP technique involves 1) write

Please see arxiv.org for the full, updated version of this paper

Published as a conference paper at ICLR 2023

down the relevant neural network computation (e.g. the ﬁrst forward pass in the NNGP case) as a principled composition of matrix multiplication and coordinatewise nonlinearities, called a Tensor Program, and 2) recursively calculate the distribution of coordinates of each vector via what s called the Master Theorem. However ﬂexible, the "language" of TP is not expressive enough to represent the necessary computations involving adaptive optimization since it does not support the application of nonlinear functions to high order tensors. In the present paper, we solve this issue by expanding the TP framework with additional functionities, and proving a new master theorem which enables our analysis. While we present a simple application of our new framework on MLPs in Theorem 4.1 and Theorem 4.2, it is applicaple in a much wider setting, including most practical architectures and algorithms. As an additional technical contribution, we prove a O(n 1/2) (where n represents the width) convergence rate guarantee for all variables produced by the program, which might be of independent interest.

Our Contributions: This paper presents the following major contributions:

1. We present the ﬁrst rigorous inﬁnite-width analysis of adaptive optimization of MLPs parameterized using the ANTK and µ parameterizations. Our results rigorously equate training of such networks to discrete time dynamical equations.

2. We develop a new tensor program framework along convergence rate guarantees, unlocking the inﬁnite-width analysis of adaptive optimization in an architecturally universal sense.

Paper Organization: This paper is organized as follows: We survey related work in Section 2. In Section 3 we set up preliminaries and notations used extensively in Section 4. In Section 4 we illustrate ANTK and µ limits for MLPs. Section 5 is dedicated to a formal introduction to the new TP framework. Although it is used as a main tool to prove our results in Section 4, Section 5 is more general and can be read as a standalone.

2 RELATED WORK

A large body of literature exists on both the kernel (NTK) limit Arora et al. (2019); Jacot et al. (2018); Lee et al. (2019); Yang (2020c); Yang & Littwin (2021) and the mean ﬁeld limit for 2 layer neural network Chizat & Bach (2018); Mei et al. (2018b); Nguyen & Pham (2020); Rotskoff & Vanden-Eijnden (2018); Sirignano & Spiliopoulos (2020). Various papers describe the kernel and feature learning regimes more generally without taking an inﬁnite-width limit. Chizat et al. (2019) describes the "lazy training" regime in arbitrary differentiable programs, and is controlled by a single parameter α which scales the output. It is shown that when α is large, the weight need only move slightly to ﬁt the training data, and network essentially performs kernel learning. Many papers Allen-Zhu et al. (2019); Huang & Yau (2020); Mei et al. (2018a) view the kernel and feature learning regimes as learning in different timescales, explicitly incorporating the time dependence in the inﬁnite-width limit, and others derive ﬁnite width corrections to the NTK for ﬁnite width networks Hanin & Nica (2020); Littwin et al. (2020a). In this paper, we consider training time to be constant, and take only the width to inﬁnity. This way, kernel and feature learning behaviour are separated by the parameterization employed at initialization, and not width or training time. TPs, ﬁrst introduced in Yang (2019) and expanded upon in Yang (2020a;b), were developed as a theoretical framework to analyze the inﬁnite-width limits of any architecture expressible in the TP language, in an attempt to rid the per architecture analysis prevalent in the literature Alemohammad et al. (2021); Du et al. (2019); Hron et al. (2020); Littwin et al. (2020b). Yang & Hu (2020) deﬁned a natural space of neural network parametrizations (abc-parametrizations), and classiﬁed all resulting inﬁnite-width limits into two possible catagories: 1) the kernel limit, in which weights and activations remain roughly in their initialization state, and 2) feature learning limit, in which weights move substantially and adapt to data. The µ parameterization was then identiﬁed as the "optimal" parameterization for arbitrary architectures in which all layers learn features, and was later heuristically extended to AGOs Yang et al. (2021). Unrelated, AGOs Duchi et al. (2010); Kingma & Ba (2015); Zhou et al. (2018) and their variants were developed to accelerate learning by adapting the learning rate on a per parameter basis, and currently serve as a prerequisite for training large scale transformer models Huang et al. (2020); Liu et al. (2020); Zhang et al. (2019). Crucially, no previous work has yet developed a theory for inﬁnite-width neural network trained with AGOs.

Published as a conference paper at ICLR 2023

3 PRELIMINARIES

Adaptive Optimizers: Generically, if g0, g1, ..., gt R denote the gradients of some scalar parameter w R at steps 0, 1, ..., t, an adaptive update wt = wt+1 wt at step t takes the form wt = η m v+ϵ where η is the learning rate and m and v are both functions of the past gradients g0, . . . , gt. For example, in Adam, m and v are the exponential moving averages of g(i) and g2 (i). Here, we consider an even more general notion of adaptive updates, encompassing all modern AGOs.

Deﬁnition 3.1. We say an update wt Qt(g0, g1, ..., gt; ϵ) to a weight w at time t is adaptive if it is proportional (up to a constant factor) to a function Qt : Rt+1 R such that c =0, Qt(cg0, cg1, ..., cgt; cϵ) = Qt(g0, g1, ..., gt; ϵ). Moreover, if Qt(g0, g2, ..., gt; ϵ) = Q(gt; ϵ) (only depends on gt), then we furthermore say wt is memoryless.

To maximize clarity, we focus on the simpler case of memoryless adaptive updates in the main text. For example, in Adam this implies setting β1, β2 = 0. This simpliﬁcation will already highlight the key differences between the adaptive and non-adaptive case. We provide an extension of these results to the case of AGOs with memory in Appendix C, and provide numerical veriﬁcation of our results in Appendix D.

MLPs and ABC(D) Parameterization: We use a standard scalar output MLP f with L hidden layers as a working example to illustrate the adaptive kernel and feature learning limits. Given an input sample ξ Rdin, weight matrices W L+1 Rn, {W l}L l=2 Rn n, W 1 Rn din and an activation function φ which we assume has a pseudo lipschitz ﬁrst derivative, the output f(ξ) R is given by:

f(ξ) = W L+1 x L(ξ), xl(ξ) = φ(hl(ξ)) for 1 l L x0(ξ) = ξ , 1 l L hl(ξ) = W lxl 1(ξ) (1)

We adopt the ABC parameterization convention from Yang & Hu (2020). Namely, for any layer l, each weight matrix is parameterized using W = n alwl where wl are the learnable weights, which are initially sampled iid from a normal distribution N(0, n 2bl). Finally, the learning rate is parameterized using ηn cl where we plug η = 1 for simplicity. In this paper, we assign speciﬁc values to {al}l, {bl}l, {cl}l for the ANTK and µ parameterizations. Additionally, we will parameterize the ϵ parameter in the AGO with ϵl = n dlϵ, where ϵ > 0. The per-layer scaling for ϵl will turn out to be crucial to prevent the adaptive gradients from collapsing to either 0 or a step function as n . We summarize the two parameterizations in the following table:

Parameterization al bl cl dl

2 l > 1 0 else 0 1 L + 1 > l > 1 1 2 else

1 L + 1 > l > 1 1 2 else

2 l = 1 1 2 l = L + 1 0 else

1 L + 1 > l > 1 1 2 else

1 L + 1 > l > 1 1 2 else

Table 1: ANTK and µ parameterizations.

Representing (pre)Activation Vectors via Random Variables: As we will see, as width becomes large, the entries of the activation and preactivation vectors will become roughly iid (just like in the SGD case), both at initialization (which is easy to see) and training (which is harder to see). Hence a vector s behavior can be tracked via a random variable that reﬂects the distribution of its entries. Concretely, if x Rn is one such vector, then we write Zx for such a random variable, such that x s entries look like iid samples from Zx. When x is scaled to have typical entry size independent of n,1 then Zx can be taken to be a random variable independent of n as well. In general, given two such vectors x, y Rn, their random variables Zx and Zy will be correlated, in such a

1i.e., x 2/n = Θ(1) as n

Published as a conference paper at ICLR 2023

way that limn x y

n = EZx Zy. Generally, inferring with initialized networks entail computing expectations with gaussian Z variables, which take a relatively simple form. However, a fundamental question is how the Z variables evolve during training, which we address next.

4 ADAPTIVE OPTIMIZATION OF AN MLP

In the following section we illustrate the inﬁnite-width limits of adaptive optimization for simple MLPs. For each parameterization, we begin by laying the basic intuition, culminating in Theorem 4.1 and Theorem 4.2. For a cleaner presentation, we assume the ﬁrst and last layers are ﬁxed, however our results are easily extended to the general case. In our setup we assume the network is trained using an AGO according to Deﬁnition 3.1,and a batchsize of 1.

Notations: Slightly abusing notation, we use subscripts to denote both the step index t, and coordinates of vectors with α, β. We assume ξt is a training sample fed to the neural network at step t (starting from ξ0). and we use yt(ξ) for any input dependent vector/scalar y to denote its evaluation given ξ at step t. To reduce clutter we remove explicit dependency on input if it is implied by the step index (i.e yt = yt(ξt) and y = y0(ξ0)). We use yt = yt( ξ) to express the dependency of y on an arbitrary input ξ at step t. We will also denote yt(ξ) = yt+1(ξ) yt(ξ) and δyt(ξ) = n yt(ξ). We assume the network is trained using a generic loss function L, with a loss derivative L t = ft Lt. We use the notation dhl based on context: for ANTK parameterization,

we set dhl(ξ) def = n f

hl (ξ) Rn, whereas for µ parameterization, we set dhl(ξ) def = n f

hl (ξ) Rn. This context dependent notation is convenient since it insures that the components of dhl(ξ) are roughly in the order of Θ(1) for both parameterizations. Finally, we use to denote the inﬁnite-width

limit of a (possibly random) scalar (i.e limn ft = ft). Using the above notation, we can express the gradient of any intermediate layer wl at step t for both parameterizations by 1

ndhl txl 1 t Lt. Using Deﬁnition 3.1 the adaptive weight update wl for both parameterizations is given by:

1<l L, wl t = 1

ndhl txl 1 t L t; ϵ

n Q(dhl txl 1 t L t; ϵ) (2)

where the function Q is applied element-wise on the matrix dhl txl 1 t L t Rn n. For the remainder of the paper we will suppress the explicit the dependency on ϵ and simply absorb it into Q.

4.1 THE ANTK LIMIT

In the NTK limit, intuitively, the weights of the neural network move by a negligible amount during training, such that the network function may be approximated by its ﬁrst order Taylor expansion around its initialization. This intuition carries over to the ANTK limit as well. At a high level, the following hold at any step t: hl t for any layer l will be of order Θ(n 1

2 ). By deﬁnition hl t+1 = hl t + hl t, hence the coordinates of hl t for any layer l do not change in the limit, and, for any input, the coordinate distributions remain constant l [1,l]Z hl t = Z hl, Zd hl t = Zd hl. Instead of training f, we consider training the ﬁrst order linearization of f, denoted by f lin, around its initial parameters. The function updates f lin t are given by

f lin t = 1

l=2 d hl Q dhl txl 1 t L t xl 1 (3)

Under the ANTK parameterization, as with SGD, training f lin and f using AGO is equivalent. The

following theorem describes the evolution of ft exactly:

Theorem 4.1. Let f(ξ) R denote an MLP as in Eq. (1) parameterized using the ANTK parameterization described in Section 4.1, where φ is pseudo-Lipschitz. Assume layers {wl}L l=2 are trained using a memoryless AGO with a pseudo-Lipschitz function Q according to Deﬁnition 3.1 and a batchsize of 1, using a loss function L with a pseudo-Lipschitz ﬁrst derivative. Then, at any step t

Published as a conference paper at ICLR 2023

and for any sample ξ, it holds that ft a.s ft where ft = KAdp(ξt, ξ| L t), where:

KAdp(ξt, ξ| L t) =

l=2 E Zd hl Q Zdhl(ξt)Zxl 1(ξt) L t Z xl 1 (4)

L t = L t( ft(ξt)) (5)

where the expectation is taken over all Z variables at initialization.

Let us discuss Theorem 4.1 in a bit more detail. First, note that after the output values ft, and by extension the loss derivatives L t are deterministic after conditioning on the outputs f at initialization, hence the only source of randomness in Eq. (162) is from the Z variables at initialization 2. Second, it is straightforward to show that by setting Q(x) = x we would get the SGD equivalent (when setting

the learning rate to be 1) of Eq. (162), which takes the form f sgd t PL l=2 dhl t d hl t n xl 1 t x1 1 t n L t. For SGD, one may naively apply the the law of large numbers argument (LLN) and derive the

inﬁnite-width limit under plain SGD: f sgd t a.s K(ξt, ξ) L t where K is the NTK function deﬁned as:

l=2 E[Zdhl(ξt)Zd hl]E[Zxl 1(ξt)Z xl 1] (6)

Hence Theorem 4.1 is a generalization of the well known NTK limit. At a glance, the transition from Eq. (3) to its inﬁnite-width counterpart seems like a straightforward application of LLN. However, the validity of Theorem 4.1 is not at all straightforward, and cannot be obtained by applying gaussian conditioning based arguments as in Yang & Littwin (2021), even for the ﬁrst weight update. Technically, the complication arises from nonlinearity of the Q function: unlike SGD where nonlinear functions are only applied to vectors, in Eq. (3) we construct a matrix (more generally a tensor) using a nonlinear function Q. Operations of this type make even the simplest case, where all inputs are iid gaussian, tricky to analyze. Developing a general framework to handle such operations will be key to developing a general framework to prove our main results later on. We discuss this technicality in more detail in Section 6.

For general adaptive updates, Theorem 4.1 implies that KAdp is nonlinear in the loss derivative L t, inducing a fundamentally different dynamics then the kernel gradient descent featured with SGD, and we leave the more in depth analysis of its nature for future work. Similar to NTK however, the ANTK limit does not allow data dependent feature learning, since the weights and activations remain roughly at their initialized values. This allows us to adopt a function space view of the training dynamics, where the output updates depend solely on the values of the outputs in the previous iteration, and without any dependence on the state of the internal representation computed by the network, which remain unchanged. In contrast, the µ parameterization allows data dependent feature learning in the inﬁnite-width limit, which we analyze next.

4.2 FEATURE LEARNING WITH µ PARAMETERIZATION

We now turn to analyzing the inﬁnite-width training dynamics under µ parameterization. Fundamentally, each weight update wl t will cause each preactivation vector hl t to change entrywise by something of order Θ(1), and the coordinate distributions at the limit will evolve non-trivially. Generally, the dynamical equations equivalent of an inﬁnite-width neural network in the feature learning regime (using µ or otherwise) is much more complex to derive for deep networks. Although our new TP formulation discussed in Section 5 provides a complete set of tools to tackle most architectures, we will be content with illustrating the main points using a 2 hidden layer MLP where only the middle layer weights w2 are trained. Using Eq. (2), we can express h2 t+1, x2 t+1, ft+1 using:

h2 t+1 = h2 t 1

n Q(dh2 tx1 t L t) x1, x2 t+1 = φ( h2 t+1), ft+1 =

nw3 x2 (t+1) n (7)

Note that under µ the coordinates of nw3 are randomly distributed according as N(0, 1), hence we expect ft to be Θ(1). Due to the Q function applied to the gradient, the components of the

2The Z variables are in fact independent from the outputs f(ξ). This is made rigorous in the proof.

Published as a conference paper at ICLR 2023

Rn n matrix Q(dh2 txt L t) are not vanishing as n , and the update 1

n Q(dh2 tx1 t L t) x1 is by consequence generally not vanishing as well. Since the updates h2 t are non vanishing, the feature vector x2 t evolves substantially (for non degenerate φ), which enables feature learning. Taking the limit of Eq. (7) to derive dynamical equations is again not an easy task. Consider the case where Q = Identity, which results in the update equation h2 t+1 = h2 t x1 x1(ξt)

n dh2 t L t, and can be expressed purely using operations between vectors. For general nonlinear Q functions, we must deal with a matrix - vector multiplication as in Eq. (7). This implies that unlike with SGD where we must reason about how a ﬁnite collection of Rn vectors behave in the limit, we must now reason about the behaviour of Rn n matrices (see Section 6 for further discussion on this). The following theorem

describes the evolution of ft under µ exactly: Theorem 4.2. Let f(ξ) R denote an MLP as in Eq. (1) with L = 2 parameterized using the µ parameterization described in Section 4.1, where φ is pseudo-Lipschitz. Assume layers w2 is trained using an AGO with a pseudo-Lipschitz function Q function according to Deﬁnition 3.1 and a batchsize of 1, using a loss function L with a pseudo-Lipschitz ﬁrst derivative. Then at any step t and

for any sample ξ, it holds that ft a.s ft where ft can be computed as follows:

Z h2 t+1 = Z h2 t EZx1(ξt),Z x1 Q(ζφ (Zh2 t )Zx1(ξt) Lt)Z x1 (8)

Z x2 t = φ(Z h2 t ), f0 = 0, ft = E[ζZ x2 t ], L t = L t( ft(ξt)) (9)

where the expectations are taken over all Z variables (including ζ d= N(0, 1)).3

From Theorem 4.2 it is clear in what sense µ parameterization enables feature learning in the limit: from Eq. (8) the random variable encoding the updates to the hidden representation Z h2 t is of order Θ(1) for non degenerate Q and φ functions, and is generally no longer gaussian at steps t > 0, allowing the neural network to learn data dependent features. Once again, substituting Q(x) = x would default the equations in Eq. (8) back to those of SGD with an appropriate step size, hence our Theorem 4.2 generalizes feature learning with SGD.

5 A TENSOR PROGRAM FOR ADAPTIVE OPTIMIZERS

In Section 4 we have derived two types of limits in a relatively restricted setting of training an MLP. In the following section, we go into more detail into the TP framework that allows such principled derivations in a much broader setting. While doing so, we will highlight the additional functionities introduced in the present paper that are key to unlocking the analysis of adaptive optimization, and removing certain assumptions preventing previous iterations from achieving full architectural generality. In the previous section we have provided intuitive calculations for computing the inﬁnitewidth limits of trained neural networks by explicitly expressing the updates to the network internal representation and output at step t, and then naively converting coordinates of vectors to iid samples from a known distribution (the Z variables) . However, these computations become exceedingly complex with an arbitrary number of updates and complex architectures, and it is not clear whether these arguments can be applied in a general setting. Tensor programs is a framework designed to automate these computations, while providing theoretical guarantees. Generally, computations involving adaptive optimization of most architectures contain a few repeating operations (i,e matrix multiplications, point wise non linearities...), applied to variables of different types (i.e matrices, vectors and scalars). This brings forth the notion of a program: A directed computational graph where nodes represent variables (typically Rn vectors or scalars), and edges represent operations performed on the variables. As a concrete example, the forward pass of an MLP given some input can be expressed as a tensor program where the input represents the root node, and the afﬁne transformation in each layer represent an edge between nodes. We give a more formal description of our framework in the following.

5.1 NE ORT PROGRAMS

Deﬁnition 5.1. A NE ORT program is a sequence of Rn-vectors and R-scalars inductively generated via one of the following ways from an initial set C of random scalars, V of random Rn vectors, and

3Once again, the loss derivatives L t are deterministic in Eq. (8)

Published as a conference paper at ICLR 2023

a set W of random Rn n matrices (which will be sampled with iid Gaussian entries in Setup 5.2). Concretely, using weights W and some pseudo-lipschitz function ψ : Rk(r+1)+l R for k, l, r N, the program generates new vectors and scalars from previously generated vectors x = {x1, ..., xk} Rn and scalars Θ = {θ1, θ2, ..., θl} R by the following instructions (using the notation xi = {x1 i , ..., xk i }):

TENSOR Generates a vector x Rn by xα = 1 nr Pn β1,...,βr=1 ψ(xα, xβ1, ..., xβr; Θ).

TENSORMOMENT Generates a scalar θ R by θ = 1 nr+1 Pn α,β1,...,βr=1 ψ(xα, xβ1, ..., xβr; Θ).

MATMUL Generates a vector x Rn by x = W x or x = W x where x x, W W.

Let us unpack Deﬁnition 5.1. We can think of the TENSOR operation as a generalized version of the standard pointwise nonlinearity which acts on vectors (or tensors where only one dimension increases to inﬁnity, akin to the NONLIN instruction in Yang (2020b). Instead, the TENSOR instruction applies a pointwise nonlinearity ψ to a tensor of rank r + 1 where all dimensions are of size n, and then contracts r dimensions to produce a vector. We note that the instruction subsumes the standard applications of (non)linear functions applied to vectors by setting r = 0. The TENSORMOMENT operation allows us to fully contract a tensor of rank r + 1 to a scalar. Finally, the MATMUL operation is copied over from Yang (2020b), and implements a standard linear layer.

The initial sets of vectors and scalars V, C, and weights W are randomly sampled according to Setup 5.2:

Setup 5.2. 1) For each initial W W, we sample iid Wαβ N(0, σ2 W /n) for some variance σ2 W associated to W, independent of other W W; 2) for some multivariate Gaussian ZV = {Zx : x V} RV, we sample the initial set of vectors V like {xα : x V} ZV iid for each α [n]. 3) For each initial scalar θ C, we require p>0, n1 p(θ θ)2 a.s. 0 for some deterministic θ R.

Note that the initial set of vectors V are assumed to be gaussian in Rn. In a typical neural network training scenario, the initial vectors correspond to input/output layer weights and biases at initialization, and the initial matrices correspond to hidden layer weights at initialization, so their Gaussian sampling reﬂects their Gaussian initialization.

Example: A program encoding the ﬁrst forward,backward and adaptive update (Eq. (7)) using µ parameterization is provided in Table 2

Expression Op type Implementation h2 = W 2x1 MATMUL x2 = φ(h2) TENSOR ψ(h2) for ψ(a) = φ(a) f = nw3 x2

n TENSORMOMENT 1 n Pn α=1 ψ( nw3 α, x2 α) for ψ(a, b) = ab L (f) TENSORMOMENT 1 n Pn α=1 ψ(; f) for ψ(; θ) = L (θ) dh2 TENSOR ψ( nw3, h2) for ψ(a, b) = aφ (b) h2 TENSOR 1 n Pn β=1 ψ(dh2, x1 β, x1 β; L ) for ψ(a, b, c; θ) = Q(abθ)c

Table 2: A NE ORT Program encoding the forward/backward and adaptive update of an MLP. In the above, a, b, c, θ R represent inputs to some function ψ implementing a TENSOR or a TENSORMOMENT instruction.

5.2 THE MASTER THEOREM

We can guarantee certain properties hold for vectors and scalars generated by a tensor program in the inﬁnite-width limit. In short, any generated scalar θ will almost surely converge to a deterministic limit θ as n , at a rate of O( 1 n). For any generated vector x Rn, the coordinates of x will approach iid samples from some distribution. Adopting the notation from Section 3, we denote by

Published as a conference paper at ICLR 2023

Zx a random variable distributed like the coordinates of x as n . The following constructs the random variable Zx for every vector x and a deterministic scalar θ for every scalar θ in the program, where we assume x = {x1, ..., xk}, Θ = {θ1, ..., θl} are previously generated vectors and scalars, and we use the abbreviated Zx to denote the set of k random variables {Zxi}k i=1 for all xi x.

Deﬁnition 5.3 (Zx and θ). We recursively deﬁne Zx def = ˆZx + Zx for each vector x and θ for each scalar θ as follows:

ZINIT If x V, then Zx is deﬁned as in Setup 5.2. We also set ˆZx def = Zx and Zx def = 0.

ZTENSOR If x is generated by TENSOR (see Deﬁnition 5.1), then Zx def = f(Zx) where f(ζ) def = EZx 1 ,...,Zx r [ψ(ζ, Zx 1 , Zx 2 , Zx r ; Θ)] with Zx i being iid copies of Zx.

ZTENSORMOMENT If θ is generated by TENSORMOMENT (see Deﬁnition 5.1), then θ def = EZx,Zx 1 ,...,Zx r [ψ(Zx, Zx 1 , ..., Zx r ; Θ)]. Here Θ = θ1, . . . , θl are deterministic, so the expectation is taken over Zx, Zx 1 , ..., Zx r , where {Zx j }r j=1 are r iid samples drawn from the same distribution as Zx.

ZMATMUL If x = W x for x x, W W then ZW x def = ˆZW x + ZW x, where:

ZHAT ˆZW x is a Gaussian variable with zero mean. Let VW denote the set of all vectors in the program of the form Wy for some y. Then { ˆZW y : Wy VW } is deﬁned to be jointly Gaussian with zero mean and covariance Cov ˆZW x, ˆZW y def =

σ2 W E Zx Zy, for any Wx, Wy VW .. Furthermore, { ˆZW y : Wy VW } is mutually independent from { ˆZv : v V S W =W V W }, where W ranges over W {A : A W}. ZDOT We can always unwind Zx = Φ( ), for some arguments ( ) = ({ ˆZW yi}k i=1, { ˆZzi}j i=1; { θi}l i=1), zi VW (where VW is deﬁned in 5.3), and

deterministic function Φ : Rk+j+l R. Deﬁne Zx/ ˆZW yi def = iΦ( ). Then we set ZW x def = σ2 W Pk i=1 Zyi E Zx

ˆ ZW yi There is some nuance in this deﬁnition, so see Remark A.1 and A.2.

The following theorem ties the symbolic nature of the Zs to the analytic nature of a Tensor Program.

Theorem 5.4 (NE ORT Master Theorem). Fix a NE ORT program initialized accordingly to Setup 5.2. Assuming all nonlinearities are pseudo-Lipschitz in all arguments, then

1. For any collection of vectors x = {x1, ..., xk} and scalars Θ = {θ1, ..., θl} in the program, and for any pseudo-Lipschitz ψ : Rk(r+1)+l R, as n :

α,β1,...,βr=1 ψ(xα, xβ1, ..., xβr; Θ) a.s. EZx 1 ,Zx 2 ,...,Zx r+1[ψ(Zx 1 , Zx 2 , ..., Zx r+1; Θ)]

(10) where {Zx j }r+1 j=1 are r + 1 iid samples drawn from the same distribution as Zx, and

Zx = {Zz1, ..., Zxk} are deﬁned according to Deﬁnition 5.3.

2. Any scalar θ in the program tends to θ almost surely such that p>0, n1 p(θ θ)2 a.s. 0, where θ is as deﬁned in Deﬁnition 5.3.

Theorem 5.4 along with Deﬁnition 5.3 provide a general tool set to analyze adaptive (and standard) training of neural networks in the inﬁnite-width limit, as long as the computational graph expressing the training process can be implemented in a NE ORT program. Moreover, Theorem 5.4 provides a universal O(n 1/2) asymptotic rate of convergence for all scalars produced by the program.

Published as a conference paper at ICLR 2023

6 PROOF SKETCH

NE ORT programs equipped with Theorem 5.4 provide the main tool to proving Theorem 4.1 and Theorem 4.2, and indeed their generalization to most common architectures and adaptive (and non adaptive) optimizers. This is done by adopting the following strategy: express the optimization dynamics using a NE ORT program, mechanically compute the Z variables according to Deﬁnition 5.3, and apply Theorem 5.4 to compute the limit (see proofs in appendix Appendix B). What remains is to prove Theorem 5.4 using a strategy which we now outline.

In a program, all vectors can be collected into an n M matrix V where n is the dimension of each vector, and M is the total number of vectors. The Master Theorem can be interpreted as saying that each row of V (i.e., the slice for each α [n]) is roughly an iid sample from some distribution D on RM (which can be derived via the calculus of Z random variables as in Deﬁnition 5.3). Speciﬁcally, Theorem 5.4 and all previous versions of the Master Theorem formalize this by saying: this matrix V of vectors looks similar to a matrix V of iid samples from D, as measured by applying arbitrary pseudo-Lipschitz test function ψ to both sides and taking averages.

Core Insight: Our core insight here is that V is in fact similar to V in a stronger sense without needing to refer to any test function ψ: There is a small matrix δV of the same size as V such that V δV is distributed exactly the same as V . In general, if this happens, then we say V is equivalent to V . The deﬁnition of small roughly means that each entry of δV has typical size O(n 1/2). Then, to recover Theorem 5.4, we just note that the test function ψ is (by assumption) smooth enough that δV contributes a vanishing amount to the LHS of Eq. (10).

To prove this core insight, there are two parts.

Part 1: We show that, in any NETSOR program (i.e., a program with no scalar variables and no TENSOR operation), V is equivalent to V . This can be done by re-analyzing the proof of the NETSOR Master Theorem in (Yang, 2020b) in a fairly straightforward way.

Part 2: For any NE ORT program π (the subject of our work here), we construct a parallel NETSOR program and show, by induction, that the vectors of the two programs are equivalent (i.e., distributed exactly the same after subtracting small vectors). This parallel program essentially replaces 1) all scalar variables in the original program by their deterministic limits, as computed in Deﬁnition 5.3(ZTENSORMOMENT), and 2) all TENSOR operations by NONLIN operations, as computed in Deﬁnition 5.3(ZTENSOR).

In this induction, we need to prove and use a lemma that, in the simplest case as an illustration, says the following: For any pseudo-Lipschitz function ψ : Rk R and random vector x Rn with iid standard Gaussian entries, the following two tensors T and T are equivalent: 1) the tensor T with entries Tβ1...βk = ψ(xβ1, . . . , xβk), and 2) the tensor T with entries T β1...βk = ψ(x1 β1, . . . , xk βk) where x1, . . . , xk are iid copies of x. The proof of this lemma interestingly requires Baranyai s theorem, a classical theorem from the theory of combinatorial design.

7 CONCLUSION

Adaptive optimizers are a staple in the modern deep learning toolkit, an are a necessary ingredient in most large scale neural network training. In this work, we have derived adaptive counterparts to the NTK and µ limit in prior works, which had only been derived for SGD. More generally, we have extended the Tensor Programs framework to allow the expression of any computation graph involving adaptive optimizers and the calculation of their large width limits. Our work lays a path to study the implicit bias of adaptive optimizers by studying their evolution equations in the inﬁnite-width limit.

Published as a conference paper at ICLR 2023

Sina Alemohammad, Zichao Wang, Randall Balestriero, and Richard Baraniuk. The recurrent neural tangent kernel. Ar Xiv, abs/2006.10246, 2021.

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. Ar Xiv, abs/1811.03962, 2019.

Sanjeev Arora, S. Du, Wei Hu, Zhiyuan Li, R. Salakhutdinov, and Ruosong Wang. On exact computation with an inﬁnitely wide neural net. In Neur IPS, 2019.

Lénaïc Chizat and Francis R. Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. In Neur IPS, 2018.

Lénaïc Chizat, Edouard Oyallon, and Francis R. Bach. On lazy training in differentiable programming. In Neur IPS, 2019.

Simon Shaolei Du, Kangcheng Hou, Barnabás Póczos, Ruslan Salakhutdinov, Ruosong Wang, and Keyulu Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Ar Xiv, abs/1905.13192, 2019.

John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. In J. Mach. Learn. Res., 2010.

Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. Ar Xiv, abs/1909.05989, 2020.

Jiri Hron, Yasaman Bahri, Jascha Narain Sohl-Dickstein, and Roman Novak. Inﬁnite attention: Nngp and ntk for deep attention networks. In ICML, 2020.

Jiaoyang Huang and H. T. Yau. Dynamics of deep neural networks and neural tangent hierarchy. Ar Xiv, abs/1909.08156, 2020.

Xiaoshan Huang, Felipe Pérez, Jimmy Ba, and Maksims Volkovs. Improving transformer optimization through better initialization. In ICML, 2020.

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and generalization in neural networks (invited paper). Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2018.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2015.

Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Ar Xiv, abs/1902.06720, 2019.

Etai Littwin, T. Galanti, and L. Wolf. On random kernels of residual architectures. ar Xiv: Learning, 2020a.

Etai Littwin, Tomer Galanti, Lior Wolf, and Greg Yang. On inﬁnite-width hypernetworks. ar Xiv: Learning, 2020b.

Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the difﬁculty of training transformers. Ar Xiv, abs/2004.08249, 2020.

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean ﬁeld view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences of the United States of America, 115:E7665 E7671, 2018a.

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean ﬁeld view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences of the United States of America, 115:E7665 E7671, 2018b.

Published as a conference paper at ICLR 2023

Phan-Minh Nguyen and Huy-Tuan Pham. A rigorous framework for the mean ﬁeld limit of multilayer neural networks. Ar Xiv, abs/2001.11443, 2020.

Grant M. Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. Ar Xiv, abs/1805.00915, 2018.

Justin A. Sirignano and Konstantinos Spiliopoulos. Mean ﬁeld analysis of neural networks: A law of large numbers. SIAM J. Appl. Math., 80:725 752, 2020.

Greg Yang. Tensor programs i: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. In Neur IPS, 2019.

Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. Ar Xiv, abs/2006.14548, 2020a.

Greg Yang. Tensor programs iii: Neural matrix laws. Ar Xiv, abs/2009.10685, 2020b.

Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. Ar Xiv, abs/2006.14548, 2020c.

Greg Yang and Edward J. Hu. Feature learning in inﬁnite-width neural networks. Ar Xiv, abs/2011.14522, 2020.

Greg Yang and Etai Littwin. Tensor programs iib: Architectural universality of neural tangent kernel training dynamics. Ar Xiv, abs/2105.03703, 2021.

Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub W. Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. In Neur IPS, 2021.

J. Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J. Reddi, Surinder Kumar, and Suvrit Sra. Why adam beats sgd for attention models. Ar Xiv, abs/1912.03194, 2019.

Zhiming Zhou, Qingru Zhang, Guansong Lu, Hongwei Wang, Weinan Zhang, and Yong Yu. Adaptive learning rate methods. 2018.

Published as a conference paper at ICLR 2023

Appendix organization The appendix is organized as follows: In Appendix A we prove Theorem 5.4, which serves as the main tool to prove Theorem 4.1 and Theorem 4.2. We then proceed to prove Theorem 4.1 and Theorem 4.2 in Appendix B. In Appendix C we extend the proofs of Theorem 4.1 and Theorem 4.2 to the case of AGOs with memory, and provide numerical veriﬁcation to our results in Appendix D.

A FULL PROOF OF THEOREM 5.4

In this section we provide the proof for Theorem 5.4, restated: Theorem 5.4 (NE ORT Master Theorem). Fix a NE ORT program initialized accordingly to Setup 5.2. Assuming all nonlinearities are pseudo-Lipschitz in all arguments, then

1. For any collection of vectors x = {x1, ..., xk} and scalars Θ = {θ1, ..., θl} in the program, and for any pseudo-Lipschitz ψ : Rk(r+1)+l R, as n :

α,β1,...,βr=1 ψ(xα, xβ1, ..., xβr; Θ) a.s. EZx 1 ,Zx 2 ,...,Zx r+1[ψ(Zx 1 , Zx 2 , ..., Zx r+1; Θ)]

(10) where {Zx j }r+1 j=1 are r + 1 iid samples drawn from the same distribution as Zx, and

Zx = {Zz1, ..., Zxk} are deﬁned according to Deﬁnition 5.3.

2. Any scalar θ in the program tends to θ almost surely such that p>0, n1 p(θ θ)2 a.s. 0, where θ is as deﬁned in Deﬁnition 5.3. Remark A.1 (Partial derivative). The partial derivative in 5.3 should be interpreted as follows. By a simple inductive argument, Zx for every vector x in the program is deﬁned uniquely as a deterministic function ϕ( ˆZx1, . . . , ˆZxk) of some x1, . . . , xk in V or introduced by MATMUL (notationally, we are suppressing the possible dependence on limit scalars θ1, . . . , θl). For instance, if in a program we have A W, v V, y = Av, x = A y, then Zx = ˆZx + ˆZv, so ϕ is given by ϕ(a, b) = a + b. Then

Zx/ ˆZxi def = iϕ( ˆZx1, . . . , ˆZxk), and Zx/ ˆZz def = 0 for any z {x1, . . . , xk}.

Note this deﬁnition depends on the precise way the program is written, not just on the underlying mathematics. For example, if y, z V and x = φ(W(y + z)), then Zx = φ( ˆZW (y+z)) so that Zx/ ˆZW y = Zx/ ˆZW z = 0. If instead, we have x = φ(Wy+Wz), then Zx = φ( ˆZW y+ ˆZW z) so that Zx/ ˆZW (x+y) = 0. However, in both cases, ZW x = (Zy + Zz) E φ ( ˆZW (y+z)). Remark A.2 (Partial derivative expectation). The quantity E Zx

ˆ ZW y is well deﬁned if Zx is differen-

tiable in ˆZW y. However, even if this is not the case, e.g. if x = θ(W y) where θ is the Heavyside step function, we can still deﬁne this expectation by leveraging Stein s lemma:

In 5.3, suppose {W yi}k i=1 are all elements of VW introduced before x. Deﬁne the matrix

C Rk k by Cij def = E Zyi Zyj and deﬁne the vector b Rk by bi def = E ˆZW yi Zx. If a = C+b (where C+ denotes the pseudoinverse of C), then in 5.3 we may set

ˆZW yi = ai. (11)

This deﬁnition agrees with the partial derivative expectation by Stein s lemma when the latter is well deﬁned.

Pseudo-Lipschitz functions are, roughly speaking, functions whose weak derivatives are polynomially bounded. Deﬁnition A.3. A function f : Rk R is called pseudo-Lipschitz of degree d if |f(x) f(y)| C x y (1 + Pk i=1 |xi|d + |yi|d) for some C. We say f is pseudo-Lipschitz if it is so for any degree.

Published as a conference paper at ICLR 2023

Here are some basic properties of pseudo-Lipschitz functions:

The norm in Deﬁnition A.3 can be any norm equivalent to the ℓ2 norm, e.g. ℓp, p 1, norms. Similarly, Pk i=1 |xi|d + |yi|d can be replaced by x d p + y d p, for any p 1.

A pseudo-Lipschitz function is polynomially bounded.

A composition of pseudo-Lipschitz functions of degrees d1 and d2 is pseudo-Lipschitz of degree d1 + d2.

A pseudo-Lipschitz function is Lipschitz on any compact set.

Indexing notations We use superscripts to distinguish between different tensors in the program, and subscripts to index into coordinates of tensors. (i.e xi j denotes the j th coordinate of vector xj Rn. We typically use β = {β1, β2, ..., βr} to denote a set of coordinates (typically containing r coordinates unless speciﬁed otherwise). For any vector x Rn, xβ denotes the set {xβ1, xβ2, ..., xβr}. For

summation over all indices in β, we use the abbreviated notation Pn β=1 def = Pn β1=1 ... Pn βr=1. We typically use x to denote a set of vectors {x1, ..., xk}, where the size of the set is implied by the context. The notation xβ refers to the set of scalars {x1 β1, ..., xk β1, ..., x1 βr, ..., xk βr} (note that in this case |xβ| = kr. We additionally use α to index into tensors. The notation xα,β for a vector x refers to the set {xα, xβ1, ..., xβr}. Similarly, the notation xα,β for x = {x1, ..., xk} refers to the set of scalars {x1 α, ..., xk α, x1 β1, ..., xk β1, ..., x1 βr, ..., xk βr} (in this case |xα,β| = k(r + 1). We use c, C, C as arbitrary constant scalars throughout the appendix (Their value might change between in different lines).

Proof. As in previous versions of Tensor Programs, we prove Theorem 5.4 by inducting on the vectors and scalars in the program. We use the following deﬁnitions throughout the proof:

Deﬁnition A.4. For any set of vectors x = {x1, x2, ..., xk} in the program, deﬁne the matrix Λx Rk k : Λx α,β = xα xβ

n . We say the set x has stable rank if limn rank(Λx) a.s= rank(limn Λx).

Deﬁnition A.5. We say a random vector x Rn is vanishing if p>0 limn x 2

Deﬁnition A.6. We say a random vector x Rn is regular with constants {ˇc(p), ˆc(p)} if for all p (0, ) there exists constants 0 < ˇc(p) ˆc(p) < such that ˇc(p) limn x p p n ˆc(p) almost surely.

Deﬁnition A.7. We say a random scalar θ R is vanishing if it converges almost surely to θ, and it holds that p>0, limn (θ θ)2

np 1 a.s= 0. Equivalently, (θ θ)1n is a vanishing vector.

Deﬁnition A.5 and Deﬁnition A.6 extend naturally to tensors, which will come in handy in our analysis. We index a rank r tensor x r Rn with indices β = {β1, ..., βr}, such that xβ = x[β1, β2, ..., βr].

Deﬁnition A.8. We say a rank r random tensor x r Rn is vanishing if p>0 limn x 2

Deﬁnition A.9. We say a rank r random tensor x r Rn is regular with constants {ˇc(p), ˆc(p)} if for all p (0, ) there exists constants 0 < ˇc(p) ˆc(p) < such that ˇc(p) limn x p p nr ˆc(p) almost surely.

A.1 INDUCTION HYPOTHESIS

Setup A.10 (Induction Setup). We will keep track of a set of vanishing scalars Θ (see Deﬁnition A.7), and two sets of vectors: The core set x which contains regular vectors produced by a MATMUL operation (see Deﬁnition A.6 and Deﬁnition 5.1), and a vanishing set ˆx of vanishing vectors (see Deﬁnition A.5). We will denote by x W the set of vectors y : Wy x. Given the sets of vanishing

Published as a conference paper at ICLR 2023

scalars Θ, corset x and vanishing set ˆx at some step m in the program, let h Rn, θ R deﬁne a new vector and scalar via TENSOR and TENSORMOMENT operations respectively. Namely:

β=1 ψ(xα,β; ˆxα,β; Θ), θ = 1

α=1 hα (12)

where ψ : R(|x|+|ˆx|)(r+1)+l R is pseudo lipschitz in all of its arguments. We further deﬁne:

β=1 ψ(xα,β; 0; Θ) (13)

hα = EZx 1 ,Zx 2 ,...,Zx r ψ(xα, Zx 1 , Zx 2 , ..., Zx r ; 0; Θ) (14)

h = h x0 (15)

h = h0 h (16)

Our induction hypothesis IH(m) asserts that the following hold simultaneously:

1. Re Write(m) Any vector produced by MATMUL can be written as a linear combination of vectors from x and x.

2. Stable Rank(m) For any W Rn n, the set x W has a stable rank.

3. Dichotomy(m) It holds that:

(a) h is either a regular or a vanishing vector. (b) h, h are vanishing vectors.

4. Tensor Moment(m) It holds that θ a.s θ, where: θ = EZx 1 ,...,Zx r+1 ψ(Zx 1 , ..., Zx r+1; 0; Θ) (17)

5. Conv Rate(m) It holds that θ R is a vanishing scalar, or:

p>0, (θ θ)2

np 1 a.s 0 (18)

A.2 HELPER LEMMAS

We use the following Lemmas regularly throughout the proof. Note that some of the lemmas are stated for the vector, however their extension to tensors are immediate.

Lemma A.11. If u, v Rn are vanishing vectors, then ν = u + v is a vanishing vector.

np = u 2 + v 2 + 2u v

np a.s 0 (19)

Lemma A.12. If u, v Rn are regular vectors, then ν = |u| + |v| is a regular vector.

Proof. for p [1, ), using triangle inequality for p norms:

p 1, ν p p n = |u| + |v| p p n u p p n 1

p + v p p n 1

p p a.s ˆcu(p) 1 p + ˆcv(p) 1 p p (20)

For p (0, 1):

0<p<1, ν p p n = |u| + |v| p p n u p p n + v p p n

a.s ˆcu(p) + ˆcv(p) (21)

One the other hand the lower bound is trivially p 1, ν p p n

a.s max(ˇcu(p), ˇcv(p)).

Published as a conference paper at ICLR 2023

Lemma A.13. If u Rn is a vanishing vector, then for any p > 0, it holds that 1

n u p p a.s 0 (i.e u has vanishing moments).

Proof. For p 2, we have that:

u p p n u p

2 a.s 0 (22)

For 0 < p < 2, using the fact that 0<p<q, u p n 1 p 1

q u q and assigning q = 2:

u p p n n1 p

2 a.s 0 (23)

which proves the claim.

Lemma A.14. If u Rn is a vanishing vector, and v Rn is a regular vector, then ν = u + v is a regular vector.

Proof. This is immediate from Lemma A.13 and Lemma A.12, and setting the constants for u {ˇcu(p), ˆcu(p)} = {0, 0}.

Lemma A.15. If u Rn is a vanishing vector, then for any r 1, ν = |u|r is a vanishing vector. If u Rn is a regular vector, then for any r 0, ν = |u|r is a regular vector.

Proof. For vanishing u, using elementary norm bounds, that 1<p<q, u q u p and assigning q = 2:

np = u 2r 2r np u 2r

n p r r a.s 0 (24)

The proof for regular u follows immediately from the deﬁnition of a regular vector.

Lemma A.16. If u Rn is a vanishing vector, and v Rn is a regular vector with constants {ˇc(p), ˆc(p)}, then ν = u v is a vanishing vector.

Proof. For any p > 0, choose m, l (0, 1) such that p > m, and l + m = 1. Using Holders inequality:

u2 i np m v2 i nm n X

m = |u| 1 l 2

m a.s 0 ˆc( 2

m) m = 0 (26)

where we used Lemma A.15 to assert that |u| 1 l 2

Lemma A.17. If θ is a vanishing scalar, then f(θ) for f : R R is a vanishing scalar if f is locally lipschitz at θ, and f( θ) = 0.

Proof. WLOG assume (θ θ)2<ϵf(θ) < A|θ θ| for some ϵ R. Deﬁne :

( A|θ θ| (θ θ)2 ϵ f(θ) else (27)

Since θ θ almost surely, this implies that Prob(limn (θ θ)2 < ϵ) = 1, hence (θ θ)2 < ϵ almost surely. Therefore:

p>0, lim n f(θ) n1 p lim n g(θ) n1 p a.s= lim n A(θ θ)2

n1 p a.s= 0 (28)

Published as a conference paper at ICLR 2023

Lemma A.18. Let x, ˆx denote sets of regular and vanishing vectors, and let ψ : R|x|+|ˆx| R be pseudo lipschitz. Then:

α=1 ψ(xα; ˆxα) = lim n 1 n

α=1 ψ(xα; 0) (29)

Proof. We have that:

ψ(xα; 0) ψ(xα; ˆxα) ψ(xα; 0) ψ(xα; ˆxα) ψ(xα; 0) + ψ(xα; ˆxα) ψ(xα; 0) (30)

Since ψ is pseudo lipschitz: ψ(xα; ˆxα) ψ(xα; 0) ˆxα 1 + ˆxα d d + xα d d (31)

Deﬁne the vectors uα = 1 + ˆxα d d + xα d d , vα = ˆxα Note that u, v are regular and venishing vectors respectively, hence by Lemma A.16 u v is a vanishing vector. Then by Lemma A.13, 1 n Pn α=1(u v)α 0 almost surely. Plugging into Eq. (30) and taking the limit, we have that:

α=1 ψ(xα; 0) 0 lim n 1 n

α=1 ψ(xα; ˆxα) lim n 1 n

α=1 ψ(xα; 0) + 0 (32)

proving the claim.

Note that Lemmas A.11 to A.16 and A.18 trivially extend to vanishing and regular tensors.

Lemma A.19. If u r+1Rn, r 1 is a vanishing tensor, then ν Rn : να = 1 n n 2 Pn β=1 uα,β is a vanishing vector.

Proof. Using the elementary inequality v 1 n v for any vector v Rn, we have:

Pn α=1 Pn β=1 uα,β 2

nrq Pn β=1 u2 α,β 2

= nr Pn α,β=1 u2 α,β nr+p =

Pn α,β=1 u2 α,β np a.s 0 (34)

Lemma A.20. Let {xn}n>0 be a sequence of random variables. If for some t N, and for all n it holds that Ex2t n cn 1 λ for c, λ > 0 then xn 0 almost surely.

Proof. by markov s inequality, for any ϵ > 0:

P(|xn| > ϵ) = P(x2t n > ϵ2t) Ex2t n ϵ2t (35)

n=1 P(|xn| > ϵ)

c ϵ2tn1+λ < (36)

By the Borel-Cantelli Lemma, xn 0 almost surely.

Lemma A.21. Let m, n, r N such that n is divisible by r. Let {νi}m i=1, i, νi R n r denote random (possibly dependent), zero mean vectors with iid coordinates and ﬁnite moments of any order q N, E[(νi α)2q] = C2q. Deﬁne Si = 1 n P n

r α=1 νi α. Then, there exists a function f(q) [0, ) independent on n, m such that:

m 2q f(q)C2q (37)

Published as a conference paper at ICLR 2023

Proof. Using Hölder s inequality:

i2q = 1 m2q

β1,...,β2q=1 E h 2q Y

l=1 Sβl i (38)

β1,...,β2q=1

l=1 E (Sβl)2q (39)

The inner expectations are given by:

E (Si)2q = E h n r X

α1,...,α2q=1 E h 2q Y

j=1 νi αj i (40)

Note that since the random variables νi α have zero mean, the expectation E h Q2q j=1 νi αj i does not vanish only when the indices α1, ..., α2q do not contain an entry which appears in isolation. In other words, the number of non-zero terms n in the sum is:

{ui}q i=1 N uq uq 1 ... u1 i,ui =1 Pq i=1 ui=2q

... 2q Pq i=2 ui u1

r Pq i=1 1ui>0)! (41)

Note that the the only term which depends on n is

r Pq i=1 1ui>0)!, which is bounded by:

r Pq i=1 1ui>0)!

r )q f(q) (42)

where f(q) < independent on n, m. It follows:

r )q f(q) X

{ui}q i=1 N uq uq 1 ... u1 i,ui =1 Pq i=1 ui=2q

... 2q Pq i=2 ui u1

q f(q) where: (44)

f(q) def = f(q) X

{ui}q i=1 N uq uq 1 ... u1 i,ui =1 Pq i=1 ui=2q

... 2q Pq i=2 ui u1

In addition, from Hölder s inequality:

j=1 νi αj 1 2q

j=1 E (νiαj)2k = C2q (46)

Hence it follows that E (Si)2q ( n

nq C2q = f(q)

rq C2q. Inserting into Eq. (38), we ﬁnally get:

β1,...,β2q=1

rq C2q (47)

rq C2q f(q)C2q (48)

Published as a conference paper at ICLR 2023

Before delving into the full proof of the induction hypothesis, we note the following fact that immediately holds at any arbitrary step in the induction. Let h, h0, h, θ be deﬁned as in Setup A.10.

Then the following claim holds:

Claim A.1. h is a vanishing vector.

Proof. Since ψ is psuedo lipshitz, there exists some d 0 such that:

ψ(xα,β; ˆxα,β; Θ) ψ(xα,β; 0; Θ) (49)

Θ Θ 2 + ˆxα 2 + ˆxβ 2 1 + Θ d d + Θ d d + xβ d d + ˆxβ d d... (50)

+ xα d d + ˆxα d d (51)

β=1 τα,βTα,β (52)

where we have deﬁned the tensors τ, T r+1Rn such that:

Tα,β = 1 + Θ d d + Θ d d + xβ d d + ˆxβ d d + xα d d + ˆxα d d (53)

τα,β = Θ Θ 1 + ˆxβ 1 + ˆxα 1

Note that by Lemmas A.11 to A.15, T is a regular tensor, while τ is a vanishing tensor. Hence, T τ is a vanishing tensor by Lemma A.16, hence h is a vanishing vector by Lemma A.19.

Claim A.1 is useful due to the following general claim:

Claim A.2. If Dichotomy(m),Tensor Moment(m) and Conv Rate(m) apply for the vector h (as deﬁned in Setup A.10), then it applies for h + δ if δ is a vanishing vector.

Proof. This is true due to the function ψ being pseudo lipschitzness. More speciﬁcally:

1. If Dichotomy(m) holds for h, then it holds for h + δ This is trivially true due since we can expand ˆx := ˆx δ, and invoke Lemmas A.11 and A.14.

2. If Tensor Moment(m) holds for h, then it holds for h + δ. This is since we may trivially assert that: θ = 1

n Pn α=1(hα + δα) 1

n Pn α=1 hα, which stems from Lemma A.13.

3. If Conv Rate(m) holds for h, then it holds for h + δ. This is since:

p>ϵ>0, 1 np 1 ( 1

α=1 δα + θ θ)2 = 1 np 1 ( 1

α=1 δα)2 + 2

α=1 δα(θ θ) (55)

+ 1 np 1 (θ θ)2 (56)

n ϵ 2 + 1 np 1 (θ θ)2 (57)

n ϵ 2 + (θ θ)2

np 1 a.s 0 (58)

Given Claim A.2 and Eq. (53), we may prove Dichotomy(m),Tensor Moment(m) and Conv Rate(m) for h0 instead of h.

Published as a conference paper at ICLR 2023

Figure 1: Baranyai s Theorem. A graphical illustration of Baranyai s Theorem for n = 8, r = 2. A partition of 8 vertices into 1 factors, represented by different colors. Each 1 factor is a partition of the vertices into hyperedges (in this case, since r = 2, simply edges) where no vertex is shared between two edges, and no edge is shared between two 1 factors. Baranyai s Theorem states that there are 8 2 2

8 = 7 such 1 factors.

A.2.1 HYPERGRAPHS AND Baranyai s theorem

Baranyai s theorem in combinatorial mathematics deals with the number of ways one can partition a complete hypergraph into 1-factors. A complete hypergraph Gn r is a hypergraph containing n vertices in which every subset of r vertices forms a hyperedge. A 1 factor of this graph is a partition of the hypergraph into n

r hyperedges in which each vertex touches exactly one hyperedge. Theorem A.22 gives an informal statement of Baranyai s theorem.

Theorem A.22 (Baranyai s theorem - Informal). The n vertices of a hypergraph Gn r |n, r N such that r divides n can be partitioned into 1-factors in n r r

n different ways such that each hyperedge in Gn r appears in exactly one of the partitions (see Fig. 1 for a graphical illustration).

Theorem A.22 turns out to be useful in getting the almost sure convergence as is stated in Theorem 5.4. Concretely, we will need to reason about the moments of inﬁnite sums of random variables. Speciﬁcally, let {z1, ..., zk} Rn denote independent and normally distributed vectors, let ψβ = ψ(z1 β1, ..., zk βr) where ψ : Rkr R is polynomialy bounded and β, E[ψβ] = 0. Consider the expression:

mq =Ez1,...,zk n X

ψβ nr 0.5 2q (59)

Theorem A.23. There exists C(q) [0, ) such that limn mq C(q).

Proof. We can express mq by breaking the sum Pn β=1:

mq = 1 nr 0.5

β=1 ψβ1β1 =β2 =... =βr + 1 nr 0.5

β=1 ψβ(1 1β1 =β2 =... =βr) 2q (60)

A(n) + B(n) where: (61)

A = C 1 nr 0.5

β=1 ψβ1β1 =β2 =... =βr 2q (62)

B = C 1 nr 0.5

β=1 ψβ(1 1β1 =β2 =... =βr) 2q (63)

Published as a conference paper at ICLR 2023

where C [0, ) does not depend on n. We now prove the following:

1. limn B 0. To see this, notice that there are nr n! (n r)! non zero terms in the sum Pn β=1 ψβ(1 1β1 =β2 =... =βr), hence:

B C nr n! (n r)! nr 0.5

2q max β E ψβ 2q (64)

Notice that limn nr n! (n r)! nr 0.5 0, and maxβ E ψβ 2q is bounded and does not depend on n. We can therefore conclude B vanishes as n .

2. We use Lemma A.21 and Theorem A.22 to show that limn A C(q) for some C(q) < . Let Θ = {β1, β2, ..., β

n! (n r)! } denote the set of all possible conﬁgurations of r indices β such that β1 = β2 = ... = βr. Let {Θi}m i=1 denote m sets where each set Θi contains n

r conﬁgurations Θi = {β1,i, ..., β n r ,i} such that:

(a) i;β,β Θi|β =β , β β =

(b) i =j;β Θi;β Θj, β = β

Finally, let R denote the remaining conﬁgurations that do not appear in any set {Θi} (i.e R = Θ/(Θ1 Θ2 ... Θm). We then have:

β=1 ψβ1β1 =β2 =... =βr =

nr 1 + 1 nr 0.5 X

β R ψβ (65)

where Si = 1 n

β Θi ψβ (66)

Note that by construction of the sets {Θi}m i=1, and since the vectors z1, ..., zk contain iid coordinates, we can conclude that for all i, random variables {ψβ}β Θi are independent. Provided we can construct m = rnr 1 O(nr 2) sets, then |R| = O(nr 1). In that case, then by Lemma A.21:

lim n A lim n C( m

nr 1 )2q E h m X

m 2qi + lim n CE h 1 nr 0.5 X

β R ψβ 2qi (67)

C(q) + lim n C( |R| nr 0.5 )2q max β E[(ψβ)2q] (68)

= C(q) + 0 (69)

We are then left with proving that we can in fact partition the set Θ into {Θi}m i=1 R where m = rnr 1 O(nr 2). To show this, we deﬁne a complete hypergraph Gn r with n vertices in which every vertex corresponds to an integer in {1, 2, ..., n}. We can think of a hyperedge in Gn r as an edge connecting r integers without ordering, hence the set of all hyperedges in Gn r has cardinality 1

r!|Θ| (this is since ordering matters in Θ). By Theorem A.22, we can partition the vertices in Gn r into n

r hyperedges (sets of r unique integers with no ordering) in n r r

n different ways, where each hyperedge appears in a single partition. For each partition i, we can assign Θi where any β Θi corresponds to a single hyperedge in partition i, with the ordering of the vertices decided arbitrarily. Notice that for each partition Θi, we can construct (r! 1) additional partitions by reordering β in any β Θi Therefore, the total number of valid partition is given by n r r

nr! = rnr 1 O(nr 2), proving the theorem.

Published as a conference paper at ICLR 2023

A.3 BASE CASE

WLOG, we start with an initial corset of Gaussian iid vectors x = {x1, ..., xk} (which are regular), an initial vanishing set of vanishing vectors ˆx = {ˆx1, ..., ˆxk } and a set of vanishing scalars Θ = {θ1, ..., θl}. Note that Re Write(1) and Stable Rank(1) trivially hold. We proceed to prove Dichotomy(1),Tensor Moment(1) and Conv Rate(1).

We deﬁne the functions ψ : Rk R, ψ : R(r+1)k R, ψ : Rk R using the pseudo lipschitz function ψ : R(r+1)k R and vectors x:

ψ(y) def = EZx 1 ,Zx 2 ,...,Zx r h ψ y; Zx 1 , Zx 2 , ..., Zx r i (70)

ψ(y; xβ) def = ψ(y, xβ) ψ(y) (71)

ψ(y) def = 1

β=1 ψ(y; xβ) (72)

hα def = ψ(xα) (73)

hα def = ψ(xα) (74)

Note that ψ(y) is a random function that depends on the vectors x.

Theorem A.24. h is a vanishing vector.

Proof. From Lemma A.20, it sufﬁces to show that for every p > 0, there exists t N such that

E Pn α=1 ψ(xα)2 t

ntp cn 1 λ for some c, λ > 0 (which may depend on p). Fix p > 0, and choose t = 1+λ

p , and let q = t . Then by Jensen s inequality:

E Pn α=1 ψ(xα)2

t 1 n1+λ E n X

α=1 ψ(xα)2 q (75)

= 1 n1+λ E h n X

nr 2 qi (76)

1 n1+λ E h n X

We are now left with the task of proving that E h Pn β=1 ψ(x1,xβ)

nr 0.5 2qi is ﬁnite for any (ﬁxed) integer

q and for n . Firstly, we may express the over all indices Pn β=1 as Pn β=2 + Pn β| j:βj=1. That is, we ﬁrst sum over all the indices where each one is bigger than 1, and then sum over all indices where at least one of them is 1. Then we have:

2qi = E h n X

2qi + CE h n X

where C does not depend on n. we now make the following observations:

Claim A.3. There exists a constant C < that is independent of n such that

n,r, E h Pn β| j:βj=1 ψ(x1,xβ)

nr 0.5 2qi C. To see this, note that the summation (i.e Pn β| j:βj=1)

effectively sums over nr (n 1)r O(nr 1) terms. Since ψ is a centered pseudo lipschitz function of normally distributed variables, we can bound the second expectation:

n,r, E h n X

2qi C nq max β E ψ(x1, xβ)2q C (80)

Published as a conference paper at ICLR 2023

for some C < that do not depend on n.

Claim A.4. There exists a constant C < that is independent of n such that

n,r, E h Pn β=2 ψ(x1,xβ)

nr 0.5 2qi C. Note that since the summation over the indices β do not include the ﬁrst index, the random variable xβ can be treated as independent from x1. WLOG we can then bound the following term instead:

n,r, E h n X

For some random variable y independent of xβ for all values of β, with the same dimensions as x1. We can now condition on y, and apply Theorem A.23 to complete the proof.

Now, we can express h = h + h + h, where h, h are vanishing vectors. Invoking Claim A.2, it is enough to prove the base case holds for h alone:

1. Tensor Moment(1) is immediate from the law of large numbers given that x are iid gaussians. Namely:

α=1 ψ(xα) a.s= EZxψ(Zx). (82)

2. Dichotomy(1) holds since ψ is a smooth, pseudo lipschitz (given by gaussian averaging of ψ) function. From Tensor Moment(1), for any p > 0:

1 n lim n h p p = lim n

Pn α=1 | hα|p

n a.s= EZx |ψ(Zx)|p = Z

Zx |ψ(Zx)|pd N(Zx)d Zx

where N(Zx) is the gaussian measure. Therefore, if 1

n limn h p p = 0 for some p > 0, then ψ is identically zero. In that case, we can write h = 0+ h+ h, which is a vanishing vector. If ψ is not identically zero, then h is a regular vector from Tensor Moment(1), proving Dichotomy(1).

3. Conv Rate(1) holds since for all p > 0:

lim n 1 np 1 ( Pn α=1( hα θ)

n )2 = lim n 1 np ( Pn α=1( hα θ) n )2 a.s= 0 (84)

which holds from Theorem A.24 and assigning r = 0.

We have therefor concluded the base case.

A.4 INDUCTION STEP

We prove the induction step assuming IH(m) holds. Namely, we must show that IH(m) IH(m+1). Assume a new vector is introduced via MATMUL, namely Wν where ν is given by TENSOR operation of corset x, vanset ˆx and vanishing scalars Θ. By Dichotomy(m), we may express ν = ν + ν + ν where ν + ν is a vanishing vector, and ν is either regular or vanishing.

A.4.1 IH(m) REWRITE(m + 1) + STABLERANK(m + 1)

We can express Wν = W ν + W( ν + ν), and point to the following fact:

Claim A.5. If ν is a regular vector, then W ν is a regular vector. Moreover, for any vanishing vector δ Rn, Wδ is a vanishing vector.

Published as a conference paper at ICLR 2023

Proof. If δ is vanishing then Wδ is vanishing: This is true since W is a gaussian matrix with a uniformally bounded (in n) operator norm. The ﬁrst part Claim A.5 holds since ν depends only on vectors from x, for which the set x W has a stable rank from Stable Rank(m). We can therefore use the gaussian conditioning trick (conditioning on all vectors in x and x W ).

We can now expand the vanset with ˆx := ˆx W( ν + ν), and proceed by casework:

1. If ν is vanishing then we expand ˆx := ˆx W ν. In that case x remains unchanged and Stable Rank(m + 1) trivially holds.

2. If ν is regular then we expand x := x W ν, and we get Stable Rank(m + 1) using

Theorem A.25.

Theorem A.25. Let x1, ..., xk x W . If Zx = α1Zx1 + α2Zx2 + ... + Zxk, then almost surely, for large enough n, x = α1x1 + α2x2 + ... + αkxk.

Proof. The set x is constructed as a standard NETSOR program (without scalars), and we may immediately apply theorem 6.3 in Yang (2020b).

A.4.2 IH(m) + REWRITE(m + 1) + STABLERANK(m + 1) IH(m + 1)

We are left with proving Dichotomy(m + 1), Tensor Moment(m + 1) and Conv Rate(m + 1). Note that if ν is a vanishing vector, the set x remains unchanged, and we immediately get IH(m + 1) by Claim A.2. Hence we proceed assuming ν (hence W ν is a regular vector.

Getting Dichotomy(m + 1) Assume a new vector is introduced in the program via a TENSOR operation:

β ψ(xα,β, (W ν)α,β; ˆxα,β; Θ) (85)

Where we made explicit the inclusion to x of the new vector W ν. Let h0, h, h, h be deﬁned as in Eq. (13). Note that h is vanishing by Claim A.1, which holds generally. We next prove h = h0 h is a vanishing vector, where (using ψ( ) ψ( ; 0; Θ) to ease notational burden):

β=1 ψ(xα,β, (W ν)α,β) EZx 1 ,ZW ν 1 ,...,Zx r ,ZW ν r ψ(xα, (W ν)α, Zx 1 , ZW ν 1 , ..., Zx r , ZW ν r )

The key insight to proving h = h0 h is indeed vanishing is that both h0, h can be written as a sum of a shared regular vector, and a vanishing vector (i.e we can express h0 = µ + δ1, h = µ + δ2 where µ is regular, and δ1, δ2 are vanishing). Their difference h0 h = δ1 δ2 is therefore the difference of two vanishing vectors, which is itself vanishing. To show this, we can make explicit the distribution of W ν by using the gaussian conditioning trick (see Appendix A.5). Denote by X, Y Rn r, U, V Rn s the matrices with {xi} x, {yi} x W , {ui} x, {vi} x W as columns respectively, representing previously generated vectors in the program, such that X = WY, U = W V . Using the Gaussian conditioning trick (conditioning on all the vectors in x), g = W ν is distributed as:

i eivi + σΠ V z (87)

where d d, e e, σ σ, and z N(0, In). Deﬁne:

i eivi + σz (88)

b = g a = X

i (di di)xi + X

i (ei ei)vi + (σΠ V σ)z (89)

We now note that:

Published as a conference paper at ICLR 2023

i, (di di)xi is a vanishing vector since xi is regular (x x W ), and (di di) is vanishing by the induction hypothesis, and Lemma A.17.

i, (ei ei)vi is a vanishing vector since vi is regular (v x W ), and (ei ei) is vanishing by the induction hypothesis, and Lemma A.17.

(σΠ V σ)z is a vanishing vector. To see this, note that:

(σΠ V σ)z = ( σ σ)z + V (V V ) V z (90)

( σ σ)z is vanishing due to the induction hypothesis and Lemma A.17. V (V V ) V z is vanishing as well. To see this, note that:

p>0, V (V V ) V z 2

= z V ( V V

n ) V z np+1 = 1

n V z n (92)

By the induction hypothesis ( V V

n ) converges almost surely (V has stable rank). Then it is enough to show that i,p>0, limn 1 n p 2 z vi

n a.s 0. From Tensor Moment(m) and Conv Rate(m):

i,p>0, lim n 1 n p 2 z vi

lim n n1 p(z vi

n )2 a.s 0 (94)

We therefore conclude that b = g a is vanishing. We have:

β=1 ψ xα,β, gα,β = 1

β=1 ψ xα,β, aα,β; bα,β = A(m) + B(m) (95)

where A(m) = 1

β=1 ψ xα,β, ( X

i eivi + σz)α,β (96)

h ψ xα,β, aα,β; bα,β ψ xα,β, ( X

i eivi + σz)α,β i (97)

From Claim A.1, B(m) is a vanishing vector. Furthermore, a is a deterministic function of previous vectors in x, and an iid gaussian noise vector z. We can now recursively expand A(m) = A(m 1) + B(m 1) where B(m 1) is a vanishing vector, until we are left with A(1) (i.e A(m) = A(1) + Pm 1 m =1 B(m ), where {B(m )}m are all vanishing vectors). Note that A(1) can be expressed as a pseudo lipschitz function of normally distributed vectors, with coordinate distributions given by Zg, Zx1, ..., Zx|x|. We can apply the same decomposition to h and get h = A(1) + Pm 1 m =1 B(m ). We have that:

h = h0 h = A(1) A(1) +

B(m ) B(m ) (98)

Finally, it is easy to see that A(1) = A(1), and hence we may invoke the base case (in particular Theorem A.24) and conclude that A(m) A(m), and by extension h are vanishing.

Getting Tensor Moment(m+1) and Conv Rate(m+1) These are immediate since we can express h = A(1) + δ where A(1) is a function of gaussian vectors, and δ is a vanishing vector, and by Claim A.2 invoke the base case on A(1).

Published as a conference paper at ICLR 2023

A.5 GAUSSIAN CONDITIONING

Let g = Wh where g, h Rn are vectors in a NETSOR program.

Denote by X, Y Rn r, U, V Rn s the matrices with {xi}, {yi}, {ui}, {vi} as columns respectively, representing previously generated vectors in the program, such that X = WY, U = W V . Using the Gaussian conditioning trick (conditioning on all the vectors in x), g = Wh is distributed as: g d= (E + Π V WΠ Y )ν0 = A + B

where we have deﬁned A = Eν0, B = Π V WΠ Y h , ΠV , ΠY are projection matrices, and W is a fresh iid sample of W, and:

E = XY + + V + U V + U Y Y +

Rewriting the conditional distribution of g, we get:

g D= Θ + σΠ V z (99)

with Θ def = Eh Rn, σ def = σA

Moreover, σ converges to a deterministic limit σ, and Θ can be written as:

Θ = X( d + ˆϵ) + V ( e + ˇϵ) (101)

where ˆϵ Rr, ˇϵ Rs are vanishing vectors, and d Rr, e Rs are deterministic vectors.

B PROOFS OF THEOREM 4.1 AND THEOREM 4.2

Now that we have proven Theorem 5.4, we prove Theorem 4.1 and Theorem 4.2 by writing out the TPs which implements the training process, and apply the master theorem. To accomplish this, we start by expressing the explicit computation done at each training step at ﬁnite width, implement it as a set of TP instructions, convert it to inﬁnite-width computation according to Deﬁnition 5.3 and apply the master theorem.

B.1 THE TENSOR PROGRAM FOR THEOREM 4.1

In the next section, we construct the Tensor Program that encodes the training of an L-hidden layer MLP as in Eq. (1) under the ANKT parametrization. Here we ﬁrst describe the initial matrices, vectors, and scalars of the program, along with necessary notations.

Initial Matrices, Vectors and Scalars We ﬁrst deﬁne the initial set of matrices W, vectors V and scalars C:

Initial matrices {wl}L l=2 Rn n are all sampled iid from N(0, 1). We set W = W 2 W 3 ... W L+1.

The initial vectors V are given by the ﬁrst layer h1(ξ) for all inputs, and the last weight vector w L+1 Rn, all samples iid from N(0, 1).

Initial scalar C = { 1 n}.

Notations We use := to more clearly denote assignment happening in the program, as opposed to mathematical equality. We will use the notation TENSOR(y1, .., yk; Θ), TENSORMOMENT(y1, .., yk; Θ) to denote an arbitrary implementation of these instructions give vectors y1, ..., yk and scalars Θ.

Published as a conference paper at ICLR 2023

Initial Forward Pass Starting with our initial vectors h1(ξ) := w1ξ , we compute all {xl(ξ)}L l=1, {hl(ξ)}L l=2 using TENSOR and MATMUL instructions:

1 l L, xl(ξ) := ψ(hl) for ψ(y) def = φ(y) (102)

1<l L, hl(ξ) := W lxl 1(ξ) (103)

The initial outputs are given by f(ξ) = w L+1 x L(ξ) n since TENSORMOMENT only allows division by nr for integer r, rather than n. However recall that in Theorem 4.1 we assume WLOG that the outputs for any input ξ is ﬁxed to f(ξ) = g(ξ). Let X denote a matrix composed of all x L(ξ) as columns. Denote by e the event that f(ξ) = 0 for all inputs. Then, using gaussian conditioning, the conditional distribution of w L+1 e given e is:

w L+1 e D= Π w L+1 (104)

where w L+1 is an independent sample of w L+1, and Π = I 1

n ) X and is the pseudo-inverse of . Then, we can write:

w L+1 e D= w L+1 X(X X

n ) X w L+1

By the master theorem X X

n a.s γ, ( X X

n ) a.s γ and X w L+1

n a.s 0. Hence, after conditioning on f(ξ) = 0, the distribution of w L+1 e is still identical to that of w L+1 at the limit. At this point we can just implement w L+1 in the program with w L+1 X( X X

n ) X w L+1

n which can be implemented with TENSOR and TENSORMOMENT instructions. We now have a program that encodes the initial forward pass of the MLP conditioned on f(ξ) = 0 for all ξ.

Initial Backward Pass and Loss Derivatives For any input sample ξ, we can implement d hl using TENSOR:

d h L := TENSOR(h L(ξ), w L+1) for TENSOR(y1, y2) def = φ (y1) y2 (106)

Then, for all 1 l < L, using MATMUL and TENSOR:

d hl := TENSOR(W l+1 d hl+1, hl) for TENSOR(y1, y2) def = φ (y1) y2 (107)

The initial loss derivatives L (ξ) are all deterministic scalars since we have conditioned on the initial outputs f(ξ).

Forward and Backward Passes at Any t The forward and backward and loss computation for any t are given by:

hl t = W l + 1 n

t =0 wl t xl 1 t (108)

d h L t = φ ( h L t ) w L+1 (109)

1 l<L, d hl t = h W l+1 + 1 n

t =0 wl+1 t d hl+1 t i φ ( hl t) (110)

with the weight updates given by:

n Q(dhl txl 1 t L t) (111)

which can all be implemented with TENSOR and Matmul operations as before.

Published as a conference paper at ICLR 2023

Adaptive Update at Time t Using Eq. (2) and wl t = wl + Pt t =0 wl t (recall wl = n W l), we have that: δ h2 t = w2 t h1 (112)

2<l L, δ hl t = wl t xl 1 t + 1 n

t =0 wl t δ xl 1 t + 1 n wl tδ xl 1 t (113)

n Q(dhl txl 1 t L t) xl 1 t + W l 1 n n

t =0 Q(dhl t xl 1 t L t ) δ xl 1 t (114)

1 n n Q(dhl txl 1 t L t)δ xl 1 t (115)

δ xl t = nφ( hl t + δ hl t n) nφ( hl t) (116)

which can be implemented using TENSOR:

δ h2 t := TENSOR(dh2 t, x1 t, x1 t; L t) for TENSOR(y1, y2, y3; θ) def = 1

n Q(y1y2 θ)y3 (117)

δ x2 t := TENSOR( h2 t, δ h2 t; 1 n) for TENSOR(y1, y2; θ) def = 1

θφ(y1 + θy2) 1

θφ(y1) θ > 0 φ (y1) y2 θ = 0 (118)

and similarly for any layer 2 < l L.

The (pre)activations at any step t can be implemented as follows using TENSOR:

1<l L, hl t+1 := TENSOR( hl t, δ ht; 1 n) for TENSOR(y1, y2; θ) def = y1 + θy2 (119)

1<l L, xl t+1 := TENSOR( xl t, δ xt; 1 n) for TENSOR(y1, y2; θ) def = y1 + θy2 (120)

Output Updates The function updates can be implemented using TENSORMOMENT:

ft := TENSORMOMENT(w L+1, δ x L t ) for TENSORMOMENT(y1 , y2) def = 1

α=1 y1 αy2 α (121)

The loss derivatives can be implemented using using TENSORMOMENT:

L t := TENSORMOMENT(; ft) for TENSORMOMENT(; θ) def = 1

α=1 L (θ) (122)

B.2 PROOF OF THEOREM 4.1

After writing the program using TP operations, we are ready to prove Theorem 4.1 by taking the inﬁnite-width limit. First, note that from Eqs. (119), (120) and (166) and ?? and applying Deﬁnition 5.3, we have that:

1 l L, Z hl t = Z hl (123)

1 l L, Z xl t = Z xl (124)

1 l<L, Zd hl t = Zd hl = ZW l+1d hl+1φ (Z hl) (125)

Zd h L t = Zd h L = Zw L+1φ (Z h L) (126)

Applying Deﬁnition 5.3 to Eqs. (117) and (118), we have that:

Zδ h2 = EZx1,Z x1 Q Zdh2Zx1 L Z x1 (127)

Zδ x2 = φ (Z h2)Zδ h2 (128) (129)

Published as a conference paper at ICLR 2023

And similarly for Eqs. (113) and (116):

2<l LZδ hl = EZxl 1,Z xl 1 Q Zdhl Zxl 1 L Z xl 1 + ZW lδ xl 1 (130)

2 l LZδ xl = φ (Z hl)Zδ hl (131) (132)

where we have that ZW lδ xl 1 = ˆZW lδ xl 1 + ZW lδ xl 1 according to Deﬁnition 5.3. Then using Theorem 5.4:

f = E Zw L+1Zδ x L t (133)

= E Zw L+1φ (Z h L)Zδ h L (134)

= E h Zw L+1φ (Z h L)EZx L 1,Z x L 1 Q Zdh LZx L 1 L t Z x L 1 i (135)

+ E Zw L+1φ (Z h L)ZW Lδ x L 1 (136)

= E Zd h LQ Zdh LZx L 1 L t Z x L 1 (137)

+ E Zd h L ZW Lδ x L 1 (138)

We now use lemma L.3 from Yang (2020b) restated:

Lemma B.1. For any x, y Rn and W Rn n in the program, it holds that:

E[Zx ZW y] = E[ZW x Zy] (139)

Applying Lemma B.1 to Eq. (138):

E Zd h L ZW Lδ x L 1 = E ZW L d h LZδ x L 1 (140)

= E Zd h L 1 EZx L 2,Z x L 2 Q Zdh L 1Zx L 2 L Z x L 2 ZW L 1δ x L 2 (141)

= E Zd h L 1Q Zdh L 1Zx L 2 L Z x L 2 + E Zd h L 1ZW L 1δ x L 2 (142)

Similarly expanding E Zd h L 1ZW L 1δ x L 2 we arrive at:

l=2 E Zd hl Q Zdhl Zxl 1 L t Z xl 1 = K(ξ, ξ| L ) (143)

Finally, using Eq. (152):

ft = Kadp(ξt, ξ| L t) (144)

B.3 THE TENSOR PROGRAM FOR THEOREM 4.2

In the next section, we construct the Tensor Program that encodes the training of an 2-hidden layer MLP as in Eq. (1) under the µ parameterization. Since the last layer is not trained, we deﬁne w3 = nw3, so the output is given by f(ξ) = 1 n w3 x2(ξ). Here we ﬁrst describe the initial matrices, vectors, and scalars of the program, along with necessary notations.

Initial Matrices, Vectors and Scalars We ﬁrst deﬁne the initial set of matrices W, vectors V and scalars C:

Initial matrices w2 Rn n sampled iid from N(0, 1

n). We set W = W 2.

The initial vectors V are given by the ﬁrst layer h1(ξ) for all inputs, and the last weight vector w3 Rn. Notice that w3 is normally distributed.

Initial scalar C = { 1 n}.

Published as a conference paper at ICLR 2023

Notations As in the ANTK case, we use := to more clearly denote assignment happening in the program, as opposed to mathematical equality. To clearly demonstrate the application of TENSOR, we will also freely introduce function symbols ψ to put things into TENSOR form.

Initial Forward and Backward Passes Starting with our initial vectors h1(ξ) := w1ξ , we compute x1(ξ), h2(ξ), x2(ξ), dh2(ξ), f(ξ), L (ξ) for all inputs using TENSOR, TENSORMOMENT and MATMUL instructions at step any t:

x1(ξ) := TENSOR(h1(ξ)) for TENSOR(y) def = φ(y) (145)

h2(ξ) := W 2x1(ξ) (146)

x2(ξ) := TENSOR(h2(ξ)) for TENSOR(y) def = φ(y) (147)

f(ξ) := TENSORMOMENT( w3, x2(ξ)) for TENSORMOMENT(y1, y2) def = 1

α=1 y1 αy2 α (148)

L (ξ) := L t := TENSORMOMENT(; ft) for TENSORMOMENT(; θ) def = 1

α=1 L (θ) (149)

dh2(ξ) := TENSOR( w3, h2(ξ))for TENSOR(y1, y2) def = y1 φ (y2) (150)

Note that with µ parameterization we can express the output f(ξ) directly without conditioning using a TENSORMOMENT.

Expressing h2 t+1 Using Eq. (2), we have:

h2 t+1 := TENSOR( h2 t, dh2 t, x1 t, x1 t; L t) for TENSOR(y1, y2, y3, y4; θ) def = y1 1

n Q(y2y3 θ)y4

B.4 PROOF OF THEOREM 4.2

After writing the program using TP operations, we are ready to prove Theorem 4.2 by taking the inﬁnite-width limit. Applying Theorem 5.4 to Eqs. (145) to (151), we have that:

Z h2 t+1 = Z h2 t EZx1(ξt),Z x1 Q(Zdh2 t Zx1(ξt) Lt)Z x1 (152)

Z x2 t = φ(Z h2 t ) (153)

Zd h2 t = Z x1 t φ (Z h2 t ) (154)

ft = E[Z w3Z x2 t ] (155)

where Z w3 N(0, 1).

C EXTENSIONS OF THEOREM 4.1 AND THEOREM 4.2 TO AGOS WITH MEMORY

So far we have dealt with the case of memoryless adaptive optimizers, and a batchsize of 1, however our results can be trivially extended to the more general case. To illustrate this, we now show how the proofs of Theorem 4.1 and Theorem 4.2 can be easily adapted to general AGOs with memory, and a general batchsize. Recall from deﬁnition Deﬁnition 3.1, if g0, g1, ..., gt R denote gradients of some scalar parameter w at times 0, 1, ..., t, a general adaptive update can be described by a function Qt Rt+1 R such that wt Qt(g0, g1, ..., gt; ϵ). Concretely, in the case of Adam, Q takes the following form (replacing β1, β2 with γ1, γ2 to prevent confusion with other indices):

Qt(g0, g1, ..., gt; ϵ) def =

1 γt 1 Pt i=0 γt i 1 gi q

1 γt 2 Pt i=0 γt i 2 g2 i + ϵ (156)

Published as a conference paper at ICLR 2023

In the context of optimizing an MLP, we can write the equivalent of Eq. (2) for a general Q function, and a general batchsize for both parameterizations:

1<l L, wl t = (157)

β0 dhl β0xl 1 β0 L β0 n ,

β1 dhl β1xl 1 β1 L β1 n , ...,

βt dhl βtxl 1 βt L βt n ; ϵ

β0 dhl β0xl 1 β0 L β0, X

β1 dhl β1xl 1 β1 L β1, ..., X

βt dhl βtxl 1 βt L βt; ϵ) (159)

βt denotes summation over samples in the minibatch βt at step t (i.e if βt := {ξi, ξj, ξk} then P

βt uβt = ut(ξi) + ut(ξj) + ut(ξk)), and Qt operates element-wise on the components of its inputs. Note that we have used Deﬁnition 3.1 to conveniently remove the 1

n factors from inside the Q function, as in Eq. (2). Since ϵ is a constant, we will absorb it into the deﬁnition of Q from now onward. Note that for any vector v, the matrix vector product wl tv can be implemented as a TENSOR instruction (see Deﬁnition 5.1): wl tv = TENSOR({dhl β0}, {xl 1 β0 }, ..., {dhl βt}, {xl 1 βt }βt, v; {L βt}) (160)

β0 dhl β0xl 1 β0 L β0, ..., X

βt dhl βtxl 1 βt L βt)v (161)

where {uβt}βt is a collection of all vectors u evaluated at time t on minibatch βt (and likewise for scalars). We can now conveniently plug in Eq. (160) into the tensor programs in Theorem 4.1 and Theorem 4.2 to prove a more general result.

C.1 EXTENSION OF THEOREM 4.1 (NTK)

We now state a general theorem for an AGO with memory and arbitrary batchsize. Theorem C.1. Let f(ξ) R denote an MLP as in Eq. (1) parameterized using the ANTK parameterization described in Section 4.1, where φ is pseudo-Lipschitz. Assume layers {wl}L l=2 are trained using an AGO applied on minibatches of arbitrary size, where Qt is pseudo Lipschitz deﬁned according to Deﬁnition 3.1, using a loss function L with a pseudo-Lipschitz ﬁrst derivative. Then, at any step t and for any sample ξ, it holds that ft a.s ft where

ft = KAdp({ξβ0}, ..., {ξβt}, ξ|{ L β0}, ..., { L βt}), where:

KAdp({ξβ0}, ..., {ξβt}, ξ|{ L β0}, ..., { L βt}) = (162)

l=2 E Zd hl Qt X

β0 Zdhl β0Zxl 1 β0 L β0, ..., X

βt Zdhl βt Zxl 1 βt L βt Z xl 1 (163)

L t = L t( ft(ξt)) (164) where the expectation is taken over all Z variables at initialization.

Proof. The proof of Theorem C.1 is a straightforward extension of the proof of Theorem 4.1. The forward and backward passes for any t are again given by:

hl t = W l + 1 n

t =0 wl t xl 1 t (165)

d h L t = φ ( h L t ) w L+1 (166)

1 l<L, d hl t = h W l+1 + 1 n

t =0 wl+1 t d hl+1 t i φ ( hl t) (167)

only with weight updates that are given by:

β0 dhl β0xl 1 β0 L β0, ..., X

βt dhl βtxl 1 βt L βt) (168)

Published as a conference paper at ICLR 2023

As in the memoryless case, using Eq. (2) and wl t = wl + Pt t =0 wl t (recall wl = n W l), we have that:

δ h2 t = w2 t h1 (169)

2<l L, δ hl t = wl t xl 1 t + 1 n

t =0 wl t δ xl 1 t + 1 n wl tδ xl 1 t (170)

δ xl t = nφ( hl t + δ hl t n) nφ( hl t) (171)

For any vector v, the matrix vector product wl tv can be implemented as a TENSOR instruction:

wl tv = TENSOR({dhl β0}, {xl 1 β0 }, ..., {dhl βt}, {xl 1 βt }βt, v; {L βt}) (172)

β0 dhl β0xl 1 β0 L β0, ..., X

βt dhl βtxl 1 βt L βt)v (173)

where {uβt}βt is a collection of all vectors u evaluated at time t on minibatch βt (and likewise for scalars). Hence, we may proceed exactly as in the base proof of Theorem 4.1 (i.e expressing the optimization process as a tensor program, applying Deﬁnition 5.3 to get the coordinate distributions in the limit, and applying Theorem 5.4). Note that in the concrete case of Adam, we get the following function update:

KAdp({ξβ0}, ..., {ξβt}, ξ|{ L β0}, ..., { L βt}) = (174)

l=2 E Zd hl (1 γ1)

1 γt 1 Pt i=0 γt i 1 P

βi Zdhl βi Zxl 1 βi L βi r

1 γt 2 Pt i=0 γt i 2 (P

βi Zdhl βi Zxl 1 βi L βi)2 + ϵ Z xl 1 (175)

C.2 EXTENSION OF THEOREM 4.2 (µP)

Theorem C.2. Let f(ξ) R denote an MLP as in Eq. (1) with L = 2 parameterized using the µ parameterization described in Section 4.1, where φ is pseudo-Lipschitz. Assume layers w2 is trained using an AGO with a pseudo-Lipschitz function Q function according to Deﬁnition 3.1 (for general batchsize), using a loss function L with a pseudo-Lipschitz ﬁrst derivative. Then at any step t and for

any sample ξ, it holds that ft a.s ft where ft can be computed as follows:

Z h2 t+1 = Z h2 t EZx1(ξ ),Z x1

β0 ζφ (Zh2 β0)Zx1(ξβ0) Lβ0, . . . , X

βt ζφ (Zh2 β0)Zx1(ξβt) Lβt

Z x2 t = φ(Z h2 t ), f0 = 0, ft = E[ζZ x2 t ], L t = L t( ft(ξt)) (177)

where the expectations are taken over all Z variables (including ζ d= N(0, 1)).4

Proof. A similarly straightforward application of the Master Theorem Theorem 5.4 to the tensor program in described in Theorem 4.2 together with Eq. (160) in µP.

D NUMERICAL VERIFICATION

We conduct numerical experiments to verify our results. For both parameterizations, the exact network dynamics at the inﬁnite width limit is not tractable in the general case, since the expectations involved do not admit an analytical solution (unlike the standard NTK for Re LU networks). Even for

4Once again, the loss derivatives L t are deterministic in Eq. (8)

Published as a conference paper at ICLR 2023

the ANTK parameterization, the inﬁnite-width dynamics cannot be separated to a ﬁxed kernel and a loss derivative, as with the NTK dynamics for SGD. We therefore must resort to MC simulations to approximate the expectations involved in evaluating the inﬁnite width dynamics in both regimes. We verify Theorem C.1 and Theorem C.2 by training a Re LU MLP (L = 4 for ANTK and L = 2 for µ) on R10 gasussian inputs and a unit output. For a loss we use the standard L2 loss function, regressing to random targets. We train networks with varying widths using Adam with β1 = 0.9, β2 = 0.99 in full batch mode, on 100 training samples, and run 10 trials per width. We use a learning rate of 0.2

n , and ϵ = 1e 4

n (where n is the width). In order to account for different initial outputs and loss derivative per weight initialization, we subtract the initialized network output from the output for each sample, such that the output is identically zero at initialization for all inputs. To approximate the inﬁnite-width training dynamics, we approximate the expectation in Eq. (174) and Eq. (176) using MC simulations where we sample the Z random variables from gaussian processes corresponding to the network architecture at initialization. Since the initial loss derivatives are deterministic (given that the outputs are zero), the inﬁnite width dynamics can be approximated without actually constructing a network. To compare the evolution of the ﬁnite vs inﬁnite architectures, we evaluate the output at each iteration on random inputs. Our results are summarized in Fig. 2 and Fig. 3. As expected, as the width increases the training dynamics converge to that of the inﬁnite dynamics.

(a) (b) (c)

Figure 2: Training dynamics of ﬁnite and inﬁnite-width networks in the ANTK parameterization. We train networks of widths 64 (a), 512 (b), 7000 (c) , and track the outputs for 4 random inputs (one per row) at each iteration as the network trains. We compute the output distribution over 10 independent runs for each network, and compare with the inﬁnite-width dynamics (black curve). As the width grows, the network function converges to that of the inﬁnite-width dynamics captured in Eq. (174)

Published as a conference paper at ICLR 2023

(a) (b) (c)

Figure 3: Training dynamics of ﬁnite and inﬁnite-width networks in the µ parameterization. We train networks of widths 64 (a), 512 (b), 7000 (c) , and track the outputs for 4 random inputs (one per row) at each iteration as the network trains. We compute the output distribution over 10 independent runs for each network, and compare with the inﬁnite-width dynamics (black curve). As the width grows, the network function converges to that of the inﬁnite-width dynamics captured in Eq. (176)