# adaptive_optimization_in_the_inftywidth_limit__8dd50b1e.pdf Published as a conference paper at ICLR 2023 ADAPTIVE OPTIMIZATION IN THE -WIDTH LIMIT Etai Littwin Apple elittwin@apple.com Greg Yang Microsoft Research gregyang@microsoft.com Recent works have developed detailed understanding of large neural networks behaviors via their infinite-width limits, e.g., the neural tangent kernel (NTK) and the feature learning (µ) limits. These theories were developed for stochastic gradient descent. Yet, in practice, all large NN are trained using Adam or other adaptive gradient optimizers (AGO), which are not covered by such previous works. Here, we close this gap via the Tensor Programs framework. Specifically, for deep MLPs, we derive the NTK and µ parametrizations as well as their infinite-width limits. We find 1) The NTK limit of AGO, in contrast to that of SGD, now depends nonlinearly on the loss derivative but nevertheless still fails to learn features; 2) this is fixed by the µ limit of AGO (as in the case of SGD). To obtain these results, we extend the Tensor Programs language with a new instruction that allows one to express the gradient processing done by AGOs. 1 INTRODUCTION Infinite width limits of neural networks have been a major focus of study in the last several years, underlying some of the most profound recent breakthroughs in our theoretical understanding of deep learning. Specifically, two types of limits have garnered the lions share of attention from the research community. The kernel limit, popularized by the seminal work of Jacot et al. (2018) refers to a regime of training where weights remain roughly in their initialized values, and training may be entirely characterized in function space by a constant kernel of a particular form, which depends on the network architecture. While easier to analyze, this limit does not permit updates to the internal representation of the network, hence it cannot account for data dependent feature learning, a staple of deep learning in practice. In contrast, the µ limit (of which the well-known mean field limit is a specific case in 1-hidden-layer perceptrons) refers to a regime of training where the weights adapt to the data during training in a nonlinear fashion, facilitating representation learning. It was recently shown in Yang & Hu (2020) that, under vanilla gradient based training, the precise setting of various hyperparameters relating to initialization scale and learning rate determine the type of infinite-width limit one can associate with a trained neural network. Notably, the µ parameterization was identified as the unique parameterization which gives rise to "maximal" feature learning dynamics in the infinite-width limit, where maximal refers to the fact that every layer learns features. However, quite remarkably, no such limits have yet been formally established for adaptive gradient based optimization of neural networks, which we make the focus of the present paper. Our main results in the paper are the identification and prescription of two types of infinite-width limits relating to popular AGO, the counterparts of the kernel and feature learning limits for vanilla GD. For the kernel limit counterpart, we uncover a fundamentally different dynamics for adaptive optimization, referred to as the adaptive neural tangent kernel (ANTK) regime. In this limit, the training dynamics can no longer be described by kernel gradient descent, since the kernel function itself depends non-linearly on the loss derivative. Our results lay a clear path to theoretically analyze the implicit biases of AGO in the infinite-width limit. Key Technical Contribution: Analyzing the dynamics of adaptive optimization of arbitrary neural network architectures in the infinite-width limit presents a major technical challenge. As a main technical tool, we build upon the TP framework introduced and developed in a series of recent papers Yang (2019; 2020a;b). At a high level, the mechanics of the TP technique involves 1) write Please see arxiv.org for the full, updated version of this paper Published as a conference paper at ICLR 2023 down the relevant neural network computation (e.g. the first forward pass in the NNGP case) as a principled composition of matrix multiplication and coordinatewise nonlinearities, called a Tensor Program, and 2) recursively calculate the distribution of coordinates of each vector via what s called the Master Theorem. However flexible, the "language" of TP is not expressive enough to represent the necessary computations involving adaptive optimization since it does not support the application of nonlinear functions to high order tensors. In the present paper, we solve this issue by expanding the TP framework with additional functionities, and proving a new master theorem which enables our analysis. While we present a simple application of our new framework on MLPs in Theorem 4.1 and Theorem 4.2, it is applicaple in a much wider setting, including most practical architectures and algorithms. As an additional technical contribution, we prove a O(n 1/2) (where n represents the width) convergence rate guarantee for all variables produced by the program, which might be of independent interest. Our Contributions: This paper presents the following major contributions: 1. We present the first rigorous infinite-width analysis of adaptive optimization of MLPs parameterized using the ANTK and µ parameterizations. Our results rigorously equate training of such networks to discrete time dynamical equations. 2. We develop a new tensor program framework along convergence rate guarantees, unlocking the infinite-width analysis of adaptive optimization in an architecturally universal sense. Paper Organization: This paper is organized as follows: We survey related work in Section 2. In Section 3 we set up preliminaries and notations used extensively in Section 4. In Section 4 we illustrate ANTK and µ limits for MLPs. Section 5 is dedicated to a formal introduction to the new TP framework. Although it is used as a main tool to prove our results in Section 4, Section 5 is more general and can be read as a standalone. 2 RELATED WORK A large body of literature exists on both the kernel (NTK) limit Arora et al. (2019); Jacot et al. (2018); Lee et al. (2019); Yang (2020c); Yang & Littwin (2021) and the mean field limit for 2 layer neural network Chizat & Bach (2018); Mei et al. (2018b); Nguyen & Pham (2020); Rotskoff & Vanden-Eijnden (2018); Sirignano & Spiliopoulos (2020). Various papers describe the kernel and feature learning regimes more generally without taking an infinite-width limit. Chizat et al. (2019) describes the "lazy training" regime in arbitrary differentiable programs, and is controlled by a single parameter α which scales the output. It is shown that when α is large, the weight need only move slightly to fit the training data, and network essentially performs kernel learning. Many papers Allen-Zhu et al. (2019); Huang & Yau (2020); Mei et al. (2018a) view the kernel and feature learning regimes as learning in different timescales, explicitly incorporating the time dependence in the infinite-width limit, and others derive finite width corrections to the NTK for finite width networks Hanin & Nica (2020); Littwin et al. (2020a). In this paper, we consider training time to be constant, and take only the width to infinity. This way, kernel and feature learning behaviour are separated by the parameterization employed at initialization, and not width or training time. TPs, first introduced in Yang (2019) and expanded upon in Yang (2020a;b), were developed as a theoretical framework to analyze the infinite-width limits of any architecture expressible in the TP language, in an attempt to rid the per architecture analysis prevalent in the literature Alemohammad et al. (2021); Du et al. (2019); Hron et al. (2020); Littwin et al. (2020b). Yang & Hu (2020) defined a natural space of neural network parametrizations (abc-parametrizations), and classified all resulting infinite-width limits into two possible catagories: 1) the kernel limit, in which weights and activations remain roughly in their initialization state, and 2) feature learning limit, in which weights move substantially and adapt to data. The µ parameterization was then identified as the "optimal" parameterization for arbitrary architectures in which all layers learn features, and was later heuristically extended to AGOs Yang et al. (2021). Unrelated, AGOs Duchi et al. (2010); Kingma & Ba (2015); Zhou et al. (2018) and their variants were developed to accelerate learning by adapting the learning rate on a per parameter basis, and currently serve as a prerequisite for training large scale transformer models Huang et al. (2020); Liu et al. (2020); Zhang et al. (2019). Crucially, no previous work has yet developed a theory for infinite-width neural network trained with AGOs. Published as a conference paper at ICLR 2023 3 PRELIMINARIES Adaptive Optimizers: Generically, if g0, g1, ..., gt R denote the gradients of some scalar parameter w R at steps 0, 1, ..., t, an adaptive update wt = wt+1 wt at step t takes the form wt = η m v+ϵ where η is the learning rate and m and v are both functions of the past gradients g0, . . . , gt. For example, in Adam, m and v are the exponential moving averages of g(i) and g2 (i). Here, we consider an even more general notion of adaptive updates, encompassing all modern AGOs. Definition 3.1. We say an update wt Qt(g0, g1, ..., gt; ϵ) to a weight w at time t is adaptive if it is proportional (up to a constant factor) to a function Qt : Rt+1 R such that c =0, Qt(cg0, cg1, ..., cgt; cϵ) = Qt(g0, g1, ..., gt; ϵ). Moreover, if Qt(g0, g2, ..., gt; ϵ) = Q(gt; ϵ) (only depends on gt), then we furthermore say wt is memoryless. To maximize clarity, we focus on the simpler case of memoryless adaptive updates in the main text. For example, in Adam this implies setting β1, β2 = 0. This simplification will already highlight the key differences between the adaptive and non-adaptive case. We provide an extension of these results to the case of AGOs with memory in Appendix C, and provide numerical verification of our results in Appendix D. MLPs and ABC(D) Parameterization: We use a standard scalar output MLP f with L hidden layers as a working example to illustrate the adaptive kernel and feature learning limits. Given an input sample ξ Rdin, weight matrices W L+1 Rn, {W l}L l=2 Rn n, W 1 Rn din and an activation function φ which we assume has a pseudo lipschitz first derivative, the output f(ξ) R is given by: f(ξ) = W L+1 x L(ξ), xl(ξ) = φ(hl(ξ)) for 1 l L x0(ξ) = ξ , 1 l L hl(ξ) = W lxl 1(ξ) (1) We adopt the ABC parameterization convention from Yang & Hu (2020). Namely, for any layer l, each weight matrix is parameterized using W = n alwl where wl are the learnable weights, which are initially sampled iid from a normal distribution N(0, n 2bl). Finally, the learning rate is parameterized using ηn cl where we plug η = 1 for simplicity. In this paper, we assign specific values to {al}l, {bl}l, {cl}l for the ANTK and µ parameterizations. Additionally, we will parameterize the ϵ parameter in the AGO with ϵl = n dlϵ, where ϵ > 0. The per-layer scaling for ϵl will turn out to be crucial to prevent the adaptive gradients from collapsing to either 0 or a step function as n . We summarize the two parameterizations in the following table: Parameterization al bl cl dl 2 l > 1 0 else 0 1 L + 1 > l > 1 1 2 else 1 L + 1 > l > 1 1 2 else 2 l = 1 1 2 l = L + 1 0 else 1 L + 1 > l > 1 1 2 else 1 L + 1 > l > 1 1 2 else Table 1: ANTK and µ parameterizations. Representing (pre)Activation Vectors via Random Variables: As we will see, as width becomes large, the entries of the activation and preactivation vectors will become roughly iid (just like in the SGD case), both at initialization (which is easy to see) and training (which is harder to see). Hence a vector s behavior can be tracked via a random variable that reflects the distribution of its entries. Concretely, if x Rn is one such vector, then we write Zx for such a random variable, such that x s entries look like iid samples from Zx. When x is scaled to have typical entry size independent of n,1 then Zx can be taken to be a random variable independent of n as well. In general, given two such vectors x, y Rn, their random variables Zx and Zy will be correlated, in such a 1i.e., x 2/n = Θ(1) as n Published as a conference paper at ICLR 2023 way that limn x y n = EZx Zy. Generally, inferring with initialized networks entail computing expectations with gaussian Z variables, which take a relatively simple form. However, a fundamental question is how the Z variables evolve during training, which we address next. 4 ADAPTIVE OPTIMIZATION OF AN MLP In the following section we illustrate the infinite-width limits of adaptive optimization for simple MLPs. For each parameterization, we begin by laying the basic intuition, culminating in Theorem 4.1 and Theorem 4.2. For a cleaner presentation, we assume the first and last layers are fixed, however our results are easily extended to the general case. In our setup we assume the network is trained using an AGO according to Definition 3.1,and a batchsize of 1. Notations: Slightly abusing notation, we use subscripts to denote both the step index t, and coordinates of vectors with α, β. We assume ξt is a training sample fed to the neural network at step t (starting from ξ0). and we use yt(ξ) for any input dependent vector/scalar y to denote its evaluation given ξ at step t. To reduce clutter we remove explicit dependency on input if it is implied by the step index (i.e yt = yt(ξt) and y = y0(ξ0)). We use yt = yt( ξ) to express the dependency of y on an arbitrary input ξ at step t. We will also denote yt(ξ) = yt+1(ξ) yt(ξ) and δyt(ξ) = n yt(ξ). We assume the network is trained using a generic loss function L, with a loss derivative L t = ft Lt. We use the notation dhl based on context: for ANTK parameterization, we set dhl(ξ) def = n f hl (ξ) Rn, whereas for µ parameterization, we set dhl(ξ) def = n f hl (ξ) Rn. This context dependent notation is convenient since it insures that the components of dhl(ξ) are roughly in the order of Θ(1) for both parameterizations. Finally, we use to denote the infinite-width limit of a (possibly random) scalar (i.e limn ft = ft). Using the above notation, we can express the gradient of any intermediate layer wl at step t for both parameterizations by 1 ndhl txl 1 t Lt. Using Definition 3.1 the adaptive weight update wl for both parameterizations is given by: 1