# scalable_optimization_in_the_modular_norm__32f3f215.pdf

Scalable Optimization in the Modular Norm

Tim Large Yang Liu Minyoung Huh Columbia University Lawrence Livermore National Lab MIT CSAIL

Hyojin Bahng Phillip Isola Jeremy Bernstein MIT CSAIL MIT CSAIL MIT CSAIL

To improve performance in contemporary deep learning, one is interested in scaling up the neural network in terms of both the number and the size of the layers. When ramping up the width of a single layer, graceful scaling of training has been linked to the need to normalize the weights and their updates in the natural norm particular to that layer. In this paper, we significantly generalize this idea by defining the modular norm, which is the natural norm on the full weight space of any neural network architecture. The modular norm is defined recursively in tandem with the network architecture itself. We show that the modular norm has several promising applications. On the practical side, the modular norm can be used to normalize the updates of any base optimizer so that the learning rate becomes transferable across width and depth. This means that the user does not need to compute optimizer-specific scale factors in order to scale training. On the theoretical side, we show that for any neural network built from well-behaved atomic modules, the gradient of the network is Lipschitz-continuous in the modular norm, with the Lipschitz constant admitting a simple recursive formula. This characterization opens the door to porting standard ideas in optimization theory over to deep learning. We have created a Python package called Modula that automatically normalizes weight updates in the modular norm of the architecture. The package is available via pip install modula with source code here.

1 Introduction

Given the practical impact of deep learning systems trained at the largest scale, there is a need for training algorithms that scale gracefully: without instability and if possible without manual tuning. However, current best practices for training have developed somewhat organically and do not live on a bedrock of sound numerical analysis. For example, while the Adam optimizer [1] is ubiquitous in the field, errors have been found in its proof of convergence [2], and empirically Adam has been found to scale poorly as either the width [3] or the depth [4] of the network is ramped up.

To remedy this situation, a patchwork of learning rate correction factors have recently been proposed [3 6]. The general idea is to retrofit a base optimizer such as Adam or SGD with special correction factors intended to render the optimizer s optimal learning rate invariant to scale. But this situation is not ideal: the correction factors are reportedly difficult to use. Lingle [7] suggests that this may be due to their higher implementation complexity, many variations, or complex theoretical background . What s more, the correction factors are optimizer-specific, meaning that if one switches to a different optimizer one must either look up or recalculate a separate set of correction factors.

The goal of this paper is to simplify matters. We show that both Adam and SGD can be made to scale gracefully with width and depth by simply normalizing their updates in a special norm associated with

denotes equal contribution. Correspondence to {jbernstein,minhuh}@mit.edu.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

learning rate

10 1 100 25 26 27 28 29 210

Adam normed Adam

SGD normed SGD

21 22 23 24 25 26

number of blocks

25 26 27 28 29 210

21 22 23 24 25 26

number of blocks

Figure 1: Learning rate transfer in the modular norm. We train GPT with context length 128 for 10k steps on Open Web Text. Left: Learning rate sweeps for normed Adam (Adam with updates normalized in the modular norm) with three transformer blocks and varying width. The optimal learning rate (marked by red dots) transfers well across scales. Mid-left: The same, but varying the number of blocks at width 128. Mid-right: Comparing normed versus unnormed Adam and SGD at fixed learning rate and varying width. For each method, we tune the learning rate at the scale marked by the dotted line. The normed methods scale better. Right: The same, but scaling number of blocks.

the network architecture see Figure 1. We call this norm the modular norm, and provide a Python package called Modula that constructs this norm automatically and in tandem with the architecture.

The modular norm is constructed recursively, leveraging the module tree perspective on neural architectures. It is enough to define how the modular norm propagates through only two elementary operations: composition and concatenation. We show how other basic operations on modules, such as addition and scalar-multiplication, can be implemented through composition and concatenation. And then higher-order structures, such as residual networks, can be built using these basic operations.

Beyond its practical relevance, the modular norm may also prove useful to theoreticians. Various optimization-theoretic quantities are accessible and efficiently calculable in the modular norm. For instance, we show that the gradient of any neural network built from well-behaved atomic modules is Lipschitz-continuous in the modular norm of the architecture. This opens the door to porting several more-or-less textbook optimization theory analyses [8] over to the world of deep learning.

1.1 Related work

Metrization It is by now well-known that deep networks do not easily or naturally admit Lipschitzcontinuity or smoothness guarantees in the Euclidean norm [9 13]. Researchers have attempted to address this problem: for instance, Bernstein et al. [12] propose a distance function called deep relative trust, which combines Frobenius norms across network layers. However, deep relative trust is only constructed for the multilayer perceptron and, when used to normalize updates, its employment of the Frobenius norm precludes good width scaling. In contrast, Yang et al. [14] equip individual layers with the RMS RMS operator norm, finding this to enable good width scaling. Researchers have also looked at building neural net distance functions outside the context of scalability [15 17].

Asymptotics The metrization-based approach to scaling developed in this paper contrasts with the tradition of asymptotic scaling analyses the study of infinite width and depth limits more common in the deep learning theory literature [3 5, 18, 19]. These asymptotic analyses follow an old observation of Neal [20] that interesting properties of the neural network function space are exactly calculable in the infinite width limit and at initialization. This tradition has continued with asymptotic studies of the neural tangent kernel [21] as well as infinite depth limits [4, 5, 22]. However, there is increasing recognition of the limits of these limits, with researchers now often trying to relax limiting results [23 25]. And ultimately, from a practitioner s perspective, these results can be difficult to make sense of [7]. In contrast, our framework eschews any kind of limiting or probabilistic analysis. As a consequence, we believe our framework is simpler, more easily relatable to basic mathematical concepts, and ultimately more relevant to what one may encounter in, say, a Py Torch [26] program.

Majorization In recent work, Streeter and Dillon [27] propose a universal majorize-minimize algorithm [28]: a method that automatically computes and minimizes a majorizer for any computational graph. Despite its generality, current downsides to the method include its overhead, which can be 2 per step [29], as well as the risk that use of a full majorization may be overly pessimistic. Indeed, Cho and Shin [30] find that an optimization approach leveraging second-order information converges significantly faster than a majorization-inspired approach. Related ideas appear in [31, 32].

2 Descent in Normed Spaces

We define the modular norm in 3. This section is intended to prime the reader for what is to come. In this section, and the rest of the document, the diamond operator denotes tensor contraction.

2.1 What s in a norm?

Suppose that we wish to use gradient descent to minimize a loss function L : W R over a weight space W = RN. What properties of the loss L and weight space W would we desire for this to be sensible? Three such properties are:

(i) the loss function is differentiable, meaning that the gradient map w L : W W exists; (ii) the weight space W carries a norm : W R, which need not be the Euclidean norm; (iii) the loss is Lipschitz smooth in the norm , with sharpness constant λ > 0, meaning that:

L(w + w) L(w) + w L(w) w + λ

2 w 2. (2.1)

Under these conditions, the weight update given by w = arg min w L(w) w + λ

is guaranteed to reduce the loss. The particular norm influences the direction of this weight update, while the sharpness constant λ influences the size of the update.

In deep learning, we would ideally like the optimal step-size to remain invariant as we scale, say, the width and the depth of the network. Thus, a fundamental problem is to design a norm such that, first, Inequality (2.1) actually holds (and is not hopelessly lax), and second, the corresponding sharpness constant λ is invariant to the relevant architectural dimensions. If the norm is chosen poorly, the practitioner may end up having to re-tune the step size as the network is scaled up. In this paper, we design a norm for neural networks that meets these requirements: the modular norm.

2.2 Preview of the modular norm

The weight space of a deep neural network is a Cartesian product W = W1 . . . WL, where Wk is the weight space at layer k. Yang et al. [14] consider the problem of metrizing individual layers. For instance, if layer k is a linear layer with weight space Wk = Rdout din, then they equip this layer with the RMS RMS operator norm, RMS RMS. This is the matrix norm induced by equipping the input and output space of the layer with the root-mean-square (RMS) vector norm, x 2 RMS := 1

dΣi x2 i for x Rd. The advantage of this non-standard matrix norm is that it allows one to estimate the amount of feature change induced by a gradient update. In other words, the inequality

W x RMS W RMS RMS x RMS, (2.2)

turns out to hold quite tightly when W is a gradient update and x is a corresponding layer input. This is because gradient updates to a layer are (sums of) outer products that align with layer inputs.

Once we know how to metrize individual layers, a natural question is: can we combine layer-wise norms to produce a norm on the full weight space W = Q

k Wk of the network? Naïvely, there are many ways to do this: one could take any positive linear combination of the layer-wise norms (L1 combination), the square root of any combination of the squared layer-wise norms (L2 combination), and so on. But we want the norm to be useful by the criteria of 2.1. To this end, we propose the modular norm W, which ends up as a max (L combination) of scaled layer-wise norms Wk:

(w1, . . . , w L) W := max (s1 w1 W1, . . . , s L w L WL) . (2.3)

The positive scalar constants s1, . . . , s L are determined by both the architecture of the network and a set of user-specified mass parameters. The precise construction of the modular norm, working

module sum 𝖬𝟣+ 𝖬𝟤

attributes:

𝖬𝖿𝗈𝗋𝗐𝖺𝗋𝖽 𝖬𝗆𝖺𝗌𝗌 𝖬𝗌𝖾𝗇𝗌𝗂𝗍𝗂𝗏𝗂𝗍𝗒 𝖬𝗇𝗈𝗋𝗆

𝖬𝟤 𝖬𝟣 𝖬𝟣 𝖬𝟤

composition concatenation

operations for building modules

Figure 2: Modules and trees of modules. A module is an object that maps an input and a weight vector to an output. Left: In addition to the standard forward function, our modules are endowed with two numbers a mass and sensitivity and a norm. Middle: New compound modules are built via the binary operations of composition and concatenation. We provide rules for composing and concatenating all module attributes. Right: Compound modules are binary trees, where the leaves are modules and the internal nodes compose and concatenate their children. Here we illustrate a sum of modules, which leverages a special utility module Add see Table 1 for more on this.

recursively over the module tree of the network, is given in 3; there, we also explain how the modular norm satisfies the criteria of 2.1, and the role played by the mass parameters. For now, let us explain what good the modular norm yields in practice.

2.3 Normed optimization

The main practical use of the modular norm is to normalize weight updates. With reference to Equation (2.3), we define the following operation on weight updates w = ( w1, . . . , w L) W:

normalize( w) := w1 s1 w1 W1 , . . . , w L s L w L WL

Provided none of the wk are zero, then normalize( w) is a unit vector in the modular norm. We propose using normalize as a wrapper, along with an explicit learning rate schedule, for any base optimizer such as Adam or SGD. The resulting normed optimizer is thus made architecture-aware via the normalize function. In pseudo-code and actual Modula code this amounts to:

delta_w = optim(w.grad()) # get update from base optimizer net.normalize(delta_w) # normalize update in the modular norm w -= eta(step) * delta_w # apply update with learning rate eta

We find this wrapper to significantly improve the scalability of the base optimizer. It renders the optimal learning rate roughly invariant to width and depth, with seemingly no cost to accuracy. In some instances, it enables training with a simpler optimizer for example, training GPT with SGD rather than Adam thus incurring a smaller memory footprint.

Normalization in the modular norm essentially forces individual layers to learn at specified, regulated rates. We view this as balancing learning across the network; no individual layer can learn too fast and destabilize training. This balance is determined by the architecture, along with user-specified mass parameters that provide precise control over the relative learning speed in different submodules.

For a variety of experiments with normed optimization, see 4 and Appendix D. But first, we detail the construction of the modular norm along with its core properties.

3 Constructing the Modular Norm

Our strategy is to first define the abstract notion of a module, which includes a norm as an attribute. We depict this concept in Figure 2. Then, by providing rules for composing and concatenating modules, we recursively define a norm for any module built via an arbitrary sequence of compositions and concatenations: the modular norm!

3.1 Modules

A module is a re-usable, composable object useful for building complicated neural networks. Our definition of a module augments the Py Torch module [26] with two real numbers and a norm: Definition 1 (Module). Given input vector space X, output vector space Y and weight vector space W, a module M is an object with the following four attributes:

(a) a function, M.forward : W X Y, which maps an input and a weight vector to an output we often abbreviate this attribute to just M M.forward;

(b) a number, M.mass 0, which will turn out to set the proportion of feature learning that this module contributes to any supermodule;

(c) a number, M.sensitivity 0, which estimates the module s sensitivity to input perturbations;

(d) a norm over the weight space, M.norm : W R 0, sometimes abbreviated to just M.

Before we say more about the intended roles of these attributes, let us mention the three kinds of modules that we will care about in practice:

(i) atomic modules, whose attributes are hand-declared, and have weights. Examples include linear modules, embedding modules, and convolution modules. (ii) bond modules, whose attributes are hand-declared, but have no weights. Formally, their weight space is the zero vector space W = 0. An example is the Re LU non-linearity module. (iii) compound modules, built out of other modules, with automatically inferred attributes.

Note that the space of objects that type-check as a module by Definition 1 is vast. Since we need to hand-declare atomic and bond modules in order to build interesting compound modules, we should have an idea of what makes for a good module. Simply put, a module is good when its attributes are predictive of its behaviour. To formalize this idea, we say that a module is well-normed if its forward function, sensitivity, and norm satisfy the following two relationships: Definition 2 (Well-normed). Let M be a module on (X, Y, W), where the input and output spaces have respective norms X and Y. M is well-normed if for all inputs x X and weights w W:

w M.forward(w, x) w Y M.norm( w) for all w W; (3.1) x M.forward(w, x) x Y M.sensitivity x X for all x X. (3.2)

Well-normed-ness means that the norm function and sensitivity are a good match for the forward function. The first inequality says that a well-normed module is Lipschitz-continuous over its weight space with a constant one. The second inequality says that a well-normed module is Lipschitzcontinuous over its input space with constant M.sensitivity. In practice, we will be interested in well-normed modules where these inequalities hold fairly tightly, since then M.sensitivity and M.norm will let us estimate the sensitivity of the module to input and weight perturbations. Appendix B provides many examples of well-normed atomic and bond modules.

The remaining attribute M.mass will turn out to control the proportion of feature learning that a module contributes to any compound module in which it participates. We formalize this concept in 3.3. But before that, we need to understand how to build compound modules.

3.2 Compound modules: Building new modules from old

We consider building new modules from old ones via the binary operations of composition and concatenation, illustrated in Figure 2. Composition is denoted via the serial combination M2 M1, and concatenation via the parallel combination (M1, M2), alternatively referred to as a module tuple. These simple binary combinations will let us build basic algebraic operations on modules (Table 1) as well as complex neural network architectures. We start by defining module composition: Definition 3 (Module composition). Consider module M1 with input, output and weight space (X1, Y1, W1) and module M2 with input, output and weight space (X2, Y2, W2). M1 and M2 are composable if X2 = Y1. Their composite M = M2 M1 lives on (X1, Y2, W1 W2) with attributes:

(a) M.forward((w1, w2), x)) = M2.forward(w2, M1.forward(w1, x));

Operation Shorthand Definition Modula Expression

module addition M1 + M2 Add (M1, M2) M_1 + M_2

scalar multiplication a M Mula M a * M

iterated composition ML M ML 1 with M0 := Identity M ** L

Table 1: Arithmetic with modules. Composition and concatenation let us define an extended arithmetic on modules. The utility modules Add, Mula and Identity are defined in Appendix B.2.

(b) M.mass = M1.mass + M2.mass;

(c) M.sensitivity = M1.sensitivity M2.sensitivity;

(d) M.norm((w1, w2)) given by:

max M2.sensitivity M.mass

M1.mass M1.norm(w1), M.mass

M2.mass M2.norm(w2) ,

where if M1.mass or M2.mass is zero, the corresponding term in the max is set to zero.

At this stage, we make two comments about this definition. First, in the definition of the composite norm, notice that the norm of the first module couples with the sensitivity of the second module. This reflects the fact that the output of the first module is fed into the second module and not vice versa. Second, observe that the masses of the submodules are involved in setting the balance of the composite norm. Before we further motivate this definition, let us first define module concatenation:

Definition 4 (Module concatenation). Consider module M1 with input, output and weight space (X1, Y1, W1) and module M2 with input, output and weight space (X2, Y2, W2). We say that M1 and M2 are concatenatable if their input spaces match: X1 = X2. The tuple M = (M1, M2) has input, output and weight space (X1, Y1 Y2, W1 W2) and attributes:

(a) M.forward((w1, w2), x)) = (M1.forward(w1, x), M2.forward(w2, x));

(b) M.mass = M1.mass + M2.mass;

(c) M.sensitivity = M1.sensitivity + M2.sensitivity;

(d) M.norm(w1, w2) given by:

M1.mass M1.norm(w1), M.mass

M2.mass M2.norm(w2) ,

where if M1.mass or M2.mass is zero, the corresponding term in the max is set to zero.

Concatenation is simpler than composition in the sense that neither module is fed through the other, and therefore, sensitivity does not appear in the concatenated norm. To further motivate these definitions, observe that two basic and desirable properties follow as immediate consequences:

Proposition 1 (Composition and concatenation are associative). If modules M1, M2, M3 are successively composable, then M3 (M2 M1) equals (M3 M2) M1 in all attributes. If modules M1, M2, M3 are mutually concatenatable, then ((M1, M2), M3) equals (M1, (M2, M3)) in all attributes.

Proposition 2 (Composition and concatenation preserve well-normedness). If modules M1 and M2 are well-normed and composable, then their composite M2 M1 is also well-normed. If modules M1 and M2 are well-normed and concatenatable, then their tuple (M1, M2) is also well-normed with respect to the L1 combination norm on the output space: ( , ) Y1 Y2 = Y1 + Y2.

The proofs follow directly from the definitions and the chain rule. Proposition 1 implies that one may build complicated compound modules without worrying in which order successive combinations are taken. Proposition 2 implies that complicated compounds automatically inherit Lipschitz guarantees.

Taken together, Definitions 3 and 4 define the modular norm M.norm of any compound module M.

mass=0.25 mass=0.5

Res MLP on CIFAR10

10 1 100 mass=2

2 4 8 16 32

0.25 0.5 1 2

learning rate

7 width=128 blocks=3

GPT on Open Web Text

learning rate

7 width=512 blocks=6

1 2 4 6 8 10 12 14 16 18 20

Figure 3: Exploring mass allocation. We tune the total mass of the hidden layers, training with normed Adam. Left group: Learning rate sweeps for Res MLP on CIFAR-10, for varying depth and mass. The bottom right subplot reports the best train loss at each mass and depth. Mass 0.5 was best at all depths. Right group: Learning rate sweeps for GPT on Open Web Text, for varying mass. Both optimal mass and learning rate transferred from the small model (top) to the large model (bottom).

3.3 Mass allocation in compound modules

Suppose we wish to train a network with an input layer, an output layer, and L blocks between:

Network = Output Layer Hidden Layers Input Layer (3.3)

= Output Layer Block L Input Layer. (3.4)

Then how much learning should happen in the output layer, compared to the blocks, compared to the input layer? And what if we scale the number of blocks L do we want relatively less learning to occur in the network s extremities? Or do we want the input and output layers to learn non-trivially even in the L limit? Since answering these questions is difficult a priori, we introduced the mass parameter to allow a user to set the proportional contribution each module has toward learning: Proposition 3 (Feature learning is apportioned by mass). Consider a compound module M derived in any fashion from L well-normed modules M1, . . . , ML. Given weight setting w = (w1, . . . , w L), where wk denote the weights of module Mk, let us perturb w by w = ( w1, . . . , w L). If we decompose the linearized change in the output of module M into one contribution per sub-module:

w M(w, x) w = w1M(w, x) w1 + + w LM(w, x) w L, (3.5)

then the kth term in this decomposition satisfies:

wk M(w, x) wk Y Mk.mass

M.mass M.norm( w). (3.6)

In words: module mass provides the flexibility needed to build complicated compound modules involving many sub-modules, while maintaining precise control over how much learning any submodule can contribute to the overall compound. Proposition 3 is proved in Appendix E.

In practice, we obtained the best training performance by maintaining a constant amount of learning in the input and output layers even as the number of blocks is scaled (Figure 6). In other words, it seems to be a good idea to assign Output Layer.mass : Hidden Layers.mass : Input Layer.mass in proportion 1 : m : 1, where m is independent of the number of blocks L. The exact mass of the hidden layers m needs to be tuned on a new architecture just as one needs to tune separate learning rates in the input and output layers in µP [18]; this tuning can be done on a small model prior to scaling (Figure 3). We further discuss mass allocation in Appendix D.6.

3.4 Smoothness in the modular norm

In this section, we study the second derivatives of a module using the modular norm as a measuring stick. Let us start by defining the notion of sharpness that we will consider:

Definition 5 (Module sharpness). Let M be a module on (X, Y, W), where the input and output spaces have respective norms X and Y. We say that M is (α, β, γ)-sharp for constants α, β, γ 0 if, at all inputs x X and weights w W, the second derivatives of M are bounded as:

w 2 ww M(w, x) e w Y α w M e w M for all w, e w W; (3.7)

w 2 wx M(w, x) x Y β w M x X for all w W and x X; (3.8)

x 2 xx M(w, x) ex Y γ x X ex X for all x, ex X. (3.9)

While one may ultimately be interested in the sharpness of a module with respect to weight perturbations, Definition 5 also tracks sharpness with respect to input perturbations. In fact, tracking this extra information is essential for propagating sharpness bounds up the module tree. Appendix C details the procedure for automatically calculating the sharpness constants of a compound module starting from the sharpness constants of all its submodules; see Propositions 8 and 9 for the specific formulae. Here we highlight one major corollary of these formulae, proved in Appendix E: for a specific choice of block multipliers, the sharpness constant of a residual network is independent of depth: Proposition 4. Suppose M is a well-normed, (α, β, γ)-sharp module on (X, X, W) with unit sensitivity. Define the depth L residual module Res L(M) via the module arithmetic of Table 1 as:

Res L(M) := L 1

L Identity + 1

L M L . (3.10)

Then this residual module Res L(M) is in fact (α + β + γ

2 , γ)-sharp, independent of depth L.

For optimization purposes, one may be more interested in the sharpness of the loss function rather than the sharpness of the neural network. Fortunately, it is possible to convert sharpness bounds on modules into sharpness bounds on loss functions, provided a little is known about the error measure: Proposition 5 (Loss functions are smooth in the modular norm). Let M be a module on (X, Y, W) and let ℓ: Y T R measure the error between a module output and a target in target space T . The loss L : W R records the module s average error on data distribution D over X T :

L(w) := Ex,t D ℓ(M(w, x), t). (3.11)

Suppose that the error measure ℓis σ-Lipschitz and τ-smooth in the module output, in the sense that:

| yℓ(y, t) y| σ y Y for all y Y and t T ; (3.12)

| y 2 yyℓ(y, t) ey| τ y Y ey Y for all y, ey Y and t T . (3.13)

If the module M is well-normed and (α, β, γ)-sharp, then the loss function L satisfies the following three inequalities at all weight settings w W and for all weight perturbations w, e w W:

(i) | w 2 ww L e w| (σα + τ) w M e w M;

(ii) w L(w + w) w L(w) M (σα + τ) w M,

where M is the dual norm of M;

(iii) |L(w + w) [L(w) + w L w]| 1

2(σα + τ) w 2 M.

The proof is given in Appendix E, and we present estimates for σ and τ for common error measures in Appendix C.4. Notice that inequalities (i), (ii) and (iii) are the standard inequalities of smooth optimization [8], albeit expressed in the modular norm. In fact, (i) implies (ii) implies (iii). In words, inequality (ii) says that the gradient of the loss is Lipschitz-continuous in the modular norm. The Lipschitz constant depends on the module only through the module s first sharpness coefficient α.

4 Experiments

Our experiments aimed to test the scalability of training with normed versions of Adam and SGD: whether one can tune the learning rate on a small model, and expect the learning rate to remain close to optimal on models of much larger width and depth. In addition to the learning rate, normed optimization in Modula requires a mass parameter to apportion feature learning between the input, output and hidden layers; we also tested the sensitivity of this parameter, whether it affects learning rate transfer, and to what extent the optimal mass itself transfers across width and depth.

Res MLP on CIFAR-10

Res Net on CIFAR-10

Adam normed Adam

SGD normed SGD

Figure 4: Learning rate transfer on CIFAR-10. We tune the learning rate on a small model at the scale marked by the dotted line and test the performance on models of increasing width and depth at this fixed learning rate. We find that normed Adam and SGD scale better than their unnormed counterparts on both Res MLPs and Res Nets. See Figure 1 for the same experiment on GPT.

All SGD experiments were done with momentum β = 0.9, and all Adam experiments used β1 = 0.9 and β2 = 0.99. No weight decay was used in any experiment. Every experiment was done with a linear decay learning rate schedule. As for initialization, we used orthogonal initialization for Linear and Conv2D modules, and Gaussian weights projected to a unit norm ball for our Embed module. This was to ensure all modules were well-normed at initialization. Precise versions of our architectures are described in Appendices B.5 and B.7. We compare with nano GPT using standard initialization in Appendix D.4 to make sure our changes recover standard performance. We actually found unnormed Adam using our GPT architecture transferred learning rate better than in nano GPT.

We found that normed optimization, with both Adam and SGD as the base optimizer, allows for successful learning rate transfer across width and depth for GPT training on Open Web Text (Figure 1), as well as Res MLP and Res Net training on CIFAR-10 (Figure 4). We present expanded results in Appendix D.5, including results on test loss. We reproduce the standard finding that train and test loss are remarkably simillar in large language model pretraining. As for mass allocation, Figure 3 shows that optimal mass transfers with depth for training a Res MLP on CIFAR-10 with normed Adam, and also that both mass and learning rate transfer quite well from a smaller GPT on Open Web Text to a larger one. We detail more experiments on mass allocation in Appendix D.6.

5 Discussion: Limitations and Future Work

This paper was influenced by four main streams of work: first, the Tensor Programs series, starting at TP-IV [3, 4, 18]; second, the papers on universal majorize-minimize algorithms [27, 28]; third, work on deep network metrization [12, 14, 31]; and fourth, the open source deep learning ecosystem [26, 33, 34] including the Py Torch module tree and Karpathy s You Tube video on autograd [35]. We have distilled and synthesized key ideas from these sources, creating a framework that we believe to be simpler than Tensor Programs, computationally lighter than universal majorization-minimization, more general than prior work on metrization and more scalable than the Py Torch module tree. We have packaged these ideas into a (soon-to-be) open-source library called Modula. Inevitably, Modula has limitations. We highlight some of them here, along with associated avenues for future work.

Loss of well-normed-ness. We have emphasized well-normed-ness (Definition 2) as an important criterion in module design. We show in Appendix B.1 that, for example, the Linear module is well-normed when its weights lie within a spectral norm ball. In our experiments, we initialize all weights so that all modules are well-normed, but we do not enforce this property throughout training. Future work could explore regularization as a means to enforce well-normed-ness throughout training, with the hope of attaining better scalability or improved generalization.

Overhead of normalization. As discussed in Appendix A.3, we implement normalization for Linear and Conv2D modules using two steps of online power iteration. While online power iteration is an established and fast primitive in deep learning in fact, coming from the GAN literature [36] it does add a modest overhead to training time, as discussed in Appendix A.4. We think it may be possible to mitigate this overhead by constructing atomic modules with more exotic operator norms. For example, if one equips feature vectors with the L norm rather than the RMS norm, then the induced L L matrix norm is cheaper to compute than the RMS RMS operator norm. In fact, L L

operator normalization has the convenient feature that it decouples over matrix rows, making it more local than spectral normalization and, dare-we-say, more biologically plausible.

Automatic step-size selection. Beyond scalability, recent work has explored the question of automatic learning rate selection [31, 37 39], with the Prodigy optimizer [37] serving as a popular example. We tested the Adam version of Prodigy and found it performs well at small scales, essentially working by an implicit form of line search. However, Prodigy will always break at large enough widths, since it requires a lower bound (d0) on Adam s initial learning rate; Yang et al. [3] showed that no such lower bound exists. We believe this issue could be fixed by rebuilding Prodigy on top of Modula. More broadly, we think that designing line search methods in a properly-normed space is a good idea.

Acknowledgements

We are grateful to Chris Mingard, Virgile Richard and Evan Kiely for useful discussions early in the project. Tongzhou Wang and Jyo Pari provided helpful feedback on the writing and figures.

The work was supported by a Packard Fellowship and a Sloan Research Fellowship to PI, by the MITIBM Watson AI Lab, by ONR MURI grant N00014-22-1-2740 and the MIT Quest for Intelligence. TL was supported by a Simons Junior Fellowship. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

We are grateful to the anonymous reviewers for their helpful and constructive feedback on this manuscript. Sadly, we have not had time to integrate much of their feedback into this camera-ready version of the paper. We will integrate the feedback into the ar Xiv version of the paper.

Contribution Statement

All authors were involved in project conception and discussions, which were initiated by JB. TL and JB developed the theory. MH and YL made core experimental observations. YL, MH, JB, and HB ran experiments. TL and JB did most of the writing, while JB, MH and YL made the figures. PI contributed guidance and helpful feedback throughout the course of the project. JB wrote the Modula package with help from MH.

[1] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Cited on page 1.

[2] Sashank J. Reddi, Satyen Kale and Sanjiv Kumar. On the convergence of Adam and beyond. In International Conference on Learning Representations, 2018. Cited on page 1.

[3] Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu et al. Tuning large neural networks via zero-shot hyperparameter transfer. In Neural Information Processing Systems, 2021. Cited on pages 1, 2, 9, and 10.

[4] Greg Yang, Dingli Yu, Chen Zhu and Soufiane Hayou. Tensor programs VI: Feature learning in infinite depth neural networks. In International Conference on Learning Representations, 2024. Cited on pages 1, 2, 9, and 21.

[5] Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin and Cengiz Pehlevan. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit. In International Conference on Learning Representations, 2024. Cited on pages 1 and 2.

[6] Samy Jelassi, Boris Hanin, Ziwei Ji, Sashank J. Reddi, Srinadh Bhojanapalli et al. Depth dependence of µP learning rates in Re LU MLPs. ar Xiv:2305.07810, 2023. Cited on page 1.

[7] Lucas Lingle. A large-scale exploration of µ-transfer. ar Xiv:2404.05728, 2024. Cited on pages 1 and 2.

[8] Hamza Fawzi. Topics in convex optimisation. University of Cambridge, Lent 2023. Lecture 3. Cited on pages 2 and 8.

[9] Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021. Cited on page 2.

[10] Haochuan Li, Jian Qian, Yi Tian, Alexander Rakhlin and Ali Jadbabaie. Convex and non-convex optimization under generalized smoothness. In Neural Information Processing Systems, 2023. Cited on page 2.

[11] Jingzhao Zhang, Tianxing He, Suvrit Sra and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2020. Cited on page 2.

[12] Jeremy Bernstein, Arash Vahdat, Yisong Yue and Ming-Yu Liu. On the distance between two neural networks and the stability of learning. In Neural Information Processing Systems, 2020. Cited on pages 2 and 9.

[13] Michael Vernon Nelson. Gradient conditioning in deep neural networks. Master s thesis, Brigham Young University, 2022. Cited on page 2.

[14] Greg Yang, James B. Simon and Jeremy Bernstein. A spectral condition for feature learning. ar Xiv:2310.17813, 2023. Cited on pages 2, 3, 9, and 16.

[15] Nikita Dhawan, Sicong Huang, Juhan Bae and Roger Grosse. Efficient parametric approximations of neural network function space distance. In International Conference on Machine Learning, 2023. Cited on page 2.

[16] Ari Benjamin, David Rolnick and Konrad Kording. Measuring and regularizing networks in function space. In International Conference on Learning Representations, 2019. Cited on page 2.

[17] Behnam Neyshabur, Ruslan Salakhutdinov and Nathan Srebro. Path-SGD: Path-normalized optimization in deep neural networks. Neural Information Processing Systems, 2015. Cited on page 2.

[18] Greg Yang and J. Edward Hu. Tensor programs IV: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, 2021. Cited on pages 2, 7, and 9.

[19] Jaehoon Lee, Jascha Sohl-Dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz et al. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018. Cited on page 2.

[20] Radford M. Neal. Bayesian Learning for Neural Networks. Ph.D. thesis, Department of Computer Science, University of Toronto, 1994. Cited on page 2.

[21] Arthur Jacot, Franck Gabriel and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Neural Information Processing Systems, 2018. Cited on page 2.

[22] Mufan Bill Li, Mihai Nica and Daniel M. Roy. The neural covariance SDE: Shaped infinite depth-and-width networks at initialization. In Advances in Neural Information Processing Systems, 2022. Cited on page 2.

[23] Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein and Guy Gur-Ari. The large learning rate phase of deep learning, 2021. Cited on page 2.

[24] Daniel A. Roberts, Sho Yaida and Boris Hanin. The Principles of Deep Learning Theory. Cambridge University Press, 2022. Cited on page 2.

[25] Chaoyue Liu, Libin Zhu and Mikhail Belkin. On the linearity of large non-linear models: When and why the tangent kernel is constant. Neural Information Processing Systems, 2020. Cited on page 2.

[26] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury et al. Py Torch: An imperative style, high-performance deep learning library. In Neural Information Processing Systems, 2019. Cited on pages 2, 5, and 9.

[27] Matthew J. Streeter and Joshua V. Dillon. Automatically bounding the Taylor remainder series: Tighter bounds and new applications. ar Xiv:2212.11429, 2022. Cited on pages 3 and 9.

[28] Matthew J. Streeter. Universal majorization-minimization algorithms. ar Xiv:2308.00190, 2023. Cited on pages 3 and 9.

[29] Matthew Streeter. Beyond automatic differentiation, 2023. URL https://research.google/ blog/beyond-automatic-differentiation/. Cited on page 3.

[30] Namhoon Cho and Hyo-Sang Shin. Automatic optimisation of normalised neural networks. ar Xiv:2312.10672, 2023. Cited on page 3.

[31] Jeremy Bernstein, Chris Mingard, Kevin Huang, Navid Azizan and Yisong Yue. Automatic gradient descent: Deep learning without hyperparameters. ar Xiv:2304.05187, 2023. Cited on pages 3, 9, 10, and 16.

[32] Dung T. Tran, Nobutaka Ono and Emmanuel Vincent. Fast DNN training based on auxiliary function technique. International Conference on Acoustics, Speech and Signal Processing, 2015. Cited on page 3.

[33] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary et al. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/ google/jax. Cited on page 9.

[34] Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau et al. Theano: A Python framework for fast computation of mathematical expressions. ar Xiv:1605.02688, 2016. Cited on page 9.

[35] Andrej Karpathy. The spelled-out intro to neural networks and backpropagation: Building micrograd, 2018. URL https://www.youtube.com/watch?v=VMj-3S1tku0. Cited on page 9.

[36] Takeru Miyato, Toshiki Kataoka, Masanori Koyama and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. Cited on page 9.

[37] Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. ar Xiv:2306.06101, 2024. Cited on page 10.

[38] Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by D-adaptation. In International Conference on Machine Learning, 2023. Cited on page 10.

[39] Maor Ivgi, Oliver Hinder and Yair Carmon. Do G is SGD s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning, 2023. Cited on page 10.

[40] Kaiming He, X. Zhang, Shaoqing Ren and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. International Conference on Computer Vision, 2015. Cited on page 19.

[41] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). ar Xiv:1606.08415, 2016. Cited on page 19.

[42] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra and Christopher Ré. Flash Attention: Fast and memory-efficient exact attention with IO-awareness. ar Xiv:2205.14135, 2022. Cited on page 22.

[43] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei et al. Language models are unsupervised multitask learners. Technical report, Open AI, 2019. Cited on pages 23 and 27.

[44] Andrej Karpathy. nano GPT code repository, 2022. URL https://github.com/karpathy/ nano GPT. Cited on pages 23, 27, and 28.

[45] Kaiming He, X. Zhang, Shaoqing Ren and Jian Sun. Deep residual learning for image recognition. Computer Vision and Pattern Recognition, 2015. Cited on page 27.

[46] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Cited on page 27.

[47] Andrej Karpathy. Tiny Shakespeare. https://huggingface.co/datasets/karpathy/ tiny_shakespeare, 2022. Cited on page 27.

[48] Ronen Eldan and Yuanzhi Li. Tiny Stories: How small can language models be and still speak coherent English? ar Xiv:2305.07759, 2023. Cited on page 27.

[49] Aaron Gokaslan and Vanya Cohen. Open Web Text corpus. http://Skylion007.github. io/Open Web Text Corpus, 2019. Cited on page 27.

Contents of the Appendices

Appendix A The Modula Package 15

A.1 The Vector class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2 The Module class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.3 Normalization in Modula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.4 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Appendix B Module and Network Design 17

B.1 Atomic modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

B.2 Bond modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.3 Module broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

B.4 Mass taring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B.5 Compound modules and neural networks . . . . . . . . . . . . . . . . . . . . . . . 20

B.6 Case study I: Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

B.7 Case study II: GPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Appendix C More on Smoothness and Sharpness 24

C.1 Underlying every estimate: The Gauss-Newton decomposition . . . . . . . . . . . 24

C.2 Sharpness under composition and concatenation . . . . . . . . . . . . . . . . . . . 24

C.3 Sharpness under module broadcasting . . . . . . . . . . . . . . . . . . . . . . . . 24

C.4 Smoothness estimates for common error measures . . . . . . . . . . . . . . . . . . 25

Appendix D Experimental Details 27

D.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

D.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

D.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

D.4 Comparing to standard nano GPT architecture . . . . . . . . . . . . . . . . . . . . 28

D.5 Full sweeps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

D.6 Mass allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

D.7 Context length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

D.8 Full sweep results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Appendix E Proofs 35

Appendix A The Modula Package

We created a Python package called Modula that realizes our module framework in code. Modula supplements Py Torch s Tensor class with two new classes: Vector and Module.

A.1 The Vector class

The Vector class is used to store the weights of a module. It allows for basic algebraic operations to be performed on module weights without needing to write for loops over lists of tensors. For example, if v_1 and v_2 are vectors with the same sub-structure, then one may write expressions such as v_1 + v_2 for the vector sum, or v_1 * v_2 for the elementwise product. Internally, a Vector stores a list of tensors and implements operations using efficient Py Torch foreach primitives.

A.2 The Module class

The most significant aspect of the Modula package is the Module class. A Module must have six attributes: two float attributes, namely mass and sensitivity. And four methods:

forward(w: Vector, x: Tensor) -> Tensor # returns an output tensor

initialize() -> Vector # randomly samples a weight vector

normalize(w: Vector) # normalizes w to have unit modular norm

regularize(w: Vector, strength: float) # regularizes w in-place

The norm of a module is not specifically implemented, instead we use the normalize method which is how the norm is directly used in optimization.

We refer to modules with hand-specified attributes as bonds if they have no weights and atoms if they have weights. Modules formed by combining existing modules are called compounds. Modula automatically constructs the attributes of compound modules. We provide reference implementations for many common modules see Appendix B. We equip atoms with their natural operator norm, and compute spectral norms via online power iteration. Reference modules may be imported as follows:

from modula.bond import Identity, Re LU, Abs, Functional Attention from modula.atom import Linear, Embed, Conv2D from modula.compound import Res MLP, Res CNN, Attention, GPT

To make building new compounds easier, Modula overloads the following operations on modules:

M_2 @ M_1 # composes module M_2 with module M_1

(M_1, M_2) # acts as a tuple module in any further composition

M_1 + M_2 # returns the module sum

a * M # multiplies module M by scalar a

M ** L # returns the Lth iterate of module M

For example, the compound

(L/(L-1) * Identity() + 1/L * M()) ** L

builds an L-layer residual network from base module M. Comparing with Equation (3.10), we see that Modula expressions closely resemble their mathematical counterparts.

Finally, all modules come with a convenience method tare(m: float), which resets the module mass to m, with default m=1.

A.3 Normalization in Modula

We can normalize any base optimizer in the modular norm using the following pattern:

delta_w = optim(w.grad()) # get update from base optimizer net.normalize(delta_w) # normalize update in the modular norm w -= lr * delta_w # apply update to weights

Computation of net.normalize(delta_w) requires an efficient estimation of the spectral matrix norm, in the last two dimensions, of the constituent tensors of delta_w; this can be done very quickly to reasonable accuracy using power iteration. We implement this by storing a running estimate of the top singular vector u for each constituent tensor of delta_w. At initialization, u is sampled Gaussian, and each time we normalize a weight update, the previous update s estimated singular vector is used as the starting value for the power iteration. This enables us to use just two steps of power iteration per weight update. Indeed, for any base optimizer with momentum, successive weight updates should be fairly close; for training without momentum more steps of power iteration may be required.

A.4 Overhead

To test the overhead of normalization in the modular norm, we trained a width 64 Res MLP with 8 blocks and block-depth 2 for 10k steps on the CIFAR-10 dataset. We repeated the experiment with and without normalization, and in each case with three different random seeds. Without normalization, the training took 101 1 seconds, and with normalization the training took 124 1 seconds. So in this experiment, the overhead of modular normalization was around 23%.

We note that the user of the Modula package is free to write new atomic modules with cheaper or more efficient normalize functions. For instance, the Frobenius norm can be used as a proxy for the spectral norm whenever the weight updates have low stable rank [14, 31]. And we note in 5 that one could explore more exotic norms such as the L L operator norm, which is cheaper to compute than the standard spectral norm. Beyond these suggestions, one could explore CUDA-level optimizations to spectral norm computation, which is something that we have not explored.

Module M M.forward M.mass M.sensitivity M.norm

Linear W , x 7 q

din W x 1 1 W 7 W

Embed E, x 7

d Ex 1 1 E 7 maxi E i 2

Conv2D C, x 7 1 K2 q

din C x 1 1 C 7 maxij C ij

Table 2: Three atomic modules. These are the three atoms implemented in Modula enough to build Res Net and GPT networks. By including explicit dimensional scale factors in the forward functions, we are able to use the standard spectral norm and Euclidean norm 2, rather than their rescaled versions. din and dout denote the input and output dimension of the Linear module. d denotes the embedding dimension of the Embed module. K denotes the kernel size of a Conv2D module with dout output channels and din input channels. denotes convolution.

Appendix B Module and Network Design

In this appendix, we list the basic, hand-declared modules that serve as building blocks for more complicated neural networks. Then we go on to show how these modules may be combined to yield interesting neural networks. This includes discussion of module broadcasting (Appendix B.3) and mass taring (Appendix B.4). The appendix culminates with case studies on attention (Appendix B.6) and transformers (Appendix B.7).

B.1 Atomic modules

An atomic module or atom for short is a module with nonzero mass and nonzero parameter space, whose attributes are specifically declared rather than derived. Setting an atom s mass to zero has the effect of freezing its weights under normed optimization.

Atom 1 (Linear). For positive integers dout and din, the linear module Linear(dout, din) corresponds to the standard linear layer with din input features and dout output features. As a module, it has input space X = Rdin, output space Rdout and weights W = Rdout din the space of dout din matrices.

Its four attributes (forward function, mass, sensitivity, norm) are given in Table 2. Note the presence of the p

dout/din factor in the forward function: this convention means that we can work with the standard L2 operator norm rather than the RMS-RMS operator norm.

Writing f = Linear(dout, din).forward, its derivative and second derivative at (W , x) are given by:

f ( W , x) = p

dout/din (( W ) x + W ( x)) , (B.1)

( W , x) 2f ( f W , ex) = p

dout/din ( W )( ex) + ( f W )( x) . (B.2)

from which we conclude that Linear(dout, din) is well-normed, using the RMS norms on its input and output, so long as its arguments satisfy:

W , x RMS 1. (B.3)

These conditions will be automatically satisfied for many neural networks under orthogonal initialization of the weights, and especially if a linear module is immediately preceded by something like a Layer Norm module. Moreover, orthogonal initialization guarantees that the well-normed inequality

f x RMS x RMS (B.4)

holds tightly in nearly-square matrices at initialization, which is important for getting good signal propagation through the whole network.

Moreover, inspection of the second derivative formula above shows it is always (0, 1, 0)-sharp with respect to the RMS norms on the input and output spaces.

Atom 2 (Embed). For positive integers n and d, the embedding module Embed(n, d) corresponds to n class, token, or positional embeddings in a d-dimensional embedding space. As a module, it has input space Rn, output space Rd and weights W = Rd n the space of d n matrices.

Its attributes are listed in Table 2.

This is at first sight similar to the linear module, the key difference being that in applications we expect the inputs of Embed(n, d) to be one-hot vectors; as such we consider its input space to carry the L1-norm.

As for the linear module, Embed(n, d) is well-normed and (0, 1, 0)-sharp with respect to the L1-norm on the input space Rn and the RMS norm on the output space Rd.

Atom 3 (Conv2D). For positive integers dout, din, K as well as H, W, the 2D-convolution module Conv2D(dout, din, K) corresponds to a convolutional layer with a K K kernel; din and dout are the number of channels for the input and output respectively (we suppress optional stride and padding arguments here for simplicity). Its input space is X = Rdin H W , its output space is Y = Rdout H W and its weights are W = Rdout din K K.

Its attributes are listed in Table 2.

In fact, one could alternatively build Conv2D(dout, din, K) starting from K2 different Linear(dout, din) modules (of mass 1/K2 each) and concatenating them, and composing with a (parameter-less) convolution module. As such, Conv2D is well-normed and (0, 1, 0)-sharp. However, in our Modula package, we choose to explicitly declare Conv2D so as to take advantage of Pytorch s efficient implementation of convolution; the presentation here reflects this.

B.2 Bond modules

A bond module or bond is a module with zero mass and zero parameter space. They are the glue between the atomic modules, needed to construct complex neural networks.

Note that we need not specify a weight space, or mass or norm arguments for a bond module. Moreover, when discussing whether a bond module is (α, β, γ)-sharp, the inequalities for α and β are vacuous; thus for bond modules we will abbreviate this notion to γ-sharp.

To begin, we need two bond modules that are essentially utility , as they are crucial for defining basic secondary module operations. These modules are also type polymorphic in the sense that they work with any underlying vector space. Bond 1 (Add). For any vector space Y, the adder module Add has inputs Y Y and outputs Y. It has forward function Add.forward : (y1, y2) 7 y1 + y2 (B.5) and sensitivity 1. Its significance is that it allows for concatenable modules to be added:

M1 + M2 := Add (M1, M2). (B.6)

For any norm Y on the vector space Y, Add is well-normed with respect to the L1 combination norm (y1, y2) Y Y := y1 Y + y2 Y on its input space. Furthermore, Add is 0-sharp. Bond 2 (Mulλ). For any normed vector space Y and real number λ the scalar multiplier module Mulλ has inputs Y and outputs Y. Its forward function is:

Mulλ.forward : y 7 λ y (B.7)

and its sensitivity is |λ|. Its significance is that it allows for scalar multiplication of modules:

λ M := Mulλ M. (B.8)

It is well-normed with respect to any norm on Y, and 0-sharp. When λ = 1, we call this the identity module Identity = Mul1. Note that λ Identity = Mulλ for any λ.

The remaining bond modules are used explicitly as non-linearities in neural networks. Bond 3 (Abs). In any dimension d, the absolute value bond module Abs has inputs and outputs Rd, forward function Abs.forward : (x1, . . . , xd) 7 (|x1|, . . . , |xd|) (B.9) and sensitivity 1. It is well-normed for any norm on Rd.

Bond 4 (Re LU and Scaled Re LU). In any dimension d, we define the rectified linear unit bond module Re LU to have input space X Rd, output space Y = Rd, forward function

Re LU.forward : (x1, . . . , xd) 7 (max(0, xi))i=1,...,d. (B.10)

and sensitivity 1/

2. For this choice of sensitivity, Re LU is not well-normed with input space X set to the full Rd. However, it is well-normed if the input space is, informally, a set of dense vectors with balanced signs. For illustration, Re LU is rigorously well-normed with respect to the input space

X = {sign t : for t Rd with # positive entries = # negative entries}, (B.11)

and RMS norm on inputs and ouputs. For more on this design decision, see [40]. We also define Scaled Re LU :=

2 Re LU to be the unit sensitivity counterpart to Re LU. Bond 5 (GELU and Scaled Ge LU). The Gaussian error linear unit bond module GELU [41] is essentially a smoothed version of Re LU. In any dimension d, GELU has inputs X Rd, outputs Y = Rd and forward function

GELU.forward : (x1, . . . , xd) 7 (xiΦ(xi))i=1,...,xd (B.12)

where Φ(x) = R x 1

2πe t2/2dt is the cumulative distribution function of the standard Gaussian.

GELU is well-normed in the same sense as Re LU. We similarly set Scaled Ge LU =

2 GELU. Bond 6 (Mean Subtract). For any dimension d, the mean subtraction module Mean Subtract has inputs and outputs Rd. It centers its input to have mean zero. The forward function is given by:

Mean Subtract.forward : (x1, . . . , xd) 7 (x1 x, . . . , xd x) (B.13)

and has sensitivity 1. It is well-normed, and since it is a linear mapping, it is 0-sharp. Bond 7 (RMSDivide). For any dimension d, the RMS division bond module RMSDivide has inputs and outputs Rd. It normalizes its input to have unit RMS norm. The forward function is given by:

RMSDivide.forward : x 7 x x RMS =

d x x 2 . (B.14)

and has sensitivity 1. While it is not automatically well-normed, as long as its inputs have x RMS 1, the required inequality is not very far off. Similarly, it is approximately 1-sharp. Bond 8 (Layer Norm). For any positive integer d, the layer normalization bond module Layer Norm has inputs and outputs Rd, and is just defined as the composition of modules

Layer Norm = RMSDivide Mean Subtract. (B.15)

As with RMSDivide, it is approximately well-normed and approximately 1-sharp.

B.3 Module broadcasting

Let us briefly discuss a supplementary module operation, which we refer to as module broadcasting. Definition 6. Suppose M is a module with inputs X, outputs Y and weights W. Then for any h 1, the h-times-broadcast of M is the module M(h) with the same weight space W, mass, sensitivity and norm as M, but inputs the Cartesian power X h = X . . . X and outputs Yh = Y . . . Y, and forward function

(w, (x1, . . . , xh)) 7 (M.forward(w, x1), . . . , M.forward(w, xh)). (B.16)

Since this is not defining a module with a new set of weights, we will usually just refer to the broadcast module by the same name M, and consider this as just an extension of its forward function.

For example, this allows us to define the action of linear modules Linear(dout, din) on inputs x Rℓ din to give outputs y Rℓ dout, where ℓis the context length parameter for a transformer (see Appendix B.6, Appendix B.7, where it is also crucial for the construction of multi-headed attention). Additionally, one can view the basic Abs, Re LU and GELU modules as being broadcasts of the usual one-variable functions to take inputs and outputs in Rd.

Let us briefly note:

Proposition 6. If M is well-normed, then so is any broadcast of M taking X h to Yh, as long as the norms on X h and Yh are taken to be either the mean Lp norms

(x1, . . . , xh) X h = 1

h( x1 p X + . . . + xh p X ) 1/p (B.17)

(y1, . . . , yh) Yh = 1

h( y1 p Y + . . . + yh p Y) 1/p (B.18)

for 1 p ; when p = this is just the max norm. In the case that M is a bond module (so W = 0, any scalar multiple of the mean Lp norm can be used (including the standard Lp norm).

The situation for sharpness is a bit more complicated; we discuss this in C.3.

B.4 Mass taring

In order to make working with the mass parameter of modules a bit easier, let us introduce an auxiliary operation:

Definition 7 (Tare). For any module M and positive real number mnew, the module tare(M, mnew) has the exact same inputs, outputs and weights as M; the same forward function, the same sensitivity and the same norm; but has mass

tare(M, mnew).mass = mnew. (B.19)

This resets the mass of M. If M is a compound module, one could also reset the masses of all its submodules, by taking tare(Mk, mnew Mk.mass

M.mass ) for every submodule Mk, to reconstruct the computation graph for tare(M, mnew).

This way, one can build complex modules starting from atomic modules with unit masses, and then using tare later to reset their masses to desired quantities for better feature learning with normed descent as in Proposition 3.

B.5 Compound modules and neural networks

Composition, concatenation and the secondary operations of addition, scalar multiplication and iterated concatenation allow us to build a wide variety of neural networks which thus come automatically endowed with the modular norm.

Deep neural networks are typically built as long series of compositions. Let us introduce some terminology: Definition 8 (Blocks and deep networks). A deep neural network is a module M formed by a composition M = Output Layer Block L . . . Block1 Input Layer (B.20)

where Input Layer, Block1, . . . , Block L, Output Layer are modules; the number of blocks L 1 is the depth of the network.

Typically, each of Block1, . . . , Block L will be copies of the same module (allowing them to take different weight values, of course), so that the network can be written as an iterated composition

M = Output Layer Block L Input Layer. (B.21)

Input Layer, Block, Output Layer can be principle be any module one likes, but usually Input Layer is often some form of embedding module, and Output Layer is usually a linear module.

As for the form of Block, we found the following design principle to be quite useful in practice:

Arrange so that each Block has unit sensitivity.

This ensures that the sensitivity of the whole network stays bounded as L (this will also be the case if we ensure that Block.sensitivity = 1 + O(1/L), but unit sensitivity has the advantage that the modular norm becomes very explicit). With this in mind:

Compound 1 (Residual network). Suppose that M is a module of unit sensitivity whose inputs and outputs are the same space X. For any L 1, consider the residual block

Block = L 1

L Identity + 1

and write Res L(M) = Block L. This is of unit sensitivity, well-normed if M is, and moreover by Proposition 4 is sharp with O(1) sharpness if M is.

A general residual network with residue M is any neural network of the form

Output Layer Res L(M) Input Layer. (B.23)

In practice, we will want to apply one more operation: we will want to tare the mass of the residual blocks. To this end, the residual network with residue M, depth L and total block mass m > 0 is

Output Layer tare(Res L(M), m) Input Layer. (B.24)

Let us give two basic example of residual networks.

Compound 2 (Res MLP). This is a simple residual variation on the multi-layer perceptron. For a width d 1, consider the unit sensitivity module

M(d) = Mean Subtract Abs Linear(d, d) RMSDivide. (B.25)

This particular order of operations is inspired by a reecent paper of Yang et al. [4].

We invite the reader to compare this to something like Re LU Linear(d, d) Layer Norm: three core operations are being performed (but in a different order in both cases): the inputs are being normalized; the inputs are being centered; and the inputs are passed through a nonlinearity that mutates just the negative coordinates.

The Res MLP network has as its residue an iterated composition of M(d), where the number of copies of M(d) in each residue is called the block depth and denoted B. It also has just linear initial and final modules. Thus the Res MLP network of depth L, width d, block depth B and total block mass m > 0 is Res MLP = Linear(dout, d) tare(Res L(M(d)B), m) Linear(d, din) (B.26)

where din is the number of features of the data, and dout the desired number of output features of the network.

Usually we suggest taking B = 1 or 2, and m 1.

Compound 3 (Res Net). This is a version of Res Net for image classification tasks. For a width d 1 and kernel size K, consider similarly to above the unit sensitivity module

M(d, K) = Mean Subtract Abs Conv2D(d, d, K) RMSDivide. (B.27)

As in the Res MLP, the Res Net network is a residual network with as its residue an iterated composition of B copies of M(d, K) where B is the block depth. Its initial and final modules are given by

Input Layer = Conv2D(d, cin, K) (B.28) Output Layer = Linear(dout, dtotal) Avg Pool (B.29)

where Avg Pool is an additional bond module implementing adaptative average pooling. Here, cin is the number of channel dimensions of the input image, dtotal = d H W is the total dimension of the hidden representation, and dout is the desired number of output features (note that in Modula we include an additional dummy module Flatten to change the tensor shape before passing through the final layer). The Res Net network of depth L, width d, block depth B, kernel size K and total block mass m is thus:

Res Net = Output Layer tare(Res L(M(d, K)B), m) Input Layer. (B.30)

As defaults, we suggest taking B = 2, K = 3 and m 20.

B.6 Case study I: Attention

Let us now focus on the construction of a single multi-headed attention module in this framework. The attention module should have, as both inputs and outputs, X = Rℓ d where ℓis the context length and d is the embedding dimension. The attention module itself will depend on three additional dimensional arguments:

h, the number of heads; d Q, the key/query dimension; d V , the value dimension;

as well as an ℓ ℓmatrix mask, which we usually take to be either

maskij = 0 if i j otherwise (B.31)

for causal attention, and mask = 0 for non-casual attention.

The core of the attention module is a bond module which we call functional attention. Bond 9 (Func Attention). Take positive integers ℓ, d Q, d V and mask matrix mask. The corresponding functional attention is the bond module of unit sensitivity, inputs X = Rℓ d Q Rℓ d Q Rℓ d V , outputs Y = Rℓ d V , and forward function

Func Attention.forward(q, k, v) = softmax qk

d Q + mask v. (B.32)

Moreover, we set Func Attention.sensitivity = 1.

In theory, one could try break up attention further into constituent more basic modules (such as scaled dot product, softmax, etc), but keeping Func Attention as the basic unit one to leverage efficient implementations of attention such as Flash Attention [42].

In fact, a perhaps surprising result is that with the above 1 d Q scaling of the dot product, we can estimate the sensitivity and sharpness of Func Attention. This relies on giving norms for the input and output spaces; these norms are chosen to be (q, k, v) X = q RMS + k RMS + v RMS, y Y = y RMS (B.33)

where RMS is the infinity-RMS-norm on Rℓ d defined from the standard root-mean-square norm RMS on Rd by x RMS := max i=1,...,ℓ xi RMS. (B.34)

Proposition 7. Over the space of inputs q, k, v with each q RMS, k RMS, v RMS 1, the functional attention module Func Attention is well-normed, and moreover is sharp with sharpness constant γ = 3.

The proof is given in Appendix E. We thus choose to adopt a 1 d Q -dot-product scaling in our implementation of attention a rigorous bound as above is not possible for 1

d Q -scaling, for instance.

We can then immediately define a single head of attention. Compound 4 (Single-headed attention). For positive integers ℓ, d, d Q, d V and a choice of mask, take four instances of the linear module, for the query, key, value and exit parameters: Query = Linear(d Q, d) (B.35) Key = Linear(d Q, d) (B.36) Value = Linear(d V , d) (B.37) Exit = Linear(d, d V ) (B.38)

which by broadcasting we consider to have inputs of shape Rℓ d. The single-headed attention Attention module is then the composition

Attention = Exit 1

3 Func Attention (Query, Key, Value). (B.39)

The scalar multiplication factor of 1

3 ensures that Attention has unit sensitivity.

For multiple heads of attention, we simply take advantage of module broadcasting (Definition 6): Compound 5 (Multi-headed attention). For positive integers ℓ, d, h, d Q, d V and a choice of mask, take four instances of the linear module:

Query = Linear(h d Q, d) (B.40) Key = Linear(h d Q, d) (B.41) Value = Linear(h d V , d) (B.42) Exit = Linear(d, h d V ) (B.43)

which by broadcasting we consider to have inputs of shape Rℓ d. The multi-headed attention Multi Head Attention module is then the composition:

Multi Head Attention = Exit 1

3 Func Attention(h) (Query, Key, Value) (B.44)

where Func Attention is broadcast over the heads dimension. Note that in Modula, we do this by creating dummy bond modules called Add Heads and Remove Heads to reshape the tensors and create/remove the explicit head dimension.

As in the single-headed case, the scalar multiplication factor of 1

3 ensures unit sensitivity.

B.7 Case study II: GPT

Let us now build an auto-regressive transformer similar to GPT-2 [43] or nano GPT [44] in this framework. Fix positive integers ℓ, d, h, d Q, d V (usually h divides d and d Q = d V = d/h). In addition to Compound 5 from the earlier, consider the 2-layer MLP:

MLP = Linear(d, 4d)

2 GELU Linear(4d, d) (B.45)

where we are using the scalar correction so that GELU has unit sensitivity, and using module broadcasting so that it can take inputs and outputs Rℓ d. Fix a depth L 1, and consider the following two modules, whose input and output spaces are Rℓ d:

Block MLP := 2L 1

2L Identity + 1 2L MLP Layer Normd (B.46)

Block Attn := 2L 1

2L Identity + 1 2L Multi Head Attention Layer Normd (B.47)

where Layer Normd refers to taking Layer Norm in the embedding dimension (i.e. the rows of matrices in Rℓ d, as distinct from normalizing all ℓ d coordinates together). This can alternately be thought of as just taking the usual Layer Norm on Rd and broadcasting it to take inputs and outputs Rℓ d.

Suppose that N is the number of tokens. For the initial module, take two embeddings of the N tokens and ℓcontext positions

Embedtok = Embed(N, d), Embedpos = Embed(ℓ, d) (B.48)

and form the mass one, sensitivity one module

Input Layer = tare( 1

2 Embedtok + 1

2 Embedpos, 1). (B.49)

The final module is just

Output Layer = Linear(N, d) Layer Normd. (B.50)

The depth L 1, width d, total block mass m > 0 GPT module is thus

GPT = Output Layer tare((Block MLP Block Attn)L, m) Input Layer. (B.51)

We suggest, as a default value, a total block mass of m 5.

Appendix C More on Smoothness and Sharpness

C.1 Underlying every estimate: The Gauss-Newton decomposition

All our estimates of sharpness for compound modules, as well as the smoothness estimate Proposition 5 for loss functions, depend on an application of the chain rule to compute second derivatives which in the optimization context is sometimes called the Gauss-Newton decomposition.

Namely, if f : Rd0 Rd1 and g : Rd1 Rd2, then the second derivative of their composite h = g f is computed by

v 2h w = ( f v) 2g ( f w) + g (v 2f w) (C.1)

for any v, w Rd0, or for short

2h( , ) = 2g( f( ), f( )) + g( 2f( , )). (C.2)

Indeed, this amounts to simply the following expression for partial derivatives:

2h xi xj = X

2fk xi xj . (C.3)

C.2 Sharpness under composition and concatenation

Here, we state the two formulae for computing the sharpness of a composition and a concatenation of two modules. The proofs are given in Appendix E. Proposition 8 (Sharpness under composition). Suppose that M2 and M1 are well-normed, composable modules that are respectively (α2, β2, γ2)-sharp and (α1, β1, γ1)-sharp. Under the shorthand that pk Mk.mass M1.mass+M2.mass and µk Mk.sensitivity, the composite M2 M1 is (α, β, γ)-sharp for:

α = 1 µ2 p2 1α1 + p2 2α2 + 2 µ2 p1p2β2 + 1 µ2 2 p2 1γ2, (C.4)

β = p1β1 + µ1p2β2 + µ1

µ2 p1γ2, (C.5)

γ = µ2γ1 + µ2 1γ2. (C.6)

Proposition 9 (Sharpness under concatenation). Suppose that M1 and M2 are well-normed, concatenatable modules that are respectively (α1, β1, γ1)-sharp and (α2, β2, γ2)-sharp. Under the shorthand that pk Mk.mass M1.mass+M2.mass and µk Mk.sensitivity, the tuple (M1, M2) is (α, β, γ)-sharp for:

α = p2 1α1 + p2 2α2, (C.7) β = p1β1 + p2β2, (C.8) γ = γ1 + γ2. (C.9)

Taken together, Propositions 8 and 9 specify a recursive procedure for computing the sharpness of any compound module that is built from a set of well-normed modules of known sharpness. Remark 1. These two sets of formulas are actually associative, as the reader may verify using their favorite computer algebra package. This means, for instance, that if M1, M2, M3 are successively composable, well-normed and each (αk, βk, γk)-sharp, then the two sets of sharpness estimates coming from applying the above formulas for M3 (M2 M1) and (M3 M2) M1 actually coincide.

C.3 Sharpness under module broadcasting

Suppose M is a well-normed module with inputs X, outputs Y and weights W, and suppose moreover that it is (α, β, γ)-sharp. The broadcast module M(h) has the same weights, mass, sensitivity and norm, but takes X h to Yh.

By Proposition 6, M(h) is well-normed, as long as the norms on X h and Yh are taken to be

(x1, . . . , xh) X h = S ( x1 p X + . . . + xh p X )1/p (C.10)

(y1, . . . , yh) Yh = S y1 p Y + . . . + yh p Y 1/p (C.11)

for 1 p ; unless M is a bond module (and thus weight-less), we must take S = h 1/p, otherwise S can be any positive scalar.

A natural question is whether M(h) is also sharp, and if so what its sharpness constants are, with respect to these norms. More or less the same proof as for Proposition 6 shows that the α and β bounds for sharpness are always true, with the same α, β. The γ bound is trickier however, and depends subtly on the chosen S, p. We highlight three cases where one can say something interesting.

Case 1. p = , S = 1. For the L norm, we have that M(h) is (α, β, γ)-sharp with the same α, β, γ by a more or less immediate proof.

Case 2. p < , S = 1. For the standard Lp-norms, we have that M(h) is (α, β, γ)-sharp with the same α, β, γ. The proof is direct, using the inequality

(xp 1exp 1 + . . . + xp hexp h)1/p (xp 1 + . . . + xp h)1/p(exp 1 + . . . + exp h)1/p (C.12)

for any positive reals x1, . . . , xh, x1, . . . , xh; however this is a very weak inequality and so leads to very pessimistic sharpness estimates for large h.

Case 3. p = 2, S = 1/

h. This is the RMS norm case. As in Case 2, one could use a very weak inequality to obtain the pessimistic result that M(h) is (α, β,

h γ)-sharp. However, one could also make the following observation: if h is large, and x1, . . . , xh are sampled from any normal distribution N(µ, σ2), then 1

h(x4 1 + . . . + x4 h) 1/2

h(x2 1 + . . . + x2 h) . (C.13)

In particular, this justifies the statement that for large h, the broadcast module M(h) is approximately (α, β,

3 γ)-sharp . While in actual deep learning contexts, the assumption that x1, . . . , xh are sampled from a normal distribution may not be valid, one should still expect the ratio between the two sides of Equation (C.13) to stay O(1) as h , and so even if the

3 rule is insufficient, the effective sharpness of the broadcast module should not blow up as h .

C.4 Smoothness estimates for common error measures

Suppose ℓ: Y T R is an error measure. In Proposition 5, we showed that smoothness estimates on ℓtogether with sharpness of a neural network imply smoothness of the corresponding average error loss function. The precise estimates are that ℓis σ-Lipschitz and τ-smooth in the module output, in the sense that:

| yℓ(y, t) y| σ y Y for all y Y and t T ; (C.14)

| y 2 yyℓ(y, t) ey| τ y Y ey Y for all y, ey Y and t T . (C.15)

We now present estimates on σ and τ for square and cross-entropy error. Both estimates will be in terms of the value of the average loss function L itself, rather than being truly global over the entire output space Y. Thus, to apply them to real learning problems, one should measure the average loss L at initialization, and use this for estimates for σ and τ; we are implicitly making the assumption that under gradient descent the loss decreases.

Square error

Consider square error for a d-class classification problem. Thus, Y = Rd and T = {1, . . . , d}. Consider the RMS norm on Y, and define the error function

ℓ(y, t) = 1

y2 1 + . . . + (yt

d)2 + . . . + y2 d for y, t Y T . (C.16)

(the slightly non-standard scalings are due to the choice of RMS norm on Y). The first and second partial derivatives of ℓare given by

ℓ yi (y, t) = 1

d), 2ℓ yi yj (y, t) = 1

dδij (C.17)

The desired constants σ, τ can then be computed as maxima:

σ = max z RMS=1

ℓ yi zi, τ = max z RMS=1

2ℓ yi yj zizj (C.18)

which from the above formulas amounts exactly to

ℓ(y, t), τ = 1. (C.19)

To translate this into a bound for the average loss function L, note that square root is a concave function. Thus if we have outputs y1, . . . , y B with true classes t1, . . . , t B, Jensen s inequality yields

ℓ(yb, tb) q

1 B X ℓ(yb, tb) =

allowing us to use σ =

L as our estimate for Proposition 5.

Cross-entropy error

Consider cross-entropy error for a d-class classification problem. Thus, Y = Rd and T = 1, . . . , d. For y Rd and t T , write

pt(y) = eyt

Σjeyj (C.21)

and consider the error function ℓ(y, t) = log(pt(y)). (C.22) The first and second partial derivatives of ℓare given by

ℓ yi (y, t) = pi δit, 2ℓ yiyj (y, t) = δijpi pipj. (C.23)

Consider again the RMS norm on Y. An estimate on σ can thus be computed as

max z RMS=1

using the basic fact that if p1, . . . , pd are non-negative numbers that sum to 1, then

p2 1 + . . . + (pt 1)2 + . . . + p2 d log(pt). (C.25)

(Indeed, for fixed pt, the left hand side is maximized at p1 = 1 pt and all other pi = 0; one then easily checked that 2(p 1)2 log(p) for all 0 < p 1.)

A similar concavity argument to the square error case then enables us to use σ =

L as the first derivative bound for average cross-entropy loss.

The second derivative bound depends on more subtle information geometry. Indeed, τ can be computed to be τ = d λ (C.26) where λ is the largest eigenvalue of the matrix diag(p) pp T . It is possible for this eigenvalue to be quite large (for instance, if p1 = p2 = 1/2 and all other pi = 0, then λ = 1/2). However, the average eigenvalue is 1 d

1 X p2 i d 1

If we presumed that, in the course of a gradient descent optimizing the weights of a module M, the output perturbations M w are only generically aligned with the eigenvectors of diag(p) pp T , then we could use the effective smoothness bound τ = 1.

Perhaps this is a dubious assumption however. A more conservative, but perhaps still dubious, assumption comes from assuming that the logits y have roughly N(0, 1) entries at least this could be more or less true at initialization. In this case, the largest eigenvalue λ is with high probability bounded as λ 1/

justifying approximate smoothness bound of τ =

Adam nano GPT

Adam modula GPT

normed Adam

32 64 128 256 512 1024

learning rate

2 4 8 16 32 64

Figure 5: Comparing to a standard transformer implementation. Since we used our own wellnormed GPT implementation for the experiments in this paper (here referred to as modula GPT) we wanted to check its performance was on par with a standard nano GPT implementation. These plots show learning rate sweeps for varying width and depth for Adam on nano GPT, as well as Adam and normed Adam on modula GPT. Even without normed updates, the architectural changes and orthogonal initialization used in Modula seem to already improve transfer compared to nano GPT.

Appendix D Experimental Details

D.1 Datasets

All experiments with Res MLP and Res Net [45] are done with the CIFAR-10 [46] image dataset with standard train and test splits. For data augmentation on the training set, we use random crop, random horizontal flip and Py Torch Auto Augment.

For the GPT [43] transformer experiments, we compared three different datasets:

(a) The Shakespeare corpus, using character-level tokens [47];

(b) The Tiny Stories database [48] using sub-word level tokenization;

(c) Open Web Text using sub-word level tokenization [49].

No data augmentation was used on the language data. We used data splitting code from [44].

D.2 Architectures

Full details of the Res MLP, Res Net and GPT architectures we used are detailed in Appendices B.5 and Appendix B.7. In every experiment, we used:

(a) cross-entropy loss with no weight decay;

(b) block depth B = 2 for Res MLP and Res Net;

(c) kernel size K = 3 for Res Net;

(d) h = 8 heads for GPT, with query and value dimensions d Q = d V = d/h where d is the embedding dimension (width);

(e) context length 128 for GPT, except for the experiment in Appendix D.7.

learning rate

mass=2 depth

2 4 8 16 32

Figure 6: Comparing mass allocation strategies. We train a Res MLP with width 64 and 2 layers per block on CIFAR-10. In the first sub-plot titled free mass , we set every atomic module to have unit mass, so that as depth is scaled the masses of the input and output layer become insignificant relative to the total mass of the hidden layers. In the other four subplots, we tare the total mass of the hidden layers to the value indicated in the subplot title. As can be seen, the taring strategy seems to work much better than the free mass strategy. So, at least in this experiment, it is good to keep a constant fraction of learning in the input and output layers even as depth is scaled.

D.3 Hardware

All experiments were run on NVIDIA GPUs using float32-precision. We used a combination of TITAN-RTX, RTX-3090, V100, Ada6000, and H100 devices. Each data point in the experiments takes up to 5 hours, depending on the computing device used. We ran over 1000 training runs in total.

D.4 Comparing to standard nano GPT architecture

Our implementation of GPT in Modula has certain differences from off-the-shelf architectures such as nano GPT [44]. We would summarize the overall changes to transformer architecture and training the following three points:

(I) the mathematical architecture has slightly different coefficients;

(II) we initialize weight matrices to be orthogonal rather than Gaussian;

(III) we train using normalized weight updates.

The architectural choices we made were entirely informed by the desire for the network to be wellnormed and have unit sensitivity: in particular this means that the network enjoys favorable signal propagation properties. In the language of modules, these architectural changes can be summarized as:

(a) Each residual block in our architecture is of the form

2L Identity + 1 2L Block (D.1)

where Block = Block MLP or Block Attn, compared to Identity + 1

LBlock suggested for nano GPT;

(b) We use a scaled dot product attention with 1 d Q scaling, rather than 1

(c) The forward function of our Linear and Embed modules (see Appendix B.1) includes scale factors p

dout/din and

d respectively.

(d) We use several additional scalar multiplications to keep the network of unit sensitivity:

Each Attention module (B.44) has a scalar factor of 1

Each MLP module (B.45) has a scalar factor of

2; The token and position embeddings (B.49) have a scalar factor of 1

In Figure 5, we ran a comparison of the performance of the standard (unnormed) Adam optimizer trained on Open Web Text with:

Shakespeare

learning rate

Tiny Stories

Open Web Text

Figure 7: Mass and learning rate sweeps across datasets of increasing difficulty. A small GPT architecture of width 128 and 3 transformer blocks was trained on the Shakespeare, Tiny Stores and Open Web Text datasets. We varied the learning rate as well as the total mass of the blocks. Optimal mass and learning rate seem to transfer reasonably well from Tiny Stories to Open Web Text, and less well from the much smaller Shakespeare dataset.

1. the nano GPT architecture with Gaussian initialization;

2. our implementation of GPT with orthogonal initialization.

We found that even without using the normed optimizer, our implementation with orthogonal initialization transferred learning rate better. We suggest that even the base Adam optimizer benefits from the above architectural changes.

D.5 Full sweeps

In Figures 9 to 12, at the end of this Appendix, we report on full learning rate sweep experiments, across width and depth, for GPT on Open Web Text and Tiny Stories, and Res MLP, Res Net on CIFAR10.

We consistently find that the normed Adam optimizer matches or outperforms unnormed Adam in both test and training loss, all the while exhibiting significantly better transfer across width. The difference in depth transfer is less stark, however we posit that, in part, unnormed Adam is already benefiting from architectural changes we made to improve depth scaling.

Notice too that normed SGD consistently significantly outperforms ordinary SGD, often coming close to or matching the performance of Adam. We would like to highlight this, since SGD has a significantly lower memory requirement than Adam, and does not require any tuning of β2.

D.6 Mass allocation

A novel feature of our normed optimization framework is the need to choose a mass parameter for each atomic module. In the context of networks of the form

Network = Output Layer Hidden Layers Input Layer (D.2)

where Hidden Layers = Block L. We typically do this by assuming that Input Layer, Output Layer have mass 1, and by hand resetting the mass of Hidden Layers to be a fixed total mass m > 0, by calling tare(Hidden Layers, m).

In this Appendix, we explore some different aspects the choice of m.

First, we tested whether or not calling tare is necessary in the first place. Not using tare would leave the free mass of Hidden Layers.mass = L Block.mass; accordingly as L grows large, the feature learning allotment (see Proposition 3) for Input Layer and Output Layer would grow smaller. Indeed, as the reader can see in Figure 6, this free mass arrangement for a Res MLP network on CIFAR-10, allowing the mass of Hidden Layers to grow with L is very undesirable, and for good learning rate transfer with depth we must fix a mass.

learning rate

context length

32 64 128 256 512 1024

Figure 8: Context length transfer. We trained GPTs of various context lengths using normed Adam. As can be seen, learning rate transferred quite well across context length.

The mass m is thus left as a tunable parameter. We then tested the transferability of mass tuning. Specifically, we wanted to know:

1. whether one can tune m on a network of small width/depth, and expect that same m to be close to optimal on a larger network; 2. whether learning rate transfer across width/depth is itself dependent on selecting a good mass m; 3. how sensitive the tuning for m is: if there is a broad range of acceptable masses, or certain precise values lead to big improvements in train or test loss.

Figures 3 and 6 answer Question 1 above in the affirmative, in the context of Res MLP on CIFAR-10 and GPT on Open Web Text. Moreover, in the context of Res MLP on CIFAR-10, they give an answer of Question 2 and Question 3: learning rate transfer occurs at a range of values of m.

Figure 7 address Question 3 in the context of transformers, on three different datasets. Across all three datasets, a mass in the region m 5 to 10 is reasonable.

D.7 Context length

Additionally, we also tested the dependence of the optimal learning rate for GPT training on Open Web Text on the context length; the results are in Figure 8 Interestingly, we report good transfer of the optimal learning rate from small contexts to long contexts.

D.8 Full sweep results

The next four pages of the Appendix list results of our full learning rate sweeps over width/depth for GPT on Open Web Text and Tiny Stories, and Res MLP, Res Net on CIFAR-10.

SGD normed SGD Adam normed Adam

32 64 128 256 512 1024

2 4 8 16 32 64

SGD normed SGD Adam normed Adam

32 64 128 256 512 1024

2 4 8 16 32 64

Figure 9: Learning rate transfer for GPT on Open Web Text. Training is done for 10k steps, at batch size 128, with SGD, Adam, and their normed versions. The total block mass for normed SGD/Adam is m = 5. Width scaling experiments are done at fixed depth 3, and depth scaling experiments are done at fixed width 128.

SGD normed SGD Adam normed Adam

32 64 128 256 512 1024

10 1 100 101

10 1 100 101

10 1 100 101

2 4 8 16 32 64

SGD normed SGD Adam normed Adam

32 64 128 256 512 1024

10 1 100 101

10 1 100 101

10 1 100 101

2 4 8 16 32 64

Figure 10: Learning rate transfer for GPT on Tiny Stories. Training is done for 10k steps, at batch size 128, with SGD, Adam, and their normed versions. The total block mass for normed SGD/Adam is m = 5. Width scaling experiments are done at fixed depth 3, and depth scaling experiments are done at fixed width 128.

SGD normed SGD Adam normed Adam

32 64 128 256 512 1024

10 1 100 101

2 4 8 16 32 64

SGD normed SGD Adam normed Adam

32 64 128 256 512 1024

10 1 100 101

2 4 8 16 32 64

Figure 11: Learning rate transfer for Res MLP. Res MLP architectures on CIFAR-10 are trained for 10k steps, at batch size 128, with SGD, Adam, and their normed versions. The total block mass for normed SGD/Adam is m = 1. Width scaling experiments are done at fixed depth 3, and depth scaling experiments are done at fixed width 128.

SGD normed SGD Adam normed Adam

32 64 128 256 512 1024

10 1100 101

10 310 210 1

10 1100 101

2 4 8 16 32 64

SGD normed SGD Adam normed Adam

32 64 128 256 512 1024

10 1100 101

10 310 210 1

10 1100 101

2 4 8 16 32 64

Figure 12: Learning rate transfer for Res Net. Res Net architectures on CIFAR-10 are trained for 10k steps, at batch size 128, with SGD, Adam, and their normed versions. The total block mass for normed SGD/Adam is m = 20. Width scaling experiments are done at fixed depth 3, and depth scaling experiments are done at fixed width 128.

Appendix E Proofs

Proposition 3: Feature learning is apportioned by mass

To prove Proposition 3, it suffices to induct on the construction of a compound module M by composition and concatenation, with the atomic modules (where the inequality is just part of wellnormed-ness) as the base case.

Indeed, suppose either M = M2 M1 or M = (M1, M2). Suppose wk is a weight for one of the atomic modules of M, and write m for the mass of this atomic module. Then wk is must be a weight of either M1 or M2; the inductive assumption is that

wk Mi wk m Mi.mass w Mi (E.1)

where i = 1 or 2 accordingly.

Case 1. M = M2 M1 and wk is a weight of M1. From the chain rule we then must have: wk M wk = x M2 wk M1 wk (E.2) M2.sensitivity wk M1 wk by well-normed-ness (E.3)

M2.sensitivity m M1.mass w M1 by assumption (E.4)

m M.mass w M (E.5)

where the last line is by the definition of the norm of module composition.

Case 2. M2 M1 and wk is a weight of M2. The chain rule is not needed in this case, and we proceed straight from the inductive assumption: wk M wk = wk M2 wk (E.6)

m M2.mass w M2 (E.7)

m M.mass w M. (E.8)

Case 3. M = (M1, M2). Given the symmetric roles of M1, M2, without loss of generality assume wk is a weight of M1. Then, wk M wk = wk M1 wk (E.9)

m M1.mass w M1 (E.10)

m M.mass w M. (E.11)

This completes the proof.

Proposition 4: Sharpness of residual networks

Suppose M is a well-normed module of unit sensitivity on (X, X, W) and is (α, β, γ)-sharp. Then, by Proposition 8 for any L 1, the module 1

L M is well-normed, sensitivity 1

L, and (Lα, β, 1

The module L 1

L Identity is also well-normed, sensitivity L 1

L , and (0, 0, 0)-sharp. In particular, the sum Mres = L 1

L Identity + 1

L M (E.12) is well-normed, unit sensitivity, and (Lα, β, 1

Lγ)-sharp; it has the same mass as the original module.

We induct on the statement for k = 1, 2, . . . that Mk res is (αk, βk, γk)-sharp where

k α + 2 (1 + 2 + . . . + (k 1))

12 + 22 + . . . + (k 1)2

Lk2 γ (E.13)

βk = β + (1 + 2 + . . . + (k 1))

Lk γ (E.14)

The base case is clearly true, and given the statement for Mk res, which has exactly k times the mass as Mres, we see that Mk+1 res = Mres Mk res is (αk+1, βk+1, γk+1)-sharp by applying Proposition 8 with p1 = k k+1 and p2 = 1 k+1, where

(k + 1)2 αk + 1 (k + 1)2 Lα + 2k (k + 1)2 β + k2

L(k + 1)2 γ (E.16)

βk+1 = k k + 1βk + 1 k + 1β + k L(k + 1)γ (E.17)

γk+1 = γk + 1

which yields the induction.

Setting k = L, observe that 1 + 2 + . . . + (L 1) = 1

2L(L 1) and 12 + 22 + . . . + (L 1)2 = 1 6L(L 1)(2L 1), giving

αL = α + L 1

L β + L(L 1)(2L 1)

6L3 γ α + β + 1

βL = β + L 1

γL = γ (E.21)

which proves the result.

Proposition 5: Smoothness in the modular norm

To establish the first inequality, we start by applying the Gauss-Newton decomposition (C.1) of the Hessian, followed by the fact that the error ℓis σ-Lipschitz and τ-smooth, followed by the well-normedness and (α, β, γ)-sharpness of the module M:

| w 2 ww L e w| (E.22)

= Ex,y D yℓ w 2 ww M e w + ( w M w) 2 yyℓ ( w M e w) (E.23)

Ex,y D σ w 2 ww M e w Y + τ w M w Y w M e w Y (E.24)

(σα + τ) w M e w M. (E.25)

The second inequality follows from the first via the fundamental theorem of calculus:

w L(w + w) w L(w) M = max e w M=1 |[ w L(w + w) w L(w)] e w| (E.26)

max e w M=1

0 | w 2 ww L(w + t w) e w| dt (E.27)

max e w M=1(σα + τ) w M e w M

0 dt (E.28)

= (σα + τ) w M. (E.29)

The third inequality follows from the second by again applying the fundamental theorem of calculus, followed by the Cauchy-Schwarz inequality:

|L(w + w) [L(w) + w L(w) w]| (E.30)

0 [ w L(w + t w) w L(w)] w dt (E.31)

0 w L(w + t w) w L(w) M w M dt (E.32)

(σα + τ) w 2 M

0 t dt (E.33)

2 (σα + τ) w 2 M. (E.34)

This completes the proof.

Proposition 6: Broadcast modules are well-normed

Suppose M is a module with inputs X, outputs Y and weights W, broadcast to take X h to Yh. We take norms on these spaces to be

(x1, . . . , xh) X h = S ( x1 p X + . . . + xh p X )1/p (E.35)

(y1, . . . , yh) Yh = S y1 p Y + . . . + yh p Y 1/p (E.36)

where S = h 1/p unless M is a bond module. Write µ = M.sensitivity. Then, for perturbations in the weight direction, which only occur if M is not a bond module:

w M(w, x1, . . . , xh) w Yh =

j w M(w, xj) w p Y

w M applying well-normed-ness. (E.38)

For perturbations in the input direction, we have:

x1,...,xh M ( x1, . . . , xh) Yh = S

j xj M xj p Y

j µp xj p Y

= µ ( x1, . . . , xh) Yh (E.41)

which proves the proposition.

Proposition 7: Sensitivity of attention

We prove that the functional attention module Func Attention of Bond 9 is well-normed and of unit sensitivity.

Recall we use the following norms on the inputs X = Rℓ d Q Rℓ d Q Rℓ d V and outputs Y = Rℓ d V :

(q, k, v) X = q RMS + k RMS + v RMS, y Y = y RMS. (E.42)

We will also make use of the L -operator norm for ℓ ℓmatrices, which we write as

B op = max i=1,...,ℓ

observe that for B Rℓ ℓand x Rℓ d we have

B x RMS B op x RMS. (E.44)

Writing F = Func Attention.forward for short, recall that

F(q, k, v) = softmax( 1

d Q qk T + M)v (E.45)

where M is the mask (our proof will apply equally for the standard causal mask and also the non-causal M 0).

We will prove that at any (q, k, v) satisfying q RMS, k RMS, v RMS 1, for any ( q, k, v) we have

F(q, k, v) ( q, k, v) Y ( q, k, v) X . (E.46)

For short, write A = softmax( 1

d Q qk T + M) for the attention matrix and its derivative as

A = (q,k) softmax( 1

d Q qk T + M) ( q, k). (E.47)

Now, the derivative of F splits into two terms

F ( q, k, v) = A( v) + ( A)v. (E.48)

To complete the proof, we claim that

A op = 1 and A op q RMS + k RMS. (E.49)

The calculation of the norm of A follows by definition from its construction by softmax. For the calculation of the norm of A, a direct calculation yields that

Aij = 1 d Q Aij qi, kj Σm Aimkm + 1 d Q Aij qi, kj Σm Aim km (E.50)

where we are writing qi = qi and so on.

Taking absolute values, applying the Cauchy-Schwarz inequality and summing over j we have

Σj| Aij| qi RMS (Σj Aij kj Σm Aimkm RMS) (E.51) + qi RMS (Σj Aij kj Σm Aim km RMS) . (E.52)

We now use the following inequality: given any non-negative reals p1, . . . , pℓwhich sum to 1, and any vectors x1, . . . , xℓin an inner product space with norm , we have by Jensen s inequality

Σjpj xj Σmpmxm Σjpj xj Σmpmxm 2 1

= Σjpj xj 2 Σjpjxj 2 1

Σjpj xj 2 1

max j xj . (E.56)

Applying to the matrix A, we thus have

Σj|Aij| qi RMS max j kj RMS + qi RMS max j kj RMS. (E.57)

Taking the max over i, this shows the L -operator-norm of A is at most

q RMS k RMS + q RMS k RMS (E.58)

which, since q RMS, k RMS 1, completes the proof.

Proposition 7: Sharpness of functional attention

In this section, we estimate the second derivative of the forward function F of functional attention at (q, k, v) in perturbation directions ( q, k, v) and ( eq, ek, ev):

2F := ( eq, ek, eq) 2F ( q, k, v). (E.59)

We will prove that functional attention is γ-sharp where in fact γ = 3; this amounts to proving that

2F 3 ( q, k, v) X ( eq, ek, ev) X . (E.60)

We continue with all the notation of the previous section. Moreover, to simplify the calculation, we suppress all factors of 1 d Q (indeed, one can absorb them as a rescaled inner product , ). We also, in addition to the shorthand xi = xi for ℓ d matrices x, we adopt the shorthand for an ℓ ℓmatrix B and a ℓ d matrix x, and any i, j = 1, . . . , ℓ:

[B, x]ij := xj Σm Bimxm. (E.61)

We note three crucial inequalities regarding [B, x], for any ℓ ℓmatrix B with non-negative entries whose rows sum to 1, and ℓ d matrices x, y::

Σj Bij [B, x]ij max j xj ; (E.62)

Σj Bij [B, x]ij 2 max j xj 2; (E.63)

Σj Bij [B, x]ij [B, y]ij (max j xj )(max j yj ). (E.64)

All three inequalities follow from standard expectation/variance inequalities for random variables on the finite set {1, . . . , ℓ} with distributions given by Bi1, . . . , Biℓ.

With these conventions, the expression for A is thus

Aij = Aij qi, [A, k]ij + Aij qi, [A, k]ij . (E.65)

Let us also write

e A : = (q,k) softmax( 1

d Q qk T + M) ( eq, ek) (E.66)

e Aij = Aij eqi, [A, k]ij + Aij qi, [A, ek]ij . (E.67)

as well as 2A := ( eq, ek) 2F ( q, k). (E.68) In these terms, the second derivative 2F is just

2F = ( e A)( v) + ( A)( ev) + ( 2A)v. (E.69)

From the estimates of the previous section, we have

( e A)( v) RMS ( eq RMS + ek RMS) v RMS (E.70) ( A)( ev) RMS ( q RMS + k RMS) ev RMS (E.71)

so our task is to estimate the L -operator-norm of 2A. Thus, we calculate 2A:

2Aij = Aij qi, [A, ek]ij (E.72) + Aij eqi, [A, k]ij (E.73)

+ e Aij qi, [A, k]ij (E.74)

+ e Aij qi, [A, k]ij (E.75)

+ Aij qi, Σm( e A)imkm (E.76)

+ Aij qi, Σm( e A)im km (E.77)

We estimate the L -operator-norm of these six terms one by one. The first (E.72), (E.73) are the simplest, using inequality (E.62):

max i Σj|Aij qi, [A, ek]ij | max i Σj Aij qi [A, ek]ij (E.78)

max i qi max j kj (E.79)

= q RMS ek RMS (E.80) max i Σj|Aij eqi, [A, k]ij | eq RMS k RMS likewise. (E.81)

For the term (E.74), we have

e Aij qi, [A, k]ij = Aij eqi, [A, k]ij + Aij qi, [A, ek]ij qi, [A, k]ij (E.82)

Take absolute values, sum over j, and apply Cauchy-Schwarz and inequalities (E.63),(E.64):

Σj| e Aij qi, [A, k]ij | Σj Aij qi eqi [A, k]ij 2 + qi qi [A, k]ij [A, ek]ij

qi eqi max j kj 2 + qi qi (max j kj )(max j ekj ).

Taking the max over i and applying q RMS, k RMS, v RMS 1:

max i Σj| e Aij qi, [A, k]ij | q RMS eq RMS + q RMS ek RMS. (E.85)

The term (E.75) is similar:

max i Σj| e Aij qi, [A, k]ij | k RMS eq RMS + k RMS ek RMS (E.86)

For term (E.76), observe that

max i Σm( e A)imkm e A op k RMS (E.87)

eq RMS + ek RMS (E.88)

and so by Cauchy-Schwarz and the fact that the rows of A sum to 1:

max i Σj|Aij qi, Σm( e A)imkm | max i qi Σm( e A)imkm (E.89)

q RMS eq RMS + q RMS ek RMS. (E.90)

By a similar argument, for term (E.77) we have:

Aij qi, Σm( e A)im km k RMS eq RMS + k RMS ek RMS (E.91)

Thus, we have an estimate on the L -operator-norm of 2A:

2A op 2 q eq + 3 q ek + 3 k eq + 2 k ek (E.92)

where all the norms on the right hand side are RMS.

Adding this together with (E.70) and (E.71), we obtain (all norms being RMS:

2F 2 q eq + 3 q ek + 3 k eq + 2 k ek (E.93)

+ v eq + v ek + q ev + k ev (E.94)

3( q + k + v )( eq + ek + ev ) (E.95)

which is the desired result.

Proposition 8: Sharpness under composition

Suppose M = M2 M1 where M1, M2 are well-normed modules on respectively (Xk, Yk, Wk) and moreover (αk, βk, γk)-sharp for k = 1, 2. If pk = Mk.mass

M.mass for k = 1, 2, note that by the definition of the modular norm on the composite M, we have for any w = ( w1, w2) W1 W2:

w1 M1 1 µ2 p1 w M and w2 M2 p2 w M. (E.96)

We must prove that M is (α, β, γ) sharp where:

α = 1 µ2 p2 1α1 + p2 2α2 + 2 µ2 p1p2β2 + 1 µ2 2 p2 1γ2, (E.97)

β = p1β1 + µ1p2β2 + µ1

µ2 p1γ2, (E.98)

γ = µ2γ1 + µ2 1γ2. (E.99)

Turning to the second derivative of M( , ), we prove the first Inequality (E.97). The Gauss-Newton decomposition (C.1) for any w = ( w1, w2) and e w = e w1, e w2 yields

w 2M e w = M2 ( w1 2M1 e w1) (E.100)

+ ( w2, M1 w1) 2M2 ( e w2, M1 e w1) (E.101)

Applying the well-normed and sharpness inequalities, the norm of the first (E.100) of these terms is bounded by

µ2 w1 2M1 e w1 Y1 µ2α1 w1 M1 e w1 M1 (E.102)

1 µ2 p2 1α1 w M e w M. (E.103)

The second term (E.101) breaks into four separate terms:

w2 2 ww M2 e w2 (E.104)

+( M1 w1) 2 xw M2 f w2 (E.105)

+ w2 2 wx M2 ( M1 e w1) (E.106)

+( M1 w1) 2 xx M2 ( M1 e w1). (E.107)

In particular, applying the well-normed and sharpness inequalities, this is bounded by α2 w2 M2 e w2 M2 (E.108) +β2 w1 M1 e w2 M2 (E.109) +β2 w2 M2 e w1 M1 (E.110) +γ2 w1 M1 e w1 M1, (E.111) which is less than p2 2α2 + 2 µ2 p1p2β2 + 1 µ2 2 p2 1γ2 w M w M (E.112)

which completes the proof of Inequality (C.4).

Inequalities (E.98) and (E.99) are simpler. For the first of these, note we have w 2M x = M2 ( w1 2M1 x) (E.113)

+ ( w2, M1 w1) 2M2 ( M1 x). (E.114) Term (E.113) is bounded by µ2 w1 2M1 x Y1 µ2β1 w1 M1 x X1 (E.115) p1β1 w M x X1 (E.116) Term (E.114) breaks into two separate terms w2 2 wx M2 ( M1 x) + ( M1 w1) 2 xx M2 ( M1 x). (E.117) In particular this is bounded by

β2 w2 M2µ1 x X1 + γ2 w1 M1µ1 x X1 µ1p2β2 + µ1

µ2 p1γ2 w M x X1 (E.118) which completes the proof of Inequality (E.98).

Finally, for (E.99), we have x 2M ex = M2 ( x 2M1 ex) (E.119)

+ ( M1 x) 2M2 ( M1 ex). (E.120)

Term (E.119) is bounded by µ2 x 2M1 ex Y1 µ2γ1 x X1 ex X1 (E.121) while Term (E.120) is bounded by γ2 M1 x X2 M1 ex X2 µ2 1γ2 x X1 ex X1 (E.122) which together give Inequality (E.99).

Proposition 9: Sharpness under concatenation

Suppose M = (M1, M2) where M1, M2 are well-normed modules on respectively (Xk, Yk, Wk) and moreover (αk, βk, γk)-sharp for k = 1, 2. If pk = Mk.mass

M.mass for k = 1, 2, as in the previous proof we have for any w = ( w1, w2) W1 W2: w1 M1 1 µ2 p1 w M and w2 M2 p2 w M. (E.123)

We must prove that M is (α, β, γ)-sharp where

α = p2 1α1 + p2 2α2, (E.124) β = p1β1 + p2β2, (E.125) γ = γ1 + γ2. (E.126)

Now, for the first of these identities, we have for w = ( w1, w2) and e w = ( e w1, e w2):

w 2M e w Y1 Y2 = ( w1 2M1 e w1, w2 2M2 e w2) Y1 Y2 (E.127)

= w1 2M1 e w1 Y1 + w2 2M2 e w2) Y2 (E.128)

α1 w1 2 M1 + α2 w2 2 M2 (E.129)

(p2 1α1 + p2 2α2) w 2 M (E.130)

which shows α = p2 1α1 + p2 2α2. The expressions for β, γ follow similarly.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes] Justification: Our main claim is that normalizing Adam and SGD updates in the modular norm leads to good learning rate transfer across width and depth. We believe this claim is supported by the experiments in our paper.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We discuss limitations in the discussion section ( 5).

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: We state theoretical results as formal propositions and provide their proofs in Appendix E.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes] Justification: See Appendix A for an overview of our code, Appendix B for the detailed network architectures and Appendix D for the parameters of our experiments. In addition, we provide the source code for our experiments and the Modula package.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We make use of standard datasets and provide our code in the supplemental materials.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: See Appendix D for the full details of our experiments. Also, see our code.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [No]

Justification: Due to computational resource constraints, we opted to run a large number of experiments to check that trends hold across several distinct architectures and datasets, rather than running repeats on each individual experiment. Each hyperparameter sweep involves on the order of 100 training runs, and we are working under academic resource limits. We believe that the fact the reported trends hold across varied experimental settings supports the significance of our results.

Guidelines:

The answer NA means that the paper does not include experiments.

The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We report on this in Appendix D.3. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We believe that no ethics guidelines were violated. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: No societal impact of the work was discussed. Potentially the work could have a positive impact in terms of reducing carbon emissions caused by sweeping hyperparameters for large-scale models.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We are not releasing any new datasets or pre-trained models. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We only use public and open-source resources. We have cited these works. Licenses were not provided from the original source. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset.

For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [Yes]

Justification: We provide code and instructions on how to use the new modules that we define.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: We did not crowdsource and we did not use human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: We did not use human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.