# modular_duality_in_deep_learning__aac32303.pdf

Modular Duality in Deep Learning

Jeremy Bernstein 1 Laker Newhouse 1

An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We derive GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers the latter two methods are based on a Newton-Schulz iteration. We conclude with small experiments demonstrating the speed, scalability and novel numerical properties of duality-based optimizers. Our methods were used in the Muon optimizer, which recently set speed records for training Nano GPT and was scaled up to a 1.5 billion parameter transformer.

1 Introduction

This paper pursues a rigorous and first-principles theoretical framework for designing neural network training algorithms. We hope that such a framework will facilitate the design of a next generation of fast and scalable optimizers that are automatically tailored to different neural architectures.

While gradient descent is the workhorse of modern machine learning, the most vanilla form of the algorithm does not, in our view, pass a basic type check. For a gradient update to type check, we insist that the gradient must be passed through a duality map before being multiplied by a learning rate and applied to the weights:

1MIT CSAIL, United States. Correspondence to: Jeremy Bernstein <jbernstein@mit.edu>, Laker Newhouse <lakern@mit.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

weight LR weight.grad type error! weight LR dualizepweight.gradq all good!

Why? The reason is that the loss function may not be equally smooth in all directions in weight space, and there is no reason for the sizes of different components of the raw gradient vector1 to respect this heterogeneity. In other words, the geometry of the loss function may be non-isotropic. Insisting on a type check should force the user to become cognizant of this issue and to find a suitable duality map. A good duality map should adjust the size and direction of the gradient to respect the smoothness structure of the loss function.

Duality maps on vector spaces are commonplace in physics and applied math. Examples include the musical isomorphism in differential geometry (Grosse, 2022), raising and lowering indices in general relativity (Carroll, 2019) and the bra-ket notation in quantum mechanics (Sakurai & Napolitano, 2020). Duality maps are also central to several optimization theories including mirror descent (Nemirovsky & Yudin, 1983), natural gradient descent (Amari, 2016) and steepest descent on a normed space (Boyd & Vandenberghe, 2004). Despite the efforts of some prescient papers (Carlson et al., 2015b; Flynn, 2017), the latter kind of norm-based duality map is yet to puncture the deep learning mainstream.

We believe that duality is a key theoretical concept that will help in building performant large-scale machine learning systems. To support this belief, we show in this paper that two important and seemingly disparate methods in contemporary optimization research may be seen as approximations to a single duality map. These methods are maximal update parameterization (Yang & Hu, 2021, µP), which is aimed at scalable training, and Shampoo (Shi et al., 2023), which is targeted at fast training. We show in Section 4.1 that both methods emerge as partial approximations to a single duality map induced by the RMS RMS operator norm.

1In this paper, we use the term gradient to mean the partial derivative of the loss function. This is in accordance with the terminology used in machine learning software libraries such as Py Torch (Paszke et al., 2019) and JAX (Bradbury et al., 2018). We chose this verbiage to help make the paper easily accessible to a broad machine learning audience. However, it is worth noting that this terminology is at odds with common mathematical parlance (Blondel & Roulet, 2024) where a gradient is a primal vector obtained by applying some form of duality map to the partial derivative of the loss function.

Modular Duality in Deep Learning

1.1 Summary of contributions

First contribution. We describe a procedure for constructing duality maps for general neural architectures. The procedure, called modular dualization, works in three steps:

Step 1: Operator norms are assigned to individual layers based on the input-output semantics of each layer;

Step 2: Based on these operator norms, duality maps are constructed for individual layers;

Step 3: Given the layerwise duality maps and the structure of the neural architecture, a single duality map is recursively induced on the full weight space of the architecture.

Second contribution. We instantiate modular dualization for a rich family of neural architectures including convolutional networks and transformers by writing down duality maps for Linear, Embed and Conv2D layers. We also provide GPU-friendly algorithms for computing these duality maps efficiently. One of these methods has already been applied in the Muon optimizer (Jordan et al., 2024b).

Third contribution. We run two experiments. In the first, we find that duality-based optimizers are fast and scalable across width see Figure 1. In the second, we see that duality-based training exhibits novel numerical properties: the weights move substantially further from their initial values than for non-dualized training see Figure 2.

2 Related Work

This paper constructs a duality map for general neural architectures. Our approach is based on assigning operator norms to individual network layers and using these layerwise norms to recursively induce a duality map on the full neural architecture. The most closely related prior work is a series of papers on spectral descent (Carlson et al., 2015a;b; 2016) and a paper on duality structure gradient descent (Flynn, 2017).

Spectral descent has been applied to restricted Boltzmann machines (Carlson et al., 2015a) and discrete graphical models (Carlson et al., 2016), but let us focus on the more closely related paper on spectral descent for deep learning (Carlson et al., 2015b). In that paper, the authors propose assigning the Schatten-8 norm (a.k.a. spectral norm) to individual linear layers. This assignment is based on the observation that neural networks admit natural majorization bounds in the Schatten-8 norm. The authors call the corresponding duality map for linear layers the #-operator a name presumably inspired by the musical isomorphism (Grosse, 2022). The authors propose a cheap approximation to the #-operator based on sketching (Martinsson & Tropp, 2020),

and they also propose a way to mix RMSprop-style preconditioning information (Tieleman & Hinton, 2012) into the weight updates. In contrast to our work, the authors only derive duality maps for single linear layers, and these maps are then heuristically extended to all-layer updates. Nonetheless, the authors achieve substantial wall clock speedups using variants of spectral descent to train small networks.

Now, let us turn our attention to duality structure gradient descent (Flynn, 2017), which constructs a duality map on the full weight space of the neural architecture based on identifying a Finsler structure (Deimling, 1985) inherent to neural networks. Similar to modular dualization, Flynn (2017) s duality map works by assigning duality maps to each layer and then inducing a duality map on the full weight space. The substantial difference to our approach is that Flynn (2017) leverages a weighted sum (L1 combination) of layerwise norms to construct his full duality map. This leads to optimization methods that only update a single layer at each iteration, and the methods need to be heuristically extended to achieve all-layer updates. In contrast, we leverage the modular norm (Large et al., 2024), which takes a weighted max (L8 combination) of layerwise norms. In turn, our duality map leads directly to more conventional all-layer optimizers.

Another important difference between our work on modular duality and prior work on duality structure gradient descent is that we fully modularize our theory meaning that our construction is explicitly recursive and as such it is easy to code up into a software package. In this regard, we are inspired by a line of work that attempts to build optimization algorithms that automatically adapt to the structure of general computation graphs. The earliest work we know of in this category is the Ph D thesis of Grant (2004) on disciplined convex programming, which aims to infer the convexity properties of general functions by breaking them up into subexpressions and applying composition theorems from convex analysis. More recent progress in this vein includes work on universal majorization-minimization algorithms (Streeter & Dillon, 2022; Streeter, 2023) and related papers on automatic majorization (Tran et al., 2015; Bernstein et al., 2023).

3 Theoretical Preliminaries

In this section, we introduce duality maps, a means of constructing duality maps based on norms, and finally a norm called the modular norm that is well-suited to describe the geometry of general neural architectures.

3.1 Duality Maps

Given a vector space V, we say that a function f : V Ñ R is a linear functional on V if f is linear. We define the

Modular Duality in Deep Learning

dual space V to be the set of linear functionals on V. The dual space is itself a vector space provided that addition is defined pointwise pf gqpxq : fpxq gpxq and scalar multiplication is defined pointwise pαfqpxq : αfpxq for any scalar α. By duality map we mean any function that sends a member of the dual vector space V to the primal vector space V. The function need not be an involution.

Let L : W Ñ R denote the loss of a differentiable machine learning model with weight space W Rn. The Taylor expansion of the loss at weight setting w P W is given by:

Lpw wq Lpwq w Lpwq J w . (1)

Observe that, in the first-order term, the gradient w Lpwq is acting as a linear functional: it is pairing with the weight vector w P W in a linear way to produce a real number. As such, we shall say that the gradient belongs to the dual weight space: w Lpwq P W . We shall forbid ourselves from directly subtracting a member of the dual weight space W from the weight space W. If we would like to conduct a gradient descent update, then we had better find a duality map to send the gradient back to the primal space W.

This restriction may seem absurd! After all, here the weight space W and its dual W are both just Rn. However, insisting upon this type check serves to remind us that the curvature of the loss function may be highly heterogeneous. The next section will show one way to construct duality maps to account for this.

3.2 Steepest Descent on a Normed Space

Suppose that we have found a norm } } : W Ñ R and a sharpness parameter λ ą 0 that serve as a good model of the higher-order terms in the Taylor expansion of the loss function given in Equation (1):

Lpw wq Æ Lpwq w Lpwq J w λ

In other words, the norm provides a good characterization of the heterogeneity in curvature of the loss function. Then it makes sense to solve for a weight update w by minimizing the right-hand side of Equation (2). We will show that the minimizer can be expressed in terms of a dual norm and a duality map:

Definition 1 (Dual norm). Given a norm } } : Rn Ñ R, the dual norm } }: of a vector g P Rn is given by:

}g}: : max t PRn:}t} 1 g Jt.

Definition 2 (Duality map based on a norm). Given a norm } } : Rn Ñ R, we consider the duality map:

dualize} } g : arg max t PRn:}t} 1 g Jt,

where, if the arg max is not unique, dualize} } returns any maximizer.

Given these definitions, minimizing the expression in the right-hand side of Equation (2) can be done using the following standard proposition, for which Bernstein & Newhouse (2024) provide a proof:

Proposition 1 (Steepest descent under a norm). For any vector g P Rn thought of as the gradient , any λ ě 0 thought of as the sharpness , and any norm } } : Rn Ñ R with dual norm } }: and duality map dualize} }:

arg min w PRn

2 } w}2 ȷ }g}:

λ ˆ dualize} } g.

In words: to find the minimizer of a linear term penalized by a squared norm, we need only evaluate the dual norm and a duality map. In this paper, we focus on constructing a duality map for the modular norm, which is a norm tailored to the weight space of general neural architectures. But first, we shall cover duality maps for more standard norms.

3.3 Basic Norms and Duality Maps

Many basic norms and duality maps are already covered in prior work (Carlson et al., 2016; 2015a;b; Flynn, 2017). For some warmup examples, the following duality maps for vector norms are standard:

Example 1 (Duality map for the Euclidean norm). For a nonzero vector g P Rd, we have dualize} }2 g g{}g}2. For the zero vector, we take dualize} }2 0 0.

Example 2 (Duality map for the infinity norm). For a vector g P Rd, we have dualize} }8 g signpgq, where the sign function is applied entrywise and we take signp0q 0.

In neural networks, the weight spaces of individual layers tend to have matrix structure. And layers with the same shape weight matrix may have semantically different input and output spaces think embedding versus linear layers in a transformer. As such, we will need duality maps for different induced operator norms:

Definition 3 (Induced operator norm). Given a matrix M P Rdoutˆdin and two normed vector spaces p Rdin, } }αq and p Rdout, } }βq, the α to β induced operator norm is:

}M}αÑβ max x PRdin }Mx}β

For tensors, we define the duality map via

dualize} } G : arg max }T } 1 flattenp Gq J flattenp T q.

For linear layers, we will need the duality map for the RMS Ñ RMS induced operator norm. This ends up as

Modular Duality in Deep Learning

a rescaled version of the spectral norm duality map from prior work (Carlson et al., 2015b; Flynn, 2017).

Example 3 (Duality map for the RMS Ñ RMS operator norm). For a vector v P Rd, we define the RMS norm to be the normalized Euclidean norm: }v}RMS }v}2{ ?

d. Given a matrix W P Rdoutˆdin, the RMS Ñ RMS induced operator norm resolves to a rescaled spectral norm: }W }RMSÑRMS a

din{dout ˆ }W } , where } } denotes the standard spectral norm. For a matrix G P Rdoutˆdin with reduced singular value decomposition G UΣV J, the corresponding duality map is given by dualize} }RMSÑRMS G a

dout{din ˆ UV J.

And for embedding layers, we will need the duality map for the ℓ1 Ñ RMS operator norm:

Example 4 (Duality map for the ℓ1 Ñ RMS operator norm). Given a matrix W P Rdoutˆdin, the ℓ1 Ñ RMS induced operator norm resolves to the max RMS norm of the columns: }W }ℓ1ÑRMS maxi }colip W q}RMS. For a matrix G P Rdoutˆdin, the corresponding duality map dualize} }ℓ1ÑRMS G simply normalizes each column of G to have unit RMS norm: colip Gq ÞÑ colip Gq{}colip Gq}RMS for each i 1, ..., din.

3.4 The Modular Norm

The modular norm (Large et al., 2024) is intended to help characterize the heterogeneous curvature of general neural architectures. The construction first defines an abstract module type along with a notion of what is a good, or wellnormed, module. Then combination rules are given for constructing new well-normed modules from a library of existing well-normed modules. So modules are a special case of combinator pattern from functional programming (Haskell Wiki Contributors, 2007) and are related to a monoidal category from category theory (Fong & Spivak, 2019). Large et al. (2024) begin by defining an abstract module:

Definition 4 (Module). Given input vector space X, output vector space Y and weight vector space W, a module M is an object with the following four attributes:

(a) a function, M.forward : W ˆ X Ñ Y, which maps an input and a weight vector to an output;

(b) a number, M.mass ě 0, which is used to set the proportion of feature learning that this module contributes to any supermodule;

(c) a number, M.sensitivity ě 0, which estimates the module s sensitivity to input perturbations;

(d) a norm over the weight space, M.norm : W Ñ Rě0, sometimes abbreviated to just } }M.

We shall care most about modules that are well-normed (Large et al., 2024), which amounts to requiring that the

forward function is Lipschitz-continuous in the weights with constant 1 and in the inputs with constant M.sensitivity:

Definition 5 (Well-normed module). Let M be a module on p X, Y, Wq, where the input and output spaces have respective norms } }X and } }Y. M is well-normed if for all inputs x P X, weights w P W, weight perturbations w P W and input perturbations x P X, we have both:

} w M.forwardpw, xq w}Y ď M.normp wq;

} x M.forwardpw, xq x}Y ď M.sensitivity ˆ } x}X .

The operator denotes summation over any shared tensor indices2. This definition of well-normed-ness can be used as a guiding principle in the design of a library of atomic (i.e. handwritten) modules. First, norms should be assigned to the input and output space of each module based on the semantics of M.forward. Then a norm M.norm should be assigned to the module s weight space and a number M.sensitivity should be chosen to make the module wellnormed. Examples are given in Section 4.1.

Given such a library of well-normed atomic modules, a compound module built through any arbitrary sequence of module compositions and module concatenations is automatically well-normed (Large et al., 2024). And if the atomic modules are not only well-normed but are also smooth (Large et al., 2024, Definition 5), then there is an automatic procedure for computing sharpness coefficients for any compound module built from the library (Large et al., 2024, Appendix C). The relevant definition of module composition is as follows:

Definition 6 (Module composition). Consider module M1 with input, output and weight space p X1, Y1, W1q and module M2 with input, output and weight space p X2, Y2, W2q. M1 and M2 are composable if X2 Y1. Their composite module M M2 M1 has input, output and weight space p X1, Y2, W1 ˆ W2q and attributes:

(a) M.forwardppw1, w2q, xqq M2.forwardpw2, M1.forwardpw1, xqq;

(b) M.mass M1.mass M2.mass;

(c) M.sensitivity M1.sensitivity ˆ M2.sensitivity;

(d) M.normppw1, w2qq maxpα, βq, where:

1) α M2.sensitivity ˆ M.mass

M1.mass ˆ M1.normpw1q

2) β M.mass M2.mass ˆ M2.normpw2q

and if M1.mass or M2.mass is zero, the corresponding term in the max is set to zero.

2The expressions inside the norms on the left-hand side are examples of Jacobian vector products , or JVPs for short (Blondel & Roulet, 2024).

Modular Duality in Deep Learning

So the composite norm is taken to be a weighted max over the norms of the two sub-modules, where the weight space of the first module is coupled to the input sensitivity of the second module. The module masses provide freedom to tune the importance of each sub-module in the norm, and Large et al. (2024) prove that module mass provides precise control over the amount of feature learning that can happen in each sub-module.

Concatenation is defined in a similar way to composition:

Definition 7 (Module concatenation). Consider module M1 with input, output and weight space p X1, Y1, W1q and module M2 with input, output and weight space p X2, Y2, W2q. We say that M1 and M2 are concatenatable if their input spaces match: X1 X2. The tuple module M p M1, M2q has input, output and weight space p X1, Y1ˆY2, W1ˆW2q and the following list of attributes:

(a) M.forwardppw1, w2q, xqq p M1.forwardpw1, xq, M2.forwardpw2, xqq;

(b) M.mass M1.mass M2.mass;

(c) M.sensitivity M1.sensitivity M2.sensitivity;

(d) M.normppw1, w2qq maxpα, βq, where:

1) α M.mass M1.mass ˆ M1.normpw1q

2) β M.mass M2.mass ˆ M2.normpw2q

and if M1.mass or M2.mass is zero, the corresponding term in the max is set to zero.

A shortcoming of the paper by Large et al. (2024) is that the power of the modular norm is not fully leveraged. In particular, the authors do modular normalization of training, where weight updates to certain modules are naïvely divided by their norm. This paper makes fuller use of the geometry of the modular norm by constructing the corresponding duality map, which we call modular dualization.

4 Modular Dualization

In this section, we construct a duality map for general neural architectures. Our strategy is to first write down duality maps for atomic modules, i.e. individual layers. We then extend to arbitrary compound modules, i.e. full neural networks, by showing how duality maps should pass through composition and concatenation.

4.1 Duality Maps for Atomic Modules

To construct a duality map for an atomic module A, the idea is to first fix norms on the input and output spaces that respect the semantics of A.forward. We should select norms that describe both how large we would like the inputs and outputs to be, and in what geometry we would like the

outputs to evolve. Then we place a norm on the weight space such that A is well-normed: this is typically the operator norm (Definition 3) induced by the input and output norms. Finally we are in position to solve for the duality map, which we shall call A.dualize. We now give some examples of this procedure for the basic layer types of Linear, Embed and Conv2D. The results are summarized in Table 1.

We start with the canonical example of an atomic module: Example 5 (The Linear module). The Linear module sends inputs from X Rdin to outputs in Y Rdout. The weight space is given by the matrix space W Rdoutˆdin. We endow the Linear module with attributes:

1. Linear.forwardp W, xq Wx, matrix-vector product;

2. Linear.sensitivity 1;

3. Linear.mass µ, where µ ě 0 is a hyperparameter;

4. Linear.normp W q }W }RMSÑRMS, the RMS Ñ RMS induced operator norm.

Since the Linear module is intended to map to and from vectors of roughly unit RMS norm, we place the RMS norm on both the input and output space: } }X } }RMS and } }Y } }RMS. Then Linear is well-normed if the inputs and weights belong to the unit balls x P Rdin : }x}X ď 1 (

and W P Rdoutˆdin : Linear.normp W q ď 1 ( . Referring back to Section 3.3, the duality map corresponding to Linear.norm is then given by:

5. Linear.dualizep Gq b

din ˆ UV J, where the gradi-

ent G P Rdoutˆdin has reduced SVD G UΣV J.

This single duality map recovers essential features of both maximal update parameterization (Yang & Hu, 2021, µP) and Shampoo (Gupta et al., 2018). In particular, the factor of a

dout{din in Linear.dualize recovers spectral update scaling (Yang et al., 2023) that leads to µP. (Initializing such that Linear.normp W q 1 also recovers µP initialization scaling.) And the mapping G ÞÑ UV J is equivalent to Shampoo without accumulation (Bernstein & Newhouse, 2024). As such, we believe duality maps may help reconcile different strands of deep learning research and provide a unifying basis for fast and scalable training algorithms.

The Embed module provides a useful counterpoint to the Linear module. The difference between the two modules stems from the fact that the input spaces of Embed and Linear have different semantics. Example 6 (The Embed module). The Embed module sends inputs from X Rdin to outputs in Y Rdout. The weight space is given by the matrix space W Rdoutˆdin. We endow the Embed module with attributes:

1. Embed.forwardp W, xq Wx, matrix-vector product;

2. Embed.sensitivity 1;

Modular Duality in Deep Learning

Module Weight Space W Module.norm Module.dualize

Linear Rdoutˆdin W ÞÑ }W }RMSÑRMS G ÞÑ b

Embed Rdoutˆdin W ÞÑ }W }ℓ1ÑRMS coljp Gq ÞÑ coljp Gq }coljp Gq}RMS Conv2D Rdoutˆdinˆkˆk W ÞÑ k2 maxk i,j 1 }W ij}RMSÑRMS G ij ÞÑ 1 k2 b

din ˆ Uij V J ij

Table 1: Duality maps for three atomic modules: Linear, Embed, and Conv2D. These atomic modules are sufficient to build CNNs and transformers. In Linear.dualize, UΣV J denotes the reduced SVD of the gradient matrix G. In Conv2D.dualize, UijΣij V J ij denotes the reduced SVD of the slice of the gradient tensor G ij at kernel indices i, j. Section 5 provides GPU-friendly algorithms for computing these duality maps using a family of Newton-Schulz iterations.

3. Embed.mass µ, where µ ě 0 is a hyperparameter;

4. Embed.normp W q }W }ℓ1ÑRMS, the ℓ1 Ñ RMS induced operator norm.

Embed is intended to map from one-hot vectors to vectors of roughly unit RMS norm, so we place the ℓ1 norm on the input space and the RMS norm on the output space: } }X } }ℓ1, } }Y } }RMS. Then Embed is well-normed if the inputs and weights belong to the unit balls x P Rdin : }x}X ď 1 ( and W P Rdoutˆdin : Embed.normp W q ď 1 ( . Referring back to Section 3.3, the duality map for Embed.norm is:

5. Embed.dualizep Gq performs the mapping coljp Gq ÞÑ coljp Gq }coljp Gq}RMS for each column index j 1, ..., din.

Finally, we consider a Conv2D module with a k ˆ k kernel. Conv2D has a more involved tensor structure than Linear and Embed. The calculations work by slicing up the weight tensor into a collection of k2 matrices. Example 7 (The Conv2D module). The Conv2D module sends inputs from X RWinˆHinˆdin to outputs in Y RWoutˆHoutˆdout. We think of this as mapping an input image of width Win, height Hin and with din color channels to an output image of width Wout, height Hout and with dout color channels. The weight space is given by the tensor space W Rdoutˆdinˆkˆk, where k is the kernel size. We endow Conv2D with attributes:

1. Conv2D.forwardp W, xq W f x, where f denotes 2D convolution;

2. Conv2D.sensitivity 1;

3. Conv2D.mass µ, where µ ě 0 is a hyperparameter;

4. Conv2D.normp W q k2 maxk i,j 1 }W ij}RMSÑRMS, the max RMS Ñ RMS norm over kernel indices.

We would like pixel intensities in the inputs and outputs to be order one and undergo order one change. We formalize this by taking the input and output norms to be the spatial maximum of the RMS norms of all the color channel vectors: }x}X max Win w 1 max Hin h 1 }xwh }RMS

and }y}Y max Wout w 1 max Hout h 1 }ywh }RMS. Then Conv2D is well-normed if the inputs and weights belong to the unit balls x P RWinˆHinˆdin : }x}X ď 1 ( and W P Rdoutˆdinˆkˆk : Conv2D.normp W q ď 1 ( . Since the duality map for a max of norms decouples into one duality map per sub-norm, the duality map corresponding to Conv2D.norm is given by:

5. Conv2D.dualizep Gq does G ij ÞÑ 1 k2 b

din ˆUij V J ij ,

where G ij has reduced SVD UijΣij V J ij .

4.2 Duality Maps for Bond Modules

Large et al. (2024) define another class of basic modules: bond modules. Bonds are handwritten modules without weights. An example of a bond is the Re LU nonlinearity. For a bond B, the weight space is the zero vector space W t0u and the modular norm B.norm 0 ÞÑ 0. As such, the corresponding duality map is also B.dualize 0 ÞÑ 0. In a software package, one need not write norms or duality maps for bond modules.

4.3 Duality Maps for Compound Modules

Given two composable modules M1 and M2, the duality map for the composite M M2 M1 is given by:

M.dualizepg1, g2q ˆM1.mass

M.mass ˆ M1.dualizepg1q

M2.sensitivity , M2.mass

M.mass ˆM2.dualizepg2q .

Given two concatenatable modules M1 and M2, the duality map for the tuple M p M1, M2q is:

M.dualizepg1, g2q ˆM1.mass

M.mass ˆM1.dualizepg1q, M2.mass

M.mass ˆM2.dualizepg2q .

The proofs of Section 4.3 follow in a straightforward manner from Definitions 6 and 7.

5 Fast Duality Maps

For modular dualization to be practically feasible, we need ways of computing duality maps quickly. Inspecting the duality maps listed in Table 1, we see that Embed.dualize is

Modular Duality in Deep Learning

easy to implement since it just involves computing vector norms of matrix columns. But Linear.dualize and Conv2D.dualize involve the projection:

G UΣV J ÞÑ UV J, (3)

where UΣV J is the reduced SVD of the matrix G. Since computing SVDs can be slow (Carlson et al., 2015b; Flynn, 2017), we discuss three approximations to this map via sketching, iterations for inverse matrix roots, and a family of rectangular Newton-Schulz iterations.

5.1 Sketching

Sketching is a randomized method (Martinsson & Tropp, 2020) that can be used to build low-rank approximations to the SVD. Carlson et al. (2015b) used sketching to provide a fast approximation to their #-operator. More recent papers have experimented with sketching in the context of Shampoo-type algorithms (Feinberg et al., 2023). A potential downside of approximating Equation (3) via sketching is that randomized SVD methods usually try to accurately approximate the largest singular values of a matrix (Martinsson & Tropp, 2020, Section 11.2) while the value of Equation (3) may lie in its action on small singular values.

5.2 Iterations for Inverse Matrix Roots

If G is a full rank matrix with reduced SVD UΣV J, then:

UV J p GGJq 1{4 G p GJGq 1{4

p GGJq 1{2 G G p GJGq 1{2.

This equation provides a route to approximating the map UΣV J ÞÑ UV J since one can compute inverse matrix roots such as p GGJq 1{2 via Newton iteration (Laki c, 1998). This is discussed in Chapter 7 of Higham (2008) s book and also see Anil et al. (2020) s paper. Care must be taken with inverses when the matrix G is ill-conditioned.

5.3 Rectangular Newton-Schulz Iteration

We developed a rectangular Newton-Schulz iteration for computing UV J by adapting Equation 5.22 in Higham (2008) s book for computing the matrix sign function . We later discovered this iteration has a long history (Kovarik, 1970; Björck & Bowie, 1971). The method works by first normalizing the matrix G according to X0 G{}G}ℓ2Ñℓ2 (or alternatively X0 G{}G}F ) and then iterating:

2Xt XJ t Xt,

then as t Ñ 8, the sequence Xt Ñ UV J. To see this, one can plot the univariate cubic function fpxq : 3

and see that, for 0 ă x ă ?

3, iterating this cubic will push

x closer and closer to 1. The final step is to realize that the effect of the matrix iteration is to apply this cubic fpxq to each singular value of Xt. This shows that the spectral normalization X0 G{}G}ℓ2Ñℓ2 is stronger than what is required: we need only ensure that X0 has singular values no greater than ?

3 for the iteration to converge.

This iteration has the advantage over sketching that it always works on all singular values, and since it does not compute inverse matrix roots the iteration is well-behaved even on low-rank matrices.

Finally, there are in fact a family of degree 2n 1 polynomial iterations of the form

Xt 1 a Xt bp Xt XJ t q Xt zp Xt XJ t qn Xt

for suitable a, b, . . . , z instead of a, b 3

2. One should choose coefficients a, b, . . . , z so that the univariate polynomial gpxq ax bx3 zx2n 1 is a suitable approximation to signpxq. One may further accelerate the iteration by tuning the coefficients a, b, . . . , z empirically.

6 Discussion

This paper develops the theory of modular duality and the procedure of modular dualization as means to construct duality maps for general neural architectures. Here, we comment on implications and connections.

6.1 Neural Network Speedrunning

We believe that the ideas in this paper can help in the design of faster training methods. In fact, based on our work, a new Nano GPT training speed record was recently set using a Newton-Schulz-based duality map, packaged into an opensource optimizer called Muon (Jordan et al., 2024b).

6.2 A Type System for Deep Learning

Part of the inspiration for this work is to build a fully-fledged type system for deep learning. We think that activation spaces should be typed by their intended norm and the intended size of activations in that norm. This information would help to construct well-normed modules (see Section 4.1). Modules should be typed according to Definition 4. And, as suggested in the introduction, gradients should be explicitly typed as dual vectors. A duality map should flip the type of a dual vector to a primal vector. We plan to use the Modula deep learning package (Large et al., 2024) as a testbed for these ideas.

6.3 Modular Duality: A Unifying Theoretical Framework for Fast and Scalable Training

An important topic in contemporary optimization research is the design of fast and scalable training methods for neural

Modular Duality in Deep Learning

Figure 1: Learning rate transfer with dualization. To test learning rate transfer across width, we train an MLP on CIFAR-10 for 20 epochs at a range of widths and learning rates. We plot the final training loss and mark the best learning rate at each width with a red dot. Left: In standard parameterization (SP), Adam s optimal learning rate drifts to the left. Middle: Maximal update parameterization (Yang & Hu, 2021, µP) mostly corrects this drift. Right: Our duality-based method has a fairly stable optimal learning rate and also reaches much lower loss. More experimental details are given in Appendix A.

networks. Two popular methods in this research space are maximal update parameterization (Yang & Hu, 2021, µP), which allows increasing network width without changing the optimal learning rate, and Shampoo (Gupta et al., 2018), a variant of which (Shi et al., 2023) won a speed challenge at the Algo Perf optimizer competition (Dahl et al., 2023).

We showed in Section 4.1 that essential features of both µP and Shampoo are recovered from the single duality map Linear.dualize. We think that, on a basic theoretical level, µP and Shampoo should be viewed as partial approximations to this duality map. This observation helps put µP and Shampoo on a consistent theoretical footing, orients the methods with respect to overlooked prior work on spectral descent (Carlson et al., 2015b) and duality structure gradient descent (Flynn, 2017), and suggests new ways to generalize these methods to arbitrary layer types and network architectures via the modular norm and modular dualization.

Figure 1 shows that our duality-based optimizer is both scalable and fast: it transfers learning rate across width like µP, and it reaches lower loss in the same number of steps.

6.4 On the Alignment of Activations and Updates

Recent work (Yang et al., 2023; Everett et al., 2024; Large et al., 2024) has singled out the following question as important to the design of scalable deep learning systems: to what extent do gradient updates to neural network layers align with incoming activation vectors? This question is important since it helps inform how large weight updates need to be to induce a certain amount of change in layer outputs. Duality maps such as Linear.dualize and Conv2D.dualize may help simplify the answer to this question, since they project gradients to scaled semi-orthogonal matrices for

Figure 2: Erasure of watermarked initial weights. It is commonly held that the weights stay close to initialization in very wide networks (Lee et al., 2019; Jesus et al., 2021). To visualize the change in weights, we watermark the hidden layer weights of an MLP of width 1024 at initialization by zeroing out matrix entries in the shape of the letter a . We then train for 1000 steps on CIFAR-10, across ten different learning rates. For each run, we plot the final training accuracy along with an image of the learned weight matrix. Not only does dualized gradient descent reach higher training accuracies than the non-dualized method, but dualized gradient descent also erases the watermark at the highest stable learning rate, constituting substantial weight change. More experimental details are given in Appendix A.

which all singular values have the same magnitude. For the case of a square weight matrix, the weight update is simply an orthogonal matrix, which acts as an isometry on inputs vectors meaning that feature learning happens trivially.

6.5 A Numerical Paradox: The Weights Don t Change!

Past work (Lee et al., 2019; Jesus et al., 2021) has pointed out an apparent paradox in deep learning: the weights seem to move a vanishing amount from initialization in the limit of large network width. This finding led to substantial interest in linearized training dynamics (Jacot et al., 2018). Prior work attempted to resolve this paradox by showing that the weights move a roughly constant amount at any width when the change is measured in spectral norm (Yang et al., 2023). But duality maps lead to a new story: Linear.dualize ramps up the stable rank of updates, causing the weights to move at large width even in the Frobenius norm provided the batch size is not too small. This result, shown in Figure 2, challenges the belief that very wide neural networks cannot stray from their initialization; instead the numerical movement of the weights depends on the choice of optimizer.

Modular Duality in Deep Learning

7 Conclusion

This paper has proposed a recursive procedure called modular dualization for building duality maps for general neural architectures. The procedure unifies past strands of optimization research on Shampoo (Gupta et al., 2018) and µP (Yang & Hu, 2021). Duality-based optimizers have already led to significant wall-clock speedups in transformer training ranging from 124M to 1.5B parameters (Jordan et al., 2024b). The rectangular Newton-Schulz iteration provides a GPU-friendly and numerically stable means of dualizing under the RMS Ñ RMS operator norm, while avoiding some of the downsides of sketching-based approaches (Carlson et al., 2015b). Overall, we hope that our theory of modular duality provides a clarifying toolkit for the design and analysis of deep learning systems.

Impact Statement

The methods developed in this paper may be used to make the training of machine learning systems more efficient, for either good or ill.

Acknowledgements

Many ideas in this paper, including Section 5.3, were developed jointly with Tim Large before he left to work at a tech company. We are grateful to Phillip Isola for invaluable discussions and for granting us the freedom to pursue this work. We also thank Jack Gallagher, Keller Jordan, Simo Ryu, Rogier Brussee, Tongzhou Wang, Victor Butoi, Jeffrey Cider and the anonymous reviewers for helpful conversations.

Amari, S. Information Geometry and Its Applications. Springer, 2016. Cited on page 1.

Anil, R., Gupta, V., Koren, T., Regan, K., and Singer, Y. Scalable second order optimization for deep learning. ar Xiv:2002.09018, 2020. Cited on page 7.

Bernstein, J. and Newhouse, L. Old optimizer, new norm: An anthology. In Workshop on Optimization for Machine Learning, 2024. Cited on pages 3 and 5.

Bernstein, J., Mingard, C., Huang, K., Azizan, N., and Yue, Y. Automatic Gradient Descent: Deep Learning without Hyperparameters. ar Xiv:2304.05187, 2023. Cited on page 2.

Björck, Å. and Bowie, C. An iterative algorithm for computing the best estimate of an orthogonal matrix. SIAM Journal on Numerical Analysis, 1971. Cited on page 7.

Blondel, M. and Roulet, V. The Elements of Differentiable

Programming. ar Xiv:2403.14606, 2024. Cited on pages 1 and 4.

Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. Cited on page 1.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander Plas, J., Wanderman-Milne, S., and Zhang, Q. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/jax-ml/jax. Cited on page 1.

Carlson, D., Cevher, V., and Carin, L. Stochastic spectral descent for restricted Boltzmann machines. In International Conference on Artificial Intelligence and Statistics, 2015a. Cited on pages 2 and 3.

Carlson, D., Hsieh, Y.-P., Collins, E., Carin, L., and Cevher, V. Stochastic spectral descent for discrete graphical models. Selected Topics in Signal Processing, 2016. Cited on pages 2 and 3.

Carlson, D. E., Collins, E., Hsieh, Y.-P., Carin, L., and Cevher, V. Preconditioned spectral descent for deep learning. In Neural Information Processing Systems, 2015b. Cited on pages 1, 2, 3, 4, 7, 8, and 9.

Carroll, S. M. Spacetime and Geometry: An Introduction to General Relativity. Cambridge University Press, 2019. Cited on page 1.

Dahl, G. E., Schneider, F., Nado, Z., Agarwal, N., Sastry, C. S., Hennig, P., Medapati, S., Eschenhagen, R., Kasimbeg, P., Suo, D., Bae, J., Gilmer, J., Peirson, A. L., Khan, B., Anil, R., Rabbat, M., Krishnan, S., Snider, D., Amid, E., Chen, K., Maddison, C. J., Vasudev, R., Badura, M., Garg, A., and Mattson, P. Benchmarking neural network training algorithms. ar Xiv:2306.07179, 2023. Cited on page 8.

Deimling, K. Nonlinear Functional Analysis. Springer Berlin, Heidelberg, 1985. Cited on page 2.

Everett, K. E., Xiao, L., Wortsman, M., Alemi, A. A., Novak, R., Liu, P. J., Gur, I., Sohl-Dickstein, J., Kaelbling, L. P., Lee, J., and Pennington, J. Scaling exponents across parameterizations and optimizers. In International Conference on Machine Learning, 2024. Cited on page 8.

Feinberg, V., Chen, X., Sun, Y. J., Anil, R., and Hazan, E. Sketchy: Memory-efficient adaptive regularization with frequent directions. In Neural Information Processing Systems, 2023. Cited on page 7.

Flynn, T. The duality structure gradient descent algorithm: Analysis and applications to neural networks. ar Xiv:1708.00523, 2017. Cited on pages 1, 2, 3, 4, 7, and 8.

Modular Duality in Deep Learning

Fong, B. and Spivak, D. I. An Invitation to Applied Category Theory: Seven Sketches in Compositionality. Cambridge University Press, 2019. Cited on page 4.

Grant, M. C. Disciplined Convex Programming. Ph D dissertation, Stanford University, 2004. Cited on page 2.

Grosse, R. Metrics. Lecture 3 of CSC2541: Neural Net Training Dynamics, 2022. Cited on pages 1 and 2.

Gupta, V., Koren, T., and Singer, Y. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, 2018. Cited on pages 5, 8, and 9.

Haskell Wiki Contributors. Combinator pattern. Haskell Wiki, 2007. URL https://wiki.haskell.org/ Combinator_pattern. Cited on page 4.

Higham, N. J. Functions of Matrices. Society for Industrial and Applied Mathematics, 2008. Cited on page 7.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Neural Information Processing Systems, 2018. Cited on page 8.

Jesus, R. J., Antunes, M. L., da Costa, R. A., Dorogovtsev, S. N., Mendes, J. F. F., and Aguiar, R. L. Effect of initial configuration of weights on training and function of artificial neural networks. Mathematics, 2021. Cited on page 8.

Jordan, K., Bernstein, J., Rappazzo, B., @fernbear.bsky.social, Vlado, B., Jiacheng, Y., Cesista, F., Koszarsky, B., and @Grad62304977. moddednanogpt: Speedrunning the nanogpt baseline, 2024a. URL https://github.com/Keller Jordan/ modded-nanogpt. Cited on page 11.

Jordan, K., Jin, Y., Boza, V., You, J., Cecista, F., Newhouse, L., and Bernstein, J. Muon: An optimizer for hidden layers in neural networks, 2024b. URL https://web. archive.org/web/20250106014946/https: //kellerjordan.github.io/posts/muon/. Cited on pages 2, 7, and 9.

Kovarik, Z. Some iterative methods for improving orthonormality. SIAM Journal on Numerical Analysis, 1970. Cited on page 7.

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Cited on page 11.

Laki c, S. On the computation of the matrix k-th root. Journal of Applied Mathematics and Mechanics, 1998. Cited on page 7.

Large, T., Liu, Y., Huh, M., Bahng, H., Isola, P., and Bernstein, J. Scalable optimization in the modular norm. In Neural Information Processing Systems, 2024. Cited on pages 2, 4, 5, 6, 7, and 8.

Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. In Neural Information Processing Systems, 2019. Cited on page 8.

Martinsson, P.-G. and Tropp, J. A. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 2020. Cited on pages 2 and 7.

Nemirovsky, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. Wiley, 1983. Cited on page 1.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Neural Information Processing Systems, 2019. Cited on page 1.

Qiu, S., Potapczynski, A., Finzi, M., Goldblum, M., and Wilson, A. G. Compute better spent: Replacing dense layers with structured matrices. In International Conference on Machine Learning, 2024. Cited on page 11.

Sakurai, J. J. and Napolitano, J. Modern Quantum Mechanics. Cambridge University Press, 2020. Cited on page 1.

Shi, H.-J. M., Lee, T.-H., Iwasaki, S., Gallego-Posada, J., Li, Z., Rangadurai, K., Mudigere, D., and Rabbat, M. A distributed data-parallel Py Torch implementation of the distributed Shampoo optimizer for training neural networks at-scale. ar Xiv:2309.06497, 2023. Cited on pages 1 and 8.

Streeter, M. Universal majorization-minimization algorithms. ar Xiv:2308.00190, 2023. Cited on page 2.

Streeter, M. J. and Dillon, J. V. Automatically bounding the Taylor remainder series: Tighter bounds and new applications. ar Xiv:2212.11429, 2022. Cited on page 2.

Tieleman, T. and Hinton, G. RMSprop. Coursera: Neural Networks for Machine Learning, Lecture 6.5, 2012. Cited on page 2.

Tran, D. T., Ono, N., and Vincent, E. Fast DNN training based on auxiliary function technique. International Conference on Acoustics, Speech and Signal Processing, 2015. Cited on page 2.

Modular Duality in Deep Learning

Yang, G. and Hu, E. J. Tensor programs IV: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, 2021. Cited on pages 1, 5, 8, and 9.

Yang, G., Simon, J. B., and Bernstein, J. A spectral condition for feature learning. ar Xiv:2310.17813, 2023. Cited on pages 5 and 8.

A Experimental Details

Datasets. The dataset for all experiments is CIFAR-10 (Krizhevsky & Hinton, 2009). We use the standard train and test splits with no data augmentation.

Architectures. The architecture for all experiments is a 3-layer MLP with a Re LU nonlinearity, one hidden layer, and no biases. Having three layers allows all three types of weight matrix shape to be present: square, wide rectangular, and tall rectangular. In the learning rate transfer experiment, the width of the hidden layer varies from 32 to 4096. In the weight erasure experiment, the width is fixed to 1024.

Precision. All experiments use the default precision float32. It is well demonstrated in the Nano GPT speedruns that Newton-Schulz iterations can run in bfloat16 (Jordan et al., 2024a). We believe it is a promising future direction to explore whether dualization leads to improved speed and scalability in further reduced precisions.

A.1 Learning Rate Transfer

We run experiments with hidden layer widths 32, 64, 128, 256, 512, 1024, 2048, and 4096. For each width, we sweep between 10 and 20 different learning rates. We train for 20 epochs with batch size 128.

Adam (SP). The hyperparameters for Adam are the default ones in Py Torch, except for the sweep over the learning rate. Weight matrices are initialized according to the default Py Torch initialization (Kaiming uniform).

Adam (µP). In µP, the learning rate of each layer is equal to the global learning rate (which we sweep) divided by that layer s input dimension din, and weight matrices are initialized as zero-mean Gaussians with standard deviation σ a

minpdin, doutq{d2 in (Qiu et al., 2024). All other hyperparameters are the default ones in Py Torch.

Dualization. We use orthogonal weight initialization. Concretely, we create weight matrices with unit Gaussian entries and then iterate them through Newton-Schulz for 30 steps. Our duality-based optimizer in this experiment uses no momentum. It passes the raw gradient through the duality map UΣV J ÞÑ a

dout{din UV J, implemented via 5 steps of Newton-Schulz iteration and then multiplying by the dimensional constant. We use a quintic Newton-Schulz iteration

with coefficients p2, 1.5, 0.5q.

A.2 Erasure of Watermarked Initial Weights

We run experiments with hidden layer width 1024 and learning rate ranging across 2 6, . . . , 23. We train for 1000 steps with batch size 1024.

The dualization uses ten steps of a quintic Newton-Schulz iteration with coefficients p3.0, 3.2, 1.2q, followed by multiplication by the dimensional constant. These quintic coefficients are different from above but lead to the same duality map UΣV J ÞÑ a

dout{din ˆ UV J.

To align the maximum stable learning rate between dualized and non-dualized training, we spectrally normalize the regular gradient descent update as g ÞÑ g{}g} {3. We also divide by 3 to match the scalar division that occurs in Linear.dualize due to the module masses for a 3-layer MLP. This way both the dualized and non-dualized methods become unstable at the same learning rate, approximately lr 1.1, as seen in Figure 2. Disabling spectral normalization does not change the qualitative findings.