# named_tensor_notation__973e8f74.pdf Published in Transactions on Machine Learning Research (1/2023) Named Tensor Notation David Chiang University of Notre Dame Alexander M. Rush Cornell University Boaz Barak Harvard University Reviewed on Open Review: https://openreview.net/forum?id=h VT7SHlilx We propose a notation for tensors with named axes, which relieves the author, reader, and future implementers of machine learning models from the burden of keeping track of the order of axes and the purpose of each. The notation makes it easy to lift operations on low-order tensors to higher order ones, for example, from images to minibatches of images, or from an attention mechanism to multiple attention heads. After a brief overview and formal definition of the notation, we illustrate it through several examples from modern machine learning, from building blocks like attention and convolution to full models like Transformers and Le Net. We then discuss differential calculus in our notation and compare with some alternative notations. Our proposals build on ideas from many previous papers and software libraries. We hope that our notation will encourage more authors to use named tensors, resulting in clearer papers and more precise implementations. 1 Introduction Formal descriptions of neural networks primarily adopt the notation of vectors and matrices from applied linear algebra (Goodfellow et al., 2016). When used to describe vector spaces, this notation is both concise and unambiguous. However, when applied to neural networks, these properties are lost. Consider the equation for attention as notated in the Transformer paper (Vaswani et al., 2017): Attention(Q, K, V ) = softmax QK The equation relates Q, K, and V (for query, key, and value, respectively) as sequences of feature vectors, packed into possibly identically-sized matrices. While concise, this equation is ambiguous. Does the product QK sum over the sequence, or over the features? We know that it sums over columns, but there is not enough information to know what the columns represent. Is the softmax taken over the query sequence or the key sequence? The usual notation does not offer an answer. Perniciously, the implementation of an incorrect interpretation might still run without errors. With the addition of more axes, like multiple attention heads or multiple sentences in a minibatch, the notation becomes even more cumbersome. We propose an alternative mathematical notation for tensors with named axes.1 The notation has a formal underpinning, but is hopefully intuitive enough that machine learning researchers can understand it without much effort. In named tensor notation, the above equation becomes Attention: Rkey Rseq key Rseq val Rval 1We follow Num Py in using the term axis. Other possible terms would be index, dimension, way, or mode (Tucker, 1964), but we felt that axis had the least potential for confusion. Published in Transactions on Machine Learning Research (1/2023) Attention(Q, K, V ) = softmax seq The type signature introduces three named axes: the key axis is for features of queries and keys, the val axis is for features of values, and the seq axis is for tokens in a sequence. (Please see Section 2.2 for an explanation of our naming convention.) This notation makes the types of each input tensor explicit. Tensor Q is a query vector that is compared with key vectors, so it has a key axis. Tensor K is a sequence of key vectors, so it has seq and key axes. Tensor V is a sequence of value vectors, so it has seq and val axes. Unlike with matrix notation, the reader is not required to remember whether seq corresponds to rows or columns in either of these tensors. The function itself uses the named axes to precisely apply operations. The expression Q key K is a dot product over the key axis shared between K and Q; there is no ambiguity about rows or columns. Similarly, the softmax function is annotated with the axis along which it is applied, removing any ambiguity or reliance on convention. Furthermore, named tensor notation naturally extends to lifting (also known as vectorizing and/or broadcasting) a function to tensors with more axes. For example, if instead of being a tensor with the single axis key, Q has three axes key, seq and batch (corresponding to tokens of a sequence and examples in a minibatch, respectively) then the Attention function works as written, acting on each example in a minibatch in parallel. Similarly, we can also add a heads axis to the inputs to get multiple attention heads. These additional axes are often elided in neural network papers, possibly avoiding notational complexity, but possibly also hiding critical model details. Our contributions. This work proposes a mathematical notation for named tensors and a fully specified semantic interpretation for the notation. Through examples, we demonstrate that this notation enables specifying machine learning models and operations in a succinct yet precise manner. The need for named tensors has been recognized by several software packages, including xarray (Hoyer & Hamman, 2017), Nexus (Chen, 2017), tsalib (Sinha, 2018), axisarrays (Bauman, 2018), Named Tensor (Rush, 2019), Py Torch (Torch Contributors, 2019), Dex (Paszke et al., 2021), JAX (JAX authors, 2021), einops (Rogozhnikov, 2022), and torchdim (De Vito, 2023). While our notation is inspired by these efforts, our focus is on mathematical notation to be used in papers, whereas previous efforts have focused on code. Our hope is that our notation will be adopted by authors, leading to clearer, more replicable papers, and that this, in turn, will encourage more implementers to adopt named tensor libraries, leading to clearer, more correct code. 2 Named Tensors In standard notation, a vector, matrix, or tensor is indexed by an integer or sequence of integers; if it has dimensions n1, . . . , nr, it can be thought of as a map that takes as input (i1, . . . , ir) [n1] [nr] and outputs a real number (or an element of a different field). For example, if A R3 3, then the order of the two axes matters: A1,3 and A3,1 are not the same element. It is up to the reader to remember what each axis of each tensor stands for. This problem is exacerbated in modern machine learning, where tensors have multiple axes with different meanings (batches, channels, etc.), and different operations act on different axes. In contrast, we propose named tensors, in which each axis has a name that describes it and ensures there is no confusion between axes. We write ax[n] for an axis with name ax and size n, and we write ax(i) to index the i-th element along axis ax. So if a tensor has axes ax1[n1], . . . , axr[nr] (with ax1, . . . , axr being distinct names), it can be thought of as a map that takes as input a record {ax1(i1), . . . , axr(ir)}, with i1 [n1], . . . , ir [nr], and outputs a field element. In summary the key difference is that, while a tensor in standard notation takes as input an ordered tuple of indices, a named tensor takes as input a record, which is an unordered set of named indices. We illustrate with some examples below, then give formal definitions. Published in Transactions on Machine Learning Research (1/2023) 2.1 By example For example, if A represents a 3 3 grayscale image, we can make it a named tensor like so (writing it two equivalent ways to show that the order of axes does not matter): A Rheight[3] width[3] = Rwidth[3] height[3] 3 1 4 1 5 9 2 6 5 3 1 2 1 5 6 4 9 5 We access elements of A using named indices, whose order again does not matter: Aheight(1),width(3) = Awidth(3),height(1) = 4. We also allow partial indexing: Aheight(1) = width 3 1 4 Awidth(3) = height 4 9 5 . It does not matter if we write Aheight(1) or Awidth(3) as row and column vectors. In many contexts, an axis name is used with only one size. If so, we can simply write height for the unique axis with name height, as in Rheight width. We can leave the size of an axis unspecified at first, and specify its size later (e.g., deferring it to an appendix on experimental details). For example, we can specify |height| = |width| = 28 if we want to prescribe the precise size of an image, or just write |height| = |width| to specify that it s a square image. 2.2 What s in a name? Although users of this notation are free to choose any names for axes, we offer the following recommendations. First, we recommend words instead of single letters, to communicate better the meaning of each axis. More subtly, we recommend words that describe a whole rather than its parts. For example, to represent a minibatch of examples, we would name the axis batch; to represent a sequence of tokens, we would name the axis seq. One reason for this choice is that there are cases, like height and width, where there is a name for the whole, but no unambiguous name for the part. By contrast, in cases where there is a name for the part but not the whole, it s always possible to use the plural form of the name of the part. For example, if we wanted A to have red, green, and blue channels, we would name the axis chans. Section 4 contains many more examples of axis names. 2.3 Formal definition We now define formally the notation we use. Definition 1 (Names, indices, and axes). An axis is a pair, written ax[I], where ax is the name of the axis, which is simply a string of letters. We write both names and variables ranging over names using sans-serif font. I is a set of indices. In this paper, I is always of the form {1, . . . , n} for some n, so we abbreviate ax[{1, . . . , n}] as ax[n]. In many contexts, there is only one axis with name ax, and so we refer to the axis simply as ax. The context always makes it clear whether ax is a name or an axis. If ax is an axis, we write ind(ax) for its index set, and we write |ax| as shorthand for | ind(ax)|. Definition 2 (Named indices and records). If ax[I] is an axis and i I, then a named index is a pair, written ax(i). A record is a set of named indices {ax1(i1), . . . , axr(ir)}, where ax1, . . . axr are pairwise distinct names. Published in Transactions on Machine Learning Research (1/2023) Definition 3 (Shapes). A shape is a set of axes, written ax1[I1] axr[Ir], where ax1, . . . axr are pairwise distinct names. We write for the empty shape. A shape defines a set of records: rec(ax1[I1] axr[Ir]) = {{ax1(i1), . . . , axr(ir)} | i1 I1, . . . , ir Ir} . We say two shapes S and T are compatible if whenever ax[I] S and ax[J] T , then I = J. We say that S and T are orthogonal if there is no ax such that ax[I] S and ax[J] T for any I, J. If t rec T and S T , then we write t|S for the unique record in rec S such that t|S t. Definition 4 (Named tensors). Let F be a field and let S be a shape. Then a named tensor over F with shape S is a mapping from rec S to F. If X has shape S then we write shp X = S. We write the set of all named tensors with shape S as F S. We don t make any distinction between a scalar (an element of F) and a named tensor with empty shape (an element of F ). If X F S, then we access an element of X by applying it to a record s rec S; but we write this using the usual subscript notation: Xs rather than X(s). To avoid clutter, in place of X{ax1(i1),...,axr(ir)}, we usually write Xax1(i1),...,axr(xr). When a named tensor is an expression like (X + Y ), we index it by surrounding it with square brackets like this: [X + Y ]ax1(i1),...,axr(xr). We also allow partial indexing. If X is a tensor with shape T and s rec S where S T , then we define Xs to be the named tensor with shape T \ S such that, for any t rec(T \ S), [Xs]t = Xs t. (For the edge case T = , our definitions for indexing and partial indexing coincide: one gives a scalar and the other gives a tensor with empty shape, but we don t distinguish between the two.) 3 Operations A significant benefit of named tensor notation is that it allows one to unambiguously specify operations that map tensors to tensors, and defines precisely how operations can be lifted when an operation is applied to tensors with more axes than are present in its signature and how broadcasting happens when different arguments add different axes. We start with the formal definition of named tensor operations and lifting, then show how this definition leads to many common operations. 3.1 Formal definition By (named) tensor function or (named) tensor operation, we mean not only functions from tensors to tensors, but also operators like negation ( ), addition (+), and so on. We will extend the standard function/operator notation by allowing tensor operations to be lifted to higher-order tensors. Definition 5 (lifting, unary). Let f : F S GT be a function from tensors to tensors. For any shape S orthogonal to both S and T , we can define the lift f S of f with the shape S to be the map f S : F S S GT S h f S (X) i s = f(Xs ) for all X F S S and s rec S . Usually, we simply write f instead of f S . That is, for every tensor X with shape R S, we let f(X) = f R\S(X). If f is a multary function, we can lift each of its arguments to larger shapes, and we don t have to add the same axes to all the arguments; an axis present in one argument but not another is broadcast from the former to the latter. We consider just the case of two arguments; three or more arguments are analogous. Published in Transactions on Machine Learning Research (1/2023) Definition 6 (lifting, binary). Let f : F S GT HU be a binary function from tensors to tensors. For any shapes S and T that are compatible with each other and orthogonal to S and T , respectively, and such that S T is orthogonal to U, we can lift f to: f S T : F S S GT T HU S T h f S T (X, Y ) i s = f X s |S , Y s |T for all X F S S , Y F T T , s rec(S T ). Again, we usually write f instead of f S T . In the following sections, we present some consequences of the above lifting rules. In particular, we show how they allow one to lift some common operations from operating on scalars, vectors, or matrices to operating on tensors with more axes, and how they correspond to vectorizing and broadcasting (as implemented by Num Py and related packages). 3.2 Elementwise operations and broadcasting Any function from a scalar to a scalar corresponds to a tensor function with signature F F . Hence lifting it to any tensor shape, by Definition 5, corresponds to elementwise application. For example, given the logistic sigmoid function, σ(x) = 1 1 + exp( x) we can lift it to tensors. If A Rheight[3] width[3] is the example tensor (1), then σ(A) = height 1 1+exp( 3) 1 1+exp( 1) 1 1+exp( 4) 1 1+exp( 1) 1 1+exp( 5) 1 1+exp( 9) 1 1+exp( 2) 1 1+exp( 6) 1 1+exp( 5) Similarly for rectified linear units (relu(x) = max(0, x)), negation, and so on. Any function with signature R R R, including binary operators like addition (+), can be applied to two named tensors with the same shape. But if we apply a binary function or operator to tensors with different shapes, then, by Definition 6, broadcasting applies. For example, let x Rheight[3] y Rwidth[3] y = width 1 4 1 . (We write x as a column just to make the broadcasting easier to visualize.) Then, to evaluate A + x, we effectively replace x with a new tensor with a copy of x for every index of axis width. Likewise for A + y: A + x = height 3 + 2 1 + 2 4 + 2 1 + 7 5 + 7 9 + 7 2 + 1 6 + 1 5 + 1 A + y = height 3 + 1 1 + 4 4 + 1 1 + 1 5 + 4 9 + 1 2 + 1 6 + 4 5 + 1 Similarly for other operations. We write elementwise multiplication (Hadamard product) as . Published in Transactions on Machine Learning Research (1/2023) 3.3 Reductions The same rules apply to functions from vectors to scalars, called reductions. We specify which axis a reduction applies to using a subscript (equivalent to the axis argument in Num Py and dim in Py Torch), so that even after lifting, we know which axis to reduce. For example, we can define summation: X ax : Rax[I] R and use it on A from Eq. (1): height A = X i Aheight(i) = width 3 + 1 + 2 1 + 5 + 6 4 + 9 + 5 width A = X j Awidth(j) = height 3 + 1 + 4 1 + 5 + 9 2 + 6 + 5 . We can also write multiple names to sum over multiple axes: X height width j Aheight(i),width(j) = 3 + 1 + 4 + 1 + 5 + 9 + 2 + 6 + 5. But a summation with an index variable (like i or j above) is a standard summation over values of that variable, and a summation with no subscript is a standard summation over a set. Other examples of reductions include: norm ax X = s X ax X2 normp ax X = X min ax X = min{Xax(i) | i I} max ax X = max{Xax(i) | i I} mean ax X = 1 |ax| ax X var ax X = 1 |ax| ax (X mean ax X)2 3.4 Contraction The vector dot-product is a function from two vectors to a scalar. We write it as follows: ax : Rax[I] Rax[I] R i I Xax(i)Yax(i) When lifted to higher-order tensors, the dot-product generalizes to the ubiquitous contraction operator, which can also be thought of as elementwise multiplication followed by summation over one axis, that is, ax X Y. (2) For example, A height x = X height A x = width 3 2 + 1 7 + 2 1 1 2 + 5 7 + 6 1 4 2 + 9 7 + 5 1 Published in Transactions on Machine Learning Research (1/2023) A width y = X width A y = height 3 1 + 1 4 + 4 1 1 1 + 5 4 + 9 1 2 1 + 6 4 + 5 1 Again, we can write multiple names to contract multiple axes at once: A height width height width A A = 3 3 + 1 1 + 4 4 + 1 1 + 5 5 + 9 9 + 2 2 + 6 6 + 5 5 A with no axis name under it contracts zero axes and is equivalent to elementwise multiplication, which is the reason we use the same symbol for elementwise multiplication and contraction. The contraction operator can be used for many multiplication-like operations: x height x = X height x x inner product x y = height 2 1 2 4 2 1 7 1 7 4 7 1 1 1 1 4 1 1 outer product A width y = X width A y matrix-vector product x height A = X height x A vector-matrix product A width B = X width A B matrix-matrix product (B Rwidth width ) A contraction of three more tensors can be written as a sum. For example, the three-way inner product of vectors x, y, z Rwidth can be written as P width x y z. Like the dot-product from which it is lifted, but unlike matrix multiplication, the contraction operator is commutative, but not associative. However, contraction does obey the following associative-like law. X S T (Y U Z) = (X S Y ) T U Z if S shp Z = U shp X = . (3) The special case X S (Y U Z) = (X S Y ) U Z if S shp Z = U shp X = (4) will be useful in Section 5 for moving Z from inside one or more sets of parentheses to the outside. 3.5 Vectors to vectors and beyond Functions from vectors to vectors (Rax[I] Rax[I]) lift to functions on tensors that operate along one axis, but leave the tensor shape unchanged. Such functions are particularly problematic in standard notation, which does not provide any way (to our knowledge) of specifying which axis the operation should be performed over. Such functions include: softmax ax X = exp X P ax exp X (5a) argmax ax X = lim α softmax ax αX (5b) Published in Transactions on Machine Learning Research (1/2023) argmin ax X = lim α softmax ax αX (5c) For example, we can clearly distinguish between two ways of performing a softmax on A: softmax height A = height exp 3 exp 3+exp 1+exp 2 exp 1 exp 1+exp 5+exp 6 exp 4 exp 4+exp 9+exp 5 exp 1 exp 3+exp 1+exp 2 exp 5 exp 1+exp 5+exp 6 exp 9 exp 4+exp 9+exp 5 exp 2 exp 3+exp 1+exp 2 exp 6 exp 1+exp 5+exp 6 exp 5 exp 4+exp 9+exp 5 softmax width A = height exp 3 exp 3+exp 1+exp 4 exp 1 exp 3+exp 1+exp 4 exp 4 exp 3+exp 1+exp 4 exp 1 exp 1+exp 5+exp 9 exp 5 exp 1+exp 5+exp 9 exp 9 exp 1+exp 5+exp 9 exp 2 exp 2+exp 6+exp 5 exp 6 exp 2+exp 6+exp 5 exp 5 exp 2+exp 6+exp 5 3.6 Renaming and reshaping It s often useful to rename an axis (analogous to a transpose operation in standard notation). We can think of this as the lift of a function from vectors to vectors, but with different input and output axes: [ ]ax ax : Rax[I] Rax [I] [Xax ax ]ax (i) = Xax(i) For example, Aheight height = height 3 1 4 1 5 9 2 6 5 We can also define notation for reshaping two or more axes into one axis: A(height,width) layer = layer 3 1 4 1 5 9 2 6 5 The order of elements in the new axis is undefined. Authors who need a particular ordering may write a more specific definition. 3.7 Indexing2 Num Py and its derivatives provide various ways to recombine elements of a tensor to form a new tensor: integer array indexing, and functions like numpy.take, numpy.take_along_axis, torch.index_select, and torch.gather. Using named tensors, we can write nearly all of these operations as lifts of a single function: index ax : Rax[n] [n] R index ax (X, i) = Xax(i). For example, suppose we have E Rvocab[n] emb P Rseq vocab[n] 2We are grateful to Tongfei Chen and Chu-Cheng Lin for contributing the original idea behind this section, as well as the example. Published in Transactions on Machine Learning Research (1/2023) Tensor E contains word embeddings for all the words in the vocabulary. Integer i is the numeric identifier of a word, while tensor I is a sequence of numeric identifiers of words. Tensor P contains a sequence of probability distributions over the vocabulary (e.g., the predictions of a language model). Then: index vocab (E, i) broadcasts the emb axis from E to i, giving the word embedding of word i. This is the same as partial indexing (Evocab(i)). index vocab (E, I) also broadcasts the seq axis from I to E, giving a sequence of word embeddings. This is the same as integer array indexing (E[I]), numpy.take(E, I, 0), or torch.index_select(E, 0, I). index vocab (P, I) aligns P s and I s seq axes, giving a sequence of probabilities. This is the same as numpy.take_along_axis(P, I, 0) or torch.gather(P, 0, I). If P and I additionally had a batch axis (before the other axes), then index vocab (P, I) would be the same as tensorflow.gather(P, I, axis=1, batch_dims=1). In Num Py, indexing using two or more integer arrays requires a special definition with some surprising special cases. With named tensors, we simply apply the indexing function twice. For example, if we wanted to get probabilities of words J at a subset I of positions, we could let: I [m]subseq positions J [n]subseq numeric identifiers S = index vocab (index seq (P, I), J) Rsubseq Ssubseq(k) = Pseq(Isubseq(k)),vocab(Isubseq(k)). 4 Worked Examples: Neural Networks In this section we give a series of worked examples illustrating how standard neural network model components can be written using named tensors. Appendix A builds some of these components into complete specifications of the Transformer and Le Net. 4.1 Feedforward neural networks A multi-layer, feedforward neural network with different-sized layers can be written as: X1 = σ(W 1 input X0 + b1) W 1 Rhidden1 input b1 Rhidden1 X2 = σ(W 2 hidden1 X1 + b2) W 2 Rhidden2 hidden1 b2 Rhidden2 X3 = σ(W 3 hidden2 X2 + b3) W 3 Rout hidden2 b3 Rout The layer sizes can be specified by writing |hidden1| = n1, etc. As noted above, σ is applied elementwise and does not require additional annotation. Published in Transactions on Machine Learning Research (1/2023) Alternatively, the layer equation can be abstracted by defining: Full Connl(x) = σ W l layer x + bl layer layer W l Rlayer [nl] layer[nl 1] bl Rlayer [nl]. The function Full Connl encapsulates both the equation for layer l as well as its parameters W l, bl (analogous to what Tensor Flow and Py Torch call modules). Since we chose to use the same axis name layer for all the layers (with different sizes nl), Full Connl temporarily computes its output over axis layer , then renames it back to layer. The network can be defined like this: X0 Rlayer[n0] X1 = Full Conn1(X0) X2 = Full Conn2(X1) X3 = Full Conn3(X2). 4.2 Recurrent neural networks As a second example, we consider a simple (Elman) RNN. This model is similar to the feedforward network, except that the number of timesteps is variable and parameters are shared over time. At each time step, it produces a tensor with a new axis hidden which is then renamed hidden for the next step in the recursion. xt Rinput t = 1, . . . , n W h Rhidden hidden |hidden| = |hidden | W i Rinput hidden ht = σ W h hidden ht 1 + W i input xt + b hidden hidden t = 1, . . . , n 4.3 Attention In the introduction ( 1), we described difficulties in interpreting the equation for attention as used with Transformers (Vaswani et al., 2017). In our notation, it looks like this: Attention: Rkey Rseq key Rseq val Rval (6) Attention(Q, K, V ) = softmax seq This definition takes a single query Q vector and returns a single result vector (and actually could be further reduced to a scalar values as val is not strictly necessary). To apply to a sequence, we can give Q a seq axis, and the function will compute an output sequence. Providing Q, K, and V with a heads axis lifts the function to compute multiple attention heads. For Transformers we often need to apply a mask to ensure attention is only applied to valid keys (e.g. for causal language models). We can modify the equation to include this mask: Attention: Rkey Rseq key Rseq val Rseq Rval Published in Transactions on Machine Learning Research (1/2023) Attention(Q, K, V, M) = softmax seq Appendix A.1 includes a full specification of the complete Transformer model using the named tensor notation. 4.4 Convolution Standard neural network convolutions can be specified by unrolling a vector and then applying a standard dot product. We define an axis-parameterized unrolling function that converts a one-axis tensor to a sequence of kernel sized vectors: unroll seq kernel : Rseq[n] Rseq[n |kernel|+1],kernel unroll seq kernel X = Y, where Yseq(i),kernel(j) = Xseq(i+j 1). A 1d convolution with input channels chans and output channels chans consists of unrolling along the seq axis and then taking a dot product: Conv1d: Rchans seq[n] Rchans seq[n ] Conv1d(X; W, b) = W chans kernel unroll seq kernel W Rchans chans kernel Unrolling easily generalizes to higher-dimensional convolutions: Conv2d: Rchans height[h] width[w] Rchans height[h ] width[w ] Conv2d(X; W, b) = W chans kh,kw unroll height kh unroll width kw W Rchans chans kh kw 4.5 Pooling Pooling is similar to convolutions. We first define a function to partition a tensor into windows. pool seq,kernel : Rseq[n] Rseq[n/|kernel|],kernel pool seq,kernel X = Y, where Yseq(i),kernel(j) = Xseq((i 1) |kernel|+j). Published in Transactions on Machine Learning Research (1/2023) Then we can define aggregations over kernel. We define max-pooling as: Max Pool1dk : Rseq[n] Rseq[n/k] Max Pool1dk(X) = max kernel pool seq,kernel X |kernel| = k Max Pool2dkh,kw : Rheight[h] width[w] Rheight[h/kh] width[w/kw] Max Pool2dkh,kw(X) = max kh,kw pool height,kh pool width,kw X 4.6 Normalization layers Normalization layers are used in all large-scale deep learning models, with different architectures requiring different types of normalization. However, despite their importance, the differences between them are often not clearly communicated. For example, the Py Torch documentation (Py Torch Contributors, 2022) describes all of them using the same equation (where ϵ > 0 is a small constant for numerical stability): Y = X mean(X) p var(X) + ϵ γ + β Wu & He (2018) give essentially the same equation and explain the differences using a combination of equations, words, and pictures. But they do not capture differences in γ and β among different normalization layers. Critically, the layers do differ by which axes are standardized as well as their parameters. We define a single named standardization function as: standardize ax : Rax Rax standardize ax (X) = X mean ax (X) r var ax (X) + ϵ Then, we can define the three kinds of normalization layers, all with type Rbatch chans layer Rbatch chans layer. While superficially similar, these functions differ in their standardized axes and their parameter shape. Batch Norm(X; γ, β) = standardize batch,layer (X) γ + β γ, β Rchans Instance Norm(X; γ, β) = standardize layer (X) γ + β γ, β Rchans Layer Norm(X; γ, β) = standardize layer,chans (X) γ + β γ, β Rchans,layer 5 Differential Calculus In many machine learning applications, we need to compute derivatives of functions from tensors to tensors. In standard vector/matrix notation, this can become complicated. For example, if f maps from vectors to vectors, then the partial derivatives of f form a matrix (the Jacobian). It has an input axis for the directions in which X could change, and an output axis for the directions in which f(X) could change. But there are conflicting conventions about whether the first axis is the input axis ( denominator layout ) or the output axis ( numerator layout ). The derivative of a function from vectors to matrices or matrices to vectors cannot be represented as a matrix at all, so one must resort to flattening the matrices into vectors. Published in Transactions on Machine Learning Research (1/2023) With non-named tensor index notation, taking derivatives is not difficult (Laue et al., 2018), but again a convention must be adopted that the input axes come after the output axes, separated by a comma. With named tensors, axes are not ordered, so we don t need to remember whether the input or output axes come first. But we do need to ensure that the input and output axes have different names. 5.1 Definition Definition 7. Let S = ax1 axr be a shape. Then we write S = ax 1 ax r, and if s = {ax1(i1), . . . axr(ir)} rec S, then we write s = {ax 1(i1), . . . ax r(ir)}. Furthermore, if X RS then we write X = XS S . Definition 8. Let f : RS RT . The derivative of f(X) with respect to X is the tensor such that s ,t = f(X)t for all s rec S and t rec T . The above definition assumes that S and T are orthogonal; if not, the axes in S should be renamed to something else. For example, the second derivative (the Hessian) could be 2f(X) X X RS S T r ,s ,t = 2f(X)t for all r, s rec S and t rec T . 5.2 Differentials We could derive rules like the chain rule and the sum and product rules, and use them to compute derivatives; however, ensuring that input and output shapes are orthogonal is inconvenient. Instead, we recommend the method of differentials (Magnus & Neudecker, 1985), which reduces renaming to a minimum. The first-order Taylor approximation of f around X RS is f(X + H) f(X) + f(X) X S H H RS. The differential of f(X) with increment H, written f(X; H), is the second term of this approximation; we defer a formal definition to Appendix B. For example, If id is the identity function, then id(X + H) = X + H id(X; H) = H. (8a) If f(X) = X ax X where X Rax, then f(X + H) = (X + H) ax (X + H) = X ax X + 2X ax H + H ax H f(X; H) = 2X ax H. (8b) Published in Transactions on Machine Learning Research (1/2023) It s often more convenient to work directly with the expression X ax X instead of f(X), and to write (X ax X) for f(X; H). Then, since X = id(X; H) = H, we can write Eq. (8b) simply as (X ax X) = 2X ax X so that the H has beeen hidden inside X. More generally, we can derive rules like the following: (U + V ) = U + V (9a) (U V ) = U V + V U (9b) (U ax V ) = U ax V + V ax U (9e) Us = [ U]s (9f) Uax ax = [ U]ax ax . (9g) The chain rule for differentials is f(U) = f(X) X=U S US S f : RS RT . (9h) Recall that f can be lifted to shapes larger than S. In that case, the rule above still applies, but note that the contraction will still be over S. A special case of this is when S = T = : f(U) = df(x) x=U U f : R R. (9i) For example, letting f(x) = exp x gives the rule (exp U) = exp U U. (9j) Using these rules we can compute the differential of a wide variety of expressions. For example, the softmax operator: (softmax ax X) (5a) = exp X P ax exp X (exp X) exp X P ax exp X (exp X) exp X P ax exp X exp X X exp X P ax (exp X X) ax exp X exp X X exp X (exp X ax X) Published in Transactions on Machine Learning Research (1/2023) ax exp X X exp X P ax exp X ax X (5a) = softmax ax X X softmax ax X (softmax ax X ax X) = softmax ax X ( X softmax ax X ax X). (10) We stop when the only differentials left are X. 5.3 Derivatives via differentials If we can get f(X) into so-called canonical form, f(X) = D S X + const. (11) where const. stands for terms not depending on X, then by Magnus & Neudecker s first identification theorem (Theorem 1 in Appendix B), we can conclude that When trying to get expressions into canonical form, one helpful fact is that renaming can be thought of as contraction with an identity matrix. First we define the identity matrix with shape ax ax : [Iax,ax ]ax(i),ax (j) = ( 1 i = j 0 i = j. Then for any tensor A with an axis ax, Aax ax = Iax,ax ax A. (12) More specifically, if X RS, then X = IS,S S X (13) and then Eq. (4) can usually be used to move the S XS S to the outermost level of the expression. Above, we found the differential of the softmax function; now let us find its derivative. (softmax ax X) (10) = softmax ax X ( X softmax ax X ax X) (13) = softmax ax X (Iax,ax ax X softmax ax X ax (Iax,ax ax X )) (4) = (softmax ax X (Iax,ax softmax ax X ax Iax,ax )) ax X (12) = (softmax ax X (Iax,ax (softmax ax X) )) ax X . This is in canonical form, so we have X (softmax ax X) = softmax ax X (Iax,ax (softmax ax X) ). (14) Published in Transactions on Machine Learning Research (1/2023) 5.4 Lifting Recall that f S is the lift of f : RS RT with S , and in most contexts we can simply write f instead of f S . However, derivatives are one place where some extra caution is in order. To lighten notation, let s write g for the derivative of f: g(X) = f(X) Recall that the chain rule (9h) works under lifting, so f S (X) = g S (X) S XS S . But the contraction is only over S , so it would be incorrect to conclude that f S (X) X = g S (X). The derivative of a lift is not the lift of a derivative. We must rename and contract S as well: f S (X) (9h) = g S (X) S XS S (13) = g S (X) S (IS ,S S XS S (S S ) ) (3) = (g S (X) IS ,S ) (S S ) XS S (S S ) X = g S (X) IS ,S . In general, then, the derivative of a lift is the lift of the derivative, multiplied by the identity matrix for the new axes. Intuitively, this is because the derivative is a linear transformation before lifting, a transformation from S to T . When lifting to S S , this transformation must also be lifted, which is what multiplication by IS ,S accomplishes. 5.5 Extended example As a more elaborate example, we find the derivative of self-attention. For brevity, we omit the factor 1 and we write α = softmax seq (Q key K). Att(Q, K, V ) (7) = (α seq V ) (9e) = α seq V + V seq α. (15) Focus first on the first term, which is the only term depending on V : α seq V (13) = α seq ((Iseq,seq Ival,val ) seq (4) = ((α seq Iseq,seq ) Ival,val ) seq (12) = (αseq seq Ival,val ) seq V Att(Q, K, V ) = αseq seq Ival,val . Published in Transactions on Machine Learning Research (1/2023) Next, focus on the second term of Eq. (15): V seq α (10) = V seq (α ( (Q key K) α seq (Q key K))) (9e) = V seq (α (Q key K + K key Q α seq (Q key K + K key Q))). (16) Keeping only terms depending on Q, we get V seq (α (K key Q α seq (K key Q))) (13) = V seq (α (K key (Ikey,key key Q ) α seq (K key (Ikey,key key Q )))) (4) = V seq (α (K key Ikey,key α seq (K key Ikey,key ))) key Q (4) = V seq (α (Kkey key α seq Kkey key )) key Q Q Att(Q, K, V ) = V seq (α (Kkey key α seq Kkey key )). Similarly, keeping only terms from Eq. (16) depending on K, we get V seq (α (Q key K α seq (Q key K))) = V seq (α (Q Iseq,seq ) α seq (Q Iseq,seq ))) key K Att(Q, K, V ) = V seq (α (Q Iseq,seq ) α seq (Q Iseq,seq ))). 6 Alternatives and Related Work 6.1 Index notations Among alternatives to standard vector and matrix notation, the most common one is index notation as used in physics (Ricci & Levi-Civita, 1900). Related notations are used in other fields as well (Harshman, 2001). In this notation, axes are ordered, and every equation is written in terms of tensor components. If an index appears on both sides of an equation, then the equation must hold for each value of the index, and the Einstein summation convention (Einstein, 1916) is that if an index appears twice on one side and not on the other, there is an implicit summation over that index. Attention: Rdk Rn dk Rn dv Rdv [Attention(Q, K, V )]k = softmax i Because k appears on both sides, the equation must hold over all values of this index. But because i and j occur twice on only the right-hand side, they are both summed over. We would have to define exactly what the i under the softmax means (i is bound inside the softmax and free outside it), and since softmax doesn t distribute over addition, we would need to modify the summation convention so that the summation over j occurs inside the softmax. Aside from these correctable issues, this notation scales very well to more than two axes and is concise and unambiguous. But it does not solve the main problem we set out to solve, which is that ordered axes force the author and reader to remember the purpose of each axis. The indices do act as symbolic names for axes Published in Transactions on Machine Learning Research (1/2023) (indeed, in abstract index notation (Penrose & Rindler, 1984), they really are symbols, not variables), but they are temporary names; they could be totally different in the next equation. It would be up to the author to choose to use consistent names, and to do so correctly. A second issue is that because it depends on repetition of indices to work, index notation can be more verbose than our notation, particularly for reductions and contractions: C = max i Ai C = max ax A C = Ai Bi C = A ax B. Finally, index notation requires us to write out all indices explicitly. So if we wanted to lift attention to minibatches (b [B]), multiple heads (h [H]) and multiple query tokens (i [n ]), we would write: Attention: RB H n dk RB H n dk RB H n dv RB H n dv [Attention(Q, K, V )]bhi k = softmax i Qbhi j Kbhij dk We could adopt a convention that lifts a function on tensors to tensors that have extra axes to the left, but such conventions tend to lead to messy reordering and squeezing/unsqueezing of axes. Named axes make such conventions unnecessary. 6.2 Graphical notations In the graphical notation of Penrose (1971), a node in the graph stands for a tensor, and its incident edges stand for its indices. The edges are ordered from left to right. An edge connecting two nodes denotes contraction. The notation of Alsberg (1997) is similar, except that edges are named, not ordered. Graphs are commonly used in machine learning for representing probability models (Koller & Friedman, 2009). A node in the graph stands for a random variable, and an edge or hyperedge stands for a dependency between variables. If random variables have finite domains, then a (hyper)edge with r endpoints can be thought of as an r-th order tensor. A graph can then be thought of as a product and contraction. Extensions that allow for a choice between two subgraphs (e.g., Minka & Winn, 2008) can be thought of as addition. Our assessment of graphical notations like these is that, on the positive side, they have obvious value for visualization, and they at least have the potential to represent indices in a purely unordered way. On the negative side, these notations seem best suited for representing linear functions, and even for this purpose, some other practical considerations are that drawing pictures requires more effort from the author, and that pictures will have a less transparent relationship with their implementation in most programming languages. 6.3 Relational algebra In relational algebra (Codd, 1970), the basic objects are sets of r-tuples, which could be thought of as tensors of order r with Boolean-valued entries. In the original formulation, the members of the tuples, which correspond to axes, were both named and ordered, although later definitions (e.g. Pirotte, 1982) made them unordered. Probabilistic variants of relational algebra also exist (e.g. Dey & Sarkar, 1996; Fuhr & Rölleke, 1997), whose relations would correspond to tensors of probabilities. While relational algebra and tensor notations are designed for totally different purposes, the notation of relational algebra generally has a similar flavor to ours (for example, our contraction operator is similar to the operator, and our renaming operator is the same as the ρ operator). Published in Transactions on Machine Learning Research (1/2023) 6.4 Programming languages One response to the notation presented here, as well as the alternative notations mentioned in this section, is that research papers in machine learning should simply present models as code rather than equations. But we argue that a model s mathematical specification should abstract away from details of its implementation. Conceptually, it is important to have a distinct specification to define what makes an implementation (both the original implementation and any reimplementations) correct or incorrect. If the implementation is its own specification, it cannot be correct or incorrect; it will be not even wrong. Practically, abstracting away from implementation is important because we do not want the interpretation of research papers to be subject to differences across programming languages and libraries, or versions thereof. For example, Py Torch s Dropout2d on order-3 tensors has one behavior in versions 1.11 and 1.13, but another behavior in 1.10, 1.12, and future versions. It would be problematic for correct understanding of a paper to depend on such differences. 7 Conclusions Named tensor notation is a system of formal notation for representing operations between tensors in a non-ambiguous way while remaining intuitive for practitioners. The system is motivated by challenges that arise from taking notation designed for applied linear algebra and using it for representing neural networks, as demonstrated through examples of canonical deep-learning components such as attention and layer normalization. However, named tensors are not limited to specifying neural networks. We have also explained how to integrate our notation with Magnus & Neudecker (1985) s method of differentials for matrix calculus. While there are other conventions that such as index notation that have some usage in the machine learning community, these conventions either lack the conciseness of named tensors or are not well-suited to non-linear operations. For these reasons, we encourage members of the machine learning community to try out named tensor notation for teaching, research, and software documentation. Acknowledgements We would like to thank Ekin Akyürek, Justin Bayer, Tongfei Chen, Chu-Cheng Lin, Colin Mc Donald, Adam Poliak, Matt Post, Chung-chieh Shan, Nishant Sinha, and Yee Whye Teh for their input to this document (or the ideas in it). We also thank the anonymous TMLR reviewers for their feedback, which substantially improved the quality of the paper, especially Section 5. This material is based upon work supported by the National Science Foundation under Grants No. CCF2019291 and DMS-2134157, as well as a Sloan Fellowship, Simons Investigator Fellowship, DARPA grant W911NF2010021, and DOE grant DE-SC0022199. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies. Bjørn K. Alsberg. A diagram notation for n-mode array equations. Journal of Chemometrics, 11:251 266, 1997. doi: 10.1002/(SICI)1099-128X(199705)11:3<251::AID-CEM472>3.0.CO;2-Q. Matt Bauman. Axisarrays, 2018. URL https://github.com/Julia Arrays/Axis Arrays.jl. Open-source software. Tongfei Chen. Typesafe abstractions for tensor operations. In Proc. 8th ACM SIGPLAN International Symposium on Scala, pp. 45 50, 2017. doi: 10.1145/3136000.3136001. E. F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6): 377 387, June 1970. doi: 10.1145/362384.362685. Published in Transactions on Machine Learning Research (1/2023) Zachary De Vito. Named tensors using first-class dimensions in Py Torch, 2023. URL https://github.com/ facebookresearch/torchdim. Open-source software. Debabrata Dey and Sumit Sarkar. A probabilistic relational model and algebra. ACM Transactions on Database Systems, 21(3):339 369, September 1996. doi: 10.1145/232753.232796. A. Einstein. Die Grundlage der allgemeinen Relativitätstheorie. Annalen der Physik, 354(7):769 880, 1916. doi: 10.1002/andp.19163540702. Norbert Fuhr and Thomas Rölleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 15(1):32 66, January 1997. doi: 10.1145/239041.239045. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. Richard A. Harshman. An index formalism that generalizes the capabilities of matrix notation and algebra to n-way arrays. Journal of Chemometrics, 15:689 714, 2001. doi: 10.1002/cem.665. Stephan Hoyer and Joe Hamman. xarray: N-D labeled arrays and datasets in Python. Journal of Open Research Software, 5(1):10, 2017. doi: 10.5334/jors.148. JAX authors. Named axes and easy-to-revise parallelism, 2021. URL https://jax.readthedocs.io/en/ latest/notebooks/xmap_tutorial.html. Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. Soeren Laue, Matthias Mitterreiter, and Joachim Giesen. Computing higher order derivatives of matrix and tensor expressions. In Proc. Neur IPS, pp. 2750 2759, 2018. URL https://proceedings.neurips.cc/ paper/2018/file/0a1bf96b7165e962e90cb14648c9462d-Paper.pdf. Jan R. Magnus and H. Neudecker. Matrix differential calculus with applications to simple, Hadamard, and Kronecker products. Journal of Mathematical Psychology, 29(4):474 492, 1985. doi: 10.1016/00222496(85)90006-9. Tom Minka and John Winn. Gates. In Proc. Neur IPS, pp. 1073 1080, 2008. URL https://papers.nips. cc/paper/3379-gates. Adam Paszke, Daniel D. Johnson, David Duvenaud, Dimitrios Vytiniotis, Alexey Radul, Matthew J. Johnson, Jonathan Ragan-Kelley, and Dougal Maclaurin. Getting to the point: Index sets and parallelism-preserving autodiff for pointful array programming. Proc. ACM on Programming Languages, 5(ICFP), August 2021. doi: 10.1145/3473593. R. Penrose and W. Rindler. Spinors and space-time, volume 1. Cambridge University Press, 1984. Roger Penrose. Applications of negative dimensional tensors. In D. J. A. Welsh (ed.), Combinatorial Mathematics and its Applications, pp. 221 244. Academic Press, 1971. Alain Pirotte. A precise definition of basic relational notations and of the relational algebra. ACM SIGMOD Record, 13(1):30 45, 1982. doi: 10.1145/984514.984516. Py Torch Contributors. Py Torch documentation, 2022. URL https://pytorch.org/docs/1.12/. version 1.12. M. M. G. Ricci and T. Levi-Civita. Méthodes de calcul différentiel absolu et leurs applications. Mathematische Annalen, 54:125 201, 1900. doi: 10.1007/BF01454201. Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with Einstein-like notation. In Proc. ICLR, 2022. URL https://openreview.net/pdf?id=oap KSVM2bcj. Published in Transactions on Machine Learning Research (1/2023) Alexander Rush. Named tensors, 2019. URL https://github.com/harvardnlp/Named Tensor. Open-source software. Nishant Sinha. Tensor shape (annotation) library, 2018. URL https://github.com/ofnote/tsalib. Opensource software. Torch Contributors. Named tensors, 2019. URL https://pytorch.org/docs/stable/named_tensor.html. Py Torch documentation. L. R. Tucker. The extension of factor analysis to three-dimensional matrices. In H. Gulliksen and N. Frederiksen (eds.), Contributions to Mathematical Psychology, pp. 110 127. Holt, Rinehart and Winston, New York, 1964. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. Neur IPS, pp. 5998 6008, 2017. URL https: //proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Yuxin Wu and Kaiming He. Group normalization. In Proc. ECCV, 2018. URL https://openaccess. thecvf.com/content_ECCV_2018/papers/Yuxin_Wu_Group_Normalization_ECCV_2018_paper.pdf. A Extended Examples A.1 Transformer We define a Transformer used autoregressively as a language model. The input is a sequence of one-hot vectors, from which we compute word embeddings and positional encodings: I {0, 1}seq vocab X vocab I = 1 W = (E vocab I) p |layer| E Rvocab layer P Rseq layer Pseq(p),layer(i) = ( sin((p 1)/10000(i 1)/|layer|) i odd cos((p 1)/10000(i 2)/|layer|) i even. Then we use L layers of self-attention and feed-forward neural networks: T 1 = Layer Norm1(Self Att1(X0) + X0) X1 = Layer Norm1 (FFN1(T 1) + T 1) ... T L = Layer Norm L(Self Att L(XL 1) + XL 1) XL = Layer Norm L (FFNL(T L) + T L) O = softmax vocab (E layer XL) where Layer Norm, Self Att and FFN are defined below. Layer normalization (l = 1, 1 , . . . , L, L ): Layer Norml : Rlayer Rlayer Layer Norml(X) = standardize layer (X) γl + βl Published in Transactions on Machine Learning Research (1/2023) βl, γl Rlayer We defined attention in 4.3; the Transformer uses multi-head self-attention, in which queries, keys, and values are all computed from the same sequence. Self Attl : Rseq layer Rseq layer Self Attl(X) = Y |seq| = |seq | |key| = |val| = |layer|/|heads| Q = W l,Q layer Xseq seq W l,Q Rheads layer key K = W l,K layer X W l,K Rheads layer key V = W l,V layer X W l,V Rheads layer val Mseq(i),seq (j) = ( 0 i j otherwise Y = W l,O heads val Attention(Q, K, V, M)seq seq W l,O Rheads val layer Feedforward neural networks: FFNl : Rlayer Rlayer FFNl(X) = X2 X1 = relu(W l,1 layer X + bl,1) W l,1 Rhidden layer bl,1 Rhidden X2 = W l,2 hidden X1 + bl,2 W l,2 Rlayer hidden bl,2 Rhidden. X0 Rbatch chans[c0] height width T 1 = relu(Conv1(X0)) X1 = Max Pool1(T 1) T 2 = relu(Conv2(X1)) X2 = Max Pool2(T 2)(height,width,chans) layer X3 = relu(W 3 layer X2 + b3) W 3 Rhidden layer b3 Rhidden O = softmax classes (W 4 hidden X3 + b4) W 4 Rclasses hidden b4 Rclasses As an alternative to the flattening operation in the equation for X2, we could have written X2 = Max Pool2(T 2) Published in Transactions on Machine Learning Research (1/2023) X3 = relu(W 3 height width chans X2 + b3) W 3 Rhidden height width chans. The convolution and pooling operations are defined as follows: Convl(X) = Conv2d(X; W l, bl)chans chans W l Rchans [cl] chans[cl 1] kh[khl] kw[kwl] bl Rchans [cl] Max Pooll(X) = Max Pool2dphl,phl(X). B Differentiation: Formal Definitions The following definition and theorem come directly from the paper by Magnus & Neudecker (1985), but generalized to named tensors. For any X RS, we write X = norm S X. Definition 9. Let f : S RT where S RS. Let A be an interior point of RS, that is, for some r > 0, B(A; r) = {X | X A < r} S. If there is a tensor D(A) RS T and R(A, H) RT such that f(A + H) = f(A) + D(A) S HS S + R(A, H) for all H RS with H < r, and lim H 0 R(A, H) then f is said to be differentiable at A; the tensor f(A; H) = D(A) S HS S is then called the (first) differential of f at A with increment H. Magnus & Neudecker give their (first) identification theorem twice, once for vector-to-vector functions and once for matrix-to-matrix functions (but omitting vector-to-matrix and matrix-to-vector functions). Here, we only need one version, which works for functions from tensors to tensors of any shape. Theorem 1. Let f : S RT , where S RS, be differentiable at A S. Let D(X) RS T . Then for all H, f(A; H) = D(X) S HS S iff f(X) X=A = D(X). C LATEX Macros Many of the LATEX macros used in this document are available in the style file namedtensor.sty, available on CTAN at https://ctan.org/pkg/namedtensor. To use it, put \usepackage{namedtensor} Published in Transactions on Machine Learning Research (1/2023) in the preamble of your LATEX source file (after \documentclass{article} but before \begin{document}). We write axis names in sans-serif font. To make this easier, \name{ax} prints an axis name (like this: ax), and \ndef{\ax}{ax} defines a macro \ax that does the same. Binary operators Use A \ndot{\ax} B for contraction: A ax B. You can use \\ to stack up several names. In general, you can use \nbin to make a new binary operator with a name under it: A \nbin{\ax}{\star} B gives you A ax B. Use \nsum{\ax} A for summation: P In general, you can use \nfun to make a function with a name under it: \nfun{\ax}{qux} A gives you qux ax A.