# transformers_from_an_optimization_perspective__a2e78dfa.pdf

Transformers from an Optimization Perspective

Yongyi Yang University of Michigan yongyi@umich.edu

Zengfeng Huang Fudan University huangzf@fudan.edu.cn

David Wipf Amazon Web Services davidwipf@gmail.com

Deep learning models such as the Transformer are often constructed by heuristics and experience. To provide a complementary foundation, in this work we study the following problem: Is it possible to find an energy function underlying the Transformer model, such that descent steps along this energy correspond with the Transformer forward pass? By finding such a function, we can view Transformers as the unfolding of an interpretable optimization process across iterations. This unfolding perspective has been frequently adopted in the past to elucidate more straightforward deep models such as MLPs and CNNs; however, it has thus far remained elusive obtaining a similar equivalence for more complex models with self-attention mechanisms like the Transformer. To this end, we first outline several major obstacles before providing companion techniques to at least partially address them, demonstrating for the first time a close association between energy function minimization and deep layers with self-attention. This interpretation contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.

1 Introduction

Although deep learning has achieved extensive practical success [21, 22], model architectures are often motivated by heuristics and at times remain difficult to understand. As such, significant effort has recently be devoted to various ways of analyzing and reinterpreting popular deep network structures. Of particular note for our purposes here are assorted attempts to create a one-to-one association between deep model layers and the unfolded iterations of some optimization process designed to minimize an interpretable energy function [10, 11, 47, 49, 35]. In this way the forward pass can be viewed as computing, at least approximately, a representation with minimal energy, where model parameters dictate the form of this energy and can be trained via the backward pass for a downstream goal of interest, e.g., classification or regression, etc.

This so-called unfolded optimization perspective can provide insights into the nature of the inductive biases introduced by different architectures, while potentially serving as a guide for bespoke design choices instantiated through properties of the underlying energy function involved. In fact, modifications of the latter can lead to predictable model behaviors, which have in the past been tailored to incorporate useful regularization factors [53], avoid subtle degenerate solutions [49], or design effective new models altogether [1, 10].

Despite these successes, the majority of prior work in this area has either addressed relatively simple network structures such as MLPs [47], or else restricted consideration to a single model component in isolation from a larger system [35]. Consequently, complex models such as the widely-used Transformer [43] have been mostly ignored, in large part because of the difficulty involved in simultaneously mapping both the self-attention and feed-forward network (FFN) modules to an integrated optimization process that minimizes a single, unified energy function.

Work completed during an internship at the AWS Shanghai AI Lab.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

We take significant steps towards extending the unfolded optimization perspective to a full Transformer stack via the following contributions:

After providing background details and existing examples of unfolded optimization, Section 2 formalizes the four key challenges to mapping Transformer layers to optimization steps descending a well-specified energy function. These include: (i) handling self-attention, (ii) integrating heterogeneous Transformer layer modules (i.e., self-attention and FFN), (iii) accounting for non-linear activations, and (iv) allowing for arbitrary/unstructured weight matrices. Later, Section 3 derives an energy function that closely reproduces self-attention per Challenge (i), Section 4 rigorously derives convergence results which demonstrate how to approximately handle Challenge (ii) via a novel form of alternating minimization, and Section 5 provides explicit details for solving Challenge (iii) via proximal methods; consideration of Challenge (iv) is deferred to the supplementary along with additional supporting analyses. The above contributions culminate in an energy function that is minimized by Transformer layers when certain technical conditions hold. To provide further motivation, Section 6 empirically demonstrates that indeed this energy is minimized with real-world data even when technical conditions are difficult to formally verify.

Collectively, these results provide a complementary foundation for understanding Transformer models, and suggest plausible avenues for future enhancements.

2 Brief Intro to Deep Architectures Formed from Unfolded Optimization

The basic idea of the unfolded optimization perspective is to create a one-to-one correspondence between the layers of a deep model and the iterations of some algorithmic process designed to minimize a parameterized, model-specific energy function. In this way, the forward pass can be viewed as computing an approximation to the minimum energy solution. Meanwhile, assuming each forward iteration is differentiable w.r.t. the energy function parameters, we can pass gradients of some meta objective function (e.g., classification, regression, etc.) through the approximate minimization steps for the backward pass, which forms an interpretable bilevel-optimization process [7, 10, 20]. Overall then, unfolded optimization produces minimal energy representations, whereby the specific structure of the energy function is trainable per the relevant downstream application domain.

2.1 Mathematical Formulation The bilevel optimization process described above can be expressed more formally as:

Y (W, X) = arg min Y E(Y ; W, X), (1)

W = arg min W 1 S

i=1 ℓi(ψ[Y (W, Xi)]). (2)

Here S is the size of dataset, ℓi is the i-th loss function and Xi Rn d is the i-th input instance (e.g., input sentence to a language model), where n is the number of tokens and d is the token dimensionality. We refer to Y Rn d as the hidden representation of the model, where Y is the (ideal) model output (e.g. set of word representations output by a Transformer encoder) that minimizes to lower-level energy function2 E : Rn d R whose form is dictated by parameters W (e.g., these could also correspond with the trainable parameters in a Transformer model). And finally, ℓis a meta-loss function (e.g., the loss function used to train a Transformer) while ψ : Rn d Rn d is an arbitrary transformation that converts hidden representations to the model output, which can also potentially have its own learnable parameters. Assuming Y (W, X)/ W and ℓ/ Y (W, X) are computable, where the former involves passing gradients through the iterative steps used to approximate Y (W, X), then the combined bilevel system can be optimized w.r.t. W.

Note that hereinafter if the energy function does not include inter-token interactions, we view it as a function of vectors, and we use lower-case characters y and x to represent an arbitrary row of Y and

2Note that although we frequently the term energy , the concept is different from what has been called energy-based learning in the past, where the energy is defined on the joint input and output spaces [2, 23]; in our case energy is just a function for explaining the forward pass.

X for simplicity. Also, when clear from context, we omit writing the parameters of a function, for example we may write E(Y ) to reference E(Y ; W, X).

2.2 Related Work and Limitations In this section we present related work on unfolded optimization, and in particular, analyze some of the relevant limitations that serve as motivation for our efforts and underscore the challenges involved.

Optimization-Induced Feed-Forward Networks A wide variety of prior work has investigated feedforward structures from an unfolded optimization perspective [10, 12, 16, 18, 41, 47]. Here we take a closer look at [47] which is based on proximal methods. The basic energy function in [47] is

E(y) = 1 σ (W y) f(x), W y 1

where 1 is a vector with all entries being 1, σ is the convex conjugate [36]3 of σ : a 7 R a 0 σ(r)dr, σ is an activation function and f(x) is some transformation of input presentation x. The proximal operator [8] of (3) is

y(t+1) = prox E y(t) = W σ Wy(t) + f(x) , (4)

which is analogous to the feed-forward layer of an MLP. Note that although W can in principle be arbitrary matrix as long as the dimensionality is properly aligned, after stacking multiple layers of this model, the effective transformation is actually forced to be constrained:

y(t+2) = W σ h WW σ Wy(t) + f(x) + f(x) i , (5)

where clearly after the first layer the actual feed-forward transformation becomes WW , which is necessarily positive semi-definite (PSD). Hence unconstrained layer weights are not actually possible within this paradigm.

Optimization-Induced Attention Mechanisms There is also limited work that attempts to interpret attention mechanisms from the unfolded optimization perspective [11, 35]. For example, the energy function

E(y) = β 1 log

i=1 eβSiy !

was proposed in [35], with subsequent updating using the concave convex procedure [51] leading to iterations of the form y(t+1) = S softmax(βSy(t)), (7)

where S Rn n corresponds with the attention keys while y maps to the attention query, and β is a constant scalar. Critically though, this attention mechanism is actually cross-attention (using y to attend to S), which does not well-align with the typical self-attention use-case of Transformers. Moreover, this work does not account for the aggregated Transformer feed-forward network module and attendant nonlinearity.

Optimization-Induced Graph Neural Networks A variety of graph neural network architectures have also been developed from a similar optimization perspective [5, 26, 29, 32, 48, 49, 52, 54]. For example, graph attention mechanisms were derived using the iterative reweighted least squares (IRLS) algorithm in [49]. While this result is related to self-attention, which can be viewed as graph attention on fully connected graphs [4], it fails to produce the Transformer softmax term or the combined self-attention/feedforward Transformer stack. Nonetheless, we will later demonstrate how ideas from [49] can be leveraged to derive self-attention with softmax.

2.3 Key Challenges Extending to General Transformers In this paper, we focus on a more complex model than prior work, namely, the Transformer encoder. A Transformer layer [43] is typically composed of two primary components: the self-attention layer and the feed-forward network (FFN) layer. If we simplify the FFN to a single linear transformation

3Here σ is a scalar function, but for simplicity, we sometimes apply a scalar function to a vector or matrix in an entry-wise fashion for convenience.

with non-linear activation, and ignore the layer-norm and residual connections,4 one Transformer layer reduces to the basic form

Y (t+1) = Re LU h softmax Y (t)Wa Y (t) Y (t)Wf i , (8)

where Wa and Wf represent self-attention and FFN layer weight matrices respectively. Regarding (8) and the limitations of prior work discussed above, we propose four major challenges in formally applying unfolded optimization to the Transformer setting:

1. Token-Level Interactions: As mentioned previously, Transformer models include not only a feed-forward process but also cross-token interactions, which is reflected by the self-attention mechanism considered to be the sine qua non of Transformers. And thus far, no prior work has derived Transformer-style self-attention, instead either relaxing to cross-attention [35], or failing to produce the ubiquitous softmax operator [49].

2. Heterogeneous Layer Types: Each Transformer encoder layer is composed of two totally different components: self-attention and a FFN. We refer to this structure as a "heterogeneous layer type," where each component has its own parameters and can be viewed as a distinct unfolded optimization process. However, it remains unknown whether or not it is possible to combine the corresponding energies of these two components to obtain a unified objective with any kind of convergence guarantees (even approximately) during the aggregated forward pass.

3. Non-Linear Activations: Many activation functions used in common neural network architectures can be understood from the perspective of proximal operators [24, 49], at least within the isolated context of some relatively simple feedforward schema. However, when integrated within the heterogeneous Transformer layer types mentioned above, the optimization process becomes rather complex, and it is unknown if any convergence properties still hold after the inclusion of a proximal step.

4. Asymmetry of Weights: As alluded to in the discussion of (5), the scope of most work addressing feed-forward networks is actually limited to models with symmetric (or even tighter, PSD) weight transformations [1, 10, 47], which restricts the resulting universality. Additionally, related models from the graph neural network literature are generally predicated on undirected graphs [48, 49], where the graph propagation is also symmetric. Consequently, although to our knowledge not addressed/discussed previously, it remains unknown how to construct energy functions for more general asymmetric transformations.

In this paper, we propose techniques to at least partially solve Challenges 1, 2 and 3, while for 4, we defer preliminary discussion to the supplementary, leaving formal investigation as a future direction.

3 A New Derivation of Transformer Self-Attention

We now tackle Challenge 1 by constructing an energy function whose iterative optimization steps match Transformer-style softmax self-attention on fully connected tokens as is customary.

3.1 Basic Softmax Self-Attention via Unfolded Optimization Steps Consider the energy function

2 yi yj 2 + R(Y ), (9)

where yi Rd 1 is the i-th row of matrix Y Rn d, ρ : R+ R is a concave non-decreasing function, and R : Rn d R is a convex function. Interestingly, although E1 is not necessarily convex because of the non-convexity of ρ, under specific choices of ρ and R, it can be optimized through a softmax-like structure as follows:

4Note that we are ignoring layer-norms and residual connections for simplicity of exposition; however, the analysis we present herein can naturally be generalized to accommodate layer-norm, and also certain forms of residual connections with additional assumptions; see supplementary for details.

Theorem 3.1. Assume ρ (z) = exp { z}, R(Y ) = 1

2 Y 2 F, and βi = exp 1

2 y(t) i 2 , and

let Y (t) represent any fixed value for Y . Then the update rule

Pn j=1βj exp n y(t) i y(t) j o y(t) j Pn j=1 βj exp n y(t) i y(j) j o , i, (10)

satisfies E1 Y (t+1) E1 Y (t) with equality iff Y (t) is a stationary point of E1.

It is worth mentioning that, although perhaps not obvious, the update step (10) can be generated via a majorization-minimization (MM) algorithm [42], where the majorization step produces a convex upper-bound, and the minimization step descends along the gradient of the upper-bound. Therefore, the core of this update is essentially a gradient step on a convex function, which will be relevant to the discussion in subsequent sections; see the proof and further details in the supplementary. Remark 3.2. Several notable generalizations of Theorem 3.1 are possible. First, although in Theorem 3.1 we adopt a specific form for ρ and R to recover the softmax operator, with other selections of these functions the corresponding unfolded optimization algorithm can generate different/novel types of attention mechanisms. Secondly, obtaining (10) relies on a particular choice of gradient step size; however, for broader choices the resulting convergent updates interestingly lead to residual connections as a natural byproduct of the optimization trajectory (see supplementary). And finally, Theorem 3.1 can be easily modified to accommodate situations whereby full Transformer connectivity is constrained by some graph structure as in [6, 14] (again, see supplementary for details).

3.2 Extension to Include Trainable Parameters After aggregating into matrix form, we have thus far shown that the iteration

Y (t+1) = softmaxβ Y (t)Y (t) Y (t), (11)

will reduce (or leave unchanged) the energy from (9), where softmaxβ(y)i = βi exp{yi} P

j βj exp{yj} denotes

a softmax operator with reweighting coefficient vector β. If βi is independent of i, i.e., y(t) i is constant (which is enforceable via layer normalization that can also be included within our framework as mentioned previously), then this reweighted softmax is equivalent to the canonical softmax used in Transformers.

Now consider the reparameterization Y = ZWa, where Wa Rd d is an invertible matrix. It follows that Z(t+1)Wa = softmaxβ Z(t)Wa W a Z(t) Z(t)Wa, leading to the revised update

Z(t+1) = softmaxβ Z(t)W s a Z(t) Z(t), (12)

where W s a = Wa W a and we have adopted the superscript s to indicate that this matrix is symmetric. Collectively then, the results of this section directly address Challenge 1, closely reproducing softmaxstyled self-attention both with and without trainable parameters (the primary lingering limitation of symmetric weights being relegated to Challenge 4).

4 Combining Transformer Components via Alternating Inexact Minimization

We next address Challenge 2 and the heterogeneous Transformer layer types. With this goal in mind, we first introduce a general optimization scenario. Specifically, we ask the following question:

Given two (convex) objectives f(y) and g(y), under what conditions, and/or to what extent, will alternatively taking separate gradient steps w.r.t. f and g optimize the aggregated function f + g?

We refer to this optimization strategy as alternating inexact minimization (AIM), and we will shortly demonstrate conditions under which it converges to a ball with finite radius containing the optimal point of f + g. Later we discuss how these results contribute towards the resolution of Challenge 2.

We emphasize upfront that AIM as so-defined is quite different from what is commonly referred to as alternating minimization in the literature [15, 28]. The latter refers to scenarios whereby a unified objective function with multiple variables is minimized in an alternating fashion over one variable at a time with the others fixed, such that descent can be trivially achieved. In contrast, our AIM scenario involves multiple objective terms with a shared variable, and we alternate minimization over each term in isolation using the same variable, a much more challenging process to analyze.

Figure 1: A two-dimensional illustration of S(C ) with different values of C . Left: ch C

Lf = 0.7. Right: ch C

Lf = 1.5. The blue area is S(C ).

Figure 2: The value of energy E Y (t) with respect to iteration t. Note the y-axis is translated by a constant log Emin to make the minimum align with zero.

4.1 General Alternating Inexact Minimization Given two objectives f, g : Rd R, AIM is formalized as Algorithm 1, where α1 and α2 constitute step sizes. We will now investigate how Algorithm 1 relates to minimization of the combined objective defined by h(y) = f(y) + g(y).

Algorithm 1

For the t-th iteration, execute u(t) = y(t) α1 f(y(t)); y(t+1) = u(t) α2 g(u(t)).

In the following, we assume that f, g are both Lipschitz smooth and strongly convex [36], with Lipschitz constants Lf and Lg respectively, and convexity parameter cf and cg respectively. Therefore, h will also be Lipschitz smooth and strongly convex, with Lipschitz constant Lh and convexity ch. We denote the optimal points of f, g, and h as y f, y g, and y h respectively.

Combining the two steps of Algorithm 1 gives

y(t+1) = y(t) α1 f y(t) α2 g h y(t) α1 f y(t) i , (13)

while a canonical gradient descent step on objective h is y(t+1) = y(t) α2 f y(t) + g y(t) . Comparing the two update rules it is apparent that Algorithm 1 can be viewed as a noisy gradient descent step with step size α2 and noise factor

t = h y(t) α1

α2 f y(t) g h y(t) α1 f y(t) i . (14)

Noisy gradient descent is a well-studied problem in certain contexts [3, 31]. However, in our specific scenario we require a novel, different (actually tighter) bound than one can get from simply applying existing results.

Specifically, we demonstrate that when δ y(t) = t / h y(t) is bounded, (13) is guaranteed to descend the objective h as follows:

Theorem 4.1. When α1 α2 L 1 h , suppose y(t) and y(t+1) are related by (13), with δ y(t) C and C = α2 α2 α1+α1α2Lg . Then h y(t+1) h y(t) .

Remark 4.2. Although the definition of C may appear rather complicated, when one of α1, α2, Lf, or Lg is sufficiently small, C behaves as Ω [α1α2Lf Lg] 1 .

We further consider how to interpret the constraint δ(y) C and the region within which (13) can optimize h as follows:

Lemma 4.3. Let S(C ) = n y y y f y y h ch C

o . When y(t) S(C ), δ(y) C .

Combining Theorem 4.1 and Lemma 4.3, we can therefore conclude that when y(t) S(C ), h y(t+1) h y(t) . Note that the boundary of S(C ) given in Lemma 4.3 is called an Apollonian circle [37]. When C Lf

ch , S(C ) is a ball centered on y f, and when C Lf

ch , S(C ) is the whole space excluding a ball centered on y h. Figure 1 provides a 2D visualization for each case respectively.

Additionally, note that the same analysis can be done by switching the role of f and g (since the two processes are alternating) and further restrict the excepted region, although here we omit this for simplicity.

Remark 4.4. Our findings above can be summarized as follows: For sufficiently small values of α1, α2, Algorithm 1 reduces the combined objective h, at least provided that y is a certain distance away from the optimal point y h.

To illustrate this conclusion, we present a synthetic example with f(Y ) = SY 2 F + Y B1 2 F and g(Y ) = Y W 2 F + Y B2 2 F. Note that here we expand the variable y to a matrix Y and let f and g consist of left and right transformations of Y respectively, with an extra bias term to prevent the degenerate solution (Y = 0). We randomly set each entry of {S, W, B1, B2} and execute Algorithm 1 with fixed step sizes and project the trace of Y (t) to a two-dimensional space with PCA for visualization. The trace of Y (t) is displayed in Figure 3 and the combined objective h across iterations in Figure 2. From these plots, the behaviour predicted by our theory can be verified: When Y (t) is a sufficient distance away from Y h (the energy is relatively high), the optimization trajectory moves closer to Y h (descending the energy), and after Y (t) is close enough to Y h (and the h is at a relatively low level), the energy oscillates in a certain range about the optimal solution.

Figure 3: The trace of Y (t) projected to a plane. The figure on the right displays a zoomed local region around Y h taken from the figure on the left.

4.2 Unfolded Optimization of Heterogeneous Layer Type Models We now return to Transformer-specific models. Let E2(Y ) = 1

2Tr Y Wf Y + 1

2 Y 2 F and consider the combined energy E(Y ) = E1(Y Wa) + E2(Y ) (analogous to h from the previous section) and alternatively perform the following AIM updates:

U (t) = softmaxβ Y (t) W s a Y Y (t); (15)

Y (t+1) = U (t) α2 E2

Y =U (t) = U (t)W s f , (16)

where W s f = (1 α2)I α2 Wf +W f 2 provides an additional linear transformation after the softmax.

From Section 3, we know that (15) is optimizing E1 using essentially a row-wise gradient descent step after majorization. Likewise, (16) is also a gradient step with step size α2. Therefore, the combined rules fall into the scope of Algorithm 1, and from the analysis in Section 4.1, we can draw the following conclusion:

Corollary 4.5. If Y (t+1) is computed via (15) and (16) with input Y (t), then E Y (t+1) E Y (t)

when Y (t) S, where S is a ball with finite radius containing Y ˆ E, the optimal point of ˆE, the convex upper bound of E.

Note that after combining (15) and (16), the update aggregated update rule is already very similar to our target from (8), the only differences being the missing non-linearity and symmetric weights (note we use symmetric weight matrix W s a and W s f instead of Wa and Wf here), which correspond with Challenges 3 and 4 in Section 2.3 respectively; we address the former next.

5 Non-linear Activations Integrated within the Unfolding Paradigm

Handling non-linear activations within an unfolded optimization setting has been considered previously [12, 41, 45, 49]. However, prior work has largely relied on proximal operators to create non-linear activations paired with simple linear filters, and the analysis does not transfer to the more complex Transformer case under investigation here. While we will restrict our attention to adding Re LU activations in this section, both for simplicity of exposition and because of their ubiquity in Transformer models, our results generalize to broader selections.

Similar to Section 4, we first study a general form of the optimization problem by introducing a Re LU activations into Algorithm 1, formulating this modification as AIM with proximal steps, and then analyzing convergence under appropriate constraints. Later we apply these results to the Transformer.

5.1 Proximal-Alternating Inexact Minimization

Algorithm 2 For the t-th iteration, execute u(t) = y(t) α1 f y(t) ; v(t) = u(t) α2 g u(t) ; y(t+1) = Re LU v(t) .

It has already been established [24, 41] that Re LU activations can be modeled as a proximal operator via Re LU(y) = arg min z 1 2λ z y 2 + ϕ(z), (17)

where ϕ is the indicator function for the first quadrant given by ϕ(z) = + if z < 0 0 if z 0. Moreover,

in Section 4, we have demonstrated how the steps of Algorithm 1 together form an inexact gradient descent iteration of the loss h(y) = f(y) + g(y) with noisy term t defined in (14). Here, with the addition of the proximal step in Algorithm 2, we obtain an inexact version of proximal-gradient descent. In fact, one turn of Algorithm 2 is equivalent to

y(t+1) = arg min z 1 2λ

z y(t) + α2 h y(t) α2 t 2 + ϕ(z), (18)

which is an inexact version of canonical proximal-gradient step with noise t, as compared to the exact version y(t+1) = arg minz P z; y(t) , where P(z; y) is the proximal problem

P z; y(t) = 1

z y(t) + α2 h y(t) 2 + ϕ(z). (19)

While various forms of inexact proximal gradient descent have been studied in the past [9, 13, 38, 46], existing work either assumes constant noise [9], stochastic noise [46], or decreasing/convergent noise [13, 38]. Critically, no prior work that we are aware of applies to our case where the noise can potentially increase with iterations. Moreover, existing analysis in the literature is primarily concerned with convergence to a fixed point, while in our scenario, we instead consider entering a specific region formed around certain points.

In addition to bounding δ y(t) as in Section 4, we also need to bound the similarity between the current position y(t) and the gradient α2 h y(t) , which is defined as D(ξ1, ξ2) = 1 ξ1 2 Pd i=1 min(ξ2 2,i ξ2 1,i, 0). Intuitively, D(ξ1; ξ2) is defined such that each term of the summa-

tion is negative, but only close to 0 when ξ1,i and ξ2,i are both large. We then have the following:

Theorem 5.1. If α1 α2 L 1 h , D α2 h y(t) ; y(t) κ for any κ (0, 1), and δ y(t) C , where C = α2c P λ 1 κ

2(α2 α1+α1α2Lg), we have that h y(t+1) + ϕ y(t+1) h y(t) + ϕ y(t) .

Intuitively, Theorem 5.1 shows that the region where the descent of h(y)+ϕ(y) is guaranteed is the intersection of S(C ) with S defined in Lemma 4.3, and the area T (κ) = {y |D(α2 h(y); y) κ}. While the shape of T (κ) remains difficult to specify in general, we note that T (κ) tends to the whole space when α2 0 or κ 1. For example, we illustrate the area of T (κ) using a 2D synthetic example, where the energy function is h(y) = Wy 2 F + y b 2, and entries of W and b are randomly generated. See Figure 4 for the visualization using different values κ, which shows that when κ is sufficiently small, T (κ) is nearly the whole space (except a small area around the origin) which ensures the descent of h(y) + ϕ(y) in most cases.

Figure 4: T (κ) region (blue area) of a 2D synthetic example with different values of κ.

5.2 Embedding the Non-Linearity within Transformer Models By adding the penalty term to the previous energy, and with E1 and E2 obtained from Sections 3.1 and 4.2 respectively, we finally arrive at the total Transformer energy E(Y ) = E1(Y ) + E2(Y ) + ϕ(Y ). And the unfolded optimization of E(Y ) falls into the scope of Algorithm 2, such that the former analysis and Theorem 5.1 apply and we may conclude that the aggregated update rule

Y (t+1) = Re LU h softmaxβ Y (t)W s a Y (t) W s f i (20)

is a descent algorithm of E, except a region with finite measure. Moreover, if we modify the underlying gradient descent algorithm to make the step sizes α1 and α2 sufficiently small, the size of the exception region will tend to zero, and the update rule will be equipped with a residual term (see supplementary; similarly for a discussion about β). Besides these issues, the unfolded update (20) only differs from (8) in its reliance on symmetric weights. This corresponds with Challenge 4 that we have not fully solved, although it has been shown that symmetric weights can mimic asymmetric weights if we enlarge the representation dimensions [19, 50].

6 Practical Verification

Although we have rigorously derived convergence criteria whereby Transformer layers descend a well-specified energy function to a region around optimal solutions, the analysis admittedly relies on conditions that would be difficult to formally verify on real-world datasets. However, our results are nonetheless amenable to targeted empirical corroboration, whereby we can check if the proposed energy does in fact descend during the Transformer forward pass on typical benchmarks.

To this end, we implement a Transformer model as described previously, up to known limitations like symmetric weights. We apply this model to two benchmarks, IMDB [30] and SST2 [40], which are both commonly-used sentiment classification datasets that rely on Glove-840b-300d [33] as the word embedding. Figures 5 and 6 display the energy of the output of each layer of a Transformer (as defined in (8)) averaged over 200 randomly chosen samples in the test set. Figure 5 uses randomly initialized weights while Figure 6 involves weights trained for 2000 steps with SGD and learning rate 0.01. Additionally, for the trained model we change the term Y 2 F in E1 to Y X 2 F in order to avoid degenerate representations (as can sometimes occur in trained Transformers [39]), noting that this modification is equally-well covered by our theory and leads to a commonly-used form of residual connection in the resulting Transformer architecture. See supplementary for details.

Figure 5: Box plot of the energy curves from randomly initialized 12-layer Transformers on IMDB (left) and SST2 (right) datasets; results are averaged over samples.

From these figures it is clear that even with real-world data, the the Transformer energy we have derived is (on average) monotonically decreasing across layers, matching the predictions of our

Figure 6: Box plots of the energy curves from trained 12-layer Transformers on IMDB (left) and SST2 (right) datasets; results are averaged over samples.

analysis. Moreover, with 12 layers, which represents a depth range not uncommon in practice (e.g., BERT), the model has not entered the fluctuation region. Moreover, this observation holds even with trained Transformer models composed from many/most of the typical components used in practice, namely, layers with self-attention, a linear transform followed by nonlinearity, and residual connections, etc. Hence although seemingly complicated, the conditions adopted by Lemma 4.3 and Theorem 5.1 are nonetheless likely hold in many practical settings.

7 Conclusion

While our contributions here have been primarily of a theoretical nature, there are nonetheless takehome messages that may be of practical relevance. First, as our overriding goal was to reproduce Transformer layers as closely as possible using an unfolded optimization perspective, very specific design choices were made in constructing the core underlying energy function. However, in practice we are free to choose alternative energies that can lead to different forms of tailored self-attention that may be be advantageous on an application-specific basis.

As a brief/simple representative example of the latter consider the following: The canonical Transformer uses softmax to normalize the attention coefficients. However, multiple works have questioned the appropriateness of softmax normalization, e.g., [27, 34]. Under our optimization unfolding perspective, we can see that the softmax normalization in the Transformer is derived from a specific choice of ρ function, namely ρ(z) = e z (Section 3.1 and Theorem 3.1). But if we choose a different ρ, it would lead to a new normalization method other than softmax with interpretable properties. For instance, if ρ(z) = log(z + 2), then the attention coefficients would behave as (for simplicity we assume all yi s have unit norm below):

ai,j = 1 2 y i yj

where ai,j is the attention coefficient between the i-th and j-th token. Since log(z + 2) grows slower than e z for z [0, 1], the associated attention formulation would tend to encourage some more dissimilar representations between tokens. This is because in the new energy, a large value of yi yj (which means dissimilar yi and yj) contributes less to the energy compared to before, and thus the optimization process has less motivation to reduce it. Conversely, if we choose a ρ that grows faster than e z in the stated range (e.g. ρ(z) = log(z + 1)), then the derived model would likely be encouraging more similar representations between tokens.

There are other ramifications of this overall framework with practical relevance as well. For example, the actual distribution of attention weights can potentially be better understood or influenced by properties of the energy function that produces them. Moreover, especially for regimes with limited data and therefore fewer free model parameters, the unfolding perspective can be used to devise architectures with inductive biases aligned with downstream tasks to help compensate for less model flexibility.

We conclude by noting that the techniques introduced in this paper can in some sense be viewed as universal tools for constructing a broader family of unfolding architectures. In fact, many/most deep learning models, including those with feed-forward structure [25], token-level interactions [44], residual connections [17], and heterogeneous layer types, can be reinterpreted through the lens of unfolded optimization using the framework we have introduced, at least up to some potential constraints like symmetric weights (which may be handled in other ways).

[1] Brandon Amos and J. Zico Kolter. Opt Net: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research, 2017.

[2] Brandon Amos, Lei Xu, and J. Zico Kolter. Input convex neural networks. In International Conference on Machine Learning, ICML, 2017.

[3] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Rev., 60(2):223 311, 2018.

[4] Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Velickovic. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. Co RR, abs/2104.13478, 2021.

[5] Siheng Chen and Yonina C Eldar. Graph signal denoising via unrolling networks. In International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021.

[6] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. Co RR, abs/1904.10509, 2019.

[7] Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Ann. Oper. Res., 153(1):235 256, 2007.

[8] Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering, pages 185 212. Springer, 2011.

[9] Olivier Devolder, François Glineur, and Yurii E. Nesterov. First-order methods of smooth convex optimization with inexact oracle. Math. Program., 146(1-2):37 75, 2014.

[10] Jordan Frécon, Gilles Gasso, Massimiliano Pontil, and Saverio Salzo. Bregman neural networks. In International Conference on Machine Learning, ICML, 2022.

[11] Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is attention better than matrix decomposition? In International Conference on Learning Representations, ICLR, 2021.

[12] Karol Gregor and Yann Le Cun. Learning fast approximations of sparse coding. In International Conference on Machine Learning, ICML, 2010.

[13] Bin Gu, De Wang, Zhouyuan Huo, and Heng Huang. Inexact proximal gradient methods for non-convex and non-smooth optimization. In AAAI Conference on Artificial Intelligence, 2018.

[14] Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. Startransformer. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, 2019.

[15] Moritz Hardt. Understanding alternating minimization for matrix completion. In Annual Symposium on Foundations of Computer Science, FOCS, 2014.

[16] Hao He, Bo Xin, Satoshi Ikehata, and David Wipf. From bayesian sparsity to gated recurrent nets. In Annual Conference on Neural Information Processing Systems, Neur IPS, 2017.

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2016.

[18] John Hershey, Jonathan Le Roux, and Felix Weninger. Deep unfolding: Model-based inspiration of novel deep architectures. Co RR, abs/1409.2574, 2014.

[19] Shell Xu Hu, Sergey Zagoruyko, and Nikos Komodakis. Exploring weight symmetry in deep neural networks. Comput. Vis. Image Underst., 187, 2019.

[20] Simon Jenni and Paolo Favaro. Deep bilevel learning. In European Conference on Computer Vision, ECCV, 2018.

[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Annual Conference on Neural Information Processing Systems, Neur IPS, 2012.

[22] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436 444, 2015.

[23] Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.

[24] Jia Li, Mingqing Xiao, Cong Fang, Yue Dai, Chao Xu, and Zhouchen Lin. Training neural networks by lifted proximal operator machines. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

[25] Hanxiao Liu, Zihang Dai, David R. So, and Quoc V. Le. Pay attention to mlps. Co RR, abs/2105.08050, 2021.

[26] Xiaorui Liu, Wei Jin, Yao Ma, Yaxin Li, Hua Liu, Yiqi Wang, Ming Yan, and Jiliang Tang. Elastic graph neural networks. In International Conference on Machine Learning, ICML, 2021.

[27] Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing Xu, Tao Xiang, and Li Zhang. SOFT: softmax-free transformer with linear complexity. In Annual Conference on Neural Information Processing Systems, Neur IPS, 2021.

[28] Songtao Lu, Mingyi Hong, and Zhengdao Wang. PA-GD: on the convergence of perturbed alternating gradient descent to second-order stationary points for structured nonconvex optimization. In International Conference on Machine Learning, ICML, 2019.

[29] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. Co RR, abs/2010.01777, 2020.

[30] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Annual Meeting of the Association for Computational Linguistics, ACL, 2011.

[31] S. Deepak Narayanan, Aditya Sinha, Prateek Jain, Purushottam Kar, and Sundararajan Sellamanickam. IGLU: efficient GCN training via lazy updates. Co RR, abs/2109.13995, 2021.

[32] Xuran Pan, Shiji Song, and Gao Huang. A unified framework for convolution-based graph neural networks, 2021.

[33] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glo Ve: Global vectors for word representation. In Conference on Empirical Methods in Natural Language Processing, EMNLP, 2014.

[34] Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cos Former: Rethinking softmax in attention. In The Tenth International Conference on Learning Representations, ICLR, 2022.

[35] Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Thomas Adler, David P. Kreil, Michael K. Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need. In International Conference on Learning Representations, ICLR, 2021.

[36] R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.

[37] Pierre Samuel. Projective Geometry. Springer, 1988.

[38] Mark Schmidt, Nicolas Le Roux, and Francis R. Bach. Convergence rates of inexact proximalgradient methods for convex optimization. In Annual Conference on Neural Information Processing Systems, Neur IPS, 2011.

[39] Han Shi, Jiahui Gao, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen Lee, and James T Kwok. Revisiting over-smoothing in BERT from the perspective of graph. International Conference on Learning Representations, ICLR, 2022.

[40] Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. Parsing with compositional vector grammars. In Annual Meeting of the Association for Computational Linguistics, ACL, 2013.

[41] Pablo Sprechmann, Alex Bronstein, and Guillermo Sapiro. Learning efficient sparse and low rank models. IEEE Trans. Pattern Analysis and Machine Intelligence, 37(9), 2015.

[42] Ying Sun, Prabhu Babu, and Daniel P Palomar. Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE Transactions on Signal Processing, 65(3):794 816, 2016.

[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Annual Conference on Neural Information Processing Systems, Neur IPS, 2017.

[44] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, ICLR, 2018.

[45] Zhangyang Wang, Qing Ling, and Thomas Huang. Learning deep ℓ0 encoders. In AAAI Conference on Artificial Intelligence, 2016.

[46] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim., 24(4):2057 2075, 2014.

[47] Xingyu Xie, Qiuhao Wang, Zenan Ling, Xia Li, Yisen Wang, Guangcan Liu, and Zhouchen Lin. Optimization induced equilibrium networks. Co RR, abs/2105.13228, 2021.

[48] Chaoqi Yang, Ruijie Wang, Shuochao Yao, Shengzhong Liu, and Tarek F. Abdelzaher. Revisiting "over-smoothing" in deep GCNs. Co RR, abs/2003.13663, 2020.

[49] Yongyi Yang, Tang Liu, Yangkun Wang, Jinjing Zhou, Quan Gan, Zhewei Wei, Zheng Zhang, Zengfeng Huang, and David Wipf. Graph neural networks inspired by classical iterative algorithms. In International Conference on Machine Learning, ICML, 2021.

[50] Yongyi Yang, Yangkun Wang, Zengfeng Huang, and David Wipf. Implicit vs unfolded graph neural networks. Co RR, abs/2111.06592, 2021.

[51] Alan L. Yuille and Anand Rangarajan. The concave-convex procedure (CCCP). In Annual Conference on Neural Information Processing Systems, Neur IPS, 2001.

[52] Hongwei Zhang, Tijin Yan, Zenjun Xie, Yuanqing Xia, and Yuan Zhang. Revisiting graph convolutional network on semi-supervised node classification from an optimization perspective. Co RR, abs/2009.11469, 2020.

[53] Kaixiong Zhou, Xiao Huang, Daochen Zha, Rui Chen, Li Li, Soo-Hyun Choi, and Xia Hu. Dirichlet energy constrained learning for deep graph neural networks. Co RR, abs/2107.02392, 2021.

[54] Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph neural networks with an optimization framework. Co RR, abs/2101.11859, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] Our work is mostly theoretical and devoted to the better understanding of an existing, wellestablished modeling framework; we do not see any clear pathway to negative societal impact.

(d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] They are in the supplementary. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]