# how_smooth_is_attention__3a93bdde.pdf

How Smooth Is Attention?

Val erie Castin 1 Pierre Ablin 2 Gabriel Peyr e 1 3

Self-attention and masked self-attention are at the heart of Transformers outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties which are key when it comes to analyzing robustness and expressive power is incomplete. We provide a detailed study of the Lipschitz constant of selfattention in several practical scenarios, discussing the impact of the sequence length n and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length n in any compact set, the Lipschitz constant of self-attention is bounded by n up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length n is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of n. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.

1. Introduction

Introduced by Vaswani et al. (2017), Transformers and their multi-head attention mechanism (Bahdanau et al., 2014) have significantly changed the machine learning landscape in just a few years, by becoming state-of-art models on a wide variety of tasks, from natural language processing (Brown et al., 2020; Radford et al., 2019; Wolf et al., 2019) to computer vision (Dosovitskiy et al., 2020; Zhao et al., 2020; Zhai et al., 2022; Lee et al., 2019). Despite this great empirical success, however, little is known from a theoretical perspective about the smoothness of Transformer archi-

1 Ecole Normale Sup erieure PSL, Paris, France 2Apple, Paris, France 3CNRS. Correspondence to: Val erie Castin <valerie.castin@orange.fr>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

tectures, particularly of self-attention, their main building block. We tackle this problem by focusing on the Lipschitz properties of self-attention, especially on its local Lipschitz constant, which controls how fast the output can change with respect to the input in the neighborhood of each point of the domain.

Studying the Lipschitz continuity of neural networks is particularly relevant for various questions (Rosca et al., 2020). It provides guarantees of adversarial robustness, in an attack-agnostic way (Szegedy et al., 2013; Cisse et al., 2017; Tsuzuku et al., 2018; Anil et al., 2019; Weng et al., 2018). Identifying inputs with a high local Lipschitz constant and understanding which local perturbations trigger the biggest change in the output also allows for robustifying the network, for example using adversarial training (Goodfellow et al., 2014; Miyato et al., 2015; Moosavi-Dezfooli et al., 2016; Kurakin et al., 2016). The Lipschitz constant is also involved in generalization bounds (Sokoli c et al., 2017; Neyshabur et al., 2017; Bartlett et al., 2017; von Luxburg & Bousquet, 2004). From a different perspective, Lipschitzconstrained neural networks can be used to estimate Wasserstein distances (Peyr e et al., 2019), enhance expressivity and improve the performance of deep models (Miyato et al., 2018; Dasoulas et al., 2021), and build invertible neural networks (Behrmann et al., 2019; Chen et al., 2019). Finally, bounding the Lipschitz constant of a neural network is an important step in the study of the associated neural ODE (Chen et al., 2018), in particular of its well-posedness (Lu et al., 2019; Geshkovski et al., 2023a;b).

Lipschitz continuity of feed-forward neural networks (FFNs) has been extensively studied and remains a hard problem. Estimating numerically the Lipschitz constant of a FFN is indeed NP-hard (Virmaux & Scaman, 2018), and theoretical bounds appear to be much larger than the actual Lipschitz constant (Virmaux & Scaman, 2018; Fazlyab et al., 2019; Latorre et al., 2020). The main theoretical difficulty here is to handle the composition of several layers more accurately than just bounding it by the product of spectral norms of weight matrices, as done by Szegedy et al. (2013). Still, taken independently, each linear map or activation function has a known tight Lipschitz constant. This is not the case for Transformers: the self-attention map has an involved non-linear structure, which makes the estimation of its local Lipschitz constant challenging and brings into play com-

How Smooth Is Attention?

pletely different approaches than for FFNs (Kim et al., 2021; Vuckovic et al., 2021).

1.1. Contributions

We make the following contributions.

We derive a bound on the local Lipschitz constant of selfattention, which takes the form C n with n the sequence length of inputs and C a constant factor that depends on the parameters of self-attention and on an upper bound R on the magnitude of tokens (Theorem 3.3). We show that our bound is tight in n for reasonable (i.e. not too large) sequence lengths (Proposition 3.4). Moreover, in most Transformer architectures, the magnitude R only depends on the parameters of the network because of normalization layers (Subsection 2.3). We identify a large radius regime that is easier to analyze theoretically, with n fixed and R very large. In this regime and except for a measure-zero set of pathological configurations, we show that the local Lipschitz constant of self-attention is bounded by C n with C a constant that does not depend on R anymore (Theorem 3.7). We also study the mean-field regime, where self-attention is modeled as a map on probability measures, which corresponds to the limit n + . In this framework, we show that the upper bound obtained by Geshkovski et al. (2023a), which is of the form CR2e CR2, cannot be significantly improved (Proposition 3.6), by finding a Rindexed family of two-Dirac probability measures supported in the closed ball BR of center 0 and of radius R whose local Lipschitz constant grows like C

with C C/16. We are the first to study the Lipschitz constant of masked self-attention. We introduce a novel mean-field framework for masked self-attention, where the order of points in input measures is encoded in a supplementary coordinate, and show both in the general regime, the large radius regime and the mean-field regime that similar bounds hold for masked self-attention as for unmasked self-attention (Section 4). We compute numerically the local Lipschitz constant of unmasked and masked self-attention in a BERT model and a GPT-2 model, where inputs are text extracts, and observe a growth rate of n1/4 up to a constant factor, with n the sequence length. Then, with the same networks, we build adversarial data in the input space of self-attention whose Lipschitz constant grows like n, which evidentiates the tightness of our bounds (Section 5).

1.2. Related Work

Robustness and local Lipschitz constant estimation. Neural networks are vulnerable to adversarial attacks (Szegedy et al., 2013), and most of the methods proposed

to measure and increase their robustness focus on specific attacks (Goodfellow et al., 2014; Papernot et al., 2016). It turns out, however, that such methods can be defeated by well-chosen unseen attacks (Carlini & Wagner, 2017). Measures of robustness that are agnostic to attack methods have therefore been proposed, often relying on the notion of Lipschitz constant of networks (Szegedy et al., 2013; Leino et al., 2021; Tsuzuku et al., 2018). As robustness lower bounds that rely on the (global) Lipschitz constant tend to be too loose, tighter constraints have been proposed involving the local Lipschitz constant (Hein & Andriushchenko, 2017; Weng et al., 2018). The problem of evaluating the local Lipschitz constant of deep networks is now at the heart of several recent articles (Tsuzuku et al., 2018; Leino et al., 2021), in particular for Transformers (Kim et al., 2021; Vuckovic et al., 2020; Geshkovski et al., 2023a; Catellier et al., 2023). From a more practical viewpoint, several Lipschitz-constrained variants of the Transformer architecture have been proposed, to increase robustness and reliability (Jia et al., 2023; Gupta & Verma, 2023; Ye et al., 2023; Qi et al., 2023).

Neural networks acting on measures. De Bie et al. (2019) and Pevny & Kovarik (2019) are the first to define neural networks whose inputs are probability measures, followed by several other articles (Vuckovic et al., 2020; Zweig & Bruna, 2021; Sander et al., 2022; Geshkovski et al., 2023a). Modeling neural networks as maps on probability measures has multiple applications, such as studying Wasserstein regularity (Vuckovic et al., 2020; Geshkovski et al., 2023a), proving generalization bounds (Zweig & Bruna, 2021) and doing a mean-field limit analysis of the dynamics of particles as they go through the network (Geshkovski et al., 2023a). The mean-field approach is particularly suited to the case of Encoder-only Transformers (Devlin et al., 2018), as the self-attention map is permutation equivariant, i.e., ignores the order of vectors in its input. This property can be leveraged to model any infinitely deep Encoder as a partial differential equation (PDE) on the space of measures (Sander et al., 2022), following the principle of neural ODEs (Chen et al., 2018). Analyzing this PDE then provides information about the dynamics of tokens as they go through the Transformer, showing for instance the emergence of clusters (Geshkovski et al., 2023a;b). In contrast, masked self-attention, which is crucial in Decoder-only (Liu et al., 2018) and Encoder-Decoder (Vaswani et al., 2017) architectures, is not permutation equivariant, so cannot be cast as naturally into a mean-field framework.

Regularity of self-attention and its variants. Kim et al. (2021) show that the self-attention map is not globally Lipschitz continuous by deriving a lower bound on its Lipschitz constant restricted to Bn R. Their lower bound grows quadratically with R. To gain regularity, they define a new

How Smooth Is Attention?

self-attention map, called L2 self-attention, which is globally Lipschitz continuous on the set of inputs of length n, for all n 1. Dasoulas et al. (2021) enforce the Lipschitz continuity of self-attention modules by normalizing the attention scores with a well-chosen normalization function. Geshkovski et al. (2023a) and Vuckovic et al. (2020) prove a mean-field upper bound on the Lipschitz constant of self-attention on BR, by viewing self-attention as a map acting on probability measures. Their upper bound grows more than exponentially with R so that the quadratic lower bound and the exponential upper bound put together provide a very loose estimation of the Lipschitz constant of self-attention on compact sets. Finally, Sander et al. (2022) propose a modification of the attention kernel that builds on the Sinkhorn-Knopp algorithm, and provide empirical evidence of the better properties of this new choice of kernel with respect to the classical one.

1.3. Notations

The Euclidean norm on Rd is denoted | |. For any vector w Rn, we denote softmax(w) := exp(wi)/Pd j=1 exp(wj)

1 i n the Softmax operator,

and diag(w) the diagonal matrix such that diag(w)ii = wi. For any function g: E F and any subset X E, the restriction of g to X is denoted g|X . The closed ball centered at 0 and of radius R > 0 is denoted BR. For φ, ψ: R R and a R {+ } we write φ(x) x a ψ(x) if φ(x)/ψ(x) is well-defined for x close enough to a, and φ(x)/ψ(x) x a 1.

2. Standard and Mean-Field Self-Attention

2.1. Unmasked Self-Attention

Unmasked self-attention, usually just called self-attention, is central in the architecture of Transformer s Encoders (Vaswani et al., 2017), which are nowadays widely used for computer vision tasks (Dosovitskiy et al., 2020). It maps sequences of n vectors to sequences of n vectors, for any integer n.

Definition 2.1 (Single-head self-attention). Let k, d N. Let Q, K, V Rk d be three matrices. For any integer n N and any vectors x1, . . . , xn Rd, self-attention with parameters (Q, K, V ) maps the sequence (x1, . . . , xn) (Rd)n to

f(x1, . . . , xn) := V Pn j=1Pijxj

1 i n (Rk)n,

with Pi := softmax (x i Q Kxj/

To alleviate notations, we will denote A := K Q/

k Rd d in what follows. In Encoders, several self-attention

heads are usually combined to obtain multi-head selfattention. Definition 2.2 (Multi-head self-attention). Let d N and H a divisor of d. For 1 h H, let Q(h), K(h), V (h) Rk d with k := d/H, and W (h) Rd k. Multihead selfattention with parameters (Q(h), K(h), V (h), W (h))1 h H maps any sequence (x1, . . . , xn) (Rd)n to

f MH(x1, . . . , xn) :=

h=1 W (h)f (h)(x1, . . . , xn) (Rd)n,

where f (h) denotes single-head self-attention with parameters (Q(h), K(h), V (h)).

When n is very large, it can be convenient to model selfattention as a map between probability measures (Sander et al., 2022; Geshkovski et al., 2023a). Indeed, the self-attention map f is permutation equivariant, which means that for all permutations σ of the set {1, . . . , n} and all inputs X = (x1, . . . , xn) (Rd)n, it holds that f(xσ(1), . . . , xσ(n)) = (f(X)σ(1), . . . , f(X)σ(n)). Informally, this means that self-attention is blind to the order of vectors so that it naturally induces a map between empirical measures, by replacing ordered sequences X = (x1, . . . , xn) with their associated empirical measure m(X) := 1

n Pn i=1 δxi, which does not encode the order of points anymore. To extend self-attention to more general probability measures, following Sander et al. (2022), let us introduce the notion of pushforward. Definition 2.3 (Santambrogio (2015)). For a probability measure µ on Rd and a measurable map φ: Rd Rd, the pushforward of µ by φ, denoted φ µ, is the probability measure given by (φ µ) (B) := µ(φ 1(B)) for any Borel set B Rd, where φ 1(B) := {x Rd : φ(x) B}.

Intuitively, φ µ is obtained by transporting each element of mass µ(dx) from x to φ(x). We are now ready to define mean-field self-attention. Definition 2.4 (Mean-field self-attention). Let Q, K, V Rk d, and denote A := K Q/

k. Mean-field selfattention with parameters (A, V ) is defined as

F : µ Pc(Rd) 7 (Γµ) µ

with Γµ : x Rd 7

R exp x A y V y dµ(y) R exp (x A y) dµ(y) .

Mean-field self-attention F generalizes discrete selfattention f in the sense that for any input X (Rd)n, we have F(m(X)) = m(f(X)) (see Appendix B.1).

2.2. Masked Self-Attention

In Decoder-only architectures, typically used for text generation (Liu et al., 2018; Open AI, 2023), unmasked selfattention is replaced by masked self-attention.

How Smooth Is Attention?

Definition 2.5 (Masked self-attention). Let Q, K, V Rk d, and A := K Q/

k. For any integer n N and any vectors x1, . . . , xn Rd, residual masked self-attention with parameters (A, V ) maps the sequence X = (x1, . . . , xn) (Rd)n to (f m(X)1, . . . , f m(X)n) (Rd)n, where

f m(X)i := f(x1, . . . , xi)i

with f unmasked self-attention (see Definition 2.1).

Masked self-attention processes inputs sequentially, so it is not permutation equivariant and the map f m does not directly induce a map on empirical measures as for unmasked self-attention. To overcome this limitation and still give meaning to masked self-attention when the sequence length goes to infinity, we introduce the following novel mean-field framework. Instead of viewing inputs (x1, . . . , xn) (Rd)n as empirical measures 1

n Pn i=1 δxi, we add a coordinate si [0, 1] to each xi, to encode its position in the sequence. We then define mean-field masked self-attention as a map between probability measures on the product space [0, 1] Rd.

Definition 2.6 (Mean-field masked self-attention). For any probability measure µ Pc([0, 1] Rd), denote µ the second marginal of µ, i.e. µ(A) := R 1 s=0 R

x A d µ(s, x) for all Borel sets A Rd. We define mean-field masked self-attention on Pc([0, 1] Rd) as

F m : µ 7 (Γ µ) µ where

Γ µ(s, x) :=

[0,1] Rd exp(x A y)V y1τ sd µ(τ, y)

[0,1] Rd exp(x A y)1τ sd µ(τ, y)

This generalizes Definition 2.5 in the following sense: given any increasing sequence 0 s1 < < sn 1, denoting ord the transformation ord: X = (x1, . . . , xn) (Rd)n 7 1 n Pn i=1 δ(si,xi) Pc([0, 1] Rd), we have F m(ord(X)) = ord(f m(X)) for all X (Rd)n.

Beyond the mean-field analysis of Lipschitz continuity of masked self-attention, the map F m can be used in future work to model Decoders as partial differential equations on probability measures and study the dynamics of tokens as they go through the network, as Geshkovski et al. (2023a;b) do for Encoders.1

2.3. Normalization

Normalization is a key part of the Transformer architecture. The most common choice of normalization is Layer Norm (Ba et al., 2016), which has two learnable parameters

1However, to do so one should rather set the first coordinate of Γ µ(s, x) to 0 instead of s so that the residual map (Id + Γ µ) µ preserves the first marginal.

γ, β Rd. It acts on each input of the sequence individually with the formula norm(x) = γ x mean(x)

std(x) + β, where

mean(x) := 1 d Pd j=1 xj and std(x) := ( 1

d Pd j=1(xj mean(x))2)1/2 are two scalars that depend on x. Each vector of the output of Layer Norm is on an ellipsis S of center β and of covariance diag(γ)2d. Another popular and simpler normalization is RMSNorm (Zhang & Sennrich, 2019), which has one learnable parameter γ Rd

and acts on each input of the sequence individually with the formula norm(x) = γ x |x|

d. RMSNorm is used in recent large language (Jiang et al., 2023; Touvron et al., 2023) and vision (Dehghani et al., 2023) models. Each vector of the output of RMSNorm is on an ellipsis S centered at 0 and of covariance diag(γ)2d. There are two ways to place the normalization layers in the transformer. The original transformer uses post-normalization: normalization is applied after each residual connection. Letting X = (x1, . . . , xn), the output of residual attention is therefore norm(X +f(X)). However, pre-normalization (Xiong et al., 2020), where normalization is applied before the attention layer: X + f(norm(X)), is now more widespread. Although the two formulations are not equivalent, the input of self-attention f is normalized in both cases by definition for pre-normalization, and because the previous layer was normalized for post-normalization. Hence, in practice, the input of self-attention is not any sequence in (Rd)n, but a sequence in Bn R for R depending only on the parameters of norm.

It is also worth noticing that for RMSNorm, the parameter γ can be absorbed in the parameters θ := (Q, K, V ) of self-attention:

fθ norm(x1, . . . , xn) = fθ x1

|x1|, . . . , xn

with θ := (Qdiag(λ), Kdiag(λ), V diag(λ)). In other words, RMSNorm followed by self-attention is equivalent to a projection on the unit sphere followed by self-attention with different parameters. This provides a simple way to bound the Lipschitz constant of the composition fθ norm, by directly applying our bounds on the Lipschitz constant of fθ for R = 1.

3. Tight Bounds on the Lipschitz Constant of Self-Attention

3.1. Lipschitz Constants and Self-Attention

Lipschitz constants provide a natural way of controlling the regularity of a function. Their definition depends on the structure that is chosen for the input and output spaces.

Euclidean framework. Let d, n N and f : (Rd)n (Rd)n. We equip the input and output spaces of f with the

How Smooth Is Attention?

Frobenius norm

X F := (Pn i=1|xi|2)1/2

for any sequence of vectors X = (x1, . . . , xn), and assume that f is differentiable. The local Lipschitz constant of f at an input X = (x1, . . . , xn) is defined as

Lip Xf := DXf 2 ,

where DXf is the differential of the function f, and 2 denotes the operator norm induced by F . We can also define, for any subset X (Rd)n, the Lipschitz constant of f on X, as

Lip f|X := sup X,Y X X =Y

f(X) f(Y ) F

The local Lipschitz constant tells us how fast the output of f can vary locally, while the Lipschitz constant on f controls how fast the output of f can vary on the whole set X. We have the following connection between the two. Lemma 3.1 (Federer (2014)). Let X be an open and connected subset of (Rd)n. Then

Lip(f|X ) = sup X X DXf 2 .

Mean-field framework. Let d N and denote Pc(Rd) the set of compactly supported probability measures on Rd. We equip this set with the 2-Wasserstein distance, defined as

W2(µ, ν) := inf π Π(µ,ν)

Z |x y|2dπ(x, y) 1/2

for µ, ν Pc(Rd), where Π(µ, ν) is the set of couplings between µ and ν, i.e. of probability measures π P(Rd Rd) such that R π( , y)dy = µ and R π(x, )dx = ν (see for example Santambrogio (2015) for more details on the subject). Consider a map F : Pc(Rd) Pc(Rd). For any subset X Pc(Rd), the Lipschitz constant of F on X is defined as

Lip(FX ) := sup µ,ν X µ =ν

W2(F(µ), F(ν))

A notion of local Lipschitz constant can also be defined in the mean-field framework. We defer it to Appendix B.2 as it only appears in some of our proofs.

Measuring the regularity of self-attention in Wasserstein distance is the natural generalization to the mean-field case of the Euclidean regularity in the case of a finite sequence length. Indeed, when f is self-attention and F is its meanfield generalization, we can connect the two frameworks as follows (see Appendix B.3). Lemma 3.2. Let R > 0. We have

Lip F (f|Bn R) Lip W2(F|P(BR)).

3.2. Lipschitz Bounds for Self-Attention in Different Regimes

General upper bound. Kim et al. (2021) show that selfattention is not globally Lipschitz continuous. Let us therefore restrict to sequences (x1, . . . , xn) Bn R, where BR Rd is the closed ball centered at 0 and of radius R. We have the following general bound (see Appendix C.2). Theorem 3.3. Let Q, K, V Rk d and A := K Q/

k. Let R > 0 and n N. Unmasked self-attention f with parameters (A, V ) is Lipschitz continuous on Bn R, with

3 V 2 A 2 2 R4(4n + 1) + n 1/2 .

Theorem 3.3 shows that when tokens are restricted to the compact set BR, the Lipschitz constant of self-attention grows at most like n with the sequence length n, up to a constant factor. On the other side, the growth rate n in Theorem 3.3 is tight as long as n is not too large, a statement made rigorous by the following result (see Appendix C.4). Proposition 3.4. Let Q, K Rk d and V := Id. Let A := K Q/

k. Denote f unmasked self-attention with parameters (A, V ). Let γ1 γδ be the real eigenvalues of A. Then, for any R > 0, and denoting γ := max( γδ, γ1/8), it holds

Lip(f|Bn R) 1 1 + (n 1)e 2R2γ

Proposition 3.4 shows that for any radius R > 0, the sequence of Lipschitz constants (Lipf|Bn R)n N grows faster than n up to a constant factor in a certain range of sequence lengths n. For example, if n 1 + e2R2γ, then

Lip(f|Bn R) n 1

and we check that for real data and pretrained GPT-2 and BERT, the factor R2γ is of the order of 102 to 103 (see Appendix C.5) so that the inequality n 1 + e2R2γ is always satisfied in practice. Note that Proposition 3.4 gives a worst-case lower bound: there are inputs with a large sequence length and a small local Lipschitz constant, such as X := (x, . . . , x) (Rd)n for any x Rd and n, which satisfies DXf 2 = 1.

Mean-field regime. What happens when the sequence length n is extremely large? As explained above, the bound provided by Theorem 3.3 becomes too loose it even goes to + when n + with fixed R and r. For very large sequence lengths, this bound can be refined by leveraging the mean-field framework, as follows. Theorem 3.5. Let R > 0 and A, V Rk d. The meanfield self-attention map F with parameters (A, V ) is W2Lipschitz continuous on the set P(BR), and its Lipschitz

How Smooth Is Attention?

constant is bounded by

Lip W2(F|P(BR)) V 2 (1 + 3 A 2 R2)e2 A 2R2.

Theorem 3.5 follows from computations made by Geshkovski et al. (2023a). We state it to draw a full picture of the regularity of self-attention on compact sets. Let us highlight the following connection between Theorem 3.5 and Theorem 3.3. For any radius R > 0, sequence length n N and input X Bn R, it holds, according to Lemma 3.2 and Theorem 3.5:

DXf 2 V 2 (1 + 3 A 2 R2)e2 A 2R2. (1)

On the other hand, Theorem 3.3 tells us that

3 V 2 A 2 2 R4(4n + 1) + n 1/2 .

When n is relatively small, Theorem 3.3 is more relevant than Equation (1), and vice versa for n very large. In the following Proposition, we identify an edge regime where both bounds have a similar growth in R, which corresponds to n R + ec R2 for some constant factor c > 0. In this edge regime, the bound of Theorem 3.3 appears to be tight both in n and R, and the bound of Theorem 3.5 is almost tight in R up to a constant factor in the exponential. Proposition 3.6. Let R > 0. Assume that k = d and V = Id, and denote γ1 γδ the real eigenvalues of A. Let γ := max( γδ, γ1/8). Then, if n R + exp(2γR2), there exists a function θ: [0, + ) [0, + ) such that θ(R) R + 1 and:

Lip(f|Bn R) θ(R)γ

One sees that the dependency in R of the lower bound in Proposition 3.6 is catastrophic. Fortunately, in practical cases, n is significantly smaller than e2γR2, and the meanfield regime bound is over-pessimistic: one should rather use Theorem 3.3. It is also interesting to note that configurations that lead to the explosion of the right-hand side in Proposition 3.6 are made of two extremely unbalanced clusters, one with 1 vector, and the other with the other n 1 vectors of the sequence (see Appendix C.6).

Large radius regime. Let us now analyze a third regime: the large radius regime, where R goes to infinity with a fixed n. This complements the mean-field analysis, where R is fixed and n goes to infinity. Let n N be a fixed sequence length. We show, drawing inspiration from Kim et al. (2021), that there exist configurations with n particles supported in BR whose local Lipschitz constant grows like R2 up to constant factors (see Appendix C.7). However, if we rule out a measure-zero set of pathological configurations, we get in the large radius regime that the Lipschitz constant grows at most like n up to a constant factor.

Theorem 3.7. Let A Rd d and V Rk d two nonzero matrices. Denote EA B(0, 1)n the set of sequences (x1, . . . , xn) such that for all i {1, . . . , n}, the maximum max1 j n x i A xj is reached at a unique index j, denoted mi. The complement of EA in B(0, 1)n has zero Lebesgue measure. Moreover, for any X EA, there exists a function θ: [0, + ) [0, + ) such that θ(R) converges exponentially fast to 1 when R + and

DRXf 2 θ(R) V 2 n.

The proof of this result is interesting, as it provides a better understanding of the Jacobian of self-attention in this limiting regime.

Proof. The sequences (x1, . . . , xn) (Rd)n that are not in EA are such that either there exists an index i {1, . . . , n} for which Axi = 0, or the xi are not distinct. Both cases are measure-zero situations, as A = 0. Now let X EA. For all perturbations ϵ := (ϵ1, . . . , ϵn) (Rd)n and all i {1, . . . , n}, we have (see Appendix C.1)

(DRXf)(ϵ)i = V R2 n X

k=1 Pikxk)x i A ϵj

j=1 Pijϵj + V R2 n X

k=1 Pikxk)x j Aϵi,

with Pij := e R2x i A xj/ Pn k=1 e R2x i A xk. By definition of mi, we have Pij R + 1j=mi, and the convergence is exponentially fast, so R2Pij has the same limit as Pij. As a consequence, in the large radius regime, the Jacobian of self-attention has a much simpler form: (DRXf)(ϵ)i R + V Pn j=1 Pijϵj, and the operator norm of the limit is bounded by V 2 n. Moreover, when V = Id for example, this bound is reached up to a constant factor if there exists an index j such that j = mi for a constant fraction of the indices i, i.e. if a token grasps the attention of a constant fraction of all tokens.

In practice, the large radius regime on general configurations (i.e. that belong to the set EA) provides an oversimplified model for self-attention. Indeed, in this regime, attention matrices are 1-sparse row-wise, i.e. have exactly one nonzero coefficient equal to 1, on each row i {1, . . . , n}, which corresponds to the index mi. If we look at real data, attention matrices indeed tend to have sparse rows, but with more than one non-zero coefficient on each row (Vaswani et al., 2017; Vyas et al., 2020; Likhosherstov et al., 2021) which is expected, otherwise the representation given by self-attention would be too rough. Still, Theorem 3.7 gives some nice intuition about the growth rate of n obtained in Theorem 3.3 and observed in practice (see Figure 1 in the experiments).

How Smooth Is Attention?

Multi-head self-attention. Bounding the Lipschitz constant of single-head self-attention provides the following bound on the Lipschitz constant of multi-head self-attention. Lemma 3.8 (Kim et al. (2021)). Let R > 0. With the notations of Definition 2.2, it holds

Lip(f MH |Bn R )

h=1 W (h) 2 Lip(f (h) |Bn R).

In the whole paper, we focus on single-head self-attention and avoid tackling the possibly tedious dependencies between the matrices W (1)V (1), . . . , W (H)V (H). Deriving a finer estimation of the Lipschitz constant of multi-head attention than what Lemma 3.8 gives us is left for future work.

4. Tight Bounds on the Lipschitz Constant of Masked Self-Attention

4.1. Measuring Distances With Conditional Optimal Transport

To study the Lipschitz properties of masked self-attention as defined in Definition 2.5, the Euclidean framework introduced in Section 3.1 still applies. In contrast, the Wasserstein framework used for mean-field unmasked self-attention is not suited to measuring the regularity of mean-field masked self-attention. Indeed, in the standard case, the distance between the outputs f m(X) and f m(Y ) for two inputs X, Y (Rd)n is measured by (Pn i=1|f(x1, . . . , xi)i f(y1, . . . , yi)i|2)1/2 so that i-th coordinates are compared to each other, separately for each index i. On the contrary, when looking at W2(F m( µ), F m( ν)), the optimal transport plan may transport mass from a point (s, x) to a point (s , y) with s = s , which contradicts the sequential nature of masked selfattention. It is therefore natural to introduce the following distance on Pc([0, 1] Rd), which comes from the theory of conditional optimal transport (Hosseini et al., 2023). Definition 4.1. Let µ and ν be two probability measures in Pc([0, 1] Rd), and p 1. If µ and ν have the same marginal with respect to s, i.e. Z s2

Rd d µ(s, x) = Z s2

Rd d ν(s, x)

for all 0 s1 < s2 1, denote θ this marginal distribution, and write with the disintegration theorem (Bogachev & Ruas, 2007) d µ(τ, x) =: dθ(τ)dµτ(x) and d ν(τ, x) =: dθ(τ)dντ(x). The measures µτ and ντ can be seen intuitively as the restriction of µ and ν to the mass located at position τ, rescaled to obtain probability measures. We then measure the distance between µ and ν with

dp( µ, ν) := Z 1

0 Wp(µτ, ντ)pdθ(τ) 1/p .

If µ and ν do not have the same first marginal, we set dp( µ, ν) := + .

Considering dp( µ, ν) amounts to minimizing the transport cost between µ and ν under the constraint that points must keep the first coordinate constant along their trajectory. Equivalently, allowed transport plans must be the identity on the first marginal.

As for unmasked self-attention, we have the following connection between the Euclidean framework and the meanfield framework for residual masked self-attention.

Lemma 4.2. Let R > 0. We have

Lip F (f m |Bn R) Lipd2(F m |P([0,1] BR)).

We do not detail the proof, which follows the same steps as for Lemma 3.2.

4.2. Lipschitz Bounds for Masked Self-Attention in Different Regimes

General upper bound. We show in Appendix D.1 that the bound provided by Theorem 3.3 still holds for masked self-attention.

Theorem 4.3. Let Q, K, V Rk d and A := K Q/

k. Let R > 0 and n N. Masked self-attention f m with parameters (A, V ) is Lipschitz continuous on the set Bn R, and

Lip f m |Bn R

3 V 2 A 2 2 R4(n + 1) + n 1/2 .

Mean-field regime. Let us now bound from above the dp Lipschitz constant of mean-field masked self-attention.

Theorem 4.4. Let R > 0 and p 1. The mean-field masked self-attention map F m is Lipschitz continuous on the space of measures supported in [0, 1] BR, with a Lipschitz constant upper-bounded by

V 2 (1 + 3 A 2 R2)e2 A 2R2.

Note that in Theorem 4.4, we consider that two measures µ and ν with different first marginals induce an infinite Lipschitz ratio dp(F m( µ), F m( ν))/dp( µ, ν). The proof can be found in Appendix D.2.

Large radius regime. We have a similar result as for unmasked attention in the large radius regime. Except for a measure-zero set of pathological configurations, when R is sufficiently large, the local Lipschitz constant of f m at the input RX does not depend on R anymore and grows at most like n up to a constant factor.

How Smooth Is Attention?

Theorem 4.5. Let A Rd d and V Rk d two nonzero matrices, and denote f m the masked self-attention map with parameters (A, V ). Denote Em A B(0, 1)n the set of sequences (x1, . . . , xn) such that for all i {1, . . . , n}, the maximum max1 j i x i A xj is reached at a unique index j, denoted mi. The complement of Em A in B(0, 1)n

has zero Lebesgue measure. Moreover, for any X EA, there exists a function θ: [0, + ) [0, + ) such that θ(R) converges exponentially fast to 1 when R + and

DRXf m 2 θ(R) V 2 n.

5. Experiments

The bound stated in Theorem 3.3 corresponds to a worstcase scenario. In practice, does it reflect the evolution of the Lipschitz constant of a self-attention layer of a Transformer on real data? We perform numerical experiments on BERT (Devlin et al., 2018), using the pretrained Huggingface model bert-base-uncased first as an Encoder and then in its Decoder version, and on GPT-2 (Radford et al., 2019) both pretrained and randomly initialized. Both models have 12 multi-head attention layers, and 12 attention heads per layer, with an embedding dimension d = 768. We perform two different experiments, first with real data, and then with synthetic adversarial data.

5.1. Experiments With Real Data

We take our data from two test datasets, Alice in Wonderland from the NLTK corpus Gutenberg (Bird et al., 2009), and AG NEWS from the Py Torch package torchtext (Zhang et al., 2015). The aim is, for various multi-head self-attention layers f model of BERT and GPT-2, and for a batch of inputs of varying length taken from the two datasets mentioned above, to get a scatter plot of the local Lipschitz constant of f model at each input (x1, . . . , xn) as a function of the sequence length n.

Construction of the datasets. Given some raw text from Alice in Wonderland or AG NEWS, we first tokenize it and then split the resulting sequence of tokens into subsequences with a fixed sequence length. For each even integer n in {2, . . . , 100}, we build 10 sequences with n tokens, so that none of the constructed sequences (s1, . . . , sn) overlap. Then, for each input sequence (s1, . . . , sn), we do a forward pass of the model, and get with a forward hook the intermediate activations just before the attention layer of interest f model. This gives us a batch of sequences (x1, . . . , xn) (Rd)n that are fed to f model when (s1, . . . , sn) goes through the model. Note that except for the inputs of the first attention layer, all the xi are the result of normalization with Layer Norm, and therefore belong to an ellipsis, which depends on the parameters of Layer Norm.

Computation of the local Lipschitz constants. The local Lipschitz constant of f model at an input sequence X = (x1, . . . , xn) is equal to DXf model 2. As DXf model, which we denote JX to alleviate notations, is of shape nd nd with d = 768, we do not compute it explicitly but use a power method on the matrix J XJX by alternating Jacobian-vector products and vector-Jacobian products (see Appendix E.1). The power method converges to the largest eigenvalue of J XJX, which is equal to JX 2 2.

5.2. Experiments With Adversarial Data

Adversarial data. To check numerically the tightness in n of the bound provided by Theorem 3.3, we build adversarial data in the input space of each self-attention layer f model. More precisely, for each sequence length n, and for a radius R > 0 to be discussed later, we look for an input X Bn R where f model has a large (ideally maximal) local Lipschitz constant. Unfortunately, performing a gradient ascent on the local Lipschitz constant f model gives poor results, as the optimization landscape is highly nonconvex. We, therefore, build X as follows, using the heuristics provided by Proposition 3.6. For h {1, . . . , 12}, denote A(h) := K(h) Q(h)/

k, with the notations of Definition 2.2 applied to the multi-head self-attention layer f model. Choose any head h, and denote γ1 (resp. γδ) the largest (resp. the smallest) real eigenvalue of A, and u1 (resp. uδ) an associated unit eigenvector. If γ1 8γδ, define X = (x1, . . . , xn) with x1 := Ru1 and x2 = = xn := R/2u1. If γ1 < 8γδ, define X = (x1, . . . , xn) with x1 := Ruδ and x2 = = xn := Ruδ. This way of defining adversarial inputs does not exactly maximize the local Lipschitz constant for each choice of n and R, but leads, for R large enough, to a growth rate of n for the local Lipschitz constant of f model (see Figure 1), which is exactly what we need to recover tightness.

Influence of the scaling factor. In Figure 1, the scaling factor R is equal to 15.5 and 21.5 for the first two columns, which corresponds to an approximation of the mean radius of real data used with the same models in the first row. In other words, our adversarial data for the first two models have tokens with a magnitude similar to tokens obtained with real data. In contrast, for the third column, corresponding to GPT-2 randomly initialized, we take R = 100, which is much larger than the magnitude of tokens generated with real data (which is 27.7). Indeed, we observed that smaller scaling factors induce a growth rate that is slower than n1/2. Studying this aspect more in-depth is an interesting perspective for future work.

5.3. Discussion

Figure 1 gives the following insights.

How Smooth Is Attention?

Real data Adversarial data

Figure 1. Scatter plots of the local Lipschitz constant of self-attention (column 1) and masked self-attention (columns 2 and 3) on text data (upper row) and adversarial data (lower row) as a function of the sequence length n. In the upper row, the color encodes the mean radius of inputs X = (x1, . . . , xn), defined as R := p

1/n Pn i=1|xi|2. Lighter points have a smaller mean radius. The first two columns correspond to two different pretrained BERT models: an Encoder-only and a Decoder-only, on the same dataset Alice in Wonderland, respectively for attention layers 0 and 6. The third column is obtained with the masked self-attention layer 6 of GPT-2 randomly initialized, on the dataset AG NEWS. We see that the Lipschitz constant of self-attention on real data grows approximately like n1/4 with the sequence length n and that the growth rate is n for adversarial data, which shows the tightness of Theorems 3.3, 3.7, 4.3 and 4.5.

The Lipschitz constant of self-attention on real data grows significantly with the sequence length, in all considered cases, independently of the dataset, the depth of the attention layer, and of whether self-attention is masked or not. The observed growth rate is close to n1/4, which is smaller than the worst-case rate n. The Lipschitz constant of self-attention on our adversarial data grows like n, which is the worst-case rate according to Theorem 3.3. This evidentiates tightness of the bound with respect to the sequence length.

Let us make a few remarks. First, the architecture of BERT adds biases to the traditional formula for self-attention. This does not affect too much the theoretical analysis (see Appendix E.3). Second, the same experiments as in Figure 1 performed with GPT-2 pre-trained (Radford et al., 2019) lead to a different behavior of the Lipschitz constant. In particular, the growth rate of the Lipschitz constant can be faster than n, which seems to come from a correlation between the sequence length of inputs and the magnitude of their tokens after going through Layer Norm (see Appendix E.2). Finally, our results point out the difficulty of designing Lispchitz-constrained self-attention layers independently of the sequence length. Indeed, dividing a self-attention layer by the mean-field bound of Theorem 3.5 to enforce its 1-Lipschitz continuity would induce a dramatic loss of expressive power on smaller sequence lengths. However,

when the sequence length is fixed for example with Vision Transformers (Dosovitskiy et al., 2020), dividing the output of the self-attention layer by the bound in Theorem 3.3 is a promising option.

In this thorough study of the Lipschitz constant of selfattention, we have identified sharp bounds in different regimes, the most relevant from a practical viewpoint being the general bounds stated in Theorems 3.3 and 4.3. Our theoretical and numerical analyses show that the Lipschitz constant of self-attention grows with the sequence length, the worst-case rate being n, and the rate on real data being at least n1/4, and possibly larger for learned positional encoding. This insight is new and represents an obstruction to designing robust Transformers without modifying the architecture or fixing the sequence length of inputs, which opens interesting avenues for future work. We have also introduced a novel mean-field framework for masked self-attention, which overcomes the lack of permutation equivariance and paves the way for a study of neural PDEs on Decoders, as Sander et al. (2022) and Geshkovski et al. (2023a;b) do for Encoders.

How Smooth Is Attention?

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Anil, C., Lucas, J., and Grosse, R. Sorting out lipschitz function approximation. In International Conference on Machine Learning, pp. 291 301. PMLR, 2019.

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014.

Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017.

Behrmann, J., Grathwohl, W., Chen, R. T., Duvenaud, D., and Jacobsen, J.-H. Invertible residual networks. In International conference on machine learning, pp. 573 582. PMLR, 2019.

Bird, S., Klein, E., and Loper, E. Natural language processing with Python: analyzing text with the natural language toolkit. O Reilly Media, Inc. , 2009.

Bogachev, V. I. and Ruas, M. A. S. Measure theory, volume 1. Springer, 2007.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020.

Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39 57. Ieee, 2017.

Catellier, R., Vaiter, S., and Garreau, D. On the robustness of text vectorizers. ar Xiv preprint ar Xiv:2303.07203, 2023.

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Chen, R. T., Behrmann, J., Duvenaud, D. K., and Jacobsen, J.-H. Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 32, 2019.

Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. Parseval networks: Improving robustness to adversarial examples. In International conference on machine learning, pp. 854 863. PMLR, 2017.

Dasoulas, G., Scaman, K., and Virmaux, A. Lipschitz normalization for self-attention layers with application to graph neural networks. In International Conference on Machine Learning, pp. 2456 2466. PMLR, 2021.

De Bie, G., Peyr e, G., and Cuturi, M. Stochastic deep networks. In International Conference on Machine Learning, pp. 1556 1565. PMLR, 2019.

Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A. P., Caron, M., Geirhos, R., Alabdulmohsin, I., et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp. 7480 7512. PMLR, 2023.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv:1810.04805, 2018.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv:2010.11929, 2020.

Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pappas, G. Efficient and accurate estimation of lipschitz constants for deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.

Federer, H. Geometric Measure Theory. Classics in Mathematics. Springer Berlin Heidelberg, 2014. ISBN 9783642620102. URL https://books.google. fr/books?id=jld-Bg AAQBAJ.

Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. The emergence of clusters in self-attention dynamics. ar Xiv preprint ar Xiv:2305.05465, 2023a.

Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. A mathematical perspective on transformers. 2023b.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Gupta, K. and Verma, S. Certvit: Certified robustness of pre-trained vision transformers. ar Xiv preprint ar Xiv:2302.10287, 2023.

Hein, M. and Andriushchenko, M. Formal guarantees on the robustness of a classifier against adversarial manipulation. Advances in neural information processing systems, 30, 2017.

How Smooth Is Attention?

Hosseini, B., Hsu, A. W., and Taghvaei, A. Conditional optimal transport on function spaces. ar Xiv preprint ar Xiv:2311.05672, 2023.

Jia, X., Chen, Y., Mao, X., Duan, R., Gu, J., Zhang, R., Xue, H., and Cao, X. Revisiting and exploring efficient fast adversarial training via law: Lipschitz regularization and auto weight averaging. ar Xiv preprint ar Xiv:2308.11443, 2023.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. ar Xiv preprint ar Xiv:2310.06825, 2023.

Kim, H., Papamakarios, G., and Mnih, A. The lipschitz constant of self-attention, 2021.

Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial machine learning at scale. ar Xiv preprint ar Xiv:1611.01236, 2016.

Latorre, F., Rolland, P., and Cevher, V. Lipschitz constant estimation of neural networks via sparse polynomial optimization. ar Xiv preprint ar Xiv:2004.08688, 2020.

Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning, pp. 3744 3753. PMLR, 2019.

Leino, K., Wang, Z., and Fredrikson, M. Globally-robust neural networks. In International Conference on Machine Learning, pp. 6212 6222. PMLR, 2021.

Likhosherstov, V., Choromanski, K., and Weller, A. On the expressive power of self-attention matrices. ar Xiv preprint ar Xiv:2106.03764, 2021.

Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L., and Shazeer, N. Generating wikipedia by summarizing long sequences. ar Xiv:1801.10198, 2018.

Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., Wang, L., and Liu, T.-Y. Understanding and improving transformer from a multi-particle dynamic system point of view. ar Xiv preprint ar Xiv:1906.02762, 2019.

Miyato, T., Maeda, S.-i., Koyama, M., Nakae, K., and Ishii, S. Distributional smoothing with virtual adversarial training. ar Xiv preprint ar Xiv:1507.00677, 2015.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018.

Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574 2582, 2016.

Neyshabur, B., Bhojanapalli, S., Mc Allester, D., and Srebro, N. Exploring generalization in deep learning. Advances in neural information processing systems, 30, 2017.

Open AI. Gpt-4 technical report, 2023.

Papernot, N., Mc Daniel, P., Wu, X., Jha, S., and Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), pp. 582 597. IEEE, 2016.

Pevny, T. and Kovarik, V. Approximation capability of neural networks on spaces of probability measures and treestructured domains. ar Xiv preprint ar Xiv:1906.00764, 2019.

Peyr e, G., Cuturi, M., et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

Qi, X., Wang, J., Chen, Y., Shi, Y., and Zhang, L. Lipsformer: Introducing lipschitz continuity to vision transformers. ar Xiv preprint ar Xiv:2304.09856, 2023.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Rosca, M., Weber, T., Gretton, A., and Mohamed, S. A case for new neural network smoothness constraints. PMLR, 2020.

Sander, M. E., Ablin, P., Blondel, M., and Peyr e, G. Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pp. 3515 3530. PMLR, 2022.

Santambrogio, F. Optimal transport for applied mathematicians. Birk auser, NY, 55(58-63):94, 2015.

Sokoli c, J., Giryes, R., Sapiro, G., and Rodrigues, M. R. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265 4280, 2017.

Sra, S., Nowozin, S., and Wright, S. J. Optimization for machine learning. Mit Press, 2012.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

How Smooth Is Attention?

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and finetuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Tsuzuku, Y., Sato, I., and Sugiyama, M. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. Advances in neural information processing systems, 31, 2018.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Virmaux, A. and Scaman, K. Lipschitz regularity of deep neural networks: analysis and efficient estimation. Advances in Neural Information Processing Systems, 31, 2018.

von Luxburg, U. and Bousquet, O. Distance-based classification with lipschitz functions. J. Mach. Learn. Res., 5 (Jun):669 695, 2004.

Vuckovic, J., Baratin, A., and des Combes, R. T. A mathematical theory of attention. Ar Xiv, abs/2007.02876, 2020.

Vuckovic, J., Baratin, A., and Combes, R. T. d. On the regularity of attention. ar Xiv preprint ar Xiv:2102.05628, 2021.

Vyas, A., Katharopoulos, A., and Fleuret, F. Fast transformers with clustered attention. Advances in Neural Information Processing Systems, 33:21665 21674, 2020.

Weng, T.-W., Zhang, H., Chen, P.-Y., Yi, J., Su, D., Gao, Y., Hsieh, C.-J., and Daniel, L. Evaluating the robustness of neural networks: An extreme value theory approach. ar Xiv preprint ar Xiv:1801.10578, 2018.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. Huggingface s transformers: State-of-the-art natural language processing. ar Xiv preprint ar Xiv:1910.03771, 2019.

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524 10533. PMLR, 2020.

Ye, W., Ma, Y., Cao, X., and Tang, K. Mitigating transformer overconfidence via lipschitz regularization. ar Xiv preprint ar Xiv:2306.06849, 2023.

Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12104 12113, 2022.

Zhang, B. and Sennrich, R. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.

Zhang, X., Zhao, J. J., and Le Cun, Y. Character-level convolutional networks for text classification. In NIPS, 2015.

Zhao, H., Jiang, L., Jia, J., Torr, P., and Koltun, V. Point transformer. arxiv. ar Xiv preprint ar Xiv:2012.09164, 2020.

Zweig, A. and Bruna, J. A functional perspective on learning symmetric functions with neural networks. In International Conference on Machine Learning, pp. 13023 13032. PMLR, 2021.

How Smooth Is Attention?

A. Optimal Transport Toolbox

This section gathers some useful definitions and lemmas from optimal transport. In what follows, X is a Borel subset of Rd.

A.1. Pushforward, Wasserstein Distance

Let us start with the notion of pushforward. Definition A.1 (Pushforward). Set µ a probability measure on X and φ: X X a measurable function. The pushforward of µ by φ, denoted φ µ, is the probability measure given by

(φ µ) (B) = µ(φ 1(B))

for any Borel set B X, where φ 1(B) := {x Rd : φ(x) B}.

The pushforward measure φ µ can be seen as the result of a transportation of the mass of µ by φ. Intuitively, φ µ is obtained by transporting each element of mass µ(dx) from x to φ(x).

Another crucial tool is the notion of Wasserstein distance. Definition A.2 (Wasserstein space, Wasserstein distance). Let p 1. Denote

Pp(X) := {µ P(X) : Z

X |x|pdµ(x) < }

the p-Wasserstein space. Then, the p-Wasserstein distance between two probability measures µ, ν Pp(X) is defined as

Wp(µ, ν) := inf π Π(µ,ν)

Z |x y|pdπ(x, y) 1/p

with Π(µ, ν) the set of all couplings between µ and ν, i.e. of all probability measures π P(X X) such that R π( , y)dy = µ and R π(x, )dx = ν.

Wasserstein distances have the following nice property, which is a direct consequence of Jensen inequality. Lemma A.3. For every p 1, it holds W1 Wp.

The distance W1 has also a simple dual formulation. Lemma A.4 (W1 duality formula). The distance W1 can be rewritten with the so-called duality formula: for all µ, ν P1(X), it holds

W1(µ, ν) = sup Lip(φ) 1

Z φ d(µ ν), (2)

where the supremum is taken over all functions φ: X R with a Lipschitz constant bounded by one.

The following result is useful to bound the Wasserstein distance between two probability measures that are pushed forward by the same map. Lemma A.5. Let p 1. Consider a measurable function φ: X X, and probability measures µ, ν Pp(X) such that φ µ Pp(X) and φ ν Pp(X). Then, it holds

Wp(φ µ, φ ν) Lip(φ)Wp(µ, ν).

Proof. We have

Wp(φ µ, φ ν)p = inf π Π(φ µ,φ ν)

Z x y p dπ (x, y)

inf π Π(µ,ν)

Z φ(x) φ(y) p dπ(x, y)

Lip(φ)p inf π Π(µ,ν)

Z x y p dπ(x, y)

= Lip(φ)p Wp(µ, ν)p,

How Smooth Is Attention?

where the first inequality derives from the fact that every π Π(µ, ν) induces a coupling π Π(φ µ, φ ν) by setting

π (B1 B2) := π(φ 1(B1) φ 1(B2))

for all Borel sets B1, B2 X, and that with this choice of π it holds Z x y p dπ (x, y) = Z φ(x) φ(y) p dπ(x, y).

Let us now bound the Wasserstein distance between two different pushforwards of the same probability measure. Lemma A.6. Let p 1. Consider two measurable functions φ, ψ: X X, and a probability measure µ Pp(X) such that φ µ Pp(X) and ψ µ Pp(X). Then, it holds

Wp(φ µ, ψ µ) φ ψ Lp(µ) .

Proof. Recall that

Wp(φ µ, ψ µ)p = inf π Π(φ µ,ψ µ)

Z x y p dπ (x, y).

Now consider the following coupling between φ µ and ψ µ, defined by the relation

π (B C) := µ(φ 1(B) ψ 1(C))

for every Borel sets B, C X. In other words, we set dπ (y, z) := R

φ 1(y) ψ 1(z) dµ, and dπ (y, z) = 0 if φ 1(y) ψ 1(z) = . With this definition of π , we have

Wp(φ µ, ψ µ)p Z x y p dπ (x, y) = Z φ(x) ψ(x) p dµ(x).

A.2. Geodesics

The notion of a geodesic is useful for the following section of the Appendix. Definition A.7 (Geodesic). Let (E, d E) be a metric space. A curve γ : [0, 1] E is called a geodesic if there exists a constant v 0 such that for all t1, t2 [0, 1] we have

d E(γ(t1), γ(t2)) = v|t2 t1|.

We say that the space E is geodesic if for any x, y E, there exists a geodesic between x and y.

One important example of geodesic space is the 2-Wasserstein space P2(Rd). Lemma A.8 (Santambrogio (2015)). The space P2(Rd) is a geodesic space.

B. Local Lipschitz Constant in the Mean-Field Setting

B.1. Mean-Field Self-Attention Generalizes Self-Attention

For any input X (Rd)n, we have F(m(X)) = m(f(X)). Indeed, we can rewrite f as

f(X) = (ΓX(x1), . . . , ΓX(xn)), (3)

Pn i=1 exp 1

kx Q Kxi V xi Pn i=1 exp 1

Seeing the ratio of sums as a ratio of integrals against the empirical measure m(X), and Equation (3) as a pushforward leads precisely to the formula of mean-field self-attention.

How Smooth Is Attention?

B.2. Local Lipschitz Constant in a General Metric Framework

The local Lipschitz constant can be defined for any function between two metric spaces, as follows. Definition B.1 (Local Lipschitz constant). Let φ: E F be a map between two metric spaces. We define the local Lipschitz constant of φ at point x E as

Lipx(φ) := lim ε 0+ Lipφ|B(x,ε).

The limit exists, as Lipφ|B(x,ε) decreases with ε.

Definition B.1 is interesting, as it captures more information than the global Lipschitz constant. More precisely, we have the following connection between the two notions. Lemma B.2. Let φ: E F be a map between two metric spaces. We have

Lip(φ) sup x E Lipx(φ).

Assume moreover that the space E admits geodesics, which is the case for Rd and P2(Rd) equipped with W2 (see A). Then, this inequality becomes an equality.

B.3. Proof of Lemma 3.2

Let us prove the following slightly stronger result. Lemma B.3. Let p 1. For any matrix X Rn d, we have

Lip F,p X (f) = Lip Wp m(X)(F|Mn(Rd)) Lip Wp m(X)(F).

Proof. Set X Rn d. One can choose ε1 > 0 small enough (for example ε1 < minxi =xj xi xj /2) to have

X Y F,p ε1 X Y F,p = Wp(m(X), m(Y )).

Indeed, if X Y F,p ε1 < min xi =xj xi xj /2,

then for all i {1, . . . , n}, we have xi yi ε1, and thus xi is the nearest neighbor (or one of the nearest neighbors) of yi among the xj. Similarly, one can choose ε2 > 0 small enough to have

f(X) f(Y ) F,p ε2 f(X) f(Y ) F,p = Wp(m(f(X)), m(f(Y ))).

Then, we can set ε ε1 small enough to have

X Y F,p ε f(X) f(Y ) F,p ε2,

and it holds, for all η ε and all Y such that X Y F,p η, that

X Y F,p = Wp(m(X), m(Y ))

and f(X) f(Y ) F,p = Wp(m(f(X)), m(f(Y ))).

Now for all η ε, we have

Lip F,pf|B F,p(X,η) = sup Y B F,p(X,η)

f(X) f(Y ) F,p

= sup Y B F,p(X,η)

Wp(m(f(X)), m(f(Y )))

Wp(m(X), m(Y ))

= sup Y B F,p(X,η)

Wp(F(m(X)), F(m(Y )))

Wp(m(X), m(Y )) ,

by definition of ε and F. We conclude the proof by noticing that:

How Smooth Is Attention?

Y B F,p(X, η) implies m(Y ) BWp(m(X), η), which shows that Lip F,pf|B F,p(X,η) Lip Wp F|Mn(Rd),

µ BWp(m(X), η) with µ Mn(Rd) implies the existence of Y Rn d such that µ = m(Y ) and X Y F,p = Wp(m(X), m(Y )), so that Y B F,p(X, η). Indeed, take Y such that µn = m(Y ) and then permute its coordinates so that xi becomes the nearest neighbor (or one of the nearest neighbors) of yi. This shows the reverse inequality and concludes the proof.

C. Proofs of Section 3

We have the following useful Lemma.

Lemma C.1. Let µ be a probability measure supported in B(x0, R) Rd, with any x0 Rd and R > 0. Then, denoting Varµ := E[(Z EZ)(Z EZ) ] with Z a random variable distributed according to µ, we have

with equality when µ = 1

2(δx0+x + δx0 x) for any x Rd such that |x| = R.

Proof. Let us assume without loss of generality that x0 = 0. It is straightforward to check that if µ = 1

2(δx + δ x) then Varµ 2 = R2. To show that this is the maximal value the variance can take, we use the triangle inequality: E[(Z EZ)(Z EZ) ] 2 E (Z EZ)(Z EZ) 2 = E[|Z EZ|2]

= E[|Z|2] |EZ|2.

Now let us pick any x BR \ B(0, r). We have E(Z x) (Z + x) 0, as the angle between the vectors Z x and Z + x is at least π/2 for Z in B(0, R). By expanding this relationship we get

E[|Z|2] |x|2 0,

which yields the result.

C.1. The Jacobian of Self-Attention

Lemma C.2. Let X = (x1, . . . , xn) (Rd)n. For all perturbations ϵ := (ϵ1, . . . , ϵn) (Rd)n and all i {1, . . . , n}, we have

(DXf)(ϵ)i = V

k=1 Pikxk)x i A ϵj + V

j=1 Pijϵj + V

k=1 Pikxk)x j Aϵi,

with Pij := ex i A xj/ Pn k=1 ex i A xk.

C.2. Proof of Theorem 3.3

Let X Bn R and ϵ := (ϵ1, . . . , ϵn) (Rd)n such that ϵ F = 1. According to Lemma C.2, for all i {1, . . . , n} we have

(DXf)(ϵ)i = V

k=1 Pikxk)x i A ϵj + V

j=1 Pijϵj + V

k=1 Pikxk)x j Aϵi,

with Pij := ex i A xj/ Pn k=1 ex i A xk. The triangle inequality gives

|(DXf)(ϵ)i| V 2

k=1 Pikxk)ϵj

+ A 2 Var(i) 2 |ϵi|

How Smooth Is Attention?

where Var(i) := Pn j=1 Pij(xj Pn k=1 Pikxk)x j is the variance of the probability measure Pn j=1 Pijδxj. Lemma C.1

gives us that Var(i) 2 R2. We can also apply Cauchy-Schwarz inequality to get

k=1 Pikxk)ϵj

j=1 Pij|ϵj|2

The proof of Lemma C.1 allows us to bound

Collecting terms, we get

|(DXf)(ϵ)i| V 2

j=1 Pij|ϵj|2

+ A 2 R2|ϵi|

Then, using the inequality (a + b + c)2 3(a2 + b2 + c2) for any a, b, c R, we obtain

i=1 |(DXf)(ϵ)i|2 3 V 2 2 A 2 2 R4(n + 1) + n ,

where we have used that with the triangle inequality and Cauchy-Schwarz inequality

j=1 Pij|ϵj|

j=1 P 2 ij ϵ 2 F 1

as ϵ 2 F = 1 and Pn j=1 P 2 ij Pn j=1 Pij = 1, and that

j=1 Pij|ϵj|2

j=1 |ϵj|2 = 1.

The proof above allows us to recover the following tighter bound but with maybe less natural assumptions on the tokens.

Theorem C.3. Let Q, K, V Rk d and A := K Q/

k. Denote f unmasked self-attention with parameters (A, V ). Let X (Rd)n and ρ, r > 0 such that

(i) max1 i,j n|xi xj| r,

(ii) max1 i n|Axi| ρ.

Then, the local Lipschitz constant of f at X is bounded by

3 V 2 (ρ2r2 + 1)n + ρ2r2 1/2.

Moreover, in practical Transformers architectures, inputs X of self-attention can be written as norm( X) for some X (Rd)n, where norm stands for Layer Norm of RMSNorm (see Subsection 2.3). In both cases, all inputs are on an ellipsis whose shape depends on the parameters of Layer Norm or RMSNorm, so that there is a choice of parameters (r, ρ) such that for all X (Rd)n, the input X = norm( X) satisfies assumptions (i) and (ii) in Theorem C.3.

How Smooth Is Attention?

C.3. Weighted Self-Attention

In the whole subsection, Σn := {a [0, 1]n : Pn i=1 ai = 1} is the simplex.

In view of proving Propositions 3.4 and 3.6, let us introduce a framework for self-attention that extends the Euclidean framework, where the local Lipschitz constant is given by the operator norm of the differential, to probability measures with a finite number of diracs. We call this framework weighted self-attention.

Definition C.4 (Weighted self-attention). For any vector a Σn, denote Pa(Rd) := {Pn i=1 aiδxi : x1, . . . , xn Rd}. We define the Euclidean version of the restriction of self-attention with parameters (A, V ) to Pa(Rd) in the following way:

fa : (x1, . . . , xn) (Rd)n 7

j=1 P a ijxj

with P a i := softmaxa (x i A xj)1 j n , where softmaxa(w1, . . . , wn) := aiewi P

1 i n . The function fa is called

weighted self-attention associated to the coefficients a.

Definition C.4 is designed so that for any sequence X (Rd)n, it holds

F(ma(X)) = ma(fa(X)),

with ma(X) := Pn i=1 aiδXi. Weighted self-attention provides a representation that is very convenient to study both theoretically and numerically the local Lipschitz constant of self-attention at measures that have a finite number of Diracs but with weights such that they require a massive sequence length to be approximated well by an empirical measure. This is for example the case of measures of the form e 2R2δR + (1 e 2R2)δ R, which are at stake in the proof of Proposition 3.6. To study the Lipschitz continuity of fa, we equip the space (Rd)n with the norm

i=1 ai|xi|2 !1/2

so that we have the following connection with the local Lipschitz constant of mean-field self-attention in the Wasserstein 2 sense.

Lemma C.5. Let X (Rd)n be an input sequence, and a Σn a vector of coefficients. Then, we have

Lip W2 ma(X)F|Pa(Rd) = Da Xfa 2,a ,

where Da is the Jacobian in the space ((Rd)n, a) and 2,a is the corresponding operator norm.2

This is a nice property, as it provides an optimized way to compute numerically the local Lipschitz constant of fa, just as for Euclidean self-attention. Lemma C.5 can be proven with the same steps as for Lemma 3.2. Moreover, the differential of fa has the following expression.

Lemma C.6. Let X = (x1, . . . , xn) (Rd)n. For all perturbations ϵ := (ϵ1, . . . , ϵn) (Rd)n and all i {1, . . . , n}, we have

(Da Xfa)(ϵ)i = V

j=1 P a ij(xj

k=1 P a ikxk)x i A ϵj + V

j=1 P a ijϵj + V

j=1 P a ij(xj

k=1 P a ikxk)x j Aϵi,

with P a ij := ajex i A xj/ Pn k=1 akex i A xk.

C.4. Proof of Proposition 3.4

Proposition 3.4 follows from the following Lemma.

Lemma C.7. Let γ be a real eigenvalue of A and u Rd an associated unit eigenvector.

2As we are in finite dimension, all norms are equivalent so the linear operator Da is just the usual notion of differential.

How Smooth Is Attention?

1. If γ 0, denote X := (u, u

2 , . . . , u

2 ) (Rd)n. Then, for any scaling factor R > 0, it holds

DRXf 2 1 1 + (n 1)e R2γ/4

2. If γ < 0, denote X := (u, u, . . . , u) (Rd)n. Then, for any scaling factor R > 0, it holds

DRXf 2 1 1 + (n 1)e 2R2|γ|

Proof. Let us start with the case γ < 0. Let X := (u, u). Using weighted self-attention fa introduced in Subsection C.3, and a := (1/n, 1 1/n), we have for any ε Rd and 1 i n that

DR Xfa(ε)i =

j=1 P a ijεj +

j=1 P a ijxj(xj

k=1 P a ikxk) Aεi +

j=1 P a ij(xj

k=1 P a ikxk)x i A εj.

Setting ε1 := u and ε2 := 0, we get

DR Xfa(ε)2 = P a 21(1 + 2P a 22R2|γ|) P a 21.

Moreover P a 21 = 1 1 + (n 1)e 2R2|γ| ,

so that DRXf 2 DR Xfa 2,a P a 21

which proves the result. When γ 0, the same method applies with ε1 = u and ε2 = 0. Denoting X := (u, u/2) and a := (1/n, 1 1/n) one obtains DRXf 2 DR Xfa 2 P a 21

and P a 21 = 1 1 + (n 1)e R2γ/4 ,

which proves the result.

C.5. About the scaling factor in Proposition 3.4

Let us investigate numerically the scaling factor 2R2γ that appears in Proposition 3.4. With the model BERT pretrained, and with a batch of five text extracts from the dataset Alice in Wonderland, we plot γ(h)R2 for each layer 0 ℓ 11 and for h {0, 5, 10}, where γ(h) is the parameter γ defined in Proposition 3.4 associated to the A(h) the weight matrix of head h, and R is the mean magnitude of tokens as they enter the attention block of layer ℓ, defined as R: p

1/n Pn i=1|xi|2. We obtain Figure 2.

C.6. Proof of Proposition 3.6

Let us prove the following result, which implies Proposition 3.6.

Proposition C.8. Let R > 0. Assume that k = d and V = Id, and denote γ1 γν the real eigenvalues of A. Then, the following claims hold.

1. If γ1 8γν, denoting C := γ1

8 , there exists a function θ: [0, + ) [0, + ) such that θ(R) R + 1 and:

Lip W2(F|P(BR)) θ(R)C

Moreover, the right-hand side is equivalent to the Lipschitz constant of mean-field self-attention at the probability measure e 2CR2δRu1 + (1 e 2CR2)δ(R/2)u1.

How Smooth Is Attention?

0 2 4 6 8 10 Layer

Head 0 Head 5 Head 10

Figure 2. Plot of the scaling factor 2R2γ across layers of BERT pretrained for three different heads and 5 text extracts of Alice in Wonderland (50 tokens for each extract).

2. If γ1 < 8γν, denoting C := |γν|, there exists a function θ: [0, + ) [0, + ) such that θ(R) R + 1 and:

Lip W2(F|P(BR)) θ(R)C

2 R2e C R2.

Moreover, the right-hand side is equivalent to the Lipschitz constant of mean-field self-attention at the probability measure e 2C R2δRuν + (1 e 2C R2)δ Ruν.

Proof. Let us detail the proof in the second case only, as the first case is very similar. According to Lemma C.6 with V = Id, for any X = (x1, x2) (Rd)2 and a = (a1, a2) Σ2, for any ϵ2 Rd, denoting ϵ := (0, ϵ2) (Rd)2 we have

Da Xfa(ϵ)1 = (P a 12Id + 2P a 11P a 12x2x 1 A )ϵ2.

Let u1, . . . , uν a family of unit eigenvectors of A, associated to the eigenvalues γ1 γν. Let R > 0. Set a2 := e 2|γν|R2, so that a1 = 1 e 2|γν|R2, and choose x1 := Ruν and x2 := Ruν. Then

Da Xfa(ϵ) a a1|P a 12(Id + 2P a 11R2|γν|uνu ν )ϵ2|. (4)

Now let us notice that

P a 12 = a2e|γν|R2

a1e |γν|R2 + a2e|γν|R2 = e |γν|R2

(1 e 2|γν|R2)e |γν|R2 + e |γν|R2 R + 1 2,

and P a 11 = 1 P a 12 R + 1 2 as well. Going back to Equation (4) and setting ϵ2 := uν, we get

Da Xfa(ϵ) a 1

2(1 + R2|γν|)|uν| 1

where f(R) g(R) means that there exists a function θ: R R+ such that θ(R) R + 1 and f(R) θ(R)g(R). Finally, we get Da Xfa(ϵ) a

2|γν|R2a1 1/2 = 1

2|γν|R2e|γν|R2,

which proves the result as Da Xfa 2,a = supϵ (Rd)2\{0} Da Xfa(ϵ) a

How Smooth Is Attention?

0 100 200 300 400 500 Radius

(Lip XR(n)f)1/2

n = 5 n = 10 n = 50 y = x

Figure 3. Linear growth of the square root of the Lipschitz constant of self-attention in the configuration XR(n).

C.7. An Example of Quadratic Growth with the Radius

For simplicity, assume that A = Id and V = Id. Let X = (0, x2 . . . , xn) Bn R. Kim et al. (2021) show that, as all norms are equivalent in finite dimension, the local Lipschitz constant of f at X grows like the empirical variance of X. Taking x2 = = xj = Re1 and xj+1 = = xn = Re1 with e1 = (1, 0, . . . , 0) Rd and j such that |2j 1 n| 1 and denoting XR the resulting configuration, we obtain that the empirical variance of XR grows like R2 up to a constant factor that is close to 1.

This behavior can be easily checked numerically (see Figure 3).

D. Proofs of Section 4

We have the following formula for the Jacobian of masked self-attention.

Lemma D.1. Let X = (x1, . . . , xn) (Rd)n. For all perturbations ϵ := (ϵ1, . . . , ϵn) (Rd)n and all i {1, . . . , n}, we have

(DXf m)(ϵ)i = V

k=1 Pikxk)x i A ϵj + V

j=1 Pijϵj + V

k=1 Pikxk)x j Aϵi,

with Pij := ex i A xj/ Pi k=1 ex i A xk.

D.1. Proofs of Theorem 4.3 and Theorem 4.5

In view of Theorem D.1, following the same steps as in the proof of Theorem 3.3 leads to Theorem 4.3. Likewise, the proof of Theorem 4.5 is the same as for Theorem 3.7.

D.2. Proof of Theorem 4.4

Let µ and ν be two distinct measures in P([0, 1] BR). Assume that µ and ν have the same first marginal, denoted θ (otherwise, we consider that they are associated with an infinite Lipschitz ratio). We have

dp(F m( µ), F m( ν)) dp((Γ µ) µ, (Γ µ) ν) + dp((Γ µ) ν, (Γ ν) ν)

0 Wp (Γs µ) µs, (Γs µ) νs p 1/p

0 Wp Γs µ) νs, (Γs ν) νs 1/p ,

How Smooth Is Attention?

where we denote

R V y G(x, y)1τ sd µ(τ, y) R G(x, y)1τ sd µ(τ, y) .

Using Lemma A.5, we get Wp (Γs µ) µs, (Γs µ) νs Lip(Γs µ)Wp(µs, νs)

where the Lipschitz constant is taken on BR. A similar computation as for traditional self-attention shows that for all 0 s 1 and x BR we have DxΓs 2 V 2 A 2 R2,

so that Lip(Γs µ) V 2 A 2 R2.

To bound the second term in the previous inequality, we use Lemma A.6, to get

Wp (Γs µ) νs, (Γs ν) νs Γs µ Γs ν L (BR) .

Again, a similar computation as for traditional self-attention shows that Γs µ Γs ν L (BR) V 2 (2 A 2 R2)e2 A 2R2Wp(µs, νs),

which concludes the proof.

E. Experiments

E.1. Power Iteration

Let X (Rd)n. To compute numerically DXf 2, we pick an initialisation

u0 n d N(0, 1) Rn d

u0 u0/ u0 F

and then repeat the following steps until convergence:

vk = (DXf) (DXf)uk

µk = v k uk

uk+1 = vk vk+1 F

where vk is computed by doing successively a Jacobian-vector product and a vector-Jacobian product. It is well known (Sra et al., 2012) that with this method, µk converges to DXf 2.

We do the same experiments as in Section 5 on GPT-2 pretrained (Radford et al., 2019). We see in Figure 4 that the behavior is different than what we observe in Figure 1. This is explained by the fact that the magnitude of tokens depends on the sequence length for pretrained GPT-2, due to the learned positional encoding (see Figures 5 and 6). Therefore, the bound of Theorem 3.3 still holds but the growth rate can be faster than n.

E.3. Self-attention with biases

Some Transformer architectures such as BERT (Devlin et al., 2018) add biases (b Q, b K, b V ) Rk in the formula of self-attention: f b : (x1, . . . , xn) 7 V Pn j=1Pijxj + b V

1 i n (Rk)n, (5)

with Pi := softmax (Qxi + b Q) (Kxj + b K))1 j n ,

where we absorbed the factor 1/

k in Q, K, b Q and b K to alleviate notations.

How Smooth Is Attention?

0 50 100 Sequence Length

Lipschitz Constant

0 50 100 Sequence Length

Figure 4. Scatter plots of the local Lipschitz constant of masked self-attention for GPT-2 pretrained as a function of the sequence length, on the dataset Alice in Wonderland. The first column corresponds to masked self-attention layer 0, and the second column to layer 6.

0 250 500 750 1000 Position

Positional embeddings norms

Figure 5. Norm of the positional embeddings of GPT-2 pretrained, ordered by position. The very first tokens are associated to positional embeddings of much larger magnitude, which makes n and R dependent from the very beginning of the architecture.

0 25 50 75 100 Sequence length

Average magnitude

Figure 6. Scatter plot of the average magnitude of tokens R := p

1/n Pn i=1|xi|2 as a function of the sequence length, for the same dataset as in Figure 4, right column.

How Smooth Is Attention?

Remark E.1. Note that b K has no influence on the value of f b:

softmax (Qxi + b Q) (Kxj + b K))1 j n = softmax (Qxi + b Q) Kxj)1 j n ,

as b K in only involved in terms that are independent of j.

How do biases in self-attention affect the bound in Theorem 3.3? We start by computing the Jacobian of self-attention with biases.

Lemma E.2. Let X = (x1, . . . , xn) (Rd)n. Let f b be a self-attention module with biases, defined as in Equation (5). For all perturbations ϵ := (ϵ1, . . . , ϵn) (Rd)n and all i {1, . . . , n}, we have

(DXf b)(ϵ)i = V

k=1 Pikxk)(Qxi + b Q) Kϵj + V

j=1 Pijϵj + V

k=1 Pikxk)x j Aϵi,

with Pij := e(Qxi+b Q) (Kxj+b K)/ Pn k=1 e(Qxi+b Q) (Kxk+b K).

The same steps as in the proof of Theorem 3.3 lead to the following bound.

Theorem E.3. Let Q, K, V Rk d and A := K Q/

k. Let R > 0 and n N. Unmasked self-attention with biases f b

with parameters (A, V ), defined in Equation (5), is Lipschitz continuous on the set Bn R, with

3 V 2 A 2 2 R4 + n K 2 ( Q 2 R + |b Q|)2R2 + 1 1/2 .