# global_convergence_in_training_largescale_transformers__7230f1e7.pdf

Global Convergence in Training Large-Scale Transformers

Cheng Gao1 Yuan Cao2 Zihao Li1 Yihan He1 Mengdi Wang1 Han Liu3 Jason M. Klusowski1 Jianqing Fan1

1Princeton University 2The University of Hong Kong 3Northwestern University {chenggao,zihaoli,yihan.he,mengdiw,jason.klusowski,jqfan}@princeton.edu yuancao@hku.hk hanliu@northwestern.edu

Despite the widespread success of Transformers across various domains, their optimization guarantees in large-scale model settings are not well-understood. This paper rigorously analyzes the convergence properties of gradient flow in training Transformers with weight decay regularization. First, we construct the mean-field limit of large-scale Transformers, showing that as the model width and depth go to infinity, gradient flow converges to the Wasserstein gradient flow, which is represented by a partial differential equation. Then, we demonstrate that the gradient flow reaches a global minimum consistent with the PDE solution when the weight decay regularization parameter is sufficiently small. Our analysis is based on a series of novel mean-field techniques that adapt to Transformers. Compared with existing tools for deep networks [47] that demand homogeneity and global Lipschitz smoothness, we utilize a refined analysis assuming only partial homogeneity and local Lipschitz smoothness. These new techniques are of independent interest.

1 Introduction

Transformers have revolutionized the field of deep learning since their introduction in [66]. These models are distinguished by their immense scales, often comprising billions of parameters to achieve state-of-the-art performance. Notably, this massive parameterization enables them to excel in a variety of domains, notably in natural language processing [21, 55, 65] and vision tasks [20, 36], where they have significantly advanced the frontiers of machine learning.

Despite the widespread adoption of Transformer models, our understanding of their optimization guarantees is still in its early stages. One particularly intriguing phenomenon is that as the size of model increases, training algorithms typically converges globally despite the highly nonconvex landscape of the training objective function. Remarkably, it remains somewhat enigmatic how gradient-based approaches can consistently succeed when training large-scale Transformers.

Notably, there have been several recent works showing the global convergence of training overparameterized neural networks [51, 16, 28, 14, 47, 22, 23, 35, 3, 25, 76]. In particular, several works [47, 22, 23] studied the setting with deep neural networks with skip connections. By studying the connections between the network with discretization in the parameter space and a corresponding ordinary differential equation system [71, 12, 43], these works demonstrated global convergence guarantees of wide and deep neural networks based on a mean-field analysis. However, these results are established based on certain homogeneity and/or global Lipschitz smoothness properties of the

Equal Contribution.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

neural network, which are not applicable to Transformer models. Therefore, it remains an open question how gradient-based methods can effectively train large-scale Transformers.

1.1 Our contribution

In this work, we bridge the gap between Transformer theory and practice by demonstrating the global convergence of Transformer training optimization via gradient flow in a large-scale model regime. We analyze the mean-field limit of the Transformer model, which is characterized by the distribution of model parameters, shifting the focus from parameter space to distributional dynamics in the Wasserstein metric [16]. This approach yields two key theorems:

i. We show the closeness between practical discrete Transformers trained by gradient flow and continuous Transformers whose parameter distribution follows a partial differential equation of the Wasserstein gradient flow (Theorem 3.1). Our result demonstrates that large-scale discrete Transformers can be approximated by its mean-field limit and the approximation error can be expressed in terms of the width and depth of the Transformer models.

ii. This approximation facilitates our analysis of the global convergence (Theorem 4.1) of discrete Transformer models. By leveraging the universal approximation capabilities of either the self-attention or feed-forward layers, we demonstrate that a basic gradient flow method can reliably find a global optimum, despite the highly non-convex landscape of the training objective.

We also highlight our novel contributions to Transformer theory through the development of these two core results:

i. The assumption on activation regularity conditions (Assumption 2) is less stringent compared to those usually found in studies of two-layer neural networks [51, 16, 28, 14] or deep Res Net networks [22, 23, 47]. In particular, many existing approximation guarantees reply on a Lipschitz continuity property of the network gradients, which limits the mean-filed study to neural networks with smooth activation functions. In comparison, our analysis relaxes this assumption and only requires local Lipschitz continuity of the gradient in expectation. This relaxation broadens the applicability of our approach and ensures that our result can cover more practical Transformer architectures.

ii. Our model differs from the Res Net models in [47, 22, 23, 13], as those models incorporate only a single identical encoder within each evolutionary block. Unlike the typical theoretical configurations, our model employs two distinct encoders f and h that alternate throughout the network s depth. More importantly, despite the distinct encoders used, the continuous limit of our model uniformly interprets the encoder as an average of f and h, providing a rigorous validation of concepts proposed in [47] and [67] from a new perspective.

iii. Our global convergence guarantee for training Transformer models is also broadly applicable: our assumption (Assumption 4) ensures global convergence by relying on the universal approximation capabilities of either the self-attention or the feed-forward encoder. Additionally, we incorporate a more flexible framework by adopting partial 1-homogeneity for only a subset of the parameters, in contrast to the full parameter homogeneity required in studies such as [47]. This modification enables the use of softmax and sigmoid activation, expanding beyond the hardmax and Re LU restricted by full homogeneity.

Additional related works. See Appendix B for a detailed discussion.

Notations. For any α Rd, dim(α) refers to its dimension d. For any B Rd d, its trace is denoted by Tr(B). For any positive integer n, Let [n] = {1, 2, . . . , n}. Let 0d denote the d-dimension vector of all zeros. Let Wp(µ, ν) denote the Wasserstein-p distance between two probability measures µ, ν P(Rd) for p 1. For a matrix A = (a1, a2, . . . , an), define its vectorization version as vec[A] := (a 1 , a 2 , . . . , a n ) . Let δ( ) denote the Dirac mass and 1{ } be the indicator function. Let supp( ) denote the support of any distribution. Let = 2 denote the l2 norm and max denote the maximum norm. For any subsets D1, D2 in Euclidean space, define C(D1, D2) as the collection of functions that map D1 to D2 and are continuous over D1. Define the Bounded Lipschitz norm for any measure µ M(Rd) as µ BL := sup{ R fdµ : f : Rd R, sup|f| 1, f is 1 Lipschitz}.

2 Transformer model

In this section, we describe our deep Transformer model with each data input as a sequence, and the gradient flow algorithm used for training.

2.1 Data setting

In our paper, the data input is both general and straightforward: an input sequence H RD (N+1) consisting of N + 1 tokens, each with dimension D. We consider the setting where each input sequence H is associated with a label y(H) R, where y(H) is the target function we aim to learn. Furthermore, we assume that each instance H is i.i.d. drawn from a population distribution µ.

Relation to in-context learning (ICL) Our data setting is versatile and applicable to any task involving sequential input. It particularly suits the in-context learning (ICL) scenario [6, 10, 75], where models are capable of making accurate predictions on new data when prompted with training examples from the same pool. For clarity, consider the input sequence H RD (N+1) formatted as follows:

H = [h1, h2, . . . , h N+1] =

"x1 x2 . . . x N x N+1 y1 y2 . . . y N 0 p1 p2 . . . p N p N+1

i.i.d. µ, y N+1 = y(H).

Here, {xi}i [N] are the input vectors, each associated with a corresponding label {yi}i [N]. The last token, x N+1 is the test input for which a prediction is made. The third row contains the customized and fixed positional encoding vectors {pi}i [N], which typically include ones, zeros, and indicators denoting the token for prediction. The label for the query point x N+1 is then given by y N+1 = y(H) in our terminology. ICL operates in a zero-shot fashion, without any updates to the model s parameters, highlighting a unique and powerful capability of these systems to adapt and generalize based on the provided context alone. In [6], the authors demonstrate that fixed Transformers can approximate in-context penalized generalized linear regression to any desired degree.

We follow a common configuration of Transformer architectures [6, 38, 40, 48, 73] where each Transformer block consists of two distinct layers: a self-attention mechanism layer and a token-wise feed-forward neural network layer, both equipped with skip connections. We assume that both layers consist of the average of M heads, treated uniformly as the width across all blocks for simplicity. The formulation for a matrix input Z RD (N+1) and a given residual step size η > 0 is as follows: Each residual self-attention layer is represented by

Attnθ1,θ2,...,θM (Z, η) = Z + ηM 1 M X

j=1 f(Z, θj), (2.1)

and each residual feed-forward neural network layer is defined by

MLPw1,w2,...,w M (Z, η) = Z + ηM 1 M X

j=1 h(Z, wj) (2.2)

for parameter vectors θ and w in the Euclidean space. The encoders for the self-attention and feed-forward layers are denoted as f : RD (N+1) RD (N+1) and h : RD (N+1) RD (N+1), respectively. The self-attention encoder f formulation, commonly adopting a multiplicative or dot-product approach as detailed in [8, 38, 48, 64, 66, 73], can be exemplified by

f(Z, θ) = WOWV ZσA h (WKZ) WQZ i ,

where WV , WK, WQ Rs D, and WO RD s. This formulation can be reparametrized to

f(Z, θ) = V ZσA h Z WZ i , (2.3)

where V, W RD D, θ = vec[V, W]. The activation σA typically uses column-wise softmax, but component-wise Re LU is also viable, as in [6]. For the feed-forward layer, an example of the encoder is h(Z, w) = W2σM(W1Z), as detailed in [6, 38, 73], where w = vec[W1, W2] and the activation σM is component-wise Re LU. Alternatively, setting h 0 results in a Transformer block that comprises only the self-attention layer, referred to as attention-only Transformers, as discussed in [6, 46, 49, 66].

Next, we analyze a Transformer network composed of L Transformer blocks, referring to L as the depth of the model. In this paper, we introduce an additional term, η, in (2.1) and (2.2) to simulate the model s evolution in a residual manner. We set the step size η as t/2, where t = 1/L. As L increases, t approaches zero, allowing Transformer blocks to incrementally contribute to the model s overall progression. The structure of the network is then defined as follows: ( b TΘ(H, t + t/2) = Attnθt,1,...,θt,M ( b TΘ(H, t), t/2) b TΘ(H, t + t) = MLPwt,1,...,wt,M ( b TΘ(H, t + t/2), t/2) (2.4)

for each t = 0, t, . . . , (L 1) t with b TΘ(H, 0) = H. We abbreviate the subscript t = 0, t, . . . , (L 1) t by t and j = 1, 2, . . . , M by j for simplicity. Here, Θ = {θt,j, wt,j}t,j denotes all parameters in the Transformer model.

Throughout this paper, we treat D and N as bounded finite values, while M and L are treated as diverging, aligning with the setting of large-scale Transformers.

2.3 Gradient flow

For the l2 regularization with λ > 0, we consider training the constructed Transformer model using the following λ-regularized risk objective:

b Q(Θ) = b R(Θ) + λ 2ML

j=1 ( θt,j 2 2 + wt,j 2 2), (2.5)

with the population squared risk function defined as

b R(Θ) = Eµ h1

Read[ b TΘ(H, 1)] y(H) 2i .

In Section 3.2, we will show that l2-regularization on the parameter norms is essential for the wellposedness of the (Wasserstein) gradient flow to control parameter growth under our mild assumptions, even with a very small λ > 0. Similar strategies that consider necessary l2 regularization are employed in [23] and [70]. Then, drawing on the methodologies in [6, 32, 46], our model processes the final output through a simple read-out function, Read[ ], extracting the (d + 1, N + 1)-th entry of its input. We propose that this read-out layer can be expanded to any linear mapping with bounded parameter norm without affecting the validity of our theoretical results.

To minimize the objective function (2.5), we implement the standard gradient flow method as follows:

Step 1. Initially, for each t = 0, t, . . . , (L 1) t, we sample M particles θ(0) t,j , w(0) t,j with j [M] independently from ρ0(θ, w|t), where ρ0 is a pre-defined distribution with bounded support.

Step 2. Then, we update all parameters θ(τ) t,j , w(τ) t,j in the set Θ(τ) = {θ(τ) t,j , w(τ) t,j }t,j using gradient flow (scaled by ML), which is defined as follows:

dθ(τ) t,j dτ = ML θt,j[ b Q(Θ(τ))], dw(τ) t,j dτ = ML wt,j[ b Q(Θ(τ))]. (2.6)

Define the function b R(H; Θ) = 1

2(Read[ b TΘ(H, 1)] y(H))2, and the partial derivative bpΘ(H, t) = b R(H; Θ)/ b TΘ(H, t) for each t = 0, t/2, t, . . . , (L 1) t, (L 1/2) t. Refer to Appendix C.4 for the explicit formula of bpΘ(H, t). Using the chain rule, we derive the explicit form of the gradient flow as follows:

dθ(τ) t,j dτ = b Gf(θ(τ) t,j , Θ(τ), t), dw(τ) t,j dτ = b Gh(w(τ) t,j , Θ(τ), t). (2.7)

b Gf(θ, Θ, t) = 1

2Eµ h θTr f( b TΘ(H, t), θ) bpΘ(H, t + t/2) i + λθ,

b Gh(w, Θ, t) = 1

2Eµ h w Tr h( b TΘ(H, t + /2), w) bpΘ(H, t + t) i + λw

for t = 0, t, . . . , (L 1) t.

3 Approximation by the mean-field limit

In this section, we present a rigorous approximation result that bridges Transformer models in (2.4) with their mean-field limit as continuous Transformers. Thus, the width M and depth L in our proposed model are treated as discretization of this continuous limit in the parameter space.

3.1 Assumptions

In addition, we introduce the norm 2 col as the maximum l2 norm across all columns of a matrix. We proceed under several mild assumptions related to the data distribution and the encoders f and h. Assumption 1 (Data regularity). There exists some universal constant B > 0 such that, for any H supp(µ), we have max{ H 2 col, y(H)} B. In addition, a universal constant Ky > 0 ensures that y(H) is Ky-Lipschitz continuous for F over H supp(µ).

Remark Assumption 1 is irrelevant to the Transformer model, and is only a fairly mild assumption on the data. Assumption 2 (Transformer particle growth bound). We assume that the gradient of f(T, θ) and h(T, w) exists. Furthermore, we have

i. f(T, θ) 2 col K T 2 col(1 + θ + θ 2).

ii. For every i [N + 1], we have θf(T, θ):,i 2 ϕP ( T 2 col)(1 + θ ).

iii. vec[T ]vec[f(T, θ)] 2 ϕT (N, D, T F )(1 + θ + θ 2).

for some continuous, monotonically increasing functions ϕP , ϕT for every coordinate, and a universal constant K > 0. Similarly, if we replace f with h and θ with w, the same conditions apply.

Remark There are three key observations for Assumption 2. Firstly, it incorporates the 2 col norm, which is particularly useful for handling sequential inputs where each column represents a token. Secondly, as we consider higher-order multiplications between data and parameters, this assumption accommodates a broader range of self-attention encoders, such as the one in (2.3) with softmax or Re LU activation (where the derivative is defined as Re LU (x) = 1{x > 0}). Lastly, a particularly interesting and frontier question is identifying the function ϕT , and we have listed related literature in Appendix B. Assumption 3 (Locally Lipschitz continuous gradient in expectation). Besides Assumption 2, for any LT > 0 and any LT -Lipschitz continuous functions T1 = T1(H) and T2 = T2(H), for every i [N + 1], we have

i. Eµ θf(T1, θ):,i θf(T2, θ):,i 2 ϕP T ( θ , KT , LT ) sup H T1 T2 2 col,

ii. Eµ vec[T ]vec[f(T1, θ)] vec[T ]vec[f(T1, θ )] 2 ϕT P (N, D, sup H T1 F , KP , LT ) θ θ

iii. Eµ θf(T1, θ):,i θf(T1, θ ):,i 2 ϕP P (KP , sup H T1 2 col, LT ) θ θ ,

iv. Eµ vec[T ]vec[f(T1, θ)] vec[T ]vec[f(T2, θ)] 2 ϕT T (N, D, KT , θ , LT ) sup H T1 T2 F

for KT = max{sup H T1 2 col, sup H T2 2 col}, KP = max{ θ , θ }, and some continuous functions ϕP T , ϕT P , ϕP P , ϕT T that are monotonically increasing for every coordinate. Similarly, if we replace f with h and θ with w, the same conditions apply.

Remark Assumption 3 states that functions are locally Lipschitz continuous in expectation, suitable for encoders that utilize Re LU functions and have second-order derivatives almost everywhere. This assumption is naturally satisfied if the activation has a locally Lipschitz continuous gradient.

Define P2 as the set of probability measures endowed with the Wasserstein-2 distance, where the Lipschitz continuity with respect to the depth holds, i.e. there exists some universal constant Cρ > 0 such that ρ( , t) ρ( , t ) BL Cρ|t t | for any t, t [0, 1].

Choice of ρ0 Suppose ρ0 P2 satisfies that for any t [0, 1], the support of ρ0( , , t) is contained within the set {(θ, w) : θ 2 + w 2 R2} for a universal constant R. Additionally, for each t [0, 1], it holds that R

θ,w ρ0(θ, w, t)d(θ, w) = 1. This condition suits common bounded support distributions, and a natural choice is a uniform distribution across a disk with radius R for each t [0, 1].

We would like to clarify that verifying Assumptions 2 and 3 for concrete examples of Transformer architectures with smooth activation functions is fairly intuitive, and the proof is mainly based on a series of tedious calculations. We give a concrete proposition with its brief proof in Appendix G.

3.2 Continuous Transformer and Wasserstein gradient flow

Drawing inspiration from [47] and [67], which suggest that deep residual networks behave like ensembles of residual networks locally, we apply a similar manipulation to formulate the continuous version of (2.4). Consider the following continuous version Tρ(H, t) RD (N+1), governed by the following continuous ODE that averages the two encoders:

Tρ(H, t) = Z

f(Tρ(H, t), θ) + h(Tρ(H, t), w)

2 ρ(θ, w, t)d(θ, w), Tρ(H, 0) = H (3.1)

In (3.1), each encoder f or h is conceptualized as a particle, and we consider the distribution of these particles denoted as ρ(θ, w, t). For any ρ P2 that have a bounded support, the well-posedness of Tρ(H, t) that satisfies the Transformer ODE (3.1) is shown in Proposition C.1. Transitioning to the framework with continuous Transformers, our objective shifts to minimizing the l2 risk function with regularization on the second moment of ρ as follows:

Q(ρ) = R(ρ) + λ

θ,w ( θ 2 2 + w 2 2)ρ(θ, w, t)d(θ, w)dt, (3.2)

R(ρ) = Eµ h1

Read[Tρ(H, 1)] y(H) 2i . (3.3)

Define pρ(H, t) RD (N+1), the partial derivative of R(ρ) relative to Tρ(H, t) at a local query point H, as the solution derived in Appendix C.4 using the classical adjoint sensitivity method [58]:

vec[pρ(H, t)] = Read[Tρ(H, 1)] y(H) exp Z 1

β vec[T ]vec[g(Tρ(H, t), β)ρ(β, t)dβdt]

Using this, we can compute the functional derivative to ρ as follows:

δρ (θ, w, t) = Eµ h Tr hf(Tρ(H, t), θ) + h(Tρ(H, t), w)

i pρ(H, t) i + λ

2 ( θ 2 2 + w 2 2). (3.4)

The following Proposition claims that δQ

δρ is indeed the derivative with respect to ρ (specifically, the Fréchet derivative [30]) for the functional Q(ρ).

Proposition 3.1 (Functional derivative to ρ). Under Assumptions 1 and 2, for any pair ρ, ν P2 that have bounded supports, we have

Q(ρ + η(ν ρ)) = Q(ρ) + η DδQ

δρ , ν ρ E + o(η),

δρ is defined in (3.4), and δQ

δρ , ν ρ = R 1 0 R

δρ (ν ρ)d(θ, w)dt R.

Now, we are in a position to display the gradient flow of ρ in the Wasserstein metric [16], given by a Mc Kean-Vlasov type equation [4, 37, 54, 56]. Specifically, we study the following partial differential equation of the distribution ρ(τ)(θ, w, t):

dρ(τ)(θ, w, t)

dτ = div(θ,w) ρ(τ) (θ,w) δQ

= divθ ρ(τ)Gf(θ, ρ(τ), t) + divw ρ(τ)Gh(w, ρ(τ), t) ,

where ρ(0) = ρ0, div is the divergence operator, and the gradient functions are defined as

Gf(θ, ρ, t) = 1

2Eµ h θTr f(Tρ(H, t), θ) pρ(H, t) i + λθ,

Gh(w, ρ, t) = 1

2Eµ h w Tr h(Tρ(H, t), w) pρ(H, t) i + λw.

Propositions D.1 and 3.2 provide the well-posedness of both gradient flow and Wasserstein gradient flow respectively. In both propositions, a λ > 0 is essential to stabilize the optimization process by controlling both the maximum and average norms across all parameters. If λ is set to 0, it is only possible to establish the well-posedness of (3.5) over a finite maximal interval [47]. Similar adjustments to regularize the risk function are also noted in [23]. Proposition 3.2 (Existence and uniqueness of Wasserstein gradient flow). Under Assumptions 1 and 2, there exists a unique solution (ρ(τ))τ 0 P2 R with ρ(0) = ρ0 for (3.5). Additionally, for any τ 0, we have

i. ρ(τ) has a bounded support {θ, w : θ 2 + w 2 Rτ} [0, 1], where Rτ = (R + 1) exp(CRτ) 1 for some constant CR that only depends on N, D, λ and the parameters of the assumptions.

ii. R 1 0 ( θ 2+ w 2)ρ(τ)(θ, w, t)d(θ, w)dt A2 0, where A0 := R2+λ 1(2B2+2B2 exp(K(1+ R + R2))2).

(θ,w) ρ(τ)(θ, w, t)d(θ, w) = 1 for any t [0, 1].

3.3 Approximation of large-scale Transformer

In this section, we discuss the general results associated with approximating our discrete Transformer model to its mean-field limit. First, we highlight that the minimization of the risk function with discretization, whether or not regularization is included, closely approximates the minimal risk achievable by continuous models. Proposition 3.3 (Global minimum approximation of discretization). Under Assumptions 1 and 2, we define P2,r as the set of distributions in P2 concentrated on {(θ, w) : θ 2 + w 2 r2} [0, 1]. for any r > 0. Then there exists a constant C dependent on N, D, r and the parameters of the assumptions such that

inf Θ b R(Θ) inf ρ P2,r R(ρ) + C L 1 +

inf Θ b Q(Θ) inf ρ P2,r Q(ρ) + C(1 + λ) L 1 +

Proposition 3.3 specifies that the distributions under consideration must have bounded support. While it is typically challenging to confirm whether the minimal risk is indeed achieved on a distribution with bounded support, this assumption is justified as λ regulates parameter norms, implicitly encourages solutions residing in a compact region of the parameter space.

We now present the main theorem concerning the convergence of the gradient flow process to the Wasserstein gradient flow as outlined in (3.5). The proof with detailed explanation of the techniques used in Theorem 3.1 is provided in Appendix D. Theorem 3.1 (Gradient flow approximation of discretization). Define the empirical distribution as ˆρ(τ) := 1 ML P

t PM j=1 δ(θ(τ) t,j , w(τ) t,j , t) for any τ 0. Under Assumptions 1-3, we

have that (ˆρ(τ))τ 0 weakly converges to (ρ(τ))τ 0 almost surely along any sequence such that L , M/log L . Moreover, for any fixed τ > 0 and any δ > 0, with probability at least 1 3 exp( δ) with respect to the parameter initialization Θ(0), we have

i. sups [0,τ]|Read[ b TΘ(s)(H, t)] Read[Tρ(s)(H, t)]| C L 1 + q

ii. sups [0,τ]| b R(Θ(s)) R(ρ(s))| C L 1 + q

iii. sups [0,τ]| b Q(Θ(s)) Q(ρ(s))| C L 1 + q

for some constant C that depends on on N, D, τ, λ and the parameters of the assumptions.

Theorem 3.1 significantly advances our understanding by controlling the difference regarding both the Transformer output, the risk function, and the regularized risk function. It s noted that the difference bound in the model s approximation may increase, possibly exponentially [22, 23, 51], as the time horizon extends. As argued in [51], such behavior may be inherent to the systems being modeled.

Additionally, the technical uniqueness and innovation of this theorem contrast sharply with previous results from overparametrized Res Net models. Our analysis distinguishes itself in two ways. First, our discrete Transformer model (2.4) uniquely splits the averaged encoder (f +h)/2 into two distinct blocks with encoders f and h. Second, we demonstrate uniform error control over any finite time interval [0, τ], enabling continuous monitoring of maximum error across the gradient flow s trajectory. In contrast, models in prior studies such as [22, 23] restricts the error analysis to a specific s [0, τ].

4 Global convergence of gradient flow

In this section, we explore the optimization problem for gradient flow in the context of the discrete Transformer model, focusing on our general global convergence results.

4.1 An additional assumption

To ensure the global convergence of gradient flow for our discrete Transformer model, we introduce the following assumption. While influenced by the work in [16, 22, 23, 47], our assumption is uniquely tailored to the context of Transformers: Assumption 4. There exists a pair (g, α) {(f, θ), (h, w)} with a partition α = (α1, α2) such that

i. (Partial 1-homogeneity) for any T RD (N+1) and c R, we have g(T, cα1, α2) = cf(T, α1, α2).

ii. (Universal kernel) a compact set K Rdim(α2) ensures that the span of n g( , α) : α

Rdim(α1) K o is dense in C( T 2 col B, RD (N+1)) for any B > 0.

We emphasize that the universal kernel property, as discussed in [52], closely relates to the universal approximation abilities. Under our assumption, we require the universal approximation capabilities of either the self-attention encoder or the feed-forward encoder. In Appendix G, we provide a concrete example of Transformer architectures and verify the validity of Assumption 4.

The universal kernel property of the feed-forward layer encoder h is well-established, particularly in two-layer neural network contexts [74]. Conversely, the universal approximation abilities of self-attention layers is a frontier research area, which, while not extensively covered in this paper, holds significant potential. Often labeled as memorization capacity", this area is recently explored across multiple studies [27, 31, 38, 39, 49, 63, 73]. The interconnection between approximation abilities and memorization capacities is established in [38]. Notably, [49] investigated the expressive capabilities of one single multi-head softmax self-attention layer, thereby potentially validating our assumptions.

Finally, we posit that the universal kernel applies to α2 within a compact set, as the function s scale can be moderated by the homogeneous part α1. In scenarios where α2 and K are absent, our

assumption simplifies to that in [47], characterized by complete homogeneity. Conversely, in the absence of the α1 component, our framework aligns with [23] which necessitates a more stringent support condition for K, as detailed later in Theorem 4.1.

4.2 Global convergence result

In this section, we establish the convergence properties of the optimization task for discrete Transformers through gradient flow dynamics.

Theorem 4.1 (Global convergence up to λ). Suppose that Assumptions 1-4 hold, and the Wasserstein gradient flow (ρ(τ))τ 0 weakly converges to some ρ P2. If for some universal constant R > 1, the following two conditions hold:

i. (ρ(τ))τ 0 is concentrated on {θ, w : θ 2 + w 2 R2 } [0, 1] when τ is sufficiently large.

ii. If Assumption 4 holds with (g, α) = (f, θ), we assume there exists a t [0, 1] such that the connected set supp(ρ ( , t )) D K {w0}, for some w0 Rdim(w) and D Rdim(θ1) that separates {θ1 : θ1 = 1/R } and {θ1 : θ1 = R }.

ii . If Assumption 4 holds with (g, α) = (h, w), we assume there exists a t [0, 1] such that the connected set supp(ρ ( , t )) {θ0} K D, for some θ0 Rdim(θ) and D Rdim(w1) that separates {w1 : w1 = 1/R } and {w1 : w1 = R }.

Then, for any ϵ > 0, there exists some τ0 > 0 such that

sup τ τ0 b R(Θ(τ)) ϵ + C1 L 1 +

δ + log(L + 1)

with probability at least 1 3 exp( δ) with respect to the parameter initialization Θ(0) for any δ > 0. Here, C1 is some constant dependent only on N, D, τ0, λ and the parameters of the assumptions, while C2 depends only on N, D, R and the parameters of the assumptions.

Theorem 4.1 depicts the behavior of the risk function b R(Θ(τ)) as the training duration τ is sufficiently large. Specifically, b R(Θ(τ)) asymptotically approaches zero as both L and M/log L , with an additional term that scales with λ. This additional term attributes to the incorporation of a λ-weighted penalty on the norm of the parameters in our training objective b Q. Consequently, by selecting an appropriately small λ > 0, the risk approximates zero, demonstrating global convergence to the minimum of b R.

In addition, Theorem 4.1 posits some additional assumptions: the weak convergence of ρ(τ), the longtime uniform boundedness, and the separation property for α1 with the support expansion of α2 to K. Similar assumptions are made in the literature of deep model optimization theory [22, 23, 47]. While these types of assumptions are typically challenging to justify, we provide high-level justifications for them in Appendix C.5, deferring detailed verification to future research.

We then present a corollary that directly follows from Theorem 4.1:

Corollary 4.1. Continuing with the notations and assumptions from Theorem 4.1, suppose λ Cλϵ for some universal constant Cλ > 0. Then, for any δ > 0, constants τ0, L0, K0 > 0 can be found such that: sup τ τ0,L>L0,M/log L>K0 b R(Θ(τ)) (1/2 + C2Cλ)ϵ (4.1)

with probability at least 1 3 exp( δ) with respect to the parameter initialization Θ(0). Notably, if Cλ (2C2) 1, the upper bound in (4.1) is less than or equal to ϵ.

Corollary 4.1 claims that with a fixed δ > 0, for any ϵ > 0, we can achieve an order of ϵ-close approximation with sufficiently large L and M. Though our result is asymptotic and does not involve an explicit rate, it is the first of its kind and lays the groundwork for future theoretical optimization guarantees for Transformers.

5 Proof ideas of main theorems

Given the technical nature of this paper, this section presents the key ideas behind the proof of our main novel results, along with an outline of the proof preparation.

Idea for Theorem 3.1 This convergence is described in two parts. First, the finite-time result (points (i)-(iii)) uses propagation of chaos [62] to analyze how differences evolve over time, comparing the evolution of parameter particles in discrete and continuous dynamics. The approximation bound is derived using a third auxiliary dynamic ("nonlinear dynamics"), involving the triangle inequality and Grönwall s inequality, which allows us to bound output differences over time.

Second, weak convergence of the empirical distribution process relies on optimal transport theory and stability results for Wasserstein gradient flows [4], focusing on the convergence of momentum fields [4, 60]. This also requires bounding the gradient differences between discrete and continuous Transformers as they approach the mean-field limit. See Appendix D for a detailed illustration, including a description of each main step.

Idea for Theorem 4.1 We first establish the continuity of the functional gradient δQ

δρ |ρ , ensuring it remains constant if the derivative with respect to β is constant over a region. Next, we derive the key bound for Q(ρ ), which is proportional to λ, by analyzing the functional energy Q s landscape through its derivatives.

Finally, we show that the finite-time risk can approach this bound. Achieving ϵ-level loss requires Q(ρ(τ0)) ϵ for some large τ0. Applying Theorem 3.1, we show b Q(τ0) becomes sufficiently small, and since b Q(ρ(τ)) is non-increasing, it remains small for τ τ0. See Appendix E for a detailed illustration, including a description of each main step.

Proof preparation for the main theorems Appendix C.3 lists several useful lemmas essential to the main results. Specifically, Lemmas C.1 C.6 ensure the boundedness of key components and bound the output differences between discrete and continuous Transformers under different parameter settings. This boundedness is non-trivial due to the mild Assumptions 2 and 3 that fit the Transformer architecture. The technical lemmas in Appendix C.3 form the foundation for all subsequent proofs. Before introducing the nonlinear dynamics used to bound parameter differences under non-i.i.d. settings, these lemmas first establish an important oracle approximation bound result (Lemma D.6) with i.i.d. parameter settings. Additionally, they serve as key tools for bounding the (functional) gradient differences between Transformer dynamics, as shown in Lemmas D.1 D.3, which are essential for proving the approximation bound in Theorem 3.1.

In Theorem 4.1, Lemma E.1 plays a key role by demonstrating that for any ρ, there is always a nearly descent direction around ρ for Q(ρ), implying that all local minima are nearly global. This motivates further landscape analysis for bounding δQ

δρ |ρ in the main theorem.

6 Conclusion

We conclude by summarizing our key contributions and suggesting future research directions. This paper establishes the global convergence of large-scale Transformer models through gradient flow dynamics, providing a thorough theoretical foundation. Our analysis, focused on the mean-field limit with infinite width and depth, shifts optimization from parameter space to distributional probability measures. We present two main theorems: one confirming the close approximation between discrete and continuous gradient flows, and another demonstrating global convergence, highlighting that basic optimization methods can successfully navigate complex landscapes to find optimal solutions. The techniques and results from this study lay the groundwork for further exploration into Transformer optimization. Future work could explore direct gradient descent with specific focus on step sizes, and expand on the in-context learning approximation capabilities of Transformers, as initiated by [6]. Additionally, it s crucial to rigorously assess under what conditions can self-attention layers serve as universal kernels to enhance our theoretical understanding, and to determine the generalization error bounds of Transformers trained on finite samples. These directions promise to deepen the theoretical and practical insights into Transformer models.

Acknowledgments and Disclosure of Funding

We thank the anonymous reviewers for their helpful comments. Yuan Cao is partially supported by NSFC 12301657 and Hong Kong RGC-ECS 27308624. Mengdi Wang acknowledges the support by NSF IIS-2107304, NSF CPS-2312093, ONR 1006977 and Genmab. Han Liu s research is partially supported by the NIH R01LM01372201. Jason M. Klusowski was supported in part by the National Science Foundation through CAREER DMS-2239448, DMS-2054808 and HDR TRIPODS CCF-1934924. Jianqing Fan s research was partially supported by NSF grants DMS-2210833, DMS-2053832, and ONR grant N00014-22-1-2340.

[1] Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations, 2022.

[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems, 2019.

[3] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242 252, 2019.

[4] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. In metric spaces and in the space of probability measures. In Gradient Flows, 2005.

[5] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322 332, 2019.

[6] Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. In Workshop on Efficient Systems for Foundation Models @ ICML2023, 2023.

[7] Raphaël Barboni, Gabriel Peyré, and François-Xavier Vialard. On global convergence of resnets: From finite to infinite width using linear parameterization. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.

[8] James Bernhard. Alternatives to the scaled dot product for attention in the transformer neural network architecture, 2023.

[9] Blake Bordelon, Hamza Tahir Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head transformer dynamics, 2024.

[10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877 1901. Curran Associates, Inc., 2020.

[11] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, 2019.

[12] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

[13] Yihang Chen, Fanghui Liu, Yiping Lu, Grigorios Chrysos, and Volkan Cevher. Generalization of scaled deep resnets in the mean-field regime. In The Twelfth International Conference on Learning Representations, 2024.

[14] Zixiang Chen, Yuan Cao, Quanquan Gu, and Tong Zhang. A generalized neural tangent kernel analysis for two-layer neural networks. ar Xiv: Learning, 2020.

[15] Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuowei Shen. Interpolation, approximation and controllability of deep neural networks. ar Xiv preprint ar Xiv:2309.06015, 2023.

[16] Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, page 3040 3050, Red Hook, NY, USA, 2018. Curran Associates Inc.

[17] Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, 2019.

[18] George V. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2:303 314, 1989.

[19] George Dasoulas, Kevin Scaman, and Aladin Virmaux. Lipschitz normalization for selfattention layers with application to graph neural networks. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2456 2466. PMLR, 18 24 Jul 2021.

[20] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme Ruiz, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd Van Steenkiste, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Collier, Alexey A. Gritsenko, Vighnesh Birodkar, Cristina Nader Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetic, Dustin Tran, Thomas Kipf, Mario Lucic, Xiaohua Zhai, Daniel Keysers, Jeremiah J. Harmsen, and Neil Houlsby. Scaling vision transformers to 22 billion parameters. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7480 7512. PMLR, 23 29 Jul 2023.

[21] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.

[22] Zhiyan Ding, Shi Chen, Qin Li, and Stephen Wright. On the global convergence of gradient descent for multi-layer resnets in the mean-field regime, 2021.

[23] Zhiyan Ding, Shi Chen, Qin Li, and Stephen Wright. Overparameterization of deep resnet: Zero loss and mean-field analysis. Journal of Machine Learning Research, 23:48 1, 2022.

[24] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

[25] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675 1685, 2019.

[26] Weinan E, Jiequn Han, and Qianxiao Li. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):1 41, 2019.

[27] Benjamin L. Edelman, Surbhi Goel, Sham M. Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms, 2022.

[28] Cong Fang, Hanze Dong, and Tong Zhang. Over parameterized two-level neural networks can learn near optimal feature representations. Ar Xiv, abs/1910.11508, 2019.

[29] Cong Fang, Yihong Gu, Weizhong Zhang, and Tong Zhang. Convex formulation of overparameterized deep neural networks. IEEE Transactions on Information Theory, 68(8):5340 5352, 2022.

[30] Bela A. Frigyik, Santosh Srivastava, and Maya R. Gupta. Introduction to functional derivatives. UWEE Tech Report 2008-0001, University of Washington Department of Electrical Engineering, 2008.

[31] Hengyu Fu, Tianyu Guo, Yu Bai, and Song Mei. What can a single attention layer learn? a study through the random features lens. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

[32] Tianyu Guo, Wei Hu, Song Mei, Huan Wang, Caiming Xiong, Silvio Savarese, and Yu Bai. How do transformers learn in-context beyond simple functions? a case study on learning with representations. In The Twelfth International Conference on Learning Representations, 2024.

[33] Yu Huang, Yuan Cheng, and Yingbin Liang. In-context convergence of transformers. ar Xiv preprint ar Xiv:2310.05249, 2023.

[34] Ken ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2:183 192, 1989.

[35] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571 8580, 2018.

[36] Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 37822 37836. Curran Associates, Inc., 2022.

[37] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokkerplanck equation. SIAM Journal on Mathematical Analysis, 29, 04 2000.

[38] Tokio Kajitsuka and Issei Sato. Are transformers with one layer self-attention using low-rank weight matrices universal approximators?, 2024.

[39] Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention, 2021.

[40] Junghwan Kim, Michelle Kim, and Barzan Mozafari. Provable memorization capacity of transformers. In The Eleventh International Conference on Learning Representations, 2023.

[41] Juno Kim and Taiji Suzuki. Transformers learn nonlinear features in context: Nonconvex mean-field dynamics on the attention landscape. In Forty-first International Conference on Machine Learning, 2024.

[42] Qianxiao Li, Long Chen, Cheng Tai, and Weinan E. Maximum principle based algorithms for deep learning. Journal of Machine Learning Research, 18(165):1 29, 2018.

[43] Qianxiao Li and Shuji Hao. An optimal control approach to deep learning and applications to discrete-weight neural networks. In International Conference on Machine Learning, pages 2985 2994. PMLR, 2018.

[44] Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning via dynamical systems: An approximation perspective. Journal of the European Mathematical Society, 25(5):1671 1709, 2022.

[45] Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. In International Conference on Machine Learning, pages 19689 19729. PMLR, 2023.

[46] Licong Lin, Yu Bai, and Song Mei. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. In The Twelfth International Conference on Learning Representations, 2024.

[47] Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean field analysis of deep Res Net and beyond: Towards provably optimization via overparameterization from depth. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6426 6436. PMLR, 13 18 Jul 2020.

[48] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Lluís Màrquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412 1421, Lisbon, Portugal, September 2015. Association for Computational Linguistics.

[49] Sadegh Mahdavi, Renjie Liao, and Christos Thrampoulidis. Memorization capacity of multihead attention in transformers. In The Twelfth International Conference on Learning Representations, 2024.

[50] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, 2019.

[51] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665 E7671, 2018.

[52] Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. J. Mach. Learn. Res., 7:2651 2667, 2006.

[53] Atsushi Nitanda and Taiji Suzuki. Stochastic particle gradient descent for infinite ensembles, 2017.

[54] Atsushi Nitanda, Denny Wu, and Taiji Suzuki. Particle dual averaging: optimization of mean field neural network with global convergence rate analysis*. Journal of Statistical Mechanics: Theory and Experiment, 2022(11):114010, nov 2022.

[55] Open AI. Gpt-4 technical report. Ar Xiv, abs/2303.08774, 2023.

[56] Felix Otto. The geometry of dissipative evolution equations: The porous medium equation. Communications in Partial Differential Equations, 26(1-2):101 174, 2001.

[57] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8:143 195, 1999.

[58] L. S. Pontryagin, V. G. Boltyanskii, R. V. Gamkrelidze, and E. F. Mishechenko. The mathematical theory of optimal processes. Zamm-zeitschrift Fur Angewandte Mathematik Und Mechanik, 43:514 515, 1963.

[59] H. Risken. The Fokker-Planck Equation: Methods of Solution and Applications. Springer, 1996.

[60] Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(5863):94, 2015.

[61] Filippo Santambrogio. {Euclidean, metric, and Wasserstein} gradient flows: an overview. Bulletin of Mathematical Sciences, 7:87 154, 2016.

[62] Alain-Sol Sznitman. Topics in propagation of chaos. In Ecole d Eté de Probabilités de Saint-Flour XIX 1989, pages 165 251. Springer, 1991.

[63] Shokichi Takakura and Taiji Suzuki. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 33416 33447. PMLR, 23 29 Jul 2023.

[64] Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention for transformer models. In International Conference on Machine Learning, 2020.

[65] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. Ar Xiv, abs/2302.13971, 2023.

[66] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

[67] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.

[68] C. Villani. Optimal Transport, Old and New, volume 338 of Grundlehren der Mathematischen Wissenschaften. Springer-Verlag, 2008.

[69] James Vuckovic, Aristide Baratin, and Rémi Tachet des Combes. A mathematical theory of attention. Ar Xiv, abs/2007.02876, 2020.

[70] Colin Wei, J. Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets v.s. their induced kernel. In Neural Information Processing Systems, 2018.

[71] E Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 1(5):1 11, 2017.

[72] E Weinan, Chao Ma, and Lei Wu. Machine learning from a continuous viewpoint, i. Science China Mathematics, 63:2233 2266, 2019.

[73] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? Ar Xiv, abs/1912.10077, 2019.

[74] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021.

[75] Ruiqi Zhang, Spencer Frei, and Peter Bartlett. Trained transformers learn linear models incontext. In R0-Fo Mo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023.

[76] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes overparameterized deep Re LU networks. Machine Learning, Oct 2019.

A Overview of Appendix

The appendix is organized as follows:

Appendix B: Additional related works are discussed.

Appendix C:

In Appendix C.1, additional notations and preliminary details are introduced. In Appendix C.2, we show the proof of Proposition C.1, concerning the existence and uniqueness of the continuous Transformer ODE (3.1). In Appendix C.3, useful lemmas for the main proofs are detailed. In Appendix C.4, the explicit formulas for pρ(H, t) and bpΘ(H, t) are explored via the adjoint sensitivity method. In Appendix C.5, high-level explanations are provided to substantiate the assumptions made in Theorem 4.1.

Appendix D: Includes proofs of main results from Section 3.

Appendix E: Includes proofs of main results from Section 4.

Appendix F: Includes proofs of all auxiliary technical results mentioned in Appendix C-E.

Appendix G: The verification of Assumptions 2 4 for a concrete example of Transformer architectures is provided.

Appendix H: We provide simple experiment results and discuss the widths and depths of Vision Transformers that guarantee convergence, achieving near-zero training loss and 100% training accuracy on the CIFAR-10 dataset.

B Additional related work

Theory of Transformers. Some very recent works have studied theoretical properties of Transformer models from different aspects. [75, 33] studied the in-context learning guarantees for single-layer Transformers to perform linear regression predictions after being trained with linear regression example tasks. [1, 6, 32] studied the in-context learning capability of Transformers through the function approximation point of view, and demonstrated that there exists Transformers with specific parameter configurations that can perform particular in-context learning tasks. [36, 45] investigated how single-layer Transformers can be trained to learn simple image models and topic models respectively.

The most closely related work to ours is [41], which is the only study we are aware of that addresses the general and universal in-context learning capability of large-scale Transformers through optimization dynamics. It shows that a two-layer MLP followed by a linear attention layer can approximate functions in a general Barron space sufficiently well as the Transformer width increases. Additionally, its corresponding mean-field dynamics, via Wasserstein gradient flow, converges to global minima for in-context feature learning. Another work exploring the mean-field limit of Transformers is [9], which examines the limit as the depth, key-query length, and number of heads increase to infinity.

In addition, we have noticed a lot of theoretical interest in identifying the optimal choice of ϕT in Assumption 2, i.e., the Lipschitz constant of the Jacobian matrix of the self-attention term. For instance, [19] suggest that ϕT can be bounded by p

N/D + poly( T F ), where poly( ) denotes a polynomial function. In the context of l2 self-attention, [39] find ϕT to be p

N log N/D, which notably does not depend on T F , and [69] demonstrate that for l1 distance metrics in attention layers, ϕT could be D log N.

Global convergence of fully connected neural networks. A line of recent works have studied the global convergence of (stochastic) gradient descent in training overparameterized neural networks in the mean-field regime [16, 51, 50, 70, 28, 29]. They consider the limit of the neural network as the width of the network at each layer goes to infinity, and models the limit of the network as a functional of the distribution of network parameters. A separate line of works also established the global convergence guarantees for training overparameterized neural networks in the neural tangent kernel regime [35, 3, 25, 76, 17, 2, 5, 11], where the gradient descent training iterates are

asymptotically equivalent to the training iterates of kernel regression based on the neural tangent kernel.

Connection between ordinary differential equation models and infinite-depth Res Nets. Our work is also closely related to the recent literature aiming to understand Res Nets by analyzing their connections to ordinary differential equations [71, 12, 43, 42, 72, 26, 22, 47, 23, 44, 7, 15, 13]. Specifically, [71, 12, 44, 15] studied the approximation of flow-based networks via discrete networks. [43, 42, 72, 26, 22, 47, 23, 44, 7] studied the optimization of the infinite-depth and infinite-width Res Nets. [13] studied the generalization properties of the Res Net trained in the mean-field regime.

C Proof setup

C.1 Additional technical notations

β := (θ, w) , g(T, β) := f(T, θ) + h(T, w)

δρ could be expressed as

δρ (β, t) = Eµ h Tr h g(Tρ(H, t), β) i pρ(H, t) i + λ

2 β 2 2. (C.1)

Additionally, we can combine Gf with Gh, and b Gf with b Gh to reformulate as

G(β, ρ, t) = Eµ h βTr g(Tρ(H, t), β) pρ(H, t) i + λβ, (C.2)

b G(β, Θ, t) = Eµ h βTr n f( b TΘ(H, t), θ)/2 h( b TΘ(H, t + t/2), w)/2

o nbpΘ(H, t + t/2) bpΘ(H, t + t)

o i + λβ. (C.3)

Remark 1. To facilitate the proof, we restate Assumptions 2 and 3 for g(T, β). Under Assumption 2, the gradient of g(T, β) respect to T and β exists. Additionally, we have

Under Assumption 2:

i. g(T, β) 2 col K T 2 col(1 + β + β 2)

ii. For every i [N + 1], we have βg(T, β):,i 2 ϕP ( T 2 col)(1 + β )

iii. vec[T ]vec[g(T, β)] 2 ϕT (N, D, T F )(1 + β + β 2)

Under Assumption 3: For any LT > 0 and any LT -Lipschitz continuous functions T1 = T1(H) and T2 = T2(H), for every i [N + 1], we have

i. Eµ θg(T1, β):,i θg(T2, β):,i 2 ϕP T ( θ , KT , LT ) sup H T1 T2 2 col,

ii. Eµ vec[T ]vec[g(T1, β)] vec[T ]vec[g(T1, β )] 2 ϕT P (N, D, sup H T1 F , KP , LT ) θ θ

iii. Eµ θg(T1, β):,i θg(T1, β ):,i 2 ϕP P (KP , sup H T1 2 col, LT ) θ θ ,

iv. Eµ vec[T ]vec[g(T1, β)] vec[T ]vec[g(T2, β)] 2 ϕT T (N, D, KT , θ , LT ) sup H T1 T2 F

Verifying all these results above only needs the basic triangle inequality of general norms, so we omit the trivial proof. We will apply them directly throughout the proofs of results. Additionally, we omit writing LT for simplicity in the proof, as all functions applied to Assumption 3 will be Lipschitz continuous with some universally bounded Lipschitz constant.

Next, we introduce some additional technical notations. Denote the identity matrix with d-dimension as Id. Define the sample space Ω:= Rdimβ [0, 1], and P(Ω) as the probability measure space

defined on Ω. For any Θ = {βt,j}t/ t+1 [L],j [M] and eΘ = {eβt,j}t/ t+1 [L],j [M], define d(Θ, eΘ) = P

t PM j=1 βt,j eβt,j . Define the local risk function as

R(H; ρ) = 1

Read[Tρ(H, 1)] y(H) 2 .

Define the nested family of compact subsets (Pr)r>0 as

Pr := {β : β r} [0, 1], r > 0.

For any ρ, ν P(Ω) and p 1, define the lp distance ρ ν p as

ρ ν p = Z 1

β |ρ(x) ν(x)|pdβdt 1/p .

Specifically, when p = 1, we have

W1(ρ, ν) = sup n Z 1

β f(ρ ν)dβdt : f is 1 Lipschitz, f(0, 0) = 0 o

β |f||ρ ν|dβdt : f is 1 Lipschitz, f(0, 0) = 0 o

(r + 1) ρ ν 1

for any ρ, ν P2 concentrated on Pr. For simplicity, any H discussed throughout this paper is assumed to lie within supp(µ).

C.2 Transformer ODE existence and uniqueness

In this section, we establish the existence and uniqueness of the solution Tρ(H, t) to the ODE presented in (3.1) for any H, given that ρ P2 is concentrated on a bounded support, specifically, Pr for some r > 0. This following proposition forms the cornerstone of the subsequent technical analyses:

Proposition C.1 (Existence and uniqueness of Transformer ODE). Under Assumptions 1 and 2, for any ρ P2 that has a bounded support, there exists a unique solution of (3.1) on t [0, 1] that is Lipschitz continuous with respect to (H, t).

Initially, we demonstrate that the integral R

β ρ(β, t)dβ is bounded. According to the definition P2, it follows that (C.4)

β ρ(β, t)dβ Z

β ρ(β, t )dβ| Cρ|t t |

for any t, t [0, 1]. Integrating (C.4) over t2 [0, 1] obtains Z

β ρ(β, t)dβ =1 + Z

β ρ(β, t)dβ Z 1

β ρ(β, t )dβdt

For the remainder of the technical proof, we will employ (C.5) without additional elaboration.

Proof of Proposition C.1. Step I: Create a small neighboring area with local Lipschitz continuity

Consider any vector β such that β r. Define F(T, t) := R

β g(T, β)ρ(β, t)dβ. For T within the rectangle {T : T H max δ}, where δ > 0 is bounded, both T 2 col and T F are also bounded. Given Assumption 2 (i), g(T, β) is universally bounded by some constant Kδ,r. Moreover,

under Assumption 2 (ii) and (iii), g(T, β) is Lipschitz continuity with some constant Lδ,r. Hence, within the rectangle {T : T H max δ} [0, 1] the following properties hold:

|F(T, t1) F(T, t2)| max{Kδ,r, Lδ,r} ρ( , t1) ρ( , t2) BL Cρ max{Kδ,r, Lδ,r}|t1 t2|. (C.6) which indicates that F(T, t) is continuous with respect to t within the rectangle {T : T H max δ} [0, 1].

Moreover, within the bounded region {T : T H max δ}, Assumption 2 (iii) ensures that vec[T ]vec[g(T, β)] 2 bounded. Consequently, g(T, β) is Lipschitz-continuous with respect to T for F . Denote this Lipschitz constant by L δ,r. Therefore, for any T, T {T : T H max δ} and t [0, 1], it follows that

|F(T, t) F(T , t)| L δ,r T T F

β ρ(β, t)dβ L δ,r(1 + Cρ/2) T T F , (C.7)

which deduces the Lipschitz continuity of F(T, t) with respect to T for F .

Step II: Show that the maximal existence interval is infinite by repeatedly using the Picard Lindelöf Theorem

Invoking the Picard-Lindelöf Theorem, there exists some ϵ > 0 such that the initial value problem

T(H, t) = F(T, t), T(H, 0) = H

has a unique solution on t [0, ϵ]. Given that this claim holds for any H, the standard ODE Extensibility Theorem guarantees a continuation of T(t) to a maximal interval of existence, denoted as [0, tmax].

Assume by contradiction that tmax < 1. From (3.1) and Assumption 2(i), for any t [0, tmax] we see that

d dt Tρ(H, t) 2 col Tρ(H, t) 2 col = Z

β g(Tρ(H, t), β)ρ(β, t)dβ 2 col

β g(Tρ(H, t), β) 2 colρ(β, t)dβ

β K(1 + ||β||2+||β||2 2)ρ(β, t) Tρ(H, t) 2 coldβ.

(C.8) Therefore, by the Grönwall s inequality, we have

Tρ(H, tmax) 2 col Tρ(H, 0) 2 col exp( Z

β K(1 + ||β||2+||β||2 2)ρ(β, s)dβds)

H 2 col exp(K(1 + Cρ/2 + r + r2)tmax) < .

This presents a contradiction to the notion that tmax < 1. This is because, By reapplying the local Picard-Lindelöf Theorem using the state T(H, tmax) as the new initial condition, we can extend the interval of existence beyond tmax. Consequently, we must conclude that tmax = 1, and the existence and uniqueness follows.

Step III: Show that the Lipschitz continuity with respect to (H, t)

In the final part of our proof, we demonstrate that Tρ(H, t) is Lipschitz continuous with respect to (H, t) for H supp(µ) and any t [0, 1]. Given that Tρ(H, t) is universally bounded within H supp(µ) and any t [0, 1], we only need to focus on establishing its Lipschitz continuity with respect to H and t separately. The Lipschitz continuity with respect to t is derived from

Tρ(H, t1) Tρ(H, t2) 2 col Z t2

β g(Tρ(H, t), β 2 colρ(β, t)dβdt (1+Cρ/2)KC(1+r+r2)(t2 t1),

for any t1, t2 [0, 1]. Given Assumption 2 (iii), we have

Tρ(H, t) Tρ(H , t) F Z

β g(Tρ(H, s), β) g(Tρ(H , s), β) F ρ(β, s)dβds

β ϕT (N, D, B exp(K(1 + Cρ/2 + r + r2)))(1 + r + r2) Tρ(H, s) Tρ(H , s) F ρ(β, s)dβds

(C.10) for any H, H supp(µ). Define LH := ϕT (N, D, B exp(K(1 + Cρ/2 + r + r2)))(1 + r + r2). Utilizing Grönwall s inequality, we establish:

Tρ(H, t) Tρ(H , t) F H H F exp(LH

β ρ(β, s)dβds) = exp(LH) H H F

given that Tρ( , 0) serves as the identity mapping. Consequently, Tρ(H, t) demonstrates Lipschitz continuity with respect to H supp(µ).

C.3 Useful technical lemmas

Lemma C.1 (Continuous Transformer output bound). Under Assumption 2, for any distribution ρ P(Ω) where R 1 0 R

β||β||2 2ρ(β, t)dβdt A2 for some constant A > 0 and for any t [0, 1], we have Tρ(H, t) 2 col H 2 col exp(K(1 + A + A2)).

Lemma C.2 (Continuous Transformer difference bound). Under Assumption 2 and given H, for any ρ, ν P2 that satisfy R 1 0 R

β||β||2 2ρ(β, t)dβdt A2 and have bounded supports Pr for some constants A, r > 0, we have that

sup t [0,1] Tρ(H, t) Tν(H, t) F Cr W1(ρ, ν).

Here, the universal constant Cr only depends on N, D, r, A, and the parameters of the assumptions.

Lemma C.3 (Continuous Transformer gradient component bound). Under Assumption 1 and 2, for any ρ P(Ω) where supp(ρ) Pr and R 1 0 R

β β 2ρ(β, t)dβdt A2, we have

sup (β,t) Pr g(Tρ(H, t), β) F

N + 1KB exp(K(1 + A + A2))(1 + r + r2),

sup (β,t) Pr pρ(H, t) F (B + B exp(K(1 + A + A2))) exp ϕT (N, D,

N + 1KB exp(K(1 + A + A2))(1 + A + A2) ,

sup (β,t) Pr

N + 1KB exp(K(1 + A + A2))(1 + r + r2)(B + B exp(K(1 + A + A2)))

exp ϕT (N, D,

N + 1KB exp(K(1 + A + A2))(1 + A + A2) + λ

Lemma C.4 (Discrete Transformer bound). Under Assumptions 2, for any Θ where 1 ML P t PM j=1 β 2 A2 for some universal constant A > 0 and at any t = 0, t/2, t, . . . , (L 1/2) t, 1, we have

b TΘ(H, t) 2 col H 2 col exp(K(1 + A + A2)).

Lemma C.5 (Discrete Transformer difference bound). Under Assumption 1 and 2, for any H, let Θ = {βt,j}t/ t+1 [L],j [M], eΘ = {eβt,j}t/ t+1 [L],j [M] such that max{ βt,j , eβt,j } r for any t = 0, . . . , (L 1) t, j = 1, . . . , M. Then, we have that

b TΘ(H, t) b TeΘ(H, t) F Cr 1 MLd(Θ, eΘ).

Here the universal constant Cr only depends on N, D, r, and the parameters of the assumptions.

Lemma C.6 (Discrete Transformer gradient component bound). Under Assumption 1 and 2, for any Θ such that supt,j βt,j 2 r2 and 1 ML P

t PM j=1 β 2 A2, we have, for any t = 0, t/2, . . . , (L 1/2) t, 1

sup (θ,t) Pr max{ f( b TΘ(H, t), θ) F , h( b TΘ(H, t), w) F , g( b TΘ(H, t), β) F }

N + 1KBT (1 + r + r2),

sup (β,t) Pr sup i [N+1] max{ θf( b TΘ(H, t), θ):,i , wh( b TΘ(H, t), w):,i , βg( b TΘ(H, t), β):,i } ϕP (BT )(1 + r),

sup (β,t) Pr bpΘ(H, t) F (B + BT ) exp ϕT (N, D,

N + 1KBT )(1 + A + A2) ,

where BT = B exp(K(1 + A + A2)). Lemma C.7 (Norm average concentration). Under Assumption 2, consider a parameter setting Θ = {βt,j}t/ t+1 [L],j [M] i.i.d. drawn from {ρ(β|t)}t/ t+1 [L],j [M] where ρ P2 is concentrated on Pr and satisfies R

β ρ(β, t)dβ = 1 for every t [0, 1]. Then, with probability at least 1 exp( δ) with respect to the parameter initialization Θ(0), we have

j=1 βt,j 2 Z 1

β β 2ρ(β, t)dβdt| L 1 +

δ + log(L + 1)

for any δ > 0. Here, the notation hides the dependencies on r and the parameters specified in the assumption. Lemma C.8 (Matrix product difference bound). Suppose that for some d > 0, the matrices A1, A2, . . . , AL and B1, B2, . . . , BL satisfy the following conditions:

1. For each l = 1, . . . , L, the norms of the matrices are bounded as Al 1 + al, Bl 1 + bl, where al, bl > 0.

2. The product of the increments for each matrix is bounded by QL l=1 1 + max{al, bl} C for some constant C > 0.

Under these conditions, it holds that

l=1 Al Bl .

C.4 Solution of adjoint ODE

In this section, we define the partial derivative

pρ(H, t) := R(H; ρ)

Tρ(H, t) RD (N+1)

without specifying its explicit formula. Denote the derivative of Tρ(H, 1) to Tρ(H, t) (after vectorization) by the Jacobian Jρ(H, t) R(N+1)D (N+1)D, and assume that Jρ(H, t) exists for any t [0, 1]. Then [58] shows that Jρ(H, t) satisfies the adjoint equation of the ODE.

Jρ(H, t) = Jρ(H, t) vec[T ] n vec[ Z

β g(Tρ(H, t), β)ρ(β, t)dβ] o (C.11)

for any t [0, 1]. By applying the chain rule and exchanging the order of the derivative and integral, we have, for any t [0, 1], that

vec[pρ(H, t)] = R(H; ρ) vec[Tρ(H, 1)] vec[Tρ(H, 1)]

vec[Tρ(H, t)]

= vec[ R(H; ρ)

Tρ(H, 1) ] vec[Tρ(H, 1)]

vec[Tρ(H, t)]

= vec[pρ(H, 1)] Jρ(H, t)

Hence, by taking the derivative with respect to t, we obtain that

vec[ pρ(H, t)] = vec[pρ(H, t)] Z

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβ

with the solution

vec[pρ(H, t)] = vec[pρ(H, 1)] exp Z 1

β vec[T ]vec[g(Tρ(H, s), β)]ρ(β, s)dβds . (C.12)

On the other hand, we have

pρ(H, 1) = R(H; ρ)

Tρ(H, 1) = (Read[Tρ(H, 1)] y(H))Eread, (C.13)

where Eread is a D (N + 1) zero matrix except 1 at the (d + 1, N + 1) th entry. Moreover, from (C.12) and (C.13), we see that

vec[pρ(H, t)] = (Read[Tρ(H, 1)] y(H)) exp Z 1

β vec[T ]vec[g(Tρ(H, s), β)]ρ(β, s)dβds

Additionally, we could explicitly derive the formula for bpΘ(H, t) = b R(H;Θ) b TΘ(H,t) for the discrete Transformer. By applying the chain rule multiple times across each layer with the encoder either f or h, for any t = 0, t, . . . , (L 1) t, 1, we have

vec[bpΘ(H, t)] = (Read[ b TΘ(H, 1)] y(H)) n Y

(s t)/ t+1 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[f( b TΘ(H, s), θs,j)]

(s t)/ t+1 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)] o

vec[bpΘ(H, t + t/2)] = (Read[ b TΘ(H, 1)] y(H)) n Y

(s t)/ t+2 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[f( b TΘ(H, s), θs,j)]

(s t)/ t+1 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)] o

C.5 Explanation for assumptions made in Theorem 4.1

The justification of the two assumptions outlined in Theorem 4 warrants careful consideration. While we provide only high-level justifications, they underpin significant aspects of our theoretical framework.

For the first assumption, we argue that the regularization parameter λ, which penalizes the magnitude of the parameter norms, implicitly promotes solutions that are confined to a compact subset of the parameter space. This rationale is conceptual and requires that regularization effectively constrains the growth of the parameter norms, thereby localizing the solutions.

The second assumption concerns the separation property. It is naturally satisfied as long as the origin 0dimθ+dimw remains an interior point of supp(ρ ( , t)). This condition is relatively mild and

is generally satisfied. The challenge arises in verifying that supp(ρ ( , t)) for the α2 component extends to encompass the entire space K. While direct confirmation is elusive, it is suggested by [23] initially spans K, this expansive support property is maintained at any finite time. Thus, we conjecture that the condition holds under these circumstances, providing a basis for this assumption.

D Proofs of main results in Section 3

D.1 Proof of Theorem 3.1

This convergence is detailed in two parts. First, the finite time result, as stated in points (i)-(iii), utilizes a concept in probability theory known as propagation of chaos [62] to examine how differences evolve uniformly across a given time interval. In the context of our model, this involves comparing how parameter particles evolve under discrete versus continuous dynamics.

Specifically, the approximation bound is derived using a third auxiliary dynamic, termed the "nonlinear dynamics," by bounding the dynamic difference over the entire finite time interval. This process involves applying the triangle inequality to each component and concluding with a Grönwall s inequality. Since the Transformer output can be bounded by the dynamic difference, we can then bound the output difference at any specific time, along with the difference regarding different time for the same dynamic. By applying a probability union bound on a dense set of L2 points, we can extend this to bound the maximal difference over any time interval.

Secondly, the weak convergence of the empirical distribution process leverages optimal transport theory alongside abstract stability results for Wasserstein gradient flows [4]. This argument involves detailed analysis of the discretization of particle distributions in space, particularly focusing on obtaining the convergence of the sequence of momentum fields [4, 60] that could directly leads to the result. To obtain the convergence of the momentum field sequence, we also need to bound the parameter gradient difference between discrete and continuous Transformers as the mean-field limit.

Preparatory Step: Nonlinear dynamics

We first define some auxiliary quantities and differential equations that are useful for the proof. For any gradient flow parameter setting Θ(τ) = {β(τ) t,j }t,j, from its definition (2.7), we could rewrite the dynamics as

β(τ) t,j = β(0) Z τ

0 b G(β(s) t,j , Θ(s), t)ds (D.1)

for any gradient flow time τ > 0, depth index t = 0, t, . . . , (L 1) t and width index j = 1, . . . , M. For simplicity, any mentioned constant only depends on N, D, τ, λ and the parameters of the assumptions, and we abbreviate the subscript t = 0, t, . . . , (L 1) t, j = 1, . . . , M by t, j throughout the proof.

Inspired by the propagation of chaos" idea [62], we could define the nonlinear dynamics" with the same initialization setting eΘ(τ) = {eβ(τ) t,j }t,j, i.e. ( eβ(τ) t,j = eβ(0) R τ 0 G(eβ(s) t,j , ρ(s), t)ds, eβ(0) t,j = β(0) t,j (D.2)

for any t, j. Here, (ρ(s))s 0 is the solution to the Wasserstein gradient flow (3.5), of which the uniqueness is implied by Proposition 3.2. Since (D.2) is just the particle flow of (3.5), its existence and uniqueness are guaranteed by Proposition 3.2.

Observing that {βt,j}t,j are independent due to the dynamics only involving (ρ(s))s 0 with i.i.d initialization over ρ0, we can consider {eβ(s) t,j }t,j as i.i.d. samples drawn from {ρ(s)}t,j. In addition,

from Propositions 3.2 and D.1, for any t, j we have max{ βt,j , eβt,j } Rτ, where Rτ is defined as in these propositions and does not depend on M and L.

Preparatory Step: Bound the gradient difference regarding parameter settings

As the second preparatory step for the proof of Theorem 3.1, we present the following three lemmas that will be helpful:

Lemma D.1 (Continuous gradient difference bound). Suppose Assumptions 1-3 hold. If we have ρ, ν P2 concentrated on Pr for some r > 0, and β, eβ such that max{ β , eβ } r, then G(β, ρ, t) G(eβ, ν, t) CG exp(CGW1(ρ, ν)) 1 + (1 + λ) β eβ .

for any t [0, 1]. Here, the constant CG only depends on N, D, r, and the parameters of the assumptions. Lemma D.2 (Discrete gradient difference bound). Under Assumption 1-3, for any Θ = {βt,j}t/ t+1 [L],j [M], eΘ = {eβt,j}t/ t+1 [L],j [M] such that max{ βt,j , eβt,j } r for any t = 0, . . . , (L 1) t, j = 1, . . . , M. Then we have that

b G(β, Θ, t) b G(eβ, eΘ, t) CG 1

MLd(Θ, eΘ) + (1 + λ) β eβ .

for any β, eβ {β : β r} and t = 0, t, . . . , (L 1) t. Here, the constant CG only depends on N, D, r and the parameters of the assumptions and d(Θ, eΘ) is defined as P

t PM j=1 βt,j eβt,j .

Lemma D.3 (Oracle gradient approximation with discretization). Under Assumptions 1-3, suppose that the parameter setting Θ is i.i.d. drawn from {ρ(β|t)}t/ t+1 [L],j [M] for some ρ P2

concentrated on Pr and satisfies that R

β ρ(β, t)dβ = 1 for any t [0, 1]. Then with probability at least 1 4 exp( δ) with respect to the parameter initialization Θ(0), we have b G(β, Θ, t) G(β, ρ, t) L 1 +

δ + log(L + 1)

G(β, ˆρ, t) G(β, ρ, t) L 1 +

δ + log(L + 1)

M , b G(β, Θ, t) G(β, ˆρ, t) L 1

for any β {β : β r}, t = 0, t, . . . , (L 1) t, 1 and any δ > 0. Here, hides the dependencies on N, D, r and the parameters of the assumptions.

Proof of Theorem 3.1. Our proof consists of several steps outlined below:

Step I: Show the W2 continuity of parameter (sample) distributions

Our analysis commences with the bound for 0 < s1 < s2 < τ, we have

W2(ˆρ(s1), ˆρ(s2))2 1 ML

j=1 |β(s1) t,j β(s2) t,j |2 (s2 s1)

s1 b G(β(s) t,j , Θ(s), t) 2ds,

where each particle at time s1 is paired with its position at time s2, leveraging the Jensen s inequality. Recalling the identity d b Q(Θ(s))

j=1 b G(β(s) t,j , Θ(s), t) 2

shown in Proposition D.1, it follows that

W2(ˆρ(s1), ˆρ(s2)) (s2 s1)1/2 b Q1/2(Θ(0)) λ

where the last inequality uses (D.30). Since A2 0 1 + λ 1, we see that W2(ˆρ(s1), ˆρ(s2)) C(1 + λ)(s2 s1)1/2 for some constant C dependent on the parameters listed in the result. Similarly, we have

W2(ρ(s1), ρ(s2))2 E eβ(s2) eβ(s1) 2 (s2 s1) Z s2

G(β, ρ(s), t) 2 dβdtds (s2 s1)Q(ρ0) (s2 s1)C(1+λ

where eβ(s1) ρ(s1)(β, t) is embedded with its future position at eβ(s2). The last step is feasible by setting C large enough, noticing that Q(ρ0) λA2 0/2. To summarize, we have

max n W2(ˆρ(s1), ˆρ(s2)), W2(ρ(s1), ρ(s2)) o C(1 + λ) s2 s1 (D.3)

for some constant C > 0 dependent on the parameters listed in the result.

Step II: Bound the difference between gradient flow dynamics and non-linear dynamics

Next, we aim to bound (s) := sups [0,s] supt,j β(s ) t,j eβ(s ) t,j for any s [0, τ]. Taking the difference of (D.1) and (D.2), we obtain that

β(τ) t,j eβ(τ) t,j Z τ

b G(β(s) t,j , Θ(s), t) G(eβ(s) t,j , ρ(s), t) ds

b G(β(s) t,j , Θ(s), t) b G(eβ(s) t,j , eΘ(s), t) ds

b G(eβ(s) t,j , eΘ(s), t) G(eβ(s) t,j , ρ(s), t) ds

MLd(Θ(s), eΘ(s)) + (1 + λ) β(s) eβ(s) ds + Z τ

b G(eβ(s) t,j , eΘ(s), t) G(eβ(s) t,j , ρ(s), t) ds

MLd(Θ(s), eΘ(s)) + (1 + λ) β(s) eβ(s) ds + sup s [0,τ]

b G(eβ(s) t,j , eΘ(s), t) G(eβ(s) t,j , ρ(s), t)

(D.4) where the final inequality stems from Lemma D.2, employing a constant CG dependent on the parameters listed in the result. By taking the supremacy over t, j in (D.4), and considering 1 MLd(Θ(s), eΘ(s)) supt,j β(τ) t,j eβ(τ) t,j for any s 0, we derive

sup t,j β(τ) t,j eβ(τ) t,j CG(2+λ) Z τ

0 sup t,j β(s) eβ(s) ds+sup t,j sup s [0,τ]

b G(eβ(s) t,j , eΘ(s), t) G(eβ(s) t,j , ρ(s), t)

Further supremacy taken over s [0, τ] yields:

(τ) e (τ) + CG(2 + λ) Z τ

0 (s)ds, (D.5)

where we define e (τ) := supt,j sups [0,τ] b G(eβ(s) t,j , eΘ(s), t) G(eβ(s) t,j , ρ(s), t) for simplicity of notation. Apply the Grönwall s inequality to (D.5) yields

(τ) exp CG(2 + λ)τ e (τ). (D.6)

It remains to bound e (τ) to bound (τ). It s worth noting that by Lemmas D.1 and D.2, for any s1, s2 [0, τ] and t, j, we have b G(eβ(s2) t,j , eΘ(s2), t) G(eβ(s2) t,j , ρ(s2), t) b G(eβ(s1) t,j , eΘ(s1), t) G(eβ(s1) t,j , ρ(s1), t)

b G(eβ(s1) t,j , eΘ(s1), t) b G(eβ(s2) t,j , eΘ(s2), t) + G(eβ(s1) t,j , ρ(s1), t) G(eβ(s2) t,j , ρ(s2), t)

C (exp(C W1(ρ(s1), ρ(s2))) 1) + C 1 MLd(eΘ(s1), eΘ(s2)) + C (1 + λ) eβ(s1) t,j eβ(s2) t,j

exp(C C s2 s1) 1 + (1 + λ) sup t,j eβ(s1) t,j eβ(s2) t,j

s2 s1 + sup t,j eβ(s1) t,j eβ(s2) t,j

(D.7) for some constant C dependent on the parameters listed in the result. Moreover, for any t, j, we have

eβ(s1) t,j eβ(s2) t,j Z s2

s1 G(eβ(s) t,j , eΘ(s), t) s2 s1

s1 G(eβ(s) t,j , eΘ(s), t) 2 (1+λ2) s2 s1,

(D.8) where the universal boundedness of G(eβ(s) t,j , eΘ(s), t) λeβ(s) t,j can be readily derived from the third result of Lemma C.3 alongside Assumption 2 (ii). Therefore, we obtain b G(eβ(s2) t,j , eΘ(s2), t) G(eβ(s2) t,j , ρ(s2), t) b G(eβ(s1) t,j , eΘ(s1), t) G(eβ(s1) t,j , ρ(s1), t) (1+λ+λ2+λ3) s2 s1 (D.9)

for any s1, s2 [0, τ].

Define τn = nτ/L2 for n = 1, 2, . . . , L2. Leveraging Lemma D.3 and applying the union bound over n [L2] (updating the δ in the lemma to δ + 2 log L), we obtain, with a probability of at least 1 exp( δ) with respect to the parameter initialization Θ(0):

sup s {τn}n [L2]

b G(eβ(s) t,j , eΘ(s), t) G(eβ(s) t,j , ρ(s), t) (1 + λ + λ2 + λ3)L 1 +

δ + log L + log(L + 1)

δ + log(L + 1)

M Furthermore, leveraging (D.9), we deduce

sup t,j sup s [0,τ]

b G(eβ(s) t,j , eΘ(s), t) G(eβ(s) t,j , ρ(s), t) sup t,j sup s {τn}n [L2]

b G(eβ(s) t,j , eΘ(s), t) G(eβ(s) t,j , ρ(s), t)

+ sup |s1 s2| τ/L2(1 + λ + λ2 + λ3) s2 s1

δ + log(L + 1)

Returning to (D.6), we establish that with a probability of at least 1 exp( δ) with respect to the parameter initialization Θ(0),

sup s [0,τ] sup t,j β(s) t,j eβ(s) t,j = (τ) L 1 +

δ + log(L + 1)

We denote the event where the above inequality holds as E1, thus we have P(E1) 1 exp( δ).

Step III: Prove the finite time results

Now, we are poised to demonstrate the results in Theorem 3.1 that concern supremacy over s [0, τ]. The verification of Lemmas C.4 reveals the existence of a universal constant Bτ := B exp(K(1 + Rτ + R2 τ)) such that

max{ Tρ(τ)(H, t) 2 col, b TΘ(τ)(H, t) 2 col, b TeΘ(τ)(H, t) 2 col} Bτ

for any H and t [0, 1].

Utilizing Lemma D.6 and applying the union bound over n [L2], we observe

sup s {τn}n [L2] b TeΘ(s)(H, t) Tρ(s)(H, t) F L 1 +

δ + log(L + 1)

Additionally, note that for any s1, s2 [0, τ], we derive from Lemmas C.2 and C.5 that b TeΘ(s1)(H, t) Tρ(s1)(H, t) F b TeΘ(s2)(H, t) Tρ(s2)(H, t) F

b TeΘ(s1)(H, t) b TeΘ(s2)(H, t) F + Tρ(s1)(H, t) Tρ(s2)(H, t) F

sup t,j eβ(s1) t,j eβ(s2) t,j + W1(ρ(s1), ρ(s2))

where the last inequality is derived utilizing (D.3) and (D.8). Consequently, we have

sup s [0,τ] b TeΘ(s)(H, t) Tρ(s)(H, t) F sup s {τn}n [L2] b TeΘ(s)(H, t) Tρ(s)(H, t) F + sup |s2 s1| τ/L2 s2 s1

δ + log(L + 1)

M (D.11) with probability at by 1 exp( δ). We denote the event where the above inequality holds as E2, thus we have P(E2) 1 exp( δ). Considering that {eβ(s) t,j }t,j could be regarded as i.i.d. samples

drawn from {ρ(s)}t,j, employing a similar method with the concentration guarantee from Lemma C.7, we can readily deduce the existence of an event E3 with P(E3) 1 exp( δ) such that, under E3, we have

sup s [0,τ] | 1

j=1 eβ(s) t,j 2 Z 1

β β 2ρ(s)(β, t)dβdt| L 1 +

δ + log(L + 1)

Now, let s analyze the scenario under the probability event E1 E2 E3 with P(E1 E2 E3) 1 3 exp( δ). Lemma C.5 demonstrates that

sup s [0,τ] |Read[ b TΘ(s)(H, t)] Read[ b TeΘ(s)(H, t)]| sup s [0,τ] b TΘ(s)(H, t) b TeΘ(s)(H, t) F

sup s [0,τ] sup t,j β(s) t,j eβ(s) t,j L 1 +

δ + log(L + 1)

This further implies, by (D.11), that

sup s [0,τ] |Read[ b TΘ(s)(H, t)] Read[Tρ(s)(H, t)]|

sup s [0,τ] |Read[ b TΘ(s)(H, t)] Read[ b TeΘ(s)(H, t)]|+ sup s [0,τ] b TeΘ(s)(H, t) Tρ(s)(H, t) F

δ + log(L + 1)

Since max{ Tρ(τ)(H, t) 2 col, b TΘ(τ)(H, t) 2 col, b TeΘ(τ)(H, t) 2 col} is universally bounded, (D.13) immediately indicates that

sup s [0,τ] | b R(Θ(s)) R(ρ(s))| sup s [0,τ] |Read[ b TΘ(s)(H, t)] Read[Tρ(s)(H, t)]| L 1+

δ + log(L + 1)

sup s [0,τ] | b Q(Θ(s)) Q(ρ(s))| sup s [0,τ] | b R(Θ(s)) R(ρ(s))|+ sup s [0,τ] | 1

j=1 eβ(s) t,j 2 Z 1

β β 2ρ(s)(β, t)dβdt|

δ + log(L + 1)

Step IV: Prove the weakly convergence

For the remainder of the proof, we adopt a similar approach as in the proof of Theorem 2.6 in [16]. We denote ˆρ as ˆρM,L for any given M and L. It s essential to note that we treat ˆρM,L as probability measures in this step of the proof.

Recalling (D.3), for any s1, s2 [0, τ], we have

W2(ˆρ(s1) M,L, ˆρ(s2) M,L) C(1 + λ) s2 s1

for some constant C dependent on the parameters listed in the result. We observe that the family of curves (s 7 ˆρ(s) M,L)M,L is equicontinuous in W2 on [0, τ], uniformly in M, L. Additionally, the family (ˆρM,L)M,L lies within a W2 ball, thus weakly precompact. As the weak topology is weaker than the topology induced by W2, according to the Arzelà Ascoli theorem, along any sequence where L and log L/M , we can identify a subsequence that converges weakly to a certain process (ν(s))s 0 P2 R, concentrated on PRτ at all times. In the subsequent analysis, we solely focus on this subsequence, still denoted as (ˆρM,L)M,L.

For any t [0, 1], let s define the sequence (E M,L)M,L of momentum fields, which is a vector-

valued measure on [0, τ] Ω, denoted by EM,L := b G(β, Θ(s) M,L, t)ˆρ(s)(β, t)ds. We also define E := G(β, ν(s), t)ν(s)(β, t)ds.

Considering that both ˆρM,L and ν are concentrated on PRτ , we also have uniform convergence in the Bounded Lipschitz metric. Hence, for any bounded and Lipschitz function φ : [0, τ] Rdimβ Rdimβ, it holds ˆρ(s) νs BL 0 uniformly among s [0, τ] along the sequence.

Note that Z τ

β φ d(EM,L E) φ max

b G(β, Θ(s) M,L, t) G(β, ν(s), t) ˆρ(s)(β, t)dβdtds

β φ (ˆρ(s) νs)(β, t)dβdtds

b G(β, Θ(s) M,L, t) G(β, ˆρ(s) M,L, t) ˆρ(s)(β, t)dβdtds

G(β, ˆρ(s), t) G(β, ν(s), t) ˆρ(s)(β, t)dβdtds

+ sup s [0,τ] ˆρ(s) νs BL

L 1 + sup s [0,τ] ˆρ(s) νs BL + sup s [0,τ], β Rτ

G(β, ˆρ(s), t) G(β, ν(s), t)

δ + log(L + 1)

M + sup s [0,τ] ˆρ(s) νs BL

(D.16) for some constant CE dependent on the parameters listed in the result and with probability at least 1 exp( δ) with respect to the parameter initialization Θ(0) for any δ > 0. Here, the third inequality of (D.16) utilizes the third result of Lemma D.3, and the fourth inequality uses the second result of Lemma D.3, following a similar process in Step II to achieve supremacy over s [0, τ].

From (D.16), we infer that R τ 0 R 1 0 R

β φ d(EM,L E) 0 almost surely along the sequence. Hence, EM,L converges weakly to E almost surely along the sequence, and the particle gradient flow for (ν(τ))τ 0 almost surely satisfies (2.7) on [0, τ] for any arbitrarily given τ > 0. According to the Fokker-Planck equation without noise involved [59], we conclude that (ν(τ))τ 0 almost surely satisfies (3.5). Consequently, the uniqueness stated in Proposition 3.2 ensures that (ν(τ))τ 0 = (ρ(τ))τ 0 almost surely.

D.2 Proof of Proposition 3.1

Proof. Suppose that the Fréchet derivative δR

δρ indeed exists, we establish

δρ (θ, w, t) = δR

δρ (θ, w, t) + λ

2 ( θ 2 2 + w 2 2).

Therefore, it suffices to show that the Fréchet derivative of L with respect to ρ is δR

δρ (β, t) = Eµ h Tr g(Tρ(H, t), β) pρ(H, t) i . (D.17)

Denote ρη = ρ + η(ν ρ). We provide the following lemma to bound Tρη(H, 1) Tρ(H, 1) by expanding the first-order derivative as follows

Lemma D.4 (First-order derivative of Transformer output). Under Assumption 2, for any H and ρ, ν P2 that have bounded supports, we have

vec[Tρη(H, 1) Tρ(H, 1)] =η Z 1

β vec[T ]vec[g(Tρ(H, s), β)]ρ(β, s)dβ)

vec[g(Tρ(H, t), β)](ν ρ)(β, t)dβdt + o(η), (D.18) where ρη := ρ + η(ν ρ).

Given (D.18), we observe from the solution of pρ (C.13) and (C.14) that Read[Tρ(H, 1)] y(H) Read[Tρη(H, 1) Tρ(H, 1)]

=vec[pρ(H, 1)] vec[Tρη(H, 1) Tρ(H, 1)]

β vec[pρ(H, t)] vec[g(Tρ(H, t), β)](ρ ν)(β, t)dβdt + o(η)

β Tr g(Tρ(H, t), β) pρ(H, t) (ρ ν)(β, t)dβdt + o(η),

Hence, by applying (D.19) to the risk function, we obtain

R(ρη) R(ρ) = 1

2Eµ h Read[Tρη(H, 1)] y(H) 2 Read[Tρ(H, 1)] y(H) 2i

= Eµ h Read[Tρ(H, 1)] y(H) Read[Tρ(H, 1) Tρη(H, 1)] i

+ Read[Tρ(H, 1) Tρη(H, 1)]O(Read[Tρη(H, 1) Tρ(H, 1)])

δρ , ν ρ E + o(η),

which indicates (D.17) and concludes the proof.

D.3 Proof of well-posedness of Wasserstein gradient flow

Proof. Following a similar idea as Proposition 2.5 of [16], we leverage the general theory of Wasserstein gradient flow developed in [4]. Define the functional family Qr(ρ) as

Qr(ρ) = Q(ρ) ρ(Pr) = 1, otherwise.

For any r > 0, let s consider any admissible transport γ PΩ Ωconcentrated on Pr. By definition, both of its marginals, denoted by ρ1 and ρ2, are concentrated on Pr. We define the transport cost for γ as

Cp(γ) := ( Z |x y|pdγ(x, y))1/p

for p 1. Additionally, we denote the transport interpolation as ργ α := ((1 α)ρ1 + αρ2)#γ. Our proof consists of several steps outlined below.

Step I: Show that Qr is proper and continuous for W2 on its closed domain:

Note that the parameters r, D, N, λ remain fixed throughout this proof step, so we hide the constant dependencies on them. For any (β, t) Pr, we have Qr(δ(β,t)) = 1

2Eµ[(Read[H + tg(H, β)] y(H))2] + λ

2 β 2 < . This indicates that Qr is proper. Moreover, for any ρ, ν P2 whose bounded support belong to Pr, Lemma C.1 ensures that

Tρ(H, t) + Tν(H, t) F = O(1),

and Lemma C.2 guarantees that

Tρ(H, t) Tν(H, t) F = O(W1(ρ, ν)) = O(W2(ρ, ν))

for any H. Therefore, we have

R(ν) R(ρ) = 1

2Eµ h Read[Tν(H, 1)] y(H) 2 Read[Tρ(H, 1)] y(H) 2i

= Eµ h Read[Tρ(H, 1)] y(H) Read[Tρ(H, 1) Tν(H, 1)] i

+ Read[Tρ(H, 1) Tν(H, 1)]O(Read[Tν(H, 1) Tρ(H, 1)]) = O(W2(ρ, ν)).

Furthermore, since both ρ and ν have bounded support, β 2 is Lipschitz continuous with respect to (β, t). Therefore, by the Kantorovich-Rubinstein Theorem (see Theorem 5.10 of [68], for example), we have λ

β β 2(ρ ν)dβdt = O(W1(ρ, ν)) = O(W2(ρ, ν)). (D.21)

Combining (D.20) and (D.21), we obtain that Q(ρ) Q(ν) = O(W2(ρ, ν)). Therefore, Qr is continuous for W2 on its closed domain.

Step II: Show that α 7 Q(ργ α)/C2 2(γ) is differentiable and has a Lipschitz continuous derivative

Let s denote h(α) := Qr(ργ α). Lemma C.3 ensures that for any ρ P2 with bounded support belonging to Pr, we have bounded δQ

δρ ( , ) F on Pr. Therefore, Qr(ργ α) is differentiable with respect to t, and the derivative reads

(1 α)(β1, t1) + α(β2, t2) ih (β1, t1) (β2, t2) dγ((β1, t1), (β2, t2)) o .

(D.22) Then, it suffices to show that h (α) is Lipschitz continuous. To accomplish this, we first propose the following lemma for later use:

Lemma D.5 (Locally Lipschitz of ρ for the gradient). Under Assumptions 1 and 2, for any ρ, ν P2 concentrated on Pr, there exists some constant Lr depending on r, N, D and parameters of the assumptions such that sup t [0,1] pρ(H, t) pν(H, t) F Lr ρ ν 1,

sup (β,t) Pr

ρ (β, t) δQ

ν (β, t) Lr ρ ν 1.

Returning to the lemma proof, for α1, α2 [0, 1], by the triangle inequality we have |h (α1) h (α2)| J1 + J2 where

J1 := Z dδQ

(1 α1)(β1, t1) + α1(β2, t2) h (β1, t1) (β2, t2) i dγ((β1, t1), (β2, t2))

(1 α1)(β1, t1) + α1(β2, t2) h (β1, t1) (β2, t2) i dγ((β1, t1), (β2, t2))

sup (β,t) Pr

ρ (β, t) δQ

ν (β, t) Z (β1, t1) (β2, t2) 1dγ((β1, t1), (β2, t2))

Lr ργ α1 ργ α2 1 C1(γ)

Lr C2 1(γ)|α1 α2|

Lr C2 2(γ)|α1 α2| (D.23) by Lemma D.5. The final inequality of (D.23) applies Hölder s inequality to obtain that C2 1(γ) C2 2(γ). Furthermore,

J2 := Z dδQ

nh (1 α1)(β1, t1) + α1(β2, t2) ih (β1, t1) (β2, t2) i dγ((β1, t1), (β2, t2)) o

nh (1 α2)(β1, t1) + α2(β2, t2) ih (β1, t1) (β2, t2) i dγ((β1, t1), (β2, t2)) o

sup (β,t) Pr

ρ=ργ α2 (β, t)

α1 α2 Z (β1, t1) (β2, t2) 2dγ((β1, t1), (β2, t2))

L r C2 2(γ)|α1 α2| (D.24)

where L r := sup(β,t) Pr δQ

δρ |ρ=ργ α2(β, t) < from Lemma C.3. Combining (D.23) and (D.24) leads us to the result that h (α)/C2 2(γ) is Lipschitz continuous.

Step III: Show the well-posedness of Wasserstein gradient flow at some finite time

We follow a similar approach to the proof of Proposition 2.5 in [16]. Since h (α) is λh C2 2(γ)- Lipschitz continuous with respect to α for some λh, the well-posedness of the Wasserstein gradient flow for Qr with the velocity field constrained on Pr is a corollary of Theorem 11.2.2 of [4]. Specifically, there exists a unique curve (ρ(τ) r )τ 0 continuous in P2 such that:

dρ(τ) r dτ = divβ(ρ(τ) r v(τ) r )

v(τ) r (β, t) =

( G(β, ρ(τ) r , t), (β, t) Pr, 0, otherwise.

for ρ(τ) r -a.e. Given the initialization ρ0 concentrated on PR, for any r > R, the unique ρ(τ) r exhibits a first exit time denoted as

τr := inf{τ > 0 : ρ(τ) r (Pr) < 1}.

By defining this exit time, for any r > r and τ [0, τr], we observe vr(τ)(β, t) = G(β, ρ(τ) r , t) and v r(τ)(β, t) = G(β, ρ(τ) r , t). Due to uniqueness, we infer ρ(τ) r = ρ(τ) r on τ [0, τr]. Considering ρ(τ) r as the solution to (3.5), we establish the existence and uniqueness of the Wasserstein gradient flow for Q over [0, τr].

Step IV: Show the well-posedness of Wasserstein gradient flow at all time

To establish the Wasserstein gradient flow s definition for τ 0, it s necessary to demonstrate that limr τr = . For any r > R, according to the energy identity in Theorem 11.2.1 of [4], on [0, tr], we observe that τ 7 Q(ρ(τ)) is non-increasing. Specifically, this represents

ρ=ρ(τ), divβ(ρ(τ)G(β, ρ(τ), t)) E

D G(β, ρ(τ), t), divβ(ρ(τ)G(β, ρ(τ), t)) E dβdt

β ρ(τ) G(β, ρ(τ), t) 2 2dβdt 0.

Therefore, for any τ [0, τr], utilizing Lemma C.1, we have

Q(ρ(τ)) Q(ρ0) =Eµ h1

Read[Tρ0(H, 1)] y(H) 2i + λ

β β 2ρ0(β, t)dβdt

2Eµ[( Tρ0(H, 1) 2 col + B)2] + λR2

Eµ[( Tρ0(H, 1) 2 2 col + B2] + λR2

B2 + B2 exp K(1 + R + R2) 2 + λR2

Thus, we have R 1 0 R

β β 2ρ(τ)(β, t) 2 λQ(ρ(τ)) R2 + λ 1 2B2 + 2B2 exp K(1 + R +

R2) 2 = A2 0. According to Assumption (ii) and Lemma C.3, for any (β, t) Pr, we have

v(τ) r (β, t) λβ = G(β, ρ(τ), t) λβ

β n g(Tρ(H, t), β):,i o pρ(H, t):,i

sup i [N+1] βg(Tρ(H, t), β):,i

i=1 pρ(H, t):,i

N + 1 sup i [N+1] βg(Tρ(H, t), β):,i pρ(H, t) F

N + 1ϕP ( Tρ(H, t) 2 col) pρ(H, t) F (1 + β ) CR(1 + β ), (D.27) where CR :=

N + 1ϕP (B exp(K(1 + A0 + A2 0)))(B + B exp(K(1 + A0 +

A2 0))) exp ϕT (N, D,

N + 1KB exp(K(1 + A0 + A2 0))(1 + A0 + A2 0) . Applying (D.27) to the gradient flow equation dβ(τ)

dτ = v(τ) r (β, t), β(0) = β

for τ 0, we obtain d β(τ)

dτ = v(τ) r (β,t), β(τ) β(τ) v(τ) r (β, t) λβ CR(1 + β(τ) ). This indicates β(τ) ( β + 1) exp(CRτ) 1 (R + 1) exp(CRτ) 1 (D.28)

by the Grönwall s inequality. Therefore, for any T > 0, ρ(T ) is concentrated on P(R+1) exp(CRT ) 1, implying that for r > (R + 1) exp(CRT), we have τr > T. Hence, we conclude limr τr = , establishing the existence of a unique Wasserstein gradient flow from (3.5) over τ > 0.

Eventually, we establish the three properties listed in Proposition 3.2 for ρ(τ). By (3.5), for any t [0, 1], we have Z

β ρ(τ)(β, t)dβ = Z

β ρ(0)(β, t)dβ + Z τ

β divβ(ρ(s)G(s)(β, ρ(s), t))dβ ds

indicated by the Divergence Theorem as ρ(s) has bounded support. Next, for any τ 0, (D.28) shows that ρ(τ) is concentrated on PRτ . Moreover, (D.25) now holds for any τ > 0, implying R 1 0 R

β β 2ρ(τ)(β, t) A2 0 for any τ 0.

D.4 Proof of well-posedness of gradient flow

Proposition D.1 (Existence and uniqueness of gradient flow). Under Assumptions 1-3, for any initialization of Θ(0) i.i.d. drawn from {ρ0(θ, w|t)}t,j, there exists a unique solution (Θ(τ))τ 0 for (2.7). Additionally, for any τ 0, we have

i. Θ(τ) has a bounded support, meaning supt,j( θ(τ) t,j 2 2 + w(τ) t,j 2 2) Rτ.

t PM j=1( θ(τ) t,j 2 2 + w(τ) t,j 2 2) A2 0.

Here, Rτ and A0 are defined as in Proposition 3.2.

Proof. The local Lipschitz continuity established in Lemma D.2 directly implies the continuity of b Gβ(βt,j, Θ, t) with respect to Θ. Since {ML b Gβ(βt,j, Θ, t)}t/ t+1 [L],j [M] serves as the gradient of b Q(Θ), it follows that b Q(Θ) is continuously differentiable, indicating the local semiconvexity of b Q(Θ). Specifically, for any Θ, there exists some κ > 0 such that b Q(Θ) + κ P

t PM j=1( θt,j 2 2 +

wt,j 2 2) is convex within a small neighborhood of Θ. The existence and uniqueness of a gradient flow over the maximal interval [0, τmax] is a standard result (see Section 2.1 of [61]).

For any τ [0, τmax], it holds that

b Q(Θ(0)) b Q(Θ(0)) b Q(Θ(τ)) = Z τ

j=1 d b Q(Θ)

Θ=Θ(τ), MLd b Q(Θ)

j=1 b Gβ(β(τ) t,j , Θ(τ), t) 2dτ

0 b Gβ(β(τ) t,j , Θ(τ), t) dτ 2 ,

(D.29) where the last inequality follows from Jensen s inequality. (D.29) establishes that b Q(Θ(τ)) is both upper and lower bounded, and Θ(τ) exhibits a bounded curve length over any the time interval [0, τmax]. By compactness, if τmax is finite, then Θ(τmax) exists and thus must exist beyond τmax, which leads to contradiction. Therefore, τmax = , and the well-posedness of the gradient flow for τ 0 consequently follows. Additionally, (D.29) shows that for any τ 0,

j=1 βt,j 2 2 2λ 1 b Q(Θ(τ)) 2λ 1 b Q(Θ(0)) =λ 1Eµ h Read[ b TΘ(0)(H, 1)] y(H) 2i

j=1 β(0) t,j 2 2

λ 1Eµ[( b TΘ(0)(H, 1) 2 col + B)2] + R2

2λ 1Eµ[( b TΘ(0)(H, 1) 2 2 col + B2] + R2

R2 + λ 1 2B2 + 2B2 exp K(1 + R + R2) 2

=A2 0 (D.30) The last inequality of (D.30) follows from Lemma C.4, thereby showing that 1 ML P

t PM j=1 βt,j 2 2 A2 0 for any τ 0.

As the final part of our proof, we demonstrate that the norm of any entry of Θ is bounded at any given time τ 0. Note that b G(β, Θ(τ), t) λβ

θ n f( b TΘ(H, t), θ):,i o bpΘ(H, t):,i /2 + w n h( b TΘ(H, t + /2), w):,i o bpΘ(H, t + t/2):,i /2

sup i [N+1] ( θf(Tρ(H, t), θ):,i

i=1 pρ(H, t):,i /2 + wh(Tρ(H, t), w):,i

i=1 pρ(H, t + t/2):,i /2)

N + 1 sup i [N+1]

θf(Tρ(H, t), θ):,i pρ(H, t) F /2

+ wh(Tρ(H, t + t/2), w):,i pρ(H, t + t/2) F /2

N + 1 max{ϕP ( Tρ(H, t) 2 col), ϕP ( Tρ(H, t + t/2) 2 col)} max{ pρ(H, t) F , pρ(H, t + t/2) F }(1 + β ) CR(1 + β ), (D.31) Applying (D.31) to the gradient flow

dβ(τ) t,j dτ = b G(β(τ) t,j , Θ(τ), t)

for τ 0, we have d β(τ) t,j dτ = b G(β(τ) t,j ,Θ(τ),t), β(τ)

β(τ) b G(β(τ) t,j , Θ(τ), t) λβ CR(1+ β(τ) ). This indicates

β(τ) ( β + 1) exp(CRτ) 1 (R + 1) exp(CRτ) 1 = Rτ. (D.32)

D.5 Proof of Proposition 3.3

Before commencing the proof, we introduce two proxy Transformer procedures in addition to TΘ and e Tρ. The first proxy, denoted as TΘ, involves moving the layers with the encoder h slightly forward by t/2 in the depth index. This adjustment results in a discrete Transformer with only L layers, where each layer has a step size of t and an encoder of (f + h)/2, represented by g. Specifically, TΘ can be written as

TΘ(H, t + t) = TΘ(H, t) + t M 1 M X

f( TΘ(H, t), θt,j) +

j=1 h( TΘ(H, t), wt,j)

= TΘ(H, t) + t M 1 M X

j=1 g( TΘ(H, t), βt,j).

The second proxy, denoted as e Tρ, extends the width to infinity by letting M , effectively replacing the average with an integral:

e Tρ(H, t + t) = e Tρ(H, t) + t Z

β g( e Tρ(H, t), β)ρ(β|t)dβ. (D.34)

We let all four Transformers share the same initial state b TΘ(H, 0) = TΘ(H, 0) = e Tρ(H, 0) = Tρ(H, 0) = H.

We first present the following lemma, considering parameters i.i.d. drawn from some distribution ρ P2 with bounded support: Lemma D.6 (Oracle approximation of discretization). Under Assumptions 1 and 2, suppose that the parameter setting Θ is i.i.d. drawn from {ρ(θ, w|t)}t,j for some ρ P2 concentrated on {(θ, w) : θ 2 + w 2 r2} [0, 1] and satisfies that R

θ,w ρ(θ, w, t)d(θ, w) = 1 for any t [0, 1]. Then with probability at least 1 exp( δ) with respect to the parameter initialization Θ(0), we have

b TΘ(H, t) Tρ(H, t) F L 1 +

δ + log(L + 1)

for any H, t = 0, t, . . . , (L 1) t, 1 and any δ > 0. Here, hides the dependencies on N, D, r and the parameters of the assumptions.

Proof of Proposition 3.3. Since ρ P2,r has a bounded support for any ρ, there exists some ρ P2,r such that R(ρ ) = infρ P2,r R(ρ). According to Lemma D.6, we can find a specific Θ such that

b TΘ(H, t) Tρ (H, t) F C L 1 +

where C depends on N, D, r, and the parameters of the assumptions. Moreover, from Lemma D.6, we ensure that each entry βt,j of Θ satisfies βt,j r. Verification of Lemmas C.1 and C.4 on Tρ and b TΘ respectively leads to their uniform boundedness, i.e., supt Tρ(H, t) 1 and supt b TΘ(H, t) 1. Therefore, we have

| b R(Θ) R(ρ )| Eµ[|Read[ b TΘ(H, 1) TΘ(H, 1)]| ||Read[ b TΘ(H, 1) + TΘ(H, 1)] + 2y(H)|]

Here, hides the dependencies on N, D, r and the parameters of the assumptions. The result then follows.

The proof for the energy functional Q (and b Q) follows a similar approach. There exists some ρ P2,r such that Q(ρ ) = infρ P2,r Q(ρ). From Lemmas D.6 and C.7, we can find a specific Θ such that

b TΘ(H, t) Tρ (H, t) F C L 1 + q

t PM j=1 βt,j 2 R 1 0 R

β β 2ρ(β, t)dβdt| C L 1 + q

by setting C large enough. Verification of Lemmas C.1 and C.4 on Tρ and b TΘ respectively leads to their uniform boundedness. Hence, we have

| b Q(Θ) Q(ρ )| | b R(Θ) R(ρ )|+λ| 1

j=1 βt,j 2 Z 1

β β 2ρ(β, t)dβdt| C(1+λ) L 1+

The result thus follows.

E Proofs of main results in Section 4

For simplicity, we assume that Assumption 4 holds with (g, α) = (f, θ). The proof for the case of (g, α) = (h, w) is symmetric, involving a simple substitution of f with h and θ with w.

E.1 Proofs of Theorem 4.1 and Corollary 4.1

Our proof of Theorem 4.1 consists of three parts, each focusing on bounding differences related to the energy functional Q or b Q.

The first step establishes the continuity of the functional gradient δQ

δρ |ρ . This ensures that if the derivative with respect to β for the functional gradient is constant over a region, then the functional gradient remains constant within that region.

The second step provides the key bound for Q(ρ ), which is proportional to λ. This involves a detailed analysis of Q s landscape by bounding its derivatives.

After obtaining the bound for Q(ρ ), the final steps are to show that the finite-time risk can approach this bound. Achieving a loss as small as ϵ requires Q(ρ(τ0)) ϵ for some sufficiently large τ0. We then apply Theorem 3.1, with constant dependency on τ0, to show that b Q(τ0) becomes sufficiently small. Since b Q(ρ(τ)) is non-increasing, b Q(τ) remains small for all τ τ0.

Preparatory Step: Landscape analysis

First, the following lemma suggests that as long as the risk R(ρ) remains positive, a descent direction for Q(ρ) can be constructed at any depth index, provided that λ is sufficiently small. This implies that by adjusting λ, one can influence the gradient flow to effectively reduce Q(ρ). Lemma E.1 (Landscape of Q(ρ)). Suppose that Assumptions 1-4 hold. For any ρ P2 concentrated on Pr with some r > 0, any w0 Rdimw, and any t [0, 1] such that R

β ρ(β, t ) 1/2, there exists a ν P(Rdimβ) such that

i. for any β supp(ν), we have 1/Br β Br

ii. for any (θ1, θ2, w), we have θ2 K and w = w0.

δρ (β, t ) ν(β) ρ(β|t ) dβ C1λ C2R(ρ)

Here, Br, C1, C2 are constants that depends on N, d, r and the parameters of the assumptions.

Given the t and w0 specified in the theorem, Lemma E.1 indicates that there exists some

ν P Bdimθ1(0, BR )/Bdimθ1(0, 1/BR ) K {w0}

such that R

δρ |ρ (β, t ) ν(β) ρ (β|t ) dβ C1λ C2R(ρ), where BR , C1, C2 are constants dependent on N, d, R and the parameters of the assumptions.

In addition, for any ρ P2, we define the following two functional derivatives:

δρ (θ, w, t) = Eµ h Tr h f(Tρ(H, t), θ) i pρ(H, t) i + λ θ 2 2

δρ (θ, w, t) = Eµ h Tr h h(Tρ(H, t), w) i pρ(H, t) i + λ w 2 2.

It is obvious that δQ

Proof of Theorem 4.1. Step I: Show that δQ

δρ |ρ (β, t) is continuous with respect to (β, t)

In Step I of the proof of Lemma E.1, we establish that ρ P2, with a bounded support PR , implies pρ (H, t) is Cp-Lipschitz continuous, where Cp is a constant dependent solely on N, D, R , and the parameters in our assumptions.

Next, we would like to show that δQ

δρ ρ (β, t) is continuous with respect to (β, t) Rdivβ [0, 1].

Let s focus on the region (β, t) Pr for any r > R , so that ρ is also concentrated on Pr. It s noteworthy that for any bounded support (β, t) Pr, Lemma C.1 and Assumption 2 (i) ensure the universal boundedness of g(Tρ(H, t), β), and Lemma C.3 ensures the universal boundedness of pρ(H, t) for any H supp(µ), with the constants depending solely on N, D, r, and the parameters of the assumptions.

Combining the Lipschitz continuity of pρ (H, t) and Tρ (H, t) with respect to (H, t), as shown in Proposition C.2, along with the Lipschitz continuity of g(T, β) with respect to (T, β) when T F is universally bounded (as guaranteed by Assumption 2 (ii) and (iii)), and their universal boundedness,

we derive that Tr h g(Tρ (H, t), β) i pρ (H, t) is CG-Lipschitz continuous for 2 with respect to (β, t) Pr for some constant CG that depends only on N, D, r and the parameters of the assumptions. Since the Lipschitz constant CG is independent of the choice of H, we see that

Tr h g(Tρ (H, t), β) i pρ (H, t) is uniformly continuous across all H supp(µ) with respect

to (β, t) Pr. Consequently, we have that

δρ (β, t)|ρ = Eµ h Tr h g(Tρ (H, t), β) i pρ (H, t) i + λ

is continuous with respect to (β, t) Pr. Since the choice of r is arbitrary, we conclude that δQ

δρ |ρ (β, t) is continuous with respect to (β, t) Rdivβ [0, 1].

Step II: Show that Q(ρ ) λ with further landscape analysis

In the first part of the proof, we will adopt a similar approach to Theorem 3.9 of [47] to demonstrate that the stationary point of the Wasserstein gradient flow, denoted ρ , satisfies Q(ρ ) λ. It s worth noting that in [47], the authors assume λ = 0 and conclude R(ρ ) = Q(ρ ) = 0, but this claim relies on assuming the global existence of the Wasserstein gradient flow rather than proving it directly.

Based on the pivotal findings from [53] regarding the stationary points in the Wasserstein space, we infer that the stationary point ρ of the Wasserstein gradient flow (3.5), i.e.

dτ = divβ ρ β δQ

must satisfy β δQ

δρ |ρ = 0 almost everywhere over supp(ρ ). This further indicates that

δρ |ρ (θ1, θ2, w, t ) = 0 almost everywhere over supp(ρ ( , t )). The fact ρ( , t ) is a con-

nected set, coupled with the continuity of the Frechét differential δQ

δρ |ρ with respect to β, implies

δρ |ρ (θ1, θ2, w, t ) = C for some constant C over (β, t ) supp(ρ( , t )).

Given the separation assumption on the support of ρ ( , t ), we ensure that for any (θ1, θ2) Bdimθ1(0, BR )/Bdimθ1(0, 1/BR ) K, there exists c R, 1/R BR |c| R BR

such that (cθ1, θ2, w0, t ) supp(ρ ). Combined with Assumption 4 (i), which implies the 1homogeneity of f(T, θ1, θ2) with respect to θ1, we have

ρ (cθ1, θ2, w0, t ) = cδQf

ρ (θ1, θ2, w0, t ) + (|c| 1)λ θ1 2.

ρ (cθ1, θ2, w0, t ) = δQh

ρ (θ1, θ2, w0, t ). (E.1)

Hence, given that β δQ

δρ |ρ ( , t ) = 0 almost everywhere over supp(ρ ( , t )), it also holds that

(θ1,θ2,w) δQ

ρ (θ1, θ2, w0, t ) = (θ1,θ2,w) δQf

ρ (θ1, θ2, w0, t ) + (θ1,θ2,w) δQh

ρ (θ1, θ2, w0, t ) /2

c λθ1, 0dimθ2, 0w ,

which implies (θ1,θ2,w) δQ

ρ (θ1, θ2, w0, t )

(R2 BR + R )λ. (E.2)

Given the condition that (cθ1, θ2, w0, t ) (θ1, θ2, w0, t ) R2 BR , and recalling that δQ

δρ |ρ (θ1, θ2, w, t ) C across (β, t ) supp(ρ( , t )), (E.2) further indicates that

ρ (θ1, θ2, w0, t ) C (R4 B2 R + R3 BR )λ. (E.3)

for any (θ1, θ2) Bdimθ1(0, BR )/Bdimθ1(0, 1/BR ) K. Hence, by Lemma E.1 we have

C1λ C2R(ρ ) Z

ρ (β, t ) ν(β) ρ (β|t ) dβ

ρ (β, t ) C ν(β) ρ (β|t ) dβ

β (R4 B2 R + R3 BR )λ ν(β) + ρ (β|t ) dβ

2(R4 B2 R + R3 BR )λ.

Therefore, we have R(ρ ) C1+2(R4 B2 R +R3 BR ) C2 λ, and

Q(ρ ) R(ρ ) + Z 1

β β 2ρ(β, t)dβdt C1 + 2(R4 B2 R + R3 BR ) C2 + R2 λ, (E.5)

which completes the first part of our proof.

Step III: Bound the difference between Q(ρ(τ)) and Q(ρ ) when τ is large

Proposition 3.2 establishes that the second moment for ρ(τ) is uniformly bounded across all τ 0: Z 1

β β 2ρ(τ)(β, t)dβdt A2 0,

where A0 is defined as in Proposition 3.2. Therefore, the weak convergence of probability measures (ρ(τ))τ 0 is equivalent to the convergence in the Wasserstein-2 distance, i.e.

lim τ W2(ρ(τ), ρ ) = 0. (E.6)

When τ is sufficiently large, ρ(τ) concentrates on PR . Therefore, according to Lemma C.2, there exists a constant C0 depending solely on N, D, R , and the parameters of the assumptions that

Tρ (H, t) Tρ(τ)(H, t) F C0W2(ρ(τ), ρ ) (E.7)

for any H and t [0, 1] when τ is sufficiently large. Note that Lemma C.1 shows that

max{ Tρ(τ)(H, t) 2 col, Tρ (H, t) 2 col} B exp(K(1 + R + R2 )) =: BT

for any H and t [0, 1]. Thus, from (E.7), we have

|Q(ρ ) Q(ρ(τ))| 1

2Eµ h |Read[|Tρ(τ)(H, 1)]|+|Tρ (H, 1)]|+2|y(H)| i Tρ (H, 1) Tρ(τ)(H, 1) F

β β 2(ρ ρ(τ))(β, t)dβdt

(BT + B)C0W2(ρ(τ), ρ ) + λ

2 R W1(ρ(τ), ρ )

((BT + B)C0 + λ

2 R )W2(ρ(τ), ρ ), (E.8) where the second inequality incorporates the Kantorovich-Rubinstein Theorem (see Theorem 5.10 of [68], for example) and the 2R -Lipschitz continuity of β 2 over the region (β, t) PR . Combining equations (E.6) and (E.8), we deduce that for any ϵ > 0, there exists some τ0 > 0 such that |Q(ρ ) Q(ρ(τ0))| ϵ.

Step IV: Complete the proof by bounding the difference between b Q(Θ(τ)) and Q(ρ(τ)) when τ is large

The final step can be seen as a direct corollary of the approximation result in Theorem 3.1. According to Theorem 3.1, there exists a constant C1 dependent on N, D, τ0, λ, and the parameters specified in the assumptions, such that

| b Q(Θ(τ0)) Q(ρ(τ0)))| C1 L 1 +

δ + log(L + 1)

with probability at least 1 3 exp( δ) with respect to the parameter initialization Θ(0) for any δ > 0. Combining the outcomes from the preceding steps, we obtain

b Q(Θ(τ0)) ϵ + C1 L 1 +

δ + log(L + 1)

+ C2λ. (E.9)

d dτ b Q(Θ(τ)) = X

j=1 d b Q(Θ)

Θ=Θ(τ), MLd b Q(Θ)

j=1 b Gβ(β(τ) t,j , Θ(τ), t) 2 0,

so the sequence ( b Q(Θ(τ)))τ 0 is non-decreasing. Hence, for any τ τ0,

b R(Θ(τ)) b Q(Θ(τ)) b Q(Θ(τ0)) ϵ + C1 L 1 +

δ + log(L + 1)

which completes the proof, recalling that C1 depends only on N, D, τ0, λ and the parameters of the assumptions, and C2 depends only on N, D, R , and the parameters of the assumptions.

Proof of Corollary 4.1. Given the choice of λ By Theorem 4.1, there exists some τ0 > 0 such that

sup τ τ0 b R(Θ(τ)) ϵ/2 + C1 L 1 +

δ + log(L + 1)

(1/4 + C2Cλ)ϵ + C1 L 1 +

δ + log(L + 1)

The result holds by setting L and M/log L sufficiently large to ensure that

δ + log(L + 1)

2(1 + δ) log(L + 1)

F Proofs of auxiliary results

F.1 Proof of Lemma D.1

Proof. Lemma C.1 confirms that Tρ(H, t) 2 col and Tν(H, t) 2 col are bounded uniformly by BT = B exp(K(1 + r + r2)). Considering the definition (C.2), it suffices to demonstrate that

Eµ h βTr g(Tρ(H, t), β) pρ(H, t) βTr g(Tν(H, t), eβ) pν(H, t) i J1 + J2,

J1 := Eµ h βTr (g(Tν(H, t), eβ) g(Tρ(H, t), β)) pν(H, t) i β eβ ,

J2 := Eµ h βTr g(Tρ(H, t), β) (pν(H, t) pρ(H, t)) i exp(CGW1(ρ, ν)) 1.

Here, the symbol hides dependencies on N, D, r, and the parameters of the assumptions. To bound J1, consider that

J1 sup i [N+1] Eµ h g(Tν(H, t), eβ):,i g(Tρ(H, t), β):,i

i=1 pν(H, t):,i i

N + 1 sup i [N+1] Eµ h g(Tν(H, t), eβ):,i g(Tρ(H, t), β):,i pν(H, t) F i

sup i [N+1] Eµ h g(Tν(H, t), eβ):,i g(Tρ(H, t), β):,i i

ϕP T (r, BT ) Tρ(H, t) Tρ(H, t) 2 col + ϕP P(r, BT ) β eβ

W1(ρ, ν) + β eβ .

The third inequality in Equation (F.1) is derived from Lemma C.3, while the fourth inequality relies on Assumption 3 (i) and (iii). Lastly, bounding Tρ(H, t) Tρ(H, t) 2 col by W1(ρ, ν) is achieved with Lemma C.2.

On the other hand, to bound J2, we have

N + 1 sup i [N+1] Eµ h g(Tρ(H, t), β):,i pν(H, t) pρ(H, t) F i

N + 1 (Tρ(H, t), β) 2 col pν(H, t) pρ(H, t) F pν(H, t) pρ(H, t) F .

In (F.2), the third inequality relies on Assumption 2 (i). Consequently, to establish J2 exp(CGW1(ρ, ν)) 1, it is adequate to demonstrate that pν(H, t) pρ(H, t) F I1 + I2 W1(ρ, ν), where

I1 =|Read[Tρ(H, 1) Tν(H, 1)]| exp Z 1

β vec[T ]vec[g(Tν(H, s), β)]ρ(β, s)dβds ,

I2 =|Read[Tρ(H, 1)] y(H)| exp Z 1

β vec[T ]vec[g(Tρ(H, s), β)]ρ(β, s)dβds exp Z 1

β vec[T ]vec[g(Tν(H, s), β)]ρ(β, s)dβds .

From Lemma C.2, it is trivial that |Read[Tρ(H, 1) Tν(H, 1)]| Tρ(H, 1) Tν(H, 1) F W1(ρ, ν). Thus, I1 W1(ρ, ν) given the boundedness of vec[T ]vec[g(Tν(H, t), β)] as provided

in Assumption 2 (iii). To bound I2, we have

β vec[T ]vec[g(Tρ(H, s), β)]ρ(β, s)dβds exp Z 1

β vec[T ]vec[g(Tν(H, s), β)]ρ(β, s)dβds

β vec[T ]vec[g(Tρ(H, s), β)]ρ(β, s)dβds Z 1

β vec[T ]vec[g(Tν(H, s), β)]ρ(β, s)dβds Idimvec[T ]

vec[T ]vec[g(Tρ(H, s), β)] vec[T ]vec[g(Tν(H, s), β)] ρ(β, s)dβds 1

exp ϕT T (N, D, BT , r) g(Tρ(H, t), β) g(Tν(H, t), β) F 1

exp CrϕT T (N, D, BT , r)ϕT (N, D,

N + 1BT )(1 + r + r2)W1(ρ, ν) 1 (F.3) for some constant Cr dependent on the parameters listed in the result setting. Here, the first inequality in (F.3) stems from |Read[Tρ(H, 1)] y(H)| BT + B, the second inequality is ensured by the boundedness of vec[T ]vec[g(Tν(H, t), β)] as stated in Assumption 2 (iii), and the fourth inequality is provided by Assumption 3 (iv). The last inequality in (F.3) arises from Assumption 2 (iii) and Lemma C.2. By combining Equation (F.3) with the bounds of J1 and I1, we deduce that J1 + J2 exp(CGW1(ρ, ν)) 1 + β eβ for some constant CG dependent on the parameters listed in the result, thereby completing the proof.

F.2 Proof of Lemma D.2

Proof. Lemma C.4 demonstrates that b TΘ(H, t) 2 col and b TeΘ(H, t) 2 col are bounded by BT = B exp(K(1 + A + A2)) for any H and t [0, 1]. We begin by bounding

b G(β, Θ, t) b G(β, eΘ, t) = n1

2Eµ h θTr f( b TΘ(H, t), θ) f( b TeΘ(H, t), θ) bpΘ(H, t + t/2) i

2Eµ h θTrf( b TeΘ(H, t), θ) bpΘ(H, t + t/2) bpeΘ(H, t + t/2) i ,

1 2Eµ h w Tr h( b TΘ(H, t + t/2), w) h( b TeΘ(H, t + t/2), w) bpΘ(H, t) i

2Eµ h w Trh( b TeΘ(H, t + t/2), w) bpΘ(H, t) bpeΘ(H, t) i o .

To demonstrate that b G(β, Θ, t) b G(β, eΘ, t) CG 1 MLd(Θ, eΘ), it suffices to show

J1 := Eµ h θTr f( b TΘ(H, t), θ) f( b TeΘ(H, t), θ) bpΘ(H, t + t/2) i CG 1 MLd(Θ, eΘ),

J2 := Eµ h θTrf( b TeΘ(H, t), θ) bpΘ(H, t + t/2) bpeΘ(H, t + t/2) i CG 1 MLd(Θ, eΘ), (F.5) as the other part for h and w follows a similar proof approach.

To bound J1, by Assumption 3 (i) we have

i=1 Eµ h θf( b TΘ(H, t), θ):,i f( b TeΘ(H, t), θ):,i bpΘ(H, t + t/2):,i i

Eµ h sup i [N+1]

θf( b TΘ(H, t), θ):,i f( b TeΘ(H, t), θ):,i

i=1 bpΘ(H, t + t/2):,i i

ϕT (r, BT ) b TΘ(H, t) b TeΘ(H, t) 2 col

N + 1 bpΘ(H, t + t/2) F

ϕT (r, BT )

N + 1 bpΘ(H, t + t/2) F 1 MLd(Θ, eΘ)

MLd(Θ, eΘ),

where the fourth inequality utilizes Lemma C.5 and the last inequality uses Lemma C.6.

To bound J2, by Assumption 2 (ii), we have

N + 1Eµ h sup i [N+1]

f( b TeΘ(H, t), θ):,i bpΘ(H, t + t/2) bpeΘ(H, t + t/2) F i

N + 1ϕ(BT )(1 + r)Eµ h bpΘ(H, t + t/2) bpeΘ(H, t + t/2) F i . (F.7)

Hence, it suffices to show that bpΘ(H, t + t/2) bpeΘ(H, t + t/2) F 1 MLd(Θ, eΘ) to establish J2 1 MLd(Θ, eΘ). Recalling the formula bpΘ in (C.16), we have bpΘ(H, t + t/2) bpeΘ(H, t + t/2) F I1 + I2, where

I1 =(Read[ b TΘ(H, 1) b TeΘ(H, 1)] n Y

(s t)/ t+2 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[f( b TΘ(H, s), θs,j)]

(s t)/ t+1 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)] o

(s t)/ t+2 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[f( b TΘ(H, s), θs,j)]

(s t)/ t+1 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)]

exp ( t/2) X

(s t)/ t+1 [(1 t)/ t] M 1 M X

vec[T ]vec[f( b TΘ(H, s), θs,j)]

(s t)/ t+1 [(1 t)/ t] M 1 M X

vec[T ]vec[h( b TΘ(H, t), ws,j)]

MLd(Θ, eΘ) exp ϕT (N, D,

N + 1KBT )(1 + r + r2)

MLd(Θ, eΘ),

and I2 =|Read[ b TeΘ(H, 1)] + y(H)| n Y

(s t)/ t+2 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[f( b TΘ(H, s), θs,j)]

(s t)/ t+1 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)]

(s t)/ t+2 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[f( b TeΘ(H, t), eθs,j)]

(s t)/ t+1 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[h( b TeΘ(H, t + /2), ews,j)] o

exp ( t/2) X

(s t)/ t+1 [(1 t)/ t] M 1 M X

j=1 max{ vec[T ]vec[f( b TΘ(H, s), θs,j)] , vec[T ]vec[f( b TeΘ(H, t), eθs,j)]}

(s t)/ t+1 [(1 t)/ t] M 1 M X

j=1 max{ vec[T ]vec[h( b TΘ(H, t), ws,j)] , vec[T ]vec[h( b TeΘ(H, t), ews,j)]}

(s t)/ t+1 [(1 t)/ t] j [M]

j=1 vec[T ]vec[f( b TΘ(H, s), θs,j)] M 1 M X

j=1 vec[T ]vec[f( b TeΘ(H, t), eθs,j)]

j=1 vec[T ]vec[h( b TΘ(H, t), ws,j)] M 1 M X

j=1 vec[T ]vec[h( b TeΘ(H, t), ews,j)]

(B + BT ) exp ϕT (N, D,

N + 1KBT )(1 + r + r2) ϕT P (N, D,

N + 1BT , r) 1

MLd(Θ, eΘ).

where the first inequality applies Lemma C.8, and the second inequality relies on Assumption 2 (iii) and Assumption 3 (ii). Therefore, we conclude that I2 1 MLd(Θ, eΘ). By combining the bounds of I1 and I2, we observe that Equation (F.5) holds, thereby establishing b G(β, Θ, t) b G(β, eΘ, t) CG 1 MLd(Θ, eΘ) for some CG dependent on N, d, r, and the parameters of the assumptions.

It remains to prove that b G(β, eΘ, t) b G(eβ, eΘ, t) CG(1 + λ) β eβ . Note that

b G(β, eΘ, t) b G(eβ, eΘ, t) λ β eβ

2Eµ h θTr f( b TeΘ(H, t), θ) f( b TeΘ(H, t), eθ) bpΘ(H, t + t/2) i ,

1 2Eµ h w Tr h( b TeΘ(H, t + t/2), w) h( b TeΘ(H, t + t/2), ew) bpΘ(H, t) i o .

Therefore, we only need to show that Eµ h θTr f( b TeΘ(H, t), θ) f( b TeΘ(H, t), eθ) bpΘ(H, t + t/2) i CG θ eθ ,

and Eµ h w Tr h( b TeΘ(H, t), w) h( b TeΘ(H, t), ew) bpΘ(H, t) i CG w ew ,

to obtain b G(β, eΘ, t) b G(eβ, eΘ, t) CG(1 + λ) β eβ . Here, we only establish the inequality above for f and θ, as the proof of the other inequality follows a similar pattern. Note that by Assumption 3 (iii), we have Eµ h θTr f( b TeΘ(H, t), θ) f( b TeΘ(H, t), eθ) bpΘ(H, t + t/2) i

sup i [N+1] Eµ h θ f( b TeΘ(H, t), θ) f( b TeΘ(H, t), eθ)

N + 1 bpΘ(H, t + t/2) F

ϕP P (r, BT ) θ eθ

N + 1 bpΘ(H, t + t/2) F

where the last inequality applies Lemma C.6. Therefore, we conclude that b G(β, eΘ, t) b G(eβ, eΘ, t) CG(1 + λ) β eβ , completing the proof.

F.3 Proof of Lemma D.3

Proof. Lemmas C.1 and C.4 establish that max Tρ(H, t) 2 col, b TΘ(H, t) 2 col BT := B exp(K(1+ r + r2)). According to Lemma D.6, there exists an event E with P(E) 1 exp( δ) such that under E, we have

b TΘ(H, t) Tρ(H, t) F L 1 +

δ + log(L + 1)

for any H and t = 0, t, . . . , (L 1) t, 1. Following the same proof procedure as in Lemma D.6, with ρ replaced by bρ, and bounding only J1 and J3 in the proof (as there is no need to utilize Hoeffding s inequality to bridge the difference due to a finite width M), we could obtain the bound

b TΘ(H, t) Tbρ(H, t) F L 1. (F.9)

We present the proof only for the case involving ρ. The bounding of b G(β, Θ, t) G(β, bρ, t) F can be derived analogously by substituting ρ with bρ using Equation (F.9), and skipping the process of bounding D1 D2 where D1 and D2 will be defined later. The bounding of G(β, bρ, t) G(β, ρ, t) can be straightforwardly achieved by combining the results obtained from the other two cases.

By the definitions of the gradients in Equations (C.2) and (C.3), we observe that b G(β, Θ, t) G(β, ρ, t) b Gf(θ, Θ, t) Gf(θ, ρ, t) + b Gh(w, Θ, t) Gh(w, ρ, t) . We will

focus on showing that 2( b Gf(θ, Θ, t) Gf(θ, ρ, t)) L 1 + q

M , as the other part of the proof follows a similar approach.

Let s define the following quantities A1, A2, B1, B2, C1, C2:

A1 := θvec[f( b TΘ(H, t), θ)], A2 := θvec[f(Tρ(H, t), θ)]

B1 :=Read[ b TΘ(H, 1) y(H)], B2 := Read[Tρ(H, 1) y(H)]

(s t)/ t+2 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[f( b TΘ(H, s), θs,j)]

(s t)/ t+1 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)] o

C2 := exp Z 1

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβdt

The universal boundedness of A1 and A2 is implied by Assumption 2 (ii), while the universal boundedness of |B1| and |B2| is implied by Assumption 1. Additionally, C1 and C2 can be

bounded via Assumption 2 (iii), and one can refer to the proofs of Lemmas C.3 and C.6 for detailed explanations.

From (C.14) and (C.16), we could rewrite J := 2( b Gf(θ, Θ, t) Gf(θ, ρ, t)) as

J = A1B1C1 A2B2C2 (A1 A2)B2C2 + A1(B1 B2)C2 + A1B1(C1 C2) A1 A2 B2 C2 + A1 B1 B2 C2 + A1 B1 C1 C2 A1 A2 + B1 B2 + C1 C2 .

We claim that to obtain the result, it suffices to show that

i. A1 A2 L 1 + q

M under event E.

ii. B1 B2 L 1 + q

M under event E.

iii. There exists some event E2 with P(E2) 1 exp( δ) such that C1 C2 L 1 + q

M under E E2.

This is because if we can establish the above statements, then under the event E E2 with P(E E2)

1 2 exp( δ), we obtain J = 2( b Gf(θ, Θ, t) Gf(θ, ρ, t)) L 1 + q

M . Given the

similarity in proof for 2( b Gh(w, Θ, t) Gh(w, ρ, t)) , we deduce that with probability at least

1 4 exp( δ) with respect to the parameter initialization Θ(0), we have b G(β, Θ, t) G(β, ρ, t)

M . The remainder of the proof focuses on bounding the quantities in statements (i)-(iii).

Proof for statement (i): By Assumption 3 (iv), we have A1 A2

ϕT T (N, D,

N + 1BT , r) b TΘ(H, t) Tρ(H, t) F L 1 + q

M under event E.

Proof for statement (ii): Under event E, it is obvious to see B1 B2 b TΘ(H, t) Tρ(H, t) F

Proof for statement (iii): We further define the following quantities

(s t)/ t+2 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[f( b TΘ(H, s), θs,j)]

(s t)/ t+1 [(1 t)/ t] j [M]

Idimvec[T ] + ( t/2)M 1 M X

j=1 vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)]

(s t)/ t+2 [(1 t)/ t]

Idimvec[T ] + ( t/2) Z

β vec[T ]vec[f( b TΘ(H, s), θ)]ρ(β|s)dβ

(s t)/ t+1 [(1 t)/ t]

Idimvec[T ] + ( t/2) Z

β vec[T ]vec[h( b TΘ(H, s + t/2), w)]ρ(β|s)dβ

D3 := exp X

(s t)/ t+2 [(1 t)/ t] ( t/2) Z

β vec[T ]vec[f( b TΘ(H, s), θ)]ρ(β, s)dβ

(s t)/ t+1 [(1 t)/ t] ( t/2) Z

β vec[T ]vec[h( b TΘ(H, s + t/2), w)]ρ(β, s)dβ

D4 := exp Z 1

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβdt

Note that C1 C2 D1 D2 + D2 D3 + D3 D4 . Assumption 2 (iii) indicates that for any s = 0, t, . . . , (L 1) and j = 1, . . . , M, we have

max{ vec[T ]vec[f( b TΘ(H, s), θs,j)] , vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)] } BJ for some constant BJ dependent on the parameters listed in the result. This implies that each column of vec[T ]vec[f( b TΘ(H, s), θs,j)] or vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)] has l2 norm upper bounded by BJ as well. Applying Hoeffding s inequality to each column of vec[T ]vec[f( b TΘ(H, s), θs,j)] and vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)], and subsequently calculating the union bound across all columns yields:

j=1 vec[T ]vec[f( b TΘ(H, s), θs,j)] Z

β vec[T ]vec[f( b TΘ(H, s), θs,j)]ρ(β|s)dβ

2(N + 1)D exp( z2

j=1 vec[T ]vec[h( b TΘ(H, t), ws,j)] Z

β vec[T ]vec[h( b TΘ(H, t), ws,j)]ρ(β|s)dβ

2(N + 1)D exp( z2

(F.11) for any z > 0. For (F.10) and (F.11), we further the consider the union bound across all s = 0, t, . . . , (L 1) , and let z = BJ p

2M(δ + log(2(N + 1)DL)), which implies that with probability at least 1 exp( δ) with respect to the parameter initialization Θ(0), we have

j=1 vec[T ]vec[f( b TΘ(H, s), θs,j)] Z

β vec[T ]vec[f( b TΘ(H, s), θs,j)]ρ(β|s)dβ

j=1 vec[T ]vec[h( b TΘ(H, t), ws,j)] Z

β vec[T ]vec[h( b TΘ(H, t), ws,j)]ρ(β|s)dβ

bounded by BJ p

2M(N + 1)D(δ + log(2(N + 1)DL)) CJ q

M for any s = 0, t, . . . , (L 1) . Here, CJ is some constant that only depends on N, D, r and the parameters of the assumptions. Denote this probability event by E2, and we have P(E2) 1 exp( δ). Under E2, by Lemma C.8, we have

(s t)/ t+2 [(1 t)/ t]

j=1 vec[T ]vec[f( b TΘ(H, s), θs,j)] Z

β vec[T ]vec[f( b TΘ(H, s), θ)]ρ(β, s)dβ

(s t)/ t+1 [(1 t)/ t]

j=1 vec[T ]vec[h( b TΘ(H, s), ws,j)] Z

β vec[T ]vec[h( b TΘ(H, s + t/2), w)]ρ(β, s)dβ

δ + log(L + 1)

(F.12) For any s = 0, t, . . . , (L 1) , we define

As,j = ( t/2) Z

β vec[T ]vec[f( b TΘ(H, s), θ)]ρ(β|s)dβ

and Bs,j = ( t/2) Z

β vec[T ]vec[h( b TΘ(H, s + t/2), w)]ρ(β|s)dβ.

Since Assumption 2 (iii) indicates that max{ As,j , Bs,j } t = L 1, we have

exp(As,j) I As,j As,j 2, exp(Bs,j) I Bs,j Bs,j 2.

Applying Lemma C.8 once more, we have

(s t)/ t+2 [(1 t)/ t] As,j 2 + X

(s t)/ t+1 [(1 t)/ t] Bs,j 2 L 1. (F.13)

Since Assumption 2 (iii) ensures the boundedness of D4 , we have

D3 D4 exp X

(s t)/ t+2 [(1 t)/ t] ( t/2) Z

β vec[T ]vec[f( b TΘ(H, s), θ)]ρ(β, s)dβ

(s t)/ t+1 [(1 t)/ t] ( t/2) Z

β vec[T ]vec[h( b TΘ(H, s + t/2), w)]ρ(β, s)dβ

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβdt 1 .

Therefore, to show that D3 D4 L 1 + q

M , it suffices to show that

(s t)/ t+2 [(1 t)/ t] ( t/2) Z

β vec[T ]vec[f( b TΘ(H, s), θ)]ρ(β, s)dβ

(s t)/ t+1 [(1 t)/ t] ( t/2) Z

β vec[T ]vec[h( b TΘ(H, s + t/2), w)]ρ(β, s)dβ

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβdt L 1 +

δ + log(L + 1)

By Assumption 3 (iv) and Lemma D.6, we have

vec[T ]vec[f( b TΘ(H, s), θ)] vec[T ]vec[f(Tρ(H, s), θ)] b TΘ(H, s) Tρ(H, s) F L 1+

δ + log(L + 1)

vec[T ]vec[h( b TΘ(H, s + t/2), w)] vec[T ]vec[h(Tρ(H, s), w)] b TΘ(H, s + t/2) Tρ(H, s) F

b TΘ(H, s + t/2) b TΘ(H, s) F + b TΘ(H, s) Tρ(H, s) F

δ + log(L + 1)

M + L 1 M 1 M X

j=1 h( b TΘ(H, s), ws,j) F

δ + log(L + 1)

where the last inequality employs Assumption 2 (i). Therefore, we conclude that

D3 D4 J34 L 1 +

δ + log(L + 1)

M + ( t/2) Z

β vec[T ]vec[f(Tρ(H, t), θ)]ρ(β, t)dβ

(s t)/ t+1 [(1 t)/ t] ( t/2) Z

β vec[T ]vec[f(Tρ(H, s), θ)]ρ(β, s)dβ

(s t)/ t+1 [(1 t)/ t] ( t/2) Z

β vec[T ]vec[h(Tρ(H, s)(H, s), w)]ρ(β, s)dβ

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβdt

δ + log(L + 1)

M + sup |s1 s2| t

vec[T ]vec[g(Tρ(H, s1), β)] vec[T ]vec[g(Tρ(H, s2), β)]

δ + log(L + 1)

M + sup |s1 s2| t Tρ(H, s1) Tρ(H, s2) F

δ + log(L + 1)

M (F.15) where the third inequality uses Assumption 3 (iv), and the last inequality relies on the Lipschitz continuity as demonstrated in Proposition C.1. Combining (F.12), (F.13), and (F.15) yields C1

C2 D1 D2 + D2 D3 + D3 D4 L 1 + q

F.4 Proof of Lemma D.4

Proof. Define Fρ(T ρ, t) = R

β vec[g(T ρ(H, t), β)]ρ(β, t)dβ for any ρ, ρ P2. From Taylor s expansion, we have

vec[ Tρη(H, t) Tρ(H, t)] = Fρ(Tρη, t) Fρ(Tρ, t) + Fρη(Tρη, t) Fρ(Tρη, t)

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβ vec[Tρη(H, t) Tρ(H, t)]

β vec[g(Tρη(H, t), β)](ν ρ)(β, t)dβ + o(η)

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβ vec[Tρη(H, t) Tρ(H, t)]

β vec[g(Tρ(H, t), β)](ν ρ)(β, t)dβ

β vec[T ]vec[g(Tρ(H, t), β)](ν ρ)(β, t)dβ vec[Tρη(H, t) Tρ(H, t)] + o(η)

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβ vec[Tρη(H, t) Tρ(H, t)]

β vec[g(Tρ(H, t), β)](ν ρ)(β, t)dβ + o(η),

of which the last equality holds as Lemma C.2 shows that Tρη(H, t) Tρ(H, t) F = O(W2(ρη, ρ)) = O(η), where we hide the constant dependence on B, K, N, r. Therefore, we have

β vec[T ]vec[g(Tρ(H, s), β)]ρ(β, s)dβ) vec[Tρη(H, t) Tρ(H, t)] o

β vec[T ]vec[g(Tρ(H, s), β)]ρ(β, s)dβ)

n vec[ Tρη(H, t) Tρ(H, t)] Z

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβ vec[Tρη(H, t) Tρ(H, t)] o

β vec[T ]vec[g(Tρ(H, s), β)]ρ(β, s)dβ) n η Z

β vec[g(Tρ(H, t), β)](ν ρ)(β, t)dβ + o(η) o ,

(F.16) which leads to (D.18).

F.5 Proof of Lemma D.5

Proof. Fix (β, t) Pr. Lemma C.2 implies that

pρ(H, 1) pν(H, 1) F = |Read(Tρ(H, 1) Read(Tν(H, 1)| Cr W1(ρ, ν) Cr(r+1) ρ ν 1

for some constant Cr dependent on the parameters listed in the result. Our goal is to regulate the difference between pρ(H, t) and pν(H, t) to control the the propagation of pρ(H, ) pν(H, ) .

Note that by (C.14) and Assumption 2,

d dt pρ(H, t) pν(H, t) F pρ(H, t) pν(H, t) F

= vec[pρ(H, t)] Z

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβ

vec[pν(H, t)] Z

β vec[T ]vec[g(Tν(H, t), β)]ν(β, t)dβ

pρ(H, t) pν(H, t) F

vec[T ]vec[g(Tρ(H, t), β)] ρ(β, t)dβ

+ pν(H, t) F

vec[T ]vec[g(Tν(H, t), β)] (ρ ν)(β, t)dβ

Lr,1 pρ(H, t) pν(H, t) F

β ρ(β, t)dβ + pν(H, t) F ρ( , t) ν( , t) 1

Lr,1 pρ(H, t) pν(H, t) F

β ρ(β, t)dβ + Lr,2 ρ( , t) ν( , t) 1

(F.17) where

Lr,1 = ϕT (N, D,

N + 1B exp(K(1 + r + r2)))(1 + r + r2)

Lr,2 = (B+B exp(K(1+r+r2))) exp ϕT (N, D,

N + 1KB exp(K(1+r+r2))(1+r+r2) .

The third inequality of (F.17) uses Lemma C.1 to obtain

N + 1 Tρ(H, t) 2 col

N + 1B exp(K(1 + r + r2))

N + 1 Tν(H, t) 2 col

N + 1B exp(K(1 + r + r2))

with Assumption 2(iii) to bound the norm of the Jacobian matrix with Lr,1. The last inequality of (F.17) employs Lemma C.3 to bound pν(H, t) F with Lr,2. Applying the Grönwall s inequality to (F.17), we obtain

pρ(H, t) pν(H, t) F Cr(r + 1) exp(Lr,1) ρ ν 1 + Z 1

0 Lr,1Lr,2 exp(Lr,1

β ρ(β, t)dβds) ρ( , t) ν( , t) 1dt

Cr(r + 1) exp(Lr,1) ρ ν 1 + Lr,1Lr,2 exp(Lr,1) Z 1

0 ρ( , t) ν( , t) 1dt

=(Cr + Lr,1Lr,2) exp(Lr,1)) ρ ν 1 (F.18) Since Z 1

0 ρ( , t) ν( , t) 1dt = Z 1

β |ρ(β, t) ν(β, t)|dβdt = ρ ν 1.

Thus, we complete the proof of the first result.

By Lemma C.3, under Assumption 1 we have

pρ(H, t) F Lr,3 := (B+B exp(K(1+r+r2))) exp ϕT (N, D,

N + 1KB exp(K(1+r+r2))(1+r+r2) .

In addition, by Lemma C.1, under Assumption 2 (i) we have

g(Tν(H, t), β) F

N + 1 g(Tν(H, t), β) 2 col Lr,4 := KB exp(K(1+r+r2))(1+r+r2).

Therefore, for the gradient function δQ

δρ , by Lemma C.3 we have

ρ (β, t) δQ

ν (β, t) =Eµ[Tr(g(Tρ(H, t), β) pρ(H, t)) Tr(g(Tν(H, t), β) pν(H, t))]

Eµ[ g(Tρ(H, t), β) g(Tν(H, t), β) F pρ(H, t) F ] + Eµ[ g(Tν(H, t), β) F pρ(H, t) pν(H, t) F ] Lr,3Eµ[ g(Tρ(H, t), β) g(Tν(H, t), β) F ] + Lr,4(Cr + Lr,1Lr,2) ρ ν 1

N + 1Lr,3Eµ[ g(Tρ(H, t), β) g(Tν(H, t), β) 2 col] + Lr,4(Cr + Lr,1Lr,2) ρ ν 1. (F.19) Hence, it suffices to show that for any H such that H 2 col B, we have g(Tρ(H, t), β) g(Tν(H, t), β) 2 col Lr,5 ρ ν 1 for some Lr,5 > 0 in order to obtain the second result of this lemma. By Assumption 2 (iii), we see that

g(Tρ(H, t), β) g(Tν(H, t), β) 2 col ϕT (N, D, max{ Tρ(H, t) F , Tν(H, t) F })(1 + r + r2) Tρ(H, t) Tρ(H, t) 2

N + 1KB exp(K(1 + r + r2)))(1 + r + r2)Cr W1(ρ, ν)

N + 1KB exp(K(1 + r + r2)))(1 + r + r2)Cr(1 + r) ρ ν 1, (F.20) where the second inequality again uses Lemma (C.2). Combining (F.19) and F.20 completes the proof of the second result.

F.6 Proof of Lemma D.6

Proof. Denote the empirical distribution of TΘ and e Tρ by

j=1 δ(βt,j, t), eρ = 1

t δt(t)ρ(β|t),

respectively. It s straightforward to verify that ρ and eρ meet the conditions outlined in Lemma C.1, and T ρ = TΘ and Teρ = e Tρ. Hence, Lemmas C.1 and C.4 indicate that max{ b TΘ(H, t) 2 col, TΘ(H, t) 2 col, e Tρ(H, t) 2 col} B exp(K(1 + r + r2)) for any H and t [0, 1]. We then define BT := B exp(K(1 + r + r2)).

The following decomposition equation holds:

b TΘ(H, t) Tρ(H, t) F b TΘ(H, t) TΘ(H, t) F | {z } J1

+ TΘ(H, t) e Tρ(H, t) F | {z } J2

+ e Tρ(H, t) Tρ(H, t) F | {z } J3

(F.21) Our proof will bound J1, J2 and J3, possibly in a probabilistic manner, to obtain the desired result.

Bounding J1: Note that according to Assumption 2 (i),

b TΘ(H, t) + ( t/2)M 1 M X

j=1 f( b TΘ(H, t), θt,j) 2 col

b TΘ(H, t) 2 col + ( t/2)M 1 M X

j=1 f( b TΘ(H, t), θt,j) 2 col

b TΘ(H, t) 2 col + (K t/2)M 1 M X

j=1 b TΘ(H, t) 2 col(1 + θt,j + θt,j 2)

BT (1 + (K t/2)(1 + r + r2))

BT (1 + (K/2)(1 + r + r2))

Denote BT (1 + (K/2)(1 + r + r2)) by BT . Combining the two equations in (2.4) gives us

b TΘ(H, t + t) = MLPwt,1,...,wt,M Attnθt,1,...,θt,M ( b TΘ(H, t), t/2), t/2

=Attnθt,1,...,θt,M ( b TΘ(H, t), t/2) + ( t/2)M 1 M X

j=1 h Attnθt,1,...,θt,M ( b TΘ(H, t), t/2), wt,j

= b TΘ(H, t) + ( t/2)M 1 M X

j=1 f( b TΘ(H, t), θt,j) + ( t/2)M 1 M X

j=1 h b TΘ(H, t)

+( t/2)M 1 M X

j=1 f( b TΘ(H, t), θt,j), wt,j

(F.22) Then, from the formula of (F.22) and Assumption 2 (iii), we see that for any t = 0, t, . . . , (L 1) ,

b TΘ(H, t + t) TΘ(H, t + t) F

b TΘ(H, t) TΘ(H, t) F 1 + ( t/2)M 1 M X

j=1 ϕT (N, D,

N + 1BT )(1 + θt,j + θt,j 2)

+( t/2)M 1 M X

j=1 ϕT (N, D,

N + 1 BT )(1 + wt,j + wt,j 2)

b TΘ(H, t) + ( t/2)M 1 M X

j=1 f( b TΘ(H, t), θt,j) TΘ(H, t) F

b TΘ(H, t) TΘ(H, t) F 1 + t M 1 M X

j=1 ϕT (N, D,

N + 1 BT )(1 + βt,j + βt,j 2)

N + 1BT (K t/2)(1 + r + r2)( t/2)M 1 M X

j=1 ϕT (N, D,

N + 1 BT )(1 + wt,j + wt,j 2)

b TΘ(H, t) TΘ(H, t) F 1 + tϕT (N, D,

N + 1 BT )(1 + r + r2)

N + 1ϕT (N, D,

N + 1 BT )BT (K t2/4)(1 + r + r2)2

(F.23) Therefore, applying (F.23), we deduce that for any t = 0, t, . . . , (L 1) t, 1, we have:

b TΘ(H, t) TΘ(H, t) 2 col

N + 1BT (K t2/4)(1+r+r2)exp(ϕT (N, D,

N + 1 BT )(1 + r + r2))

(F.24) where C1 :=

N + 1BT K(1 + r + r2) exp(ϕT (N, D,

N + 1 BT )(1 + r + r2))/4.

Bounding J2: For any t = 0, t, . . . , (L 1) , i [N + 1], j [M], we have g( TΘ(H, t), βt,j):,i BT . Hence, by the Hoeffding s inequality, for any z > 0 we have

j=1 g( TΘ(H, t), βt,j):,i Z

β g( TΘ(H, t), βt,j):,iρ(β|t)dβ z) 2 exp( z2

By the union bound over i [N + 1] and t = 0, t, . . . , (L 1) , the above inequality implies

P(sup t M 1 M X

j=1 g( TΘ(H, t), βt,j) Z

β g( TΘ(H, t), βt,j)ρ(β|t)dβ 2 col z) 2(N+1)L exp( z2

(F.25) We let z = BT p

2M 1(δ + log((N + 1)L)). Then, (F.25) turns into

P(sup t M 1 M X

j=1 g( TΘ(H, t), βt,j) Z

β g( TΘ(H, t), βt,j)ρ(β|t)dβ 2 col BT p

2M 1(δ + log((N + 1)L)) exp( δ).

Denote the event such that

sup t M 1 M X

j=1 g( TΘ(H, t), βt,j) Z

β g( TΘ(H, t), βt,j)ρ(β|t)dβ 2 col BT p

2M 1(δ + log((N + 1)L))

by Eδ. (F.26) directly indicates P(Eδ) 1 exp( δ).

Suppose that the high probability event Eδ occurs. Let s denote BT p

2M 1(δ + log((N + 1)L)) by Bδ for brevity. From Assumption 2 (iii), it follows that for any t = 0, t, . . . , (L 1) ,

TΘ(H, t + t) e Tρ(H, t + t) F TΘ(H, t) e Tρ(H, t) F

+ t M 1 M X

j=1 g( TΘ(H, t), βt,j) Z

β g( TΘ(H, t), βt,j)ρ(β|t)dβ F

β (g( TΘ(H, t), βt,j) g( e Tρ(H, t), βt,j))ρ(β|t)dβ F

TΘ(H, t) e Tρ(H, t) F + t

β g( TΘ(H, t), βt,j) g( e Tρ(H, t), βt,j) F ρ(β|t)dβ

TΘ(H, t) e Tρ(H, t) F (1 + tϕT (N, D,

N + 1BT )(1 + r + r2))

N + 1Bδ. (F.27) Repeatedly applying Equation (F.27) yields

b TΘ(H, t) TΘ(H, t) F

N + 1Bδ ϕT (N, D,

N + 1BT )(1 + r + r2) exp(ϕT (N, D,

N + 1BT )(1 + r + r2))

M 1(δ + log((N + 1)L)). (F.28) for some constant C2 > 0 dependent on the parameters listed in the result.

Bounding J3: It s worth noting that the convergence proof with a convergence rate of O( t) = O( 1

L) for e Tρ(H, t), the first-order Euler method for Tρ(H, t), is non-standard. This departure from

convention arises because we do not assume the boundedness of the second-order derivative d2Tρ(H,t)

dt2 , instead relying on the continuity of ρ( , t) with respect to the depth index t. In this proof, O( ) hides dependencies on N, D, r, and the parameters of the assumptions.

From the definition of e Tρ and Tρ, we have

e Tρ(H, t + t) Tρ(H, t + t) F e Tρ(H, t) Tρ(H, t) F

+ Tρ(H, t + t) Tρ(H, t) t Z

β g(Tρ(H, t), β)ρ(β|t)dβ F | {z } I1

β (g(Tρ(H, t), β) g( e Tρ(H, t), β))ρ(β|t)dβ F | {z } I2

To bound I1, we use (3.1) to get

β g(Tρ(H, s), β)ρ(β|s)dβds t Z

β g(Tρ(H, t), β)ρ(β|t)dβ F

t sup s [t,t+ t] Z

β g(Tρ(H, s), β)ρ(β|s)dβ Z

β g(Tρ(H, t), β)ρ(β|t)dβ F

t sup s [t,t+ t] Z

β (g(Tρ(H, s), β) g(Tρ(H, t), β))ρ(β|s)dβ F

+ t sup s [t,t+ t] Z

β g(Tρ(H, t), β)(ρ(β|s) ρ(β|t))dβ F

t sup s [t,t+ t]

β g(Tρ(H, s), β) g(Tρ(H, t), β) F ρ(β|s)dβ

+ t sup s [t,t+ t] Z

β g(Tρ(H, t), β)(ρ(β|s) ρ(β|t))dβ F

Given that Proposition C.1 establishes the Lipschitz continuity of Tρ(H, t) with respect to t under the condition that ρ P2 has a bounded support, we can conclude:

sup s [t,t+ t] g(Tρ(H, s), β) g(Tρ(H, t), β) F C3,1|t s| C3,1 t (F.31)

for some constant C3,1 > 0 dependent on the parameters listed in the result. Furthermore, Lemma C.1 and Proposition C.1 demonstrate that g(Tρ(H, t), β) is both bounded and Lipschitz continuous with respect to β. Thus, we have

sup s [t,t+ t] Z

β g(Tρ(H, t), β)(ρ(β|s) ρ(β|t))dβ F = sup s [t,t+ t] Z

β g(Tρ(H, t), β)(ρ(β, s) ρ(β, t))dβ F

sup s [t,t+ t] C3,2 ρ( , s) ρ( , t) BL

C3,2Cρ t (F.32) for some constant C3,2 > 0 dependent on the parameters listed in the result. Substituting (F.31) and (F.32) into (F.30), we find that there exists a constant C3,5 dependent on the parameters listed in the result such that I1 C3,5 t2.

Additionally, Assumption 2 (iii) implies that

β g(Tρ(H, t), β) g( e Tρ(H, t), β) F ρ(β|t)dβ

N + 1BT )(1 + r + r2) Tρ(H, t), β) e Tρ(H, t), β) F

Therefore, by bounding I1 + I2, we obtain the following inequality

Tρ(H, t+ t), β) e Tρ(H, t+ t), β) F C3,5 t2+(1+C3,6 t) Tρ(H, t), β) e Tρ(H, t), β) F , (F.34) which implies, after being used multiple times, that

Tρ(H, t), β) e Tρ(H, t), β) F C3,5 t2 (1 + C3,6 t)L+1 1

C3,6 t C3L 1 (F.35)

for any t = 0, t, . . . , (L 1) t, 1 and some constant C3 dependent on the parameters listed in the result. Combining (F.24), (F.28), and (F.35) yields the desired result.

F.7 Proof of Lemma E.1

Proof. Let eµ(t) be the measure induced by Tρ(H, t) with H µ, and eµ be eµ(t ). By verifying Lemma C.1, we establish that Tρ(H, t) 2 col BT := B exp(K(1 + r + r2)) for any H and

t [0, 1]. Consequently, supp(eµ(t)) {T : T 2 col BT } for any t [0, 1]. The remainder of the proof involves four steps:

Step I: Show that pρ(H, t) is Lipschitz continuous with respect to (H, t)

In the proof of this proposition, when referring to the Lipschitz continuity of a function, we imply its Lipschitz continuity for H within the support of µ. Recall that we have shown in Lemma C.3 that pρ(H, t) is universally bounded and Lipschitz continuous for F and any t [0, 1].

From the formula of pρ in (C.14), for any H, H , we have

pρ(H, t) pρ(H , t) F = vec[pρ(H, t)] vec[pρ(H , t)]

|Read[Tρ(H, 1) Tρ(H , 1)] (y(H) y(H ))| | {z } I1

β vec[T ]vec[g(Tρ(H, s), β)]ρ(β, s)dβds | {z } I2 + |Read[Tρ(H , 1)] y(H )| | {z } I3

β vec[T ]vec[g(Tρ(H, s), β)]ρ(β, s)dβds exp Z 1

β vec[T ]vec[g(Tρ(H , s), β)]ρ(β, s)dβds | {z } I4

(F.36) From Proposition C.1, we observe that pρ(H, t) is Lipschitz continuous. Since y( ) is Ky-Lipschitz continuous, as given in Assumption 4 (iii), we obtain that I1 H H F . Moreover, from Assumption 1 and Lemma C.1, we see that I3 is universally bounded. Therefore, to show that pρ(H, t) is Lipschitz continuous for F , it suffices to demonstrate that I2 1 and I4 H H F .

From Assumption 2 (iii), we have

vec[T ]vec[g(Tρ(H, s), β)] ρ(β, s)dβds exp ϕT (N, D,

N + 1BT )(1+r+r2) .

Thus, dividing I4 by the uniformly bounded part I2, we obtain

β vec[T ] vec[g(Tρ(H , s), β)] vec[T ]vec[g(Tρ(H, s), β)] ρ(β, s)dβds Idivvec[T ]

vec[T ] vec[g(Tρ(H , s), β)] vec[T ]vec[g(Tρ(H, s), β)] ρ(β, s)dβds 1

exp ϕT T (N, D,

N + 1BT , r) sup t [0,1] Tρ(H, t) Tρ(H , t) F 1,

(F.37) where the final inequality holds by Assumption 3 (iv).

By the Lipschitz continuity of Tρ(H, t) for F as stated in Proposition C.1, we have supt [0,1] Tρ(H, t) Tρ(H , t) F H H F . Hence, utilizing the universal boundedness of the last equation in (F.37), we derive I4 H H F .

Considering all assertions regarding I1, I2, I3 and I4, we conclude that pρ(H, t) is Cp-Lipschitz continuous with respect to H for some constant Cp dependent on the parameters listed in the result that is sufficiently large. The Lipschitz continuity of pρ(H, t) with respect to t could be easily derived from the boundedness of the Jacobian matrix, as asserted in Assumption 2 (iii). Moreover, since pρ(H, t) is universally bounded shown in Lemma C.3, we conclude that pρ(H, t) is Lipschitz continuous with respect to (H, t) for some universal Lipschitz constant Cp dependent on the parameters listed in the result that is sufficiently large.

Step II: Prepare bounds related to pρ for later use

(C.14) implies that pρ solves the adjoint equation

vec[ pρ(H, t)] = vec[pρ(H, t)] Z

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβ (F.38)

β vec[T ]vec[g(Tρ(H, t), β)]ρ(β, t)dβ (1+Cρ/2)ϕT (N, D,

N + 1BT )(1+r+r2) =: C0

implied by Assumption 2 (iii). Therefore, the Grönwall s inequality directly indicates that

Eµ pρ(H, t) 2 F exp( 2C0t)Eµ pρ(H, 1) 2 F exp( 2C0)Eµ[|Read[Tρ(H, 1)] y(H)|2] 2 exp( 2C0)R(ρ),

Eµ pρ(H, t) 2 F exp(2C0t)Eµ pρ(H, 1) 2 F exp(2C0)Eµ[|Read[Tρ(H, 1)] y(H)|2] 2 exp(2C0)R(ρ),

for any t [0, 1].

Step III: Construct the descent direction ν

By the well-posedness of the ODE solution to (3.1) as shown in Proposition C.1, the solution map Tρ(H, ) is invertible. Hence, for any ρ P2, there exists a continuous inverse map T 1 t such that T 1 t (Tρ(H, t)) = H for any H and t [0, 1]. Let s define the following function F(T) to approximate F(T) = pρ(T 1 t (T), t ) + Z

β g(T, β)ρ(β|t )dβ h(T, w0)

The function F(T), arising from the composition of pρ( , t) and T 1 t ( ), exhibits continuity over T supp(eµ) owing to the continuous nature of pρ( , t) and T 1 t ( ). Therefore, Assumption 4 (ii) could be applied to F. Since f( , θ) is a universal kernel constrained on θ Rdimθ1 K [52], and F(T) is continuous with respect to T, there exists a sequence {(ck, θk)}k 0 (R Rdimθ1 K)N such that F(T)

k=1 ckf(T, θk)

max ϵ/3 (F.39)

given some ϵ > 0 and any T such that T 2 col BT . Notably, since f(T, θk 1, θk 2) = f(T, θk 1, θk 2), we could assume without loss of generality that ck 0 for any k. Furthermore, there exists a constant kϵ such that F(T)

k=1 ckf(T, θk)

max 2ϵ/3. (F.40)

We define C(ϵ) := Pkϵ k=1 ck, and ν(β) P(Rdimβ) as the probability distribution such that, given β ν and for any k 0, β has probability ck/C(ϵ) of being (2C(ϵ)θk 1, θ2, 0). Then, (F.40) transforms into F(T) Z

β g(T, β) ν(β)dβ max 2ϵ/3, (F.41)

where F(T) == pρ(T 1 t (T), t ) + Z

β g(T, β)ρ(β|t )dβ.

From (F.41), we claim that there exists some R(ϵ) such that

ν({β : 1/R(ϵ) β R(ϵ)}) 1/2,

β >R(ϵ) g(T, β) ν(β)dβ max ϵ/3.

We are now in the position to define the descent direction ν. By defining ν P(Rdimβ) as the measure obtained by truncating any part outside 1/R(ϵ) β R(ϵ) from ν, and scaling the measure function by 1/ ν({β : 1/R(ϵ) β R(ϵ)}) 2, we can establish that F(T) Z

β g(T, β)ν(β)dβ max ϵ. (F.42)

for any T such that T 2 col BT . A straightforward deduction from (F.42) is that for any T such that T 2 col BT , we have F(T) R

β g(T, β)ν(β)dβ F ϵ p

(N + 1)D. It is clear that ν has

a bounded support as {β : 1/R(ϵ) β R(ϵ)}. We will determine the value of ϵ later, ensuring it based only on N, D, r, and the parameters of the assumptions.

Step IV: Upper bound R

δρ (β, t ) ν(β) ρ(β|t ) dβ to complete the proof

Utilizing the gradient definition in (3.4) and R

β g(T, β)ρ(β|t )dβ = F(T) + pρ(T 1 t (T), t ), we obtain Z

δρ (β, t ) ν(β) ρ(β|t ) dβ

β Tr h g(Tρ(H, t ), β) pρ(H, t ) i ν(β) ρ(β|t ) dβ + λ

β β 2 ν(β) ρ(β|t ) dβ

ET eµ(t)Tr h g(T, β) ev(β) ρ(β|t ) pρ(T 1 t (T), t ) i + λ

2 (R(ϵ) + r)

= ET eµ(t)Tr h F(T) Z

β g(T, β)ν(β)dβ pρ(T 1 t (T), t ) i

ET eµ(t)Tr h pρ(T 1 t (T), t ) pρ(T 1 t (T), t ) i

2 (R(ϵ) + r)

For ρ P2 concentrated on Pr, we observe

2Eµ[ |Read[Tρ(H, 1)]|+|y(H)| 2 ] 1

2(BT + B)2.

Hence, to bound J1, we have

J1 ET eµ(t) F(T) Z

β g(T, β)ν(β)dβ F pρ(T 1 t (T), t ) F dt

(N + 1)DϵET eµ(t) pρ(T 1 t (T), t ) F dt

(N + 1)Dϵ ET eµ(t) pρ(T 1 t (T), t ) 2 F 1/2 dt

=(N + 1)Dϵ Eµ pρ(H, t) 2 F 1/2 dt

2(N + 1)D exp(C0)R(ρ)1/2ϵ (N + 1)(BT + B)(N + 1)D exp(C0)R(ρ)ϵ =C3R(ρ)ϵ,

where C3 := (N + 1)(BT + B)(N + 1)D exp(C0).

To bound J2, we have J2 =ET eµ(t) pρ(T 1 t (T), t) 2 F dt

=Eµ pρ(H, t) 2 F dt

Eµ exp( 2C0) pρ(H, 1) 2 F dt

2 exp( 2C0)R(ρ).

Combining (F.43) and (F.44), by choosing ϵ = 1

4 exp( 2C0)/C3, which only depends on N, D, r, and the parameters of the assumptions, we have Z

δρ (β, t ) ν(β) ρ(β|t ) dβ 1

4 exp( 2C0)R(ρ) + λ

4 exp( 2C0)/C3) + r .

Setting Br = R( 1

4 exp( 2C0)/C3), C1 = R( 1

4 exp( 2C0)/C3)+r /2, and C2 = 1

4 exp( 2C0) completes the proof.

F.8 Proof of Lemma C.1

Proof. By applying the Cauchy-Schwarz inequality, we trivially obtain R 1 0 R

β||β||2ρ(β, t)dβdt A. Thus, leveraging (3.1) and Assumption 2 (i), we can infer

d dt Tρ(H, t) 2 col Tρ(H, t) 2 col = Z

β g(Tρ(H, t), β)ρ(β, t)dβ 2 col

β g(Tρ(H, t), β) 2 colρ(β, t)dβ

β K(1 + ||β||2+||β||2 2)ρ(β, t) Tρ(H, t) 2 coldβ.

(F.45) Therefore, by the Grönwall s inequality, we have

Tρ(H, t) 2 col Tρ(H, 0) 2 col exp( Z 1

β K(1 + ||β||2+||β||2 2)ρ(β, t)dβdt)

H 2 col exp(K(1 + A + A2)).

F.9 Proof of Lemma C.2

Proof. As per Lemma C.1, the boundedness of Tν(H, t) 2 col and Tρ(H, t) 2 col is established by a constant C := H 2 col exp(K(1 + A + A2)) > 0 for all t [0, 1]. Consequently, from (3.1), this implies

Tν(H, t1) Tν(H, t2) 2 col Z t2

t1 Tν(H, t) 2 col

β g(Tν(H, t), β 2 colν(β, t)dβdt

(1 + Cρ/2)KC(1 + r + r2)(t2 t1).

for any t1, t2 [0, 1]. Therefore, Tν(H, t) is (1 + Cρ/2)KC(1 + r + r2)-Lipschitz with respect to t for 2 col, and thus

N + 1(1 + Cρ/2)KC(1 + r + r2)-Lipschitz with respect to t for F . Note that by (3.1),

(H, t) := Tρ(H, t) Tν(H, t) F

0 Tρ(H, s) Tν(H, s)ds F

β g(Tρ(H, s), β)ρ(β, t)dβds Z

β g(Tν(H, s), β)ν(β, s)dβds F

β g(Tρ(H, s), β) g(Tν(H, s), β) F ρ(β, s)dβds | {z } J1

β g(Tν(H, s), β) ρ ν (β, s)dβds F | {z } J2

We then bound J1 and J2 using the following two lemmas separately. Firstly, since Tρ F

N + 1C and Tν F

N + 1C, we have by Assumption 2 that

g(Tρ(H, s), β) g(Tν(H, s), β) F sup T F N+1C vec[T ]vec[g(T, β)] 2 Tρ(H, s) Tν(H, s) F

N + 1C)(1 + r + r2) (H, s) (F.48)

Therefore, by (F.48) we have

J1 ϕT (N, D,

N + 1C)(1 + Cρ/2)(1 + r + r2) Z

0 (H, s)ds. (F.49)

Secondly, we aim to bound the integral J2 given Assumption 2 and Tν(H, t) 2 col C on t [0, 1]. Again by Assumption 2 we have

βvec[g(Tν(H, s), β)] F

i=1 βg(Tν(H, s), β):,i 2 2 col

N + 1ϕP (C)(1+r) (F.50)

T vec[g(Tν(H, s), β)] F ϕT (N, D,

N + 1C)(1 + r + r2) (F.51) Since Tν(H, s) is

N + 1(1+Cρ/2)KC(1+r+r2)-Lipschitz with respect to s for F (as shown in (F.46)), by (F.51), we obtain that g(Tν(H, s), β) is

N + 1(1+Cρ/2)KCϕT (N, D,

N + 1C)(1+ r + r2)2-Lipschitz with respect to s. Thus, g(Tν(H, s), β) is C -Lipschitz with respect to (s, β), where C =

N + 1(1 + Cρ/2)KCϕT (N, D,

N + 1C)(1 + r + r2)2 +

N + 1ϕP (C)(1 + r). This indicates, by the Kantorovich-Rubinstein Theorem (see Theorem 5.10 of [68], for example), that

D g(Tν(H, s), β) ρ ν (β, s)dβds F C W1(ρ, ν). (F.52)

Define C := max{C , ϕT (N, D,

N + 1C)(1 + r + r2)}. By combining (F.49) and (F.52), we have

0 (H, s)ds + C W1(ρ, ν). (F.53)

Applying the Grönwall s inequality then shows (H, t) C exp(C t)W1(ρ, ν). (F.54) Specifically, we have Tρ(H, 1) Tν(H, 1) F C exp(C )W1(ρ, ν).

F.10 Proof of Lemma C.3

Proof. Let s consider a fixed (β, t) pair within Pr. Given Lemma C.1, we establish Tρ(H, t) 2 col B exp(K(1 + A + A2)). Consequently, under Assumption 1, we deduce

g(Tρ(H, t), β) F

N + 1 g(Tρ(H, t), β) 2 col

N + 1K Tρ(H, t) 2 col(1 + r + r2)

N + 1KB exp(K(1 + A + A2))(1 + r + r2).

On the other hand, from (C.14), we have

pρ(H, t) F |Read(Tρ(H, 1)) y(H)| exp Z 1

β vec[T ]vec[g(Tρ(H, s), β)ρ(β, s)dβds] 2

(B + B exp(K(1 + A + A2))) exp Z 1

vec[T ]vec[g(Tρ(H, s), β)] ρ(β, s)dβds

(B + B exp(K(1 + A + A2))) exp Z 1

vec[T ]vec[g(Tρ(H, s), β)] ρ(β, s)dβds

(B + B exp(K(1 + A + A2))) exp Z 1

β ϕT (N, D, Tρ(H, s) F )(1 + β + β 2)ρ(β, s)dβds

(B + B exp(K(1 + A + A2))) exp ϕT (N, D,

N + 1KB exp(K(1 + A + A2))(1 + A + A2)

(F.56) Combining (F.55) and (F.56), we could obtain δQ

δρ (β, t) Eµ g(Tρ(H, t), β) F pρ(H, t) F + λ

N + 1KB exp(K(1 + A + A2))(1 + r + r2)(B + B exp(K(1 + A + A2)))

exp ϕT (N, D,

N + 1KB exp(K(1 + A + A2))(1 + A + A2) + λ

F.11 Proof of Lemma C.4

Proof. By employing the Cauchy-Schwarz inequality, we readily observe that 1 ML P

t PM j=1 β A. Consequently, leveraging (2.4) and Assumption 2, we ascertain that for any t = 0, t, . . . , (L 1) t, we obtain:

b TΘ(H, t + t/2) 2 col b TΘ(H, t) 2 col n 1 + 1 2ML

j=1 K(1 + θt,j + θt,j 2) o

b TΘ(H, t + t) 2 col b TΘ(H, t + t/2) 2 col n 1 + 1 2ML

j=1 K(1 + wt,j + wt,j 2) o .

(F.57) Therefore, by applying (F.57) multiple times, we obtain that for any t = 0, t/2, . . . , (L 1/2) t, 1,

b TΘ(H, t) 2 col b TΘ(H, 0) 2 col Y

n 1 + 1 2ML

j=1 K(1 + θt,j + θt,j 2) on 1 + 1 2ML

j=1 K(1 + wt,j + wt,j 2)

H 2 col exp K

j=1 1 + βt,j + βt,j 2

H 2 col exp K(1 + A + A2) .

F.12 Proof of Lemma C.5

Proof. Lemma C.4 shows that e TΘ(H, t) 2 col and e TeΘ(H, t) 2 col are bounded by BT := B exp(K(1 + A + A2)) for any H and t [0, 1]. From (2.4), for any t = 0, . . . , (L 1) t, from Assumption 2 (ii) and (iii) we have

b TΘ(H, t + t/2) b TeΘ(H, t + t/2) F

b TΘ(H, t) b TeΘ(H, t) F + t/2

j=1 f( b TΘ(H, t), θt,j) f( b TeΘ(H, t), eθt,j) F

b TΘ(H, t) b TeΘ(H, t) F + ( t/2)

N + 1ϕP (BT )(1 + r)M 1 M X

j=1 θt,j eθt,j

+( t/2)ϕT (N, D,

N + 1BT )(1 + r + r2) b TΘ(H, t) b TeΘ(H, t) F

b TΘ(H, t) b TeΘ(H, t) F (1 + C1( t/2)) + ( t/2)C2M 1 M X

j=1 θt,j eθt,j .

for some constant C1 and C2 depending only N, D, r and assumptions. Similarly, we have

b TΘ(H, t + t) b TeΘ(H, t + t) F b TΘ(H, t + t/2) b TeΘ(H, t + t/2) F (1 + C1( t/2))

+( t/2)C2M 1 M X

j=1 wt,j ewt,j .

(F.59) Combining (F.58) and (F.59), we derive

b TΘ(H, t+ t) b TeΘ(H, t+ t) F b TΘ(H, t) b TeΘ(H, t) F (1+C3 t)+ t C3M 1 M X

j=1 βt,j eβt,j .

(F.60) where C3 is a constant depending solely on N, D, r, and the parameters of the assumptions. Iterating (F.60) multiple times yields

b TΘ(H, t) b TeΘ(H, t) F exp(C3) 1

for any t = 0, t, . . . , 1.

F.13 Proof of Lemma C.6

Proof. By verifying that Θ satisfies the conditions outlined in Lemma C.4, we establish b TΘ(H, t) 2 col B exp(K(1 + A + A2)). The first two results stem from Assumption 2 (i) and (ii), with recognition that T F

N + 1 T 2 col for any T RD (N+1). As for the third result, consider t = 0, t, . . . , (L 1) t, 1, where max{ bpΘ(H, t) F , bpΘ(H, t + t/2) F } = max{ vec[bpΘ(H, t)] , vec[bpΘ(H, t + t/2)] }

|Read[ b TΘ(H, 1)] y(H)| n Y

(s t)/ t+1 [(1 t)/ t] j [M]

Idimθ + ( t/2) vec[T ]vec[f( b TΘ(H, s), θs,j)]

(s t)/ t+1 [(1 t)/ t] j [M]

Idimw + ( t/2) vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)] o

|Read[ b TΘ(H, 1)] y(H)| Y

(s t)/ t+1 [(1 t)/ t] j [M]

1 + ( t/2) vec[T ]vec[f( b TΘ(H, s), θs,j)]

1 + ( t/2) vec[T ]vec[h( b TΘ(H, s + t/2), ws,j)]

exp ( t/2) X

(s t)/ t+1 [(1 t)/ t] M 1 M X

vec[T ]vec[f( b TΘ(H, s), θs,j)]

(s t)/ t+1 [(1 t)/ t] M 1 M X

vec[T ]vec[h( b TΘ(H, t), ws,j)]

(B + BT ) exp ϕT (N, D,

N + 1BT ) 1

j=1 (1 + β + β 2)

(B + BT ) exp ϕT (N, D,

N + 1KBT )(1 + A + A2) ,

where the first inequality arises from the fact that the matrix 2-norm is greater equal than the norm of any of its columns, and the fourth inequality follows from Assumption 2 (iii).

F.14 Proof of Lemma C.7

Proof. Note that βt,j r for any t = 0, t, . . . , (L 1) t and j [M] with its expectation denoted as R

β βt,j ρ(β, t)dβ. Applying Hoeffding s inequality yields, for any z > 0 and t = 0, t, . . . , (L 1) t

j=1 βt,j 2 Z

β β 2ρ(β, t)dβ| z) 2 exp( z2

By applying the union bound over t = 0, t, . . . , (L 1) t, the inequality above implies

j=1 βt,j 2 1

β β 2ρ(β, t)dβ| z) P(sup t |M 1 M X

j=1 βt,j 2 Z

β β 2ρ(β, t)dβ| z)

2r2 M). (F.62)

In addition, we have

β β 2ρ(β, t)dβ Z 1

β β 2ρ(β, t)dβdt| sup |t s| t | Z

β β 2ρ(β, t)dβ Z

β β 2ρ(β, s)dβ|

r2L 1 sup |t s| t ρ( , t) ρ( , s) BL

r2CρL 1. (F.63) Combining (F.62) and (F.63), and setting z = r p

2M 1(δ + log(2L)) completes the proof.

F.15 Proof of Lemma C.8

Proof. The proof will be trivial by noting the equality

s=1 Bs(Al Bl)

Hence, we have

s=1 Bs Al Bl

l=1 Al Bl .

G Assumption verification for a concrete example

In this section, we consider f(Z, θ) = V Zsoftmax(Z WZ) (G.1)

with the collection of parameters θ = vec[V, W], where softmax denotes the column-wise softmax function. Moreover, consider

h(Z, w) = W2Huberized Re LU(W1H) (G.2)

with the collection of parameters w = vec[W1, W2]. Here, Huberized Re LU denotes the entry-wise Huberized Re LU activation function defined as

Huberized Re LU(x) =

z2/2, if z [0, 1]; z 1/2, if z 1.

Then, we can consider a Transformer model defined by equations (2.1), (2.2), and (2.4) in the paper, where the functions f and h are specified above. We suppose that this Transformer model is applied to a learning task with data that satisfies Assumption 1. We have the following proposition.

Proposition G.1. Consider the Transformer model defined by equations (2.1), (2.2), and (2.4), with f(Z, θ) and h(Z, w) defined in (G.1) and (G.2) respectively. Then Assumptions 2-4 all hold.

Proof. We omit the detailed derivations for the function h(Z, w), which corresponds to the MLP part, in our verification of Assumptions 2 and 3, as h(Z, w) satisfies Assumptions 2 and 3 is relatively more intuitive, especially given the proofs for f(Z, θ).

Denote Z = (z1, . . . , z N+1) RD N+1. Then the function f can be rewritten as

f(Z, θ) = V Zsoftmax(Z WZ) = (f(Z, θ):,i)1 i N+1,

where f(Z, θ):,i = PN+1 j=1 Pij V zj and Pi,: = softmax(Z Wzi). Next, we calculate the derivatives of f(Z, θ):,i with respect to Z and θ as follows:

For Z: the Jacobian J R(N+1)D (N+1)D is J = (Jij)1 i,j N+1, where Jij = f:,i

zj (Z, θ) RD D. After calculation, we obtain

Jij = V ZQi[Eji Z W + Z W δij] + Pij V,

where Qi := diag(Pi:) P i: Pi:, Eij is the matrix with zeros everywhere except one the (i, j)-th entry, and δij is the Kronecker delta (1 if i = j, 0 otherwise).

For θ: Define Ai = Z Wzi. After calculation, we have Pij Akl = Pij(δik Pil), Aij Wkl = Zki Zjl (G.3)

Thus, by the chain rule, we have

Wklf(Z, θ):,i =

j=1 Zki Zjl Pij(δik Pil)V zj. (G.4)

Moreover, we have

vec[V ]f(Z, θ):,i =

j=1 Pij(z j , . . . , z j ) , (G.5)

where (z j , . . . , z j ) contains D copies of zj. We then verify the assumptions one by one.

For Assumption 2 (i), we have

f(T, θ) 2 col = V Tsoftmax(T WT) 2 col V 2 T 2 softmax(T WT) 2 col θ 2 T 2 col softmax(T WT) 1 col θ 2 T 2 col,

where the second-to-the-last inequality follows by the fact that ℓ2-norm can be upper bounded by the ℓ1-norm, and the last inequality follows by the fact that each column of the softmax output has an ℓ1-norm equaling one. Therefore, the first condition in Assumption 2 with K = 1 is verified for the function f in (G.1).

For h in (G.2), we have

h(T, w) 2 col = W2Huberized Re LU(W1T) 2 col W2 2 Huberized Re LU(W1T) 2 col W2 2 W1T 2 col 2 w 2 2 T 2 col,

where the second inequality follows by the property of Huberized Re LU that |Huberized Re LU(x)| |x|. This demonstrates that Assumption 2 (i) with K = 1 holds for h in (G.2) as well.

For Assumption 2 (ii), (G.5) leads to

vec[V ]f(T, θ):,i 2

j=1 Pij T :,j 2

j=1 Pij T:,j 2 T 2 col.

Moreover, (G.4) leads to

vec[W ]f(T, θ):,i 2 X

j=1 Zki Zjl Pij(δik Pil)V T:,j 2

j=1 Pij Tki Tjl V T:,j 2

2 max 1 j N+1

1 k,l D Tki Tjl V T:,j 2

2 max 1 j N+1

1 k,l D |Tki Tjl| V 2 T:,j 2

2 T 2 2 col θ 2

Combining the two equations above gives Assumption 2 (ii) with ϕP ( T 2 col) = T 2 col + 2 T 2 2 col.

For Assumption 2 (iii), we have

1 i,j N+1 Jij 2 2 (N + 1) max 1 i,j N+1 Jij 2.

For any 1 i, j N + 1, we have

Jij 2 Pij V 2 + T 2 Qi 2 Eji T W + T W δij 2 V 2

V 2 1 + 2 T 2 2 W 2

θ 2 + 2 T 2 F θ 2 2 (1 + 2 T 2 F )(1 + θ 2 + θ 2 2).

Hence, we have

J 2 2N T 2 F (N + 1)(1 + 2 T 2 F )(1 + θ 2 + θ 2 2).

The above equation demonstrates that for f, Assumption 2 (iii) holds with ϕT (N, D, T F ) = (N + 1)(1 + 2 T 2 F ). We have verified Assumption 2 for the attention layer encoder f. The verification for h is similar and easier.

Next, we verify Assumption 3. Given that we are currently considering the example where the encoder employs a smooth univariate activation function, we can prove stronger results by removing the expectation Eµ.

(i) and (iii): Given the calculation of derivatives in (G.4) and (G.5) we have presented above, we first show that Pij = softmax(Z Wzi) is locally Lipschitz continuous with respect to Z and θ. By (G.3) and the chain rule, we can derive that

j=1 Zki Zjl Pij(δik Pil), Pij

zj = Qi[Ejiz j W + z j W δij].

and the local Lipschitz continuity is then obvious given the boundedness of Wkl Pij and Pij

zj with respect to necessary parameters.

As we prove that Pij is locally Lipschitz continuous, given that then each component in (G.4) and (G.5) is locally Lipschitz continuity with respect to both Z and θ, and is obviously bounded by an increasing function of N, D, θ , KT , LT . Then the local Lipschitz continuity is straightforward as they are all sufficiently smooth.

(ii) and (iv): Because the norm of the difference of two Jacobian matrices J1 J2 2 is bounded by q P

1 i,j N+1 J1 ij J2 ij 2 2, it suffices to show that Jij is locally Lipschitz continuous with respect to both θ and Z. Again each component of Jij that depends on Z or θ, i.e. Z, Qi, W, Pij, is bounded by an increasing function of N, D, KP , LT , KT , θ , and is locally Lipschitz continuous given sufficient smoothness. Hence, (ii) and (iv) also hold.

For Assumption 4, we consider the pair (g, α) = (h, w), and the partition α = (α1, α2) with α1 = W2, α2 = W1. We also let a compact set K = {W1 : W1 1}. Then Assumption 4 (i) on the partial 1-homogeneity property straightforwardly holds:

h(T, W1, c W2) = c W2Huberized Re LU(W1H) = c h(T, W1, W2).

Regarding Assumption 4 (ii) on the universal kernel property, we first note that according to the choice (g, α) = (h, w), this assumption is purely an assumption on the MLP part of the Transformer. Here we give the detailed proof as follows.

First of all, according to the classic universal approximation theory (see the wiki page of universal approximation theorem and [34, 18, 57] for more details), we know that two-layer fully-connected

networks with non-polynomial activation functions and without any constraints on its parameters are universal approximates.

Therefore, we know that the function class span{W2Re LU2(W1T) : W1 Rdim(W1), W2 Rdim(W2)} is dense in C( T 2 col B, RD (N+1)). Moreover, by the definition of Huberized Re LU, for any B > 0 and any c W1, c W2, there exist small constant c such that c c W1 K, c c W 1 B 1, and

c 2 c W2Hyberized Re LU(c c W1T) =c 2 c W2Re LU2(c c W1T)

=c2 c 2 c W2Re LU2(c W1T)

=c W2Re LU2(c W1T),

where the second equation follows by the positive 2-homogeneity of Re LU2 activation. This implies that

{W2Re LU2(W1T) : W1 Rdim(W1), W2 Rdim(W2)} {W2Hyberized Re LU(W1T) : W2 Rdim(W2) K}.

Therefore, we conclude that span{W2Hyberized Re LU(W1T) : W2 Rdim(W2) K} is dense in C( T 2 col B, RD (N+1)). This finishes the validation of Assumption 4.

H Experiments

As discussed in Sections 3 and 4, our mean-field approximation results and global convergence results are asymptotic guarantees requiring exponentially large number of heads M and number of layers L. Such results are due to the nature of mean-field type analysis. In practice, we frequently observe that global convergence can be achieved by Transformer models of reasonable sizes. In this section, we run simple experiments on training Vision Transformers (Vi T) [24] on the CIFAR-10 datasets to demonstrate global convergence in practical applications.

We train Vision Transformers with different numbers of heads and layers. In all our experiments, we split each CIFAR-10 image into four patches and then pass the patches into Vision Transformer models. We keep the dimension of each attention head to be 128. The output of each self-attention layer is passed through a single-hidden-layer feedforward component with 128 hidden neurons and Ge LU activation. Both the self-attention and feedforward components include skip connections. We implement dropout in the self-attention layers as well as the feedforward layers with a dropout probability of 0.1. The model is attached to a linear classifier.

In all experiments, we train the Vi T models using Adam for 200 epochs with a mini-batch size 512. We set the initial learning rate to be 1e 4, and implement a cosine annealing learning rate schedule. We do not use any data augmentation or explicit regularization techniques, so that global convergence for large enough models implies close-to-zero training loss and close to 100% training accuracy.

In the first set of experiments, we fix the depth of the Vi T to 6 layers (i.e., there are six self-attention layers, each followed by a single-hidden-layer feedforward component). We train such Vision Transformers with the numbers of heads per layer ranging from 4 to 40, and record the training loss and training accuracy throughout training. The results are given in Figure 1. Based on the results, it is clear that for Vi T models with more than 20 heads can achieve close-to-zero training loss and close to 100% training accuracy, demonstrating global convergence on the CIFAR-10 training data.

In the second set of experiments, we fix number of heads in the Vi T model per layer to 8. We train such Vision Transformers with depths ranging from 2 to 20, and record the training loss and training accuracy throughout training. The results are given in Figure 2. Again, the results indicate that Vi T models with more than 16 layers can achieve close-to-zero training loss and close to 100% training accuracy on the CIFAR-10 dataset, implying global convergence.

We note that all these experiments are conducted on a standard GPU card. We can observe clear global convergence when the Vision Transformer is sufficiently wide or deep, but still within reasonable scales. This indicates that, although our theoretical guarantees require extremely large numbers of heads and layers due to the limitations of the mean-field technical tools, global convergence can be achieved by Transformers of reasonable sizes in practice.

(a) Training loss

(b) Training accuracy Figure 1: Training loss and training accuracy of Vision Transformers with different numbers of heads. (a) gives the curves of training loss, while (b) gives the curves of training accuracy.

(a) Training loss

(b) Training accuracy Figure 2: Training loss and training accuracy of Vision Transformers with different depths. (a) gives the curves of training loss, while (b) gives the curves of training accuracy.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The abstract and introduction accurately summarize the main contributions and scope of the paper (Section 1.1), clearly outlining the theoretical advancements while aligning with the stated assumptions. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We operate under noiseless label settings, as noisy conditions typically preclude achieving zero risk. The limitations of Assumption 4 and the assumptions in Theorem 4.1 are discussed in the respective sections. Although some assumptions are currently untestable, we provide high-level justifications for their validity. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All necessary assumptions (Assumptions 1-4) are stated in the main content prior to presenting the theoretical results, with complete and correct proofs included in the supplementary material. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We clearly disclose all the information for reproducing experimental results. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [No]

Justification: This paper includes only simple experiments in the supplemental material to verify the main theoretical results. Therefore, the access to the data and code is not central to the paper s contribution.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: This paper specifies all necessary details to understand the results.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: This paper reports appropriate information about the statistical significance.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: This paper provides sufficient details on the computer resources used for the simulation studies. The task size is small and can be easily reproduced using a standard GPU.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: The research presented in this paper fully complies with the Neur IPS Code of Ethics. All ethical guidelines pertinent to the theoretical nature of the work have been adhered to rigorously.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: The paper does not discuss societal impacts as it strictly addresses theoretical optimization guarantees for Transformer models, focusing solely on providing rigorous proofs without exploring practical applications. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper is theoretical and does not involve the release of data or models. Therefore, there are no risks for misuse or dual-use associated with this research. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: The public CIFAR-10 datasets are explicitly mentioned and properly respected. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset.

The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper is theoretical and does not involve the use of existing assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper is theoretical and does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper is theoretical and does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.