# transformers_are_universal_incontext_learners__426615f7.pdf

Published as a conference paper at ICLR 2025

TRANSFORMERS ARE UNIVERSAL IN-CONTEXT LEARNERS

Takashi Furuya1 Maarten V. de Hoop2 Gabriel Peyr e3

1Doshisha Univ. takashi.furuya0101@gmail.com 2Rice Univ. mvd2@rice.edu 3CNRS, ENS, PSL Univ. gabriel.peyre@ens.fr

Transformers are deep architectures that define in-context mappings which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In this work, we study in particular the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically, uniformly address their expressivity, we consider the case that the mappings are conditioned on a context represented by a probability distribution of tokens which becomes discrete for a finite number of these. The relevant notion of smoothness then corresponds to continuity in terms of the Wasserstein distance between these contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. This result implies, as a special case, that transformers are universal approximators for continuous permutation-invariant mappings over a fixed number of tokens. It also establishes the universal approximation capability of transformers for certain in-context learning tasks, demonstrating in particular their ability to perform regression within context. A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens. Additionally, it operates with a fixed embedding dimension of tokens (this dimension does not increase with precision) and a fixed number of heads (proportional to the dimension). The use of MLPs between multi-head attention layers is also explicitly controlled. We consider both unmasked attentions (as used for the vision transformer) and masked causal attentions (as used for NLP and time series applications). We tackle the causal setting leveraging a space-time lifting to analyze causal attention as a mapping over probability distributions of tokens.

1 INTRODUCTION

Transformers have revolutionized the field of machine learning with their powerful attention mechanisms as introduced by Vaswani et al. (2017). The exceptional performance and expressivity of large-scale transformers have been empirically well established for both NLP (Brown et al., 2020) and vision applications (Dosovitskiy et al., 2020). One key property of these architectures is their ability to leverage contexts of arbitrary length, which enables the parameterization of in context mappings with an arbitrarily large complexity. In this paper, we present a rigorous formalism to model inputs and the associated context with an arbitrarily large number of tokens, defining a notion of continuity that enables the analysis of their expressivity.

Universality, from neural networks to neural operators. Multilayer Perceptrons (MLP) with two layers are universal approximators, as shown decades ago in Cybenko (1989); Hornik et al. (1989), with a comprehensive review in Pinkus (1999). The significance of depth in enhancing expressivity is explored in Hanin & Sellke (2017); Yarotsky (2017). These results have been extended to cover a variety of architectural constraints on the networks, for instance, invoking weight sharing in Convolutional Neural Networks (CNN) (Zhou, 2020) and skip connections in Res Nets (Cuchiero

Published as a conference paper at ICLR 2025

et al., 2020; Tabuada & Gharesifard, 2022). It is also possible to design equivariant architectures, in particular for graph neural networks (Kratsios & Papon, 2022; Keriven & Peyr e, 2019; Xu et al., 2018) and neural networks operating on sets of points (Qi et al., 2017; De Bie et al., 2019). The connection between transformers and graph neural networks is exposed in M uller et al. (2023). Here, we take a different point of view, with transformers operating on probability distributions rather than on sets of points. Related to this setup are extensions of neural networks acting in finite-dimensional vector spaces to infinite-dimensional function spaces resulting in the notion of neural operators (Kovachki et al., 2023), the universality of which is studied in Furuya et al. (2023). Neural operators can be generalized to cope with data in metric spaces, addressing topological obstructions, in Kratsios et al. (2023b).

Mathematical modeling of transformers. It is now customary to describe transformers as performing in context prediction, which means that it maps token to token, while this map depends on a set of previously seen tokens. The size of this context might be very long, possibly arbitrarily long, which is the focus of this article. The ability of trained transformers to effectively perform in-context computation has been supported by both empirical studies (von Oswald et al., 2023) and theoretical results (Ahn et al., 2024; Mahankali et al., 2023; Sander et al., 2024; Zhang et al., 2023) on simplified architectures (typically linear attention) and specific data generation processes.

To make a rigorous analysis of arbitrarily long token lengths, and also to describe a mean field limit of an infinite number of tokens, it is convenient to view attention as operating over probability distributions of tokens (Vuckovic et al., 2020; Sander et al., 2022). The smoothness (Lipschitz continuity) of the resulting attention layers is analyzed in Castin et al. (2024). Deep transformers (with the residual connection) can be described by a coupled system of particles evolving across the layers. The analysis of the clustering properties of such an evolution is studied in Geshkovski et al. (2023a;b).

Universality of transformers. Yun et al. (2019) provides, to the best of our knowledge, the most detailed account of the universality of transformers. The authors rely on shallow transformers with only 2 heads but require that the transformers operate over an embedding dimension which grows with the number of tokens. This result is refined in Nath et al. (2024) which highlights the difficulty of attention mechanisms to capture smooth functions. Our focus is different, since we consider deep transformers with a fixed embedding dimension, but which are universal for an arbitrary number of tokens.

We note that there exist variations over the original transformer s architecture which enjoys universality results, for instance, the Sumformer (Alberti et al., 2023) and stochastic deep networks (De Bie et al., 2019), which also requires an embedding dimension that grows with the number of tokens. We furthermore mention the introduction of probabilistic transformers (Kratsios et al., 2023a) which can approximate embeddings of metric spaces. The work of Agrachev & Letrouit (2024) provides an abstract universal interpolation result for equivariant architectures under genericity conditions, but it is not known whether there exist generic attention maps.

While this is not directly related to our results, a line of works studies the expressivity of transformers when operating on a discrete set of tokens as formal systems (Chiang et al., 2023; Merrill & Sabharwal, 2023; Strobl et al., 2024; Elhage et al., 2021). Another line of work studies the impact of positional encoding on their expressivity (Luo et al., 2022).

1.1 OUR CONTRIBUTIONS

Our work provides a rigorous formalization of transformer expressivity and continuity as operating over the space of probability distributions. The main mathematical results is the universality presented in Theorems 1 and 2, respectively, for the unmasked and the masked settings. Our approach effectively handles an arbitrary number of tokens and leverages deep architectures without requiring arbitrary width. The embedding dimension and the number of heads are proportional to the dimension of the input tokens and are independent of precision. It is interesting to note that the masked setting requires some stronger regularity hypothesis on the contexts, namely that they are Wasserstein-Lipschitz with respect to time, which is needed to cope with the constraint of causality in the relevant mappings.

Published as a conference paper at ICLR 2025

1.2 NOTATION

For a natural number N N, we denote by [N] := {1, ..., N}. For vector x Rd, the Euclidean norm of x is denoted by |x|. For two vectors x, y Rd, the Euclidean inner product of x and y is denoted by x, y and the component-wise multiplication of x and y is denoted by x y. The vector 1d is the vector of dimension d with all coordinates equal to 1, that is, 1d := (1, ..., 1) Rd. We denote by P(Ω) the space of probability measures on Ω, and denote by C(Ω) the space of continuous functions from Ωto R, where Ω Rd is a compact domain for tokens embeddings. In what follows, we frequently utilize notions such as the push-forward operator T , weak topology (denoted by the convergence ), and Wasserstein distance Wp for 1 p < + . For further details on these notions, we refer to Appendix A.

2 MEASURE-THEORETIC IN-CONTEXT MAPPINGS

Transformers are defined by alternating multi-head attention layers (which compute interactions between tokens), MLP and normalization layers (which operate independently over each token). For the sake of simplicity, we omit normalization in the following analysis. We first recall their definition and then explain how they can be equivalently re-written using in-context mappings. This definition provides new insights and can also be generalized to an infinite number of tokens encoded in a probability measure.

2.1 ATTENTION AS IN CONTEXT MAPPINGS ON TOKEN ENSEMBLES

Classical definition. A set of n tokens, xi Rdtok, is denoted by X = (xi)n i=1 Rdtok n. An attention head maps these n tokens to the same number n of tokens in Rdhead through

X Rdtok n, Attθ(X) := V X Soft Max(X Q KX/

k) Rdhead n, where the parameters are the (Key, Query, Value) matrices θ := (K, Q, V ) Rk dtok Rk dtok Rdhead dtok. Here, the (possibly masked) Soft Max function operates in a row-wise manner:

Z Rn n, Soft Max(Z) := Mi,je Zi,j Pn ℓ=1 Mi,ℓe Zi,ℓ

i,j=1 Rn n + ,

where Mi,j = 1 for the unmasked setting (bidirectional encoding transformers) and Mi,j = 1j i in the masked setting (causal decoding transformers). Multiple heads with different parameters θ := (W h, θh)H h=1 are combined in a linear way in a multi-head attention

MAttθ(X) := X +

h=1 W h Attθh(X), (1)

where W h Rdtok dhead and θh := (Kh, Qh, V h). In the following, we denote the various dimensions of a multi-head attention layer by: dtok(θ), dhead(θ) for the token and head dimensions, respectively, and k(θ) for the key/query dimensions, and H(θ) for the number of heads.

In-context mappings form. For the unmasked setting, the mapping X 7 MAttθ(X) can be re-written as the application of an in context function Gθ(X, ) to each token, xi 7 Gθ(X, xi) i.e. MAttθ(X) = (Gθ(X, xi))n i=1, where the in-context mapping is, (X, x) Rdtok n Rdtok,

Gθ(X, x) := x+

h=1 W h n X

k Qhx, Khxj

Pn ℓ=1 exp 1

k Qhx, Khxℓ V hxj. (2)

In the masked setting, due to the lack of permutation equivariance, it is required to track also the index i of the token. The mapping X 7 MAttθ(X) can then be re-written as MAttθ(X) = (Gθ(X, xi, i))n i=1 where the in-context mapping is,

Gθ(X, x, i) := x+

h=1 W h i X

k Qhx, Khxj

Pi ℓ=1 exp 1

k Qhx, Khxℓ V hxj. (3)

Published as a conference paper at ICLR 2025

Here, the terminology in context refers to the fact that Gθ(X, ) depends on the tokens X themselves, and can thus be seen as a parametric map that is modified for each token depending on its interactions with the other tokens. While this re-writing is equivalent to the original one, it highlights the fact that transformers define spatial mappings. This also allows us to clearly state the associated mathematical question at the core of this paper, which is the approximation of arbitrary in-context mappings by (compositions of) such parametric maps. Another interest in this reformulation is that it enables the definition of generalized attention operating over a possibly infinite number of tokens, as explained in Section 2.2.

Composition of in-context mappings. A transformer (ignoring normalization layers at this moment) is a composition of L attention layers and Multi-Layer Perceptrons (MLP):

MLPξL MAttθL . . . MLPξ1 MAttθ1 . (4)

Here, the MLPξ functions process each token independently from one another:

MLPξ(X) = (Fξ(xi))n i=1,

i.e., they are context-free mappings (in the above notation, Fξ(X, x) = Fξ(x)), while the attention maps, Gθ(X, ) depend on the context X.

On the level of in-context mappings, the composition of layers in (4) induces a new in-context composition rule, which we denote by :

(G2 G1)(X, x) := G2(X1, G1(X, x)) where X1 := (G1(X, xi))n i=1, (5)

for the unmasked case, and

(G2 G1)(X, x, i) := G2(X1, G1(X, x, i), i) where X1 := (G1(X, xi, i))n i=1, (6)

for the masked case. This rule can be applied whether G1(X, ) or G2(X, ) depends on the context X or not (such as for the Fξ mappings above). Using this rule, the transformer s definition in (4) translates into a composition of in-context and context-free maps, i.e.,

FξL GθL . . . Fξ1 Gθ1. (7)

The core question this paper addresses is the uniform approximation of a continuous (in a suitable topology) in-context maps (X, x) 7 G(X, x) or (X, x, i) 7 G(X, x, i) by transformers in-context mappings of the form of (4), with clear control of the dimensions and the number of heads involved in the different layers. The main originality of our approach is that we aim to do so for an arbitrary number n of tokens, as we now explain.

2.2 MEASURE-THEORETIC IN-CONTEXT MAPPINGS: UNMASKED SETTING

A first key observation is that the definition in (2) makes sense irrespective of the number n of tokens. The second key observation is that, in the un-masked case, Mi,j = 1, the attention mapping is permutation equivariant. To make this more explicit, and also handle the limit of an infinite number of tokens, we represent a set X of tokens using a probability distribution µ P(Rdtok) over Rdtok. A finite number of tokens is encoded using a discrete empirical measure,

i=1 δxi P(Rdtok). (8)

This encoding is not only for notional convenience, it also allows us to define clearly a correct notion of smoothness for the in-context mappings. This smoothness corresponds to the displacement of the tokens and is quantified through the optimal transport distance as presented in Section 1.2. This enables us to compare context with different sizes and, for instance, to compare a set of tokens with a large (but finite) n to a continuous distribution. The in-context mapping in (2) is now defined as, (µ, x) P(Rdtok) Rdtok,

Γθ(µ, x) := x +

h=1 W h Z exp 1

k Qhx, Khz dµ(z) V hy dµ(y). (9)

Published as a conference paper at ICLR 2025

The discrete case is contained in this more general definition in the sense that

X = (xi)n i=1, Gθ(X, x) = Γθ 1

i=1 δxi, x .

In the following, we will invoke, whenever convenient, the following slight abuse of notation,

Γθ(µ, x) = Γθ(µ)(x),

so that Γθ(µ) : Rdtok Rdtok defines a map between Euclidean spaces. Using this general definition, the attention map X 7 MAttθ(X) can be rewritten as displacing the tokens positions, which corresponds to applying a push-forward to the measure as defined in (22),

µ P(Rdtok) 7 Γθ(µ) µ P(Rdtok).

This formulation of transformers as a mapping between probability measures was introduced in Sander et al. (2022) and also used in Castin et al. (2024) to prove a convergence result of deep transformers. We re-use it here but put emphasis on the in-context mapping itself, which is the object of interest of this paper (rather than on studying the mapping between measures).

Composition of in-context unmasked measure-theoretic mappings. The definition of composition in (5) generalizes to the measure-theoretic setting in the unmasked setting as

(Γ2 Γ1)(µ, x) := Γ2(µ1, Γ1(µ, x)), where µ1 := Γ1(µ) µ, (10)

i.e., (Γ2 Γ1)(µ) = Γ2(µ1) Γ1(µ). Transformers operating over an arbitrary (possibly infinite) number of tokens are then obtained by replacing the original definition of (7) by

FξL ΓθL . . . Fξ1 Γθ1. (11)

Here, the Fξ are context-free MLP mappings, i.e., Fξ(µ, x) = Fξ(x) is independent of µ. It is important to keep in mind that when restricted to finite discrete empirical measures of the form of (8), definitions in (7) and in (11) coincide. Our theory encompasses classical transformers as well as their mean field limits operating over arbitrary measures.

2.3 MEASURE-THEORETIC IN-CONTEXT MAPPINGS: MASKED SETTING

In the masked setting (for NLP or time series applications), Mi,j = 1j i, the attention mappings are not any more permutation equivariant. To restore this invariance, and be able to write in-context mappings using measures, we introduce a space-time lifting so that the input tokens are of the form {xi, ti}n i=1, where ti [0, 1]. For instance, assuming an upper bound, N, on the number of tokens, one can use ti = i/N, but it is also possible to assume that the ti are positioned arbitrarily in [0, 1], which enables considering an arbitrarily large (and even infinite) number of tokens.

We thus let the context be encoded as a space-time measure µ P(Rdtok [0, 1]). Similarly to Castin et al. (2024, Definition 2.6), we introduced the in-context map, (x, t) Rdtok [0, 1],

Γθ(µ, x, t) := x +

h=1 W h Z exp 1

k Qhx, Khy 1[0,t](r) R exp 1

k Qhx, Khz 1[0,t](s) dµ(z, s) V hy dµ(y, r), (12)

where 1[0,t](s) is a masking function that is 1 if 0 s t and 0 otherwise. For a finite number of tokens, using a discrete measure, µ = 1

n Pn i=1 δ(xi,i/n), one retrieves the initial definition in (3) in the masked case. The space-time lifting can incorporate positional information, however, recent positional encoding methods, such as Rotary Positional Embedding (Ro PE) (Kazemnejad et al., 2024), are not included in the current formulation. This extension is left for future work.

Composition of in-context masked measure-theoretic mappings. The composition rule in the masked setting is similar to the one in the unmasked setting (cf. (10)), except that the time position of the token is kept unchanged while the push forward acts in space,

(Γ2 Γ1)(µ, x, t) := Γ2(µ1, Γ1(µ, x, t), t), where µ1 := (Γ1(µ), Id R) µ. (13)

Here, (Γ1(µ), Id R) : (x, t) Rdtok+1 (Γ1(µ)(x, t), t) Rdtok+1 (the time is kept unchanged). Equipped with this definition, one retrieves the composition rule in (6) when the measure µ is discrete.

Published as a conference paper at ICLR 2025

3 UNIVERSALITY IN THE UNMASKED CASE

3.1 STATEMENT OF THE RESULT AND DISCUSSION

Our first main result is the following universal approximation theorem for unmasked in-context mappings.

Theorem 1. Let Ω Rd be a compact set and Λ : P(Ω) Ω Rd be continuous, where P(Ω) is endowed with the weak topology. Then for all ε > 0, there exist L and parameters (θℓ, ξℓ)L ℓ=1, such that

(µ, x) P(Ω) Ω, |FξL ΓθL . . . Fξ1 Γθ1(µ, x) Λ (µ, x)| ε,

with dtok(θℓ) d + 3d , dhead(θℓ) = k(θℓ) = 1, H(θℓ) d .

Appendix D demonstrates that this result can be applied to two key settings: permutation-invariant mappings over a fixed number of tokens and in-context learning of regression operators.

The two strengths of the result above are (i) the approximating architecture performs the approximation independently of n (it even works for an infinite number of tokens), and (ii) the number of heads and the embedding dimension do not depend on ε. Moreover, its proof technique is notable, particularly because, unlike classical MLPs with cosine activation functions, shallow attention architectures lack an algebraic structure as these cannot be multiplied. To address this, we rely on depth to accommodate this limitation.

A weakness is that the theorem is non-quantitative , meaning that that we have no explicit control over the dependency of the number of MLP parameters ξℓon ε. A limitation of our proof technique is that the number of heads grows proportionally to the output dimension while each head only outputs a scalar dhead(θℓ) = 1. Obtaining a better balance between these two parameters is an interesting problem. As explained in the proof, these MLPs approximate a real-valued squaring operator R a 7 a2, so we expect this dependency to be well-behaved in common situations; however, our construction does not provide any a priori bound on how the magnitude of the tokens grows through the layers. The main hypothesis of Theorem 1 is that the underlying map, Λ , is a continuous map for the weak topology over measures (see Section 1.2 for some background). However, this setting might not be a proper one for conducting further quantitative studies; we leave these for future work.

3.2 PROOF OF THEOREM 1

We first consider elementary in-context mappings, which map (x, µ) to a real variable

γλ(µ, x) := x, a + b + Z ec( x, a +b)( y, a +b)v ( a, y + b) R ec( x, a +b)( z, a +b)dµ(z) dµ(y), (14)

where λ := (a, b, c, v) Rd R R R. These elementary mappings are built by composing an affine scalar-valued MLP with a single-head attention (with skip connection) as in (9), operating in 1-D. Indeed, defining Fξ(x) = a, x + b R as an affine MLP, where ξ = (a, b), we have

γλ(µ, x) = (Γθ Fξ)(µ, x),

where θ = (k, q, v) R3 (recalling that this attention operates in 1-D), and we let c = qk.

We now define A, the algebra spanned by these elementary functions:

A := n P(Ω) Ω (µ, x) 7

n=1 γλ1,n(µ, x) γλT,n(µ, x) R : N, T N o .

The first main ingredient of the proof is to show that this algebra is dense. Elements of this algebra are sums of products of elementary functions, which are often referred to as cylindrical functions . Proposition 1. A is dense in the space of (weak ℓ2)-continuous functions from P(Ω) Ωto R.

Proof. We apply the Stone-Weierstrass theorem. First, we note that P(Ω) Ωis compact for the (weak ℓ2) topology (Aliprantis & Border, 2006, Theorem 15.11). We then check the three key hypotheses needed to apply the Stone-Weierstrass theorem:

Published as a conference paper at ICLR 2025

1. The functions γλ are continuous because the denominator R ec( x, a +b)( z, a +b)dµ(z) in the elementary mapping is not always zero for any µ P(Ω) and x Ω.

2. When setting a = 0, b = v = 1, one has that γλ(µ, x) = 1 is the constant function.

3. We need to show that the set (γλ)λ separates points, which is more challenging.

For the last one, we need to show that if

λ, γλ(µ, x) = γλ(µ , x ), (15)

then (µ, x) = (µ , x ). First setting v = 0, this implies that x, a = x , a for all a Rd and, hence, x = x . Then, setting b = 1 a, x , (15) reads L(µ)(a, c) = L(µ )(a, c) where we defined a generalized Laplace-like transform

L(µ)(a, c) := Z ec a, y a, y R ec a, z dµ(z) dµ(y). (16)

We conclude that µ = µ using the following key lemma.

Lemma 1. The map µ 7 L(µ) defined in (16) is injective.

See Appendix B.1 for a detailed proof. To approximate vector-valued in-context mappings, we use the previous algebra of cylindrical functions along each dimension. Since the elementary mapping, γλ, is built by the composing an affine MLP and single-head attention, we arrive at the following lemma.

Lemma 2. For any ε > 0, there exist T, N N and ( θt,n, ξt,n)t [T ],n [N] such that

(µ, x) P(Ω) Ω, |G(µ, x) Λ (µ, x)| ε,

where G(µ, x) :=

n=1 (Γ θ1,n F ξ1,n)(µ, x) (Γ θT,n F ξT,n)(µ, x), (17)

with dtok( θt,n) = d , dhead( θt,n) = k( θt,n) = 1, H( θt,n) = d .

See Appendix B.2 for the details of the proof. We finally need to approximate G in (17) by a deep transformer of the form of (7). To do that, we furthermore approximate the component-wise multiplication maps, (x, y) R2d 7 x y Rd , in (17) by some MLPs. This way, we obtain the following lemma.

Lemma 3. For any ε > 0, there exist L and parameters (θℓ, ξℓ)L ℓ=1, such that

(µ, x) P(Ω) Ω, |G(µ, x) FξL ΓθL . . . Fξ1 Γθ1(µ, x)| ε,

with dtok(θℓ) d + 3d , dhead(θℓ) = k(θℓ) = 1, H(θℓ) d .

See Appendix B.3 for a detailed proof.

4 UNIVERSALITY IN THE MASKED CASE

4.1 CAUSAL MAPS, LIPSCHITZ CONTEXTS, AND MAIN RESULT

As before, Ω Rd is a compact set, and we define Ω:= Ω [0, 1] as the space-time domain. Throughout this section, µ P([0, 1]) is the marginal with respect to only the time variable of some space-time measure, µ P( Ω), i.e.,

µ := P µ where P : Ω (x, t) 7 t [0, 1]. (18)

Approximation in the masked setting is more subtle than in the unmasked setting because causal attentions are typically less regular due to the masking. To cope with this difficulty, we impose additional smoothness constraints on the context, which still allow for an arbitrary number of tokens.

Published as a conference paper at ICLR 2025

Definition 1 (Lipschitz contexts). A map t [0, 1] 7 µ( |t) P(Ω) is C-Lipschitz if

(s, t) [0, 1]2, W2(µ( |s), µ( |t)) C|s t|.

The set of space-time C-Lipschitz measures is

Lip C( Ω) := {µ P( Ω) : µ( |t) s.t. µ(x, s) = µ(x|s) µ(s) and µ( |t) is C-Lipschitz}.

The conditional measure t 7 µ( |t) P(Ω) is any valid disintegration of µ against the marginal µ, and must be C-Lipschitz. We also define, σ (0, 1),

Lipσ C( Ω) := {µ Lip C( Ω) : µ({0}) σ}.

It is worth noting that these conditions are automatically satisfied by a discrete measure, µ = 1 n Pn i=1 δ(xi,ti), with distinct times δ := mini =j |ti tj| > 0 (using C = Radius(Ω)/δ). More generally, if t [0, 1] ϕ(t) Ωis C-Lipschitz and ν P([0, 1]), then µ = ϕ ν Lip C( Ω).

Masked attention can be conveniently re-expressed as operating over masked contexts which are defined as follows. Definition 2 (Masked measure). For µ Lipσ C( Ω), the masked probability measure µt Lipσ C( Ω) is defined as

µt := 1[0,t] µ([0, t]) µ. (19)

Thanks to µ Lipσ C( Ω), where the starting point of µ is fixed as 0 (i.e., 0 supp( µ)), the masked probability measure µt is well-defined when t (0, 1]. We note that we can define µt at t = 0 as the limit (in the weak topology), and such a limit exists (See Lemma 10). Thus, the map [0, 1] t 7 µt Lipσ C( Ω) is continuous (for the weak topology).

In the masked setting, the aim is to approximate causal mappings, defined as follows, which are maps where the output of time t only depends on tokens with smaller times. Definition 3 (Causal identifiable map). A space-time in-context map Λ is said to be causal if

(µ, x, t) Lipσ C( Ω) Ω, Λ(µ, x, t) = Λ(µt, x, t). (20)

Such a map Λ is said to be identifiable if for any context µ Lipσ C( Ω),

µt = µt Λ(µt, , t) = Λ(µt , , t ). (21)

By the construction, the masked attention map Γθ, defined in (12), is a causal identifiable in-context map (See Lemma 11). Since the composition of such maps preserves both causality and identifiability (see Lemma 12), a deep transformer, formed by composing of causal identifiable in-context maps, remains these properties.

The following theorem mimics Theorem 1, but is restricted to approximating on Lipschitz contexts to cope with the causality constraint.

Theorem 2. Let Λ be a continuous (where Lipσ C( Ω) is endowed with the weak topology) and causal identifiable in-context mapping. Then, for all ε > 0, there exist L and parameters (θℓ, ξℓ)L ℓ=1 such that

(µ, x, t) Lipσ C( Ω) Ω, |FξL ΓθL . . . Fξ1 Γθ1(µ, x, t) Λ (µ, x, t)| ε,

with dtok(θℓ) d + 3d , dhead(θℓ) = k(θℓ) = 1, H(θℓ) d . Remark 1 (Sharpness of the identifiability and Lipschitz hypotheses). The masked approximation in Theorem 2 shares the same conclusion as the unmasked Theorem 1, but it requires stronger assumptions. In particular, the map Λ is assumed to be identifiable. This hypothesis is sharp and cannot be weakened: transformers define identifiable maps, and as proved in Lemma 13 identifiable maps can only approximate uniformly identifiable maps. Identifiability is also crucial since our proof technique involves recasting the approximation over (µ, x, t) as an approximation over a reduced space (µt, x). Another important assumption is that we restrict our approximation to Lipschitz contexts. This limitation is essential for ensuring that the set of masked contexts µt is compact, which allows us to apply the Stone-Weierstrass theorem.

Published as a conference paper at ICLR 2025

Remark 2 (Fixing the time marginal). We impose that contexts have Dirac masses at 0, namely µ({0}) σ. While this is not restrictive for discrete measures, this prevents for instance measures with density with respect to Lebesgues. It is possible to lift this constraint, and instead impose that the time marginal µ is fixed, i.e. replace the set Lipσ C( Ω) by {µ Lip C( Ω) : µ = ν} for any ν P([0, 1]) satisfying 0 supp(ν). One can check that the proof of Theorem 2 carries over to this setting with minor modifications.

4.2 PROOF OF THEOREM 2

To translate into the setting that can exploit the Stone-Weierstrass theorem, we first introduce the following operation, which can be defined for any probability measure.

Definition 4 (Reduced mapping). For Λ : P( Ω) Ω Rd , we define the reduced map Λ, which takes two argument (µ, x), as

(µ, x) P( Ω) Ω, Λ(µ, x) := Λ(µ, x, e( µ)), where e( µ) is the end point of supp( µ), that is, e( µ) := max{r supp( µ)} [0, 1].

We introduce the reduced space on which we consider the approximation,

X σ C := {(µt, x) : µ Lipσ C( Ω), x Ω, t [0, 1]}.

Lemma 4. Let Λ : Lipσ C( Ω) Ω Rd be a causal identifiable in-context mapping defined in Definition 3. Then the following holds true.

(i) For any (µ, x, t) Lipσ C( Ω) Ω, Λ(µ, x, t) = Λ(µt, x).

(ii) If Λ is continuous, then the reduced map of X σ C (µt, x) 7 Λ(µt, x) is (weak ℓ2)- continuous.

See Appendix C.2 for details of the proof. As the target map Λ is a continuous and causal identifiable in-context mapping, by applying Lemma 4, Λ has the form (i), and is continuous on X σ C. Also, the masked attention map Γθ holds the same properties (See Lemma 11). Thus, it suffices to show that the following proposition. Proposition 2. For all ε > 0, there exist L and parameters (θℓ, ξℓ)L ℓ=1 such that

(µt, x) X σ C, |FξL ΓθL . . . Fξ1 Γθ1(µt, x) Λ (µt, x)| ε, with dtok(θℓ) d + 3d , dhead(θℓ) = k(θℓ) = 1, H(θℓ) d .

The proof of Proposition 2 is basically the same as in the unmasked case, just replacing the arguments for µ P(Ω) with that for µt Lipσ C( Ω). For the application of the Stone-Weierstrass theorem, we need to check the compactness of X σ C, which is not obvious. Thus, we show that following lemma. Lemma 5. X σ C is compact for the (weak ℓ2) topology.

Sketch of proof. Assume that µn Lipσ C( Ω) and (xn, tn) Ω. We denote by Pσ([0, 1]) := {ν P([0, 1]) : ν({0}) σ}, and µn Pσ([0, 1]). As Pσ([0, 1]) and Ωare compact, there exits µ Pσ([0, 1]) and (x, t) Ωsuch that (if necessary, re-choosing a subsequence) µn µ and (xn, tn) (x, t). Since [0, 1] s 7 µn( |s) P(Ω) is C-Lipschitz, applying the general Arzel a Ascoli theorem (see e.g., Kelley (2017, Chapter 7, Theorem 17)), there exists µ( |s) P(Ω) such that (if necessary, re-choosing a subsequence) sup s [0,1] W2(µn( |s), µ( |s)) 0 as n .

Thus, we define by µ := µ( |s) µ(s) which belongs to Lipσ C( Ω). Using this convergence and the fact that the start points of µn, µ Pσ([0, 1]) are always fixed at t = 0, we can prove the convergence of the masked probability measures, i.e., (µn)tn µt as n . The non-obvious situation is when tn > 0 and t = 0 since the masked probability measure µt is differently defined on t = 0 or t (0, 1]. However, this can be solved by the continuity of the map [0, 1] t 7 µt Lipσ C( Ω) (see Lemma 10). See Appendix C.3 for more details of the proof.

Published as a conference paper at ICLR 2025

CONCLUSION AND DISCUSSION

In this work, we have presented a unified analysis of the expressivity of both unmasked and masked transformers in settings with an arbitrarily large number of tokens. A limitation of our method is that it is not quantitative. Using, for instance, the Wasserstein distance between token distributions could be a way to impose smoothness on the map to obtain quantitative bounds. Our proof relies on the approximation of the map along each dimension and the use of a commuting architecture (the transformer layers are multiplied together to obtain the output). This results in a growth of the number of heads proportional to the dimension. Lowering this dependency would require the development of new proof techniques beyond the use of the Stone-Weierstrass theorem. It is important to note that universality results like ours do not directly translate into conclusions about the learning capabilities of transformers. However, our proof techniques, which leverage ideas from optimal transport, share similarities with those used in the analysis of two-layer MLPs (Chizat & Bach, 2018). Thus, we believe that future work could build on this approach to investigate both convergence properties of the training of transformers. We also leave extending our approach to more recent positional encodings, such as Rotary Positional Embedding (Ro PE) (Kazemnejad et al., 2024), for future work. Ro PE modifies all attention layers to account for relative positional information, which would require slight adjustments to the proof to accommodate the different formulas.

ACKNOWLEDGEMENTS

T. Furuya was supported by JSPS KAKENHI Grant Number JP24K16949, JST CREST JPMJCR24Q5, and JST ASPIRE JPMJAP2329. M.V. de Hoop carried out the work while he was an invited professor at the Centre Sciences des Donn ees at Ecole Normale Sup erieure, Paris. He acknowledges the support of the Department of Energy, BES under grant DE-SC0020345, the Simons Foundation under the MATH + X Program, Oxy, and the corporate members of the Geo Mathematical Imaging Group at Rice University. The work of G. Peyr e was supported by the European Research Council (ERC project WOLF) and the French government under the management of Agence Nationale de la Recherche as part of the France 2030 program, reference ANR-23-IACL0008 (PRAIRIE-PSAI).

Andrei Agrachev and Cyril Letrouit. Generic controllability of equivariant systems and applications to particle systems and neural networks. ar Xiv preprint ar Xiv:2404.08289, 2024.

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Processing Systems, 36, 2024.

Silas Alberti, Niclas Dern, Laura Thesing, and Gitta Kutyniok. Sumformer: Universal approximation for efficient transformers. In Topological, Algebraic and Geometric Learning Workshops 2023, pp. 72 86. PMLR, 2023.

Charalambos D. Aliprantis and Kim C Border. Infinite dimensional analysis. Springer, 2006.

Jan Boman and Filip Lindskog. Support theorems for the radon transform and cram er-wold theorems. Journal of theoretical probability, 22:683 710, 2009.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

Val erie Castin, Pierre Ablin, and Gabriel Peyr e. How smooth is attention? In ICML 2024, 2024.

David Chiang, Peter Cholak, and Anand Pillay. Tighter bounds on the expressivity of transformer encoders. In International Conference on Machine Learning, pp. 5544 5562. PMLR, 2023.

Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for overparameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.

Published as a conference paper at ICLR 2025

Christa Cuchiero, Martin Larsson, and Josef Teichmann. Deep neural networks, generic universal interpolation, and controlled odes. SIAM Journal on Mathematics of Data Science, 2(3):901 919, 2020.

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989.

Gwendoline De Bie, Gabriel Peyr e, and Marco Cuturi. Stochastic deep networks. In International Conference on Machine Learning, pp. 1556 1565. PMLR, 2019.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Dennis Elbr achter, Philipp Grohs, Arnulf Jentzen, and Christoph Schwab. Dnn expression rate analysis of high-dimensional pdes: Application to option pricing. Constructive Approximation, 55(1):3 71, 2022.

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12, 2021.

Takashi Furuya, Michael Puthawala, Matti Lassas, and Maarten V de Hoop. Globally injective and bijective neural operators. ar Xiv preprint ar Xiv:2306.03982, 2023.

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. ar Xiv preprint ar Xiv:2305.05465, 2023a.

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers. ar Xiv preprint ar Xiv:2312.10794, 2023b.

Boris Hanin and Mark Sellke. Approximating continuous functions by relu nets of minimal width. ar Xiv preprint ar Xiv:1710.11278, 2017.

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989.

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.

John L Kelley. General topology. Courier Dover Publications, 2017.

Nicolas Keriven and Gabriel Peyr e. Universal invariant and equivariant graph neural networks. Advances in Neural Information Processing Systems, 32, 2019.

Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces with applications to pdes. Journal of Machine Learning Research, 24(89):1 97, 2023.

Anastasis Kratsios and L eonie Papon. Universal approximation theorems for differentiable geometric deep learning. Journal of Machine Learning Research, 23(196):1 73, 2022.

Anastasis Kratsios, Valentin Debarnot, and Ivan Dokmani c. Small transformers compute universal metric embeddings. Journal of Machine Learning Research, 24(170):1 48, 2023a.

Anastasis Kratsios, Chong Liu, Matti Lassas, Maarten V de Hoop, and Ivan Dokmani c. An approximation theory for metric space-valued functions with a view towards deep learning. ar Xiv preprint ar Xiv:2304.12231, 2023b.

Shengjie Luo, Shanda Li, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, and Di He. Your transformer may not be as powerful as you expect. Advances in Neural Information Processing Systems, 35: 4301 4315, 2022.

Published as a conference paper at ICLR 2025

Arvind Mahankali, Tatsunori B Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. ar Xiv preprint ar Xiv:2307.03576, 2023.

William Merrill and Ashish Sabharwal. The expresssive power of transformers with chain of thought. ar Xiv preprint ar Xiv:2310.07923, 2023.

Luis M uller, Mikhail Galkin, Christopher Morris, and Ladislav Ramp aˇsek. Attending to graph transformers. ar Xiv preprint ar Xiv:2302.04181, 2023.

Swaroop Nath, Harshad Khadilkar, and Pushpak Bhattacharyya. Transformers are expressive, but are they expressive enough for regression? ar Xiv preprint ar Xiv:2402.15478, 2024.

Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143 195, 1999.

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652 660, 2017.

Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr e. Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pp. 3515 3530. PMLR, 2022.

Michael E Sander, Raja Giryes, Taiji Suzuki, Mathieu Blondel, and Gabriel Peyr e. How do transformers perform in-context autoregressive learning? ar Xiv preprint ar Xiv:2402.05787, 2024.

Filippo Santambrogio. Optimal transport for applied mathematicians. Birk auser, NY, 55(58-63):94, 2015.

Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What formal languages can transformers express? a survey. Transactions of the Association for Computational Linguistics, 12:543 561, 2024.

Paulo Tabuada and Bahman Gharesifard. Universal approximation power of deep residual neural networks through the lens of control. IEEE Transactions on Automatic Control, 2022.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp. 35151 35174. PMLR, 2023.

Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Max Vladymyrov, Razvan Pascanu, et al. Uncovering mesa-optimization algorithms in transformers. ar Xiv preprint ar Xiv:2309.05858, 2023.

James Vuckovic, Aristide Baratin, and Remi Tachet des Combes. A mathematical theory of attention. ar Xiv preprint ar Xiv:2007.02876, 2020.

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? ar Xiv preprint ar Xiv:1810.00826, 2018.

Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94: 103 114, 2017.

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? ar Xiv preprint ar Xiv:1912.10077, 2019.

Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. ar Xiv preprint ar Xiv:2306.09927, 2023.

Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and computational harmonic analysis, 48(2):787 794, 2020.

Published as a conference paper at ICLR 2025

A BASIC NOTIONS

In this section, we briefly review basic notations used in the main text.

Let Ω Rd be a compact domain. For T : Ω Rd Ω Rd being a measurable map, the push-forward T µ P(Ω ) of µ P(Ω) is given by

A Ω , T µ(A) := T f 1(A) .

The push-forward operator, T , operates on discrete measures by simply displacing their supports,

i=1 δxi := 1

i=1 δT (xi).

For a general measure, ν = T µ is defined by a change of variables in integration, i.e.,

Ω g(y) dν(y) := Z

Ω g(T(x)) dµ(x). (22)

We employ the weak topology on P(Ω). This induces the following notion of convergence of sequences:

µk µ f C(Ω), Z f(x) dµk(x) Z f(x) dµ(x) .

Intuitively, this corresponds to a soft notion of convergence where the support of µk approaches that of µ.

In the special case of discrete measures, with a fixed number n of points, this corresponds, up to relabeling of the points, to the usual convergence of points in finite dimensions:

i=1 δxk i 1

i=1 δxi Xk = (xi,k)i Rd n X = (xi)i Rd n .

It is possible to metrize this weak topology using the Wasserstein Optimal Transport distance, which is defined, for 1 p < + , as

Wp(µ, ν)p := min π P(Ω2)

Z x y p dπ(x, y) : π1 = µ, π2 = ν ,

where πi = (Pi) π are the marginals of π with P1(x, y) = x and P2(x, y) = y. One has

µk µ Wp(µk, µ) 0,

see e.g., Santambrogio (2015, Theorem 5.10). This Wasserstein distance is used in the masked case to impose a Lipschitz regularity with respect to the time of the contexts.

B PROOFS IN SECTION 3

B.1 PROOF OF LEMMA 1

First, we show the one dimensional case of Lemma 1. Lemma 6. Let Ω R be a compact set, and let µ, ν P(Ω). Then,

L1(µ)(c) = L1(ν)(c), c R, µ = ν.

where, for k N,

Lk(µ)(c) :=

R ecyyk dµ(y) R ecz dµ(z) .

Published as a conference paper at ICLR 2025

Proof. One has Lk(µ) (c) = Lk+1(µ)(c) Lk(µ)(c)L1(µ)(c). Hence, by recursion, we have that

L1(µ)(c) = L1(ν)(c), c R, Lk(µ)(c) = Lk(ν)(c), c R, k 1,

Evaluating the equation at c = 0, we obtain that

Lk(µ)(0) = Lk(ν)(0), k 1 k, Z yk dµ(y) = Z yk dν(y),

which is equivalent to µ = ν.

Using Lemma 6, we arrive at the following lemma. Lemma 7 (Lemma 1 in the main text). Let Ω Rd be a compact set, and let µ, ν P(Ω). Then,

L(µ)(a, c) = L(ν)(a, c), a Rd, c R, µ = ν.

L(µ)(a, c) :=

R exp(c a, y ) a, y dµ(y) R exp(c a, z ) dµ(z) .

Proof. We define e Sd, µe := (Pe) µ,

where Sd is the d-dimensional sphere, and Pe(x) = x, e is the projection on e. We see that

L(µ)(e, c) =

R exp(c e, y ) e, y dµ(y) R exp(c e, z ) dµ(z) =

R ecss dµe(s) R ecr dµe(r) .

By Lemma 6, we can show that e, (Pe) µ = (Pe) ν, which implies that, by the injectivity of the Radon transform (see e.g., Boman & Lindskog (2009, Theorem A)), µ = ν.

B.2 PROOF OF LEMMA 2

We write Λ (µ, x) = (Λ 1(µ, x), , Λ d (µ, x)) , where Λ h : P(Ω) Ω R, so that

h=1 Λ h(µ, x)eh,

where (eh)h [d ] is the standard basis in Rd . For each h = 1, . . . , d , we apply Proposition 1 to Λ h and conclude that there exist T, N N and (λh t,n)t [T ],n [N] such that

(µ, x) P(Ω) Ω,

n=1 γλh 1,n(µ, x) γλh T,n(µ, x)

This implies that

(µ, x) P(Ω) Ω,

n=1 γλh 1,n(µ, x) γλh T,n(µ, x)

n=1 γλh 1,n(µ, x) γλh T,n(µ, x)

Published as a conference paper at ICLR 2025

Here, we wrote λh t,n = (ah t,n, bh t,n, ch t,n, vh t,n) Rd R R R. We introduce

n=1 γλh 1,n(µ, x) γλh T,n(µ, x)

n=1 γλ1,n(µ, x) γλT,n(µ, x), (24)

in which γλt,n(µ, x) := γλ1 t,n(µ, x), ..., γλd t,n(µ, x) Rd .

We define self-attentions by

Γ θt,n(µ, x) := x +

h=1 W h t,n

Z exp Qh t,nx, Kh t,ny

R exp Qh t,nx, Kh t,nz dµ(z) V h t,ny dµ(y), x Rd , (25)

where θt,n := ( W h t,n, V h t,n, Qh t,n, Kh t,n)h=1,...,d Rd 1 R1 d R1 d R1 d ,

i.e., dtok( θt,n) = d , dhead( θt,n) = k( θt,n) = 1 and

W h t,n = (0, ..., 0, 1 |{z} h th , 0, ..., 0) = eh,

V h t,n = (0, ..., 0, vh t,n |{z} h th

, 0..., 0), Qh t,n = (0, ..., 0, ch t,n |{z} h th

, 0..., 0), Kh t,n = (0, ..., 0, 1 |{z} h th , 0..., 0).

We define affine transforms, F ξt,n : Rd Rd , according to

F ξt,n(x) := At,nx + bt,n, (26)

where ξt,n = (At,n, bt,n) Rd d Rd in which

At,n = (a1 t,n, ..., ad t,n), bt,n = (b1 t,n, ..., bd t,n).

Then we have the composition,

Γ θt,n F ξt,n(µ, x) = Γ θt,n((F ξt,n) µ, F ξt,n(x)) = At,nx + bt,n (27)

h=1 W h t,n

Z exp Qh t,n(At,nx + bt,n), Kh t,n(At,ny + bt,n)

R exp Qh t,n(At,nx + bt,n), Kh t,n(At,nz + bt,n) dµ(z)

V h t,n(At,ny + bt,n) dµ(y) =

ah t,n, x + bh t,n

+ Z exp ( ah t,n, x + bh t,n)ch t,n( ah t,n, y + bh t,n)

R exp ( ah t,n, x + bh t,n)ch t,n( ah t,n, z + bh t,n) dµ(z) vh t,n( ah t,n, y + bh t,n) dµ(y)

h=1 γλh t,n(µ, x)eh = γλt,n(µ, x).

With (24), we find that

n=1 (Γ θ1,n F ξ1,n)(µ, x) (Γ θT,n F ξT,n)(µ, x). (28)

Therefore, with the estimate in (23), we obtain the statement in Lemma 2.

Published as a conference paper at ICLR 2025

B.3 PROOF OF LEMMA 3

We first establish the following lemma.

Lemma 8. For any ε > 0, there exists an MLP Φ : R2d Rd such that

(µ, x) P(Ω) Ω, |G(µ, x) GΦ(µ, x)| ε,

where GΦ(µ, x) :=

n=1 Φ (Γ θT,n F ξT,n)(µ, x), Φ (Γ θT 1,n F ξT 1,n)(µ, x),

Φ (Γ θ2,n F ξ2,n)(µ, x), Φ (Γ θ1,n F ξ1,n)(µ, x), 1d .

Proof. We note that

(Γ θ1,n F ξ1,n)(µ, x) (Γ θT,n F ξT,n)(µ, x)

= (Γ θT,n F ξT,n)(µ, x) (Γ θT 1,n F ξT 1,n)(µ, x)

(Γ θ2,n F ξ2,n)(µ, x) Γ θ1,n F ξ1,n)(µ, x) 1d .

Because the component-wise multiplication map (x, y) R2d 7 x y Rd is continuous, by the universality of MLPs, for any ε > 0 and R > 0, there exists an MLP Φ : R2d Rd , such that

(x, y) BR2d (0, R), |x y Φ(x, y)| ε. (29)

Since Ω Rd is compact then 0 CΩ:= supx Ω x 2 is finite. Thus, using (27), we obtain the estimate,

(Γ θt,n F ξt,n)(µ, x) At,n 2CΩ+ bt,n 2 +

h=1 ( At,n 2CΩ+ bt,n 2) W h t,n V h t,n 2

max t [T ],n [N]

( At,n 2CΩ+ bt,n 2)(1 +

h=1 W h t,n V h t,n 2)

=:C Γ for all (µ, x) P(Ω) Ω, (30)

where the constant, C Γ > 0, depends on Ω, W h t,n, V h t,n, At,n, bt,n, but is independent of t, n, µ and x. Thus, using the universality in (29), choosing a large radius R > 0 depending on the constant C Γ > 0, we can show that (Γ θ1,n F ξ1,n)(µ, x) (Γ θT,n F ξT,n)(µ, x)

Φ (Γ θT,n F ξT,n)(µ, x), Φ (Γ θT 1,n F ξT 1,n)(µ, x),

Φ (Γ θ2,n F ξ2,n)(µ, x), Φ (Γ θ1,n F ξ1,n)(µ, x), 1d ε

Upon summing from n = 1, . . . , N and using the form (17), we complete the proof of Lemma 8.

Remark 3. (The challenge to derive quantitative estimates.) The key is to approximate and capture the mentioned multiplicity by MLPs, for which quantitative estimates have been studied, e.g., Elbr achter et al. (2022, Lemma 6.2), which is a variant of Yarotsky (2017, Proposition 2). However, the depth and width of MLPs depend on the bound of input variables. Specifically, an existential Φ in the above depends on the bound C Γ (see (30)), which in turn depends on parameters in Γ θt,n F ξt,n that are chosen to approximate Γ within ε through the application of the Stone-Weierstrass theorem (see Proposition 1). Thus, providing the quantitative estimate for the MLP, Φ, is challenging.

Published as a conference paper at ICLR 2025

Finally in this appendix we prove, by construction, the following result.

Lemma 9. Let Φ : R2d Rd be an MLP. There exist ξ0, (ξt,n)t [T ],n [N], ξ , and θ0, (θt,n)t [T ],n [N], θ such that

(µ, x) P(Ω) Ω, GΦ(µ, x) = Fξ Γθ N n=1 T t=1 Fξt,n Γθt,n Fξ0 Γθ0(µ, x).

with following sizes:

dtok(θ0) = d, dhead(θ0) = k(θ0) = H(θ0) = 1,

dtok(θt,n) = d + 3d , dhead(θt,n) = k(θt,n) = 1, H(θt,n) = d , dtok(θ ) = d + 3d , dhead(θ ) = k(θ ) = H(θ ) = 1.

Proof. The proof is based on the following scheme:

x Fξ0 Γθ0 [Step A]

x F ξ1,1 (x)

φ1,1(x) f1(x)

Fξ1,1 Γθ1,1 [Step B]

x F ξ2,1 (x)

φ2,1(x) f1(x)

Fξ2,1 Γθ2,1 [Step B] FξT 1,1 ΓθT 1,1 [Step B]

x F ξT,1 (x)

φT,1(x) f1(x)

FξT,1 ΓθT,1 [Step C]

x F ξ2,1 (x)

φ1,2(x) f2(x)

Fξ1,2 Γθ1,2 [Step B]

x F ξ2,2 (x)

φ2,2(x) f2(x)

Fξ2,2 Γθ2,2 [Step B] FξT 1,2 ΓθT 1,2 [Step B]

x F ξT,2 (x)

φT,2(x) f2(x)

FξT,2 ΓθT,2 [Step C]

x F ξ1,3 (x)

φ1,3(x) f3(x)

Fξ1,N Γθ1,N [Step B]

x F ξ2,N (x)

φ2,N (x) f N (x)

Fξ2,N Γθ2,N [Step B] FξT 1,N ΓθT 1,N [Step B]

x F ξT,N (x)

φT,N (x) f N (x)

FξT,N ΓθT,N [Step C]

x F ξ1,N+1 (x)

φ1,N+1(x) f N+1(x) =

Fξ Γθ [Step D] f N+1(x) = GΦ(µ, x)

where φt,n : Rd Rd is given by

Φ (Γ θt,n F ξt,n)(µ, x), Φ (Γ θt 1,n F ξt 1,n)(µ, x)

Φ (Γ θ2,n F ξ2,n)(µ, x), Φ (Γ θ1,n F ξ1,n)(µ, x), 1d ), t 2 1d , t = 1

and fn : Rd Rd by

fn(x) := Pn 1 i=1 φT,i(x), n 2 0 n = 1 ,

where Γ θt,n and F ξt,n : Rd Rd are the self-attention and affine maps chosen in (25) and (26), respectively. Here, Γθ0, Γθt,n, Γθ , Fξ0, Fξt,n and Fξ will be specified below, in the following steps:

[Step A] Let Γθ0(µ) : Rd Rd be Γθ0(µ, x) = x,

and let Fξ0 : Rd Rd+3d be the affine transform defined by

Fξ0(x) := (x, A1,1x + b1,1, 1d , 0) = (x, F ξ1,1(x), φ1,1(x), f1(x)).

Then we see that Fξ0 Γθ0(µ, x) = (x, F ξ1,1(x), φ1,1(x), f1(x)),

and µ1,1 := (Fξ0 Γθ0(µ)) µ = µ, (F ξ1,1) µ, (φ1,1) µ, (f1) µ .

We proceed with [Step B] in which we handle the case when n = t = 1.

Published as a conference paper at ICLR 2025

[Step B] Let t = 1, ..., T 1 and n = 1, ..., N. We already have that t 1 j=1Fξj,n Γθj,n n 1 i=1 T s=1 Fξs,i Γθs,i Fξ0 Γθ0(µ, x) = x, F ξt,n(x), φt,n(x), fn(x) ,

µt,n := t 1 j=1Fξj,n Γθj,n n 1 i=1 T s=1 Fξs,i Γθs,i Fξ0 Γθ0(µ)

= µ, (F ξt,n) µ, (φt,n) µ, (fn) µ .

When n = 1 or t = 1, the above reduces to n 1 i=1 T s=1 Fξs,i Γθs,i = Id+3d or t 1 j=1Fξj,n Γθj,n = Id+3d .

Let Γθt,n(µt,n) : Rd+3d Rd+3d be given by

Γθt,n(µt,n, (x, u, p, w)) = (x, u, p, w)

h=1 W h t,n

Z exp Qh t,n(x, u, p, w), Kh t,n(y , v , q , z )

R exp Qh t,n(x, u, p, w), Kh t,n(y, v, q, z) dµt,n(y, v, q, z)

V h t,n(y , v , q , z ) dµt,n(y , v , q , z )

h=1 W h t,n

Z exp Qh t,nu, Kh t,nv

R exp Qh t,nu, Kh t,nv dµt,n(y, v, q, z) V h t,nv dµt,n(y , v , q , z ), p, w

= x, Γ θt,n((F ξt,n) µ, u), p, w ,

where x, y, y Rd, u, v, v Rd , p, q, q Rd , and w, z, z Rd . Here, θt,n is given by

θt,n := (W h t,n, V h t,n, Qh t,n, Kh t,n)h=1,...,d Rd+3d 1 R1 d+3d R1 d+3d R1 d+3d ,

that is, dtok(θt,n) = d + 3d , dhead(θt,n) = k(θt,n) = 1, H(θt,n) = d , and

W h t,n := (O, W h t,n, O, O), V h t,n := (O, V h t,n, O, O),

Qh t,n := (O, Qh t,n, O, O), Kh t,n := (O, Kh t,n, O, O).

Let Fξt,n : Rd+3d Rd+3d be defined by

Fξt,n(x, u, p, w) = (x, At+1,nx + bt+1,n, Φ(u, p), w) = (x, F ξt+1,n(x), Φ(u, p), w). (31)

Then we have

( t j=1Fξj,n Γθj,n) ( n 1 i=1 T s=1 Fξs,i Γθs,i) Fξ0 Γθ0(µ, x)

= Fξt,n Γθt,n ( t 1 j=1Fξj,n Γθj,n) ( n 1 i=1 T s=1 Fξs,i Γθs,i) Fξ0 Γθ0(µ, x)

= Fξt,n Γθt,n(µt,n, (x, F ξt,n(x), φt,n(x), fn(x)))

= Fξt,n(x, Γ θt,n((F ξt,n) µ, F ξt,n(x)), φt,n(x), fn(x))

= Fξt,n(x, (Γ θt,n F ξt,n)(µ, x), φt,n(x), fn(x))

= x, F ξt+1,n(x), φt+1,n(x), fn(x)) ,

µt+1,n := t j=1Fξj,n Γθj,n n 1 i=1 T s=1 Fξs,i Γθs,i Fξ0 Γθ0(µ)

= µ, (F ξt+1,n) µ, (φt+1,n) µ, (fn) µ .

Published as a conference paper at ICLR 2025

We repeat [Step B] until obtaining µT,n. Once µT,n is obtained, we proceed with [Step C].

[Step C] Let ΓθT,n(µT,n) : Rd+3d Rd+3d be given by

ΓθT,n(µT,n, (x, u, p, w))

h=1 W h T,n

Z exp Qh T,nu, Kh T,nv

R exp Qh T,nu, Kh T,nv dµT,n(y, v, q, z) V h T,nv dµT,n(y , v , q , z ), p, w

= x, Γ θT,n((F ξT,n) µ, u), p, w .

Let FξT,n : Rd+3d Rd+3d be defined by

FξT,n(x, u, p, w) = (x, A1,n+1x+b1,n+1, 1d , w+Φ(u, p)) = (x, F ξ1,n+1(x), φ1,n+1(x), w+Φ(u, p)). (32)

When n = N, we define by F ξ1,N+1(x) := 0 and φ1,N+1 := 0 in the above. We find that

( T j=1Fξj,n Γθj,n) ( n 1 i=1 T s=1 Fξs,i Γθs,i) Fξ0 Γθ0(µ, x)

= FξT,n(x, (Γ θT,n F θT,n)(µ, x), φT,n(x), fn(x))

= x, F ξ1,n+1(x), φ1,n+1(x), fn+1(x)) ,

µT +1,n := T j=1Fξj,n Γθj,n n 1 i=1 T s=1 Fξs,i Γθs,i Fξ0 Γθ0(µ)

= µ, (F ξ1,n+1) µ, (φ1,n+1) µ, (fn+1) µ .

Denoting µ1,n+1 := µT +1,n,

we return to [Step B], and repeat [Step B] and [Step C] until obtaining µT +1,N. Once µT +1,N is obtained, we proceed with [Step D].

[Step D] Let Γθ (µT +1,N) : Rd+3d Rd+3d be given by

Γθ (µT +1,N, (x, u, p, w)) = (x, u, p, w),

and let Fξ : Rd+3d Rd be the affine transform defined by

Fξ (x, u, p, w) := w.

Then we conclude that

Fξ Γθ N n=1 T s=1 Fξs,n Γθs,n Fξ0 Γθ0(µ, x)

= Fξ Γθ µT +1,N, x, F ξ1,N+1(x), φ1,n+1(x), f N+1(x)) = f N+1(x)

n=1 Φ (Γ θT,n F ξT,n)(µ, x), Φ (Γ θT 1,n F ξT 1,n)(µ, x),

Φ (Γ θ2,n F ξ2,n)(µ, x), Φ (Γ θ1,n F ξ1,n)(µ, x), 1d = GΦ(µ, x).

Remark 4. Note that if the MLPs represent the identity map, such as when using Re LU activation functions, then context-free maps Fξt,n in (31) and (32) can be represented by MLPs. If this is not the case, it is sufficient to further approximate FξT,n using MLPs.

Published as a conference paper at ICLR 2025

C PROOFS IN SECTION 4

C.1 BASIC PROPERTIES FOR THE MASKED CASE

Lemma 10. The map [0, 1] t 7 µt Lipσ C( Ω) is continuous (for the weak topology).

Proof. Let µ Lipσ C( Ω). We re-define the masked probability measure (including at t = 0) by

µt := 1[0,t] µ([0,t]) µ t (0, 1] µ( |s)δs=0 t = 0 . (33)

Thus, the continuity on t (0, 1] is obvious. We now show that as t 0 µt µ0.

For f C( Ω), we see that

Z f(x, s) dµt Z f(x, s) dµ0

Z f(x, s) dµ(x|s)1[0,t](s)

µ([0, t]) d µ(s) Z f(x, 0) dµ(x|0)

Z f(x, s) dµ(x|s)1[0,t](s)

µ([0, t]) d µ(s) Z f(x, 0) dµ(x|0) d1[0,t](s)

µ([0, t]) d µ(s)

Z 1[0,t](s)

µ([0, t])(F(s) F(0)) d µ(s) sup s [0,t] |F(s) F(0)|,

where F(s) := Z f(x, s) dµ(x|s),

and s 7 F(s) is continuous as µ Lipσ C( Ω). Thus we have, as t 0, Z f(x, s) dµt Z f(x, s) dµ0,

which implies that µt µ0.

Lemma 11. Let Γθ be the masked in-context map defined in (12). Then we have the following:

(a) Γθ is a causal identifiable in-context map in the sense of Definition 3.

(b) For any (µ, x, t) Lipσ C( Ω) Ω, Γθ(µ, x, t) = Γθ(µt, x).

(c) The reduced map of X σ C (µt, x) 7 Γθ(µt, x) is (weak ℓ2)-continuous.

Proof. To show (a), we observe that

Γθ(µ, x, t) = x + Z exp 1

k Qhx, Khy 1[0,t](r) R exp 1

k Qhx, Khz 1[0,t](s) dµ(z, s) V hy dµ(y, r)

= x + Z exp 1

k Qhx, Khz 1[0,t](s)

µ([0,t]) dµ(z, s) V hy 1[0,t](r)

µ([0, t]) dµ(y, r)

= x + Z exp 1

k Qhx, Khy 1[0,t](r) R exp 1

k Qhx, Khz 1[0,t](s) dµt(z, s) V hy dµt(y, r) = Γθ(µt, x, t).

Published as a conference paper at ICLR 2025

This proves the causality. To show the identifiability, we assume that µt = µt where µ Lipσ C( Ω) and t, t [0, 1]. Without loss of generality, we assume that t < t . Then we have that µ = 0 on [t, t ], so that

Γθ(µt, x, t) = x + Z exp 1

k Qhx, Khy 1[0,t](r) R exp 1

k Qhx, Khz 1[0,t](s) dµt(z, s) V hy dµt(y, r)

= x + Z exp 1

k Qhx, Khy 1[0,t](r) R exp 1

k Qhx, Khz 1[0,t](s) dµt (z, s) V hy dµt (y, r)

= x + Z exp 1

k Qhx, Khy 1[0,t ](r) R exp 1

k Qhx, Khz 1[0,t ](s) dµt (z, s) V hy dµt (y, r) = Γθ(µt , x, t ).

Thus we obtain (a).

From Lemma 11 (a), Γθ is a causal identifiable in-context map in the sense of Definition 3. By applying Lemma 4 (i), as Λ = Γθ, we obtain (b).

Using Lemma 11 (b), we find that

Γθ(µt, x) = Γθ(µt, x, t)

= x + Z exp 1

k Qhx, Khy 1[0,t](r) R exp 1

k Qhx, Khz 1[0,t](s) dµt(z, s) V hy dµt(y, r)

= x + Z exp 1

k Qhx, Khz dµt(z, s) V hy dµt(y, r). (35)

We can show the continuity of the map

P( Ω) Ω (µ, x) 7 x + Z exp 1

k Qhx, Khz dµ(z, s) V hy dµ(y, r) Rd ,

which, in fact, follows from the continuity of the unmasked self-attention. Thus, with (35), we obtain (c).

Lemma 12. Let Γ1 and Γ2 be causal identifiable in-context maps in the sense of Definition 3. Then, the composition Γ2 Γ1 in the sense of (13) is a causal identifiable in-context map.

Proof. Assume that (µ, x, t) Lipσ C( Ω) Ω. We first show that

[(Γ1(µ), Id R) µ]t = (Γ1(µt), Id R) µt. (36)

Indeed, we see that for all f C( Ω) Z f(x, s) d [(Γ1(µ), Id R) µ] (x, s) = Z f (Γ1(µ)(x, s), s) dµ(x, s)

= Z f (Γ1(µ)(x, s), s) dµ(x|s) d µ(s)

= Z f (x, s) d [Γ1(µ)( , s) µ( |s)] (x) d µ(s),

which obtains that (Γ1(µ), Id R) µ = [Γ1(µ)( , s) µ( |s)] µ(s). (37)

Published as a conference paper at ICLR 2025

This implies that by using the causality of Γ1 Z f(x, s) d [(Γ1(µ), Id R) µ]t (x, s)

= Z f (x, s) d [Γ1(µ)( , s) µ( |s)] (x)1[0,t](s)

µ([0, t]) d µ(s)

= Z f (Γ1(µ, x, s), s) dµ(x|s) 1[0,t](s)

µ([0, t]) d µ(s)

= Z f (Γ1(µs, x, s), s) dµ(x|s) 1[0,t](s)

µ([0, t]) d µ(s)

= Z f (Γ1(µt, x, s), s) dµ(x|s) 1[0,t](s)

µ([0, t]) d µ(s)

= Z f (Γ1(µt, x, s), s) dµt(x, s)

= Z f (x, s) d [(Γ1(µt), Id R) µt] (x, s),

where we have used that µs = µt when s t. This shows (36).

We see that by using the causality of Γ1 and Γ2, and (36)

Γ2 Γ1(µ, x, t) = Γ2 ((Γ1(µ), Id R) µ, Γ1(µ, x, t), t)

= Γ2 [(Γ1(µ), Id R) µ]t , Γ1(µt, x, t), t

= Γ2 ((Γ1(µt), Id R) µt, Γ1(µt, x, t), t) = Γ2 Γ1(µt, x, t).

These discussions apply for t (0, 1], and the case when t = 0 follows by the same argument. Thus, we obtains the causality of Γ2 Γ1.

Assume that µt = µt where µ Lipσ C( Ω) and t, t [0, 1]. Without loss of generality, assume that t < t . We have that by the identifiability of Γ1 and Γ2, and (36)

Γ2 Γ1(µt, x, t) = Γ2 ((Γ1(µt), Id R) µt, Γ1(µt, x, t), t)

= Γ2 ((Γ1(µt), Id R) µt, Γ1(µt , x, t ), t)

= Γ2 [(Γ1(µ), Id R) µ]t , Γ1(µt , x, t ), t

= Γ2 [(Γ1(µ), Id R) µ]t , Γ1(µt , x, t ), t

= Γ2 ((Γ1(µt ), Id R) µt , Γ1(µt , x, t ), t )

= Γ2 Γ1(µt , x, t ),

where we have used the following fact from (37)

[(Γ1(µ), Id R) µ]t = [Γ1(µ)( , s) µ( |s)] 1[0,t] µ([0, t]) µ(s)

= [Γ1(µ)( , s) µ( |s)] 1[0,t ] µ([0, t ]) µ(s) = [(Γ1(µ), Id R) µ]t .

These discussions apply for t (0, 1], and the case when t = 0 follows by the same argument. Thus, we obtain the identifiability of Γ2 Γ1.

Lemma 13. Identifiability is stable in the following sense: Let Λn be continuous and causal, identifiable in-context mappings, and let Λ be continuous and causal in-context mappings. Assume that, as n , sup (µ,x,t) Lipσ C( Ω) Ω |Λn(µ, x, t) Λ (µ, x, t)| 0. (38)

Then, the map Λ is identifiable.

Published as a conference paper at ICLR 2025

Proof. Assume that µt = µt where µ Lipσ C( Ω) and t, t [0, 1]. As Λn(µt, x, t) = Λn(µt , x, t ) and µt, µt Lipσ C( Ω), we see that

|Λ (µt, x, t) Λ (µt , x, t )|

|Λ (µt, x, t) Λn(µt, x, t)| + |Λn(µt, x, t) Λ (µt , x, t )|

sup (µ,x,t) Lipσ C( Ω) Ω |Λ (µ, x, t) Λn(µ, x, t)| + |Λn(µt , x, t ) Λ (µ t, x, t )|

2 sup (µ,x,t) Lipσ C( Ω) Ω |Λ (µ, x, t) Λn(µ, x, t)| 0,

which implies that Λ (µt, x, t) = Λ (µt , x, t ).

C.2 PROOF OF LEMMA 4

For the representation (i) we find that, by using (20), (21) and µt = µe( µt),

Λ(µ, x, t) = Λ(µt, x, t) = Λ(µe( µt), x, e( µt)) = Λ(µe( µt), x) = Λ(µt, x),

where we have used, for the second and fourth equality, µt = µe( µt).

The continuity (ii) follows from (20). Indeed, we observe that Λ(µt, x) = Λ(µt, x, e( µt)) = Λ(µe( µt), x, e( µt)). (39)

Viewing µe( µt) = (µe( µt))e( µt) = (µe( µt))1 = µt, (40)

where (µe( µt))e( µt) and (µe( µt))1 are regarded as the masked probability measures of µe( µt) at t = e( µt) and t = 1, respectively, we obtain, using (20), (39) and (40), that Λ(µt, x) = Λ((µe( µt))e( µt), x, e( µt)) = Λ((µe( µt))1, x, 1) = Λ(µt, x, 1).

Thus, by the continuity of Λ, we conclude that the map (µt, x) 7 Λ(µt, x) = Λ(µt, x, 1) is continuous.

C.3 PROOF OF LEMMA 5

Assume that µn Lipσ C( Ω) and (xn, tn) Ω. We see that

{s 7 µn( |s)}n N C([0, 1]; P(Ω)) is equicontinuous,

as s 7 µn( |s) is C-Lipschitz. We also see that

{µn( |s)}n N P(Ω) is compact for each s [0, 1].

as P(Ω) is compact in the W2 topology (see e.g., Aliprantis & Border (2006, Theorem 15.11)). By the Arzel a Ascoli theorem (Kelley, 2017, Chapter 7, Theorem 17), there exists µ( |s) P(Ω) such that the map s 7 µ( |s) is continuous map and (if needed, re-choose a subsequence)

sup s [0,1] W2(µn( |s), µ( |s)) 0 as n . (41)

As ( µn)n N Pσ([0, 1]) := {ν P([0, 1]) : ν({0}) σ} and Pσ([0, 1]) is compact, there exists µ Pσ([0, 1]) such that (if needed, re-choose a subsequence) as n

We set µ := µ( |s) µ(s). Then we have µ Lipσ C( Ω),

Published as a conference paper at ICLR 2025

W2(µ( |s), µ( |s )) W2(µ( |s), µn( |s)) + W2(µn( |s), µn( |s )) + W2(µn( |s ), µ( |s ))

2 sup s [0,1] W2(µ( |s), µn( |s)) + C|s s |,

and taking limit as n , we see that s 7 µ( |s) P(Ω) is C-Lipschitz.

Note that form (41)

g C(Ω), sup s [0,1]

Z g(x) dµn(x|s) Z g(x) dµ(x|s) 0. (42)

Indeed, since the set Lip(Ω) of all Lipschitz functions on Ωis dense in C(Ω), for any g C(Ω) and any ϵ (0, 1), we choose h Lip(Ω) such that supx Ω|g(x) h(x)| ϵ. We see that as W1 W2 and the dual formulae Z g(x) dµn(x|s) Z g(x) dµ(x|s)

2 sup x Ω |g(x) h(x)| + Lip(h) Z h(x) Lip(h) dµn(x|s) Z h(x) Lip(h) dµ(x|s)

2ϵ + Lip(h)W1(µn( |s), µ( |s)) 2ϵ + Lip(h)W2(µn( |s), µ( |s)).

Taking sups [0,1] and the limit as n , we obtain (42).

As (xn, tn) Ωand Ωis compact, there are (x, t) Ω(if needed re-choose the subsequence) such that (xn, tn) (x, t) in Ω.

We finally need to show that, as n ,

which is equivalent to

f C( Ω), Z f d(µn)tn Z f dµt. (43)

It is enough to check on any functions f which are separable, i.e. of the form f(x, s) = g(x)h(s) because linear combinations of separable functions of the form P

i gi(x)hi(s) are dense in C( Ω).

To prove (43), we distinguish three cases (appropriately choosing a subsequence again):

(i) tn (0, 1] and t (0, 1], (ii) tn (0, 1] and t = 0, and (iii) tn = t = 0

CASE (i): We see that as n Z f d(µn)tn Z f dµt

1[0,tn](s)h(s)

µn([0, tn])

Ω g(x) dµn(x|s) d µn(s) Z

1[0,t](s)h(s)

Z g(x) dµ(x|s) d µ(s)

because µ([0, t]) > 0, equation (42), and using that µn µ.

Published as a conference paper at ICLR 2025

CASE (ii): We see that as n Z f d(µn)tn Z f dµ0

1[0,tn](s)h(s)

µn([0, tn])

Ω g(x) dµn(x|s) d µn(s) h(0) Z g(x) dµ(x|0)

1[0,tn](s)h(s)

µn([0, tn])

Ω g(x) dµn(x|s) d µn(s) Z

1[0,tn](s)h(0)

µn([0, tn])

Z g(x) dµ(x|0) d µn(s)

sup s [0,tn]

h(s) Z g(x) dµn(x|s) h(0) Z g(x) dµ(x|0)

1[0,tn](s) µn([0, tn]) d µn(s)

sup s [0,tn]

h(s) Z g(x) dµn(x|s) h(s) Z g(x) dµ(x|s)

+ sup s [0,tn]

h(s) Z g(x) dµ(x|s) h(0) Z g(x) dµ(x|0)

sup s [0,1] |h(s)| sup s [0,1]

Z g(x) dµn(x|s) Z g(x) dµ(x|s)

+ sup s [0,tn]

h(s) Z g(x) dµ(x|s) h(0) Z g(x) dµ(x|0) 0,

where we have used (42) and the continuity of the map s 7 h(s) R g(x) dµ(x|s).

CASE (iii): We see that, as n , Z f d(µn)tn Z f dµt = Z f d(µn)0 Z f dµ0

= h(0) Z g(x) dµn(x|0) h(0) Z g(x) dµ(x|0)

Z g(x) dµn(x|0) Z g(x) dµ(x|0) 0,

by using (42). Therefore, we obtain (43).

D EXAMPLES OF OUR THEORY

D.1 DISCRETE CASE

We consider a fixed n and an in-context map G(X, x), with X Rdtok n, which is continuous on ℓ2 and permutation equivariant with respect to the tokens. This defines a map on discrete probability measures

:= G((xi)i, x),

which is continuous for the weak topology on the set, Pn(Ω) P(Ω), of n-point measures (that is, uniform distributions supported on n points). The map Γ is continuous on Pn(Ω) (because on point sets, the weak topology coincides with the ℓ2-topology up to permutations), and Pn(Ω) is compact (because it is a closed subset of a compact set P(Ω)). Hence, we can use our theorem on Pn(Ω), and obtain that Γ can be approximated by a transformer on this space. This implies the approximation of G by a transformer.

D.2 LINEAR REGRESSION

In the discrete case, tokens are assumed to be of the form xi = (ui, vi) where ui are features and vi are labels to be predicted. Then simplified (linear attention) transformers are shown to learn

Published as a conference paper at ICLR 2025

in context a linear relation vi Wui; the in-context (I-C) prediction then maps some (u, v) to (u, Wu) (that is, the value of v is discarded). Adding a ridge penalty, λ, to make the problem well-posed, this corresponds to the I-C map

G(X, (u, v)) := (u, W(X)u) where W(X) := argmin W

i=1 Wui vi 2 + λ W 2.

We note that Von Oswald et al. (2023) consider, in fact, a single attention layer and replace this minimization with a single step of gradient descent for simplicity; however, this is just a modification of the in-context map.

Thanks to our framework which operates over measures, the above regression can be written, for any n, upon considering a data distribution µ over the space (u, v) of (feature, label), in terms of the more general in-context map,

Γ(µ, (u, v)) := (u, W(µ)u) where W(µ) := argmin W

Z Wu v 2dµ(u, v).

This map has the closed form,

W(µ) = Z uu dµ(u, v) + λId 1 Z vu dµ(u, v) ,

and is weak continuous as long as λ > 0. Hence, our theorem states that it can be learned in context.