# incontext_deep_learning_via_transformer_models__5626b473.pdf In-Context Deep Learning via Transformer Models Weimin Wu * 1 2 Maojiang Su * 1 2 Jerry Yao-Chieh Hu * 1 2 Zhao Song 3 Han Liu 1 2 4 We investigate the transformer s capability for incontext learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a (2N + 4)L-layer transformer capable of simulating L gradient descent steps of an N-layer Re LU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training. 1 Introduction We study transformers ability to simulate the training process of deep models. This analysis is not only practical but also timely. On one hand, transformers and deep models (Brown et al., 2020; Radford et al., 2019) are so powerful, popular and form a new machine learning paradigm foundation models. These large-scale machine learning models, trained on vast data, provide a general-purpose foundation for various tasks with minimal supervision (Team et al., 2023; Touvron et al., 2023; Zhang et al., 2022). On the *Equal contribution 1Center for Foundation Models and Generative AI, Northwestern University, USA 2Department of Computer Science, Northwestern University, USA 3University of California Berkeley, USA 4Department of Statistics and Data Science, Northwestern University, USA. Correspondence to: Weimin Wu , Maojiang Su , Jerry Yao-Cheih Hu , Zhao Song , Han Liu . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). Full version and future updates are on ar Xiv. other hand, the high cost of pretraining these models often makes them prohibitive outside certain industrial labs (Jiang et al., 2024; Bi et al., 2024; Achiam et al., 2023). In this work, we aim to advance the one-for-many modeling philosophy of foundation model paradigm (Bommasani et al., 2021) by considering the following research problem: Question 1. Is it possible to train one deep model with the ICL of another foundation model? The implication of Question 1 is profound: if true, one foundation model could lead to many others without pertaining. In this work, we provide an affirmative example for Question 1. Specifically, we show that transformers are capable of simulating the training of a deep Re LU-based feedforward neural network with provable guarantees through ICL. Our analysis assumes that we have well-pretrained the transformer using the data generated by the deep network. We require the deep network to maintain consistent hyperparameters (e.g., model width and depth) during the pretraining and testing. However, during the testing, we vary the parameter distribution and input data distribution of the deep network to generate data for the transformer. In ICL, the models learn to solve new tasks during inference by using task-specific examples provided as part of the input prompt, rather than through parameter updates (Shi et al., 2023b; Bubeck et al., 2023; Achiam et al., 2023; Bai et al., 2023; Min et al., 2022; Garg et al., 2022). Unlike standard supervised learning, ICL enables models to adapt to new tasks during inference using only the provided examples. In this work, the new task of our interest is algorithmic approximation via ICL (Bai et al., 2023; Zhang et al., 2024; Wang et al., 2024). Specifically, we aim to use transformer s ICL capability to replace/simulate the standard supervised training algorithms for N-layer networks. To be concrete, we formalize the learning problem of how transformers learn (i) a given function and (ii) a machine learning algorithm (e.g., gradient descent) via ICL, following (Bai et al., 2023). (i) ICL for Function f. Let f : Rd R be the function of our interest. Suppose we have a dataset Dn := {(xi, yi)}i [n], where {xi}i [n] Rd and {yi}i [n] R are the input and output of f, respectively. Let xn+1 be the test input. The goal of ICL is to use a transformer, denoted by T , to predict yn+1 based on the test input and the incontext dataset autoregresively: byn+1 T (Dn, xn+1). The In-Context Deep Learning via Transformer Models goal is for the prediction byn+1 to be close to yn+1 = f(x). (ii) ICL for Gradient Descent of a Parametrized Model f(w, ). Bai et al. (2023) generalize (i) to include algorithmic approximations of Gradient Descent (GD) training algorithms and explore how transformers simulate gradient descent during inference without parameter updates. They term the simulated GD algorithm In-Context Gradient Descent (ICGD). In essence, ICGD enables transformers to approximate gradient descent on a loss function Ln(w) for a parameterized model f(w, ) based on a dataset Dn. Traditional gradient descent updates w iteratively as wt+1 = wt η Ln(wt). In contrast, ICGD uses a transformer T to simulate these updates within a forward pass. Given example data Dn and test input xn+1, the transformer performs gradient steps in an implicit fashion by inferring parameter updates through its internal representations, using input context without explicit weight changes. Please see Section 2 for explicit formulation. In this work, we investigate the case where f(w, ) is a deep feed-forward neural network. We defer the detailed problem setting to Section 2. In comparison to standard ICGD (Bai et al., 2023), ICGD for deep feed-forward networks is not trivial. This is due to two technical challenges: (C1) Analytical feasibility of gradient computation for these thick networks. (C2) Explicit construction capable of approximating ICGD for such layers and their gradients. To this end, we present the first explicit expression for gradient computation of N-layer feed-forward network (Lemma 1). Importantly, its term-by-term tractability provides key insights for the detailed construction of a specific transformer to train this network via ICGD (Theorem 1). Contributions. Our contributions are threefold: Approximation by Re LU-Transformer. For simplicity, we begin with the Re LU-based transformer. For a broad class of smooth empirical risks, we construct a (2N + 4)L-layer transformer to approximate L steps of in-context gradient descent on the N-layer feed-forward networks with the same input and output dimensions (Theorem 1). We then extend this to accommodate varying dimensions (Theorem 4). We also provide the theoretical guarantees for the approximation within any given error (Corollary 1.1) and the convergence of the ICL gradient descent (Lemma 14). Approximation by Softmax-Transformer. We extend our analysis to the Softmax-transformer to better reflect realistic applications. The key technique is to ensure a qualified approximation error at each point to achieve universal approximation capabilities of the Softmax-based Transformer (Lemma 16). We give a construction of a 4L-layer Softmax transformer to approximate L steps of gradient descent, and guarantee the approximation and the convergence (Theorem 2). Experimental Validation. We validate our theory with Re LUand Softmax-transformers, specifically, ICGD for the N-layer networks (Theorem 1, Theorem 2, and Theorem 4). We assess the ICL capabilities of transformers by training 3-, 4-, and 6-layer networks in Section 5. The numerical results show that the performance of ICL matches that of training N-layer networks. However, a minor limitation is that the trained transformers do not always achieve the theoretical construction. Organization. We present our main results in Theorem 1. Section 2 covers the preliminaries. Section 3 presents the problem setup and the ICL approximation of GD steps for an N-layer feed-forward network with both Re LU-Transformer and Softmax-Transformer. Section 5 presents the experimental results, with additional details in Appendix F. The appendix includes related work (Appendix A.1), detailed proofs for Section 3 (Appendix C), and an application to train diffusion models via ICL (Appendix G). Notations. We use lower case letters to denote vectors and upper case letters to denote matrices. The index set {1, ..., I} is denoted by [I], where I N+. For any matrices A Rn n, let ℓp norm of A be induced by vector ℓp-norm, defined as A p := sup{ Ax p : x Rn with x p = 1}. We use A[i, j] to denote the element in i-th row and j-th column of matrix A. For any matrices A Rm n and B Rm n, let denotes the Hadamard product: (A B)[i, j] := A[i, j] B[i, j]. For any matrices A Rm n and B Rp q, let denote the Kronecker product: A[1, 1]B A[1, n]B ... ... ... A[m, 1]B A[m, n]B 2 Preliminaries: ICL and ICGD We present the ideas we built upon: In-Context Gradient Descent (ICGD). (i) ICL for Function f. Let f : Rd R be the function of our interest. Suppose we have a dataset Dn := {(xi, yi)}i [n], where {xi}i [n] Rd and {yi}i [n] R are the input and output of f, respectively. Let xn+1 be the test input. The goal of ICL is to use a transformer, denoted by T , to predict yn+1 based on the test input and the incontext dataset autoregresively: byn+1 T (Dn, xn+1). For convenience in our analysis, we adopt the ICL notation from (Bai et al., 2023). Specifically, we shorthand (Dn, xn+1) into an input sequence (i.e., prompt) of length n + 1 and represent it as a compact matrix H RD (n+1) := In-Context Deep Learning via Transformer Models [h1, . . . , hn+1] in the form: x1 x2 xn xn+1 y1 y2 yn 0 q1 q2 qn qn+1 0D (d+3) 1 ti RD (d+1). (2.1) We use qi to fill in the remain D (d+1) entries in addition to xi Rd and yi R. The last entry ti := 1(i < n + 1) of qi is the position indicator to distinguish the n in-context examples and the test data. The problem of ICL for f is to show the existence of a transformer T that, when given H, outputs T (H) RD (n+1) of the same shape, and the (d + 1, n + 1) entry of T (H) provides the prediction byn+1. The goal is for the prediction byn+1 to be close to yn+1 = f(x) measured by some proper loss. (ii) ICL for Gradient Descent of a Parametrized Model f(w, ). We aim to use ICL to simulate the standard supervised training procedure for N-layer neural networks. To achieve this, we introduce the concept of In-Context Gradient Descent (ICGD) for a parameterized model. Consider a machine learning model f(w, ) : RDw Rd Rd, parametrized by w RDw. Given a dataset Dn := {(xi, yi)}i [n] iid P, a typical learning task is to find parameters w such that f(w , ) becomes closest to the true data distribution P. Then, for any test input xn+1, we predict: byn+1 = f(w , xn+1). To find w , Bai et al. (2023) configure a transformer to implement gradient descent on f(w, ) through ICL, simulating optimization algorithms during inference without explicit parameter updates. We formalize this In-Context Gradient Descent (ICGD) problem: using a pretrained model to simulate gradient descent on f(w, ) w.r.t. the provided context (Dn, xn+1). Problem 1 (In-Context Gradient Descent (ICGD) on Model f(w, ) (Bai et al., 2023)). Let ϵ > 0 and L 1. Consider a machine learning model f(w, x) : RDw Rd Rd parameterized by w RDw. Given a dataset Dn := {(xi, yi)}i [n] iid P with (xi, yi) Rd Rd, define the empirical risk function: i=1 ℓ(f(w, xi), yi), (2.2) where ℓ: Rd Rd R is a loss function. Let W RDw be a closed domain, and Proj W denote the projection onto W. The problem of ICGD on model f(w, ) is to find a transformer T with L blocks, each approximating one step of gradient descent using T layers. For any input H(0) RD (n+1) in the form of (2.1), the transformer T (H(0)) approximates L steps of gradient descent. Specifically, for l [L] and i [n + 1], the output at layer Tl is: h(T l) i = [xi; yi; w(l); 0; 1; ti], where, with w(0) = 0, w(l) = Proj W w(l 1) η Ln(w(l 1)) + ϵ(l 1) , is updated recursively, and ϵ(l 1) 2 ϵ represents the approximation error at step l 1. Problem 1 aims to find a transformers T to perform L steps gradient descent on loss Ln(w) in an implicit fashion (i.e., no explicit parameter update). More precisely, Bai et al. (2023) configure T with L identical blocks, each approximating one gradient descent step using T layers. In this work, we investigate the case where f(w, ) is an N-layer neural network. Transformer. We defer the standard definition of transformer to Appendix B.1. 3 In-Context Gradient Descent on N-Layer Neural Networks We now show that transformers is capable of implementing gradient descent on N-layer neural networks through ICL. In Section 3.1, we define the N-layer Re LU neural network and state its ICGD problem. In Section 3.2, we derive explicit gradient descent expression for N-layer NN. In Section 3.3, we construct Re LU-Transformer executing gradient descent on N-layer NN via ICL. In Section 4, we show the existence of Softmax-Transformer capable of performing in-context gradient descent on N-layer NN. 3.1 Problem Setup: ICGD for N-Layer Neural Networks To begin, we introduce the construction of our N-Layer Neural Network which we aims to implement gradient descent on its empirical loss function. Definition 1 (N-Layer Neural Network). An N-Layer Neural Network comprises N 1 hidden layers and 1 output layer, all constructed similarly. Let r : R R be the activation function. For the hidden layers: for any i [n + 1], j [N 1], and k [K], the output for the first j layers w.r.t. input xi Rd, denoted by predh(xi; j) RK, is defined as recursive form: predh(xi; 1)[k] := r(v 1kxi), predh(xi; j)[k] := r(v jkpredh(xi; j 1)), where v1k Rd and vjk RK for j {2, . . . , N 1} are the k-th parameter vectors in the first layer and the j-th layer, respectively. For the output layer (N-th layer), the output for the first N layers (i.e the entire neural network) w.r.t. input xi Rd, denoted by predo(xi; w, N) Rd, is defined for any k [d] as follows: predo(xi; w, N)[k] := r(v Nkpredh(xi; N 1)), (3.1) where v Nk RK are the k-th parameter vectors in the In-Context Deep Learning via Transformer Models N-th layer and w R2d K+(N 2)K2 denotes the vector containing all parameters in the neural network, w := v 11, . . . , v 1K, . . . , v jk, . . . , v N1, . . . , v Nd . (3.2) Remark 1 (Prediction Function for j-th layer on i-th Data: pi(j)). For simplicity, we abbreviate the output from the first j-th layer of the N-layer neural networks NN with input xi as pi(j), xi Rd, for j = 0 predh(xi; j) RK, for j [N 1] predo(xi; w, N) Rd, for j = N. (3.3) Additionally, we define pi := [pi(1); . . . ; pi(N)] R(N 1)K+d. We formalize the problem of using a transformer to simulate gradient descent algorithms for training the N-layer NN defined in Definition 1, by optimizing loss (2.2). Specifically, we consider the ICGD (Problem 1) with the parameterized model f(w, ) := predo( ; w, N). Problem 2 (ICGD on N-Layer Neural Networks). Let the N-layer neural networks, activation function r, and prediction function pi(j) for all layers follow Definition 1 and Remark 1. Assume we under the identical setting as Problem 1, considering model f(w, ) := predo( ; w, N) and specifying W is a closed domain such that for any j [N 1] and k [K], W w = [vjk] RDN : vjk 2 Bv . (3.4) The problem of ICGD on N-layer neural networks is to find a TL layers transformer T , capable of implementing L steps gradient descent as in Problem 1. Remark 2 (Why Bounded Domain W?). For using a sum of Re LU to approximate functions like r, which is illustrated in the consequent section, we need to avoid gradient exploding. Therefore, we require W to be a bounded domain, and utilize Proj W to project w into bounded domain W. 3.2 Explicit Gradient Descent of N-Layer Neural Networks Intuitively, Problem 2 asks whether there exists a transformer capable of simulating the gradient descent algorithm on the loss function of an N-layer neural network. We answer Problem 2 by providing an explicit construction for such a transformer T in Theorem 1. To facilitate our proof, we first introduce the necessary notations for explicit expression of the gradient w Ln(w). Definition 2 (Abbreviations). Fix i [n + 1], and consider an N-layer neural network with activation function r and prediction function pi(j) as defined in Definition 1. Let Dj R denote the total number of parameters in the first j layers. By (3.2), we have: 0, j = 0 d K, j = 1 (j 1)K2 + d K, 2 j N 1 (N 2)K2 + 2d K, j = N. The parameter vector w := v 11, . . . , v 1K, . . . , v N 11, . . . , v N 1K, v N1, . . . , v Nd follows (3.2). Define ϕi := ℓ(pi(N),yi) pi(N) pi(N) RDN . For any j [N], let Ai(j) denote the derivative of ℓ(pi(N), yi) with respect to the parameters in the j-th layer: Ai(j) = ϕi[Dj 1 : Dj], where ϕi[a : b] selects elements from the a-th to b-th position in ϕi. For activation function r(t), let r (t) be its derivative. Define r i(j) RK as: r i(j)[k] := r (v j+1kpi(j)). Define r i := [r i(0); . . . ; r i(N 1)] and Ri(j) as: Ri(j) := ( diag{r (v j+11pi(j)), . . . , r (v j+1Kpi(j))}, j N 2 diag{r (v j+11pi(j)), . . . , r (v j+1dpi(j))}, j = N 1. where Ri(j) RK K for j {0, . . . , N 2} and Ri(j) Rd d for j = N 1. For any j [N], let Vj denote the parameters in the j-th layer as: h v11, . . . , v1K i RK d, j = 1 h vj1, . . . , vj K i RK K, j 2, . . . , N 1 h v N1, . . . , v Nd i Rd K, j = N. Definition 2 splits the gradient of Ln(w) into N parts. This makes w Ln(w) more interpretable and tractable, since all parts follows a recursion formula according to chain rule. With above notations, we calculate the gradient descent step (2.3) of N-layer neural network as follows: Lemma 1 (Decomposition of One Gradient Descent Step). Fix any Bv, η > 0. Suppose loss function Ln(w) on n data points {(xi, yi)}i [n] follows (2.2). Suppose closed domain W and projection function Proj W(w) follows (3.4). Let Ai(j), r i(j), Ri(j), Vj be as defined in Definition 2. Then the explicit form of gradient Ln(w) becomes Ai(1) ... Ai(N) where Ai(j) denote the derivative of ℓ(pi(N), yi) with respect to the parameters in the j-th layer, Ai(j) = In-Context Deep Learning via Transformer Models (Ri(N 1)VN . . . Ri(j 1) h IK K pi(j 1) i ) ( ℓ(pi(N),yi) pi(N) ) , j = N (Ri(N 1) h Id d pi(N 1) i ) ( ℓ(pi(N),yi) pi(N) ) , j = N. Proof Sketch. Using the chain rule and product rule, we decompose the gradient as follows: w Ln(w) = 1 2n PN i=1[ pi(N) w ] [ ℓ(pi(N),yi) pi(N) ] . Thus, we only need to compute pi(N) w . By Definition 1 and the chain rule, we prove that pi(N) w satisfies the recursive formulation (C.4). Combining these, we derive the explicit form of gradient w Ln(w), and the gradient step follows directly. Please see Appendix C.1 for a detailed proof. It is hard to calculate the elements in Ai(j) in a straightforward mannar, we calculate each parts of it successively. We define the intermediate terms si(j) and u as follows Definition 3 (Definition of intermediate terms). Let Ai(j), r i(j), Ri(j), Vj be as defined in Definition 2. For any t, y Rd, we define vector function u(t, y) := ( ℓ(t,y) t ) : Rd Rd Rd. Moreover, for any j [N], i [n + 1], we define si(j) as Ri(j 1)V j+1 . . . Ri(N 2)V N Ri(N 1) u(pi(N), yi) RK, j = N Ri(N 1) u(pi(N), yi) Rd, j = N. Let denotes Hadamard product. For any j [N 1], i [N + 1], Definition 3 leads to si(j) = r i(j 1) (V j+1 si(j + 1)), (3.6) Moreover, by Definition 3, it holds h IK K pi(j 1) i si(j), j = N, h Id d pi(N 1) i si(N), j = N. (3.7) 3.3 Transformers Approximate Gradient Descent of N-Layer Neural Networks In-Context For using neural networks to approximate (2.2), which contains smooth functions changeable, we need to approximate these smooth functions by simple combination of activation functions. Our key approximation theory is the sum of Re LUs for any smooth function (Bai et al., 2023). Definition 4 (Approximability by Sum of Re LUs, Definition 12 of (Bai et al., 2023)). Let z Rk. We say a function g : Rk R is (ϵapprox, R, H, C)-approximable by sum of Re LUs if there exist a (H, C)-sum of Re LUs function f H,C(z) defined as h=1 chσ(a h [z; 1]), with PH h=1 |ch| C, maxh [H] ah 1 1, ah Rk+1, and ch R, such that sup z [ R,R]k |g(z) f H,C(z)| ϵapprox. Overview of Our Proof Strategy. Lemma 1 and Definition 4 motivate the following strategy: term-by-term approximation for our gradient descent step (3.5). Please see Figure 1 for a high-level visualization. Step 1. Given (xi, w), we use N attention layers to approximate the output of the first j layers with input xi, pi(j) := predh(xi; j) Rk (Definition 1) for any j [N]. Then we use 1 attention layer to approximate chainrule intermediate terms r i(j 1)[k] := r (v jkpi(j 1)) (Definition 2) for any i [n], j [N] and k [K]: Lemma 2 and Lemma 3. Step 2. Given (r i, pi, w), we use an MLP layer to approximate u(pi(N), yi) (Definition 3), for i [n], and use N element-wise multiplication layers to approximate si(j) (Definition 3), for any j [N]: Lemma 4 and Lemma 5. Moreover, Lemma 6 shows the closeness result for approximating si(j), which leads to the final error accumulation in Theorem 1. Step 3. Given (pi, r i, gisi(j), w), we use an attention layer to approximate w η Ln(w). Then we use an MLP layer to approximate Proj W(w). And implementing L steps gradient descent by a (2N + 4)L-layer neural network NNθ constructed based on Step 1 and 2. Finally, we arrive our main result: Theorem 1. Furthermore, Lemma 14 shows closeness results to the true gradient descent path. Step 1. We start with approximation for pi(j). Lemma 2 (Approximate pi(j)). Let upper bounds Bv, Bx > 0 such that for any k [K], j [N] and i [n], vjk 2 Bv, and xi 2 Bx. For any j [N], i [n], define Bj r := max |t| Bv Bj 1 r |r(t)|, B0 r := Bx, and Br := max j Bj r. Let function r(t) be (ϵr, R1, M1, C1)-approximable for R1 = max{Bv Br, 1}, M1 e O(C2 1ϵ 2 r ), where C1 depends only on R1 and the C2-smoothness of r. Then, for any ϵr > 0, there exist N attention layers Attnθ1, . . . , AttnθN such that for any input hi RD takes from (2.1), they map hi = [xi; yi; w; pi(1); . . . ; pi(j 1); 0; 1; ti] Attnθj ehi = [xi; yi; w; pi(1); . . . ; pi(j); 0; 1; ti], where pi(j) is approximation for pi(j) (Definition 1). In the expressions of hi and ehi, the dimension of 0 differs. Specifically, the 0 in hi is larger than in ehi. The dimensional difference between these 0 vectors equals the dimension of pi(j). Suppose function r is Lr-smooth in bounded domain In-Context Deep Learning via Transformer Models Inputs TFN θ TF1 θ TF1 θ EWMLN θ TF2 θ {(xi, yi)}i [n], w pi(j) r i(j) u(pi(N), yi) si(j) Proj W(w η Ln(w)) Lemma 2 Lemma 3 Lemma 4 Lemma 5 Theorem 1 Figure 1: One Step In-Context Gradient Descent (ICGD) with (2N + 4)-layer Transformer. This illustration presents the backpropagation process within an ICGD in a transformer model with 2N + 4 layers. It simulates a single gradient descent step for an N-layer neural network, trained with loss Ln and datasets {(xi, yi)}i [n]. The term pi(j) denotes the output after the j-th layer for input xi. The terms r i(j), u(pi(N), yi), and si(j) are intermediate gradient terms of gradient Ln(w) from the chain rule. The expression Proj W(w η Ln(w)) shows one gradient descent step. Here, η is the learning rate, and W denotes the bounded domain for the N-layer NN parameters w. W, then for any i [n + 1], j [N], pi(j) such that pi(j) = pi(j) + ϵ(i, j), (3.8) ( (Pj 1 l=0 Kl/2Ll r Bl v) Kϵr , 1 j N 1 (PN 1 l=0 Kl/2Ll r Bl v) dϵr , j = N. (3.9) Additionally, for any j [N], the norm of parameters Bθj defined as (B.1) such that Bθj 1 + KC1. Proof. Please see Appendix C.2 for a detailed proof. Notice that the form of error accumulation in Lemma 2 is complicated. For the ease of later presentations, we define the upper bound of coefficient in (3.8) as Er := max j [N] ϵ(i, j) 2 = max j [N]{( l=0 Kl/2Ll r Bl v) l=0 Kl/2Ll r Bl v) (3.11) such that (3.8) becomes pi(j) = pi(j) + ϵ(i, j), ϵ(i, j) 2 Erϵr. (3.12) Moreover, we abbreviate pi := [pi(1); . . . ; pi(N)] R(N 1)K+d, such that the output of Attnθ1 AttnθN is hi = [xi; yi; w; pi; 0; 1; ti]. (3.13) Then, the next lemma approximates r i(j) base on pi(j) obtained in Lemma 2. Lemma 3 (Approximate r i(j)). Let upper bounds Bv, Bx > 0 such that for any k [K], j [N] and i [n], vjk 2 Bv, and xi 2 Bx. For any j [N], i [n], define B j r := max |t| Bv Bj 1 r |r (t)|, B0 r := Bx, and Br := max j Bj r . Suppose function r (t) is (ϵr , R2, M2, C2)-approximable for R2 = max{Bv Br , 1}, M2 e O(C2 2ϵ 2 r ), where C2 depends only on R2 and the C2-smoothness of r . Then, for any ϵr > 0, there exist an attention layer AttnθN+1 such that for any input hi RD takes from (3.13), it maps hi = [xi; yi; w; pi; 0; 1; ti] AttnθN+1 ehi = [xi; yi; w; pi; r i; 0; 1; ti], where r i(j) is approximation for r i(j) (Definition 2) and r i := [r i(0); . . . ; r i(N 1)] R(N 2)K+d. Similar to Lemma 2, in the expressions of hi and ehi, the dimension of 0 differs. In addition, let Er be defined in (3.12), for any i [n + 1], j [N], k [K], r i(j) such that r i(j 1)[k] =r i(j 1)[k] + ϵ(i, j, k), |ϵ(i, j, k)| ϵr + Lr Bv Erϵr, (3.14) where ϵr denotes the error generated in approximating r by sum of Re LUs r follows (C.5). Additionally, the norm of parameters BθN+1 defined as (B.1) such that BθN+1 1 + K(N 1)C2. Proof Sketch. By Lemma 2, we obtain pi(j), the approximation for pi(j) (3.3). Using pi(j), we construct an Attention layer to approximate r i(j). We then establish upper bounds for the errors |r i(j)[k] r i(j)[k]| by applying Cauchy-Schwarz inequality and Lemma 2. Finally we present the norms (B.1) of the Transformers constructed. Please see Appendix C.3 for a detailed proof. Let Attnθj(j [N]) be as defined in Lemma 2, then Lemma 3 implies that for the input takes from Problem 2, the output of Attnθ1 AttnθN+1 is hi = [xi; yi; w; pi; r i; 0; 1; ti]. (3.15) Step 2. Now, we construct an approximation for u(pi(N), yi) = ( ℓ(pi(N),yi) Lemma 4 (Approximate u(pi(N), yi)). Let upper bounds Bv, Bx, > 0 such that for any k [K], j [N] and i [n], vjk 2 Bv, and xi 2 Bx. For any k [d], suppose function u(t, y)[k] be (ϵl, R3, M k 3 , Ck 3 )-approximable for R3 = max{Bv Br, By, 1}, M3 e O((Ck 3 )2ϵ 2 l ), where Ck 3 depends only on Rk 3 and the C3-smoothness of In-Context Deep Learning via Transformer Models u(t, y)[k]. Then, there exists an MLP layer MLPθN+2 such that for any input sequences hi RD takes from (3.15), it maps hi = [xi; yi; w; pi; r i; 0; 1; ti] MLPθN+2 ehi = [xi; yi; w; pi; r i; gi; 0; 1; ti], where gi Rd is an approximation for u(pi(N), yi). For any k [d], assume u(pi(N), yi) is Ll Lipschitz continuous. Then the approximation gi such that, gi[k] = u(pi(N), yi)[k] + ϵ(i, k), (3.16) with |ϵ(i, k)| ϵl + Ll Erϵr. Additionally, the parameters θN+2 such that BθN+2 max{R3 + 1, C3}. Proof Sketch. By Definition 1, we provide term-by-term approximations for pi(j) as forward propagation. Specifically, we construct Attention layers to implement forward propagation algorithm. Then we establish upper bounds for the errors pi(j) pi(j) 2 inductively. Finally, we present the norms (B.1) of the Transformers constructed. Please see Appendix C.4 for a detailed proof. Let Attnθj(j [N + 1]) be as defined in Lemma 2 and Lemma 3, then for any input sequences hi RD takes from (2.1), the output of Attnθ1 AttnθN+1 MLPθN+2 is hi = [xi; yi; w; pi; r i; gi; 0; 1; ti]. (3.17) Before introducing our next approximation lemma, we define an element-wise multiplication layer, since attention mechanisms and MLPs are unable to compute selfproducts (e.g., output xy from input [x; y]). To enable selfmultiplication, we introduce a function γ. This function, for any square matrix, preserves the diagonal elements and sets all others to zero. Definition 5 (Operator Function γ). For any square matrix A Rn n, define γ(A) := diag(A[1, 1], . . . A[n, n]) Rn n. By Definition 5, we introduce the following elementwise multiplication layer, capable of performing selfmultiplication operations such as the Hadamard product. Definition 6 (Element-wise Multiplication Layer). Let γ be defined as Definition 5. An element-wise multiplication layer with m heads is denoted as Attnθ( ) with parameters θ = {Qm, Km, Vm}m [M]. On any input sequence H RD n, EWMLθ(H) = H + i=1 (Vm H) γ((Qm H) (Km H)). (3.18) where Qm, Km, Vm RD D and γ( ) is operator function follows Definition 5. In vector form, for for each token hi RD in H, it outputs [EWMLθ(H)]i = hi + PM m=1 γ( Qmhi, Kmhi ) Vmhi. In addition, we define L-layer neural networks EWMLL θ := EWMLθ1 EWMLθL. Remark 3 (Necessary for Element-Wise Multiplication Layer). As we shall show in subsequent sections, ELML is capable of implementing multiplication in hi. Specifically, it allows us to multiply some elements in hi in Lemma 5. By Definition 7, it is impossible for transformer layers to achieve our goal without any other assumptions. Similar to (B.1), we define the norm for L-layer transformer EWMLL θ as: Bθ := max m [M],l [L]{ Ql m 1, Kl m 1, V l m 1}. (3.19) Then, given the approximations for pi(j) and r i(j), we use N element-wise multiplication layer (Definition 6) to approximate si(j), the chain-rule intermediate terms defined as Definition 3. Lemma 5 (Approximate si(j)). Recall that si(j) = r i(j 1) (V j+1 si(j + 1)) follows Definition 3. Let the initial input take from (3.17). Then, there exist N element-wise multiplication layers: EWMLθN+3, . . . , EWMLθ2N+2 such that for input sequences, j [N], hi = [xi; yi; w; pi; r i; gi; si(N); . . . ; si(j + 1); 0; 1; ti], they map EWMLθ2N+3 j(hi) = [xi; yi; w; pi; r i; gi; si(N); . . . ; si(j); 0; 1; ti], where the approximation si(j) is defined as recursive form: for any i [n + 1], j [N], r i(j 1) (V j+1 si(j + 1)), j [N 1] r i(N 1) gi, j = N. (3.20) Additionally, for any j [N], BθN+2+j defined in (B.1) satisfies BθN+2+j 1. Proof. Please see Appendix C.5 for a detailed proof. Let Attnθj(j [N + 1]), MLPθN+2 be as defined in Lemma 2, Lemma 3 and Lemma 4 respectively. Define si := [si(N); . . . ; si(1)] R(N 1)K+d, then for any input sequences hi RD takes from Problem 2, the output of neural network Attnθ1 AttnθN+1 MLPθN+2 EWMLθN+3 EWMLθ2N+2, (3.21) is hi = [xi; yi; w; pi; r i; si; 0; 1; ti]. (3.22) For the sake of simplicity, we consider Re LU Attention layer and MLP layer are both a special kind of transformer. In this way, by Definition 9, (3.21) becomes TFN+2 θ EWMLN θ . Next we calculate the error accumulation |si(j)[k] si(j)[k]| based on Lemma 3 and Lemma 4. In-Context Deep Learning via Transformer Models Lemma 6 (Error for si(j)). Suppose the upper bounds Bv, Bx > 0 such that for any k [K], j [N] and i [n], vjk 2 Bv, and xi 2 Bx. Let r i(j) RK such that r i(j)[k] := r (v j+1kpi(j)) follows Definition 2. Let si(j) = Ri(j 1)V j+1 . . . Ri(N 2)V N Ri(N 1)u follows Definition 3. Let r i(j), gi, si(j) be the approximations for r i(j), u(pi(N), yi), si(j) follows Lemma 3, Lemma 4 and Lemma 5 respectively. Let Br be the upper bound of r i(j)[k] and r i(j)[k] as defined in Lemma 3. Let Bl be the upper bound of gi and u(pi(N), yi) as defined in Lemma 4. Then for any i [n + 1], j [N], k [K], si(j)[k] Bs, |si(j)[k] si(j)[k]| Er sϵr + Er s ϵr + El sϵl, where Bs is the upper bound of si(j)[k] and Er s, Er s , El s are the coefficients of ϵr, ϵ r, ϵl in the upper bounds of |si(j)[k] si(j)[k]|, respectively. Proof. Please see Appendix C.6 for a detailed proof. Lemma 6 offers the explicit form of the error |si(j)[k] si(j)[k]|, which is crucial for calculating the error w Ln(w) w Ln(w) 2 in Theorem 1. Step 3. Combining the above, we prove the existence of a neural network, that implements L in-context GD steps on our N-layer neural network. And finally we arrive our main result: a neural network T for Problem 2. Theorem 1 (In-Context Gradient Descent on N-layer NNs). Fix any Bv, η, ϵ > 0, L 1. For any input sequences takes from (2.1), their exist upper bounds Bx, By such that for any i [n], yi 2 By, xi 2 Bx. Assume functions r(t), r (t) and u(t, y)[k] are Lr, Lr , Ll-Lipschitz continuous. Suppose W is a closed domain such that for any j [N 1] and k [K], W w = [vjk] RDN : vjk 2 Bv , and Proj W project w into bounded domain W. Assume Proj W = MLPθ for some MLP layer with hidden dimension Dw parameters θ Cw. If functions r(t), r (t) and u(t, y)[k] are C4-smoothness, then for any ϵ > 0, there exists a transformer model NNθ with (2N +4)L hidden layers consists of L neural network blocks TFN+2 θ EWMLN θ TF2 θ, NNθ := TFN+2 θ EWMLN θ TF2 θ, such that the heads number M l, parameter dimensions Dl, and the parameter norms Bθl suffice max l [(2N+4)L] M l e O(ϵ 2), max l [(2N+4)L] Dl O(NK2) + Dw, max l [(2N+4)L] Bθl O(η) + Cw + 1, where e O( ) hides the constants that depend on d, K, N, the radius parameters Bx, By, Bv and the smoothness of r and ℓ. And this neural network such that for any input sequences H(0), take from (2.1), NNθ(H(0)) implements L steps in-context gradient descent on risk Eqn (2.2): For every l [L], the (2N + 4)l-th layer outputs h((2N+4)l) i = [xi; yi; w(l); 0; 1; ti] for every i [n + 1], and approximation gradients w(l) such that w(l) = Proj W(w(l 1) η Ln(w(l 1)) + ϵ(l 1)), where w(0) = 0, and ϵ(l 1) 2 ηϵ is an error term. Proof Sketch. Let the first 2N + 2 layers of NNθ are Transformers and EWMLs constructed in Lemma 2, Lemma 3, Lemma 4, and Lemma 5. Explicitly, we design the last two layers to implement the gradient descent step (Lemma 1). We then establish the upper bounds for error w Ln(w) w Ln(w) 2, where w Ln(w), derived from the outputs of NNθ, approximates w Ln(w). Next, for any ϵ > 0, we select appropriate parameters ϵl, ϵr and ϵr to ensure that w Ln(w(l 1)) w Ln(w(l 1)) 2 ϵ holds for any l [L]. Please see Appendix C.7 for a detailed proof. We summarize and visualize the backpropagation process within an ICGD in a transformer model with 2N + 4 layers in Figure 1. As a direct result, the neural networks NNθ constructed earlier is able to approximate the true gradient descent trajectory {wl GD}l 0, defined by w0 GD = 0 and wl+1 GD = wl GD η w Ln(wl GD) for any l 0. Consequently, Theorem 1 motivates us to investigate the error accumulation under setting w(l) = Proj W(w(l 1) η Ln(w(l 1)) + ϵ(l 1)), where w(0) = 0, and ϵ(l 1) 2 ηϵ represents error terms. Moreover, Corollary 1.1 shows NNθ constructed in Theorem 1 implements L steps ICGD with exponential error accumulation to the true GD paths. Corollary 1.1 (Error for implementing ICGD on N-layer neural network). Fix L 1, under the same setting as Theorem 1, (2N +4)L-layer neural networks NNθ approximates the true gradient descent trajectory {wl GD}l 0 RDN with the error accumulation wl wl GD 2 L 1 f (1 + n Lf)lϵ, where Lf denotes the Lipschitz constant of Ln(w) within W. Proof. Please see Appendix C.8 for a detailed proof. 4 In-Context Deep Learning with Softmax Transformers In this section, we extend our analysis from Re LUtransformers to more practical Softmax-transformers for ICGD of N-layer neural network (Appendix E). Specifically, we establish the existence of Softmax-transformers capable of performing ICGD for N-layer neural networks in Theorem 2 and give more details in Appendix E. In-Context Deep Learning via Transformer Models 0 25 50 75 In-context Examples N( 2, I) 0.9N( 2, I) + 0.1N(2, I) 0.7N( 2, I) + 0.3N(2, I) 0.5N( 2, I) + 0.5N(2, I) 6-Layer NN (a) Re LU-Transformer 0 25 50 75 In-context Examples N( 2, I) 0.9N( 2, I) + 0.1N(2, I) 0.7N( 2, I) + 0.3N(2, I) 0.5N( 2, I) + 0.5N(2, I) 6-Layer NN (b) Softmax-Transformer Figure 2: Performance of ICL in Re LU-Transformer and Softmax-Transformer: ICL learns 6-layer NN and achieves R-squared values comparable to those from training with prompt samples. Theorem 2 (In-Context Gradient Descent of Softmax-Transformer). Fix any Bw, η, ϵ > 0, L 1. For any input sequences takes from (2.1), their exist upper bounds Bx, By such that for any i [n], yi max By, xi max Bx. Suppose W is a closed domain such that w max Bw and Proj W project w into bounded domain W. Assume Proj W = MLPθ for some MLP layer. Define l(w, xi, yi) as a loss function with L-Lipschitz gradient. Let Ln(w) = 1 n Pn i=1 ℓ(w, xi, yi) denote the empirical loss function, then there exists a Softmax-transformer NNθ, such that for any input sequences H(0), take from (2.1), NNθ(H(0)) implements L steps in-context gradient descent on Ln(w): For every l [L], the 4l-th layer outputs h(4l) i = [xi; yi; w(l); 0; 1; ti] for every i [n + 1], and approximation gradients w(l) with w(0) = 0 such that w(l) = Proj W(w(l 1) η Ln(w(l 1)) + ϵ(l 1)), where ϵ(l 1) 2 ηϵ is an error term. Proof Sketch. By our assumption Proj W = MLPθ, we only need to find a transformer to implement gradient descent w+ := w η Ln(w). For any input takes form (E.2), let function f := RD n RD n maps w to w η Ln(w) and preserve other elements. By Lemma 7, there exists a transformer block f Softmax capable of approximating f with any desired small error. Therefore, f Softmax MLP suffices our requirements. Please see Appendix E.2 for a proof. Remark 4 (Note on the Use of the Universal Approximation Lemma 16). We assume the transformer parameters are independent of inputs when invoking the universal approximation lemma (Lemma 16). This is because the lemma guarantees the existence of a transformer that can approximate any L-Lipschitz permutation equivariant function over a bounded input domain. Although we can invoke the lemma to show that a transformer can represent global minimizers of the target N-layer neural network, such an approach only proves existence without providing an explicit construction. Our goal is to show how the transformer performs gradient descent step by step as explicitly as possible. 5 Numerical Studies In this section, we conduct experiments to verify the capability of ICL to learn feed-forward neural networks, and give details in Appendix F. We conduct the experiments based on 3-, 4and 6-layer NN using both Re LUand Softmax Transformer. The main objective is to validate the performance of ICL matches that of training N-layer networks, i.e., the results in Theorem 1, Theorem 2, and Theorem 4. However, a minor limitation is that the trained transformers do not always achieve the theoretical construction. Specifically, we sample the input of feed-forward network x Rd from the Gaussian mixture distribution: w1N( 2, Id) + w2N(2, Id), where w1, w2 R, and d=20. We consider the network f : Rd R as a 3-, 4-, or 6-layer NN. We generate the true output by y = f(x). For the pertaining data, we use 50 in-context examples, and sample them from N( 2, Id). For the testing data, we use 75 in-context examples, and sample them from four distributions: (i) ω1 = 1, ω2 = 0, (ii) ω1 = 0.9, ω2 = 0.1, (iii) ω1 = 0.7, ω2 = 0.3, (iv) ω1 = 0.5, ω2 = 0.5. We show the results of 6-layer NN in Figure 2. 6 Conclusion We provide an explicit characterization of the ICL capabilities of both Re LUand Softmax-transformer in approximating the gradient descent training process of a N-layer feedforward neural network. Our results include approximation (Theorem 1 and Theorem 2) and convergence (Corollary 1.1) guarantees. We also provide experimental validation. Extensions. We further extend our analysis from N-layer networks with the same input and output dimensions to scenarios with arbitrary dimensions (Appendix D). Applications. We apply our results to learn the score function of the diffusion model through ICL in Appendix G. Related Work and Limitations. Please see the related works, a detailed comparison with (Wang et al., 2024), broader impact, and limitations in Appendix A. In-Context Deep Learning via Transformer Models Acknowledgments The authors would like to thank Zhijia Li, Mimi Gallagher, Sara Sanchez, Dino Feng and Andrew Chen for helpful discussions; Hude Liu, Hong-Yu Chen, Jennifer Zhang, and Teng-Yun Hsiao for collaborations on related topics; and Jiayi Wang for facilitating experimental deployments. JH also thanks the Red Maple Family for their support. The authors also thank the anonymous reviewers and program chairs for their constructive comments. JH is partially supported by the Walter P. Murphy Fellowship. HL is partially supported by NIH R01LM1372201 Abb Vie and Dolby. This research was supported in part through the computational resources and staff contributions provided for the Quest high performance computing facility at Northwestern University which is jointly supported by the Office of the Provost, the Office for Research, and Northwestern University Information Technology. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies. Impact Statement By the theoretical nature of this paper, we do not expect immediate negative social impact. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023. Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G., Dyer, E., and Neyshabur, B. Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546 38556, 2022. Bai, Y., Chen, F., Wang, H., Xiong, C., and Mei, S. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Advances in neural information processing systems, 36:57125 57211, 2023. Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al. Deepseek llm: Scaling open-source language models with longtermism. Co RR, 2024. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. ar Xiv preprint ar Xiv:2108.07258, 2021. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020. Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. Sparks of artificial general intelligence: Early experiments with gpt-4. ar Xiv preprint ar Xiv:2303.12712, 2023. Chan, S. et al. Tutorial on diffusion models for imaging and vision. Foundations and Trends in Computer Graphics and Vision, 16(4):322 471, 2024. Chen, M., Du, J., Pasunuru, R., Mihaylov, T., Iyer, S., Stoyanov, V., and Kozareva, Z. Improving in-context few-shot learning via self-supervised training. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3558 3573, 2022. Chen, M., Huang, K., Zhao, T., and Wang, M. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In International Conference on Machine Learning, pp. 4672 4712. PMLR, 2023. Chen, M., Mei, S., Fan, J., and Wang, M. An overview of diffusion models: Applications, guided generation, statistical rates and optimization. ar Xiv preprint ar Xiv:2404.07771, 2024. Dai, D., Sun, Y., Dong, L., Hao, Y., Ma, S., Sui, Z., and Wei, F. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2022. Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978 2988, 2019. Garg, S., Tsipras, D., Liang, P. S., and Valiant, G. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583 30598, 2022. Gu, Y., Dong, L., Wei, F., and Huang, M. Pre-training to learn in context. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4849 4870, 2023. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. Training computeoptimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pp. 30016 30030, 2022. In-Context Deep Learning via Transformer Models Hu, J. Y.-C., Wang, W.-P., Gilani, A., Li, C., Song, Z., and Liu, H. Fundamental limits of prompt tuning transformers: Universality, capacity and efficiency. In The Thirteenth International Conference on Learning Representations, 2025a. Hu, J. Y.-C., Wu, W., Lee, Y.-C., Huang, Y.-C., Chen, M., and Liu, H. On statistical rates of conditional diffusion transformers: Approximation, estimation and minimax optimality. In The Thirteenth International Conference on Learning Representations, 2025b. Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts. ar Xiv preprint ar Xiv:2401.04088, 2024. Kajitsuka, T. and Sato, I. Are transformers with one layer self-attention using low-rank weight matrices universal approximators? In The Twelfth International Conference on Learning Representations (ICLR), 2024. Li, S., Song, Z., Xia, Y., Yu, T., and Zhou, T. The closeness of in-context learning and weight shifting for softmax regression. Advances in Neural Information Processing Systems, 37:62584 62616, 2024. Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11048 11064, 2022. Panigrahi, A., Malladi, S., Xia, M., and Arora, S. Trainable transformer in transformer. In International Conference on Machine Learning, pp. 39448 39492. PMLR, 2024. Panwar, M., Ahuja, K., and Goyal, N. In-context learning through the bayesian prism. In The Twelfth International Conference on Learning Representations, 2023. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. Safran, I. and Shamir, O. Depth-width tradeoffs in approximating natural functions with neural networks. In International conference on machine learning, pp. 2979 2987. PMLR, 2017. Shi, W., Min, S., Lomeli, M., Zhou, C., Li, M., Lin, X. V., Smith, N. A., Zettlemoyer, L., Yih, W.-t., and Lewis, M. In-context pretraining: Language modeling beyond document boundaries. In The Twelfth International Conference on Learning Representations, 2023a. Shi, Z., Wei, J., Xu, Z., and Liang, Y. Why larger language models do in-context learning differently? In Forty-first International Conference on Machine Learning, 2023b. Shin, S., Lee, S. W., Ahn, H., Kim, S., Kim, H. S., Kim, B., Cho, K., Lee, G., Park, W., Ha, J. W., et al. On the effect of pretraining corpora on in-context learning by a large-scale language model. In 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, pp. 5168 5186, 2022. Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pp. 574 584. PMLR, 2020a. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020b. Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al. Gemini: a family of highly capable multimodal models. ar Xiv preprint ar Xiv:2312.11805, 2023. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023. Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011. Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp. 35151 35174. PMLR, 2023. Wang, Z., Jiang, B., and Li, S. In-context learning on function classes unveiled for transformers. In Forty-first International Conference on Machine Learning, 2024. Wies, N., Levine, Y., and Shashua, A. The learnability of in-context learning. Advances in Neural Information Processing Systems, 36, 2024. Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. An explanation of in-context learning as implicit bayesian inference. ar Xiv preprint ar Xiv:2111.02080, 2021. Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., and Yang, M.-H. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1 39, 2023. In-Context Deep Learning via Transformer Models Yoo, K. M., Kim, J., Kim, H. J., Cho, H., Jo, H., Lee, S.- W., Lee, S.-G., and Kim, T. Ground-truth labels matter: A deeper look into input-label demonstrations. In 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, pp. 2422 2437, 2022. Zhang, R., Frei, S., and Bartlett, P. L. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25(49):1 55, 2024. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. ar Xiv preprint ar Xiv:2205.01068, 2022. In-Context Deep Learning via Transformer Models Supplementary Material A Related Work, Broader Impact, Further Discussion and Limitations 14 A.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Further Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B Supplementary Theoretical Backgrounds 17 B.1 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.2 Re LU Provably Approximates Smooth k-Variable Functions . . . . . . . . . . . . . . . . . . . . . . . . 17 C Proofs of Main Text 18 C.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C.3 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.4 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.5 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 C.6 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C.7 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 C.8 Proof of Corollary 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 D Extension: Different Input and Output Dimensions 33 E Extension: Softmax Transformer 35 E.1 Axillary Lemma: Universal Approximation of Softmax Transformer . . . . . . . . . . . . . . . . . . . . 35 E.2 In-Context Gradient Descent with Softmax Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 E.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 E.4 Proof of Lemma 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 F Experimental Details 44 F.1 Experiments for Objectives 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 F.1.1 Performance of Re LU Transformer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 F.1.2 Performance of Softmax Transformer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 F.2 Experiments for Objective 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 F.3 Experiments for Objective 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 G Application: ICL for Diffusion Score Approximation 48 G.1 Score Matching Generative Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 G.2 ICL for Score Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 In-Context Deep Learning via Transformer Models A Related Work, Broader Impact, Further Discussion and Limitations In this section, we show the related works, broader impact and limitations. A.1 Related Work In-Context Learning. Large language models (LLMs) demonstrate the in-context learning (ICL) ability (Brown et al., 2020), an ability to flexibly adjust their prediction based on additional data given in context. In recent years, a number of studies investigate enhancing ICL capabilities (Chen et al., 2022; Gu et al., 2023; Shi et al., 2023a), exploring influencing factors (Shin et al., 2022; Yoo et al., 2022), and interpreting ICL theoretically (Xie et al., 2021; Wies et al., 2024; Panwar et al., 2023; Li et al., 2024; Bai et al., 2023; Dai et al., 2022). The works most relevant to ours are as follows. Von Oswald et al. (2023) show that linear attention-only Transformers with manually set parameters closely resemble models trained via gradient descent. Bai et al. (2023) provide a more efficient construction for in-context gradient descent and establish quantitative error bounds for simulating multi-step gradient descent. However, these results focus on simple ICL algorithms or specific tasks like least squares, ridge regression, and gradient descent on two-layer neural networks. These algorithms are inadequate for practical applications. For example: (i) Approximating the diffusion score function requires neural networks with multiple layers (Chen et al., 2023). (ii) Approximating the indicator function requires at least 3-layer networks (Safran & Shamir, 2017). Therefore, the explicit construction of transformers to implement in-context gradient descent (ICGD) on deep models is necessary to better align with real-world in-context settings. Our work achieves this by analyzing the gradient descent on N-layer neural networks through the use of ICL. We provide a more efficient construction for in-context gradient descent. Furthermore, we extend our analysis to Softmax-transformer in Appendix E to better align with real-world uses. In-Context Gradient Descent on Deep Models (Wang et al., 2024; Panigrahi et al., 2024). A work similar to ours is (Wang et al., 2024). It constructs a family of transformers with flexible activation functions to implement multiple steps of ICGD on deep neural networks. This work emphasizes the generality of activation functions and demonstrates the theoretical feasibility of such constructions. Our work adopts a different approach by enhancing the efficiency of transformers and better aligning with practical applications. Here we highlight key differences: More Structured and Efficient Transformer Architecture. While Wang et al. (2024) use a O(N 2L)-layer transformer to approximate L gradient descent steps on N-layer neural networks, our approach achieves more efficient simulation for ICGD. We approximate specific terms in the gradient expression to reduce computational costs, requiring only a (2N + 4)L-layer transformer for L gradient descent steps. Our method focuses on selecting and approximating the most impactful intermediate terms in the explicit gradient descent expression (Lemmas 3 to 5), optimizing layer complexity to O(NL). Less Restrictive Input and Output Dimensions for N-layer Neural Networks. Wang et al. (2024) simplify the output of N-layer networks to a scalar. Our work expands this by considering cases where output dimensions exceed one, as detailed in Appendix D. This includes scenarios where input and output dimensions differ. More Practical Transformer Model. Wang et al. (2024) discuss activation functions in the attention layer that meet a general decay condition (Wang et al., 2024, Definition 2.3) without considering the Softmax activation function. We extend our analysis to include Softmax-transformers. Our analysis reflects more realistic applications, as detailed in Appendix E. More Advanced and Complicated Applications. Wang et al. (2024) discuss the applications to functions, including indicators, linear, and smooth functions. We explore more advanced and complicated scenarios, i.e., the score function in diffusion models discussed in Appendix G. The score function (Chen et al., 2023) falls outside the smooth function class. This enhancement broadens the applicability of our results. Another work similar to ours is (Panigrahi et al., 2024). It proposes a new efficient construction, Transformer in Transformer (TINT), to allow a transformer to simulate and finetune more complex models (e.g., one transformer). The main distinction between ours and (Panigrahi et al., 2024) lies in the different aims: Our approach focuses on using a standard transformer for the simulator (with a minor modification: the element-wise multiplication layer ), and we provide a theoretical understanding of how a standard transformer can learn the ICGD of an N-layer network using ICL. In contrast, the work (Panigrahi et al., 2024) aims to build even stronger transformers by introducing several structural modifications that enable running gradient descent on auxiliary transformers. While it demonstrates in-context gradient descent for a more advanced model, i.e., one transformer, our work offers the following potential advantages: In-Context Deep Learning via Transformer Models Explicit Transformer Construction. We provide an explicit construction of the transformer, whereas the work (Panigrahi et al., 2024) does not detail the explicit construction of model parameters within their transformer. Exact Gradient Descent. We compute the exact and explicit gradient descent for an N-layer network (Lemma 1). Building on this, we employ the transformer s ICL to perform gradient descent on all parameters. However, the work (Panigrahi et al., 2024) stops the gradient computation through attention scores in the self-attention layer and only updates the value parameter in the self-attention module. Additionally, it uses Taylor expansion to approximate the gradient. Rigorous Error and Convergence Guarantees. We provide rigorous gradient descent approximation errors (for multiple steps) and convergence guarantees for the ICGD on an N-layer network (Corollary 1.1 and Lemma 14). However, the work (Panigrahi et al., 2024) only presents the gradient approximation error for each specific part of the parameters in a single step. Attention Layer Better Aligned with Practice. Our analysis is based on Re LU-attention (Theorem 1) or Softmaxattention (Theorem 5), whereas the work (Panigrahi et al., 2024) utilizes linear attention. Our choice of attention layer better aligns with practical applications. A.2 Further Discussion We provide an interpretation and example of how to explicitly instantiate the constants for Re LU approximations in Lemmas 2 to 4. The key reason is that the function approximated by the sum of Re LUs is simple in our context, such as the Sigmoid activation function. For such simple functions, it is straightforward to derive an explicit construction. Here, we take the Sigmoid activation function as an example and propose one explicit construction method. Let r(z) denote the Sigmoid function. Segment the Input Domain. For example, divide the domain [ 10, 10] smaller intervals such as [ 10, 9], [ 9, 8], . . . , [9, 10]. Approximate Each Segment Locally Using a Linear Function via Linear Interpolation. For instance, in the domain [9, 10], approximate r(z) using a linear function a1z + c1, where a1 and c1 are calculated as follows: a1 = (r(10) r(9))/(10 9), and c1 = r(9) a1 9. Approximate Linear Function a1z + c1(z [9, 10]) Using a Sum of Re LU Terms. This step involves two substeps, which are straightforward to implement: (i) Approximate the indicator function for z [9, 10] using a sum of Re LU terms. (ii) Approximate the constant c1 using the sum of Re LU. This is because bias terms are not included in the sum of Re LU terms in Definition 4. The bias term c1 must be approximated using an additional sum of Re LU terms. Combine All the Sum of Re LU Approximators Across All Segments. Finally, integrate the approximations for all segments to construct the complete approximation. Estimation of the Parameters in Definition 4. ϵapprox = 0.625, R = 10, H = 80, and C = 25. Furthermore, to achieve higher precision in the approximation, it is sufficient to use finer segmentations. A.3 Limitations Our work has the following six limitations: Although we provide a theoretical guarantee for the ICL of the Softmax-Transformer to approximate gradient descent in N-layer NN, characterizing the weight matrices construction in Softmax-Transformer remains challenging. This motivates us to rethink transformer universality and explore more accurate proof techniques for ICL in Softmax Transformer, which we leave for future work. The hidden dimension and MLP dimension of the transformer in Theorem 1 are both e O(NK2) + Dw, which is very large. The reason for the large dimensions is that if we use ICL to perform ICGD on the N-layer network, we need to allow the transformer to realize the N-layer network parameters. This means that it is reasonable for the input dimension to be so large. However, it is possible to reduce the hidden dimension and MLP dimension of the transformer through smarter construction. We leave this for future work. The generalization capabilities are limited compared with traditional transformers. In our setting, the pretraining task refers to using in-context examples generated by an N-layer network for a given N. Specifically, during pretraining, the distribution of the N-layer network parameters is predetermined (e.g., N(0, I)). The input data distribution of N-layer In-Context Deep Learning via Transformer Models network for generating the in-context examples is also predetermined (e.g., N( 2, I)). The generalization capabilities include the following two aspects: (i) Varying the input data distribution for the N-layer network to generate the in-context examples. For example, we change the input data distribution from N( 2, I) to 0.9N( 2, I) + 0.1N(2, I) during the testing in Appendix F.1. (ii) Varying the distribution of the N-layer network parameters. For example, we change the distribution from N(0, I) to N(0.5, I) in Appendix F.2. The above points lead to differences between the distributions of in-context examples during pretraining and testing. However, we must generate the in-context examples by the N-layer network with the same hyperparameters, including the network width and depth. We leave the theoretical analysis of broader generalization capabilities for future work. In theory, the FLOPs (Hoffmann et al., 2022) required to perform one forward pass of the transformer are greater than those required for the direct training of an N-layer network. (i) For the forward pass of the transformer, the FLOPs for in-context learning (ICL) are O(n LN 3K5/ϵ2), where ϵ is the approximation error in the sum of Re LU. (ii) For direct training of the N-layer network, the FLOPs without ICL are O(n LNK2). Therefore, the FLOPs required for ICL exceed those needed for direct training of the N-layer network. However, experimental results in Appendix F demonstrate that the transformer with ICL can achieve the performance of a trained 6-layer network using fewer FLOPs in practice (3.3 billion vs. 7.6 billion FLOPs). This finding encourages further exploration of more efficient architectures. We also leave this topic for future research. The empirically trained transformer differs from the transformer constructed in our theoretical analysis. Our experiments confirm the existence of a transformer capable of simulating gradient descent (GD) steps for N-layer neural networks through in-context learning (ICL). Despite this discrepancy, the limitation does not affect the primary contribution: establishing the theoretical existence of this transformer by explicit construction. There are two minor differences between the transformer used in the theoretical analysis and a standard transformer: (i) The transformer used in the theoretical analysis incorporates an element-wise multiplication layer, a specialized variant of self-attention that retains only the diagonal score and allows efficient implementation. (ii) It does not alternate self-attention and MLP layers. We emphasize that this also qualifies as a standard transformer because we view either an attention or an MLP layer as equivalent to an attention plus MLP layer due to the residual connections. In-Context Deep Learning via Transformer Models B Supplementary Theoretical Backgrounds Here we present some ideas we built on. B.1 Transformers Lastly, we introduce key components for constructing a transformer for ICGD: Re LU-Attention, MLP, and element-wise multiplication layers. We begin with the Re LU-Attention layer. Definition 7 (Re LU-Attention Layer). For any input sequence H RD n, an M-head Re LU-attention layer with parameters θ = {Qm, Km, Vm}m [M] outputs Attnθ(H) := H + 1 m=1 (Vm H) σ((Qm H) (Km H)), where Qm, Km, Vm RD D and σ( ) is element-wise Re LU activation function. In vector form, for each token hi RD in H, it outputs [Attnθ(H)]i = hi + 1 n PM m=1 Pn s=1 σ( Qmhi, Kmhs ) Vmhs. Notably, Definition 7 uses normalized Re LU activation σ/n, instead of the standard Softmax. We adopt this for technical convenience following (Bai et al., 2023). Next we define the MLP layer. Definition 8 (MLP Layer). For any input sequence H RD n, an d -hidden dimensions MLP layer with parameters θ = (W1, W2) outputs MLPθ(H) := H + W2σ(W1H), where W1 Rd D, W2 RD d and σ( ) : R R is elementwise Re LU activation function. In vector form, for each token hi RD in H, it outputs MLPθ(H)i := hi + W2σ(W1hi). Then, we consider a transformer architecture with L 1 transformer layers, each consisting of a self-attention layer followed by an MLP layer. Definition 9 (Transformer). For any input sequence H RD n, an L-layer transformer with parameters θ = {θAttn, θMLP} outputs TFL θ (H) := MLPθ(L) mlp Attnθ(L) attn . . . MLPθ(1) mlp Attnθ(1) attn(H), where θ = {θAttn, θMLP} consists of Attention layers θAttn = {(Ql m, Kl m, V l m)}l [L],m [M l] and MLP layers θMLP = {(W l 1, W l 2)}l [L]. Above, for any l [L], m [M l], Ql m, Kl m, V l m RD D and (W l 1, W l 2) Rd D RD d In this section, we consider Re LU Attention layer and MLP layer are both a special kind of 1-layer transformer, which is for technical convenience. For later proof use, we define the norm for L-layer transformer TFθ as: Bθ := max l [L] Ql m 1, Kl m 1 + i=1 V l m 1 + W1 1 + W2 1 The choice of operation norm and max/sum operation is for convenience in later proof only, as our result depends only on Bθ. B.2 Re LU Provably Approximates Smooth k-Variable Functions Following lemma expresses that the smoothness enables the approximability of sum of Re LU. Lemma 7 (Approximating Smooth k-Variable Functions, modified from Proposition A.1 of (Bai et al., 2023)). For any ϵ, Cl > 0, R 1. If function g : Rk R such that for s := (k 1)/2 + 1, g is a Cs function on Bk (R), and for all i {0, 1, . . . , s}, sup z Bk (R) ig(z) Li, max 0 i s Li Ri Cl, then function g is (ϵ, R, H, C)-approximable by sum of Re LUs (Definition 4) with H C(k)C2 l log(1 + Cl/ϵ)/ϵ2 and C C(k)Cl where C(k) is a constant that depends only on k. In-Context Deep Learning via Transformer Models C Proofs of Main Text C.1 Proof of Lemma 1 Lemma 8 (Lemma 1 Restated: Decomposition of One Gradient Descent Step). Fix any Bv, η > 0. Suppose loss function Ln(w) on n data points {(xi, yi)}i [n] follows (2.2). Suppose closed domain W and projection function Proj W(w) follows (3.4). Let Ai(j), r i(j), Ri(j), Vj be as defined in Definition 2. Then the explicit form of gradient Ln(w) becomes Ai(1) ... Ai(N) where Ai(j) denote the derivative of ℓ(pi(N), yi) with respect to the parameters in the j-th layer, (Ri(N 1) VN . . . Ri(j 1) h IK K pi(j 1) i ) ( ℓ(pi(N),yi) pi(N) ) , j = N (Ri(N 1) h Id d pi(N 1) i ) ( ℓ(pi(N),yi) pi(N) ) , j = N. Proof of Lemma 1. We start with calculating w Ln(w). By chain rule and (2.2), w Ln(w) | {z } RDN 1 = 1 | {z } RDN d [ pi(N)ℓ(pi(N), yi)] | {z } Rd 1 By (2.2) and chain rule Thus we only need to calculate wpi(N). For a vector x and a function r : R R, we use r(x) to denote the vector that i-th coordinate is r(xi). Let Ri(j), Vj follows Definition 2, then it holds w | {z } Rd DN Rd K z}|{ VN RK z }| { pi(N 2)) w | {z } Rd DN By Definition 1 = r(VN pi(N 1)) VN pi(N 1) | {z } Rd d w | {z } Rd DN By chain rule = diag{r (v N1pi(N 1)), . . . , r (v NKpi(N 1))} VN pi(N 1) By Definition 2 = Ri(N 1) VN pi(N 1) Notice that for any k [d], v Nk is a part of w, thus DN 1+(k 1)K z}|{ 0 DN DN 1 k K z}|{ 0 ] Rd DN . (C.2) Therefore, letting denotes Kronecker product, it holds v N1 pi(N 1) w + pi(N 1) v N1 w ... v Nd . pi(N 1) w + pi(N 1) v Nd By chain rule and product rule In-Context Deep Learning via Transformer Models = VN pi(N 1) w + 0DN 1; IK K pi(N 1) , (C.3) where the last step follows from the definition of VN (i.e., Definition 2) and (C.2). Substituting (C.3) into (C.1), we obtain w = Ri(N 1) (VN pi(N 1) w + 0DN 1; Id d pi(N 1) ). Similarly, for any j [N], we prove w = Ri(j 1) (Vj pi(j 1) w + 0Dj 1; IK K pi(j 1) ; 0DN Dj ). (C.4) By the recursion formula (C.4), for any j [N 1], we calculate Ai(j) as follows, ℓ(pi(N), yi) pi(N) pi(N) [Dj 1 : Dj] By Definition 2 w ) ( ℓ(pi(N), yi) pi(N) ) [Dj 1 : Dj] By transpose property w ) [ , Dj 1 : Dj] ( ℓ(pi(N), yi) = (Ri(N 1) VN . . . Ri(j 1) IK K pi(j 1) ) ( ℓ(pi(N), yi) where M[ , a : b] denotes a sub-matrix of M, which includes all the columns but only the rows from the a-th row to the b-th row of A. Similarly, for j = N, it holds Ai(N) = (Ri(N 1) Id d pi(N 1) ) ( ℓ(pi(N), yi) Thus we completes the proof. C.2 Proof of Lemma 2 Lemma 9 (Lemma 2 Restated: Approximate pi(j)). Let upper bounds Bv, Bx > 0 such that for any k [K], j [N] and i [n], vjk 2 Bv, and xi 2 Bx. For any j [N], i [n], define Bj r := max |t| Bv Bj 1 r |r(t)|, B0 r := Bx, and Br := max j Bj r. Let function r(t) be (ϵr, R1, M1, C1)-approximable for R1 = max{Bv Br, 1}, M1 e O(C2 1ϵ 2 r ), where C1 depends only on R1 and the C2-smoothness of r. Then, for any ϵr > 0, there exist N attention layers Attnθ1, . . . , AttnθN such that for any input hi RD takes from (2.1), they map hi = [xi; yi; w; pi(1); . . . ; pi(j 1); 0; 1; ti] Attnθj ehi = [xi; yi; w; pi(1); . . . ; pi(j); 0; 1; ti], where pi(j) is approximation for pi(j) (Definition 1). In the expressions of hi and ehi, the dimension of 0 differs. Specifically, the 0 in hi is larger than in ehi. The dimensional difference between these 0 vectors equals the dimension of pi(j). Suppose function r is Lr-smooth in bounded domain W, then for any i [n + 1], j [N], pi(j) such that pi(j) = pi(j) + ϵ(i, j), ϵ(i, j) 2 ( (Pj 1 l=0 Kl/2Ll r Bl v) Kϵr , 1 j N 1 (PN 1 l=0 Kl/2Ll r Bl v) dϵr , j = N . In-Context Deep Learning via Transformer Models Additionally, for any j [N], the norm of parameters Bθj defined as (B.1) such that Bθj 1 + KC1. Proof of Lemma 2. First we need to give a approximation for activation function r(t). By our assumption and Definition 4, r(t) is (ϵr, R1, M1, C1)-approximable by sum of Re LUs, there exists: m=1 c1 mσ( a1 m, [t; 1] ) with c1 m C1, a1 m 1 1, m [M1], (C.5) such that supt [ R1,R1] |r(t) r(t)| ϵr. Let pi(0) := pi(0) = xi. Similar to pi(j) follows Definition 1, we pick pi(j) such that for any j [N], pi(j)[k] := r(v jkpi(j 1)). (C.6) Fix any j [N], suppose the input sequences hi = [xi; yi; w; pi(1); . . . ; pi(j 1); 0; 1; ti]. Then for every m [M1], k [K](or k [d] if j = N), we define matrices Qj m,k, Kj m,k, V j m,k RD D such that for all i [n + 1], a1 m[1] pi(j 1) a1 m[2] 0 , Kj m,khi = , V j m,khi = c1 me1 j,k , (C.7) where e1 j,k denotes the position unit vector of element pi(j)[k] because this position only depends on j, k. Since input hi = [xi; yi; w; pi(1); . . . ; pi(j 1); 0; 1; ti], those matrices indeed exist. In fact, it is simple to check that 0 a1 m[1]IK(j) 0 0 0 0 0 0 a1 m[2] 0 0 0 0 0 0 0 IK(j, k) 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 c1 m(j, k) 0 0 0 0 0 are suffice to (C.7). IK(j), IK(j, k), c1 m(j, k) represents their positions are related to variables in parentheses. In Addition, by (B.1), notice that they have operator norm bounds max j,m,k Qj m,k 1 1, max j,m,k Kj m,k 1 1, max j k,m V j m,k 1 KC1. Consequently, for any j [N], Bθj 1 + C1. By our construction follows (C.7), a simple calculation shows that X m [M1],k [K] σ( Qj m,khi, Kj m,khs )V j m,khs m=1 c1 mσ( a1 m, [v jkpi(j 1); 1] )e1 j,k By our construction (C.7) k=1 (r(v jkpi(j 1)))e1 j,k By definition of r follows (C.5) = [0; pi(j); 0]. By definition of pi(j) follows (C.6) In-Context Deep Learning via Transformer Models Therefore, by definition of Re LU Attention layer follows Definition 7, the output ehi becomes ehi = [Attnθj(hi)] = hi + 1 n + 1 m [M1],k [K] σ( Qj m,khi, Kj m,khs )V j m,khs = hi + 1 n + 1 s=1 (n + 1)[0; pi(j); 0] = [xi; yi; w; pi(1); . . . ; pi(j 1); 0; 1; ti] + [0, pi(j), 0] = [xi; yi; w; pi(1); . . . ; pi(j 1); pi(j); 0; 1; ti]. Therefore, let the attention layer θj = {(Qj m,k, Kj m,k, V j m,k)}(k,m), we construct Attnθj such that hi = [xi; yi; w; pi(1); . . . ; pi(j 1); 0; 1; ti] Attnθj ehi = [xi; yi; w; pi(1); . . . ; pi(j); 0; 1; ti]. In addition, by setting R1 = max{Bv Br, 1} , the lemma then follows directly by induction on j. For the base case j = 1, it holds |pi(1)[k] pi(1)[k]| = ri(v 1kxi)[k] r(v 1kxi) By Definition 1 By definition of r follows (C.5) Suppose the claim holds for iterate j 1 and function r is Lr-smooth in bounded domain W. Then for iterate j, |pi(j)[k] pi(j)[k]| pi(j)[k] r(v jkpi(j 1)) + r(v jkpi(j 1)) pi(j)[k] By triangle inequality ϵr + Lr v jk 2 pi(j 1) pi(j 1) 2 By (C.5) and Cauchy Schwarz inequality l=0 Kl/2Ll r Bl v) By inductive hypothesis l=0 Kl/2Ll r Bl v, Thus, it holds pi(j) pi(j) 2 = k=1 |pi(j)[k] pi(j)[k]|2 l=0 Kl/2Ll r Bl v). This finishes the induction. Then for the output layer j = N, it holds pi(N) pi(N) 2 = k=1 |pi(N)[k] pi(N)[k]|2 l=0 Kl/2Ll r Bl v). Thus we complete the proof. In-Context Deep Learning via Transformer Models C.3 Proof of Lemma 3 Lemma 10 (Lemma 3 Restated: Approximate r i(j)). Let upper bounds Bv, Bx > 0 such that for any k [K], j [N] and i [n], vjk 2 Bv, and xi 2 Bx. For any j [N], i [n], define B j r := max |t| Bv Bj 1 r |r (t)|, B0 r := Bx, and Br := max j Bj r . Suppose function r (t) is (ϵr , R2, M2, C2)-approximable for R2 = max{Bv Br , 1}, M2 e O(C2 2ϵ 2 r ), where C2 depends only on R2 and the C2-smoothness of r . Then, for any ϵr > 0, there exist an attention layer AttnθN+1 such that for any input hi RD takes from (3.13), it maps hi = [xi; yi; w; pi; 0; 1; ti] AttnθN+1 ehi = [xi; yi; w; pi; r i; 0; 1; ti], where r i(j) is approximation for r i(j) (Definition 2) and r i := [r i(0); . . . ; r i(N 1)] R(N 2)K+d. Similar to Lemma 2, in the expressions of hi and ehi, the dimension of 0 differs. In addition, let Er be defined in (3.12), for any i [n + 1], j [N], k [K], r i(j) such that r i(j 1)[k] = r i(j 1)[k] + ϵ(i, j, k), |ϵ(i, j, k)| ϵr + Lr Bv Erϵr, where ϵr denotes the error generated in approximating r by sum of Re LUs r follows (C.5). Additionally, the norm of parameters BθN+1 defined as (B.1) such that BθN+1 1 + K(N 1)C2. Proof of Lemma 3. By Definition 2, recall that for any j [N], i [n + 1], k [K], r i(j)[k] = r (v j+1kpi(j)). (C.9) Therefore we need to give a approximation for r . By our assumption and Definition 4, r (t) is (ϵr , R2, M2, C2)- approximable by sum of relus. In other words, there exists: m=1 c2 mσ( a2 m, [t; 1] ) with c2 m C2, a2 m 2 1, m [M2], (C.10) such that supt [ R2,R2] |r (t) r (t)| ϵr . Similar to (C.9), we pick r i(j) such that r i(j)[k] := r (v j+1kpi(j)). (C.11) To ensure (C.11), we construct our attention layer as follows: for every j [N], m [M2], k [K], we define matrices QN+1 j,m,k, KN+1 j,m,k, V N+1 j,m,k RD D such that QN+1 j,m,khi = a2 m[1] pi(j 1) a2 m[2] 0 , KN+1 j,m,khi = , V N+1 j,m,khi = c2 me2 j,k , (C.12) for all i [n + 1] and e2 j,k denotes the position unit vector of element r i(j)[k]. Since input hi = [xi; yi; w; pi; 0; 1; ti], similar to (C.8), those matrices indeed exist. In addition, they have operator norm bounds max j,m,k QN+1 j,m,k 1 1, max j,m,k KN+1 j,m,k 1 1, X j,m,k V N+1 j,m,k 1 K(N 1)C2. Consequently, by definition of parameter norm follows (B.1), BθN+1 1 + K(N 1)C2. In-Context Deep Learning via Transformer Models A simple calculation shows that X j [N],m [M2],k [K] σ( QN+1 j,m,khi, KN+1 j,m,khs )V N+1 j,m,khs m=1 c2 mσ( a2 m, [v jkpi(j 1); 1] )e2 j,k By our construction follows (C.12) k=1 (r (v jkpi(j 1)))e2 j,k By definition of r follows (C.5) = [0; r i(0); . . . ; r i(N 1); 0] By definition of r i(j) follows (C.11) = [0; r i; 0], By definition of r i Therefore, by definition of Re LU Attention layer follows Definition 7, the output ehi becomes ehi = [AttnθN (hi)] = hi + 1 n + 1 j [N 1],m [M2],k [K] σ( QN j,m,khi, KN j,m,khs )V N j,m,khs = hi + 1 n + 1 s=1 (n + 1)[0; r i; 0] = [xi; yi; w; pi; 0; 1; ti] + [0; r i; 0] = [xi; yi; w; pi; r i; 0; 1; ti]. Next, we calculate the error accumulation in this approximation layer. By our assumption, R2 = max{Bv Br , 1}. Thus, for any j [N], k [K], i [n + 1], it holds v jkpi(j 1) R2. As our assumption, we suppose function r is Lr-smooth in bounded domain W. Combining above, the upper bound of error accumulation |r i(j)[k] r i(j)[k]| becomes |r i(j)[k] r i(j)[k]| r i(j)[k] r (v jkpi(j 1)) + r (v jkpi(j 1)) r i(j)[k] By triangle inequality ϵr + Lr v jk 2 pi(j 1) pi(j 1) 2 By (C.10) and Cauchy Schwarz inequality ϵr + Lr Bv Erϵr. By definition of Er follows (3.12) Thus we complete the proof. C.4 Proof of Lemma 4 Lemma 11 (Lemma 4 Restated: Approximate 1ℓ(pi(N), yi)). Let upper bounds Bv, Bx, > 0 such that for any k [K], j [N] and i [n], vjk 2 Bv, and xi 2 Bx. For any k [d], suppose function u(t, y)[k] be (ϵl, R3, M k 3 , Ck 3 )- approximable for R3 = max{Bv Br, By, 1}, M3 e O((Ck 3 )2ϵ 2 l ), where Ck 3 depends only on Rk 3 and the C3-smoothness of u(t, y)[k]. Then, there exists an MLP layer MLPθN+2 such that for any input sequences hi RD takes from (3.15), it maps hi = [xi; yi; w; pi; r i; 0; 1; ti] MLPθN+2 ehi = [xi; yi; w; pi; r i; gi; 0; 1; ti], where gi Rd is an approximation for u(pi(N), yi). For any k [d], assume u(pi(N), yi) is Ll Lipschitz continuous. In-Context Deep Learning via Transformer Models Then the approximation gi such that, gi[k] = u(pi(N), yi)[k] + ϵ(i, k), with |ϵ(i, k)| ϵl + Ll Erϵr. Additionally, the parameters θN+2 such that BθN+2 max{R3 + 1, C3}. Proof of Lemma 4. By our assumption and Definition 4, for any k [d], function u[k](t, y) is (ϵl, R3, M k 3 , Ck 3 )- approximable by sum of relus, there exists : m=1 c3,k m σ( a3,k m , [t; y; 1] ) with c3,k m C3, a3,k m 2 1, m [M k 3 ], (C.13) such that sup(t,y) [ R3,R3]2 |gk(t, y) u[k](t, y)| ϵl. Then we construct our MLP layer. Let M3 := Pd k=1 M k 3 , we pick matrices W N+1 1 RM3 D, W N+1 2 RD M3 such that for any i [n + 1], m [M3], W N+1 1 hi = a3,1 1 [1] pi(N) + a3,1 1 [2] yi + a3,1 1 [3] R3(1 ti) ... a3,1 M 1 3 [1] pi(N) + a3,1 M 1[2] yi + a3,1 M 1 3 [3] R3(1 ti) ... a3,d 1 [1] pi(N) + a3,d 1 [2] yi + a3,d 1 [3] R3(1 ti) ... a3,d M d 3 [1] pi(N) + a3,d M d[2] yi + a3,d M d 3 [3] R3(1 ti) W N+1 2 [j, m] = c3,k m 1{j = Dk g, M k 1 3 < m M k 3 }, (C.14) where Dk g denotes the position of element gi[k]. Since input hi = [xi; yi; w; pi; r i; 0; 1; ti], similar to (C.8), those matrices indeed exist. Furthermore, by (B.1), they have operator norm bounds W N+1 1 1 R3 + 1, W N+1 2 1 C3 Consequently, BθN+2 max{R3 + 1, C3}. By our construction (C.14), a simple calculation shows that W N+1 2 σ(W N+1 1 hi) = m=1 σ( a3,k m , [pi(N); yi; 1] R3(1 ti)) c3,k m e Dk g = 1{tj = 1} 0 g1(pi(N), yi) ... gd(pi(N), yi) 0 For k [d], we let gi[k] = 1{tj = 1} gk(pi(N), yi)e Dk g for i [n + 1]. Hence, MLPθN+2 maps hi = [xi; yi; w; pi; r i; 0; 1; ti] MLPθN+2 ehi = [xi; yi; w; pi; r i; gi; 0; 1; ti], Next, we calculate the error generated in this approximation. By setting R3 = max{Bv Br, By, 1}, for any i [n + 1], it holds pi(N) R3, yi R3 In-Context Deep Learning via Transformer Models Moreover, as our assumption, we suppose function 1ℓis Ll-smooth in bounded domain W. Therefore, by the definition of the function g, for each i [n], the error becomes |gi[k] u(pi(N), yi)[k]| |gi[k] u(pi(N), yi)[k]| + |u(pi(N), yi)[k] u(pi(N), yi)[k]| By triangle inequality ϵl + Ll pi(N) pi(N) 2 By the definition of gk follows (C.13) and Ll-smooth assumption ϵl + Ll Erϵr, . By the definition of Er follows (3.12) Combining above, we complete the proof. C.5 Proof of Lemma 5 Lemma 12 (Lemma 5 Restated: Approximate st(j)). Recall that si(j) = r i(j 1) (V j+1 si(j + 1)) follows Definition 3. Let the initial input takes from (3.17). Then, there exist N element-wise multiplication layers: EWMLθN+3, . . . , EWMLθ2N+2 such that for input sequences, j [N], hi = [xi; yi; w; pi; r i; gi; si(N); . . . ; si(j + 1); 0; 1; ti], they map EWMLθ2N+3 j(hi) = [xi; yi; w; pi; r i; gi; si(N); . . . ; si(j); 0; 1; ti], where the approximation si(j) is defined as recursive form: for any i [n + 1], j [N 1], r i(j 1) (V j+1 si(j + 1)), j [N 1] r i(N 1) gi, j = N. Additionally, for any j [N], BθN+2+j defined in (B.1) satisfies BθN+2+j 1. Proof of Lemma 5. By Lemma 2 and Lemma 3, we obtain pi(j) and r i(j), the approximation for pi(j) (3.3) and r i(j) respectively. Using pi(j) and r i(j), we construct N element-wise multiplication layers to approximate si(j). We give the construction of parameters directly. For every j [N 1], k [K], we define matrices Q2N+3 j k , K2N+3 j k , V 2N+3 j k RD D such that for all i [n + 1], Q2N+3 j k hi = vj+11[k] ... vj+1K[k] 0 , K2N+3 j k hi = si(j + 1) 0 , V 2N+3 j k hi = r i(j 1)[k] e3 j,k , (C.15) where e3 j,k denotes the position unit vector of element si(j)[k]. Since input hi = [xi; yi; w; pi; r i; gi; si(N); . . . ; si(j + 1); 0; 1; ti], similar to (C.8), those matrices indeed exist. Thus, it is straightforward to check that X k [K] γ( Q2N+3 j k hi, K2N+3 j k hi )V 2N+3 j k hi k=1 (V j+1[k, ] si(j + 1))r i(j 1)[k]e3 j,k By definition of EWML layer follows Definition 6 0 ri(j 1)[1]V j+1[1, ] si(j + 1) ... r i(j 1)[k]V j+1[K, ] si(j + 1) 0 By definition of e3 j,k In-Context Deep Learning via Transformer Models 0 r i(j 1) (V j+1 si(j + 1)) 0 By definition of hadamard product = [0; si(j); 0]. By definition of si(j) follows (3.20) Therefore, by the definition of EWML layer follows Definition 6, the output ehi becomes ehi = [Attnθ2N+3 j(hi)] m [2],k [K] σ( Q2N+3 j m,k hi, K2N+3 j m,k hs )V 2N+3 j m,k hs = hi + [0; s(j); 0] = [xi; yi; w; pi; r i; gi; si(N 1); . . . ; si(j + 1); 0; 1; ti] + [0; si(j); 0] = [xi; yi; w; pi; r i; gi; si(N 1); . . . ; si(j); 0; 1; ti]. Finally we come back to approximate the initial approximation si(N) = r i(N 1) gi. Notice that gi and r i(N 1) are already in the input hi = [xi; yi; w; pi; r i; gi; 0; 1; ti], thus it is simple to construct EWMLN+3 , similar to (C.15), such that it maps, [xi; yi; w; pi; r i; gi; 0; 1; ti] EWMLN+3 [xi; yi; w; 0; 1; pi; r i; gi; si(N); 0; 1; ti]. Since we don t using the sum of Re LU to approximate any variables, these step don t generate extra error. Besides, by (3.19), matrices have operator norm bounds max j,k QN+2+j k 1 1, max j,k KN+2+j k 1 1, max j,k V N+2+j k 1 1. Consequently, for any j [N], BθN+2+j 1. Thus we complete the proof. C.6 Proof of Lemma 6 Lemma 13 (Lemma 6 Restated: Error for gisi(j)). Suppose the upper bounds Bv, Bx > 0 such that for any k [K], j [N] and i [n], vjk 2 Bv, and xi 2 Bx. Let r i(j) RK such that r i(j)[k] := r (v j+1kpi(j)) follows Definition 2. Let si(j) = Ri(j 1)V j+1 . . . Ri(N 2)V N Ri(N 1)u follows Definition 3. Let r i(j), gi, si(j) be the approximations for r i(j), u(pi(N), yi), si(j) follows Lemma 3, Lemma 4 and Lemma 5 respectively. Let Br be the upper bound of r i(j)[k] and r i(j)[k] as defined in Lemma 3. Let Bl be the upper bound of gi[k] and u(pi(N), yi)[k] as defined in Lemma 4. Then for any i [n + 1], j [N], k [K], si(j)[k] Bs, |si(j)[k] si(j)[k]| Er sϵr + Er s ϵr + El sϵl, Bs := max j [N]{(P Br Bv)N j Br Bl}, Er s := max j [N]{Lr Er PBs B2 v[ l=0 (Br Bv P)l] + (Br Bv P)N j(Bl Lr Bv Er + Br Ll Er)}, Er s := max j [N]{PBs Bv[ l=0 (Br Bv P)l] + (Br Bv P)N j Bl}, El s := max j [N]{(Br Bv P)N j Br }. In-Context Deep Learning via Transformer Models Above, Bs is the upper bound of si(j)[k] and Er s, Er s , El s are the coefficients of ϵr, ϵ r, ϵl in the upper bounds of |si(j)[k] si(j)[k]|, respectively. Proof of Lemma 6. By Lemma 5, we manage to approximate si(j) by si(j). By triangle inequality, we have |si(j)[k] si(j)[k]| |r i(n 1)[k] r i(n 1)[k]| v n+1ksi(n + 1) + |r i(n 1)[k]| (v n+1ksi(n + 1)) (v n+1ksi(n + 1)) . We bound these four terms separately. By Lemma 3, |r i(n 1)[k] r i(n 1)[k]| is bounded by ϵr + Lr Bv Erϵr. We then use induction to establish upper bounds for si(j)[k] and |si(j)[k] si(j)[k]|. We first use induction to prove the first two statements. To begin with, we illustrate the recursion formula for si(j). By (3.20), recall that for any j [N], r i(j 1) (V j+1 si(j + 1)), j [N 1] r i(N 1) gi, j = N. We consider applying induction to prove the first statement: si(j)[k] (P Br Bv)N n Br Bl. As for the base case, j = N: si(N)[k] = r i(N 1)[k] gi[k] Br Bl. Therefore, if the statement holds for j = n + 1, by (3.20) and our assumption, it holds si(n)[k] = r i(n 1)[k] (v j+1ksi(n + 1)) By recursion formula (3.20) r i(n 1)[k] vn+1k 2 si(n + 1) 2 By Cauchy-schwarz inequality r i(n 1)[k] vn+1k 2 max{ d} max k |si(n + 1)[k]| (Br Bv) max{ d} Br Bv)N n 1Br Bl By inductive hypothesis = (P Br Bv)N n Br Bl. By definition of P follows Lemma 6 Thus, by the principle of induction, the first statement is true for all integers j [N]. Moreover, by the definition of Bs follows Lemma 6, we know Bs is the upper bound of si(j)[k]. Next we apply induction to prove the second statement: |si(j)[k] si(j)[k]| (ϵr + Lr Bv Erϵr)PBv Bs[ l=0 (Br Bv P)l] + (Br Bv P)N n[(Bl Lr Bv Er + Br Ll Er)ϵr + Blϵr + Br ϵl]. For the base case, j = N: |si(N)[k] si(N)[k]| = |r i(N 1)[k] gi[k] r i(N 1)[k] u(pi(N), yi)[k]| By definition (3.20) and (3.6) |r i(N 1)[k] r i(N 1)[k]| |gi[k]| + |r i(N 1)[k]| |gi[k] u(pi(N), yi)[k]| By triangle inequality (ϵr + Lr Bv Erϵr)Bl + Br (ϵl + Ll Erϵr). By (3.14) and (3.16) = (Bl Lr Bv Er + Br Ll Er)ϵr + Blϵr + Br ϵl Therefore, if the statement holds for j = n + 1, by (3.20) and our assumption, it holds |si(n)[k] si(n)[k]| In-Context Deep Learning via Transformer Models = r i(n 1)[k] (v n+1ksi(n + 1)) r i(n 1)[k] (v n+1ksi(n + 1)) By the recursion formula (3.6) and (3.20) |r i(n 1)[k] r i(n 1)[k]| v n+1ksi(n + 1) + |r i(n 1)[k]| (v n+1ksi(n + 1)) (v n+1ksi(n + 1)) By triangle inequality (ϵr + Lr Bv Erϵr)PBv Bs + Br Bv si(n + 1) si(n + 1) 2 By error accumulation of approximating r follows (3.14) (ϵr + Lr Bv Erϵr)PBv Bs + Br Bv P max k |si(n + 1)[k] si(n + 1)[k]| (ϵr + Lr Bv Erϵr)PBv Bs + Br Bv P n (ϵr + Lr Bv Erϵr)PBv Bs[ l=0 (Br Bv P)l] + (Br Bv P)N n 1[(Bl Lr Bv Er + Br Ll Er)ϵr + Blϵr + Br ϵl] o By inductive hypothesis (ϵr + Lr Bv Erϵr)PBv Bs[ l=0 (Br Bv P)l] + (Br Bv P)N n[(Bl Lr Bv Er + Br Ll Er)ϵr + Blϵr + Br ϵl]. Thus, by the principle of induction, the second statement is true for all integers j [N 1]. By the definition of Es follows Lemma 6, it is simple to check that |si(j)[k] si(j)[k]| Er sϵr + Er s ϵr + El sϵl. Thus we complete the proof. C.7 Proof of Theorem 1 Theorem 3 (Theorem 1 Restated: In-context gradient descent on N-layer NNs). Fix any Bv, η, ϵ > 0, L 1. For any input sequences takes from (2.1), their exist upper bounds Bx, By such that for any i [n], yi 2 By, xi 2 Bx. Assume functions r(t), r (t) and u(t, y)[k] are Lr, Lr , Ll-Lipschitz continuous. Suppose W is a closed domain such that for any j [N 1] and k [K], W w = [vjk] RDN : vjk 2 Bv , and Proj W project w into bounded domain W. Assume Proj W = MLPθ for some MLP layer with hidden dimension Dw parameters θ Cw. If functions r(t), r (t) and u(t, y)[k] are C4-smoothness, then for any ϵ > 0, there exists a transformer model NNθ with (2N + 4)L hidden layers consists of L neural network blocks TFN+2 θ EWMLN θ TF2 θ, NNθ := TFN+2 θ EWMLN θ TF2 θ . . . TFN+2 θ EWMLN θ TF2 θ, such that the heads number M l, embedding dimensions Dl, and the parameter norms Bθl suffice max l [(2N+4)L] M l e O(ϵ 2), max l [(2N+4)L] Dl O(NK2) + Dw, max l [(2N+4)L] Bθl O(η) + Cw + 1, where e O( ) hides the constants that depend on d, K, N, the radius parameters Bx, By, Bv and the smoothness of r and ℓ. And this neural network such that for any input sequences H(0), take from (2.1), NNθ(H(0)) implements L steps in-context gradient descent on risk Eqn (2.2): For every l [L], the (2N + 4)l-th layer outputs h((2N+4)l) i = [xi; yi; w(l); 0; 1; ti] for every i [n + 1], and approximation gradients w(l) such that w(l) = Proj W(w(l 1) η Ln(w(l 1)) + ϵ(l 1)), w(0) = 0, where ϵ(l 1) 2 ηϵ is an error term. Proof Sketch. Let the first 2N + 2 layers of NNθ are Transformers and EWMLs constructed in Lemma 2, Lemma 3, Lemma 4, and Lemma 5. Explicitly, we design the last two layers to implement the gradient descent step (Lemma 1). In-Context Deep Learning via Transformer Models We then establish the upper bounds for error w Ln(w) w Ln(w) 2, where w Ln(w), derived from the outputs of NNθ, approximates w Ln(w). Next, for any ϵ > 0, we select appropriate parameters ϵl, ϵr and ϵr to ensure that w Ln(w(l 1)) w Ln(w(l 1)) 2 ϵ holds for any l [L]. Proof of Theorem 1. We consider the first N + 2 transformer layers TFN+2 θ are layers in Lemma 2 ,Lemma 3 and Lemma 4. Then we let the middle N element-wise multiplication layers EWMLN θ be layers in Lemma 5. We only need to check approximability conditions. By Lemma 7 and our assumptions, for any ϵr, ϵr , ϵl, it holds Function r(t) is (ϵr, R1, M1, C1)-approximable for R1 = max{Bv Br, 1}, M1 e O(C2 1ϵ 2 r ), where C1 depends only on R1 and the C2-smoothness of r(t). Function r (t) is (ϵr , R2, M2, C2)-approximable for R2 = max{Bv Br , 1}, M2 e O(C2 2ϵ 2 r ), where C2 depends only on R2 and the C2-smoothness of r (t). Function 1ℓ(t, y) is (ϵl, R3, M3, C3)-approximable for R3 = max{Bv Br, 1}, M3 e O(C2 3ϵ 2 l ), where C3 depends only on R3 and the C3-smoothness of u(t, y)[k]. which suffice approximability conditions in Lemma 2, Lemma 3 and Lemma 4. Now we construct the last two layers to implement w η Ln(w) and Proj W(w). First we construct a attention layer to approximate w η Ln(w). For every m [2], j [N], k [K], we consider matrices Q2N+3 m,j,k , j2N+3 m,j,k , V 2N+3 m,j,k RD D Q2N+3 1,j,k hi = 1 0 , K2N+3 1,j,k hi = si(j)[k] 0 , V 2N+3 1,j,k hi = η(n + 1) 0 pi(j 1) 0 Q2N+3 2,j,k hi = 1 0 , K2N+3 2,j,k hi = si(j)[k] 0 , V 2N+3 2,j,k hi = η(n + 1) 0 pi(j 1) 0 Furthermore, we define approximation gradient w Ln(w) as follows, w Ln(w) := 1 η(n + 1) m [2],j [N],k [K] σ( Q2N+3 m,j,k hi, K2N+3 m,j,k ht )V 2N+3 m,j,k ht j=1 (σ(st(j)[k]) σ( st(j)[k])) 0 pt(j 1) 0 By our construction (C.16) 0 pt(j 1) 0 By f(x) = σ(x) σ( x) 0 IK K pt(j 1) st(j) 0 By definition of Kronecker product 0 At(1) ... At(N) 0 By sn+1(j) = 0 follows Lemma 5 where At(j) := IK K pt(j 1) st(j) denotes the approximation for At(j). Therefore, by the definition of Re LU attention layer follows Definition 7, for any i [n + 1], ehi = [Attnθ2N+3(hi)] In-Context Deep Learning via Transformer Models = hi + 1 n + 1 m [2],j [N],k [K] σ( Q2N+3 m,j,k hs, K2N+3 m,j,k hi )V 2N+3 m,j,k hi = [xi; yi; w; pi; r i; gi; si; 0; 1; ti] η 0 At(1) ... At(N) 0 = [xi; yi; w η w Ln(w); pi; r i; gi; 0; 1; ti]. By definition of w Ln(w) Since we do not use approximation technique like Definition 4, this step do not generate extra error. Besides,by (B.1), matrices have operator norm bounds max j,m,k Q2N+3 j,m,k 1 1, max j,m,k K2N+3 j,m,k 1 1, X j,m,k V 2N+3 j,m,k 1 2ηNK. Consequently, Bθ2N+3 1 + 2ηNK. Fix any ϵ > 0, then we pick appropriate ϵr, ϵ r, ϵl such that ϵ(l 1) 2 = η w Ln(w(l 1)) w Ln(w(l 1)) 2 ηϵ. By Definition 3 and Lemma 6, for any j [N 1], i [n], it holds Ai(j) Ai(j) 2 k=1 si(j)[k]pi(j 1) si(j)[k]pi(j 1) 2 By Definition 2 and definition of Ai(j) k=1 |si(j)[k] si(j)[k]| pi(j 1) 2 + |si(j)[k]| pi(j 1) pi(j 1) 2 By triangle inequality P[(Er sϵr + Er s ϵr + El sϵl) PBr + Bs Erϵr], By (3.12) and Lemma 6 where Bs is the upper bound of si(j)[k] and Er s, Er s , El s are the coefficients of ϵr, ϵ r, ϵl in the upper bounds of |si(j)[k] si(j)[k]| follow Lemma 6, respectively. We can drive similar results as j = N. Actually, by P = max{ d} follows Lemma 6, above inequality also holds for j = N. Therefore, the error in total such that for any w, w Ln(w) w Ln(w) 2 0 At(1) ... At(N) 0 0 At(1) ... At(N) 0 By definition of Ln(w) and Ln(w) 2 max 1 t n{ j=1 At(j) At(j) 2} 2 P[(Er sϵr + Er s ϵr + El sϵl) PBr + Bs Erϵr]. By the error accumulation results derived before Let Cl, Cr, Cr denotes coefficients in front of ϵl, ϵr, ϵr respectively. Then it holds Cl = NP 3 2 Br El s, Cr = NP 3 2 Br Er s + NPBs Er, In-Context Deep Learning via Transformer Models Cr = NP 3 2 Br Er s . Thus, to ensure w Ln(w) w Ln(w) 2 ϵ, we only need to select ϵl, ϵr, ϵ r as 3Cl , ϵr = 2ϵ 3Cr , ϵ r = 2ϵ 3Cr . Therefore, we only need to pick the last MLP layer MLP2N+4 such that it maps [xi; yi; w η w Ln(w); pi; r i; gi; si; 0; 1; ti] MLP2N+4 [xi; yi; Proj W(w η w Ln(w)); 0; 1; ti]. By our assumption on the map Proj W, this is easy. Finally, we analyze how many embedding dimensions of Transformers are needed to implement the above ICGD. Recall that xi, yi Rd, w R2d K+(N 2)K2, pi R(N 1)K+d, r i R(N 2)K+d, gi Rd, si R(N 1)K+d. Therefore, max{Ω(NK2), Dw} embedding dimensions of Transformer are required to implement ICGD on deep models. Combining the above, we complete the proof. Remark 5 (Modest Assumptions). Our assumptions remain modest. For example, we require that the loss function l( ), the activation function r( ), and its derivative r ( ) are C4-smoothness. Many settings meet these conditions, including those using the sigmoid activation function as r( ) and the squared loss function. C.8 Proof of Corollary 1.1 Corollary 3.1 (Corollary 1.1 Restated: Error for implementing ICGD on N-layer neural network). Fix L 1, under the same setting as Theorem 1, (2N + 4)L-layer neural networks NNθ approximates the true gradient descent trajectory {wl GD}l 0 RDN with the error accumulation wl wl GD 2 L 1 f (1 + n Lf)lϵ, where Lf denotes the Lipschitz constant of LN(w) within W. First we introduce a helper lemma. Lemma 14 (Error for Approximating GD, Lemma G.1 of (Bai et al., 2023)). Let W Rd is a convex bounded domain and Proj W projects all vectors into W. Suppose f : W R and f is Lf-Lipschitz on W. Fix any ϵ > 0, let sequences {wl}l 0 Rd and {wl GD}l 0 Rd are given by w0 = w0 GD = 0, then for all l 0, wl = Proj W(wl 1 η Ln(wl 1) + ϵl 1), ϵl 1 2 ηϵ, wl GD = Proj W(wl 1 GD η Ln(wl 1 GD)) To show the convergence, we define the gradient mapping at w with step size η as, Gf W,η := w Proj W(w η Ln(w)) Then if η Lf, for all L 1, convergence holds min l [L 1] Gf W,η(wl) 2 2 1 l=1 Gf W,η(wl) 2 2 8(f(0) infw W f(w)) Moreover, for any l 0, the error accumulation is wl wl GD 2 L 1 f (1 + n Lf)lϵ. In-Context Deep Learning via Transformer Models Lemma 14 shows Theorem 1 leads to exponential error accumulation in the general case. Moreover, Lemma 14 also provides convergence of approximating GD. Then we proof Corollary 1.1. Proof. For any small ϵ, by Theorem 1, the neural network NNθ implements each gradient descent step with error bounded by ϵ. Then we simply apply Lemma 14 to complete the proof. In-Context Deep Learning via Transformer Models D Extension: Different Input and Output Dimensions In this section, we explore the ICGD on N-layer neural networks under the setting where the dimensions of input xi and label yi can be different. Specifically, we consider our prompt datasets {(xi, yi)}i [n] where xi Rdx and yi Rdy. We start with our new N-layer neural network. Definition 10 (N-Layer Neural Network). An N-Layer Neural Network comprises N 1 hidden layers and 1 output layer, all constructed similarly. Let r : R R be the activation function. For the hidden layers: for any i [n + 1], j [N 1], and k [K], the output for the first j layers w.r.t. input xi Rd, denoted by predh(xi; j) RK, is defined as recursive form: predh(xi; 1)[k] := r(v 1kxi), and predh(xi; j)[k] := r(v jkpredh(xi; j 1)), where v1k Rd and vjk RK for j {2, . . . , N 1} are the k-th parameter vectors in the first layer and the j-th layer, respectively. For the output layer (N-th layer), the output for the first N layers (i.e the entire neural network) w.r.t. input xi Rdx, denoted by predo(xi; w, N) Rdy, is defined for any k [dy] as follows: predo(xi; w, N)[k] := r(v Nkpredh(xi; N 1)), where v Nk RK are the k-th parameter vectors in the N-th layer and w R(dx+dy)K+(N 2)K2 denotes the vector containing all parameters in the neural network, w := h v 11, . . . , v 1K, . . . , v jk, . . . v N 11, . . . , v N 1K, v N1, . . . , v Ndy i . Notice that our new N-layer neural network only modify the output layer compared to Definition 1. Intuitively, this results in minimal change in output, which allows our framework in Section 3.3 to function across varying input/output dimensions. Theoretically, we derive the explicit form of gradient Ln(w). Lemma 15 (Decomposition of One Gradient Descent Step). Fix any Bv, η > 0. Suppose the empirical loss function Ln(w) on n data points {(xi, yi)}i [n] is defined as i=1 ℓ(f(w, xi), yi), where ℓ: Rdy Rdy R is a loss function, where f(w, xi), yi) is the output of N-layer neural networks (Definition 10) with modified output layer. Suppose closed domain W and projection function Proj W(w) follows (3.4). Let Ai(j), r i(j), Ri(j), Vj be as defined in Definition 2 (with modified dimensions), then the explicit form of gradient Ln(w) becomes Ai(1) ... Ai(N) where Ai(j) denote the derivative of ℓ(pi(N), yi) with respect to the parameters in the j-th layer, (Ri(N 1) VN . . . Ri(j 1) h IK K pi(j 1) i ) ( ℓ(pi(N),yi) pi(N) ) , j = N (Ri(N 1) h Idy dy pi(N 1) i ) ( ℓ(pi(N),yi) pi(N) ) , j = N. Proof. Simply follow the proof of Lemma 1. We show the different terms compared to Definition 2: In-Context Deep Learning via Transformer Models Let Dj R denote the total number of parameters in the first j layers. 0, j = 0 dx K, j = 1 (j 1)K2 + dx K, 2 j N 1 (N 2)K2 + (dx + dy)K, j = N, The intermediate term Ri(N 1), Ri(N 1) = diag{r (v j+11pi(j)), . . . , r (v j+1dy pi(j))} Rdy dy. The parameters matrices of the first and the last layers: h v11, . . . , v1K i RK dx, j = 1 h v N1, . . . , v Ndy i Rdy K, j = N. Thus we complete the proof. Lemma 15 shows that the explicit form of gradient Ln(w) holds the same structure as Lemma 1. Therefore, it is simple to follow our framework in Section 3.3 to approximate Ln(w) term by term. Finally, we introduce the generalized version of main result Theorem 1. Theorem 4 (In-Context Gradient Descent on N-layer NNs). Fix any Bv, η, ϵ > 0, L 1. For any input sequences takes from (2.1), where {(xi, yi)}i [n] and xi Rdx and yi Rdy, their exist upper bounds Bx, By such that for any i [n], yi 2 By, xi 2 Bx. Assume functions r(t), r (t) and u(t, y)[k] are Lr, Lr , Ll-Lipschitz continuous. Suppose W is a closed domain such that for any j [N 1] and k [K], W w = [vjk] RDN : vjk 2 Bv , and Proj W project w into bounded domain W. Assume Proj W = MLPθ for some MLP layer with hidden dimension Dw parameters θ Cw. If functions r(t), r (t) and u(t, y)[k] are C4-smoothness, then for any ϵ > 0, there exists a transformer model NNθ with (2N + 4)L hidden layers consists of L neural network blocks TFN+2 θ EWMLN θ TF2 θ, NNθ := TFN+2 θ EWMLN θ TF2 θ . . . TFN+2 θ EWMLN θ TF2 θ, such that the heads number M l, parameter dimensions Dl, and the parameter norms Bθl suffice max l [(2N+4)L] M l e O(ϵ 2), max l [(2N+4)L] Dl O(K2N) + Dw, max l [(2N+4)L] Bθl O(η) + Cw + 1, where e O( ) hides the constants that depend on d, K, N, the radius parameters Bx, By, Bv and the smoothness of r and ℓ. And this neural network such that for any input sequences H(0), take from (2.1), NNθ(H(0)) implements L steps in-context gradient descent on risk Ln(w) follows Lemma 15: For every l [L], the (2N + 4)l-th layer outputs h((2N+4)l) i = [xi; yi; w(l); 0; 1; ti] for every i [n + 1], and approximation gradients w(l) such that w(l) = Proj W(w(l 1) η Ln(w(l 1)) + ϵ(l 1)), w(0) = 0, where ϵ(l 1) 2 ηϵ is an error term. In-Context Deep Learning via Transformer Models E Extension: Softmax Transformer In this part, we demonstrate the existence of pretrained Softmax transformers capable of implementing ICGD on an N-layer neural network. First, we introduce our main technique: the universal approximation property of softmax transformers in Appendix E.1. Then, we prove the existence of pretrained softmax transformers that implement ICGD on N-layer neural networks in Appendix E.2. E.1 Axillary Lemma: Universal Approximation of Softmax Transformer Softmax-Attention Layer. We replace modified normalized Re LU activation σ/n in Re LU attention layer (Definition 7) by standard softmax. Thus, for any input sequence H RD n, a single head attention layer outputs Attn (H) = H + W (O)(V H) Softmax (KH) (QH) , (E.1) where W (O), Q, K, V RD D Rd d are the weight matrices. Then we introduce the softmax transformer block, which consists of two feed-forward neural network layers and a single-head self-attention layer with the softmax function. Definition 11 (Transformer Block TSoftmax). For any input sequences H RD n, let FF(H) := H + W2 Re LU(W1H + b11T L) + b21T L be the Feed-Forward layer, where d is hidden dimensions, W1 Rd D, W2 RD d , b1 Rl, and b2 Rd. We configure a transformer block with Softmax-attention layer as TSoftmax := {FF Attn FF : Rd L Rd L}. Universal Approximation of Softmax-Transformer. We show the universal approximation theorem for Transformer blocks (Definition 11). Specifically, Transformer blocks TSoftmax are universal approximators for continuous permutation equivariant functions on bounded domain. Lemma 16 (Universal Approximation of TSoftmax). Let f( ) := Rd n Rd n be any L-Lipschitz permutation equivariant function supported on [0, Bx]d n. We denote the discrete input domain of [0, Bx]d n by a grid GD with granularity D N defined as GD = {Bx/D, 2Bx/D, . . . , Bx}d n Rd n. For any κ > 0, there exists a transformer network f Softmax TSoftmax, such that for any Z [0, Bx]d n, it approximate f(Z) as: f Softmax(Z) f(Z) 2 κ. Proof Sketch. First, we use a piece-wise constant function to approximate f and derive an upper bound based on its L-Lipschitz property. Next, we demonstrate how the feed-forward neural network F(F F ) 1 quantizes the continuous input domain into the discrete domain GD through a multiple-step function, using Re LU functions to create a piece-wise linear approximation. Then, we apply the self-attention layer F(SA) on F(F F ) 1 , establishing a bounded output region for F(SA) S F(F F ) 1 . Finally, we employ a second feed-forward network F(F F ) 2 to predict f Softmax(Z) and assess the approximation error relative to the actual output f(Z) . See Appendix E.4 for a detailed proof. E.2 In-Context Gradient Descent with Softmax Transformer In-Context Gradient Descent with Softmax Transformer. By applying universal approximation theory (Lemma 16), we now illustrate how to use Transformer block TSoftmax (Definition 11) and MLP layers (Definition 8) to implement ICGD on general risk function Ln(w). Theorem 5 (Theorem 2 Restated: In-Context Gradient Descent on General Risk Function). Fix any Bw, η, ϵ > 0, L 1. For any input sequences takes from (2.1), their exist upper bounds Bx, By such that for any i [n], yi max By, xi max Bx. Suppose W is a closed domain such that w max Bw and Proj W project w into bounded domain W. Assume Proj W = MLPθ for some MLP layer. Define l(w, xi, yi) as a loss function with L-Lipschitz gradient. Let Ln(w) = 1 n Pn i=1 ℓ(w, xi, yi) denote the empirical loss function, then there exists a Softmax-transformer NNθ, such that for any input sequences H(0), take from (2.1), NNθ(H(0)) implements L steps in-context gradient descent on Ln(w): For every l [L], the 4l-th layer outputs h(4l) i = [xi; yi; w(l); 0; 1; ti] for every i [n + 1], and approximation gradients w(l) w(l) = Proj W(w(l 1) η Ln(w(l 1)) + ϵ(l 1)), w(0) = 0, where ϵ(l 1) 2 ηϵ is an error term. In-Context Deep Learning via Transformer Models E.3 Proof of Theorem 2 Proof of Theorem 2. We only need to construct a 4 layers transformer capable of implementing single step gradient descent. With out loss of generality, we assume w RDw. Recall that the input sequences H RD (n+1) takes form x1 x2 xn xn+1 y1 y2 yn 0 q1 q2 qn qn+1 RD (n+1), qi := RD (d+1). (E.2) Let function f : RD n RD n output x1 x2 xn xn+1 y1 y2 yn 0 q1 q2 qn qn+1 w η Ln(w) 0 1 ti By Lemma 16, for any κ > 0, there exists a transformer network f Softmax TSoftmax, such that for any input H [ B, B]d L, we have f Softmax(H) f(H) 2 κ. Therefore, by the equivalence of matrix norms, f Softmax(H) f(H) max κ holds without loss of generality. Above B := max{Bx, By, Bw, 1} denotes the upper bound for every elements in H. Thus, we obtain w from the identical position of w in f Softmax(H). Suppose we choose κ = ϵ Dw , then it holds w (w η Ln(w) 2 p Dw w (w η Ln(w) max f Softmax f(H) max Finally, by our assumption, there exists an MLP layer such that for any i [n + 1], it maps [xi; yi; w η Ln(w); 0; 1; ti] MLP [xi; yi; Proj W(w η w Ln(w)); 0; 1; ti]. Therefore, a four-layer transformer f Softmax MLP is capable of implementing one-step gradient descent through ICL. As a direct corollary, there exist a 4L-layer transformer consists of L identical blocks f Softmax MLP to approximate L steps gradient descent algorithm. Each block approximates a one-step gradient descent algorithm on general risk function Ln(w). E.4 Proof of Lemma 16 In this section, we introduce a helper lemma Lemma 17 to prove Lemma 16. At the beginning, we assume all input sequences are separated by a certain distance. Definition 12 (Token-wise Separateness, Definition 1 of (Kajitsuka & Sato, 2024)). Let N 1 and Z(1), . . . , Z(N) Rd n be input sequences. Then, Z(1), . . . , Z(N) are called token-wise (rmin, rmax, δ)-separated if the following three conditions hold. For any i [N] and k [n], Z(i) :,k 2 > rmin holds. For any i [N] and k [n], Z(i) :,k 2 < rmax holds. For any i, j [N] and k, l [n] with Z(i) :,k = Z(j) :,l , Z(i) :,k Z(j) :,l 2 > δ holds. Note that we refer to Z(1), . . . , Z(N) as token-wise (rmax, ϵ)-separated instead if the sequences satisfy the last two conditions. Then we introduce the definition of contextual mapping. Intuitively, a contextual mapping can provide every input sequence with a unique id, which enables us to construct approximation for labels. In-Context Deep Learning via Transformer Models Definition 13 (Contextual mapping, Definition 2 of (Kajitsuka & Sato, 2024)). Let input sequences Z(1), . . . , Z(N) Rd n. Then, a map q : Rd n Rd n is called an (r, δ)-contextual mapping if the following two conditions hold: For any i [N] and k [n], q Z(i) 2 < r holds. For any i, j [N] and k, l [n], if Z(i) :,k = Z(j) :,l , then q Z(i) 2 > δ holds. In particular, q Z(i) for i [N] is called a context id of Z(i). Next, we show that a softmax-based 1-layer attention block with low-rank weight matrices is a contextual mapping for almost all input sequences. Lemma 17 (Softmax attention is contextual mapping, Theorem 2 of (Kajitsuka & Sato, 2024)). Let Z(1), . . . , Z(N) Rd n be input sequences with no duplicate word token in each sequence, that is, Z(i) :,k = Z(i) :,l , for any i [N] and k, l [n]. Also assume that Z(1), . . . , Z(N) are token-wise (rmin, rmax, ϵ) separated. Then, there exist weight matrices W (O) Rd s and V, K, Q Rs d such that the ranks of V, K and Q are all 1, and 1-layer single head attention with softmax, i.e., F(SA) S with h = 1 is an (r, δ)-contextual mapping for the input sequences Z(1), . . . , Z(N) Rd n with r and δ defined by r = rmax + ϵ δ = 2(log n)2ϵ2rmin r2max(|V| + 1)4(2 log n + 3)πd exp (|V| + 1)4 (2 log n + 3)πdr2 max 4ϵrmin Applying Lemma 17, we extends Proposition 1 of (Kajitsuka & Sato, 2024) to our Lemma 161. We provide explicit upper bound of error f Softmax(Z) f(Z) 2 and analysis with function f of a broader supported domain. Lemma 18 (Lemma 16 Restated: Universal Approximation of TSoftmax). Let f( ) := Rd n Rd n be any L-Lipschitz permutation equivariant function supported on [0, Bx]d n. We denote the discrete input domain of [0, Bx]d n by a grid GD with granularity D N defined as GD = {Bx/D, 2Bx/D, . . . , Bx}d n Rd n. For any κ > 0, there exists a transformer network f Softmax TSoftmax (Definition 11), such that for any Z [0, Bx]d n, it approximate f(Z) as: f Softmax(Z) f(Z) 2 κ. Proof. We begin our 3-step proof. Approximation of f by piece-wise constant function. Since f is a continuous function on a compact set, f has maximum and minimum values on the domain. By scaling with F(F F ) 1 and F(F F ) 2 , f is assumed to be normalized: for any Z Rd n \ [0, Bx]d n and for any Z [0, Bx]d n By f(Z) By. Let D N be the granularity of a grid GD: D , . . . , Bx}d n Rd n, 1This extension builds on the results of (Hu et al., 2025a), which extend the rank-1 requirement to any rank for attention weights. Additionally, Hu et al. (2025b) apply similar techniques to analyze the statistical rates of diffusion transformers (Di Ts). In-Context Deep Learning via Transformer Models where each coordinate only take discrete value Bx/D, 2Bx/D, ..., Bx. Now with a continuous input Z, we approximate f by using a piece-wise constant function f evaluating on the nearest grid point L of Z in the following way: L GD f (L) 1Z L+[ Bx/D,0)d n. (E.3) Additionally if Z L + [ 1/D, 0)d n, denote it as Q(Z) = L. Now we bound the piece-wise constant approximation error f f as follows. Define set PD = {L + [ Bx/D, 0)d n|L GD}. It is a set of regions of size ( Bx D )d n, whose vertexes are the points in GD. For any subset U PD, the maximal difference of f and f in this region is: max Z U f(Z) f(Z) 2 = max Z U f(Z) f(Q(Z)) 2 max Z,Z U f(Z) f(Z ) 2 L max Z,Z U Z Z 2 By f is a L-Lipschitz function D )2 Z, Z are in the same Bx D -wide (d n)-dimension U. dn Bx D . (E.4) Quantization of input using F(F F ) 1 . In the second step, we use F(F F ) 1 to quantize the continuous input domain into GD. This process is achieved by a multiple-step function, and we use Re LU functions to approximate this multiple-step functions. This Re LU function can be easily implemented by a one-layer feed-forward network. First for any small δ > 0 and z R, we construct a δ-approximated step function using Re LU functions: 0 z < 0 z δD 0 z < δBx Bx D δBx z , (E.5) where a one-hidden-layer feed-forward neural network is able to implement this. By shifting (E.5) by Bx, for any t [D 1], we have: D z δD t Bx D z < δBx + t Bx D δBx + t Bx D z , (E.6) when δ is small the above function approximates to a step function: quant(t) D (z) = By adding up (E.6) at every t [D 1], we have an approximated multiple-step function In-Context Deep Learning via Transformer Models t=0 quant(t) D (z) when δ is small. = quant D(z) D ... ... Bx Bx Bx Note that the error of approximation at z here estimated as: D quant D(z) and for matrix Z Rd n: δD i σR h Z δ Bx E Q(Z) D quant D(Z) 2 D )2 Z Rd n Subtract the last step function from (E.7) we get the desired result: This equation approximate the quantization of input domain [0, Bx] into {Bx/D, . . . , Bx} and making R \ [0, Bx] to 0. In addition to the quantization of input domain [0, Bx], we add a penalty term for input out of [0, Bx] in the following way: penalty(z) = Bx, z 0 0, 0 < z Bx Bx, Bx < z. . Both (E.10) and (E.11) can be realized by the one-layer feed-forward neural network. Also, it is straightforward to show that generate both of them to input Z Rd n. Combining both components together, the fırst feed-forward neural network layer F(F F ) 1 approximates the following function F (F F ) 1 (Z): F(F F ) 1 F (F F ) 1 (Z) = quantd n D (Z) + k=1 penalty(Zt,k). (E.12) Note how we generalize penalty( ) to multi-dimensional occasions in the above equation. Whenever an input sequence Z has one entry Zt,k out of [0, Bx]d n, we penalize the whole input sequence by adding a Bx to all entries. This makes all In-Context Deep Learning via Transformer Models entries of this quantization lower bounded by dn Bx. (E.12) quantizes inputs in [0, Bx]d n with granularity D, while every element of the output is non-positive for inputs outside [0, Bx]d n. In particular, the norm of the output is upper-bounded when every entry in Z is out of [0, Bx], this adds dn Bx penalties to all entries: F(F F ) 1 (Z):,k 2 = p d ( dn Bx)2 One column is d dimension. d Bx, (E.13) for any k [n]. Estimating the Influence of Self-Attention F(SA). Define e GD GD as: e GD = {L GD | k, l [n], L:,k = L:,l} . (E.14) It is a set of all the input sequences that don t have have identical tokens after quantization. Within this set, the elements are at least Bx D separated by the quantization. Thus Lemma 17 allows us to construct a self-attention F(SA) to be a contextual mapping for such input sequences. Since when D is sufficiently large, originally different tokens will still be different after quantization. In this context, we omit GD/e GD for simplicity. From the proof of Lemma 17 in (Kajitsuka & Sato, 2024), we follow their way to construct self-attention and have following equation: F(SA) S (Z):,k Z:,k 2 < 1 4 d D max k [n] Z:,k 2, (E.15) for any k [n] and Z Rd n. Combining this upper-bound with (E.13) we have F(SA) S F(F F ) 1 (Z):,k F(F F ) (Z):,k 2 < 1 4 d D max k [n] F(F F )(Z:,k) 2 4D . (E.16) We show that if we take large enough D, every element of the output for Z Rd n\[0, Bx]d n is upper-bounded by F(SA) S F(F F ) 1 (Z)t,k < Bx 4D ( t [d], k [n]). (E.17) To show (E.17) holds, we consider the opposite occasion that there exists a F(SA) S F(F F ) 1 (Z)t0,k0 Bx/4D. Then we divide the case into two sub cases: 1. The whole F(F F ) 1 (Z) receives no less than 2 penalties. In this occasion, since every entry consists of two counterparts in (E.12): the quantization part quantd n D (Z) [0, Bx] and aggregated with a penalty part Pd t=1 Pn k=1 penalty(Zt,k) 2Bx, for every entry we have F(F F ) (Z)t,k Bx. In-Context Deep Learning via Transformer Models This yields that: F(SA) S F(F F ) 1 (Z):,k0 F(F F ) (Z):,k0 2 F(SA) S F(F F ) 1 (Z)t0,k0 F(F F ) (Z)t0,k0 2 for a large enough D thus we derive a contradiction towards (E.16) from the assumption, proving it to be incorrect. 2. The whole F(F F ) 1 (Z) receives only one penalty. In this case all entries in Z is penalized by Bx and satisfies: F(F F ) 1 (Z)t,k [ Bx, 0]d n. (E.18) By (E.15), this further denotes: F(SA) S F(F F ) 1 (Z):,k F(F F ) 1 (Z):,k 2 < 1 4 d D max k [n] F(F F ) 1 (Z):,k 2 Yet by our assumption, there exists such an entry F(SA) S F(F F ) (Z)t0,k0 Bx/4D, which since F(F F ) 1 (Z)t0,k0 0, yields: F(SA) S F(F F ) 1 (Z):,k0 F(F F ) 1 (Z):,k0 2 F(SA) S F(F F ) 1 (Z)t0,k0 F(F F ) 1 (Z)t0,k0 The final conclusion contradict the former result, suggesting the prerequisite to be fallacious. Joining the incorrectness of the two sub-cases of the opposite occasion, we confirm the upper bound when input Z is outside [0, Bx]d n in (E.17). For the input Z inside [0, Bx]d n, we now show it is lower-bounded by F(SA) S F(F F ) 1 (Z)t,k > 3Bx 4D ( t [d], k [n]). (E.20) By our construction, every entry Z in [0, Bx]d n satisfies: F(F F ) 1 (Z)t,k [Bx D , Bx]. (E.21) By (E.15): F(SA) S F(F F ) 1 (Z):,k F(F F ) 1 (Z):,k 2 d D max k [n] F(F F ) 1 (Z):k 2 In-Context Deep Learning via Transformer Models d-dimensional vector with each entry has maximum value Bx. This yields: |F(SA) S F(F F ) 1 (Z)t,k F(F F ) 1 (Z)t,k | F(SA) S F(F F ) 1 (Z):,k F(F F ) 1 (Z):,k 2 Finally, we have: F(SA) S F(F F ) 1 (Z)t,k > F(F F ) 1 (Z)t,k F(SA) S F(F F ) 1 (Z)t,k F(F F ) 1 (Z)t,k 2 D F(SA) S F(F F ) 1 (Z)t,k F(F F ) 1 (Z)t,k 2 Hence we finally finish the proof for the upper bound of F(SA) S F(F F ) 1 (Z)t,k for Z outside [0, Bx] in (E.17) and lower bound for Z inside [0, Bx] in (E.20). Approximation Error. Now, we can conclude our work by constructing the final feed-forward network F(F F ) 2 . It receives the output of the self-attention layer and maps the ones in e GD (3Bx/4D, )d n to the corresponding value of the target function, and the rest in ( , Bx/4D)d n to 0. In order to adapt to the L2 norm, we use a continuous and Lipschitz function to map the input Z to its targeted corresponding output f(Q(Z)). According to piece-wise linear approximation, function F(F F ) 2 exists such that for any input L GD, it maps it to corresponding f(L), and for an arbitrary input Z, its output suffices: F(F F ) 2 (Z) [ min L Z max Bx 2D f(L), max L Z max Bx 2D f(L)]. (E.24) Next we estimate the difference between F(F F ) 2 F(SA) S F(F F ) 1 and F(F F ) 2 F(SA) S F (F F ) 1 . The difference is caused by the difference between F (F F ) 1 and F(F F ) 1 . By (E.9), this difference is bounded by 1 D in every dimension, for any input Z Rd n: F (F F ) 1 (Z) F(F F ) 1 (Z) 2 < F(SA) S F (F F ) 1 (Z) F(SA) S F(F F ) 1 (Z) 2 F(SA) S F (F F ) 1 (Z) F (F F ) 1 (Z) 2 + F (F F ) 1 (Z) F(F F ) 1 (Z) 2 + F(F F ) 1 (Z) F(SA) S F(F F ) 1 (Z) 2 By triangle inequality In-Context Deep Learning via Transformer Models By A 2 A F and (E.19) In the section on quantization of the input, we used piece-wise linear functions (E.7) to approximate piece-wise-constant functions (E.8), this creates a deviation for the inputs on the boundaries of the constant regions. Consider Z as one of these inputs whose value deviated from F(F F ) 2 F(SA) S F (F F ) 1 (Q(Z)). Let f(L1) denote the value given to F(F F ) 2 F(SA) S F(F F ) 1 (Z). Because the deviation take the output to a grid at most dn Bx/D + n Bx/2D away from its original grid, under the quantization of the output, f(L1) at most deviate from its original output F(F F ) 2 F(SA) S F (F F ) 1 (Z) by the distance of dn Bx/D + n Bx/2D aggregated with 2 times of the maximal distance within a grid. They sum up to be: F(F F ) 2 F(SA) S F(F F ) 1 F(F F ) 2 F(SA) S F (F F ) 1 2 L (2 dn Bx + n Bx dn Bx + n Bx Lastly, by condition we neglect the GD \ e GD part. This yields: F(F F ) 2 F(SA) S F (F F ) 1 = f. Thus, adding up the errors yields: f F(F F ) 2 F(SA) S F(F F ) 1 2 f f 2 + f F(F F ) 2 F(SA) S F(F F ) 1 2 By triangle inequality dn Bx + n Bx For any κ > 0, we select large enough D, such that This completes the proof. In-Context Deep Learning via Transformer Models F Experimental Details In this section, we conduct experiments to verify the capability of ICL to learn deep feed-forward neural networks. We conduct the experiments based on 3-layer NN, 4-layer NN and 6-layer NN using both Re LU-Transformer and Softmax Transformer based on the GPT-2 backbone. Experimental Objectives. Our objectives include the following three parts: Objective 1. Validating the performance of ICL matches that of training N-layer networks, i.e., the results in Theorem 1, Theorem 4, and Theorem 5. Objective 2. Validating the ICL performance in scenarios where the testing distribution diverges from the pretraining one or where prompt lengths exceed those used in pretraining. Objective 3. Validating the ICL performance in scenarios where the distribution of parameters in the N-layer network diverges from that of the pretraining phase. Objective 4. Validating that a deeper transformer achieves better ICL performance, supporting the idea that scaling up the transformer enables it to perform more ICGD steps. Computational Resource. We conduct all experiments using 1 NVIDIA A100 GPU with 80GB of memory. Our code is based on the Py Torch implementation of the in-context learning for the transformer (Garg et al., 2022) at https: //github.com/dtsip/in-context-learning. F.1 Experiments for Objectives 1 and 2 In this section, we conduct experiments to validate Objectives 1 and 2. We sample the input of feed-forward network x Rd from the Gaussian mixture distribution: w1N( 2, Id) + w2N(2, Id), where w1, w2 R. We consider three kinds of network f : Rd R, (i) 3-layer NN, (ii) 4-layer NN, and (iii) 6-layer NN. We generate the true output by y = f(x). In our setting, we use d = 20. Model Architecture. The sole difference between Re LU-Transformer and Softmax-Transformer is the activation function in the attention layer. Both models comprise 12 transformer blocks, each with 8 attention heads, and share the same hidden and MLP dimensions of 256. Transformer Pretraining. We pretrain the Re LU-Transformer and Softmax-Transformer based on the GPT-2 backbone. In our setting, we sample the pertaining data from N( 2, Id), i.e., w1 = 1 and w2 = 0. Following the pre-training method in (Garg et al., 2022), we use the batch size as 64. To construct each sample in a batch, we use the following steps (take the generation for the i-th sample as an example): 1. Initialize the parameters in fi with a standard Gaussian distribution, i.e., N(0, I). 2. Generate n queries {xi,j}n j=1 (i.e., input of fi) from the Gaussian mixture model ω1N( 2, Id) + ω2N(2, Id). Here we take n = 51. 3. For each query xi,j, use yi,j = fi(xi,j) to calculate the true output. This generates a training sample for the transformer model with inputs [xi,1, yi,1, , xi,50, yi,50, xi,51] , and training target oi = [yi,1, , yi,50, yi,51] . We use the MSE loss between prediction and true value of oi. The pretraining process iterates for 500k steps. Testing Method. We generate samples similar to the pretraining process. The batch size is 64, and the number of batch is 100, i.e., we have 6400 samples totally. For each sample, we extend the value n from 51 to 76 to learn the performance of in-context learning when the prompt length is longer than we used in pretraining. The input to the model becomes [xi,1, yi,1, , xi,75, yi,75, xi,76] . In-Context Deep Learning via Transformer Models We assess performance using the mean R-squared value for all 6400 samples. Baseline. We use the 3-layer, 4-layer, and 6-layer feed-forward neural networks with 200 hidden dimensions as baselines by training them with in-context examples. Specially, given a testing sample (take the i-th sample as an example), which includes prompts {xi,j, yi,j}k 1 j=1 and a test query xi,k. We use {xi,j, yi,j}k 1 j=1 to train the network with MSE loss for 100 epochs. We select the highest R-squared value from each epoch as the testing measure and calculate the average across all 6400 samples. F.1.1 PERFORMANCE OF RELU TRANSFORMER. We use four different Gaussian mixture distributions ω1N( 2, Id) + ω2N(2, Id) for the testing data: (i) ω1 = 1, ω2 = 0, (ii) ω1 = 0.9, ω2 = 0.1, (iii) ω1 = 0.7, ω2 = 0.3, (iv) ω1 = 0.5, ω2 = 0.5. Here the distribution in the first setting matches the distribution in pretraining. We show the results in Figure 3. 0 25 50 75 In-context Examples N( 2, I) 0.9N( 2, I) + 0.1N(2, I) 0.7N( 2, I) + 0.3N(2, I) 0.5N( 2, I) + 0.5N(2, I) 3-Layer NN (a) 3-Layer NN 0 25 50 75 In-context Examples N( 2, I) 0.9N( 2, I) + 0.1N(2, I) 0.7N( 2, I) + 0.3N(2, I) 0.5N( 2, I) + 0.5N(2, I) 4-Layer NN (b) 4-Layer NN 0 25 50 75 In-context Examples N( 2, I) 0.9N( 2, I) + 0.1N(2, I) 0.7N( 2, I) + 0.3N(2, I) 0.5N( 2, I) + 0.5N(2, I) 6-Layer NN (c) 6-Layer NN Figure 3: Performance of ICL in Re LU-Transformer: ICL learns 3-layer, 4-layer, and 6-layer NN and achieves R-squared values comparable to those from training with prompt samples. The results also show the ICL performance declines as the testing distribution diverges from the pretraining one. F.1.2 PERFORMANCE OF SOFTMAX TRANSFORMER. We use four different Gaussian mixture distribution ω1N( 2, Id) + ω2N(2, Id) for the testing data: (i) ω1 = 1, ω2 = 0, (ii) ω1 = 0.9, ω2 = 0.1, (iii) ω1 = 0.7, ω2 = 0.3, (iv) ω1 = 0.5, ω2 = 0.5. Here the distribution in the first setting matches the distribution in pretraining. We show the results in Figure 4. 0 25 50 75 In-context Examples 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 N( 2, I) 0.9N( 2, I) + 0.1N(2, I) 0.7N( 2, I) + 0.3N(2, I) 0.5N( 2, I) + 0.5N(2, I) 3-Layer NN (a) 3-Layer NN 0 25 50 75 In-context Examples N( 2, I) 0.9N( 2, I) + 0.1N(2, I) 0.7N( 2, I) + 0.3N(2, I) 0.5N( 2, I) + 0.5N(2, I) 4-Layer NN (b) 4-Layer NN 0 25 50 75 In-context Examples N( 2, I) 0.9N( 2, I) + 0.1N(2, I) 0.7N( 2, I) + 0.3N(2, I) 0.5N( 2, I) + 0.5N(2, I) 6-Layer NN (c) 6-Layer NN Figure 4: Performance of ICL in Softmax-Transformer: ICL learns 3-layer, 4-layer, and 6-layer NN and achieves R-squared values comparable to those from training with prompt samples. The results also show the ICL performance declines as the testing distribution diverges from the pretraining one. Note that performance decreases when the prompt length exceeds the pretraining length (i.e., 50), a well-known issue (Dai et al., 2019; Anil et al., 2022). We believe this is due to the absolute positional encodings in GPT-2, as noted in (Zhang et al., 2024) The results in Appendix F.1.1 and Appendix F.1.2 show that the performance of ICL in the transformer matches that of training N-layer networks, regardless of whether the prompt lengths are within or exceed those used in pretraining. Furthermore, the ICL performance declines as the testing distribution diverges from the pretraining one. F.2 Experiments for Objective 3 In this section, we conduct experiments to validate Objective 3. For these experiments, we use testing data that is identical to the training data, which follows a distribution of N( 2, Id). We vary the distribution of parameters in the N-layer network. During the training process, we set the distribution as N(0, I). In the testing process, we examine different In-Context Deep Learning via Transformer Models distributions, including N(0, I), N( 0.5, I), and N(0.5, I). All other model hyperparameters and experimental details remain consistent with those described in Appendix F.1. We evaluate the ICL performance of both the Re LU-Transformer and the Softmax-Transformer for 4-layer networks, as shown in Figure 5 and Figure 6. The results demonstrate that the ICL performance in the transformer matches that of training N-layer networks, regardless of whether the parameter distribution in the N-layer network diverges from that of the pretraining phase. 0 25 50 75 In-context Examples ICL of Transformer 4-Layer NN (Param. ~ N(0, I)) (a) Parameters N(0, I) 0 25 50 75 In-context Examples ICL of Transformer 4-Layer NN (Param. ~ N( 0.5, I)) (b) Parameters N( 0.5, I) 0 25 50 75 In-context Examples ICL of Transformer 4-Layer NN (Param. ~ N(0.5, I)) (c) Parameters N(0.5, I) Figure 5: Performance of ICL Across Various N-layer Network Parameter Distributions for the Re LU-Transformer: ICL learns 4-layer NN and achieves R-squared values comparable to those from training with prompt samples, even when the parameter distribution in the N-layer network during testing diverges from that in the pretraining phase (N(0, I)). 0 25 50 75 In-context Examples ICL of Transformer 4-Layer NN (Param. ~ N(0, I)) (a) Parameters N(0, I) 0 25 50 75 In-context Examples ICL of Transformer 4-Layer NN (Param. ~ N( 0.5, I)) (b) Parameters N( 0.5, I) 0 25 50 75 In-context Examples ICL of Transformer 4-Layer NN (Param. ~ N(0.5, I)) (c) Parameters N(0.5, I) Figure 6: Performance of ICL Across Various N-layer Network Parameter Distributions for the Softmax-Transformer: ICL learns 4-layer NN and achieves R-squared values comparable to those from training with prompt samples, even when the parameter distribution in the N-layer network during testing diverges from that in the pretraining phase (N(0, I)). F.3 Experiments for Objective 4 In this section, we conduct experiments to validate Objective 4. For these experiments, we use testing data identical to the pertaining data from N( 2, Id). We vary the number of layers in the transformer architecture, testing configurations with 4, 6, 8 and 10 layers. All other model hyperparameters and experimental details remain consistent with those described in Appendix F.1. We evaluate the ICL performance of both the Re LU-Transformer and the Softmax-Transformer with 15, 30, and 45 in-context examples, as shown in Figure 7. The results show that a deeper transformer achieves better ICL performance, supporting the idea that scaling up the transformer enables it to perform more ICGD steps. In-Context Deep Learning via Transformer Models 4 6 8 10 Re LU-Transformer Layer Depth 0.82 0.83 0.84 0.85 0.86 0.87 0.88 15 In-context Examples 30 In-context Examples 45 In-context Examples (a) Re LU-Transformer 4 6 8 10 Softmax-Transformer Layer Depth 0.85 0.86 0.87 0.88 0.89 0.90 0.91 15 In-context Examples 30 In-context Examples 45 In-context Examples (b) Softmax-Transformer Figure 7: Performance of ICL Across Varying Transformer Depths: We use the number of in-context examples as 15, 30, or 45 for both the Re LU-Transformer and the Softmax-Transformer. The results show that a deeper transformer achieves better ICL performance, supporting the idea that scaling up the transformer enables it to perform more ICGD steps. In-Context Deep Learning via Transformer Models G Application: ICL for Diffusion Score Approximation In this part, we give an important application of our work, i.e., learn the score function of diffusion models by the in-context learning of transformer models. We give the preliminaries about score matching generative diffusion models in Appendix G.1. Then, we give the analysis for ICL to approximate the diffusion score function in Appendix G.2. G.1 Score Matching Generative Diffusion Models Diffusion Model. Let x0 Rd be initial data following target data distribution x0 P0. In essence, a diffusion generative model consists of two stochastic process in Rd: A forward process gradually add noise to the initial data (e.g., images): x0 x1 x T . A backward process gradually remove noise from pure noise: y T y T 1 y0. Importantly, the backward process is the reversed forward process, i.e., yt d x T t for i 0, . . . , T.2 This allows the backward process to reconstruct the initial data from noise, and hence generative. To achieve this time-reversal, a diffusion model learns the reverse process by ensuring the backward conditional distributions mirror the forward ones. The most prevalent technique for aligning these conditional dynamics is through score matching a strategy training a model to match score function, i.e., the gradients of the log marginal density of the forward process (Song et al., 2020b;a; Vincent, 2011). To be precise, let Pt, pt( ) denote the distribution function and destiny function of xt. The score function is given by log pt( ). In this work, we focus on leveraging the in-context learning (ICL) capability of transformers to emulate the score-matching training process. Score Matching Loss. We introduce the basic setting of score-matching as follows3. To estimate the score function, we use the following loss to train a score network s W ( , t) with parameters W: T0 γ(t)Ext Pt h s W (xt, t) log pt(xt) 2 2 i dt, where γ(t) is a weight function, (G.1) and T0 is a small value for stabilizing training and preventing the score function from diverging. In practice, as log pt( ) is unknown, we minimize the following equivalent loss (Vincent, 2011). T0 γ(t)Ex0 P0 h Ext|x0 h s W (xt, t) log p(xt|x0) 2 2 ii dt, (G.2) where p(xt|x0) is distribution of xt conditioned on x0. G.2 ICL for Score Approximation We first give the problem setup about the ICL for score approximation as the following: Problem 3 (In-Context Learning (ICL) for Score Function log pt( )). Consider the score function log pt( ) for any t 0. Given a dataset Dn := {(xi, yi)}i [n], where {xi}i [n] Rd and yi = log pti(xi) Rd (ti 0), and a test input xn+1, the goal of ICL for Score Function is to find a transformer T to predict yn+1 based on xn+1 and the in-context dataset Dn. In essence, the desired transformer T serves as the trained score network s W ( , t). To solve Problem 3, we follow two steps: (i) Approximate the diffusion score function log pt( ) with a multi-layer feed-forward network with Re LU activation functions under the given training dataset Dn. (ii) Approximate the gradient descent used to train this network by the in-context learning of the Transformer until convergence, using the same training set Dn as the prompts of ICL. For the first step, we follow the score approximation results based on a multi-layer feed-forward network with Re LU activation in (Chen et al., 2023), stated as next lemma. Lemma 19 (Score Approximation by Feed-Forward Networks, Theorem 1 of (Chen et al., 2023)). Given an approximation error ϵ > 0, for any initial data distribution P0, there exist a multi-layer feed-forward network with Re LU activation, 2 d denotes distributional equivalence. 3Please also see Appendix A.1 and (Chen et al., 2024; Chan et al., 2024; Yang et al., 2023) for overviews. In-Context Deep Learning via Transformer Models f(w, x, t) : RDw Rd R Rd. Then for any t [T0, T], we have f(w, , t) log pt( ) L2(Pt) O(ϵ). With the approximation result, we reduce the Problem 3 to Problem 2, where the loss function is (G.1). Following Theorem 1, we show that the in-context learning of transformer models can approximate the score function of diffusion model.