# chainofthought_provably_enables_learning_the_otherwise_unlearnable__d50bfbf9.pdf

Published as a conference paper at ICLR 2025

CHAIN-OF-THOUGHT PROVABLY ENABLES LEARNING THE (OTHERWISE) UNLEARNABLE

Chenxiao Yang , Zhiyuan Li , David Wipf Toyota Technological Institute at Chicago, Amazon Web Services {chenxiao,zhiyuanli}@ttic.edu, davidwipf@gmail.com

Modern language models have demonstrated remarkable reasoning capabilities by using chain-of-thought (Co T). One hypothesis about the inner workings of Co T is that it breaks down originally complex tasks into smaller subtasks that are more amenable to learning. We formalize this by showing possibility and impossibility results of learning from in-context demonstrations with and without Co T. In particular, with Co T, we examine a family of learning algorithms that learn a task step-by-step, capable of composing simpler functions from individual reasoning steps to form an overall complex function. This process reduces the difficulty of learning a task to that of the hardest reasoning step in the chain. Moreover, we prove Transformers can express this algorithm and thus they can efficiently in-context learn arbitrary tasks as long as these tasks can be decomposed into a finite number of subtasks, each of which are efficiently learnable. In contrast, without Co T, we demonstrate that there exist tasks that are inherently unlearnable by the same algorithm. Overall, our results suggest several provably effective ways for decomposing target problems to instantiate Co T. Empirically, we demonstrate our proposed Co T construction significantly enhances the reasoning capabilities of real-world LLMs in solving challenging arithmetic reasoning tasks, including learning polynomials and Boolean formulas.

1 INTRODUCTION

Complex problem solving often involves breaking down an originally challenging task into smaller, more manageable subtasks, learning from these subtasks, and then composing the acquired skills to address the overall task a strategy that reflects how humans naturally solve problems. One empirically-successful method that mimics this process is called Chain-of-Thought (Co T) (Wei et al., 2022; Reynolds & Mc Donell, 2021; Nye et al., 2021), whereby a model is provided with demonstrations involving detailed reasoning steps and subsequently instructed to generate thoughts step-by-step before yielding the final answer. Modern language models rely on Co T or variants thereof (Yao et al., 2024; Besta et al., 2024) to tackle complex tasks ranging from commonsense reasoning to mathematical proofs (Cobbe et al., 2021; Rae et al., 2021; Srivastava et al., 2022), sometimes even exceeding the capabilities of human experts.

Despite the empirical success of Co T, theoretical investigation thus far has remained relatively sparse. Some previous efforts have made strides from the perspective of Transformer expressiveness (Li et al., 2024; Feng et al., 2024) or through case studies of in-context learning MLPs (Li et al., 2023b). However, these works either focus on mostly generic cases without specifying concrete/actionable ways of actually decomposing a task, or involve quite restricted scenarios where both the task and the Co T are predetermined. Since in practice not all intermediate steps in Co T are equally useful, e.g. Zhang et al. (2022); Press et al. (2022), and the target tasks are typically diverse, there still exists a considerable gap between current theory and practical scenarios. In targeting this gap, we ask the following core question:

How do task decompositions affect the ability of a model to learn complex reasoning tasks, and what principles can guide the corresponding optimal Co T design?

Work was done during author s internship at Amazon Web Services.

Published as a conference paper at ICLR 2025

Learner Input Output w/o Co T w/ Co T

Input Step 2 Step 1 Output

Hardness of learning (Thm. 4 & Cor. 5) Learn Step-by-Step (Alg. 1) Expressible by Transformer? ( Lem. 1)

The hardest subtask is efficiently learnable The overall task is efficiently learnable

How to form effective Co T? (Experiments)

? (Bottleneck)

(Lem. 2 & Thm. 3) Figure 1: An overview of our analysis. Arrow length indicates the difficulty of learning a step.

One possible explanation for the efficacy of Co T is that it reduces the difficulty of learning a complex task to the level of learning a series of subtasks. The implication here is simply that, assuming these subtasks are suitably designed and orchestrated, then the necessary skills acquired at the subtask level through Co T will suffice to outperform attempts at learning directly from the overall task. In this work, we formalize these intuitions in a mathematically rigorous way, and take initial steps towards quantifying the benefits of different task decomposition schemes.

Problem Setup. To define a notion of learnability in the context of language modeling, we adopt the classic PAC learnability (Valiant, 1984) in the setting where the learner is associated with a parametric model a setting known as in-context learning (Brown et al., 2020; Garg et al., 2022). Unlike conventional supervised learning, the learning considered here occurs during test time, which a highly useful feature enabling the model to adapt to new tasks that were not explicitly seen in pre-training. Concretely, we say a task is in-context learnable by a parametric model M if there exists a parameter configuration θ such that, for arbitrary distribution from the task, the model can, upon receiving a set of i.i.d. demonstrations/samples D and a query/test sample x, output a prediction for the query, i.e. Mθ(D, x), that is close to the ground-truth y with high probability. A learning algorithm A is implicit in this process, which takes D as input and outputs a predictor or hypothesis h D : X Y. The relation between the algorithm and the model Mθ can be written as

Mθ(D, x) = A(D)(x) = h D(x). (1)

To account for the effects of Co T, we also formalize the notion of task decomposition as transforming the distribution from which D is generated into a sequence of distributions that generate D with detailed intermediate steps. (Section 2)

1.1 CONTRIBUTIONS

Main Results. To formally answer whether or not Co T can help a model learn complex reasoning tasks, we investigate the in-context learnability of tasks w.r.t. different decomposition schemes. We find the answer is often affirmative depending on the task decomposition and summarize our findings via the following two informal statements:

1. Regardless of how complex a reasoning task is, it can be efficiently learned by Transformers in-context as long as it can be decomposed into a finite number of reasoning steps, each of which is efficiently learnable by a learner Transformers can perform. (Section 3) 2. There exist inherently hard tasks that are not learnable without Co T regardless of the sample size, and yet nonetheless become learnable by using Co T with specific decomposition schemes we introduce. (Section 4)

In aggregate, our results suggest a broad range of scenarios where Co T can indeed effectively reduce the hardness of learning, from that of the overall task to that of the hardest reasoning step in the chain, or even from an unlearnable level to the learnable level. These results also suggest several actionable ways to form effective intermediate steps of Co T.

To obtain the above results, we introduce a class of learning algorithms ACo T enabled by Co T, dubbed step-by-step learning. This class of algorithms takes as input Co T examples, and outputs a complex predictor h D by composing predictors {hi}i [k] obtained from k individual algorithms, where k is the number of reasoning steps (Algorithm 1). Given a fixed k, the expected overall prediction error made by h D can be upper bounded by the individual errors made by predictors {hi}i [k] on their respective reasoning steps (Lemma 2). Leveraging this fact, we prove that the difficulty of learning the overall task can be reduced to that of the hardest constituent step of Co T, since the sample complexity of this step determines the sample complexity of learning the overall task (Theorem 3).

Furthermore, we establish that the capabilities of ACo T described above can be achieved by Transformers the de facto parametric model M used in language modeling. Specifically, we show

Published as a conference paper at ICLR 2025

that a linear depth (w.r.t. k) and constant embedding size Transformer is sufficient to express ACo T in an end-to-end manner, if the individual algorithms for learning each step are instantiated as kernel gradient descent that produces a predictor hi : x 7 Wϕ(x) with a non-linear feature map ϕ( ) (Lemma 1). Combining these results suggests that a reasoning task is efficiently learnable by Transformers in-context as long as each subtask is efficiently learnable by these individual algorithms.

Within the same analytical framework, we further present impossibility results illustrating the hardness of learning without Co T. Particularly, we consider the task of learning a Boolean function class called sparse parity. Without Co T, this task is impossible to learn by the variant of algorithm ACo T with k = 1 due to limited approximation power of the hypothesis class that is used to learn a single step, whereas if we enable Co T with specific intermediate reasoning steps, the task becomes learnable with guarantees of smaller errors (Theorem 4 & Corallary 5).

Experiments. For empirical verification, we consider new arithmetic reasoning tasks and test if the decomposition schemes from our analysis are practically effective (Section 5). Specifically, we propose to construct complex reasoning tasks with varying overall hardness and hardness of subtasks. We observe that the performance of real-world LLMs improve significantly as the difficulty of the hardest step in the Co T reduces, regardless of the overall task complexity. Additionally, on learning two inherently hard Boolean functions, we demonstrate that our introduced Co T significantly enhances reasoning performance, sometimes improving accuracy from nearly random guessing to nearly perfect.

1.2 RELATED WORK

Co T has been analyzed through the lens of the expressiveness of finite-precision Transformers Feng et al. (2024); Li et al. (2024) . In particular, Feng et al. (2024) proves Co T can enable Transformers to solve some specific tasks such as basic arithmetic and linear equations, while Li et al. (2024) proves that increasing the step number in the Co T allows them to emulate circuits of increasing depth. More closely related to ours, Li et al. (2023b) studies a scenario in which the intermediate steps of Co T are intermediate layers of an MLP, showing that Transformers can in-context learn the MLP. We also note that there are many recent papers that empirically analyze Co T, e.g. (Wang et al., 2022; Fu et al., 2023b; Madaan & Yazdanbakhsh, 2022; Turpin et al., 2024; Prabhakar et al., 2024) among others. However, previous works are generally limited in one of two ways: 1) either the problem setting is overly restricted, focusing on fixed tasks and specific forms of decomposition, such as solving arithmetic equations (Feng et al., 2024) or learning MLPs (Li et al., 2023b), 2) or in a more complex regime (Li et al., 2024), where though the possibility for improvement is shown, the effects of specific intermediate steps are not quantified. In contrast, our work introduces a new setting, one which is general enough to encompass learning all distribution families, while also explicitly accounting for the effects of specific decomposition schemes. Our contribution is also related to a recent line of work on ICL (Aky urek et al., 2023; Von Oswald et al., 2023; Dai et al., 2023; von Oswald et al., 2023; Cheng et al., 2024; Li et al., 2023a), particularly demonstrations of how Transformers with certain weight constructions perform ICL similarly to optimization algorithms like gradient descent (Von Oswald et al., 2023) or its variants (Giannou et al., 2024; Fu et al., 2023a). Notably, Cheng et al. (2024) proves that Transformers could implement kernel gradient descent, which is closely related to ours. The idea of task decomposition and generalizing from easier to harder tasks has also been explored in other works of learning theory, e.g. Natarajan (1989); Tadepalli (2008); Wies et al. (2022).

2 PRELIMINARIES

Chain-of-Thought (Co T). We follow the commonly-adopted few-shot Co T setting, where demonstrations are augmented with intermediate steps and the prediction is also in the format of Co T. In particular, for k reasoning steps, we denote each demonstration as

e = (x, z1, , zk 1, y) X Z1 Zk 1 Y (2)

where zt Zt represents the t-th intermediate reasoning step. For convenience, we also let z0 = x and zk = y. Let d(Zi) be the dimension of Zi, and d = Pk i=0 d(Zi).

In-Context Learning (ICL). In ICL, the base model is provided with N demonstrations or incontext examples e(i) = (x(i), y(i)) for i [N] where x X and y = f(x) Y. We denote a learning algorithm or learner as A : (X Y)N H, which takes a set of demonstrations

Published as a conference paper at ICLR 2025

as input and outputs a predictor or hypothesis h : X Y from a hypothesis class H. Given a set of demonstrations and a query x(N+1) (which we assume is from the same distribution as the demonstrations), the goal of the base model is to learn a predictor h and use this predictor to make predictions on the query x(N+1). For base model Transformer TFθ( ) with parameters θ, we have

TFθ({(x(i), y(i)) : i [N]}, x(N+1)) = A({(x(i), y(i)) : i [N]})(x(N+1)) = h(x(N+1)). (3)

Transformers. Let E = {e(i) : i [N]} Rd N be the concatenation of samples, and let e(N+1) = (x(N+1), 0) Rd whose dimension aligns with other examples. Following prior work (Von Oswald et al., 2023; Ahn et al., 2023; Cheng et al., 2024; Zhang et al., 2023), a single-head self-attention layer with weights WK, WQ, WV θ updates e(N+1) as

e(N+1) e(N+1) + WV Eσ(E W KWQe(N+1)), (4)

where σ is non-linearity that could be specified as Softmax, Re LU or some kernel functions, e.g. (Choromanski et al., 2021; Katharopoulos et al., 2020; Wang et al., 2020; Peng et al., 2021). Stacking multiple self-attention layers (with or without an MLP module applied between each selfattention layer) gives us the Transformer considered in this paper. Following (Von Oswald et al., 2023), and for subtle technical reasons related to the construction in Sec 3.2, we exclude the query token when computing the attention.

Our Setup. We are now ready to formally define the problem setup. Let D be a distribution over X, and f : X Y a target function. An input distribution and target function pair (f, D) defines the generating process of in-context examples, namely examples are drawn based on x D, y = f(x). Let P be a family of distributions defined as a set of (f, D) pairs, representing a certain task a model aims to solve. For instance, the target functions in P could be defined as all polynomials, Boolean functions, etc. The following error quantifies how successfully an algorithm can learn the task:

(P, A) max (f,D) P Ex D [l (h(x), f(x))] (5)

where l is the squared loss (which could be extended to other convex loss functions), h is the predictor given by a learner A on N i.i.d. examples from (f, D). Minimizing this error guarantees successful learning of all target functions within the family P. Definition 1 (In-Context Learnability). We say a parametric model M : θ 7 Mθ (i.e. a functional mapping from parameter space to function space) can learn task P if there exists a learning algorithm A and a function NA : (0, 1)2 N, such that for any confidence and accuracy parameters δ, ϵ (0, 1):

1. Under a certain parameter choice θ, we have Mθ( , x) = A( )(x) for any query x. 2. Given NA(δ, ϵ) i.i.d. examples, the algorithm A returns a predictor h such that with probability of at least 1 δ, (P, A) ϵ.

Moreover, we say P can be efficiently in-context learned by M if both the running time of A and the sample size NA(δ, ϵ) are polynomial in δ 1 and ϵ 1. Note that the notion of learnability here is different from the classic PAC learnability (Shalev-Shwartz & Ben-David, 2014) in the sense that, per in our definition, the model itself acts not as a predictor, but a learner. This is a highly desirable property for modern language models as it allows them to meta-learn out-of-distribution tasks that were not available during pre-training (Brown et al., 2020). In the rest of this paper, we stipulate the model as a Transformer TFθ as defined via (4), unless otherwise stated.

Moving forward, we focus on whether or not Co T can improve the in-context learnability of different reasoning tasks. In principle this can be studied from multiple vantage points, such as the the effect of Co T on the sample efficiency NA(δ, ϵ), or the existence of cases where a task is initially not learnable but becomes so once Co T is enabled, etc. Notably, addressing these issues depends critically on the specific forms of intermediate Co T steps involved. To accommodate this aspect, we next formalize the notion of a task decomposition. Definition 2 (Task Decomposition). A decomposition operator T is such that, for every (f, D) P, a target function can be decomposed as T(f) = (f2, f1) subject to f = f2 f1, where f1 : X Z and f2 : Z Y for another space Z. This operation induces two new distribution families

P1 = {{(f1, D) : (f2 f1, D) P}} and P2 = {{(f2, D ) : (f2 f1, D) P}} (6)

where {{ }} is multiset allowing repeating elements, and D : Z R is determined by f1 and D.

Published as a conference paper at ICLR 2025

The decomposition always exists and is not unique (e.g. f1 can be arbitrary bijection). Particularly when f1 is specified as the identity mapping, we have P2 = P, in which case Co T is unlikely to work. For each decomposition, we can associate it with the generating process of demonstrations (denoted as z), i.e. x D, z = f1(x), y = f2(z). This affects how examples are generated in Definition 1, and thus could impact the learnability of a task. Note that this definition can be generalized to Co T with k > 2 steps by sequentially applying the decomposition.

3 IMPROVED LEARNABILITY BY TASK DECOMPOSITION

To demonstrate how Co T improves learnability by task decomposition, we will first define a learning algorithm ACo T (Sec 3.1) and prove that it can be expressed by Transformers with linear depth (Sec 3.2). Then, we derive an upper bound for the overall error, which allows us to show Co T can adjust the learnability of a complex task to the learnability of simpler subtasks (Sec 3.3).

3.1 LEARNING ALGORITHMS ENABLED BY CHAIN-OF-THOUGHTS

Consider a class of learning algorithms ACo T enabled by Co T, which involves several individual algorithms {Ai}k i=1 and learns increasingly complex compositional functions with more Co T steps:

Algorithm 1: Step-by-Step Learning with Co T (ACo T)

Input: Demonstrations {(z(j) 0 , z(j) 1 , , z(j) k )}N j=1, individual learning algorithms {Ai}k i=1. Output: Predictor h : X Y for i = 1, , k do

hi Ai({(z(j) i 1, z(j) i )}N j=1) h hk h2 h1

This class of algorithms take as input a set of demonstrations, where each demonstration contains k reasoning steps, and outputs a predictor h : X Y. The learning is performed in a step-by-step manner; that is, for each reasoning step i [k], an individual algorithm Ai is used to learn a predictor hi : Zi Zi 1 Hi. The learned predictors h1, h2, , hk are then composed to obtain the desired overall predictor. Note that the algorithm can be naturally extended to scenarios where each step is a function of all preceding steps, by redefining zi in the algorithm as a concatenation of {zj}j i in the initial Co T. Therefore, without loss of generality, we assume that the Co T satisfies the Markov property, meaning that each step is conditionally dependent only on the last step.

3.2 EXPRESSIVENESS OF TRANSFORMERS

Next, we demonstrate parameter choices θ that connect Transformers TFθ and algorithm ACo T instantiated in a certain way.

Instantiation of ACo T. While the algorithm could have many different instantiations, in this paper, we define Ai as empirical risk minimization: using gradient descent to minimize a squared loss Li over in-context examples to learn a predictor from a hypothesis class Hi, which is defined as a linear model on fixed non-linear features. Specifically,

j=1 hi(z(j) i 1) z(j) i 2 2, hi Hi = {zi 1 7 Wiϕi(zi 1) : Wi 2 B}, (7)

where ϕi : Zi 1 RK is a non-linear feature map to a K-dimensional space, and Wi Rd(Zi) K are learnable weights initialized to zero and subsequently with norm bounded by B. Therefore, the overall predictor h obtained from this composition can be written as a stacked sequence of multiple non-linearities and linear transformations, i.e.

h = Wkϕk( (W2ϕ2(W1ϕ1(x)))) H = Hk H2 H1. (8)

For example, if ϕ1 is specified as the identity mapping, whereas ϕi for i = 1 are conventional activation functions, (8) could represent a k-layer deep neural network. And beyond this, for generic feature maps, h could represent more powerful functions. As the step number k increases, the predictor h also becomes more complex. Below, we connect the algorithm with Transformers.

Published as a conference paper at ICLR 2025

Lemma 1 (Transformers Learn Step-by-Step). For any k > 1, given a set of Co T demonstrations with k reasoning steps as per (2) and a query x(N+1), Transformers with linear depth kt and constant embedding size 2d can express ACo T, where Ai is t steps of GD on a squared loss and Hi is the hypothesis class defined in (7) whose feature map aligns with the attention as is specified in the Appendix A.1.

Proof Sketch. Consider a simplified case k = 1. The loss is L = P

i [N] Wϕ(x(i)) y(i) 2 2/2, and GD with a fixed step size updates the weights as W W η W L. This process also induces dynamics in function space, i.e. the evolution of the learned predictor h as the weights update. For query x(N+1), the function-space dynamics could be written as (see derivation in the proof)

(GD Dynamics) h(x(N+1)) h(x(N+1)) + η (Y ˆY ) ϕ(X) ϕ(x(N+1)), (9)

Kernel Function Residuals Predictions

where Y = [y(i)]N i=1 Rd(Y) N, ˆY = [h(x(i))]N i=1, ϕ(X) = [ϕ(x(i))]N i=1 RK N. The residuals Y ˆY are equivalent to Y at initialization as the weights are initialized to zero. The last term represents a kernel function w.r.t. the feature map ϕ, quantifying the similarity between the test (i.e. query) and training examples (i.e. demonstrations).

For comparison, we also rewrite the self-attention layer, where e(i) is reinterpreted as a concatenation of input and the residual (x(i), y(i) h(x(i))), which at initialization is equivalent to (x(i), y(i)):

(Transformer Layer) e(N+1) e(N+1) + WV E σ(E W KWQe(N+1))), (10)

Attention Module Embedding Skip Connection

where the last term is the attention module. As is specified in Appendix A.1, with simple choices of WV , WK, WQ θ, one can show (10) subsumes (9); that is, Transformers can perform kernel regression in their forward pass. The key here is the connection between the kernel function and the attention matrix, which has been widely studied in previous literature, e.g. (Tsai et al., 2019; Wright & Gonzalez, 2021; Chen et al., 2024; Choromanski et al., 2021; Katharopoulos et al., 2020; Wang et al., 2020; Peng et al., 2021) and also discussed in the setting of ICL (Von Oswald et al., 2023; Cheng et al., 2024; Guo et al., 2024). In Appendix B, we provide a comprehensive discussion of their connections and how the feature map ϕ is related to Transformer parameter and architectural choices.

Extension to k > 1. To extend this result to Co T, we define k loss functions {Li}i [k] associated with k reasoning steps. Each loss function is convex w.r.t. the weights of the corresponding predictor. In the forward pass, similar with (9), Transformers implement (kernel) GD dynamics in function space to minimize these loss functions. One challenge of retaining the connection between (9) and (10) while further incorporating Co T is that, for compositional non-linear predictors h in (8), updating weights in a prior-step predictors (e.g. W1 in h1) could introduce non-linear dynamics in the final prediction h(x) from (8). We show that this issue can be circumvented if the learning is done in a step-by-step manner as in ACo T, namely Transformers first learn a preceding reasoning step using t layers, then proceed to learn the next step using t layers. This results in a total of kt layers for k steps. Note that the construction here is not unique and similar conclusions could be drawn from other setups, such as recurrently making k predictions (Li et al., 2023b) in k forward passes, which would additionally require the Transformer to perform the so-called filtering process but could potentially reduce the depth requirement to a constant. Our construction differs from Von Oswald et al. (2023); Cheng et al. (2024) by extending their proof to the case where Co T is used.

3.3 EFFECTS OF TASK DECOMPOSITION

We now proceed to answer when and how Co T improves the learnability of a task by studying the in-context learnability of a distribution family P w.r.t. different task decomposition schemes. Before presenting the main result, we analyze how well can the predictor h generalizes to unseen queries. This is accomplished by studying the final error (P, A) made by the learning algorithm ACo T and its relation with individual errors made by individual algorithms at each step.

Consider a fixed number of steps k. According to Definition 2, applying a sequence of decomposition operators {Ti}i [k 1] on P produces k new distribution families {Pi}i [k], where Pi is the induced distribution family that generates the i-th reasoning step. As per equation (5), (Pi, Ai) refers to the error of learning Pi using the individual algorithm Ai. The following lemma upper bounds the final error (P, A) by the individual errors.

Published as a conference paper at ICLR 2025

Lemma 2 (Error Upper Bound). Fix k > 1. Then for any distribution family P and any decomposition operators {Ti}i [k 1] applied on P, the predictor returned by ACo T has an error upper bound

(P, A) ck Pk i=1 (Pi, Ai), ck = 2 maxj Qk i=j+1 2B2 Lip(ϕi)2 (11)

where ck is a constant depending on the hypothesis class Hi and the step number k.

Note that this result does not rely on the specific individual algorithm we choose in ACo T as long as the predictor for each step has a Lipschitz constant and the loss function is convex. The upper bound suggests that, to minimize (P, A), it suffices for each individual algorithm to minimize the error made at its corresponding step. In fact, it actually suffices to minimize the largest error made at the hardest reasoning step argmaxi { (Pi, Ai)}, as is revealed by the following result. Theorem 3 (Improved Learnability with Co T). With Co T, any distribution family P is efficiently in-context learnable by the Transformer in Lemma 1, if there exists a finite sequence of decomposition operators {Ti}i [k 1] such that each induced Pi if efficiently PAC learnable by the individual algorithm Ai. Particularly, the sample complexity is NA(ϵ, δ) = maxi [k] NAi (δ/k, ϵ/ck) where NAi is the sample complexity for learning Pi by Ai.

Theoretical / Practical Implications. This result indicates that, regardless of how complex the original reasoning task P is, in order for a Transformer to efficiently learn this task, it is sufficient to make each subtask Pi efficiently learnable. In other words, Co T can reduce the difficulty of learning a task to the difficulty of learning the hardest subtask in the chain; here, the hardest step refers to the least sample-efficient one. This result also leads to a simple practical lesson for designing Co T: an effective way to form a Co T is by decomposing the hardest reasoning step into smaller steps that are easier to learn. Such a result aligns with existing empirical practices of decomposing challenging tasks, e.g. (Zhou et al., 2022; Khot et al., 2022; Zhang et al., 2022), and will be validated in greater depth by new experiments in Section 5.1.

To understand this result, notice that in ACo T each subtask shares the same sample size, and thus one has to choose this size based on the hardest step to ensure each individual step can be successfully learned so that the overall task can be successful learned. Another way to view this is through the lens of error (Lemma 2): in the limit of large sample size N, the individual error at the step with the worst rate will dominate the overall error, regardless of the constant coefficient associated with each (PT,i, hi); therefore, the hardest step becomes the bottleneck.

Compounding Error Issue. Nevertheless, we note one caveat of Lemma 2 is that it is not asymptotic in k which has been treated as a constant. Therefore, the result in this section does not hold if k scales up, in which case errors could accumulate over reasoning steps and grow exponentially in the horizon. This implies a trade-off between the step number and hardness of subtasks: decomposing the hardest subtask improves the learnability, but also introduces the risk of compounding error, which renders scaling up k a practical challenge despite the theoretical merit. Aligned with our theory, the compounding error issue has indeed been widely observed in practice, and approaches such as self-correction or refinement have been proposed to mitigate this issue, e.g. (Wang et al., 2023; Yao et al., 2024; Madaan et al., 2024). From our experiments in Section 5.1, we also find when the hardness of each reasoning step is approximately the same, increasing k increases the overall task hardness and could hurt performance; however, for fixed overall task hardness, reducing the hardness of the hardest step can significantly improve the performance, which aligns with Theorem 3. We also note that the required GD steps t for each step depends on parameters ϵ and δ and depth of the constructed Transformer in Lemma 1 could increase for harder tasks.

4 HARDNESS OF LEARNING WITHOUT CHAIN-OF-THOUGHT

In this section, we will present impossibility results demonstrating that there exist inherently hard tasks that are not learnable without Co T but learnable after being decomposed.

4.1 LOWER BOUND

We begin by presenting a general lower bound on the error (P, A). This bound applies to both the case where there is no Co T (k = 1), and the case of Co T with one intermediate step (k = 2) each demonstration is in the form of (x, z, y). The proof is based on the limited power of the hypothesis class H to approximate the target functions in P (see Appendix A.4).

Published as a conference paper at ICLR 2025

Theorem 4 (Error Lower Bound). For any distribution family P and decomposition operator T, suppose ACo T returns a first-step predictor h1 from a finite set H 1 H1. Then the overall error has a lower bound given by1

K|H 1| Var(P), (12)

where B and K are constants depending on the definition of the hypothesis class as per (7). Var(P) is a certain variable depending on P, we defer its definition to Appendix A.4.

We discuss the implications of this theorem in two separate cases: when Co T is used and not used.

w/ Co T. In this case, the bound is generally loose or even uninformative, because the term |H 1| is typically expected to be large. This makes sense since when Co T is used, the hypothesis class H = H2 H1 per equation (8) can potentially approximate a wide range of functions analogous to what a two-layer neural network can approximate. Despite this, the bound is still useful to identify cases where the Co T is suboptimal. For instance, consider a dummy Co T where the first step is the identity mapping (i.e., x = z), and assume the learner has successfully learned this step, which indicates that h1 is also an identity mapping, and hence |H 1| = 1. In this case, the lower bound would increase and potentially become positive depending on Var(P) which will be discussed below, suggesting this dummy Co T is suboptimal.

w/o Co T. A more interesting scenario is when there is no Co T. In this case the lower bound from (12) reduces to 1/2 B p

K Var(P) (Malach & Shalev-Shwartz, 2022), quantifying how hard it is to approximate P using the hypothesis class H = {x 7 Wϕ(x) : W 2 B} per definition in (7). The key quantity in the lower bound is Var(P), which indicates the intrinsic complexity of the task; the more complex P is, the smaller Var(P) is. In the next subsection, we will discuss learning a specific family of Boolean functions called parities; these functions underpin a concrete scenario whereby the task is unlearnable without Co T but becomes learnable when Co T is used.

4.2 ILLUSTRATIVE EXAMPLE: LEARNING PARITIES

Boolean functions are mappings from an input space X = { 1}n of n binary bits to an output space Y = { 1}. In particular, parities are a family of functions that compute the exclusive-or (XOR) of bits at some predefined positions in the input, which are notoriously hard to learn (Kearns, 1998; Shalev-Shwartz et al., 2017; Daniely & Malach, 2020). The specific form of a parity function is determined by a subset S [n]. For each S, the corresponding parity function is defined as χS(x) = Q

i S x[i] where x[i] is the i-th bit of the input. The distribution family is defined as

PXOR(n) {(χS, D) : S [n]} (13)

where D is a fixed input distribution uniform over { 1}n. Particularly, applying the lower bound to learning parities, we have the following result:

Corollary 5 (Learning Parities). For any hypothesis class defined in (7), there exists some sufficiently large n N such that PXOR(n) is not learnable by the corresponding algorithm ACo T when k = 1, but learnable after using the following decomposition to form the Co T (k = 2):

1st Step: z[i] = χ1,S(x)[i] = x[i] for i S 1 for i / S , 2nd Step: y = χ2,S(z) = Y

i z[i]. (14)

where the first step χ1,S(x) learns to select relevant features from x while masking irrelevant ones, and the second step χ2,S(z) computes XOR of all bits in z.

The proof is deferred to Appendix A.5. Intuitively, parities are hard to learn without Co T because the 2n target functions in the family form an orthogonal basis for the space of all Boolean functions. Consequently, a linear function class with a fixed feature space dimension K can not approximate all parities, as this is as hard as approximating all Boolean functions. Concretely, we have Var(PXOR(n)) = 2 n and thus the lower bound will become positive for sufficiently large n; this implies there always exists at least one target function in PXOR(n) and some ϵ0 > 0 such that ϵ ϵ0 holds true, and thus the task is not learnable per definition in Section 2 regardless of the sample size. Co T resolves this issue, since the first-step predictor with learnable weights can adapt to the relevant

1Note that in the case of no Co T, by default |H 1| = 1 and the bound still applies.

Published as a conference paper at ICLR 2025

97 -- -- -- -- -- -- --

92 35 -- -- -- -- -- --

90 31 18 -- -- -- -- --

88 22 16 6 -- -- -- --

82 21 12 9 3 -- -- --

71 27 12 6 5 1 -- --

69 23 6 4 1 0 0 --

56 16 4 3 1 1 1 0

Success Rate (%)

65 -- -- -- -- -- -- --

43 27 -- -- -- -- -- --

34 17 15 -- -- -- -- --

45 24 14 7 -- -- -- --

35 13 8 4 3 -- -- --

30 9 6 2 6 1 -- --

15 12 5 4 3 1 1 --

13 5 2 1 0 2 1 1

Success Rate (%)

(b) GPT-3.5-turbo Figure 2: Success rate of GPT-4o and GPT-3.5-turbo for learning compositional functions. H denotes the number of elementary functions used to construct the target function; Hmax denotes the maximal number of elementary functions to construct a reasoning step.

bits in the input x (determined by S), such that the second-step only needs to learn a fixed XOR function. In particular, with Co T, both steps or subtasks become learnable by a linear function class with a fixed feature space dimension. Therefore, we have illustrated such a setting where successful learning is impossible without Co T. The effectiveness of the designed Co T is verified in Section 5.2.

5 EXPERIMENTS

As empirical verification, we consider new arithmetic reasoning tasks and evaluate the performance of real-world LLMs, including GPT-4o and GPT-3.5-turbo. Sec 5.1 studies the connection between the hardest step and the overall reasoning performance. Sec 5.2 tests whether the specific forms of Co T we introduced can improve the performance. See detailed experimental setups in Appendix C. Codes are available at https://github.com/chr26195/Co T-ICL.

5.1 INCREASINGLY COMPLEX FUNCTIONS

Since benchmarks are lacking where one can precisely control the hardness of tasks and Co T steps, we first present a method to construct such tasks by incrementally building upon challenging subtasks.

Constructing Highly Challenging Tasks. Let us consider a class of elementary functions Fe where each function maps from the input space X to itself. In general, these elementary functions should be considered equally easy to learn. Then, we sample a sequence of these functions f1, f2, . . . , f T Fe; composing them gives us a target function f = f T f2 f1 : X X whose complexity increases as H increases. We consider an instantiation by defining the input space as the space of two integers x Z2. The elementary functions are defined as choosing one integer and using it to perform a basic arithmetic operation (drawn from +, or ) with another number. Therefore, Fe consists of

z[0] z[0] + z[1], z[0] z[0] z[1], z[0] z[0] z[1], z[1] z[1] + z[0], z[1] z[1] z[0], z[1] z[1] z[0]. (15)

While each elementary function in (15) is simple, the overall target function f can become highly complex, possibly representing polynomial functions on z[0] and z[1] up to an arbitrary order and number of terms. Moreover, to quantify the hardest step, we do not reveal all intermediate steps of f in the demonstrations provided to LLMs. Instead, we stipulate that there exists at least one step i [k] where the function from zi 1 to zi is constructed from Hmax elementary functions, whereas all other steps use fewer of them. For example, given H = 3 elementary functions f1 : z[0] z[0] + z[1], f2 : z[1] z[1] z[0] and f3 : z[1] z[1] z[0], the hardest step can be expressed as f3 f2 f1 : zi[0] = zi 1[0] + zi 1[1] zi[1] = (zi 1[0] + zi 1[1]) (zi 1[1] 1) (16)

Results. We test the performance of real-world LLMs on the reasoning task described above with respect to different overall hardness H and the hardest step Hmax. We report their success rates

Published as a conference paper at ICLR 2025

(n, k)-Parities (10,1) (10,2) (10,3) (10,4) (10,5) (10,6) (10,7) (10,8) (10,9) (10,10)

GPT-4o w/o Co T 87 69 63 54 51 48 50 52 47 51 GPT-4o w Co T 92 95 97 94 87 73 66 58 62 50

GPT-3.5-turbo w/o Co T 75 62 60 47 59 58 51 56 55 54 GPT-3.5-turbo w Co T 80 76 72 74 75 69 57 63 57 57

Table 1: Success rate (%) of GPT-4o and GPT-3.5-turbo of learning (n,k)-parities.

3-Term DNF Width 3 Width 4 Width 5 Width 6 Width 7 Width 8 Width 9 Width 10

GPT-4o w/o Co T 85 81 77 73 68 66 62 74 GPT-4o w Co T 96 87 86 88 81 80 80 84

GPT-3.5-turbo w/o Co T 74 68 67 62 55 58 64 53 GPT-3.5-turbo w Co T 90 78 87 81 65 73 70 73

Table 2: Success rate (%) of GPT-4o and GPT-3.5-turbo of learning 3-term DNF.

across 100 i.i.d. sampled target functions for each H and Hmax. And for each target function, the LLMs are provided with 10 demonstrations and asked to infer the computation process and apply it to derive the output for an unseen input. As shown in Fig. 2 (and more results in Appendix C.1), the success rate of LLMs quickly drops as Hmax increases. In particular, GPT-4o can successfully learn the target function in most cases when H = 1; however, it performs significantly worse as Hmax increases from 1 to 4, then fails as Hmax becomes even larger. These phenomena corroborate our result that reducing the complexity of the hardest step is critical to successfully handle the task.

5.2 CANONICAL BOOLEAN FUNCTIONS

We further evaluate LLMs on two families of Boolean functions: parities and disjunctive normal form (DNF), which are known hard to learn (Daniely & Vardi, 2021; Malach & Shalev-Shwartz, 2022).

Task Descriptions. An (n, k)-parity function computes the XOR ( ) of a subset of k variables from a total of n input binary bits (n is 10 in our experiments). It outputs 1 if an odd number of the k relevant variables are 1, and 0 otherwise. Meanwhile, a DNF function is a disjunction (logical OR) of conjunctions (logical ANDs) of literals; and in the experiments, we consider a family of 3-term DNFs f(x) = 3 i=1 w j=1 (xij mij) where w is the width and m { 1}3w is a latent variable whose value determines the target function (i.e. mij = 1 invalidates xij).

Results. For each k in parities and w in DNFs, we similarly i.i.d. sample 100 target functions.2 For each function, we provide LLMs with 100 in-context examples, ask them to find patterns in these examples, and return the output for each query. Tables 1 and 2 report the success rates. Particularly n terms of parities, we find even GPT-4o generally performs no better than random guessing (with an expected accuracy of 50%) when k > 3. Then, we provide LLMs Co T examples with intermediate steps that are provably effective by applying our results derived in Section 4: for parities, the intermediate step is defined as z[i] = x[i] if i S otherwise 0; for DNFs, z[i, j] = x[i, j] if m[i, j] = 0 otherwise 1. Results in Tables 1 and 2 clearly demonstrate the designed Co T significantly improves the performance, e.g. GPT-4o achieves an almost perfect success rate of 94 on (10, 4)-parity, while without Co T the success rate is 54, which is close to random guessing.

6 CONCLUSION AND DISCUSSION

In this paper, we quantify the benefits of task decompositions within the setting of learning tasks incontext by Transformers. We note that a limitation of our work thus far is that, despite quantifying the potential for a Transformer to efficiently learn a complex task by Co T, it nonetheless remains unclear on a case-by-case basis whether real-world LLMs will actually achieve success in practice. This is because of confounding issues related to model training procedures, including dataset properties, the actual optimization process, and fine-tuning. Hence an interesting future direction is to delve more deeply into these issues.

2For parities, we sample from a uniform distribution; for DNF, we sample from a non-uniform distribution to ensure the label (0/1) is balanced for w 3.

Published as a conference paper at ICLR 2025

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to implement preconditioned gradient descent for in-context learning. ar Xiv preprint ar Xiv:2306.00297, 2023.

Ekin Aky urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. International Conference on Learning Representations, 2023.

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

Yingyi Chen, Qinghua Tao, Francesco Tonin, and Johan Suykens. Primal-attention: Self-attention through asymmetric kernel svd in primal representation. Advances in Neural Information Processing Systems, 36, 2024.

Xiang Cheng, Yuxin Chen, and Suvrit Sra. Transformers implement functional gradient descent to learn non-linear functions in context. International Conference on Machine Learning, 2024.

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. International Conference on Learning Representations, 2021.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. ar Xiv preprint ar Xiv:2110.14168, 2021.

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 4005 4019, 2023.

Amit Daniely and Eran Malach. Learning parities with neural networks. Advances in Neural Information Processing Systems, 33:20356 20365, 2020.

Amit Daniely and Gal Vardi. From local pseudorandom generators to hardness of learning. In Conference on Learning Theory, pp. 1358 1394. PMLR, 2021.

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36, 2024.

Deqing Fu, Tian-Qi Chen, Robin Jia, and Vatsal Sharan. Transformers learn higher-order optimization methods for in-context learning: A study with linear models. ar Xiv preprint ar Xiv:2310.17086, 2023a.

Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations, 2023b.

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583 30598, 2022.

Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, and Jason D Lee. How well can transformers emulate in-context newton s method? ar Xiv preprint ar Xiv:2403.03183, 2024.

Tianyu Guo, Wei Hu, Song Mei, Huan Wang, Caiming Xiong, Silvio Savarese, and Yu Bai. How do transformers learn in-context beyond simple functions? a case study on learning with representations. International Conference on Learning Representations, 2024.

Published as a conference paper at ICLR 2025

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156 5165. PMLR, 2020.

Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983 1006, 1998.

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. ar Xiv preprint ar Xiv:2210.02406, 2022.

Juno Kim and Taiji Suzuki. Transformers learn nonlinear features in context: Nonconvex mean-field dynamics on the attention landscape. International conference on machine learning, 2024.

Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pp. 19565 19594. PMLR, 2023a.

Yingcong Li, Kartik Sreenivasan, Angeliki Giannou, Dimitris Papailiopoulos, and Samet Oymak. Dissecting chain-of-thought: A study on compositional in-context learning of mlps. ar Xiv preprint ar Xiv:2305.18869, 2023b.

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. International Conference on Learning Representations, 2024.

Aman Madaan and Amir Yazdanbakhsh. Text and patterns: For effective chain of thought, it takes two to tango. ar Xiv preprint ar Xiv:2209.07686, 2022.

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.

Eran Malach and Shai Shalev-Shwartz. When hardness of approximation meets hardness of learning. Journal of Machine Learning Research, 23(91):1 24, 2022.

BK Natarajan. Learning from exercises. In Proceedings of the 1989 Workshop on Computational Learning Theory, pp. 72 86, 1989.

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. ar Xiv preprint ar Xiv:2112.00114, 2021.

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong. Random feature attention. ar Xiv preprint ar Xiv:2103.02143, 2021.

Akshara Prabhakar, Thomas L Griffiths, and R Thomas Mc Coy. Deciphering the factors influencing the efficacy of chain-of-thought: Probability, memorization, and noisy reasoning. ar Xiv preprint ar Xiv:2407.01687, 2024.

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. ar Xiv preprint ar Xiv:2210.03350, 2022.

Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. ar Xiv preprint ar Xiv:2112.11446, 2021.

Laria Reynolds and Kyle Mc Donell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1 7, 2021.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Published as a conference paper at ICLR 2025

Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In International Conference on Machine Learning, pp. 3067 3075. PMLR, 2017.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ar Xiv preprint ar Xiv:2206.04615, 2022.

Prasad Tadepalli. Learning to solve problems from exercises. Computational Intelligence, 24(4): 257 291, 2008.

Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: An unified understanding for transformer s attention via the lens of kernel. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2019.

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024.

Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp. 35151 35174. PMLR, 2023.

Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Max Vladymyrov, Razvan Pascanu, et al. Uncovering mesa-optimization algorithms in transformers. ar Xiv preprint ar Xiv:2309.05858, 2023.

Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer, and Huan Sun. Towards understanding chain-of-thought prompting: An empirical study of what matters. ar Xiv preprint ar Xiv:2212.10001, 2022.

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. ar Xiv preprint ar Xiv:2006.04768, 2020.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824 24837, 2022.

Noam Wies, Yoav Levine, and Amnon Shashua. Sub-task decomposition enables learning in sequence to sequence tasks. ar Xiv preprint ar Xiv:2204.02892, 2022.

Matthew A Wright and Joseph E Gonzalez. Transformers are deep infinite-dimensional non-mercer binary kernel machines. ar Xiv preprint ar Xiv:2106.01506, 2021.

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.

Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. Advances in Neural Information Processing Systems, 2023.

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. ar Xiv preprint ar Xiv:2210.03493, 2022.

Denny Zhou, Nathanael Sch arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. ar Xiv preprint ar Xiv:2205.10625, 2022.

Published as a conference paper at ICLR 2025

1 Introduction 1

1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Preliminaries 3

3 Improved Learnability by Task Decomposition 5

3.1 Learning Algorithms Enabled by Chain-of-Thoughts . . . . . . . . . . . . . . . . 5

3.2 Expressiveness of Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.3 Effects of Task Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Hardness of Learning without Chain-of-Thought 7

4.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.2 Illustrative Example: Learning Parities . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Experiments 9

5.1 Increasingly Complex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5.2 Canonical Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 Conclusion and Discussion 10

A Proofs 15

A.1 Lemma 1: Expressiveness of Transformers . . . . . . . . . . . . . . . . . . . . . . 15

A.2 Lemma 2: Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.3 Theorem 3: Co T Improves Learnability . . . . . . . . . . . . . . . . . . . . . . . 18

A.4 Theorem 4: Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.5 Corollary 5: Learning Parities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

B Background: Connection between Attention and Kernel 22

C Experimental Details and Additional Results 24

C.1 Increasingly Complex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

C.2 Canonical Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Published as a conference paper at ICLR 2025

A.1 LEMMA 1: EXPRESSIVENESS OF TRANSFORMERS

Given demonstrations {(z(j) 0 , z(j) 1 , , z(j) k 1, z(j) k )}N j=1, we could create k training sets, each of which defines a loss function quantifying the error of a particular predictor hi. These loss functions are n Li = 1

hi(z(j) i 1) z(j) i 2

2 : i [k] o . (17)

Note that while the overall predictor

h = hk h2 h1 = Wkϕk( (W2ϕ2(W1ϕ1(x)))) (18)

is a non-linear function, loss functions in (17) are convex with respective to the weights of their corresponding linear predictors {Wi : i [k]}. Let us also denote h k = hk h2 h1 for k k. Using gradient descent to minimize Li with fixed step size η induces the following training dynamics in weight space

Wi Wi η Wi Li = Wi + η (Zi hi(Zi 1)) ϕi(Zi 1) (19)

where hi(Zi 1) = [hi(z(j) i 1)]j [N] Rd(Zi) N, ϕi(Zi 1) = [ϕi(z(j) i 1)]j [N] RK N. Note that difference between our setup and the conventional supervised learning setup is that, the latter is interested in the variation of output with different weights, while in this paper, we are also interested in the dynamics of intermediate steps. Particularly:

For i < i, the dynamics of intermediate steps induced by GD is

h i (x(N+1)) h i (x(N+1)), (20)

namely the variation of upper layer weights does not affect lower layer representations (i.e. intermediate steps).

For i = i, the dynamics is

h i(x(N+1)) = Wiϕi(h i 1(x(N+1)))

(Wi η Wi Li) ϕi(h i 1(x(N+1)))

= h i(x(N+1)) + η (Zi hi(Zi 1)) ϕi(Zi 1) ϕi(h i 1(x(N+1))).

Let κi be the kernel function defined by the feature map ϕi, we have

h i(x(N+1)) h i(x(N+1)) + η (Zi hi(Zi 1)) κi(Zi 1, h i 1(x(N+1))). (22)

For i > i, the dynamics is

h i (x(N+1)) = hi hi +1 h i (x(N+1))

hi hi +1 h i(x(N+1)) + η (Zi hi(Zi 1)) κi(Zi 1, h i 1(x(N+1)))

which in general intractable since hi hi +1 is non-linear. However, if upper layer weights in hi, , hi +1 are 0, h i (x(N+1)) will become 0 as well and thus we can circumvent (23).

Recall also that based on the definition in Section 2, the self-attention layer can be written as

(Self-Attention) e(N+1) e(N+1) + WV Eσ E W KWQe(N+1) . (24)

where e = (x, z1, , zk 1, zk) Rd at the input layer, d = Pk i=0 d(Zi).

In the following construction, we show that in the forward pass of Transformer, (24) could express dynamics of all intermediate steps, including (20), (22) and (23), based on a certain order in which minimization of losses in (17) is performed. Particularly, in our construction, Transformer will sequentially minimize loss functions in (17). In other words, the lower layers of the Transformer learn prior reasoning steps, while upper layers of the Transformer learn later reasoning steps.

Published as a conference paper at ICLR 2025

Expanded Embedding Construction. Let d = Pk i=0 d Zi be the total dimension of the original demonstration (z0, z1, . . . , zk) where z0 = x Z0 and zk Zk. Let d = 2d d Z0 d Zk

which will be the dimension of our expanded embedding. We now construct a projection matrix P Rd d to map each demonstration into an expanded embeddings that keeps track of inputs and residuals in a single vector. Concretely, we partition P to perform the following transformation:

Id(Z0) 0 0 0 0 Id(Z1) 0 0 0 Id(Z1) 0 0 0 0 Id(Z2) 0 0 0 Id(Z2) 0 ... ... ... ... ... 0 0 0 Id(Zk)

where each I d(Zi) denotes a d(Zi) d(Zi) identity block (or negative identity block). Applying this P to a vector (z0, z1, . . . , zk) Rd yields an expanded embedding

e = P z0, z1, . . . , zk R d . (26)

At initialization, each hi zi 1 is zero, so the rows with I d(Zi) produce residuals hi(zi 1) zi

which initially reduce to (0 zi) = zi. Hence, for a demonstration (x, z1, z2, . . . , zk) with x = z0, we explicitly obtain

e = x, h1(x) z1, z1, . . . , hk 1 zk 2 zk 1, zk 1, hk zk 1 zk Rd (27)

Similarly, for the query x(N+1) Z0, we have the expanded embedding

e(N+1) = x(N+1), h1 x(N+1) , 0, . . . , h k 1 x(N+1) , 0, h k x(N+1) Rd , (28)

reflecting that for the query we store prediction h i(x(N+1)) at each layer in its own coordinate, with zeros in place of zi or differences as needed to run gradient-based updates in the Transformer. Here we leverage the fact that z1, , zk 1, zk are all-0 vectors since they are unknown.

Weight Matrices Construction. For layers that minimize the i-th step s loss function Li, we construct Transformer weights in the corresponding self-attention layer as follows.

0 dl(i) 0 0

0 η I d(Zi) 0 0 0 0 dr(i)

j=0 d(Zj) d(X), dr(i) = 2

j=i d(Zj) d(Zk) d(Zi),

selecting the residual coordinates in the expanded embedding (i.e. zi hi(zi 1)). This extracts those residuals, multiplies them by η, and writes them into the sublayer prediction coordinate.

We let WK and WQ select the appropriate coordinates for zi 1 (from demonstrations) and h i 1(x(N+1)) (from the query):

0 d l(i) 0 0

0 I d(Zi 1) 0 0 0 0 d r(i)

0 d l (i) 0 0

0 I d(Zi 1) 0 0 0 0 d r (i)

Published as a conference paper at ICLR 2025

j=0 d(Zj) d(X) d(Z i 1), d r(i) = 2

j=i d(Zj) d(Zk), (32)

and similarly d l (i), d r(i) define the row/column slices for the query coordinate. Stacking t selfattention layers (each with the above WV , WK, WQ for subtask i) carries out t gradient steps on Li. Finally, iterating over i = 1, . . . , k in ascending order completes minimization of all losses in equation 17, yielding the final predictor that includes all intermediate and final predictions.

Extension to kernel regression follows from Proposition 2 in Von Oswald et al. (2023) and Proposition 1 in Cheng et al. (2024). Refer also to Appendix B for a discussion of the connection between kernel and attention.

A.2 LEMMA 2: UPPER BOUND

Given a target function f and an input distribution D(x), the algorithm A returns a predictor h based on N i.i.d. samples, whose expected error is defined as Ex D[l(h(x), f(x))]. Let us first consider a single step of task decomposition.

Lemma 6. For any distribution family P and decomposition operator T, the predictor returned by ACo T on demonstrations sampled from the corresponding distributions in P1 and P2 has an error upper bound (P, A) 2 max{1, c B,ϕ}( (P1, A1) + (P2, A2)) (33)

where c B,ϕ = B2 Lip(ϕ)2 is a constant determined by the hypothesis class H in (7).

Proof. For a family of distributions P, the error is defined as

(P, A) max (f,D) P Ex D [l (h(x), f(x))] , (34)

where we slightly abuse notation here as h is also dependent on the distribution (f, D) and the learning algorithm. For a certain decomposition operator T, the target function can be expressed as f = f2 f1 and the predictor h = h2 h1. We have

(P, A) = max (f,D) P Ex D

2 ((h2 h1)(x) (h2 f1)(x) + (h2 f1)(x) f(x))2 (35)

Suppose the feature map ϕ(x) for h2 has Lipschitz constant

Lip(ϕ) = sup x =x ϕ(x) ϕ (x ) 2

x x 2 , (36)

and by Jensen s inequality, we have

(P, A) max (f,D) P Ex D h ((h2 h1)(x) (h2 f1)(x))2 + ((h2 f1)(x) f(x))2i (37)

max (f,D) P Ex D h B2 Lip(ϕ)2 h1(x) f1(x) 2 2 + (h2(z) f2(z))2i (38)

B2 Lip(ϕ)2 max (f,D) P Ex D h1(x) f1(x) 2 2 + max (f,D) P Ex D h (h2(z) f2(z))2i

where B2 Lip(ϕ)2 is a constant determined by the definition of hypothesis class. Given decomposition operator T, the distribution family can be decomposed into P1 and P2. It follows that

(P, A) 2B2 Lip(ϕ)2 (P1, A1) + 2 (P2, A2) (40)

Let c B,ϕ = max{1, B2 Lip(ϕ)2}, we get the desired upper bound.

Published as a conference paper at ICLR 2025

To extend this lemma to the case where there are more reasoning steps, i.e. k > 2, let

Individual Error: (Pi, Ai) = i = max (f,D) Pi Ezi 1 D[l (hi(zi 1) zi)]

Accumulated Error: (P, {Aj}j [i]) = i = max (f,D) P Ex D [l ((hi h1)(x) zi)] .

i is the error of learning an individual reasoning step i, which is consistent with its definition in Section 2, i is the accumulated error from the first step to step i. Note that we slightly abuse the notation here since (P, A) is actually equivalent to k. We also have 1 = 1. Therefore

k = max (f,D) P Ex D

2 (hi h1)(x) zi 2 2

= max (f,D) P Ex D

2 (hi h1)(x) hi(zi 1) + hi(zi 1) zi 2 2

2B2 Lip(ϕk)2 k 1 + 2 k (44)

i=j+1 2B2 Lip(ϕi)2

We use the lemma from (43) to (44). Therefore, we have the following upper bound

(P, A) = k ck

i=1 (Pi, Ai) (46)

where ck = 2 maxj Qk i=j+1 2B2 Lip(ϕi)2 is a constant depending on the hypothesis class Hi and k.

Note that in the derivation, we could remove max(f,D) P and directly analyze the expected error Ex D[l(h(x), f(x))], which will give us a similar result: the final error is upper bounded by the sum of individual errors up to a constant.

A.3 THEOREM 3: COT IMPROVES LEARNABILITY

Recall the definition of in-context learnability: we say a parametric model M can learn task P if there exists a learning algorithm A and a function N : (0, 1)2 N, such that for any confidence and accuracy parameters δ, ϵ (0, 1):

1. Under a certain parameter choice θ, we have Mθ( , x) = A( )(x) for any query x.

2. Given NA(δ, ϵ) i.i.d. examples, the learning algorithm A returns a predictor h such that with probability of at least 1 δ, (P, A) ϵ.

Moreover, we say P can be efficiently in-context learned by M if both the running time of A and the sample size NA(δ, ϵ) are polynomial in δ 1 and ϵ 1.

To see how it can be reduced to the learnability of subtasks after using Co T, recall the definition of PAC learnability:

Definition 3. We say a subtask Pi can be efficiently learned by an algorithm Ai if its sample complexity and time complexity scale as poly(δ 1 i , ϵ 1 i ), where δi and ϵi are confidence and accuracy parameter for the subtask.

Since by Lemma 1, we know there exists such a parameter choice such that TFθ( , x) = ACo T( )(x), our goal is then to show: if every subtask Pi is efficiently learnable by its corresponding individual algorithm Ai, the overall task is also efficiently learnable by ACo T.

By Lemma 2, we know that for fixed k, the accuracy parameter ϵ of the overall task is upper bounded by the sum of the accuracy parameters of subtasks {ϵi}k i=1 up to a constant coefficient. Additionally,

Published as a conference paper at ICLR 2025

notice that, given n (not necessarily independent) events A1, A2, . . . , An, each occurring with probability P (Ai) = 1 δi, we have

This means the confidence parameter of the overall task δ is also upper bounded by the sum of confidence parameters of all subtasks {δi}k i=1. Namely

i=1 δi (48)

In terms of the time complexity of ACo T, since each individual algorithm Ai runs in time polynomial in δ 1 i and ϵ 1 i , we have Time(Ai) = poly(δ 1 i , ϵ 1 i ). Moreover, one can choose δ1 = δ2 = = δk = δ , ϵ1 = ϵ2 = = ϵk = ϵ , correspondingly we have δ δ/k and ϵ ϵ/ck. Since ACo T is a simple combination of all individual algorithms {Ai}k i=1, we have

Time(ACo T) =

i=1 Time(Ai) poly(δ 1, ϵ 1), (49)

meaning the learning algorithm is also computationally efficient.

In terms of the sample complexity of ACo T, we also let δ1 = δ2 = = δk = δ δ/k and ϵ1 = ϵ2 = = ϵk = ϵ ϵ/ck. Note that in the case of Co T, all individual algorithms use the same number of samples as the algorithm ACo T. To show the algorithm is sample-efficient, we can simply choose NA(δ, ϵ) to be the largest among {NAi(δ , ϵ )}k i=1, i.e.

NA(δ, ϵ) = max i [k] NAi(δ , ϵ ) (50)

which ensures that ACo T can achieve high accuracy with high probability. Since any NAi(δi, ϵi) is polynomial in δ 1 and ϵ 1, it holds that NA(δ, ϵ) is also polynomial in δ 1 and ϵ 1. However, if NA(δ, ϵ) < maxi [k] NAi(δ , ϵ ), we have no guarantee that each step can successfully learn their corresponding reasoning step. From here, we also proved that the sample efficiency of the overall step is that of the hardest step in the Co T.

Since ACo T can efficiently learn P, and Transformer with linear depth and constant embedding size can express ACo T, we complete the proof.

A.4 THEOREM 4: LOWER BOUND

The in-context learning error has approximation error lower bound, that is the minimum error achievable by a predictor in the hypothesis class H = {h2 h1 : h1 H 1, h2 H2}

max (f,D) P Ex D [l (h(x), f(x))] max (f,D) P min (h1,h2) H 1 H2 Ex D [l (h(x), f(x))] . (51)

Thus it suffices to lower bound the approximation error.

To do so, notice that the learned first-step predictor h1 is from finite function class H 1, which is a subset of the initial hypothesis class H1. We show the approximation power of h(x) with finite-sized H 1 is lower bounded by a linear class whose size depends on H 1. In particular, suppose the hypothesis class is H 1 = {h1,1, h1,2, , h1,|H 1|}, (52)

and based on the index of h1 in H 1, the predictor h(x) can be re-written as

h(x, j) W2(ϕ h1,j)(x) =

i=1 W2,iϕi,j(x) (53)

i =1 Ui,i ϕi,i (x) (54)

Published as a conference paper at ICLR 2025

where ϕi,j = ϕi h1,j and Ui,i = W2,i if i = j otherwise 0. As (54) is an inner product of weight vector U RK|H 1| and feature vector ϕ(x) { 1}K|H 1| in an expanded space, h(x, j) reduces to a linear function.

Since the squared loss l(h(x, j), f(x)) is convex w.r.t. U for arbitrary x and j, its expectation Lf,D(h) Ex D [l (h(x, j), f(x))] is also convex w.r.t. U. Moreover, h(x, j) = 0 when W2 or U goes to 0. Therefore, given any (f, D) P and any predictor h with fixed W2 and j (and thus fixed U), by first-order condition, we have

Lf,D(h) Lf,D(0) + U 0, ULf,D(h)|U=0 (55)

2 U 2 ULf,D(h)|U=0 2 (56)

where the last equation uses the fact Lf,D(0) = Ex D [l (0, f(x))] = 1

2 and inequality v, u u 2 v 2.

Notice U has the same norm as W2 and thus U 2 B. Moreover, we have

(f, D) ULf,D(h)|U=0 2 2 = UEx D [l (h(x, j), f(x))] |U=0 2 2 (57)

= Ex D [ Ul (h(x, j), f(x)) |U=0] 2 2 (58)

U 1 2( U, ϕ(x) f(x))2|U=0

= Ex D [ϕ(x)f(x)] 2 2 (60)

j=1 Ex D [ϕi,j(x)f(x)]2 (61)

Subjecting it to (56) gives us that, for any (f, D) P and any predictor h(x) obtained from in-context learning, i.e.

min (h1,h2) H 1 H2 Lf,D(h) 1

2 B (f, D) 1 2 (62)

Following from (51), the in-context learning error has lower bound

(P, A) max (f,D) P min (h1,h2) H 1 H2 Lf,D(h) (63)

max (f,D) P

2 B (f, D) 1 2 (64)

2 B (f, D) 1 2 (65)

2 B E(f,D) P h (f, D) 1 2 i (66)

By noting E[X 1 2 ] E[X] 1 2 , we have

2 B E(f,D) P [ (f, D)]

2 B E(f,D) P

j=1 Ex D [ϕi,j(x)f(x)]2

Let Var(P) sup ϕ E(f,D) P h Ex D [ϕ(x)f(x)]2i , (69)

which is determined by target functions and distributions in P, and could be understood as the intrinsic complexity of the distribution family (i.e. the more complex P is, the smaller Var(P) is). It

Published as a conference paper at ICLR 2025

follows that,

j=1 E(f,D) P h Ex D [ϕi,j(x)f(x)]2i (70)

K|H 1| Var(P), (71)

completing the proof.

A.5 COROLLARY 5: LEARNING PARITIES

w/o Co T. The distribution family of parities with input size n is defined as

PXOR(n) {(χS, D) : S [n]} where χS(x) = Y

i S x[i]. (72)

Based on (69), we can derive an upper bound on Var(P) for parities, i.e.

Var(PXOR(n)) = sup ϕ E(χ,D) PXOR(n) Ex D[ϕ(x)χ(x)]2 (73)

1 |PXOR(n)|

S [n] Ex D[ϕ(x)χS(x)]2 (74)

where |PXOR(n)| = 2n and 2[n] is the power set of [n]. Note that {χS : S [n]} forms a Fourier basis of Boolean functions, meaning that for any pair of different subsets S1, S2 [n], we have

Ex D[χS1(x)χS2(x)] = 0. (75)

Therefore Ex D[ϕ(x)χS(x)] is exactly the Fourier coefficient of Boolean function ϕ(x) corresponding to S, which we denote as bϕ(S). Therefore

Var(PXOR(n)) = 1

Since, when there is no Co T, the error lower bound is

(Var(PXOR(n)), h) 1

K Var(PXOR(n)) = 1

K 2n , (77)

we can choose n > 2 + log2 B2K such that this lower bound is always positive (regardless of the sample size).

w/ Co T. Consider the following Co T

1st Step: z[i] = χ1,S(x)[i] = x[i] for i S 1 for i / S , 2nd Step: y = χ2,S(z) = Y

i z[i], (78)

we show each step can be approximated by the hypothesis class defined in (7)

Hi = {zi 1 7 Wiϕi(zi 1) : Wi 2 B} (79)

where ϕi : Zi 1 RK is a non-linear feature map to a K-dimensional space, and Wi Rd(Zi) K are learnable weights. We do this by construction. For the first-step predictor χ1,S(x), we let ϕ1(x) = (x, 1) be the connection of x and 1; for the second-step predictor χ2,S(x), we let ϕ2(z) = Q

i z[i]. With these features, both steps can be learned by a simple linear model Wx, which also means the expected error could be arbitrarily small with sufficiently large sample size. The sample complexity follows the standard analysis for linear regression.

In contrast, when there is no Co T, (P, A) has a positive lower bound, meaning there always exists at least one (f, D) P such that the expected error is greater than the lower bound.

Published as a conference paper at ICLR 2025

B BACKGROUND: CONNECTION BETWEEN ATTENTION AND KERNEL

As the background knowledge for understanding the construction of Transformers in Lemma 1, here we provide a non-exhaustive summary of the connections between the attention matrix in Transformers and the kernel method from previous works (Von Oswald et al., 2023; Cheng et al., 2024; Guo et al., 2024; Tsai et al., 2019; Wright & Gonzalez, 2021; Chen et al., 2024). We rewrite the kernel gradient descent dynamics and the Transformer layer here for reference

(GD Dynamics) h(x(N+1)) h(x(N+1)) + η (Y ˆY ) ϕ(X) ϕ(x(N+1)), (80)

Kernel Function Residuals Predictions

(Transformer Layer) e(N+1) e(N+1) + WV E σ(E W KWQe(N+1))), (81)

Attention Module Embedding Skip Connection

Here, the kernel is induced by κ(x, y) = ϕ(x) ϕ(y). The definition of σ could be flexibly chosen depending on the practical implementation of Transformers.

In-Context Learning. Von Oswald et al. (2023) (Proposition 1) demonstrates in the most simple case where both the non-linearity in Transformer σ and the feature map ϕ are identity mappings, the weight constructions WV = 0d(X) 0 0 ηId(Y)

and W KWQ = Id(X) 0 0 0d(Y)

E W KWQe(N+1) = X x N+1, WV E = η(0d(X), Y ˆY ). (82)

In this case, κ(x, y) = x y is the inner product kernel. Their Proposition 2 further discusses the case which accommodates the role of the MLP module in the Transformer architecture. Specifically the MLP module transforms the token e as MLPθ(e), and thus with the same WV , WK, WQ and identty mapping σ, the kernel is κ(x, y) = MLPθ(x) MLPθ(y). (83)

Guo et al. (2024) also leverages the representation power of MLP and considers the case where ϕ : Rd RK is a fixed representation function, which can in principle be chosen arbitrarily as long as the features can be represented by a MLP. Theorem 1 in this paper provides a construction where σ is chosen to be the normalized Re LU and proves that Transformers can perform in-context ridge regression. Following this work, more recently Kim & Suzuki (2024) similarly concluded that MLP layer can extend the class of learnable functions of ICL to the Barron space.

Cheng et al. (2024) considers the case where σ is non-linear, and their Proposition 1 proves if the non-linearity σ matches the kernel κ, the Transformers can perform functional (kernel) gradient descent regression w.r.t. the reproducing kernel Hilbert space metric of this kernel; the proof is exactly based on the relation between (80) and (81). For example, they demonstrate that the Softmax attention corresponds to the exponential kernel

κ(x, y) = exp 1

σ2 x y . (84)

One caveat here is that the attention matrix in this case is asymmetric; to address this, one could treat the normalization term as pre-conditioner of the optimization algorithm, and the construction of Transformers should be modified accordingly. We refer the readers to more discussions in their paper.

General Discussions. Even before ICL, the connection between kernel and attention is a topic that has been widely discussed (e.g. (Tsai et al., 2019; Wright & Gonzalez, 2021; Chen et al., 2024) etc.) and references therein). For instance, Wright & Gonzalez (2021) proves that the standard attention matrix is a reproducing kernel for a reproducing kernel Banach space and gives explicit formulation of the feature maps (Proposition 1). Their Theorem 2 further demonstrates that Transformers can learn any binary non-Mercer reproducing kernel Banach space pair.

In practice, variants of Transformers with kernelized attentions (e.g. (Choromanski et al., 2021; Katharopoulos et al., 2020; Wang et al., 2020; Peng et al., 2021) etc.) such as those relying on random

Published as a conference paper at ICLR 2025

Fourier features are very popular and effective. These implementations are often considered for efficiency considerations, since once the attention is kernelized, one can switch the order of matrix multiplication to accelerate the computation and reduce the time complexity from quadratic to linear w.r.t. the input length.

Published as a conference paper at ICLR 2025

C EXPERIMENTAL DETAILS AND ADDITIONAL RESULTS

C.1 INCREASINGLY COMPLEX FUNCTIONS

Sampling. For this task, we sample a sequence of functions from a set of elementary functions with the following probabilities:

P (z[0] z[0] + z[1]) = 1

8, P (z[0] z[0] z[1]) = 1

8, P (z[0] z[0] z[1]) = 1

P (z[1] z[1] + z[0]) = 1

8, P (z[1] z[1] z[0]) = 1

8, P (z[1] z[1] z[0]) = 1

Here, the multiplication operation is more likely to be sampled than addition or subtraction. For the input values, we uniformly select two unique integers from the set {2, 3, . . . , 10}. This range is chosen to avoid excessively large numbers, which are difficult to handle, and trivial cases, such as when x[0] = x[1], which could result in zero values that are easily predictable by LLMs. For this task, we explicitly ensure that each sample is unique to prevent repeated samples in training and testing. This is done to avoid scenarios where LLMs could simply memorize the results from the demonstrations and use them to answer the query.

Co T Prompting. Given a certain Hmax, namely the maximal number of elementary functions to construct a reasoning step, we implement it by randomly masking H 1 consecutive intermediate steps. For example, when T = 6 and H = 3, an example prompt is given as follows:

Given two numbers, sequentially apply predefined arithmetic operations (addition, subtraction, multiplication) to transform them. Each step involves a specific predefined operation on one of the numbers. If any operations or intermediate results are missing, deduce these to complete the transformation and arrive at the final output.

Input: 7, 5 Step1: 7, -2 Step2: 5, -2 Step3: missing Step4: missing Step5: 5, 75 Output: -70, 75

Input: 2, 3 Step1: 2, 1 Step2: 3, 1 Step3: missing Step4: missing Step5: 3, 36 Output: -33, 36

Input: 5, 8 What is the output? Your answer should end in the format 'Step1: ?, Step2: ?, ..., Output:?'. ,

Note that the missing steps are consistent for one trial. We test 100 times to compute the success rate.

Additional Results. In addition to reporting the success rate of LLMs for predicting the final output as presented in Section 5, we also evaluate their success rate for predicting intermediate steps. This provides a more comprehensive assessment of the LLMs performance, as even they might fail to predict the final step but could still succeed in predicting the intermediate steps.

Published as a conference paper at ICLR 2025

97 -- -- -- -- -- -- --

92 35 -- -- -- -- -- --

90 31 18 -- -- -- -- --

88 22 16 6 -- -- -- --

82 21 12 9 3 -- -- --

71 27 12 6 5 1 -- --

69 23 6 4 1 0 0 --

56 16 4 3 1 1 1 0

Success Rate (%)

65 -- -- -- -- -- -- --

43 27 -- -- -- -- -- --

34 17 15 -- -- -- -- --

45 24 14 7 -- -- -- --

35 13 8 4 3 -- -- --

30 9 6 2 6 1 -- --

15 12 5 4 3 1 1 --

13 5 2 1 0 2 1 1

Success Rate (%)

(b) GPT-3.5-turbo

Figure 3: Success rate for predicting the last step (i.e. namely the output).

-- -- -- -- -- -- -- --

98 3 -- -- -- -- -- --

91 15 2 -- -- -- -- --

89 18 11 1 -- -- -- --

89 17 10 7 0 -- -- --

81 20 8 3 3 1 -- --

73 20 3 3 0 0 0 --

78 14 4 2 1 1 0 0

Success Rate (%)

-- -- -- -- -- -- -- --

31 14 -- -- -- -- -- --

25 16 0 -- -- -- -- --

38 18 8 3 -- -- -- --

36 10 7 5 1 -- -- --

33 6 6 1 5 0 -- --

20 9 3 2 2 0 0 --

16 7 0 1 1 0 0 0

Success Rate (%)

(b) GPT-3.5-turbo

Figure 4: Success rate for predicting the second to last step.

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

94 56 0 -- -- -- -- --

92 46 10 0 -- -- -- --

93 34 5 1 1 -- -- --

87 29 13 4 2 0 -- --

80 33 8 3 0 0 0 --

74 21 5 2 1 1 0 0

Success Rate (%)

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

33 24 3 -- -- -- -- --

35 26 0 1 -- -- -- --

38 16 5 0 0 -- -- --

37 12 5 0 1 0 -- --

26 17 6 2 4 0 0 --

21 10 0 1 1 0 1 0

Success Rate (%)

(b) GPT-3.5-turbo

Figure 5: Success rate for predicting the third to last step.

Published as a conference paper at ICLR 2025

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

100 75 54 0 -- -- -- --

95 54 32 6 0 -- -- --

90 46 41 5 5 0 -- --

90 49 31 4 1 1 0 --

84 32 17 5 1 0 0 0

Success Rate (%)

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

43 23 24 3 -- -- -- --

39 20 9 1 0 -- -- --

37 18 10 1 0 0 -- --

34 17 10 2 0 0 0 --

25 15 3 0 2 0 2 0

Success Rate (%)

(b) GPT-3.5-turbo

Figure 6: Success rate for predicting the fourth to last step.

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

97 76 66 48 2 -- -- --

94 62 58 35 12 1 -- --

93 57 42 20 0 2 0 --

86 51 32 22 0 0 0 0

Success Rate (%)

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

40 32 31 19 1 -- -- --

41 19 8 11 0 0 -- --

30 23 12 9 0 1 0 --

29 17 11 4 1 0 0 0

Success Rate (%)

(b) GPT-3.5-turbo

Figure 7: Success rate for predicting the fifth to last step.

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

98 77 76 68 62 2 -- --

93 73 54 52 35 2 0 --

92 56 50 36 13 3 1 0

Success Rate (%)

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

-- -- -- -- -- -- -- --

41 29 23 33 24 0 -- --

34 21 18 19 8 0 0 --

31 24 14 11 8 1 0 0

Success Rate (%)

(b) GPT-3.5-turbo

Figure 8: Success rate for predicting the sixth to last step.

Published as a conference paper at ICLR 2025

C.2 CANONICAL BOOLEAN FUNCTIONS

Examples of standard and Co T prompts for (10, 4)-parity and DNF with width 6 are given as follows:

Standard prompt for parity:

Predict the output based on a pattern in the input binary string.

Input: 1010101100 Output: 0

Input: 0100000011 Output: 0

Input: 0110010000 Output: 1

Input: 1010100001 Output: 0

Input: 1010100001 Output: 0

Input: 1011000111 What is the output? Directly answer the question in the format 'Output:'. ,

Co T prompt for parity:

Replace some bits located at specific predefined positions in the binary string with 0 to form a new string. Then, based on some patterns in the new string to predict the output.

Input: 1000010001 New string: 0000010001 Output: 0

Input: 0100110111 New string: 0000110001 Output: 1

Input: 0101001000 New string: 0001000000 Output: 1

Input: 0100000010 New string: 0000000000 Output: 0

Input: 1011010000 New string: 0001010000 Output: 0

Input: 0100100110 What is the output? Directly answer the question in the format 'New string:, Output:'. ,

Published as a conference paper at ICLR 2025

Standard prompt for DNF:

Predict the output based on a pattern in the input binary string.

Input: 001010 000001 010011 Output: 1

Input: 100111 011001 010111 Output: 1

Input: 010010 011010 101011 Output: 1

Input: 010101 001101 001001 Output: 1

Input: 111110 011001 010111 Output: 1

Input: 010001 110101 011011 What is the output? Directly answer the question in the format 'Output:'. ,

Co T prompt for DNF:

Replace some bits located at specific predefined positions in the binary string with 1 to form a new string. Then, based on some patterns in the new string to predict the output.

Input: 000101 011111 011010 New string: 100111 111111 011111 Output: 1

Input: 001101 100111 000101 New string: 101111 110111 000101 Output: 0

Input: 100001 011001 001010 New string: 100011 111011 001111 Output: 0

Input: 001100 010100 101011 New string: 101111 110110 101111 Output: 0

Input: 010101 111011 100101 New string: 110111 111011 100101 Output: 0

Input: 000100 011110 100000 What is the output? Directly answer the question in the format 'New string:, Output:'. ,