# ehrenfeuchthaussler_rank_and_chain_of_thought__c36445fc.pdf

Ehrenfeucht-Haussler Rank and Chain of Thought

Pablo Barcel o * 1 2 3 Alexander Kozachinskiy * 3 Tomasz Steifer * 4

The notion of rank of a Boolean function has been a cornerstone in PAC learning, enabling quasipolynomial-time learning algorithms for polynomial-size decision trees. We present a novel characterization of rank, grounded in the well-known Transformer architecture. We show that the rank of a function f corresponds to the minimum number of Chain of Thought (Co T) steps required by a single-layer Transformer with hard attention to compute f. Based on this characterization we establish tight bounds on the number of Co T steps required for specific problems, showing that ℓ-fold function composition necessitates exactly ℓCo T steps. Furthermore, we analyze the problem of identifying the position of the k-th occurrence of 1 in a Boolean sequence, proving that it requires k Co T steps.

1. Introduction

Ehrenfeucht & Haussler introduced the notion of the rank of a Boolean function and showed that, for any constant r, the class of Boolean functions with rank at most r is properly PAC-learnable in polynomial time. As a corollary, they derived their renowned quasipolynomial-time PAC-learning algorithm for polynomial-size decision trees. Pudl ak & Impagliazzo further characterized the rank not only for Boolean functions but also for Boolean relations through Prover-Delayer games. Since its introduction, this concept has played a significant role in proof complexity (Kullmann, 1999; Esteban & Tor an, 2003).

In this paper, we present a new characterization of the notion of rank. Surprisingly, this characterization is grounded in the

*Equal contribution 1Institute for Mathematical and Computational Engineering, Pontifical Catholic University of Chile 2Millennium Institute for Foundational Research on Data (IMFD Chile) 3National Center for Artificial Intelligence (CENIA Chile) 4Institute of Fundamental Technological Research, Polish Academy of Sciences. Correspondence to: Alexander Kozachinskiy <alexander.kozachinskyi@cenia.cl>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Transformer architecture (Vaswani et al., 2017), which has recently revolutionized the field of NLP and facilitated the development of LLMs. In essence, we show that the rank of a function f corresponds to the minimum number of Chain of Thought (Co T) steps required by a single-layer Transformer to compute f. The Transformers used in our characterization are based on the hard attention mechanism a theoretical abstraction of the soft attention mechanism employed in practice. Hard attention has been widely used in theoretical studies (Hahn, 2020b; Hao et al., 2022b; Barcel o et al., 2024; Yang et al., 2024) due to its amenability to formal analysis, while still effectively capturing the essence of practical models (Clark et al., 2019; Voita et al., 2019).

The Transformer architecture is built upon attention layers and a decoder. An attention layer performs attention on the input sequence, mapping a sequence of input vectors to another sequence of vectors of the same length. Attention layers are used to generate vector representations of sentences in natural language. However, a more common application of Transformers is sequence generation, where the input sequence is mapped to an unbounded sequence of output vectors, generated iteratively, one at a time. This task is carried out by the decoder. In the first iteration, the decoder processes the input sequence through the attention layers and outputs the vector in the last position. This output is then appended to the input sequence. During subsequent iterations, the decoder applies its attention layers to the extended sequence, computes the next output, and appends it to the sequence. These are the Co T steps mentioned earlier (Merrill & Sabharwal, 2024; Liu et al., 2024).

Below we summarize our main results:

We show that the rank of a function f, denoted by rk(f), is the minimal number of iterations of a singlelayer decoder with one hard-attention head that computes f. We establish our result not only for Boolean functions, generalizing the notion of the rank to the non-Boolean case (as far as we know, for the first time).

In practice, Transformers are equipped with multiple attention heads, which enhance their computational capabilities. We show that the ability of such Transformers to compute functions can also be characterized using the notion of rank. Specifically, we define the

Ehrenfeucht-Haussler Rank and Chain of Thought

H-head rank of a function f, denoted as rk(H)(f), for H 1. We prove that rk(H)(f) equals the minimum number of iterations required by a single-layer decoder with H hard-attention heads to compute f.

We then explore methods for obtaining tight bounds on the multi-head rank. We begin by observing that rk(H)(f) is at most a factor of H smaller than rk(f). While computing rk(f) is typically straightforward, it does not always provide an accurate bound for rk(H)(f). To address this limitation, we propose a general communication complexity lower bound for rk(H)(f). Using this technique, we derive a tight bound on the H-head rank for the t-fold iterated composition, a function whose complexity has been previously studied for single-layer decoders with soft attention (Peng et al., 2024). The function t-Comp takes as input a sequence of n integers from {1, . . . , n}, interpreted as the values of a function ϕ: {1, . . . , n} {1, . . . , n}. The output of t-Comp is the value of ϕ, composed with itself t times, evaluated at 1.

It is easy to see that rk(t-Comp) t for any input length n. A decoder, establishing this upper bound works by computing ϕ(1) in the first iteration, then ϕ(ϕ(1)) in the second iteration, and so on. We prove that this is optimal even if we increase the number of attention heads. Namely, for any H, we show that rk(H)(t-Comp) = t for all large enough input lengths.

Finally, we study the k-th One function. This function takes as input a Boolean sequence of length n, and it returns the position of the k-th one in it. It is easy to see that rk(k-th One) k for any input length. In terms of decoders, in the first iteration we can compute the position of the first one, then of the second one in the second iteration, and so on. We prove that for any H and for large enough n, we have rk(H)(k-th One) = k, showing that even increasing the number of attention heads we cannot improve upon the trivial solution for large enough input lengths. Interestingly, this result cannot be obtained via the communication complexity techniques used for iterated composition. Instead, our proof relies on a purely combinatorial argument.

Related work. Numerous studies have sought to explore the expressive power of Transformers by treating them as a computational model and investigating what they can compute (Hahn, 2020a; P erez et al., 2021; Hao et al., 2022a; Angluin et al., 2023; Chiang et al., 2023; Merrill & Sabharwal, 2023; Barcel o et al., 2024; Merrill & Sabharwal, 2024; Liu et al., 2024; Yang & Chiang, 2024; Peng et al., 2024). In particular, several works have investigated how the capability of decoders depends on the number of iterations. To start with, P erez et al. showed that decoders

based on hard attention with an unbounded number of iterations are capable of computing any decidable language (with the parameters of the decoder not depending on the input length). Afterwards, the computation power of decoders with polynomially many iterations was addressed. Merrill & Sabharwal have shown that in the uniform-regime (when, as in (P erez et al., 2021), parameters do not depend on the input length), such decoders with constant number of layers and softmax attention are capable of computing any polynomial-time language. Similarly, for the non-uniform regime, (Liu et al., 2024) have shown that such decoders are capable of computing any language recognizable by a polynomial-size family of Boolean circuits.

Our result is the first exact characterization of the expressive power of decoders with a given fixed number of iterations, although just for a single layer and for hard attention. Recently, Peng et al. have shown that any single-layer decoder with soft attention requires Ω(t) iterations to compute t-Comp for t = p

n/(d Hp), where n is the input length, d is the dimension of vectors, H is the number of attention heads, and p is the number of bits of precision. We point out that our results instead do not require any assumptions on the dimension and the number of bits of precision.

Organization of the paper. An introduction to decision trees and the notion of rank is found in Section 2, with basic concepts of Transformers being discussed in Section 3. The main results about single-head Transformers are presented in Section 4, with extensions to multi-head Transformers covered in Section 5. Final remarks are given in Section 6. Missing proofs can be found in the ar Xiv version of this paper.

2. Decision Trees and Rank

Consider n + 1 finite sets Σ1, . . . , Σn, O, for n > 0. We are interested in decision trees that compute functions:

f : Σ1 Σ2 . . . Σn O.

To do this, we consider decision trees over arbitrary families of queries, where a query is a function q whose domain is Σ1 . . . Σn. We write Im(q) for the image of query Q. If F is a set of queries, a decision tree over F is a rooted tree T such that:

Every non-leaf node v is labeled by some query qv F and has exactly |Im(qv)| out-going edges, each one of them labeled by a different element from Im(qv).

Every leaf ℓis labeled by some element oℓ O.

Given an input w = (σ1, . . . , σn) Σ1 . . . Σn, the output of decision tree T on w is computed by descending

Ehrenfeucht-Haussler Rank and Chain of Thought

from the root to one of the leaves. At each intermediate nonleaf node v, the tree computes the value qv( w) Im(qv) and descends to the unique child of v that is linked to v through an edge labeled q( w). In this way, we reach some leaf ℓ, where T outputs the element oℓas its result on w. We denote this output as T( w).

The function f : Σ1 . . . Σn O is computed by T, if T( w) = f( w) for every input w Σ1 . . . Σn.

Boolean case. Decision trees are often defined for Boolean functions, i.e., functions of the form f : {0, 1}n {0, 1}. In our notation, this corresponds to the case Σ1 = . . . = Σn = O = {0, 1}. Boolean decision trees are decision trees over a family {p1, . . . , pn} of queries, where for i = 1, . . . , n the function pi : {0, 1}n {0, 1} is defined as follows on input (b1, . . . , bn) {0, 1}n:

pi(b1, . . . , bn) = bi.

That is, at every node, a Boolean decision tree queries the value of some coordinate of the input.

Ehrenfeucht & Haussler defined the rank of a Boolean decision tree T by inductively defining the rank of its nodes as follows:

the rank of a leaf is 0, and

the rank of a non-leaf v, whose two children have ranks r0, r1, is r = max{min{r0, r1} + 1, max{r0, r1}}.

The rank of T is then the rank of its root, and the rank of a Boolean function f : {0, 1}n {0, 1} is the minimum rank of a Boolean decision tree that computes f.

Rank in the non-boolean case and a-queries. We extend the notion of rank to the non-Boolean case through decision trees over assignment queries. We start by introducing some terminology. Pairs of the form (i, σ), where i [n] and σ Σi, are called assignments. We denote by

A = {1} Σ1 {n} Σn

the set of assignments. An assignment (i, σ) is consistent with an input w = (σ1, . . . , σn) Σ1 . . . Σn if and only if σi = σ. By a permutation of a finite set B we mean a bijection τ : {1, . . . , |B|} B.

An assignment query (a-query from now on) is a function of the form qτ : Σ1 . . . Σn A, where τ is a permutation of the set of assignments A. For w Σ1 . . . Σn, we let k w be the minimal element k {1, . . . , |A|} such that τ(k) is consistent with w. We then define qτ( w) = τ(k w).

It is sometimes convenient to view the computation of an a-query qτ on an input w as follows. Assume that

τ(j) = (ij, σj), for each j = 1, . . . , |A|. Imagine that we do not know w, and we start asking a person who knows w questions: is the i1-th letter of w equal to σ1? , is the i2-th letter of w equal to σ2? , and so on. We stop once we receive the first YES answer. If this happens at the kth step, we return qτ( w) = (ik, σk).

We define the rank of an arbitrary function f : Σ1 . . . Σn O in terms of the class of decision trees over assignment queries that compute f.

Definition 2.1. Let f : Σ1 . . . Σn O. We define rk(f) as the minimal depth of a decision tree over a-queries that computes f.

As we show below, the notion of rank we have just introduced for arbitrary functions aligns, in the case of Boolean functions, with the definition we previously provided for that class of functions.

Proposition 2.2. For any Boolean function f : {0, 1}n {0, 1}, its rank, as defined by Ehrenfeucht and Haussler, is equal to rk(f).

An example: Iterated composition We consider the iterated composition function. We use a notation [n] = {1, . . . , n} for n N. For positive integer numbers t, n, we define:

t-Compn : [n]n [n],

t-Compn : (f(1), . . . , f(n)) 7 f(f(. . . f | {z } t times

A clarification for the second line: an input to t-Compn is an n-length word, where every letter is a number from 1 to n. This input is interpreted as a function f : [n] [n], with f(1) being the first letter of the word, f(2) being the second letter of the word, and so on. Sometimes, we also use the following notation:

f (ℓ) = f f . . . f | {z } ℓtimes .

In particular, we let f (0) be the identity function.

We claim that the rank of t-Compn does not exceed t. Recall that the input is interpreted as a word (f(1), . . . , f(n)), for some f : [n] [n], and our task is to compute f (t)(1). Consider a decision tree that first tries to guess the value of the first letter, that is, of f(1) by going is f(1) = 1? , is f(1) = 2? , and so on. Once the tree gets it right, receiving the first YES-answer, it already knows f(1), and now it starts guessing the f(1)st letter, that is, f (2)(1) = f(f(1))). It costs the second YES-answer to get it right. Continuing in this way, the tree will find out f (t)(1) after t YES-answers.

By means of a combinatorial argument, it is possible to show that this is the best one can do if n is large enough.

Ehrenfeucht-Haussler Rank and Chain of Thought

Proposition 2.3. For any t and for all n > 2t, we have rk(t-Compn) = t.

Proof. Assume for contradiction that we have a decision tree T of depth t 1 over a-queries for t-Compn, for some n > 2t. We start answering questions for T, descending to one of its leafs, in the following manner. We maintain a set F [n] of forbidden numbers . Initially, F = {1}. When we receive an a-query with a permutation τ of assignments, we select the first assignment (i, j) such that j / F and f(i) is not fixed yet. We fix f(i) = j and continue along the tree as if this was the first consistent assignment. After that, we put j into F. Note that after k values of f have been fixed this way, F consists of precisely k + 1 distinct elements. Indeed, every a-query we consider adds exactly one new element to F.

Let ℓdenote the leaf of T where we come in this way by answering a-queries. Suppose that oℓ [n] is the value that T outputs in this leaf. We obtain a contradiction by showing that some function g: [n] [n] with g(t)(1) = oℓalso gets to ℓ.

Observe that, since T is of depth t 1, there are k t 1 a-queries on the path to ℓand the same number of values of f have been fixed:

f(i1) = j1, f(i2) = j2, . . . , f(ik) = jk. (1)

Note that i1, . . . , ik are distinct because we never fix the same value twice. Numbers j1, . . . , jk are distinct too, and they define the evolution of the set F. Initially, F = {1} after the first a-query, F = {1, j1} after the second a-query, F = {1, j1, j2} after the third one, and so on.

Take any y [n] \ {1, i1, . . . , ik, j1, . . . , jk, oℓ} (it exists because n > 2t 2(k+1)). Define a function g: [n] [n] by

g(i1) = j1, . . . , g(ik) = jk,

g(x) = y for x [n] \ {i1, . . . , ik}.

We first show that g arrives to ℓin T. For that, we show that g is consistent with all answers to questions on the path to ℓ. All the assignments corresponding to our answers to a-queries on the path to ℓare as in (1), and g is consistent with all of them by definition. Next, take an assignment (i, j) and suppose it appears at the m-th a-query along the path to ℓ, ordered before the assignment (im, jm) (which we chose to be the first consistent one). Hence, in our descent along the tree we ignored this assignment and decided to fix the assignment (im, jm) instead. Hence, we need to observe that g is not consistent with it, that is, that g(i) = j. Indeed, we could have ignored (i, j) in two cases. Firstly, it could have happened that g(i) was already fixed to some value different to j. Secondly, we could have ignored it when g(i)

was not yet fixed, because j already belonged to the set of forbidden numbers F. But by definition of g that means that either g(i) = y or g(i) = js for some s > m. The first case is not possible since y was chosen to be outside of F, and the second case gives us g(i) = j.

To finish the proof, we show that g(t)(1) = y. Consider a directed graph with vertex set {1, . . . , n}, where for every i {1, . . . , n} there is a directed edge from i to g(i). The image of the function g consists of j1, . . . , jk and y. In the graph, these are the only nodes with incoming edges. Observe that each of j1, . . . , jk has exactly one incoming edge. Namely, for s = 1, . . . , k, the node js has a unique incoming edge from is. To compute g(t)(1), we start moving from 1 along the edges for t steps. We will be moving over j1, . . . , jk and y. Note that g(y) = y because y / {i1, . . . , ik}. Hence, it is enough to show that y is reached from 1 in at most t steps because then we stay at y forever. Now, if we do not reach y within the first t steps, then we travel over j1, . . . , jk for t steps. Since k t 1, it means that we come into some of j1, . . . , jk two times, but this would mean that one of them has two distinct incoming edges, which is impossible.

An example: Position of the k-th one. We define a function k-th Onen : {0, 1}n [n + 1] such that:

k-th Onen(σ1, . . . , σn) =

min ({n + 1} {i [n] : σ1 + . . . + σi = k}) .

In other words, given w = (σ1, . . . , σn) {0, 1}n, the function k-th Onen returns the position of the k-th one in w (counting from the left). If there are fewer than k ones in w, we return n + 1. We can then show the following by means of a combinatorial argument:

Proposition 2.4. For any n, k, we have rk(k-th Onen) k, and for n k2 + k, we have rk(k-th Onen) = k.

3. Attention Layers and Decoders

Attention layer. We consider layers with unique hard attention, and possibly multiple attention heads, where the output of the layer is computed in the last token. By unique hard attention we refer to the mechanism in which each position attends to the element with the highest attention score (breaking ties arbitrarily).

Formally, a unique hard-attention layer (or, simply, attention layer) with H heads and embedding dimension d is a function L: (Rd) Rd, which is defined by

H query matrices Q(h) Rd d and H key matrices K(h) Rd d, for h = 1, . . . , H,

two matrices W1, W2 Rd d, and

Ehrenfeucht-Haussler Rank and Chain of Thought

a matrix WO Rd (d H).

Consider an input sequence of vectors (x1, . . . , xm) (Rd)m. The output of L on (x1, . . . , xm) is computed as follows. For every h = 1, . . . , H, we compute the value of the h-th head on (x1, . . . , xm), which is a vector from Rd

denoted by headh Rd. Namely, we start by computing attention scores

a(h) i,m = K(h)xi, Q(h)xm , (2)

defining, for every i = 1, . . . , m, the attention from the last token to the i-th token with respect to the h-th head. The vector K(h)xi is called the key of the i-th token, and the vector Q(h)xm is called the query of the m-th token.

For every h = 1, . . . , H, we let ih {1, . . . , m} to be the index maximizing (2). If there are multiple indices achieving the maximum, we let ih be the leftmost one. We then set headh = xih, for h = 1, . . . , H, and define:

multihead = WO

head1 ... head H

Finally, we define:

L(x1, . . . , xm)

= W2 Re LU (W1(multihead + xm)) Rd. (4)

Recall that Re LU(x) = max {0, x}, for every x R, and if x Rd then Re LU(x) is obtained by applying Re LU to each one of its components.

Decoders. A decoder, defined by the d-dimensional attention layer L, is a function that takes on input a sequence of vectors (x1, . . . , xm) (Rd)m and in the output produces an infinite sequence of vectors {yt Rd} t=1, defined by:

y1 = L(x1, . . . , xm),

yt = L(x1, . . . , xm, y1, . . . , yt 1), t 2.

That is, the decoder works in iterations: first, it computes the output of L, adds it to the end of the input sequence, computes the output of L on the new sequence, adds this output to the end, and so on. We refer to yt as the output of the decoder after t iterations (sometimes these iterations are called chain of thought steps ).

Computation of functions by decoders. Fix n and n + 1 finite sets Σ1, . . . , Σn, O. We want to define how a decoder computes functions of the form:

f : Σ1 . . . Σn O.

Inputs to f are interpreted as words with n letters, with the i-th letter coming from the alphabet Σi, for i = 1, . . . , n (alphabets are possibly different at different positions). We put this word as an input to a decoder using n + 1 tokens, one per letter plus a special token at the end for the end of line symbol. Input tokens can use arbitrary encodings of letters by d-dimensional vectors, potentially different at different positions of the input word, utilizing in this form a positional information. We then run the decoder on the resulting input for some number t of iterations. The output of f is computed by applying an output function to the decoder s output yt from the final iteration. Definition 3.1 (Computation of functions by decoders). Let n be a natural number and Σ1, . . . , Σn, O be n + 1 finite sets. A function f : Σ1 . . . Σn O can be computed by t iterations of a decoder with H heads, if there exist:

d N and an attention layer L of embedding dimension d with H heads,

a positional encoding p, i.e. a function p: Σ1 {1} . . . Σn {n} {Eo L} Rd, where Eo L denotes a special end-of-line symbol, and

an output function α : Rd O,

such that for any w = (σ1, . . . , σn) Σ1 . . . Σn, the value f( w) is determined by the following procedure:

1. Define a sequence (x1, . . . , xn, y0) of d-dimensional vectors by:

x1 = p(σ1, 1), . . . , xn = p(σn, n), y0 = p(Eo L).

2. Place (x1, . . . , xn, y0) as an input to the the decoder defined by L, and let yt for t 1 denote the output of this decoder after t iterations.

3. Set f( w) = α(yt).

Next, we define the following important notion. Definition 3.2 (Decoder depth of a function). The decoder depth with H heads of f : Σ1 . . . Σn O, denoted dd(H)(f), is the minimum t 0 such that f can be computed by t iterations of a decoder with H heads.

4. One-Head Decoder Depth vs Tree Rank

In this section, we show that the rank of a function is equivalent to its decoder depth in the single-head setting. Theorem 4.1. For any function f : Σ1 . . . Σn O, we have rk(f) = dd(1)(f).

As a corollary to Theorem 4.1 and Proposition 2.3, we obtain that for suitable n the decoder depth with one head of the iterated composition function t-Compn is precisely t:

Ehrenfeucht-Haussler Rank and Chain of Thought

Corollary 4.2. For each t and for all n > 2t, we have dd(1)(t-Compn) = t.

Also, as a corollary to Theorem 4.1 and Proposition 2.4, we obtain that for suitable n the decoder depth with one head of the kth one function k-th Onen is precisely k:

Corollary 4.3. For each k, and for every n k2 + k, we have dd(1)(k-th Onen) = k.

We now prove our main theorem.

Proof of Theorem 4.1. We first show the inequality rk(f) dd(1)(f). Assume that f can be computed by a decoder with one head in r iterations, for some r N. We deduce that rk(f) r. For that, we show that at the cost of t a-queries one can compute the outputs of the decoder in the first t iterations on a given input. Hence, in r a-queries, we can compute the rth output of the decoder, which uniquely determines the value of f, implying that rk(f) r.

Consider any input w = (σ1, . . . , σn) Σ1 . . . Σn. Define then:

x1 = p(1, σ1), . . . , xn = p(n, σn), y0 = p(Eo L) Rd,

where d is the dimension of our decoder and p is its positional encoding function. Let {yt Rd} t=1 be the sequence of the outputs of our decoder on input (x1, . . . , xn, y0). Assume that we have already computed y1, . . . , yt for some t 0 (if t = 0, we just know y0 = p(Eo L)). We explain how to compute yt+1 using one a-query. By definition,

yt+1 = L(x1, . . . , xn, y0, y1, . . . , yt),

where L is the attention layer defining our decoder. It is enough to compute s {x1, . . . , xn, y0, y1, . . . , yt} with the maximal value of Ks, Qyt for the key and query matrices K, Q Rd d of our attention layer. If there are multiple vectors s {x1, . . . , xn, y0, . . . , yt} with the maximal value of this scalar product, we need to compute the leftmost one among them. Since we already have computed y0, y1, . . . , yt, it suffices to find this maximal s over {x1, . . . , xn} = {p(1, σ1), . . . , p(n, σn)}.

Consider the following linear order of the set A of assignments. Given two different assignments a = (i, σ), a = (i , σ ), we say that a is larger than a if either Kp(a), Qyt > Kp(a ), Qyt or Kp(a), Qyt = Kp(a ), Qyt and i < i . We arbitrarily order assignments with Kp(a), Qyt = Kp(a ), Qyt and i = i . Our task is to find the maximal assignment from {p(1, σ1), . . . , p(n, σn)} in this order. For that, we ask the a-query qτ for a permutation τ, where the first assignment is the maximal in our linear order, the second one is the second maximal, and so on.

We now show the inequality dd(1)(f) rk(f). Assume that T is an r-depth decision tree over a-queries that computes f. We transform into a decoder with one head that computes f in r iterations. We assume that T is a complete r-depth |A|-ary tree, where A is the set of assignments.

The embedding dimension of our decoder will be:

d = 1 + |A| + . . . + |A|r 1

+ 1 + |A| + . . . + |A|r

The coordinates will be split into 4 groups:

the first 1 + |A| + . . . + |A|r 1 coordinates are called positional coordinates and are indexed by non-leaf nodes of T;

the second 1 + |A| + . . . + |A|r coordinates are called output coordinates and are indexed by nodes of T;

the third |A| coordinates are called assignment coordinates and are indexed by assignments;

the last coordinate will be called special.

Our goal is to construct a decoder that simulates T in the following sense. On input w Σ1 . . . Σn, for any t 0, we want the t-th output of the decoder, denoted by yt Rd, to be the one-hot encoding of the node where T comes on w at depth t. More specifically, this one-hot encoding will take place in output coordinates, the remaining coordinates of yt will all be 0.

To achieve this, we start with defining y0 = p(Eo L) Rd as follows. In the restriction to the output coordinates it is the one-hot encoding of the root of T; all the other coordinates of y0 are 0. Next, we define the positional encoding p(a) Rd for an assignment a = (i, σ) A. In the restriction to the assignment coordinates, it is the one-hot encoding of a. Now, for each non-leaf node v of T and its corresponding positional coordinate p(a)v, we set p(a)v = 1/τ 1 v (a), where τv : {1, . . . , |A|} A is the permutation defining the a-query asked at v. We let the special coordinate of p(a) to be 1. Finally, all output coordinates of p(a) are set to 0.

Having our positional encoding defined, we move to the construction of the attention layer and define the query matrix Q Rd d by the following linear transformation α 7 Qα for α Rd: for every non-leaf node v of T, the v-th positional coordinate of Qα is equal the v-th output coordinate of α; the remaining coordinates of Qα are 0. The key matrix K Rd d is set to be the identity matrix.

Ehrenfeucht-Haussler Rank and Chain of Thought

Assume that, as an input, for some t < r, we give to a layer the following sequence of vectors:

x1, . . . , xn, y0, y1, . . . , yt Rd,

where xi = p(i, σi) for i = 1, . . . , n and for some w = (σ1, . . . , σn) Σ1 . . . Σn, y0 = p(Eo L), and for every i = 1, . . . , t, the vector yi is the one-hot encoding, inside the output coordinates, of some depth-i node vi of T, and has 0 in the remaining coordinates. Let q = qvt be the a-query asked at vt, and let τ = τvt be the corresponding permutation of the set of assignments (the node vt is a nonleaf node because t < r). We claim that the attention on this input will be maximized for the position with the assignment which is the output of q on w.

Indeed, the vector yt has the unique 1 at the vt-th output coordinate, with the remaining coordinates of vt being 0. The matrix Q moves this 1 into the vt-th positional coordinate, and the rest of the coordinates of Qyt are 0. Thus, for any α Rd, the product Kα, Qyt equals the value of α in the vt-th positional coordinate. If α = p(i, σi) for i [n], this value is 1/τ 1(i, σi). The maximum of this value is attained for (i, σi) {(1, σ1), . . . , (n, σn)} with the minimal value of τ 1(i, σi), i.e, for (i, σi) = q( w). Now, for α {y0, y1, . . . , yt}, the value of the vt-th positional coordinate, as well as any other positional coordinate, is 0. Hence, the output of the head will be the vector p(q( w)).

The output of the layer is now computed as:

yt+1 = W2 Re LU (W1 β) , (5)

β = p(q( w)) + yt. (6)

We need to choose W1, W2 Rd d such that the resulting yt+1 will encode the node vt+1 where the tree goes from vt by following the q( w)-labeled edge. More specifically, we want yt+1 to be the one-hot encoding of vt+1 in the output coordinates, and we want all the other coordinates of yt+1 to be 0. We will set W2 to be the identity matrix. To define W1, we fix the following notation. For a non-root node v of T, let parent(v) denote the parent node of v, and let label(v) A denote the label of the edge from parent(v) to v. We define W1 by the following linear transformation α 7 W1α, α Rd: for every non-root node v of T, we define the v-th output coordinate of Wα as

the parent(v)-th output coordinate of α (7)

+ the label(v)-th assignment coordinate of α (8)

the special coordinate of α. (9)

We set all the other coordinates of W1α to 0.

We have to show now that Re LU(W1 β), with β as in (5 6) has 1 in the vt+1-th output coordinate and 0 in all the other coordinates. Indeed, W1 β has 0 in any coordinate which

is not an output coordinate for a non-root node of T. Now, consider any non-root node v of T. It is enough to show that the v-th output coordinate of W1 β is 1 for v = vt and is 0 or -1 for v = vt (applying Re LU to 0 and 1, we get 0).

To calculate the v-th output coordinate of W1β, as stated in (7 9), we calculate the parent(v)-th output coordinate of β, the label(v)-th assignment coordinate of β, and the special coordinate of β. Recall that positional encodings of assignments have 0 in the output coordinates. Hence, the sum β = p(q( w)) + yt, in the restriction to the output coordinates, is the one-hot encoding of vt. In other words, the parent(v)-th output coordinate of β is the indicator I{parent(v) = vt}. Likewise, since yt has only 0 in the non-output coordinates, the sum β = p(q( w)) + yt, in the restriction to the assignment coordinates, is the onehot encoding of the assignment q( w). Again, this means that the label(v)-th assignment coordinate of β is equal to the indicator I{label(v) = q( w)}. Finally, the special coordinates of p(q( w)) and yt are 1 and 0, respectively, meaning that the special coordinate of β is 1. Plugging these equalities into (7 9) for α = β, we obtain that the v-th output coordinate of W1β equals:

I{parent(v) = vt} + I{label(v) = q( w)} 1.

This expression takes values in { 1, 0, 1} and it is equal to 1 if and only if both indicators are 1. It remains to note that vt+1 is the only node whose parent is vt and such that the label of the edge from vt to this node is q( w).

The r-th output of the decoder, yr, in restriction to the output coordinates, will be the one-hot encoding of the leaf to which we come while computing T on input w. Since this leaf uniquely determines f( w), we are done.

5. Multihead Rank

In order to generalize Theorem 4.1 to decoders with many heads, we define the notion of H-head rank for a function f : Σ1 . . . Σn O. For that we require a notion of the product of two functions with the same domain. Namely, by the product of f : A B and g: A C, we mean a function (f g): A B C, defined by:

(f g)(a) = (f(a), g(a)).

An H-degree a-query is a product of H a-queries.

Definition 5.1. The H-head rank of a function f : Σ1 . . . Σn O, denoted rk(H)(f), is the minimal depth of a decision tree over H-degree a-queries that computes f.

A simple generalization of the construction of Theorem 4.1 allows us to obtain the following result.

Theorem 5.2. For any H N and for any function f : Σ1 . . . Σn O, we have rk(H)(f) = dd(H)(f).

Ehrenfeucht-Haussler Rank and Chain of Thought

We observe that the H-head rank can be at most H times smaller than the normal rank. Specifically, each H-degree a-query can be computed by performing H individual aqueries sequentially.

Proposition 5.3. For f : Σ1 . . . Σn O and H 1, we have rk(f) H rk(H)(f).

Proposition 5.3 allows us to reduce, up to a factor of H, lower bounds on rk(H)(f) to lower bounds on rk(f). However, this proposition is sometimes unable to provide tight bounds on rk(H)(f). This occurs, for instance, when rk(H)(f) is not smaller at all than rk(f). We present two examples of this phenomenon in this section.

To establish precise lower bound on the decoder depth of a function with H heads, it suffices to derive a lower bound on its H-head rank (Theorem 5.2). However, this task proves to be significantly more challenging than for the single-head rank. Specifically, for the iterated composition function, combinatorial arguments alone, as employed in the proof of Proposition 2.3, are no longer sufficient. Instead, we must rely on techniques from communication complexity to address the problem. For the k-th Onen function, we develop a combinatorial argument that is notably more intricate than the one used in the proof of Proposition 2.4.

5.1. Multihead decoder depth of iterated composition

In this section, we show a method for lower bounding the multihead rank of a function based on communication complexity (Kushilevitz & Nisan, 1996). Let X, Y, Z be finite sets and f : X Y Z be a function. Imagine that there are two players, Alice and Bob. Alice is given x X and Bob is given y Y. Their goal is to cooperatively compute f(x, y). For that, they can send each other messages that are binary words. They want to minimize the number of messages and their total length in bits.

Formally, a k-round Alice-first communication protocol Π is given by:

k positive integer numbers ℓ1, . . . , ℓk (messages lengths);

a function Mi : {0, 1}ℓ1+...+ℓi 1 X {0, 1}ℓi for every odd i {1, . . . , k};

a function Mi : {0, 1}ℓ1+...+ℓi 1 Y {0, 1}ℓi for every even i {1, . . . , k}; and

the output function out: {0, 1}ℓ1+...+ℓk Z.

The communication complexity of Π is the sum ℓ1+. . .+ℓk.

On input (x, y) X Y, the output of Π on (x, y) is computed as follows. We inductively define a sequence of

binary words m1 {0, 1}ℓ1, . . . , mk {0, 1}ℓk by setting

mi = Mi(m1 . . . mi 1, x) for odd i {1, . . . , k},

mi = Mi(m1 . . . mi 1, y) for even i {1, . . . , k}.

Intuitively, m1 = M1(ε, x) is the first message of Alice that she sends to Bob in the protocol on input x. Upon receiving m1, Bob replies with the second message m2 = M2(m1, y) that depends on his input and the first of Alice s messages. Then Alice sends the third message m3 = M3(m1m2, x), and so on. The output of the protocol is defined as out(m1 . . . mk) Z.

By Ck,A(f) we mean the minimal communication complexity of a k-round Alice-first protocol that computes f. By reversing the roles of Alice and Bob, we define k-round Bobfirst protocols, and Ck,B(f), the minimal communication complexity of a k-round Bob-first protocol for a function f.

Assume we have a function f : Σ1 . . . Σn O and a subset S [n]. Suppose that positions of an input word w Σ1 . . . Σn are split between Alice and Bob like this: Alice knows letters of w at positions i S, and Bob knows letter of w at positions i [n] \ S. Their goal is to find out f( w). This defines a function:

where the two inputs correspond to the parts of w that Alice and Bob knows, respectively, and the output of is f( w).

Assuming that the H-head rank of f is r, we construct low-communication (r + 1)-round Alice-first and Bob-first protocols for f S, for any S [n]. This gives a method for lower bounding the multihead rank of f: by showing that either Cr+1,A(f) and Cr+1,B is large enough, we conclude that the H-head rank of f is larger than r.

Lemma 5.4. For every f : Σ1 . . . Σn {0, 1}, for every S [n], and for every H 1, denoting r = rk(H)(f) and |A| the number of assignments for f, we have:

Cr+1,A(f S) 2Hr log2 |A| and

Cr+1,B(f S) 2Hr log2 |A| .

Proof. We first notice that Alice and Bob can compute the value of any H-degree a-query qτ1 . . . qτH by exchanging messages of length H log2 a . In fact, for a given input w Σ1 . . . Σn there are exactly n assignments consistent with w. A part of them is known to Alice (for positions in S) and the other part to Bob (for positions in [n] \ S). For each h = 1, . . . , H, Alice and Bob have to calculate the first assignment in the permutation τh which is consistent with w. Alice can see which w-consistent assignment, known to her, goes first in τh, and the same for Bob.

Ehrenfeucht-Haussler Rank and Chain of Thought

Among these two assignments, the one that goes first is the answer to qτh. Alice and Bob just have to exchange the indices of these assignments. For both Alice and Bob it is thus enough to send a H log2 a -bit message with indices of H assignments.

We already see that an r-depth decision tree over H-degree a-queries can be simulated by a communication protocol with 2Hr log2 a bits. We need to explain how to arrange this communication in r + 1 rounds. For that, Alice and Bob have to alternate the order in which they exchange their messages in a computation of the H-degree a-queries. For example, for the Alice-first protocol, in the computation of the first query Alice has to send her message first and then Bob. Now, for the second query, Bob has to send his message first and then Alice. In this way, Bob s messages for the first and for the second query merge into a single round of communication. Similarly, for the third query, Alice has to send first, and then Bob, and so on, getting overall r + 1 rounds. The Bob-first protocol is constructed in an analogous fashion.

As a corollary, we obtain the following: Corollary 5.5. For every H and t, for all but finitely many n, we have rk(H)(t-Compn) = t.

Proof. We reduce from a communication problem called pointer chasing. In this problem, Alice is given f A : {1, . . . , m} {1, . . . , m} and Bob is given f B : {1, . . . , m} {1, . . . , m}. In the k-step pointer chase, denoted here by PCm k , the goal of Alice and Bob is to compute:

. . . f A(f B(f A | {z } k times

(1)) . . .)

It is easy to see that Ck,A(PCm k ) = O(k log m) (Alice starts by sending m1 = f A(1), Bob replies by sending m2 = f B(m1), and so on). It is known that this task requires much longer communication for k-round Bobfirst protocols. Namely, for any constant k, we have Ck,B(PCk) = Ω(m) (Duris et al., 1987).

It remains to notice that PCn/2 t is a special case of the problem t-Comp S n, for S = {1, . . . , n/2}, where Alice gets (ϕ(1), . . . , ϕ(n/2)) and Bob gets (ϕ(n/2 + 1), . . . , ϕ(n)), for some function ϕ: {1, . . . , n} {1, . . . , n}, and the task is to compute ϕ(k)(1). Namely, we obtain PCn/2 t as a special case when ϕ maps the first half of the inputs into the second half, and the second half into the first half. Assuming that rk(H)(t-Compn) < t, by Lemma 5.4 we obtain:

Omega(n) = Ct,B(PCn/2 t )

Ct,B(t-Comp S n) 2Ht log2 n2 .

For any fixed H, t this is true only for finitely many n.

5.2. Multihead decoder depth of kth One

In this section, we establish a tight lower bound on the multi-head rank of k-th One.

Theorem 5.6. For any k, H N, for all but finitely many n N, we have rk(H)(k-th Onen) = k.

We observe that our communication complexity tool is not applicable in this case, as for any partition of the input positions between Alice and Bob, there exists a 2-round protocol with logarithmic communication that computes the position of the k-th one: Alice sends the positions of the first k ones in her part of the input, and Bob does the same.

Proposition 5.7. For any k, n and S [n]:

C2,A(k-th One S n) = C2,B(k-th One S n) = O(k log n).

If we wanted to use Lemma 5.4 to obtain a lower on rk(H)(k-th Onen), we would have needed C2,A(k-th One S n) or C2,B(k-th One S n) to grow super-logarithmically with n for some S [n]. Instead, we use a self-reducibility technique by means of partial fixations.

6. Final Remarks

We have shown that the expressive power of single-layer Transformers with hard attention is tightly connected to the notion of rank of functions. Extending this characterization to more layers or to soft attention is a challenging future direction. In a contemporaneous manuscript, Chen et al. have proved unconditional lower bounds on the embedding dimension of multilayer decoder-only Transformers with soft attention that compute iterated function composition. However, their version of the problem differs significantly from the one considered here: they have several functions to compose, and each function is completely given in a single token. We plan to explore whether the techniques used in their work can be applied to strengthen our results.

Acknowledgements

Kozachinskiy is supported by ANID Fondecyt Iniciaci on grant 11250060. Barcel o and Kozachinskiy are funded by the National Center for Artificial Intelligence CENIA FB210017, Basal ANID. Barcel o is also funded by ANID Millennium Science Initiative Program Code ICN17002.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Ehrenfeucht-Haussler Rank and Chain of Thought

Angluin, D., Chiang, D., and Yang, A. Masked hardattention transformers and boolean RASP recognize exactly the star-free languages. Co RR, abs/2310.13897, 2023.

Barcel o, P., Kozachinskiy, A., Lin, A. W., and Podolskii, V. V. Logical languages accepted by transformer encoders with hard attention. In ICLR. Open Review.net, 2024.

Chen, L., Peng, B., and Wu, H. Theoretical limitations of multi-layer transformer. Co RR, abs/2412.02975, 2024.

Chiang, D., Cholak, P., and Pillay, A. Tighter bounds on the expressivity of transformer encoders. In ICML, volume 202, pp. 5544 5562, 2023.

Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. What does BERT look at? an analysis of bert s attention. In ACL Workshop Blackbox NLP, pp. 276 286, 2019.

Duris, P., Galil, Z., and Schnitger, G. Lower bounds on communication complexity. Information and Computation, 73(1):1 22, 1987.

Ehrenfeucht, A. and Haussler, D. Learning decision trees from random examples. Information and Computation, 82(3):231 246, 1989.

Esteban, J. L. and Tor an, J. A combinatorial characterization of treelike resolution space. Information Processing Letters, 87(6):295 300, 2003.

Hahn, M. Theoretical limitations of self-attention in neural sequence models. Trans. Assoc. Comput. Linguistics, 8: 156 171, 2020a.

Hahn, M. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156 171, 2020b.

Hao, Y., Angluin, D., and Frank, R. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Trans. Assoc. Comput. Linguistics, 10:800 810, 2022a.

Hao, Y., Angluin, D., and Frank, R. Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10:800 810, 2022b.

Kullmann, O. Investigating a general hierarchy of polynomially decidable classes of cnf s based on short tree-like resolution proofs. Citeseer, 1999.

Kushilevitz, E. and Nisan, N. Communication Complexity. Cambridge University Press, 1996.

Liu, Z., Liu, H., Zhou, D., and Ma, T. Chain of thought empowers transformers to solve inherently serial problems. In ICLR, 2024.

Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Trans. Assoc. Comput. Linguistics, 11:531 545, 2023.

Merrill, W. and Sabharwal, A. The expressive power of transformers with chain of thought. In ICLR, 2024.

Peng, B., Narayanan, S., and Papadimitriou, C. H. On limitations of the transformer architecture. Co RR, abs/2402.08164, 2024.

P erez, J., Barcel o, P., and Marinkovic, J. Attention is turingcomplete. J. Mach. Learn. Res., 22:75:1 75:35, 2021.

Pudl ak, P. and Impagliazzo, R. A lower bound for dll algorithms for k-sat (preliminary version). In Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, pp. 128 136, 2000.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Neur IPS, pp. 5998 6008, 2017.

Voita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL, 2019.

Yang, A. and Chiang, D. Counting like transformers: Compiling temporal counting logic into softmax transformers. Co RR, abs/2404.04393, 2024.

Yang, A., Chiang, D., and Angluin, D. Masked hardattention transformers recognize exactly the star-free languages. In Neur IPS, 2024.