# transformers_in_uniform_tc0__7a4ac0a0.pdf

Published in Transactions on Machine Learning Research (01/2025)

Transformers in DLOGTIME-Uniform TC0

David Chiang dchiang@nd.edu University of Notre Dame

Reviewed on Open Review: https: // openreview. net/ forum? id= ZA7D4n Qu QF

Previous work has shown that the languages recognized by average-hard attention transformers (AHATs) and softmax-attention transformers (SMATs) are within the circuit complexity class TC0. However, these results assume limited-precision arithmetic: using floating-point numbers with O(log n) bits (where n is the length of the input string), Strobl showed that AHATs can be approximated in L-uniform TC0, and Merrill & Sabharwal showed that SMATs can be approximated in DLOGTIME-uniform TC0. Here, we improve these results, showing that AHATs with no approximation, SMATs with O(poly(n)) bits of floating-point precision, and SMATs with at most 2 O(poly(n)) absolute error are all in DLOGTIME-uniform TC0.

1 Introduction

Previous work (summarized in Table 1) has shown that the languages recognized by average-hard attention transformers (AHATs) and softmax-attention transformers (SMATs) are within the circuit complexity class TC0. This places some interesting computational problems beyond the power of these transformers. In particular, if TC0 = NC1 (as is often assumed, Williams 2022), then these transformers cannot solve any NC1-complete problems. For example, consider Boolean formulas with constants 0 and 1 and no variables, like (0 1) ( 0 1). Checking the syntax of such formulas is equivalent to the Dyck language, which is recognizable by both AHATs (Yao et al., 2021) and SMATs (Yang & Chiang, 2024). But computing the semantics of such formulas, that is, deciding whether a formula is true, is NC1-complete (Buss, 1987) and therefore not solvable by these transformers (unless TC0 = NC1).

However, these non-solvability results assume limited-precision arithmetic. The best results that we are aware of use floating-point numbers with O(log n) bits (where n is the length of the input string): Strobl (2023) showed that AHATs can be approximated in L-uniform TC0, and Merrill & Sabharwal (2023b) showed that SMATs can be approximated in DLOGTIME-uniform TC0. These results leave open the possibility that AHATs and SMATs, as defined on paper using real numbers, might not be subject to the same limitations. Here, we improve these results, showing that:

AHATs (without any approximation) are in DLOGTIME-uniform TC0.

SMATs with O(poly(n)) bits of floating-point precision are in DLOGTIME-uniform TC0.

Furthermore, because there are many different ways to approximate a transformer using limited precision, and different ways appear to lead to different results, we propose an alternative assumption, which is that the final output is approximated up to a certain (absolute) error. Thus, we show:

SMATs with at most 2 O(poly(n)) absolute error are in DLOGTIME-uniform TC0.

This can also be rephrased as a statement, not about the expressivity of approximations of SMATs, but about the expressivity of exact SMATs themselves:

Any language that is recognized by a SMAT with margin 2 O(poly(n)) is in DLOGTIME-uniform TC0.

Published in Transactions on Machine Learning Research (01/2025)

Table 1: Summary of results on transformer encoders in previous work and in this paper. Our results show that even when (average-hard attention or softmax-attention) transformers are computed to very high precision, they remain limited to DLOGTIME-uniform TC0.

attention approximation class

Merrill et al. (2022) average O(log n) precision non-uniform TC0

Liu et al. (2023) softmax O(log n) precision non-uniform TC0

Strobl (2023) average O(log n) precision L-uniform TC0

Merrill & Sabharwal (2023a) softmax O(log n) precision L-uniform TC0

Merrill & Sabharwal (2023b) softmax O(log n) precision DLOGTIME-uniform TC0

This paper, Theorem 7 average none DLOGTIME-uniform TC0

This paper, Theorem 13 softmax O(poly(n)) precision DLOGTIME-uniform TC0

This paper, Theorem 14 softmax 2 O(poly(n)) error DLOGTIME-uniform TC0

2 Background

We write [n] for the set {1, 2, . . . , n}. We write x for the floor of x (greatest integer less than or equal to x), and x for the ceiling of x (least integer greater than or equal to x). We write O(poly(n)) for the family of functions S k 0 O(nk).

2.1 Transformers

We assume familiarity with transformers (Vaswani et al., 2017) and describe a few concepts briefly. For more detailed definitions, please see the survey by Strobl et al. (2024), whose notation and terminology we follow.

In standard attention, attention weights are computed from attention scores using a softmax:

αi,j = [softmax si, ]j = exp si,j P

j exp si,j .

We call a transformer with standard attention a softmax-attention transformer (SMAT). An average-hard attention transformer (AHAT, Pérez et al. 2019; Merrill et al. 2022) is one where the softmax is replaced by:

ahardmax si, = lim τ 0 softmax si, /τ.

In other words, each position i attends to those positions j that maximize the score si,j. If there is more than one such position, attention is divided equally among them.

Layer normalization (Ba et al., 2016) scales and shifts the components of a vector to have mean and standard deviation equal to parameters γ and β:

Layer Norm(x) = x E[x] p

Var[x] + c γ + β (1)

where is componentwise multiplication and c 0 is a constant. When layer normalization is used, we require that c > 0 (as is standard in practice).

We assume that a transformer has a single scalar output, computed from the last position. For simplicity, we assume that the output is used for binary classification, as follows:

Definition 1. A transformer T : Σ R recognizes a language L if, for every string w Σ , if w L then T(w) > 0, and if w L then T(w) < 0.

Published in Transactions on Machine Learning Research (01/2025)

2.2 Complexity classes

A TC0 circuit is one with made from the usual AND, OR, NOT gates, as well as MAJORITY gates, which are true if a strict majority of their inputs are true. A TC0 circuit family is a set of circuits indexed by lengths n > 0, such that the circuit for length n has polynomial size, bounded depth, and unbounded fan-in. DLOGTIME-uniform TC0 is the set of TC0 circuit families for which queries about the circuit for length n can be decided in deterministic O(log n) time. Throughout this paper, whenever we say TC0, we mean DLOGTIME-uniform TC0.

The class TC0 is also the class of languages definable in first-order logic with majority quantifiers (Mx.ϕ(x) iff ϕ(x) is true for a majority of positions x) and the BIT predicate (BIT(x, y) iff the y-th bit of x is 1) (Barrington et al., 1990). Depending on the context, it may be easier to think about TC0 in terms of circuits or in terms of logical formulas. Our descriptions of functions in TC0 abstract away from details of either circuits or formulas, making use of functions already known to be in TC0 together with the fact that functions in TC0 are closed under serial and parallel composition (Jeřábek, 2012).

Theorem 2. The following operations on O(poly(n)) bit integers are in TC0:

(a) Addition of two numbers

(b) Comparison of two numbers

(c) Maximum of n numbers

(d) Truncated base-2 logarithm log2 x

(e) Iterated addition of n numbers

(f) Multiplication of two numbers

(g) Iterated multiplication of n numbers

(h) Truncated division of two numbers.

Proof. Addition (a) is shown by Immerman (1999, Prop. 1.9) for n bits and is easy to extend to O(poly(n)) bits. Comparison (b), maximum (c), and truncated base-2 logarithm (d) are also easy. These cases do not require majority gates.

Iterated addition (e) is shown, for example, by Barrington & Maciel (2000, Lecture 7, Section 2), and multiplication (f) is closely related.

Iterated multiplication (g) was proven to be in TC0 by Hesse et al. (2002, Theorem 5.1) and can be used for truncated division (h).

2.3 Approximation error

We will define various numeric representations and associated concepts as they are needed, but will make use of the following definitions throughout.

Definition 3. For functions ˆf : X R and f : X R, we say that ˆf approximates f with absolute error at most ϵ if for all x X, we have | ˆf(x) f(x)| ϵ, and ˆf approximates f with relative error at most ϵ if

for all x X, we have ˆ f(x) f(x)

3 Arbitrary-precision AHATs

In this section, we prove that AHATs without layer normalization, even with arbitrary precision, are in TC0. We do this by representing rational numbers as pairs of integers. This turns out to only need a polynomial number of bits, so it can be computed in TC0.

Published in Transactions on Machine Learning Research (01/2025)

Definition 4. A p-bit rational number is a pair a, b , where a is an integer in [ 2p, 2p) and b is an integer in [1, 2p). The value of a, b is a/b.

(According to this definition, a p-bit rational number actually requires (2p + 1) bits: 1 for the sign, p for the numerator, and p for the denominator.)

Lemma 5. The following operations on O(poly(n))-bit rational numbers are in TC0:

(a) Addition, multiplication, division, and comparison of two numbers

(b) Iterated multiplication of n numbers

(c) Iterated addition and maximum of n numbers.

Proof. The operations (a,b) can be expressed in terms of operations on O(poly(n))-bit integers, which are in TC0 (Theorem 2):

a1, b1 + a2, b2 = a1b2 + b1a2, b1b2 (2)

a1, b1 a2, b2 = a1a2, b1b2 (3)

a1, b1 a2, b2 = a1b2, b1a2 (4)

a1, b1 a2, b2 a1b2 b1a2 (5)

i [n] ai, bi =

i [n] ai, Y

To find the sum or maximum of n rational numbers (c), we precompute the product of the denominators:

i [n] ai, bi =

i [n] ai B/bi, B

max i [n] ai, bi = max i [n] ai B/bi, B . (8)

Lemma 6. Let T be an AHAT with rational weights, p-bit position embeddings, and no layer normalization. Let L be the depth of T. Then the computation of T needs O(pn L) bits for each intermediate and final value.

Proof. First, note that if a1, b1 uses O(nk) bits and a2, b2 uses O(nk) bits, then their sum, product, and quotient (Eqs. (2) to (4)) also use O(nk) bits. But if ai, bi for i [n] use O(nk) bits each, then their sum (Eq. (7)) uses O(nk+1) bits.

We prove the lemma by induction on L. If L = 0, we just look up the embeddings, which need O(p) bits per value. If L > 0, assume that layer (L 1) required O(pn L 1) bits per value. In the self-attention, the queries, keys, values, and scores need O(pn L 1) bits. The sum of the maximum-scoring values, which there could be up to n of, needs O(pn L) bits, as does the average. Finally, the activations of the FFNN also need O(pn L) bits.

Theorem 7. Let T be an AHAT with rational weights, O(poly(n))-bit position embeddings, and no layer normalization. Then the language recognized by T is in TC0.

Proof. AHATs use only the operations in Lemma 5 on rational numbers with O(poly(n)) bits (Lemma 6). Since these operations are all computable in TC0 and can be composed in TC0, the language recognized by T is in TC0.

Published in Transactions on Machine Learning Research (01/2025)

Remark 8. We now have a more or less complete characterization of which regular languages can be recognized by AHATs. Barrington et al. (1992) showed that every regular language L is either in ACC0 or NC1-complete.

If L is in ACC0, then it can be defined in linear temporal logic with modular counting (Baziramwabo et al., 1999), and therefore it can be recognized by an AHAT with suitable position encodings (Barceló et al., 2024).

If L is NC1-complete, then by Theorem 7 it cannot be recognized by an AHAT unless TC0 = NC1.

4 Polynomial-precision SMATs

Next, we turn to SMATs, extending Merrill & Sabharwal s proof from O(log n) bits to O(poly(n)) bits.

Definition 9. A p-bit floating-point number is a pair m, e where m (called the significand) and e (called the exponent) are integers, |m| {0} [2p 1, 2p), and e [ 2p, 2p). The value of m, e is m 2e. We write roundp(x), where x is either a real number or a floating-point number, for the p-bit floating-point number nearest to x. If there are two such numbers, we call x a breakpoint and define roundp(x) to be the one with an even significand.

(According to this definition, a p-bit floating-point number actually requires (2p + 2) bits: (p + 1) for the significand and its sign, and (p + 1) for the exponent and its sign.)

To compute a SMAT with p-bit floating-point numbers means to approximate the operations in the SMAT with operations on floating-point numbers. In typical floating-point implementations, addition, multiplication, division, and square root are rounded to the nearest floating-point number, but exp is only approximated with a relative error of about 2 p. We also assume that summation of n numbers is performed exactly and then rounded (following Liu et al. 2023; Chiang et al. 2023; Merrill & Sabharwal 2023a; but pace Li et al. (2024), who argue that rounding should be performed after each addition).

Lemma 10. The following operations on floating-point numbers with p O(poly(n)) bits are computable in TC0, with exact rounding to the nearest p-bit floating-point number:

(a) Addition, multiplication, division, and comparison of two numbers

(b) Iterated multiplication of n numbers.

Proof. These operations on O(poly(n))-bit integers are in TC0 (Theorem 2). We just have to show that they are also definable on floating-point numbers. This is not a new result, but we try to fill in some details here that are missing elsewhere.

First, roundp( m, e ) can be computed in TC0 as follows: Count the number of significand bits q = log2 |m| + 1 (Theorem 2d), shift m right by (q p) bits, and increment e by (q p). Round m to the nearest integer, and if |m| = 2p, shift m and increment e once more. For the operations (a), we have

m1, e1 + m2, e2 =

( roundp( m1 + m2 // 2e1 e2, e1 ) if e1 e2 roundp( m1 // 2e2 e1 + m2, e2 ) if e1 e2

m1, e1 m2, e2 = roundp( m1m2, e1 + e2 )

m1, e1 m2, e2 = roundp( m1 2p 1 // m2, e1 e2 p + 1 )

m1, e1 m2, e2

( m1 m2 // 2e1 e2 if e1 e2 m1 // 2e2 e1 m2 if e1 e2.

The operation // is defined as

( a/b if a/b is a multiple of 1/4 a/b + 1/8 otherwise.

Published in Transactions on Machine Learning Research (01/2025)

O(poly(n)) p + log2 n O(poly(n))

O(poly(n)) p O(poly(n))

Figure 1: Overview of algorithm for iterated addition of p-bit floating-point numbers. The summands are grouped into blocks that each span O(poly(n)) bits. They are separated by at least p + log2 n bits, so that the block-sums are separated by at least p bits.

The result has three fractional bits (called the guard, round and sticky bits), which ensure that the result is correctly rounded to the nearest floating point number (Goldberg, 2017). Note that this can be computed efficiently even if b is a large power of 2.

For iterated multiplication (b), we have

i [n] mi, ei = roundp

i [n] mi, X

Lemma 11. Iterated addition of n floating-point numbers, each with p O(poly(n)) bits, is in TC0.

Proof. We are given p-bit floating-point numbers m1, e1 , . . . , mn, en . Without loss of generality, assume mi = 0. We need to compute the sum

i [n] mi, ei

Step 1. Define the relation i j just in case |ei ej| < 2p + log2 n . The transitive closure of partitions the (indices of the) summands into blocks B1, . . . , Bk [n] (that is, if i j, then i and j are in the same block). The intuition (Fig. 1) is that, in the binary representation, the numbers within each block are close enough together that we can sum them by brute force, while numbers in different blocks are far enough apart that we can ignore all but the two leftmost blocks.

The partitioning into blocks can be computed in TC0 as follows. Call i [n] block-minimal iff there is no j [n] such that mj, ej < mi, ei and i j. Then i and j belong to the same block if and only if there is no block-minimal k [n] such that ei < ek ej or ej < ek ei.

Step 2. For each block B, we compute the sum of all the numbers in B. Let e be the minimal exponent in B (that is, e = mini{ei | i B}). Since all the exponents in B are bigger than e by at most n(2p + log2 n ) O(poly(n)), we can perform this sum exactly (Merrill & Sabharwal, 2023a):

i B mi, ei =

i B mi 2ei e, e

Published in Transactions on Machine Learning Research (01/2025)

s(1) 1 10 0

Case 2 Case 3

Figure 2: In Case 2, s(1) is a breakpoint, so the sum s depends on the sign (and only the sign) of s(2). In Case 3, even if m(1) has only a single bit, the remaining block-sums do not affect the whole sum.

We ve left the block-sums unnormalized; that is, their significands could have more or less than p bits.

Step 3. Let s(i) = m(i), e(i) be the sum of the block with the i-th largest absolute sum. Then the first block-sum s(1) dominates the whole sum; any number not in the first block has absolute value less than 2p, e(1) 2p log2 n . So we can bound the rest of the sum as:

i=2 s(i) < n 2p 2e(1) 2p log2 n 2e(1) p. (9)

In other words, in the binary representation, there is a gap of at least p zero bits between the first block-sum and the remaining block-sums.

It s not necessary to sort all the block-sums; it s enough to find the maximal block-sum s(1) and the second block-sum s(1). Then we consider three cases (see Fig. 2).

Case 1: If m(1) = 0, then the whole sum is zero, and we are done.

Case 2: If s(1) is a breakpoint, then we need to look at the remainder r to see which way to round. Since r < 2e(1) (Eq. (9)), it s enough to look at the sign of r, which is the sign of m(2).

Case 3: Otherwise, s(1) is sufficiently far (on the number line) from a breakpoint that the addition of r cannot change the result. Due to cancellation, m(1) could have fewer than p bits, down to just one bit. So the distance to the nearest breakpoint could be as small as 2e(1) p. But r < 2e(1) p by Eq. (9).

Lemma 12. Given a floating-point number x with O(poly(n)) bits, the following functions can be computed in TC0:

(a) x, rounded to the nearest floating-point number

(b) exp x, with a relative error of at most 2 p.

Proof. The basic idea is to use a truncated Taylor series (Merrill, p.c.; Hesse et al., 2002, Cor. 6.5). This is not a new result, but we try to fill in some details here that are missing elsewhere. Let p O(poly(n)).

For x: Find r [ 1

4, 1] and an even integer k such that x = r 2k, as follows. If x = m, e and e + p is even, let r = m 2 p [ 1

2, 1) and k = e + p; if e + p is odd, let r = m 2 p 1 [ 1

2) and k = e + p + 1. Then compute r using the Taylor series about 1:

(r 1)i + O(|r 1|N).

Published in Transactions on Machine Learning Research (01/2025)

Since the error term is in O(|r 1|N) and r 1

4, there is some a such that the error is at most a|r 1|N

4 N. To make this less than 2 p 1, we set N = (p+1) log 2+log a

4 O(p). Then we decide which way to

round by squaring the breakpoint nearest to the approximation of r and comparing it with r. Finally, x = r 2k/2.

For exp x: Let k = x/ log 2 and r = x k log 2, where log 2 is computed using the series:

1 i 2i + O(2 N).

Compute exp r using the Taylor series about 0:

i! + O(r N).

Since the error term is in O(r N) and r [0, log 2), there is some a such that the relative error is at most ar N exp r a(log 2)N. So to get a relative error of 2 p, we set N = p log 2+log a

log log 2 O(p). Finally, exp x = (exp r) 2k.

Theorem 13. Any language that is recognizable by an O(poly(n))-bit precision SMAT is in TC0.

Proof. SMATs use only the operations in Lemmas 10 to 12. Since these operations are all computable in TC0

and can be composed in TC0, the language recognized by an O(poly(n))-bit precision SMAT is in TC0.

5 Approximating SMATs with 2 O(poly(n)) error

Defining transformers with p-bit precision and characterizing the class of languages they recognize is complicated, because there are many different ways to perform rounding, which can lead to differences in expressive power (Li et al., 2024). In this section, we propose an alternative approach, which is to limit the error of the final result of a transformer approximation and abstract away from details (like precision and rounding) of how that level of error is achieved. We show that approximating a SMAT with absolute error at most 2 O(poly(n)) can be done in TC0.

This has two advantages. First, it has a simple and unambiguous definition. Second, it will allow us to say something about the expressivity of a large subclass of exact SMATs, namely, those that accept or reject strings with margin 2 O(poly(n)).

Theorem 14. For any SMAT T : Σ R and for any ϵ(n) 2 O(poly(n)), there is a function ˆT : Σ R in TC0 such that for all w Σ with n = |w|, | ˆT(w) T(w)| ϵ(n).

Proof. We construct ˆT out of the following operations, where C > 0 and c > 0 do not depend on n:

(a) Addition of two numbers

(b) Multiplication xy where |x|, |y| C

(c) Comparison of two numbers

(d) Inverse square root 1 x where |x| c

(e) Iterated addition of n numbers

(f) Softmax of n numbers.

Published in Transactions on Machine Learning Research (01/2025)

The upper bound C on all activations was shown by Hahn (2020), and in operation (d), the lower bound c exists because we defined layer normalization to add a constant to the variance (Eq. (1)).

To simplify the error analysis, all of the above operations are performed on O(poly(n))-bit rational numbers. In TC0, all of these operations can be computed exactly (Lemma 5), except x and exp x, which can be approximated with relative error ϵ for any ϵ 2 O(poly(n)), by Lemma 12. In that Lemma, the case for square root asks for r [ 1

4, 1] and an even integer k such that x = r 2k. We do this as follows. If a b, compute a

b using truncated division (Theorem 2h), then count the number of bits (Theorem 2d) to get k = log2 a

b + 1. Similarly, if a < b, compute k = log2 b

a + 1. Finally, if k is odd, increment it by 1.

Fix ϵfinal > 0. We show by induction that, for each operation i in the computation of ˆT, there is a δi Θ(ϵ/poly(n)) such that if we compute operation i with error δi, then the final answer has error ϵfinal. In particular, it is possible to compute ˆT using O(poly(n))-bit rationals and achieve a final error of at most 2 O(poly(n)).

For each operation, we will show that for any ϵ > 0, there is a δ Ω(ϵ/n) such that if the inputs to the operation are approximated with error δ, then the output is approximated with error ϵ.

If a function f : Rd R is ρ-Lipschitz continuous, then for any ϵ > 0, if h ϵ/ρ, then |f(x + h) f(x)| ρ h ϵ. Operations (a d) are ρ-Lipschitz continuous with ρ not depending on n, while iterated addition of n numbers (e) is n-Lipschitz continuous, and softmax of n numbers (f) is ρ-Lipschitz continuous with ρ not depending on n.

We show the cases of inverse square root and softmax, as these also have error due to the Taylor approximations.

For inverse square root y = 1 x for x c: For any ϵ > 0, let δ = min c 2, c c (2c+1)

2ϵ . Suppose that x has

been approximated as ˆx = x + h where |h| δ. Because x for x c h c

2 is 1 c-Lipschitz continuous,

ˆx x| δ c. Furthermore, we approximate

ˆx with relative error η where |η| δ. So we approximate y as ˆy = 1

ˆx(1+η), and the error is

ˆx(1 + η) 1 x

ˆx(1 + η) 1

triangle inequality

2 c η δ, ˆx c, ˆx c

For softmax of n numbers: For any ϵ > 0, let δ = min 1

16 . Suppose that for all i [n], xi has been approximated as xi + hi where |hi| δ, and let ηi where |ηi| δ be the relative error of approximating exp(xi + hi). Then the softmax and its approximation are

yi = exp xi P

ˆyi = (exp(xi + hi))(1 + ηi) P

j(exp(xj + hj))(1 + ηj)

Published in Transactions on Machine Learning Research (01/2025)

and ˆy overestimates y by at most

ˆyi yi (exp(xi + δ))(1 + δ) P

j(exp(xj δ))(1 δ) yi

= (exp 2δ)(1 + δ)

(exp 2δ)(1 + δ)

(1 + 4δ)(1 + δ)

1 δ 1 2δ [0, 1] exp 2δ 1 + 4δ

2 ϵ δ ϵ 16.

Similarly, we can show that ˆy underestimates y by at most

yi ˆyi 8δ ϵ.

The above is a statement about the expressivity of SMAT approximations, but as mentioned at the beginning of this section, it also makes it possible to say something about the expressivity of a large subclass of exact SMATs.

Definition 15. A transformer T : Σ R recognizes a language L with margin ϵ(n) if, for every string w Σ with n = |w|, if w L then T(w) > ϵ(n), and if w L then T(w) < ϵ(n).

Corollary 16. Any language that is recognizable by a SMAT with margin 2 O(poly(n)) is in TC0.

Proof. Let L be a language recognized by SMAT T with margin ϵ 2 O(poly(n)). By Theorem 14, there is a function ˆT in uniform TC0 such that for all w, we have ϵ ˆT(w) T(w) ϵ. If w L, then T(w) > ϵ, so ˆT(w) T(w) ϵ > 0. Similarly, if w L, then T(w) < ϵ, so ˆT(w) T(w) + ϵ < 0. Therefore, ˆT also recognizes L.

6 Limitations and Conclusions

The levels of precision considered here go far beyond what is practical to compute with. Nevertheless, these results are valuable because they further strengthen the case that transformers cannot compute any function outside of TC0.

Moreover, Section 5 offers an alternative approach to limited-precision transformers that may be useful in more realistic settings. In particular, an analogous argument shows that it takes O(log n) bits of precision to achieve an error of 1/O(poly(n)), which may make SMATs with margin 1/O(poly(n)) an interesting target for future research.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. In NIPS 2016 Deep Learning Symposium, 2016. URL https://arxiv.org/abs/1607.06450.

Pablo Barceló, Alexander Kozachinskiy, Anthony Widjaja Lin, and Vladimir Podolskii. Logical languages accepted by transformer encoders with hard attention. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=gbr HZq07mq.

David A. Barrington, Kevin Compton, Howard Straubing, and Denis Thérien. Regular languages in NC 1. Journal of Computer and System Sciences, 44(3):478 499, 1992. doi:10.1016/0022-0000(92)90014-A.

Published in Transactions on Machine Learning Research (01/2025)

David A. Mix Barrington, Neil Immerman, and Howard Straubing. On uniformity within NC 1. Journal of Computer and System Sciences, 41(3):274 306, 1990. doi:10.1016/0022-0000(90)90022-D.

David Mix Barrington and Alexis Maciel. Advanced course on computational complexity, 2000. URL https://people.clarkson.edu/~alexis/PCMI/. CMI-PCMI Undergraduate Program.

Augustin Baziramwabo, Pierre Mc Kenzie, and Denis Thérien. Modular temporal logic. In Proceedings of the 14th Symposium on Logic in Computer Science (LICS), pp. 344 351, 1999. doi:10.1109/LICS.1999.782629.

Samuel R. Buss. The Boolean formula value problem is in ALOGTIME. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing (STOC), pp. 123 131, 1987. doi:10.1145/28395.28409.

David Chiang, Peter Cholak, and Anand Pillay. Tighter bounds on the expressivity of transformer encoders. In Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202 of Proceedings of Machine Learning Research, pp. 5544 5562, 2023. URL https://proceedings.mlr.press/v202/ chiang23a.html.

David Goldberg. Computer arithmetic, 2017. URL https://www.elsevier.com/books-and-journals/ book-companion/9780128119051. Appendix J of John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, 6th ed.

Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156 171, 2020. doi:10.1162/tacl_a_00306.

William Hesse, Eric Allender, and David A. Mix Barrington. Uniform constant-depth threshold circuits for division and iterated multiplication. Journal of Computer and System Sciences, 65(4):695 716, 2002. doi:10.1016/S0022-0000(02)00025-9.

Neil Immerman. Descriptive Complexity. Springer, 1999. doi:10.1007/978-1-4612-0539-5.

Emil Jeřábek. Root finding with threshold circuits. Theoretical Computer Science, 462:59 69, 2012. doi:10.1016/j.tcs.2012.09.001. URL https://www.sciencedirect.com/science/article/pii/ S0304397512008006.

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. In Proceedings of the 12th International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=3EWTEy9MTM.

Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=De4FYqj Fue Z.

William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531 545, 2023a. doi:10.1162/tacl_a_00562.

William Merrill and Ashish Sabharwal. A logic for expressing log-precision transformers. In Advances in Neural Information Processing Systems 36 (Neur IPS), pp. 52453 52463, 2023b. URL https://papers.neurips.cc/paper_files/paper/2023/hash/ a48e5877c7bf86a513950ab23b360498-Abstract-Conference.html.

William Merrill, Ashish Sabharwal, and Noah A. Smith. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843 856, 2022. doi:10.1162/tacl_a_00493.

Jorge Pérez, Javier Marinković, and Pablo Barceló. On the Turing completeness of modern neural network architectures. In Proceedings of the Seventh International Conference on Learning Representations (ICLR), 2019. URL https://openreview.net/forum?id=Hy GBdo0q Fm.

Published in Transactions on Machine Learning Research (01/2025)

Lena Strobl. Average-hard attention transformers are constant-depth uniform threshold circuits, 2023. URL

https://arxiv.org/abs/2308.03212. ar Xiv:2308.03212.

Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What formal languages can transformers express? A survey. Transactions of the Association for Computational Linguistics, 12:543 561, 2024. doi:10.1162/tacl_a_00663.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 (Neur IPS), 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

R. Ryan Williams. Some estimated likelihoods for computational complexity. In Bernhard Steffen and Gerhard Woeginger (eds.), Computing and Software Science: State of the Art and Perspectives, pp. 9 26. Springer-Verlag, 2022. doi:10.1007/978-3-319-91908-9_2.

Andy Yang and David Chiang. Counting like transformers: Compiling temporal counting logic into softmax transformers. In Proceedings of the First Conference on Language Modeling (Co LM), 2024. URL https: //openreview.net/forum?id=Fmh Pg4UJ9K.

Shunyu Yao, Binghui Peng, Christos Papadimitriou, and Karthik Narasimhan. Self-attention networks can process bounded hierarchical languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), pp. 3770 3785, 2021. doi:10.18653/v1/2021.acl-long.292.