# transformers_learn_shortcuts_to_automata__726f826c.pdf Published as a conference paper at ICLR 2023 TRANSFORMERS LEARN SHORTCUTS TO AUTOMATA Bingbin Liu1 Jordan T. Ash2 Surbhi Goel3 Akshay Krishnamurthy2 Cyril Zhang2 1Carnegie Mellon University 2Microsoft Research NYC 3University of Pennsylvania Algorithmic reasoning requires capabilities which are most naturally understood through recurrent models of computation, like the Turing machine. However, Transformer models, while lacking recurrence, are able to perform such reasoning using far fewer layers than the number of reasoning steps. This raises the question: what solutions are these shallow and non-recurrent models finding? We investigate this question in the setting of learning automata, discrete dynamical systems naturally suited to recurrent modeling and expressing algorithmic tasks. Our theoretical results completely characterize shortcut solutions, whereby a shallow Transformer with only o(T) layers can exactly replicate the computation of an automaton on an input sequence of length T. By representing automata using the algebraic structure of their underlying transformation semigroups, we obtain O(log T)-depth simulators for all automata and O(1)-depth simulators for all automata whose associated groups are solvable. Empirically, we perform synthetic experiments by training Transformers to simulate a wide variety of automata, and show that shortcut solutions can be learned via standard training. We further investigate the brittleness of these solutions and propose potential mitigations. 1 INTRODUCTION Modern deep learning pipelines demonstrate an increasing capability to perform combinatorial reasoning: pretrained on large, diverse distributions of natural language, math, and code, they are nascently solving tasks which seem to require a rigid understanding of syntax, entailment, and state inference. How do these neural networks represent the primitives of logic and the algorithms they execute internally? When considering this question, there is an immediate mismatch between classical sequential models of computation (e.g., Turing machines) and the Transformer architecture, which has delivered many of the recent breakthroughs in reasoning domains. If we are to think of an algorithm as a set of sequentially-executed computational rules, why would we use a shallow1 non-recurrent network? We study this question through the lens of finite semiautomata, which compute state sequences q1, . . . , q T from inputs σ1, . . . , σT by application of a transition function δ (and initial state q0): qt = δ(qt 1, σt). Semiautomata are the underlying structures governing the computations realizable by automata (such as regular expression parsers or finite-state transducers), which are simply semiautomata equipped with mappings from states to output. Thus, one natural motivation for studying them comes from the question of whether Transformers can subsume the structures found in classical NLP pipelines. Another motivation comes from the perspective of reinforcement learning and control, where Transformers are beginning to be used as world models: semiautomata specify deterministic discrete-state dynamical systems. We perform a theoretical and empirical investigation of whether (and how) non-recurrent Transformers learn semiautomata. We characterize and analyze how shallow Transformers find shortcut The majority of this work was completed while B. Liu was an intern at Microsoft Research NYC. This work was completed while S. Goel was at Microsoft Research NYC. 1Compared to the number of symbols it can process. For example, Distil BERT (Sanh et al., 2019) can handle thousands of tokens with 6 sequential layers. Published as a conference paper at ICLR 2023 Q = {even, odd} Σ = {0, 1} Q = { , } Σ = {σ , σ , } Q = {1, 2, 3, 4} Σ = {햫, 햱} parity counter memory unit 1D gridworld Q = {1 . . 3} {1 . . 4} Σ = { , , , } Q = {54 stickers} Σ = {6 face rotations} Figure 1: Various examples of semiautomata. From left to right: a mod-2 counter, a 2-state memory unit, Grid4, a 2-dimensional gridworld constructible via a direct product Grid3 Grid4, and a Rubik s Cube, whose transformation semigroup is a very large non-abelian group. solutions, which correctly and efficiently simulate the transition dynamics of semiautomata with far fewer sequential computations than required for iteratively inferring each state qt. Our contributions. Our theoretical results provide structural guarantees for the representability of semiautomata by shallow, non-recurrent Transformers. In particular, we show that: Shortcut solutions, with depth logarithmic in the sequence length, always exist (Theorem 1). Constant-depth shortcuts exist for solvable semiautomata (Theorem 2). There do not exist constant-depth shortcuts for non-solvable semiautomata, unless TC0 = NC1 (Theorem 4). For a natural class of semiautomata corresponding to path integration in a gridworld with boundaries, we show that there are even shorter shortcuts (Theorem 3), beyond those guaranteed by the general structure theorems above. We accompany these theoretical findings with an extensive set of experiments: End-to-end learnability of shortcuts via SGD (Section 4). The theory shows that shortcut solutions exist; is the non-convexity of the optimization problem an obstruction to learning them in practice? For a variety of semiautomaton simulation problems, we find empirically that there is no such obstruction. Shallow non-recurrent Transformers are able to learn shortcuts which generalize near-perfectly in-distribution. More challenging settings (Section 5). We compare non-recurrent and recurrent models in the presence of additional considerations: out-of-distribution generalization (including to unseen sequence lengths) and limited supervision. This reveals the brittleness of non-recurrent models, in line with prior spurious representation notions of shortcuts in deep learning. Toward mitigating these drawbacks and obtaining the best of both worlds, we show that with recency-biased scratchpad training, Transformers can be guided to learn the robust recurrent solutions. 1.1 RELATED WORK Emergent reasoning in neural sequence models. Neural sequence models, both recurrent (Wu et al., 2016; Peters et al., 2018; Howard & Ruder, 2018) and non-recurrent (Vaswani et al., 2017; Devlin et al., 2018), have ushered in an era of broadly-applicable and (with pretraining) sampleefficient natural language understanding. Building on this, large-scale non-recurrent Transformer models have demonstrated capabilities in program synthesis, mathematical reasoning, and in-context multi-task adaptation. A nascent frontier is to leverage neural dynamics models, again both recurrent (Hafner et al., 2019) and non-recurrent (Chen et al., 2021a; Janner et al., 2021), for decision making. At the highest level, the present work seeks to idealize and understand the mechanisms behind which deep learning solves tasks requiring combinatorial and algorithmic reasoning. Computational models of neural networks. In light of the above, it is empirically evident that neural networks are successfully learning circuits which generalize on some combinatorial tasks. Many efforts in the theory and empirical science of deep learning are dedicated towards the rigorous analysis of this phenomenon. Various perspectives map self-attention to bounded-complexity circuits (Hahn, 2020; Elhage et al., 2021; Merrill et al., 2021; Edelman et al., 2022), declarative programs (Weiss et al., 2021), and Turing machines (Dehghani et al., 2019). The research program of BERTology (Clark et al., 2019; Vig, 2019; Tenney et al., 2019) interprets trained models in terms of known linguistic and symbolic primitives. The most relevant theoretical work to ours is (Barrington & Th erien, 1988), which acts as a Rosetta Stone between classical circuit complexity and semigroup theory. The core technical ideas for Published as a conference paper at ICLR 2023 Theorems 1 (NC1 prefix sum), 2 (Krohn-Rhodes), and 4 (Barrington) are inspired by the results and discussions therein. In the language of circuit complexity, our work establishes that shallow, nonrecurrent Transformers can efficiently represent all of the constructions involved in the (simple) NC1 and (significantly more complex) ACC0 solutions to sequential multiplication in semigroups. On the other hand, the shorter shortcut from Theorem 3 carefully leverages self-attention to improve upon these results; we were unable to find an analogous refinement in the circuit complexity literature. Synthetic combinatorial tasks. Our problem setting of simulating finite-state semiautomata unifies the settings of several recent investigations of whether (and how) Transformers learn bounded-depth Dyck languages (Yao et al., 2021), parities (Anil et al., 2022), adders (Nogueira et al., 2021; Nanda & Lieberum, 2022), regular languages (Bhattamishra et al., 2020), and sparse logical predicates (Edelman et al., 2022; Barak et al., 2022). Zhang et al. (2022) empirically analyze the behavior and inner workings of Transformers on random-access group operations and note shortcuts (which skip over explicit program execution) similar to those we study. We provide an expanded discussion of related work in Appendix A.5. 2 PRELIMINARIES 2.1 SEMIAUTOMATA AND THEIR ALGEBRAIC STRUCTURE A semiautomaton A := (Q, Σ, δ) consists of a set of states Q, an input alphabet Σ, and a transition function δ : Q Σ Q. In this work, Q and Σ will always be finite sets. For all positive integers T and a starting state q0 Q, A defines a map from input sequences (σ1, . . . , σT ) ΣT to state sequences (q1, . . . , q T ) QT : qt := δ(qt 1, σt) for t = 1, . . . , T. This is a deterministic Markov model, in the sense that at time t, the future states qt+1, . . . , q T only depend on the current state qt and the future inputs σt+1, . . . , σT . We define the task of simulation: given a semiautomaton A, starting state q0, and input sequence (σ1, . . . , σT ), output the state trajectory AT,q0(σ1, . . . , σT ) := (q1, . . . , q T ). Let f : ΣT QT be a function (which in general can depend on A, T, q0). We will say that f simulates AT,q0 if f(σ1:T ) = AT,q0(σ1:T ) for all input sequences σ1:T . Finally, for a positive integer T, we say that a function class F of functions from ΣT QT is said to simulate A at length T if, for each q0 Q, there is a function in F which simulates (A, T, q0). Every semiautomaton induces a transformation semigroup T (A) of functions ρ : Q Q under composition, generated by the per-input-symbol state mappings δ( , σ) : Q Q. When T (A) contains the identity function, it is called a transformation monoid. When all of the functions are invertible, T (A) is a permutation group. See Figure 1 for some examples which appear both in our theory and experiments; additional background (including a self-contained tutorial on the relevant concepts in finite group and semigroup theory) is provided in Appendix A.2. An elementary but interesting example is a parity counter (Figure 1, left): the state is a bit, and the inputs are { toggle the bit , do nothing }; the transformation semigroup is C2, the cyclic group of order 2. Parity has been studied in previous synthetic experiments (Zhang et al., 2022; Anil et al., 2022). 2.2 RECURRENT AND NON-RECURRENT NEURAL SEQUENCE MODELS A sequence-to-sequence neural network of length T and dimension d is a function fnn : RT d Θ RT d, with trainable parameters θ Θ. Equipped with an encoding layer E : Σ Rd and decoding layer W : Rd Q (applied position-wise), the function (W fnn E) : ΣT QT has the same input and output types as AT,q0. This work will investigate when the functions defined by neural networks can simulate semiautomata. A recurrent neural network (RNN) is a sequence-to-sequence neural network defined by iterated composition of a recurrent unit g : Rd Rd Θ Rd. For a given initial hidden state h0 Rd, and input sequence u1, . . . , u T Rd, it produces an output hidden state sequence ht := g(ht 1; ut; θ), t = 1, . . . , T. Thus, for any fixed θ, an RNN defines a semiautomaton with infinitely many states and inputs: Q = Σ = Rd. Thus, as long as g can represent δ, RNNs can simulate all semiautomata. In this sense, the computational models of RNNs and semiautomata naturally coincide. Published as a conference paper at ICLR 2023 An L-layer Transformer is another sequence-to-sequence network, consisting of alternating selfattention blocks and feedforward MLP blocks ftf := (id + f (L) mlp) (id + f (L) attn) (id + f (L 1) mlp ) ... (id + f (1) attn) (id + P). Briefly, an attention layer performs ℓ1-normalized mixing operations across positions t, while a constant-layer MLP block performs position-wise function approximation (with no mixing between positions); id denotes the identity function (residual connections), and P encodes the position t.2 We use fairly standard positional encodings in both theory and experiments. Importantly, the standard Transformer is convolutional (in that the weights in fattn and fmlp are shared across positions t), but is non-recurrent: parameters are not shared across blocks. All architectures have a notion of computational depth D (succinctly, depth) when processing inputs of length T, which is the longest path in the computational graph. For RNNs, this is Θ(T) while an L-layer Transformer (with constant-layer MLPs) has depth Θ(L). For Transformers, since they coincide up to constant factors, we use depth and number of layers interchangeably. We will also track the layers L, embedding dimension d, attention width (the largest number of parallel attention head outputs), and MLP width (the largest number of parallel hidden activations in the MLP blocks).3 3 THEORY: SHORTCUTS ABOUND A T-layer Transformer can trivially simulate a semiautomaton at length T sequentially: like an RNN, the t-th layer can implement (an embedding of) the state transition qt 1 7 qt. Yet, Transformers succeed in practice with long contexts ( 103) and fewer layers (as few as 6). A natural theoretical question is that of representability: can Transformers efficiently simulate semiautomata with parallel shortcut solutions, whose depths are much smaller than the sequence length T? Definition 1 (Shortcut solution). Let A be a semiautomaton. Suppose that for every T 1, there is sequence-to-sequence neural network f T which simulates A at length T. We call this sequence {f T }T 1 a shortcut solution to the problem of simulating A if its depth D satisfies D o(T). By this definition, shortcuts are quite general and some are less interesting than others. For example, it is always possible to construct a constant-depth neural network which memorizes all |Σ|T values of AT,q0, but these networks must be exceptionally wide. We could also fast-forward state simulation, letting each of (say) T layers simulate T consecutive state transitions, but, without exploiting the structure of the semiautomaton, this would require width Ω(2 T ). To rule out these cases and focus on interesting shortcuts for Transformers, we want the other size parameters (attention and MLP width) to be small: say, scaling polynomially with T, or even dependent only on |Q|, |Σ|. To construct such shortcuts, we need ideas beyond explicit iteration of state transitions. 3.1 SEMIAUTOMATA ADMIT SHALLOW PARALLEL SHORTCUTS We begin by noting that polynomial-width shortcuts always exist. This may be counterintuitive if we restrict ourselves to viewing a network s intermediate activations as representations of states qt. When we instead view them as encoding state transformations δ( , σ) : Q Q and their compositions, a divide-and-conquer construction is evident (see Figure 2a), detailed in Appendix C.2: Theorem 1 (Simulation is parallelizable; informal). Transformers can simulate all semiautomata A = (Q, Σ, δ) at length T, with depth O(log T), embedding dimension O(|Q|), attention width O(|Q|), and MLP width O(|Q|2). If we assume that an attention head can only select a constant number of indices, Theorem 1 is unimprovable: the receptive field of a sublogarithmic-depth Transformer is not large enough. However, it is known in theory and practice that soft-attention heads are capable of attending broadly, representing certain non-sparse dependencies (Clark et al., 2019; Yao et al., 2021). Thus, we can ask a more challenging question: can the dense operations of attention enable even shallower shortcuts? 2We omit layer normalization. This discrepancy is superficial; see the discussion in Appendix A.4. 3Full statements and proofs also track -weight norms (the largest absolute value of any parameter) and bit precision of each floating-point computation. We defer precise definitions and discussion to Appendix A.4. Published as a conference paper at ICLR 2023 σ1 σ2 σ3 σ4 σ5 σ6 σ7 σ8 z1 z2 z3 z4 z5 z6 z7 z8 sum(z1:6) mod p (a) (b) (c) (d) 햫 햱 햫 햫 햫 햱 햱 햫 Figure 2: Intuitions for the theoretical constructions. (a) Divide-and-conquer function composition yields logarithmic-depth shortcuts (Theorem 1). (b) The two atoms of the constant-depth Krohn-Rhodes decomposition (Theorem 2) of a solvable semiautomaton: modular addition and sequentially resettable memory. (c) Information flow of the cascade product, which is used to glue these atoms together, and easily implemented with residual connections. (d) An even shorter shortcut solution for gridworld simulation (Theorem 3; see Appendix C.4). The key to resolving this question comes from Krohn-Rhodes theory, which gives us tools to reason about the structure of arbitrary semiautomata and their transformation semigroups. A landmark result (Krohn & Rhodes, 1965), a vast generalization of the uniqueness of prime factorizations for integers, shows that to simulate any semiautomaton, we only need to handle two types of elementary objects: simple groups, and a memory unit (Figure 1b). When the Krohn-Rhodes decomposition contains no non-abelian groups (we call such a semiautomaton solvable4), there exist constant-depth circuits for simulation, which we manifest as neural networks. It turns out that positional weight sharing (a.k.a. width-1 convolutions ), non-recurrence, and selfattention are particularly well-suited for efficiently representing the Krohn-Rhodes decomposition of a semiautomaton: uniform-sum attention heads perform abelian group operations, proximity-based selection heads implement memory units, and the rest of the architecture (MLPs and residual connections) implements the cascade product (Definition 4) which combines these atomic operations. Overall, we conclude: Theorem 2 (Transformer Krohn-Rhodes; informal). Transformers can simulate all solvable semiautomata A = (Q, Σ, δ), with depth O(|Q|2 log |Q|), embedding dimension 2O(|Q| log |Q|), attention width 2O(|Q| log |Q|), and MLP width |Q|O(2|Q|) + 2O(|Q| log |Q|) T.5 It is quite counterintuitive6 that as T , no additional depth is needed for such a large class of problems. We provide background and details (including the definition and implementation of this notion of semigroup product) in Appendices A.2 and C.3. In Figure 2b and 2c, we illustrate the three key ingredients: efficient implementations of the two atoms (modular counting and memory lookups), and the procedure for gluing them together (building a transformation cascade). What does each layer do? The construction in Theorem 1 recursively composes functions, as opposed to the naive solution of directly emulating states. Theorem 2 takes a very different approach: it relies on the holonomy decomposition variant of the Krohn-Rhodes theorem (Eilenberg, 1974). Rather than simulating qt or composing functions, the computational paths correspond to a |Q|- level tree of nested coarsenings of the semiautomaton s dynamics: which subset of states could qt be in right now? Within each level of this tree, the network must implement (generally noncommutative) group operations. This can be done with O(|Q| log |Q|) layers, by leveraging the Jordan-H older decompositions and the universal embedding theorem (Krasner & Kaloujnine, 1951). Can we get even shallower shortcuts? Finally, we show that on a natural class of problems, the computational model of self-attention leads to further fine-grained improvements over the guarantees of Krohn-Rhodes theory. Motivated by the application of Transformers in modeling environment 4See Definition 6. Among the solvable groups are the dihedral groups D2n, the permutation groups Sn, An for n 4, the quaternion group Q8, all groups of order < 120 except A5, and all groups of odd order. 5Perhaps surprisingly, the only place where a width of T is used is to implement a mod-n gate. This dependence can be removed entirely if we allow for periodic activation functions such as x 7 sin(x). 6From the back cover of Rhodes et al. (2010): the underlying theorem launched a theory which reveals deep and unexpected connections between algebra (semigroups) and areas of science and engineering . Published as a conference paper at ICLR 2023 dynamics, we consider the semiautomaton Gridn corresponding to a gridworld : n states on a line, with input symbols move left if possible and move right if possible (see Figure 1, middle). We show that self-attention enables an extremely concise solution, with depth independent of both T and |Q| = n: Theorem 3 (Depth-2 shortcut for gridworld; informal). For all positive integers n, T, Transformers can simulate Gridn at length T, with depth 2,7 embedding dimension O(1), attention width O(n), and MLP width O(T).8 The proof builds a concise parallel nearest boundary detector, and can be found in Appendix C.4. We note that this particular setting is known to be an extremal case for the holonomy construction in Krohn-Rhodes theory (Maler (2010) discusses this, calling it the elevator automaton). It would be interesting to generalize our improvement and characterize the class of problems for which selfattention affords O(1) instead of poly(|Q|)-depth solutions. Aren t neural networks universal function approximators? Sufficiently wide neural networks with sufficiently expressive nonlinearities can fit arbitrary functions (Hornik et al., 1989; Cybenko, 1989). However, if we constrain complexity measures such as depth and width, one cannot hope to apply universality directly. It is true that one can take the discrete circuit constructions in (Barrington & Th erien, 1988), compile every gate to a constant-depth network, and recover shortcut solutions with o(T) depth and poly(T) width. However, our constructions go far beyond black-box reductions the roles of self-attention and positional parameter sharing allow for such efficient constructions that no parameter count depends on T (except the MLP width, which is removable with a periodic activation function). Furthermore, the constructions are so simple and natural that they are corroborated by the preliminary reverse engineering investigation in Section 4. 3.2 LOWER BOUNDS Can Theorem 2 be improved to handle non-solvable semiautomata? (Equivalently: can Theorem 1 be improved to constant depth?) It turns out that as a consequence of a classic result in circuit complexity (Barrington, 1986), this question is equivalent to the major open question of TC0 ?= NC1 (thus: conjecturally, no). Unless these complexity classes collapse, Theorems 1 and 2 are optimal. In summary, simulating non-solvable semiautomata with constant depth is provably hard: Theorem 4 (Transformer Barrington). Let A be a non-solvable semiautomaton. Then, for sufficiently large T, no O(log T)-precision Transformer with depth independent of T and width polynomial in T can continuously simulate A at length T, unless TC0 = NC1. This is proven in Appendix C.5. The smallest example of a non-solvable semiautomaton is the one on |Q| = 5 states, whose transitions generate A5 (all of the even permutations). Finally, we note that although our width bounds might be improvable, an exponential-in-|Q| number of hypotheses (and hence a network with poly(|Q|) parameters) is unavoidable if one wishes to learn an arbitrary |Q|-state semiautomaton from data: there are |Q||Q| |Σ| of them, which generate |Q|Ω(|Q|2) distinct semigroups (Kleitman et al., 1976). If we wish to study how machine learning models can efficiently identify large algebraic structures, we will need finer-grained inductive biases to specify which semiautomata to prefer, a direction for future work. 4 EXPERIMENTS: CAN SGD FIND THE SHORTCUTS? Our theorems are limited to representability: concise shallow solutions exist, but whether gradientbased local search (i.e., standard training) finds them is another matter entirely. For example, embedded within the problem of learning to simulate the 2-state parity semiautomaton is a well-known non-convex optimization problem (Daniely & Malach, 2020; Edelman et al., 2022; Nichani et al., 2022). In general, even detecting whether T (A) contains a cycle is PSPACE-hard (Cho & Huynh, 1991). Theoretically understanding how the training dynamics of deep learning transcend the worstcase hardness of non-convex optimization is a major frontier of research, that we do not attempt to 7This requires max-pooling. If we do not use max-pooling, we can instead use an MLP with width 2O(n) and depth O(1), or width O(n) and depth O(log n). 8As with Theorem 2, the width can be reduced to O(n) if we employ periodic activation functions. Published as a conference paper at ICLR 2023 1 2 3 4 5 6 7 8 12 16 99.3 100 100 100 100 100 100 100 100 100 92.2 100 100 100 100 100 100 100 100 100 77.6 99.8 99.9 100 100 99.5 100 99.7 100 100 54.6 94.6 96.7 99.4 100 100 99.8 100 100 100 65.0 77.9 99.9 97.9 100 99.8 98.2 99.9 95.9 80.6 25.4 27.2 47.4 75.2 100 100 100 100 100 100 45.6 98.0 100 100 100 100 100 100 100 100 31.6 49.2 59.6 60.4 73.5 99.3 100 100 100 100 12.5 23.1 32.5 46.7 71.2 98.8 100 100 100 100 7.9 11.8 14.6 19.7 26.0 28.4 32.8 51.8 97.2 99.9 (a) Accuracy across tasks (rows) and network depths (columns). (b) Attention heatmaps (Grid8); unstable training (C2 and S5). Figure 3: Overview of the empirical results in Section 4, on in-distribution learnability of shortcuts by standard Transformer training. (a) Truncated table of results (in-distribution accuracy); rows specify semiautomaton simulation problems, and columns specify network depth. (b) Attention heads implement a nearest boundary detector (top); training is highly unstable (bottom). address here. Instead, we approach the question of optimization through an empirical lens. Our primary goal is to understand if gradient-based training can find shortcut solutions at all, rather than whether such training is stable. Accordingly, unless otherwise noted, we report the performance of the best model among 20 replicates; the median performance is provided in Appendix B. For a selection of 19 semiautomata corresponding to various groups and semigroups, we train shallow Transformers to output their state sequences given random inputs. Specifically, we apply GPT-2-like models (Radford et al., 2019) with 1-16 layers on freshly-sampled sequences of length T = 100.9 Strikingly, we obtain positive results (> 99% in-distribution accuracy) for all of them, including ones which generate the non-solvable groups A5 and S5.10 Figure 3a gives a selection of our full results (in Appendix B.1). We find that more complex semiautomata (corresponding to nonabelian groups) require deeper networks to learn, in agreement with our theoretical constructions. Which shallow solutions are learned? Our theoretical results identify shortcut solutions which follow multiple, mutually incompatible paradigms. In general, we do not attempt a full investigation of mechanistic interpretability of the trained models. As preliminary evidence, we visualize some of the attention patterns in Figure 3b (top) within successfully-trained models, finding attention heads which perform flat summations (with uniform attention) and conditional resets. Optimization quirks. Although sufficiently deep networks find the solutions with non-negligible probability, the training dynamics are unstable; Figure 3b (bottom) shows some training curves, exhibiting high variance, negative progress, or accuracy that decays with continued training. In the same vein as the synthetic reasoning tasks introduced by Zhang et al. (2022), we hope that semiautomaton simulation will be useful as a clean, nontrivial testbed (with multiple difficulty knobs) for debugging and improving training algorithms, and perhaps the neural architectures themselves. 5 FURTHER EXPERIMENTS: MORE CHALLENGING SETTINGS For a wide family of algebraic structures, we have proven that the function class of shallow nonrecurrent networks subsumes deeper finite-state recurrent models. Furthermore, the experiments in Section 4 have shown that despite the non-convexity of the optimization problem, standard training works: Transformers can learn shortcuts to semiautomaton simulation, end-to-end. While encouraging, the experiments in Section 4 are idealized in several ways, and it is natural to ask if Transformers perform similarly in more challenging semiautomaton simulation scenarios. Towards answering this 9Using freshly-sampled data ensures that the model cannot achieve good performance by brute-force memorization in a number of training steps we could ever execute computationally (for sufficiently large T such as T = 100), since there are an exponential number of sequences. 10Explanations on why certain groups are harder to learn are provided in Appendix B.1.1. Published as a conference paper at ICLR 2023 Task Dyck4,8 Grid9 S5 C4 D8 (abab) Observation stack top 1boundary π1:t(1) 10 mod 4 location accept Accuracy 100.0 99.7 98.0 99.7 99.8 100.0 (a) Accuracies with indirect supervision. LSTM gets 100% on all tasks. (b) Varying preveal (log spacing). (c) C2 (parity): accuracy at different Pr[σ = 1] and T. Figure 4: Overview of the empirical results in Section 5. (a) Learning in the latent-state setting, with various observation maps φ(qt). (b) Learning from incomplete state sequences: final accuracy vs. position-wise probability of a hidden token. (c) OOD generalization: Transformers fail to generalize to different distributions and lengths. question, in this section, we consider some challenges that may arise in practice and an associated set of experimental results; further details are deferred to Appendix B.2. 5.1 INCOMPLETE AND INDIRECT SUPERVISION Automata are partially observable semiautomata. Consider the case of partial observability. For any semiautomaton A = (Q, Σ, δ) and a (generally non-invertible) observation function φ : Q Q, we can define the problem of predicting qt := φ(qt). If we can only obtain observations qt (i.e., the state is latent), this fully captures the problem of learning a finite-state automaton from data. The results in this paper have shown that this is equivalent to the fully-observable case in terms of representation. However, the learning problem can be much harder; indeed, this may account for Bhattamishra et al. (2020) s negative results on learning regular languages with constantdepth Transformers. Note that this also captures autoregressive next-token prediction tasks induced by distributions (e.g., generating Dyck languages (Yao et al., 2021)) where the sequence s continuations depend on a latent semiautomaton s state (e.g., the current stack for Dyck). Despite these potential challenges, we find that Transformers are able to find a solution with good in-distribution performance for all partially observable settings we consider; see Figure 4(a). Learning from incomplete state sequences. Next, we consider the setting which is identical to that described in Section 4, but each state qt is randomly revealed from the training data with some probability 0 preveal 1. As with partial observability, this does not affect representation issues, but can make learning/optimization much harder. Figure 4b shows the accuracy of S5 for models trained on length 100 sequences for various preveal. It can be seen that Transformers may be unable to find good solutions when the labels become sparser, whereas LSTM s performance stays robust across all choices of preveal. 5.2 OUT-OF-DISTRIBUTION SHORTCOMINGS OF SHORTCUT SOLUTIONS The theoretical construction of modular counters (Lemma 6) suggests a possible failure mode: if attention performs prefix addition and the MLP computes the sum modulo n, the MLP could fail on sums unseen during training. This suggests that if the distribution over σ1:T shifts between training and testing (but the semiautomaton remains the same), a non-recurrent shortcut solution might map inputs into an intermediate latent variable space (like the sum) which fails to generalize. Indeed, we observe that with the same models which obtain the positive in-distribution results in Section 4, accuracy degrades as distribution shift increases; see Figure 4(c) (left), where the performance drops as the probability of seeing input σ = 1 deviates from the training distribution (Pr[σ = 1] = 0.5). From the viewpoint of mechanistic interpretation, this is further (but not absolutely conclusive) evidence that with standard training, Transformers learn implementations similar to those predicted by the theory. We provide details and further empirical evidence in Section B.2.3. Published as a conference paper at ICLR 2023 More ambitiously, we could try to use these models to extrapolate to longer sequence lengths T than those seen in the training data. Promoting this difficult desideratum of length generalization is an intricate problem in its own right; see Yao et al. (2021); Anil et al. (2022) for more experiments similar to ours. Figure 4(c) (right) shows the performance on sequences of various lengths, where Transformer s accuracy drops sharply as we move to lengths unseen during training. In contrast, LSTM performs perfectly in both out-of-distribution scenarios. Details are deferred to Section B.2.4. Shortcuts as unintended solutions. Throughout the deep learning literature, the term shortcut is often used in a statistical sense to connote undesired (i.e., misleading, spurious, or overfitting) learned representations (Geirhos et al., 2020; Robinson et al., 2021). The experiments in this section show why our circuit-depth shortcuts are statistical shortcuts. Specifically, we have identified a problem with learning relaxations to sequential state simulation: the models may hallucinate statistically suboptimal latent variables. The positive results in Sections 3 and 4 suggest that this may only be robustly diagnosable via out-of-distribution evaluation. Finally, we empirically show that this flaw is circumventable. Using a combination of scratchpad (a.k.a. chain-of-thought ) (Nye et al., 2021; Wei et al., 2022) and recency bias (Press et al., 2022), we demonstrate that Transformers can be guided towards learning recurrent (depth-T) solutions that generalize out-of-distribution and to longer sequence lengths (Figure 4(c), yellow curves). Computational-statistical tradeoffs. The experiments in this section highlight a statistical price for learning shortcuts to semiautomaton simulation. On the other hand, the shallowness of these shortcuts is computationally appealing: leveraging parallel computation, they enjoy much lower latency (O(log T) or O(1), compared to O(T)), in both training and inference. Whether the best of both worlds is attainable is an interesting avenue for future work. 6 CONCLUSIONS AND FUTURE WORK We have conducted a theoretical and empirical analysis of how shallow Transformers can learn shortcut solutions to the problem of simulating the transitions of semiautomata (and thus, the algebraic structures which underlie regular expressions, finite-state transducers, and deterministic MDPs). Using tools from semigroup theory and circuit complexity, we have constructed explicit logarithmic-depth and constant-depth shortcuts for semiautomaton simulation. Experimentally, we have shown that gradient-based optimization finds shortcut solutions which generalize near-perfectly in-distribution (Section 4), but are brittle out-of-distribution (Section 5). We hope that these results shed new light on the power and limitations of applying shallow non-recurrent models, even when the dynamics we wish to represent are deep and recurrent. Beyond Transformers? The theory and experiments in this work are specialized to the Transformer architecture, to provide a concrete and timely setting. However, we note that the underlying themes (continuous arithmetic circuits; parameter sharing across input positions and/or iterated function compositions; local vs. global computational units) are not specialized to any particular neural architecture11, nor the field of deep learning at all. The question of which architectures are even more natural for learning discrete automata, while being optimizable by gradient-based search? is extremely open-ended. We believe that the themes of sufficient depth and recurrent vs. non-recurrent function composition are relevant to the study of other (and future) deep learning methods. Future topics. In terms of theory, we have only scratched the surface of the possible interplay between neural architectures and classical ideas from the complexity theories of circuits and automata. One salient direction is to generalize the shorter shortcut constructions in Theorem 3. Also, we have made no attempt to treat stochastic environments, which would fully capture probabilistic Markov models and MDPs. Section 5 alludes to a landscape of algorithm design challenges in the presence of distribution shift and limited supervision. The latter (i.e., latent state inference) is known to lead to worst-case computational hardness (Papadimitriou & Tsitsiklis, 1987), but yields powerful empirical tools when tractable. Towards fully understanding and leveraging the circumstances which allow learning algorithms to decode and simulate qt, there is much work to be done. 11In fact, the divide-and-conquer construction of Theorem 1 is almost recurrent with log(T) depth, and resembles Wave Net-like hierarchical pooling (van den Oord et al., 2016; Larsson et al., 2016), more than Transformers. Published as a conference paper at ICLR 2023 ACKNOWLEDGEMENTS We are very grateful to Abhishek Shetty for helpful discussions about circuit complexity. We also thank Ashwini Pokle for thoughtful comments and suggestions towards improving clarity and readability. REPRODUCIBILITY STATEMENT Complete proofs of the theoretical results are provided in Appendix C, with a self-contained tutorial of relevant group-theoretic concepts in Appendix A.2. For the empirical results, all our datasets are derived from synthetic distributions, which are clearly described in Appendix B.1 and B.2. The architectures, implementations (with references to popular base repositories), and hyperparameters (including training procedure) are documented in Appendix B.3. We intend to release our code as open source prior to publication. Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. ar Xiv:2207.04901, 2022. Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge University Press, 2009. Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Logical extrapolation without overthinking. ar Xiv:-2202.05826, 2022. Boaz Barak, Benjamin L Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. ar Xiv:2207.08799, 2022. David A. Mix Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC1. In Symposium on the Theory of Computing, 1986. David A. Mix Barrington and Denis Th erien. Finite monoids and the fine structure of NC1. Journal of the ACM, 1988. Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. On the ability and limitations of transformers to recognize formal languages. In Conference on Empirical Methods in Natural Language Processing, 2020. Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv:2104.13478, 2021. Ashok K Chandra, Steven Fortune, and Richard Lipton. Unbounded fan-in circuits and associative functions. In Symposium on Theory of Computing, 1983. Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, 2021a. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philipp Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leiki, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Published as a conference paper at ICLR 2023 Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob Mc Grew, Dario Amodei, Sam Mc Candlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. ar Xiv:2107.03374, 2021b. Sang Cho and Dung T Huynh. Finite-automaton aperiodicity is PSPACE-complete. Theoretical Computer Science, 1991. Noam Chomsky and Marcel P Sch utzenberger. The algebraic theory of context-free languages. In Studies in Logic and the Foundations of Mathematics. 1959. Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Conditional positional encodings for vision transformers. ar Xiv preprint ar Xiv:2102.10882, 2021. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? An analysis of BERT s attention. In ACL Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, 2019. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 1989. Amit Daniely. Depth separation for neural networks. In Conference on Learning Theory, pp. 690 696. PMLR, 2017. Amit Daniely and Eran Malach. Learning parities with neural networks. Advances in Neural Information Processing Systems, 2020. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019. Gr egoire Del etang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Marcus Hutter, Shane Legg, and Pedro A Ortega. Neural networks and the chomsky hierarchy. ar Xiv preprint ar Xiv:2207.02098, 2022. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. ar Xiv:1810.04805, 2018. Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, Roman Wang, Nikhil Singh, Taylor L. Patti, Jayson Lynch, Avi Shporer, Nakul Verma, Eugene Wu, and Gilbert Strang. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences, 2022. Javid Ebrahimi, Dhruv Gelda, and Wei Zhang. How can self-attention networks recognize Dyck-n languages? In Findings of the Association for Computational Linguistics: EMNLP, 2020. Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, 2022. Attila Egri-Nagy and Chrystopher L Nehaniv. Computational holonomy decomposition of transformation semigroups. ar Xiv:1508.06345, 2015. Samuel Eilenberg. Automata, languages, and machines. Academic Press, 1974. Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pp. 907 940. PMLR, 2016. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam Mc Candlish, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. URL https://transformer-circuits.pub/2021/framework/index.html. Merrick Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy. Mathematical Systems Theory, 1984. Published as a conference paper at ICLR 2023 Robert Geirhos, J orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020. Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the Re LU in polynomial time. In Conference on Learning Theory, 2017. Alex Graves. Adaptive computation time for recurrent neural networks. ar Xiv preprint ar Xiv:1603.08983, 2016. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. ar Xiv preprint ar Xiv:1410.5401, 2014. Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non-autoregressive neural machine translation. ar Xiv:1711.02281, 2017. Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. ar Xiv:1912.01603, 2019. Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 2020. Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. ar Xiv:2203.16634, 2022. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. Christoph Hertrich, Amitabh Basu, Marco Di Summa, and Martin Skutella. Towards lower bounds on the depth of Re LU neural networks. In Advances in Neural Information Processing Systems, 2021. W Daniel Hillis and Guy L. Steele Jr. Data parallel algorithms. Communications of the ACM, 1986. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 1989. Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. ar Xiv:1801.06146, 2018. De Lesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Blockrecurrent transformers. ar Xiv:2203.07852, 2022. Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems, 2021. Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. ar Xiv:2103.13076, 2021. Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training. ar Xiv preprint ar Xiv:2006.15595, 2020. Daniel J Kleitman, Bruce R Rothschild, and Joel H Spencer. The number of semigroups of order n. Proceedings of the American Mathematical Society, 1976. L aszl o Kov acs and Cheryl Praeger. Finite permutation groups with large abelian quotients. Pacific Journal of Mathematics, 1989. Marc Krasner and L eo Kaloujnine. Produit complet des groupes de permutations et probleme d extension de groupes II. Acta Scientiarum Mathematicarum, 1951. Kenneth Krohn and John Rhodes. Algebraic theory of machines, I: Prime decomposition theorem for finite semigroups and machines. Transactions of the American Mathematical Society, 1965. Published as a conference paper at ICLR 2023 Guillaume Lample and Franc ois Charton. Deep learning for symbolic mathematics. ar Xiv:1912.01412, 2019. Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractal Net: Ultra-deep neural networks without residuals. ar Xiv:1605.07648, 2016. Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, and Sanjeev Arora. On the ability of neural nets to express distributions. In Conference on Learning Theory, pp. 1271 1296. PMLR, 2017. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level code generation with alphacode. ar Xiv:2203.07814, 2022. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv:1711.05101, 2017. Oded Maler. On the Krohn-Rhodes cascaded decomposition theorem. In Time for Verification. 2010. Oded Maler and Amir Pnueli. On the cascaded decomposition of automata, its complexity and its application to logic (Draft). 1994. Carlo Mereghetti and Beatrice Palano. Threshold circuits for iterated matrix product and powering. RAIRO-Theoretical Informatics and Applications, 2000. William Merrill, Yoav Goldberg, Roy Schwartz, and Noah A. Smith. On the power of saturated Transformers: A view from circuit complexity. ar Xiv:2106.16213, 2021. Vincent Micheli, Eloi Alonso, and Franc ois Fleuret. Transformers are sample efficient world models. ar Xiv:2209.00588, 2022. Anirbit Mukherjee and Amitabh Basu. Lower bounds over boolean inputs for deep neural networks with Re LU gates. ar Xiv:1711.03073, 2017. Neel Nanda and Tom Lieberum. A mechanistic interpretability analysis of grokking. Alignment Forum, 2022. URL https: //www.alignmentforum.org/posts/N6WM6hs7RQMKDh Yj B/ a-mechanistic-interpretability-analysis-of-grokking. Benjamin Newman, John Hewitt, Percy Liang, and Christopher D. Manning. The EOS decision and length extrapolation. In Blackbox NLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020. Eshaan Nichani, Yu Bai, and Jason D Lee. Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials. ar Xiv:2206.03688, 2022. Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks. ar Xiv:2102.13019, 2021. Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models. ar Xiv:2112.00114, 2021. Christos H Papadimitriou and John N Tsitsiklis. The complexity of Markov decision processes. Mathematics of Operations Research, 1987. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K opf, Edward Yang, Zach De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py Torch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019. Published as a conference paper at ICLR 2023 Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. ar Xiv:1802.05365, 2018. Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. ar Xiv:2009.03393, 2020. Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Open AI blog, 2019. John H. Reif and Stephen R. Tate. On threshold circuits and polynomial computation. SIAM Journal on Computing, 1992. John Rhodes, Chrystopher L Nehaniv, and Morris W Hirsch. Applications of automata theory and algebra: via the mathematical theory of complexity to biology, physics, psychology, philosophy, and games. World Scientific, 2010. Joshua Robinson, Li Sun, Ke Yu, Kayhan Batmanghelich, Stefanie Jegelka, and Suvrit Sra. Can contrastive learning avoid shortcut solutions? Advances in Neural Information Processing Systems, 2021. Itay Safran, Ronen Eldan, and Ohad Shamir. Depth separations in neural networks: what is actually being separated? In Conference on Learning Theory, pp. 2664 2666. PMLR, 2019. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distil BERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ar Xiv:1910.01108, 2019. Tal Schuster, Ashwin Kalyan, Alex Polozov, and Adam Kalai. Programming puzzles. In Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. Marcel Paul Sch utzenberger. On finite monoids having only trivial subgroups. Information and Control, 1965. Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. In Advances in Neural Information Processing Systems, 2021. Hava T Siegelmann and Eduardo D Sontag. On the computational power of neural nets. In Conference on Learning Theory, 1992. Matus Telgarsky. Benefits of depth in neural networks. In Conference on learning theory, pp. 1517 1539. PMLR, 2016. Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. ar Xiv:1905.05950, 2019. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wave Net: A generative model for raw audio. ar Xiv:1609.03499, 2016. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017. Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake A. Hechtman, and Jonathon Shlens. Scaling local self-attention for parameter efficient visual backbones. In IEEE Conference on Computer Vision and Pattern Recognition, 2021. Jesse Vig. Visualizing attention in transformer-based language representation models. ar Xiv:1904.02679, 2019. Published as a conference paper at ICLR 2023 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ar Xiv:2201.11903, 2022. Gail Weiss, Yoav Goldberg, and Eran Yahav. Thinking like Transformers. In International Conference on Machine Learning, 2021. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface s transformers: Stateof-the-art natural language processing. ar Xiv:1910.03771, 2019. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google s neural machine translation system: Bridging the gap between human and machine translation. ar Xiv:1609.08144, 2016. Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, and Tie-yan Liu. A survey on non-autoregressive generation for neural machine translation and beyond. ar Xiv:2204.09269, 2022. Keyulu Xu, Mozhi Zhang, Jingling Li, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. How neural networks extrapolate: From feedforward to graph neural networks. ar Xiv:2009.11848, 2020. Shunyu Yao, Binghui Peng, Christos H. Papadimitriou, and Karthik Narasimhan. Self-attention networks can process bounded hierarchical languages. In Association for Computational Linguistics, 2021. Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 2021. H Paul Zeiger. Cascade synthesis of finite-state machines. Information and Control, 1967. Chiyuan Zhang, Maithra Raghu, Jon Kleinberg, and Samy Bengio. Pointer value retrieval: A new benchmark for understanding the limits of neural network generalization. ar Xiv:2107.12580, 2021a. Linjun Zhang, Zhun Deng, Kenji Kawaguchi, Amirata Ghorbani, and James Zou. How does mixup help with robustness and generalization? In International Conference on Learning Representations, 2021b. Yi Zhang, Arturs Backurs, S ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, and Tal Wagner. Unveiling Transformers with LEGO: a synthetic reasoning task. ar Xiv:2206.04301, 2022. Karl-Heinz Zimmermann. On Krohn-Rhodes theory for semiautomata. ar Xiv:2010.16235, 2020. Published as a conference paper at ICLR 2023 Table of Contents A Additional background and notation 17 A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 Automata, semigroups, and groups . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Shallow circuit complexity classes . . . . . . . . . . . . . . . . . . . . . . . . 21 A.4 The Transformer architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A.5 Additional discussion of related work . . . . . . . . . . . . . . . . . . . . . . . 24 B Experiments 26 B.1 Section 4: SGD finds the shortcuts, under ideal supervision . . . . . . . . . . . . 26 B.2 Section 5: Failures of shortcuts in more challenging settings . . . . . . . . . . . 31 B.3 Additional details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 C Proofs 38 C.1 Useful definitions and lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 C.2 Proof of Theorem 1: Logarithmic-depth shortcuts via parallel prefix sum . . . . 40 C.3 Proof of Theorem 2: Constant-depth shortcuts via Krohn-Rhodes decomposition 42 C.4 Proof of Theorem 3: Even shorter shortcuts for gridworld . . . . . . . . . . . . 56 C.5 Proof of Theorem 4: Depth lower bound for non-solvable semiautomata . . . . . 64 Published as a conference paper at ICLR 2023 A ADDITIONAL BACKGROUND AND NOTATION A.1 NOTATION Below, we list our notational conventions for indices, vectors, matrices, and functions. For a natural number n, [n] denotes the index set {1, 2, . . . , n}. For a vector v Rn and i [n], vi denotes the i-th entry. When v is an expression, we use [v]i for clarity. Vectors can be instantiated by square brackets (like [1 2 3] R3). They can be indexed by slices: va:b denotes [va va+1 . . . vb]. We adhere to the convention that all vectors are column vectors. For a matrix M Rm n, i [m], j [n]: Mij (or [M]ij) denotes the (i, j)-th entry. Mi,: and M:,j denote the i-th row and j-th column, respectively. Importantly, we note the convention that this slice notation converts all vectors into column vectors. When the first dimension of a matrix M RT d is to be interpreted as a sequence length, we will implicitly convert a sequence of vectors v1, . . . , v T Rd into a matrix (v1, . . . , v T ) RT d whose rows the vectors vt. This is an arbitrary choice (compared to concatenating columns and obtaining a matrix in Rd T ), selected to adhere to previously standardized notation for the Transformer. We will sometimes index vectors and matrices with named indices (such as for padding tokens) instead of integers, for clarity. ei denotes the i-th elementary (one-hot) unit vector. Likewise as above, we sometimes use noninteger indices (e.g. e ). For vectors u, v Rd, u, v = u v both denote the inner product. For a function f : X Y Z and all y Y , we will let f( , y) : X Z denote the restriction of f to y (and similarly for other restrictions). This appears in the per-input state transition functions δ( , σ) : Q Q, as well as the functions represented by neural networks for a particular choice of weights. For functions f, g, f g denotes composition: (f g)(x) := f(g(x)). When we compose neural networks f : X Θf Y, g : Y Θg Z with parameter spaces Θf, Θg, we will use f g : X (Θf Θg) Z to indicate the composition f(g(x; θg)θf). ( )+ : R R denotes the Re LU (a.k.a. positive part) function: (x)+ = max{x, 0}. In function compositions, we use σ to denote the entry-wise Re LU (e.g. f σ g). A.2 AUTOMATA, SEMIGROUPS, AND GROUPS Recall that a semiautomaton A = (Q, Σ, δ) has a state space Q, an input alphabet Σ, and a transition function δ : Q Σ Q. For any natural number T and a starting state q0 Q, by repeated composition of the transition function δ, one can use A to define a map from a sequence of inputs (σ1, . . . , σT ) ΣT to a sequence of states (q1, . . . , q T ) QT via: qt := δ(qt 1, σt), t [T]. Here and below, it is helpful to use a matrix-vector notation to express the computation of semiautomata. For a given semiautomaton we can always identify the state space Q with index set {1, . . . , |Q|} and use a one-hot encoding of states into {0, 1}|Q|. For each input symbol σ Σ, we associate a transition matrix δ( , σ) {0, 1}|Q| |Q| with entries [δ( , σ)]q ,q = 1{δ(q, σ) = q }. This implies that for all q, σ, we have eδ(q,σ) = δ( , σ)eq, so that the computation of the semiautomaton amounts to repeated matrix-vector multiplication. While semiautomata are remarkably expressive, we discuss a few simple examples throughout this background section to elucidate the key concepts. Example 1 (Parity). Let Q = Σ = {0, 1} and let δ(q, 0) = q and δ(q, 1) = 1 q. Then, starting with q0 = 0, the state at time t, qt, is 1 if the binary sequence (σ1, . . . , σt) has an odd number of 1s. Published as a conference paper at ICLR 2023 Example 2 (Flip-flop). Let Q = {1, 2}, Σ = { , 1, 2} and let δ be given by δ( , ) = I2 2, δ( , 1) = 1 1 0 0 , δ( , 2) = 0 0 1 1 As the name suggests, this semiautomaton implements a simple memory operation where the state at time t is the value of the most recent noninput symbol. Example 3 (1D gridworld). Let S be a natural number, Q = {0, 1, . . . , S} and Σ = {L, , R}. Then the transition matrices are given by: δ( , ) = IS+1 S+1, δ( , L) = 1 ... 0 IS S ... ... 0 . . . . . . 0 0 . . . . . . 0 ... ... IS S 0 ... 1 This semiautomaton describes the movement of an agent along a line segment where actions 1 and +1 correspond to decrementing and incrementing the state respectively, except that the decrement input has no effect at state 0 and the increment input has no effect at state S. Note that we have chosen a convention which differs slightly from the main paper (i.e. Figure 1): we enumerate the indices starting from 0 rather than 1. This is because the proofs are stated more naturally when the boundaries of the gridworld are identified with the indices 0 and S. For a semiautomaton A = (Q, Σ, δ) each input symbol σ Σ defines a function δ( , σ) : Q Q. These functions can be composed in the standard way, and we use δ( , σ1:t) to denote the t-fold function composition. Note that δ(q0, σ1:t) is precisely the value of the state at time t on input σ1:t. Thus, the set of all functions that can be obtained by composition of the transition operator, formally T (A) := {δ( , σ1:t) : t N, σ1:t Σt}, plays a central role in describing the computation of the semiautomaton. This object is a transformation semigroup. We now turn to describing the necessary algebraic background. Recall that a group (G, ) is a set G equipped with a binary operation : G G G such that (identity) There exists an identity element e G such that e g = g e = g for all g G. (invertibility) Every element g G has an inverse g 1 G such that g g 1 = g 1 g = e (associativity) The binary operation is associative: (g1 g2) g3 = g1 (g2 g3). A monoid is less structured than a group; there must be an identity element and the binary operation must be associative, but invertibility is relaxed. A semigroup is even less structured: the only requirement is that the binary operation is associative. It is common to let G be a subset of functions from Q Q where Q is some ground set and let the binary operation be function composition. In this case, the structure is called a permutation group or transformation monoid/semigroup depending on which subset of the above properties hold. For transformation groups, since every element has an inverse under function composition, it is immediate that every element is some permutation over the ground set. In fact, taking G to be a subset of functions as above is without loss of generality: by Cayley s theorem every group is isomorphic (equivalent after renaming elements) to a transformation group on some ground set, and we can take the ground set to have the same number of elements as the original group (for finite groups). Analogously, all semigroups are isomorphic to a transformation semigroup, but the ground set may need one additional element (for the identity); this is Cayley s theorem for semigroups. It is also clear that every transformation semigroup can be realized by some semiautomaton by trivially having the input symbols correspond to the functions in G.12 Therefore we have lost no structure when passing from finite semiautomata to finite semigroups. Before discussing the compositional structure of semigroups, we give one more canonical example. 12More succinctly, inputs can correspond to a generating set of the group, but this is not relevant for our results. Published as a conference paper at ICLR 2023 Example 4 (Cyclic group). Let S be a natural number let Q = {0, 1, . . . , S 1} and let Σ = {1} have only one element. The dynamics are given by δ(q, 1) = (q + 1) mod S. Clearly this semiautomaton implements counting modulo S. The underlying group is the cyclic group, denoted CS, which is isomorphic to the integers mod S with addition as the binary operation. Note that in this case, the operation is commutative, which makes the group abelian. Let us now turn to the compositional structure of groups and semigroups. Since it is without loss of generality to consider transformation (semi)groups, we always take the binary operation to be function composition. A subgroup H of a group G is a subset of the elements of G that is also a group, denoted as H G. In particular it must be closed under the binary operation. N is a normal subgroup, denoted N G, if in addition to being a subgroup, it satisfies that {gn : n N} = {ng : n N}. (These sets are known as the left and right cosets of N in G and denoted g N and Ng respectively.) 13 Normal subgroups can also arise as the kernel of a mapping from G to a subgroup H of G. Let ϕ : G H be a mapping that preserves the group operation (i.e., a group homomorphism) and let ker(ϕ) := {g : ϕ(g) = id}. Then ker(ϕ) is a normal subgroup of G. We will see below that normal subgroups provide a weak form of commutativity, that allows us to construct more complex groups out of simpler ones. Direct products. The most natural way to compose larger groups from smaller ones is via the direct product. Given two groups G and H, we can form a new group with elements {(g, h) : g G, h H} with a binary operation that is applied component-wise (g, h) (g , h ) = (g g , h h ) (here, is overloaded to be the group operation for all three groups). This direct product group is denoted G H. In the context of permutation groups, say G is a permutation group over ground set QG and H is over ground set QH. Then G H has ground set QG QH and every function in G H factorizes component-wise, i.e., every element in G H is identified with a permutation (q G, q H) 7 (g(q G), h(q H)) where g G, h H. Observe that G H contains normal subgroups which are isomorphic to both G and H. To see this, take N = {(e G, h) : h H} where e G is the identity element in G. Then since ge G = e Gg and since H is closed under its group operation, we have (g, h)N = N(g, h) for all (g, h) G H. A symmetric argument shows that G is also a normal subgroup of the direct product. Note that we can analogously define direct products in the absence of the group axioms, and thus for monoids and semigroups. This gives a natural construction of the semigroup corresponding to moving around both axes of a 2-dimensional rectangular gridworld, as a concatenation of two non-interacting 1-dimensional gridworlds: Example 5 (2D gridworld). If GS is the transformation semigroup of the 1-d grid world with S + 1 states, then GS GS corresponds to a 2-dimensional gridworld. A semiautomaton that yields this transformation semigroup has state space Q = {(i, j) : i, j {0, . . . , S}} and 5 actions: increment or decrement i or j, subject to boundary effects, or do nothing. The definition of direct product extends straightforwardly to more than two terms G1 G2 . . . Gn; we identify the items with tuples (g1, g2, . . . , gn). Semidirect products. However, it is possible to compose larger groups so that one of the subgroups is not a normal subgroup. This operation is called a semidirect product, with the group law (g, h) (g , h ) = (g ϕh(g ), h h ) for some ϕh to be defined later. Observe that in the direct product G H, we have constructed the elements from ordered pairs (g G, h H), lifting G and H into a shared product space (i.e., the Cartesian product of the underlying sets of G and H), defining the group operation as simply applying those of G and H separately. In fact, there are other ways, to define the group operation in the product space, but a difficulty arises: we need to find other nontrivial multiplication rules on pairs (g, h), and we cannot take for granted that an arbitrary binary operation satisfies the group axioms. We would like to define other operations (g, h) (g , h ) which output an element of g and an element of h. An attempt would be to pick two arbitrary injective homomorphisms ϕG, ϕH which embed G and H into a shared space, so that elements of G and H can be multiplied together: (g, h) (g , h ) := ϕG(g) ϕH(h) ϕG(g ) ϕH(h ). 13An equivalent definition of a normal group is a subgroup N such that g 1ng N, g G, n N. Published as a conference paper at ICLR 2023 However, we need to ensure that this group operation is closed. Since all elements of the group are of the form (g, h) where g G and h H, we must find a pair ( g, h) that yields the right hand side of the above display when embedded into the shared space via ϕG, ϕH. For this, the most natural choice is g = g g and h = h h , and thus we must check that: ϕG(g g ) ϕH(h h ) = ϕG(g) ϕG(g ) ϕH(h) ϕH(h ) = ϕG(g) ϕH(h) ϕG(g ) ϕH(h ) = (g, h) (g , h ) However, the middle equality may not hold, because ϕG(g ) and ϕH(h) are not guaranteed to commute. (Observe that for the special case of g 7 (g, e H), h 7 (e G, h), these two elements always commute, giving rise to the direct product.) Eliding ϕG, ϕH and simply using g, h as elements of the shared space, a sufficient condition for this to hold is that hg h 1 G, since then, for some g G, (g, h) (g, h ) = ghg (h 1h)h = g(hg h 1)hh = g ghh , which is of the form ϕG( ) ϕH( ) since both G and H are themselves closed. This condition is precisely that G is a normal subgroup. There is a degree of freedom here: for each pair h and g , we can choose which element of G is given by hg h 1. When we make this choice we must ensure all of the group axioms are preserved, e.g., when h = e H we should always have e Hg e H = g . Suppose we make this choice and define ϕh : g 7 hgh 1 G (this ϕ is a homomorphism from H Aut(G), where Aut( ) denotes the automorphism group, the group of bijections on G that preserve the group axioms, under composition). Then, these ordered pairs do indeed form a group, but the group operation is (g, h) (g , h ) = gϕh(g )hh This object is the semidirect product, and it is denoted G H. Note that the choice of mapping ϕ is unspecified in the notation, and, in general, different choices of ϕ will yield different structures for the semidirect product. Finally, when G = N H, both N and H are subgroups of G, but N is also a normal subgroup. To see this, we need to check that h N = Nh for any h H. This is equivalent to hnh 1 N for each h, n, but we defined the group operation to be hnh 1 = ϕh(n) N, specifically so this would hold. On the other hand, H may not be a normal subgroup, and in this sense the semidirect product is a generalization of the direct product (for which both subgroups are normal). However, when the mapping ϕ is trivial, that is ϕh(n) = n then both N and H are normal subgroups, and one can verify that in this case the semidirect product and direct product coincide. Example 6 (Dihedral group). Consider a semiautomaton with Q = {0, . . . , S 1} { 1, +1} and input alphabet Σ = {advance, reverse}. The transitions are given by: δ((s, b), advance) = (s + b mod S, b) δ((s, b), reverse) = (s, b) The transformation semigroup for this semiautomaton is CS C2 where CS is the cyclic group on S elements (cf. Example 4). C2 has two elements, the identity e and one element h such that hh = e. CS has S elements where each element g is a function that adds some number k {0, . . . , S 1} to the input modulo S. The inverse g 1 is naturally to subtract k to the input, modulo S. The homomorphism ϕ in the semidirect product is such that ϕe(g) = g and ϕh(g) = g 1. Wreath products. We define one more type of product between groups N and H: the wreath product N H := (N . . . N) H. This is a group containing |N||H| |H| elements (rather than |N| |H|, like the direct and semidirect products). Intuitively, it is defined by creating one copy of N per element in H via the direct product, then letting H specify a way to exchange these copies. Formally, N H is the unique group generated by (g1, . . . , g|H|, h) gi N, h H, (g1, . . . , g|H|, e H) (g 1, . . . , g |H|, e H) := (g1 g 1, . . . , g|H| g |H|, e H) gi, g i N, Published as a conference paper at ICLR 2023 and (g1, . . . , g|H|, e H) (e N, . . . , e N, h) := (gπh(1), . . . , gπh(|H|), e H) gi N, h H, (A.1) where we have enumerated the elements of H in arbitrary order, such that each πh : [H] [H] is the permutation defined by right multiplication h 7 h h (by convention). To write this explicitly as a semidirect product N H := (N . . . N) H, the homomorphism into the direct product s automorphism group ϕ : H Aut(N . . . N) is given by A.1: for each h H, ϕ is the automorphism defined by permuting the indices between the terms in the direct product, according to the permutation induced by right multiplication by h. Example 7 (Rubik s Cube). A naive way to construct the Rubik s Cube is to assign labels {1, . . . , 54} to the stickers on the cube, and define the Rubik s Cube group G via the sticker configurations reachable by the 6 face turns (which each specify a permutation δL, δR, δU, δD, δB, δF : [54] [54] of the stickers). This establishes G as a subgroup of S54. First, notice that the 6 central stickers never move (so this is really improvable to S48). Next, notice that the 24 = 8 3 vertex stickers never switch places with the 24 = 12 2 edge stickers. The vertex stickers form a subset of the wreath product C3 S8, while the edge stickers form a subset of the wreath product C2 S12. In all, this realizes G as a subgroup of a direct product of wreath products: G (C3 S8) (C2 S12). Among other consequences towards solving the Rubik s Cube, this gives an improved upper bound on the size of G (which turns out to still be off by a factor of 12, because of nontrivial invariants preserved by the face rotations, a.k.a. unreachable configurations). Quotients, simple groups, and maximal subgroups. When N is a normal subgroup of G, the quotient group G/N is defined as {g N : g G} with binary operation (g N)(g N) = (gg )N. The fact that N is a normal subgroup implies that this is a well defined group. We can also check that if G = N H then the quotient group G/N is isomorphic to H, which matches the intuition for multiplication and division. A G group is simple if it has no non-trivial normal subgroups. Intuitively, a simple group cannot be factorized into components; this generalizes the fact that a prime number admits no non-trivial factorization. When G is not simple then it has a non-trivial normal subgroup, say N. We call N proper if N = G. We call a proper subgroup N maximal if there is no other proper normal subgroup N G such that N N . Equivalently, N is a maximal proper normal subgroup if and only if G/N is simple. This is akin to extracting a prime factor from a number, since the quotient group G/N cannot be further factorized. We will revisit this idea of factorization when defining composition series and solvable groups in Section C.3.2. Group extensions. Finally, we provide some additional terminology related to these different notions of products, which provide a cleaner unifying language in which to state our constructions. Let N, H be arbitrary groups. Which groups G contain a normal subgroup isomorphic N, such that the quotient G/N is isomorphic to H? Such a group G is said to be an extension of N over H. The direct product G = N H is known as the trivial extension. A semidirect product G = N H is known as a split extension. However, not all extensions are split extensions; the smallest example is the quaternion group Q8, the group of unit quaternions { 1, i, j, k} under multiplication (i2 = j2 = k2 = ijk = 1), which cannot be realized as a semidirect product of smaller groups. In general, it is very hard to derive interesting properties of a group extension based on the properties of N and H. Fortunately, there is a characterization of general extensions. The Krasner-Kaloujnine universal embedding theorem (Krasner & Kaloujnine, 1951) states that all extensions G can be found as subgroups of the wreath product N H. The proof of Theorem 2 essentially shows how to implement the different kinds of group extensions, given constructions which implement the substructures N, H. In the worst case, we will have to implement a wreath product. A.3 SHALLOW CIRCUIT COMPLEXITY CLASSES We provide an extremely abridged selection of relevant concepts in circuit complexity. For a systematic introduction, refer to (Arora & Barak, 2009). In particular, we discuss each circuit complexity class and inclusion below: NC0 AC0 ACC0 TC0 NC1. Published as a conference paper at ICLR 2023 NC0 is the class of constant-depth, constant-fan-in, polynomial-sized AND/OR/NOT circuits. If a constant-depth Transformer only uses the constant-degree sparse selection constructions in (Edelman et al., 2022), it can be viewed as representing functions in this class. However, the representational power of these circuits is extremely limited: they cannot express any function which depend on a number of inputs growing with T. AC0 is the class of constant-depth, unbounded-fan-in, polynomial-sized AND/OR circuits, allowing NOT gates only at the inputs. A classic result is that the parity of T bits is not in AC0 (Furst et al., 1984); Hahn (2020) concludes the same for bounded-norm (and thus bounded Lipschitz-constant) constant-depth Transformers. ACC0 extends AC0 with an additional type of unbounded-fan-in gate known as MODp for any prime number p, which checks if the sum of the input bits is a multiple of p. Theorem 2 comes from the fact that the semigroup word problem (which is essentially identical to semiautomaton simulation) is in this class; see (Barrington & Th erien, 1988). TC0 extends AC0 with an additional type of unbounded-fan-in gate called MAJ, which computes the majority of an odd number of input bits (a threshold gate). It is straightforward to simulate modular counters using a polynomial number of parallel thresholds (i.e. ACC0 TC0). Whether this inclusion is strict (can you simulate a threshold in constant depth with modular counters?) is a salient open problem in circuit complexity. Threshold circuits are a very natural model for objects of interest in machine learning like decision trees and neural networks (Merrill et al., 2021). NC1 is the class of O(log T)-depth, constant-fan-in, polynomial-sized AND/OR/NOT circuits. It is an extremely popular and natural complexity class capturing efficiently parallelizable algorithms. It is unknown whether any of the inclusions in the larger classes TC0 NC1 L P are strict. A.4 THE TRANSFORMER ARCHITECTURE In this section, we define the Transformer function class used in our theoretical results, and discuss remaining discrepancies with the true architecture. An L-layer Transformer is a sequence-to-sequence network ftf : RT d Θtf RT d, consisting of alternating self-attention blocks and feedforward blocks ftf := f (L) mlp f (L) attn f (L 1) mlp ... f (1) attn. The parameter space Θtf is the Cartesian product of those of the individual blocks (without recurrent weight sharing across layers, by default). We define these two types of blocks below. Attention. A single-headed (H = 1) self-attention block is a sequence-to-sequence network fattn : RT d Θattn RT d, parameterized by θattn = (WQ, WK, WV , WC). With an inner embedding dimension k, the shapes of these matrices are as follows: WQ, WK, WV , W C Rd k. Each head, indexed by t [T], computes pairwise query-key alignment scores W Q xt, W Kxt for each position t [T], normalizes them with a T-dimensional causally-masked softmax (forcing weights for positions t > t to be 0), and uses these attention weights α RT to mix value embeddings: P t [T ] αt W V xt . This mixture is mapped back into Rd by multiplying by WC, to form the t-th row of the output matrix. In a single equation: fattn(X; WQ, WK, WV , WC) := Causal Attn(XWQW KX )XWV WC, where Causal Attn : RT T RT T applies a row-wise causally-masked T-dimensional softmax function. The standard softmax function softmax(z) : RT RT is defined by [softmax(z)]t := ezt P t [T ] ezt ; the causally-masked softmax at row t is defined to be softmax(z1:t) on the first t coordinates, and 0 on the rest. To implement the causal masking operation, it is customary to set the entries above the diagonal of the attention score matrix XWQW KX to , then obtaining Causal Attn(XWQW KX ) via a row-wise softmax (letting e evaluate to 0). Published as a conference paper at ICLR 2023 In general, for any positive integer H, a multi-headed self-attention block consists of a sum of H copies of the above construction, each with its own parameters. This component is often called soft attention: the softmax performs continuous selection, taking a convex combination of its inputs. In contrast, hard attention refers to attention heads which perform truly sparse selection (putting weight 1 on the position with the highest score, and 0 on all others). Feedforward MLP. An L -layer position-wise feedforward MLP block is a sequence-to-sequence network fmlp : RT d Θmlp RT d, parameterized by θmlp = (W1, b1, . . . , WL , b L ). For a choice of activation function σ : R R (which is always Re LU in our theoretical constructions, for simplicity), fmlp applies the same nonlinear map (x 7 WL x+b L ) σ . . . σ (x 7 W1x+b1) to each row t of the input matrix X RT d (with the same parameters per position t); here, σ is applied pointwise. Finally, an extra term P RT d is added to the first layer s input, the matrix of position encodings. Residual connections. It is typical to add residual connections which bypass each block. That is, letting id denote the identity function in RT d, the network (with position encodings) becomes ftf := (id + f (L) mlp) (id + f (L) attn) (id + f (L 1) mlp ) ... (id + f (1) attn) + (id + P). At the level of granularity of the results in this paper (up to negligible changes in the width and weight norms), this changes very little from the viewpoint of representation. A residual connection can be implemented (or negated) by appending two Re LU activations to a non-residual network: x = (x)+ ( x)+. Similarly, a residual connection can be implemented with one attention head (with internal embedding dimension k = d), as long as it is able to select its own position (which will be true in all of our constructions). In some of our constructions, we choose to use residual connections (sometimes restricted to certain dimensions); it will be very natural to view the embedding space Rd as a workspace , where residual connections ensure that downstream layers can access the input (and position) embeddings, as well as outputs of all earlier layers. We will specify whether to use residual connections in each construction, to make the proofs as clear as possible. When we do so, we do not add the extra weights explicitly to fattn and fmlp. Layer normalization. For simplicity of presentation, we omit the normalization layers which are usually present after each attention and MLP block. It would be straightforward (but an unnecessary complication) to modify the function approximation gadgets in our constructions to operate with unit-norm embeddings. Padding tokens. Finally, it will greatly simplify the constructions to add padding tokens: to simulate a semiautomaton at length T, we will choose to prepend τ tokens, with explicitly chosen embeddings, which do not depend on the input σ1:T . Theorem 1 uses τ = Θ(T) padding, and Theorem 2 uses τ = 1. In both cases, padding is not strictly necessary (the same functionality could be implemented by the MLPs without substantially changing our results), but we find that it leads to the most intuitive and concise constructions. Complexity measures. We define the following quantities associated with a Transformer network, and briefly outline their connection to familiar concepts in circuit complexity: The dimensions according to the definition of a sequence-to-sequence network: sequence length T and embedding dimension d. Up to a factor of bit precision, this corresponds to the number of inputs in a classical Boolean circuit. We will exclusively define architectures where d is independent of T. Its depth L, the number of repeated fmlp fattn blocks. When each of these modules contains a constant number of sequential computations, this coincides with the usual notion of circuit depth, up to a constant factor. This is true in practice and our theoretical treatment (the attention and MLP have a constant number of layers). Published as a conference paper at ICLR 2023 The other shape parameters from the definition of the architecture: number of heads (per layer and position) H14, and internal embedding dimension k. When fmlp is an L -layer MLP, it has MLP intermediate widths d1, . . . , d L 1. We will exclusively think of L as a small constant, so that the number of sequential matrix multiplications in the entire network is within a constant factor of L. Its attention width wattn is defined to be the maximum of {d, Hk}, and its MLP width wmlp is defined as the maximum of {d1, . . . , d L 1}. Taking w = max(wattn, wmlp) as a coarse upper bound we will use to summarize the number of per-position trainable embeddings in our constructions. To map this to the usual notion of circuit size, note that the computations are repeated position-wise. Thus, Transformers induce a computational graph with O(T L w) gates and O(T L w2) wires. The position-wise parameter-sharing induces a special notion of circuit uniformity. A bound on its -weight norms: the largest absolute value of any trainable parameter. These can be converted into norm-based generalization bounds via the results in Edelman et al. (2022). Note that the results in this paper go beyond the sparse variable creation constructions of bounded-norm attention heads; in general, the norms scale with T. The attention heads express meaningful non-sparse functions. Aside from the positive experimental results, we do not directly investigate generalization in this paper. The bit precision (length of finite-precision truncation of real numbers in a computational graph implementing ftf), which lets us implement approximate real-valued computations as Boolean (or discrete arithmetic) circuits. With infinite-precision real numbers, there are pathological constructions for RNNs (Siegelmann & Sontag, 1992) and Transformers (Merrill et al., 2021) which give single parameters of neural networks infinite representational power. Throughout this work, our circuits will work with O(log T) bit precision, which can represent real numbers (as integers x 2c in their binary representation, for some choice of c = Θ(log T)) with magnitude up to O(poly(T)), with O(1/poly(T)) approximation error. Since this is far from the focus of our results, we will elide details for the remainder of this paper, returning to these considerations only to make Theorem 4 more concrete. All of our constructions are robust up to this noise level: this is because the internal weight norms and activations are bounded by a quantity at most polynomial in T, and the function approximation construction in Lemmas 1 and 2 can tolerate 1/poly(T) perturbations using poly(T) weight norms. A.5 ADDITIONAL DISCUSSION OF RELATED WORK Relevant applications. We first provide references for the reasoning-like applications of neural networks mentioned in the main paper. Program synthesis: (Chen et al., 2021b; Schuster et al., 2021; Li et al., 2022). Mathematical reasoning: (Lample & Charton, 2019; Polu & Sutskever, 2020; Drori et al., 2022). Neural dynamics models for decision-making: recurrent (Hafner et al., 2019; Ye et al., 2021; Micheli et al., 2022) and non-recurrent (Chen et al., 2021a; Janner et al., 2021). Synthetic combinatorial experiments (and relations). We provide an expanded discussion of empirical analyses of neural networks trained on synthetic combinatorial tasks. Pointer Value Retrieval: Zhang et al. (2021a) propose a benchmark of tasks based on pointer value retrieval (PVR) to study the generalization ability (in-distribution as well as distribution shift) of different neural network architectures. Their key idea behind the task is indirection through a pointer rule , that is, a specific position of the input acts as a pointer to the relevant position (window) of the input which contains the answer. Using our results, we can implement a certain sub-class of PVR tasks: (1) we use the first attention layer to identify the pointer, and (2) we use the second attention layer to select the window between the pointer value and the width. (2) is doable with O(1) attention heads if we are computing a function that is based on the sum (for example, mod n). Otherwise it would require the window size number of attention heads similar to our grid-world construction. 14There will be a notational collision between h, H denoting attention heads, and h H denoting an element in a group. We keep the overloaded notation for clarity, and this will certainly be unambiguous. Published as a conference paper at ICLR 2023 LEGO: Zhang et al. (2022) propose a task based on solving a simple chain-of-reasoning problem based on group-based equality constraints. They study the ability of transformers to generalize the entire chain of reasoning given only part of the chain while training. A direct comparison to our setting is not clear since this task is not modelled as a sequence-to-sequence task, however it serves as another example of the emergence of shortcut solutions: transformers solve certain variables without resolving the chain of reasoning. Dyck: Several works (Hahn, 2020; Ebrahimi et al., 2020; Newman et al., 2020; Yao et al., 2021) have studied the ability of Transformers to represent Dyck languages, both for generation and closing bracket prediction. The most closely related to our work is Yao et al. (2021), which constructs a clever depth-2 as well as a depth-D solution for bounded-depth D Dyck languages. Bounded-depth Dyck can be captured by our semiautomata formalism and our main construction would recover the depth-2D solution by default. Their depth-2 construction bears semblance to the constructions we use in Theorem 3: they implement a counter in the first layer similar to our mod n construction, and implement a proximity-based depth matching in the second layer. Our grid-world construction generalizes their construction to a significantly more complex problem. We view our work as a generalization of their results to a wider class of semiautomata. Parity: Another commonly studied synthetic setup is the task of learning parities. Edelman et al. (2022); Barak et al. (2022) perform a theoretical and empirical study of the ability of Transformers (and other architectures) to learn sparse parities where the support size k T. Bhattamishra et al. (2020); Schwarzschild et al. (2021) study the task of computing prefix sum in the binary basis (which is essentially parity of the prefix sum) for Transformers and recurrent models, repsectively. Anil et al. (2022); Wei et al. (2022) study essentially the same problem however they model the task as a natural language task and use pretrained Transformers. Modular addition: In the pursuit of understanding grokking, Nanda & Lieberum (2022) focus on the task of adding two 5 digit numbers modulo a large prime (113 in their setting). They take the viewpoint of mechanistic interpretability and attempt to reverse engineer what the Transformer is learning on this task in the low sample regime. They claim that the trained model learns sinusoidal encodings that we also use in our theoretical constructions. Note that our setting of modular counters performs a T-way summation, while their setting involves only a 2-way summation (with carryover). Inspired by their work, we do some preliminary investigation into interpreting the trained Transformer on the grid world (see Figure 7). Formal languages and neural networks. Dyck languages are particularly interesting for their completeness property: the Chomsky-Sch utzenberger representation theorem (Chomsky & Sch utzenberger, 1959) states that all context-free languages can be (homomorphically) represented by the intersection of a Dyck language and a regular language. For more on this topic, see the discussion in Yao et al. (2021). In the context of regular languages (which in general induce finitestate automata), our findings imply that O(log T)-depth networks can simulate all context-free languages (Theorem 1), and O(1)-depth networks can represent some of them. The obstructing regular languages are the ones whose associated syntactic monoids are non-solvable. We further note that the gridworld semigroups are aperiodic and thus simulable by star-free regular expressions (Sch utzenberger, 1965) and AC0 circuits (Chandra et al., 1983; Barrington & Th erien, 1988). We did not see a way for this to generically entail O(1)-depth shortcuts with self-attention. For the relation between the Chomsky hierarchy and various neural networks in practice, Del etang et al. (2022) provide an extensive empirical study for memory-augmented RNNs and Transformers on tasks spanning all 4 levels of the hierarchy, and conclude the Transformers lack the ability to even recognize regular languages. Their results do not contradict with ours, since they measure performance on inductive inference , which is similar to our length generalization setup where we also see the failure of Transformer. Different axes of generalization: length, size, and algorithmic. There has been much recent interest in quantifying out of distribution generalization of trained models under distribution shifts that maintain some notion of logical invariance. Wei et al. (2022); Anil et al. (2022) empirically investigate the ability of pre-trained Transformers to generalize to longer sequence length for parity-like problems modelled as language tasks. Xu et al. (2020) study size generalization in graph neural networks where they train on small graphs and evaluate on larger sized graphs with similar structural properties. Schwarzschild et al. (2021); Bansal et al. (2022) focus on length and algorithmic generalization for recurrent models where they train on simple/easy instances of the underlying Published as a conference paper at ICLR 2023 problem and evaluate on harder/complex instances using the power of recurrence to simulate extra computational steps, inspired by the ideas of Neural Turing Machines (Graves et al., 2014) and Adaptive Computation Time (Graves, 2016). We view our results as complementing those of Yao et al. (2021); Anil et al. (2022) for a richer class of problems. Our use of scratchpad is inspired by Nye et al. (2021); Wei et al. (2022); Anil et al. (2022). Recurrent Transformers. Our work is not the first to notice that Transformer architectures make brittle predictions out-of-distribution. Indeed, even the seminal paper introducing the architecture (Vaswani et al., 2017) notes that length generalization is promoted by a subtle hyperparameter choice (namely, the positional encoding scheme). Furthermore, there have been several attempts to reconcile this gap by modifying Transformers to behave more like RNNs; (Dehghani et al., 2019; Nye et al., 2021; Wei et al., 2022; Anil et al., 2022; Hutchins et al., 2022). Kasai et al. (2021) consider training a non-recurrent Transformer, and finetuning it into an RNN. All of these works have some element of natural language experiments: either the task is end-to-end language modeling, or the synthetic reasoning task is framed as a natural language problem, for a pretrain-finetune pipeline. We view our work as strengthening the foundations of these lines of inquiry. Theoretically, we provide structural guarantees for how shallow non-recurrent models can (perhaps deceptively) fit recurrent dynamics over long sequences. Empirically, we perform a pure (no confounds arising from the influence of a natural langauge corpus) analogue of the experiments seeking to help neural networks follow long chains of reasoning. Recurrent vs. non-recurrent sequence transduction. As mentioned briefly towards the end of Section 5, the setting of indirectly-supervised semiautomata matches that of autoregressive generative modeling (a.k.a. next-token prediction), if the continuations of the sequence depend on the state of a latent semiautomaton. This is the case in (for example) generating Dyck languages (Yao et al., 2021), where the possible continuations are {all possible open brackets, if the stack qt is not full} {close bracket which pairs with the top of the stack qt}. We note that when an autoregressive model is used for sequence generation via a token-by-token inference procedure, this amounts to a special case of scratchpad inference (with a naive 1-step training procedure): the constant-depth network is used as a single iteration of a recurrent network, whose state is the completed prefix of the current generated sequence. Non-autoregressive natural language generation and transduction are an exciting area of research (Gu et al., 2017); for a recent survey, see Xiao et al. (2022). Our results are relevant to this line of work, suggesting that there may not be an expressivity barrier to expressing deep recurrent linguistic primitives, but there may be issues with out-of-distribution robustness. Algebraic structures in deep learning. Another area where tools from abstract algebra are used to reason about neural networks is geometric deep learning, a research program which seeks to understand how to specify inductive biases stemming from algebraic invariances. For a recent survey, see Bronstein et al. (2021). In contrast, this work studies the ability of a fixed architecture to learn a wide variety of algebraic operations, in the absence of special priors (but a large amount of data). There are certainly possible connections (e.g. how do you bias an architecture to perform operations in a known group, when there is limited data? ) to explore in future work. Theoretical role of depth. Our theoretical results can be interpreted as a depth separation result: contingent on TC0 = NC1, it takes strictly more layers to simulate non-solvable semiautomata, compared to their solvable counterparts. In a similar spirit, there have been several works establishing depth separation for feed-forward neural networks (mostly using Re LU activations) (Telgarsky, 2016; Eldan & Shamir, 2016; Daniely, 2017; Lee et al., 2017; Safran et al., 2019). These results are usually constructive in nature, that is, they show the existence of functions that can be represented by depth L but would require exponential-width for depth L 1 (or L, depending on the result). B EXPERIMENTS B.1 SECTION 4: SGD FINDS THE SHORTCUTS, UNDER IDEAL SUPERVISION This section contains a full description and discussion of the in-distribution simulation experiments from Section 4. Published as a conference paper at ICLR 2023 B.1.1 SHALLOW TRANSFORMERS SIMULATE SMALL GROUPS AND SEMIGROUPS The main experiments in this paper investigate whether gradient-based training of Transformers finds low-depth solutions to the problem of simulating semiautomata. In these experiments, we consider a wide variety of semiautomata A, corresponding to various groups and semigroups, and construct a distribution DA over input sequences (σ1, . . . σT ) and their corresponding state sequences (q1, . . . , q T ) = AT,q0(σ1:T ). In each setting, the σt are chosen uniformly at random from the set of valid tokens in Σ. 15 Given this distribution DA, and a sequence-to-sequence neural network (with a token embedding and a linear classification head) which maps ΣT to token predictions Y RT |Q| (such that Yt,q := c Prθ(qt = q|σ1:t)), we establish the task of minimizing the cross-entropy loss t=1 log(1/Yt,qt). This defines a supervised learning problem over sequences. Note that without intermediate states in the input these problems exhibit long-range dependencies: for example, in the parity semiautomaton (and for any semiautomaton whose transformation semigroup is a group), every qt depends on every preceding input {σt : t < t}. Indeed, this is why previous studies have used group operations as a benchmark for reasoning (Anil et al., 2022; Zhang et al., 2022). Settings. We proceed to enumerate the semiautomata considered in these simulation experiments. Cyclic groups C2, C3, . . . , C8. For each cyclic group Cn (realized as Q := {0, 1, . . . , n 1} under mod-n addition), we choose the generator set Σ to be the full set of group elements {0, . . . , n 1}. An alternative could be to let Σ be a minimal16 set {0, 1}, which we do not use in the experiments. Direct products of cyclic groups C2 C2, C2 C2 C2, realized as concatenated copies of the component semiautomata. Note that C6 (which is isomorphic to C2 C3), included in the above set, is another example. Dihedral groups D6, D8. Our realization of D2n chooses Q = {0, 1, . . . , n 1} {0, 1} and Σ = {(1, 0), (0, 1)}. Since these groups are non-abelian, it is already not so straightforward (compared to parity) to see why constant-depth shortcuts should exist. Permutation groups A4, S4, A5, S5. We choose Q to be the set of n! permutations for Sn (symmetric group), and Q to be the set of n! 2 even permutations for An (alternating group on n elements). The generator set for Sn consists of the minimal generators, a transposition and an n-cycle, as well as 6 other permutations. 17 For An, we choose the 3-cycles of the form (12i) for i {3, 4, , n}. Note that A4, S4 are solvable (leading to constant-depth shortcuts), while A5, S5 are not. Also, note that to learn a constant-depth shortcut for A4, a model needs to discover the wondrous fact that A4 has a nontrivial normal subgroup, that of its double transpositions. The quaternion group Q8. This is the smallest example of a non-abelian solvable group which is not realizable as a semidirect product of smaller groups, thus requiring the full wreath product construction (Lemma 10) in our theory. The Dyck language Dyckn,k (correctly nested brackets of k types, with depth at most n). We take n = 4, k = 2 in the experiments. To realize Dyckn,k as a semiautomaton simulation problem, the state Q is the state of the stack which implements Dyck language recognition (there are thus Pn i=0 ki distinct states); 18 Σ is the set of 2k opening and closing brackets. The 15Take for instance the Dyck language, if the current stack is empty, then σt is chosen uniformly from the choices of open parentheses but not the closing parentheses. This is in accordance with Yao et al. (2021). 16In the sense that it induces a non-trivial learning problem on this group.If we only pick the generator {1}, the output sequence is deterministic, and there is no learning problem. 17These other permutations are chosen following the ordering given by the sympy.combinatorics package. They are not necessary for covering the state space (since the minimal set of 2 permutations already suffice to cover Q), but can help speed up the mixing of the states. 18In the experiments we use (k + 1)n classes (i.e. each of the n positions can take (k + 1) possible values), Pn i=0 ki of which are reachable. Published as a conference paper at ICLR 2023 distribution in inputs is slightly different, since there is a notion of illegal inputs: if the stack is empty, then the set of feasible inputs contain all the opening brackets; if the stack is full (i.e. reaching depth n), then the only feasible input is the closing bracket for the opening bracket at the top of the stack. Gridworld semiautomata Grid4, Grid9, where Q = {0, 1, , n 1} (for n = 4 or 9) and Σ = { 1}.19 For this special case, we have a constant-depth solution as stated in Theorem 3. Training. We focus on the online learning setting for all experiments in this paper: at training iteration i, draw a fresh minibatch of samples from DA, compute the network s loss and gradients on this minibatch, and update the model s weights using a standard first-order optimizer (we use Adam W (Loshchilov & Hutter, 2017)). This is to mitigate the orthogonal challenge of overfitting; note that the purpose of these experiments is to determine whether standard gradient-based training finds shortcut solutions in these combinatorial settings (in a reasonable amount of time), not how efficiently. We do not investigate how to improve sample efficiency in this paper. The results in the paper are based on sinusoidal positional encodings (Vaswani et al., 2017) unless otherwise specified. Sequence length. We report our main results with sequence length T = 100, which is large enough to rule out memorization: for this choice of T, the inputs come from a uniform distribution over |Σ|100 > 1030 sequences, rendering it overwhelmingly unlikely for a sample to appear twice between training and evaluation. We observed positive results in most of the settings for larger T, but training became prohibitively unstable and computationally expensive; mitigating this is an interesting direction for future empirically-focused studies. Depth. We seek to investigate the sufficient depth for learning to simulate each semiautomaton. Thus, for each problem setting, we vary the number of layers L in the Transformer between 1 and 16. Note that we do not attempt in this work to distinguish between depths O(log T) and O(1), nor do we attempt to tackle the problem of exhaustively enumerating and characterizing the shortcut solutions for any particular semiautomaton. Results. For each task and number of layers, we report the highest (Figure 5) and median (Figure 6) accuracies over 20 runs. The accuracy is calculated at token level (i.e. 1 T P t [T ] 1[ˆqt = qt]), as opposed to the sequence-level accuracy (i.e. 1[ˆq1:T = q1:T ]) as reported in Bhattamishra et al. (2020). We evaluate in-distribution accuracy on independent (unseen) samples of DA, which contain 2048 sequences of length T = 100. 20 As shown in Figure 5, Transformers, trained with standard gradient-based methods, are able to find solutions which generalize well (in-distribution) on all of the tasks. Performance tends to improve as the number of layers increases (there is a small amount of non-monotonicity in some settings due to training instability); the sufficient depth to achieve high accuracy varies depending on the problem setting, as discussed below. Trends in sufficient depth. The minimum number of layers required to achieve 99%+ performance reflects our beliefs on the difficulty of the task: a high-level trend is that the semigroups which don t contain groups (which only require memory lookups) are the easiest to learn, and among the groups, the larger non-abelian groups require more layers to learn, with the non-solvable group S5 requiring the largest depth.21 Between the non-abelian groups, the difficulty of learning Q8 compared to D8 (which has the same cardinality) agrees with our theoretical characterizations of the respective constant-depth shortcuts for these groups: D8 can be written as a semidirect product of smaller groups, while Q8 cannot, so our theoretical construction of a constant-depth shortcut must embed Q8 in a larger structure (i.e. the wreath product). Improving training stability. Throughout these experiments, we observe the following forms of training instability: high variance in training curves (based on initialization and random seeds for 19 1 for L, 1 for R. We omit the no-op in the experiment which does not change the difficulty of the task. 20This size is sufficient for evaluating the model performance: for example, for C2 (i.e. parity), evaluating a model on 10 evaluation sets of this size gives a standard deviation of 0.031% in the accuracy. 21However, we stress that these experiments do not control for the fact that larger groups have richer supervision (for example, A5 has more informative labels than A4), possibly accounting for the counterintuitive result that the latter requires more layers, despite being a subgroup of the former. Published as a conference paper at ICLR 2023 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 99.3 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 99.9 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 92.2 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 77.6 99.8 99.9 100 100 99.5 100 99.7 100 100 100 100 100 100 100 100 54.6 94.6 96.7 99.4 100 100 99.8 100 99.9 100 100 100 100 100 99.8 100 95.1 92.3 84.2 99.9 99.7 99.9 100 100 100 100 100 100 100 100 100 100 89.0 99.1 99.9 100 100 100 100 100 100 100 100 100 100 100 100 100 59.8 98.7 75.5 99.9 99.8 99.9 99.9 100 100 100 99.8 99.9 100 99.8 99.9 99.9 90.9 95.0 99.9 99.9 100 99.9 100 100 100 100 100 99.8 100 100 100 100 79.6 96.2 99.8 99.8 99.9 100 99.9 99.9 100 99.4 99.9 99.9 99.9 100 99.9 99.9 90.5 98.8 99.9 100 100 99.9 100 100 99.9 99.9 100 100 100 100 100 100 65.0 77.9 99.9 97.9 100 99.8 98.2 99.9 100 100 91.9 95.9 91.7 90.6 87.5 80.6 25.4 27.2 47.4 75.2 100 100 100 100 100 100 100 100 100 100 100 100 45.6 98.0 100 100 100 100 100 100 100 100 100 100 100 100 100 100 31.6 49.2 59.6 60.4 73.5 99.3 100 100 100 100 100 100 100 100 100 100 25.0 35.4 49.1 59.3 62.6 82.3 90.9 98.0 98.0 99.1 99.8 100 99.7 100 100 100 12.5 23.1 32.5 46.7 71.2 98.8 100 100 100 100 100 100 100 100 100 100 11.3 17.6 22.0 27.1 37.7 44.8 50.8 72.5 91.3 97.1 97.9 98.7 99.9 100 99.8 99.9 7.9 11.8 14.6 19.7 26.0 28.4 32.8 51.8 86.3 94.8 90.2 97.2 99.3 99.1 99.9 99.9 Figure 5: A complete version of Figure 3, for various tasks (rows) and numbers of network layers (columns). Reported performance is the maximum test accuracy over 20 runs. Published as a conference paper at ICLR 2023 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 98.8 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 99.8 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 91.4 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 56.4 83.0 79.9 80.9 89.1 85.2 84.8 84.9 88.8 94.5 98.3 86.4 90.4 88.7 94.6 99.3 40.3 69.1 78.2 85.0 84.0 84.9 87.9 96.2 99.4 89.2 82.5 99.3 87.4 98.0 89.6 92.0 56.8 63.8 56.2 64.2 69.5 71.5 75.9 73.7 85.8 68.0 77.1 84.1 64.9 71.1 64.3 99.3 75.6 62.7 99.0 99.5 99.8 99.9 99.8 99.5 99.8 99.8 99.7 99.8 99.8 99.9 99.9 99.7 45.8 49.0 53.0 59.6 75.5 77.0 95.6 91.2 83.4 59.6 98.4 72.9 89.7 94.5 99.8 87.5 51.0 76.2 99.7 99.7 99.6 99.6 99.4 99.7 99.7 99.6 99.6 99.6 99.7 99.6 99.8 99.7 60.5 58.8 99.0 98.5 99.6 99.7 99.4 99.5 99.6 98.5 99.5 99.8 99.8 99.6 99.3 99.7 62.6 73.1 78.4 73.4 74.9 79.8 84.1 82.4 77.0 70.6 69.0 71.9 70.6 76.9 68.3 59.3 50.0 61.4 60.6 60.7 72.4 63.2 63.8 66.4 69.8 59.0 63.4 54.6 59.5 53.0 44.7 48.4 24.8 26.8 40.8 57.2 81.3 91.6 100 99.6 100 100 93.0 96.2 100 97.7 99.6 99.3 38.1 63.6 99.7 100 100 100 100 100 100 100 100 100 100 100 100 100 29.0 45.8 38.5 42.7 57.4 79.5 84.7 89.2 95.9 98.1 98.8 97.8 99.8 98.3 98.8 99.4 19.7 30.4 41.0 45.4 44.7 52.8 60.0 68.3 72.8 74.1 91.4 82.6 88.2 97.9 99.0 98.5 10.5 18.7 26.6 30.5 40.6 63.9 77.2 99.4 99.3 100 100 100 100 100 99.9 100 10.7 15.1 18.8 22.9 25.0 31.1 36.6 43.6 56.2 71.0 73.1 88.1 91.0 97.6 95.6 97.8 7.1 11.0 13.1 16.5 20.9 24.3 29.4 37.6 40.1 59.0 60.4 91.3 91.2 94.6 98.0 99.1 Figure 6: The median accuracy for various tasks (rows) and numbers of network layers (columns). Reported performance is the median test accuracy over 20 runs. Published as a conference paper at ICLR 2023 the gradient-based optimization algorithm), and negative progress (i.e. non-monotonic loss curves), even for training runs which eventually converge successfully. This is evident in Figure 3(b),(c) and in the significant difference between the maximum accuracies in Figure 5 and the median in Figure 6. To stabilize training, we experiment with dropout and exponential moving average (EMA)22. The effectiveness of dropout varies across datasets; for example, we find using a dropout of 0.1 (the best among {0, 0.1, 0.2, 0.3}) to be helpful for Dihedral and Quaternion, while such dropout hurts the training of Dyck and Gridworld. We find EMA to be generally useful, and fix the decay parameter γ = 0.9 in the experiments since the performance of the EMA model does not seem to be sensitive to the choice of γ {0.85, 0.9, 0.95}. Further, increasing the patience of the learning rate scheduler can be helpful. B.1.2 VISUALIZING AND INTERPRETING ATTENTION HEADS Although we defer a fine-grained mechanistic interpretability study ( which group/semigroup factorizations did these shallow Transformers discover, if any? ) to future work, we provide some preliminary visualizations of attention heatmaps which strongly corroborate their theoretical counterparts. In particular, consider the gridworld setup in Theorem 3. The theoretical construction consists of two steps: the first attention layer calculates the prefix sum of the actions (i.e. the sum of { σi}i [T ] {0, 1}T ), and the second attention layer identifies the last time the process is at a boundary state (i.e. 0 or S) where the process can be reset (i.e. the model can ignore the history before the boundary state and only needs to calculate the sum of subsequent actions). We have seen in Figure 3 that the network indeed learns to 1) compute the prefix sum, as evidenced by the uniform attention in the first layer, and 2) detect boundary states, as highlighted by large attention scores in the last layer. Figure 7 provides more examples of attention patterns, which are taken from the last layer of a 4-layer GPT-2 model on two randomly selected Grid9 sequences. We highlight the locations where the process is at a boundary state (white strips for state 0 or gray strips for state S = 8), which align well with the highly activated positions of the attention heads, showing that the model learns to locate the closest boundary states. Moreover, when processing tokens appearing later in the sequence than these highly activated positions, no attention weight is put on tokens before these positions. This suggests that these highly activated locations reset the state so that the model does not need to look further back past them. B.2 SECTION 5: FAILURES OF SHORTCUTS IN MORE CHALLENGING SETTINGS Our theoretical and main empirical findings have shown that not only do shallow non-recurrent networks subsume deeper finite-state recurrent models in theory, these shallow solutions can also be found empirically via standard gradient-based training. However, experiments in Section 4 and Appendix B are in an idealized setting, with full state supervision during training and in-distribution evaluation at test time. This section studies more challenging settings where these assumptions are relaxed. We consider training under indirect (Section B.2.1) or incomplete (Section B.2.2) state supervision, and evaluation on sequences that is out-of-distribution (Section B.2.3) or of longer lengths (Section B.2.4). B.2.1 CHALLENGES FROM INDIRECT SUPERVISION One type of limited supervision is that the observations may not provide full information of the underlying state. To model this, we consider the case where instead of observing the state q directly, we get a function of the state, denoted φ(q), where φ : Q Q is non-injective (i.e. | Q| < |Q|). In each of the experiments involving partially-observable semiautomata, we specify the underlying semiautomaton, as well as the observation function φ. Dyck language with stack top observations: For Dyckn,k, the state Q is the state of the stack which takes Pn i=0 ki values. We take φ to be the function that takes in a stack and returns the element at the top of the stack, which is either one of the k open brackets if the stack if non- 22We use the EMA implementation from https://github.com/fadel/pytorch ema. Published as a conference paper at ICLR 2023 Figure 7: Visualization of the entire set of 8 attention heads on two randomly selected length-128 sequences for models trained on Grid9. The lower triangles visualize the attention patterns, while the upper triangles are ignored by the attention head because of the causal mask. We use the upper triangles to visualize the positions of state 0 and state S = 8 in the output: white strips mark the position of state 0 and gray strips mark state S = 8. Example heads that clearly detect state 0 and S are highlighted with blue and red frames, respectively. Note that in many cases, the white/gray strips align with the locations of high attention scores (the bright yellow patterns). This suggests that the model indeed learns to identify the boundary states and that the construction in Theorem 3 agrees with solutions found in practice. Published as a conference paper at ICLR 2023 Task Dyck4,8 Grid9 S5 C4 D8 (abab) , (1) (abab) , (2) Observation stack top 1boundary π1:t(1) 10 mod 4 location accept accept Accuracy 100.0 100.0 99.6 99.9 100.0 100.0 100.0 Figure 8: Accuracies with indirect supervision, extending results in Figure 4(a). The numbers are the maximum over 25 runs. As a reference, LSTM gets 100% on all tasks. empty, or a special token indicating an empty stack, i.e. Q := {1, 2, , k, }. We consider k = 8 (as opposed to k = 2 in Section 4) to make the prediction task more challenging. Gridworld with boundary observations: We consider the case where the underlying semiautomaton is Grid9 with Q = {0, 1, , 8}. The observation function φ : Q {0, 1} outputs whether the current state is of two boundary states, i.e. at state 0 or state S = 8. Permutations with single-element observations: We take the permutation group S5 with Q is the set of 5! operations. The observation function φ : Q {1, 2, 3, 4, 5} returns the first value of the permutation. For example, φ((2, 1, 4, 3, 5) = 2. We use a set of 5 generators for the experiments. Cyclic group with 0 mod 4 observations: We take C4 as the underlying group with Q = {0, 1, 2, 3}. The observation function computes whether the current state is state 0, i.e. φ(q) = 1[q = 0]. Dihedral group with rotation component only: Recall that D2n = Cn C2. We take n = 4 with Q = {0, 1, 2, 3} {0, 1}, and let the observation function ψ output only the the first component (i.e. Q = {0, 1, 2, 3}). (abab) : We consider one semiautomaton which is not featured in Section 4: the one which recognizes the regular expression (abab) , which is also studied in Bhattamishra et al. (2020). The underlying semiautomaton has 5 states: 4 states are in a cyclic fashion when seeing repeated patterns of abab, and a fifth absorbing failure state is entered if any other pattern is seen. For example, the input sequence abababaaabab corresponds to states 012301244444, where the 5th a leads to the absorbing state. The observation function φ computes whether the current state is the accepting state (i.e. state 3), with Q = {0, 1}. For example, the output of φ for the input sequence ababababa is 000100010, and the output for the input sequence ababbabab is 000111111, i.e. the sequence enters the absorbing state at position 5 and never recovers. We consider two distributions on the input sequences: (1) the input is always a sequence of the form abababa (i.e. the process is never in the absorbing state), which is the setup in Bhattamishra et al. (2020); and (2) the input is of the form abababa with probability 0.5, and is some randomly drawn string of a, b otherwise. Note that case (1) can be solved purely based on the positional encoding, since the label is 1 when the position is a multiple of 4 and 0 otherwise, while case (2) is more difficult since the model needs to take into account the input tokens. Results. We train GPT-2-like models on sequences of length 40. We use 16 layers for S5 and 8 layers for other tasks, with embedding dimension d = 512 and H = 8 attention heads. As shown in Figure 8, the model is able to achieve near-perfect in-distribution accuracies for all tasks. An interesting side finding is that the choice of positional encoding turns out to be important for both cases of (abab) : learning is challenging for linear encoding (i.e. pi i) but is easy when using sinusoidal positional encoding, which is likely because the sinusoidal encoding naturally matches the periodicity in (abab) . In all other experiments, we use sinusoidal positional encodings unless otherwise noted. B.2.2 CHALLENGES FROM INCOMPLETE SUPERVISION Another challenge of limited supervision is that the observation sequence may be incomplete, that is, we may not be able to get supervision on the states at every time step. We consider the task of learning length 100 sequences, where the state at each position is revealed with some probability preveal (0, 1]. Published as a conference paper at ICLR 2023 Figure 9: Learning from incomplete state sequences, extending results from Figure 4(b): accuracy vs. position-wise probability of a hidden token (i.e. preveal), for GPT and LSTM. While LSTM is able to maintain a perfect accuracy across different values of preveal, GPT s performance may degrade as labels get sparser. The mean and standard deviation are taken over 25 runs. Figure 10: OOD generalization performance on C2: (Left) Accuracy on sequences of the same length as training, with varying Pr[σ = 1] = 0.5. GPT fails at OOD generalization, whereas recurrent solutions implemented by LSTM and Scratchpad with recency bias is robust to different distributions. (Right) Accuracy on sequences of the same length as training, with a varying number of 1s in each sequence. GPT has worse performance on counts less frequently seen during training. The lines show the mean accuracy (with shadows showing standard error) over 25 replicates. Results. Figure 9 shows the accuracy against preveal, for S5 and C2 (i.e. parity). Transformer training pipeline is worse than LSTM at tolerating incomplete supervision: while Transformer is able to maintain the performance across preveal for C2, the performance degrades significantly at lower preveal for S5. We leave improving the robustness to sparse supervision to future work. B.2.3 OUT-OF-DISTRIBUTION GENERALIZATION The previous subsections show positive results on learning shallow non-recurrent shortcuts with limited supervision during training, either in the form of indirect observations or incomplete observation sequences. In this section, we study challenges at test time, and evaluate Transformers on their out-of-distribution generalization performance. For this and the next subsection, the models are trained in the standard way with full state supervision. The training sequences are of length 40, where each position has an equal probability of being 0 or 1, i.e. Pr[σ = 1] = 0.5. At test time, the sequences of the same length as training, but the Bernoulli parameter Pr[σ = 1] varies in the range {0.05, 0.1, 0.15, . . . , 0.9, 0.95}. Negative results for vanilla Transformers. Figure 10 (left) shows the accuracy as Pr[σ = 1] varies. The performance of the Transformer degrades sharply as the test distribution changes away from training, failing at out-of-distribution generalization. Given the theoretical construction of modular counters (Lemma 6), our hypothesis is that Transformer may be learning a shortcut solution that computes the parity by counting the number of 1s, and that counts less frequently seen during training will cause the model to fail. The experimental results agree with the hypothesis: as Pr[σ = 1] deviates from 0.5, it is less likely for the value of the count (which concentrates around T Pr[σ = 1]) to be seen during training, hence the performance degrades. In contrast, an LSTM recurrent network maintains perfect accuracy when evaluated on all values of Pr[σ = 1]. We further test this hypothesis by checking how the accuracy changes as we vary the count (i.e. the number of 1s) in the input sequence. As shown in Figure 10 (right), Transformer s performance Published as a conference paper at ICLR 2023 degrades as the count moves away from the expected number during training, agreeing with the hypothesis. It might appear strange that GPT fails at a lower count more than a higher count. However, this may be because the shortcut learns a correlation between the count and the position: during training, a lower count is more likely to appear early in an input sequence, as opposed to the testing scenario where a lower count is equally likely to appear at a later part of an sequence. This is further supported by the observation that training the model with randomly shifted positions significantly improves the performance at lower counts. Guiding the Transformer to learn the recurrent solution. We investigate one established mitigation for the out-of-distribution brittleness of non-recurrent Transformers: scratchpad training and inference. Given a sequence of inputs (σ1, . . . , σT ) and states (q1, . . . , q T ), in the standard (non-recurrent) sequence-to-sequence learning pipeline, the network receives σ1:T as input, and outputs the sequence of predictions for qt. In scratchpad training (Nye et al., 2021; Wei et al., 2022), we instead feed the network an interleaved sequence of inputs and states (σ1, q1, σ2, q2, σ3, q3, . . . , q T 1, σT ) (with an appropriately expanded token vocabulary), and define the network s state predictions to be those at the appropriately aligned positions: (ˆq1, , ˆq2, , . . . , , ˆq T ) (where denotes a position where the prediction is ignored by the loss function). During inference, we iteratively fill in the state predictions. This removes the need for the network to learn long-range dependencies in a single non-recurrent pass, by splitting it into T sequential state prediction problems which can depend on previous predicted state ˆqt 1; one can think of this as a way to guide a shallow Transformer to learn the recurrent solution (i.e. explicit depth-Θ(T) iteration of the state transition function), rather than a shortcut. We note that introducing the scratchpad itself is not sufficient to remove the parallel solution, since the model can simply ignore the scratchpad positions and find the same parallel shortcut as before. The good news is that we can couple scratchpad with an explicit recency bias in the attention mechanism (Press et al., 2022) which biases the model towards putting more attention weights on closer input. Intuitively, if the model is only allowed to put attention on the current input token and the current scratchpad (which is simply the current state), then the model is forced to be recurrent; recency bias can be considered as a soft relaxation of the same idea. Combining scratchpad and recency bias, we are able to train a Transformer to learn the recurrent solution, which is resilient to distribution shift; see Figure 10 (left). Notice that this mitigation completely foregoes the computational advantage of a shallow shortcut; we leave it to future work to obtain shortcuts which are resilient to distribution shift. Towards this, the constructions used in the proof of Theorem 1 may be helpful. Finally as a side note, even though the state transitions are Markov, the dependency in the input sequence can still be long range, so we do not expect recency bias to help without scratchpad, since in this case the output can depend uniformly on each input positions (e.g. consider parity). B.2.4 LENGTH GENERALIZATION Settings for length generalization. Our final setup is length generalization, where the model is evaluated on sequences of lengths unseen during training. Promoting this difficult desideratum of length generalization is an intricate problem in its own right; see Yao et al. (2021); Anil et al. (2022) for more experiments similar to ours and more discussions on length generalization in A.5. In the following, we check the length generalization performance on Dyck4,2 and C2 (with Pr[σ = 1] = 0.5), where the model is trained on sequences of length 40 and tested on sequences of length {8, 16, 24, , 120, 128}. Results. Figure 11 shows the performance on sequences of various lengths. In contrast to LSTM s perfect performance on all scenarios, Transformer s accuracy drops sharply as we move to lengths unseen during training. This is not purely due to unseen values of the positional encoding: randomly shifting the positions during training can cover all the positions seen during testing, which helps improve the length generalization performance but cannot make it perfect; we see similar results for removing positional encodings altogether. However, similar to the OOD setup in the previous subsection, we empirically show that the above flaws are circumventable. Using a combination of scratchpad (a.k.a. chain-of-thought ) (Nye et al., 2021; Wei et al., 2022) and recency bias (Press et al., 2022), we demonstrate that Transformers can be guided towards learning recurrent (depth-T) solutions, which generalize out-of-distribution and to longer sequence lengths (Figure 11, Published as a conference paper at ICLR 2023 (a) Mean accuracy over 25 replicates, with shadows showing standard error. (b) Max accuracy over 25 replicates. Figure 11: Length generalization on Dyck and C2: Transformer fails to generalize, but adding Scratchpad (Nye et al., 2021) and recency bias (Press et al., 2022) serves as a remedy. For GPT Shifted positions , the positions in a sequence are shifted by a random number. For GPT No positional encoding , no position encodings are provided but the causal mask is still present. Figure 12: Choice of the positional encoding: while having similar or even superior in-distribution performance (on sequences of length T = 40), sinusoidal positional encoding may suffer a larger generalization gap than linear positional encoding when testing on length 2T. The lines show the mean accuracy over 25 replicates. yellow curves). The results also confirm that the inclusion of recency bias is necessary: without it, scratchpad training shows no improvement on length generalization. Impact of positional encoding. Figure 11 also shows some interesting findings related to positional encoding, which is believed to be a key component for Transformers and a topic with active research (Ke et al., 2020; Chu et al., 2021). While this work does not aim to improve positional encoding, some of our results may be of interest for future research. Sinusoidal vs linear encoding: We find that the conventional sinusoidal encoding (which is the default for results in this paper) seems to generalize worse to unseen length than linear encoding (where pi i), despite having comparable or better in-distribution performance. Figure 12 shows Published as a conference paper at ICLR 2023 examples on Grid9 and partially observed (abab) , where we compare the accuracy on freshly drawn samples of the same length as the training sequences (i.e. in-distribution), or of twice the training length. For Grid9, both positional encodings achieve comparable accuracy, however the sinusoidal encoding performs significantly worse when tested on sequences of doubled length. For partially observed (abab) where the label is whether the current string is a multiple of abab, the sinusoidal encoding has a clear advantage over the linear encoding on in-distribution performance. However, when tested on sequences of lengths twice as those during training, the performance gap between the two positional encodings shrinks significantly. Training with shifted positions: In general, unseen positions appear to be a major contributor to Transformer s failure of length generalization. This is evidenced by the comparison between Transformer trained with absolute positional encoding, and Transformers trained with random shifts added to the positional encoding: for each batch, we sample a random positive integer in [0,400] and add it to the position indices before calculating the positional encoding; this random integer is the same for each batch and varies across batches. Figure 11 shows that adding such random shifts gives a significant boost to Transformer s length generalization performance, for both Dyck and C2. This suggests that a main challenge to length generalization is the distribution shifts due to positions unseen during training, and finding better positional encoding could be a potential remedy for poor length generalization. As a side note, we also find that removing positional encoding altogether helps improve generalization for both parity and Dyck. For the former, removing positional encodings makes sense since parity is a symmetric function where the ordering of the arguments does not matter, 23 though the positive result for Dyck is less clearly understood. Note that removing positional encoding does not mean having no position information, since the use of the causal mask implicitly encodes the position, which is also noted in Bhattamishra et al. (2020) and concurrent work by Haviv et al. (2022). Understanding this phenomenon is tangential to the current work and is left to future work. B.3 ADDITIONAL DETAILS Hyperparameters. For GPT-2 models, we fix the embedding dimension and MLP width to 512 and the number of heads to 8 in all experiments in Section 4, and vary the number of layers from 1 to 16. For LSTM, we fix the embedding dimension to 64, the hidden dimension to 128, and the number of layers to 1. We use the Adam W optimizer (Loshchilov & Hutter, 2017), with learning rate in {3e-5, 1e-4, 3e-4} for GPT-2 or {1e-3, 3e-3} for LSTM, weight decay 1e-4 for GPT-2 or 1e-9 for LSTM, and batch size 16 for GPT-2 or 64 for LSTM. As detailed in Section B.1.1, the models are trained in an online fashion with freshly drawn samples in each batch. The number of freshly drawn samples ranges from 600k to 5000k for different datasets, which is much fewer than the number of possible strings of length 100. Implementation details. Our experiments are implemented with Py Torch (Paszke et al., 2019). The Transformers architectures are taken from the Hugging Face Transformers library (Wolf et al., 2019), using the GPT-2 configuration as a base. The LSTM architecture is the default one provided by the Py Torch library. Computational resources. The experiments were performed on an internal cluster with NVIDIA Tesla P40, P100, V100, and A100 GPUs. For the experiments in Section 4, each training run took up to 10 hours on a single GPU, for a total of 104 GPU hours. The remaining experiments in Section 5 amount to less than 1% of this expenditure. 23Empirically, we are able to achieve non-trivial accuracy (even when evaluated at the sequence level) without positional encoding, whereas Bhattamishra et al. (2020) reports 0 accuracy. The discrepancy may be due to different model size: Bhattamishra et al. considers Transformers with up to 4 layers, 4 heads and dimension up to 32, whereas for the parity experiments we consider Transformers with 8 layers, 8 heads, and dimension 512. Published as a conference paper at ICLR 2023 C.1 USEFUL DEFINITIONS AND LEMMAS Formal definitions of simulation. We first recall the notions of simulation introduced in Section 2: A function can simulate an automaton for particular choices of T, q0. For a semiautomaton A = (Q, Σ, δ), a function f : ΣT QT simulates AT,q0 if f(σ1:T ) = AT,q0(σ1:T ) for all input sequences σ1:T . Here, the right-hand side denotes the sequence of states q1:T induced by the input sequence σ1:T under the transitions δ starting from state q0. A function class can simulate multiple functions associated with a semiautomaton. For a semiautomaton A = (Q, Σ, δ) and a positive integer T, a function class F (a set of functions f : ΣT QT ) simulates A at length T if, for every q0 Q, there is function fq0 F which simulates AT,q0. Our proofs rely on composing gadgets which simulate various substructures of the transformation semigroup T (A). Thus, it will be useful to establish a third notion of simulation, which works for functions in the embedding space Rd rather than the symbol spaces Q, Σ. For clarity, we give this notion a different name (continuous simulation): For a semiautomaton A = (Q, Σ, δ), a function f : Rd Rd continuously simulates AT,q0 if there exist functions E : Σ Rd, W : imf Q such that W f E simulates AT,q0. When W is a linear threshold function z 7 arg maxq[Wz]q, this corresponds to a standard classification head. However, our constructions may leverage other encodings of discrete objects. Function approximation. We provide some simple function approximation results below. Lemma 1 (1D discrete function interpolation with an MLP). Let X be a finite subset of R, such that |x| Bx for all x X, and |x x | for all x = x X. Let f : X Rd be such that f(x) By for all x X. Then, there is a 2-layer Re LU network for which fmlp(x + ξ; θmlp) = f(x) x X, |ξ| /4. The inner dimension is d = 4|X|, and the weights satisfy + 2, W2 By, b2 = 0. Proof. For each x0 X, we construct an indicator ψx0(x) for x0, out of 4 Re LU units. Letting := /4, the construction is ψx0(x) := x (x0 2 ) + + x (x0 + 2 ) The second layer simply sums these indicators, weighted by each f(x0). Lemma 2 (General discrete function interpolation with an MLP). Let X be a finite subset of Rdin, such that x Bx for all x X, and x x for all x = x X. Let f : X Rdout be such that f(x) By for all x X. Then, there is a 3-layer Re LU network for which fmlp(x + ξ; θmlp) = f(x) x X, |ξ| /4. Letting Xi denote the set of unique values in coordinate i, the inner MLP dimensions are as follows: i [din] |Xi|, d2 = |X|. The weights satisfy + 2, W2 1, b2 din, W3 By, b3 = 0. Published as a conference paper at ICLR 2023 Proof. The first layer uses the same construction as that in Lemma 1, creating indicators for each x Xi for each i. For each x X, the second layer has an activation which sums the indicators from each xi, with bias din (thus creating indicators for each x). The third layer outputs f(x) for each indicator. When we apply Lemmas 1 and 2 in recursive constructions, and Bx/ 1, we will opt to use the bound b1 6Bx/ , to reduce the clutter of propagating the 2 term without resorting to asymptotic notation. We also introduce a simpler version of Lemma 1 for the special case of the threshold function f(x) := 1[x > 0]: Lemma 3 (Threshold with an MLP). Let X be a subset of R, and |x| for all x X. Then, there is a 2-layer Re LU network for which fmlp(x + ξ; θmlp) = 1[x > 0] x X, |ξ| /4. The inner dimension is d = 2, and the weights satisfy , b1 1/2, W2 1, b2 = 0. Proof. We construct the threshold using 2 Re LU units. The construction is ψ(x) := x + Selection via soft attention. We record some useful lemmas pertaining to approximating hard coordinate selection with soft attention. The following is a simplified version of Lemma B.7 from (Edelman et al., 2022) (which generalizes this to multi-index selection): Lemma 4 (Softmax approximates hard max). Let z RT . Let softmax(z) : RT RT denote the T-dimensional softmax function: [softmax(z)]t := ezt P t [T ] ezt . Let t := arg maxt zt. Suppose that for all t = t , zt zt γ. Then, softmax(z) et 1 2T e γ. Proof. Without loss of generality, max z = γ (since the softmax function is invariant under shifting all inputs by the same value), so that all other coordinates are non-positive. Also, assume T 2 (the T = 1 case is trivial). We have [softmax(z)]t = eγ eγ + P t =t et eγ eγ + T 1 = 1 T 1 eγ + T 1 1 T 1 and for t = t, [softmax(z)]t = et Thus, the 1-norm of the difference is bounded by eγ + (T 1) 1 as claimed. Published as a conference paper at ICLR 2023 Positional embeddings. We note the following elementary fact about 2-dimensional circular embeddings. Proposition 5 (Circular embeddings). Consider p1, . . . , p T , the T equally-spaced points on the 2-dimensional circle: [pt]1 := cos 2πt , [pt]2 := sin 2πt Then, for any t = t , | pt, pt | 1 2π2 T 2 < 1 19.7 C.2 PROOF OF THEOREM 1: LOGARITHMIC-DEPTH SHORTCUTS VIA PARALLEL PREFIX SUM In this section, we give the full statement and proof of the universal existence of logarithmic-depth shortcuts. Theorem 1 (Simulation is parallelizable). Let A = (Q, Σ, δ) be a semiautomaton, q0 Q, and T 1. Then, there is a depthlog2 T Transformer which continuously simulates AT,q0, with embedding dimension 2|Q| + 2, MLP width |Q|2 + |Q|, and -weight norms at most max{4|Q| + 2, 10T p log |Q| + log T}. It has H = 2 heads with embedding dimension |Q| implying 2|Q| + 2 attention width, and a 3-layer MLP. Proof. The basic idea is that all prefix compositions δ( , σt) . . . δ( , σ1) can be evaluated in logarithmic depth using a binary tree whose leaves are the per-input transition functions δ( , σ) : Q Q. The attention heads select the pairs of functions that need to be composed, while the feedforward networks implement function composition. The network will manipulate functions in terms of their transition maps: for example, the encoding of f := (1 7 1, 2 7 1, 3 7 2) is X q {1,2,3} f(q) eq = [1 1 2]. Small nuances. We will produce a construction for the case where T is a power of 2; general T can be handled via padding. To simplify the construction, we also introduce T padding positions (T 1), . . . , 0 at the beginning; while this greatly simplifies the positional selection construction, this padding construction could be replaced with a slightly more complicated MLP. Also, in this construction, we do not need to use residual connections; the parallel prefix sum algorithm we use can be executed in place , saving a logarithmic factor in the width. We do assume access to the 2 positional embeddings at each layer; in the absence of residual connections, the identity function restricted to these 2 dimensions can be implemented by the MLP and attention heads. Let L = log2 T be the depth of the binary tree. We choose d := 2|Q| + 2. Instead of indexing the dimensions by [d], we give them names: Left function encoding dimensions (q, L) for each q Q. Right function encoding dimensions (q, R) for each q Q. Positional encoding dimensions P1, P2. Without loss of generality, let Q = [|Q|] = {1, . . . , Q} (selecting an arbitrary enumeration of the state space). Also, assume |Q| 2 (if not, add a dummy state). We choose E(σt) := P q Q δ(q, σt) e(q,R), mapping each input symbol to the transition map of its transitions. At the padding positions (T 1), . . . , 0, we will encode the go to q0 function: P q Q q0 e(q,R). Function composition gadget. We first introduce the construction for function composition with a 3-layer Re LU MLP, which will be used by all layers. It gives an exponential improvement over the generic universal function approximation gadget from Lemma 2. Published as a conference paper at ICLR 2023 Lemma 5. There exists a 3-layer Re LU MLP ϕmlp : Rd Rd, with fixed parameters W1, b1, W2, b2, W3 whose dimensions and weights only depend on Q, such that for all f, g : Q Q, ϕmlp outputs the transition map of f g given the concatenated transition maps of f and g. That is, for all |Q|2|Q| choices of f, g: q Q g(q) e(q,L) + X q Q f(q) e(q,R) q Q (f g)(q) e(q,R). The intermediate dimensions are d1 = |Q|2 + |Q| and d2 = |Q|2, and weight norms are bounded by 4|Q| + 2. Proof. The first layer uses Lemma 1 to create |Q|2 indicators: one to recognize each value along the e(q,L) direction. Let us index these by q, q Q. Then, this gives us W1 Rd 4|Q|2, b1 R4|Q|2, W 2 R|Q|2 such that [((z W 2z) σ (z 7 W1z + b1))(z)]q,q = 1[e (q,L)z = q ]. We also add Q more weights which let the inputs pass through along the e(q,R) directions (add Q more rows e (q,R) to W1, W 2, calling these indices q for all q Q; set biases to 0), for a total of 4|Q|2 + |Q| hidden units and |Q|2 + |Q| output dimensions of W 2. The second layer implements multiplication between the indicators and function values. The outputs are again indexed by q, q Q. We define W 2 R|Q|2 (|Q|2+|Q|) and b 2 R|Q|2 to be such that [W 2 ](q,q ),( q, q ) := |Q| 1[(q, q ) = ( q, q )], [W 2 ](q,q ), q := 1[q = q], [b 2](q,q ) = |Q|, q, q , q, q Q. Overall, so far we have [(σ (z 7 W 2 W 2z + b 2) σ (z 7 W1z + b1))(z)]q,q = g(q ) 1[e (q,L)z = q ]. The third layer W3 Rd |Q|2 simply converts these activations back into an transition map: q Q e(q,R)e q,q . Finally, we note the weight norms: W1 4|Q|, b1 4|Q| + 2, W 2 W 2 4|Q|, b 2 = |Q|, W3 = 1. Recursive parallel scan. The rest of the construction uses a standard parallel algorithm for computing all prefix function compositions: at layer l [L], compose the function at position t with the function at position t 2l 1. This is a standard algorithm for computing all prefix compositions of associative binary operations with a logarithmic-depth circuit (Hillis & Steele Jr., 1986). We choose the position embeddings to enable implementing these look-backs with rotation matrices. For each t { T + 1, . . . , 0, 1, . . . , T}, we use the circle embeddings Pt,P1 := cos πt , Pt,P2 := sin πt In detail, for each 1 l L: Let θ := π2l 1 T , γ := 100T 2(log |Q| + log T). Let H := 2, k := |Q|. Recall that |Q| 2. We will index the heads by superscripts [L], [R]. Select W [L] Q = W [R] Q = W [R] K := γ (e P1e 1 + e P2e 2 ). Published as a conference paper at ICLR 2023 Select W [L] K := γ (e P1e 1 + e P2e 2 )ρθ, where ρθ is the rotation matrix cos(θ) sin(θ) sin(θ) cos(θ) in the e1, e2 basis. Select W [L] V = W [R] V := P q Q e(q,L)e q . Select W [L] C = P q Q eqe (q,L), W [R] C = P q Q eqe (q,R). At layer l, let α[L], α[R] R2T denote the attention mixture weights of the two heads. With this choice of γ, Lemma 4 and Proposition 5, for each t [T], we are guaranteed that α[R] et 1 and α[L] et 2l 1 1 are both at most 0.1 |Q| T . Thus, by H older s inequality (noting that this mixture is over T vectors of -norm at most |Q|), this attention layer s output is 0.1-close in the -norm to the concatenated transition maps of the functions at positions t and t 2l 1, allowing us to invoke the perturbation-robust function approximation guarantee of Lemma 1 with = 1. For the MLP, we use the function composition gadget. Thus, at the final layer, the (q, R) dimensions at position t contains the transition map of the prefix composition δ( , σt) . . . δ( , σ1) (q 7 q0). It suffices to choose W to be z 7 e (q,R)z for an arbitrary q to read out the sequence of states as scalar outputs in [|Q|]. To output a one-hot encoding, an additional MLP (appended to the end of the final layer) would be required. C.3 PROOF OF THEOREM 2: CONSTANT-DEPTH SHORTCUTS VIA KROHN-RHODES DECOMPOSITION We begin with the full statement of the theorem: Theorem 2 (Transformer Krohn-Rhodes). Let A = (Q, Σ, δ) be a solvable semiautomaton (see Definition 6), q0 Q, and T 1. Then, there is a depth-O(|Q|2 log |Q|) Transformer which continuously simulates AT,q0, with embedding dimension O(2|Q||T (A)|), MLP width |Q|O(2|Q|) + O(2|Q||Q| |T (A)| T), attention width O(|Q|2|Q||T (A)|) heads, and weight norms are bounded by 6|Q| T log T + 6 max{|Q|, |Σ|}. We will begin by presenting self-contained constructions for the two atoms in the Krohn-Rhodes decomposition: a modular counter and a memory unit. In Appendix C.3.2, we will introduce necessary background from Krohn-Rhodes theory (including the definition of a solvable semiautomaton). In Appendix C.3.3 and C.3.4, we will complete the proof of Theorem 2. C.3.1 BASE CASES: MODULAR COUNTING AND MEMORY Base case 1: modular addition. We will start with a construction of a tiny network which lets us simulate any semiautomaton whose transformation semigroup is a cyclic group. Later on, we will use copies of this unit to handle all solvable groups. The construction simply uses attention to perform a flat prefix sum, and an MLP to compute the modular sum. Definition 2 (Modular counter semiautomaton). For any positive integer n, define the mod-n modular counter semiautomaton A = (Q, Σ, δ): Q := {0, . . . , n 1}, Σ := {0, . . . , n 1}, δ(q, σ) := (q + σ) mod n, q Q, σ Σ. Lemma 6 (Simulating a modular counter). Let A = (Q, Σ, δ) be the mod-n modular counter semiautomaton. Let q0 Q, and T 1. Then, there is a depth-1 Transformer which continuously simulates AT,q0, with embedding dimension 3, width 4n T, and -weight norms at most 4n T + 2. It has H = 1 head with embedding dimension k = 1, and a 2-layer Re LU MLP. Published as a conference paper at ICLR 2023 Proof. The intuition is simply that the lower triangular matrix causal mask can implement simulation in this cyclic group by performing unweighted prefix sums. The only subtlety is that selecting WQ = WK = 0 does not quite give us prefix sums: the attention mixture weights at position t are 1 t P t [t] et , while we would like the normalizing factor to be uniform across positions (1/T rather than 1/t). It is possible to undo this normalization using the MLP; however, a particulaly simple solution is to use an additional padding input and 1-dimensional position embeddings to absorb a fraction of the attention proportional to 1 t/T. We proceed to formalize this construction, beginning with the input embedding and attention block: Select d := 3, k := 1, H := 1. Intuitively, the 3 dimensions implement {input/output, padding, position} channels . Select input symbol embeddings E(σ) := σ e1 Rd for each σ Σ. Include an extra position , with embedding E( ) := e2 and position encoding P ,: := 0. Think of this as padding at position 0; it is not masked out by the causal attention mask at any position t 1. For t [T], select Pt,: := γte3, where γt := log(2T t) is such that 1 eγt+t = 1 2T . Select WQ := e3, WK := e2, WV := e1, W C := e1. We do not need residual connections. In the output of this attention module, for any input sequence σ1:T , the 1st channel of the output at position t is then where zt {0, . . . , n 1} is such that δ( , σt) = gzt. The MLP simply needs to memorize the function s e1 7 (Ts mod n) e1. We invoke Lemma 1, with = 1 2T , Bx = n 1 2 , By = n. The number of possible values of S (the cardinality of X in Lemma 1) is at most n T. Base case 2: memory lookups. It turns out that to simulate semigroups instead of groups, the only additional ingredient is a memory unit, a semiautomaton for which there are read and write operations. The minimal example of this is a flip-flop (Example 2), a semiautomaton which can sequentially remember and retrieve a single bit {0, 1}, and whose transformation semigroup is the flip-flop monoid. It will be convenient to generalize this object to Q states: Definition 3 (Memory semiautomaton). For a given state set Q, define the memory semiautomaton A = (Q, Σ, δ): Σ = Q { }, δ(q, σ) := σ q Q, σ Σ, σ = , δ(q, ) := q q Q. Lemma 7 (Simulating a memory semiautomaton). Let A = (Q, Σ, δ) be the memory semiautomaton. Let q0 Q, and T 1. Then, there is a depth-1 Transformer which continuously simulates AT,q0, with embedding dimension 4, width 4|Q|, and -weight norms at most 2T log(|Q|T). It has H = 1 head with embedding dimension k = 2, and a 2-layer Re LU MLP. Proof. We start in state q0 Q. Our goal is to identify the closest nontoken and output the corresponding state. The attention construction is: Select d = 4, k = 2, H = 1. Published as a conference paper at ICLR 2023 Select input symbol encodings E(σ) := (1[σ = ]q0 + 1[σ = ]σi)e1 + 1[σ = ]e2 + e4 Rd, where the first coordinate denotes the action that sets the state24, the second coordinate denotes whether the input is the no-op action , and the fourth coordinate is padding. We use positional encoding Pt,: := (t/T) e3. WQ := [ 2e4 e4] R4 2, WK := [ce2 ce3] R4 2 for c = O(T log(|Q|T)) as explained below, WV := [e1 0] R4 2, and W C := [e1 0] R4 2. The unnormalized attention score computed for position i attending to j is c(j/T 1[σj = ]). Note that the max attention value is achieved at the closest reset action: the unnormalized scored is non-negative if and only if σj = , and j/T increases with j ensuring that the closest position is chosen. Denote this max position as jmax. In the setting of hard attention, the output for the ith token after the attention module is E(σjmax) e1. In particular, this value is q0 if and only if σj = , j i, i.e. the semiautomaton never leaves the starting state. Otherwise, the value is the value of the nearest nonstate (including the current state). By Lemma 4 (with γ = c/T), we can approximate hard-attention by soft-attention weights α RT , that is, α ejmax 1 2T e c/T . This implies, that the output of the attention j i αj E(σj) e1 E(σjmax) e1 2T|Q| e c/T . Then, the MLP can simply round the first coordinate, and we can invoke Lemma 1 with = 8T|Q| exp( c/T) = 1/2 (for c = T log(16|Q|T)), Bx = |Q|, By = |Q| to get weight norm bound (4+log(|Q|T))T 2 log(|Q|T)T and width 4|Q|. C.3.2 PRIME DECOMPOSITIONS OF GROUPS AND SEMIGROUPS The key idea behind the proof of Theorem 2 is that all semigroups (and thus, all transformation semigroups of semiautomata) admit a prime factorization into elementary components, which turn out to be simple groups and copies of the flip-flop monoid, which have both been discussed in Appendix A.2. This is somewhat counterintuitive: the only constraint on the algebraic structure of a semigroup is associativity (and indeed, there are many more semigroups than groups), but all of these structures can be built using these two types of atoms . These components, as well as the cascade product under which this notion of factorization is defined, are naturally and efficiently implementable by constant-depth self-attention networks. The special case of groups. We begin by discussing the analogous decomposition for groups, which generalizes the fact that integers have unique prime factorizations. Let G be a finite group, and let G = Hn Hn 1 H1 H0 = 1 be a composition series: each Hi is a maximal proper normal subgroup of Hi+1; 1 denotes the trivial group with 1 element. Then the quotient group Hi+1/Hi is called a composition factor. The Jordan-H older theorem tells us that one can think about the set of composition factors as an invariant of G. Theorem 6 (Jordan-H older). Any two composition series of G are equivalent: they have the same length n, and the sequences of compositions factors Hi+1/Hi are equivalent under permutation and isomorphism. When each Hi+1/Hi is abelian, G is called a solvable group. It turns out that each Hi+1/Hi is a simple group, so the composition factors of solvable groups can only be cyclic groups of prime order (because every finitely generated abelian group is a direct product of cyclic groups, and, of these, only those of prime order are simple). The smallest non-solvable group is A5, realizable as the 24Technically σ = does not reset the state. We will see that when q0 is selected, it must be that the semiautomaton is always in state q0. Published as a conference paper at ICLR 2023 group of even permutations of 5 elements. As a part of Theorem 2, we will use the composition series to iteratively build neural networks which simulates solvable group operations, requiring intricate constructions to do this with depth independent of the sequence length T. Adding memory to handle semigroups. Now, we move on to semigroups. When not all of the input symbols to a semiautomaton induce permutations, we no longer have the group axiom of invertibility (also, if there is no explicit identity symbol, we are not guaranteed to have the monoid axiom of an identity element either). Intuitively, this would seem to induce a much larger family of algebraic structures; an analogy, which is formalizable by representation theory, is that we are now considering a collection of general matrices under multiplication, instead of only invertible ones. The non-invertible transitions collapse the rank of the transformations, reducing the set of reachable transformations whenever they are included in an input sequence. A landmark result of Krohn & Rhodes (1965) tames the seemingly vast and unorderly universe of general finite semigroups. It extends the Jordan-H older theorem to the case of semigroups, for a more sophisticated notion of decomposition. Since that work, many variations have arisen, in terms of its precise statement, construction of the decomposition, and proof of correctness. Out of these, an important development is the holonomy decomposition method (Zeiger, 1967; Eilenberg, 1974), which forms the basis of our results. We extract the definitions and theorems from Maler & Pnueli (1994), whose exposition emphasizes explicitly tracking the construction of the semiautomaton. We also refer to Maler (2010); Egri-Nagy & Nehaniv (2015); Zimmermann (2020) as recent expositions, containing historical context. Definition 4 (Cascade product; cf. (Maler, 2010), Definition 11). Let n be a positive integer. For each i [n], let A(i) = (Q(i), Σ(i), δ(i)) be a semiautomaton. For i {2, . . . , n}, let ϕ(i) : Q(1) Q(i 1) Σ Σ(i) denote a dependency function. This object ({A(i)}; {ϕ(i)}) is called a transformation cascade, and defines a cascade product semiautomaton A = (Q(1) Q(n), Σ(1), δ) by feedforward simulation under the dependency function. We define δ by the i-th component of its output (which we call δ( i) : Q(1) Q(n) Σ(1) : Q(i)): δ( i)((q(1), . . . , q(n)), σ) := δ(i)(q(i), σ(i)), σ(i) := σ if i = 1 ϕ(i)(q(1), . . . , q(i 1), σ) otherwise . The corresponding transformation semigroup T (A) is known as a cascade product semigroup. Intuitively, the cascade specifies a way to compose semiautomata hierarchically: the first layer i = 1 maps input sequences to its state sequence, and each internal layer receives an input which depends on the states of all of the preceding layers. Algebraically, the cascade product semigroup is a subsemigroup of the larger wreath product of semigroups (the straightforward analogue of the wreath product of groups, discussed in Section A.2). Although this is useful from an algebraic point of view, we will not use this perspective; the cascade product is a smaller substructure of the wreath product which is sufficient for semiautomaton simulation. Finally, it will be convenient to define permutation-reset semiautomata, which are a useful intermediate step in the Krohn-Rhodes decomposition. To obtain our final result, we will further break these semiautomata down into flip-flops and simple groups. Definition 5 (Permutation-reset semiautomaton; cf. (Maler & Pnueli, 1994), Definition 12). A semiautomaton A = (Q, Σ, δ) is a permutation-reset semiautomaton if, for each σ Σ, the transition function δ( , σ) : Q Q is either a bijection (i.e. a permutation over the states of Q) or constant (i.e. maps every state to some q(σ)). Associated with each permutation-reset semiautomaton is its permutation group, generated by only the bijections. Now we can state the Krohn-Rhodes theorem, which decomposes every finite semiautomaton into a transformation cascade. Theorem 7 (Krohn-Rhodes). Let A = (Q, Σ, δ) be a semiautomaton. Then, there exists a transformation cascade {A(1), . . . , A(n); ϕ(2), . . . , ϕ(n)}, defining a cascade product semiautomaton A , such that: Published as a conference paper at ICLR 2023 (i) The input symbol space of A(1) (and thus, that of A ) is Σ, the same as that of A. (ii) Letting Q(i) denote the state space of A(i), there exists a function W : Q(1) Q(n) Q such that W A T,q0 simulates AT,q0 for all T 1, q0 Q. For each i [n], the transformation semigroup T (A(i)) is a permutation-reset semiautomaton with at most |Q| states, whose permutation group is a (possibly trivial) subgroup of T (A) (Maler & Pnueli (1994), Theorem 4). (iii) The number of semiautomata in the cascade is n 2|Q|. Furthermore, the cascade has at most |Q| levels: the indices can be partitioned into at most L |Q| contiguous subsets N (1) = {1, . . . , n1}, N (2), {n1 + 1, . . . , n1 + n2}, . . . , N (L) = {n n L + 1, . . . , n} such that ϕ(i) only depends on input indices from previous partitions (Maler & Pnueli (1994), Claim 11 & Corollary 12). With this decomposition, we are now able to define a solvable semiautomaton. Definition 6 (Solvable semiautomaton). Let A = (Q, Σ, δ) be a semiautomaton. We call A solvable if all the permutation groups associated with all of the permutation-reset automata from Theorem 7 are solvable groups. The remainder of this section will build our construction from the bottom up: Appendix C.3.3 will build up from the base case of cyclic groups (Lemma 6), using increasingly sophisticated notions of group products, culminating in a recursive construction which simulates all stages of the Jordan-H older composition series. The crucial step is a construction for simulating the semidirect product of groups, given networks which simulate the individual components; this allows us to handle the solvable non-abelian groups. Appendix C.3.4 will build networks which simulate permutation-reset semiautomata. A new base case arises: the memory unit (Lemma 7), a semiautomaton whose transformation semigroup is a generalization of the flip-flop monoid. Combining the constructions for solvable groups and memory units, we obtain simulators for solvable permutation-reset semiautomata. Finally, the cascade product guaranteed by Krohn-Rhodes (Theorem 7) glues all of these pieces together, giving us the final result. C.3.3 SIMULATING SOLVABLE GROUPS We begin by handling groups. Now, we are ready to specify the recursive constructions which glue these components together to form solvable groups. We will proceed in a bottom-up order: (i) Define a canonical semiautomaton AG corresponding to each group G (Definition 7), such that if a network can simulate AG, it can simulate any other semiautomaton whose transformation semigroup T (AG) is G. This lets us talk about simulating groups, rather than particular semiautomata. We will show how to turn simulators for groups N and H into simulators for extensions of N by H, for increasingly sophisticated extensions, until all cases have been captured. (ii) Show how to build the trivial extension: given networks which simulate the groups N and H, simulate the direct product G = N H, by simply running the individual simulators in parallel (Lemma 8). Combined with Lemma 6, this immediately allows us to simulate arbitrary abelian groups with depth 1, since every abelian group is isomorphic to a direct product of cyclic groups. (iii) Show how to build a split extension: given networks which simulate a normal subgroup N and quotient H, construct a network which simulates any semidirect product G = N H (Lemma 9). This is the first place where we will require a sequential cascade of layers. It will allow us to handle certain families of non-abelian groups (including S3, D2n, A4, S4) (iv) Show how to build arbitrary extensions (any G which contains N as a normal subgroup, and for which the quotient group G/N is isomorphic to H), using the wreath product (Lemma 10), which contains all of the group extensions. The wreath product is itself the semidirect product between a |H|-way direct product and H, so this can be done in a constant number of layers, by the above. This finally lets us implement any step of a Published as a conference paper at ICLR 2023 composition series. In particular, using cyclic groups as a simulable base case, this shows that we can simulate all solvable groups. Step (i). It will be convenient to associate with each group G a canonical complete semiautomaton for the class of all semiautomata A for which T (A) = G. It is simply the one whose input symbol space Σ is every transformation reachable by some sequence of inputs (i.e. every element of T (A)). (For a semigroup, we would also want to adjoin the identity element if it is missing, however, we will only find it useful to define this for groups.) Definition 7 (Canonical group semiautomaton; simulating a group). Let G be a finite group. Then, we define the canonical group semiautomaton for G as the semiautomaton (Q, Σ, δ) defined by: Q := G, the set of elements of G. Note that if (for example) G = Sn, we are setting the state space to be the set of n! permutations, not the ground set [n]. Σ := G. That is, we include all functions in the input symbol space. δ(g, h) := h g, for all g Q, h Σ. (In algebraic terms, we are embedding the G into its left regular representation, a.k.a. left multiplication action.) Thus, if we take q0 to be the identity element, the sequence of states q1, q2, . . . , q T corresponds to qt = σtσt 1 . . . σ1. When we simulate the canonical group semiautomaton, we will always choose q0 to be the identity element e G. A sequence-to-sequence network is said to continuously simulate G at length T if it continuously simulates the canonical group semiautomaton of G at length T. Notation for composable implementations. Let us furthermore formalize an implementation of group simulation. For any finite group G, T 1, q0 G, let AG = (Q, Σ, δ) be the canonical semiautomaton for G. Then, we summarize a family of concrete implementations of networks which continuously simulate of AT,e G. We write sim : (G, T) 7 (E : G Rd, ftf : RT d RT d, W : Rd G), where the shape parameters of the output can depend on G, T. To reduce notational clutter, we will access the shape attributes of an implementation via objectoriented notation, defining sim(G, T).{depth, dim, heads, head Dim, mlp Width, norm Bound} to respectively denote the complexity-parameterizing quantities {L, d, H, k, max j {d j}, B}, defined in Appendix A.4. Also, we will let sim(G, T).{E, θ, W} respectively denote the encoding layer E, network parameters θnn, and decoding layer W. Canonical encodings of group elements. We also enforce that throughout our constructions of networks which simulate groups, we will maintain that all networks and their submodules manipulate encodings via integer vectors in a consecutive range {0, . . . , n 1}. Furthermore, the identity element will always map to the zero vector. We will keep track of the dimensionality of these vectors sim(G, T).rep Dim d, and their maximum entries sim(G, T).rep Size 1. All encoders E and decoders W will map all group elements to and from this kind of representation, and we will choose W = E 1. In all, the networks will keep a rep Dim-dimensional workspace of integer vectors, with entries bounded by rep Size 1. When combining groups via the various products constructions, we will combine the components individual workspaces to create a larger workspace for the product group s elements. We make some additional remarks on implementations: Note that the canonical semiautomaton forgets about the semiautomaton abstraction, and never assumes that G is a permutation group on the original state space Q of the semiautomaton we would like to simulate. Indeed, when N G are permutation groups on Q, there is no natural Published as a conference paper at ICLR 2023 simulate all G(i) Figure 13: Recursive construction for simulating a direct product of groups G(1) G(n). Any number of groups can be simulated in parallel without increasing the depth. permutation group on Q associated with the quotient H = G/N; it turns out that will consider simulators for N and H.25 To return to solving the simulation problem for some semiautomaton A = (Q, Σ, δ) whose transformation semigroup is isomorphic to G (at length T and initial state q0), let µ : G SQ denote this isomorphism. We use AG T,e G(σ1:T ) as the network, with an encoding layer E µ 1, and decoding layer (π 7 π(q0)) µ W, which can be memorized by an MLP of width O(|G|) via Lemma 2. The modular counter semiautomaton, for which we constructed a simulator in Lemma 6, is the canonical group semiautomaton for the corresponding cyclic group Cn. Calling this construction sim Cn, we can easily verify that it satisfies the canonical simulator s conditions, and: sim Cn.depth = 1. sim Cn.dim = 3. sim Cn.heads = 1. sim Cn.head Dim = 1. sim Cn.mlp Width = 4|G| T. sim Cn.norm Bound 4|G| T + 2 6|G| T. sim Cn.rep Dim = 1. sim Cn.rep Size = |G|. Step (ii). As a precursor to the more sophisticated products, we formalize the obvious fact that two non-interacting parallel semiautomata can be simulated without increasing the depth. First, we define the direct product semiautomaton: Definition 8 (Direct product of semiautomata). Let A = (Q, Σ, δ), A = (Q , Σ , δ ) be two semiautomata. Then, A A = (Q Q , Σ {e} Σ {e}, δ δ ) denotes the natural direct product semiautomaton. Its states are ordered pairs (q Q, q Q ). Its input symbols are defined similarly, adjoining identity inputs (so that δ(q, e) = q, δ (q , e) = q ). The transitions δ δ are defined such that (δ δ )((q, q ), (σ, σ )) := (δ(q, σ), δ (q , σ )). Note that under this definition, we have T (A A ) = T (A) T (A ). In particular, for two groups G, H, we have G H = T (AG) T (AH) = T (AG H) = G H. Lemma 8 (Direct product via parallel simulation). Let G(1), . . . , G(n) be a collection of finite groups, and let T 1. Suppose each group admits a simulation simi := sim(G(i), T). Then, there is a simulation of the direct product group sim := sim(G(1) . . . G(n), T), whose sizes satisfy: 25There is nothing in general preventing quotient groups from being extremely large groups which are not realizable as smaller permutation groups. For concrete examples, see (Kov acs & Praeger, 1989). When we ultimately specialize to simulating the composition series of solvable groups, the largest groups we will handle will be the cyclic groups of prime order, so we will in the end be guaranteed that the groups we want to simulate are realizable with |Q| states, but not directly or canonically. Published as a conference paper at ICLR 2023 (gt, ht) (g t, h t) Figure 14: Recursive construction for simulating the semidirect product N H. The quotient group H is simulated first; these outputs are used to re-map the inputs into the simulator for N. sim .depth = maxi{simi.depth}. sim .dim = P i{simi.dim}. sim .heads = P i{simi.heads}. sim .head Dim = maxi{simi.head Dim}. sim .mlp Width = P i{simi.mlp Width}. sim .norm Bound maxi{simi.norm Bound}. sim .rep Dim = P i{simi.rep Dim}. sim .rep Size = maxi{simi.rep Size}. Proof. First, we pad all of the individual simi with layers implementing identity (add residual connections, and set attention WV and all MLP weight matrices to 0), so that all of them have depth maxi{simi.depth}. Then, the intuition is to construct the direct product semiautomaton by concatenating the workspaces of each G(i). In other words, we set the canonical encoding sim .E of (g(1), . . . , g(n)) to be the concatenation of each simi s encodings. The direct product simply lets each simi take inputs and outputs in its individual workspace. To enable this, we need enough parallel dimensions. We set an embedding space of dimension P i{simi.dim} (and similarly within the heads and MLPs), partitioning the coordinates such that in the product construction, each simi.E and simi.θ only reads and writes to its own dimensions. This clearly simulates the direct product group. Figure 13 provides a sketch of this construction. Note that the direct product construction already allows us to simulate all finite abelian groups in constant depth, since each such group is isomorphic to the direct product of a collection of abelian groups of prime power order. Step (iii). Now, as a harder (and conceptually crucial) case, we show how to simulate a group which is a semidirect product of two groups we already know how to simulate. This encompasses the direct product as a special case, but can now handle some non-abelian groups which admit such decompositions (like the dihedral group D2n). The catch is that we will have to simulate these groups using a sequential cascade of the individual simulators. This is the key lemma which lets us simulate non-abelian groups: Lemma 9 (Semidirect product via 4-stage cascade). Let G be a finite group which is isomorphic to a semidirect product: G = N H, where N is a normal subgroup of G. Let T 1. Suppose N, H admit simulations sim N := sim(N, T), sim H := sim(H, T). Then, there is a simulation of G, sim := sim(G, T), whose sizes satisfy: sim .depth = sim N.depth + sim H.depth + 2. sim .dim = sim N.dim + sim H.dim. sim .heads = max{sim N.heads, sim H.heads}. sim .head Dim = max{sim N.head Dim, sim H.head Dim}. sim .mlp Width = max{sim{N,H}.mlp Width, 4|G|}. Published as a conference paper at ICLR 2023 sim .norm Bound max{sim{N,H}.norm Bound, 6 sim{N,H}.rep Size, sim N.rep Dim + sim H.rep Dim}. sim .rep Dim = sim N.rep Dim + sim H.rep Dim. sim .rep Size = max{sim N.rep Size, sim H.rep Size}. Proof. The intuition is as follows, using the dihedral group D2n = Cn C2 as an example: For simplicity, let us think of the reversible car on a circular world semiautomaton, whose transformation semigroup is D2n. Its state consists of a direction {+1, 1}, and a position {0, 1, . . . , n 1}. It has two types of inputs: advance by i (increment the position by i in the current direction, modulo n), and reverse (flip the sign of the direction). Our simulation task is to track the car s state sequence, given a sequence of inputs (in constant depth, of course). It is intuitively clear that we can (and should) compute the sequence corresponding to direction at time t , which is equivalent to simulating the parity semiautomaton. We will convert the advance moves via a basis transformation : whenever the current direction is 1, an advance by i should be converted into i. Then, we have reduced the problem to the prefix sum. Algorithm. This intuition essentially shows us how to implement arbitrary semidirect products; we derive the basis change from ϕ. Before implementing it with Transformer operations, we formalize this basis transformation . Recall that by the definition of a semidirect product, the elements of N H can be written as pairs (g N, h H), equipped with a homomorphism ϕ : h Aut(N) which specifies a multiplication rule: (g, h) (g , h ) := (gϕh(g ), hh ). Let us write down the properties of ϕ: ϕ is a homomorphism. That is, ϕh h = ϕh(ϕh ( )) = ϕh ϕh as permutations on N. The output of that homomorphism, ϕh, is also a homomorphism. That is, ϕh(gg ) = ϕh(g) ϕh(g ). Let us roll out the definition of the semidirect product, given a sequence of inputs (gt, ht): (g2, h2) (g1, h1) = (g2 ϕh2(g1), h2h1), (g3, h3) (g2, h2) (g1, h1) = (g3 ϕh3(g2 ϕh2(g1)), h3h2h1), (g4, h4) (g1, h1) = (g4 ϕh4(g3 ϕh3(g2 ϕh2(g1))), h4h3h2h1). In general, by induction, letting (g t, h t) denote (gt, ht) (g1, h1), we have g t = gt ϕht(gt 1) ϕhtht 1(gt 2) ϕht...h3(g2) ϕht...h2(g1). Applying ϕ 1 h t on both sides, we notice that ϕ 1 h t(g t) = ϕ 1 h t(gt) ϕ 1 h t 1(gt 1) ϕ 1 h 2(g2) ϕ 1 h 1(g1). Thus, it suffices to compute each h t = htht 1 . . . h1, map each gt 7 ϕ 1 h t(gt), compute the prefix products in these coordinates , then invert the mapping to get back g t. Implementation. Like before, we partition the embedding dimension in our construction sim into blocks, one for each component simulator. Let us index the dimensions by the d N := sim N.dim indices in the N channel and analogously for the d H-dimensional H channel . We choose the canonical encoding E to map elements to their individual channels: E(g, h) = sim N.E(g) (in the N channel) + sim H.E(h) (in the H channel). We proceed to specify the construction layer-by-layer. Let L{N,H} denote sim{N,H}.depth. Published as a conference paper at ICLR 2023 Layers 1 through LH: quotient group simulation. As suggested by the intuitive sketch, we begin with LH Transformer layers, which are just a copy of sim H.θ, reading and writing in the H channel, with a parallel residual layer in the N channel. So far, after these LH layers, the output at each position t is an integer vector, whose H channel contains h t, and whose N channel contains sim N.E(gt). Layer LH + 1: basis change. Now, let us add one more mixing Transformer layer, whose attention block is identity26; we only need a 3-layer MLP block, which represents the function (g N, h H) 7 ϕ 1 h (g). To do this, we invoke Lemma 2 (choosing the output to be in the same representation as that used by sim N.E, in the N channel), with = 1, din = sim N.rep Dim + sim H.rep Dim, Bx = max{sim N.rep Size, sim H.rep Size}, By = sim N.rep Size, giving us a construction with d1 4(|N| + |H|), d2 |N| |H|, W1 4, b1 6 max{sim N.rep Size, sim H.rep Size}, W2 1, b2 sim N.rep Dim + sim H.rep Dim, W3 sim N.rep Size. We also add residual connections in the H channel. In summary, after this layer, the output at each position t is an integer vector, whose H channel contains h t, and whose N channel contains ϕ 1 h t(gt). Layers LH +1 through LH +LN +1: normal group simulation. The next LN layers are a copy of sim H.θ, with residual connections in the H channel. After these layers, the output at each position t is an integer vector, whose H channel contains h t, and whose N channel contains ϕ 1 h t(g t). Layer LH + LN + 2: undoing the basis change. Now, we add one more Transformer layer, whose attention block is identity; we will again use a 3-layer MLP block to represent the inverse of our mapping function (g N, h H) 7 ϕh(g). This uses Lemma 2, with exactly the same bounds. At the end of this final unmixing layer, the output at each position t is an integer vector, whose H channel contains h t, and whose N channel contains g t; thus, this is a valid simulation of the semidirect product. This construction is sketched in Figure 14. Step (iv). Note that H = G/N does not imply that G is a semidirect product of N and H. Thus, although simulating semidirect products allows us to handle some families of non-abelian groups, this does not yet allow us to handle general solvable groups (i.e. general steps of a composition series, even with a cyclic quotient group). The smallest example is the non-abelian quaternion group Q8, the group of unit quaternions under multiplication, which cannot be realized as a semidirect product of subgroups. Instead, we need to appeal to the Krasner Kaloujnine universal embedding theorem (Krasner & Kaloujnine, 1951): a characterization of all of the groups G which are extensions of N by H, as subgroups of the wreath product N H. Lemma 10 (Wreath product via direct and semidirect products). Let G be a finite group which is isomorphic to a wreath product: G = N H. Let T 1. Suppose N, H admit simulations sim N := sim(N, T), sim H := sim(H, T). Then, there is a simulation of G, sim := sim(G, T). In the case where sim .rep Dim = 1, the sizes satisfy: 26Even when an attention block simply implements identity, we choose to include it, rather than combining the preceding and subsequent MLPs into a single MLP. This is to ensure that if we compose a number of Transformer layers that depends on |Q|, the depth of each MLP is bounded by an absolute constant. Published as a conference paper at ICLR 2023 t , . . . , g(|H|) simulate all G(i) simulate H mix unmix (identical) t , . . . , g(|H|) Figure 15: Recursive construction for simulating the wreath product N H. One independent copy of N is instantiated for each element of H, while the simulator for H permutes them. sim .depth = sim N.depth + sim H.depth + 2. sim .dim = |H| sim N.dim + sim H.dim. sim .heads = max{|H| sim N.heads, sim H.heads}. sim .head Dim = max{sim N.head Dim, sim H.head Dim}. sim .mlp Width = max{|H| sim N.mlp Width, sim H.mlp Width, 5|H|2|N|}. sim .norm Bound max{sim{N,H}.norm Bound, 6 |H|}. sim .rep Dim = |H| sim N.rep Dim + 1. sim .rep Size = max{sim N.rep Size, sim H.rep Size}. Proof. Even though the wreath product s algebraic structure can be very complex, the construction just requires us to implement its relatively simple description. Applying Lemma 8, we have a network sim which simulates N . . . N. Then, we simply apply Lemma 9, using sim H to re-map inputs to sim for the normal subgroup. This construction is sketched in Figure 15. Concise implementation of reindexing. We can make one interesting improvement over a generic application of Lemmas 8 and 9: the structure of the mixing function ϕ, which specifies the semidirect product, is extremely regular. Very fortunately, the structure of ϕ allows us to avoid any dependence on the size of the wreath product group (|N||H| |H|) in the size measures of the implementation. A general automorphism on N N is specified by its |N||H| values. However, in this case, ϕ is just a permutation, specified by how each of the |H| channels should switch places. Thus, much like the function composition gadget in Theorem 1, we can construct a simpler MLP than the generic one from Lemma 2. Specifically, we would like to approximate the function ϕ : H (N N) (N N), which simply applies πh to the indices: ϕh(g(1), . . . , g(|H|)) := (g(πh(1)), . . . , g(πh(|H|))). In the component neural networks representation space, we need the MLP to implement sim N.E(g(1)), . . . , sim N.E(g(|H|)), sim H.E(h) 7 sim N.E(g(πh(1))), . . . , sim N.E(g(πh(|H|))) , recalling that the elements of g, h are represented by integer vectors with -norm at most sim{N,H}.rep Bound. Notice that when the representation of |H| is a single integer, restricting to any particular coordinate in the representation of an element g, this is the same composition problem of function transition maps solved by Lemma 5 in the proof of Theorem 1, which uses its left inputs to permute its right inputs (modulo converting the representations from {0, . . . , |H| 1} to Published as a conference paper at ICLR 2023 {1, . . . , |H|}, which we can do by shifting the indicators at the input and final-layer output weights). Thus, |N| sim N.dim parallel copies of the 3-layer function composition MLP suffice, yielding d1 = 4|H|2|N| + |H| |N| < 5|H|2|N|, d2 = |H|2|N|, W1 4|H|, b1 6|H|, W 2 W 2 4|H|, b 2 = |H|, W3 = 1. When the information about group elements in H is encoded by multiple integers, it is straightforward to extend this construction, by replacing the one-dimensional indicator with the multidimensional indicator from Lemma 2. We will skip the details of this case, since our final results are only about solvable groups; when we want to simulate a general group extension, it will always come from the composition series, so that H is always a cyclic group of prime order. Thus, for general group extensions G, we can construct sim , the wreath product simulator for N H, and combine the individual simulators. Note that we can throw away the excess group elements from the simulator: only include in sim .E, sim .W the group elements which correspond to the subgroup isomorphic to G. Then, no part of this construction needs to maintain a width or matrix entry scaling with |N H|. Putting all of this together, we state an intermediate theorem, which is our most general result for groups: Theorem 8 (Simulation of solvable groups). Let G be a solvable group which is isomorphic to a permutation group on n elements. Let T 1. Then, there is a Transformer network sim := sim(G, T) which simulates G at length T, for which we have the following size bounds: sim.depth 3 log2 |G|. sim.dim 2|G|. sim.heads 2|G|. sim.head Dim = 1. sim.mlp Width 20n T|G|. sim.norm Bound 6n T. sim.rep Dim 2|G|. sim.rep Size n. Proof. Let G = Hℓ Hℓ 1 H1 H0 = 1 denote the composition series. Then, because G is solvable, all of the quotient groups Ki := Hi+1/Hi are abelian, thus cyclic groups of prime order. Since G is assumed to be a subgroup of Sn, none of these primes can be greater than n. Thus, every quotient group Ki in the chain satisfies 2 |Ki| n. Also, note that the length of the composition series ℓis at most log2(|G|) (since each inclusion halves the size of the group). We start with a simulation of H1, which must be a cyclic group, and build the sequence of group extensions recursively until we obtain G. In the worst case (in the sense that the implementation size bounds from Lemma 10 are maximized), each step in the composition series must be manifested by a wreath products with K := Cn. Recall that we have: sim K.depth = 1. sim K.dim = 3. sim K.heads = 1. sim K.head Dim = 1. sim K.mlp Width = 4n T. sim K.norm Bound 6n T. sim K.rep Dim = 1. sim K.rep Size n. At each step i = 0, . . . , ℓ 1, Lemma 10, with H := Ki, N := Hi, implies: Published as a conference paper at ICLR 2023 sim Hi+1.depth sim Ki.depth + 3 (1 more layer to simulate the cyclic group Ki, and 2 from the wreath product s mixing operations). sim Hi+1.dim |Ki| sim Hi.dim+1 (noting that all of the components can reuse the same and positional encoding dimensions). sim Hi+1.heads |Ki| sim Hi.heads + 1. sim Hi+1.head Dim max{1, 1, . . . , 1} = 1. sim Hi+1.mlp Width max{|Ki| sim Hi.mlp Width, 4n T, 5|Ki|2 |Hi|}. sim Hi+1.norm Bound max{6n T, 6 |Ki|}. sim Hi+1.rep Dim = |Ki| sim Hi.rep Dim + 1. sim Hi+1.rep Size n. Iterating these recursive inequalities ℓ log2 T times gives us the desired bounds. Note that we are using Lagrange s theorem (Q i |Ki| = |G|), as well as the fact that for positive integers m1, . . . , mℓ 2, we have a bound on the series of prefix products: P C.3.4 SIMULATING SEMIGROUPS Now, using this construction and the results developed in the previous section for groups, we complete the construction for semigroups: We combine the memory gate construction (Lemma 7) and any network simulating a group to implement the corresponding permutation-reset semiautomaton (Definition 5), the elements of the cascade in Theorem 7. To finish, we implement the cascade product (Definition 4) of these permutation-reset semiautomata, guaranteed to exist by Theorem 7. This gives the full result. First, we summarize the findings of Lemma 7, naming this neural network sim M in our objectoriented notation. Note that since we are no longer simulating canonical group semiautomata past this point, rep Dim, rep Size are no longer well-defined. sim M.depth = 1. sim M.dim = 4. sim M.heads = 1. sim M.head Dim = 2. sim M.mlp Width = 4|Q|. sim M.norm Bound 2T log(|Q| T). Lemma 11 (Simulating a permutation-reset semiautomaton). Let A = (Q, Σ, δ) be a permutationreset semiautomaton (see Definition 5), and let G denote its permutation group. Let T 1, q0 Q. Let sim G := sim(G, T) be a Transformer network which continuously simulates G at length T. Then, there is a Transformer network sim G which continuously simulates AT,q0, with size bounds: sim G.depth = sim G.depth + sim M.depth + 1 3 log2 |G| + 2. sim G.dim = sim G.dim + sim G.rep Dim + sim M.dim |G| + |Q| + 4. sim G.heads = sim G.heads + sim M.heads 2|G| + 1. sim G.head Dim = sim G.head Dim + sim M.head Dim + sim G.rep Dim |Q| + 3. sim G.mlp Width = sim G.mlp Width + sim M.mlp Width + |G|2|Q| 20n T|G| + 4|Q| + |G|2|Q|. sim G.norm Bound max{sim G.norm Bound, sim M.norm Bound, 6|Q|} 6|Q| T log T. Proof. Without loss of generality, we will let Q := [|Q|]. We split the embedding space in our construction into two channels: the sim G.dim dimensions used by G, and a channel consisting of 4 additional dimensions, to be used by a copy of the memory Published as a conference paper at ICLR 2023 semiautomaton, whose symbol set is Q. Let us call these the G and M channels. For the reset symbols, let EM(σ) denote the 4-dimensional encoding of σ from the memory semiautomaton. Since we defined G to be isomorphic to the permutation group associated with A, there is a bijection Φ : G SQ between group elements and permutations on Q. We choose the embedding E as follows: E(σ) := sim.E(Φ 1(δ( , σ)) (G channel) + EM( ) (M channel) , bijections δ( , σ) sim.E(e G) (G channel) + EM(qσ) (M channel) , resets δ( , σ) = qσ . Let LG denote sim G.depth. Layers 1 through LG: group simulation. The first LG layers are chosen to be a copy of sim G.θ in the G channel, and only residual connections in the M channel. At the end of this, given any inputs σ1:T which map via Φ 1 to gt (letting the group operation be identity when σt is a reset symbol), the outputs in the G channel will be d G := sim G.rep Dim-dimensional encodings of the prefix group products g t = gtgt 1 g1. Now, letting r(t) denote the most recent reset (τ t such that στ is a reset token), we notice that the state we want can be derived from this sequence: qt = Φ(gtgt 1 gr(t))qσr(t) = Φ(g t g 1 r(t))(qσr(t)). (C.2) Here, if there have been no resets up to time t, we define r(t) to be 0. We treat q0 like a reset symbol at the beginning of the sequence. Also, note that our canonical group semiautomaton simulator always uses g0 = e G as its initial state. Layer LG + 1: memory lookup and copy. To implement the above, at layer LG + 1, we put a copy of the memory semiautomaton in channel M, setting its initial state to q0. We will modify this construction slightly, extending WV with the identity matrix on the d G group element encoding dimensions of channel G. Intuitively, when the memory unit fetches the last nontoken, we would like it to copy the corresponding g t. Note that sim G.rep Size, the -norm bound on the group element encodings, is at most |Q| by Theorem 8, so we do not need to modify the WQ, WK norms to increase the attention head s precision. The final modification is that we append to WC an identity matrix copying the d G embedding dimensions to a new d G dimensional channel, which we will call the I (for invert ) channel (set to 0 in the embedding all preceding layers). Then, by Lemma 7, at the end of this layer, at each position t, the M channel will contain qσr(t) in dimension 1, and g r(t) in channel I. Finally, in channel G, we use only residual connections, preserving g t in channel G. Layer LG + 2: applying Φ(gh 1) pointwise. This finally allows us to execute Equation C.2 at each position t. We use one more Transformer layer, with attention block implementing identity. The MLP memorizes the function (g, h, q) 7 Φ(gh 1)(q) e1 (the coordinate is selected arbitrarily), with the concatenated (dinv := 2 sim G.rep Size + 1)-dimensional encodings on the (G, I, M) channels, whose activations have -norms bounded by |Q|. We invoke Lemma 2, with parameters = 1, din = dinv, Bx = |Q|, By = |Q|, giving us a construction with d1 4dinv(|Q| + 1), d2 |G|2|Q|, W1 4, b1 6|Q|, W2 1, b2 dinv, W3 |Q|. From this output, W simply decodes the correct qt from dimension 1. Lemma 12 (Implementing the transformation cascade). Let A = (Q, Σ, δ) be a semiautomaton, and let T 1. Let {A(1), . . . , A(n); ϕ(2), . . . , ϕ(n)} be the transformation cascade (Definition 4) which simulates A, as guaranteed by Theorem 7. For each i, let simi be a Transformer network which continuously simulates the permutation-reset semiautomaton A(i) at length T. Then, there is a Transformer network sim A which simulates A at length T. Its size bounds are: sim A.depth = |Q| (maxi{simi.depth} + 1) 1 3|Q|2 log |Q| + 7|Q|. sim A.dim = Pn i=1 simi.dim + 1 2|Q|(|T (A)| + |Q| + 4) + 1. Published as a conference paper at ICLR 2023 sim A.heads = Pn i=1 simi.heads 2|Q|+1(|T (A)| + 1). sim A.head Dim = maxn i=1{simi.head Dim} |Q| + 3. sim A.mlp Width = Pn i=1 simi.mlp Width+2|Q| |Q|2|Q| |Σ| 2|Q|(20|Q| |T (A)| T +4|Q|+ |T (A)|2|Q| + |Q|2|Q| |Σ|). sim A.norm Bound maxn i=1{simi.norm Bound} {2|Q|(|T (A)|+5|Q|)}+6 max{|Q|, |Σ|} max{6|Q| T log T, 2|Q|(|T (A)| + 5|Q|)} + 6 max{|Q|, |Σ|}. Proof. At this point, most of the work has been done for us. We create a separate channel i for each component permutation-reset semiautomaton A(i). This requires a total of Pn i=1 simi.dim embedding dimensions. In addition to these channels, we keep one dimension (with residual connections throughout the network) to represent the input σt. Let eΣ denotes the unit vector along this coordinate. Choosing an arbitrary enumeration to identify Σ with [Σ], we select the embeddings to be E(σ) := σ eΣ. The L layers of sim A are divided into |Q| subnetworks (which are just Transformer networks), which we will concatenate sequentially at the end. Let these subnetworks be indexed by eℓ {1, . . . , e L}. Each subnetwork starts with a parallel simulation (as in the direct product construction of Lemma 8, padding with layers implementing identity if their depths do not match), combining all of the simi.θ in the ℓ-th level of the Krohn-Rhodes decomposition, as defined by Theorem 7. Each simi.θ is chosen to operate in its own channel i. We add residual connections on all of the input/output dimensions in each channel. Then, at the end of each subnetwork except the final one (1 eℓ e L 1), we append one more Transformer layer with identity attention block, whose MLP implements the wiring specified by ϕ(i) from the next level of the decomposition. Namely, we invoke Lemma 2 with = 1, Bx = max{|Q|, |Σ|}, By = |Σ|, giving us for each pre-final-layer i an MLP which represents the function (sim1.W 1(q(1)), . . . , simi 1.W 1(q(i 1)), E(σ)) 7 simi.E(ϕ(q(1), . . . , q(i 1), σ)), where the inputs are stored in the respective i < i and Σ channels, and the output is written to the i channel. Here, the number of input dimensions is i