# looped_transformers_as_programmable_computers__34f55cac.pdf Looped Transformers as Programmable Computers Angeliki Giannou * 1 Shashank Rajput * 1 Jy-yong Sohn 2 Kangwook Lee 1 Jason D. Lee 3 Dimitris Papailiopoulos 1 We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop. Our input sequence acts as a punchcard, consisting of instructions and memory for data read/writes. We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including lexicographic operations, non-linear functions, function calls, program counters, and conditional branches. Using this framework, we emulate a computer using a simple instruction-set architecture, which allows us to map iterative algorithms to programs that can be executed by a constant depth looped transformer network. We show how a single frozen transformer, instructed by its input, can emulate a basic calculator, a basic linear algebra library, and even a full backpropagation, in-context learning algorithm. Our findings reveal the potential of transformer networks as programmable compute units and offer insight into the mechanics of attention. 3 1 Introduction Transformers (TFs) have become a popular choice for machine learning tasks, achieving state-of-the-art results in Natural Language Processing (NLP) and Computer Vision (CV) (Vaswani et al., 2017; Khan et al., 2022; Yuan et al., 2021; Dosovitskiy et al., 2020). They use attention to capture higher-order relationships and long-range dependencies, making them effective in tasks such as machine translation and language modeling (Vaswani et al., 2017; Kenton & *Equal contribution 1University of Wisconsin-Madison 2Yonsei University 3Princeton University. Correspondence to: Angeliki Giannou , Shashank Rajput . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 3The code is available at https://github.com/ jysohn1108/Looped_TF Toutanova, 2019). Large language models, such as GPT3 (Brown et al., 2020) and Pa LM (Chowdhery et al., 2022), excel in various NLP tasks and perform in-context learning based on brief prompts and examples. LLMs can also perform algorithmic tasks and reasoning through incontext learning (ICL), as shown in several works, such as (Nye et al., 2021; Wei et al., 2022c; Lewkowycz et al., 2022; Wei et al., 2022b; Dasgupta et al., 2022; Chung et al., 2022). For example, (Zhou et al., 2022) showed that LLMs can perform addition on unseen examples when prompted with a multidigit addition algorithm and some examples. These results suggest that LLMs can apply algorithmic principles and perform pre-instructed commands on a given input, as if interpreting natural language as code. Constructive arguments have demonstrated that Transformers can simulate Turing Machines with enough depth or recursive links between attention layers (P erez et al., 2021; P erez et al., 2019; Wei et al., 2022a). This demonstrates the potential of transformer networks to precisely follow algorithmic instructions specified by the input. Yet, these constructions do not provide insight into how to create Transformers that can carry out particular algorithmic tasks, or compile programs in a higher-level programming language. Recently, various methods have been developed to select the weights of a Transformer model to function as a learning algorithm on-the-fly, performing implicit training at inference time when given training data as input (Aky urek et al., 2022; von Oswald et al., 2022). These methods typically require a number of layers proportional to the number of iterations of the learning algorithm and are limited to a small set of loss functions and models. The ability to program transformer models to emulate the abstract computation of a Turing Machine and the specific algorithms of in-context learning, highlights the potential for transformer networks as versatile programmable computers. Our research aims to explore this promising prospect, uncovering how the mechanics of attention can enable the emulation of a general-purpose computer inspired by instructionset architectures. Looped Transformers as Programmable Computers Our Contributions: In this paper, we show that transformer networks can emulate complex algorithms and programs by programming them with specific weights and placing them in a loop. We accomplish this by reverse engineering attention to emulate basic computing blocks, such as lexicographic operations, nonlinear functions, function calls, program counters and conditional branches. We also demonstrate the importance of using a single loop or recursion to connect the transformer s output sequence back to its input, avoiding the need for a deep model. We design a transformer that can execute programs written in a generalized version of a single instruction, known as SUBLEQ(A,B,C), which is a one-instruction set computer (OISC) that consists of 3 memory address operands. When executed, it subtracts the value at memory address A from the value at memory address B and stores the result in B. If the result in B is less than or equal to zero, the execution jumps to address C, otherwise it proceeds to the next instruction. Programs written in SUBLEQ language use only this command, yet this single instruction is capable of defining a universal computer (Mavaddat & Parhami, 1988). We construct transformers that can run programs like SUBLEQ using a more flexible instruction called FLEQ with mem[c] = fm(mem[a], mem[b]) if mem[flag] 0 goto instruction p format, where fm can be selected from a set of functions (matrix multiplication/ non-linear functions/ polynomials/ etc), which we can hardcode into the network. The depth of the transformer needed to run these programs is not affected by the program s complexity, but by the depth required for a single FLEQ instruction, which is typically constant. We use this framework to emulate a calculator, linear algebra functions and in-context learning algorithm. The input sequence acts as a program for the transformer to execute, while also providing space to store and process variables. The transformer networks used to execute these programs have a depth of 13 or less. Our approach alleviates limitations of constructions by (P erez et al., 2019; Wei et al., 2022a) (infinite precision/logarithmic program complexity depth) while improving depth (from proportional to number implicit GD iterations to constant) and generalizing (from linear to arbitrary models) in-context learning constructions in (Aky urek et al., 2022). Our study shows that attention mechanisms can be used to emulate complex iterative algorithms and execute general programs with even a single loop. We hope that this inspires more research on the capabilities of attention and the use of smaller transformer networks to distill tasks for larger models and enhance language model capabilities. 2 Prior Work Our work is inspired by the recent results on the expressive power of Transformer networks and their in-context learning capabilities. The authors of P erez et al. (2021); P erez et al. (2019); Wei et al. (2022a) have shown that Transformers are Turing complete, meaning they can simulate a Turing machine. The constructions typically require high/infinite precision (apart from that of (Wei et al., 2022a)), and recursion around attention layers. Additionally, Yun et al. (2019) prove that with sufficient width/depth, Transformers can act as universal sequence to sequence approximators. In (Weiss et al., 2021), the authors propose a computational model for the transformer-encoder in the form of a domain-specific language called the Restricted Access Sequence Processing Language (RASP). The model maps the basic components of a TF encoder into simple primitives. Examples of tasks that could be learned by a Transformer are provided, and the maximum number of heads and layers necessary to encode a task in a transformer are analyzed. In a recent and related work, (Lindner et al., 2023) suggests using transformer networks as programmable units and introduces a compiler called Tracr which utilizes RASP. However, the expressivity limitations and unclear Turing completeness of the language are discussed in (Weiss et al., 2021; Merrill et al., 2022; Lindner et al., 2023). Our approach, in contrast, demonstrates the potential of transformer networks to serve as universal computers, enabling the implementation of arbitrary nonlinear functions and emulating iterative, non-linear algorithms. Furthermore, our framework allows the depth of our transformers to not scale in proportion to the lines of code that they execute, allowing the implementation of iterative algorithms, expanding the potential applications. In (Garg et al., 2022) the authors demonstrate that standard Transformers (e.g., GPT-2) can be trained from scratch to perform in-context learning of linear functions and more complex model classes, such as two-layer neural networks, with performance that matches or exceeds task-specific learning algorithms. A useful element of their analysis is the fact that language is completely removed from the picture, and they perform all operations on the level of vector embeddings. This allows a higher abstraction level than using language as an input, and in fact is what also allows us to obtain our derivations. Motivated by the above experimental work, in (Aky urek et al., 2022), the authors investigate the hypothesis that TF-based in-context learners emulate standard learning algorithms implicitly at inference time. The authors provide evidence for this hypothesis by constructing transformers that implement SGD for linear models, showing that trained in-context learners closely match the predictors computed by these algorithms. Looped Transformers as Programmable Computers In a similar vein, (von Oswald et al., 2022) argues that training Transformers on auto-regressive tasks is closely related to gradient-based meta-learning formulations. The authors also provide a hard-coded weight construction showing the equivalence between data transformations induced by a single linear self-attention layer and gradient descent on a regression loss. The authors empirically show that when training linear attention TFs on simple regression tasks, the models learned by GD and Transformers have intriguing similarities. In (Liu et al., 2022), the authors test the hypothesis that TFs can perform algorithmic reasoning using fewer layers than the number of reasoning steps, in the context of finite automata. The authors characterized shortcut solutions that allow shallow Transformer models to exactly replicate the computation of an automaton on an input sequence, and showed that these solutions can be learned through standard training methods. As is expected this hypothesis is only true for a certain family of automata, as the general existence of shortcut solutions would imply the collapse of complexity classes that are widely believed not to be identical. Other experimental studies have utilized recursion in transformer architectures in a similar manner to our constructions, although in our case we only utilize a single recursive link that feeds the output of the transformer back as an input (Hutchins et al., 2022; Shen et al., 2022; Dehghani et al., 2018). 3 Preliminaries The transformer architecture. Our work follows a similar problem setting as previous studies (e.g. (Yun et al., 2019; Garg et al., 2022; Aky urek et al., 2022; von Oswald et al., 2022)) in which the input sequence consists of ddimensional embedding vectors rather than tokens. This simplifies our results without sacrificing generality, as an embedding layer can map tokens to the desired vector constructions. The input to each layer, X Rd n, is a vector representation of a sequence of n tokens, where each token is a d-dimensional column. In this paper, the terms token and column may be used interchangeably. A transformer layer outputs f(X), where f is defined as Attn(X) = X + i=1 Vi XσS(X Ki Qi X) (1a) f(X) = Attn(X) + W2σ(W1Attn(X) + b11 n ) + b21 n (1b) where σS is the softmax function applied on the columns of the input matrix, i.e., [σS(X, λ)]i,j = eλXi,j Pn k=1 eλXk,j , where λ 0 is the temperature parameter, σ(x) = x 1x>0 is the Re LU activation, and 1n is the all ones vector of length n. We refer to the K, Q, and V matrices as the key, query, and value matrices respectively4; the superscript i that appears on the weight matrices indicates those corresponding to the i-th attention head. Equation (1a) represents the attention layer, while the combination with Re LUs a single transformer layer. Iterative computation through a loop. In the following sections, we utilize TF networks with multiple transformer layers. Let us refer to the output of such a multilayer TF as TF(W; X), Algorithm 1 Looped Transformer 1: for i = 1 : T do 2: X TF(W; X) 3: end for where for simplicity W is the collection of all weight matrices required to define such a multi-layer TF. We use our constructions recursively, and feed the output back as an input sequence, allowing the network to perform iterative computation through a simple fixed-point like iteration. This recursive transformer is similar to past work on adding recursion to TF networks. We refer to these simple recursive TFs as Looped Transformers. Our model resembles traditional computer processing, with the input sequence X including instructions and memory, and the transformer network executing one instruction at a time, performing complex computations as a self-contained unit with loops analogous to CPU cycles. Despite the analogy between TFs and CPUs being entertaining, there are many differences in implementation. The results obtained from using TFs as computational units do not require the analogy to be valid. Input sequence format. The input to our transformer network has the following abstract form: X= S M C p1 . . . ps ps+1 . . . ps+m ps+m+1 . . . pn where S represents the portion of the input that serves as a scratchpad, M represents the portion that acts as memory that can be read from and written to, and C represents the portion that contains the commands provided by the user. The p1, . . . , pn are positional encodings for the n columns, which will be described in more detail in the following paragraph, and will be used as pointers to data and instructions. The structure of our input sequence bares similarities to that of (Wei et al., 2022a; Aky urek et al., 2022) that also use scratchspace, and have a separate part for the input data. Scratchpad. The scratchpad, a central component in our constructions, can be considered analogous to a CPU s 4Typically the weight matrices are denoted as WQ, WK, WV but to make notation cleaner, we use instead Q, K, V. Looped Transformers as Programmable Computers cache memory. Serving as a temporary workspace, it enables to record the intermediate results of various operations from simple arithmetic to complex tasks like matrix inversion, with data transferred between memory and scratchpad as needed for computation. Memory. All the compute boxes we create require memory for specific actions, storing data in various forms such as scalars, vectors, and matrices. As a central repository, memory communicates with the scratchpad, enabling data access and manipulation as needed for operations. Commands. Our framework incorporates commands within a transformer network, resembling a low-level programming language. These commands include memory location indicators and operation directives, enabling transformers to execute complex computations and tasks consecutively and systematically. 4 Building Transformer Blocks towards General Computation To build general compute boxes using transformer networks, specialized compute blocks are required. These blocks will be assembled to create the desired end functionality. In this section, we highlight various operations that transformer layers can perform. These operations will serve as building blocks to create more complex routines and algorithms. 4.1 Positional Encodings, Program Counter, and Data Pointers To aid the transformer in locating the position of each token, each column of X is appended with positional encodings that is based on the column index. In this case, similar to (Wei et al., 2022a), the positional encodings is the binary representation of the column index, to keep the encoding dimension low, i.e., logarithmic in the sequence length. This approach to using positional encodings is slightly different from the typical method of adding them to the embeddings of the input sequence. However, in this case, appending them as suffixes to the embeddings allows for cleaner arguments and constructions. In particular, the encoding for token/column indexed by i is a log(n)-dimensional 1 binary vector pi { 1}log(n), where n is the length of the input sequence. Using the standard binary representation of an integer i, meaning i = Plog(n) 1 k=0 2k bk, the positional encoding vector pi is set to 1 at index j if the binary representation of i has 0 at the j-th index, i.e., bi = 0, otherwise it is +1. As a result, we have p T i pi = log(n) and by Cauchy-Schwarz inequality, p T i pj < |pi||pj| = p log(n) = log(n) whenever i = j, since pi, pj differ in at least one coordinate. In the applications presented, the transformer often needs to execute iterative algorithms or go through a sequence of commands. To achieve this, we utilize a program counter that iterates through the commands. The counter contains the encoding of the location where the next command is stored. Additionally, a command may have data pointers that point to the location of the data the command needs to read and write to. Both the program counter and data pointers utilize the same positional encodings as discussed in the previous paragraph. Using binary vectors as positional encodings allows us to easily increment the program counter by 1 (or any other amount) using the feed forward Re LU layers in the transformer architecture (1). This is formalized in the following lemma, for the proof see Lemma 9. Lemma 1. Given two d-dimensional binary vectors representing two non-negative integers, there exists a 1-hidden layer feedforward network with Re LU activation, containing 8d activations in the hidden layer and d neurons in the output layer, that can output the binary vector representation of their sum, as long as the sum is less than 2d+1. Furthermore, this technique for pointing to specific data locations enables the transformer to effectively read and write from/to data during the execution of the algorithm or sequence of commands that is build to implement. 4.2 read / write: Copying Data/Instructions to/from the Scratchpad scratchpad memory instructions pr pa pb p1 copy instruction to location Figure 1. A sketch of the read operation. Arrows show command blocks being copied from the part of the input that is allocated to commands to the scratchpad. Typically an instruction is another set of pointers. Positional encodings and counters are used for tracking what is copied where. As previously stated, the scratchpad serves as a temporary memory for storing all information needed for computation. This includes copying commands and data to it, performing computation, and writing results back to memory. This process has similarities with the copy mechanism developed in (Aky urek et al., 2022). The following lemma states that the command or data pointed to by the program counter or a data pointer in the current command can be copied to the scratchpad. The location of the program counter is conventionally placed right below the contents of the scratchpad, but it can be changed arbitrarily. Keeping it in a specific location throughout the Looped Transformers as Programmable Computers entire computation helps retain a good organization of the construction. Lemma 2 (read). A transformer with one layer, one head, and width of O(log n + d), where d is the dimension of the data vectors and n is the length of the input, can read data/command vectors from the input to the scratchpad from the location pointed to by the position embedding vector in the scratchpad. This operation incurs an error which can be driven arbitrarily close to 0 by increasing the temperature of the softmax operation. The next lemma explains that a vector stored in the scratchpad can be copied to a designated location in memory, as specified within the scratchpad itself. This allows for the transfer of data from the scratchpad to a specific location in memory for further use or storage. scratchpad memory instructions from scratchpad to location pointed by pa Figure 2. A sketch of the write operation. Arrows show data blocks being copied from the scratchpad to a designated location in the part of the input allocated for memory. Positional encodings are used for tracking the destination location and ensuring data is written at the correct memory location. Lemma 3 (write). A transformer network with a single layer, one head, and width O(log n + d), where d is the dimension of the data vectors and n is the length of the input, can effectively write a data vector stored in the scratchpad to a specific location in the input, as designated by a positional encoding vector in the scratchpad. This operation incurs an error which can be driven arbitrarily close to 0 by increasing the temperature of the softmax operation. 4.3 if condition then goto instruction In this subsection, we state the main ideas used to implement a conditional branching instruction that evaluates a condition and sets the program counter to a specified location if the condition is true, or increments the program counter by 1 if the condition is false. The form of the command is as follows: if mem[a] 0, then goto i, where mem[a] is a value of some location in the memory part of the input sequence. This command has two parts: evaluating the inequality and modifying the program counter accordingly. The first thing we do is read from mem[a], as described in the previous subsection. We then use one Re LU layer to create the flag , the condition. This is implemented for the cases that mem[a] contains an integer, or its binary representation as in Section 5.1. Let the current Program Counter be p PC, which points to a given command. Thus, if flag is 1, we want the program counter to jump and become pi, else if flag is 0 the program counter will be incremented by one, and set to be p PC+1. This can be implemented with one Re LU layer, which selects one of the vectors based on the value of the flag. The details of this construction are given in Appendix B.3. 5 Emulating a Single Instruction Computer 5.1 A SUBLEQ Transformer Figure 3. Graphical representation of the building blocks necessary to implement the OISC instruction. The first two blocks transfer the data/command to the scratchpad, the second and third implement the subtraction and store the result, while the last one implements the if goto command that completes the instruction. (Mavaddat & Parhami, 1988) showed that there exists an instruction such that any computer program can be translated to a program consisting of instantiations of this single instruction. A variant of such an instruction is SUBLEQ, where different registers, or memory locations are accessed. The way that SUBLEQ works is simple. It accesses two registers in memory, takes the difference of their contents and stores it back to one of the registers, and then if the result is negative it jumps to a different predefined line of code, or continues on to the next instruction from the current line of code.5 A computer that is built to execute SUBLEQ programs is called an One-Instruction Set Computer, and is a universal computer, i.e., it is Turing Complete, if given access to infinite memory. Algorithm 2 SUBLEQ(a, b, c) 1: mem[b] = mem[b] - mem[a] 2: if mem[b] 0 then 3: goto instruction c 4: else 5: goto next instruction 6: end if The transformer keeps track of the lines of code, memory 5This version of the SUBLEQ instruction is a slightly restricted version of the original instruction; here we separate the memory / registers from the instructions. We show that this restriction does not make our version computationally less powerful by proving in Appendix C that our version is also Turing Complete. Looped Transformers as Programmable Computers locations, and a program counter, using the memory part of the input as memory registers and the command part as lines of code/instructions. The scratchpad is used to record the additions and pointers involved in each instruction, and the read, write, and conditional branch operations are utilized. Lemma 4. There exists a looped transformer architecture that can run SUBLEQ programs. This architecture has ten layers, two heads, and a width of O(log(n) + log(N)), where n is the length of the input sequence that is proportional to the length of the program and memory used by the emulated OISC, and N is the number of bits we use to store each integer. The integers are considered to be in the range [ 2N 1 + 1, 2N 1 1] . The importance of loops. The use of a loop outside the transformer is crucial as it allows the computer to keep track of the program counter and execute the instructions in the correct order. Without this loop, the size of the transformer would have to scale with the number of lines of code, making the implementation impractical. Note that the overall complexity of running a SUBLEQ program is going to scale with the number of lines of code, which is to be expected given standard complexity theoretic assumptions on the circuit depth of functions. Note however that the depth of the looped transfromer itself does not scale with the size of the program. OISC as a basis for a more flexible attention-based computer. The following construction describes an implementation of a fully functioning one-instruction set computer (OISC) using a transformer architecture. The memory stores integers and the instructions are executed in a sequential manner. The key to this construction is the reverse engineering of the attention mechanism to perform read/write operations and taking full advantage of each piece of the transformer architecture, including the feedforward layers. This implementation serves as the foundation for a more general attention-based computer presented in the next subsection, where the subtraction of two contents of memory can be replaced with a general function, allowing for the implementation of arbitrary iterative algorithms. Proof of Lemma 4. Looking at Alg. 2, note that each instruction can be specified by just 3 indices, a, b, and c. Since we use binary representation of indices to form positional encodings and pointers, each of these indices can be represented by a log n dimensional vector. We represent each instruction by simply concatenating these embedding vectors to form a 3 log n dimensional vector as follows: The input then takes the following form: 0 0 0 cs+m+1 cs+m+2. . .cn 1c EOF 0 0 M 0 0 . . . 0 0 0 0 0 0 0 . . . 0 0 p PC 0 0 0 0 . . . 0 0 0 p2:sps+1:s+mps+m+1ps+m+2. . .pn 1 pn 1 12:s 0s+1:s+m 0s+m+1 0s+m+2 0n 1 0n EOF Block of memory Scratchpad Program Counter Encodings Indicator of the scratchpad where ci R3 log(n), M RN m and X R(8 log(n)+3N+1) n. The first s columns constitute the scratchpad, the next m constitute the memory section, and the last n m s columns contain the instructions. The program counter, p PC points to the next instruction that is to be executed, and hence it is initialized to the first instruction as p PC := ps+m+1. The contents of the memory section are N dimensional 1 binary vectors which represent the corresponding integers. We follow the 2 s complement convention to represent the integers. Step 1 - Read the instruction c PC. The first thing to do is to read and copy the instruction pointed to by p PC in the scratchpad. The current instruction is located at column index PC, and is pointed to by the current program counter p PC. The instruction, c PC consists of three pointers, each of length log n. In particular we copy the elements at the location (1 : 3 log(n), PC) to the location (3 log(n) + 4 : 6 log(n)+3, 1). This can be done using the read operation as described in Section 4.2. Hence, after this operation, the input looks as follows: 0 0 0 c1 c2 . . .cn m sc EOF 0 0 M 0 0 . . . 0 0 0 0 0 0 0 . . . 0 0 0 0 0 0 0 . . . 0 0 c PC 0 0 0 0 . . . 0 0 p PC 0 0 0 0 . . . 0 0 0 p2:sps+1:s+mps+m+1ps+m+2. . . pn 1 pn 1 12:s 0s+1:s+m 0s+m+1 0s+m+2. . . 0n 1 0n where c PC contains the three positional encodings. This step can be done in one layer. Step 2 - Read the data required by the instruction. We need to read the data that the columns a, b contain. To do so, we again use the read operation on the pointers pa, pb. Note that we need two heads for this operation, each one for reading a and b respectively. The resulting output sequence will place mem[a], mem[b] just above the positional encoding copied in the previous step. This can be done in one layer. Looped Transformers as Programmable Computers Step 3 - Perform subtraction. Let x denote a column of the input X. Let it have the following structure: x = br bs , where each entry above represents the corresponding column element of the matrix X in the updated input above. Thus, br = mem[a], bs = mem[b] for the first column, and br = bs = 0 otherwise. Hence, to perform bs r, we first need to compute the binary representation of r, which is b r, and then simply add it to bs. To compute b r, which is the 2 s complement of br, we just need to flip the bits of br and add 1. Bit flipping a 1 bit can be done with a neuron simply as bflipped = 2 σ( b) 1. For adding 1, we can use Lemma 9,which requires 1 Re LU layer of width O(N), and so we need 2 transformer layers to perform this (Here we make the intermediate attention layers become the identity mapping by setting their value matrices to 0). Finally, we need one more Re LU layer to add bs to b r, hence bringing the total to 3 transformer layers. This results in calculating mem[b] mem[a] and place it above of pa. Step 4 - Write the result back to memory. Writing mem[b] mem[a] back to location b can be done using the pointer pb and the set of embeddings and applying the write operation described in Section 4.2. This operation requires one layer. Step 5 - Conditional branching. We use the lemmas from the appendix to create the flag, which is 1 if mem[b] mem[a] 0 and 0 otherwise. This can be done using the Equation (1b) of the transformer. Thus, the first column of the input matrix is now update to be: [ 0 0 0 flag pa pb pc p PC 0 1 ] (4) where we represent it as a row vector for saving space. This operation requires one layer. Next we use the construction described in Section 4.3 to choose, depending on the value of the flag, whether we want to increment the current program counter or we want to jump in the command c. Similar to implementing goto, this step needs 2 layers of transformer. Step 6 - Error Correction. Note that some of the steps above we incur some error while reading and writing due to the fact that we are using softmax instead of hardmax. This error can be made arbitrarily small by increasing the temperature of the softmax. In this step, we push the error down to zero. Note that all the elements of X can only be one of { 1, 0, 1}, with some additive error from reads and writes as explained before. Assume that the temperature is set high enough that the error is at most ϵ < 0.5. Then, a noisy bit b can be fixed using the following Re LU: bnoiseless = 1 1 2ϵ(σ(b + 1 ϵ) σ(b + ϵ)) + 1 1 2ϵ(σ(b ϵ) σ(b 1 + ϵ)) 1. This operation can be done with one layer of transformer. Step 7 - Program Termination. The special command c EOF is used to signal the end of a program to the transformer. This command is made up of three encodings: ps+1, ps+2, and pn. The first encoding, ps+1, points to the first entry in the memory, which we hard-code to contain the value 0. The second encoding, ps+2, points to the second entry in the memory, which is hard-codeded to contain the value 1. The third encoding, pn, points to itself, signaling the end of the program and preventing further execution of commands. Hence, on executing this command, the next command pointer is set to point to this command again. This ensures that the transformer maintains the final state of the input. For this, the last instruction in each program is c EOF, and mem[s + 1] = 0, mem[s + 2] = 1. Then, a = s + 1, b = s + 2, and c = n and the memory is updated with the value mem[b] = mem[b] mem[a]. Since mem[a] = 0, the memory remains unchanged, and since mem[b] 0, the branch is always true and thus the pointer for the next instruction is again c EOF. 5.2 FLEQ: A More Flexible Attention-based Computer In this section, we introduce FLEQ, a generalization of SUBLEQ that defines a more flexible reduced-instruction set computer, which includes not just addition of registers, but any function from a set of M predefined functions implementable by a transformer network. In the following, we use the term FLEQ to refer interchangably to the instruction, the language, and the attention-based computer it defines. The design of FLEQ allows for the easier implementation of complex algorithms using functions beyond simple subtraction, such as matrix multiplication, computation of square roots, activation functions, etc. directly as built-in primitives in the architecture. This not only increases the flexibility of the system, but also makes it possible to implement nonlinear computations, linear algebra calculations, and iterative optimization algorithms for in-context learning, while containing the length of the corresponding programs. Definition 1. Let Ti be a transformer network of the form (1) with li-layers, hi-heads and dimensionality r. We call this a transformer-based function block if it implements a function f(A, B) where the input and output sequence format is assumed to be the following: A Rdh dw is assumed to be provided in the first set of d columns (columns 1 Looped Transformers as Programmable Computers to d) and B Rdh dw the second set of d columns (columns d + 1 to 2d); after passing the input through the li layers, the output of f(A, B) Rdh dw is stored in the third d columns (columns 2d + 1 to 3d), where d is the maximum size that the input could have and it is a constant that we determine. Note that dh, dw d. Finally, the sequence length of the block is s 3d. Similarly to d, s is a predetermined constant. The parameters A, B can be scalars, vectors or matrices as long as they can fit within a d d matrix. Hence, the above definition is minimally restrictive, with the only main constraint being the input and output locations. More details about the input and output requirements are explained in Appendix D. Theorem 1. Given M different transformer-based function blocks T1, , TM, there exists a transformer T of the form (1) with number of layers 9 + max{l1, , l M}, a number of PM i=1 hi heads, and dimensionality O(Md + log n) such that running it recurrently T times can run T instructions of any program where each instruction is FLEQ(a, b, c, m, flag, p, dh, dw), and executes the following: mem[c] = fm(mem[a], mem[b]) if mem[flag] 0 goto instruction p Here n is the total length of the program and we assume that mem[flag] is an integer. The parameters dh, dw are explained in Remark 1 below. The execution of this operation incurs an error which can be driven arbitrarily close to 0 by increasing the temperature of the softmax operation. Remark 1. Note that, the transformer T contains M transformer-based function blocks and each one may use different input parameters. We thus define with d the max length that each of the parameters A, B, C (stored in locations a, b, c) as in Definition 1 can have; this is a global constant and it is fixed for all the different instances that we can create. Now, dh, dw refer to the maximum dimension that the parameters can have in a specific instance of the transformer T ; the rest of the columns d dw and rows d dh are set to zero. The proof of this theorem can be found in Appendix F. Below we explain some of our design choices. Execution cycle of the unified attention-based computer. In each iteration of the looped transformer, one instruction is fetched from the set of instructions in the input according to the program counter. The instruction is then copied to the scratchpad. Depending on the function to be implemented, a different function block location is used to locally record the results of that function. Once the result is calculated, it is copied back to a specified memory location provided by the instruction. The execution cycle is similar to the one-instruction set computer (OISC) in the previous section, with the main difference being that for each instruction, we can choose from a pre-selected list of functions that take inputs in the form of arbitrary arrays of numbers, such as matrices, vectors, and scalars. For a more detailed overview of FLEQ and how it interacts with the Transformer-based Function Blocks, we refer the reader to Appendix D. Figure 4. The structure of input X, to execute FLEQ commands. Computational concerns: Do we need full attention? In our constructions we can reduce the computational complexity of the attention mechanism by limiting it the number of embedding vectors that each part of the input has to attend to. In our specific construction, only the columns within the scratchpad require global attention. By focusing only on these columns, we can reduce the computational complexity of the attention mechanism from O(n2d) to O(nd), where n is the number of input sequences, d is the dimension of the embedding vectors. 6 Applications Our unified template allows us to implement algorithms and iterative operations as programs. Calculations like multiplication, division, square root, etc., as well as linear algebra functions like matrix multiplication, transposition can be formed as attention-based function blocks. One key component of our analysis for creating non-linear functions is the manipulation of the softmax in Equation (1a) so as to create the sigmoid function g(x) = 1/(1 + e x). We then encode a different sigmoid function at each head and create linear combinations of them to create approximations for different functions. For more details see Lemma 10. Using these function-blocks and the FLEQ transformer, we are further able to implement a calculator, inversion, power iteration and learning algorithms like SGD on a linear model with square loss, as well as, full backpropagation on a 2layer sigmoid-activated neural network. We now formally state some of these results below, for a complete list, please see the appendix. Calculator. Our first result is the emulation of a simple calculator. To prove the Lemma below, we use Lemma 10, which provides error guarantees in terms of the number of Looped Transformers as Programmable Computers heads m, to approximate the square root and the inversion function. The details can be found in Appendix H. Lemma 5. There exists a transformer with 12 layers, m heads and dimensionality O(log n) that uses the Unified Attention Based Computer framework in Section 5.2 to implement a calculator which can perform addition, subtraction, multiplication, and computing the inverse, square root and percentage. For computing the inverse and square root, the operand needs to be in the range [ e O(m), Ω( 1 m)] [ Ω( 1 m), e O(m)] and [0, O(m2)] respectively, and the returned output is correct up to an error of O(1/ m) and O(1/m) respectively. Here, n is the number of operations to be performed. Linear Algebra. We continue with emulating approximation algorithms like the Newton-Raphson Method to find the inverse of a non-singular matrix A (Alg. 3), and the Power Iteration Algorithm for finding the eigenvector corresponding to the eigenvalue with the maximum absolute value (Alg. 4). Notice that once we have established matrix transposition, matrix multiplication and functions like scalar division etc., these algorithms can be encoded as sequential applications of those results. Algorithm 3 Pseudocode for Matrix Inversion . 1: X T = ϵA 2: for i = T, . . . , 0 do 3: Xi+1 = Xi(2I AXi) 4: end for Lemma 6. Consider a matrix A Rd d, then for any ϵ > 0 there exists a transformer with 13 layers, 1 head and dimensionality r = O(d) that emulates Alg. 3 with output X(transf) 1 that satisfies X(transf) 1 X1 ϵ. This error ϵ arises due to softmax, and can be driven arbitrarily close to 0 by increasing the temperature. Algorithm 4 Power Iteration i Input: A, T 1: Initialize b0 = 1 2: for k = 0, . . . , T 1 do 3: bk+1 = Abk 4: end for 5: b = b T / b T Lemma 7. Consider a matrix A Rd d, then for any ϵ > 0 there exists a transformer with 13 layers, 1 head and dimensionality r = O(d) that emulates Alg. 4 for T = O(log 1/ϵ) iterations with output b(transf) T +1 that satisfies b(transf) T +1 b T +1 ϵ. This error ϵ arises due to softmax, and can be driven arbitrarily close to 0 by increasing the temperature. Stochastic Gradient Descent and Backpropagation. Finally, we present our result on the emulation of stochastic gradient descent (SGD) in 2-layer neural networks, over a set of data points (xi, yi). We first implement Alg. 5, which serves as a function for calculating and updating the weight and bias matrices with steps proportional to their gradients. Each function call takes as input pointers to the weight and biases matrices, one data point and its corresponding label and the step-size . Algorithm 5 Backpropagation Define: Loss function: J(x) = 1 2x2. i Input: W1 Rm d, b1 Rm, W2 Rm 1, b2 R , x Rd, y R, η R 1: Compute z = W1x + b1, a = σ(z). 2: Compute o = W2a + b2. 3: Compute δ2 = (o y). 4: Compute δ1 = σ (z) W2(o y). 5: Compute J W2 = δ2a , J b2 = δ2. 6: Compute J W1 = δ1x , J b1 = δ1. 7: Update W1, W2, δ1, δ2 with one gradient update. Lemma 8. There exists a transformer with 13 layers, 1 head and dimensionality O(log(|D|) + d) that uses the Unified Attention Based Computer framework to implement T iterations of SGD on a two-layer sigmoid-activated neural network, over a set of nd data points (xi, yi) Rd+1, i = 1, . . . , |D|. The step size is given as a parameter to the program. The emulation of each step of SGD is not exact, there is some error in each step which, however, can be driven down arbitrarily close to 0 by increasing the temperature of softmax and another free parameter which does not affect the size of the network. 7 Conclusion In this work, we have shown that transformer networks can be used as universal computers by programming them with specific weights and placing them in a loop. Acknowledgements DP acknowledges the support of an NSF CAREER Award #1844951, a Sony Faculty Innovation Award, an AFOSR & AFRL Center of Excellence Award FA9550-18-1-0166, an NSF TRIPODS Award #1740707, and an ONR Grant No. N0001421-1-2806. JDL acknowledges the support of the ARO under MURI Award W911NF-11-1-0304, the Sloan Research Fellowship, NSF CCF 2002272, NSF IIS 2107304, NSF CIF 2212262, ONR Young Investigator Award, and NSF CAREER Award 2144994. KL acknowledges the support of NSF Award DMS-2023239 and Furiosa AI. JS acknowledges the support of the Yonsei University Research Fund of 2023-22-0122. SR acknowledges the support of a Google Ph D Fellowship award. Looped Transformers as Programmable Computers Aky urek, E., Schuurmans, D., Andreas, J., Ma, T., and Zhou, D. What learning algorithm is in-context learning? investigations with linear models. ar Xiv preprint ar Xiv:2211.15661, 2022. Barron, A. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930 945, 1993. doi: 10.1109/18.256500. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020. Charton, F. Linear algebra with transformers. ar Xiv preprint ar Xiv:2112.01898, 2021. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. 2022. Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. ar Xiv preprint ar Xiv:2210.11416, 2022. Dasgupta, I., Lampinen, A. K., Chan, S. C., Creswell, A., Kumaran, D., Mc Clelland, J. L., and Hill, F. Language models show human-like content effects on reasoning. ar Xiv preprint ar Xiv:2207.07051, 2022. Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and Kaiser, Ł. Universal transformers. ar Xiv preprint ar Xiv:1807.03819, 2018. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020. Garg, S., Tsipras, D., Liang, P., and Valiant, G. What can transformers learn in-context? a case study of simple function classes. In Advances in Neural Information Processing Systems, 2022. Hutchins, D., Schlag, I., Wu, Y., Dyer, E., and Neyshabur, B. Block-recurrent transformers. ar Xiv preprint ar Xiv:2203.07852, 2022. Kenton, J. D. M.-W. C. and Toutanova, L. K. Bert: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171 4186, 2019. Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., and Shah, M. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1 41, 2022. Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. ar Xiv preprint ar Xiv:2206.14858, 2022. Lindner, D., Kram ar, J., Rahtz, M., Mc Grath, T., and Mikulik, V. Tracr: Compiled transformers as a laboratory for interpretability. ar Xiv preprint ar Xiv:2301.05062, 2023. Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata. ar Xiv preprint ar Xiv:2210.10749, 2022. Mavaddat, F. and Parhami, B. Urisc: the ultimate reduced instruction set computer. International Journal of Electrical Engineering Education, 25(4):327 334, 1988. Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843 856, 2022. Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., et al. Show your work: Scratchpads for intermediate computation with language models. 2021. Perekrestenko, D., Grohs, P., Elbr achter, D., and B olcskei, H. The universal approximation power of finite-width deep relu networks. ar Xiv preprint ar Xiv:1806.01528, 2018. P erez, J., Barcel o, P., and Marinkovic, J. Attention is turingcomplete. Journal of Machine Learning Research, 22(75): 1 35, 2021. URL http://jmlr.org/papers/ v22/20-302.html. P erez, J., Marinkovi c, J., and Barcel o, P. On the turing completeness of modern neural network architectures, 2019. URL https://arxiv.org/abs/1901.03429. Shen, Z., Liu, Z., and Xing, E. Sliced recursive transformer. In European Conference on Computer Vision, pp. 727 744. Springer, 2022. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017. Looped Transformers as Programmable Computers von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., and Vladymyrov, M. Transformers learn in-context by gradient descent. ar Xiv preprint ar Xiv:2212.07677, 2022. Wei, C., Chen, Y., and Ma, T. Statistically meaningful approximation: a case study on approximating turing machines with transformers. Advances on Neural Information Processing Systems (Neur IPS), 2022a. Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. Emergent abilities of large language models. ar Xiv preprint ar Xiv:2206.07682, 2022b. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of thought prompting elicits reasoning in large language models. ar Xiv preprint ar Xiv:2201.11903, 2022c. Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. In International Conference on Machine Learning, pp. 11080 11090. PMLR, 2021. Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.- H., Tay, F. E., Feng, J., and Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558 567, 2021. Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S., and Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2019. Zhou, H., Nova, A., Larochelle, H., Courville, A., Neyshabur, B., and Sedghi, H. Teaching algorithmic reasoning via in-context learning. ar Xiv preprint ar Xiv:2211.09066, 2022. Looped Transformers as Programmable Computers A Limitations 13 B Omitted Proofs 13 B.1 Addition of pointers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B.2 Read/Write operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 B.3 if condition then goto instruction : Conditional branching . . . . . . . . . . . . . . . . . . . . 16 C subleq is Turing Complete 17 D FLEQ Overview 17 E Functions in the Unified Template Form 19 E.1 Encoding Non-linear Functions within the Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . 19 E.2 Matrix Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 E.3 Matrix Multiplication by Linearizing the Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 E.4 Advantage of attention over fully-connected networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 F FLEQ: Proof of Theorem 1 28 F.1 Step 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 F.2 Step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 F.3 Step 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 F.4 Step 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 F.5 Step 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 F.6 Step 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 F.7 Step 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 G Error Analysis 33 H A Basic Calculator 35 I Linear Algebra 38 J Emulating Learning Algorithms at Inference Time 41 Looped Transformers as Programmable Computers A Limitations In this paper, we have presented a new approach for using transformer blocks for function approximation. However, there are several limitations to our work. However, there are several limitations to our work that should be considered. One limitation is that the constructions presented in this paper have not been experimentally validated for efficiency. Additionally, implementing a transformer-based approach may be less efficient than running the algorithm directly. Furthermore, our constructions are forced to have a specific input structure where commands and memory are separated, which may lead to inefficiencies. At this point, it is also unclear how to combine hardcoded transformer models with pretrained ones. Lastly, we have not conducted a thorough finite precision analysis of our algorithms. Despite these limitations, our work presents a novel approach to exploring the mechanics of attention based networks and our methods may have potential for further exploration and development. B Omitted Proofs B.1 Addition of pointers. Lemma 9. There exists a 1-hidden layer feedforward, Re LU network, with 6d activations in the hidden layer and d neurons in the output layer that when given two d-dimensional binary vectors representing two non-negative integers, can output the binary vector representation of their sum, as long as the sum is less than 2d+1. Proof. For the purpose of explaining this proof, we use the {0, 1}d binary representation of the integers, instead of the { 1}d binary representation. However, since the conversion of a bit between the two representations can be done easily using simple affine transformation, the proof will also work for the { 1}d binary representation. Let the two integers be a, b and let c := a + b. We assume that c < 2d. Futher, let a1 be the least significant bit of a, ad the most significant, and ai be the i-th most significant bit, and similarly for b and c. Further, let a[i] represent the integer formed by considering only the least i significant bits of a. Note that ci is only dependent on the least i bits of a and b, and not on the more significant bits of a or b. In particular, ci only depends on a[i] + b[i]. Define s := a[i] + b[i], and note that ci = si. Further note that s < 2i+1 and hence can be represented in i + 1 bits. Then, whenever ci = 1, there can be two cases: (si+1 = 1, si = 1); or (si+1 = 0, si = 1). This can be equivalently written as ci = 1 iff s [2i 1, 2i 1] [3 2i 1, 2i+1 1]. This can be computed by the following Re LU: ci = (σ(s 2i 1 + 1) σ(s 2i 1)) + (σ(2i s) σ(2i s 1)) 1 + (σ(s 3 2i 1 + 1) σ(s 3 2i 1)). Thus, each bit of c can be computed using 6 neurons. Hence, computing the entire sum needs 8d activations. B.2 Read/Write operations. Lemma 2 (read). A transformer with one layer, one head, and width of O(log n + d), where d is the dimension of the data vectors and n is the length of the input, can read data/command vectors from the input to the scratchpad from the location pointed to by the position embedding vector in the scratchpad. This operation incurs an error which can be driven arbitrarily close to 0 by increasing the temperature of the softmax operation. Proof. Consider a simplified input where the scratchpad only has one column, and we have positional encodings, denoted as pi, that point to the location where data or commands should be copied from. In this case, the operation we want to perform is as follows: Looped Transformers as Programmable Computers 0 v2 vi v1 0 0 pi 0 0 0 p2 pi 0 0 0 1 0 . . . 0 . . . 0 v2 vi vi 0 0 pi 0 0 0 p2 pi 0 0 0 1 0 . . . 0 . . . which moves data/command embedding vector vi from the memory/command part of the input to the scratchpad. The first row contains the data to be read, the second row has the data written in the scratchpad, the third row contains the program counter, the fourth row contains the positional encodings, the fifth row is used by for temporary storage and the last row is just a bit that indicates whether the column is in the scratchpad or not. We use the following key and query matrices: K = Q = 0 0 I I 0 0 , so that the key and query become equal to KX = QX = pi p2 pi , and hence, p i pi p i p2 . . . p 2 pi p 2 p2 . . . ... ... ... p i pi p i p2 . . . ... ... ... Recall that pi is a log(n)-dimensional 1 vector such that p T i pi = log(n) and each p T i pj log(n) 1 for j = i. We show in the appendix that if we apply the softmax with temperature λ log n3 ϵ , we have σS((KX) QX) to be an n n matrix of the following form 1 2 0 0 1 2 0 0 1 0 0 0 0 0 1 0 0 ... ... ... ... ... ... ... 1 2 0 0 1 2 0 ... ... ... ... ... ... ... 0 0 0 0 1 + ϵM = e1+ei 2 e2 e3 e1+ei where ei is the ith column of the identity matrix, M 1, and ϵ is as defined in Appendix G. For the purpose of the proof, we ignore the error term ϵM, because it can be reduced arbitrarily by increasing the temperature (it can be made precisely equal to 0, if we consider hardmax instead of softmax), and overall does not limit us from deriving arbitrarily small error bounds. Next we set the output and value weight matrices as follows 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I I 0 0 0 0 0 0 0 0 0 0 Using this, the output of the head is X + VXσS((KX) QX) = 0 v2 vi v1 0 0 pi 0 0 0 p2 pi v1+vi 2 1 0 . . . 0 . . . Looped Transformers as Programmable Computers Each column above has the following form: v0 orig v1 orig vorig p(0) where v(0) orig and v(1) orig are the original value vectors (present in the top two row blocks) contained in that column, p(0) and p(1) are the corresponding embeddings of each column, vnew is the new value, and b is the bit indicating whether the column is part of the scratchpad or not. The feedforward layers have the following form: v(1) orig := v(1) orig + σ(C(b 1)1 + 2vnew 2v(1) orig) σ(C(b 1)1 2vnew + 2v(1) orig) vnew := vnew σ(vnew) + σ( vnew) = 0, where C is a large positive constant. The first equation is performing the operation of subtracting vnew from vorig but only when the sum and difference of C(b 1)1 and vnew are positive, otherwise the subtraction does not occur. The second equation is resetting the value of vnew to zero after it has been copied to vorig, where σ( vnew) is the rectified linear unit (Re LU) applied to the negative of vnew. It can be verified that the output of the feedforward layers would then be the desired result 0 v2 vi vi 0 0 pi 0 0 0 p2 pi 0 0 0 1 0 . . . 0 . . . Proof. We want to achieve the following operation 0 v2 vi v1 0 0 pi 0 0 0 p2 pi 0 0 0 1 0 . . . 0 . . . 0 v2 v1 v1 0 0 pi 0 0 0 p2 pi 0 0 0 1 0 . . . 0 . . . The construction for this is identical to the one for read (see the proof of Lemma 2), except that the feedforward layers are outputting the following: v(0) orig := v(0) orig + σ( Cb1 + 2vnew 2v(0) orig) + σ( Cb1 2vnew + 2v(0) orig) vnew := vnew σ(vnew) + σ( vnew) = 0, where C is a large positive constant. The first equation updates the value of a vector vorig in memory with the value of a vector vnew from the scratchpad. The second equation is resetting the new vector in the scratchpad to zero. It can be verified Looped Transformers as Programmable Computers that the output of the feedforward layers would be 0 v2 v1 v1 0 0 pi 0 0 0 p2 pi 0 0 0 1 0 . . . 0 . . . B.3 if condition then goto instruction : Conditional branching In this subsection, we will implement a conditional branching instruction that evaluates a condition and sets the program counter to a specified location if the condition is true, or increments the program counter by 1 if the condition is false. The form of the command is as follows: if mem[a] 0, then goto i, where mem[a] is a value of some location in the memory part of the input sequence. This command has two parts: evaluating the inequality and modifying the program counter accordingly. The first thing we do is read from mem[a], as described in the previous subsection. Let us say that flag is the truth value of the inequality. Since we assume that for such conditional branching command, mem[a] contains an integer, the following Re LU network can be used to compute the flag: flag = 1 σ(mem[a]) + σ(mem[a] 1). (5) In Section 5.1, we consider mem[a] to be vectors contain the binary 1 representation of integers. There we use 2 s complement convention to represent negative integers. Let the vector be [b N . . . b1], where b N is the most significant bit and b1 the least significant. As we explain in that section, the sign of b N indicates whether the integer is negative or positive (The number is negative if b N = +1 and non-negative otherwise). Hence, the flag is 1 if b N = +1 or if all the bits are 1 (which is the case when mem[a] represents the integer 0). flag = σ(b N) + σ Let the current Program Counter be p PC, which points to a given command. Thus, if flag is 1, we want the program counter to jump and become pi, else if flag is 0 the program counter will be incremented by one, and set to be p PC+1. Consider that the simplified input currently has the following scratchpad . . . flag 0 . . . 0 0 p PC 0 . . . 0 0 pi 0 . . . 0 0 where are inconsequential values. The incremented pointer, p PC+1, can be computed using the pointer incrementing operation that we described in the Subsection 4.1, using one feedforward layer of (1b).Then, pnext = 2σ(p PC+1 1flag) + 2σ(pi 1(1 flag)) 1, where 1 is the all ones vector. Notice that we can implement this with just the feed forward layers of Equation (1b). To account for the residual connection we can add the expression σ(p PC) + σ( p PC) in the equation above. Hence, this entire operation requires 3 feed forward layers of Equation (1b), and hence 2 transformer layers. Note that to ensure that the attention layer of the transformer do not modify the input, we simply set the V matrix to zero in (1a). Looped Transformers as Programmable Computers C subleq is Turing Complete In this section, we show that our slightly restricted version of the original SUBLEQ instruction (Mavaddat & Parhami, 1988) is indeed also Turing complete. To do this, we will utilize Minsky machines, which are also Turing complete. A Minksy machine comprises of registers and a list of instructions, where each instruction can be either of the following two instructions add(a): mem[a] := mem[a] + 1, go to the next instruction. sub(a, n): If mem[a] == 0, go to instruction n. Otherwise mem[a] := mem[a] 1, go to the next instruction. Given a program written in a language above, we translate it into an equivalent one written in our SUBLEQ language. For this, we initialize three fixed locations / registers c 1, c0, and c+1 such that mem[c 1] := 1, mem[c0] := 0, and mem[c+1] := +1; as well as an extra register mem[b]. We translate the program instruction-by-instruction. Assume that we have translated the first i 1 instructions. Let j 1 be the index of the last (translated) SUBLEQ instruction, that is, the index of the next SUBLEQ instruction will be j. Then, for the i-th instruction in the Minsky machine language, we translate it into our language as follows: Case 1, The i-th instruction of the Minsky machine program is add(a). This is equivalent to SUBLEQ(a, c 1, j + 1), and hence the j instruction in our program will simply be SUBLEQ(a, c 1, j + 1). Case 2, The i-th instruction in the Minsky machine program is sub(a, n). This would be equivalent to the sequence of the following 5 SUBLEQ instructions. Algorithm 6 Translation for sub(a, n) 1: Instr. j : SUBLEQ(b, b, j + 1) 2: Instr. j + 1: SUBLEQ(b, a, j + 3) 3: Instr. j + 2: SUBLEQ(a, c+1, j + 5) 4: Instr. j + 3: SUBLEQ(a, c0, n ) 5: Instr. j + 4: SUBLEQ(a, c+1, j + 5) Here n is the index of the translation of the n-th instruction of the Minsky machine program. This can be computed as a function of the number of add and sub instructions up to instruction n. The correctness of the above can be verified by considering the three cases: mem[a] 1, mem[a] 1, and mem[a] = 0. D FLEQ Overview The format of the input sequence. In Fig. 5, we illustrate the input X to our looped transformer, which can execute a program written as a series of FLEQ instructions. Note that X is divided into three sections: Scratchpad, Memory, and Instructions. As in the left bottom part of Fig. 5, we allocate a separate part of the scratchpad for each of the M functions that are internally implemented by the transformer. For example, if we have matrix multiplication and element-wise square root as two functions, we would allocate a different function block for each one. Figure 5. The structure of input X, to execute FLEQ commands. Looped Transformers as Programmable Computers This design may not be the most efficient, but our goal is to demonstrate the possibilities of looped transformers. Additionally, since the number of different functions is typically small in the applications we have in mind, the design does not significantly increase in size. The choice to reserve different function blocks for each predefined function is for convenience, as it allows for separate treatment of functions without worrying about potentially overlapping results. We believe that a design with a single function block is feasible, but it would significantly complicate the rest of the transformer construction. Instruction format. The instruction in Theorem 1 is essentially a composition of the following two components: the function call to fm and the conditional branching (if ... goto ...). The instruction, located at the top right side of Fig. 5 contains the following components: pa pb pc pm pflag Pointers to parameters of fm Position to write result Pointer to function block Position of flag Next instruction Dimensions of the inputs and the output The goal of each positional encoding vector in Equation (7) is to point to the corresponding space of the input where each component required by the instruction is located. To be specific, pa and pb point to the locations that the inputs a and b are located, pc points to the location to which we will record the final result of the function fm. Similarly, pm points to the function block in the scratchpad that the intermediate computations required for fm are recording, pflag points to the variable that we check if it is non-positive (the result is used for conditional branching), and pp points to the address of the line of code that we would jump if the variable in pointed by pflag is non-positive. Execute a function; Jump to command. Recall that the first four parameters (a, b, c, m) of FLEQ, as well as the last two (dh, dw) are related to the implementation of the function block, while the other two (flag, p) are related with the conditional branching. Since there is no overlap between the two components of each instruction, it is possible to use each of these components independently. By having a fixed location flag0 where mem[flag0] is always set to 1, we can have the simpler command FLEQ(a, b, c, m, flag0, p, dh, dw) which implements mem[c] = fm(mem[a], mem[b]). Further, by having fixed locations a0, b0, c0 which are not used elsewhere in the program, and hence inconsequential, we can have the simpler command FLEQ(a0, b0, c0, m, flag, p, dh, dw) which implements if mem[flag] 0 goto instruction p. Using this, we get the following corollary: Corollary 1. The Unified Attention Based Computer presented in Theorem 1 can run programs where each instruction can be either of the following two simple instructions: mem[c] = fm(mem[a], mem[b]) if mem[flag] 0 goto instruction p Format of Transformer-Based Function Blocks. Recall that each function block is located at the bottom left part of the input X, as shown in Fig. 5. Each transformer-based function block is expected to operate using the following format of the input: The number of rows in the input is r, while the number of columns is s and s 3d. Here s will dictate the total maximum number of columns that any transformer-based function block needs to operate. The reason that s might be larger than 3d has to do with the fact that some blocks may need some extra scratchpad space to perform some calculations. Looped Transformers as Programmable Computers The function block specifies the dimensions of input and output. Say they are dh dw, where dh, dw d . These will be part of the instruction which calls this function inside the FLEQ framework, as in (7). Suppose each function block has two inputs (A Rdh dw and B Rdh dw) and one output f(A, B) = C Rdh dw. As in (8), the function block is divided into four parts: (1) the first input A is placed in the first dh rows and the first dw columns, (2) the second input B is placed in the first dh rows and the columns d + 1 : d + dw, (3) the output f(A, B) = C is in the first dh rows and the columns 2d + 1 : 2d + dw columns and 4) the rest s 3d column used as scratchpad space for performing necessary calculations. Note that the unused columns are set to zero. The last r dh rows can be used by the transformer-based function block in any way, e.g., to store any additional positional encodings. We put the format of the input of each transformer-based function block in (8). The first input A = [z1 a, , zdw a ] of the function is zero padded and stored in the first d columns. Similarly, the second input B = [z1 b, , zdw b ] is stored in the next d columns. The output/result of the function block C = [z1 c, , zdw c ] is located in the next d columns while we have some extra s 3d columns which can be used as scratchpad. z1 a . . . zdw a 0 z1 b . . . zdw b 0 z1 c . . . zdw c 0 . . . 0 . . . . . . . . . . . . Input A Input B Output C = f(A, B) Let us consider the case where we wish to multiply a matrix A Rd d,with a vector b Rd 1. The resulting output matrix would look as follows: A b 0 A b 0 0 . E Functions in the Unified Template Form In this section, we demonstrate how to implement a variety of nonlinear functions and basic linear algebra operations using transformers. These techniques will be crucial in the construction of iterative algorithms in the following sections. Each transformer-based function block in this section fits in our unified template in terms of input/output parameters locations. We note here that each transformer-based function block might have its own positional encodings used to transfer the output in the correct place or perform some read/write operations and they are part of the design of the block. E.1 Encoding Non-linear Functions within the Attention Mechanism One key ingredient of our constructions is encoding various functions within the attention mechanism. We do this by forcing the softmax to act as a sigmoid function and by storing multiple coefficients in the query and value weight matrices. As far as we know, this is the first work that shows how general non-linear functions can be emulated by attention layers. This allows us to create linear combinations of sigmoids that can be accessed by an indicator vector in the input. Our analysis is based on the result of (Barron, 1993) which we present below. Definition 2. Let ΓC,B be the set of functions defined in a bounded domain B, f : B R, B Rd with a proper extension to Rd such that they have C bounded Fourier integral, i.e., R supx B |w x| F(dw) C holds where F(dw) is the magnitude of the Fourier distribution. Definition 3. Given τ > 0, C > 0 and a bounded set B, let Gϕ,τ = {γϕ(τ(a T x + b)) : |γ| 2C, a B 1, |b| 1} where a B = supx B{x T a} and ϕ is the sigmoid function, i.e., ϕ(x) = 1 1+e x . Theorem 2 (Theorem 3 in (Barron, 1993)). Every function f ΓC,B with f(0) = 0 and can be approximated by a linear combination of sigmoids fi Gϕ,τ, i = 1, . . . m. If τ m1/2 ln m the error scales as f(x) Looped Transformers as Programmable Computers To encode N different functions, we use the index j [N] and write cji, aji for the coefficients of the sigmoids that approximate them or i=1 cjiϕ(x T aji) for j = 1, . . . , N We here note that the terms τ, b can be incorporated in the term aij by adding an extra coefficient of 1 in x and multiplying everything with τ. We are now able to present the lemma on approximating functions using transformer blocks, in a format that is consistent with the FLEQ design outlined in the previous section. Lemma 10. Consider an input of the form e 0 x 0 0 0 0 0 0 0 0 0 0 0 p2d+1 0 0 0 p1 p2:d 0 pd+2:2d p2d+1 p2d+2:3d 0 02:d 1 0d+2:2d 0 02d+2:3d where d is chosen, N is the number of functions we encode and dx is the dimension of x. e = ej an indicator vector of the function we want to choose. Then there exists a transformer-based function block with 3 layers, m heads and dimensionality O(d) such that Pm i=1 cjiϕ(x T aji) 0 0 x 0 0 0 0 0 p2d+1 0 0 0 p1 p2:d 0 pd+2:2d p2d+1 p2d+2:3d 0 02:d 1 0d+2:2d 0 02d+2:3d where denoted inconsequential values that will be ignored downstream. Proof. The first thing we do is to move the x to the second row block, as follows: e 0 x 0 0 0 0 0 0 0 0 0 0 0 p2d+1 0 0 0 p1 p2:d 0 pd+2:2d p2d+1 p2d+2:3d 0 02:d 1 0d+2:2d 0 02d+2:3d e 0 0 0 0 0 0 0 x 0 0 0 0 0 p2d+1 0 0 0 p1 p2:d 0 pd+2:2d p2d+1 p2d+2:3d 0 02:d 1 0d+2:2d 0 02d+2:3d This can be done using a Re LU feedforward layer that performs this using the last row of the input as the indicator bit for the column containing x. Then we want to create the following transformation e 0 0 0 0 0 0 0 x 0 0 0 0 0 p2d+1 0 0 0 p1 p2:d 0 pd+2:2d p2d+1 p2d+2:3d 0 02:d 1 0d+2:2d 0 02d+2:3d Pm i=1 cjiϕ(x T aji) 0 0 x 0 0 0 0 0 p2d+1 0 0 0 p1 p2:d 0 pd+2:2d p2d+1 p2d+2:3d 0 02:d 1 0d+2:2d 0 02d+2:3d The proof follows that of Lemma 10. We again ignore the last three rows by setting the corresponding rows in the key, query and values weight matrices to be zero. Let Qi = 0 Id 0 0 , Ki = [a1i . . . a Ni] 0 0 0 , Vi = [c1i . . . c Ni] 0 0 0 We note that for the purpose of this proof, each ai has one extra element at the end equal to log(3d 1), while the vectors Looped Transformers as Programmable Computers x will have the last element equal to one. Then we will have σS((Ki X)T (Qi X)) = σS a ji 0 0 0 0 0 0 0 0 0 0 0 0 0 x 0 0 0 0 0 0 0 0 0 0 0 a jix 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 since a jix = a jix log 3d 1 and thus ea jix/(3d 1 + ea jix) = ϕ(a jix) with a slight abuse of notation over the inner product a jix to account for the extra corrections bias term. Thus, VXσS((KX)T (QX)) = cjiϕ(x T aji) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 By summing over all heads and adding the residual we get Pm i=1 cjiϕ(x T aji) 0 0 x 0 0 0 0 0 p2d+1 0 0 0 p1 p2:d 0 pd+2:2d p2d+1 p2d+2:3d 0 02:d 1 0d+2:2d 0 02d+2:3d Finally, we use an extra layer similarly to Lemma 3 to write the result in the desired output. Hence, we get Pm i=1 cjiϕ(x T aji) 0 0 x 0 0 0 0 0 p2d+1 0 0 0 p1 p2:d 0 pd+2:2d p2d+1 p2d+2:3d 0 02:d 1 0d+2:2d 0 02d+2:3d Alternative Lemma. We demonstrate an alternative way of encodings functions in the attention mechanism, which has a different complexity tradeoff. Lemma 11. Consider an input of the form x . . . x 0 . . . 0 1 e1 . . . 1 ed e1 . . . ed where z = ej Rd is an indicator vector and x Rd; then there exists a one layer transformer with 1 head such that x . . . x σ(a 1 x) . . . σ(a d x) 1 e1 . . . 1 ed e1 . . . ed Looped Transformers as Programmable Computers a 1 0 Ce 1 0 ... ... ... ... a d 0 Ce d 0 0 0 0 e 1 ... ... ... ... 0 0 0 e d a 1 x C + a 1 x . . . C + a 1 x C + a 2 x a 2 x . . . C + a 2 x ... ... ... ... C + a d x C + a d x . . . a d x 0 0 . . . 0 After applying softmax we get, σs((KX) QX) σ(a 1 x) 0 . . . 0 0 σ(a 2 x) . . . 0 ... ... ... ... 0 0 . . . σ(a d x) 0 . . . for large enough C. Next we set 0 0 . . . 0 0 1 1 . . . 1 0 0 0 . . . 0 0 ... ... ... ... ... 0 0 . . . 0 0 thus resulting in 0 0 . . . 0 0 1 1 . . . 1 0 0 0 . . . 0 0 ... ... ... ... ... 0 0 . . . 0 0 0 0 . . . 0 0 1 1 . . . 1 0 0 0 . . . 0 0 ... ... ... ... ... 0 0 . . . 0 0 Hence, we get VXσs((KX) QX) = 0 . . . 0 σ(a 1 x) . . . σ(a d x) 0 . . . 0 ... ... ... ... 0 . . . 0 0 . . . 0 X + VXσs((KX) QX) = x . . . x σ(a 1 x) . . . σ(a d x) 1 e1 . . . 1 ed e1 . . . ed Looped Transformers as Programmable Computers Corollary 2. Consider an input of the form x . . . 0 0 . . . 0 1 e1 . . . 1 ed e1 . . . ed where m is the number of sigmoids we use and ei is an indicator vector and x Rd; then there exists a 3 layer transformer with 1 head such that Attn(X) = Pd i=1 σ(a 1 x) . . . Pd i=1 σ(a 1 x) 0 . . . 0 Proof. Given the input x . . . 0 0 . . . 0 1 e1 . . . 1 ed e1 . . . ed we set the query and key matrices as follows: K = Q = 0 0 1 1 . Then, we get d . . . d ... . . . ... d . . . d Setting the value matrix to d I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 VXσS((KX) QX) = x . . . x 0 . . . 0 0 . . . 0 0 . . . 0 Hence, the output of the attention layer is: X + VXσS((KX) QX) = 2x . . . x 0 . . . 0 1 e1 . . . 1 ed e1 . . . ed Note that using the embeddings in the last rows and a feedforward network can be used to produce the following x . . . x 0 . . . 0 1 e1 . . . 1 ed e1 . . . ed Now, passing this into the transformer of Lemma 11 will result in x . . . x σ(a 1 x) . . . σ(a d x) 1 e1 . . . 1 ed e1 . . . ed Looped Transformers as Programmable Computers For the third layer, we set the key and query matrices as follows K = Q = 0 0 1 1 . Then, we get d . . . d ... . . . ... d . . . d Setting the value matrix to 0 0 0 0 0 d 0 0 0 0 0 0 0 0 0 0 VXσS((KX) QX) = 0 . . . 0 Pd i=1 σ(a 1 x) . . . Pd i=1 σ(a 1 x) 0 . . . 0 0 . . . 0 Hence, the output of the attention layer is: X + VXσS((KX) QX) = x . . . x Pd i=1 σ(a 1 x) . . . Pd i=1 σ(a 1 x) 1 e1 . . . 1 ed e1 . . . ed Finally, the feedforward layers can be used to move the results to the first row. E.2 Matrix Transposition Lemma 12. Fix ϵ > 0 and consider an input of the following form A 0 0 . . . 0 0 0 0 . . . 0 p1:d p1:d p1:d . . . p1:d P 1 P 2 P 3 . . . P d where A Rd d; then there exists transformer-based function block with 4 layers, 1 head and dimensionality r = 2d + 2 log d = O(d) that outputs the following matrix A A A . . . A 0 0 0 . . . 0 p1:d p1:d p1:d . . . p1:d P 1 P 2 P 3 . . . P d where A = A + ϵM, for some M 1. Proof. We can vectorize the matrix A into a d2 dimensional vector using the attention mechanism, as shown in Eq. (9). Notice that once we have the matrix in this form we can implement its transpose with a fixed permutation of the columns of the matrix to get the vectorized form of A . Once we have the transpose in vector form, we matricize it back to get the matrix transform using the attention mechanism. We explain the details of this process below: Vectorization: We assume that the input is of the following form, where A is the matrix to be vectorized. A 0 . . . 0 0 0 . . . 0 p1:d p1:d . . . p1:d P 1 P 2 . . . P d Looped Transformers as Programmable Computers Here, P i represents a matrix of d columns, where each column is pi. The first layer uses the p1:d encodings to make d copies of the matrix A, as follows: A 0 . . . 0 A A . . . A p1:d p1:d . . . p1:d P 1 P 2 . . . P d The feed forward part of the second layer then uses the encodings p i to vectorize the matrix in the second row block as follows: A . . . 0 A(1,1) . . . A(1,d) 0 . . . 0 . . . A(d,1) . . . A(d,d) 0 . . . 0 p1:d . . . p1:d P 1 . . . P d This is achieved, by explicitly defining a neural network that keeps the i th row if the corresponding encoding is P i and place it in the d + 1 row. Transposition in the vector form: Once we have the matrix vectorized as the second row block of the scratchpad, the following key and query matrices K = 0 0 I 0 0 0 0 I , Q = 0 0 0 I 0 0 I 0 results in the head outputting the following, which is the vectorized form of A (in the second row block) XσS((KX) (QX)) = . . . A(1,1) . . . A(d,1) 0 . . . 0 . . . A(1,d) . . . A(d,d) 0 . . . 0 P 1 . . . P d p1:d . . . p1:d Then, using the following value matrix gives 0 0 0 0 0 I 0 0 0 0 0 0 0 0 0 0 VXσS((KX) (QX)) = 0 . . . 0 A(1,1) . . . A(d,1) 0 . . . 0 . . . A(1,d) . . . A(d,d) 0 . . . 0 0 . . . 0 0 . . . 0 Adding back the X (see (1)), results in X + VXσS((KX) (QX)) = A . . . 0 A(1,1) . . . A(d,1) 0 . . . 0 . . . A(1,d) . . . A(d,d) 0 . . . 0 p1:d . . . p1:d P 1 . . . P d Using the feedforward layers and the encodings P i, we get A . . . 0 A(1,1) . . . A(d,1) 0 . . . 0 . . . 0 . . . 0 A(1,d) . . . A(d,d) p1:d . . . p1:d P 1 . . . P d Looped Transformers as Programmable Computers Using an attention layer and the first row of encodings, we get A . . . A 0 . . . 0 p1:d . . . p1:d P 1 . . . P d E.3 Matrix Multiplication by Linearizing the Softmax We will show how we can implement matrix multiplication so that it will fit our unified template. To do so, we need to show for example for the result of A B , where A Rk m and B Rk n with k, m, n < d we can achieve the following: A 0 B 0 0 0 0 0 0 0 A B 0 0 0 0 0 0 Lemma 13. Let A Rk m and B Rk n; then for any ϵ > 0 there exists a transformer-based function block with 2 layers, 1 head and dimensionality r = O(d) that outputs the multiplication AT BT + ϵM, for some M 1 . Corollary 3. Let A Rk m and B Rk n; then for any ϵ > 0 there exists a transformer-based function block with 2 layers, 1 head and dimensionality r = O(d) that outputs the multiplication B A + ϵM, for some M 1 . Corollary 4. Let A Rk m and B Rk n; then for any ϵ > 0 there exists a transformer-based function block with 2 layers, 1 head and dimensionality r = O(d) that outputs the multiplication B B + ϵM, for some M 1 . Corollary 5. Let A Rk m and B Rk n; then for any ϵ > 0 there exists a transformer-based function block with 2 layers, 1 head and dimensionality r = O(d) that outputs the multiplication A A + ϵM, for some M 1 . We will prove just the first of these results and the rest are a simple corollary of it. Proof. Let M R2d 2d, A Rk m and B Rk n be the following matrices: M = A 0 B 0 0 0 0 0 The zeros pad the rows and columns to ensure that the matrix M is 2d 2d. Then, consider the input matrix to be of the following form: M 0 0 0 11 0 I 0 0 p(1) where 1 R2d is the all ones vector. The identity matrix I and the all ones matrix 11 are part of the design of the input and they are always fixed. For now we ignore the encodings and the last row, by setting the corresponding rows of the key,query and value weight matrices to be zero. These rows will be used to copy the output to the place that we want. Focusing on the rest of the rows, we set the key and query weight matrices to be "c I 0 0 0 0 CI 0 I 0 "0 0 0 0 0 ne CDd 0 0 0 where Dd R2d 2d is the diagonal matrix with the first d diagonal elements 1, and the rest 0. Thus we have M 0 0 0 11 0 I 0 0 c M 0 0 CI 0 0 0 11 0 c M M 11 0 C11 0 0 0 0 0 Looped Transformers as Programmable Computers Each of the first 2d columns above looks as follows cz1i cz2i . . . czni C1 0 After we apply the softmax σs per column, we get σs(czij) = eczij Pn j=1 eczij + n(e C + 1) where n = 2d, zij is the (i, j) element of the matrix M M. Let ℓ( ) be the transformation above then we have VXσS((KX) QX) = " 0 0 0 ne CDd 0 0 0 0 0 0 0 0 ne CDdℓ(c M M) 0 0 0 0 0 0 11 + c M M 0 0 0 and by adding back the residual we have M 0 0 11 + c M M I 0 0 for small enough c and large enough C. This is because ne C ecxij Pn j=1 ecxij + n(e C + 1) = ecxij 1 1 + Pn j=1 ecxij C log n + n = (1 + cxij + O((cxij)2))(1 ecxij C log n + O(e2(cxij C log n))) = (1 + cxij + O((cxij)2))(1 ecxij C log n) (1 + cxij) (10) Hence by increasing C and decreasing c, the error can be made arbitrarily small. We now use the feedforward layers to perform the following transform A A 0 A B 0 0 0 0 0 B A 0 B B 0 0 0 0 0 Now if p(1) = [0 0 p2d+1:2d+n 0 0] and p(2) = [p1:n pn+1:d 0 pd+n+1:2d p2d:3d] we can copy A B to the desired place using Lemma 2. Looped Transformers as Programmable Computers E.4 Advantage of attention over fully-connected networks It is possible to implement the functions and overall lexicographic functionality presented in previous sections using fully connected networks, as they are also universal function approximators. However, it is easy to demonstrate a depth separation between attention-based networks and fully connected networks. For example, to compute simple functions like polynomials of x (e.g., x2), a Re LU network with a depth proportional to log(1/ϵ) is required, where ϵ is the quality of approximation, e.g., as showed in (Perekrestenko et al., 2018). In contrast, we have shown how x2 can be implemented in essentially 2 layers. This simple depth separation argument highlights the constant vs scaling depth required for several functionalities in fully connected networks versus attention-based networks. It is important to note that although these constructions are easy to demonstrate their existence, constructing them is not straightforward. In this work, we provide hardcoded attention layers that precisely do that, making it easier to implement these functionalities in practice. F FLEQ: Proof of Theorem 1 Each instruction consists of the following tuple: (pa, pb, pc, pflag, pm, pp), and does the following 1. mem[c] = fm(mem[a], mem[b]) 2. if mem[flag](0,0) 0 goto instruction p Here, locations a, b, and c can contain either scalars, or d-dimensional vectors or d d matrices, and mem[flag](0,0) is the 1-st entry of mem[flag] if it is a vector / matrix, else it is mem[flag] if a scalar. This can be implemented using the following steps (each may use a separate layer of transformer): At the beginning of each iteration, the scratchpad starts with storing the pointer to the next instruction pt. 1. Read the command (pa, pb, pc, pflag, pp, pm) from the location to the scratchpad. 2. Copy the d d data at locations a, b to the scratchpad memory scratch Mem (assume the data is d d even if actually scalar or vector, the fm implementation will handle that) 3. Copy the data to the i-th function row block using the feed forward layer. 4. Once in the correct row block, fm(mem[a], mem[b]) is computed 5. Feedforward layers copy back the data from i-th row block to the scratchpad memory scratch Mem. 6. Write result from scratchpad memory to pc. 7. if mem[flag](0,0) 0 store pp in the scratchpad, else pt+1 Figure 6. The structure of input X Looped Transformers as Programmable Computers The structure of the input X is shown in Figure 6. It has n columns and O(Md + log n) rows. It is partitioned into 3 column blocks: the Scratchpad block, the Memory block, and the Instructions block. The Memory block is the storage and is the location where all the variables are stored. Each variable can be either a scalar, vector or matrix, as long as the number of rows in it are no larger than d. For example, if a variable is a d d matrix, it is stored in d consecutive columns in the block, where each column has length d. The address of this variable is the index of its first column in the input X. The Instructions block contains instructions, where each instruction is a vector of the form pa pb pc pm pflag pp dh dw b(1) mask b(2) mask b(3) mask which encodes the following logic: mem[c] = fm(mem[a], mem[b]) ; if mem[flag] 0 goto instruction p. pa, pb, pc, pp, and pflag are all binary 1 vectors that point to the locations a, b, c, p, and flag respectively. These are simply the binary representations of the integers a, b, c, p and flag, and hence have length log2 n each. Similarly, pm is the binary vector representation of the integer m, and hence has length log2 M, where M is the number of functions we implement. The bmask is mask bit used while writing the output back to memory. The scratchpad has s columns. The length s depends on the maximum number of columns needed by the function blocks to operate, and can be as low as O(1) for scalar and vector functions, O(d) for matrix functions, and can be as high as O(d2) if functions like matrix vectorization are one of the M functions. The Scratchpad consists of the following parts: The program counter is a row block with log2 n rows and s columns and takes the form: [pi pi pi.] This signifies that the current program counter points to the i-th instruction. Using this, the i-th instruction is read into all the s columns of Current Instruction row block. The Current Instruction row block has O(log n) rows and s columns, and each column initially contains the i-th instruction once it is read. Then, the instructions in each column are slightly modified depending on the column index, to read memory blocks pointed to in the instruction. The memory blocks are read into the Scratchpad Memory . The Scratchpad Memory is a temporary location where the data is first read into from the Memory column block, before it is moved to the correct function s Function Block, using the function index encoding pm in the instruction. The encodings row block has O(log n) rows and n columns, and is used to index every column in the input X. It contains the binary 1 vector encodings of the column index for each column. The details of this row block are explained later. The Function Blocks are custom transformer blocks that can be added in a plug-n-play manner to the Unified Attention Based Computer depending on what elementary functions the user wants the computer to have access to. Looped Transformers as Programmable Computers 0 0 . . . 0 zs+1 . . . zm+s pt pt . . . pt . . . . . . c1 t c2 t . . . cs t . . . . . . z1 at z2 at . . . zs at 0 . . . 0 0 . . . 0 z1 bt z2 bt . . . zs bt 0 . . . 0 0 . . . 0 z1 ct z2 ct . . . zs ct 0 . . . 0 0 . . . 0 0 0 . . . 0 ps+1 . . . pm+s pm+s+1 . . . pn p1 p2 . . . ps 0 . . . 0 0 . . . 0 f1mem . . . . . . . . . . . . . . . ... ... ... ... ... ... ... . . . f Mmem . . . . . . . . . . . . . . . In this step, we need to copy the t-th instruction, pointed to by the program counter pt, to the scratchpad s Current Instruction block. We denote the instruction by ct where pat pbt pct pflagt ppt pmt dh dw b(1) mask b(2) mask b(3) mask For this step, we only consider the following relevant subset of rows of the matrix X: 0 0 . . . 0 . . . cm+s+1 . . . cn pt pt . . . pt . . . . . . c1 t c2 t . . . cs t . . . . . . 0 0 . . . 0 ps+1 . . . pm+s pm+s+1 . . . pn The other rows will not be used or changed during this operation because we can simply set the corresponding rows of the K, V, Q matrices to 0 for all heads and setting the feed forward layers to also pass the corresponding rows unchanged. At the beginning of execution of each command, the Current Instruction row block would be empty, so the input would look like . . . . . . cm+s+1 . . . cn pt pt . . . pt . . . . . . 0 0 . . . 0 0 . . . 0 0 . . . 0 0 . . . 0 ps+1 . . . pm+s pm+s+1 . . . pn Then, consider an attention head with the following K, Q, V matrices: K = [0 0 0 I] , Q = [0 I 0 0] , V = 0 0 0 0 0 0 0 0 I 0 0 0 0 0 0 0 Looped Transformers as Programmable Computers This will result in . . . . . . cm+s+1 . . . cn pt pt . . . pt . . . . . . ct ct . . . ct . . . . . . 0 0 . . . 0 ps+1 . . . pm+s pm+s+1 . . . pn We apply Lemma 9 on the row blocks ct ct . . . ct . . . . . . p1 p2 . . . ps 0 . . . 0 0 . . . 0 to construct feedforward layers that convert ct to ci t, where pat+i pbt+i d pct+i 2d pflagt ppt pmt dh dw b(1) mask = 1(i dw) b(2) mask = 1(i>d) + 1(i d+dw) 1 b(3) mask = 1(i>2d) + 1(i 2d+dw) 1 Note that the last three elements can be created using the following Re LU: b(1) mask =σ(2d + dw i + 1) σ(2d + dw i) b(2) mask =σ(i d) σ(i d 1) + σ(d + dw i + 1) σ(d + dw i) 1 b(3) mask =σ(i 2d) σ(i 2d 1) + σ(2d + dw i + 1) σ(2d + dw i) 1. At the end of this step, we get the following: 0 0 . . . 0 . . . cm+s+1 . . . cn pt pt . . . pt . . . . . . c0 t c1 t . . . cs t . . . . . . 0 0 . . . 0 ps+1 . . . pm+s pm+s+1 . . . pn Use three heads, one each for pa, pb and pc. Using the vectors pat+i, pbt+i d, and pct+i 2d we copy the data (using one head each and a similar technique as last step) to get the following in the Scratchpad Memory: " zat . . . zat+d . . . . . . . . . . . . . . . zbt . . . zbt+d . . . . . . . . . . . . . . . zct . . . zct+s 2d . . . . . . Using the mask bits at the end of ci t, we get " zat . . . zat+dw 1 0 zbt . . . zbt+dw 1 0 zct . . . zct+dw 1 0 0 . . . 0 . . . 0 . . . 0 0 0 . . . 0 0 0 . . . 0 0 0 . . . 0 . . . 0 . . . 0 0 0 . . . 0 0 0 . . . 0 0 0 . . . 0 . . . Looped Transformers as Programmable Computers zi[1 : d] = σ(zi[1 : d] C(1 b(1) mask)1) σ( zi[1 : d] C(1 b(1) mask)1) + σ(zi[d + 1 : 2d] C(1 b(2) mask)1) σ( zi[d + 1 : 2d] C(1 b(1) mask)1) + σ(zi[2d + 1 : 3d] C(1 b(1) mask)1) σ( zi[2d + 1 : 3d] C(1 b(1) mask)1), zi[d + 1 : 3d] = 0, where C is a large positive constant. Using the same mask bits, we also mask the row containing the output data pointers for c: [ 0 . . . 0 0 . . . 0 pct . . . pct+dw 1 0 . . . 0 0 . . . 0 0 . . . 0 ] (12) The following feedforward Re LU layer can move the data to the correct function blocks: fkmem[1 : dh] = (σ(z[1 : dh] C((1 b(1) mask b(2) mask)1 + log M p k pm)) σ( z[1 : dh] C((1 b(1) mask b(2) mask)1 + log M p k pm))), where C is a large positive constant. Each of the M functions have their own attention heads, which are constructed to be copies of their transformer based function blocks. The results after the attention are written back into their respective row blocks. Since the row blocks are separate, the feedforward layers of each of the transformer based function blocks also work in parallel to store the final results in the respective row blocks. Similar to Step 3 we use the following feedforward Re LU layer to move the data from the function block back into the scratchpad memory z[1 : dh] = z[1 : dh] + σ((fkmem[1 : dh] z[1 : dh]) C((1 b(3) mask)1 + log M p k pm)) σ( (fkmem[1 : dh] z[1 : dh]) C((1 b(3) mask)1 + log M p k pm)) , where C is a large positive constant. For this step we focus on the encoding row block, memory storage row block and the following rows in the input (see (12), (11)): 0 . . . 0 0 . . . 0 0 . . . 0 0 . . . 0 zs+1 . . . zm+s 0 . . . 0 0 . . . 0 znew ct . . . znew ct+dw 0 . . . 0 0 . . . 0 0 . . . 0 0 . . . 0 0 . . . 0 pct . . . pct+dw 0 . . . 0 0 . . . 0 0 . . . 0 0 . . . 0 0 . . . 0 0 . . . 0 0 . . . 0 ps . . . pm 1 pm . . . pn 1 We set the Key and Query matrices as follows: Looped Transformers as Programmable Computers 0 0 0 0 I I 0 0 0 0 0 0 0 0 0 0 VXσS((KX) QX) . . . 0 . . . 0 . . . 0 . . . 0 0 . . . 0 0 . . . . . . . . . dnew ct +dct 2 . . . dnew ct+dw +dct+dw 2 . . . d0 . . . dct 1 dnew ct +dct 2 . . . dnew ct+dw +dct+dw 2 dct+dw+1 . . . . . . . . . 0 . . . 0 . . . 0 . . . 0 0 . . . 0 0 . . . . . . . . . 0 . . . 0 . . . 0 . . . 0 0 . . . 0 0 . . . . . . Finally, we use the feedforward layers similar to the proof of Lemma 3 to write back [dnew ct . . . dnew ct+dw] to the correct rows. This step is identical to Section 4.3. G Error Analysis In all of this section we assume that each element of the input matrix X has values vi bounded by some constant G, i.e., |vi| G. The error in the read/ write operation. The positional encodings as we have already mentioned have the following properties: pi is an log(n) dimensional 1 vector which is the binary representation of i with 1 in the place of 0. Hence, we have p i pi = log(n) and each p i pj < log(n) for i = j. Each time a copy is implemented from one column to another, we create a permutation matrix (a matrix of zeros and ones) which then multiplies the input matrix X Rd n from the right and results in permutations of the column space. We thus focus on just one column of the n n matrix that is created after we apply the softmax. Let z be this column of the matrix, ideally we want to output in one position 1 and in the rest 0. In the place that we want to output 1, say the a th position, we have the inner product za = p i pi for some i [n]. The rest of the elements in the same column would be zb p i pj for i = j and a = b. Then, [σS((KX) QX)]i,i = eλp i pi eλp i pi + P j =i eλp i pj j =i eλp i pj/eλp i pi Since λp i pj < λp i pi λ for i = j, we have that [σS((KX) QX)]i,i 1 1 + ne λ 1 1 + elog n λ 1 + elog n λ Thus, for i = j, [σS((KX) QX)]i,j elog n λ. This implies that there exist ϵi, i = 1, . . . , n such that za = 1 εa, for some εa elog n λ zb = εb for b = a and for some εb elog n λ Looped Transformers as Programmable Computers Hence, we have that z = z + ε where z is the targeted vector and ε is the vector containing the errors εa, εb. Now let xi be the i th row of the input matrix X, then we have Xz = Xz + Xε x1, ε ... xd, ε In the general case that all the columns will change, let P = σS((KX) QX) and P be the targeted matrix then we have that XP = XP + XE where E = [ε1 . . . εn] is the matrix containing all the errors and so XP XP = max 1 j n i=1 | xi, εj | Gn2delog n λ elog Gdn3 λ Thus, if λ > log Gdn3 ϵ we have that The error in Matrix Multiplication . This error has already been calculated in Appendix E.3, however we explicitly define it here as follows: ne C ecxij Pn j=1 ecxij + n(e C + 1) = ecxij 1 1 + Pn j=1 ecxij C log n + n = (1 + cxij + O((cxij)2))(1 ecxij C log n + O(e2(cxij C log n))) Let c = ϵ1 C1G for some constant C1 and C = log C2 ϵ2 for some C2 then we have A = ne C ecxij Pn j=1 ecxij + n(e C + 1) = ecxij 1 1 + Pn j=1 ecxij C log n + n = (1 + cxij + ϵ2 1x2 ij G2 )(1 ecxijϵ2 n + e2cxijϵ2 2 n2 ) = (1 + cxij)(1 ecxijϵ2 n + e2cxijϵ2 2 n2 ) + ϵ2 1x2 ij G2 (1 ecxijϵ2 n + e2cxijϵ2 2 n2 ) |A (1 + cxij)| = | (1 + cxij)ecxijϵ2 n + e2cxijϵ2 2 n2 + ϵ2 1x2 ij G2 (1 ecxijϵ2 n + e2cxijϵ2 2 n2 )| ϵ2 1(eϵ1/C1ϵ2 n + 2e2ϵ1/C1ϵ2 2 n2 ) + eϵ1/C1ϵ2 n Hence if ϵ2 = ϵ/4 and ϵ1 = C1 log(nϵ) we have that the total error is less than ϵ. Looped Transformers as Programmable Computers Function approximation. The error in Lemma 10 is an immediate consequence of Theorem 2 and it is proportional to 1/ m, where m is the number of heads we are using. Accumulation of error after T operations. Fix an ϵ > 0 and assume that in the t th iteration the input is Xt = X t + ϵt Mt, where X t is the ideal input 0 < ϵt < tϵ T and Mt is a matrix such that Mt 1, we will show that Xt+1 = X t+1 + ϵt+1Mt+1, where X t+1 is the ideal input, 0 < ϵt+1 < (t + 1)ϵ T and Mt+1 is a matrix such that Matrix Multiplication with a matrix A, A 16 will have the following result: AXt + ϵ = AX t + ϵt AMt + ϵ M = X t+1 + (ϵt + ϵ )Mt+1 where ϵ is controlled by the constants we use in the design of the function block and Mt+1 is some matrix with Mt+1 1. If now ϵ < ϵ T , our claim follows. Read/Write operations will result to an error of Xt P = Xt P + ϵ M = X t P + ϵt Mt P + ϵ M Notice that as before, since M 1 and Mt P 1 and thus we have Xt+1 = Xt P = X t+1 + ϵt+1Mt+1, where ϵt+1 = ϵt + ϵ . Again if ϵ ϵ T the result follows. The result for function approximation follows in a similar way. H A Basic Calculator We show that the FLEQ transformer introduced in the previous section, can be used to build a simple calculator. This transformer consists of six transformer-based function blocks that implement addition, substraction, multiplication, percentage, division and square root. The formal statement is written as below. Theorem 3. There exists a transformer with 12 layers, m heads and dimensionality O(log n) that uses the Unified Attention Based Computer framework in Section 5.2 to implement a calculator which can perform addition, subtraction, multiplication, and computing the inverse, square root and percentage. For computing the inverse and square root, the operand needs to be in the range [ e O(m), Ω( 1 m)] [ Ω( 1 m), e O(m)] and [0, O(m2)] respectively, and the returned output is correct up to an error of O(1/ m) and O(1/m) respectively. Here, n is the number of operations to be performed. Remark 2. In the proof of this theorem, we use Lemma 10 to approximate the square root and the inversion function. That lemma provides error guarantees in terms of the number of heads m. We prove Corollary 2 in the appendix which provides equivalent error guarantees, but where the error decreases with the dimension d of the transformer. Depending on the design choices of the transformer, either of the results can be used, and the calculator s error guarantee will also change accordingly. We show how one can implement a calculator in our FLEQ framework in Alg. 7. 6Notice that this can be assumed without loss of generality, since we can normalize all the errors with the maximum norm of a matrix to the power of T. Looped Transformers as Programmable Computers Algorithm 7 A sample program for executing a basic calculator functionality. The following algorithm performs 1/(((a+b) c) d) 100 Require: mem[p] = a, mem[q] = b, mem[r] = c, mem[s] = d. { Location of the inputs.} 1: mem[t] = fadd(mem[p], mem[q]) {mem[t] = a + b.} 2: mem[t] = fsub(mem[t], mem[r]) {mem[t] = (a + b) c.} 3: mem[t] = fmul(mem[t], mem[s]) {mem[t] = ((a + b) c) d.} 4: mem[t] = finv(mem[t]) {mem[t] = 1/((a + b) c) d.} 5: mem[t] = fsqrt(mem[t]) {mem[t] = p 1/((a + b) c) d.} 6: mem[t] = fperc(mem[t]) {mem[t] = 1/((a+b) c) d Looking at the algorithm, it is clear that for proving the theorem above, it is sufficient to implement the 6 functions (addition, subtraction, multiplication, inversion, square root and percentage) using the transformer-based function blocks defined in Definition 1. We start with two lemmas, which can be proved by constructing transformers that add and subtract in a similar way to the OISC transformer constructed in Section 5.1. Lemma 14 (addition). There exists a transformer-based function block with 3 layers, 1 head and dimensionality O(1) which can implement f(a, b) = a + b. Proof. Consider the input in the form of Equation (8) a 0 b 0 0 0 0 0 0 0 0 0 p2d+1 0 0 0 0 0 0 p2:d pd+1 pd+2:2d p2d+1 p2d+2:3d 1 0 0 0 0 0 We can perform the following transformation a 0 b 0 0 0 0 0 0 0 0 0 " a 0 b 0 0 0 a 0 b 0 0 0 0 0 0 0 0 0 " a 0 0 0 0 0 0 0 b 0 0 0 0 0 0 0 0 0 " a + b 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 " a + b 0 0 0 a + b 0 0 0 b 0 0 0 0 0 0 0 0 0 The first and second step are implemented with one feed-forward layer each. The third step with the Section 4.2. We have ignored the last three rows since we don t change them and we only use them for the last step. Note that in addition as well as the rest of the operations in this proof, softmax leads to an extra error, which can be driven arbitrarily close to 0 by increasing its temperature. Lemma 15 (subtraction). There exists a transformer-based function block with 3 layers, 1 head and dimensionality O(1) which can implement f(a, b) = a b. This lemma can be proved in the exact same way as the previous one. In addition, we can use the theory presented in Lemma 13 to get the following corollaries: Corollary 6 (multiplication). There exists a transformer-based function block with 2 layers, 1 head and dimensionality O(d) which can implement f(a, b) = ab. Looped Transformers as Programmable Computers Corollary 7 (percentage). There exists a transformer-based function block with 2 layers, 1 head and dimensionality O(1) which can implement f(a) = a/100 = a 0.01. To implement inversion function, we first show that we can approximate inversion with threshold activations, then we can easily conclude that we can also approximate it with sigmoids. Lemma 16. Given two constants ϵ, δ [0, 1], there exists a 1 hidden layer neural network f with threshold activation and d activations in the hidden layer, such that x [ C, δ] [δ, C] , f(x) 1 as long as d = Ω( log(1/(ϵδ)) ϵδ + log C). Proof. We partition [δ, C] into the following intervals [δ, δ(1 + ϵδ)), [δ(1 + ϵδ), δ(1 + ϵδ)(1 + ϵδ(1 + ϵδ))) . . . , [ai, ai(1 + ϵai)), . . . , that is, if an interval begins at a, then it ends at a(1 + ϵa). Note that for any point x [ai, ai(1 + ϵai)) 1 x 1 ai 1 ai(1 + ϵai) = ϵ 1 + ϵai < ϵ. Hence two output activations of the form 1 ai 1x