# logic_and_the_2simplicial_transformer__64d55b66.pdf Published as a conference paper at ICLR 2020 LOGIC AND THE 2-SIMPLICIAL TRANSFORMER James Clift Dmitry Doryn Daniel Murfet James Wallbridge {jamesedwardclift,dmitry.doryn,james.wallbridge}@gmail.com Department of Mathematics, University of Melbourne, d.murfet@unimelb.edu.au We introduce the 2-simplicial Transformer, an extension of the Transformer which includes a form of higher-dimensional attention generalising the dot-product attention, and uses this attention to update entity representations with tensor products of value vectors. We show that this architecture is a useful inductive bias for logical reasoning in the context of deep reinforcement learning. 1 INTRODUCTION Deep learning contains many differentiable algorithms for computing with learned representations. These representations form vector spaces, sometimes equipped with additional structure. A recent example is the Transformer (Vaswani et al., 2017) in which there is a vector space V of value vectors and an inner product space H of query and key vectors. This structure supports a kind of messagepassing, where a value vector vj V derived from entity j is propagated to update an entity i with weight qi kj, where qi H is a query vector derived from entity i, kj H is a key vector derived from entity j, and the inner product on H is written as a dot product. The Transformer therefore represents a relational inductive bias, where a relation from entity j to entity i is perceived to the extent that qi kj is large and positive. However, the real world has structure beyond entities and their direct relationships: for example, the three blocks in Figure 1 are arranged in such a way that if either of the supporting blocks is removed, the top block will fall. This is a simple 3-way relationship between entities i, j, k that is complex to represent as a system of 2-way relationships. It is natural to make the hypothesis that such higher-order relationships are essential to extracting the full predictive power of data, across many domains. Figure 1: A 3-way relationship between blocks. In accordance with this hypothesis, we introduce a generalisation of the Transformer architecture, the 2-simplicial Transformer, which incorporates both 2and 3-way interactions. Mathematically, the key observation is that higher-order interactions between entities can be understood using algebras. This is nothing but Boole s insight (Boole, 1847) which set in motion the development of modern logic. In our situation, an appropriate algebra is the Clifford algebra Cl(H) of the space H of queries and keys, which contains that space H Cl(H) and in which queries and keys can be multiplied. To represent a 3-way interaction we map each entity i to a triple (pi, l1 i , l2 i ) of vectors in H consisting of a query vector pi, a (first) key vector l1 i and a (second) key vector l2 i . Given a triple i, j, k we first form the product pil1 jl2 k in the Clifford algebra, and then extract a scalar quantity η(pil1 jl2 k) using a natural continuous function η : Cl(H) R associated to the Z-grading of Cl(H). This scalar Listing order is alphabetical. Correspondence to d.murfet@unimelb.edu.au. Published as a conference paper at ICLR 2020 measures how strongly the network perceives a 3-way interaction involving i, j, k. In summary, the 2-simplicial Transformer learns how to represent entities in its environment as vectors v V , and how to transform those entities to queries and (pairs of) keys in H, so that the signals provided by the scalars qi kj and η(pil1 jl2 k) are informative about higher-order structure in the environment. As a toy example of higher-order structure, we consider the reinforcement learning problem in a variant of the Box World environment from (Zambaldi et al., 2019). The original Box World is played on a rectangular grid populated by keys and locked boxes of varying colours, with the goal being to open the box containing the Gem . In our variant of the Box World environment, bridge Box World, the agent must use two keys simultaneously to obtain the Gem; this structure in the environment creates many 3-way relationships between entities, including for example the relationship between the locked boxes j, k providing the two keys and the Gem entity i. This structure in the environment is fundamentally logical in nature, and encodes a particular kind of conjunction; see Appendix I. The architecture of our deep reinforcement learning agent largely follows (Zambaldi et al., 2019) and the details are given in Section 4. The key difference between our simplicial agent and the relational agent of (Zambaldi et al., 2019) is that in place of a standard Transformer block we use a 2-simplicial Transformer block. Our experiments show that the simplicial agent confers an advantage over the relational agent as an inductive bias in our reasoning task. Motivation from neuroscience for a simplicial inductive bias for abstract reasoning is contained in Appendix J. Our use of tensor products of value vectors is inspired by the semantics of linear logic in vector spaces (Girard, 1987; Melli es, 2009; Clift & Murfet, 2017; Wallbridge, 2018) in which an algorithm with multiple inputs computes on the tensor product of those inputs, but this is an old idea in natural language processing, used in models including the second-order RNN (Giles et al., 1989; Pollack, 1991; Goudreau et al., 1994; Giles et al., 1991), multiplicative RNN (Sutskever et al., 2011; Irsoy & Cardie, 2015), Neural Tensor Network (Socher et al., 2013) and the factored 3-way Restricted Boltzmann Machine (Ranzato et al., 2010), see Appendix A. Tensors have been used to model predicates in a number of neural network architectures aimed at logical reasoning (Serafini & Garcez, 2016; Dong et al., 2019). The main novelty in our model lies in the introduction of the 2-simplicial attention, which allows these ideas to be incorporated into the Transformer architecture. 2 2-SIMPLICIAL TRANSFORMER In this section we first review the definition of the ordinary Transformer block and then explain the 2simplicial Transformer block. We distinguish between the Transformer architecture which contains a word embedding layer, an encoder and a decoder (Vaswani et al., 2017), and the Transformer block which is the sub-model of the encoder that is repeated. The fundamental idea, of propagating information between nodes using weights that depend on the dot product of vectors associated to those nodes, comes ultimately from statistical mechanics via the Hopfield network (Appendix B). The ordinary and 2-simplicial Transformer blocks define operators on sequences e1, . . . , e N of entity representations. Strictly speaking the entities are indices 1 i N but we sometimes identify the entity i with its representation ei. The space of entity representations is denoted V , while the space of query, key and value vectors is denoted H. We use only the vector space structure on V , but H = Rd is an inner product space with the usual dot product pairing (h, h ) 7 h h and in defining the 2-simplicial Transformer block we will use additional algebraic structure on H, including the multiplication tensor B : H H H of (10) (used to propagate tensor products of value vectors) and the Clifford algebra of H (used to define the 2-simplicial attention). In the first step of the standard Transformer block we generate from each entity ei a tuple of vectors via a learned linear transformation E : V H 3. These vectors are referred to respectively as query, key and value vectors and we write (qi, ki, vi) = E(ei) . (1) Stated differently, qi = W Qei, ki = W Kei, vi = W V ei for weight matrices W Q, W K, W V . In the second step we compute a refined value vector for each entity eqi kj PN s=1 eqi ks vj = j=1 softmax(qi k1, . . . , qi k N)jvj . (2) Published as a conference paper at ICLR 2020 Finally, the new entity representation e i is computed by the application of a feedforward network gθ, layer normalisation and a skip connection e i = Layer Norm gθ(v i) + ei . (3) Remark 2.1. In the introduction we referred to the idea that a Transformer model learns representations of relations. To be more precise, these representations are heads, each of which determines an independent set of transformations W Q, W K, W V which extract queries, keys and values from entities. Thus a head determines not only which entities are related (via W Q, W K) but also what information to transmit between them (via W V ). In multiple-head attention with K heads, there are K channels along which to propagate information between every pair of entities, each of dimension dim(H)/K. More precisely, we choose a decomposition H = H1 HK so that u=1 (H 3 u ) and write (qi,(1), ki,(1), vi,(1), . . . , qi,(K), ki,(K), vi,(K)) = E(ei) . To compute the output of the attention, we take a direct sum of the value vectors propagated along every one of these K channels, as in the formula e i = Layer Norm gθ h K M j=1 softmax(qi,(u) k1,(u), . . . , qi,(u) k N,(u))jvj,(u) i + ei . (4) In combinatorial topology the canonical one-dimensional object is the 1-simplex (or edge) j i. Since the standard Transformer model learns representations of relations, we refer to this form of attention as 1-simplicial attention. The canonical two-dimensional object is the 2-simplex (or triangle) which we may represent diagrammatically in terms of indices i, j, k as In the 2-simplicial Transformer block, in addition to the 1-simplicial contribution, each entity ei is updated as a function of pairs of entities ej, ek using the tensor product of value vectors uj uk and a probability distribution derived from a scalar triple product pi, l1 j, l2 k in place of the scalar product qi kj. This means that we associate to each entity ei a four-tuple of vectors via a learned linear transformation E : V H 4, denoted (pi, l1 i , l2 i , ui) = E(ei) . (6) We still refer to pi as the query, l1 i , l2 i as the keys and ui as the value. Stated differently, pi = W P ei, l1 i = W L1ei, l2 i = W L2ei and ui = W Uei for weight matrices W P , W L1, W L2, W U. Definition 2.2. The unsigned scalar triple product of a, b, c H is a, b, c = (a b)c (a c)b + (b c)a (7) whose square is a polynomial in the pairwise dot products a, b, c 2 = (a b)2(c c) + (b c)2(a a) + (a c)2(b b) 2(a b)(a c)(b c) . (8) This scalar triple product has a simple geometric interpretation in terms of the volume of the tetrahedron with vertices 0, a, b, c. To explain, recall that the triangle spanned by two unit vectors a, b in R2 has an area A which can be written in terms of the dot product of a and b. In three dimensions, the analogous formula involves the volume V of the tetrahedron with vertices given by unit vectors a, b, c, and the scalar triple product as shown in Figure 2. In general, given nonzero vectors a, b, c let ˆa,ˆb, ˆc denote unit vectors in the same directions. Then we can by Lemma C.10(v) factor out the length in the scalar triple product a, b, c = a b c ˆa,ˆb, ˆc (9) Published as a conference paper at ICLR 2020 (a b)2 = 1 4A2 a, b, c 2 = 1 36V 2 Figure 2: The geometry of 1and 2-simplicial attention. Left: the dot product in terms of the area A in R2. Right: the triple product in terms of the volume V in R3. so that a general scalar triple product can be understood in terms of the vector norms and configurations of three points on the 2-sphere. One standard approach to calculating volumes of such tetrahedrons is the cross product which is only defined in three dimensions. Since the space of representations H is high dimensional the natural framework for the triple scalar product a, b, c is instead the Clifford algebra of H (see Appendix C). For present purposes, we need to know that a, b, c attains its minimum value (which is zero) when a, b, c are pairwise orthogonal, and attains its maximum value (which is a b c ) if and only if {a, b, c} is linearly dependent (Lemma C.10). Using the number pi, l1 j, l2 k as a measure of the degree to which entity i is attending to (j, k), or put differently, the degree to which the network predicts the existence of a 2-simplex (i, j, k), the update rule for the entities when using purely 2-simplicial attention is e pi,l1 j,l2 k PN s,t=1 e pi,l1s,l2 t B(uj uk) (10) where B : H H H is a learned linear transformation. Although we do not impose any further constraints, the motivation here is to equip H with the structure of an algebra; in this respect we model conjunction by multiplication, an idea going back to Boole (Boole, 1847). We compute multiple-head 2-simplicial attention in the same way as in the 1-simplicial case. To combine 1-simplicial heads (that is, ordinary Transformer heads) and 2-simplicial heads we use separate inner product spaces H1, H2 for each simplicial dimension, so that there are learned linear transformations E1 : V (H1) 3, E2 : V (H2) 4 and the queries, keys and values are extracted from an entity ei according to (qi, ki, vi) = E1(ei) , (pi, l1 i , l2 i , ui) = E2(ei) . The update rule (for a single head in each simplicial dimension) is then: v i = n N X eqi kj PN s=1 eqi ks vj o Layer Norm n N X e pi,l1 j,l2 k PN s,t=1 e pi,l1s,l2 t B(uj uk) o , (11) e i = Layer Norm gθ(v i) + ei . (12) If there are K1 heads of 1-simplicial attention and K2 heads of 2-simplicial attention, then (11) is modified in the obvious way using H1 = LK1 u=1 H1 u and H2 = LK2 u=1 H2 u. Remark 2.3. Without the additional layer normalisation on the output of the 2-simplicial attention we find that training is unstable. The natural explanation is that these outputs are constructed from polynomials of higher degree than the 1-simplicial attention, and thus computational paths that go through the 2-simplicial attention will be more vulnerable to exploding or vanishing gradients. The time complexity of 1-simplicial attention as a function of the number of entities is O(N 2) while the time complexity of 2-simplicial attention is O(N 3) since we have to calculate the attention for every triple (i, j, k) of entities. For this reason we consider only triples (i, j, k) where the base of the 2-simplex (j, k) is taken from a set of pairs predicted by the ordinary attention, which we view as Published as a conference paper at ICLR 2020 the primary locus of computation. More precisely, we introduce in addition to the N entities (now referred to as standard entities) a set of M virtual entities e N+1, . . . , e N+M. These virtual entities serve as a scratch pad onto which the iterated ordinary attention can write representations, and we restrict j, k to lie in the range N < j, k N + M so that only value vectors obtained from virtual entities are propagated by the 2-simplicial attention. With virtual entities the update rule is for 1 i N eqi kj PN s=1 eqi ks vj e pi,l1 j,l2 k PN+M s,t=1 e pi,l1 l ,l2m B(uj uk) and for N < i N + M eqi kj PN+M s=1 eqi ks vj Layer Norm(ui) . (14) The updated representation e i is computed from v i, ei using (12) as before. Observe that the virtual entities are not used to update the standard entities during 1-simplicial attention and the 2-simplicial attention is not used to update the virtual entities; instead the second summand in (14) involves the vector ui = W Uei, which adds recurrence to the update of the virtual entities. After the attention phase the virtual entities are discarded. The method for updating the virtual entities is similar to the role of the memory nodes in the relational recurrent architecture of (Santoro et al., 2018), the master node in (Gilmer et al., 2017, 5.2) and memory slots in the Neural Turing Machine (Graves et al., 2014). The update rule has complexity O(NM 2) and so if we take M to be of order N we get the desired complexity O(N 2). 3 RL ENVIRONMENT The environment in our reinforcement learning problem is a variant of the Box World environment from (Zambaldi et al., 2019). The standard Box World environment is a rectangular grid in which are situated the player (a dark gray tile) and a number of locked boxes represented by a pair of horizontally adjacent tiles with a tile of colour x, the key colour, on the left and a tile of colour y, the lock colour, on the right. There is also one loose key in each episode, which is a coloured tile not initially adjacent to any other coloured tile. All other tiles are blank (light gray) and are traversable by the player. The rightmost column of the screen is the inventory, which fills from the top and contains keys that have been collected by the player. The player can pick up any loose key by walking over it. In order to open a locked box, with key and lock colours x, y, the player must step on the lock while in possession of a copy of y, in which case one copy of this key is removed from the inventory and replaced by a key of colour x. The goal is to attain a white key, referred to as the Gem (represented by a white square) as shown in the sample episode of Figure 3. In this episode, there is a loose pink key (marked 1) which can be used to open one of two locked boxes, obtaining in this way either key 5 or key 21. The correct choice is 2, since this leads via the sequence of keys 3, 4 to the Gem. Some locked boxes, if opened, provide keys that are not useful for attaining the Gem. Since each key may only be used once, opening such boxes means the episode is rendered unsolvable. Such boxes are called distractors. An episode ends when the player either obtains the Gem (with a reward of +10) or opens a distractor box (reward 1). Opening any non-distractor box, or picking up a loose key, garners a reward of +1. The solution length is the number of locked boxes (including the one with the Gem) in the episode on the path from the loose key to the Gem. Our variant of the Box World environment, bridge Box World, is shown in Figure 4. In each episode two keys are now required to obtain the Gem, and there are therefore two loose keys on the board. To obtain the Gem, the player must step on either of the lock tiles with both keys in the inventory, at which point the episode ends with the usual +10 reward. Graphically, Gems with multiple locks are denoted with two vertical white tiles on the left, and the two lock tiles on the right. Two solution 1The agent sees only the colours of tiles, not the numbers which are added here for exposition. Published as a conference paper at ICLR 2020 Figure 3: Right: a sample episode of the Box World environment. The rightmost column is the player inventory, currently empty. Left: graph representation of the puzzle, with key colours as vertices and an arrow C D if key C can be used to obtain key D. paths (of the same length) leading to each of the locks on the Gem are generated with no overlapping colours, beginning with two loose keys. In episodes with multiple locks we do not consider distractor boxes of the old kind; instead there is a new type of distractor that we call a bridge. This is a locked box whose lock colour is taken from one solution branch and whose key colour is taken from the other branch. Opening the bridge renders the puzzle unsolvable. An episode ends when the player either obtains the Gem (reward +10) or opens the bridge (reward 1). Opening a box other than the bridge, or picking up a loose key, has a reward of +1 as before. In this paper we consider episodes with zero or one bridge (the player cannot fail to solve an episode with no bridge). Figure 4: Right: a sample episode of the bridge Box World environment, in which the Gem has two locks and there is a marked bridge. Left: graph representation of the puzzle, with upper and lower solutions paths and the bridge between them. There is a source involving the orange key and a sink involving the purple lock. Standard Box World is straightforward for an agent to solve using relational reasoning, because leaves on the solution graph can be identified (their key colour appears only once on the board) and by propagating this information backwards along the arrows on the solution graph, an agent can identify distractors. Bridge Box World emphasises reasoning about 3-way relationships (or 2simplices). The following 2-simplex motifs appear in all solution graphs where a pair of boxes (α, β) is a source if they have the same lock colour but distinct key colours, and a sink if they have the same key colour but distinct lock colours (the 2-simplex leading to the Gem being an example). If α, β is a source or a sink then either α is the bridge or β is the bridge. If the agent can observe both a source and a sink then it can locate the bridge. It is less clear how to identify bridges using iterated relational reasoning, because every path in the solution graph eventually reaches the Gem. Published as a conference paper at ICLR 2020 4 RL AGENT ARCHITECTURE Our baseline relational agent is modeled closely on (Zambaldi et al., 2019) except that we found that a different arrangement of layer normalisations worked better in our experiments, see Remark 4.1. The code for our implementation of both agents is available online (Clift et al., 2019). In the following we describe the network architecture of both the relational and simplicial agent; we will note the differences between the two models as they arise. The input to the agent s network is an RGB image, represented as a tensor of shape [R, C + 1, 3] (i.e. an element of RR RC+1 R3) where R is the number of rows and C the number of columns (the C + 1 is due to the inventory). This tensor is divided by 255 and then passed through a 2 2 convolutional layer with 12 features, and then a 2 2 convolutional layer with 24 features. Both activation functions are Re LU and the padding on our convolutional layers is valid so that the output has shape [R 2, C 1, 24]. We then multiply by a weight matrix of shape 24 62 to obtain a tensor of shape [R 2, C 1, 62]. Each feature vector has concatenated to it a twodimensional positional encoding, and then the result is reshaped into a tensor of shape [N, 64] where N = (R 2)(C 1) is the number of Transformer entities. This is the list (e1, . . . , e N) of entity representations ei V = R64. In the case of the simplicial agent, a further two learned embedding vectors e N+1, e N+2 are added to this list; these are the virtual entities. So with M = 0 in the case of the relational agent and M = 2 for the simplicial agent, the entity representations form a tensor of shape [N +M, 64]. This tensor is then passed through two iterations of the Transformer block (either purely 1-simplicial in the case of the relational agent, or including both 1 and 2-simplicial attention in the case of the simplicial agent). In the case of the simplicial agent the virtual entities are then discarded, so that in both cases we have a sequence of entities e 1, . . . , e N. Inside each block are two feedforward layers separated by a Re LU activation with 64 hidden nodes; the weights are shared between iterations of the Transformer block. In the 2-simplicial Transformer block the input tensor, after layer normalisation, is passed through the 2-simplicial attention and the result (after an additional layer normalisation) is concatenated to the output of the 1-simplicial attention heads before being passed through the feedforward layers. The pseudo-code for the ordinary and 2-simplicial Transformer blocks are: def t r a n s f o r m e r b l o c k ( e ) : x = Layer Norm ( e ) a = 1 S i m p l i c i a l A t t e n t i o n ( x ) b = Dense Layer1 ( a ) c = Dense Layer2 ( b ) r = Add ( [ e , c ] ) eprime = Layer Norm ( r ) r e t u r n eprime def s i m p l i c i a l t r a n s f o r m e r b l o c k ( e ) : x = Layer Norm ( e ) a1 = 1 S i m p l i c i a l A t t e n t i o n ( x ) a2 = 2 S i m p l i c i a l A t t e n t i o n ( x ) a2n = Layer Norm ( a2 ) ac = Concatenate ( [ a1 , a2n ] ) b = Dense Layer1 ( ac ) c = Dense Layer2 ( b ) r = Add ( [ e , c ] ) eprime = Layer Norm ( r ) r e t u r n eprime Our implementation of the standard Transformer block is based on an implementation in Keras from (Mavreshko, 2019). In both the relational and simplicial agent, the space V of entity representations has dimension 64 and we denote by H1, H2 the spaces of 1-simplicial and 2-simplicial queries, keys and values. In both the relational and simplicial agent there are two heads of 1-simplicial attention, H1 = H1 1 H1 2 with dim(H1 i ) = 32. In the simplicial agent there is a single head of 2-simplicial attention with dim(H2) = 48 and two virtual entities. The output of the Transformer blocks is a tensor of shape [N, 64]. To this final entity tensor we apply max-pooling over the entity dimension, that is, we compute a vector v R64 by the rule vi = max1 j N(e j )i for 1 i 64. This vector v is then passed through four fully-connected layers with 256 hidden nodes and Re LU activations. The output of the final fully-connected layer is multiplied by one 256 4 weight matrix to produce logits for the actions (left, up, right and down) and another 256 1 weight matrix to produce the value function. Remark 4.1. There is wide variation in the layer normalisation in Transformer models, compare (Vaswani et al., 2017; Child et al., 2019; Zambaldi et al., 2019). In (Zambaldi et al., 2019) layer normalisation occurs in two places: on the concatenation of the Q, K, V matrices, and on the output of the feedforward network gθ. We keep this second normalisation but move the first from after the Published as a conference paper at ICLR 2020 linear transformation E of (1) to before this linear transformation, so that it is applied directly to the incoming entity representations. This ordering gave the best performant relational model in our experiments, with our results diverging even further if a direct comparison to the (Zambaldi et al., 2019) architecture was used. 5 EXPERIMENTS AND RESULTS The training of our agents uses the implementation in Ray RLlib (Liang et al., 2018) of the distributed off-policy actor-critic architecture IMPALA of (Espeholt et al., 2018) with optimisation algorithm RMSProp. The hyperparameters for IMPALA and RMSProp are given in Table 1 of Appendix E. Following (Zambaldi et al., 2019) and other recent work in deep reinforcement learning, we use RMSProp with a large value of the hyperparameter ε = 0.1. As we explain in Appendix G, this is effectively RMSProp with smoothed gradient clipping. First we verified that our implementation of the relational agent solves the Box World environment (Zambaldi et al., 2019) with a solution length sampled from [1, 5] and number of distractors sampled from [0, 4] on a 9 9 grid. After training for 2.35 109 timesteps our implementation solved over 93% of puzzles (regarding the discrepancy with the reported sample complexity in (Zambaldi et al., 2019) see Appendix D). Next we trained the relational and simplicial agent on bridge Box World, under the following conditions: half of the episodes contain a bridge, the solution length is uniformly sampled from [1, 3] (both solution paths are of the same length), colours are uniformly sampled from a set of 20 colours and the boxes and loose keys are arranged randomly on a 7 9 grid, under the constraint that the box containing the Gem does not occur in the rightmost column or bottom row, and keys appear only in positions (y, x) = (2r, 3c 1) for 1 r 3, 1 c 3. The starting and ending point of the bridge are uniformly sampled with no restrictions (e.g. the bridge can involve the colours of the loose keys and locks on the Gem) but the lock colour is always on the top solution path. There is no curriculum and no cap on timesteps per episode. We trained four independent trials of both agents to either 5.5 109 timesteps or convergence, whichever came first. In Figure 6 we give the mean and standard deviation of these four trials, showing a clear advantage of the simplicial agent. We make some remarks about performance comparisons taking into account the fact that the relational agent is simpler (and hence faster to execute) than the simplicial agent in Appendix D. The training runs for the relational and simplicial agents are shown in Figure 9 and Figure 10 of Appendix F, together with analysis and visualization of the 1and 2-simplicial attention in specific examples. Figure 6: Training curve of mean relational and simplicial agents on bridge Box World. Shown are the mean and standard deviation of four runs of each agent, including the best run of each. In the reported experiments we use only two Transformer blocks; we performed two trials of a relational agent using four Transformer blocks, but after 5.5 109 timesteps neither trial exceeded Published as a conference paper at ICLR 2020 the 0.85 plateau in terms of fraction solved. Our overall results therefore suggest that the 2-simplicial Transformer is more powerful than the standard Transformer, with its performance not matched by adding greater depth. This is further supported by the fact on a time-adjusted basis, the 2-simplicial model still converges faster than the ordinary model; see Figure 8 of Appendix D. We analyse the simplicial agent to establish that it has learned to use the 2-simplicial attention, and to provide some intuition for why 2-simplices are useful; additional details are in Appendix F. The analysis is complicated by the fact that our 2 2 convolutional layers (of which there are two) are not padded, so the number of entities processed by the Transformer blocks is (R 2)(C 1) where the original game board is R C and there is an extra column for the inventory (here R is the number of rows). This means there is not a one-to-one correspondence between game board tiles and entities; for example, all the experiments reported in Figure 6 are on a 7 9 board, so that there are N = 40 Transformer entities which can be arranged on a 5 8 grid (information about this grid is passed to the Transformer blocks via the positional encoding). Nonetheless we found that for trained agents there is a strong relation between a tile in position (y, x) and the Transformer entity with index x + (C 1)(y 1) 1 for (y, x) [1, R 2] [1, C 1] [0, R 1] [0, C]. This correspondence is presumed in the following analysis, and in our visualisations. Displayed in Figure 7 are attention distributions for simplicial agent A of Figure 10. The four images in the top right show the ordinary attention of the virtual entities in the first iteration of the simplicial Transformer block: in the first head, the first virtual entity attends strongly to a particular lock, while the second head of the second virtual entity attends strongly to the corresponding key. Shown at the bottom of Figure 7 is the 2-simplicial attention in the second iteration of the simplicial Transformer block. The columns are query entities i and rows are key entity pairs (j, k) in lexicographic order (1, 1), (1, 2), (2, 1), (2, 2). Entity 17 is the top lock on the Gem, 25 is the bottom lock on the Gem, 39 is the player. We may therefore infer, from our earlier description of the ordinary attention of the virtual entities, that the agent perceives the 2-simplex with query entity 25 as shown. In general we observe that the top and bottom locks on the Gem, the player, and the entities 7, 15 associated to the inventory often have a non-generic 2-simplicial attention, which strongly suggests that the simplicial agent has learned to use 2-simplices in a meaningful way. Figure 7: Visualization of 2-simplicial attention in step 18 of an episode. 7 DISCUSSION On general grounds one might expect that in the limit of infinite experience, any reinforcement learning agent with a sufficiently deep neural network will be able to solve any environment, in- Published as a conference paper at ICLR 2020 cluding those like bridge Box World that involve higher-order relations between entities. In practice, however, we do not care about the infinite computation limit. In the regime of bounded computation it is reasonable to introduce biases towards learning representations of structures that are found in a wide range of environments that we consider important. We argue that higher-order relations between entities are an important example of such structures, and that the 2-simplicial Transformer is a natural inductive bias for 3-way interactions between entities. We have given preliminary evidence for the utility of this bias by showing that in the bridge Box World environment the simplicial agent has better performance than a purely relational agent, and that this performance involves in a meaningful way the prediction of 3-way interactions (or 2-simplices). We believe that simplicial Transformers may be useful for any problem in which higher-order relations between entities are important. The long history of interactions between logic and algebra is a natural source of inspiration for the design of inductive biases in deep learning. In this paper we have exhibited one example: Boole s idea, that relationships between entities can be modeled by multiplication in an algebra, may be realised in the context of deep learning as an augmentation to the Transformer architecture using Clifford algebras of spaces of representations. Published as a conference paper at ICLR 2020 ACKNOWLEDGMENTS We acknowledge support from the Nectar cloud at the University of Melbourne and GCP research credits. DM thanks Paul Middlebrooks for his excellent podcast Brain Inspired , where he first learned about cognitive maps. Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. In Proceedings of the International Conference on Learning Representations, 2016. Aristotle. Sophistical refutations. In J. Barnes (ed.), Complete works of Aristotle, Volume 1: The revised Oxford translation. Princeton University Press, 1984. David G. T. Barrett, Felix Hill, Adam Santoro, Ari S. Morcos, and Timothy P. Lillicrap. Measuring abstract reasoning in neural networks. In Proceedings of the 35th International Conference on Machine Learning, pp. 4477 4486, 2018. Timothy E.J. Behrens, Timothy H. Muller, James C.R. Whittington, Shirley Mark, Alon B. Baram, Kimberly L. Stachenfeld, and Zeb Kurth-Nelson. What is a cognitive map? Organizing knowledge for flexible behavior. Neuron, 100(2):490 509, 2018. Jacob Bellmund, Peter G ardenfors, Edvard Moser, and Christian F. Doeller. Navigating cognition: Spatial codes for human thinking. Science, 362, 11 2018. George Boole. The mathematical analysis of logic: being an essay towards a calculus of deductive reasoning. Macmillan, Barclay & Macmillan, Cambridge, 1847. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. preprint ar Xiv:1904.10509, 2019. James Clift and Daniel Murfet. Cofree coalgebras and differential linear logic. preprint ar Xiv:1701.01285, 2017. James Clift, Dmitry Doryn, Daniel Murfet, and James Wallbridge. 2simplicialtransformer. https: //github.com/dmurfet/2simplicialtransformer/, 2019. Alexandra O. Constantinescu, Jill X. O Reilly, and Timothy E.J. Behrens. Organising conceptual knowledge in humans with a gridlike code. Science, 352:1464 1468, 2016. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In Proceedings of the International Conference on Learning Representations, 2019. Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou. Neural logic machines. In Proceedings of the International Conference on Learning Representations, 2019. Russell A. Epstein, Eva Zita Patai, Joshua B. Julian, and Hugo J. Spiers. The cognitive map in humans: spatial navigation and beyond. Nature Neuroscience, 20:1504 1513, 2017. Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, pp. 1407 1416, 2018. Gottlob Frege. On sense and denotation ( uber sinn und bedeutung). Zeitschrift f ur Philosophie und philosophische Kritik, 100:25 50, 1892. C. Lee Giles, Guo-Zheng Sun, Hsing-Hen Chen, Yee-Chun Lee, and Dong Chen. Higher order recurrent networks and grammatical inference. In Advances in Neural Information Processing Systems 2, pp. 380 387. 1989. C. Lee Giles, Dong Chen, Clifford B. Miller, Hsing-Hen Chen, Guo-Zheng Sun, and Yee-Chun Lee. Second-order recurrent neural networks for grammatical inference. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume 2, pp. 273 281, 1991. Published as a conference paper at ICLR 2020 Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, pp. 1263 1272, 2017. Jean-Yves Girard. Linear logic. Theor. Comput. Sci., 50(1):1 102, 1987. Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT Press, 2016. Mark W. Goudreau, C. Lee Giles, Srimat T. Chakradhar, and Dong Chen. First-order versus secondorder single-layer recurrent neural networks. IEEE Transactions on Neural Networks, 5(3):511 513, 1994. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. preprint ar Xiv:1410.5401, 2014. Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska Barwi nska, Sergio G omez, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adri a Puigdom enech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538, 10 2016. Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. preprint ar Xiv:1709.06560, 2017. David Hestenes. New foundations for classical mechanics. Kluwer Academic Publishers, 2nd edition, 2002. John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019. John J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554 2558, 1982. Martin Hyland. Game semantics, pp. 131 184. Publications of the Newton Institute. Cambridge University Press, 1997. Ozan Irsoy and Claire Cardie. Modeling compositionality with multiplicative recurrent neural networks. In Proceedings of the International Conference on Learning Representations, 2015. Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu. Population based training of neural networks. preprint ar Xiv:1711.09846, 2017. Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, pp. 1885 1894, 2017. Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael I. Jordan, and Ion Stoica. Rllib: Abstractions for distributed reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 3059 3068, 2018. Yunzhe Liu, Raymond J. Dolan, Zeb Kurth-Nelson, and Timothy E.J. Behrens. Human replay spontaneously reorganizes experience. Cell, 178(3):640 652, 2019. Alan Macdonald. Sobczyk s simplicial calculus does not have a proper foundation. preprint ar Xiv:1710.08274, 2017. David J.C. Mac Kay. Information theory, inference and learning algorithms. Cambridge University Press, 2003. Nicholas John Mackintosh. Animal learning, 2019. https://www.britannica.com/ science/animal-learning/Insight-and-reasoning. Published as a conference paper at ICLR 2020 Chris Martens. Programming interactive worlds with linear logic. Ph D thesis, Carnegie Mellon University, 2015. Kirill Mavreshko. keras-transformer. https://github.com/kpot/keras-transformer, 2019. Paul-Andr e Melli es. Categorical semantics of linear logic. In Interactive Models of Computation and Program Behaviour, Panoramas et Synth eses 27, Soci et e Math ematique de France 1 196, 2009. Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Multifaceted feature visualization: uncovering the different types of features learned by each neuron in deep neural networks. In Proceedings of the 33rd International Conference on Machine Learning, 2016. Jordan B. Pollack. The induction of dynamical recognizers. Machine Learning, 7(2):227 252, 1991. Marc Aurelio Ranzato, Alex Krizhevsky, and Geoffrey Hinton. Factored 3-way restricted Boltzmann machines for modeling natural images. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9, pp. 621 628, 2010. David Raposo. Personal communication, May 13, 2019. Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Th eophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7310 7321, 2018. Luciano Serafini and Artur S. d Avila Garcez. Logic tensor networks: Deep learning and logical reasoning from data and knowledge. In Proceedings of the 11th International Workshop on Neural-Symbolic Learning and Reasoning (Ne Sy 16), 2016. Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 3145 3153, 2017. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations, 2013. Robin Smith. Aristotle s logic. In Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, summer 2019 edition, 2019. Garret Sobczyk. Simplicial calculus with geometric algebra. Fundamental Theories of Physics, vol. 47, 1992. Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems 26, pp. 926 934. 2013. Paul Vincent Spade and Jaakko J. Hintikka. History of logic, 2019. URL https://www. britannica.com/topic/history-of-logic/Aristotle. Ilya Sutskever, James Martens, and Geoffrey Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning, pp. 1017 1024, 2011. Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. Adaptive Computation and Machine Learning series. MIT Press, 2018. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818 2826, 2016. Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5 - rmsprop: Divide the gradient by a running average of its recent magnitude. [Coursera] Neural Networks for Machine Learning (University of Toronto), 2012. Published as a conference paper at ICLR 2020 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30, pp. 5998 6008. 2017. James Wallbridge. Jets and differential linear logic. preprint ar Xiv:1811.06235, 2018. James C. R. Whittington, Timothy H. Muller, Shirely Mark, Caswell Barry, and Tim E. J. Behrens. Generalisation of structural knowledge in the hippocampal-entorhinal system. In Advances in Neural Information Processing Systems 31, pp. 8493 8504, 2018. Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, Timothy Lillicrap, Edward Lockhart, Murray Shanahan, Victoria Langston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Peter Battaglia. Deep reinforcement learning with relational inductive biases. In Proceedings of the International Conference on Learning Representations, 2019. A COMPARISON TO THE NTM The Transformer model and descendents such as the Universal Transformer (Dehghani et al., 2019) can be viewed as general units for computing with learned representations; in this sense they have a similar conceptual role to the Neural Turing Machine (NTM) (Graves et al., 2014) and Differentiable Neural Computer (Graves et al., 2016). As pointed out in (Dehghani et al., 2019, 4) one can view the Transformer as a block of parallel RNNs (one for each entity) which update their hidden states at each time step by attending to the sequence of hidden states of the other RNNs at the previous step. We expand on those remarks here in order to explain the connection between the 2-simplicial Transformer and earlier work in the NLP literature, which is written in terms of RNNs. We consider a NTM with content-based addressing only and no sharpening. The core of the NTM is an RNN controller with update rule h = Re LU(M + Wh + Ux + b) (15) where W, U, b are weight matrices, x is the current input symbol, h is the previous hidden state, h is the next hidden state and M is the output of the memory read head j=1 softmax(K[q, M1], . . . , K[q, MN])j Mj (16) where there are N memory slots containing M1, . . . MN, q is a query generated from the hidden state of the RNN by a weight matrix q = Zh, and K[u, v] = (u v)/( u v ). We omit the mechanism for writing to the memory here, since it is less obvious how that relates to the Transformer; see (Graves et al., 2014, 3.2). Note that while we can view Mj as the hidden state of memory slot j, the controller s hidden state and the hidden states of the memory slots play asymmetric roles, since the former is updated with a feedforward network at each time step, while the latter is not. The Transformer with shared transition functions between layers is analogous to a NTM with this asymmetry removed: there is no longer a separate recurrent controller, and every memory slot is updated with a feedforward network in each timestep. To explain, view the entity representations e1, . . . , e N of the Transformer as the hidden states of N parallel RNNs. The new representation is e i = Layer Norm(gθ(A) + ei) (17) where the attention term is j=1 softmax(qi k1, . . . , qi k N)vj (18) and qi = Zei is a query vector obtained by a weight matrix from the hidden state, the kj = Kej are key vectors and vj = V ej is the value vector. Note that in the Transformer the double role of Mj Published as a conference paper at ICLR 2020 in the NTM has been replaced by two separate vectors, the key and value, and the cosine similarity K[ , ] has been replaced by the dot product. Having now made the connection between the Transformer and RNNs, we note that the second-order RNN (Giles et al., 1989; Pollack, 1991; Goudreau et al., 1994; Giles et al., 1991) and the similar multiplicative RNN (Sutskever et al., 2011; Irsoy & Cardie, 2015) have in common that the update rule for the hidden state of the RNN involves a term V (x h) which is a linear function of the tensor product of the current input symbol x and the current hidden state h. One way to think of this is that the weight matrix V maps inputs x to linear operators on the hidden state. In (Socher et al., 2013) the update rule contains a term V (e1 e2) where e1, e2 are entity vectors, and this is directly analogous to our construction. B CONNECTION TO HOPFIELD NETWORKS The continuous Hopfield network (Hopfield, 1982) (Mac Kay, 2003, Ch.42) with N nodes updates in each timestep a sequence of vectors {ei}N i=1 by the rules e i = tanh h η X j (ei ej) ej i (19) for some parameter η. The Transformer block may therefore be viewed as a refinement of the Hopfield network, in which the three occurrences of entity vectors in (19) are replaced by query, key and value vectors W Qei, W Kej, W V ej respectively, the nonlinearity is replaced by a feedforward network with multiple layers, and the dynamics are stabilised by layer normalisation. The initial representations ei also incorporate information about the underlying lattice, via the positional embeddings. The idea that the structure of a sentence acts to transform the meaning of its parts is due to Frege (Frege, 1892) and underlies the denotational semantics of logic. From this point of view the Transformer architecture is an inheritor both of the logical tradition of denotational semantics, and of the statistical mechanics tradition via Hopfield networks. C CLIFFORD ALGEBRA The volume of an n-simplex in Rn with vertices at 0, v1, . . . , vn is n! det(v1, . . . , vn) n! times the volume of the n-dimensional parallelotope which shares n edges with the nsimplex. In our applications the space of representations H is high dimensional, but we wish to speak of the volume of k-simplices for k < dim(H) and use those volumes to define the coefficients of our simplicial attention. The theory of Clifford algebras (Hestenes, 2002) is one appropriate framework for such calculations. Let H be an inner product space with pairing (v, w) 7 v w. The Clifford algebra Cl(H) is the associative unital R-algebra generated by the vectors v H with relations vw + wv = 2(v w) 1 . The canonical k-linear map H Cl(H) is injective, and since v2 = v 2 1 in Cl(H), any nonzero vector v H is a unit in the Clifford algebra. While as an algebra Cl(H) is only Z2graded, there is nonetheless a Z-grading of the underlying vector space which can be defined as follows: let {ei}n i=1 be an orthonormal basis of H, then the set B = ei1 eim is a basis for Cl(H), with m ranging over the set {0, . . . , n}. If we assign the basis element ei1 eim the degree m, then this determines a Z-grading [ ]k of the Clifford algebra which is easily checked to be independent of the choice of basis. Definition C.1. [A]k denotes the homogeneous component of A Cl(H) of degree k. Published as a conference paper at ICLR 2020 Example C.2. Given a, b, c H we have [ab]0 = a b, [ab]2 = a b and [abc]1 = (a b)c (a c)b + (b c)a , [abc]3 = a b c . (20) There is an operation on elements of the Clifford algebra called reversion in geometric algebra (Hestenes, 2002, p.45) which arises as follows: the opposite algebra Cl(H)op admits a k-linear map j : H Cl(H)op with j(v) = v which satisfies j(v)j(w) + j(w)j(v) = 2(v w) 1, and so by the universal property there is a unique morphism of algebras ( ) : Cl(H) Cl(H)op which restricts to the identity on H. Note (v1 vk) = v k v 1 for v1, . . . , vk H and ( ) is homogeneous of degree zero with respect to the Z-grading. Using this operation we can define the magnitude (Hestenes, 2002, p.46) of any element of the Clifford algebra. Definition C.3. The magnitude of A Cl(H) is |A| = p For vectors v1, . . . , vk H, |v1 vk|2 = [vk v1v1 vk]0 = v1 2 vk 2 (21) and in particular for v H we have |v| = v . Lemma C.4. Set n = dim(H). Then for A Cl(H) we have i=0 |[A]i|2 . Proof. See (Hestenes, 2002, Chapter 2 (1.33)). Example C.5. For a, b, c H the lemma gives a 2 b 2 c 2 = |[abc]1|2 + |[abc]3|2 = (a b)c (a c)b + (b c)a 2 + |a b c|2 and hence (a b)c (a c)b + (b c)a 2 = a 2 b 2 c 2 |a b c|2 . Remark C.6. Given vectors v1, . . . , vk H the wedge product v1 vk is an element in the exterior algebra V H. Using the chosen basis B we can identify the underlying vector space of Cl(H) with V H and using this identification (set vi = P j λijej) [v1 vk]k = h n X j1=1 λ1j1ej1 n X jk=1 λkjkejk i j1,...,jk λ1j1 λkjk[ej1 ejk]k 1 j1<