# learning_language_representations_with_logical_inductive_bias__bc90c863.pdf

Published as a conference paper at ICLR 2023

LEARNING LANGUAGE REPRESENTATIONS WITH LOGICAL INDUCTIVE BIAS

Jianshu Chen Tencent AI Lab, Bellevue, WA 98004, USA jianshuchen@global.tencent.com

Transformer architectures have achieved great success in solving natural language tasks, which learn strong language representations from large-scale unlabeled texts. In this paper, we seek to go further beyond and explore a new logical inductive bias for better language representation learning. Logic reasoning is known as a formal methodology to reach answers from given knowledge and facts. Inspired by such a view, we develop a novel neural architecture named FOLNet (First-Order Logic Network), to encode this new inductive bias. We construct a set of neural logic operators as learnable Horn clauses, which are further forward-chained into a fully differentiable neural architecture (FOLNet). Interestingly, we ﬁnd that the self-attention module in transformers can be composed by two of our neural logic operators, which probably explains their strong reasoning performance. Our proposed FOLNet has the same input and output interfaces as other pretrained models and thus could be pretrained/ﬁnetuned by using similar losses. It also allows FOLNet to be used in a plug-and-play manner when replacing other pretrained models. With our logical inductive bias, the same set of logic deduction skills learned through pretraining are expected to be equally capable of solving diverse downstream tasks. For this reason, FOLNet learns language representations that have much stronger transfer capabilities. Experimental results on several language understanding tasks show that our pretrained FOLNet model outperforms the existing strong transformer-based approaches.1

1 INTRODUCTION

Pretrained transformer models have achieved great success in solving natural language tasks, which learn strong language representations from large-scale unlabeled texts. The learned representations can be easily transferred to different downstream tasks by ﬁnetuning over limited amount of labeled data (Radford et al., 2018; Devlin et al., 2018; Lan et al., 2019; Liu et al., 2019; Yang et al., 2019). They even exhibit strong zero-shot or few-shot generalization capability without ﬁnetuning when further scaling up the model size (Radford et al., 2019; Brown et al., 2020; Chowdhery et al., 2022). Besides large-scale models and training data, one important reason for the success is the strong relational inductive bias encoded in the transformer architecture (Vaswani et al., 2017); it effectively models the pairwise relations between tokens and use it to compute the language representations.

In this paper, we seek to go beyond the inductive bias in transformer models and explore a new logical inductive bias for better language representation learning. The main idea is to view the computation of language representations as a logic reasoning process; that is, the language representations are deduced via logic reasoning step-by-step from the original discrete token sequences. Speciﬁcally, we treat the tokens in the input sequence as the terms in logic programming, and treat their properties and relations as the predicates of different arities. Then, the ﬁnal language representations are derived as the advanced properties and relations from the basic input properties and relations (e.g., token ids and relative distances). Most importantly, we require the construction of such deduction process to follow the principles of ﬁrst-order logic, in order to encode such logical inductive bias.

Following the above logical inductive bias, we derive a principled neural architecture, named FOLNet (First-Order Logic Network), for learning language representations. Speciﬁcally, we construct a set

1The code along with the pretrained model checkpoints will be released publicly.

Published as a conference paper at ICLR 2023

of neural logic operators as learnable Horn clauses, which are further forward-chained into a fully differentiable neural architecture. In particular, the FOLNet architecture consists of two interacting branches responsible for unary and binary relational reasoning, respectively. Interestingly, we ﬁnd that the self-attention mechanism can be constructed by two of our developed neural logic operators, and the entire transformer architecture can be understood as a single-branch version of FOLNet. This newly discovered connection might partially explain the surprisingly strong reasoning performance of the transformer architecture (Wei et al., 2022; Lewkowycz et al., 2022). As we will demonstrate in our experiments, such dual-branch architecture has several signiﬁcant advantages that are essential for learning better language representations. Furthermore, we also establish a new uniﬁed understanding of different positional encoding strategies with our logical inductive bias. For instance, we ﬁnd that the existing popular relative positional encoding can be constructed by the degenerated version of our two neural logic operators. More importantly, it also allows us to develop a new principled relative positional encoding that is simple yet quite effective in practice. Notably, our proposed FOLNet has the same input and output interfaces as other pretrained transformer models (e.g., BERT) and thus could be trained by using similar losses. It also allows FOLNet to be used in a plug-and-play manner when replacing other pretrained models in solving downstream tasks. Our logical inductive bias assumes that the logic deduction skills are shared across all natural language tasks; that is, these skills learned during pretraining should be equally applicable to solving diverse downstream tasks. For this reason, FOLNet learns language representations that have much stronger transfer generalization capabilities. Experimental results on several language understanding tasks (GLUE, SQu AD 2.0 and FOLIO) show that our FOLNet model outperforms the transformer architecture by a large-margin when they are pretrained using similar losses. The results clearly show that advantage of using the logical inductive bias for learning language representations.

2 LOGICAL INDUCTIVE BIAS FOR LANGUAGE REPRESENTATIONS

Natural language text can be viewed as a sequence of discrete symbols, and language representations learning considers the problem of mapping the discrete symbols into certain more computable forms. One widely used approach is distributed representation, which maps the discrete token ids into dense vectors (Mikolov et al., 2013; Pennington et al., 2014; Peters et al., 2018). Many different functional forms, such as LSTM (Hochreiter & Schmidhuber, 1997), and more recently, transformer models (Vaswani et al., 2017), have been used to implement such mappings. They generally encode different kinds of inductive bias for modeling natural languages. For example, RNNs use the same set of model parameters to update the hidden states over time, which encodes translation-invariance over time (Battaglia et al., 2018). These forms of inductive bias continuously push the state-of-the-arts in solving natural language tasks. In this section, we introduce a new form of inductive bias, named logical inductive bias, which will work together with distributed representations to design more effective representation mappings. Our main idea is to view the language representation mapping as a logic reasoning process; that is, the language representations are deduced step-by-step from the original discrete token sequences. Speciﬁcally, we treat the tokens in the input sequence as terms (or objects) in logic programming, and treat their properties and relations as predicates of different arities. In light of logical inductive bias, the language representations that we seek to compute are the (advanced) properties and relations that can be deduced from these input (basic) properties and relations. Most importantly, we require the construction of such deduction process to follow the principles of ﬁrst-order logic, in order to encode the logical inductive bias into the representation learning process. We now formulate the language representation learning as a logic programming problem by adopting similar (logic programming) notations used in Evans & Grefenstette (2018).

Terms: We consider a ﬁrst-order logic system without function symbols, so that terms can only be variables or constant. They are used to represent general objects or a particular object of interest, respectively. In the context of language representation learning, we model each instance of text sequence x (of length T) as a collection of constants x = {x1, . . . , x T }, where each token xt is a constant (t = 1, . . . , T). We use lower-case letters to denote constants and upper case for variables as in logic programming. For example, X is a variable to represent a general object (e.g., token).

Atoms: For each term, we will deﬁne its properties and relations as an r-ary predicate p(X1, . . . , Xr), which takes the value of T (True) or F (False) depending on whether a certain property/relation regarding (X1, . . . , Xr) holds or not. For example, whether the a token a takes the v-th id in the vocabulary is a unary predicate Token IDv(a) for v = 1, . . . , V , where V

Published as a conference paper at ICLR 2023

Language representations Logic programming FOLNet

Tokens xt in the text sequence Constant: xt The argument: xt in tensor ul(xt) Token ids, relative distances, etc Input (basic) atoms: C0 = B Input tensors: {u0(x), u0(x, y)} Final langauge representation Deduced (advanced) atoms: CL Output tensors: {u L(x), u L(x, y)} Representation mapping (partial) Modus Ponens using generic clause (8) Neural logic operator: see Table 2 Representation mapping (partial) 1-step deduction: Cl = con R(Cl 1) Forward pass: 1-layer Representation mapping (full) L-step deduction: forward-chaining (3) Forward pass: L-layer

Table 1: Identiﬁcation of language representation learning as a logic programming problem.

is the vocabulary size, and whether the distance between two tokens a and b is equal to d is a binary predicate Distd(a, b), for |d| < T. An atom is ground if it has no variables, e.g., the above Token IDv(a) and Distd(a, b) are all ground atoms. Clauses: The reasoning process is constructed by a set of if-then clauses in the form of: q p1 pm, (1) where p1, . . . , pm and q are the body atoms and head atoms, respectively, and denotes conjunction (i.e., logical AND). These atoms play the roles of premises and conclusions: if p1, . . . , pm are true, then q is also true. Clauses of the above form are known as deﬁnite Horn clauses (Horn, 1951). We call a clause a ground rule if all its variables are substituted by constants. For example, when applying a substitution θ {a/X, b/Y } to a clause q(X, Y ) p(X, Y ), we get a ground rule: q(a, b) p(a, b). It can be viewed as applying a general clause to a particular instantiation.

Our objective is to learn a collection of clauses and compose them into a mapping from input predicates (e.g., Token IDv(xt) and Rel Distd(xt, xτ)) to language representations. Speciﬁcally, let R be a set of clauses and ground(R) be the corresponding set of ground rules. We deﬁne the immediate consequences of applying the ground rules in ground(R) to a set of ground atoms X as

con R(X) = X n q q p1 pm ground(R),

i=1 pi X o . (2)

It can be understood as a set of ground atoms that can be deduced from X together with X itself. Given a set of input ground atoms B, we can repeatedly apply the ground rules in R for L steps: Cl = con R(Cl 1), C0 = B and l = 1, . . . , L. (3) Then, CL is all the possible ground atoms (predicates) that can be deduced from B (including B itself) in L steps. The above procedure is known as forward-chaining: it deduces all the possible conclusions from the input premises (i.e., B) by repeatedly applying (i.e., chaining) clauses. If we want to verify whether a predicate q holds (i.e., can be entailed), it sufﬁces to check if q is in CL. In language representation learning, we start, for example, from the following input (basic) atoms: B = {Token IDv(xt), Rel Distd(xt, xτ), . . . |t, τ = 1, . . . , T}, (4) and deduce CL as the ﬁnal representations by forward-chaining our (learned) clauses. For example, in solving an (extractive) question answering problem, whether a certain token is the beginning (or end) of the answer span is modeled as an advanced deduced property of this token, i.e., CL = {Answer Starts At(xt), Answer Ends At(xt)|t = 1, . . . , T}. (5) In autoregressive language modeling, the advanced deduced property becomes the next token ids. Table 1 summarizes the above identiﬁcations between language representations and logic programming. Next, we will develop a neural architecture to encode such logical inductive bias. Throughout the paper, we will use boldface letters to denote vectors (lowercase) and matrices (uppercase).

3 FOLNET: A NEURAL ARCHITECTURE WITH LOGICAL INDUCTIVE BIAS

In this section, we develop a novel neural architecture, named First-Order Logic Network (FOLNet), which encodes the logical inductive bias presented in Section 2. Speciﬁcally, we focus on: (i) how to represent the atoms as continuous vectors (Section 3.1), (ii) how to devise a neural inference step that approximates (2) (Section 3.2), and (iii) how to forward-chain them into a fully-differentiable architecture based on (3) (Section 3.3). The overall neural architecture and its correspondence to the logical inductive bias are shown in Figure 1. Next, we will discuss step-by-step how we devise the architecture, its advantages, and also some important connections to existing approaches.

Published as a conference paper at ICLR 2023

Figure 1: Overview of the FOLNet architecture and how it encodes the logical inductive bias. The neural logic operators model the clauses, which are forward-chained into a differentiable model. The mixer ops refer to the operators c, j, m, and t in Table 2 as they reduce over the object dimension.

3.1 VECTOR REPRESENTATIONS OF ATOMS

Recall that we use an r-ary ground atom pd(x1, . . . , xr) to characterize whether the d-th property/relation holds for a tuple of tokens (x1, . . . , xr), where d = 1, . . . , Dr. To overcome the difﬁculty of learning these discrete-valued atoms, which takes values in {T, F}, we introduce ud(x1, . . . , xr) R as its continuous representation and characterizes the extent to which the atom pd is true. For example, in Prob Log (De Raedt et al., 2007; De Raedt & Kimmig, 2015; Fierens et al., 2012; Manhaeve et al., 2018), ud( ) gives the probability of the atom pd( ) being true. In this paper, we consider ud( ) to be the logit of the corresponding atom, i.e., Pr{pd( ) = T} = 1/(1 + e ud( )); a larger value of ud( ) implies a higher chance of the atom pd( ) being true. And we can also easily verify that the logit for the negated atom of pd( ) is simply ud( ). Let u(x1, . . . , xr) U RDr be a Dr-dimensional vector that collects ud(x1, . . . , xr) as its d-th element. Then, u(x1, . . . , xr) will be a continuous (logit) vector representation for Dr atoms with arity r. For example, for an input text sequence x = (x1, . . . , x T ), we can use u(xt) to represent D1 (unary) properties for each token xt, and use u(xt, xτ) to characterize D2 (binary) relations between any pair of tokens. Both the input (basic) properties/relations and the deduced (advanced) ones will be represented in the same logit vector space, where the deduction process will be carried out. For convenience, we may also directly refer to a set of atoms by their logit vector representation u( ).

3.2 NEURAL MODUS PONENS INFERENCE

We now develop a set of neural operators for implementing the deduction step characterized in (2). To begin with, we ﬁrst introduce the Modus Ponens (MP) rule from ﬁrst-order logic (Andrews, 2013), which states that if clause B A and statement A hold, then B is true:

B {B A and A}. (6)

In the context of (2), when choosing A = p1 pm and B = q, the deduction in (2) can be viewed as an applications of the MP inference using all the ground clauses in ground(R). In this paper, we restrict our focus to the setting where all the atoms have arity of either one or two. That is, we will only consider atoms of the form u(xt) and u(xt, xτ) (represented in their vector forms), respectively.2 Then, we will need to develop the MP inference from a set of atoms {v(a), v(a, b)} to another set of atoms {u(x), u(x, y)}, which can be categorized into the following four groups:

u(x) RUU, v(a); u(x) RUB, v(a, b); u(x, y) RBU, v(a); u(x, y) RBB, v(a, b) (7)

where RUU, RUB, RBU and RBB denote the sets of rules that deduce atoms of a certain arity from another arity. For example, RBU deﬁnes a collection of clauses that derives binary (B) atoms from unary (U) atoms. We now proceed to model these clauses and their inference in logit space. Given a set of premise atoms P1, . . . , PM, we consider N clauses of the following generic form:

, n = 1, . . . , N, (8)

2Generalizing our work to higher arities is relatively straighforward in principle but will lead to high computation complexities in practice. We leave such an extension as a future work.

Published as a conference paper at ICLR 2023

Sym. Ops. Typing B-dim. R-dim. Neural operator in logit space Kernel act. Remarks

b bool U U U x w uhs(x) = P

w Khw(x)vws(x) Identity c cjoin U U B h a uhs(x) = P

a Khs(a)vh(x, a) Softmaxa j join U B U h a uhs(x) = P

a Kh(x, a)vhs(a) Softmaxa Self-attention m mu U B B x a uhs(x) = P

a Kh(x, a)vs(x, a) Softmaxa General RPE a assoc B U U h w uh(x, y) = P

w Khw(x)vhw(y) Identity Self-attention p prod B U B x w uh(x, y) = P

w Khw(x)vw(x, y) Identity General RPE t trans B B B h a uh(x, y) = P

a Kh(x, a)vh(a, y) Softmaxa

Table 2: List of all our neural logic operators with restricted kernels (see Appendix A.1 for our naming protocols). Note that each operator has a unique batching dimension (B-dim) and a unique reduction dimension (R-dim). The typing of the operator deﬁnes the arities of the kernel, the premise and the output atom. For example, U B U means the arities of the kernel, the input atom, and the output atom are 2, 1 and 1, respectively. We also list the activation functions that are used to compute the corresponding kernels, where the normalization dimension of Softmax is listed in its subscript.

where Qn is the head atom, denotes logical negation (i.e., NOT), and Mn,+ and Mn, are two subsets of M = {1, . . . , M} with Mn,+ Mn, = . Then, the logit vector u for the head atoms {Qn} can be approximately inferred from the logit vector v of the premises {Pm} by a matrix multiplication followed by an (optional) elementwise nonlinear activation function (Appendix D):

u = σ(Kv), (9)

where K is an N M kernel matrix that represents the clauses in (8), and σ(z) = ln(1 + 2ez) is the activation function. Notably, each row of K characterizes the conjunction pattern on the right-hand side of (8): a positive (negative) value of its (n, m)-th element, Knm, means that m is more likely in Mn,+ (Mn, ) for the n-th clause. It follows that the kernels for RUU, RUB, RBU and RBB in (7) would be in the form of KUU(x, a), KUB(x, a, b), KBU(x, y, a) and KBB(x, y, a, b), respectively, which are D1 D1, D1 D2, D2 D1 and D2 D2 matrices. And (9) would become matrix multiplications between them and their corresponding premises (i.e., v(a), v(a, b), v(a) or v(a, b)) followed by a summation over a and b whoever appear therein (see (30) (33) in Appendix D). Finally, the activation function σ( ) can be dropped when is replaced with , where A B iff A B and B A.

One major limitation of directly implementing (9) for the inference rules in (7) is the high memory and computation costs. For example, the kernel KBB(x, y, a, b) needs O(D2 2T 4) to store its value. And the MP inference (9), which now becomes u(x, y) = P

a,b K(x, y, a, b)v(a, b), also has a computation complexity of O(D2 2T 4). Therefore, we have to restrict the size of the kernel and reduce the overall complexity by using different methods, such as sharing the values of KBB(x, y, a, b) across different dimensions. We now provide a systematic approach based on the following principles:

1. We restrict all the kernels to be in the form of {K(ω), K(ω, ν)}, i.e., the arity r = 1, 2. 2. We pick one reduction dimension and one batching dimension in the matrix multiplication.

With the above assumption, we factor the predicate dimensions of unary kernels and unary atoms so that Kd(ω) = Khs(ω) and ud(x) = uhs(x), where d = (h 1)S + s with h = 1, . . . , H and s = 1, . . . , S. This is inspired by the multi-head attention mechanism (Vaswani et al., 2017), where h is akin to the head index, H is the number of heads and S is the size of the head. Then, we enumerate all possible neural logic operators that can compatibly multiply a kernel from {K(ω), K(ω, ν)} with a premise from {v(a), v(a, b)} by properly choosing different reduction and the batching dimensions. With this, we list the resulting neural logic operators for each typing in Table 2, which are further discussed in Appendix A.1 for their different roles in (restricted) Modus Ponens reasoning.

Connection with transformers Interestingly, we ﬁnd that the j-operator and the a-operator share similar forms as the self-attention mechanism in transformers, where the a-operator computes the selfattention scores and the j-operator performs the self-attention operation. Furthermore, the m-operator and the p-operator indeed generalize the existing relative positional encodings developed in (Shaw et al., 2018), which are widely used in different transformer variants, such as in T5 models (Raffel et al., 2020). Speciﬁcally, we show in Appendix E that by making vw(x, y) instance-independent, the p-operator computes the second term in equation (5) of Shaw et al. (2018), where vw(x, y) play the role of a K ij . And by setting vs(x, a) instance-independent, the m-operator computes the second

Published as a conference paper at ICLR 2023

term in equation (3) of Shaw et al. (2018), where vs(x, a) plays the role of a V ij. That is, under such degenerated settings, these two operators will compose the relative positional encoding developed therein. Note that their a K ij and a V ij are static learnable embeddings, while our vw(x, y) and vs(x, a) are dynamically computed for each instance (as we will discuss in Section 3.3). Therefore, our m-op and p-op can also be viewed as a more adaptive relative positional encoding, whose advantages will be further demonstrated in our experiments (Section 4).

3.3 FORWARD-CHAINING AND DIFFERENTIABLE LEARNING

Note that, in general, a logic operator takes a kernel K and a premise v to infer an outcome u (Table 2). Speciﬁcally, it neuralizes the logic deduction in (2) for a particular typing (e.g., U B U). Applying all the neural logic operators amounts to have a full execution of (2), which is one recursion step in (3) that maps a set of {ul 1(x), ul 1(x, y)} into {ul(x), ul(x, y)}. Therefore, we can naturally forward-chain L stages of them together to create a fully-differentiable architecture that models the reasoning chain in (3). Figure 1 depicts such a forward-chaining process and also how it encodes the logical inductive bias described in Section 2. One remaining problem is how to obtain the kernels K( ) and the premises v( ) from our FOLNet architecture in Figure 1. We do this simply by applying two linear projections (one for the premise and one for the kernel) to the previous layer s output {ul(x), ul(x, y)}. For the kernel, we may further apply an activation function after the linear projection to compute the kernel (see Table 2 for the list of kernel activation functions for each operator). In other words, we parameterize the kernels K( ) and the premises v( ) by (the intermediate deduction results of) FOLNet itself. This is based on the observation that clauses are themselves predicates since A B is deﬁned as A B (Andrews, 2013), where denotes disjunction (logical OR). We now describe the input and the output of FOLNet in Figure 1. At the input, we can encode the discrete token ids for a token xt into vectors of the form u0(xt) by standard embedding lookup. Likewise, we also convert the (discrete) relative distance between two tokens xt and xτ into a vector of the form u0(xt, xτ). The {u0(xt), u0(xt, xτ)}t,τ will be used as vector representations of the base atoms B and fed into the FOLNet model (Figure 1). After L layers (i.e., L steps of deduction), the output {u L(xt), u L(xt, xτ)}t,τ becomes the vector representations of CL in (3), which is used as the ﬁnal language representations. Therefore, our FOLNet model has the same input-output interface as other transformer models and can be used in a plug-and-play manner for solving downstream tasks. Because of this, our model can also be pretrained over large-scale texts using the same losses (e.g., MLM, NSP, SOP, etc) as other encoder-only models see Appendix A.4 for how to compute these losses from {u L(xt), u L(xt, xτ)}t,τ. Our model can also be extended to the decoder-only and the encoder-decoder versions by slightly modifying the neural logic operators (see Appendix A.5), which can then be pretrained to predict the next words auto-regressively. We will conclude this section by discussing several important properties of FOLNet architecture.

The dual-branch architecture Note that the FOLNet model in Figure 1 has two branches: (i) a unary predicate branch for reasoning over ul(x), and (ii) a binary predicate branch for reasoning over ul(x, y). This is in sharp contrast to the single-branch architecture of the transformer models (Figure 3 in Appendix A.2). We further note that when FOLNet is only loaded with j-operator and a-operator, it degenerates into a dual-branch variant of the transformer architecture. In our experiments, we will show that, even in such degenerated setting, FOLNet still outperforms transformer. This is probably because the binary predicate branch explicitly maintains the pairwise relations ul(x, y) throughout the reasoning process. In addition, the explicit binary predicate branch also allows us to directly input the relative distance knowledge into the model without performing the less-intuitive operations as in existing RPEs (Shaw et al., 2018). In our experiments in Section 4, we will demonstrate the advantage of such a simple yet effective approach for consuming the relative positional information, along with some other advantages of the dual-branch architecture of FOLNet.

4 EXPERIMENTS

4.1 EXPERIMENTAL SETTINGS

We now evaluate our FOLNet models under different settings and seek to answer the following question: Can the neural architecture (FOLNet) that encodes the logical inductive bias learn better language representations than the transformer models? To this end, we need to eliminate other

Published as a conference paper at ICLR 2023

Model Params PE D2 Loss MNLI-m/mm QQP QNLI SST-2 Co LA STS-B MRPC RTE Avg

1 BERT 110M APE - MLM.NSP 84.5/- 91.3 91.7 93.2 58.9 89.5 87.3 68.6 83.1 2 BERT (ours) 110M APE - MLM.NSP 83.9/84.1 90.9 88.2 92.6 61.5 89.2 88.2 66.8 82.7 3 FOLNet: j.a 110M APE 12 MLM.NSP 84.9/84.5 91.1 91.6 92.2 61.1 89.6 89.5 72.2 84.0 4 FOLNet: j.a 109M RPE 12 MLM.NSP 85.0/84.9 91.4 91.4 93.5 63.8 90.1 89.9 72.9 84.7 5 FOLNet: j.a 110M RPE 12 MLM.NSP 84.7/84.9 91.4 91.5 93.8 63.4 89.8 90.6 72.6 84.7 6 FOLNet: jm.ap 123M RPE 12 MLM.NSP 85.7/85.3 91.6 91.8 93.8 65.7 90.4 91.0 73.5 85.4

7 FOLNet: j.a 109M RPE 16 MLM.NSP 85.2/85.2 91.3 91.8 93.7 64.1 89.9 89.3 71.8 84.6 8 FOLNet: j.a 109M RPE 32 MLM.NSP 85.8/85.7 91.4 92.0 93.7 64.2 90.0 90.5 71.8 84.9 9 FOLNet: j.a 110M RPE 64 MLM.NSP 85.7/85.5 91.4 91.8 93.2 63.5 90.1 90.2 74.7 85.1

10 FOLNet: j.at 110M RPE 64 MLM.NSP 86.7/86.6 91.6 92.8 93.1 63.6 91.1 89.9 80.9 86.2 11 FOLNet: j.atp 117M RPE 64 MLM.NSP 87.4/87.4 91.9 93.3 94.0 62.9 91.3 91.4 81.6 86.7 12 FOLNet: jm.atp 124M RPE 64 MLM.NSP 88.1/87.6 91.7 93.9 94.2 64.7 91.2 91.4 83.2 87.3 13 FOLNet: jmc.atp 138M RPE 64 MLM.NSP 88.2/87.9 91.9 94.1 94.5 66.9 91.6 91.5 83.5 87.7 14 FOLNet: jmc.atp 137M RPE 12 MLM.NSP 85.9/86.3 91.6 92.7 93.6 63.4 90.5 91.0 75.8 85.6

15 FOLNet: jmc.atp 138M RPE 64 MLM.SOP 88.3/87.9 91.8 94.2 94.7 65.6 91.1 91.1 83.2 87.5

Table 3: Analysis of FOLNet on the development sets of GLUE benchmark. All the results are medians of ﬁve random seeds. From top to bottom, the ﬁrst block shows the advantage of dual-branch architecture and compares our new positional encoding to others, the second block analyzes the inﬂuence of the binary predicate dimension, the third block performs ablation study of all the logic operators, and the last block shows the results of other pretraining losses for FOLNet. APE stands for absolute positional encoding, RPE denotes the relative positional encoding used by T5 (Shaw et al., 2018; Raffel et al., 2020), and RPE means our proposed relative positional encoding. We have also pretrained a BERT (base) model (line #2) by using the same settings as FOLNet for a fair comparison.

confouding factors and make sure the only difference lies in the model architecture itself. First, we choose to pretrain our FOLNet model using the widely used masked language modeling (MLM) loss (Devlin et al., 2018), and add an extra loss of either the next sentence prediction (NSP) (Devlin et al., 2018) or sentence-order prediction (SOP) (Lan et al., 2019). Many different variants of widely used encoder-only transformer models such as BERT, Ro BERTa, ALBERT, De BERTa and Megatron-LM are pretrained with these losses. Therefore, we will also use these models as our primary baselines. Although there could be other more efﬁcient pretraining losses such as the ones in (Bao et al., 2020; Clark et al., 2020; Yang et al., 2019; Meng et al., 2021), we believe that developing a new model architecture with a better inductive bias is an orthogonal line of research. Therefore, we leave the exploration of other pretraining loss for FOLNet as a future work. In addition, we consider two settings of pretraining dataset: (i) Wikipedia + Book Corpus (Zhu et al., 2015) (16GB in texts) and (ii) a larger set of 160GB texts consisting of Wikipedia, Book Corpus2, Open Web Text2, and Pile-CC (extracted from the Pile dataset (Gao et al., 2020)). We use the BERT tokenizer with 32,768 uncased BPE vocabulary (Sennrich et al., 2016) throughout our experiments.3 We consider FOLNet models of two different sizes: FOLNet Base and FOLNet Large, which are comparable in size to the base (e.g., BERTBase) and large models (e.g., BERTLarge) in literature. Finally, the FOLNet model will always be pretrained with a sequence length of 128 tokens, although it will be evaluated on different downstream tasks with longer sequence lengths (e.g., 384 or 512). For evaluation, we consider three benchmarks: GLUE (Wang et al., 2019), SQu AD 2.0 (Rajpurkar et al., 2016b), and FOLIO (Han et al., 2022), where we ﬁnetune our pretrained models on each individual task separately for evaluation. More details about these downstream tasks and hyper-parameters can be found in Appendix B.

4.2 ANALYSIS OF THE FOLNET ARCHITECTURE

We begin with in-depth analysis of FOLNet to demonstrate its advantage over the transformer architecture. To this end, we ﬁrst pretrain our FOLNet Base model on the dataset of Wikipedia and Book Corpus, which is the same as the one used by BERT. And we further use MLM and NSP as our pretraining losses to make it consistent with BERT. We analyze FOLNet from different aspects on GLUE benchmark and report the results in Table 3. We now proceed to discuss the results below.

3Although Liu et al. (2019) pointed out that it would be more ideal to use the byte-level BPE for preprocessing the much larger 160GB texts, we use the same BERT tokenizer (based on character-level BPE) to process the 160GB text to simplify our logistics, and it already demonstrates the strong performance of our models. Using the byte-level BPE tokenizer with a larger 50K vocabulary as in (Liu et al., 2019; Radford et al., 2019; Brown et al., 2020) may further improve our FOLNet models that are pretrained on the 160GB texts.

Published as a conference paper at ICLR 2023

Model Params GLUE SQu AD 2.0 FOLIO

MNLI-m/mm QQP QNLI SST-2 Co LA STS-B MRPC RTE Avg EM F1 Acc

BERTBase 110M 84.5/- 91.3 91.7 93.2 58.9 89.5 87.3 68.6 83.1 73.7 76.3 57.8 Ro BERTa Base 110M 85.8/85.5 91.3 92.0 93.7 60.1 88.5 87.3 68.2 83.3 77.7 80.5 - De BERTa Base 134M 86.3/86.2 - - - - - - - - 79.3 82.5 - FOLNet Base 138M 88.2/87.9 91.9 94.1 94.5 66.9 91.6 91.5 83.5 87.7 84.7 87.9 64.2

BERTLarge 340M 86.6/- 91.3 92.3 93.2 60.6 90.0 88.0 70.4 84.1 79.0 81.8 62.3

Ro BERTa Base 125M 87.6/- 91.9 92.8 94.8 63.6 91.2 90.2 78.7 86.4 80.5 83.7 64.7 De BERTa Base 134M 88.8/88.5 - - - - - - - - 83.1 86.2 - FOLNet Base 138M 89.4/89.7 92.2 94.4 95.6 69.9 92.5 92.0 87.0 89.2 85.5 88.6 64.2

Ro BERTa Large 356M 90.2/90.2 92.2 94.7 96.4 68.0 92.4 90.9 86.6 88.9 86.5 89.4 67.7 De BERTa Large 384M 91.1/91.1 92.3 95.3 96.8 70.5 - - - - 88.0 90.7 - ALBERTXXL 235M 90.4/- 92.0 95.2 96.8 68.7 92.7 90.2 88.1 89.3 87.2 89.9 - ALBERTXXL+ 235M 90.8/- 92.2 95.3 96.9 71.4 93.0 90.9 89.2 89.9 87.4 90.2 - FOLNet Large 437M 91.2/91.3 92.5 95.8 96.8 71.5 92.2 93.5 91.1 90.6 88.5 91.5 70.6

Megatron1.3B 1.3B 90.9/91.0 92.6 - - - - - - - 87.1 90.2 - Megatron3.9B 3.9B 91.4/91.4 92.7 - - - - - - - 88.5 91.2 -

Table 4: Overall results on the development sets of GLUE, SQu AD 2.0 and FOLIO. The upper block (separated by the solid line) of the table shows the results of the models pretrained on Wikipedia + Book Corpus (16GB), and the lower block are the models pretrained on extended data (160GB). We use dashed lines to separate models of different sizes within each block. All the results are medians of ﬁve random seeds. The baseline results of FOLIO are provided by the authors of Han et al. (2022). Here we use ALBERTXXL to refer to the ALBERT model pretrained by 1M steps and use ALBERTXXL+ to refer to the ALBERTXXL modeled pretrained by 1.5M steps.

The advantage of the dual-branch architecture Recall that when FOLNet only has the join and assoc operators, it can be viewed as a dual-branch version of the transformer architecture. To have a fair comparison, we pretrain a FOLNet model with the same absolute positional encoding (APE) as BERT (line #3 of Table 3). Note that, even in such overly degenerated case, FOLNet still noticeably outperforms BERT on average. When equipping FOLNet with our new relative positional encoding (RPE) (line #4 of Table 3), we will outperform BERT by 2 points on average. Notably, it achieves on par (or slightly better) average performance compared to the one with T5 relative positional encoding (RPE in line #5). As we discussed in earlier section, the T5 RPE are degenerated version of our mu and prod operators. Line #6 of Table 3 show that adding mu and prod operators to the FOLNet would further boost the performance by a noticeable amount.

The beneﬁts of a larger D2 We can see (lines #7-9 of Table 3) that increasing the dimension D2 will steadily improves the performance. As we will reveal soon, when FOLNet is fully loaded with all the operators, having a larger D2 is essential to unleash their full power; that is, the performance improvement from increasing D2 would be even larger for a fully-loaded FOLNet.

Contribution of the logic operators We now analyze the contributions of the logic operators by adding them one-by-one into FOLNet until being fully loaded. We see from line #9 to line #13 in Table 3 that this drastically improves the average performance. In line #14, we evaluate a fully-loaded FOLNet with D2 decreased from 64 to 12, which shows a signiﬁcant performance drop. This conﬁrms the importance of having a relatively large D2 in order to store the deduced relations.

4.3 OVERALL PERFORMANCE

Having closely examined various aspects of FOLNet architecture, we now proceed to evaluate it comprehensively on three benchmarks (GLUE, SQu AD 2.0, and FOLIO). We will also examine its performance when we further scale up the pretraining data size and model size. To pretrain the model on a larger (160GB) dataset, we ﬁnd it more efﬁcient to generate the pretraining data with SOP losses. This is because NSP losses require us to sample negative sentences from another document. To begin with, we verify the performance by pretraining a FOLNet Base with SOP loss on Wikipedia and Book Corpus. The result (line #15 in Table 3) shows that it could slightly degrade the performance compared to the one with NSP (line #13). However, this is relatively tolerable given its convenience when pretraining on a large corpus. Therefore, we will replace NSP with SOP when pretraining

Published as a conference paper at ICLR 2023

FOLNet Base and FOLNet Large on the 160GB dataset. We show our results on the GLUE benchmark in Table 4 and compare them with other baseline methods. Observe that our FOLNet Base models pretrained on both 16GB data and 160GB signiﬁcantly outperform other transformer-based models on all three benchmarks. Our FOLNet Base pretrained on 16GB data outperforms BERTLarge model that is 3 larger in model size. In addition, it even outperforms Ro BERTa Base that is pretrained on 160GB data. When pretraining our FOLNet Base model on 160G data, we even surpass the Ro BERTa Large model (89.3 vs 88.9) on GLUE benchmark. Likewise, our FOLNet Large model also signiﬁcantly outperforms all other baselines. It even outperforms ALBERTXXL+ that is pretrained by 1.5M steps (i.e., 50% more tokens during pretraining). Notably, our FOLNet Large model achieves comparable performance as the Megatron3.9B model on both GLUE and SQu AD 2.0 benchmark, which is 10 times larger than our model. Furthermore, although our FOLNet model is not designed for solving reasoning problem (but incorporating reasoning as an inductive bias at the token-level), our model still consistently demonstrates stronger ﬁrst-order logic reasoning capability on FOLIO task (e.g., +3.9 over Ro BERTa Large). Finally, we would like to highlight that all the FOLNet Base and FOLNet Large models are pretrained with sequence length of 128, in contrast to 512 as in other baselines. However, they are evaluated on GLUE, SQu AD 2.0 and FOLIO with sequence length of 128, 384 and 512, respectively. In particular, by ﬁnetuning with merely 1,004 training examples on FOLIO, it is able to generalize to much longer sequences (512), which have never been seen during pretraining.

5 RELATED WORKS

Transformer language models There have been a long line of research on neural language models since Bengio et al. (2000). Recently, it has achieved great success by exploring different variants of pretrained transformer models (Vaswani et al., 2017) for solving downstream language tasks, such as with ﬁnetuning (Radford et al., 2018; Devlin et al., 2018; Lan et al., 2019; Liu et al., 2019; Yang et al., 2019) or with zero/few-shot learning using large language models (Radford et al., 2019; Brown et al., 2020; Chowdhery et al., 2022). Another line of active research focuses on developing more effective pretraining losses (Yang et al., 2019; Clark et al., 2020; Bao et al., 2020; Tay et al., 2022) beyond the widely used autoregressive or masked language modeling objectives. There have been limited works on developing new neural architectures for learning better language representations. In this paper, we seek to move in this direction and develop a new neural architecture based on logical inductive bias.

Logic programming and neural reasoning Our logical inductive bias is inspired by logic programmings (Horn, 1951; De Raedt et al., 2007; De Raedt & Kimmig, 2015; Fierens et al., 2012; Manhaeve et al., 2018) and inductive logic programming (Evans & Grefenstette, 2018; Muggleton, 1991; 1996; Friedman et al., 1999). Different from these works, we do not directly work on reasoning problems. Instead, we seek to encode the logical inductive bias into the neural model to learn better language representations. Another line of related works focuses on developing neural models that can perform reasoning in a broad sense. For example, different variants of memory augmented networks are developed (Le et al., 2020; Santoro et al., 2018; Graves et al., 2014; 2016; Santoro et al., 2016; Le et al., 2019), which augment a control network with memory units and the associated neural read/write modules. Besides, Rocktäschel & Riedel (2017) and Minervini et al. (2020) consider the problem of proving structured queries to knowledge base, by constructing differentiable neural networks via backward-chaining. Bergen et al. (2021) develop a triangular attention to deduce relations from other relations, which can be viewed as a transformer with a single relational branch. These methods are effective in solving (relatively small-scale) reasoning tasks. However, it remains unclear whether they can be effectively pretrained on large-scale texts to solve diverse downstream natural language tasks.

6 CONCLUSION

We introduce a novel logical inductive bias, which treats language representation learning as a logic programming problem. We then develop a fully-differentiable neural architecture (FOLNet) that effectively encodes this inductive bias by forward-chaining a rich set of neural logic operators. The proposed FOLNet architecture has the same input-output interface as the transformer models and can be pretrained over large-scale text data. Experimental results demonstrate that the FOLNet architecture signiﬁcantly outperforms different variants of transformer models, and has many inherent advantages due to the new dual-branch architecture along with its rich set of neural logic operators.

Published as a conference paper at ICLR 2023

Peter B Andrews. An introduction to mathematical logic and type theory: to truth through proof, volume 27. Springer Science & Business Media, 2013.

Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al. Unilmv2: Pseudo-masked language models for uniﬁed language model pre-training. In International Conference on Machine Learning, pp. 642 652. PMLR, 2020.

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. ar Xiv preprint ar Xiv:1806.01261, 2018.

Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A neural probabilistic language model. In T. Leen, T. Dietterich, and V. Tresp (eds.), Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. URL https://proceedings.neurips.cc/paper/2000/ file/728f206c2a01bf572b5940d7d9a8fa4c-Paper.pdf.

Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The ﬁfth pascal recognizing textual entailment challenge. In TAC, 2009.

Leon Bergen, Timothy O Donnell, and Dzmitry Bahdanau. Systematic generalization with edge transformers. Advances in Neural Information Processing Systems, 34:1390 1402, 2021.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. Sem Eval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (Sem Eval-2017), pp. 1 14, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2001. URL https://aclanthology.org/S17-2001.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022.

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. ar Xiv preprint ar Xiv:2003.10555, 2020.

Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. In Machine learning challenges workshop, pp. 177 190. Springer, 2005.

Luc De Raedt and Angelika Kimmig. Probabilistic (logic) programming concepts. Machine Learning, 100(1):5 47, 2015.

Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. Problog: A probabilistic prolog and its application in link discovery. In IJCAI, volume 7, pp. 2462 2467. Hyderabad, 2007.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171 4186, 2018.

William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. URL https://aclanthology.org/I05-5002.

Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. Journal of Artiﬁcial Intelligence Research, 61:1 64, 2018.

Daan Fierens, Guy Van den Broeck, Maurice Bruynooghe, and Luc De Raedt. Constraints for probabilistic logic programming. In Proceedings of the NIPS probabilistic programming workshop, pp. 1 4, 2012.

Published as a conference paper at ICLR 2023

Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. Learning probabilistic relational models. In Proc. IJCAI, 1999.

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020.

Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3816 3830, 2021.

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp. 1 9, Prague, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/W07-1401.

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. ar Xiv preprint ar Xiv:1410.5401, 2014.

Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska Barwi nska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626): 471 476, 2016.

R Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, volume 7, 2006.

Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Luke Benson, Lucy Sun, Ekaterina Zubova, Yujie Qiao, Matthew Burtell, et al. Folio: Natural language reasoning with ﬁrst-order logic. ar Xiv preprint ar Xiv:2209.00840, 2022.

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997.

Alfred Horn. On sentences which are true of direct unions of algebras1. The Journal of Symbolic Logic, 16(1):14 21, 1951.

Shankar Iyer, Nikhil Dandekar, and Kornl Csernai. First quora dataset release: Question pairs. In Quora, January 2017. URL https://www.quora.com/q/quoradata/ First-Quora-Dataset-Release-Question-Pairs.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. ar Xiv preprint ar Xiv:1909.11942, 2019.

Hung Le, Truyen Tran, and Svetha Venkatesh. Neural stored-program memory. In International Conference on Learning Representations, 2019.

Hung Le, Truyen Tran, and Svetha Venkatesh. Self-attentive associative memory. In International Conference on Machine Learning, pp. 5682 5691. PMLR, 2020.

Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858.

Percy Liang. Lambda dependency-based compositional semantics. Co RR, abs/1309.4408, 2013. URL http://arxiv.org/abs/1309.4408.

Published as a conference paper at ICLR 2023

Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations, 2018.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019.

Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neural probabilistic logic programming. Advances in Neural Information Processing Systems, 31, 2018.

Yu Meng, Chenyan Xiong, Payal Bajaj, Paul Bennett, Jiawei Han, Xia Song, et al. Coco-lm: Correcting and contrasting text sequences for language model pretraining. Advances in Neural Information Processing Systems, 34:23102 23114, 2021.

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. In International Conference on Learning Representations, 2018. URL https: //openreview.net/forum?id=r1gs9Jg RZ.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.

Pasquale Minervini, Matko Bošnjak, Tim Rocktäschel, Sebastian Riedel, and Edward Grefenstette. Differentiable reasoning on large knowledge bases and natural language. In Proceedings of the AAAI conference on artiﬁcial intelligence, pp. 5182 5190, 2020.

Stephen Muggleton. Inductive logic programming. New Gener. Comput., 8(4):295 318, 1991.

Stephen Muggleton. Stochastic logic programs. Advances in Inductive Logic Programming, 32: 254 264, 1996.

Nvidia. A pytorch extension: Tools for easy mixed precision and distributed training in pytorch. Available Online: https://https://github.com/NVIDIA/apex, 2019.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.

Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532 1543, 2014.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227 2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. URL https: //aclanthology.org/N18-1202.

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. Open AI, 2018.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. J. Mach. Learn. Res., 21(140):1 67, 2020.

Published as a conference paper at ICLR 2023

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQu AD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383 2392, Austin, Texas, November 2016a. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology. org/D16-1264.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383 2392, 2016b.

Tim Rocktäschel and Sebastian Riedel. End-to-end differentiable proving. Advances in neural information processing systems, 30, 2017.

Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memory-augmented neural networks. In International conference on machine learning, pp. 1842 1850. PMLR, 2016.

Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks. Advances in Neural Information Processing Systems, 31:7299 7310, 2018.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715 1725. Association for Computational Linguistics (ACL), 2016.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In NAACL-HLT (2), pp. 464 468, 2018. URL https://aclanthology.info/papers/ N18-2074/n18-2074.

Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L Hamilton. CLUTRR: A diagnostic benchmark for inductive reasoning from text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4506 4515, 2019.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631 1642, 2013.

Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms. ar Xiv preprint ar Xiv:2205.05131, 2022.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Petar Veliˇckovi c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, 2019.

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625 641, 2019.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. ar Xiv preprint ar Xiv:2201.11903, 2022.

Published as a conference paper at ICLR 2023

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112 1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https: //aclanthology.org/N18-1101.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. XLNet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Syx4wn Etv H.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.

Published as a conference paper at ICLR 2023

Supplementary Materials

A ADDITIONAL DETAILS OF THE FOLNET ARCHITECTURE

A.1 MORE DISCUSSIONS ON THE NEURAL LOGIC OPERATORS

Modus Ponens inference with restricted kernels. First, it is straightforward to show that the neural logic operators in Table 2 can be implemented as batch matrix multiplications (Figure 2), where multiple slices of matrix multiplications are executed in parallel to obtain the outputs. Different operators pick their own reduction dimensions (R-dim) and batching dimensions (B-dim) for the matrix multiplication (see Table 2). For example, the join operator picks the a-dimension as the R-dim and the h-dimension as the B-dim so that matrix multiplications are carried out in parallel over h. According to (9), each slice of the matrix multiplication can be viewed as an independent neural Modus Ponens inference based on its own set of clauses. For example, in the join operator, let Kjoin,h denote a matrix that collects Kh(x, a) as its (x, a)-th element. Then, Kjoin,h is the h-th kernel slice that represents a particular group of clauses for the join operator. In addition, recall that the values at each row of the kernel slice characterize the conjunction pattern on the right-hand side of (8), with positive (or negative) values determine whether the original premise Pm (or the negated premise Pm) should be used for conjunction (see Section 3.2 and Appendix D). Therefore, the R-dim in the matrix multiplication shall be understood as the conjunction dimension in (8). In the join operator, it implies that the conjunction operation is over the a-dimension of the premises vhs(a), with the conjunction pattern determined by the x-th row of Kjoin,h if we want to infer uhs(x). In contrast, a full Modus Ponens infernece from unary atoms to unary atoms (the ﬁrst expression in (7)) shall perform its conjunction over the joint dimensions of (h, s, a), which is much higher in complexity. Therefore, the join operator controls the complexity of Modus Ponens inference by restricting the conjunction operation over the a-dimension. Furthermore, it is noteworthy that the join operator is applying the same kernel slice to premises of different s; that is, it implements kernel-sharing across the s-dimension. This is another complexity-reduction strategy resulted from the principles in Section 3.2. Likewise, the same conclusion holds for all other operators, who pick their own conjunction-dimensions and kernel-sharing dimensions (see Figure 2).

Figure 2: Our neural logic operators can be efﬁciently implemented as batch matrix multiplications by using hardware accelerators (e.g., GPUs and TPUs). Each slice of matrix multiplications represents an independent neural Modus Ponens inference based on its own set of clauses.

Naming protocols and examples. We now explain the naming protocols for the logic operators along with concrete examples to demonstrate how each of them help reasoning in different aspects.

Published as a conference paper at ICLR 2023

bool: It operates over the predicate dimension and plays the role of Boolean functions over a set of input truth values, which leads to its name of bool-operator. It is used to deduce a property of x from other properties regarding the same x via certain Boolean operations. For example, it can model inferences like Edible(x) Is Mushroom(x) Is Toxic(x), where the conjunction pattern on the right-hand side is determined by the kernel.

join: We name this operator as join-operator because it is in a similar form as the join operation in λ-DCS (Liang, 2013). For example, when the kernel represents Is Born In(x, y), it can be used to infer Citizen Of USA(x) from premise Country Is USA(y), where the kernel determines the conjunction pattern of Country Is USA(y) over y.

cjoin: Since it simply swaps the roles of the kernel and the premise in the join-operator, we name it as the conjugate join operator. It plays a similar role as the join operator.

mu: We name it as mu-operator because it is can be viewed as a more general-form of the mu-abstraction operation in λ-DCS (Liang, 2013). For example, when the kernel models Apply Job At(x, y), it can deduce Has AJob(x) from premise Receive Offer From(x, y), where the kernel determines the conjunction pattern of Receive Offer From(x, y) over y.

assoc: We name it as assoc-operator because it can be viewed as computing the association between two vectors. For example, when the kernel represents Los Angeles(y), it can be used to infer Same State As(x, y) from San Francisco(x), where the kernel determines the conjunction pattern of San Francisco(x) over the predicate dimension. In this example, we only have one premise atom over the predicate dimension. In the general case, a particular conjunction pattern would be applied to multiple premise atoms to yield the output.

prod: We name it as prod-operator because it is in resemblance of computing the product over the predicate dimension. For example, when the kernel models Graduated(x), it can be used to infer Graduated From(x, y) from Study At(x, y), where the kernel determines the conjunction pattern of Study At(x, y) over the predicate dimension. In this example, we only have one premise atom over the predicate dimension. In the general case, a particular conjunction pattern would be applied to multiple premise atoms to yield the output.

trans: We name it as trans-operator because its form is in reminiscent of reasoning with transitive properties. For example, when the kernel models Father Of(x, z), it can be used to infer Grand Father Of(x, y) from Parent Of(z, y), where the kernel determines the conjunction pattern of Parent Of(z, y) over the dimension z.

Note that we do not have a logic operator corresponding to the typing B B U because it will be identical to the prod operator when the kernel action function is chosen to be the identity mapping. In addition, the above examples further show that the kernels (i.e., the clauses) themselves are also atoms or can be computed from certain atoms. This justiﬁes our strategy of parameterizing the kernels by (the intermediate deduction results of) FOLNet itself in Section 3.3.

A.2 COMPARING TO THE TRANSFORMER ARCHITECTURE

Figure (3) show that the transformer architecture can be understood as a single-branch version of the FOLNet architecture with only join-operator and assoc-operator.

Layer Norm Join op

Layer Norm Boolean "

Figure 3: Overview of the Transformer architecture. It can be understood as a single-branch version of the FOLNet architecture with only join-operator and assoc-operator.

Published as a conference paper at ICLR 2023

A.3 THE BOOLEAN OPERATOR

In FOLNet, we do not chain the Boolean operators in parallel with other operators but chain it with others in a cascade manner. This is akin to cascading the FFN with the self-attention module in transformers. Nevertheless, we may also place it in parallel to other operators just as making FFN in parallel to self-attention in transformer, which is done in Chowdhery et al. (2022). In addition, we also adopt the same FFN architecture (i.e., a multilayer perceptron with Ge LU units) for the Boolean operator in order to make FOLNet directly comparable with transformer architectures.

A.4 THE INPUT-OUTPUT INTERFACE

Computing the pretraining losses (e.g., MLM, NSP and SOP) from the output atoms As we pointed out, FOLNet models will have a similar input-output interface as the existing transformer models, so that we can seamless adopt existing pretraining losses (e.g., MLM, NSP and SOP) by computing them from {u L(x)}. Recall that u L(xt) is the vector that represents the derived (advanced) properties for the object (token) xt, where t = 1, . . . , T. For a masked token xt, u L(xt) contains its properties that could be deduced from other tokens via their input properties and relations. Therefore, these deduced properties in u L(xt) can be used to predict the original masked token. For example, we can apply a linear classiﬁer followed by a softmax operator to compute a probability distribution over the vocabulary and the MLM loss. Likewise, to compute NSP and SOP losses, we add a special [CLS] token at the beginning of the input sequence, so that the u L(x) that corresponds to the [CLS] token will be fed into a binary classiﬁer for computing the NSP or SOP loss. The usage of {u L(xt)} in downstream tasks (such as sequence classiﬁcation, multiple-choice and sequence labeling) are also similar. On the other hand, we have not used the output binary predicates {u L(xt, yτ)} for computing any losses. The main reason is that we would like to adopt the existing off-the-shelf pretraining losses for an apple-to-apple comparison regarding the proposed model architecture. Since most of these losses are developed for the transformer architecture, which is a single-branch model with only unary predicates on its main pathway (Figure 3), it is not surprising that these losses are mainly computed from the unary properties {u L(xt)}. Nevertheless, we believe that our newly introduced binary predicate branch could open a new avenue for developing additional pretraining losses using u L(xt, yτ) (i.e., the token relations). For example, we may randomly swap two tokens xt and xτ and use u L(xt, xτ) to predict whether they have been swapped or not. We will leave the development of more effective pretraining losses for FOLNet as a future work.

The input base atoms in B and their logit vector representations Throughout our work, we only consider plain texts as the input, which is similar to other pretrained language models. Therefore, the base atoms at the input should only encode the information that is directly available from the plain text, which include the token ids in the input sequence (denoted by Token IDv(xt)) and the relative distance between any of the two tokens (denoted by Rel Distd(xt, xτ)). Speciﬁcally, Token IDv(xt) = T if the token xt takes the v-th id in the vocabulary for v = 1, . . . , V , and Token IDv(xt) = F otherwise. When the input sequence consists of a pair of sequences, we will construct the input sequence as:

[CLS] Sequence #0 [SEP] Sequence #1 [SEP] [PAD] ... [PAD] ,

which is similar to the input format used in BERT and other transformer models. In this case, we will introduce an additional base atom Seq IDs(xt) to characterize whether a token xt belongs to the s-th sequence (s = 0, 1); that is, Seq IDs(xt) = T if token xt belongs to the s-th sequence and is F otherwise. This atom is akin to the segment ids (or token type ids) in existing pretrained transformer models. Furthermore, the relative positional atom Rel Distd(xt, xτ) = T if d = dist (xt, xτ) and is F otherwise, where dist (xt, xτ) is a clipped distance function with a clipping parameter :

dist (xt, xτ) =

max(1 , min(τ t, 1)) t > 0, τ > 0 and Seq IDs(xt) = Seq IDs(xτ) + 1 t > 0, τ > 0 and Seq IDs(xt) = Seq IDs(xτ) t = 0, τ > 0 t > 0, τ = 0 0 t = 0, τ = 0

The above clipped distance function clamps the relative distance to be within [ + 1, 1] when the two tokens are from the same sequence. And it also assigns a special distance id whenever they

Published as a conference paper at ICLR 2023

belong to two different sequences. In other words, the distance between any two tokens would be the same if they belong to two separate sequences. Likewise, we also characterize the distances between the [CLS] token (t = 0 or τ = 0) and the other tokens (τ > 0 or t > 0) with two special distance ids, which sets all the regular tokens to have the same distance towards (or from) the special [CLS] token. Such relative positional encoding (atom) preserves the translational invariance of the text sequence and generalizes better than the absolute positional encodings used in BERT and Ro BERTa. Notably, it allows FOLNet to generalize to much longer sequences that are unseen during pretraining. In summary, we consider the following base atoms as the inputs to the FOLNet model:

B = {Token IDv(xt), Seq IDs(xt), Rel Distd(xt, xτ)|t, τ = 1, . . . , T}. (10)

By stacking Token IDv(xt) and Seq IDs(xt) over v and s, stacking Rel Distd(xt, xτ) over d, and using the values of 1 and 0 to represent T and F, respectively, we will have the 0-1 vectors to represent all the input atoms in B. Then, we may further convert them into dense vector representations {u0(xt), u0(xt, xτ)}t,τ via embedding lookup. The embedding lookup process can also be understood as ﬁnding a set of more ﬁne-grained learnable properties (or relations) to characterize a given token xt (or a token-pair (xt, xτ)), where these properties (or relations) are represented in the same logit space as the neural logic operators in Table 2, which carry out Modus Ponens inferences. Finally, the above base atoms are just one particular design for our FOLNet model, and there could be other alternatives for encoding the input information. And when there is extra information available besides plain texts, we may also encode it as the new unary or binary base atoms in B.

A.5 EXTENDING FOLNET TO TEXT-TO-TEXT VERSIONS

So far, we have mainly focused on the encoder-only version of the FOLNet model. We now show that it can be extended to the decoder-only and the encoder-decoder counterparts in a relatively straightforward manner. Such extension would be useful for text-to-text generation tasks.

Decoder-only To develop the decoder-only version of FOLNet, we need to let the model autoregressively generate the output tokens. In the context of FOLNet, this requires the unary and binary atoms {u(xt), u(xt, xτ)} to be inferred only from the premises in the past: {v(xω), v(xω, xν) : ω t}. In addition, we further restrict the binary atoms to have a causal pattern, i.e., u(xt, xτ) = 0 and u(xω, xν) = 0, whenever t < τ and ω < ν. Figure 4(a) illustrates these patterns for the unary and the binary atoms. Note that the causal structure for the binary atoms translates into a lower triangular pattern for the nonzeros. In particular, the dark blue atoms are directly responsible for the generation of the next token (word), and the auto-regressive generation requires the model to update them only from the light blue atoms. We enforce such auto-regressive property by multiplying a proper 0-1 mask to the binary kernels and premises in the neural logic operators. In the last column of Table 5, we list the kernels and the premises that should be masked, and in Figure 4(b), we show the masks that correspond to two variants of decoder-only models: (i) Causal Langauge Model (Causal LM) (Radford et al., 2018; 2019; Brown et al., 2020; Chowdhery et al., 2022), and (ii) Preﬁx Langauge Model (Preﬁx LM) (Liu et al., 2018; Raffel et al., 2020). The Causal LM has a lower triangular mask for the binary atoms so that the information pattern is always unidirectional. On the other hand, the Preﬁx LM will have a bi-directional mask for the input segment (the top-left part in Figure 4(b)) and a unidirectional mask for the output (target) segment (the bottom-right part in Figure 4(b)). Accordingly, the binary atoms of the Preﬁx LM version will share a similar pattern as its mask in Figure 4(b), which models the relatons for the preﬁx and the output segments separately. With such simple modiﬁcations, our decoder-only FOLNet model will have the same input-output interface as the decoder-only transformers; it can be pretrained to predict the next tokens using a linear classiﬁer over the unary atoms u L(xt). After the training, it can generate tokens in an auto-regressive manner.

Encoder-Decoder For the encoder-decoder variant, we use two separate stacks of FOLNet for the encoder and decoder, where each of them has the same overall architecture as in Figure 1 with its own set of atoms (the green and blue blocks in Figure 4(c)). In particular, the encoder will be identical to the encoder-only version that we have thoroughly discussed earlier in the main paper. Meanwhile, the decoder part will be similar to the decoder-only variant with a few additional modiﬁcations. First, the decoder needs to maintain a slightly different version of binary atoms (Figure 4(c)). Speciﬁcally, besides the relations between the output tokens, the decoder also has to model the (unidirectional) relations from the input tokens to the output tokens (i.e., the bottom-left part of uh(x, y) in Figure 4(c)). These relations are crucial in deducing the output tokens from the input ones, which plays a

Published as a conference paper at ICLR 2023

(a) Atoms for decoder-only FOLNet

Causal Language Model Prefix Language Model

(b) Masks for decoder-only FOLNet

input target

input target

input target

(c) Atoms for encoder-decoder FOLNet

Encoder-Decoder (d) Masks for encoder-decoder FOLNet

Figure 4: Extending FOLNet to text-to-text versions. The top row illustrates the atoms and the masking strategy for the decoder-only version, and the bottom row shows the encoder-decoder variant. On the left side, we visualize the unary and the binary atoms {ud(x), uh(x, y)} for these cases, where d = (h 1)S + s. The green and blue colors characterize the nonzero patterns for the atoms of the encoder and the decoder, respectively. The green-colored atoms are associated with the encoder part in the encoder-decoder variant. The dark blue color represents the atoms that are directly responsible for generating the next token in the decoder. Notably, they are deduced only from the light blue atoms, so that the entire generation process retains an auto-regressive nature. On the right side, the white and the orange blocks represent the 0-1 masking positions, where the light orange part are associated with the preﬁx segment in Preﬁx LM. Likewise, the encoder-decoder variant has separate sets of masks for the encoder and the decoder, respectively, where the mask for the encoder part are designed to have a bi-directional information pattern. Similar design also holds for the preﬁx part in Preﬁx LM.

Sym. Ops. Typing B-dim. R-dim. Neural operator in logit space Mask

b bool U U U x w uhs(x) = P

w Khw(x)vws(x) - c cjoin U U B h a uhs(x) = P

a Khs(a)vh(x, a) vh(x, a) j join U B U h a uhs(x) = P

a Kh(x, a)vhs(a) Kh(x, a) m mu U B B x a uhs(x) = P

a Kh(x, a)vs(x, a) Kh(x, a), vs(x, a) a assoc B U U h w uh(x, y) = P

w Khw(x)vhw(y) - p prod B U B x w uh(x, y) = P

w Khw(x)vw(x, y) vw(x, y) t trans B B B h a uh(x, y) = P

a Kh(x, a)vh(a, y) Kh(x, a), vh(a, y)

Table 5: The binary kernels and premises to be masked for decoder-only versions of FOLNet. In the encoder-decoder variant, similar part of the kernels and premises would be masked in its decoder.

similar role as the cross-attention scores in transformers. Accordingly, we also need to adjust the masks to handle these relations separately (see Figure 4(d)). Second, the neural logic operators in Table 5 should also be slightly adjusted in order to incorporate the additional premise atoms from the encoder output. For example, the premise vhs(a) in join-operator should now be a concatenation of the unary atoms from the encoder output (with a linear projection) and the decoder premises. Likewise, the premise vhw(y) in assoc-operator should also be a concatenation of the encoder output (with a linear projection) and the decoder premises. These two modiﬁed operators share a similar spirit as the self-attention and the cross-attention mechanisms in transformer decoders. The cjoin operator is similar to join operator, except now the concatenation happens at the kernel Khs(a) (with

Published as a conference paper at ICLR 2023

a linear projection). The mu-operator and the prod-operator stay the same as before. Meanwhile, the trans-operator for the decoder needs to be implemented according to the following new expression:

uh(x, y) = P

a T Kh(x, a)v D h (a, y) y T P

a I Kh(x, a)v E h(a, y) + P

a T Kh(x, a)v D h (a, y) y I

where T and I are deﬁned to be the target and the input sequences, respectively, the kernel Kh(x, a) are obtained by applying a linear projection to the binary atoms u(x, a) in the decoder, and the superscript E or D in the premises vh(a, y) denotes whether they are from the encoder or the decoder. Notably, we observe that the encoder-decoder version of FOLNet retains the dual-branch architecture in its decoder module as well. This is in sharp contrast to the standard transformer decoder, which is a single-branch architecture with only unary atoms on its main pathway.

B EXPERIMENTAL DETAILS

B.1 OVERVIEW OF THE DOWNSTREAM TASKS

GLUE The GLUE benchmark (Wang et al., 2019) consists of 9 tasks: MNLI (Williams et al., 2018), QQP (Iyer et al., 2017), QNLI (Rajpurkar et al., 2016a), SST-2 (Socher et al., 2013), Co LA (Warstadt et al., 2019), STS-B (Cer et al., 2017), MRPC (Dolan & Brockett, 2005), RTE (Dagan et al., 2005; Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), WNLI (Levesque et al., 2012). They cover a wide range of natural language understanding tasks such as natural language inference, paraphrasing, linguistic acceptability, and sentiment analysis. We ﬁnetune our FOLNet models by following the same procedures from BERT (Devlin et al., 2018). We do not evaluate our model on WNLI because it generally needs special procedures. The basic description of all the tasks in GLUE (including their evaluation metrics) can be found in Table 6.

SQu AD 2.0 SQu AD 2.0 (Rajpurkar et al., 2016b) is an extractive question answering dataset built from Wikipedia. The objective of the task is to predict an answer span from the context paragraphs. SQu AD 2.0 is an updated version that adds additional 50,000 unanswerable questions to the original SQu AD 1.1 version. The performance metrics are exact match (EM) and F1 scores.

FOLIO FOLIO (Han et al., 2022) is a natural language reasoning dataset that contains ﬁrst-order logic reasoning problems. It requires the models to decide whether a conclusion statement is correct or not given a world deﬁned by a set of premises. It is formulated as a 3-class classiﬁcation problem with the labels being True , False , Unknown . We follow the same procedure as Han et al. (2022) by formulating the problem as a sequence pair classiﬁcation problem. Speciﬁcally, we concatenate all the the premises into one sequence (i.e., sequence A), and then further concatenate it with the conclusion (i.e., sequence B), where the two sequences are separated by a [SEP]. In addition, we add a [CLS] token at the beginning and a [SEP] in the end before padding (in the end). By doing so, the task has the same format as a natural language inference task (e.g., MNLI). Table 7 (quoted from the original FOLIO paper (Han et al., 2022)) shows an example from the FOLIO dataset, which demonstrate that it requires strong ﬁrst-order reasoning capabilities to solve the problem. The dataset has an ofﬁcial train/validation/test split with 1,004/204/227 examples, respectively. By the time of this submission, the test set is not available and thus we only report performance on the validation set.

MNLI QQP QNLI SST-2 Co LA RTE MRPC STS-B

Size 393K 364K 108K 67K 8.5K 2.5K 3.7K 5.7K Task Inference Similarity QA/Inference Sentiment Acceptability Inference Paraphrase Similarity Metric(s) Accuracy Accuracy/F1 Accuracy Accuracy Matthews corr. Accuracy Accuracy/F1 Pearson/Spearman. #Classes 3 2 2 2 2 2 2 1 (regression)

Table 6: Basic information about different tasks in GLUE benchmark.

B.2 IMPLEMENTATION DETAILS AND HYPER-PARAMETERS

We implement both our pretraining and ﬁnetuning pipelines using Py Torch (Paszke et al., 2019) and automatic mixed precision (AMP) learning (Micikevicius et al., 2018) based on the Apex library (Nvidia, 2019). For pretraining, we use large-batch training (with a batch-size 131K) using LAMB

Published as a conference paper at ICLR 2023

A FOLIO example based on the Wild Turkey Wikipedia page

NL premises NL Conclusion -> Labels 1. There are six types of wild turkeys: Eastern wild turkey, Osceola wild turkey, Gould s wild turkey, A. Tom is an Ocellated wild turkey. -> True Merriam s wild turkey, Rio Grande wild turkey, and the Ocellated wild turkey. B. Tom is an Eastern wild turkey. -> False 2. Tom is not an Eastern wild turkey. C. Joey is a wild turkey. -> Unknown 3. Tom is not an Osceola wild turkey. 4. Tom is also not a Gould s wild turkey, or a Merriam s wild turkey, or a Rio Grande wild turkey. 5. Tom is a wild turkey.

FOL Premises FOL conclusion -> Labels 1. x(Wild Turkey(x) (Eastern(x) Osceola(x) Goulds(x) A. Ocellated(tom) -> True Merriams(x) Riogrande(x) Ocellated(x))) B. Eastern(tom) -> False 2. (Wild Turkey(tom) Eastern(tom)) C. Wild Turkey(joey) -> Unknown 3. (Wild Turkey(tom) Osceola(tom)) 4. Wild Turkey(tom) (Goulds(tom) Merriams(tom) Riogrande(tom)) 5. Wild Turkey(tom)

Table 7: We show an example from the FOLIO dataset, which is quoted directly from the original FOLIO paper (Han et al., 2022). It demonstrates that it requires strong ﬁrst-order reasoning capabilities to solve the problem. NL stands for natural language and FOL stands for First-Order Logic . We only use the natural language part to solve the task in our experiments.

optimizer (You et al., 2020). The pretraining of FOLNet Base on Wikipedia + Book Corpus (16GB) for 8K steps takes about 12 hours using 512 V100 GPUs. The pretraining of FOLNet Base on 160GB data for 128K steps takes 7 days using 512 V100 GPUs. And the pretraining of FOLNet Large on 160GB data for 128K steps takes 19 days using 512 V100 GPUs. For the ﬁnetuning of all downstream tasks, we also use AMP learning based on Apex, and the optimizers are Fused Adam from Apex library.

We report the hyper-parameters of pretraining FOLNet in Table 8. The hypper-parameters for ﬁnetuning different downstream tasks are included in Table 9. The hyper-parameters for the ﬁnetuning tasks are searched per task, and the results are the median of ﬁve random seeds.

Hyperparams FOLNet Base FOLNet Base FOLNet Large

Pretraining data size 16G 160G 160G Number of Layers (L) 12 12 24 Unary Hidden size (D1) 768 768 1024 Binary Hidden size (D2) 64 64 64 Unary Boolean (FFN) intermediate size 3072 3072 4096 Binary Boolean (FFN) intermediate size 256 256 256 Number of Attention heads (H) 12 12 16 Attention head size (S) 64 64 64 RPE clipping parameter ( ) 64 64 64 Dropout rate 0.1 0.1 0.1 Attention Dropout rate 0.1 0.1 0.1 Warmup Ratio 1% 1% 1% Peak Learning Rate 1e-2 2e-3 1.6e-3 Batch Size 131,072 131,072 131,072 Weight Decay 0.01 0.01 0.01 Max Steps 8K 128K 128K Learning rate Decay Linear Linear Linear LAMB ϵ 1e-6 1e-6 1e-6 LAMB β1 0.9 0.9 0.9 LAMB β2 0.999 0.999 0.999 Gradient Clipping 0.0 0.0 0.0 Sequence length (T) 128 128 128

Table 8: The hyper-parameters for pretraining FOLNet in different settings.

C ADDITIONAL EXPERIMENTS

Zero-shot performance on GLUE We evaluate the zero-shot performance of our FOLNet model on GLUE benchmark. Speciﬁcally, we perform zeros-shot predictions by using the same method and prompt templates from Gao et al. (2021). We only consider the FOLNet Large model and compare it to Ro BERTa Large, which are similar in model-size, pretraining dataset and pretraining losses. In

Published as a conference paper at ICLR 2023

Hyperparams GLUE-small/big SQu AD 2.0 FOLIO

Max epochs {5, 10, 20} / {2, 3, 5} {2, 3} {60, 80} Peak lr for Base (16G): {6e-5, 8e-5 1e-4, 2e-4} { 8e-5, 9e-5, 1e-4 } { 6e-5, 7e-5, 8e-5} Peak lr for Base/Large (160G): {1e-5, 2e-5, 3e-5, 4e-5} {1e-5, 2e-5, 3e-5, 4e-5} { 1e-5, 2e-5, 3e-5} Batch size {16, 32} / {32} {16, 32} {16, 32} Learning rate decay Linear Linear Linear Warmup ratio {6%, 25%} / {6%} 6% {6%, 25%} Sequence length 128 384 512 Adam ϵ 1e-6 1e-6 1e-6 Adam β1 0.9 0.9 0.9 Adam β2 0.999 0.999 0.999 Gradient clipping 0.0 0.0 0.0 Dropout rate 0.1 0.1 0.1 Weight decay 0.01 0.01 0.01

Table 9: The hyperparameters for ﬁnetuning on GLUE, SQu AD 2.0, and FOLIO tasks. GLUE-small refers to Co LA, STS-B, MRPC and RTE. GLUE-big stands for MNLI, QQP, QNLI and SST-2.

Model Method Data MNLI-m/mm QQP QNLI SST-2 Co LA STS-B MRPC RTE Avg Acc F1 Acc Acc mcc Pear. F1 Acc

Majority guess - - 32.7/33.0 0.0 49.5 50.9 0.0 - 81.2 52.7 33.3

Ro BERTa Large MLM 160G 50.8/51.7 49.7 50.8 83.6 2.0 -3.2 61.9 51.3 44.3 FOLNet Large MLM 160G 50.8/52.6 55.4 59.7 79.5 6.4 23.6 77.2 58.2 51.5

Table 10: The zero-shot performance on the GLUE development set. Acc stands for accuracy, mcc means Matthews s correlation coefﬁcient, and Pear. is short for Pearson s correlation coefﬁcient.

addition, the zero-shot prediction methods are also similar: they both predict the label words using their own MLM heads. The results are summarized in Table 10, which demonstrates that FOLNet Large outperforms Ro BERTa Large on 7 out of 8 tasks with average gain of 7.2 points.

Performance on CLUTRR To further examine the reasoning capabilities, we evaluate our FOLNet models on the CLUTRR (Compositional Language Understanding and Text-based Relational Reasoning) dataset (Sinha et al., 2019). CLUTRR is a semi-synthetic diagnostic benchmark designed to evaluate the systematic generalization ability of a model. Speciﬁcally, for a given natural language story, a model is asked to infer the (implicit) relationship between two family members. To solve the problem, it has to extract relationships between entities and master the logical rules governing these relationships. CLUTRR examines the systematic generalization by testing a model on stories that contain unseen combinations of logical rules as well as stories that require more reasoning steps. Following the same setting as Sinha et al. (2019), we ﬁrst ﬁnetune our FOLNet models with clauses of length k = 2, 3 and k = 2, 3, 4, respectively, and then we evaluate them on clauses of length k = 2, . . . , 10. Speciﬁcally, for each input instance, we concatenate the natural language story with the text query (separated by a [SEP] token), and cast the problem as a sequence-pair classiﬁcation. In addition, we add a [CLS] token at the beginning and append a [SEP] token in the end. However, unlike Sinha et al. (2019), for simplicity, we do not replace entities with special task-speciﬁc embeddings but treat them as regular English words. And we leave such entity representation techniques as a future work, which may further improve our performance. We ﬁnetune FOLNet Base for 100 epochs using an Adam optimizer with a learning rate of 5 10 5, a batch-size of 16 and a warmup ratio of 6%. In our experiment, we mainly compare to transformer-based baselines that also use the natural language inputs. Speciﬁcally, we compare to the results of BERTBase and BERT-LSTM from Sinha et al. (2019), which share the same model size and pretraining corpus. There is also a rich set of approaches that work directly on the symbolic inputs from CLUTRR, such as the Graph Attention Network (GAT) (Veliˇckovi c et al., 2018). Since these methods directly use the structured logical graph underlying the story as their input, they circumvent the difﬁculty of parsing natural language stories and generally perform much better than the text-based counterparts. We include GAT as a reference to examine whether our FOLNet model could narrow such a performance gap with the help of our logical inductive bias. The full results are reported in Figure 5. First, our FOLNet Base performs

Published as a conference paper at ICLR 2023

2 3 4 5 6 7 8 9 10 Relation Length

Accuracy (%)

BERT BERT-LSTM GAT FOLNet Base

(a) Training clause length k = 2, 3

2 3 4 5 6 7 8 9 10 Relation Length

Accuracy (%)

BERT BERT-LSTM GAT FOLNet Base

(b) Training clause length k = 2, 3, 4

Figure 5: Systematic generalization performance on CLUTRR benchmark.

signiﬁcantly better than both text-based methods (i.e., BERTBase and BERT-LSTM) by a large margin. In particular, when the testing k is out of the training set, we improve BERTBase by as much as 8% 41% (in absolute improvement). Meanwhile, this advantage further improves to 54% 81% when the testing k has been seen during training, and FOLNet Base achieves competitive performance as GAT. Notably, FOLNet Base even achieves substantially better performance (10% 20%) than GAT when the testing k = 10. The results conﬁrms the beneﬁt of encoding logical inductive bias.

D NEURAL MODUS PONENS

In this section, we show that generic Modus Ponens inference using clauses of the form (8) can be expressed as a matrix multiplication followed by an optional nonlinear activation function.

Suppose P1, . . . , PM are M premise atoms and Q is the head (i.e., conclusion) atom. We partition the set M = {1, . . . , M} into three non-overlapping sets M+, M0 and M , and deﬁne the following general clause (where we have dropped the subscript n in (8) for simplicity of notation):

Note that the expression (11) can be used to represent a general clause that infers Q from any subset of the premises in {P1, . . . , PM, P1, . . . , PM}, where the sets M+ and M contain the indexes of the selected original and negated premises, respectively, and the set M0 indexes the ignored premises. Therefore, each particular partition M = M+ M0 M also corresponds to a unique clause. To apply Modus Ponens inference with the clause (11), we can proceed in two steps: (i) evaluate the body of the clause (i.e., the right-hand side of (11)) according to

and (ii) apply the generic Modus Ponens inference rule: Q {Q P and P}. In the following two subsections, we will derive the neural operations that implement these two steps, respectively. Speciﬁcally, the neural operations will be carried out in the logit space under the Prob Log setting.

D.1 THE MATRIX MULTIPLICATION FOR THE COMPOSITIONS OF THE BODY ATOMS

To derive the operations for the logic expression (12), we deﬁne the probabilities of the atoms P, P1, . . . , PM being true in the following logistic form:

Pr{P = T} = 1 1 + e z , Pr{Pm = T} = 1 1 + e vm , m = 1, . . . , M, (13)

Published as a conference paper at ICLR 2023

where z and vm are the logits corresponding to the atoms P and Pm, respectively. In addition, for each m, we further introduce a variable Wm = +1, 0, 1 to indicate whether m is in M+, M0 and M , respectively. Then, (12) can be characterized by the following conditional probability:

Pr{P =T|W, P} =

( 1 VM m=1 (Wm =1 Pm =T) (Wm = 1 Pm =F) (Wm =0)

0 otherwise , (14)

where W {W1, . . . , WM} and P {P1, . . . , PM}. We further assume that the variables P1, . . . , PM are independent of each other and are also independent of W1, . . . , WM. In addition, for a given set of W = {W1, . . . , WM}, we introduce the following random event of P1, . . . , PM:

E (P1, . . . , PM) :

(Wm =1 Pm =T) (Wm = 1 Pm =F) (Wm =0) , (15)

along with its indicator function I((P1, . . . , PM) E). An indicator function I( ) is one if the expression inside its parenthesis is true and zero otherwise. Then, we can derive Pr{P = T|W} as:

Pr{P = T|W} = X

P Pr{P = T|W, P}Pr{P1, . . . , PM|W}

P Pr{P = T|W, P}Pr{P1, . . . , PM}

P I (P1, . . . , PM) E Pr{P1, . . . , PM}

P I (P1, . . . , PM) E M Y

(P1,...,PM) E

m M+ Pr{Pm} Y

m M Pr{Pm} Y

m M0 Pr{Pm}

(P1,...,PM) E

m M+ Pr{Pm = T} Y

m M Pr{Pm = F} Y

m M0 Pr{Pm}

m M+ Pr{Pm = T} Y

m M Pr{Pm = F} X

m M0 Pr{Pm}

m M+ Pr{Pm = T} Y

m M Pr{Pm = F}

1 1 + e vm Y

(1 + e vm)I+ m

(1 + evm)I m

(1 + e vm)I+ m 1

(1 + evm)I m

1 + e PM m=1(I+ m I m)vm (16)

where step (a) uses the independence between P1, . . . , PM and W1, . . . , WM, step (b) substitutes the deﬁnition of the conditional probability (14), step (c) uses the assumption that P1, . . . , PM are independent of each other, step (d) substitutes the decomposition M = M+ M0 M , step (e) is obtained by following the deﬁnition (15) for the event E (i.e., within event E, we have Pm = T for m M+, Pm = F for m M and Pm being arbitrary value in {T, F}), step (f) takes out the common factors of the summands and keeps the remaining summation over all possible values in {Pm : m M0} (based on the deﬁnition (15)), step (g) uses the fact that the total probability of the event {Pm : m M0} is one, step (h) substitutes the second logistic expression in (13), step

Published as a conference paper at ICLR 2023

(i) introduces variables I+ m I(m M+) and I m I(m M ), and the upper bound in step (j) is obtained by expanding the multiplications in the denominator and dropping all the cross-terms, all of which are nonnegative. Note that the upper bound (16) is tight when only one unique I+ m or I m (among all I+ 1 , . . . , I+ M and I 1 , . . . , I M) is nonzero. The above result in (16) is conditioned on a particular realization of W1, . . . , WM, which are discrete variables. Therefore, it is difﬁcult to learn them directly in a differentiable manner. To address this issue, we further assume that W1, . . . , WM are random variables. Then, taking expectation over W1, . . . , WM, we have:

Pr{P = T} E 1

1 + e PM m=1(I+ m I m)vm

where the randomness inside the expectation comes from I+ m and I m, which are further from W1, . . . , WM. We now adopt a simple yet effective strategy to approximate the right-hand side in order to obtain a differentiable implementation. Speciﬁcally, we use the following approximation:

E 1 1 + e X

1 1 + e EX , (18)

which becomes more accurate when the distribution of X gets more concentrated around a single peak (i.e., becoming determinisitic). Using the above approximation, we obtain

Pr{P = T} 1

1 + e PM m=1(κ+ m κ m)vm (19)

where κ+ m E[I+ m] = Pr{Wm = +1} and κ m E[I m] = Pr{Wm = 1}. Substituting the ﬁrst expression in (13) into the left-hand side of (19), we conclude that the logit of Pr{P = T} can be approximated via:

m=1 (κ+ m κ m)vm = κ, v (20)

where κ and v denote the vectors that collect κ+ m κ m and vm as their m-th elements, respectively. The above expression implies that the logit z for the atom P can be computed by a simple inner product operation between the two vectors κ and v. Further recall that different realizations of W1, . . . , WM correspond to different partitions of M = M+ M0 M , which further deﬁnes different logic expressions for P in (12). Since κ is a vector that collects Pr{Wm = +1} Pr{Wm = 1} as its m-th element, it can also be viewed as a signature vector of the logic expression (12), which is the body of the clause (11). Therefore, κ is also the signature vector for the clause in (11) as it determines the logic compositions among the body atoms. When we have multiple logic expressions, represented by κ1, . . . , κN, that compose N logic expressions from the atoms P1, . . . , PM, then we can compute the logits of the output logic expressions by the following matrix multiplication:

z = Kv, (21)

where z is a logit vector for the N output atoms, and K is a matrix consists of κn as its n-th row.

D.2 THE NONLINEAR ACTIVATION FOR THE IMPLICATION OPERATION

In this subsection, we proceed to derive the the neural operator for generic Modus Ponens inference, Q {Q P and P}, in the logit space. Likewise, we consider the Prob Log setting, where the atoms P and Q will be assigned with probabilities Pr{P = T} and Pr{Q = T}, respectively, which characterize the chances of them being true. Let C (Q P) be the clause. Our objective is to infer the outcome probability Pr{Q = T} from Pr{P = T} based on the fact that C = T.

To this end, we ﬁrst derive the conditional probability Pr{Q = T|P, C = T}. We will ﬁrst need to establish the conditional probability Pr{C = T|P, Q} based on the deﬁnition of the logical implication (Andrews, 2013) and then apply Bayes rule. Note that C = (Q P) is deﬁned as C (Q P), which can be expressed as the following conditional probability:

Pr{C = T|P, Q} = 0 if P = T and Q = F 1 otherwise . (22)

Published as a conference paper at ICLR 2023

That is, the clause C will be false only when the premise P is true and the conclusion Q is false. To proceed, we further assume that Pr{Q = T|P} = Pr{Q = F|P} = 1/2. The intuition of the assumption is that we do not have any prior knowledge about the outcome Q when we are only given the input premise P (without C). Then, by Bayes rule, we have

Pr{Q = T|P, C = T} = Pr{Q = T, C = T|P}

Pr{C = T|P}

= Pr{Q = T|P}Pr{C = T|P, Q = T} P

Q {F,T} Pr{Q|P}Pr{C = T|P, Q}

(a) = Pr{C = T|P, Q = T} Pr{C = T|P, Q = T} + Pr{C = T|P, Q = F}

(b) = 1 if P = T 0.5 if P = F , (23)

where steps (a) and (b) substitute Pr{Q = T|P} = Pr{Q = F|P} = 1/2 and (22), respectively.

Next, we derive Pr{Q = T|C = T} with the assumption that the premise P is independent of the clause C that is used in the current Modus Ponens inference step. We have

Pr{Q = T|C = T} = X

P {T,F} Pr{Q = T|P, C = T}Pr{P|C = T}

P {T,F} Pr{Q = T|P, C = T}Pr{P}

(b) = Pr{P = T} + 1

2Pr{P = T}, (24)

where step (a) uses the assumption that P is independent of C, and step (b) substitutes (23).

Finally, we derive the expression for the conditional probability (24) in the logit space. Speciﬁcally, let q and p be the logits for Q and P, respectively, which parameterize their probabilities:

Pr{P = T} = 1 1 + e z , Pr{Q = T|C = T} = 1 1 + e u . (25)

Substituting the ﬁrst expression in (25) into (24) followed by some simple algebra, we obtain

Pr{Q = T|C = T} = 1 1 + e ln(1+2ez) . (26)

Comparing the right-hand side of (26) with that of (25), we obtain the logit for P as

u = ln(1 + 2ez). (27)

v The above expression implies that, to implement Modus Ponens in logit space, we only need to apply the above simple nonlinear activation function to the input premise logit p. To gain further insights into the above logit-space Modus Ponens inference, we carry out further analysis of the above nonlinear activation function. Note that ln(1 + ex) is a non-negative convex function. By using Jensen s inequality and the non-negativity, we can prove that the above nonlinear activation function can be lower bounded as ln(1 + 2ex) Re LU(x + ln 2), where the right-hand side is indeed a good approximation of the left-hand side. Therefore, we can implement the generic Modus Ponens by applying the (shifted) Re LU function in the logit space with much lower computation complexity.

D.3 PUTTING EVERYTHING TOGETHER

Given N different clauses of the form (11) and M input premises P1, . . . , PM, the deduction of the N outcome atoms Q1, . . . , QN using Modus Ponens rule can be implemented (approximately) via

z = Kv (28)

Published as a conference paper at ICLR 2023

u = ln(1 + 2ez), (29)

where K is an N M matrix with its n-th row being the signature vector of the n-th clause, v is an M-dimensional vector consisting of the logits for the input premises P1, . . . , PM, u is an N-dimensional vector that contains the logits of the output atoms Q1, . . . , QN. Note that the Modus Ponens inference using N clauses can be implemented in parallel by ﬁrst multiply matrix K to the left of the logit vector v and then pass through a special element-wise nonlinear activation function (which can be approximated by a shifted Re LU function). For this reason, we call K the kernel-of-clauses (or kernels for short) in this paper. Furthermore, we note that, when the implication in (11) is replaced by the logical equivalence , the original clause (11) becomes (12), so that the activation function in (29) can be dropped. Finally, it is straightforward to show that the neural Modus Ponens inference (28) for the four categories of rules in (7) can be expressed (equivalently) as:

a KUU(x, a)v(a) (30)

a,b KUB(x, a, b)v(a, b) (31)

u(x, y) = X

a KBU(x, y, a)v(a) (32)

u(x, y) = X

a,b KBB(x, y, a, b)v(a, b), (33)

where KUU(x, a), KUB(x, a, b), KBU(x, y, a) and KBB(x, y, a, b) are D1 D1, D1 D2, D2 D1 and D2 D2 matrices that correspond to RUU, RUB, RBU and RBB, respectively.

E COMPOSITION OF EXISTING RELATIVE POSITIONAL ENCODING

We now show that we can compose the existing relative positional encoding (denoted as RPE in our paper) from our m-operator and p-operator. To begin with, we ﬁrst write the expressions of existing RPE from Shaw et al. (2018) by using their original notation. Speciﬁcally, RPE computes the (multi-head) self-attention outputs according to the following expressions (with the notation of self-attention head h being dropped for simplicity):

j=1 αij(xj W V + a V ij) =

j=1 αijxj W V +

j=1 αija V ij (34)

eij = xi W Q(xj W K + a K ij )T dz = xi W Q(xj W K)T + xi W Q(a K ij )T dz , (35)

where zi is the self-attention output at the i-th token, eij is the unnormalized self-attention scores, αij is the self-attention probability (which is obtained by applying softmax to eij, normalized over j), xi is the vector of the i-th token at the attention input, W Q/W K/W V are the weight matrices for query/key/value in the self-attention operation, respectively, n is the sequence length, and dz is the dimension of each head. Notably, a V ij and a K ij are the learnable relative positional embedding vectors, deﬁned as

a K ij = w K clip(j i,k) a V ij = w V clip(j i,k) clip(x, k) = max( k, min(k, x)).

That is, a V ij and a K ij are the embedding vectors corresponding to the (clipped) relative distance between the i-th and the j-th tokens, where k is the clipping threshold. Therefore, the second terms in both (34) (35) are the RPE bias terms that seep into the computations of self-attention mechanism. We now proceed to show that these two terms can be viewed as the degenerated form of our m-operator and p-operator in Table 2. For convenience, we ﬁrst rewrite the expressions for these two operators:

m : uhs(x) = X

a Kh(x, a)vs(x, a) (36)

Published as a conference paper at ICLR 2023

p : uh(x, y) = X

w Khw(x)vw(x, y). (37)

We ﬁrst show that the second terms in (34) can be composed from the m-operator (36). To see this, note that tokens x and a in (36) can be identiﬁed as i and j in (34), respectively. In addition, the kernel Kh(x, a) and the premise vs(x, a) can be identiﬁed as the self-attention probability αij and the relative positional embedding vector a V ij, respectively. Therefore, the m-operator shares the same form as the second term in (34), except our head index h. Likewise, we can show that the second term in (35) can be composed from the p-operator. Observe that the second term in (35) is an inner product between vector xi W Q and vector a K ij . Let kh(x) be a vector that collects Khw(x) as its w-th element and let v(x, y) be a vector that collects vw(x, y) as its w-th element. Then, the p-operator can also be viewed as an inner product between the vectors kh(x) and v(x, y). By identifying kh(x) and v(x, y) as the vectors xi W Q and a K ij , respectively, we conclude that the second term in (35) can be composed from the p-operator. Besides these similarities, our m-operator and p-operator are more general and powerful than the original RPE in the following aspects. First, recall from Section 3.3 that both the kernels K and v are parametrized by the FOLNet (by linearly projecting the intermediate representations {ul(x), ul(x, y)} followed by possible activation functions). Therefore, our vs(x, a) and vw(x, y) are instance-dependent and are different across input instances. In contrast, a V ij and a K ij in (34) (35) are static embedding vectors that are the same for all input instances. Furthermore, our m-operator and p-operator are adaptive in a layerwise manner; that is, each layer will compute their own m-operator and p-operator adaptively based on their own intermediate representations {ul(x), ul(x, y)}, whereas in RPE the vectors a V ij and a K ij are generally identical across layers. Therefore, the existing operations in RPE can be viewed as the degenerated special cases that are composable from our m-operator and p-operator.