# learning_explanatory_rules_from_noisy_data__b989eecc.pdf

Journal of Artiﬁcial Intelligence Research 61 (2018) 1-64 Submitted 02/17; published 01/18

Learning Explanatory Rules from Noisy Data

Richard Evans richardevans@google.com Edward Grefenstette etg@google.com Deep Mind, London, UK

Artiﬁcial Neural Networks are powerful function approximators capable of modelling solutions to a wide variety of problems, both supervised and unsupervised. As their size and expressivity increases, so too does the variance of the model, yielding a nearly ubiquitous overﬁtting problem. Although mitigated by a variety of model regularisation methods, the common cure is to seek large amounts of training data which is not necessarily easily obtained that suﬃciently approximates the data distribution of the domain we wish to test on. In contrast, logic programming methods such as Inductive Logic Programming oﬀer an extremely data-eﬃcient process by which models can be trained to reason on symbolic domains. However, these methods are unable to deal with the variety of domains neural networks can be applied to: they are not robust to noise in or mislabelling of inputs, and perhaps more importantly, cannot be applied to non-symbolic domains where the data is ambiguous, such as operating on raw pixels. In this paper, we propose a Diﬀerentiable Inductive Logic framework, which can not only solve tasks which traditional ILP systems are suited for, but shows a robustness to noise and error in the training data which ILP cannot cope with. Furthermore, as it is trained by backpropagation against a likelihood objective, it can be hybridised by connecting it with neural networks over ambiguous data in order to be applied to domains which ILP cannot address, while providing data eﬃciency and generalisation beyond what neural networks on their own can achieve.

1. Introduction

Inductive Logic Programming (ILP) is a collection of techniques for constructing logic programs from examples. Given a set of positive examples, and a set of negative examples, an ILP system constructs a logic program that entails all the positive examples but does not entail any of the negative examples. From a machine learning perspective, an ILP system can be interpreted as implementing a rule-based binary classiﬁer over examples, mapping each example to an evaluation of its truth or falsehood according to the axioms provided to the system, alongside new rules inferred by the system during training. ILP has a number of appealing features. First, the learned program is an explicit symbolic structure that can be inspected, understood1, and veriﬁed. Second, ILP systems tend to be impressively data-eﬃcient, able to generalise well from a small handful of examples. The reason for this data-eﬃciency is that ILP imposes a strong language bias on the sorts of programs that can be learned: a short general program will be preferred to a program consisting of a large number of special-case ad-hoc rules that happen to cover the training

1. Human readability is a much touted feature of ILP systems, but when the learned programs become large, and include a number of invented auxiliary predicates, the resulting programs become less readable (see Besold, Muggleton, Schmid, Tamaddoni-Nezhad, & Zeller, 2016). But even a complex machine-generated logic program will be easier to understand than a large tensor of ﬂoating point numbers.

c 2018 AI Access Foundation. All rights reserved.

Evans & Grefenstette

data. Third, ILP systems support continual and transfer learning. The program learned in one training session, being declarative and free of side-eﬀects, can be copied and pasted into the knowledge base before the next training session, providing an economical way of storing learned knowledge. The main disadvantage of traditional ILP systems is their inability to handle noisy, erroneous, or ambiguous data. If the positive or negative examples contain any mislabelled data, these systems will not be able to learn the intended rule. De Raedt and Kersting (2008) discuss this issue in depth, stressing the importance of building systems capable of applying relational learning to uncertain data. A key strength of neural networks is that they are robust to noise and ambiguity. One way to overcome the brittleness of traditional ILP systems is to reimplement them in a robust connectionist framework. Garcez, Besold, de Raedt, F oldiak, Hitzler, Icard, K uhnberger, Lamb, Miikkulainen, and Silver (2015) argue strongly for the importance of integrating robust connectionist learning with symbolic relational learning. Recently, a diﬀerent approach to program induction has emerged from the deep learning community (Graves, Wayne, & Danihelka, 2014; Reed & de Freitas, 2015; Neelakantan, Le, & Sutskever, 2015; Kaiser, 2015; Andrychowicz & Kurach, 2016; Graves, Wayne, Reynolds, Harley, Danihelka, Grabska-Barwi nska, Colmenarejo, Grefenstette, Ramalho, Agapiou, et al., 2016). These neural network-based systems do not construct an explicit symbolic representation of a program. Instead, they learn an implicit procedure (distributed in the weights of the net) that produces the intended results. These approaches take a relatively low-level model of computation2 a model that is much closer to the metal than the Horn clauses used in ILP and produce a diﬀerentiable implementation of that low-level model. The implicit procedure that is learned is a way of operating within that low-level model of computation (by moving the tape head, reading and writing in the case of diﬀerentiable Turing machines; by pushing and popping in the case of diﬀerentiable pushdown automata). There are two appealing features of this diﬀerentiable approach to program induction. First, these systems are robust to noise. Unlike ILP, a neural system will tolerate some bad (mis-labeled) data. Second, a neural program induction system can be provided with fuzzy or ambiguous data (from a camera, for example). Unlike traditional ILP systems (which have to be fed crisp, symbolic input), a diﬀerentiable induction system can start with raw, un-preprocessed pixel input. However, the neural approaches to program induction have two disadvantages when compared to ILP. First, the implicit procedure learned by a neural network is not inspectable or human-readable. It is notoriously hard to understand what it has learned, or to what extent it has generalised beyond the training data. Second, the performance of these systems tails oﬀsharply when the test data are signiﬁcantly larger than the training data: if we train the neural system to add numbers of length 10, they may also be successful when tested on numbers of length 20. But if we test them on numbers of length 100, the performance deteriorates (Kaiser, 2015; Reed & de Freitas, 2015). General-purpose neural architectures,

2. Low-level models include a Turing machine (Graves et al., 2014, 2016), a Forth virtual machine (Riedel, Boˇsnjak, & Rockt aschel, 2016), a cellular automaton (Kaiser, 2015), and a pushdown automaton (Sun, Giles, Chen, & Lee, 1998; Grefenstette, Hermann, Suleyman, & Blunsom, 2015; Joulin & Mikolov, 2015).

Learning Explanatory Rules from Noisy Data

being universal function approximators, produce solutions with high variance. There is an ever-present danger of over-ﬁtting3 . This paper proposes a system that addresses the limits of connectionist systems and ILP systems, and attempts to combine the strengths of both. Diﬀerentiable Inductive Logic Programming ( ILP) is a reimplementation of ILP in an an end-to-end diﬀerentiable architecture. It attempts to combine the advantages of ILP with the advantages of the neural network-based systems: a data-eﬃcient induction system that can learn explicit human-readable symbolic rules, that is robust to noisy and ambiguous data, and that does not deteriorate when applied to unseen test data. The central component of this system is a diﬀerentiable implementation of deduction through forward chaining on deﬁnite clauses. We reinterpret the ILP task as a binary classiﬁcation problem, and we minimise cross-entropy loss with regard to ground-truth boolean labels during training. Our ilp system is able to solve moderately complex tasks requiring recursion and predicate invention. For example, it is able to learn Fizz-Buzz using multiple invented predicates (see Section 5.3.3). Unlike the MLP described by Grus4, our learned program generalises robustly to test data outside the range of training examples. Unlike symbolic ILP systems, ilp is robust to mislabelled data. It is able to achieve reasonable performance with up to 20% of mislabelled training data (see Section 5.4). Unlike symbolic ILP systems, ilp is also able to handle ambiguous or fuzzy data. We tested ilp by connecting it to a convolutional net trained on MNIST digits, and it was still able to learn eﬀectively (see Section 5.5). The major limitation of our ilp system is that it requires signiﬁcant memory resources. This limits the range of benchmark problems on which our system has been tested5. We discuss this further in the introduction to Section 5.3 and in Section 5.3.4, and oﬀer an analysis in Appendix E. The structure of the paper is as follows. We begin, in Section 2, by giving an overview of logic programming as a ﬁeld, and of the collection of learning methods known as Inductive Logic Programming. In Section 3, we re-cast learning under ILP as a satisﬁability problem, and use this formalisation of the problem as the basis for introducing, in Section 4, a diﬀerentiable form of ILP where continuous representations of rules are learned through backpropagation against a likelihood objective. In Section 5, we evaluate our system against a variety of standard ILP tasks, we measure its robustness to noise by evaluating its performance under conditions where consistent errors exist in the data, and ﬁnally compare it to neural network baselines in tasks where logic programs are learned over ambiguous data such as raw pixels. We complete the paper by reviewing and contrasting with related work in Section 6 before oﬀering our conclusions regarding the framework proposed here and its empirical validation.

3. With extra supervision, over-ﬁtting can be avoided, to an extent. Reed & de Freitas, 2015 use a much richer training signal. Instead of trying to learn rules from mere input output pairs, they learn from explicit traces of desired computations. In their approach, a training instance is an input plus a ﬁnegrained speciﬁcation of the desired computational operations. For example, when learning addition, a training instance would be the two inputs x and y that we are expected to add, plus a detailed list of all the low-level operations involved in adding x and y. With this additional signal, it is possible to learn programs that generalise to larger training instances. 4. See the darkly humorous http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/. 5. The memory requirements require us to limit ourselves to predicates of arity at most two.

Evans & Grefenstette

2. Background

We ﬁrst review Logic Programming and Inductive Logic Programming (ILP), before focusing on one particular approach to ILP, which turns the induction problem into a satisﬁability problem.

2.1 Logic Programming

Logic programming is a family of programming languages in which the central component is not a command, or a function, but an if-then rule. These rules are also known as clauses. A deﬁnite clause6 is a rule of the form

α α1, ..., αm

composed of a head atom α, and a body α1, ..., αm, where m 0. These rules are read from right to left: if all of the atoms on the right are true, then the atom on the left is also true. If the body is empty, if m = 0, we abbreviate α to just α. An atom α is a tuple p(t1, ..., tn), where p is a n-ary predicate and t1, ..., tn are terms, either variables or constants. An atom is ground if it contains no variables. The set of all atoms, ground and unground, is denoted by A. The set of ground atoms, also known as the Herbrand base, is G. In ﬁrst-order logic, terms can also be constructed from function symbols. So, for example, if c is a constant, and f is a one place function, then f(c), f(f(c)), f(f(f(c))), ... are all terms. In this paper, we impose the following restriction, found in systems like Datalog, on clauses: the only terms allowed are constants and variables, while function symbols are disallowed7. For example, the following program deﬁnes the connected relation as the transitive closure of the edge relation:

connected(X, Y ) edge(X, Y )

connected(X, Y ) edge(X, Z), connected(Z, Y )

We follow the logic programming convention of using upper case to denote variables and lower case for constants and predicates. A ground rule is a clause in which all variables have been substituted by constants. For example: connected(a, b) edge(a, b)

is a ground rule generated by applying the substitution θ = {a/X, b/Y } to the clause:

connected(X, Y ) edge(X, Y )

The consequences of a set R of clauses is computed by repeatedly applying the rules in R until no more consequences can be derived. More formally, deﬁne cn R(X) as the

6. In this paper, we restrict ourselves to logic programs composed of deﬁnite clauses. We do not consider clauses containing negation-as-failure. 7. The reason for restricting ourselves to deﬁnite Datalog clauses is that this language is decidable and forward chaining inference (described below) always terminates.

Learning Explanatory Rules from Noisy Data

immediate consequences of rules R applied to ground atoms X:

cn R(X) = X {γ | γ γ1, ..., γm ground(R),

where ground(R) are the ground rules of R. In other words, ground atom γ is in cn R(X) if γ is in X or there exists a ground rule γ γ1, ..., γm ground(R) such that each ground atom γ1, ..., γm is in X. Alternatively, we can deﬁne cn R(X) using substitutions:

cn R(X) = X {α[θ] | α α1, ..., αm R,

i=1 αi[θ] X}

where α[θ] is the application of substitution θ to atom α. Now deﬁne a series of consequences CR,0, CR,1, CR,2, ...

CR,0 = {} CR,i+1 = cn R(CR,i)

Now deﬁne the consequences after T time-steps as the union of this series:

Consider, for example, the program R:

edge(a, b) connected(X, Y ) edge(X, Y ) edge(b, c) connected(X, Y ) edge(X, Z), connected(Z, Y ) edge(c, a)

The consequences are:

CR,1 = {edge(a, b), edge(b, c), edge(c, a)}

CR,2 = CR,1 {connected(a, b), connected(b, c), connected(c, a)}

CR,3 = CR,2 {connected(a, c), connected(b, a), connected(c, b)}

CR,4 = CR,3 {connected(a, a), connected(b, b), connected(c, c)}

CR,5 = CR,4 = con(R)

Given a set R of clauses, and a ground atom γ, we say R entails γ, written R |= γ, if every model satisfying R also satisﬁes γ. To test if R |= γ, we check if γ con(R). This technique is called forward chaining. An alternative approach is backward chaining. To test if R |= γ, we work backwards from γ, looking for rules in R whose head uniﬁes with γ. For each such rule α α1, ..., αm, where α[θ] = γ, we create a sub-goal to prove the body α1[θ], ..., αm[θ]. This procedure constructs a search tree, where nodes are lists of propositions to be proved, in left-to-right order, and edges are applications of rules with substitutions. The root of the tree is the node containing γ, and search terminates when we ﬁnd a node with no atoms remaining to be proved.

Evans & Grefenstette

A distinction we will need later is the diﬀerence between intensional and extensional predicates. An extensional predicate is a predicate that is wholly deﬁned by a set of ground atoms. In our example above, edge is an extensional predicate deﬁned by the set of atoms: {edge(a, b), edge(b, c), edge(c, a)}

An intensional predicate is deﬁned by a set of clauses. In our example, connected is an intensional predicate deﬁned by the clauses:

connected(X, Y ) edge(X, Y )

connected(X, Y ) edge(X, Z), connected(Z, Y )

2.2 Inductive Logic Programming (ILP)

An ILP problem is a tuple (B, P, N) of ground atoms, where:

B is a set of background assumptions, a set of ground atoms8.

P is a set of positive instances - examples taken from the extension of the target predicate to be learned

N is a set of negative instances - examples taken outside the extension of the target predicate

Consider, for example, the task of learning which natural numbers are even. We are given a minimal description of the natural numbers:

B = {zero(0), succ(0, 1), succ(1, 2), succ(2, 3), ...}

Here, zero(X) is the unary predicate true of X if X = 0, and succ is the successor relation. The positive and negative examples of the even predicate are:

P = {even(0), even(2), even(4), even(6), ...}

N = {even(1), even(3), even(5), even(7), ...}

The aim of ILP is to construct a logic program that explains the positive instances and rejects the negative instances. Given an ILP problem (B, P, N), a solution is a set R of deﬁnite clauses such that9:

B, R |= γ for all γ P

B, R γ for all γ N

8. We assume that B is a set of ground atoms, but not all ILP systems make this restriction. In some ILP systems, the background assumptions are clauses, not atoms. 9. This is a speciﬁc deﬁnition of ILP in terms of entailment. For a more general deﬁnition of the ILP problem, that contains this formulation as a special case, see the work of Muggleton and De Raedt (1994), De Raedt and Kersting (2008), De Raedt, Frasconi, Kersting, and Muggleton (2008).

Learning Explanatory Rules from Noisy Data

Induction is ﬁnding a set of rules R such that, when they are applied deductively to the background assumptions B, they produce the desired conclusions. Namely: positive examples are entailed, while negative examples are not. In the example above, one solution is the set R:

even(X) zero(X)

even(X) even(Y ), succ2(Y, X)

succ2(X, Y ) succ(X, Z), succ(Z, Y )

Note that this simple toy problem is not entirely trivial. The solution requires both recursion and predicate invention: an auxiliary synthesised predicate succ2. We ﬁrst describe how the ILP problem can be transformed into a satisﬁability problem (Section 3), and then provide the main contribution of the paper: a diﬀerentiable implementation of the satisﬁability solving process (Section 4).

3. ILP as a Satisﬁability Problem

There are, broadly speaking, two families of approaches to ILP. The bottom-up approaches10

start by examining features of the examples, extract speciﬁc clauses from those examples, and then generalise from those speciﬁc clauses. The top-down approaches11 use generateand-test: they generate clauses from a language deﬁnition, and test the generated programs against the positive and negative examples. Amongst the top-down approaches, one particular method is to transform the problem of searching for clauses into a satisﬁability problem. Amongst these induction-via-satisﬁability approaches, some12 use the Boolean ﬂags to indicate which branches of the search space (deﬁned by the language grammar) to explore. Others13 generate a large set of clauses according to a program template, and use the Boolean ﬂags to determine which clauses are on and oﬀ. In this paper, we also take a top-down, generate-and-test approach, in which clauses are generated from a language template. We assign each generated clause a Boolean ﬂag indicating whether it is on or oﬀ. Now the induction problem becomes a satisﬁability problem: choose an assignment to the Boolean ﬂags such that the turned-on clauses together with the background facts together entail the positive examples and do not entail the negative examples. Our approach is an extension to this established approach to ILP: our contribution is to provide, via a continuous relaxation of satisﬁability, a diﬀerentiable implementation of this architecture. This allows us to apply gradient descent to learn which clauses to turn on and oﬀ, even in the presence of noisy or ambiguous data. We use the rest of this section (together with Appendix B) to explain ILP-as-satisﬁability in detail.

10. For example, Progol (Muggleton, 1995). 11. For example, Top Log (Muggleton, Santos, & Tamaddoni-Nezhad, 2008), TAL (Corapi, Russo, & Lupu, 2010), Metagol (Muggleton, Lin, Pahlavi, & Tamaddoni-Nezhad, 2014; Muggleton, Lin, & Tamaddoni Nezhad, 2015; Cropper, Tamaddoni-Nezhad, & Muggleton, 2015). 12. See, for example, the work by Chikara, Koshimura, Fujita, and Hasegawa (2015). 13. See, for example, the work by Corapi et al. (2010). Note that they are working with Answer-Set Programming, but under the hood, this is propositionalised and compiled into a SAT problem.

Evans & Grefenstette

3.1 Basic Concepts

A language frame L is a tuple

(target, Pe, aritye, C)

target is the target predicate, the intensional predicate we are trying to learn

Pe is a set of extensional predicates

aritye is a map Pe {target} N specifying the arity of each predicate

C is a set of constants

An ILP problem is a tuple (L, B, P, N) where

L is a language frame

B is a set of background assumptions, ground atoms formed from the predicates in Pe and the constants in C

P is a set of positive examples, ground atoms formed from the target predicate and the constants in C

N is a set of negative examples, ground atoms formed from the target predicate and the constants in C

Here, P are ground atoms where the target predicate holds. For example, in the case of even on natural numbers, P might be {even(0), even(2), even(4), even(6)}. The set N contains ground atoms where the target predicate does not hold, e.g.. {even(1), even(3), even(5)}. A rule template τ describes a range of clauses that can be generated. It is a pair (v, int) where:

v N speciﬁes the number of existentially quantiﬁed variables allowed in the clause

int {0, 1} speciﬁes whether the atoms in the body of the clause can use intensional predicates (int = 1) or only extensional predicates (int = 0)

A program template Π describes a range of programs that can be generated. It is a tuple (Pa, aritya, rules, T) where:

Pa is a set of auxiliary (intensional) predicates; these are the additional invented predicates used to help deﬁne the target predicate

aritya is a map Pa N specifying the arity of each auxiliary predicate

rules is a map from each intensional predicate p to a pair of rule templates (τ 1 p , τ 2 p )

T N speciﬁes the max number of steps of forward chaining inference

Learning Explanatory Rules from Noisy Data

Note that rules deﬁnes each intensional predicate by a pair of rule templates. In our system, we insist, without loss of generality14 that each predicate can be deﬁned by exactly two clauses. Our program template is a tuple of hyperparameters constraining the range of programs that are used to solve the ILP problem. This template plays the same role as a collection of mode declarations (see Appendix C) or a set of metarules in Metagol (see Appendix D). Assume, for the moment, that the program template is given to us as part of the ILP problem speciﬁcation. Later, in Section 7.1, we shall discuss how to search through the space of program templates using iterative deepening. We can combine the extensional predicates from the language-frame

L = (target, Pe, aritye, C)

with the intensional predicates from the program template

Π = (Pa, aritya, rules, T)

into a language (Pe, Pi, arity, C) where

Pi = Pa {target}

arity = aritye aritya

Note that the target predicate is always one of the intensional predicates Pi. Let P be the complete set of predicates:

A language determines the set G of all ground atoms. If we restrict ourselves to nullary, unary, and dyadic predicates, then the set of ground atoms is:

G = {γi}n i=1 = {p() | p P, arity(p) = 0}

{p(k) | p P, arity(p) = 1, k C}

{p(k1, k2) | p P, arity(p) = 2, k1, k2 C}

Note that G includes the falsum atom , the atom that is always false.

3.2 Generating Clauses

For each rule template τ, we can generate a set cl(τ) of clauses that satisfy the template. To keep the set of generated clauses manageable, we make a number of restrictions. First, the only clauses we consider are those composed of atoms involving free variables. We do not allow any constants in any of our clauses. This may seem, initially, to be a severe

14. If a predicate p is deﬁned by three clauses p(X, Y ) β1, p(X, Y ) β2, and p(X, Y ) β3, create a new auxiliary predicate q and replace the above with four clauses: (i) p(X, Y ) β1; (ii) p(X, Y ) q(X, Y ); (iii) q(X, Y ) β2; (iv) q(X, Y ) β3. We have transformed a program in which one predicate is deﬁned by three clauses into a program in which two predicates are deﬁned by two clauses each.

Evans & Grefenstette

restriction but recall that our language has extensional predicates as well as intensional predicates. If we need a predicate whose meaning depends on particular constants, then we treat it as an extensional predicate, rather than an intensional predicate. For example, the predicate zero/1, which appears in the arithmetic examples later, is treated as an extensional predicate. We do not treat zero as an intensional predicate deﬁned by a single rule with an empty body: zero(0)

Rather, we treat zero as an extensional predicate deﬁned by the single background atom:

The second restriction on generated clauses is that we only allow predicates of arity 0, 1, or 2. We do not currently support ternary predicates or higher15. Third, we insist that all clauses have exactly two atoms in the body. This restriction can also be made without loss of generality. For any logic program involving clauses with more than two atoms in the body of some clause, there is an equivalent logic program (with additional intensional predicates used as auxiliary functions) with exactly two atoms in the body of each clause16. There are four additional restrictions on generated clauses. We rule out clauses that are (1) unsafe (a variable used in the head is not used in the body), (2) circular (the head atom appears in the body), (3) duplicated (equivalent to another clause with the body atoms permuted), and (4) those that do not respect the intensional ﬂag int (i.e. those that contain an intensional predicate in the clause body, even though the int ﬂag was set to 0, i.e. False). We provide a worked example in Appendix B.

3.3 Reducing Induction to Satisﬁability

Given a program template Π, let τ i p be the i th rule template for intensional predicate p, where i {1, 2} indicates which of the two rule templates we are considering for p. Let Ci,j p be the j th clause in cl(τ i p), the set of clauses generated for template τ i p . To turn the induction problem into a satisﬁability problem, we deﬁne a set Φ of Boolean variables (i.e. atoms with nullary predicates) indicating which of the various clauses in Ci,j p are actually to be used in our program. Now a SAT solver can be used to ﬁnd a truthassignment to the propositions in Φ, and we can extract the induced rules from the subset of propositions in Φ that are set to True. The technical details behind this approach are described in Appendix B.

4. A Diﬀerentiable Implementation of ILP

In this section, we describe our core model: a continuous reimplementation of the ILP-assatisﬁability approach described above. The discrete operations are replaced by diﬀerentiable operators, so the ILP problem can be solved by minimising a loss using stochastic

15. This is because of practical space constraints. See Appendix E. 16. This is a separate condition from the restriction above in Section 3.1 that each predicate can be deﬁned by exactly two clauses.

Learning Explanatory Rules from Noisy Data

gradient descent. In this case, the loss is the cross entropy of the correct label with regard to the predictions of the system. Instead of the discrete semantics in which ground atoms are mapped to {False, True}, we now use a continuous semantics17 which maps atoms to the real unit interval18 [0, 1]. Instead of using Boolean ﬂags to choose a discrete subset of clauses, we now use continuous weights to determine a probability distribution over clauses. This model, which we call ilp, implements diﬀerentiable deduction over continuous values. The gradient of the loss with regard to the rule weights, which we use to minimise classiﬁcation loss, implements a continuous form of induction.

4.1 Valuations

Given a set G of n ground atoms, a valuation is a vector [0, 1]n mapping each ground atom γi G to the real unit interval. Consider, for example, the language L = (Pe, Pi, arity, C), where

Pe = {r/2} Pi = {p/0, q/1} C = {a, b}

One possible valuation on the ground atoms G of L is

7 0.0 p() 7 0.0 q(a) 7 0.1 q(b) 7 0.3 r(a, a) 7 0.7 r(a, b) 7 0.1 r(b, a) 7 0.4 r(b, b) 7 0.2

We insist that all valuations map to 0. (The reason for including the atom will become clear in Section 4.5).

4.2 Induction by Gradient Descent

Given the sets P and N of positive and negative examples, we form a set Λ of atom-label pairs: Λ = {(γ, 1) |γ P} {(γ, 0) |γ N}

Each pair (γ, λ) indicates whether atom γ is in P (when λ = 1) or N (when λ = 0). This can be thought of as a dataset, used to learn a binary classiﬁer that maps atoms γ to their truth or falsehood. Now given an ILP problem (L, B, P, N), a program template Π and a set of clauseweights W, we construct a diﬀerentiable model that implements the conditional probability of λ for a ground atom α: p(λ | α, W, Π, L, B)

We want our predicted label p(λ | α, W, Π, L, B) to match the actual label λ in the pair (α, λ) we sample from Λ. We wish, in other words, to minimise the expected negative log likelihood when sampling uniformly (α, λ) pairs from Λ:

loss = E (α,λ) Λ [λ log p(λ | α, W, Π, L, B) + (1 λ) log(1 p(λ | α, W, Π, L, B))]

17. For a related approach, see the work of Seraﬁni and Garcez (2016), discussed in Section 6.3 below. 18. The values in [0, 1] are interpreted as probabilities rather than fuzzy degrees of truth .

Evans & Grefenstette

To calculate the probability of the label λ given the atom α, we infer the consequences of applying the rules to the background facts (using T steps of forward chaining). In Figure 1 below, these consequences are called the Conclusion Valuation . Then, we extract λ as the probability of α in this valuation. The probability p(λ | α, W, Π, L, B) is deﬁned as:

p(λ | α, W, Π, L, B) = fextract(finfer(fconvert(B), fgenerate(Π, L), W, T), α)

Here, p(λ | α, W, Π, L, B) is computed using four auxiliary functions: fextract, finfer, fconvert, and fgenerate. (See Figure 1). fextract and finfer are diﬀerentiable operations, while fconvert, and fgenerate are non-diﬀerentiable. The function fextract : [0, 1]n G [0, 1] takes a valuation x and an atom γ and extracts the value for that atom: fextract(x, γ) = x[index(γ)]

where index : G N is a function that assigns each ground atom a unique integer index. The function fconvert : 2G [0, 1]n takes a set of atoms and converts it into a valuation mapping the elements of B to 1 and all other elements of G to 0:

fconvert(B) = y where y[i] =

( 1 if γi B 0 otherwise

and where γi is the i th ground atom in G for i = 1..n. The function fgenerate produces a set of clauses from a program template Π and a language L: fgenerate(Π, L) = {cl(τ i p) | p Pi, i {1, 2}}

This uses the cl function deﬁned in Section 3.2 above. The finfer : [0, 1]n C W N [0, 1]n function is where all the heavy-lifting takes place. It performs T steps19 of forward-chaining inference using the generated clauses, amalgamating the various conclusions together using the clause weights W. It is described in detail below. Figure 1 shows the architecture. The inputs to the network, the elements that are fed in every training step, are the atom α and the corresponding label λ. These are fed into the boxes labeled Sampled Target Atom and Sampled Label . This input pair (α, λ) is sampled from the set Λ. The conditional probability p(λ | α, W, Π, L, B) is the value in the Predicted Label box. The only trainable component is the set of clause weights W, shown in red. The diﬀerentiable operations are shown in green, while the non-diﬀerentiable operations are shown in orange. Note that, even though not all operations are diﬀerentiable, the operators between the loss and the clause weights W are all diﬀerentiable, so we can compute loss

W , which in turn is used to update the clause weights by stochastic gradient descent (or any related optimisation method). So far, we have described the high-level picture, but not the details. In Section 4.3, we explain how the rule weights are represented. In Section 4.4, we show how inference is performed over multiple time steps, by translating each clause c into a function

Fc : [0, 1]n [0, 1]n

19. Recall that T is a part of the program template Π.

Learning Explanatory Rules from Noisy Data

cross entropy

target atom

clause weights clauses initial valuation

diﬀerentiable

non-diﬀerentiable

parameters inputs

diﬀerentiable

non-diﬀerentiable

program template

language axioms

Figure 1: The ilp Architecture.

Evans & Grefenstette

Finally, in Section 4.5, we describe how the Fc functions are computed.

4.3 Rule Weights

The weights W are a set {W1, ..., W|Pi|} of matrices, one matrix for each intensional predicate p Pi. The matrix Wp for predicate p is of shape (|cl(τ 1 p )|, |cl(τ 2 p )|), where |cl(τ 1 p )| is the number of clauses generated by the ﬁrst rule template τ 1 p , and |cl(τ 2 p )| is the number of clauses generated by the second rule template τ 2 p . Note that the various Wp matrices are typically not of the same size, as the diﬀerent rule templates produce diﬀerent sized sets of clauses. The weight Wp[j, k] represents how strongly the system believes that the pair of clauses (C1,j p , C2,k p ) is the right way to deﬁne the intensional predicate p. (Recall from Section 3.3 that Ci,j p is the j th clause of the i th rule template τ i p for intensional predicate p. Here, as each predicate is deﬁned by exactly two clauses, i {1, 2}. Recall from Section 3.1 that we assume that each intensional predicate is deﬁned by exactly two clauses). The weight matrix Wp R|cl(τ 1 p )| |cl(τ 2 p )| is a matrix of real numbers. We transform it into a probability distribution W p [0, 1]|cl(τ 1 p )| |cl(τ 2 p )| using softmax:

W p[j, k] = e Wp[j,k] P j ,k e Wp[j ,k ]

Here, W p[j, k] represents the probability that the pair of clauses (C1,j p , C2,k p ) is the right way to deﬁne the intensional predicate p. Using matrices to store the weights of each pair of clauses requires a lot of memory. See Appendix E. An alternative, less memory-hungry approach would be to have a vector of weights for every set of clauses generated by every individual rule template. Unfortunately, this alternative is much less eﬀective at the ILP tasks. See Appendix F for a fuller discussion.

4.4 Inference

The central idea behind our diﬀerentiable implementation of inference is that each clause c induces a function Fc : [0, 1]n [0, 1]n on valuations. Consider, for example, the clause c:

Table 1 shows the results of applying the corresponding function Fc to two valuations on the set G = {p(a), p(b), q(a), q(b), } of ground atoms. The details of how the Fc functions are generated is deferred to Section 4.5 below. The important point now is that we can automatically generate, from each clause c, a diﬀerentiable function Fc on valuations that implements a single step of forward chaining inference using c. Recall that C is an indexed set of generated clauses, where Ci,j p is the j th clause of the i th rule template τ i p for intensional predicate p. Deﬁne a corresponding indexed set of functions F where Fi,j p is the valuation function corresponding to the clause Ci,j p . Now we deﬁne another indexed set of functions Gj,k p that combines the application of two functions F1,j p and F2,k p . Recall that each intensional predicate p is deﬁned by two

Learning Explanatory Rules from Noisy Data

G a0 Fc(a0) a1 Fc(a1)

p(a) 0.0 0.1 0.2 0.7 p(b) 0.0 0.3 0.9 0.4 q(a) 0.1 0.0 0.7 0.0 q(b) 0.3 0.0 0.4 0.0 0.0 0.0 0.0 0.0

Table 1: Applying c = p(X) q(X), treated as a function Fc, to valuations a0 and a1

clauses generated from two rule templates τ 1 p and τ 2 p . Now Gj,k p is the result of applying both clauses C1,j p and C2,k p and taking the element-wise max:

Gj,k p (a) = x where x[i] = max (F1,j p (a)[i], F2,k p (a)[i])

Next, we will deﬁne a series of valuations of the form at. A valuation at represents our conclusions after t time-steps of inference. The initial value a0 when t = 0 is based on our initial set B G of background axioms:

( 1 if γx B 0 otherwise

We now deﬁne cp,j,k t : cp,j,k t = Gj,k p (at)

Intuitively, cp,j,k t is the result of applying one step of forward chaining inference to at using clauses C1,j p and C2,k p . Note that this only has non-zero values for one particular predicate: the p th intensional predicate. We can now deﬁne the weighted average of the cp,j,k t , using the softmax of the weights:

j,k cp,j,k t e Wp[j,k] P j ,k e Wp[j ,k ]

Intuitively, bp t is the result of applying all the possible pairs of clauses that can jointly deﬁne predicate p, and weighting the results by the weights Wp. Note that bp t is also zero everywhere except for the p th intensional predicate. From this, we deﬁne the successor at+1 of at:

at+1 = famalgamate(at, X

p Pi bp t )

The successor depends on the previous valuation at and a weighted mixture of the clauses deﬁning the other intensional predicates. Note that the valuations bp t are disjoint for different p, so we can simply sum these valuations. When amalgamating the previous valuation, at, with the single-step consequences, P p Pi bp t , there are various functions we can use for famalgamate. First we considered:

famalgamate(x, y) = max(x, y)

Evans & Grefenstette

(Note this is element-wise max over the two valuation vectors). But the use of max here adversely aﬀected gradient ﬂow. The deﬁnition of famalgamate we actually use is the probabilistic sum: famalgamate(x, y) = x + y x y

This keeps valuations within the real unit interval [0, 1] while allowing gradients to ﬂow through both x and y. The two alternative ways of computing famalgamate are compared in Table 2.

4.5 Computing the Fc Functions

We now explain the details of how the various Fc functions are computed. Each Fc function can be computed as follows. Let Xc = {xk}n k=1 be a set of sets of pairs of indices of ground atoms for clause c. Each xk contains all the pairs of indices of atoms that justify atom γk according to the current clause c:

xk = {(a, b) | satisﬁesc(γa, γb) headc(γa, γb) = γk}

Note that we can restrict ourselves to pairs only (and don t have to worry about triples, etc) because we are restricting ourselves to rules with two atoms in the body. Here, satisﬁesc(γ1, γ2) if the pair of ground atoms (γ1, γ2) satisﬁes the body of clause c. If c = α α1, α2, then satisﬁesc(γ1, γ2) is true if there is a substitution θ such that α1[θ] = γ1 and α2[θ] = γ2. Also, headc(γ1, γ2) is the head atom produced when applying clause c to the pair of atoms (γ1, γ2). If c = α α1, α2 and α1[θ] = γ1 and α2[θ] = γ2 then

headc(γ1, γ2) = α[θ]

For example, suppose P = {p, q, r} and C = {a, b}. Then our ground atoms G are:

k 0 1 2 3 4 5 6 7 8 γk p(a, a) p(a, b) p(b, a) p(b, b) q(a, a) q(a, b) q(b, a) q(b, b)

k 9 10 11 12 γk r(a, a) r(a, b) r(b, a) r(b, b)

Suppose clause c is: r(X, Y ) p(X, Z), q(Z, Y )

Then Xc = {xk}n k=1 is:

0 {} 1 p(a, a) {} 2 p(a, b) {} 3 p(b, a) {} 4 p(b, b) {}

5 q(a, a) {} 6 q(a, b) {} 7 q(b, a) {} 8 q(b, b) {}

9 r(a, a) {(1,5), (2, 7)} 10 r(a, b) {(1, 6), (2, 8)} 11 r(b, a) {(3, 5), (4, 7)} 12 r(b, b) {(3, 6), (4, 8)}

Learning Explanatory Rules from Noisy Data

Focusing on a particular row, the reason why (2, 7) is in x9 is that γ2 = p(a, b), γ7 = q(b, a), the pair of atoms (p(a, b), q(b, a)) satisfy the body of clause c, and the head of the clause c (for this pair of atoms) is r(a, a) which is γ9. We can transform Xc, a set of sets of pairs, into a three dimensional tensor: X Nn w 2. Here, w is the maximum number of pairs for any k in 1...n. The width w depends on the number of existentially quantiﬁed variables v in the rule template. Each existentially quantiﬁed variable can take |C| values, so w = |C|v. X is constructed from Xc, ﬁlling in unused space with (0, 0) pairs that point to the pair of atoms ( , ):

( xk[m] if m < |xk| (0, 0) otherwise

This is why we needed to include the falsum atom in G, so that the null pairs have some atom to point to. In our running example, this yields:

0 (0, 0) (0, 0)

1 p(a, a) (0, 0) (0, 0)

2 p(a, b) (0, 0) (0, 0)

3 p(b, a) (0, 0) (0, 0)

4 p(b, b) (0, 0) (0, 0)

5 q(a, a) (0, 0) (0, 0)

6 q(a, b) (0, 0) (0, 0)

7 q(b, a) (0, 0) (0, 0)

8 q(b, b) (0, 0) (0, 0)

9 r(a, a) (1, 5) (2, 7)

10 r(a, b) (1, 6) (2, 8)

11 r(b, a) (3, 5) (4, 7)

12 r(b, b) (3, 6) (4, 8)

Let X1, X2 Nn w be two slices of X, taking the ﬁrst and second elements in each pair:

X1 = X[:, :, 0] X2 = X[:, :, 1]

We shall use a function gather2 : Ra Nb c Rb c:

gather2(x, y)[i, j] = x[y[i, j]]

Now we are ready to deﬁne Fc(a). Let Y1, Y2 [0, 1]n w be the results of assembling the elements of a according to the matrix of indices in X1 and X2:

Y1 = gather2(a, X1) Y2 = gather2(a, X2)

Now let Z [0, 1]n w contain the results of element-wise multiplying the elements of Y1 and Y2: Z = Y1 Y2

Here, Z[k, :] is the vector of fuzzy conjunctions of all the pairs of atoms that contribute to the truth of γk, according to the current clause. Now we can deﬁne Fc(a) by taking the max fuzzy truth values in Z. Let Fc(a) = a where a [k] = max(Z[k, :]).

Evans & Grefenstette

The following table shows the calculation of Fc(a) for a particular valuation a, using our running example c = r(X, Y ) p(X, Z), q(Z, Y ). Here, since there is one existential variable Z, v = 1, and w = |{a, b}|v = 2.

k γk a[k] X1[k] X2[k] Y1[k] Y2[k] Z[k] Fc(a)[k]

0 0.0 0 0 0 0 0 0 0 0 0 0 0.00 1 p(a, a) 1.0 0 0 0 0 0 0 0 0 0 0 0.00 2 p(a, b) 0.9 0 0 0 0 0 0 0 0 0 0 0.00 3 p(b, a) 0.0 0 0 0 0 0 0 0 0 0 0 0.00 4 p(b, b) 0.0 0 0 0 0 0 0 0 0 0 0 0.00 5 q(a, a) 0.1 0 0 0 0 0 0 0 0 0 0 0.00 6 q(a, b) 0.0 0 0 0 0 0 0 0 0 0 0 0.00 7 q(b, a) 0.2 0 0 0 0 0 0 0 0 0 0 0.00 8 q(b, b) 0.8 0 0 0 0 0 0 0 0 0 0 0.00 9 r(a, a) 0.0 1 2 5 7 1.0 0.9 0.1 0.2 0.1 0.18 0.18 10 r(a, b) 0.0 1 2 6 8 1.0 0.9 0 0.8 0 0.72 0.72 11 r(b, a) 0.0 3 4 5 7 0 0 0.1 0.2 0 0 0.00 12 r(b, b) 0.0 3 4 6 8 0 0 0 0.8 0 0 0.00

Please note that the calculation of the indices in X is not a diﬀerentiable operation. Calculating X makes use of the discrete operations satisﬁes and head. The matrix X of indices is created, ahead of time, before the neural net is constructed. However, the function Fc on valuations is diﬀerentiable. It takes the indices in X and applies diﬀerentiable operations such as gather2 and element-wise multiplication.

4.5.1 Deﬁning Fuzzy Conjunction

Above, when computing Z, we took the element-wise product:

Now Y1 and Y2 represent the fuzzy truth values of the two atoms satisfying the body of the predicate, and Z represents their conjunction. Now there are many diﬀerent ways of representing fuzzy conjunction. At a high level of generality, we need an operator : [0, 1]2 [0, 1] satisfying the conditions on a tnorm (Esteva & Godo, 2001):

commutativity: x y = y x

associativity: (x y) z = x (y z)

monotonicity (i): x1 x2 implies x1 y x2 y

monotonicity (ii): y1 y2 implies x y1 x y2

Learning Explanatory Rules from Noisy Data

unit (i): x 1 = x

unit (ii): x 0 = 0

Now there are a number of operators satisfying these conditions:

Godel t-norm: x y = min(x, y)

Lukasiewicz t-norm: x y = max(0, x + y 1)

Product t-norm: x y = x y

The reason we chose the product t-norm over the alternatives proposed by Godel and Lukasiewicz is that, when back-propagating from the loss to the clause weights, the gradients ﬂow evenly between both Y1 and Y2. With Godel s t-norm, there is no gradient information sent to Y1 when Y1 > Y2. Similarly, with Lukasiewicz s t-norm, there is no gradient information sent to either Y1 or Y2 when Y1 + Y2 < 1. Experimental evaluation bears this out. We tested all three t-norms on our dataset of 20 symbolic problems and found that the product t-norm consistently out-performed Godel s and Lukasiewicz s. See Section 5.3.4 below.

5. Experiments

ILP has a number of excellent features. But it has two main areas of weakness: intolerance to mis-labelled data, and inability to cope with fuzzy or ambiguous data. After ﬁrst checking that ilp is capable of learning standard ILP tasks with discrete error-free input, our experiments were largely focused on testing how ilp compares with standard ILP in these two problem areas. We implemented our model in Tensor Flow (Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, Devin, et al., 2016) and tested it with three types of experiment. First, we used standard symbolic ILP tasks, where ilp is given discrete error-free input. Second, we modiﬁed the standard symbolic ILP tasks so that a certain proportion of the positive and negative examples are wilfully mis-labelled. Third, we tested it with fuzzy, ambiguous data, connecting ilp to the output of a pretrained convolution neural network that classiﬁes MNIST digits.

5.1 Hyperparameters

We ﬁrst ran a grid search to ﬁnd reasonable hyperparameters that could solve each of ten tasks in at least 5% of all random initialisations of weights20. We tried a range of optimisation algorithms: Stochastic Gradient Descent, Adam, Ada Delta, and RMSProp. We searched across a range of learning rates in {0.5, 0.2, 0.1, 0.05, 0.01, 0.001}. Weights were initialised randomly from a normal distribution with mean 0 and a standard deviation that ranged between 0 and 2 (the standard deviation was a hyperparameter but the mean was ﬁxed). For each conﬁguration of hyperparameter settings, we ran 20 trials, each with diﬀerent random seeds, for each of 10 tasks.

20. 5% was chosen, arbitrarily, as a cut-oﬀfor reasonable performance.

Evans & Grefenstette

We found that both RMSProp and Adam gave reasonable results: both were able to solve all 10 tasks for at least 5 % of random weight initialisations. RMSProp was successful on a range of learning rates from 0.5 to 0.01. Hyperparameters were validated against training data, not held-out development data. Because of ilp s strong inductive bias towards universally quantiﬁed rules, programs with a low loss on the training data are also biased towards performing well on the test data. In all the experiments described below, we stuck to a particular ﬁxed set of hyperparameter settings: we used RMSProp with a learning rate of 0.5, and initialised clause weights by sampling from a N(0, 1) distribution.

5.2 Experimental Method

Before training, rule weights are initialised randomly by sampling from a N(0, 1) distribution. We train for 6000 steps, adjusting rule weights to minimise cross entropy loss as described above. During training, ilp is given multiple (B, P, N) triples. Each triple represents a different possible world. Each training step, ilp ﬁrst samples one of these (B, P, N) triples, and then samples a mini-batch from P N. Providing multiple alternative possible worlds helps the system to generalise correctly. It is less likely to focus on irrelevant aspects of one particular situation if it is forced to consider multiple situations. Each step we sample a mini-batch from the positive and negative examples. Note that, instead of using the whole set of positive and negative examples each training step, we just take a random subset. This mini-batching gives the process a stochastic element and helps to escape local minima. After 6000 steps, ilp produces a set of rule weights for each rule template. To validate this program, we run the model with the learned weights on an entirely new set of background facts. This is testing the system s ability to generalise to unseen data. We compute validation error as the sum of mean-squared diﬀerence between the actual label λ and the predicted label ˆλ:

i=1 (λ ˆλ)2

Once ilp has ﬁnished training, we extract the rule weights, and take the soft-max. Interpreting the soft-maxed weights as a probability distribution over clauses, we measure the fuzziness of the solution by calculating the entropy of the distribution. On the discrete error-free tasks, ilp ﬁnds solutions with zero entropy, as long as it does not get stuck in a local minimum. To extract a human-readable logic program, we just take all clauses whose probability is over some constant threshold (currently set, arbitrarily, to 0.1).

5.3 ILP Benchmark Tasks

We tested ilp on 20 ILP tasks, taken from four domains: arithmetic, lists, group-theory, and family tree relations. Some of the arithmetic examples appeared in the work of Cropper and Muggleton (2016). The list examples are used by Feser, Chaudhuri, and Dillig (2015). The family tree dataset comes from Wang, Mazaitis, and Cohen (2015) and is also used by Yang, Yang, and Cohen (2016).

Learning Explanatory Rules from Noisy Data

We emphasize that the 20 tasks we used have the following common feature: in each case, the program can be learned from a small amount of training data. ilp is a memoryexpensive solution to ILP (see Appendix E), so only problems with small training sets have been tested. This is why we have not tested ilp on the larger ILP datasets, such as Mutagenesis, Web KB, or IMDB. Although the 20 tasks we use are all quite small in the amount of training data needed to learn them, the programs needing to be synthesised in order to solve them are often complex, involving multiple recursive predicates and invented auxiliary predicates. In this section, we shall focus on three examples in detail. The complete list of 20 examples is detailed in Appendix G.

5.3.1 Learning Even/1 on Natural Numbers

The task is to learn the even predicate on natural numbers. The language contains the monadic predicate zero and the successor relation succ. The background knowledge is the set of basic arithmetic facts deﬁning the zero predicate and succ relation on numbers up to 10: B = {zero(0), succ(0, 1), succ(1, 2), succ(2, 3), ..., succ(9, 10)}

The positive examples P are:

P = {target(0), target(2), target(4), target(6), target(8), target(10)}}

In all these examples, target is the name of the target predicate we are trying to learn. In this case, target = even. The negative examples are

N = {target(1), target(3), target(5), target(7), target(9)}

For validation and test, we use positive and negative examples of the even predicate on numbers greater than 10. One possible language template for this task is:

Pe : {zero/1, succ/2}

Pi : {target/1, pred/2}

Here, pred is an auxiliary binary predicate. The set C of constants is just {0, 1, ..., 10}. One suitable program template21 for this tasks is:

τtarget,1 = (h = target, n = 0, int = False)

τtarget,2 = (h = target, n = 1, int = True)

τpred,1 = (h = pred, n = 1, int = False)

τpred,2 = null

This template speciﬁes two clauses for target, and one clause for the auxiliary predicate pred. The architecture assumes that every predicate is deﬁned by exactly two clauses

21. There are other templates that work equally well.

Evans & Grefenstette

speciﬁed by two rule templates. Here, we only need one clause to deﬁne pred, so the second rule template is null. ilp can reliably solve this task. One solution22 found is:

target(X) zero(X)

target(X) target(Y ), pred(Y, X)

pred(X, Y ) succ(X, Z), succ(Z, Y )

Here, pred(X, Y ) is true if X + 2 = Y . One major concern with ILP systems is that the program template makes it easy to bake in knowledge of the target solution. To evaluate this concern, we compared the above program template with the corresponding metarules needed for Metagol to learn the same task. See Appendix D below. We also experimented with iterative deepening over the space of program templates. See Section 7.1 below.

5.3.2 Learning Graph Cyclicity

In this example, ilp is given various graphs (nodes connected by edges). The task is to learn the is-cyclic predicate. This predicate is true of a node if there is a path, a sequence of edge connections, from that node back to itself. (Note that this concept is not deﬁnable in ﬁrst-order logic. We need some form of recursion to capture the idea that the sequence of edge connections could be indeﬁnitely long). In this task, there are two (B, P, N) triples, displayed in Figures 2a and 2b. Providing multiple alternative possible worlds helps the system to generalise correctly.

(a) First Training Triple

(b) Second Training Triple

(c) Validation Triple

Figure 2: Three (B, P, N) Triples for the Cyclic Task. Green circular nodes satisfy the is-cyclic predicate. Blue square nodes do not.

22. All clauses have exactly two literals, so the ﬁrst target clause is actually target(X) zero(X), zero(X). We omit duplicated literals for readability.

Learning Explanatory Rules from Noisy Data

The validation triple is shown in Figure 2c. Again, we are testing generalisation in an unseen situation. The language for this task is:

Pe = {edge/2}

Pi = {target/1, pred/2}

C = {a, b, c, d, e, f}

Here, target is the is-cyclic predicate we are trying to learn, and pred is an auxiliary additional predicate. One of the solutions found is:

target(X) pred(X, X)

pred(X, Y ) edge(X, Y )

pred(X, Y ) pred(X, Z), pred(Z, Y )

Note that the invented predicate pred is the transitive closure of the edge relation, true of X and Y if there is some sequence of edge connections, of any length, between X and Y . In order to solve this task, ilp had to use an invented recursive predicate. ilp is able to solve this problem 100% of the time, irrespective of initial random weights.

5.3.3 Learning Fizz-Buzz

Joel Grus provides23 a Tensor Flow implementation of the children s game Fizz-Buzz . In this game, the players must enumerate the natural numbers but if the number is divisible by 3, they should say Fizz , and if the number is divisible by 5, they should say Buzz . Grus used a standard multilayer perceptron (MLP). The training data was integers from 100 to 1024. The integers below 100 were held out as test data. On all runs, the MLP was unable to generalise correctly to the unseen held-out test data24. This is because a standard MLP can be interpreted as suﬀering from what Mc Carthy (1988) dubbed the propositional ﬁxation : it remembers particular instances but does not capture generalities. We tested ilp on Fizz-Buzz, after translating it into a curriculum learning task. We ﬁrst asked ilp to learn Fizz, and then, once it had learned the correct rules, we fed the results of those rules, as axioms, into the task of learning Buzz25. To learn Fizz, we provided the following language:

Pe = {zero/1, succ/2}

Pi = {target/1, pred1/2, pred2/2}

C = {0, 1, 2, 3, 4, 5, 6}

23. See the darkly humorous blog post http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/. 24. I guess , he suggests with rueful irony, maybe I should have used a deeper network . 25. Traditionally, Fizz-Buzz has an additional rule, that if a number divides by 3 and by 5, then we say Fizz-Buzz which is a separate class from either Fizz or Buzz . We did not model this detail.

Evans & Grefenstette

Here, zero is the monadic predicate that is true of X if X = 0, succ is the successor relation, and target(X) is the Fizz predicate we are trying to learn, true of numbers that are divisible by 3. In this case, we need two auxiliary predicates pred1 and pred2 to solve the task. The background axioms are

{zero(0)} {succ(n, n + 1) | n {0, ..., 5}}

Note the small number of integers that ilp is trained on. The positive and negative examples are:

P = {target(0), target(3), target(6)}

N = {target(1), target(2), target(4), target(5)}

Note that there are three intensional predicates in the language: the target predicate and two auxiliary predicates, pred1 and pred2. The validation data is an expansion of the earlier setup to larger integers: we test if 7, 8, and 9 satisfy the Fizz property. ilp can ﬁnd a solution to this problem. One of the solutions that it ﬁnds is:

target(X) zero(X)

target(X) target(Y ), pred1(Y, X)

pred1(X, Y ) succ(X, Z), pred2(Z, Y )

pred2(X, Y ) succ(X, Z), succ(Z, Y )

In order to solve this task, it had to construct two auxiliary invented predicates: pred1(X, Y ) is true if X is three greater than Y , and pred2(X, Y ) is true if X is two greater than Y . The rules that ilp has learned are robust to test data of any size: although it was trained on integers from 0 to 6, the learned program works eﬀectively on test integers of any size. There is no chance of the learned program suddenly failing when the integers reach a certain size. There is no worry that ilp s performance will suddenly fall oﬀa cliﬀ when the integers exceed 1000, say. Grus MLP failed to solve the Fizz task because it did not capture the general, universally quantiﬁed rules needed to understand this task. ilp, by contrast, is entirely incapable of doing anything other than producing general universally quantiﬁed rules. The language-bias in this model is very strong. The only rules that ilp understands are rules involving free variables. There are no constants allowed whatsoever. While a MLP struggles to generalise, ilp is doomed to generalise. Now ilp does not always ﬁnd the correct solution. All existing neural program induction systems are susceptible to getting stuck in local minima when they are started with an unfortunate random initialisation of initial weights. (This is a result of the spikiness of the loss landscape in program induction tasks). ilp is no exception, and on complex problems it only ﬁnds a correct solution a certain proportion of the time. However, this dependence on initial random weights is not a serious problem in this particular system, because we can model-select based on the training data. After a certain number (e.g. 100) of runs, we look at the loss on the training data, and choose the model with the lowest loss. Even though we are model-selecting on training data, the top-scoring model will generalise robustly to unseen test data because generalisation has been baked into the system from the beginning.

Learning Explanatory Rules from Noisy Data

The only possible rules it can learn are universally quantiﬁed rules. Furthermore, while this was not explored in the current work, one could automate the process of ﬁnding the right intialisations by designing a search procedure over clause weight intialisations based on how well the entropy of the distribution over clauses is minimised by a ﬁxed number of training steps. One of the attractive features of ILP systems is that knowledge gained from one task can be packaged up, in a compressed and readable format, into a set of clauses, and this knowledge can be transferred directly to new tasks. We took advantage of this transfer learning feature in the Fizz-Buzz task: when learning Buzz, we gave ilp the results of the invented predicates pred1 and pred2 from the Fizz task. We compiled the learned rules for pred1 and pred2 into a set of background atoms:

pred1(n, n + 3)

pred2(n, n + 2)

The language for the Buzz task is:

Pe = {zero/1, succ/2, pred1/2, pred2/2}

Pi = {pred3/2, target/1}

C = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

The background information B is:

{zero(0)} {succ(n, n + 1)} {pred1(n, n + 3)} {pred2(n, n + 2)}

The positive and negative examples are:

P = {target(0), target(5)}

N = {target(1), target(2), target(3), target(4), target(6), target(7), target(8), target(9)}

ilp is sometimes (30% of weight initialisations) able to ﬁnd the correct solution to this problem. One solution it ﬁnds is:

target(X) zero(X)

target(X) target(Y ), pred3(Y, X)

pred3(X, Y ) pred1(X, Z), pred2(Z, Y )

Note that the invented predicate pred3 is true of X and Y if X is 5 greater than Y . This is, to our knowledge, the ﬁrst neural program induction system that learns Fizz Buzz and generalises robustly to unseen test data.

Evans & Grefenstette

5.3.4 Results on All Symbolic Tasks

The results of our experiments on all 20 discrete error-free symbolic tasks are shown in Table 2. We compare ilp with Metagol, a state of the art ILP system (Cropper & Muggleton, 2016; Cropper et al., 2015; Muggleton et al., 2015, 2014). For each task, we show the number of intensional predicates |Pi|; this is usually a good estimate of the hardness of the task. We also indicate whether each task requires at least one recursive clause. This is another factor in the hardness of the task. The hardest tasks involve synthesising multiple intensional predicates, some of which are recursive.

In the few cases where Metagol was unable to ﬁnd the correct program, this was because it ran out of time. We gave Metagol and ilp the same ﬁxed time limit (24 hours running on a standard workstation). After correspondence with Andrew Cropper, one of the authors of Metagol, we discovered that in each of the problem cases, Metagol was stuck in an inﬁnite loop due to the presence of a recursive metarule interacting with a cycle in the data.

Conversely, there are some problems that Metagol can solve but that ilp is unable to. The main limitation of ilp is that it requires a large amount of memory (see Appendix E). Because of its memory consumption, we currently restrict ourselves to nullary, unary, and binary predicates. So problems that require ternary predicates (for example, the chess strategy problem in Cropper & Muggleton, 2016) are currently outside the scope of ilp. Similarly, the robot waiter example described by Cropper and Muggleton (2016) is also too large for ilp to handle in its present form.

We emphasize that a direct apples-to-apples comparison between Metagol and ilp on these 20 tasks is hard because the two systems require diﬀerent forms of program template. See Appendix D.

A breakdown of ilp s performance on the 20 symbolic tasks is shown in Table 3. We compare four diﬀerent methods: the core ilp method, a modiﬁcation of the ilp method that uses Godel s t-norm instead of the Product t-norm (see Section 4.5.1 above), a modiﬁcation that uses Lukasiewicz s t-norm, and a variant that uses famalgamate(x, y) = max(x, y), described in Section 4.4. For each method, we ran 200 times, with diﬀerent random initialisations of the weights sampled from a N(0, 1) distribution.

The numbers in the table represent the percentage of runs that got less than 1e 4 mean squared error on the validation set (i.e. the percentage of runs that generalised correctly to unseen data). All existing neural program induction systems are sensitive to the particular values of the initial random weights, and are prone to getting stuck in local minima in some proportion of the runs (Neelakantan et al., 2015; Kaiser, 2015; Bunel, Desmaison, Mudigonda, Kohli, & Torr, 2016; Gaunt, Brockschmidt, Singh, Kushman, Kohli, Taylor, & Tarlow, 2016; Feser, Brockschmidt, Gaunt, & Tarlow, 2016). So it is important to be transparent and make explicit what proportion of random weight initialisations are successful. (We stress, again, that the system s susceptibility to random weight initialisation is not actually a problem in our particular case, since we can model-select based on training data, without danger of overﬁtting on test data, because of the architectural constraints enforcing generality of the rules learned. See Section 5.3.3 for a fuller discussion).

Learning Explanatory Rules from Noisy Data

Domain Task |Pi| Recursive Metagol Performance ILP Performance

Arithmetic Predecessor 1 No Arithmetic Even / odd 2 Yes Arithmetic Even / succ2 2 Yes Arithmetic Less than 1 Yes Arithmetic Fizz 3 Yes Arithmetic Buzz 2 Yes Lists Member 1 Yes Lists Length 2 Yes Family Tree Son 2 No Family Tree Grandparent 2 No Family Tree Husband 2 No Family Tree Uncle 2 No Family Tree Relatedness 1 No Family Tree Father 1 No Graphs Undirected Edge 1 No Graphs Adjacent to Red 2 No Graphs Two Children 2 No Graphs Graph Colouring 2 Yes Graphs Connectedness 1 Yes Graphs Cyclic 2 Yes

Table 2: A Comparison Between ilp and Metagol on 20 Symbolic Tasks

Domain Task |Pi| Recursive ILP Godel Lukasiewicz Max

Arithmetic Predecessor 1 No 100.0 100.0 100.0 100.0 Arithmetic Even / odd 2 Yes 100.0 44.0 52.0 34.0 Arithmetic Even / succ2 2 Yes 48.5 28.0 6.0 20.5 Arithmetic Less than 1 Yes 100.0 100.0 100.0 100.0 Arithmetic Fizz 3 Yes 10.0 1.5 0.0 5.5 Arithmetic Buzz 2 Yes 14.0 35.0 3.5 5.5 Lists Member 1 Yes 100.0 100.0 100.0 100.0 Lists Length 2 Yes 92.5 59.0 6.0 82.0 Family Tree Son 2 No 100.0 94.5 0.0 99.5 Family Tree Grandparent 2 No 96.5 61.0 0.0 96.5 Family Tree Husband 2 No 100.0 100.0 100.0 100.0 Family Tree Uncle 2 No 70.0 60.5 0.0 68.0 Family Tree Relatedness 1 No 100.0 100.0 100.0 100.0 Family Tree Father 1 No 100.0 100.0 100.0 100.0 Graphs Undirected Edge 1 No 100.0 100.0 100.0 100.0 Graphs Adjacent to Red 2 No 50.5 40.0 1.0 42.0 Graphs Two Children 2 No 95.0 74.0 53.0 95.0 Graphs Graph Colouring 2 Yes 94.5 81.0 2.5 90.0 Graphs Connectedness 1 Yes 100.0 100.0 100.0 100.0 Graphs Cyclic 2 Yes 100.0 100.0 0.0 100.0

Table 3: The Percentage of Runs that Achieve Less Than 1e 4 Mean Squared Test Error

Evans & Grefenstette

5.4 Dealing with Mislabelled Data

Standard ILP ﬁnds a set R of clauses such that:

B, R |= γ for all γ P

B, R γ for all γ N

These requirements are strict and there is no room for error. If one of the positive examples P or negative examples N is mislabelled, then standard ILP cannot ﬁnd the intended program. ilp, by contrast, is minimising a loss rather than trying to satisfy a strict requirement, so is capable of handling mislabelled data. We tested this by adding a global variable ρ, the proportion of mislabelled data, and varying it between 0 and 1 in small increments. At the beginning of each experiment, a proportion ρ of the atoms from P and N are sampled, without replacement, and transferred to the other group. Note that the algorithm sees the same consistently mislabelled examples over all training steps, rather than letting the mis-labelling vary between training steps. The results are shown in Table 4 and Figure 3. Each cell shows the mean squared test error for a particular task with a particular proportion of mislabelled data. The results show that ilp is robust to mislabelled data. The mean-squared error degrades gracefully as the proportion of mislabelled data increases. This is in stark contrast to a traditional ILP system, where test error increases sharply as soon as there is a single piece of mislabelled training data. At 10% mislabelled data, ilp still ﬁnds a perfect answer in 5 out of 6 of the tasks. In some of the tasks, ilp still ﬁnds good answers with 20% or 30% mislabelled data.

ρ Predecessor Less Than Member Son Connectedness Undirected Edge

0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.05 0.000 0.000 0.000 0.000 0.000 0.000 0.10 0.000 0.000 0.000 0.065 0.005 0.000 0.15 0.007 0.006 0.002 0.171 0.030 0.005 0.20 0.117 0.005 0.013 0.171 0.079 0.007 0.25 0.279 0.064 0.029 0.328 0.161 0.032 0.30 0.425 0.071 0.034 0.400 0.250 0.032 0.40 0.629 0.276 0.127 0.455 0.326 0.134 0.50 0.686 0.486 0.174 0.483 0.483 0.302 0.60 0.695 0.616 0.199 0.480 0.648 0.369 0.70 0.705 0.761 0.202 0.503 0.686 0.469 0.80 0.715 0.788 0.203 0.507 0.793 0.474 0.90 0.727 0.753 0.182 0.504 0.833 0.499

Table 4: Mean squared test error as a function of the proportion ρ of mislabelled data

Figure 4 shows the entropy of the rules as the proportion of mislabelled examples increases. The rule weights, when soft-maxed, represent a probability distribution over clauses. The entropy of that probability distribution measures the fuzziness of the learned rules. Note that, in some cases, entropy goes down as the proportion of mislabelled data

Learning Explanatory Rules from Noisy Data

(a) Predecessor

(b) Less-Than

(c) List: Member

(d) Family Tree: Son

(e) Graph: Connectedness

(f) Graph: Undirected Edge

Figure 3: Mean squared test error as a function of the proportion of mislabelled data

approaches 1. Sometimes, it is easier to ﬁnd a clear rule when all the data is mislabelled than when half of it is is mislabelled. For example, if we are learning the even predicate on natural numbers, and we switch all the positive and negative examples around, then we will learn a rule for the odd numbers, and this rule will have zero entropy.

(a) Predecessor

(b) Less-Than

(c) List: Member

(d) Family Tree: Son

(e) Graph: Connectedness

(f) Graph: Undirected Edge

Figure 4: Entropy of Rule Weights as the Proportion of Mislabelled Examples Increases

Evans & Grefenstette

5.5 Dealing with Ambiguous Data

Standard ILP assumes that atoms are either true or false. There is no room for vagueness or indecision. ilp, by contrast, uses a continuous semantics which maps atoms to the real unit interval [0, 1]. It was designed to reason with cases where we are unsure if a proposition is true or false. (See Section 4.1). We tested ilp s ability to handle ambiguous input by connecting it to the results of a pre-trained convolution neural networks (a convnet ) in the style of Krizhevsky, Sutskever, and Hinton (2012), trained on MNIST classiﬁcation, following Le Cun, Bottou, Bengio, and Haﬀner (1998). Of course, MNIST is no longer a seriously challenging image-classiﬁcation problem. The aim here is not to advance the state of the art in image-recognition, but to hook up a well-understood and easily-trainable image-recognition system to ilp, to test how ilp fares with ambiguous data. We ran tests on 10 separate tasks. In the next few sections, we describe four of those tasks in detail.

5.5.1 Learning Even Numbers from Raw Pixel Images

In this task, the system is given a sequence of 28 28 MNIST images. The training signal is a single binary value, indicating whether the MNIST image represents a number that is even (1) or odd (0). Figure 5 shows how the convnet is connected to ilp. The image is fed into the pretrained convnet. The logits from the convnet are transformed, via soft-max, into a probability distribution over a set of atoms. For example, if the image represents a 2 , the sub-valuation26 produced by the convnet might be:

image(0) 7 0.0 image(3) 7 0.0

image(1) 7 0.0 image(4) 7 0.1

image(2) 7 0.8 image(5) 7 0.1

We only consider integers 0-5 to keep the example simple. We also have a set of background atoms B. In this case:

B = {zero(0)} {succ(n, n + 1) | n {0, ..., 4}}

B is transformed into a background valuation, and this valuation is merged with the subvaluation for image to produce the initial valuation. The target predicate in this case is a nullary27 predicate: target is true if the image I can currently see is even. We emphasise that there is an extra element in Figure 5 that is not present in Figure 1 above. In Figure 5, the initial valuation is generated by combining two elements: the axiom valuation and the image sub-valuation. The initial valuation represents everything that I

26. A sub-valuation is part of a valuation that is restricted to a single predicate. 27. A predicate is nullary if it does not take any arguments. ilp currently handles nullary, unary, and binary predicates only.

Learning Explanatory Rules from Noisy Data

cross entropy

target atom

clause weights clauses initial valuation

diﬀerentiable

non-diﬀerentiable

parameters inputs

diﬀerentiable

non-diﬀerentiable

program template

time-invariant

assumptions

image sub-valuation

network weights image(s)

Figure 5: ilp from Raw Pixel Images

Evans & Grefenstette

know to currently be the case. This contains both what is timelessly true (the exampleinvariant axioms) and what is currently true (the example-varying image sub-valuation, representing the image I am currently seeing). In Figure 1, there is no equivalent of the image sub-valuation. The language L for this task is:

Pe : {zero/1, succ/2, image/1}

Pi : {target/0, pred1/1, pred2/1}

C = {0, ..., 5}

Note that this task requires synthesising two auxiliary predicates, pred1 and pred2. One of the solutions ilp ﬁnds for this task is:

target() image(X), pred1(X)

pred1(X) zero(X)

pred1(X) succ(Y, X), pred2(Y )

pred2(Y ) succ(Y, X), pred1(Y )

The top rule says: the target proposition is true if the image I am currently seeing has numeral X, and X satisﬁes pred1 . The invented pred1 predicate is recognisably the even predicate, with a base case for zero and a step case. The pred2 predicate is the odd predicate. Solving this task involved synthesising two mutually recursive invented predicates.

5.5.2 Learning if the Left Image is Exactly Two Less than the Right Image

In this task, the system is given a pair of images (left and right) each training step. The training label is a 1 if the number represented by the left image is exactly two less than the number represented by the right image. The language L for this task is:

Pe : {zero/0, succ/2, image1/1, image2/1}

Pi : {target/0, pred1/1, pred2/2}

C = {0, ..., 5}

Here, image1(X) is true if the left image represents the integer X, and image2(X) is true if the right image represents X. Note that again there are two auxiliary predicates to synthesise as well as the target predicate. One of the solutions that ilp ﬁnds is:

target() image2(X), pred1(X)

pred1(X) image1(Y ), pred2(X, Y )

pred2(X, Y ) succ(Y, Z), succ(Z, X)

Here, target is true if the right image X satisﬁes the pred1 predicate, and pred1(X) is true if the left image Y is two less than X.

Learning Explanatory Rules from Noisy Data

5.5.3 Learning if the Right Image Equals 1

In this task, the system is given a pair of images (left and right) each training step. The training label is a 1 if the number represented by the right image is exactly 1. The system must learn to ignore the label of the left image. The language for this task is:

Pe : {zero/0, succ/2, image1/1, image2/1}

Pi : {target/0, pred1/1}

C = {0, ..., 5}

One of the solutions that ilp ﬁnds is:

target() image2(X), pred1(X)

pred1(X) zero(Y ), succ(Y, X)

It has learned to correctly ignore the content of the left image, image1. Note that the invented predicate pred1 is true of X if and only if X = 1.

5.5.4 Learning the Less-Than Relation

In this task, the system is again given a pair of images each training step. The training label is a 1 if the number represented by the left image is less than the number represented by the right image. See Figure 6. The language for this task is:

images label

Figure 6: Learning Less-Than from Raw Pixel Images

Pe : {zero/1, succ/2, image1/1, image2/1}

Pi : {target/0, pred1/1, pred2/2}

C = {0, ..., 5}

Evans & Grefenstette

This is another hard task that requires synthesising two auxiliary predicates, one of which is recursive. One of the solutions that ilp found is:

target() image2(X), pred1(X)

pred1(X) image1(Y ), pred2(Y, X)

pred2(X, Y ) succ(X, Y )

pred2(X, Y ) pred2(Z, Y ), pred2(X, Z)

Here pred2 is the less-than relation. ilp found a number of other solutions, some of which are less human-readable, e.g.

target() image2(X), pred1(X)

pred1(X) succ(Y, X), pred2(X, Y )

pred2(X, Y ) succ(Y, X), image1(Y )

pred2(X, Y ) succ(Y, X), pred2(Y, Z)

Here, pred2 is true of pairs of integers (n+1, n) for n greater than or equal to the left image. It creates a ladder that starts at the left image and keeps going up, and then target tests if the right image intersects with that ladder. Finally, to test the sample-complexity of this approach, we ran the less-than experiment while holding out certain pairs of integers. Please note that we are not just holding out pairs of images. Rather, we are holding out pairs of integers, and removing from training every pair of images whose labels match that pair. We built a simple MLP baseline to compare ilp against. In the baseline model, the logits for the two images are concatenated together, creating a vector of size 20. This vector is then passed to a linear layer with Re LU activations. Again, we train by minimising crossentropy loss. The MLP baseline is able to solve this straightforward task reliably when it is given every pair of images. But as the number of hold-out pairs increases, the baseline performs increasingly poorly. If the MLP model has seen that 2 < 3 and that 3 < 4, but has never seen any pair of images with labels (2, 4), then it does not know what to do.

(a) Proportion of successful runs

(b) Mean Error

Figure 7: Comparing ilp with the MLP Baseline

Learning Explanatory Rules from Noisy Data

We compare the baseline and ilp s performance in Figure 7. In the left ﬁgure, we plot the number of runs that get at least 96% correct on the test data. The reason we cap at 96% is that our pre-trained MNIST classiﬁer has a test accuracy of 98%, and we are running it twice to classify two images, so the chance of it classifying both images correctly is 0.98 0.98. Because ilp s performance depends, in part, on the values of the initial random weights, it never achieves 100% performance. But please note that its performance is robust when holding out 70% of the data. Because of its strong inductive bias, ilp is able to learn the correct generalisation from much fewer pairs of data. In the right image, we plot the mean test error as we increase the hold-out proportion. Note that ilp maintains a respectable mean error even when 70% of the pairs of integers are never seen.

5.5.5 Discussion

Our aim is to connect ilp to a perceptual system, with the view of potentially training the whole system end-to-end: learning rules and perceptual classiﬁers at once. The initial experiments described above do not quite reach that goal, for two reasons. First, we do not train the convnet at the same time as ilp. Rather we pre-train the convnet on MNIST classiﬁcation, and then freeze the convnet weights. Second, we provide a custom mapping from the convnet s logits to the initial sub-valuation. We applied our knowledge that there are two images, each of which describes a single digit, to bake in a particular mapping from the convnet s logits to the initial sub-valuation: since we know that there is exactly one X such that image1(X), we apply soft-max to the logits when transforming to the sub-valuation. While this does not provide information about the speciﬁc tasks the system will deal with, this is nonetheless knowledge about the domain the tasks form a part of. Ideally, this is knowledge which should be learned by the system rather than baked in by the designers, and it is important to note that the framework provides no obstacles to this being achieved. In future work, we plan to develop an end-to-end diﬀerentiable system that learns rules and classiﬁers simultaneously, focussing on good initialisation through pretraining of the perceptual network. In a further series of experiments, we no longer pass the logits from the pre-trained convnet directly to the sub-valuation (the mapping from {image(i)}5 i=0 to probabilities). Rather, we extract the ﬁnal hidden layer of the convnet, and learn how to convert it into a sub-valuation. If the ﬁnal hidden layer of the convnet is h, then we compute the subvaluation v as follows: v = softmax(Ah + B)

where A and B are learned weights. This way, the network learns the mapping from perceptual discriminations to symbolic assignments as part of the end-to-end learning process. The model learns to discriminate the images of integers just as much as it needs to for the task at hand. For example when learning the zero task (whether the image I am currently seeing represents the integer zero), it reliably maps images of zero to the integer 0, but all other images are mapped to a high entropy distribution. It blurs the diﬀerence between the other integers, as it does not need them. In another example, when learning the identity relation, it creates some bijective mapping from classes of images to integers, but may map images of 1 to 2, and images of 2 to 1, etc. In a third example, when

Evans & Grefenstette

learning the predecessor relation on integers, it learned to map the image of n to 9 n, so e.g. 0 7 9 and 1 7 8, and then it used succ to implement predecessor directly. We ﬁnd these initial results intriguing, in that they serve to illustrate the arbitrariness of symbols in standing for concepts, and believe they show the promise of end-to-end learning that combines perception and reasoning.

6. Related Work

We survey some recent related work involving neural or diﬀerentiable implementations of logic programming in general, and inductive logic programming in particular. Early work in this area involved translating propositional logic programs into neural nets (Steinbach & Kohut, 2002), or learning a propositional logic program from a set of examples (H olldobler & Kalinke, 1991). More recently, nonclassical logics such as modal logics, nonmonotonic propositional logics, and argumentation frameworks have been encoded in neural nets (Garcez, Gabbay, Ray, & Woods, 2007; Garcez, Lamb, & Gabbay, 2008, 2009; Garcez, Gabbay, & Lamb, 2014). The related work discussed below focuses on learning ﬁrst-order rules.

6.1 Approximating the Semantics of Logic Programs by Recurrent Neural Networks

H olldobler, Kalinke, and St orr (1999) show that, for every logic program in a certain class, there exists a recurrent net that can compute the forward chaining consequences of that program to any desired degree of precision. Given a set G of ground atoms and an injective function || || : G N mapping atoms to integer indices, they deﬁne a mapping R : 2G R from sets of ground atoms to the reals28:

g M 4 ||g||

Since || || is injective, R has an inverse R 1. They deﬁne a single-step consequence operator:

TP (M) = {g | g g1, ..., gm ground(P),

Note that this is not quite the same as our cn P , deﬁned in Section 2.1, since TP (M) does not include the input atoms M. Now, given a logic program P, they deﬁne a function f P : R R on reals that mirrors the forward chaining consequence function, as TP : 2G 2G on sets of ground atoms:

f P (x) = R(TP (R 1(x)))

The above function is deﬁned only on the range of R. They extend f P to a function on R by linear interpolation. Now, given that fp as deﬁned is continuous, they apply the wellknown result that any continuous function on the reals can be approximated, to any degree

28. There is nothing particularly special about the use of base 4 in this expression. Any base b > 2 could have been used. To see that base 2 is not acceptable, consider an inﬁnite set {g1, g2, ...} of ground atoms. Now R({g1}) = 0.1 = R({g2, ...}) = 0.01, so our R function fails to be injective. In any base b greater than 2, 0.1b = 0.01b, so the function R is injective, as desired.

Learning Explanatory Rules from Noisy Data

of precision, by a three layer network. Therefore, for every acyclic program, there exists a neural net that can compute forward chaining inference on that program to any desired degree of precision. However, as the authors acknowledge, this instructive and revealing theoretical result does not give us a practical method for constructing neural nets for particular logic programs. The particular encoding R they use, mapping sets of atoms to individual real numbers, is not practical, as it assumes arbitrary precision ﬂoating point numbers.

6.2 Connectionist Model Generation: A First-Order Approach

Bader, Hitzler, and H olldobler (2008) built upon the work described above by H olldobler et al. (1999). Here, the authors describe an actual implementation based on the theory described above. The impractical aspect of the work of H olldobler et al. (1999) is the use of a single real number to represent an entire set of ground atoms. Now, to make this approach practical, they use a vector, rather than a single number, to represent a set of ground atoms. They modify ||.|| to be a 2-dimensional level mapping, assigning each ground atom a pair (L, i) where L is a level, as before, and i is a position index in the distributed vector representation. Similarly, they redeﬁne R : 2G Rk from sets of ground atoms to vectors of length k:

ι(A) = (ι1(A), ..., ιk(A))

( 4 l if ||A|| = (l, d) and i = d 0 otherwise

The paper describes two key technical contributions. The ﬁrst is a program that converts an acyclic logic program P into a neural net that computes f P for any desired level n. To convert a logic program into a multi layer neural net, for a particular desired level n, they enumerate all possible models containing atoms that are of level less than or equal to n. For each such model M, they compute TP (M), add a corresponding hidden unit h and set the weights from h to the output layer to R(TP (M)). In other words, there is a new hidden unit for every possible model M. This means there are 2n hidden units added. Each hidden unit represents a particular combination of proposition atoms. The neural net eﬀectively memorises how to map particular sets of propositional atoms to their immediate consequents. The second major contribution is a technique for learning an acyclic program from examples. To learn a set of weights that computes f P , we sample models M containing atoms of level n. Set the input layer of the net to the vector representation of the model:

input = R(M)

Set the desired output layer y to the vector representation of the single-step consequence of M: y = R(TP (M))

Evans & Grefenstette

They update the weights from hidden units to output using the weight update rule with learning rate η: wout η y + (1 η) wout

Their approach is similar to the approach taken in the current paper in that it involves a diﬀerentiable implementation of forward chaining. But it is diﬀerent in a number of respects. First, they consider acyclic logic programs, while we consider Datalog programs. Acyclic programs can include function symbols, and hence inﬁnite Herbrand universes, but do not include programs involving mutual recursion. Datalog programs do not support function symbols, have ﬁnite Herbrand universes, but do allow mutual recursion. Second, once a net has learned to compute f P , the program is hidden in the weights. It is not at all obvious how to extract an explicit symbolic logic program from these weights. How to do so, if indeed it is possible, is an open research question. Third, there is a fundamental diﬀerence in the way the clause is represented in the neural net. In their system, each hidden unit in their system represents a hyper-square around a vector representation of a particular set of ground atoms. Each hidden unit has a propositional ﬁxation . In our system, by contrast, a universally quantiﬁed clause is applied simultaneously to every pair of atoms. The hidden units in our net encode the relative weight of a general rule that is applied repeatedly, while each hidden unit in their network encodes a particular set of ground atoms.

6.3 Logic Tensor Networks: Deep Learning and Logical Reasoning from Data and Knowledge

Seraﬁni and Garcez (2016) provide a generalisation of the semantics of ﬁrst-order logic, replacing the standard Boolean values with real values in [0, 1]. In their Real Logic , a constant c is interpreted as a vector G(c) Rn of reals, a function f is interpreted as a function G(f) Rarity(f) n Rn from tuples of vectors to vectors of reals, and a predicate p is interpreted as a function G(p) Rarity(f) n [0, 1] from tuples of vectors to [0, 1]. Given these interpretations, they deﬁne satisfaction for complex ground formulae:

G(f(t1, ..., tm) = G(f)(G(t1), ..., G(tm))

G(p(t1, ..., tm) = G(p)(G(t1), ..., G(tm))

G( φ) = 1 G(φ)

G(φ1 ... phim) = µ(G(φ1), ..., G(φm))

where µ is the dual of some t-norm operator. (See Section 4.5.1 above). Note that this continuous semantics works for full ﬁrst-order logic. They are not restricted to Horn clauses in general or to function-free Datalog clauses in particular. The major diﬀerence between this approach and ours is that theirs is a form of continuous abduction, rather than induction. Given an assignment ˆG from atoms to [0, 1], and a set R of ﬁrst-order statements, they ﬁnd an extension G ˆG such that G minimises the discrepancy between the given (continuous) truth-values assigned to the ground instances of the ﬁrst-order statements in R and the calculated truth-values assigned to the ground instances of R using the continuous satisfaction rules above. This way, they can ﬁll in facts that were not given, so as to maximize the truth-values of the ﬁrst order rules. This is a form of knowledge completion over fuzzy valuations.

Learning Explanatory Rules from Noisy Data

6.4 Computing First-Order Logic Programs by Fibring Artiﬁcial Neural Networks

Bader, Garcez, and Hitzler (2005) describe a way of implementing the single-step operator TP (described in Section 6.1 above) using ﬁbred neural networks. This approach can represent recursive programs, and even mutually recursive programs (e.g. the mutual deﬁnition of even and odd described in Appendix G.2). However, this system cannot represent programs with free variables: variables that are existentially quantiﬁed in the body of a rule that do not appear in the head of the rule.

6.5 Logical Hidden Markov Models

The state transitions of a Hidden Markov Model (HMM) can be viewed as encoding a set of propositional rules, each weighted with a probability. Kersting, De Raedt, and Raiko (2006) lift HMMs from a propositional to a ﬁrst-order representation, and call the resulting structure a Logical HMM (LOHMM). A rule in LOHMM is of the form:

p : h(X, Y ) a(X) b(X)

where p is a probability; h, a, and b are predicates; X is a vector of universally quantiﬁed variables and constants, and Y is a vector of existentially quantiﬁed variables appearing only in the head. This rule reads as: for all X, if b(X) holds then, with probability p, emit atom a(X) and transition to a state h(X, Y ) for some value of Y . In their paper, they show how the rule probabilities p can be learned from sets of atom sequences, using a modiﬁed version of the Baum-Welch algorithm. Using this technique, they are able to learn ﬁrst-order rules from sequences of data. Experimental results show that this approach produces smaller models and better generalisation than traditional HMMs. This approach is very similar to ours in its aim: we both want to learn ﬁrst-order rules from noisy data by maximising a conditional probability. One diﬀerence is that they are learning rules from sequences of atoms. The background assumption behind their approach is that exactly one atom is true at a time: a set of hidden states emit a sequence of atoms, and we wish to learn the hidden states, or the rules that transition between the states. In our ilp approach, by contrast, the state of the world at any time is modelled by a set of atoms, not by a single atom. Another diﬀerence between our approaches is that LOHMM rules have a single atom in the body, while ilp rules have multiple atoms in the body. LOHMM rules are equivalent to the rules of linear Datalog (Cal ı, Gottlob, & Lukasiewicz, 2012).

6.6 Probabilistic Inductive Logic Programming

De Raedt and Kersting (2008) make a strong case for the importance of applying relational learning to uncertain noisy data. They provide a very general formulation of ILP using a covers relation. This general formulation encompasses ILP learning from entailment (this is the formulation of ILP used in our paper), learning from interpretations, and learning from proofs. They generalise this covers relation to the probabilistic setting by reinterpreting it as a conditional probability. If e is an example, R is a set of rules and B is a background

Evans & Grefenstette

theory, they deﬁne covers(e, R, B) as

covers(e, R, B) = P(e|R, B)

This is similar to our probabilistic formulation in Section 4.2 above. They describe n FOIL (Landwehr, Kersting, & De Raedt, 2005), a probabilistic variant of FOIL (Quinlan & Cameron-Jones, 1995) using naive Bayes. This system is able to learn individual clauses in the presence of noisy data, and outperformed m Foil and Aleph on the benchmark Mutagenesis dataset. n FOIL is certainly similar in approach to ilp. One diﬀerence between their system and ours is that ilp learns multiple clauses at once, including multiple recursive clauses, while FOIL and its variants are not able to learn recursive clauses29.

6.7 First-order Logic Learning in Artiﬁcial Neural Networks

Guillame-Bert, Broda, and Garcez (2010) describe an approach to learning deﬁnite clauses which is similar, in various ways, to our approach. In their approach, just like ours, the architecture of the network is designed to implement forward chaining on deﬁnite clauses. Although our approach is similar to theirs in overall approach, there are a number of diﬀerences in the details. First, the type of clause they learn is diﬀerent. They do not make the Datalog restriction, and consider deﬁnite clauses with function symbols. Second, their system is only able to learn a single clause, rather than a set of clauses. Third, they do not support recursive clauses.

6.8 Fast Relational Learning Using Bottom Clause Propositionalization with Artiﬁcial Neural Networks

Fran ca, Zaverucha, and Garcez (2014) describe a powerful approach for learning ﬁrst-order logic programs eﬃciently. They use the bottom clauses e introduced by Muggleton (1995). For each positive example e P, the bottom clause e is the most speciﬁc clause whose head uniﬁes with e. Since the clauses form a lattice (with a partial ordering deﬁned by theta-subsumption), there is always a most speciﬁc clause for a particular language bias. For example, if target is a binary predicate, the bottom clause e for target(a, b) might be

target(X, Y ) p(X), q(Y ), r(X, X), s(Y ), t(Y, X)

For each negative example e N, we ﬁnd the most speciﬁc clause whose head uniﬁes with e (where is classical negation). For example, the bottom clause for a negative example target(c, d) N might be:

target(X, Y ) q(X), p(Y ), r(Y, X), t(X, X)

A bottom clause is typically rather long: it is, after all, the most speciﬁc clause (in the ﬁnite set of clauses speciﬁed by the language bias) that captures the facts about the example 30.

29. But see the work of Natarajan, Khot, Kersting, Gutmann, and Shavlik (2012) and Khot, Natarajan, Kersting, and Shavlik (2015) for systems that can learn multiple clauses at once. See the work of De Raedt, Kersting, Natarajan, and Poole (2016) for an overview of Statistical Relational Learning. 30. For a thorough discussion, see the work of Tamaddoni-Nezhad and Muggleton (2009).

Learning Explanatory Rules from Noisy Data

Fran ca et al. (2014) use the set of bottom clauses generated from the examples as a set of features: each clause is applied to every tuple of individuals, generating a list of features for each tuple. Fran ca et al. (2014) take the bottom clauses from the positive and negative examples, ground them for all tuples, and then pass the ground bottom clauses to CILP++. Then, they use the ground clauses to generate a neural net, and learn the appropriate weights. This system has a number of appealing features. First, it learns ﬁrst-order rules rather than just propositional rules. Second, it is very eﬃcient. Comparing CILP++ with ilp, both systems have their advantages and disadvantages. We note ﬁrst that CILP++ is unable to learn recursive clauses, since the predicates in the body of the bottom clauses e are always extensional predicates. ilp, by contrast, is able to learn complex recursive programs, sometimes involving mutual recursion. See Appendix G. On the other hand, ilp is currently restricted to small datasets, because of its memory requirements. So it cannot be tested on large datasets such as Mutagenesis, KRK, UWCSE. ilp and CILP++ occupy diﬀerent positions in the space of design tradeoﬀs: ilp can learn more complex programs, while CILP++ can handle larger amounts of data.

6.9 Learning Knowledge Base Inference with Neural Theorem Provers

Rockt aschel and Riedel (2016) describe a diﬀerentiable neural net implementation of a backward chaining inference procedure. From the examples they give, it looks like they are also operating on the Datalog subset of logic programs, although they do not explicitly say this. The Neural Theorem Prover is handed a set of predeﬁned clause templates: clauses in which some elements (including predicates) are wild cards, e.g.

#1(X, Z) #2(X, Y ) #2(Y, Z)

This template has already speciﬁed the variable bindings. Notice also the wild-card predicate matching in the above template. It insists that the same predicate (#2) is used for both atoms in the body of the rule. These templates are similar to the mode declarations used by Corapi, Russo, and Lupu (2011), and are a more restrictive form of rule template than the (v, int) templates used in ilp. Given a set of clause templates, and a goal, they produce a neural net model that performs backwards chaining, starting with the goal, working backwards using the rules to ﬁnd a set of substitutions that makes the goal true. They keep track of substitutions as vectors of vectors, and keep track of the plausibility of a particular substitution as a real number in [0, 1]. The Neural Theorem Prover uses a distributed representation of atoms: each atom is represented by a triple of vectors. The vector for each symbol is a distributed latent representation. One thing that is particularly interesting about this system is that they learn clauses and vector embeddings simultaneously. At a high level, the Neural Theorem Prover is similar to our work: they provide a diﬀerentiable model of inference on Datalog clauses, using a distributed representation of atoms. The main diﬀerence between their approach and our is that their model uses backward chaining inference, keeping track of the various proof paths and substitutions, while our approach uses forward chaining.

Evans & Grefenstette

6.10 A Diﬀerentiable Approach to Inductive Logic Programming

Yang et al. (2016) outline a system that learns clauses by a diﬀerentiable implementation of forward-chaining. They apply their model to the European royal family dataset (Wang et al., 2015), and learn rules. Some of our baseline ILP tasks are borrowed from this model. There are two diﬀerences between their approach and the one described here. First, they only learn individual clauses, not sets of clauses. Second, their system does not learn recursive clauses.

6.11 Dyna: A Declarative Language for Implementing Dynamic Programs

Eisner, Goldlust, and Smith (2004) describe a high level declarative ﬁrst-order language for specifying Natural Language Processing systems. Starting with an initial assignment of probabilities to axioms, the system runs forward-chaining inference to deduce further probabilities, concluding with the probability of a certain parse of a sentence. The system can also be run in reverse, to maximise the probability of the input sentence. In this reverse mode, the weights of the axioms are updated (using back-propagation) to maximise the probability of the training data. The central diﬀerence between ilp and Dyna is the diﬀerence between induction and abduction. In ilp, we are given a set B of background rules, and we use gradient descent to learn the weights of the rules R. In Dyna, by contrast, the set of rules R is ﬁxed, and they use gradient descent to learn the weights of the background assumptions B.

7. Discussion

We conclude by discussing the strengths and weaknesses of our approach.

7.1 Program Templates and Language Bias

A common complaint levelled at Inductive Logic Programming, and at Inductive Program Synthesis more generally, is that the system needs to be given a program template (also known as language-bias , mode declarations , or metarules ), carefully hand-engineered by a human to match the task at hand, in order for the system to have any chance of ﬁnding the desired program. A lot of the diﬃcult work has already been done by the choice of program template, and all the program synthesis system needs to do is ﬁll in the blanks. The strength of this complaint depends on the amount of information in the program template. If the program template speciﬁes exactly the predicates or functions to be called, then it is very restrictive indeed. The rule templates used in ilp are not so restrictive; we specify the number of existential variables and whether the clause may or must use intensional predicates, but we do not specify the form of the recursive call. See Appendix C and Appendix D for a comparison with other forms of language bias. Nevertheless, it would be clearly preferable to avoid the need for program templates at all. We believe that the need for program templates is the key weakness in existing ILP systems, ours included. We experimented with a brute-force way to avoid hand-engineering rule templates by using iterative deepening. Suppose we have a way of enumerating program templates, starting with the most minimal, in such a way that all program templates will eventually be produced. To do this, we used Cantor s pairing function, applied recur-

Learning Explanatory Rules from Noisy Data

sively, generating sequences of program templates. Given such an enumeration, we searched through the space of programs without needing any hand-engineered rule templates, by running the diﬀerent program templates in batch over a large cluster of machines. In this brute-force approach, we have, of course, only avoided hand-engineering rule templates by resorting to using a massive amount of computation.

7.2 Conclusion

Our main contribution is a diﬀerentiable implementation of Inductive Logic Programming. This model is able to learn moderately complex programs: programs requiring multiple invented predicates and recursive clauses. Unlike traditional symbolic ILP systems, our model is robust to mislabelled training data. Futhermore, unlike traditional symbolic ILP systems, our model can handle ambiguous data in domains where artiﬁcial neural networks are traditionally the tool of choice. Unlike neural networks on their own, it oﬀers the same data eﬃciency and generalisation guarantees as traditional ILP systems, extended to new domains. In doing so, this work opens the door to an end-to-end diﬀerentiable system that is capable of learning perceptual and inferential rules simultaneously. We evaluated this model on 20 symbolic ILP tasks, and showed that it can consistently solve problems traditional ILP systems excel on. We tested its tolerance to mislabelled data, and demonstrated it was able to learn good models even with 20% mislabelled training examples. We also tested it with ambiguous data, connecting the output of a pretrained convnet to the inputs of the induction system, and it was able to learn eﬀectively and consistently generalise. There are two main limitations of our framework that we wish to address in future work. First, we currently use program templates to constrain the set of programs that are searched through. We have experimented with avoiding the use of hand-engineered program templates, by searching through the space of program templates using iterative deepening, as outlined in Section 7.1. But this approach is hugely computationally expensive. In future work, we plan to adopt a less brute-force solution to this problem. Second, the experiments on ambiguous data relied on pre-trained neural network classiﬁers. We baked in the knowledge that the MNIST images were clustered into ten classes corresponding to the ten digits. Preliminary experiments showed that pre-training only the convolutional layers of a vision network, and jointly learning the projection into symbols and reasoning over such symbols, yielded satisfactory and interpretable results for our MNIST-based tasks, but full end-to-end joint training of vision modules and ilp requires further research. Future work will focus on sensible pretraining strategies for these neural networks, which do not obviously bias them towards the speciﬁc problems into which they will be integrated.

Acknowledgments

We thank Nando de Freitas, Shane Legg, Demis Hassabis, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, David Pfau, Chrisantha Fernando, Joel Veness, and Marek Sergot for their helpful comments and suggestions. We are grateful to Andrew Cropper and Stephen Mug-

Evans & Grefenstette

gleton for making Metagol available online; thanks to Andrew for his patience in explaining how to use it. We thank the reviewers for their thoughtful comments.

Appendix A. Notational Conventions

The notational conventions used throughout this paper are summarised in Tables 5 and 6. We use A, B, ... to denote sets of ground atoms e.g. A = {p(a, b), p(b, c)}. We use p, q, ... to denote predicates, a, b, ... to denote constants, and X, Y, ... to denote variables. Italic lowercase e.g. a, b, ... denote scalars. Boldface lowercase e.g. a, b, ... denote vectors. Boldface uppercase e.g. A, B, ... denote matrices and tensors. The i th element of a vector x is x[i]. The element at row i and column j in a matrix A is A[i, j]. The i th row of a matrix A is A[i, :]. The j th column is A[:, j]. If X is a tensor of shape (k, m, n) then X[i, :, :] is the i th matrix of shape (m, n). We use for scalar multiplication (and also for cross-product), for component-wise multiplication, and for dot product.

Appendix B. ILP as Satisﬁability

This elaborates the discussion in Section 3.3 above. We shall consider the following simple example throughout this section. Let our language-frame be

L = (target, Pe, aritye, C)

target = q/2

C = {a, b, c, d})

We follow the logic programming convention of placing the arity of the predicate after the predicate name, so q/2 states that arity(q) = 2. We shall consider the following ILP problem (L, B, P, N), with background facts:

B = {p(a, b), p(b, c), p(c, d)}

The positive examples P are

P = {q(a, b), q(a, c), q(a, d), q(b, c), q(b, d), q(c, d)}

The negative examples N are

N = {q(X, Y ) | (X, Y ) {a, b, c, d}2, q(X, Y ) / P}

We shall consider two rule templates for q. Template τ 1 q is (v = 0, int = 0), specifying no existentially quantiﬁed variables and disallowing intensional predicates in the body. Template τ 2 q is (v = 1, int = 1), specifying one existentially quantiﬁed variable and requiring an intensional predicate in the body. For this simple problem, we use the program template Π = (Pa, aritya, rules, T) where:

Learning Explanatory Rules from Noisy Data

Term Explanation

p(X) q(X, Y ) A deﬁnite clause. This clause states that, for all X, if there is a Y such that q(X, Y ), then p(X) L A language frame: a tuple consisting of a target, a set of extensional predicates Pe with associated arities, and a set C of constants target The target predicate that the ILP system is trying to learn Pe A set of extensional predicates arity A map P N from predicates to arities C A set of constants AL The set of all atoms (both ground and unground) that can be generated from language L α A particular atom (ground or unground) G The set of all ground atoms n The size of G γ A particular ground atom Falsum, the atom that is always false B The set of background assumptions P The set of positive examples of the target predicate N The set of negative examples of the target predicate τ A rule template, a tuple (v, int) v The number of existentially quantiﬁed variables allowed by a rule template int Whether intensional predicates are allowed in clauses Π A program template: a tuple (Pa, aritya, rules, T) Pa A set of auxiliary (invented) predicates Pi The set of intensional predicates: Pi = {target} Pa rules A map from intensional predicates p Pi to pairs of rule templates τ 1 p, τ 2 p P The set of all predicates in the language Pe Pi T The max number of time-steps of forward chaining inference permitted by the program template L A language (Pe, Pi, arity, C) combining the extensional predicates Pe from a language frame with the intensional predicates Pa from the program template τ i p The i th rule template for intensional predicate p, i {1, 2} CL The set of all deﬁnite clauses that can be generated from language L R A set of deﬁnite clauses, a program, a candidate solution to an ILP problem cl(τ) The set of clauses that can be generated from rule template τ Ci,j p The j th clause generated by the i th rule template for predicate p

Φ The set Φ = {f i,j p } i {1,2},j=1..|cl(τ i p)| p Pi of Boolean variables indicating whether to use the j th clause in the i th rule template for p cl (τ i p) The set of generated clauses for τ i p, modiﬁed by adding the corresponding f i,j p ﬂags, for j = 1..|cl(τ i p)| F Φ A subset of ﬂags specifying a particular set of clauses

Table 5: Notational Conventions used throughout this paper.

Evans & Grefenstette

Term Explanation

Λ A set of atom-label pairs λ A label for a particular ground atom γ W A set of weight matrices W = {Wp|p Pi} Wp R|cl(τ 1 p)| |cl(τ 2 p)| The weight matrix for a particular intensional predicate p, determining the probability of every pair c1, c2 of clauses satisfying rule templates τ 1 p and τ 2 p a, b, c, x Valuation vectors in [0, 1]n

Fc : [0, 1]n [0, 1]n The valuation function performing one step of forward inference corresponding to clause c Fi,j p : [0, 1]n [0, 1]n The valuation function corresponding to clause Ci,j p Gj,k p : [0, 1]n [0, 1]n The valuation function corresponding to one step of forward inference using both clause C1,j p and clause C2,k p θ A substitution mapping free variables to constants α[θ] The result of applying substitution θ to atom α

Table 6: Notational Conventions used throughout this paper.

aritya = {}

rules = {τ 1 q , τ 2 q }

T = 3 specifying 3 steps of forward chaining

Here, there are no auxiliary predicates needed. This ILP problem is suﬃciently simple that it can be solved without additional invented predicates. Most of the benchmark ILP problems described below in Section 5.3 require at least one invented predicate. When there are no auxiliary predicates, the only intensional predicate that needs to be assigned rule templates is the target predicate, q. Suppose template τ 1 q for predicate q is (v = 0, int = 0), specifying no existentially quantiﬁed variables and disallowing intensional predicates in the body. There are only 8 generated clauses for τ 1 q remaining after pruning:

1. q(X, Y ) p(X, X), p(X, Y )

2. q(X, Y ) p(X, X), p(Y, X)

3. q(X, Y ) p(X, X), p(Y, Y )

4. q(X, Y ) p(X, Y ), p(X, Y )

5. q(X, Y ) p(X, Y ), p(Y, X)

6. q(X, Y ) p(X, Y ), p(Y, Y )

7. q(X, Y ) p(Y, X), p(Y, X)

8. q(X, Y ) p(Y, X), p(Y, Y )

Note that some of the clauses have the same atom repeated twice in the body. For example, clause 4: q(X, Y ) p(X, Y ), p(X, Y )

This clause is equivalent to: q(X, Y ) p(X, Y )

Learning Explanatory Rules from Noisy Data

The repetition of atoms is not a mistake. The clauses are generated automatically, and we insist that all clauses have exactly two atoms in the body. This insistence on a uniform clause size will come in useful when we transfer to the matrix-based approach in Section 4. Suppose template τ 2 q for q is (v = 1, int = 1), specifying one existentially quantiﬁed variable and requiring an intensional predicate in the body. There are 58 generated clauses for τ 2 q remaining after pruning, of which the ﬁrst 16 are:

1. q(X, Y ) p(X, X), q(Y, X)

2. q(X, Y ) p(X, X), q(Y, Y )

3. q(X, Y ) p(X, X), q(Y, Z)

4. q(X, Y ) p(X, X), q(Z, Y )

5. q(X, Y ) p(X, Y ), q(X, X)

6. q(X, Y ) p(X, Y ), q(X, Z)

7. q(X, Y ) p(X, Y ), q(Y, X)

8. q(X, Y ) p(X, Y ), q(Y, Y )

9. q(X, Y ) p(X, Y ), q(Y, Z)

10. q(X, Y ) p(X, Y ), q(Z, X)

11. q(X, Y ) p(X, Y ), q(Z, Y )

12. q(X, Y ) p(X, Y ), q(Z, Z)

13. q(X, Y ) p(X, Z), q(Y, X)

14. q(X, Y ) p(X, Z), q(Y, Y )

15. q(X, Y ) p(X, Z), q(Y, Z)

16. q(X, Y ) p(X, Z), q(Z, Y )

Given a program template Π, let τ i p be the i th rule template for intensional predicate p, where i {1, 2} indicates which of the two rule templates we are considering for p. Let Ci,j p be the j th clause in cl(τ i p), the set of clauses generated for template τ i p . To turn the induction problem into a satisﬁability problem, we deﬁne a set Φ of Boolean variables (i.e. atoms with nullary predicates) indicating whether the various generated clauses are actually to be used in our program R:

Φ = {fi,j p } i {1,2},j=1..|cl(τ i p)| p Pi

Here, fi,j p indicates whether the j th clause of Ci p is used in our program R:

fi,j p iﬀCi,j p R

A candidate solution to an ILP problem is a subset F Φ of these atoms indicating which clauses to turn on. We insist that exactly one ﬂag is turned on for each rule template τ i p:

p Pi i {1, 2} !j {1..|cl(τ i p)|} fi,j p F

To transform the induction problem into a satisﬁability problem, we modify our generated clauses by adding a Boolean ﬂag to each clause, so the j th clause of cl(τ i p) now tests whether fi,j p is true. Let cl (τ i p) be the modiﬁed set of clauses that include the appropriate ﬂags. Returning to our example, the Boolean variables (i.e. atoms using nullary predicates) are f1,1 q , ..., f1,8 q for τ 1 q and f2,1 q , ..., f2,58 q for τ 2 q . The modiﬁed clauses cl (τ 1 q ) are:

Evans & Grefenstette

1. q(X, Y ) p(X, X), p(X, Y ), f1,1 q

2. q(X, Y ) p(X, X), p(Y, X), f1,2 q

3. q(X, Y ) p(X, X), p(Y, Y ), f1,3 q

4. q(X, Y ) p(X, Y ), p(X, Y ), f1,4 q

5. q(X, Y ) p(X, Y ), p(Y, X), f1,5 q

6. q(X, Y ) p(X, Y ), p(Y, Y ), f1,6 q

7. q(X, Y ) p(Y, X), p(Y, X), f1,7 q

8. q(X, Y ) p(Y, X), p(Y, Y ), f1,8 q

The modiﬁed clauses cl (τ 2 q ) are:

1. q(X, Y ) p(X, X), q(Y, X), f2,1 q

2. q(X, Y ) p(X, X), q(Y, Y ), f2,2 q

3. q(X, Y ) p(X, X), q(Y, Z), f2,3 q

4. q(X, Y ) p(X, X), q(Z, Y ), f2,4 q

5. q(X, Y ) p(X, Y ), q(X, X), f2,5 q

6. q(X, Y ) p(X, Y ), q(X, Z), f2,6 q

7. q(X, Y ) p(X, Y ), q(Y, X), f2,7 q

8. q(X, Y ) p(X, Y ), q(Y, Y ), f2,8 q

9. q(X, Y ) p(X, Y ), q(Y, Z), f2,9 q

10. q(X, Y ) p(X, Y ), q(Z, X), f2,10 q

11. q(X, Y ) p(X, Y ), q(Z, Y ), f2,11 q

12. q(X, Y ) p(X, Y ), q(Z, Z), f2,12 q

13. q(X, Y ) p(X, Z), q(Y, X), f2,13 q

14. q(X, Y ) p(X, Z), q(Y, Y ), f2,14 q

15. q(X, Y ) p(X, Z), q(Y, Z), f2,15 q

16. q(X, Y ) p(X, Z), q(Z, Y ), f2,16 q

Recall the ILP problem. Given a set B of background assumptions, and sets P and N of positive and negative examples, we wish to ﬁnd a set R of deﬁnite clauses such that

B, R |= γ for all γ P

B, R γ for all γ N

We recast this as a satisﬁability problem. Find a set F Φ of Booleans such that:

p Pi i {1, 2} !j {1..|cl(τ i p)|} fi,j p F

p Pi,i {1,2} cl (τ i p) |= γ for all γ P

p Pi,i {1,2} cl (τ i p) γ for all γ N

This approach transforms the induction problem (ﬁnd a set R of rules) into an satisﬁability problem (ﬁnd a subset F Φ of atoms). If we ﬁnd a set F of ﬂags satisfying these constraints, then we can extract the logic program R: R = {Ci,j p | fi,j p F}

Returning to the running example from Section 3.3, suppose our background facts B are

B = {p(a, b), p(b, c), p(c, d)}

Learning Explanatory Rules from Noisy Data

Suppose our positive examples P are

P = {q(a, b), q(a, c), q(a, d), q(b, c), q(b, d), q(c, d)}

and our negative examples N are

N = {q(X, Y ) | (X, Y ) {a, b, c, d}2, q(X, Y ) / P}

Then one set F satisfying the constraints is

F = {f1,4 q , f2,16 q }

We can extract a logic program R from these clauses:

R = {C1,4 q , C2,16 q }

In other words, R is:

q(X, Y ) p(X, Y ), p(X, Y )

q(X, Y ) p(X, Z), q(Z, Y )

This solution deﬁnes q as the transitive closure of p.

Appendix C. Comparing Language Bias in ILP with Mode Declarations

Many ILP systems 31 specify a language bias (a range of admissible programs) by using a set of mode declarations. A mode declaration is an atom p(t1, ..., tn) where each ti is a placemarker indicating the type and grounding of the term that can appear in this position. Each mode declaration is either a head declaration (modeh) specifying the form of the head of a rule, or a body declaration (modeb) specifying the form of one of the literals in the body of a rule. For example, the mode declarations for the even/succ2 example in Section 5.3.1 are:

m1 : modeh(even(+nat))

m2 : modeh(pred(+nat, +nat))

m3 : modeb(zero(+nat))

m4 : modeb(even( nat))

m5 : modeb(pred( nat, +nat))

m6 : modeb(succ(+nat, nat))

m7 : modeb(succ( nat, +nat))

The ﬁrst two declarations m1 and m2 specify the form of the head of the rule: it must be either even(X) or pred(X, Y ). The modeb declarations specify the sorts of atoms that can appear in the body of the rule.

31. E.g., Progol (Muggleton, 1995) or ASPAL (Corapi et al., 2011).

Evans & Grefenstette

Let us compare this with the program templates in Section 5.3.1:

τtarget,1 = (h = target, n = 0, int = False)

τtarget,2 = (h = target, n = 1, int = True)

τpred,1 = (h = pred, n = 1, int = False)

In comparison with ilp s rule templates, the mode declarations are rather restrictive. While our rule templates only specify the number of existentially quantiﬁed variables, and the type of predicate (intensional or extensional) that can appear in the body of the rule, the mode declarations specify precisely the exact predicates allowed in the body of the rule. Rule m5 is particularly restrictive:

m5 : modeb(pred( nat, +nat))

Not only do they specify the precise predicate to be used in the atom (it must be pred), but the declaration also speciﬁes exactly where the new existentially quantiﬁed variable may appear in the atom: it must be in the ﬁrst argument of pred.

Appendix D. Comparing Language Bias in ILP and in Metagol

Recall when learning even in Section 5.3.1 above, we used the following program template:

τtarget,1 = (h = target, n = 0, int = False)

τtarget,2 = (h = target, n = 1, int = True)

τpred,1 = (h = pred, n = 1, int = False)

Here, we compare the above program template with the corresponding metarules needed for Metagol to learn the same task:

metarule(base,[P,Q],([P,A]:-[[Q,A]])). metarule(recursive,[P,Q],([P,A]:-[[Q,B,A],[P,B]])). metarule(invented, [P,Q], ([P,A,B]:-[[Q,A,C],[Q,C,B]])).

Comparing the two program templates, ilp speciﬁes less information about the terms used in the rules: the Metagol metarules specify the exact variable bindings that must occur, while the ilp template only speciﬁes the number of existential variables. The ilp template also speciﬁes less information about the predicates used in the rules. The Metagol template speciﬁes the exact form of recursion to appear in the rule:

metarule(recursive,[P,Q],([P,A]:-[[Q,B,A],[P,B]])).

This is more restrictive than the ilp template, which simply says that at least one intensional predicate is allowed in the body of the rule.

Appendix E. Space Requirements

ilp is undoubtedly a memory intensive solution to logic program induction.

Learning Explanatory Rules from Noisy Data

One major source of memory consumption is the collection of vectors ap,j,k t and cp,j,k t . Together, these require the following number of ﬂoats:

i=1 |cl(τ 1 i )| |cl(τ 2 i )|

where Pi is the set of intensional predicates, t is the number of time steps of forward inference, and cl(τ) is the clauses generated for rule template τ. Here, n is the size of the Herbrand base G:

n = |G| |P| |C|2 + 1

where P and C are the sets of predicates and constants. The other major source of memory consumption is the calculation of the intermediate vectors Y1, Y2, and Z. These require the following number of ﬂoats:

i=1 |cl(τ 1 i )| |cl(τ 2 i )|

Appendix F. Why Not Use a Simpler, Smaller Model?

A natural question to ask about the model presented here is: why make everything so complicated? Instead of having a matrix of weights Wp R|cl(τ 1 p )| |cl(τ 2 p )| for every pair of rule templates τ 1 p , τ 2 p , why not have a vector of weights Wτ R|cl(τ)| for each individual rule template τ? Let Fj τ be the valuation function for the j th clause of cl(τ). Let cτ,j t be the result of applying Fj τ to at: cτ,j t = Fj τ (at)

Now we generate weighted sums bτ t for each time-step t and template τ in the program template Π:

j cτ,j t e Wp[j] P j Wp[j ]

Finally, deﬁne at+1 in terms of at and the bτ t s:

at+1 = famalgamate(at, max τ Π bτ t )

In the memory-intensive ilp model, the bp t were disjoint for diﬀerent predicates p. But now, in this simpler model, some of the bτ t may produce conclusions involving the same predicate head. The various bτ t are no longer disjoint. So we can no longer add them together. Instead, we must take the elementwise-max. This simpler model was, in fact, the ﬁrst model we tried. We only moved away from it, with reluctance, when we found it incapable of escaping local minima on harder tasks. The problem with this simpler, smaller model is that if there are two rule templates both deﬁning the same intensional predicate, then the conclusions of the two rules will over-write each other when they are combined together. In the memory-intensive ilp model, the results of

Evans & Grefenstette

applying the clause-pairs is disjoint from the results of applying clause-pairs whose head is a diﬀerent predicate, so the results can be simply added. This means no information is lost when amalgamating the results from diﬀerent clauses. In the simpler model, information is lost when taking the element-wise max, impacting the system s ability to learn programs involving multiple clauses.

Appendix G. The 20 Benchmark ILP Experiments

We describe our twenty benchmark experiments in detail.

G.1 Predecessor

The aim of this task is to learn the predecessor relation from examples. The language contains the monadic predicate zero and the successor relation succ. The background knowledge is the set of basic arithmetic facts deﬁning the zero predicate and succ relation:

B = {zero(0), succ(0, 1), succ(1, 2), succ(2, 3), ...}

The positive examples P are:

P = {target(1, 0), target(2, 1), target(3, 2), ...}

In all these examples, target is the name of the target predicate we are trying to learn. In this case, target = predecessor. The negative examples are

N = {target(X, Y ) | (X, Y ) {0, ..., 9}2} P

The solution found by ilp is:

target(X, Y ) succ(Y, X)

G.2 Even / Odd

The aim of this task is to learn the even predicate from examples. Again, the language contains the monadic predicate zero and the successor relation succ, and the same background facts as in the previous example above. This task requires an additional auxiliary monadic predicate pred1. The positive examples are:

P = {target(0), target(2), target(4), ...}

Negative examples are:

N = {target(1), target(3), target(5), ...}

One solution found by ilp is:

target(X) zero(X)

target(X) pred1(Y ), succ(Y, X)

pred1(X) target(Y ), succ(Y, X)

Here, target and pred1 are mutually recursive. Note that the invented predicate pred1(X) is true if X is odd.

Learning Explanatory Rules from Noisy Data

G.3 Even / Succ2

This example is described in Section 5.3.1 above.

G.4 Less-Than

The aim of this task is to learn the less than relation. The language and background axioms are the same as the previous examples. All integers are between 0 and 9. Positive examples of the target predicate are:

P = {target(X, Y ) | X < Y }

Negative examples are: N = {target(X, Y ) | X Y }

One solution found by ilp is:

target(X, Y ) succ(X, Y )

target(X, Y ) target(X, Z), target(Z, Y )

This experiment is described in Section 5.3.3.

This experiment is described in Section 5.3.3.

The task is to learn the member relation on lists, where member(X, Y ) if X is an element in list Y . The elements in a list are encoded using two relations:

cons(X, Y ) if the node after X is node Y ; we terminate lists with the null node 0

value(X, Y ) if the value of node X is Y

We train on two separate (B, P, N) triples. In the ﬁrst, the list is [4, 3, 2, 1] and the positive examples are:

target(4, [4, 3, 2, 1]) target(3, [4, 3, 2, 1]) target(2, [4, 3, 2, 1]) target(1, [4, 3, 2, 1]) target(3, [3, 2, 1]) target(2, [3, 2, 1]) ... ...

The negative examples are all other ground atoms involving target. In the second (B, P, N) triple, the list is [2, 3, 2, 4] and the positive examples are:

target(2, [2, 3, 2, 4]) target(3, [2, 3, 2, 4]) target(4, [2, 3, 2, 4]) target(3, [3, 2, 4]) target(2, [3, 2, 4]) target(4, [3, 2, 4]) ... ...

Evans & Grefenstette

One solution found by ilp is:

member(X, Y ) value(Y, X)

member(X, Y ) cons(Y, Z), member(X, Z)

The task is to learn the length relation, where length(X, Y ) is true if the length of list X is Y . The representation of lists is the same as in the experiment above. To solve this task, we need at least one auxiliary relation pred1. Again, we train on two separate (B, P, N) triples. In the ﬁrst, the list is [1, 2, 3] and the positive examples are:

target([1, 2, 3], 3])

target([2, 3], 2)

target([3], 1)

The negative examples are all ground atoms involving target except those in the positive set. In the second triple, the list is [3, 2, 1] and the positive examples are:

target([3, 2, 1], 3])

target([2, 1], 2)

target([1], 1)

One of the solutions found by ilp is:

target(X, X) zero(X)

target(X, Y ) cons(X, Z), pred1(Z, Y )

pred1(X, Y ) target(X, Z), succ(Z, Y )

The task here is to learn the son-of relation from various facts about a family tree involving the relations father-of, brother-of, and sister-of. The task requires at least one auxiliary monadic predicate pred1. There are a set of constants representing people: a, b, c, ... The background facts are

{father(a, b), father(a, c), father(d, e), father(d, f), father(g, h), father(g, i),

brother(b, c), brother(c, b), brother(e, f), sister(f, e), sister(h, i), sister(i, h)}

Note that the facts about sister-of are irrelevant to this problem. The positive examples are:

P = {target(b, a), target(c, a), target(e, d)}

Learning Explanatory Rules from Noisy Data

The negative examples are a subset of all ground atoms involving the target predicate that are not in P. One of the solutions found by ilp is:

target(X, Y ) father(Y, X), pred1(X)

pred1(X) brother(X, Y )

pred1(X) father(X, Y )

Here, the invented predicate pred1 here deﬁnes the is-male property: X is male if X is the brother of someone, or if X is the father of someone. The clause for target reads: X is the son of Y if Y is the father of X and X is male.

G.10 Grandparent

The task here is to learn the grandparent relation from various facts involving the father-of and mother-of relations. The problem uses an auxiliary relation pred1. There are a set of constants representing people: a, b, c, ... The background facts are:

{mother(i, a), father(a, b), father(a, c), father(b, d), father(b, e),

mother(c, f), mother(c, g), mother(f, h)}

Positive examples are:

P = {target(i, b), target(i, c), target(a, d), target(a, e), target(a, f), target(a, g), target(c, h)}

Again, the negative examples are all ground atoms involving target except those in the positive set. One solution found by ilp is:

target(X, Y ) pred1(X, Z), pred1(Z, Y )

pred1(X, Y ) father(X, Y )

pred1(X, Y ) mother(X, Y )

Note that the invented predicate pred1 here deﬁnes the parent-of relation. This is the same solution found in Muggleton et al. (2014).

G.11 Husband

The task here is to learn the husband-of relation from various facts about family relations involving the relations father-of, daughter-of, and brother-of. The atomic facts were taken from the European family-tree dataset (Wang et al., 2015). One of the solutions found by ilp is:

target(X, Y ) father(X, Z), daughter(Z, Y )

Evans & Grefenstette

The task here is to learn the uncle-of relation from various facts about family relations including father-of, mother-of, and brother-of. Again, the background facts were taken from the European family-tree dataset (Wang et al., 2015). The task requires one auxiliary relation pred1. One of the solutions found by ilp is:

target(X, Y ) brother(X, Z), pred1(Z, Y )

pred1(X, Y ) father(X, Y )

pred1(X, Y ) mother(X, Y )

The invented predicate pred1 here deﬁnes the is parent of relation.

G.13 Relatedness

The task here is to learn the related relation from facts about family relations involving the parent-of relation. The background facts are

parent(a, b) parent(a, c)

parent(c, e) parent(c, f)

parent(d, c) parent(g, h)

The idea is that target(X, Y ) is true if there is some (undirected) path between X and Y . Positive examples are:

P = {target(a, b), target(a, c), target(a, e), target(a, f), target(f, a), target(a, a), target(d, b), target(h, g)}

Negative examples are:

N = {target(g, a), target(a, h), target(e, g), target(g, b)}

One of the solutions found by ilp is:

target(X, Y ) pred1(X, Y )

target(X, Y ) pred1(X, Z), target(Z, Y )

pred1(X, Y ) parent(X, Y )

pred1(X, Y ) parent(Y, X)

G.14 Father

Figure 8 shows part of a family tree centered on Louis VII and Henry VIII. The task is to learn the father-of relation given this background data and just two positive examples:

Louis VII is the father of Phillip II

Henry VIII is the father of Elizabeth I

Learning Explanatory Rules from Noisy Data

Henry VIII Anne Boleyn

Elizabeth I

Louis VII Adele of Champagne married to

Henry of Beauvais

brother-of Margaret of

France brother-of

Margaret Tudor brother-of Henry Carey aunt-of

Henry Carey

Dorothy Howard

Figure 8: The Background B for the task in Section G.14

All other ground atoms involving the father-of relation are assumed to be false and are added to the negative examples. Note that some of the background data is irrelevant to the task at hand: we could solve this task without any data about brothers or aunts. ilp has to sift through the background data, ﬁnding a rule that entails the positive examples without entailing any of the negative examples. Formally, the language L for this task is:

Pe : {husband/2, mother/2, brother/2, aunt/2}

Pi : {target/2}

The set C of constants contains Louis VII, Henry VIII, etc. Here, target is the father-of relation we are trying to learn. For validation, we give the system ﬁctitious family tree data:

George Donatus is married to Cecilie of Mountbatten

Cecilie of Mountbatten is the mother of Louis

The positive example for validation is the single atom stating that George Donatus is the father of Louis. The negative examples are all other ground atoms whose predicate is target. In this simple task, ilp always ﬁnds the correct solution. It ﬁnds the program:

target(X, Y ) husband(X, Z), mother(Z, Y )

In other words, if X is the husband of some Z, and Z is the mother of Y , then X is the father of Y .

Evans & Grefenstette

G.15 Undirected Edge

A graph is represented by a set of edge(X, Y ) atoms, representing that there is an edge from node X to node Y . The task is to learn the unconnected-edge relation, which is true of X and Y if there is an edge, in either direction, between X and Y . We train on two separate (B, P, N) triples. In the ﬁrst, the background facts are:

{edge(a, b), edge(b, d), edge(c, c)}

The positive examples are:

P = {target(a, b), target(b, a), target(b, d), target(d, b), target(c, c)}

In the second triple, the background facts are:

{edge(a, b), edge(c, d)}

The positive examples are:

P = {target(a, b), target(b, a), target(c, d), target(d, c)}

The solution found by ilp is:

target(X, Y ) edge(X, Y )

target(X, Y ) edge(Y, X)

G.16 Adjacent to Red

In this example, the nodes of the graph are coloured either green or red. The task is to learn the predicate is adjacent to a red node. As well as the edge relation, we have a colour relation on nodes, where colour(X, C) means node X has colour C. We also have two monadic predicates red(C) and green(C), true of colour C if it is red or green respectively. The task requires at least one auxiliary monadic predicate pred1. We train on two separate (B, P, N) triples. In the ﬁrst, the background facts are:

{edge(a, b), edge(b, a), edge(c, d), edge(c, e), edge(d, e),

colour(a, red), colour(b, green), colour(c, red), colour(d, red), colour(e, green)}

The positive examples are:

P = {target(b), target(c)}

In the second triple, the background facts are:

{edge(b, c), edge(d, c), colour(a, red), colour(b, green),

colour(c, red), colour(d, red), colour(e, green)}

The positive examples are:

P = {target(b, b), target(d, d)}

One solution found by ilp is:

target(X) edge(X, Y ), pred1(Y )

pred1(X) colour(X, Y ), red(Y )

Learning Explanatory Rules from Noisy Data

G.17 Two Children

The task is to learn the predicate has at least two children. As well as the edge predicate, this task uses the not-equals relation neq. In future work, we plan to add negation-as-failure to the language, at which point it will not be necessary to include the neq relation explicitly. This task requires an auxiliary relation pred1. The background facts are

B = {neq(X, Y ) | X = Y }

We train on two separate (B, P, N) triples. In the ﬁrst, the background facts are:

{edge(a, b), edge(a, c), edge(b, d), edge(c, d), edge(c, e), edge(d, e)}

The positive examples are:

P = {target(a, a), target(c, c)}

In the second triple, the background facts are:

{edge(a, b), edge(b, c), edge(b, d), edge(c, c), edge(c, e), edge(d, e)}

The positive examples are:

P = {target(b, b), target(c, c)}

One solution found by ilp is:

target(X) edge(X, Y ), pred1(X, Y )

pred1(X, Y ) edge(X, Z), neq(Z, Y )

G.18 Graph Colouring

A graph is well-coloured if no two adjacent nodes are the same colour. The task here is to learn the is-bad-node predicate that is true of a node X if X is adjacent to a node of the same colour. As well as the edge relation, we have a colour relation on nodes, where colour(X, C) means node X has colour C. The task requires at least one auxiliary relation pred1. We train on two separate (B, P, N) triples. In the ﬁrst, the background facts are:

{edge(a, b), edge(b, c), edge(b, d), edge(c, e), edge(e, f), colour(a, green),

colour(b, red), colour(c, green), colour(d, green), colour(e, red), colour(f, red)}

The positive example is:

P = {target(e, e)}

In the second triple, the background facts are:

{edge(a, b), edge(b, a), edge(c, b), edge(f, c), edge(d, c), edge(e, d), colour(a, green),

colour(b, green), colour(c, red), colour(d, green), colour(e, green), colour(f, red)}

Evans & Grefenstette

The positive examples are:

P = {target(a, a), target(b, b), target(e, e), target(f, f)}

One solution found by ilp is:

target(X) edge(X, Y ), pred1(X, Y )

pred1(X, Y ) colour(X, Z), colour(Y, Z)

Here, the invented pred1 relation is true of X and Y if X and Y are both the same colour.

G.19 Graph Connectedness

In this graph task, the problem is to learn the connected(X, Y ) relation that is true if there is some sequence of edge transitions connecting X and Y . The background facts are:

{edge(a, b), edge(b, c), edge(c, d), edge(b, a)}

The positive examples are:

P = {target(a, b), target(b, c), target(c, d), target(b, a), target(a, c),

target(a, d), target(a, a), target(b, d), target(b, a), target(b, b)}

The solution ilp ﬁnds is:

target(X, Y ) edge(X, Y )

target(X, Y ) edge(X, Z), target(Z, Y )

This is the transitive closure of the edge relation. This is the same relation that is synthesised as the recursive auxiliary predicate needed for the Graph Cycles task described in Section 5.3.2.

G.20 Graph Cycles

This experiment is described in detail in Section 5.3.2.

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. (2016). Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint ar Xiv:1603.04467.

Andrychowicz, M., & Kurach, K. (2016). Learning eﬃcient algorithms with hierarchical attentive memory. ar Xiv preprint ar Xiv:1602.03218.

Bader, S., Garcez, A. S. d., & Hitzler, P. (2005). Computing ﬁrst-order logic programs by ﬁbring artiﬁcial neural networks.. In FLAIRS Conference, pp. 314 319.

Bader, S., Hitzler, P., & H olldobler, S. (2008). Connectionist model generation: A ﬁrst-order approach. Neurocomputing, 71(13), 2420 2432.

Learning Explanatory Rules from Noisy Data

Besold, T., Muggleton, S., Schmid, U., Tamaddoni-Nezhad, A., & Zeller, C. (2016). How does predicate invention aﬀect human comprehensibility?..

Bunel, R. R., Desmaison, A., Mudigonda, P. K., Kohli, P., & Torr, P. (2016). Adaptive neural compilation. In Advances in Neural Information Processing Systems, pp. 1444 1452.

Cal ı, A., Gottlob, G., & Lukasiewicz, T. (2012). A general datalog-based framework for tractable query answering over ontologies. Web Semantics: Science, Services and Agents on the World Wide Web, 14, 57 83.

Chikara, N., Koshimura, M., Fujita, H., & Hasegawa, R. (2015). Inductive logic programming using a maxsat solver..

Corapi, D., Russo, A., & Lupu, E. (2010). Inductive logic programming as abductive search. In LIPIcs-Leibniz International Proceedings in Informatics, Vol. 7. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik.

Corapi, D., Russo, A., & Lupu, E. (2011). Inductive logic programming in answer set programming. In International Conference on Inductive Logic Programming, pp. 91 97. Springer.

Cropper, A., & Muggleton, S. H. (2016). Learning higher-order logic programs through abstraction and invention. In Proceedings of the 25th International Joint Conference Artiﬁcial Intelligence (IJCAI 2016). IJCAI.

Cropper, A., Tamaddoni-Nezhad, A., & Muggleton, S. H. (2015). Meta-interpretive learning of data transformation programs. In International Conference on Inductive Logic Programming, pp. 46 59. Springer.

De Raedt, L., Frasconi, P., Kersting, K., & Muggleton, S. (2008). Probabilistic inductive logic programming theory and applications, volume 4911 of lecture notes in artiﬁcial intelligence..

De Raedt, L., & Kersting, K. (2008). Probabilistic inductive logic programming. In Probabilistic Inductive Logic Programming, pp. 1 27. Springer.

De Raedt, L., Kersting, K., Natarajan, S., & Poole, D. (2016). Statistical relational artiﬁcial intelligence: Logic, probability, and computation. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 10(2), 1 189.

Eisner, J., Goldlust, E., & Smith, N. A. (2004). Dyna: A declarative language for implementing dynamic programs. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, p. 32. Association for Computational Linguistics.

Esteva, F., & Godo, L. (2001). Monoidal t-norm based logic: towards a logic for leftcontinuous t-norms. Fuzzy sets and systems, 124(3), 271 288.

Feser, J. K., Brockschmidt, M., Gaunt, A. L., & Tarlow, D. (2016). Neural functional programming. ar Xiv preprint ar Xiv:1611.01988.

Feser, J. K., Chaudhuri, S., & Dillig, I. (2015). Synthesizing data structure transformations from input-output examples. In ACM SIGPLAN Notices, Vol. 50, pp. 229 239. ACM.

Fran ca, M. V., Zaverucha, G., & Garcez, A. S. d. (2014). Fast relational learning using bottom clause propositionalization with artiﬁcial neural networks. Machine learning, 94(1), 81 104.

Evans & Grefenstette

Garcez, A. d., Besold, T. R., de Raedt, L., F oldiak, P., Hitzler, P., Icard, T., K uhnberger, K.-U., Lamb, L. C., Miikkulainen, R., & Silver, D. L. (2015). Neural-symbolic learning and reasoning: contributions and challenges. In Proceedings of the AAAI Spring Symposium on Knowledge Representation and Reasoning: Integrating Symbolic and Neural Approaches, Stanford.

Garcez, A. S. d., Gabbay, D. M., & Lamb, L. C. (2014). A neural cognitive model of argumentation with application to legal inference and decision making. Journal of Applied Logic, 12(2), 109 127.

Garcez, A. S. d., Lamb, L. C., & Gabbay, D. M. (2009). Connectionist modal logic. Neural Symbolic Cognitive Reasoning, 55 74.

Garcez, A. S. d., Gabbay, D. M., Ray, O., & Woods, J. (2007). Abductive reasoning in neural-symbolic systems. Topoi, 26(1), 37 49.

Garcez, A. S., Lamb, L. C., & Gabbay, D. M. (2008). Neural-symbolic cognitive reasoning. Springer Science & Business Media.

Gaunt, A. L., Brockschmidt, M., Singh, R., Kushman, N., Kohli, P., Taylor, J., & Tarlow, D. (2016). Terpret: A probabilistic programming language for program induction. ar Xiv preprint ar Xiv:1608.04428.

Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines. ar Xiv preprint ar Xiv:1410.5401.

Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwi nska, A., Colmenarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou, J., et al. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471 476.

Grefenstette, E., Hermann, K. M., Suleyman, M., & Blunsom, P. (2015). Learning to transduce with unbounded memory. In Advances in Neural Information Processing Systems, pp. 1828 1836.

Guillame-Bert, M., Broda, K., & Garcez, A. d. (2010). First-order logic learning in artiﬁcial neural networks. In The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE.

H olldobler, S., & Kalinke, Y. (1991). Towards a new massively parallel computational model for logic programming. In In ECAI 94 workshop on Combining Symbolic and Connectioninst Processing. Citeseer.

H olldobler, S., Kalinke, Y., & St orr, H.-P. (1999). Approximating the semantics of logic programs by recurrent neural networks. Applied Intelligence, 11(1), 45 58.

Joulin, A., & Mikolov, T. (2015). Inferring algorithmic patterns with stack-augmented recurrent nets. In Advances in neural information processing systems, pp. 190 198.

Kaiser, L. (2015). Neural gpus learn algorithms. ar Xiv preprint ar Xiv:1511.08228.

Kersting, K., De Raedt, L., & Raiko, T. (2006). Logical hidden markov models.. J. Artif. Intell. Res.(JAIR), 25, 425 456.

Learning Explanatory Rules from Noisy Data

Khot, T., Natarajan, S., Kersting, K., & Shavlik, J. (2015). Gradient-based boosting for statistical relational learning: the markov logic network and missing data cases. Machine Learning, 100(1), 75 100.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105.

Landwehr, N., Kersting, K., & De Raedt, L. (2005). nfoil: Integrating naıve bayes and foil. In Proceedings of the twentieth national conference on artiﬁcial intelligence (AAAI-05), pp. 795 800.

Le Cun, Y., Bottou, L., Bengio, Y., & Haﬀner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278 2324.

Mc Carthy, J. (1988). Epistemological challenges for connectionism. Behavioral and Brain Sciences, 11(01), 44 44.

Muggleton, S. (1995). Inverse entailment and progol. New generation computing, 13(3-4), 245 286.

Muggleton, S., & De Raedt, L. (1994). Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19, 629 679.

Muggleton, S. H., Lin, D., Pahlavi, N., & Tamaddoni-Nezhad, A. (2014). Meta-interpretive learning: application to grammatical inference. Machine Learning, 94(1), 25 49.

Muggleton, S. H., Lin, D., & Tamaddoni-Nezhad, A. (2015). Meta-interpretive learning of higher-order dyadic datalog: Predicate invention revisited. Machine Learning, 100(1), 49 73.

Muggleton, S. H., Santos, J. C. A., & Tamaddoni-Nezhad, A. (2008). Toplog: Ilp using a logic program declarative bias. In International Conference on Logic Programming, pp. 687 692. Springer.

Natarajan, S., Khot, T., Kersting, K., Gutmann, B., & Shavlik, J. (2012). Gradient-based boosting for statistical relational learning: The relational dependency network case. Machine Learning, 86(1), 25 56.

Neelakantan, A., Le, Q. V., & Sutskever, I. (2015). Neural programmer: Inducing latent programs with gradient descent. ar Xiv preprint ar Xiv:1511.04834.

Quinlan, J. R., & Cameron-Jones, R. M. (1995). Induction of logic programs: Foil and related systems. New Generation Computing, 13(3-4), 287 312.

Reed, S., & de Freitas, N. (2015). Neural programmer-interpreters. ar Xiv preprint ar Xiv:1511.06279.

Riedel, S., Boˇsnjak, M., & Rockt aschel, T. (2016). Programming with a diﬀerentiable forth interpreter. ar Xiv preprint ar Xiv:1605.06640.

Rockt aschel, T., & Riedel, S. (2016). Learning knowledge base inference with neural theorem provers..

Seraﬁni, L., & Garcez, A. (2016). Logic tensor networks: Deep learning and logical reasoning from data and knowledge. ar Xiv preprint ar Xiv:1606.04422.

Evans & Grefenstette

Steinbach, B., & Kohut, R. (2002). Neural networks a model of boolean functions. In Boolean Problems, Proceedings of the 5th International Workshop on Boolean Problems, pp. 223 240.

Sun, G., Giles, C. L., Chen, H., & Lee, Y. (1998). The neural network pushdown automaton: Model, stack and learning simulations. Tech. rep..

Tamaddoni-Nezhad, A., & Muggleton, S. (2009). The lattice structure and reﬁnement operators for the hypothesis space bounded by a bottom clause. Machine Learning, 76(1), 37 72.

Wang, W. Y., Mazaitis, K., & Cohen, W. W. (2015). A soft version of predicate invention based on structured sparsity.. In IJCAI, pp. 3918 3924. Citeseer.

Yang, F., Yang, Z., & Cohen, W. W. (2016). A diﬀerentiable approach to inductive logic programming..