# neurosymbolic_hierarchical_rule_induction__3f73b386.pdf

Neuro-Symbolic Hierarchical Rule Induction

Claire Glanois 1 Zhaohui Jiang 2 Xuening Feng 2 Paul Weng 2 Matthieu Zimmer 2 Dong Li 3 Wulong Liu 3

Jianye Hao 3 4

We propose Neuro-Symbolic Hierarchical Rule Induction (HRI), an efﬁcient interpretable neurosymbolic model, to solve Inductive Logic Programming (ILP) problems. In this model, which is built from a pre-deﬁned set of meta-rules organized in a hierarchical structure, ﬁrst-order rules are invented by learning embeddings to match facts and body predicates of a meta-rule. To instantiate HRI, we speciﬁcally design an expressive set of generic meta-rules, and demonstrate they generate a consequent fragment of Horn clauses. As a differentiable model, HRI can be trained both via supervised learning and reinforcement learning. To converge to interpretable rules, we inject a controlled noise to avoid local optima and employ an interpretability-regularization term. We empirically validate our model on various tasks (ILP, visual genome, reinforcement learning) against relevant state-of-the-art methods, including traditional ILP methods and neurosymbolic models.

1. Introduction

Research on neuro-symbolic approaches (Santoro et al., 2017; Manhaeve et al., 2018; Dai et al., 2019; d Avila Garcez & Lamb, 2020; Amizadeh et al., 2020) has become very active recently and has been fueled by the successes of deep learning (Krizhevsky et al., 2017) and the recognition of its limitations (Marcus, 2018). At a high level, these approaches aim at combining the strengths of deep learning (e.g., high expressivity, differentiable learning) and symbolic methods (e.g., interpretability, generalizability) while addressing their respective limitations (e.g., brittleness for

1IT University of Copenhagen, Denmark 2UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China 3Huawei Noah s Ark Lab, China 4School of Computing and Intelligence, Tianjin University. Correspondence to: <paul.weng@sjtu.edu.cn>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

deep learning, scalability for symbolic methods).

In this paper, we focus on inductive logic programming (ILP) tasks (Cropper & Dumanˇci c, 2021). The goal in ILP is to ﬁnd a ﬁrst-order logic (FOL) formula that explains positive and negative examples given some background knowledge also expressed in FOL. In contrast to traditional ILP methods, which are based on combinatorial search over the space of logic programs, neuro-symbolic methods (see Section 2) often work on a continuous relaxation of this discrete space.

For tackling ILP problems, we propose a new neurosymbolic hierarchical model, denoted HRI, which is built upon a set of meta-rules. For a more compact, yet expressive, representation, we introduce the notion of proto-rule, which encompasses multiple meta-rules. The valuations of predicates are computed via a soft uniﬁcation between proto-rules and predicates using learnable embeddings. Like most ILP methods, our model can provide an interpretable solution and, being independent of the number of objects, manifests some combinatorial generalization skills1. In contrast to other approaches, HRI is also independent of the number of predicates: this number may vary during training and testing 2. Our model can also allow recursive deﬁnitions, if needed. Since it is based on embeddings, it is able to generalize and is particularly suitable for multi-task ILP problems. Moreover, any semantic or visual priors on concepts can be leveraged, through the embeddings initialization.

The design of proto-rules, crucially restricting the hypothesis space, embodies a well-known trade-off between efﬁciency and expressivity. Relying on minimal sets of metarules for rule induction models has been shown to improve both learning time and predictive accuracies (Cropper & Muggleton, 2014; Fonseca et al., 2004). For our model to be both adaptive and efﬁcient, we initially designed an expressive and minimal set R0 of proto-rules. While most neuro-symbolic approaches do not formally discuss the expressivity of their models, we provide a theoretical analysis

1Trained on smaller instances, it can generalize to larger ones. 2In the case some input predicates only appear in the held-out dataset used for the ﬁnal evaluation, our model can still predict thanks to the pre-trained embeddings, as mentioned in the Visual Genome Experiment in Appendix C.

Neuro-Symbolic Hierarchical Rule Induction

to characterize the expressivity of R0. Moreover, in contrast to traditional ILP work based on combinatorial search (Cropper & Tourret, 2020), we found that a certain redundancy in the proto-rules may experimentally help learning in neuro-symbolic approaches. Therefore, as a replacement of R0, we propose an extended set R of proto-rules.

We validate our model using proto-rules in R on classic ILP tasks. Despite using the same set of proto-rules, our model is competitive with other approaches that may require specifying different meta-rules for different tasks to work, which, unsatisfactorily, requires an a priori knowledge of the solution. In order to exploit the embeddings of our model and demonstrate its scalability, we distinctly evaluate our approach on a large domain, GQA (Hudson & Manning, 2019b) extracted from Visual Genome (Krishna et al., 2017)). Our model outperforms other methods such as NLIL (Yang & Song, 2020) on those tasks. We also empirically validate all our design choices.

Contributions Our contributions can be summarized as follows: (1) Hierarchical model for embedding-based rule induction (see Section 4), (2) Expressive set of generic proto-rules and theoretical analysis (see Section 4.2), (3) interpretability-oriented training method (see Section 5), (4) Empirical validation on various domains (see Section 6).

2. Related Work

Classic ILP methods Classic symbolic ILP methods (Quinlan, 1990; Muggleton, 1995; Law et al., 2018; Cropper & Morel, 2021). aim to learn logic rules from examples by a direct search in the space of logic programs. To scale to large knowledge graphs, recent work (Gal arraga et al., 2013; Omran et al., 2018) learns predicate and entity embeddings to guide and prune the search. These methods typically rely on carefully hand-designed and task-speciﬁc templates in order to narrow down the hypothesis space. Their drawbacks include scalability and potential struggle with noisy data.

Differentiable ILP methods By framing an ILP task as a classiﬁcation problem, these approaches based on a continuous relaxation of the logical reasoning process can learn via gradient descent. The learnable weights may be assigned to rules (Evans & Grefenstette, 2018) although combinatorially less attractive or to the membership of predicates in a clause (Payani & Fekri, 2019; Zimmer et al., 2021). However, these models are still hard to scale and have a constrained implementation, e.g., task-speciﬁc template-based variable assignments or limited number of predicates and objects. In another direction, Dong et al. (2019) learn rules as shallow multi-layer perceptrons (MLP), losing thus in interpretability. In contrast to these neuro-symbolic methods, our model HRI ascribes embeddings to predicates, and the

membership weights derive from a similarity score, which may be beneﬁcial in rich and multi-task learning domains.

Multi-hop reasoning methods In multi-hop reasoning methods for knowledge graph (KG) (Guu et al., 2015; Yang et al., 2017; Yang & Song, 2020; Ren et al., 2021), predicate invention is understood as ﬁnding a relational path (Yang et al., 2017) or a combination of paths (Yang & Song, 2020) in the KG; a path amounts to a chain-like ﬁrst-order rule. However, although computationally efﬁcient, their restricted expressiveness limits their performance and applicability.

Embedding-based models In the context of KG completion or rule mining notably, many previous studies learn embeddings of both binary predicates and entities. Entities are typically attached to low-dimensional vectors, while relations are understood as bilinear or linear operators applied on entities possibly involving some non-linear operators (Bordes et al., 2013; Lin et al., 2015; Socher et al., 2013; Nickel et al., 2011; Guu et al., 2015). These methods are either limited in their reasoning power or suffer from similar problems as multi-hop reasoning.

Our work is closely related to that of Campero et al. (2018) whose representation model is inspired by Neural Theorem Provers (Rockt aschel & Riedel, 2017), attaching vector embeddings to predicates and rules. Their major drawback, like most classic or differentiable ILP methods, is the need for a carefully hand-designed template set for each ILP task. In contrast, we extend their model to a hierarchical structure with an expressive set of meta-rules, which can be used for various tasks. We further demonstrate our approach in the multi-task setting.

3. Background

We ﬁrst deﬁne ﬁrst-order logic and inductive logic programming (ILP). Then, we recall some approaches for predicate invention, an essential aspect of ILP.

Notations Sets are denoted in calligraphic font (e.g., C). Constants (resp. variables) are denoted in lowercase (resp. uppercase). Predicates have the ﬁrst letter of their names capitalized (e.g., P or Even), their corresponding atoms are denoted sans serif (e.g., P), while predicate variables are denoted in roman font (e.g., P). Integer hyperparameters are denoted n with a subscript (e.g., n L).

3.1. Inductive Logic Programming

First-order logic (FOL) is a formal language that can be deﬁned with a set of constants C, a set of predicates P, a set of functions, and a set of variables. Constants correspond to the objects of discourse, n-ary predicates can be seen as mappings from Cn to the Boolean set B = {True, False},

Neuro-Symbolic Hierarchical Rule Induction

n-ary functions are mappings from Cn to the set of constants C, and variables correspond to unspeciﬁed constants. As customary in most ILP work, we focus on a function-free fragment3 of FOL.

An atom is a predicate applied to a tuple of arguments, either constants or variables; it is called a ground atom if all its arguments are constants. Well-formed FOL formulas are deﬁned recursively as combinations of atoms with logic connectives (e.g., negation, conjunction, disjunction, implication) and existential or universal quantiﬁers (e.g., , ). Clauses are a special class of formulas that can be written as a disjunction of literals, which are atoms or their negations. Horn clauses are clauses with at most one positive literal, while deﬁnite Horn clauses contain exactly one. In the context of ILP, deﬁnite clauses play an important role since they can be re-written as rules:

H B1 B2 Bk (1)

where H is called the head atom and Bi s the body atoms. The variables appearing in the head atom are instantiated with a universal quantiﬁer, while the other variables in the body atoms are existentially quantiﬁed4.

Inductive Logic Programming (ILP) aims at ﬁnding a set of rules such that all the positive examples and none of the negative ones of a target predicate are entailed by both these rules and some background knowledge expressed in FOL. The background knowledge may contain ground atoms, called facts, but also some given rules.

Forward chaining can be used to chain rules together to deduce a target predicate valuation from some facts. Consider the Even-Succ task as an example. Given background facts {Zero(0), Succ(0, 1), Succ(1, 2)} and rules:

Even(X) Zero(X)

Even(X) Even(Y ) Aux(Y, X)

Aux(X, Y ) Succ(X, Z) Succ(Z, Y ), (2)

we can easily deduce the facts {Aux(0, 2), Even(2)} by unifying the body atoms of the rules with the facts; repeating this process, we can, iteratively, infer all even numbers.

The predicates appearing in the background facts are called input predicates. Any other predicates, apart from the target predicate, are called auxiliary predicates. A predicate can be extensional, as deﬁned by a set of ground atoms, or intensional, deﬁned by a set of clauses .

3A fragment of a logical language is a subset of this language obtained by imposing syntactical restrictions on it. 4This rule can be read as: if, for a certain grounding, all the body atoms are true, then the head atom is also true.

3.2. Predicate Invention and Meta-Rules

One important part of solving an ILP problem is predicate invention, which consists in creating auxiliary predicates, intensionally deﬁned, which would ultimately help deﬁne the target predicate. Without language bias, the set of possible rules to consider grows exponentially in the number of body atoms and in the maximum allowed arity of the predicates. In previous work, this bias is often enforced using meta-rules, also called rule templates, which restrict the hypothesis space by imposing syntactic constraints. The hypothesis space is generated by the successive applications of the meta-rules on the predicate symbols from the background knowledge.

A meta-rule corresponds to a second-order clause with predicate variables. For instance, the chain meta-rule is:

H(X, Y ) B1(X, Z) B2(Z, Y ), (3)

where H, B1, and B2 correspond to predicate variables.

LRI Model We recall the differentiable approach to predicate invention proposed by Campero et al. (2018). Their model, Logical Rule Induction (LRI), learns an embedding for each predicate and each meta-rule atom; then, predicates and atoms of meta-rules are matched via a soft uniﬁcation technique. For any predicate P, let θP Rd denotes its d-dimensional5 embedding. The embeddings attached to a meta-rule like (3) with one head (H) and two body (B1, B2) predicate variables are (θH, θB1, θB2) Rd Rd Rd. The soft uniﬁcation is based on cosine to measure the degree of similarity of embeddings of predicates and meta-rule atoms. For example, αP H = cos(θP , θH) [0, 1] is the uniﬁcation score between predicate P and head predicate variable H: a higher uniﬁcation score indicates a higher belief that P is the correct predicate for H. Similarly, uniﬁcation scores are computed for any pair of predicate and meta-rule atom.

In this model, a ground atom P = P(s, o) is represented as (θP , s, o, v P), where θP is the embedding of predicate P, s and o are respectively the subject and object constants, and v P [0, 1] is a valuation measuring the belief that P is true. In ILP tasks, the valuations are initialized to 1 for background facts and are otherwise set to 0, reﬂecting a closed-world assumption.

For better clarity, let use meta-rule (3) as a running example to present how one inference step is performed. For an intensional predicate P, the valuation of a corresponding ground atom P = P(x, y) with respect to a meta-rule R of the form (3) and two ground atoms, P1 = P1(x, z) and P2 = P2(z, y), is computed as follows:

v(P, R, P1, P2)=(αP H αP1B1 αP2B2) (v P1 v P2) . (4)

5In Campero et al. (2018), d equals the number of predicates.

Neuro-Symbolic Hierarchical Rule Induction

The term in the ﬁrst brackets measures how well the predicate tuple (P, P1, P2) matches the meta-rule, while the term in the second brackets corresponds to a fuzzy AND. Note that (4), deﬁned as a product of many terms in [0, 1], leads to an underestimation issue. The valuation of atom P with respect to meta-rule R is then computed as6:

v(P, R) = max P1,P2 v(P, R, P1, P2). (5)

Since P is matched with several meta-rules s heads, the new valuation of P after one forward-chaining inference step is:

v(P) = max vold(P), max R v(P, R) , (6)

where vold(P) is the previous valuation of P. The max over meta-rules allows an intensional predicate to be deﬁned as a disjunction. After n I steps of forward chaining, the model uses binary cross-entropy as loss function to measure the difference between inferred values and target values.

4. Hierarchical Rule Induction

Let us introduce our model HRI and our generic meta-rules set, alongside some theoretical results about its expressivity.

4.1. Proposed Model

Building on LRI (Campero et al., 2018), our model includes several innovations, backed both by theoretical and experimental results: proto-rules, incremental prior, hierarchical prior, and improved model inference. The priors reﬂect the hierarchical and progressive structure arguably present in the formation of human conceptual knowledge.

HRI is illustrated in Figure 1, using the recursive task Even. The boxes, appearing in different layers, denote auxiliary predicates, which embody an asymmetrical disjunction of a conjunction of two atoms with a third body predicate. An arrow represents the soft uniﬁcation between one body predicate and the heads of lower-level predicates, following (13); same-level predicates are also considered when the recursivity allowed, as in this example. In this ﬁgure, the fully-exposed arrows are highlighting one solution found by our model (mentioned in Appendix B.1), which can be written as:

Even(X) aux31(X) aux31(X) (Zero(X) aux21(Y, X)) Zero(X) aux21(X, Y ) (aux21(X, Z) aux21(Z, Y )) aux11(X, Y ) aux11(X, Y ) (Succ(X, Z) Succ(Z, Y )) False, (7)

6The max over ground atoms is taken both over possible existential variable (e.g., Z above), and over predicates P1, P2 P.

Figure 1. Hierarchical Model

Proto-rules We ﬁrst deﬁne the notion of proto-rules, which implicitly correspond to sets of meta-rules. A protorule is an adaptive7 meta-rule (see Appendix A for a formal deﬁnition). For instance, (3) extended as a proto-rule becomes:

H(X, Y ) B1(X, Z) B2(Z, Y ), (8) where the Bi s correspond to predicate variables, of arity 2 or 1, with an optional second argument. For instance, (8) can be instantiated as the following rule:

P(X, Y ) P1(X) P2(Z, Y ) (9) where P, P2 are binary predicates, and P1 is a unary predicate implicitly viewed as the binary predicate P1, where z, P1(x, z) := P1(x).

We propose a minimal and expressive set of proto-rules in Section 4.2. However, HRI can be instantiated with any set R of proto-rules. The choice of R deﬁnes the language bias used in the model.

Incremental Prior To reinforce the incremental aspect of predicate invention, in contrast to LRI, each auxiliary predicate is directly associated with a unique proto-rule deﬁning it. This has two advantages. First, it partially

7 It amounts to an embedding of lower-arity predicates into the space of higher-arity predicates.

Neuro-Symbolic Hierarchical Rule Induction

addresses the underestimation issue of (4) since it amounts to setting αP H = 1. Second, it partly reduces computational costs since the soft uniﬁcation for a rule can be computed only by matching the body atoms8. In order to preserve expressivity9, we can re-introduce disjunctions in the model by considering proto-rules with disjunctions in their bodies (see Section 4.2);10 e.g., (8) could be extended to:

B1(X, Z) B2(Z, Y ) B3(X, Y ). (10)

Hierarchical Prior A hierarchical architecture is enforced by organizing auxiliary predicates in successive layers, from layer 1 to the max layer n L, as illustrated in Figure 1. Layer 0 contains all input (extensional) predicates. For each layer l = 1, . . . , n L, each proto-rule11 in R generates one auxiliary predicate. Lastly, the target predicate is matched with the auxiliary predicates in layer n L only.

With this hierarchical architecture, an auxiliary predicate P at layer ℓcan be only deﬁned from a set P ℓof predicates at a layer lower or equal to ℓ. This set can be deﬁned in several ways depending on whether recursivity is allowed. We consider three cases: no recursivity, iso-recursivity, and full recursivity. For the no recursivity (resp. full recursivity) case, P ℓis deﬁned as the set P<ℓ(resp. P ℓ) of predicates at layer strictly lower than (resp. lower or equal to) ℓ. For iso-recursivity, P ℓis deﬁned as P<ℓ {P}.

Enforcing this hierarchical prior has two beneﬁts. First, it imposes a stronger language bias, which facilitates learning. Second, it also reduces computational costs since the soft uniﬁcation does not need to consider all predicates.

Improved Model Inference We improve LRI s inference technique with a soft uniﬁcation computation that reduces the underestimation issue in (4), which is due to the product of many values in [0, 1]. For clarity s sake, we illustrate the inference with proto-rule (10); although the equations can be straightforwardly adapted to any proto-rule. One inference step in our model is formulated as follows12:

vand = POOLP1,P2 (αP1B1 αP2B2 AND[v P1, v P2])

vor = OR [vand, POOLP3 (αP3B3 v P3)]

v = MERGE (vold, vor) ,

where v (resp. vold) denotes the new (resp. old) valuation of a grounded auxiliary predicate P. For an auxiliary predi-

8The max over meta-rules in (6) is not needed anymore. 9Since identifying intensional predicates to proto-rules prevent them to have disjunctive deﬁnitions. 10Although such disjunction increases the computational cost again, it can be partly parallelised in the inference. 11Alternatively, several auxiliary predicates may be generated per proto-rules per layer. For simplicity, we only generate one and control the model expressivity with only one hyperparameter n L. 12It generalizes (4)-(6), under the incremental prior.

cate at layer ℓ, the POOL operation encompass a max over the groundings compatible to P followed by a pooling performed over both predicates P1, P2, P3 P ℓ. Indeed, in presence of an existential quantiﬁer as in proto-rule (10), a max is applied to eliminate the corresponding existential variable before the actual pooling over predicates.

The valuation vt of the target predicate (with variable denoted Pt) is computed as a MERGE of its previous valuation vt old and a POOL over atoms at layer n L:

vt = MERGE vt old, POOLP1 (αP1Pt v P1) . (12)

The underestimation issue is alleviated in two ways. First, the POOL operation is implemented as a sum, AND as min, OR as max, and MERGE as max. Second, the uniﬁcation scores αPi Bi s based on cosine are renormalized with a softmax transformation:

αPi Bi = exp(cos(θPi, θBi)/τ) P

Pj exp(cos(θPj, θBi)/τ), (13)

where Pj P ℓand hyperparameter τ, called temperature, controls the renormalization. Score αP1Pt can be computed in a similar way with embedding θPt. These choices have been further experimentally validated (see Appendix D).

Equation (11) implies that the computational complexity of one inference step at layer ℓis O(|Pℓ| |P ℓ|3 |C|2) where Pℓis the set of predicates at layer ℓand P ℓis the set of predicates available for deﬁning an auxiliary predicate at layer ℓ; parallelising the OR operation leads to a quadratic dependence in |P ℓ|. Its spatial complexity is O(|P| |C|2). The inference step described above is iterated n I times, updating the valuations of auxiliary and target predicates. Note that when recursive deﬁnitions of predicates are allowed, it could be conceivable to iterate this inference step until convergence. A nice property of this inference procedure is that it is parallelizable, which our implementation on GPU takes advantage of.

4.2. Generic Set of Proto-Rules

Although HRI can be instantiated with any proto-rules, we design a small set of expressive proto-rules to instantiate it. We introduce ﬁrst the following set of proto-rules R0:

A : H(X) B1(X, Y ) B2(Y, X)

B : H(X, Y ) B1(X, Z) B2(Z, Y )

C : H(X, Y ) B1(X, Y ) B2(Y, X)

where the Bi s correspond to predicate variables where the second argument is optional (see Section 4.1).

To support our design, we analyzed the expressivity of R0, by investigating which fragment of ﬁrst-order logic can be generated by R0 from a set of predicates P. Because of

Neuro-Symbolic Hierarchical Rule Induction

space constraints, we only present here our main result, where we assume that P contains the zero-ary predicate True and the equality symbol:

Theorem 4.1. The hypothesis space generated by R0 from P is exactly the set of function-free deﬁnite Horn clause fragment F{1,2} P, 2 composed of clauses with at most two body atoms involving unary and binary predicates in P.

We refer the reader to Appendix A for all formal deﬁnitions, proofs and further theoretical results. This results implies that our model with R0 can potentially solve perfectly any ILP task whose target predicate is expressible in F{1,2} P, 2 with a large enough max layer n L.

However, to potentially reduce the max layer n L and facilitate the deﬁnition of recursive predicates, we extend R0 to the set R 0 by including a disjunction with a third atom in all the proto-rules:

B1(X, Y ) B2(Y, X) B3(X, T)

B : H(X, Y )

B1(X, Z) B2(Z, Y ) B3(X, Y )

C : H(X, Y )

B1(X, Y ) B2(Y, X) B3(X, Y )

Moreover, as we have observed a certain redundancy to be beneﬁcial to the learning, we incorporate the permutation rule I in our experiments:

R = R 0 {I : H(X, Y ) F(Y, X)} This small set of protorules R , which has the same expressivity as R0 characterized in Theorem 4.1, is used to instantiate our model in all our experiments.

5. Proposed Training Method

Supervised Training We train our model with a ﬁxed number n T of training iterations. For each iteration, the model is trained using one training instance. During each iteration, we add a geometrically-decaying Gaussiandistributed noise to the embeddings for both predicates and rules to avoid local optima, similarly to LRI (Campero et al., 2018). Then the improved inference procedure described in (11) is performed with these noisy embeddings for a number n I of steps to get the valuations for all predicates. After these n I inference steps, the binary cross entropy (BCE) loss of the ground-truth target predicate valuation and the learned target predicate valuation is calculated: X

Gt x,y log v(Pt x,y) (1 Gt x,y) log 1 v(Pt x,y) , (14)

where the sum is over all pairs of constants x, y in one training instance, Gt x,y {0, 1} is the ground-truth value of atom Pt x,y = P t(x, y). To encourage the uniﬁcation scores to be closer to 0 or 1, an extra regularization term is added

to the BCE loss to update the embeddings:

P,P αP P 1 αP P . (15)

where hyperparameter λ controls the regularization weight and the sum is over all predicates P that can be matched with a predicate variable P appearing in some meta-rules.

In addition, to further help with convergence to an interpretable solution and avoid local optima, we replace during training the softmax transformation in (13), by a variant of Gumbel-softmax: we replace each cos(θPj, θBi) by cos(θPj, θBi) + Gj where Gj = g1 log ( log (g2Uj)) where g1 (0, + ) is the Gumbel scale, g2 (0, + ) is a novel hyperparameter, and Uj U([0, 1]). With g2 = 1, this leads to the usual Gumbel-softmax (Jang et al., 2017; Maddison et al., 2017). Introducing g2 allows a better control of the noise range during training (see Appendix D for further discussion). In our experiments, g2 is linearlydecreased during training, while g1 is held ﬁxed. In contrast to the low-level parameter-noise applied directly to the embeddings, Gj may be understood as a higher-level noise enacting on the similarity coefﬁcients themselves.

Convergence to an Interpretable Model The passage from our soft model to a symbolic model may be realized by taking the limit of the softmax temperature in the uniﬁcation score to zero, or equivalently, switching to an argmax in the uniﬁcation score; i.e., for each proto-rule, the ﬁnal learned rules can be interpreted by assigning head and body atoms to the predicates that obtain the highest uniﬁcation score.

For instance, at a layer ℓ, the extracted symbolic rule (9) could be extracted from proto-rule (8) if P is the auxiliary predicate associated to that proto-rule at layer ℓand ( P1 = arg max P P ℓαP B1 P2 = arg max P P ℓαP B2 (16)

Then we can interpret the solution by successively unfolding the logical formula extracted for each involved predicate, starting from the target predicate.

6. Experimental Results

For the empirical validation of HRI13, we carried out several series of experiments to answer the following questions: (1) how does HRI fare against other neuro-symbolic models? (2) when can HRI be superior to traditional ILP methods? (3) can HRI scale to tackle large domain? (4) how signiﬁcant are the different choices in HRI ? For (1), we use both ILP and RL tasks. Since traditional ILP methods cannot be

13The source code of HRI and the scripts to reproduce the experimental results can be found at <https://github.com/claireaoi/hierarchical-rule-induction>.

Neuro-Symbolic Hierarchical Rule Induction

applied to RL, we use ILP tasks for (2). For (3), we consider a large domain from Visual Genome (Krishna et al., 2017). For our ablation study (Table 11), we use again ILP tasks.

For space constraints, we only present a selection of the experimental results in the main paper. The remaining results, alongside with additional experiments (e.g., choice of operators, Gumbel Noise, limitations of LRI, sensitivity to hyperparameters) are discussed in Appendices B.2, B.4, C, D, and E. All of our results are averaged over several runs with different random seeds. In each run, a model is trained over a ﬁxed number of iterations. Hyperparameter details are given in Appendix E.

For simplicity and consistency, we instantiate our model with R , with full recursivity in all the ILP tasks. For the RL and Visual Genome tasks, we use no recursivity. In the latter domain, it is not expected to be helpful. Note the performances of HRI can be improved and/or learning can be accelerated if we customize R to a task or restrict its recursivity. However, we intended to emphasize the genericity of our approach.

6.1. Comparison with Neuro-Symbolic Models

ILP Experiments We ﬁrst evaluate our model on the ILP tasks from Evans & Grefenstette (2018), which are detailed in Appendix B.1, where we also provide the interpretable solutions found by our model.

We compare our model with ILP (Evans & Grefenstette, 2018) and LRI (Campero et al., 2018) using their reported results. A selection of the results on these ILP tasks are presented in Table 1 (see Appendix B.2 for full results). A run is counted as successful if the mean square error between inferred and given target valuations is less than 1e-4 for new evaluation datasets. Column train corresponds to the performance measured during training, while Column soft evaluation (resp. symbolic evaluation) corresponds to the evaluation performance (no noise) using (12) (resp. the interpretable solution, as explained in the end of Section 5).

In contrast to both ILP or LRI, which use a speciﬁc set of meta-rules tailored for each task, we use the same generic and more expressive template set R to tackle all tasks. Despite that, HRI can often match or outperform them. Naturally, on a few tasks, our generic approach based on R impedes learning, however our performance could be improved by enforcing more tailored meta-rules.

We also compare HRI to more generic neuro-symbolic models that do not require speciﬁc metarule templates: NLM (Dong et al., 2019) and DLM (Zimmer et al., 2021), which can scale better than methods like ILP. For those two models, we run their respective open-source implementations that are also GPU-based, which provide a fair empirical comparison with HRI. To compare their performances and

training times for a varying number of training constants, we evaluate them on two ILP tasks used by NLM: Adjacent to Red and Grandparent. Those two ILP tasks are similar to those used by Evans & Grefenstette (2018), but are deﬁned with different background predicates. The results reported in Table 2 show that our model is one to two orders of magnitude faster and that it can scale much better in terms of training constants, while achieving at least as good performances.

RL Experiments We also tested our model in an RL setting on blocks manipulation tasks: Stack, Unstack, and On (Jiang & Luo, 2019). In those tasks, an agent has 50 steps to build a stack (Stack), place all the blocks directly on the ﬂoor (Unstack), or move a speciﬁc block on another (On). We compare HRI with NLRL (Jiang & Luo, 2019) (which extends ILP to RL), NLM (Dong et al., 2019), and DLM (Zimmer et al., 2021).

From all the valuations of the target predicate Move(X,Y), we compute a softmax policy used during the exploration. During training, the supervised BCE loss (14) is replaced by a standard PPO loss (Schulman et al., 2017) on the softmax policy. To estimate the advantage function, we relied on the same critic architecture deﬁned by Zimmer et al. (2021).

At the end of the training, the symbolic policy is extracted and evaluated on 5 test scenarios not seen during training. As shown in Table 4, our model outperforms NLRL and competes with NLM or DLM, both in terms of performance and training time.

6.2. Comparison with Traditional ILP Methods

To answer the second question, we selected two state-ofthe-art classic ILP methods: ILASP3 (Law et al., 2018) and Popper (Cropper & Morel, 2021). Both are meta-level approaches, like most recent classic ILP approaches. While ILASP3 can handle noise, Popper can scale better.

In both the traditional ILP literature and the neuro-symbolic work, experimental comparisons of the two approaches are rare, with a few exceptions (Law et al., 2018). Our experiments in this section can be seen as a contribution to ﬁll this gap and shed more light on which situations one approach may be preferred to another. We selected a sample of four diverse ILP tasks (e.g., easy vs harder tasks, need of recursion, etc.): Connectedness, Grandparent, Member, and Undirected Edge.

Our experimental evaluation indicates that ILASP3 and Popper are superior both in terms of performance and computational/space costs only when the dataset is small and noise-free, especially when a good language bias is available. However, when the dataset becomes larger, as shown in Figure 4 of Appendix B.4, HRI scales better because it

Neuro-Symbolic Hierarchical Rule Induction

Table 1. Percentage of successful runs among 10 runs. |I| is the smallest number of intensional predicates needed. Recursive means whether or not the solution needs to learn recursive rules.

Task |I| Recursive ILP LRI Ours

train soft evaluation symbolic evaluation

Undirected Edge 1 No 100 100 100 100 100 Member 1 Yes 100 100 100 100 100 Connectedness 1 Yes 100 100 100 100 100 Grandparent 2 No 96.5 100 100 100 100 Adjacent to Red 2 No 50.5 100 100 100 100 Two Children 2 No 95 0 100 100 100 Graph Coloring 2 Yes 94.5 0 100 100 100 Even-Succ2 2 Yes 48.5 100 40 40 40 Buzz 2 Yes 35 70 100 40 40

Table 2. Comparisons with NLM/DLM in terms of percentage of successful runs and average training times over 10 runs.

Task #Training % successful runs Training time (secs)

constants NLM DLM Ours NLM DLM Ours

Adjacent to Red 7 100 90 100 163 920 62 10 90 90 100 334 6629 71

Grandparent 9 90 100 100 402 2047 79 20 100 100 100 1441 3331 89

Table 3. R@1 and R@5 for 150 objects classiﬁcation on VG.

Model Visual Genome

MLP+RCNN 0.53 0.81 Freq 0.40 0.44 NLIL 0.51 0.52

Ours 0.53 0.60

Table 4. Comparisons with NLRL/NLM/DLM in terms of rewards on the testing scenarios.

Task Rewards

NLRL NLM DLM Ours

Unstack 0.914 0.920 0.920 0.920 Stack 0.877 0.920 0.920 0.920 On 0.885 0.896 0.896 0.896

can be trained on mini-batches, which is not possible for traditional ILP methods.

Noisy data Finally, we compared HRI with ILASP3 (Law et al., 2018) 14 and ILP (Evans & Grefenstette, 2018) in a noisy data setting by ranging the ratio of mislabeled target training examples from 0% to 90%. We actually also evaluated ILASP4 (Law et al., 2020), which improves ILASP3 with a more efﬁcient algorithm, using the experimental set-up described in (Law et al., 2018), however the results were worse than those of ILASP3. We discuss those experiments and results in Appendix B.4.

Our experiments suggest that HRI handles noise better than ILASP3 (and also ILP) when the noise level is below about

14The experimental results of ILASP3 and ILP come from their own paper (Law et al., 2018; Evans & Grefenstette, 2018).

Table 5. Performance of different embedding initializations for the single Visual Genome task. Displaying both the soft and the symbolic evaluations (once the symbolic rule has been extracted).

Init. Accuracy Precision R@1

soft symb. soft symb. soft symb.

Random 0.63 0.49 0.57 0.5 0.23 0.38 NLIL 0.75 0.6 0.87 0.75 0.46 0.58 GPT2 0.65 0.45 0.72 0.66 0.27 0.5

40% (e.g., probability of incorrect positive/negative examples), as shown in Figure 2. This demonstrates the robustness of HRI since in practice noise levels would rarely be above 40%, otherwise a dataset would be hardly exploitable. Furthermore, note that ILASP3 adopts a very speciﬁc hypothesis bias and ILP uses task-speciﬁc templates to obtain their results, while HRI is based on a much more general hypothesis space.

6.3. Visual Genome

Our model can be applied beyond classical ILP domains, and even beneﬁt from richer environments and semantic structure. Furthermore, initial predicates embeddings can be bootstrapped by visual or semantic priors. We illustrate it by applying HRI to the larger dataset of Visual Genome

Neuro-Symbolic Hierarchical Rule Induction

0.0 0.2 0.4 0.6 0.8 noise ratio

(a) Predecessor

0.0 0.2 0.4 0.6 0.8 noise ratio

0.0 0.2 0.4 0.6 0.8 noise ratio

(c) Connectedness

0.0 0.2 0.4 0.6 0.8 noise ratio

(d) Undirected Edge

Figure 2. Mean Squared Error (MSE) w.r.t. different noise ratios.

(Krishna et al., 2017), or more precisely a pre-processed (less noisy) version known as GQA (Hudson & Manning, 2019a). Similarly to Yang & Song (2020), we ﬁltered out the predicates with less than 1500 occurrences, which lead to a KG with 213 predicates; then, we solved a multi-task ILP problem, which consists in learning the explanations for the top 150 objects in the dataset. More details on the multitask setting, results, baseline, and metrics are provided in Appendix C. We also empirically conﬁrm there that HRI can beneﬁt from more informative pre-trained embeddings in a single-task setting (cf. Table 5). Based on those results, we used pretrained embeddings from the NLP model, GPT2 (Radford et al., 2019), in our other experiments on GQA.

In the multitask setting, we trained 150 target models together with shared background predicate embeddings, where each model corresponds to one object classiﬁcation. The training consists of Nr rounds. In each round, we trained each model sequentially, by sampling for each target np instances containing positive examples, and nr with or without positive examples.

We compared our model with NLIL (Yang & Song, 2020) and two other supervised baselines (classiﬁers) mentioned in their paper: Freq, which predicts object class by checking the related relation that contains the target with the highest frequency; and MLP-RCNN, an MLP classiﬁer trained with RCNN features extracted from object images in Visual Genome dataset. The latter is a strong baseline because it

uses visual information for its predictions while the other methods only use relational information. Like NLIL, we use R@1 and R@5 to evaluate the trained models. Table 3 summarizes the results. We see that both our method and MLP+RCNN achieve best performance for R@1. For R@5, we outperform Freq and NLIL.

We ran a ﬁrst set of experiments to validate that embeddings priors may be leveraged and compared three initialization methods: random initialization, pretrained embeddings from NLIL (Yang & Song, 2020), and pretrained embeddings from NLP model GPT2 (Radford et al., 2019). We trained our model on a single ILP task, which consists in predicting predicate Car. For this task, we further ﬁltered the GQA dataset to keep 185 instances containing cars. Table 5 presents the accuracy, precision, and recall, obtained over 10 runs. Pretrained embeddings outperform the randomly initialized ones in all metrics. Embeddings from NLIL, trained speciﬁcally on this dataset, expectedly yield the best results. Yet, to be fair, for the other experiments, we rely on the NLP embeddings.

7. Conclusion

We presented a new neuro-symbolic interpretable model performing hierarchical rule induction through soft uniﬁcation with learned embeddings; it is initialized by a theoretically supported small-yet-expressive set of proto-rules, which is sufﬁcient to tackle many classical ILP benchmark tasks, as it encompasses a consequent function-free deﬁnite Horn clause fragment. Our model has demonstrated its efﬁciency and performance in ILP, RL, and richer domains against state-of-the-art baselines, where it is typically one to two orders of magnitude faster to train.

As discussed in Appendix F, future extensions of HRI could tackle further RL domains and continual learning scenarios beneﬁting from an interpretable logic-oriented policy, while extending our proto-rules set to broaden its expressivity.

Ethical Considerations Deploying our model trained with data from annotated images (e.g., Visual Genome), or embeddings produced by NLP models (e.g., GPT2) would raise ethical issues, because such data contain some biases as it has been well documented (Papakyriakopoulos et al., 2020; Tommasi et al., 2017).

Amizadeh, S., Palangi, H., Polozov, O., Huang, Y., and Koishida, K. Neuro-symbolic visual reasoning: Disentangling visual from reasoning. In ICML, 2020.

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and

Neuro-Symbolic Hierarchical Rule Induction

Yakhnenko, O. Translating embeddings for modeling multi-relational data. In Neur IPS, 2013.

Campero, A., Pareja, A., Klinger, T., Tenenbaum, J., and Riedel, S. Logical rule induction and theory learning using neural theorem proving. ar Xiv preprint ar Xiv:1809.02193, 2018.

Cropper, A. and Dumanˇci c, S. Inductive logic programming at 30: a new introduction. Machine Learning, 43, 2021.

Cropper, A. and Morel, R. Learning programs by learning from failures. Machine Learning, 110(4):801856, 2021.

Cropper, A. and Muggleton, S. Logical minimisation of meta-rules within meta-interpretive learning. In Inductive Logic Programming - 24th International Conference, ILP, volume 9046, pp. 62 75. Springer, 2014.

Cropper, A. and Tourret, S. Derivation reduction of metarules in meta-interpretive learning. In Inductive Logic Programming, pp. 1 21, 01 2018.

Cropper, A. and Tourret, S. Logical reduction of metarules. Machine Learning, 109(7):1323 1369, 2020.

Dai, W.-Z., Xu, Q., Yu, Y., and Zhou, Z.-H. Bridging machine learning and logical reasoning by abductive learning. In Advances in Neural Information Processing Systems, volume 32, 2019.

d Avila Garcez, A. and Lamb, L. C. Neurosymbolic ai: The 3rd wave, 2020. URL https://arxiv.org/abs/ 2012.05876.

Dong, H., Mao, J., Lin, T., Wang, C., Li, L., and Zhou, D. Neural Logic Machines. In International Conference on Learning Representations. Open Review.net, 2019.

Evans, R. and Grefenstette, E. Learning explanatory rules from noisy data. Journal of AI Research, 61:1 64, 2018.

Fonseca, N., Costa, V., Silva, F., and Camacho, R. On avoiding redundancy in inductive logic programming. In Lecture Notes in Artiﬁcial Intelligence, volume 3194, pp. 132 146, 09 2004.

Gal arraga, L. A., Teﬂioudi, C., Hose, K., and Suchanek, F. AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In Proceedings of the 22nd international conference on World Wide Web, pp. 413 422. ACM Press, 2013.

Guu, K., Miller, J., and Liang, P. Traversing knowledge graphs in vector space. In EMNLP, 2015.

Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. 2019 IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR), pp. 6693 6702, 2019a.

Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019b.

Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017.

Jiang, Z. and Luo, S. Neural Logic Reinforcement Learning. In International Conference on Machine Learning, April 2019.

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision, 123 (1):3273, May 2017.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deep convolutional neural networks. Commun. ACM, 60(6):8490, May 2017.

Law, M., Russo, A., and Broda, K. Inductive learning of answer set programs from noisy examples. Advances in Cognitive Systems, 7:5776, 2018.

Law, M., Russo, A., and Broda, K. The ilasp system for inductive learning of answer set programs. in The Association for Logic Programming Newsletter, 2020, Arxiv, abs/2005.00904, 2020.

Lin, Y., Liu, Z., Sun, M., Liu, Y., and Zhu, X. Learning entity and relation embeddings for knowledge graph completion. In In Proceedings of AAAI15, 2015.

Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR, 2017.

Manhaeve, R., Dumanˇci c, S., Kimmig, A., Demeester, T., and De Raedt, L. Deepproblog: Neural probabilistic logic programming. In Neur IPS, 2018.

Marcus, G. Deep learning: A critical appraisal. ar Xiv: 1801.00631, 2018.

Muggleton, S. Inverse entailment and progol. New Generation Computing, 13:245 286, 1995.

Nickel, M., Tresp, V., and Kriegel, H.-P. A three-way model for collective learning on multi-relational data. In ICML, 2011.

Omran, P., Wang, K., and Wang, Z. Scalable rule learning via learning representation. In IJCAI, pp. 2149 2155, 07 2018.

Neuro-Symbolic Hierarchical Rule Induction

Papakyriakopoulos, O., Hegelich, S., Serrano, J. C. M., and Marco, F. Bias in word embeddings. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* 20, pp. 446457. Association for Computing Machinery, 2020.

Payani, A. and Fekri, F. Learning algorithms via neural logic networks. ar Xiv:1904.01554, 2019.

Quinlan, J. R. Learning logical deﬁnitions from relations. Machine Learning, 5:239 266, 1990.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. 2019.

Ren, H., Dai, H., Dai, B., Chen, X., Yasunaga, M., Sun, H., Schuurmans, D., Leskovec, J., and Zhou, D. Lego: Latent execution-guided reasoning for multi-hop question answering on knowledge graphs. In ICML, 2021.

Robinson, J. A. A machine-oriented logic based on the resolution principle. J. ACM, 12(1):2341, January 1965.

Rockt aschel, T. and Riedel, S. End-to-end differentiable proving. In Neur IPS, volume 30, 2017.

Santoro, A., Raposo, D., Barrett, D. G. T., Malinowski, M., Pascanu, R., Battaglia, P., and Lillicrap, T. A simple neural network module for relational reasoning. Neur IPS, pp. 4967 4976, 2017.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal Policy Optimization Algorithms. ar Xiv preprint: 1707.06347, August 2017.

Socher, R., Chen, D., Manning, C. D., and Ng, A. Y. Reasoning with neural tensor networks for knowledge base completion. 26, 2013.

Tommasi, T., Patricia, N., Caputo, B., and Tuytelaars, T. A deeper look at dataset bias. 2017.

Yang, F., Yang, Z., and Cohen, W. W. Differentiable learning of logical rules for knowledge base reasoning. In Neural Information Processing Systems, 2017.

Yang, Y. and Song, L. Learn to explain efﬁciently via neural logic inductive learning. In International Conference on Learning Representations, 2020.

Zellers, R., Yatskar, M., Thomson, S., and Choi, Y. Neural motifs: Scene graph parsing with global context. In CVPR, 2018.

Zimmer, M., Feng, X., Glanois, C., Jiang, Z., Zhang, J., Weng, P., Dong, L., Jianye, H., and Wulong, L. Differentiable logic machines. ar Xiv preprint ar Xiv:2102.11529, 2021.

Neuro-Symbolic Hierarchical Rule Induction

A. Expressivity Analysis

Many methods within inductive logic programming (ILP) deﬁne meta-rules, i.e., second-order Horn clauses which delineates the structure of learnable programs. The choice of these meta-rules, referred to as a language bias and restricting the hypothesis space, results in a well-known trade-off between efﬁciency and expressivity. In this section, we bring some theoretical justiﬁcation for our designed set of second-order Horn Clause, by characterizing its expressivity. Relying on minimal sets of metarules for rule induction models has been shown to improve both learning time and predictive accuracies (Cropper & Muggleton, 2014; Fonseca et al., 2004). For a model to be both adaptive and efﬁcient, it seems pertinent to aim for a minimal set of meta-rules generating a sufﬁciently expressive subset of Horn clauses. The desired expressivity is ineluctably contingent to the tackled domain(s); a commonly-considered fragment is Datalog D which has been proven expressive enough for many problems. Datalog is a syntactic subset of Prolog which, by losing the Turing-completeness of Prolog/Horn, provides the undeniable advantage that its computations always terminate. Below, we focus our attention to similar fragments, in concordance with previous literature (Gal arraga et al., 2013; Cropper & Tourret, 2020).

To make the document self-contained, before stating our results, let us introduce both concepts and notations related to First and Second Order Logic. Recall P denote the set of all considered predicates.

Horn Logic We focus on Horn clause logic, since it is a widely adopted15 Turing-complete16 subset of FOL.

A Horn clause is a clause with at most one positive literal. Horn clauses may be of the following types (where P, Q, T, and U are atoms):

goal clauses which have no positive literal; they are expressed as P Q ... T or equivalently False P Q ... T,

deﬁnite clauses which have exactly one positive literal; they are expressed as P Q ... T U or equivalently as a Horn rule U P Q ... T,

facts which can be expressed as U True17 where U is grounded.

Meta-rules We follow the terminology from the Meta Interpretative Learning literature (Cropper & Tourret, 2020):

Deﬁnition 7.1. A meta-rule is a second-order Horn clause of the form: A0 A1 ... Am where each Ai is a literal of the form P(T1, ..., Tni) where P is a predicate symbol or a second-order variable that can be substituted by a predicate symbol, and each Ti is either a constant symbol or a ﬁrst-order variable that can be substituted by a constant symbol.

Here are some examples of intuitive meta-rules, which will be of later use:

(σ) P(A, B) Q(B, A) permute (ι) P(A, B) Q(A) expand ( ) P(A) B, Q(A, B) existential contraction ( ) P(A) B, Q(A, B) universal contraction ( ) P(A) Q(A, A) diagonal extract ( ) P(A, A) Q(A) diagonal fill

15Pure Prolog programs are composed by deﬁnite clauses and any query in Prolog is a goal clause. 16Horn clause logic and universal Turing machines are equivalent in terms of computational power. 17Facts can be seen as a sub-case of deﬁnite clause assuming True P.

Neuro-Symbolic Hierarchical Rule Induction

(a) Meta-rule A (b) Meta-rule B (c) Meta-rule C

Figure 3. Three types of meta-rules: boxes represent variables, green arrows represent relations for head atoms and blue arrows represent relations for body atoms

A common meta-rule set, which has been used and tested in the literature (Cropper & Tourret, 2020), is the following:

(Indent1) P(A) Q(A) (DIndent1) P(A) Q(A) R(A) (Indent2) P(A, B) Q(A, B) (DIndent2) P(A, B) Q(A, B) R(A, B) (Precon) P(A, B) Q(A) R(A, B) (Postcon) P(A, B) Q(A, B) R(B) (Curry) P(A, B) Q(A, B, R) (Chain) P(A, B) Q(A, C) R(C, B)

where the letters P, Q, R correspond to existentially-quantiﬁed second-order variables and the letters A, B, C to universallyquantiﬁed ﬁrst-order variables.

Proto-rules Our model relies on an extended notion of meta-rules, which we refer to as proto-rules. These templates have the speciﬁcity that they do not ﬁx the arity of their body predicates; only the arity of the head predicate is determined. Formally, they can be deﬁned as follows. For any n N, let Pn (resp. P n) be the predicates in P that have arity equal to (resp. lower or equal to) n.

Deﬁne the projection operator νn which canonically embeds predicates of arity i n in the space of predicates of arity n:

νn : P n Pn s.t. νn(P)(X1, , Xn) = P(X1, , Xi) for P Pi. (19)

Note that the restriction of νn on the space of predicates of arity n is identity: νn |Pn= id Pn. Moreover, this projection naturally extends to second-order variables. To ease the notations, we denote below the projection νni with an overline (e.g., for a predicate P, νni(P) = P), since there is no risk of confusion because ni is speciﬁed by its arguments (T1, . . . , Tni).

Deﬁnition 7.2. A proto-rule is an extension of a second-order Horn clause of the form: A0 A1 . . . Am, where A0 (resp. each Ai, i > 0 is a literal of the form P(T1, ..., Tni) (resp. νni P (T1, ..., Tni) in which P, which is a predicate symbol or a second-order variable that can be substituted by a predicate symbol, is of arity lower or equal to ni, and each Ti is either a constant symbol or a ﬁrst-order variable that can be substituted by a constant symbol.

Seeing second-order rules as functions over predicate spaces to ﬁrst-order logic rules, we can provide another characterization of proto-rules: a meta-rule can be understood as a mapping Pn1 . . . Pnm H for some indices ni, i.e., taking m predicates of speciﬁc arities and returning a ﬁrst-order Horn clause; in contrast, a proto-rule is a mapping P n1 . . . P nm H, for some indices ni.

The ﬁrst set of protorules we propose is the following (see Figure 3):

A : P(A) Q(A, B) R(B, A) B : P(A, B) Q(A, C) R(C, B) C : P(A, B) Q(A, B) R(B, A)

Horn Fragments A Horn theory T is a set of Horn clauses.

Neuro-Symbolic Hierarchical Rule Induction

Let us denote the set of second order Horn clause, the set of deﬁnite Horn clauses, and the set of function-free deﬁnite Horn clauses respectively by H, S, and F. Within Horn clause logic, several restrictions have been proposed in the literature to narrow the hypothesis space, under the name of fragment. A fragment is a syntactically restricted subset of a theory. Following previous work (such as (Cropper & Tourret, 2020)), we introduce below a list of classic fragments of F, in decreasing order:

connected fragment, C: connected deﬁnite function-free clauses, i.e., clauses in F whose literals can not be partitioned into two sets such that the variables attached to one set are disjoint from the variables attached to literals in the other set; e.g., P(A, B) Q(A, C) R(A, B) is connected while the following clause is not P(A, B) Q(A, C) R(B).

Datalog18 fragment, D: sub-fragment of C, composed of clauses such that every variable present in the head of the Horn rule also appears in its body. 19

two-connected fragment, K: sub-fragment of D composed of clauses such that the variables appearing in the body must appear at least twice.

exactly-two-connected fragment, E: sub-fragment of K such that the variables appearing in the body must appear exactly twice.

duplicate-free fragment, U: sub-fragment of K excluding clauses in which a literal contains multiple occurrences of the same variable, e.g., excluding P(A, A) Q(A).

These fragments are related as follows:

H S F C D K

For M a set of meta-rules, we denote M a m the fragment of M where each literal has arity at most a and each meta-rule

has at most m body literals. We extend the notation in order to allow both a and m to be a set of integers; e.g., M{1,2} {3} is the fragment of M where each literal corresponds to an unary or binary predicate and the body of each meta-rule contains exactly three literals.

Binary Resolution As it would be of future use, let us introduce some notions around resolution. To gain intuition about the notion of resolvent20, let us look at an example before stating more formal deﬁnitions. Consider the following Horn clauses: C1 : P(X, Y ) Q(X) R(X, Y ) or, equivalently, P(X, Y ) Q(X) R(X, Y ) C2 : R(X, a) T(X, a) S(a) or, equivalently, R(X, a) S(a) T(X, a)

The atoms R(X, Y ) and R(X, a) are uniﬁable, under the substitution σ : {a/Y }; under this substitution, we obtain the following clauses: {P(X, a) Q(X) R(X, a), R(X, a) T(X, a) S(a)}

Since either R(X, a) is True, or R(X, a) is True, assuming C1 and C2 are True, we can deduce the following clause is True: (P(X, a) Q(X)) ( T(X, a) S(a)) or, equivalently, P(X, a) Q(X) T(X, a) S(a)

This resulting clause, is a binary resolvent of C1 and C2.

Now, let us state more formal deﬁnitions. We denote σ = {t1/X1, , tn/Xn}, a substitution and Eσ the expression obtained from an expression E by simultaneously replacing all occurrences of the variables Xi by the terms ti.

18Some wider Datalog fragment have been proposed, notably allowing negation in the body, under certain stratiﬁcation constraints; here we consider only its intersection with function-free deﬁnite Horn clauses. 19Since below each clause is assume to be a deﬁnite Horn clause, it can be equivalently represented as a Horn rule, with one head and several non-negated literals in the body; we below alternately switch between these representations. 20Another way to gain intuition, is to think about propositional logic; there, a resolvent of two parent clauses containing complementary literals, such as P and P, is simply obtained by taking the disjunction of these clauses after removing these complementary literals. In the case of FOL or HOL, such resolvent may involve substitutions.

Neuro-Symbolic Hierarchical Rule Induction

Deﬁnition 7.3. A substitution σ is called a uniﬁer of a given set of expressions {E1, , Em} , m 2 if E1σ = = Emσ.

A uniﬁer θ of a uniﬁable set of expressions E = {E1, , Em} , m 2, is said to be a most general uniﬁer if for each uniﬁer σ of E there exists a substitution λ such that σ = θλ.

Deﬁnition 7.4. Consider two parent clauses C1 and C2 containing the literals l1 and l2 respectively. If l1 and l2 have a most general uniﬁer σ, then the clause (C1σ \ l2σ) (C2σ \ l2σ) is called a binary resolvent of C1 and C2.

Derivation Reduction Diverse reductions methods have been proposed in the literature, such as subsumption (Robinson, 1965), entailment (Muggleton, 1995), and derivation (Cropper & Tourret, 2018). As previously pointed out (e.g., (Cropper & Tourret, 2020)), common entailment methods may be too strong and remove useful clauses enabling to make predicates more speciﬁc. The relevant notion to ensure the reduction would not affect the span of our hypothesis space is the one of derivation-reduction (D-reduction) (Cropper & Tourret, 2018), as deﬁned in Deﬁnition 7.6.

Let us ﬁrst deﬁne for the following operator for a Horn theory T : R0(T ) = T Rn(T ) = {C | C is the binary resolvent of C1 and C2 with C1 Rn 1(T ), C2 T }

Deﬁnition 7.5. The Horn closure of a Horn theory T is: R (T ) = n Rn(T ).

Deﬁnition 7.6. (i) A Horn clause C is derivationally redundant in the Horn theory T if T C, i.e., C R (T ); it is said to be k-derivable from T if C Rk(T ).

(ii) A Horn theory T is derivationally reduced (D-reduced) if and only if it does not contain any derivationally redundant clauses.

A clausal theory may have several D-reductions. For instance, we can easily see that

C1 : P(A, B) Q(B, A) C2 : P(A, B) Q(A, C), R(C, B) C3 : P(A, B) Q(A, C), R(B, C)

may be D-reduced to either {C1, C2} or {C1, C3}

Hypothesis Space Recall that P (resp. P0) denotes the set of predicates (resp. of initial predicates); and B the background knowledge, may encompass both initial predicates and pre-given rules. As we are interested in characterizing the expressivity of our model, let us deﬁne the relevant notion of hypothesis space21, which could apply to any rule-induction model attached to at meta-rule or proto-rule set:

Deﬁnition 7.7. Given a predicate set P, and a meta-rule or proto-rule set M, let us deﬁne:

(i) the set MP as the set of all Horn clauses generated by all the possible substitutions of predicates variables in M by predicates symbols from P.

(ii) the hypothesis space M[P] generated by M from P as the Horn closure of MP.

Equality Assumption The Equality relation is typically assumed part of FOL, so it is reasonable to assume it belongs to the background knowledge in our results below. In our approach, it amounts to integrating the predicate Equal to the set of initial predicates P0; Equal, which corresponds to the identity, is intensionally deﬁned on any domain D by:

D D B = {True, False}

(X, Y ) 7 True if X = Y False if X = Y (22)

21In this paper, we adopted the denomination of hypothesis space, which is unconventional in logic, because it refers to the space accessible to our learning algorithm. Similarly, by incorporating initial rules, we could also deﬁne the Horn theory generated by a set of meta-rules or proto-rules given a certain background knowledge B and a set of predicates P.

Neuro-Symbolic Hierarchical Rule Induction

Main Result Here is our main result, concerning the investigation of the expressivity of the minimal set R0: Theorem 7.8. (i) Assuming True P0, the hypothesis space generated by R0 from P encompasses the set of duplicatefree and function-free deﬁnite Horn clause composed of clauses with at most two body atoms involving unary and binary predicates in P: True P0 = R0[P] U{1,2} 2 [P]

(ii) Assuming True, Equal P0, the hypothesis space generated by R0 from P corresponds to the set of function-free deﬁnite Horn clause composed of clauses with at most two body atoms involving unary and binary predicates in P; i.e., in terms of the second order logic fragment:

True, Equal P0 = R0[P] = F{1,2} 2 [P]

We refer to the end of this section for the proof of this theorem.

Theorem 7.8 allows us to conclude that R0 is more expressive than the set commonly found in the literature RMIL, assuming we are working with predicates of arity at most 2: Corollary 7.9. Assuming the initial predicates P0 contains only predicates of arity at most 2, the hypothesis space generated by R0 given P encompasses the one generated by RMIL as deﬁned in (18).

P P0, arity(P) 2 = R0[P] RMIL[P]

Proof. Under our assumption, the meta-rule (Curry) in (18) may be disregarded. The remaining meta-rules present in RMIL have already been examined in Theorem 7.8. R0 has therefore at least the same expressivity than RMIL. In order to be able to conclude R0 is strictly more expressive, we can mention the rules P(A) P(A, B) or P(A, B) P(B, A), which are reached by R0 not RMIL.

Disjunction In our model, to ensure an incremental learning of the target rule, we impose the restriction that each auxiliary predicate corresponds to one rule. Since such constraint risk to remove the possibility of disjunction, we redeﬁne new proto-rules versions integrating disjunctions:

Q(A, B) R(B, A) S(A, B) B : P(A, B)

Q(A, C) R(C, B) S(A, B) C : P(A, B)

Q(A, B) R(B, A) S(A, B)

Adopting the minimal set R 0 , in our incremental setting does not affect the expressivity of the hypothesis space compared to when relying on the set R0 in a non incremental setting. Let us remind an incremental setting imposes a predicate symbol to be the head of exactly one rule. Lemma 7.10. The minimal proto-rule set R 0 in an incremental setting holds the same expressivity as R0.

Proof. It is straightforward to see that R 0 is a minimal set. The hypothesis space generated by R0 obviously encompasses the hypothesis space generated by R 0 in an incremental setting. Reciprocally, let us assume n rules instantiated from R0 are attached to one predicate symbol P. They can be expressed as: P(X, Y ) f1(X, Y ) P(X, Y ) f2(X, Y ) P(X, Y ) fn(X, Y )

Recursively, we can deﬁne Pi auxiliary predicates in the hypothesis space of R 0 .

P1(X, Y ) f1(X, Y ) P2(X, Y ) P1(X, Y ) f2(X, Y ) Pn(X, Y ) Pn 1(X, Y ) fn(X, Y )

Since Pn is then equivalent to P, we can conclude.

Neuro-Symbolic Hierarchical Rule Induction

Redundancy Since we have observed a certain redundancy to be beneﬁcial to the learning, we incorporate the permutation rule I in our experiments, and rely on the following set:

R = R 0 {I : H(X, Y ) F(Y, X)}

This small yet non minimal set of protorules R , has the same expressivity as R0 (characterized in Theorem 7.8) Lemma 7.11. The proto-rule set R holds the same expressivity than R0.

Proof. It stems from the fact, observed in the proof of Theorem 7.8 that the rule (σ), equivalent to (I) can be D-reduced from R0.

Recursivity The development of Recursive Theory arose ﬁrst around the Hilberts Program and G odels proof of the Incompleteness Theorems (1931) and is tightly linked to questions of computability. While implementing recursive-friendly model, we should keep in mind considerations of computability and decidability. Enabling full recursivity within the templates is likely to substantially affect both the learning and the convergence, by letting room to numerous inconsistencies.

With this in mind, let us consider different stratiﬁcations in the rule space, and in our model. First, we introduce a hierarchical ﬁltration of the rule space: initial predicates are set of layer 0, and auxiliary predicates in layer ℓare deﬁned from predicates from layers at most ℓ. Within such stratiﬁed space, different restrictions may be considered:

recursive-free hierarchical hypothesis space:, where rules of layer ℓare deﬁned from rules of strictly lower layer. The proto-rules from R 0 are therefore restricted to the following form:

P( ) [Q( ) R( )] O( ),

with P Pℓ, Q, R, O P<ℓ.

iso-recursive hierarchical hypothesis space, where the only recursive rules allowed from R 0 have the form:

P( ) [Q( ) R( )] O( ), with P Pℓ, Q, R P<ℓ {P}, O P<ℓ

Such space ﬁts most recursive ILP tasks (e.g., Fizz, Buzz, Even, Less Than).

recursive hierarchical hypothesis space, where the recursive rules allowed from R 0 have the form:

P( ) [Q( ) R( )] O( ) with P Pℓ, Q, R P ℓ, O P ℓ.

Such space is required for the recursivity present in Even Odd or Length. Such hierarchical notion is also present in the Logic Programming literature under the name of stratiﬁed programs.

Proof of Theorem 7.8 Before we present the proof, let us introduce two additional notations:

R i R refers to the resolvent of R with R , where the ith body member of R is resolved with the head of R . To illustrate it, consider the following meta-rules:

R : P(A) Q(A) R(A, B) R : P(A) Q(A) R(B, A) R : P(A, B) Q(B, A)

The resolvent R 2 R corresponds to resolve the second body atom of R with R , which therefore leads to R.

We deﬁne the operation ρ as the composition of a projection ν, a permutation (σ), and an existential contraction over the second variable ( ), which amounts to:

ρ(R)(X) := σ(R(X, Y )) = Y R(Y ) (24)

Here, we identiﬁed the permutation clause (σ) (as deﬁned in (17)) with the operation it deﬁnes on predicates σ; similarly for the existential clause ( ) with the contraction operation .

Likewise, we can deﬁne the corresponding non-connected meta-rule attached to the operation ρ by:

(ρ) : P(X) Y R(Y ) (25)

Neuro-Symbolic Hierarchical Rule Induction

Proof. The inclusion R0[P] F{1,2} 2 [P] is trivial since all the rules instantiated from R0 are deﬁnite Horn clause with two body predicates, and duplicate-free; similarly, if adding the duplicate-free constraint to F (and denoting it by FD), we can trivially state: R0[P] FD,{1,2} 2 [P].

It remains to prove the other inclusion; e.g., for (ii), it amounts to R0[P] F{1,2} 2 [P], which boils down to R0[P] F{1,2} P, 2 since R0[P] is closed under resolution. We will proceed by successively extending the sub-fragments M we are considering while proving R0[P] MP. For instance, under the assumption True, Equal P0, we demonstrate that the hypothesis space generated by R0 encompasses the one generated by the increasingly larger fragments: (a) U (b) K; (c) D; (d) C; (e) F. For (i), under the simpler assumption True P0, we can follow almost identical steps, although the duplicate-free constraint can not be lifted, and therefore step (b) is skipped. Since the proof of (i) is more or less included in the proof of (ii), we will below focus only on proving (ii).

For each fragment M, to demonstrate such inclusion, we can simply show that every rule within the fragment MP belongs to the hypothesis space generated by R0; more speciﬁcally, we will mostly work at the level of second order logic; there, we will show that any meta-rule in M can be reduced from meta-rules in R0, or generated by R0.

However, to avoid listing all the meta-rules generating these fragments, and restrict our enumeration, we leverage several observations: ﬁrst, since we assume the predicate set includes True, we can restrict our attention to clauses with exactly two body literals; secondly, the permutation meta-rule (σ) (introduced in 17) can be derived from the proto-rule C upon matching the ﬁrst body in C with True; therefore, we can examine rules only upon permutation (σ) applied to their body or head predicates.

1. First, let us demonstrate the following inclusion:

Claim1 : True P0 = R0[P] U{1,2} P, 2 (26)

By the previous observations, we can narrow down to examine a subset of the meta-rules generating U{1,2} 2 , as listed and justiﬁed below. Our claim is that all these clauses are derivationally redundant from the Horn theory generated by R0; for each of them, we therefore explicit the clauses used to reduce them22.

Meta-Rules Reduction Comment (i) P(A) Q(A) R(A) A 2 σ A(Q, σ(R)) (ii) P(A) Q(A) R(A, B) A 2 σ A(Q, σ(R)) (iii) P(A) Q(A, B) R(B) A (iv) P(A) Q(A, B) R(B, A) A (v) P(A) Q(A, B) R(B, C) A (vi) P(A, B) Q(A) R(A, B) A 2 σ C(Q, σ(R)) (vii) P(A, B) Q(A, C) R(C, B) B (viii) P(A, B) Q(A, B) R(B, A) C (ix) P(A, B) Q(A, B) R(B, C) A 2 A(Q, (R))

Beside the notation of the resolvent H explained previously, we provided a more intuitive form for the resolution (under Comment ), in terms of composition between maps, where we identify a proto-rule (or meta-rule) with a map of the following form: P P P

Let us justify why these clauses are the only clauses in U{1,2} 2 worth to examine in two steps:

Let us consider clauses whose head predicate arity is 1. First, by the Datalog assumption, we can indeed restrict our attention to body clauses where A is appearing at least once. By symmetry, we can assume A appears in the ﬁrst body. By the connected assumption, either A appears in the second body too, and/or an existential variable (say B) appears both in the ﬁrst or second body. Upon permutation, it leads us to clauses (i) (v). Let us consider clauses whose head predicate arity is 2. First, by the Datalog assumption, we can indeed restrict our attention to body clauses where A and B have to appear at least once. The only body clause of arity (1, 1) satisfying this condition is not connected (Q(A) R(B)); upon permutation of A and B, and of the body

22Note that we are allowed to use clauses and σ in the reduction as we already explained why we can derive them from R0 in the beginning of the proof of Theorem 7.8.

Neuro-Symbolic Hierarchical Rule Induction

predicates Q and R, and upon inversion (σ) of the body predicates, the only body clause of arity (1, 2) is of the form Q(A) R(A, B) (denoted (vi) below, which reduces to C upon permutation); for arity (2, 2), upon similar permutations, we can list three types of body clause, listed below as (vii), (viii), (ix). While (vii), (viii) are directly pointing to B, C, (ix), can be obtained by reducing the second body of C through ( ) (as deﬁned in 17). This justiﬁes the remaining clauses, (vi) (ix).

We can therefore conclude by Claim1.

2. Let us extend (26) by enabling duplicates in the clause:

Claim2 : True, Equal P0 = R0[P] K{1,2} P, 2 (27)

To extend the result from U{1,2} 2 to K{1,2} 2 , it is sufﬁcient to show the meta-rules ( ), ( ) from (17) can be derived from R0, under the extra-assumption that Equal is included as background knowledge. This stems from the fact that these meta-rules can be written as follows:

Meta-rule Equivalent Form ( ) P(A, A) Q(A) P(A, B) Q(A) Equal(A, B) ( ) P(A) Q(A, A) P(A) Q(A, B) Equal(A, B)

Since the equivalent forms above are duplicate-free and two-connected, by (26), ( ), ( ) can be derived from R0, which implies Claim2.

3. Removing the assumption of two-connectedness from (27), we claim the following inclusion holds:

Claim3 : True, Equal P0 = R0[P] D{1,2} P, 2 (28)

The additional meta-rules we have to reduce can be narrowed down to:

P(A) Q(A, A) R(A, C) ; P(A, B) Q(A, A) R(A, B) ; P(A, A) Q(A, A) R(A, B)

By using the reduction for duplicated forms P(A, A), Q(A, A) stated previously in (b), these meta-rules may be derived from R0, and Claim3 follows.

4. In the next step, we extend (28) to the connected Horn fragment;

Claim4 : True, Equal P0 = R0[P] C{1,2} P, 2 (29)

To get rid of the Datalog constraint that each head variable appears in the body, we are brought to examine the following meta-rules: P(A, B) Q(A, C)

This meta-rule may be seen as a subcase of B once we have matched its second body with True. It ensues that R0 is able to generate the fragment C{1,2} 2 , as stated in Claim4.

5. Lastly, we can get rid of the assumption of connected .

Claim5 : True, Equal P0 = R0[P] F{1,2} P, 2 (30)

Upon symmetries and permutations, we are brought to examine the following non-connected meta-rules:

Meta-rule Reduction Comment (i) P(A) Q(A) R(B) A (ii) P(A) Q(A, B) R(C) A 1 A( (Q), R) (iii) P(A) Q(A, B) R(C, D) (A 1 ) 2 A( (Q) (R)) (iv) P(A, B) Q(A, B) R(C) C 2 ρ C(Q, ρ(R)) (v) P(A, B) Q(A) R(B, C) B 2 σ B(Q, σ(R)) (vi) P(A, B) Q(A, B) R(C, D) C 2 (ρ ) C(Q, ρ (R))) (vii) P(A, B) Q(A, C) R(B, D) (C 1 ) 2 σ C( (Q) σ(R)))

Note that the clause/operation ρ, deﬁned in (24,25), enables to express the clause R(C) (resp. R(C, D)) appearing in the body of (iv) (resp. (vi)) as (ρ)(R)(B). Since, by deﬁnition of (ρ), it can be derived from R0, we can thereupon

Neuro-Symbolic Hierarchical Rule Induction

conclude that all the above meta-rules are reducible to R0, which concludes the proof.

We have therefore proven the following inclusions:

True, Equal P0 = R0[P] = F{1,2} P, 2 C{1,2} P, 2 D{1,2} P, 2 KP{1,2} 2 U{1,2} P, 2

B. ILP Experiment Results

B.1. EXTRACTED INTERPRETABLE SOLUTIONS

We give a detailed description of the ILP tasks in our experiments, including the target, background knowledge, positive/negative examples, and the learned solution by our method for each task. Note that the solution for Fizz is missing since our current approach does not solve it with the generic proto-rules in R .

Predecessor The goal of this task is to learn the predecessor(X, Y ) relation from examples. The background knowledge is the set of facts deﬁning predicate zero and the successor relation succ

B = {True, False, zero(0), succ(0, 1), succ(1, 2), . . . }.

The set of positive examples is

P = {target(1, 0), target(2, 1), target(3, 2), . . . }

and the set of negative examples is

N = {target(X, Y )|(X, Y ) {0, 1, . . . }2} P.

Among these examples, target is the name of the target predicate to be learned. For example, in this task, target = predecessor. We use ﬁxed training data for this task given the range of integers. One solution found by our method is:

target(X, Y ) succ(X, Y ).

Undirected Edge A graph is represented by a set of edge(X,Y) atoms which deﬁne the existence of an edge from node X to Y . The goal of this task is to learn the undirected-edge(X, Y ) relation from examples. This relation deﬁnes the existence of an edge between nodes X and Y regardless of the direction of the edge. An example set of background knowledge is B = {True, False, edge(a, b), edge(b, c)}.

The corresponding set of positive examples is

P = {target(a, b), target(b, a), target(b, c), target(c, b)}

and the set of negative examples is P = {target(a, c), target(c, a)}.

We use randomly generated training data for this task given the number of nodes in the graph. One solution found by our method is:

target(X, Y ) (aux1(X, Y ) edge(Y, X)) edge(X, Y ),

aux1(X, Y ) edge(X, Y ),

where aux1 is an invented auxiliary predicate.

Less Than The goal of this task is to learn the less-than(X, Y ) relation which is true if X is less than Y . Here the background knowledge is the same as that in Predecessor task. The set of positive examples is

P = {target(X, Y )|X < Y }

and the set of negative examples is N = {target(X, Y )|X Y }.

Neuro-Symbolic Hierarchical Rule Induction

We use ﬁxed training data for this task given the range of integers. One solution found by our method is:

target(X, Y ) (target(X, Z) target(Z, Y )) succ(X, Y ).

Member The goal of this task is to learn the member(X, Y ) relation which is true if X is an element in list Y . Elements in a list are encoded with two relations cons and values, where cons(X, Y ) if Y is a node after X with null node 0 being the termination of lists and value(X, Y ) if Y is the value of node X.

Take the list [3, 2, 1] as an example. The corresponding set of positive examples is

P = {target(3, [3, 2, 1]), target(2, [3, 2, 1]), target(1, [3, 2, 1]),

target(2, [2, 1]), target(1, [2, 1], target(3, [3, 2]),

target(2, [3, 2]), target(3, [3, 1]), target(1, [3, 1]),

True, False}

and the set of negative examples is

P = {target(3, [2, 1]), target(1, [3, 2]), target(2, [3, 1])}.

We use randomly generated training data for this task given the length of the list. One solution found by our method is:

target(X, Y ) (target(X, Z) aux1(Z, Y )) value(X, Y ),

aux1(X, Y ) (cons(Y, X) True) False.

Connectedness The goal of this task is to learn the connected(X, Y ) relation which is true if there is a sequence of edges connecting nodes X and Y . An example set of background knowledge is

B = {True, False, edge(a, b), edge(b, c), edge(c, d)}.

The corresponding set of positive examples is

P = {target(a, b), target(b, c), target(c, d), target(a, c), target(a, d), target(b, d)}

and the set of negative examples is

P = {target(b, a), target(c, b), target(d, c), target(d, a), target(d, b), target(c, a)}.

We use randomly generated training data for this task given the number of nodes in the graph. One solution found by our method is:

target(X, Y ) (target(X, Z) target(Z, Y )) edge(X, Y ).

Son The goal of this task is to learn the son-of(X, Y ) relation which is true if X is the son of Y . The background knowledge consists of various facts about a family tree containing the relations father-of, bother-of and sister-of. An example set of background knowledge is

B = {True, False, father(a, b), father(a, c), father(a, d), bother(b, c), bother(d, c), sister(c, b)}.

The corresponding set of positive examples is

P = {target(b, a), target(d, a)}

and the set of negative examples N is a subset of all ground atoms involving the target predicates that are not in P.

We use randomly generated training data for this task given the number of nodes in the family tree. One solution found by our method is:

target(X, Y ) (aux1(X, Y ) father(Y, X)) False

aux1(X) (father(X, Z) True) brother(X, T)

where aux1 is an invented auxiliary predicate.

Grandparent The goal of this task is to learn the grandparent(X, Y ) relation which is true if X is the grandparent of Y .

Neuro-Symbolic Hierarchical Rule Induction

The background knowledge consists of various facts about a family tree containing the relations father-of and mother-of. An example set of background knowledge is

B = {father(c, b), father(b, a), mother(d, b), mother(e, a), True, False}.

The corresponding set of positive examples is

P = {target(c, a), target(d, a)}

and the set of negative examples N is a subset of all ground atoms involving the target predicates that are not in P.

We use randomly generated training data for this task given the number of nodes in the family tree. One solution found by our method is:

target(X, Y ) (aux1(X, Z) aux1(Z, Y )) False,

aux1(X, Y ) (mother(X, Y ) True) father(X, Y ),

where aux1 is an invented auxiliary predicate.

Adjacent to Red In this task, nodes of the graph are colored either green or red. The goal of this task is to learn the is-adjacent-to-a-red-node(X) relation which is true if node X is adjacent to a red node. Other than the relation edge, the background knowledge also consists of facts of relations colour and red, where colour(X, C) if node X has colour C and red(C) if colour of C is red.

An example set of background knowledge is

B = {True, False, edge(a, b), edge(b, a), edge(d, e), edge(d, f), colour(a, p), red(p), colour(d, q), red(q)}.

The corresponding set of positive examples is

P = {target(b), target(e), target(f)}

and the set of negative examples N is a subset of all ground atoms involving the target predicates that are not in P.

We use randomly generated training data for this task given the number of nodes in the graph. One solution found by our method is:

target(X) (edge(X, Z) aux1(Z, X)) False,

aux1(X) (colour(X, Z) red(Z, X)) False,

where aux1 is an invented auxiliary predicate.

Two Children The goal of this task is to learn the has-at-least-two-children(X) relation which is true if node X has at least two child nodes. Other than the relation edge, the background knowledge also consists of facts of not-equals relation neq, where neq(X, Y ) if node X does not equal to node Y .

An example set of background knowledge is

B = {True, False, edge(a, b), edge(a, c), edge(c, d), neq(a, b), neq(a, c), neq(a, d), neq(b, c), neq(b, d), neq(c, d)}.

The corresponding set of positive example(s) is

P = {target(a)}

and the set of negative examples N is a subset of all ground atoms involving the target predicates that are not in P.

We use randomly generated training data for this task given the number of nodes in the graph. One solution found by our method is:

target(X) (aux1(X, Z) edge(Z, X)) False,

aux1(X, Y ) (edge(X, Z), neq(Z, Y )) False,

where aux1 is an invented auxiliary predicate.

Relatedness The goal of this task is to learn the related(X, Y ) relation, which is true if two nodes X and Y have any family relations. The background knowledge is parent(X, Y ) if X is Y s parent.

Neuro-Symbolic Hierarchical Rule Induction

An example set of background knowledge is

B = {True, False, parent(a, b), parent(a, c), parent(c, e), parent(c, f), parent(d, c), parent(g, h)}.

The corresponding set of positive examples is

P = {target(a, b), target(a, c), target(a, d), target(a, e), target(a, f),

target(b, a), target(b, c), target(b, d), target(b, e), target(b, f),

target(c, a), target(c, b), target(c, d), target(c, e), target(c, f),

target(d, a), target(d, b), target(d, c), target(d, e), target(d, f),

target(e, a), target(e, b), target(e, c), target(e, d), target(e, f),

target(f, a), target(f, b), target(f, c), target(f, d), target(f, e)

target(g, h), target(h, g)}

and the set of negative examples N is a subset of all ground atoms involving the target predicates that are not in P.

We use randomly generated training data for this task given the number of nodes in the family tree. One solution found by our method is:

target(X, Y ) (target(X, Z) aux1(Z, Y )) parent(X, Y ),

aux1(X, Y ) (target(X, Y ) target(Y, X)) parent(X, Y ),

where aux1 is an invented auxiliary predicate.

Cyclic The goal of this task is to learn the is-cyclic(X) relation which is true if there is a path, i.e., a sequence of edge connections, from node X back to itself. An example set of background knowledge is

B = {True, False, edge(a, b), edge(b, a), edge(d, c), edge(d, b)}.

The corresponding set of positive examples is

P = {target(a), target(b)}

and the set of negative examples N is a subset of all ground atoms involving the target predicates that are not in P.

We use randomly generated training data for this task given the number of nodes in the graph. One solution found by our method is:

target(X) (aux1(X, Z) aux1(Z, X)) False,

aux1(X, Y ) (aux1(X, Z) edge(Z, Y )) edge(X, Y ).

where aux1 is an invented auxiliary predicate.

Graph Coloring The goal of this task is to learn the adj-to-same(X, Y ) relation which is true if nodes X and Y are of the same colour and there is an edge connection between them. The background knowledge consists of facts about a coloured graph containing the relations edge and colour, which are similar to those in the task Adjacent to Red. An example set of background knowledge is

B = {True, False, edge(a, b), edge(b, a), edge(b, c), edge(a, d), colour(a, p), colour(b, p), colour(c, q), colour(d, q)}.

The corresponding set of positive examples is

P = {target(a, b), target(b, a)}

and the set of negative examples N is a subset of all ground atoms involving the target predicates that are not in P.

We use randomly generated training data for this task given the number of nodes in the graph. One solution found by our method is:

target(X, Y ) (edge(X, Z) aux1(Z, X)) False,

aux1(X, Y ) (colour(X, Z) colour(Z, Y )) False,

where aux1 is an invented auxiliary predicate.

Neuro-Symbolic Hierarchical Rule Induction

Length The goal of this task is to learn the length(X, Y ) relation which is true if the length of list X is Y . Similar to the task Member, elements in a list are encoded with two relations cons and succ, where cons(X, Y ) if Y is a node after X with null node 0 being the termination of lists and succ(X, Y ) if Y is the next value of integer X. Moreover, this task adds zero(X) as another background predicate, which is true if X is 0.

Take the list [3, 2, 1] as an example, suppose node 0 is the end of a list. The background knowledge is

B = {True, False, zero(0), succ(0, 1), succ(1, 2), succ(2, 3)}.

The corresponding set of positive examples is

P = {target([3, 2, 1], 3), target([2, 1], 2), target([1], 1)}

We use randomly generated training data for this task given the number of nodes in a list.

Even-Odd The goal of this task is to learn the even(X) relation which is true if value X is an even number. The background knowledge includes two predicates, one is zero(X), which is true if X is 0, another one is succ(X, Y ), which is true if Y is the next value of X. An example set of background knowledge is

B = {True, False, zero(0), succ(0, 1), succ(1, 2), . . . }.

The corresponding set of positive examples is

P = {target(0), target(2), target(4), . . . }

and the set of negative examples is

N = {target(1), target(3), target(5), . . . }.

Once the number of constants is given, the dataset is deterministic. One solution found by our method is:

target(X) (zero(X) aux1(Z, X)) zero(X),

aux1(X, Y ) (aux1(X, Z) aux1(Z, Y )) aux2(X, Y ),

aux2(X, Y ) (succ(X, Z) succ(Z, Y )) False,

where aux1 and aux2 are invented auxiliary predicates.

Even-Succ2 Even-succ has the same backgrounds and target predicates as Even-Odd. In Campero et al. (2018), they provide two different templates set tailored to Even-Succ respectively to Even-Odd. However, in our approach, this difference is not relevant anymore: since we provide a uniform template set, these two tasks become identical.

Buzz The goal of this task is to learn the buzz(X) relation which is true if value X is divisible by 5. The background knowledge consists of 4 predicates: zero(X) is true if X is 0, succ(X, Y ) is true if Y is the next value of X, pred1(X, Y ) is true if Y = X + 3, pred2(X, Y ) is true if Y = X + 2. An example set of background knowledge is

B = {True, False, zero(0), succ(0, 1), succ(1, 2), . . . , pred1(0, 3), pred2(2, 4), . . . , pred2(0, 2), pred2(1, 3), . . . }.

The corresponding set of positive examples is

P = {target(0), target(5), . . . }

and the set of negative examples is

N = {target(1), target(2), target(3), target(4), target(6), target(7), . . . }.

Once the number of constants is given, the dataset is deterministic. One solution found by our method is:

target(X) (aux1(X, Z) pred2(Z, X)) zero(X),

aux1(X, Y ) (aux2(X, Z) pred1(Z, Y )) False,

aux2(X, Y ) (True aux3(Y )) zero(X),

aux3(X) (aux1(X, Z) pred2(Z, X)) zero(X),

where aux1, aux2 and aux3 are invented auxiliary predicates.

Fizz The goal of this task is to learn the fizz(X) relation which is true if value X is divisible by 3. The background

Neuro-Symbolic Hierarchical Rule Induction

knowledge is the same with Even Odd task. The corresponding set of positive examples is

P = {target(0), target(5), . . . }

and the set of negative examples is

N = {target(1), target(2), target(3), target(4), target(6), target(7), . . . }.

Once the number of constants is given, the dataset is deterministic.

B.2. FULL ILP EXPERIMENTAL RESULTS

In Table 6, we provide the results for all the ILP tasks.

Table 6. Percentage of successful runs among 10 runs. |I| is the smallest number of intensional predicates needed. Recursive means whether or not the solution needs to learn recursive rules.

Task |I| Recursive ILP LRI Ours

train soft evaluation symbolic evaluation

Predecessor 1 No 100 100 100 100 100 Undirected Edge 1 No 100 100 100 100 100 Less than 1 Yes 100 100 100 100 100 Member 1 Yes 100 100 100 100 100 Connectedness 1 Yes 100 100 100 100 100 Son 2 No 100 100 100 100 100 Grandparent 2 No 96.5 100 100 100 100 Adjacent to Red 2 No 50.5 100 100 100 100 Two Children 2 No 95 0 100 100 100 Relatedness 2 Yes 100 100 100 100 100 Cyclic 2 Yes 100 100 100 100 100 Graph Coloring 2 Yes 94.5 0 100 100 100 Length 2 Yes 92.5 100 20 0 0 Even-Odd 2 Yes 100 100 40 40 40 Even-Succ2 2 Yes 48.5 100 40 40 40 Buzz 2 Yes 35 70 100 40 40 Fizz 3 Yes 10 10 0 0 0

About task Length: The deceiving performance in this table for task Length can be easily explained: in theory our model corresponds to rule-induction with function-free deﬁnite Horn clause; yet, de facto, in the recursive-case, both the number of layers and the number of instantiated auxiliary predicate by proto-rule per layer deﬁne the actual expressivity of the model. The number of instantiated auxiliary predicates by proto-rule per layer is by default set to 1 in all our experiments, as we intended to share the same hyperparameter set on all tasks; however, to have a chance to solve the task Length, we should increase this number to 2, to widen the hypothesis space.

About task Even-Odd: As mentioned above, task Even-Odd corresponds, for our method, to the same task as task Even-Succ2, as the target predicate is identical. This is not the case for other ILP approaches like ILP or LRI, as they hand-engineer different template sets for each of these two tasks, corresponding to the desired auxiliary predicate.

B.3. LIMITATIONS OF LRI (CAMPERO ET AL., 2018)

Using LRI, if gathering all the templates needed for all ILP tasks (in Table 6) in a template set RLRI, we obtain 18 templates. Table 7 demonstrates how the evaluation success rate decreases if LRI is trained with RLRI. While LRI can obtain good results as indicated in Table 6 if provided with meta-rules customized for each task, the performance quickly degrades for hard tasks when provided with a set of more generic meta-rules. This phenomenon is more accentuated for more complex tasks. For easy tasks, like Predecessor, the performance did not change.

Neuro-Symbolic Hierarchical Rule Induction

Table 7. LRI s performance with an increased number of rule templates. Measured in percentage of successful runs over 10 runs using soft evaluation.

Tasks LRI with speciﬁc templates LRI with RLRI

Predecessor 100 100 Undirected Edge 100 80 Less than 100 100 Member 100 30 Connectedness 100 100 Son 100 80 Grandparent 100 90 Adjacent to Red 100 0 Two Children 0 0 Relatedness 100 100 Cyclic 100 0 Graph Coloring 0 0 Length 100 50 Even-Odd 100 20 Even-Succ2 100 10 Buzz 70 0 Fizz 10 0

B.4. COMPARISON WITH TRADITIONAL ILP METHODS

Large dataset We test if traditional ILP methods can handle large datasets with a large number of objects. As far as we know, traditional ILP methods can not process input knowledge by mini-batches: users must provide all background knowledge and examples to the solver before it starts running. Moreover, they do not utilize GPUs to speed up solving. In contrast, HRI can sample a small subgraph from the large dataset in each training iteration, and can make use of GPU to accelerate training easily. Thus, it can scale better to large datasets. To verify this, we compare HRI with Popper and ILASP3. To have a fair comparison, we try to give the two baseline methods a similar hypothesis space as HRI, including the hierarchical bias and meta-rule templates.

However, once we give a similar hypothesis space as HRI, ILASP3 can not ﬁnd a correct solution within 30 minutes even for a graph with 10 objects in the Grandparent task, therefore we did not test it further in larger dataset with such a general bias.

As shown in Figure 4, Popper costs much more time than HRI for the Grandparent task, which needs predicate invention. Although Popper performs better for the Undirected Edge task, note that this is a very simple task, which does not require predicate invention, and Popper can verify the solution straightforwardly once the corresponding meta-rules are given. Besides, for Member and Connectedness, in the case that the number of objects is 10,000, Popper can not ﬁnd correct solutions within 30 minutes while HRI can be successful within 5 minutes.

Noisy Data Finally, as mentioned in the main paper, we compared HRI with ILASP3 (Law et al., 2018) 23 and ILP (Evans & Grefenstette, 2018) in a noisy data setting by ranging the ratio of mislabeled target training examples from 0% to 90%, as shown in Figure 2.

We also ran additional comparison with ILASP4 (Law et al., 2020), which is a successor of ILASP3 computationally less expensive, since it only considers necessary (and not sufﬁcient) constraints. Since we did not ﬁnd any reference for hyperparameters and biases chosen for ILASP4, we used similar ones than for the ones the authors indicate for ILASP3 (Law et al., 2018). We used a time limit of 300s and added 2 to their maxbl parameter, to increase the hypothesis space, although similar results were obtained with identical maxbl. Based on these parameters, the preliminary results as displayed in Figure Figure 5 (averaged on 5 runs) suggest that ILASP4 is quite sensitive to noise. This should be further investigated, with different parameters and language bias since we may missed the best settings parameters.

23Experimental results of ILASP3 and ILP come from their own paper (Law et al., 2018; Evans & Grefenstette, 2018).

Neuro-Symbolic Hierarchical Rule Induction

10000 20000 30000 40000 50000 number of objects

method HRI Popper device cpu gpu

(a) Grandparent

10000 20000 30000 40000 50000 number of objects

method HRI Popper device cpu gpu

(b) Undirected Edge

Figure 4. Time(s) to ﬁnd the correct solution with respect to the number of objects in dataset

0.0 0.2 0.4 0.6 0.8 noise_ratio

predecessor

ILASP3 d ILP HRI ILASP4

(a) Predecessor

0.0 0.2 0.4 0.6 0.8 noise_ratio

ILASP3 d ILP HRI ILASP4

0.0 0.2 0.4 0.6 0.8 noise_ratio

connectedness

ILASP3 d ILP HRI ILASP4

(c) Connectedness

0.0 0.2 0.4 0.6 0.8 noise_ratio

undirected_edge

ILASP3 d ILP HRI ILASP4

(d) Undirected Edge

Figure 5. Mean Squared Error (MSE) w.r.t. different noise ratios.

C. Visual Genome Experiments

For these experiments, we use a dataset called GQA (Hudson & Manning, 2019b) which is a preprocessed version of the Visual Genome dataset (Krishna et al., 2017), since the original is commonly considered to be too noisy (Zellers et al., 2018). In GQA, the original scene graphs have been converted into a collection of KGs, leading to 1.9M facts, 1.4M constants, and 2100 predicates. Following (Yang & Song, 2020), we ﬁlter this dataset to remove the predicates that appear less than 1500 times and to focus on the top 150 objects.

In the multi-task setting, we train 150 models to provide a logic explanation to the top 150 objects. For the evaluation metrics, we use recall @1 (R@1) and recall @5 (R@5), which are computed on a held-out set. R@k measures the fraction of ground-truth atoms that appear among the top k most conﬁdent predictions in an image. In our model, even though all the models are instantiated with the same max layer n L, the trained model may have different layers. Indeed, since each auxiliary predicate at layer ℓmay be formed with predicates from any layer 0 to ℓin P ℓ), the trained model may contain 1 to n L layers (without counting the target predicate). Given the soft uniﬁcation computation in (11), a trained model with a larger number of layers has a tendency to output smaller values. Therefore, to make the output of all the models comparable, we use a simple L2 normalization before comparing the outputs of those trained models in order to compute the evaluation metrics.

After training HRI, we compare the semantic space underlying these embeddings. More precisely, we use cosine similarity to measure distances between all pairs of background embeddings, sort, and select the top 10 closest pairs. As Table 8 suggests, the ﬁne-tuned embeddings pairs have akin similarities; this initialization choice may therefore help with the performance to some extent.

Neuro-Symbolic Hierarchical Rule Induction

Table 8. Top 10 closest embeddings from GPT2 and our trained multi-task model.

Top 10 pairs of similar embeddings

GPT2 (bag, backpack), (arrow, apple), (airplane, air), (backpack, airplane), (apple, airplane), (animal, airplane), (arrow, animal), (at, above), (apple, animal), (bag, airplane)

Fine-tuned (bag, backpack), (bag, airplane), (arrow, apple), (airplane, air), (arm, air), (backpack, airplane), (air, above), (apple, air), (arrow, animal), (arm, airplane)

Table 9 shows some extracted learned rules for the multitask model.

Table 9. Examples of extracted rules from multi-task model

Target Rules

wrist wrist(X) (watch(Y ) on(Y, X)) False

person person(X) (on(X, Y ) bench(Y )) wearing(X, T)

vase vase(X) (aux(X, Y ) flowers(Y )) False

aux(X, Y ) (True on(Y, X)) with(X, Y )

logo logo(X) (aux0(X) True) aux1(X)

aux1(X) (on(X, Y ) laptop(Y )) False

aux0(X) (on(X, Y ) kite(Y )) False

D. Design Choices

Operation Choices In our implementation, the default choice is sum for POOL, min for AND, max for OR. In Table 10, we experimentally compare different choices for these operations. The ﬁrst column shows the results with our default choices, while other columns show the results by using max for POOL, product for AND, and probabilistic sum (probsum)24 for OR.

Table 10. Percentage of successful runs among 10 runs using soft evaluation, obtained by models trained with different implementations for POOL, MERGE, AND, OR.

Task default POOL-max AND-product OR-probsum

Adjacent To Red 100 20 100 100 Member 100 40 100 90 Cyclic 100 50 90 100 Two Children 100 70 80 80

Ablation Study Table 11 shows some ablation study about hierarchical and incremental bias on a few ILP tasks, and with or without the adoption of proto-rules. In the case of no-protorules, we use the set of all templates used in LRI Campero et al. (2018); the ﬁrst line correspond to LRI model with a uniﬁed set of templates. As displayed in the table, there is a clear improvement of the performance upon adoption of these different biases.

Extended Gumbel Noise We adopt an extended version of Gumbel noise deﬁned as follows:

G = g1 log ( log (g2U)) ,

24probsum(v1, v2) = v1 + v2 v1 v2

Neuro-Symbolic Hierarchical Rule Induction

Table 11. Percentage of successful runs among 10 runs. Ablation study of incremental and hierarchical biases.

protorules incremental hierarchical Adjacent To Red Member Cyclic Two Children

0 30 0 0 60 10 70 80 100 100 100 100

Table 12. Percentage of successful runs among 20 runs for some ILP tasks, with different settings about the extended Gumbel noise.

g1 g2 decay mode Adjacent to Red Member Cyclic Two Children

0.3 linear 100 100 100 100 0.5 100 95 100 95 0.3 exponential 100 85 95 80 0.5 60 95 95 40

1 linear 95 85 85 65 0.5 65 75 35 25 0.3 exponential 30 80 50 0 0.5 0 0 0 0

0 \ \ 60 20 75 30

where U is sampled according to a uniform distribution U([0, 1]), and with a Gumbel scale g1 (0, + ) and a Gumbel factor g2 (0, + ).

As illustrated in Table 12, using a linearly-decaying Gumbel factor and ﬁxed Gumbel scale (instead of a decaying Gumbel scale and ﬁxed Gumbel factor as commonly adopted) was experimentally beneﬁcial for the tasks we considered. The last row in Table 12 shows the signiﬁcance of adding a Gumbel noise for hard tasks like Member or Two Children.

In our model, this noise term is added to the cosine-similarity scores in (13), before applying a softmax, in order to lead to the uniﬁcation scores. More precisely:

αPi Bi = exp cos(θPi, θBi) + G(i,i) /τ

Pj exp cos(θPj, θBi) + G(j,i) /τ , (31)

Let us examine the behavior of this decay to justify our approach. On the one hand, making g2 tends towards 0 would make the Gumbel term G tends towards , which seems unreasonable. However, if we keep g2 above a moderately small value (as for instance 10 4 after 3000 iterations steps in our experiments, with a linear decay), reducing g2 amounts to reduce the range and variance of the applied noise. In our context, note that because of the softmax, only this range matters for the uniﬁcation scores to be affected. Let us point out that, at the extreme, adding a constant noise term to each of the similarity scores (as in the evaluation task), would not affect the uniﬁcation scores (equivalent to having G being 0).

For instance, with g2 going from 0.3 to 0.3 (1 3999

4000), the range of the Gumbel noise G is roughly going from (for U between 0.0001 to 1) [ 2.543, 0.186] to [ 3.045, 2.251]. In that sense, our choice would encourage the similarity scores to be different enough, and with g2 small enough (> 0), would not affect the uniﬁcation scores, similarly as a noise going to 0.

E. Hyperparameters

We list relevant generic and task-speciﬁc hyperparameters used for our training method in Tables 13 and 14, respectively. In Table 14, the hyperparameters train-num-constants and eval-num-constants represent the number of constants during training and evaluation, respectively. We keep the values of the two hyperparameters for each task the same as those used in Campero et al. (2018). Note that our model does not require the actual knowledge of the depth of the solution; the

Neuro-Symbolic Hierarchical Rule Induction

max-depth parameter simply could be an upper bound. Although, this parameter could be reduced for simpler tasks, we set max-depth= 4 for all tasks to make our training method more generic.

Table 13. Generic hyperparameters for all tasks.

Hyperparameter recursivity fuzzy-and fuzzy-or similarity lr lr-rules

Value full min max cosine 0.01 0.03

Hyperparameter temperature Gumbel-Scale Gumbel-Factor Gumbel-factor-decay-mode

Value 0.1 1.0 0.3 linear

Table 14. Specialized hyperparameters for each ILP task.

Task max-depth train-steps eval-steps train-num-constants eval-num-constants

Predecessor 4 2 4 10 14 Undirected Edge 4 2 2 4 6 Less than 4 12 12 10 12 Member 4 12 12 5 7 Connectedness 4 4 4 5 5 Son 4 4 4 9 10 Grandparent 4 4 4 9 11 Adjacent to Red 4 4 4 7 9 Two Children 4 4 5 5 7 Relatedness 4 10 12 8 10 Cyclic 4 4 4 6 7 Graph Coloring 4 4 4 8 10 Even-Odd 4 6 8 11 15 Even-Succ2 4 6 8 11 15 Buzz 4 8 10 11 16

In multi-task GQA, we used the same generic hyperparameters as given in Table 13 except that we disallow recursivity and decreased the learning rate (lr) to 0.001 for background embeddings and rule learning rate (lr-rules) to 0.01 for intensional embeddings. Other speciﬁc hyperparametes are given in Table 15.

Table 15. Multi-task GQA hyperparameters.

Hyperparameter Value

Max depth 3 Train-steps 4 Embedding-dim 30 Train-iterations n T 3000 Train-num-positive-instances np 5 Train-num-random-instances nr 5

In reinforcement learning, we used the same generic hyperparameters as given in Table 13 except that we did not use recursivity and decreased the learning rate to 0.005. Additional hyperparameters are given in Table 16.

Neuro-Symbolic Hierarchical Rule Induction

Table 16. Reinforcement learning hyperparameters.

Hyperparameter Value

Max depth 6 Train-steps 6 Train-num-constants 5 Temperature of softmax policy 0.01 GAE λ 0.9 γ 0.99 Trajectory per update 5 PPO ϵ-clipping 0.2 GRU hidden neurons (critic) 64

Sensitivity to Hyperparameters To evaluate the sensitivity of our model to hyperparameters, we tested other hyperparameter choices on some ILP tasks. We refer to Table 14 for the results obtained with default hyperparameters, presented in Table 13. As Table 17 suggests, both smaller inference steps, or smaller depth will affect performance; this performance decreases naturally, since it narrows down the hypothesis space, which may not contain anymore the solution needed for the task. However, we can appreciate the fact that larger inference step or depth usually will not affect much the performance (until a limit). We can also notice that cosine performs better than other similarity functions; naturally, restricting or excluding the recursivity would perform well compared with full recursivity, unless the task requires it.

Table 17. Percentage of successful runs among 10 runs using soft evaluation, obtained by models trained with different hyperparameter choices. Here, st and se are default inference steps used in training and evaluation, shown in Table 14.

Task default Inference-steps (train-steps, eval-steps)

(st 2, se 2) (st 1, se 1) (st + 1, se + 1) (st + 2, se + 2)

Adjacent to Red 100 90 100 100 100 Member 100 100 100 100 100 Cyclic 100 90 90 90 100 Two Children 100 80 100 100 100

Task Similarity Recursivity Max-depth

L1 L2 scalar-product none iso-recursive 1 2 3 5

Adjacent to Red 10 80 40 100 100 0 90 100 100 Member 60 100 80 0 100 50 100 100 100 Cyclic 60 80 50 90 100 10 60 80 90 Two Children 0 60 10 90 100 0 70 70 100

F. Limitations and Potentials of HRI

Expressivity Limitations As made explicit in Theorem 7.8, our model corresponds to rule-induction with function-free deﬁnite Horn clauses. Of course, the number of layers affects the actual expressivity of each model; in the non-recursive case, this parameter is sufﬁcient to reach all the function-free recursive-free deﬁnite Horn clauses. In the recursive case, the number of instantiated rules from the proto-rule set (by default set to 1) also affects the expressivity; this may be seen in task Length in ILP, where two rules should be initiated per proto-rule to sufﬁciently widen the hypothesis space.

The expressivity of our model could be extended in different ways, more or less computationally-expensive, and therefore more or less judicious. The points below would be object of further investigations and experiments; for this reason, we only share in this paper some succinct comments on different tracks to gain expressivity:

+ Enabling negations.

Neuro-Symbolic Hierarchical Rule Induction

+ Enabling functions.

+ Enabling zero-ary predicates: We could easily extend our result to include 0-ary predicates.

Claim : By adding the zero-ary predicate Z to R0, the hypothesis space reaches the fragment F 2 P, 2 (32)

where Z : P() Q(A, B).

+ Enabling more body atoms:25

Claim : Enabling 3 or 4 body atoms would result in the same hypothesis space. (33)

This can be proven by reducing clauses with 3 (resp. 4) body atoms to 2 (resp. 3) body atoms. Instead of a formal proof, let us give a visual illustration of these reductions26, in Figure 6 (resp. Figure 7). Red arrows denote head predicates; full grey arrows depict two arrows that can be reduced into the dotted grey arrow to lower the number of body atoms by introducing an auxiliary predicate.

Figure 6. Reduction of meta-rules with 3 body atoms

Figure 7. Reduction of meta-rules with 4 body atoms

In contrast, as illustrated in Figure 8 some rules with 5 body clauses are not reducible to 2 body clauses, such as:

P(A, B) Q(A, C), R(A, D), S(B, C), T(B, D), U(C, D). (34)

As mentioned in Cropper & Tourret (2020), F{1,2} 5 is therefore not D-reducible to F{1,2} 2 . However, allowing a higher number of clauses (such as 5) may drastically increase the computational cost.

+ Enabling higher-arity. For instance, considering arity-3 predicates, while keeping two-body clauses, could enable to reach C{1,2} 5 :

Claim : F{1,2,3} 2 [P] F{1,2} 5 (35)

Further comparative experiments could be led in the future, to test different minimal or small meta-rules or proto-rules set, and investigate some judicious balance of minimalism/redundancy and expressivity/efﬁciency in diverse tasks.

25For instance the language bias present in Gal arraga et al. (2013) is narrowing the space to connected, two-connected and duplicate-free Horn clauses U. On the one hand, U is a smaller fragment than F which our model is reaching. However, they consider clauses with N body atoms (N hypothetically small in their experiments), which could enable greater expressivity. 26We considered clauses upon permutation and symmetries, as usual; we omitted clauses having two body atoms with the same variables, as they can be trivially reduced.

Neuro-Symbolic Hierarchical Rule Induction

Figure 8. Irreducible Meta-Rule with 5 body atoms

Potentials In contrast to most previous classic ILP or differentiable ILP works, we hypothesize that HRI could be particularly suited for continual learning in semantically-richer domains, such as in RL scenarios where an interpretable logic-oriented higher-level policy seems pertinent (e.g., autonomous driving). Here are some arguments to support this belief:

+ In contrast to some previous works, HRI is independent of the number of predicates, which may vary between training and testing.

+ As our experiments in Visual Genome suggest, HRI can be bootstrapped by semantic or visual priors, to initialize the predicates.

+ In semantically-rich domains, the complexity of HRI would be more advantageous than other ILP works. Semantically rich domains are characterized by a high number of initial predicates, while enabling a much lower embedding dimension d (as these predicates are highly interdependent), d <<| P0 |; such inequality implies a lower complexity for HRI than common methods which have to learn one coefﬁcient per predicate.

+ The embedding-based approach enables to reason by analogy, and possibly quickly generalize to new predicates. For instance, if the model has learned the rule Over Take() (P(X) On Next Lane(X)), and the embedding θP is learned to be close to θCar, the rule would generalize to include Truck, assuming θCar θT ruck, despite having never seen the new predicate Truck during training.