# differentiable_logic_machines__3c777671.pdf Published in Transactions on Machine Learning Research (07/2023) Differentiable Logic Machines Matthieu Zimmer matthieu.zimmer@huawei.com Huawei Noah s Ark Lab, London, United Kingdom. Xuening Feng cindyfeng2019@sjtu.edu.cn UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China. Claire Glanois clgl@itu.dk IT University of Copenhagen, Copenhagen, Denmark. Zhaohui Jiang jiangzhaohui@sjtu.edu.cn UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China. Jianyi Zhang zhangjy97@sjtu.edu.cn UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China. Paul Weng paul.weng@sjtu.edu.cn UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China. Dong Li lidong106@huawei.com Huawei Noah s Ark Lab, China. Jianye Hao haojianye@huawei.com Huawei Noah s Ark Lab, China. School of Computing and Intelligence, Tianjin University, China. Wulong Liu liuwulong@huawei.com Huawei Noah s Ark Lab, Canada. Reviewed on Open Review: https://openreview.net/forum?id=m Xfk Ktu5JA The integration of reasoning, learning, and decision-making is key to build more general artificial intelligence systems. As a step in this direction, we propose a novel neural-logic architecture, called differentiable logic machine (DLM), that can solve both inductive logic programming (ILP) and reinforcement learning (RL) problems, where the solution can be interpreted as a first-order logic program. Our proposition includes several innovations. Firstly, our architecture defines a restricted but expressive continuous relaxation of the space of first-order logic programs by assigning weights to predicates instead of rules, in contrast to most previous neural-logic approaches. Secondly, with this differentiable architecture, we propose several (supervised and RL) training procedures, based on gradient descent, which can recover a fully-interpretable solution (i.e., logic formula). Thirdly, to accelerate RL training, we also design a novel critic architecture that enables actor-critic algorithms. Fourthly, to solve hard problems, we propose an incremental training procedure that can learn a logic program progressively. Compared to state-of-the-art (SOTA) differentiable ILP methods, DLM successfully solves all the considered ILP problems with a higher percentage of successful seeds (up to 3.5ˆ). On RL problems, without requiring an interpretable solution, DLM outperforms other non-interpretable neural-logic RL approaches in terms of rewards (up to 3.9%). When enforcing interpretability, DLM can solve harder RL problems (e.g., Sorting, Path) than other interpretable RL methods. Moreover, we show that deep logic programs can be learned via incremental supervised training. In addition to this excellent performance, DLM can scale well in terms of memory and computational Published in Transactions on Machine Learning Research (07/2023) time, especially during the testing phase where it can deal with much more constants (ą2ˆ) than SOTA. 1 Introduction Following the successes of deep learning and deep reinforcement learning, a research trend (Dong et al., 2019; Jiang & Luo, 2019; Manhaeve et al., 2018), whose goal is to combine reasoning, learning, and decisionmaking into one architecture has become very active. This research may unlock the next generation of artificial intelligence (AI) (Lake et al., 2017; Marcus, 2018). Simultaneously, a second research trend has flourished under the umbrella term of explainable AI (Barredo Arrieta et al., 2020). This trend is fueled by the realization that solutions obtained via deep learning-based techniques are difficult to understand, debug, and deploy. Notably, interpretability is crucial in high-stake domains (e.g., autonomous driving), where the decisions made by a trained model should be understandable. In this context, neural-logic approaches (see Section 2) have been proposed to integrate reasoning and learning, notably via first-order logic and neural networks. Recent works have demonstrated good achievements by using differentiable methods to learn a logic program (Evans & Grefenstette, 2018) or by applying a logical inductive bias to create a neural-logic architecture (Dong et al., 2019). The latter approach currently obtains the best performance at the cost of interpretability, while the former can yield an interpretable solution, but at the cost of scalability. Therefore, one important research problem regards the design of an efficient method with good performance and better scalability while preserving interpretability. The main motivation of our work is to provide such algorithm. In this paper, we propose a novel neural-logic architecture (see Section 3 for background notions and Section 4 for our proposition) that offers a better tradeoff in terms of interpretability vs. performance and scalability. This architecture defines a continuous relaxation over first-order logic expressions defined on input predicates. In contrast to most previous approaches (see Section 2), one key idea is to assign learnable weights on predicates instead of template rules, which allows for a much better scalability. This architecture can be trained both in the supervised learning (SL) and reinforcement learning (RL) settings for which we introduce several training techniques to find interpretable solutions. Notably, to accelerate the RL training, we propose an adapted critic to train our architecture in an actor-critic scheme. In addition, to help scale further the approach, we propose an incremental training methodology that can be applied to both SL and RL training. We experimentally compare our proposition with previously-proposed neural-logic architectures in both the SL and RL settings on inductive logic programming (ILP) and RL tasks respectively (see Section 5). Our architecture can achieve state-of-the-art (SOTA) performances in both ILP and RL tasks while maintaining interpretability and achieving better scalability. More precisely, our proposition is superior to all considered interpretable methods in terms of success rates, computational time, and memory consumption. Compared to non-interpretable ones, our method compares favorably, but can find fully-interpretable solutions (i.e., logic programs) that are faster and use less memory during the testing phase. The contributions of this paper can be summarized as follows: (1) a novel neural-logic architecture that can produce an interpretable solution and that can scale better than SOTA methods, (2) several training algorithms to obtain an interpretable solution, and (3) a thorough empirical evaluation in both RL and ILP tasks. Hence, to the best of our knowledge, our approach is the first neural-logic method able to output at the end of training a fully-interpretable solution for complex tasks like Path or Blocksworld. 2 Related Work The literature aiming at integrating reasoning, learning, and possibly decision-making is very rich. Our work is related to statistical relational AI (De Raedt et al., 2016), which aims at combining relational reasoning and learning. However, a key difference is that our focus is to learn a logic program, although we have a probabilistic interpretation of predicate evaluations. Our work is also related to relational RL (Džeroski et al., 2001; Tadepalli et al., 2004; van Otterlo, 2012), whose goal is to combine RL with First-Order Logic (FOL) representation. To the best of our knowledge, such approach does not scale as well as those resorting Published in Transactions on Machine Learning Research (07/2023) to neural networks. Thus, the investigation of neural approaches to tackle this integration has become very active in recent years (de Raedt et al., 2020; d Avila Garcez et al., 2019; Besold et al., 2017). For space reasons, we focus on the recent work closest to ours below. (Differentiable) ILP and their extensions to RL ILP (Muggleton, 1991; Cropper et al., 2020) aims to extract lifted logical rules from examples. Since traditional ILP systems can not handle noisy, uncertain or ambiguous data, they have been extended and integrated into neural and differentiable frameworks. For instance, Evans & Grefenstette (2018) proposed BILP, a model based on a continuous relaxation of the logic reasoning process, such that the parameters can be trained via gradient descent, by expressing the satisfiability problem of ILP as a binary classification problem. This relaxation is defined by assigning weights to templated rules. Jiang & Luo (2019) adapted BILP to RL problems using vanilla policy gradient. Despite being interpretable, this approach does not scale well in terms of both memory and computation, which is notably due to how the relaxation is defined. Payani & Fekri (2019b) proposed differentiable Neural Logic ILP (d NL-ILP), another ILP solver where in contrast to BILP, weights are placed on predicates like in our approach. Their architecture is organized as a sequence of one layer of neural conjunction functions followed by one layer of neural disjunction functions to represent expressions in Conjunctive Normal Form (CNF) or Disjunctive Normal Form (DNF), which provides high expressivity. In this architecture, conjunctions and disjunctions are defined over all predicates of any arity in contrast to DLM. Payani & Fekri (2019b) did not provide any experimental evaluation of d NL-ILP on any standard ILP benchmarks. But, in our experiments, our best effort to evaluate it suggests that d NL-ILP performs worse than BILP. We believe this is due to the too generic form imposed on the logic program to be learned. Payani & Fekri (2020) extended their model to RL and showed that initial predicates can be learned from images if sufficient domain knowledge under the form of auxiliary rules is provided to the agent. However, they do not show that their approach can learn good policies without this domain knowledge. Another way to combine learning and logic reasoning is to introduce some logical architectural inductive bias, as in Neural Logic Machine (NLM) (Dong et al., 2019). This approach departs from previous ones by learning rules with multilayer perceptrons (MLPs), which prevent this method to provide any final interpretable solution. NLM can generalize and its inference time is significantly reduced compared to BILP; by avoiding rule templates as in traditional neural-symbolic approaches, it also gains in expressivity. Our architecture is inspired by NLM, but we use interpretable modules instead of MLPs. Since the search space of traditional and differentiable ILP methods is often very different, it is difficult to produce a fair comparison. As far as we know, traditional ILP methods often rely on carefully hand-designed and task-specific templates which is less the case with our approach. This difficulty and the difference between these approaches may explain the limited comparative evaluation in past work on ILP (Law et al., 2018; Glanois et al., 2022). Glanois et al. (2022) state that traditional methods are superior to neuro-symbolic ones only when the dataset is small and noise-free. In the RL setting, it is also not clear how to extend non-differentiable methods. Other neural-symbolic approaches In order to combine probabilistic logic reasoning and neural networks, Manhaeve et al. (2018) proposed Deep Prob Log, which extends Prob Log De Raedt et al. (2007), a probabilistic logic language, with neural predicates. While this approach is shown to be capable of program induction, it is not obvious how to apply it for solving generic ILP problems, since partially-specified programs need to be provided. Another line of work in relational reasoning specifically targets Knowledge-Base (KB) reasoning. Although these works have demonstrated a huge gain in scalability (w.r.t. the number of predicates or entities), they are usually less concerned about predicate invention1. Some recent works (Yang et al., 2017; Yang & Song, 2020) extend the multi-hop reasoning framework to ILP problems. The latter work is able to learn more expressive rules, with the use of nested attention operators. In the KB completion literature, a recurrent idea is to jointly learn sub-symbolic embeddings of entities and predicates, which are then used for approximate 1Typically, rules learned in KB reasoning are chain-like rules (i.e., paths on graphs), which form a subset of Horn clauses: Qp X, Y q Ð P1p X, Z1q P2p Z1, Z2q Pbp Zb 1, Zbq. Published in Transactions on Machine Learning Research (07/2023) inference. However, the expressivity remains too limited for more complex ILP tasks and these works are typically more data-hungry. 3 Background We present the necessary notions in ILP and RL. We also recall NLM, since our work is based on it. 3.1 Inductive Logic Programming (ILP) ILP (Muggleton, 1991) refers to the problem of learning a logic program that entails a given set of positive examples and does not entail a given set of negative examples. This logic program is generally written in (a fragment of) FOL. FOL is a formal language defined with several elements: constants, variables, functions, predicates, and formulas. Constants correspond to the objects in the domain of discourse. Let C denote the set of m constants (e.g., objects). They will be denoted in lowercase (e.g., o). Variables refer to unspecified constants. They will be denoted in uppercase (e.g., X, Y , or Z). Functions allow to denote some constants using other constants (e.g., sp0q may refer to 1). Like previous work, we consider a fragment of FOL without any functions. A predicate can be thought of as a relation between constants, which can be evaluated as true T or false F. A predicate is said to be b-ary if it is a relation between b constants. Note that a 0-ary predicate is simply a Boolean value. Predicates will be denoted in uppercase (e.g., P or Q). Let P denote the set of predicates used for a given problem. An atom is an b-ary predicate with its arguments Ppx1, , xbq where xi s are either variables or constants. For simplicity, we may refer to atoms as predicates when there is no risk of confusion. A formula is a logical expression composed of atoms, logical connectives (e.g., negation , conjunction , disjunction _, implication Ð), and possibly existential D and universal @ quantifiers. Since solving an ILP task involves searching an exponentially large space, this problem is generally handled by focusing on formulas of restricted forms, such as a subset of if-then rules, also referred to as clauses. A definite clause is a rule of the form: H Ð A1 . . . Ak which means that the head atom H is implied by the conjunction of the body atoms A1, . . . , Ak. Horn clauses extend definite rules by allowing H to be possibly negated. More general rules can be defined by allowing logical operations (e.g., disjunction or negation) in the body. A ground rule (resp. ground atom) is a rule (resp. atom) whose variables have been all replaced by constants. In ILP tasks, given some initial predicates (e.g., for natural numbers, Zerop Xq meaning X 0 , Succp X, Y q meaning X Y 1 ), and a target predicate (e.g., Evenp Xq meaning X is even ), the goal is to learn a logical formula defining the target predicate. Expressing directly this target predicate in terms of initial ones may require a very long logical formula, which may therefore be hard to learn. A usual approach is to rely on predicate invention, which consists in learning intermediate auxiliary predicates with which the target predicate can be defined. Below, we show a simple example of such predicate invention with the created auxiliary predicate Succ2p X, Y q (meaning X Y 2 ): Evenp Xq Ð Zerop Xq _ Succ2p X, Y q Evenp Y q , Succ2p X, Y q Ð Succp X, Zq Succp Z, Y q. 3.2 Reinforcement Learning (RL) For any finite set X, let p Xq denote the set of probability distributions over X. The Markov Decision Process (MDP) model (Bellman, 1957) is defined as a tuple p S, A, T, r, µ, γq, where S is a set of states, A is a set of actions, T : S ˆA Ñ p Sq is a transition function, r : S ˆA Ñ R is a reward function, µ P p Sq is a distribution over initial states, and γ P r0, 1q is a discount factor. A (stationary Markov) policy π : S Ñ p Aq is a mapping from states to distributions over actions; πpa | sq stands for the probability of taking action a given state s. We consider parametrized policies πθ with parameter θ (e.g., neural networks). The aim in Published in Transactions on Machine Learning Research (07/2023) discounted MDP settings is to find a policy maximizing the expected discounted total reward: Jpθq Eµ,T,πθ 8 ÿ t 0 γtrpst, atq ı , (1) where Eµ,T,πθ is the expectation w.r.t. distribution µ, transition function T, and policy πθ. The state value function of a policy πθ for a state s is defined by: V θpsq ET,πθ 8 ÿ t 0 γtrpst, atq | s0 s ı , (2) where ET,πθ is the expectation w.r.t. transition function T and policy πθ. The action value function is defined by: Qθps, aq ET,πθ 8 ÿ t 0 γtrpst, atq | s0 s, a0 a ı , and the advantage function (Sutton & Barto, 2018) is defined by: Aθps, aq Qθps, aq V θpsq. RL (Sutton & Barto, 2018), which is based on MDP, is the problem of learning a policy that maximizes the expected discounted sum of rewards without knowing the transition and reward functions. Policy Gradient (PG) methods constitute a widespread approach for tackling RL problems in continuous or large state-action spaces. They are based on iterative updates of the policy parameter in the direction of a (policy) gradient expressed as: θJpθq Es dπθ p q,a πθpsqr Aθps, aq θ log πθpa | sqs, where the expectation is taken w.r.t. dπθpsq ř s0 µps0q ř8 t 1 γt 1Prpst s|s0q, the discounted state distribution induced by policy πθ. Algorithms like REINFORCE (Williams, 1992) that estimate this gradient via Monte Carlo sampling are known to suffer from high variance (due to the stochasticity of the environment and/or policy) Sutton & Barto (2018). To address this issue, actor-critic (AC) schemes (Konda & Tsitsiklis, 2000) have been proposed. In such a framework, both an actor (πθ) and a critic (e.g., depending on the AC algorithm, V θ (Sutton & Barto, 2018; Schulman et al., 2017)or Qθ (Lillicrap et al., 2016; Mnih et al., 2016), from which Aθ can be estimated) are jointly learned. Using a critic to estimate the policy gradient reduces variance, but at the cost of introducing some bias. Proximal Policy Optimization (PPO) (Schulman et al., 2017) is a SOTA AC algorithm, which optimizes a clipped surrogate objective function JPPOpθq: t 0 minpωtpθq A θpst, atq, clippωtpθq, ϵq A θpst, atqq ı , where θ is the parameter of the policy that generated the training data, ωtpθq πθpat|stq π θpat|stq, the advantage function A θ is estimated with Generalised Advantage Estimation (GAE) (Schulman et al., 2016) and clipp , ϵq is the function to clip between r1 ϵ, 1 ϵs. This surrogate objective was motivated as an approximation of that used in Trust Region Policy Optimization (TRPO) (Schulman et al., 2015), which was introduced to ensure monotonic improvement after a policy parameter update. Some undeniable advantages of PPO over TRPO lie in its simplicity and lower computational and sample complexity. 3.3 Neural Logic Machine (NLM) NLMs (Dong et al., 2019) are neural networks designed with a strong architectural inductive bias to solve ILP (and RL) problems. The NLM architecture is composed of MLPs organized in a hierarchical fashion. Its input corresponds to the initial predicates and its output corresponds to the target predicate. The MLPs approximate logic operations and define invented auxiliary predicates based on previously-created or initial Published in Transactions on Machine Learning Research (07/2023) (a) High-level architecture of NLM (and DLM) zoomed in around breadth b, where boxes represent computation units (except for layer 0), blue arrows correspond to reduction, and yellow arrows to expansion. Permutation is applied on the inputs of each computation unit. (b) A DLM computation unit at breadth b and layer l. After the NLM operations (i.e., reduction, expansion, and permutation), negation (green arrow) and preservation (violet arrow) with true (T) or false (F) are applied to generate the inputs of the logic modules. Figure 1: (Left) High-level architecture of NLM (and DLM), (Right) Computation unit in DLM. predicates. Thus, this architecture simulates forward chaining, a form of reasoning, which computes a target predicate from the truth values of initial predicates via a sequence of invented ones. Training an NLM therefore approximates the inductive definition of logic formulas. More specifically, an NLM is comprised of learnable computation units organized into L successive layers, each layer having a maximum breadth B (see Fig. 1a). A computation unit at layer l for a given breadth b corresponds to some b-ary predicates and is connected to units at the previous and next layers. An NLM processes atoms (i.e., predicates with all its arguments) represented as tensors. For any b-ary predicate P, its corresponding tensor P , denoted in bold, is of order b with shape rm, . . . , ms (recall that m is the number of constants present in the ILP problem). For instance, the tensor representation of a binary (b 2) predicate would be P P r0, 1smˆm. A value P pi1, . . ., ibq P r0, 1s (with ij P t1, . . . , mu and j P t1, . . . , bu) is interpreted as the probability of the truth value of the corresponding grounded atom. Apart from layer 0, which directly corresponds to the initial predicates, a computation unit at breadth b P t0, . . . , Bu and layer l P t1, . . . , Lu takes as inputs the b-ary predicates of a set Ql b and then outputs a set Pl b of newly-invented b-ary predicates. For l ą 0, the set Ql b is obtained from the sets Pl 1 b of the previous layer according to three operations: expansion, reduction, and permutation: Expansion: Any b-ary predicate P can be expanded into a pb 1q-ary predicate ˆP, where the last argument does not play any role in its truth value, i.e., ˆPp X1, . . . , Xb 1q : Pp X1, . . . , Xbq. The set of expanded predicates obtained from Pl 1 b is denoted p Pl 1 b . Reduction: Any pb 1q-ary predicate P can be reduced into a b-ary predicate ˇP by marginalizing out its last argument with an existential (resp. universal) quantifier, i.e., ˇPp X1, . . . , Xbq DXb 1Pp X1, . . . , Xb 1q (resp. ˇPp X1, . . . , Xbq @Xb 1 Pp X1, . . . , Xb 1q). Those operations on tensors are performed by a max (resp. min) on the corresponding component, i.e., ˇP pi1, . . . , ibq maxj P pi1, . . . , ib, jq (resp. ˇP pi1, . . . , ibq minj P pi1, . . . , ib, jq). The set of reduced predicates obtained from Pl 1 b 1 is denoted q Pl 1 b 1. Permutation: Let Sb be the set of all permutations of t1, . . . , bu for b P t1, . . . , Bu. For a given bary predicate P and a given permutation σ P Sb, Pσp X1, . . . , Xbq : Pp Xσp1q, . . . , Xσpbqq. Permuting arguments allows to build more expressive formulas. Example 1. In a blocks world environment, consider the Unstack task where the goal is to move all the blocks on the floor. Therefore, a constant in C is a block. The current configuration of the blocks can be described by the following initial predicates: Is Floorp Xq (i.e., true if X is the floor), Onp X, Y q (i.e., true if X is on Y ), and Topp Xq (i.e., true if there is no block on X). The target predicate Movep X, Y q to be learned indicates which block X could be moved on Y so that the next configuration of the blocks is closer to Published in Transactions on Machine Learning Research (07/2023) the goal specified by the Unstack task. A possible solution is: Movep X, Y q Ð Is Floorp Y q Topp Xq DZ, Pp X, Zq, Pp X, Y q Ð Onp X, Y q DZ, Onp Y, Zq, where P is an invented predicate, which can be formulated thanks to those three previous operations. Indeed, reduction applied on On can yield Qp Y q DZ, Onp Y, Zq. Applying expansion on it yields @X, ˆQp Y, Xq Qp Y q. Using permutation σ that swaps two arguments gives ˆQσp X, Y q ˆQp Y, Xq. Finally, P can be expressed with Onp X, Y q and ˆQσp X, Y q. The input predicates of a computation unit at layer l and breadth b are therefore the elements of set Ql b t Pσ | P P Pl 1 b Y p Pl 1 b 1 Y q Pl 1 b 1, σ P Sbu assuming p Pl 0 H and q Pl B 1 H. A computation unit outputs n O new predicates, which form the elements of set Pl b of b-ary predicates invented at layer l. For each such predicate, a corresponding MLP computes its outputs: its evaluation for any given grounding (i.e., its arguments o1, . . . , ob) is computed with the same MLP using as inputs all predicates in Ql b grounded on o1, . . . , ob. Thus, the NLM architecture is independent of the number of constants and therefore a trained network can be applied on new instances with any number of constants. NLM offers an expressive neural network architecture, whose model complexity is controlled by setting the number L of layers, breadth B, and number n O of output predicates of the computation units, in addition to the MLP size. However, by using MLPs, NLM cannot provide an interpretable solution after training. 4 Differentiable Logic Machine (DLM) We present our novel neural-logic architecture, called Differentiable Logic Machines (DLM), which offers a good trade-off between expressivity and trainability. We first discuss its architecture, then present several training methods, and finally explain how to extract interpretable solutions. 4.1 Architecture At a high level, DLM has a similar architecture (Fig. 1a) as NLM (Dong et al., 2019), i.e., it contains computation units organized into a hierarchical fashion with breadth B and L layers. In contrast to NLM, the computation units correspond to soft logic operators, which means that DLM defines a continuous relaxation over FOL programs. Next, we explain the three main innovations in DLM and compare its computational complexity with respect to other related neural-logic methods. In Appendix A, we provide further discussions about DLM, notably its interpretation, expressivity, and implementation details. Innovations The first two innovations correspond to two novel operations, negation and preservation, which we introduce to increase expressivity. They are used in addition to the three operations in NLM (expansion, reduction, and permutation) to compute the input predicates of a computation unit. For the third innovation, we replace the MLPs of NLM s computation units by logic modules to promote interpretability. We present the two operations next, and then describe the logic modules: Negation: On any set of predicates P, this operation yields the set ηp Pq containing the negation of the predicates in P. On tensor representations, the negation is computed via the involutive function fpxq 1 x applied component-wisely. Preservation: To add more flexibility in the architecture, we augment any set of predicates P with T or F, which can be seen as a predicate that is always true or false respectively. The rationale for introducing those constant predicates T and F is notably to allow a predicate at one layer to be preserved for a later layer (see Example 2 below). From the set Ql b of input predicates computed with expansion, reduction, and permutation, the application of negation and preservation yields four sets Ql b Y t Tu, Ql b Y t Fu, ηp Ql bq Y t Tu, and ηp Ql bq Y t Fu. The union of these sets, denoted Rl b, is used as input of a logic module, which we explain next. Published in Transactions on Machine Learning Research (07/2023) Logic module: In DLM, a computation unit (see Fig. 1b) at layer l and breadth b outputs n O new predicates like in NLM. However, in contrast to NLM, each invented predicate is based on input predicates from Rl b and is computed with a (soft) logic operator. For simplicity, we only use and and or, but other logic operators could be considered. An and (resp. or) logic module at any layer l and breadth b outputs a conjunction (resp. disjunction) over n A terms of Rl b, which represents a predicate Cl bp X1, . . . , Xbq (resp. Dl bp X1, . . . , Xbq). In terms of tensor, a conjunctive predicate Cl bp X1, . . . , Xbq is computed with a fuzzy and and is obtained as follows (for n A 2, which easily extends to n A ą 2): Cl b Q d Q1, (3) P PRl b w P P , Q1 ř P 1PRl b w P 1P 1, d is the component-wise product, w P P r0, 1s and w P 1 P r0, 1s are learnable weights for selecting predicates P and P 1 such that ř P PRl b w P 1 and ř P 1PRl b w P 1 1. Similarly, a disjunctive predicate Dl bp X1, . . . , Xbq is computed with a fuzzy or and is defined as follows (for n A 2, which easily extends to n A ą 2): Dl b Q Q1 Q d Q1. (4) Each logic module (i.e., each conjunction or disjunction) has its own set of weights w P s (and w P 1 s), which are learned as a softmax of parameters θP (and θP 1) with temperature τ as a hyperparameter: w P exppθP {τq ř P 1PRl b exppθP 1{τq. (5) Example 2. Assume n A 2. Consider a blocks world environment with an initial predicate: Onp X, Y q (i.e., block X is on Y ). If it were not initially provided, predicate Topp Xq (i.e., no block is on X) could be learned thanks to the negation and preservation operations (in addition to reduction and permutation). Indeed, since Topp Xq can be expressed as T DY Onp Y, Xq , it can be obtained as follows. As Q1 1 contains a predicate Pp Xq DY Onp Y, Xq (by reduction and permutation), an and logic module could express Topp Xq as a conjunction of T P Q1 1 Y t Tu (by preservation) and P P ηp Q1 1q (by negation). Since P is an input predicate in R1 1, T (resp. F) can preserve it via and (resp. or) logic modules for the next layer. Computational Complexity In one computation unit, the number of parameters grows as Oppn An Oq where p is the number of input predicates of the unit, n A is the number of atoms used to define an invented predicate, and n O is the number of output predicates invented by the unit. In comparison with related work (see Section 2), to obtain the same expressivity, the alternative approach BILP (Evans & Grefenstette, 2018) (and thus its extension to RL, NLRL (Jiang & Luo, 2019)) would need Op p n A ˆ n Oq because the weights are defined for all the n A-combinations of p predicates. In contrast, d NL-ILP and NLM would be better with only Oppn Oq. However, the components of d NL-ILP and NLM are not really comparable to those of our model or BILP. Indeed, the NLM units are not interpretable, and the d NL-ILP architecture amounts to learning a CNF or DNF formula, where a component of that architecture corresponds to a part of that formula. While expressive, the space of logic program induced in d NL-ILP is much less constrained than that in our architecture, making it much harder to train, as shown in our experiments (see Table 4). Assuming that B (the maximum arity) is a small constant, the complete DLM network has Op Ln An O2q parameters where L is the maximum number of layers. NLM has instead Op Ln O2q parameters. The complexity of a forward pass in DLM is Opm BLn An O2q where m is the number of constants. For NLM, it is Opm BLn O2q (assuming that single-layer perceptrons are used). Once a logic program is extracted (see Section 4.3), we can reduce the forward pass complexity to Opm Bpq where p is the number of elements in the graph. Note that, by construction, p ď Ln O2 ď Ln An O2. 4.2 Training As a differentiable model, DLM can be trained in both the SL and RL settings. Since any computation unit directly corresponds to a (possibly soft) logic expression, DLM is also independent of the number of Published in Transactions on Machine Learning Research (07/2023) constants: it can be trained on a small number of constants and generalize to a larger number. Next, we discuss supervised training, then present RL training with an actor-critic scheme. In addition, we also present an incremental training approach to tackle hard tasks. Apart from the latter incremental approach, the algorithms are quite standard. We present their pseudo-codes in Appendix A. 4.2.1 Supervised Training SL training can be applied to solve ILP tasks (see Section B.1 for concrete examples): Given a dataset D expressed with a set of constants C and the initial predicates in P0 Ť b P0 b , learn to predict an r-ary target predicate P T . DLM can be trained by minimizing a binary cross-entropy loss (e.g., via stochastic mini-batch gradient descent): Lθp P T , ˆ P T q ÿ o1,...,or PC P T po1, ..., orq logp ˆ P T po1, ..., orqq p1 P T po1, ..., orqq logp1 ˆ P T po1, ..., orqq, where θ corresponds to all the DLM parameters, ˆ P T ϕθp P0q is the output atom computed by DLM, and P0 contains the grounded atoms given in D (i.e., all Ppo1, . . . , obq, @b, @o1, . . . , ob P C, @P P P0 b ). As reported in previous neural-logic work, this loss function generally admits many local optima. Besides, there may be global optima that reside in the interior of the continuous relaxation (i.e., not interpretable). Therefore, if we train our model with a standard supervised (or RL training) technique, there is no reason that an interpretable solution would be obtained, even if we manage to completely solve the task. In order to help training and guide the model towards an interpretable solution, we propose to use three techniques: (a) inject some noise in the softmax defined in (5), (b) decrease temperature τ during training, and (c) use dropout. For the noise, we use a Gumbel distribution. Thus, the softmax in (5) is replaced by a Gumbel-softmax (Jang et al., 2017): w P exppp GP θP q{τq ř P 1PRl b exppp GP 1 θP 1q{τq, (6) where GP (and GP 1) are i.i.d. samples from a Gumbel distribution Gumbelp0, βq. The injection of noise during training corresponds to a stochastic smoothing technique: optimizing by injecting noise with gradient descent amounts to performing stochastic gradient descent on a smoothed loss function. This helps avoid early convergence to a local optimum and find a global optimum. The decreasing temperature favors the convergence towards interpretable solutions. To further help learn an interpretable solution, we additionally use dropout during our training procedure. Dropout helps learn more independent weights, which also promotes interpretability. The scale β of the Gumbel distribution and the dropout probability are also decreased with the temperature during learning. 4.2.2 Actor-Critic (AC) Training For RL tasks (see Section 5 for concrete examples), states are assumed to be encoded with a given set C of constants and the initial predicates in P0 Ť b P0 b , while action selection can be expressed by an b-ary action predicate P A, i.e., P Apo1, . . . , obq corresponds to an action. In contrast to ILP, the truth values of grounded atoms (describing a state) may change after performing an action to reflect a new state. The RL problem consists in learning a policy that takes as inputs P0 (all the grounded atoms of the initial predicates) and that returns an action predicate such that the expected cumulative reward (1) is maximized by selecting actions according to the action predicate. Example 3. Consider a blocks world environment with the initial predicate Onp X, Y q (i.e., block X is on Y ). The state is defined by all the state atoms (i.e., ground predicates), e.g., keeping only the positive ones, On(a, b) and On(b, floor) could define a state. The action is defined by an action atom, for instance Move(a, floor). The policy network only takes as input the state atoms (represented as tensors) containing their valuations over all the constants and predicts the truth values of the predicate Move over all the constants. A possible resulting next state after taking action Move(a, floor) would be On(b, floor), On(a, floor). Published in Transactions on Machine Learning Research (07/2023) Figure 2: Architecture of our critic, assuming that the max arity in P0 is 2. On the left are the depicted atoms of arity 0, 1, and 2, where m |C|, ni is the number of predicates of arity i, and E is the output dimension of a GRU. Each GRU unit sequentially reads slices depicted with black lines. The architecture can be generalized to larger arity by introducing more GRU units. The number of parameters of the critic is independent of the number of constants and only depends on the possible arities and the possible number of predicates in P0. To the best of our knowledge, all previous neural-logic works2 rely on REINFORCE (Williams, 1992) instead of an AC algorithm, which can generally be much more sample-efficient. One reason may be the difficulty of designing a neural network for the critic that can directly receive as inputs P0 (i.e., the same input given to the policy based on a DLM architecture). To overcome this issue, we propose a recurrent neural network architecture based on Gated Recurrent Unit (GRU) (Cho et al., 2014) for the critic. Once an architecture for the critic is defined, different actor-critic algorithms could be applied. In this work, we use the SOTA AC algorithm, PPO (Schulman et al., 2017). We present next the design of our critic and actor. Critic This GRU-based critic estimates the value (2) of a state described by the atoms in P0. Recall a GRU unit (Cho et al., 2014) processes an input sequence xt as follows: zt σp Wzxt Uzht 1 bzq, ξt σp Wrxt Urht 1 brq, ht zt d ht 1 p1 ztq d ϕp Whxt Uhpξt d ht 1q bhq, where zt (resp. ξt) is the update (resp. reset) gate vector, ht is the output vector, σ (resp. ϕ) is the sigmoid (resp. tanh) activation function, d is the component-wise product, and W , U , b for P tz, r, hu are learnable parameters. Our GRU-based critic (see Fig. 2) includes, for each arity b ą 1 in P0, b independent GRU units. Each of them reads all the atoms in P0 b, but focuses on a different component of the b-ary predicates. For simplicity, we present the procedure for the first component, the other components are treated in a similar fashion. The GRU unit for the first component processes for each o P C, the input sequence xt whose element is defined by @t po2, . . . obq P Cb 1, xt Pbpo, o2, . . . , obq Pb PP0 b . Its output, taken as the computed last ht, corresponds to p Q1 bpoq, . . . , QE b poqq P RE where E is a hyperparameter. Intuitively, the b GRU units compute for each constant a summary (i.e., vector of dimension b ˆ E) of the objects that are in relation with it. 2Although not discussed in Jiang & Luo (2019), we found in their source code an attempt to apply an AC scheme, but they do so by converting states into images (i.e. by activating the associated pixels in the Cartesian space when a block is present at a specific location), which may not only be unsuitable for some problems, but may also lose information during the conversion and prevent good generalization. Published in Transactions on Machine Learning Research (07/2023) After processing all the atoms of arity b ą 1, all the combined outputs of these GRU units (i.e., Qi b s) in addition to the initial unary predicates are read by another GRU unit, which outputs a vector p Q1 1, . . . , QE 1 q P RE. This latter GRU unit takes as inputs @t o P C, xt P RN1 where N1 is the total number of Qi b s plus the number of initial unary predicates. The computed vector, in addition to any initial 0-ary predicates if any, are given as inputs to an MLP, which outputs the estimated state value. Note that the number of parameters of this critic, like DLM, is independent of the number of constants. An NLM architecture could also be used for the critic, but we find out that our proposed GRU-based architecture was more efficient in practice. It enables a more expressive aggregation function to compute a value from a set of input tensors. With NLM the reduction of the inputs to a scalar would only use the min-pooling operation across objects, which leads to a too simple aggregation function. Actor The actor s output should ideally correspond to a predicate that evaluates to true for only one action and false for all other actions, which corresponds to a deterministic policy. For instance, in a blocks world domain, the target predicate would be movep X, Y q and would be true for only one pair of objects, corresponding to the optimal action, and false for all other pairs. While not impossible to achieve (at least for certain small tasks), an optimal deterministic policy may involve an unnecessarily complex logic program. Indeed, for instance, for the blocks world domain, in many states, there are several equivalent actions, which the deterministic policy would have to order. Thus, as done in previous works, we separate the reasoning part and the decision-making part. The reasoning part follows the architecture presented in Fig. 1a, which provides a tensor representing the target predicate corresponding to the actions. A component of this tensor can be interpreted as whether the respective action is good or not. The decision part takes as input this tensor and outputs a probability distribution over the actions using a softmax with fixed low temperature. Note that this temperature used for the actions is different from that used in Equation(6). 4.2.3 Incremental Training Learning an interpretable solution amounts to searching in a discrete space whose size increases exponentially with the size of the solution (i.e., L). In complex tasks whose solution requires a logic program with a large depth, the SL or RL training we have just discussed may not be sufficient. To face this difficulty, we also experimented with a more elaborate training procedure where potentially useful auxiliary predicates are incrementally learned and added to the initial predicates, which can then facilitate finding a good interpretable solution. This is achieved by repeatedly training a fixed DLM architecture with an increasing number of initial predicates. More specifically, we repeat the following training phase. At phase i, we apply the SL or RL training method discussed previously on a randomly initialized DLM to obtain an interpretable solution using as initial predicates the set Ppiq of predicates, which includes the predicates of P0 Ť b P0 b (i.e., all initial predicates of the problem) in addition to predicates invented in previous phases (if any). Thus, during the first phase, Pp1q P0. At the end of a training phase, if the learned interpretable solution solves the problem, the procedure ends. Otherwise, although the obtained solution is imperfect, it may still contain some useful invented predicates. Therefore, we extract (see Section 4.3) the invented predicates from this interpretable solution, which are then added to Ppiq to form Ppi 1q for the next phase. Intuitively, incremental training stacks many DLM networks (each with L layers) to obtain a large DLM network whose number of layers can be up to the number of phases times L, depending on which invented predicates are used. Thus, incremental training can be seen as a training procedure for training a very deep DLM architecture. Note that this procedure increases the number of learnable parameters (as we iteratively increase the input size) but the overall architecture still remains independent of the number of objects. 4.3 Logic Program Extraction During evaluation and deployment, both the time and space complexity will increase quickly as the number of constants increases. To speed up inference and have an interpretable model, we post-process the trained model to extract the logical formulas instead of using it directly. For each used module, we replace the Published in Transactions on Machine Learning Research (07/2023) Gumbel-softmax (6) by an argmax to choose the predicates deterministically. The fuzzy operations can then be replaced by their corresponding Boolean ones. Formula extraction can be done recursively from the output of the model. All the non-selected input predicates coming from the previous layer do not need to be computed. A graph containing only the selected predicates is built from the output to the input predicates. The extracted interpretable model can then operate on Boolean tensors, which further saves space and computation time. 5 Experimental Results We performed four series of experiments. The first (resp. second) series evaluate DLM with SOTA baselines on ILP (resp. RL) tasks. The third series correspond to an ablation study that justifies the different components (i.e., critic, Gumbel-softmax, dropout) of our method. The last series present a comparison of the methods in terms of computational costs. Examples of interpretable policies learned by DLM are provided in Appendix B.2. Other details about the experiments (computer specifications, curriculum learning, hyperparameters used in DLM and baselines) are provided in Appendix D. The first series, which focuses on SL training, confirm that DLM is competitive with SOTA baselines on ILP tasks. Compared to interpretable models (Payani & Fekri, 2019a; Evans & Grefenstette, 2018)), DLM performs better and compared to a SOTA non-interpretable model (Dong et al., 2019), DLM is easier to train successfully. To keep the paper short, we present the details of those results in Appendix B. The results for the other experiments are discussed next. 5.1 RL Tasks For the RL experiments, we justify the baselines we used, provide a short description of the tasks with their corresponding training procedures, explain the performance metrics, and discuss the experimental results. RL Baselines Our evaluation on ILP tasks suggests that the best neural-logic architectures, apart from DLM, are BILP and NLM. While the authors of NLM directly demonstrated it in an RL setting, BILP was later extended into an RL method called NLRL (Jiang & Luo, 2019). Based on these observations, we selected NLM and NLRL as strong RL neural-logic baselines to compare DLM with. RL Task Description Our evaluation is performed on six RL tasks: three Blocks World tasks from Jiang & Luo (2019), three other tasks (two algorithmic tasks and one Blocks World task) from Dong et al. (2019). We provide a short description below, but more details can be found in Appendix C. In the first three, Stack, Unstack, and On, the agent is trained to learn predicate Movep X, Y q, which moves block X on block (or floor) Y if possible. The observable predicates are: Is Floorp Xq, Topp Xq, and Onp X, Y q with an additional predicate On Goalp X, Y q for the On task only. In Stack, the agent needs to stack all the blocks whatever their order. In Unstack, the agent needs to put all the blocks on the floor. In On, the agent needs to reach the goal specified by on Goal. The last three are Sorting, Path and Blocksworld. In Sorting, the agent must learn Swapp X, Y q where X and Y are two elements of a list to sort. The binary observable predicates are Smaller Index, Same Index, Greater Index, Smaller Value, Same Value, and Greater Value. In Path, the agent is given a graph as a binary predicate with a source node and a target node as two unary predicates. It must learn the shortest path with a unary predicate Go Top Xq where X is the destination node. Blocksworld is the most complex task: it features a target world and a source world with numbered blocks, which makes the number of constants to be 2pm 1q where m is the number of blocks and 1 corresponds to the floor. The agent learns Movep X, Y q by moving blocks in the source world. It is rewarded if both worlds match exactly. The binary observable predicates are Same World ID, Smaller World ID, Larger World ID, Same ID, Smaller ID, Larger ID, Left, Same X, Right, Below, Same Y, and Above. All those domains are goal-based sparse-reward RL problems. Since the first three domains are relatively simple, they can be trained and evaluated on fixed instances with a fixed number of blocks. In contrast, for the last three domains, the training and testing instances are generated randomly. Those last three domains, Published in Transactions on Machine Learning Research (07/2023) Table 1: Average rewards of NLRL, NLM, and DLM on RL tasks for the best seed. Average rewards NLRL NLM n IDLM DLM DLM+incr DLM+incr (SL) Interpretable yes no no yes yes yes Unstack 5 vari. 0.914 0.01 0.920 0 0.920 0 0.920 0 0.920 0 0.920 0 Stack 5 vari. 0.877 0.09 0.920 0 0.920 0 0.920 0 0.920 0 0.920 0 On 5 vari. 0.885 0.01 0.896 0 0.896 0 0.896 0 0.896 0 0.896 0 Sorting m 10 0.866 0.1 0.939 0.02 0.939 0.02 0.939 0.02 0.939 0.02 0.939 0.02 M 50 N/A 0.556 0.02 0.556 0.02 0.559 0.02 0.559 0.02 0.559 0.02 Path m 10 N/A 0.970 0 0.970 0 0.319 0.2 0.970 0 M 50 0.970 0 0.970 0 0.004 0.4 0.970 0 Blocksworld m 10 N/A 0.888 0.02 0.904 0.02 0.894 0.02 M 50 0.153 0.04 0.159 0.11 0.230 0.04 N/A: Out of memory issue. : Could not extract a working interpretable policy for given architecture size. which are much harder than the first three, require training with curriculum learning (CL), which was also used by Dong et al. (2019). The difficulty of a lesson in CL is defined by the number of objects. Further details about the training with CL are provided in Appendix D.2. After training with m 10, we evaluate the learned model on test instances with m 10, but also M 50 to assess its generalizability. RL Metrics Like ILP tasks, the performance can be measured in terms of success rates, i.e., percentage of times a trained policy can solve an RL task (i.e., reach a goal). In addition, another natural metric in RL is the average reward obtained during testing. RL Results Table 1 provides the success rates of all the algorithms on different RL tasks. For DLM, we provide the results obtained by RL training (see Section 4.2.2) where we enforce the convergence to an interpretable policy (DLM) or not (n IDLM). DLM+incr and DLM+incr (SL) correspond to our architecture trained in an incremental way via RL and SL training respectively (see below). Each subrow of an RL task corresponds to some instance(s) on which a trained model is evaluated. The experimental results show that NLRL does not scale to harder RL tasks, as expected from its performance on ILP tasks. On Sorting where we can learn a fully-interpretable policy, DLM is better than NLM in terms of computational time and memory usage during testing. Thus, DLM is always superior to NLRL and can match the performance of NLM for simpler tasks. For the harder RL tasks (Path, Blocksworld), when we do not enforce the convergence to an interpretable policy, our method with RL training (n IDLM) can reach similar or better performances than NLM. Interestingly, although both NLM and n IDLM learn a non-interpretable policy, n IDLM can generalize better than NLM in Blocksworld, which suggests that the architectural inductive bias of DLM is more suitable for problems described with FOL. However, obtaining a good interpretable policy with CL reveals to be difficult: there is a contradiction between learning to solve a lesson and converging to a final interpretable policy that generalizes. Indeed, on the one hand, we can learn an interpretable policy for a lesson with a small number of objects, however that policy will probably not generalize and training that interpretable policy on the next lesson is hard since the softmaxes are nearly argmaxes. On the other hand, we can learn to solve all the lessons with a non-interpretable policy, but that final policy is hard to turn into an interpretable one, because of the many local optima in the loss landscape. This training difficulty explains why we did not manage to learn an interpretable policy for Path and Blocksworld using RL training. To demonstrate that this issue is due to RL training (and not to our DLM architecture), we also evaluated it with incremental training (see Section 4.2.3). In the incremental training phases, we tried both SL and Published in Transactions on Machine Learning Research (07/2023) Table 2: Average percentage of successful seeds (PSS) and average success rates (SR) on all the tasks of Family Tree and Graph Reasoning during testing with interpretable rules. Score computed with 5 seeds for each task. Average PSS (%) Average SR (%) Softmax without noise 58 97.3 6.1 Constant β with Dropout 68 97.7 9.3 Without Dropout 70 99.3 1.6 Gaussian noise 80 99.7 0.8 DLM 95 99.8 0.7 0 50 100 150 200 episodes testing performance NLM without critic NLM with critic 0 1000 2000 3000 episodes testing performance DLM without critic DLM with critic Figure 3: Learning performance with or without our proposed critic with NLM (left) and DLM (right). RL training. While incremental RL training does increase the performance compared to RL training alone, notably in Path, it is not sufficient to solve Blocksworld nor Path completely. However, incremental SL training is successful, as reported in column DLM+incr (SL) in Table 1. It corresponds to an imitation learning scenario where the agent tries to copy a good policy given by a teacher (possibly the one learned by n IDLM which does not enforce interpretability). With DLM+incr (SL), unlike vanilla DLM and DLM+incr, we were able to extract a good interpretable policy for Blocksworld. It needed a stacking of 4 DLMs of depth 8, which shows the difficulty of finding an interpretable formula for this task. Interestingly, the generalization performance of DLM+incr (SL) is the best on Blocksworld, which suggests that enforcing interpretability is beneficial. We leave for future work the investigation of alternative RL training methods that scale better than CL for sparse-reward RL problems like Path and Blocksworld. 5.2 Ablation and Computational Cost Studies For simplicity, these studies are performed in ILP tasks. Their description can be found on Appendix B. Ablation We performed an ablation study to understand the different features (e.g., Gumbel noise in softmaxes, decreasing β, use of Dropout) in DLM. We trained our model on ILP tasks by using only softmax without injecting noise, without decreasing the noise over time, without having a dropout noise and finally by replacing the Gumbel distribution with a Gaussian one. In those experiments, during evaluation, we still used an argmax to retrieve the interpretable rules. Table 2 shows that all our choices help our model reach interpretable rules. Success rate (SR) is the proportion of examples well-classified by a trained model. Percentage of successful seeds (PSS) is the proportion of seeds for which a trained model reaches an SR of 100%. In addition, to evaluate the quality of our proposed critic architecture, we used it to train both NLM and DLM. Fig. 3 shows the testing performance of NLM or DLM trained with REINFORCE or with an actor-critic scheme with our proposed critic architecture. Recall the testing performance corresponds to the evaluation of the latest trained policy in terms of success rates at regular training steps. It has been averaged over 3 environments and 5 random seeds: Unstack, Stack, and On. As shown in Fig. 3, using our proposed critic architecture greatly accelerates the RL training for both NLM and DLM. Comparative computational costs We compare now the different algorithms with respect to computational times and memory usage during testing (Figure 4) and training (see Table 5 in Appendix). During Published in Transactions on Machine Learning Research (07/2023) Is Grand Parent 2-outdegree 2 4 6 8 10 12 14 persons ( 10) 0 5 10 15 20 25 30 d NL-ILP NLM DLM (Ours) 5 10 15 20 25 30 35 persons ( 10) 0 10 20 30 40 50 60 70 80 90 d NL-ILP NLM DLM (Ours) (a) Computational time Is Grand Parent 2-outdegree 3 4 5 6 7 8 9 10 11 12 persons ( 10) Memory (Mi B) d NL-ILP NLM DLM (Ours) 5 10 15 20 25 30 35 persons ( 10) 0 1000 2000 3000 4000 5000 6000 Memory (Mi B) d NL-ILP NLM DLM (Ours) (b) Memory usage Figure 4: Comparison during test in Is Grand Parent and 2-outdegree. On 2-outdegree, NLM is rapidly out of memory and d NL-ILP does not achieve a 100% success rate. testing, we extracted the compact representation of the produced solution by DLM (see Section 4.3). Figure 4 clearly shows that DLM scales better than the other baselines in terms of both memory and computational times. Note that in the second task (2-outdegree), d NL-ILP does not achieve a 100% success rate and searches instead in a much smaller space. 6 Conclusion We proposed a novel neural-logic architecture that is capable of learning a fully-interpretable solution, i.e., logic program. Among differentiable methods, it obtains state-of-the-art results for inductive logic programming tasks, while retaining interpretability and scaling much better. For reinforcement learning tasks, it is superior to previous interpretable neuro-logic models. Compared to non-interpretable models, it can achieve comparable or higher performances using incremental training. Moreover, it can generalize better, and more importantly, it scales much more advantageously in terms of computational times and memory usage during testing. Our results suggest that learning a fully-interpretable solution using supervised learning with our approach could be practical, but enforcing interpretability in more complex reinforcement learning tasks, especially with RL training using sparse rewards, is a much harder problem, which calls for novel neural-logic architectures or novel training techniques. Another interesting phenomenon that we have observed is that there is a tension between learning interpretable solutions and curriculum learning. We leave a study of this phenomenon for future work. Acknowledgments This work has been supported in part by a project funded by Huawei (HF2020055001) and the program of National Natural Science Foundation of China (No. 62176154). Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, and et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82 115, 2020. Richard Bellman. A Markovian decision process. Journal of mathematics and mechanics, pp. 679 684, 1957. Tarek R. Besold, Artur d Avila Garcez, Sebastian Bader, Howard Bowman, Pedro Domingos, Pascal Hitzler, Kai-Uwe Kuehnberger, Luis C. Lamb, Daniel Lowd, Priscila Machado Vieira Lima, Leo de Penning, Gadi Pinkas, Hoifung Poon, and Gerson Zaverucha. Neural-symbolic learning and reasoning: A survey and interpretation. ar Xiv:1711.03902, 2017. Published in Transactions on Machine Learning Research (07/2023) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014. Andrew Cropper, Sebastijan Dumančić, and Stephen H. Muggleton. Turning 30: New ideas in inductive logic programming. In IJCAI, 2020. Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. Problog: a probabilistic prolog and its application in link discovery. In IJCAI, 2007. Luc De Raedt, Kristian Kersting, Sriraam Natarajan, and David Poole. Statistical Relational AI: Logic, Probability and Computation. Morgan & Claypool, 2016. Luc de Raedt, Sebastijan Dumančić, Robin Manhaeve, and Giuseppe Marra. From Statistical Relational to Neuro-Symbolic Artificial Intelligence. In IJCAI, 2020. Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou. Neural Logic Machines. In ICLR, 2019. Sašo Džeroski, Luc De Raedt, and Kurt Driessens. Relational reinforcement learning. Machine Learning, 43 (1):7 52, 2001. Artur d Avila Garcez, Marco Gori, Luis C. Lamb, Luciano Serafini, Michael Spranger, and Son N. Tran. Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. ar Xiv: 1905.06088, 2019. Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. Journal of AI Research, 2018. Claire Glanois, Zhaohui Jiang, Xuening Feng, Paul Weng, Matthieu Zimmer, Dong Li, Wulong Liu, and Jianye Hao. Neuro-symbolic hierarchical rule induction. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 7583 7615. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/v162/glanois22a.html. A Graves, G Wayne, M Reynolds, T Harley, I Danihelka, A Grabska-Barwińska, SG Colmenarejo, E Grefenstette, T Ramalho, J Agapiou, and et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471 476, 2016. Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017. Zhengyao Jiang and Shan Luo. Neural Logic Reinforcement Learning. In ICML, 2019. Vijay Konda and John Tsitsiklis. Actor-critic algorithms. In Neur IPS, volume 12, 2000. Robert A Kowalski. Logic programming. In Computational Logic, volume 9, pp. 523 569, 2014. Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017. Mark Law, Alessandra Russo, and Krysia Broda. Inductive learning of answer set programs from noisy examples. Advances in Cognitive Systems, 2018. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016. Robin Manhaeve, Sebastijan Dumančić, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neural probabilistic logic programming. In Neur IPS, 2018. Published in Transactions on Machine Learning Research (07/2023) Gary Marcus. Deep learning: A critical appraisal. ar Xiv: 1801.00631, 2018. Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, 2016. Stephen Muggleton. Inductive logic programming. New Generation Computing, 8(4):295 318, Feb 1991. Ali Payani and Faramarz Fekri. Inductive logic programming via differentiable deep neural logic networks. ar Xiv:1906.03523, 2019a. Ali Payani and Faramarz Fekri. Learning algorithms via neural logic networks. ar Xiv:1904.01554, 2019b. Ali Payani and Faramarz Fekri. Incorporating relational background knowledge into reinforcement learning via differentiable inductive logic programming. ar Xiv:2003.10386, 2020. J. Schulman, S. Levine, P. Abbeel, M.I. Jordan, and P. Moritz. Trust region policy optimization. In ICML, 2015. J. Schulman, P. Moritz, S. Levine, M.I. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. ar Xiv preprint: 1707.06347, August 2017. Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In Neur IPS, 2015. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Prasad Tadepalli, Robert Givan, and Kurt Driessens. Relational reinforcement learning : An overview. In ICML workshop on relational reinforcement learning, 2004. Martijn van Otterlo. Solving relational and first-order logical markov decision processes: A survey. In Reinforcement Learning, volume 12, pp. 253 292. 2012. Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229 256, 1992. Fan Yang, Zhilin Yang, and William W. Cohen. Differentiable learning of logical rules for knowledge base reasoning. In Neur IPS, 2017. Yuan Yang and Le Song. Learn to explain efficiently via neural logic inductive learning. In ICLR, 2020. Published in Transactions on Machine Learning Research (07/2023) A Details About DLM A.1 Tensor Representation DLM processes predicates by working on the concatenation of their corresponding tensors. We define below the concatenation, provide some illustrative examples, and list the pseudo-code (Algorithm 1) for the inference in DLM. A set Pb of b-ary predicates can be represented as a tensor, denoted Pb, of order b 1 with shape rm, . . . , m, |Pb|s. The concatenation of two tensors Pb and Qb representing two sets of b-ary predicates, Pb and Qb, is performed over the last dimension. It results in a tensor of order b 1 with shape rm, ...m, |Pb| |Qb|s. We provide some examples of how the different operations (i.e., expansion, reduction, permutation) work on tensors: Example 4. Expansion: A unary predicate Topp Xq represented over 3 constants with the vector p0, 0, 1q would be expanded into y Topp X, Y q as the following matrix pp0, 0, 0q, p0, 0, 0q, p1, 1, 1qq. Reduction: A binary predicate Onp X, Y q represented with the matrix pp0, 1, 0q, p0, 0, 1q, p0, 0, 0qq would be reduced to the vector p1, 1, 0q with the existential quantifier representing whether X is On any objects. Permutation: A binary predicate Onp X, Y q represented with the matrix pp0, 1, 0q, p0, 0, 1q, p0, 0, 0qq would be permuted to the matrix pp0, 0, 0q, p1, 0, 0q, p0, 1, 0qq representing the relation Underp X, Y q. We provide a detailed example for understanding DLM s architecture: Example 5. Consider a blocks world environment with m 4 objects: 3 blocks pu, v, wq and the floor pfloorq. Assuming that only two initial predicates are available, the following facts t Onpu, vq, Onpv, floorq, Onpw, floorq, Toppuq, Toppwqu are encoded by two tensors. The first tensor encoding the unary predicate Topp Xq is a vector of length 4, since there are 4 objects. The second tensor for the binary predicate Onp X, Y q is a 4 ˆ 4-matrix. Those two tensors will feed the DLM network at layer 0 on different breadths (1 and 2 respectively). We first focus on what happens in the first layer of breadth 1 (l 1, b 1). By assumption, there is no expansion. Since the previous layer l 1 with breadth b 1 is not empty, the reduction operation generates two predicates On R1p Xq Ð @Y, Onp X, Y q and On R2p Xq Ð DY, Onp X, Y q. Hence, the possible (positive) input predicates R1 1 are t Topp Xq, On R1p Xq, On R2p Xqu. Since we deal with unary predicates, the permutation operation does not play a role. Now, consider the first layer of breadth 2 (l 1, b 2). By assumption, there is no reduction. The expansion operation generates one predicate Top Ep X, Y q whose truth value is given by Topp Xq. Therefore, after the permutation operation, R1 2 t Onp X, Y q, Top Ep X, Y q, Onp Y, Xq, Top Ep Y, Xqu. Assume that we want to detect the objects that are neither the floor, nor on the top of a stack (i.e., v here) with a unary predicate, which we call Block Not Topp Xq. This predicate can be invented as a conjunction with a negative first atom. As |R1 1| 3, w P and w P 1 are vectors of length 4 due to the preservation operation. To build Block Not Topp Xq Ð Topp Xq On R2p Xq, w P should be zero everywhere except for selecting Topp Xq in t Tu Y ηp Rl bq. Accordingly, w P 1 should be zero everywhere except for selecting On R2p Xq inside t Tu Y Rl b. We provide the formulas for the inference of NLM and DLM units in the first layer associated with Example 5 in Table 3. A.2 Interpretability and Expressivity Interpretation Following NLM, we keep the probabilistic interpretation of the tensor representations of the predicates in DLM. This interpretation justifies the application of statistical machine learning techniques to train DLM via cross-entropy minimization for supervised tasks (i.e., ILP) for instance. In this interpretation, using the fuzzy conjunctions and disjunctions can be understood as making the assumption Published in Transactions on Machine Learning Research (07/2023) Table 3: First layer of NLM and DLM on Example 5. The input predicates are composed of Top(X), represented by a vector U1 of length m, and On(X,Y) represented a m ˆ m matrix U2. The operation || denotes the concatenation of tensors on a new dimension. Note that min and max operations for the reduction are always performed on the last dimension of the input tensor. The expansion operation is denoted ζ, it repeats a tensor m times on a new last dimension. The permutation operation is denoted ν and permutes a tensor according to all the possible permutations without repetitions stacked in a new dimension. For instance, ζp U2q, a m ˆ m ˆ m tensor, would result in a new m ˆ m ˆ m ˆ 3! tensor. Note that for DLM we assumed n A 2 for readability. Output breadth NLM DLM 0 (nullary predicates) MLPpminp U1q||maxp U1qq Fuzzy ANDθ,τp U3, U3q Fuzzy ORθ,τp U3, U3q with U3 minp U1q||maxp U1q||1 1 (unary predicates) MLPpminp U2q||maxp U2q||U1q Fuzzy ANDθ,τp U4, U4q Fuzzy ANDθ,τp U4, 1 U4q Fuzzy ORθ,τp U4, U4q Fuzzy ORθ,τp U4, 1 U4q with U4 minp U2q||maxp U2q||U1||1 2 (binary predicates) MLPp U2||U T 2 ||ζp U1qq Fuzzy ANDθ,τp U5, U5q Fuzzy ANDθ,τp U5, 1 U5q Fuzzy ORθ,τp U5, U5q Fuzzy ORθ,τp U5, 1 U5q with U5 U2||U T 2 ||ζp U1q||1 3 (ternary predicates) MLPpνpζp U2qqq Fuzzy ANDθ,τp U6, U6q Fuzzy ANDθ,τp U6, 1 U6q Fuzzy ORθ,τp U6, U6q Fuzzy ORθ,τp U6, 1 U6q with U6 νpζp U2qq||1 Algorithm 1 Inference of a module at layer l and breadth b Require: Pl 1 b 1, Pl 1 b , Pl 1 b 1 (tensor representations of the sets of predicates at layer l 1 with breadth b 1, b, and b 1 respectively), θ (parameters of the module) and τ (temperature) Ensure: Pl b (tensor representation of the set of the output predicates) q P l 1 b 1 Ð Reduction(Pl 1 b 1). p P l 1 b 1 Ð Expansion(Pl 1 b 1). Ql b Ð Permutation(Concatenation( q P l 1 b 1, Pl 1 b , p P l 1 b 1qq Ql b Ð Preservation (Ql b) Rl b Ð Negation(Ql b) // Assuming n A 2 for readability F1 Ð Fuzzy ANDθ,τp Ql b, Ql b) Eq. (3) F2 ÐFuzzy ANDθ,τp Ql b, Rl b) Eq. (3) F3 ÐFuzzy ORθ,τp Ql b, Ql b) Eq. (4) F4 ÐFuzzy ORθ,τp Ql b, Rl b) Eq. (4) Pl b Ð Concatenation(F1, F2, F3, F4) return Pl b of probabilistic independence between the truth values of any pairs of atoms in Rl b. This may seem a strong assumption, however this is not detrimental since we want to learn a logic program operating on Boolean tensors. Published in Transactions on Machine Learning Research (07/2023) Expressivity Previous work like BILP or NLM (Evans & Grefenstette, 2018; Jiang & Luo, 2019; Dong et al., 2019) can express Datalog programs, a subset of FOL composed of Horn clauses that do not contain function symbols. With the negation operation, DLM is more expressive than BILP. With sufficient breadth B and depth L, it can express any normal logic programs, i.e., Horn clauses with negative literals (Kowalski, 2014), that do not contain function symbols. We refer the reader to the proof present in (Dong et al., 2019), its extension with the negation is trivial. Hence, the only FOL formulas not covered by DLM are the ones containing function symbols. In practice, the expressivity of DLM can not only be controlled by setting hyperparameters L, B, n A, and n O but also by restricting the inputs of logic modules (e.g., no negation or no existential or universal quantifiers). Therefore, a priori knowledge can be injected in this architecture by choosing different values for B at each layer, different values for n O for each computation unit, different values for n A for each logic module, or by removing some inputs of logic modules for instance. A.3 Implementation Remarks In our implementation, half of the n O outputs of each computation unit corresponds to and logic modules and the other half corresponds to or modules. Moreover, to enforce a stronger bias, among the and (resp. or) logic modules, half of them only takes Ql b Y t Tu (resp. Ql b Y t Fu) as inputs. Note that there is no expressivity loss with this bias, as long as the architecture is large enough. For the other half of the and (resp. or) modules, half of the n A terms used to build a conjunction (resp. disjunction) is from Ql b Y t Tu (resp. Ql b Y t Fu) and the other half is from ηp Ql bq Y t Tu (resp. ηp Ql bq Y t Fu). A.4 Training Algorithms For completeness, we provide the pseudo-code of the training algorithms we used. They are mostly standard, apart from the incremental training technique. Here is the pseudo-code for supervised training: Algorithm 2 Supervised training of DLM Require: p P0, P T q (input and target predicates describing an ILP task), θ (randomly-initialized DLM parameters), with hyperparameters τi (softmax temperature), βi (Gumbel scale), pi (dropout probability), αi (learning rate) Ensure: θ (trained DLM parameters) for i 0, 1, 2, . . . do Sample a batch of ILP instances described by atoms p P0, P T q ˆ P T Ð DLMθp P0q with softmax temperature τi, Gumbel noise βi and dropout probability pi Compute binary cross-entropy loss Lθp P T , ˆ P T q θ Ð θ αi θLθp P T , ˆ P T q end for return θ The algorithm could naturally also be applied to train on a fixed unique ILP instance. The hyperparameters are scheduled to decrease so that a convergence to a fully interpretable solution is possible (see Table 10 for more details). Here is the pseudo-code for RL training, which is actually based on PPO (Schulman et al., 2017): Published in Transactions on Machine Learning Research (07/2023) Algorithm 3 RL training of DLM Require: p P0, P Aq (predicates describing an RL task), θ (randomly-initialized DLM parameters), ψ (randomly-initialized critic parameters), with hyperparameters τi (softmax temperature), β (Gumbel scale), p (dropout probability), α and ρ (learning rates) Ensure: θ (trained DLM parameters) for i 0, 1, 2, . . . do for j 1, 2, . . . , N do Sample initial state sj 0 (described by P0) Generate j-th trajectory psj 0, sj 1, . . .q using DLMθ with softmax temperature τi, Gumbel noise βi and dropout probability pi Compute rewards-to-go Rj t of j-th trajectory (i.e., 1 if goal reached and 0 otherwise) end for Compute ˆA from p Rj tqj,t and critic pvψpsj tqqj,t with GAE θ Ð θ α θJP P Opθq ψ Ð ψ ρ ψMSEψp ˆA vpsj tqj,t, vψpsj tqj,tq end for return θ Recall in PPO, the update of the parameters of the critic is performed by minimizing the mean square error (MSE) between the output of the critic and the Generalized Advantage Estimate (GAE) (Schulman et al., 2017) through gradient descent. Here is the pseudo-code for incremental training: Algorithm 4 Incremental training of DLM Require: p P0, P Aq (predicates describing an RL task) Ensure: P0 (initial predicates and those invented during incremental training), θ (trained DLM parameters) for k 0, 1, 2, . . . do Randomly initialize θ θ Ð train DLMθ on p P0, P Aq if performance is satisfying then return p P0, θq else Ppkq Ð extract predicates used in DLMθ P0 P0 Y Ppkq end if end for Incremental training is written here for solving an RL task, but it could be adapted to solve a hard ILP task as well. Inside the for loop, DLM can be trained via SL or RL, depending on the training method the necessary hyperparameters need to be passed. In addition for RL training, the critic parameters would also be needed and should be randomly initialized before RL training. B Additional Experimental Results and Details B.1 ILP Tasks For the ILP experiments, we justify the baselines we selected, provide a short description of the tasks, explain the performance metrics used for the evaluation, and discuss the results we obtained in terms of performance and computational costs. During this discussion, we provide the training details. Published in Transactions on Machine Learning Research (07/2023) ILP Baselines Since DLM is a neural-logic architecture, we only compare with other neural-logic approaches: BILP (Evans & Grefenstette, 2018), NLM (Dong et al., 2019), and d NL-ILP (Payani & Fekri, 2019a). Differentiable architectures such as MEM-NN (Sukhbaatar et al., 2015) or DNC (Graves et al., 2016) are not included, since they have been shown to be inferior on ILP tasks compared to NLM and they, furthermore, do not provide any interpretable solutions. Also, the approaches in multi-hop reasoning (Yang et al., 2017; Yang & Song, 2020) are also left out because although they can scale well, the rules they can learn are much less expressive, which prevent them from solving any complex ILP tasks in an interpretable way. Besides, Deep Prob Log (Manhaeve et al., 2018), which is based on backward chaining, is not included since it is not easy to perform predicate invention with it. ILP Task Description For the ILP tasks, we evaluate on two domains: Family Tree and Graph Reasoning. In the Family Tree domain, different tasks are considered corresponding to different target predicates to be learned from an input graph where nodes representing individuals are connected with relations: Is Motherp X, Y q, Is Fatherp X, Y q, Is Sonp X, Y q, and Is Daughterp X, Y q. The target predicates are Has Father, Has Sister, Is Grand Parent, Is Uncle, and Is MGUncle (i.e., maternal great uncle). In the Graph Reasoning domain, the different target predicates to be learned from an input graph are Adjacent To Red, 4-Connectivity, 6-Connectivity, 1-Out Degree, 2-Out Degree (see Appendix C.1 for their definitions). ILP Metrics The performance of an ILP method on a task can be measured in terms of success rate, which is the percentage of relations in a test instance that are correctly classified by a trained model. Following previous work, we report it as an average over 250 random instances for the best model obtained over 10 random seeds. In addition, we also report the percentage of successful seeds (PSS), which is the percentage of those 10 seeds that reach a 100% success rate on the testing instances. PSS indicates how reliable a model and its training are. Performance on ILP Tasks In Table 4, we report those two metrics for DLM and the baselines we selected. All the methods are trained on random instances with the number of constants m 20, except for BILP. Since its authors did not release its source code, the reported success rates for BILP are from Dong et al. (2019) and its PSS is missing. For NLM and d NL-ILP, we use the source codes shared by their authors. For the latter, Payani & Fekri (2019a) did not evaluate their method in any standard ILP tasks. Using their source code, we did our best to find the best set of hyperparameters (see Appendix D.3) for each ILP task. Column m 20 provides the success rates for test instances with the same number of constants as in the training instances, while column M 100 provides the success rates for test instances with 100 constants, which allow to measure the generalization capability of the trained models. N/A means that the method ran out of memory. For d NL-ILP, the memory issue comes from the fact that, both learning auxiliary predicates and increasing the number of variables of predicates increase memory consumption sharply with a growing number of constants (see details in Appendix D.3.1). The experimental results demonstrate that the previous interpretable methods, d NL-ILP and BILP, do not scale to difficult ILP tasks and to a larger number of constants. However, DLM can solve all the ILP tasks like NLM, while DLM can in addition provide an interpretable rule in contrast to NLM. Moreover, we can observe that DLM is more stable, in terms of successful seeds, than all the other methods including NLM. To further demonstrate the stability of DLM, we also report the standard deviations of its success rates over the different seeds in Table 4 (values in parentheses). They show that even when a 100% success rate is not reached for a given seed, the trained model still reaches a success rate close to 100%. Computational Costs on ILP Tasks In Table 5, we provide the computational time (column T) and memory usage (column M) during training and testing on two different representative ILP tasks: Is Grandparent and 2-Outdegree. The Training rows provides the computational costs observed during training while the subsequent rows give the costs measured during testing for different numbers of constants. The d NL-ILP method can scale very well, but both NLM and DLM perform much better than d NL-ILP as shown in our previous experiments. While the training costs for DLM are slightly higher than NLM, they are still reasonable. More interestingly, DLM scales much better than NLM during testing, which is the most important, since these computational costs are incurred after the trained model is deployed. Published in Transactions on Machine Learning Research (07/2023) Table 4: Success rates (%) and percentage of successful seeds of d NL-ILP, BILP, NLM, and DLM on Family Tree and Graph Reasoning. d NL-ILP BILP NLM DLM (Ours) Family Tree m 20 M 100 PSS m 20 M 100 m 20 M 100 PSS m 20 M 100 PSS Has Father 100 100 100 100 100 100 100 100 100 p 0q 100 p 0q 100 Has Sister 100 100 40 100 100 100 100 100 100 p 0q 100 p 0q 100 Is Grandparent 100 100 80 100 100 100 100 100 100 p 0q 100 p 0q 100 Is Uncle 97.32 96.77 0 100 100 100 100 90 100 p 0q 100 p 0q 100 Is MGUncle 99.09 N/A 0 100 100 100 100 20 100 p 0.01q 100 p 0.08q 70 Graph Reasoning m 10 M 50 PSS m 10 M 50 m 10 M 50 PSS m 10 M 50 PSS Adjacent To Red 100 100 100 100 100 100 100 90 100 p 0.02q 100 p 0.01q 90 4-Connectivity 91.36 85.30 0 100 100 100 100 100 100 p 0q 100 p 0q 100 6-Connectivity 92.80 N/A 0 100 100 100 100 60 100 p 0.00q 100 p 0q 90 1-Out Degree 82.00 78.44 0 100 100 100 100 100 100 p 0q 100 p 0q 100 2-Out Degree 83.39 8.24 0 N/A N/A 100 100 100 100 p 0q 100 p 0q 100 N/A: Out of memory issues. PSS: Percentage of Successful Seeds reaching 100% of success rates on the testing instances. Table 5: Computational costs of d NL-ILP, NLM, and DLM on Is Grand Parent and 2-Outdegree. d NL-ILP NLM DLM (Ours) Is Grandparent T M T M T M Training 201 30 1357 70 1629 382 m 10 1 27 4 24 1 2 m 20 1 30 4 70 1 24 m 30 5 42 5 188 1 24 m 40 13 99 5 414 2 74 m 50 31 198 7 820 3 124 m 60 65 358 8 1341 3 226 m 70 119 596 8 2089 4 344 m 80 192 932 11 3123 6 500 m 90 303 1390 13 4434 8 724 m 100 464 1994 17 6093 10 1002 m 110 656 2771 21 8079 13 1321 m 120 915 3751 27 10056 16 1710 m 130 1247 4964 N/A N/A 19 2161 2-Outdegree T M T M T M Training 966 22 2522 844 3238 1372 m 5 1 22 3 78 2 4 m 10 1 22 6 844 2 40 m 15 3 42 9 4342 3 254 m 20 9 112 N/A N/A 5 732 m 25 24 314 N/A N/A 10 1751 m 30 47 682 N/A N/A 19 3594 m 35 93 1332 N/A N/A 33 6666 T: time (s), M: Memory (MB). DLM used depth 4, breadth 3 for Is Grandparent, and depth 6, breadth 4 for 2-Outdegree. N/A: Out of memory issues. Published in Transactions on Machine Learning Research (07/2023) B.2 Examples of Interpretable Rules or Policies B.2.1 Discussion of Two Examples As illustration for ILP, we provide the logic program learned by our method on the task Is Grand Parent. We used L 5 layers, B 3 breadth, n A 2 atoms, and n O 8 outputs per logic modules. For better legibility, we give more meaningful names to the learned rules and remove the expansions and reductions: Is Child1pa, bq Ð Is Sonpa, bq _ Is Daughterpa, bq Is Child2pa, bq Ð Is Sonpa, bq _ Is Daughterpa, bq Is GCPpa, b, cq Ð Is Child2pa, cq Is Child2pc, bq Is GPC1pa, b, cq Ð Is Child1pc, aq Is Childpb, cq Is GPC2pa, b, cq Ð Is GPC1pa, b, cq _ Is GCPpb, a, cq Is GPpa, bq Ð DC, Is GPC2pa, b, Cq DC, Is GPC2pa, b, Cq Is Grand Parentpa, bq Ð Is GPpa, bq Is GPpa, bq The logic program extracted from the trained DLM has redundant parts (e.g., P P), because we used a relatively large architecture to ensure sufficient expressivity. Note that being able to learn interpretable solutions with a large architecture is a desirable feature when the designer does not know the solution beforehand. Redundancy could be reduced by using a smaller architecture, otherwise the redundant parts (e.g., P P) could easily be removed by post-processing the extracted logic program, as we did. After simplification, the solution is given by the following program, which shows that the target predicate has been perfectly learned: Is Childpa, bq Ð Is Sonpa, bq _ Is Daughterpa, bq Is GPC1pa, b, cq Ð Is Childpc, aq Is Childpb, cq Is Grand Parentpa, bq Ð DC, Is GPC1pa, b, Cq As an illustration for RL, we provide the simplified logic program learned by our method on the task On, which corresponds to the output of the reasoning part: Movepa, bq Ð p On Goalpa, bq _ Is Floorpbqq Onpa, bq Toppaq. Using this program, the decision-making part (stochastically) moves blocks to the floor and moves the good block on its goal position when it can. For completeness, we provide the complete logic program: Pred1pa, bq Ð On Goalpb, aq _ Is Floorpaq Pred2pa, bq Ð Onpa, bq Toppaq Pred3pa, bq Ð Pred1pb, aq Pred2pa, bq Pred4pa, bq Ð Pred1pb, aq Pred1pb, aq Movepa, bq Ð Pred3pa, bq Pred4pa, bq Being able to find solutions in a large architecture is a desirable feature when the designer does not know the solution before hand. Besides, note that we directly output an interpretable logic program. In contrast, with previous interpretable models, logic rules with high weights are extracted to be inspected. However, those rules may not generalize because weights are usually not concentrated on one element. Published in Transactions on Machine Learning Research (07/2023) B.2.2 Other ILP Examples Here are other examples on the family tree domain: Pred1paq Ð DB, Is Fatherpa, Bq DB, Is Motherpa, Bq Pred2paq Ð DB, Is Fatherpa, Bq _ DB, Is Motherpa, Bq Pred3paq Ð DB, Is Motherpa, Bq _ DB, Is Motherpa, Bq Pred4paq Ð Pred1paq _ Pred2paq Pred5paq Ð Pred3paq _ Pred3paq Pred6paq Ð Pred4paq Pred5paq Pred7paq Ð Pred6paq _ Pred6paq Pred8paq Ð Pred6paq _ Pred6paq Has Fatherpaq Ð Pred7paq Pred8paq Pred1pa, bq Ð Is Daughterpb, aq Is Motherpa, bq Pred2pa, bq Ð Is Daughterpb, aq Is Fatherpa, bq Pred3pa, bq Ð Is Daughterpb, aq _ Is Motherpa, bq Pred4pa, bq Ð DC, Pred2pb, a, Cq DC, Pred1pb, a, Cq Pred5pa, bq Ð DC, Pred3pb, a, Cq DC, Pred1pb, a, Cq Pred6paq Ð DB, Pred4pa, Bq DB, Pred5pa, Bq Pred7paq Ð Pred6paq _ Pred6paq Has Sisterpaq Ð Pred7paq Pred7paq Pred1pa, bq Ð Is Sonpb, aq Is Sonpb, aq Pred2pa, bq Ð Is Daughterpb, aq _ Is Sonpb, aq Pred3pa, bq Ð Is Sonpb, aq _ Is Motherpb, aq Pred4pa, bq Ð Is Fatherpa, bq Is Fatherpa, bq Pred5pa, b, cq Ð Is Motherpa, bq Is Motherpa, bq Pred6pa, bq Ð Is Sonpb, aq Is Daughterpb, aq Pred7paq Ð DB, Pred1pa, Bq _ DB, Pred1pa, Bq Pred8pa, bq Ð DC, Pred5pa, b, Cq _ DC, Pred5pa, b, Cq Pred9pa, bq Ð DC, Pred6pb, a, Cq DC, Pred4pb, a, Cq Pred10pa, b, cq Ð Pred2pb, aq _ Pred3pa, bq Pred11pa, bq Ð Pred8pa, bq Pred7pb, aq Pred12pa, b, cq Ð Pred9pa, bq _ Pred10pb, c, aq Pred13pa, bq Ð Pred11pa, bq @C, Pred12pa, b, Cq Is Unclepa, bq Ð Pred13pa, bq Pred13pa, bq C Task Description C.1.1 Family Tree For family tree tasks, they have the same initial predicates: Is Fatherp X, Y q, Is Motherp X, Y q, Is Sonp X, Y q and Is Daughterp X, Y q. Is Fatherp X, Y q is True when Y is X s father. The other three predicates have the similar meaning. Published in Transactions on Machine Learning Research (07/2023) Has Father: Has Fatherp Xq is True when X has father. It can be expressed by: Has Fatherp Xq Ð DY, Is Fatherp X, Y q Has Sister: Has Sisterp Xq is True when X has at least one sister. It can be expressed by: Has Sisterp Xq Ð DY, Is Sisterp X, Y q Is Sisterp X, Y q Ð DZ, p Is Daughterp Z, Y q Is Motherp X, Zqq Is Grandparent: Is Grandparentp X, Y q is True when Y is X s grandparent. It can be expressed by: Is Grandparentp X, Y q Ð DZ, pp Is Sonp Y, Zq Is Fatherp X, Zqq _p Is Daughterp Y, Zq Is Motherp X, Zqqq Is Uncle: Is Unclep X, Y q is True when Y is X s uncle. It can be expressed by: Is Unclep X, Y q Ð DZ, pp Is Motherp X, Zq Is Brotherp Z, Y qqq _ p Is Fatherp X, Zq Is Brotherp Z, Y qq Is Brotherp X, Y q Ð DZ, pp Is Sonp Z, Y q Is Sonp Z, Xqq _ p Is Sonp Z, Y q Is Daughterp Z, Xqqq Is MGUncle: Is MGUnclep X, Y q is True when Y is X s maternal great uncle. It can be expressed by: Is MGUnclep X, Y q Ð DZ, p Is Motherp X, Zq Is Unclep Z, Y qq C.1.2 Graph Reasoning For graph tasks, Has Edge task have the same initial predicates: Has Edgep X, Y q. Has Edgep X, Y q is True when there is an undirected edge between node X and node Y . Adjacent To Red: Adjacent To Redp Xq is True if node X has an edge with a red node. In this task, it also use Colorsp X, Y q as another initial predicate besides Has Edgep X, Y q. Colorp X, Y q is True when the color of node X is Y . It can be expressed by: Adjacent To Redp Xq Ð DY, p Has Edgep X, Y q Colorp Y, redqq 4-Connectivity: 4-Connectivityp X, Y q is True if there exists a path between node X and node Y within 4 edges. It can be expressed by: 4-Connectivityp X, Y q Ð DZ, p Has Edgep X, Y q_ Distance2p X, Y q _ p Distance2p X, Zq Has Edgep Z, Y qq_ p Distance2p X, Zq Distance2p Z, Y qqq Distance2p X, Y q Ð DZ, p Has Edgep X, Zq Has Edgep Z, Y qq 6-Connectivity: 6-Connectivityp X, Y q is True if there exists a path between node X and node Y within 6 edges. It can be expressed by: 6-Connectivityp X, Y q Ð DZ, p Has Edgep X, Y q_ Distance2p X, Y q _ Distance3p X, Y q_ p Distance2p X, Zq Distance2p Z, Y qq_ p Distance3p X, Zq Distance2p Z, Y qq_ p Distance3p X, Zq Distance3p Z, Y qqq Distance2p X, Y q Ð DZ, p Has Edgep X, Zq Has Edgep Z, Y qq Distance3p X, Y q Ð DZ, p Has Edgep X, Zq Distance2p Z, Y qq Published in Transactions on Machine Learning Research (07/2023) 1-Outdegree: 1-Outdegreep Xq is True if there the outdegree of node X is exactly 1. It can be expressed by: 1-Outdegreep Xq Ð DY, @Z, p Has Edgep X, Y q Has Edgep X, Zqq 2-Outdegree: 2-Outdegreep Xq is True if there the outdegree of node X is exactly 2. It can be expressed by: 2-Outdegreep Xq Ð DY, @Z, Kp Has Edgep X, Y q Has Edgep X, Zq Has Edgep X, Kqq C.2.1 NLRL Tasks For the three NLRL tasks, Unstack, Stack, and On, the agent is trained to move the blocks to reach a certain configuration. The action is represented by a binary predicate Movep X, Y q, which indicates moving block X on block (or floor) Y . This action can be executed if it is legal, i.e., X is not the ground, block X has no blocks on it, and the same for Y if it is a block. The three NLRL tasks share the same initial predicates Is Floorp Xq, Topp Xq, and Onp X, Y q. Besides, there is one additional predicate On Goalp X, Y q for the On task only. Is Floorp Xq is True when X is the floor. Topp Xq is True when block X has no blocks on it. Onp X, Y q is True when block X is on Y . On Goalp X, Y q indicates that the goal for the On task is to move block X onto block Y . Unstack: In this task, the goal is to move all the blocks on the floor. A policy that solves this task is: Movep X, Y q Ð Is Floorp Y q Predp Xq Predp Xq Ð Pred2p Xq Topp Xq Pred2p Xq Ð DY, Z, p Onp X, Y q Onp Y, Zqq This solution is actually learned by DLM, where Predicates Predp Xq and Pred2p Xq are invented during training. Pred2p Xq indicates that block X is on top of another block, i.e., X is not directly on the floor. Predp Xq indicates that block X is on top of a column of blocks and is not directly on the floor. Stack: The goal in this task is to stack all the blocks into one column. A policy for this task can be expressed as: Movep X, Y q Ð Predp Y q Pred4p X, Y q Pred4p X, Y q Ð Pred3p Xq Pred2p Y, Xq Pred2p X, Y q Ð DZ, p Onp X, Zq Topp Y qq Pred3p Xq Ð Onp X, Y q Is Floorp Y q Predp Xq Ð Onp X, Y q Pred2p Y, Xq This solution was actually found by DLM where Predicates Predp Xq, Pred2p X, Y q, Pred3p Xq, and Pred4p Xq are invented during training. Predicate Predp Xq here has the same meaning as the one invented in Unstack task. Pred2p X, Y q indicates that X is a block and block Y is has no blocks on it. Pred3p Xq indicates that block X is directly on the floor. Pred4p X, Y q indicates that block X is directly on the floor and there is no block on it, and Y is a block. On: The goal is to reach a configuration indicated by On Goal. A policy for the On task can be expressed by: Movep X, Y q Ð p On Goalp X, Y q _ Is Floorp Y qq Onp X, Y q Topp Xq Published in Transactions on Machine Learning Research (07/2023) This is a single-source and single-target path finding problem in an undirected graph. For a given graph, the algorithm need to find if there is at least one path from the source node to the target node within d steps. The initial predicates are Adjacentp X, Y q, Is Startp Xq and Is Targetp Xq. The graph is described as an adjacency matrix and Adjacentp X, Y q is True when node X and node Y are connected by an undirected edge. Is Startp Xq (resp. Is Targetp Xq) is True when node X is the source (resp. target) node. The action predicate is Go Top Xq. If True, it moves from the current node to the next node X. A working policy for Path, requiring recursive predicates, can be expressed by: Go Top Y q Ð DX, Is Startp Xq Adjacentp X, Y q p Is Targetp Y q _ p Has Edgep Y, Zq Is Targetp Zqqq Has Edgep X, Y q Ð Adjacentp X, Y q_ p DZ, Has Edgep X, Zq Has Edgep Z, Y qq C.2.3 Sorting In this problem, the algorithm needs to sort an array a of m integers into ascending order, by swapping those integers iteratively. Each index of array a is treated as a constant. The initial predicates are Smaller Indexp X, Y q, Same Indexp X, Y q, Greater Indexp X, Y q, Smaller Valuep X, Y q, Same Valuep X, Y q, and Greater Valuep X, Y q. The first three describe index relations and the last three describe value relations. For example, Smaller Indexp X, Y q is True when X ă Y , and Smaller Valuep X, Y q is True when ar Xs ă ar Y s. The action predicate is Swapp X, Y q. If True, it swaps ar Xs and ar Y s. Therefore, we have m ˆ pm 1q available actions. The difficulty of the problem comes from the fact that the number of performed swap operations is limited. Hence, if the number of integers is small, a working policy for Sorting can be expressed by: Swapp X, Y q Ð Greater V aluep X, Y q Smaller Indexp X, Y q. However, the general solution for any m cannot be expressed without recursive predicates. C.2.4 Blocksworld Recall that in this problem, there are two worlds: the source world where blocks can be moved and the target world that describes the desired block configurations. Each constant (block or ground) has 4 properties: a world ID, an object ID, an X coordinate, and Y coordinate. The ground has a fixed position p0, 0q. The initial predicates are Same World ID(X, Y), Smaller World ID(X, Y), Larger World ID(X, Y), Same ID(X, Y), Smaller ID(X, Y), Larger ID(X, Y), Left(X, Y), Same X(X, Y), Right(X, Y), Below(X, Y), Same Y(X, Y), and Above(X, Y). The first three compare two constants world IDs, while the next three compare object IDs. The last ones compare two constants X or Y coordinates. Published in Transactions on Machine Learning Research (07/2023) The action predicate is Move(X, Y). If true, it moves block X onto Y in the source world if it is a legal operation. It can be implemented as follows (found by a human expert): Movep X, Y q Ð Plan Ap X, Y q _ Plan BOnly If p X, Y q Plan BOnly If p X, Y q Ð Plan Bp X, Y q Plan AWorkpq Plan AWorkpq Ð DX, Y, Plan Ap X, Y q Plan Bp X, Y q Ð Should Movep Xq No Good Y p Xq Is Groundp Y q Initial Worldp Y q Plan Ap X, Y q Ð Should Movep Xq Should Move Onp X, Y q No Good Y p Xq Ð DY, Should Move Onp X, Y q Should Move Onp X, Y q Ð Well Placedp Y q Clearp Y q Under Blockp X, Y q Should Movep Xq Ð Initial Worldp Xq Moveablep Xq Well Placedp Xq Under Blockp X, Y q Ð DZ, p Targetp X, Zq Under Block TIp Z, Y qq Well Placedp Xq Ð Matchedp Xq Have Unmatched Belowp Xq Under Block TIp X, Y q Ð DZ, p Targetp Y, Zq Same XDirectly Abovep X, Zqq Have Unmatched Belowp Xq Ð DY, p Same XAbovep X, Y q Matchedp Y qq Same XDirectly Abovep X, Y q Ð Same Xp X, Y q Directly Abovep X, Y q Moveablep Xq Ð Clearp Xq Is Groundp Xq Directly Abovep X, Y q Ð Abovep X, Y q DZ, Betweenp X, Z, Y q Matchedp Xq Ð DY, Matchp X, Y q Clearp Xq Ð DY, Same XAbovep Y, Xq Betweenp X, Y, Zq Ð Abovep X, Y q Abovep Y, Zq Targetp X, Y q Ð Smaller World IDp X, Y q Same IDp X, Y q Matchp X, Y q Ð Same World IDp X, Y q Same IDp X, Y q Same Xp X, Y q Same Y p X, Y q Initial Worldp Xq Ð DY, Smaller World IDp Y, Xq Same XAbovep X, Y q Ð Same World IDp X, Y q Same Xp X, Y q Abovep X, Y q Is Groundp Xq Ð DY, Belowp Y, Xq D Experimental Set-Up D.1 Computer Specifications The experiments are ran by one CPU thread and one GPU unit on the computer with specifications shown in Table 6. Table 6: Computer specification. Attribute Specification CPU 2 ˆ Intel(R) Xeon(R) CPU E5-2678 v3 Threads 48 Memory 64GB (4ˆ16GB 2666) GPU 4 ˆ Ge Force GTX 1080 Ti Published in Transactions on Machine Learning Research (07/2023) D.2 Curriculum Learning Every 10 epochs, we test the performance of the agent over 100 instances with a deterministic policy and a stochastic policy. If one of them reaches 100% then it can move to the next lesson. Our agents are trained only on one lesson at a time. In the NLRL tasks, curriculum learning is not needed: the number of blocks during training is always 4. In Path, the first lesson starts with 3 objects and finish with 10. In Sorting, we start with 3 objects and finish with 15. In Blocksworld, we start with 2 blocks (6 objects) and finish with 12 blocks (26 objects). In those 3 domains, the decreasing of the temperature and noise to obtain an interpretable policy is only applied during the last lesson. For DLM-incr, curriculum learning is not needed: the number of blocks during training is always 7. The teacher produces the trajectories to learn. A new DLM is stacked only if the extracted formula is better than the previous one. If it is not, the parameters of the last DLM are reinitialized. D.3 Hyperparameters D.3.1 Hyperparameters for d NL-ILP For d NL-ILP, we train each task with at most 80, 000 iterations. Moreover, at each iteration, we use a new family tree or graph as training data, which is randomly generated from the same data generator in NLM and DLM, as backgrounds for training the model. For task Has Father, Is Grandparent and Adjacent To Red, d NL-ILP can achieve 100% accuracy without learning any auxiliary predicates. For other ILP tasks, it has to learn at least one auxiliary predicate to induct the target. In practice, the performance decrease with increasing number of auxiliary predicates or variables, therefore here we only use at most one auxiliary predicate and at most three variables. Table 7 is the notions for all the hyperparameters for testing d NL-ILP. Table 8 shows hyperparameters for defining rules in d NL-ILP that achieve the best performance. Table 7: Notions for hyperparameters used in testing Payani & Fekri (2019a) s work. Hyperparameter Explanation Narg The number of arguments Nvar The number of variables Nterms The number of terms Fam amalgamate function Ntrain The number of nodes for training T The number of forward chain Nfilter The number of tests for rules filter lr Learning rate Nepoch Maximum number of epochs for training model Niter The number of iterations for one epoch Mterms Maximum number of terms in each clause θmean Fast convergence total loss threshold MEAN θmax Fast convergence total loss threshold MAX β1 ADAM β1 β2 ADAM β2 ϵ ADAM ϵ Moreover, we use the same Ntrain from NLM to train d NL-ILP (i.e. We set Ntrain 20 for Has Father, Has Sister, Is Grandparent, Is Uncle and Is MGUncle. And we set Ntrain 10 for Adjacent To Red, 4-Connectivity, 6-Connectivity, 1-Outdegree and 2-Outdegree). Other hyperparameters, such as hyperparameters for optimizer, remain the same for all tasks and consistent with Payani et al. s github code Published in Transactions on Machine Learning Research (07/2023) Table 8: Hyperparameters for defining and learning d NL-ILP rules for each task. Task T Auxiliary Target Narg Nvar Nterms Fam Narg Nvar Nterms Fam Has Father 1 1 1 2 or Has Sister 2 2 1 1 eq 1 1 1 max Is Grandparent 1 2 1 4 eq Is Uncle 7 2 1 3 or 2 1 4 eq Is MGUncle 7 2 1 1 or 2 2 2 or Adjacent To Red 2 1 1 2 eq 4-Connectivity 7 2 1 1 or 2 1 5 eq 6-Connectivity 7 2 1 1 or 2 2 7 eq 1-Outdegree 3 1 2 2 eq 2-Outdegree 3 2 0 1 eq 1 3 2 eq Table 9: Hyperparameters for training d NL-ILP. Nepoch Nfilter Niter Mterms θmean θmax β1 β2 ϵ 400 3 200 6 0.5 0.5 0.9 0.99 1e-6 (https://github.com/apayani/ILP). For Family Tree tasks, we set lr 0.01 when the loss from the last step is greater than 2, otherwise we set lr 0.005. For Graph tasks, we set lr 0.01. Other hyperparameters are shown in Table9. D.3.2 Hyperparameters for DLM We have used ADAM with learning rate of 0.005, 5 trajectories, with a clip of 0.2 in the PPO loss, λ 0.9 in GAE and a value function clipping of 0.2. For the softmax over the action distribution, we used a temperature of 0.01. Published in Transactions on Machine Learning Research (07/2023) Table 10: Hyperparameters of the noise in DLM. Starting from Exponential decay Approximate final value SL temperature τ of Gumbel dist. 1 0.995 0.5 scale β of Gumbel dist. 1 0.98 0.005 dropout probability 0.1 0.98 0.0005 RL temperature τ of Gumbel distr. 1 0.995 task-dependent scale β of Gumbel dist. 0.1 0.98 task-dependent dropout probability 0.01 0.98 task-dependent Table 11: Architectures for DLM on the different ILP and RL tasks. Depth Breadth n O n A IO residual1 Family Tree Has Father 5 3 8 2 Has Sister 5 3 8 2 Is Grandparent 5 3 8 2 Is Uncle 5 3 8 2 Is MGUncle 9 3 8 2 Graph Reasoning Adjacent To Red 5 3 8 2 4-Connectivity 5 3 8 2 6-Connectivity 9 3 8 2 1-Out Degree 5 3 8 2 2-Out Degree 7 4 8 2 NLRL Tasks Unstack 4 2 8 2 Stack 4 2 8 2 General Algorithm Sorting 4 3 8 2 Path 8 3 8 2 Path (DLM-incr) 6 3 8 3 Blocks World Blocksworld (n IDLM) 8 2 8 2 Blocksworld (imitation) 8 2 8 2 1 Input-Output residual connections: As in NLM, all the input predicates of the DLM are given as input of every module. Similarly, every output of each module is given to the final predicate of the DLM.