# outofdistribution_generalization_by_neuralsymbolic_joint_training__2b5e4217.pdf Out-of-Distribution Generalization by Neural-Symbolic Joint Training Anji Liu2, Hongming Xu3, Guy Van den Broeck2, Yitao Liang1,3 1 Institute for Artificial Intelligence, Peking University 2 Computer Science Department, University of California, Los Angeles 3 Beijing Institute of General Artificial Intelligence (BIGAI) {liuanji, guyvdb}@cs.ucla.edu, xuhongming@bigai.ai, yitaol@pku.edu.cn This paper develops a novel methodology to simultaneously learn a neural network and extract generalized logic rules. Different from prior neural-symbolic methods that require background knowledge and candidate logical rules to be provided, we aim to induce task semantics with minimal priors. This is achieved by a two-step learning framework that iterates between optimizing neural predictions of task labels and searching for a more accurate representation of the hidden task semantics. Notably, supervision works in both directions: (partially) induced task semantics guide the learning of the neural network and induced neural predictions admit an improved semantic representation. We demonstrate that our proposed framework is capable of achieving superior outof-distribution generalization performance on two tasks: (i) learning multi-digit addition, where it is trained on short sequences of digits and tested on long sequences of digits; (ii) predicting the optimal action in the Tower of Hanoi, where the model is challenged to discover a policy independent of the number of disks in the puzzle. 1 Introduction The recent rejuvenation of neural-symbolic synthesis demonstrates that we can significantly benefit from a tight integration of low-level perception and high-level reasoning in tasks featuring out-of-distribution (OOD) generalization (Manhaeve et al. 2018; Xu et al. 2018; Li and Srikumar 2019). It is especially the case on tasks whose semantics are (implicitly) encoded as symbolic rules. Take the multi-digit addition example shown in Fig. 1(left) as an example. Given two numbers represented by a sequence of MNIST images, the goal is to predict the sum of both numbers. Following the orange arrows, it is possible to train neural nets (NNs) to predict the sum directly. When the test sums are i.i.d. w.r.t. the training samples, this task poses no challenge for NNs either. However, the prediction accuracy could degrade if NNs are provided with OOD samples (e.g., unseen combination of digits or numbers with greater length). In contrast, from a neural-symbolic perspective, we can adopt NNs to extract invariant features z (in this case, the digit of every MNIST image) first, and then employ logic rules for multi-digit addition to predict the sum. This model is more robust to various Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. types of distribution shift of the input (e.g., longer numbers) since the NN only needs to accurately predict the digit of MNIST images and the addition rules are invariant to input length. Some prior work has success in addressing such an integration (Dai and Muggleton 2021). However, they often require abundant prior knowledge. Specifically, most of them require the ground-truth symbolic knowledge to be provided (Yang, Ishay, and Lee 2020; Manhaeve et al. 2018). In other word, such success has not been transferred to the setting where priors on task semantics are limited or not provided. We identify two challenges that hinder the development of neural-symbolic models in this more generalized setting. First, the lack of bidirectional supervision between the neural and symbolic models poses great difficulties for reliably learning meaningful symbolic features z. Specifically, an under-trained neural network cannot produce accurate grounded facts for the symbolic model to learn logic rules, and conversely, the symbolic model cannot provide effective supervision to refine the neural model. Existing approaches alleviate this problem by either providing strong priors on the symbolic rules (Dai and Muggleton 2021; Yang, Ishay, and Lee 2020) or learning symbolic models implicitly (Wang et al. 2019; Amos and Kolter 2017), which cannot deliver reliable reasoning results when facing out-ofdistribution samples. Next, even if supervision signals can be properly propagated between the neural and symbolic models, it is still possible that the NN predicts spurious features, leading to bad generalization performance (an example is provided in Sec. 6). This paper makes an initial attempt to solve these crucial yet under-explored problems. Both aforementioned challenges are mainly attributed to the gigantic set of candidate rules considered by the symbolic model. In one direction, a large number of plausible rules result in noisy update signals to the NN. In the other direction, it is hard to search for good symbolic rules from a vast candidate set. Instead of adding prior knowledge to shrink the rule space, we propose to control the symbolic model s complexity by gradually adding plausible candidate rules with a symbolic search step. That is, the model is first initialized with a few general rules. After eliminating candidates that are very unlikely to hold, we invoke the symbolic search algorithm to extend and specialize the remaining rules. This process continues until the The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23) Move the top disk on the 1st pillar to the 3rd pillar x Logic rules based on size relationships of top disks Logic rules based on single digit perception and addition E.g., 5+1=6 Figure 1: Illustration of the multi-digit addition task (left) and the Tower of Hanoi optimal action prediction task (right). Following the orange arrows, the output y of both tasks can be predicted end-to-end with NNs. However, the features captured by such models are unlikely to generalize well to different scenarios (e.g., increased digit length and more disks in Hanoi). Instead, following the blue arrows, NNs are used to capture invariant features z in the task, and then apply logical reasoning to obtain the answer y. Accuracy will not degrade if z is invariant to the change of task scenarios. model converges to a set of specific rules. To improve effectiveness of the candidate set expansion procedure, we borrow wisdom from the knowledge compilation community, and use Logic Circuits (LCs) (Darwiche 2011) to compactly represent candidate rules. Rule expansion is done by efficient structure learning of the LC. Further, by making use of other properties of LCs (e.g., differentiability), we propose NTOC , a neural-symbolic joint training method that learns invariant features in the NN and generalizable task semantics in the LC. To understand when the proposed model can generalize well to unseen scenarios, we identify a special class of OOD tasks. We show that the multi-digit addition task and the Tower of Hanoi optimal action prediction task shown in Fig. 1 fit in this class of OOD. Furthermore, we demonstrate that our NTOC out-performs NNs and neuralsymbolic models on both tasks by a large margin. 2 Notation and OOD Generalization Notation We write Boolean variables as uppercase letters (e.g., X) and their assignments as lowercase letters (e.g., x). Analogously, sets of Boolean variables and their joint assignments are denoted by bold uppercase (e.g., X) and lowercase (e.g., x) letters, respectively. Define the set of all values x for variables X as val(X). A literal represents a variable (e.g., X) or its negation (e.g., X). Propositional logic formulas are constructed in the usual way, from literals and logical connectives. OOD Generalization Following Ye et al. (2021) s formulation, we consider supervised learning tasks where the model has access to samples from a set of training domains Etrain, while at test time it is required to make predictions on samples drawn from a set of test domains Etest. For example, in the digit addition task (Fig. 1(left)), samples from Etrain and Etest may contain numbers with lengths {2, 3} and {5, 6}, respectively. Similarly, in the Hanoi task (Fig. 1(right)), samples from Etest may contain more disks then those from Etrain. To achieve OOD generalization, models need to extract features that are invariant across both domains. However, since Etest is invisible during training, it is impossible to achieve OOD generalization without proper assumptions (Blanchard, Lee, and Scott 2011; Deshmukh et al. 2019). One reasonable assumption is that features invariant across Etrain maintain their invariance in Etest. Intuitively, if the domains in Etrain and Etest share invariant features, it is unlikely that spurious features that fail to generalize to test domains work well across all domains. A Special Class This paper specifically considers OOD tasks with a special type of invariant features; the features are represented as a set of symbols, whose relation with the output can be quantified by a set of logical rules. For example, Fig. 1(right) illustrates the Tower of Hanoi problem, where the task is to predict the next move given the current state represented as a single image. One possible set of invariant features z consists of three boolean variables, each representing the size relation between the top disks of two pillars. They are invariant because NNs equipped with attention mechanism can learn to focus on the top disks regardless of the total number of disks. The features are symbolic in the sense that there exits a set of logic rules that map them to the optimal actions. We reveal the logical rules in Sec. 6. 3 Neural-Symbolic Joint Learning This section elaborates the high-level design principles of our proposed neural-symbolic joint learning framework. As shown in Fig. 2(a), the joint learning model consists of a NN fφ and a logical circuit model gθ, parameterized by φ and θ, respectively. NN fφ maps input x to a |Z|-dimensional vector pnn, where i [|Z|], pnn i denotes the probability of variable Zi = true. Logic circuit (LC) gθ compactly represents a set of weighted propositional logic rules, where the weights characterizes the importance of the corresponding rules. LC gθ takes pnn and y as inputs, and outputs a scalar representing the probability of satisfaction. The LC as a whole encodes potential task semantics between the NN predicted features z and class labels y. The training pipeline of the proposed model is illustrated in Fig. 2(b). First, we train the NN with some primitive tasks, such as teaching it size relationships between two disks in the Hanoi problem. The pretraining step does not need to teach the NN complete semantics of the input. And it is used to provide a good initialization point to optimize the Logic Circuit Encodes learned constraints between ( ) and z y Probability of satisfaction for (pnn, y) (a) Structure of the neural symbolic joint learning model. (Z1 Y ) ( Z1 Z2 Y ) (Z1 Z2 Y ) ( Z2 Y ) (Z1 ( Z1 Z2)) Y (Z1 Z2 Y ) ( Z2 Y ) θ θ1θ2 θ1(1 θ2) (1 θ1)(1 θ2) structure search using samples Z1 Z2 Y Z1 Z2 (θ2Y (1 θ2) Y ) Z1 Z2 (θ1Y (1 θ1) Y ) (Z1 (θY (1 θ) Y )) ( Z1 Z2 Y ) ( Z1 Z2 Y ) (Z1 Z2 Y ) (Z1 Z2 Y ) Symbolic Search Neural Model Learning x pnn Logic Circuit y Optimize and by minimizing or its variants. Ls(α, Dsym) Train using some primitive tasks. (b) Training procedure of the joint learning model. Search the LC to discover better rule candidates. (pnn, y) Dsym (Z1 ( Z1 Z2)) Y Figure 2: Illustration of the proposed neural symbolic joint learning framework. (a) A NN fφ predicts the probability (i.e., pnn) of a set of logical variables (i.e., Z). A symbolic model gθ termed Logic Circuit encodes logical constraints among variables {Z, Y}, and outputs the SAT probability of (pnn, y). (b) The NN of the proposed framework is first pretrained with some primitive tasks. Then, learning proceeds by iteratively updating learnable parameters φ and θ in the neural learning step and searching a good structure of the Logic Circuit in the symbolic search step. neural-symbolic model.1 Afterwards, training proceeds by iterating over the symbolic rule search and the neural model learning steps, where the goal is to simultaneously learn to predict invariant features z and induce the logical rules. Among the three phases, the symbolic rule search step is task-independent, while the others could be task-dependent. Specifically, the perception model needs to accommodate the type of input data x (e.g., CNNs for images), and the primitive pretraining task are designed to address the necessary perception need of solving the task, with minimal prior regarding the underlying invariance of the task (e.g., digit classification for learning multi-digit addition, and disk size comparison for tackling Hanoi). 3.1 Symbolic Rule Search Given a set of probabilistic facts pnn predicted by the NN and the corresponding labels y, we denote Dsym := {(pnn,(i), y(i))}N i=1, where {(x(i), y(i))}N i=1 are training samples and pnn,(i) = fφ(x(i)). The goal of the symbolic rule search step is to adjust the set of rules (i.e., propositional logic formulas) embedded in gθ such that (i) new rules satisfied by the samples in Dsym with high probability are added, and (ii) rules that are strongly against Dsym are pruned away. Concretely, Fig. 2(b) demonstrates an example where the top logical model is transformed to the bottom one by conducting structure search using Dsym. As shown on the right side, both logical models can be written as parameterized propositional logic formulas, which are equivalently expressed as the set of weighted logical formulas shown in the 1Alternative approaches such as loss that encourages large variance in z can be used in replacement of pretraining. respective boxes. Comparing the two rule sets, we observe that this example structure search step adds two new rules. Having demonstrated how to adjust embedded rules by conducting structure search on the logical model, the next immediate question is how to encourage the emergence of good rules? Since good logic rules should be satisfied (with high probability) by samples in Dsym, we define the objective as maximizing their probability of satisfaction, defined via the Semantic Loss (Xu et al. 2018): Definition 1 (Semantic Loss). For a set of Boolean variables Z, denote p as a |Z|-dimensional vector containing a probability for each variable in Z, and α as a propositional logic sentence over Z. The semantic loss between α and p is defined as Ls(α, p) := log X i:z|=Zi pi Y j:z|= Zj (1 pj), (1) where |= denotes logical entailment. Logic sentence α entails β iff all assignments that satisfy α also satisfy β. Intuitively, Ls(α, p) becomes smaller as the prediction p is closer to satisfying α, and Ls(α, p) reaches its minimum 0 if p entails α. In the context of gθ and Dsym, our goal is to add logic rules α with small semantic loss Ls(α, Dsym) := P (pnn,y) Dsym Ls(α y, pnn) to the LC through some structure transformation steps. However, minimizing the semantic loss alone does not guarantee finding good symbolic rules since the semantic loss does not control the specificity of the logic formula. For example, if α models a tautology over Z, Ls(α, Dsym) remains zero regardless of Dsym since any value z val(Z) entails tautology. To avoid such trivial rules, we additionally use the model count (i.e., number of satisfying assignments) to control the specificity of the learned logic formula. In summary, this step searches for logic formulas that (i) match the dataset Dsym well and (ii) are specific enough (have relatively low model count) to contain useful information. In Sec. 5.1, we will demonstrate a practical implementation of this symbolic search algorithm. 3.2 Neural Model Learning As illustrated in Fig. 2(b), the neural learning step is responsible for jointly optimizing the parameters in fφ and gθ. That is, to adjust the NN to predict invariant features z while selecting informative rules in the LC by maximizing its corresponding weight. Different from the well-studied task of learning NNs under the supervision of groundtruth symbolic knowledge (Manhaeve et al. 2018; Xu et al. 2018; Li and Srikumar 2019; Xie et al. 2019), in the joint training framework we do not have access to groundtruth rules. Instead, the symbolic model gθ encodes a weighted set of rules. Concretely, suppose gθ encodes k logic rules α1, . . . , αk, whose weights are θ1, . . . , θk, respectively.2 Both models are updated by minimizing the following objective: minimize φ,θ (x,y) Dtrain i=1 θi Ls(αi y, fφ(x)), (2) where Pk i=1 θi = 1. This objective is minimized when rules αi closer to the (unknown) groundtruth have high weights, and the neural network learns to predict ps that best satisfy these likely rules. In this way, we cast the logic rule selection problem into a continuous optimization problem. In Sec. 5.2, we will concretize this idea by showing how the rules are compactly represented and how to learn the weights θ together with the parameters in the neural network. 4 Representation of the Symbolic Model In this section, we first discuss the requirements on the symbolic model posed by the symbolic rule search step and the neural model learning step (Sec. 4.1). This motivates the use of circuit representations, as they satisfy all requirements. We then conclude this section with necessary backgrounds on circuit representations (Sec. 4.2). 4.1 Key Requirements on the Symbolic Model Firstly, a crucial requirement on implementing the symbolic rule search algorithm is to efficiently compute both search objectives: the semantic loss Ls(α, Dsym) and the model count w.r.t. propositional logic sentence α. Examining the definition of semantic loss, we find that computing Ls(α, p) can be cast to a well-known automated reasoning task termed weighted model counting (WMC) (Chavira and Darwiche 2008; Sang, Beame, and Kautz 2005a,b) that generalizes model counting. Therefore, the first requirement is reduced to efficient computation of WMC. Secondly, the 2In LCs, the probability of a rule is implicitly defined by θ, which allows us to represent exponentially many rules w.r.t. |θ|. WMC(g, p) = 0.72 p = [0.6, 0.8, 0.5] Figure 3: An example decomposable and deterministic LC g representing X3 (X1 X2) ( X1 X2). WMC of g w.r.t. p = [0.6, 0.8, 0.5] is labeled blue beside each node. symbolic representation needs to be general enough to represent various logical sentences and flexible enough to support efficient search of new logical rules. Thirdly, in the neural model learning phase, we backpropagate the gradients through the symbolic model to pnn to update the neural network f. Therefore, we require the symbolic model s objective (Eq. (2)) to be differentiable w.r.t. its input pnn. Circuit representations (Darwiche 2011) satisfy all above requirements. On the one hand, they can represent all possible propositional logical rules, and support linear-time (w.r.t. model size) computation of WMC. On the other hand, circuits are computation graphs with sum and product computational units, supporting gradient computation. We proceed to formally introduce circuit representations. 4.2 Circuit Representations and the Computation of Weighted Model Counting Circuits represent functions, including logical sentences, as (parametrized) computation graphs (Darwiche 2011; Darwiche and Marquis 2002). By imposing different structural constraints, circuits can compute various logical and probabilistic queries efficiently. The basic syntax and semantics of circuits are formalized in the following. Definition 2 (Circuits). A circuit g(X) encodes a function over X via a parametrized directed acyclic graph (DAG) with a single root node nr. Similar to neural networks, every node of the DAG defines a computational unit. Specifically, each leaf node corresponds to an input unit represented by a literal (e.g., X and X), and each inner node n is either a sum or product unit that receives inputs from its children, denoted in(n). Let p [0, 1]|X| be a vector of probabilities for variables X. The output gn(p) of each node n given input p is defined recursively as hn(p) if n is an input unit, P c in(n) θn,c gc(p) if n is a sum unit, Q c in(n) gc(p) if n is a product unit, (3) where hn is any univariate function defined by n, and θn,c denotes the parameter corresponds to edge (n, c). In particular, Logic Circuits (LCs) are a subset of circuits where hn(p) = pi if n encodes literal Xi and hn(p) = 1 pi if n encodes literal Xi. For ease of notation, we define gn(x) SPLIT(g, m, n, X2) Figure 4: An example of the SPLIT operation. as gn(px), where px i = 1 for all i such that x |= Xi and px i =0 if x|= Xi. The size of a circuit g, denoted |g|, is the number of edges in its DAG. This paper focuses on LCs that (i) represents a set of weighted logic formulas and (ii) supports linear time (w.r.t. |g|) computation of arbitrary weighted model counting queries.3 For example, the LC shown in Fig. 3 represents the formula X3 (X1 X2) ( X1 X2). By slightly abusing notation, for any LC gn, we write x |= gn if x entails the logic formula of gn. The WMC of gn w.r.t. weights p, formally defined as P x|=gn Q i:x|=Xi pi Q j:x|= Xj(1 pj), can be computed by a feedforward evaluation of gn following Eq. (3) (Kimmig, Broeck, and De Raedt 2012). For instance, the WMC between the LC in Fig. 3 and p = [0.6, 0.8, 0.5] is 0.72. Since the semantic loss between gn and p is the negative logarithm of the corresponding WMC, it can also be computed efficiently. Additionally, since the feedforward computation consists of only additions and multiplications, the gradient p Ls(gn, p) can be computed by backpropagating through the DAG of gn. 5 Practical Implementation This section introduces the implementation details of our proposed algorithm Neural to Circuit (NTOC). Details of the symbolic rule search and the neural model learning algorithms are provided in Sec. 5.1 and 5.2, respectively. 5.1 Symbolic Rule Search as Structure Learning Recall the symbolic rule search step aims to find propositional logic rules that best explain the dataset Dsym. Different from existing approaches that select a good subset of predefined rules, we directly learn a good LC gθ. This avoids the need to provide task-specific background knowledge and allows us to explicitly control the balance between how well samples in Dsym satisfy g (i.e., Ls(g, Dsym)) and the specificity of g (i.e., its model count). The proposed symbolic rule search algorithm (Alg. 1) takes Dsym as input, and returns a LC representing the logic rules. The algorithm starts with a LC that represents tautology (line 3). At this point, the semantic loss Ls(g, Dsym)=0 and the model count of g is at its maximum. Alg. 1 then iteratively restricts g by pruning worlds w that are likely untrue. This will lead to increase of the semantic loss and decrease 3For (ii) to hold, circuits must be both decomposable and deterministic. We kindly refer readers to (Darwiche 2011), who are interested in why these two properties are necessary. Algorithm 1: Symbolic Rule Search 1: Input: A dataset of probabilistic facts Dsym = {p(i)}N i=1 2: Output: A logic circuit g 3: Initialize: g as a LC representing tautology 4: while termination conditions not met do 5: (m, n), v select an edge and a variable to SPLIT 6: g SPLIT(g, m, n, v) 7: Prune worlds in g that do not explain Dsym 8: end while of the model count. The goal here is to prune worlds with a good strategy to minimize the increase of Ls(g, Dsym) (i.e., maintain the correctness of the rules) while maximize the decrease in model count (i.e., improve rule completeness). Iterative pruning relies on the fact that every node n in a LC corresponds to a logic sentence αn, thus pruning a node is equivalent to discarding the worlds that entail αn. The main loop (lines 4-8) consists of two key steps: (i) apply transformations to the LC to expose prunable nodes (lines 5-6), and (ii) prune nodes by predefined criterions (line 7). We proceed to describe both steps in detail.4 Step #1: SPLIT We use the SPLIT circuit transformation (Liang, Bekker, and Van den Broeck 2017) to discover candidate nodes for pruning. Specifically, SPLIT takes as input an edge (m, n) (m is a parent of n; m is a sum unit and n is a product unit) and a variable X ϕ(n), where ϕ(n), the scope of n, is defined as the collection of variables defined by all its descendent input nodes. Suppose n represents logic sentence αn, SPLIT first constructs two product units n1 and n2, which represent αn X and αn X, respectively. The algorithm then replaces (m, n) with two edges (m, n1) and (m, n2). As an example, after applying SPLIT to the LC in Fig. 4(a) w.r.t. (m, n) and X2, the circuit is transformed to the one shown in Fig. 4(b). Since αn =(αn X) (αn X), SPLIT does not change the LC s semantics. SPLIT can generate prunable units in the LC. Consider again the example in Fig. 4. If the groundtruth sentence is β = X1 X2, it is impossible to obtain a corresponding LC by pruning nodes from the LC in Fig. 4(a), since we cannot only eliminate the world β = X1 X2. However, after SPLIT, we can eliminate X1 X2 by pruning away n1. Another key process in this step is to select which edge and variable to SPLIT. Specifically, we select SPLIT candidates by assigning a score to each edge-variable pair, and choose the candidate with the highest score. The score is the multiplication of two values: one measuring the number of samples in Dsym that activate the edge, and the other estimating the information gain of the split. Step #2: node pruning We eliminate worlds from g via two criterions. The first is to prune away units that almost do not satisfy any sample. That is, the change in Ls(g, p) is small for (almost) all samples. To accelerate the pruning algorithm, we use a second criterion termed prune by conjunction. Specifically, for unit n in LC g, denote g as the LC obtained by conjoining n with a literal (e.g., X, X). 4Details such as hyperparameters are given in the appendix. If Ls(g , p) Ls(g, p) is small for (almost) all samples, we eliminate worlds from g by conjoining n with the literal. Termination conditions The trade-off between the accuracy and specificity of the learned rules is controlled by the termination condition. Specifically, define the improvement of g over g as I(g, g ) := Ls(g , Dsym) Ls(g, Dsym) MC(g) MC(g ) , which measures the increase in semantic loss per decrease in model count. The smaller I(g, g ) is, the better the quality of the transformation from g to g , as it significantly decreases the model count while not increasing the semantic loss by too much. Define g0 as the initial LC representing tautology and gt as the circuit learned at iteration t. We terminate Alg. 1 at step t if I(gt 1, gt) > d I(g0, gt), where d is a hyperparameter chosen as 5 in all our experiments. Computational complexity Since WMC is #Phard in general, for tasks with complex underlying logic rules, neural-symbolic methods that explicitly use these groundtruth rules (Manhaeve et al. 2018; Xu et al. 2018) could be extremely slow for computing the semantic loss, which makes them less scalable when facing such tasks. In contrast, NTOC can gradually search for good approximations of the groundtruth rules, achieving a better balance between rule correctness and computational complexity. 5.2 NN Learning Using Probabilistic Logic Given task semantics represented by LC g, the neural model learning step aims to maximize the probability of pnn satisfying g y, where pnn is a vector of probabilities for z and is the output of neural network f. That is, the goal is to minimize the semantic loss Ls(g y, pnn), which can be optimized by gradient-based algorithms since pnn Ls(g y, pnn) can be computed efficiently. When the LC is learned from data and prone to error, directly minimizing the semantic loss could lead to catastrophic failure. Take the two-digit addition task illustrated in Sec. 2 as an example. Denote z1 and z2 as the symbolic representations of both images, respectively. Suppose that in addition to the groundtruth addition rules, g contains the following rule: (z1 = digit 0 ) (z2 = digit 0 ). g is still correct in the sense that if the neural network correctly learns to classify all digits, the semantic loss can still converge to its minimum 0. However, in practice, minimizing the semantic loss will likely lead to the trivial solution of predicting all images as digit 0, which also minimizes the semantic loss. To avoid such local minima, we make use of the fact that for each z val(Z), only one y val(Y) is true. Specifically, we enforce this constraint by maximizing the following objective: Ls(g y, pnn) log P y val(Y) exp( Ls(g y , pnn)), (4) which pushes pnn to satisfy g y (y is the label) but not g y ( y val(Y), y =y). While the above objective can avoid NN from being trapped in certain local optima, neural learning could still fail if g is not informative (e.g., it is a tautology). Fortunately, since in this case the logic rules are often too gen- eral, it is likely that the groundtruth logic sentence α entails g. That is, following the main idea in Sec. 3.2, if we properly select a set of sentences {αi}k i=1 that contains closeto-optimal ones, we can jointly refine logic rules g and the neural network f. For LCs, learnable parameters are assigned to the edge weights θn,c defined in Def. 2. Denote the equivalent logical formula of LC unit n as αn. Since the LCs we learn are deterministic, for every sum unit n, the logic formulas of its children, i.e., {αc : c in(n)}, are mutually exclusive: c1, c2 in(n)(c1 = c2), αc1 αc2 = false. According to Def. 2, the edge weights of n (i.e., {θn,c : c in(n)}) sum up to 1. By making analogy with Eq. (2), each edge parameter θn,c can be viewed as the weights associated with logic formula αc. Therefore, circuit representations provide a natural way to incorporate learnable parameters that weight the logical formulas. Joint learning of the LC weights and the NN parameters are simple, since parameterized LCs are still differentiable, and the LC weights can be learned via welldeveloped EM algorithms (Liu and Van den Broeck 2021). A final question is which LC units should we weigh. While it is possible to weigh all edges, adding too many weights in the LC could render the learning process unstable. We choose to add learnable parameters to every sum unit n whose scope ϕ(n) is Y, for the following reason. In classification tasks, for any z val(Z), only one y val(Y) holds. Therefore, adding weights to LC unit n with scope Y allows the model to select a correct y that satisfies the groundtruth constraint α . In summary, given a LC g learned in the symbolic rule search step, we first add learnable weights to g. We then jointly optimize the LC weights and the neural network s parameters by maximizing Eq. (4). 6 Experiments In this section, we evaluate the proposed NTOC model on the multi-digit addition task and the Tower of Hanoi action prediction task.5 In both tasks, we evaluate the model s ability to (i) simultaneously training neural networks and extracting task semantics, and (ii) learn invariant features z that generalize well across various test domains. Baselines We compare the proposed model against two NN models, LSTM (Sak, Senior, and Beaufays 2014) and DNC (Graves et al. 2016), and a neural-symbolic model Deep Prob Log (Manhaeve et al. 2018). Both NN models are selected for their ability to handle sequential prediction tasks. Although not designed for joint training tasks, Deep Prob Log is considered as the So TA approach in neuralsymbolic tasks such as injecting knowledge to NNs and induce symbolic rules. Note that in the experiments, Deep Prob Log is additionally provided with the groundtruth logic rules. Hence, it is considered as a loose performance upper bound of other approaches. Additionally, since some neuralsymbolic models such as Opt Net (Amos and Kolter 2017) do not fit both tasks, we run additional tasks in the appendix to compare against these models. 5Our code can be found at https://github.com/sbx126/NTo C. Model Multi-digit addition [test seq length + train/test img] Tower of Hanoi 5 w/ test 10 w/ test 20 w/ test 5 w/ train 10 w/ train 20 w/ train Task #1 Task #2 Task #3 Deep Prob Log 89.68 1.34 80.87 2.32 timeout 95.16 0.81 90.59 1.41 timeout 89.01 1.36 96.69 0.39 88.37 1.29 LSTM 81.61 2.54 61.13 4.71 35.06 5.69 91.94 1.66 80.31 3.15 60.68 5.94 79.29 2.78 91.09 1.41 73.61 3.43 DNC 87.64 0.37 74.78 1.00 57.09 1.39 93.10 1.45 85.52 2.46 73.28 4.13 66.71 0.54 94.57 1.38 65.35 1.77 NTOC(ours) 91.20 0.73 82.93 1.95 68.60 3.14 97.90 0.29 95.79 0.53 91.92 0.99 86.07 0.28 94.82 0.19 84.32 0.41 Table 1: OOD performance ( std over 10 runs) on the multi-digit addition and Tower of Hanoi tasks. NTOC is compared against two NN-based models (LSTM and DNC) and a neural-symbolic approach Deep Prob Log). Note that Deep Prob Log is additionally provided with groundtruth logical rules, and is considered as the performance upper-bound of other approaches. Multi-digit addition: different configurations for digit number generalization and sequence length generalization. Tower of Hanoi: a suit of different configurations for disks number generalization and movement-steps length generalization. Numbers are sequence accuracies, i.e., the fraction of correctly predicted sequences. Multi-digit addition Models are only informed that the task has a recursive structure; except Deep Prob Log, no considered model is provided with the multi-digit addition rules. Specifically, the recursion prior is informed by first feeding the two images corresponding to the ones place of the input numbers to predict the output number s ones place, followed by feeding the images for the tens place, hundreds place, etc. For NTOC, we use auxiliary variables, denoted Zin and Zout, to carry memory across time steps. Specifically, zin and zout are the input auxiliary variables from the previous time step and fed to the next time step, respectively. For all methods, 100 MNIST images are given for pretraining. We challenge all models by training the models on numbers of length 3 and test them with numbers of length 5-20. As shown in Sec. 6, NTOC achieves significantly better than non-symbolic models (i.e., LSTM and DNC). Moreover, NTOC is on-par with Deep Prob Log, where the groundtruth multi-digit addition rules are explicitly given. To trace the source of error, we additionally evaluate all models with numbers consist of MNIST images from the training set. This bridges the generalization gap towards unseen digit images and directly challenges the model s ability to learn generalizable addition rules. Results show that NTOC still outperforms all baselines with no provided groundtruth task semantics, which shows that the NTOC is able to learn generalizable symbolic rules. Moreover, we find that the main error source of NTOC comes from the perception part, i.e., the misclassifications made by the NN. Another benefit of NTOC is its speed. Existing neuralsymbolic approaches such as Deep Prob Log and ILP (Evans and Grefenstette 2018) require a compilation step to convert their symbolic models to low-level computation modules, which takes hours or days even for very simple tasks. In contrast, since LCs are represented as computation graph, NTOC is free from compilation and can handle problems with many variables. Tower of Hanoi optimal action prediction The goal of this task is to predict the optimal next move given a game state. As determined by the nature of the game, algorithms need to select particular symbols to achieve good generalization performance. We would like to reiterate that compared to the previous task, only the perception model and primitive task are different here. As illustrated by Fig. 1(right), despite the different visual appearances of Hanoi states with different number of disks, an invariant policy can be found if the NN learns to always compare the size of the top disk on each pillar.6 This experiment challenges the models to automatically discover such an invariant policy without direct supervision. In the pretraining step, we give 500 samples to teach the model to compare between disk sizes. For every sample, each pillar contains at most one disk, so it is impossible to learn the invariant feature solely from pretraining. Furthermore, to ensure we are evaluating different method ability to discover invariant features and policies, we only provide 1,000 training samples, while testing them on 10,000 samples. With this huge size difference between training and testing set, no considered method can resort to the shortcut of trying to remember Hanoi state-action pairs. We use three types of OOD settings: #1 (# disk generalization): game states with 5 disks are used for training while test samples contain 7 or 9 disks; #2 (sequence length generalization): models are trained on sequences of length 3 and tested on sequences of length 5; #3 (both): combination of #1 and #2. Setting #2 is the simplest in terms of generalization, as spurious features could still help. In contrast, settings #1 and #3, which require the the model to generalize to more game states with more disks, are more appropriate testing beds to test method s ability to discover invariance. As shown in Sec. 6, our proposed method NTOC out-performs all baselines in all three settings. The lead is especially significant on settings #1 and #3. 7 Conclusion The integration of perception and symbolic reasoning has been considered as a key challenge in machine learning (Mattei and Walsh 2013; De Raedt et al. 2021; Garcez et al. 2019). Our proposed algorithm NTOC learns explicit symbolic rules without assuming the existence of background knowledge or candidate rules. We show that the proposed model exhibits good out-of-distribution generalization on two neural-symbolic tasks. 6We use MNIST images in place of disks; disk sizes are represented by different MNIST digits. Acknowledgements This work was funded in part by a grant from National Key R&D Program of China (2021ZD0150200), the DARPA Perceptually-enabled Task Guidance (PTG) Program under contract number HR00112220005, NSF grants #IIS1943641, #IIS1956441, #CCF-1837129, Samsung, CISCO, a Sloan Fellowship, and a UCLA Samueli Fellowship. We thank Honghua Zhang for providing helpful feedback on an initial version of the paper. References Amos, B.; and Kolter, J. Z. 2017. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, 136 145. PMLR. Blanchard, G.; Lee, G.; and Scott, C. 2011. Generalizing from several related classification tasks to a new unlabeled sample. Neur IPS, 24. Chavira, M.; and Darwiche, A. 2008. On probabilistic inference by weighted model counting. Artificial Intelligence, 172(6-7): 772 799. Dai, W.-Z.; and Muggleton, S. H. 2021. Abductive knowledge induction from raw data. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. Darwiche, A. 2011. SDD: A new canonical representation of propositional knowledge bases. In Twenty-Second International Joint Conference on Artificial Intelligence. Darwiche, A.; and Marquis, P. 2002. A knowledge compilation map. Journal of Artificial Intelligence Research, 17: 229 264. De Raedt, L.; Dumancic, S.; Manhaeve, R.; and Marra, G. 2021. From Statistical Relational to Neuro-Symbolic Artificial Intelligence. In Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI), Yokohama, Japan, Januray 7-15, 2021., 4943 4950. ijcai. org. Deshmukh, A. A.; Lei, Y.; Sharma, S.; Dogan, U.; Cutler, J. W.; and Scott, C. 2019. A generalization error bound for multi-class domain generalization. ar Xiv preprint ar Xiv:1905.10392. Evans, R.; and Grefenstette, E. 2018. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61: 1 64. Garcez, A.; Gori, M.; Lamb, L.; Serafini, L.; Spranger, M.; and Tran, S. 2019. Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. Journal of Applied Logics, 6(4): 611 632. Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka, I.; Grabska-Barwi nska, A.; Colmenarejo, S. G.; Grefenstette, E.; Ramalho, T.; Agapiou, J.; et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626): 471 476. Kimmig, A.; Broeck, G. V. d.; and De Raedt, L. 2012. Algebraic model counting. ar Xiv preprint ar Xiv:1211.4475. Li, T.; and Srikumar, V. 2019. Augmenting Neural Networks with First-order Logic. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 292 302. Liang, Y.; Bekker, J.; and Van den Broeck, G. 2017. Learning the structure of probabilistic sentential decision diagrams. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI). Liu, A.; and Van den Broeck, G. 2021. Tractable Regularization of Probabilistic Circuits. In Advances in Neural Information Processing Systems 35 (Neur IPS). Manhaeve, R.; Dumancic, S.; Kimmig, A.; Demeester, T.; and De Raedt, L. 2018. Deep Prob Log: Neural Probabilistic Logic Programming. In 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montreal, Canada, December 2-8, 2018, 3753 3760. Neural Information Processing Systems Foundation Inc. Mattei, N.; and Walsh, T. 2013. Preflib: A library for preferences http://www. preflib. org. In International Conference on Algorithmic Decision Theory, 259 270. Springer. Sak, H.; Senior, A. W.; and Beaufays, F. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Neural Computation. Sang, T.; Beame, P.; and Kautz, H. 2005a. Solving Bayesian networks by weighted model counting. In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05), volume 1, 475 482. AAAI Press. Sang, T.; Beame, P.; and Kautz, H. A. 2005b. Performing Bayesian inference by weighted model counting. In AAAI, volume 5, 475 481. Wang, P.-W.; Donti, P.; Wilder, B.; and Kolter, Z. 2019. Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. In International Conference on Machine Learning, 6545 6554. PMLR. Xie, Y.; Xu, Z.; Kankanhalli, M. S.; Meel, K. S.; and Soh, H. 2019. Embedding symbolic knowledge into deep networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 4233 4243. Xu, J.; Zhang, Z.; Friedman, T.; Liang, Y.; and Broeck, G. 2018. A semantic loss function for deep learning with symbolic knowledge. International conference on machine learning. Yang, Z.; Ishay, A.; and Lee, J. 2020. Neur ASP: Embracing Neural Networks into Answer Set Programming. In 29th International Joint Conference on Artificial Intelligence (IJCAI 2020). Ye, H.; Xie, C.; Cai, T.; Li, R.; Li, Z.; and Wang, L. 2021. Towards a theoretical framework of out-of-distribution generalization. Advances in Neural Information Processing Systems, 34.