# neurosymbolic_learning_yielding_logical_constraints__e56bc4b6.pdf

Neuro-symbolic Learning Yielding Logical Constraints

Zenan Li1 Yunpeng Huang1 Zhaoyu Li2

Yuan Yao1 Jingwei Xu1 Taolue Chen3 Xiaoxing Ma1 Jian Lü1

1State Key Lab of Novel Software Technology, Nanjing University, China 2Department of Computer Science, University of Toronto, Canada 3School of Computing and Mathematical Sciences, Birkbeck, University of London, UK {lizn, hyp}@smail.nju.edu.cn, zhaoyu@cs.toronto.edu, t.chen@bbk.ac.uk, {y.yao,jingweix,xxm,lj}@nju.edu.cn

Neuro-symbolic systems combine neural perception and logical reasoning, representing one of the priorities of AI research. End-to-end learning of neuro-symbolic systems is highly desirable, but remains to be challenging. Resembling the distinction and cooperation between System 1 and System 2 of human thought (à la Kahneman), this paper proposes a framework that fuses neural network training, symbol grounding, and logical constraint synthesis to support learning in a weakly supervised setting. Technically, it is cast as a game with two optimization problems which correspond to neural network learning and symbolic constraint learning respectively. Such a formulation naturally embeds symbol grounding and enables the interaction between the neural and the symbolic part in both training and inference. The logical constraints are represented as cardinality constraints, and we use the trust region method to avoid degeneracy in learning. A distinguished feature of the optimization lies in the Boolean constraints for which we introduce a difference-ofconvex programming approach. Both theoretical analysis and empirical evaluations substantiate the effectiveness of the proposed framework.

1 Introduction

Perception and reasoning serve as fundamental human abilities that are intrinsically linked within the realm of human intelligence [Kahneman, 2011, Booch et al., 2021]. The objective of our study is to develop a learning framework for neuro-symbolic systems (e.g., the one illustrated in Figure 1), enabling simultaneous learning of neural perception and symbolic reasoning.

The merit of neuro-symbolic learning lies in resembling the integration of System 1 and System 2 of human minds [Kahneman, 2011, Yoshua and Gary, 2020, Le Cun, 2022]. First, it eliminates unnatural and sometimes costly human labeling of the latent variables and conducts learning in an end-to-end fashion. Second, it generates not only a neural network for perception, but also a set of explicit (symbolic) constraints enabling exact and interpretable logical reasoning. Last but not least, the mutually beneficial interaction between the neural and the symbolic parts during both training and inference stages potentially achieves better performance than separated learning approaches.

However, existing approaches do not provide an adequate solution to the problem. They either (1) are not end-to-end, i.e., human intervention is employed to label the latent z so that the task can be divided into a purely neural subtask of image classification and a purely symbolic subtask of constraint solving, or (2) are not interpretable, i.e., can only approximate symbolic reasoning with neural network but without explicit logical constraints generated (e.g., Wang et al. [2019], Yang et al. [2023]), which inevitably sacrifices the exactness and interpretability of symbolic reasoning, resulting in an inaccurate black-box predictor but not a genuine neuro-symbolic system.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Neural network !! Raw input " Latent symbol # Logical reasoning $" Final output y

1 2 3 4 3 4 1 2

2 1 4 3 4 3 2 1

Information flow:

: Inference perception by neural network, reasoning by SMT solver : Constraint learning revising !! with current " = $"(&) : Symbol grounding revising " to satisfy constraints ℎ! ", * = + : Network training revising $" by back-propagating grounded #

ℎ! ", * = + System 1 System 2

Neuro-Symbolic Learning Combining System 1 and System 2

Figure 1: An example of neuro-symbolic learning for visual Sudo Ku solving. In this task, the neural network is employed to transform the puzzle image (strawberry etc.) into its corresponding symbols, while symbolic reasoning is utilized to produce the puzzle s solution. Importantly, the neuro-symbolic learning task is framed in a weakly supervised setting, where only the raw input (the puzzle image x) and the final output (the puzzle solution y, but without numbers in z) is observed.

We argue that end-to-end and interpretable neuro-symbolic learning is extremely challenging due to the semantic and representation gaps between the neural part and the symbolic part. The semantic gap caused by the latency of the intermediate symbol (i.e., z in Figure 1) makes the neural network training lack effective supervision and the logic constraint synthesis lack definite inputs. The representation gap between the differentiable neural network and the discrete symbol logic makes it difficult to yield explicit symbolic constraints given the continuous neural network parameters. Despite existing proposals mitigating some of these obstacles such as visual symbol grounding [Topan et al., 2021, Li et al., 2023], softened logic loss [Kimmig et al., 2012, Xu et al., 2018], semidefinite relaxation [Wang et al., 2019], etc., none of them realize the full merit of neuro-symbolic learning.

In this paper, we propose a new neuro-symbolic learning framework directly meeting the challenges. It bridges the semantic gap with an efficient symbol grounding mechanism that models the cooperative learning of both the neural network and the logical constraint as a bilevel optimization problem. It bridges the representation gap by employing difference-of-convex (DC) programming as a relaxation technique for Boolean constraints in the optimization. DC programming ensures the convergence to explicit logical constraints, which enables exact symbolic reasoning with powerful off-the-shell tools such as SAT/SMT solvers [Eén and Sörensson, 2006, Bailleux and Boufkhad, 2003, Bailleux et al., 2006] during the inference stage. In addition, to address degeneracy in logical constraint learning, i.e., the tendency to learn only trivial logical constraints (e.g., resulting in simple rules insufficient to solve Sudo Ku), we introduce an additional trust region term [Boyd et al., 2004, Conn et al., 2000], and then employ the proximal point algorithm in the learning of logical constraints.

We provide a theoretical analysis of the convergence of our algorithm, as well as the efficacy of the DC relaxation in preserving the exactness and the trust region in preventing degeneracy. Empirical evaluations with four tasks, viz. Visual Sudoku Solving, Self-Driving Path Planning, Chained XOR, and Nonograms, demonstrate the new learning capability and the significant performance superiority of the proposed framework.

Organization. Section 2 formulates our neuro-symbolic learning framework. Section 3 details the algorithm and theoretical analysis. Section 4 presents empirical evaluations. Section 5 covers related work. Section 6 discusses the limitations. Section 7 concludes the paper.

2 Neuro-symbolic Learning Framework

In this paper, we focus on end-to-end neuro-symbolic systems comprising two components: (1) neural network fθ : X Z, which transforms the raw input x X into a latent state z Z; and (2) symbolic reasoning gϕ : Z Y, which deduces the final output y Y from state z Z. Both components are built simultaneously, taking only the input x and output y as supervision. We assume that Z and Y are represented by finite sets, and thus we can encode z and y by binary vectors using

one-hot encoding. For ease of discussion, we directly define Z and Y to be spaces Bu and Bv of Boolean vectors, respectively, where B = {0, 1}.

Unlike existing work (e.g., SATNet [Wang et al., 2019]) that simulates logical reasoning in an implicit and approximate way via a network layer, our key insight is to learn explicit logical constraints hϕ : Bu+v B on (z, y) in the training phrase, which allow to perform exact reasoning by offthe-shelf constraint solvers (e.g., SMT solvers) in the inference phrase. Fig. 1 illustrates our neurosymbolic learning framework.

The latent state z enables the interaction between neural perception and logical reasoning. If z were observable, we could perform the learning of neural network and logical constraint by solving the following two separate optimization problems:

θ = arg minθ E(x,z) D1[ℓ1(fθ(x), z)], ϕ = arg minϕ E(z,y) D2[ℓ2(hϕ(z, y), 1)],

where ℓ1(fθ(x), z) refers to the error between network prediction fθ(x) and the actual symbol z, and ℓ2(hϕ(z, y), 1) refers to the (un)satisfaction degree of the learned logical constraints hϕ(z, y) = 1.

Nevertheless, with z being latent, these two problems become tightly coupled:

θ = arg minθ E(x,y) D[ℓ1(fθ(x), z)], ϕ = arg minϕ E(x,y) D[ℓ2(hϕ(z, y), 1)],

s.t. hϕ(z, y) = 1, z Z; s.t. z = fθ(x), z Z.

We further surrogate the constraint satisfaction by loss functions, and obtain essentially a game formulation as follows:

θ = arg minθ E(x,y) D[ℓ1(fθ(x), z)], ϕ = arg minϕ E(x,y) D[ℓ2(hϕ( z, y), 1)],

s.t. z = arg minz Z ℓ2(hϕ(z, y), 1); s.t. z = arg minz Z ℓ1(fθ(x), z). (1)

Three players are involved in this game: neural network fθ and logical constraint hϕ pursue optimal (prediction/satisfaction) accuracy, while z strives for the grounding that integrates the network prediction and logical reasoning.

2.1 Efficient and Effective Logical Constraint Learning

For efficient learning of logical constraints, we adopt cardinality constraint [Syrjänen, 2009, 2004, Fiorini et al., 2021] to represent logical constraint.1Cardinality constraints can be easily arithmetized, enabling the conventional optimization method and avoiding the computationally expensive model counting or sampling [Manhaeve et al., 2018, Xu et al., 2018, Li et al., 2020, 2023], which significantly boosts the learning efficiency. In addition, the conjunctive normal form and cardinality constraints can be easily converted from each other, ensuring not only the expressiveness of the learned constraints, but also the seamless compatibility with existing reasoning engines.

Formally, we denote the column concatenation of z and y by (z; y), and define a cardinality constraint hϕ(z, y) := w T(z; y) [bmin, bmax], where ϕ = (w, bmin, bmax), w Bu+v is a Boolean vector, and bmin, bmax N+ are two positive integers. Moreover, we can directly extend hϕ to the matrix form representing m logical constraints, i.e.,

hϕ(z, y) := W (z; y) = (w T 1 (z; y), . . . , w T m(z; y)) [bmin, bmax],

where ϕ = (W , bmin, bmax), W := (w T 1 ; . . . ; w T m) Bm (u+v), and bmin, bmax N m + .

For effective learning of logical constraints, one must tackle the frequent problem of degeneracy that causes incomplete or trivial constraints2. First, for the bounding box [bmin, bmax], any of its superset is a feasible result of logical constraint learning, but we actually expect the tightest one. To overcome this difficulty, we propose to use the classic mean squared loss, i.e., defining ℓ2(hϕ(z, y), 1) = W (z; y) b 2, where b can be computed in optimization or pre-defined. Then, our logical constraint learning problem can be formulated as a Boolean least squares problem [Vandenberghe and Boyd, 2004]. The unbiased property of least squares indicates that b (bmin + bmax)/2, and

1Cardinality constraints express propositional logic formulae by constraining the number of variables being true. For example, x y z is encoded as x + y + z 1. 2For example, suppose the targeted logical constraints are z1 + z2 + y1 2 and z1 1. Degeneracy happens if one only obtains a single constraint z1 + z2 + y1 2 or a trivial constraint z1 0.

the minimum variance property of least squares ensures that we can achieve a tight bounding box [bmin, bmax] [Henderson, 1975, Björck, 1990, Hastie et al., 2009].

Second, it is highly possible that some of the m distinct logical constraints as indicated by matrix W eventually degenerate to the same one during training, because the Boolean constraints and stochastic gradient descent often introduce some implicit bias [Gunasekar et al., 2017, Smith et al., 2021, Ali et al., 2020]. To mitigate this problem, we adopt the trust region method [Boyd et al., 2004, Conn et al., 2000], i.e., adding constraints wi w(0) i λ, i = 1, . . . , m, where w(0) i is a pre-defined centre point of the trust region. The trust region method enforces each wi to search in their own local region. We give an illustrative figure in Appendix D to further explain the trust region method.

To summarize, using the penalty instead of the trust region constraint, we can formulate the optimization problem of logical constraint learning in (1) as

min (W ,b) E(x,y) D[ W ( z; y) b 2] + λ W W (0) 2,

s.t. z = arg minz Z ℓ1(fθ(x), z), W Bm (u+v), b N m + . (2)

2.2 Neural Network Learning in Tandem with Constraint Learning

A key challenge underlines end-to-end neuro-symbolic learning is symbol grounding, which is to tackle the chicken-and-egg situation between network training and logical constraint learning: training the network requires the supervision of symbol z that comes from solving the learned logical constraints, but the constraint learning needs z as input recognized by the trained network. Specifically, since matrix W is often underdetermined in high-dimensional cases, the constraint z = arg minz Z ℓ2(hϕ(z, y), 1) := W (z; y) b 2 often has multiple minimizers (i.e., multiple feasible groundings of z), all of which satisfy the logical constraints hϕ(z, y) = 1. Moreover, matrix W is also a to-be-trained parameter, meaning that it is highly risky to determine the symbol grounding solely on the logical constraints.

To address these issues, instead of (approximately) enumerating all the feasible solutions via model counting or sampling [Manhaeve et al., 2018, Xu et al., 2018, Li et al., 2020, van Krieken et al., 2022, Li et al., 2023], we directly combine network prediction and logical constraint satisfaction to establish symbol grounding, owing to the flexibility provided by the cardinality constraints. Specifically, for given α [0, + ), the constraint in network learning can be rewritten as follows,

z = arg minz Z W (z; y) b 2 + α z fθ(x) 2.

The coefficient α can be interpreted as the preference of symbolic grounding for network predictions or logical constraints. For α 0, the symbol grounding process can be interpreted as distinguishing the final symbol z from all feasible solutions based on network predictions. For α + , the symbol grounding process can be viewed as a correction step, where we revise the symbol grounding from network s prediction towards logical constraints. Furthermore, as we will show in Theorem 1 later, both symbol grounding strategies can finally converge to the expected results.

The optimization problem of network training in (1) can be written as

min θ E(x,y) D[ z fθ(x) 2],

s.t. z = arg min z Z W ( z; y) b 2 + α z fθ(x) 2. (3)

where we also use the mean squared loss, i.e., ℓ1(fθ(x), z) = fθ(x) z 2, for compatibility.

3 Algorithms and Analysis

Our general framework is given by (1), instantiated by (2) and (3). Both optimizations contain Boolean constraints of the form Qu q1 2 + τ u q2 2 where u are Boolean variables.3 We propose to relax these Boolean constraints by difference of convex (DC) programming [Tao and Hoai An, 1997, Yuille and Rangarajan, 2003, Lipp and Boyd, 2016, Hoai An and Tao, 2018]. Specifically, a

3In (2), for each logic constraint, Q = (z; y)T, q1 = bi, q2 = w(0) i , and τ = λ; in (3), Q = W , q1 = b, q2 = (fθ(x); y), and τ = α.

Boolean constraint u {0, 1} can be rewritten into two constraints of u u2 0 and u u2 0. The first constraint is essentially a box constraint, i.e., u [0, 1], which is kept in the optimization. The second one is concave, and we can equivalently add it as a penalty term, as indicated by the following proposition [Bertsekas, 2015, Hansen et al., 1993, Le Thi and Ding Tao, 2001] .

Proposition 1. Let e denote the all-one vector. There exists t0 0 such that for every t > t0, the following two problems are equivalent, i.e., they have the same optimum.

(P) min u {0,1}n q(u) := Qu q1 2 + τ u q2 2,

(Pt) min u [0,1]n qt(u) := Qu q1 2 + τ u q2 2 + t(e Tu u Tu).

Remarks. We provide more details, including the setting of t0, in Appendix A.

However, adding this penalty term causes non-convexity. Thus, DC programming further linearizes the penalty u u2 u u2 + (u u)(1 2 u) at the given point u, and formulates the problem in Proposition 1 as min u [0,1]n Qu q1 2 + τ u q2 2 + t(e 2 u)Tu.

By applying this linearization, we achieve a successive convex approximation to the Boolean constraint [Razaviyayn, 2014], ensuring that the training is more stable and globally convergent [Lipp and Boyd, 2016]. Furthermore, instead of fixing the coefficient t, we propose to gradually increase it until the Boolean constraint is fully satisfied, forming an annealing procedure. We illustrate the necessity of this strategy in the following proposition [Beck and Teboulle, 2000, Xia, 2009].

Proposition 2. A solution u {0, 1}n is a stationary point of (Pt) if and only if

[ q(u)]i(1 2ui) + t 0, i = 1, . . . , n.

Then, if u {0, 1}n is a global optimum of (P) (as well as (Pt)), it holds that

[ q(u)]i(1 2ui) + ρi 0, i = 1, . . . , n,

where ρi is the i-th diagonal element of (QTQ + τI).

Remarks. The proposition reveals a trade-off of t: a larger t encourages exploration of more stationary points satisfying the Boolean constraints, but a too large t may cause the converged point to deviate from the optimality of (P). Therefore, a gradual increase of t is sensible for obtaining the desired solution. Moreover, the initial minimization under small t results in a small gradient value (i.e., |[ q(u)]i|), thus a Boolean stationary point can be quickly achieved with a few steps of increasing t.

3.1 Algorithms

For a given dataset {(xi, yi)}N i=1, X = (x1, . . . , x N) and Y = (y1, . . . , y N) represent the data matrix and label matrix respectively, and fθ(k)(X) denotes the network prediction at the k-th iteration.

Logical constraint learning. Eliminating the constraint in (2) by letting Z = fθ(k)(X), the empirical version of the logical constraint learning problem at the (k+1)-th iteration is

i=1 (fθ(k)(X); Y)wi bi 2 + λ wi w(0) i 2 + t1(e 2w(k) i )Twi,

where w(k) 1 , . . . , w(k) m are parameters of logical constraints at the k-th iteration. In this objective function, the first term is the training loss of logical constraint learning, the second term is the trust region penalty to avoid degeneracy, and the last term is the DC penalty of the Boolean constraint.

To solve this problem, we adopt the proximal point algorithm (PPA) [Rockafellar, 1976, Parikh et al., 2014, Rockafellar, 2021], as it overcomes two challenges posed by stochastic gradient descent. First, stochastic gradient descent has an implicit inductive bias [Gunasekar et al., 2017, Ali et al., 2020, Zhang et al., 2021, Smith et al., 2021], causing different wi, i = 1, . . . , m, to converge to a singleton. Second, the data matrix (fθ(k)(X); Y) is a 0-1 matrix and often ill-conditioned, leading to diverse or slow convergence rates of stochastic gradient descent.

Algorithm 1 Neuro-symbolic Learning Procedure

Set step sizes (γ, η), and penalty coefficients (λ, t1, t2). Randomly generate an initial matrix W (0) under [0, 1] uniform distribution. for k = 0, 1, . . . , K do

Randomly draw a batch {(xi, yi)}N i=1 from training data. Compute the predicted symbol zi = fθ(k)(xi), i = 1, . . . , N. Network prediction Update (W , b) from (zi, yi), i = 1, . . . , m, by PPA update (Eq. 4). Constraint learning Correct the symbol grounding z from zi to logical constraints hϕ (Eq. 5). Symbol grounding Update θ from (xi, zi), i = 1, . . . , m, by SGD update (Eq. 6). Network training if W / Bm (u+v)/ Z / BN (u+v) then

Increasing t1/t2. Enforcing DC penalty end if end for Estimate (bmin, bmax) based on network fθ and logical constraints hϕ from training data.

Given (W (k), b(k)) at the k-th iteration, the update of PPA can be computed by

W (k+1) = (M + (λ + 1

γ )I) 1 (fθ(k)(X); Y)Tb(k) + λW (0) + ( 1

γ + t1)W (k 1) t1

b(k+1) = (1 + 1

γ ) 1(W (k)(fθ(k)(X); Y)Te + 1

γ b(k)), (4)

where γ > 0 is the step size of PPA, E is an all-ones matrix, and M = (fθ(k)(X); Y)T(fθ(k)(X); Y). Note that the computation of matrix inverse is required, but it is not an issue because M is positive semidefinite, and so is the involved matrix, which allows the use of Cholesky decomposition to compute the inverse. Moreover, exploiting the low rank and the sparsity of M can significantly enhance the efficiency of the computation.

Neural network training. By adding the DC penalty, the constraint (i.e., symbol grounding) in (3) is

z = arg min W ( z; y) b 2 + α z fθ(k)(x) 2 + t2(e 2( z(k))T) z.

We can also compute the closed-form solution, i.e.,

Z = b W + (α + t2) Z(k) + αfθ(k)(X) t2

2 E W TW + αI 1 . (5)

Similarly, the low-rank and sparsity properties of W TW ensure an efficient computation of matrix inverse. Finally, the parameter θ of network is updated by stochastic gradient descent,

θ(k+1) = θ(k) η θfθ(k)(xi)

i=1 ( zi fθ(k)(xi)). (6)

The overall algorithm is summarized in Algorithm 1, which mainly involves three iterative steps: (1) update the logical constraints by combining the network prediction and observed output; (2) correct the symbol grounding by revising the prediction to satisfy logical constraints; (3) update the neural network by back-propagating the corrected symbol grounding with observed input.

3.2 Theoretical Analysis

Theorem 1. With an increasing (or decreasing) α, the constraint learning and network training performed by Algorithm 1 converge to the stationary point of (2) and (3), respectively. Specifically, it satisfies

E[ θℓk 1(θk) 2] = O( 1

K + 1), and E[ ϕℓ2(ϕk) 2] = O( 1

Remarks. The proof and additional results can be found in Appendix B. In summary, Theorem 1 confirms the convergence of our neuro-symbolic learning framework and illustrates its theoretical

complexity. Furthermore, note that an increase (or decrease) in α indicates a preference for correcting the symbol grounding over network prediction (or logical constraint learning). In practice, we can directly set a small (or large) enough α instead.

Next, we analyze the setting of centre points W (0) in the trust region penalty as follows.

Theorem 2. Let w(0) 1 [0, 1]n and w(0) 2 [0, 1]n be two initial points sampled from the uniform distribution. For given t 0, the probability that the corresponding logical constraints ϕ1 and ϕ2 converge to the same (binary) stationary point ϕ satisfies

Pr(ϕ1 = ϕ, ϕ2 = ϕ)

2λ([ q(u)]i(1 2ui) + t), 1 2.

Remarks. The proof is in Appendix C. In a nutshell, Theorem 2 shows that the probability of w1 and w2 degenerating to the same logical constraint can be very small provided suitably chosen λ and t. Note that λ and t play different roles. As shown by Proposition 2, the coefficient t in DC penalty ensures the logical constraint learning can successfully converge to a sensible result. The coefficient λ in trust region penalty enlarges the divergence of convergence conditions between two distinct logical constraints, thereby preventing the degeneracy effectively.

4 Experiments

We carry out experiments on four tasks, viz., chained XOR, Nonogram, visual Sudoku solving, and self-driving path planning. We use Z3 SMT (Max SAT) solver [Moura and Bjørner, 2008] for symbolic reasoning. Other implementation details can be found in Appendix E. The experimental results of chained XOR and Nonogram tasks are detailed in Appendix F due to the space limit. The code is available at https://github.com/Lizn-zn/Nesy-Programming.

4.1 Visual Sudoku Solving

Datasets. We consider two 9 9 visual Sudo Ku solving datasets, i.e., the SATNet dataset [Wang et al., 2019, Topan et al., 2021]4 and the RRN dataset [Yang et al., 2023], where the latter is more challenging (17 - 34 versus 31 - 42 given digits in each puzzle). Both datasets contain 9K/1K training/test examples, and their images are all sampled from the MNIST dataset. We typically involve two additional transfer tasks, i.e., training the neuro-symbolic system on SATNet dataset (resp. RRN dataset), and then evaluating the system on RRN dataset (resp. SATNet dataset).

Baselines. We compare our method with four state-of-the-art methods, i.e., RRN [Palm et al., 2018], SATNet [Wang et al., 2019], SATNet* [Topan et al., 2021], and L1R32H4 [Yang et al., 2023]. RRN is modified to match visual Sudo Ku as done by Yang et al. [2023]. SATNet* is an improved version of SATNet that addresses the symbol grounding problem by introducing an additional pre-clustering step. As part of our ablation study, we introduce two variants of our method (NTR and NDC) where NTR removes the trust region penalty (i.e., setting λ = 0), and NDC removes the DC penalty (i.e., fixing t1 = t2 = 0) and directly binarizes (W , b) as the finally learned logical constraints.

Results. We report the accuracy results (i.e., the percentage of correctly recognized boards, correctly solved boards, and both) in Table 1.5 A more detailed version of our experimental results is given in Appendix F. The results show that our method significantly outperforms the existing methods in all cases, and both trust region penalty and DC penalty are critical design choices. The solving accuracy is slightly higher than the perception accuracy, as the Max SAT solver may still solve the problem correctly even when the perception result is wrong. Notably, our method precisely learns all logical constraints,6 resulting in a logical reasoning component that (1) achieves full accuracy when the neural perception is correct; (2) ensures robust results on transfer tasks, in comparison to the highly sensitive existing methods.

4The SATNet dataset is originally created by Wang et al. [2019]. However, Topan et al. [2021] point out that the original dataset has the label leakage problem, which was fixed by removing the labels of given digits. 5We successfully reproduce the baseline methods, achieving consistent results with Yang et al. [2023]. 6We need 324 (= 9 9 4: each row/column/block should fill 1 - 9, and each cell should be in 1 - 9) groundtruth cardinality constraints for the Sudoku task, whose rank is 249. After removing redundant constraints, our method learns exact 324 logical constraints that are one-to-one corresponding to the ground-truth.

Table 1: Accuracy results (%) of visual Sudoku task. Our method performs the best.

Method SATNet dataset RRN dataset SATNet RRN

Percep. Solving Total Percep. Solving Total Percep. Solving Total

RRN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 SATNet 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 SATNet* 72.7 75.9 67.3 75.7 0.1 0.1 80.8 1.4 1.4 L1R32H4 94.1 91.0 90.5 87.7 65.8 65.7 84.8 21.3 21.3

NTR 87.4 0.0 0.0 91.4 3.9 3.9 90.2 0.0 0.0 NDC 79.9 0.0 0.0 88.0 0.0 0.0 86.1 0.0 0.0

Ours 95.5 95.9 95.5 93.1 94.4 93.1 93.9 95.2 93.9

0K 10K 20K 30K 40K Iteration

Accuracy (%)

L1R32H4 Ours

0K 10K 20K 30K 40K 50K Iteration

Accuracy (%)

L1R32H4 Ours

0K 20K 40K 60K 80K Iteration

NDC NTR Ours

0K 20K 40K 60K 80K Iteration

NDC NTR Ours

Figure 2: Training curves of accuracy (left) and rank (right). Our method significantly boosts the efficiency of symbol grounding, and accurately converges to ground-truth constraints.

Furthermore, we plot some training curves in Figure 2. The left two figures depict the training curves of neural perception accuracy on the SATNet dataset and RRN dataset, which demonstrate the extremely higher efficiency of our method in symbol grounding compared with the best competitor L1R32H4. We also compute the rank of matrix W to evaluate the degeneracy of logical constraint learning. The results are presented in the right two figures, illustrating that logical constraints learned by our method are complete and precise. In contrast, the ablation methods either fail to converge to the correct logical constraints or result in a degenerate outcome.

4.2 Self-driving Path Planning

Motivation. Self-driving systems are fundamentally neuro-symbolic, where the primary functions are delineated into two components: object detection empowered by neural perception and path planning driven by symbolic reasoning. Neuro-symbolic learning has great potential in self-driving, e.g., for learning from demonstrations [Schaal, 1996] and to foster more human-friendly driving patterns [Sun et al., 2021, Huang et al., 2021].

Datasets. We simulate the self-driving path planning task based on two datasets, i.e., Kitti [Geiger et al., 2013] and nu Scenes [Caesar et al., 2020]. Rather than provide the label of object detection, we only use planning paths as supervision. To compute planning paths, we construct obstacle maps with 10 10 grids, and apply the A algorithm with fixed start points and random end points. Note that Kitti and nu Scenes contain 6160/500 and 7063/600 training/test examples, respectively, where nu Scenes is more difficult (7.4 versus 4.6 obstacles per image on average).

Baselines. We include the best competitor L1R32H4 [Yang et al., 2023] in the previous experiment as comparison. Alongside this, we also build an end-to-end Res Net model (denoted by Res Net) [He et al., 2016] and an end-to-end recurrent transformer model (denoted by RTNet) [Hao et al., 2019]. These models take the scene image, as well as the start point and the end point, as the input, and directly output the predicted path. Finally, as a reference, we train a Res Net model with direct supervision (denoted by SUP) by using labels of object detection, and the logical reasoning is also done by the A algorithm.

Results. We include the F1 score of predicted path grids, the collision rate of the planning path, and the distance error between the shortest path and the planning path (only computed for safe paths) in Table 2. The results show that our method achieves the best performance on both datasets, compared with the alternatives. Particularly, the existing state-of-the-art method L1R32H4 fails on this task,

Table 2: Results of self-driving path planning task. Our method performs the best.

Method Kitti dataset nu Scenes

F1 score Coll. rate Dist. Err. F1 score Coll. Rate Dist. Err.

Res Net 68.5% 54.0% 2.91 51.8% 68.1% 3.60 RTNet 77.3% 36.8% 2.89 55.9% 63.8% 2.94 L1R32H4 11.9% 100.0% NA. 12.0% 91.5% 100.0

Ours 80.2% 32.8% 2.84 58.8% 57.8% 2.81

SUP 84.9% 28.3% 2.75 74.6% 52.9% 2.90

resulting in a high collision rate. Our method is nearly comparable to the supervised reference model SUP on the Kitti dataset. On the nu Scenes dataset, our method even produces slightly less distance error of safe paths than the SUP method.

5 Related Work

Neuro-symbolic learning. Neuro-symbolic learning has received great attention recently. For instance, Dai et al. [2019] and Corapi et al. [2010] suggest bridging neural perception and logical reasoning via an abductive approach, where a logic program is abstracted from a given knowledge base. To reduce reliance on knowledge bases, Ciravegna et al. [2020] and Dong et al. [2019] directly represent and learn constraints using neural networks. However, the learned constraints are still uninterpretable. To improve interpretability, Wang et al. [2019] introduce SATNet, a method that relaxes the Max SAT problem with semidefinite programming and incorporates it as a layer into neural networks. SATNet is further followed up by several works [Topan et al., 2021, Lim et al., 2022, Yang et al., 2023]. However, how to explicitly extract and use the learned constraints is still unclear for these works. In contrast to the existing neuro-symbolic learning methods, our method can synthesize explicit logical constraints supporting exact reasoning by off-the-shelf reasoning engines.

Constraint learning. Our work is also related to constraint learning, which can be traced back to Valiant s algorithm [Valiant, 1984] and, more generally, inductive logic programming [Muggleton and De Raedt, 1994, Bratko and Muggleton, 1995, Yang et al., 2017, Evans and Grefenstette, 2018]. However, Cropper and Dumanˇci c [2022] highlight that inductive logic programming is limited when learning from raw data, such as images and speech, as opposed to perfect symbolic data. To this end, our method goes a step further by properly tackling the symbol grounding problem.

Boolean quadratic programming and its relaxation. Many constraint learning and logical reasoning tasks, e.g., learning Pseudo-Boolean function [Marichal and Mathonet, 2010], Max SAT learning [Wang et al., 2019] and solving [Gomes et al., 2006], and SAT solving [Lipp and Boyd, 2016], can be formulated as Boolean quadratic programming (i.e., quadratic programming with binary variables) [Hammer and Rubin, 1970]. However, commonly used techniques, such as branch and bound [Buchheim et al., 2012] and cutting plane [Kelley, 1960], cannot be applied in neuro-symbolic learning tasks. In literature, semidefinite relaxation (SDR) [d Aspremont and Boyd, 2003, Gomes et al., 2006, Wang and Kolter, 2019] and difference-of-convex (DC) programming [Tao and Hoai An, 1997, Yuille and Rangarajan, 2003, Lipp and Boyd, 2016, Hoai An and Tao, 2018] are two typical methods to relax Boolean constraints. Although SDR is generally more efficient, the tightness and recovering binary results from relaxation are still an open problem [Burer and Ye, 2020, Wang and Kılınç-Karzan, 2022], compromising the exactness of logical reasoning. In this work, we choose DC programming and translate DC constraints to a penalty term with gradually increasing weight, so as to ensure that the Boolean constraints can be finally guaranteed.

6 Limitations

In this section, we discuss the limitations of our framework and outline some potential solutions.

Expressiveness. The theoretical capability of cardinality constraints to represent any propositional logic formula does not necessarily imply the practical ability to learn any such formula in our frame-

work; this remains a challenge. Fundamentally, logical constraint learning is an inductive method, and thus different learning methods would have different inductive biases. Cardinality constraint-based learning is more suitable for tasks where the logical constraints can be straightforwardly translated into the cardinality form. A typical example of such a task is Sudoku, where the target CNF formula consists of at least 8,829 clauses [Lynce and Ouaknine, 2006], while the total number of target cardinality constraints stands at a mere 324.

Technically, our logical constraint learning prefers equality constraints (e.g., x + y = 2), which actually induce logical conjunction (e.g., x y = T) and may ignore potential logical disjunction which is represented by inequality constraints (e.g., x y = T is expressed by x + y 1). To overcome this issue, a practical trick is to introduce some auxiliary variables, which is commonly used in linear programming [Fang and Puthenpura, 1993]. Consider the disjunction x y = T; here, the auxiliary variables z1, z2 help form two equalities, namely, x + y + z1 = 2 for (x, y) = (T, T) and x + y + z2 = 1 for (x, y) = (T, F) or (x, y) = (F, T). One can refer to the Chain-XOR task (cf. Section F.1) for a concrete application of auxiliary variables.

Reasoning efficiency. The reasoning efficiency, particularly that of SMT solvers, during the inference phase can be a primary bottleneck in our framework. For instance, in the self-driving path planning task, when we scale the map size up to a 20 20 grid involving 800 Boolean variables (400 variables for grid obstacles and 400 for path designation), the Z3 Max SAT solver takes more than two hours for some inputs.

To boost reasoning efficiency, there are several practical methods that could be applied. One straightforward method is to use an integer linear program (ILP) solver (e.g., Gurobi) as an alternative to the Z3 Max SAT solver. In addition, some learning-based methods (e.g., Balunovic et al. [2018]) may enhance SMT solvers in our framework. Nonetheless, we do not expect that merely using a more efficient solver can resolve the problem. The improve the scalability, a more promising way is to combine System 1 and System 2 also in the inference stage (e.g., Cornelio et al. [2023]). Generally speaking, in the inference stage, neural perception should first deliver a partial solution, which is then completed by the reasoning engine. Such a paradigm ensures fast reasoning via neural perception, drastically reducing the logical variables that need to be solved by the exact reasoning engine, thereby also improving its efficiency.

7 Conclusion

This paper presents a neuro-symbolic learning approach that conducts neural network training and logical constraint synthesis simultaneously, fueled by symbol grounding. The gap between neural networks and symbol logic is suitably bridged by cardinality constraint-based learning and differenceof-convex programming. Moreover, we introduce the trust region method to effectively prevent the degeneracy of logical constraint learning. Both theoretical analysis and empirical evaluations have confirmed the effectiveness of the proposed approach. Future work could explore constraint learning using large language models to trim the search space of the involved logical variables, and augment reasoning efficiency by further combining logical reasoning with neural perception.

Acknowledgment

We are thankful to the anonymous reviewers for their helpful comments. This work is supported by the National Natural Science Foundation of China (Grants #62025202, #62172199). T. Chen is also partially supported by Birkbeck BEI School Project (EFFECT) and an overseas grant of the State Key Laboratory of Novel Software Technology under Grant #KFKT2022A03. Yuan Yao (y.yao@nju.edu.cn) and Xiaoxing Ma (xxm@nju.edu.cn) are the corresponding authors.

Daniel Kahneman. Thinking, fast and slow. macmillan, 2011.

Grady Booch, Francesco Fabiano, Lior Horesh, Kiran Kate, Jonathan Lenchner, Nick Linck, Andreas Loreggia, Keerthiram Murgesan, Nicholas Mattei, Francesca Rossi, et al. Thinking fast and slow in ai. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 15042 15046, 2021.

Bengio Yoshua and Marcus Gary. Ai debate: The best way forward for ai, 2020. URL https: //montrealartificialintelligence.com/aidebate/.

Yann Le Cun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62, 2022.

Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. In International Conference on Machine Learning, pages 6545 6554. PMLR, 2019.

Zhun Yang, Adam Ishay, and Joohyung Lee. Learning to solve constraint satisfaction problems with recurrent transformer. In The Eleventh International Conference on Learning Representations, 2023.

Sever Topan, David Rolnick, and Xujie Si. Techniques for symbol grounding with satnet. Advances in Neural Information Processing Systems, 34:20733 20744, 2021.

Zenan Li, Yuan Yao, Taolue Chen, Jingwei Xu, Chun Cao, Xiaoxing Ma, and Lü Jian. Softened symbol grounding for neuro-symbolic systems. In The Eleventh International Conference on Learning Representations, 2023.

Angelika Kimmig, Stephen Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. A short introduction to probabilistic soft logic. In Proceedings of the NIPS workshop on probabilistic programming: foundations and applications, pages 1 4, 2012.

Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Broeck. A semantic loss function for deep learning with symbolic knowledge. In International conference on machine learning, pages 5502 5511. PMLR, 2018.

Niklas Eén and Niklas Sörensson. Translating pseudo-boolean constraints into sat. Journal on Satisfiability, Boolean Modeling and Computation, 2(1-4):1 26, 2006.

Olivier Bailleux and Yacine Boufkhad. Efficient cnf encoding of boolean cardinality constraints. In Principles and Practice of Constraint Programming CP 2003: 9th International Conference, CP 2003, Kinsale, Ireland, September 29 October 3, 2003. Proceedings 9, pages 108 122. Springer, 2003.

Olivier Bailleux, Yacine Boufkhad, and Olivier Roussel. A translation of pseudo-boolean constraints to sat. Journal on Satisfiability, Boolean Modeling and Computation, 2(1-4):191 200, 2006.

Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Andrew R Conn, Nicholas IM Gould, and Philippe L Toint. Trust region methods. SIAM, 2000.

Tommi Syrjänen. Logic programs and cardinality constraints: Theory and practice. Ph D thesis, Ph D thesis, Helsinki University of Technology, 2009.

Tommi Syrjänen. Cardinality constraint programs. In Logics in Artificial Intelligence: 9th European Conference, JELIA 2004, Lisbon, Portugal, September 27-30, 2004. Proceedings 9, pages 187 199. Springer, 2004.

Samuel Fiorini, Tony Huynh, and Stefan Weltge. Strengthening convex relaxations of 0/1-sets using boolean formulas. Mathematical programming, 190:467 482, 2021.

Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neural probabilistic logic programming. advances in neural information processing systems, 31, 2018.

Qing Li, Siyuan Huang, Yining Hong, Yixin Chen, Ying Nian Wu, and Song-Chun Zhu. Closed loop neural-symbolic learning via integrating neural perception, grammar parsing, and symbolic reasoning. In International Conference on Machine Learning, pages 5884 5894. PMLR, 2020.

Lieven Vandenberghe and Stephen Boyd. Convex optimization (Tutorial), volume 1. Cambridge university press Cambridge, 2004.

Charles R Henderson. Best linear unbiased estimation and prediction under a selection model. Biometrics, pages 423 447, 1975.

Åke Björck. Least squares methods. Handbook of numerical analysis, 1:465 652, 1990.

Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.

Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. Advances in Neural Information Processing Systems, 30, 2017.

Samuel L Smith, Benoit Dherin, David GT Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. ar Xiv preprint ar Xiv:2101.12176, 2021.

Alnur Ali, Edgar Dobriban, and Ryan Tibshirani. The implicit regularization of stochastic gradient flow for least squares. In International conference on machine learning, pages 233 244. PMLR, 2020.

Emile van Krieken, Thiviyan Thanapalasingam, Jakub M Tomczak, Frank van Harmelen, and Annette ten Teije. A-nesi: A scalable approximate method for probabilistic neurosymbolic inference. ar Xiv preprint ar Xiv:2212.12393, 2022.

Pham Dinh Tao and Le Thi Hoai An. Convex analysis approach to dc programming: theory, algorithms and applications. Acta mathematica vietnamica, 22(1):289 355, 1997.

Alan L Yuille and Anand Rangarajan. The concave-convex procedure. Neural computation, 15(4): 915 936, 2003.

Thomas Lipp and Stephen Boyd. Variations and extension of the convex concave procedure. Optimization and Engineering, 17:263 287, 2016.

Le Thi Hoai An and Pham Dinh Tao. Dc programming and dca: thirty years of developments. Mathematical Programming, 169(1):5 68, 2018.

Dimitri Bertsekas. Convex optimization algorithms. Athena Scientific, 2015.

Pierre Hansen, Brigitte Jaumard, MichèLe Ruiz, and Junjie Xiong. Global minimization of indefinite quadratic functions subject to box constraints. Naval Research Logistics (NRL), 40(3):373 392, 1993.

Hoai An Le Thi and Pham Ding Tao. A continuous approch for globally solving linearly constrained quadratic. Optimization, 50(1-2):93 120, 2001.

Meisam Razaviyayn. Successive convex approximation: Analysis and applications. Ph D thesis, University of Minnesota, 2014.

Amir Beck and Marc Teboulle. Global optimality conditions for quadratic optimization problems with binary constraints. SIAM journal on optimization, 11(1):179 188, 2000.

Yong Xia. New optimality conditions for quadratic optimization problems with binary constraints. Optimization letters, 3:253 263, 2009.

R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877 898, 1976.

Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and trends in Optimization, 1 (3):127 239, 2014.

R Tyrrell Rockafellar. Advances in convergence and scope of the proximal point algorithm. J. Nonlinear and Convex Analysis, 22:2347 2375, 2021.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021.

Leonardo de Moura and Nikolaj Bjørner. Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337 340. Springer, 2008.

Rasmus Palm, Ulrich Paquet, and Ole Winther. Recurrent relational networks. Advances in neural information processing systems, 31, 2018.

Stefan Schaal. Learning from demonstration. Advances in neural information processing systems, 9, 1996.

Jiankai Sun, Hao Sun, Tian Han, and Bolei Zhou. Neuro-symbolic program search for autonomous driving decision module design. In Conference on Robot Learning, pages 21 30. PMLR, 2021.

Junning Huang, Sirui Xie, Jiankai Sun, Qiurui Ma, Chunxiao Liu, Dahua Lin, and Bolei Zhou. Learning a decision module by imitating driver s control behaviors. In Conference on Robot Learning, pages 1 10. PMLR, 2021.

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231 1237, 2013.

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621 11631, 2020.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

Jie Hao, Xing Wang, Baosong Yang, Longyue Wang, Jinfeng Zhang, and Zhaopeng Tu. Modeling recurrence for transformer. ar Xiv preprint ar Xiv:1904.03092, 2019.

Wang-Zhou Dai, Qiuling Xu, Yang Yu, and Zhi-Hua Zhou. Bridging machine learning and logical reasoning by abductive learning. Advances in Neural Information Processing Systems, 32, 2019.

Domenico Corapi, Alessandra Russo, and Emil Lupu. Inductive logic programming as abductive search. In Technical communications of the 26th international conference on logic programming. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2010.

Gabriele Ciravegna, Francesco Giannini, Stefano Melacci, Marco Maggini, and Marco Gori. A constraint-based approach to learning and explanation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3658 3665, 2020.

Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou. Neural logic machines. In International Conference on Learning Representations, 2019.

Sangho Lim, Eun-Gyeol Oh, and Hongseok Yang. Learning symmetric rules with satnet. In The 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Neural information processing systems foundation, 2022.

Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

Stephen Muggleton and Luc De Raedt. Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19:629 679, 1994.

Ivan Bratko and Stephen Muggleton. Applications of inductive logic programming. Communications of the ACM, 38(11):65 70, 1995.

Fan Yang, Zhilin Yang, and William W Cohen. Differentiable learning of logical rules for knowledge base reasoning. Advances in neural information processing systems, 30, 2017.

Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. Journal of Artificial Intelligence Research, 61:1 64, 2018.

Andrew Cropper and Sebastijan Dumanˇci c. Inductive logic programming at 30: a new introduction. Journal of Artificial Intelligence Research, 74:765 850, 2022.

Jean-Luc Marichal and Pierre Mathonet. Symmetric approximations of pseudo-boolean functions. ar Xiv preprint ar Xiv:1004.2593, 2010.

Carla P Gomes, Willem-Jan Van Hoeve, and Lucian Leahu. The power of semidefinite programming relaxations for max-sat. In Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems: Third International Conference, CPAIOR 2006, Cork, Ireland, May 31-June 2, 2006. Proceedings 3, pages 104 118. Springer, 2006.

Peter L Hammer and Abraham A Rubin. Some remarks on quadratic programming with 0-1 variables. Revue française d informatique et de recherche opérationnelle. Série verte, 4(V3):67 79, 1970.

Christoph Buchheim, Alberto Caprara, and Andrea Lodi. An effective branch-and-bound algorithm for convex quadratic integer programming. Mathematical programming, 135:369 395, 2012.

James E Kelley, Jr. The cutting-plane method for solving convex programs. Journal of the society for Industrial and Applied Mathematics, 8(4):703 712, 1960.

Alexandre d Aspremont and Stephen Boyd. Relaxations and randomized methods for nonconvex qcqps. EE392o Class Notes, Stanford University, 1:1 16, 2003.

Po-Wei Wang and J Zico Kolter. Low-rank semidefinite programming for the max2sat problem. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1641 1649, 2019.

Samuel Burer and Yinyu Ye. Exact semidefinite formulations for a class of (random and non-random) nonconvex quadratic programs. Mathematical Programming, 181(1):1 17, 2020.

Alex L Wang and Fatma Kılınç-Karzan. On the tightness of sdp relaxations of qcqps. Mathematical Programming, 193(1):33 73, 2022.

Inês Lynce and Joël Ouaknine. Sudoku as a sat problem. In AI&M, 2006.

Shu-Cherng Fang and Sarat Puthenpura. Linear optimization and extensions: theory and algorithms. Prentice-Hall, Inc., 1993.

Mislav Balunovic, Pavol Bielik, and Martin Vechev. Learning to solve smt formulas. Advances in Neural Information Processing Systems, 31, 2018.

Cristina Cornelio, Jan Stuehmer, Shell Xu Hu, and Timothy Hospedales. Learning where and when to reason in neuro-symbolic inference. In The Eleventh International Conference on Learning Representations, 2023.

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM review, 60(2):223 311, 2018.

Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In International Conference on Machine Learning, pages 3067 3075. PMLR, 2017.

A Proofs of DC technique

Notations. We define S := (QTQ + τI), s := (QTq1 + τq2), and denote the largest eigenvalues and largest diagonal element of S by σmax and δmax, respectively. Hence, the two problems can be equivalently rewritten as

(P) min u {0,1}n u TSu 2s Tu, (Pt) min u [0,1]n u T(S t I)u (2s te)Tu.

A.1 Proof of Proposition 1

Proof. The results are primarily based on Bertsekas [2015, Proposition 1.3.4]: the minima of a strictly concave function cannot be in the relative interior of the feasible set.

We first show that if t0 σmax, then the two problems are equivalent [Le Thi and Ding Tao, 2001, Theorem 1]. Specifically, since S t I is negative definite, problem (Pt) is strictly concave. Therefore, the minima should be in the vertex set of the feasible domain, which is consistent with problem (P).

We can further generalize this result to the case t0 δmax[Hansen et al., 1993, Proposition 1]. In this case, considering the i-th component of u, its second-order derivative in problem (Pt) is 2(Sii t). Similarly, the strict concavity of ui ensures a binary solution, indicating the equivalence of problems (P) and (Pt).

A.2 Proof of Proposition 2

Proof. The Karush Kuhn Tucker (KKT) conditions of the problem (Pt) are as follows.

[2Su 2tu 2s + te]i αi + βi = 0; ui [0, 1]n; αi 0, βi 0; αiui = 0, βi(ui 1) = 0; i = 1, . . . , n.

where α and β are multiplier vector. For u {0, 1}n, the KKT condition is equivalent to

αi = [2Su 2tu 2s + te]i(1 ui) 0, βi = [2Su 2tu 2s + te]iui 0.

By using (1 2ui) { 1, 1}, we can further combine the above two inequalities, and obtain

2[Su s]i(1 2ui) + t 0, i = 1, . . . , n.

On the other hand, if 2[Su s]i(1 2ui) + t 0 holds for each i = 1, . . . , n, it is easy to check that α 0 and β 0, which proves the first part of the proposition.

The proof of the second part is a direct result of Beck and Teboulle [2000, Theorem 2.4]. To be specific, if u achieves a global minimum of (P), then q(u) q(u ) for any u {0, 1}n. Hence, we only flip the i-th value of u, i.e., considering ui and u i = 1 ui, and it holds that

u TSu 2s Tu (u )TSu 2s Tu

= (u TSu 2s Tu) + 2[Su s]i(1 2ui) + Sii.

Rearranging the inequality, we obtain

2[Su s]i(1 2ui) Sii, i = 1, . . . , n,

which completes the proof.

B Proof of Theorem 1

Proof. Notations. We use to denote the ℓ2 norm for vectors and Frobenius norm for matrices. We define

φ(ϕ, θ, Z, Y) := Zwu + Ywv b 2 + α (Z, Y) (fθ(X), Y) 2 + λ w w0 2.

For the loss functions of logic programming and network training, we assume ℓ1(θ) and ℓ2(ϕ) to be µθ and µϕ smooth, respectively. For ease of presentation, we define k = fθk(X)wk u + Ywk v b,

and let cmax be the upper bound of k . Furthermore, by using the Woodbury identity formula, we can compute

(Zk; Yk) = arg min (Z,Y) Zwk u + Ywk v b 2 + α (Z, Y) (fθk(X), Y) 2 + λ w w0 2

= (fθk(X); Y) βk k(wk)T, where βk = 1 α + wk 2 .

Let ρk := (αβk), we have

φ(ϕk, θk, fθk(X), Y) φ(ϕk, θk, Zk, Yk) = (1 ((αβk)2 + (1 αβk)2) k 2

= 2ρk(1 ρk) k 2.

Update of ϕ. We consider the single rule case (multiple rules can be directly decomposed), i.e., ϕ = (w, b) and b = (b; . . . ; b). The update of ϕ is conducted on the loss function

ℓk 2(w, b) = φ(ϕk, θk, fθk(X), Y) = fθk(X)wu + Ywv b 2 + λ w w0 2.

The smallest and the largest eigenvalues of (fθk(X), Y)T(fθk(X), Y) + λI are denoted by σmin and σmax, respectively.

The PPA method updates w by

wk+1 = arg min w ℓk 2(w, b) + 1

which can be reduced to

wk+1 wk = M kδk, δk = (fθk(X), Y)T = wℓk 2(w, b),

M k = (fθk(X), Y)T(fθk(X), Y) + λI + 1

The (2/γ)-strongly convexity of the proximal term implies the Polyak-Łojasiewicz (PL) inequality, which derives that

φ(ϕk, θk, Zk, Yk) = φ(ϕk, θk, fθk(X), Y) 2ρk(1 ρk) 2

= ℓk 2(wk, b) 2ρk(1 ρk) k 2 ℓk 2(wk+1, b) 2ρk(1 ρk) 2 + 2

γ wk+1 wk 2.

Plugging wk+1 wk = M kδk into the inequality, we have

φ(ϕk, θk, Zk, Yk) ℓk 2(wk+1, b) + 2

γ (δk)T(M k)2δk 2ρk(1 ρk) k 2

φ(ϕk+1, θk, Zk+ 1

γ (δk)T(M k)2δk 2ρk(1 ρk) k 2,

2 ) = arg min( Z, Y) Zwk+1 u + Ywk+1 v b 2 + α ( Z, Y) (fθk(X), Y) 2

= (fθk(X); Y) βk+ 1

2 (wk+1)T, where βk+ 1

2 = 1 α + wk+1 2 .

Note that (M k)2 has the smallest eigenvalue γ2/(1 + γσmax)2, and thus we have

φ(ϕk, θk, Zk, Yk) φ(ϕk+1, θk, Zk+ 1

2 )+ 2γ (1 + γσmax)2 ϕℓ2(w, b) 2 2ρk(1 ρk)cmax.

Update of θ. The update of θ is conducted on the loss function

ℓk 1(θ) = Zk+ 1

By using µθ-smooth of ℓk 1, we obtain that

φ(ϕk+1, θk, Zk+ 1

2 ) φ(ϕk+1, θk+1, Zk+ 1

2 ) = ℓk 1(θk+1) ℓk 1(θk)

θℓk 1(θk), θk+1 θk µθ

2 θk+1 θk 2 1

2η θℓk 1(θk) 2.

Letting Zk+1 = arg min Z φ(ϕk+1, θk+1, Z), we conclude

φ(ϕk+1, θk, Zk+ 1

2 ) φ(ϕk+1, θk+1, Zk+1, Yk+1) + 1

2η θℓk 1(θk) 2.

Convergent result. By combining the update of ϕ and θ, we have

φ(ϕk, θk, Zk, Yk) φ(ϕk+1, θk+1, Zk+1, Yk+1)

2η θℓk 1(θk) 2 + 2γ (1 + γσmax)2 ϕℓ2(ϕk) 2 2ρk(1 ρk)cmax.

Taking a telescopic sum over k, we obtain

φ(ϕ0, θ0, Z0, Y0) φ(ϕK, θK, ZK, YK)

1 2η θℓk 1(θk) 2 + 2γ (1 + γσmax)2 ϕℓ2(ϕk) 2 2ρk(1 ρk)cmax.

Since ρk(1 ρk) κρ/(K + 1)2, we have

E[ θℓk 2(θk) 2] 2 (K + 1)η (φ(ϕ0, θ0, Z0, Y0) min φ) + 2κcmax ,

E[ ϕℓ2(ϕk) 2] (1 + γσmax)2

2(K + 1) (φ(ϕ0, θ0, Z0, Y0) min φ) + 2κcmax .

Stochastic version. We first introduce an additional assumption: the gradient estimate is unbiased and has bounded variance [Bottou et al., 2018, Sec. 4], i.e.,

Eξ[ θℓk 1(θk)] = θℓk 1(θk), Eξ[ θℓk 2(ϕk)] = θℓk 2(ϕk),

and Vξ[ θℓk 1(θk)] ζ + ζv θℓk 1(θk) 2, Vξ[ ϕℓk 1(θk)] ζ + ζv ϕℓk 2(ϕk) 2. This assumption derives the following inequalities hold for ζg = ζv + 1:

Eξ[ θℓk 1(θk) 2] ζ + ζg θℓk 1(θk) 2, Eξ[ θℓk 2(ϕk) 2] ζ + ζg ϕℓk 2(ϕk) 2,

For the update of θ, we have

φ(ϕk+1, θk, Zk+ 1

2 ) Eξ[φ(ϕk+1, θk+1, Zk+1, Yk+1)] ηk

2 θℓk 1(θk) 2 η2 kµθ

For the update of ϕ, using the µϕ-smooth, and taking the total expectation:

φ(ϕk, θk, Zk, Yk) Eξ[φ(ϕk+1, θk, Zk+ 1

2 )] + 2ρk(1 ρk) k 2

( ϕℓk 2(ϕk))TM k( ϕℓk 2(ϕk)) µϕ

2 Eξ[ M k ϕℓk 2(ϕk) 2]

1 ϵk + σmax ϕℓk 2(ϕk) 2 µϕ 2(ϵk + σmin)2 (ζ + ζg ϕℓk 2(ϕk) 2),

where we define ϵk = 1/γk for simplicity. Now, let γ be sufficiently small (that is, satisfying (ϵk + σmin)2 µϕ(ϵk + σmax)), we obtain

φ(ϕk, θk, Zk, Yk) Eξ[φ(ϕk+1, θk, Zk+ 1

2 )] + 2ρk(1 ρk) k 2

1 2(ϵk + σmax) ϕℓk 2(ϕk) 2 µϕ 2(ϵk + σmin)2 ζ.

Putting the updates of θ and ϕ together, we have

φ(ϕk, θk, Zk, Yk) Eξ[φ(ϕk+1, θk+1, Zk+1, Yk+1)] + 2ρk(1 ρk) k 2

2ηk θℓk 1(θk) 2 1

2η2 kµθζ + 1 2(ϵk + σmax) ϕℓk 2(ϕk) 2 µϕ 2(ϵk + σmin)2 ζ.

Now, setting ηk κθ/

K + 1 and γk κϕ/

K + 1, we can conclude

E[ θℓk 1(θk) 2] = O( 1

K + 1), E[ ϕℓ2(ϕk) 2] = O( 1

C Proof of Theorem 2

Proof. We consider the following problem,

(Pξ) min u {0,1}n qξ(u) := u T(S + λI)u 2(s + λξ)Tu.

For given t 0, the corresponding stationary points of (Pξ) satisfy

2[Su s]i(1 2ui) + 2λ(ui ξi)(1 2ui) + t 0, i = 1, . . . , n.

(ui ξi)(1 2ui) = ξi if ui = 0; ξi 1 if ui = 1.

For given u {0, 1}n, we denote ϱi = 2[Su s]i(1 2ui). Then, the probability that (Pξ) has the stationary point u can be computed as

i=1 Pr(ϱi + 2λ(ui ξi)(1 2ui) + t 0),

where Pr(2λ(ui ξi)(1 2ui) + ϱi + t 0) = min( 1

2λ(t + ϱi), 1).

Hence, for given two different u(0) 1 and u(0) 2 , the probability that the corresponding rules can converge to the same result u satisfying

Pr(u1 = u, u2 = u) Pr(u)2 =

2λ(t + ϱi), 1)2.

D Trust Region Method

Figure 3 illustrates the key concept of the trust region method. For simplicity, centre points w1(0), . . . , w4(0) of the trust region are also set as the initial points of stochastic gradient descent. Stochastic gradient descent is implicitly biased to least norm solutions and finally converges to point (0, 1) by enforcing the Boolean constraints. The trust region penalty encourages the stochastic gradient descent to converge to different optimal solutions in different trust regions.

E Experiment Details

Computing configuration. We implemented our approach via the Py Torch DL framework. The experiments were conducted on a GPU server with two Intel Xeon Gold 5118 CPU@2.30GHz, 400GB RAM, and 9 Ge Force RTX 2080 Ti GPUs. The server ran Ubuntu 16.04 with GNU/Linux kernel 4.4.0.

Hyperparameter tuning. Some hyperparameters are introduced in our framework. In Table 3 we summarize the (hyper-)parameters, together with their corresponding initialization or update strategies. Most of these hyperparameters are quite stable and thus only need to be fixed to a constant or set by standard strategies. We only discuss the selection of m, and the setting of bmin, bmax and b.

𝐰!(𝟎) 𝐰"(𝟎)

𝐰! 𝐰" 𝐰# 𝐰$ 𝐰$(𝟎)

𝐰!(𝟎) 𝐰"(𝟎)

Figure 3: Avoid degeneracy by trust region method. In logical constraint learning, the imposition of the Boolean constraints and the implicit bias of the stochastic gradient descent cause w1, . . . , w4 to converge to the same result (left figure), while the trust region constraints guarantee that they can sufficiently indicate different rules (right figure).

(1) To ensure the sufficiency of learned constraints, we suggest initially setting a large m to estimate the actual number of logical constraints needed, and then adjusting it for more efficient training. We also observe that a large m does not ruins the performance of our method. For example, we set m = 2000 in visual Sudo Ku solving task, while only 324 constraints are learned. (2) For the bias term b, we recommend b to be tuned manually rather than set by PPA update, and one can gradually increase b from 1 to n 1 (n is the number of involved logical variables), and collect all logical constraints as candidate constraints. For bmin and bmax, due to the prediction error, it is unreasonable to set bmin and bmax that ensure all examples to satisfy the logical constraint. An alternative method is to set a threshold (e.g. k%) on the training (or validation) set, and the constraint is only required to be satisfied by at least k% examples.

Table 3: The list of (hyper-)parameters and their initialization or update strategies. Param. Description Setting

θ Neural network parameters Updated by stochastic gradient descent

W Matrix of logical constraints Updated by stochastic PPA

b Bias term of logical constraints Pre-set or Updated by stochastic PPA

bmin/bmax Lower/Upper bound of logical constraints Estimated by training set

m Pre-set number of constraints Adaptively tuned

α Trade-off weight in symbol grounding Fixed to α = 0.5

λ Weight of trust region penalty Fixed to λ = 0.1

t1/t2 Weight of DC penalty Increased per epoch

η Learning rate of network training Adam schedule

γ Step size of constraint learning Adaptively set (γ = 0.001 by default)

F Additional Experiment Results

F.1 Chained XOR

The chained XOR, also known as the parity function, is a basic logical function, yet it has proven challenging for neural networks to learn it explicitly [Shalev-Shwartz et al., 2017, Wang et al., 2019] To be specific, given a sequence of length L, the parity function outputs 1 if there are an odd number of 1 s in the sequence, and 0 otherwise. The goal of the Chained XOR task is to learn this parity function with fixed L. Note that this task does not involve any perception task.

We compare our method with SATNet and L1R32H4. In this task, SATNet uses an implicit but strong background knowledge that the task can be decomposed into L single XOR tasks. Neither L1R32H4

20 40 60 80 100 200 Sequence Length

Results (%)

F1 score Accuracy

20 40 60 80 100 200 Sequence Length

Results (%)

F1 score Accuracy

(b) L1R32H4

20 40 60 80 100 200 Sequence Length

Results (%)

F1 score Accuracy

Figure 4: Results (%) of chained XOR task, including accuracy and F1 score (of class 0). The sequence length ranges from 20 to 200, showing that our method stably outperforms competitors.

nor our method uses such knowledge. For L1R32H4, we adapt the embedding layer to this task and fix any other configuration. Regarding our method, we introduce L 1 auxiliary variables.7

It is worth noting that these auxiliary variables essentially serve as a form of symbol grounding. Elaborately, the learned logical constraints by our method can be formulated as follows,

w1x1 + + w Lx L + w L+1z1 + + w2L 1z L 1 = b,

where wi B, i = 1, . . . , 2L 1, xi B, i = 1, . . . , L and zi B, i = 1, . . . , L. The auxiliary variables zi, i = 1, . . . , L have different truth assignments for different examples, indicting how the logical constraint is satisfied by the given input. Now, combining the symbol grounding of auxiliary variables, we revise the optimization problem (1) of our framework as

min (W ,b) E(x,y) D[ W (x; z; y) b 2] + λ W W (0) 2,

s.t. z = arg minz Z E(x,y) D[ W (x; z; y) b 2], W Bm (u+v), b N m + .

The symbol grounding is solely guided by logical constraints, as neural perception is not involved.

The experimental results are plotted in Figure 4. The results show that L1R32H4 is unable to learn such a simple reasoning pattern, while SATNet often fails to converge even with sufficient iterations, leading to unstable results. Our method consistently delivers full accuracy across all settings, thereby demonstrating superior performance and enhanced scalability in comparison to existing state-of-theart methods. To further exemplify the efficacy of our method, we formulate the learned constraints in the task of L = 20. Eliminating redundant constraints and replacing the auxiliary variables with logical disjunctions, the final learned constraint can be expressed as

(x1 + + x20 + y = 0) (x1 + + x20 + y = 2) (x1 + + x20 + y = 20),

which shows that our method concludes with complete and precise logical constraints.

F.2 Nonograms

Nonograms is a logic puzzle with simple rules but challenging solutions. Given a grid of squares, the task of nonograms is to plot a binary image, i.e., filling each grid in black or marking it by X. The required numbers of black squares on that row (resp. column) are given beside (resp. above) each row (resp. column) of the grid. Figure 5 gives a simple example.

In contrast to the supervised setting used in Yang et al. [2023], we evaluate our method on a weakly supervised learning setting. Elaborately, instead of the fully solved board, only partial solutions (i.e., only one row or one column) are observed. Note that this supervision is enough to solve the nonograms, because the only logical rule to be learned is that the different black squares (in each row or column) should not be connected.

For our method, we do not introduce a neural network in this task, and only aim to learn the logical constraints. We carry out the experiments on 7 7 nonograms, with training data sizes ranging

7Note that the number of auxiliary variables should not exceed the number of logical variables. If so, the logical constraints trivially converge to any result.

c1 c2 c3 c4 c5

Empty Nonogram

Solved Nonogram

Figure 5: An example of nonograms.

Data Size L1R32H4 Ours

1000 14.4 100.0

5000 62.0 100.0

9000 81.2 100.0

Table 4: Accuracy (%) of the nonograms task.

from 1,000 to 9,000. The results are given in Table 4, showing the efficacy of our logical constraint learning. Compared to the L1R32H4 method, whose effectiveness highly depends on the training data size, our method works well even with extremely limited data.

F.3 Visual Sudo Ku Solving

In the visual Sudo Ku task, it is worth noting that the computation of z cannot be conducted by batch processing. This is because the index of y varies for each data point. For instance, in different Sudo Ku games, the cells to be filled are different, and thus the symbol z has to be computed in a point-wise way. To solve this issue, we introduce an auxiliary y to approximate the output symbol y:

( z, y) = arg min z Z, y Y W ( z; y) b 2 + α ( z; y) (fθ(x); y) 2.

On the SATNet dataset, we use the recurrent transformer as the perception model [Yang et al., 2023], because we observe that the recurrent transformer can significantly improve the perception accuracy, and even outperforms the state-of-the-art of MNIST digit recognition model. However, we find that its performance degrades on the more difficult dataset RRN, and thus we still use a standard convolutional neural network model as the perception model for this dataset.

We include detailed results of board and cell accuracy in Table 5. It can be observed that our method is consistently superior to the existing methods, and significantly outperforms the current state-of-the-art method L1R32H4 on the RRN dataset (total board accuracy improvement exceeds 20%). Also note that the solving accuracy of our method always performs the best, illustrating the efficacy of our logical constraint learning.

Next, we exchange the evaluation dataset, namely, using the RRN dataset to evaluate the model trained on the SATNet dataset, and vice versa. The results are presented in Table 6. The accurate logical constraints and exact logical reasoning engine guarantee the best performance of our method on transfer tasks. Notably, the performance of L1R32H4 drops significantly when transferring the model (trained SATNet dataset) to RNN dataset, our method remains unaffected by such shift.

F.4 Self-driving Path Planning

Input Scene

Neural network

Latent symbol

Symbolic reasoning

Output Path

Figure 6: A neuro-symbolic system in self-driving tasks. The neural perception detects the obstacles from the image collected by the camera; the symbolic reasoning plans the driving path based on the obstacle map. The neuro-symbolic learning task is to build these two modules in an end-to-end way.

The goal of the self-driving path planning task is to train the neural network for object detection and to learn the logical constraints for path planning in an end-to-end way. As shown in Figure 6, we construct two maps and each contains 10 10 grids (binary variables). The neural perception detects the obstacles from the image x and locates it in the first map, which is essentially the symbol z. Next,

Table 5: Detailed cell and board accuracy (%) of original visual Sudoku task.

Method SATNet dataset RRN dataset

Perception Solving Total Perception Solving Total board acc. board acc. board acc. board acc. board acc. board acc.

RRN 0.0 0.0 0.0 0.0 0.0 0.0 SATNet 0.0 0.0 0.0 0.0 0.0 0.0 SATNet* 72.7 75.9 67.3 75.7 0.1 0.1 L1R32H4 94.1 91.0 90.5 87.7 65.8 65.7

NTR 87.4 0.0 0.0 91.4 3.9 3.9 NDC 79.9 0.0 0.0 88.0 0.0 0.01

Ours 95.5 95.9 95.5 93.1 94.4 93.1

Perception Solving Total Perception Solving Total cell acc. cell acc. cell acc. cell acc. cell acc. cell acc.

RRN 0.0 0.0 0.0 0.0 0.0 0.0 SATNet 0.0 0.0 0.0 0.0 0.0 0.0 SATNet* 99.1 98.6 98.8 75.7 59.7 72.0 L1R32H4 99.8 99.1 99.4 99.3 89.5 92.6

NTR 99.7 60.1 77.8 99.7 38.5 57.3 NDC 99.4 10.8 50.4 99.5 10.9 38.7

Ours 99.9 99.6 99.7 99.7 98.3 98.7

the logical reasoning computes the final path from the symbol z and tags it on the second map as the output y.

As a detailed reference, we select some results of path planning generated by different methods and plot them in Figure 7. We find that some correct properties are learned by our method. For example, given the point y34 in the path, we have the following connectivity:

(y34 = s) + (y34 = e) + Adj(y34) = 2,

which means that the path point y34 should be connected by its adjacent points. In addition, some distinct constraints are also learned, for example,

y32 + z32 + z11 + z01 = 1.

In this constraint, z11 and z01 are two noise points, and they always take the value of 0. Therefore, it actually ensures that if z32 is an obstacle, then y32 should not be selected as a path point. However, it is still unknown whether our neuro-symbolic framework derives all the results as expected, because some of the learned constraints are too complex to be understood.

(a) Input scene (with labels)

(b) Res Net method

(c) RT method

Figure 7: Some results of neuro-symbolic learning methods in self-driving path planning task.

Table 6: Detailed cell and board accuracy (%) of transfer visual Sudoku task.

Method SATNet RRN RRN SATNet

Perception Solving Total Perception Solving Total board acc. board acc. board acc. board acc. board acc. board acc.

RRN 0.0 0.0 0.0 0.0 0.0 0.0 SATNet 0.0 0.0 0.0 0.0 0.0 0.0 SATNet* 80.8 1.4 1.4 0.0 0.0 0.0 L1R32H4 84.8 21.3 21.3 94.9 95.0 94.5

NTR 90.2 0.0 0.0 86.9 0.0 0.0 NDC 86.1 0.0 0.0 82.4 0.0 0.0

Ours 93.9 95.2 93.9 95.2 95.3 95.2

Perception Solving Total Perception Solving Total cell acc. cell acc. cell acc. cell acc. cell acc. cell acc.

RRN 0.0 0.0 0.0 0.0 0.0 0.0 SATNet 0.0 0.0 0.0 0.0 0.0 0.0 SATNet* 99.1 66.2 76.5 65.8 53.8 59.2 L1R32H4 99.3 89.5 92.6 99.7 99.6 99.7

NTR 99.6 37.1 56.3 99.6 62.4 79.0 NDC 99.4 11.0 38.7 99.5 11.3 50.7

Ours 99.8 98.4 98.8 99.8 99.7 99.7