# deep_differentiable_logic_gate_networks__086797b7.pdf

Deep Differentiable Logic Gate Networks

Felix Petersen Stanford University University of Konstanz mail@felix-petersen.de

Christian Borgelt University of Salzburg christian@borgelt.net

Hilde Kuehne University of Frankfurt MIT-IBM Watson AI Lab kuehne@uni-frankfurt.de

Oliver Deussen University of Konstanz oliver.deussen@uni.kn

Recently, research has increasingly focused on developing efficient neural network architectures. In this work, we explore logic gate networks for machine learning tasks by learning combinations of logic gates. These networks comprise logic gates such as ªANDº and ªXORº, which allow for very fast execution. The difficulty in learning logic gate networks is that they are conventionally non-differentiable and therefore do not allow training with gradient descent. Thus, to allow for effective training, we propose differentiable logic gate networks, an architecture that combines real-valued logics and a continuously parameterized relaxation of the network. The resulting discretized logic gate networks achieve fast inference speeds, e.g., beyond a million images of MNIST per second on a single CPU core.

1 Introduction

With the success of neural networks, there has also always been strong interest in research and industry in making the respective computations as fast and efficient as possible, especially at inference time. Various techniques have been proposed to solve this problem, including reduced computational precision [1], [2], binary [3] and sparse [4] neural nets. In this work, we want to train a different kind of architecture, which is well known in the domain of computer architectures: logic (gate) networks.

The problem in training networks of discrete components like logic gates, is that they are nondifferentiable and therefore, conventionally, cannot be optimized via standard methods such as gradient descent [5]. One approach for this would be gradient-free optimization methods such as evolutionary training [6], [7], which works for small models, but becomes infeasible for larger ones.

In this work, we propose an approach for gradient-based training of logic gate networks (aka. arithmetic / algebraic circuits [8], [9]). Logic gate networks are based on binary logic gates, such as ªandº and ªxorº (see Table 1). For training logic gate networks, we continuously relax them to differentiable logic gate networks, which allows efficiently training them with gradient descent. For this, we use real-valued logic and learn which logic gate to use at each neuron. Specifically, for each neuron, we learn a probability distribution over logic gates. After training, the resulting network is discretized to a (hard) logic gate network by choosing the logic gate with the highest probability. As the (hard) logic gate network comprises logic gates only, it can be executed very fast. Additionally, as the logic gates are binary, every neuron / logic gate has only 2 inputs, and the networks are extremely sparse.

Logic gate networks are not binary neural networks: binary neural networks are a form of low precision (wrt. weights and/or activations) neural networks, as they reduce weights and/or activations to binary precision. Binary neural networks are usually dense and typically rely on weights trained in

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Figure 1: Overview of the proposed differentiable logic gate networks: the pixels of the image are converted into Boolean valued inputs, which are processed by a layer of neurons such that each neuron receives two inputs. The connectivity of neurons remains fixed after an initial pseudo-random initialization. Each neuron is continuously parameterized by a distribution over logical operators. During training, this distribution is learned for each neuron, and, during inference, the most likely operator is used for each neuron. There are multiple outputs per class, which are aggregated by bit-counting, which yields the class scores. The number of neurons in the visualization is greatly reduced for visual simplicity.

the continuous domain and are discretized afterwards. In contrast to binary neural networks, logic gate networks do not have weights, are intrinsically sparse as they have only 2 inputs to each neuron, and are not a form of low precision (wrt. weights and/or activations) neural networks.

They also differ from current sparse neural network approaches, as our goal is to learn which logic gate operators are present at each neuron, while the (weightless) connections between neurons are (pseudo-)randomly initialized and remain fixed. The network is, thus, parameterized by the choice of the logic gate operator / binary function for each neuron. As there is a total of 16 functions of signature f : {0, 1} {0, 1} {0, 1}, the information which operation a neuron executes can be encoded in just 4 bits. The objective is to learn which of those 16 operations is optimal for each neuron. Specifically, for each neuron, we learn a probability distribution over possible logic gates, which we parameterize via softmax. We find that this approach allows learning logic gate networks very effectively via gradient descent.

Logic gate networks allow for very fast classification, with speeds beyond a million images per second on a single CPU core (for MNIST at > 97.5% accuracy). The computational cost of a layer with n neurons is Θ(n) with very small constants (as only logic gates of Booleans are required), while, in comparison, a fully connected layer (with m input neurons) requires Θ(n m) computations with significantly larger constants (as it requires floating-point arithmetic). While the training can be more expensive than for regular neural networks (however, just by a constant and asymptotically less expensive), to our knowledge, the proposed method is the fastest available architecture at inference time. Overall, our method accelerates inference speed (in comparison to fully connected Re LU neural networks) by around two orders of magnitude. In the experiments, we scale the training of logic gate networks up to 5 million parameters, which can be considered relatively small in comparison to other architectures. In comparison to the fastest neural networks at 98.4% on MNIST, our method is more than 12 faster than the best binary neural networks and 2 3 orders of magnitude faster than the theoretical speed of sparse neural networks.

2 Related Work

In this section, we discuss related work on learning logic gate networks, methods with methodological or conceptual similarity, as well as other machine learning methods that are fast and that we compare ourselves to with respect to inference cost and speed.

Differentiable Logics and Triangular Norms Differentiable logics (aka. real-valued logics, or infinite-valued logic) are well-known in the fields of fuzzy logics [10] and probabilistic metric spaces [11], [12]. In Supplementary Material D we give examples for T-norms and T-conorms. An additional reference for differentiable real-valued logics is Van et al. [13].

Learning Logic Gate Networks Chatterjee [14] explored ªmemorizationº, a method for memorizing binary classification data sets with a network of binary lookup tables. The motivation for this is to explore principles of learning and memorization, as well as their trade-off and generalization capabilities. He constructs the networks of lookup tables by counting conditional frequencies of data points. We mention this work here because binary logic gates may be seen as the special case of 2-input lookup tables. That is, his method has some similarities to our resulting networks. However, as he memorizes the data set, while this leads to some generalization, this generalization is limited. In his experiments, he considers the binary classification task of distinguishing the combined classes 0 4 from the combined classes 5 9 of MNIST and achieves a test accuracy of 90%. Brudermueller et al. [15] propose a method where they train a neural network on a classification task and then translate it, first into random forests, and then into networks of AND-Inverter logic gates, i.e., networks based only on ªandº and ªnotº logical gates. They evaluate their approach on the ªgastrointestinal bleedingº and ªveterans aging cohort studyº data sets and argue for the verifiability and interpretability of small logical networks in patient care and clinical decision-making.

Continuous Relaxations A popular approach for making discrete structures differentiable is continuous relaxation [16], [17]. In this work, we also use continuous relaxations; however, instead of relaxing a fixed discrete structure (e.g., an algorithm) [16] [23], we continuously relax a discrete structure (a logic gate network) to optimize it.

Relaxed Connectivity in Networks In this work, we relax which logic operator is applied at each node, while the connections are predefined. Zimmer et al. [24] propose differentiable logic machines for inductive logic programming. For this, they propose logic modules, which contain one level of logic and for which they predefine that the first half of operators are fuzzy ªandºs and the second half are fuzzy ªorºs. They relax which nodes are the inputs to the ªandºs and ªorºs of their logic modules. Similarly, Chen [25] proposes Gumbel-Max Equation Learner Networks, where he predefines a set of arithmetic operations in each layer and learns via Gumbel-Softmax [26], [27], which outputs of the previous layer should be used as inputs of a respective arithmetic operation. He uses this to learn symbolic expressions from data. While these works relax which nodes are connected to which nodes, this is fixed in our work, and we relax which operator is at which node.

Evolutionary Learning of Networks Mocanu et al. [28] propose training neural networks with sparse evolutionary training inspired by network science. Their method evolves an initial sparse topology of two consecutive layers of neurons into a scale-free topology. On MNIST, they achieve (with 89 797 parameters) an accuracy of 98.74%. Gaier et al. [29] propose learning networks of operators such as Re LU, sin, inverse, absolute, step, and tanh using evolutionary strategies. Specifically, they use the population-based neuroevolution algorithm NEAT. They achieve learning those floating-point function-based networks and achieve an accuracy of 94.2% on MNIST with a total of 1 849 connections.

Learning of Decision Trees Zantedeschi et al. [30] propose to learn decision trees by quadratically relaxing the decision trees from mixed-integer programs that learn the discrete parameters of the tree (input traversal and node pruning). This allows them to differentiate in order to simultaneously learn the continuous parameters of splitting decisions. Logic gate-based trees are conceptually vastly different from decision trees: decision trees rely on splitting decisions instead of logical operations, and the tree structure of decision trees and logic gate-based trees are in the opposite directions [31]. Logic gate-based trees begin with a number of inputs (leafs) and apply logic gates to aggregate them to a binary value (root). Decision trees begin at the root and apply splitting decisions (for which they consider an external input) to decide between children, such that they end up at a leaf node corresponding to a value.

Binary Neural Networks Binary neural networks (BNNs) [3] are conceptually very different from logic gate networks. For binary neural networks, ªbinaryº refers to representing activations and weights of a neural network with binary states (e.g., { 1, +1}). This allows approximating the expensive matrix multiplication by faster XNOR and bitcount (popcount) operations. The logical operations involved in BNNs are not learned but instead predefined to approximate floating-point operations, and, as such, a regular weight-based neural network. This is not the case for logic gate networks, where we learn the logic operations, we do not approximate weight-based neural networks, and do not have weights. While BNNs are defined via their weights and not via their logic operations,

logic gate networks do not have weights and are purely defined via their logic operations. We include BNNs as baselines in our experiments because they achieve the best inference speed.

Sparse Neural Networks Sparse neural networks [4] are neural networks where only a selected subset of connections is present, i.e., instead of fully-connected layers, the layers are sparse. In the literature of sparse neural networks, usually, the task is to distill a sparse neural network from a dense neural network and the choice of connections is important. However, there has also been work suggesting a high effectiveness of using randomized and fixed sparse connections [32]. For logic gate networks, which are sparse by definition, we follow these findings and use randomly initialized and fixed connections.

3 Logic Gate Networks

Logic gate networks are networks similar to neural networks where each neuron is represented by a binary logic gate like and , nand , and nor and accordingly has only two inputs (instead of all neurons in the previous layer as it is the case in fully-connected neural networks). Given a binary vector as input, pairs of Boolean values are selected, binary logic gates are applied to them, and their output is then used as input for layers further downstream. Logic gate networks do not use weights. Instead, they are parameterized via the choice of logic gate at each neuron. In contrast to fully connected neural networks, binary logic gate networks are sparse because each neuron has only 2 instead of n inputs, where n is the number of neurons per layer. In logic gate networks, we do not need activation functions as they are intrinsically non-linear.

While it is possible to make a prediction simply with a single binary output or k binary outputs for k classes, this is not ideal. This is because in the crisp case, we only get 0s or 1s and no graded prediction, which would be necessary for a ªgreatest activationº classification scheme. By using multiple neurons per class and aggregating them by summation, even the crisp case allows for grading, with as many levels as there are neurons per class. Each of these neurons captures a different piece of evidence for a class, and this allows for more finely graded predictions. Figure 1 illustrates a small logic gate network. In the illustration, each node corresponds to a single logic operator. Note that the distribution over operators (red) is part of the differentiable relaxation discussed in the next section.

As logic gate networks build on bit-wise logic operations only, their execution is very efficient.

4 Differentiable Logic Gate Networks

Training binary logic gate networks is hard because they are not differentiable, and thus no gradient descent-based training is conventionally possible. Thus, we propose relaxing logic gate networks to differentiable logic gate networks to allow for gradient-based training.

Table 1: List of all real-valued binary logic ops.

ID Operator real-valued 00 01 10 11

0 False 0 0 0 0 0 1 A B A B 0 0 0 1 2 (A B) A AB 0 0 1 0 3 A A 0 0 1 1 4 (A B) B AB 0 1 0 0 5 B B 0 1 0 1 6 A B A + B 2AB 0 1 1 0 7 A B A + B AB 0 1 1 1 8 (A B) 1 (A + B AB) 1 0 0 0 9 (A B) 1 (A + B 2AB) 1 0 0 1 10 B 1 B 1 0 1 0 11 A B 1 B + AB 1 0 1 1 12 A 1 A 1 1 0 0 13 A B 1 A + AB 1 1 0 1 14 (A B) 1 AB 1 1 1 0 15 True 1 1 1 1 1

Differentiable Logics To make binary logic networks differentiable, we leverage the following relaxation. First, instead of hard binary activations / values a {0, 1}, we relax all values to probabilistic activations a [0, 1]. Second, we replace the logic gates by computing the expected value of the activation given probabilities of independent inputs a1 and a2. For example, the probability that two independent events with probabilities a1 and a2 both occur is a1 a2. These operators correspond to the probabilistic T-norm and T-conorm; we report the full set of relaxations corresponding to the probabilistic interpretation in Table 1. (In addition, we report alternative relaxations corresponding to alternative interpretations in Tables 10 and 11 in Supplementary Material D.)

Accordingly, we define the activation of a neuron with the ith operator as

a = fi(a1, a2) , (1)

where fi is the ith real-valued operator corresponding to Table 1 and a1, a2 are the inputs to the neuron. There are also alternative real-valued logics like the Hamacher T-(co)norm, the relativistic Einstein sum, and the èukasiewicz T-(co)norm. While, in this work, we use the probabilistic interpretation, we review an array of possible T-norms and T-conorms that could also be used in SM D.

Differentiable Choice of Operator While real-valued logics allow differentiation, they do not allow training as the operators are not continuously parameterized and thus (under hard binary inputs) the activations in the network will always be a {0, 1}. Thus, we propose to represent the choice of which logic gate is present at each neuron by a categorical probability distribution. For this, we parameterize each neuron with 16 floats (i.e., w R16), which, by softmax, map to the probability simplex (i.e., a categorical probability distribution such that all entries sum up to 1 and it has only non-negative values). That is, pi = ewi/(P

j ewj), and thus p lies in the probability simplex p 15. During training, we evaluate for each neuron all 16 relaxed binary logic gates and use the categorical probability distribution to compute their weighted average. Thus, we define the activation a of a differentiable logic gate neuron as

i=0 pi fi(a1, a2) =

j ewj fi(a1, a2) . (2)

Aggregation of Output Neurons Now, we may have n output neurons a1, a2, ..., an [0, 1], but we may want the logic gate network to only predict k < n values of a larger range than [0, 1]. Further, we may want to be able to produce graded outputs. Thus, we can aggregate the outputs as

(i+1) n/k X

j=i n/k+1 aj / τ + β (3)

where τ is a normalization temperature and β is an optional offset.

4.1 Training Considerations

Training For learning, we randomly initialize the connections and the parameterization of each neuron. For the initial parameterization of each neuron, we draw elements of w independently from a standard normal distribution. In all reported experiments, we use the same number of neurons in each layer (except for the input) and between 4 and 8 layers, which we call straight network. We train all models with the Adam optimizer [33] at a constant learning rate of 0.01.

Discretization After training, during inference, we discretize the probability distributions by only taking their mode (i.e., their most likely value), and thus the network can be computed with Boolean values, which makes inference very fast. In practice, we observe that most neurons converge to one logic gate operation; therefore, the discretization step introduces only a small error, e.g., for MNIST, the gap is smaller than 0.1%. We note that all reported results are accuracies after discretization.

Classification In the application of a classification learning setting with k classes (e.g., 10) and n output neurons (e.g., 1 000), we group the output into k groups of size n/k (e.g., 100). Then, we count the number of 1s which corresponds to the classification score such that the predicted class can be retrieved via the arg max of the class scores. During differentiable training, we sum up the probabilities of the outputs in each group instead of counting the 1s, and we can train the model using a softmax cross-entropy classification loss. For a reference on choosing the hyperparameter τ, see Supplementary Material A.1; the offset β is not relevant for the classification setting, as arg max is shift-invariant. A heuristic for choosing τ is that when increasing n, we have to reduce τ. Empirically, when increasing n by a factor of 10, τ should be decreased by a factor of about 2 to

Regression For regression learning, let us assume that we need to predict a k-dimensional output vector. Here, τ and β play the role of an affine transformation to transform the range of possible

predictions from 0 to n/k to an application specific and more suitable range. Here, the optional bias β is important, e.g., if we want to predict values outside the range of [0, n/k/τ]. In some cases, it is desirable to cover the entire range of real numbers, which may be achieved using a logit transform logit(x) = σ 1(x) = log x 1 x in combination with τ = n/k, β = 0. During differentiable training, we sum up the probabilities of the outputs in each group instead of counting the 1s, and we can train the model, e.g., using an MSE loss.

4.2 Remarks

Boolean Vectorization via Larger Data Types One important computational detail for inference time is that we do not use Boolean data types but instead use larger data types such as, e.g., int64 for a batch size of 64, and thus perform bit-wise logics on larger batches which significantly improves speed on current hardware. For int64, we batch 64 data points such that the ith Boolean value of the jth data point is the jth bit in the ith int64 integer. Thus, it is possible to compute on average around 250 binary logic gates on each core in each CPU clock cycle (i.e., per Hz) on a typical desktop / notebook computer. This is the case because modern CPUs execute many instructions per clock cycle even on a single core, and additionally (for Booleans) allow single-instruction multiple-data (SIMD) by batching bits from multiple data points into one integer (e.g., int64). Using advanced vector extensions (AVX), even larger speedups would be possible. On GPU, this computational speedup is also available in addition to typical GPU parallelization.

Aggregation of Output Neurons via Binary Adders In addition, during inference, we aggregate the output neurons directly using logic gate nets that make up respective adders, as writing all outputs to memory would constitute a bottleneck and aggregating them using logic gate networks is fast. Specifically, we construct adders that can add exactly one bit to a binary number from logic gates.

Memory Considerations Since we pseudo-randomly initialize the connections in binary logic gate networks, i.e., which are the two inputs for each neuron, we do not need to store the connections as they can be reproduced from a single seed. Thus, it suffices to store the 4-bit information which of the 16 logic gate operators is used for each neuron. Thus, the memory footprint of logic gate networks is drastically reduced in comparison to neural networks, binary neural networks, and sparse neural networks.

Pruning the Model An additional speedup for the inference of logic gate networks is available by pruning neurons that are not used, or by simplifying logical expressions. However, this requires storing the connections, posing a (minor) trade-off between memory and speed.

Subset of Operators We investigated reducing the set of operators; however, we found that, in all settings, the more expressive full set of 16 operators performed better. Nevertheless, a smaller set of operators could be a good trade-off for reducing the model size.

Half Precision We also investigated training with half precision (float16). In our experiments, half precision (in comparison to full precision) did not degrade training performance; nevertheless, all reported results were trained with full precision (float32).

Optimizer For training differentiable logic gate networks, we use the Adam optimizer [33] because it includes a normalization of the gradients with respect to their magnitude over past steps. We found that this greatly improves training compared to other optimizers like SGD or SGD with momentum, which can become ineffective for training deeper logic gate networks.

5 Current Limitations and Opportunities

Expensive Training A limitation of differentiable logic gate networks is their relatively higher training cost compared to (performance-wise) comparable conventional neural networks. The higher training cost is because multiple differentiable operators need to be evaluated for each neuron, and in their real-valued differentiable form, most of these operators require floating-point value multiplications. However, the practical computational cost can be reduced through improved implementations.

We note that, asymptotically, differentiable logic gate networks are cheaper to train compared to conventional neural networks due to their sparsity.

Convolutions and Other Architectures Convolutional logic gate networks and other architectural components such as residual connections are interesting and important directions for future research.

Edge Computing and Embedded Machine Learning We would like to emphasize that the current limitations to rather small architectures (compared to large deep learning architectures) does not need to be a limitation: For example, in edge computing and embedded machine learning [34] [37], models are already limited to tiny architectures because they run, e.g., on mobile CPUs, microcontrollers, or Io T devices. In these cases, training cost is not a concern because it is done before deployment.

We also note that there are many other applications in industry where the training cost is negligible in comparison to the inference cost.

6 Experiments1

To empirically validate our method, we perform an array of experiments. We start with the three MONK data sets and continue to the Adult Census and Breast Cancer data sets. For each experiment, we compare our method to other methods in terms of model memory footprint, evaluation speed, and accuracy. To demonstrate that our method also performs well on image recognition, we benchmark it on the MNIST as well as the CIFAR-10 data sets. We benchmark speeds and computational complexity of our method in comparison to baselines, which we discuss in detail in Section B.

6.1 MONK s Problems Table 2: Results on the MONK data sets. The inference times are per data point for 1 CPU thread. Averaged over 10 runs. For Diff Logic Nets, # Parameters and Space vary between the MONK data sets as we use different architectures.

Method MONK-1 MONK-2 MONK-3

Decision Tree Learner (ID3) [38] 98.6% 67.9% 94.4% Decision Tree Learner (C4.5) [39] 100% 70.4% 100% Rule Learner (CN2) [31] 100% 69.0% 89.1% Logistic Regression 71.1% 61.4% 97.0% Neural Network 100% 100% 93.5% Diff Logic Net (ours) 100% 90.9% 97.7%

# Parameters Inf. Time Space

Decision Tree Learner 30 49ns 60B Logistic Regression 20 68ns 80B Neural Network 162 152ns 648B Diff Logic Net (ours) 144 | 72 | 72 18ns 72B | 36B | 36B

The MONK s problems [40] are 3 classic machine learning tasks that have been used to benchmark learning algorithms. They consist of 3 binary classification tasks on a data set with 6 attributes with 2 4 possible values each. Correspondingly, the data points can be encoded as binary vectors of size 17. In Table 2, we show the performance of our method, a regular neural network, and a few of the original learning methods that have been benchmarked. We give the prediction speed for a single CPU thread, the number of parameters, and storage requirements.

Table 3: Results for the Adult and Breast Cancer data sets averaged over 10 runs.

Adult Acc. # Param. Infer. Time Space

Decision Tree Learner 79.5% 50 86ns 130B Logistic Regression 84.8% 234 63ns 936B Neural Network 84.9% 3810 635ns 15KB Diff Logic Net (ours) 84.8% 1280 5.1ns 640B

Breast Cancer Acc. # Param. Infer. Time Space

Decision Tree Learner 71.9% 100 82ns 230B Logistic Regression 72.9% 104 34ns 416B Neural Network 75.3% 434 130ns 1.4KB Diff Logic Net (ours) 76.1% 640 2.8ns 320B

On all three data sets, our method performs better than logistic regression and on MONK-3 (which is the data set with label noise) our method even outperforms the much larger neural network. For hyperparameter details, see Supplementary Material A.1.

6.2 Adult and Breast Cancer

For our second set of experiments, we consider the Adult Census [41] and the Breast Cancer data set [42]. We find that our method performs very similar to neural

1The source code will be publicly available at github.com/Felix-Petersen/difflogic.

networks and logistic regression on the Adult data set while achieving a much faster inference speed. On the Breast Cancer data set, our method achieves the best performance while still being the fastest model. We present the results in Table 3.

For our comparison to the fastest neural networks, we start by considering MNIST [43]. The methods we compare ourselves to are also discussed in further detail in the baselines section. In comparison to the fastest method achieving at least 98.4% on MNIST, which is FINN by Umuroglu et al. [44] (identified by Qin et al. [3]), our logic network achieves a better performance, while requiring less than 10% of the number of binary operations. That is, our model is objectively more than 10 cheaper to evaluate. When comparing real times, for an NVIDIA A6000 GPU, our model is 12 faster than the model by Umuroglu et al. [44] on their specialized FPGA hardware, even though our model only achieves a 7% utilization of the GPU. For the other BNNs, OPs have not been reported, but their inference speed is also substantially slower than FINN. When compared to the smallest sparse neural network, our model requires substantially fewer operations than each of the of baselines. Sparse function networks [29], which have been learned evolutionarily, achieve an accuracy of 94.2%. We provide an additional discussion of the results displayed in the Table 4 in Supplementary Material B.

Table 4: Results for MNIST, all of our results are averaged over 10 runs. Times (T.) are inference times per image, the GPU is an NVIDIA A6000, and the CPU is a single thread at 2.5 GHz. For our experiments, i.e., the top block, we use binarized MNIST.

MNIST Acc. # Param. Space T. [CPU] T. [GPU] OPs FLOPs

Linear Regression 91.6% 4 010 16KB 3µs 2.4ns (4M) 4K Neural Network (small) 97.92% 118 282 462KB 14µs 12.4ns (236M) 236K Neural Network 98.40% 22 609 930 86MB 2.2ms 819ns (45G) 45M

Diff Logic Net (small) 97.69% 48 000 23KB 625ns 6.3ns 48K Ð Diff Logic Net 98.47% 384 000 188KB 7µs (50ns) 384K Ð

Binary Neural Networks T. [FPGA] FINN [44] 98.40% (96µs) 641ns 5.28M Binary Eye [45] 98.40% 50µs Re BNet [46] 98.29% 3µs Low Bit NN [47] 99.2% 152µs

Sparse Neural Networks Sparsity Var. Dropout [48] 98.08% 4 000 98.5% (8M) 8K L0 regularization [49] 98.6% 2/3 (200M) 200K SET-MLP [28] 98.74% 89 797 96.8% (180M) 180K

Sparse Function Net [29] 94.2% 3 1 849 > 2K

6.4 CIFAR-10

In addition to MNIST, we also benchmark our method on CIFAR-10 [50]. For CIFAR-10, we reduce the color-channel resolution of the CIFAR-10 images and employ a binary embedding: For a color-channel resolution of 4 (the first three rows of Table 5), we use three binary values with the three thresholds 0.25, 0.5, and 0.75. For a color-channel resolution of 32 (the large models, i.e., the last three rows of the top block of Table 5), we use 31 binary values with thresholds (i/32)i {1..31}. We do not apply data augmentation / dropout for our experiments, which could additionally improve performance. For all baselines, we copied the reported accuracies from the original source, and thus those results are with data augmentation and the original color-channel resolution, and may include dropout [51] as well as other techniques such as student-teacher learning with a convolutional teacher [52].

The results are displayed in Table 5. We find that our method outperforms neural networks in the first setting (color-channel resolution of 4) by a small margin, while requiring less than 0.1% of the memory footprint and (with a larger model) by a large margin while requiring less than 1% of the memory footprint. In comparison to the best fully-connected neural network baselines, which are trained with various tricks such as student-teacher learning and retain the full color-channel resolution, our model does not achieve the same performance, while it is also much smaller and has

access to fewer data. With 1 million parameters, the student-teacher model [52] has a footprint that is about 64% larger than the footprint of our largest model (large 4), and achieves an accuracy which is only 3.7% better than ours. It is important to note that this models requires 2 million floating-point operations, while our model requires 5 million bit-wise logic operations (before pruning/optimization). On float-arithmetic hardware-accelerated integrated circuits (as current GPUs and many CPUs), the 2 million floating-point operations are around 100 slower than 5 million bit-wise logic operations. On general purpose hardware (i.e., without float acceleration) the speed difference would be one order of magnitude larger, i.e., 1 000 .

More competitive with respect to speed are sparse neural networks. In the final block of Table 5, we report the sparsest models for CIFAR-10. Note that two of the methods resulted in performances below 50%, and even those methods which achieve around 75% accuracy require a significantly more expensive inference. Prob Mask [53] with its 140 KFLOPs per image means that the model is (depending on hardware) 1 2 orders of magnitude more expensive than our largest model. Also, note that two out of three Res Net32 based sparsification methods achieve only around 37% accuracy.

The results in parentheses are estimated because compilation to binaries did not finish / for GPU the largest models could also not be compiled due to compiler limitations. This can be resolved if desired with moderate implementation effort, e.g., compiling the model directly to PTX (CUDA Assembly) without compiling via gcc and nvcc. The actual problem is that the compilers used by the implementation have a compile time that is quadratic in the number of lines of code / statements.

We provide an additional discussion of the results displayed in Table 5 in SM B.

Table 5: Results on CIFAR-10. Times (T.) are inference times per image, the GPU is an NVIDIA A6000, and the CPU is a single thread at 2.5 GHz. For our experiments, i.e., the top block, we use a color-channel resolution of 4 for the first 3 lines and a color-channel resolution of 32 for the large models. The other baselines were provided with the full resolution of 256 color-channel values. The numbers in parentheses are extrapolated / estimated.

CIFAR-10 Acc. # Param Space T. [CPU] T. [GPU] OPs FLOPs

Neural Network (color-ch. res. = 4) 50.79% 12.6M 48MB 1.2ms 370ns (25G) 25M

Diff Logic Net (small) 51.27% 48K 24KB 1.3µs 19ns 48K Ð Diff Logic Net (medium) 57.39% 512K 250KB 7.3µs 29ns 512K Ð Diff Logic Net (large) 60.78% 1.28M 625KB (18µs) (73ns) 1.28M Ð Diff Logic Net (large 2) 61.41% 2.56M 1.22MB (37µs) (145ns) 2.56M Ð Diff Logic Net (large 4) 62.14% 5.12M 2.44MB (73µs) (290ns) 5.12M Ð

Best Fully-Connected Baselines (color-ch. res. = 256) Regularized SRe LU NN [28] 68.70% 20.3M 77MB 1.9ms 565ns (40G) 40M Student-Teacher NN [52] 65.8% 1M 4MB 112µs 243ns (2G) 2M Student-Teacher NN [52] 74.3% 31.6M 121MB 2.9ms 960ns (63G) 63M

Sparse Neural Networks Sparsity PBW (Res Net32) [54] 38.64% 99.9% (140M) (140K) MLPrune (Res Net32) [55] 36.09% 99.9% (140M) (140K) Prob Mask (Res Net32) [53] 76.87% 99.9% (140M) (140K) SET-MLP [28] 74.84% 279K 4.7MB 98.6% (558M) 558K

6.5 Distribution of Logic Gates

To gain additional insight into learned logic gate networks, we consider histograms of operators present in each layer of a trained model. Specifically, we consider a 4 layer CIFAR-10 model with 12 000 neurons per layer in Figure 2.

We observe that, generally, the constant 0/1 ªoperatorº is learned to be used only very infrequently as it does not actually provide value to the model. Especially interesting is that it does not occur at all in the last layer. In the first layer, we observe a stronger presence of and , nand , or , and nor . In the second and third layers, there are more A , B , A , and B s, which can be seen as a residual / direct connection. This enables the network to model lower-order dependencies more efficiently by expressing it with fewer layers than the predefined number of layers. In the last layer, the most frequent operations are xor and xnor , which can create conditional dependencies of activations of the previous layers. Interestingly, however, implications (e.g., A B) are only infrequently used.

1500 Layer 1

1000 Layer 2

not(A implies B)

not(B implies A)

not(A or B)

not(A xor B)

B implies A

A implies B

not(A and B)

not(A implies B)

not(B implies A)

not(A or B)

not(A xor B)

B implies A

A implies B

not(A and B)

Figure 2: Distribution of logic gates in a trained four layer logic network.

7 Conclusion

In this work, we presented a novel approach to train logic gate networks, which allows us to effectively train extremely efficient neural networks thatÐfor their level of accuracyÐare one or more orders of magnitude more efficient than the state-of-the-art. For this, we leveraged real-valued logics and continuous relaxations via softmax. We will release the source code of this work to the community to foster future research on learning logic gate networks.

Acknowledgments and Disclosure of Funding

This work was supported by the MIT-IBM Watson AI Lab, the DFG in the Cluster of Excellence EXC 2117 ªCentre for the Advanced Study of Collective Behaviourº (Project-ID 390829875), and the Land Salzburg within the WISS 2025 project IDA-Lab (20102-F1901166-KZP and 20204-WISS/225/1972019).

[1] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and K. Gopalakrishnan, ªPact: Parameterized clipping activation for quantized neural networks,º ar Xiv preprint ar Xiv:1805.06085, 2018. [2] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, ªDeep learning with limited numerical precision,º in International Conference on Machine Learning (ICML), 2015. [3] H. Qin, R. Gong, X. Liu, X. Bai, J. Song, and N. Sebe, ªBinary neural networks: A survey,º Pattern Recognition, vol. 105, p. 107 281, 2020. [4] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, ªSparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,º ar Xiv preprint ar Xiv:2102.00554, 2021. [5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, ªLearning representations by backpropagating errors,º Nature, vol. 323, no. 6088, pp. 533 536, 1986. [6] A. Telikani, A. Tahmassebi, W. Banzhaf, and A. H. Gandomi, ªEvolutionary machine learning: A survey,º ACM Computing Surveys (CSUR), 2021. [7] J. Rapin and O. Teytaud, Nevergrad - A gradient-free optimization platform, https:// Git Hub.com/Facebook Research/Nevergrad, 2018. [8] A. Darwiche and P. Marquis, ªA knowledge compilation map,º Journal of Artificial Intelligence Research, 2002. [9] S. Arora and B. Barak, Computational Complexity: a Modern Approach. Cambridge University Press, 2009.

[10] G. J. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall, 1997. [11] K. Menger, ªStatistical metrics,º Proceedings of the National Academy of Sciences of the United States of America, vol. 28, no. 12, p. 535, 1942. [12] E. P. Klement, R. Mesiar, and E. Pap, Triangular Norms. Springer Science & Business Media, 2013. [13] E. van Krieken, E. Acar, and F. van Harmelen, ªAnalyzing differentiable fuzzy logic operators,º ar Xiv preprint ar Xiv:2002.06100, 2020. [14] S. Chatterjee, ªLearning and memorization,º in International Conference on Machine Learning (ICML), 2018. [15] T. Brudermueller, D. L. Shung, A. J. Stanley, J. Stegmaier, and S. Krishnaswamy, ªMaking logic learnable with neural networks,º ar Xiv preprint ar Xiv:2002.03847, 2020. [16] M. Cuturi, O. Teboul, and J.-P. Vert, ªDifferentiable ranking and sorting using optimal transport,º in Proc. Neural Information Processing Systems (NIPS), 2019. [17] F. Petersen, ªLearning with Differentiable Algorithms,º 2022. [18] F. Petersen, B. Goldluecke, C. Borgelt, and O. Deussen, ªGen DR: A Generalized Differentiable Renderer,º in Proc. International Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [19] F. Petersen, C. Borgelt, H. Kuehne, and O. Deussen, ªMonotonic Differentiable Sorting Networks,º in International Conference on Learning Representations (ICLR), 2022. [20] Q. Berthet, M. Blondel, O. Teboul, M. Cuturi, J.-P. Vert, and F. Bach, ªLearning with Differentiable Perturbed Optimizers,º in Proc. Neural Information Processing Systems (NIPS), 2020. [21] F. Petersen, C. Borgelt, H. Kuehne, and O. Deussen, ªDifferentiable Sorting Networks for Scalable Sorting and Ranking Supervision,º in International Conference on Machine Learning (ICML), 2021. [22] M. Cuturi and M. Blondel, ªSoft-DTW: A Differentiable Loss Function for Time-Series,º in International Conference on Machine Learning (ICML), 2017. [23] F. Petersen, C. Borgelt, H. Kuehne, and O. Deussen, ªLearning with Algorithmic Supervision via Continuous Relaxations,º in Proc. Neural Information Processing Systems (NIPS), 2021. [24] M. Zimmer, X. Feng, C. Glanois, et al., ªDifferentiable logic machines,º ar Xiv preprint ar Xiv:2102.11529, 2021. [25] G. Chen, ªLearning symbolic expressions via gumbel-max equation learner network,º ar Xiv preprint ar Xiv:2012.06921, 2020. [26] C. J. Maddison, A. Mnih, and Y. W. Teh, ªThe concrete distribution: A continuous relaxation of discrete random variables,º in International Conference on Learning Representations (ICLR), 2017. [27] E. Jang, S. Gu, and B. Poole, ªCategorical reparameterization with gumbel-softmax,º 2017. [28] D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, and A. Liotta, ªScalable training of artificial neural networks with adaptive sparse connectivity inspired by network science,º Nature communications, vol. 9, no. 1, pp. 1 12, 2018. [29] A. Gaier and D. Ha, ªWeight agnostic neural networks,º in Proc. Neural Information Processing Systems (NIPS), 2019. [30] V. Zantedeschi, M. Kusner, and V. Niculae, ªLearning binary decision trees by argmin differentiation,º in International Conference on Machine Learning (ICML), 2021. [31] P. Clark and T. Niblett, ªThe cn2 induction algorithm,º Machine learning, vol. 3, no. 4, pp. 261 283, 1989. [32] S. Liu, T. Chen, X. Chen, et al., ªThe unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training,º in International Conference on Learning Representations (ICLR), 2022. [33] D. Kingma and J. Ba, ªAdam: A method for stochastic optimization,º in International Conference on Learning Representations (ICLR), 2015. [34] M. S. Murshed, C. Murphy, D. Hou, N. Khan, G. Ananthanarayanan, and F. Hussain, ªMachine learning at the network edge: A survey,º ACM Computing Surveys (CSUR), 2021.

[35] T. S. Ajani, A. L. Imoize, and A. A. Atayero, ªAn overview of machine learning within embedded and mobile devices optimizations and applications,º Sensors, vol. 21, no. 13, 2021. [36] K. P. Seng, P. J. Lee, and L. M. Ang, ªEmbedded intelligence on fpga: Survey, applications and challenges,º Electronics, vol. 10, no. 8, p. 895, 2021. [37] S. Branco, A. G. Ferreira, and J. Cabral, ªMachine learning in resource-scarce embedded systems, fpgas, and end-devices: A survey,º Electronics, vol. 8, no. 11, p. 1289, 2019. [38] J. R. Quinlan, ªInduction of decision trees,º Machine learning, vol. 1, no. 1, pp. 81 106, 1986. [39] S. L. Salzberg, C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993, 1994. [40] S. B. Thrun, J. W. Bala, E. Bloedorn, et al., ªThe monk s problems: A performance comparison of different learning algorithms,º Tech. Rep., 1991. [41] R. Kohavi and B. Becker, ªUci machine learning repository: Adult data set,º Avaliable: https://archive.ics.uci.edu/ml/machine-learning-databases/adult, 1996. [42] M. Zwitter and M. Soklic, Uci machine learning repository breast cancer dataset, 1988. [43] Y. Le Cun, C. Cortes, and C. Burges, ªMnist handwritten digit database,º 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist. [44] Y. Umuroglu, N. J. Fraser, G. Gambardella, et al., ªFinn: A framework for fast, scalable binarized neural network inference,º in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017, pp. 65 74. [45] P. Jokic, S. Emery, and L. Benini, ªBinaryeye: A 20 kfps streaming camera system on fpga with real-time on-device image recognition using binary neural networks,º in 2018 IEEE 13th International Symposium on Industrial Embedded Systems (SIES), IEEE, 2018, pp. 1 7. [46] F. K. Mohammad Ghasemzadeh Mohammad Samragh, ªRebnet: Residual binarized neural network,º in Proceedings of the 26th IEEE International Symposium on Field-Programmable Custom Computing Machines, ser. FCCM 18, 2018. [47] J. Zhan, X. Zhou, and W. Jiang, ªField programmable gate array-based all-layer accelerator with quantization neural networks for sustainable cyber-physical systems,º Software: Practice and Experience, 2020. [48] D. Molchanov, A. Ashukha, and D. Vetrov, ªVariational dropout sparsifies deep neural networks,º in International Conference on Machine Learning, PMLR, 2017, pp. 2498 2507. [49] C. Louizos, M. Welling, and D. P. Kingma, ªLearning sparse neural networks through L0 regularization,º in International Conference on Learning Representations (ICLR), 2018. [50] A. Krizhevsky, G. Hinton, et al., ªLearning multiple layers of features from tiny images,º 2009. [51] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, ªDropout: A simple way to prevent neural networks from overfitting,º The journal of machine learning research, vol. 15, no. 1, pp. 1929 1958, 2014. [52] G. Urban, K. J. Geras, S. E. Kahou, et al., ªDo deep convolutional nets really need to be deep and convolutional?º In International Conference on Learning Representations (ICLR), 2018. [53] X. Zhou, W. Zhang, H. Xu, and T. Zhang, ªEffective sparsification of neural networks with global sparsity constraint,º in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3599 3608. [54] S. Han, H. Mao, and W. J. Dally, ªDeep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,º in International Conference on Learning Representations (ICLR), 2016. [55] W. Zeng and R. Urtasun, MLPrune: Multi-layer pruning for automated neural network compression, 2019. [56] A. Paszke, S. Gross, F. Massa, et al., ªPytorch: An imperative style, high-performance deep learning library,º in Proc. Neural Information Processing Systems (NIPS), 2019. [57] J.-F. Tˆetu, L.-C. Trudeau, M. Van Beirendonck, A. Balatsoukas-Stimming, and P. Giard, ªA standalone fpga-based miner for lyra2rev2 cryptocurrencies,º IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 4, pp. 1194 1206, 2020. [58] C. Kolivas, Cgminer, https://github.com/ckolivas/cgminer, 2012. [59] S. A. Khan, A. Agarwala, and D. Shahani, ªImplementation of advance oscilloscope triggering scheme on fpga,º in 2005 IEEE Instrumentationand Measurement Technology Conference Proceedings, IEEE, vol. 1, 2005, pp. 407 411.

[60] I. Shani, L. Shaughnessy, J. Rzasa, et al., ªDynamics of analog logic-gate networks for machine learning,º Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 29, no. 12, p. 123 130, 2019. [61] S. Lym, E. Choukse, S. Zangeneh, W. Wen, S. Sanghavi, and M. Erez, ªPrunetrain: Fast neural network training by dynamic sparse model reconfiguration,º in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019. [62] D. Blalock, J. J. G. Ortiz, J. Frankle, and J. Guttag, ªWhat is the state of neural network pruning?º In Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020.

1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them?

[Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The source code for this work will be made publicly available at github.com/Felix-Petersen/difflogic. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]