# neural_power_units__2b00c1bb.pdf Neural Power Units Niklas Heim, Tomáš Pevný, Václav Šmídl Artificial Intelligence Center Czech Technical University Prague, CZ 120 00 {niklas.heim, tomas.pevny, vasek.smidl}@aic.fel.cvut.cz Conventional Neural Networks can approximate simple arithmetic operations, but fail to generalize beyond the range of numbers that were seen during training. Neural Arithmetic Units aim to overcome this difficulty, but current arithmetic units are either limited to operate on positive numbers or can only represent a subset of arithmetic operations. We introduce the Neural Power Unit (NPU).1 that operates on the full domain of real numbers R and is capable of learning arbitrary power functions in a single layer. The NPU thus fixes the shortcomings of existing arithmetic units and extends their expressivity. We achieve this by using complex arithmetic without requiring a conversion of the network to complex numbers C. A simplification of the unit to the Real NPU yields a highly transparent model. We show that the NPUs outperform their competitors in terms of accuracy and sparsity on artificial arithmetic datasets, and that the Real NPU can discover the governing equations of a dynamical system only from data. 1 Introduction Numbers and simple algebra are essential not only to human intelligence but also to the survival of many other species [Dehaene, 2011, Gallistel, 2018]. A successful, intelligent agent should, therefore, be able to perform simple arithmetic. State of the art neural networks are capable of learning arithmetic, but they fail to extrapolate beyond the ranges seen during training [Suzgun et al., 2018, Lake and Baroni, 2018]. The inability to generalize to unseen inputs is a fundamental problem that hints at a lack of understanding of the given task. The model merely memorizes the seen inputs and fails to abstract the true learning task. The failure of numerical extrapolation on simple arithmetic tasks has been shown by Trask et al. [2018], who also introduced a new class of Neural Arithmetic Units with good extrapolation performance on some arithmetic tasks. Including Neural Arithmetic Units in standard neural networks promises to significantly increase their extrapolation capabilities due to their inductive bias towards numerical computation. This is especially important for tasks in which the data generating process contains mathematical relationships. They also promise to reduce the number of parameters needed for a given task, which can improve the explainability of the model. We demonstrate this in a Neural Ordinary Differential Equation (NODE, Chen et al. [2019]), where a handful of neural arithmetic units can outperform a much bigger network built from dense layers (Sec. 4.1). Moreover, our new unit can be used to directly read out the correct generating ODE from the fitted model. This is in line with recent efforts to build transparent models instead of attempting to explain black-box models [Rudin, 2019], like conventional neural networks. We refer to the terminology by Lipton [2017] which defines the potential of understanding the parameters of a given model as transparency by decomposability. 1Implementation of Neural Arithmetic Units: github.com/nmheim/Neural Arithmetic.jl The code to reproduce our experiments is available at github.com/nmheim/Neural Power Units. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. The currently available arithmetic units all have different strengths and weaknesses, but none of them solve simple arithmetic completely. The Neural Arithmetic Logic Unit (NALU) by Trask et al. [2018], chronologically, was the first arithmetic unit. It can solve addition (+, including subtraction), multiplication ( ), and division ( ), but is limited to positive inputs. The convergence of the NALU is quite fragile due to an internal gating mechanism between addition and multiplication paths as well as the use of a logarithm which is problematic for small inputs. Recently, Schlör et al. [2020] introduced the improved NALU (i NALU, to fix the NALU s shortcomings. It significantly increases its complexity, and we observe only a slight improvement in performance. Madsen and Johansen [2020] solve (+, ) with two new units: the Neural Addition Unit (NAU), and the Neural Multiplication Unit (NMU). Instead of gating between addition and multiplication paths, they are separate units that can be stacked. They can work with the full range of real numbers, converge much more reliably, but cannot represent division. Our Contributions Neural Power Unit. We introduce a new arithmetic layer (NPU, Sec. 3) which is capable of learning products of power functions (Q xwi i ) of arbitrary real inputs xi and power wi, thus including multiplication (x1 x2 = x1 1x1 2) as well as division (x1 x2 = x1 1x 1 2 ). This is achieved by using formulas from complex arithmetic (Sec. 3.1). Stacks of NAUs and NPUs can thus learn the full spectrum of simple arithmetic operations. Convergence improvement. We address the known convergence issues of neural arithmetic units by introducing a relevance gate that smooths out the loss surface of the NPU (Sec. 3.2). With the relevance gate, which helps to learn to ignore variables, the NPU reaches extrapolation errors and sparsities that are on par with the NMU on ( ) and outperforms NALU on ( , ). Transparency. We show how a power unit can be used as a highly transparent2 model for equation discovery of dynamical systems. Specifically, we demonstrate its ability to identify a model that can be interpreted as a SIR model with fractional powers (Sec. 4.1) that was used to fit the COVID-19 outbreak in various countries [Taghvaei et al., 2020]. 2 Related Work Several different approaches to automatically solve arithmetic tasks have been studied in recent years. Approaches include Neural GPUs [Kaiser and Sutskever, 2016], Grid LSTMs [Kalchbrenner et al., 2016], Neural Turing Machines [Graves et al., 2014], and Neural Random Access Machines [Kurach et al., 2016]. They solve tasks like binary addition and multiplication, or single-digit arithmetic. The Neural Status Register [Faber and Wattenhofer, 2020] focuses on control flow. The Neural Arithmetic Expression Calculator [Chen et al., 2018], a hierarchical reinforcement learner, is the only method that solves the division problem, but it operates on character sequences of arithmetic expressions. Related is symbolic integration with transformers [Lample and Charton, 2019]. Unfortunately, most of the named models have severe problems with extrapolation [Madsen and Johansen, 2019, Saxton et al., 2019]. A solution to the extrapolation problem could be Neural Arithmetic Units. They are designed with an inductive bias towards systematic, arithmetic computation. However, currently, they are limited in their capability of expressing the full range of simple arithmetic operations (+, , ). In the following two sections, we briefly describe the currently available arithmetic layers, including their advantages and drawbacks. 2.1 Neural Arithmetic Logic Units Trask et al. [2018] have demonstrated the severity of the extrapolation problem of dense networks for even the simplest arithmetic operations, such as summing or multiplying two numbers. To increase the power of abstraction for arithmetic tasks, they propose the Neural Arithmetic Logic Unit (NALU), which is capable of learning (+, , ). However, the NALU cannot handle negative inputs correctly due to the logarithm in Eq. 2: 2as defined by Lipton [2017] Definition (NALU). The NALU consists of a (+) and a ( ) path with shared weights W and M. Addition: a = ˆ W x ˆ W = tanh(W ) σ(M) (1) Multiplication: m = exp( ˆ W log(|x| + ϵ)) (2) Output: y = a g + m (1 g) g = σ(Gx) (3) with inputs x and learnt parameters W , M, and G. Additionally, the logarithm destabilizes training to the extent that the chance of success can drop below 20% for (+, ), it becomes practically impossible to learn ( ) and difficult to learn from small inputs in general [Madsen and Johansen, 2019]. Schlör et al. [2020] provide a detailed description of the shortcomings of the NALU, and they suggest an improved NALU (i NALU). The i NALU addresses the NALU s problems through several mechanisms. It has independent addition and multiplication weights for Eq. 1 and Eq. 2, clips weights and gradients to improve training stability, regularizes the weights to push them away from zero, and, most importantly, introduces a mechanism to recover the sign that is lost due to the absolute value in the logarithm. Additionally, the authors propose to reinitialize the network if its loss is not improving during training. We include the i NALU in one of our experiments and find that it only slightly improves the NALU s performance (Sec. 4.2) at the cost of a significantly more complicated unit. Our NPU avoids all these mechanisms by internally using complex arithmetic. 2.2 Neural Multiplication Unit & Neural Addition Unit Instead of trying to fix the NALU s convergence issues, Madsen and Johansen [2020] propose a new unit for ( ) only. The Neural Multiplication Unit (NMU) uses explicit multiplications and learns to gate between identity and ( ) of inputs. The NMU is defined by Eq. 4 and is typically used in conjunction with the so-called Neural Addition Unit (NAU) in Eq. 5. Definition (NMU & NAU). NMU and NAU are two units that can be stacked to model (+, ). NMU: yj = Y i ˆ Mijxi + 1 ˆ Mij ˆ Mij = min(max(Mij, 0), 1) (4) NAU: y = ˆ Ax ˆAij = min(max(Aij, 1), 1) (5) with inputs x, and learnt parameters M and A. Both NMU and NAU are regularized with R = P ij min(|Wij|, |1 Wij|), and their weights are clipped, which biases them towards learning an operation or pruning it completely. The combination of NAU and NMU can thus learn (+, ) for both positive and negative inputs. Training NAU and NMU is stable and succeeds much more frequently than with the NALU, but they cannot represent ( ), which we address with our NPU. 3 Neural Power Units To fix the deficiencies of current arithmetic units, we propose a new arithmetic unit (inspired by NALU) that can learn arbitrary products of power functions (Q xwi i ) (including , ) for positive and negative numbers, and still train well. Combined with the NAU, we solve the full range of arithmetic operations. This is possible through a simple modification of the ( )-path of the NALU (Eq. 6). We suggest to replace the logarithm of the absolute value by the complex logarithm and to allow W to be complex as well. Since the complex logarithm is defined for negative inputs, the NPU does not have a problem with negative numbers. A complex W improves convergence at the expense of transparency (see Sec. 4.1). The improvement during training might be explained by the additional imaginary parameters that make it possible to avoid regions with an uninformative gradient signal. 3.1 Naive Neural Power Unit Naive NPU With the modifications introduced above we can extend the multiplication path of the NALU from m = exp(W logreal(|x| + ϵ)) (6) Figure 1: Naive NPU diagram, with input x and output y. Vectors in green, trainables in orange, functions in blue. Figure 2: NPU diagram. The NPU has a relevance gate g (hatched background) in front of the input to the unit to prevent zero gradients. to use the complex logarithm (log := logcomplex) and a complex weight W to y = exp(W log x) = exp ((Wr + i Wi) log x) , (7) where the input x is still a vector of real numbers. With the polar form for a complex number z = reiθ the complex log applied to a real number x = reikπ is log x = log r + ikπ, (8) where k = 0 if x 0 and k = 1 if x < 0. Using the complex log in Eq. 7 lifts the positivity constraint on x, resulting in a layer that can process both positive and negative numbers correctly. A complex weight matrix W somewhere in a larger network would result in complex gradients in other layers. This would effectively result in doubling the number of parameters of the whole network. As we are only interested in real outputs, we can avoid this doubling by considering only the real part of the output y: Re(y) = Re(exp((Wr + i Wi)(log r + iπk))) (9) = exp(Wr log r πWik) cos(Wi log r + πWrk). (10) Above we have used Euler s formula eix = cos x + i sin x. A diagram of the Naive NPU is shown in Fig. 1. Definition (Naive NPU). The Naive Neural Power Unit, with matrices Wr and Wi representing real and imaginary part of the complex numbers, is defined as y = exp(Wr log r πWik) cos(Wi log r + πWrk), where (11) r = |x| + ϵ, ki = 0 xi 0 1 xi < 0 , with inputs x, machine epsilon ϵ, and learnt parameters Wr and Wi. 3.2 The Relevance Gate NPU The Naive NPU has difficulties to converge on large scale tasks, and to reach sparse results in cases where the input to a given row is small. We demonstrate this on a toy example of learning the function f : R2 R, which is the identity on one of two inputs. The task is defined by the loss L: i |m(x1, x2) f(x1, x2)| = X i |m(x1, x2) x1,i|, where m = Naive NPU with (Wr, Wi) R1 2 (12) and x1 U(0, 2), x2 U(0, 0.05). The left plot in Fig. 3 depicts the gradient norm G G(Wr) = L Wr of the Naive NPU for a batch of two-dimensional inputs. Even in this simple example, the gradient of the Naive NPU is close to zero in large parts of the parameter space. This can be explained as follows. One row of Naive NPU weights effectively raises each input to a power and multiplies them: xw1 1 xw2 2 . . . xwn n . If a single input xi is constantly close to zero (i.e. irrelevant), the whole row will be zero, no matter what its weights are and the gradient information on all other weights is lost. Therefore, we introduce a gate on the input of our layer that can turn irrelevant inputs into 1s. A diagram of the NPU is shown in Fig. 2. 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 w1 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 w1 NPU g1 = g2 = 0.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 w1 NPU g1 = 1; g2 = 0 Figure 3: Gradient norm G of Naive NPU and NPU for the task of learning the identity on x1 (black areas are beyond the color scale). Inputs and loss are defined in Eq. 12. The correct solution is w1 = 1 and w2 = 0. The Naive NPU has a large zero gradient region for w2 > 0.75, while the NPU s surface is much more informative. The gates in the central plot are fixed at g1 = g2 = 0.5 which corresponds to the initial gate parameters. During training they adjust as needed, in this case to g1 = 1 and g2 = 0. Wi is set to zero in all plots. Definition (NPU). The NPU extends the Naive NPU by the relevance gate g on the input x. y = exp(Wr log r πWik) cos(Wi log r + πWrk), where (14) r = ˆg (|x| + ϵ) + (1 ˆg), ki = 0 xi 0 ˆgi xi < 0 , ˆgi = min(max(gi, 0), 1), (15) with inputs x, and learnt parameters Wr, Wi and g. The central plot of Fig. 3 shows the gradient norm G of the NPU on the identity task with its initial gate setting of g1 = g2 = 0.5. The large zero-gradient region of the Naive NPU is gone. The last plot shows the same for g1 = 1 and g2 = 0, which corresponds to the correct gates at the end of NPU training. The gradient is independent of w2, which means that it can easily be pruned by a simple regularization such as L1. In Sec. 4.3 we show how important the relevance gating mechanism is for the convergence and sparsity of large models. Sparsity is especially important in order to use the NPU as a transparent model. Initialization We recommend initializing the NPU with a Glorot Uniform distribution on the real weights Wr. The imaginary weights Wi can be initialized to zeros, so they will only be used where necessary, and the gate g with 0.5, so the NPU can choose to output 1. Definition (Real NPU). In many practical tasks, such as multiplication or division, the final value of Wi should be equal to zero. We will denote NPU with removed parameters for the imaginary part as Real NPU and study the impact of this change on convergence in Sec. 4. 4 Experiments In Sec. 4.1, we show how the NPU can help to build better NODE models. Additionally, we use the Real NPU as a highly transparent model, from which we can directly recover the generating equation of an ODE containing fractional powers. Subsequent Secs. 4.2 and 4.3 compare the NPU to prior art (NALU and NMU) on arithmetic tasks typically used to benchmark arithmetic units. 4.1 A Step Towards Equation Discovery of an Epidemiological Model Data-driven models such as SINDy [Champion et al., 2019] or Neural Ordinary Differential Equations (NODE, Chen et al. [2019]) are used more and more in scientific applications. Recently, Universal Differential Equations (UDEs, Rackauckas et al. [2020]) were introduced which aim to combine data-driven models with physically informed differential equations to maximize interpretability/explainability of the resulting models. If an ODE model is composed of dense layers, its direct interpretation is problematic and has to be performed retrospectively. The class of models based on SINDy is transparent by design, however it can only provide explanation within a linear combination of predefined set of basis functions. Thus, it cannot learn models with unknown fractional powers. With this experiment we aim to show that 101.2 101.4 101.6 101.8 102.0 102.2 102.4 102.6 10 2 Nr. Paramters Dense NPU Real NPU 0 50 100 150 200 0 True S, I, R Figure 4: Pareto fronts of the dense network, NPU, and Real NPU. The NPU reaches solutions with lower MSE and fewer parameters than the dense net. The Real NPU mostly yields worse results than the NPU. Just in a few cases it converges to very sparse models with good MSE. the NPU can potentially discover exact ODE models.3 An example of an ODE that contains powers is a modification of the well-known epidemiological SIR model [Kermack et al., 1927] to fractional powers (f SIR, Taghvaei et al. [2020]), which was shown to be a beneficial modification for modelling the COVID-19 outbreak. The SIR model is built from three variables: S (susceptible), I (infectious), and R (recovered/removed). Arguably the most important part of the model is the transmission rate r, which is typically taken to be proportional to the product of S and I. Taghvaei et al. [2020] argue that, especially in the initial phase of an epidemic, the boundary areas of infected and susceptible cells scale with a fractional power, which leads to Eq. 17: dt = r(t) + ηR(t), d I dt = r(t) αI(t), d R dt = αI(t) ηR(t), (16) r(t) = βI(t)γS(t)κ, (17) We have numerically simulated one realization of the f SIR model with the parameters α = 0.05, β = 0.06, η = 0.01, γ = κ = 0.5, in 40 time steps that are equally spaced in the time interval T = (0, 200), such that the training data X = [St, It, Rt]40 t=1 contains one time series each for S, I, and R. The initial conditions u0 = [S0, I0, R0] are set to S0 = 100, I0 = 0.01, and R0 = 0, as shown in Figure 4 (right). We fit the data with three different NODEs composed of different model types: a dense network, the NPU, and the Real NPU. An exemplary model is: NPU = Chain(NPU(3, h), NAU(h, 3)) with variable hidden size h. The detailed models are defined in Tab. A1. The training objective is the loss L with L1 regularization. L = MSE(X, NODEθ(u0)) + β||θ||1. (18) We train each model for 3000 steps with the ADAM optimizer and a learning rate of 0.005, and subsequently with LBFGS until convergence (or for maximum 1000 steps). For each model type, we run a small grid search to build a Pareto front with h {6, 9, 12, 15, 20} and β {0, 0.01, 0.1, 1}, where each hyper-parameter pair is run five times. The resulting Pareto front is shown on the left of Fig. 4. The NPU reaches much sparser and better solutions than the dense network. The Real NPU has problems to converge in the majority of cases, however, there are a few models in the bottom left that reach a very low MSE and have very few parameters. The best of these models is shown in Fig. 5. It looks strikingly similar to the f SIR model in matrix form: β 0 η β α 0 0 α η Reading Fig. 5 from right to left, we can extract the ODE that the Real NPU represents. The first hidden variable correctly identified the transmission rate as a product of two fractional powers r = IγSκ with κ = 0.57 and γ = 0.62, which is close to the true values γ = κ = 0.5. The second, third, and the last hidden variable were found to be irrelevant (the relevance gate returns 1). The 3The demonstration given here is not intended to be used in practice. For real-world predictions in such a sensitive area much more post-processing is needed to ensure safe predictions. 0.09I 0.15R0.6 0.12r +0.13I NAU Layer 2 Figure 5: Visualization of the best Real NPU. Reading from right to left, it takes the SIR variables as an input, then applies the NPU and the NAU. It correctly identifies r as a fractional product in the NPU, and gets the rest of the f SIR parameters almost right in the NAU. fourth hidden variable is a selector of the second input I, and the fifth hidden variable is selector of a power of R, R0.64. In the second layer, the NAU combines the correct hidden outputs from the NPU such that S is composed of the negative transmission rate r and positive R. I and R are also composed of the correct hidden variables, with the parameters α, β, η being not far off from the truth. We conclude that even with this very naive approach, the Real NPU can recover something close to the true fractional SIR model. In summary, the NPU can work well in sequential tasks, and we have shown that we can reach highly transparent results with the Real NPU, but in practice, using the Real NPU might be difficult due to its lower success rate. With a more elaborate analysis, it should be possible to reach the same solutions with the full NPU and e.g. a strong regularization of its imaginary parameters. 4.2 Simple Arithmetic Task In this experiment we compare six different layers (NPU, Real NPU, NMU, NALU, i NALU, Dense) on a small problem with two inputs and four outputs. The objective is to learn the function f : R2 R4 with a standard MSE loss: f(x, y) = (x + y, xy, x/y, x )T =: t, (20) i=1 (model(x, y)i f(x, y)i) = MSE(ˆt, t). (21) Learning the function f includes not only learning the correct arithmetic operation, but also to separate them cleanly, which tests the gating mechanisms of the layers. Each model has two layers with a hidden dimension h. E.g. the NPU model is defined by NPU = Chain(NPU(2, h = 6), NAU(h = 6, 4)). The remaining models that are used in the tables and plots are given in Tab. A3. To obtain valid results in case of division we train on positive, non-zero inputs, but test on negative, non-zero numbers (except for test inputs to the square-root): (xtrain, ytrain) U(0.1, 2) (xtest, ytest) R(-4.1:0.2:4) (xtest, sqrt, ytest, sqrt) R(0.1:0.1:4) (22) where R denotes a range with start, step, and end. We train each model for 20 000 steps with the ADAM optimizer, a learning rate of 0.001, and a batch size of 100. The input samples are generated on the fly during training. Fig. 6 shows the error surface of the best of 20 models on each task. Tab. A2 lists the corresponding averaged testing errors of all 20 models. Both NPUs successfully learn (+, , , ) and clearly outperform NALU and i NALU on all tasks. Surprisingly, the NALU has problems extrapolating in this task, which as Schlör et al. [2020] suggest, might be due to its gating mechanism. The NPUs are on par with the NMU for (+), but the NMU is better at ( ) due to its inductive bias. The NMU cannot learn ( , ). The fact that the Real NPU performs slightly better than the NPU indicates that the task is easy enough to not require the NPU Real NPU NMU NALU i NALU Dense Multiplication -4 -2 0 2 -4 -4 -2 0 2 -4 -2 0 2 -4 -2 0 2 -4 -2 0 2 -4 -2 0 2 1 2 3 4 5 x Square-root 1 2 3 4 5 x 1 2 3 4 5 x 1 2 3 4 5 x 1 2 3 4 5 x 1 2 3 4 5 x 4.000 3.333 2.667 2.000 1.333 0.667 0.000 0.667 1.333 2.000 log(|t1 t1|) 4.000 3.333 2.667 2.000 1.333 0.667 0.000 0.667 1.333 2.000 log(|t2 t2|) 4.000 3.333 2.667 2.000 1.333 0.667 0.000 0.667 1.333 2.000 log(|t3 t3|) 4.000 3.333 2.667 2.000 1.333 0.667 0.000 0.667 1.333 2.000 log(|t4 t4|) Figure 6: Comparison of extrapolation quality of different models learning Eq. 20. Each column represents the best model of 20 runs that were trained on the range U(0.1, 2). Lighter color implies lower error. imaginary parameters to help convergence. In such a case, the Real NPU generalizes better because it corresponds to the task it is trying to learn. 4.3 Large Scale Arithmetic Task One of the most important properties of a layer in a neural network is its ability to scale. With the large scale arithmetic task we show that the NPU works reliably on many-input tasks that are heavily over-parametrized. In this section we compare NALU, NMU, NPU, Real NPU, and the Naive NPU on a task that is identical to the arithmetic task that Madsen and Johansen [2020] and Trask et al. [2018] analyse as well. The goal is to sum two subsets of a 100 dimensional vector and apply an operation (like ) to the two summed subsets. The dataset generation is defined in the set of Eq. 23, with the parameters from Tab. A5. i=s1,start xi, b = i=s2,start xi, yadd = a + b, ymul = a b, ydiv = 1/a, ysqrt = a, (23) where starting and ending values si,start, si,end of the summations are chosen such that a and b come from subsets of the input vector x with a given overlap. The training objective is standard MSE, regularized with L1: L = MSE(model(x), y) + β θ 1 , (24) where β is scheduled to be low in the beginning of training and stronger towards the end. Specifics of the used models and their hyper-parameters are defined in Tab. A4 & A6. Madsen and Johansen [2020] perform an extensive analysis of this task with different subset and overlap ratios, varying model and input sizes, and much more, establishing that the combination of NAU/NMU outperforms the NALU. We focus on the comparison of NPU, Real NPU, NMU, and NALU on the default parameters of Madsen and Johansen [2020] which sets the subset ratio to 0.5 and the overlap ratio to 0.25 (details in Tab. A5). We include the Naive NPU (without the relevance gate) to show how important the gating mechanism is for both sparsity and overall performance. Fig. 7 plots testing errors over the number of non-zero parameters for all models and tasks. The addition plot shows that NMU, NPU, and Real NPU successfully learn and extrapolate on (+) with 102.0 102.5 103.0 103.5 104.0 Nr. Parameters Testing MSE 102.0 102.5 103.0 103.5 104.0 104.5 10 1 Nr. Parameters Multiplication 102.0 102.5 103.0 103.5 104.0 Nr. Parameters 102.0 102.5 103.0 103.5 104.0 104.5 10 4 10 2 100 102 104 106 108 1010 Nr. Parameters Square root NPU Real NPU NALU NMU Naive NPU Figure 7: Testing MSE over number of non-zero parameters (wi > 0.001) of the large scale arithmetic task. The NMU outperforms the NPU on its native tasks, addition and multiplication. The NPU is the best at division and square-root. The Naive NPU without the relevance gate is far off, because it does not have the necessary gradient signal to converge, as discussed in Sec. 3.2 Table 1: Testing errors of the large scale arithmetic task. Each value is obtained by computing median (and median absolute deviation) of 10 runs. Task NPU Real NPU NALU NMU Naive NPU + 0.092 0.031 0.063 0.014 740.0 330.0 0.00602 0.00019 161.65 0.11 4.28 0.9 3.09 0.74 2.9e83 2.9e83 1.7 1.4 3750.0 870.0 1.0e-7 1.0e-7 1.4e-6 4.0e-7 530.0 200.0 1.622 0.081 5.4e17 5.4e17 0.054 0.0078 0.017 0.011 7300.0 7200.0 10.96 0.89 9.3e8 9.3e8 the NMU converging to the sparsest and most accurate models. On ( ), the best NMU models outperform the NPU and Real NPU, but some NMUs do not converge at all. The testing MSE of the NALU is so large that it is excluded from the plot. On ( , ) the NPU clearly outperforms all other layers in MSE and sparsity. Generally, the difference between the Naive NPU and the other NPUs is huge and demonstrates how important the relevance gate is both for convergence and sparsity. The NPUs with relevance gates effectively convert irrelevant inputs to 1s, while the Naive NPU is stuck on the zero gradient plateau. 5 Conclusion We introduced the Neural Power Unit which addresses the deficiencies of current arithmetic units: it can learn multiplication, division, and arbitrary power functions for positive, negative, and small numbers. We showed that the NPU outperforms its main competitor (NALU) and reaches performance that is on par with the multiplication specialist NMU (Sec. 4.2 and 4.3). Additionally, we have demonstrated that the NPU converges consistently, even on sequential tasks. The Real NPU can be used as a transparent model that is capable of recovering the governing equations of dynamical systems purely from the data (Sec. 4.1). Broader Impact Current neural network architectures are often perceived as black box models that are difficult to explain or interpret. This becomes highly problematic if ML models are involved in high stakes decisions in e.g. criminal justice, healthcare, or control systems. With the NPU, we hope to contribute to the broad topic of interpretable machine learning, with a focus on scientific applications. Additionally, learning to abstract (mathematical) ideas and extrapolating is a fundamental goal that might contribute to more reliable machine learning systems. However, the inductive biases that are used to increase transparency in the NPU can cause the model to ignore subgroups in the data. This is not an issue for learning arithmetic operations, but could lead to biased models in more general use cases. The methodology presented in the experiments in Sec. 4.1 is not indented to be used for real-world epidemiological predictions. They are merely demonstrating that the NPU can learn an ODE with fractional powers. For an application of the NPU much more post-processing has to be done to ensure the reliability of the results. Acknowledgements and Disclosure of Funding The research presented in this work has been supported by the Grant Agency of Czech Republic no. 18-21409S. The authors also acknowledge the support of the OP VVV MEYS funded project CZ.02.1.01/0.0/0.0/16_019/0000765 Research Center for Informatics . We thank the authors of the Julia packages Flux.jl [Innes et al., 2018] and Differential Equations.jl [Rackauckas and Nie, 2017]). Kathleen Champion, Bethany Lusch, J. Nathan Kutz, and Steven L. Brunton. Data-driven discovery of coordinates and governing equations. Proceedings of the National Academy of Sciences, 116 (45):22445 22451, November 2019. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1906995116. URL http://www.pnas.org/lookup/doi/10.1073/pnas.1906995116. Kaiyu Chen, Yihan Dong, Xipeng Qiu, and Zitian Chen. Neural Arithmetic Expression Calculator. ar Xiv:1809.08590 [cs], September 2018. URL http://arxiv.org/abs/1809.08590. ar Xiv: 1809.08590. Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural Ordinary Differential Equations. ar Xiv:1806.07366 [cs, stat], December 2019. URL http://arxiv.org/ abs/1806.07366. ar Xiv: 1806.07366. Stanislas Dehaene. The Number Sense: How the Mind Creates Mathematics, Revised and Updated Edition. Oxford University Press, April 2011. ISBN 978-0-19-991039-7. Google-Books-ID: 1p6XWYuwpj UC. Lukas Faber and Roger Wattenhofer. Neural Status Registers. ar Xiv:2004.07085 [cs, stat], April 2020. URL http://arxiv.org/abs/2004.07085. ar Xiv: 2004.07085. C. R. Gallistel. Finding numbers in the brain. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1740):20170119, February 2018. ISSN 0962-8436, 1471-2970. doi: 10.1098/rstb.2017.0119. URL https://royalsocietypublishing.org/doi/10.1098/ rstb.2017.0119. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines. ar Xiv:1410.5401 [cs], December 2014. URL http://arxiv.org/abs/1410.5401. ar Xiv: 1410.5401. Michael Innes, Elliot Saba, Keno Fischer, Dhairya Gandhi, Marco Concetto Rudilosso, Neethu Mariya Joy, Tejan Karmali, Avik Pal, and Viral Shah. Fashionable Modelling with Flux. ar Xiv:1811.01457 [cs], November 2018. URL http://arxiv.org/abs/1811.01457. ar Xiv: 1811.01457. Łukasz Kaiser and Ilya Sutskever. Neural GPUs Learn Algorithms. ar Xiv:1511.08228 [cs], March 2016. URL http://arxiv.org/abs/1511.08228. ar Xiv: 1511.08228. Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid Long Short-Term Memory. ar Xiv:1507.01526 [cs], January 2016. URL http://arxiv.org/abs/1507.01526. ar Xiv: 1507.01526. William Ogilvy Kermack, A. G. Mc Kendrick, and Gilbert Thomas Walker. A contribution to the mathematical theory of epidemics. Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 115(772):700 721, August 1927. doi: 10.1098/rspa.1927.0118. URL https://royalsocietypublishing.org/doi/abs/10. 1098/rspa.1927.0118. Publisher: Royal Society. Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural Random-Access Machines. ar Xiv:1511.06392 [cs], February 2016. URL http://arxiv.org/abs/1511.06392. ar Xiv: 1511.06392. Brenden M. Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. ar Xiv:1711.00350 [cs], June 2018. URL http://arxiv.org/abs/1711.00350. ar Xiv: 1711.00350. Guillaume Lample and François Charton. Deep Learning for Symbolic Mathematics. ar Xiv:1912.01412 [cs], December 2019. URL http://arxiv.org/abs/1912.01412. ar Xiv: 1912.01412. Zachary C. Lipton. The Mythos of Model Interpretability. ar Xiv:1606.03490 [cs, stat], March 2017. URL http://arxiv.org/abs/1606.03490. ar Xiv: 1606.03490. Andreas Madsen and Alexander Rosenberg Johansen. Measuring Arithmetic Extrapolation Performance. ar Xiv:1910.01888 [cs], November 2019. URL http://arxiv.org/abs/1910.01888. ar Xiv: 1910.01888. Andreas Madsen and Alexander Rosenberg Johansen. Neural Arithmetic Units. abs/2001.05016, 2020. URL http://arxiv.org/abs/2001.05016. Christopher Rackauckas and Qing Nie. Differential Equations.jl A Performant and Feature-Rich Ecosystem for Solving Differential Equations in Julia. Journal of Open Research Software, 5:15, May 2017. ISSN 2049-9647. doi: 10.5334/jors.151. URL http://openresearchsoftware. metajnl.com/articles/10.5334/jors.151/. Christopher Rackauckas, Yingbo Ma, Julius Martensen, Collin Warner, Kirill Zubov, Rohit Supekar, Dominic Skinner, and Ali Ramadhan. Universal Differential Equations for Scientific Machine Learning. ar Xiv:2001.04385 [cs, math, q-bio, stat], January 2020. URL http://arxiv.org/ abs/2001.04385. ar Xiv: 2001.04385. Cynthia Rudin. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. ar Xiv:1811.10154 [cs, stat], September 2019. URL http://arxiv.org/abs/1811.10154. ar Xiv: 1811.10154. David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing Mathematical Reasoning Abilities of Neural Models. ar Xiv:1904.01557 [cs, stat], April 2019. URL http: //arxiv.org/abs/1904.01557. ar Xiv: 1904.01557. Daniel Schlör, Markus Ring, and Andreas Hotho. i NALU: Improved Neural Arithmetic Logic Unit. ar Xiv:2003.07629 [cs], March 2020. URL http://arxiv.org/abs/2003.07629. ar Xiv: 2003.07629. Mirac Suzgun, Yonatan Belinkov, and Stuart M Shieber. On Evaluating the Generalization of LSTM Models in Formal Languages. page 10, 2018. Amirhossein Taghvaei, Tryphon T. Georgiou, Larry Norton, and Allen R Tannenbaum. Fractional SIR Epidemiological Models. preprint, Epidemiology, April 2020. URL http://medrxiv.org/ lookup/doi/10.1101/2020.04.28.20083865. Andrew Trask, Felix Hill, Scott Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural Arithmetic Logic Units. ar Xiv:1808.00508 [cs], August 2018. URL http://arxiv.org/abs/1808.00508. ar Xiv: 1808.00508.