# neural_arithmetic_units__74ddb46c.pdf

Published as a conference paper at ICLR 2020

NEURAL ARITHMETIC UNITS

Andreas Madsen Computationally Demanding amwebdk@gmail.com

Alexander Rosenberg Johansen Technical University of Denmark aler@dtu.dk

Neural networks can approximate complex functions, but they struggle to perform exact arithmetic operations over real numbers. The lack of inductive bias for arithmetic operations leaves neural networks without the underlying logic necessary to extrapolate on tasks such as addition, subtraction, and multiplication. We present two new neural network components: the Neural Addition Unit (NAU), which can learn exact addition and subtraction; and the Neural Multiplication Unit (NMU) that can multiply subsets of a vector. The NMU is, to our knowledge, the ﬁrst arithmetic neural network component that can learn to multiply elements from a vector, when the hidden size is large. The two new components draw inspiration from a theoretical analysis of recently proposed arithmetic components. We ﬁnd that careful initialization, restricting parameter space, and regularizing for sparsity is important when optimizing the NAU and NMU. Our proposed units NAU and NMU, compared with previous neural units, converge more consistently, have fewer parameters, learn faster, can converge for larger hidden sizes, obtain sparse and meaningful weights, and can extrapolate to negative and small values.1

1 INTRODUCTION

When studying intelligence, insects, reptiles, and humans have been found to possess neurons with the capacity to hold integers, real numbers, and perform arithmetic operations (Nieder, 2016; Rugani et al., 2009; Gallistel, 2018). In our quest to mimic intelligence, we have put much faith in neural networks, which in turn has provided unparalleled and often superhuman performance in tasks requiring high cognitive abilities (Silver et al., 2016; Devlin et al., 2018; Open AI et al., 2018). However, when using neural networks to solve simple arithmetic problems, such as counting, multiplication, or comparison, they systematically fail to extrapolate onto unseen ranges (Lake & Baroni, 2018; Suzgun et al., 2019; Trask et al., 2018). The absence of inductive bias makes it difﬁcult for neural networks to extrapolate well on arithmetic tasks as they lack the underlying logic to represent the required operations.

A neural component that can solve arithmetic problems should be able to: take an arbitrary hidden input, learn to select the appropriate elements, and apply the desired arithmetic operation. A recent attempt to achieve this goal is the Neural Arithmetic Logic Unit (NALU) by Trask et al. (2018).

The NALU models the inductive bias explicitly via two sub-units: the NAC+ for addition/subtraction and the NAC for multiplication/division. The sub-units are softly gated between, using a sigmoid function, to exclusively select one of the sub-units. However, we ﬁnd that the soft gating-mechanism and the NAC are fragile and hard to learn.

In this paper, we analyze and improve upon the NAC+ and NAC with respect to addition, subtraction, and multiplication. Our proposed improvements, namely the Neural Addition Unit (NAU) and Neural Multiplication Unit (NMU), are more theoretically founded and improve performance regarding stability, speed of convergence, and interpretability of weights. Most importantly, the NMU supports both negative and small numbers and a large hidden input-size, which is paramount as neural networks are overparameterized and hidden values are often unbounded.

The improvements, which are based on a theoretical analysis of the NALU and its components, are achieved by a simpliﬁcation of the parameter matrix for a better gradient signal, a sparsity regularizer, and a new multiplication unit that can be optimally initialized. The NMU does not support division.

1Implementation is available on Git Hub: https://github.com/Andreas Madsen/stable-nalu.

Published as a conference paper at ICLR 2020

Figure 1: Visualization of the NMU, where the weights (Wi,j) controls gating between 1 (identity) or xi, each intermediate result is then multiplied explicitly to form zj.

However, we ﬁnd that the NAC in practice also only supports multiplication and cannot learn division (theoretical analysis on division discussed in section 2.3).

To analyze the impact of each improvement, we introduce several variants of the NAC . We ﬁnd that allowing division makes optimization for multiplication harder, linear and regularized weights improve convergence, and the NMU way of multiplying is critical when increasing the hidden size.

Furthermore, we improve upon existing benchmarks in Trask et al. (2018) by expanding the simple function task , expanding MNIST Counting and Arithmetic Tasks with a multiplicative task, and using an improved success-criterion Madsen & Johansen (2019). This success-criterion is important because the arithmetic units are solving a logical problem. We propose the MNIST multiplication variant as we want to test the NMU s and NAC s ability to learn from real data and extrapolate.

1.1 LEARNING A 10 PARAMETER FUNCTION

Consider the static function t = (x1 + x2) (x1 + x2 + x3 + x4) for x R4. To illustrate the ability of NAC , NALU, and our proposed NMU, we conduct 100 experiments for each model to learn this function. Table 1 shows that the NMU has a higher success rate and converges faster.

Table 1: Comparison of the success-rate, when the model converged, and the sparsity error for all weight matrices, with 95% conﬁdence interval on the t = (x1 + x2) (x1 + x2 + x3 + x4) task. Each value is a summary of 100 different seeds.

Op Model Success Solved at iteration step Sparsity error

Rate Median Mean Mean

NAC 13% +8% 5% 5.5 104 5.9 104 +7.8 103

6.6 103 7.5 10 6 +2.0 10 6

2.0 10 6 NALU 26% +9% 8% 7.0 104 7.8 104 +6.2 103

8.6 103 9.2 10 6 +1.7 10 6

1.7 10 6 NMU 94% +3% 6% 1.4 104 1.4 104 +2.2 102

2.1 102 2.6 10 8 +6.4 10 9

2 INTRODUCING DIFFERENTIABLE BINARY ARITHMETIC OPERATIONS

We deﬁne our problem as learning a set of static arithmetic operations between selected elements of a vector. E.g. for a vector x learn the function (x5 + x1) x7. The approach taking in this paper is to develop a unit for addition/subtraction and a unit for multiplication, and then let each unit decide which inputs to include using backpropagation.

We develop these units by taking inspiration from a theoretical analysis of Neural Arithmetic Logic Unit (NALU) by Trask et al. (2018).

2.1 INTRODUCING NALU

The Neural Arithmetic Logic Unit (NALU) consists of two sub-units; the NAC+ and NAC . The sub-units represent either the {+, } or the { , } operations. The NALU then assumes that either NAC+ or NAC will be selected exclusively, using a sigmoid gating-mechanism.

Published as a conference paper at ICLR 2020

The NAC+ and NAC are deﬁned accordingly,

Whℓ,hℓ 1 = tanh( ˆWhℓ,hℓ 1)σ( ˆ Mhℓ,hℓ 1) (1)

NAC+ : zhℓ=

hℓ 1=1 Whℓ,hℓ 1zhℓ 1 (2)

NAC : zhℓ= exp

hℓ 1=1 Whℓ,hℓ 1 log(|zhℓ 1| + ϵ)

where ˆ W, ˆM RHℓ Hℓ 1 are weight matrices and zhℓ 1 is the input. The matrices are combined using a tanh-sigmoid transformation to bias the parameters towards a { 1, 0, 1} solution. Having { 1, 0, 1} allows NAC+ to compute exact {+, } operations between elements of a vector. The NAC uses an exponential-log transformation for the { , } operations, which works within ϵ precision and for positive inputs only.

The NALU combines these units with a gating mechanism z = g NAC+ + (1 g) NAC given g = σ(Gx). Thus allowing NALU to decide between all of {+, , , } using backpropagation.

2.2 WEIGHT MATRIX CONSTRUCTION AND THE NEURAL ADDITION UNIT

Glorot & Bengio (2010) show that E[zhℓ] = 0 at initialization is a desired property, as it prevents an explosion of both the output and the gradients. To satisfy this property with Whℓ 1,hℓ= tanh( ˆWhℓ 1,hℓ)σ( ˆ Mhℓ 1,hℓ), an initialization must satisfy E[tanh( ˆWhℓ 1,hℓ)] = 0. In NALU, this initialization is unbiased as it samples evenly between + and , or and . Unfortunately, this initialization also causes the expectation of the gradient to become zero, as shown in (4).

= E L Whℓ 1,hℓ

E h tanh( ˆWhℓ 1,hℓ) i E h σ ( ˆ Mhℓ 1,hℓ) i = 0 (4)

Besides the issue of initialization, our empirical analysis (table 2) shows that this weight construction (1) do not create the desired bias for { 1, 0, 1}. This bias is desired as it restricts the solution space to exact addition, and in section 2.5 also exact multiplication, which is an intrinsic property of an underlying arithmetic function. However, this bias does not necessarily restrict the output space as a plain linear transformation will always be able to scale values accordingly.

To solve these issues, we add a sparsifying regularizer to the loss function (L = ˆL + λsparse Rℓ,sparse) and use a simple linear construction, where Whℓ 1,hℓis clamped to [ 1, 1] in each iteration.

Whℓ 1,hℓ= min(max(Whℓ 1,hℓ, 1), 1), (5)

Rℓ,sparse = 1 Hℓ Hℓ 1

hℓ 1=1 min |Whℓ 1,hℓ|, 1 Whℓ 1,hℓ (6)

hℓ 1=1 Whℓ,hℓ 1zhℓ 1 (7)

2.3 CHALLENGES OF DIVISION

The NAC , as formulated in equation 3, has the capability to compute exact multiplication and division, or more precisely multiplication of the inverse of elements from a vector, when a weight in Whℓ 1,hℓis 1.

Published as a conference paper at ICLR 2020

However, this ﬂexibility creates critical optimization challenges. By expanding the exp-logtransformation, NAC can be expressed as

hℓ 1=1 (|zhℓ 1| + ϵ)Whℓ,hℓ 1 . (8)

In equation (8), if |zhℓ 1| is near zero (E[zhℓ 1] = 0 is a desired property when initializing (Glorot & Bengio, 2010)), Whℓ 1,hℓis negative, and ϵ is small, then the output will explode. This issue is present even for a reasonably large ϵ value (such as ϵ = 0.1), and just a slightly negative Whℓ 1,hℓ, as visualized in ﬁgure 2. Also note that the curvature can cause convergence to an unstable area.

This singularity issue in the optimization space also makes multiplication challenging, which further suggests that supporting division is undesirable. These observations are also found empirically in Trask et al. (2018, table 1) and Appendix C.7.

(a) NAC with ϵ = 10 7

(b) NAC with ϵ = 0.1

(c) NAC with ϵ = 1

Figure 2: RMS loss curvature for a NAC+ unit followed by a NAC . The weight matrices are constrained to W1 = w1 w1 0 0 w1 w1 w1 w1 , W2 = [ w2 w2 ]. The problem is (x1 +x2) (x1 +x2 +x3 +x4) for x = (1, 1.2, 1.8, 2). The solution is w1 = w2 = 1 in (a), with many unstable alternatives.

2.4 INITIALIZATION OF NAC

Initialization is important for fast and consistent convergence. A desired property is that weights are initialized such that E[zhℓ] = 0 (Glorot & Bengio, 2010). Using second order Taylor approximation and assuming all zhℓ 1 are uncorrelated; the expectation of NAC can be estimated as

E[zhℓ] 1 + 1

2V ar[Whℓ,hℓ 1] log(|E[zhℓ 1]| + ϵ)2 Hℓ 1 E[zhℓ] > 1. (9)

As shown in equation 9, satisfying E[zhℓ] = 0 for NAC is likely impossible. The variance cannot be input-independently initialized and is expected to explode (proofs in Appendix B.3).

2.5 THE NEURAL MULTIPLICATION UNIT

To solve the the gradient and initialization challenges for NAC we propose a new unit for multiplication: the Neural Multiplication Unit (NMU)

Whℓ 1,hℓ= min(max(Whℓ 1,hℓ, 0), 1), (10)

Rℓ,sparse = 1 Hℓ Hℓ 1

hℓ 1=1 min Whℓ 1,hℓ, 1 Whℓ 1,hℓ (11)

Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ (12)

The NMU is regularized similar to the NAU and has a multiplicative identity when Whℓ 1,hℓ= 0. The NMU does not support division by design. As opposed to the NAC , the NMU can represent input of both negative and positive values and is not ϵ bounded, which allows the NMU to extrapolate to zhℓ 1 that are negative or smaller than ϵ. Its gradients are derived in Appendix A.3.

Published as a conference paper at ICLR 2020

2.6 MOMENTS AND INITIALIZATION

The NAU is a linear layer and can be initialized using Glorot & Bengio (2010). The NAC+ unit can also achieve an ideal initialization, although it is less trivial (details in Appendix B.2).

The NMU is initialized with E[Whℓ,hℓ 1] = 1/2. Assuming all zhℓ 1 are uncorrelated, and E[zhℓ 1] = 0, which is the case for most neural units (Glorot & Bengio, 2010), the expectation can be approximated to

Hℓ 1 , (13)

which approaches zero for Hℓ 1 (see Appendix B.4). The NMU can, assuming V ar[zhℓ 1] = 1 and Hℓ 1 is large, be optimally initialized with V ar[Whℓ 1,hℓ] = 1

4 (proof in Appendix B.4.3).

2.7 REGULARIZER SCALING

We use the regularizer scaling as deﬁned in (14). We motivate this by observing optimization consists of two parts: a warmup period, where Whℓ 1,hℓshould get close to the solution, unhindered by the sparsity regularizer, followed by a period where the solution is made sparse.

λsparse = ˆλsparse max min t λstart λend λstart , 1 , 0 (14)

2.8 CHALLENGES OF GATING BETWEEN ADDITION AND MULTIPLICATION

The purpose of the gating-mechanism is to select either NAC+ or NAC exclusively. This assumes that the correct sub-unit is selected by the NALU, since selecting the wrong sub-unit leaves no gradient signal for the correct sub-unit.

Empirically we ﬁnd this assumption to be problematic. We observe that both sub-units converge at the beginning of training whereafter the gating-mechanism, seemingly random, converge towards either the addition or multiplication unit. Our study shows that gating behaves close to random for both NALU and a gated NMU/NAU variant. However, when the gate correctly selects multiplication our NMU converges much more consistently. We provide empirical analysis in Appendix C.5 for both NALU and a gated version of NAU/NMU.

As the output-size grows, randomly choosing the correct gating value becomes an exponential increasing problem. Because of these challenges we leave solving the issue of sparse gating for future work and focus on improving the sub-units NAC+ and NAC .

3 RELATED WORK

Pure neural models using convolutions, gating, differentiable memory, and/or attention architectures have attempted to learn arithmetic tasks through backpropagation (Kaiser & Sutskever, 2016; Kalchbrenner et al., 2016; Graves et al., 2014; Freivalds & Liepins, 2017). Some of these results have close to perfect extrapolation. However, the models are constrained to only work with well-deﬁned arithmetic setups having no input redundancy, a single operation, and one-hot representations of numbers for input and output. Our proposed models does not have these restrictions.

The Neural Arithmetic Expression Calculator (Chen et al., 2018) can learn real number arithmetic by having neural network sub-components and repeatedly combine them through a memory-encoderdecoder architecture learned with hierarchical reinforcement learning. While this model has the ability to dynamically handle a larger variety of expressions compared to our solution they require an explicit deﬁnition of the operations, which we do not.

In our experiments, the NAU is used to do a subset-selection, which is then followed by either a summation or a multiplication. An alternative, fully differentiable version, is to use a gumbelsoftmax that can perform exact subset-selection (Xie & Ermon, 2019). However, this is restricted to a predeﬁned subset size, which is a strong assumption that our units are not limited by.

Published as a conference paper at ICLR 2020

4 EXPERIMENTAL RESULTS

4.1 ARITHMETIC DATASETS

The arithmetic dataset is a replica of the simple function task by Trask et al. (2018). The goal is to sum two random contiguous subsets of a vector and apply an arithmetic operation as deﬁned in (15)

i=s1,start xi

i=s2,start xi where x Rn, xi Uniform[rlower, rupper], {+, , } (15)

where n (default 100), U[rlower, rupper] (interpolation default is U[1, 2] and extrapolation default is U[2, 6]), and other dataset parameters are used to assess learning capability (see details in Appendix C.1 and the effect of varying the parameters in Appendix C.4).

4.1.1 MODEL EVALUATION

We deﬁne the success-criterion as a solution that is acceptably close to a perfect solution. To evaluate if a model instance solves the task consistently, we compare the MSE to a nearly-perfect solution on the extrapolation range over many seeds. If W1, W2 deﬁnes the weights of the ﬁtted model, Wϵ 1 is nearly-perfect, and W 2 is perfect (example in equation 16), then the criteria for successful convergence is LW1,W2 < LWϵ 1,W 2, measured on the extrapolation error, for ϵ = 10 5. We report a 95% conﬁdence interval using a binomial distribution (Wilson, 1927).

Wϵ 1 = 1 ϵ 1 ϵ 0 + ϵ 0 + ϵ 1 ϵ 1 ϵ 1 ϵ 1 ϵ

, W 2 = [1 1] (16)

To measure the speed of convergence, we report the ﬁrst iteration for which LW1,W2 < LWϵ 1,W 2 is satisﬁed, with a 95% conﬁdence interval calculated using a gamma distribution with maximum likelihood proﬁling. Only instances that solved the task are included.

We assume an approximate discrete solution with parameters close to { 1, 0, 1} is important for inferring exact arithmetic operations. To measure the sparsity, we introduce a sparsity error (deﬁned in equation 17). Similar to the convergence metric, we only include model instances that did solve the task and report the 95% conﬁdence interval, which is calculated using a beta distribution with maximum likelihood proﬁling.

Esparsity = max hℓ 1,hℓmin(|Whℓ 1,hℓ|, |1 |Whℓ 1,hℓ||) (17)

4.1.2 ARITHMETIC OPERATION COMPARISON

We compare models on different arithmetic operations {+, , }. The multiplication models, NMU and NAC , have an addition unit ﬁrst, either NAU or NAC+, followed by a multiplication unit. The addition/substraction models are two layers of the same unit. The NALU model consists of two NALU layers. See explicit deﬁnitions and regularization values in Appendix C.2.

Each experiment is trained for 5 106 iterations with early stopping by using the validation dataset, which is based on the interpolation range (details in Appendix C.2). The results are presented in table 2. For multiplication, the NMU succeeds more often and converges faster than the NAC and NALU. For addition and substraction, the NAU and NAC+ has similar success-rate (100%), but the NAU is signiﬁcantly faster at solving both of the the task. Moreover, the NAU reaches a signiﬁcantly sparser solution than the NAC+. Interestingly, a linear model has a hard time solving subtraction. A more extensive comparison is included in Appendix C.7 and an ablation study is included in Appendix C.3.

4.1.3 EVALUATING THEORETICAL CLAIMS

To validate our theoretical claim, that the NMU model works better than NAC for a larger hidden input-size, we increase the hidden size of the network thereby adding redundant units. Redundant units are very common in neural networks, which are often overparameterized.

Additionally, the NMU model is, unlike the NAC model, capable of supporting inputs that are both negative and positive. To validate this empirically, the training and validation datasets are sampled for U[ 2, 2], and then tested on U[ 6, 2] U[2, 6]. The other ranges are deﬁned in Appendix C.4.

Published as a conference paper at ICLR 2020

Table 2: Comparison of: success-rate, ﬁrst iteration reaching success, and sparsity error, all with 95% conﬁdence interval on the arithmetic datasets task. Each value is a summary of 100 different seeds.

Op Model Success Solved at iteration step Sparsity error

Rate Median Mean Mean

NAC 31% +10% 8% 2.8 106 3.0 106 +2.9 105

2.4 105 5.8 10 4 +4.8 10 4

2.6 10 4 NALU 0% +4% 0% NMU 98% +1% 5% 1.4 106 1.5 106 +5.0 104

6.6 104 4.2 10 7 +2.9 10 8

NAC+ 100% +0% 4% 2.5 105 4.9 105 +5.2 104

4.5 104 2.3 10 1 +6.5 10 3

6.5 10 3 Linear 100% +0% 4% 6.1 104 6.3 104 +2.5 103

3.3 103 2.5 10 1 +3.6 10 4

3.6 10 4 NALU 14% +8% 5% 1.5 106 1.6 106 +3.8 105

3.3 105 1.7 10 1 +2.7 10 2

NAU 100% +0% 4% 1.8 104 3.9 105 +4.5 104

3.7 104 3.2 10 5 +1.3 10 5

NAC+ 100% +0% 4% 9.0 103 3.7 105 +3.8 104

3.8 104 2.3 10 1 +5.4 10 3

5.4 10 3 Linear 7% +7% 4% 3.3 106 1.4 106 +7.0 105

6.1 105 1.8 10 1 +7.2 10 2

5.8 10 2 NALU 14% +8% 5% 1.9 106 1.9 106 +4.4 105

4.5 105 2.1 10 1 +2.2 10 2

NAU 100% +0% 4% 5.0 103 1.6 105 +1.7 104

1.6 104 6.6 10 2 +2.5 10 2

Finally, for a fair comparison we introduce two new units: A variant of NAC , denoted NAC ,σ, that only supports multiplication by constraining the weights with W = σ( ˆW). And a variant, named NAC ,NMU, that uses clamped linear weights and sparsity regularization identically to the NMU.

Figure 3 shows that the NMU can handle a much larger hidden-size and negative inputs. Furthermore, results for NAC ,σ and NAC ,NMU validate that removing division and adding bias improves the success-rate, but are not enough when the hidden-size is large, as there is no ideal initialization. Interestingly, no models can learn U[1.1, 1.2], suggesting certain input ranges might be troublesome.

G G G G G G G G

G G G G G G G G G G G G G G

G Success rate Solved at iteration step Sparsity error

2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 0.00

Hidden size

G G G G G G

Interpolation range

model G G G G G NAC ,NMU NAC ,σ NAC NALU NMU

Figure 3: Multiplication task results when varying the hidden input-size and when varying the input-range. Extrapolation ranges are deﬁned in Appendix C.4.

Published as a conference paper at ICLR 2020

4.2 PRODUCT OF SEQUENTIAL MNIST

To investigate if a deep neural network can be optimized when backpropagating through an arithmetic unit, the arithmetic units are used as a recurrent-unit over a sequence of MNIST digits, where the target is to ﬁt the cumulative product. This task is similar to MNIST Counting and Arithmetic Tasks in Trask et al. (2018)2, but uses multiplication rather than addition (addition is in Appendix D.2). Each model is trained on sequences of length 2 and tested on sequences of up to 20 MNIST digits.

We deﬁne the success-criterion by comparing the MSE of each model with a baseline model that has a correct solution for the arithmetic unit. If the MSE of each model is less than the upper 1% MSE-conﬁdence-interval of the baseline model, then the model is considered successfully converged.

Sparsity and solved at iteration step is determined as described in experiment 4.1. The validation set is the last 5000 MNIST digits from the training set, which is used for early stopping.

In this experiment, we found that having an unconstrained input-network can cause the multiplication-units to learn an undesired solution, e.g. (0.1 81 + 1 0.1) = 9. Such network do solve the problem but not in the intended way. To prevent this solution, we regularize the CNN output with Rz = 1 Hℓ 1Hℓ PHℓ hℓ PHℓ 1 hℓ 1 (1 Whℓ 1,hℓ) (1 zhℓ 1)2. This regularizer is applied to the NMU and NAC ,NMU models. See Appendix D.4 for the results where this regularizer is not used.

Figure 4 shows that the NMU does not hinder learning a more complex neural network. Moreover, the NMU can extrapolate to much longer sequences than what it is trained on.

G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G

G G G G G G G G G G G G G G G G G G G G

G G G G G G G

Success rate Solved at iteration step Sparsity error

1 2 4 6 8 10 12 14 16 18 20 1 2 4 6 8 10 12 14 16 18 20 1 2 4 6 8 10 12 14 16 18 20

Extrapolation length

model G G G G G G NAC ,NMU NAC ,σ NAC LSTM NALU NMU

Figure 4: MNIST sequential multiplication task. Each model is trained on sequences of two digits, results are for extrapolating to longer sequences. Error-bars represent the 95% conﬁdence interval.

5 CONCLUSION

By including theoretical considerations, such as initialization, gradients, and sparsity, we have developed the Neural Multiplication Unit (NMU) and the Neural Addition Unit (NAU), which outperforms state-of-the-art models on established extrapolation and sequential tasks. Our models converge more consistently, faster, to an more interpretable solution, and supports all input ranges.

A natural next step would be to extend the NMU to support division and add gating between the NMU and NAU, to be comparable in theoretical features with NALU. However we ﬁnd, both experimentally and theoretically, that learning division is impractical, because of the singularity when dividing by zero, and that a sigmoid-gate choosing between two functions with vastly different convergences properties, such as a multiplication unit and an addition unit, cannot be consistently learned.

Finally, when considering more than just two inputs to the multiplication unit, our model performs signiﬁcantly better than previously proposed methods and their variations. The ability for a neural unit to consider more than two inputs is critical in neural networks which are often overparameterized.

2The same CNN is used, https://github.com/pytorch/examples/tree/master/mnist.

Published as a conference paper at ICLR 2020

ACKNOWLEDGMENTS

We would like to thank Andrew Trask and the other authors of the NALU paper, for highlighting the importance and challenges of extrapolation in Neural Networks.

We would also like to thank the students Raja Shan Zaker Kreen and William Frisch Møller from The Technical University of Denmark, who initially showed us that the NALU do not converge consistently.

Alexander R. Johansen and the computing resources from the Technical University of Denmark, where funded by the Innovation Foundation Denmark through the DABAI project.

Kaiyu Chen, Yihan Dong, Xipeng Qiu, and Zitian Chen. Neural arithmetic expression calculator. Co RR, abs/1809.08590, 2018. URL http://arxiv.org/abs/1809.08590.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. Co RR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.

Karlis Freivalds and Renars Liepins. Improving the neural GPU architecture for algorithm learning. Co RR, abs/1702.08727, 2017. URL http://arxiv.org/abs/1702.08727.

Charles R. Gallistel. Finding numbers in the brain. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1740):20170119, 2018. doi: 10.1098/rstb.2017.0119. URL https: //royalsocietypublishing.org/doi/abs/10.1098/rstb.2017.0119.

Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In JMLR W&CP: Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2010), volume 9, pp. 249 256, May 2010.

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. Co RR, abs/1410.5401, 2014. URL http://arxiv.org/abs/1410.5401.

Lukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.08228.

Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long short-term memory. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1507.01526.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In The 3rd International Conference for Learning Representations, San Diego, 2015, pp. ar Xiv:1412.6980, Dec 2014.

Brenden M. Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 2879 2888, 2018. URL http://proceedings.mlr.press/v80/lake18a. html.

Andreas Madsen and Alexander R. Johansen. Measuring arithmetic extrapolation performance. In Science meets Engineering of Deep Learning at 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), volume abs/1910.01888, Vancouver, Canada, October 2019. URL http://arxiv.org/abs/1910.01888.

Andreas Nieder. The neuronal code for number. Nature Reviews Neuroscience, 17:366 EP , 05 2016. URL https://doi.org/10.1038/nrn.2016.40.

Published as a conference paper at ICLR 2020

Open AI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob Mc Grew, Jakub W. Pachocki, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous in-hand manipulation. Co RR, abs/1808.00177, 2018. URL http: //arxiv.org/abs/1808.00177.

Rosa Rugani, Laura Fontanari, Eleonora Simoni, Lucia Regolin, and Giorgio Vallortigara. Arithmetic in newborn chicks. Proceedings of the Royal Society B: Biological Sciences, 276(1666):2451 2460, 2009. doi: 10.1098/rspb.2009.0044. URL https://royalsocietypublishing.org/ doi/abs/10.1098/rspb.2009.0044.

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. doi: 10.1038/nature16961. URL https://doi.org/10.1038/nature16961.

Mirac Suzgun, Yonatan Belinkov, and Stuart M. Shieber. On evaluating the generalization of LSTM models in formal languages. In Proceedings of the Society for Computation in Linguistics (SCi L), pp. 277 286, January 2019.

Andrew Trask, Felix Hill, Scott E Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic logic units. In Advances in Neural Information Processing Systems 31, pp. 8035 8044. 2018. URL http://papers.nips.cc/paper/ 8027-neural-arithmetic-logic-units.pdf.

Edwin B. Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209 212, 1927. doi: 10.1080/01621459.1927. 10502953. URL https://www.tandfonline.com/doi/abs/10.1080/01621459. 1927.10502953.

Sang Michael Xie and Stefano Ermon. Differentiable subset sampling. Co RR, abs/1901.10517, 2019. URL http://arxiv.org/abs/1901.10517.

Published as a conference paper at ICLR 2020

A GRADIENT DERIVATIVES

A.1 WEIGHT MATRIX CONSTRUCTION

For clarity the weight matrix construction is deﬁned using scalar notation

Whℓ,hℓ 1 = tanh( ˆWhℓ,hℓ 1)σ( ˆ Mhℓ,hℓ 1) (18)

The of the loss with respect to ˆWhℓ,hℓ 1 and ˆ Mhℓ,hℓ 1 is then derived using backpropagation.

ˆWhℓ,hℓ 1 = L Whℓ,hℓ 1

Whℓ,hℓ 1 ˆWhℓ,hℓ 1

= L Whℓ,hℓ 1 (1 tanh2( ˆWhℓ,hℓ 1))σ( ˆ Mhℓ,hℓ 1)

ˆ Mhℓ,hℓ 1 = L Whℓ,hℓ 1

Whℓ,hℓ 1 ˆ Mhℓ,hℓ 1

= L Whℓ,hℓ 1 tanh( ˆWhℓ,hℓ 1)σ( ˆ Mhℓ,hℓ 1)(1 σ( ˆ Mhℓ,hℓ 1))

As seen from this result, one only needs to consider L Whℓ,hℓ 1 for NAC+ and NAC , as the gradient

with respect to ˆWhℓ,hℓ 1 and ˆ Mhℓ,hℓ 1 is a multiplication on L Whℓ,hℓ 1 .

A.2 GRADIENT OF NAC

The NAC is deﬁned using scalar notation.

hℓ 1=1 Whℓ,hℓ 1 log(|zhℓ 1| + ϵ)

The gradient of the loss with respect to Whℓ,hℓ 1 can the be derived using backpropagation.

zhℓ Whℓ,hℓ 1 = exp

h ℓ 1=1 Whℓ,h ℓ 1 log(|zh ℓ 1| + ϵ)

log(|zhℓ 1| + ϵ)

= zhℓlog(|zhℓ 1| + ϵ)

We now wish to derive the backpropagation term δhℓ= L zhℓ, because zhℓaffects {zhℓ+1}Hℓ+1 hℓ+1=1 this becomes:

hℓ+1=1 δhℓ+1 zhℓ+1

To make it easier to derive zhℓ+1

zhℓ we re-express the zhℓas zhℓ+1.

zhℓ+1 = exp

hℓ=1 Whℓ+1,hℓlog(|zhℓ| + ϵ)

Published as a conference paper at ICLR 2020

The gradient of zhℓ+1

zhℓ is then:

hℓ=1 Whℓ+1,hℓlog(|zhℓ| + ϵ)

Whℓ+1,hℓ log(|zhℓ| + ϵ)

hℓ=1 Whℓ+1,hℓlog(|zhℓ| + ϵ)

Whℓ+1,hℓ abs (zhℓ)

= zhℓ+1Whℓ+1,hℓ abs (zhℓ)

abs (zhℓ) is the gradient of the absolute function. In the paper we denote this as sign(zhℓ) for brevity. However, depending on the exact deﬁnition used there may be a difference for zhℓ= 0, as abs (0) is undeﬁned. In practicality this doesn t matter much though, although theoretically it does mean that the expectation of this is theoretically undeﬁned when E[zhℓ] = 0.

A.3 GRADIENT OF NMU

In scalar notation the NMU is deﬁned as:

Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ (25)

The gradient of the loss with respect to Whℓ 1,hℓis fairly trivial. Note that every term but the one for hℓ 1, is just a constant with respect to Whℓ 1,hℓ. The product, except the term for hℓ 1 can be expressed as zhℓ Whℓ 1,hℓzhℓ 1+1 Whℓ 1,hℓ. Using this fact, the gradient can be expressed as:

L whℓ,hℓ 1 = L

zhℓ whℓ,hℓ 1 = L

zhℓ Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ

zhℓ 1 1 (26)

Similarly, the gradient L zhℓwhich is essential in backpropagation can equally easily be derived as:

zhℓ zhℓ 1 =

zhℓ Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ Whℓ 1,hℓ (27)

Published as a conference paper at ICLR 2020

B.1 OVERVIEW

B.1.1 MOMENTS AND INITIALIZATION FOR ADDITION

The desired properties for initialization are according to Glorot et al. (Glorot & Bengio, 2010):

E[zhℓ] = 0 E L zhℓ 1

V ar[zhℓ] = V ar zhℓ 1 V ar L zhℓ 1

B.1.2 INITIALIZATION FOR ADDITION

Glorot initialization can not be used for NAC+ as Whℓ 1,hℓis not sampled directly. Assuming that ˆWhℓ,hℓ 1 Uniform[ r, r] and ˆ Mhℓ,hℓ 1 Uniform[ r, r], then the variance can be derived (see proof in Appendix B.2) to be:

V ar[Whℓ 1,hℓ] = 1

One can then solve for r, given the desired variance (V ar[Whℓ 1,hℓ] = 2 Hℓ 1+Hℓ) (Glorot & Bengio, 2010).

B.1.3 MOMENTS AND INITIALIZATION FOR MULTIPLICATION

Using second order multivariate Taylor approximation and some assumptions of uncorrelated stochastic variables, the expectation and variance of the NAC layer can be estimated to:

f(c1, c2) = 1 + c1 1 2V ar[Whℓ,hℓ 1] log(|E[zhℓ 1]| + ϵ)2 c2 Hℓ 1

E[zhℓ] f (1, 1) V ar[zhℓ] f (4, 1) f (1, 2)

V ar L zhℓ 1

Hℓf (4, 1) V ar[Whℓ,hℓ 1]

1 |E[zhℓ 1]| + ϵ 2 + 3 |E[zhℓ 1]| + ϵ 4 V ar[zhℓ 1]

This is problematic because E[zhℓ] 1, and the variance explodes for E[zhℓ 1] = 0. E[zhℓ 1] = 0 is normally a desired property (Glorot & Bengio, 2010). The variance explodes for E[zhℓ 1] = 0, and can thus not be initialized to anything meaningful.

Published as a conference paper at ICLR 2020

For our proposed NMU, the expectation and variance can be derived (see proof in Appendix B.4) using the same assumptions as before, although no Taylor approximation is required:

V ar[zhℓ] V ar[Whℓ 1,hℓ] + 1

Hℓ 1 V ar[zhℓ 1] + 1 Hℓ 1 1

V ar L zhℓ 1

V ar[Whℓ 1,hℓ] + 1

Hℓ 1 V ar[zhℓ 1] + 1 Hℓ 1 1 !

These expectations are better behaved. It is unlikely that the expectation of a multiplication unit can become zero, since the identity for multiplication is 1. However, for a large Hℓ 1 it will be near zero.

The variance is also better behaved, but do not provide a input-independent initialization strategy. We propose initializing with V ar[Whℓ 1,hℓ] = 1

4, as this is the solution to V ar[zhℓ] = V ar[zhℓ 1] assuming V ar[zhℓ 1] = 1 and a large Hℓ 1 (see proof in Appendix B.4.3). However, more exact solutions are possible if the input variance is known.

B.2 EXPECTATION AND VARIANCE FOR WEIGHT MATRIX CONSTRUCTION IN NAC LAYERS

The weight matrix construction in NAC, is deﬁned in scalar notation as:

Whℓ,hℓ 1 = tanh( ˆWhℓ,hℓ 1)σ( ˆ Mhℓ,hℓ 1) (32)

Simplifying the notation of this, and re-expressing it using stochastic variables with uniform distributions this can be written as: W tanh( ˆW)σ( ˆ M) ˆW U[ r, r] ˆ M U[ r, r]

Since tanh( ˆW) is an odd-function and E[ ˆW] = 0, deriving the expectation E[W] is trivial.

E[W] = E[tanh( ˆW)]E[σ( ˆ M)] = 0 E[σ( ˆ M)] = 0 (34)

The variance is more complicated, however as ˆW and ˆ M are independent, it can be simpliﬁed to:

Var[W] = E[tanh( ˆW)2]E[σ( ˆ M)2] E[tanh( ˆW)]2E[σ( ˆ M)]2 = E[tanh( ˆW)2]E[σ( ˆ M)2] (35)

These second moments can be analyzed independently. First for E[tanh( ˆW)2]:

E[tanh( ˆW)2] = Z

tanh(x)2f U[ r,r](x) dx

r tanh(x)2 dx

2r 2 (r tanh(r))

= 1 tanh(r)

Published as a conference paper at ICLR 2020

Then for E[tanh( ˆ M)2]:

E[σ( ˆ M)2] = Z

σ(x)2f U[ r,r](x) dx

Which results in the variance:

B.3 EXPECTATION AND VARIANCE OF NAC

B.3.1 FORWARD PASS

Expectation Assuming that each zhℓ 1 are uncorrelated, the expectation can be simpliﬁed to:

hℓ 1=1 Whℓ,hℓ 1 log(|zhℓ 1| + ϵ)

hℓ 1=1 exp(Whℓ,hℓ 1 log(|zhℓ 1| + ϵ))

hℓ 1=1 E[exp(Whℓ,hℓ 1 log(|zhℓ 1| + ϵ))]

= E[exp(Whℓ,hℓ 1 log(|zhℓ 1| + ϵ))]Hℓ 1

= E h (|zhℓ 1| + ϵ)Whℓ,hℓ 1 i Hℓ 1

= E f(zhℓ 1, Whℓ,hℓ 1) Hℓ 1

Here we deﬁne g as a non-linear transformation function of two independent stochastic variables:

f(zhℓ 1, Whℓ,hℓ 1) = (|zhℓ 1| + ϵ)Whℓ,hℓ 1 (40)

We then apply second order Taylor approximation of f, around (E[zhℓ 1], E[Whℓ,hℓ 1]).

E[f(zhℓ 1, Whℓ,hℓ 1)] E

f(E[zhℓ 1], E[Whℓ,hℓ 1])

+ zhℓ 1 E[zhℓ 1] Whℓ,hℓ 1 E[Whℓ,hℓ 1]

f(zhℓ 1,Whℓ,hℓ 1)

zhℓ 1 f(zhℓ 1,Whℓ,hℓ 1)

zhℓ 1 = E[zhℓ 1] Whℓ,hℓ 1 = E[Whℓ,hℓ 1]

zhℓ 1 E[zhℓ 1] Whℓ,hℓ 1 E[Whℓ,hℓ 1]

2f(zhℓ 1,Whℓ,hℓ 1)

2f(zhℓ 1,Whℓ,hℓ 1)

zhℓ 1 Whℓ,hℓ 1 2f(zhℓ 1,Whℓ,hℓ 1)

zhℓ 1 Whℓ,hℓ 1

2f(zhℓ 1,Whℓ,hℓ 1)

zhℓ 1 = E[zhℓ 1] Whℓ,hℓ 1 = E[Whℓ,hℓ 1]

zhℓ 1 E[zhℓ 1] Whℓ,hℓ 1 E[Whℓ,hℓ 1]

Published as a conference paper at ICLR 2020

Because E[zhℓ 1 E[zhℓ 1]] = 0, E[Whℓ,hℓ 1 E[Whℓ,hℓ 1]] = 0, and Cov[zhℓ 1, Whℓ,hℓ 1] = 0. This simpliﬁes to: E[g(zhℓ 1, Whℓ,hℓ 1)] g(E[zhℓ 1], E[Whℓ,hℓ 1])

2V ar zhℓ 1 Whℓ,hℓ 1

2g(zhℓ 1,Whℓ,hℓ 1)

2zhℓ 1 2g(zhℓ 1,Whℓ,hℓ 1)

zhℓ 1 = E[zhℓ 1] Whℓ,hℓ 1 = E[Whℓ,hℓ 1]

Inserting the derivatives and computing the inner products yields:

E[f(zhℓ 1, Whℓ,hℓ 1)] (|E[zhℓ 1]| + ϵ)E[Whℓ,hℓ 1]

2V ar[zhℓ 1](|E[zhℓ 1]| + ϵ)E[Whℓ,hℓ 1] 2E[Whℓ,hℓ 1](E[Whℓ,hℓ 1] 1)

2V ar[Whℓ,hℓ 1](|E[zhℓ 1]| + ϵ)E[Whℓ,hℓ 1] log(|E[zhℓ 1]| + ϵ)2

2V ar[Whℓ,hℓ 1] log(|E[zhℓ 1]| + ϵ)2

This gives the ﬁnal expectation:

E[zhℓ] = E g(zhℓ 1, Whℓ,hℓ 1) Hℓ 1

2V ar[Whℓ,hℓ 1] log(|E[zhℓ 1]| + ϵ)2 Hℓ 1 (44)

We evaluate the error of the approximation, where Whℓ,hℓ 1 U[ rw, rw] and zhℓ 1 U[0, rz]. These distributions are what is used in the arithmetic dataset. The error is plotted in ﬁgure 5.

0.0 0.1 0.2 0.3 0.4 0.5 rw

Figure 5: Error between theoretical approximation and the numerical approximation estimated by random sampling of 100000 observations at each combination of rz and rw.

Variance The variance can be derived using the same assumptions as used in expectation , that all zhℓ 1 are uncorrelated.

V ar[zhℓ] = E[z2 hℓ] E[zhℓ]2

hℓ 1=1 (|zhℓ 1| + ϵ)2 Whℓ,hℓ 1

hℓ 1=1 (|zhℓ 1| + ϵ)Whℓ,hℓ 1

= E f(zhℓ 1, 2 Whℓ,hℓ 1) Hℓ 1 E f(zhℓ 1, Whℓ,hℓ 1) 2 Hℓ 1

Published as a conference paper at ICLR 2020

We already have from the expectation result in (43) that:

E f(zhℓ 1, Whℓ,hℓ 1) 1 + 1

2V ar[Whℓ,hℓ 1] log(|E[zhℓ 1]| + ϵ)2 (46)

By substitution of variable we have that:

E f(zhℓ 1, 2 Whℓ,hℓ 1) 1 + 1

2V ar[2 Whℓ,hℓ 1] log(|E[zhℓ 1]| + ϵ)2

1 + 2 V ar[Whℓ,hℓ 1] log(|E[zhℓ 1]| + ϵ)2 (47)

This gives the variance:

V ar[zhℓ] = E g(zhℓ 1, 2 Whℓ,hℓ 1) Hℓ 1 E f(zhℓ 1, Whℓ,hℓ 1) 2 Hℓ 1

1 + 2 V ar[Whℓ,hℓ 1] log(|E[zhℓ 1]| + ϵ)2 Hℓ 1

2 V ar[Whℓ,hℓ 1] log(|E[zhℓ 1]| + ϵ)2 2 Hℓ 1 (48)

B.3.2 BACKWARD PASS

Expectation The expectation of the back-propagation term assuming that δhℓ+1 and zhℓ+1

zhℓ are mutually uncorrelated:

hℓ+1=1 δhℓ+1 zhℓ+1

Hℓ+1E[δhℓ+1]E zhℓ+1

Assuming that zhℓ+1, Whℓ+1,hℓ, and zhℓare uncorrelated:

E[zhℓ+1]E[Whℓ+1,hℓ]E abs (zhℓ)

= E[zhℓ+1] 0 E abs (zhℓ)

Variance Deriving the variance is more complicated:

= V ar zhℓ+1Whℓ+1,hℓ abs (zhℓ)

Assuming again that zhℓ+1, Whℓ+1,hℓ, and zhℓare uncorrelated, and likewise for their second moment:

E[z2 hℓ+1]E[W 2 hℓ+1,hℓ]E

" abs (zhℓ)

E[zhℓ+1]2E[Whℓ+1,hℓ]2E abs (zhℓ)

= E[z2 hℓ+1]V ar[Whℓ+1,hℓ]E

" abs (zhℓ)

E[zhℓ+1]2 0 E abs (zhℓ)

= E[z2 hℓ+1]V ar[Whℓ+1,hℓ]E

" abs (zhℓ)

Using Taylor approximation around E[zhℓ] we have:

" abs (zhℓ)

(|E[zhℓ]| + ϵ)2 + 1

(|E[zhℓ]| + ϵ)4 V ar[zhℓ]

(|E[zhℓ]| + ϵ)2 + 3

(|E[zhℓ]| + ϵ)4 V ar[zhℓ]

Published as a conference paper at ICLR 2020

Finally, by reusing the result for E[z2 hℓ] from earlier the variance can be expressed as:

V ar L zhℓ 1

Hℓ 1 + 2 V ar[Whℓ,hℓ 1] log(|E[zhℓ 1]| + ϵ)2 Hℓ 1

V ar[Whℓ,hℓ 1]

1 |E[zhℓ 1]| + ϵ 2 + 3 |E[zhℓ 1]| + ϵ 4 V ar[zhℓ 1]

B.4 EXPECTATION AND VARIANCE OF NMU

B.4.1 FORWARD PASS

Expectation Assuming that all zhℓ 1 are independent:

Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ

E Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ Hℓ 1

E[Whℓ 1,hℓ]E[zhℓ 1] + 1 E[Whℓ 1,hℓ] Hℓ 1

Assuming that E[zhℓ 1] = 0 which is a desired property and initializing E[Whℓ 1,hℓ] = 1/2, the expectation is:

E[zhℓ] E[Whℓ 1,hℓ]E[zhℓ 1] + 1 E[Whℓ 1,hℓ] Hℓ 1

Variance Reusing the result for the expectation, assuming again that all zhℓ 1 are uncorrelated, and using the fact that Whℓ 1,hℓis initially independent from zhℓ 1:

V ar[zhℓ] = E[z2 hℓ] E[zhℓ]2

Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ 2

E[ Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ 2]Hℓ 1 1

= E[W 2 hℓ 1,hℓ]E[z2 hℓ 1] 2E[W 2 hℓ 1,hℓ]E[zhℓ 1] + E[W 2 hℓ 1,hℓ]

+ 2E[Whℓ 1,hℓ]E[zhℓ 1] 2E[Whℓ 1,hℓ] + 1 Hℓ 1 1

Assuming that E[zhℓ 1] = 0, which is a desired property and initializing E[Whℓ 1,hℓ] = 1/2, the variance becomes:

V ar[zhℓ] E[W 2 hℓ 1,hℓ] E[z2 hℓ 1] + 1 Hℓ 1 1

V ar[Whℓ 1,hℓ] + E[Whℓ 1,hℓ]2 V ar[zhℓ 1] + 1 Hℓ 1 1

= V ar[Whℓ 1,hℓ] + 1

Hℓ 1 V ar[zhℓ 1] + 1 Hℓ 1 1

Published as a conference paper at ICLR 2020

B.4.2 BACKWARD PASS

Expectation For the backward pass the expectation can, assuming that L zhℓand zhℓ zhℓ 1 are uncorrelated, be derived to:

E zhℓ Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ Whℓ 1,hℓ

E zhℓ Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ

Initializing E[Whℓ 1,hℓ] = 1/2, and inserting the result for the expectation

E h zhℓ Whℓ 1,hℓzhℓ 1+1 Whℓ 1,hℓ

Assuming that E h L zhℓ

i = 0, which is a desired property (Glorot & Bengio, 2010).

Variance For the variance of the backpropagation term, we assume that L zhℓis uncorrelated with

zhℓ zhℓ 1 .

V ar L zhℓ 1

Assuming again that E h L zhℓ

i = 0, and reusing the result E h zhℓ zhℓ 1

V ar L zhℓ 1

2 Hℓ 1 + V ar zhℓ

Focusing now on V ar h zhℓ zhℓ 1

i , we have:

" zhℓ Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ

E[W 2 hℓ 1,hℓ]

E zhℓ Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ

2 E[Whℓ 1,hℓ]2 (64)

Published as a conference paper at ICLR 2020

Inserting the result for the expectation E h zhℓ Whℓ 1,hℓzhℓ 1+1 Whℓ 1,hℓ

i and Initializing again

E[Whℓ 1,hℓ] = 1/2.

" zhℓ Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ

E[W 2 hℓ 1,hℓ]

2 (Hℓ 1 1) 1

" zhℓ Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ

E[W 2 hℓ 1,hℓ]

Using the identity that E[W 2 hℓ 1,hℓ] = V ar[Whℓ 1,hℓ] + E[Whℓ 1,hℓ]2, and again using E[Whℓ 1,hℓ] = 1/2.

" zhℓ Whℓ 1,hℓzhℓ 1 + 1 Whℓ 1,hℓ

2# V ar[Whℓ 1,hℓ] + 1

2 Hℓ 1 (66)

To derive E zhℓ Whℓ 1,hℓzhℓ 1+1 Whℓ 1,hℓ

2 the result for V ar[zhℓ] can be used, but for ˆHℓ 1 =

Hℓ 1 1, because there is one less term. Inserting E zhℓ Whℓ 1,hℓzhℓ 1+1 Whℓ 1,hℓ

2 = V ar[Whℓ 1,hℓ] + 1

4 Hℓ 1 1 V ar[zhℓ 1] + 1 Hℓ 1 1, we have:

V ar[Whℓ 1,hℓ] + 1

Hℓ 1 1 V ar[zhℓ 1] + 1 Hℓ 1 1

V ar[Whℓ 1,hℓ] + 1

= V ar[Whℓ 1,hℓ] + 1

Hℓ 1 V ar[zhℓ 1] + 1 Hℓ 1 1 1

Inserting the result for V ar h zhℓ zhℓ 1

i into the result for V ar h L zhℓ 1

V ar L zhℓ 1

+ V ar[Whℓ 1,hℓ] + 1

Hℓ 1 V ar[zhℓ 1] + 1 Hℓ 1 1 1

V ar[Whℓ 1,hℓ] + 1

Hℓ 1 V ar[zhℓ 1] + 1 Hℓ 1 1 !

Published as a conference paper at ICLR 2020

B.4.3 INITIALIZATION

The Whℓ 1,hℓshould be initialized with E[Whℓ 1,hℓ] = 1

2, in order to not bias towards inclusion or exclusion of zhℓ 1. Using the derived variance approximations (68), the variance should be according to the forward pass:

V ar[Whℓ 1,hℓ] = (1 + V ar[zhℓ]) Hℓ 1V ar[zhℓ] + (4 + 4V ar[zhℓ]) Hℓ 1 1 Hℓ 1 1

And according to the backward pass it should be:

V ar[Whℓ 1,hℓ] =

(V ar[zhℓ] + 1)1 Hℓ 1

Both criteria are dependent on the input variance. If the input variance is know then optimal initialization is possible. However, as this is often not the case one can perhaps assume that V ar[zhℓ 1] = 1. This is not an unreasonable assumption in many cases, as there may either be a normalization layer somewhere or the input is normalized. If unit variance is assumed, the variance for the forward pass becomes:

V ar[Whℓ 1,hℓ] = 2 Hℓ 1 + 8 Hℓ 1 1 Hℓ 1 1

4Hℓ 1 + 1 Hℓ 1 2 (71)

And from the backward pass:

V ar[Whℓ 1,hℓ] = 21 Hℓ 1

The variance requirement for both the forward and backward pass can be satisﬁed with V ar[Whℓ 1,hℓ] = 1

4 for a large Hℓ 1.

Published as a conference paper at ICLR 2020

C ARITHMETIC TASK

The aim of the Arithmetic task is to directly test arithmetic models ability to extrapolate beyond the training range. Additionally, our generalized version provides a high degree of ﬂexibility in how the input is shaped, sampled, and the problem complexity.

Our arithmetic task is identical to the simple function task in the NALU paper (Trask et al., 2018). However, as they do not describe their setup in details, we use the setup from Madsen & Johansen (2019), which provide Algorithm 3, an evaluation-criterion to if and when the model has converged, the sparsity error, as well as methods for computing conﬁdence intervals for success-rate and the sparsity error.

t x: overlap

Figure 6: Shows how the dataset is parameterized.

C.1 DATASET GENERATION

The goal is to sum two random subsets of a vector x (a and b), and perform an arithmetic operation on these (a b).

i=s1,start xi, b =

i=s2,start xi, t = a b (73)

Algorithm 1 deﬁnes the exact procedure to generate the data, where an interpolation range will be used for training and validation and an extrapolation range will be used for testing. Default values are deﬁned in table 3.

Table 3: Default dataset parameters for Arithmetic task

Parameter name Default value

Input size 100 Subset ratio 0.25 Overlap ratio 0.5

Parameter name Default value

Interpolation range U[1, 2] Extrapolation range U[2, 6]

Algorithm 1 Dataset generation algorithm for Arithmetic task

1: function DATASET(OP( , ) : Operation, i : Input Size, s : Subset Ratio, o : Overlap Ratio, R : Range) 2: x UNIFORM(Rlower, Rupper, i) Sample i elements uniformly 3: k UNIFORM(0, 1 2s o) Sample offset 4: a SUM(x[ik : i(k + s)]) Create sum a from subset 5: b SUM(x[i(k + s o) : i(k + 2s 0)]) Create sum b from subset 6: t OP(a, b) Perform operation on a and b 7: return x, t

C.2 MODEL DEFINTIONS AND SETUP

Models are deﬁned in table 4 and are all optimized with Adam optimization (Kingma & Ba, 2014) using default parameters, and trained over 5 106 iterations. Training takes about 8 hours on a single CPU core(8-Core Intel Xeon E5-2665 2.4GHz). We run 19150 experiments on a HPC cluster.

Published as a conference paper at ICLR 2020

The training dataset is continuously sampled from the interpolation range where a different seed is used for each experiment, all experiments use a mini-batch size of 128 observations, a ﬁxed validation dataset with 1 104 observations sampled from the interpolation range, and a ﬁxed test dataset with 1 104 observations sampled from the extrapolation range.

Table 4: Model deﬁnitions

Model Layer 1 Layer 2 ˆλsparse λstart λend NMU NAU NMU 10 106 2 106

NAU NAU NAU 0.01 5 103 5 104 NAC NAC+ NAC NAC ,σ NAC+ NAC ,σ NAC ,NMU NAC+ NAC ,NMU 10 106 2 106 NAC+ NAC+ NAC+ NALU NALU NALU Linear Linear Linear Re LU Re LU Re LU Re LU6 Re LU6 Re LU6

C.3 ABLATION STUDY

To validate our model, we perform an ablation on the multiplication problem. Some noteworthy observations:

1. None of the W constraints, such as Rsparse and clamping W to be in [0, 1], are necessary when the hidden size is just 2.

2. Removing the Rsparse causes the NMU to immediately fail for larger hidden sizes.

3. Removing the clamping of W does not cause much difference. This is because Rsparse also constrains W outside of [0, 1]. The regularizer used here is Rsparse = min(|W|, |1 W|), which is identical to the one used in other experiments in [0, 1], but is also valid outside [0, 1]. Doing this gives only a slightly slower convergence. Although, this can not be guaranteed in general, as the regularizer is omitted during the initial optimization.

4. Removing both constraints, gives a somewhat satisfying solution, but with a lower successrate, slower convergence, and higher sparsity error.

In conclusion both constraints are valuable, as they provide faster convergence and a sparser solution, but they are not critical to the success-rate of the NMU.

Success rate Solved at iteration step Sparsity error

2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

Hidden size

model NMU NMU, no W clamp NMU, no Rsparse NMU, no Rsparse, no W-clamp

Figure 7: Ablation study where Rsparse is removed and the clamping of W is removed. There are 50 experiments with different seeds, for each conﬁguration.

Published as a conference paper at ICLR 2020

C.4 EFFECT OF DATASET PARAMETER

To stress test the models on the multiplication task, we vary the dataset parameters one at a time while keeping the others at their default value (default values in table 3). Each runs for 50 experiments with different seeds. The results, are visualized in ﬁgure 8.

In ﬁgure 3, the interpolation-range is changed, therefore the extrapolation-range needs to be changed such it doesn t overlap. For each interpolation-range the following extrapolationrange is used: U[ 2, 1] uses U[ 6, 2], U[ 2, 2] uses U[ 6, 2] U[2, 6], U[0, 1] uses U[1, 5], U[0.1, 0.2] uses U[0.2, 2], U[1.1, 1.2] uses U[1.2, 6], U[1, 2] uses U[2, 6], U[10, 20] uses U[20, 40].

G G G G G G

G G G G G G

G G G G G G

G G G G G G

G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G

Success rate Solved at iteration step Sparsity error

0 100 200 300 0 100 200 300 0 100 200 300

Size of the input vector

G G G G G G G

G G G G G G G G G G G G

G G G G G G G G G G G

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

The ratio of which subsets overlap with each other

G G G G G G G G

0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5

Relative size of subsets compared to input size

model G G G G G G NAC ,NMU NAC ,σ NAC Gated NAU/NMU NALU NMU

Figure 8: Shows the effect of the dataset parameters.

C.5 GATING CONVERGENCE EXPERIMENT

In the interest of adding some understand of what goes wrong in the NALU gate, and the shared weight choice that NALU employs to remedy this, we introduce the following experiment.

We train two models to ﬁt the arithmetic task. Both uses the NAC+ in the ﬁrst layer and NALU in the second layer. The only difference is that one model shares the weight between NAC+ and NAC in the NALU, and the other treat them as two separate units with separate weights. In both cases NALU should gate between NAC+ and NAC and choose the appropriate operation. Note that this NALU

Published as a conference paper at ICLR 2020

model is different from the one presented elsewhere in this paper, including the original NALU paper (Trask et al., 2018). The typical NALU model is just two NALU layers with shared weights.

Furthermore, we also introduce a new gated unit that simply gates between our proposed NMU and NAU, using the same sigmoid gating-mechanism as in the NALU. This combination is done with seperate weights, as NMU and NAU use different weight constrains and can therefore not be shared.

The models are trained and evaluated over 100 different seeds on the multiplication and addition task. A histogram of the gate-value for all seeds is presented in ﬁgure 9 and table 5 contains a summary. Some noteworthy observations:

1. When the NALU weights are separated far more trials converge to select NAC+ for both the addition and multiplication task. Sharing the weights between NAC+ and NAC makes the gating less likely to converge for addition. 2. The performance of the addition task is dependent on NALU selecting the right operation. In the multiplication task, when the right gate is selected, NAC do not converge consistently, unlike our NMU that converges more consistently. 3. Which operation the gate converges to appears to be mostly random and independent of the task. These issues are caused by the sigmoid gating-mechanism and thus exists independent of the used sub-units.

These observations validates that the NALU gating-mechanism does not converge as intended. This becomes a critical issues when more gates are present, as is normally the case. E.g. when stacking multiple NALU layers together.

Gated NAU/NMU NALU (separate) NALU (shared)

Multiplication Addition

mul 0.25 0.50 0.75 add mul 0.25 0.50 0.75 add mul 0.25 0.50 0.75 add

Figure 9: Shows the gating-value in the NALU layer and a variant that uses NAU/NMU instead of NAC+/NAC . Separate/shared refers to the weights in NAC+/NAC used in NALU.

Table 5: Comparison of the success-rate, when the model converged, and the sparsity error, with 95% conﬁdence interval on the arithmetic datasets task. Each value is a summary of 100 different seeds.

Op Model Success Solved at iteration step Sparsity error

Rate Median Mean Mean

Gated NAU/NMU 62% +9% 10% 1.5 106 1.5 106 +3.9 104

3.8 104 5.0 10 5 +2.3 10 5

1.8 10 5 NALU (separate) 22% +9% 7% 2.8 106 3.3 106 +3.9 105

3.6 105 5.8 10 2 +4.1 10 2

2.3 10 2 NALU (shared) 24% +9% 7% 2.9 106 3.3 106 +3.7 105

3.6 105 1.0 10 3 +1.1 10 3

Gated NAU/NMU 37% +10% 9% 1.9 104 4.2 105 +7.3 104

6.7 104 1.7 10 1 +4.6 10 2

4.0 10 2 NALU (separate) 51% +10% 10% 1.4 105 2.9 105 +3.5 104

4.3 104 1.8 10 1 +1.4 10 2

1.4 10 2 + NALU (shared) 34% +10% 9% 1.8 105 3.1 105 +4.3 104

5.4 104 1.8 10 1 +2.3 10 2

Published as a conference paper at ICLR 2020

C.6 REGULARIZATION

The λstart and λend are simply selected based on how much time it takes for the model to converge. The sparsity regularizer should not be used during early optimization as this part of the optimization is exploratory and concerns ﬁnding the right solution by getting each weight on the right side of 0.5.

In ﬁgure 10, 11 and 12 the scaling factor ˆλsparse is optimized.

λsparse = ˆλsparse max(min( t λstart λend λstart , 1), 0) (74)

Success rate Solved at iteration step Sparsity error

0 0.01 0.1 1 10 100 0 0.01 0.1 1 10 100 0 0.01 0.1 1 10 100 0.00

Sparse regualizer

Figure 10: Shows effect of ˆλsparse in NAU on the arithmetic dataset for the + operation.

G G G G G G G G G G G G G

G G G G Success rate Solved at iteration step Sparsity error

0 0.01 0.1 1 10 100 0 0.01 0.1 1 10 100 0 0.01 0.1 1 10 100 0.00

Sparse regualizer

Figure 11: Shows effect of ˆλsparse in NAU on the arithmetic dataset for the operation.

Success rate Solved at iteration step Sparsity error

0 0.01 0.1 1 10 100 0 0.01 0.1 1 10 100 0 0.01 0.1 1 10 100 0e+00

Sparse regualizer

Figure 12: Shows effect of ˆλsparse in NMU on the arithmetic dataset for the operation.

C.7 COMPARING ALL MODELS

Table 6 compares all models on all operations used in NALU (Trask et al., 2018). All variations of models and operations are trained for 100 different seeds to build conﬁdence intervals. Some noteworthy observations are:

Published as a conference paper at ICLR 2020

1. Division does not work for any model, including the NAC and NALU models. This may seem surprising but is actually in line with the results from the NALU paper (Trask et al. (2018), table 1) where there is a large error given the interpolation range. The extrapolation range has a smaller error, but this is an artifact of their evaluation method where they normalize with a random baseline. Since a random baseline will have a higher error for the extrapolation range, errors just appear to be smaller. A correct solution to division should have both a small interpolation and extrapolation error.

2. NAC and NALU are barely able to learn z, with just 2% success-rate for NALU and 7% success-rate for NAC .

3. NMU is fully capable of learning z2. It learn this by learning the same subset twice in the NAU layer, this is also how NAC learn z2.

4. The Gated NAU/NMU (discussed in section C.5) works very poorly, because the NMU initialization assumes that E[zhℓ 1] = 0. This is usually true, as discussed in section 2.6, but not in this case for the ﬁrst layer. In the recommended NMU model, the NMU layer appears after NAU, which causes that assumption to be satisﬁed.

Table 6: Comparison of the success-rate, when the model converged, and the sparsity error, with 95% conﬁdence interval on the arithmetic datasets task. Each value is a summary of 100 different seeds.

Op Model Success Solved at iteration step Sparsity error

Rate Median Mean Mean

NAC ,NMU 93% +4% 7% 1.8 106 2.0 106 +1.0 105

9.7 104 9.5 10 7 +4.2 10 7

4.2 10 7 NAC ,σ 100% +0% 4% 2.5 106 2.6 106 +8.8 104

7.2 104 4.6 10 5 +5.0 10 6

5.6 10 6 NAC 31% +10% 8% 2.8 106 3.0 106 +2.9 105

2.4 105 5.8 10 4 +4.8 10 4

2.6 10 4 NAC+ 0% +4% 0% Gated NAU NMU 0% +4% 0% Linear 0% +4% 0% NALU 0% +4% 0% NAU 0% +4% 0% NMU 98% +1% 5% 1.4 106 1.5 106 +5.0 104

6.6 104 4.2 10 7 +2.9 10 8

2.9 10 8 Re LU 0% +4% 0%

Re LU6 0% +4% 0%

NAC ,NMU 0% +4% 0% NAC ,σ 0% +4% 0% NAC 0% +4% 0% NAC+ 0% +4% 0% Gated NAU NMU 0% +4% 0% Linear 0% +4% 0% NALU 0% +4% 0% NAU 0% +4% 0% NMU 0% +4% 0% Re LU 0% +4% 0%

Re LU6 0% +4% 0%

Published as a conference paper at ICLR 2020

Table 6: Comparison of the success-rate, when the model converged, and the sparsity error, with 95% conﬁdence interval on the arithmetic datasets task. Each value is a summary of 100 different seeds. (continued)

Op Model Success Solved at Sparsity error

Rate Median Mean Mean

NAC ,NMU 0% +4% 0% NAC ,σ 0% +4% 0% NAC 0% +4% 0% NAC+ 100% +0% 4% 2.5 105 4.9 105 +5.2 104

4.5 104 2.3 10 1 +6.5 10 3

6.5 10 3 Gated NAU NMU 0% +4% 0% Linear 100% +0% 4% 6.1 104 6.3 104 +2.5 103

3.3 103 2.5 10 1 +3.6 10 4

3.6 10 4 NALU 14% +8% 5% 1.5 106 1.6 106 +3.8 105

3.3 105 1.7 10 1 +2.7 10 2

2.5 10 2 NAU 100% +0% 4% 1.8 104 3.9 105 +4.5 104

3.7 104 3.2 10 5 +1.3 10 5

1.3 10 5 NMU 0% +4% 0% Re LU 62% +9% 10% 6.2 104 7.6 104 +8.3 103

7.0 103 2.5 10 1 +2.4 10 3

Re LU6 0% +4% 0%

NAC ,NMU 0% +4% 0% NAC ,σ 0% +4% 0% NAC 0% +4% 0% NAC+ 100% +0% 4% 9.0 103 3.7 105 +3.8 104

3.8 104 2.3 10 1 +5.4 10 3

5.4 10 3 Gated NAU NMU 0% +4% 0% Linear 7% +7% 4% 3.3 106 1.4 106 +7.0 105

6.1 105 1.8 10 1 +7.2 10 2

5.8 10 2 NALU 14% +8% 5% 1.9 106 1.9 106 +4.4 105

4.5 105 2.1 10 1 +2.2 10 2

2.2 10 2 NAU 100% +0% 4% 5.0 103 1.6 105 +1.7 104

1.6 104 6.6 10 2 +2.5 10 2

1.9 10 2 NMU 56% +9% 10% 1.0 106 1.0 106 +5.8 102

5.8 102 3.4 10 4 +3.2 10 5

2.6 10 5 Re LU 0% +4% 0%

Re LU6 0% +4% 0%

NAC ,NMU 3% +5% 2% 1.0 106 1.0 106 +Na N 10 Inf

Na N 10 Inf 1.7 10 1 +8.3 10 3

8.1 10 3 NAC ,σ 0% +4% 0% NAC 7% +7% 4% 4.0 105 1.5 106 +6.0 105

5.6 105 2.4 10 1 +1.7 10 2

1.7 10 2 NAC+ 0% +4% 0% Gated NAU NMU 0% +4% 0% Linear 0% +4% 0% NALU 2% +5% 1% 2.6 106 3.3 106 +1.8 106

2.2 106 5.0 10 1 +2.5 10 6

8.0 10 6 NAU 0% +4% 0% NMU 0% +4% 0% Re LU 0% +4% 0%

Re LU6 0% +4% 0%

Published as a conference paper at ICLR 2020

Table 6: Comparison of the success-rate, when the model converged, and the sparsity error, with 95% conﬁdence interval on the arithmetic datasets task. Each value is a summary of 100 different seeds. (continued)

Op Model Success Solved at Sparsity error

Rate Median Mean Mean

NAC ,NMU 100% +0% 4% 1.4 106 1.5 106 +8.4 104

7.9 104 2.9 10 7 +1.4 10 8

1.4 10 8 NAC ,σ 100% +0% 4% 1.9 106 1.9 106 +5.3 104

6.2 104 1.8 10 2 +4.3 10 4

4.3 10 4 NAC 77% +7% 9% 3.3 106 3.2 106 +1.6 105

2.0 105 1.8 10 2 +5.8 10 4

5.7 10 4 NAC+ 0% +4% 0% Gated NAU NMU 0% +4% 0% Linear 0% +4% 0% NALU 0% +4% 0% NAU 0% +4% 0% NMU 100% +0% 4% 1.2 106 1.3 106 +3.1 104

3.6 104 3.7 10 5 +5.4 10 5

3.7 10 5 Re LU 0% +4% 0%

Re LU6 0% +4% 0%

Published as a conference paper at ICLR 2020

D SEQUENTIAL MNIST

D.1 TASK AND EVALUATION CRITERIA

The simple function task is a purely synthetic task, that does not require a deep network. As such it does not test if an arithmetic layer inhibits the networks ability to be optimized using gradient decent.

The sequential MNIST task takes the numerical value of a sequence of MNIST digits and applies a binary operation recursively. Such that ti = Op(ti 1, zt), where zt is the MNIST digit s numerical value. This is identical to the MNIST Counting and Arithmetic Tasks in Trask et al. (2018, section 4.2). We present the addition variant to validate the NAU s ability to backpropagate, and we add an additional multiplication variant to validate the NMU s ability to backpropagate.

The performance of this task depends on the quality of the image-to-scalar network and the arithmetic layer s ability to model the scalar. We use mean-square-error (MSE) to evaluate joint image-to-scalar and arithmetic layer model performance. To determine an MSE threshold from the correct prediction we use an empirical baseline. This is done by letting the arithmetic layer be solved, such that only the image-to-scalar is learned. By learning this over multiple seeds an upper bound for an MSE threshold can be set. In our experiment we use the 1% one-sided upper conﬁdence-interval, assuming a student-t distribution.

Similar to the simple function task we use a success-criteria as reporting the MSE is not interpretable and models that do not converge will obscure the mean. Furthermore, because the operation is applied recursively, natural error from the dataset will accumulate over time, thus exponentially increasing the MSE. Using a baseline model and reporting the successfulness solves this interpretation challenge.

D.2 ADDITION OF SEQUENTIAL MNIST

Figure 13 shows results for sequential addition of MNIST digits. This experiment is identical to the MNIST Digit Addition Test from Trask et al. (2018, section 4.2). The models are trained on a sequence of 10 digits and evaluated on sequences between 1 and 1000 MNIST digits.

Note that the NAU model includes the Rz regularizer, similarly to the Multiplication of sequential MNIST experiment in section 4.2. However, because the weights are in [ 1, 1], and not [0, 1], and the idendity of addition is 0, and not 1, Rz is

Rz = 1 Hℓ 1Hℓ

hℓ 1 (1 |Whℓ 1,hℓ|) z2 hℓ 1 . (75)

To provide a fair comparison, a variant of NAC+ that also uses this regularizer is included, this variant is called NAC+,Rz. Section D.3 provides an ablation study of the Rz regularizer.

G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G

G G G G G G G

G G G G G G G G G G G G

Success rate Solved at iteration step Sparsity error

1 10 200 400 600 800 1000 1 10 200 400 600 800 1000 1 10 200 400 600 800 1000

Extrapolation length

model G G G G G NAC+,Rz NAC+ LSTM NALU NAU

Figure 13: Shows the ability of each model to learn the arithmetic operation of addition and backpropagate through the arithmetic layer in order to learn an image-to-scalar value for MNIST digits. The model is tested by extrapolating to larger sequence lengths than what it has been trained on. The NAU and NAC+,Rz models use the Rz regularizer from section 4.2.

Published as a conference paper at ICLR 2020

D.3 SEQUENTIAL ADDTION WITHOUT THE Rz REGULARIZER

As an ablation study of the Rz regularizer, ﬁgure 14 shows the NAU model without the Rz regularizer. Removing the regularizer causes a reduction in the success-rate. The reduction is likely larger, as compared to sequential multiplication, because the sequence length used for training is longer. The loss function is most sensitive to the 10th output in the sequence, as this has the largest scale. This causes some of the model instances to just learn the mean, which becomes passable for very long sequences, which is why the success-rate increases for longer sequences. However, this is not a valid solution. A well-behavior model should be successful independent of the sequence length.

G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G

G G G G G G G

G G G G G G G G G G G G

Success rate Solved at iteration step Sparsity error

1 10 200 400 600 800 1000 1 10 200 400 600 800 1000 1 10 200 400 600 800 1000

Extrapolation length

model G G G G G NAC+,Rz NAC+ LSTM NALU NAU

Figure 14: Same as ﬁgure 13, but where the NAU model do not use the Rz regularizer.

D.4 SEQUENTIAL MULTIPLICATION WITHOUT THE Rz REGULARIZER

As an ablation study of the Rz regularizer ﬁgure 15 shows the NMU and NAC ,NMU models without the Rz regularizer. The success-rate is somewhat similar to ﬁgure 4. However, as seen in the sparsity error plot, the solution is quite different.

G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G

G G G G G G G G

G G G G G G G

Success rate Solved at iteration step Sparsity error

1 2 4 6 8 10 12 14 16 18 20 1 2 4 6 8 10 12 14 16 18 20 1 2 4 6 8 10 12 14 16 18 20

Extrapolation length

model G G G G G G NAC ,NMU NAC ,σ NAC LSTM NALU NMU

Figure 15: Shows the ability of each model to learn the arithmetic operation of addition and backpropagate through the arithmetic layer in order to learn an image-to-scalar value for MNIST digits. The model is tested by extrapolating to larger sequence lengths than what it has been trained on. The NMU and NAC ,NMU models do not use the Rz regularizer.