# numerical_influence_of_relu0_on_backpropagation__4a384b1f.pdf Numerical influence of Re LU (0) on backpropagation David Bertoin IRT Saint Exup ery ISAE-SUPAERO ANITI Toulouse, France david.bertoin@irt-saintexupery.com J erˆome Bolte Toulouse School of Economics Universit e Toulouse 1 Capitole ANITI Toulouse, France jbolte@ut-capitole.fr S ebastien Gerchinovitz IRT Saint Exup ery Institut de Math ematiques de Toulouse ANITI Toulouse, France sebastien.gerchinovitz@irt-saintexupery.com Edouard Pauwels CNRS IRIT, Universit e Paul Sabatier ANITI Toulouse, France edouard.pauwels@irit.fr In theory, the choice of Re LU (0) in [0, 1] for a neural network has a negligible influence both on backpropagation and training. Yet, in the real world, 32 bits default precision combined with the size of deep learning problems makes it a hyperparameter of training methods. We investigate the importance of the value of Re LU (0) for several precision levels (16, 32, 64 bits), on various networks (fully connected, VGG, Res Net) and datasets (MNIST, CIFAR10, SVHN, Image Net). We observe considerable variations of backpropagation outputs which occur around half of the time in 32 bits precision. The effect disappears with double precision, while it is systematic at 16 bits. For vanilla SGD training, the choice Re LU (0) = 0 seems to be the most efficient. For our experiments on Image Net the gain in test accuracy over Re LU (0) = 1 was more than 10 points (two runs). We also evidence that reconditioning approaches as batch-norm or ADAM tend to buffer the influence of Re LU (0) s value. Overall, the message we convey is that algorithmic differentiation of nonsmooth problems potentially hides parameters that could be tuned advantageously. 1 Introduction Nonsmooth algorithmic differentiation: The training phase of neural networks relies on firstorder methods such as Stochastic Gradient Descent (SGD) [14, 9] and crucially on algorithmic differentiation [15]. The fast differentiator used in practice to compute mini-batch descent directions is the backpropagation algorithm [29, 4]. Although designed initially for differentiable problems, it is applied indifferently to smooth or nonsmooth networks. In the nonsmooth case this requires surrogate derivatives at the non regularity points. We focus on the famous Re LU := max(0, ), for which a value for s = Re LU (0) has to be chosen. A priori, any value in [0, 1] bears a variational sense as it corresponds to a subgradient [27]. Yet in most libraries s = 0 is chosen; it is the case for Tensor Flow [2], Py Torch [26] or Jax [10]. Why this choice? What would be the impact of a different value of s? How this interacts with other training strategies? We will use the notation backprops to denote backpropagation implemented with Re LU (0) = s for any given real number s.1 1Definition in Section 2. This can be coded explicitly or cheaply emulated by backprop0. Indeed, considering fs : x 7 Re LU(x) + s(Re LU( x) Re LU(x) + x) = Re LU(x), we have backprop0[fs](0) = s. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). 0 25 50 75 100 Iteration counter Weight difference 32bits 1 vs 0 0 vs 0 0 25 50 75 100 Iteration counter Minimal activation 32bits 5.0 2.5 0.0 2.5 Figure 1: Left: Difference between network parameters (L1 norm), 100 iterations within an epoch. 0 vs 0 indicates θk,0 θk,0 1 where θk,0 is a second run for sanity check, 0 vs 1 indicates θk,0 θk,1 1. Center: minimal absolute activation of the hidden layers within the k-th mini-batch, before Re LU. At iteration 65, the jump on the left coincides with the drop in the center, and an exact evaluation of Re LU (0). Right: illustration of the bifurcation zone at iteration k = 65 in a randomly generated network weight parameter plane (iteration 65 in the center). The quantity represented is the absolute value of the neuron of the first hidden layer which is found to be exactly zero within the mini-batch before application of Re LU (exact zeros are represented in white). What does backpropagation compute? A popular thinking is that the impact of s = Re LU (0) should be limited as it concerns the value at a single point. This happens to be exact in theory but for surprisingly complex reasons related to Whitney stratification (see Section 2 and references therein): For a given neural network architecture, backprops outputs a gradient almost everywhere, independently of the choice of s R, see [7] and [5] for a detailed treatment of Re LU networks. Under proper randomization of the initialization and the step-size of SGD, with probability 1 the value of s has no impact during training with backprops as a gradient surrogate, see [8, 6]. In particular, the set of network parameters such that backprop0 = backprop1 is negligible and the vast majority of SGD sequences produced by training neural networks are not impacted by changing the value of s = Re LU (0). These results should in principle close the question about the role of Re LU (0). Yet, this is not what we observe in practice with the default settings of usual libraries. A surprising experiment and the bifurcation zone: An empirical observation on MNIST triggered our investigations: consider a fully connected Re LU network, and let (θk,0)k N and (θk,1)k N be two training weights sequences obtained using SGD, with the same random initialization and the same mini-batch sequence but choosing respectively s = 0 and s = 1 in Py Torch [26]. As depicted in Figure 1 (left), the two sequences differ. A closer look shows that the sudden divergence is related to what we call the bifurcation zone, i.e., the set S0,1 of weights such that backprop0 = backprop1. As mentioned previously this contradicts theory which predicts that the bifurcation zone is met with null probability during training. This contradiction is due to the precision of floating point operations and, to a lesser extent, to the size of deep learning problems. Indeed, rounding schemes used for inexact arithmetics over the reals (which set to zero all values below a certain threshold), may thicken negligible sets. This is precisely what happens in our experiments (Figure 1). The role of numerical precision: Contrary to numerical linear algebra libraries such as numpy, which operates by default under 64 bits precision, the default choice in Py Torch is 32 bits precision (as in Tensor Flow or Jax). We thus modulated machine precision to evaluate the importance of the bifurcation zone in Section 3. In 32 bits, we observed that empirically the zone occupies from about 10% to 90% of the network weight space. It becomes invisible in 64 bits precision even for quite large architectures, while, in 16 bits, it systematically fills up the whole space. Although numerical precision is the primary cause of the apparition of the zone, we identify other factors such as network size, sample size. Let us mention that low precision neural network training is a topic of active research [33, 20, 11, 16], see also [28] for an overview. Our investigations are complementary as we focus on the interplay between nonsmoothness and numerical precision. Impact on learning: The next natural question is to measure the impact of the choice of s = Re LU (0) in machine learning terms. In Section 4, we conduct extensive experiments combining different architectures (fully connected, VGG, Res Net), datasets (MNIST, SVHN, CIFAR10, Image Net) and other learning factors (Adam optimizer, batch normalization, dropout). In 32 bits numerical precision (default in Py Torch or Tensorflow), we consistently observe an effect of choosing s = 0. We observe a significant decrease in terms of test accuracy as |s| increases; this can be explained by chaotic oscillatory behaviors induced during training. In some cases gradients even explode and learning cannot be achieved. The sensitivity to this effect highly depends on the problem at hand, in particular, on the network structure and the dataset. On the other hand the choice s = 0 seems to be the most stable. We also observe that both batch normalization [21] and to a lesser degree the Adam optimizer [22] considerably mitigate this effect. All our experiments are done using Py Torch [26]; we provide the code to generate all figures presented in this manuscript. One important message is that, even if the default choice s = 0 seems to be the most stable, our experiments show a counter-intuitive phenomenon that illustrates the interplay between numerical precision and nonsmoothness, and calls for caution when learning nonsmooth networks. Outline of the paper: In Section 2 we recall elements of nonsmooth algorithmic differentiation which are key to understand the mathematics underlying our experiments. Most results were published in [7, 8]; we provide more detailed pointers to this literature in Appendix A.1. In Section 3 we describe investigations of the bifurcation zone and factors influencing its importance using fully connected networks on the MNIST dataset. Neural network training experiments are detailed in Section 4 with further experimental details and additional experiments reported in Appendix D. 2 On the mathematics of backpropagation for Re LU networks This section recalls recent advances on the mathematical properties of backpropagation, with in particular the almost sure absence of impact of Re LU (0) on the learning phase (assuming exact arithmetic over the reals). The main mathematical tools are conservative fields developed in [7]; we provide a simplified overview which is applicable to a wide class of neural networks. 2.1 Empirical risk minimization and backpropagation Given a training set {(xi, yi)}i=1...N, the supervised training of a neural network f consists in minimizing the empirical risk: min θ RP J(θ) := 1 i=1 ℓ(f(xi, θ), yi) (1) where θ RP are the network s weight parameters and ℓis a loss function. The problem can be rewritten abstractly, for each i = 1, . . . , N and θ RP , ℓ(f(xi, θ), yi) = li(θ) where the function li : RP R is a composition of the form li = gi,M gi,M 1 . . . gi,1 (2) where for each j = 1, . . . , M, the function gi,j is locally Lipschitz with appropriate input and output dimensions. A concrete example of what the functions gi,j look like is given in Appendix A.2 in the special case of fully connected Re LU networks. Furthermore, we associate with each gi,j a generalized Jacobian Ji,j which is such that Ji,j(w) = Jacgi,j(w) whenever gi,j is differentiable at w and Jac denotes the usual Jacobian. The value of Ji,j at the nondifferentiability loci of gi,j can be arbitrary for the moment. The backpropagation algorithm is an automatized implementation of the rules of differential calculus: for each i = 1, . . . , N, we have backprop li(θ) = Ji,M (gi,M 1 . . . gi,1(θ)) Ji,M 1 (gi,M 2 . . . gi,1(θ)) . . . Ji,1(θ). (3) Famous autograd libraries such as Py Torch [26] or Tensor Flow [1] implement dictionaries of functions g with their corresponding generalized Jacobians J, as well as efficient numerical implementation of the quantities defined in (3). 2.2 Re LU networks training Our main example is based on the function Re LU: R R defined by Re LU(x) = max{x, 0}. It is differentiable save at the origin and satisfies Re LU (x) = 0 for x < 0 and Re LU (x) = 1 for x > 0. The value of the derivative at x = 0 could be arbitrary in [0, 1] as we have Re LU(0) = [0, 1], where denotes the subgradient from convex analysis. Let us insist on the fact that any value within [0, 1] has a variational meaning. For example Py Torch and Tensor Flow use Re LU (0) = 0. Throughout the paper, and following the lines of [8], we say that a function g : Rp Rq is elementary log-exp if it can be described by a finite compositional expression involving the basic operations +, , , / as well as the exponential and logarithm functions, inside their domains of definition. Examples include the logistic loss log(1 + e x) on R, the multivariate Gaussian density (2π) K/2 exp( PK k=1 x2 k/2) on RK, and the softmax function exi/ PK k=1 exk 1 i K on RK. The expressions x3/x2 and exp( 1/x2) do not fit this definition because evaluation at x = 0 cannot be defined by the formula. Roughly speaking a computer evaluating an elementary log-exp expression should not output any Na N error for any input. Definition 1 (Re LU network training). Assume that in (1), the function ℓis elementary log-exp, the network f has an arbitrary structure and the functions gi,j in (2) are either elementary log-exp or consist in applying the Re LU function to some coordinates of their input. We then call the problem in (1) a Re LU network training problem. Furthermore, for any s R, we denote by backprops the quantity defined in (3) when Re LU (0) = s for all Re LU functions involved in the composition (2). Other nonsmooth activation functions: The Re LU operation actually allows to express many other types of nonsmoothness such as absolute values, maxima, minima, quantiles (med for median) or soft-thresholding (st). For any x, y, z R, we indeed have |x| = Re LU(2x) x, 2 max(x, y) = (|x y| + x + y), min(x, y) = max( x, y), med(x, y, z) = min(max(x, y), max(x, z), max(y, z)), st(x) = Re LU(x 1) Re LU( x 1). Definition 1 is thus much more general than it may seem since it allows, for example, to express convolutional neural networks with max-pooling layers such as VGG or Res Net architectures which correspond to the models considered in the experimental section (although we do not re-program pooling using Re LU). The following theorem is due to [7], with an alternative version in [8]. Theorem 1 (Backprop returns a gradient a.e.). Consider a Re LU network training problem (1) as in Definition 1 and T 1. Define S RP as the complement of the set θ RP , li differentiable at θ, backpropsli(θ) = li(θ), i {1, . . . , N}, s [ T, T] . Then S is contained in a finite union of embedded differentiable manifolds of dimension at most P 1 (and in particular has Lebesgue measure zero). Although this theorem looks natural, this is a nontrivial result about the backpropagation algorithm that led to the introduction of conservative fields [7]. It implies that all choices for s = Re LU (0) in [0, 1] = Re LU(0) are essentially equivalent modulo a negligible set S. Perhaps more surprisingly, s can be chosen arbitrarily in R without breaking this essential property of backprop. The set S is called the bifurcation zone throughout the manuscript. For Re LU network training problems, the bifurcation zone is a Lebesgue zero set and is actually contained locally in a finite union of smooth objects of dimension strictly smaller than the ambient dimension. This geometric result reveals a surprising link between backpropagation and Whitney stratifications, as described in [7, 8]. In any case the bifurcation zone is completely negligible. Note that the same result holds if we allow each different call to the Re LU function to use different values for Re LU (0). 2.3 Re LU network training with SGD Let (Bk)k N denote a sequence of mini-batches with sizes |Bk| {1, . . . , N} for all k and αk > 0 the learning rate. Given initial weights θ0 RP and any parameter s R, the SGD training procedure of f consists in applying the recursion θk+1,s = θk,s γ αk n Bk backprops[ℓ(f(xn, θk,s), yn)], (4) where γ > 0 is a step-size parameter. Note that we explicitly wrote the dependency of the sequence in s = Re LU (0). According to Theorem 1 if the initialization θ0 is chosen randomly, say, uniformly in a ball, a hypercube, or with iid Gaussian entries, then with probability 1, θ0 does not fall in the bifurcation zone S. Intuitively, since S is negligible, the odds that one of the iterates produced by the algorithm fall on S are very low. As a consequence, varying s in the recursion (4) does not modify the sequence. This rationale is actually true for almost all values of γ. This provides a rigorous statement of the idea that the choice of Re LU (0) does not affect neural network training. The following result is based on arguments developed in [8], see also [6] in a probabilistic context. Theorem 2 (Re LU (0) does not impact training with probability 1). Consider a Re LU network training problem as in Definition 1. Let (Bk)k N be a sequence of mini-batches with |Bk| {1, . . . , N} for all k and αk > 0 the associated learning rate parameter. Choose θ0 uniformly at random in a hypercube and γ uniformly in a bounded interval I R+. Let s R, set θ0,s = θ0, and consider the recursion given in (4). Then, with probability one, for all k N, θk,s = θk,0. 3 Surprising experiments on a simple feedforward network 3.1 Re LU (0) has an impact Even though the Re LU activation function is non-differentiable at 0, autograd libraries such as Py Torch [26] or Tensor Flow [1] implement its derivative with Re LU (0) = 0. What happens if one chooses Re LU (0) = 1? The popular answer to this question is that it should have no effect. Theorems 1 and 2 provide a formal justification which is far from being trivial. A 32 bits MNIST experiment We ran a simple experiment to confirm this answer. We initialized two fully connected neural networks f0 and f1 of size 784 2000 128 10 with the same weights θ0,0 = θ0,1 RP which are chosen at random. Using the MNIST dataset [24], we trained f0 and f1 with the same sequence of mini-batches (Bk)k N (minibatch size 128), using the recursion in (4) for s = 0 and s = 1 and with a fixed αk = 1, and γ chosen uniformly at random in [0.01, 0.012]. At each iteration k, we computed the sum θk,0 θk,1 1 of the absolute differences between the coordinates of θk,0 and θk,1. As a sanity check, we actually computed θk,0 a second time, denoting this by θk,0, using a third network to control for sources of divergence in our implementation. Results are reported in Figure 1. The experiment was run using Py Torch [26] on a CPU. Re LU (0) has an impact First we observe no difference between θk,0 and θk,0, which shows that we have controlled all possible sources of divergence in Py Torch. Second, while no differences between θ0,0 and θ0,1 is expected (Theorem 2), we observe a sudden deviation of θk,0 θk,1 1 at iteration 65 which then increases in a smooth way. The deviation is sudden as the norm is exactly zero before iteration 65 and jumps above one after. Therefore this cannot be explained by an accumulation of small rounding errors throughout iterations, as this would produce a smooth divergence starting at the first iteration. So this suggests that there is a special event at iteration 65. The center part of Figure 1 displays the minimal absolute value of neurons of the first hidden layer evaluated on the current mini-batch, before the application of Re LU. It turns out that at iteration 65, this minimal value is exactly 0, resulting in a drop in the center of Figure 1. This means that the divergence is actually due to an iterate of the sequence falling exactly on the bifurcation zone. According to Theorem 2, this event is so exceptional that it should never been seen. Practice and Theory: Numerical precision vs Genericity This contradiction can be solved as follows: the minimal absolute value in Figure 1 oscillates between 10 6 and 10 8 which is roughly the value of machine precision in 32 bits float arithmetic. This machine precision value is of the order 10 16 in 64 bits floating arithmetic which is orders of magnitude smaller than the typical value represented in Figure 1. And indeed, performing the same experiment in 64 bits precision, the divergence of θk,0 θk,1 1 disappears and the algorithm can actually be run for many epochs without any divergence between the two sequences. This is represented in Figure 7 in Appendix B. We also report similar results using Re LU6 [19] in place of Re LU on a similar network. 3.2 Relative volume of the bifurcation zone and relative gradient variation The previous experiment suggests that mini-batch SGD algorithm (4) crossed the bifurcation zone: S01 = {θ RP : i {1, . . . , N}, backprop0[li](θ) = backprop1[li](θ)} S. (5) This unlikely phenomenon is due to finite numerical precision arithmetic which thickens the negligible set S01 in proportion to the machine precision threshold. This is illustrated on the right of Figure 1, which represents the bifurcation zone at iteration 65 in a randomly generated hyperplane (uniformly among 2 dimensional hyperplanes) centered around the weight parameters which generated the bifurcation in the previous experiment (evaluation on the same mini-batch). The white area corresponds to some entries being exactly zero, i.e., below the 32 bits machine precision threshold, before application of Re LU. On the other hand, in 64 bits precision, the same representation is much smoother and does not contain exact zeros (see Figure 7 in Appendix B). This confirms that the level of floating point arithmetic precision explains the observed divergence. We now estimate the relative volume of this bifurcation zone by Monte Carlo sampling (see Appendix C for details). All experiments are performed using Py Torch [26] on GPU. Experimental procedure weight randomization: We randomly generate a set of parameters {θj}M j=1, with M = 1000, for a fully connected network architecture f composed of L hidden layers. Given two consecutive layers, respectively composed of m and m neurons, the weights of the corresponding affine transform are drawn independently, uniformly in [ α, α] where α = 6/ m. This is the default weight initialization scheme in Py Torch (Kaiming-Uniform [17]). Given this sample of parameters, iterating on the whole MNIST dataset, we approximate the proportion of θj for which backprop0(li)(θj) = backprop1(li)(θj) for some i, for different networks and under different conditions (see Appendix C for details). Impact of the floating-point precision: Using a fixed architecture of three hidden layers of 256 neurons each, we empirically measured the relative volume of S01 using the above experimental procedure, varying the numerical precision. Table 1 reports the obtained estimates. As shown in Table 1, line 1, at 16 bits floating-point precision, all drawn weights {θi}M i=1 fall within S01. In sharp contrast, when using a 64 bits precision, none of the sampled weights belong to S01. This proportion is 40% in 32 bits precision. For the rare impacted mini-batches (Table 1 line 2), the relative change in norm is above a factor 20, higher in 16 bits precision (Table 1 line 3). These results confirm that the floating-point arithmetic precision is key in explaining the impact of Re LU (0) during backpropagation, impacting both frequency and magnitude of the differences. Floating-point precision 16 bits 32 bits 64 bits Proportion of {θi}M i=1 in S (CI 5%) 100% 40% 0% Proportion of impacted mini-batches (CI 13%) 0.05% 0.0002% 0% Relative L2 difference for impacted mini-batches (1st quartile, median, 3rd quartile) (98, 117, 137) (19, 25, 47) (0, 0, 0) Table 1: Impact of S according to the floating-point precision on a fully connected neural network (784 256 256 256 10) on MNIST. First line: proportion of drawn weights θi such that at least one mini-batch results in difference between backprop0 and backprop1. CI stands for Confidence Interval with 5% risk (Hoeffding CI, see Appendix C). Second line: overall proportion of minibatch-weight vector pairs causing a difference between backprop0 and backprop1 (Mc Diarmid CI, see Appendix C). Third line: distribution of backprop0 backprop1 2/ backprop0 2 for the affected mini-batch-weight vector pairs. Impact of sample and mini-batch size: Given a training set {(xn, yn)}n=1...N and a random variable θ RP with distribution Pθ, we estimate the probability that θ S01 S. Intuitively, this probability should increase with sample size N. We perform this estimation for a fixed architecture of 4 hidden layers composed of 256 neurons each while varying the sample size. Results are reported in Figure 2. For both the 16 and the 32 bits floating-point precisions, our estimation indeed increases with the sample size while we do not find any sampled weights in S01 in 64 bits precision. We also found that the influence of mini-batch size is not significative. Impact of network size: To evaluate the impact of the network size, we carried out a similar Monte Carlo experiment, varying the depth and width of the network. Firstly, we fixed the number of hidden layers to 3. Following the same experimental procedure, we empirically estimated the relative volume of S01, varying the width of the hidden layers. The results, reported in Figure 2, show that increasing the number of neurons by layer increases the probability to find a random θ Pθ in S01 for both 16 and 32 floating-point precision. In 64 bits, even with the largest width tested (1024 neurons), no sampled weight parameter is found in S01. Similarly we repeated the experiment varying the network depth and fixing, this time, the width of the layers to 256 neurons. Anew, the Sample size (1000s) Number of neurons Number of layers Batch size 1 10 20 30 40 50 60 16 64 256 512 1024 1 2 3 5 6 16 64 256 512 1024 Proportion % Figure 2: Influence of different size parameters on the proportion of drawn weight vector θi such that at least one mini-batch results in difference between backprop0 and backprop1 on MNIST dataset. Confidence interval at risk level 5% results in a variation of 5% for all represented proportions. First: 4 hidden layers with 256 neurons and batch size 256, varying the size of the training set. Second: 3 hidden layer network with mini-batch size 256, varying the number of neuron per layer. Third: 256 neurons per layer and mini-batch size 256, varying the number of layers. Fourth: 3 hidden layers with 256 neurons per layer, varying mini-batch size. results, reported in Figure 2, show that increasing the network depth increases the probability that θ Pθ belongs to S01 for both 16 and 32 bits floating-point precision while this probability is zero in 64 bits precision. This shows that the size of the network, both in terms of number of layers and size of layers is positively related to the effect of the choice of s in backpropagation. On the other hand, the fact that neither the network depth, width, or the number of samples impact the 64 bits case suggests that, within our framework, numerical precision is the primary factor of deviation. 4 Consequences for learning 4.1 Benchmarks and implementation Datasets and networks We further investigate the effect of the phenomenon described in Section 3 in terms of learning using the CIFAR10 dataset [23] and the VGG11 architecture [31]. To confirm our findings in alternative settings, we also use the MNIST [24], SVHN [25] and Image Net [12] datasets, fully connected networks (3 hidden layers of size 2048), and the Res Net18 and Res Net50 architectures [18]. Additional details on the different architectures and datasets are found in Appendix D.1. By default, unless stated otherwise, we use the SGD optimizer. We also investigated the effect of batch normalization [21], dropout [32], the Adam optimizer [22] as well as numerical precision. All the experiments in this section are run in Py Torch [26] on GPU. For each training experiment presented in this section (except Image Net experiments), we use the optuna library [3] to tune learning rates for each experimental condition; see also Appendix D.2. 4.2 Effect on training and test errors We first consider training a VGG11 architecture on CIFAR10 using the SGD optimizer. For different values of Re LU (0), we train this network ten times with random initializations under 32 bits arithmetic precision. The results are depicted in Figure 3. Without batch normalization, varying the value of Re LU (0) beyond a magnitude of 0.1 has a detrimental effect on the test accuracy, resulting in a concave shaped curve with maximum around Re LU (0) = 0. On the other hand, the decrease of the training loss with the number of epochs suggests that choosing Re LU (0) = 0 induces jiggling behaviors with possible sudden jumps during training. Note that the choice Re LU (0) = 0 leads to a smooth decrease and that the magnitude of the jumps for other values is related to the magnitude of the chosen value. This is consistent with the interpretation that changing the value of Re LU (0) has an excitatory effect on training. We observed qualitatively similar behaviors for a fully connected network on MNIST and a Res Net18 on CIFAR10 (Appendix D). Sensitivity to the magnitude of Re LU (0) depends on the network architecture: our fully connected network on MNIST is less sensitive to this value than VGG11 and Res Net18. The latter shows a very high sensitivity since for values above 0.2, training becomes very unstable and almost impossible. -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Re LU'(0) Test accuracy 0 50 100 150 200 epoch -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Figure 3: Training a VGG11 network on CIFAR10 with SGD. Left: test accuracy with and without batch normalization. Right: training loss without batchnorm. For each experiment we performed 10 random initializations, depicted by the boxplots on the left and the filled contours on the right (standard deviation). These experiments complement the preliminary results obtained in Section 3. In particular choosing different values for Re LU (0) has an effect, it induces a chaotic behavior during training which affects test accuracy. The default choice Re LU (0) = 0 seems to provide the best performances. To conclude, we conducted four training experiments for Res Net50 on the Image Net dataset [12] using the SGD optimizer. These were conducted with fixed learning rate, contrary to results reported above. We observe that switching from Re LU (0) = 0 to 1 results in a massive drop from around 75% to 63% or 55% for two runs. 4.3 Mitigating factors: numerical precision, batch-normalization and Adam We analyze below the combined effects of the variations of Re LU (0) with numerical precision or classical reconditioning methods: Adam, batch-normalization and dropout. Batch-normalization: As represented in Figure 3, batch normalization [21] not only allows to attain higher test accuracy, but it also completely filters out the effect of the choice of the value of Re LU (0), resulting in a flat shaped curve for test accuracy. This is consistent with what we observed on the training loss (Figure 12 in Appendix D.3) for which different choices of Re LU (0) lead to indistinguishable training loss curves. This experiment suggests that batch normalization has a significant impact in reducing the effect of the choice of Re LU (0). This observation was confirmed with a very similar behavior on the MNIST dataset with a fully connected network (Figure 9 in Appendix D.2). We could observe a similar effect on CIFAR 10 using a Res Net18 architecture (see Appendix D.5), however in this case the value of Re LU (0) still has a significative impact on test error, the Res Net18 architecture being much more sensitive. Using the Adam optimizer: The Adam optimizer [22] is among the most popular algorithms for neural network training; it combines adaptive step-size strategies with momentum. Adaptive stepsize acts as diagonal preconditioners for gradient steps [22, 13] and therefore can be seen as having an effect on the loss landscape of neural network training problems. We trained a VGG11 network on both CIFAR 10 and SVHN using the Adam optimizer. The results are presented in Figure 4. We observe a qualitatively similar behavior as in the experiments of Section 4.2 but a much lower sensitivity to the magnitude of Re LU (0). In other words, the use of the Adam optimizer mitigates the effect of this choice, both in terms of test errors and by buffering the magnitude of the sometimes chaotic effect induced on training loss optimization (Figure 13 in Appendix D.3). Increasing numerical precision: As shown in Appendix D.3, using 64 bits floating precision on VGG11 with CIFAR10 cancels out the effect of Re LU (0) = 1, in coherence with Section 3. More specifically Re LU (0) = 1 in 64 bits precision obtains similar performances as Re LU (0) = 0 in 32 bits precision. Furthermore, the numerical precision has barely any effect when Re LU (0) = 0. Finally, we remark that in 16 bits with Re LU (0) = 1, training is extremely unstable so that we were not able to train the network in this setting. 0.0 2.0 4.0 6.0 8.0 10.0 50.0 Re LU'(0) Test accuracy Figure 4: Similar to Fig.3, with VGG11 and Adam optimizer, on CIFAR 10 (left) and SVHN (right). 0 5000 10000 sample size CIFAR-VGG11 0 5000 10000 sample size CIFAR-VGG11-Batch Norm 0 200000 400000 sample size SVHN-VGG11-Batch Norm Figure 5: Monte Carlo estimation of relative volume on CIFAR10 and SVHN with VGG11 network Combination with dropout: Dropout [32] is another algorithmic process to regularize deep neural networks. We investigated its effect when combined with varying choices of Re LU (0). We used a fully connected neural network trained on the MNIST dataset with different values of dropout probability. The results are reported in Figure 11 in Appendix D.2. We did not observe any joint effect of dropout probability and magnitude of Re LU (0) in this context. 4.4 Back to the bifurcation zone: more neural nets and the effects of batch-norm The training experiments are complemented by a similar Monte Carlo estimation of the relative volume of the bifurcation zone as performed in Section 3 (same experimental setting). To avoid random outputs we force the GPU to compute convolutions deterministically. Examples of results are given in Figure 5. Additional results on fully connected network with MNIST and Res Net18 on CIFAR10 are shown in Section D. We consistently observe a high probability of finding an example on the bifurcation zone for large sample sizes. Several comments are in order. Numerical precision: Numerical precision is the major factor in the thickening of the bifurcation zone. In comparison to 32 bits experiments, 16 bits precision dramatically increases its relative importance. We also considered 64 bits precision on SVHN, a rather large dataset. Due to the computational cost, we only drew 40 random weights and observed no bifurcation on any of the terms of the loss whatsoever. This is consistent with the experiments conducted in Section 3 and suggests that, within our framework, 64 bit precision is the main mitigating factor for our observations. Batch normalization: In all our relative volume estimation experiments, we observe that batch normalization has a very significant effect on the proportion of examples found in the bifurcation zone. In 32 bits precision, the relative size of this zone increases with the addition of batch normalization, similar observations were made in all experiments presented in Appendix D. This is a counter-intuitive behavior as we have observed that batch normalization increases test accuracy and mitigates the effect of Re LU (0). Similarly in 16 bits precision, the addition of batch normalization seems to actually decrease the size of the bifurcation zone. Batch normalization does not result in the same qualitative effect depending on arithmetic precision. These observations open many questions which will be the topic of future research. 4.5 Total number of Re LU calls during training We consider the MNIST dataset with a fully connected Re LU network, varying the number of layers in 1, . . . , 6 and neuron per layers in 16, 64, 256, 512. For 16 and 32 bits precisions, we perform 100 0 50000 100000 150000 200000 250000 300000 Ratio of Relu(0) calls Ratio of zeros by Relu calls Figure 6: Proportion of Re LU(0) calls by total number of Re LU calls after 100 epochs for Re LU networks on MNIST with number of layers in 1, . . . , 6 and neuron per layers in 16, 64, 256, 512. epochs and proportion of Re LU(0) over all Re LU calls during training. Figure 6 suggests that, for a given precision, the number of times the bifurcation zone is met is roughly proportional to the total number of Re LU calls during training, independently of the architecture used (number of layers, neurons per layers). This suggests that the total number of Re LU calls during training is an important factor in understanding the bifurcation phenomenon. Broader investigation of this conjecture will be a matter of future work. 5 Conclusions and future work The starting point of our work was to determine if the choice of the value s = Re LU (0) affects neural network training. Theory tells that this choice should have negligible effect. Performing a simple learning experiment, we discovered that this is false in the real world and the first purpose of this paper is to account for this empirical evidence. This contradiction between theory and practice is due to finite floating point arithmetic precision while idealized networks are analyzed theoretically within a model of exact arithmetic on the field of real numbers. Owing to the size of deep learning problems, rounding errors due to numerical precision occur at a relatively high frequency, and virtually all the time for large architectures and datasets under 32 bit arithmetic precision (the default choice for Tensor Flow and Py Torch libraries). Our second goal was to investigate the impact of the choice of s = Re LU (0) in machine learning terms. In 32 bits precision it has an effect on test accuracy which seems to be the result of inducing a chaotic behavior in the course of empirical risk minimization. This was observed consistently in all our experiments. However we could not identify a systematic quantitative description of this effect; it highly depends on the dataset at hand, the network structure as well as other learning parameters such as the presence of batch normalization and the use of different optimizers. Our experiments illustrate this diversity. We observe an interesting robustness property of batch normalization and the Adam optimizer, as well as a high sensitivity to the network structure. Overall, the goal of this work is to draw attention to an overlooked factor in machine learning and neural networks: nonsmoothness. The Re LU activation is probably the most widely used nonlinearity in this context, yet its nondifferentiability is mostly ignored. We highlight the fact that the default choice Re LU (0) = 0 seems to be the most robust, while different choices could potentially lead to instabilities. For a general nonsmooth nonlinearity, it is not clear a priori which choice would be the most robust, if any, and our investigation underlines the potential importance of this question. Our research opens new directions regarding the impact of numerical precision on neural network training, its interplay with nonsmoothness and its combined effect with other learning factors, such as network architecture, batch normalization or optimizers. The main idea is that mathematically negligible factors are not necessarily computationally negligible. Acknowledgments and Disclosure of Funding The authors thank anonymous referees for constructive suggestions which greatly improved the paper. The authors acknowledge the support of the DEEL project, the AI Interdisciplinary Institute ANITI funding, through the French Investing for the Future PIA3 program under the Grant agreement ANR-19-PI3A-0004, Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant numbers FA9550-19-1-7026, FA9550-18-1-0226, and ANR Ma SDOL 19-CE23-0017-01. J. Bolte also acknowledges the support of ANR Chess, grant ANR-17-EURE0010 and TSE-P. [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), pages 265 283, 2016. [3] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623 2631, 2019. [4] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in machine learning: a survey. Journal of machine learning research, 18, 2018. [5] J. Berner, D. Elbr achter, P. Grohs, and A. Jentzen. Towards a regularity theory for relu networks chain rule and global error estimates. In 2019 13th International conference on Sampling Theory and Applications (Samp TA), pages 1 5. IEEE, 2019. [6] P. Bianchi, W. Hachem, and S. Schechtman. Convergence of constant step stochastic gradient descent for non-smooth non-convex functions. ar Xiv preprint ar Xiv:2005.08513, 2020. [7] J. Bolte and E. Pauwels. Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Mathematical Programming, pages 1 33, 2020. [8] J. Bolte and E. Pauwels. A mathematical model for automatic differentiation in machine learning. In Advances in Neural Information Processing Systems, volume 33, 2020. [9] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223 311, 2018. [10] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. Vander Plas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transformations of Python+Num Py programs, 2018. [11] M. Courbariaux, Y. Bengio, and J.-P. David. Training deep neural networks with low precision multiplications. In Proceedings of the International Conference on Learning Representations (ICLR), 2015. [12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. [13] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011. [14] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press Cambridge, 2016. [15] A. Griewank and A. Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008. [16] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In International conference on machine learning, pages 1737 1746. PMLR, 2015. [17] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026 1034, 2015. [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. [19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017. [20] K. Hwang and W. Sung. Fixed-point feedforward deep neural network design using weights+ 1, 0, and1. In 2014 IEEE Workshop on Signal Processing Systems (Si PS), pages 1 6. IEEE, 2014. [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448 456. PMLR, 2015. [22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, Conference Track Proceedings, 2015. [23] A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 (canadian institute for advanced research). URL http://www. cs. toronto. edu/kriz/cifar. html, 5, 2010. [24] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998. [25] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. [26] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. De Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019. [27] R. T. Rockafellar and R. J.-B. Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009. [28] A. Rodriguez, E. Segal, E. Meiri, E. Fomenko, Y. J. Kim, H. Shen, and B. Ziv. Lower numerical precision deep learning inference and training. Intel White Paper, 3:1 19, 2018. [29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. nature, 323(6088):533 536, 1986. [30] S. Scholtes. Introduction to piecewise differentiable equations. Springer Science & Business Media, 2012. [31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015. [32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929 1958, 2014. [33] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning NIPS Workshop., 2011.