# exchangeability_and_kernel_invariance_in_trained_mlps__7f2ab4b2.pdf Exchangeability and Kernel Invariance in Trained MLPs Russell Tsuchida1 , Fred Roosta1,2 and Marcus Gallagher1 1The University of Queensland 2International Computer Science Institute In the analysis of machine learning models, it is often convenient to assume that the parameters are IID. This assumption is not satisfied when the parameters are updated through training processes such as Stochastic Gradient Descent. A relaxation of the IID condition is a probabilistic symmetry known as exchangeability. We show the sense in which the weights in MLPs are exchangeable. This yields the result that in certain instances, the layer-wise kernel of fully-connected layers remains approximately constant during training. Our results shed light on such kernel properties throughout training while limiting the use of unrealistic assumptions. 1 Introduction Despite the widespread usage of deep learning in applications, current theoretical understanding of deep networks continues to lag behind the pursued engineering outcomes. Much recent theory concerns networks in their randomized initial state, or contains assumptions about the parameters or data during training. For example, Cho and Saul [2009], Daniely et al. [2016], Bach [2017] and Tsuchida et al. [2018] analyze the kernels of neural networks with random IID weights. Insightful analysis connecting signal propagation in deep networks to chaos have made similar assumptions [Poole et al., 2016; Raghu et al., 2017]. Random matrix theory has recently been applied to neural networks in an attempt to understand the empirical spectral distribution (ESD) of the Hessian [Pennington and Bahri, 2017] and the Gram matrix [Pennington and Worah, 2017], but these works have made strong assumptions on the weight and data distributions. The most widely used yet unrealistic assumption is that weights remain IID throughout training. Relaxing such unrealistic assumptions makes obtaining meaningful results more challenging. We take a step in this direction by investigating the probabilistic symmetry known as exchangeability, which is a generalization of the IID assumption. We uncover the striking result that the layer-wise kernel of MLPs with Re LU Supplemental material available at the ar Xiv version https://arxiv. org/abs/1810.08351. activations, trained with many optimizers, remains constant up to a scaling factor during training when the network inputs satisfy certain conditions. Otherwise, we are able to bound the absolute difference between layer-wise kernel and the kernel of the network in its random IID state. 2 Background 2.1 Notation Random variables, vectors and matrices will be denoted by upper case, bold upper case, and bold upper case with overline characters, respectively. Parenthesized superscripts index the layer of the network to which an object belongs. The first and second post-subscripts index the rows and columns of a matrix, respectively. When the row of a matrix is extracted through an index, it will be assumed to be transposed into a column vector. Pre-subscripts will indicate the iteration of an iterative optimizer. Expectation with respect to the distribution of and random variable R is denoted ER . Consider an MLP with an input layer and L non-input layers. Denote the number of neurons in layer 0 l L by n(l). Denote an input to the network by x. Denote the random weight matrix connecting layer l 1 to layer l by W (l). Denote the ℓ2 norm by . Denote the activation function by σ. We consider Re LU activations throughout. 2.2 Exchangeability An exchangeable sequence of random variables (Q1, Q2, ...) has the property that the joint distribution of the sequence is invariant to finite permutations. That is, a sequence (Qi)i 1 is exchangeable if (Q1, Q2, ...) d= (Qπ(1), Qπ(2), ...) for all finite permutations π. To aid in readability we will omit the index set in the subscript, so that (Qi)i 1 is the same as (Qi)i. Infinite exchangeable sequences are characterized as mixtures of IID random variables through de Finetti s theorem. Theorem 1. [Aldous, 1981] An infinite sequence Q = (Qi)i is exchangeable if and only if there exists a measurable function f such that (Qi)i d= f(A, Bi) i, where A and B are mutually IID random variables uniform on [0, 1]. Generalizations of Theorem 1 to multi-dimensional arrays exist [Kallenberg, 2006]. A matrix Q is row and column exchangeable (RCE) if its joint distribution is invari- Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Stationary (Qi)i d= (Qc+i)i c N Contractable (Qi)i d= (Qki)i k1 < k2 < ... Exchangeable finite permutations π, (Qi)i d= (Qπ(i))i Rotatable R O(n), (Qi)1 i n d= R(Qi)1 i n Isotropic Gaussian Figure 1: Relative strength of probabilistic symmetries. ant to row and column permutations. That is, Q is RCE if (Qji)ji d= (Qπ1(j)π2(i))ji for all finite permutations π1, π2. Theorem 2. [Aldous, 1981] An infinite array Q = (Qji)ji is RCE if and only if there exists a measurable function f such that (Qji)ji d= f(A, Bj, Ci, Dji) ji, where A, B, C, and D are mutually IID uniform on [0, 1]. Intuition concerning the strength of exchangeability in the context of probabilistic symmetries may be aided by the implication graph shown in Figure 1. 2.3 Kernels of Random MLPs There is a well-studied connection between the feature maps in MLPs (and other neural network architectures) and the kernel of a reproducing kernel Hilbert space (RKHS) [Mac Kay, 1992; Neal, 1994; Cho and Saul, 2009; Daniely et al., 2016; Bach, 2017; Bietti and Mairal, 2017]. Consider the angle θ(l) between two random signals σ(W (l)x) and σ(W (l)y) in the lth hidden layer of an MLP for inputs x and y. We have cos θ(l) = (1) j=1 σ W(l) j x σ W(l) j y j=1 σ W(l) j x σ W(l) j x n(l) P j=1 σ W(l) j y σ W(l) j y where W(l) j is the jth row of W (l). We divide the numerator and denominator by n(l) x y and use the absolutehomogeneity property of the Re LU σ(|a|z) = |a|σ(z) to consider the scaled numerator j=1 σ Wj x/ x σ Wj y/ y . (2) Let ˆx = x x . Suppose that each row W(l) j of W (l) is IID with all other rows (we relax this requirement later) and is defined on some probability space (Ω, Σ, µ). Asymptotically 0 0.5 1 1.5 2 2.5 3 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 2: Normalized kernel for a hidden layer with Re LU activations. Samples from a network with 1000 inputs and hidden units are obtained by generating an orthogonal matrix R from a QR decomposition of a random matrix containing IID samples from U[0, 1], then setting x = R(1, 0, ..., 0)T and y = R(cos θ, sin θ, 0, ..., 0)T . in the number of neurons n(l), the strong law of large numbers implies that (2) converges almost surely to E σ(W(l) j ˆx)σ(W(l) j ˆy) Ω σ(W(l) j ˆx)σ(W(l) j ˆy) dµ, (3) which corresponds to an inner-product in feature space. The kernel is positive semi-definite and uniquely defines an RKHS. When µ is the product measure corresponding to an IID Gaussian with variance E (W (l) 11 )2 and 0 mean, the kernel has a closed-form expression known as the arc-cosine kernel (of degree 1) [Cho and Saul, 2009], given by E (W (l) 11 )2 2π sin θ(l 1) + (π θ(l 1)) cos θ(l 1) , (4) where θ(l 1) is the angle between x and y. We will refer to (3) as the layer-wise kernel in layer l, denoted k(l)(x, y). When (3) is normalized in the same fashion as (1), we will call the resulting quantity the layer-wise normalized kernel. 2.4 Layer-wise Kernel in IID MLPs Our analysis draws upon and extends results concerning the layer-wise normalized kernels of MLPs with IID weights [Tsuchida et al., 2018], which, for completeness, we briefly review here. Construct a sequence {x(m)}m 2 such that for all m, x(m) R and coordinates m + 1, m + 2, ... of x(m) are all 0. Define the sequence {y(m)}m 2 in the same way, and additionally require that the angle θ(l 1) between x(m) and y(m) is constant in m. Denote the randomly ini- tialized weight matrix by 0W (l). We would like to evaluate lim m E σ(0W(l) j ˆx(m))σ(0W(l) j ˆy(m)) . (5) Sufficient conditions for the central limit theorem (CLT) are given below. Let x(m)i denote the ith coordinate of x(m). Hypothesis 3. lim m m(1/4) maxm i=1 |x(m)i| x(m) and lim m m(1/4) maxm i=1 |y(m)i| y(m) are both 0. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) This condition is easily satisfied since for data points with many non-zero entries, x(m) will grow like m when compared to |x(m)i|. Provided E 0W (l) 11 = 0 and E 0W (l) 11 3 < , Tsuchida et al. [2018] show that under Hypothesis 3, 0W(l) 1 ˆx(m) σ 0W(l) 1 ˆy(m) d σ(Zx)σ(Zy), (Zx, Zy) N(0, Σ) with Σ = 1 cos θ(0) Letting Zx(m) = 0W(l) j ˆx(m), Zy(m) = 0W(l) j ˆy(m), σ(Zx(m))σ(Zy(m)) |Zx(m)||Zy(m)| Z2 x(m) + Z2 y(m). The integral of the RHS is 2E (0W (l) 11 )2 , so the limit may be brought inside the integral in (5) by Theorem 19 of Royden [2010]. The resulting expectation is (4). Figure 2 shows the normalized kernels for random weights with PDF Qm i=1 β 2αΓ(1/β)e |wi/α|β. This PDF generalizes the isotropic Gaussian PDF (β = 2) and the Uniform PDF (β ). The CLT result says nothing about the kernel of trained networks whose weights are not IID. In 4 we extend the CLT result to trained networks. We do this by first exploring exchangeability in MLPs. 3 Exchangeability in MLPs Suppose that for every l, the matrix (0W (l) ji )ji is IID and then the weights evolve according to SGD over t iterations. The index j (which corresponds to the jth row of the random weight matrix, or the jth neuron in layer l) is an arbitrary labeling; one may permute these indices along with the corresponding connection in layer l + 1 without changing the output of the network or the joint distribution of the weights. We show this for L = 3; the generalization to any L 2 will be straightforward. To start our argument, it is clear that there is full exchangeability of the weights when the network has been randomly initialized with IID weights and has not yet been trained. More restrictively, we have the following. Observation 4. Let a Rn(0) and b Rn(3) be inputs and targets of an MLP. Suppose that the initial weights in each layer 0W (l) are IID, and temporarily drop the pre-subscript. Then for any bijective permutations π1 and π2, a, W (1) π1(i)h ih, W (2) π2(j)π1(i) ji, W (3) kπ2(j) d= a, W (1) ih ih, W (2) ji ji, W (3) kj kj, b . (6) This generalizes to any network with one or more hidden layers (L 2) because the permutation does not affect the non-exchangeable elements a and/or b. Define g(l) qp to be the function that takes a, b and realizations of 0W (m) m [L] and calculates realizations of 1W (l) qp according to an online (batch size of 1) backpropagation update rule. Let g(l) be a matrix-valued function, whose qpth element is g(l) qp . We have for some cost function E a, b; and step-size α. Denote the LHS of (6) by U and the RHS of (6) by Uπ. Then by examining the backpropagation equations, w(l) π2(j)π1(i) By the continuous mapping theorem, we may apply g(2) to both sides of (6) if g(2) is almost everywhere (a.e.) continuous. Temporarily dropping the 0 pre-subscripts on the weights, g(2) a, W (1) ih ih, W (2) ji ji, W (3) kj d=g(2) a, W (1) π1(i)h ih, W (2) π2(j)π1(i) ji, W (3) kπ2(j) = g(2) π2(q)π1(p) a, W (1) ih ih, W (2) ji ji, W (3) kj 1W (2) π2(q)π1(p) and the first line is equal to 1W (2) qp qp. This shows that 1W (2) is RCE. t W (1) is row but not column-exchangeable and t W (L) is column but not row-exchangeable. When any batch size M is used, the inputs a and b may be replaced by sets {ai}i M and {bi}i M and (7) still holds. If M is the size of the entire finite dataset, this corresponds to gradient descent. We may use any a.e. continuous g(l) whose evaluation commutes with index permutations in the input (such as SGD, Adam [Kingma and Ba, 2015] or RMSprop). Call such an update rule index commuting. By redefining g(2) to calculate the weights at the tth iteration of SGD, one can show that t W (2) is RCE for all t. Theorem 5. Let L 3. Suppose that the initial weights in each layer 0W (l) are IID. Suppose the network is trained using an index commuting update rule. Then for all 2 l L 1 and all optimizer iterations t 0, the weight matrices t W (l) are RCE. For L 2, t W (1) is row but not column exchangeable and t W (L) is column but not row exchangeable. 4 Kernels of Trained MLPs We now extend the results of 2.4 to trained networks using the results of 3. For the remainder of the paper we will drop the pre-subscript t denoting the training iteration on the weights. 4.1 Layer-wise Kernel in Trained MLPs We examine the limit in m of the layer-wise kernel in layer l for a network with infinitely many RCE weights. By Theorem 1, there exists some measurable function f and some mutually independent A and B each uniform on [0, 1] such that lim m E h σ W(l) 1 ˆx(m) σ W(l) 1 ˆy(m) i [0,1] k A(x(m), y(m)) dµA, (8) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) where µA is the uniform probability measure on [0, 1] with k A(x(m), y(m)) given by Z [0,1]m σ f A(B) ˆx(m) σ f A(B) ˆy(m) dµB, where f A(B)i i = f(A, Bi) i and µB is the uniform probability measure. We prove the following in Appendix A. Proposition 6. Suppose that 2 l L 1, E W (l) 11 3 < , E W (l) 11 W (l) 12 < , Hypothesis 3 is satisfied and i=1 ˆx(m)i = lim m i=1 ˆy(m)i = 0 or E W (l) 11 W (l) 12 = 0. Then (8) is given by 1 2π E (W (l) 11 )2 E W (l) 11 W (l) 12 sin θ(l 1) + (π θ(l 1)) cos θ(l 1) . (9) Note that (9) and (4) are the same up to a scaling factor, which cancels out after normalizing. 4.2 The Ergodic Problem Unfortunately, (8) is not necessarily the inner product in feature space of an infinitely wide network. By Theorem 2, j=1 σ W(l) j ˆx(m) σ W(l) j ˆy(m) j=1 σ f AC(Bj, Dj) ˆx(m) σ f AC(Bj, Dj) ˆy(m) , for some measurable f AC(B1, D1) = (f(A, Bj, Ci, Dji)i, which converges almost surely to the random variable EB1D1 h σ f AC(B1, D1) ˆx(m) σ f AC(B1, D1) ˆy(m) i (10) depending on A and C by the Birkhoff-Khinchin ergodic theorem (see Appendix E). For the purposes of experimenting, we make the following simplifying assumption. Hypothesis 7. The following holds: j=1 σ W(l) j ˆx(m) σ W(l) j ˆy(m) p E h σ W(l) 1 ˆx(m) σ W(l) 1 ˆy(m) i . Hypothesis 7 says that taking averages over j of the products of activations in one network is equivalent to taking averages over one fixed neuron of the products of activations in an ensemble of independent networks. A sufficient condition is that the measure is ergodic with respect to the row-shift transformation. This condition is stronger than necessary. In statistical mechanics, an approximate ergodicity applied to sum functions is used to compare time averages with phase averages [Khinchin, 1949; Kurth, 2014]. The Ergodic Problem features heavily in the history of statistical mechanics [Moore, 2015]. It is our hope that by introducing this assumption into the analysis of MLPs, we make further progress towards efforts in connecting neural networks to statistical mechanics [Martin and Mahoney, 2017]. In 5 we demonstrate that Hypothesis 7 is not inconsistent with our empirical observations. 5 Experiments We illustrate our results with selected figures. Other datasets and optimizers are investigated in the supplemental material. 5.1 Verification of Proposition 6 Architecture. We train an autoencoder with 4 layers and 3072 neurons in each layer on CIFAR10 [Krizhevsky and Hinton, 2009] with pixel values normalized to [0, 1] using an ℓ2 objective. Weights are initialized with a variance of 2 nl [He et al., 2015]. Method. In Figure 3 we plot the empirical layer-wise normalized kernel in each layer. The color of the points moves from blue to red as the training iteration t increases. Each sample is generated using Procedure 1. The numerical steps ensure that the desired angle θ(l 1) is obtained between x and y. The alphabetical steps ensure that P xi = P yi = 0. Procedure 1 Sample θ(l 1) Inputs datapoint x, θ(l 1) Output y at angle θ(l 1) to x. a Set the last two coordinates of x to 0. 1 Sample random vector p orthogonal to x: Set all coordinates of p to zero where x is non-zero and sample remaining coordinates of p from U[0, 1]. Set last two coordinates to 0. Normalize p so that x = p . b Set the second last coordinate of x to the negative sum of all coordinates of x. c Set the last coordinate of p to the negative sum of all coordinates of p. 2 Return y = cos θ(l 1)x + sin θ(l 1)p. 5.2 Inputs with Non-Zero Sums Consider a modification of the method described in 5.1: the alphabetical steps of Procedure 1 are not performed. This means that the sums P ˆxi and P ˆyi are no longer 0. However, if E W (l) 11 W (l) 12 = 0, Proposition 6 still applies. Note that E W (l) 11 W (l) 12 = Z [0,1]3 f(A, B1)f(A, B2) dµB1B2A f(A, B1) dµB1 2 dµA f(A, B1) dµB1 = 0. Also, by the strong law of large numbers, i=1 W (l) 1i a.s. Z [0,1] f(A, B1) dµB1. Therefore, for finite n(l) if (En(l))2 := i=1 W (l) ji 2 is small , E W (l) 11 W (l) 12 will Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Figure 3: Layer-wise normalized kernels for a trained MLP at iteration t, indicated by color. Batch-size of 256 used. First 4 columns: layers 1 to 4. Fifth column: full network. Last column: sample reconstructions on test data, indicating whether or not training converged. First 3 rows: adam using step size 0.001, β1 = 0.9, β2 = 0.999, ε = [10 16, 10 8, 1]. Last row: SGD with constant learning rate 0.5. be small . We are interested in finding optimizer hyperparameters that result in (En(l))2 = 0, which in turn results in deviations from (9). We make the following observations: (1) When Adam, RMSProp or Nadam [Dozat, 2016] are used, as the hyperparameter ε decreases there is a sharp change in (En(l))2 and the mean squared error (MSE) of the observed normalized kernel to the normalized arc-cosine kernel of degree one measured at iteration t = 19000. When (En(l))2 is small the kernel is approximately described by (9). See Figures 4 and 5 and Appendices F and G. (2) SGD using step sizes α that result in stable training generally have smaller (En(l))2 than Adam, and thus the normalized kernel agrees more closely with Proposition 6. See Figures 4 and 5 and Appendices F and G. 6 Discussion and Conclusion We identified that the weights in hidden layers of MLPs are RCE. Using this symmetry, we analyzed the kernels of trained networks. Specifically, we found that the normalized kernel remains invariant when the inputs have sums over their coordinates of 0. When the sums are not 0, a bound which depends on E[W (l) 11 W (l) 12 ] applies to the residual of the normal- ized kernel to the normalized arc-cosine kernel. We derived a measure (E(n(l))2 which, when close to 0, indicates whether E[W (l) 11 W (l) 12 ] is close to 0 and thus whether the normalized kernel remains approximately invariant during training. When empirically comparing optimizers, those which result in small E[W (l) 11 W (l) 12 ] have kernels which follow the normalized arc-cosine kernel during training. The parameter ε present in Adam and other optimizers can increase E[W (l) 11 W (l) 12 ], leading to qualitatively different kernels to the normalized arccosine kernel. Changes in other hyperparameters may change E[W (l) 11 W (l) 12 ], although we had difficulty finding instances where changing α in SGD resulted in a kernel that did not roughly match the normalized arc-cosine kernel without also resulting in unstable training. In contrast with works that analyze weight distributions through an approximation of SGD by a stochastic differential equation [Seung et al., 1992; Watkin et al., 1993; Martin and Mahoney, 2017; Chaudhari and Soatto, 2018], we incorporate very little knowledge of the learning rule into our theory. The result is that our theory is perhaps more general than required. Interestingly, our results still hold if we perform (stochastic) gradient ascent on the weights. We believe our analysis would benefit from including more knowledge of the update rule. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Figure 4: As in Figure 3, but for inputs with non-zero sums as outlined in 5.2. Figure 5: k: MSE of kernel to normalized arc-cosine kernel normalized to between 0 and 1. W : (En(l))2 normalized to between 0 and 1. Top: Adam using step size 0.001, β1 = 0.9, β2 = 0.999 varying ε. Bottom: SGD varying α. Jacot et al. [2018] analyse a continuous-time approximation of (not stochastic) gradient descent. In this dynamic, it is shown that MLPs in function space follow a linear differential equation in an infinite width limit. A central object of their study is a positive-definite kernel, the neural tangent kernel, which is shown to stay approximately constant during train- ing. Also utilizing a kernel and working in function space, Du et al. [2019] bound distances between functions trained in continuous and discrete time dynamics in special cases and investigate the surprising fact that certain models can achieve zero training loss. Chizat and Bach [2018] present a unified framework of these and other works under a training regime called lazy training. We remark that it is easy to find networks trained with Adam that do not exhibit approximately constant kernels (for example, see the first row of Figure 4). These networks seem to out-perform those that do have constant kernels during training. This is consistent with the view expressed by Chizat and Bach [2018], that the most competitive neural networks are not trained in this regime [lazy training]. In future work, we would like to examine the ESD of weight and Hessian matrices without normality assumptions as in previous works [Pennington and Bahri, 2017; Pennington and Worah, 2017], perhaps using results concerning the ESD of exchangeable random matrices [Chatterjee, 2006; Adamczak et al., 2016]. Acknowledgements Russell Tsuchida and Fred Roosta gratefully acknowledge the generous support given by the Australian Research Council Centre of Excellence for Mathematical & Statistical Frontiers (ACEMS). Fred Roosta was partially supported by the Australian Research Council through a Discovery Early Career Researcher Award (DE180100923). Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) [Adamczak et al., 2016] R. Adamczak, D. Chafa ı, and P. Wolff. Circular law for random matrices with exchangeable entries. Random Structures & Algorithms, 48(3):454 479, 2016. [Aldous, 1981] D.J. Aldous. Representations for partially exchangeable arrays of random variables. Journal of Multivariate Analysis, 11(4):581 598, 1981. [Bach, 2017] F. Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1 53, 2017. [Bietti and Mairal, 2017] A. Bietti and J. Mairal. Invariance and stability of deep convolutional representations. In Advances in Neural Information Processing Systems, pages 6210 6220, 2017. [Chatterjee, 2006] S. Chatterjee. A generalization of the Lindeberg principle. The Annals of Probability, 34(6):2061 2076, 2006. [Chaudhari and Soatto, 2018] P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In International Conference on Learning Representations, 2018. [Chizat and Bach, 2018] L. Chizat and F. Bach. A note on lazy training in supervised differentiable programming. Technical report, 2018. [Cho and Saul, 2009] Y. Cho and L.K. Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pages 342 350, 2009. [Daniely et al., 2016] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pages 2253 2261, 2016. [Dozat, 2016] T. Dozat. Incorporating nesterov momentum into adam. Technical report, 2016. [Du et al., 2019] S.S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. International Conference on Learning Representations, 2019. [He et al., 2015] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International Conference on Computer Vision, pages 1026 1034, 2015. [Jacot et al., 2018] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571 8580, 2018. [Kallenberg, 2006] O. Kallenberg. Probabilistic symmetries and invariance principles. Springer Science & Business Media, 2006. [Khinchin, 1949] A.I. Khinchin. Mathematical foundations of statistical mechanics. Courier Corporation, 1949. [Kingma and Ba, 2015] D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [Krizhevsky and Hinton, 2009] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, 2009. [Kurth, 2014] R. Kurth. Axiomatics of classical statistical mechanics. Elsevier, 2014. [Mac Kay, 1992] D.J.C. Mac Kay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448 472, 1992. [Martin and Mahoney, 2017] C.H. Martin and M.W. Mahoney. Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. ar Xiv preprint ar Xiv:1710.09553, 2017. [Moore, 2015] C.C. Moore. Ergodic theorem, ergodic theory, and statistical mechanics. Proceedings of the National Academy of Sciences, 112(7):1907 1911, 2015. [Neal, 1994] R.M. Neal. Bayesian Learning for Neural Networks. Ph D thesis, University of Toronto, 1994. [Pennington and Bahri, 2017] J. Pennington and Y. Bahri. Geometry of neural network loss surfaces via random matrix theory. In International Conference on Machine Learning, pages 2798 2806, 2017. [Pennington and Worah, 2017] J. Pennington and P. Worah. Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems, pages 2634 2643, 2017. [Poole et al., 2016] B. Poole, S. Lahiri, M. Raghu, J. Sohl Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, pages 3360 3368, 2016. [Raghu et al., 2017] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive power of deep neural networks. In International Conference on Machine Learning, pages 2847 2854, 2017. [Royden and Fitzpatrick, 2010] H.L. Royden and P. Fitzpatrick. Real Analysis. Prentice Hall, 2010. [Seung et al., 1992] H.S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056, 1992. [Tsuchida et al., 2018] R. Tsuchida, F. Roosta-Khorasani, and M. Gallagher. Invariance of weight distributions in rectified MLPs. International Conference on Machine Learning, 2018. [Watkin et al., 1993] T.L.H. Watkin, A. Rau, and M. Biehl. The statistical mechanics of learning a rule. Reviews of Modern Physics, 65(2):499, 1993. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)