# a_spline_theory_of_deep_learning__50c1e58e.pdf A Spline Theory of Deep Networks Randall Balestriero 1 Richard G. Baraniuk 1 Abstract We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. This implies that a DN constructs a set of signal-dependent, class-specific templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classification via matched filters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classification performance and reduced overfitting with no change to the DN architecture. The spline partition of the input signal space opens up a new geometric avenue to study how DNs organize signals in a hierarchical fashion. As an application, we develop and validate a new distance metric for signals that quantifies the difference between their partition encodings. 1. Introduction Deep learning has significantly advanced our ability to address a wide range of difficult machine learning and signal processing problems. Today s machine learning landscape is dominated by deep (neural) networks (DNs), which are compositions of a large number of simple parameterized linear and nonlinear transforms. An all-too-common story of late is that of plugging a deep network into an application as a black box, training it on copious training data, 1ECE Department, Rice University, Houston, TX, USA. Correspondence to: Randall B. . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). and then significantly improving performance over classical approaches. Despite this empirical progress, the precise mechanisms by which deep learning works so well remain relatively poorly understood, adding an air of mystery to the entire field. Ongoing attempts to build a rigorous mathematical framework fall roughly into five camps: (i) probing and measuring DNs to visualize their inner workings (Zeiler & Fergus, 2014); (ii) analyzing their properties such as expressive power (Cohen et al., 2016), loss surface geometry (Lu & Kawaguchi, 2017; Soudry & Hoffer, 2017), nuisance management (Soatto & Chiuso, 2016), sparsification (Papyan et al., 2017), and generalization abilities; (iii) new mathematical frameworks that share some (but not all) common features with DNs (Bruna & Mallat, 2013); (iv) probabilistic generative models from which specific DNs can be derived (Arora et al., 2013; Patel et al., 2016); and (v) information theoretic bounds (Tishby & Zaslavsky, 2015). In this paper, we build a rigorous bridge between DNs and approximation theory via spline functions and operators. We prove that a large class of DNs including convolutional neural networks (CNNs) (Le Cun, 1998), residual networks (Res Nets) (He et al., 2016; Targ et al., 2016), skip connection networks (Srivastava et al., 2015), fully connected networks (Pal & Mitra, 1992), recurrent neural networks (RNNs) (Graves, 2013), and beyond can be written as spline operators. In particular, when these networks employ current standard-practice piecewise-affine, convex nonlinearities (e.g., Re LU, max-pooling, etc.) they can be written as the composition of max-affine spline operators (MASOs) (Magnani & Boyd, 2009; Hannah & Dunson, 2013). We focus on such nonlinearities here but note that our framework applies also to non-piecewise-affine nonlinearities through a standard approximation argument. The max-affine spline connection provides a powerful portal through which to view and analyze the inner workings of a DN using tools from approximation theory and functional analysis. Here is a summary of our key contributions: [C1] We prove that a large class of DNs can be written as a composition of MASOs, from which it follows immediately that, conditioned on the input signal, the output of a DN is a simple affine transformation of the input. We illustrate in Section 4 by deriving a closed-form expression for the A Spline Theory of Deep Networks input/output mapping of a CNN. [C2] The affine mapping formula enables us to interpret a MASO DN as constructing a set of signal-dependent, classspecific templates against which the signal is compared via a simple inner product. In Section 5 we relate DNs directly to the classical theory of optimal classification via matched filters and provide insights into the effects of data memorization (Zhang et al., 2016). [C3] We propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal to each other. In Section 6, we show that this leads to significantly improved classification performance and reduced overfitting on standard test data sets like CIFAR100 with no change to the DN architecture. [C4] The partition of the input space induced by a MASO links DNs to the theory of vector quantization (VQ) and K-means clustering, which opens up a new geometric avenue to study how DNs cluster and organize signals in a hierarchical fashion. Section 7 studies the properties of the MASO partition. [C5] Leveraging the fact that a DN considers two signals to be similar if they lie in the same MASO partition region, we develop a new signal distance in Section 7.3 that measures the difference between their partition encodings. The distance is easily computed via backpropagation. A number of appendices in the Supplementary Material (SM) contain the mathematical setup and proofs. A significantly extended account of these events with numerous new results is available in (Balestriero & Baraniuk, 2018). 2. Background on Deep Networks A deep network (DN) is an operator fΘ : RD RC that maps an input signal1 x RD to an output prediction by RC as fΘ : RD RC. All current DNs can be written as a composition of L intermediate mappings called layers fΘ(x) = f (L) θ(L) f (1) θ(1) (x), (1) where Θ = θ(1), . . . , θ(L) is the collection of the network s parameters from each layer. This composition of mappings is nonlinear and non-commutative, in general. A DN layer at level ℓis an operator f (ℓ) θ(ℓ) that takes as input the vector-valued signal z(ℓ 1)(x) RD(ℓ 1) and produces the vector-valued output z(ℓ)(x) RD(ℓ). We will assume that x and z(ℓ) are column vectors. We initialize with z(0)(x) = x and denote z(L)(x) =: z for convenience. 1 For concreteness, we focus here on processing K-channel images x, such as color digital photographs. But our analysis and techniques apply to signals of any index-dimensionality, including speech and audio signals, video signals, etc. The signals z(ℓ)(x) are typically called feature maps; it is easy to see that z(ℓ)(x) = f (ℓ) θ(ℓ) f (1) θ(1) (x), ℓ {1, . . . , L}. (2) We briefly overview the basic DN operators and layers we consider in this paper; more details and additional layers are provided in (Goodfellow et al., 2016) and (Balestriero & Baraniuk, 2018). A fully connected operator performs an arbitrary affine transformation by multiplying its input by the dense matrix W (ℓ) RD(ℓ) D(ℓ 1) and adding the arbitrary bias vector b(ℓ) W RD(ℓ), as in f (ℓ) W z(ℓ 1)(x) := W (ℓ)z(ℓ 1)(x) + b(ℓ) W . A convolution operator reduces the number of parameters in the affine transformation by replacing the unconstrained W (ℓ) with a multichannel convolution matrix, as in f (ℓ) C z(ℓ 1)(x) := C(ℓ)z(ℓ 1)(x) + b(ℓ) C . An activation operator applies a scalar nonlinear activation function σ independently to each entry of its input, as in f (ℓ) σ z(ℓ 1)(x) k := σ [z(ℓ 1)(x)]k , k = 1, . . . , D(ℓ). Nonlinearities are crucial to DNs, since otherwise the entire network would collapse to a single global affine transform. Three popular activation functions are the rectified linear unit (Re LU) σRe LU(u) := max(u, 0), the leaky Re LU σLRe LU(u) := max(ηu, u), η > 0, and the absolute value σabs(u) := |u|. These three functions are both piecewise affine and convex. Other popular activation functions include the sigmoid σsig(u) := 1 1+e u and hyperbolic tangent σtanh(u) := 2σsig(2u) 1. These two functions are neither piecewise affine nor convex. A pooling operator subsamples its input to reduce its dimensionality according to a sub-sampling policy ρ applied over a collection of input indices {Rk}K(ℓ) k=1 (typically a small patch), e.g., max pooling f (ℓ) ρ z(ℓ 1)(x) k := maxd R(ℓ) k z(ℓ 1)(x) 1, . . . , D(ℓ). See (Balestriero & Baraniuk, 2018) for the definitions of average pooling, channel pooling, skip connections, and recurrent layers. Definition 1. A DN layer f (ℓ) θ(ℓ) comprises a single nonlinear DN operator (non-affine to be precise) composed with any preceding affine operators lying between it and the preceding nonlinear operator. This definition yields a single, unique layer decomposition for any DN, and the complete DN is then the composition of its layers per (1). For example, in a standard CNN, there are two different layers types: i) convolution-activation and ii) max-pooling. We form the prediction by by feeding fΘ(x) through a final nonlinearity g : RD(L) RD(L) as in by = g(fΘ(x)). In classification, g is typically the softmax nonlinearity, which arises naturally from posing the classification inference as a A Spline Theory of Deep Networks multinomial logistic regression problem (Bishop, 1995). In regression, typically no g is applied. We learn the DN parameters Θ for a particular prediction task in a supervised setting using a labeled data set D = (xn, yn)N n=1, a loss function, and a learning policy to update the parameters Θ in the predictor fΘ(x). For classification problems, the loss function is typically the negative cross-entropy LCE(x, y) (Bishop, 1995). For regression problems, the loss function is typically is the squared error. Since the layer-by-layer operations in a DN are differentiable almost everywhere with respect to their parameters and inputs, we can use some flavor of first-order optimization such as gradient descent to optimize the parameters Θ with respect to the loss function. Moreover, the gradients for all internal parameters can be computed efficiently by backpropagation (Hecht-Nielsen, 1992), which follows from the chain rule of calculus. 3. Background on Spline Operators Approximation theory is the study of how and how well functions can best be approximated using simpler functions (Powell, 1981). A classical example of a simpler function is a spline s : RD R (Schmidhuber, 1994). For concreteness, we will work exclusively with affine splines in this paper (aka linear splines ), but our ideas generalize naturally to higher-order splines. Multivariate Affine Splines. Consider a partition of a domain RD into a set of regions Ω= {ω1, . . . , ωR} and a set of local mappings Φ = {φ1, . . . , φR} that map each region in the partition to R via φr(x) := [α]r, , x + [β]r for x ωr.2 The parameters are: α RR D, a matrix of hyperplane slopes, and β RR, a vector of hyperplane offsets or biases . We will use the terms offset and bias interchangeably in the sequel. The notation [α]r, denotes the column vector formed from the rth row of α. With this setup, the multivariate affine spline is defined as s[α, β, Ω](x) = r=1 ( [α]r, , x + [β]r) 1(x ωr) =: α[x], x + β[x], (3) where 1(x ωr) is the indicator function. The second line of (3) introduces the streamlined notation α[x] = [α]r, when x ωr; the definition for β[x] is similar. Such a spline is piecewise affine and hence piecewise convex. However, in general, it is neither globally affine nor globally convex unless R = 1, a case we denote as a degenerate spline, since it corresponds simply to an affine mapping. 2 To make the connection between splines and DNs more immediately obvious, here x is interpreted as a point in RD, which plays the rˆole of the space of signals in the other sections. Max-Affine Spline Functions. A major complication of function approximation with splines in general is the need to jointly optimize both the spline parameters α, β and the input domain partition Ω(the knots for a 1D spline) (Bennett & Botkin, 1985). However, if a multivariate affine spline is constrained to be globally convex, then it can always be rewritten as a max-affine spline (Magnani & Boyd, 2009; Hannah & Dunson, 2013) s[α, β, Ω](x) = max r=1,...,R [α]r, , x + [β]r . (4) An extremely useful feature of such a spline is that it is completely determined by its parameters α and β without needing to specify the partition Ω. As such, we denote a max-affine spline simply as s[α, β]. Changes in the parameters α, β of a max-affine spline automatically induce changes in the partition Ω, meaning that they are adaptive partitioning splines (Magnani & Boyd, 2009). Max-Affine Spline Operators. A natural extension of an affine spline function is an affine spline operator (ASO) S[A, B, ΩS] that produces a multivariate output. It is obtained simply by concatenating K affine spline functions from (3). The details and a more general development are provided in the SM and (Balestriero & Baraniuk, 2018). We are particularly interested in the max-affine spline operator (MASO) S[A, B] : RD RK formed by concatenating K independent max-affine spline functions from (4). A MASO with slope parameters A RK R D and offset parameters B RK R is defined as S[A, B](x) = maxr=1,...,R [A]1,r, , x + [B]1,r ... maxr=1,...,R [A]K,r, , x + [B]K,r =: A[x] x + B[x]. (5) The second line of (5) introduces the streamlined notation in terms of the signal-dependent matrix A[x] and signal-dependent vector B[x], where [A[x]]k, := [A]k,rk(x), and [B[x]]k := [B]k,rk(x) with rk(x) = arg maxr [A]k,r, , x + [B]k,r. Max-affine spline functions and operators are always piecewise affine and globally convex (and hence also continuous) with respect to each output dimension. Conversely, any piecewise affine and globally convex function/operator can be written as a max-affine spline. Moverover, using standard approximation arguments, it is easy to show that a MASO can approximate arbitrarily closely any (nonlinear) operator that is convex in each output dimension. 4. DNs are Compositions of Spline Operators While a MASO is appropriate only for approximating convex functions/operators, we now show that virtually all of A Spline Theory of Deep Networks today s DNs can be written as a composition of MASOs, one for each layer. Such a composition is, in general, nonconvex and hence can approximate a much larger class of functions/operators. Interestingly, under certain broad conditions, the composition remains a piecewise affine spline operator, which enables a variety of insights into DNs. 4.1. DN Operators are MASOs We now state our main theoretical results, which are proved in the SM and elaborated in (Balestriero & Baraniuk, 2018). Proposition 1. An arbitrary fully connected operator f (ℓ) W is an affine mapping and hence a degenerate MASO S h A(ℓ) W , B(ℓ) W i , with R = 1, [A(ℓ) W ]k,1, = W (ℓ) [B(ℓ) W ]k,1 = h b(ℓ) W i k, leading to W (ℓ)z(ℓ 1)(x) + b(ℓ) W = A(ℓ) W [x]z(ℓ 1)(x) + B(ℓ) W [x]. The same is true of a convolution operator with W (ℓ), b(ℓ) W replaced by C(ℓ), b(ℓ) C . Proposition 2. Any activation operator f (ℓ) σ using a piecewise affine and convex activation function is a MASO S h A(ℓ) σ , B(ℓ) σ i with R = 2, h B(ℓ) σ i k,1 = h B(ℓ) σ i 0 k, and for Re LU h A(ℓ) σ i k,1, = 0, h A(ℓ) σ i ek k; for leaky Re LU h A(ℓ) σ i k,1, = νek, h A(ℓ) σ i ek k, ν > 0; and for absolute value h A(ℓ) σ i k,1, = ek, h A(ℓ) σ i k,2, = ek k, where ek represents the kth canonical basis element of RD(ℓ). Proposition 3. Any pooling operator f (ℓ) ρ that is piecewise affine and convex is a MASO S h A(ℓ) ρ , B(ℓ) ρ i .3 Maxpooling has R = #Rk (typically a constant over all output dimensions k), h A(ℓ) ρ i k, , = {ei, i Rk}, and h B(ℓ) ρ i k,r = 0 k, r. Average-pooling is a degenerate MASO with R = 1, h A(ℓ) ρ i k,1, = 1 #(Rk) P i Rk ei, and h B(ℓ) ρ i Proposition 4. A DN layer constructed from an arbitrary composition of fully connected/convolution operators followed by one activation or pooling operator is a MASO S[A(ℓ), B(ℓ)] such that f (ℓ)(z(ℓ 1)(x)) = A(ℓ)[x]z(ℓ 1)(x) + B(ℓ)[x]. (6) Consequently, a large class of DNs boil down to a composition of MASOs. We prove the following in the SM and in (Balestriero & Baraniuk, 2018) for CNNs, Res Nets, skip connection nets, fully connected nets, and RNNs. 3 This result is agnostic to the pooling type (spatial or channel). Theorem 1. A DN constructed from an arbitrary composition of fully connected/convolution, activation, and pooling operators of the types in Propositions 1 3 is a composition of MASOs that is equivalent to a global affine spline operator. Note carefully that, while the layers of each of the DNs stated in Theorem 1 are MASOs, the composition of several layers is not necessarily a MASO. Indeed, a composition of MASOs remains a MASO if and only if all of its component operators (except the first) are non-decreasing with respect to each of their output dimensions (Boyd & Vandenberghe, 2004). Interestingly, Re LU and max-pooling are both nondecreasing, while leaky Re LU is strictly increasing. The culprits causing non-convexity of the composition of layers are negative entries in the fully connected or convolution operators, which destroy the required non-increasing property. A DN where these culprits are thwarted is an interesting special case, because it is convex with respect to its input (Amos et al., 2016) and multiconvex (Xu & Yin, 2013) with respect to its parameters. Theorem 2. A DN whose layers ℓ= 2, . . . , L consist of an arbitrary composition of fully connected and convolution operators with nonnegative weights, i.e., W (ℓ) k,j 0, C(ℓ) k,j 0; non-decreasing, piecewise-affine, and convex activation operators; and non-decreasing, piecewise-affine, and convex pooling operators is globally a MASO and thus also globally convex with respect to each of its output dimensions. The above results pertain to DNs using convex, affine operators. Other popular non-convex DN operators (e.g., the sigmoid and arctan activation functions) can be approximated arbitrarily closely by an affine spline operator but not by a MASO. DNs are Signal-Dependent Affine Transformations. A common theme of the above results is that, for DNs constructed from fully connected/convolution, activation, and pooling operators from Propositions 1 3, the operator/layer outputs z(ℓ)(x) are always a signal-dependent affine function of the input x (recall (5)). The particular affine mapping applied to x depends on which partition of the spline it falls in RD. More on this in Section 7 below. DN Learning and MASO Parameters. Given labeled training data (xn, yn)N n=1, learning in a DN that meets the conditions of Theorem 1 (i.e., optimizing its parameters Θ) is equivalent to optimally approximating the mapping from input x to output by = g z(L)(x) using an appropriate cost function (e.g., cross-entropy for classification or squared error for regression) by learning the parameters θ(ℓ) of the layers. In general the overall optimization problem is nonconvex (it is actually piecewise multi-convex in general (Rister, 2016)). A Spline Theory of Deep Networks z(L) CNN(x) = W (L) 1 Y ℓ=L 1 A(ℓ) ρ [x]A(ℓ) σ [x]C(ℓ) ! | {z } ACNN[x] x + W (L) L 1 X j=L 1 A(j) ρ [x]A(j) σ [x]C(j) A(ℓ) ρ [x]A(ℓ) σ [x]b(ℓ) C | {z } BCNN[x] 4.2. Application: DN Affine Mapping Formula Combining Propositions 1 3 and Theorem 1 and substituting (5) into (2), we can write an explicit formula for the output of any layer z(ℓ)(x) of a DN in terms of the input x for a variety of different architectures. The formula for a standard CNN (using Re LU activation and max-pooling) is given in (7) above; we derive this formula and analogous formulas for Res Nets and RNNs in (Balestriero & Baraniuk, 2018). In (7), A(ℓ) σ [x] are the signal-dependent matrices corresponding to the Re LU activations, A(ℓ) ρ [x] are the signal-dependent matrices corresponding to maxpooling, and the biases b(L) W , b(ℓ) C arise directly from the fully connected and convolution operators. The absence of B(ℓ) σ [x], B(ℓ) ρ [x] is due to the absence of bias in the Re LU (recall (2)) and max-pooling operators (recall (3)). Inspection of (7) reveals the exact form of the signaldependent, piecewise affine mapping linking x to z(L) CNN(x). Moreover, this formula can be collapsed into z(L) CNN(x) = W (L) ACNN[x] x + BCNN[x] + b(L) W (8) from which we can recognize z(L 1) CNN (x) = ACNN[x] x + BCNN[x] (9) as an explicit, signal-dependent, affine formula for the featurization process that aims to convert x into a set of (hopefully) linearly separable features that are then input to the linear classifier in layer ℓ= L with parameters W (L) and b(L) W . Of course, the final prediction by is formed by running z(L) CNN(x) through a softmax nonlinearity g, but this merely rescales its entries to create a probability distribution. 5. DNs are Template Matching Machines We now dig deeper into (8) in order to bridge DNs and classical optimal classification theory. While we focus on CNNs and classification for concreteness, our analysis holds for any DN meeting the conditions of Theorem 1. 5.1. Template Matching An alternate interpretation of (8) is that z(L) CNN(x) is the output of a bank of linear matched filters (plus a set of biases). That is, the cth element of z(L)(x) equals the inner product between the signal x and the matched filter for the cth class, which is contained in the cth row of the matrix W (L)A[x]. The bias W (L)B[x] + b(L) W can be used to account for the fact that some classes might be more likely than others (i.e., the prior probability over the classes). It is well-known that a matched filterbank is the optimal classifier for deterministic signals in additive white Gaussian noise (Rabiner & Gold, 1975). Given an input x, the class decision is simply the index of the largest element of z(L)(x).4 Yet another interpretation of (8) is that z(L)(x) is computed not in a single matched filter calculation but hierarchically as the signal propagates through the DN layers. Abstracting (5) to write the per-layer maximization process as z(ℓ)(x) = maxr(ℓ) A(ℓ) r(ℓ)z(ℓ 1)(x) + B(ℓ) r(ℓ) and cascading, we obtain a formula for the end-to-end DN mapping z(L)(x) = W (L) max r(L 1) A(L 1) r(L 1) max r(2) A(2) r(2) . . . A(1) r(1)x + B(1) r(1) + B(2) r(2) + B(L 1) r(L 1) This formula elucidates that a DN performs a hierarchical, greedy template matching on its input, a computationally efficient yet sub-optimal template matching technique. Such a procedure is globally optimal when the DN is globally convex. Corollary 1. For a DN abiding by the requirements of Theorem 2, the computation (10) collapses to the following globally optimal template matching z(L 1)(x) = W (L) max r(L 1),r(2),...,r(1) A(L 1) r(L 1) A(2) r(2) . . . A(1) r(1)x + B(1) r(1) + B(2) r(2) + B(L 1 r(L 1) + b(L) W . 5.2. Template Visualization Examples Since the complete DN mapping (up to the final softmax) can be expressed as in (8), given a signal x, we can compute the signal-dependent template for class c via A[x]c = d[z(L)(x)]c dx , which can be efficiently computed via backpropogation (Hecht-Nielsen, 1992).5 Once the template A[x]c has been computed, the bias term b[x]c can be computed via b[x]c = z(L)(x)c A[x]c, , x . Figure 1 plots 4Again, since the softmax merely rescales the entries of z(L)(x) into a probability distribution, it does not affect the location of its largest element. 5In fact, we can use the same backpropagation procedure used for computing the gradient with respect to a fully connected or A Spline Theory of Deep Networks Figure 1. Visualization of the matched filter templates generated by a CNN using Re LU activation and max-pooling. At top, we show the input image x of the digit 0 from the MNIST dataset along with the three signal-dependent templates ACNN[x]( 0 , ), ACNN[x]( 3 , ), and ACNN[x]( 7 , ). The inner products between x and the three templates are 32.7, 27.7, 27.8, respectively (c.f. (12)). Below, we repeat the same experiment with the input image x of a truck from the CIFAR10 dataset along with the three signaldependent templates corresponding to the classes truck, bird, and dog. The inner products between x and the three templates are 30.6, 12.5, 16.2, respectively. (The final classification will involve not only these inner products but also the relevant biases and softmax transformation.) various signal-dependent templates for two CNNs trained on the MNIST and CIFAR10 datasets. 5.3. Collinear Templates and Data Set Memorization Under the matched filterbank interpretation of a DN developed in Section 5.1, the optimal template for an image x of class c is a scaled version of x itself. But what are the optimal templates for the other (incorrect) classes? In an idealized setting, we can answer this question. Proposition 5. Consider an idealized DN consisting of a composition of MASOs that has sufficient approximation power to span arbitrary MASO matrices A[xn] from (9) for any input xn from the training set. Train the DN to classify among C classes using the training data D = (xn, yn)N n=1 with normalized inputs xn 2 = 1 n and the cross-entropy loss LCE(yn, fΘ(xn)) with the addition of the regularization constraint that P c A[xn]c, 2 < α with α > 0. At the global minimum of this constrained optimization problem, the rows of A [xn] (the optimal templates) have the form: [A [xn]]c, = C xn, c = yn α C(C 1) xn, c = yn (12) In short, the idealized CNN in the proposition will memorize a set of collinear templates whose bimodal outputs force convolution weight but instead with the input x. This procedure is becoming increasingly popular in the study of adversarial examples (Szegedy et al., 2013). Figure 2. Empirical study of the implications of Proposition 5. We illustrate the bimodality of [A [xn]]c, , xn for the large CNN matched filterbank trained on (a) MNIST and (b) CIFAR10. Training used batch normalization and bias units. Results are similar when trained with bias and no batch-normalization. The green histogram summarizes the output values for the correct class (top half of (12)), while the red histogram summarizes the output values for the incorrect classes (bottom half of (12)). The easier the classification problem (MNIST), the more bimodal the distribution. the softmax output to a Dirac delta function (aka 1-hot representation) that peaks at the correct class. Figure 2 confirms this bimodal behavior on the MNIST and CIFAR10 datasets. 6. New DNs with Orthogonal Templates While a DN s signal-dependent matched filterbank (8) is optimized for classifying signals immersed in additive white Gaussian noise, such a statistical model is overly simplistic for most machine learning problems of interest. In practice, errors will arise not just from random noise but also from nuisance variations in the inputs such as arbitrary rotations, positions, and modalities of the objects of interest. The effects of these nuisances are only poorly approximated as Gaussian random errors. Limited work has been done on filterbanks for classification in non Gaussian noise; one promising direction involves using not matched but rather orthogonal templates (Eldar & Oppenheim, 2001). For a MASO DN s templates to be orthogonal for all inputs, it is necessary that the rows of the matrix W (L) in the final linear classifier layer be orthogonal. This weak constraint on the DN still enables the earlier layers to create a high-performance, class-agnostic, featurized representation (recall the discussion just below (9)). To create orthogonal templates during learning, we simply add to the standard (potentially regularized) cross-entropy loss function LCE a term that penalizes non-zero off-diagonal entries in the matrix W (L)(W (L))T leading to the new loss with the additional penalty c1, , W (L) The parameter λ controls the tradeoff between cross-entropy A Spline Theory of Deep Networks Figure 3. Orthogonal templates significantly boost DN performance with no change to the architecture. (a) Classification performance of the large CNN trained on CIFAR100 for different values of the orthogonality penalty λ in (13). We plot the average (back dots), standard deviation (gray shade), and maximum (blue dots) of the test set accuracy over 15 runs. (b, top) Training set error. The blue/black curves corresponds to λ = 0/1. (b, bottom) Test set accuracy over the course of the learning. minimization and orthogonality preservation. Conveniently, when minimizing (13) via backpropagation, the orthogonal rows of W (L) induce orthogonal backpropagation updates for the various classes. We now empirically demonstrate that orthogonal templates lead to significantly improved classification performance. We conducted a range of experiments with three different conventional DN architectures small CNN, large CNN, and Res Net4-4 trained on three different datasets SVHN, CIFAR10, and CIFAR100. Each DN employed bias units, Re LU activations, and max-pooling as well as batchnormalization prior each Re LU. The full experimental details are given in the (Balestriero & Baraniuk, 2018). For learning, we used the Adam optimizer with an exponential learning rate decay. All inputs were centered to zero mean and scaled to a maximum value of one. No further preprocessing was performed, such as ZCA whitening (Nam et al., 2014). We assessed how the classification performance of a given DN would change as we varied the orthogonality penalty λ in (13). For each configuration of DN architecture, training dataset, learning rate, and penalty λ, we averaged over 15 runs to estimate the average performance and standard deviation. We report here on only the CIFAR100 with large CNN experiments. (See (Balestriero & Baraniuk, 2018) for detailed results for all three datasets and the other architectures. The trends for all three datasets are similar and are independent of the learning rate.) The results for CIFAR100 in Figure 3 indicate that the benefits of the orthogonality penalty emerge distinctly as soon as λ > 0. In addition to improved final accuracy and generalization performance, we see that template orthogonality reduces the temptation of the DN to overfit. (This is is especially visible in the examples in (Balestriero & Baraniuk, 2018).) One explanation is that the orthogonal weights W (L) positively impact not only the prediction but also the backpropagation via orthogonal gradient updates with respect to each output dimension s partial derivatives. 7. DN s Intrinsic Multiscale Partition Like any spline, it is the interplay between the (affine) spline mappings and the input space partition that work the magic in a MASO DN. Recall from Section 3 that a MASO has the attractive property that it implicitly partitions its input space as a function of its slope and offset parameters. The induced partition Ωopens up a new geometric avenue to study how a DN clusters and organizes signals in a hierarchical fashion. 7.1. Effect of the DN Operators on the Partition A DN operator at level ℓdirectly influences the partitioning of its input space RD(ℓ 1) and indirectly influences the partitioning of the overall signal space RD. A Re LU activation operator splits each of its input dimensions into two half-planes depending on the sign of the input in each dimension. This partitions RD(ℓ 1) into a combinatorially large number (up to 2D(ℓ)) of regions. Following a fully connected or convolution operator with a Re LU simply rotates the partition in RD(ℓ 1). A max-pooling operator also partitions RD(ℓ 1) into a combinatorially large number (up to #RD(ℓ)) of regions, where #R is the size of the pooling region. This per-MASO partitioning of each layer s input space constructs an overall partitioning of the input signal space RD. As each MASO is applied, it subdivides the input space RD into finer and finer partitions. The final partition corresponds to the intersection of all of the intermediate partitions, and hence we can encode the input in terms of the ordered collection of per-layer partition regions into which it falls. This overall process can be interpreted as a hierarchical vector quantization (VQ) of the training input signals xn. There are thus many potential connections between DNs and optimal quantization, information theory, and clustering that we leave for future research. See (Balestriero & Baraniuk, 2018) for some early results. A Spline Theory of Deep Networks Figure 4. Example of the partition-based distance between images in the CIFAR10 (top) and SVHN (bottom) datasets. In each image, we plot the target image x at top-left and its 15 nearest neighbors in the partition-based distance at layer ℓof a small CNN trained without batch normalization (see (Balestriero & Baraniuk, 2018) for the details). From left to right, the images display layers ℓ= 1, . . . , 6, and we see that the images become further and further in Euclidean distance but closer and closer in featurized distance. In particular, the first layer neighbors share similar colors and shapes (and thus are closer in Euclidean distance). Later layer neighbors mainly come from the same class independently of their color or shape. 7.2. Inferring a DN Layer s Intrinsic Partition Unfortunately there is no simple formula for the partition of the signal space. However, once can obtain the set of inputs signals xn that fall into the same partition region at each layer of a DN. At layer ℓ, denote the index of the region selected by the input x (recall (6)) by t(ℓ)(x) k = arg max r [A(ℓ)]k,r, , z(ℓ 1)(x) + [B(ℓ)]k,r. Thus, [t(ℓ)]k {1, . . . , R(ℓ)}, with R(ℓ) the number of partition regions in the layer s input space. Encoding the partition as an ordered collection of integers designating the activate hyperplane parameters from (4), we can now visualize which inputs fall into the same or nearby partitions. Due to the very large number of possible regions (up to 2D(ℓ) for a Re LU at layer ℓ) and the limited amount of training data, in general, many partitions will be empty or contain only a single training data point. 7.3. A New Image Distance based on the DN Partition To validate the utility of the hierarchical intrinsic clustering induced by a DN, we define a new distance function between the signals x1 and x2 that quantifies the similarity of their position encodings tℓ(x1) and tℓ(x2) at layer ℓvia d t(ℓ)(x1), t(ℓ)(x2) k=1 1 [t(ℓ)(x1)]k = [t(ℓ)(x2)]k D(ℓ) . (15) For a Re LU MASO, this corresponds simply to counting how many entries of the layer inputs for x1 and x2 are positive or negative at the same positions. For a max-pooling MASO, this corresponds to counting how many argmax positions are the same in each patch for x1 and x2. Figure 4 provides a visualization of the nearest neighbors of a test image under this partition-based distance measure. Visual inspection of the figures highlights that, as we progress through the layers of the DN, similar images become closer in the new distance but further in Euclidean distance. 8. Conclusions We have used the theory of splines to build a rigorous bridge between deep networks (DNs) and approximation theory. Our key finding is that, conditioned on the input signal, the output of a DN can be written as a simple affine transformation of the input. This links DNs directly to the classical theory of optimal classification via matched filters and provides insights into the positive effects of data memorization. There are many avenues for future work, including a more in-depth analysis of the hierarchical MASO partitioning, particularly from the viewpoint of vector quantization and K-means clustering, which are unsupervised learning techniques, and information theory. The spline viewpoint also could inspire the creation of new DN layers that have certain attractive partitioning or approximation capabilities. We have begun exploring some of these directions in (Balestriero & Baraniuk, 2018).6 6 This work was partially supported by ARO grant W911NF-151-0316, AFOSR grant FA9550-14-1-0088, ONR grants N0001417-1-2551 and N00014-18-12571, DARPA grant G001534-7500, and a DOD Vannevar Bush Faculty Fellowship (NSSEFF) grant N00014-18-1-2047. A Spline Theory of Deep Networks Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks. ar Xiv preprint ar Xiv:1609.07152, 2016. Arora, S., Bhaskara, A., Ge, R., and Ma, T. Provable bounds for learning some deep representations. ar Xiv preprint ar Xiv:1310.6343, 2013. Balestriero, R. and Baraniuk, R. A spline theory of deep networks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 383 392, Stockholmsmssan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings. mlr.press/v80/balestriero18b.html. Bennett, J. and Botkin, M. Structural shape optimization with geometric description and adaptive mesh refinement. AIAA Journal, 23(3):458 464, 1985. Bishop, C. M. Neural networks for pattern recognition. Oxford University Press, 1995. Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004. Bruna, J. and Mallat, S. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872 1886, 2013. Cohen, N., Sharir, O., and Shashua, A. On the expressive power of deep learning: A tensor analysis. In Feldman, V., Rakhlin, A., and Shamir, O. (eds.), 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 698 728, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR. Eldar, Y. C. and Oppenheim, A. V. Orthogonal matched filter detection. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP 01). 2001 IEEE International Conference on, volume 5, pp. 2837 2840. IEEE, 2001. Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep learning, volume 1. MIT press Cambridge, 2016. Graves, A. Generating sequences with recurrent neural networks. ar Xiv preprint ar Xiv:1308.0850, 2013. Hannah, L. A. and Dunson, D. B. Multivariate convex regression with adaptive partitioning. Journal of Machine Learning Research, 14(1):3261 3294, 2013. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hecht-Nielsen, R. Theory of the backpropagation neural network. In Neural networks for perception, pp. 65 93. Elsevier, 1992. Le Cun, Y. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. Lu, H. and Kawaguchi, K. Depth creates no bad local minima. ar Xiv preprint ar Xiv:1702.08580, 2017. Magnani, A. and Boyd, S. P. Convex piecewise-linear fitting. Optimization and Engineering, 10(1):1 17, 2009. Nam, W., Doll ar, P., and Han, J. H. Local decorrelation for improved pedestrian detection. In Advances in Neural Information Processing Systems (NIPS), pp. 424 432, 2014. Pal, S. K. and Mitra, S. Multilayer perceptron, fuzzy sets, and classification. IEEE Transactions on Neural Networks, 3(5):683 697, 1992. Papyan, V., Romano, Y., and Elad, M. Convolutional neural networks analyzed via convolutional sparse coding. Journal of Machine Learning Research, 18(83):1 52, 2017. Patel, A. B., Nguyen, T., and Baraniuk, R. G. A probabilistic framework for deep learning. Advances in Neural Information Processing Systems (NIPS), 2016. Powell, M. J. D. Approximation Theory and Methods. Cambridge University Press, 1981. Rabiner, L. R. and Gold, B. Theory and application of digital signal processing. Prentice-Hall, 1975., 1975. Rister, B. Piecewise convexity of artificial neural networks. ar Xiv preprint ar Xiv:1607.04917, 2016. Schmidhuber, J. Discovering problem solutions with low kolmogorov complexity and high generalization capability. In Machine Learning: Proceedings of the Twelfth International Conference. Citeseer, 1994. Soatto, S. and Chiuso, A. Visual representations: Defining properties and deep approximations. In Proceedings of International Conference on Learning Representations (ICLR 16), Nov. 2016. Soudry, D. and Hoffer, E. Exponentially vanishing suboptimal local minima in multilayer neural networks. ar Xiv preprint ar Xiv:1702.05777, 2017. Srivastava, R. K., Greff, K., and Schmidhuber, J. Training very deep networks. In Advances in Neural Information Processing Systems (NIPS), pp. 2377 2385, 2015. A Spline Theory of Deep Networks Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. Targ, S., Almeida, D., and Lyman, K. Resnet in resnet: generalizing residual architectures. ar Xiv preprint ar Xiv:1603.08029, 2016. Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pp. 1 5. IEEE, 2015. Xu, Y. and Yin, W. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758 1789, 2013. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), pp. 818 833, 2014. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.