# a_spline_theory_of_deep_learning__50c1e58e.pdf

A Spline Theory of Deep Networks

Randall Balestriero 1 Richard G. Baraniuk 1

Abstract We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-afﬁne spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the input signal, the output of a MASO DN can be written as a simple afﬁne transformation of the input. This implies that a DN constructs a set of signal-dependent, class-speciﬁc templates against which the signal is compared via a simple inner product; we explore the links to the classical theory of optimal classiﬁcation via matched ﬁlters and the effects of data memorization. Going further, we propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other; this leads to significantly improved classiﬁcation performance and reduced overﬁtting with no change to the DN architecture. The spline partition of the input signal space opens up a new geometric avenue to study how DNs organize signals in a hierarchical fashion. As an application, we develop and validate a new distance metric for signals that quantiﬁes the difference between their partition encodings.

1. Introduction

Deep learning has signiﬁcantly advanced our ability to address a wide range of difﬁcult machine learning and signal processing problems. Today s machine learning landscape is dominated by deep (neural) networks (DNs), which are compositions of a large number of simple parameterized linear and nonlinear transforms. An all-too-common story of late is that of plugging a deep network into an application as a black box, training it on copious training data,

1ECE Department, Rice University, Houston, TX, USA. Correspondence to: Randall B. <randallbalestriero@gmail.com>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

and then signiﬁcantly improving performance over classical approaches.

Despite this empirical progress, the precise mechanisms by which deep learning works so well remain relatively poorly understood, adding an air of mystery to the entire ﬁeld. Ongoing attempts to build a rigorous mathematical framework fall roughly into ﬁve camps: (i) probing and measuring DNs to visualize their inner workings (Zeiler & Fergus, 2014); (ii) analyzing their properties such as expressive power (Cohen et al., 2016), loss surface geometry (Lu & Kawaguchi, 2017; Soudry & Hoffer, 2017), nuisance management (Soatto & Chiuso, 2016), sparsiﬁcation (Papyan et al., 2017), and generalization abilities; (iii) new mathematical frameworks that share some (but not all) common features with DNs (Bruna & Mallat, 2013); (iv) probabilistic generative models from which speciﬁc DNs can be derived (Arora et al., 2013; Patel et al., 2016); and (v) information theoretic bounds (Tishby & Zaslavsky, 2015).

In this paper, we build a rigorous bridge between DNs and approximation theory via spline functions and operators. We prove that a large class of DNs including convolutional neural networks (CNNs) (Le Cun, 1998), residual networks (Res Nets) (He et al., 2016; Targ et al., 2016), skip connection networks (Srivastava et al., 2015), fully connected networks (Pal & Mitra, 1992), recurrent neural networks (RNNs) (Graves, 2013), and beyond can be written as spline operators. In particular, when these networks employ current standard-practice piecewise-afﬁne, convex nonlinearities (e.g., Re LU, max-pooling, etc.) they can be written as the composition of max-afﬁne spline operators (MASOs) (Magnani & Boyd, 2009; Hannah & Dunson, 2013). We focus on such nonlinearities here but note that our framework applies also to non-piecewise-afﬁne nonlinearities through a standard approximation argument.

The max-afﬁne spline connection provides a powerful portal through which to view and analyze the inner workings of a DN using tools from approximation theory and functional analysis. Here is a summary of our key contributions:

[C1] We prove that a large class of DNs can be written as a composition of MASOs, from which it follows immediately that, conditioned on the input signal, the output of a DN is a simple afﬁne transformation of the input. We illustrate in Section 4 by deriving a closed-form expression for the

A Spline Theory of Deep Networks

input/output mapping of a CNN.

[C2] The afﬁne mapping formula enables us to interpret a MASO DN as constructing a set of signal-dependent, classspeciﬁc templates against which the signal is compared via a simple inner product. In Section 5 we relate DNs directly to the classical theory of optimal classiﬁcation via matched ﬁlters and provide insights into the effects of data memorization (Zhang et al., 2016).

[C3] We propose a simple penalty term that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal to each other. In Section 6, we show that this leads to signiﬁcantly improved classiﬁcation performance and reduced overﬁtting on standard test data sets like CIFAR100 with no change to the DN architecture.

[C4] The partition of the input space induced by a MASO links DNs to the theory of vector quantization (VQ) and K-means clustering, which opens up a new geometric avenue to study how DNs cluster and organize signals in a hierarchical fashion. Section 7 studies the properties of the MASO partition.

[C5] Leveraging the fact that a DN considers two signals to be similar if they lie in the same MASO partition region, we develop a new signal distance in Section 7.3 that measures the difference between their partition encodings. The distance is easily computed via backpropagation.

A number of appendices in the Supplementary Material (SM) contain the mathematical setup and proofs. A signiﬁcantly extended account of these events with numerous new results is available in (Balestriero & Baraniuk, 2018).

2. Background on Deep Networks

A deep network (DN) is an operator fΘ : RD RC that maps an input signal1 x RD to an output prediction by RC as fΘ : RD RC. All current DNs can be written as a composition of L intermediate mappings called layers

fΘ(x) = f (L) θ(L) f (1) θ(1) (x), (1)

where Θ = θ(1), . . . , θ(L) is the collection of the network s parameters from each layer. This composition of mappings is nonlinear and non-commutative, in general.

A DN layer at level ℓis an operator f (ℓ) θ(ℓ) that takes as input

the vector-valued signal z(ℓ 1)(x) RD(ℓ 1) and produces the vector-valued output z(ℓ)(x) RD(ℓ). We will assume that x and z(ℓ) are column vectors. We initialize with z(0)(x) = x and denote z(L)(x) =: z for convenience.

1 For concreteness, we focus here on processing K-channel images x, such as color digital photographs. But our analysis and techniques apply to signals of any index-dimensionality, including speech and audio signals, video signals, etc.

The signals z(ℓ)(x) are typically called feature maps; it is easy to see that

z(ℓ)(x) = f (ℓ) θ(ℓ) f (1) θ(1) (x), ℓ {1, . . . , L}. (2)

We brieﬂy overview the basic DN operators and layers we consider in this paper; more details and additional layers are provided in (Goodfellow et al., 2016) and (Balestriero & Baraniuk, 2018). A fully connected operator performs an arbitrary afﬁne transformation by multiplying its input by the dense matrix W (ℓ) RD(ℓ) D(ℓ 1) and adding the arbitrary bias vector b(ℓ) W RD(ℓ), as in f (ℓ) W z(ℓ 1)(x) := W (ℓ)z(ℓ 1)(x) + b(ℓ) W . A convolution operator reduces the number of parameters in the afﬁne transformation by replacing the unconstrained W (ℓ) with a multichannel convolution matrix, as in f (ℓ) C z(ℓ 1)(x) := C(ℓ)z(ℓ 1)(x) + b(ℓ) C .

An activation operator applies a scalar nonlinear activation function σ independently to each entry of its input, as in f (ℓ) σ z(ℓ 1)(x)

k := σ [z(ℓ 1)(x)]k , k = 1, . . . , D(ℓ). Nonlinearities are crucial to DNs, since otherwise the entire network would collapse to a single global afﬁne transform. Three popular activation functions are the rectiﬁed linear unit (Re LU) σRe LU(u) := max(u, 0), the leaky Re LU σLRe LU(u) := max(ηu, u), η > 0, and the absolute value σabs(u) := |u|. These three functions are both piecewise afﬁne and convex. Other popular activation functions include the sigmoid σsig(u) := 1 1+e u and hyperbolic tangent σtanh(u) := 2σsig(2u) 1. These two functions are neither piecewise afﬁne nor convex.

A pooling operator subsamples its input to reduce its dimensionality according to a sub-sampling policy ρ applied over a collection of input indices {Rk}K(ℓ) k=1 (typically a small patch), e.g., max pooling f (ℓ) ρ z(ℓ 1)(x)

k := maxd R(ℓ) k z(ℓ 1)(x)

1, . . . , D(ℓ). See (Balestriero & Baraniuk, 2018) for the deﬁnitions of average pooling, channel pooling, skip connections, and recurrent layers.

Deﬁnition 1. A DN layer f (ℓ) θ(ℓ) comprises a single nonlinear DN operator (non-afﬁne to be precise) composed with any preceding afﬁne operators lying between it and the preceding nonlinear operator.

This deﬁnition yields a single, unique layer decomposition for any DN, and the complete DN is then the composition of its layers per (1). For example, in a standard CNN, there are two different layers types: i) convolution-activation and ii) max-pooling.

We form the prediction by by feeding fΘ(x) through a ﬁnal nonlinearity g : RD(L) RD(L) as in by = g(fΘ(x)). In classiﬁcation, g is typically the softmax nonlinearity, which arises naturally from posing the classiﬁcation inference as a

A Spline Theory of Deep Networks

multinomial logistic regression problem (Bishop, 1995). In regression, typically no g is applied.

We learn the DN parameters Θ for a particular prediction task in a supervised setting using a labeled data set D = (xn, yn)N n=1, a loss function, and a learning policy to update the parameters Θ in the predictor fΘ(x). For classiﬁcation problems, the loss function is typically the negative cross-entropy LCE(x, y) (Bishop, 1995). For regression problems, the loss function is typically is the squared error. Since the layer-by-layer operations in a DN are differentiable almost everywhere with respect to their parameters and inputs, we can use some ﬂavor of ﬁrst-order optimization such as gradient descent to optimize the parameters Θ with respect to the loss function. Moreover, the gradients for all internal parameters can be computed efﬁciently by backpropagation (Hecht-Nielsen, 1992), which follows from the chain rule of calculus.

3. Background on Spline Operators

Approximation theory is the study of how and how well functions can best be approximated using simpler functions (Powell, 1981). A classical example of a simpler function is a spline s : RD R (Schmidhuber, 1994). For concreteness, we will work exclusively with afﬁne splines in this paper (aka linear splines ), but our ideas generalize naturally to higher-order splines.

Multivariate Afﬁne Splines. Consider a partition of a domain RD into a set of regions Ω= {ω1, . . . , ωR} and a set of local mappings Φ = {φ1, . . . , φR} that map each region in the partition to R via φr(x) := [α]r, , x + [β]r for x ωr.2 The parameters are: α RR D, a matrix of hyperplane slopes, and β RR, a vector of hyperplane offsets or biases . We will use the terms offset and bias interchangeably in the sequel. The notation [α]r, denotes the column vector formed from the rth row of α.

With this setup, the multivariate afﬁne spline is deﬁned as

s[α, β, Ω](x) =

r=1 ( [α]r, , x + [β]r) 1(x ωr)

=: α[x], x + β[x], (3)

where 1(x ωr) is the indicator function. The second line of (3) introduces the streamlined notation α[x] = [α]r, when x ωr; the deﬁnition for β[x] is similar. Such a spline is piecewise afﬁne and hence piecewise convex. However, in general, it is neither globally afﬁne nor globally convex unless R = 1, a case we denote as a degenerate spline, since it corresponds simply to an afﬁne mapping.

2 To make the connection between splines and DNs more immediately obvious, here x is interpreted as a point in RD, which plays the rˆole of the space of signals in the other sections.

Max-Afﬁne Spline Functions. A major complication of function approximation with splines in general is the need to jointly optimize both the spline parameters α, β and the input domain partition Ω(the knots for a 1D spline) (Bennett & Botkin, 1985). However, if a multivariate afﬁne spline is constrained to be globally convex, then it can always be rewritten as a max-afﬁne spline (Magnani & Boyd, 2009; Hannah & Dunson, 2013)

s[α, β, Ω](x) = max r=1,...,R [α]r, , x + [β]r . (4)

An extremely useful feature of such a spline is that it is completely determined by its parameters α and β without needing to specify the partition Ω. As such, we denote a max-afﬁne spline simply as s[α, β]. Changes in the parameters α, β of a max-afﬁne spline automatically induce changes in the partition Ω, meaning that they are adaptive partitioning splines (Magnani & Boyd, 2009).

Max-Afﬁne Spline Operators. A natural extension of an afﬁne spline function is an afﬁne spline operator (ASO) S[A, B, ΩS] that produces a multivariate output. It is obtained simply by concatenating K afﬁne spline functions from (3). The details and a more general development are provided in the SM and (Balestriero & Baraniuk, 2018).

We are particularly interested in the max-afﬁne spline operator (MASO) S[A, B] : RD RK formed by concatenating K independent max-afﬁne spline functions from (4). A MASO with slope parameters A RK R D and offset parameters B RK R is deﬁned as

S[A, B](x) =

maxr=1,...,R [A]1,r, , x + [B]1,r ... maxr=1,...,R [A]K,r, , x + [B]K,r

=: A[x] x + B[x]. (5)

The second line of (5) introduces the streamlined notation in terms of the signal-dependent matrix A[x] and signal-dependent vector B[x], where [A[x]]k, := [A]k,rk(x), and [B[x]]k := [B]k,rk(x) with rk(x) = arg maxr [A]k,r, , x + [B]k,r.

Max-afﬁne spline functions and operators are always piecewise afﬁne and globally convex (and hence also continuous) with respect to each output dimension. Conversely, any piecewise afﬁne and globally convex function/operator can be written as a max-afﬁne spline. Moverover, using standard approximation arguments, it is easy to show that a MASO can approximate arbitrarily closely any (nonlinear) operator that is convex in each output dimension.

4. DNs are Compositions of Spline Operators

While a MASO is appropriate only for approximating convex functions/operators, we now show that virtually all of

A Spline Theory of Deep Networks

today s DNs can be written as a composition of MASOs, one for each layer. Such a composition is, in general, nonconvex and hence can approximate a much larger class of functions/operators. Interestingly, under certain broad conditions, the composition remains a piecewise afﬁne spline operator, which enables a variety of insights into DNs.

4.1. DN Operators are MASOs

We now state our main theoretical results, which are proved in the SM and elaborated in (Balestriero & Baraniuk, 2018).

Proposition 1. An arbitrary fully connected operator f (ℓ) W is an afﬁne mapping and hence a degenerate MASO S h A(ℓ) W , B(ℓ) W i , with R = 1, [A(ℓ) W ]k,1, = W (ℓ)

[B(ℓ) W ]k,1 = h b(ℓ) W i

k, leading to W (ℓ)z(ℓ 1)(x) + b(ℓ) W =

A(ℓ) W [x]z(ℓ 1)(x) + B(ℓ) W [x]. The same is true of a convolution operator with W (ℓ), b(ℓ) W replaced by C(ℓ), b(ℓ) C .

Proposition 2. Any activation operator f (ℓ) σ using a piecewise afﬁne and convex activation function is a MASO S h A(ℓ) σ , B(ℓ) σ i with R = 2, h B(ℓ) σ i

k,1 = h B(ℓ) σ i

0 k, and for Re LU h A(ℓ) σ i

k,1, = 0, h A(ℓ) σ i

ek k; for leaky Re LU h A(ℓ) σ i

k,1, = νek, h A(ℓ) σ i

ek k, ν > 0; and for absolute value h A(ℓ) σ i

k,1, = ek, h A(ℓ) σ i

k,2, = ek k, where ek represents the kth canonical

basis element of RD(ℓ).

Proposition 3. Any pooling operator f (ℓ) ρ that is piecewise afﬁne and convex is a MASO S h A(ℓ) ρ , B(ℓ) ρ i .3 Maxpooling has R = #Rk (typically a constant over all output dimensions k), h A(ℓ) ρ i

k, , = {ei, i Rk}, and h B(ℓ) ρ i

k,r = 0 k, r. Average-pooling is a degenerate

MASO with R = 1, h A(ℓ) ρ i

k,1, = 1 #(Rk) P

i Rk ei, and h B(ℓ) ρ i

Proposition 4. A DN layer constructed from an arbitrary composition of fully connected/convolution operators followed by one activation or pooling operator is a MASO S[A(ℓ), B(ℓ)] such that

f (ℓ)(z(ℓ 1)(x)) = A(ℓ)[x]z(ℓ 1)(x) + B(ℓ)[x]. (6)

Consequently, a large class of DNs boil down to a composition of MASOs. We prove the following in the SM and in (Balestriero & Baraniuk, 2018) for CNNs, Res Nets, skip connection nets, fully connected nets, and RNNs.

3 This result is agnostic to the pooling type (spatial or channel).

Theorem 1. A DN constructed from an arbitrary composition of fully connected/convolution, activation, and pooling operators of the types in Propositions 1 3 is a composition of MASOs that is equivalent to a global afﬁne spline operator.

Note carefully that, while the layers of each of the DNs stated in Theorem 1 are MASOs, the composition of several layers is not necessarily a MASO. Indeed, a composition of MASOs remains a MASO if and only if all of its component operators (except the ﬁrst) are non-decreasing with respect to each of their output dimensions (Boyd & Vandenberghe, 2004). Interestingly, Re LU and max-pooling are both nondecreasing, while leaky Re LU is strictly increasing. The culprits causing non-convexity of the composition of layers are negative entries in the fully connected or convolution operators, which destroy the required non-increasing property. A DN where these culprits are thwarted is an interesting special case, because it is convex with respect to its input (Amos et al., 2016) and multiconvex (Xu & Yin, 2013) with respect to its parameters.

Theorem 2. A DN whose layers ℓ= 2, . . . , L consist of an arbitrary composition of fully connected and convolution operators with nonnegative weights, i.e., W (ℓ) k,j 0, C(ℓ) k,j 0; non-decreasing, piecewise-afﬁne, and convex activation operators; and non-decreasing, piecewise-afﬁne, and convex pooling operators is globally a MASO and thus also globally convex with respect to each of its output dimensions.

The above results pertain to DNs using convex, afﬁne operators. Other popular non-convex DN operators (e.g., the sigmoid and arctan activation functions) can be approximated arbitrarily closely by an afﬁne spline operator but not by a MASO.

DNs are Signal-Dependent Afﬁne Transformations. A common theme of the above results is that, for DNs constructed from fully connected/convolution, activation, and pooling operators from Propositions 1 3, the operator/layer outputs z(ℓ)(x) are always a signal-dependent afﬁne function of the input x (recall (5)). The particular afﬁne mapping applied to x depends on which partition of the spline it falls in RD. More on this in Section 7 below.

DN Learning and MASO Parameters. Given labeled training data (xn, yn)N n=1, learning in a DN that meets the conditions of Theorem 1 (i.e., optimizing its parameters Θ) is equivalent to optimally approximating the mapping from input x to output by = g z(L)(x) using an appropriate cost function (e.g., cross-entropy for classiﬁcation or squared error for regression) by learning the parameters θ(ℓ) of the layers. In general the overall optimization problem is nonconvex (it is actually piecewise multi-convex in general (Rister, 2016)).

A Spline Theory of Deep Networks

z(L) CNN(x) = W (L) 1 Y

ℓ=L 1 A(ℓ) ρ [x]A(ℓ) σ [x]C(ℓ) !

| {z } ACNN[x]

x + W (L) L 1 X

j=L 1 A(j) ρ [x]A(j) σ [x]C(j)

A(ℓ) ρ [x]A(ℓ) σ [x]b(ℓ) C

| {z } BCNN[x]

4.2. Application: DN Afﬁne Mapping Formula

Combining Propositions 1 3 and Theorem 1 and substituting (5) into (2), we can write an explicit formula for the output of any layer z(ℓ)(x) of a DN in terms of the input x for a variety of different architectures. The formula for a standard CNN (using Re LU activation and max-pooling) is given in (7) above; we derive this formula and analogous formulas for Res Nets and RNNs in (Balestriero & Baraniuk, 2018). In (7), A(ℓ) σ [x] are the signal-dependent matrices corresponding to the Re LU activations, A(ℓ) ρ [x] are the signal-dependent matrices corresponding to maxpooling, and the biases b(L) W , b(ℓ) C arise directly from the fully connected and convolution operators. The absence of B(ℓ) σ [x], B(ℓ) ρ [x] is due to the absence of bias in the Re LU (recall (2)) and max-pooling operators (recall (3)).

Inspection of (7) reveals the exact form of the signaldependent, piecewise afﬁne mapping linking x to z(L) CNN(x). Moreover, this formula can be collapsed into

z(L) CNN(x) = W (L) ACNN[x] x + BCNN[x] + b(L) W (8)

from which we can recognize

z(L 1) CNN (x) = ACNN[x] x + BCNN[x] (9)

as an explicit, signal-dependent, afﬁne formula for the featurization process that aims to convert x into a set of (hopefully) linearly separable features that are then input to the linear classiﬁer in layer ℓ= L with parameters W (L) and b(L) W . Of course, the ﬁnal prediction by is formed by running z(L) CNN(x) through a softmax nonlinearity g, but this merely rescales its entries to create a probability distribution.

5. DNs are Template Matching Machines

We now dig deeper into (8) in order to bridge DNs and classical optimal classiﬁcation theory. While we focus on CNNs and classiﬁcation for concreteness, our analysis holds for any DN meeting the conditions of Theorem 1.

5.1. Template Matching

An alternate interpretation of (8) is that z(L) CNN(x) is the output of a bank of linear matched ﬁlters (plus a set of biases). That is, the cth element of z(L)(x) equals the inner product between the signal x and the matched ﬁlter for the cth class, which is contained in the cth row of the matrix

W (L)A[x]. The bias W (L)B[x] + b(L) W can be used to account for the fact that some classes might be more likely than others (i.e., the prior probability over the classes). It is well-known that a matched ﬁlterbank is the optimal classiﬁer for deterministic signals in additive white Gaussian noise (Rabiner & Gold, 1975). Given an input x, the class decision is simply the index of the largest element of z(L)(x).4

Yet another interpretation of (8) is that z(L)(x) is computed not in a single matched ﬁlter calculation but hierarchically as the signal propagates through the DN layers. Abstracting (5) to write the per-layer maximization process as z(ℓ)(x) = maxr(ℓ) A(ℓ) r(ℓ)z(ℓ 1)(x) + B(ℓ) r(ℓ) and cascading, we obtain a formula for the end-to-end DN mapping

z(L)(x) = W (L) max r(L 1)

A(L 1) r(L 1) max r(2)

A(2) r(2) . . .

A(1) r(1)x + B(1) r(1) + B(2) r(2)

+ B(L 1) r(L 1)

This formula elucidates that a DN performs a hierarchical, greedy template matching on its input, a computationally efﬁcient yet sub-optimal template matching technique. Such a procedure is globally optimal when the DN is globally convex. Corollary 1. For a DN abiding by the requirements of Theorem 2, the computation (10) collapses to the following globally optimal template matching

z(L 1)(x) = W (L) max r(L 1),r(2),...,r(1)

A(L 1) r(L 1) A(2) r(2) . . . A(1) r(1)x + B(1) r(1) + B(2) r(2) + B(L 1 r(L 1) + b(L) W .

5.2. Template Visualization Examples

Since the complete DN mapping (up to the ﬁnal softmax) can be expressed as in (8), given a signal x, we can compute the signal-dependent template for class c via A[x]c = d[z(L)(x)]c

dx , which can be efﬁciently computed via backpropogation (Hecht-Nielsen, 1992).5 Once the template A[x]c has been computed, the bias term b[x]c can be computed via b[x]c = z(L)(x)c A[x]c, , x . Figure 1 plots

4Again, since the softmax merely rescales the entries of z(L)(x) into a probability distribution, it does not affect the location of its largest element. 5In fact, we can use the same backpropagation procedure used for computing the gradient with respect to a fully connected or

A Spline Theory of Deep Networks

Figure 1. Visualization of the matched ﬁlter templates generated by a CNN using Re LU activation and max-pooling. At top, we show the input image x of the digit 0 from the MNIST dataset along with the three signal-dependent templates ACNN[x]( 0 , ), ACNN[x]( 3 , ), and ACNN[x]( 7 , ). The inner products between x and the three templates are 32.7, 27.7, 27.8, respectively (c.f. (12)). Below, we repeat the same experiment with the input image x of a truck from the CIFAR10 dataset along with the three signaldependent templates corresponding to the classes truck, bird, and dog. The inner products between x and the three templates are 30.6, 12.5, 16.2, respectively. (The ﬁnal classiﬁcation will involve not only these inner products but also the relevant biases and softmax transformation.)

various signal-dependent templates for two CNNs trained on the MNIST and CIFAR10 datasets.

5.3. Collinear Templates and Data Set Memorization

Under the matched ﬁlterbank interpretation of a DN developed in Section 5.1, the optimal template for an image x of class c is a scaled version of x itself. But what are the optimal templates for the other (incorrect) classes? In an idealized setting, we can answer this question.

Proposition 5. Consider an idealized DN consisting of a composition of MASOs that has sufﬁcient approximation power to span arbitrary MASO matrices A[xn] from (9) for any input xn from the training set. Train the DN to classify among C classes using the training data D = (xn, yn)N n=1 with normalized inputs xn 2 = 1 n and the cross-entropy loss LCE(yn, fΘ(xn)) with the addition of the regularization constraint that P

c A[xn]c, 2 < α with α > 0. At the global minimum of this constrained optimization problem, the rows of A [xn] (the optimal templates) have the form:

[A [xn]]c, =

C xn, c = yn

α C(C 1) xn, c = yn (12)

In short, the idealized CNN in the proposition will memorize a set of collinear templates whose bimodal outputs force

convolution weight but instead with the input x. This procedure is becoming increasingly popular in the study of adversarial examples (Szegedy et al., 2013).

Figure 2. Empirical study of the implications of Proposition 5. We illustrate the bimodality of [A [xn]]c, , xn for the large CNN matched ﬁlterbank trained on (a) MNIST and (b) CIFAR10. Training used batch normalization and bias units. Results are similar when trained with bias and no batch-normalization. The green histogram summarizes the output values for the correct class (top half of (12)), while the red histogram summarizes the output values for the incorrect classes (bottom half of (12)). The easier the classiﬁcation problem (MNIST), the more bimodal the distribution.

the softmax output to a Dirac delta function (aka 1-hot representation) that peaks at the correct class. Figure 2 conﬁrms this bimodal behavior on the MNIST and CIFAR10 datasets.

6. New DNs with Orthogonal Templates

While a DN s signal-dependent matched ﬁlterbank (8) is optimized for classifying signals immersed in additive white Gaussian noise, such a statistical model is overly simplistic for most machine learning problems of interest. In practice, errors will arise not just from random noise but also from nuisance variations in the inputs such as arbitrary rotations, positions, and modalities of the objects of interest. The effects of these nuisances are only poorly approximated as Gaussian random errors. Limited work has been done on ﬁlterbanks for classiﬁcation in non Gaussian noise; one promising direction involves using not matched but rather orthogonal templates (Eldar & Oppenheim, 2001).

For a MASO DN s templates to be orthogonal for all inputs, it is necessary that the rows of the matrix W (L) in the ﬁnal linear classiﬁer layer be orthogonal. This weak constraint on the DN still enables the earlier layers to create a high-performance, class-agnostic, featurized representation (recall the discussion just below (9)). To create orthogonal templates during learning, we simply add to the standard (potentially regularized) cross-entropy loss function LCE a term that penalizes non-zero off-diagonal entries in the matrix W (L)(W (L))T leading to the new loss with the additional penalty

c1, , W (L)

The parameter λ controls the tradeoff between cross-entropy

A Spline Theory of Deep Networks

Figure 3. Orthogonal templates signiﬁcantly boost DN performance with no change to the architecture. (a) Classiﬁcation performance of the large CNN trained on CIFAR100 for different values of the orthogonality penalty λ in (13). We plot the average (back dots), standard deviation (gray shade), and maximum (blue dots) of the test set accuracy over 15 runs. (b, top) Training set error. The blue/black curves corresponds to λ = 0/1. (b, bottom) Test set accuracy over the course of the learning.

minimization and orthogonality preservation. Conveniently, when minimizing (13) via backpropagation, the orthogonal rows of W (L) induce orthogonal backpropagation updates for the various classes.

We now empirically demonstrate that orthogonal templates lead to signiﬁcantly improved classiﬁcation performance. We conducted a range of experiments with three different conventional DN architectures small CNN, large CNN, and Res Net4-4 trained on three different datasets SVHN, CIFAR10, and CIFAR100. Each DN employed bias units, Re LU activations, and max-pooling as well as batchnormalization prior each Re LU. The full experimental details are given in the (Balestriero & Baraniuk, 2018). For learning, we used the Adam optimizer with an exponential learning rate decay. All inputs were centered to zero mean and scaled to a maximum value of one. No further preprocessing was performed, such as ZCA whitening (Nam et al., 2014). We assessed how the classiﬁcation performance of a given DN would change as we varied the orthogonality penalty λ in (13). For each conﬁguration of DN architecture, training dataset, learning rate, and penalty λ, we averaged over 15 runs to estimate the average performance and standard deviation.

We report here on only the CIFAR100 with large CNN experiments. (See (Balestriero & Baraniuk, 2018) for detailed results for all three datasets and the other architectures. The trends for all three datasets are similar and are independent of the learning rate.) The results for CIFAR100 in Figure 3 indicate that the beneﬁts of the orthogonality penalty emerge distinctly as soon as λ > 0. In addition to improved ﬁnal accuracy and generalization performance, we see that template orthogonality reduces the temptation of the DN to overﬁt. (This is is especially visible in the examples in (Balestriero & Baraniuk, 2018).) One explanation is that the orthogonal weights W (L) positively impact not only the prediction but also the backpropagation via orthogonal gradient updates with respect to each output dimension s partial derivatives.

7. DN s Intrinsic Multiscale Partition

Like any spline, it is the interplay between the (afﬁne) spline mappings and the input space partition that work the magic in a MASO DN. Recall from Section 3 that a MASO has the attractive property that it implicitly partitions its input space as a function of its slope and offset parameters. The induced partition Ωopens up a new geometric avenue to study how a DN clusters and organizes signals in a hierarchical fashion.

7.1. Effect of the DN Operators on the Partition

A DN operator at level ℓdirectly inﬂuences the partitioning of its input space RD(ℓ 1) and indirectly inﬂuences the partitioning of the overall signal space RD.

A Re LU activation operator splits each of its input dimensions into two half-planes depending on the sign of the input in each dimension. This partitions RD(ℓ 1) into a combinatorially large number (up to 2D(ℓ)) of regions. Following a fully connected or convolution operator with a Re LU simply rotates the partition in RD(ℓ 1). A max-pooling operator also partitions RD(ℓ 1) into a combinatorially large number (up to #RD(ℓ)) of regions, where #R is the size of the pooling region.

This per-MASO partitioning of each layer s input space constructs an overall partitioning of the input signal space RD. As each MASO is applied, it subdivides the input space RD

into ﬁner and ﬁner partitions. The ﬁnal partition corresponds to the intersection of all of the intermediate partitions, and hence we can encode the input in terms of the ordered collection of per-layer partition regions into which it falls. This overall process can be interpreted as a hierarchical vector quantization (VQ) of the training input signals xn. There are thus many potential connections between DNs and optimal quantization, information theory, and clustering that we leave for future research. See (Balestriero & Baraniuk, 2018) for some early results.

A Spline Theory of Deep Networks

Figure 4. Example of the partition-based distance between images in the CIFAR10 (top) and SVHN (bottom) datasets. In each image, we plot the target image x at top-left and its 15 nearest neighbors in the partition-based distance at layer ℓof a small CNN trained without batch normalization (see (Balestriero & Baraniuk, 2018) for the details). From left to right, the images display layers ℓ= 1, . . . , 6, and we see that the images become further and further in Euclidean distance but closer and closer in featurized distance. In particular, the ﬁrst layer neighbors share similar colors and shapes (and thus are closer in Euclidean distance). Later layer neighbors mainly come from the same class independently of their color or shape.

7.2. Inferring a DN Layer s Intrinsic Partition

Unfortunately there is no simple formula for the partition of the signal space. However, once can obtain the set of inputs signals xn that fall into the same partition region at each layer of a DN. At layer ℓ, denote the index of the region selected by the input x (recall (6)) by t(ℓ)(x)

k = arg max r

[A(ℓ)]k,r, , z(ℓ 1)(x) + [B(ℓ)]k,r.

Thus, [t(ℓ)]k {1, . . . , R(ℓ)}, with R(ℓ) the number of partition regions in the layer s input space. Encoding the partition as an ordered collection of integers designating the activate hyperplane parameters from (4), we can now visualize which inputs fall into the same or nearby partitions.

Due to the very large number of possible regions (up to 2D(ℓ)

for a Re LU at layer ℓ) and the limited amount of training data, in general, many partitions will be empty or contain only a single training data point.

7.3. A New Image Distance based on the DN Partition

To validate the utility of the hierarchical intrinsic clustering induced by a DN, we deﬁne a new distance function between the signals x1 and x2 that quantiﬁes the similarity of their position encodings tℓ(x1) and tℓ(x2) at layer ℓvia

d t(ℓ)(x1), t(ℓ)(x2)

k=1 1 [t(ℓ)(x1)]k = [t(ℓ)(x2)]k

D(ℓ) . (15)

For a Re LU MASO, this corresponds simply to counting how many entries of the layer inputs for x1 and x2 are positive or negative at the same positions. For a max-pooling

MASO, this corresponds to counting how many argmax positions are the same in each patch for x1 and x2.

Figure 4 provides a visualization of the nearest neighbors of a test image under this partition-based distance measure. Visual inspection of the ﬁgures highlights that, as we progress through the layers of the DN, similar images become closer in the new distance but further in Euclidean distance.

8. Conclusions

We have used the theory of splines to build a rigorous bridge between deep networks (DNs) and approximation theory. Our key ﬁnding is that, conditioned on the input signal, the output of a DN can be written as a simple afﬁne transformation of the input. This links DNs directly to the classical theory of optimal classiﬁcation via matched ﬁlters and provides insights into the positive effects of data memorization.

There are many avenues for future work, including a more in-depth analysis of the hierarchical MASO partitioning, particularly from the viewpoint of vector quantization and K-means clustering, which are unsupervised learning techniques, and information theory. The spline viewpoint also could inspire the creation of new DN layers that have certain attractive partitioning or approximation capabilities. We have begun exploring some of these directions in (Balestriero & Baraniuk, 2018).6

6 This work was partially supported by ARO grant W911NF-151-0316, AFOSR grant FA9550-14-1-0088, ONR grants N0001417-1-2551 and N00014-18-12571, DARPA grant G001534-7500, and a DOD Vannevar Bush Faculty Fellowship (NSSEFF) grant N00014-18-1-2047.

A Spline Theory of Deep Networks

Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks. ar Xiv preprint ar Xiv:1609.07152, 2016.

Arora, S., Bhaskara, A., Ge, R., and Ma, T. Provable bounds for learning some deep representations. ar Xiv preprint ar Xiv:1310.6343, 2013.

Balestriero, R. and Baraniuk, R. A spline theory of deep networks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 383 392, Stockholmsmssan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings. mlr.press/v80/balestriero18b.html.

Bennett, J. and Botkin, M. Structural shape optimization with geometric description and adaptive mesh reﬁnement. AIAA Journal, 23(3):458 464, 1985.

Bishop, C. M. Neural networks for pattern recognition. Oxford University Press, 1995.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.

Bruna, J. and Mallat, S. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872 1886, 2013.

Cohen, N., Sharir, O., and Shashua, A. On the expressive power of deep learning: A tensor analysis. In Feldman, V., Rakhlin, A., and Shamir, O. (eds.), 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pp. 698 728, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR.

Eldar, Y. C. and Oppenheim, A. V. Orthogonal matched ﬁlter detection. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP 01). 2001 IEEE International Conference on, volume 5, pp. 2837 2840. IEEE, 2001.

Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep learning, volume 1. MIT press Cambridge, 2016.

Graves, A. Generating sequences with recurrent neural networks. ar Xiv preprint ar Xiv:1308.0850, 2013.

Hannah, L. A. and Dunson, D. B. Multivariate convex regression with adaptive partitioning. Journal of Machine Learning Research, 14(1):3261 3294, 2013.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hecht-Nielsen, R. Theory of the backpropagation neural network. In Neural networks for perception, pp. 65 93. Elsevier, 1992.

Le Cun, Y. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

Lu, H. and Kawaguchi, K. Depth creates no bad local minima. ar Xiv preprint ar Xiv:1702.08580, 2017.

Magnani, A. and Boyd, S. P. Convex piecewise-linear ﬁtting. Optimization and Engineering, 10(1):1 17, 2009.

Nam, W., Doll ar, P., and Han, J. H. Local decorrelation for improved pedestrian detection. In Advances in Neural Information Processing Systems (NIPS), pp. 424 432, 2014.

Pal, S. K. and Mitra, S. Multilayer perceptron, fuzzy sets, and classiﬁcation. IEEE Transactions on Neural Networks, 3(5):683 697, 1992.

Papyan, V., Romano, Y., and Elad, M. Convolutional neural networks analyzed via convolutional sparse coding. Journal of Machine Learning Research, 18(83):1 52, 2017.

Patel, A. B., Nguyen, T., and Baraniuk, R. G. A probabilistic framework for deep learning. Advances in Neural Information Processing Systems (NIPS), 2016.

Powell, M. J. D. Approximation Theory and Methods. Cambridge University Press, 1981.

Rabiner, L. R. and Gold, B. Theory and application of digital signal processing. Prentice-Hall, 1975., 1975.

Rister, B. Piecewise convexity of artiﬁcial neural networks. ar Xiv preprint ar Xiv:1607.04917, 2016.

Schmidhuber, J. Discovering problem solutions with low kolmogorov complexity and high generalization capability. In Machine Learning: Proceedings of the Twelfth International Conference. Citeseer, 1994.

Soatto, S. and Chiuso, A. Visual representations: Deﬁning properties and deep approximations. In Proceedings of International Conference on Learning Representations (ICLR 16), Nov. 2016.

Soudry, D. and Hoffer, E. Exponentially vanishing suboptimal local minima in multilayer neural networks. ar Xiv preprint ar Xiv:1702.05777, 2017.

Srivastava, R. K., Greff, K., and Schmidhuber, J. Training very deep networks. In Advances in Neural Information Processing Systems (NIPS), pp. 2377 2385, 2015.

A Spline Theory of Deep Networks

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Targ, S., Almeida, D., and Lyman, K. Resnet in resnet: generalizing residual architectures. ar Xiv preprint ar Xiv:1603.08029, 2016.

Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pp. 1 5. IEEE, 2015.

Xu, Y. and Yin, W. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences, 6(3):1758 1789, 2013.

Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), pp. 818 833, 2014.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.