# convexified_convolutional_neural_networks__44234d25.pdf Convexified Convolutional Neural Networks Yuchen Zhang 1 Percy Liang 1 Martin J. Wainwright 2 We describe the class of convexified convolutional neural networks (CCNNs), which capture the parameter sharing of convolutional neural networks in a convex manner. By representing the nonlinear convolutional filters as vectors in a reproducing kernel Hilbert space, the CNN parameters can be represented in terms of a lowrank matrix, and the rank constraint can be relaxed so as to obtain a convex optimization problem. For learning two-layer convolutional neural networks, we prove that the generalization error obtained by a convexified CNN converges to that of the best possible CNN. For learning deeper networks, we train CCNNs in a layerwise manner. Empirically, we find that CCNNs achieve competitive or better performance than CNNs trained by backpropagation, SVMs, fully-connected neural networks, stacked denoising auto-encoders, and other baseline methods. 1. Introduction Convolutional neural networks (CNNs) (Le Cun et al., 1998) have proven successful across many tasks including image classification (Le Cun et al., 1998; Krizhevsky et al., 2012), face recognition (Lawrence et al., 1997), speech recognition (Hinton et al., 2012), text classification (Wang et al., 2012), and game playing (Mnih et al., 2015; Silver et al., 2016). There are two principal advantages of a CNN over a fully-connected neural network: (i) sparsity each nonlinear convolutional filter acts only on a local patch of the input, and (ii) parameter sharing the same filter is applied to each patch. However, as with most neural networks, the standard approach to training CNNs is based on solving a nonconvex optimization problem that is known to be NP-hard (Blum 1Stanford University, CA, USA 2University of California, Berkeley, CA, USA. Correspondence to: Yuchen Zhang . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). & Rivest, 1992). In practice, researchers use some flavor of stochastic gradient method, in which gradients are computed via backpropagation (Bottou, 1998). This approach has two drawbacks: (i) the rate of convergence, which is at best only to a local optimum, can be slow due to nonconvexity (for instance, see the paper (Fahlman, 1988)), and (ii) its statistical properties are very difficult to understand, as the actual performance is determined by some combination of the CNN architecture along with the optimization algorithm. In this paper, with the goal of addressing these two challenges, we propose a new model class known as convexified convolutional neural networks (CCNNs). These models have two desirable features. First, training a CCNN corresponds to a convex optimization problem, which can be solved efficiently and optimally via a projected gradient algorithm. Second, the statistical properties of CCNN models can be studied in a precise and rigorous manner. We obtain CCNNs by convexifying two-layer CNNs; doing so requires overcoming two challenges. First, the activation function of a CNN is nonlinear. In order to address this issue, we relax the class of CNN filters to a reproducing kernel Hilbert space (RKHS). This approach is inspired by the paper (Zhang et al., 2016a), which put forth a relaxation for fully-connected neural networks. Second, the parameter sharing induced by CNNs is crucial to its effectiveness and must be preserved. We show that CNNs with RKHS filters can be parametrized by a low-rank matrix. Relaxing this low-rank constraint to a nuclear norm constraint leads to our final formulation of CCNNs. On the theoretical front, we prove an oracle inequality on the generalization error achieved by our class of CCNNs, showing that it is upper bounded by the best possible performance achievable by a two-layer CNN given infinite data a quantity to which we refer as the oracle risk plus a model complexity term that decays to zero polynomially in the sample size. Our results suggest that the sample complexity for CCNNs is significantly lower than that of the convexified fully-connected neural network (Zhang et al., 2016a), highlighting the importance of parameter sharing. For models with more than one hidden layer, our theory does not apply, but we provide encouraging empirical results using a greedy layer-wise training heuristic. Finally, we apply CCNNs to the MNIST handwritten digit dataset Convexified Convolutional Neural Networks as well as four variation datasets (Variations MNIST), and find that it achieves state-of-the-art accuracy. Related work. With the empirical success of deep neural networks, there has been an increasing interest in understanding its connection to convex optimization. Bengio et al. (2005) showed how to formulate neural network training as a convex optimization problem involving an infinite number of parameters. Aslan et al. (2013; 2014) propose a method for learning multi-layer latent-variable models. They showed that for certain activation functions, the proposed method is a convex relaxation for learning fullyconnected neural networks. Past work has studied learning translation-invariant features without backpropagation. Mairal et al. (2014) present convolutional kernel networks. They propose a translationinvariant kernel whose feature mapping can be approximated by a composition of the convolution, non-linearity and pooling operators, obtained through unsupervised learning. However, this method is not equipped with the optimality guarantees that we provide for CCNNs in this paper, even for learning one convolutional layer. The Scat Net method (Bruna & Mallat, 2013) uses translation and deformation-invariant filters constructed by wavelet analysis; however, these filters are independent of the data, in contrast to CCNNs. Daniely et al. (2016) show that a randomly initialized CNN can extract features as powerful as kernel methods, but it is not clear how to provably improve the model from random initialization. Notation. For any positive integer n, we use [n] as a shorthand for the discrete set {1, 2, . . . , n}. For a rectangular matrix A, let A be its nuclear norm, A 2 be its spectral norm (i.e., maximal singular value), and A F be its Frobenius norm. We use ℓ2(N) to denote the set of countable dimensional vectors v = (v1, v2, . . . ) such that P ℓ=1 v2 ℓ < . For any vectors u, v ℓ2(N), the inner product u, v := P ℓ=1 uivi and the ℓ2-norm u 2 := p u, u are well defined. 2. Background and problem setup In this section, we formalize the class of convolutional neural networks to be learned and describe the associated nonconvex optimization problem. 2.1. Convolutional neural networks At a high level, a two-layer CNN1 is a function that maps an input vector x Rd0 (e.g., an image) to an output vector in y Rd2 (e.g., classification scores for d2 classes). This mapping is formed in the following manner: 1Average pooling and multiple channels are also an integral part of CNNs, but these do not present any new technical challenges, so that we defer these extensions to Section 4. First, we extract a collection of P vectors {zp(x)}P p=1 of the full input vector x. Each vector zp(x) Rd1 is referred to as a patch, and these patches may depend on overlapping components of x. Second, given some choice of activation function σ : R R and a collection of weight vectors {wj}r j=1 in Rd1, we define the functions hj(z) := σ(w j z) for each patch z Rd1. (1) Each function hj (for j [r]) is known as a filter, and note that the same filters are applied to each patch this corresponds to the parameter sharing of a CNN. Third, for each patch index p [P], filter index j [r], and output coordinate k [d2], we introduce a coefficient αk,j,p R that governs the contribution of the filter hj on patch zp(x) to output fk(x). The final form of the CNN is given by f(x) : = (f1(x), . . . , fd2(x)), where the kth component is given by p=1 αk,j,phj(zp(x)). (2) Taking the patch functions {zp}P p=1 and activation function σ as fixed, the parameters of the CNN are the filter vectors w := {wj Rd1 : j [r]} along with the collection of coefficient vectors α := {αk,j RP : k [d2], j [r]}. We assume that all patch vectors zp(x) Rd1 are contained in the unit ℓ2-ball. This assumption can be satisfied without loss of generality by normalization: By multiplying a constant γ > 0 to every patch zp(x) and multiplying 1/γ to the filter vectors w, this assumption holds without changing the the output of the network. Given some positive radii B1 and B2, we consider the model class Fcnn(B1, B2) := n f of the form (2) : max j [r] wj 2 B1 and max k [d2],j [r] αk,j 2 B2 o . (3) When the radii (B1, B2) are clear from context, we adopt Fcnn as a convenient shorthand. 2.2. Empirical risk minimization. Given an input-output pair (x, y) and a CNN f, we let L(f(x); y) denote the loss incurred when the output y is predicted via f(x). We assume that the loss function L is convex and L-Lipschitz in its first argument given any value of its second argument. As a concrete example, for multiclass classification with d2 classes, the output vector y takes values in the discrete set [d2] = {1, 2, . . . , d2}. For example, given a vector f(x) = (f1(x), . . . , fd2(y)) Rd2 of classification scores, the associated multiclass logistic loss for a pair (x, y) is given by L(f(x); y) := fy(x) + Convexified Convolutional Neural Networks log Pd2 y =1 exp(fy (x)) . Given n training examples {(xi, yi)}n i=1, we would like to compute an empirical risk minimizer: bfcnn arg min f Fcnn i=1 L(f(xi); yi). (4) Recalling that functions f Fcnn depend on the parameters w and α in a highly nonlinear way (2), this optimization problem is nonconvex. As mentioned earlier, heuristics based on stochastic gradient methods are used in practice, which makes it challenging to gain a theoretical understanding of their behavior. Thus, in the next section, we describe a relaxation of the class Fcnn for which empirical risk minimization is convex. 3. Convexifying CNNs We now turn to the development of the class of convexified CNNs. We begin in Section 3.1 by illustrating the procedure for the special case of the linear activation function. Although the linear case is not of practical interest, it provides intuition for our more general convexification procedure, described in Section 3.2, which applies to nonlinear activation functions. In particular, we show how embedding the nonlinear problem into an appropriately chosen reproducing kernel Hilbert space (RKHS) allows us to again reduce to the linear setting. 3.1. Linear activation functions: low rank relaxations In order to develop intuition for our approach, let us begin by considering the simple case of the linear activation function σ(t) = t. In this case, the filter hj when applied to the patch vector zp(x) outputs a Euclidean inner product of the form hj(zp(x)) = zp(x), wj . For each x Rd0, we first define the P d1-dimensional matrix z1(x) ... z P (x) We also define the P-dimensional vector αk,j := (αk,j,1, . . . , αk,j,P ) . With this notation, we can rewrite equation (2) for the kth output as p=1 αk,j,p zp(x), wj = j=1 α k,j Z(x)wj = tr Z(x) r X j=1 wjα k,j = tr(Z(x)Ak), (6) where in the final step, we have defined the d1 Pdimensional matrix Ak := Pr j=1 wjα k,j. Observe that fk now depends linearly on the matrix parameter Ak. Moreover, the matrix Ak has rank at most r, due to the parameter sharing of CNNs. See Figure 1 for a graphical illustration of this model structure. Letting A := (A1, . . . , Ad2) be a concatenation of these matrices across all d2 output coordinates, we can then define a function f A : Rd1 Rd2 of the form f A(x) := (tr(Z(x)A1), . . . , tr(Z(x)Ad2)). (7) Note that these functions have a linear parameterization in terms of the underlying matrix A. Our model class corresponds to a collection of such functions based on imposing certain constraints on the underlying matrix A: in particular, we define Fcnn(B1, B2) to be the set of functions f A which satisfies: (C1) maxj [r] wj 2 B1, maxk [d2],j [r] αk,j 2 B2; and (C2) rank(A) = r. This is simply an alternative formulation of our original class of CNNs defined in equation (3). Notice that if the filter weights wj are not shared across all patches, then the constraint (C1) still holds, but constraint (C2) no longer holds. Thus, the parameter sharing of CNNs is realized by the low-rank constraint (C2). The matrix A of rank r can be decomposed as A = UV , where both U and V have r columns. The column space of matrix A contains the convolution parameters {wj}, and the row space of A contains to the output parameters {αk,j}. The matrices satisfying constraints (C1) and (C2) form a nonconvex set. A standard convex relaxation is based on the nuclear norm A corresponding to the sum of the singular values of A. It is straightforward to verify that any matrix A satisfying the constraints (C1) and (C2) must have nuclear norm bounded as A B1B2r d2. Consequently, if we define the function class Fccnn := n f A : A B1B2r p then we are guaranteed that Fccnn Fcnn. We propose to minimize the empirical risk (4) over Fccnn instead of Fcnn; doing so defines a convex optimization problem over this richer class of functions bfccnn := arg min f A Fccnn i=1 L(f A(xi); yi). (9) In Section 3.3, we describe iterative algorithms that can be used to solve this convex problem in the more general setting of nonlinear activation functions. 3.2. Nonlinear activations: RKHS filters For nonlinear activation functions σ, we relax the class of CNN filters to a reproducing kernel Hilbert space (RKHS). As we will show, this relaxation allows us to reduce the problem to the linear activation case. Let K : Rd1 Rd1 R be a positive semidefinite kernel function. For particular choices of kernels (e.g., the Gaus- Convexified Convolutional Neural Networks P (patches) Figure 1. The kth output of a CNN fk(x) R can be expressed as the product between a matrix Z(x) RP d1 whose rows are features of the input patches and a rank-r matrix Ak Rd1 P , which is made up of the filter weights {wj} and coefficients {ak,j,p}, as illustrated. Due to the parameter sharing intrinsic to CNNs, the matrix Ak inherits a low rank structure, which can be encouraged via convex relaxation using the nuclear norm. sian RBF kernel) and a sufficiently smooth activation function σ, we are able to show that the filter h : z 7 σ( w, z ) is contained in the RKHS induced by the kernel function K. See Section 3.4 for the choice of the kernel function and the activation function. Let S := {zp(xi) : p [P], i [n]} be the set of patches in the training dataset. The representer theorem then implies that for any patch zp(xi) S, the function value can be represented by h(zp(xi)) = X (i ,p ) [n] [P ] ci ,p K(zp(xi), zp (xi )) (10) for some coefficients {ci ,p }(i ,p ) [n] [P ]. Filters of the form (10) are members of the RKHS, because they are linear combinations of basis functions z 7 K(z, zp (xi )). Such filters are parametrized by a finite set of coefficients {ci ,p }(i ,p ) [n] [P ], which can be estimated via empirical risk minimization. Let K Rn P n P be the symmetric kernel matrix, where with rows and columns indexed by the example-patch index pair (i, p) [n] [P]. The entry at row (i, p) and column (i , p ) of matrix K is equal to K(zp(xi), zp (xi )). So as to avoid re-deriving everything in the kernelized setting, we perform a reduction to the linear setting of Section 3.1. Consider a factorization K = QQ of the kernel matrix, where Q Rn P m; one example is the Cholesky factorization with m = n P. We can interpret each row Q(i,p) Rm as a feature vector in place of the original zp(xi) Rd1, and rewrite equation (10) as h(zp(xi)) = Q(i,p), w where w := X (i ,p ) ci ,p Q(i ,p ). In order to learn the filter h, it suffices to learn the mdimensional vector w. To do this, define patch matrices Z(xi) RP m for each i [n] so that its p-th row is Q(i,p). Then the problem reduces to learning a linear filter with coefficient vector w. Carrying out all of Sec- tion 3.1, solving the ERM gives us a parameter matrix A Rm P d2. The only difference is that ℓ2-norm constraint (C1) needs to be adapted to the norm of the RKHS. See Appendix B for details. At test time, given a new input x Rd0, we can compute a patch matrix Z(x) RP m as follows: The p-th row of this matrix is the feature vector for patch p, which is equal to Q v(zp(x)) Rm, where for any patch z, the vector v(z) is defined as a n Pdimensional vector whose (i, p)-th coordinate is equal to K(z, zp(xi)). We note that if x is an instance xi in the training set, then the vector Q v(zp(x)) is exactly equal to Q(i,p). Thus the mapping Z(x) applies to both training and testing. We can then compute the predictor fk(x) = tr(Z(x)Ak) via equation (6). Note that we do not explicitly need to compute the filter values hj(zp(x)) to compute the output under the CCNN. Retrieving filters. When we learn multi-layer CCNNs (Section 4), we need to compute the filters hj explicitly in order to form the inputs to the next layer. Recall from Section 3.1 that the column space of matrix A corresponds to parameters of the convolutional layer, and the row space of A corresponds to parameters of the output layer. Thus, once we obtain the parameter matrix A, we compute a rankr approximation A b U b V . Then set the j-th filter hj to the mapping z 7 b Uj, Q v(z) for any patch z Rd1, (11) where b Uj Rm is the j-th column of matrix b U, and Q v(z) represents the feature vector for patch z.2 The matrix b V encodes parameters of the output layer, thus 2If z is a patch in the training set, namely z = zp(xi), then we have equation Q v(z) = Q(i,p) Convexified Convolutional Neural Networks Algorithm 1 Learning two-layer CCNNs Input: Data {(xi, yi)}n i=1, kernel function K, regularization parameter R > 0, number of filters r. 1. Construct a kernel matrix K Rn P n P such that the entry at column (i, p) and row (i , p ) is equal to K(zp(xi), zp (xi )). Compute a factorization K = QQ or an approximation K QQ , where Q Rn P m. 2. For each xi, construct patch matrix Z(xi) RP m whose p-th row is the (i, p)-th row of Q, where Z( ) is defined in Section 3.2. 3. Solve the following optimization problem to obtain a matrix b A = ( b A1, . . . , b Ad2): b A argmin A R e L(A) where (12) i=1 L tr(Z(xi)A1), . . . , tr(Z(xi)Ad2) ; yi . 4. Compute a rank-r approximation b U b V b A where b U Rm r and b V RP d2 r. Output: predictor bfccnn(x) := tr(Z(x) b A1), . . . , tr(Z(x) b Ad2) and the convolutional layer output H(x) := b U Z(x) . doesn t appear in the filter expression (11). It is important to note that the filter retrieval is not unique, because the rank-r approximation of the matrix A is not unique. The heuristic we suggest is to form the singular value decomposition A = UΛV , then define b U to be the first r columns of U. When we apply all of the r filters to all patches of an input x Rd0, the resulting output is H(x) := b U Z(x) this is an r P matrix whose element at row j and column p is equal to hj(zp(x)). 3.3. Algorithm The algorithm for learning a two-layer CCNN is summarized in Algorithm 1; it is a formalization of the steps described in Section 3.2. In order to solve the optimization problem (12), the simplest approach is via projected gradient descent: At iteration t, using a step size ηt > 0, we form the new matrix At+1 based on the previous iterate At according to: At+1 = ΠR At ηt A e L(At) . (13) Here A e L denotes the gradient of the objective function defined in (12), and ΠR denotes the Euclidean projection onto the nuclear norm ball {A : A R}. This nuclear norm projection can be obtained by first computing the singular value decomposition of A, and then projecting the vector of singular values onto the ℓ1-ball. This latter projection step can be carried out efficiently by the algorithm of Duchi et al. (2008). There are other efficient optimiza- tion algorithms (Duchi et al., 2011; Xiao & Zhang, 2014) for solving the problem (12). All these algorithms can be executed in a stochastic fashion, so that each gradient step processes a mini-batch of examples. The computational complexity of each iteration depends on the width m of the matrix Q. Setting m = n P allows us to solve the exact kernelized problem, but to improve the computation efficiency, we can use Nystr om approximation (Drineas & Mahoney, 2005) or random feature approximation (Rahimi & Recht, 2007); both are randomized methods to obtain a tall-and-thin matrix Q Rn P m such that K QQ . Typically, the parameter m is chosen to be much smaller than n P. In order to compute the matrix Q, the Nystr om approximation method takes O(m2n P) time. The random feature approximation takes O(mn Pd1) time, but can be improved to O(mn P log d1) time using the fast Hadamard transform (Le et al., 2013). The time complexity of project gradient descent also scales with m rather than with n P. 3.4. Theoretical results In this section, we upper bound the generalization error of Algorithm 1, proving that it converges to the best possible generalization error of CNN. We focus on the binary classification case where the output dimension is d2 = 1.3 The learning of CCNN requires a kernel function K. We consider kernel functions whose associated RKHS is large enough to contain any function of the following form: z 7 q( w, z ), where q is an arbitrary polynomial function and w Rd1 is an arbitrary vector. As a concrete example, we consider the inverse polynomial kernel: K(z, z ) := 1 2 z, z , z 2 1, z 2 1. (14) This kernel was studied by Shalev-Shwartz et al. (2011) for learning halfspaces, and by Zhang et al. (2016a) for learning fully-connected neural networks. We also consider the Gaussian RBF kernel: K(z, z ) := e γ z z 2 2, z 2 = z 2 = 1. (15) As shown by Appendix A, the inverse polynomial kernel and the Gaussian kernel satisfy the above notion of richness. We focus on these two kernels for the theoretical analysis. Let bfccnn be the CCNN that minimizes the empirical risk (12) using one of the two kernels above. Our main theoretical result is that for suitably chosen activation functions, the generalization error of bfccnn is comparable to that of the best CNN model. In particular, we consider the following types of activation functions σ: 3We can treat the multiclass case by performing a standard one-versus-all reduction to the binary case. Convexified Convolutional Neural Networks (a) arbitrary polynomial functions (e.g., used by Chen & Manning (2014); Livni et al. (2014)). (b) sinusoid activation function σ(t) := sin(t) (e.g., used by Sopena et al. (1999); Isa et al. (2010)). (c) erf function σerf(t) := 2/ π R t 0 e z2dz, which represents a close approximation to the sigmoid function (Zhang et al., 2016a). (d) a smoothed hinge loss σsh(t) := R t 1 2(σerf(z) + 1)dz, which represents a close approximation to the Re LU function (Zhang et al., 2016a). To understand how these activation functions pair with our choice of kernels, we consider polynomial expansions of the above activation functions: σ(t) = P j=0 ajtj, and note that the smoothness of these functions are characterized by the rate of their coefficients {aj} j=0 converging to zero. If σ is a polynomial in category (a), then the richness of the RKHS guarantees that it contains the class of filters activated by function σ. If σ is a non-polynomial function in categories (b),(c),(d), then as Appendix A shows, the RKHS contains the filter only if the coefficients {aj} j=0 converge quickly enough to zero (the criterion depends on the concrete choice of the kernel). Concretely, the inverse polynomial kernel is shown to capture all of the four categories of activations: thus, (a), (b), (c), and (d) are all are referred as valid activation functions for the inverse polynomial kernel. The Gaussian kernel induces a smaller RKHS, so only (a) and (b) are valid activation functions for the Gaussian kernel. In contrast, the sigmoid function and the Re LU function are not valid for either kernel, because their polynomial expansions fail to converge quickly enough, or more intuitively speaking, because they are not smooth enough to be contained in the RKHS. We are ready to state the main theoretical result. In the theorem statement, we use K(X) RP P to denote the random kernel matrix obtained from an input vector X Rd0 drawn randomly from the population. More precisely, the (p, q)-th entry of K(X) is given by K(zp(X), zq(X)). Theorem 1. Assume that the loss function L( ; y) is LLipchitz continuous for every y [d2] and that K is the inverse polynomial kernel or the Gaussian kernel. For any valid activation function σ, there is a constant Cσ(B1) such that by choosing hyper-parameter R := Cσ(B1)B2r in Algorithm 1, the expected generalization error is at most EX,Y [L( bfccnn(X); Y )] inf f Fcnn EX,Y [L(f(X); Y )] + c LCσ(B1)B2r p log(n P) EX[ K(X) 2] n , (16) where c > 0 is a universal constant. Proof sketch The proof of Theorem 1 consists of two parts: First, we consider a larger function class that con- tains the class of CNNs. This function class is defined as: Fccnn := n x 7 p=1 αj,phj(zp(x)) : r < (17) j=1 αj 2 hj H Cσ(B1)B2d2 o . (18) where H is the norm of the RKHS associated with the kernel. This new function class relaxes the class of CNNs in two ways: 1) the filters are relaxed to belong to the RKHS, and 2) the ℓ2-norm bounds on the weight vectors are replaced by a single constraint on αj 2 and hj H. We prove the following property for the predictor bfccnn: it must be an empirical risk minimizer of Fccnn. This property holds even though equation (18) defines a non-parametric function class Fccnn, while Algorithm 1 optimizes bfccnn in a parametric function class. Second, we characterize the Rademacher complexity of this new function class Fccnn, proving an upper bound for it based on the matrix concentration theory. Combining this bound with the classical Rademacher complexity theory (Bartlett & Mendelson, 2003), we conclude that the generalization loss of bfccnn converges to the least possible generalization error of Fccnn. The latter loss is bounded by the generalization loss of CNNs (because Fcnn Fccnn), which establishes the theorem. See the full version of this paper (Zhang et al., 2016b) for a rigorous proof of Theorem 1. Remark on activation functions. It is worth noting that the quantity Cσ(B1) depends on the activation function σ, and more precisely, depends on the convergence rate of the polynomial expansion of σ. Appendix A shows that if σ is a polynomial function of degree ℓ, then Cσ(B1) = O(Bℓ 1). If σ is the sinusoid function, the erf function or the smoothed hinge loss, then the quantity Cσ(B1) will be exponential in B1. From an algorithmic perspective, we don t need to know the activation function for executing Algorithm 1. From a theoretical perspective, however, the choice of σ is relevant from the point of Theorem 1 to compare bfccnn with the best CNN, whose representation power is characterized by the choice of σ. Therefore, if a CNN with a low-degree polynomial σ performs well on a given task, then CCNN also enjoys correspondingly strong generalization. Empirically, this is actually borne out: in Section 5, we show that the quadratic activation function performs almost as well as the Re LU function for digit classification. Remark on parameter sharing. In order to demonstrate the importance of parameter sharing, consider a CNN without parameter sharing, so that we have filter weights wj,p for each filter index j and patch index p. With this change, Convexified Convolutional Neural Networks the new CNN output (2) is p=1 αj,pσ(w j,pzp(x)), where αj,p R and wj,p Rd1. Note that the hidden layer of this new network has P times more parameters than that of the convolutional neural network with parameter sharing. These networks without parameter sharing can be learned by the recursive kernel method proposed by Zhang et al. (2016a). Their paper shows that under the norm constraints wj 2 B 1 and Pr j=1 PP p=1 |αj,p| B 2, the excess risk of the recursive kernel method is at most O(LCσ(B 1)B 2 p Kmax/n), where Kmax = maxz: z 2 1 K(z, z) is the maximal value of the kernel function. Plugging in the norm constraints of the function class Fcnn, we have B 1 = B1 and B 2 = B2r P. Thus, the expected risk of the estimated bf is bounded by: EX,Y [L( bf(X); Y )] inf f Fcnn EX,Y [L(f(X); Y )] + c LCσ(B1)B2r PKmax n . (19) Comparing this bound to Theorem 1, we see that (apart from the logarithmic terms) they differ in the multiplicative factors of P Kmax versus p E[ K(X) 2]. Since the matrix K(X) is P-dimensional, we have K(X) 2 max p [P ] q [P ] |K(zp(X), zq(X))| P Kmax. This demonstrates that P Kmax is always greater than p E[ K(X) 2]. In general, the first term can be up to factor of P times greater, which implies that the sample complexity of the recursive kernel method is up to P times greater than that of the CCNN. This difference is intuitive given that the recursive kernel method learns a model with P times more parameters. Although comparing the upper bounds doesn t rigorously show that one method is better than the other, it gives intuition for understanding the importance of parameter sharing. 4. Learning multi-layer CCNNs In this section, we describe a heuristic method for learning CNNs with more layers. The idea is to estimate the parameters of the convolutional layers incrementally from bottom to top. Before presenting the multi-layer algorithm, we present two extensions, average pooling and multi-channel inputs. Average pooling. Average pooling is a technique to reduce the output dimension of the convolutional layer from dimensions P r to dimensions P r with P < P. For the CCNN model, if we apply average pooling after Algorithm 2 Learning multi-layer CCNNs Input:Data {(xi, yi)}n i=1, kernel function K, number of layers m, regularization parameters R1, . . . , Rm, number of filters r1, . . . , rm. Define H1(x) = x. For each layer s = 2, . . . , m: Train a two-layer network by Algorithm 1, taking {(Hs 1(xi), yi)}n i=1 as training examples and Rs, rs as parameters. Let Hs be the output of the convolutional layer and bfs be the predictor. Output: Predictor bfm and the top layer output Hm. the convolutional layer, then the k-th output of the CCNN model becomes tr(GZ(x)Ak) where G RP P is the pooling matrix. Thus, performing a pooling operation requires only replacing every matrix Z(xi) in problem (12) by the pooled matrix GZ(xi). Note that the linearity of the CCNN allows us to effectively pool before convolution, even though for the CNN, pooling must be done after applying the nonlinear filters. The resulting ERM problem is still convex, and the number of parameters have been reduced by P/P -fold. Processing multi-channel inputs. If our input has C channels (corresponding to RGB colors, for example), then the input becomes a matrix x RC d0. The c-th row of matrix x, denoted by x[c] Rd0, is a vector representing the c-th channel. We define the multi-channel patch vector as a concatenation of patch vectors for each channel: zp(x) := (zp(x[1]), . . . , zp(x[C])) RCd1. Then we construct the feature matrix Z(x) using the concatenated patch vectors {zp(x)}P p=1. From here, everything else of Algorithm 1 remains the same. We note that this approach learns a convex relaxation of filters taking the form σ(PC c=1 wc, zp(x[c]) ), parametrized by the vectors {wc}C c=1. Multi-layer CCNN. Given these extensions, we are ready to present the algorithm for learning multi-layer CCNNs, summarized in Algorithm 2. For each layer s, we call Algorithm 1 using the output of previous convolutional layers as input note that this consists of r channels (one from each previous filter); thus we must use the multi-channel extension. Algorithm 2 outputs a new convolutional layer along with a prediction function, which is kept only at the last layer. We optionally use averaging pooling after each successive layer. to reduce the output dimension of the convolutional layers. 5. Experiments In this section, we compare the CCNN approach with other methods on the MNIST dataset and more challenging variations (Variations MNIST), including adding white noise (rand), random rotation (rot), random image back- Convexified Convolutional Neural Networks basic rand rot img img+rot SVMrbf (Vincent et al., 2010) 3.03% 14.58% 11.11% 22.61% 55.18% NN-1 (Vincent et al., 2010) 4.69% 20.04% 18.11% 27.41% 62.16% CNN-1 (Re LU) 3.37% 9.83% 18.84% 14.23% 45.96% CCNN-1 2.35% 8.13% 13.45% 10.33% 42.90% TIRBM (Sohn & Lee, 2012) - - 4.20% - 35.50% SDAE-3 (Vincent et al., 2010) 2.84% 10.30% 9.53% 16.68% 43.76% Scat Net-2 (Bruna & Mallat, 2013) 1.27% 12.30% 7.48% 18.40% 50.48% PCANet-2 (Chan et al., 2015) 1.06% 6.19% 7.37% 10.95% 35.48% CNN-2 (Re LU) 2.11% 5.64% 8.27% 10.17% 32.43% CNN-2 (Quad) 1.75% 5.30% 8.83% 11.60% 36.90% CCNN-2 1.39% 4.64% 6.91% 7.44% 30.23% Table 1. Classification error on the basic MNIST and its four variations. The best performance within each block is bolded. Re LU and Quad denote using the Re LU and quadratic activation functions, respectively. ground (img) or combining the last two (img+rot). For all datasets, we use 10,000 images for training, 2,000 images for validation and 50,000 images for testing. This 10k/2k/50k partitioning is standard for MNIST variations (Variations MNIST). For the CCNN method and the baseline CNN method, we train two-layer and three-layer models respectively. The models with k convolutional layers are denoted by CCNNk and CNN-k. Each convolutional layer is constructed on 5 5 patches with unit stride, followed by 2 2 average pooling. The first and the second convolutional layers contains 16 and 32 filters, respectively. The loss function is chosen as the 10-class logistic loss. We use Gaussian kernel for the CCNN. The feature matrix Z(x) is constructed via random feature approximation (Rahimi & Recht, 2007) with dimension m = 500 for the first convolutional layer and m = 1000 for the second. Before training each CCNN layer, we preprocess the input vectors zp(xi) using local contrast normalization and ZCA whitening (Coates et al., 2010). The convex optimization problem is solved by projected SGD with mini-batches of size 50. Code and reproducible experiments are available on the Coda Lab platform4. As a baseline approach, the CNN models are activated by the Re LU function σ(t) = max{0, t} or the quadratic function σ(t) = t2. We train them using mini-batch SGD. The input images are preprocessed by global contrast normalization and ZCA whitening (Srivastava et al., 2014). We compare our method against several alternative baselines. The CCNN-1 model is compared against an SVM with the Gaussian RBF kernel (SVMrbf) and a fully connected neural network with one hidden layer (NN1). The CCNN-2 model is compared against methods that report the state-of-the-art results on these datasets, including the translation-invariant RBM model (TIRBM) (Sohn & Lee, 2012), the stacked denoising auto-encoder with 4http://worksheets.codalab.org/ worksheets/0x1468d91a878044fba86a5446f52aacde/ three hidden layers (SDAE-3) (Vincent et al., 2010), the Scat Net-2 model (Bruna & Mallat, 2013) and the PCANet2 model (Chan et al., 2015). Table 1 shows the classification errors on the test set. The models are grouped with respect to the number of layers that they contain. For models with one convolutional layer, the errors of CNN-1 are significantly lower than that of NN-1, highlighting the benefits of local filters and parameter sharing. The CCNN-1 model outperforms CNN-1 on all datasets. For models with two or more hidden layers, the CCNN-2 model outperforms CNN-2 on all datasets, and is competitive against the state-of-the-art. In particular, it achieves the best accuracy on the rand, img and img+rot dataset, and is comparable to the state-of-the-art on the remaining two datasets. Further adding a third convolutional layer doesn t notibly improve the performance on these datasets. In Section 3.4, we showed that if the activation function σ is a polynomial function, then the CCNN (which does not depend on σ) requires lower sample complexity to match the performance of the best possible CNN using σ. More precisely, if σ is a degree-ℓpolynomial, then Cσ(B) in the upper bound will be controlled by O(Bℓ). This motivates us to study the performance of low-degree polynomial activations. Table 1 shows that the CNN-2 model with a quadratic activation function achieves error rates comparable to that with a Re LU activation: CNN-2 (Quad) outperforms CNN2 (Re LU) on the basic and rand datasets, and is only slightly worse on the rot and img dataset. Since the performance of CCNN matches that of the best possible CNN, the good performance of the quadratic activation in part explains why the CCNN is also good. Acknowledgements. MJW and YZ were partially supported by the Office of Naval Research Grant DOD ONRN00014 and the NSF Grant NSF-DMS-1612948. PL and YZ were partially supported by the Microsoft Faculty Fellowship. Convexified Convolutional Neural Networks Aslan, Ozlem, Cheng, Hao, Zhang, Xinhua, and Schuurmans, Dale. Convex two-layer modeling. In Advances in Neural Information Processing Systems, pp. 2985 2993, 2013. Aslan, Ozlem, Zhang, Xinhua, and Schuurmans, Dale. Convex deep learning via normalized kernels. In Advances in Neural Information Processing Systems, pp. 3275 3283, 2014. Bartlett, Peter L and Mendelson, Shahar. Rademacher and Gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3: 463 482, 2003. Bengio, Yoshua, Roux, Nicolas L, Vincent, Pascal, Delalleau, Olivier, and Marcotte, Patrice. Convex neural networks. In Advances in Neural Information Processing Systems, pp. 123 130, 2005. Blum, Avrim L and Rivest, Ronald L. Training a 3-node neural network is NP-complete. Neural Networks, 5(1): 117 127, 1992. Bottou, L eon. Online learning and stochastic approximations. On-line learning in neural networks, 17(9):142, 1998. Bruna, Joan and Mallat, St ephane. Invariant scattering convolution networks. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1872 1886, 2013. Chan, Tsung-Han, Jia, Kui, Gao, Shenghua, Lu, Jiwen, Zeng, Zinan, and Ma, Yi. Pcanet: A simple deep learning baseline for image classification? IEEE Transactions on Image Processing, 24(12):5017 5032, 2015. Chen, Danqi and Manning, Christopher D. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), volume 1, pp. 740 750, 2014. Coates, Adam, Lee, Honglak, and Ng, Andrew Y. An analysis of single-layer networks in unsupervised feature learning. Ann Arbor, 1001(48109):2, 2010. Daniely, Amit, Frostig, Roy, and Singer, Yoram. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. ar Xiv preprint ar Xiv:1602.05897, 2016. Drineas, Petros and Mahoney, Michael W. On the Nystr om method for approximating a Gram matrix for improved kernel-based learning. The Journal of Machine Learning Research, 6:2153 2175, 2005. Duchi, John, Shalev-Shwartz, Shai, Singer, Yoram, and Chandra, Tushar. Efficient projections onto the ℓ1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pp. 272 279. ACM, 2008. Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121 2159, 2011. Fahlman, Scott E. An empirical study of learning speed in back-propagation networks. Journal of Heuristics, 1988. Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29 (6):82 97, 2012. Isa, IS, Saad, Z, Omar, S, Osman, MK, Ahmad, KA, and Sakim, HA Mat. Suitable mlp network activation functions for breast cancer and thyroid disease detection. In 2010 Second International Conference on Computational Intelligence, Modelling and Simulation, pp. 39 44, 2010. Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097 1105, 2012. Lawrence, Steve, Giles, C Lee, Tsoi, Ah Chung, and Back, Andrew D. Face recognition: A convolutional neuralnetwork approach. Neural Networks, IEEE Transactions on, 8(1):98 113, 1997. Le, Quoc, Sarl os, Tam as, and Smola, Alex. Fastfoodapproximating kernel expansions in loglinear time. In Proceedings of the International Conference on Machine Learning, 2013. Le Cun, Yann, Bottou, L eon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Livni, Roi, Shalev-Shwartz, Shai, and Shamir, Ohad. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems, pp. 855 863, 2014. Mairal, Julien, Koniusz, Piotr, Harchaoui, Zaid, and Schmid, Cordelia. Convolutional kernel networks. In Advances in Neural Information Processing Systems, pp. 2627 2635, 2014. Convexified Convolutional Neural Networks Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177 1184, 2007. Shalev-Shwartz, Shai, Shamir, Ohad, and Sridharan, Karthik. Learning kernel-based halfspaces with the 01 loss. SIAM Journal on Computing, 40(6):1623 1646, 2011. Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. Sohn, Kihyuk and Lee, Honglak. Learning invariant representations with local transformations. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 1311 1318, 2012. Sopena, Josep M, Romero, Enrique, and Alquezar, Rene. Neural networks with periodic and monotonic activation functions: a comparative study in classification problems. In ICANN 99, pp. 323 328, 1999. Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1): 1929 1958, 2014. Variations MNIST. Variations on the MNIST digits. http: //www.iro.umontreal.ca/ lisa/twiki/ bin/view.cgi/Public/Mnist Variations, 2007. Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, and Manzagol, Pierre-Antoine. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 11:3371 3408, 2010. Wang, Tao, Wu, David J, Coates, Andrew, and Ng, Andrew Y. End-to-end text recognition with convolutional neural networks. In Pattern Recognition (ICPR), 2012 21st International Conference on, pp. 3304 3308. IEEE, 2012. Xiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014. Zhang, Yuchen, Lee, Jason D, and Jordan, Michael I. ℓ1regularized neural networks are improperly learnable in polynomial time. In Proceedings on the 33rd International Conference on Machine Learning, 2016a. Zhang, Yuchen, Liang, Percy, and Wainwright, Martin J. Convexified convolutional neural networks. Co RR, abs/1609.01000, 2016b. URL http://arxiv.org/ abs/1609.01000.