# learning_compact_neural_networks_with_regularization__f4a62c09.pdf Learning Compact Neural Networks with Regularization Samet Oymak 1 Proper regularization is critical for speeding up training, improving generalization performance, and learning compact models that are cost efficient. We propose and analyze regularized gradient descent algorithms for learning shallow neural networks. Our framework is general and covers weight-sharing (convolutional networks), sparsity (network pruning), and low-rank constraints among others. We first introduce covering dimension to quantify the complexity of the constraint set and provide insights on the generalization properties. Then, we show that proposed algorithms become well-behaved and local linear convergence occurs once the amount of data exceeds the covering dimension. Overall, our results demonstrate that near-optimal sample complexity is sufficient for efficient learning and illustrate how regularization can be beneficial to learn overparameterized networks. 1. Introduction Deep neural networks (DNN) find ubiquitous use in large scale machine learning systems. Applications include speech processing, computer vision, natural language processing, and reinforcement learning (Krizhevsky et al., 2012; Graves et al., 2013; Hinton et al., 2012; Silver et al., 2016). DNNs can be efficiently trained with first-order methods and provide state of the art performance for important machine learning benchmarks such as Image Net and TIMIT (Russakovsky et al., 2015; Graves et al., 2013). They also lie at the core of complex systems such as recommendation and ranking models and self-driving cars (Covington et al., 2016; Wang et al., 2015; Bojarski et al., 2016). The abundance of promising applications bring a need to understand the properties of deep learning models. Recent literature shows a growing interest towards theoretical prop- 1University of California, Riverside, CA, USA. Work done at The Voleon Group, Berkeley, CA, USA. Correspondence to: . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). erties of complex neural network models. Significant questions of interest include efficient training of such models and their generalization abilities. Typically, neural nets are trained with first order methods that are based on (stochastic) gradient descent. The variations include Adam, Adagrad, and variance reduction methods (Kingma & Ba, 2014; Duchi et al., 2011; Johnson & Zhang, 2013). The fact that SGD is highly parallellizable is often crucial to training large scale models. Consequently, there is a growing body of works that focus on the theoretical understanding of gradient descent algorithms (Zhong et al., 2017b; Tian, 2017; Panigrahy et al., 2018; Soltanolkotabi et al., 2017; Ge et al., 2017; Janzamin et al., 2015) and the generalization properties of DNNs (Zhang et al., 2016; Hardt et al., 2015; Bartlett et al., 2017; Kawaguchi et al., 2017; Neyshabur et al., 2017). In this work, we propose and analyze regularized gradient descent algorithms to provably learn compact neural networks that have space-efficient representation. This is in contrast to existing theory literature where the focus is mostly fully-connected neural networks (FNN). Proper regularization is a critical tool for building models that are compact and that have better generalization properties. This is achieved by reducing degrees of freedom of the model. Sparsifying and quantizing neural networks lead to storage efficient compact models that will be building blocks intelligent mobile devices (Han et al., 2015a;b; Courbariaux et al., 2016; Denton et al., 2014; Jin et al., 2016; Dong et al., 2017; Aghasi et al., 2017). The pruning idea has been around for many years (Hassibi & Stork, 1993; Cun et al., 1990) however it gained recent attention due to the growing size of the state of the art DNN models. Convolutional neural nets (CNN) are also compact models that efficiently utilize their parameters by weight sharing (Krizhevsky et al., 2012). We study neural network regularization and address both generalization and optimization problems with an emphasis on one hidden-layer networks. We introduce a machinery to measure the impact of regularization, namely the covering dimension of the constraint set. We show that covering dimension controls generalization properties as well as the optimization landscape. Hence, regularization can have substantial benefit over training unconstrained (e.g. fully-connected) models and can help with training over-parameterized networks. Learning Compact Neural Networks with Regularization Specifically, we consider the networks that are parametrized as y = o T σ(W x) where x Rp is the input data, W Rh p is the weight matrix, o Rh is the output layer and h p. We assume W C for some constraint set C. We provide insights on the generalization and optimization performance by studying the tradeoff between the constraint set and the amount of training data (n) as follows. Generalization error: We study the Rademacher complexity and show that good generalization is achieved when data size n is larger than the sum of the covering dimension of C and the number of hidden nodes h. Regularized first order methods: We propose and analyze regularized gradient descent algorithms which incorporates the knowledge of C to iterations. We show that problem becomes well conditioned (around ground truth parameters) once the data size exceeds the covering dimension of the constraint set. This implies the local linear convergence of first order methods with near-optimal sample complexity. Recent results (as well as our experiments) indicate that it is not possible to do much better than this as random initialization can get stuck at spurious local minima (Zhong et al., 2017b; Safran & Shamir, 2017). Application to CNNs: We apply our results to CNNs and obtain improved global convergence guarantees when combined with the tensor initialization of (Zhong et al., 2017a). We also improve existing local convergence results on unconstrained problem (compared to (Zhong et al., 2017b)). 1.1. Related Works Our results on the optimization landscape are closely related to the recent works on provably learning shallow neural nets (Zhong et al., 2017b; Tian, 2017; Soltanolkotabi, 2017; Panigrahy et al., 2018; Ge et al., 2017; Oymak & Soltanolkotabi, 2018; Zhong et al., 2017a; Safran & Shamir, 2017; Arora et al., 2014; Mei et al., 2016). (Janzamin et al., 2015) proposed tensor decomposition to learn shallow networks. (Tian, 2017) studies the gradient descent algorithm to train a model assuming population gradient. (Soltanolkotabi et al., 2017) focuses on training of shallow networks when they are over-parameterized and analyzes the global landscape for quadratic loss. More recently (Ge et al., 2017) shows global convergence of gradient descent by designing a new objective function instead of using ℓ2-loss. Our algorithmic results are closest to those of (Zhong et al., 2017b). Similar to us, authors focus on learning weights of a ground truth model where the input data is Gaussian. They propose a tensor based initialization followed by local gradient descent for learning one hidden-layer FNN. While we analyze a more general class of problems, when specialized to their setup, we improve their sample complexity and radius of convergence for local convergence. For instance, they need O (h2p) samples to learn a FNN whereas we require O (hp) which is proportional to the degrees of freedom of the weight matrix. Growing list of works (Brutzkus & Globerson, 2017; Oymak & Soltanolkotabi, 2018; Du et al., 2017b;a; Zhong et al., 2017a) investigate CNNs with a focus on non-overlapping structure. Unlike these, we formalize CNN as a lowdimensional subspace constraint and show sample optimal local convergence even with multiple kernels and overlapping structure. As discussed in Section 4, we also improve the global convergence bounds of (Zhong et al., 2017a). Generalization properties of deep networks recently attracted significant attention (Zhang et al., 2016; Hardt et al., 2015; Bartlett et al., 2017; Konstantinos et al., 2017). Our results are closer to (Bartlett et al., 2017; Neyshabur et al., 2017; Konstantinos et al., 2017) which studies the problem in a learning theory framework. (Bartlett et al., 2017; Neyshabur et al., 2017) provide generalization bounds for deep FCNNs based on spectral norm of the individual layers. More recently, (Konstantinos et al., 2017) specializes such bounds to CNNs. Our result differs from these in two ways. First, our bound reflects the impact of regularization and secondly, we avoid the dependencies on input data length by taking advantage of the Gaussian data model. 2. Problem Statement Here, we describe the general problem formulation. Our aim is learning neural networks that efficiently utilize their parameters by using gradient descent and proper regularization. For most of the discussion, the input/output (yi,xi)n i=1 relation is given by yi = o T σ(W xi). Here o Rh is the vector that connects hidden to output layer and W Rh p is the weight matrix that connects input to hidden layer. Assuming o is known we are interested in learning W which has hp degrees of freedom. The associated loss function for the regression problem is n i=1 (yi o T σ(W xi))2. Starting from an initial point W0, gradient descent algorithms learns W using the following iterations Wi+1 = Wi µ L(Wi). If we have a prior on W , such as sparse weights, this information can be incorporated by projecting W on the constraint set. Suppose W lies in a constraint set C. Denote the projection on C by PC( ). Starting from an initial point W0, the Projected Gradient Descent (PGD) algorithm is characterized by the following iterations Wi+1 = PC(Wi µ L(Wi)). (1) Learning Compact Neural Networks with Regularization Our goal will be to understand the impact of C on generalization as well as the properties of the PGD algorithm. 2.1. Compact Models and Associated Regularizers In order to learn parameter-efficient compact networks, practical approaches include weight-sharing, weight pruning, and quantization as explained below. Convolutional model (weight-sharing): Suppose we have a CNN with k kernels of width b. Each kernel is shifted and multiplied with length b patches of the input data i.e. same kernel weights are used many times across the input. In Section 4, we formulate this as an FNN subject to a subspace constraint where the constraint C is a kb dimensional subspace. Sparsity: Weight matrix W has at most s nonzero weights out of hp entries. Quantization: Weights are restricted to be discrete values. In the extreme case, entries of W are 1. Low-rank approximation: Weight matrix W obeys rank(W ) r for some r h. We also consider convex regularizers which can yield smoother optimization landscape (e.g. subspace, ℓ1). Convexified version of sparsity constraint is the ℓ1 regularization. Parametrized by τ > 0, the constraint set is given by C = {W Rh p W 1 τ}. Similarly, the convexified version of low-rank projection is the nuclear norm regularization, which corresponds to the ℓ1 norm of singular values (Recht et al., 2010). Finally, we remark that our results can be specialized to the unconstrained problem where the constraint set is C = Rh p and PGD reduces to gradient descent. Notation: Throughout the paper, h denotes the number of hidden nodes, p denotes the input dimension, and n denotes the number of data points unless otherwise stated. smin( ),smax( ) returns the minimum/maximum singular values of a matrix. κ(V ) returns the condition number of the matrix smax(V )/smin(V ). Similarly, for a vector v, κ(v) = maxi vi /mini vi . Frobenius norm and spectral norm are denoted by F , respectively. c,C > 0 denote absolute constants. N(0,Id) will denote a vector in Rd with i.i.d. standard normal entries. var[ ] returns the variance of a random variable. 3. Main Results We first introduce covering numbers to quantify the impact of regularization. 3.1. Covering Dimension If constraint set C is a d-dimensional subspace (e.g. C = Rh p), weight matrices W C has d degrees of freedom. This model applies to convolutional and unconstrained problems. For subspaces, the dimension d is sufficient to capture the problem complexity and our main results apply when the data size n obeys n O (d). For other constraint types such as sparsity and matrix rank, we consider the constraint set given by C = {W Rh p R(W ) τ} where R is the regularizer function such as ℓ1 norm. To capture the impact of regularizer, we define feasible ball which is the set of feasible directions given by T = Bh p cl({αU Rh p W + U C, α 0}) (2) where cl( ) is the set closure and Bh p is the unit Frobenius norm ball. For instance, when R is the ℓ0 norm, T is a subset of τ + W 0 sparse weight matrices. Covering number is a standard way to measure the complexity of a set (Shalev-Shwartz & Ben-David, 2014). We will quantify the impact of regularization by using covering dimension which is defined as follows. Definition 3.1 (Covering dimension). Let T Bh p and C > 0 be an absolute constant. Covering dimension of T is denoted by cover(T) and is defined as follows. Suppose there exists a set S satisfying T conv(S) where conv(S) is the minimal closed convex set containing S. Radius of S obeys supv S v ℓ2 C. For all ε > 0, ℓ2 ε-covering number of S obeys Nε(S) (1 + B ε )s for some s 0,B > 1 and all ε > 0. Then, cover(T) slog B. Hence cover(T) is the infimum of all such upper bounds. As illustrated in Table 1, covering dimension captures the degrees of freedom for practical regularizers. This includes sparsity, low-rank, and weight-sharing constraints discussed previously. Note that Table 1 is obtained by setting τ = R(W ). In practice, a good choice for τ can be found by using cross-validation. It is also known that the performance of PGD is robust to choice of τ (see Thm 2.6 of (Oymak et al., 2017)). For unstructured constraint sets without a clean covering number, one can use stronger tools from geometric functional analysis. In the extended manuscript (Oymak, 2018), we discuss how more general complexity estimates can be achieved by using Gaussian width of T (Chandrasekaran et al., 2012) and establish a connection to covering dimension. Our results will apply in the regime n cover(T ) where n is the number of data points. This will allow sample size to be proportional to the degrees of freedom of the constraint space implying data-efficient learning. Now that we can quantify the impact of regularization, we proceed to state our results. Learning Compact Neural Networks with Regularization Constraint Weight matrix model cover(T ) None W Rh p hp Convolutional k kernels of b width kb Sparsity 0 s nonzero weights slog(6hp/s) ℓ1 norm 1 s nonzero weights slog(6hp/s) Subspace W S, dim(S) = k k Matrix rank rank(W ) r rh Table 1: The list of low-dimensional models and corresponding covering dimensions (up to a constant factor) for the constraint sets C = {W R(W ) R(W )}. If constraint is set membership such as subspace, R(W ) = 0 inside the set and outside. 3.2. Generalization Properties To provide insights on generalization, we derive the Rademacher complexity of regularized neural networks with 1-hidden layer. To be consistent with the rest of the paper, we focus on Gaussian data distribution. Rademacher complexity is a useful tool that measures the richness of a function class and that allows us to give generalization bounds. Given sample size n, let r Rn be an i.i.d. Rademacher vector. Let {xi}n i=1 are input data points that are i.i.d. with xi N(0,Ip). Finally, let F be the class of neural nets we analyze. Then, Rademacher complexity of F with respect to Gaussian data with n samples is given by n E{xi}n i=1[Er[sup f F n i=1 rif(xi)]] The following lemma provides the result on Rademacher complexity of networks with low-covering numbers. Lemma 3.2. Suppose the activation function σ is LLipschitz. Consider the class of one hidden-layer networks F where f F is parametrized by its input matrix W and output vector o and satisfies input/output relation is fo,W (x) = o T σ(W x), W RW and W C where ε-covering number of C obeys Nε(C) (1 + B/ε)s for some B > 0, s 0, For Gaussian input data {xi}n i=1 N(0,Ip)n, Rademacher complexity of class F is bounded by Rad(F) LRo RW O (h + s) log(n + p) + s log(1 + B RW ) This result obeys typical Rademacher complexity bounds however the ambient dimension hp is replaced by the total degrees of freedom which is given in terms of h + slog B. Furthermore, unlike (Bartlett et al., 2017), we do not have dependence on the length of the input data which is E[ x ℓ2] p. This is because we take advantage of the Gaussianity of input data which allows us to escape from the worst-case analysis that suffer from E[ x ℓ2]. Combined with standard learning theory results (Shalev-Shwartz & Ben-David, 2014), this bound shows that empirical risk minimization achieves small generalization error as soon as n O (h + slog B) samples. Observe that O (s) components of Rad(F) relate to the covering dimension of C and become dominant as soon as s h. We remark that typically B O (RW ). For instance, if C is a B scaled unit ℓ2 ball, in order to ensure it contains RW scaled spectral ball {W W RW }, we need to pick B = Our main results are dedicated to the properties of the PGD algorithm where the aim is to learn compact neural nets efficiently. We show that Rademacher complexity bounds are highly consistent with the sample complexity requirements of PGD which is governed by the local optimization landscape such as positive-definiteness of the Hessian matrix. 3.3. Local Convergence of Regularized Training A crucial ingredient of the convergence analysis of PGD is the positive-definiteness of Hessian along restricted directions dictated by T (Negahban & Wainwright, 2012). Denoting Hessian at the ground truth W by HW , we investigate its restricted eigenvalue H(W ,T ) = inf v T v T HW v in the regime h p. Positivity of H(W ,T ) will ensure that the problem is well conditioned around W and is locally convergent. However, radius of convergence is not guaranteed to be large. Below, we present a summary of our results to provide basic insights about the actual technical contribution while avoiding the exact technical details. Sample size: Whether the constraint set C is convex or nonconvex, we have H(W ,T ) > 0 as soon as n O (cover(T )). This implies sample optimal local convergence for subspace, sparsity and rank constraints among others. Radius of convergence: Basin of attraction for the PGD iterations (1) are O (h 1) neighborhood of W i.e. we require W0 W F O (h 1 W F ). As there are more hidden nodes, we require a tighter initialization. However, the result is independent of p. Learning Compact Neural Networks with Regularization Rate of convergence: Within radius of convergence, weight matrix distance Wi W 2 F reduces by a factor of ρ = 1 O ( 1 max{1,n 1plog p}hlog p), at each iteration, which implies linear convergence. As long as the problem is not extremely overparametrized (i.e. n plog p), ignoring log terms, rate of convergence is 1 O (1/h). This implies accurate learning in O (hlog ε 1) steps given target precision ε. We are now in a place to state the main results. We place the following assumptions on the activation function for our results. It is a combination of smoothness and nonlinearity conditions. Assumption 1 (Activation function). σ( ) obeys following properties: σ( ) is differentiable, σ ( ) is an L-Lipschitz function and σ (0) L0 for some L,L0 > 0. Given g N(0,1) and θ > 0, define ζ(θ) as ζ(θ) = min{var[σ (θg)] E[σ (θg)g]2, var[σ (θg)g] E[σ (θg)g2]2} where expectations are with respect to g. ζ(θ) > 0. Example functions that satisfy the assumptions are Sigmoid and hyperbolic tangent, Error function σ(x) = x 0 exp( t2)dt, Squared Re LU σ(x) = max{0,x}2, Softplus σ(x) = log(1+exp(x)) (for sufficiently large θ). While Re LU does not satisfy the criteria, a smooth Re LU approximation such as softplus works. In general, definition of ζ( ) reveals that our assumptions are satisfied if σ i) is nonlinear, ii) is increasing, iii) has bounded second derivative, and iv) has symmetric first derivative (see Theorem 5.3 of (Zhong et al., 2017b)). The ζ(θ) quantity is a measure of the nonlinearity of the activation function. It will be used to control the minimum eigenvalue of Hessian. A very similar quantity is used by (Zhong et al., 2017b) where they have an extra term which is not needed by us. This implies, our ζ(θ) is positive under milder conditions. Definition 3.3 (Critical quantities). Θ will be used to lower bound H(W ,T ) and Ωwill control the learning rate. They are defined as follows Θ = L2s2 maxκ2(o)κh+2(W ) ζ(smin) , Ω= h(log p + L2 0 L2s2max ) Θ will be a measure of the conditioning of the problem. It is essentially unitless and obeys Θ 1 since L2s2 max ζ(smin). Ωwill be inversely related to the radius of convergence and learning rate. If L0 = 0 (e.g. quadratic activation), Ωsimplifies to hlog p 3.3.1. RESTRICTED EIGENVALUE OF HESSIAN Our first result is a sample complexity bound for the restricted positive definiteness of the Hessian matrix at W . It implies that problem is locally well-conditioned with minimal data (n cover(T )). Theorem 3.4. Suppose C is a closed set that includes W , h p, and let {xi}n i=1 be i.i.d. N(0,Ip) data points. Set υ = CΘlog2(CΘ) and suppose cover(T ) + t)2 υ4). With probability 1 exp( n/ υ2) 2exp( O (min{t n,t2})), we have that1 H(W ,T ) ζ(smin)o2 min κh+2(W ) υ3 . Proof sketch. Given a data point x Rp, we define d(x) = o σ (W x) Rh where σ is the entrywise derivative and is entrywise product. Then, define ρ(x) = d(x) x Rhp where is the Kronecker product. At ground truth, we have HW = n 1 n i=1 ρ(xi)ρ(xi)T . (3) After showing Σ = E[ρ(x)ρ(x)T ] is positive definite, we need to ensure that H(W ,T ) O (smin(Σ)), (4) with finite sample size n. This boils down to a highdimensional statistics problem. We first show that ρ(x) has subexponential tail for x N(0,Ip) i.e. for all unit vectors v, P( v T (ρ(x) E[ρ(x)]) t) 2exp( Cσ,o,W t). Next, we prove a novel restricted eigenvalue result for random matrices with subexponential rows as in (3). This is done by combining Mendelson s small-ball argument with tools from generic chaining (Mendelson, 2014; Talagrand, 2006). Careful treatment is necessary to address the facts that ρ(xi) is not zero-mean and its tail depends on σ,o,W . Our final result ensures (4) with n O (cover(T )) samples where O () has the dependencies on the aforementioned variables. 1If W has orthogonal rows, κh+2(W ) term can be removed from Θ by utilizing a more involved analysis that uses an alternative definition of ζ. The reader is referred to the supplementary material. Learning Compact Neural Networks with Regularization 3.3.2. LINEAR CONVERGENCE OF PGD Our next result utilizes Theorem 3.4 to characterize PGD around O (1/h) neighborhood of the ground truth. Theorem 3.5. Suppose C is a convex and closed set that includes W and let {xi}n i=1 be i.i.d. N(0,Ip) data points. Set υ = CΘlog2(CΘ) and suppose cover(T ) + t)2 υ4), (5) Set q = max{1,8n 1plog p}. Define learning rate µ and rate of convergence ρ as µ = 1 6qo2max L2Ω, ρ = 1 1 12q υ4Ω (6) Given W (independent of data points), consider the PGD iteration ˆ W = PC(W µ L(W )) Suppose W satisfies W W F O ( W F q hΩlog p υ4 ). Then, ˆ W obeys ˆ W W 2 F ρ W W 2 F , with probability 1 P where P = exp( n/ υ2) + 2exp( O (min{t n,t2})) + 8(nexp( p/2) + np 10 + exp( qn/4p)). 3.3.3. CONVERGENCE TO THE GROUND TRUTH Theorem 3.5 shows the improvement of a single iteration. Unfortunately, it requires the existence of fresh data points at every iteration. Once the initialization radius becomes tighter (O (p 1/2h 1) rather than O (h 1)), we can show a uniform convergence result that allows W to depend on data points. Combining both, the following corollary shows that repeated applications of projected gradient converges in O (hlog ε 1) steps to ε neighborhood of W using O (cover(T )hlog2 p) samples. This is in contrast to related works (Zhong et al., 2017b;a) which always require fresh data points. Theorem 3.6. Consider the setup of Theorem 3.5. Let K = O (q υ4Ωlog p). Given n = Kn independent data points (where n obeys (5)), split dataset into K equal batches. Starting from a point W0 W F O ( W F q hΩlog p υ4 ), apply the PGD iterations Wi+1 = Wi µ Lmin{i,K}(Wi), where Li is the loss function associated with ith batch. With probability 1 KP, all Wi for i 1 obey Wi W 2 F ρi W0 W 2 F . 4. Application to Convolutional Neural Nets We now illustrate how CNNs can be treated under our framework. To describe shallow CNN, suppose we have k kernels {ki}k i=1 each with width b. Set K = [k1 ... kk]T Rk b. Denote stride size by s and set r = p/s . To describe our argument, we introduce some notation specific to the convolutional model. Let vℓ Rb denote ith subvector of v Rp corresponding to entries from ℓs+1 to ℓs+b for 0 ℓ r 1. Also given b Rb, let v = mapℓ(b) Rp be the vector obtained by mapping b to the ith subvector i.e. vj = b if j = ℓand 0 otherwise. For each data point xj, we consider its r subvectors {xl j}r l=1 and filter each subvector with each of the kernels. Then, the input/output relation has the following form (assuming output layer weights oi,l) y CNN(K,xj) = k i=1 r l=1 oi,lσ(k T i xl j) Given labels {yi}n i=1, the gradient of ℓ2 2-loss with respect to ki and jth label, takes the form LCNN,j(ki) = r l=1 oi,l(y CNN(K,xj) yj)σ (k T i xj,l)xj,l We will show that a CNN can be transformed into a FNN combined with a subspace constraint. This will allow us to apply Theorem 3.5 to CNNs which will yield near optimal local convergence guarantees. We start by writing convolutional model as a fully-connected network. Definition 4.1 (Convolutional weight matrix structure). Set h = kr. Given kernels {ki}k i=1, we construct the fullyconnected weight matrix W = FC(K) Rh p as follows: Representing {1,...,h} as cartesian product of {1,...,k} and {1,...,r}, define the h = kr rows {wi,j}(k,r) (i,j)=(1,1) of the weight matrix W as wi,j = mapj(ki) for 1 i r and 1 l k. Finally let C be the space of all convolutional weight matrices defined as C = {FC(K) {ki}k i=1 Rb}. This model yields a matrix W that has double structure: Each row of W has at most b = p/r nonzero entries. For fixed i, the weight vectors {wi,l}r l=1 are just shifted copies of each other. This implies the total degrees of freedom is same as {ki}k i=1 and convolutional constraint C is a kb dimensional subspace. Next, given W = FC(K), observe the equality of the predictions i.e. y FC(W ,xj) = k i=1 r l=1 oi,lσ(w T i,lx) = y CNN(K,xj) Learning Compact Neural Networks with Regularization Similarly for W = FC(K), one can also show the equality of the CNN gradient and projected FNN gradient. Considering the following CNN and FNN gradient iterations r LCNN(K), ˆ W = PC(W µ LF C(W )), we have the equality FC( ˆK) = ˆ W . This relation yields the following corollary of Theorem 3.5. Corollary 4.2. Let {xi}n i=1 be i.i.d. N(0,Ip) data points. Given k kernels K = [k 1 ... k k]T and generate the labels yj = y CNN(K ,xj) Assume FC(K ) is full row-rank and let Θ,Ω, υ,q,µ,ρ,P be same as in Theorem 3.5 defined with respect to the matrix FC(K ). Suppose n O (( kb + t)2/ υ4) and consider the convolutional iteration Suppose the initial point K = [k1 ... kk]T satisfies K K F O ( K F q hΩlog p υ4 ). Then, with 1 P prob- ability, ˆK K 2 F ρ K K 2 F . This corollary can be combined with the results of (Zhong et al., 2017a) to obtain a globally convergent CNN learning algorithm using n O (poly(k,t,log p)) samples. In particular, for local convergence (Zhong et al., 2017a) needed n O (p) samples whereas we show that there is no dependence on the data length p. 5. Numerical Results To support our theoretical findings, we present numerical performance of sparsity and convolutional constraints for neural network training. We consider synthetic simulations where o is a vector of all ones and weight matrix W Rh p is sparse or corresponds to a CNN. 5.1. Sparsity Constraint We generate W matrices with exactly s nonzero entries at each row and nonzero pattern is distributed uniformly at random. Each entry of W is N(0, p hs) to ensure E[ W x 2 ℓ2] = x 2 ℓ2. We set the learning rate to µ = 5. We verified that smaller learning rate leads to similar results with slower convergence. We declare the estimate ˆ W to be the output of PGD algorithm after 2000 iterations. We consider two sets of simulations using Re LU activations. Good initialization: We set W0 = W + Z where Z has i.i.d. N(0, 1 h) entries. Note that noise Z satisfies E[ Z 2 F ] = E[ W 2 F ]. Figure 1: Experiments with good initialization W0 = W + Z. Figure 2: Experiments with random initialization W0 = Z. Random initialization: We set W0 = Z where Z has i.i.d. N(0, 1 h) entries. Each set of experiments consider three algorithms. Unconstrained: Only uses gradient descent. ℓ1-regularization: Projects W to ℓ1 ball scaled by the ℓ1 norm of W . ℓ0-regularization: Projects W to set of sh sparse matrices. For our experiments, we picked p = 80, h = 20 and s = p/10 = 8. For training, we use n data points which varies from 100 to 1000. Test error is obtained by averaging ntest = 1000 independent data points. For each point in the plots, we averaged the outcomes of 20 random trials. The total degrees of freedom is the number of nonzeros equal to sh = 160. Our theorems imply good estimation via O (shlog p/s) data points when initialization is sufficiently close. Figure 1 summarizes the outcome of the experiments with good initialization. Suppose y is the label and ˆy is Learning Compact Neural Networks with Regularization Figure 3: Experiments for convolutional constraint. the prediction. We define the (normalized) test and train losses as the ratio of empirical variances that approximates the population var[y ˆy] var[y] . Centering (i.e. variance) is used to eliminate the contribution of trivial but large E[y] term due to nonnegative Re LU outputs. First, we observe that ℓ1 is slightly better than ℓ0 constraint however both approach 0 test loss when n 600. Unregularized model has significant test error for all 100 n 1000 while perfectly overfitting training set for all n values. We also consider the recovery of ground truth W . Since there is permutation invariance (permuting rows of W doesn t change the prediction), we define the correlation between W and ˆ W as follows, corr(W , ˆ W ) = 1 h i=1 max 1 j h w i , ˆwj w i ℓ2 ˆwj ℓ2 where wi is the ith row of W . In words, each row of W is matched to the highest correlated row from ˆ W and correlations are averaged over h rows. Observe that, if ˆ W and W have matching permutations, corr(W , ˆ W ) = 1. We see that corr(W , ˆ W ) 1 once n 600 which is the moment test error hits 0. Figure 2 summarizes the outcome of the experiments with random initialization. In this case, we vary n from 200 to 2000 but the rest of the setup is the same. We observe that unlike good initialization, ℓ0 test error and 1 corr(W , ˆ W ) does not hit 0 and ℓ1 approaches 0 only at n = 2000. On the other hand, both metrics demonstrate the clear benefit of sparsity regularization. The performance gap between ℓ1 and ℓ0 is surprisingly high however it is consistent with Theorem 3.5 which only applies to convex regularizers. The performance difference between good and random initialization implies that initialization indeed plays a big role not only for finding the ground truth solution W but also for achieving good test errors. 5.2. Convolutional Constraint For the CNN experiment, we picked the following configuration. Problem parameters are input dimension p = 81, kernel width b = 15, stride s = 6, number of kernels k = 4 and learning rate µ = 1. We did not use zero-padding hence r = (p b)/s + 1 = 12. This implies kr = 48 hidden layers for fully-connected representation. The subspace dimension and degrees of freedom is kb = 60. We generate kernel entries with i.i.d. N(0, p hb) and the random matrix Z with i.i.d. N(0, p bk) entries. The noise variance is chosen higher to ensure E[ PC(Z) 2 F ] = E[ FC(K) 2 F ] i.e. Z projected onto convolutional space has the same variance as the kernel matrix. We compare three models. Unconstrained model with W0 = Z initialization: Uses only gradient descent. CNN subspace constraint with W0 = Z initialization: Weights are shared via CNN backpropagation. CNN subspace constraint with with W0 = W + Z initialization. Figures 1 illustrates the outcome of CNN experiments. Unconstrained model barely makes it into the test loss figure due to low signal-to-noise ratio. Focusing on CNN constraints, we observe that good initialization greatly helps and quickly achieves 0 test error. However random initialization has respectable test and correlation performance and gracefully improves as the data amount n increases. 6. Conclusions In this work, we studied neural network regularization in order to reduce the storage cost and to improve generalization properties. We introduced covering dimension to quantify the impact of regularization and the richness of the constraint set. We proposed projected gradient descent algorithms to efficiently learn compact neural networks and showed that, if initialized reasonably close, PGD linearly converges to the ground truth while requiring minimal amount of training data. The sample complexity of the algorithm is governed by the covering dimension. We also specialized our results to convolutional neural nets and demonstrated how CNNs can be efficiently learned within our framework. Numerical experiments support the substantial benefit of regularization over training fully-connected neural nets. Global convergence of the projected gradient descent appears to be a more challenging problem. In Section 5, we observed that gradient descent with random initialization can get stuck at local minima. For fully-connected networks, this is a well-known issue and the best known global convergence results are based on tensor initialization (Zhong et al., 2017b; Safran & Shamir, 2017; Janzamin et al., 2015). Interesting future directions include developing data-efficient initialization algorithms that can take advantage of the network priors (weight-sharing, sparsity, low-rank) and studying the properties of PGD from random initialization. Learning Compact Neural Networks with Regularization Aghasi, A., Abdi, A., Nguyen, N., and Romberg, J. Nettrim: Convex pruning of deep neural networks with performance guarantee. In Advances in Neural Information Processing Systems, pp. 3180 3189, 2017. Arora, S., Bhaskara, A., Ge, R., and Ma, T. Provable bounds for learning some deep representations. In International Conference on Machine Learning, pp. 584 592, 2014. Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6241 6250, 2017. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., et al. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316, 2016. Brutzkus, A. and Globerson, A. Globally optimal gradient descent for a convnet with gaussian inputs. ar Xiv preprint ar Xiv:1702.07966, 2017. Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805 849, 2012. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., and Bengio, Y. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. ar Xiv preprint ar Xiv:1602.02830, 2016. Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pp. 191 198. ACM, 2016. Cun, Y. L., Denker, J. S., and Solla, S. A. Optimal brain damage. In Touretzky, D. S. (ed.), Advances in Neural Information Processing Systems 2, pp. 598 605. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990. ISBN 1-55860-100-7. URL http://dl.acm. org/citation.cfm?id=109230.109298. Denton, E. L., Zaremba, W., Bruna, J., Le Cun, Y., and Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269 1277, 2014. Dong, X., Chen, S., and Pan, S. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4860 4874, 2017. Du, S. S., Lee, J. D., and Tian, Y. When is a convolutional filter easy to learn? ar Xiv preprint ar Xiv:1709.06129, 2017a. Du, S. S., Lee, J. D., Tian, Y., Poczos, B., and Singh, A. Gradient descent learns one-hidden-layer cnn: Don t be afraid of spurious local minima. ar Xiv preprint ar Xiv:1712.00779, 2017b. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011. Ge, R., Lee, J. D., and Ma, T. Learning one-hidden-layer neural networks with landscape design. ar Xiv preprint ar Xiv:1711.00501, 2017. Graves, A., Mohamed, A.-r., and Hinton, G. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645 6649. IEEE, 2013. Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ar Xiv preprint ar Xiv:1510.00149, 2015a. Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135 1143, 2015b. Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. ar Xiv preprint ar Xiv:1509.01240, 2015. Hassibi, B. and Stork, D. G. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pp. 164 171, 1993. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82 97, 2012. Janzamin, M., Sedghi, H., and Anandkumar, A. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. ar Xiv preprint ar Xiv:1506.08473, 2015. Jin, X., Yuan, X., Feng, J., and Yan, S. Training skinny deep neural networks with iterative hard thresholding methods. ar Xiv preprint ar Xiv:1607.05423, 2016. Learning Compact Neural Networks with Regularization Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315 323, 2013. Kawaguchi, K., Kaelbling, L. P., and Bengio, Y. Generalization in deep learning. ar Xiv preprint ar Xiv:1710.05468, 2017. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Konstantinos, P., Davies, M., and Vandergheynst, P. Pac-bayesian margin bounds for convolutional neural networks-technical report. ar Xiv preprint ar Xiv:1801.00171, 2017. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012. Mei, S., Bai, Y., and Montanari, A. The landscape of empirical risk for non-convex losses. ar Xiv preprint ar Xiv:1607.06534, 2016. Mendelson, S. Learning without concentration. ar Xiv preprint ar Xiv:1401.0304, 2014. Negahban, S. and Wainwright, M. J. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Journal of Machine Learning Research, 13 (May):1665 1697, 2012. Neyshabur, B., Bhojanapalli, S., Mc Allester, D., and Srebro, N. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. ar Xiv preprint ar Xiv:1707.09564, 2017. Oymak, S. Learning compact neural networks with regularization. ar Xiv preprint ar Xiv:1802.01223, 2018. Oymak, S. and Soltanolkotabi, M. End-to-end learning of a convolutional neural network via deep tensor decomposition. ar Xiv preprint ar Xiv:1805.06523, 2018. Oymak, S., Recht, B., and Soltanolkotabi, M. Sharp time data tradeoffs for linear inverse problems. IEEE Transactions on Information Theory, 2017. Panigrahy, R., Rahimi, A., Sachdeva, S., and Zhang, Q. Convergence results for neural networks via electrodynamics. In LIPIcs-Leibniz International Proceedings in Informatics, volume 94. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. Recht, B., Fazel, M., and Parrilo, P. A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471 501, 2010. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211 252, 2015. Safran, I. and Shamir, O. Spurious local minima are common in two-layer relu neural networks. ar Xiv preprint ar Xiv:1712.08968, 2017. Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. Soltanolkotabi, M. Learning relus via gradient descent. ar Xiv preprint ar Xiv:1705.04591, 2017. Soltanolkotabi, M., Javanmard, A., and Lee, J. D. Theoretical insights into the optimization landscape of overparameterized shallow neural networks. ar Xiv preprint ar Xiv:1707.04926, 2017. Talagrand, M. The generic chaining: upper and lower bounds of stochastic processes. Springer Science & Business Media, 2006. Tian, Y. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. ar Xiv preprint ar Xiv:1703.00560, 2017. Wang, H., Wang, N., and Yeung, D.-Y. Collaborative deep learning for recommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1235 1244. ACM, 2015. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016. Zhong, K., Song, Z., and Dhillon, I. S. Learning nonoverlapping convolutional neural networks with multiple kernels. ar Xiv preprint ar Xiv:1711.03440, 2017a. Zhong, K., Song, Z., Jain, P., Bartlett, P. L., and Dhillon, I. S. Recovery guarantees for one-hidden-layer neural networks. ar Xiv preprint ar Xiv:1706.03175, 2017b.