# learning_one_convolutional_layer_with_overlapping_patches__33d2ab92.pdf Learning One Convolutional Layer with Overlapping Patches Surbhi Goel 1 Adam Klivans 1 Raghu Meka 2 We give the first provably efficient algorithm for learning a one hidden layer convolutional network with respect to a general class of (potentially overlapping) patches under mild conditions on the underlying distribution. We prove that our framework captures commonly used schemes from computer vision, including one-dimensional and twodimensional patch and stride convolutions. Our algorithm Convotron is inspired by recent work applying isotonic regression to learning neural networks. Convotron uses a simple, iterative update rule that is stochastic in nature and tolerant to noise (requires only that the conditional mean function is a one layer convolutional network, as opposed to the realizable setting). In contrast to gradient descent, Convotron requires no special initialization or learning-rate tuning to converge to the global optimum. We also point out that learning one hidden convolutional layer with respect to a Gaussian distribution and just one disjoint patch P (the other patches may be arbitrary) is easy in the following sense: Convotron can efficiently recover the hidden weight vector by updating only in the direction of P. 1. Introduction Developing provably efficient algorithms for learning commonly used neural network architectures continues to be a core challenge in machine learning. The underlying difficulty arises from the highly non-convex nature of the optimization problems posed by neural networks. Obtaining provable guarantees for learning even very basic architectures remains open. In this paper we consider a simple convolutional neural network with a single filter and overlapping patches fol- *Equal contribution 1Department of Computer Science, University of Texas at Austin 2Department of Computer Science, UCLA. Correspondence to: Surbhi Goel . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). lowed by average pooling (Figure 1). More formally, for an input image x, we consider k patches of size r indicated by selection matrices P1, . . . , Pk {0, 1}r n where each matrix has exactly one 1 in each row and at most one 1 in each column. The neural network is computed as fw(x) = 1 k Pk i=1 σ(w T Pix) where σ is the activation function and w Rr is the weight vector corresponding to the convolution filter. We focus on Re LU and leaky Re LU activation functions. Figure 1. Architecture of convolutional network with one hidden layer and average pooling. Each purple rectangle corresponds to a patch. 1.1. Our Contributions The main contribution of this paper is a simple, stochastic update algorithm Convotron (Algorithm 1) for provably learning the above convolutional architecture. The algorithm has the following properties: Works for general classes of overlapping patches and requires mild distributional conditions. Proper recovery of the unknown weight vector. Stochastic in nature with a gradient-like update step. Requires no special/random initialization scheme or tuning of the learning rate. Tolerates noise and succeeds in the probabilistic concept model of learning. Logarithmic convergence in 1/ϵ, the error parameter, in the realizable setting. Learning One Convolutional Layer with Overlapping Patches This is the first efficient algorithm for learning general classes of overlapping patches (and the first algorithm for any class of patches that succeeds under mild distributional assumptions). Prior work has focused on analyzing SGD in the realizable/noiseless setting with the caveat of requiring either disjoint patches (Brutzkus & Globerson, 2017; Du et al., 2017b) with Gaussian inputs or technical conditions linking the underlying true parameters and the closeness of patches (Du et al., 2017a). In contrast, our conditions depend only on the patch structure itself and can be efficiently verified. Commonly used patch structures in computer vision applications such as 1D/2D grids satisfy our conditions. Additionally, we require only that the underlying distribution on samples is symmetric and induces a covariance matrix on the patches with polynomially bounded condition number1. All prior work handles only continuous distributions. Another major difference from prior work is that we give guarantees using purely empirical updates. That is, we do not require an assumption that we have access to exact quantities such as the population gradient of the loss function. We further show that in the commonly studied setting of Gaussian inputs and non-overlapping patches, updating with respect to a single non-overlapping patch is sufficient to guarantee convergence. This indicates that the Gaussian/nooverlap assumption is quite strong. 1.2. Our Approach Our approach is to exploit the monotonicity of the activation function instead of the strong convexity of the loss surface. We use ideas from isotonic regression and extend them in the context of convolutional networks. These ideas have been successful for learning generalized linear models (Kakade et al., 2011), improperly learning fully connected, depththree neural networks (Goel & Klivans, 2017b), and learning graphical models (Klivans & Meka, 2017). 1.3. Related Work It is known that in the worst case, learning even simple neural networks is computationally intractable. For example, in the non-realizable (agnostic) setting, it is known that learning a single Re LU (even for bounded distributions and unit norm hidden weight vectors) with respect to square-loss is as hard as learning sparse parity with noise (Goel et al., 2016), a notoriously difficult problem from computational learning theory. For learning one hidden layer convolutional networks, Brutzkus and Globerson (Brutzkus & Glober- 1Brutzkus and Globerson (Brutzkus & Globerson, 2017) proved that the problem, even with disjoint patches, is NP-hard in general, and so some distributional assumption is needed for efficient learning. son, 2017) proved that distribution-free recoverability of the unknown weight vector is NP-hard, even if we restrict to disjoint patch structures. As such, a major open question is to discover the mildest assumptions that lead to polynomial-time learnability for simple neural networks. In this paper, we consider the very popular class of convolutional neural networks (for a summary of other recent approaches for learning more general architectures see (Goel & Klivans, 2017a)). For convolutional networks, all prior research has focused on analyzing conditions under which (Stochastic) Gradient Descent converges to the hidden weight vector in polynomial-time. Along these lines, Brutzkus and Globerson (Brutzkus & Globerson, 2017) proved that with respect to the spherical Gaussian distribution and for disjoint (non-overlapping) patch structures, gradient descent recovers the weight vector in polynomial-time. Zhong et al. (Zhong et al., 2017a) showed that gradient descent combined with tensor methods can recover one hidden layer involving multiple weight vectors but still require a Gaussian distribution and nonoverlapping patches. Du et al. (Du et al., 2017b) proved that gradient descent recovers a hidden weight vector involved in a type of two-layer convolutional network under the assumption that the distribution is a spherical Gaussian, the patches are disjoint, and the learner has access to the true population gradient of the loss function. We specifically highlight the work of Du, Lee, and Tian (Du et al., 2017a), who proved that gradient descent recovers a hidden weight vector in a one-layer convolutional network under certain technical conditions that are more general than the Gaussian/no-overlap patch scenario. Their conditions involve a certain alignment of the unknown patch structure, the hidden weight vector, and the (continuous) marginal distribution. However, it is unclear which concrete patch-structure/distributional combinations their framework captures. We also note that all of the above results assume there is no noise; i.e.,, they work in the realizable setting. Other related works analyzing gradient descent with respect to the Gaussian distribution (but for non-convolutional networks) include (Soltanolkotabi, 2017; Ge et al., 2017; Zhong et al., 2017b; Tian, 2016; Li & Yuan, 2017; Zhang et al., 2017). In contrast, we consider an alternative to gradient descent, namely Convotron, that is based on isotonic regression. The exploration of alternative algorithms to gradient descent is a feature of our work, as it may lead to new algorithms for learning deeper networks. Learning One Convolutional Layer with Overlapping Patches 2. Preliminaries || || corresponds to the l2 -norm for vectors and the spectral norm for matrices. The identity matrix is denoted by I. We denote the input-label distribution by D over input drawn from X and label drawn from Y. The marginal distribution on the input is denoted by DX and the corresponding probability density function is denoted by PX . In this paper we consider a simple convolution neural network with one hidden layer and average pooling. Given input x Rn, the network computes k patches of size r where each patch s location is indicated by matrices P1, . . . , Pk {0, 1}r n. Each matrix Pi has exactly one 1 in each row and at most one 1 in every column. As before, the neural network is computed as follows: i=1 σ w T Pix where σ is the activation function and w Rr is the weight vector corresponding to the convolution filter. We study the problem of learning the teacher network with true weight w under the square loss from noisy labels, that is, we wish to find a w such that L(w) := Ex DX (fw(x) fw (x))2 ϵ. Assumptions 1. We make the following assumptions: (a) Learning Model: Probabilistic Concept Model (Kearns & Schapire, 1990), that is, for all (x, y) D, y = fw (x)+ξ, for some unknown w where ξ is noise with E[ξ|x] = 0 and E[ξ4|x] ρ for some ρ > 0. Note we do not require that the noise is independent of the instance.2 (b) Distribution: The marginal distribution on the input space DX is a symmetric distribution about the origin, that is, for all x, PX (x) = PX ( x). (c) Patch Structure: The minimum eigenvalue of PΣ := Pk i,j=1 PiΣP T j where Σ = Ex DX [xx T ] and the max- imum eigenvalue of P := Pk i,j=1 Pi P T j are polynomially bounded. (d) Activation Function: The activation function has the following form: ( x if x 0 αx otherwise for some constant α [0, 1]. 2In the realizable setting, as in previous works, it is assumed that ξ = 0. The distributional assumption includes common assumptions such as Gaussian inputs, but is far less restrictive. For example, we do not require the distribution to be continuous nor do we require it to have identity covariance. In Section 4, we show that commonly used patch schemes from computer vision satisfy our patch requirements. The assumption on activation functions is satisfied by popular activations such as Re LU (α = 0) and leaky Re LU (α > 0). 2.1. Some Useful Properties The activations we consider in this paper have the following useful property under the stated distributional assumption: Lemma 1. For all a, b R, Ex DX [σ(a T x)(b T x)] = 1 + α 2 Ex DX [(a T x)(b T x)]. The loss function can be upper bounded by the l2-norm distance of weight vectors using the following lemma. Lemma 2. For any w, we have 2 λmax(Σ)||w w||2. Lemma 3. For all w and x, (fw (x) fw(x))2 ||w w||2||x||2 The Gershgorin Circle Theorem, stated below, is useful for bounding the eigenvalues of matrices. Theorem 1 ((Weisstein, 2003)). For a n n matrix A, define Ri := Pn j=1,j =i |Ai,j|. Each eigenvalue of A must lie in at least one of the disks {z : |z Ai,i| Ri}. Note: The proofs of lemmas in this section have been deferred to the Supplemental section. 3. The Convotron Algorithm In this section we describe our main algorithm Convotron and give a proof of its correctness. Convotron is an iterative algorithm similar in flavor to SGD with a modified (aggressive) gradient update. Unlike SGD (Algorithm 3), Convotron comes with provable guarantees and also does not need a good initialization scheme for convergence. The following theorem describes the convergence rate of our algorithm: Theorem 2. If Assumptions 1 are satisfied then for η = Ω λmin(PΣ) kλmax(P ) min 1 Ex[||x||4], ϵδ||w ||2 ρEx[||x||4] T = O k ηλmin(PΣ) log 1 ϵδ , with probability 1 δ, the weight vector w computed by Convotron satisfies ||w w ||2 ϵ||w ||2. Learning One Convolutional Layer with Overlapping Patches Algorithm 1 Convotron Initialize w1 := 0 Rr. for t = 1 to T do Draw (xt, yt) D Let Gt = (yt fwt(xt)) Pk i=1 Pixt Set wt+1 = wt + ηGt end for Return w T +1 Proof. Define St = {(x1, y1), . . . , (xt, yt)} The dynamics of Convotron can be expressed as follows: Ext,yt[||wt w ||2 ||wt+1 w ||2|St 1] = 2ηExt,yt[(w wt)T Gt|St 1] η2Ext,yt[||Gt||2|St 1] We need to bound the RHS of the above equation. We have, Ext,yt[(w wt)T Gt|St 1] (w wt)T (yt fwt(xt)) = Ext,ξt (w wt)T (fw (xt) + ξt (w wt)T (fw (xt) fwt(xt)) 1 i,j k Ext[(σ(w T Pixt) σ(w T t Pixt)) (2) (w T w T t )Pjxt|St 1] 1 i,j k Ext[((w T w T t )Pixt) ((w T w T t )Pjxt)|St 1] (3) 2k (w T w T t ) Ext[xtx T t ] 1 j k P T j 2k (w T w T t ) 1 i,j k PiΣP T j 2k (w T w T t )PΣ(w wt) 2k λmin(PΣ)||w wt||2. (4) (1) follows using linearity of expectation and the fact that that E[ξt|xt] = 0 and (3) follows from using Lemma 1. (4) follows from observing that PΣ is symmetric, thus x, x T PΣx λmin(PΣ)||x||2. Now we bound the variance of Gt. Note that E[Gt] = 0. Further, Ext,yt[||Gt||2|St 1] (yt fwt(xt))2 λmax(P)Ext,yt (yt fwt(xt))2||xt||2 St 1 (5) = λmax(P)Ext,ξt (fw (xt) + ξt fwt(xt))2||xt||2 St 1 = λmax(P)Ext,ξt ((fw (xt) fwt(xt))2 + ξ2 t + 2(fw (xt) fwt(xt))ξt||xt||2 St 1 = λmax(P) Ext (fw (xt) fwt(xt))2||xt||2 St 1 + Ext,ξt[ξ2 t ||xt||2] (6) λmax(P) Ext[||xt||4]||w wt||2 + p ρExt[||xt||4] (5) follows from observing Pk i=1 Pix 2 λmax(P)||x||2 for all x, (6) follows from observing Eξ[ξ|x] = 0 and (7) follows from applying Lemma 3 and bounding Ext,ξt[ξ2 t ||xt||2] using Cauchy-Schwartz inequality. Combining the above equations and taking expectation over St 1, we get ESt[||wt+1 w ||2] (1 3ηβ + η2γ)ESt 1[||wt w ||2] + η2B for β = 1+α 3k λmin(PΣ), γ = λmax(P)Ex[||x||4] and B = λmax(P) p ρEx[||x||4]. We set η = β min 1 γ , ϵδ||w ||2 B and break the analysis to two cases: Case 1: ESt 1[||wt w ||2] > ηB β . This implies that ESt[||wt+1 w ||2] (1 ηβ)ESt 1[||wt w ||2]. Case 2: ESt 1[||wt w ||2] ηB β ϵ||w ||2. Observe that once Case 2 is satisfied, we have ESt[||wt+1 w ||2] (1 2ηβ) ηB β . Hence, for any iteration > t, Case 2 will continue to hold true. This implies that either at each iteration ESt 1[||wt w ||2] decreases by a factor (1 ηβ) or it is less than ϵδ||w ||2. Thus if Case 1 is not satisfied for any iteration up to T, then we have, EST [||w T +1 w||2] (1 ηβ)T ||w ||2 e ηβT ||w ||2 Learning One Convolutional Layer with Overlapping Patches since at initialization ||w1 w || = ||w ||. Setting T = O 1 ηβ log 1 ϵδ and using Markov s inequality, with probability 1 δ, over the choice of ST , ||w T +1 w || ϵ||w ||2. By using Lemma 2, we can get a bound on L(w T ) ϵ||w ||2 by appropriately scaling ϵ. 3.1. Convotron in the Realizable Case For the realizable (no noise) setting, that is, for all (x, y) D, y = fw (x), for some unknown w , Convotron achieves faster convergence rates. Corollary 1. If Assumptions 1 are satisfied with the learning model restricted to the realizable case, then for suitably choosen η, after T = O k2λmax(P )Ex[||x||4] λmin(PΣ)2 log 1 ϵδ iterations, with probability 1 δ, the weight vector w computed by Convotron satisfies ||w w ||2 ϵ||w ||2. Proof. Since the setting has no noise, ρ = 0. Setting that parameter in Theorem 2 gives us η = Ω λmin(PΣ) kλmax(P )Ex[||x||4] as ϵδ||w ||2 ρEx[||x||4] tends to infinity as ρ tends to 0 and taking the minimum removes this dependence from η. Substituting this η gives us the required result. Observe that the dependence of ϵ in the convergence rate is log(1/ϵ) for the realizable setting, compared to the 1/ϵ dependence in the noisy setting. 4. Which Patch Structures are Easy to Learn? In this section, we will show that the commonly used convolutional filters in practice ( patch and stride ) have good eigenvalues giving us fast convergence by Theorem 2. We will start with the 1D case and then subsequently extend the result for the 2D case. 4.1. 1D Convolution Here we formally describe a patch and stride convolution in the one-dimensional setting. Consider a 1D image of dimension n. Let the patch size be r and stride be d. Let the patches be indexed from 1 and let patch i start at position (i 1)d+1 and be contiguous through position (i 1)d+r. The matrix Pi of dimension r n corresponding to patch i looks as follows, Pi = 0r ((i 1)d+1)Ir0r (n r (i 1)d) where 0a b indicates a matrix of dimension a b with all zeros and Ia indicates the identity matrix of size a. Thus, the total number of patches is k = n r d + 1. We will assume that n 2r 1 and r d. The latter condition is to ensure there is some overlap, non-overlapping case, which is easier, is handled in the next section. We will bound the extremal eigenvalues of P = Pk i,j=1 Pi P T j . Simple algebra gives us the following structure for P, ( k a if |i j| = ad 0 otherwise For understanding, we show the matrix structure for d = 1 and n 2r. k k 1 . . . k r + 1 k 1 k . . . k r + 2 ... ... ... ... k r + 1 k r + 2 . . . k 4.1.1. BOUNDING EXTREMAL EIGENVALUES OF P The following lemmas bound the extremal eigenvalues of P. Lemma 4. Maximum eigenvalue of P satisfies λmax(P) k(p + 1) (p p2)(p2 + 1) = O(kp) where p = r 1 d and p2 = p Proof. Using Theorem 1, we have λmax(P) maxi Pi,i + P j =i |Pi,j| = maxi Pk j=1 Pi,j. Observe that P is bisymmetric thus Pk j=1 Pi,j = Pk j=1 Pr i+1,j and we can restrict to the top half of the matrix. The structure of P indicates that in a fixed row, the diagonal entry is maximum and the non-zero entries decrease monotonically by 1 as we move away from the diagonal. Also, there can be at most p + 1 non-zero entries in any row. Thus the sum is maximized when there are p + 1 non-zero entries and the diagonal entry is the middle entry, that is at position p2d+1. By simple algebra, j=1 Pp2d+1,j j=1 (k j) + (p 2p2)(k p2 1) = k(p + 1) (p p2)(p2 + 1). Lemma 5. Minimum eigenvalue of P satisfies λmin(P) 0.5. Proof. We break the analysis into following two cases: Learning One Convolutional Layer with Overlapping Patches d < r/2: We can show that λmax(P 1) 2 using the structure of P (see Lemma A and B in Supplemental). Since λmin(P) = 1/λmax(P 1), we have λmin(P) 0.5. d r/2: In this case we directly bound the minimum eigenvalue of P. Using Theorem 1, we know that λmin(P) mini Pi,i P j =i |Pi,j| . For Pi,j = 0, |i j| = ad for some a. The maximum value that |i j| can take is r 1 and since d r/2, a must be either 0 or 1. Also, for any i, there exists a unique j such that |i j| = d since r/2 d < r, thus there are exactly 2 non-zero entries in each row of P, Pi,i. This gives us, for each i, P j =i Pi,j = k 1. Thus, we get that λmin(P) mini Pi,i P j =i Pi,j = k (k 1) = 1. Combining both, we get the required result. 4.1.2. LEARNING RESULT FOR 1D Augmenting the above analysis with Theorem 2 gives us learnability of 1D convolution filters. Corollary 2. If Assumptions 1(a),(b), and (d) are satisfied and the patches have a patch and stride structure with parameters n, r, d, then for suitably chosen η and T = O n3r d4λmin(Σ)2 max Ex[||x||4], ρEx[||x||4] ϵ||w ||2 with probability 1 δ, the weight vector w output by Convotron satisfies ||w w ||2 ϵ||w ||2. Proof. Combining the above Lemmas gives us that λmax(P) = O(pk) = O(nr/d2) and λmin(P) = Ω(1). Observe that λmin(PΣ) λmin(P)λmin(Σ). Substituting these values in Theorem 2 gives us the desired result. Comparing with SGD, (Brutzkus & Globerson, 2017) showed that even for r = 2 and d = 1, Gradient descent can get stuck in a local minima with probability 1/4. 4.2. 2D Convolution Here we formally define stride and patch convolutions in two dimensions. Consider a 2D image of dimension n1 n2. Let the patch size be r1 r2 and stride in both directions be d1, d2 respectively. Enumerate patches such that patch (i, j) starts at position ((i 1)d1 + 1, (j 1)d2 + 1) and is a rectangle with diagonally opposite point ((i 1)d2+r1, (j 1)d2 + r2). Let k1 = n1 r1 d1 + 1 and k2 = n2 r2 d2 + 1. Let us vectorize the image row-wise into a n1n2 dimension vector and enumerate each patch row-wise to get a r1r2 dimensional vector. Let Q(i,j) be the indicator matrix of dimension r1r2 n1n2 with 1 at (a, b) if the ath location of patch (i, j) is b. More formally, (Q(i,j))a,b = 1 for all a = pr2 + q + 1 for 0 p < r1, 0 q < r2, and Figure 2. 2D convolution patches for image size n1 = n2 = 7, patch size r1 = r2 = 3, and stride d1 = 2, d2 = 1. Blue box corresponds to patch (1, 1), red to patch (2, 1) green to patch (1, 2) and orange to patch (3, 4). b = ((i 1)d1 + p)n2 + jd2 + q + 1 else 0. Note that there are k1 k2 patches in total with the corresponding patch matrices being Q(i,j) for 1 i k1, 1 j k2. 4.2.1. BOUNDING EXTREMAL EIGENVALUES OF Q We will bound the extremal eigenvalues of Q = Pk1 i,p=1 Pk2 j,q=1 Q(i,j)QT (p,q). Let P (1) i s be the patch matrices corresponding to the 1D convolution for parameters n1, r1, d1 defined as in the previous section and let P (1) = Pk1 i,j=1 P (1) i (P (1) j )T . Define P (2) i s for 1 i k2 and P (2) similarly with parameters n2, r2, d2 instead of n1, r1, d1. Lemma 6. Q(i,j) = P (1) i P (2) j . Proof. Intuitively P (1) i and P (2) j give the indices corresponding to the row and column of the 2D patch and the Kronecker product vectorizes it to give us the (i, j)th patch. More formally, we will show that (Q(i,j))a,b = 1 iff (P (1) i P (2) j )a,b = 1. Let a = pr2 + q + 1 with 0 p < r1, 0 q < r2 and b = rn2 + s + 1 with 0 r < n1, 0 s < n2. Then, (P (1) i P (2) j )a,b = 1 iff (P (1) i )p,r = 1 and (P (2) j )q,s = 1. We know that (P (1) i )p,r = 1 iff r = (i 1)d1 + p + 1 and (P (2) j )q,s = 1 iff s = (j 1)d2 + q + 1. This gives us that b = ((i 1)d1 + p)n1 + (j 1)d2 + q + 1, which is the same condition for (Q(i,j))a,b = 1. Thus Q(i,j) = P (1) i P (2) j . Lemma 7. Q = P (1) P (2). Proof. We have, j,q=1 Q(i,j)QT (p,q) Learning One Convolutional Layer with Overlapping Patches j,q=1 (P (1) i P (2) j )(P (1) p P (2) q )T j,q=1 (P (1) i P (2) j )((P (1) p )T (P (2) q )T ) j,q=1 (P (1) i (P (1) p )T ) (P (2) j (P (2) q )T ) i,p=1 P (1) i (P (1) p )T j,q=1 P (2) j (P (2) q )T = P (1) P (2). Lemma 8. We have λmin(Q) 0.25 and λmax(Q) = O(k1p1k2p2) where p1 = r1 1 d1 and p2 = r2 1 Proof. Since Q = P (1) P (2) and Q, P (1), P (2) are positive semi-definite, λmin(Q) = λmin(P)λmin(P (2)) and λmax(Q) = λmax(P (1))λmax(P (2)). Using the lemmas from the previous section gives us the required result. Note that this technique can be extended to higher dimensional patch structures as well. 4.2.2. LEARNING RESULT FOR 2D Similar to the 1D case, combining the above analysis with Theorem 2 gives us learnability of 2D convolution filters. Corollary 3. If Assumptions 1(a),(b), and (d) are satisfied and the patches have a 2D patch and stride structure with parameters n1, n2, r1, r2, d1, d2, then for suitably chosen η and T = O n3 1n3 2r1r2 d3 1d3 2λmin(Σ)2 max Ex[||x||4], ρEx[||x||4] ϵ||w ||2 with probability 1 δ, the weight vector w output by Convotron satisfies ||w w ||2 ϵ||w ||2. Proof. Lemma 8 gives us that λmax(Q) = O(n1n2r1r2/(d1d2)2) and λmin(P) = Ω(1). Observe that λmin(PΣ) λmin(P)λmin(Σ). Substituting these values in Theorem 2 gives us the desired result. 5. Non-overlapping Patches are Easy In this section, we will show that if there is one patch that does not overlap with any patch and the covariance matrix is identity then we can easily learn the filter even if the other patches have arbitrary overlaps. This includes the commonly used Gaussian assumption. WLOG we assume Algorithm 2 Convotron-No-Overlap Initialize w1 := 0 Rr. for t = 1 to T do Draw (xt, yt) D Let Gt = (yt fwt(xt))P1xt Set wt+1 = wt + ηGt end for Return w T +1 that P1 is the patch that does not overlap with any other patch implying P1P T j = P T j P1 = 0 for all j = 1. Observe that the algorithm ignores the directions of all other patches and yet succeeds. This indicates that with respect to a Gaussian distribution, in order to have an interesting patch structure (for one layer networks), it is necessary to avoid having even a single disjoint patch. The following theorem shows the convergence of Convotron-No-Overlap. Theorem 3. If Assumptions 1 are satisfied with Σ = I, then for η = (1+α) 3k min 1 Ex[||x||4], ϵδ||w ||2 ρEx[||x||4] ϵδ , with probability 1 δ, the weight vector w outputted by Convotron-No-Overlap satisfies ||w w ||2 ϵ||w ||2. Proof. The proof follows the outline of the Convotron proof very closely. We use the same definitions as in the previous proof. We have, Ext,yt[(w wt)T Gt|St 1] 1 i k Ext[(σ(w T Pixt) σ(w T t Pixt))(w T w T t )P1xt|St 1] 1 i k Ext[((w T w T t )Pixt)((w T w T t )P1xt)|St 1] 2k (w T w T t ) Ext[xtx T t ]P1(w wt) 2k ||w T w T t ||2 The last equality follows since P T i P1 = 0 for all i = 1 and P T 1 P1 is a permutation of identity. Ext,yt[||Gt||2|St 1] = Ext,yt h (yt fwt(xt))2 ||Pixt||2 St 1 i Ext,yt (yt fwt(xt))2||xt||2 St 1 Learning One Convolutional Layer with Overlapping Patches Algorithm 3 SGD Randomly initialize w1 Rr. for t = 1 to T do Draw (xt, yt) D Let Gt = (yt fwt(xt)) Pk i=1 σ (w T t Pixt)Pixt Set wt+1 = wt + ηGt end for Return w T +1 Ext[||xt||4]||w wt||2 + p ρExt[||xt||4] Following the rest of the analysis for η and T as in the theorem statement gives us the required result. 6. Experiments: SGD vs Convotron To further support our theoretical findings, we empirically compare the performance of SGD (Algorithm 3) with our algorithm Convotron. We measure performance based on the failure probability, that is, the fraction of runs the algorithm fails to converge on randomly initialized runs (the randomness is over both the choice of initialization for SGD and the draws from the distribution). More formally, we say that the algorithm fails if the closeness in l2-norm of the difference of the final weight vector obtained (w T ) and the true weight parameter (w ), that is, ||w T w || is greater than a threshold θ. We choose this measure because in practice, due to the high computation time of training neural networks, random restarts are expensive. In the experiments, given a fixed true weight vector, for varying learning rates (increments of 0.01), we choose 50 random initializations and run the two algorithms with them as starting points. We plot the failure probability (θ = 0.1) with varying learning rate. Note that the lowest learning rate we use is 0.01 as making the learning rate too small requires high number of iterations for convergence for both algorithms. We first test the performance on a simple 1D convolution case with (n, k, d, T) = (8, 4, 1, 6000) and 2D case with (n1, n2, k1, k2, d1, d2, T) = (5, 5, 3, 3, 1, 1, 15000) on inputs drawn from a normalized (l2 norm 1) Gaussian distribution with identity covariance matrix. We adversarially choose a fixed weight vector (l2-norm 1). For example, we take the vector to be [1, 1, 1, 1] in the 1D case and normalize. This weight vector can be viewed as an edge detection filter, that is, counting the number of times image goes from black (negative) to white (positive). Figure 3 (Top) shows that SGD has a small data dependent range where it succeeds but may fail with almost 0.5 probability outside this region whereas Convotron always returns a good solution for small enough η chosen according to Figure 3. Failure probability of SGD (green) vs Convotron (blue) with varying learning rate η. Experiment 1: Patch and stride 1D (Top-left) and 2D (Top-right). Experiment 2: Input distribution has mean 0 and covariance matrix identity (Bottom-left) and nonidentity covariance matrix (Bottom-right). The curves are shifted due to scaling difference of updates. Theorem 2. The failure points observed for SGD show the prevalence of bad local minima where SGD gets stuck. For the second experiment, we choose a fixed weight vector for which SGD performs well with very high probability on a normalized Gaussian input distribution with identity covariance matrix (see Figure 3 (Bottom-left)). However, on choosing a different covariance matrix with higher condition number 60, the performance of SGD worsens whereas Convotron always succeeds (see Figure 3 (Bottom-Right)). The covariance matrix is generated by choosing random matrices followed by symmetrizing them and adding c I for c > 0 to make the eigenvalues positive. These experiments demonstrate that techniques for finetuning SGD s learning rate are necessary, even for very simple architectures. In contrast, no fine-tuning is necessary for Convotron: the correct learning rate can be easily computed given the learner s desired patch structure and estimate of the covariance martix. 7. Conclusions and Future Work We have given the first efficient algorithm with provable guarantees for learning general one layer convolutional networks under symmetric, well-conditioned distributions. The obvious open question is to extend our algorithm to higher depth networks and weaken the distributional assumptions. Acknowledgments We thank Jessica Hoffmann and Philipp Kr ahenb uhl for useful discussions. Learning One Convolutional Layer with Overlapping Patches Brutzkus, A. and Globerson, A. Globally optimal gradient descent for a convnet with gaussian inputs. ar Xiv preprint ar Xiv:1702.07966, 2017. Du, S. S., Lee, J. D., and Tian, Y. When is a convolutional filter easy to learn? ar Xiv preprint ar Xiv:1709.06129, 2017a. Du, S. S., Lee, J. D., Tian, Y., Poczos, B., and Singh, A. Gradient descent learns one-hidden-layer cnn: Don t be afraid of spurious local minima. ar Xiv preprint ar Xiv:1712.00779, 2017b. Ge, R., Lee, J., and Ma, T. Learning one-hidden-layer neural networks with landscape design. ar Xiv preprint ar Xiv:1711.00501, 2017. Goel, S. and Klivans, A. Eigenvalue decay implies polynomial-time learnability for neural networks. In Advances in Neural Information Processing Systems, pp. 2189 2199, 2017a. Goel, S. and Klivans, A. Learning depth-three neural networks in polynomial time. ar Xiv preprint ar Xiv:1709.06010, 2017b. Goel, S., Kanade, V., Klivans, A., and Thaler, J. Reliably learning the relu in polynomial time. ar Xiv preprint ar Xiv:1611.10258, 2016. Kakade, S. M., Kanade, V., Shamir, O., and Kalai, A. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems, pp. 927 935, 2011. Kearns, M. J. and Schapire, R. E. Efficient distributionfree learning of probabilistic concepts. In Foundations of Computer Science, 1990. Proceedings., 31st Annual Symposium on, pp. 382 391. IEEE, 1990. Klivans, A. and Meka, R. Learning graphical models using multiplicative weights. ar Xiv preprint ar Xiv:1706.06274, 2017. Li, Y. and Yuan, Y. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pp. 597 607, 2017. Soltanolkotabi, M. Learning relus via gradient descent. In Advances in Neural Information Processing Systems, pp. 2004 2014, 2017. Tian, Y. Symmetry-breaking convergence analysis of certain two-layered neural networks with relu nonlinearity. 2016. Weisstein, E. W. Gershgorin circle theorem. 2003. Zhang, Q., Panigrahy, R., Sachdeva, S., and Rahimi, A. Electron-proton dynamics in deep learning. ar Xiv preprint ar Xiv:1702.00458, 2017. Zhong, K., Song, Z., and Dhillon, I. S. Learning nonoverlapping convolutional neural networks with multiple kernels. ar Xiv preprint ar Xiv:1711.03440, 2017a. Zhong, K., Song, Z., Jain, P., Bartlett, P. L., and Dhillon, I. S. Recovery guarantees for one-hidden-layer neural networks. ar Xiv preprint ar Xiv:1706.03175, 2017b.