# harmonic_decompositions_of_convolutional_networks__37783fad.pdf Harmonic Decompositions of Convolutional Networks Meyer Scetbon 1 2 Zaid Harchaoui 2 We present a description of the function space and the smoothness class associated with a convolutional network using the machinery of reproducing kernel Hilbert spaces. We show that the mapping associated with a convolutional network expands into a sum involving elementary functions akin to spherical harmonics. This functional decomposition can be related to the functional ANOVA decomposition in nonparametric statistics. Building off our functional characterization of convolutional networks, we obtain statistical bounds highlighting an interesting trade-off between the approximation error and the estimation error. 1. Introduction The renewed interest in convolutional neural networks (Fukushima, 1980; Le Cun et al., 1995) in computer vision and signal processing has led to a major leap in generalization performance on common task benchmarks, supported by the recent advances in graphical processing hardware and the collection of huge labelled datasets for training and evaluation. Convolutional neural networks pose major challenges to statistical learning theory. First and foremost, a convolutional network learns from data, jointly, both a feature representation through its hidden layers and a prediction function through its ultimate layer. A convolutional neural network implements a function unfolding as a composition of basic functions (respectively nonlinearity, convolution, and pooling), which appears to model well visual information in images. Yet the relevant function spaces to analyze their statistical performance remain unclear. The analysis of convolutional neural networks (CNNs) has been an active research topic. Different viewpoints have been developed. A straightforward viewpoint is to dismiss 1CREST, ENSAE 2Department of Statistics, University of Washington. Correspondence to: Meyer Scetbon , Zaid Harchaoui . Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). completely the gridor lattice-structure of images and analyze a multi-layer perceptron (MLP) instead acting on vectorized images, which has the downside to set aside the most interesting property of CNNs which is to model well images that is data with a 2D lattice structure. The scattering transform viewpoint and the i-theory viewpoint (Mallat, 2012; Bruna & Mallat, 2013; Mallat, 2016; Poggio & Anselmi, 2016; Oyallon et al., 2018) keeps the triad of components nonlinearity-convolution-pooling and their combination in a deep architecture and characterize the group-invariance properties and compression properties of convolutional neural networks. Recent work (Bietti & Mairal, 2017) considers risk bounds involving appropriately defined spectral norms for convolutional kernel networks acting on continuous-domain images. We present in this paper the construction of a function space including the mapping associated with a convolutional network acting on discrete-domain images. Doing so, we characterize the sequence of eigenvalues and eigenfunctions of the related integral operator, hence shedding light on the harmonic structure of the function space of a convolutional neural network. Indeed the eigenvalue decay controls the statistical convergence rate. Thanks to this spectral characterization, we establish high-probability statistical bounds, relating the eigenvalue decay and the convergence rate. We show that a convolutional network function admits a decomposition whose structure is related to a functional tensor-product space ANOVA model decomposition (Lin, 2000). Such models extend the popular additive models in order to capture interactions of any order between covariates. Indeed a tensor-product space ANOVA model decomposes a high-dimensional multivariate function as a sum of one-dimensional functions (main effects), two-dimensional functions (two-way interactions), and so on. A remarkable property of such models is their statistical convergence rate, which is within a log factor of the rate in one dimension, under appropriate assumptions. We bring to light a similar structure in the decomposition of mapping associated with a convolutional network. This structure plays an essential role in the convergence rates we present. This suggests that an important component of the modeling power of a convolutional network is to capture spatial interactions between sub-images or patches. Harmonic Decompositions of Convolutional Networks This work makes the following contributions. We construct a kernel and a corresponding reproducing kernel Hilbert space (RKHS) to describe a convolutional network (CNN). The construction encompasses networks with any number of filters per layer. Moreover, we provide a sufficient condition for the kernel to be universal. Then, we establish an explicit, analytical, Mercer decomposition of the multilayer kernel associated to this RKHS. We uncover a relationship to a functional ANOVA model, by highlighting a sum-product structure involving interactions between subimages or patches. We obtain a tight control of the eigenvalue decay of the integral operator associated under general conditions on the activation functions. Finally, we establish convergence rates to the Bayes classifier for the regularized least-squares estimator in this RKHS. From a nonparametric learning viewpoint, these rates are optimal in a minimax sense. All the proofs can be found in the longer version of the paper (Scetbon & Harchaoui, 2020). 2. Basic Notions Image Space. We first describe the mathematical framework to describe image data. An image is viewed here as a collection of normalized sub-images or patches. The sub-image or patch representation is standard in image processing and computer vision, and encompasses the pixel representation as a special case (Mairal et al., 2014a). Note that the framework presented here readily applies to signals and any grid or lattice-structured data with obvious changes in indexing structures. We focus on the case of images as it is currently a popular application of convolutional networks (Goodfellow et al., 2016). Denote X the space of images. Let h, w 1 respectively the height and width of the images and min(h2, w2) d 2 the size of each patch. We consider square patches for simplicity. Denoting r 1 the height of a patch, we have that r2 = d. We define for each (i, j) {1, ..., h r+1} {1, ..., w r + 1} the patch extraction operator at location (i, j) as Pi,j(X) := (Xi+ℓ,j+k)ℓ,k {1,...,r} Rd (1) where X Rh w. Moreover let 1 n (h r + 1)(w r + 1) and let A {1, ..., h r + 1} {1, ..., w r + 1} such that |A| = n. Define now the initial space of images as EA := {X Rh w: Pz(X) 2 = 1 for z A} where each patch considered has been normalized. Since {(i + ℓ, j + k): (i, j) A and ℓ, k {1, ..., r}} = {1, ..., h r + 1} {1, ..., w r + 1}, the mapping φ : Rh w Rd ... Rd X (Pz(X))z A is injective. The mappings I := φ(EA) and EA are then Pixel (𝑖, 𝑗) 𝐴 Patch Extraction Patch Normalization Figure 1. For simplicity, we consider a single-channel image in this illustration. Normalized patches are extracted from the image. isomorphic. Hence we shall work from now on with I as the image space. We have by construction that I Qn i=1 Sd 1 the n-th Cartesian power of Sd 1, where Sd 1 is the unit sphere of Rd. Moreover, as soon as the patches considered are disjoint, we have that I = Qn i=1 Sd 1. In order to simplify the notation, we shall always consider the case where I = Qn i=1 Sd 1 where d is the dimension of the square patches and n is the number of patches considered. In the following, we shall denote for any q 1 and set X, the q-ary Cartesian power Qq i=1 X := (X)q. Moreover if X (X)q, we shall denote X := (Xi)q i=1 where each Xi X. Let Pm(d) be the space of homogeneous polynomials of degree m in d variables with real coefficients and Hm(d) be the space of harmonics polynomials defined by Hm(d) := {P Pm(d)| P = 0} (2) where = d P 2 x2 i is the Laplace operator on Rd (Folland, Moreover, define Hm(Sd 1) the space of real spherical harmonics of degree m defined as the set of restrictions of harmonic polynomials in Hm(d) to Sd 1. Let also Ldσd 1 2 (Sd 1) be the space of (real) square-integrable functions on the sphere Sd 1 endowed with its induced Lebesgue measure dσd 1 and |Sd 1| the surface area of Sd 1. Ldσd 1 2 (Sd 1) endowed with its natural inner product is a separable Hilbert space and the family of spaces (Hm(Sd 1))m 0, yields a direct sum decomposition (Efthimiou & Frye, 2014) Ldσd 1 2 (Sd 1) = M m 0 Hm(Sd 1) (3) which means that the summands are closed and pairwise orthogonal. Moreover, each Hm(Sd 1) has a finite dimension αm,d with α0,d = 1, α1,d = d and for m 2 αm,d = d 1 + m m d 1 + m 2 m 2 Harmonic Decompositions of Convolutional Networks Therefore for all m 0, given any orthonormal basis of Hm(Sd 1), (Y 1 m, ..., Y αm,d m ), we can build an Hilbertian basis of Ldσd 1 2 (Sd 1) by concatenating these orthonormal basis. Let L2(I) := L n i=1dσd 1 2 (I) be the space of (real) square-integrable functions on I endowed with the n-tensor product measure n i=1dσd 1 := dσd 1 ... dσd 1 and let us define the integral operator on L2(I) associated with a positive semi-definite kernel K on I TK : L2(I) L2(I) f R I K(x, )f(x) n i=1 dσd 1(x). As soon as R I K(x, x)dσd 1 ... dσd 1(x) is finite, which is clearly satisfied when K is continuous, TK is well defined, self-adjoint, positive semi-definite and trace-class (Simon, 2010; Smale & Zhou, 2007). We approach here the modeling of interactions of patches or sub-images with functional ANOVA modelling in mind. Let us first recall the basic notions to define a tensor product of functional Hilbert spaces (Lin, 2000; Steinwart & Christmann, 2008). Consider a Hilbert space E1 of functions of X1 and a Hilbert space E2 of functions of X2. The tensor product space E1 E2 is defined as the completion of the class of functions of the form i=1 fi(X1)gi(X2) where fi E1, gi E2 and k is any positive integer, under the norm induced by the norms in E1 and E2. The inner product in E1 E2 satisfies f1(X1)g1(X2), f2(X1)g2(X2) E1 E2 = f1(X1), f2(X1) E1 g1(X2), g2(X2) E2 where for i = 1, 2, , Ei denote the inner product in Ei. Note that when E1 and E2 are RKHS-s with associated kernels k1 and k2, one has an explicit formulation of the kernel associated to the RKHS E1 E2 (Carmeli et al., 2010). A tensor product space ANOVA model captures interactions between covariates as follows. Let D be the highest order of interaction in the model. Such model assumes that the high-dimensional function to be estimated is a sum of one-dimensional functions (main effects), two-dimensional functions (two-way interactions), and so on. That is, the n-dimensional function f decomposes as f(x1, ..., xn) = C + A {1,...,n} |A|=k where C is a constant and the components satisfy conditions that guarantee the uniqueness (Scetbon & Harchaoui, 2020). More precisely, after assigning a function space for each main effect, this strategy models an interaction as living in the tensor product space of the function spaces of the interacting main effects. In other words, if we assume f1(X1) to be in a Hilbert space E1 of functions of X1 and f2(X2) be in a Hilbert space E2 of functions of X2, then we can model f12 as in E1 E2, the tensor product space of E1 and E2. Higher order interactions are modeled similarly. In (Lin, 2000), the author considers the case where the main effects are univariate functions living in a Sobolev Hilbert space with order m 1 and domain [0, 1], denoted Hm([0, 1]), defined as n f: f (ν) abs. cont., ν = 0, ..., m 1; f (m) L2o More generally, functional ANOVA models assume that the main effects are univariate functions living in a RKHS (Lin, 2000). 3. Convolutional Networks and Multi-Layer Kernels We proceed with the mathematical description of a convolutional network. The description follows previous works (Bruna & Mallat, 2013; Mairal et al., 2014b; Bietti & Mairal, 2017). Let N be the number of hidden layers, (σi)N i=1, N real-valued functions defined on R be the activation functions at each hidden layer, (di)N i=1 the sizes of square patches at each hidden layer, (pi)N i=1 the number of filters at each hidden layer and (ni)N+1 i=1 the number of patches at each hidden layer. As our input space is I = (Sd 1)n, we set d0 = d, p0 = 1, n0 = n. Moreover as the prediction layer is a linear transformation of the N th layer, we do not need to extract patches from the N th layer, and we set d N = n N 1 such that the only patch extracted for the prediction layer is the full image itself. Therefore we can also set n N = 1. Then, a mapping defined by a convolutional network is parameterized by a sequence W := (W 0, ..., W N) where for 0 k N 1, W k Rpk+1 dkpk and W N Rd Np N for the prediction layer. Indeed let W such a sequence and denote for k {0, ..., N 1}, W k := (wk 1, ..., wk pk+1)T where for all j {1, ..., pk+1}, wk j Rdkpk. Moreover let us define for all k {0, ..., N 1}, j {1, ..., pk+1} and q {1, ..., nk+1} the following sequence of operators. Convolution Operators. Ck j : Z (Rdkpk)nk Ck j (Z) := Zi, wk j nk i=1 Rnk Non-Linear Operators. Mk : X Rnk Mk(X) := (σk (Xi))nk i=1 Rnk Pooling Operators. Let (γk i,j)nk i,j=1 be the pooling factors at layer k (which are often assumed to be decreasing with Harmonic Decompositions of Convolutional Networks respect to the distance between i and j). Ak : X Rnk Ak(X) := j=1 γk i,j Xj Patch extraction Operators. P k+1 q : (Rpk+1)nk Rpk+1dk+1 U P k+1 q (U) := (Uq+l)dk+1 1 ℓ=0 Notice that, as we set d N = n N 1 and n N = 1, hence when k = N 1, there is only one patch extraction operator which is P N 1 = Id. Then the function associated to W generated by the convolutional network can be obtained by the following procedure: let X0 I, then we can denote X0 = (X1 i )n1 i=1 where for all i [|1, n1|], X0 i Sd 1. Therefore we can build by induction the sequence (Xk)N k=0 by doing the following operations starting from k = 0 until k = N 1 Ck j (Xk) = Xk i , wk j nk i=1 (4) Mk(Ck j (Xk)) = σk Xk i , wk j nk i=1 (5) Ak(Mk(Ck j (Xk))) = q=1 γk i,qσk Xk q, wk j !nk Writing now Zk+1(i, j) = Ak(Mk(Ck j (Xk)))i ˆXk+1 = (Zk+1(i, 1), ..., Zk+1(i, pk+1)))nk i=1 Xk+1 = (P k+1 q ( ˆXk+1))nk+1 q=1 (7) The mapping associated with a convolutional network therefore reads NW (X0) := XN, W N Rp N n N 1. In the following, we denote F(σi)N i=1,(pi)N i=1 the function space of all the functions NW defined as above on I for any choice of (W k)N k=0 such that for 0 k N 1, W k Rpk+1 dkpk and W N Rd N p N . We omit the dependence of F(σi)N i=1,(pi)N i=1 with respect to (di)N i=1 and (ni)N i=1 to simplify the notations. We shall also consider the union space F(σi)N i=1 := [ (p1,...,p N) N N F(σi)N i=1,(pi)N i=1 to encompass convolutional networks with varying number of filters across layers. Example. Consider the case where at each layer the number of filters is 1. This corresponds to the case where for all k {1, ..., N}, pk = 1. Therefore we can omit the dependence in j of the convolution operators defined above. At each layer k, b Xk+1 Rnk is the new image obtained after a convolution, a nonlinear and a pooling operation with nk pixels which is the number of patches that has been extracted from the image b Xk at layer k 1. Moreover Xk+1 is the decomposition of the image ˆXk+1 in nk+1 patches obtained thanks to the patch extraction operators (P k+1 q )nk+1 q=1 . Finally after N layers, we obtain that ˆXN = XN Rn N 1 which is the final image with n N pixels obtained after repeating N times the above operations. Then the prediction layer is a linear combination of the coordinates of the final image XN from which we can finally define for all X0 I, NW (X0) := XN, W N Rn N 1. We show in Prop. 1 below that there exists an RKHS (Sch olkopf & Smola, 2002) containing the space of functions F(σi)N i=1, and this, for a general class of activation functions, (σi)N i=1, admitting a Taylor decomposition on R. Moreover we show that for a large class of nonlinear functions, the kernel is actually a c-universal kernel on I. It is worthwhile to emphasize that the definition of the RKHS HN we give below does not depend on the number of filters (pi)N+1 i=2 at each hidden layer. Therefore our framework encompasses networks with varying number of filters across layers (Scetbon & Harchaoui, 2020). Definition 3.1. (c-universal (Sriperumbudur et al., 2011)) A continuous positive semi-definite kernel k on a compact Hausdorff space X is called c-universal if the RKHS, H induced by k is dense in C(X) w.r.t. the uniform norm. Proposition 1. Let N 2 and (σi)N i=1 be a sequence of N functions with a Taylor decomposition on R. Moreover let (fi)N i=1 be the sequence of functions such that for every i {1, ..., N} |σ(t) i (0)| Then the bivariate function defined on I I as KN(X, X ) := f N ... f2 i=1 f1 ( Xi, X i Rd) is a positive definite kernel on I, and the RKHS associated HN contains F(σi)N i=1, the function space generated by con- volutional networks. Moreover as soon as σ(t) i (0) = 0 for all i 1 and t 0, then KN is a c-universal kernel on I. Function space. A simple fact is that inf f HN E[(f(X) Y )2] inf f F E[(f(X) Y )2] Harmonic Decompositions of Convolutional Networks where F := F(σi)N i=1. In other words the minimum expected risk in HN is a lower bound on the minimum expected risk in F. Since a major concern of recent years has been the spectacular performance of deep networks i.e. how small they can drive the risk, analyzing them via this kind of Hilbertian envelope can shed more light on the relation between their multi-layer structure and their statistical performance. From this simple observation, one could obtain statistical bounds on F using high-probability bounds from (Boucheron et al., 2005). However, we choose to focus on getting tight statistical bounds on HN instead, in order to explore the connection between the statistical behavior and the integral operator eigenspectrum. Universality. For a large class of nonlinear activation functions, the kernel KN defined above is actually universal. Therefore the corresponding RKHS HN allows one to get universal consistency results for common loss functions (Steinwart & Christmann, 2008). In particular, if we choose the least-squares loss, we have then inf f H E[(f(X) Y )2] = R where R is the Bayes risk. See Corollary 5.29 in (Steinwart & Christmann, 2008). For instance, if at each layer the nonlinear function is σexp(x) = exp x, as in (Mairal et al., 2014b; Bietti & Mairal, 2017), then the corresponding RKHS is universal. There are other examples of activation functions satisfying assumptions from Prop. 1, such as the square activation σ2(x) = x2, the smooth hinge activation σsh, close to the Re LU activation, or a sigmoid-like function such as σerf, similar to the sigmoid function, with σerf(x) = 1 σsh(x) = 1 π x xe t2dt + exp( πx2) In the following section, we study in detail the properties of the kernel KN. In particular we show an explicit Mercer decomposition of the kernel from which we uncover a relationship existing between convolutional networks and functional ANOVA models. 4. Spectral Analysis of Convolutional Networks We give now a Mercer decomposition of the kernel introduced in Prop 1. From this Mercer decomposition, we show first that the multivariate function space generated by a convolutional network enjoys a decomposition related to the one in functional ANOVA models, where the highest order of interaction is controlled by the nonlinear functions (σi)N i=1 involved in the construction of the network. Moreover we also obtain a tight control of the eigenvalue decay under general assumptions on the activation functions involved in the construction of the network. Recall that for all m 0, we denote (Y 1 m, ..., Y αm,d m ) an arbitrary orthonormal basis of Hm(Sd 1). The next result gives an explicit Mercer decomposition of the kernels of interest. Theorem 4.1. Let N 2, f1 a real valued function that admits a Taylor decomposition around 0 on [ 1, 1] with non-negative coefficients and (fi)N i=2 a sequence of real valued functions such that f N ... f2 admits a Taylor decomposition around 0 on R with non-negative coefficients (aq)q 0. Let us denote for all k1, ..., kn 0, (lk1, ..., lkn) {1, ..., αk1,d} ... {1, ..., αkn,d} and X I, e(ki,lki)n i=1(X) := i=1 Y lki ki (Xi) Then each e(ki,lki)n i=1 is an eigenfunction of TKN the integral operator associated to the kernel KN, with associated eigenvalue given by the formula µ(ki,lki)n i=1 := X α1,...,αn 0 n P q α1, ..., αn where for any k 0 and α 0 we have λk,α = |Sd 2|Γ((d 1)/2) dt2s+k |t=0 f α 1 (t) (2s + k)! (2s)! Γ(s + 1/2) Γ(s + k + d/2) Moreover we have KN(X, X ) = X k1,...,kn 0 1 lki αki,d µ(ki,lki)n i=1e(ki,lki)n i=1(X)e(ki,lki)n i=1(X ) where the convergence is absolute and uniform. From this Mercer decomposition, we get a decomposition of the multivariate function generated by a convolutional network. This decomposition is related to the one in functional ANOVA models. But first let us introduce useful notations. Let us denote Ldσd 1 2 (Sd 1) = {1} M Ldσd 1 2,0 (Sd 1) where Ldσd 1 2,0 (Sd 1) is the subspace orthogonal to {1}. Thus we have i=1 Ldσd 1 2 (Sd 1) = i=1 [{1} M Ldσd 1 2,0 (Sd 1)]. Harmonic Decompositions of Convolutional Networks Identify the tensor product of {1} with any Hilbert space with that Hilbert space itself, then Nn i=1 Ldσd 1 2 (Sd 1) is the direct sum of all the subspaces of the form Ldσd 1 2,0 (Xj1) ... Ldσd 1 2,0 (Xjk) and {1} where {j1, ..., jk} is a subset of {1, ..., n} and the subspaces in the decomposition are all orthogonal to each other. In fact, the function space generated by a convolutional network is a subset of Nn i=1 Ldσd 1 2 (Sd 1) which selects only few orthogonal components in the decomposition of Nn i=1 Ldσd 1 2 (Sd 1) described above and allows only few interactions between the covariates. Moreover, the highest order of interactions can be controlled by the depth of the network. Indeed, in the following proposition, we show that the eigenvalues (µ(ki,lki)n i=1) obtained from the Mercer decomposition vanish as soon as the interactions are large enough relatively to the network depth (Scetbon & Harchaoui, 2020). Proposition 4.1. Let N 2, f1 a real-valued function admitting a Taylor decomposition around 0 on [ 1, 1] with non-negative coefficients and f N .... f2 a polynomial of degree D 1. Then, denoting d := min(D, n), we have that µ(ki,lki)n i=1 = 0, as soon as |{i : ki = 0}| > d , and, for any f F(σi)n i=1, q > d and {j1, ..., jq} {1, ..., n}, we have f Ldσd 1 2,0 (Xj1) ... Ldσd 1 2,0 (Xjq) . From this observation, we are able to characterize the function space of a convolutional network following the same strategy as functional ANOVA models, but allowing the main effects to live in an Hilbert space which may not necessarily be a RKHS of univariate functions. We shall refer to such decompositions as ANOVA-like decompositions to underscore both their similarity and their difference with functional ANOVA decompositions. Definition 4.1. ANOVA-like Decomposition Let f a real valued function defined on I. We say that f admits an ANOVA-like decomposition of order r if f can be written as f(X1, ..., Xn) = C + A {1,...,n} |A|=k where C is a constant, for all k {1, r} and for A = {j1, ..., jk} {1, ..., n} XA = (Xj1, ..., Xjk), f A Ldσd 1 2,0 (Xj1) ... Ldσd 1 2,0 (Xjk) and the decomposition is unique. Here the main effects live in Ldσd 1 2,0 (Sd 1) which is a Hilbert space of multivariate functions. Thanks to Prop. 4.1, any function generated by a convolutional network admits an ANOVA-like decomposition, where the highest order of interactions is at most d and it is completely determined by the functions (σi)N i=1. Moreover even if the degree D is arbitrarily large, the highest order of interaction cannot be bigger than n. Example. For any convolutional network such that σ1 admits a Taylor decomposition around 0 on [ 1, 1] (as in tanh and other nonlinear activations), and (σi)N i=2, are quadratic functions, the highest order of interaction allowed by the network is upper bounded by d min(2N 1, n). Finally, from the Mercer decomposition, we can control the eigenvalue decay under general assumptions on the activations. Assume that N 2 and that f1 is a function, admitting a Taylor decomposition on [ 1, 1], with non-negative coefficients (bm)m 0. Let us now show a first control of the (λm,α)m,α introduced in Theorem 4.1. Proposition 4.2. If there exist 1 > r > 0 and 0 < c2 c1 constants such that for all m 0 c2rm bm c1rm then for all α 1, there exist C1,α and C2,α > 0, constants depending only on α, and d such that for all m 0 m λm,α C1,α(m + 1)α 1rm We can now provide a tight control of the positive eigenvalues of the integral operator TKN associated with the kernel KN sorted in a non-increasing order with their multiplicities which is exactly the ranked sub sequence of positive eigenvalues in µ(ki,lki)n i=1 Proposition 4.3. Let us assume that f N .... f2 is a polynomial of degree D 1 and let d := min(D, n). Let (µm)M m=0 be the positive eigenvalues of the integral operator TKN associated to the kernel KN ranked in a non-increasing order with their multiplicities, where M N {+ }. Under the same assumptions of Prop. 4.2 we have M = + and there exists C3, C4 > 0 and 0 < γ < q constants such that for all m 0: C4e qm 1 (d 1)d µm C3e γm 1 (d 1)d Thanks to this control, we obtain in the next section the convergence rate for the regularized least-squares estimator on a non-trivial set of distributions, for the class of RKHS introduced earlier. Moreover, in some situations, we show that these convergence rates are actually minimax optimal from a nonparametric learning viewpoint. 5. Regularized Least-Squares for CNNs We consider the standard nonparametric learning framework (Gy orfiet al., 2002; Steinwart & Christmann, 2008), Harmonic Decompositions of Convolutional Networks where the goal is to learn, from independent and identically distributed examples z = {(x1, y1), . . . , (xℓ, yℓ)} drawn from an unknown distribution ρ on Z := I Y, a functional dependency fz : I Y between input x I and output y Y. The joint distribution ρ(x, y), the marginal distribution ρI, and the conditional distribution ρ(.|x), are related through ρ(x, y) = ρI(x)ρ(y|x). We call fz the learning method or the estimator and the learning algorithm is the procedure that, for any sample size ℓ N and training set z Zℓyields the learned function or estimator fz. Here we assume that Y R, and given a function f : I Y , the ability of f to describe the distribution ρ is measured by its expected risk I Y (f(x) y)2 dρ(x, y) . (9) The minimizer over the space of measurable Y-valued functions on I is Y ydρ(y|x) . (10) We seek to characterize, with high probability, how close R(fz) is to R(fρ). Let us now consider the regularized least-squares estimator (RLS). Consider as hypothesis space a Hilbert space H of functions f : I Y. For any regularization parameter λ > 0 and training set z Zℓ, the RLS estimator f H,z,λ is the solution of i=1 (f(xi) yi)2 + λ f 2 H In the following we consider the specific estimators obtained from the RKHS-s introduced in Prop. 1. But before stating the statistical bounds we have obtained, we recall basic definitions in order to clarify what we mean by asymptotic upper rate, lower rate and minimax rate optimality, following (Caponnetto & De Vito, 2007). We want to track the precise behavior of these rates and the effects of adding layers in a convolutional network. More precisely, we consider a class of Borel probability distributions P on I R satisfying basic general assumptions. We consider rates of convergence according to the LdρI 2 norm denoted . ρ. Definition 5.1. (Upper Rate of Convergence) A sequence (aℓ)ℓ 1 of positive numbers is called upper rate of convergence in LdρI 2 norm over the model P, for the sequence of estimated solutions (fz,λℓ)ℓ 1 using regularization parameters (λℓ)ℓ 0 if lim τ + lim sup ℓ sup ρ P ρℓ z : fz,λℓ fρ 2 ρ > τaℓ = 0 Definition 5.2. (Minimax Lower Rate of Convergence) A sequence (wℓ)ℓ 1 of positive numbers is called minimax lower rate of convergence in LdρI 2 norm over the model P if lim τ 0+ lim inf ℓ inf fz sup ρ P ρℓ z : fz fρ 2 ρ > τwℓ = 1 where the infimum is taken over all measurable learning methods with respect to P. We call such sequences (wℓ)ℓ 1 (minimax) lower rates. Every sequence ( ˆwℓ)ℓ 1 which decreases at least with the same speed as (wℓ)ℓ 1 is also a lower rate for this set of probability measures and on every larger set of probability measures at least the same lower rate holds. The meaning of a lower rate (wℓ)ℓ 1 is that no measurable learning method can achieve a LdρI 2 (I)-convergence rate (aℓ)ℓ 1 in the sense of Definition 5.1 decreasing faster than (wℓ)ℓ 1. In the case where the convergence rate of the sequence coincides with the minimax lower rates, we say that it is optimal in the minimax sense from a nonparametric learning viewpoint. Setting. Here the hypothesis space considered is the RKHS HN associated to the Kernel KN introduced in Prop. 1 where, N 2, f1 a function which admits a Taylor decomposition on [ 1, 1] with non-negative coefficients (bm)m 0 and (fi)N i=2 a sequence of real valuated functions such that g := f N .... f2 admits a Taylor decomposition on R with non-negative coefficients. In the following, we denote by TρI the integral operator on LdρI 2 (I) associated with KN defined as TρI : LdρI 2 (I) LdρI 2 (I) f R I KN(x, )f(x)dρI(x) Let us now introduce the general assumptions on the class of probability measures considered. Let us denote d P := n i=1dσd 1 and for ω 1, we denote by Wω the set of all probability measures ν on I satisfying dν d P < ω. Furthermore, we introduce for a constant ω 1 > h > 0, Wω,h Wω the set of probability measures µ on I which additionally satisfy dν Assumptions 5.1. Probability measures on I Y: Let B, B , L, σ > 0 be some constants and 0 < β 2 a parameter. We denote by FB,B ,L,σ,β(P) the set of all probability measures ρ on I Y with the following properties I Y y2dρ(x, y) < and fρ 2 L dρI B there exists g LdρI 2 (I) such that fρ = T β/2 ρI g and g 2 ρ B there exist σ > 0 and L > 0 such that R Y |y fρ(x)|mdρ(y|x) 1 A sufficient condition for the last assumption is that ρ is concentrated on I [ M, M] for some constant M > 0. Harmonic Decompositions of Convolutional Networks In the following we denote Gω,β := FB,B ,L,σ,β(Wω) and Gω,h,β := FB,B ,L,σ,β(Wω,h). Remark 1. Note that we do not make any assumption on the set of distributions related to the eigenvalue decay. Indeed, the control of the eigenvalue decay obtained in Proposition 4.3 allows us to define a non-trivial set of distributions adapted to these kernels. The main result of this paper is given in the following theorem. See (Scetbon & Harchaoui, 2020) for details. Theorem 5.1. Let us assume there exists 1 > r > 0 and c1 > 0 a constant such that (bm)m 0 satisfies for all m 0 we have bm c1rm. Moreover let us assume that f N .... f2 is a polynomial of degree D 1 and let us denote d := min(D, n). Let also w 1 and 0 < β 2. Then there exists A, C > 0 some constants independent of β such that for any ρ Gω,β and τ 1 we have: If β > 1, then for λℓ = 1 ℓ1/β and ℓ max eβ, A β(d 1)d β β 1 τ 2β β 1 log(ℓ) with a ρℓ-probability 1 e 4τ it holds f HN,z,λℓ fρ 2 ρ 3Cτ 2 log(ℓ)(d 1)d If β = 1, then for λℓ= log(ℓ)µ ℓ , µ > (d 1)d > 0 and ℓ max exp (Aτ) 1 µ (d 1)d , e1 log(ℓ)µ , with a ρℓ-probability 1 e 4τ it holds f HN,z,λℓ fρ 2 ρ 3Cτ 2 log(ℓ)µ If β < 1, then for λℓ= log(ℓ) (d 1)d max exp (Aτ) β (d 1)d (1 β) , e1 log(ℓ) with a ρℓ-probability 1 e 4τ it holds f HN,z,λℓ fρ 2 ρ 3Cτ 2 log(ℓ)(d 1)d Remark 2. It is worth noting that the convergence rates obtained here do not depend on the number of parameters considered in the network which may be much larger than the input dimension. Indeed, here we show that even on the largest possible function space generated by convolutional networks, learning from data can still happen. In fact from the above theorem, we can deduce asymptotic upper rate of convergence. Indeed we have lim τ + lim sup ℓ sup ρ Gω,β ρℓ z : fz,λℓ fρ 2 ρ > τaℓ = 0 if one of the following conditions hold β > 1, λℓ= 1 ℓ1/β and aℓ= log(ℓ)(d 1)d β = 1, λℓ= log(ℓ)µ ℓ and aℓ= log(ℓ)µ ℓ for µ > (d 1)d > 0 β < 1,λℓ= log(ℓ) (d 1)d β ℓ and aℓ= log(ℓ)(d 1)d In order to investigate the optimality of the convergence rates, let us take a look at the lower rates. Theorem 5.2. Under the exact same assumptions of Theorem 5.1, and if we assume in addition that there exist a constant 0 < c2 < c1 such that for all m 0: we have that for any 0 < β 2 and ω 1 > h > 0 such that Wω,h is not empty lim τ 0+ lim inf ℓ inf fz sup ρ Gω,h,β ρℓ z : fz fρ 2 ρ > τwℓ = 1 wℓ= log(ℓ)(d 1)d ℓ The infimum is taken over all measurable learning methods with respect to Gω,h,β. Rate optimality. If the source condition is satisfied with β > 1, then the convergence rate of the regularized leastsquares estimator stated in Theorem 5.1 is optimal in the minimax sense from a nonparametric learning viewpoint (Gy orfiet al., 2002; Caponnetto & De Vito, 2007; Steinwart & Christmann, 2008). It is worthwhile to note that the rate is close to the known optimal rate for nonparametric regression with d-dimensional inputs, setting the dimension of the sub-images or patches to d fz,λℓ fρ 2 ρ 3Cτ 2 log(ℓ)d 1 This connection highlights that the dimension of the subimages or patches drives the statistical rate of convergence in this regime. Functional ANOVA. In (Lin, 2000), the author establishes a similar result for a functional ANOVA models assuming that the main effects live in Hm([0, 1]). Indeed, denoting d the highest order of interaction in the model, the regularized least-squares estimator enjoys a near-optimal rate of convergence, within a log factor of the optimal rate of convergence in one dimension fz,λℓ fρ 2 ρ 3Cτ 2 log(ℓ)d 1 Harmonic Decompositions of Convolutional Networks This correspondence brings to light how the construction underlying a convolutional network allows one to overcome the curse of dimensionality. The rates in Theorem 5.1 highlight two important aspects of the behavior of CNNs. First, the highest order of interactions, given by the network depth, controls the statistical performance of such models. If the order is small, we obtain optimal rates which are close to the optimal rate for estimating multivariate functions in d dimensions where d is the patch size. Therefore we obtain learning rates which are almost free dimension. Second, adding layers makes the eigenvalue decay decrease slower and as soon as σN σ2 are arbitrary polynomial functions with degrees higher than n, then the optimal rates will be exactly the same as the one obtain for a polynomial function of degree n. There is thus a regime in which adding layers does not affect the convergence rate of convergence, and allows the function space of target functions to grow. Indeed the eigenvalue decay gives a concrete notion of the complexity of the function space considered. Given an eigensystem (µm)m 0 and (em)m 0 of positive eigenvalues and eigenfunctions respectively of the integral operator TKN , associated with the Kernel KN, defined on L2(I), the RKHS HN associated is defined as f L2(I): f = X m 0 amem, am µm endowed with the inner product f, g = P m 0 ambm/µm. From this definition, we see that, as the rate of decay of the eigenvalues of the integral operator gets slower, the RKHS gets larger. Therefore composing layers allows the function space generated by the network to grow and therefore allows the function space of the target function to grow, while the rates remain the same. 6. Related works In (Caponnetto & De Vito, 2007), the authors obtained optimal convergence rates of the regularized least-squares estimator for any RKHS yet given a hypothetical set of distributions. Indeed the authors consider a subset of the set of all the distributions for which the eigenvalue decay of the integral operator associated to the kernel is polynomial. This assumption may be too stringent or unsuited for the kernel we consider here. Recall that the eigenvalue decay we are dealing with here is geometric instead. We show how to control the eigenvalue decay of the integral operator associated to the kernels introduced in Prop. 1 under general assumptions on the activation functions. This control allows us to circumvent abstract assumptions stated in terms of sets of distributions leading to the desired eigenvalue decay as in (Caponnetto & De Vito, 2007). Thus, thanks to the spectral characterization of the kernels we consider, we can actually put forth a non-trivial set of distributions for which the RLS estimator with the corresponding RKHS-s enjoys optimal convergence rates from a nonparametric learning viewpoint. Moreover, this set of distributions is independent of the choice of the RKHS (except for the source condition which can just be fixed). Therefore we are able to compare the convergence rates obtained for the different RKHS-s defined in Prop. 1 on this set of distributions. In particular, we can compare convergence rates depending on different network depths. In (Bach, 2017), the author considers a single-hidden layer neural network with affine transforms and homogeneous functions acting on vectorial data. In this particular case, the author provides a detailed theoretical analysis of generalization performance. See e.g. (Barron, 1994; Anthony & Bartlett, 2009; Mohri et al., 2012) for related classical approaches and (Zhang et al., 2016; 2017) for more recent ones to analyze multi-layer perceptrons. Recent works (Bartlett et al., 2017; Neyshabur et al., 2018) studied various kinds of statistical bounds for multi-layer perceptrons. In (Bietti & Mairal, 2017), statistical bounds for convolutional kernel networks are presented. These statistical bounds typically depend on the product of spectral norms of matrices stacking layer weights. When put in our context, these bounds do not involve the full eigenspectrum of the integral operator associated with each layer. We focused here on multi-layer convolutional networks on images. With appropriate changes, a similar analysis can be carried out for signal data, with sub-signals/windows in place of sub-images/patches, and any lattice-structured data (including e.g. voxel data). 7. Conclusion. We have presented an approach to convolutional networks that allowed us to draw connections to nonparametric statistics. In particular, we have brought to light a decomposition akin to a functional ANOVA decomposition that explains how a convolutional network models interactions between sub-images or patches of images. This correspondence allows us to interpret how a convolutional network overcomes the curse of dimensionality when learning from dense images. The extensions of our work beyond least-squares estimators would be an interesting venue for future work. Acknowledgements. This work was developed during academic year 2018-2019, as M. Scetbon then with ENS Paris-Saclay was visiting Univ. Washington. We acknowledge support from NSF CCF-1740551, NSF DMS-1839371, CIFAR LMB, and faculty research awards. Harmonic Decompositions of Convolutional Networks Anthony, M. and Bartlett, P. L. Neural Network Learning: Theoretical Foundations. Cambridge UP, New York, NY, USA, 2009. Bach, F. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1 53, 2017. Barron, A. R. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14(1):115 133, Jan 1994. Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, 2017. Bietti, A. and Mairal, J. Invariance and stability of deep convolutional representations. In Advances in Neural Information Processing Systems, 2017. Boucheron, S., Bousquet, O., and Lugosi, G. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323 375, 2005. Bruna, J. and Mallat, S. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell., 35(8): 1872 1886, 2013. Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2007. Carmeli, C., Vito, E. D., Toigo, A., and Umanit a, V. Vector valued reproducing kernel Hilbert spaces and universality. Analysis and Applications, 08(01):19 61, 2010. Efthimiou, C. and Frye, C. Spherical harmonics in p dimensions. World Scientific, 2014. Folland, G. B. A course in abstract harmonic analysis. Chapman and Hall/CRC, 2016. Fukushima, K. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36 (4):193, 1980. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. The MIT Press, 2016. Gy orfi, L., Kohler, M., Krzyzak, A., and Walk, H. A Distribution-Free Theory of Nonparametric Regression. Springer series in statistics. Springer, 2002. Le Cun, Y., Bengio, Y., et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995. Lin, Y. Tensor product space ANOVA models. The Annals of Statistics, 28(3):734 755, 2000. Mairal, J., Bach, F. R., and Ponce, J. Sparse modeling for image and vision processing. Found. Trends Comput. Graph. Vis., 8(2-3):85 283, 2014a. Mairal, J., Koniusz, P., Harchaoui, Z., and Schmid, C. Convolutional kernel networks. In Advances in neural information processing systems, pp. 2627 2635, 2014b. Mallat, S. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331 1398, 2012. Mallat, S. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065): 20150203, 2016. Mohri, M., Talwalkar, A., and Rostamizadeh, A. Foundations of machine learning. MIT Press Cambridge, MA, 2012. Neyshabur, B., Bhojanapalli, S., and Srebro, N. A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018. Oyallon, E., Belilovsky, E., Zagoruyko, S., and Valko, M. Compressing the input for CNN-s with the first-order scattering transform. In Proceedings of the European Conference on Computer Vision (ECCV), 2018. Poggio, T. and Anselmi, F. Visual Cortex and Deep Networks: Learning Invariant Representations. Computational Neuroscience Series. MIT Press, 2016. Scetbon, M. and Harchaoui, Z. Harmonic decompositions of convolutional networks. Co RR, abs/2003.12756, 2020. Sch olkopf, B. and Smola, A. J. Learning with Kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning series. MIT Press, 2002. Simon, B. Trace ideals and their applications. AMS, 2010. Smale, S. and Zhou, D.-X. Learning theory estimates via integral operators and their approximations. Constructive approximation, 26(2):153 172, 2007. Sriperumbudur, B. K., Fukumizu, K., and Lanckriet, G. R. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12 (Jul):2389 2410, 2011. Steinwart, I. and Christmann, A. Support vector machines. Springer Science & Business Media, 2008. Harmonic Decompositions of Convolutional Networks Zhang, Y., Lee, J. D., and Jordan, M. I. ℓ1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning, 2016. Zhang, Y., Liang, P., and Wainwright, M. J. Convexified convolutional neural networks. In International Conference on Machine Learning, 2017.