# invariance_and_stability_of_deep_convolutional_representations__d3f3675b.pdf Invariance and Stability of Deep Convolutional Representations Alberto Bietti Inria alberto.bietti@inria.fr Julien Mairal Inria julien.mairal@inria.fr In this paper, we study deep signal representations that are near-invariant to groups of transformations and stable to the action of diffeomorphisms without losing signal information. This is achieved by generalizing the multilayer kernel introduced in the context of convolutional kernel networks and by studying the geometry of the corresponding reproducing kernel Hilbert space. We show that the signal representation is stable, and that models from this functional space, such as a large class of convolutional neural networks, may enjoy the same stability. 1 Introduction The results achieved by deep neural networks for prediction tasks have been impressive in domains where data is structured and available in large amounts. In particular, convolutional neural networks (CNNs) [14] have shown to model well the local appearance of natural images at multiple scales, while also representing images with some invariance through pooling operations. Yet, the exact nature of this invariance and the characteristics of functional spaces where convolutional neural networks live are poorly understood; overall, these models are sometimes only seen as clever engineering black boxes that have been designed with a lot of insight collected since they were introduced. Understanding the geometry of these functional spaces is nevertheless a fundamental question. In addition to potentially bringing new intuition about the success of deep networks, it may for instance help solving the issue of regularization, by providing ways to control the variations of prediction functions in a principled manner. Small deformations of natural signals often preserve their main characteristics, such as the class label in a classification task (e.g., the same digit with different handwritings may correspond to the same images up to small deformations), and provide a much richer class of transformations than translations. Representations that are stable to small deformations allow more robust models that may exploit these invariances, which may lead to improved sample complexity. The scattering transform [5, 17] is a recent attempt to characterize convolutional multilayer architectures based on wavelets. The theory provides an elegant characterization of invariance and stability properties of signals represented via the scattering operator, through a notion of Lipschitz stability to the action of diffeomorphisms. Nevertheless, these networks do not involve learning in the classical sense since the filters of the networks are pre-defined, and the resulting architecture differs significantly from the most used ones. In this work, we study these theoretical properties for more standard convolutional architectures from the point of view of positive definite kernels [27]. Specifically, we consider a functional space derived from a kernel for multi-dimensional signals, which admits a multilayer and convolutional structure that generalizes the construction of convolutional kernel networks (CKNs) [15, 16]. We show that this functional space contains a large class of CNNs with smooth homogeneous activation functions in addition to CKNs [15], allowing us to obtain theoretical results for both classes of models. Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. The main motivation for introducing a kernel framework is to study separately data representation and predictive models. On the one hand, we study the translation-invariance properties of the kernel representation and its stability to the action of diffeomorphisms, obtaining similar guarantees as the scattering transform [17], while preserving signal information. When the kernel is appropriately designed, we also show how to obtain signal representations that are near-invariant to the action of any group of transformations. On the other hand, we show that these stability results can be translated to predictive models by controlling their norm in the functional space. In particular, the RKHS norm controls both stability and generalization, so that stability may lead to improved sample complexity. Related work. Our work relies on image representations introduced in the context of convolutional kernel networks [15, 16], which yield a sequence of spatial maps similar to traditional CNNs, but each point on the maps is possibly infinite-dimensional and lives in a reproducing kernel Hilbert space (RKHS). The extension to signals with d spatial dimensions is straightforward. Since computing the corresponding Gram matrix as in classical kernel machines is computationally impractical, CKNs provide an approximation scheme consisting of learning finite-dimensional subspaces of each RKHS s layer, where the data is projected, see [15]. The resulting architecture of CKNs resembles traditional CNNs with a subspace learning interpretation and different unsupervised learning principles. Another major source of inspiration is the study of group-invariance and stability to the action of diffeomorphisms of scattering networks [17], which introduced the main formalism and several proof techniques from harmonic analysis that were keys to our results. Our main effort was to extend them to more general CNN architectures and to the kernel framework. Invariance to groups of transformations was also studied for more classical convolutional neural networks from methodological and empirical points of view [6, 9], and for shallow learned representations [1] or kernel methods [13, 19, 22]. Note also that other techniques combining deep neural networks and kernels have been introduced. Early multilayer kernel machines appear for instance in [7, 26]. Shallow kernels for images modelling local regions were also proposed in [25], and a multilayer construction was proposed in [4]. More recently, different models based on kernels are introduced in [2, 10, 18] to gain some theoretical insight about classical multilayer neural networks, while kernels are used to define convex models for two-layer neural networks in [36]. Finally, we note that Lipschitz stability of deep models to additive perturbations was found to be important to get robustness to adversarial examples [8]. Our results show that convolutional kernel networks already enjoy such a property. Notation and basic mathematical tools. A positive definite kernel K that operates on a set X implicitly defines a reproducing kernel Hilbert space H of functions from X to R, along with a mapping ϕ : X H. A predictive model associates to every point z in X a label in R; it consists of a linear function f in H such that f(z) = f, ϕ(z) H, where ϕ(z) is the data representation. Given now two points z, z in X, Cauchy-Schwarz s inequality allows us to control the variation of the predictive model f according to the geometry induced by the Hilbert norm . H: |f(z) f(z )| f H ϕ(z) ϕ(z ) H. (1) This property implies that two points z and z that are close to each other according to the RKHS norm should lead to similar predictions, when the model f has reasonably small norm in H. Then, we consider notation from signal processing similar to [17]. We call a signal x a function in L2(Ω, H), where Ωis a subset of Rd representing spatial coordinates, and H is a Hilbert space, when x 2 L2 := R Ω x(u) 2 Hdu < , where du is the Lebesgue measure on Rd. Given a linear operator T : L2(Ω, H) L2(Ω, H ), the operator norm is defined as T L2(Ω,H) L2(Ω,H ) := sup x L2(Ω,H) 1 Tx L2(Ω,H ). For the sake of clarity, we drop norm subscripts, from now on, using the notation for Hilbert space norms, L2 norms, and L2 L2 operator norms, while | | denotes the Euclidean norm on Rd. Some useful mathematical tools are also presented in Appendix A. 2 Construction of the Multilayer Convolutional Kernel We now present the multilayer convolutional kernel, which operates on signals with d spatial dimensions. The construction follows closely that of convolutional kernel networks [15] but generalizes it to input signals defined on the continuous domain Ω= Rd (which does not prevent signals to have compact support), as done by Mallat [17] for analyzing the properties of the scattering transform; the issue of discretization where Ωis a discrete grid is addressed in Section 2.1. xk 1 : Ω Hk 1 xk 1(u) Hk 1 Pkxk 1(v) Pk (patch extraction) kernel mapping Mk Pkxk 1(v) = ϕk(Pkxk 1(v)) Hk Mk Pkxk 1 : Ω Hk xk := Ak Mk Pkxk 1 : Ω Hk linear pooling xk(w) = Ak Mk Pkxk 1(w) Hk Figure 1: Construction of the k-th signal representation from the k 1-th one. Note that while Ω is depicted as a box in R2 here, our construction is supported on Ω= Rd. Similarly, a patch is represented as a squared box for simplicity, but it may potentially have any shape. In what follows, an input signal is denoted by x0 and lives in L2(Ω, H0), where H0 is typically Rp0 (e.g., with p0 = 3, x0(u) may represent the RGB pixel value at location u). Then, we build a sequence of RKHSs H1, H2, . . ., and transform x0 into a sequence of feature maps supported on Ω, respectively denoted by x1 in L2(Ω, H1), x2 in L2(Ω, H2), .... As depicted in Figure 1, a new map xk is built from the previous one xk 1 by applying successively three operators that perform patch extraction (Pk), kernel mapping (Mk) in a new RKHS Hk, and linear pooling (Ak), respectively. When going up in the hierarchy, the points xk(u) carry information from larger signal neighborhoods centered at u in Ωwith more invariance, as we will formally show. Patch extraction operator. Given the layer xk 1, we consider a patch shape Sk, defined as a compact centered subset of Rd, e.g., a box [ 1, 1] [ 1, 1] for images, and we define the Hilbert space Pk := L2(Sk, Hk 1) equipped with the norm z 2 = R Sk z(u) 2dνk(u), where dνk is the normalized uniform measure on Sk for every z in Pk. More precisely, we now define the linear patch extraction operator Pk : L2(Ω, Hk 1) L2(Ω, Pk) such that for all u in Ω, Pkxk 1(u) = (v 7 xk 1(u + v))v Sk Pk. Note that by equipping Pk with a normalized measure, the operator Pk preserves the norm. By Fubini s theorem, we have indeed Pkxk 1 = xk 1 and hence Pkxk 1 is in L2(Ω, Pk). Kernel mapping operator. In a second stage, we map each patch of xk 1 to a RKHS Hk with a kernel mapping ϕk : Pk Hk associated to a positive definite kernel Kk. It is then possible to define the non-linear pointwise operator Mk such that Mk Pkxk 1(u) := ϕk(Pkxk 1(u)) Hk. As in [15], we use homogeneous dot-product kernels of the form Kk(z, z ) = z z κk with κk(1) = 1, (2) which ensures that Mk Pkxk 1(u) = Pkxk 1(u) and that Mk Pkxk 1 is in L2(Ω, Hk). Concrete examples of kernels satisfying (2) with some other properties are presented in Appendix B. Pooling operator. The last step to build the layer xk is to pool neighboring values to achieve some local shift-invariance. As in [15], we apply a linear convolution operator Ak with a Gaussian kernel at scale σk, hσk(u) := σ d k h(u/σk), where h(u) = (2π) d/2 exp( |u|2/2). Then, xk(u) = Ak Mk Pkxk 1(u) = Z Rd hσk(u v)Mk Pkxk 1(v)dv Hk. Applying Schur s test to the integral operator Ak (see Appendix A), we obtain that Ak 1. Thus, xk Mk Pkxk 1 and xk L2(Ω, Hk). Note that a similar pooling operator is used in the scattering representation [5, 17], though in a different way which does not affect subsequent layers. Multilayer construction. Finally, we obtain a multilayer representation by composing multiple times the previous operators. In order to increase invariance with each layer, the size of the patch Sk and pooling scale σk typically grow exponentially with k, with σk and supc Sk |c| of the same order. With n layers, the final representation is given by the feature map Φn(x0) := xn = An Mn Pn An 1Mn 1Pn 1 A1M1P1x0 L2(Ω, Hn). (3) Then, we can define a kernel Kn on two signals x0 and x 0 by Kn(x0, x 0) := Φn(x0), Φn(x 0) , whose RKHS HKn contains all functions of the form f(x0) = w, Φn(x0) with w L2(Ω, Hn). The following lemma shows that this representation preserves all information about the signal at each layer, and each feature map xk can be sampled on a discrete set with no loss of information. This suggests a natural approach for discretization which we discuss next. For space limitation reasons, all proofs in this paper are relegated to Appendix C. Lemma 1 (Signal preservation). Assume that Hk contains linear functions w, with w in Pk (this is true for all kernels Kk described in Appendix B), then the signal xk 1 can be recovered from a sampling of xk = Ak Mk Pkxk 1 at discrete locations as soon as the union of patches centered at these points covers all of Ω. It follows that xk can be reconstructed from such a sampling. 2.1 From Theory to Practice: Discretization and Signal Preservation The previous construction defines a kernel representation for general signals in L2(Ω, H0), which is an abstract object defined for theoretical purposes, as often done in signal processing [17]. In practice, signals are discrete, and it is thus important to discuss the problem of discretization, as done in [15]. For clarity, we limit the presentation to 1-dimensional signals (Ω= Rd with d = 1), but the arguments can easily be extended to higher dimensions d when using box-shaped patches. Notation from the previous section is preserved, but we add a bar on top of all discrete analogues of their discrete counterparts, e.g., xk is a discrete feature map in ℓ2(Z, Hk) for some RKHS Hk. Input signals x0 and x0. Discrete signals acquired by a physical device are often seen as local integrators of signals defined on a continuous domain (e.g., sensors from digital cameras integrate the pointwise distribution of photons that hit a sensor in a spatial window). Let us then consider a signal x0 in L2(Ω, H0) and s0 a sampling interval. By defining x0 in ℓ2(Z, H0) such that x0[n] = x0(ns0) for all n in Z, it is thus natural to assume that x0 =A0x, where A0 is a pooling operator (local integrator) applied to an original signal x. The role of A0 is to prevent aliasing and reduce high frequencies; typically, the scale σ0 of A0 should be of the same magnitude as s0, which we choose to be s0 = 1 in the following, without loss of generality. This natural assumption will be kept later in the analysis. Multilayer construction. We now want to build discrete feature maps xk in ℓ2(Z, Hk) at each layer k involving subsampling with a factor sk w.r.t. xk 1. We now define the discrete analogues of the operators Pk (patch extraction), Mk (kernel mapping), and Ak (pooling) as follows: for n Z, Pk xk 1[n] := e 1/2 k ( xk 1[n], xk 1[n + 1], . . . , xk 1[n + ek 1]) Pk := Hek k 1 Mk Pk xk 1[n] := ϕk( Pk xk 1[n]) Hk xk[n]= Ak Mk Pk xk 1[n] := s1/2 k X m Z hk[nsk m] Mk Pk xk 1[m]=( hk Mk Pk xk 1)[nsk] Hk, where (i) Pk extracts a patch of size ek starting at position n in xk 1[n] (defining a patch centered at n is also possible), which lives in the Hilbert space Pk defined as the direct sum of ek times Hk 1; (ii) Mk is a kernel mapping identical to the continuous case, which preserves the norm, like Mk; (iii) Ak performs a convolution with a Gaussian filter and a subsampling operation with factor sk. The next lemma shows that under mild assumptions, this construction preserves signal information. Lemma 2 (Signal recovery with subsampling). Assume that Hk contains the linear functions w, for all w Pk and that ek sk. Then, xk 1 can be recovered from xk. We note that this result relies on recovery by deconvolution of a pooling convolution with filter hk, which is stable when its scale parameter, typically of order sk to prevent anti-aliasing, is small enough. This suggests using small values for ek, sk, as in typical recent convolutional architectures [30]. Links between the parameters of the discrete and continuous models. Due to subsampling, the patch size in the continuous and discrete models are related by a multiplicative factor. Specifically, a patch of size ek with discretization corresponds to a patch Sk of diameter eksk 1sk 2 . . . s1 in the continuous case. The same holds true for the scale parameter of the Gaussian pooling. 2.2 From Theory to Practice: Kernel Approximation and Convolutional Kernel Networks Besides discretization, two modifications are required to use the image representation we have described in practice. The first one consists of using feature maps with finite spatial support, which introduces border effects that we did not study, but which are negligible when dealing with large realistic images. The second one requires finite-dimensional approximation of the kernel maps, leading to the convolutional kernel network model of [15]. Typically, each RKHS s mapping is approximated by performing a projection onto a subspace of finite dimension, a classical approach to make kernel methods work at large scale [12, 31, 34]. One advantage is its compatibility with the RKHSs (meaning that the approximations live in the respective RKHSs), and the stability results we will present next are preserved thanks to the non-expansiveness of the projection. It is then be possible to derive theoretical results for the CKN model, which appears as a natural implementation of the kernel constructed previously; yet, we will also show in Section 5 that the results apply more broadly to CNNs that are contained in the functional space associated to the kernel. 3 Stability to Deformations and Translation Invariance In this section, we study the translation-invariance and the stability of the kernel representation described in Section 2 for continuous signals under the action of diffeomorphisms. We use a similar characterization of stability to the one introduced by Mallat [17]: for a C1-diffeomorphism τ : Ω Ω, let Lτ denote the linear operator defined by Lτx(u) = x(u τ(u)), the representation Φ( ) is stable under the action of diffeomorphisms if there exist two constants C1 and C2 such that Φ(Lτx) Φ(x) (C1 τ + C2 τ ) x , (4) where τ is the Jacobian of τ, τ := supu Ω τ(u) , and τ := supu Ω|τ(u)|. As in [17], our results will assume the regularity condition τ < 1/2. In order to have a translationinvariant representation, we want C2 to be small (a translation is a diffeomorphism with τ = 0), and indeed we will show that C2 is proportional to 1/σn, where σn is the scale of the last pooling layer, which typically increases exponentially with the number of layers n. Note that unlike the scattering transform [17], we do not have a representation that preserves the norm, i.e., such that Φ(x) = x . While the patch extraction Pk and kernel mapping Mk operators do preserve the norm, the pooling operators Ak may remove (or significantly reduce) frequencies from the signal that are larger than 1/σk. Yet, natural signals such as natural images often have high energy in the low-frequency domain (the power spectra of natural images is often considered to have a polynomial decay in 1/f 2, where f is the signal frequency [33]). For such classes of signals, a large fraction of the signal energy will be preserved by the pooling operator. In particular, with some additional assumptions on the kernels Kk, it is possible to show [3]: Φ(x) An A0x . Additionally, when using a Gaussian kernel mapping ϕn+1 on top of the last feature map as a prediction layer instead of a linear layer, the final representation Φf(x) := ϕn+1(Φn(A0x)) preserves stability and always has unit norm (see the extended version of the paper [3] for details). This suggests that norm preservation may be a less relevant concern in our kernel setting. 3.1 Stability Results In order to study the stability of the representation (3), we assume that the input signal x0 may be written as x0 = A0x, where A0 is an initial pooling operator at scale σ0, which allows us to control the high frequencies of the signal in the first layer. As discussed previously in Section 2.1, this assumption is natural and compatible with any physical acquisition device. Note that σ0 can be taken arbitrarily small, making the operator A0 arbitrarily close to the identity, so that this assumption does not limit the generality of our results. Moreover, we make the following assumptions for each layer k: (A1) Norm preservation: ϕk(x) = x for all x in Pk; (A2) Non-expansiveness: ϕk(x) ϕk(x ) x x for all x, x in Pk; (A3) Patch sizes: there exists κ > 0 such that at any layer k we have sup c Sk |c| κσk 1. Note that assumptions (A1-2) imply that the operators Mk preserve the norm and are non-expansive. Appendix B exposes a large class of homogeneous kernels that satisfy assumptions (A1-2). General bound for stability. The following result gives an upper bound on the quantity of interest, Φ(Lτx) Φ(x) , in terms of the norm of various linear operators which control how τ affects each layer. The commutator of linear operators A and B is denoted [A, B] = AB BA. Proposition 3. Let Φ(x)=Φn(A0x) where Φn is defined in (3) for x in L2(Ω, H0). Then, Φ(Lτx) Φ(x) k=1 [Pk Ak 1, Lτ] + [An, Lτ] + LτAn An In the case of a translation Lτx(u) = Lcx(u) = x(u c), it is easy to see that pooling and patch extraction operators commute with Lc (this is also known as covariance or equivariance to translations), so that we are left with the term Lc An An , which should control translation invariance. For general diffeomorphisms τ, we no longer have exact covariance, but we show below that commutators are stable to τ, in the sense that [Pk Ak 1, Lτ] is controlled by τ , while LτAn An is controlled by τ and decays with the pooling size σn. Bound on [Pk Ak 1, Lτ] . We begin by noting that Pkz can be identified with (Lcz)c Sk isometrically for all z in L2(Ω, Hk 1), since Pkz 2 = R Sk Lcz 2dνk(c) by Fubini s theorem. Then, Pk Ak 1Lτz LτPk Ak 1z 2 = Z Sk Lc Ak 1Lτz LτLc Ak 1z 2dνk(c) sup c Sk Lc Ak 1Lτx LτLc Ak 1z 2, so that [Pk Ak 1, Lτ] supc Sk [Lc Ak 1, Lτ] . The following result lets us bound [Lc Ak 1, Lτ] when |c| κσk 1, which is satisfied under assumption (A3). Lemma 4. Let Aσ be the pooling operator with kernel hσ(u) = σ dh(u/σ). If τ 1/2, there exists a constant C1 such that for any σ and |c| κσ, we have [Lc Aσ, Lτ] C1 τ , where C1 depends only on h and κ. A similar result is obtained in Mallat [17, Lemma E.1] for commutators of the form [Aσ, Lτ], but we extend it to handle integral operators Lc Aσ with a shifted kernel. The proof (given in Appendix C.4) relies on the fact that [Lc Aσ, Lτ] is an integral operator in order to bound its norm via Schur s test. Note that κ can be made larger, at the cost of an increase of the constant C1 of the order κd+1. Bound on LτAn An . We bound the operator norm LτAn An in terms of τ using the following result due to Mallat [17, Lemma 2.11], with σ = σn: Lemma 5. If τ 1/2, we have with C2 = 2d h 1. Combining Proposition 3 with Lemmas 4 and 5, we immediately obtain the following result. Theorem 6. Let Φ(x) be a representation given by Φ(x) = Φn(A0x) and assume (A1-3). If τ 1/2, we have Φ(Lτx) Φ(x) C1 (1 + n) τ + C2 This result matches the desired notion of stability in Eq. (4), with a translation-invariance factor that decays with σn. The dependence on a notion of depth (the number of layers n here) also appears in [17], with a factor equal to the maximal length of scattering paths, and with the same condition τ 1/2. However, while the norm of the scattering representation is preserved as the length of these paths goes to infinity, the norm of Φ(x) can decrease with depth due to pooling layers, though this concern may be alleviated by using an additional non-linear prediction layer, as discussed previously (see also [3]). 3.2 Stability with Kernel Approximations As in the analysis of the scattering transform of [17], we have characterized the stability and shiftinvariance of the data representation for continuous signals, in order to give some intuition about the properties of the corresponding discrete representation, which we have described in Section 2.1. Another approximation performed in the CKN model of [15] consists of adding projection steps on finite-dimensional subspaces of the RKHS s layers, as discusssed in Section 2.2. Interestingly, the stability properties we have obtained previously are compatible with these steps. We may indeed redefine the operator Mk as the pointwise operation such that Mkz(u) = Πkϕk(z(u)) for any map z in L2(Ω, Pk), instead of Mkz(u) = ϕk(z(u)); Πk : Hk Fk is here a projection operator onto a linear subspace. Then, Mk does not necessarily preserve the norm anymore, but Mkz z , with a loss of information corresponding to the quality of approximation of the kernel Kk on the points z(u). On the other hand, the non-expansiveness of Mk is satisfied thanks to the non-expansiveness of the projection. Additionally, the CKN construction provides a finite-dimensional representation at each layer, which preserves the norm structure of the original Hilbert spaces isometrically. In summary, it is possible to show that the conclusions of Theorem 6 remain valid for this tractable CKN representation, but we lose signal information in the process. The stability of the predictions can then be controlled through the norm of the last (linear) layer, which is typically used as a regularizer [15]. 4 Global Invariance to Group Actions In Section 3, we have seen how the kernel representation of Section 2 creates invariance to translations by commuting with the action of translations at intermediate layers, and how the last pooling layer on the translation group governs the final level of invariance. It is often useful to encode invariances to different groups of transformations, such as rotations or reflections (see, e.g., [9, 17, 22, 29]). Here, we show how this can be achieved by defining adapted patch extraction and pooling operators that commute with the action of a transformation group G (this is known as group covariance or equivariance). We assume that G is locally compact, so that we can define a left-invariant Haar measure µ that is, a measure on G that satisfies µ(g S) = µ(S) for any Borel set S G and g in G. We assume the initial signal x(u) is defined on G, and we define subsequent feature maps on the same domain. The action of an element g G is denoted by Lg, where Lgx(u) = x(g 1u). Then, we are interested in defining a layer that is, a succession of patch extraction, kernel mapping, and pooling operators that commutes with Lg, in order to achieve equivariance to the group G. Patch extraction. We define patch extraction as follows Px(u) = (x(uv))v S for all u G, where S G is a patch centered at the identity. P commutes with Lg since PLgx(u) = (Lgx(uv))v S = (x(g 1uv))v S = Px(g 1u) = Lg Px(u). Kernel mapping. The pointwise operator M is defined as in Section 2, and thus commutes with Lg. Pooling. The pooling operator on the group G is defined in a similar fashion as [22] by G x(uv)h(v)dµ(v) = Z G x(v)h(u 1v)dµ(v), where h is a pooling filter typically localized around the identity element. It is easy to see from the first expression of Ax(u) that ALgx(u) = Lg Ax(u), making the pooling operator G-equivariant. In our analysis of stability in Section 3, we saw that inner pooling layers are useful to guarantee stability to local deformations, while global invariance is achieved mainly through the last pooling layer. In some cases, one only needs stability to a subgroup of G, while achieving global invariance to the whole group, e.g., in the roto-translation group [21], one might want invariance to a global rotation but stability to local translations. Then, one can perform pooling just on the subgroup to stabilize (e.g., translations) in intermediate layers, while pooling on the entire group at the last layer to achieve the global group invariance. 5 Link with Convolutional Neural Networks In this section, we study the connection between the kernel representation defined in Section 2 and CNNs. Specifically, we show that the RKHS HKn obtained from our kernel construction contains a set of CNNs on continuous domains with certain types of smooth homogeneous activations. An important consequence is that the stability results of previous sections apply to this class of CNNs. CNN maps construction. We now define a CNN function fσ that takes as input an image x0 in L2(Ω, Rp0) with p0 channels, and builds a sequence of feature maps, represented at layer k as a function zk in L2(Ω, Rpk) with pk channels; it performs linear convolutions with a set of filters (wi k)i=1,...,pk, followed by a pointwise activation function σ to obtain intermediate feature maps zk, then applies a linear pooling filter and repeats the same operations at each layer. Note that here, each wi k is in L2(Sk, Rpk 1), with channels denoted by wij k L2(Sk, R). Formally, the intermediate map zk in L2(Ω, Rpk) is obtained for k 1 by zi k(u) = nk(u)σ wi k, Pkzk 1(u) /nk(u) , (8) where zk(u) = ( z1 k(u), . . . , zpk k (u)) in Rpk, and Pk is the patch extraction operator, which operates here on finite-dimensional maps. The activation involves a pointwise non-linearity σ along with a quantity nk(u) that is independent of the filters and that will be made explicit in the sequel. Finally, the map zk is obtained by using a pooling operator as in Section 2, with zk = Ak zk, and z0 = x0. Homogeneous activations. The choice of non-linearity σ relies on Lemma B.2 of the appendix, which shows that for many choices of smooth functions σ, the RKHSs Hk defined in Section 2 contains the linear functions z 7 z σ( g, z / z ) for all g in Pk. While this homogenization involving the quantities z is not standard in classical CNNs, we note that (i) the most successful activation function, namely rectified linear units, is homogeneous that is, relu( g, z ) = z relu( g, z / z ); (ii) while relu is nonsmooth and thus not in our RKHSs, there exists a smoothed variant that satisfies the conditions of Lemma B.2 for useful kernels. As noticed in [35, 36], this is for instance the case for the inverse polynomial kernel described in Appendix B, In Figure 2, we plot and compare these different variants of relu. Then, we may now define the quantities nk(u) := Pkxk 1(u) in (8), which are due to the homogenization, and which are independent of the filters wi k. Classification layer. The final CNN prediction function fσ is given by inner products with the feature maps of the last layer: fσ(x0) = wn+1, zn , with parameters wn+1 in L2(Ω, Rpn). The next result shows that for appropriate σ, the function fσ is in HKn. The construction of this function in the RKHS and the proof are given in Appendix D. We note that a similar construction for fully connected networks with constraints on weights and inputs was given in [35]. Proposition 7 (CNNs and RKHSs). Assume the activation σ satisfies Cσ(a) < for all a 0, where Cσ is defined for a given kernel in Lemma B.2. Then the CNN function fσ defined above is in the RKHS HKn, with norm i=1 wi n+1 2 2Bn,i, where Bn,i is defined recursively by B1,i = C2 σ( wi 1 2 2) and Bk,i = C2 σ pk 1 Ppk 1 j=1 wij k 2 2Bk 1,j . The results of this section imply that our study of the geometry of the kernel representations, and in particular the stability and invariance properties of Section 3, apply to the generic CNNs defined 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x Re LU s Re LU 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x f : x |x| (wx/|x|) Re LU, w=1 s Re LU, w = 0 s Re LU, w = 0.5 s Re LU, w = 1 s Re LU, w = 2 Figure 2: Comparison of one-dimensional functions obtained with relu and smoothed relu (s Re LU) activations. (Left) non-homogeneous setting of [35, 36]. (Right) our homogeneous setting, for different values of the parameter w. Note that for w 0.5, s Re LU and Re LU are indistinguishable. above, thanks to the Lipschitz smoothness relation (1). The smoothness is then controlled by the RKHS norm of these functions, which sheds light on the links between generalization and stability. In particular, functions with low RKHS norm (a.k.a. large margin ) are known to generalize better to unseen data (see, e.g., the notion of margin bounds for SVMs [27, 28]). This implies, for instance, that generalization is harder if the task requires classifying two slightly deformed images with different labels, since this requires a function with large RKHS norm according to our stability analysis. In contrast, if a stable function (i.e., with small RKHS norm) is sufficient to do well on a training set, learning becomes easier and few samples may be enough for good generalization. Acknowledgements This work was supported by a grant from ANR (MACARON project under grant number ANR14-CE23-0003-01), by the ERC grant number 714381 (SOLARIS project), and by the MSR-Inria joint center. [1] F. Anselmi, L. Rosasco, and T. Poggio. On invariance and selectivity in representation learning. Information and Inference, 5(2):134 158, 2016. [2] F. Anselmi, L. Rosasco, C. Tan, and T. Poggio. Deep convolutional networks are hierarchical kernel machines. preprint ar Xiv:1508.01084, 2015. [3] A. Bietti and J. Mairal. Group invariance and stability to deformations of deep convolutional representations. preprint ar Xiv:1706.03078, 2017. [4] L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchical kernel descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011. [5] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on pattern analysis and machine intelligence (PAMI), 35(8):1872 1886, 2013. [6] J. Bruna, A. Szlam, and Y. Le Cun. Learning stable group invariant representations with convolutional networks. preprint ar Xiv:1301.3537, 2013. [7] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems (NIPS), 2009. [8] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning (ICML), 2017. [9] T. Cohen and M. Welling. Group equivariant convolutional networks. In International Conference on Machine Learning (ICML), 2016. [10] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems (NIPS), 2016. [11] J. Diestel and J. J. Uhl. Vector Measures. American Mathematical Society, 1977. [12] S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research (JMLR), 2:243 264, 2001. [13] B. Haasdonk and H. Burkhardt. Invariant kernel functions for pattern analysis and machine learning. Machine learning, 68(1):35 61, 2007. [14] Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541 551, 1989. [15] J. Mairal. End-to-End Kernel Learning with Supervised Convolutional Kernel Networks. In Advances in Neural Information Processing Systems (NIPS), 2016. [16] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In Advances in Neural Information Processing Systems (NIPS), 2014. [17] S. Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331 1398, 2012. [18] G. Montavon, M. L. Braun, and K.-R. Müller. Kernel analysis of deep networks. Journal of Machine Learning Research (JMLR), 12:2563 2581, 2011. [19] Y. Mroueh, S. Voinea, and T. A. Poggio. Learning with group invariant features: A kernel perspective. In Advances in Neural Information Processing Systems (NIPS), 2015. [20] K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1 141, 2017. [21] E. Oyallon and S. Mallat. Deep roto-translation scattering for object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [22] A. Raj, A. Kumar, Y. Mroueh, T. Fletcher, and B. Schoelkopf. Local group invariant representations via orbit embeddings. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017. [23] S. Saitoh. Integral transforms, reproducing kernels and their applications, volume 369. CRC Press, 1997. [24] I. J. Schoenberg. Positive definite functions on spheres. Duke Mathematical Journal, 9(1):96 108, 1942. [25] B. Schölkopf. Support Vector Learning. Ph D thesis, Technischen Universität Berlin, 1997. [26] B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299 1319, 1998. [27] B. Schölkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. 2001. [28] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [29] L. Sifre and S. Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2013. [30] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2014. [31] A. J. Smola and B. Schölkopf. Sparse greedy matrix approximation for machine learning. In Proceedings of the International Conference on Machine Learning (ICML), 2000. [32] E. M. Stein. Harmonic Analysis: Real-variable Methods, Orthogonality, and Oscillatory Integrals. Princeton University Press, 1993. [33] A. Torralba and A. Oliva. Statistics of natural image categories. Network: computation in neural systems, 14(3):391 412, 2003. [34] C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems (NIPS), 2001. [35] Y. Zhang, J. D. Lee, and M. I. Jordan. ℓ1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning (ICML), 2016. [36] Y. Zhang, P. Liang, and M. J. Wainwright. Convexified convolutional neural networks. In International Conference on Machine Learning (ICML), 2017.