# invariance_and_stability_of_deep_convolutional_representations__d3f3675b.pdf

Invariance and Stability of Deep Convolutional Representations

Alberto Bietti Inria alberto.bietti@inria.fr

Julien Mairal Inria julien.mairal@inria.fr

In this paper, we study deep signal representations that are near-invariant to groups of transformations and stable to the action of diffeomorphisms without losing signal information. This is achieved by generalizing the multilayer kernel introduced in the context of convolutional kernel networks and by studying the geometry of the corresponding reproducing kernel Hilbert space. We show that the signal representation is stable, and that models from this functional space, such as a large class of convolutional neural networks, may enjoy the same stability.

1 Introduction

The results achieved by deep neural networks for prediction tasks have been impressive in domains where data is structured and available in large amounts. In particular, convolutional neural networks (CNNs) [14] have shown to model well the local appearance of natural images at multiple scales, while also representing images with some invariance through pooling operations. Yet, the exact nature of this invariance and the characteristics of functional spaces where convolutional neural networks live are poorly understood; overall, these models are sometimes only seen as clever engineering black boxes that have been designed with a lot of insight collected since they were introduced.

Understanding the geometry of these functional spaces is nevertheless a fundamental question. In addition to potentially bringing new intuition about the success of deep networks, it may for instance help solving the issue of regularization, by providing ways to control the variations of prediction functions in a principled manner. Small deformations of natural signals often preserve their main characteristics, such as the class label in a classiﬁcation task (e.g., the same digit with different handwritings may correspond to the same images up to small deformations), and provide a much richer class of transformations than translations. Representations that are stable to small deformations allow more robust models that may exploit these invariances, which may lead to improved sample complexity. The scattering transform [5, 17] is a recent attempt to characterize convolutional multilayer architectures based on wavelets. The theory provides an elegant characterization of invariance and stability properties of signals represented via the scattering operator, through a notion of Lipschitz stability to the action of diffeomorphisms. Nevertheless, these networks do not involve learning in the classical sense since the ﬁlters of the networks are pre-deﬁned, and the resulting architecture differs signiﬁcantly from the most used ones.

In this work, we study these theoretical properties for more standard convolutional architectures from the point of view of positive deﬁnite kernels [27]. Speciﬁcally, we consider a functional space derived from a kernel for multi-dimensional signals, which admits a multilayer and convolutional structure that generalizes the construction of convolutional kernel networks (CKNs) [15, 16]. We show that this functional space contains a large class of CNNs with smooth homogeneous activation functions in addition to CKNs [15], allowing us to obtain theoretical results for both classes of models.

Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

The main motivation for introducing a kernel framework is to study separately data representation and predictive models. On the one hand, we study the translation-invariance properties of the kernel representation and its stability to the action of diffeomorphisms, obtaining similar guarantees as the scattering transform [17], while preserving signal information. When the kernel is appropriately designed, we also show how to obtain signal representations that are near-invariant to the action of any group of transformations. On the other hand, we show that these stability results can be translated to predictive models by controlling their norm in the functional space. In particular, the RKHS norm controls both stability and generalization, so that stability may lead to improved sample complexity.

Related work. Our work relies on image representations introduced in the context of convolutional kernel networks [15, 16], which yield a sequence of spatial maps similar to traditional CNNs, but each point on the maps is possibly inﬁnite-dimensional and lives in a reproducing kernel Hilbert space (RKHS). The extension to signals with d spatial dimensions is straightforward. Since computing the corresponding Gram matrix as in classical kernel machines is computationally impractical, CKNs provide an approximation scheme consisting of learning ﬁnite-dimensional subspaces of each RKHS s layer, where the data is projected, see [15]. The resulting architecture of CKNs resembles traditional CNNs with a subspace learning interpretation and different unsupervised learning principles.

Another major source of inspiration is the study of group-invariance and stability to the action of diffeomorphisms of scattering networks [17], which introduced the main formalism and several proof techniques from harmonic analysis that were keys to our results. Our main effort was to extend them to more general CNN architectures and to the kernel framework. Invariance to groups of transformations was also studied for more classical convolutional neural networks from methodological and empirical points of view [6, 9], and for shallow learned representations [1] or kernel methods [13, 19, 22].

Note also that other techniques combining deep neural networks and kernels have been introduced. Early multilayer kernel machines appear for instance in [7, 26]. Shallow kernels for images modelling local regions were also proposed in [25], and a multilayer construction was proposed in [4]. More recently, different models based on kernels are introduced in [2, 10, 18] to gain some theoretical insight about classical multilayer neural networks, while kernels are used to deﬁne convex models for two-layer neural networks in [36]. Finally, we note that Lipschitz stability of deep models to additive perturbations was found to be important to get robustness to adversarial examples [8]. Our results show that convolutional kernel networks already enjoy such a property.

Notation and basic mathematical tools. A positive deﬁnite kernel K that operates on a set X implicitly deﬁnes a reproducing kernel Hilbert space H of functions from X to R, along with a mapping ϕ : X H. A predictive model associates to every point z in X a label in R; it consists of a linear function f in H such that f(z) = f, ϕ(z) H, where ϕ(z) is the data representation. Given now two points z, z in X, Cauchy-Schwarz s inequality allows us to control the variation of the predictive model f according to the geometry induced by the Hilbert norm . H:

|f(z) f(z )| f H ϕ(z) ϕ(z ) H. (1)

This property implies that two points z and z that are close to each other according to the RKHS norm should lead to similar predictions, when the model f has reasonably small norm in H.

Then, we consider notation from signal processing similar to [17]. We call a signal x a function in L2(Ω, H), where Ωis a subset of Rd representing spatial coordinates, and H is a Hilbert space, when x 2 L2 := R

Ω x(u) 2 Hdu < , where du is the Lebesgue measure on Rd. Given a linear operator T : L2(Ω, H) L2(Ω, H ), the operator norm is deﬁned as T L2(Ω,H) L2(Ω,H ) := sup x L2(Ω,H) 1 Tx L2(Ω,H ). For the sake of clarity, we drop norm subscripts, from now on, using

the notation for Hilbert space norms, L2 norms, and L2 L2 operator norms, while | | denotes the Euclidean norm on Rd. Some useful mathematical tools are also presented in Appendix A.

2 Construction of the Multilayer Convolutional Kernel

We now present the multilayer convolutional kernel, which operates on signals with d spatial dimensions. The construction follows closely that of convolutional kernel networks [15] but generalizes it to input signals deﬁned on the continuous domain Ω= Rd (which does not prevent signals to have compact support), as done by Mallat [17] for analyzing the properties of the scattering transform; the issue of discretization where Ωis a discrete grid is addressed in Section 2.1.

xk 1 : Ω Hk 1 xk 1(u) Hk 1

Pkxk 1(v) Pk (patch extraction)

kernel mapping

Mk Pkxk 1(v) = ϕk(Pkxk 1(v)) Hk Mk Pkxk 1 : Ω Hk

xk := Ak Mk Pkxk 1 : Ω Hk linear pooling xk(w) = Ak Mk Pkxk 1(w) Hk

Figure 1: Construction of the k-th signal representation from the k 1-th one. Note that while Ω is depicted as a box in R2 here, our construction is supported on Ω= Rd. Similarly, a patch is represented as a squared box for simplicity, but it may potentially have any shape.

In what follows, an input signal is denoted by x0 and lives in L2(Ω, H0), where H0 is typically Rp0 (e.g., with p0 = 3, x0(u) may represent the RGB pixel value at location u). Then, we build a sequence of RKHSs H1, H2, . . ., and transform x0 into a sequence of feature maps supported on Ω, respectively denoted by x1 in L2(Ω, H1), x2 in L2(Ω, H2), .... As depicted in Figure 1, a new map xk is built from the previous one xk 1 by applying successively three operators that perform patch extraction (Pk), kernel mapping (Mk) in a new RKHS Hk, and linear pooling (Ak), respectively. When going up in the hierarchy, the points xk(u) carry information from larger signal neighborhoods centered at u in Ωwith more invariance, as we will formally show.

Patch extraction operator. Given the layer xk 1, we consider a patch shape Sk, deﬁned as a compact centered subset of Rd, e.g., a box [ 1, 1] [ 1, 1] for images, and we deﬁne the Hilbert space Pk := L2(Sk, Hk 1) equipped with the norm z 2 = R

Sk z(u) 2dνk(u), where dνk is the normalized uniform measure on Sk for every z in Pk. More precisely, we now deﬁne the linear patch extraction operator Pk : L2(Ω, Hk 1) L2(Ω, Pk) such that for all u in Ω,

Pkxk 1(u) = (v 7 xk 1(u + v))v Sk Pk.

Note that by equipping Pk with a normalized measure, the operator Pk preserves the norm. By Fubini s theorem, we have indeed Pkxk 1 = xk 1 and hence Pkxk 1 is in L2(Ω, Pk).

Kernel mapping operator. In a second stage, we map each patch of xk 1 to a RKHS Hk with a kernel mapping ϕk : Pk Hk associated to a positive deﬁnite kernel Kk. It is then possible to deﬁne the non-linear pointwise operator Mk such that

Mk Pkxk 1(u) := ϕk(Pkxk 1(u)) Hk.

As in [15], we use homogeneous dot-product kernels of the form

Kk(z, z ) = z z κk

with κk(1) = 1, (2)

which ensures that Mk Pkxk 1(u) = Pkxk 1(u) and that Mk Pkxk 1 is in L2(Ω, Hk). Concrete examples of kernels satisfying (2) with some other properties are presented in Appendix B.

Pooling operator. The last step to build the layer xk is to pool neighboring values to achieve some local shift-invariance. As in [15], we apply a linear convolution operator Ak with a Gaussian kernel at scale σk, hσk(u) := σ d k h(u/σk), where h(u) = (2π) d/2 exp( |u|2/2). Then,

xk(u) = Ak Mk Pkxk 1(u) = Z

Rd hσk(u v)Mk Pkxk 1(v)dv Hk.

Applying Schur s test to the integral operator Ak (see Appendix A), we obtain that Ak 1. Thus, xk Mk Pkxk 1 and xk L2(Ω, Hk). Note that a similar pooling operator is used in the scattering representation [5, 17], though in a different way which does not affect subsequent layers.

Multilayer construction. Finally, we obtain a multilayer representation by composing multiple times the previous operators. In order to increase invariance with each layer, the size of the patch Sk and pooling scale σk typically grow exponentially with k, with σk and supc Sk |c| of the same order. With n layers, the ﬁnal representation is given by the feature map

Φn(x0) := xn = An Mn Pn An 1Mn 1Pn 1 A1M1P1x0 L2(Ω, Hn). (3)

Then, we can deﬁne a kernel Kn on two signals x0 and x 0 by Kn(x0, x 0) := Φn(x0), Φn(x 0) , whose RKHS HKn contains all functions of the form f(x0) = w, Φn(x0) with w L2(Ω, Hn).

The following lemma shows that this representation preserves all information about the signal at each layer, and each feature map xk can be sampled on a discrete set with no loss of information. This suggests a natural approach for discretization which we discuss next. For space limitation reasons, all proofs in this paper are relegated to Appendix C.

Lemma 1 (Signal preservation). Assume that Hk contains linear functions w, with w in Pk (this is true for all kernels Kk described in Appendix B), then the signal xk 1 can be recovered from a sampling of xk = Ak Mk Pkxk 1 at discrete locations as soon as the union of patches centered at these points covers all of Ω. It follows that xk can be reconstructed from such a sampling.

2.1 From Theory to Practice: Discretization and Signal Preservation

The previous construction deﬁnes a kernel representation for general signals in L2(Ω, H0), which is an abstract object deﬁned for theoretical purposes, as often done in signal processing [17]. In practice, signals are discrete, and it is thus important to discuss the problem of discretization, as done in [15]. For clarity, we limit the presentation to 1-dimensional signals (Ω= Rd with d = 1), but the arguments can easily be extended to higher dimensions d when using box-shaped patches. Notation from the previous section is preserved, but we add a bar on top of all discrete analogues of their discrete counterparts, e.g., xk is a discrete feature map in ℓ2(Z, Hk) for some RKHS Hk.

Input signals x0 and x0. Discrete signals acquired by a physical device are often seen as local integrators of signals deﬁned on a continuous domain (e.g., sensors from digital cameras integrate the pointwise distribution of photons that hit a sensor in a spatial window). Let us then consider a signal x0 in L2(Ω, H0) and s0 a sampling interval. By deﬁning x0 in ℓ2(Z, H0) such that x0[n] = x0(ns0) for all n in Z, it is thus natural to assume that x0 =A0x, where A0 is a pooling operator (local integrator) applied to an original signal x. The role of A0 is to prevent aliasing and reduce high frequencies; typically, the scale σ0 of A0 should be of the same magnitude as s0, which we choose to be s0 = 1 in the following, without loss of generality. This natural assumption will be kept later in the analysis.

Multilayer construction. We now want to build discrete feature maps xk in ℓ2(Z, Hk) at each layer k involving subsampling with a factor sk w.r.t. xk 1. We now deﬁne the discrete analogues of the operators Pk (patch extraction), Mk (kernel mapping), and Ak (pooling) as follows: for n Z,

Pk xk 1[n] := e 1/2 k ( xk 1[n], xk 1[n + 1], . . . , xk 1[n + ek 1]) Pk := Hek k 1 Mk Pk xk 1[n] := ϕk( Pk xk 1[n]) Hk

xk[n]= Ak Mk Pk xk 1[n] := s1/2 k X

m Z hk[nsk m] Mk Pk xk 1[m]=( hk Mk Pk xk 1)[nsk] Hk,

where (i) Pk extracts a patch of size ek starting at position n in xk 1[n] (deﬁning a patch centered at n is also possible), which lives in the Hilbert space Pk deﬁned as the direct sum of ek times Hk 1; (ii) Mk is a kernel mapping identical to the continuous case, which preserves the norm, like Mk; (iii) Ak performs a convolution with a Gaussian ﬁlter and a subsampling operation with factor sk. The next lemma shows that under mild assumptions, this construction preserves signal information.

Lemma 2 (Signal recovery with subsampling). Assume that Hk contains the linear functions w, for all w Pk and that ek sk. Then, xk 1 can be recovered from xk.

We note that this result relies on recovery by deconvolution of a pooling convolution with ﬁlter hk, which is stable when its scale parameter, typically of order sk to prevent anti-aliasing, is small enough. This suggests using small values for ek, sk, as in typical recent convolutional architectures [30].

Links between the parameters of the discrete and continuous models. Due to subsampling, the patch size in the continuous and discrete models are related by a multiplicative factor. Speciﬁcally, a patch of size ek with discretization corresponds to a patch Sk of diameter eksk 1sk 2 . . . s1 in the continuous case. The same holds true for the scale parameter of the Gaussian pooling.

2.2 From Theory to Practice: Kernel Approximation and Convolutional Kernel Networks

Besides discretization, two modiﬁcations are required to use the image representation we have described in practice. The ﬁrst one consists of using feature maps with ﬁnite spatial support, which introduces border effects that we did not study, but which are negligible when dealing with large realistic images. The second one requires ﬁnite-dimensional approximation of the kernel maps, leading to the convolutional kernel network model of [15]. Typically, each RKHS s mapping is approximated by performing a projection onto a subspace of ﬁnite dimension, a classical approach to make kernel methods work at large scale [12, 31, 34]. One advantage is its compatibility with the RKHSs (meaning that the approximations live in the respective RKHSs), and the stability results we will present next are preserved thanks to the non-expansiveness of the projection.

It is then be possible to derive theoretical results for the CKN model, which appears as a natural implementation of the kernel constructed previously; yet, we will also show in Section 5 that the results apply more broadly to CNNs that are contained in the functional space associated to the kernel.

3 Stability to Deformations and Translation Invariance

In this section, we study the translation-invariance and the stability of the kernel representation described in Section 2 for continuous signals under the action of diffeomorphisms. We use a similar characterization of stability to the one introduced by Mallat [17]: for a C1-diffeomorphism τ : Ω Ω, let Lτ denote the linear operator deﬁned by Lτx(u) = x(u τ(u)), the representation Φ( ) is stable under the action of diffeomorphisms if there exist two constants C1 and C2 such that

Φ(Lτx) Φ(x) (C1 τ + C2 τ ) x , (4)

where τ is the Jacobian of τ, τ := supu Ω τ(u) , and τ := supu Ω|τ(u)|. As in [17], our results will assume the regularity condition τ < 1/2. In order to have a translationinvariant representation, we want C2 to be small (a translation is a diffeomorphism with τ = 0), and indeed we will show that C2 is proportional to 1/σn, where σn is the scale of the last pooling layer, which typically increases exponentially with the number of layers n.

Note that unlike the scattering transform [17], we do not have a representation that preserves the norm, i.e., such that Φ(x) = x . While the patch extraction Pk and kernel mapping Mk operators do preserve the norm, the pooling operators Ak may remove (or signiﬁcantly reduce) frequencies from the signal that are larger than 1/σk. Yet, natural signals such as natural images often have high energy in the low-frequency domain (the power spectra of natural images is often considered to have a polynomial decay in 1/f 2, where f is the signal frequency [33]). For such classes of signals, a large fraction of the signal energy will be preserved by the pooling operator. In particular, with some additional assumptions on the kernels Kk, it is possible to show [3]:

Φ(x) An A0x .

Additionally, when using a Gaussian kernel mapping ϕn+1 on top of the last feature map as a prediction layer instead of a linear layer, the ﬁnal representation Φf(x) := ϕn+1(Φn(A0x)) preserves stability and always has unit norm (see the extended version of the paper [3] for details). This suggests that norm preservation may be a less relevant concern in our kernel setting.

3.1 Stability Results

In order to study the stability of the representation (3), we assume that the input signal x0 may be written as x0 = A0x, where A0 is an initial pooling operator at scale σ0, which allows us to control the high frequencies of the signal in the ﬁrst layer. As discussed previously in Section 2.1, this assumption is natural and compatible with any physical acquisition device. Note that σ0 can be taken arbitrarily small, making the operator A0 arbitrarily close to the identity, so that this assumption does not limit the generality of our results. Moreover, we make the following assumptions for each layer k:

(A1) Norm preservation: ϕk(x) = x for all x in Pk; (A2) Non-expansiveness: ϕk(x) ϕk(x ) x x for all x, x in Pk; (A3) Patch sizes: there exists κ > 0 such that at any layer k we have

sup c Sk |c| κσk 1.

Note that assumptions (A1-2) imply that the operators Mk preserve the norm and are non-expansive. Appendix B exposes a large class of homogeneous kernels that satisfy assumptions (A1-2).

General bound for stability. The following result gives an upper bound on the quantity of interest, Φ(Lτx) Φ(x) , in terms of the norm of various linear operators which control how τ affects each layer. The commutator of linear operators A and B is denoted [A, B] = AB BA. Proposition 3. Let Φ(x)=Φn(A0x) where Φn is deﬁned in (3) for x in L2(Ω, H0). Then,

Φ(Lτx) Φ(x)

k=1 [Pk Ak 1, Lτ] + [An, Lτ] + LτAn An

In the case of a translation Lτx(u) = Lcx(u) = x(u c), it is easy to see that pooling and patch extraction operators commute with Lc (this is also known as covariance or equivariance to translations), so that we are left with the term Lc An An , which should control translation invariance. For general diffeomorphisms τ, we no longer have exact covariance, but we show below that commutators are stable to τ, in the sense that [Pk Ak 1, Lτ] is controlled by τ , while LτAn An is controlled by τ and decays with the pooling size σn.

Bound on [Pk Ak 1, Lτ] . We begin by noting that Pkz can be identiﬁed with (Lcz)c Sk isometrically for all z in L2(Ω, Hk 1), since Pkz 2 = R

Sk Lcz 2dνk(c) by Fubini s theorem. Then,

Pk Ak 1Lτz LτPk Ak 1z 2 = Z

Sk Lc Ak 1Lτz LτLc Ak 1z 2dνk(c)

sup c Sk Lc Ak 1Lτx LτLc Ak 1z 2,

so that [Pk Ak 1, Lτ] supc Sk [Lc Ak 1, Lτ] . The following result lets us bound [Lc Ak 1, Lτ] when |c| κσk 1, which is satisﬁed under assumption (A3). Lemma 4. Let Aσ be the pooling operator with kernel hσ(u) = σ dh(u/σ). If τ 1/2, there exists a constant C1 such that for any σ and |c| κσ, we have

[Lc Aσ, Lτ] C1 τ ,

where C1 depends only on h and κ.

A similar result is obtained in Mallat [17, Lemma E.1] for commutators of the form [Aσ, Lτ], but we extend it to handle integral operators Lc Aσ with a shifted kernel. The proof (given in Appendix C.4) relies on the fact that [Lc Aσ, Lτ] is an integral operator in order to bound its norm via Schur s test. Note that κ can be made larger, at the cost of an increase of the constant C1 of the order κd+1.

Bound on LτAn An . We bound the operator norm LτAn An in terms of τ using the following result due to Mallat [17, Lemma 2.11], with σ = σn: Lemma 5. If τ 1/2, we have

with C2 = 2d h 1.

Combining Proposition 3 with Lemmas 4 and 5, we immediately obtain the following result. Theorem 6. Let Φ(x) be a representation given by Φ(x) = Φn(A0x) and assume (A1-3). If τ 1/2, we have

Φ(Lτx) Φ(x) C1 (1 + n) τ + C2

This result matches the desired notion of stability in Eq. (4), with a translation-invariance factor that decays with σn. The dependence on a notion of depth (the number of layers n here) also appears in [17], with a factor equal to the maximal length of scattering paths, and with the same condition τ 1/2. However, while the norm of the scattering representation is preserved as the length of these paths goes to inﬁnity, the norm of Φ(x) can decrease with depth due to pooling layers, though this concern may be alleviated by using an additional non-linear prediction layer, as discussed previously (see also [3]).

3.2 Stability with Kernel Approximations

As in the analysis of the scattering transform of [17], we have characterized the stability and shiftinvariance of the data representation for continuous signals, in order to give some intuition about the properties of the corresponding discrete representation, which we have described in Section 2.1.

Another approximation performed in the CKN model of [15] consists of adding projection steps on ﬁnite-dimensional subspaces of the RKHS s layers, as discusssed in Section 2.2. Interestingly, the stability properties we have obtained previously are compatible with these steps. We may indeed redeﬁne the operator Mk as the pointwise operation such that Mkz(u) = Πkϕk(z(u)) for any map z in L2(Ω, Pk), instead of Mkz(u) = ϕk(z(u)); Πk : Hk Fk is here a projection operator onto a linear subspace. Then, Mk does not necessarily preserve the norm anymore, but Mkz z , with a loss of information corresponding to the quality of approximation of the kernel Kk on the points z(u). On the other hand, the non-expansiveness of Mk is satisﬁed thanks to the non-expansiveness of the projection. Additionally, the CKN construction provides a ﬁnite-dimensional representation at each layer, which preserves the norm structure of the original Hilbert spaces isometrically. In summary, it is possible to show that the conclusions of Theorem 6 remain valid for this tractable CKN representation, but we lose signal information in the process. The stability of the predictions can then be controlled through the norm of the last (linear) layer, which is typically used as a regularizer [15].

4 Global Invariance to Group Actions

In Section 3, we have seen how the kernel representation of Section 2 creates invariance to translations by commuting with the action of translations at intermediate layers, and how the last pooling layer on the translation group governs the ﬁnal level of invariance. It is often useful to encode invariances to different groups of transformations, such as rotations or reﬂections (see, e.g., [9, 17, 22, 29]). Here, we show how this can be achieved by deﬁning adapted patch extraction and pooling operators that commute with the action of a transformation group G (this is known as group covariance or equivariance). We assume that G is locally compact, so that we can deﬁne a left-invariant Haar measure µ that is, a measure on G that satisﬁes µ(g S) = µ(S) for any Borel set S G and g in G. We assume the initial signal x(u) is deﬁned on G, and we deﬁne subsequent feature maps on the same domain. The action of an element g G is denoted by Lg, where Lgx(u) = x(g 1u). Then, we are interested in deﬁning a layer that is, a succession of patch extraction, kernel mapping, and pooling operators that commutes with Lg, in order to achieve equivariance to the group G.

Patch extraction. We deﬁne patch extraction as follows

Px(u) = (x(uv))v S for all u G,

where S G is a patch centered at the identity. P commutes with Lg since

PLgx(u) = (Lgx(uv))v S = (x(g 1uv))v S = Px(g 1u) = Lg Px(u).

Kernel mapping. The pointwise operator M is deﬁned as in Section 2, and thus commutes with Lg.

Pooling. The pooling operator on the group G is deﬁned in a similar fashion as [22] by

G x(uv)h(v)dµ(v) = Z

G x(v)h(u 1v)dµ(v),

where h is a pooling ﬁlter typically localized around the identity element. It is easy to see from the ﬁrst expression of Ax(u) that ALgx(u) = Lg Ax(u), making the pooling operator G-equivariant.

In our analysis of stability in Section 3, we saw that inner pooling layers are useful to guarantee stability to local deformations, while global invariance is achieved mainly through the last pooling layer. In some cases, one only needs stability to a subgroup of G, while achieving global invariance to the whole group, e.g., in the roto-translation group [21], one might want invariance to a global rotation but stability to local translations. Then, one can perform pooling just on the subgroup to stabilize (e.g., translations) in intermediate layers, while pooling on the entire group at the last layer to achieve the global group invariance.

5 Link with Convolutional Neural Networks

In this section, we study the connection between the kernel representation deﬁned in Section 2 and CNNs. Speciﬁcally, we show that the RKHS HKn obtained from our kernel construction contains a set of CNNs on continuous domains with certain types of smooth homogeneous activations. An important consequence is that the stability results of previous sections apply to this class of CNNs.

CNN maps construction. We now deﬁne a CNN function fσ that takes as input an image x0 in L2(Ω, Rp0) with p0 channels, and builds a sequence of feature maps, represented at layer k as a function zk in L2(Ω, Rpk) with pk channels; it performs linear convolutions with a set of ﬁlters (wi k)i=1,...,pk, followed by a pointwise activation function σ to obtain intermediate feature maps zk, then applies a linear pooling ﬁlter and repeats the same operations at each layer. Note that here, each wi k is in L2(Sk, Rpk 1), with channels denoted by wij k L2(Sk, R). Formally, the intermediate map zk in L2(Ω, Rpk) is obtained for k 1 by

zi k(u) = nk(u)σ wi k, Pkzk 1(u) /nk(u) , (8)

where zk(u) = ( z1 k(u), . . . , zpk k (u)) in Rpk, and Pk is the patch extraction operator, which operates here on ﬁnite-dimensional maps. The activation involves a pointwise non-linearity σ along with a quantity nk(u) that is independent of the ﬁlters and that will be made explicit in the sequel. Finally, the map zk is obtained by using a pooling operator as in Section 2, with zk = Ak zk, and z0 = x0.

Homogeneous activations. The choice of non-linearity σ relies on Lemma B.2 of the appendix, which shows that for many choices of smooth functions σ, the RKHSs Hk deﬁned in Section 2 contains the linear functions z 7 z σ( g, z / z ) for all g in Pk. While this homogenization involving the quantities z is not standard in classical CNNs, we note that (i) the most successful activation function, namely rectiﬁed linear units, is homogeneous that is, relu( g, z ) = z relu( g, z / z ); (ii) while relu is nonsmooth and thus not in our RKHSs, there exists a smoothed variant that satisﬁes the conditions of Lemma B.2 for useful kernels. As noticed in [35, 36], this is for instance the case for the inverse polynomial kernel described in Appendix B, In Figure 2, we plot and compare these different variants of relu. Then, we may now deﬁne the quantities nk(u) := Pkxk 1(u) in (8), which are due to the homogenization, and which are independent of the ﬁlters wi k.

Classiﬁcation layer. The ﬁnal CNN prediction function fσ is given by inner products with the feature maps of the last layer: fσ(x0) = wn+1, zn , with parameters wn+1 in L2(Ω, Rpn). The next result shows that for appropriate σ, the function fσ is in HKn. The construction of this function in the RKHS and the proof are given in Appendix D. We note that a similar construction for fully connected networks with constraints on weights and inputs was given in [35]. Proposition 7 (CNNs and RKHSs). Assume the activation σ satisﬁes Cσ(a) < for all a 0, where Cσ is deﬁned for a given kernel in Lemma B.2. Then the CNN function fσ deﬁned above is in the RKHS HKn, with norm

i=1 wi n+1 2 2Bn,i,

where Bn,i is deﬁned recursively by B1,i = C2 σ( wi 1 2 2) and Bk,i = C2 σ pk 1 Ppk 1 j=1 wij k 2 2Bk 1,j .

The results of this section imply that our study of the geometry of the kernel representations, and in particular the stability and invariance properties of Section 3, apply to the generic CNNs deﬁned

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x

Re LU s Re LU

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x

f : x |x| (wx/|x|)

Re LU, w=1 s Re LU, w = 0 s Re LU, w = 0.5 s Re LU, w = 1 s Re LU, w = 2

Figure 2: Comparison of one-dimensional functions obtained with relu and smoothed relu (s Re LU) activations. (Left) non-homogeneous setting of [35, 36]. (Right) our homogeneous setting, for different values of the parameter w. Note that for w 0.5, s Re LU and Re LU are indistinguishable.

above, thanks to the Lipschitz smoothness relation (1). The smoothness is then controlled by the RKHS norm of these functions, which sheds light on the links between generalization and stability. In particular, functions with low RKHS norm (a.k.a. large margin ) are known to generalize better to unseen data (see, e.g., the notion of margin bounds for SVMs [27, 28]). This implies, for instance, that generalization is harder if the task requires classifying two slightly deformed images with different labels, since this requires a function with large RKHS norm according to our stability analysis. In contrast, if a stable function (i.e., with small RKHS norm) is sufﬁcient to do well on a training set, learning becomes easier and few samples may be enough for good generalization.

Acknowledgements

This work was supported by a grant from ANR (MACARON project under grant number ANR14-CE23-0003-01), by the ERC grant number 714381 (SOLARIS project), and by the MSR-Inria joint center.

[1] F. Anselmi, L. Rosasco, and T. Poggio. On invariance and selectivity in representation learning. Information and Inference, 5(2):134 158, 2016.

[2] F. Anselmi, L. Rosasco, C. Tan, and T. Poggio. Deep convolutional networks are hierarchical kernel machines. preprint ar Xiv:1508.01084, 2015.

[3] A. Bietti and J. Mairal. Group invariance and stability to deformations of deep convolutional representations. preprint ar Xiv:1706.03078, 2017.

[4] L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchical kernel descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.

[5] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on pattern analysis and machine intelligence (PAMI), 35(8):1872 1886, 2013.

[6] J. Bruna, A. Szlam, and Y. Le Cun. Learning stable group invariant representations with convolutional networks. preprint ar Xiv:1301.3537, 2013.

[7] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems (NIPS), 2009.

[8] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning (ICML), 2017.

[9] T. Cohen and M. Welling. Group equivariant convolutional networks. In International Conference on Machine Learning (ICML), 2016.

[10] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems (NIPS), 2016.

[11] J. Diestel and J. J. Uhl. Vector Measures. American Mathematical Society, 1977.

[12] S. Fine and K. Scheinberg. Efﬁcient SVM training using low-rank kernel representations. Journal of Machine Learning Research (JMLR), 2:243 264, 2001.

[13] B. Haasdonk and H. Burkhardt. Invariant kernel functions for pattern analysis and machine learning. Machine learning, 68(1):35 61, 2007.

[14] Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541 551, 1989.

[15] J. Mairal. End-to-End Kernel Learning with Supervised Convolutional Kernel Networks. In Advances in Neural Information Processing Systems (NIPS), 2016.

[16] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In Advances in Neural Information Processing Systems (NIPS), 2014.

[17] S. Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331 1398, 2012.

[18] G. Montavon, M. L. Braun, and K.-R. Müller. Kernel analysis of deep networks. Journal of Machine Learning Research (JMLR), 12:2563 2581, 2011.

[19] Y. Mroueh, S. Voinea, and T. A. Poggio. Learning with group invariant features: A kernel perspective. In Advances in Neural Information Processing Systems (NIPS), 2015.

[20] K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends R in Machine Learning, 10(1-2):1 141, 2017.

[21] E. Oyallon and S. Mallat. Deep roto-translation scattering for object classiﬁcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[22] A. Raj, A. Kumar, Y. Mroueh, T. Fletcher, and B. Schoelkopf. Local group invariant representations via orbit embeddings. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2017.

[23] S. Saitoh. Integral transforms, reproducing kernels and their applications, volume 369. CRC Press, 1997.

[24] I. J. Schoenberg. Positive deﬁnite functions on spheres. Duke Mathematical Journal, 9(1):96 108, 1942.

[25] B. Schölkopf. Support Vector Learning. Ph D thesis, Technischen Universität Berlin, 1997.

[26] B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299 1319, 1998.

[27] B. Schölkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. 2001.

[28] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[29] L. Sifre and S. Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2013.

[30] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2014.

[31] A. J. Smola and B. Schölkopf. Sparse greedy matrix approximation for machine learning. In Proceedings of the International Conference on Machine Learning (ICML), 2000.

[32] E. M. Stein. Harmonic Analysis: Real-variable Methods, Orthogonality, and Oscillatory Integrals. Princeton University Press, 1993.

[33] A. Torralba and A. Oliva. Statistics of natural image categories. Network: computation in neural systems, 14(3):391 412, 2003.

[34] C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems (NIPS), 2001.

[35] Y. Zhang, J. D. Lee, and M. I. Jordan. ℓ1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning (ICML), 2016.

[36] Y. Zhang, P. Liang, and M. J. Wainwright. Convexiﬁed convolutional neural networks. In International Conference on Machine Learning (ICML), 2017.