# measuring_dissimilarity_with_diffeomorphism_invariance__f50274ec.pdf

Measuring dissimilarity with diffeomorphism invariance

Th eophile Cantelobre 1 2 Carlo Ciliberto 3 Benjamin Guedj 2 3 4 Alessandro Rudi 1

Measures of similarity (or dissimilarity) are a key ingredient to many machine learning algorithms. We introduce DID, a pairwise dissimilarity measure applicable to a wide range of data spaces, which leverages the data s internal structure to be invariant to diffeomorphisms. We prove that DID enjoys properties which make it relevant for theoretical study and practical use. By representing each datum as a function, DID is defined as the solution to an optimization problem in a Reproducing Kernel Hilbert Space and can be expressed in closed-form. In practice, it can be efficiently approximated via Nystr om sampling. Empirical experiments support the merits of DID.

1. Introduction

One of the overarching goals of most machine learning algorithms is to generalize to unseen data. Ensuring and quantifying generalization is of course challenging, especially in the high-dimensional setting. One way of reducing the hardness of a learning problem is to study the invariances that may exist with respect to the distribution of data, effectively reducing its dimension. Handling invariances in data has attracted considerable attention over time in machine learning and applied mathematics more broadly. Two notable examples are image registration (De Castro and Morandi, 1987; Reddy and Chatterji, 1996) and time series alignment (Sakoe and Chiba, 1978; Cuturi and Blondel, 2017; Vayer et al., 2020; Blondel et al., 2021; Cohen et al., 2021).

In practice, data augmentation is a central tool in the ma-

1DI ENS, Ecole normale sup erieure, Universit e PSL, CNRS, Inria, Paris, France 2Inria London, UK 3Centre for AI, Department of Computer Science, University College London, UK 4Inria, Lille - Nord Europe Research Centre, Lille, France. Correspondence to: Th eophile Cantelobre <theophile.cantelobre@inria.fr>, Carlo Ciliberto <c.ciliberto@ucl.ac.uk>, Benjamin Guedj <benjamin.guedj@inria.fr>, Alessandro Rudi <alessandro.rudi@inria.fr>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

chine learning practitioner s toolbox. In computer vision for instance, images are randomly cropped, color spaces are changed, and artifacts are added. Such heuristics enforce some form of invariance to transformations that are chosen by hand, often parametrically, and comes at the cost of a more demanding learning step (e.g., more data to store and process, more epochs, to name but a few).

Prior work. Learning under invariances has also spurred significant theoretical interest. De Coste and Sch olkopf (2002), Haasdonk and Burkhardt (2007), Kondor (2008) and Mroueh et al. (2015) algebraically studied how kernels invariant to group actions behave for learning. Bruna and Mallat (2013) takes inspiration in signal processing (with wavelet spaces) to build scattering networks, which can present good properties with respect to invariances. Focusing on neural networks, Mairal et al. (2014) and Bietti and Mairal (2019) introduced and analyzed a model aiming to mimic Convolutional Neural Networks (CNNs). More recently, Bietti et al. (2021) studied the sample complexity of learning in the presence of invariances, with invariant kernels. Conversely, the hypothesis that invariance can be a proxy for neural network performance has been put to test empirically by Petrini et al. (2021).

Contributions. We introduce a dissimilarity called DID (standing for Diffeomorphism Invariant Dissimilarity) which is invariant to smooth diffeomorphisms present between data points. Although DID is somewhat less sophisticated than most of the models presented above, it is considerably more generic and can be seen as a building block for devising practical machine learning algorithms in the presence of invariances.

In order to exploit the internal structure of a data point (e.g. of an image), we cast them as functions between two spaces (e.g. coordinate and color spaces). This unlocks the potential of using the change of variable formula to eliminate diffeomorphisms between functions (i.e. transformations between data points) when they exist. DID is based on a generic method for identifying if a function g is of the form f Q, without any parametric model over Q, founded on the change of variable formula. This makes DID a promising and flexible tool for image processing (e.g., image registration), time-series analysis (e.g., dynamic time warping) and

Measuring dissimilarity with diffeomorphism invariance

machine learning (e.g., nearest neighbors).

DID is defined as an optimisation problem in a Reproducing Kernel Hilbert Space (RKHS), of which we present a closed form solution (Thm. 4.2). We then show how it can be approximated in practice, using Nystr om sampling techniques (Lemma 4.3, Thm. 4.4). By relying on standard matrix linear algebra, this approximation can be efficiently implemented with batch techniques, and accelerated hardware.

A key aspect is that DID has very few hyper-parameters which can easily be chosen by a domain expert: the kernels on the input and output space and a regularization parameter. We provide guidance on choosing the regularization parameter in Sec. 4.2.

Using tools from functional analysis, we prove that DID behaves as expected when comparing f and f Q (i.e., that it considers these two functions to be close) in the limit of vanishing regularization (Thm. 3.2). We support our theoretical claims with numerical experiments on images.

Outline. We introduce our new dissimilarity in Sec. 2. In Sec. 3, we prove that the intuition leading to the definition in Sec. 2 is well founded and discuss the theoretical properties of the dissimilarity. We then show how to compute the dissimilarity in Sec. 4: we first show that it has a closed-form expression; we then present and justify an approximation scheme based on Nystr om sampling. We illustrate the behavior of the dissimilarity with experiments on images in Sec. 5.

2. The dissimilarity

2.1. Informal derivation of the dissimilarity

The dissimilarity we describe in this work relies on the internal structure of the objects it compares. An efficient way of encoding this structure is to, whenever possible, view objects as maps between an input space and an output space. In this way, we can consider both the values taken by the function and the locations at which these values were taken.

Consider f and g two maps between Rd and Rp. Our goal is to determine whether there exists a diffeomorphism Q : Rd Rd such that g = f Q. In practice, such transformations could be rigid-body transformations, a nonsingular projective transformation or more generally a mild distorsion of the space (such as warping). The goal is thus to find a measure of dissimilarity between f and g that is robust to such diffeomorphisms. We derive such a measure informally in this section around three key ideas.

Change of variable formula. Integrals offer a natural way to eliminate a diffeomorphism from a function, via

the change of variable formula Z f(Q(x))| Q(x)|dx = Z f(x)dx.

As | Q| (the determinant of the Jacobian of Q) is unknown, we can approximate the above formula with:

Z g(x)q(x)dx Z f(x)dx , (1)

where q lies in a space of functions.

Indeed, when g = f Q for some diffeomorphism Q, choosing q = | Q| minimizes (1). However, this solution is not unique. Indeed there exist trivial solutions as q = f/g, that are irrespective of the existence of a Q such that g = f Q or not.

Range of statistics. One way of reducing the class of solutions to ones that are relevant to our original question is to study not only how well the weighted integral of g can approximate the integral of f, but also require that the same weight approximate a wide class of transformations of f. A natural example with inspiration in probability theory is to be able to approximate all moments of f, i.e., the integrals of the moments v1(f) = f, v2(f) = f 2, . . . , or more general statistics. The function q = f/g may match the integral of v1(g) with that of v1(f), but cannot work for v2. However, if g = f Q, q = | Q(x)| (the solution we are seeking), satisfies that for any continuous function v : Rp R, Z v(f(Q(x)))| Q(x)|dx = Z v(f(x))dx.

Problem (1) is thus replaced by the following one, which also has q as a solution when g = f Q:

min q max v V

Z v(g(x))q(x)dx Z v(f(x))dx , (2)

where V is a rich set of statistics, e.g., continuous integrable functions on Rp.

Uniformity over regions. Problem (2) averages the statistics uniformly over the whole space irrespectively of the fact that a witness of g = f Q could live in a lower dimensional region. For instance g and f might be non-zero only on a small region of the space, consequently yielding to a relatively small value for Eq. (2). To enhance such regions, we choose to integrate with respect to a smooth function h, that is chosen adversarially to maximize the dissimilarity between f and g. In other words, we arrive at the following optimization problem:

max h H1 min q max v V

Z v(g(x))q(x)dx Z v(f(x))h(x)dx ,

where H1 is a suitable set of smooth functions. Again, if g = f Q, the above is solved by q = h Q | Q|.

Measuring dissimilarity with diffeomorphism invariance

(a) f1 (close-up)

(b) f2 (raccoon s paw)

(c) f3 (pepper close-up)

(d) g1 (scene with multiple objects)

(e) g2 (raccoon s abdomen, with natural keystone deformation.)

(f) g3 (peppers among vegetables)

Figure 1. Illustration of b Dλ on images. We compute b Dλ(fi, gi) and materialize the optimal hi and qi selected by DID (with thresholding for visualisation purposes), for fi and gi images taken with a smartphone. The mask µ is illustrated by the dashed circle (in light blue), see Sec. 5.1 for details. The images are taken from different views so as to provide different angles and lighting. Figs. 1a and 1d show a scene of objects and a close-up on one of them (bottle opener). Figs. 1b and 1e are taken from images of a raccoon (raccoon). Figs. 1c and 1f are sub-patches from peppers, a scene with vegetables. Notice that qi visually matches the area highlighted by hi, despite the perspective and scale changes. Additional details are gathered in Sec. 5.2.

Note that the smoothness of h and q is crucial. On the one hand, the smoothness of h ensures that the considered regions are of interest with respect to the underlying metric on Rd, i.e. they cannot be too close to diracs on pathological sets. On the other, the smoothness of q ensures that the transformations are not matched by cherry-picking dispersed points on the domain such that the integrals match.

2.2. Definition of the dissimilarity

Now that we have given the motivation as well as the intuition behind our method, we can formally introduce it.

Let X Rd and Y Rp where d, p 1. In this paper, the objects of interest are maps from X to Y . Note that this is quite flexible and can reflect many of the rich types of data considered in image processing, time-series modelling and machine learning. Indeed, an image can be seen as a map from R2 (the coordinates of the pixels) to R3 (the color space, RGB for instance). A time-series can be seen as a map from R1 (time) to Rn (the space of the sample values).

Let k X be a reproducing kernel on X with Reproducing

Kernel Hilbert Space (RKHS) H (Aronszajn, 1950) and k Y be a reproducing kernel on Y with RKHS F. In particular, we assume that they are bounded.

To conclude, the formalization below allows naturally for the presence of a mask function µ, with the role of focusing the matching process on a subregion of interest of f. The role of the mask will be discussed after the definition. Definition 2.1 (Dissimilarity Dλ(f, g)). Let λ 0 and bounded integrable µ : X R. For any f, g : X Y , we define the dissimilarity

Dλ(f, g) := max h H 1 min q H f,g(h, q) + λ q 2 H (3)

where f,g(h, q) is defined as follows,

f,g(h, q) := max v F 1

X v(g(x))q(x)dx

X v(f(x))µ(x)h(x)dx 2 .

When clear from context, we write D instead of Dλ for the sake of conciseness.

Measuring dissimilarity with diffeomorphism invariance

The reader should easily recognize the construction described in Sec. 2.1. We have made the space in which we search for q and h precise: H, the RHKS of k X. Regularity of h is enforced by searching in the unit ball, while regularity of q is enforced with Tikhonov regularization. This choice allows at the same time to compute the dissimilarity in closed form (see Sec. 4), while not sacrificing its expressivity (see Sec. 4.2). In particular, to allow a very rich set of statistics that is also manageable from a computational viewpoint, we choose V to be the unit ball of F. For example, if Y is a bounded subset of Rp and k Y is chosen as the Laplace kernel k Y (y, y ) = exp( y y ), then a rescaled version of any infinitely smooth function belongs to V (including in particular all polynomials, smooth probabilities, Fourier basis see Appendix A for more details). For the same reason we choose H1 to be the unit ball of H and we choose also q H. Fig. 1 shows a few examples of the role played by the h and q optimizing the problem in Eq. (3).

The role of the mask µ. We introduced a function µ : X R which applies to the term depending on f and h. This function is meant to be a mask which focuses the distance on a certain region of f, discounting other regions. Such presence is useful in practice, since typically the space X is given by the problem. For example, if we want to use the dissimilarity to check if the content of a given image (f) is contained in an image (g), the shape X is typically rectangular, while the region of interest is the interior of the image. In this case the mask is useful to avoid the artifacts introduced by the corners. Notice that this addition further breaks the symmetry between f and g: f becomes a reference, and we search in g for matching statistics. In the experiments, we relied for example on a Blackman Window, a classical window function in signal and image processing (see Sec. 5.1) to reduce the impact of the corners.

3. Robustness to diffeomorphisms

The dissimilarity D is designed (consistently with the derivation in Sec. 2.1) to be small when f and g are equal up to a diffeomorphism. The ideal result would be something along the lines of the following informal theorem: Theorem 3.1 (Ideal). For any f : X Y and any Q diffeomorphism over X, D(f, f Q) 0.

This is of course too much to ask. Indeed, the regularization over the choice of q (which in turn controls the regularity of the jacobian of a hypothetical Q) introduces a bias: even if g = f Q, the dissimilarity is not 0. This bias vanishes if and only if q = 0 (by definiteness of the norm).

However, when the RKHS is assumed to be rich enough and Q and µ are regular enough, then we have the following result. Before stating it, we recall that the Laplace kernel

k X(x, x ) = exp( x x ) belongs to the more general family of Sobolev kernels (Wendland, 2004). In particular, it corresponds to the Sobolev kernel of smoothness m, with m = (d + 1)/2, where d is the dimension of X Rd.

Theorem 3.2. Let X Rd be an open bounded set with Lipschitz boundary. Let µ C (Rd) with compact support Ω X. Choose k X to be a Sobolev kernel of smoothness m, with m > d/2. Then for any Cm+1 diffeomorphism Q on Rd satisfying Q 1(Ω) X, we have that

Dλ(f, f Q) λ C2 µC2 Q f measurable,

where Cµ = µ Hm(Rd), and CQ is defined in Eq. (19) in Appendix B.2 and depends only on Ω, X, Q, d, m.

The theorem above is a special case of Thm. B.3, presented in Appendix B.2. There we prove the more general result: Dλ(f, g) λC2 µC2 Q, for all measurable f, g that satisfy g(x) = (f Q)(x), only in the region not canceled by the mask, i.e., x Q 1(Ω).

A first consequence of Thm. 3.2 is that the regularization parameter λ controls the threshold to decide whenever g f Q. In particular, we easily see that Dλ(f, f Q) 0, when λ 0. This result confirms that in the limit where the regularization vanishes, so does the bias and we have the result we would have imagined a la Thm. 3.1. We will see in Sec. 4.2 that λ has also an important role in controlling the approximation error of D(f, g). This shows that λ controls a similar bias-variance trade-off as in classical kernel supervised learning (Shawe-Taylor and Cristianini, 2004).

To show concretely the dependence of CQ with respect to Q, in the following example we show CQ explicitly for an interesting class of diffeomorphisms. Example 3.3 (Magnitude of CQ for rigid transforms). Let X be the unit ball in Rd and let µ be a mask supported on Ω, the ball of radius r < 1. We consider the diffeomorphisms Q(x) = αRx, with R a unitary matrix and r < α < 1/r. We use the Laplace kernel k X(x, x ) = exp( x x ) for k X (analogously for k Y ). Then, we compute explicitly the bound in Eq. (19), since the Laplace kernel corresponds to Sobolev kernel with exponent m = (d + 1)/2, obtaining

α min(α, 1) r

m+d/2 αd(1 + α + αm).

Remark 3.4. By seeing data points as functions, we are able to efficiently reason about them. Hence, since data points are continuous objects, d = dim(X), the dimensions of X, and not the number of pixels, which we denote N below. Indeed, X is low-dimensional by nature : for images, d = 2 since images are 2-d objects, while N could be very large. This way, the dimensional dependence of CQ is very reasonable. In the extreme case where we consider time series of point

Measuring dissimilarity with diffeomorphism invariance

T (warping intensity)

D(f, warp(f, T))

T = 0.0001 T = 0.01 T = 0.001

(a) b Dλ(f, warp(f, T)) as a function of T. We repeat the warp 50 times (+ markers) on the same image and represent average values standard deviation (in grey).

T = 0.0001 T = 0.001 T = 0.01 T = 0.1

T = 10 T = 1

D random regime (90 %)

RMSE random regime (90%)

RMSE(f, warp(f, T))

D(f, warp(f, T))

(b) RMSE(f, warp(f, T)) against b Dλ(f, warp(f, T)) for various values of T and images f. 1000 images are each warped once for each value of T. We represent the random regime for each metric.

Figure 2. Invariance to general diffeomorphisms (warping). Warping is randomly generated, its intensity controlled by a temperature parameter T (higher T produces, on average, warps with higher displacement norm). In (a): b Dλ stays constant (i.e. invariant to the warps) as long as their norm is not too strong (small T), while RMSE increases exponentially. When T becomes large the transformations become intense (indeed they are non-diffeomorphic) and b Dλ grows to reflect this fact. In (b): we see that b Dλ stays invariant to warps as long as T 0.1 (far from the random regime interval), while the Euclidean distance increases exponentially with T, even for small T. See Sec. 5.3 for more details.

clouds (for example, brain scans through time), d = 4 (time + 3 spatial coordinates). If we consider a scale change range of 50%, then CQ is of the order of 1.54 5. We emphasize that CQ is not related to the number of samples (for images, the number of pixels) used to compute DID in practice.

3.1. Discussion on the discriminatory power of Dλ

In Thm. 3.2, we proved that when g = f Q for some diffeomorphism Q, then D(f, g) is small, i.e. that the dissimilarity is essentially invariant to the diffeomorphisms. However, to fully characterize the properties of the proposed dissimilarity it would be interesting to study also its discriminatory power, i.e. the fact that D(f, g) is small only if there exists a diffeomorphism Q such that g = f Q. Fig. 2 investigates this question from the empirical perspective. The details of these experiments are reported in Sec. 5 (and further explored in Appendix G). They show that, DID is very robust to significant transformations f Q of the original signal f. Additionally we observe that Dλ is very discriminative, in contrast to less diffeomorphism invariant metrics such as the euclidean distance, when comparing Dλ(f, f Q) with Dλ(f, g) for a random signal g. We care to point out however, that the theoretical analysis of DID s the discriminative abilities is beyond the scope of this work (whose aim is to introduce the discrepancy and study its invariance properties) and we postpone it to future research.

4. Computing the dissimilarity

Before deriving the closed form solution for Dλ, we need to recall some basic properties of kernels. Reproducing kernels and RKHSs satisfy the so called reproducing property, i.e. There exists a map ψ : X H such that, for any f H and x X, it holds that f(x) = f, ψ(x) H, where , H is the scalar product associated to the RKHS H. Moreover k X(x, x ) = ψ(x), ψ(x ) H, for all x, x X. The same holds for k Y and F. i.e., there exists Φ : Y F such that v(y) = v, Φ(y) F for all v F, y Y and k Y (y, y ) = Φ(y), Φ(y ) for all y, y Y , where , F is the inner product associated to F. In particular, note that, since we assumed that k X, k Y are bounded kernels, then there exist two constants κX, κY such that supx X ψ(x) H κX and analogously supy Y Φ(y) F κY .

4.1. Closed form solution

In Definition 2.1, we define Dλ(f, g) as an optimization problem in H H. In fact, this optimization problem has a closed-form solution, as the solution of an eigenvalue problem of an operator between H and F as derived in Eq. (7). We introduce the relevant objects and prove Thm. 4.2.

Definition 4.1 (Operators Fµ, G). Given f, g : X Y , the feature map Φ : Y F and the mask function µ : X R,

Measuring dissimilarity with diffeomorphism invariance

define the linear operators Fµ, G : H F as follows:

X Φ(f(x)) ψ(x)µ(x)dx, (4)

X Φ(g(x)) ψ(x)dx. (5)

Definition 4.1 is a compact notation for the the integral operators in the RKHS. Noticing that we can rewrite f,g (see below) as a function of Fµ and G is key to deriving the closed-form expression and the finite-dimensional approximation in Eqs. (10) and (11), that can be computed in practice.

When X is a bounded set, the two operators above are trace class and, by the representer property v, Fµh F = R

X v(f(x))h(x)µ(x)dx and also v, Gq F = R

X v(f(x))q(x)dx (see Lemma C.1 in Appendix C for a detailed proof). Using this result and considering the linearity of , F and the variational characterization of the Hilbert norm (i.e. u F = max v F 1 | v, u F |), we have

f,g(h, q) = Fµh Gq 2 F,

for any h, q H. From which we characterize Dλ as

Dλ(f, g) = max h H 1 min q H Fµh Gq 2 F + λ q 2 H . (6)

To conclude, note that the optimization problem in the equation above has a closed-form expression in terms of the operatorial norm of an operator depending on Fµ and G. All the reasoning above is formalized below.

Theorem 4.2 (Closed-form solution). Let X Rd be an open bounded set. Using the notations above, we have:

Dλ(f, g) = λ (GG + λI) 1/2Fµ 2 op. (7)

The proof of Thm. 4.2 comes from identifying the inner optimization problem as a linear regression problem, and the the outer maximization problem as an eigenproblem. The complete proof is presented in Appendix C.1.

4.2. Approximate computation

Although Thm. 4.2 gives a closed-form expression of Dλ(f, g), Fµ and G are defined as integral operators between infinite dimensional Hilbert spaces. In practice, we only have access to a discretization of f and g (e.g. an image is a discretized spatial signal represented by N pixels). A first natural approximation is thus to replace the integral with an empirical counterpart. This estimate is then a sum of rank-one operators between H and F. To reduce the computational cost, while keeping good accuracy, we can then further approximate it using Nystr om methods for kernels. The resulting estimator is b Dλ presented in Eq. (12), its convergence to Dλ is studied in Thm. 4.4.

0 25 50 75 100 125 150 175

D(f, g) is here 90 % of time

D(f, rotate(f))

RMSE(f, g) is here 90 % of time

Rotation angle (degrees)

Figure 3. Invariance to rotation. We consider a patch f (size 100 100) of a larger scene (peppers.jpeg) and compare it to its rotated versions rotate(f, α), where α is an angle. D(f, rotate(f, α)) is represented with + symbols and RMSE(f, rotate(f, α)) with symbols. Random regimes are represented by shaded areas (see text). Both b Dλ and the RMSE seem constant as a function of α (although with α = 0 or α = 180 a smaller value is achieved). However, the RMSE of the rotated patches falls in (or close to) the confidence interval, making them indistinguishable from random patches from the same image. b Dλ takes values that are over 10 smaller for rotated patches that for random patches. Hence, DID is invariant to rotation, whereas RMSE is constant (for α > 0). Here λ = 10 6. See Sec. 5.4 for more details.

Quadrature approximation. We replace Fµ with an estimator Fµ,N and G with an estimator GN:

i=1 Φ(f(xi)) ψ(xi)µ(xi) (8)

i=1 Φ(g(xi)) ψ(xi), (9)

where v X := R

X dx is the volume of the domain X. Note that the set of {x1, . . . , x N} can be chosen at random or arbitrarily to best approximate the integrals. F and G can be approximated using different points. In practice, they are often given as the positions of pixels of images, sample times of a time series.

Nystr om approximation. From the previous section, it is clear that rank (Fµ,N) N and rank (GN) N. This justifies using the low-rank approximations we introduce in this section. It is possible to further reduce the rank of the matrices, while keeping a good accuracy, by using the socalled Nystr om approximation (Williams and Seeger, 2001; Drineas et al., 2005; Rudi et al., 2015). Let MX, MY N and choose the set of points e X = {ex1, . . . , ex MX} X and e Y = {ey1, . . . , ey MY } Y . The Nystr om approximation of

Measuring dissimilarity with diffeomorphism invariance

a vector v H consists in the projected vector P Xv, where P X : H H is the projection operator with range corresponding to span{ψ( x1), . . . , ψ( x MX)}. Note, in particular, that P X has rank MX when k X is universal. P Y : F F on Y in defined analogously.

Combining the two approximations. Assume in this section, that the kernels of choice are universal (as, e.g., the Laplacian or the Gaussian kernel). Let P X : H H be the projection operator associated to the Nystr om points on X and P Y the one associated to the Nystr om point on Y . Combining the two approximations, we define the following estimator for Dλ,

b Dλ(f, g) := λ (P Y GNP XG NP Y + λ) 1

2 P Y Fµ,NP X 2 op.

Note, however, that b Dλ(f, g) has a finite dimensional characterization that we are going to derive now. Let Ke Y f RMY N the matrix defined by (Ke Y f)i,j = k Y (eyi, f(xj)), Ke Y g in an analogous way, and finally, KX e X RN MX

the matrix defined by (KX e X)i,j = KX(xi, exj). Let bµ = [µ(x1), . . . , µ(x MX)] and R e X RMX MX be the upper-triangular Cholesky decomposition of K e X e X defined by (K e X e X)ij = k X(exi, exj). Analogously, define Ke Y e Y and define Re Y RMY MY its Cholesky decomposition. Note that the decomposition exists since the kernel k X is universal and so the kernel matrix K X, X is invertible.

We introduce the following operators b A, b B RMY MX, which are the finite dimensional representations in appropriate spaces RMX, RMY of, respectively, PY Fµ,N PX and PY GN PX:

N R T e Y Ke Y f diag (bµ) KX e XR 1 e X , (10)

N R T e Y Ke Y g KX e XR 1 e X . (11)

In particular, we have the following characterization for b Dλ.

Lemma 4.3. With the notation above,

b Dλ(f, g) = λ b A ( b B b B + λI) 1 b A op. (12)

Note that, in practice, e X and e Y can be chosen either deterministically, e.g. on a grid, or randomly. Now, we provide a bound on the approximation error associated to b Dλ(f, g). We assume that the N points in x1, . . . , x N are sampled independently and uniformly at random in X and, moreover, that the MX points in e X and the MY points in e Y are sampled independently and uniformly at random in, respectively, X, Y (similar result can be derived for a grid).

Theorem 4.4. Let δ (0, 1). Let X Rd, Y Rp be bounded sets and k X, k Y be Sobolev kernels with smoothness, respectively, s + d/2 and z + p/2, for some s, z > 0.

There exists two constants c1, c2 s. t., when MX c1 and MY c2, then the following holds with probability 1 δ,

| b Dλ(f, g) Dλ(f, g)| c log 1

N + (log MX

λM s/d X + (log MY

for any measurable f, g : X Y , where b Dλ(f, g) is defined as in Eq. (12). Here c1, c2, c depend only on X, Y, µ, s, z, d, p, while α = s/d + 1/2, β = z/p + 1/2.

The theorem above shows that the estimation error of b Dλ with respect to Dλ goes to 0 when N, MX, MY . On the contrary, the error diverges in λ. This is in accordance with the fact that the error is of variance type and shows that λ plays the role of a regularization parameter. The bound shows also that, when s d and z p, i.e. when we are choosing very smooth Sobolev kernels, the decay rate of the error in MX and MY is faster. For example, if we choose s = rd, z = rp, for some r > 0, then choosing MX = MY = O(N r/2) leads to the rate

| b Dλ(f, g) Dλ(f, g)| = O 1 λ

On the choice of λ. To conclude, a choice of λ as λ = N 1/4 guarantees a final convergence rate of b Dλ to Dλ in the order of N 1/4 and, together with Thm. 3.2 a level of invariance to diffeomorphism for b Dλ of the order

b Dλ(f, f Q) = O(N 1/4C2 µC2 Q),

which can become a statistically significant threshold to decide if, in practice f g up to diffeomorphism. Clearly the choice of r, s while reducing the number of Nystr om points required in the approximation (with important computational implications that we see below) increases the constant CQ as shown, e.g., in Example 3.3, where m = s + d/2.

Algorithm and computational complexity. The final form of the empirical estimator b Dλ is Eq. (12). An efficient algorithm to compute it consists in (1) first computing the matrices b A and b B, then the inverse b C = ( b B b B + λ) 1

and finally compute the largest eigenvalue of b A b C b A via a power iteration method (Trefethen and Bau III, 1997).

Assuming that (a) the cost of one kernel computation in Rd

is O(d) (as in the case of any translation invariant kernel as the Laplace kernel) (b) MX N and MY N (which is reasonable in light of Thm. 4.4), then the cost of computing b Dλ with the algorithm above is O(d NMX + p NMY + MXM 2 Y + M 3 X + M 3 Y ). Choosing the parameters, as in the discussion after Thm. 4.4, with r = 1, would lead to a total computational cost of O(N 1.5(p + d)).

Computing DID between different pairs of data-points can easily be parallelized as each computation relies on matrixvector products and matrix inversions.

Measuring dissimilarity with diffeomorphism invariance

T (warping intensity)

D(f, warp(f, T))

Increasing diﬀeomorphism norm

Figure 4. Effect of regularization. We consider the same image and setting as in Fig. 2 and vary T and λ (shaded areas are std deviation over 50 warps). Observe: (1) b Dλ increases as a function of T (as the norm of the diffeomorphism increases); (2) b Dλ is proportional to λ. Both of these phenomena are predicted by Thm. 3.2. See Sec. 5.5 for more details.

Choice of kernel hyperparameters. As in any kernelbased method, selecting kernel hyper-parameters is important. The usual considerations apply: we advise crossvalidation should be used to choose hyper-parameters (such as the bandwith). We do not discuss this further for space considerations.

In the theoretical analysis above, the rates in N hold for any parameter of the kernel, but constants depend on the chosen parameters. This is typical of the Sobolev analysis technique, where the important parameter is the kernel regularity m, which corresponds to the differentiability class of the diffeormorphisms one aims to capture.

Finally, the method introduced is empirically quite robust to the choice of parameters.

5. Experiments

This section investigates the empirical performance of DID. We observe that, in line with our results in Thm. 3.2, when g = f Q the resulting Dλ is small, while it is consistently large for signals that are not diffeomorphic versions of f.

5.1. Implementation details

We implemented b Dλ as in Eq. (12) as described in the end of Sec. 4.2 using standard linear algebra routines such as matrix and matrix-vector products, matrix inversions, and eigendecompositions. The Python source code used for the experiments presented here is freely available at https://github.com/theophilec/diffy, depends on Numpy and

Pytorch and supports using GPUs.

Normalization. In practice we normalize b A and b B by their operator norms b A op and b B op. This normalizes Dλ between 0 and 1 and makes interpretation easier.

Choice of mask. We choose µ to be a Blackman window, a standard windowing function in image processing. In 1-D, the Blackman window is defined for any 0 t 1 as: µ(t) = 0.42 0.5 cos(2πt) + 0.08 cos(4πt) (Oppenheim et al., 1999). We generalize it to higher-dimension by considering its tensor-product over dimensions.

Choice of kernel. Because we work with images in the experiments we present, X = R2 (coordinate space) and Y = R3 (color space). We consider the Gaussian kernel defined as k(x, x ) = exp( x x 2/(2σ2)) on X and the Laplace kernel defined as k(y, y ) = exp( a y y ) on Y . For the experiments presented in this paper, unless otherwise stated, DID has parameters: MX = 100, MY = 163, σ = 1/6 and a = 5. Datasets. We rely on images from Imagenet (more precisely from the Imagenette subset), example images from the Matlab software (peppers), and finally images taken with our personal devices for illustrations (raccon, flowers, objects). All images are made available with the source code.

Diffeomorphism generation. Diffeomorphism are obtained either by affine transformations or by generating warpings. In particular, for warpings, we use the code from (Petrini et al., 2021). We generate random transformations of images by displacing each of its coordinates independently (while enforcing zero displacement at the edge of the grid) then interpolating the colors. We choose the standard deviation of the displacements of each pixel so as to obtain a transformation of given (average) displacement norm. This can be controlled by two parameters T (a temperature, between 10 4 and 10) and c (a cut-off parameter, taking c = 2). We denote warp(f, T) such a (random) warp. Examples of warps for various T parameters (and samples) are provided in Appendix H.

5.2. Illustrative examples

In Fig. 1, we show how the h and q that optimize DID concentrate on related regions on for different scenes which which are related by a diffeomorphism, in particular rigidbody and perspective for objects and keystone deformation for raccoon. We present supplementary illustrative examples in Appendix G.

5.3. Invariance to warping

Diffeomorphisms (even infinitely regular) are a much wider class of transformations than rigid body transformations such as scale, rotation and translation (or combinations thereof). In this experiment (Fig. 2), we evaluating DID s

Measuring dissimilarity with diffeomorphism invariance

behavior against a wide family of transformations (we call warps). Consider f an image from Imagenette. For T = 10k for 4 k 2 and various images, we evaluate b Dλ(f, warp(f, T)) as well as RMSE(f, warp(f, T)). We compare the values observed to b Dλ(f, g) and RMSE(f, g) for random images f and g. We call this the random regime (90% confidence interval). Finally, we look at the performance of DID on a fixed image (gas station), with repeating warps. Fig. 2 shows that DID is invariant to diffeomorphic warping for T 10 1 whereas the Euclidean distance increases exponentially with T (making warp(f, T) indistinguishable from g, a random image).

5.4. Invariance to rotation

The peppers image is often used to demonstrate image registration techniques. In this experiment (see Fig. 3) we show that DID is invariant to rotation using patches taken from it. Consider f a patch from peppers and g = rotate(f, α), rotated version of f by angle α (in practice, we rotate a larger patch then crop to avoid artifacts). We then compare D(f, rotate(f, α)) and RMSE(f, rotate(f, α)). We show that while the Euclidean distance is constant for α = 0, DID is invariant. We compare DID s and the Euclidean distance values with their values for random patches from the image. As before, we call this the random regime (90% confidence interval). This shows that the Euclidean distance is not able to distinguish between a random patch and a rotated version of the same patch, while DID can. See Fig. 3 for the results of the experiments.

5.5. Effect of regularization

In order to understand the effect of regularization, we reuse the setup from Sec. 5.3 with a single image, with varying λ. In Fig. 4, we observe two phenomena: (1) as T increases, so does D; (2) as λ decreases, so does D. This shows that DID behaves close to what is predicted by Thm. 3.2. Indeed, D seems proportional to λ. Also, as T increases, so does the norm of the transformation between f and warp(f, T). This makes the upper bound of Thm. 3.2 increase in turn.

Acknowledgements

T.C. gratefully acknowledges support from the French National Agency for Research, grant ANR-18-CE40-0016-01. C.C. acknowledges the support of the Royal Society (grant SPREM RGS\R1\201149) and Amazon.com Inc. (Amazon Research Award ARA). B.G. acknowledges partial support by the U.S. Army Research Laboratory and the U.S. Army Research Office, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1; B.G. also acknowledges partial support from the French National

Agency for Research, grants ANR-18-CE40-0016-01 and ANR-18-CE23-0015-02. A.R. acknowledges partial support from the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute), and support from the European Research Council (grant REAL 947908).

Robert A Adams and John JF Fournier. Sobolev spaces. Elsevier, 2003.

Charalambos D Aliprantis and Owen Burkinshaw. Principles of real analysis. Gulf Professional Publishing, 1998.

Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337 404, 1950.

Alberto Bietti and Julien Mairal. Group invariance, stability to deformations, and complexity of deep convolutional representations. Journal of Machine Learning Research, 20(25):1 49, 2019.

Alberto Bietti, Luca Venturi, and Joan Bruna. On the sample complexity of learning under geometric stability. Advances in Neural Information Processing Systems 34 (Neur IPS 2021), 2021.

Mathieu Blondel, Arthur Mensch, and Jean-Philippe Vert. Differentiable divergences between time series. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS), 2021.

G erard Bourdaud and Winfried Sickel. Composition operators on function spaces with fractional order of smoothness. Harmonic Analysis and Nonlinear Partial Differential Equations, 26:93 132, 2011.

Joan Bruna and St ephane Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872 1886, 2013.

Martins Bruveris. Regularity of maps between sobolev spaces. Annals of Global Analysis and Geometry, 52(1): 11 24, 2017.

Samuel Cohen, Giulia Luise, Alexander Terenin, Brandon Amos, and Marc Deisenroth. Aligning time series on incomparable spaces. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS), 2021.

Marco Cuturi and Mathieu Blondel. Soft-DTW: a differentiable loss function for time-series. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 894 903. PMLR, 06 11 Aug 2017.

Measuring dissimilarity with diffeomorphism invariance

E. De Castro and C. Morandi. Registration of translated and rotated images using finite Fourier transforms. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-9(5):700 703, 1987.

Dennis De Coste and Bernhard Sch olkopf. Training invariant support vector machines. Machine learning, 46(1):161 190, 2002.

Petros Drineas, Michael W Mahoney, and Nello Cristianini. On the Nystr om method for approximating a Gram matrix for improved kernel-based learning. journal of machine learning research, 6(12), 2005.

Bernard Haasdonk and Hans Burkhardt. Invariant kernel functions for pattern analysis and machine learning. Machine learning, 68(1):35 61, 2007.

Imre Risi Kondor. Group theoretical methods in machine learning. Ph D thesis, 2008.

Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. Convolutional kernel networks. In Advances in Neural Information Processing Systems, volume 27, 2014.

Youssef Mroueh, Stephen Voinea, and Tomaso A Poggio. Learning with group invariant features: A kernel perspective. In Advances in Neural Information Processing Systems, volume 28, 2015.

Francis Narcowich, Joseph Ward, and Holger Wendland. Sobolev bounds on functions with scattered zeros, with applications to radial basis function surface fitting. Mathematics of Computation, 74(250):743 763, 2005.

Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck. Discrete-Time Signal Processing. Prentice-hall Englewood Cliffs, second edition, 1999.

Leonardo Petrini, Alessandro Favero, Mario Geiger, and Matthieu Wyart. Relative stability toward diffeomorphisms indicates performance in deep nets. Advances in Neural Information Processing Systems 34 (Neur IPS 2021), 2021.

B.S. Reddy and B.N. Chatterji. An FFT-based technique for translation, rotation, and scale-invariant image registration. IEEE Transactions on Image Processing, 5(8): 1266 1271, 1996.

Alessandro Rudi and Carlo Ciliberto. PSD representations for effective probability models. Advances in Neural Information Processing Systems, 34, 2021.

Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nystr om computational regularization. In NIPS, pages 1657 1665, 2015.

Alessandro Rudi, Ulysse Marteau-Ferey, and Francis Bach. Finding global minima via kernel approximations. ar Xiv preprint ar Xiv:2012.11978, 2020.

Thomas Runst and Winfried Sickel. Sobolev spaces of fractional order, Nemytskij operators, and nonlinear partial differential equations. de Gruyter, 2011.

H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26 (1):43 49, 1978.

John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam, 1997.

Adrien Vacher, Boris Muzellec, Alessandro Rudi, Francis Bach, and Francois-Xavier Vialard. A dimension-free computational upper-bound for smooth optimal transport estimation. ar Xiv preprint ar Xiv:2101.05380, 2021.

Titouan Vayer, Laetitia Chapel, Nicolas Courty, R emi Flamary, Yann Soullard, and Romain Tavenard. Time series alignment with global invariances. preprint, February 2020.

Holger Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004.

Christopher Williams and Matthias Seeger. Using the Nystr om method to speed up kernel machines. In Proceedings of the 14th annual conference on neural information processing systems, pages 682 688, 2001.

Measuring dissimilarity with diffeomorphism invariance Appendix

A. Background

We recall here the classical version of change of variable theorem for Rd, see e.g. Thm. 40.7 of Aliprantis and Burkinshaw (1998).

Theorem A.1 (Change of Variables). Let V be an open set in Rd and Q : Rd Rd be an injective continuosly differentiable map. Let f L1(V ). Denote by Q(x) Rd d the gradient of Q for any x Rd. Then, we have Z

Q 1(V ) f(Q(x))| Q(x)|dx = Z

A.1. Sobolev spaces

We recall some basic properties of Sobolev spaces. The Sobolev space Hm(Z) for m > 0 and an open set Z Rd is defined as follows (Adams and Fournier, 2003)

Hm(Z) = f L2(Rd) f Hm(Z) < , f 2 Hm(Z) = X

α1+ +αdf(x)

xα1 1 . . . xαd d

Analogously, we define the space Hm(Z, Rt) for functions with output in Rt, t 1, with norm f 2 Hm(Z,Rt) = Pt j=1 fj 2 Hm(Z). In the following theorem we collect some important properties of Hm(Z) that will be useful for the proof of Thm. 3.2.

Theorem A.2. Let m > d/2 and Z be an open set with locally Lipschitz continuous boundary. The following properties hold

(a) Hm(Z) is a Reproducing Kernel Hilbert space. Moreover, Hm(Z) C(Z), and, when Z is bounded we have f|Z Hm(Z), for any f Cm(Rd).

(b) Restriction and extension. For any u Hm(Rd), it holds that u|Z Hm(Z) and u|Z Hm(Z) c1 u Hm(Rd). Moreover, for any u Hm(Z) there exists a function EZ[u] H(Rd) such that (EZ[u])|Z = u and EZ[u] Hm(Rd) c2 u Hm(Z). The constants c1, c2 depend only on Z, m, d.

(c) Pointwise product. For any u, v Hm(Z) we have u v Hm. In particular, there exists c0 > 0 depending only on Z, m, d such that u v Hm(Z) c0 u Hm(Z) v Hm(Z).

Proof. The space Hm(Rd) is a RKHS due to the characterization of the norm with respect to the Fourier transform (Wendland, 2004). The space Hm(Z) is a RKHS since any restriction of a RKHS to a subset of the set of definition is still a RKHS (Aronszajn, 1950). The inclusions derive directly by the definition of the Hm norm (see Adams and Fournier, 2003, for more embeddings of Sobolev spaces).

The second point is a classical result on Sobolev space and is derived in Adams and Fournier (2003). The third point is equivalent to showing that the Sobolev spaces with m > d/2 are Banach algebras and it is derived in Adams and Fournier (2003). For the case Z = Rd an explicit derivation based on the Fourier transform is done in Lemma 10 of Rudi et al. (2020).

B. Proof of the more general version of Theorem 3.2

In the next subsection we introduce some preliminary results that will be necessary to prove the more general version of Theorem 3.2, which is in the subsection Appendix B.2.

B.1. Preliminary result on composition of Sobolev functions

The following theorem quantifies the fact that a Sobolev space with m > d/2 is closed with respect to composition with diffeomorphisms. There exists many abstract results about the closure of composition in Sobolev space in the literature (see

Measuring dissimilarity with diffeomorphism invariance

e.g. Bruveris, 2017). Here, however, we want a quantitative bound. We base our result on the explicit bound in Bourdaud and Sickel (2011). To obtain a final readable form we have to do a bit of slalom between restriction and extension between Z and Rd. Indeed, we need functions that are equivalent to the functions of interest on Z, but whose norm does not diverge when going to Rd. For example, constant functions don t belong to Hm(Rd), but for us it is enough to have a function that is equal to a constant on Z and that goes to zero at infinity fast enough to have finite Hm(Z) norm.

Theorem B.1 (Smooth composition). Let m > d/2. Let Q : Rd Rd be an invertible m-times differentiable map whose inverse has continuous Lipschitz derivative. Let Z be open bounded set with Lipschitz boundary and a compact Ωsuch that Ω Z and Q 1(Ω) Z. Assume also, without loss of generality, that 0 is in the interior of Ω. For any h Hm(Rd) supported on Ωthe following holds:

1. there exists b Hm(Rd) satisfying b(x) = h(Q(x)) for any x Q 1(Ω).

2. if h(x) = 0 for any x Rd \ Ω, then b(x) = 0 for any x Rd \ Q 1(Ω),

3. the norm of b is controlled by

b Hm(Rd) C h Hm(Z) d m d/2 Ω,Q (1 + Q Hm(Z,Rd) + Dm + Lm), (13)

where dΩ,Q := min[1, d H(Q 1(Ω), Q 1(Z) Z)], d H(A, B) is the Haussdorff distance between two sets A, B and D := diam(Z), L = maxx Z Q(x) , while C depends only on d, m, Z.

Proof. The fact

To apply Bourdaud and Sickel (2011) we need a function such that h(0) = 0. With this aim, we rewrite h Q as

h Q = h(0) + ((h h(0)) Q).

Step 1. Construction of b. Denote by U the open set U = Q 1(Z) Z. Note that the set is not empty since, by construction, Q 1(Ω) U. Define Q = EZ[Q|Z], i.e. the extension to the whole Rd of the restriction of Q on the set Z (restriction and extension done componentwise). By the first point we have that Q|Z belongs to Hm(Z, Rd) and, by the second point, that Q belongs to Hm(Rd, Rd) and moreover Q(x) = Q(x) for any x Z. Denote by s = h(0) C (Rd) the constant function equal to h(0) everywhere on Rd. In particular, note that s|Z Hm(Z) via Thm. A.2(b). Denote by τ, the extension of s|Z to Rd, i.e., τ = EZ[s|Z]. Denote by u, the function u = h|Z s|Z and by u Hm(Rd) the function u = EZ[h|Z s|Z]. Note that u(x) = h(x) h(0) for any x Z, and, in particular for any x Ω. Define by ρ the C (Rd) function that is 1 on Q 1(Ω) and 0 on Rd U.

Now define, b = ρ (τ + u Q).

Denote by b the function b = τ + u Q. We have that b(x) = h(Q(x)) for any x U. Since Q(U) Z and that h(x) = h(x), Q(x) = Q(x) for any x Z. In particular, since b(x) = 0 for all x U \ Q 1(Ω) and by definition of ρ, we have: (a) ρ b = b = h(Q(x)) on Q 1(Ω), (b) ρ b = 0 on UQ 1(Ω); (c) ρ b = 0 and 0 on Rd \ U. Then b = ρ b = h(Q(x)) for any x Rd.

Step 2. Bound of b. Now, let us bound the norm of b. By applying Thm. A.2(c) we have

b Hm(Rd) c ρ Hm(Rd)( τ Hm(Rd) + u Q Hm(Rd)). (14)

The bound for τ is obtained applying Thm. A.2(b) and the definition of Hm(Z) norm, as follows,

τ Hm(Rd) c2(Z) s|Z Hm(Z) = c2h(0)vol(Z)1/2.

Now we bound the norm u Q Hm(Rd) with respect to the norms of u and Q. We use a result on the composition of Sobolev functions (Bourdaud and Sickel, 2011, Theorem 27) that is highly technical, but allows to highlight the quantities of interests for us. For a more extensive treatment of the topic see for example Runst and Sickel (2011). Since u(0) = 0 by

Measuring dissimilarity with diffeomorphism invariance

construction, by applying Theorem 27 of Bourdaud and Sickel (2011) with p = 2 and considering that their norm W m Ep is bounded by Hm(Rd) and analogously W 1 mp W 1 W 1 (Rd),

u Q Hm(Rd) c( u Hm(Rd) + u W 1 (Rd))( Q Hm(Rd,Rd) + Q m W 1 (Rd,Rd)) (15)

Here z W 1 (A,Rd) = supx A,j {1,...,t} zj for any differentiable z : A R and open set A. Now to conclude, note that u Hm(Rd) = EZ[h|Z s|Z] Hm(Rd) c2 h|Z s|Z Hm(Z) c2 h Hm(Z) + c2h(0)vol(Z)1/2,

moreover Q Hm(Rd,Rd) = EZ(Q|Z) Hm(Rd,Rd) c2 Q|Z Hm(Z,Rd).

Note also that for the Sobolev space W 1 there exists the same type of result as Thm. A.2(b) and for the same extension operator defined in Thm. A.2(b) (which is a total extension operator, see e.g. the Stein extension theorem, Thm 5.4 page 154 of Adams and Fournier, 2003), so, as above, we have

u W 1 (Rd) c2 h W 1 (Z) + c2h(0),

Q W 1 (Rd,Rd) c2 Q|Z W 1 (Z,Rd) = c2 sup x Z max( Q(x) , Q(x) ) c2diam(Z) + c2 sup x Z Q(x) .

Substituting the six bounds above in Eq. (14), we obtain

b Hm(Rd) c ρ Hm(Rd)(c2h(0)vol(Z)1/2 + c ( h Hm(W ) + h W 1 (Z) + h(0)c )( Q|Z Hm(Z,Rd) + Dm + Lm)),

where D = diam(Z)m, L = supx Z Q(x) and c = cc2 2(1 + 2mcm 1 2 ), c = 1 + vol(Z)1/2, with c, c2 depending only on d, m, Z. The final result is obtained considering that x A + d(B + y A)R (1 + d + dy)(B + A)(1 + R) for any x, y, d, A, B, R 0, and applying this result with x = c2(Z)vol(Z)1/2, A = h(0), d = c , B = h Hm(Z) + h W 1 (Z), y = c , R = Q|Z Hm(Z,Rd) +Dm +Lm. In particular, in the final result, the constant C corresponds to C = 3(1+d+dy) and we used the fact that h(0) h W 1 (Z) h Hm(Z), then h Hm(Z) + h W 1 (Z) + h(0) 3 h Hm(Z).

Step 3. The norm of ρ . Let At be the set At = {x | miny Q 1(Ω) x y t}. Let η = d H(Q 1(Ω), U), i.e., the Haussdorff distance between the sets Q 1(Ω) and U, corresponding to the largest η for which Aη U. Note that η > 0 since Q 1(Ω) is compact, while U is open and Q 1(Ω) U. The fact that η > 0 implies that for any η < η it holds that Aη U.

The function ρ can be obtained by the convolution of the indicator function 1Aη/2 with the bump function ψη/2(x) = (η/2) dψ(x/(η/2))/S where S = R

Rd ψ(x)dx and ψ is an infinitely smooth non-zero non-negative function that is 0 on x 1 as for example ψ(x) = exp( 1/(1 x 2)+) for any x Rd, where (z)+ = max(0, z). In particular, since (a) f(y ) Hm(Rd) = f Hm(Rd) for any y Rd, by construction of the norm Hm(Rd), and (b) t df( /t) Hm(Rd) c6t d/2 max(1, t m) f Hm(Rd) (see, e.g., Proposition 3 of Runst and Sickel, 2011) we have

Aη/2 ψη(y ) Hm(Rd)dy vol(Aη/2) ψη Hm(Rd)

c6 ψ Hm(Rd) vol(Aη/2) (η/2) d/2 max(1, (η/2) m).

To conclude note that vol(Aη) vol(Z), since Aη/2 U Z.

Lemma B.2 (Existence and norm of q Hm(Rd)). Let µ be an infinitely differentiable function, with compact support Ω Rd. Let Q : Rd Rd be a Cm+1 diffeomorphism. Let Z be an open bounded set with Lipschitz boundary and such that Ω Z and Q 1(Ω) Z. Moreover, assume without loss of generality that 0 is in the interior of Ω. For any h Hm(Rd), there exists a function q Hm(Rd) satisfying

( h(Q(y))µ(Q(y))| Q(y)| y Q 1(Ω) 0 y Rd \ Q 1(Ω) . (16)

Measuring dissimilarity with diffeomorphism invariance

In particular, let b Hm(Rd) be defined according to Thm. B.1, then

q Hm(Rd) C h Hm(Rd) µ Hm(Rd)d m d/2 Ω,Q Q d Hm(Z,Rd)(1 + Q Hm(Z,Rd) + Dm + Lm),

the constant C depends only on d, m, Z, while dΩ,Q := min[1, d H(Q 1(Ω), Q 1(Z) Z) and d H(A, B) is the Haussdorff distance between two sets A, B.

Proof. Let h Hm(Rd) and µ to be the extension of µ on Rd (which corresponds to µ(x) = µ(x) for any x Ωand µ(x) = 0 for x Rd \ Ω). Define the function s(x) = h µ and note that s Hm(Rd) since it is the product of two functions in Hm(Rd) (see Thm. A.2(c)).

Second, note that the function r = (| Q|)|Z Hm(Z) since (a) the map Q belongs to Cm+1(Rd, Rd) so its entries ( Q)i,j|Z belong to Hm(Z) (see Thm. A.2(a)); and (b) the determinant of a matrix, by the Leibniz formula, is defined in terms of sums and products of its entries and Hm(Z) is closed with respect to multiplication, by Thm. A.1. To quantify its norm let s write explicitly the Leibniz formula (Trefethen and Bau III, 1997),

σ Sd sgn(σ)

i=1 e σi Q(x)

where sgn(σ) { 1, 1}, Sn is the set of permutations of of d elements and e1, . . . , ed is the canonical basis of Rd. Note, in particular, that, by the equation above, since |Sd| = d! and that e σi Q(x)

xi Hm(Z) Q Hm(Z,Rd), we have that

r Hm(Z) d! Q d Hm(Z,Rd).

Now consider the function b Hm(Rd) defined according to Thm. B.1 we have that b(x) = s(Q(x)) for any x Rd. Now, define q as follows q = b EZ[r].

The function q is in Hm(Z), since it is the product of two functions in Hm(Z) (see Thm. A.2(c) and r Hm(Z) (see Thm. A.2(b)). Note, in particular, that Eq. (16) holds, by construction. To conclude, note that, by applying Thm. A.2(c) and Thm. A.2(b), we have

q Hm(Rd) c0 EZ[r] Hm(Rd) b Hm(Rd) c0c2 r Hm(Z) b Hm(Rd) d!c0c2 Q d Hm(Z,Rd) b Hm(Rd).

To conclude, we bound b according with Thm. B.1, obtaining

b Hm(Rd) C s Hm(W ) d m d/2 Ω,Q (1 + Q Hm(Z,Rd) + Dm + Lm),

and s Hm(W ) s Hm(Rd) c0 h Hm(Rd) µ Hm(Rd), by Thm. A.2(c).

B.2. Proof of the general version of Theorem 3.2

Thm. 3.2 is a particular case of the following theorem Theorem B.3. Let X Rd be an open bounded set with Lipschitz boundary. Let µ C (Rd) with compact support Ω X. Let the RKHS H be H = Hm(Rd), i.e. the Sobolev space of smoothness m, with m > d/2. Denote by F the RKHS induced by the kernel k Y on Y , that we assume uniformly bounded. Then for any Cm+1 diffeomorphism Q on Rd

satisfying Q 1(Ω) X, we have that for all f, g measurable functions,

g(x) = (f Q)(x), x Q 1(Ω) implies D(f, g) λ CµCQ,

where Cµ = µ Hm(Rd), and CQ is defined in Eq. (19) below and depends only on Ω, X, Q, d, m.

Proof. Now we have all the elements to prove the main theorem of the paper. Let q as defined in Lemma B.2, with Z = X. Moreover, let v F, where F is the reproducing kernel Hilbert space associated to the kernel k Y which is bounded by assumption. Then for any continuous f, we have that v f is continuous and bounded. Denote by Θf,g(h, v, q) the quantity

Θf,g(h, v, q) := Z

X v(f(x))h(x)µ(x)dx Z

X v(g(y))q(y)dy.

Measuring dissimilarity with diffeomorphism invariance

Step 1. Simplifying Θf,g(h, v, q). Since µ is supported on Ω X, by assumptions, we have Z

X v(f(x))h(x)µ(x)dx = Z

Ω v(f(x))h(x)µ(x)dx.

Moreover, by expanding the characterization of q in Eq. (16), we have that q(x) = 0 for any x Rd \ Q 1(Ω), since Q 1(Ω) X, we have Z

X v(g(y)) q(y)dy = Z

Q 1(Ω) v(g(y))h(Q(y))µ(Q(y))| Q(y)|dy,

Q 1(Ω) v(f(Q(y)))h(Q(y))µ(Q(y))| Q(y)|dy,

where we used in the last step that g(y) = f(Q(y)) for any y Q 1(Ω), which is now the domain of integration.

Step 2. Applying the Change of Variable theorem. Note that the function q is continuous since Hm(Rd) is subset of continuous functions. By applying the change of variable theorem we have Z

Q 1(Ω) v(f(Q(x)))h(Q(x))µ(Q(x))| Q(x)|dx = Z

Ω v(f(y))h(y)µ(y)dy.

Then, by using the characterizations in Step 1, we have

Θf,g(h, v, q) = Z

X v(f(x))h(x)µ(x)dx Z

X v(g(y)) q(y)dy

Ω v(f(x))h(x)µ(x)dx Z

Q 1(Ω) v(f(Q(y)))h(Q(y))µ(Q(y))| Q(y)|dy

Step 3. Bound on Dλ(f, g). Now, denote by H the set Hm(Rd). Since q H, and f,g(h, v, q) = 0, we have that

Dλ(f, g) = max h H 1 min q H max v F 1 |Θf,g(h, v, q)|2 + λ q 2 H max h H 1 max v F 1 |Θf,g(h, v, q)|2 + λ q 2 H (17)

= max h H 1 λ q 2 H. (18)

Step 4. Simplifying q Hm(Rd). We bound q Hm(Rd) as in Lemma B.2, where b is bounded according to Thm. B.1,

q Hm(Rd) C h Hm(Rd)CµCQ,

where Cµ := µ Hm(Rd) and

CQ := d m d/2 Ω,Q Q d Hm(X,Rd)(1 + Q Hm(X,Rd) + Dm + Lm), (19)

where D := diam(X), L = maxx X Q(x) and dΩ,Q := min[1, d H(Q 1(Ω), Q 1(X) X)] and d H(A, B) is the Haussdorff distance between two sets A, B.

The final result is obtained by considering that we are optimizing with the constraint h Hm(Rd) 1.

C. Proof of closed form of Dλ and computational results

In the next lemma, we prove some important properties useful for the characterization of Dλ in terms of Fµ, G. Lemma C.1. Let X Rd be an open bounded set. The linear operators Fµ, G defined above are compact and trace class. Moreover, for any h, q H and v F, we have

v, Fµh F = Z

X v(f(x))h(x)µ(x)dx,

v, Gq F = Z

X v(f(x))q(x)dx.

Measuring dissimilarity with diffeomorphism invariance

Proof. Since k Y is bounded, and f, g are measurable, then Φ(f( )) : X F is bounded and measurable. Since µ C (Rd) by assumption X is bounded and h is bounded and continuous since it belongs to H and k X is bounded and continuous, so J := R

X Φ(f(x)) F ψ(x) H|µ(x)|dx < . This guarantees the existence of the following Bochner integral Fµ := R

X Φ(f(x)) ψ(x)µ(x)dx F H. In particular, denoting by the trace norm, i.e. A = Tr(

A A), and recalling that u v = Tr( p

(u v) (u v)) = u F v H for any u U, v V and any two separable Hilbert spaces U, V, we have

X Φ(f(x)) ψ(x) |µ(x)|dx = Z

X Φ(f(x)) F ψ(x) H|µ(x)|dx =: J < .

Then Fµ is trace class. The same reasoning hold for G, considering that R

X Φ(f(x)) F ψ(x) Hdx < , since X is compact.

C.1. Proof of Thm. 4.2

Proof. We have seen in Lemma C.1, that since X is a bounded set, then Fµ and G are trace class and, by the representer property

v, Fµh F = Z

X v(f(x))h(x)µ(x)dx,

v, Gq F = Z

X v(f(x))q(x)dx

Using this result and considering the linearity of the inner product and the variational characterization of the Hilbert norm (i.e. u F = max v F 1 | v, u |F for any u F), we have

f,g(h, q) = max v F 1 | v, Fµh F v, Gq F |2

= max v F 1 | v, Fµh Gq F |2

= Fµh Gq 2 F,

for any h, q H. From which we characterize Dλ as

Dλ(f, g) = max h H 1 min q H Fµh Gq 2 F + λ q 2 H . (20)

Now, we prove that the problem above has a characterization in terms of the operatorial norm of a given operator. First, notice that q 7 Fh Gq 2 F + λ q 2 H is 2λ-strongly convex. It therefor has a unique global minimizer q (h) which is also a critical point. This leads to q (h) = ZFh where Z = (GG + λI) 1G . Here GG + λI is a positive linear operator, and therefor invertible.

So far, we have shown that:

D(f, g) = max h H 1 Fh GZFh 2 F + λ ZFh 2 H . (21)

Rewriting both squared norms as scalar products in F and H and using the adjoint operators, we have that: Fh GZFh 2 F + λ ZFh 2 H = h, Th H with T = F (I Z G )(I GZ)F + λF Z ZF. Thus, we have rewritten D as the operator norm of T:

D(f, g) = max h H 1 h, Th H = T op. (22)

We can now simplify T. Recall that for any bounded operator A, A(A A + λI) 1 = (AA + λI) 1A and (A + λI) 1A = I λ(A + λI) 1.

Thus, GZ = Z G = G(G G + λI) 1G = (GG + λI) 1GG = I λ(GG + λI) 1. Similarly,

Z Z = G(G G + λI) 1(G G + λI) 1G = (GG + λI) 1GG (GG + λI) 1 (23)

= (I λ(GG + λI) 1(GG + λI) 1 = (GG + λI) 1 λ(GG + λI) 2. (24)

Measuring dissimilarity with diffeomorphism invariance

Replacing these expression in T, we obtain:

T = λ2F (GG + λI) 1)(GG + λI) 1)F + λF (GG + λI) 1 λ(GG + λI) 2 F (25)

= λF (GG + λI) 1F. (26)

Dλ(f, g) = T op = λ F (G + λI) 1F op = λ (GG + λI) 1/2F 2 op. (27)

C.2. Proof of Lemma 4.3

Before proceeding with the proof of Lemma 4.3, we introduce some operators, that will be useful also in the rest of the paper. We recall that for this set of results we are assuming that the kernel k X is universal. This implies that the kernel matrix K X, X is invertible and so R X exists and is invertible. The same holds for R Y .

Definition C.2 (The operators S, V : H RMX and Z, U : F RMY ). First define S : H RMX as

Su = ( ψ( x1), u H , . . . , ψ( x MX), u H) RMX, S α =

i=1 αiψ( xi),

for all u H and α RMX. Analogously define Z : F RMY as

Zv = ( Φ( y1), v F , . . . , Φ( y MY ), v F) RMY , Z β =

i=1 βiΦ( yi),

for all v F and β RMY . Moreover, define V, U as

V = R X S, V = R Y Z.

Remark C.3. We recall the following basic facts about the operator above, together with a short proof, when needed.

1. The range of S is span(ψ(x1), . . . , ψ(x MX),

2. S is full rank and SS = K X X,

3. R XR X = K X X, since R X is the upper-triangular Cholesky of K X X.

4. V V = I, indeed, V V = R X SS R 1 X = R X K X XR 1 X = R X R XR XR 1 X = I.

5. V is a partial isometry, since V V = I and it is full rank, since it is the product of two full rank operators.

6. P X = V V , indeed, V V is a projector and the range of V is the range of S that is span(ψ(x1), . . . , ψ(x MX).

For the same reasons, we have that: (a) The range of Z is span(Φ( y1), . . . , Φ( y MY ) (b) Z is full rank and ZZ = K Y Y (c) R Y R Y = K Y Y (d) UU = I (e) U is a partial isometry (f) P Y = U U.

Now we are ready to state the proof of Lemma 4.3.

Proof. First, note that since A 2 op = A A op for any bounded linear operator A, we have

b Dλ = λ P XF µ,NP Y (P Y GNP XG NP Y + λ) 1P Y Fµ,NP X op.

Measuring dissimilarity with diffeomorphism invariance

From Remark C.3 we recall that P X = V V where V is a partial isometry defined in Definition C.2. Analogously P Y = U U where V is a partial isometry defined in Definition C.2. Now, we have U(U BU + λI) 1U = (B + λI) 1

for any positive semidefinite operator B RMY MY , since since UU = I and so U UU = U , indeed

U(U BU + λI) 1U (B + λI) = U(U BU + λI) 1U (B + λI)UU

= U(U BU + λI) 1(U BU + λU U)U

= U(U BU + λI) 1(U BU + λI)U λU(U BU + λI) 1(I U U)U

= UU λU(U BU + λI) 1(U U UU ) = I.

In particular, we will use now the result above. Let C = UGNP XG NU . So,

b Dλ = λ V V F µ,NU U(U CU + λI) 1U UFµ,NV V op = λ V A (C + λI) 1AV op = λ A (C + λI) 1A op

where A = UFµ,NV and we used the fact that V TU op = T op for any couple of partial isometries such that V V = I and UU = I. By applying the definition of U, V and Fµ,N, we see that A = b A as in Eq. (10). Indeed, by expanding the definitions of U, V from Definition C.2 and denoting by ci RMY and di RMX respectively the vectors ci = Z Φ(f(xi)) = (k Y ( y1, f(xi)), . . . , k Y ( y MY , f(xi))) and di = Sψ(xi) = (k X( x1, xi), . . . , k X( x MX, xi)), we have

UFµ,NV = v X

i=1 (UΦ(f(xi))) (V ψ(xi))µ(xi)

i=1 R Y ((ZΦ(f(xi))) (Sψ(xi))) R 1 X µ(xi)

i=1 R Y cid i R 1 X µ(xi).

Note now, that by construction ci is the i-th column of K Y ,f while di is the i-th row of the matrix KX, X for i = 1, . . . , N. Denoting by diag (ˆµ) the diagonal matrix whose i-th element of diagonal is µ(xi), we have

i=1 µ(xi) cid i = K Y ,f diag (ˆµ) KX, X.

From which have UFµ,NV = 1

N R Y K Y ,f diag (ˆµ) KX, XR 1 X = b A.

To conclude, note that

C = UGNP XG NU = (UGNV )(V G NU ) = (UGNV ) (UGNV ) ,

Analogously as we proved that A = b A, we have that UGNV = b B, where b B is defined in Eq. (11). Then

b Dλ = λ A (C + λI) 1A op = λ b A ( b B b B + λI) 1 b A op.

D. Proof of Thm. 4.4

Before proving the theorem, we need some preliminary lemmas

Lemma D.1. Let δ (0, 1). Let X Rd be an open bounded set with locally Lipschitz boundary. Let k be a Sobolev kernel of smoothness m, with m > d/2 on X and denote by H and ψ : X H the associated RKHS and canonical feature

Measuring dissimilarity with diffeomorphism invariance

map. Let X = { x1, . . . , x M} X be M points sampled independently and uniformly at random in X. Denote by P X the projection operator whose range corresponds to span{ψ( x1), . . . , ψ( x M)}. There exists M0 such that for all M M0, the following holds with probability at least 1 δ:

sup x X (I P X)ψ(x) H CM m/d+1/2(log C M

where C, C are constants depending only on X, m, d.

Proof. To prove this result we use the same reasoning of Theorem C.3 of Rudi and Ciliberto (2021), but applied to the Sobolev kernel. First, by applying, first Lemma C.2 of Rudi and Ciliberto (2021) we have that

sup x X (I P X)ψ(x) H sup f H 1 f P Xf L (X).

Now, denote by η the so called fill distance (Narcowich et al., 2005) defined as η = supx X mini 1,...,M x xi . By applying Proposition 3.2 of Narcowich et al. (2005) with α = 0, q = , τ = m X, we have that there exists an η0 such that when η η0 then f P Xf L (X) Cη m+d/2 f H, f H,

where η0 and C are constants depending only on d, X, m. To conclude, note that, by using Lemma 11 and 12 of Vacher et al. (2021), η (C1M 1 log(C2M/δ))1/d,

with probability 1 δ, where C1, C2 depend only on X, d. The result is obtained by combining the three inequalities above and selecting M0 as the minimum integer satisfying (C1M 1 0 log(C2M0/ρ))1/d η0.

Now we are ready to prove Thm. 4.4.

Proof of Theorem 4.4.

Proof. Let ρ = δ/4.Denote by κX and κY the constants bounding the kernel k X, k Y (which are Sobolev kernels of smoothness s and z, see Wendland (2004) for the explicit definition of such kernel). Note that κX, κY are constants depending, respectively, only on s, d and on z, p. We recall here that v X := vol(X) = R

Step 1. We recall that x1, . . . , x N are independently and uniformly distributed with uniform measure over X. Define the random variable ζi F H as ζi = v XΦ(f(xi)) ψ(xi)µ(xi)

for i = 1, . . . , n. Note now, that

i=1 ζi, Fµ = Eζ1.

Note moreover that ζi v XκXκY µ L =: L.

Denote by A, B HS the Hilbert-Schmidt inner product defined as A, B HS = Tr(A B). We recall that the space of HS(H, F) with finite HS norm, is a separable Hilbert space. We recall also that op HS where the last is the trace norm and that Fµ,N has finite trace norm. By applying the Bernstein inequality for random vectors (see, e.g. Prop. 11 of Rudi et al. (2015) and references therein), we have that the following holds with probability 1 ρ

Fµ,N Fµ HS 4L

Applying the same reasoning for G, we obtain

with probability 1 ρ, where L := vol(X)κXκY .

Measuring dissimilarity with diffeomorphism invariance

Step 2. Now, recall that P X is a projection operator whose range is span{ψ( x1), . . . , ψ( x MX)}, X is bounded with Lipschitz boundary and k X is a Sobolev kernel of smoothness m X. By applying Lemma D.1, we have that there exists C0 such that, when MX C0, then with probability 1 ρ

sup x X (I P X)ψ(x) H C1M s/d X (log C2MX

ρ )s/d+1/2, (30)

where C0, C1, C2 depend only on X, s, d. Applying the same reasoning on P Y , we have that there exists C 0 such that, when MY C 0, then with probability 1 ρ

sup y Y (I P Y )ψ(y) F C 1M z/p Y (log C 2MY

ρ )z/p+1/2, (31)

where C 0, C 1, C 2 depend only on Y, z, p.

Step 3. Now we can estimate the distance between Fµ and P Y Fµ,NP X and, analogously between G and P Y GNP X. In particular, we can rewrite Fµ P Y Fµ,NP X as

Fµ P Y Fµ,NP X (Fµ Fµ,N) + (I P Y )Fµ,N + P Y Fµ,N(I P X),

from which, using op HS, we derive

Fµ P Y Fµ,NP X op Fµ Fµ,N HS + (I P Y )Fµ,N op + P Y op Fµ,N(I P X) op.

Now, the term Fµ Fµ,N HS is already studied in Eq. (28). For the second term, note that by expanding the definition of Fµ,N and using Eq. (30) we obtain,

(I P Y )Fµ,N op v X

i=1 (I P Y )(Φ(f(xi)) ψ(xi)) op|µ(xi)|

i=1 (I P Y )Φ(f(xi)) F ψ(xi) H|µ(xi)|

κXv X µ L C 1M z/p Y (log C 2MY

ρ )z/p+1/2,

with probability 1 ρ. Applying the same reasoning to the third term, we obtain

Fµ,N(I P X) op κY v X µ L C1M s/d X (log C2MX

ρ )s/d+1/2,

with probability 1 ρ. Combining all the terms and considering that P X op = 1 since it is a projection, we have

Fµ P Y Fµ,NP X op β (32)

ρ + κX µ L C 1M z/p Y (log C 2MY

ρ )z/p+1/2 + κY µ L C1M s/d X (log C2MX

ρ )s/d+1/2.

To conclude, note that

Fµ op Z Φ(f(x)) op ψ(x) op|µ(x)|dx v XκXκY µ L (X) = L,

and with the same reasoning we have Fµ,N L. Then, by considering that AA ˆA ˆA ( A op+ ˆA op) A ˆA op for any bounded operators A, A between the same two Hilbert spaces, we have

FµF µ P Y Fµ,NP XF µ,NP Y op 2Lβ.

Repeating the same reasoning of the beginning of Step 3 for G and GN we obtain

GG P Y GP XG P Y op 2L β , (33)

ρ + κXvol(X)C 1M z/p Y (log C 2MY

ρ )z/p+1/2 + κY vol(X)C1M s/d X (log C2MX

ρ )s/d+1/2.

Measuring dissimilarity with diffeomorphism invariance

Step 4. Before deriving the final result, we need an algebraic inequality between bounded operators. Let B, ˆB, Q, ˆQ be bounded operators and assume that Q, ˆQ are also symmetric and invertible. We recall that A 2 op = A A op = AA op for any bounded operator A. We have

Q 1/2B 2 op ˆQ 1/2 ˆB 2 op = ( Q 1/2B 2 op ˆQ 1/2B 2 op) + ( ˆQ 1/2B 2 op ˆQ 1/2 ˆB 2 op).

Then, using the equality A 1 B 1 = B 1(A B)A 1, valid for any bounded and invertible operator, we have

| Q 1/2B 2 op ˆQ 1/2B 2 op| = | B Q 1B op B ˆQ 1B op|

B (Q 1 ˆQ 1)B op = B ˆQ 1(Q ˆQ)Q 1B op

B op ˆQ 1 op Q ˆQ op Q 1 op B op.

moreover, we have

| ˆQ 1/2B 2 op ˆQ 1/2 ˆB 2 op| = | ˆQ 1/2BB ˆQ 1/2 2 op ˆQ 1/2 ˆB ˆB ˆQ 1/2 op|

ˆQ 1/2(BB ˆB ˆB ) ˆQ 1/2 op

ˆQ 1/2 2 op BB ˆB ˆB op.

Now, noting that ˆQ 1/2 2 op = ˆQ 1 op and combining the inequalities above, we obtain

| Q 1/2B 2 op ˆQ 1/2 ˆB 2 op| B 2 op ˆQ 1 op Q 1 op Q ˆQ op + ˆQ 1 op BB ˆB ˆB op. (34)

Step 5. With the tools derived above we can proceed to bound b Dλ(f, g) Dλ(f, g). First, note that, by Lemma 4.3, b Dλ(f, g) of Eq. (12) is equivalent to

b Dλ(f, g) := λ (P Y GNP XG NP Y + λ) 1

2 P Y Fµ,NP X 2 op.

|Dλ(f, g) b Dλ(f, g)| = λ| (GG + λ) 1

2 Fµ 2 op (P Y GNP XG NP Y + λ) 1

2 P Y Fµ,NP X 2 op|.

By applying Eq. (34) with Q = GG + λ, ˆQ = P Y GNP XG NP Y + λ, B = Fµ and ˆB = P Y Fµ,NP X and noting that Q λI and ˆQ λI, then Q 1 op λ 1 and ˆQ 1 op λ 1, and that Fµ op L, we have

|Dλ(f, g) b Dλ(f, g)| λ B 2 op ˆQ 1 op Q 1 op Q ˆQ op + ˆQ 1 op BB ˆB ˆB op

L2λ 1 Q ˆQ op + BB ˆB ˆB op.

Now, since Q ˆQ op = GG P Y GNP X op and BB ˆB ˆB op = Fµ P Y Fµ,NP X op, which are bounded in Eqs. (32) and (33)

|Dλ(f, g) b Dλ(f, g)| 2L L2λ 1β + 2Lβ.

E. Computing the optimal h and q

Using the finite-rank approximate described in the paper, we observe h and q using the following formulas, where [h] = [h(x1), . . . , h(x N)] and [q] = [q(x1), . . . , q(x N)], the values at grid points (discretization of the integral):

[h] = KX e XR 1/2 X eh,

[q] = KX e XR 1/2 X b G ( b G b G + λI) 1 b Feh.

where eh is computed as a byproduct of the computation of b D (for example, power-iteration).

Measuring dissimilarity with diffeomorphism invariance

F. Information for reproducing experiments

Dependencies We conduct our experiments using Pytorch version 1.9.0 (torchvision 0.10.0) and Numpy 1.20.1 (but our implementation does not require any special functions so should generalize to any recent version). We also make use of Kornia version 0.5.7 and PIL version 8.2.0 for IO.

Code Our source code is organized as follows:

warping (dir): contains the necessary code for generating random warps. Code modified from Petrini et al. (2021) at

https://github.com/pcsl-epfl/diffeomorphism.

did (dir): implements objects necessary for the computation of DID.

imagenet.py: implements as Image Folder torchvision dataset for Imagenet (and Imagenette).

scenes (dir), perspective (dir), peppers.mat: provide images taken us for the experiments (the .mat is a Matlab binary object which can be read thanks to scipy).

demo_XXX.py are used for the demonstrations in Fig. 1. They are the easiest way to get familiar with DID.

exp_XXX.py are used for the different experiments quantitative experiments.

Note that the path to the Imagenette dataset must be set in each experiment file or in the imagenet.py source file.

Imagenette We use images from the Imagenet dataset for our experiments. We use the subset called Imagenette, available at: https://github.com/fastai/imagenette.

Image Net statistics We use the following Imagenet statistics to normalize all images:

µ = [0.485, 0.456, 0.406]

σ = [0.229, 0.224, 0.225]

Measuring dissimilarity with diffeomorphism invariance

G. DID in action (supplementary experiments)

We propose a collection of experiments with DID on the peppers (see Fig. 5) image (with random square patches). We show the values DID takes and the shapes of the optimal h and q for a choice of λ.

Figure 5. Peppers image from Matlab software.

DID has parameters: MX = 100, MY = 163, k X Gaussian with σ = 1/6 and k Y Abel with a = 5. We ran several experiments and chose λ = 10 2 as the best value. Indeed, as shown in Fig. 6, the regions found by h and q for this value are coherent.

We consider a random square area f of size 150 150. We rotate, translate and scale it and denote it g = transform(f). We then show f, h (over f), g and q (over g) as well as the value of b Dλ(f, g). Experiments in Fig. 6 can be reproduced with appendix_peppers_match.py.

Measuring dissimilarity with diffeomorphism invariance

D(f, transform(f))= 1.82e-01 for = 0.01

D(f, transform(f))= 8.23e-02 for = 0.01

D(f, transform(f))= 4.60e-02 for = 0.01

D(f, transform(f))= 8.45e-02 for = 0.01

D(f, transform(f))= 5.83e-02 for = 0.01

D(f, transform(f))= 1.22e-01 for = 0.01

Figure 6. b Dλ(f, transform(f)) for f random patches from peppers. We show f, g = transform(f) as well as the optimal functions h and q.

Measuring dissimilarity with diffeomorphism invariance

H. Warping demonstrations

The warps below were generated using file appendix_warp.py in our source code, using an image from Image Net. Rows are from different samples, columns for different warp temperatures. All waps use c = 2.

(a) T = 10 3

(b) T = 10 2

(c) T = 10 1

(f) T = 10 3

(g) T = 10 2

(h) T = 10 1

(k) T = 10 3

(l) T = 10 2

(m) T = 10 1

(p) Original

Figure 7. Image #8000. Deformations warp(f, T) for different values of T (c = 2).