# diffeomorphic_learning__d95ab4f4.pdf

Journal of Machine Learning Research 21 (2020) 1-28 Submitted 6/18; Revised 8/20; Published 11/20

Diﬀeomorphic Learning

Laurent Younes laurent.younes@jhu.edu Department of Applied Mathematics and Statistics and Center for Imaging Science Johns Hopkins University 3400 N.Charles st. Baltimore, MD 21209

Editor: Corinna Cortes

We introduce in this paper a learning paradigm in which training data is transformed by a diﬀeomorphic transformation before prediction. The learning algorithm minimizes a cost function evaluating the prediction error on the training set penalized by the distance between the diﬀeomorphism and the identity. The approach borrows ideas from shape analysis where diﬀeomorphisms are estimated for shape and image alignment, and brings them in a previously unexplored setting, estimating, in particular diﬀeomorphisms in much larger dimensions. After introducing the concept and describing a learning algorithm, we present diverse applications, mostly with synthetic examples, demonstrating the potential of the approach, as well as some insight on how it can be improved. Keywords: Diﬀeomorphisms, Reproducing kernel Hilbert Spaces, Classiﬁcation

1. Introduction

We consider, in this paper, a family of classiﬁers that take the form x 7 F(ψ(x)), x Rd, where ψ is a diﬀeomorphism of Rd and F is real-valued or categorical, belonging to a class of simple (e.g., linear) predictors. We will describe a training algorithm that, when presented with a ﬁnite number of training samples, displaces all points together through non-linear trajectories, in order to bring them to a position for which the classes are linearly separable. Because these trajectories are built with a guarantee to avoid collisions between points, they can be interpolated to provide a ﬂuid motion of the whole space (Rd) resulting in a diﬀeomorphic transformation. One can then use this transformation to assign a class to any new observation. Because one may expect that the simplicity of the transformation will be directly related to the ability of the classiﬁer to generalize, point trajectories and their interpolation are optimized so that this resulting global transformation is penalized from erring too far away from the identity, this penalty being assessed using a geodesic distance in the group of diﬀeomorphisms, based on methods previously developed for shape analysis.

The last decade s achievements in deep learning have indeed demonstrated that optimizing nonlinear transformations of the data in very high dimensions and using massive parametrization could lead to highly performing predictions without being necessarily struck by the curse of dimensionality. The approach that we describe here also explores a very

c 2020 Laurent Younes.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-415.html.

large space in terms, at least, of the number of parameters required to describe the transformations, and uses a training algorithm that implements, like neural networks, dynamic programming. It however frames the estimated transformations within a well speciﬁed space of diﬀeomorphisms, whose nonlinear structure is adapted to the successive compositions of functions that drive deep learning methods. While it shares some of the characteristics of deep methods, our formulation relies on a regularization term, which makes it closer in spirit with many of the methods used for non-parametric prediction.

We now sketch our model for classiﬁcation, and for this purpose introduce some notation. We let c denote the number of classes for the problem at hand. As a consequence, the function F can either take values in C = {0, . . . , c 1}, or in the space of probability distributions on C (which will be our choice since we will use logistic regression). We furthermore assume that F is parametrized, by a parameter θ Rq, and we will write F(x; θ) instead of F(x) when needed. Assume that a training set is given, in the form T0 = (x1, y1, . . . , x N, y N) with xk Rd

and yk C, for k = 1, . . . , N. We will assume that this set is not redundant, i.e., that xi = xj whenever i = j. (Our construction can be extended to redundant training sets by allowing the training class variables yk to be multi-valued. This would result in somewhat cumbersome notation that is better avoided.) If ψ is a diﬀeomorphism of Rd, we will let ψ T0 = (ψ(x1), y1, . . . , ψ(x N), y N). The learning procedure will consist in minimizing the objective function

G(ψ, θ) = D(id, ψ)2 + λg(θ) + 1

σ2 Γ(F( , θ), ψ T0) (1)

with respect to ψ and θ. Here D is a Riemannian distance in a group of diﬀeomorphisms of Rd, that will be described in the next section, Γ is a standard loss function, g a regularization term on the parameters of the ﬁnal classiﬁer and λ, σ are positive numbers. One can take for example

Γ(F( , θ), ψ T0) =

k=1 log F(ψ(xk); θ)(yk), (2)

F(z, θ)(y) = eθ(y)T z P y C eθ(y )T z

and where the parameter is (θ1, . . . , θc 1) (Rd)c 1 with θ0 = 0, while (as done in our experiments) g(θ) = |θ|2. The proposed construction shares some of its features with feed-forward networks (Goodfellow et al., 2016), for which, for example, using the loss function above on the ﬁnal layer is standard, and can also be seen as a kernel method, sharing this property with classiﬁers such as support vector machines (SVMs) or other classiﬁcation methods that use the kernel trick (Vapnik, 2013; Sch olkopf and Smola, 2002). However, the algorithm is directly inspired from diﬀeomorphic shape analysis, and can be seen as a variant of the large deformation diﬀeomorphic metric mapping (LDDMM) algorithm, which has been introduced for image and shape registration (Joshi and Miller, 2000; Beg et al., 2005; Younes, 2010). Up

Diffeomorphic Learning

to our knowledge, this theory has not been applied so far to classiﬁcation problems prior to the development of the present work, although some similar ideas using a linearized model have been suggested in (Trouv e and Yu, 2001) for the inference of metrics in shape spaces, and a similar approach has been discussed in Walder and Sch olkopf (2009) in the context of dimensionality reduction. Invertible neural networks have been introduced as normalizing ﬂows in (Rezende and Mohamed, 2015), with a main focus on generative modeling and variational inference. (Normalizing ﬂows are discrete iterates of invertible functions that are implementable within a neural architecture; see Kobyzev et al. (2020) for a recent review and discussion.) Since, or concurrently to, the original submission of the present work, several papers have proposed to use ﬂows of ODE within a machine learning context, mostly through an adaptation of residual networks (He et al., 2016) and normalizing ﬂows to a continuous time setting (Salman et al., 2018; Chen et al., 2019; Rousseau et al., 2019; Dupont et al., 2019; Vialard et al., 2020). Additional comments on neural ODEs are provided in section 5.2. While the underlying theory in shape analysis, which formulates the estimation of transformations as optimal control problems, is by now well established, the present paper introduces several original contributions. Translating shape analysis methods designed for small dimensional problems to larger scale problems indeed introduces new challenges, and suggests new strategies on the design of the ﬂows that control the diﬀeomorphic transformations, on the choice of parameters driving the model and on optimization schemes. Importantly, the paper provides experimental evidence that diﬀeomorphic methods can be competitive in machine learning contexts. Indeed, diﬀeomorphisms in high dimension are wild objects, and, even with the kind of regularization that is applied in this paper, it was not obvious that using them in a non-parametric setting would avoid overﬁtting and compare, often favorably, with several state-of-the-art machine learning methods. As explained in section 5.1, the model (after the addition of one or more dummy dimensions) can represent essentially any function of the data. In this regard, given that one is ready to work with such high-dimensional objects, the requirement that the transformation is a diﬀeomorphism is not a strong restriction to the scope of the model. Using such a representation has several advantages, however. It indeed expresses the model in terms of a well-understood non-linear space of functions (the group of diﬀeomorphisms), which has a simple algebraic structure, and a diﬀerential geometry extensively explored in the literature. The description of such models is, as a consequence, quite concise and explicit, and lends itself to an optimal control formulation with the learning algorithms that result from it. Finally, there are obvious advantages, on the generative side, in using invertible data transformations, especially when the data in the transformed conﬁguration may be easier to describe using a simple statistical model, since, in that case, the application of the inverse diﬀeomorphism to realizations of that simple model provides a generative model in the original conﬁguration space. The paper is organized as follows. Section 2 introduces basic concepts related to groups of diﬀeomorphisms and their Riemannian distances. Section 3 formulates the optimal control problem associated to the estimation of the diﬀeomorphic predictor, introducing its reduction to a ﬁnite-dimensional parametrization using a kernel trick and describing its optimality conditions. Some additional discussion on the choice of reproducing kernel is provided in section 4. Section 5 introduces remarks and extensions that are speciﬁc to

the prediction problems that we consider here, and sections 6 and 7 provide experimental results, including details on how (hyper-)parameters used in our model can be set. We conclude the paper in section 8.

2. Distance on Diﬀeomorphisms

We now describe the basic framework leading to the deﬁnition of Riemannian metrics on groups of diﬀeomorphisms. While many of the concepts described in these ﬁrst sections are adapted from similar constructions in shape analysis (Younes, 2010), this short presentation helps making the paper self-contained and accessible to readers without prior knowledge of that domain. Let Bp = Cp 0(Rd, Rd) denote the space of p-times continuously diﬀerentiable functions that tend to 0 (together with their ﬁrst p derivatives) to inﬁnity. This space is a Banach space, for the norm v p, = max 0 k p dkv

where is the usual supremum norm. Introduce a Hilbert space V of vector ﬁelds on Rd which is continuously embedded in Bp for some p 1, which means that there exists a constant C such that

for all v in V , where V denotes the Hilbert norm on V (and we will denote the associated inner product by ,

V ). This assumption implies that V is a reproducing kernel Hilbert space (RKHS). Because V is a space of vector ﬁelds, the deﬁnition of the associated kernel slightly diﬀers from the usual case of scalar valued functions in that the kernel is matrix valued. More precisely, a direct application of Riesz s Hilbert space representation theorem implies that there exists a function

K : Rd Rd Md(R)

where Md(R) is the space of d by d real matrices, such that

1. The vector ﬁeld K( , y)a : x 7 K(x, y)a belongs to V for all y, a Rd.

2. For v V , for all y, a Rd, K( , y)a , v

V = a T v(y).

These properties imply that K( , x)a , K( , y)b

V = a T K(x, y)b for all x, y, a, b Rd, which in turn implies that K(y, x) = K(x, y)T for all x, y Rd.

Diﬀeomorphisms can be generated as ﬂows of ordinary diﬀerential equations (ODEs) associated with time-dependent elements of V . More precisely, let v L2([0, 1], V ), i.e., v(t) V for t [0, 1] and Z 1

0 v(t) 2 V dt < .

Then, the ODE ty = v(t, y) has a unique solution over [0, 1], and the ﬂow associated with this ODE is the function ϕv : (t, x) 7 y(t) where y is the solution starting at x. This ﬂow

Diffeomorphic Learning

is, at all times, a diﬀeomorphism of Rd, and satisﬁes the equation tϕ = v(t) ϕ, ϕ(0) = id (the identity map). Here, and in the rest of this paper, we make the small abuse of notation of writing v(t)(y) = v(t, y), where, in this case, v(t) V for all t [0, 1]. Similarly, we will often write ϕv(t) for the function x 7 ϕv(t, x), so that ϕv may be considered either as a time-dependent diﬀeomorphism, or as a function of both time and space variables.

The set of diﬀeomorphisms that can be generated in such a way forms a group, denoted DiﬀV since it depends on V . Given ψ1 DiﬀV , one deﬁnes the optimal deformation cost Λ(ψ1) from id to ψ1 as the minimum of R 1 0 v(t) 2 V dt over all v L2([0, 1], V ) such that ϕv(1) = ψ1. If we let D(ψ1, ψ2) = Λ(ψ2 ψ 1 1 )1/2 then D is a geodesic distance on DiﬀV associated with the right-invariant Riemannian metric generated by v 7 v V on V . We refer to Younes (2010) for more details and additional properties on this construction. For our purposes here, we only need to notice that the minimization of the objective function in (1) can be rewritten as an optimal control problem minimizing

E(v, θ) = Z 1

0 v(t) 2 V dt + λg(θ) + 1

σ2 Γ(F( , θ), ϕ(1) T0) (3)

over v L2([0, 1], V ), θ Rq and subject to the constraint that ϕ(t) satisﬁes the equation tϕ = v ϕ with ϕ(0) = id. We point out that, under mild regularity conditions on the dependency of Γ with respect to T0 (continuity in x1, . . . , x N suﬃces), a minimizer of this function in v for ﬁxed θ always exists, with v L2([0, 1], V ).

3. Optimal Control Problem

3.1 Reduction

The minimization in (3) can be reduced using an RKHS argument, similar to the kernel trick invoked in standard kernel methods (Aronszajn, 1950; Duchon, 1977; Meinguet, 1979; Wahba, 1990; Sch olkopf and Smola, 2002). Let zk(t) = ϕ(t, xk). Because the endpoint cost Γ only depends on (z1(1), . . . , z N(1)), it suﬃces to compute these trajectories, which satisfy tzk = v(t, zk). This implies that an optimal v must be such that, at every time t, v(t) 2 V is minimal over all w 2 V with w satisfying w(zk) = v(t, zk), which requires v(t) to take the form

k=1 K( , zk(t))ak(t) (4)

where K is the kernel of the RKHS V and a1, . . . , a N are unknown time-dependent vectors in Rd, which provide our reduced variables. Letting a = (a1, . . . , a N), the reduced problem requires to minimize

E(a( ), θ) = Z 1

k,l=1 ak(t)T K(zk(t), zl(t))al(t) dt + λg(θ) + 1

σΓ(F( , θ), T (z(1))) (5)

subject to tzk = PN l=1 K(zk, zl)al, zk(0) = xk, with the notation z = (z1, . . . , z N) and T (z) = (z1, y1, . . . , z N, y N).

3.2 Optimality Conditions and Gradient

We now consider the minimization problem with ﬁxed θ (optimization in θ will depend on the speciﬁc choice of classiﬁer F and risk function Γ). For the optimal control problem (5), the state space is the space in which the state variable z = (z1, . . . , z N) belongs, and is therefore Q = (Rd)N. The control space contains the control variable a, and is U = (Rd)N. Optimality conditions for the variable a are provided by Pontryagin s maximum principle (PMP). They require the introduction of a third variable (co-state), denoted p Q, and of a control-dependent Hamiltonian Ha deﬁned on Q Q given, in our case, by

k,l=1 (pk ak)T K(zk, zl)al. (6)

(In this expression, a, p and z do not depend on time.) The PMP (Hocking, 1991; Macki and Strauss, 2012; Miller et al., 2015; Vincent and Grantham, 1997) then states that any optimal solution a must be such that there exists a time-dependent co-state satisfying

tz = p Ha(t)(p(t), z(t))

tp = z Ha(t)(p(t), z(t))

a(t) = argmaxa Ha (p(t), z(t))

with boundary conditions z(0) = (x1, . . . , x N) and

σ2 zΓ(F( , θ), T (z(1))). (8)

These expressions are closely related to those allowing for the computation of the differential of E with respect to a( ), which is given by

a( )E(a( ), θ) = u( )

l=1 K(zk(t), zl(t))(pl(t) 2al(t)) (9)

where p solves ( tz = p Ha(t)(p(t), z(t))

tp = z Ha(t)(p(t), z(t)) (10)

with boundary conditions z(0) = (x1, . . . , x N) and

σ2 zΓ(F( , θ), T (z(1))).

Concretely, the diﬀerential is computed by (i) solving the ﬁrst equation of (10), which does not involve p, (ii) using the obtained value of z(1) to compute p(1) from the boundary condition, then (iii) solving the second equation of (10) backward in time to obtain p at all times, and (iv) plugging p in the expression of u(t).

Diffeomorphic Learning

For practical purposes, the discrete-time version of the problem is obviously more useful, and its diﬀerential is obtained from a similar dynamic programming (or back-propagation) computation. Namely, discretize time over 0, 1, . . . , T and consider the objective function

E(a( ), θ) = 1

k,l=1 ak(t)T K(zk(t), zl(t))al(t) dt + λg(θ) + 1

σ2 Γ(F( , θ), T (z(T))) (11)

zk(t + 1) = zk(t) + 1

l=1 K(zk(t), zl(t))al(t), zk(0) = xk.

We therefore use a simple Euler scheme to discretize the state ODE. Note that the state is discretized over 0, . . . , T and the control over 0, . . . , T 1. The diﬀerential of E is now given by the following expression, very similar to that obtained in continuous time,

a( )E(a( ), θ) = u( )

l=1 K(zk(t), zl(t))(pl(t) 2al(t)), t = 0, . . . , T 1

where p (discretized over 0, . . . , T 1), can be computed using

z(t + 1) = z(t) + 1

T p Ha(t)(p(t), z(t))

p(t 1) = p(t) + 1

T z Ha(t)(p(t), z(t)) (12)

with boundary conditions z(0) = (x1, . . . , x N) and

σ2 zΓ(F( , θ), T (z(T))).

These computations allow us to compute the diﬀerential of the objective function with respect to a. The diﬀerential in θ depends on the selected terminal classiﬁer and its expression for the function chosen in (2) is standard. We emphasize the fact that we are talking of diﬀerential of the objective function rather than its gradient. Our implementation uses a Riemannian (sometimes called natural ) gradient with respect to the metric

η1( ) , η2( )

k,l=1 ηk(t)T K(zk(t), zl(t))ηl(t)dt

with tzk = Pn l=1 K(zk(t), zl(t))al(t). With respect to this metric, one has

a( )E(a( ), θ) = p 2a,

a very simple expression that can also be used in the discrete case. Using this Riemannian inner product as a conditioner for the deformation parameters (and a standard Euclidean inner product on the other parameters), experiments in sections 6 and 7 run Polak-Ribiere conjugate gradient iterations (Nocedal and Wright, 1999) to optimize the objective function. (We also experimented with limited-memory BFGS, which does not use natural gradients, and found that conditioned conjugate gradient performed better on our data.)

4.1 General Principles

To fully specify the algorithm, one needs to select the RKHS V , or, equivalently, its reproducing kernel, K. They constitute important components of the model because they determine the regularity of the estimated diﬀeomorphisms. We recall that K is a kernel over vector ﬁelds, and therefore is matrix valued. One simple way to build such a kernel is to start with a scalar positive kernel κ : Rd Rd R and let

K(x, y) = κ(x, y)Id Rd. (13)

We will refer to such kernels as scalar. One can choose κ from the large collection of known positive kernels (and their inﬁnite number of possible combinations; Aronszajn (1950); Dyn (1989); Sch olkopf and Smola (2002); Buhmann (2003)). Most common options are Gaussian kernels,

κ(x, y) = exp( |x y|2/2ρ2), (14)

or Mat ern kernels of class Ck,

κ(x, y) = Pk(|x y|/ρ) exp( |x y|/ρ), (15)

where Pk is a reversed Bessel polynomial of order k. In both cases, ρ is a positive scale parameter. The Mat ern kernels have the nice property that their associated RKHS is equivalent to a Sobolev space of order k + d/2. Vector ﬁelds v in the RKHS associated with a matrix kernel such as (13), where κ is a radial basis function (RBF), are such that each coordinate function of v belongs to the scalar RKHS associated with κ, which is translation and rotation invariant (i.e., the transformations that associate to a scalar function h the functions x 7 h(RT (x b)) are isometries, for all rotation matrices R and all vectors b Rd).

4.2 Graph-Based Kernels

While (13) provide a simple recipe for the deﬁnition of matrix-valued kernels, other interesting choices can be made within this class. Rotation and translation invariance more adapted to spaces of vector ﬁelds, in which one requires that replacing v : Rd Rd by x 7 Rv(RT (x b)) is an isometry of V for all R, b, leads to a more general class of matrix kernels extensively discussed in (Micheli and Glaun es, 2014). When the data is structured, however (e.g., when it is deﬁned over a grid), rotation invariance may not be a good requirement since it allows, for example, for permuting coordinates, which would break the data structure. In this context, other choices may be preferable, as illustrated by the following example. Assume that the data is deﬁned over a graph, say G with d vertices. Then, one may consider matrix kernels relying on this structure. For example, letting Ni denote the set of nearest neighbors of i in G, one can simply take

K(x, y) = diag(Φ(|Pix Piy|), i = 1, . . . , d) (16)

where Pix is the vector (xj, j Ni) and Φ is an RBF associated to a positive radial scalar kernel.

Diffeomorphic Learning

4.3 Introducing Aﬃne Transformations

RKHS s of vector ﬁelds built from RBF s have the property that all their elements (and several of their derivatives) vanish at inﬁnity. As a consequence, these spaces do not contain simple transformations, such as translations or more general aﬃne transformations. It is however possible to complement them with such mappings, deﬁning

ˆVa = {g + v : g a, v V }

where a is any Lie sub-algebra of the group of aﬃne transformations (so that any element g a takes the form g(x) = Ax + b, where A is a matrix and b is a vector). In particular, a can be the whole space of aﬃne transformations. Since a and v intersect only at {0}, one can deﬁne without ambiguity a Hilbert norm on ˆVa by letting

g + v 2 ˆVa = g 2 a + v 2 V

where g a is any inner-product norm on a. A simple choice, for g(x) = Ax + b, can be to take g 2 a = κ1trace(AAT ) + κ2|b|2. (17)

Both a and ˆVa are RKHS s in this case, respectively with kernels Ka and Ka + K, where K is the kernel of V . If the norm on a is given by (17), then

Ka(x, y) = x T y

as can be deduced from the deﬁnition of a reproducing kernel.

Instead of using this extended RKHS, another option is to model aﬃne transformations separately from the vector ﬁeld. This leads to replacing (4) by

ˆv(t, ) = g(t, ) +

k=1 K( , zk(t))ak(t)

and the cost (5) by

E(a( ), θ) = Z 1

0 g(t) 2 a dt+ Z 1

k,l=1 ak(t)T K(zk(t), zl(t))al(t) dt+λg(θ)+ 1

σ2 Γ(F( , θ), T (1))

with tzk = ˆv(t, zk(t)), zk(0) = xk. The two approaches are, in theory, equivalent, in that the second one simply ignore the reduction on the aﬃne part of the vector ﬁeld, but it may be helpful, numerically, to use separate variables in the optimization process for the aﬃne transform and the vector ﬁeld reduced coeﬃcients, because this gives more ﬂexibility to the optimization. The derivation of the associated optimality conditions and gradient are similar to those made in section 3.2 and left to the reader.

Another point worth mentioning is that, if ϕ satisﬁes

tϕ = g ϕ + v ϕ

for (time-dependent) g a and v V , and if one deﬁnes the time-dependent aﬃne transformation ρ by tρ = g ρ,

then ϕ = ρ ψ, where ψ satisﬁes tψ = w ψ and w = ρ 1 L v ρ, ρL being the linear part of ρ. In the special case when a is the Lie algebra of the Euclidean group, so that ρ is the composition of a rotation and a translation, and when the norm of V is invariant by such transformations (e.g., when using a scalar kernel associated to an RBF), then ϕ and ψ are equidistant to the identity. As a consequence, when the ﬁnal function F implements a linear model, there is, in theory (numerics may be diﬀerent), no gain in introducing an aﬃne component restricted to rotations and translations. The equidistance property does not hold, however, if one uses a larger group of aﬃne transformations, or a norm on V that is not Euclidean invariant, and the introduction of such transformations actually extends the model in a way that may signiﬁcantly modify its performance, generally, in our experiments, for the better.

5. Enhancements and Remarks

5.1 Adding a Dimension

We use, in this paper, logistic regression as ﬁnal classiﬁer applied to transformed data. This classiﬁer estimates a linear separation rule between the classes, but it should be clear that not every training set can be transformed into a linearly separable one using a diﬀeomorphism of the ambient space. A very simple one-dimensional example is when the true class associated to an input x R is 0 if |x| < 1 and 1 otherwise: no one-dimensional diﬀeomorphism will make the data separable, since such diﬀeomorphisms are necessarily monotone. This limitation can be ﬁxed easily, however, by adding a dummy dimension and apply the model to a (d + 1)-dimensional dataset in which X is replaced by (X, 0). In the example just mentioned, for example, the transformation (x, µ) 7 (x, µ+x2 1) is a diﬀeomorphism of R2 that separates the two classes (along the y axis). Notice that any binary classiﬁer that can be expressed as x 7 sign(f(x) a) for some smooth function f can be included in the model class we are considering after adding a dimension, simply taking (letting again µ denote the additional scalar variable) ψ(x, µ) = (x, µ + f(x)), which is a diﬀeomorphism of Rd+1, and u = (0Rd, 1). However, even when it works perfectly on the training set, this transformation will not be optimal in general, and the diﬀeomorphic classiﬁer would typically prefer a diﬀeomorphism ϕ that will minimize the overall distortion, essentially trading oﬀsome non-linear transformation of the data, x, to induce a simpler classiﬁcation rule. Figure 1 provides an example of the eﬀect of adding a dimension when the original data is two-dimensional with two classes forming a pattern resembling a target. While no 2D transformation will break the topology and separate the central disk from the exterior ring, using an additional dimension oﬀers a straightforward solution. Another simple example is provided in Figure 2, where a circle (class 1) is inscribed in a half ellipse (class 2). In this case, adding a dimension is not required, but the transformation estimated when this is done is closer to the identity (in the sense of our metric on diﬀeomorphisms) than the one computed in two dimensions.

Diffeomorphic Learning

Figure 1: Additional dimension separating the center of a target from its external ring. Left: initial conﬁguration; Right: transformed conﬁguration.

Figure 2: Comparison of optimal transformations without (top) and with (bottom) adding a dimension. Left: initial conﬁguration; Right: transformed conﬁguration.

When adding one or more dimensions (in a c-class problem, it makes sense to add c 1 dimensions), one may want to expand training data in the form xk (xk, δuk) for small δ, with uk a realization of a standard Gaussian variable to help breaking symmetries in the training phase (test data being still expanded as xk (xk, 0)).

5.2 Parametric Dimension and Sub-Riemannian Methods

The dimensional reduction method described in section 3.1 is the exact ﬁnite-dimensional parametrization of the spatial component of the original inﬁnite-dimensional problem. Assuming T steps for the time discretization, this results in Td N parameters in the transformation part of the model, while the computation of the gradient has a Td N2 order cost. With our implementation (on a four-core Intel i7 laptop), we were able to handle up to 2,000 training samples in 100 dimensions with 10 time steps, which required about one day for 2,000 gradient iterations. Even if the eﬃciency of our implementation may be improved, for example through the use of GPU arrays, it is clear that the algorithm we just described will

not scale to large datasets unless some approximations are made. Importantly, no signiﬁcant reduction in computation cost can be obtained by randomly sampling from the training set at each iteration, since the representation (4) requires computing the trajectories of all sample points. This suggests replacing the optimal expression in (4) by a form whose complexity remain bounded, rather than linear in the number of training samples. One possibility is to require that v is decomposed as a sum similar to (4), but over a smaller number of control points , i.e., to let

j=1 K( , ζj(t))aj(t) (18)

with tζj = v(t, ζj) and ζj(0) = ζ0 j where {ζ0 1, . . . , ζ0 n} is, for example, a subset of the training data set. As a result, the computational cost per iteration in the data size would now be of order Td Nn (because the evolution of all points is still needed to evaluate the objective function and its gradient), but this change would make possible randomized evaluations of the data cost, Γ, (leading to stochastic gradient descent) that would replace Tdn N by Tdn N where N is the size of the mini-batch used at each iteration. This general scheme will be explored in future work, where optimal strategies in selecting an n-dimensional subset of the training data must be analyzed, including in particular the trade-oﬀthey imply between computation time and sub-optimality of solutions. Even though they appeared in a diﬀerent context, some inspiration may be obtained from similar approaches that were explored in shape analysis (Younes, 2012; Durrleman et al., 2013; Younes, 2014; Durrleman et al., 2014; Gris et al., 2018). Another plausible choice is to specify a parametric form for the vector ﬁeld at a given time. Models motivated by neural architectures have been introduced in Chen et al. (2019), and we provide in section 6.11 a few comparative results between the Riemannian approach that we introduced and a planar representation of the vector ﬁeld (Rezende and Mohamed, 2015) in the form v(t, x) = a(t)Φ(h(t)T x + b(t)) (19)

where a(t) and h(t) are d-dimensional vectors, b is a scalar and Φ is a non-linear function (for example, Φ(t) = max(t, 0), the positive part, or Re LU). While the exact expression of v(t) V may be challenging to compute analytically when v is given in this form, it is suﬃcient, in order to ensure the existence of solutions to the state equation, to control the supremum norm of the Lipschitz constant of v(t). One can therefore replace R 1 0 v(t) 2 V dt in the optimal control formulation (3) by, e.g.,

0 (α|a(t)|2 + β|h(t)|2 + γb(t)2)dt.

Note that the Lipschitz constant of v(t, ) is controlled (assuming that Φ is Lipschitz) by |a(t)| |h(t)| so that, since the ﬁniteness of the integral above ensures that R 1 0 |a(t)| |h(t)|dt is ﬁnite, the ﬂow is guaranteed to provide an homeomorphic transformation. Using this restriction, the computational complexity is linear in the dimension and in the number of training samples, with signiﬁcantly shorter computation times.

Diffeomorphic Learning

5.3 Deformable Templates

It is important to strengthen the fact that, even though our model involves diﬀeomorphic transformations, it is not a deformable template model. The latter type of model typically works with small-dimensional images (k=2 or 3), say I : Rk R, and tries to adjust a k-dimensional deformation (using a diﬀeomorphism g : Rk Rk) such that the deformed image, given by I g 1 aligns with a ﬁxed template (and classiﬁcation based on a ﬁnite number of templates would pick the one for which a combination of the deformation and the associated residuals is smallest). The transformation ψg : I 7 I g 1 is a homeomorphism of the space of, say, continuous images. Once images are discretized over a grid with d points, ψg becomes (assuming that the grid is ﬁne enough) a one-to-one transformation of Rd, but a very special one. In this context, the model described in this paper would be directly applied to discrete images, looking for a d-dimensional diﬀeomorphism that would include deformations such as ψg, but many others, involving also variations in the image values or more complex transformations (including, for example, reshuﬄing all image pixels in arbitrary orders!).

6. Comparative Experiments

6.1 Classiﬁers Used for Comparison

We now provide a few experiments that illustrate some of the advantages of the proposed diﬀeomorphic learning method, and some of its limitations as well. We will compare the performance of this algorithm with a few oﬀ-the-shelve methods, namely k-nearest-neighbors (k NN), linear and non-linear SVM, random forests (RF), multi-layer perceptrons with 1, 2 and 5 hidden layers (abbreviated below as MLP1, MLP2 and MLP5) and logistic regression (Hastie et al., 2003; Bishop, 2006). The classiﬁcation rates that were reported were evaluated on a test set containing 2,000 examples per class (except for MNIST, for which we used the test set available with this data). We used the scikit-learn Python package (Pedregosa et al., 2011) with the following parameters (most being default in scikit-learn).

Linear SVM: ℓ2 penalty with default weight C = 1, with one-vs-all multi-class strategy when relevant.

Kernel SVM: ℓ2 penalty with weight C estimated as described in section 7. Gaussian (i.e., RBF) kernel with coeﬃcient γ identical to that used for for the kernel in diﬀeomorphic learning. One-vs-all multi-class strategy when relevant.

Random forests: 100 trees, with Gini entropy splitting rule, with the default choice (

d) of number of features at each node.

k-nearest neighbors: with the standard Euclidean metric and the default (ﬁve) number of neighbors.

Multi-layer perceptrons: with Re LU activations, ADAM solver, constant learning rate and 10,000 maximal iterations, using 1, 2 or 5 hidden layers each composed of 100 units.

Logistic regression: ℓ2 penalty with weight C = 1. The same classiﬁer is used as the ﬁnal step of the diﬀeomorphic learning algorithm, so that its performance on transformed data is also the performance of the algorithm that is proposed in this paper.

In all cases, we added a dummy dimension to the data as described in section 5.1 when running diﬀeomorphic learning (classiﬁcation results on original data were obtained without the added dimension). We also used an aﬃne transformation to complement the kernel, as described in section 4.3. Note that adding a dimension was not always necessary for learning (especially with large dimensional problems), nor was the aﬃne transform always improving on results, but they never harmed the results in any signiﬁcant way, so that it was simpler to always turn on these options in our experiments. The optimization procedure was initialized with a vanishing vector ﬁeld (i.e., ψ = id) and run until numerical stabilization (with a limit of 2,000 iterations at most). Doing so was always an overkill, in terms of classiﬁcation performance, because in almost all cases, the limit classiﬁcation error on the test set stabilizes faster than the time taken by the algorithm to optimize the transformation. It was however important to avoid stopping the minimization early in order to make sure that optimizing the diﬀeomorphism did not result in overﬁtting. We now describe the datasets that we used (all but the last one being synthetic) and compare the performances of the classiﬁers above on the original data and on the transformed data after learning.

6.2 Tori datasets

In our ﬁrst set of experiments, we let Di = R(Ti Rd 3) where T1, T2 are non-intersecting tori in R3 and R is a random d-dimensional rotation. The tori are positioned as illustrated in the ﬁrst panel of Figure 3, so that, even though they have an empty intersection, they are not linearly separable. The distribution of training and test data is (before rotation) the product of a uniform distribution on the torus and of a standard Gaussian in the d 3 remaining dimensions.

Figure 3: left: Tori data (d = 3). Center: Spherical layers (d = 2). Right: RBF data (d = 2).

Classiﬁcation results for this dataset are summarized in Table 1. Here, we let the number of noise dimensions vary from 0 to 7 and 17 (so that the total number of dimensions is 3, 10 and 20) and the number of training samples are 200, 500 and 1,000. The problem

Diffeomorphic Learning

Log. reg. lin. SVM SVM RF k NN MLP1 MLP2 MLP5

Tori, d = 3, 100 samples per class

Original Data 0.341 0.341 0.000 0.007 0.000 0.004 0.000 0.000 Transformed Data 0.000 0.000 0.000 0.007 0.000 0.000 0.000 0.000

Tori, d = 10, 100 samples per class

Original Data 0.312 0.317 0.280 0.311 0.309 0.159 0.170 0.233 Transformed Data 0.159 0.163 0.162 0.176 0.153 0.155 0.153 0.163

Tori, d = 10, 250 samples per class

Original Data 0.325 0.326 0.207 0.285 0.257 0.073 0.093 0.109 Transformed Data 0.023 0.024 0.017 0.021 0.012 0.030 0.026 0.025

Tori, d = 20, 100 samples per class

Original Data 0.320 0.317 0.324 0.376 0.369 0.325 0.325 0.353 Transformed Data 0.321 0.329 0.322 0.347 0.330 0.320 0.319 0.323

Tori, d = 20, 250 samples per class

Original Data 0.320 0.323 0.312 0.339 0.367 0.204 0.284 0.280 Transformed Data 0.249 0.255 0.249 0.277 0.251 0.247 0.247 0.250

Tori, d = 20, 500 samples per class

Original Data 0.316 0.315 0.267 0.305 0.355 0.117 0.130 0.191 Transformed Data 0.163 0.166 0.158 0.173 0.167 0.160 0.163 0.164

Table 1: Comparative performance of classiﬁers on Tori data

becomes very challenging for most classiﬁers when the number of noisy dimensions is large, in which case a multi-layer perceptron with one hidden layer seems to perform best. All other classiﬁers see their performance improved after diﬀeomorphic transformation of the data. One can also notice that, after transformation, all classiﬁers perform approximately at the same level. Figure 4 illustrates how the data is transformed by the diﬀeomorphism and is typical of the other results. We also point out that all classiﬁers except RF are invariant by rotation of the data, so that making a random rotation when generating it does not change their performance. The estimation of the diﬀeomorphism using a radial kernel is also rotation invariant. The RF classiﬁer, which is based on comparison along coordinate axes, is highly aﬀected, however. Without a random rotation, it performs extremely well, with an error rate of only 0.05 with 17 noisy dimensions and 250 examples per class.

6.3 Spherical Layers

We here deduce the class from the sign of cos(9|X|) where X follows a uniform distribution on the d-dimensional unit sphere, and we provide results with d = 3 (a representation of the data set in 2D is provided in the second panel of Figure 3). We see that linear classiﬁers cannot do better than chance (this is by design), but that after the estimated diﬀeomorphic

Figure 4: Visualization of the diﬀeomorphic ﬂow applied to the 3D tori dataset. Left: training data; Right: test data. Top to bottom: t = 0, t = 0.3, t = 0.7, t = 1. The data is visualized in a frame formed by the discriminant direction followed by the two principal components in the perpendicular space.

transformation is applied, all classiﬁers outperform, with a large margin for small datasets, the best non-linear classiﬁers trained on the original data.

Log. reg. lin. SVM SVM RF k NN MLP1 MLP2 MLP5

Spherical layers, d = 3, 100 samples per class

Original Data 0.507 0.507 0.357 0.353 0.408 0.478 0.488 0.481 Transformed Data 0.246 0.249 0.329 0.279 0.355 0.246 0.245 0.253

Spherical layers, d = 3, 250 samples per class

Original Data 0.487 0.487 0.233 0.231 0.279 0.394 0.264 0.113 Transformed Data 0.119 0.118 0.185 0.129 0.199 0.149 0.143 0.138

Spherical layers, d = 3, 500 samples per class

Original Data 0.506 0.506 0.185 0.194 0.229 0.299 0.113 0.091 Transformed Data 0.089 0.092 0.165 0.087 0.197 0.092 0.103 0.096

Table 2: Comparative performance of classiﬁers on 3D spherical layers.

6.4 RBF datasets

The next dataset generates classes using sums of radial basis functions. More precisely, we let ρ(z) = exp( (z/α)2) with α = 0.1 and generate classes according to the sign of the function

j=1 ρ(|X cj|)aj

Diffeomorphic Learning

In this expression, X follows a uniform distribution over the d-dimensional unit sphere and µ is estimated so that both positive and negative classes are balanced. The centers, c1, . . . , c L are chosen as cj = (3j/L)e(j mod d)+1 where j mod d is the remainder of the division of j by d and e1, . . . , ed is the canonical basis of Rd. The coeﬃcients are aj = cos(6πj/L), and we took L = 100. The third panel of Figure 3 depicts the resulting regions of the unit disc in the case d = 2. Table 3 provides classiﬁcation performances for various combinations of dimension and sample size. Excepted when d = 5 and only 100 samples per class are observed, in which case linear classiﬁers provide the best rates, the best classiﬁcation is obtained after diﬀeomorphic transformation.

6.5 Mixture of Gaussians

In the next example, we assume that the conditional distribution of X given Y = y is normal with mean my and covariance matrix Σy. We used three classes (y {1, 2, 3}) in dimension 20, with

m1 = (0, . . . , 0)T , m2 = ( 1, 1, 1, 0, . . . , 0)T and m3 = ( 1, 1, 1, 0. . . . , 0)T ,

Σ1 = 10 Id Rd, Σ2(i, j) = 20 exp( |i j|/20) and Σ3(i, j) = 20 exp( |i j|/60),

where Id Rd is the d-dimensional identity matrix. Classiﬁcation results for training sets with 100 and 250 examples per class are described in Table 4. While multilayer perceptrons perform best on this data, the performance of diﬀeomorphic classiﬁcation is comparable, and improves on all other classiﬁers. A visualization of the transformation applied to this dataset is provided in Figure 5.

Figure 5: Diﬀeomorphic transformation applied to the Gaussian mixture dataset (evolution visualized at times t = 0.0, 0.3, 0.5, 0.7 and 1.0. The data is 20 dimensional, and the visualization is made in the plane generated by the two discriminative directions provided by logistic regression estimated on transformed data at time t = 1.

6.6 Curve Segments

To build this dataset, we start with two scalar functions and assume that the observed data is the restriction of one of them (where this choice determines the class) to a small discrete

Log. reg. lin. SVM SVM RF k NN MLP1 MLP2 MLP5

RBF data, d = 2, 100 samples per class

Original Data 0.381 0.381 0.158 0.126 0.208 0.183 0.193 0.163 Transformed Data 0.142 0.139 0.149 0.153 0.186 0.140 0.136 0.141

RBF data, d = 2, 250 samples per class

Original Data 0.358 0.359 0.109 0.150 0.136 0.176 0.158 0.127 Transformed Data 0.101 0.102 0.106 0.102 0.130 0.126 0.122 0.119

RBF data, d = 2, 500 samples per class

Original Data 0.377 0.378 0.089 0.103 0.097 0.119 0.101 0.081 Transformed Data 0.058 0.055 0.085 0.055 0.090 0.070 0.054 0.142

RBF data, d = 3, 100 samples per class

Original Data 0.307 0.307 0.209 0.207 0.221 0.218 0.233 0.250 Transformed Data 0.171 0.175 0.202 0.168 0.195 0.162 0.162 0.154

RBF data, d = 3, 250 samples per class

Original Data 0.310 0.308 0.139 0.182 0.148 0.151 0.151 0.103 Transformed Data 0.083 0.077 0.133 0.082 0.122 0.086 0.083 0.089

RBF data, d = 3, 500 samples per class

Original Data 0.337 0.335 0.129 0.112 0.114 0.144 0.093 0.094 Transformed Data 0.071 0.070 0.118 0.063 0.102 0.080 0.068 0.078

RBF data, d = 5, 100 samples per class

Original Data 0.216 0.218 0.221 0.256 0.254 0.280 0.259 0.233 Transformed Data 0.221 0.223 0.221 0.222 0.219 0.221 0.223 0.214

RBF data, d = 5, 250 samples per class

Original Data 0.233 0.233 0.197 0.226 0.213 0.250 0.273 0.240 Transformed Data 0.199 0.204 0.185 0.215 0.196 0.207 0.209 0.205

RBF data, d = 5, 500 samples per class

Original Data 0.231 0.230 0.167 0.219 0.184 0.206 0.189 0.158 Transformed Data 0.128 0.129 0.133 0.127 0.142 0.141 0.140 0.143

Table 3: Comparative performance of classiﬁers on RBF data

interval. More precisely, we let Y {1, 2} and given Y = y, we let X = (ψy(t A)+εt, t I) where I = {0, 1/d, . . . , (d 1/d)} is a discretization of the unit interval, (εt, t I) are independent standard Gaussian variables and A follows a uniform distribution on [ 2, 2]. For the experiments in Table 5, we took ψy(t) = log cosh(t/βy) with β1 = 0.02 and β2 = 0.021. This table shows signiﬁcant improvements of classiﬁcation rates after application of the diﬀeomorphism.

Diffeomorphic Learning

Log. reg. lin. SVM SVM RF k NN MLP1 MLP2 MLP5

Mixture of Gaussians, d = 20, 100 samples per class

Original Data 0.398 0.407 0.225 0.191 0.495 0.143 0.101 0.099 Transformed Data 0.135 0.153 0.175 0.110 0.236 0.134 0.109 0.108

Mixture of Gaussians, d = 20, 250 samples per class

Original Data 0.354 0.359 0.172 0.163 0.427 0.073 0.061 0.063 Transformed Data 0.079 0.091 0.145 0.089 0.187 0.077 0.074 0.074

Table 4: Comparative performance of classiﬁers on Gaussian mixtures.

Log. reg. lin. SVM SVM RF k NN MLP1 MLP2 MLP5

Curve segments, d = 50, 100 samples per class

Original Data 0.484 0.495 0.400 0.448 0.466 0.449 0.444 0.465 Transformed Data 0.324 0.324 0.350 0.378 0.438 0.329 0.335 0.332

Curve segments, d = 50, 250 samples per class

Original Data 0.487 0.486 0.266 0.345 0.353 0.493 0.486 0.467 Transformed Data 0.152 0.157 0.147 0.195 0.248 0.155 0.258 0.216

Curve segments, d = 50, 500 samples per class

Original Data 0.495 0.496 0.155 0.294 0.230 0.505 0.484 0.426 Transformed Data 0.093 0.094 0.062 0.070 0.093 0.093 0.104 0.113

Table 5: Comparative performance of classiﬁers on curve segment data.

Here, we consider a 50-dimensional dataset with two classes. Each sample has all coordinates equal to zero except two of them that are equal to 1, and the class is 0 if the two nonzero coordinates are equal and 1 otherwise. The position of the two nonzero values is random and each takes the values 1 with equal probability. Table 6 show that diﬀeomorphic classiﬁcation performs well on this data with 100 training samples per class. When the number of examples is raised to 250 per class, multi-layer perceptrons with one or two hidden layers perform well as well. The other classiﬁers perform poorly even on the larger dataset. The classes are obviously not linearly separable, explaining the performances of logistic regression and linear SVMs, and the minimum distance between distinct examples from the same class is the same as that between examples from diﬀerent classes, namely

6.8 Segment Lengths

Our next synthetic example is a simple pattern recognition problem in d = 100 dimensions. Samples from class 1 take the form (ρ1U1, . . . , ρd Ud) where ρ is uniformly distributed be-

Log. reg. lin. SVM SVM RF k NN MLP1 MLP2 MLP5

xor data, d = 50, 100 samples per class

Original Data 0.491 0.474 0.489 0.501 0.500 0.339 0.470 0.472 Transformed Data 0.037 0.005 0.039 0.007 0.008 0.012 0.009 0.014

xor data, d = 50, 250 samples per class

Original Data 0.511 0.434 0.510 0.481 0.472 0.011 0.071 0.226 Transformed Data 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Table 6: Comparative performance of classiﬁers on xor data.

tween 0.75 and 1.25, and U = (U1, . . . Ud) is a binary vector with exactly ly consecutive ones (possibly wrapping around {1, . . . , d})and d ly zeros, with l1 = 10 and l2 = 11. An optimal linear separation between classes is achieved (because of shift invariance) by thresholding the sum of the d variables. However, because of the multiplication by ρ1, . . . , ρd, this simple classiﬁcation rule performs very poorly, while the same rule applied to the binary variables obtained by thresholding individual entries at, say, 0.5 perfectly separates the classes. It is therefore not surprising that multi-layer perceptrons perform very well on this problem and achieve the best classiﬁcation rates. All other classiﬁers perform signiﬁcantly better when run on the transformed data.

Log. reg. lin. SVM SVM RF k NN MLP1 MLP2 MLP5

Segment length data, d = 100, 100 samples per class

Original Data 0.462 0.412 0.387 0.491 0.505 0.337 0.343 0.328 Transformed Data 0.341 0.349 0.354 0.347 0.450 0.390 0.385 0.362

Segment length data, d = 100, 250 samples per class

Original Data 0.432 0.412 0.135 0.483 0.503 0.082 0.079 0.084 Transformed Data 0.100 0.098 0.107 0.098 0.443 0.096 0.113 0.114

Segment length data, d = 100, 500 samples per class

Original Data 0.391 0.391 0.045 0.447 0.194 0.025 0.025 0.025 Transformed Data 0.032 0.030 0.032 0.054 0.165 0.029 0.031 0.036

Table 7: Comparative performance of classiﬁers on segment lengths

6.9 Segment Pairs

This section describes a more challenging version of the former in which each data point consists in two sequences of ones in the unit circle (discretized over 50 points), these two sequences being both of length ﬁve in class 1, and of lengths 4 and 6 (in any order) in class 2. No linear rule can separate the classes and do better than chance, since such a rule would need to be based on summing the variables (by shift invariance), which returns 10 in both classes. The problem is also challenging for metric-based methods because each data

Diffeomorphic Learning

point has close neighbors in the other class: the nearest non-identical neighbor in the same class is obtained by shifting one of the two segments, and would be at distance

2, which is identical to the distance of the closest element from the other class which is obtained by replacing a point in one segment by 0 and adding a 1 next to the other segment. One way to separate the classes would be to compute moving averages (convolutions) over windows of lengths 6 along the circle, and to threshold the result at 5.5, which can be represented as a one-hidden layer perceptron. As shown in Table 8, one-hidden-layer perceptrons do perform best on this data, with some margin compared to all others. Diﬀeomorphic classiﬁcation does however improve signiﬁcantly on the classiﬁcation rates of all other classiﬁers.

Log. reg. lin. SVM SVM RF k NN MLP1 MLP2 MLP5

Segment pair data, d = 50, 100 samples per class

Original Data 0.477 0.466 0.476 0.475 0.458 0.470 0.470 0.488 Transformed Data 0.461 0.457 0.465 0.461 0.469 0.461 0.462 0.461

Segment pair data, d = 50, 250 samples per class

Original Data 0.490 0.485 0.447 0.412 0.492 0.343 0.392 0.405 Transformed Data 0.394 0.392 0.412 0.390 0.444 0.398 0.391 0.386

Segment pair data, d = 50, 500 samples per class

Original Data 0.501 0.502 0.369 0.320 0.504 0.054 0.164 0.284 Transformed Data 0.240 0.238 0.261 0.233 0.378 0.250 0.244 0.253

Segment pair data, d = 50, 1000 samples per class

Original Data 0.465 0.463 0.233 0.191 0.547 0.008 0.038 0.137 Transformed Data 0.071 0.068 0.099 0.073 0.281 0.085 0.082 0.085

Table 8: Comparative performance of classiﬁers on segment pair data

6.10 MNIST (Subset)

To conclude this section we provide (in Table 9) classiﬁcation results on a subset of the MNIST digit recognition dataset (Le Cun, 1998), with 10 classes and 100 examples per class for training. To reduce the computation time, we reduced the dimension of the data by transforming the original 28 28 images to 14 14, resulting in a 196-dimensional dataset. With this sample size, this reduction had little inﬂuence on the performance of classiﬁers run before diﬀeomorphic transformation. All classiﬁers have similar error rates, with a small improvement obtained after diﬀeomorphic transformation compared to linear classiﬁers, reaching a ﬁnal performance comparable to that obtained by random forests and multilayer perceptrons with two hidden layers. We did observe a glitch in the performance of nonlinear support vector machines, which may result from a poor choice of hyper-parameters (section 6.1) for this dataset.

Log. reg. lin. SVM SVM RF k NN MLP1 MLP2 MLP5

MNIST data, d = 196, 100 samples per class

Original Data 0.128 0.138 0.228 0.111 0.120 0.122 0.111 0.127 Transformed Data 0.113 0.120 0.325 0.129 0.094 0.110 0.110 0.124

Table 9: Comparative performance of classiﬁers on a subset of the MNIST dataset.

6.11 Comparison with neural ODEs

In Table 10, we provide classiﬁcation performances obtained on the same datasets as above, when the vector ﬁelds generating diﬀeomorphisms are speciﬁed by (19). We still work with one added dimension, which, for neural ODEs, corresponds to the ANODE method in Dupont et al. (2019). Unsurprisingly, results vary with datasets. In some cases, ANODE achieves top performances over all classiﬁers: This includes tori datasets, segment lengths, and segments pairs. In some cases, such as line segments or RBF, and also MNIST, results are signiﬁcantly worse that those obtained with most nonlinear classiﬁers.

Datasets 100 250 500 1000

3D tori 0.00 NA NA NA

10D tori 0.15 0.01 NA NA

20D tori 0.30 0.18 0.02 NA

Spherical layers 0.42 0.31 0.15 NA

2D RBF 0.30 0.12 0.09 NA

3D RBF 0.11 0.06 0.09 NA

5D RBF 0.20 0.27 0.20 NA

M. of G. 0.13 0.11 NA NA

Curve segments 0.48 0.49 0.32 NA

Xor 0.035 0.00 NA NA

Segment lengths 0.29 0.075 0.025 NA

Segment pairs 0.46 0.30 0.05 0.015

MNIST 0.16 NA NA NA

Table 10: Classiﬁcation rates obtained with vector ﬁelds represented using equation (19)

Note that, in order to obtain optimal results, we had to use a relatively deep discretization of the ODE, over 50 time steps. In spite of this, the neural net formulation is signiﬁcantly faster than the fully Riemannian version.

Diffeomorphic Learning

6.12 Discussion

The results presented in this section are certainly encouraging, and demonstrate that the proposed algorithm has, at least, some potential. We however point out that the classiﬁers that were used in our experiments were run oﬀ-the-shelves, using the setup described in section 6.1. There is no doubt that, after some tuning up speciﬁc to each dataset, each of these classiﬁers could perform better. Such an analysis was not our goal here, however, and the classiﬁcation rates that we obtained must be analyzed with this in mind. We just note that we also used the exact same version of diﬀeomorphic learning for all datasets, with hyper-parameters values described in section 7. We have included kernel-based support vector machines in our comparisons. These classiﬁers indeed share with the one proposed here the use of a kernel trick to reduce an inﬁnite dimensional problem to one with a number of parameters comparable to the size of the training set. Some notable diﬀerences exist however. The most fundamental one lies in the role that is taken by the kernel in the modeling and computation. For SVMs and other kernel methods, the kernel provides, or is closely related to, the transformation of the data into a feature space. It indeed deﬁnes the inner product in that space, and can itself be considered as an inﬁnite dimensional representation of the data, through the standard mapping x 7 κ(x, ). Kernel methods then implement linear methods in this feature space, such as computing a separating hyperplane for SVMs. For the diﬀeomorphic classiﬁcation method that is described here, the kernel is used as a computational tool, similar to the way it appears in spline interpolation. The fundamental concept here is the Hilbert space V and its norm, which is, in the case of a Ck Mat ern kernel, a Hilbert Sobolev space of order k + d/2. In small dimensions, one could actually directly discretize this norm on a ﬁnite grid (using ﬁnite diﬀerences to approximate the derivatives), as done in shape analysis for image registration (see, e.g., Beg et al. (2005)). The reproducing kernel of V is therefore a mathematical tool that makes explicit the derivation of the discrete representation described in section 3.1 rather than an essential part of the model, as it is for kernel methods such as SVMs. Another important distinction, which makes the diﬀeomorphic method deviate from both kernel methods in machine learning and from spline interpolation is that the kernel (or its associated space V ) is itself part of a nonlinear dynamical evolution where it intervenes at every time. This dynamical aspect of the approach constitutes a strong deviation from SVMs which applied the kernel transformation once before implementing a linear method. It is, in this regard, closer to feed-forward neural network models in that it allows for very large non-linear transformations of the data to be applied prior to linear classiﬁcation. Unlike neural networks, however, diﬀeomorphic learning operates the transformations within the original space containing the data. As mentioned in section 6.1, we ran the optimization algorithm until numerical convergence, although it is clear from monitoring classiﬁcation over time that classes become well separated early on while the estimation of the optimal transformation typically takes more time. If one is not interested in the limit diﬀeomorphism (which can be of interest for modeling purposes), signiﬁcant computation time may be saved by stopping the procedure earlier, using, for example a validation set.

7. Setting Hyper-Parameters

The diﬀeomorphic classiﬁcation method described in this paper requires the determination of several hyper-parameters that we now describe with a discussion of how they were set in our experiments.

(a) Kernel scale parameter. While the kernel can be any function of positive type and therefore considered as an inﬁnite dimensional hyper-parameter, we assume that one uses a kernel such as those described in equations (14) and (15) (our experiments use the latter with k = 3) and focus on the choice of the scale parameter ρ. In our implementation, we have based this selection on two constraints, namely that training data should tend to collaborate with some of their neighbors from the same class while data from diﬀerent classes should ignore each other. This resulted in the following computation.

(i) To address the ﬁrst constraint, we evaluate, for each data point, the ﬁfth percentile of its distance to other points from the same class. We then deﬁne ρ1 to be the 75th percentile of these values.

(ii) For the second constraint, we deﬁne ρ2 as the 10th percentile of the minimum distance between each data point and its closest neighbor in an other class.

We ﬁnally set ρ = ρ0 = min(ρ1, ρ2). Table 11 provides a comparison of the performances of the algorithm for values of ρ that deviate from this default value, showing that, in most cases, this performance obtained with ρ0 is not too far from the best one. The only exception was found with the Xor dataset, for which a signiﬁcant improvement was obtained using a large value of ρ. Note that in Tables 11 and 12, all experiments in the same row were run with the same training and test sets.

Dataset Samples/class 0.25ρ0 0.5ρ0 0.75ρ0 ρ0 1.5ρ0 2ρ0 4ρ0 Sph. Layers 250 0.238 0.199 0.188 0.175 0.165 0.183 0.212

RBF 100 0.170 0.107 0.085 0.083 0.070 0.063 0.063

Tori (d = 10) 100 0.285 0.233 0.207 0.165 0.163 0.175 0.194

M. of G. 100 0.322 0.221 0.156 0.135 0.109 0.103 0.124

Xor 100 0.476 0.458 0.039 0.036 0.042 0.022 0.468

Seg. Length 250 0.164 0.143 0.114 0.097 0.093 0.093 0.86

Table 11: Performance comparison when the kernel scale deviates from the default value ρ0, assessed on the spherical layers, RBF, Tori, mixture of Gaussians, Xor and segment length datasets.

(b) Regularization weight. A second important parameter is the regularization weight, σ2 in Equation 1. In our experiments, this parameter is adjusted online during the minimization algorithm, starting with a large value, and slowly decreasing it until the

Diffeomorphic Learning

training error reaches a target, δ. While this just replaces a parameter by another, this target value is easier to interpret, and we used δ = 0.005 in all our experiments (resulting de facto in a vanishing training error at the end of the procedure in all cases).

(c) ℓ2 penalty on logistic regression. This parameter, denoted λ in Equation 1, was taken equal to 1 in all our experiments.

(d) Time discretization. The value of T in equation (11) may also impact the performance of the algorithm, at least for small values, since one expects the model to converge to its continuous limit for large T. This expectation is conﬁrmed in Table 12, which shows that the classiﬁcation error rates obtained with the value chosen in our experiments, T = 10, was not far from the asymptotic one. In some cases, the error rates obtained for smaller value of T may be slightly better, but the diﬀerence is marginal.

Dataset Samples/class T = 1 T = 2 T = 4 T = 10 T = 20 T = 40 T = 100

Sph. Layers 250 0.199 0.199 0.190 0.197 0.197 0.197 0.197

RBF 250 0.089 0.089 0.085 0.083 0.083 0.083 0.083

Tori (d = 10) 100 0.243 0.160 0.170 0.177 0.186 0.195 0.197

M. of G. 100 0.178 0.169 0.141 0.119 0.114 0.111 0.109

Xor 100 0.473 0.384 0.061 0.036 0.035 0.034 0.034

Seg. Length 250 0.140 0.117 0.100 0.094 0.092 0.091 0.091

Table 12: Performance comparison for various values of the number of discretization steps, T, assessed on the spherical layers, RBF, Tori, mixture of Gaussians, Xor and segment length datasets.

8. Conclusion

In this paper, we have introduced the concept of diﬀeomorphic learning and provided a few illustrations of its performance on simple, but often challenging, classiﬁcation problems. On this class of problems, the proposed approach appeared quite competitive among other classiﬁers used as comparison. Some limitations also appeared, regarding, in particular, the scalability of the method, for which we have provided some options that will be explored in the future. We have only considered, in this paper, applications of diﬀeomorphic learning to classiﬁcation problems. Extensions to other contexts will be considered in the future. Some, such as regression, may be relatively straightforward, while others, such as clustering, or dimension reduction should require additional thinking, as the obtained results will be highly dependent on the amount of metric distortion allowed in the diﬀeomorphic transformation.

N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68:337 404, 1950.

M. F. Beg, M. I. Miller, A. Trouv e, and L. Younes. Computing large deformation metric mappings via geodesic ﬂows of diﬀeomorphisms. Int J. Comp. Vis., 61(2):139 157, 2005.

Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.

Martin D. Buhmann. Radial basis functions: theory and implementations, volume 12 of Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, Cambridge, 2003.

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural Ordinary Diﬀerential Equations. ar Xiv:1806.07366 [cs, stat], December 2019. URL http://arxiv.org/abs/1806.07366. ar Xiv: 1806.07366.

Jean Duchon. Interpolation des fonctions de deux variables suivant le principe de la ﬂexion des plaques minces. R.A.I.R.O. Analyse Numerique, 10:5 12, 1977.

Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented Neural ODEs. In H. Wallach, H. Larochelle, A. Beygelzimer, F. Alch e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 3140 3150. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/ 8577-augmented-neural-odes.pdf.

Stanley Durrleman, St ephanie Allassonni ere, and Sarang Joshi. Sparse adaptive parameterization of variability in image ensembles. International Journal of Computer Vision, 101(1):161 183, 2013.

Stanley Durrleman, Marcel Prastawa, Nicolas Charon, Julie R Korenberg, Sarang Joshi, Guido Gerig, and Alain Trouv e. Morphometry of anatomical shape complexes with dense deformations and sparse parameters. Neuro Image, 101:35 49, 2014.

Nira Dyn. Interpolation and approximation by radial and related functions. In C. K. Chui, L. L. Shumaker, and J. D. Ward, editors, Approximation Theory VI: vol. 1, pages 211 234. Academic Press, 1989.

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.

Barbara Gris, Stanley Durrleman, and Alain Trouv e. A sub-riemannian modular framework for diﬀeomorphism-based analysis of shape ensembles. SIAM Journal on Imaging Sciences, 11(1):802 833, 2018.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. Springer, 2003.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings CVPR 2016, pages 770 778, 2016. URL https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_ Residual_Learning_CVPR_2016_paper.html.

Diffeomorphic Learning

Leslie M Hocking. Optimal control: an introduction to the theory with applications. Oxford University Press, 1991.

Sarang Joshi and Michael I Miller. Landmark matching via large deformation diﬀeomorphisms. IEEE transactions in Image Processing, 9(8):1357 1370, 2000.

Ivan Kobyzev, Simon Prince, and Marcus Brubaker. Normalizing Flows: An Introduction and Review of Current Methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. ISSN 1939-3539. doi: 10.1109/TPAMI.2020.2992934. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.

Yann Le Cun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

Jack Macki and Aaron Strauss. Introduction to optimal control theory. Springer Science &amp; Business Media, 2012.

Jean Meinguet. Multivariate interpolation at arbitrary points made simple. J. Applied Math. and Physics, 30:292 304, 1979.

Mario Micheli and Joan A. Glaun es. Matrix-valued kernels for shape deformation analysis. Geom. Imaging Comput., 1(1):57 139, 2014. ISSN 2328-8876; 2328-8884/e.

Michael I Miller, Alain Trouv e, and Laurent Younes. Hamiltonian systems and optimal control in computational anatomy: 100 years since d arcy thompson. Annual review of biomedical engineering, 17:447 509, 2015.

Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, 1999.

Fabian Pedregosa, Ga el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825 2830, 2011.

Danilo Jimenez Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. In Proceedings ICML 15, May 2015. URL https://arxiv.org/abs/1505. 05770v6.

Fran cois Rousseau, Lucas Drumetz, and Ronan Fablet. Residual networks as ﬂows of diﬀeomorphisms. Journal of Mathematical Imaging and Vision, pages 1 11, 2019. Publisher: Springer.

Hadi Salman, Payman Yadollahpour, Tom Fletcher, and Kayhan Batmanghelich. Deep Diﬀeomorphic Normalizing Flows. ar Xiv:1810.03256 [cs, stat], November 2018. URL http://arxiv.org/abs/1810.03256. ar Xiv: 1810.03256.

Bernhard Sch olkopf and Alexander J Smola. Learning with kernels. MIT Press, 2002.

Alain Trouv e and Yong Yu. Metric similarities learning through examples: an application to shape retrieval. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 50 62. Springer, 2001.

Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.

Fran cois-Xavier Vialard, Roland Kwitt, Susan Wei, and Marc Niethammer. A Shooting Formulation of Deep Learning. ar Xiv:2006.10330 [cs, math], June 2020. URL http: //arxiv.org/abs/2006.10330. ar Xiv: 2006.10330.

Thomas L. Vincent and Walter J. Grantham. Nonlinear and Optimal Control Systems. Wiley, 1997.

Grace Wahba. Spline Models for Observational Data. SIAM, 1990.

Christian Walder and Bernhard Sch olkopf. Diﬀeomorphic dimensionality reduction. In Advances in Neural Information Processing Systems, pages 1713 1720, 2009.

Laurent Younes. Shapes and diﬀeomorphisms, volume 171. Springer Science & Business Media, 2010.

Laurent Younes. Constrained diﬀeomorphic shape evolution. Foundations of Computational Mathematics, 12(3):295 325, 2012.

Laurent Younes. Gaussian diﬀeons for surface and image matching within a lagrangian framework. Geometry, Imaging and Computing, 1(1):141 171, 2014.