# on_mahalanobis_distance_in_functional_settings__ddd19106.pdf Journal of Machine Learning Research 21 (2020) 1-33 Submitted 3/18; Revised 12/19; Published 1/20 On Mahalanobis Distance in Functional Settings Jos e R. Berrendero joser.berrendero@uam.es Beatriz Bueno-Larraz beatriz.bueno@uam.es Antonio Cuevas antonio.cuevas@uam.es Department of Mathematics Universidad Aut onoma de Madrid Madrid, Spain Editor: Lorenzo Rosasco Mahalanobis distance is a classical tool in multivariate analysis. We suggest here an extension of this concept to the case of functional data. More precisely, the proposed definition concerns those statistical problems where the sample data are real functions defined on a compact interval of the real line. The obvious difficulty for such a functional extension is the non-invertibility of the covariance operator in infinite-dimensional cases. Unlike other recent proposals, our definition is suggested and motivated in terms of the Reproducing Kernel Hilbert Space (RKHS) associated with the stochastic process that generates the data. The proposed distance is a true metric; it depends on a unique real smoothing parameter which is fully motivated in RKHS terms. Moreover, it shares some properties of its finite dimensional counterpart: it is invariant under isometries, it can be consistently estimated from the data and its sampling distribution is known under Gaussian models. An empirical study for two statistical applications, outliers detection and binary classification, is included. The results are quite competitive when compared to other recent proposals in the literature. Keywords: Functional data, Mahalanobis distance, reproducing kernel Hilbert spaces, kernel methods in statistics, square root operator 1. Introduction The classical (finite-dimensional) Mahalanobis distance is a well-known tool in multivariate analysis with multiple applications. Let X be a random variable taking values in Rd with non-singular covariance matrix Σ. In many practical situations it is required to measure the distance between two points x, y Rd when considered as two possible observations drawn from X. Clearly, the usual (square) Euclidean distance x y 2 = (x y) (x y) is not a suitable choice since it disregards the standard deviations and the covariances of the components of X (given a column vector x Rd we denote by x the transpose of x). The most popular alternative is perhaps the classical Mahalanobis distance, M(x, y), defined as M(x, y) = (x y) Σ 1(x y) 1/2 . (1) c 2020 Jos e R. Berrendero, Beatriz Bueno-Larraz and Antonio Cuevas. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-156.html. Berrendero, Bueno-Larraz and Cuevas Very often the interest is focused on studying how extreme a point x is within the distribution of X; this is typically evaluated in terms of M(x, m), where m stands for the vector of means of X. This distance is named after the Indian statistician P. C. Mahalanobis (1893-1972) who first proposed and analyzed this concept (Mahalanobis, 1936) in the setting of Gaussian distributions. Nowadays, some popular applications of the Mahalanobis distance are: supervised classification, outlier detection (Rousseeuw and van Zomeren, 1990 and Penny, 1996), multivariate depth measures (Zuo and Serfling, 2000), hypothesis testing (through Hotelling s statistic, Rencher, 2012, Ch. 5) or goodness of fit (Mardia, 1975). This list of references is far from exhaustive. 1.1. The infinite-dimensional (functional) case. The covariance operator Our framework here is Functional Data Analysis (FDA); see, e.g., Cuevas (2014) for an overview. In other words, we deal with statistical problems involving functional data. Thus our sample is made of trajectories X1(t), . . . , Xn(t) in L2[0, 1] drawn from a second order stochastic process X(t), t [0, 1] with m(t) = E(X(t)). The inner product and the norm in L2[0, 1] will be denoted by , and , respectively. We will henceforth assume that the covariance function K(s, t) = Cov(X(s), X(t)) is continuous. The function K defines a linear operator K : L2[0, 1] L2[0, 1], called covariance operator, given by Kf(t) = Z 1 0 K(t, s)f(s)ds. (2) Note that the covariance operator is the infinite-dimensional version of the covariance transformation x 7 Σx in Rd. Still, its properties are different in some important aspects as we will discuss below. We will define a Mahalanobis distance associated with K assuming that K is injective, that is, all the eigenvalues are strictly positive. For later use, it will be useful to note that the assumption of continuity for K(s, t) entails m L2[0, 1] and E X 2 < . Indeed, since R 1 0 K(t, t)dt = E R 1 0 (X(t) m(t))2dt < , it holds R 1 0 (X(t) m(t))2dt < almost surely (a.s.). Hence, as we are assuming X(t) L2[0, 1] a.s., we also have m L2[0, 1]. Now, E R 1 0 (X(t) m(t))2dt = R 1 0 E(X(t) m(t))2dt < implies E X 2 = E Z 1 0 X2(t)dt = Z 1 0 EX2(t)dt = Z 1 0 E(X(t) m(t))2dt + Z 1 0 m2(t)dt < . It can be noted that this reasoning does not require in fact the continuity of K (boundedness would suffice) but, for simplicity, we will keep the continuity assumption as it is quite common in functional data analysis and not very restrictive. Besides, continuity will be needed below to prove, for example, Theorem 14 on asymptotic distribution. 1.2. Some basic notions on linear operators. Functions of operators Let (H, , ) be a real, separable, infinite-dimensional Hilbert space. In what follows, we will mostly consider the case H = L2[0, 1] but the basic ideas and definitions can be given On Mahalanobis Distance in Functional Settings in general. Denote by C the space of self-adjoint bounded linear operators on H; see, e.g., Gohberg et al. (2003) for definitions and background. Let us recall that for any linear bounded operator A : H H in C, the operator norm of A is defined by A op = sup x =1 Ax = sup x =1 | Ax, x |. If K C is compact then, from the Spectral Theorem, K can be expressed in the form j λj x, ej ej, (3) {ei} being an orthonormal system of eigenfunctions of K and {λi} the corresponding sequence of eigenvalues. Let C+ be the subset of positive operators in C, that is the subset of those operators A in C such that Ax, x 0, for all x. Note that every compact operator in C+ has necessarily non-negative eigenvalues. If K C is compact, the spectrum of K is defined by σ(K) = {λj : λj is an eigenvalue of K} {0}. If K has a spectral representation (3) and f is a real function defined on a real interval containing σ(K), bounded on σ(K), we define the operator f(K) by f(K)(x) = f(0)P0x + X j f(λj) x, ej ej, where P0x denotes the orthogonal projection of x on the kernel of K. For these operators we have f(K) op = sup j |f(λj)|, see, e.g., Gohberg et al. (2003, Ch. 8). Some examples of functions f(K) to be considered below are as follows. (i) If K C+ is compact and injective, then the square root of K, denoted K1/2, is defined by K1/2(x) = P j λ1/2 j x, ej ej. Note that the name square root is justified since K1/2K1/2 = K. In fact, A1/2 can be defined as well, in a natural way, even if A is not compact; see, for example, Debnath and Mikiusinski (2005, p. 173) for a definition of the square root operator when A is a bounded positive operator. (ii) We will also need to use operator functions f(A) in some examples where A is not necessarily compact but it can be expressed in the form Ax = P j µj x, ej ej, for some (bounded) sequence of constants {µj} and some orthonormal system {ej}. Then, we can define f(A) (if f is a function bounded on {µj}) by f(A)(x) = P j f(µj) x, ej ej (iii) A particular case arises when A = K + h I, where K is a compact, injective operator, K C+, h is a real number and I is the identity operator. Since K is injective there is an orthonormal basis {ej} of H, made of eigenvectors of K. Thus, if {λj} are the corresponding eigenvalues, we have, for all x H, (K + h I)(x) = P j(λj + h) x, ej ej and (K + h I)1/2(x) = P j(λj + h)1/2 .x, ej ej. Similarly, the positive part of K + h I is defined by (K + h I)+(x) = P j(λj + h)+ x, ej ej, where f(t) := t+ = max{t, 0} is the usual positive part function, defined for real numbers. Berrendero, Bueno-Larraz and Cuevas 1.3. On the difficulties of defining a functional Mahalanobis-type distance The aim of this paper is to extend the notion of the multivariate (finite-dimensional) Mahalanobis distance (1) to the functional case when x, y L2[0, 1]. Clearly, in view of (1), the inverse K 1 of the functional operator K should play some role in this extension if we want to keep a close analogy with the multivariate case. Unfortunately, such a direct approach utterly fails since, typically, K is not invertible in general as an operator, in the sense that there is no linear continuous operator K 1 such that K 1K = KK 1 = I, the identity operator. This is a well-known consequence of the fact that, when K(s, t) is a continuous function, the operator K defined in (2) is compact, that is, it takes bounded sets to relative compact sets, which entails the non-invertibility. It is interesting to see this from the spectral point of view, starting again with the finite-dimensional case. Indeed, some elementary linear algebra yields the following representations for Σx, Σ 1x and the squared Mahalanobis distance M(x, y)2, for x, y Rd: j=1 λj(e jx)ej, Σ 1x = 1 λj (e jx)ej, Σ 1/2x = λj (e jx)ej, M(x, y)2 = (x y) Σ 1/2Σ 1/2(x y) = 1 λj ((x y) ej)2, (4) where λ1, . . . , λd are the, strictly positive, eigenvalues of Σ and {e1, . . . , ed} the corresponding orthonormal basis of eigenvectors. In the functional case, the expression (3) of K in terms of the Spectral Theorem, would suggest an expression for the inverse operator of type 1 λj x, ej ej. (5) However, it is easy to check that R 1 0 R 1 0 K(s, t)2dsdt = P j=1 λ2 j so that, in particular, the sequence {λj} converges to zero very quickly. As a consequence, there is no hope of keeping a direct analogy with (4) and the very reason for this is the non-invertibility of the covariance operator. See more details on this in the next subsection. Nevertheless, as we will see, the failed definition (5) gives some clues in order to extend Mahalanobis idea to the infinite-dimensional case. 1.4. Some previous proposals Motivated by the finite-dimensional spectral version (4) of the Mahalanobis distance, Galeano et al. (2015) have proposed the following definition of the functional Mahalanobis distance. dk FM(x, y) = which is well-defined for any x, y L2[0, 1] at the expense of introducing a sort of tuning parameter k N. We keep the notation dk FM used in Galeano et al. (2015). Let us note that On Mahalanobis Distance in Functional Settings dk FM(x, y) is a semi-distance, since it lacks the identifiability condition dk FH(x, y) = 0 x = y. The applications of dk FM considered by these authors focus mainly on supervised classification. While this proposal is quite simple and natural, its theoretical motivation presents some difficulties. The most important one is the fact that the series is divergent, with probability one, whenever x and y are independent trajectories from a Gaussian process with covariance function K and the mean function m belongs to the Reproducing Kernel Hilbert Space associated with K (see, e.g. Luki c and Beder, 2001, Cor. 7.1); the same holds if y is replaced with the mean function m L2[0, 1]. So, (6) is defined in terms of the k-th partial sum of a divergent series. As a consequence, one may expect that the definition might be strongly influenced by the choice of k. As we will discuss below, in practice this effect is not noticed if x and y are replaced with certain appropriate smoothed approximations but, in that case, the smoothing procedure should be incorporated to the definition. Another recent proposal is due to Ghiglietti et al. (2017). The idea is also to modify the template (7) to deal with the convergence issues. In this case, the suggested definition is eλjc g(c; p)dc where p > 0 and g(c; p) is a weight function such that g(0; p) = 1, g is non-increasing and non-negative and R 0 g(c; p)dc = p. Moreover, for any c > 0, g(c; p) is assumed to be non-decreasing in p with limp g(c; p) = 1. This definition does not suffer from any problem derived from divergence but, still, it depends on two smoothing functions: the exponential in the denominator of (8) and the weighting function g(c; p). As pointed out also in Ghiglietti et al. (2017), a more convenient expression for (8) is given by the following weighted version of the template, formal definition (7), where hj(p) = R 0 λje λjcg(c; p)dc. The applications of (8) offered in Ghiglietti et al. (2017) and Ghiglietti and Paganoni (2017) deal with hypotheses testing for two-sample problems of type H0 : m1 = m2. All in all, both proposals, Galeano et al. (2015) and Ghiglietti et al. (2017), are natural extensions to the functional case of the multivariate notion (1). Moreover, as suggested by the practical examples considered in both works, these options performed quite well in many cases. However, we believe that there is still some room to further explore the subject for the reasons we will explain below. Berrendero, Bueno-Larraz and Cuevas 1.5. The organization on this work In the next section some theory of RKHS and its connection with the Mahalanobis distance is introduced, together with the proposed definition. In Section 3 some properties of the proposed distance are presented and compared with those of the original multivariate definition. Then, a consistent estimator of the distance is analyzed in Section 4 and convergence rates are obtained. Finally, some numerical outputs corresponding to different statistical applications can be found in Section 5, along with some final remarks in Section 6. 2. A new definition of Mahalanobis distance for functional data In this section we will propose a further definition of a Mahalanobis-type distance, denoted Mα (as it will depend on a tuning parameter α). Its most relevant features can be summarized as follows: i) Mα is also inspired in the natural template (7). The serious convergence issues appearing in (7) are solved by smoothing. ii) Mα depends on a single, real, easy to interpret smoothing parameter α whose choice is not critical, in the sense that the distance has some stability with respect to α. Hence, it is possible to think of a cross-validation or bootstrap-based choice of α. In particular, no auxiliary weight function is involved in the definition. iii) Mα(x, y) is a true metric which is defined for any given pair x, y of functions in L2[0, 1]. It shares some invariance properties with the finite-dimensional counterpart (1). iv) If m(t) = E(X(t)), the distribution of Mα(X, m) is explicitly known for Gaussian processes. In particular, E(M2 α(X, m)) and Var(M2 α(X, m)) have explicit, relatively simple expressions. v) The distance Mα(X, m) can be consistently estimated from a random sample and convergence rates can be obtained under some mild conditions. The main contribution of this paper is to show that the theory of Reproducing Kernel Hilbert Spaces (RKHS) provides a natural and useful framework in order to propose an extension of the Mahalanobis distance to the functional setting. We refer to Berlinet and Thomas-Agnan (2004), Appendix F in Janson (1997), Sch olkopf and Smola (2002), Hsing and Eubank (2015) for background on RKHS s. 2.1. A few, very basic, ideas on RKHS s There are several equivalent ways of approaching RKHS s. For our purposes, it is enough to recall that the RKHS associated with the covariance function K = K(s, t) is defined by H(K) := K1/2(L2[0, 1]) = {f L2[0, 1] : λj < }, (9) On Mahalanobis Distance in Functional Settings (recall that we assume λj > 0 for all j) and the corresponding inner product and norm are given by f, ej g, ej It can be seen that all finite linear combinations of type g(s) = Pp j=1 aj K(s, tj) are in H(K); in fact, H(K) could be defined as the family of such finite linear combinations plus the pointwise-limit of all Cauchy sequences made of functions of type g(s) = Pp j=1 aj K(s, tj). Note that, in particular, if g( ) = K( , t), g, ej = λjej(t), so that f, g K = f(t). This is the so-called reproducing property which accounts for the name of these spaces. Also, supt fn, K( , t) K = supt |fn(t)| fn K supt K( , t) K = fn K supt K(t, t)1/2. So, fn K 0 implies fn(t) 0 uniformly on t since K is assumed to be continuous. 2.2. The proposed definition First note that in the finite dimensional case, the RKHS associated with a non-singular covariance matrix is just Rd and the square norm is x 2 Σ = Pp j=1 (x ej)2 λj . Therefore, the spectral expression (5) of the square Mahalanobis distance in the finite-dimensional case can be formulated in terms of the corresponding RKHS square norm, that is, M2(x, y) = x y 2 Σ. As discussed above, the difficulty to extend this idea to the functional case lies in the fact that in general, the RKHS space H(K) is much smaller than the ambient space L2[0, 1] and, more importantly, the trajectories of a second-order stochastic process with continuous covariance function K = K(s, t) do not belong, with probability one, to H(K). This observation suggests us the simple strategy we will follow here: given two functions x, y L2[0, 1], just approximate them by two other functions xα, yα H(K) and calculate the distance xα yα K. It only remains to decide how to obtain the RKHS approximations xα and yα. One could think of taking xα as the closest function to x in H(K) but this approach also fails since H(K) is dense in L2[0, 1] whenever all λj are strictly greater than zero (Remark 4.9 of Cucker and Zhou, 2007). Thus, every function x L2[0, 1] can be arbitrarily well approximated by functions in H(K) in such a way that for any given function x L2[0, 1] there is no function in H(K) closest to x. In other words, there is no projection of x over H(K) since this space is not closed. The situation is reminiscent to that arising in the penalization methods in nonparametric functional estimation: we might look instead for the function xα closest to x among those that are not too rough . In more formal terms: let us fix a penalization parameter α > 0. Given any x L2[0, 1], define xα = argmin f H(K) x f 2 + α f 2 K. (11) Whereas the term x f 2 controls closeness to x, the term α f 2 K accounts for the roughness of the approximation. As we will see below, the penalized projection xα is welldefined. In fact it admits a relatively simple closed form. Finally, the definition we propose for the functional α-Mahalanobis distance is as follows. Berrendero, Bueno-Larraz and Cuevas Definition 1 Let K be the covariance operator associated with the continuous covariance function K = K(s, t) = Cov(X(s), X(t)) of a process X = {X(t), t [0, 1]} with trajectories in L2[0, 1]. Assume that K is injective. Thus, all the eigenvalues of K are strictly positive. Given a constant α > 0 we define the α-Mahalanobis distance between two functions x and y in L2[0, 1] by Mα(x, y) = xα yα K, (12) where xα is defined as the unique solution of (11) (see Proposition 2 below) and K is the RKHS norm defined in (10). Let us note that the original Mahalanobis distance as well as our adapted functional version (12) are both intrinsically dependent on the covariance operator K. Therefore, (12) is naturally aimed at measuring either the statistical distance between two observations (trajectories) x = x(t) and y = y(t) of a process with covariance operator K or, more commonly, the distance between a trajectory x = x(t) and the mean function m = m(t). In the later case, the notation Mα(x, m) will be used. As mentioned, given a realization x of the stochastic process, we have relatively simple expressions for both the smoothed trajectory xα and the proposed distance. Such expressions are given in the following result. Proposition 2 In the setting of Definition 1 the smoothed trajectories xα defined in (11) satisfy the following basic properties: (a) Let I be the identity operator on L2[0, 1]. Then, K + αI is invertible and xα = (K + αI) 1 Kx = λj λj + α x, ej ej, (13) where λj, j = 1, 2, are the eigenvalues of K and ej stands for the unit eigenfunction of K corresponding to λj. (b) Denoting as K1/2 the square root operator, the norm of xα in H(K) satisfies λj (λj + α)2 x, ej 2 = K1/2(K + αI) 1x 2, (14) and, therefore, Mα(x, y)2 = P j=1 λj (λj+α)2 x y, ej 2. Proof (a) From Spectral Theorem for compact and self-adjoint operators (see, e.g., Theorem 2 of Chapter 2 in Cucker and Smale, 2001), Kx = P j λj x, ej ej. Thus, (K + αI) x = j=1 (λj + α) y, ej ej and (K + αI) 1 y = 1 λj + α y, ej ej. (15) The expression xα = (K + αI) 1 Kx for the solution of the minimization problem (11) follows directly from Theorem 8.4 (i) in Cucker and Zhou (2007, p. 136); note that K 1/2(f) On Mahalanobis Distance in Functional Settings is well defined for f H(K), and f K = K 1/2(f) . Now, expression (13) readily follows by replacing y with Kx in the second equation of (15). (b) From (10) and (13), λj (λj + α)2 x, ej 2. Moreover, from the Spectral Theorem K1/2(x) = P i=1 λi x, ei ei. Then, using (15) K1/2(K + αI) 1x = λj λj + α x, ej ej. By Parseval s identity we get K1/2(K + αI) 1x 2 = xα 2 K, as desired. Corollary 3 The expression Mα given in (12) defines a metric in L2[0, 1]. Proof This result is a direct consequence of Proposition 2. Indeed, from expression (13), the transformation x 7 xα form L2[0, 1] to H(K) in injective (since the coefficients x, ej completely determine x). Now, the result follows from the fact that K is a norm. Remark 4 The expression xα = (K + αI) 1 Kx obtained in the first part of Proposition 2 has an interesting intuitive meaning: the transformation x 7 Kx takes first the function x L2[0, 1] to the space H(K), made of much nicer functions, with Fourier coefficients x, ej converging quickly to zero, since we must have P j=1 x, ej 2/λj < ; see (10). Then, after this smoothing step , we perform an approximation step by applying the inverse operator (K + αI) 1, in order the get, as a final output, a function xα that is both, close to x and smoother than x. On the other hand, the expressions (11) and (13) clearly relate our approach with the general methodology of Tikhonov regularization; see more details on this in Section 6 below. 3. Some properties of the functional Mahalanobis distance In this section we analyze in detail and prove some of the features of Mα we have anticipated above. We assume throughout that X = X(t) fulfils the conditions indicated in the statement of Definition 1. 3.1. Invariance In the finite dimensional case, one appealing property of the Mahalanobis distance is the fact that it does not change if we apply a non-singular linear transformation to the data. Then, the invariance for a large class of linear operators appears also as a desirable property for any extension of the Mahalanobis distance to the functional case. Here, we will prove Berrendero, Bueno-Larraz and Cuevas invariance with respect to operators preserving the norm. We recall that a bounded linear operator L is an isometry if it maps L2[0, 1] to L2[0, 1] and f = Lf . In this case, it holds L L = I, where L stands for the adjoint of L. Theorem 5 Let L be an isometry on L2[0, 1]. Then, Mα(x, y) = Mα(Lx, Ly) for all α > 0, x, y L2[0, 1], where Mα was defined in (12). Proof Let KL be the covariance operator of the process LX. The first step of the proof is to show that KL = LKL . It is enough to prove that for all f, g L2[0, 1], it holds KLf, g = LKL f, g . Observe that KLf, g = Z 1 0 KLf(t)g(t)dt = Z 1 0 E[(LX(s) Lm(s))(LX(t) Lm(t))]f(s)g(t)dsdt. Then, using Fubini s theorem and the definition of the adjoint operator, KLf, g = E L(X m), f L(X m), g = E X m, L f X m, L g . Analogously, we also have LKL f, g = KL f, L g = E X m, L f X m, L g . From the last two equations we conclude KL = LKL . The second step of the proof is to observe that the eigenvalues λj of KL are the same as those of K, and the unit eigenfunction uj of KL for the eigenvalue λj is given by vj = Lej, where ej is the unit eigenfunction corresponding to λj. Indeed, using L L = I we have KLvj = LKL vj = LKL Lej = λj Lej = λjvj, j = 1, 2, . . . Then, by (14) and using that L is an isometry, Mα(Lx, Ly)2 = (Lx Ly)α 2 KL = λj (λj + α)2 Lx Ly, Lej 2 λj (λj + α)2 x y, ej 2 = Mα(x, y)2. The family of isometries on L2[0, 1] contains some interesting examples such as symmetries and changes between orthonormal bases. Thus, this distance does not depend on the basis on which the data are represented. On Mahalanobis Distance in Functional Settings 3.1.1. On the effect of regularization on invariance properties. We have shown that our proposed functional Mahalanobis distance is invariant under isometries. However, unlike the finite-dimensional case, it is not invariant for bounded operators in general. The reason for this different behavior is the need for regularization (which precludes general invariance) together with the fact that the class of bounded linear operators is too broad in the case of infinite-dimensional Hilbert spaces. To shed some light on these issues, let us consider the Mahalanobis distance changes under a particular type of bounded operators, namely, those positive compact self-adjoint bounded operators that commute with K. It turns out that these are precisely the operators that can be diagonalized simultaneously with K (see Theorem 1.1 in Gohberg et al., 2003, p. 243). This means that they have the same eigenfunctions as K with different positive eigenvalues. In the finite-dimensional case these transformations would correspond to nonuniform changes of scale along the principal directions (that is, eigenvectors). To be more precise, let L be a positive semidefinite compact self-adjoint bounded operator that commutes with K. Then, as in Theorem 1 we have that KL = LKL, where KL is the covariance operator of the process LX. Now, if µ1 µ2 are the (non-negative) eigenvalues of L with unit eigenfunctions e1, e2, . . . (which are the same as those of K), it holds j = 1, 2, . . . KLej = LKLej = λjµ2 jej. That is, ej is a unit eigenfunction of KL and its corresponding eigenvalue is λjµ2 j. Using Proposition 1(b) together with Spectral Theorem for compact self-adjoint operators we readily obtain Mα(Lx, Lm)2 = λjµ4 j (λjµ2 j + α)2 x m, ej 2, It is very easy to see that, for all j = 1, 2, . . . λjµ4 j (λjµ2 j + α)2 < λj (λj + α)2 µj < 1. In particular, if L is such that it shrinks x along all the principal directions, then Mα(Lx, Lm) Mα(x, m), for all α > 0. The above discussion shows how regularization (associated with values α > 0) is an obstacle for invariance: observe that in the finite-dimensional case, where the sum is finite and α = 0, Mahalanobis distance would remain unchanged. 3.2. Distribution for Gaussian processes It is a well known fact that the squared Mahalanobis distance to the mean for finite dimensional Gaussian data has a χ2 distribution with d degrees of freedom, where d is the dimension of the data. In the functional case, the distribution of Mα(X, m)2 for a Gaussian process X equals that of an infinite linear combination of independent χ2 1 random variables. We prove this fact in the following result and its corollary, and also give explicit expressions for the expectation and the variance of Mα(X, m)2. Proposition 6 Let {X(t), t [0, 1]} be an L2-Gaussian process with continuous covariance function K such that the corresponding covariance operator K is injective. Let λ1, λ2, be the eigenvalues of K and let e1, e2, . . . be the corresponding unit eigenfunctions. Berrendero, Bueno-Larraz and Cuevas (a) The squared Mahalanobis distance satisfies, for X = X(t) and for all f L2[0, 1], Mα(X, f)2 = (X f)α 2 K = j=1 βj Yj, (16) where βj = λ2 j(λj +α) 2 and Yj, j = 1, 2, , are non-central χ2 1(γj) random variables with non-centrality parameter γj = µ2 j/λj, with µj := m f, ej . (b) We have for X = X(t) and for all f L2[0, 1], E Mα(X, f)2 = λ2 j (λj + α)2 1 + µ2 j λj Var Mα(X, f)2 = 2 λ4 j (λj + α)4 1 + 2µ2 j λj Proof (a) Using (14), (X f)α 2 K = P j=1 βj Yj, where βj = λ2 j(λj + α) 2 and Yj = λ 1 j X f, ej 2. Since the process is Gaussian the variables λ 1/2 j X f, ej are independent with normal distribution, mean λ 1/2 j µj and variance 1 (see Ash and Gardner, 1975, p. 40). The result follows. (b) It is easy to see that the partial sums in (16) form a sub-martingale with respect to the natural filtration σ(Y1, . . . , YN). Indeed, j=1 βj Yj Y1, . . . , YN = βN+1E(YN+1) + Moreover, if λ := supj λj, which is always finite, sup N E N+1 X j=1 βj Yj = λj(λj + µ2 j) (λj + α)2 λ α2 j=1 µ2 j < , because m L2[0, 1] and P j=1 λj = R 1 0 K(t, t)dt < (see e.g. Cucker and Smale, 2001, Corollary 3, p. 34). Now, Doob s convergence Theorem implies PN j=1 βj Yj P j=1 βj Yj a.s. as N , and Monotone Convergence Theorem yields the expression for the expectation of Mα(X, f)2. The proof for the variance is fairly similar. Using Jensen s inequality, we deduce j=1 βj (Yj EYj) 2 Y1, . . . , YN j=1 βj (Yj EYj) On Mahalanobis Distance in Functional Settings Moreover, since the variables Yj are independent, j=1 βj (Yj EYj) j=1 β2 j Var(Yj) = 2 λ3 j(λj + 2µ2 j) (λj + α)4 j=1 µ2 j < . Then, PN+1 j=1 βj (Yj EYj) 2 P j=1 βj (Yj EYj) 2 a.s., as N , and using Monotone Convergence Theorem, Var Mα(X, f)2 = lim N Var N X j=1 βj Yj = 2 λ4 j (λj + α)4 1 + 2µ2 j λj When we compute the squared Mahalanobis distance from X to the mean the expressions above simplify because µj = 0 for each j, and then we have the following corollary. Corollary 7 Under the same assumptions of Proposition 6, if m denotes the mean function, we have Mα(X, m)2 = P j=1 βj Yj, where βj = λ2 j(λj + α) 2 and Y1, Y2, . . . are independent χ2 1 random variables. Moreover, E Mα(X, m)2 = P j=1 λ2 j(λj + α) 2 and Var Mα(X, m)2 = 2 P j=1 λ4 j(λj + α) 4. 3.3. Stability with respect to the regularization parameter Our definition of distance depends on a regularization parameter α > 0. In this subsection we prove the continuity of Mα with respect to this tuning parameter α. The proof of the main result requires the following auxiliary lemma, which has been adapted from Corollary 8.3 in Gohberg et al. (2003), p. 71. Lemma 8 Let Aj : H H, j = 1, 2, . . ., be a sequence of bounded invertible operators on a Hilbert space H which converges in norm op to another operator A, and such that supj A 1 j op < . Then A is also invertible, and A 1 j A 1 op 0, as j . We will apply the preceding lemma in the proof of the following result. Proposition 9 Let αj be a sequence of positive real numbers such that αj α > 0, as j . Then, Xαj K Xα K a.s. as j . Proof Note that by Proposition 2(b), Eq (14), we have (by the sub-multiplicative property of the norm) Xαj K Xα K K1/2(K + αj I) 1X K1/2(K + αI) 1X K1/2 op (K + αj I) 1 (K + αI) 1 op X . Berrendero, Bueno-Larraz and Cuevas But (K + αj I) (K + αI) op = |αj α| 0, as j , and supj (K + αj I) 1 op supj α 1 j < (see Gohberg et al., 2003, (1.14), p. 228). Therefore, (K + αj I) 1 (K + αI) 1 op 0, as j , by Lemma 8. Observe that Proposition 9 implies the pointwise convergence of the sequence of distribution functions of Mαj(X, m) to that of Mα(X, m). This fact in turn implies the pointwise convergence of the corresponding quantile functions. 4. A consistent estimator of the functional Mahalanobis distance Again, throughout this section, we will assume that the process {X(t), t [0, 1]}, with trajectories in L2[0, 1], has a continuous covariance function K = K(s, t) and that the covariance operator K is injective (that is, all the eigenvalues are strictly positive). As mentioned in the introduction, the continuity of K entails E X 2 < and m L2[0, 1], where m(t) = E(X(t)). The problem considered in this section is as follows: given a sample X1(t), . . . , Xn(t) of realizations of the stochastic process X(t), we want to estimate the Mahalanobis distance between any trajectory of the process X and the mean function m in a consistent way. Let X(t) = n 1 Pn i=1 Xi(t) be the sample mean and let b K(s, t) = 1 i=1 (Xi(s) X(s))(Xi(t) X(t)) be the sample covariance function. The function b K defines the sample covariance operator b Kf( ) = R 1 0 b K( , t)f(t)dt. Define the following estimator for Mα(X, m): c Mα,n(X, X) := b Xα Xα b K, (17) where b Xα = ( b K + αI) 1 b KX and Xα = ( b K + αI) 1 b K X. The following lemma collects, for posterior reference, some asymptotic results on the estimation of the mean function and covariance operator. Although similar results can be found at several places in the literature (see, e.g., Hsing and Eubank, 2015, Th. 8.1.2) we include them here for the sake of completeness. Lemma 10 Under the conditions on {X(t), t [0, 1]} stated at the beginning of this section, we have X m 0, b K K L2([0,1] [0,1]) 0, and b K K op 0, a.s., as n . Proof Mourier s Strong Law of Large Numbers (see e.g. Theorem 4.5.2 in Laha and Rohatgi, 1979, p. 452) implies directly X m 0 since (E X )2 E X 2 < and L2[0, 1] is a separable Banach space. Consider the process Z(s, t) = X(s)X(t). Then, Z L2([0, 1] [0, 1]) and this is also a separable Banach space. Therefore, if Zi(s, t) = Xi(s)Xi(t), Z = n 1 Pn i=1 Zi(s, t), and mz(s, t) = E[X(s)X(t)], using again Mourier s SLLN we have Z mz L2([0,1] [0,1]) 0, a.s., n , On Mahalanobis Distance in Functional Settings and also, since b K(s, t) = Z(s, t) X(s) X(t), b K K HS 0 a.s., where b K K HS = b K K L2([0,1] [0,1]) stands for the Hilbert-Schmidt norm of the operator b K K. Finally, for any x L2[0, 1], ( b K K)x 2 = Z 1 0 b K(t, ) K(t, ), x 2dt x 2 b K K 2 L2([0,1] [0,1]). Thus, in particular, the operator norm is smaller than the Hilbert-Schmidt norm and we have b K K op b K K L2([0,1] [0,1]) 0 a.s. as n . Now, before going on with the proof of the consistency and convergence rates we need some, non-trivial, auxiliary results on the square root operator K1/2 which appears in the definition (9) of the RKHS, and will appear also in the proofs below. 4.1. On the square root operator We refer here to the notation and definitions included in Subsection 1.2 above. In particular, recall that we denote by C the family of bounder linear self-adjoint operators on (H, , ). Given two operators A, B C we will say that A B if and only if B A is non-negative, that is B A C+. The following proposition collects some, mostly elementary, operator inequalities which will be helpful below in the proofs of the results on consistency and convergence rates. The inequality in part (e) will be especially useful in what follows. Proposition 11 Let K0, K C+ be compact operators. Suppose that K is injective. Denote h0 = K0 K op. We have, (a) (K ϵI)+ K (K + ϵI), for all ϵ > 0. (b) (K h0I)+ K0 (K + h0I). (c) (K h0I)1/2 + K1/2 (K + h0I)1/2. Also, (K h0I)1/2 + K1/2 0 (K + h0I)1/2. (d) Let A, B, M C+ be operators such that A M B. Then, B M op B A op, and M A op B A op. (e) K1/2 0 K1/2 op 4 K0 K 1/2 op . Proof (a) The relation K (K+ϵI) is obvious. To see the other inequality we will use (3), together with the fact that, as K is injective, Spectral Theorem implies that the eigenvectors {ej} corresponding to the positive eigenvalues {λj} are in fact an orthonormal basis of H. Thus, for any x = P j x, ej ej, (K (K ϵI)+)x, x = P j (λj (λj ϵ)+) x, ej 2, and (λj (λj ϵ)+) is either λj (λj ϵ) = ϵ > 0, or λj, depending on whether λj > ϵ or λj ϵ, respectively. Berrendero, Bueno-Larraz and Cuevas (b) As h0 = sup x =1 | (K K0)x, x |, we have for any x with x = 1, (K + h0I K0)x, x = (K K0)x, x + h0 Ix, x = (K K0)x, x + h0 0, We thus conclude K0 K + h0I. Now, to prove the other inequality, let us first note that, we also have K h0I K0 (whose proof is similar to that of K0 K + h0I). Now, the first inequality in (b) follows from the following general property: Let A C, B C+, with Ax = X j λj x, ej ej ({ej} being an orthonormal basis). If A B, then A+ B. To prove this, define x = P j x jej, where x j = x, ej when λj > 0, x j = 0, otherwise and x = x x . We have A+x, x = Ax , x , so that Bx, x = Bx , x + Bx , x Ax , x = A+x, x . We thus conclude A+ B and the inequality (K h0I)+ K0 follows from (K h0I) K0. (c) This follows directly from (a) and (b), together with the operator monotonicity of the square root transformation for non-negative bounded self-adjoint operators; see Pedersen (1972). (d) The result is immediate since Bx, x Mx, x Ax, x 0. and B A op = sup x =1 (( Bx, x Ax, x )) . (e) Using the precedent results (b), (c) and (d), together with triangular inequality, we get K1/2 0 K1/2 op 2 (K + h0I)1/2 (K h0I)1/2 + op. (K + h0I)1/2x (K h0I)1/2 + x, x = X (λj + h0) q (λj h0)+ x, ej 2, and x 2 = P j x, ej 2 = 1, we conclude (K + h0I)1/2 (K h0I)1/2 + op sup j (λj + h0) q (λj + h0) (λj h0)+ p (λj + h0) + p (λj h0)+ 2h0 h1/2 0 = 2h1/2 0 . On Mahalanobis Distance in Functional Settings 4.2. Consistency Let us first establish a consistency statement for (17). As we will see the square root operator, K1/2, will play a relevant role throughout this subsection. Theorem 12 Under the assumptions and notation established at the beginning of Section 4, denote, as usual, by m the mean function of X, X the sample mean trajectory and n the sample size. Then, c Mα,n(X, X) Mα(X, m) a.s., as n . In fact, the result is true when measuring the distance between the mean and any function f in L2[0, 1], that is, c Mα,n(f, X) Mα(f, m) a.s., as n . Proof First note that c Mα,n(X, X) = b K1/2( b K + αI) 1(X X) . This follows from Proposition 2(b), Eq. (14). Therefore, c Mα,n(X, X) Mα(X, m) b K1/2( b K + αI) 1(X X) K1/2(K + αI) 1(X m) b K1/2( b K + αI) 1 op X m + b K1/2( b K + αI) 1 K1/2(K + αI) 1 op X m . By Lemma 10, X m goes to zero a.s. as n . As a consequence, it is enough to show that b K1/2( b K + αI) 1 K1/2(K + αI) 1 op 0 a.s. For that purpose, observe that b K1/2( b K + αI) 1 K1/2(K + αI) 1 op b K1/2( b K + αI) 1 b K1/2(K + αI) 1 op + b K1/2(K + αI) 1 K1/2(K + αI) 1 op b K1/2 op ( b K + αI) 1 (K + αI) 1 op + b K1/2 K1/2 op (K + αI) 1 op. Therefore, to end the proof we will show that b K1/2 K1/2 op 0 a.s. as n and ( b K + αI) 1 (K + αI) 1 op 0 a.s. as n . From Lemma 10, b K K op 0 a.s. Moreover, using part (e) of Proposition 11 with K0 = b K we also get b K1/2 K1/2 op 0, a.s., as n . Finally, observe that b K K op = ( b K + αI) (K + αI) op 0 a.s., and we also have supn ( b K + αI) 1 op α 1 < , see Gohberg et al. (2003, eq. (1.14), p. 228). Then, Lemma 8 implies ( b K + αI) 1 (K + αI) 1 op 0 a.s. as n . The statement for a given f readily follows along the same lines. 4.3. Asymptotic distribution Putting together Theorem 12 and Proposition 6 we get the asymptotic distribution of c Mα,n. Corollary 13 Under the assumptions and notation of Theorem 12, c Mα,n(X, X) converges in distribution to P j=1 βj Yj, where βj = λ2 j(λj + α) 2 and Y1, Y2, . . . are independent χ2 1 random variables. We can also prove another asymptotic distribution result involving the distances between the sample and the population means, which could be useful to perform inference on the mean. Berrendero, Bueno-Larraz and Cuevas Theorem 14 Under the assumptions and notation established at the beginning of Section 4, we have n c Mα,n( X, m) d X λ2 j (λj + α)2 Yj 1 where Y1, Y2, . . . are independent χ2 1 random variables. Proof We can rewrite the left-hand side of Equation (18) as, n c Mα,n( X, m) = n(c Mα,n( X, m) Mα( X, m)) + n Mα( X, m). (19) Now, from Equation (12) and Proposition 2, we have n|c Mα,n( X, m) Mα( X, m)| n b K1/2( b K + αI) 1( X m) K1/2(K + αI) 1( X m) b K1/2( b K + αI) 1 K1/2(K + αI) 1 op n( X m) . As a part of the proof of Theorem 12 we have seen that the first norm in the right-hand side goes to zero a.s. as n . From the Functional Central Limit Theorem (see e.g., Theorem 7.5.1 in Laha and Rohatgi, 1979), n( X m) converges in distribution in L2[0, 1] to a Gaussian stochastic process Z with zero mean and covariance operator K. Since the norm is a continuous function in this space, by the continuous mapping theorem the second term converges in distribution to the random variable Z . Thus, by Slutsky s theorem, the distribution of the product goes to zero, and this convergence holds also in probability since the limit is a constant. We can rewrite the remaining term of Equation (19) as, n Mα( X, m) = n K1/2(K + αI) 1( X m) = n 1 i=1 χα,i µα , where we denote χα,i = K1/2(K+αI) 1Xi and µα = K1/2(K+αI) 1m. Since K1/2(K+αI) 1 is a bounded linear operator and the process X is Bochner-integrable (E X < ), the expectation and the operator commute, that is, Eχα,1 = E[K1/2(K + αI) 1X1] = K1/2(K + αI) 1EX1 = µα. Therefore, we can use again the Functional Central Limit Theorem with χα,i and µα, since E χα,1 2 K1/2(K + αI) 1 2 op E X1 2 < , which gives us that n Mα( X, m) converges in distribution to ξ , ξ being a random element with zero mean and whose covariance operator is the same as that of χα,1. Using the same reasoning as at the beginning of the proof of Theorem 5 and denoting as A the adjoint of the operator A, the covariance operator of χα,1 is given by K1/2(K + αI) 1K[K1/2(K + αI) 1] = K1/2(K + αI) 1K(K + αI) 1K1/2, since both K1/2 and (K + αI) 1 are self-adjoint operators (for instance, Theorem 3.35 and Problem 3.32 of Kato, 2013, Chapter 5 and Proposition 2.4 of Conway, 1990, Chapter X). On Mahalanobis Distance in Functional Settings Now ξ is a zero-mean Gaussian process with a compact covariance operator which has the same eigenfunctions as K and its eigenvalues are λ2 j(λj + α) 2. Thus, using its Karhunen Lo eve representation we get j=1 Zjej = X where ej are the eigenfunctions of K and Zj are independent Gaussian random variables with zero mean and variances λ2 j(λj +α) 2 (the eigenvalues of the covariance operator of ξ). Then the result follows from the standardization of these Zj, applying Slutsky s theorem to the sum of Equation (19). 4.4. Convergence rates Convergence rates, in probability, can be obtained for the distance estimator using part (e) of Proposition 11. Theorem 15 Under the assumptions established at the beginning of Section 4 and assuming in addition that E X 4 < , we have |c Mα,n(X, X) Mα(X, m)| = OP (n 1/4) as n . (20) where OP stands here for convergence order in probability. Proof At the beginning of the proof of Theorem 12, we got c Mα,n(X, X) Mα(X, m) b K1/2( b K + αI) 1 op X m + b K1/2( b K + αI) 1 K1/2(K + αI) 1 op X m . (21) The first term in the right hand side of (21) is the product of two random variables, say U1 and U2. From the a.s. convergence of b K towards K, we know that U1 converges a.s. to K1/2(K + αI) 1 op. Regarding the second factor U2 = X m , we have E(U2 2 ) = 1 n E( X m 2); indeed, note that X m = Pn i=1 Zi/n where the Zi = Xi m are iid random elements with the same distribution as X m. Therefore, as we have E X 2 < (which entails E X m 2 < ) we conclude that U2 is OP (n 1/2) as n . Therefore the global convergence rate for the first term in the right-hand side of (21) is of type OP (n 1/2). Now, to handle the second term in the upper bound of (21) note that b K1/2( b K + αI) 1 K1/2(K + αI) 1 op b K1/2 op ( b K + αI) 1 (K + αI) 1 op + b K1/2 K1/2 op (K + αI) 1 op := W1 + W2. Regarding W1, we first note b K1/2 op K1/2 op, a.s., since b K1/2 K1/2 op 0, a.s. Now, for ( b K + αI) 1 (K + αI) 1 op we use the following inequality (see Gohberg et al., 2003, Cor. 8.2, p. 71): if A is an invertible bounded linear operator and An is a sequence Berrendero, Bueno-Larraz and Cuevas of bounded linear operators such that An A op 0, then the An are also (eventually) invertible and A 1 n A 1 op A 1 2 op An A op 1 A 1 op An A op , so that, in particular, A 1 n A 1 op = O ( An A op) , as n . Using this result for A = K+αI, An = b K+αI, we get W1 = O b K K op , a.s. But according to Theorem 8.1.2 in Hsing and Eubank (2015) we have, under the assumption E X 4 < , that n( b K K) converges in law (with respect to the Hilbert-Schmidt norm, HS) to a Gaussian element with mean 0. In particular, n( b K K) must be bounded in probability (with respect to HS) so that b K K op = OP (n 1/2), since op HS. So, we conclude that W1 is OP (n 1/2), as n . As for the term W2 we use part (e) in Proposition 11, with K0 = b K. This gives b K1/2 K1/2 op 4 b K K 1/2 op , so that, finally (using again Theorem 8.1.2 in Hsing and Eubank, 2015) we get that W2 is OP (n 1/4). Collecting the convergence orders obtained for U2, W1 and W2 we get (20). 5. Statistical applications The purpose of this section is to give a general overview of possible applications of the proposed distance by analyzing its practical performance under various simulation scenarios and real data examples . The selected models and examples have been mostly chosen among those previously proposed in the literature. However, as usual in empirical studies, many other meaningful scenarios could be considered. Thus we make no attempt to reach any definitive conclusion. Only the long term practitioners experience will lead to a safer judgment. 5.1. Exploratory analysis The Mahalanobis distance can be used to analyze and summarize some interesting features of the data. Typically this is done, by example, via outlier detection and boxplots-type summaries of the data. We follow here the experimental setting proposed in Arribas-Gil and Romo (2014), where some real and simulated data sets are used for outliers detection and functional boxplots. 5.1.1. Outliers detection Our aim. We will measure the number of true positives and false positives of different methods when detecting functional outliers. We ran 100 simulations of each model with different contamination rates c = 0, 0.05, 0.1, 0.15 and 0.2. Methods. We replicate the simulation study proposed in Arribas-Gil and Romo (2014) with a selection of the methods presented there (we have excluded those methods that assumed some prior knowledge of the number of outliers). The details about the implementations of On Mahalanobis Distance in Functional Settings each method can be found in that paper. We have adapted the original code provided by the authors to include our method. In order to formally define what we exactly mean by an outlier with our method, we have approximated the distribution of the random variable Xα mα K given in Corollary 7 for Gaussian processes through a Monte Carlo sample of size 2000 where the Monte Carlo observations are generated using the covariance structure of the original data. Note that this procedure may present some drawbacks when the process is far from being Gaussian, as we will see when generating the boxplots. Then we mark as outliers the curves whose distance to the mean is greater that the 95% of the distances for the simulated data. A weak point of this method would be that the distribution of Corollary 7 is computed using the covariance structure of the data. Therefore, if the number of outliers is large compared with the sample size, this estimation is biased. In order to partially overcome this problem, we compute the covariance function using the robust minimum covariance determinant (MCD) estimator, see Rousseeuw and Driessen (1999). Regarding our proposal, we have noticed that the choice of α does not affect the number of selected outliers significantly. We have chosen α = 0.01, but an automatic technique (as the one proposed in Arribas-Gil and Romo, 2014 for the choice of the factor of the adjusted outliergram) could be used as well. Simulation models. The curves are generated using three different combinations of the main process (from which most trajectories are drawn) and the contamination one (from which the outliers come from). Given a contamination rate c, n c n curves are drawn from the main process and c n from the contamination one (we denote as x the smallest integer not smaller than x). The first model is defined by, main process: X(t) = 30t(1 t)3/2 + ε(t), contamination process: X(t) = 30t3/2(1 t) + ε(t), for t [0, 1], where ε is a Gaussian process with zero mean and covariance function K(s, t) = 0.3 exp( |s t|/0.3). The second model is given by, main process: X(t) = 4t + ε(t), contamination process: X(t) = 4t + ( 1)u1.8 + (0.02π) 1/2 e (t µ)2 0.02 + ε(t), where ε is a Gaussian process with zero mean and covariance function K(s, t) = exp( |s t|), u follows a Bernoulli distribution with parameter 0.5 and µ is uniformly distributed over [0.25, 0.75]. Finally, using the same definitions for ε and µ, the third model is given by, main process: X(t) = 4t + ε(t), contamination process: X(t) = 4t + 2 sin(4(t + µ)π) + ε(t). The sample size for each simulation was 100 and the curves are simulated in a discretized fashion over a grid of 50 equidistant points in [0, 1]. Berrendero, Bueno-Larraz and Cuevas Results. The proportions (pc and pf) of correctly and falsely detected outliers for each method on the different settings can be found in Table 1. We can see that the Mahalanobisbased method proposed in this paper (denoted Mah RKHS in the table) is quite competitive. 5.1.2. Boxplots Our aim. As a part of the exploratory analysis of the data, we include the functional boxplots of two real data sets used also in Arribas-Gil and Romo (2014). We do not include here the boxplots obtained with the other methods, which can be found in that paper. Our methodology. We use the proposed Mahalanobis distance to define a depth measure by (1 + M2 α(x, m)) 1, for a realization x of the process. Using this depth, we mark as the functional median the deepest curve of the set. The central band of the boxplot is built as the envelope of the 50% deepest curves, and the whiskers are constructed as the envelope of all the curves that are not marked as outliers. In order to detect the outliers we use the same procedure as before. However, the sample sizes now are too small to estimate robustly the covariance matrix over the grid, so we use the standard empirical covariance matrix. For the mortality data set we choose again the value α = 0.01. However, for the Berkeley data set the distribution of the distances seems far from Gaussian. Then, in this case the value of α has a greater effect on the number of curves marked as outliers with our procedure. In an attempt to overcome this problem, the parameter α is adjusted automatically in order to minimize an estimate of the KL divergence between the empirical distribution and the distribution for Gaussian processes. The selected values of α with this procedure are 0.089 for the female set and 0.1 for the male set. Data sets. The two data sets under study are: Male mortality rates in Australia 1901-2003: this data set can be found in the R package fds . It contains Australia male log mortality rates between 1901 and 2003, provided by the Australian Demographic Data Bank. Berkeley growth: this data set is available in the R package fda . It contains height measures of 54 girls and 39 boys, under the age of 18, at 31 fixed points. In Arribas-Gil and Romo (2014) the authors suggest to smooth the data, since the curves in the first set are very irregular. However, the distance proposed here has an intrinsic smoothing procedure, so we work directly with the original curves. Results. The curves marked as outliers for the male mortality set correspond to years 1919 (influenza epidemic) and 1999-2003, which are among the curves detected using other different proposals in Arribas-Gil and Romo (2014). The resulting boxplot for this data set can be found in Figure 1b, where the outliers are plotted in red. This figure includes also (on the left) a graphic representation of the depths: from green, the deepest curves, to ochre, the outer ones. The boxplots for the Berkely growth sets, female and male, are shown in Figure 2. Although now the smoothing parameter is adapted to the data set, the number of outliers detected is quite large when compared to the sample size. On Mahalanobis Distance in Functional Settings c= 0 Model 1 Model 2 Model 3 Method pc pf pc pf pc pf Fun. BP - 0.001 (0.003) - 0.001 (0.002) - 0.000 (0.002) Adj. Fun. BP - 0.007 (0.010) - 0.006 (0.010) - 0.007 (0.012) Rob. Mah. Dist. - 0.016 (0.014) - 0.015 (0.013) - 0.015 (0.015) ISE - 0.038 (0.020) - 0.032 (0.021) - 0.033 (0.021) DB weighting - 0.016 (0.012) - 0.015 (0.011) - 0.014 (0.011) Outliergram - 0.054 (0.025) - 0.057 (0.027) - 0.058 (0.022) Adj. Ourliergram - 0.012 (0.012) - 0.011 (0.013) - 0.011 (0.014) Mah. RKHS - 0.037 (0.015) - 0.033 (0.018) - 0.035 (0.016) c= 0.05 Model 1 Model 2 Model 3 Method pc pf pc pf pc pf Fun. BP 0.186 (0.193) 0.001 (0.003) 0.208 (0.220) 0.000 (0.001) 0.184 (0.179) 0.000 (0.002) Adj. Fun. BP 0.576 (0.282) 0.008 (0.012) 0.551 (0.330) 0.006 (0.010) 0.588 (0.344) 0.008 (0.012) Rob. Mah. Dist. 0.976 (0.096) 0.008 (0.009) 0.361 (0.250) 0.008 (0.010) 0.104 (0.153) 0.015 (0.013) ISE 0.865 (0.313) 0.033 (0.020) 1.000 (0.000) 0.038 (0.026) 1.000 (0.000) 0.033 (0.021) DB weighting 0.894 (0.259) 0.008 (0.009) 0.941 (0.203) 0.012 (0.011) 0.957 (0.168) 0.011 (0.009) Outliergram 0.998 (0.020) 0.038 (0.022) 0.998 (0.020) 0.033 (0.021) 1.000 (0.000) 0.036 (0.023) Adj. Ourliergram 0.994 (0.035) 0.006 (0.008) 0.978 (0.070) 0.006 (0.009) 0.998 (0.020) 0.012 (0.014) Mah. RKHS 0.998 (0.020) 0.022 (0.016) 1.000 (0.000) 0.027 (0.014) 1.000 (0.000) 0.031 (0.016) c= 0.1 Model 1 Model 2 Model 3 Method pc pf pc pf pc pf Fun. BP 0.139 (0.123) 0.000 (0.001) 0.158 (0.151) 0.000 (0.002) 0.134 (0.128) 0.000 (0.002) Adj. Fun. BP 0.549 (0.239) 0.005 (0.008) 0.593 (0.268) 0.008 (0.010) 0.632 (0.248) 0.008 (0.012) Rob. Mah. Dist. 0.961 (0.105) 0.004 (0.007) 0.373 (0.170) 0.007 (0.009) 0.104 (0.108) 0.011 (0.014) ISE 0.790 (0.335) 0.027 (0.017) 1.000 (0.000) 0.036 (0.021) 1.000 (0.000) 0.033 (0.022) DB weighting 0.176 (0.247) 0.001 (0.004) 0.910 (0.232) 0.005 (0.008) 0.922 (0.258) 0.006 (0.008) Outliergram 0.981 (0.040) 0.020 (0.014) 0.998 (0.014) 0.018 (0.012) 1.000 (0.000) 0.020 (0.016) Adj. Ourliergram 0.897 (0.118) 0.006 (0.009) 0.971 (0.076) 0.006 (0.009) 1.000 (0.000) 0.007 (0.011) Mah. RKHS 0.767 (0.148) 0.014 (0.012) 1.000 (0.000) 0.014 (0.011) 0.995 (0.030) 0.015 (0.013) c= 0.15 Model 1 Model 2 Model 3 Method pc pf pc pf pc pf Fun. BP 0.098 (0.105) 0.000 (0.002) 0.114 (0.101) 0.000 (0.002) 0.134 (0.130) 0.000 (0.001) Adj. Fun. BP 0.494 (0.215) 0.006 (0.010) 0.550 (0.242) 0.006 (0.009) 0.584 (0.247) 0.006 (0.009) Rob. Mah. Dist. 0.927 (0.098) 0.001 (0.003) 0.324 (0.184) 0.004 (0.007) 0.152 (0.175) 0.005 (0.008) ISE 0.778 (0.349) 0.027 (0.018) 0.999 (0.007) 0.040 (0.029) 1.000 (0.000) 0.034 (0.023) DB weighting 0.020 (0.039) 0.001 (0.003) 0.659 (0.329) 0.002 (0.005) 0.634 (0.391) 0.002 (0.005) Outliergram 0.879 (0.137) 0.011 (0.012) 0.984 (0.043) 0.008 (0.011) 0.999 (0.013) 0.008 (0.009) Adj. Ourliergram 0.616 (0.220) 0.003 (0.007) 0.969 (0.099) 0.006 (0.008) 0.996 (0.019) 0.007 (0.010) Mah. RKHS 0.295 (0.122) 0.013 (0.011) 0.988 (0.052) 0.008 (0.009) 0.941 (0.167) 0.007 (0.009) c= 0.2 Model 1 Model 2 Model 3 Method pc pf pc pf pc pf Fun. BP 0.060 (0.090) 0.000 (0.001) 0.098 (0.104) 0.000 (0.001) 0.102 (0.094) 0.000 (0.000) Adj. Fun. BP 0.376 (0.226) 0.003 (0.006) 0.509 (0.205) 0.005 (0.009) 0.540 (0.227) 0.003 (0.006) Rob. Mah. Dist. 0.866 (0.167) 0.000 (0.002) 0.304 (0.171) 0.002 (0.005) 0.111 (0.118) 0.004 (0.007) ISE 0.513 (0.396) 0.031 (0.023) 0.997 (0.018) 0.047 (0.031) 0.999 (0.010) 0.028 (0.023) DB weighting 0.015 (0.025) 0.001 (0.004) 0.216 (0.228) 0.001 (0.003) 0.111 (0.179) 0.000 (0.002) Outliergram 0.356 (0.202) 0.002 (0.005) 0.894 (0.158) 0.001 (0.004) 0.959 (0.146) 0.001 (0.004) Adj. Ourliergram 0.248 (0.179) 0.001 (0.003) 0.959 (0.074) 0.004 (0.008) 0.999 (0.007) 0.008 (0.011) Mah. RKHS 0.141 (0.089) 0.012 (0.011) 0.945 (0.127) 0.005 (0.007) 0.749 (0.232) 0.006 (0.009) Table 1: Proportions of correctly and falsely detected outliers (standard errors appear in brackets) Berrendero, Bueno-Larraz and Cuevas (a) Depths of the curves (b) Boxplot and outliers Figure 1: Male mortality rates in Australia 1901-2003 (a) Depths of the curves (female) (b) Boxplot and outliers (female) (c) Depths of the curves (male) (d) Boxplot and outliers (male) Figure 2: Berkeley growth Female: 3, 8, 10, 13, 15, 18, 26, 29, 42, 43, 48 and 53. Male: 5, 10, 15, 27, 29, 32, 35 and 37. But if we look at the estimated density functions corresponding to the distribution of M2 α on each set (Figure 3), we can see that these distributions have two modes. In fact, all the curves marked as outliers are the ones that fall into the second mode (whose distance to the mean is greater that the red dotted line). This behavior is similar to that of the Integrated Squared Error shown in Arribas-Gil and Romo (2014). On Mahalanobis Distance in Functional Settings Figure 3: Estimated density functions of the distributions of M2 α for Berkeley growth. 5.2. Supervised classification The goals of the study. Mahalanobis distance can be also used as a tool for supervised classification. Mimicking a common criterion (see, e.g., Galeano et al., 2015, Section 3.3) we would classify the curve x to the population j {1, . . . , k} whenever M2 α(x, mj) 2 log πj = min 1 i k M2 α(x, mi) 2 log πi , where π1, . . . , πk denote the prior probabilities of the classes, which are assumed to be homoskedastic, and mj stands for the j-th mean function. In the examples below we will just consider the case of equal probabilities, which amounts to assign x to the population for which the Mahalanobis distance to the respective mean is smaller. The heteroskedastic case is also considered later on. Here we present two different examples of binary classification with same prior probabilities, for which the percentage of misclassified curves is measured. In order to check the performance of our proposal, we compare it with other classifiers presented below. The name used on the tables for each method is shown between brackets. Optimal Bayes classifier proposed in Dai et al. (2017) ( OB ). This is a functional extension of the classical multivariate Bayes classifier based on nonparametric estimators of the density functions corresponding to the main coefficients in Karhunen-Lo eve expansions. Here the curves are projected onto a common sequence of eigenfunctions, and the quotient in Bayes classifier is taken using the densities of these projections. The authors propose three approaches to estimate these densities. We have chosen the implementation which assumes that these densities are Gaussian since, according to their results, it seems to slightly outperform the others. The number of eigenfunctions used for the projections is fixed by cross-validation. Mahalanobis-based semidistance of Equation (6) proposed in Galeano et al. (2015) ( dk FM ). k-nearest neighbours with 3 and 5 neighbours ( knn3 and knn5 ). In spite of its simplicity, this method tends to show a good performance when dealing with functional data. Berrendero, Bueno-Larraz and Cuevas Our methodology. Our proposal is denoted as Mα . Now the parameter α is fixed by cross-validation, for α [10 4, 10 1]. For heterokedastic problems, we have implemented our binary classifier mimicking an improvement that is usually made in the multivariate context. In that finite setting, given two equiprobable populations with covariance matrices Σ0, Σ1, a curve x is assigned to class 1, according to the Quadratic Discriminant classifier, whenever M2 Σ0(x, x0) M2 Σ1(x, x1) > log |Σ1| where MΣ0, MΣ1 denote the finite dimensional Mahalanobis distances using the covariance matrices Σ0 and Σ1 respectively. The finite dimensional distance M is defined in (1) (see, for instance, Section 8.3.7 of Izenman, 2008). In most cases with multivariate data, classifying with this rule gives better results than merely classifying to class 1 when M2(x, x0) > M2(x, x1). In the case of functional data we will consider the following heuristic version of this idea. If m0, K0 and m1, K1 are the mean and covariance functions of each class, we classify x to class 1 if M2 α,K0(x, m0) M2 α,K1(x, m1) > C, and to 0 otherwise. These are the distances between the projections into H(K0) and H(K1), respectively. The constant C is computed as log((λ1 1 . . . λ1 10)/(λ0 1 . . . λ0 10)), where λ0 j, λ1 j, j = 1, . . . , 10, are the ten larger eigenvalues of K0 and K1, respectively. First case study: cut Brownian Motion vs. Brownian Bridge. The first problem under consideration is to distinguish between two cut versions of a standard Brownian Motion and a Brownian Bridge. By cut we mean to take the process X(t) on the interval t [0, T], T < 1. We know an explicit expression for the Bayes error of this problem, which depends on the cut point T. For the case of equal prior probabilities of the classes, which will be the case here, this Bayes error is given by, ( (1 T) log(1 T))1/2 (T(1 T))1/2 ( (1 T) log(1 T))1/2 where Φ stands for the distribution function of a standard Gaussian random variable. Since both processes are almost indistinguishable around zero, L 0.5 when T 0. Also L 0 when T 1, since then one can decide the class with no error just looking at the last point of the curve. The trajectories of both processes are shown in Figure 4 and the cut points considered, 0.75, 0.8125, 0.875, 0.9375 and 1, are marked with vertical dotted lines. For each class, 50 samples are drawn for training and 250 for test. The experiment is run 500 times for each cut point, and the trajectories are sampled over an equidistant grid in [0, 1] of size 50. Second case study: simulated data as in Dai et al. (2017). We have implemented also the experimental setting proposed in Dai et al. (2017). The authors consider three different scenarios. For the first two, the curves of both classes X(0) and X(1), are drawn from processes X(i)(t) = mi(t) + j=1 Aj,iφj(t) + ε, i = 0, 1, On Mahalanobis Distance in Functional Settings Figure 4: Trajectories of Brownian Motion and Bridge with cut points (vertical) where ε is a Gaussian variable with zero mean and variance 0.01. Function φj is the jth element in the Fourier basis, starting with φ1(t) = 1, φ2(t) = 2 cos(2πt), φ3(t) = 2 sin(2πt). For Scenario A, the coefficients Aj,0, Aj,1 are independent centered Gaussian variables. For Scenario B they are independent centered exponential random variables. Finally, in Scenario C the processes are X(i)(t) = mi(t) + Bi φj(t), i = 0, 1, where Aj,0, Aj,1 are the same as in Scenario B and B0, B1 are independent variables with common distribution χ2 30/30. Thus, in this latter case the coefficients of the basis expansion are dependent but uncorrelated. The means and the variances of the coefficients Aj,i, i = 0, 1, are changed in order to check the same and different scenarios for mean and variances. Then m0(t) = 0 always, and m1(t) is either 0 or t. In the same way, the variance of Aj,0 is always exp( j/3) and the variance of Aj,1 is either exp( j/3), or exp( j/2). The curves are sampled on 51 equidistant points in [0, 1]. The prior probabilities of both classes are set to 0.5 and two sample sizes, 50 and 100, are tested for training. For test we use 500 realizations of the processes. Each experiment is repeated 500 times. Results. Table 2 shows the percentages of misclassified curves, as well as the Bayes errors for the first data set. Our proposal and knn with 5 neighbors seem to outperform the other methods for this problem. The misclassification percentages for all the different scenarios of the second data set are shown in Table 3. Our proposal is mainly the winner, although in Scenario A it is overtaken by the Optimal Bayes classifier in the case of equal means and different variances. Also knn with 5 neighbors performs better sometimes in the case of different means and equal variances. Berrendero, Bueno-Larraz and Cuevas 6. Some final remarks In this section we include some final remarks and additional comments. 6.1. Sensitivity of the distance with respect to α and λi In Section 3.3 we proved that the proposed distance is continuous with respect to the smoothing parameter α. However the choice of this value may be of importance in some cases, specially in those cases where cross-validation procedures can not be applied. Then we will try to shed some light on the role this parameter plays in the distance. To simplify matters in the following discussion we consider the distance from a function x to the mean m. Let us begin with a simple, derivative-based, analysis of the squared distance, when considered as a function of α and λj, as defined in (12), Mα(x, m)2 = λj (λj + α)2 x m, ej 2. (22) Some informal conclusions can be drawn from the above formula as well as from the definition of xα: (a) In general, larger values of α generate smoother curves fα, for a given original function f. In order to gain some intuition on this, we show in Figure 5, the aspect of a smoothed Brownian trajectory xα for different values of α. (b) Of course, Mα(x, m)2 is a decreasing function of α. This means that when α gets larger, so providing increasingly smoother versions of x and m, the distance decreases to 0. Clearly a very large value of α would provide functions xα and mα too far away from the original unsmoothed versions x and m. On the opposite side, the choice α = 0 leads to the divergence problems addressed in Subsection 1.4. (c) The derivative (with respect to α) of the j-th coefficient g(λj, α) = λj (λj+α)2 in (22) is αg(λj, α) = 2λj (λj+α)3 whose absolute value is again a decreasing function of α. Since λj are usually unknown, and must be estimated, it is also interesting to consider the sensitivity of the coefficients g(λj, α) with respect to λj. Note that λj g(λj, α) = α λj (α+λj)3 so that the speed of variation (in absolute value) is minimized for λj = α. t Bayes Mα OB dk FM knn3 knn5 0.75 33.9 42.5 (3.5) 43.5 (2.5) 46.4 (3.2) 43.2 (2.8) 42.4 (2.8) 0.8125 30.8 40.0 (3.7) 41.9 (2.6) 44.8 (3.3) 41.0 (2.8) 40.1 (3.0) 0.875 26.9 36.1 (3.6) 40.2 (2.6) 42.6 (3.7) 38.0 (3.0) 36.9 (3.0) 0.9375 20.9 32.3 (3.1) 38.0 (2.8) 39.9 (3.5) 33.7 (2.7) 32.5 (2.7) 1 0.0 26.5 (2.8) 35.9 (2.9) 36.0 (3.5) 28.4 (2.7) 27.6 (2.7) Table 2: Percentage of misclassification for the cut versions of Brownian Motion Brownian Bridge On Mahalanobis Distance in Functional Settings Scenario A (Gaussian) n mean sd Mα OB dk FM knn3 knn5 50 same diff 35.9 (3.5) 19.0 (4.0) 47.0 (3.1) 45.6 (2.2) 46.2 (2.0) diff same 42.3 (3.8) 47.3 (6.8) 43.7 (3.7) 42.9 (3.6) 42.0 (3.6) diff diff 29.1 (5.0) 36.4 (10.1) 40.0 (5.4) 39.7 (3.0) 40.0 (3.1) 100 same diff 34.2 (3.0) 9.3 (2.1) 45.8 (3.5) 44.6 (1.9) 45.4 (1.8) diff same 34.6 (4.5) 45.1 (8.2) 37.0 (4.4) 42.1 (3.0) 41.0 (3.0) diff diff 22.0 (4.9) 35.7 (11.3) 34.2 (6.2) 38.3 (2.4) 38.6 (2.5) Scenario B (exponential) n mean sd Mα OB dk FM knn3 knn5 50 same diff 24.2 (5.2) 30.2 (10.4) 37.0 (6.6) 37.6 (2.6) 38.0 (2.7) diff same 41.8 (3.9) 49.1 (5.5) 42.3 (4.1) 38.0 (3.4) 37.2 (3.6) diff diff 14.3 (4.8) 31.8 (12.8) 25.1 (9.0) 24.7 (3.1) 25.1 (3.5) 100 same diff 16.9 (3.1) 24.0 (9.6) 28.2 (6.1) 35.3 (2.4) 35.7 (2.3) diff same 34.5 (4.6) 48.3 (5.9) 36.7 (4.2) 36.5 (2.8) 35.6 (2.7) diff diff 7.7 (2.9) 30.1 (13.4) 17.8 (6.3) 21.6 (2.4) 21.8 (2.6) Scenario C (dependent) n mean sd Mα OB dk FM knn3 knn5 50 same diff 30.0 (5.4) 33.3 (8.1) 40.1 (5.9) 39.9 (2.7) 39.9 (2.7) diff same 43.6 (4.1) 48.8 (4.8) 42.9 (4.2) 38.1 (3.6) 37.5 (3.8) diff diff 19.9 (4.9) 36.2 (11.0) 30.3 (7.7) 26.4 (3.1) 26.6 (3.3) 100 same diff 21.7 (3.0) 28.0 (7.5) 29.4 (5.7) 37.6 (2.4) 37.5 (2.4) diff same 38.0 (4.3) 48.8 (5.0) 38.9 (3.8) 36.5 (2.7) 35.6 (2.8) diff diff 13.3 (3.2) 34.6 (11.0) 23.2 (6.1) 23.4 (2.4) 23.3 (2.4) Table 3: Percentage of misclassification for the experimental setting of Dai et al. (2017). 6.2. Mahalanobis-based classifiers and optimality Our distance-based classifiers might be seen as a geometrically oriented proposal to functional classification. One might wonder under which circumstances these classifiers are close to the optimal (Bayes) classifier. It might be seen (see, e.g, Ba ıllo et al., 2011) that, under very general conditions, when the distribution P1 of X|Y = 1 is absolutely continuous with respect to that of X|Y = 0, denoted by P0, the Bayes rule is just g(x) = 1({d P1(x) d P0 > log 1 p p }), d P1(x) d P0 being the Radon-Nikodym derivative of P1 with respect to P0, p = P(Y = 1) and 1(A) the indicator function of the event A . Now, taking into account the results by Parzen (1961) concerning the explicit expression of d P1(x) d P0 in the Gaussian homoskedastic case (see Berrendero et al., 2018 for additional details) one might see that the optimal rule amounts to assign the observation x to population 1 whenever x m1, x m1 K x m2, x m2 K > log 1 p Here we must assume that P1 and P0 are Gaussian processes with continuous trajectories and a continuous covariance function K and the mean functions m0 and m1 belong both to the RKHS, H(K), associated with K. Also, it is important to note that the notation Berrendero, Bueno-Larraz and Cuevas Figure 5: Smoothed versions of a Brownian trajectory for different values of α x m1, x m1 K must be carefully interpreted in terms of the so-called Lo eve isometry since, with probability one, the trajectories x do not belong to H(K) so that x, K is not directly defined and, in fact, (23) is defined in terms of such isometry. As a consequence, xα m1, xα m1 K xα m2, xα m2 K > log 1 p (where , K is now the true inner product in H(K)) can be seen as an approximation to the optimal rule (23) when we replace the Lo eve isometry x m1, x m1 K with the smoothed approximation given by our estimated Mahalanobis-type distance. Note that, strictly speaking, Mα(x, mi)2 = xα miα, xα miα K = xα miα 2 K but, if we assume m1, m2 H(K), there is no need of considering the smoothed versions of these functions. 6.3. Connections with Tikhonov s regularization methodology Our proposal is closely related to standard regularization results developed to deal with ill-posed equations in functional analysis; see e.g. Kress (1989). Indeed, observe that we can write xα = (K + αI) 1Kx = K1/2(K + αI) 1K1/2x = K1/2Rαx, where Rαx := (K + αI) 1K1/2x is a Thikonov regularization operator (which yields a continuous approximation to the Moore-Penrose pseudoinverse of K1/2). Generally speaking, each regularization operator Rα is associated with a function q(α, λj) that downweights the effect of 1/ p λj in the spectral representation of xα. In the case of Thikonov regularization, q(α, λj) = λj/(λj + α). Thus, λj λj + α x, ej ej = λj x, ej ej, and therefore xα 2 K = K1/2Rαx 2 K = Rαx 2 = P j=1 q(α, λj)2/λj. On Mahalanobis Distance in Functional Settings In principle, to define the Mahalanobis distance one could select other regularization methods. For instance, the cut-offoperator is given by λj x, ej ej, which corresponds to q(α, λj) = 1, when λj α, and q(α, λj) = 0, when λj < α. For this choice, the regularization parameter determines the terms of the series that are retained. Other possibility is the so-called Landweber regularization operator, given by q(m, a, λj) = 1 (1 aλj)m+1, for 0 < a < min{λ 1 j : λj > 0}. In this case regularization depends on two parameters, a and m. We believe Tikhonov scheme provides a natural choice with good properties since it depends on just one regularization parameter (unlike Landweber method) and defines a metric in L2[0, 1] (unlike the cut-offapproach). Acknowledgments This work has been partially supported by Spanish Grant MTM2016-78751-P. The authors are grateful to Daniel Est evez and Dmitry Yakubovich for their help with operator theory. The constructive comments and suggestions from three anonymous reviewers are also gratefully acknowledged. Ana Arribas-Gil and Juan Romo. Shape outlier detection and visualization for functional data: the outliergram. Biostatistics, 15(4):603 619, 2014. Robert B. Ash and Melvin F. Gardner. Topics in Stochastic Processes. Academic Press, 1975. Amparo Ba ıllo, Antonio Cuevas, and Juan Antonio Cuesta-Albertos. Supervised classification for a family of Gaussian functional models. Scandinavian Journal of Statistics, 38 (3):480 498, 2011. Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic, 2004. Jos e R. Berrendero, Antonio Cuevas, and Jos e L. Torrecilla. On the use of reproducing kernel Hilbert spaces in functional classification. Journal of the American Statistical Association, 113(3):1210 1218, 2018. John B. Conway. A course in functional analysis. Springer, 1990. Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1):1 49, 2001. Felipe Cucker and Ding Xuan Zhou. Learning Theory: an Approximation Theory Viewpoint. Cambridge University Press, 2007. Berrendero, Bueno-Larraz and Cuevas Antonio Cuevas. A partial overview of the theory of statistics with functional data. Journal of Statistical Planning and Inference, 147:1 23, 2014. Xiongtao Dai, Hans-Georg M uller, and Fang Yao. Optimal Bayes classifiers for functional data and density ratios. Biometrika, 104(3):545 560, 2017. Lokenath Debnath and Piotr Mikiusinski. Introduction to Hilbert Spaces with Applications (3rd Ed.). Elsevier, 2005. Pedro Galeano, Esdras Joseph, and Rosa E Lillo. The Mahalanobis distance for functional data with applications to classification. Technometrics, 57:281 291, 2015. Andrea Ghiglietti and Anna Maria Paganoni. Exact tests for the means of Gaussian stochastic processes. Statistics & Probability Letters, 131:102 107, 2017. Andrea Ghiglietti, Francesca Ieva, and Anna Maria Paganoni. Statistical inference for stochastic processes: two-sample hypothesis tests. Journal of Statistical Planning and Inference, 180:49 68, 2017. Israel Gohberg, Seymour Goldberg, and Marinus Kaashoek. Basic Classes of Linear Operators. Birkh auser, 2003. Tailen Hsing and Randall Eubank. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. John Wiley & Sons, 2015. Alan J. Izenman. Modern Multivariate Statistical Techniques. Springer, 2008. Svante Janson. Gaussian Hilbert Spaces. Cambridge University Press, 1997. Tosio Kato. Perturbation Theory for Linear Operators. Springer, 2013. Rainer Kress. Linear Integral Equations. Springer, 1989. Radha G. Laha and Vijay K. Rohatgi. Probability Theory. John Wiley & Sons, 1979. Milan N. Luki c and Jay H. Beder. Stochastic processes with sample paths in reproducing kernel Hilbert spaces. Transactions of the American Mathematical Society, 353:3945 3969, 2001. Prasanta C. Mahalanobis. On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta), 2:49 55, 1936. Kanti V. Mardia. Assessment of multinormality and the robustness of Hotelling s t2-test. Applied Statistics, pages 163 171, 1975. Emanuel Parzen. An approach to time series analysis. Journal of the American Statistical Association, 32:951 989, 1961. Gert K. Pedersen. Some operator monotone functions. Proceedings of the American Mathematical Society, 36:309 310, 1972. On Mahalanobis Distance in Functional Settings Kay I. Penny. Appropriate critical values when testing for a single multivariate outlier by using the Mahalanobis distance. Journal of the Royal Statistical Society. Series C (Applied Statistics), 45:73 81, 1996. Alvin C. Rencher. Methods of Multivariate Analysis, volume 3 ed. John Wiley & Sons, 2012. Peter J. Rousseeuw and Katrien van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41:212 223, 1999. Peter J. Rousseeuw and Bert C. van Zomeren. Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85(411):633 639, 1990. Bernhard Sch olkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002. Yijun Zuo and Robert Serfling. General notions of statistical depth function. Annals of Statistics, 28:461 482, 2000.