# on_perfect_clustering_for_gaussian_processes__a49841a5.pdf Published in Transactions on Machine Learning Research (09/2023) On Perfect Clustering for Gaussian Processes Juan A. Cuesta-Albertos cuestaj@unican.es Departamento de Matemáticas, Estadística y Computación Universidad de Cantabria, Spain Subhajit Dutta duttas@iitk.ac.in Department of Mathematics and Statistics IIT Kanpur, India Reviewed on Open Review: https: // openreview. net/ forum? id= ig DOV2KBw M In this paper, we propose a data based transformation for infinite-dimensional Gaussian processes and derive its limit theorem. For a clustering problem using mixture models, an appropriate modification of this transformation asymptotically leads to perfect separation of the populations under rather general conditions, except the scenario in which differences between clusters depend only on the locations; in which case our procedure is useless. Theoretical properties related to label consistency are studied for the k-means clustering algorithm when used on this transformed data. Good empirical performance of the proposed methodology is demonstrated using simulated as well as benchmark data sets, when compared with some popular parametric and nonparametric methods for such functional data. 1 Introduction Suppose that we are given two Gaussian distributions (say, GDs) P1 and P2. The Hajek and Feldman property (established independently by Hajek (1958) and Feldman (1958)) states that P1 and P2 are either equivalent, or else mutually singular. In other words, for every measurable set A, P1(A) = 0 if and only if P2(A) = 0, or else there exist two disjoint measurable sets S1 and S2 such that P1(S1) = 1, P2(S1) = 0 and P1(S2) = 0, P2(S2) = 1. Mutual singularity is not very interesting in finite dimensions because it happens only when at least one of the covariance matrices is singular. However, in the functional setting, this singularity appears in non-trivial situations. To mention an example, it was shown by Rao and Varadarajan (1963) that if the covariance operators of P1 and P2, namely, ΣP1 and ΣP2 satisfy ΣP2 = aΣP1 for some a = 1, then P1 and P2 are mutually singular. It is clear that the mutually singular case of the Hajek and Feldman property (say, HFp) looks very promising for classification as well as clustering (in the mixture setting) of data points. Recently, some results have appeared taking advantage of this property to propose perfect classifiers (see the references given below). However, the clustering problem seems to be harder and as far as we know, Delaigle et al (2019) is the only available paper with results in this area. The main drawback of the paper by Delaigle et al (2019) is that it primarily deals with location problems (see Section M of Appendix II for a detailed discussion). In this paper, we present a family of transformations on functional data which allows one to identify some mutually singular situations. The transformed data are then used to obtain perfect clustering in the mixture setting. To give an overview of our main contributions, let us consider a Gaussian process (say, Z) defined on a bounded real interval, which without loss of generality, we identify with the unit interval [0, 1]. Further, Published in Transactions on Machine Learning Research (09/2023) assume that its trajectories belong to the Hilbert space of square integrable functions H, which is defined as follows: H : set of real functions f(t) with t [0, 1] such that Z 1 0 f 2(t)dt < . The inner product in H is f, g = Z 1 0 f(t)g(t)dt. The keystone of this paper is Theorem 2.1 in Section 2. It states that under appropriate assumptions, if b H, then the limit of a sequence of scaled Mahalanobis distances between some finite-dimensional projections of Z and b converges in probability to a non-random limit. Scaling is done using the dimension of the projection, and this convergence holds as the dimension goes to infinity. Practical interest of this result lies in the fact that the limit depends only on the distribution of Z (say, PZ). Therefore, Theorem 2.1 allows one to identify some cases in which GDs are mutually singular. In such scenarios, this result allows one to obtain perfect classification (also see Cuesta-Albertos and Dutta (2022)) as well as perfect clustering. We now explain this point a bit more precisely. Consider a probability distribution P such that P = PJ h=1 πh Ph, where 0 < πh < 1 with PJ h=1 πh = 1 and Ph are GDs on H for 1 h J. Additionally, assume P to be known, but the precise values of J, πh and Ph for 1 h J are unknown. Thus, PZ can be understood as a two-step procedure. In the first step, we randomly select one among the J possibilities with probabilities πh for 1 h J. Now, if Ph0 is selected in this step, then one chooses a random function using the probability Ph0 for h0 {1, . . . , J}. We will stick to this interpretation throughout the paper. According to this mixture model, every function Z P is in fact generated from one of the Ph s. Using Theorem 2.1, if we project the available functions on certain subspaces of increasing dimensions, then the proposed transformation leads to different limits (depending just on the Ph which generated each function). Thus, if we have a set of observations (with at least one observation from each Ph), in the limit we can identify the value of J as well as the subset of the observations produced by each Ph without any possibility of mistake. We call this fact perfect clustering, which in practice means that we can estimate J and the cluster assignment of all the observations correctly with a high probability when the dimension is large. We believe that HFp should have attracted the attention of researchers in classification and clustering for functional data, the orthogonality case apparently being more attractive because it would allow one to obtain perfect classification and perfect clustering. It took 50 years before the HFp was formally used in classification. To the best of our knowledge, the first paper using HFp in classification was Baillo et al (2011), where the authors derived a classification procedure using likelihood ratios. They focused on the equivalence case and hence, did not obtain perfect classification. Optimal classification of Gaussian processes (say, GPs) was analyzed in Torrecilla et al (2020) from the HFp viewpoint. Further, the optimal (Bayes ) classifier of equivalent GPs was derived and a procedure to obtain asymptotically perfect classification of mutually singular GPs was described as well. The results covered both homoscedastic as well as heteroscedastic cases. Additionally, Delaigle and Hall (2012) and Delaigle and Hall (2013) investigated conditions under which a perfect classification procedure for GPs was possible and developed related classifiers. The paper by Dai et al (2017) proposed a functional classifier based on ratio of density functions, which also leads to perfect classification. These papers contain no reference of the HFp. In fact, the relationship between Delaigle and Hall (2012) and the HFp was analyzed in Berrendero et al (2018), where the authors presented an expression of the optimal Bayes rule in some classification problems for functional data. As mentioned earlier, perfect clustering has been studied only by Delaigle et al (2019). In Rao and Varadarajan (1963) and Shepp (1966b), the authors obtained characterizations of the singularity or equivalence of Gaussian measures in functional spaces. Their results involve increasing sequences of subspaces. For equivalent GDs, the limit obtained in Rao and Varadarajan (1963) includes a term which is the exponential of an expression involving the difference of the means of P1 and P2. Curiously, the logarithm of this term is related with the expressions of our limits. Similarities between our proposal and those in Rao and Varadarajan (1963) and Shepp (1966b) end here because the other involved terms are different. Moreover, we handle Mahalanobis distances between data points, while these papers use Hellinger and Jeffreys functionals to measure discrepancy between distributions. As a consequence, the characterizations they obtain are not applicable in practice to cluster data points because they depend on the underlying distribution. Moreover, it is not straight forward to compute such functionals using data points. Published in Transactions on Machine Learning Research (09/2023) 1.1 Contributions In this paper, we first analyze the limit of the above mentioned scaled Mahalanobis distances by assuming the underlying parameters of the GPs to be known (see Section 2). We begin with a general concentration result in Theorem 2.1. Then, we propose a transformation for clustering that asymptotically yield perfect separation among the clusters (see Theorem 2.5). Further, this transformation can be used to find the unknown number of clusters (see Proposition 2.3). In Section 3, we estimate the covariance operator of the mixture distribution from data and state related asymptotic results for the proposed transformation. In Theorem 3.2, we prove uniform (on the sample points) consistency of the empirical version for the transformation associated with GP clustering. It is surprising that our GP clustering method fails to discriminate location only scenarios, but yields perfect clustering for differences in scales (see Remark 2.2.2). We have also compared our work both theoretically (see Section M of Appendix II) as well as numerically (see Sections 4, 5 and Section N in Appendix II) with the existing literature on perfect clustering for functional data. All proofs are deferred to Appendix I. Some additional material is presented in Appendix II, which includes a possible extension to non-Gaussian distributions (see Section K), discussion on a clustering procedure in the location only case (see Section L) and theoretical comparisons of our results with those obtained in the paper by Delaigle et al (2019) (see Section M). In summary, the main contributions of our paper are as follows: 1. Given a sample from a mixture GD, we develop a method which allows us to identify the number of members in the mixture as well as the cluster membership of the functions. 2. On the theoretical side, if we know the probability P (but neither J nor the Phs) and we have a finite sample taken from P, then the procedure allows to identify J as well as the functions produced by each Ph for 1 h J. 3. From the practical point of view, we can estimate the covariance operator of the full mixture if the sample size increases. Under appropriate conditions, our procedure allows to identify J as well as the functions produced by each Ph with a high probability for 1 h J. 4. The proposed procedure only works in those cases in which the distributions in the sample have different covariance operators and we have enough difference in the values between Lh S and Lhk S for 1 h, k J (see Theorem 2.2 and Remark 2.5.2). 5. Our simulations as well as analysis of real data sets show promising behaviour of the procedure when compared with several existing methods. 1.2 Notation We will use the following notation. The distribution of the random process Z will be denoted as PZ, its mean function by µZ and its covariance operator (referred to simply as covariance) by ΣZ. We use ΣZ(s, t) to denote the covariance between Z(s) and Z(t) for s, t [0, 1]. Given a square matrix A, trace(A) denotes its trace. The usual Euclidean norm on Rd is denoted by . To simplify notation, we do not always explicitly state the dependence of the norm on the dimension d. Further, we will assume that all the involved random quantities are defined on a common probability space (Ω, A, P). The notions of convergence in probability and equality in distribution are denoted by P and , respectively. 2 Transformation with Known Gaussian Distributions Let {Vd}d N be an increasing sequence of subspaces of H. Here, the dimension of Vd is d. This restriction is not necessary for the development which follows as long as the dimension of Vd goes to infinity with increasing d, but it simplifies the notation. Given the subspace Vd, let µZ d and ΣZ d represent the d-dimensional mean and the d d covariance matrix of the projection of Z on Vd. If u H, we denote ud = (u1, . . . , ud)T to be its projection on Vd. Published in Transactions on Machine Learning Research (09/2023) Fix b H. Theorem 2.1 analyses the behaviour of the limit of squared Mahalanobis norm of the d-dimensional random vector (Z b)d for d N. For every positive definite (p.d.) d d matrix Ad, we define the map DAd d (u, v) = 1 A 1/2 d (u v)d 2 for u, v H. (1) Later, we will use this function alongwith the sequence of covariance matrices {ΣZ d }d N. In this section, the underlying distributions are assumed to be known. After stating Theorem 2.1 and some remarks related to it, we will look into an application to cluster analysis inspired from this result. We will take advantage of the fact that the limit in this theorem is not random, but it may depend on the underlying probability distribution PZ. Theorem 2.1 Let {Ad}d N be a sequence of d d symmetric, p.d. matrices and αd 1, . . . , αd d be the eigenvalues of the matrix Sd = (Ad) 1/2ΣZ d (Ad) 1/2 for d N. We define αd = (αd 1, . . . , αd d)T and αd = max(αd 1, . . . , αd d) is the supremum norm. Let us assume that 0 = lim d αd Let b H such that there exist constants Lµ and LS (finite, or not) with Lµ = lim d DAd d (µZ, b) and (3) LS = lim d 1 dtrace(Sd). (4) Then, DAd d (Z, b) P L := Lµ + LS as d . Remark 2.1.1 A condition in Theorem 2.1 is required to ensure that no single component is extremely influential. For instance, it may happen that we take a sequence such that αd 1 = d and αd i = o(d 1) for every 2 i d. Under this condition, a limiting value is not feasible in Theorem 2.1. However, this possibility is excluded by assumption (2). Remark 2.1.2 We allow both the constants in Theorem 2.1 to be infinite. When LS is finite, Lemma B.1 (see Appendix I) shows that assumption (2) follows from assumption (4). Remark 2.1.3 Let Z1 and Z2 be independent observations generated from the GDs P1 and P2. Thus, Z1 Z2 is a GP with mean µZ1 µZ2 and covariance ΣZ1 +ΣZ2. Consider the matrix Sd = (Ad) 1/2(ΣZ1 d + ΣZ2 d )(Ad) 1/2 with d N. Define Z = Z1 Z2 and b = 0 in Theorem 2.1. Then, under the assumptions of this result, the following convergence result holds: DAd d (Z1, Z2) = DAd d (Z1 Z2, 0) P L := Lµ + LS as d . Here, Lµ = limd DAd d (µZ1, µZ2) and LS is as defined in (4) of Theorem 2.1. Remark 2.1.4 In general, the fact that Vd Vd+1 does not guarantee the existence of any relationship between the sets {αd 1, . . . , αd d} and {αd+1 1 , . . . , αd+1 d+1}. However, in some cases {αd 1, . . . , αd d} {αd+1 1 , . . . , αd+1 d+1} holds (for instance, when Vd is generated by the first d eigenfunctions of ΣZ in Section 2.1). 2.1 Application: Cluster Analysis In this subsection, we deal with a random function Z whose distribution is a two component mixture distribution of the form: PZ = π1P1 + π2P2, where 0 < π1 < 1 and π1 + π2 = 1. Here, Ph denotes the GD on H with Zh Ph having mean function µZh and covariance ΣZh for h = 1, 2. The mean function and the covariance of the mixture satisfy µZ(t) = π1µZ1(t) + π2µZ2(t) with t [0, 1] and ΣZ(s, t) = π1ΣZ1(s, t) + π2ΣZ2(s, t) + π1π2[µZ1(s) µZ2(s)][µZ1(t) µZ2(t)] for s, t [0, 1]. (5) Published in Transactions on Machine Learning Research (09/2023) Given N independent and identically distributed (i.i.d.) r.v. s Z1, . . . , ZN PZ, consider the following set: Ch = {j : Zj was generated from Ph for 1 j N} (6) with h {1, 2}. Clearly, the set Ch depends on the sample size N. The components of the mixture distribution PZ and the sets Ch for h = 1, 2 are unknown, and the problem we are addressing here is the estimation of these sets. However, we assume PZ and the sets C1 and C2 to be known in this section to build the fundamental idea behind using the proposed transformation for clustering of GPs. Let Vd with d N denote the sequence of d-dimensional subspaces generated by the d eigenfunctions associated with the d largest eigenvalues of ΣZ (recall the discussion in Remark 2.1.4). Consider a subsample of size two from Z1, . . . , ZN generated from P. To simplify our notation, we denote them to be Z1 and Z2, respectively. They are independent, with PZ1 = Ph and PZ2 = Pk for h, k {1, 2}. The clustering procedure that we propose is based on the behavior of the transformation DΣZ d (Z1, Z2), which is stated below in Theorem 2.2. Theorem 2.2 (a) Assume that h = k {1, 2}. Define Sh d := (ΣZ d ) 1/2(2ΣZh d )(ΣZ d ) 1/2 for d N, and assume that Lh S = limd 1 dtrace(Sh d ) exists. Then, Lh S is finite and DΣZ d d (Z1, Z2) P Lh S as d . (7) (b) Assume that h = k {1, 2}. Then, Lhk µ = limd DΣZ d d (µZh, µZk) = 0. Define Shk d := (ΣZ d ) 1/2(ΣZh d + ΣZk d )(ΣZ d ) 1/2 for d N, and assume that Lhk S = limd 1 dtrace(Shk d ) exists. Then, Lhk S is finite and DΣZ d d (Z1, Z2) P Lhk S as d . (8) Remark 2.2.1 The structure of the covariance ΣZ stated in equation (5) imposes some restrictions on the associated constants as stated in part (b) of Theorem 2.2. In particular, the fact that Lh S and Lhk S are finite implies that assumption (2) in Theorem 2.1 always holds for the sequence of matrices {Sh d }d N and {Shk d }d N with h, k {1, 2}. Remark 2.2.2 It follows from part (b) of Theorem 2.2 that the statistic we propose is useless for cluster analysis in the homoscedastic case (independently of the difference between µZ1 and µZ2) because if ΣZ1 = ΣZ2, then L12 = L1 S = L2 S. A possibility is to modify the statistic DΣZ d d (z1, z2) so that the value of the transformation DΣZ d d (µZ1, µZ2) increases with d. Our proposal is to use DΣZ d ,r d (u, v) := 1 ((ΣZ d ) 1/2)r(u v)d 2 = 1 λr i with r N. Discussion of this transformation, and some numerical results are included in Section L of Appendix II. To simplify notation and avoid technicalities with empty classes, we additionally assume that the observations whose indices belong to the sets C1 = {1, . . . , N1} and C2 = {N1 + 1, . . . , N} with N = N1 + N2 and N1, N2 > 0, were generated by P1 and P2, respectively. In practice, these sets are unknown and in fact our aim is their estimation. We begin with this simplifying assumption for ease of notation, and to obtain a clearer exposition of the proposed methodology. Define the N N matrix Γd, whose (i, j)-th element is Γd(Zi, Zj) = γd ij = 1 N 2 t=1, t =i,j h DΣZ d d (Zt, Zi) DΣZ d d (Zt, Zj) i2 (9) for 1 i, j N. Theorem 2.2 and the fact that t = i, j in (9) give us the following: ( 0, if i, j Ch for h = 1, 2, γhk, if i Ch and j Ck, with h = k {1, 2}, (10) Published in Transactions on Machine Learning Research (09/2023) as d . Here, N 2 (Lh S Lhk S )2 + Nk 1 N 2 (Lk S Lkh S )2. Note that N1 and N2 are fixed here. Combining the fact stated above in (10), as d , we obtain " 0N10T N1 γ121N11T N2 γ211N21T N1 0N20T N2 Let βd i and βi (for 1 i N) denote the eigenvalues corresponding to the matrices Γd and Γ, respectively. Define the following quantities i=1 I(|βd i | > ad) and K0 = i=1 I(|βi| > 0), (12) with {ad}d N decreasing to 0 as d (at an appropriate rate), and I is the indicator function. The constant K0 clearly equals 2 for the limiting N N matrix Γ stated in (11), and hence, correctly identifies the true number of clusters. Proposition 2.3 Assume N1, N2 1 are fixed. Under the assumptions of Theorem 2.2, if L12 S = L1 S and L21 S = L2 S, then there exists a sequence {ad}d N R+ such that ad 0 and Kd P 2 as d . This now implies that we can correctly identify the true number of clusters asymptotically as d . The structure of the matrix Γ in (11) is straight forward because of the simplifying assumption on the sets C1 and C2. However, this is not a requirement and we will drop it. Proposition 2.3 holds more generally for any permutation of the data points Z1, . . . , ZN. In fact, if the sets C1 and C2 are unknown, then the rows/columns of the Γ matrix will be permuted accordingly. But, the underlying structure remains the same and Proposition 2.3 continues to hold. As a followup of our previous result, we now prove that if any standard clustering method is used on the Γd matrix, then we can perfectly cluster all the observations asymptotically (as d ) because of the structure of the Γ matrix stated in (11). Definition 2.4 A clustering method can be defined as a map from H to the set {1, . . . , J}. Consider the sequence of maps {ψd}d N and a second map ϕ. A measure of distance between two clusterings based on the Rand index (see p. 847 of Rand (1971)) is defined as follows: Rd,N = 1 N 2 X 1 i 0. Further, assume that the conditions in Theorem 2.2 and Proposition 2.3 hold. Then, the clusters will be perfectly identifiable, i.e., Rd,N P 0 as d . Remark 2.5.1 The well-known Rand index (a measure of similarity) is usually defined as 1 Rd,N. As a consequence, Theorem 2.5 implies that the usual Rand index goes to one as d . Published in Transactions on Machine Learning Research (09/2023) Remark 2.5.2 The structure of the N N symmetric matrix Γ stated in (11) continues to hold, and will lead us to perfect clustering for every value of J 2 if enough distinct γij s exist. In particular, this happens if we have some h with 1 h J such that γhk = γhk for every 1 k = k J, but other possibilities exist as well. Moreover, the procedure described in Proposition 2.3 also works fine with the limit equal to K0 (which is the rank of Γ). One may be tempted to think that it generally coincides with J. But, this is true only for J 3 and may be different for J 4 (as shown in Lemma F.1 of Appendix F). The proof of Lemma F.1 further shows that the condition under which rank(Γ) < J is quite restrictive. Thus, for simplicity, our proposal is to estimate the number of clusters J using Kd (defined in equation (12) above). 2.1.1 Example with GPs If we assume that ΣZ2 = aΣZ1 (for some a > 0), then we have the following expressions for the scale constants stated in Theorem 2.2: L1 S = 2 π1 + π2a, L2 S = 2a π1 + π2a and L12 S = 1 + a π1 + aπ2 . Thus, it is possible to identify perfectly the clusters as long as a = 1, since this implies that γ12 and γ21 both are positive quantities. 2.1.2 Uniform Convergence In Theorem 2.2, we have proved consistency for finite sets of data points corresponding to the transformation DΣZ d d (Z1, Z2) defined in (1). We now prove the uniform (on the random sample) convergence of this function as N . This result will be useful in establishing a second result on uniform convergence, which we state in the next section. Theorem 2.6 Assume the conditions in Theorem 2.2, and let {d N} N be such that d N as N . Then, we have the following. a) For h {1, 2}, let αd N = (αd N 1 , . . . , αd N d N )T be the eigenvalues of Sh d N with d N N. If log N = o d N αd N then it happens that sup Z1,Z2 CN h D ΣZ d N d N (Z1, Z2) Lh S P 0 as N . (14) b) For any h = k {1, 2}, let αd N = (αd N 1 , . . . , αd N d N )T be the eigenvalues of Shk d N with d N N. If log N = o d N αd N then it happens that sup Z1 CN h ,Z2 CN k D ΣZ d N d N (Z1, Z2) Lhk P 0 as N . (16) Remark 2.6.1 Assumption (2) holds here, so αd N d N = 1 d N max 1 i d N αd N i 0 as N . Thus, if we take d N growing fast enough, then it is assured that assumptions (13) and (15) hold. The structure of the matrices Sh d and Shk d for d N with h = k {1, 2} implies that a sufficient condition is log N = o(d N) (also see Proposition H.1 in Appendix H). Published in Transactions on Machine Learning Research (09/2023) 3 Transformations with Estimated Gaussian Distributions In this section, we will discuss the steps to implement the procedure described in Section 2. In practice, the involved distributions and all the associated quantities need to be estimated from the data. Here, Z will denote a random element with distribution the GP mixture π1P1 + π2P2. Let ϕZ j (t) with t [0, 1] and λZ j for j N denote the eigenfunctions and eigenvalues of ΣZ, respectively. We will now make the following assumptions: A.1 supt [0,1] E[Z4(t)] < . A.2 Assume that λZ 1 > λZ 2 > > 0 satisfying P j=1 λZ j < . It is well-known that assumption A.2 implies {ϕZ j }j N forms an orthonormal basis of H. To estimate ΣZ and its eigenvalues and eigenfunctions, we will use the corresponding empirical quantities. Suppose that we have a simple random sample Z1, . . . , ZN taken from Pz. Given s, t [0, 1], we define ˆΣZ(s, t) = 1 i=1 [Zi(s) ZN(s)][Zi(t) ZN(t)], where ZN(t) = 1 N PN i=1 Zi(t). Consider the corresponding families ˆλZ 1 ˆλZ 2 and ˆϕZ 1 , ˆϕZ 2 , . . . of its eigenvalues and eigenvectors, respectively. Note that ˆΣZ as well as all the ˆλZ j s and ˆϕZ j s depend on N. With a finite sample, we cannot estimate the entire (infinite) collection of eigenvalues and eigenvectors. Thus, we follow the work of Delaigle and Hall (2012) and Hall and Hosseini-Nasab (2006), and select a non-random decreasing sequence ηN going to zero slowly enough as to satisfy lim N N 1/5ηN = . We define ˆRZ N = inf{j : ˆλZ j ˆλZ j+1 < ηN} 1. (17) This definition implies that ˆλZ j ηN for every j ˆRZ N. Moreover, we also need that the theoretical eigenvalues are reasonably well separated. To obtain this, given δ > 0, we further define RZ N = inf{j : λZ j λZ j+1 < (1 + δ)ηN} 1. (18) We now state empirical analogues of the results stated in Section 2.1. 3.1 Consistency of Clustering Let Z1, . . . , ZN be a simple random sample taken from the GP mixture distribution PZ. Now, PZ and the sets C1 and C2 (containing information on the class labels defined in (6)) are unknown. Extensions of Theorems 2.2 and 2.6 to Theorems 3.1 and 3.2 are presented below. The following results will be based on the analysis of the map ˆD ˆ RN (u, v), which is the transformation DΣZ d d (u, v) defined in (1) with d = ˆRN (defined in (17)), and the pooled covariance matrix ΣZ ˆ RN which is estimated by ˆΣZ ˆ RN (sample covariance of the full sample). Here, ˆD ˆ RN is an abridged notation of D ˆΣZ ˆ RN ˆ RN . The first result is related to the consistency of the transformation on finite sets. Theorem 3.1 Let assumptions A.1 and A.2 and those in Theorem 2.2 hold. (a) If h = k {1, 2}, then ˆD ˆ RN (Z1, Z2) P Lh S as N . (19) (b) If h = k {1, 2}, then ˆD ˆ RN (Z1, Z2) P Lhk S as N . (20) Published in Transactions on Machine Learning Research (09/2023) We need an increasing sample size in order to estimate the parameters consistently. Thus, it is desirable to be able to cluster the increasing number of data points, asymptotically without error. The only way to achieve this is to get some kind of uniform convergence in (19) and (20) when the sample size increases. This is the purpose of Theorem 3.2, which gives us clear evidence that using this transformation would lead to asymptotic perfect separation in the empirical case as well. Theorem 3.2 Let us assume all the conditions in Theorem 2.6 with log N = o(RZ N) in (18). (a) For h {1, 2}, we have that sup Z1,Z2 CN h ˆD ˆ RN (Z1, Z2) Lh S P 0 as N . (b) For any h, k {1, 2} with h = k, we have that sup Z1 CN h ,Z2 CN k ˆD ˆ RN (Z1, Z2) Lhk S P 0 as N . Remark 3.2.1 Clearly, Theorem 3.1 follows from Theorem 3.2. But, the conditions required for proving the former are weaker and hence, we state it as a separate result. Remark 3.2.2 (Asymptotic perfect identification of clusters) Recall the matrix Γd from (9) with d N. Now, consider the matrix ˆΓ ˆ RN , which is obtained by replacing γd ijs in the matrix Γ ˆ RN with their estimated values ˆγ ˆ RN ij , i.e., ˆγ ˆ RN ij = ˆD ˆ RN (Zi, Zj) with 1 i = j N. Define v12 = π1 L1 S L12 S 2 + π2 L2 S L21 S 2. Fix ϵ > 0. Theorem 3.2 implies that with probability converging to one as N , we have - if Zi, Zj Ch for h {1, 2}, then ˆγ ˆ RN ij 4ϵ2, - if Zi Ch, Zj Ck for h = k {1, 2}, then ˆγ ˆ RN ij v12 Hϵ, for some H > 0. Consequently, if v12 > 0, then the elements in ˆΓd will be clustered into two well-separated clusters: one around 0 and another one around v12 with probability converging to one. Similarly, let PZ be a mixture of J(> 2) components and denote vhk := πh Lh S Lhk S 2 + πk Lk S Lkh S 2 with 1 h = k J. For positive and distinct vhks, the elements in the matrix ˆΓd will be perfectly clustered into 1 + J 2 well separated clusters: one of them around the point 0 and the remaining around the values vhk (for h < k) with probability converging to one as N . Therefore, asymptotically, the sequence of matrices {ˆΓ ˆ RN }N N will contain enough information to perfectly cluster all the data points in a sample. 3.2 Implementation Issues We are given a sample of data points without the labels. Here, we consider the N N estimated matrix ˆΓN with the (i, j)-th element as ˆD ˆ RN (Zi, Zj) (which is just the empirical version of DΣZ d d (Zi, Zj) based on the pooled sample covariance) for 1 i, j N and apply any clustering procedure on its rows (or, columns). Note again that we do not need to estimate the unknown constants Lh S and Lhk S for h, k {1, 2} (stated in Theorem 3.1) for the implementation of our clustering procedure. The expressions related with Kd and ˆRN are not used in our implementation (see Table 3.2 below for the main differences between theory and practice). In Section 4.2, we give complete details of the implementation of our procedure. Published in Transactions on Machine Learning Research (09/2023) Table 3.1: Some key differences in our theoretical assumptions and implementations. Quantity Theoretical Assumption Implementation ˆRN involves ˆλjs and goes to zero as N estimated using cross-validation in clustering (see Wang (2010) for more details) Kd involves βis and goes to zero as d involves ˆβis and estimated using the optishrink function from the R package denoise R 4 Analysis of Simulated Datasets For our simulation study, we consider two class problems (J = 2). We generated data on a discrete grid of 100 equi-spaced points in the unit interval [0, 1] from four different simulation models, which are described below. Fix s > 0. I. Define Xh(t) = P40 j=1(λ1/2 hj Zhj+µhj)ϕj(t) with t [0, 1] and h = 1, 2. Here, the Zhjs are independent standard normal (i.e., N(0, 1)) random variables, ϕj(t) = 2 sin(πjt) with t [0, 1] and 1 j 40. Also, µhj = 0 for j > 6, and we set the other components equal to (0, 0.5, 1, 0.5, 1, 0.5)T and (0, 0.75, 0.75, 0.15, 1.4, 0.1)T for k = 1, 2, and λ1j = 1/j2 and λ2j = s/j2 for 1 j 40. This model is from the paper Delaigle and Hall (2012). II. In this example, X1 B and X2 µ+s B with µ(t) = Gt for t [0, 1] and G N(0, 4) independent of B. Here, B is the standard Brownian bridge, i.e., a centered GP with σij = min(ti, tj) titj for ti, tj [0, 1] and i, j N. Since E[X2(t)] = E[Gt] = 0 for t [0, 1], the differences in mean never appear in this setting. In fact, the inclusion of µ modifies the covariances because if 0 < ti < tj < 1, then the independence between G and B yields the following: E[X2(ti)X2(tj)] = 4titj + s2ti(1 tj). This model is from the paper Berrendero et al (2018). III. Let Xh = µh + P50 j=1 ξhjλhj 1/2ϕj for h = 1, 2. Here, ξhjs are i.i.d. N(0, 1), µ1 = 0 and µ2(t) = t with t [0, 1], λ1j = e j/3 and λ2j = se j/3 for 1 j 50, and ϕ2i 1 = 2 sin(2iπt) and ϕ2i = 2 cos(2iπt) for 1 i 25 with t [0, 1]. This model is from the paper Dai et al (2017). IV. This problem consists of two Brownian motions defined in the closed interval [0, 1] with mean functions µZ1(t) = 20t1.1(1 t) and µZ2(t) = 20t(1 t)1.1, respectively, for t [0, 1]. For the first class, the eigenfunctions are ϕj(t) = 2 sin((j 0.5)πt) and associated eigenvalues are λ1j = 1/(π(j 0.5))2 for 1 j 15. The second class is similar to the first one, but the eigenvalues are multiplied by s (i.e., λ2j = sλ1j = s/(π(j 0.5))2) for 1 j 15. This model is from the paper Galeano et al (2015). We set s = 1 for location only problems. In location and scale problems, we fixed s = 3, while for scale only problems the mean functions µZ1 and µZ2 were set to be the constant function 0 and s = 3 was retained. 4.1 Choice of d A critical issue here is selection of the optimal dimension of the projected space for a given a set of data points (i.e., a fixed value of (N1, N2) or N). Let us recall Theorem 3.1. According to this result, we expect the rows of the matrix ˆΓd to form two clearly separated clusters depending on the class label of the observation for Published in Transactions on Machine Learning Research (09/2023) large values of d. To demonstrate this, we construct a sequence of images which shows how this separation varies with increasing values of d. We generated samples of size 250 from each of the two classes for the scale case of Example II. For the purpose of demonstration, the first 250 observations correspond to the first GD, while the next 250 observations to the second. Figure 1 below shows the heatmap for increasing values of d, and we observe the best concentration at d = 80. However, some noise in the off-diagonal submatrices for d = 80 (compared to d = 60) makes us to consider that the optimum could be somewhere between the values 60 and 80. Figure 1: Heatmap of ˆΓd for varying values of d. Clearly, the choice of d is quite important as d is the dimension of the subspace where we project our observations (for a fixed sample size). We observe from Figure 1 that its estimation is quite crucial. The next subsection contains further details on the choice of d. 4.2 Clustering Procedure in Practice To implement the clustering method, we use cross-validation (CV) to choose the dimension d suitably. We use the idea developed by Wang (2010), which we state briefly here. Given B N, split the data into three random subsets (say, S1b, S2b and S3b) each of equal size for 1 b B. For each value of b, treat the data points in S1b and S2b as the training sets, and S3b as the validation set. For a fixed value of d and given a clustering algorithm, the two training sets S1b and S2b are used to construct two cluster assignments. An appropriate distance between these two cluster assigments (say, D) is computed based on the validation set S3b (see Section 2 of Wang (2010) for more details). We repeat this partitioning B(= 50) times and average it over these B samples to get ˆDCV d . Define ˆd CV = arg min2 d N ˆDCV d . A pseudo-code for this procedure is given in Algorithm 2 (see Section O of the Appendix). Recall the structure of the Γ matrix stated in (11), and also see Figure 1. As mentioned in Section 3.2, the number of clusters were estimated using the quantity Kd described in Section 2.1 (see (12)). To implement the procedure in practice, one needs to estimate the sequence {ad}d N. We have used the function optishrink available in the R package denoise R. This function extracts a low-rank signal from Gaussian noisy data using the optimal shrinker of singular values. The low rank structure of the Γ matrix motivates us to directly apply this function on ˆΓd. A pseudo-code (alongwith stepwise computational complexity) of our methods is given in Algorithm 1 (see Section O of the Appendix). Our overall implementation yielded quite desirable results in our numerical study (see Tables 4.1 and 4.2 below). We can apply any clustering method on the transformed data ˆΓd. In addition to the k-means algorithm (CD-k-means) discussed in Theorem 2.5, we considered spectral clustering (CD-spectral) and Gaussian mixture models (CD-mclust). Here, the acronym CD corresponds to our proposed transformation. One may refer to the book by Hastie et al (2009) for details on these three popular clustering methods. The R codes for our methods are available here: GP clustering. We considered several methods for comparison. The first three methods are those that we have used on the matrix ˆΓd but applied directly to functional data, namely, the k-means algorithm (k-means), spectral clustering (spectral) and Gaussian mixture models (mclust). Several competent methods for functional Published in Transactions on Machine Learning Research (09/2023) clustering using functional mixed mixture models are implemented in the function funcit from the R package funcy. We report this method as funclust. The methodology developed by Chiou and Li (2007) is available in the function FClust from the R package fdaspace using two clustering techniques EMcluster (CL1) and k CFC (CL2). We have reported the minimum value, and stated it as CL. In Delaigle et al (2019), the authors developed functional clustering based on the k-means using basis functions. We implemented this method for two choices of the basis functions, namely, Haar and PC, and reported the best result among these two (we call it DHP). We have not used the DB2 basis for our comparisons because it requires the grid points to be of a power of 2. The DHP method is available from the journal website, and we used those Matlab codes for our comparisons. We conducted simulations based on models I to IV, which were introduced in the beginning of Section 4. We did not consider the location only scenario as our proposed method is useless in such cases (recall part (c) of Theorem 2.2). However, we have some discussion and additional results in Section L in Appendix II for this scenario. The sample size of each class was set to be 250. Our experiment was replicated 100 times, and the results are reported in Tables 4.1 and 4.2 below. To measure the similarity between two cluster assignments, we computed the adjusted Rand index using the function RRand in the R package phyclust. One minus the adjusted Rand index (we call it adjusted Rand distance) is reported in tables below, where the minimum is marked in bold and the second lowest is in italics. It is worth noting that all the competing methods require the number of clusters as an input variable, and we have run these methods with k = 2 (the true number of clusters). However, when applying the CD procedure we have estimated the number of clusters following the procedure described above. We obtained the correct value in more than 99% of the cases (across all four examples for both scenarios) in our simulation study. Table 4.1: Adjusted Rand distances for different GPs with differences in locations and scales (with standard error in brackets). Ex. k-means spectral mclust funclust CL DHP CD k-means spectral mclust I 0.0632 0.0280 1.0000 0.1541 0.0239 0.0818 0.0654 0.0028 0.0001 (0.0007) (0.0016) (0.0000) (0.0017) (0.0007) (0.0025) (0.0010) (0.0002) (0.0001) II 0.5377 0.4210 0.9700 0.8222 0.5767 0.5149 0.2104 0.3464 0.0738 (0.0046) (0.0042) (0.0017) (0.0027) (0.0045) (0.0049) (0.0030) (0.0045) (0.0014) III 0.4250 0.4460 1.0000 0.3858 0.2891 0.4137 0.1367 0.2596 0.0250 (0.0017) (0.0048) (0.0000) (0.0003) (0.0000) (0.0054) (0.0024) (0.0042) (0.0016) IV 0.4945 0.3788 1.0000 0.3975 0.1833 0.1379 0.0316 0.0000 0.0000 (0.0056) (0.0047) (0.0000) (0.0011) (0.0000) (0.0033) (0.0001) (0.0000) (0.0000) 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 4 3 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 4 3 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 4 3 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 4 3 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 2: Representative curves for the four examples having differences in locations and scales. The mean curves marked in bold, while the two classes marked in red and blue colors, respectively. Please zoom in for a better resolution. In the first setting, we considered clustering problems with differences in their location and scale parameters. Class specific representative plots are given in Figure 2. Usefulness of the proposed transformation is clear from Table 4.1. Our method attains the first position across all examples, while in Example IV we Published in Transactions on Machine Learning Research (09/2023) obtain perfect clustering. Although there is no location difference in Example II, sub-optimal performance of our method is probably due to low signal from the difference between the two covariance structures. CL attains the second best performance in the first three examples among the competing methods, while DHP performs better than CL in Example IV. In the next setting, we dealt with differences only in scale parameters. Class specific representative plots are given in Figure 3. It is clear from Table 4.2 that the separation in scatters is captured very well by the proposed transformation ˆΓd. Moreover, our method again leads to perfect clustering (with a significant improvement in Example II compared to Table 4.1). In Example II, all (except possibly, funclust) the competing methods perform quite poorly across all examples. However, its performance is far below our method CD even in this case. The performances of k-means and DHP are similar, and quite bad in this scenario. Generally, the results in Table 4.2 suggest that all existing methods fail to judiciously capture information if it is present only in the scale parameters. Table 4.2: Adjusted Rand distances for different GPs with differences only in scales (with standard error in brackets). Ex. k-means spectral mclust funclust CL DHP CD k-means spectral mclust I 0.9900 0.9992 1.0000 0.9776 0.8269 0.9966 0.0063 0.0001 0.0000 (0.0014) (0.0007) (0.0000) (0.0006) (0.0000) (0.0005) (0.0002) (0.0001) (0.0000) II 0.9985 0.9962 0.9986 0.5004 0.9065 0.9999 0.0091 0.0100 0.0084 (0.0005) (0.0007) (0.0005) (0.0049) (0.0007) (0.0003) (0.0036) (0.0047) (0.0019) III 0.9944 0.9990 1.0000 0.9956 0.9994 0.9967 0.1789 0.2872 0.0352 (0.0009) (0.0004) (0.0000) (0.0000) (0.0000) (0.0007) (0.0026) (0.0032) (0.0007) IV 0.9927 0.9984 1.0000 1.0006 0.8464 0.9980 0.0102 0.0014 0.0005 (0.0002) (0.0005) (0.0000) (0.0000) (0.0000) (0.0006) (0.0021) (0.0013) (0.0004) 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 0.0 0.2 0.4 0.6 0.8 1.0 6 4 2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 6 4 2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 6 4 2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 6 4 2 0 2 4 6 Figure 3: Representative curves for the four examples having differences only in scales. The mean curves marked in bold, while the two classes marked in red and blue colors, respectively. Please zoom in for a better resolution. After applying the transformation ˆΓd, we had used three methods for clustering the transformed observations (see Tables 4.1 and 4.2). Additionally, when these methods are applied directly to the functional data, we observe that k-means and spectral perform fairly well, but their performances deteriorate sharply for the case with differences only in the scales. The mclust algorithm performs worst (possibly due to the presence of low signal in the locations) among the three clustering methods, while CD-mclust clearly achieves the best results with a substantial improvement (see Table 4.1). It is worth mentioning that the Rand distances improve substantially for all three methods when applied on the transformed matrix ˆΓd. All the three usual clustering methods seem useless (possibly due to the presence of signal only in the scales) from the results in Table 4.2, while their CD counterparts clearly lead to excellent performances with CDmclust yielding the overall best result. Overall, the best performance of this method may be attributed to the fact that the signal gets amplified in the transformed matrix and the specific structure (recall equation (11)) of this matrix. Published in Transactions on Machine Learning Research (09/2023) 5 Analysis of Benchmark Datasets We have applied our proposed methods to some benchmark data sets, Wheat (from the R package fds), Satellite (available at https://www.math.univ-toulouse.fr/ ferraty/SOFTWARES/NPFDA /index.html), Cars (kindly provided by the first author of Torrecilla et al (2020)) and Velib (from the R package fun FEM). To evaluate the clustering algorithms, we ran a single execution (without splitting). Class assigments are already available for the Wheat dataset. The Satellite data has been analyzed in detail in the paper Dabo Niang et al (2007), where the authors split the curves into two clusters unimodal and multimodal . The authors of this paper kindly shared the exact cluster assignments for this data set with us. The Cars data contains asset log-returns of the car companies Tesla, General Motors and BMW (see Torrecilla et al (2020) for more details). However, the rank of the estimated ˆΓd matrix was two for this data set, and our method detected only two distinct clusters. This is coherent with Torrecilla et al (2020), where the authors had noted that assets of General Motors and BMW were very similar and quite difficult to distinguish. So, we merged General Motors with BMW while assigning the class labels for this data set. Consequently, the number of clusters was set to be two for all the competing methods. The Velib data was analyzed by Bouveyron et al (2015), where the authors identified the optimal number of clusters to be ten using the fun FEM algorithm. Setting J = 10 (this information was provided to the competing methods), we determined the class labels of the observations using this algorithm. We report the adjusted Rand distance for these four data sets in Table 5.1. Competitive performance of our proposed methodology w.r.t. the competing methods is clear from the results given below. Table 5.1: Adjusted Rand distances for different clustering methods. Data J N d k-means spectral mclust funclust CL DHP CD k-means spectral mclust Wheat 2 100 701 0.6960 0.7860 0.6978 0.6960 0.8058 0.5730 0.7859 0.8057 0.3644 Satellite 2 472 70 0.6072 0.8947 0.3525 0.6072 0.6060 0.7253 0.8861 0.9954 0.4448 Cars 2 90 32 0.8856 0.9230 0.8834 0.8856 0.9650 0.9088 0.5385 0.9313 0.4680 Velib 10 1189 181 0.3755 0.5464 0.8561 0.6738 0.5672 0.5872 0.5408 - R package is now archived; - Matlab code available only for two classes. To get a better understanding of the performance of our proposed method, we further computed the wellknown average purity function correspoinding to the CD method having the minimum adjusted Rand distance. A value of average purity function close to one indicates good performance of a method. We obtained the values as 0.9000, 0.8622, 0.8666 and 0.6038 for the Wheat data, the Satellite data, the Cars data and the Velib data, respectively. Overall, our proposed method CD yields quite promising results in all four benchmark data sets, with CD-mclust yielding the most stable performance. Acknowledgements The first author has been partially supported by grants PID2021-128314NB-I00 and PID2022-139237NB-I00 funded by MCIN/AEI/10.13039/501100011033/FEDER, UE and ERDF A way of making Europe". Both the authors would like to thank the Action Editor Prof. Brian Kulis and three anonymous reviewers for their constructive comments and suggestions that improved the paper. Published in Transactions on Machine Learning Research (09/2023) Baíllo, A., Cuevas, A. and Cuesta-Albertos, J. A. (2011) Supervised classification for a family of Gaussian functional models. Scand. J. Statist., Vol. 38, 480-498. Berrendero, J. R., Cuevas, A. and Torrecilla, J. L. (2018) On the use of reproducing kernel Hilbert spaces in functional classification. J. Amer. Stat. Assoc., Vol. 113, 1210-1218. Bouveyron, C., Côme, E. and Jacques, J. (2015) The discriminative functional mixture model for a comparative analysis of bike sharing systems. Ann. Appl. Statist., Vol. 9, pp. 1726-1760. Chiou, J. M., and Li, P. L. (2007) Functional clustering and identifying substructures of longitudinal data. J. R. Statist. Soc. B, Vol. 69, 679-699. Cuesta-Albertos, J.A. and Dutta, S. (2022) On perfect classification and clustering for Gaussian processes. ar Xiv preprint, https://arxiv.org/abs/1602.04941. Dabo-Niang, S., Ferraty, F., and Vieu, P. (2007) On the using of modal curves for radar waveforms classification. Comput. Statist. Data Anal., Vol. 51(10), 4878-4890. Dai, X., Muller, H-G. and Yao, F. (2017) Optimal Bayes classifiers for functional data and density ratios. Biometrika, Vol. 104, 545-560. Delaigle, A. and Hall, P. (2012) Achieving near perfect classification for functional data. J. R. Statist. Soc. B, Vol. 74, 267-286. Delaigle, A. and Hall, P. (2013) Classification using censored functional data. J. Amer. Stat. Assoc., Vol. 108, 1269-1283. Delaigle, A., Hall, P. and Pham, T. (2019) Clustering functional data into groups using projections. J. R. Statist. Soc. B, Vol. 81, 271-304. Feldman, J. (1958) Equivalence and perpendicularity of Gaussian processes. Pacific J. Math., Vol. 8, No. 4, 699-708. Galeano, P., Joseph, E. and Lillo, R. E. (2015) The Mahalanobis distance for functional data with applications to classification. Technometrics, Vol. 57, 281-291. Hajek, J. (1958). A property of J-divergences of marginal probability distributions. Cz. Math. J., Vol. 8, 460-462. Hall, P. and Hosseini-Nasab, M. (2006) On properties of functional principal components analysis. J. R. Statist. Soc. B, Vol. 68, 109-126. Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer. Laurent, B. and Massart, P. (2000) Adaptive estimation of a quadratic functional by model selection. Ann. Statist., 28, 1302 1338 Rand, W. M. (1971) Objective criteria for the evaluation of clustering methods. J. Amer. Stat. Assoc., 66:336, 846-850. Rao, C. R. and Varadarajan, V. S. (1963) Discrimination of Gaussian processes. Sankhya, Series A, Vol. 25, 303-330. Shepp, L. A. (1966b) Gaussian measures in function space. Pacific J. Math., Vol. 17, 167-173. Torrecilla, J. L., Ramos-Carreño, C., Sánchez-Montañés, M. and Suárez, A. (2020) Optimal classification of Gaussian processes in homoand heteroscedastic settings. Statist. Comput., Vol. 30, 1091-1111. Wang, J. (2010) Consistent selection of the number of clusters via cross-validation. Biometrika, Vol. 97, 893-904. Published in Transactions on Machine Learning Research (09/2023) Appendix I: Proofs and Mathematical Details A Proof of Theorem 2.1 Fix d N. The d-dimensional random vector (Z b)d has a Gaussian distribution with mean equal to (µ b)d and covariance matrix equal to ΣZ d . Now, (Ad) 1/2(Z b)d 2 is equal to the square of the norm of a d-dimensional normal variable with mean md = (Ad) 1/2(µ b)d and covariance matrix Sd = (Ad) 1/2ΣZ d (Ad) 1/2. Therefore, if ud is a d-dimensional vector with centered normal distribution and covariance matrix equal to Sd, then DAd d (Z, b) 1 d md + ud, md + ud = 1 d md 2 + ud 2 + 2 md, ud . (21) By assumption (3), we have lim d 1 d md 2 = Lµ. Let us consider the second term in (21). Fix a basis in Vd spanned by the eigenvectors of Sd. Note that this term is not dependent on Lµ. Denote ud = (ud,1, . . . , ud,d)T and md = (md,1, . . . , md,d)T in this basis. Therefore, the random variables (ud,i)2 with 1 i d are independent with means equal to αd i for 1 i d and Pd i=1(ud,i)2 Pd i=1 αd i (ui)2. Here, {ui}1 i d is a sequence of independent and identically distributed (i.i.d.) real variables with the standard normal distribution. We split the proof into two cases. A.0.1 LS is finite Fix ϵ > 0. Taking into account that the variance of a χ2 distribution with one degree of freedom is two and using Chebychev s inequality, we have that ud 2 trace(Sd) ϵ = P (ud,i)2 αd i ϵ i=1 (αd i )2 which converges to zero by assumptions (4) and (2). Consequently, we have shown that dtrace(Sd) P 0 as d , and assumption (4) gives 1 d ud 2 P LS as d . A.0.2 LS is infinite We have that " 1 Pd i=1 αd i (ud,i)2 αd i ϵ αd i Pd i=1 αd i αd i Pd i=1 αd i 2 ϵ2 αd Pd i=1 αd i , Published in Transactions on Machine Learning Research (09/2023) which converges to zero because LS = and assumption (2). Thus, we have shown that 1 d Pd i=1 αd i dtrace(Sd) P 0 as d . (22) Consequently, 1 d ud 2 converges to at the same rate as 1 dtrace(Sd). Concerning the last term in (21), we have md, ud = Pd i=1 md,iud,i. We split the proof into cases. A.0.3 Lµ is finite Fix ϵ > 0, and define αd = (αd 1, . . . , αd d)T . Using Chebychev s inequality again, we get d| md, ud | > ϵ 1 ϵ2d2 i=1 (md,i)2αd i 1 ϵ2d2 αd md 2, which converges to zero by assumptions (3) and (2), and the proposition is proved in this case. A.0.4 Lµ is infinite The result follows from equation (21) and the previous results, if we are able to show that the sequence of real-valued random variables wd = md, ud max( md 2, ud 2) converges to zero in probability as d . In turn, this will be fixed if we show that every subsequence of {wd} contains a new subsequence which satisfies this property. Thus, let {wdk} be a subsequence of {wd} and let us consider the associated subsequences { mdk } and { udk }. Obviously, there exists a further subsequence {dk } such that one of the following holds: (i) limdk mdk 2 trace(Sdk ) = 0. (ii) limdk mdk 2 trace(Sdk ) = . (iii) There exists a finite C > 0 such that limdk mdk 2 trace(Sdk ) = C. Note that in cases (i) and (iii), we have LS = . To simplify notation, we denote the sequence {Sdk } by {Sh}, and similarly for the remaining ones. In case (i), since equation (22) shows that P 1 as h , (23) P 0 as h . Consequently, lim h |wh| = lim h | mh, uh | uh 2 lim h mh uh = 0 in probability. If (ii) holds, we have that |wh| uh mh . Since E uh 2 = trace(Sd), we have that uh 2 mh 2 P 0 and also, in this case wh P 0 as h . In case (iii), taking into account that equation (23) now holds, it is enough to show that mh, uh Ctrace(Sh) Published in Transactions on Machine Learning Research (09/2023) Fix ϵ > 0. We have that P mh, uh Ctrace(Sh) m2 h,iαh i Ph i=1 αh i 2 1 C2ϵ2 αh Ph i=1 αh i which converges to zero by assumptions (4) and (2). B On Assumptions (4) and (2) The next lemma shows that if LS < , then assumption (4) implies assumption (2). Lemma B.1 Let {ad}d 1 be a sequence of real positive numbers such that limd 1 d Pd i=1 ai exists, and it is finite. Then, it happens that limd 1 Proof: Fix d N, and denote Ad = Pd i=1 ai. We have that and consequently, 0 = limd ad d . Given ϵ > 0, there exists d1 > 0 such that if d > d1, then ad d ϵ and d2 d1 such that sup 1 i d1 Let d > d2 and take 1 i d. So, we have that if i d1, then ai d2 ϵ and if i > d1, then ai i ϵ. This completes the proof. C Proof of Theorem 2.2 First, note that (Z1 Z2)d is a d-dimensional normal vector, with mean (µZh µZk)d and covariance ΣZh d + ΣZk d for h, k {1, 2}. Now, the proof will based in the following lemma. Lemma C.1 Assume that h = k {1, 2}. Define Sh d , Sk d and Shk d as in Theorem 2.2. Then, i) Lhk µ = limd DΣZ d d (µZh, µZk) = 0, and ii) lim supd 1 dtrace(Sh d ) < , lim supd 1 dtrace(Sk d) < and lim supd 1 dtrace(Shk d ) < . Proof: Let us denote Σ = π1ΣZh + π2ΣZk, µ = (µZh µZk) and π12 = π1π2. From (5), we have that ΣZ d = Σ d + π12µdµT d . From here, the Sherman-Morrison formula gives (ΣZ d ) 1 = (Σ d) 1 π12(Σ d) 1µdµT d (Σ d) 1 1 + π12µT d (Σ d) 1µd . Since (Σ d) 1 is positive definite (p.d.) for all d N, this now implies that 0 µT d (ΣZ d ) 1µd = µT d (Σ d) 1µd π12(µT d (Σ d) 1µd)2 1 + π12(µT d (Σ d) 1µd) = µT d (Σ d) 1µd 1 + π12µT d (Σ d) 1µd 1 π12 , and the proof that Lhk µ = 0 trivially ends from definition of Lhk µ . Published in Transactions on Machine Learning Research (09/2023) To handle the terms Lh S (Lk S is identical) and Lhk S , recall the Woodbury matrix identity: (U + V ) 1 = U 1 (U + UV 1U) 1. Using this identity, we have (ΣZ d ) 1 = 1/π1(ΣZh d ) 1 Bd, where Bd = (π1ΣZh d + π2 1ΣZh d (π2ΣZk d + π12µdµT d ) 1ΣZh d ) 1. If U and V are p.d., then U T V U is p.d. In Bd, both the matrices ΣZh d and (π2ΣZk d +π12µdµT d ) are symmetric and p.d., and this implies that Bd is also p.d. Further, (ΣZh d )1/2 and Bd are p.d. which now implies that (ΣZh d )1/2Bd(ΣZh d )1/2 is p.d. Recall that trace is a linear map. Now, trace(Sh d ) = 2 trace(ΣZh d (ΣZ d ) 1) = 2 trace(1/π1Id) trace(ΣZh d Bd) = 2 trace(1/π1Id) trace((ΣZh d )1/2Bd(ΣZh d )1/2) trace(2/π1Id) = 2d/π1. We have proved that lim supd 1 dtrace(Sh d ) < 2/π1. Similarly, it is proved that lim supd 1 dtrace(Sk d) < 2/π2. Finally, we have trace(Shk d ) = trace((ΣZ d ) 1/2ΣZh d (ΣZ d ) 1/2) + trace((ΣZ d ) 1/2ΣZk d (ΣZ d ) 1/2) π2 = d π1π2 from where we conclude that lim supd 1 dtrace(Shk d ) < 1/(π1π2). To prove part (a) in Theorem 2.2, we have h = k. So, (µZh µZk)d = 0d and ΣZh d + ΣZk d = 2ΣZh d . If we take Ad = Sh d , according to Remark 2.1.2, Lemma C.1 gives that assumption (2) holds for this selection of Ad. Therefore, (7) follows from Theorem 2.1 because in this case Lh µ = 0 and once we assume that Lh S exists, Lemma C.1 gives that it is finite. In case (b), we have h = k. We take Ad = Shk d and b = µZh µZk. Similarly, as in (a), we have that assumption (2) also holds in this case and Theorem 2.1 implies DΣZ d d (Z1, Z2) P Lhk µ + Lhk S as d . Now, (8) follows because Lemma C.1 gives that Lhk µ = 0 and, also, that Lhk S is finite. D Proof of Proposition 2.3 Under the conditions of Proposition 2.3, the number of significant (unique) eigenvalues of the matrix Γ is 2. Recall that N is fixed here. Consider the standardized distance matrix Dd with the (i, j)-th element as DΣZ d d (Zi, Zj) for 1 i, j N and d N. We have a sequence of matrices Dd P D0 as d (componentwise). Since the map D to Γ is clearly continuous w.r.t. this convergence, we have that Γd P Γ as d . Let us denote the eigenvalues of Γd (respectively, Γ) to be βd 1, . . . , βd N (respectively, β1, . . . , βN). Since eigenvalues are continuous functions of the respective matrices, we have βd j P βj as d for all 1 j N. Let us now look into the following: i=1 I(|βd i | > ad) P i=1 I(|β0 i | > 0) as d Published in Transactions on Machine Learning Research (09/2023) with ad 0 as d at an appropriate rate. Recall that the limiting quantity on the right should give us the correct number of clusters. Consider the sequence {1/m}m N. Let us take i {1, . . . , N} such that βd i P 0 as d . Thus, for every ϵ, δ > 0 there exists Di δ,ϵ such that if d Di δ,ϵ then P[|βd i | > δ] < ϵ. In particular, if we take δ = ϵ = 1/m, there exists Di m such that if d Di m, then we have P |βd i | > 1 Without loss of generality, we can assume that Di 1 < Di 2 < , and consider the sequence ( 2 if 1 i < Di 1, 1 m if Di m i < Di m+1 for some m 1. Then, obviously ai d 0 and P I(|βd i | > ai d) > 0 = P |βd i | > ai d < ai d. If we define ad = sup{ai d : β0 i = 0}, and i satisfies that β0 i = 0, then I(|βd i | > ad) P 0 as d . A similar reasoning allows us also to conclude that if |β0 i | > 0, then I(|βd i | > ad) P 1 as d . E Proof of Theorem 2.5 In this proof, we use the superindex d in Gd i to emphasize that the groupings can change with the dimension d N. Proposition 2.3 implies that Kd = 2 with probability converging to one. Note that ϕ(G1, . . . , GJ) has an alternative mathematical expression as u,v Gh u v 2, (24) where |G| denotes the cardinality of the set G. Let us denote the rows/columns of Γd as γd 1, . . . , γd N. The structure of Γd implies that γd i γd j 2 P 0 as d iff i, j Ch for h {1, 2}. So, if each Gd h for h = 1, 2 contains observations from the same population, then ϕd(Gd 1, Gd 2) P 0 as d . Let us assume that on the contrary, there exists a subsequence of dimensions {dk} such that for every k there exists at least a couple of points ik, jk with ik Gd 1 and jk Gd 2 (say). Since the number of points is finite, there exists a further subsequence {dk } such that both sequences {ik } and {jk } are constant. Therefore, for those subsequences, we will have lim inf d ϕd(G1, G2) lim d γdk ik γdk jk 2 P γ12 > 0. So, for the minimization of ϕd(Gd 1, Gd 2), each Gd h must contain all observations from a single population with probability converging to one as the dimension increases. This proves the convergence in probability of the Rand index Rd,N to zero as d . F Rank of the Matrix Γ Identifying number of clusters from the matrix Γ is not equivalent to finding the rank of the matrix Γ. Published in Transactions on Machine Learning Research (09/2023) Lemma F.1 The rank of the matrix Γ is less then or equal to J. Moreover, equality is guaranteed only when J 3. Proof: Trivially, rank(Γ) J. Let us denote the reduced Echelon form of N N matrix Γ as Γ . Thus, the matrix Γ is a J J symmetric matric with γij > 0 and distinct when i = j, while γii = 0. Moreover, for J = 3, we have det(Γ ) = det 0 γ12 γ13 γ12 0 γ23 γ13 γ23 0 = 2γ12γ13γ23 = 0. In the case J = 4, if γ12 = γ13γ24 + γ14γ23 + 2 γ13γ14γ23γ24 γ34 , then a simple computation gives that det(Γ ) = 0. This happens, for instance, if we consider the following matrix (with all positive and distinct off-diagonal entries): 0 t 1 2 t 0 3 4 1 3 0 5 2 4 5 0 where t = 2 + 4 G Proof of Theorem 2.6 In order to simplify the writing, we will write d instead of d N. We will use the notation αd 2 := Pd i=1(αd i )2 1/2 . The real r.v. s {ui} are assumed to be i.i.d. with standard normal distribution. The following lemma is deduced from Lemma 1 in Laurent and Massart (2000) on p. 1325, after some simple computations, taking into account that αd 2 αd . We state it here for further reference. Lemma G.1 If Zd = Pd i=1 αd i (u2 i 1) and x 1, then P [|Zd| 4x αd ] 2 exp( x). We will also employ the following well known bound for the tail of the standard normal distribution: P[|N(0, 1)| t] 2 π exp( t2/2) for all t 1. (25) Proof of Theorem 2.6 : Let us show part b). The proof of (14) is similar to that of (16). We use the notation md = (ΣZ d ) 1/2(µZ1 µZ2)d and ui d = (ΣZ d ) 1/2(Zi µi)d with d N, where Zi is a generic observation with distribution Pi for i = 1, 2. Moreover, with an obvious abuse of notation, we will often write ui d CN i with d N for i = 1, 2. Recall that Lµ = 0 and LS < (see part (c) in Theorem 2.2). Repeating the first steps in the proof of Theorem 2.1, we have that sup Z1 CN 1 ,Z2 CN 2 DΣZ d d (Z1, Z2) 1 dtrace(S12 d ) 1 d md 2 + sup u1 CN 1 ,u2 CN 2 1 d u1 d u2 d 2 1 dtrace(S12 d ) (26) + 2 sup u1 CN 1 ,u2 CN 2 md, u1 d u2 d , (27) Published in Transactions on Machine Learning Research (09/2023) and it is enough to prove that the terms in (26) and (27) converge to zero in probability. The first term in (26) converges to zero by first part of (c) in Theorem 2.2. Concerning the second term, let N1, N2 be the number of elements in CN 1 and CN 2 , respectively. Since N1 + N2 = N, it is clear that N1N2 N 2/4. Let ε > 0. We have that sup u1 CN 1 ,u2 CN 2 1 d u1 d u2 d 2 1 dtrace(S12 d ) > ε u1 CN 1 ,u2 CN 2 1 d u1 d u2 d 2 1 dtrace(S12 d ) > ε 4 P 1 d u1 d u2 d 2 1 dtrace(S12 d ) > ε , (28) where u1 and u2 are associated with some Z1 CN 1 and Z2 CN 2 , respectively. However, it is clear that 1 d u1 d u2 d 2 1 dtrace(S12 d ) 1 i=1 αd i (u2 i 1). Take x = εd/(4 αd ). By assumption (15), we have d/ αd and eventually x 1. So, from Lemma G.1, we obtain i d αd i (u2 i 1) 2 exp εd 4 αd + 2 log N , which converges to zero by assumption (15). Fix ϵ > 0. For the third term, in equation (27) we have that sup u1 CN 1 ,u2 CN 2 1 d| md, u1 d u2 d | > ε d| md, u1 d u2 d | > ε i d mdi(αd i )1/2ui |N(0, 1)| > ε d q P i d(mdi)2αd i 1 23/2π1/2 exp i d(mdi)2αd i + 2 log N 1 23/2π1/2 exp i d(mdi)2 + 2 log N which converges to 0 because of the fact that Lµ = 0 (see part (c) in Theorem 2.2) and (15). The same assumption allows us to apply inequality (25) to equation (29). H Result Related to Remark 2.6.1 Proposition H.1 Under assumptions of Theorem 2.2, if we assume that log N d N 0, then conditions (13) and (15) hold. Published in Transactions on Machine Learning Research (09/2023) Proof: Fix h {1, 2}, and note that (ΣZ d ) 1 = (πhΣZh d ) 1 Pd, where Pd = (πhΣZh d + πhΣZh d (T k d ) 1πhΣZh d ) 1 is a p.d. matrix and T k d is the matrix πkΣZk d + π1π2[µZ1 d µZ2 d ][µZ1 d µZ2 d ]T with k = h {1, 2}. Further, Id + ΣZ d Pd = 1 πh ΣZ d (ΣZh d ) 1. From here, Weyl s inequality gives πh ΣZ d (ΣZh d ) 1) = 1 πh αmin(ΣZ d (ΣZh d ) 1). (30) Note the fact that the eigenvalues of the matrices AB and BA are same. So, the matrices Sh d and ΣZh d (ΣZ d ) 1 will have the same eigenvalues. Furthermore, the eigenvalues of Sh d are the inverses of the eigenvalues of ΣZ d (ΣZh d ) 1. Thus, (30) gives that αmax(Sh d ) < 2 πh (free of d). (31) We now have αd N 1 log N d N 0 as N . Equation (31) now implies that condition (13) holds if we assume log N d N 0 as N . Fix h = k {1, 2}. Our second matrix of interest is Shk d = (ΣZ d ) 1/2(ΣZh d + ΣZk d )(ΣZ d ) 1/2. Since the matrices are symmetric, we have αmax(Shk d ) αmax((ΣZ d ) 1/2ΣZh d (ΣZ d ) 1/2) + αmax((ΣZ d ) 1/2ΣZk d (ΣZ d ) 1/2). Again, the eigenvalues of (ΣZ d ) 1/2ΣZi d (ΣZ d ) 1/2 and of ΣZi d (ΣZ d ) 1 will be equal for i = h, k. So, αmax(Shk d ) αmax(ΣZh d (ΣZ d ) 1) + αmax(ΣZk d (ΣZ d ) 1) = 1 αmin(ΣZ d (ΣZh d ) 1) + 1 αmin(ΣZ d (ΣZk d ) 1) πk = 1 πhπk (using equation (30)). From here, similarly as before, we would obtain that log N d N 0 implies (15) holds. I Proof of Theorem 3.1 Recall that we use subspaces Vd generated by estimates of the first d eigenfunctions of the covariance of Z. We begin with some notation and preliminary results which have been taken from Delaigle and Hall (2012) and Hall and Hosseini-Nasab (2006), or follow directly from the results there. Then, we will give the proof of Theorem 3.1. For every n N, let us consider ˆ 2 Z = Z 1 0 (ˆΣZ(s, t) ΣZ(s, t))2dsdt, δZ j = min k j (λk λk+1). Published in Transactions on Machine Learning Research (09/2023) In Delaigle and Hall (2012) and Hall and Hosseini-Nasab (2006), it is shown that if j 1, then |ˆλj λj| ˆ Z, (32) and that if j ˆRZ N (recall the definition of ˆRZ N in (17)), then ˆϕj ϕj 81/2 ˆ Z(δZ j ) 1, (33) ˆ Z = Op(N 1/2), (34) RZ N and ˆRZ N ˆλZ 1 η 1 N . (35) Moreover, if j ˆRZ N, there exists a k j such that δZ j = λk λk+1 ˆλk ˆλk+1 2 ˆ Z ηN 2 ˆ Z = ηN + o P (ηN), (36) where we have applied (32) and (17). Using (34) and the assumption on ηN, we can conclude that ηN > 2 ˆ Z from an index onward. Thus, (36) and (33) yield ˆϕj ϕj 81/2 ˆ Z ηN 2 ˆ Z . (37) From (32), (17) and (34), we obtain that λj ˆλj ˆ Z ηN ˆ Z = ηN + o P (ηN). (38) Now, we are in a position to prove Theorem 3.1. Proof of Theorem 3.1: The proof is based on Lemma I.1. The result follows trivially from this lemma, the fact that ˆRZ N P as n , and the result in Theorem 2.2. Lemma I.1 Under the assumptions in Theorem 3.1, it happens that ˆD ˆ RZ N (Z1, Z2) D ˆ RZ N (Z1, Z2) P 0 as n . Proof. For fixed Z1, Z2, let us denote u = Z1 Z2. Obviously, u = O(1) a.s. Let us denote (u1, . . . , u ˆ RZ n)T and (ˆu1, . . . , ˆu ˆ RZ n)T to be the projections of u on the subspaces generated by the first ˆRZ N eigenvectors of the matrices ΣZ ˆ RZ N and ˆΣZ ˆ RZ N , respectively, when written in the basis generated by those eigenvectors. Now, define ˆu Z j = u, ˆϕZ j = Z 1 0 u(t)ˆϕZ j (t)dt and similarly define uj for j N. Fix n N and take j ˆRZ N. We now have (uj)2 uj (λj)1/2 ˆuj (ˆλj)1/2 uj (λj)1/2 + ˆuj (ˆλj)1/2 ˆuj (λj)1/2 (ˆλj)1/2 ! uj (λj)1/2 + ˆuj (ˆλj)1/2 We analyze each term in this expression separately as follows: uj ˆuj 0 |u(t)||ϕj(t) ˆϕj(t)|dt 81/2 u ˆ Z (λj)1/2(ηn 2 ˆ Z) 81/2 u ˆ Z(η 3/2 n + o P (η 3/2 n )), (39) Published in Transactions on Machine Learning Research (09/2023) where we have applied the Cauchy-Schwarz inequality, (37), (34) and (38). On the other hand, we have ˆuj (λj)1/2 (ˆλj)1/2 0 |u(t)||ˆϕj(t)|dt |λj ˆλj| (λj)1/2 + (ˆλj)1/2 (λjˆλj)1/2 u ˆ Z (λj)1/2 + (ˆλj)1/2 (λjˆλj)1/2 1 2 u ˆ Z(η 3/2 n + o P (η 3/2 n )), (40) where we have applied (17) and (38). Concerning the final term, using (38) and (17) again, we obtain that uj (λj)1/2 + ˆuj (ˆλj)1/2 1 (λj)1/2 + 1 u (η 1/2 n + o P (η 1/2 n )). (41) Now, define C = 81/2 + 1. Combining (39), (40), (41), (35) and (34), we get the following: ˆD ˆ RZ N (Z, Z2) D ˆ RZ N (Z, Z2) 1 ˆRZ N C u 2 ˆ Z(η 2 n + o P (η 2 n )) = OP (n 1/2η 2 n ). By construction, ηn is such that nη5 n . So, we have | ˆD ˆ RZ N (Z1, Z2) D ˆ RZ N (Z1, Z2)| P 0 as n , and the lemma is proved. J Proof of Theorem 3.2 We now prove the following lemma. Lemma J.1 Under the assumptions in Theorem 3.2, we have that P[ ˆRN RN] 1 as N . Proof : Fix N N. From (32), we have that inf j RN(ˆλj ˆλj+1) inf j RN(λj λj+1) 2 ˆ Z (1 + δ)ηN 2 ˆ Z, and the proof ends because (34) and the fact that ηN N 1/5 imply that P[δηN 2 ˆ Z 0] 1. In this setting, recall that Lµ = 0 and LS < (see part (b) of Theorem 2.2). We will only prove part b); part a) being similar. W.l.o.g. we will assume that h = 1 and k = 2. Recall that for every Z1 and Z2, we have D ˆ RN (Z1, Z2) = 1 ˆRN Z1 Z2, ϕj 2 λj and ˆD ˆ RN (Z1, Z2) = 1 ˆRN Z1 Z2, ˆϕj 2 We are going to consider the function D ˆ RN (Z1, Z2) = 1 ˆRN Z1 Z2, ϕj 2 Published in Transactions on Machine Learning Research (09/2023) sup Z1 CN 1 ,Z2 CN 2 ˆD ˆ RN (Z1, Z2) L12 S sup Z1 CN 1 ,Z2 CN 2 ˆD ˆ RN (Z1, Z2) D ˆ RN (Z1, Z2) + sup Z1 CN 1 ,Z2 CN 2 D ˆ RN (Z1, Z2) D ˆ RN (Z1, Z2) + sup Z1 CN 1 ,Z2 CN 2 D ˆ RN (Z1, Z2) L12 S =: T1 + T2 + T3. Lemma J.1, and equations (35) and (32) imply that there exists C > 0 such that P[RN ˆRN CN 1/5] 1. Consequently, with probability going to 1, it happens that DRN (Z1, Z2) D ˆ RN (Z1, Z2) DCN 1/5(Z1, Z2). Since, by assumption (15), log N = o RN and trivially we have log N = o CN 1/5 , b) in Theorem 2.6 gives that T3 converges in probability to zero as N . Since L12 S < , this fact implies that sup Z1 CN 1 ,Z2 CN 2 D ˆ RN (Z1, Z2) = OP (1). (42) With respect to T2, we have T2 sup Z1 CN 1 ,Z2 CN 2 Z1 Z2, ϕj 2 ˆλj sup Z1 CN 1 ,Z2 CN 2 D ˆ RN (Z1, Z2) = Op(N 1/10), where last equality follows from (42), (32), (34), (35) and (17). Finally, given Z1 CN 1 , Z2 CN 2 , the Cauchy-Schwarz inequality and the fact that ˆϕj = ϕj = 1 imply ˆD ˆ Rn(Z1, Z2) D ˆ Rn(Z1, Z2) 1 ˆRN Z1 Z2, ˆϕj 2 Z1 Z2, ϕj 2 Z1 Z2, ˆϕj ϕj Z1 Z2, ˆϕj + ϕj ˆϕj ϕj ˆϕj + ϕj 2 Z1 Z2 2 1 = 2 Z1 Z2 2 HN. Moreover, the application of (33), (34), (36) and (17) gives that HN = OP (N 1/10), which in turn is equivalent to saying that there exists C > 0 such that P[Hn < CN 1/10] 1. This and the reasoning leading to (28) imply that to prove T1 P 0, it is enough to show that for every C > 0 N 2P h Z1 Z2 2 > CN 1/10i 0 as N , (43) Published in Transactions on Machine Learning Research (09/2023) where Z1 and Z2 came from distributions P1 and P2, respectively. To show (43), notice that Z1 Z2 follows a Gaussian distribution whose mean function is µZ1 µZ2 and its covariance is Σ12 = ΣZ1 + ΣZ2. Let γj with j N denote the ordered eigenvalues of Σ12. Consider a basis composed by the eigenfunctions of Σ12, we denote by (µZ1 µZ2)j the components of µZ1 µZ2 in this basis and {uj}j N is a sequence of i.i.d. real standard normal variables. Now, we have the following γ1/2 j uj + (µZ1 µZ2)j 2 γj(u2 j 1) + γj + (µZ1 µZ2)2 j + 2(µZ1 µZ2)jγ1/2 j uj γj(u2 j 1) + 2(µZ1 µZ2)jγ1/2 j uj + trace(Σ12) + µZ1 µZ2 2. Note that K := trace(Σ12) + µZ1 µZ2 2 < . Thus, P h Z1 Z2 2 > CN 1/10i = P γj(u2 j 1) + 2(µZ1 µZ2)jγ1/2 j uj > CN 1/10 K j=1 γj(u2 j 1) > 1 j=1 (µZ1 µZ2)jγ1/2 j uj > 1 =: P1 + P2. (44) Obviously, 1 4 CN 1/10 K . Thus, eventually 1 4 CN 1/10 K > 1 and from Lemma G.1, we have that j=1 γj(u2 j 1) > 1 CN 1/10 K . (45) For P2, first note that the real r.v. Pd j=1(µZ1 µZ2)jγ1/2 j uj is centered normal, with variance equal to j=1 (µZ1 µZ2)2 jγj γ1 j=1 (µZ1 µZ2)2 j γ1 µZ1 µZ2 2 for every d N. Therefore, j=1 (µZ1 µZ2)jγ1/2 j uj |N(0, 1)| > 1 4γ1/2 1 µZ1 µZ2 CN 1/10 K # 2 π exp 1 2γ1(4 µZ1 µZ2 )2 CN 1/10 K 2 , (46) where last inequality comes from (25) because, eventually 1 < CN 1/10 K /(4γ1/2 1 µZ1 µZ2 ). Finally, (44), (45), and (46) give (43) and consequently, T1 P 0 as N . Published in Transactions on Machine Learning Research (09/2023) Appendix II: Additional Material K Extension to non-Gaussian Processes Obviously, non-Gaussian processes can also be mutually singular. In fact, Theorem 4.3 in Rao and Varadarajan (1963) contains a sufficient condition for this property to be satisfied. This allows us to consider the possibility to extend previous results to cover non-Gaussian distributions. It is obvious that the proofs that we have developed can cover non-Gaussian distributions as long as they satisfy the due properties. In this subsection, we state the properties a distribution should satisfy in order the proofs can be extended. Let P1 and P2 be two probabilities on the Hilbert space H. Here, Z will denote a L2[0, 1]-valued random element with distribution π1P1 + π2P2 for some π1, π2 > 0 with π1 + π2 = 1. The basic assumption is the existence of a covariance of Z. We will also consider assumptions A.1 and A.2 (see Section 3 of the paper) and b L2[0, 1]. Given a p.d. d d matrix Ad and a d-dimensional subspace Vd L2[0, 1], we need to consider the d-dimensional random vector Ud = (Ad) 1/2(Z b)d and the covariance matrix Sd = A 1/2 d ΣZ d A 1/2 d , where ΣZ d is the covariance matrix of Zd and (Z b)d is the projection on Vd of (Z b) with d N. Let us write Ud E[Ud] = (u1, . . . , ud)T in the basis of the eigenvectors of Sd and let αd 1, . . . , αd d denote the eigenvalues of Sd. Therefore, ui/αd i for 1 i d are real, standardised random variables which we need to assume to be i.i.d. Similar properties must hold for the decomposition of Z in its eigenfunction basis (also see Dai et al (2017)). We finally need two exponential inequalities as those stated in (25) of Lemma G.1. L Discussion on GP Clustering for the Location Only Case We have some ideas to fix the problem with the location only case. Recall the notation used in Subsection 2.1 of the paper. As stated there, the problem in this case is that DΣZ d d (u, v) = 1 (ΣZ d ) 1/2(u v)d 2 = 1 Our idea is to replace the terms in the sum with some others going to 0 slowly enough (or, if possible, not converging to zero at all). To use this idea, our proposal is as follows: DΣZ d ,r d (u, v) := 1 ((ΣZ d ) 1/2)r(u v)d 2 = 1 λr i , with r I. (47) Here, I is the set of integers. In this article, we have studied the case when r = 1, i.e., DΣZ d ,1 d . However, this was not a strict requirement and we look into some possible scenarios below: If r {0, 1, 2, . . .}, then assumption A.2 in the main paper trivially gives that (ui vi)2 λr i (ui vi)2 λi eventually for large i. Consequently, we have DΣZ d ,r d (u, v) P 0 as d . When r {2, 3, . . .}, the transformation DΣZ d ,r d may be useful because 1/λr i will start to take high values (recall assumption A.2) and this may lead to separation between the observations of corresponding to different clusters. Keeping the viewpoint stated above in mind, we consider the transformation DΣZ d ,4 d (using r = 4 in (47)). Numerical results for the difference in location only setting stated in Section 4 are reported below. We have excluded Example II from this comparison because, as stated earlier, the two GPs have no differences in their means. Published in Transactions on Machine Learning Research (09/2023) Table L.1: Adjusted Rand distances for different GPs with differences only in locations (with standard error in brackets). Ex. k-means spectral mclust funclust CL DHP CD k-means spectral mclust I 0.0001 0.0001 1.0000 0.0002 0.0001 0.0001 0.0012 0.0814 0.0016 (0.0001) (0.0001) (0.0000) (0.0001) (0.0000) (0.0001) (0.0002) (0.0072) (0.0003) III 0.0646 0.0960 1.0000 0.0795 0.0945 0.1480 0.1965 0.1948 0.1649 (0.0015) (0.0018) (0.0000) (0.0009) (0.0045) (0.0047) (0.0025) (0.0027) (0.0017) IV 0.1606 0.2111 1.0000 0.0318 0.1015 0.0134 0.1257 0.3777 0.1784 (0.0007) (0.0008) (0.0000) (0.0003) (0.0000) (0.0004) (0.0019) (0.0085) (0.0021) As expected, the performance of k-means is quite good in Examples I and III. Both DHP and CL also perform quite well securing a first place in some cases. The proposed statistic DΣZ d ,4 d shows significant improvement (recall from part (b) of Theorem 2.2 that Lµ = 0 for the earlier transformation DΣZ d ,1 d ), and this is reflected in the numerical figures of Table L.1. Clearly, there is scope of further work with the proposed transformation DΣZ d ,r d for r {2, 3, . . .}, both theoretically as well as numerically. M Review of the paper by Delaigle et al (2019) As stated, Delaigle et al (2019) is related to perfect clustering and it is the only paper on perfect clustering we are aware of. In this section, we analyze the relation between our proposal and this paper. The proposal by Delaigle et al (2019) is based on finding a finite-dimensional subspace in which the data are projected, and clustering is done by applying a modification of the k-means algorithm on those projections. A theoretical result related to perfect clustering is stated in Theorem 1 of this paper. In the homoscedastic case, Delaigle et al (2019) gives an explicit expression of the subspace in which the data should be projected (see Theorem 2 of this paper). The technique proposed in this paper has some advantages over our proposal in the sense that they can handle the homoscedastic (differences only in location) case. However, it suffers from several limitations, the main being its difficulty to deal with more than two populations. As stated in Section 5.2 of Delaigle et al (2019), "...it is not clear how to extend... [their method] ...to more than two populations.". The only exception being the case in which one needs to assume that the data follow a binary hierarchical structure, where the groups can be sequentially split into two. As a consequence of this, Delaigle et al (2019) do not propose a procedure to estimate the true number of clusters. On the technical side, the theory of Delaigle et al (2019) has some limitations. It requires to arbitrarily fix p N. Then, the data are projected on a p-dimensional subspace in which the clustering is to be done. New issues appear in the way in which the subspace should be chosen as well as the clusters need to be constructed. According to Theorem 1 of this paper, the generators of the subspace must be chosen in a finite set with cardinality an as the sample size n . Moreover, the partition of the data set must be chosen between those in a finite set of Voronoi tessellations of Rp with cardinality bn as n . Additionally, the result needs technical conditions like the existence of some c (0, 1) such that for every C > 0 it happens that ap nbn exp( Cnc) as n . Published in Transactions on Machine Learning Research (09/2023) N Full Numerical Results Full results for two scenarios are given below. Table N.1: Adjusted Rand distances for different GPs with differences in locations and scales (with standard error in brackets). Ex. CL1 CL2 DHP1 DHP2 I 0.0239 0.0554 0.8386 0.0818 (0.0007) (0.0005) (0.0064) (0.0025) II 0.5767 0.9967 0.5470 0.5149 (0.0045) (0.0018) (0.0047) (0.0049) III 0.2891 0.9962 0.4137 0.5613 (0.0000) (0.0000) (0.0054) (0.0060) IV 0.1833 0.6660 0.1379 0.5211 (0.0000) (0.0000) (0.0033) (0.0061) Table N.2: Adjusted Rand distances for different GPs with differences only in scales (with standard error in brackets). Ex. CL1 CL2 DHP1 DHP2 I 0.8269 1.0017 0.9989 0.9966 (0.0000) (0.0000) (0.0005) (0.0008) II 0.9065 1.0019 1.0001 0.9999 (0.0007) (0.0005) (0.0002) (0.0003) III 0.9994 0.9998 0.9984 0.9967 (0.0000) (0.0000) (0.0004) (0.0007) IV 0.8464 0.9928 0.9994 0.9980 (0.0000) (0.0000) (0.0003) (0.0006) Full result for the location only case (using the transformation DΣZ d ,4 d stated in Section L of Appendix II) is given next. Table N.3: Adjusted Rand distances for different GPs with differences in locations (with standard error in brackets). Ex. C1 C2 DHP1 DHP2 I 0.0001 0.0001 0.9896 0.0001 (0.0000) (0.0000) (0.0001) (0.0001) III 0.0945 0.9975 0.1480 0.1623 (0.0010) (0.0043) (0.0047) (0.0075) IV 0.1015 0.9001 0.0134 0.1473 (0.0000) (0.0000) (0.0004) (0.0039) R codes for our clustering methods are available from this link: GP clustering. Published in Transactions on Machine Learning Research (09/2023) O Pseudo-codes of Our Methods Algorithm 1 Clustering Algorithm Data: Z1, . . . , ZN are observed at p time points (i.e., p-dimensional data); Compute the sample variance-covariance matrix SN; O(Np2) Define d N = min{p, N}; Spectral decomposition of the d N d N matrix SN yields e1, . . . , ed N (eigen functions) and λ1, . . . , λd N (eigen values); O(p3) Fix 1 d d N; for i = 1 : N do for j = 1 : d do Compute zij = Zi, ej ; end for end for O(N 2p) for i = 1 : N do for i = 1 : N do Compute Dd(Zi, Z i) = 1 (zij zi j)2 end for end for O(N 2d) for i = 1 : N do for i = 1 : N do temp = 0; for t = i, i do Compute temp = temp + [Dd(Zi, Zt) Dd(Z i, Zt)]2; end for γd(Zi, Z i) = temp; end for end for O(N 2d) Spectral decomposition of the N N matrix ΓN yields β1, . . . , βN (eigen values); O(N 3) Use optishrink to get Kd; Implement any suitable clustering algorithm on ΓN with Kd as input for the number of clusters; Obtain class labels for the N observations, or a given data set. Algorithm 2 Cross Validation Algorithm to Choose the Value of d Consider the data S = {Z1, . . . , ZN}; for d = 2 : d N do; Fix B N; for b = 1 : B do Split the set S into three disjoint subsets S1b, S2b and S3b; Use Algorithm 1 with data points corresponding to S1b to get a cluster assignment for data points corresponding to S3b (say, C1); Use Algorithm 1 with data points corresponding to S2b to get a cluster assignment for data points corresponding to S3b (say, C2); Compute a distance between C1 and C2 (say, Db) end for O(Bd N) Average over D1, . . . , DB to get DCV d ; O(d N) end for Return d CV = arg min2 d d N DCV d .