# multiscale_classification_using_localized_spatial_depth__08c80f6c.pdf

Journal of Machine Learning Research 17 (2016) 1-30 Submitted 3/16; Revised 10/16; Published 12/16

Multi-scale Classiﬁcation using Localized Spatial Depth

Subhajit Dutta duttas@iitk.ac.in Department of Mathematics and Statistics Indian Institute of Technology Kanpur 208016, India.

Soham Sarkar sohamsarkar1991@gmail.com Anil K. Ghosh akghosh@isical.ac.in Theoretical Statistics and Mathematics Unit Indian Statistical Institute 203, B. T. Road, Kolkata 700108, India.

Editor: Jie Peng

In this article, we develop and investigate a new classiﬁer based on features extracted using spatial depth. Our construction is based on ﬁtting a generalized additive model to posterior probabilities of diﬀerent competing classes. To cope with possible multi-modal as well as non-elliptic nature of the population distribution, we also develop a localized version of spatial depth and use that with varying degrees of localization to build the classiﬁer. Final classiﬁcation is done by aggregating several posterior probability estimates, each of which is obtained using this localized spatial depth with a ﬁxed scale of localization. The proposed classiﬁer can be conveniently used even when the dimension of the data is larger than the sample size, and its good discriminatory power for such data has been established using theoretical as well as numerical results.

Keywords: Bayes classiﬁer, elliptic distributions, generalized additive models, HDLSS asymptotics, uniform strong consistency, weighted aggregation of posteriors.

1. Introduction

In a supervised classiﬁcation problem with J competing classes, we have nj labeled observations xj1, . . . , xjnj from the j-th class (1 j J). We use this training sample consisting of n = PJ j=1 nj observations to construct a decision rule for classifying an unlabeled observation x to one of these J classes. If πj, fj and p(j| ) denote the prior probability, the probability density function and the posterior probability of the j-th class, respectively, then the Bayes classiﬁer assigns x to the class j0, where j0 = argmax1 j J p(j|x) = argmax1 j J πjfj(x). However, the fj s or the p(j| ) s are usually unknown in practice, and one needs to estimate them from the training sample. Popular parametric classiﬁers like linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) (see, e.g., Hastie et al., 2009) are motivated by parametric model assumptions on the fj s. So, they may lead to poor classiﬁcation when these assumptions fail to hold, and the class boundaries of the Bayes classiﬁer have complex geometry. On the other hand, nonparametric classiﬁers like those based on k-nearest neighbors (k-NN) (see, e.g., Cover and Hart, 1967) and kernel

c 2016 Subhajit Dutta, Soham Sarkar, and Anil K. Ghosh.

Dutta, Sarkar and Ghosh

10 5 0 5 10

10 5 0 5 10

10 5 0 5 10

10 5 0 5 10

(a) Example E1

(b) Example E2

Figure 1: Bayes class boundaries in R2.

density estimates (KDE) (see, e.g., Scott, 2015) are more ﬂexible and free from such model assumptions. But, they suﬀer from the curse of dimensionality and are often not suitable for high-dimensional data.

To demonstrate this, let us consider two examples denoted by E1 and E2. E1 involves a classiﬁcation problem with two classes in Rd, where the distribution of the ﬁrst class is an equal mixture of Nd(0d, Id) and Nd(0d, 10Id), and that of the second class is Nd(0d, 5Id). Here Nd denotes the d-variate normal distribution, 0d = (0, . . . , 0)T Rd and Id is the d d identity matrix. In E2, each class distribution is an equal mixture of two uniform distributions. For the ﬁrst (respectively, the second) class, it is a mixture of Ud(0, 1) and Ud(2, 3) (respectively, Ud(1, 2) and Ud(3, 4)), where Ud(r1, r2) denotes the uniform distribution over the region {x Rd : r1 x r2} with 0 r1 < r2 < and being the Euclidean norm. Figure 1 shows the class boundaries of the Bayes classiﬁer for these two examples when d = 2 and π1 = π2 = 1/2. The regions colored grey (respectively, black) correspond to observations classiﬁed to the ﬁrst (respectively, the second) class by the Bayes classiﬁer. It is clear that classiﬁers like LDA and QDA, or any other classiﬁer with linear or quadratic class boundaries will deviate signiﬁcantly from the Bayes classiﬁer in both examples. A natural question then is how standard nonparametric classiﬁers like those based on k-NN and KDE perform in such examples.

Figure 2 shows the average misclassiﬁcation rates of these two classiﬁers along with the Bayes risks for diﬀerent values of d. These classiﬁers were trained on a sample of size 100 from each class, and the misclassiﬁcation rates were computed based on 250 independent observations from each class. This procedure was repeated 500 times to calculate the average misclassiﬁcation rates. Smoothing parameters associated with k-NN and KDE (i.e., the number of neighbors k in k-NN and the bandwidth in KDE) were chosen by minimizing leave-one-out cross-validation estimates of misclassiﬁcation rates (see, e.g., Hastie et al., 2009). Figure 2 shows that in E1, the Bayes risk decreases to zero as d grows. Since the class distributions in E2 have disjoint supports, the Bayes risk is zero for all values of d. But in both examples, the misclassiﬁcation rates of these two nonparametric classiﬁers increased to almost 50% as d increased.

Multi-scale Classification using Localized Spatial Depth

0.0 0.1 0.2 0.3 0.4 0.5

misclassification rate

0.0 0.1 0.2 0.3 0.4 0.5

misclassification rate

0.0 0.1 0.2 0.3 0.4 0.5

misclassification rate

Bayes KDE k NN

(a) Example E1

0.0 0.1 0.2 0.3 0.4 0.5

misclassification rate

0.0 0.1 0.2 0.3 0.4 0.5

misclassification rate

0.0 0.1 0.2 0.3 0.4 0.5

misclassification rate

Bayes KDE k NN

(b) Example E2

Figure 2: Average misclassiﬁcation rates of nonparametric classiﬁers and the Bayes classiﬁer for d = 2, 5, 10, 20, 50 and 100.

These two examples clearly show the necessity to develop new classiﬁers to cope with such situations. We use the idea of data depth for this purpose. Over the last three decades, data depth (see, e.g., Liu et al., 1999; Zuo and Serﬂing, 2000) has emerged as a powerful tool for multivariate data analysis with applications in many areas including supervised and unsupervised classiﬁcation (see, e.g., Jornsten, 2004; Ghosh and Chaudhuri, 2005a,b; Hoberg and Mosler, 2006; Xia et al., 2008; Dutta and Ghosh, 2012; Li et al., 2012; Lange et al., 2014; Paindaveine and Van Bever, 2015). Spatial depth (also known as the L1 depth) is a popular notion of data depth that was introduced and studied by Vardi and Zhang (2000) and Serﬂing (2002). The spatial depth (SPD) of an observation x Rd with respect to (w.r.t.) a distribution function F on Rd is deﬁned as SPD(x, F) = 1 EF [u(x X)] , where X F, and u( ) is the multivariate sign function given by u(x) = x 1x if x = 0d Rd, and u(0d) = 0d. This version of SPD is invariant w.r.t. location shift, orthogonal, and homogeneous scale transformations. SPD is often computed on the standardized version of X as well. In that case, it is deﬁned as

SPD(x, F) = 1 EF [u(Σ 1/2(x X))] ,

where Σ is a scatter matrix associated with F. One can check that if Σ has the aﬃne equivariance property (see, e.g., Zuo and Serﬂing, 2000), this version of SPD is aﬃne invariant. To diﬀerentiate between these two versions of SPD, we will denote them by SPD

and SPD , respectively. If Σ = λId for some λ > 0 (e.g., if F is spherically symmetric, see Fang et al., 1990), then SPD and SPD coincide. Throughout this article, the term SPD will be used in a generic sense. Like other depth functions, SPD provides a center-outward ordering of multivariate data. An observation has higher (respectively, lower) depth if it lies close to (respectively, away from) the center of the distribution. In other words, given an observation x and a

Dutta, Sarkar and Ghosh

pair of probability distributions F1 and F2, if SPD(x, F1) is larger than SPD(x, F2), one would expect x to come from F1 instead of F2. Based on this simple idea, the maximum depth classiﬁer was developed by Jornsten (2004); Ghosh and Chaudhuri (2005b). For a J class problem involving distributions F1, . . . , FJ, the maximum depth classiﬁer based on SPD assigns an observation x to the j0-th class, where j0 = argmax1 j J SPD(x, Fj).

0 2 4 6 8 10

0.0 0.2 0.4 0.6 0.8 1.0

SPD(x, F1) SPD(x, F2)

(a) Example E1

0 1 2 3 4 5

0.0 0.2 0.4 0.6 0.8 1.0

SPD(x, F1) SPD(x, F2)

(b) Example E2

Figure 3: SPD(x, F1) and SPD(x, F2) for diﬀerent values of x when x R2.

In E1 and E2, since the class distributions are spherically symmetric, SPD coincides with SPD , and they become a monotonically decreasing function of the Euclidean norm of x (see Lemma 7). In Figure 3, we have plotted SPD(x, F1) and SPD(x, F2) for diﬀerent values of x in E1 and E2, where F1 and F2 are the distributions of the two classes and x R2. It is transparent from Figure 3 that the maximum depth classiﬁer based on SPD will fail in both examples. In E1, for all values of x smaller (respectively, greater) than a constant close to 4, the observations will be classiﬁed to the ﬁrst (respectively, the second) class by the maximum SPD classiﬁer. On the other hand, this classiﬁer will classify all observations to the second class in E2. Most of the popular depth functions turn out to be monotonically decreasing functions of the Euclidean norm in the case of a spherically symmetric distribution. So, the maximum depth classiﬁers based on those depth functions will have similar problems as well.

In Section 2, we develop a modiﬁed classiﬁer based on SPD to overcome this limitation of maximum depth classiﬁers. In the literature, most of the modiﬁed depth based classiﬁers are developed mainly for two class problems (see, e.g., Ghosh and Chaudhuri, 2005b; Dutta and Ghosh, 2012; Li et al., 2012; Lange et al., 2014). For classiﬁcation problems involving J(> 2) classes, one usually solves J 2 binary classiﬁcation problems taking one pair of classes at a time and then uses either majority voting (see, e.g., Friedman, 1996) or pairwise coupling (see, e.g., Hastie and Tibshirani, 1998) to make the ﬁnal classiﬁcation. Unlike those existing methods, our proposed classiﬁer directly addresses the J class problem.

Multi-scale Classification using Localized Spatial Depth

Almost all existing depth based classiﬁers require ellipticity of class distributions to achieve Bayes optimality. To cope with possible multi-modal as well as non-elliptic population distributions, we construct a localized version of spatial depth (LSPD) in Section 3. In Section 4, we develop a multi-scale classiﬁer based on LSPD. Relevant theoretical results on SPD, LSPD and the resulting classiﬁers are studied in these sections. In Sections 5 and 6, some simulated and benchmark data sets are analyzed to demonstrate the usefulness of these proposed classiﬁers. An advantage of SPD over other depth functions is its computational simplicity. Classiﬁers based on SPD and LSPD can be constructed even when the dimension exceeds the sample size. We deal with such high dimension, low sample size (HDLSS) cases in Section 7, and show that both classiﬁers turn out to be optimal under a fairly general framework. Several high-dimensional data sets are also analyzed to evaluate their empirical performance. All proofs and mathematical details are given in Appendix A.

2. Bayes Optimality of a Classiﬁer Based on Spatial Depth

Let us assume that f1, . . . , f J are density functions of J elliptically symmetric distributions (Fang et al., 1990) on Rd, where fj(x) = |Σj| 1/2gj( Σ 1/2 j (x µj) ) for 1 j J. Here µj Rd, Σj is a d d symmetric and positive deﬁnite matrix, and gj( t ) is a probability density function of a spherically symmetric distribution on Rd for 1 j J. For such classiﬁcation problems involving general elliptic populations with equal or unequal priors, the next theorem establishes the Bayes optimality of a classiﬁer, which is based on z (x) = (z 1(x), . . . , z J(x))T = (SPD (x, F1), . . . , SPD (x, FJ))T .

Theorem 1 If the densities of J competing classes are elliptically symmetric, the posterior probabilities of these classes satisfy the logistic regression model given by

p(j|x) = p(j|z (x)) = exp(Φj(z (x)))

[1 + P(J 1) k=1 exp(Φk(z (x)))] for 1 j (J 1) (1)

and p(J|x) = p(J|z (x)) = 1

[1 + P(J 1) k=1 exp(Φk(z (x)))] . (2)

Here Φj(z (x)) = ϕj1(z 1(x)) + + ϕj J(z J(x)), and ϕjis are appropriate real-valued functions of πj and fj for 1 j J. Consequently, the Bayes rule assigns an observation x to the class j0, where j0 = argmax1 j J p(j|z (x)).

Theorem 1 shows that the Bayes classiﬁer is based on a nonparametric multinomial additive logistic regression model for the posterior probabilities, which is a special case of generalized additive models (GAM) (Hastie and Tibshirani, 1990). If the prior probabilities of J classes are equal, and f1, . . . , f J are all elliptic and unimodal diﬀering only in their locations, this Bayes classiﬁer reduces to the maximum depth classiﬁer (Ghosh and Chaudhuri, 2005b) (see Remark 8 after the proof of Theorem 1 in Appendix A). A special case of Theorem 1 with Σj = λj Id, where λj > 0 for 1 j J is stated below.

Corollary 2 If the densities of J competing classes are spherically symmetric (i.e., fj(x) = gj( x µj ) for 1 j J), then the posterior probabilities of these classes satisfy the logistic regression model given in Theorem 1 with z (x) replaced by z (x) = (SPD (x, F1), . . ., SPD (x, FJ))T .

Dutta, Sarkar and Ghosh

For any ﬁxed i and j, one can calculate the J-dimensional vector z (xji) (or, z (xji)), where xji is the i-th labeled observation from the j-th class for 1 i nj and 1 j J. These z (xji)s (or, z (xji)s) can be viewed as realizations of the vector of covariates in a nonparametric multinomial additive logistic regression model, where the response corresponds to the class label that belongs to {1, . . . , J}. Now, a classiﬁer based on SPD can be constructed by ﬁtting a GAM with the logistic link function. This procedure can be viewed as a multinomial logistic regression in the J-dimensional depth plot. Lange et al. (2014); Li et al. (2012); Mozharovskyi et al. (2015) used such plots for nonparametric classiﬁcation. Recently, Cuesta-Albertos et al. also considered GAM to construct a depth based classiﬁer for functional data. In practice, we use a random sample x1, . . . , xn generated from F to compute the empirical versions of SPD and SPD , which are given by

SPD (x, Fn) = 1

i=1 u(x xi)

and SPD (x, Fn) = 1

i=1 u(bΣ 1/2(x xi))

respectively, where bΣ is an estimate of Σ, and Fn is the empirical distribution of the data x1, . . . , xn. Clearly, SPD is aﬃne invariant if bΣ has the aﬃne equivariance property. The resulting classiﬁer worked quite well in examples E1 and E2, and we shall see the numerical results later in Section 5.1.

3. Extraction of Small Scale Distributional Features by Localization of Spatial Depth

Under elliptic symmetry, the density function of a class can be expressed as a function of SPD , and hence the depth contours coincide with the density contours. This is the main mathematical argument used in the proof of Theorem 1. For non-elliptic distributions, where the density function cannot be expressed as a function of SPD, such mathematical arguments are no longer valid. Consider an equal mixture of Nd(0d, 0.25Id), Nd(21d, 0.25Id) and Nd(41d, 0.25Id), where 1d = (1, . . . , 1)T denotes a d-dimensional vector with all elements equal to 1. We have plotted the density contours in Figure 4(a) and SPD contours in Figure 4(b) when d = 2. In this trimodal distribution, the SPD contours failed to match the density contours. As a second example, we consider a d-dimensional distribution with independent components, where the i-th component is exponential with the scale parameter d/(d i + 1) for 1 i d. Figures 5(a) and 5(b) show the density contours and the SPD

contours, respectively, when d = 2. Even in this example, SPD and density contours diﬀered signiﬁcantly. We observed a similar picture for contours based on SPD as well. To cope with this issue, we suggest a localization of SPD. Note that SPD (x, F) = 1 EF [u(x X)] is constructed by assigning the same weight to each unit vector u(x X) and ignoring the signiﬁcance of the distance between x and X. By introducing a weight function, which takes account of this distance, one can extract important features related to the local geometry of the data. To capture these local features, we use a kernel function K( ) and deﬁne

Γ h(x, F) = EF [Kh(t)] EF [Kh(t)u(t)] ,

Multi-scale Classification using Localized Spatial Depth

(a) Density

(c) LSPD h=0.4

Figure 4: Contours of density, SPD and LSPD h (with h = 0.4) functions for a symmetric, trimodal density function.

0 1 2 3 4 5

0 1 2 3 4 5

(a) Density

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

0 1 2 3 4 5

(c) LSPD h=0.25

Figure 5: Contours of density, SPD and LSPD h (with h = 0.25) functions for the density function f(x1, x2) = 0.5 exp{ (x1 + 0.5x2)}I{x1 > 0, x2 > 0}.

where t = (x X) and Kh(t) = h d K(t/h). For our theoretical investigation, we will assume K to be a continuous probability density function on Rd that satisﬁes the following properties:

(K1) K(t) = g0( t ), where g0 is a decreasing function with g0(0) < and g0( t ) 0 as t ,

(K2) K(t) has bounded ﬁrst derivatives, and

Rd t K(t)dt < .

The Gaussian kernel K(t) = (

2π) d exp( t 2/2) is a possible choice. It is desirable that localized spatial depth (LSPD) approximates the class density, or a monotone function of it for small values of h. This will ensure that the class densities and hence the class posterior probabilities become functions of LSPD as h 0. On the other hand, one should expect that as h , LSPD should tend to SPD, or a monotone function of it. However,

Dutta, Sarkar and Ghosh

Γ h(x, F) 0 as h . So, we re-scale Γ h(x, F) by an appropriate factor of h to deﬁne LSPD as follows:

LSPD h(x, F) = Γ h(x, F) if h 1, hdΓ h(x, F) if h > 1. (3)

LSPD h deﬁned in this way is a continuous function of h. For d = 2, Figures 4(c) and 5(c) show that unlike SPD contours, LSPD h contours matched the density contours in both examples. Using t = Σ 1/2(x X) in the deﬁnition of Γ h(x, F), one gets Γ h(x, F), and LSPD h is deﬁned using Γ h(x, F) in the same way. Clearly, LSPD h is aﬃne invariant if Σ is aﬃne equivariant. When Σ = λId, we obtain Γ h(x, F) = λd/2Γ h (x, F) with h = h

λ, and using this expression, one can derive the relation between LSPD h and LSPD h. The vector z h(x) = (LSPD h(x, F1), . . . ,LSPD h(x, FJ))T has the desired behavior as shown in Theorem 3.

Theorem 3 If f1, . . . , f J are continuous density functions with bounded ﬁrst derivatives, and Σj is the scatter matrix corresponding to the j-th class (1 j J), then (a) z h(x) (|Σ1|1/2f1(x), . . . , |ΣJ|1/2f J(x))T as h 0, and (b) z h(x) (K(0)SPD (x, F1), . . . , K(0)SPD (x, FJ))T as h .

Now, we construct a classiﬁer by plugging in LSPDh instead of SPD in the GAM framework discussed in equations (1) and (2) of Section 2. Consider the following model for the posterior probabilities:

p(j|x) = p(j|z h(x)) = exp(Φj(z h(x)))

[1 + P(J 1) k=1 exp(Φk(z h(x)))] , for 1 j (J 1), (4)

and p(J|x) = p(J|z h(x)) = 1

[1 + P(J 1) k=1 exp(Φk(z h(x)))] . (5)

The main implication of part (a) of Theorem 3 is that the classiﬁer constructed using GAM and z h(x) as the covariate tends to the Bayes classiﬁer in a general nonparametric setup as h 0. On the other hand, part (b) of Theorem 3 implies that for elliptic class distributions, the same classiﬁer tends to the Bayes classiﬁer when h . When we ﬁt a GAM, the unknown functions Φjs are estimated nonparametrically. Flexibility of such nonparametric estimates also takes care of the unknown constants |Σj|1/2 for 1 j J and K(0) in the expressions of the limiting values of z h(x) in parts (a) and (b) of Theorem 3, respectively. A special case of Theorem 3 follows by taking Σj = λj Id with λj > 0 for all 1 j J.

Corollary 4 If f1, . . . , f J are continuous density functions with bounded ﬁrst derivatives, then (a) z h(x) = (LSPD h(x, F1), . . ., LSPD h(x, FJ))T (f1(x), . . . , f J(x))T as h 0, and (b) z h(x) (K(0)SPD (x, F1), . . . , K(0)SPD (x, FJ))T as h .

If x1, . . . , xn is a random sample of size n from F, the empirical version of Γ h(x, F) is given by

Γ h(x, Fn) = 1

i=1 Kh(ti) 1 n

i=1 Kh(ti)u(ti) ,

Multi-scale Classification using Localized Spatial Depth

where ti = (x xi) for 1 i n. Then LSPD h(x, Fn) is deﬁned using (3) with Γ h(x, F) replaced by Γ h(x, Fn). Similarly, we obtain Γ h(x, Fn) and LSPD h(x, Fn) by using ti = bΣ 1/2(x xi) in the expression stated above. Here bΣ is an estimate of Σ, and Fn is the empirical distribution of the data x1, . . . , xn. We know that supx Rd |SPD (x, Fn) SPD (x, F)| goes to 0 almost surely (a.s.) as n goes to inﬁnity (see Gao, 2003). Theorem 5 establishes a similar a.s. uniform convergence of LSPD h(x, Fn) to its population counterpart LSPD h(x, F) for a ﬁxed value of h.

Theorem 5 Assume the density corresponding to the distribution function F to be bounded. Then, for any ﬁxed h > 0, supx Rd |LSPD h(x, Fn) LSPD h(x, F)| a.s. 0 as n .

From the proof of Theorem 5 (see Appendix A), it is easy to check that this a.s. uniform convergence also holds when h . Under additional moment conditions on F, we obtain this convergence when h 0 in such a way that nh2d/ log n as n (see Remarks 9 and 10 after the proof of Theorem 5 in Appendix A). The fact that LSPD tends to a constant multiple of the probability density function as h 0 is a crucial requirement for limiting Bayes optimality of classiﬁers based on this local depth function. Agostinelli and Romanazzi (2010) proposed localized versions of simplicial depth and half-space depth, but the relationship between the local depth and the probability density function was established only for d = 1. A depth function based on inter-point distances was developed by Lok and Lee (2011) to capture multi-modality in a data set. Chen et al. (2009) deﬁned kernelized spatial depth using a reproducing kernel Hilbert space. Hu et al. (2011) also considered a generalized notion of Mahalanobis depth in reproducing kernel Hilbert spaces. However, there is no result connecting them to the probability density function. In fact, the kernelized spatial depth function becomes degenerate at the value (1 1/

2) as the tuning parameter goes to zero. Consequently, it becomes non-informative for small values of the tuning parameter. It will be appropriate to note here that none of the preceding authors used their proposed depth functions for constructing classiﬁers. Recently, Paindaveine and Van Bever (2013, 2015) proposed a notion of local depth and used it for supervised classiﬁcation along with other applications. Their version of local depth does not relate to the underlying density function either. At this point, one should note that convergence of local depth function to the underlying density function is an advantageous property for classiﬁcation. However, this may not always be a desirable property for other applications of data depth (see Paindaveine and Van Bever, 2013, for a detailed discussion).

4. Multi-scale Classiﬁcation using Localized Spatial Depth

When the class distributions are elliptic, part (b) of Theorem 3 implies that LSPDh with large values of h will lead to good classiﬁers. These large values may not be appropriate for non-elliptic class distributions, but part (a) of Theorem 3 implies that LSPDh with small values of h will lead to good classiﬁers for general nonparametric models for class densities. However, the empirical version of LSPDh with small h and the resulting classiﬁer may have their statistical limitations for high-dimensional data.

Dutta, Sarkar and Ghosh

We now consider two examples to demonstrate the above points. The ﬁrst example (we call it E3) involves two multivariate normal distributions Nd(0d, Id) and Nd(1d, 4Id). In the second example (we call it E4), both the competing classes have trimodal distributions. The ﬁrst class has the same density as in Figure 4(a) (i.e., an equal mixture of Nd(0d, 0.25Id), Nd(21d, 0.25Id) and Nd(41d, 0.25Id)), while the second class is an equal mixture of Nd(1d, 0.25Id), Nd(31d, 0.25Id) and Nd(51d, 0.25Id). In each of these examples, we considered d = 5 and generated a training sample of size 100 from each class. The misclassiﬁcation rate for the classiﬁer based on LSPD h was computed based on a test sample of size 500 (250 observations from each class). This procedure was repeated 100 times to calculate the average misclassiﬁcation rates for diﬀerent values of h. Small values of h extracted local distributional features and yielded low misclassiﬁcation rates in E4 (see Figure 6(b)). However, those small values of h led to relatively higher misclassiﬁcation rates in E3, while the underlying global elliptic structure was captured well by the proposed classiﬁer for larger values of h (see Figure 6(a)). This provides a strong motivation for adapting a multi-scale approach in constructing the ﬁnal classiﬁer so that one can harness the strength of diﬀerent classiﬁers corresponding to diﬀerent scales of localization. One would expect that when aggregated judiciously, the multi-scale classiﬁer will lead to an improved misclassiﬁcation rate. Usefulness of the multi-scale approach in combining diﬀerent classiﬁers has been discussed in the classiﬁcation literature (see, e.g., Kittler et al., 1998; Dzeroski and Zenko, 2004; Ghosh et al., 2005, 2006).

2 1 0 1 2 3 4

Misclassification rate

(a) Example E3

2 1 0 1 2 3 4 0

Misclassification rate

(b) Example E4

Figure 6: Misclassiﬁcation rates of the Bayes classiﬁer (indicated by dotted lines) and the classiﬁer based on LSPD h (indicated by solid curves) in examples E3 and E4 for varying choices of h.

A popular way of aggregation is to consider a weighted average of the estimated posterior probabilities computed for diﬀerent values of h. There are various proposals for the choice of the weight function in the literature. Following Ghosh et al. (2005, 2006), we compute b h, a cross-validation estimate of the misclassiﬁcation rate of the classiﬁer based on LSPD h

Multi-scale Classification using Localized Spatial Depth

(or, LSPD h) and use

2 (b h b 0)2

b 0(1 b 0)/n

as the weight function, where b 0 = minh b h. The exponential function helps to appropriately weigh up (respectively, weigh down) the promising (respectively, the unsatisfactory) classiﬁer resulting from diﬀerent choices of the smoothing parameter h. We compute R W(h) g(h) p(j|z h(x))dh for the j-th class (1 j J), where a probability density function g is used to make the integral ﬁnite. Here p(j|z h(x)) is as deﬁned in equations (4) and (5) of Section 3. If we use very small values of h to classify a test case, then the kernel function used in LSPDh will put almost zero weights on all observations. Clearly, those small values of h will not be useful for classiﬁcation. On the other hand, LSPDh behaves like SPD for large values of h. So, after a certain threshold value, increasing the value of h will not provide any additional information about the distributional features. Therefore, one needs to ﬁnd suitable lower and upper limits of h to compute the weighted posterior probabilities of diﬀerent classes. Following Ghosh et al. (2006), we compute the pairwise distances (standardized pairwise distances in the case of LSPD h) among the observations in a class and compute the quantiles of these distances. Let λj,α denote the α-th quantile (0 < α < 1) of the pairwise distances for the j-th class with 1 j J. We use h L = minj{λj,0.05}/3 as the lower limit of h, and h U = 2rh L as the upper limit of h. Here r is the smallest integer for which we have z h(xji) z (xji) / z (xji) < 0.05 (or, z h(xji) z (xji) / z (xji) < 0.05 in case of LSPD h) for 1 i nj and 1 j J. Our ﬁnal classiﬁer, which we call the LSPD classiﬁer, assigns an observation x to the class j0, where

j0 = argmax 1 j J

W(h) g(h) p(j|z h(x))dh.

One can choose g to be the uniform distribution on the interval [h L, h U]. Since we are dealing with a scale parameter h, we take the uniform distribution in the logarithmic scale. In practice, we generate M independent observations h1, . . . , h M from the distribution g. For any given 1 j J and x, R h U h L W(h) g(h) p(j|z h(x))dh is approximated by the average PM i=1 W(hi) p(j|z hi(x))/M.

5. Analysis of Simulated Data Sets

We have analyzed several data sets simulated from elliptic as well as non-elliptic distributions in R5. In each example, taking an equal number of observations from each of the two competing classes, we generated training and test sets of sizes 200 and 500, respectively. This procedure was repeated 500 times, and the average test set misclassiﬁcation rates of diﬀerent classiﬁers are reported in Tables 1 and 2 along with their corresponding standard errors. To facilitate comparison, the corresponding Bayes risks are reported as well. In all the tables in this article, the best misclassiﬁcation rate in a data set is indicated by . The other ﬁgures in bold (if any) are the misclassiﬁcation rates whose diﬀerences from the best misclassiﬁcation rate were found to be statistically insigniﬁcant at the 5% level when the usual large sample test for equality of proportions was used.

Dutta, Sarkar and Ghosh

For the classiﬁers based on SPD and LSPD, we wrote our own R codes and they are available at the link goo.gl/E5tmd6. Throughout this article, we have used 50 diﬀerent values of h for multi-scale classiﬁcation based on LSPD, and the weight function is computed using 5-fold cross-validation method. In this section and in Section 6, we have used SPD

and LSPD h for classiﬁcation with the usual sample covariance matrix of the j-th class as ˆΣj for 1 j J. Any other choice of ˆΣj has been mentioned at appropriate places. We compared our proposed classiﬁers with a pool of classiﬁers that include parametric classiﬁers like LDA and QDA, and nonparametric classiﬁers like those based on k-NN (with the Euclidean metric as the distance function) and KDE (with the Gaussian kernel). For k-NN and KDE, we have used the pooled sample covariance matrix for standardization. Tables 1 and 2 show misclassiﬁcation rates for the multi-scale versions of k-NN (Ghosh et al., 2005) and KDE (Ghosh et al., 2006) based on the same weight function described in Section 4. For the multi-scale method based on KDE, we have considered 50 equi-spaced values of the bandwidth in the range suggested by Ghosh et al. (2006). For the multi-scale version of k-NN, we considered all possible values of k (see Ghosh et al., 2005, for more details). These multi-scale versions usually had better performance then their single scale analogs with the smoothing parameters chosen by the method of cross-validation. We also considered support vector machines (SVM) (Hastie et al., 2009) based on the linear kernel (i.e., K(x, y) = x, y ) and the radial basis function (RBF) kernel (i.e., Kγ(x, y) = exp( γ x y 2)) to facilitate comparison. We used the codes available at the R library e1071 (Dimitriadou et al., 2011). For the RBF kernel, it has been suggested in the literature to use γ = 1/d (see http://www.csie.ntu.edu.tw/~cjlin/libsvm/). However, for our numerical work, we considered γ = i/10d for 1 i 50. We also used 25 diﬀerent values for the box constraint in the interval [0.1, 100], which were equi-spaced in the logarithmic scale. Misclassiﬁcation rates were computed for these diﬀerent choices of the tuning parameters, and the best result is reported in the tables for both classiﬁers. Misclassiﬁcation rates are also reported for classiﬁcation tree (TREE), and a boosted version of TREE known as random forest (RF) (see, e.g., Hastie et al., 2009). For the implementation of TREE and RF, we used the R codes available in the libraries tree (Ripley, 2011) and random Forest (Liaw and Wiener, 2002), respectively. For classiﬁcation tree, the deviance function was used as a measure of impurity, and the maximum height of the tree was restricted to 31. Nodes with less than 5 observations were never considered for splitting. We have combined the results of 500 trees in RF, where each tree was generated based on 63.2% randomly chosen observations from the training sample. At any stage, only a random subset of

d out of d variables were considered for splitting. Here t denotes the largest integer less than, or equal to t. In addition, we also compared the performance of our classiﬁers with two depth based classiﬁcation methods: the classiﬁer based on depth-depth (DD) plot (Li et al., 2012) and the maximum depth classiﬁer based on local depth (LD) (Paindaveine and Van Bever, 2013). The DD classiﬁer ﬁts a polynomial on the depth values corresponding to the two competing classes to construct a separating surface. Three notions of depth were used: Mahalanobis depth, half-space depth and projection depth, where the last two depths were computed based on 500 random projections. For each of these depth functions, we used polynomials of degrees 1, 2 and 3. The best result obtained among all these nine possibilities is reported in Tables 1 and 2. For the maximum LD classiﬁer, we used the R library Depth Proc

Multi-scale Classification using Localized Spatial Depth

(Kosiorowski and Zawadzki, 2016) and considered the best result obtained for diﬀerent choices of depth and a range of values for the localization parameter. The misclassiﬁcation rates of the maximum LD classiﬁer was higher than those of the DD classiﬁer in almost all cases, and we do not report those results in this article.

5.1 Examples Involving Elliptic Distributions

Recall examples E1 and E2 in Section 2, and example E3 in Section 4 involving elliptic class distributions. In E1, the DD classiﬁer led to the lowest misclassiﬁcation rate closely followed by SPD and LSPD classiﬁers (see Table 1), but it did not perform well in E2. In this example, SPD and LSPD classiﬁers signiﬁcantly outperformed all their competitors. Since the class distributions were elliptic, the SPD classiﬁer had a slight edge over the LSPD classiﬁer in these examples. In view of normality of the class distributions, QDA was expected to have the best performance in E3. The DD classiﬁer ranked second here, while SPD and LSPD classiﬁers performed satisfactorily. In all these examples, the Bayes classiﬁer had non-linear class boundaries. So, LDA and SVM with the linear kernel did not perform well. The performance of SVM with the RBF kernel was relatively better, and it had competitive misclassiﬁcation rates in E3. In all these examples, nonparametric classiﬁers based on k-NN and KDE yielded much higher misclassiﬁcation rates compared to SPD and LSPD classiﬁers.

Table 1: Misclassiﬁcation rates (in %) of diﬀerent classiﬁers in elliptic data sets.

Ex Bayes LDA QDA SVM SVM k-NN KDE TREE RF DD SPD LSPD risk (linear) (RBF) E1 26.50 50.22 51.58 45.46 33.03 39.99 39.16 36.90 31.32 27.92 28.32 28.54 (0.11) (0.19) (0.12) (0.12) (0.13) (0.12) (0.13) (0.11) (0.12) (0.10) (0.11) E2 0.00 47.43 42.08 43.92 34.06 36.98 34.29 39.10 34.26 26.68 8.23 8.26 (0.11) (0.12) (0.11) (0.12) (0.13) (0.15) (0.13) (0.11) (0.13) (0.11) (0.10) E3 10.14 21.56 11.09 22.09 11.74 17.86 16.95 19.18 13.77 11.37 11.49 11.64 (0.09) (0.07) (0.09) (0.07) (0.09) (0.08) (0.13) (0.08) (0.08) (0.07) (0.07)

5.2 Examples Involving Non-elliptic Distributions

We now consider some examples involving non-elliptic class distributions. Recall the trimodal example E4 discussed in Section 4. In this example, when the classiﬁers based on k-NN and KDE were used after standardizing the data set by the pooled sample covariance matrix, they yielded misclassiﬁcation rates higher than 40%. For KDE, we used a common bandwidth in all directions after standardization. This lead to the use of a large bandwidth in the principal component direction 1

d1d (this can be observed from Figure 4(a)). Since the diﬀerence between the posterior probabilities of the two classes changes its sign frequently along this direction, use of this large bandwidth makes it diﬃcult to discriminate between the two competing classes. In the k-NN classiﬁer, this standardization leads to the use of a neighborhood which was also elongated along the direction 1

d1d, and this aﬀected the performance of this classiﬁer. So, we did not standardize the data for these two classiﬁers, and they outperformed all other classiﬁers considered here (see Table 2). Classiﬁers based on SPD and LSPD also had poor performance because of this issue with standard-

Dutta, Sarkar and Ghosh

ization. So we used classiﬁers based on SPD and LSPD in this example. The LSPD

classiﬁer had the third best performance. SVM with the RBF kernel also performed well. All other classiﬁers had relatively higher misclassiﬁcation rates. The DD classiﬁer, LDA, QDA and SVM with the linear kernel all misclassiﬁed more than 25% of the observations. The next example (we call it E5) is with exponential distributions, where the component variables are independently distributed in both classes. The i-th variable in the ﬁrst (respectively, the second) class is exponential with scale parameter d/(d i + 1) (respectively, d/2i) for 1 i d. Further, the second class has a location shift such that the diﬀerence between the mean vectors of the two classes is 1

d1d. Recall that Figure 5(a) shows the density contours of the ﬁrst class when d = 2. In this example, the RF classiﬁer had the best performance followed by TREE. Here all the measurement variables were independent, and there was signiﬁcant separation between the two classes in some of the co-ordinate directions. This is one of the main reasons behind the superior performance of both TREE and RF. Classiﬁers based on DD, SPD and LSPD also performed quite well, and their misclassiﬁcation rates were signiﬁcantly lower than all other classiﬁers. The two linear classiﬁers performed poorly, but QDA had a reasonably good performance in this example. Good performance of QDA was not surprising as the two competing classes are unimodal, while they diﬀer widely in their dispersion structures.

Table 2: Misclassiﬁcation rates (in %) of diﬀerent classiﬁers in non-elliptic data sets.

Ex Bayes LDA QDA SVM SVM k-NN KDE TREE RF DD SPD LSPD risk (linear) (RBF) E4 2.10 40.45 42.41 36.16 3.28 2.70 2.75 15.52 4.98 30.14 10.07 3.25 (0.12) (0.11) (0.12) (0.04) (0.03) (0.03) (0.10) (0.07) (0.12) (0.10) (0.04) E5 2.04 41.17 5.97 32.14 7.12 9.55 9.32 4.82 2.04 5.92 5.53 5.42 (0.15) (0.05) (0.34) (0.07) (0.08) (0.07) (0.08) (0.03) (0.05) (0.06) (0.06) E6 13.16 49.67 25.77 47.77 29.33 27.44 27.59 38.39 29.73 28.86 24.15 24.09 (0.12) (0.12) (0.15) (0.11) (0.12) (0.11) (0.14) (0.11) (0.14) (0.10) (0.10) E7 19.96 50.78 50.48 49.77 46.01 35.29 38.88 34.45 27.62 26.48 38.39 40.64 (0.23) (0.22) (0.07) (0.23) (0.22) (0.24) (0.13) (0.11) (0.12) (0.20) (0.28)

In example E6, each class is an equal mixture of four elliptic distributions. The ﬁrst class constitutes of Nd(1d, S0.6), t3,d(βd, S0.7), Nd( 1d, S0.8) and t3,d( βd, S0.9), while the second class is an equal mixture of t3,d(1d, S 0.9), Nd(βd, 3S 0.8), t3,d( 1d, S 0.7) and Nd( βd, 3S 0.6). Here t3,d(µ, Σ) denotes the d-variate t distribution with 3 degrees of freedom (df), location parameter µ and scatter matrix Σ. The vector βd is a d-dimensional vector with the i-th element equal to ( 1)i+1 for 1 i d and the matrix Sα = ((α|i j|))d d for α ( 1, 1) and 1 i, j J. This example has a complex structure for the class distributions, and both SPD and LSPD classiﬁers signiﬁcantly outperformed all their competitors. As the Bayes classiﬁer was far from being linear, LDA and linear SVM did not have satisfactory performance. Finally, we consider a classiﬁcation problem between a Cauchy distribution and a skewed Cauchy distribution (Azzalini, 2014) (we call it E7). The Cauchy distribution had location parameter 1d and scatter matrix 0.5Id + 0.51d1T d ; while the skewed Cauchy distribution had location parameter 0d, scatter matrix Id and asymmetry vector 1d. The DD classiﬁer and RF performed better than other classiﬁers, but SPD and LSPD classiﬁers yielded

Multi-scale Classification using Localized Spatial Depth

relatively higher misclassiﬁcation rates. Both half-space depth and projection depth used in the DD classiﬁer are robust against outliers generated from heavy-tailed distributions, while the moment based estimates used in both SPD and LSPD are non-robust. So, it is better to use robust estimates of Σjs here. When we used MCD estimates based on 75% of the observations (Rousseeuw and Van Driessen, 1999), the misclassiﬁcation rates of SPD

and LSPD classiﬁers dropped to 31.90% and 32.05%, respectively, with corresponding standard errors of 0.18% and 0.20%. All these examples clearly demonstrate that the LSPD classiﬁer performs as good as (if not better) popular nonparametric classiﬁers for non-elliptic, or multi-modal data. This adjustment of the LSPD classiﬁer is automatic in view of the multi-scale approach developed in Section 4.

5.3 Computing Time for SPD and LSPD Classiﬁers

For a training sample of size n, computation of z(xji) for 1 i nj and 1 j J requires O(n2) calculations. Fitting a GAM involves an iterative algorithm, and it is quite diﬃcult to calculate its exact computational complexity. Each iteration requires computations of the order O(n2) (Wood, 2006). So, the algorithm takes no more than O(n2) computations to ﬁt a GAM for a ﬁnite number of iterations. For the multi-scale classiﬁer based on LSPD, we need to repeat this procedure for M diﬀerent values of h and then compute the weight function W(h) based on V -fold cross-validation. The overall order of computation remains O(n2) although the associated constant increases linearly with d, J, M and V . However, one should note that these are oﬄine calculations. Both SPD and LSPD classiﬁers require O(n) calculations to classify a test case. Throughout this article, we have used M = 50 and V = 5 and the R library VGAM (Yee, 2008) was used to ﬁt GAM. In a single iteration, the average CPU time to determine the weight function W(h) based on cross-validation for the LSPD classiﬁer was 21.83 seconds, while 0.55 seconds were required to ﬁt a GAM using the full training data. The average CPU time to classify the 500 test observations was about 0.01 seconds. All the calculations were done on a desktop computer with an Intel i7 (2.2 GHz) processor having 8 GB RAM.

6. Analysis of Benchmark Data Sets

We have analyzed seven benchmark data sets for further evaluation of our proposed classiﬁers. The biomedical data set is taken from the CMU data archive (http://lib.stat. cmu.edu/datasets/). In this data set, we ignored the observations with missing values. The diabetes data set is available in the R library mclust (also analyzed in Reaven and Miller, 1979). All other data are taken from the UCI machine learning repository (http://archive.ics.uci.edu/ml/). Descriptions of these data sets are available at these sources. Satellite image (satimage) data set has speciﬁc training and test samples. For this data set, we report misclassiﬁcation rates of diﬀerent classiﬁers based on this ﬁxed test set. If a classiﬁer had misclassiﬁcation rate ε, its standard error was computed as p

ε(1 ε)/(size of the test set). For all other data sets, we formed the training and the test sets by randomly partitioning the data, and this random partitioning was repeated 500 times. Average test set misclassiﬁcation rates of diﬀerent classiﬁers were computed over these 500 partitions, and they are reported in Table 3 along with their corresponding stan-

Dutta, Sarkar and Ghosh

dard errors. Sizes of training and test sets in each partition are also reported in this table. For all classiﬁers, we used the same tuning procedures as described in Section 5. Codes for the DD classiﬁer are available only for two class problems. In biomedical and Parkinson s data sets, the DD classiﬁer yielded misclassiﬁcation rates of 12.54% and 14.48%, respectively, with corresponding standard errors of 0.18% and 0.15%. We also used the maximum LD classiﬁer on these real data sets. However, its performance was not satisfactory for most data sets and we do not report those misclassiﬁcation rates in Table 3. In biomedical and vehicle data, covariance matrices of the competing classes were diﬀerent. So, QDA led to signiﬁcant improvement over LDA, and its misclassiﬁcation rates were close to the best rate. In both these data sets, the competing classes were nearly elliptic (this can be veriﬁed using the diagnostic plots suggested by Li et al., 1997). The SPD classiﬁer utilized this ellipticity of the class distributions to outperform the nonparametric classiﬁers. The LSPD classiﬁer competed well with the SPD classiﬁer in biomedical data. But, the evidence of ellipticity was much stronger in vehicle data and LSPD had a slightly higher misclassiﬁcation rate. In diabetes data also, the three competing classes had widely varying covariance structures. As expected, QDA performed better than LDA. Since the class distributions were not elliptic, the SPD classiﬁer yielded a higher misclassiﬁcation rate than the LSPD classiﬁer, while both TREE and RF outperformed all other classiﬁers in this data set.

Table 3: Descriptions of the real data sets, and misclassiﬁcation rates (in %) of diﬀerent classiﬁers.

Data set Biomed Parkinson s Diabetes Wine Waveform Vehicle Satimage d 4 22 3 13 21 18 36 J 2 2 3 3 3 4 6 Train 100 97 73 100 300 423 4435 Test 94 98 72 78 501 423 2000

Data LDA QDA SVM SVM k-NN KDE TREE RF SPD LSPD set (linear) (RBF) Biomed 15.66 12.57 21.90 12.76 17.74 16.67 17.69 13.23 12.53 12.49 * (0.14) (0.13) (0.13) (0.13) (0.15) (0.14) (0.18) (0.14) (0.21) (0.15) Parkinson s 30.93 xxxx 14.83 13.29 14.42 11.24 * 16.63 11.68 15.44 14.23 (0.12) xxxx (0.12) (0.10) (0.16) (0.12) (0.20) (0.15) (0.15) (0.11) Diabetes 13.86 8.51 10.20 14.93 11.20 11.96 3.78 * 4.29 9.36 7.93 (0.16) (0.13) (0.19) (0.15) (0.13) (0.14) (0.09) (0.10) (0.15) (0.14) Wine 2.00 2.46 3.64 1.86 1.98 1.40 * 10.99 2.12 2.34 1.85 (0.06) (0.09) (0.09) (0.06) (0.06) (0.05) (0.22) (0.06) (0.08) (0.07) Waveform 19.74 20.78 18.89 16.28 21.23 21.04 28.81 16.45 15.12 * 15.36 (0.15) (0.15) (0.07) (0.07) (0.11) (0.11) (0.12) (0.08) (0.06) (0.06) Vehicle 22.49 16.38 20.59 25.37 21.80 21.21 31.41 25.52 16.35 * 17.15 (0.07) (0.07) (0.07) (0.08) (0.08) (0.07) (0.10) (0.07) (0.08) (0.08) Satimage 16.02 14.11 12.95 8.97 18.00 21.40 18.60 8.24 * 12.58 12.58 (0.82) (0.78) (0.75) (0.64) (0.86) (0.92) (0.87) (0.61) (0.74) (0.74)

xxxx : QDA could not be used because of singularity of the estimated class dispersion matrices.

In Parkinson s data, we could not use QDA because of singularity of the estimated class dispersion matrices. So, we used the pooled sample covariance matrix for computation

Multi-scale Classification using Localized Spatial Depth

of SPD and LSPD . In this data set, all the nonparametric classiﬁers had signiﬁcantly lower misclassiﬁcation rates than LDA, and the classiﬁer based on KDE had the lowest misclassiﬁcation rate. The performance of the LSPD classiﬁer was also competitive. Since the underlying distributions were non-elliptic, LSPD outperformed the SPD classiﬁer. We observed a similar phenomena in wine data as well. The sample covariance matrices of diﬀerent classes were nearly singular, and we used the pooled sample covariance matrix for computing SPD and LSPD . The classiﬁer based on KDE yielded the lowest misclassiﬁcation rate, while the LSPD classiﬁer had the second best performance. Although the data dimension was quite high in both data sets, all the competing classes had low intrinsic dimensions (can be estimated using the method described by Levina and Bickel, 2004). So, nonparametric methods like KDE were not aﬀected much by the curse of dimensionality. TREE was the only classiﬁer with a somewhat higher misclassiﬁcation rate.

In waveform data, the competing class distributions were nearly elliptic and the SPD classiﬁer was expected to perform well. The LSPD classiﬁer is quite ﬂexible, and it yielded a competitive misclassiﬁcation rate. The class distributions were not normal (can be checked using the method proposed in Royston, 1983) for this data, and did not have low intrinsic dimensions. As a result, LDA, QDA and the nonparametric classiﬁers had relatively higher misclassiﬁcation rates.

In satimage data, recall that the results are based on a single training and a single test set. So, the standard errors of the misclassiﬁcation rates were high for all classiﬁers, and it is quite diﬃcult to compare the performance of diﬀerent classiﬁers. Both RF and SVM with the RBF kernel had lower misclassiﬁcation rates than other classiﬁers, while the classiﬁers based on SPD and LSPD had the next best performance.

7. Classiﬁcation of High-dimensional Data

A serious practical limitation of many existing depth based classiﬁers is their computational complexity in high dimensions, and this makes such classiﬁers impossible to use even for moderately large dimensional data. Besides, depth functions that are based on random simplices formed by the data points (see, e.g., Liu et al., 1999; Zuo and Serﬂing, 2000) cannot be deﬁned in a meaningful way if the dimension of the data exceeds the sample size. Tukey s half-space depth and projection depth both become degenerate at zero for such high-dimensional data (see, e.g., Dutta et al., 2011). Classiﬁcation of high-dimensional data presents a substantial challenge to many nonparametric classiﬁcation tools as well. We have seen in examples E1 and E2 (recall Figure 2) that nonparametric classiﬁers like those based on k-NN and KDE can yield poor performance when the data dimension is large. Some limitations of SVM for classiﬁcation of high-dimensional data has been noted by Marron et al. (2007); Dutta and Ghosh (2016).

One of our primary motivations behind using SPD is its computational tractability (especially when the dimension is large). If the dimension exceeds the sample size, then the sample covariance matrices become singular, and we cannot use these estimates to deﬁne the empirical versions of SPD and LSPD h. So, we use classiﬁers based on SPD and LSPD h. We now assume the following regularity conditions to investigate the behavior of these classiﬁers for such high-dimensional data.

Dutta, Sarkar and Ghosh

(C) Consider two independent random vectors X1 = (X(1) 1 , . . . , X(d) 1 )T Fj and X2 = (X(1) 2 , . . . , X(d) 2 )T Fi for 1 j, i J. Further, assume that (C1) aj = limd d 1 Pd k=1 E(X(k) 1 )2 exists, and d 1 Pd k=1(X(k) 1 )2 a.s. aj as d , (C2) bji = limd d 1 Pd k=1 E(X(k) 1 X(k) 2 ) exists, and d 1 Pd k=1 X(k) 1 X(k) 2 a.s. bji as d .

It is not diﬃcult to verify that for X1 Fj (1 j J), if we assume that the sequence of variables {X(k) 1 E(X(k) 1 ) : k = 1, 2, . . .} centered at their means are independent with uniformly bounded eighth moments (see Theorem 1 (2) in Jung and Marron, 2009, p. 4110), or they are m-dependent processes with some appropriate conditions (see Theorem 2 in de Jong, 1995, p. 350), then the convergence results in (C1) and (C2) hold. Also, if the observations are generated from discrete time ARMA processes, all these conditions are satisﬁed. Stationarity of such time series is not required here. These assumptions continue to hold if the sequences {(X(k) 1 )2 E(X(k) 1 )2 : k = 1, 2, . . .} and {X(k) 1 X(k) 2 E(X(k) 1 X(k) 2 ) : k = 1, 2, . . .}, where X1 Fj and X2 Fi for all 1 j, i J, are mixingales satisfying some appropriate conditions (see, e.g., Theorem 2 in de Jong, 1995, p. 350). Deﬁne σ2 j = aj bjj and νji = bjj 2bji + bii. For the random vector X1 Fj, σ2 j is the

limit of d 1 Pd k=1 V ar(X(k) 1 ) as d . If we consider a second independent random vector X2 Fi with i = j, then νji is the limit of d 1 Pd k=1{E(X(k) 1 ) E(X(k) 2 )}2 as d . Hall et al. (2005) assumed a similar set of conditions to study the performance of support vector machines (SVM) with the linear kernel and the 1-NN classiﬁer as the data dimension grows to inﬁnity. Similar conditions on observation vectors were also considered by Jung and Marron (2009) to study consistency of principal components of the empirical covariance matrix for high-dimensional data. Under (C1) and (C2), the following theorem describes the behavior of z (x) = (SPD (x, F1), . . ., SPD (x, FJ))T and z h(x) = (LSPD h(x, F1), . . ., LSPD h(x, FJ))T as d grows to inﬁnity.

Theorem 6 Suppose that the conditions (C1)-(C2) hold, and X Fj for 1 j J.

(a) z (X) a.s. (cj1, . . . , cj J)T = cj as d , where cjj = 1 q

1 2 and cji = 1 r

σ2 j +νji σ2 j +σ2 i +νji for 1 j = i J. (b) Assume that h and d in such a way that

d/h 0 or A0(> 0). Then, z h(X) a.s. g0(0)cj or c j = (g0(ej1A0)cj1, . . . , g0(ej JA0)cj J)T depending on whether

0 or A0, respectively. Here K(t) = g0( t ), ejj =

2σj and eji = q

σ2 j + σ2 i + νji for j = i.

(c) Assume that h > 1, and

d/h as d . Then, z h(X) a.s. 0J.

The cjs as well as the c js in the statement of Theorem 6 are distinct for all 1 j J whenever either σ2 j = σ2 i or νji = 0 for all 1 j = i J (see Lemma 11 in Appendix A). In such a case, part (a) of Theorem 6 implies that for large d, z (x) becomes degenerate at points depending on the class distributions. So, z (x) has good discriminatory power, and our classiﬁer based on SPD can discriminate well among the J populations. Further, it follows from part (b) that when both d and h grow to inﬁnity in such a way that

d/h 0 or to a positive constant, z h(x) has good discriminatory power and the classiﬁer based on LSPD h can yield low misclassiﬁcation probability. However, part (c) shows that if

Multi-scale Classification using Localized Spatial Depth

at a rate faster than h, z h(x) converges to the same value 0J and it becomes non-informative. Consequently, the classiﬁer based on LSPD h will lead to a high misclassiﬁcation probability in this case. To evaluate the performance of our depth based classiﬁers for high-dimensional data, we considered examples E1-E7 with d = 200. In each example, we generated 20 observations from each class to constitute the training sample, while 250 observations from each class were used to form the test set. We generated 500 training and test sets, and the average test set misclassiﬁcation rates of the diﬀerent classiﬁers along with their corresponding standard errors are reported in Table 4. The Bayes risks were almost zero in all these examples, and we have not stated them in Table 4. We did not standardize the data for KDE and k-NN. QDA could not be used in these examples, and we used Id instead of the pooled sample covariance matrix for LDA. When the competing classes have equal priors (which is the case in simulated examples), this leads to the Euclidean distance based classiﬁer which classiﬁes an observation to the class having the nearest centroid. As we have mentioned before, we use SPD and LSPD for classiﬁcation of these highdimensional data sets. For a single iteration, the LSPD classiﬁer required an average CPU time of 8.82 seconds to compute the weight function W(h), 0.39 seconds for ﬁtting GAM using the full training data, and 0.06 seconds for classiﬁcation of 500 test cases.

Table 4: Misclassiﬁcation rates (in %) of diﬀerent classiﬁers in simulated data sets.

Example LDA SVM SVM k-NN KDE TREE RF SPD LSPD

(linear) (RBF) E1 50.93 47.57 28.97 49.71 49.99 45.72 41.95 0.27 0.31 (0.13) (0.09) (0.38) (0.06) (0.06) (0.15) (0.14) (0.03) (0.03) E2 45.84 45.69 32.70 49.96 49.92 43.70 39.36 0.08 0.09 (0.08) (0.07) (0.18) (0.01) (0.01) (0.17) (0.12) (0.03) (0.03) E3 0.20 0.29 0.00 49.99 49.98 27.46 0.28 0.00 0.00 (0.01) (0.01) (0.00) (0.01) (0.01) (0.12) (0.17) (0.00) (0.00) E4 34.87 44.28 10.43 0.19 0.20 38.55 23.57 0.68 0.13 (0.26) (0.15) (0.43) (0.08) (0.08) (0.24) (0.45) (0.12) (0.06) E5 40.83 44.61 13.69 49.98 49.93 18.93 0.00 0.84 0.80 (0.07) (0.11) (0.15) (0.01) (0.01) (0.03) (0.00) (0.04) (0.04) E6 50.11 48.16 31.03 46.52 47.03 48.20 45.00 30.98 29.76 (0.12) (0.14) (0.26) (0.18) (0.14) (0.13) (0.17) (0.20) (0.19) E7 44.74 35.06 31.82 18.92 22.91 36.82 22.36 26.33 25.96 (0.45) (0.24) (0.48) (0.22) (0.32) (0.19) (0.19) (0.26) (0.27)

Id was used instead of the pooled sample covariance matrix.

In the ﬁrst ﬁve examples, the two competing classes had separation between either in their locations and/or scales. So, good performance of the SPD and LSPD classiﬁers was expected in view of Theorem 6 and Lemma 11 (see Appendix A). In E1 and E2, recall that the component distributions of the two classes diﬀered only in scales. The SPD and LSPD

classiﬁers performed well in these examples, and the former had an edge due to ellipticity of the class distributions. Surprisingly, all other classiﬁers failed to extract this separability information properly, and had misclassiﬁcation rates higher than 25%. Since the Bayes class boundaries were highly nonlinear in these two examples, poor performance of linear SVM and LDA was quite expected. Dutta and Ghosh (2016) showed that when one component

Dutta, Sarkar and Ghosh

distribution from the ﬁrst class and one from the second class diﬀer only in their scales, the k-NN classiﬁer gives a decision in favor of the distribution with a smaller spread (also see Hall et al., 2005). This was the main reason behind the poor performance of the k-NN classiﬁer. Similar arguments can be given for the poor performance of the classiﬁer based on KDE. In these two examples, splitting based on a single variable failed to yield signiﬁcant reduction in the impurity function (one can see this in Figure 1). So, TREE and RF had relatively higher misclassiﬁcation rates. In E3, the two Gaussian distributions diﬀered in their locations and scales. Barring TREE, k-NN and the classiﬁer based on KDE, all other classiﬁers yielded misclassiﬁcation rates close to zero. Since the scale diﬀerence between the two classes dominates the location diﬀerence, such a poor performance of the classiﬁer based on KDE and k-NN was expected (see the results in Hall et al., 2005; Dutta and Ghosh, 2016). The same explanation holds for E5 as well. These nonparametric classiﬁers yielded excellent performance in E4, where the component distribution diﬀer only in their locations. However, TREE and RF failed to have satisfactory performance here. Splitting based on linear combinations of the variables may be helpful in E4 (see Figure 4). Examples E6 and E7 were diﬃcult to deal with. Unlike E1-E5, none of the classiﬁers could achieve misclassiﬁcation rates close to zero in these two examples. Conditions (C1) and (C2) do not hold here, and Theorem 6 is not applicable. The LSPD classiﬁer had the best performance in E6 (just like the case with d = 5 in Section 5). SVM with the RBF kernel and the SPD classiﬁer also led to competitive misclassiﬁcation rates. Their performance was much better than all other classiﬁers. In E7, the linear classiﬁers and SVM with the RBF kernel could not perform well. This is also consistent with what we observed in Section 5. Barring TREE, all other classiﬁers yielded competitive performances in this example. Among them the k-NN classiﬁer led to the lowest misclassiﬁcation rate. We also analyzed two high-dimensional benchmark data sets, namely, lightning-2 data and colon data (Alon et al., 1999). The ﬁrst data set is from the UCR time series classiﬁcation archive (http://www.cs.ucr.edu/~eamonn/time_series_data/), while the other one is taken from the R library rda. In each case, we formed 500 training and test sets by randomly partitioning each data into two almost equal parts. The average test set misclassiﬁcation rates of diﬀerent classiﬁers are reported in Table 5.

Table 5: Misclassiﬁcation rates (in %) of diﬀerent classiﬁers in real data sets.

Data d J Sample size LDA SVM SVM k-NN KDE TREE RF SPD LSPD set Train Test (linear) (RBF) Lightning-2 637 2 60 61 31.86 35.64 28.73 29.89 28.11 33.69 22.08 * 27.70 27.16 (0.25) (0.35) (0.32) (0.20) (0.30) (0.34) (0.34) (0.30) (0.30) Colon 2000 2 31 31 14.47 16.38 21.48 22.47 23.20 28.78 19.28 19.66 19.05 (0.21) (0.23) (0.25) (0.27) (0.28) (0.35) (0.24) (0.31) (0.30)

Id was used instead of the pooled sample covariance matrix.

Lightning-2 data consist of observations that are realizations of a time series. In this data set, RF had the best performance followed by the LSPD classiﬁer. The SPD classiﬁer also worked well and yielded the third best performance. The class distributions for this data set turn out to be non-elliptic (can be veriﬁed using the method proposed by Li et al., 1997) with low intrinsic dimensions (Levina and Bickel, 2004). As a consequence, the classiﬁer based on KDE and k-NN yielded reasonably good performances.

Multi-scale Classification using Localized Spatial Depth

Colon data contain micro-array expression levels of 2000 genes for normal and colon cancer tissues. There was a good linear separation among the observations from the two competing classes, and the linear classiﬁers lead to low misclassiﬁcation rates. Among the other classiﬁers, the LSPD classiﬁer yielded the minimum misclassiﬁcation rate closely followed by RF and the SPD classiﬁer. These three classiﬁers were less aﬀected by the curse of dimensionality. In these high-dimensional benchmark data sets, the data had low intrinsic dimensions due to high correlation among the the measurement variables (Levina and Bickel, 2004). Moreover, data from the competing classes diﬀered mainly in their locations. As a consequence, though the proposed LSPD classiﬁer had a good overall performance, its superiority over the nonparametric methods was not as prominent as it was in the simulated examples.

Acknowledgments

The authors are grateful to Prof. Probal Chaudhuri for his valuable contributions to this manuscript. They are also thankful to the Action Editor and two anonymous reviewers for providing them with several helpful comments. The ﬁrst author would like to thank Prof. Thomas W. Yee for his help with VGAM, and Prof. Jun Li for sharing R codes of the DD classiﬁer.

Appendix A. Proofs and Mathematical Details

Lemma 7 If F has a spherically symmetric density f(x) = g( x ) on Rd with d > 1, then EF [u(x X)] is a non-negative monotonically increasing function of x .

Proof of Lemma 7 : In view of spherical symmetry of f(x), S(x) = EF [u(x X)] is invariant under orthogonal transformations of x. Consequently, S(x) = η( x ) for some non-negative function η. Consider now x1 and x2 such that x1 < x2 . Using spherical symmetry of f(x), without loss of generality, we can assume xi = (ti, 0, . . . , 0)T for i = 1, 2 such that |t1| < |t2|. For any x = (t, 0, . . . , 0)T , we have

(t X1)2 + X2 2 + . . . + X2 d

due to spherical symmetry of f(x). For any x Rd with d > 1, EF [ x X ] is a strictly convex function of x in this case. Consequently, it is a strictly convex function of t. Observe now that S(x) with this choice of x is the absolute value of the derivative of EF [ x X ] w.r.t. t. This derivative is a symmetric function of t that vanishes at t = 0. Hence, S(x) is an increasing function of |t|, and this proves that η( x1 ) < η( x2 ).

Proof of Theorem 1 : If the population distribution fj(x) is elliptically symmetric, we have fj(x) = |Σj| 1/2gj(δ(x, Fj)), where δ(x, Fj) = Σ 1/2 j (x µj) is the Mahalanobis

distance for 1 j J. Since SPD (x, Fj) = 1 E[u(Σ 1/2 j (x µj))] is aﬃne invariant, it

is a function of δ(x, Fj). Again, as Σ 1/2 j (X µj) has a spherically symmetric distribution

Dutta, Sarkar and Ghosh

with its center at the origin, from Lemma 7 it follows that SPD (x, Fj) is a monotonically decreasing function of δ(x, Fj). Therefore, δ(x, Fj) is also a function of SPD (x, Fj) and using this fact fj(x) can also be expressed as

fj(x) = ψj(SPD (x, Fj)) for all 1 j J,

where ψj is an appropriate real-valued function that depends on gj. Now, one can check that

log h p(j|x)

i = log(πj/πJ) + log ψj(SPD (x, Fj)) log ψJ(SPD (x, FJ)).

for 1 j (J 1). Now, if we deﬁne ϕjj(z) = log πj + log ψj(z) and ϕij(z) = 0 for 1 j = i (J 1); and ϕ1J(z) = = ϕ(J 1)J(z) = log πJ log ψJ(z), then the proof is complete.

Remark 8 If fj(x) is unimodal, ψj(z) is monotonically increasing for 1 j J. Moreover, if the distributions diﬀer only in their locations, then the ψjs are same for all classes. In that case, fj(x) > fi(x) δ(x, Fj) < δ(x, Fi) SPD (x, Fj) > SPD (x, Fi) for 1 i = j J, and hence the classiﬁer turns out to be the maximum SPD classiﬁer.

Proof of Theorem 3(a) : Let h < 1. For any ﬁxed x Rd and the distribution function Fj, we have LSPD h(x, Fj) = EFj[Kh(t)] EFj[Kh(t)u(t)] , where t = Σj 1/2(x X) for 1 j J. For the ﬁrst term in the expression of LSPD h(x, Fj) above, we have

EFj[Kh(t)] = Z

Rd 1 hd Kh(Σ 1/2 j (x v))fj(v)dv = |Σj|1/2 Z

Rd K(y)fj(x hΣ1/2 j y)dy,

where y = h 1Σ 1/2 j (x v). So, using Taylor s expansion of fj(x), we get

EFj[Kh(t)] = |Σj|1/2fj(x) h|Σj|1/2 Z

Rd K(y) (Σ1/2 j y)T fj(ξ)dy,

where ξ lies on the line joining x and (x hΣ1/2 j v). Using the Cauchy-Schwartz inequality,

one gets EFj[Kh(t)] |Σj|1/2fj(x) h|Σj|1/2λ1/2 j M j MK, where M j = supx Rd fj(x) ,

MK = R y K(y)dy, and λj is the largest eigenvalue of Σj. This implies EFj[Kh(t)]

|Σj|1/2fj(x) 0 as h 0 for 1 j J. For the second term in the expression of LSPD h(x, Fj), a similar argument yields

EFj[Kh(t)u(t)] = |Σj|1/2 Z

Rd K(y)u(y)fj(x hΣ1/2 j y)dy

= h|Σj|1/2 Z

Rd K(y)u(y) (Σ1/2 j y)T fj(ξ)dy (as Z

Rd K(y)u(y)dy = 0).

Now, EFj[Kh(t)u(t)] h|Σj|1/2λ1/2 j M j MK 0 and this implies that LSPD h(x, Fj) |Σj|1/2fj(x), as h 0. Consequently, we have z h(x) (|Σ1|1/2f1(x), . . . , |ΣJ|1/2f J(x))T

Multi-scale Classification using Localized Spatial Depth

Proof of Theorem 3(b) : Here we consider the case h > 1. Take any ﬁxed x Rd

and a j with 1 j J. For any ﬁxed t, since K(t/h) K(0) as h and K is bounded, using Dominated Convergence Theorem (DCT), one can show that LSPD h(x, Fj) K(0)SPD (x, Fj) as h . So, z h(x) (K(0)SPD (x, F1), . . . , K(0) SPD (x, FJ))T as h .

Proof of Theorem 5 : Deﬁne the sets Bn = {x = (x1, . . . , xd) : x

dn}, and An = {x : n2xi is an integer and |xi| n for all 1 i d}. Clearly An Bn Rd, the set Bn is a closed ball and the set An has cardinality (2n3 + 1)d. We will prove almost sure (a.s.) uniform convergence on three disjoint sets: (i) An, (ii) Bn \ An and (iii) Bc n. Consider any ﬁxed h (0, 1]. Recall that for this choice of h, LSPD h(x, F) (see equation (3)) and LSPD h(x, Fn) are deﬁned as follows:

LSPD h(x, Fn) = 1 nhd

u(x Xi) , and

LSPD h(x, F) = 1

hd E h K x X

(i) Deﬁne Zi = K(h 1(x Xi))u(x Xi) E[K(h 1(x X))u(x X)] for 1 i n. Note that Zis are independent and identically distributed (i.i.d.) with E(Zi) = 0 and Zi 2K(0). Fix an ϵ > 0. Using the exponential inequality for sums of i.i.d. random vectors (see Yurinskii, 1976, p. 491), we obtain P n 1 Pn i=1 Zi ϵ 2e C0nϵ2. Here C0 is a positive constant that depends on K(0) and ϵ. This now implies that

hd E h K x X

hd E h K x X

i=1 Zi hdϵ 2e C0nh2dϵ2. (6)

For a ﬁxed value of h, Pn i=1 K(h 1(x Xi)) is a sum of i.i.d. bounded random variables. Using Bernstein s inequality, we obtain

i ϵ 2e C1nϵ2,

for some suitable positive constant C1. This implies

hd E h K x X

i ϵ 2e C1nh2dϵ2. (7)

Combining (6) and (7), we get P(|LSPD (x, Fn) LSPD (x, F)| ϵ) C3e C4nh2dϵ2 for some suitable constants C3 and C4. Since the cardinality of An is (2n3 + 1)d, we have

P sup x An |LSPD (x, Fn) LSPD (x, F)| ϵ C3(2n3 + 1)de C4nh2dϵ2. (8)

Dutta, Sarkar and Ghosh

Now, P n 1(2n3 + 1)de C4nh2dϵ2 < . So, an application of Borel-Cantelli lemma implies that supx An |LSPD h(x, Fn) LSPD h(x, F)| a.s. 0 as n .

(ii) Consider the set Bn \ An. Given any x in Bn \ An, there exists y An such that x y

2/n2. First we will show that |LSPD (y, Fn) LSPD (x, Fn)| a.s. 0 as n . Using the mean-value theorem, one obtains 1 nhd

(x y)T K ξ Xi

where ξ lies on the line joining x and y. Note that the right hand side is less than M K hd+1

and M K = supt K(t) . This upper bound is free of x, and goes to 0 as n . Now, 1 nhd

u(x Xi) K y Xi

u(y Xi) i (9)

i=1 [u(x Xi) u(y Xi)]

We have proved above that the ﬁrst part converges to 0 in a.s. sense. For the second part, consider a ball of radius 1/n around x (say, B(x, 1/n)). Now, 1 nhd

i=1 [u(x Xi) u(y Xi)]

i=1 I[Xi B(x, 1/n)] + 2n

i=1 I[Xi B(x, 1/n)] P[X1 B(x, 1/n)]

hd P[X1 B(x, 1/n)] + 2n

Note that I[Xi B(x, 1/n)]s are i.i.d. bounded random variables with expectation P[X B(x, 1/n)]. Therefore, a.s. convergence of the ﬁrst term follows from Bernstein s inequality. Since P[X B(x, 1/n)] Mfn d (where Mf = supx f(x) < ), the second term converges to 0. For any ﬁxed h, the third term also converges to 0 as n . So, we have |LSPD h(x, Fn) LSPD h(y, Fn)| a.s. 0 as n . Similarly, one can prove that |LSPD h(x, F) LSPD h(y, F)| a.s. 0 as n . In the arguments above, all the bounds are free from x and y. We have also proved that supy An |LSPD h(y, Fn) LSPD h(y, F)| a.s. 0 as n . Combining all these results, we

have supx Bn\An |LSPD h(x, Fn) LSPD h(x, F)| a.s. 0 as n .

(iii) Now, consider the region outside Bn (i.e., the set Bc n). First note that

sup x Bcn |LSPD h(x, Fn) LSPD h(x, F)| sup x Bcn

+ sup x Bcn

1 hd E h K x X

Multi-scale Classification using Localized Spatial Depth

We will show that both of these terms become suﬃciently small as n . Fix an ϵ > 0. We can choose two constants M1 and M2 such that P( X M1) hdϵ/2K(0) and K(t) hdϵ/2 when t M2. Now, one can check that

1 hd E h K x X

hd E h K x X

I( X M1) i + 1

hd K(0)P( X > M1).

If x Bc n and X M1, then h 1 x X h 1|

dn M1|. Choose n large enough so that |

dn M1| M2h, and this implies K(h 1(x X)) hdϵ/2. So, we obtain

1 hd E h K x X

hd K(0)P( X > M1) ϵ, and

i=1 I( Xi > M1)

i=1 I( Xi > M1) P( X > M1)

The Glivenko-Cantelli theorem implies that the last term on the right hand side converges to 0 as n . So, we have supx Bcn |LSPD h(x, Fn) LSPD h(x, F)| a.s. 0 as n . Combining the arguments in parts (i), (ii) and (iii) and for a ﬁxed h (0, 1], we get supx |LSPD h(x, Fn) LSPD h(x, F)| a.s. 0 as n . If we have h > 1, then this convergence result can be proved in a similar way. For this case, recall that the deﬁnition of LSPD (x, F) does not involve the hd term in the denominator (see equation (3)).

Remark 9 Following the proof of Theorem 5, it is easy to check that a.s. convergence holds when h diverges to inﬁnity with n.

Remark 10 The result continues to hold when h 0 as well. However, for a.s. convergence in part (i) (to use the Borel-Cantelli lemma) we require nh2d/ log n as n . In part (iii), we need M1 and M2 to vary with n. Assume the ﬁrst moment of the density corresponding to F to be ﬁnite, and R t K(t)dt < (which implies that t K(t) 0 as t ). Also, assume that nh2d/ log n as n . We can now choose M1 = M2 = n to ensure that both P( X M1) hdϵ/2K(0) and K(t) hdϵ/2 for t M2 hold for a suﬃciently large n.

Proof of Theorem 6(a) : Consider two independent random vectors X = (X(1), . . ., X(d))T Fj and X1 = (X(1) 1 , . . . , X(d) 1 )T Fj, where 1 j J. It follows from (C1) and

(C2) that X X1 /

2σ2 j as d . So, for almost every realization x of X Fj,

2σ2 i as d . (10)

Next, consider two independent random vectors X Fj and X1 Fi for 1 i = j J.

Using (C1) and (C2), we get X X1 /

σ2 j + σ2 i + νji as d . Consequently, for almost every realization x of X Fj

σ2 j + σ2 i + νji as d . (11)

Dutta, Sarkar and Ghosh

Let us next consider x X1, x X2 , where X Fj, X1, X2 Fi are independent random vectors, and , denotes the inner product in Rd. Therefore, for almost every realization x of X, arguments similar to those used in (10) and (11) yield

a.s. σ2 j as d if 1 i = j J, and (12)

a.s. σ2 j + νji as d if 1 i = j J. (13)

Observe now that EFj[u(x X)] 2 = EFj[u(x X1)], EFj[u(x X2)] = EFj[ u(x X1), u(x X2) ], where X1, X2 Fj are independent random vectors for 1 j J. Since we are dealing with expectations of random vectors with bounded norm, a simple application of DCT implies that for almost every realization x of X Fj (1 j J), as d ,

SPD (x, Fj) a.s. 1

1 2 and SPD (x, Fi) a.s. 1

σ2 j + νji σ2 j + σ2 i + νji for i = j. (14)

Thus, for X Fj, we get z (X) = (SPD (X, F1), . . . , SPD (X, FJ))T a.s. cj as d .

Proof of Theorem 6(b) : Recall that for h > 1, LSPD h(x, F) = EF [hd Kh(t)] EF [hd Kh(t)u(t)] . Since we have assumed Xs to be standardized, here we get hd Kh(t) = K((x X)/h) = g0( x X /h). Let X Fj and Xi Fi with 1 i, j J. Using (10) and (11) above, and the continuity of g0, for almost every realization x of X Fj, one obtains the following

a.s. g0(0) or g0(eji A0),

depending on whether

d/h 0 or A0. The proof follows from an application of DCT, and the arguments used in the proof of Theorem 6(a).

Proof of Theorem 6(c) : Since g0(s) 0 as s , using the same argument as used in the proof of Theorem 6(b), for Xi Fi and almost every realization x of X Fj, we have

The proof now follows from a simple application of DCT.

Lemma 11 Recall cj and c j for 1 j J deﬁned in Theorem 6(a) and (b), respectively. For any 1 j = i J, cj = ci if and only if σj = σi and νji = νij = 0. Similarly, c j = c i if and only if σj = σi and νji = νij = 0.

Proof of Lemma 11 : The if part is easy to check in both cases. So, it is enough to prove the only if part and that too for the case of J = 2. If c1 = (c11, c12)T and c2 = (c21, c22)T

are equal, then we have

σ2 1 + ν12 σ2 1 + σ2 2 + ν12 = 1/2 and σ2 2 + ν12 σ2 1 + σ2 2 + ν12 = 1/2.

Multi-scale Classification using Localized Spatial Depth

These two equations hold simultaneously only if σ2 1 = σ2 2 and ν12 = ν21 = 0. Consider the case c 1 = c 2. Recall that c 11 = g0(A0

2σ1)c11, c 22 = g0(A0

2σ2)c22, c 12 = g0(A0 p

σ2 1 + σ2 2 + ν12)c12 and c 21 = g0(A0 p

σ2 2 + σ2 1 + ν21)c21. If possible, assume that σ1 > σ2. This implies that A0 p

σ2 1 + σ2 2 + ν12 > A0

2σ1 and hence we obtain

2σ1) > g0(A0 q

σ2 1 + σ2 2 + ν12) (since g0 is monotonically decreasing). (15)

Also, if σ1 > σ2, we must have

1/2 < σ2 1 σ2 1 + σ2 2 < σ2 1 + ν12 σ2 1 + σ2 2 + ν12 < 1 1 p

σ2 1 + ν12 σ2 1 + σ2 2 + ν12 . (16)

Combining (15) and (16), we have c 11 > c 21, and this implies c 1 = c 2. Similarly, if σ1 < σ2, we get c 12 > c 22 and hence c 1 = c 2. Again, if σ1 = σ2 but ν12 = ν21 > 0, similar arguments lead to c 1 = c 2 . This completes the proof.

C. Agostinelli and M. Romanazzi. Local depth. Journal of Statistical Planning and Inference, 141:817 830, 2010.

U. Alon, N. Barkai, D. A. Notterman, K. Gish, D. Mack, and A. J. Leine. Broad pattern of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, USA, 96: 6745 6750, 1999.

A. Azzalini. The Skew-Normal and Related Families. 2014. Cambridge University Press, Cambridge.

Y. Chen, X. Dang, H. Peng, and H. L. Bart Jr. Outlier detection with the kernelized spatial depth function. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31: 288 305, 2009.

T. M. Cover and P. E. Hart. Nearest neighbor pattern classiﬁcation. IEEE Transactions on Information Theory, 13:21 27, 1967.

J. A. Cuesta-Albertos, M. Febrero-Bande, and M. Oviedo da la Fuente. The DDG-classiﬁer in the functional setting. Test, forthcoming.

R. M. de Jong. Laws of large numbers for dependent heterogeneous processes. Econometric Theory, 11:347 358, 1995.

E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel. e1071 : Misc functions of the department of statistics (e1071), TU Wien. R package version 1.5-27, 2011. http: //CRAN.R-project.org/package=e1071.

S. Dutta and A. K. Ghosh. On robust classiﬁcation using projection depth. Annals of the Institute of Statistical Mathematics, 64:657 676, 2012.

Dutta, Sarkar and Ghosh

S. Dutta and A. K. Ghosh. On some transformations of high dimension, low sample size data for nearest neighbor classiﬁcation. Machine Learning, 102:57 83, 2016.

S. Dutta, A. K. Ghosh, and P. Chaudhuri. Some intriguing properties of Tukey s half-space depth. Bernoulli, 17:1420 1434, 2011.

S. Dzeroski and B. Zenko. Is combining classiﬁers better than selecting the best one? Machine Learning, 54:255 273, 2004.

K. T. Fang, S. Kotz, and K. W. Ng. Symmetric Multivariate and Related Distributions. 1990. Chapman & Hall, London.

J. Friedman. Another approach to polychotomous classiﬁcation. Technical report, Dept. of Statistics, Stanford University, 1996. http://old.cba.ua.edu/~mhardin/poly.pdf.

Y. Gao. Data depth based on spatial rank. Statistics and Probability Letters, 65:217 225, 2003.

A. K. Ghosh and P. Chaudhuri. On data depth and distribution free discriminant analysis using separating surfaces. Bernoulli, 11:1 27, 2005a.

A. K. Ghosh and P. Chaudhuri. On maximum depth and related classiﬁers. Scandinavian Journal of Statistics, 32:328 350, 2005b.

A. K. Ghosh, P. Chaudhuri, and C. A. Murthy. On visualization and aggregation of nearest neighbor classiﬁers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27:1592 1602, 2005.

A. K. Ghosh, P. Chaudhuri, and D. Sengupta. Classiﬁcation using kernel density estimates : Multiscale analysis and visualization. Technometrics, 48:120 132, 2006.

P. Hall, J. S. Marron, and A. Neeman. Geometric representation of high dimension low sample size data. Journal of the Royal Statistical Society: Series B, 67:427 444, 2005.

T. Hastie and R. Tibshirani. Generalized Additive Models. 1990. Chapman & Hall, London.

T. Hastie and R. Tibshirani. Classiﬁcation by pairwise coupling. Annals of Statistics, 26: 451 471, 1998.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2009. Springer, New York.

R. Hoberg and K. Mosler. Data analysis and classiﬁcation with the zonoid depth. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 72:45 59, 2006.

Y. Hu, Y. Wang, Y. Wu, Q. Li, and C. Hou. Generalized Mahalanobis depth in the reproducing kernel Hilbert space. Statistical Papers, 52:511 522, 2011.

R. Jornsten. Clustering and classiﬁcation based on the L1 data depth. Journal of Multivariate Analysis, 90:67 89, 2004.

Multi-scale Classification using Localized Spatial Depth

S. Jung and J. S. Marron. PCA consistency in high dimension, low sample size context. Annals of Statistics, 37:4104 4130, 2009.

J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas. On combining classiﬁers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:226 239, 1998.

D. Kosiorowski and Z. Zawadzki. Depth Proc: An R Package for Robust Exploration of Multidimensional Economic Phenomena, 2016.

T. Lange, K. Mosler, and P. Mozharovskyi. Fast nonparametric classiﬁcation based on data depth. Statistical Papers, 55:49 69, 2014.

E. Levina and P. J. Bickel. Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems (NIPS), 17:777 784, 2004. MIT Press, Cambridge, MA.

J. Li, J. A. Cuesta-Albertos, and R. Liu. Nonparametric classiﬁcation procedures based on DD-plot. Journal of the American Statistical Association, 107:737 753, 2012.

R. Z. Li, K. T. Fang, and L. X. Zhu. Some Q-Q probabillity plots to test spherical and elliptic symmetry. Journal of Computational and Graphical Statistics, 6:435 450, 1997.

A. Liaw and M. Wiener. Classiﬁcation and regression by random forest. R News, 2:18 22, 2002.

R. Liu, J. Parelius, and K. Singh. Multivariate analysis of data depth : Descriptive statistics and inference. Annals of Statistics, 27:783 858, 1999.

W. S. Lok and S. M. S. Lee. A new statistical depth function with applications to multimodal data. Journal of Nonparametric Statistics, 23:617 631, 2011.

J. S. Marron, M. J. Todd, and J. Ahn. Distance weighted discrimination. Journal of the American Statistical Association, 102:1267 1271, 2007.

P. Mozharovskyi, K. Mosler, and T. Lange. Classifying real-world data with the DDα

procedure. Advances in Data Analysis and Classiﬁcation, 9:287 314, 2015.

D. Paindaveine and G. Van Bever. From depth to local depth : a focus on centrality. Journal of the American Statistical Association, 105:1105 1119, 2013.

D. Paindaveine and G. Van Bever. Nonparametrically consistent depth-based classiﬁers. Bernoulli, 21:62 82, 2015.

G. M. Reaven and R. G. Miller. An attempt to deﬁne the nature of chemical diabetes using a multidimensional analysis. Diabetologia, 16:17 24, 1979.

B. Ripley. tree: Classiﬁcation and regression trees. R package version 1.0-29, 2011.

P. J. Rousseeuw and K. Van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41:212 223, 1999.

Dutta, Sarkar and Ghosh

J. P. Royston. Some techniques for assessing multivariate normality based on the Shapiro Wilk W. Technometrics, 32:121 133, 1983.

D. W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. 2015. Wiley, Hoboken, New Jersey.

R. Serﬂing. A depth function and a scale curve based on spatial quantiles. In Statistics and Data Analysis based on L1-Norm and Related Methods (Y. Dodge ed.), pages 25 38. 2002. Birkhaeuser.

Y. Vardi and C. H. Zhang. The multivariate L1-median and associated data depth. Proceedings of the National Academy of Sciences, USA, 97:1423 1426, 2000.

S. N. Wood. Generalized Additive Models: An Introduction with R. 2006. Chapman & Hall/CRC, Boca Raton, FL.

C. Xia, L. Lin, and G. Yang. An extended projection data depth and its applications to discrimination. Communication in Statistics - Theory and Methods, 37:2276 2290, 2008.

Thomas W. Yee. The VGAM package. R News, 8(2):28 39, 2008.

V. V. Yurinskii. Exponential inequalities for sums of random vectors. Journal of Multivariate Analysis, 6:473 499, 1976.

Y. Zuo and R. Serﬂing. General notions of statistical depth function. Annals of Statistics, 28:461 482, 2000.