# rase_random_subspace_ensemble_classification__f3096868.pdf

Journal of Machine Learning Research 22 (2021) 1-93 Submitted 6/20; Revised 1/21; Published 1/21

Ra SE: Random Subspace Ensemble Classiﬁcation

Ye Tian ye.t@columbia.edu Department of Statistics Columbia University New York, NY 10027, USA

Yang Feng yang.feng@nyu.edu Department of Biostatistics, School of Global Public Health New York University New York, NY 10003, USA

Editor: Jie Peng

We propose a ﬂexible ensemble classiﬁcation framework, Random Subspace Ensemble (Ra SE), for sparse classiﬁcation. In the Ra SE algorithm, we aggregate many weak learners, where each weak learner is a base classiﬁer trained in a subspace optimally selected from a collection of random subspaces. To conduct subspace selection, we propose a new criterion, ratio information criterion (RIC), based on weighted Kullback-Leibler divergence. The theoretical analysis includes the risk and Monte-Carlo variance of the Ra SE classiﬁer, establishing the screening consistency and weak consistency of RIC, and providing an upper bound for the misclassiﬁcation rate of the Ra SE classiﬁer. In addition, we show that in a high-dimensional framework, the number of random subspaces needs to be very large to guarantee that a subspace covering signals is selected. Therefore, we propose an iterative version of the Ra SE algorithm and prove that under some speciﬁc conditions, a smaller number of generated random subspaces are needed to ﬁnd a desirable subspace through iteration. An array of simulations under various models and real-data applications demonstrate the eﬀectiveness and robustness of the Ra SE classiﬁer and its iterative version in terms of low misclassiﬁcation rate and accurate feature ranking. The Ra SE algorithm is implemented in the R package Ra SEn on CRAN.

Keywords: Random Subspace Method, Ensemble Classiﬁcation, Sparsity, Information Criterion, Consistency, Feature Ranking, High Dimensional Data

1. Introduction

Ensemble classiﬁcation is a very popular framework for carrying out classiﬁcation tasks, which typically combines the results of many weak learners to form the ﬁnal classiﬁcation. It aims at improving both the accuracy and stability of weak classiﬁers and usually leads to a better performance than the individual weak classiﬁer (Rokach, 2010). Two prominent examples of ensemble classiﬁcation are bagging (Breiman, 1996, 1999) and random forest (Breiman, 2001), which focused on decision trees and aimed to improve the performance by bootstrapping training data and/or randomly selecting the splitting dimension in diﬀerent trees, respectively. Boosting is another example that converts a weak learner that performs only slightly better than random guessing into a strong learner achieving arbitrary accu-

c 2021 Ye Tian and Yang Feng.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-600.html.

Tian and Feng

racy (Freund and Schapire, 1995). Recently, several new ensemble ideas appeared. Blaser and Fryzlewicz (2016) aggregate decision trees with random rotations to get an ensemble classiﬁer. In particular, it randomly rotates the feature space each time prior to ﬁtting the decision tree. Random rotations make it possible for the tree-based classiﬁer to arrive at oblique solutions, increasing the ﬂexibility of the decision trees. As a variant, to make the ensembles favor simple base learners, Blaser and Fryzlewicz (2019) proposed a regularized random rotation algorithm. Another popular framework of ensemble classiﬁers is via random projection. Durrant and Kab an (2015) studied a random projection ensemble classiﬁer with linear discriminant analysis (LDA) and developed its theoretical properties. Furthermore, Cannings and Samworth (2017) proposed a very general framework for random projection ensembles. Each weak learner ﬁrst randomly projects the original feature space into a low-dimensional subspace, then the base classiﬁer is trained in that new space. The choice of the base classiﬁer is ﬂexible, and it was shown that the random projection ensemble classiﬁer performs competitively and has desirable theoretical properties. There are two key aspects of their framework. One is that since na ıvely aggregating all projections might lead to a poor performance, they ﬁrst select some good projections and only aggregate these ones. The other key idea is that they tune the decision threshold instead of applying the na ıve majority vote. These two ideas will also appear in our framework (to be proposed). To make the random projection include more important features in the linear combinations, Mukhopadhyay and Dunson (2019) proposed a targeted random projection ensemble approach, which includes each variable with probability proportional to their marginal utilities.

Another example of ensemble classiﬁcation is the random subspace method, which was ﬁrst studied in the context of decision trees (Ho, 1998). As the name suggests, it randomly selects a feature subset and grows each tree within the chosen subspace. A similar idea is used in random forest when we restrict the splitting of each tree to a subset of features. The random subspace method is closely related to other aggregation-based approaches, including the bootstrap procedure for features (Boot and Nibbering, 2020). Also, as Cannings and Samworth (2017) pointed out, the random subspace method can be regarded as the random projection ensemble classiﬁcation method when the projection space is restricted to be axis-aligned. Compared to other ensemble approaches, the random subspace method keeps the data structure via sticking to the original features, which can be helpful for interpretation and provide a direct way for feature ranking. It has been coupled with various base classiﬁers, including linear discriminant analysis (Skurichina and Duin, 2002), k-nearest neighbor classiﬁer (Bay, 1998), and combined with other techniques such as boosting (Garc ıa-Pedrajas and Ortiz-Boyer, 2008). A related approach is random partition (Ahn et al., 2007), where the whole space is partitioned into multiple parts. Bryll et al. (2003) introduced optimization ideas into the framework of the random subspace method, and selected optimal subspaces by evaluating how the corresponding ﬁtted models performed on the training data. Despite these developments, most existing works do not have theoretical support, and the research on the link between random subspace and feature ranking is scarce to the best of our knowledge. Furthermore, the existing literature usually considers the ensemble of all generated random subspaces, which may not be a wise idea in the sparse classiﬁcation scheme as many random subspaces will contain no signals. Our new ensemble framework on random subspaces is designed to tackle the sparse classiﬁcation

Ra SE: Random Subspace Ensemble Classification

problems with theoretical guarantees. Instead of naively aggregating all generated random subspaces, we divide them into groups and only keep the optimally performing subspace inside each group to construct the ensemble classiﬁer. Feature ranking and selection are of critical importance in many real-world applications. For example, in disease diagnosis, beyond getting an accurate prediction for patients, we are also interested in understanding how each feature contributes to our prediction, which can facilitate the advancement of medical science. It has been widely acknowledged that in many high-dimensional classiﬁcation problems, we only have a handful of useful features, with the rest being irrelevant ones. This is sometimes referred to as the sparse classiﬁcation problem, which we brieﬂy review next. Bickel et al. (2004) showed that linear discriminant analysis (LDA) is equivalent to random guessing in the worst scenario when the sample size is smaller than the dimensionality. Exploiting the underlying sparsity plays a signiﬁcant role in improving the performance of the classic methods, including the LDA and the quadratic discriminant analysis (QDA) (Mai et al., 2012; Jiang et al., 2018; Hao et al., 2018; Fan et al., 2012; Shao et al., 2011; Fan et al., 2015; Li and Shao, 2015). While those methods work well under their corresponding models, it is not clear how to conduct feature ranking with other types of base classiﬁers. In this work, we propose a ﬂexible ensemble classiﬁcation framework, named Random Subspace Ensemble (Ra SE), which can be combined with any base classiﬁers and provide feature ranking as a by-product. Ra SE is a ﬂexible ensemble classiﬁcation framework, the main mechanism of which is brieﬂy described as below. Suppose the observation pair (x, y) takes values from X {0, 1}, where X is an open subset of Rp, p is a positive integer and y is the class label. Assume the training set consists of n observation pairs {(xi, yi), i = 1, . . . , n}. We use CS T n (x) {0, 1} to represent the prediction result of the classiﬁer trained with only features in subset S when the base classiﬁer is T . For the j-th (j {1, . . . , B1}) weak learner, B2 random subspaces {Sjk}B2 k=1 are generated and the optimal one Sj is selected according to some criterion to be speciﬁed. Then this weak learner will be trained by using only the slice of training samples in this subspace Sj . Finally the B1 weak classiﬁers CS1 T n , . . . , C SB1 T n are aggregated to form the decision function

CRa SE n (x) = 1

j=1 CSj T n (x) > α

where α is a threshold to be determined. This framework contributes to the research of ensemble method and feature ranking in the following aspects. First, it admits a ﬂexible framework, in which any classiﬁcation algorithm can serve as the base classiﬁer. Some examples include the standard LDA, QDA, k-nearest neighbor classiﬁer (k NN), support vector machines (SVM), and decision trees. Second, the ensemble process naturally implies a ranking of the importance of variables via the frequencies they appear in the B1 subspaces {Sj }B1 j=1. For several speciﬁc sparse classiﬁcation problems, equipped with a new information criterion named ratio information criterion (RIC), Ra SE is shown to cover the minimal discriminative set (to be deﬁned later) for each weak learner with high probability when B2 and sample size n are suﬃciently large. Although the Ra SE framework shares some similarities with the random projection (RP) framework of Cannings and Samworth (2017), there are several essential diﬀerences

Tian and Feng

between them. First, the key workhorse behind Ra SE is to search for a desirable subspace that covers the signals, which makes it amenable for feature ranking and selection. The RP framework, on the other hand, is not naturally designed for feature ranking. Second, a key condition (Assumption 2 in Cannings and Samworth (2017)) to guarantee the success of RP implies that using the criterion for choosing the optimal random projection, each random projection does not deviate much from the optimal one with a non-zero probability that is independent of n and p, which may not be satisﬁed under the high-dimensional setting. Ra SE, however, assumes a set of conditions that explicitly take into account the high-dimensionality, which leads to the screening consistency and weak consistency. Third, we propose a new information criterion RIC with its theoretical properties analyzed under the high-dimensional setting. Fourth, motivated by the stringent requirement of a large B2 for the vanilla Ra SE (see Sections 3.2 and 3.5), we propose the iterative Ra SE, which relaxes the requirement on B2 by taking advantage of the feature ranking in preceding steps. The rest of this paper is organized as follows. In Section 2, we ﬁrst introduce the Ra SE algorithm, and discuss some important concepts, including minimal discriminative set and RIC. At the end of Section 2, an iterative version of the Ra SE algorithm is presented. In Section 3, theoretical properties of Ra SE and RIC are investigated, including the impact of B1 on the risk and Monte-Carlo variance of Ra SE classiﬁer, the screening consistency and weak consistency of RIC, the upper bound of expected misclassiﬁcation rate, and the theoretical analysis for iterative Ra SE algorithm. In Section 4, we discuss several important computational issues in the Ra SE algorithm, tuning parameter selection, and how to apply the Ra SE framework for feature ranking. Section 5 focuses on numerical studies in terms of extensive simulations and several real data applications, through which we compare Ra SE with various competing methods. The results frequently feature Ra SE classiﬁers among the top-ranked methods and also shows its eﬀectiveness in feature ranking. Finally, we summarize our contributions and point out a few potential directions for future work in Section 6. We present some additional results for empirical studies in Appendix A, and all proofs are relegated to Appendix B.

2. Methodology

Recall that we have n pairs of observations {(xi, yi), i = 1, . . . , n} i.i.d. (x, y) X {0, 1}, where X is an open subset of Rp, p is a positive integer and y {0, 1} is the class label. We use SFull = {1, . . . , p} to represent the whole feature set. We assume the marginal densities of x for class 0 (y = 0) and 1 (y = 1) exist and are denoted as f(0) and f(1), respectively. The corresponding probability measures they induce are denoted as P(0) and P(1). Thus, the joint distribution of (x, y) can be described in the following mixture model

x|y = y0 (1 y0)f(0) + y0f(1), y0 = 0, 1, (1)

where y is a Bernoulli variable with success probability π1 = 1 π0 (0, 1). For any subspace S, we use |S| to denote its cardinality. Denote Px as the probability measure induced by the marginal distribution of x, which is in fact π0P(0) + π1P(1). When restricting to the feature subspace S, the corresponding marginal densities of class 0 and 1 are denoted as f(0) S and f(1) S , respectively.

Ra SE: Random Subspace Ensemble Classification

2.1 Random Subspace Ensemble Classiﬁcation (Ra SE)

Motivated by Cannings and Samworth (2017), to train each weak learner (e.g., the j-th one), B2 independent random subspaces are generated as Sj1, . . . , Sj B2. Then, according to some speciﬁc criterion (to be introduced in Section 2.3), the optimal subspace Sj is selected and the j-th weak learner is trained only in Sj . Subsequently, B1 such weak classiﬁers {CSj T n }B1 j=1 are obtained. Finally, we aggregate outputs of {CSj T n }B1 j=1 to form the ﬁnal decision function by taking a simple average. The whole procedure can be summarized in the following algorithm.

Algorithm 1: Random subspace ensemble classiﬁcation (Ra SE)

Input: training data {(xi, yi)}n i=1, new data x, subspace distribution D, criterion C, integers B1 and B2, type of base classiﬁer T Output: predicted label CRa SE n (x), the selected proportion of each feature η

1 Independently generate random subspaces Sjk D, 1 j B1, 1 k B2 2 for j 1 to B1 do

3 Select the optimal subspace Sj from {Sjk}B2 k=1 according to C and T

5 Construct the ensemble decision function νn(x) = 1 B1 PB1 j=1 CSj T n (x)

6 Set the threshold ˆα according to (2)

7 Output the predicted label CRa SE n (x) = 1(νn(x) > ˆα), the selected proportion of each feature η = (η1, . . . , ηp)T where ηl = B 1 1 PB1 j=1 1(l Sj ), l = 1, . . . , p

In Algorithm 1, the subspace distribution D is chosen as a hierarchical uniform distribution over the subspaces by default. Speciﬁcally, with D as the upper bound of the subspace size1, we ﬁrst generate the subspace size d from the uniform distribution over {1, . . . , D}. Then, the subspace S11 follows the uniform distribution over {S SFull : |S| = d}. In practice, the subspace distribution can be adjusted if we have prior information about the data structure. In Step 6 of Algorithm 1, we set the decision threshold to minimize the empirical classiﬁcation error on the training set,

ˆα = arg min α [0,1][ˆπ0(1 ˆG(0) n (α)) + ˆπ1 ˆG(1) n (α)], (2)

i=1 1(yi = r), r = 0, 1,

n , r = 0, 1,

ˆG(r) n (α) = 1

i=1 1(yi = r)1(νn(xi) α), r = 0, 1,

1. How to set D in practice will be discussed in Section 4.1.

Tian and Feng

j=1 1(CSj n (xi) = 1).

2.2 Minimal Discriminative Set

For sparse classiﬁcation problems, it is of signiﬁcance to accurately separate signals from noises. Motivated by Kohavi et al. (1997) and Zhang and Wang (2011), we deﬁne the discriminative set and study some of its properties as follows.

Deﬁnition 1 A feature subset S is called a discriminative set if y is conditionally independent with x Sc given x S, where Sc = SFull \ S. We call S a minimal discriminative set if it has minimal cardinality among all discriminative sets.

Assumption 1 The densities f(0) and f(1) have the same support a.s. with respect to Px.

Remark 2 Note that the existence of densities and the common support requirement are not necessary for the deﬁnition of the minimal discriminative set and the Ra SE framework. We focus on the continuous distribution purely for notation convenience. We will discuss this assumption again after introducing our information criterion in Deﬁnition 6.

Proposition 3 Under Assumption 1, we can characterize the discriminative set using the marginal density ratio due to the following two facts.

(i) If S is a discriminative set, then

f(1)(x) f(0)(x) = f(1) S (x S)

f(0) S (x S)

almost surely with respect to Px.

(ii) If for a feature subset S, there exists a function h : R|S| [0, + ] such that

f(1)(x) f(0)(x) = h(x S)

almost surely with respect to Px, then S is a discriminative set and

h(x S) = f(1) S (x S)

f(0) S (x S)

almost surely with respect to Px.

In general, there may exist more than one minimal discriminative sets. For instance, if two features are exactly the same, then more than one minimal discriminative sets may exist since we cannot distinguish between them. To rule out this type of degenerate scenario, we impose the uniqueness assumption for the minimum discriminative set.

Assumption 2 There is only one minimal discriminative set, which is denoted as S . In addition, all discriminative sets cover S .

Ra SE: Random Subspace Ensemble Classification

In the classiﬁcation problem, we are often interested in the risk of a classiﬁer C. With the 0-1 loss, the risk or the misclassiﬁcation rate is deﬁned as

R(C) = E[1(C(x) = y)] = P(C(x) = y).

The Bayes classiﬁer

CBayes(x) =

( 1, P(y = 1|x) > 1

2, 0, otherwise. (3)

is known to achieve the minimal risk among all classiﬁers (Devroye et al., 2013). If only features in S are used, it will provide us a local Bayes classiﬁer CS Bayes(x S) which achieves the minimal risk among all classiﬁers using only features in S. In general, there is no guarantee that R(CS Bayes) = R(CBayes). Fortunately, the equation holds when S is a discriminative set.

Proposition 4 For any discriminative set S, it holds that

R(CS Bayes) = R(CS Bayes) = R(CBayes).

This direct result illustrates that covering S is suﬃcient to obtain performance as good as the Bayes classiﬁer. To clarify the notions above, we next take the two-class Gaussian settings as examples and investigate what S is in each case.

Example 1 (LDA) Suppose f(0) N(µ(0), Σ), f(1) N(µ(1), Σ), where Σ is positive deﬁnite. The log-density ratio is

f(0)(x) f(1)(x)

= C (µ(1) µ(0))T Σ 1x,

where C is a constant independent of x. By Proposition 3, the minimal discriminative set S = {j : [Σ 1(µ(1) µ(0))]j = 0}.

This deﬁnition is equivalent to that in Mai et al. (2012) for the LDA case.

Example 2 (QDA) Suppose f(0) N(µ(0), Σ(0)), f(1) N(µ(1), Σ(1)), where Σ(0) and Σ(1) are positive deﬁnite matrices with Σ(0) = Σ(1), then the log-density ratio is

f(0)(x) f(1)(x)

2x T h (Σ(1)) 1 (Σ(0)) 1i x+ h (Σ(0)) 1µ(0) (Σ(1)) 1µ(1)i T x, (4)

where C is a constant independent of x. Let S l = {j : [(Σ(0)) 1µ(0) (Σ(1)) 1µ(1)]j = 0}, S q = {j : [(Σ(1)) 1 (Σ(0)) 1]ij = 0, i}. The elements in S l are often called variables with main eﬀects while elements in S q are called variables with quadratic eﬀects (Hao et al., 2018; Fan et al., 2015; Jiang et al., 2018). By Proposition 3, the minimal discriminative set S = S l S q.

Proposition 5 If f(0) N(µ(0), Σ(0)), f(1) N(µ(1), Σ(1)), where Σ(0) and Σ(1) are positive deﬁnite matrices, then the following conclusions hold:

Tian and Feng

(i) The minimal discriminative set S is unique;

(ii) For any discriminative set S, we have S S ;

(iii) Any set S S is a discriminative set. This conclusion also holds without the Gaussian assumption.

2.3 Ratio Information Criterion (RIC)

As discussed in Section 2.2, it is desirable to identify the minimal discriminative set S for the classiﬁer to achieve a low misclassiﬁcation rate. Hence in Algorithm 1, it is important to apply a proper criterion to select the optimal subspace. In the variable selection literature, a criterion enjoying the property of correctly selecting the minimal discriminative set with high probability is often referred to as a consistent one. For model (1), Zhang and Wang (2011) proved that BIC, in conjunction with a backward elimination procedure, is selection consistent in the Gaussian mixture case. However, the BIC they investigated involved the joint log-likelihood function for (x, y), which involves estimating high-dimensional covariance matrices that could be problematic when p is close to or larger than n without additional structural assumptions. Here we propose a new criterion, which enjoys the weak consistency under the general setting of (1), based on Kullback-Leibler divergence (Kullback and Leibler, 1951). Two asymmetric Kullback-Leibler divergences for densities f and g are deﬁned as

KL(f||g) = Ex f

, KL(g||f) = Ex g

where Ex f represents taking expectation with respect to x f. In binary classiﬁcation model (1), marginal probabilities can be crucial because imbalanced marginal distributions can signiﬁcantly compromise the performance of most standard learning algorithms (He and Garcia, 2009). Therefore, we consider a weighted version of two KL divergences for the marginal distributions f(0) S and f(1) S with subspace S, i.e. π0KL(f(0) S ||f(1) S )+π1KL(f(1) S ||f(0) S ). Denote by ˆf(0) S , ˆf(1) S , ˆπ0, ˆπ1 the estimated version via MLEs of parameters, then it holds that

ˆπ0 c KL(f(0) S ||f(1) S ) = n 1 n X

i=1 1(yi = 0) log

" ˆf(0) S (xi,S) ˆf(1) S (xi,S)

ˆπ1 c KL(f(1) S ||f(0) S ) = n 1 n X

i=1 1(yi = 1) log

" ˆf(1) S (xi,S) ˆf(0) S (xi,S)

Now, we are ready to introduce the following new criterion named ratio information criterion (RIC), with a proper penalty term.

Deﬁnition 6 For model (1), the ratio information criterion (RIC) for feature subspace S is deﬁned as

RICn(S) = 2[ˆπ0 c KL(f(0) S ||f(1) S ) + ˆπ1 c KL(f(1) S ||f(0) S )] + cn deg(S), (5)

where cn is a function of sample size n and deg(S) is the degree of freedom corresponding to the model with subspace S.

Ra SE: Random Subspace Ensemble Classification

Remark 7 Assumption 1 is necessary to make sure RIC is well-deﬁned. To see this, note that the existence of both Kullback-Leibler divergences requires (1) f(0)(x) = 0 f(1)(x) = 0 a.s. with respect to P(1); (2) f(1)(x) = 0 f(0)(x) = 0 a.s. with respect to P(0). This is equivalent to Assumption 1 when π0, π1 > 0.

Note that although AIC is also motivated by the Kullback-Leibler divergence, it aims to minimize the KL divergence between the estimated density and the true density (Burnham and Anderson, 1998). In our case, however, the goal is to maximize the KL divergence between the conditional densities under two classes to achieve a greater separation. Next, let s work out a few familiar examples where explicit expressions exist for RIC.

Proposition 8 (RIC for the LDA model) Suppose f(0) N(µ(0), Σ), f(1) N(µ(1), Σ), where Σ is positive deﬁnite. The MLEs of the parameters are

ˆµ(r) S = 1

i=1 1(yi = r)xi,S, r = 0, 1,

r=0 1(yi = r) (xi,S ˆµ(r) S )(xi,S ˆµ(r) S )T ,

Then we have

RICn(S) = (ˆµ(1) S ˆµ(0) S )T ˆΣ 1 S,S(ˆµ(1) S ˆµ(0) S ) + cn(|S| + 1).

Proposition 9 (RIC for the QDA model) Suppose f(0) N(µ(0), Σ(0)), f(1) N(µ(1), Σ(1)), where Σ(0), Σ(1) are positive deﬁnite but not necessarily equal. The MLEs of the estimators are as follows. {ˆµ(r) S , r = 0, 1} are the same as in Proposition 8, and

ˆΣ(r) S,S = 1

i=1 1(yi = r) (xi,S ˆµ(r) S )(xi,S ˆµ(r) S )T , r = 0, 1.

Then we have

RICn(S) = (ˆµ(1) S ˆµ(0) S )T [ˆπ1(ˆΣ(0) S,S) 1 + ˆπ0(ˆΣ(1) S,S) 1](ˆµ(1) S ˆµ(0) S )

+ Tr[((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)(ˆπ1 ˆΣ(1) S,S ˆπ0 ˆΣ(0) S,S)] + (ˆπ1 ˆπ0)(log |ˆΣ(1) S,S| log |ˆΣ(0) S,S|)

+ cn |S|(|S| + 3)

Note that the primary term of RIC for the LDA case, is the Mahalanobis distance (Mc Lachlan, 1999), which is closely related to the Bayes error of LDA classiﬁer (Efron, 1975). And for the QDA case, the KL divergence components contain three terms. The ﬁrst term is similar to the Mahalanobis distance, representing the contributions of linear signals to the classiﬁcation model. And the second and third terms represent the contributions of quadratic signals. Note that the KL divergence can also be estimated by non-parametric methods including the k-nearest neighbor distance (Wang et al., 2009; Ganguly et al., 2018), which may

Tian and Feng

sometimes lead to more robust estimates than the parametric ones in our numerical experiments. Speciﬁcally, consider two samples {x(0) i,S}n0 i=1 i.i.d. f(0) S and {x(1) i,S}n1 i=1 i.i.d. f(1) S , and

write ρk0,0(x(0) j,S) for the Euclidean distance between x(0) j,S and its k0-th nearest neighbor in

the sample {x(0) i,S}n0 i=1\{x(0) j,S}, and write ρk1,1(x(0) j,S) for the Euclidean distance between x(0) j,S and its k1-th nearest neighbor in the sample {x(1) i,S}n1 i=1. Wang et al. (2009) and Ganguly et al. (2018) deﬁned the following asymptotic unbiased estimator given k0 and k1:

c KL(f(0) S ||f(1) S ) = |S|

ρk0,0(x(0) i,S)

ρk1,1(x(0) i,S)

+ log n1 n0 1

+ Ψ(k0) Ψ(k1), (6)

where Ψ denotes the di Gamma function (Abramowitz and Stegun, 1948). Similarly we can obtain estimate c KL(f(1) S ||f(0) S ). Besides, Berrett and Samworth (2019) proposed a weighted estimator based on (6) and investigated its eﬃciency. We will compare the performance of Ra SE when using the estimate in (6) to that using parametric methods in simulation. Another line of research for classiﬁcation is to study the conditional distribution of y|x. For this setup, there has been a rich literature on various information type criteria that involves the conditional log-likelihood function. Akaike information criterion (AIC) (Akaike, 1973) was shown to be inconsistent. It was demonstrated that Bayesian information criterion (BIC) is consistent under certain regularity conditions (Rao and Wu, 1989). Chen and Chen (2008, 2012) modiﬁed the deﬁnition of conventional BIC to form the extended BIC (e BIC) for the high-dimensional setting where p grows at a polynomial rate of n. Fan and Tang (2013) proved the consistency of a generalized information criterion (GIC) for generalized linear models in ultra-high dimensional space.

2.4 Iterative Ra SE

The success of the RIC proposed in Section 2.3 relies on the assumption that the minimal discriminative set S appears in some of the B2 subspaces for each weak learner. For sparse classiﬁcation problems, the size of S can be very small compared to p. When p is large, the probability of generating a subset that covers S is quite low according to the hierarchical uniform distribution for subspaces. It turns out by using the selected frequency of each feature in B1 subspaces {Sj }B1 j=1 from Algorithm 1, we can improve the Ra SE algorithm by running the Ra SE algorithm again with a new hierarchical distribution for subspaces. In particular, we ﬁrst calculate the percentage vector η = (η1, . . . , ηp)T

representing the proportion of each feature appearing among B1 subspaces {Sj }B1 j=1, where ηl = B 1 1 PB1 j=1 1(l Sj ), l = 1, . . . , p. The new hierarchical distribution is speciﬁed as follows. In the ﬁrst step, we generate the subspace size d from the uniform distribution over {1, . . . , D} as before. Before moving on to the second step, note that each subspace S can be equivalently represented as J = (J1, . . . , Jp)T , where Jl = 1(l S), l = 1, . . . , p. Then, we generate J from a restrictive multinomial distribution with parameter (p, d, η), where η = ( η1, . . . , ηp)T , ηl = ηl1(ηl > C0/ log p) + C0

p 1(ηl C0/ log p), and the restriction is that Jl {0, 1}, l = 1, . . . , p. Here C0 is a constant. Intuitively, this strategy can be repeated to increase the probability that signals in S are covered in the subspaces we generate. It could also lead to an improved feature

Ra SE: Random Subspace Ensemble Classification

ranking according to the updated proportion of each feature η. This will be veriﬁed via multiple simulation experiments in Section 5. This iterative Ra SE algorithm is summarized in Algorithm 2. A related idea was introduced by Mukhopadhyay and Dunson (2019) to generate random projections with probabilities proportional to the marginal utilities. One major diﬀerence in Ra SE is that the feature importance is determined via their joint contributions in the selected subspaces.

Algorithm 2: Iterative Ra SE (Ra SET )

Input: training data {(xi, yi)}n i=1, new data x, initial subspace distribution D(0), criterion C, integers B1 and B2, the type of base classiﬁer T , the number of iterations T Output: predicted label CRa SE n (x), the proportion of each feature η(T)

1 for t 0 to T do

2 Independently generate random subspaces S(t) jk D(t), 1 j B1, 1 k B2

3 for j 1 to B1 do

4 Select the optimal subspace S(t) j from {S(t) jk }B2 k=1 according to C and T

6 Update η(t) where η(t) l = B 1 1 PB1 j=1 1(l S(t) j ), l = 1, . . . , p

7 Update D(t) restrictive multinomial distribution with parameter (p, d, η(t)),

where η(t) l = η(t) l 1(η(t) l > C0/ log p) + C0

p 1(η(t) l C0/ log p) and d is sampled from the uniform distribution over {1, . . . , D}

9 Set the threshold ˆα according to (2)

10 Construct the ensemble decision function νn(x) = 1 B1 PB1 j=1 C S(T ) j T n (x)

11 Output the predicted label CRa SE n (x) = 1(νn(x) > ˆα) and η(T)

3. Theoretical Studies

In this section, we investigate various theoretical properties of Ra SE, including the impact of B1 on the risk of Ra SE classiﬁer and expectation of misclassiﬁcation rate. Furthermore, we will demonstrate that RIC achieves weak consistency. Throughout this section, we allow the dimension p grows with sample size n. To streamline the presentation, we ﬁrst introduce some additional notations. In this paper, we have three diﬀerent sources of randomness: (1) the randomness of the training data, (2) the randomness of the subspaces, and (3) the randomness of the test data. We will use the following notations to diﬀerentiate them.

Analogous to Cannings and Samworth (2017), we write P and E to represent the probability and expectation with respect to the collection of B1B2 random subspaces {Sjk : 1 j B1, 1 k B2};

P and E are used when the randomness comes from the training data {(xi, yi)}n i=1;

Tian and Feng

We use P and E when considering all three sources of randomness.

Recall that in Algorithm 1, the decision function is

j=1 CSj n (x).

For a given threshold α (0, 1), the Ra SE classiﬁer is

CRa SE n (x) =

( 1, νn(x) > α, 0, νn(x) α.

By the weak law of large numbers, as B1 , νn will converge in probability to its expectation µn(x) = P CS1 n (x) = 1 .

It should be noted that here as both the training data and the criterion C in Algorithm 1 are ﬁxed, S1 is a deterministic function of {S1k}B2 k=1. Nevertheless, µn(x) is still random due to randomness of the new x. Then, it is helpful to deﬁne the conditional cumulative distribution function of µn for class 0, 1 respectively as G(0) n (α ) = P(µn(x) α |y = 0) = P(0)(µn(x) α ) and G(1) n (α ) = P(µn(x) α |y = 1) = P(1)(µn(x) α ). Since the distribution D of subspaces is discrete, µn(x) takes ﬁnite unique values almost surely, implying G(0) n , G(1) n to be step functions. We denote the corresponding probability mass functions of G(0) n and G(1) n as g(0) n and g(1) n , respectively. Since for any x, νn(x) µn(x) as B1 , we consider the following randomized Ra SE classiﬁer in population

CRa SE n (x) =

1, µn(x) > α, 0, µn(x) < α, Bernoulli 1

2 , µn(x) = α.

as the inﬁnite simulation Ra SE classiﬁer with B1 . In the following sections, we would like to study diﬀerent properties of CRa SE n . In Section 3.1, we condition on the training data and study the impact of B1 via the relationship between test error of CRa SE n and CRa SE n and Monte-Carlo variance Var(R(CRa SE n )), which can reﬂect the stability of Ra SE classiﬁer as B1 increases. It will be demonstrated that conditioned on training data, both the diﬀerence between the test errors of CRa SE n and CRa SE n , and the Monte-Carlo variance of CRa SE n , converge to zero as B1 for almost every threshold α (0, 1) at an exponential rate. In Section 3.3, we will prove an upper bound for the expected misclassiﬁcation rate R(CRa SE n ) with respect to all the randomness for ﬁxed threshold α. Next, we introduce several scaling notations. For two sequences an and bn, we use an = o(bn) or an bn to denote |an/bn| 0; an = O(bn) or an bn to denote |an/bn| < . The corresponding stochastic scaling notations are op and Op, where an = op(bn), an = Op(bn) imply that |an/bn| p 0 and for any ϵ > 0, there exists M > 0 such that P(|an/bn| > M) ϵ, n. Also, we use λmin(A) and λmax(A) to denote the smallest and largest eigenvalues of a square matrix A. For any vector x = (x1, . . . , xp)T , the

Euclidean norm x = q P i x2 i . And for any matrix A = (aij)p p, we deﬁne the operator

Ra SE: Random Subspace Ensemble Classification

norm A 2 = sup x 2=1 Ax 2, the inﬁnity norm A = sup i

Pp j=1 |aij|, the maximum norm

A max = max i,j |aij|, and the Frobenius norm A F = q P i,j a2 ij.

3.1 Impact of B1

In this section, we study the impact of B1 by presenting upper bounds of the absolute diﬀerence between the test error of CRa SE n and CRa SE n , and Monte-Carlo variance for Ra SE when conditioned on the training data as B1 grows. Note that the discrete distribution of random subspaces leads to both bounds vanishing at exponential rates, except for a ﬁnite set of thresholds α, which is very appealing.

Theorem 10 (Risk for the Ra SE classiﬁer conditioned on training data) Denote Gn(α ) = π1G(1) n (α ) π0G(0) n (α ) and {αi}N i=1 represents the discontinuity points of Gn. Given training samples with size n, we have the following bound for expected misclassiﬁcation rate of Ra SE classiﬁer with threshold α when B1 as

|E[R(CRa SE n )] R(CRa SE n )|

, α {αi}N i=1,

exp { CαB1} , otherwise,

where Cα = 2 min 1 i N(|α αi|2).

It shows that as B1 , the Ra SE classiﬁer CRa SE n and its inﬁnite simulation version CRa SE n achieve the same expected misclassiﬁcation rate conditioned on the training data. Many similar results about the excess risk of randomized ensembles have been studied in literature (Cannings and Samworth, 2017; Lopes et al., 2019; Lopes, 2020). Next, we provide a similar upper bound for the MC variance of the Ra SE classiﬁer. Suppose the discontinuity points of G(0) n and G(1) n are {α0 i }N0 i=1 and {α1 j}N1 j=1, respectively.

Theorem 11 (MC variance for the Ra SE classiﬁer) It holds that

Var[R(CRa SE n )]

2 h π0(g(0) n (α))2 + π1(g(1) n (α))2i + O 1 B1

, α {α(0) i }N0 i=1 {α(1) j }N1 j=1 exp { CαB1} , otherwise,

where Cα = 2 min 1 i N0 1 j N1 (|α α(0) i |2, |α α(1) j |2).

This theorem asserts that except for a ﬁnite set of threshold α, the MC variance of Ra SE classiﬁer shrinks to zero at an exponential rate.

3.2 Theoretical Properties of RIC

An important step of the Ra SE classiﬁer is the choice of an optimal subspace among B2 subspaces for each of the B1 weak classiﬁers. Before showing the screening consistency and weak consistency of RIC, we ﬁrst present a proposition that explains the intuition of why RIC can succeed.

Tian and Feng

Proposition 12 When Assumptions 1 and 2 hold, we have the following conclusions:

(i) KL(f(0) S ||f(1) S ) = KL(f(0) S ||f(1) S ), KL(f(1) S ||f(0) S ) = KL(f(1) S ||f(0) S ) hold for any S S ;

(ii) π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ) < π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ) if S S ;

(iii) KL(f(0) S ||f(1) S ) = sup S KL(f(0) S ||f(1) S ), KL(f(1) S ||f(0) S ) = sup S KL(f(1) S ||f(0) S ).

From Proposition 12, if we deﬁne the population RIC without penalty as

RIC(S) = 2 h π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ) i ,

it can be easily seen that sup S:S S RIC(S) = RIC(S ) < inf S:S S RIC(S). To successfully

diﬀerentiate S from {S : S S } using RICn, we need to impose a condition on the minimum gap of RIC on S and that on S where S S . Similar assumptions such as the beta-min condition appears in the high-dimensional variable selection literature (B uhlmann and Van De Geer, 2011). Denote

ψ(n, p, D) =

D log p + κ1 log D

Dκ1(Dκ3 + Dκ4), Dκ5, D2κ1+κ2 r

D log p + κ1 log D

The complete set of conditions is presented as follows.

Assumption 3 Suppose densities f(0) and f(1) are in parametric forms with f(0)(x) = f(0)(x; θ), f(1)(x) = f(1)(x; θ), where θ contains all parameters for both f(0) and f(1). Note that not all elements of θ appear in f(0) or f(1). Denote the dimension of θ as p . As in Section 2.1, D represents the upper bound of the subspace size. Deﬁne LS(x, θ) =

log f(0) S (x S;θS)

f(1) S (x S;θS)

. Assume the following conditions hold, where κ1, κ2, κ3, κ4, κ5 0 and

C1, C2, C3, C4, C5 > 0 are some universal constants which does not depend on n, p or D:

(i) p is a function of p and p (p) pκ1;

(ii) max KL(f(0)||f(1)), KL(f(1)||f(0)) (p )κ1 Dκ1;

(iii) There exists a family of functions {VS({xi,S}n i=1)}S where VS({xi,S}n i=1) R, such that for {xi,S}n i=1 X and subset S with |S| D, there exists a constant ζ such that if θ S θS 2 ζ, then 1 n

i=1 2 θSLS(xi, θ )

max VS({xi,S}n i=1),

and the following tail probability bound holds for VS:

P (VS({xi,S}n i=1) > C1Dκ2) exp{ C2n},

where xi,S f(0) S or f(1) S ;

Ra SE: Random Subspace Ensemble Classification

(iv) Each component of θSLS(x, θ) is 2C3Dκ3-sub Gaussian for any subset S with |S| D where x S f(0) S or x S f(1) S , respectively;

(v) sup S:|S| D

Ex S f(0) S θSLS(x, θ) , sup S:|S| D

Ex S f(1) S θSLS(x, θ) Dκ4;

(vi) LS(x, θ) is a 2C4Dκ5-sub Gaussian variable, where x S f(0) S or x S f(1) S ;

(vii) Denoting the MLE of θ based on subset S with |S| D as

ˆθS = arg max θS

r=0 1(yi = r) log f(r) S (xi,S; θS),

then when ϵ is smaller than a positive constant, it holds that

P( ˆθS θS > ϵ) |S|κ1 exp{ C5nϵ2} ˆθS θS = Op

(viii) The signal strength satisﬁes

:= inf S:S S |S| D RIC(S) RIC(S ) ψ(n, p, D),

where ψ(n, p, D) = o(1).

Here condition (i) assumes the number of parameters in the model grows slower than a polynomial rate of the dimension. Condition (ii) assumes an upper bound for the two KL divergences. Conditions (iii)-(vi) are imposed to guarantee the accuracy of the second-order approximation for RIC. And condition (vii) is usually satisﬁed for common distribution families (Van der Vaart, 2000). The last condition is a requirement for the signal strength. A condition of this type is necessary to prove the consistency result for any information criterion. For condition (viii), it imposes a constraint among n, p, and D. When D p, it can be simpliﬁed as

where κ is some positive constant. To cover all the signals in S , we require D p . Therefore, an implied requirement for n, p, and p is

(p )κ log p

which is similar to the conditions in the literature of variable selection. Here, we allow p to grow at an exponential rate of n. To help readers understand these conditions better, we show that some commonly used conditions (presented in Assumption 4) for high-dimensional LDA model are suﬃcient for Assumption 3.(i)-(vii) to hold with the results presented in Proposition 14. Assumption 3.(viii) can be relaxed under the LDA model.

Tian and Feng

Assumption 4 (LDA model) Suppose the following conditions are satisﬁed, where m, M, M are constants:

(i) λmin(Σ) m > 0, Σ max M < ;

(ii) µ(1) µ(0) M < ;

(iii) Denote δS = Σ 1 S,S(µ(1) S µ(0) S ), γ = inf j |(δS )j| > 0, then

Remark 13 Here, condition (i) constrains the eigenvalues of the common covariance matrix, which is similar to condition (C2) in Hao et al. (2018) and Wang (2009), condition (2) in Shao et al. (2011), condition (C4) in Li and Liu (2019). Condition (ii) imposes an upper bound for the maximal componentwise mean diﬀerence of the two classes, which is similar to condition (3) in Shao et al. (2011). Condition (iii) assumes a lower bound on the minimum signal strength γ, which is in a similar spirit to condition 2 in Mai et al. (2012).

Proposition 14 Suppose f(0) N(µ(0), Σ), f(1) N(µ(1), Σ), where Σ is positive deﬁnite. If Assumption 4 holds, then Assumption 3.(i)-(vii) hold with θS = ((µ(0) S )T , (µ(1) S )T , vec(ΣS,S)T )T

and κ1 = 2, κ2 = κ4 = 1, κ3 = κ5 = 1

The detailed proof for this proposition can be found in Appendix B. We are now ready to present the consistency result for RIC as deﬁned in (5).

Theorem 15 (Consistency of RIC) Under Assumptions 1-3, we have

(i) If sup S:|S| D deg(S) cn/ = o(1), then the following screening consistency holds for

sup S:S S |S| D

RICn(S) < inf S:S S |S| D RICn(S)

1 O p DDκ1 exp Cn D2κ1+κ2

Cn Dκ1(Dκ3 + Dκ4)

(ii) If in addition cn ψ(n, p, D), then the weak consistency in the following sense holds for RIC:

P RICn(S ) = inf S:|S| D RICn(S) 1 O p DDκ1 exp n Cn cn D2κ1+κ2

2. Note that here we assume that MLE of θ is well-deﬁned for all models under consideration.

Ra SE: Random Subspace Ensemble Classification

Cn cn Dκ1(Dκ3 + Dκ4)

O p D exp Cn cn

Corollary 16 Denote p S = P(S11 S ) = 1 D P

(p d) , where the subspaces are

generated from the hierarchical uniform distribution. Assume the conditions stated in Assumption 1-2 hold, and in addition there holds

If we select the optimal subspace by minimizing RIC as the criterion in Ra SE where D p , then we have

sup S:S S |S| D

RICn(S) < inf S:S S |S| D RICn(S)

j=1 {S1j S }

j=1 {S1j S }

= 1 (1 p S )B2 1 O (exp { B2p S }) .

Remark 17 Note that this corollary actually holds when we replace RICn with a general criterion Crn (smaller value leads to better subspace) with more discussions in Section 3.5. And for RIC, the bounds for the ﬁrst probability on the right-hand side of (7) in Theorems 15, 18 and 20 can be plugged in to get the explicit bounds.

Also, we want to point out that direct analyses of RIC for discriminant analysis models are also insightful and interesting. We can show similar consistency results as those in Theorem 15 from properties of discriminant analysis approach itself based on some common conditions used in literature about sparse discriminant analysis, instead of applying the general analysis of KL divergence.

Theorem 18 (LDA consistency) For the LDA model, under Assumption 4, we have

(i) If Dcn/γ2 = o(1), then the following screening consistency holds for RIC:

sup S:S S |S| D

RICn(S) < inf S:S S |S| D RICn(S)

(ii) If in addition cn D2 q

n , then RIC is weakly consistent:

P RICn(S ) = inf S:|S| D RICn(S) 1 O p2 exp Cn cn

Tian and Feng

Assumption 5 (QDA model) Denote Ω(r) S,S = (Σ(r) S,S) 1, r = 0, 1. Suppose the following conditions are satisﬁed, where m, M, M are constants:

(i) λmin(Σ(r)) m > 0, λmax(Σ(r)) M < , r = 0, 1;

(ii) µ(1) µ(0) M < ;

(iii) Denote γl = inf j

(Ω(1) S,Sµ(1) S Ω(0) S,Sµ(0) S )j > 0, γq = inf i sup j

(Ω(1) S q ,S q Ω(0) S q ,S q )ij > 0,

min{γ2 l , γ2 q, γq} D2 r

Remark 19 The conditions here are similar to Assumption 4 for the LDA model. A set of analogous conditions were used in Jiang et al. (2018).

Theorem 20 (QDA consistency) For the QDA model, under Assumption 5,

(i) If D2cn/ min{γ2 l , γ2 q, γq} = o(1), then RIC is screening consistent:

sup S:S S |S| D

RICn(S) < inf S:S S |S| D RICn(S)

min{γ2 l , γ2 q, γq} D2

(ii) Further, if cn D2 q

n , then RIC is weakly consistent:

P RICn(S ) = inf S:|S| D RICn(S) 1 O p2 exp Cn cn

The proof is available in Appendix B. Note that the bound here is tighter than the results from Proposition 14 and Theorem 15. Based on the consistency of RIC, in the next section, we will construct an upper bound for the expectation of the misclassiﬁcation rate R(CRa SE n ).

3.3 Misclassiﬁcation Rate of the Ra SE Classiﬁer

In the following theorem, we present an upper bound on the misclassiﬁcation rate for the Ra SE classiﬁer, which holds for any criterion to choose optimal subspaces.

Theorem 21 (General misclassiﬁcation rate) For the Ra SE classiﬁer with threshold α and any criterion to choose optimal subspaces, it holds that

E{E[R(CRa SE n ) R(CBayes)]}

E sup S:S S |S| D

[R(CS n ) R(CBayes)] + P(S1 S )

min(α, 1 α) . (8)

Ra SE: Random Subspace Ensemble Classification

Here, the upper bound consists of two terms. The ﬁrst term involving E sup S:S S |S| D

R(CBayes)] can be seen as the maximum discrepancy between the risk of models trained in any subspace covering S based on ﬁnite samples with the Bayes risk, which will be investigated in detail in the next subsection. This term shrinks to zero under certain conditions (see details in Section 3.4). The second term corresponds to the event that at least one signal is missed in the selected subspace. Corollary 16 in Section 3.2 shows that B2 needs to be suﬃciently large to ensure this term goes to 0. We will show in Section 3.5 that the iterative Ra SE could relax the requirement on B2 under certain scenarios. Speciﬁcally, if we use the criterion of minimizing training error (misclassiﬁcation rate on the training set) or leave-one-out cross-validation error, a similar guarantee of performance can be arrived as follows.

Theorem 22 (Misclassiﬁcation rate when minimizing training error or leaveone-out cross-validation error) If the criterion of minimal training error or leave-oneout cross-validation error is applied for the Ra SE classiﬁer with threshold α, it holds that

E{E[R(CRa SE n ) R(CBayes)]}

E sup S:S S |S| D

[R(CS n ) R(CBayes)] +

E(ϵn) + E sup S:S S |S| D

+ (1 p S )B2

min(α, 1 α) , (9)

where ϵS n = R(CS n ) Rn(CS n ), ϵn = E[R(CS1 n ) Rn(CS1 n )]. Here Rn(C) is the training error or leave-one-out cross-validation error of classiﬁer C.

This theorem is closely related to Theorem 3 in Cannings and Samworth (2017) and derived along similar lines. The merit of Theorem 22 compared with Theorem 21 is that we don t have the term P(S1 S ) in the bound, which can be diﬃcult to quantify when minimizing training error or leave-one-out cross-validation error. Regarding the upper bound in (9), the ﬁrst term is the same as the ﬁrst term of the bound in Theorem 21. The second term involving E(ϵn) + E sup S:S S |S| D

|ϵS n| is relative to the distance between the training

error and test error, which usually shrinks to zero for some speciﬁc classiﬁers under certain conditions (see details in Section 3.4). The third term involving (1 p S )B2 reﬂects the possibility that S is not selected in any of the B2 subspaces we generate, which is similar to the second term in the bound given by Theorem 21. This term shrinks to zero under the condition of Corollary 16.

3.4 Detailed Analysis for Several Base Classiﬁers

In this section, we work out the technical details for the Ra SE classiﬁer when the base classiﬁer is chosen to be LDA, QDA, and k NN.

Tian and Feng

3.4.1 Linear Discriminant Analysis (LDA)

LDA was proposed by Fisher (1936) and corresponds to model (1) where f(r) N(µ(r), Σ), r = 0, 1. For given training data {xi, yi}n i=1 in subspace S, using the MLEs given in Proposition 8, the classiﬁer can be constructed as

CS LDA n (x) =

( 1, LS(x S|ˆπ0, ˆπ1, ˆµ(0) S , ˆµ(1) S , ˆΣS,S) > 0, 0, otherwise,

where the decision function

LS(x S|ˆπ0, ˆπ1, ˆµ(0) S , ˆµ(1) S , ˆΣS,S) = log(ˆπ1/ˆπ0) + (x S (ˆµ(0) S + ˆµ(1) S )/2)T (ˆΣS,S) 1(ˆµ(1) S ˆµ(0) S ).

And the degree of freedom of the LDA model with feature subspace S is deg(S) = |S| + 1. Efron (1975) derived that

R(CS LDA n ) R(CBayes) = π1

r µ(1) S µ(0) S T Σ 1 S,S µ(1) S µ(0) S , ˆ S =

r ˆµ(1) S ˆµ(0) S T ˆΣ 1 S,S ˆµ(1) S ˆµ(0) S ,

τS = log(π1/π0)/ S, ˆτS = log(ˆπ1/ˆπ0)/ ˆ S.

Proposition 23 If Assumption 4 holds, then we have

E sup S:S S |S| D

[R(CS LDA n ) R(CBayes)] D2 r

n max{(p ) 1

2 γ 1, (p ) 3

Regarding the second term in the upper bound in Theorem 22, due to Theorem 23.1 in Devroye et al. (2013), for any subset S, we have

P |ϵS n| > ϵ 8n D exp 1

which yields

sup S:S S |S| D

S:S S |S| D

P |ϵS n| > ϵ 8n Dp D p exp 1

By Lemma 39 in Appendix B, it follows

E sup S:S S |S| D

32[(D p ) log p + D log n + 3 log 2 + 1]

Ra SE: Random Subspace Ensemble Classification

sup k=1,...,B2 |ϵS1k n |

sup k=1,...,B2 |ϵS n| > ϵ

k=1 P |ϵS1k n | > ϵ 8B2n D exp 1

again by Lemma 39, we have

32[log B2 + D log n + 3 log 2 + 1]

By plugging these bounds in the right-hand side of (8) and (9), we can get the explicit upper bound of the misclassiﬁcation rate for the LDA model.

3.4.2 Quadratic Discriminant Analysis (QDA)

QDA considers the model (1) analogous to LDA while x|y = r N(µ(r), Σ(r)), r = 0, 1, where Σ(0) can be diﬀerent from Σ(1). On the basis of training data {(xi, yi)}n i=1 in subspace S, it admits the following form of classiﬁer based on the MLEs given in Proposition 9:

CS QDA n (x) =

( 1, QS(x S|ˆπ0, ˆπ1, ˆµ(0) S , ˆµ(1) S , ˆΣ(0) S,S, ˆΣ(1) S,S) > 0,

0, otherwise,

where the decision function

QS(x S|ˆπ0, ˆπ1, ˆµ(0) S , ˆµ(1) S , ˆΣ(0) S,S, ˆΣ(1) S,S) = log(ˆπ1/ˆπ0) 1

2x T S[(ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1]x S

+ x T S[(ˆΣ(1) S,S) 1 ˆµ(1) S (ˆΣ(0) S,S) 1 ˆµ(0) S ]

2(ˆµ(1) S )T (ˆΣ(1) S,S) 1 ˆµ(1) S + 1

2(ˆµ(0) S )T (ˆΣ(0) S,S) 1 ˆµ(0) S .

And the degree of freedom of QDA model with feature subspace S is deg(S) = |S|(|S| + 1)/2 + |S| + 1. To analyze the ﬁrst term in (8) and (9), as in Jiang et al. (2018), for any constant c, we deﬁne

ess sup z [ c,c] h(r)(z), r = 0, 1

where ess sup represents the essential supremum deﬁned as the supremum except a set with measure zero and h(r)(z) is the density of QS (x S |π1, π0, µ(0) S , µ(1) S , Σ(0) S ,S , Σ(1) S ,S ) given that y = r.

Proposition 24 If Assumption 5 and the following conditions hold:

(i) There exist positive constants c, Uc such that uc Uc < ;

(ii) There exists a positive number ϖ0 (0, 1), p exp{nϖ0};

Tian and Feng

then we have

E sup S:S S |S| D

R(CS QDA n ) R(CBayes) D2 log p

for any ϖ (0, 1/2).

Also, by applying Theorem 23.1 in Devroye et al. (2013) and Lemma 39, we have similar conclusions as (11) and (12) in the following

E sup S:S S |S| D

32[(D p ) log p + D(D + 3)/2 log n + 3 log 2 + 1]

32[log B2 + D(D + 3)/2 log n + 3 log 2 + 1]

The explicit upper bound of misclassiﬁcation rate for the QDA model follows when we plug these inequalities into the right-hand sides of (8) and (9).

3.4.3 k-nearest Neighbor (k NN)

k NN method was ﬁrstly proposed by Fix (1951). Given x, it tries to mimic the regression function E[y|x] in the local region around x by using the average of k nearest neighbors. k NN is a non-parametric method and its success has been witnessed in a wide range of applications. Given training data {xi, yi}n i=1, for the new observation x and subspace S, rank the training data by the increasing ℓ2 distance in Euclidean space to x S as {xmi,S}n i=1 such that x S xm1,S 2 x S xm2,S 2 . . . x S xmn,S 2,

where 2 represents the ℓ2 norm in the corresponding Euclidean space. Then the k NN classiﬁer admits the following form:

CS k NN n (x) =

( 1, 1 k Pk i=1 ymi > 0.5, 0, otherwise.

By Devroye and Wagner (1979) and Cannings and Samworth (2017), it holds the following tail bound:

sup S:S S |S| D

S:S S |S| D

P |ϵS n| > ϵ 8p D p exp nϵ3

108k(3D + 1)

Then by Lemma 39 and similar to the analysis for deriving (12), it follows

E sup S:S S |S| D

|ϵS n| [108k(3D + 1)] 1 3 3 log 2 + (D p ) log p + 1

Ra SE: Random Subspace Ensemble Classification

E|ϵn| [108k(3D + 1)] 1 3 3 log 2 + log B2 + 1

However, for k NN, due to its lack of parametric form, it is much more involved to derive a similar upper bound as those presented in Propositions 23 and 24. We decide to leave this analysis as future work.

3.5 Theoretical Analysis of Iterative Ra SE

Recall that to control the misclassiﬁcation rate when minimizing RIC, we showed in Section 3.3 that to control P(S1 S ), B2 needs to be suﬃciently large. In particular, a suﬃcient condition regarding B2 was presented in Corollary 16. Since p p d p

p D D p p + 1

condition (ii) in Corollary 16 implies B2 p p +1

D p , which could be very large for highdimensional settings. Next, we show the iterative Ra SE in Algorithm 2 can sometimes relax the requirement on B2 substantially. Diﬀerent from the hierarchical uniform distribution over the subspaces in Ra SE (Algorithm 1), the iterative Ra SE in Algorithm 2 uses a non-uniform distribution from the second iteration. The non-uniform distribution works by assigning higher probabilities to subspaces that include the more frequently appeared variables among the B1 subspaces chosen in the previous step. We will show that the subspaces generated from such non-uniform distributions require a smaller B2 to cover S . In the following analysis, we study the iterative Ra SE algorithm that minimizes a general criterion Cr, which is a real-valued function on any subspace with its sample version denoted as Crn. To guarantee the success of Algorithm 2, we need the following conditions.

Assumption 6 Suppose the following conditions are satisﬁed:

(i) There exists a positive function of n, p, D called ν satisfying ν(n, p, D) = o(1), such that sup S:|S| D |Crn(S) Cr(S)| = Op(ν(n, p, D)).

(ii) (Stepwise detectable condition) There exists a series {Mn} n=1 and a speciﬁc positive integer p p such that

(a) for any feature subset S(1) S where | S(1) | p p , there exists S(2) S \ S(1) and |S(2) | = p such that

Cr(S ) Cr(S) > Mnν(n, p, D)

holds for any n, any S and S satisfying S S = S S , S S = S(1) S(2) , S S = S 1 S 2 where S 1 S(1) , S 2 S \ S(1) , |S 2| p ;

Tian and Feng

(b) the criterion satisﬁes

inf S:|S| D S S Cr(S) sup S:|S| D S S Cr(S) > Mnν(n, p, D).

(iii) We have

D log log p log p.

In Assumption 6, condition (i) provides a uniform bound for the speciﬁc criterion we use. Condition (ii)(a) is introduced to make Algorithm 2 detect additional signals in each iteration until all signals are covered. Condition (ii)(b) is imposed to help us ﬁnd discriminative sets among B2 subspaces, and it holds for RIC by previous analysis in Section 3.2. Condition (iii) characterizes the requirement on the dimension p and the maximal subspace size D.

Theorem 25 For Algorithm 2, B2 in the ﬁrst step is set as

and B2 in the following steps is set as

D p p (log p)p B2 p

where C0 is the same as in Algorithm 2. Also we set B1 such that B1 log p . If Assumption 6 holds, then after T iterations where e B1

p , as n, B1, B2 there holds

P(S(T) 1 S ) 0.

Next, we compare the requirements on B2 for iterative Ra SE with that for the vanilla Ra SE. For simplicity, we assume D, p , and p are constants. When p < p , iterative Ra SE requires B2 p p (log p)p , which is much weaker than the requirement B2 pp for vanilla Ra SE implied by Corollary 16. Using the results in Theorem 21, an upper bound for the error rate could be obtained. Also note that the rate of p is constrained by Assumption 6.(i), where we assume ν(n, p, D) = o(1). For example, with LDA and QDA model, by Lemmas 32 and 38 in Appendix B, under Assumptions 4 and 5 respectively, we have

ν(n, p, D) = D2 r

Therefore the constraint is D2 q

Ra SE: Random Subspace Ensemble Classification

4. Computational and Practical Issues

4.1 Tuning Parameter Selection

In Algorithm 1, there are ﬁve tuning parameters, including the number of weak learners B1, the number of candidate subspaces B2 to explore for each weak learner, the distribution D of subspaces, the criterion C for selecting the optimal subspace for each weak learner, and the threshold ˆα. If we set C as minimizing the RIC, the diﬀerence between the risk of Ra SE and Bayes risk, as well as the MC variance of Ra SE, vanish at an exponential rate when B1 , except for a ﬁnite set of thresholds α. This implies the Ra SE classiﬁer becomes more accurate and stable as B1 increases. Regarding the impact of B2, by Corollary 16, Theorem 15 and Theorem 25, under the minimal RIC criterion with some conditions, as B2, n , the subspace chosen for each weaker learner recovers the minimal determinative set with high probability. By Theorem 21, the expectation of the misclassiﬁcation rate becomes closer to the Bayes error as the sample size n and B2 increase, which motivates us to use a large B2 if we have suﬃcient computational power. However, when we choose minimizing training error as the criterion C to select the optimal subspace, Theorem 22 shows that the inﬂuence of B2 becomes more subtle. In our implementation, we set B1 = 200 and B2 = 500 as default. For LDA and QDA classiﬁer, C is set to choose the optimal subspace by minimizing the RIC, while for k NN, the default setting is minimizing the leave-one-out cross-validation error. Without prior information about the features, as we mentioned in Section 2.1, D is set as the hierarchical uniform distribution over the subspaces. To generate the size d of subspaces from the uniform distribution over {1, . . . , D}, another parameter D has to be determined. In practice, for QDA base classiﬁer we set D = min(p, [ n0], [ n1]) and for LDA, k NN and all the other base classiﬁers, we set D = min(p, [ n]), where [a] denotes the largest integer not larger than a. The threshold ˆα is chosen by (2) to minimize the training error. When using non-parametric estimate of RIC corresponding to (6), following Wang et al. (2009) and Ganguly et al. (2018), we set k0 = [ n0] and k1 = [ n1] to satisfy the conditions they presented for proving the consistency.

4.2 Computational Cost

Ra SE is an ensemble framework, generating B1B2 subspaces in total following distribution D. If we use the uniform distribution introduced in the last section to generate one subspace, the time required equals to the time for sampling at most D features from p ones, which is O(p D). And the time for training the base model is denoted as Ttrain, which equals to O(n D2) for LDA and QDA base classiﬁers. Similarly, the time for predicting test data is denoted as Ttest, which equals to O(ntest D) for LDA base classiﬁer, O(ntest D2) for QDA base classiﬁer, and O(n ntest D) for k-NN base classiﬁer. In total, the computation cost for the training process is O(B1B2Ttrain + B1B2 log B2) time. Here, O(B2 log B2) is the time needed to ﬁnd the optimal subspace among B2 ones based on the sorting of their scores calculated under some criterion. The computation cost for prediction process is O(B1Ttest) and Ra SE algorithm takes approximately O(B1B2Ttrain + B1B2 log B2 + B1Ttest) for both model ﬁtting and prediction.

Tian and Feng

In practice, this type of framework is very convenient to apply the parallel computing paradigm, making the computing quite fast. And for speciﬁc classiﬁers like LDA and QDA, we have simpliﬁed the RIC expression, which can be directly used to speed up calculation. Compared to the projection generation process in Cannings and Samworth (2017), Ra SE is more eﬃcient since we only need to select features based on certain distribution without doing any complicated matrix operations.

4.3 Feature Ranking

There are many powerful tools in statistics and machine learning for variable selection and feature ranking. For the sparse classiﬁcation approaches like sparse-LDA and sparse-QDA (Mai et al., 2012; Jiang et al., 2018; Fan et al., 2012; Shao et al., 2011; Hao et al., 2018; Fan et al., 2015; Mai et al., 2015; Fan et al., 2016), or independent regularization approach like nearest shrunken centroids (Tibshirani et al., 2003), this is usually directly implied by the methodology. For model-free classiﬁcation methods, however, it s not straightforward to rank features. For random forest, Breiman (2001) proposed a feature ranking method by randomly permuting the value of each feature and calculating the misclassiﬁcation rate on the out-of-bag data for the new random forest. For Ra SE, as an ensemble framework, it s quite natural to rank variables by their frequencies of appearing in B1 subspaces corresponding to B1 weak learners. Following Corollary 16, as n, B2 , under some conditions for signal strength and the increasing rate of B2, by applying the criterion of minimizing RIC, the chosen subspace tends to cover the minimal discriminative set S with high probability, which intuitively illustrates why this idea works to rank variables. When we do not have suﬃcient computational resources to set a very large B2, as Theorem 25 indicates, under some conditions, the iterative Ra SE (Algorithm 2) can cover S with high probability after a few steps with a smaller B2. In practice, the frequencies of signals in S tend to increase after iterations, which can improve the performance of the Ra SE classiﬁer and provide a better ranking. We will demonstrate this via extensive simulation studies in the next section.

5. Simulations and Real-data Experiments

We use six simulation settings and four real data sets to demonstrate the eﬀectiveness of the Ra SE method, coupled with RIC and leave-one-out cross-validation as the minimizing criterion to choose the optimal subspace. The performance of Ra SE classiﬁers with LDA, QDA, Gamma and k NN as base classiﬁers with diﬀerent iteration numbers are compared with that of other competitors, including the standard LDA, QDA, and k NN classiﬁers, sparse LDA (s LDA) (Mai et al., 2012), regularization algorithm under marginality principle (RAMP) (Hao et al., 2018), nearest shrunken centroids (NSC) (Tibshirani et al., 2003), random forests (RF) (Breiman, 2001), and random projection ensemble classiﬁer (RPEnsemble) (Cannings and Samworth, 2017). For the LDA model, we also implemented the non-parametric estimate for RIC to show its eﬀectiveness and robustness. The standard LDA and QDA methods are implemented by using R package MASS. And the k NN classiﬁer is implemented by knn, knn.cv in R package class and function knn3 in R package caret. We utilize package dsda to ﬁt s LDA model. RAMP is implemented through package RAMP. For the RF, we use R package Random Forest; the number of trees

Ra SE: Random Subspace Ensemble Classification

are set as 500 (as default) and [ p] variables are randomly selected when training each tree (as default). And the NSC model is ﬁtted by calling function pamr.train in package pamr. RPEnsemble is implemented by R package RPEnsemble. To obtain the MLE of parameters in the Gamma distribution, the Newton s method is applied via function nlm in package stats and the initial point is chosen to be the moment estimator. To get the non-parametric estimate of KL divergence and RIC, we call function KL.divergence in package FNN. When ﬁtting the standard k NN classiﬁer, and the k NN base classiﬁer in RPEnsemble and Ra SE method, the number of neighbors k is chosen from {3, 5, 7, 9, 11} via leave-one-out cross-validation, following Cannings and Samworth (2017). For RAMP, the response type is set as binomial , for which the logistic regression with interaction eﬀects is considered. In s LDA, the penalty parameter λ is chosen to minimize cross-validation error. In RPEnsemble method, LDA, QDA, k NN, are set as base classiﬁers with default parameters, and the number of weak learner B1 = 500 and the number of projection candidates for each weak learner B2 = 50 and the dimension of projected space d = 5. The projection generation method is set to be Haar . The criterion of choosing optimal projection is set to minimize training error for LDA and QDA and minimize leave-one-out cross-validation error for k NN. For the Ra SE method, for LDA, QDA, and independent Gamma classiﬁer (to be illustrated later in Section 5.1.2), the criterion is set to be minimizing RIC, and for k-NN, the strategy of minimizing leave-one-out cross-validation error is applied. Other parameter settings in Ra SE are the same as in the last section. For simulations, the number of iterations T in Algorithm 2 is set to be 0, 1, 2, while for real-data experiments, we only consider Ra SE methods with 0 or 1 iteration. We write the iteration number on the subscript, and if it is zero, the subscript will be omitted. For example, Ra SE1-LDA represents the Ra SE classiﬁer with T = 1 iteration and LDA base classiﬁer. In addition, we use LDAn to denote the LDA base classiﬁer with the non-parametric estimate of RIC. For all experiments, 200 replicates are considered, and the average test errors (percentage) are calculated based on them. The standard deviation of the test errors over 200 replicates is also calculated for each approach and written on the subscript. The approach with minimal test error for each setting is highlighted in bold, and methods that achieve test error within one standard deviation of the best one are highlighted in italics. Also, for all simulations and the madelon data set, the average selected percentage of features in B1 subspaces for the largest sample size setting in the Ra SE method in 200 replicates are presented, which provides a natural way for feature ranking. For the average selected percentage in the case of the smallest sample size, refer to Appendix A. To highlight the diﬀerent behaviors of signals and noises, we present the average selected percentages of all noise features as a box-plot marked with N . The Ra SE classiﬁer competes favorably with existing classiﬁcation methods. Its misclassiﬁcation rate is the lowest in 27 out of 30 (simulation and real-data) settings and within one standard deviation of the lowest in the remaining four settings.

5.1 Simulations

For the simulated data, model 1 follows model 1 in Mai et al. (2012), which is a sparse LDA-adapted setting. In model 2, for each class, a Gamma distribution with independent components is used. Model 3 follows from the setting of model 3 in Fan et al. (2015), which is

Tian and Feng

a QDA-adapted model. The marginal distribution for two classes is set to be π0 = π1 = 0.5 for the ﬁrst three simulation models. Model 4 is motivated by the k NN algorithm, and the data generation process will be introduced below. To test the robustness of Ra SE, two non-sparse settings, model 1 and 4 are investigated as well, where model 1 has decaying signals in the LDA model while model 4 inherits from model 4 by increasing the number of signals to 30 with the signal strength decreased. For simulations, we consider the signal model as a benchmark. These models use the correct model on the minimal discriminative set S , mimicking the behavior of the Bayes classiﬁer when S is sparse.

5.1.1 Models 1 and 1 (LDA)

First we consider a sparse LDA setting (model 1). Let x|y = r N(µ(r), Σ), r = 0, 1, where Σ = (Σij)p p = (0.5|i j|)p p, µ(0) = 0p 1, µ(1) = Σ 0.556(3, 1.5, 0, 0, 2, 01 (p 5))T . Here p = 400, and the training sample size n {200, 400, 1000}. Test data of size 1000 is independently generated from the same model. As analyzed in Example 1, feature subset {1, 2, 5} is the minimal discriminative set S . On the left panel of Table 1, the performance of various methods on model 1 for diﬀerent sample sizes are presented. As we could see, Ra SE1-LDAn performs the best when the sample size n = 200 and 400. s LDA achieves similar performances to the best classiﬁers for each setting, and Ra SE2-QDA ranks the top when n = 1000. Also, since this model is very sparse, the default value of B2 cannot guarantee that the minimal discriminative set can be selected. Therefore the iterative version of Ra SE improves the performance of Ra SE a lot. And NSC also achieves a comparably small misclassiﬁcation rate when n = 200. In Figure 1, the average selected percentages of 400 features in 200 replicates when n = 1000 are presented. Note that after two iterations, the three signals can be captured by almost all B1 = 200 subspaces for all three base classiﬁers, and all noises are rarely selected across B1 subspaces except when the non-parametric estimate of RIC is applied. Next, we consider a non-sparse LDA model (model 1 ). Let µ(1) = Σ (0.9, 0.92, . . . , 0.950, 01 (p 50))T and keep other parameters the same as above. Now S contains the ﬁrst 50 features. Under this non-sparse setting, as the right panel of Table 1 shows, although most methods obtain similar error rates, Ra SE1-LDAn achieves the best performance when n = 200 and 400. Ra SE1-k NN performs the best when n = 1000. From the table, it can be seen that despite the non-sparse design, the iterations can still improve the performance of Ra SE. An interesting phenomenon is observed from Figure 2, which exhibits the average selected percentages for model 1 . Note that the selected percentages are decaying as the signal strength decreases except for the ﬁrst feature. One possible reason is that the marginal discriminative powers of feature 2 and 3 are the strongest among all features due to the speciﬁc correlation structure in our setting.

5.1.2 Model 2 (Gamma distribution)

In this model we investigate the Gamma distribution with independent features, which is rarely studied in the literature. x|y = r at j-th coordinate follows Gamma distribution

Ra SE: Random Subspace Ensemble Classification

Method Results for model 1 Results for model 1 n = 200 n = 400 n = 1000 n = 200 n = 400 n = 1000 Ra SE-LDA 12.991.42 12.381.20 11.16 1.10 21.434.64 19.56 3.67 17.98 2.74 Ra SE-LDAn 13.401.31 13.231.20 12.641.11 21.644.22 20.253.01 19.422.59 Ra SE-QDA 13.781.26 13.691.19 13.231.04 22.363.96 20.462.83 19.832.51 Ra SE-k NN 13.211.48 12.571.23 11.191.10 19.98 3.74 18.21 2.89 16.97 2.10 Ra SE1-LDA 11.48 1.26 10.42 1.07 10.25 1.08 20.15 3.65 18.39 2.69 17.20 2.14 Ra SE1-LDAn 10.581.15 10.281.01 10.16 1.04 18.212.92 17.552.26 16.86 1.74 Ra SE1-QDA 11.28 1.47 10.76 1.25 10.40 1.22 21.244.75 19.65 3.56 18.23 2.51 Ra SE1-k NN 11.11 1.25 10.58 1.09 10.33 1.05 18.90 3.16 17.75 2.51 16.801.98 Ra SE2-LDA 12.621.45 11.411.16 10.13 1.03 21.783.44 19.22 2.54 17.53 1.98 Ra SE2-LDAn 10.90 1.16 10.42 0.96 10.17 1.06 18.75 2.87 17.95 2.19 17.30 1.70 Ra SE2-QDA 11.741.45 10.39 1.01 10.091.07 21.845.03 20.003.62 19.042.57 Ra SE2-k NN 11.17 1.31 10.60 0.99 10.34 1.02 18.96 3.10 17.77 2.29 17.19 1.78 RP-LDA 17.261.53 15.031.24 13.371.09 25.661.98 23.851.79 21.991.50 RP-QDA 17.961.60 15.261.34 13.501.12 26.412.04 24.031.88 22.061.51 RP-k NN 18.541.60 16.151.21 14.231.21 27.032.27 25.061.76 23.011.55 LDA 46.392.62 18.621.53 47.812.87 27.441.95 QDA 47.741.73 48.901.65 k NN 29.082.77 26.731.97 24.531.60 35.672.59 34.072.33 32.361.72 s LDA 10.80 1.26 10.48 1.16 10.23 1.07 18.80 3.19 17.62 2.28 17.24 1.81 RAMP 14.092.67 10.56 1.33 10.10 1.04 21.915.49 19.22 3.25 17.55 2.04 NSC 11.50 1.13 11.500.97 11.671.09 18.49 1.99 19.02 1.78 19.411.39 RF 12.661.43 11.771.04 11.251.10 21.333.01 20.002.10 19.251.51 Sig-LDA 10.070.94 10.090.95 10.071.03 23.703.14 20.902.27 18.951.64

Not applicable.

Table 1: Error rates for models 1 and 1

Gamma(α(r) j , β(r) j ), which has the density function

f(r) j (x; α(r) j , β(r) j ) = 1

(β(r) j )α(r) j Γ(α(r) j ) xα(r) j 1 exp{ x/β(r) j }1(x 0), j = 1, . . . , p, r = 0, 1.

(13) Denote α(r) = (α(r) 1 , . . . , α(r) p )T , β(r) = (β(r) 1 , . . . , β(r) p )T . Here, we let α(0) = (2, 1.5, 1.5, 2, 2, 11 (p 5))T , α(1) = (2.5, 1.5, 1.5, 1, 1, 11 (p 5))T , β(0) = (1.5, 3, 1, 1, 1, 11 (p 5))T , β(1) = (2, 1, 3, 1, 1, 11 (p 5))T , p = 400, n {100, 200, 400}. Hence, the minimal discriminative set S is {1, 2, 3, 4, 5}, due to Proposition 3. MLEs of α(0), α(1), β(0), β(1) can be obtained by numerical approaches like the gradient descent or Newton s method. And the marginal probabilities are estimated by the proportion of two classes in training samples. Then the Bayes classiﬁer is estimated by these MLEs and applied to classify new observations, which is denoted as an independent Gamma classiﬁer in Table 2. For this example, we also apply Ra SE with the independent Gamma classiﬁer as one of the base classiﬁers. According to (3) and (13), the decision function of independent Gamma classiﬁer estimated in subspace S is

Tian and Feng

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

Ra SE1 LDAn

1 2 5 N feature

selected percentage

Ra SE2 LDAn

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

Ra SE1 k NN

1 2 5 N feature

selected percentage

Ra SE2 k NN

Figure 1: Average selected percentages of features for model 1 in 200 replicates when n = 1000

CS Gamma n (x) = 1

f(1) j (xj; ˆα(1) j , ˆβ(1) j )

f(0) j (xj; ˆα(0) j , ˆβ(0) j ) > 0.5

where ˆπ1, ˆπ0, ˆα(1) j , ˆα(0) j , ˆβ(1) j , ˆβ(0) j are corresponding MLEs. This is also a very sparse model. Therefore the iteration process can improve the Ra SE method with a lower misclassiﬁcation rate. The left panel of Table 2 shows us the performance of various methods on this model. It demonstrates that Ra SE1-Gamma performs

Ra SE: Random Subspace Ensemble Classification

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

Ra SE1 LDAn

selected percentage

Ra SE2 LDAn

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

Ra SE1 k NN

selected percentage

Ra SE2 k NN

Figure 2: Average selected percentages of features for model 1 in 200 replicates when n = 1000

the best when n = 100. Ra SE2-Gamma incurs the lowest misclassiﬁcation rate and low standard deviation for the other two settings.

Figure 3 shows us the average selected percentage of features when n = 400, from which we can see that due to the high sparsity, the default B2 is not suﬃcient and makes it hard for the vanilla Ra SE classiﬁer to capture all the features in S . After iterations, the frequencies of the minimal discriminative set increase signiﬁcantly, and S can be easily identiﬁed for all three base classiﬁers after two iterations.

Tian and Feng

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

Ra SE Gamma

1 2 3 4 5 N feature

selected percentage

Ra SE1 Gamma

1 2 3 4 5 N feature

selected percentage

Ra SE2 Gamma

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

Ra SE1 k NN

1 2 3 4 5 N feature

selected percentage

Ra SE2 k NN

Figure 3: Average selected percentages of features for model 2 in 200 replicates when n = 400

5.1.3 Model 3 (QDA)

x|y = r N(µ(r), Σ(r)), r = 0, 1, where Ω(0) = (Σ(0)) 1 is a p p band matrix with (Ω(0))ii = 1 and (Ω(0))ik = 0.3 for |i k| = 1. All the other entries are zero. Ω(1) = Ω(0) +Ω, where Ωis a p p sparse symmetric matrix with Ω10,10 = 0.3758, Ω10,30 = 0.0616, Ω10,50 = 0.2037, Ω30,30 = 0.5482, Ω30,50 = 0.0286, Ω50,50 = 0.4614 and all the other entries are zero. Here p = 200, n {200, 400, 1000}.

As analyzed in Example 2, the minimal discriminative set S = {1, 2, 10, 30, 50}, where features 1 and 2 represent linear signals or main eﬀects and features 10, 30, and 50 are quadratic signals. The right panel of Table 2 shows us the results. From it we can see that Ra SE1-QDA achieves the best performance when n = 200, and Ra SE2-QDA has the lowest test misclassiﬁcation rate when n = 400, 1000, which is reasonable since data is generated from a QDA-adapted model. Ra SE1-k NN and Ra SE2-k NN also do a good job when n is large.

Figure 4 represents the average selected percentage of features when n = 1000, which exhibits that the frequencies of the elements in the minimal discriminative set are increasing after iterations. When the base classiﬁer is QDA or k NN, all the ﬁve features stand out. On

Ra SE: Random Subspace Ensemble Classification

the other hand, Ra SE, with the LDA base classiﬁer, only captures the two linear signals, which is expected since LDA does not consider the quadratic terms.

Method3 Results for model 2 Results for model 3 n = 100 n = 200 n = 400 n = 200 n = 400 n = 1000 Ra SE-LDA 21.513.42 18.522.75 17.432.00 37.303.17 36.111.97 35.671.73 Ra SE-Gamma/QDA 23.573.65 21.413.17 20.292.43 32.522.90 30.442.60 29.001.97 Ra SE-k NN 22.833.09 21.073.07 20.472.57 31.103.23 27.832.41 25.221.56 Ra SE1-LDA 19.012.83 15.141.75 13.551.20 36.092.87 32.821.74 32.681.49 Ra SE1-Gamma/QDA 15.052.31 13.02 1.51 12.50 1.18 26.832.47 25.07 1.89 23.53 1.50 Ra SE1-k NN 19.842.90 16.641.86 15.331.39 28.76 2.60 25.88 1.98 24.18 1.47 Ra SE2-LDA 20.883.03 16.922.05 13.661.14 38.092.48 33.691.83 32.711.55 Ra SE2-Gamma/QDA 16.27 2.22 12.641.31 11.831.02 26.99 2.68 24.871.99 23.111.60 Ra SE2-k NN 22.393.13 17.402.27 14.591.44 28.73 2.56 25.46 1.82 23.76 1.54 RP-LDA 38.891.96 35.001.97 30.891.76 44.901.86 42.821.76 40.381.74 RP-QDA 43.623.62 37.622.35 33.022.14 43.022.07 39.871.88 36.381.78 RP-k NN 41.482.26 38.902.06 36.691.85 44.321.81 42.461.58 40.802.12 LDA 47.502.35 49.031.94 42.881.82 38.681.70 QDA 32.062.40 26.221.80 21.561.44 45.131.58 k NN 45.482.24 44.682.14 44.072.00 45.671.78 44.632.02 43.431.63 s LDA 22.263.52 18.642.12 15.341.55 36.413.15 33.872.01 32.991.52 RAMP 20.643.81 16.722.25 13.311.21 36.945.87 32.651.89 32.421.78 NSC 26.006.49 19.924.31 16.872.69 41.144.49 38.243.85 35.132.20 RF 24.975.74 18.022.77 15.261.47 37.342.91 31.612.19 27.421.60 Sig-Gamma/QDA 12.651.12 12.121.12 11.760.97 23.621.47 22.721.40 22.161.31

Not applicable.

Table 2: Error rates for models 2 and 3

5.1.4 Models 4 and 4 (k NN)

As in the LDA models, we ﬁrst study a sparse setting (model 4) with the data generating process motivated by the k NN classiﬁer. First, 10 initial points z1, . . . , z10 are generated i.i.d. from N(0p 1, Ip), ﬁve of which are labeled as 0 and the other ﬁve are labeled as 1. Then each time one of {z1, . . . , z10} is randomly selected (suppose zki) and we then generate xi N((z T ki,S , 01 (p 5))T , 0.52Ip). Here the minimal discriminative set is S = {1, 2, 3, 4, 5}, p = 200, and n {200, 400, 1000}. The results are presented in Table 3 and the average selected percentages of features in B1 = 200 subspaces are presented in Figure 5. From the left panel of Table 3, it can be seen that the performance of Ra SE2-k NN is surprising. It outperforms all the other methods, and the diﬀerence between its misclassiﬁcation rate and others is very prominent. Note that in this case, the Sig-k NN classiﬁer is calculated by applying k NN on only the ﬁrst ﬁve features and also uses leave-one-out

3. To save space, for the rows involving Gamma/QDA , it represents the independent Gamma classiﬁer in model 2 and the QDA in model 3.

Tian and Feng

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

Ra SE1 k NN

1 2 10 30 50 N feature

selected percentage

Ra SE2 k NN

Figure 4: Average selected percentages of features for model 3 in 200 replicates when n = 1000

cross-validation to choose k from {3, 5, 7, 9, 11}. Note that it is not the optimal classiﬁer due to the lack of a true model, which explains why Ra SE2-k NN can even achieve a better performance when n = 400, 1000.

Figure 5 shows that Ra SE, based on all the three base classiﬁers, can capture features in the minimal discriminative set.

Now, we study a non-sparse setting (model 4 ), where the number of signals are increased to 30 and each xi N((z T ki,S , 01 (p 30))T , 2Ip). The other parameters and the data generation mechanism are the same as model 4. From the right panel of Table 3, we observe that Ra SE2-k NN still achieves the best performance, and the iterations improve the performance of Ra SE classiﬁers under this non-sparse setting. In addition, Figure 6 shows that Ra SE can capture the signals while the noises keep a low selected percentage, which again veriﬁes the robustness of Ra SE.

Ra SE: Random Subspace Ensemble Classification

Method Results for model 4 Results for model 4 n = 200 n = 400 n = 1000 n = 200 n = 400 n = 1000 Ra SE-LDA 27.389.53 25.088.40 24.929.05 26.505.78 23.844.98 20.925.20 Ra SE-QDA 24.247.20 22.626.77 22.046.73 29.284.34 26.733.63 24.773.35 Ra SE-k NN 13.265.03 10.674.44 8.854.05 20.834.50 15.703.74 10.332.90 Ra SE1-LDA 25.8910.13 23.058.49 23.428.95 21.284.76 18.593.97 16.873.92 Ra SE1-QDA 13.835.88 12.545.79 12.705.51 21.975.37 19.324.65 15.844.36 Ra SE1-k NN 7.51 3.80 6.16 3.48 5.90 3.12 16.24 3.52 11.092.84 7.082.04 Ra SE2-LDA 27.4910.13 23.398.51 23.399.05 20.734.99 17.343.99 15.723.93 Ra SE2-QDA 13.155.00 11.905.38 12.155.22 20.945.06 18.614.61 14.964.11 Ra SE2-k NN 7.063.62 5.893.32 5.743.02 13.623.31 8.642.40 5.361.58 RP-LDA 28.038.91 25.487.80 24.838.09 22.164.49 18.844.42 16.844.10 RP-QDA 26.227.66 23.936.97 22.757.00 21.374.29 17.583.84 15.533.73 RP-k NN 26.638.09 24.497.15 23.327.35 22.374.69 18.694.19 16.503.95 LDA 47.512.66 33.277.65 27.898.97 46.063.05 25.164.12 17.784.05 QDA 36.704.83 30.452.89 k NN 24.496.64 21.046.60 19.076.50 24.734.35 20.063.95 15.913.59 s LDA 24.909.41 22.808.19 23.228.79 19.784.92 16.413.98 14.593.69 RAMP 22.1411.50 15.017.83 12.827.13 24.796.49 18.594.56 13.133.13 NSC 27.709.44 25.718.35 25.828.72 22.096.20 18.174.51 16.174.09 RF 23.648.05 17.396.19 14.845.73 21.735.44 15.703.65 11.542.80 Sig-k NN 6.893.40 6.033.37 6.013.21 7.602.19 5.401.76 4.061.43

Not applicable.

Table 3: Error rates for models 4 and 4

5.2 Real-data Experiments

5.2.1 Madelon

The madelon data set (http://archive.ics.uci.edu/ml/datasets/madelon) is an artiﬁcial data set containing data points grouped in 32 clusters placed on the vertices of a ﬁve-dimensional hypercube and randomly labeled as 0 or 1 (Guyon et al., 2005). It consists of 2000 observations, 1000 of which are class 0, and the other 1000 are class 1. There are 500 features, among which only 20 are informative, and the others have no predictive power. The training sample size is set as n {200, 500, 1000} for three diﬀerent settings, and each time the remained data is used as test data. 200 replicates are applied, and the average misclassiﬁcation rate with a standard deviation of all methods is reported in Table 4. Figure 7 represents the average selected percentage of features in diﬀerent Ra SE models when n = 1000.

From the left panel of Table 4, we can see that the misclassiﬁcation rate of Ra SE1-k NN outperforms all the other methods in all the three settings. Figure 7 shows us that the Ra SE model leads to sparse solutions since most of the features have frequencies that are close to zero.

Tian and Feng

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

Ra SE1 k NN

1 2 3 4 5 N feature

selected percentage

Ra SE2 k NN

Figure 5: Average selected percentages of features for model 4 in 200 replicates when n = 1000

The musk data set (https://archive.ics.uci.edu/ml/datasets/Musk+(Version+2)) contains 6598 observations with 5581 non-musk (class 0) and 1017 musk (class 1) molecules. The molecule needs to be classiﬁed based on p = 166 shape measurements (Dua and Graﬀ, 2019). The training sample size is set to be 200, 500, 1000, for each setting, and the remaining observations are used as the test data. 200 replicates are considered, and the average misclassiﬁcation rates and standard deviations are reported in Table 4. When the training sample size is 200, RP-k NN achieves the lowest misclassiﬁcation rate, and Ra SE-LDA, Ra SE-k NN, Ra SE1-LDA, Ra SE1-k NN, s LDA, RP-QDA, and RF also have a good performance. As the sample size increases, Ra SE-k NN turns to be the best one when n = 500 and RF yields a comparable performance. When n = 1000, Ra SE1-k NN and RF outperform the other methods.

5.2.3 Mice Protein Expression

The mice protein expression data set (https://archive.ics.uci.edu/ml/datasets/Mice+ Protein+Expression) contains 1080 instances with 570 healthy mice (class 0) and 510 mice

Ra SE: Random Subspace Ensemble Classification

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

Ra SE1 k NN

selected percentage

Ra SE2 k NN

Figure 6: Average selected percentages of features for model 4 in 200 replicates when n = 1000

with Down s syndrome (class 1). There are 77 features representing the expression of 77 different proteins (Higuera et al., 2015). Training samples of size 200, 500, 800 are considered, and the remaining observations are set as the test data. The average of test misclassiﬁcation rates and the standard deviations of 200 replicates are calculated with results reported in Table 5. When n = 200, s LDA achieves the lowest error among all approaches, and Ra SE-k NN achieves a similar performance. As the sample size increases to 500 and 800, the average misclassiﬁcation rate of Ra SE-k NN and Ra SE1k NN decrease signiﬁcantly, and they become the best classiﬁer when n = 500 and 800, respectively. When n = 800, RP-k NN and RF also have a similar performance.

5.2.4 Hand-written Digits Recognition

The hand-written digits recognition data set (https://archive.ics.uci.edu/ml/datasets/ Multiple+Features) consists of features of hand-written numerals (0-9) extracted from a collection of Dutch utility maps (Dua and Graﬀ, 2019). Here we use the mfeat-fou data set, which records 76 Fourier coeﬃcients of the character shapes. We extract the observations corresponding to number 7 (class 0) and 9 (class 1) from the original data. There are 400 observations, 200 of which belong to class 0, and the remaining 200 belong to class 1. The

Tian and Feng

Method Madelon Musk n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000 Ra SE-LDA 43.393.53 39.871.63 39.161.29 10.55 1.22 8.860.77 7.870.46 Ra SE-QDA 43.833.28 40.581.73 40.011.57 12.557.02 9.160.97 7.770.73 Ra SE-k NN 35.022.52 26.782.71 21.451.91 10.26 1.81 7.330.96 5.76 0.74 Ra SE1-LDA 45.982.49 39.761.81 38.571.11 10.56 1.19 8.880.81 7.820.46 Ra SE1-QDA 44.465.36 37.212.85 34.632.15 16.719.58 11.022.99 8.600.85 Ra SE1-k NN 26.053.33 16.661.33 13.571.00 10.52 1.95 7.49 0.97 5.710.78 RP-LDA 41.181.77 39.861.14 39.411.17 12.581.86 10.190.76 9.500.46 RP-QDA 39.971.53 39.291.57 38.791.49 10.70 1.95 8.750.83 8.180.45 RP-k NN 40.101.66 39.071.54 38.401.43 10.011.66 8.05 1.06 6.890.84 LDA 49.821.24 47.601.49 25.603.76 9.060.80 6.970.40 QDA k NN 35.891.42 31.841.43 28.561.56 11.35 1.55 8.19 0.86 6.530.55 s LDA 43.012.98 40.401.94 39.571.52 10.68 1.54 7.81 0.68 6.850.40 RAMP 49.093.46 41.885.16 38.581.36 14.351.93 11.501.40 9.721.10 NSC 41.492.32 40.101.09 40.051.24 20.725.11 24.123.62 25.581.01 RF 43.912.49 38.591.56 34.171.46 10.83 1.44 7.60 0.66 5.710.48

Not applicable.

Table 4: Error rates for madelon and musk data sets

0 100 200 300 400 500 feature

selected percentage

0 100 200 300 400 500 feature

selected percentage

0 100 200 300 400 500 feature

selected percentage

0 100 200 300 400 500 feature

selected percentage

0 100 200 300 400 500 feature

selected percentage

0 100 200 300 400 500 feature

selected percentage

Ra SE1 k NN

Figure 7: Average selected percentages of features for madelon data set in 200 replicates when n = 1000

Ra SE: Random Subspace Ensemble Classification

Method Mice protein expression Hand-written digits recognition n = 200 n = 500 n = 800 n = 50 n = 100 n = 200 Ra SE-LDA 7.411.14 5.700.93 4.651.24 1.56 0.85 1.130.59 0.80 0.54 Ra SE-QDA 9.142.58 4.811.17 3.441.23 2.501.47 1.890.91 1.470.96 Ra SE-k NN 6.801.88 1.550.88 0.62 0.55 1.860.96 1.120.66 0.75 0.45 Ra SE1-LDA 7.241.10 5.531.02 4.491.23 1.060.63 0.70 0.35 0.530.40 Ra SE1-QDA 9.382.21 5.161.16 3.401.16 2.181.66 1.180.71 0.85 0.61 Ra SE1-k NN 7.432.00 1.70 0.87 0.600.56 1.720.95 1.02 0.62 0.60 0.44 RP-LDA 24.842.91 22.792.50 22.342.55 1.751.29 1.220.67 1.040.61 RP-QDA 18.312.57 16.192.08 15.662.38 2.121.67 1.280.94 0.92 0.62 RP-k NN 11.772.54 2.570.89 0.92 0.68 1.68 1.34 1.03 0.60 0.84 0.59 LDA 7.071.37 3.880.85 3.131.08 1.820.96 1.010.56 QDA 3.252.32 k NN 20.532.47 7.751.44 2.801.21 1.42 1.32 0.670.41 0.60 0.47 s LDA 5.701.10 3.950.91 3.131.05 2.301.36 1.711.27 1.150.95 RAMP 11.762.42 8.521.69 7.021.89 3.311.75 2.261.19 1.700.87 NSC 30.493.31 29.702.76 29.883.02 3.221.44 3.531.23 3.591.50 RF 8.321.71 2.620.94 1.04 0.73 2.341.24 1.630.73 1.370.74

Not applicable.

Table 5: Error rates for mice protein expression and hand-written digits recognition data sets

training samples of size 50, 100, 200 are used, and the remained data is used as the test data. Average of test misclassiﬁcation rates and standard deviations are reported in Table 5, from which it can be seen that when n = 50, 200, Ra SE1-LDA enjoys the minimal test misclassiﬁcation rate while standard k NN method is the best when n = 100. And we also note that all the Ra SE classiﬁers get improved after 1 iteration for all three settings, implying that the underlying classiﬁcation problem may be a sparse one.

6. Discussion

6.1 Summary

In this work, we introduce a ﬂexible ensemble classiﬁcation framework named Ra SE, which is designed to solve the sparse classiﬁcation problem. To select the optimal subspace for each weak learner, we deﬁne a new information criterion, ratio information criterion (RIC), based on Kullback-Leibler divergence, and it is shown to achieve screening consistency and weak consistency under some general model conditions. This guarantees that each weak learner is trained using features in the minimal discriminative set with a high probability for a suﬃciently large sample size and the number of random subspaces. We also investigate the consistency of RIC for LDA and QDA models under some speciﬁc conditions. The theoretical analysis of Ra SE classiﬁers with speciﬁc base classiﬁers is conducted. In addition,

Tian and Feng

we present two versions of Ra SE algorithms, that is, the vanilla version (Ra SE) and the iterative version (Ra SET ). We also apply Ra SE for feature ranking based on the average selected percentage of features in subspaces. Theoretical analysis shows that when the stepwise detectable condition holds and the signal is suﬃciently sparse, the iterative Ra SE can cover the minimal discriminative set with high probability after a few iterations, with the required number of random subspaces smaller than that for the vanilla Ra SE. Multiple numerical experiments, including simulations and real data, verify that Ra SE is a favorable classiﬁcation method for sparse classiﬁcation problems. The Ra SE algorithms are available in R package Ra SEn (https://cran.r-project. org/web/packages/Ra SEn/index.html).

6.2 Future Work

There are many interesting directions along which Ra SE can be extended and explored. An interesting question is how to extend Ra SE and RIC into multi-class problems. For example, the pair-wise KL divergences can be used to deﬁne the multi-class RIC. Moreover, we can also apply Ra SE for variable selection. We can conduct thresholding to the average selected percentage of features in B1 subspaces to select variables. When the sample size is small, a bootstrap-type idea can be used, and each time we apply Ra SE on a bootstrap sample and at the end, take the average for the selected percentage to do variable selection or feature ranking. Finally, we aggregate the classiﬁers by taking a simple average over all weak learners. However, the boosting-type idea can also be applied here to assign diﬀerent weights for diﬀerent weak learners according to the training error, which may further improve the performance of Ra SE. In addition, the distribution for random subspaces can also be chosen to be diﬀerent for each weak learner and can also be updated using a similar idea to boosting.

Acknowledgments

The authors would like to thank the Action Editor and anonymous referees for many constructive comments which have greatly improved the paper. This work was partially supported by National Science Foundation CAREER grant DMS-2013789.

Ra SE: Random Subspace Ensemble Classification

Appendix A. Additional Figures of Simulations

We present ﬁgures of the selected percentage for each feature when n equals to the smallest value among three settings.

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

Ra SE1 LDAn

1 2 5 N feature

selected percentage

Ra SE2 LDAn

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

1 2 5 N feature

selected percentage

Ra SE1 k NN

1 2 5 N feature

selected percentage

Ra SE2 k NN

Figure 8: Average selected percentages of features for model 1 in 200 replicates when n = 200

Tian and Feng

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

Ra SE1 LDAn

selected percentage

Ra SE2 LDAn

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

Ra SE1 k NN

selected percentage

Ra SE2 k NN

Figure 9: Average selected percentages of features for model 1 in 200 replicates when n = 200

Ra SE: Random Subspace Ensemble Classification

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

Ra SE Gamma

1 2 3 4 5 N feature

selected percentage

Ra SE1 Gamma

1 2 3 4 5 N feature

selected percentage

Ra SE2 Gamma

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

Ra SE1 k NN

1 2 3 4 5 N feature

selected percentage

Ra SE2 k NN

Figure 10: Average selected percentages of features for model 2 in 200 replicates when n = 100

Tian and Feng

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

1 2 10 30 50 N feature

selected percentage

Ra SE1 k NN

1 2 10 30 50 N feature

selected percentage

Ra SE2 k NN

Figure 11: Average selected percentages of features for model 3 in 200 replicates when n = 200

Ra SE: Random Subspace Ensemble Classification

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

1 2 3 4 5 N feature

selected percentage

Ra SE1 k NN

1 2 3 4 5 N feature

selected percentage

Ra SE2 k NN

Figure 12: Average selected percentages of features for model 4 in 200 replicates when n = 200

Tian and Feng

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

selected percentage

Ra SE1 k NN

selected percentage

Ra SE2 k NN

Figure 13: Average selected percentages of features for model 4 in 200 replicates when n = 200

Ra SE: Random Subspace Ensemble Classification

Appendix B. Main Proofs

In the following proofs, for convenience, we use C to represent a positive constant, which could be diﬀerent at diﬀerent occurences.

B.1 Proof of Proposition 3

(i) The conditional probability is

P(y = 1|x) = π1f(1)(x) π1f(1)(x) + π0f(0)(x).

The deﬁnition of discriminative set is equivalent to

P(y = 1|x) = P(y = 1|x S), (14)

or P(y = 0|x) = P(y = 0|x S)

almost surely, which is also equivalent to

f(1)(x) f(0)(x) = f(1) S (x S)

f(0) S (x S) . (15)

(ii) First, we observe the condition is equivalent to f(1)(x) = h(x S)f(0)(x). Taking integration on both sides with respect to x Sc, we get f(1) S (x S) = h(x S)f(0) S (x S). Due to the equivalence of (14) and (15), S is a discriminative set.

B.2 Proof of Proposition 5

To facilitate our analysis, we ﬁrst state the following lemma.

Lemma 26 A discriminative set S is unique, if it will not be a discriminative set anymore after removing any features from it.

Proof [Proof of Lemma 26] Suppose the conclusion is not correct, then there exist two diﬀerent discriminative sets S1 and S2 satisfying such a property, that is, deleting any features from them leads to non-discriminative sets. Then according to Proposition 3 (i), we have f(1) S1 (x S1)

f(0) S1 (x S1) = f(1) S2 (x S2)

f(0) S2 (x S2) , (16)

almost surely w.r.t Px, where Px = π0P(0) + π1P(1). Since f(0), f(1) are supported on the whole Rp, Px dominates the Lebesgue measure. Combined with the explicit form of density functions of Gaussian distribution and denoting ΩS1,S1 = (Σ(1) S1,S1) 1 (Σ(0) S1,S1) 1, ΩS2,S2 =

(Σ(1) S2,S2) 1 (Σ(0) S2,S2) 1, δS1 = (Σ(0) S1,S1) 1µ(0) S1 (Σ(1) S1,S1) 1µ(1) S1 , δS2 = (Σ(0) S2,S2) 1µ(0) S2

(Σ(1) S2,S2) 1µ(1) S2 , it can be obtained from (4) and (16) that

2x T S1ΩS1,S1x S1 + δT S1x S1 = c + 1

2x T S2ΩS2,S2x S2 + δT S2x S2,

Tian and Feng

where c, c are constants irrelative to S1, S2. Considering the combined vector x S1 S2, there exists some matrix Ωand vector δ such that the equation above can be simpliﬁed as

x T S1 S2Ωx S1 S2 + δT x S1 S2 + c c = 0, (17)

for almost every x S1 S2 in the Euclidean space. Since S1 is a discriminative set, by Example 2, for every feature j S1, the corresponding row of ΩS1,S1 is not zero vector or the corresponding component of δS1 is not zero. The same argument holds for S2 as well. Since S1\S2 and S2\S1 are not empty, at least one of Ωand δ contain non-zero components. However, it s obvious that (17) cannot hold for almost every x S1 S2 in Euclidean space, which leads to contradiction. Thus S is unique.

Now let s proceed on the proof of Proposition 5. By Deﬁnition 1, the minimal discriminative set S satisﬁes the property described in Lemma 26, therefore by this lemma S is obviously unique, which implies (i). For (ii), if there exists a discriminative set S S , we can remove elements from S until we arrive at a discriminative set S that satisﬁes the property stated in Lemma 26. It s apparent S = S , which contradicts with Lemma 26. (iii) is trivial from Deﬁnition 1.

B.3 Proof of Proposition 8

By Deﬁnition 6 and the fact that deg(S) = |S| + 1, it s easy to obtain that

RICn(S) = 2

i=1 1(yi = 0) ((ˆµ(1) S )T ˆΣ 1 S,S (ˆµ(0) S )T ˆΣ 1 S,S)xi,S + 1

2(ˆµ(0) S )T ˆΣ 1 S,S ˆµ(0) S

2(ˆµ(1) S )T ˆΣ 1 S,S ˆµ(1) S

i=1 1(yi = 1) ((ˆµ(0) S )T ˆΣ 1 S,S (ˆµ(1) S )T ˆΣ 1 S,S)xi,S

2(ˆµ(1) S )T ˆΣ 1 S,S ˆµ(1) S 1

2(ˆµ(0) S )T ˆΣ 1 S,S ˆµ(0) S

+ cn (|S| + 1)

((ˆµ(1) S )T ˆΣ 1 S,S (ˆµ(0) S )T ˆΣ 1 S,S)ˆµ(0) S + 1

2(ˆµ(0) S )T ˆΣ 1 S,S ˆµ(0) S 1

2(ˆµ(1) S )T ˆΣ 1 S,S ˆµ(1) S

((ˆµ(0) S )T ˆΣ 1 S,S (ˆµ(1) S )T ˆΣ 1 S,S)ˆµ(1) S + 1

2(ˆµ(1) S )T ˆΣ 1 S,S ˆµ(1) S 1

2(ˆµ(0) S )T ˆΣ 1 S,S ˆµ(0) S

+ cn (|S| + 1)

= (ˆµ(1) S ˆµ(0) S )T ˆΣ 1 S,S(ˆµ(1) S ˆµ(0) S ) + cn (|S| + 1),

which completes the proof.

B.4 Proof of Proposition 9

By Deﬁnition 6 and the fact deg(S) = |S|(|S| + 3)/2 + 1, it s easy to obtain that

RICn(S) = 2

i=1 1(yi = 0) 1

2x T i,S((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)xi,S + ((ˆµ(1) S )T (ˆΣ(1) S,S) 1

Ra SE: Random Subspace Ensemble Classification

(ˆµ(0) S )T (ˆΣ(0) S,S) 1)xi,S + 1

2(ˆµ(0) S )T (ˆΣ(0) S,S) 1 ˆµ(0) S 1

2(ˆµ(1) S )T (ˆΣ(1) S,S) 1 ˆµ(1) S

+ log(|ˆΣ(0) S,S|) log(|ˆΣ(1) S,S|) + 2

i=1 1(yi = 1) 1

2x T i,S((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)xi,S

+ ((ˆµ(0) S )T (ˆΣ(0) S,S) 1 (ˆµ(1) S )T (ˆΣ(1) S,S) 1)xi,S + 1

2(ˆµ(1) S )T (ˆΣ(1) S,S) 1 ˆµ(1) S

2(ˆµ(0) S )T (ˆΣ(0) S,S) 1 ˆµ(0) S + log(|ˆΣ(1) S,S|) log(|ˆΣ(0) S,S|) + cn (|S|(|S| + 3)/2 + 1).

Denote the observations with class r as {x(r) i }nr i=1, r = 0, 1. And it holds that

i=1 1(yi = 0) h x T i,S((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)xi,S i

h (x(0) i,S)T ((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)x(0) i,S i

i=1 Tr h ((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)x(0) i,S(x(0) i,S)T i

((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1) 1

i=1 x(0) i,S(x(0) i,S)T #

= ˆπ0Tr h ((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)(ˆµ(0) S (ˆµ(0) S )T + ˆΣ(0) S,S) i

= ˆπ0(ˆµ(0) S )T ((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)ˆµ(0) S + ˆπ0Tr h ((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)ˆΣ(0) S,S i .

Similarly we have

i=1 1(yi = 1) h x T i,S((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)xi,S i

= ˆπ1(ˆµ(1) S )T ((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)ˆµ(1) S + ˆπ1Tr h ((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)ˆΣ(1) S,S i .

Combining with the fact

i=1 1(yi = r)xi,S = ˆπrµ(r) S , r = 0, 1,

we obtain that

RICn(S) = (ˆµ(1) S ˆµ(0) S )T [ˆπ1(ˆΣ(0) S,S) 1 + ˆπ0(ˆΣ(1) S,S) 1](ˆµ(1) S ˆµ(0) S )

+ Tr[((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)(ˆπ1 ˆΣ(1) S,S ˆπ0 ˆΣ(0) S,S)] + (ˆπ1 ˆπ0)(log |ˆΣ(1) S,S| log |ˆΣ(0) S,S|)

+ cn (|S|(|S| + 3)/2 + 1),

which completes the proof.

Tian and Feng

B.5 Proof of Theorem 10

Denote gn(α ) = π1g(1) n (α ) π0g(0) n (α ). By the deﬁnition of {αi}N i=1, it holds that

gn(α) = 0, when α / {αi}N i=1.

Recall that x|y = 0 P(0), x|y = 1 P(1), where P(0), P(1) are the corresponding cumulative distribution functions. We have

E[R(CRa SE n )] = π0

X P(νn(x) > α)d P(0) + π1

X P(νn(x) α)d P(1).

For given x and the corresponding α = µn(x), we can construct a random variable T = PB1 j=1 1 n CSj n (x) = 1 o Bin(B1, α ). Then Z

X P(νn(x) α)d P(1) = Z

[0,1] P(T B1α)d G(1) n (α ).

Similarly, Z

X P(νn(x) > α)d P(0) = 1 Z

[0,1] P(T B1α)d G(0) n (α ).

This leads to E[R(CRa SE n )] = π0 + Z

[0,1] P(T B1α)d Gn(α ),

where Gn(α ) = π1G(1) n (α ) π0G(0) n (α ). And there also holds that

R(CRa SE n ) = π0P(0)(µn(x) > α) + π1P(1)(µn(x) < α)

2[π0P(0)(µn(x) = α)) + π1P(1)(µn(x) = α)]

= π0(1 G(0) n (α)) + π1G(1) n (α) + 1

2[π0g(0) n (α) π1g(1) n (α)]

= π0 + Gn(α) 1

where gn(α ) = π1g(1) n (α ) π0g(0) n (α ). This implies

E[R(CRa SE n )] R(CRa SE n ) = Z

[0,1] [P(T B1α) 1{α α}]d Gn(α ) + 1

2gn(α). (18)

(i) When α / {αi}N i=1, gn(α) = 0 holds. For α {αi}N i=1, by Hoeﬀding s inequality (Petrov, 2012), we obtain that

|P(T B1α) 1{α α}|

P(T B1α > B1(α α ))1{α α} + P(B1α T B1(α α))1{α >α} exp{ 2B1(α α)2}

exp { CαB1} ,

where Cα = min 1 i N |α αi|2. This leads to the ﬁnal bound |E[R(CRa SE n )] R(CRa SE n )|

exp { CαB1}.

Ra SE: Random Subspace Ensemble Classification

(ii) When α = αi0, i0 {1, 2, . . . , N}, for α = αi0, we have

|P(T B1α) 1{α α}| exp{ 2B1(α α)2},

leading to X

i =i0 |P(T B1α) 1{α α}|gn(αi) = exp { CαB1} X

i =i0 gn(αi) exp { CαB1} ,

where Cα = 2 min 1 i N |α αi|2 = 2 min i =i0 |αi0 αi|2. Therefore, again by (18), we have

|E[R(CRa SE)] R(CRa SE n )| [P(T B1αi0) 1] gn(αi0) + 1

2gn(αi0) +exp { CαB1} .

By Berry-Esseen theorem (Esseen, 1956),

P(T B1αi0) = P

B1(αi0(1 αi0)) 0

as B1 . Eventually there holds that

|E[R(CRa SE)] R(CRa SE n )| O 1 B1

This completes the proof.

B.6 Proof of Theorem 11

Referring to the proof of the bound on MC-variance in Cannings and Samworth (2017), it can be obtained that

Rp 1{νn(x)>α}d P(0) 2 Z

[0,α ] P(T B1α)P(T > B1α)d G(0) n (α )d G(0) n (α ),

where given x, T Bin(B1, α ), T Bin(B1, α ) and T , T are independent.

(i) For α / {α(0) i }N0 i=1 {α(1) i }N1 i=1: constant C(0) α = 2 min 1 i N0(|α α(0) i |2) > 0, C(1) α =

2 min 1 i N1(|α α(1) i |2) > 0. If α α α : Then by Hoeﬀding s inequality, we have

P(T B1α) = P(T B1α B1(α α )) exp{ B1C(0) α }. (20)

If α α α: Similarly we have

P(T > B1α) exp{ B1C(0) α }. (21)

If α α α : We have both (20) and (21).

Thus we always have

P(T B1α)P(T > B1α) exp{ B1C(0) α }.

Tian and Feng

Since here the integration actually is the ﬁnite summation, it implies that

X 1{νn(x)>α}d P(0) exp{ B1C(0) α }.

Since α / {α(1) i }N1 i=1: We can obtain the similar conclusion that

X 1{νn(x) α}d P(1) exp{ B1C(1) α }.

Thus for any α / {α(0) i }N0 i=1 {α(1) i }N1 i=1, setting Cα = 2 min 1 i N0 1 j N1 (|α α(0) i |2, |α α(1) j |2) =

min(C(0) α , C(1) α ) > 0, then by the convexity, we have

Var R(CRa SE n ) = Var π0

X 1{νn(x)>α}d P(0) + π1

X 1{νn(x) α}d P(1)

X 1{νn(x)>α}d P(0) + π1Var Z

X 1{νn(x) α}d P(1)

exp{ B1Cα}.

(ii) For α = α(0) i0 or α(1) i1 : Without loss of generality, suppose α = α(0) i0 . When α < α , similar to (i), there exists positive number C α such that

P(T B1α)P(T > B1α) exp{ B1C α}.

When α = α = α(0) i0 , similar to (19), there holds

P(T B1α)P(T > B1α) = 1

Thus it holds

X 1{νn(x)>α}d P (0) 1

2(g(0) n (α(0) i0 ))2 + O 1 B1

And similar results hold for Var R

X 1{νn(x) α}d P (1) . Eventually we would have

Var R(CRa SE n ) π0Var Z

X 1{νn(x)>α}d P(0) + π1Var Z

X 1{νn(x) α}d P(1)

h π0(g(0) n (α))2 + π1(g(1) n (α))2i + O 1 B1

which completes the proof.

Ra SE: Random Subspace Ensemble Classification

B.7 Proof of Proposition 12

(i) Due to Proposition 3, this is trivial to be seen to hold here.

(ii) First let s consider subspace S = S {j}, where |S | 1, j / S . The conditional densities of j|S are denoted as f(0) j|S , f(1) j|S and the components of x S corresponding to S and {j} are x S , xj, respectively. The deﬁnition of KL divergence and Fubini theorem incur

KL(f(0) S ||f(1) S ) = Ex S f(0) S

f(0) S (x S )

f(1) S (x S )

f(0) j|S (xj)

f(1) j|S (xj)

= Ex S f(0) S

f(0) S (x S )

f(1) S (x S )

+ Ex S f(0) S

Exj f(0) j|S log

f(0) j|S (xj)

f(1) j|S (xj)

= KL(f(0) S ||f(1) S ) + Ex S f(0) S

h KL(f(0) j|S ||f(1) j|S ) i

KL(f(0) S ||f(1) S ).

Here = holds if and only if f(0) j|S (xj|x S ) = f(1) j|S (xj|x S ) a.s. with respect to P(0). By induction, this indicates that for any S S and S = , there holds

KL(f(0) S ||f(1) S ) KL(f(0) S ||f(1) S ). (22)

Note that in (22), by Proposition 3, if f(0) j|S (xj|x S ) = f(1) j|S (xj|x S ) a.s. with respect

to P(0), we have f(0) S (x S)

f(1) S (x S) = f(0) S (x S )

f(1) S (x S ) , a.s. , w.r.t. P(0). (23)

Similarly, we have KL(f(1) S ||f(0) S ) KL(f(1) S ||f(0) S ), (24)

where the = holds if and only if

f(1) S (x S)

f(0) S (x S) = f(1) S (x S )

f(0) S (x S ) , a.s. , w.r.t. P(1). (25)

If S S , consider S = S (S \S) S , then by (i) and (22), there holds

π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ) π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ) (26)

= π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ).

Then by Proposition 3, if the = holds in (26), due to (23) and (25), we have

f(0) S (x S)

f(1) S (x S) = f(0) S (x S)

f(1) S (x S) = f(0) S (x S )

f(1) S (x S ) , a.s. , w.r.t. Px = π0P(0) + π1P(1),

implying that S is a discriminative set but S S , which yields a contradiction. Therefore (ii) holds here.

Tian and Feng

(iii) holds since the full model SFull S and the KL divergence is monotone in the sense of (22) and (24).

B.8 Proof of Proposition 14

Denote R0|1 S (µ(0) S , µ(1) S , ΣS,S, x S) = log f(0) S (x S)

f(1) S (x S)

= µ(0) S µ(1) S T Σ 1 S,S h x S 1

2(µ(0) S + µ(1) S ) i .

To eliminate any confusion and better illustrate the meaning of the gradient and second-

order derivative matrix, we will use R0|1 S µ(0) S , R0|1 S µ(1) S , R0|1 S ΣS,S to represent the gradient and use

2R0|1 S µ(0) S ΣS,S , 2R0|1 S µ(1) S ΣS,S , 2R0|1 S ΣS,S ΣS,S to represent the second-order derivative matrix.

According to Brookes (2005) and Petersen and Pedersen (2012), with some calculation we can obtain that

R0|1 S µ(0) S = Σ 1 S,S x S µ(0) S ,

R0|1 S µ(1) S = Σ 1 S,S x S µ(1) S ,

R0|1 S ΣS,S = Σ 1 S,S µ(0) S µ(1) S x S 1

2(µ(0) S + µ(1) S ) T Σ 1 S,S,

2R0|1 S µ(0) S µ(0) S = 2R0|1 S µ(1) S µ(1) S = Σ 1 S,S,

2R0|1 S ΣS,S µ(0) S = Σ 1 S,S Σ 1 S,S I|S| x S µ(0) S ,

2R0|1 S ΣS,S µ(1) S = Σ 1 S,S Σ 1 S,S I|S| x S µ(1) S ,

2R0|1 S ΣS,S ΣS,S =

I|S| Σ 1 S,S µ(0) S µ(1) S x S 1

2(µ(0) S + µ(1) S ) T

2(µ(0) S + µ(1) S ) µ(0) S µ(1) S T Σ 1 S,S I|S|

Σ 1 S,S Σ 1 S,S ,

where is the Kronecker product. Then let s check conditions in Assumption 3 one by one. Without loss of generality, for (iii), we only check the case that xi,S i.i.d. f(0) S . And for (iv), (v), we only check the case that x S f(0) S .

(i) It s easy to see that the number of parameters in LDA model in p-dimensional space is 2 + 2p + p(p+1)

2 , therefore κ1 = 2.

(ii) According to Assumption 4, we have

KL(f(0)||f(1)) = µ(1) µ(0) T Σ 1 µ(1) µ(0)

Ra SE: Random Subspace Ensemble Classification

µ(1) S µ(0) S 2 2 Σ 1 S ,S 2

The similar conclusion holds for KL(f(1)||f(0)) as well.

(iii) For any ( µ(0) S , µ(1) S , ΣS,S): 1 n

2R0|1 S ΣS,S ΣS,S ( µ(0) S , µ(1) S , ΣS,S, xi)

2R0|1 S ΣS,S ΣS,S ( µ(0) S , µ(1) S , ΣS,S, xi)

2 Σ 1 S,S( µ(0) S µ(1) S ) 2

2( µ(0) S + µ(1) S ) 2 Σ 1 S,S 2

2 Σ 1 S,S 3

2 µ(0) S µ(1) S 2

2( µ(0) S + µ(1) S ) 2 .

When µ(0) S µ(0) S 2, µ(1) S µ(1) S 2, ΣS,S ΣS,S F ζ < m, due to (28), it follows that

Σ 1 S,S Σ 1 S,S 2

1 m2 ΣS,S ΣS,S 2 1 1

m ΣS,S ΣS,S 2

1 m2 ΣS,S ΣS,S F 1 1

m ΣS,S ΣS,S F ζ m2 mζ ,

which leads to Σ 1 S,S 2 Σ 1 S,S 2 + Σ 1 S,S Σ 1 S,S 2 1 m ζ .

In addition, we have

µ(0) S µ(1) S 2 µ(0) S µ(0) S 2 + µ(1) S µ(1) S 2 + µ(0) S µ(1) S 2 D 1 2 M .

Without loss of generality, consider xi,S i.i.d. f(0) S , it holds that 1 n

2( µ(0) S + µ(1) S ) 2

xi,S µ(0) S 2 + 1

2 µ(0) S µ(0) S 2

2 µ(1) S µ(1) S 2 + 1

2 µ(0) S µ(1) S 2

xi,S µ(0) S 2 + D 1 2 M .

By Proposition 1 in Hsu et al. (2012), since 1

n Pn i=1(xi,S µ(0) S ) N(0, 1

nΣS,S), it follows that

xi,S µ(0) S 2 >

1 n Tr(ΣS,S) + 2 r

Tr(Σ2 S,S) ϵ

n + 2 ΣS,S 2ϵ

Tian and Feng

And due to Assumption 4, we have

Tr(ΣS,S) D ΣS,S max DM,

Tr(Σ2 S,S) = ΣS,S 2 F D2 ΣS,S 2 max D2M2,

ΣS,S 2 D ΣS,S max DM,

xi,S µ(0) S 2 >

1 n DM + 2DM r ϵ

Therefore when µ(0) S µ(0) S 2, µ(1) S µ(1) S 2, ΣS,S ΣS,S 2 ζ, we have 1 n

2R0|1 S ΣS,S ΣS,S ( µ(0) S , µ(1) S , ΣS,S, xi)

max D 1 2 M

xi,S µ(0) S 2 + D 1 2 M

:= VS({xi,S}n i=1),

and P (VS({xi,S}n i=1) > CD) exp{ Cn}.

Thus we can set κ2 = 1.

(iv) Σ 1 S,S x S µ(0) S N(0, Σ 1 S,S), Σ 1 S,S x S µ(1) S N(Σ 1 S,S(µ(0) S µ(1) S ), Σ 1 S,S) when

x S f(0) S . And also we have Σ 1 S,S max Σ 1 S,S 2 Σ 1 2 m 1.

Then each component of R0|1 S µ(0) S and R0|1 S µ(1) S is

m 1-sub Gaussian. On the other hand,

since Σ 1 S,S x S 1

2(µ(0) S + µ(1) S ) N 1 2Σ 1 S,S(µ(0) S µ(1) S ), Σ 1 S,S and

Σ 1 S,S(µ(0) S µ(1) S ) D 1 2 Σ 1 S,S 2 µ(0) S µ(1) S m 1M D 1 2 , (27)

it follows that each component of R0|1 S ΣS,S is m 1

M D-sub Gaussian. Therefore κ3 = 1

(v) Notice that because of (27), we have Ex S f(0) S

h Σ 1 S,S(x S µ(1) S ) i = Σ 1 S,S(µ(0) S µ(1) S ) m 1M D 1 2 ,

Ex S f(0) S

Σ 1 S,S µ(0) S µ(1) S x S 1

2(µ(0) S + µ(1) S ) T Σ 1 S,S

Ra SE: Random Subspace Ensemble Classification

Σ 1 S,S µ(0) S µ(1) S µ(0) S µ(1) S T Σ 1 S,S

Σ 1 S,S(µ(0) S µ(1) S ) 2

2(m 1M )2D.

Therefore κ4 = 1.

(vi) It s easy to see that µ(1) S µ(0) S T Σ 1 S,S h x S 1

2(µ(0) S + µ(1) S ) i N((µ(1) S µ(0) S )T Σ 1 S,S(µ(1) S

µ(0) S ), (µ(1) S µ(0) S )T Σ 1 S,S(µ(1) S µ(0) S )). And there holds

(µ(1) S µ(0) S )T Σ 1 S,S(µ(1) S µ(0) S ) µ(1) S µ(0) S 2

2 Σ 1 S,S 2 m 1(M )2D,

which yields that log f(0) S (x S)

f(1) S (x S)

m 1D-sub Gaussian. So κ5 = 1

(vii) This can be easily derived from Lemma 28 and its proof.

(viii) This is obvious because of Lemma 31.(iii) and Assumption 4.(iii).

B.9 Proof of Theorem 15

First suppose that we have n0 observations of class 0 {x(0) i }n0 i=1 and n1 observations of class 1 {x(1) i }n1 i=1, where n0 + n1 = n. Deﬁne

G(0) n0,S(θ S) = 1

f(0) S (x(0) i,S|θ S)

f(1) S (x(0) i,S|θ S)

, G(1) n1,S(θ S) = 1

f(1) S (x(1) i,S|θ S)

f(0) S (x(1) i,S|θ S)

for any θ S. Denote d RIC(S) = 2[ˆπ0G(0) n0,S(ˆθS) + ˆπ1G(1) n1,S(ˆθS)], then RICn(S) = d RIC(S) + cn deg(S).

Lemma 27 If Assumptions 1-3 holds, then we have

sup S:|S| D | d RIC(S) RIC(S)| > ϵ

Cn ϵ Dκ1(Dκ3 + Dκ4)

+ p DDκ1 exp n Cn ϵ D2κ1+κ2

+ p D exp Cn ϵ

Proof [Proof of Lemma 27] By Taylor expansion and mean-value theorem, there exists λ (0, 1) and θS = λθS + (1 λ)ˆθS satisfying that

G(0) n0,S(ˆθS) = G(0) n0,S(θS) + G(0) n0,S(θS)T (ˆθS θS) + 1

2(ˆθS θS)T 2G(0) n0,S( θ)(ˆθS θS).

Tian and Feng

Notice that G(0) n0,S(θS)T (ˆθS θS) 2 Dκ1 G(0) n0,S(θS) ˆθS θS ,

and when ˆθS θS 2 ζ, we also have (ˆθS θS)T 2G(0) n0,S( θ)(ˆθS θS) Dκ1 ˆθS θS 2 2G(0) n0,S( θ) max

D2κ1 ˆθS θS 2 VS({x(0) i,S}n0 i=1) .

Since nr Bin(n, πr), r = 0, 1, by Hoeﬀding s inequality, we have

P(|nr nπr| > 0.5nπr) exp{ Cn}, r = 0, 1.

Because of Assumption 3.(iv), given n0, each component of n0 G(0) n0,S(θS) is n0 2C3Dκ3sub Gaussian, then the tail bound in the below holds:

G(0) n0,S(θS) Ex S f(0) S θSLS(x; θ) > ϵ

Dκ1 exp Cn0 ϵ

Then according to all conclusions above, we have

P(|G(0) n0,S(ˆθS) KL(f(0) S ||f(1) S )| > ϵ)

En0[P(|G(0) n0,S(ˆθS) KL(f(0) S ||f(1) S )| > ϵ||n0 nπ0| 0.5nπ0)] + P(|n0 nπ0| > 0.5nπ0)

En0P G(0) n0,S(θS) KL(f(0) S ||f(1) S ) > ϵ/3 |n0 nπ0| 0.5nπ0 +

G(0) n0,S(θS) Ex S f(0) S G(0) n0,S(θS) > CDκ3 |n0 nπ0| 0.5nπ0

+ P CDκ1(Dκ3 + Dκ4) ˆθS θS > ϵ/3 + P 0.5CD2κ1+κ2 ˆθS θS 2 > ϵ/3

VS({x(0) i,S}n0 i=1) > CDκ2 |n0 nπ0| 0.5nπ0

+ P(|n0 nπ0| > 0.5nπ0) + P(D 1 2 κ1 ˆθS θS > ζ)

Cn ϵ Dκ1(Dκ3 + Dκ4)

+ Dκ1 exp n Cn ϵ D2κ1+κ2

o + exp Cn ϵ

which yields

P(|ˆπ0G(0) n0,S(ˆθS) π0KL(f(0) S ||f(1) S )| > ϵ) Dκ1 exp

Cn ϵ Dκ1(Dκ3 + Dκ4)

+ Dκ1 exp n Cn ϵ D2κ1+κ2

o + exp Cn ϵ

Ra SE: Random Subspace Ensemble Classification

Similarly, there holds

P(|ˆπ1G(1) n1,S(ˆθS) π1KL(f(1) S ||f(0) S )| > ϵ) Dκ1 exp

Cn ϵ Dκ1(Dκ3 + Dκ4)

+ Dκ1 exp n Cn ϵ D2κ1+κ2

o + exp Cn ϵ

Since d RIC(S) = 2[ˆπ0G(0) n0,S(ˆθS) + ˆπ1G(1) n1,S(ˆθS)], we obtain that

P | d RIC(S) RIC(S)| > ϵ Dκ1 exp

Cn ϵ Dκ1(Dκ3 + Dκ4)

+ Dκ1 exp n Cn ϵ D2κ1+κ2

o + exp Cn ϵ

Due to the union bound over all p D = O(p D) possible subsets, we obtain the conclusion.

Let s now prove Theorem 15.

(i) It s easy to see that

sup S:S S |S| D

RICn(S) inf S:S S |S| D RICn(S)

RIC(S ) + cn sup S:|S| D deg(S) + /3 inf S:S S |S| D RIC(S) /3

sup S:|S| D | d RIC(S) RIC(S)| > /3

P(cn sup S:|S| D deg(S) /3) + P

sup S:|S| D | d RIC(S) RIC(S)| > /3

Cn Dκ1(Dκ3 + Dκ4)

+ p DDκ1 exp Cn D2κ1+κ2

Tian and Feng

(ii) Similar to above, we have

RICn(S ) inf S:S S |S| D RICn(S)

RIC(S ) + cn deg(S ) + /3 inf S:S S |S| D RIC(S) /3

sup S:|S| D | d RIC(S) RIC(S)| > /3

Cn Dκ1(Dκ3 + Dκ4)

+ p DDκ1 exp Cn D2κ1+κ2

Besides, it holds

RICn(S ) inf S:S S |S| D RICn(S)

P (RIC(S ) + cn deg(S ) + cn/3 RIC(S ) + cn(deg(S ) + 1) cn/3)

sup S:|S| D | d RIC(S) RIC(S)| > cn/3

Cn cn Dκ1(Dκ3 + Dκ4)

+ p DDκ1 exp n Cn cn D2κ1+κ2

+ p D exp Cn cn

Therefore we have

P RICn(S ) = inf S:|S| D RICn(S)

RICn(S ) inf S:S S |S| D RICn(S)

RICn(S ) inf S:S S |S| D RICn(S)

Ra SE: Random Subspace Ensemble Classification

Cn cn Dκ1(Dκ3 + Dκ4)

+ p DDκ1 exp n Cn cn D2κ1+κ2

+ p D exp Cn cn

which is because cn . This completes the proof.

B.10 Proof of Theorem 18

Lemma 28 For arbitrary ϵ (0, m 1), we have the following conclusions:

(i) P( (ˆµ(1) ˆµ(0)) (µ(1) µ(0)) > ϵ) p exp{ Cnϵ2}, r = 0, 1;

sup S1:|S1| D S2:|S2| D

ˆΣS1,S2 ΣS1,S2 > ϵ

p2 exp n Cn ϵ

D 2o + p exp Cn ϵ

sup S:|S| D ˆΣS,S ΣS,S 2 > ϵ

p2 exp n Cn ϵ

D 2o + p exp Cn ϵ

sup S:|S| D ˆΣ 1 S,S Σ 1 S,S 2 > ϵ

p2 exp n Cn ϵ

D 2o + p exp Cn ϵ

Proof [Proof of Lemma 28] For (i), because Σ max M, for any j = 1, . . . , p and r = 0, 1, ˆµ(r) j µ(r) j is a

M-sub Gaussian variable. By the tail bound and union bound, we have

P( ˆµ(r) µ(r) > ϵ)

j=1 P(|ˆµ(r) j µ(r) j | > ϵ) p exp{ Cnϵ2}, r = 0, 1,

which leads to (i). Denote Σ = (σij)p p, ˆΣ = (ˆσij)p p. To show (ii), similar to Bickel et al. (2008), we have

P max i,j |ˆσij σij| > ϵ p2 exp{ Cnϵ2} + p exp{ Cnϵ}.

And it yields

sup S1:|S1| D S2:|S2| D

ˆΣS1,S2 ΣS1,S2 > ϵ

sup S1:|S1| D S2:|S2| D

j S2 |ˆσij σij| > ϵ

P D max i,j |ˆσij σij| > ϵ

p2 exp Cn ϵ

2 + p exp n Cn ϵ

Tian and Feng

Since ˆΣS,S ΣS,S is symmetric, we have ˆΣS,S ΣS,S 2 ˆΣS,S ΣS,S (Bickel et al. (2008)). For (iv), ﬁrstly because the operator norm is sub-multiplicative, we have

ˆΣ 1 S,S Σ 1 S,S 2 = ˆΣ 1 S,S(ˆΣS,S ΣS,S)Σ 1 S,S 2

ˆΣ 1 S,S 2 ˆΣS,S ΣS,S 2 Σ 1 S,S 2

( ˆΣ 1 S,S Σ 1 S,S 2 + Σ 1 S,S 2) ˆΣS,S ΣS,S 2 Σ 1 S,S 2,

ˆΣ 1 S,S Σ 1 S,S 2 Σ 1 S,S 2 2 ˆΣS,S ΣS,S 2

1 ˆΣS,S ΣS,S 2 Σ 1 S,S 2

1 m2 ˆΣS,S ΣS,S 2 1 1

m ˆΣS,S ΣS,S 2 . (28)

Then we obtain that

sup S:|S| D ˆΣ 1 S,S Σ 1 S,S 2 > ϵ

P ˆΣS,S ΣS,S 2 > 1

m2 ˆΣS,S ΣS,S 2 > ϵ

sup S:|S| D ˆΣS,S ΣS,S 2 > m2

Then by applying (iii), we get (iv) immediately.

Lemma 29 Deﬁne δS = Σ 1 S,S(µ(1) S µ(0) S ) for any subset S. For S = S S , where S S = , we have

δ S = Σ 1 S, S(µ(1) S µ(0) S ) = Σ 1 S, S

µ(1) S µ(0) S µ(1) S µ(0) S

Proof [Proof of Lemma 29] First, when S = SFull, we know that

δ S = Σ 1 S, S(µ(1) S µ(0) S ) = Σ 1 S, S

µ(1) S µ(0) S µ(1) S µ(0) S

Then combined with the matrix decomposition

(ΣS,S ΣS,S Σ 1 S ,S ΣS ,S) 1 Σ 1 S,SΣS,S (ΣS ,S ΣS ,SΣ 1 S,SΣS,S ) 1

Σ 1 S ,S ΣS ,S(ΣS,S ΣS,S Σ 1 S ,S ΣS ,S) 1 (ΣS ,S ΣS ,SΣ 1 S,SΣS,S ) 1

it can be noticed that

(ΣS,S ΣS,S Σ 1 S ,S ΣS ,S) 1(µ(1) S µ(0) S ) = Σ 1 S,SΣS,S (ΣS ,S ΣS ,SΣ 1 S,SΣS,S ) 1(µ(1) S µ(0) S ).

Ra SE: Random Subspace Ensemble Classification

Also since there holds that

(ΣS ,S ΣS ,SΣ 1 S,SΣS,S ) 1 = Σ 1 S ,S Σ 1 S ,S ΣS ,S( ΣS,S+ΣS,S Σ 1 S ,S ΣS ,S) 1ΣS,S Σ 1 S ,S ,

it can be easily veriﬁed that the * part is actually δS . Therefore the conclusion holds with S = SFull. For general S S , notice that SFull = (SFull\ S) S, with the same procedure, we can obtain that

δSFull = Σ 1(µ(1) µ(0)) = 0 δ S

δSFull = 0 δS

we reach the conclusion.

Lemma 30 For S which satisﬁes S S , let S = S \S, then Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S 2 γm.

Proof [Proof of Lemma 30] Let S = S S , due to Lemma 29, it s easy to see that there holds

µ(1) S µ(0) S µ(1) S µ(0) S

(ΣS,S ΣS, S Σ 1 S , S Σ S ,S) 1 Σ 1 S,SΣS, S (Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1

Σ 1 S , S Σ S ,S(ΣS,S ΣS, S Σ 1 S , S Σ S ,S) 1 (Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1

(30) and δ S consists of several components of δS . Denote c = Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S . Since Σ 1 S, S is symmetric, combining with (29) we have

(Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1c = δ S ,

leading to (Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1c 2 = δ S 2 γ,

by Assumption 4.(iii). For the left-hand side:

(Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1c 2

c 2 (Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1 2

c 2 Σ 1 S, S 2

c 2 λ 1 min(Σ S, S)

Tian and Feng

which leads to c 2 γm.

Lemma 31 For LDA, RIC(S) = µ(1) S µ(0) S T Σ 1 S,S µ(1) S µ(0) S . It satisﬁes the following conclusions:

(i) If S S , then RIC(S) = RIC(S );

(ii) For S = S S, we have

RIC( S) = RIC(S) h Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S i T

Σ S , S Σ S ,SΣ 1 S,SΣS, S h Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S i

(iii) It holds inf S:|S| D S S RIC(S) RIC(S ) m3γ2.

Proof [Proof of Lemma 31]

(i) First let s suppose (ii) is correct. For S S , consider SFull = S S and S S = , then by sparsity condition, we have

= Σ 1(µ(1) µ(0)) = Σ 1 µ(1) S µ(0) S µ(1) S µ(0) S

Also by basic algebra, there holds

(ΣS,S ΣS, S Σ 1 S , S Σ S ,S) 1 = Σ 1 S,S Σ 1 S,SΣS, S ( Σ S , S +Σ S ,SΣ 1 S,SΣS, S ) 1Σ S ,SΣ 1 S,S. (31) Combined with (30) and (31), it can be obtained that

Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S = 0,

yielding that RIC(SFull) = RIC(S). Since the same procedure can be conducted for arbitrary S S , we complete the proof.

(ii) This can be directly calculated and simpliﬁed by applying (30) and (31).

(iii) It s easy to see that

λmin(Σ S , S Σ S ,SΣ 1 S,SΣS, S ) = λ 1 max((Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1)

λ 1 max(Σ 1 S, S)

= λmin(Σ S, S)

Ra SE: Random Subspace Ensemble Classification

Then by (ii) and Lemma 30, it holds that

RIC( S) RIC(S)

Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S 2

2 λmin(Σ S , S Σ S ,SΣ 1 S,SΣS, S )

which completes the proof.

Lemma 32 If Assumption 4 holds, for ϵ smaller than some constant and n, p larger than some constants, we have

sup S:|S| D | d RIC(S) RIC(S)| > ϵ

p2 exp Cn ϵ

Proof [Proof of Lemma 32] By Lemma 28, when ϵ < m 1, there holds that

sup S:|S| D | d RIC(S) RIC(S)| > ϵ

sup S:|S| D

ˆµ(1) S ˆµ(0) S T ˆΣ 1 S,S ˆµ(1) S ˆµ(0) S µ(1) S µ(0) S T Σ 1 S,S µ(1) S µ(0) S > ϵ

sup S:|S| D

ˆµ(1) S ˆµ(0) S T ˆΣ 1 S,S Σ 1 S,S ˆµ(1) S ˆµ(0) S

+ ˆµ(1) S ˆµ(0) S µ(1) S + µ(0) S T Σ 1 S,S ˆµ(1) S ˆµ(0) S

+ ˆµ(1) S ˆµ(0) S µ(1) S + µ(0) S T Σ 1 S,S µ(1) S µ(0) S > ϵ

sup S:|S| D

ˆµ(1) S ˆµ(0) S 2

ˆΣ 1 S,S Σ 1 S,S 2 + (ˆµ(1) S ˆµ(0) S ) (µ(1) S µ(0) S ) 2

Σ 1 S,S 2 ˆµ(1) S ˆµ(0) S 2 + µ(1) S µ(0) S 2

(3M )2D sup S:|S| D

ˆΣ 1 S,S Σ 1 S,S 2 > ϵ

sup S:|S| D

(ˆµ(1) S ˆµ(0) S ) (µ(1) S µ(0) S ) > M

Tian and Feng

sup S:|S| D

(ˆµ(1) S ˆµ(0) S ) (µ(1) S µ(0) S ) 2 m 1 3

p2 exp Cn ϵ

2 + p exp n Cn ϵ

o + p exp { Cn} + p exp Cn ϵ

p2 exp Cn ϵ

Then the following steps to prove Theorem 18 are analogous to what we did to prove Theorem 15.

B.11 Proof of Theorem 20

Denote ˆΩ(r) S,S = (ˆΣ(r) S,S) 1, r = 0, 1. Then we denote

T(S) = Tr h (Ω(1) S,S Ω(0) S,S)(π1Σ(1) S,S π0Σ(0) S,S) i + (π1 π0)(log |Σ(1) S,S| log |Σ(0) S,S|),

D(S) = (µ(1) S µ(0) S )T h π1Ω(0) S,S + π0Ω(1) S,S i (µ(1) S µ(0) S ),

RIC(S) = 2[T(S) D(S)].

And their sample versions by MLEs are

ˆT(S) = Tr h (ˆΩ(1) S,S ˆΩ(0) S,S)(ˆπ1 ˆΣ(1) S,S ˆπ0 ˆΣ(0) S,S) i + (ˆπ1 ˆπ0)(log |ˆΣ(1) S,S| log |ˆΣ(0) S,S|),

ˆD(S) = (ˆµ(1) S ˆµ(0) S )T h ˆπ1Ω(0) S,S + ˆπ0Ω(1) S,S i (ˆµ(1) S ˆµ(0) S ),

d RIC(S) = 2[ ˆT(S) ˆD(S)].

Similar to Lemma 28, we have the following lemma holds.

Lemma 33 For arbitrary ϵ (0, m 1), we have conclusions in the follows for r = 0, 1:

(i) P( (ˆµ(1) ˆµ(0)) (µ(1) µ(0)) > ϵ) p exp{ Cnϵ2};

sup S:|S| D

ˆΣ(r) S,S Σ(r) S,S 2 > ϵ

p2 exp n Cn ϵ

D 2o + p exp Cn ϵ

sup S:|S| D

ˆΩ(r) S,S Ω(r) S,S 2 > ϵ

p2 exp n Cn ϵ

D 2o + p exp Cn ϵ

Also, the lemmas in the follows are useful to prove Theorem 20 as well.

Lemma 34 There hold the following conditions:

Ra SE: Random Subspace Ensemble Classification

(i) For S = S {j}, k(r) j,S = Σ(r) j,j Σ(r) j,S(Σ(r) S,S) 1Σ(r) S,j, r = 0, 1, where j / S, we have:

T( S) T(S) = h Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j i T

k(0) j,S Σ(0) S,S + π1

k(1) j,S Σ(1) S,S

h Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j i + 1 π0

k(1) j,S k(0) j,S π1

k(0) j,S k(1) j,S

+ (π1 π0) log

k(1) j,S k(0) j,S

And for S = S S where S S = , there holds

D( S) D(S) = π1 h Σ(0) S ,SΩ(0) S,S(µ(1) S µ(0) S ) (µ(1) S µ(0) S ) i T

Σ(0) S , S Σ(0) S ,SΩ(0) S,SΣ(0) S, S h Σ(0) S ,SΩ(0) S,S(µ(1) S µ(0) S ) (µ(1) S µ(0) S ) i

+ π0 h Σ(1) S ,SΩ(1) S,S(µ(1) S µ(0) S ) (µ(1) S µ(0) S ) i T

Σ(1) S , S Σ(1) S ,SΩ(1) S,SΣ(1) S, S h Σ(1) S ,SΩ(1) S,S(µ(1) S µ(0) S ) (µ(1) S µ(0) S ) i .

(ii) Further, (i) implies the monotonicity of T, D, RIC in the following sense: If S1 S2, then T(S1) T(S2), D(S1) D(S2), which leads to RIC(S1) RIC(S2).

(iii) Deﬁne S l = {j : [(Σ(0)) 1µ(0) (Σ(1)) 1µ(1)]j = 0}, S q = {j : [(Σ(1)) 1 (Σ(0)) 1]ij = 0, i}, then:

(a) If S S q, then T(S) = T(S q);

(b) If S S , then D(S) = D(S ) = D(S l ).

Proof [Proof of Lemma 34]

(i) (33) is obvious due to Lemma 31. Now let s prove (32). It s easy to see that

T( S) = Tr h (Ω(1) S, S Ω(0) S, S)(π1Σ(1) S, S π0Σ(0) S, S) i + (π1 π0)(log |Σ(1) S, S| log |Σ(0) S, S|)

= |S| π1Tr h Ω(0) S, SΣ(1) S, S

i π0Tr h Ω(1) S, SΣ(0) S, S

i + (π1 π0)(log |Σ(1) S, S| log |Σ(0) S, S|).

Ω(r) S, S =

Ω(r) S,S + 1 k(r) j,S Ω(r) S,SΣ(r) S,jΣ(r) j,SΩ(r) S,S 1 k(r) j,S Ω(r) S,SΣ(r) S,j

1 k(r) j,S Σ(r) j,SΩ(r) S,S 1 k(r) j,S

Tian and Feng

for r = 0, 1, we have

Tr Ω(1) S, SΣ(0) S, S

Ω(1) S,SΣ(0) S,S + 1

k(1) j,S Ω(1) S,SΣ(1) S,jΣ(1) j,SΩ(1) S,SΣ(0) S,S 1

k(1) j,S Ω(1) S,SΣ(1) S,jΣ(0) j,S

k(1) j,S Σ(1) j,SΩ(1) S,SΣ(0) S,j + Σ(0) j,j k(1) j,S

= Tr[Ω(1) S,SΣ(0) S,S] + 1

k(1) j,S Σ(1) j,SΩ(1) S,SΣ(0) S,SΩ(1) S,SΣ(1) S,j 2

k(1) j,S Σ(1) j,SΩ(1) S,SΣ(0) S,j + Σ(0) j,j k(1) j,S , (36)

where the last equality follows from the fact that Tr(AB) = Tr(BA) if A and B are square matrices with the same dimension.

Similarly we have

Tr Ω(0) S, SΣ(1) S, S

= Tr[Ω(0) S,SΣ(1) S,S]+ 1

k(0) j,S Σ(0) j,SΩ(0) S,SΣ(1) S,SΩ(0) S,SΣ(0) S,j 2

k(0) j,S Σ(0) j,SΩ(0) S,SΣ(1) S,j + Σ(1) j,j k(0) j,S .

(37) Combining (34), (36), (37), the fact that Σ(r) j,j = k(r) j,S + Σ(r) j,SΩ(r) S,SΣ(r) S,j, |Σ(r) S, S| = |Σ(r) S,S|

|k(r) j,S|, r = 0, 1 and T(S) = |S| π1Tr Ω(0) S,SΣ(1) S,S π0Tr Ω(1) S,SΣ(0) S,S +(π1 π0)(log |Σ(1) S,S|

log |Σ(0) S,S|), (32) is obtained.

(ii) For the monotonicity, since π0 log k(1) j,S k(0) j,S

= π0 log k(0) j,S k(1) j,S

k(0) j,S k(1) j,S 1 and

π1 log k(1) j,S k(0) j,S

k(1) j,S k(0) j,S 1 , we have

1 + (π1 π0) log

k(1) j,S k(0) j,S

k(1) j,S k(0) j,S π1

k(0) j,S k(1) j,S

k(0) j,S k(1) j,S 1

k(1) j,S k(0) j,S 1

k(1) j,S k(0) j,S π1

k(0) j,S k(1) j,S

implying that T( S) T(S).

And it s easy to see that Σ(r) S , S Σ(r) S ,SΩ(r) S,SΣ(r) S, S is positive-deﬁnite, thus

D( S) D(S).

Ra SE: Random Subspace Ensemble Classification

And we also have RIC( S) RIC(S).

By induction we will obtain the monotonicity.

(iii) Denote Ω(r) = (Σ(r)) 1, r = 0, 1. Consider full feature space SFull S . Let s remove one feature which does not belong to S q from SFull. Again, without loss of generality (in fact, we can always switch this feature with the last one), suppose we are removing the last feature j. That is, SFull = S {j}.

By sparsity:

Ω(1) Ω(0) = 0 0T 0

Plugging (35) into above and by simpliﬁcation we can obtain that

k(0) j,S = k(1) j,S, (39)

Ω(0) S,SΣ(0) S,j = Ω(1) S,SΣ(1) S,j, (40)

By Lemma 34, T(SFull) = T(S). Besides, this also implies that

Ω(1) Ω(0) =

Ω(1) S,S Ω(0) S,S 0 0T 0

By induction, it can be seen that for all subspace S S q, T(S) = T(S q).

Then again consider SFull = S {j} but j / S , therefore by sparsity, in addition to (38), we also have

Ω(1)µ(1) Ω(0)µ(0) = 0

which together with (38) leads to

Σ(0) j,SΩ(0) S,S(µ(1) S µ(0) S ) (µ(1) j µ(0) j ) = 0,

Σ(1) j,SΩ(1) S,S(µ(1) S µ(0) S ) (µ(1) j µ(0) j ) = 0.

By Lemma 34, D(SFull) = D(S). And it holds that

Ω(1)µ(1) Ω(0)µ(0) =

Ω(1) S,Sµ(1) S Ω(0) S,Sµ(0) S 0

Again by induction, it can be seen that for any subspace S S , D(S) = D(S ).

Lemma 35 It holds that

inf j S q inf S:S \S={j} |S| D+p T(S) T(S ) m2

4M γq min mγq, 1

Tian and Feng

Proof [Proof of Lemma 35] Suppose S \S = {j}, S = S {j} S and j S q. Due to

Lemma 34, equation (35) and Assumption 5, we have either (k(1) j,S) 1 (k(0) j,S) 1 γq or

(k(1) j,S) 1Ω(1) S,SΣ(1) S,j (k(0) j,S) 1Ω(0) S,SΣ(0) S,j γq holds. And it s easy to notice that for r = 0, 1,

(k(r) j,S) 1 Ω(r) S, S

2 = λ 1 min(Σ(r) S, S) λ 1 min(Σ(r)) m 1,

k(r) j,S = Σ(r) j,j Σ(r) j,SΩ(r) S,SΣ(r) S,j Σ(r) j,j M,

which implies that m (k(r) j,S) M, r = 0, 1.

(i) If (k(1) j,S) 1 (k(0) j,S) 1 γq, then we have

k(0) j,S k(1) j,S 1

= k(0) j,S (k(1) j,S) 1 (k(0) j,S) 1 mγq,

leading to k(0) j,S k(1) j,S 1 log

k(0) j,S k(1) j,S

mγq log(1 + mγq).

When mγq 1, we know that

mγq log(1 + mγq) 1

When mγq > 1, it holds that

mγq log(1 + mγq) 1

Thus, we have k(0) j,S k(1) j,S 1 log

k(0) j,S k(1) j,S

6 min{mγq, (mγq)2}.

And the same result holds for k(1) j,S k(0) j,S as well. Therefore by (32), we have

T(S) T( S) = T(S) T(S )

k(0) j,S k(1) j,S 1 log

k(0) j,S k(1) j,S

k(1) j,S k(0) j,S 1 log

k(1) j,S k(0) j,S

6 min{mγq, (mγq)2}.

Since this holds for arbitrary j S q and S satisfying S \S = {j}, we obtain that

inf j S q inf S:S \S={j} |S| D+p T(S) T(S ) 1

6 min{mγq, (mγq)2}. (41)

Ra SE: Random Subspace Ensemble Classification

(ii) If (k(1) j,S) 1Ω(1) S,SΣ(1) S,j (k(0) j,S) 1Ω(0) S,SΣ(0) S,j γq, when (k(1) j,S) 1 (k(0) j,S) 1 m 2M γq,

similar to (i), we have

T(S) T( S) 1

γq (k(1) j,S) 1Ω(1) S,SΣ(1) S,j (k(0) j,S) 1Ω(0) S,SΣ(0) S,j

(k(1) j,S) 1 (k(0) j,S) 1 Ω(1) S,SΣ(1) S,j + (k(0) j,S) 1 Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j .

By (35), (k(1) j,S) 1Ω(1) S,SΣ(1) S,j is part of Ω(1) S, S, then

M 1 Ω(1) S,SΣ(1) S,j (k(1) j,S) 1Ω(1) S,SΣ(1) S,j Ω(1) S, S

max Ω(1) S, S

yielding Ω(1) S,SΣ(1) S,j M

Then we have Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j 2 Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j 1

T(S) T( S) Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j 2

2 λmin π0(k(1) j,S) 1Σ(0) S,S + π1(k(0) j,S) 1Σ(1) S,S

2 π0M 1λmin(Σ(0) S,S) + π1M 1λmin(Σ(1) S,S)

m3γ2 q 4M . (43)

Combining (41), (42) and (43), we complete the proof.

Lemma 36 It holds that

inf j S l \S q inf S:S \S={j} |S| D+p [D(S ) D(S)] m3γ2 l .

Proof For any j S l \S q and S satisfying S \S = {j}, let S = S {j}, it s easy to see that there holds

µ(1) S µ(1) j

µ(0) S µ(0) j

Tian and Feng

Denote c = Σ(1) j,SΩ(1) S,S(µ(1) S µ(0) S ) (µ(1) j µ(0) j ). By (39), (40), we can obtain

Σ(0) j,SΩ(0) S,S(µ(1) S µ(0) S ) (µ(1) j µ(0) j ) = Σ(1) j,SΩ(1) S,S(µ(1) S µ(0) S ) (µ(1) j µ(0) j ).

Combining with (44), we have

m 1|c| (k(1) j,S) 1|c| = |δ j| γl,

because δ j is the corresponding component of δS with respect to feature j. We obtain that |c| mγl, which leads to

D(S ) D(S) = D( S) D(S) = k(1) j,S|c|2 m3γ2 l .

Lemma 37 It holds that

inf S:S S |S| D RIC(S) RIC(S ) min m2

4M γq min mγq, 1

Proof [Proof of Lemma 37] Because of Lemmas 35 and 36, it suﬃces to prove

inf S:S S |S| D RIC(S) RIC(S )

inf j S inf S:S \S={j} |S| D+p RIC(S) RIC(S )

inf j S q inf S:S \S={j} |S| D+p [T(S) T(S )], inf j S l \S q inf S:j / S |S| D+p [D(S ) D(S)]

For any subset S which does not cover S and |S| D, consider feature j S \S and S = S S \{j} S. It s easy to notice that | S| D + p . By the monotonicity proved in Lemma 34, we know that RIC( S) RIC(S). In addition, we know that

RIC( S) inf j S inf S:S \S={j} |S| D+p RIC(S),

which yields the ﬁrst inequality. For the second inequality, it directly comes from Lemma 34.(iii).

Lemma 38 If Assumption 5 holds, for ϵ smaller than some constant and n, p larger than some constants, we have

sup S:|S| D | d RIC(S) RIC(S)| > ϵ

p2 exp Cn ϵ

Ra SE: Random Subspace Ensemble Classification

Proof [Proof of Lemma 38] Recall that RIC(S) = 2[T(S) D(S)], d RIC(S) = 2[ ˆT(S) ˆD(S)]. Denote T1(S) = Tr h Ω(1) S,S Ω(0) S,S π1Σ(1) S,S π0Σ(0) S,S i , T2(S) = (π1 π0)(log |Σ(1) S,S|

log |Σ(0) S,S|), then T(S) = T1(S) + T2(S). And since for any S with |S| D, we have ˆT1(S) T1(S)

= Tr h ˆΩ(1) S,S ˆΩ(0) S,S ˆπ1 ˆΣ(1) S,S ˆπ0 ˆΣ(0) S,S Ω(1) S,S Ω(0) S,S π1Σ(1) S,S π0Σ(0) S,S i

D ˆΩ(1) S,S ˆΩ(0) S,S ˆπ1 ˆΣ(1) S,S ˆπ0 ˆΣ(0) S,S Ω(1) S,S Ω(0) S,S π1Σ(1) S,S π0Σ(0) S,S 2

D Ω(1) S,S Ω(0) S,S 2 h |ˆπ1 π1| Σ(1) S,S 2 + Σ(0) S,S 2

+ ˆΣ(1) S,S Σ(1) S,S 2

+ ˆΣ(0) S,S Σ(0) S,S 2

i + D π1Σ(1) S,S π0Σ(0) S,S 2 ˆΩ(1) S,S Ω(1) S,S 2 + ˆΩ(0) S,S Ω(0) S,S 2

2Dm 1 h |ˆπ1 π1| 2M + ˆΣ(1) S,S Σ(1) S,S 2 + ˆΣ(0) S,S Σ(0) S,S 2

+ DM ˆΩ(1) S,S Ω(1) S,S 2 + ˆΩ(0) S,S Ω(0) S,S 2

where the last inequality comes from Σ(r) S,S 2 M and Ω(1) S,S Ω(0) S,S 2 2m 1. And this yields

sup S:|S| D

ˆT1(S) T1(S) > ϵ

P 4Dm 1M|ˆπ1 π1| > ϵ

2Dm 1 sup S:|S| D

ˆΣ(1) S,S Σ(1) S,S 2 > ϵ

2Dm 1 sup S:|S| D

ˆΣ(0) S,S Σ(0) S,S 2 > ϵ

DM sup S:|S| D

ˆΩ(1) S,S Ω(1) S,S 2 > ϵ

DM sup S:|S| D

ˆΩ(0) S,S Ω(0) S,S 2 > ϵ

p2 exp Cn ϵ

On the other hand, we have log

|ˆΣ(1) S,S|

λi(ˆΣ(1) S,S)

λi(Σ(1) S,S)

λi(ˆΣ(1) S,S)

λi(Σ(1) S,S) 1

λi(ˆΣ(1) S,S)

λi(Σ(1) S,S) 1

By Weyl s inequality,

λi(ˆΣ(1) S,S) λi(Σ(1) S,S)

λi(Σ(1) S,S)

ˆΣ(1) S,S Σ(1) S,S 2 .

Tian and Feng

Therefore there exists ϵ > 0 such that when ˆΣ(1) S,S Σ(1) S,S 2 ϵ , we have

|ˆΣ(1) S,S|

ˆΣ(1) S,S Σ(1) S,S 2 + o D ˆΣ(1) S,S Σ(1) S,S 2

ˆΣ(1) S,S Σ(1) S,S 2 .

Therefore, there holds

sup S:|S| D

ˆT2(S) T2(S) > ϵ

sup S:|S| D 2 |ˆπ1 π1| log |Σ(1) S,S| + log |Σ(0) S,S| + log

|ˆΣ(1) S,S|

|ˆΣ(0) S,S|

2 |ˆπ1 π1| sup S:|S| D

h log λi Σ(1) S,S + log λi Σ(1) S,S i > ϵ

sup S:|S| D log

|ˆΣ(1) S,S|

sup S:|S| D log

|ˆΣ(0) S,S|

P 2D max{| log M|, | log m|} |ˆπ1 π1| > ϵ

m sup S:|S| D

ˆΣ(1) S,S Σ(1) S,S 2 > ϵ

m sup S:|S| D

ˆΣ(0) S,S Σ(0) S,S 2 > ϵ

sup S:|S| D

ˆΣ(1) S,S Σ(1) S,S 2 > ϵ !

sup S:|S| D

ˆΣ(0) S,S Σ(0) S,S 2 > ϵ !

p2 exp Cn ϵ

Thus we have

sup S:|S| D

ˆT(S) T(S) > ϵ

p2 exp Cn ϵ

Besides, following the same strategy in the proof of Lemma 32, it can be shown that

sup S:|S| D

ˆD(S) D(S) > ϵ

p2 exp Cn ϵ

By (45) and (46), we complete the proof.

By all the above lemmas, and following similar idea in the proof of Theorem 15, we can prove the consistency stated in the theorem.

Ra SE: Random Subspace Ensemble Classification

B.12 Proof of Theorem 21

Firstly since taking subspace can also be seen as an axis-aligned projection, applying Theorem 2 in Cannings and Samworth (2017) we have

E[R(CRa SE n ) R(CBayes)] E[R(CS1 n ) R(CBayes)] min(α, 1 α) . (47)

Then it can be easily noticed that

E{E[R(CS1 n ) R(CBayes)]} E{E[(R(CS1 n ) R(CBayes))1(S1 S )]}

+ (1 R(CBayes))P(S1 S )

E sup S:S S |S| D

R(CS n ) R(CBayes) + P(S1 S ). (48)

Combining (47) and (48), we obtain the conclusion.

B.13 Proof of Theorem 22

This conclusion is almost the same as Theorem 2 in Cannings and Samworth (2017). However, since we are studying the discrete space of subspaces and discriminative sets are ideal subspaces here, we can drop Assumptions 2 and 3 in their paper and get a similar upper bound. Firstly we have

E[R(CRa SE n ) R(CBayes)] E[R(CS1 n ) R(CBayes)] min(α, 1 α) , (49)

by Cannings and Samworth (2017). Then write

E[R(CS1 n )] = E[Rn(CS1 n )] + ϵn, (50)

where ϵn = E[R(CS1 n )] E[Rn(CS1 n )]. Then we have

E[Rn(CS1 n )]

sup S:S S |S| D

Rn(CS n ) + E

R(CS1 n ) 1

Rn(CS1 n ) > sup S:S S |S| D

sup S:S S |S| D

Rn(CS n ) + P

Rn(CS1 n ) > sup S:S S |S| D

= sup S:S S |S| D

Rn(CS n ) +

Rn(CS11 n ) > sup S:S S |S| D

sup S:S S |S| D

Rn(CS n ) + (1 p S )B2

Tian and Feng

= sup S:S S |S| D

R(CS n ) + sup S:S S |S| D

|ϵS n| + (1 p S )B2, (51)

where ϵS n = R(CS n ) Rn(CS n ). Combining (49), (50) and (51), we obtain

E{E[R(CRa SE n ) R(CBayes)]}

E sup S:S S |S| D

R(CS n ) R(CBayes) + E sup S:S S |S| D

|ϵS n| + E(ϵn) + (1 p S )B2

min(α, 1 α) .

B.14 Proof of Proposition 23

First we prove the following lemma.

Lemma 39 (Devroye et al. (2013)) For a non-negative random variable z satisfying

P(z > t) C1 exp{ C2ntα},

for any t > 0, where C1 > 1, C2 > 0 and α 1 are three ﬁxed constants, then we have

Ez log C1 + 1

Proof [Proof of Lemma 39] It s easy to see

0 P(zα > t)dt = ϵ + C1

ϵ exp{ C2nt}dt = ϵ + C1

C2n exp{ C2nϵ},

Ezα inf ϵ>0

C2n exp{ C2nϵ} = log C1 + 1

Then by Jensen s inequality, it holds

Ez (Ezα) 1 α log C1 + 1

which completes the proof.

Then let s prove the proposition. Notice that

2 S = |δT S ΣS,SδS| δS 2 2 λmin(ΣS,S) mp γ2,

for any S S . In addition, due to mean value theorem, it follows that

2| ˆ S S| + |ˆτS τS|.

Ra SE: Random Subspace Ensemble Classification

By a similar argument, due to (10), we can obtain that

R(CS LDA n ) R(CBayes) 1

2| ˆ S S| + |ˆτS τS|.

By Lemma 32, we know that

sup S:|S| D | ˆ 2 S 2 S| > ϵ

p2 exp Cn ϵ

This yields that

sup S:S S |S| D

R(CS LDA n ) R(CBayes) > ϵ

sup S:S S |S| D

| ˆ 2 S 2 S| > 3

2 sup S:S S |S| D

| ˆ 2 S 2 S| ˆ S + S > 1

2ϵ, sup S:S S |S| D

| ˆ 2 S 2 S| 3

sup S:S S |S| D

|ˆτS τS| > 1

2ϵ, sup S:S S |S| D

| ˆ 2 S 2 S| 3

sup S:|S| D | ˆ 2 S 2 S| > 3

sup S:|S| D | ˆ 2 S 2 S| > 3

sup S:|S| D

4ϵ, sup S:S S |S| D

| ˆ 2 S 2 S| 3

sup S:|S| D

4ϵ, sup S:S S |S| D

| ˆ 2 S 2 S| 3

sup S:|S| D | ˆ 2 S 2 S| > 3

sup S:|S| D | ˆ 2 S 2 S| > 3

sup S:|S| D | ˆ 2 S 2 S| > C(p ) 3 2 γ3ϵ

+ P |ˆπ0 π0| > Cϵ p

+ P |ˆπ1 π1| > Cϵ p

ϵ(p ) 3 2 γ3

ϵ(p ) 1 2 γ D2

which completes the proof by applying Lemma 39.

Tian and Feng

B.15 Proof of Proposition 24

Denote δS = Ω(1) S,Sµ(1) S Ω(0) S,Sµ(0) S , ΩS,S = Ω(1) S,S Ω(0) S,S, and d S(x S) = log π1 π0

2x T SΩS,Sx S+

2(µ(1) S )T Ω(1) S,Sµ(1) S + 1

2(µ(0) S )T Ω(0) S,Sµ(0) S . Their estimators are correspondingly denoted

as ˆδS = ˆΩ(1) S,S ˆµ(1) S ˆΩ(0) S,S ˆµ(0) S , ˆΩS,S = ˆΩ(1) S,S ˆΩ(0) S,S, and ˆd S(x S) = log ˆπ1 ˆπ0

2x T S ˆΩS,Sx S +

ˆδT S x S 1

2(ˆµ(1) S )T ˆΩ(1) S,S ˆµ(1) S + 1

2(ˆµ(0) S )T ˆΩ(0) S,S ˆµ(0) S . From the proof of Theorem 20, we know that d S(x S) = d S (x S ) for any S S . Denote the training data as Dtr, then R(CS n ) = π0P(0)( ˆd S(x S) > 0|Dtr) + π1P(1)( ˆd S(x S) 0|Dtr), R(CBayes) = π0P(0)(d S (x S ) > 0) + π1P(1)(d S (x S ) 0). Then we have

sup S:S S |S| D

[P(0)( ˆd S(x S) > 0|Dtr) P(0)(d S (x S ) > 0)]

sup S:S S |S| D

P(0)(d S(x S) > d S(x S) ˆd S(x S)|Dtr) P(0)(d S (x S ) > 0)

P(0)(d S (x S ) > ϵ) + sup S:S S |S| D

P(0)(|d S(x S) ˆd S(x S)| > ϵ|Dtr)

P(0)(d S (x S ) > 0)

ϵ h(0)(z)dz + sup S:S S |S| D

P(0)(|d S(x S) ˆd S(x S)| > ϵ|Dtr)

ϵuc + sup S:S S |S| D

P(0)(|d S(x S) ˆd S(x S)| > ϵ|Dtr), (52)

for any ϵ (0, c). Denote a S = (Σ(0) S,S) 1 2 [Ω(1) S,S(µ(1) S ˆµ(1) S ) + (Ω(1) S,S ˆΩ(1) S,S)(ˆµ(1) S ˆµ(0) S ) + ˆΩ(0) S,S(ˆµ(0) S µ(0) S )]. Notice that

d S(x S) ˆd S(x S) = log π1

2(x S µ(0) S )T (ˆΩS,S ΩS,S)(x S µ(0) S )

+ a T S(Σ(0) S,S) 1

2 (x S µ(0) S ) + 1

2(µ(1) S µ(0) S )T Ω(1) S,S(ˆµ(1) S µ(1) S )

2(ˆµ(0) S µ(0) S )T Ω(0) S,S(ˆµ(0) S µ(0) S ).

Further, denote z S = (Σ(0) S,S) 1

2 (x S µ(0) S ) N(0|S|, I|S|). Then it follows that

|d S(x S) ˆd S(x S)| C|ˆπ1 π1| + 1

2 ˆΩS,S ΩS,S 2 Σ(0) S,S 2 z S 2 2 + |a T Sz S|

2 µ(1) S µ(0) S 2 Ω(1) S,S 2 ˆµ(1) S µ(1) S 2

2 ˆµ(0) S µ(0) S 2 2 Ω(0) S,S 2

C|ˆπ1 π1| + M

2 ˆΩS,S ΩS,S 2 z S 2 2 + |a T Sz S|

Ra SE: Random Subspace Ensemble Classification

+ M Dm 1 ˆµ(1) S µ(1) S + Dm 1 ˆµ(0) S µ(0) S 2 ,

a S 2 ˆµ(1) S µ(1) S 2 + Ω(1) S,S ˆΩ(1) S,S 2 ˆµ(1) S ˆµ(0) S 2 + ˆµ(0) S µ(0) S 2

D 1 2 ( ˆµ(1) S µ(1) S + ˆµ(0) S µ(0) S ) + Ω(1) S,S ˆΩ(1) S,S 2 ˆµ(1) S ˆµ(0) S 2.

For any t, t > 0, denote event B = {C|ˆπ1 π1| ϵ/4, sup S:|S| D ˆΩS,S ΩS,S 2

t, M Dm 1 ˆµ(1) S µ(1) S ϵ/8, Dm 1 ˆµ(0) S µ(0) S ϵ/8, a S 2 t }. When a S 2 t , a T Sz S is a t -sub Gaussian. This yields that

sup S:S S |S| D

|d S(x S) ˆd S(x S)| > ϵ Dtr B

P(t z S 2 2 > ϵ/4) + P(|a T Sz S| > ϵ/4| a S 2 t )

where ϵ satisﬁes Cϵ

t > D and the ﬁrst term comes from the tail bound of χ2 1-distribution

P z S 2 2 > Cϵ

= P z S 2 2 > D + Cϵ

Taking the expectation for the training data in (52), we have

E sup S:S S |S| D

[R(CS n ) R(CBayes)]

ϵuc + E sup S:S S |S| D

P(|d S(x S) ˆd S(x S)| > ϵ|Dtr B) + P(Bc)

2 + exp{ Cnϵ2}

+ p exp Cn ϵ

2 , t = D log p

2 , t = D 3 2 log p

2 for an arbitrary ϖ (0, 1/2) and plug them into (53), we obtain that

2 + exp{ Cnϵ2} + p2 exp

+ p exp Cn ϵ

Tian and Feng

ϖ + p2 exp Cnϖ(log p)1 ϖ + p exp CDn2ϖ(log p)1 ϖ

because log p nϖ0 for some ϖ0 (0, 1). Therefore, due to (54), we have

E sup S:S S |S| D

[R(CS n ) R(CBayes)] D2 log p

B.16 Proof of Theorem 25

We ﬁrst prove the following helpful lemma.

Lemma 40 For H Hypergeometric(p, p , d), if min(d, p ) p d

p = o(1) and B2 p p d t+1 ,

where t is a positive integer no larger than d, then (Pr(H t))B2 1 as p .

Proof [Proof of Lemma 40] The cumulative distribution function of H satisﬁes

Pr(H t) = 1

d t+1 p d p t 1

1, t + 1 p , t + 1 n t + 2, p + t + 2 p d ; 1 ,

where 3F 2 is the generalized hypergeometric function (Abadir, 1999). And we have

1, t + 1 p , t + 1 n t + 2, p + t + 2 p d ; 1

1n(t + 1 p )n(t + 1 d)n (t + 2)n(p + t + 2 p d)n 1

(t + 1 p )n(t + 1 d)n (t + 2)n(p + t + 2 p d)n

min(d t, p t) X

( p t 1) ( p t n)(d t 1) (d t n) (t + 2)n(p + t + 2 p d) (p + t + 1 p d + n)

min(d t, p t) X

1 + min(d, p ) O p d

where an := a(a + 1) (a + n 1), a0 := 1 for any real number a and positive integer n. On the other hand, we can also see that

1, t + 1 p , t + 1 n t + 2, p + t + 2 p d ; 1 1.

Ra SE: Random Subspace Ensemble Classification

For convenience, denote F = 3F 2

1, t + 1 p , t + 1 n t + 2, p + t + 2 p d ; 1 , then by Taylor expansion,

it holds that

d t+1 p d p t 1

d t+1 p d p t 1

d t+1 p d p t 1

In addition, we have d t+1 p d p t 1

p p dt+1 p ( p 1) ( p t) (t + 1)!(p p + t + 1) (p p + 1) p d

t+1 1 (t + 1)!.

d t+1 p d p t 1

p p = o(1), B2

d t+1 p d p t 1

leading to B2 log 1 ( d t+1)( p d p t 1) ( p p ) F = o(1), which implies the conclusion.

Next let s prove the original theorem. Without loss of generality, assume there is a positive constant C such that at the ﬁrst step of Algorithm 2, B2 satisﬁes

D p + 1 B2 p

and at the following steps it follows

C (D + C0)D+1p p (log p)p

CD 0 (D p + 1) B2 p

Notice that condition (i) in Assumption 6 implies that

sup S:|S| D |Crn(S) Cr(S)| > Mnν(n, p, D)

We will prove the original theorem in three steps. Denote η(t) j = P(j S(t) 1 ), ˆη(t) j =

1 B1 PB1 i=1 1(j S(t) i ), S(t) = {j : ˆη(t) j > 1/ log p}, S(t) = S(t) S , p t = | S(t) |.

(i) Step 1: When t = 0, due to the stepwise detectable condition, there exists a subset S(0) S with cardinality p that satisﬁes the conditions. On the other hand,

k=1 {S1k S(0) }

= 1 h 1 P S1k S(0) i B2

Tian and Feng

1 1 D p + 1

B2 log 1 D p + 1

yielding that

k=1 {S(0) 1k S(0) }

Then we have

P S(0) 1 S(0)

k=1 {S(0) 1k S(0) }, inf S:|S S | p |S| D

S:S S =S(0) Crn(S) > Mnν(n, p, D),

n |S(0) 1k S | p o!

k=1 {S(0) 1k S(0) }, sup S:|S| D |Crn(S) Cr(S)| 1

2Mnν(n, p, D),

n |S(0) 1k S | p o!

k=1 {S(0) 1k S(0) }

sup S:|S| D |Crn(S) Cr(S)| > 1

2Mnν(n, p, D)

n |S(0) 1k S | p + 1 o!

as n is suﬃciently large. The last inequality holds because there exists a variable H Hypergeometric(p, p , D), such that

n |S(0) 1k S | p + 1 o!

= 1 h 1 P |S(0) 1k S | p + 1 i B2

1 d D P |S(0) 1k S | p + 1||S(0) 1k | = d

1 d D P |S(0) 1k S | p + 1||S(0) 1k | = D

Ra SE: Random Subspace Ensemble Classification

n |S(0) 1k S | p + 1 o

n |S(0) 1k | = D o!

= 1 P(H p )

due to Lemma 40.

Then for any j S(0) , it holds that

P j S(0) 1 P S(0) 1 S(0) 1 3

And by Hoeﬀding s inequality (Petrov, 2012):

P ˆη(1) j 1 2e C P ˆη(1) j η(1) j > 1

2e C 1 exp 2B1 1

Following by union bounds, we can obtain

n ˆη(1) j 1 2e Co

(ii) Step 2: When t = 1, let s ﬁrst condition on some speciﬁc S(0) deﬁned before as {j : ˆη(0) j > C0/ log p} satisfying | S(0) | = p . Later we will take the expectation to get the inequality without conditioning. To simplify and distinguish the notations, we omit the condition mentioned above and denote the new corresponding conditional probabilities as Pc, Pc, Pc.

For any speciﬁc S(0) , due to the stepwise detectable condition again, there exists a subset S(1) S with cardinality p satisﬁes the conditions. Similar to step 1, we have

n S(1) 1k S(0) S(1) o!

p 1+ p d D Pc S(1) 1k S(0) S(1) |S(1) 1k | = d

Pc S(1) 1k S(0) S(1) |S(1) 1k | = d

Pc S(1) 1k S(1) |S(1) 1k | = d, S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) =

Pc S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) = |S(1) 1k | = d . (57)

Tian and Feng

For convenience, let s consider a series (j1, . . . , jd) sampled from {1, . . . , p} without replacement with sampling weight (ˆη(1) 1 , . . . , ˆη(1) p ) in this order. We use Pc(j1, . . . , jd||S(1) 1k | = d) to represent the corresponding probability and use Pc(ji|j1, . . . , ji 1, |S(1) 1k | = d) to represent the conditional probability for sampling ji given j1, . . . , ji 1.

Based on the notations deﬁned above, there holds

Pc S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) = |S(1) 1k | = d

(j1,...,jd) S(0) (j1,...,jd) ( S(0)\ S(0) )=

Pc j1, . . . , jd |S(1) 1k | = d

(j1,...,jd) S(0) (j1,...,jd) ( S(0)\ S(0) )=

Pc j1 |S(1) 1k | = d Pc j2 j1, |S(1) 1k | = d

Pc(jd|j1, . . . , jd 1, |S(1) 1k | = d). (58)

Since for ji S(0) , it holds

Pc(ji|j1, . . . , ji 1, |S(1) 1k | = d) Pc(ji||S(1) 1k | = d)

j S(0) ˆη(0) j + P

j / S(0) C0

C0 (D + C0) log p.

And for ji {1, . . . , p}\ S(0), it holds

Pc(ji|j1, . . . , ji 1, |S(1) 1k | = d) Pc(ji||S(1) 1k | = d)

j S(0) ˆη(0) j + P

j / S(0) C0

C0 (D + C0)p.

Therefore by plugging these inequalities into (58), it yields that when d p(0) ,

Pc S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) = |S(1) 1k | = d

p(0) !(d p(0) )! C0 (D + C0) log p

p(0) C0 (D + C0)p

1 | S(0)| p(0) + d 1 p

d(d 1) (d p(0) + 1)

Cd 0 (D + C0)d(log p) p(0)

Ra SE: Random Subspace Ensemble Classification

d p(0) p(0) !

(log p) p(0) C0 D + C0

1 (log p)p C0 D + C0

when n is suﬃciently large. In the last second inequality, we used the fact that | S(0)| D log p

C0 . On the other hand, given |S(1) 1k | = d, S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) = , the indicators of whether the remaining features are sampled out or not follow the restricted multinomial distribution with parameters (p | S(0)|, d p(0) , 1), which means that the remaining variables have the same sampling weights. Then for d p(0) p ,

Pc S(1) 1k S(1) |S(1) 1k | = d, S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) =

p | S(0)| p

max(d p(0) p , 1) p | S(0)|

which combined with (56), (57) and (59) leads to

n S(1) 1k S(0) S(1) o!

1 (log p)p C0 D + C0

since B2 C (D+C0)D+1p p (log p)p

CD 0 (D p +1) .

Furthermore,

n |S(1) 1k (S \ S(0) )| p + 1 o!

= 1 h 1 Pc |S(1) 11 (S \ S(0) )| p + 1 i B2 ,

Pc |S(1) 11 (S \ S(0) )| p + 1

= Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = Pc S(1) 11 ( S(0)\ S(0) ) =

Tian and Feng

+ Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = Pc S(1) 11 ( S(0)\ S(0) ) =

Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) =

1 d D Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = , |S(1) 11 | = d

1 d D P |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = , |S(1) 11 | = D

= Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = , |S(1) 11 | = D .

Similar to step 1, because of Lemma 40, it follows that

n |S(1) 1k (S \ S(0) )| p + 1 o!

1 h 1 Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = , |S(1) 11 | = D i B2

n |S(1) 1k (S \ S(0) )| p + 1 o

n S(1) 1k ( S(0)\ S(0) ) = , |S(1) 1k | = D o!

uniformly for any S(0). Then similar to (55), we have for j S(0) S(1) :

η(1) j = Pc j S(1) 1 Pc S(1) 1 S(0) S(1) 1 3

And by Hoeﬀding s inequality (Petrov, 2012):

P ˆη(1) j 1 2e C S(0) P ˆη(1) j η(1) j > 1

2e C 1 exp 2B1 1

By union bounds, it holds that

j S(0) S(1)

n ˆη(1) j 1 2e Co S(0)

Since the above conclusions hold for any S(0) and | S(0) | |S(0) | = p , we can conclude that

j=1 1(ˆη(1) j 1 2e C) 2 p

Ra SE: Random Subspace Ensemble Classification

j S(0) S(1)

n ˆη(1) j 1 2e Co S(0)

|S(0) | = p

2B1e 2C 1 p exp 1

After the the second iteration step, with probability 1 2p exp 1

2B1e 2C , there will be at least 2 p features of S covered in S(1) and having ˆηj > 1 2e C. Similarly, after the the t-th iteration step, there will be at least (t + 1) p features of S covered in S(t) and having ˆηj > 1 2e C.

(iii) Step 3: By step 2, after at most t = p

p 1 iterations, S(t ) will cover S . Without

loss of generality, let s assume the smallest t satisfying S(t ) S equal to p

when the iteration number t = p

p , we have

n ˆη(t 1) j 1 2e Co

Using the same notations deﬁned at the beginning of step 2 (notice that here we only need to condition on T

j S {ˆη(t 1) j 1 2e C}), we have

Pc S(t) 1k S

(j1,...,jd) S Pc(j1, . . . , jd||S(t) 1k | = d)

(j1,...,jd) S Pc j1 |S(t) 1k | = d Pc jd|j1, . . . , jd 1, |S(t) 1k | = d .

Similar to step 2, for ji S , it holds

Pc(ji|j1, . . . , ji 1, |S(t) 1k | = d) Pc(ji||S(t) 1k | = d) ˆη(t 1) ji P j S ˆη(t 1) j + P j / S C0

And for ji {1, . . . , p}\S , it holds

Pc(ji|j1, . . . , ji 1, |S(t) 1k | = d) Pc(ji||S(t) 1k | = d) ˆη(1) ji P j S ˆη(t) j + P j / S C0

p C0 (D + C0)p.

Thus, we have

Pc S(t) 1k S 1

(j1,...,jd) S

C0 (D + C0)p

Tian and Feng

C0 (D + C0)p

= (D p + 1)(1 2e C)p CD p 0 D(D + C0)D 1 D 1

n S(t) 1k S o!

= 1 h 1 Pc S(t) 1k S i B2

B2 (D p + 1)(1 2e C)p CD p 0 D(D + C0)D 1 D 1

C (D + C0)p p

DCp 0 1 D 1

Thus we have

n S(t) 1k S o!

Pc S(t) 1k S P

n ˆη(t 1) j 1 2e Co

C (D + C0)p +1p p

DCp 0 1 D 1

Then similar to (55), there holds

P(S(t) 1 S ) P

k=1 {S1k S(t) }, inf S:S S |S| D Crn(S) sup S:S S Crn(S) > Mnν(n, p, D)

k=1 {S1k S(t) }, sup S:|S| D |Crn(S) Cr(S)| 1

2Mnν(n, p, D)

n S(t) 1k S o!

sup S:|S| D |Crn(S) Cr(S)| > 1

2Mnν(n, p, D)

Ra SE: Random Subspace Ensemble Classification

Combined with all the conclusions above, as n, B1, B2 , we get

P(S(t) 1 S ) exp

C (D + C0)p +1p p

DCp 0 1 D 1

sup S:|S| D |Crn(S) Cr(S)| > 1

2Mnν(n, p, D)

And for t > p

p , the same conclusion can be obtained by following the same procedure, which completes our proof.

K. M. Abadir. An introduction to hypergeometric functions for economists. Econometric Reviews, 18(3):287 330, 1999.

M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing oﬃce, 1948.

H. Ahn, H. Moon, M. J. Fazzari, N. Lim, J. J. Chen, and R. L. Kodell. Classiﬁcation by ensembles from random partitions of high-dimensional data. Computational Statistics & Data Analysis, 51(12):6166 6179, 2007.

H. Akaike. Information theory and an extension of the maximum likelihood principle. Proceeding of IEEE international symposium on information theory, 1973.

S. D. Bay. Combining nearest neighbor classiﬁers through multiple feature subsets. In ICML, volume 98, pages 37 45. Citeseer, 1998.

T. B. Berrett and R. J. Samworth. Eﬃcient two-sample functional estimation and the super-oracle phenomenon. ar Xiv preprint ar Xiv:1904.09347, 2019.

P. J. Bickel, E. Levina, et al. Some theory for ﬁsher s linear discriminant function ,naive bayes , and some alternatives when there are many more variables than observations. Bernoulli, 10(6):989 1010, 2004.

P. J. Bickel, E. Levina, et al. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577 2604, 2008.

R. Blaser and P. Fryzlewicz. Random rotation ensembles. The Journal of Machine Learning Research, 17(1):126 151, 2016.

R. Blaser and P. Fryzlewicz. Regularizing axis-aligned ensembles via data rotations that favor simpler learners, 2019.

Tian and Feng

T. Boot and D. Nibbering. Subspace methods. In Macroeconomic Forecasting in the Era of Big Data, pages 267 291. Springer, 2020.

L. Breiman. Bagging predictors. Machine learning, 24(2):123 140, 1996.

L. Breiman. Pasting small votes for classiﬁcation in large databases and on-line. Machine learning, 36(1-2):85 103, 1999.

L. Breiman. Random forests. Machine learning, 45(1):5 32, 2001.

M. Brookes. The matrix reference manual. Imperial College London, 3, 2005.

R. Bryll, R. Gutierrez-Osuna, and F. Quek. Attribute bagging: improving accuracy of classiﬁer ensembles by using random feature subsets. Pattern recognition, 36(6):1291 1302, 2003.

P. B uhlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.

K. P. Burnham and D. R. Anderson. Practical use of the information-theoretic approach. In Model selection and inference, pages 75 117. Springer, 1998.

T. I. Cannings and R. J. Samworth. Random-projection ensemble classiﬁcation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):959 1035, 2017.

J. Chen and Z. Chen. Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759 771, 2008.

J. Chen and Z. Chen. Extended bic for small-n-large-p sparse glm. Statistica Sinica, pages 555 574, 2012.

L. Devroye and T. Wagner. Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, 25(2):202 207, 1979.

L. Devroye, L. Gy orﬁ, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013.

D. Dua and C. Graﬀ. Uci machine learning repository. school of information and computer science, university of california, irvine, ca, 2019.

R. J. Durrant and A. Kab an. Random projections as regularizers: learning a linear discriminant from fewer observations than dimensions. Machine Learning, 99(2):257 286, 2015.

B. Efron. The eﬃciency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70(352):892 898, 1975.

C.-G. Esseen. A moment inequality with an application to the central limit theorem. Scandinavian Actuarial Journal, 1956(2):160 170, 1956.

Ra SE: Random Subspace Ensemble Classification

J. Fan, Y. Feng, and X. Tong. A road to classiﬁcation in high dimensional space: the regularized optimal aﬃne discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(4):745 771, 2012.

J. Fan, Y. Feng, J. Jiang, and X. Tong. Feature augmentation via nonparametrics and selection (fans) in high-dimensional classiﬁcation. Journal of the American Statistical Association, 111(513):275 287, 2016.

Y. Fan and C. Y. Tang. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):531 552, 2013.

Y. Fan, Y. Kong, D. Li, Z. Zheng, et al. Innovated interaction screening for high-dimensional nonlinear classiﬁcation. The Annals of Statistics, 43(3):1243 1272, 2015.

R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179 188, 1936.

E. Fix. Discriminatory analysis: nonparametric discrimination, consistency properties. USAF school of Aviation Medicine, 1951.

Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory, pages 23 37. Springer, 1995.

S. Ganguly, J. Ryu, Y.-H. Kim, Y.-K. Noh, and D. D. Lee. Nearest neighbor density functional estimation based on inverse laplace transform. ar Xiv preprint ar Xiv:1805.08342, 2018.

N. Garc ıa-Pedrajas and D. Ortiz-Boyer. Boosting random subspace method. Neural Networks, 21(9):1344 1362, 2008.

I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the nips 2003 feature selection challenge. In Advances in neural information processing systems, pages 545 552, 2005.

N. Hao, Y. Feng, and H. H. Zhang. Model selection for high-dimensional quadratic regression via regularization. Journal of the American Statistical Association, 113(522):615 625, 2018.

H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263 1284, 2009.

C. Higuera, K. J. Gardiner, and K. J. Cios. Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. Plo S one, 10(6), 2015.

T. K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832 844, 1998.

D. Hsu, S. Kakade, T. Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.

Tian and Feng

B. Jiang, X. Wang, and C. Leng. A direct approach for sparse quadratic discriminant analysis. The Journal of Machine Learning Research, 19(1):1098 1134, 2018.

R. Kohavi, G. H. John, et al. Wrappers for feature subset selection. Artiﬁcial intelligence, 97(1-2):273 324, 1997.

S. Kullback and R. A. Leibler. On information and suﬃciency. The annals of mathematical statistics, 22(1):79 86, 1951.

Q. Li and J. Shao. Sparse quadratic discriminant analysis for high dimensional data. Statistica Sinica, pages 457 473, 2015.

Y. Li and J. S. Liu. Robust variable and interaction selection for logistic regression and general index models. Journal of the American Statistical Association, 114(525):271 286, 2019.

M. E. Lopes. Estimating a sharp convergence bound for randomized ensembles. Journal of Statistical Planning and Inference, 204:35 44, 2020.

M. E. Lopes et al. Estimating the algorithmic variance of randomized ensembles via the bootstrap. The Annals of Statistics, 47(2):1088 1112, 2019.

Q. Mai, H. Zou, and M. Yuan. A direct approach to sparse discriminant analysis in ultrahigh dimensions. Biometrika, 99(1):29 42, 2012.

Q. Mai, Y. Yang, and H. Zou. Multiclass sparse discriminant analysis. ar Xiv preprint ar Xiv:1504.05845, 2015.

G. J. Mc Lachlan. Mahalanobis distance. Resonance, 4(6):20 26, 1999.

M. Mukhopadhyay and D. B. Dunson. Targeted random projection for prediction from high-dimensional features. Journal of the American Statistical Association, pages 1 13, 2019.

K. B. Petersen and M. S. Pedersen. The matrix cookbook, nov 2012. URL http://www2. imm. dtu. dk/pubdb/p. php, 3274:14, 2012.

V. V. Petrov. Sums of independent random variables, volume 82. Springer Science & Business Media, 2012.

R. Rao and Y. Wu. A strongly consistent procedure for model selection in a regression problem. Biometrika, 76(2):369 374, 1989.

L. Rokach. Ensemble-based classiﬁers. Artiﬁcial Intelligence Review, 33(1-2):1 39, 2010.

J. Shao, Y. Wang, X. Deng, S. Wang, et al. Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of statistics, 39(2):1241 1265, 2011.

M. Skurichina and R. P. Duin. Bagging, boosting and the random subspace method for linear classiﬁers. Pattern Analysis & Applications, 5(2):121 135, 2002.

Ra SE: Random Subspace Ensemble Classification

R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Class prediction by nearest shrunken centroids, with applications to dna microarrays. Statistical Science, pages 104 117, 2003.

A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

H. Wang. Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488):1512 1524, 2009.

Q. Wang, S. R. Kulkarni, and S. Verd u. Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Transactions on Information Theory, 55(5):2392 2405, 2009.

Q. Zhang and H. Wang. On bic s selection consistency for discriminant analysis. Statistica Sinica, pages 731 740, 2011.