# rase_random_subspace_ensemble_classification__f3096868.pdf Journal of Machine Learning Research 22 (2021) 1-93 Submitted 6/20; Revised 1/21; Published 1/21 Ra SE: Random Subspace Ensemble Classification Ye Tian ye.t@columbia.edu Department of Statistics Columbia University New York, NY 10027, USA Yang Feng yang.feng@nyu.edu Department of Biostatistics, School of Global Public Health New York University New York, NY 10003, USA Editor: Jie Peng We propose a flexible ensemble classification framework, Random Subspace Ensemble (Ra SE), for sparse classification. In the Ra SE algorithm, we aggregate many weak learners, where each weak learner is a base classifier trained in a subspace optimally selected from a collection of random subspaces. To conduct subspace selection, we propose a new criterion, ratio information criterion (RIC), based on weighted Kullback-Leibler divergence. The theoretical analysis includes the risk and Monte-Carlo variance of the Ra SE classifier, establishing the screening consistency and weak consistency of RIC, and providing an upper bound for the misclassification rate of the Ra SE classifier. In addition, we show that in a high-dimensional framework, the number of random subspaces needs to be very large to guarantee that a subspace covering signals is selected. Therefore, we propose an iterative version of the Ra SE algorithm and prove that under some specific conditions, a smaller number of generated random subspaces are needed to find a desirable subspace through iteration. An array of simulations under various models and real-data applications demonstrate the effectiveness and robustness of the Ra SE classifier and its iterative version in terms of low misclassification rate and accurate feature ranking. The Ra SE algorithm is implemented in the R package Ra SEn on CRAN. Keywords: Random Subspace Method, Ensemble Classification, Sparsity, Information Criterion, Consistency, Feature Ranking, High Dimensional Data 1. Introduction Ensemble classification is a very popular framework for carrying out classification tasks, which typically combines the results of many weak learners to form the final classification. It aims at improving both the accuracy and stability of weak classifiers and usually leads to a better performance than the individual weak classifier (Rokach, 2010). Two prominent examples of ensemble classification are bagging (Breiman, 1996, 1999) and random forest (Breiman, 2001), which focused on decision trees and aimed to improve the performance by bootstrapping training data and/or randomly selecting the splitting dimension in different trees, respectively. Boosting is another example that converts a weak learner that performs only slightly better than random guessing into a strong learner achieving arbitrary accu- c 2021 Ye Tian and Yang Feng. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-600.html. Tian and Feng racy (Freund and Schapire, 1995). Recently, several new ensemble ideas appeared. Blaser and Fryzlewicz (2016) aggregate decision trees with random rotations to get an ensemble classifier. In particular, it randomly rotates the feature space each time prior to fitting the decision tree. Random rotations make it possible for the tree-based classifier to arrive at oblique solutions, increasing the flexibility of the decision trees. As a variant, to make the ensembles favor simple base learners, Blaser and Fryzlewicz (2019) proposed a regularized random rotation algorithm. Another popular framework of ensemble classifiers is via random projection. Durrant and Kab an (2015) studied a random projection ensemble classifier with linear discriminant analysis (LDA) and developed its theoretical properties. Furthermore, Cannings and Samworth (2017) proposed a very general framework for random projection ensembles. Each weak learner first randomly projects the original feature space into a low-dimensional subspace, then the base classifier is trained in that new space. The choice of the base classifier is flexible, and it was shown that the random projection ensemble classifier performs competitively and has desirable theoretical properties. There are two key aspects of their framework. One is that since na ıvely aggregating all projections might lead to a poor performance, they first select some good projections and only aggregate these ones. The other key idea is that they tune the decision threshold instead of applying the na ıve majority vote. These two ideas will also appear in our framework (to be proposed). To make the random projection include more important features in the linear combinations, Mukhopadhyay and Dunson (2019) proposed a targeted random projection ensemble approach, which includes each variable with probability proportional to their marginal utilities. Another example of ensemble classification is the random subspace method, which was first studied in the context of decision trees (Ho, 1998). As the name suggests, it randomly selects a feature subset and grows each tree within the chosen subspace. A similar idea is used in random forest when we restrict the splitting of each tree to a subset of features. The random subspace method is closely related to other aggregation-based approaches, including the bootstrap procedure for features (Boot and Nibbering, 2020). Also, as Cannings and Samworth (2017) pointed out, the random subspace method can be regarded as the random projection ensemble classification method when the projection space is restricted to be axis-aligned. Compared to other ensemble approaches, the random subspace method keeps the data structure via sticking to the original features, which can be helpful for interpretation and provide a direct way for feature ranking. It has been coupled with various base classifiers, including linear discriminant analysis (Skurichina and Duin, 2002), k-nearest neighbor classifier (Bay, 1998), and combined with other techniques such as boosting (Garc ıa-Pedrajas and Ortiz-Boyer, 2008). A related approach is random partition (Ahn et al., 2007), where the whole space is partitioned into multiple parts. Bryll et al. (2003) introduced optimization ideas into the framework of the random subspace method, and selected optimal subspaces by evaluating how the corresponding fitted models performed on the training data. Despite these developments, most existing works do not have theoretical support, and the research on the link between random subspace and feature ranking is scarce to the best of our knowledge. Furthermore, the existing literature usually considers the ensemble of all generated random subspaces, which may not be a wise idea in the sparse classification scheme as many random subspaces will contain no signals. Our new ensemble framework on random subspaces is designed to tackle the sparse classification Ra SE: Random Subspace Ensemble Classification problems with theoretical guarantees. Instead of naively aggregating all generated random subspaces, we divide them into groups and only keep the optimally performing subspace inside each group to construct the ensemble classifier. Feature ranking and selection are of critical importance in many real-world applications. For example, in disease diagnosis, beyond getting an accurate prediction for patients, we are also interested in understanding how each feature contributes to our prediction, which can facilitate the advancement of medical science. It has been widely acknowledged that in many high-dimensional classification problems, we only have a handful of useful features, with the rest being irrelevant ones. This is sometimes referred to as the sparse classification problem, which we briefly review next. Bickel et al. (2004) showed that linear discriminant analysis (LDA) is equivalent to random guessing in the worst scenario when the sample size is smaller than the dimensionality. Exploiting the underlying sparsity plays a significant role in improving the performance of the classic methods, including the LDA and the quadratic discriminant analysis (QDA) (Mai et al., 2012; Jiang et al., 2018; Hao et al., 2018; Fan et al., 2012; Shao et al., 2011; Fan et al., 2015; Li and Shao, 2015). While those methods work well under their corresponding models, it is not clear how to conduct feature ranking with other types of base classifiers. In this work, we propose a flexible ensemble classification framework, named Random Subspace Ensemble (Ra SE), which can be combined with any base classifiers and provide feature ranking as a by-product. Ra SE is a flexible ensemble classification framework, the main mechanism of which is briefly described as below. Suppose the observation pair (x, y) takes values from X {0, 1}, where X is an open subset of Rp, p is a positive integer and y is the class label. Assume the training set consists of n observation pairs {(xi, yi), i = 1, . . . , n}. We use CS T n (x) {0, 1} to represent the prediction result of the classifier trained with only features in subset S when the base classifier is T . For the j-th (j {1, . . . , B1}) weak learner, B2 random subspaces {Sjk}B2 k=1 are generated and the optimal one Sj is selected according to some criterion to be specified. Then this weak learner will be trained by using only the slice of training samples in this subspace Sj . Finally the B1 weak classifiers CS1 T n , . . . , C SB1 T n are aggregated to form the decision function CRa SE n (x) = 1 j=1 CSj T n (x) > α where α is a threshold to be determined. This framework contributes to the research of ensemble method and feature ranking in the following aspects. First, it admits a flexible framework, in which any classification algorithm can serve as the base classifier. Some examples include the standard LDA, QDA, k-nearest neighbor classifier (k NN), support vector machines (SVM), and decision trees. Second, the ensemble process naturally implies a ranking of the importance of variables via the frequencies they appear in the B1 subspaces {Sj }B1 j=1. For several specific sparse classification problems, equipped with a new information criterion named ratio information criterion (RIC), Ra SE is shown to cover the minimal discriminative set (to be defined later) for each weak learner with high probability when B2 and sample size n are sufficiently large. Although the Ra SE framework shares some similarities with the random projection (RP) framework of Cannings and Samworth (2017), there are several essential differences Tian and Feng between them. First, the key workhorse behind Ra SE is to search for a desirable subspace that covers the signals, which makes it amenable for feature ranking and selection. The RP framework, on the other hand, is not naturally designed for feature ranking. Second, a key condition (Assumption 2 in Cannings and Samworth (2017)) to guarantee the success of RP implies that using the criterion for choosing the optimal random projection, each random projection does not deviate much from the optimal one with a non-zero probability that is independent of n and p, which may not be satisfied under the high-dimensional setting. Ra SE, however, assumes a set of conditions that explicitly take into account the high-dimensionality, which leads to the screening consistency and weak consistency. Third, we propose a new information criterion RIC with its theoretical properties analyzed under the high-dimensional setting. Fourth, motivated by the stringent requirement of a large B2 for the vanilla Ra SE (see Sections 3.2 and 3.5), we propose the iterative Ra SE, which relaxes the requirement on B2 by taking advantage of the feature ranking in preceding steps. The rest of this paper is organized as follows. In Section 2, we first introduce the Ra SE algorithm, and discuss some important concepts, including minimal discriminative set and RIC. At the end of Section 2, an iterative version of the Ra SE algorithm is presented. In Section 3, theoretical properties of Ra SE and RIC are investigated, including the impact of B1 on the risk and Monte-Carlo variance of Ra SE classifier, the screening consistency and weak consistency of RIC, the upper bound of expected misclassification rate, and the theoretical analysis for iterative Ra SE algorithm. In Section 4, we discuss several important computational issues in the Ra SE algorithm, tuning parameter selection, and how to apply the Ra SE framework for feature ranking. Section 5 focuses on numerical studies in terms of extensive simulations and several real data applications, through which we compare Ra SE with various competing methods. The results frequently feature Ra SE classifiers among the top-ranked methods and also shows its effectiveness in feature ranking. Finally, we summarize our contributions and point out a few potential directions for future work in Section 6. We present some additional results for empirical studies in Appendix A, and all proofs are relegated to Appendix B. 2. Methodology Recall that we have n pairs of observations {(xi, yi), i = 1, . . . , n} i.i.d. (x, y) X {0, 1}, where X is an open subset of Rp, p is a positive integer and y {0, 1} is the class label. We use SFull = {1, . . . , p} to represent the whole feature set. We assume the marginal densities of x for class 0 (y = 0) and 1 (y = 1) exist and are denoted as f(0) and f(1), respectively. The corresponding probability measures they induce are denoted as P(0) and P(1). Thus, the joint distribution of (x, y) can be described in the following mixture model x|y = y0 (1 y0)f(0) + y0f(1), y0 = 0, 1, (1) where y is a Bernoulli variable with success probability π1 = 1 π0 (0, 1). For any subspace S, we use |S| to denote its cardinality. Denote Px as the probability measure induced by the marginal distribution of x, which is in fact π0P(0) + π1P(1). When restricting to the feature subspace S, the corresponding marginal densities of class 0 and 1 are denoted as f(0) S and f(1) S , respectively. Ra SE: Random Subspace Ensemble Classification 2.1 Random Subspace Ensemble Classification (Ra SE) Motivated by Cannings and Samworth (2017), to train each weak learner (e.g., the j-th one), B2 independent random subspaces are generated as Sj1, . . . , Sj B2. Then, according to some specific criterion (to be introduced in Section 2.3), the optimal subspace Sj is selected and the j-th weak learner is trained only in Sj . Subsequently, B1 such weak classifiers {CSj T n }B1 j=1 are obtained. Finally, we aggregate outputs of {CSj T n }B1 j=1 to form the final decision function by taking a simple average. The whole procedure can be summarized in the following algorithm. Algorithm 1: Random subspace ensemble classification (Ra SE) Input: training data {(xi, yi)}n i=1, new data x, subspace distribution D, criterion C, integers B1 and B2, type of base classifier T Output: predicted label CRa SE n (x), the selected proportion of each feature η 1 Independently generate random subspaces Sjk D, 1 j B1, 1 k B2 2 for j 1 to B1 do 3 Select the optimal subspace Sj from {Sjk}B2 k=1 according to C and T 5 Construct the ensemble decision function νn(x) = 1 B1 PB1 j=1 CSj T n (x) 6 Set the threshold ˆα according to (2) 7 Output the predicted label CRa SE n (x) = 1(νn(x) > ˆα), the selected proportion of each feature η = (η1, . . . , ηp)T where ηl = B 1 1 PB1 j=1 1(l Sj ), l = 1, . . . , p In Algorithm 1, the subspace distribution D is chosen as a hierarchical uniform distribution over the subspaces by default. Specifically, with D as the upper bound of the subspace size1, we first generate the subspace size d from the uniform distribution over {1, . . . , D}. Then, the subspace S11 follows the uniform distribution over {S SFull : |S| = d}. In practice, the subspace distribution can be adjusted if we have prior information about the data structure. In Step 6 of Algorithm 1, we set the decision threshold to minimize the empirical classification error on the training set, ˆα = arg min α [0,1][ˆπ0(1 ˆG(0) n (α)) + ˆπ1 ˆG(1) n (α)], (2) i=1 1(yi = r), r = 0, 1, n , r = 0, 1, ˆG(r) n (α) = 1 i=1 1(yi = r)1(νn(xi) α), r = 0, 1, 1. How to set D in practice will be discussed in Section 4.1. Tian and Feng j=1 1(CSj n (xi) = 1). 2.2 Minimal Discriminative Set For sparse classification problems, it is of significance to accurately separate signals from noises. Motivated by Kohavi et al. (1997) and Zhang and Wang (2011), we define the discriminative set and study some of its properties as follows. Definition 1 A feature subset S is called a discriminative set if y is conditionally independent with x Sc given x S, where Sc = SFull \ S. We call S a minimal discriminative set if it has minimal cardinality among all discriminative sets. Assumption 1 The densities f(0) and f(1) have the same support a.s. with respect to Px. Remark 2 Note that the existence of densities and the common support requirement are not necessary for the definition of the minimal discriminative set and the Ra SE framework. We focus on the continuous distribution purely for notation convenience. We will discuss this assumption again after introducing our information criterion in Definition 6. Proposition 3 Under Assumption 1, we can characterize the discriminative set using the marginal density ratio due to the following two facts. (i) If S is a discriminative set, then f(1)(x) f(0)(x) = f(1) S (x S) f(0) S (x S) almost surely with respect to Px. (ii) If for a feature subset S, there exists a function h : R|S| [0, + ] such that f(1)(x) f(0)(x) = h(x S) almost surely with respect to Px, then S is a discriminative set and h(x S) = f(1) S (x S) f(0) S (x S) almost surely with respect to Px. In general, there may exist more than one minimal discriminative sets. For instance, if two features are exactly the same, then more than one minimal discriminative sets may exist since we cannot distinguish between them. To rule out this type of degenerate scenario, we impose the uniqueness assumption for the minimum discriminative set. Assumption 2 There is only one minimal discriminative set, which is denoted as S . In addition, all discriminative sets cover S . Ra SE: Random Subspace Ensemble Classification In the classification problem, we are often interested in the risk of a classifier C. With the 0-1 loss, the risk or the misclassification rate is defined as R(C) = E[1(C(x) = y)] = P(C(x) = y). The Bayes classifier CBayes(x) = ( 1, P(y = 1|x) > 1 2, 0, otherwise. (3) is known to achieve the minimal risk among all classifiers (Devroye et al., 2013). If only features in S are used, it will provide us a local Bayes classifier CS Bayes(x S) which achieves the minimal risk among all classifiers using only features in S. In general, there is no guarantee that R(CS Bayes) = R(CBayes). Fortunately, the equation holds when S is a discriminative set. Proposition 4 For any discriminative set S, it holds that R(CS Bayes) = R(CS Bayes) = R(CBayes). This direct result illustrates that covering S is sufficient to obtain performance as good as the Bayes classifier. To clarify the notions above, we next take the two-class Gaussian settings as examples and investigate what S is in each case. Example 1 (LDA) Suppose f(0) N(µ(0), Σ), f(1) N(µ(1), Σ), where Σ is positive definite. The log-density ratio is f(0)(x) f(1)(x) = C (µ(1) µ(0))T Σ 1x, where C is a constant independent of x. By Proposition 3, the minimal discriminative set S = {j : [Σ 1(µ(1) µ(0))]j = 0}. This definition is equivalent to that in Mai et al. (2012) for the LDA case. Example 2 (QDA) Suppose f(0) N(µ(0), Σ(0)), f(1) N(µ(1), Σ(1)), where Σ(0) and Σ(1) are positive definite matrices with Σ(0) = Σ(1), then the log-density ratio is f(0)(x) f(1)(x) 2x T h (Σ(1)) 1 (Σ(0)) 1i x+ h (Σ(0)) 1µ(0) (Σ(1)) 1µ(1)i T x, (4) where C is a constant independent of x. Let S l = {j : [(Σ(0)) 1µ(0) (Σ(1)) 1µ(1)]j = 0}, S q = {j : [(Σ(1)) 1 (Σ(0)) 1]ij = 0, i}. The elements in S l are often called variables with main effects while elements in S q are called variables with quadratic effects (Hao et al., 2018; Fan et al., 2015; Jiang et al., 2018). By Proposition 3, the minimal discriminative set S = S l S q. Proposition 5 If f(0) N(µ(0), Σ(0)), f(1) N(µ(1), Σ(1)), where Σ(0) and Σ(1) are positive definite matrices, then the following conclusions hold: Tian and Feng (i) The minimal discriminative set S is unique; (ii) For any discriminative set S, we have S S ; (iii) Any set S S is a discriminative set. This conclusion also holds without the Gaussian assumption. 2.3 Ratio Information Criterion (RIC) As discussed in Section 2.2, it is desirable to identify the minimal discriminative set S for the classifier to achieve a low misclassification rate. Hence in Algorithm 1, it is important to apply a proper criterion to select the optimal subspace. In the variable selection literature, a criterion enjoying the property of correctly selecting the minimal discriminative set with high probability is often referred to as a consistent one. For model (1), Zhang and Wang (2011) proved that BIC, in conjunction with a backward elimination procedure, is selection consistent in the Gaussian mixture case. However, the BIC they investigated involved the joint log-likelihood function for (x, y), which involves estimating high-dimensional covariance matrices that could be problematic when p is close to or larger than n without additional structural assumptions. Here we propose a new criterion, which enjoys the weak consistency under the general setting of (1), based on Kullback-Leibler divergence (Kullback and Leibler, 1951). Two asymmetric Kullback-Leibler divergences for densities f and g are defined as KL(f||g) = Ex f , KL(g||f) = Ex g where Ex f represents taking expectation with respect to x f. In binary classification model (1), marginal probabilities can be crucial because imbalanced marginal distributions can significantly compromise the performance of most standard learning algorithms (He and Garcia, 2009). Therefore, we consider a weighted version of two KL divergences for the marginal distributions f(0) S and f(1) S with subspace S, i.e. π0KL(f(0) S ||f(1) S )+π1KL(f(1) S ||f(0) S ). Denote by ˆf(0) S , ˆf(1) S , ˆπ0, ˆπ1 the estimated version via MLEs of parameters, then it holds that ˆπ0 c KL(f(0) S ||f(1) S ) = n 1 n X i=1 1(yi = 0) log " ˆf(0) S (xi,S) ˆf(1) S (xi,S) ˆπ1 c KL(f(1) S ||f(0) S ) = n 1 n X i=1 1(yi = 1) log " ˆf(1) S (xi,S) ˆf(0) S (xi,S) Now, we are ready to introduce the following new criterion named ratio information criterion (RIC), with a proper penalty term. Definition 6 For model (1), the ratio information criterion (RIC) for feature subspace S is defined as RICn(S) = 2[ˆπ0 c KL(f(0) S ||f(1) S ) + ˆπ1 c KL(f(1) S ||f(0) S )] + cn deg(S), (5) where cn is a function of sample size n and deg(S) is the degree of freedom corresponding to the model with subspace S. Ra SE: Random Subspace Ensemble Classification Remark 7 Assumption 1 is necessary to make sure RIC is well-defined. To see this, note that the existence of both Kullback-Leibler divergences requires (1) f(0)(x) = 0 f(1)(x) = 0 a.s. with respect to P(1); (2) f(1)(x) = 0 f(0)(x) = 0 a.s. with respect to P(0). This is equivalent to Assumption 1 when π0, π1 > 0. Note that although AIC is also motivated by the Kullback-Leibler divergence, it aims to minimize the KL divergence between the estimated density and the true density (Burnham and Anderson, 1998). In our case, however, the goal is to maximize the KL divergence between the conditional densities under two classes to achieve a greater separation. Next, let s work out a few familiar examples where explicit expressions exist for RIC. Proposition 8 (RIC for the LDA model) Suppose f(0) N(µ(0), Σ), f(1) N(µ(1), Σ), where Σ is positive definite. The MLEs of the parameters are ˆµ(r) S = 1 i=1 1(yi = r)xi,S, r = 0, 1, r=0 1(yi = r) (xi,S ˆµ(r) S )(xi,S ˆµ(r) S )T , Then we have RICn(S) = (ˆµ(1) S ˆµ(0) S )T ˆΣ 1 S,S(ˆµ(1) S ˆµ(0) S ) + cn(|S| + 1). Proposition 9 (RIC for the QDA model) Suppose f(0) N(µ(0), Σ(0)), f(1) N(µ(1), Σ(1)), where Σ(0), Σ(1) are positive definite but not necessarily equal. The MLEs of the estimators are as follows. {ˆµ(r) S , r = 0, 1} are the same as in Proposition 8, and ˆΣ(r) S,S = 1 i=1 1(yi = r) (xi,S ˆµ(r) S )(xi,S ˆµ(r) S )T , r = 0, 1. Then we have RICn(S) = (ˆµ(1) S ˆµ(0) S )T [ˆπ1(ˆΣ(0) S,S) 1 + ˆπ0(ˆΣ(1) S,S) 1](ˆµ(1) S ˆµ(0) S ) + Tr[((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)(ˆπ1 ˆΣ(1) S,S ˆπ0 ˆΣ(0) S,S)] + (ˆπ1 ˆπ0)(log |ˆΣ(1) S,S| log |ˆΣ(0) S,S|) + cn |S|(|S| + 3) Note that the primary term of RIC for the LDA case, is the Mahalanobis distance (Mc Lachlan, 1999), which is closely related to the Bayes error of LDA classifier (Efron, 1975). And for the QDA case, the KL divergence components contain three terms. The first term is similar to the Mahalanobis distance, representing the contributions of linear signals to the classification model. And the second and third terms represent the contributions of quadratic signals. Note that the KL divergence can also be estimated by non-parametric methods including the k-nearest neighbor distance (Wang et al., 2009; Ganguly et al., 2018), which may Tian and Feng sometimes lead to more robust estimates than the parametric ones in our numerical experiments. Specifically, consider two samples {x(0) i,S}n0 i=1 i.i.d. f(0) S and {x(1) i,S}n1 i=1 i.i.d. f(1) S , and write ρk0,0(x(0) j,S) for the Euclidean distance between x(0) j,S and its k0-th nearest neighbor in the sample {x(0) i,S}n0 i=1\{x(0) j,S}, and write ρk1,1(x(0) j,S) for the Euclidean distance between x(0) j,S and its k1-th nearest neighbor in the sample {x(1) i,S}n1 i=1. Wang et al. (2009) and Ganguly et al. (2018) defined the following asymptotic unbiased estimator given k0 and k1: c KL(f(0) S ||f(1) S ) = |S| ρk0,0(x(0) i,S) ρk1,1(x(0) i,S) + log n1 n0 1 + Ψ(k0) Ψ(k1), (6) where Ψ denotes the di Gamma function (Abramowitz and Stegun, 1948). Similarly we can obtain estimate c KL(f(1) S ||f(0) S ). Besides, Berrett and Samworth (2019) proposed a weighted estimator based on (6) and investigated its efficiency. We will compare the performance of Ra SE when using the estimate in (6) to that using parametric methods in simulation. Another line of research for classification is to study the conditional distribution of y|x. For this setup, there has been a rich literature on various information type criteria that involves the conditional log-likelihood function. Akaike information criterion (AIC) (Akaike, 1973) was shown to be inconsistent. It was demonstrated that Bayesian information criterion (BIC) is consistent under certain regularity conditions (Rao and Wu, 1989). Chen and Chen (2008, 2012) modified the definition of conventional BIC to form the extended BIC (e BIC) for the high-dimensional setting where p grows at a polynomial rate of n. Fan and Tang (2013) proved the consistency of a generalized information criterion (GIC) for generalized linear models in ultra-high dimensional space. 2.4 Iterative Ra SE The success of the RIC proposed in Section 2.3 relies on the assumption that the minimal discriminative set S appears in some of the B2 subspaces for each weak learner. For sparse classification problems, the size of S can be very small compared to p. When p is large, the probability of generating a subset that covers S is quite low according to the hierarchical uniform distribution for subspaces. It turns out by using the selected frequency of each feature in B1 subspaces {Sj }B1 j=1 from Algorithm 1, we can improve the Ra SE algorithm by running the Ra SE algorithm again with a new hierarchical distribution for subspaces. In particular, we first calculate the percentage vector η = (η1, . . . , ηp)T representing the proportion of each feature appearing among B1 subspaces {Sj }B1 j=1, where ηl = B 1 1 PB1 j=1 1(l Sj ), l = 1, . . . , p. The new hierarchical distribution is specified as follows. In the first step, we generate the subspace size d from the uniform distribution over {1, . . . , D} as before. Before moving on to the second step, note that each subspace S can be equivalently represented as J = (J1, . . . , Jp)T , where Jl = 1(l S), l = 1, . . . , p. Then, we generate J from a restrictive multinomial distribution with parameter (p, d, η), where η = ( η1, . . . , ηp)T , ηl = ηl1(ηl > C0/ log p) + C0 p 1(ηl C0/ log p), and the restriction is that Jl {0, 1}, l = 1, . . . , p. Here C0 is a constant. Intuitively, this strategy can be repeated to increase the probability that signals in S are covered in the subspaces we generate. It could also lead to an improved feature Ra SE: Random Subspace Ensemble Classification ranking according to the updated proportion of each feature η. This will be verified via multiple simulation experiments in Section 5. This iterative Ra SE algorithm is summarized in Algorithm 2. A related idea was introduced by Mukhopadhyay and Dunson (2019) to generate random projections with probabilities proportional to the marginal utilities. One major difference in Ra SE is that the feature importance is determined via their joint contributions in the selected subspaces. Algorithm 2: Iterative Ra SE (Ra SET ) Input: training data {(xi, yi)}n i=1, new data x, initial subspace distribution D(0), criterion C, integers B1 and B2, the type of base classifier T , the number of iterations T Output: predicted label CRa SE n (x), the proportion of each feature η(T) 1 for t 0 to T do 2 Independently generate random subspaces S(t) jk D(t), 1 j B1, 1 k B2 3 for j 1 to B1 do 4 Select the optimal subspace S(t) j from {S(t) jk }B2 k=1 according to C and T 6 Update η(t) where η(t) l = B 1 1 PB1 j=1 1(l S(t) j ), l = 1, . . . , p 7 Update D(t) restrictive multinomial distribution with parameter (p, d, η(t)), where η(t) l = η(t) l 1(η(t) l > C0/ log p) + C0 p 1(η(t) l C0/ log p) and d is sampled from the uniform distribution over {1, . . . , D} 9 Set the threshold ˆα according to (2) 10 Construct the ensemble decision function νn(x) = 1 B1 PB1 j=1 C S(T ) j T n (x) 11 Output the predicted label CRa SE n (x) = 1(νn(x) > ˆα) and η(T) 3. Theoretical Studies In this section, we investigate various theoretical properties of Ra SE, including the impact of B1 on the risk of Ra SE classifier and expectation of misclassification rate. Furthermore, we will demonstrate that RIC achieves weak consistency. Throughout this section, we allow the dimension p grows with sample size n. To streamline the presentation, we first introduce some additional notations. In this paper, we have three different sources of randomness: (1) the randomness of the training data, (2) the randomness of the subspaces, and (3) the randomness of the test data. We will use the following notations to differentiate them. Analogous to Cannings and Samworth (2017), we write P and E to represent the probability and expectation with respect to the collection of B1B2 random subspaces {Sjk : 1 j B1, 1 k B2}; P and E are used when the randomness comes from the training data {(xi, yi)}n i=1; Tian and Feng We use P and E when considering all three sources of randomness. Recall that in Algorithm 1, the decision function is j=1 CSj n (x). For a given threshold α (0, 1), the Ra SE classifier is CRa SE n (x) = ( 1, νn(x) > α, 0, νn(x) α. By the weak law of large numbers, as B1 , νn will converge in probability to its expectation µn(x) = P CS1 n (x) = 1 . It should be noted that here as both the training data and the criterion C in Algorithm 1 are fixed, S1 is a deterministic function of {S1k}B2 k=1. Nevertheless, µn(x) is still random due to randomness of the new x. Then, it is helpful to define the conditional cumulative distribution function of µn for class 0, 1 respectively as G(0) n (α ) = P(µn(x) α |y = 0) = P(0)(µn(x) α ) and G(1) n (α ) = P(µn(x) α |y = 1) = P(1)(µn(x) α ). Since the distribution D of subspaces is discrete, µn(x) takes finite unique values almost surely, implying G(0) n , G(1) n to be step functions. We denote the corresponding probability mass functions of G(0) n and G(1) n as g(0) n and g(1) n , respectively. Since for any x, νn(x) µn(x) as B1 , we consider the following randomized Ra SE classifier in population CRa SE n (x) = 1, µn(x) > α, 0, µn(x) < α, Bernoulli 1 2 , µn(x) = α. as the infinite simulation Ra SE classifier with B1 . In the following sections, we would like to study different properties of CRa SE n . In Section 3.1, we condition on the training data and study the impact of B1 via the relationship between test error of CRa SE n and CRa SE n and Monte-Carlo variance Var(R(CRa SE n )), which can reflect the stability of Ra SE classifier as B1 increases. It will be demonstrated that conditioned on training data, both the difference between the test errors of CRa SE n and CRa SE n , and the Monte-Carlo variance of CRa SE n , converge to zero as B1 for almost every threshold α (0, 1) at an exponential rate. In Section 3.3, we will prove an upper bound for the expected misclassification rate R(CRa SE n ) with respect to all the randomness for fixed threshold α. Next, we introduce several scaling notations. For two sequences an and bn, we use an = o(bn) or an bn to denote |an/bn| 0; an = O(bn) or an bn to denote |an/bn| < . The corresponding stochastic scaling notations are op and Op, where an = op(bn), an = Op(bn) imply that |an/bn| p 0 and for any ϵ > 0, there exists M > 0 such that P(|an/bn| > M) ϵ, n. Also, we use λmin(A) and λmax(A) to denote the smallest and largest eigenvalues of a square matrix A. For any vector x = (x1, . . . , xp)T , the Euclidean norm x = q P i x2 i . And for any matrix A = (aij)p p, we define the operator Ra SE: Random Subspace Ensemble Classification norm A 2 = sup x 2=1 Ax 2, the infinity norm A = sup i Pp j=1 |aij|, the maximum norm A max = max i,j |aij|, and the Frobenius norm A F = q P i,j a2 ij. 3.1 Impact of B1 In this section, we study the impact of B1 by presenting upper bounds of the absolute difference between the test error of CRa SE n and CRa SE n , and Monte-Carlo variance for Ra SE when conditioned on the training data as B1 grows. Note that the discrete distribution of random subspaces leads to both bounds vanishing at exponential rates, except for a finite set of thresholds α, which is very appealing. Theorem 10 (Risk for the Ra SE classifier conditioned on training data) Denote Gn(α ) = π1G(1) n (α ) π0G(0) n (α ) and {αi}N i=1 represents the discontinuity points of Gn. Given training samples with size n, we have the following bound for expected misclassification rate of Ra SE classifier with threshold α when B1 as |E[R(CRa SE n )] R(CRa SE n )| , α {αi}N i=1, exp { CαB1} , otherwise, where Cα = 2 min 1 i N(|α αi|2). It shows that as B1 , the Ra SE classifier CRa SE n and its infinite simulation version CRa SE n achieve the same expected misclassification rate conditioned on the training data. Many similar results about the excess risk of randomized ensembles have been studied in literature (Cannings and Samworth, 2017; Lopes et al., 2019; Lopes, 2020). Next, we provide a similar upper bound for the MC variance of the Ra SE classifier. Suppose the discontinuity points of G(0) n and G(1) n are {α0 i }N0 i=1 and {α1 j}N1 j=1, respectively. Theorem 11 (MC variance for the Ra SE classifier) It holds that Var[R(CRa SE n )] 2 h π0(g(0) n (α))2 + π1(g(1) n (α))2i + O 1 B1 , α {α(0) i }N0 i=1 {α(1) j }N1 j=1 exp { CαB1} , otherwise, where Cα = 2 min 1 i N0 1 j N1 (|α α(0) i |2, |α α(1) j |2). This theorem asserts that except for a finite set of threshold α, the MC variance of Ra SE classifier shrinks to zero at an exponential rate. 3.2 Theoretical Properties of RIC An important step of the Ra SE classifier is the choice of an optimal subspace among B2 subspaces for each of the B1 weak classifiers. Before showing the screening consistency and weak consistency of RIC, we first present a proposition that explains the intuition of why RIC can succeed. Tian and Feng Proposition 12 When Assumptions 1 and 2 hold, we have the following conclusions: (i) KL(f(0) S ||f(1) S ) = KL(f(0) S ||f(1) S ), KL(f(1) S ||f(0) S ) = KL(f(1) S ||f(0) S ) hold for any S S ; (ii) π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ) < π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ) if S S ; (iii) KL(f(0) S ||f(1) S ) = sup S KL(f(0) S ||f(1) S ), KL(f(1) S ||f(0) S ) = sup S KL(f(1) S ||f(0) S ). From Proposition 12, if we define the population RIC without penalty as RIC(S) = 2 h π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ) i , it can be easily seen that sup S:S S RIC(S) = RIC(S ) < inf S:S S RIC(S). To successfully differentiate S from {S : S S } using RICn, we need to impose a condition on the minimum gap of RIC on S and that on S where S S . Similar assumptions such as the beta-min condition appears in the high-dimensional variable selection literature (B uhlmann and Van De Geer, 2011). Denote ψ(n, p, D) = D log p + κ1 log D Dκ1(Dκ3 + Dκ4), Dκ5, D2κ1+κ2 r D log p + κ1 log D The complete set of conditions is presented as follows. Assumption 3 Suppose densities f(0) and f(1) are in parametric forms with f(0)(x) = f(0)(x; θ), f(1)(x) = f(1)(x; θ), where θ contains all parameters for both f(0) and f(1). Note that not all elements of θ appear in f(0) or f(1). Denote the dimension of θ as p . As in Section 2.1, D represents the upper bound of the subspace size. Define LS(x, θ) = log f(0) S (x S;θS) f(1) S (x S;θS) . Assume the following conditions hold, where κ1, κ2, κ3, κ4, κ5 0 and C1, C2, C3, C4, C5 > 0 are some universal constants which does not depend on n, p or D: (i) p is a function of p and p (p) pκ1; (ii) max KL(f(0)||f(1)), KL(f(1)||f(0)) (p )κ1 Dκ1; (iii) There exists a family of functions {VS({xi,S}n i=1)}S where VS({xi,S}n i=1) R, such that for {xi,S}n i=1 X and subset S with |S| D, there exists a constant ζ such that if θ S θS 2 ζ, then 1 n i=1 2 θSLS(xi, θ ) max VS({xi,S}n i=1), and the following tail probability bound holds for VS: P (VS({xi,S}n i=1) > C1Dκ2) exp{ C2n}, where xi,S f(0) S or f(1) S ; Ra SE: Random Subspace Ensemble Classification (iv) Each component of θSLS(x, θ) is 2C3Dκ3-sub Gaussian for any subset S with |S| D where x S f(0) S or x S f(1) S , respectively; (v) sup S:|S| D Ex S f(0) S θSLS(x, θ) , sup S:|S| D Ex S f(1) S θSLS(x, θ) Dκ4; (vi) LS(x, θ) is a 2C4Dκ5-sub Gaussian variable, where x S f(0) S or x S f(1) S ; (vii) Denoting the MLE of θ based on subset S with |S| D as ˆθS = arg max θS r=0 1(yi = r) log f(r) S (xi,S; θS), then when ϵ is smaller than a positive constant, it holds that P( ˆθS θS > ϵ) |S|κ1 exp{ C5nϵ2} ˆθS θS = Op (viii) The signal strength satisfies := inf S:S S |S| D RIC(S) RIC(S ) ψ(n, p, D), where ψ(n, p, D) = o(1). Here condition (i) assumes the number of parameters in the model grows slower than a polynomial rate of the dimension. Condition (ii) assumes an upper bound for the two KL divergences. Conditions (iii)-(vi) are imposed to guarantee the accuracy of the second-order approximation for RIC. And condition (vii) is usually satisfied for common distribution families (Van der Vaart, 2000). The last condition is a requirement for the signal strength. A condition of this type is necessary to prove the consistency result for any information criterion. For condition (viii), it imposes a constraint among n, p, and D. When D p, it can be simplified as where κ is some positive constant. To cover all the signals in S , we require D p . Therefore, an implied requirement for n, p, and p is (p )κ log p which is similar to the conditions in the literature of variable selection. Here, we allow p to grow at an exponential rate of n. To help readers understand these conditions better, we show that some commonly used conditions (presented in Assumption 4) for high-dimensional LDA model are sufficient for Assumption 3.(i)-(vii) to hold with the results presented in Proposition 14. Assumption 3.(viii) can be relaxed under the LDA model. Tian and Feng Assumption 4 (LDA model) Suppose the following conditions are satisfied, where m, M, M are constants: (i) λmin(Σ) m > 0, Σ max M < ; (ii) µ(1) µ(0) M < ; (iii) Denote δS = Σ 1 S,S(µ(1) S µ(0) S ), γ = inf j |(δS )j| > 0, then Remark 13 Here, condition (i) constrains the eigenvalues of the common covariance matrix, which is similar to condition (C2) in Hao et al. (2018) and Wang (2009), condition (2) in Shao et al. (2011), condition (C4) in Li and Liu (2019). Condition (ii) imposes an upper bound for the maximal componentwise mean difference of the two classes, which is similar to condition (3) in Shao et al. (2011). Condition (iii) assumes a lower bound on the minimum signal strength γ, which is in a similar spirit to condition 2 in Mai et al. (2012). Proposition 14 Suppose f(0) N(µ(0), Σ), f(1) N(µ(1), Σ), where Σ is positive definite. If Assumption 4 holds, then Assumption 3.(i)-(vii) hold with θS = ((µ(0) S )T , (µ(1) S )T , vec(ΣS,S)T )T and κ1 = 2, κ2 = κ4 = 1, κ3 = κ5 = 1 The detailed proof for this proposition can be found in Appendix B. We are now ready to present the consistency result for RIC as defined in (5). Theorem 15 (Consistency of RIC) Under Assumptions 1-3, we have (i) If sup S:|S| D deg(S) cn/ = o(1), then the following screening consistency holds for sup S:S S |S| D RICn(S) < inf S:S S |S| D RICn(S) 1 O p DDκ1 exp Cn D2κ1+κ2 Cn Dκ1(Dκ3 + Dκ4) (ii) If in addition cn ψ(n, p, D), then the weak consistency in the following sense holds for RIC: P RICn(S ) = inf S:|S| D RICn(S) 1 O p DDκ1 exp n Cn cn D2κ1+κ2 2. Note that here we assume that MLE of θ is well-defined for all models under consideration. Ra SE: Random Subspace Ensemble Classification Cn cn Dκ1(Dκ3 + Dκ4) O p D exp Cn cn Corollary 16 Denote p S = P(S11 S ) = 1 D P (p d) , where the subspaces are generated from the hierarchical uniform distribution. Assume the conditions stated in Assumption 1-2 hold, and in addition there holds If we select the optimal subspace by minimizing RIC as the criterion in Ra SE where D p , then we have sup S:S S |S| D RICn(S) < inf S:S S |S| D RICn(S) j=1 {S1j S } j=1 {S1j S } = 1 (1 p S )B2 1 O (exp { B2p S }) . Remark 17 Note that this corollary actually holds when we replace RICn with a general criterion Crn (smaller value leads to better subspace) with more discussions in Section 3.5. And for RIC, the bounds for the first probability on the right-hand side of (7) in Theorems 15, 18 and 20 can be plugged in to get the explicit bounds. Also, we want to point out that direct analyses of RIC for discriminant analysis models are also insightful and interesting. We can show similar consistency results as those in Theorem 15 from properties of discriminant analysis approach itself based on some common conditions used in literature about sparse discriminant analysis, instead of applying the general analysis of KL divergence. Theorem 18 (LDA consistency) For the LDA model, under Assumption 4, we have (i) If Dcn/γ2 = o(1), then the following screening consistency holds for RIC: sup S:S S |S| D RICn(S) < inf S:S S |S| D RICn(S) (ii) If in addition cn D2 q n , then RIC is weakly consistent: P RICn(S ) = inf S:|S| D RICn(S) 1 O p2 exp Cn cn Tian and Feng Assumption 5 (QDA model) Denote Ω(r) S,S = (Σ(r) S,S) 1, r = 0, 1. Suppose the following conditions are satisfied, where m, M, M are constants: (i) λmin(Σ(r)) m > 0, λmax(Σ(r)) M < , r = 0, 1; (ii) µ(1) µ(0) M < ; (iii) Denote γl = inf j (Ω(1) S,Sµ(1) S Ω(0) S,Sµ(0) S )j > 0, γq = inf i sup j (Ω(1) S q ,S q Ω(0) S q ,S q )ij > 0, min{γ2 l , γ2 q, γq} D2 r Remark 19 The conditions here are similar to Assumption 4 for the LDA model. A set of analogous conditions were used in Jiang et al. (2018). Theorem 20 (QDA consistency) For the QDA model, under Assumption 5, (i) If D2cn/ min{γ2 l , γ2 q, γq} = o(1), then RIC is screening consistent: sup S:S S |S| D RICn(S) < inf S:S S |S| D RICn(S) min{γ2 l , γ2 q, γq} D2 (ii) Further, if cn D2 q n , then RIC is weakly consistent: P RICn(S ) = inf S:|S| D RICn(S) 1 O p2 exp Cn cn The proof is available in Appendix B. Note that the bound here is tighter than the results from Proposition 14 and Theorem 15. Based on the consistency of RIC, in the next section, we will construct an upper bound for the expectation of the misclassification rate R(CRa SE n ). 3.3 Misclassification Rate of the Ra SE Classifier In the following theorem, we present an upper bound on the misclassification rate for the Ra SE classifier, which holds for any criterion to choose optimal subspaces. Theorem 21 (General misclassification rate) For the Ra SE classifier with threshold α and any criterion to choose optimal subspaces, it holds that E{E[R(CRa SE n ) R(CBayes)]} E sup S:S S |S| D [R(CS n ) R(CBayes)] + P(S1 S ) min(α, 1 α) . (8) Ra SE: Random Subspace Ensemble Classification Here, the upper bound consists of two terms. The first term involving E sup S:S S |S| D R(CBayes)] can be seen as the maximum discrepancy between the risk of models trained in any subspace covering S based on finite samples with the Bayes risk, which will be investigated in detail in the next subsection. This term shrinks to zero under certain conditions (see details in Section 3.4). The second term corresponds to the event that at least one signal is missed in the selected subspace. Corollary 16 in Section 3.2 shows that B2 needs to be sufficiently large to ensure this term goes to 0. We will show in Section 3.5 that the iterative Ra SE could relax the requirement on B2 under certain scenarios. Specifically, if we use the criterion of minimizing training error (misclassification rate on the training set) or leave-one-out cross-validation error, a similar guarantee of performance can be arrived as follows. Theorem 22 (Misclassification rate when minimizing training error or leaveone-out cross-validation error) If the criterion of minimal training error or leave-oneout cross-validation error is applied for the Ra SE classifier with threshold α, it holds that E{E[R(CRa SE n ) R(CBayes)]} E sup S:S S |S| D [R(CS n ) R(CBayes)] + E(ϵn) + E sup S:S S |S| D + (1 p S )B2 min(α, 1 α) , (9) where ϵS n = R(CS n ) Rn(CS n ), ϵn = E[R(CS1 n ) Rn(CS1 n )]. Here Rn(C) is the training error or leave-one-out cross-validation error of classifier C. This theorem is closely related to Theorem 3 in Cannings and Samworth (2017) and derived along similar lines. The merit of Theorem 22 compared with Theorem 21 is that we don t have the term P(S1 S ) in the bound, which can be difficult to quantify when minimizing training error or leave-one-out cross-validation error. Regarding the upper bound in (9), the first term is the same as the first term of the bound in Theorem 21. The second term involving E(ϵn) + E sup S:S S |S| D |ϵS n| is relative to the distance between the training error and test error, which usually shrinks to zero for some specific classifiers under certain conditions (see details in Section 3.4). The third term involving (1 p S )B2 reflects the possibility that S is not selected in any of the B2 subspaces we generate, which is similar to the second term in the bound given by Theorem 21. This term shrinks to zero under the condition of Corollary 16. 3.4 Detailed Analysis for Several Base Classifiers In this section, we work out the technical details for the Ra SE classifier when the base classifier is chosen to be LDA, QDA, and k NN. Tian and Feng 3.4.1 Linear Discriminant Analysis (LDA) LDA was proposed by Fisher (1936) and corresponds to model (1) where f(r) N(µ(r), Σ), r = 0, 1. For given training data {xi, yi}n i=1 in subspace S, using the MLEs given in Proposition 8, the classifier can be constructed as CS LDA n (x) = ( 1, LS(x S|ˆπ0, ˆπ1, ˆµ(0) S , ˆµ(1) S , ˆΣS,S) > 0, 0, otherwise, where the decision function LS(x S|ˆπ0, ˆπ1, ˆµ(0) S , ˆµ(1) S , ˆΣS,S) = log(ˆπ1/ˆπ0) + (x S (ˆµ(0) S + ˆµ(1) S )/2)T (ˆΣS,S) 1(ˆµ(1) S ˆµ(0) S ). And the degree of freedom of the LDA model with feature subspace S is deg(S) = |S| + 1. Efron (1975) derived that R(CS LDA n ) R(CBayes) = π1 r µ(1) S µ(0) S T Σ 1 S,S µ(1) S µ(0) S , ˆ S = r ˆµ(1) S ˆµ(0) S T ˆΣ 1 S,S ˆµ(1) S ˆµ(0) S , τS = log(π1/π0)/ S, ˆτS = log(ˆπ1/ˆπ0)/ ˆ S. Proposition 23 If Assumption 4 holds, then we have E sup S:S S |S| D [R(CS LDA n ) R(CBayes)] D2 r n max{(p ) 1 2 γ 1, (p ) 3 Regarding the second term in the upper bound in Theorem 22, due to Theorem 23.1 in Devroye et al. (2013), for any subset S, we have P |ϵS n| > ϵ 8n D exp 1 which yields sup S:S S |S| D S:S S |S| D P |ϵS n| > ϵ 8n Dp D p exp 1 By Lemma 39 in Appendix B, it follows E sup S:S S |S| D 32[(D p ) log p + D log n + 3 log 2 + 1] Ra SE: Random Subspace Ensemble Classification sup k=1,...,B2 |ϵS1k n | sup k=1,...,B2 |ϵS n| > ϵ k=1 P |ϵS1k n | > ϵ 8B2n D exp 1 again by Lemma 39, we have 32[log B2 + D log n + 3 log 2 + 1] By plugging these bounds in the right-hand side of (8) and (9), we can get the explicit upper bound of the misclassification rate for the LDA model. 3.4.2 Quadratic Discriminant Analysis (QDA) QDA considers the model (1) analogous to LDA while x|y = r N(µ(r), Σ(r)), r = 0, 1, where Σ(0) can be different from Σ(1). On the basis of training data {(xi, yi)}n i=1 in subspace S, it admits the following form of classifier based on the MLEs given in Proposition 9: CS QDA n (x) = ( 1, QS(x S|ˆπ0, ˆπ1, ˆµ(0) S , ˆµ(1) S , ˆΣ(0) S,S, ˆΣ(1) S,S) > 0, 0, otherwise, where the decision function QS(x S|ˆπ0, ˆπ1, ˆµ(0) S , ˆµ(1) S , ˆΣ(0) S,S, ˆΣ(1) S,S) = log(ˆπ1/ˆπ0) 1 2x T S[(ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1]x S + x T S[(ˆΣ(1) S,S) 1 ˆµ(1) S (ˆΣ(0) S,S) 1 ˆµ(0) S ] 2(ˆµ(1) S )T (ˆΣ(1) S,S) 1 ˆµ(1) S + 1 2(ˆµ(0) S )T (ˆΣ(0) S,S) 1 ˆµ(0) S . And the degree of freedom of QDA model with feature subspace S is deg(S) = |S|(|S| + 1)/2 + |S| + 1. To analyze the first term in (8) and (9), as in Jiang et al. (2018), for any constant c, we define ess sup z [ c,c] h(r)(z), r = 0, 1 where ess sup represents the essential supremum defined as the supremum except a set with measure zero and h(r)(z) is the density of QS (x S |π1, π0, µ(0) S , µ(1) S , Σ(0) S ,S , Σ(1) S ,S ) given that y = r. Proposition 24 If Assumption 5 and the following conditions hold: (i) There exist positive constants c, Uc such that uc Uc < ; (ii) There exists a positive number ϖ0 (0, 1), p exp{nϖ0}; Tian and Feng then we have E sup S:S S |S| D R(CS QDA n ) R(CBayes) D2 log p for any ϖ (0, 1/2). Also, by applying Theorem 23.1 in Devroye et al. (2013) and Lemma 39, we have similar conclusions as (11) and (12) in the following E sup S:S S |S| D 32[(D p ) log p + D(D + 3)/2 log n + 3 log 2 + 1] 32[log B2 + D(D + 3)/2 log n + 3 log 2 + 1] The explicit upper bound of misclassification rate for the QDA model follows when we plug these inequalities into the right-hand sides of (8) and (9). 3.4.3 k-nearest Neighbor (k NN) k NN method was firstly proposed by Fix (1951). Given x, it tries to mimic the regression function E[y|x] in the local region around x by using the average of k nearest neighbors. k NN is a non-parametric method and its success has been witnessed in a wide range of applications. Given training data {xi, yi}n i=1, for the new observation x and subspace S, rank the training data by the increasing ℓ2 distance in Euclidean space to x S as {xmi,S}n i=1 such that x S xm1,S 2 x S xm2,S 2 . . . x S xmn,S 2, where 2 represents the ℓ2 norm in the corresponding Euclidean space. Then the k NN classifier admits the following form: CS k NN n (x) = ( 1, 1 k Pk i=1 ymi > 0.5, 0, otherwise. By Devroye and Wagner (1979) and Cannings and Samworth (2017), it holds the following tail bound: sup S:S S |S| D S:S S |S| D P |ϵS n| > ϵ 8p D p exp nϵ3 108k(3D + 1) Then by Lemma 39 and similar to the analysis for deriving (12), it follows E sup S:S S |S| D |ϵS n| [108k(3D + 1)] 1 3 3 log 2 + (D p ) log p + 1 Ra SE: Random Subspace Ensemble Classification E|ϵn| [108k(3D + 1)] 1 3 3 log 2 + log B2 + 1 However, for k NN, due to its lack of parametric form, it is much more involved to derive a similar upper bound as those presented in Propositions 23 and 24. We decide to leave this analysis as future work. 3.5 Theoretical Analysis of Iterative Ra SE Recall that to control the misclassification rate when minimizing RIC, we showed in Section 3.3 that to control P(S1 S ), B2 needs to be sufficiently large. In particular, a sufficient condition regarding B2 was presented in Corollary 16. Since p p d p p D D p p + 1 condition (ii) in Corollary 16 implies B2 p p +1 D p , which could be very large for highdimensional settings. Next, we show the iterative Ra SE in Algorithm 2 can sometimes relax the requirement on B2 substantially. Different from the hierarchical uniform distribution over the subspaces in Ra SE (Algorithm 1), the iterative Ra SE in Algorithm 2 uses a non-uniform distribution from the second iteration. The non-uniform distribution works by assigning higher probabilities to subspaces that include the more frequently appeared variables among the B1 subspaces chosen in the previous step. We will show that the subspaces generated from such non-uniform distributions require a smaller B2 to cover S . In the following analysis, we study the iterative Ra SE algorithm that minimizes a general criterion Cr, which is a real-valued function on any subspace with its sample version denoted as Crn. To guarantee the success of Algorithm 2, we need the following conditions. Assumption 6 Suppose the following conditions are satisfied: (i) There exists a positive function of n, p, D called ν satisfying ν(n, p, D) = o(1), such that sup S:|S| D |Crn(S) Cr(S)| = Op(ν(n, p, D)). (ii) (Stepwise detectable condition) There exists a series {Mn} n=1 and a specific positive integer p p such that (a) for any feature subset S(1) S where | S(1) | p p , there exists S(2) S \ S(1) and |S(2) | = p such that Cr(S ) Cr(S) > Mnν(n, p, D) holds for any n, any S and S satisfying S S = S S , S S = S(1) S(2) , S S = S 1 S 2 where S 1 S(1) , S 2 S \ S(1) , |S 2| p ; Tian and Feng (b) the criterion satisfies inf S:|S| D S S Cr(S) sup S:|S| D S S Cr(S) > Mnν(n, p, D). (iii) We have D log log p log p. In Assumption 6, condition (i) provides a uniform bound for the specific criterion we use. Condition (ii)(a) is introduced to make Algorithm 2 detect additional signals in each iteration until all signals are covered. Condition (ii)(b) is imposed to help us find discriminative sets among B2 subspaces, and it holds for RIC by previous analysis in Section 3.2. Condition (iii) characterizes the requirement on the dimension p and the maximal subspace size D. Theorem 25 For Algorithm 2, B2 in the first step is set as and B2 in the following steps is set as D p p (log p)p B2 p where C0 is the same as in Algorithm 2. Also we set B1 such that B1 log p . If Assumption 6 holds, then after T iterations where e B1 p , as n, B1, B2 there holds P(S(T) 1 S ) 0. Next, we compare the requirements on B2 for iterative Ra SE with that for the vanilla Ra SE. For simplicity, we assume D, p , and p are constants. When p < p , iterative Ra SE requires B2 p p (log p)p , which is much weaker than the requirement B2 pp for vanilla Ra SE implied by Corollary 16. Using the results in Theorem 21, an upper bound for the error rate could be obtained. Also note that the rate of p is constrained by Assumption 6.(i), where we assume ν(n, p, D) = o(1). For example, with LDA and QDA model, by Lemmas 32 and 38 in Appendix B, under Assumptions 4 and 5 respectively, we have ν(n, p, D) = D2 r Therefore the constraint is D2 q Ra SE: Random Subspace Ensemble Classification 4. Computational and Practical Issues 4.1 Tuning Parameter Selection In Algorithm 1, there are five tuning parameters, including the number of weak learners B1, the number of candidate subspaces B2 to explore for each weak learner, the distribution D of subspaces, the criterion C for selecting the optimal subspace for each weak learner, and the threshold ˆα. If we set C as minimizing the RIC, the difference between the risk of Ra SE and Bayes risk, as well as the MC variance of Ra SE, vanish at an exponential rate when B1 , except for a finite set of thresholds α. This implies the Ra SE classifier becomes more accurate and stable as B1 increases. Regarding the impact of B2, by Corollary 16, Theorem 15 and Theorem 25, under the minimal RIC criterion with some conditions, as B2, n , the subspace chosen for each weaker learner recovers the minimal determinative set with high probability. By Theorem 21, the expectation of the misclassification rate becomes closer to the Bayes error as the sample size n and B2 increase, which motivates us to use a large B2 if we have sufficient computational power. However, when we choose minimizing training error as the criterion C to select the optimal subspace, Theorem 22 shows that the influence of B2 becomes more subtle. In our implementation, we set B1 = 200 and B2 = 500 as default. For LDA and QDA classifier, C is set to choose the optimal subspace by minimizing the RIC, while for k NN, the default setting is minimizing the leave-one-out cross-validation error. Without prior information about the features, as we mentioned in Section 2.1, D is set as the hierarchical uniform distribution over the subspaces. To generate the size d of subspaces from the uniform distribution over {1, . . . , D}, another parameter D has to be determined. In practice, for QDA base classifier we set D = min(p, [ n0], [ n1]) and for LDA, k NN and all the other base classifiers, we set D = min(p, [ n]), where [a] denotes the largest integer not larger than a. The threshold ˆα is chosen by (2) to minimize the training error. When using non-parametric estimate of RIC corresponding to (6), following Wang et al. (2009) and Ganguly et al. (2018), we set k0 = [ n0] and k1 = [ n1] to satisfy the conditions they presented for proving the consistency. 4.2 Computational Cost Ra SE is an ensemble framework, generating B1B2 subspaces in total following distribution D. If we use the uniform distribution introduced in the last section to generate one subspace, the time required equals to the time for sampling at most D features from p ones, which is O(p D). And the time for training the base model is denoted as Ttrain, which equals to O(n D2) for LDA and QDA base classifiers. Similarly, the time for predicting test data is denoted as Ttest, which equals to O(ntest D) for LDA base classifier, O(ntest D2) for QDA base classifier, and O(n ntest D) for k-NN base classifier. In total, the computation cost for the training process is O(B1B2Ttrain + B1B2 log B2) time. Here, O(B2 log B2) is the time needed to find the optimal subspace among B2 ones based on the sorting of their scores calculated under some criterion. The computation cost for prediction process is O(B1Ttest) and Ra SE algorithm takes approximately O(B1B2Ttrain + B1B2 log B2 + B1Ttest) for both model fitting and prediction. Tian and Feng In practice, this type of framework is very convenient to apply the parallel computing paradigm, making the computing quite fast. And for specific classifiers like LDA and QDA, we have simplified the RIC expression, which can be directly used to speed up calculation. Compared to the projection generation process in Cannings and Samworth (2017), Ra SE is more efficient since we only need to select features based on certain distribution without doing any complicated matrix operations. 4.3 Feature Ranking There are many powerful tools in statistics and machine learning for variable selection and feature ranking. For the sparse classification approaches like sparse-LDA and sparse-QDA (Mai et al., 2012; Jiang et al., 2018; Fan et al., 2012; Shao et al., 2011; Hao et al., 2018; Fan et al., 2015; Mai et al., 2015; Fan et al., 2016), or independent regularization approach like nearest shrunken centroids (Tibshirani et al., 2003), this is usually directly implied by the methodology. For model-free classification methods, however, it s not straightforward to rank features. For random forest, Breiman (2001) proposed a feature ranking method by randomly permuting the value of each feature and calculating the misclassification rate on the out-of-bag data for the new random forest. For Ra SE, as an ensemble framework, it s quite natural to rank variables by their frequencies of appearing in B1 subspaces corresponding to B1 weak learners. Following Corollary 16, as n, B2 , under some conditions for signal strength and the increasing rate of B2, by applying the criterion of minimizing RIC, the chosen subspace tends to cover the minimal discriminative set S with high probability, which intuitively illustrates why this idea works to rank variables. When we do not have sufficient computational resources to set a very large B2, as Theorem 25 indicates, under some conditions, the iterative Ra SE (Algorithm 2) can cover S with high probability after a few steps with a smaller B2. In practice, the frequencies of signals in S tend to increase after iterations, which can improve the performance of the Ra SE classifier and provide a better ranking. We will demonstrate this via extensive simulation studies in the next section. 5. Simulations and Real-data Experiments We use six simulation settings and four real data sets to demonstrate the effectiveness of the Ra SE method, coupled with RIC and leave-one-out cross-validation as the minimizing criterion to choose the optimal subspace. The performance of Ra SE classifiers with LDA, QDA, Gamma and k NN as base classifiers with different iteration numbers are compared with that of other competitors, including the standard LDA, QDA, and k NN classifiers, sparse LDA (s LDA) (Mai et al., 2012), regularization algorithm under marginality principle (RAMP) (Hao et al., 2018), nearest shrunken centroids (NSC) (Tibshirani et al., 2003), random forests (RF) (Breiman, 2001), and random projection ensemble classifier (RPEnsemble) (Cannings and Samworth, 2017). For the LDA model, we also implemented the non-parametric estimate for RIC to show its effectiveness and robustness. The standard LDA and QDA methods are implemented by using R package MASS. And the k NN classifier is implemented by knn, knn.cv in R package class and function knn3 in R package caret. We utilize package dsda to fit s LDA model. RAMP is implemented through package RAMP. For the RF, we use R package Random Forest; the number of trees Ra SE: Random Subspace Ensemble Classification are set as 500 (as default) and [ p] variables are randomly selected when training each tree (as default). And the NSC model is fitted by calling function pamr.train in package pamr. RPEnsemble is implemented by R package RPEnsemble. To obtain the MLE of parameters in the Gamma distribution, the Newton s method is applied via function nlm in package stats and the initial point is chosen to be the moment estimator. To get the non-parametric estimate of KL divergence and RIC, we call function KL.divergence in package FNN. When fitting the standard k NN classifier, and the k NN base classifier in RPEnsemble and Ra SE method, the number of neighbors k is chosen from {3, 5, 7, 9, 11} via leave-one-out cross-validation, following Cannings and Samworth (2017). For RAMP, the response type is set as binomial , for which the logistic regression with interaction effects is considered. In s LDA, the penalty parameter λ is chosen to minimize cross-validation error. In RPEnsemble method, LDA, QDA, k NN, are set as base classifiers with default parameters, and the number of weak learner B1 = 500 and the number of projection candidates for each weak learner B2 = 50 and the dimension of projected space d = 5. The projection generation method is set to be Haar . The criterion of choosing optimal projection is set to minimize training error for LDA and QDA and minimize leave-one-out cross-validation error for k NN. For the Ra SE method, for LDA, QDA, and independent Gamma classifier (to be illustrated later in Section 5.1.2), the criterion is set to be minimizing RIC, and for k-NN, the strategy of minimizing leave-one-out cross-validation error is applied. Other parameter settings in Ra SE are the same as in the last section. For simulations, the number of iterations T in Algorithm 2 is set to be 0, 1, 2, while for real-data experiments, we only consider Ra SE methods with 0 or 1 iteration. We write the iteration number on the subscript, and if it is zero, the subscript will be omitted. For example, Ra SE1-LDA represents the Ra SE classifier with T = 1 iteration and LDA base classifier. In addition, we use LDAn to denote the LDA base classifier with the non-parametric estimate of RIC. For all experiments, 200 replicates are considered, and the average test errors (percentage) are calculated based on them. The standard deviation of the test errors over 200 replicates is also calculated for each approach and written on the subscript. The approach with minimal test error for each setting is highlighted in bold, and methods that achieve test error within one standard deviation of the best one are highlighted in italics. Also, for all simulations and the madelon data set, the average selected percentage of features in B1 subspaces for the largest sample size setting in the Ra SE method in 200 replicates are presented, which provides a natural way for feature ranking. For the average selected percentage in the case of the smallest sample size, refer to Appendix A. To highlight the different behaviors of signals and noises, we present the average selected percentages of all noise features as a box-plot marked with N . The Ra SE classifier competes favorably with existing classification methods. Its misclassification rate is the lowest in 27 out of 30 (simulation and real-data) settings and within one standard deviation of the lowest in the remaining four settings. 5.1 Simulations For the simulated data, model 1 follows model 1 in Mai et al. (2012), which is a sparse LDA-adapted setting. In model 2, for each class, a Gamma distribution with independent components is used. Model 3 follows from the setting of model 3 in Fan et al. (2015), which is Tian and Feng a QDA-adapted model. The marginal distribution for two classes is set to be π0 = π1 = 0.5 for the first three simulation models. Model 4 is motivated by the k NN algorithm, and the data generation process will be introduced below. To test the robustness of Ra SE, two non-sparse settings, model 1 and 4 are investigated as well, where model 1 has decaying signals in the LDA model while model 4 inherits from model 4 by increasing the number of signals to 30 with the signal strength decreased. For simulations, we consider the signal model as a benchmark. These models use the correct model on the minimal discriminative set S , mimicking the behavior of the Bayes classifier when S is sparse. 5.1.1 Models 1 and 1 (LDA) First we consider a sparse LDA setting (model 1). Let x|y = r N(µ(r), Σ), r = 0, 1, where Σ = (Σij)p p = (0.5|i j|)p p, µ(0) = 0p 1, µ(1) = Σ 0.556(3, 1.5, 0, 0, 2, 01 (p 5))T . Here p = 400, and the training sample size n {200, 400, 1000}. Test data of size 1000 is independently generated from the same model. As analyzed in Example 1, feature subset {1, 2, 5} is the minimal discriminative set S . On the left panel of Table 1, the performance of various methods on model 1 for different sample sizes are presented. As we could see, Ra SE1-LDAn performs the best when the sample size n = 200 and 400. s LDA achieves similar performances to the best classifiers for each setting, and Ra SE2-QDA ranks the top when n = 1000. Also, since this model is very sparse, the default value of B2 cannot guarantee that the minimal discriminative set can be selected. Therefore the iterative version of Ra SE improves the performance of Ra SE a lot. And NSC also achieves a comparably small misclassification rate when n = 200. In Figure 1, the average selected percentages of 400 features in 200 replicates when n = 1000 are presented. Note that after two iterations, the three signals can be captured by almost all B1 = 200 subspaces for all three base classifiers, and all noises are rarely selected across B1 subspaces except when the non-parametric estimate of RIC is applied. Next, we consider a non-sparse LDA model (model 1 ). Let µ(1) = Σ (0.9, 0.92, . . . , 0.950, 01 (p 50))T and keep other parameters the same as above. Now S contains the first 50 features. Under this non-sparse setting, as the right panel of Table 1 shows, although most methods obtain similar error rates, Ra SE1-LDAn achieves the best performance when n = 200 and 400. Ra SE1-k NN performs the best when n = 1000. From the table, it can be seen that despite the non-sparse design, the iterations can still improve the performance of Ra SE. An interesting phenomenon is observed from Figure 2, which exhibits the average selected percentages for model 1 . Note that the selected percentages are decaying as the signal strength decreases except for the first feature. One possible reason is that the marginal discriminative powers of feature 2 and 3 are the strongest among all features due to the specific correlation structure in our setting. 5.1.2 Model 2 (Gamma distribution) In this model we investigate the Gamma distribution with independent features, which is rarely studied in the literature. x|y = r at j-th coordinate follows Gamma distribution Ra SE: Random Subspace Ensemble Classification Method Results for model 1 Results for model 1 n = 200 n = 400 n = 1000 n = 200 n = 400 n = 1000 Ra SE-LDA 12.991.42 12.381.20 11.16 1.10 21.434.64 19.56 3.67 17.98 2.74 Ra SE-LDAn 13.401.31 13.231.20 12.641.11 21.644.22 20.253.01 19.422.59 Ra SE-QDA 13.781.26 13.691.19 13.231.04 22.363.96 20.462.83 19.832.51 Ra SE-k NN 13.211.48 12.571.23 11.191.10 19.98 3.74 18.21 2.89 16.97 2.10 Ra SE1-LDA 11.48 1.26 10.42 1.07 10.25 1.08 20.15 3.65 18.39 2.69 17.20 2.14 Ra SE1-LDAn 10.581.15 10.281.01 10.16 1.04 18.212.92 17.552.26 16.86 1.74 Ra SE1-QDA 11.28 1.47 10.76 1.25 10.40 1.22 21.244.75 19.65 3.56 18.23 2.51 Ra SE1-k NN 11.11 1.25 10.58 1.09 10.33 1.05 18.90 3.16 17.75 2.51 16.801.98 Ra SE2-LDA 12.621.45 11.411.16 10.13 1.03 21.783.44 19.22 2.54 17.53 1.98 Ra SE2-LDAn 10.90 1.16 10.42 0.96 10.17 1.06 18.75 2.87 17.95 2.19 17.30 1.70 Ra SE2-QDA 11.741.45 10.39 1.01 10.091.07 21.845.03 20.003.62 19.042.57 Ra SE2-k NN 11.17 1.31 10.60 0.99 10.34 1.02 18.96 3.10 17.77 2.29 17.19 1.78 RP-LDA 17.261.53 15.031.24 13.371.09 25.661.98 23.851.79 21.991.50 RP-QDA 17.961.60 15.261.34 13.501.12 26.412.04 24.031.88 22.061.51 RP-k NN 18.541.60 16.151.21 14.231.21 27.032.27 25.061.76 23.011.55 LDA 46.392.62 18.621.53 47.812.87 27.441.95 QDA 47.741.73 48.901.65 k NN 29.082.77 26.731.97 24.531.60 35.672.59 34.072.33 32.361.72 s LDA 10.80 1.26 10.48 1.16 10.23 1.07 18.80 3.19 17.62 2.28 17.24 1.81 RAMP 14.092.67 10.56 1.33 10.10 1.04 21.915.49 19.22 3.25 17.55 2.04 NSC 11.50 1.13 11.500.97 11.671.09 18.49 1.99 19.02 1.78 19.411.39 RF 12.661.43 11.771.04 11.251.10 21.333.01 20.002.10 19.251.51 Sig-LDA 10.070.94 10.090.95 10.071.03 23.703.14 20.902.27 18.951.64 Not applicable. Table 1: Error rates for models 1 and 1 Gamma(α(r) j , β(r) j ), which has the density function f(r) j (x; α(r) j , β(r) j ) = 1 (β(r) j )α(r) j Γ(α(r) j ) xα(r) j 1 exp{ x/β(r) j }1(x 0), j = 1, . . . , p, r = 0, 1. (13) Denote α(r) = (α(r) 1 , . . . , α(r) p )T , β(r) = (β(r) 1 , . . . , β(r) p )T . Here, we let α(0) = (2, 1.5, 1.5, 2, 2, 11 (p 5))T , α(1) = (2.5, 1.5, 1.5, 1, 1, 11 (p 5))T , β(0) = (1.5, 3, 1, 1, 1, 11 (p 5))T , β(1) = (2, 1, 3, 1, 1, 11 (p 5))T , p = 400, n {100, 200, 400}. Hence, the minimal discriminative set S is {1, 2, 3, 4, 5}, due to Proposition 3. MLEs of α(0), α(1), β(0), β(1) can be obtained by numerical approaches like the gradient descent or Newton s method. And the marginal probabilities are estimated by the proportion of two classes in training samples. Then the Bayes classifier is estimated by these MLEs and applied to classify new observations, which is denoted as an independent Gamma classifier in Table 2. For this example, we also apply Ra SE with the independent Gamma classifier as one of the base classifiers. According to (3) and (13), the decision function of independent Gamma classifier estimated in subspace S is Tian and Feng 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage Ra SE1 LDAn 1 2 5 N feature selected percentage Ra SE2 LDAn 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage Ra SE1 k NN 1 2 5 N feature selected percentage Ra SE2 k NN Figure 1: Average selected percentages of features for model 1 in 200 replicates when n = 1000 CS Gamma n (x) = 1 f(1) j (xj; ˆα(1) j , ˆβ(1) j ) f(0) j (xj; ˆα(0) j , ˆβ(0) j ) > 0.5 where ˆπ1, ˆπ0, ˆα(1) j , ˆα(0) j , ˆβ(1) j , ˆβ(0) j are corresponding MLEs. This is also a very sparse model. Therefore the iteration process can improve the Ra SE method with a lower misclassification rate. The left panel of Table 2 shows us the performance of various methods on this model. It demonstrates that Ra SE1-Gamma performs Ra SE: Random Subspace Ensemble Classification selected percentage selected percentage selected percentage selected percentage selected percentage Ra SE1 LDAn selected percentage Ra SE2 LDAn selected percentage selected percentage selected percentage selected percentage selected percentage Ra SE1 k NN selected percentage Ra SE2 k NN Figure 2: Average selected percentages of features for model 1 in 200 replicates when n = 1000 the best when n = 100. Ra SE2-Gamma incurs the lowest misclassification rate and low standard deviation for the other two settings. Figure 3 shows us the average selected percentage of features when n = 400, from which we can see that due to the high sparsity, the default B2 is not sufficient and makes it hard for the vanilla Ra SE classifier to capture all the features in S . After iterations, the frequencies of the minimal discriminative set increase significantly, and S can be easily identified for all three base classifiers after two iterations. Tian and Feng 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage Ra SE Gamma 1 2 3 4 5 N feature selected percentage Ra SE1 Gamma 1 2 3 4 5 N feature selected percentage Ra SE2 Gamma 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage Ra SE1 k NN 1 2 3 4 5 N feature selected percentage Ra SE2 k NN Figure 3: Average selected percentages of features for model 2 in 200 replicates when n = 400 5.1.3 Model 3 (QDA) x|y = r N(µ(r), Σ(r)), r = 0, 1, where Ω(0) = (Σ(0)) 1 is a p p band matrix with (Ω(0))ii = 1 and (Ω(0))ik = 0.3 for |i k| = 1. All the other entries are zero. Ω(1) = Ω(0) +Ω, where Ωis a p p sparse symmetric matrix with Ω10,10 = 0.3758, Ω10,30 = 0.0616, Ω10,50 = 0.2037, Ω30,30 = 0.5482, Ω30,50 = 0.0286, Ω50,50 = 0.4614 and all the other entries are zero. Here p = 200, n {200, 400, 1000}. As analyzed in Example 2, the minimal discriminative set S = {1, 2, 10, 30, 50}, where features 1 and 2 represent linear signals or main effects and features 10, 30, and 50 are quadratic signals. The right panel of Table 2 shows us the results. From it we can see that Ra SE1-QDA achieves the best performance when n = 200, and Ra SE2-QDA has the lowest test misclassification rate when n = 400, 1000, which is reasonable since data is generated from a QDA-adapted model. Ra SE1-k NN and Ra SE2-k NN also do a good job when n is large. Figure 4 represents the average selected percentage of features when n = 1000, which exhibits that the frequencies of the elements in the minimal discriminative set are increasing after iterations. When the base classifier is QDA or k NN, all the five features stand out. On Ra SE: Random Subspace Ensemble Classification the other hand, Ra SE, with the LDA base classifier, only captures the two linear signals, which is expected since LDA does not consider the quadratic terms. Method3 Results for model 2 Results for model 3 n = 100 n = 200 n = 400 n = 200 n = 400 n = 1000 Ra SE-LDA 21.513.42 18.522.75 17.432.00 37.303.17 36.111.97 35.671.73 Ra SE-Gamma/QDA 23.573.65 21.413.17 20.292.43 32.522.90 30.442.60 29.001.97 Ra SE-k NN 22.833.09 21.073.07 20.472.57 31.103.23 27.832.41 25.221.56 Ra SE1-LDA 19.012.83 15.141.75 13.551.20 36.092.87 32.821.74 32.681.49 Ra SE1-Gamma/QDA 15.052.31 13.02 1.51 12.50 1.18 26.832.47 25.07 1.89 23.53 1.50 Ra SE1-k NN 19.842.90 16.641.86 15.331.39 28.76 2.60 25.88 1.98 24.18 1.47 Ra SE2-LDA 20.883.03 16.922.05 13.661.14 38.092.48 33.691.83 32.711.55 Ra SE2-Gamma/QDA 16.27 2.22 12.641.31 11.831.02 26.99 2.68 24.871.99 23.111.60 Ra SE2-k NN 22.393.13 17.402.27 14.591.44 28.73 2.56 25.46 1.82 23.76 1.54 RP-LDA 38.891.96 35.001.97 30.891.76 44.901.86 42.821.76 40.381.74 RP-QDA 43.623.62 37.622.35 33.022.14 43.022.07 39.871.88 36.381.78 RP-k NN 41.482.26 38.902.06 36.691.85 44.321.81 42.461.58 40.802.12 LDA 47.502.35 49.031.94 42.881.82 38.681.70 QDA 32.062.40 26.221.80 21.561.44 45.131.58 k NN 45.482.24 44.682.14 44.072.00 45.671.78 44.632.02 43.431.63 s LDA 22.263.52 18.642.12 15.341.55 36.413.15 33.872.01 32.991.52 RAMP 20.643.81 16.722.25 13.311.21 36.945.87 32.651.89 32.421.78 NSC 26.006.49 19.924.31 16.872.69 41.144.49 38.243.85 35.132.20 RF 24.975.74 18.022.77 15.261.47 37.342.91 31.612.19 27.421.60 Sig-Gamma/QDA 12.651.12 12.121.12 11.760.97 23.621.47 22.721.40 22.161.31 Not applicable. Table 2: Error rates for models 2 and 3 5.1.4 Models 4 and 4 (k NN) As in the LDA models, we first study a sparse setting (model 4) with the data generating process motivated by the k NN classifier. First, 10 initial points z1, . . . , z10 are generated i.i.d. from N(0p 1, Ip), five of which are labeled as 0 and the other five are labeled as 1. Then each time one of {z1, . . . , z10} is randomly selected (suppose zki) and we then generate xi N((z T ki,S , 01 (p 5))T , 0.52Ip). Here the minimal discriminative set is S = {1, 2, 3, 4, 5}, p = 200, and n {200, 400, 1000}. The results are presented in Table 3 and the average selected percentages of features in B1 = 200 subspaces are presented in Figure 5. From the left panel of Table 3, it can be seen that the performance of Ra SE2-k NN is surprising. It outperforms all the other methods, and the difference between its misclassification rate and others is very prominent. Note that in this case, the Sig-k NN classifier is calculated by applying k NN on only the first five features and also uses leave-one-out 3. To save space, for the rows involving Gamma/QDA , it represents the independent Gamma classifier in model 2 and the QDA in model 3. Tian and Feng 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage Ra SE1 k NN 1 2 10 30 50 N feature selected percentage Ra SE2 k NN Figure 4: Average selected percentages of features for model 3 in 200 replicates when n = 1000 cross-validation to choose k from {3, 5, 7, 9, 11}. Note that it is not the optimal classifier due to the lack of a true model, which explains why Ra SE2-k NN can even achieve a better performance when n = 400, 1000. Figure 5 shows that Ra SE, based on all the three base classifiers, can capture features in the minimal discriminative set. Now, we study a non-sparse setting (model 4 ), where the number of signals are increased to 30 and each xi N((z T ki,S , 01 (p 30))T , 2Ip). The other parameters and the data generation mechanism are the same as model 4. From the right panel of Table 3, we observe that Ra SE2-k NN still achieves the best performance, and the iterations improve the performance of Ra SE classifiers under this non-sparse setting. In addition, Figure 6 shows that Ra SE can capture the signals while the noises keep a low selected percentage, which again verifies the robustness of Ra SE. Ra SE: Random Subspace Ensemble Classification Method Results for model 4 Results for model 4 n = 200 n = 400 n = 1000 n = 200 n = 400 n = 1000 Ra SE-LDA 27.389.53 25.088.40 24.929.05 26.505.78 23.844.98 20.925.20 Ra SE-QDA 24.247.20 22.626.77 22.046.73 29.284.34 26.733.63 24.773.35 Ra SE-k NN 13.265.03 10.674.44 8.854.05 20.834.50 15.703.74 10.332.90 Ra SE1-LDA 25.8910.13 23.058.49 23.428.95 21.284.76 18.593.97 16.873.92 Ra SE1-QDA 13.835.88 12.545.79 12.705.51 21.975.37 19.324.65 15.844.36 Ra SE1-k NN 7.51 3.80 6.16 3.48 5.90 3.12 16.24 3.52 11.092.84 7.082.04 Ra SE2-LDA 27.4910.13 23.398.51 23.399.05 20.734.99 17.343.99 15.723.93 Ra SE2-QDA 13.155.00 11.905.38 12.155.22 20.945.06 18.614.61 14.964.11 Ra SE2-k NN 7.063.62 5.893.32 5.743.02 13.623.31 8.642.40 5.361.58 RP-LDA 28.038.91 25.487.80 24.838.09 22.164.49 18.844.42 16.844.10 RP-QDA 26.227.66 23.936.97 22.757.00 21.374.29 17.583.84 15.533.73 RP-k NN 26.638.09 24.497.15 23.327.35 22.374.69 18.694.19 16.503.95 LDA 47.512.66 33.277.65 27.898.97 46.063.05 25.164.12 17.784.05 QDA 36.704.83 30.452.89 k NN 24.496.64 21.046.60 19.076.50 24.734.35 20.063.95 15.913.59 s LDA 24.909.41 22.808.19 23.228.79 19.784.92 16.413.98 14.593.69 RAMP 22.1411.50 15.017.83 12.827.13 24.796.49 18.594.56 13.133.13 NSC 27.709.44 25.718.35 25.828.72 22.096.20 18.174.51 16.174.09 RF 23.648.05 17.396.19 14.845.73 21.735.44 15.703.65 11.542.80 Sig-k NN 6.893.40 6.033.37 6.013.21 7.602.19 5.401.76 4.061.43 Not applicable. Table 3: Error rates for models 4 and 4 5.2 Real-data Experiments 5.2.1 Madelon The madelon data set (http://archive.ics.uci.edu/ml/datasets/madelon) is an artificial data set containing data points grouped in 32 clusters placed on the vertices of a five-dimensional hypercube and randomly labeled as 0 or 1 (Guyon et al., 2005). It consists of 2000 observations, 1000 of which are class 0, and the other 1000 are class 1. There are 500 features, among which only 20 are informative, and the others have no predictive power. The training sample size is set as n {200, 500, 1000} for three different settings, and each time the remained data is used as test data. 200 replicates are applied, and the average misclassification rate with a standard deviation of all methods is reported in Table 4. Figure 7 represents the average selected percentage of features in different Ra SE models when n = 1000. From the left panel of Table 4, we can see that the misclassification rate of Ra SE1-k NN outperforms all the other methods in all the three settings. Figure 7 shows us that the Ra SE model leads to sparse solutions since most of the features have frequencies that are close to zero. Tian and Feng 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage Ra SE1 k NN 1 2 3 4 5 N feature selected percentage Ra SE2 k NN Figure 5: Average selected percentages of features for model 4 in 200 replicates when n = 1000 The musk data set (https://archive.ics.uci.edu/ml/datasets/Musk+(Version+2)) contains 6598 observations with 5581 non-musk (class 0) and 1017 musk (class 1) molecules. The molecule needs to be classified based on p = 166 shape measurements (Dua and Graff, 2019). The training sample size is set to be 200, 500, 1000, for each setting, and the remaining observations are used as the test data. 200 replicates are considered, and the average misclassification rates and standard deviations are reported in Table 4. When the training sample size is 200, RP-k NN achieves the lowest misclassification rate, and Ra SE-LDA, Ra SE-k NN, Ra SE1-LDA, Ra SE1-k NN, s LDA, RP-QDA, and RF also have a good performance. As the sample size increases, Ra SE-k NN turns to be the best one when n = 500 and RF yields a comparable performance. When n = 1000, Ra SE1-k NN and RF outperform the other methods. 5.2.3 Mice Protein Expression The mice protein expression data set (https://archive.ics.uci.edu/ml/datasets/Mice+ Protein+Expression) contains 1080 instances with 570 healthy mice (class 0) and 510 mice Ra SE: Random Subspace Ensemble Classification selected percentage selected percentage selected percentage selected percentage selected percentage selected percentage selected percentage selected percentage Ra SE1 k NN selected percentage Ra SE2 k NN Figure 6: Average selected percentages of features for model 4 in 200 replicates when n = 1000 with Down s syndrome (class 1). There are 77 features representing the expression of 77 different proteins (Higuera et al., 2015). Training samples of size 200, 500, 800 are considered, and the remaining observations are set as the test data. The average of test misclassification rates and the standard deviations of 200 replicates are calculated with results reported in Table 5. When n = 200, s LDA achieves the lowest error among all approaches, and Ra SE-k NN achieves a similar performance. As the sample size increases to 500 and 800, the average misclassification rate of Ra SE-k NN and Ra SE1k NN decrease significantly, and they become the best classifier when n = 500 and 800, respectively. When n = 800, RP-k NN and RF also have a similar performance. 5.2.4 Hand-written Digits Recognition The hand-written digits recognition data set (https://archive.ics.uci.edu/ml/datasets/ Multiple+Features) consists of features of hand-written numerals (0-9) extracted from a collection of Dutch utility maps (Dua and Graff, 2019). Here we use the mfeat-fou data set, which records 76 Fourier coefficients of the character shapes. We extract the observations corresponding to number 7 (class 0) and 9 (class 1) from the original data. There are 400 observations, 200 of which belong to class 0, and the remaining 200 belong to class 1. The Tian and Feng Method Madelon Musk n = 200 n = 500 n = 1000 n = 200 n = 500 n = 1000 Ra SE-LDA 43.393.53 39.871.63 39.161.29 10.55 1.22 8.860.77 7.870.46 Ra SE-QDA 43.833.28 40.581.73 40.011.57 12.557.02 9.160.97 7.770.73 Ra SE-k NN 35.022.52 26.782.71 21.451.91 10.26 1.81 7.330.96 5.76 0.74 Ra SE1-LDA 45.982.49 39.761.81 38.571.11 10.56 1.19 8.880.81 7.820.46 Ra SE1-QDA 44.465.36 37.212.85 34.632.15 16.719.58 11.022.99 8.600.85 Ra SE1-k NN 26.053.33 16.661.33 13.571.00 10.52 1.95 7.49 0.97 5.710.78 RP-LDA 41.181.77 39.861.14 39.411.17 12.581.86 10.190.76 9.500.46 RP-QDA 39.971.53 39.291.57 38.791.49 10.70 1.95 8.750.83 8.180.45 RP-k NN 40.101.66 39.071.54 38.401.43 10.011.66 8.05 1.06 6.890.84 LDA 49.821.24 47.601.49 25.603.76 9.060.80 6.970.40 QDA k NN 35.891.42 31.841.43 28.561.56 11.35 1.55 8.19 0.86 6.530.55 s LDA 43.012.98 40.401.94 39.571.52 10.68 1.54 7.81 0.68 6.850.40 RAMP 49.093.46 41.885.16 38.581.36 14.351.93 11.501.40 9.721.10 NSC 41.492.32 40.101.09 40.051.24 20.725.11 24.123.62 25.581.01 RF 43.912.49 38.591.56 34.171.46 10.83 1.44 7.60 0.66 5.710.48 Not applicable. Table 4: Error rates for madelon and musk data sets 0 100 200 300 400 500 feature selected percentage 0 100 200 300 400 500 feature selected percentage 0 100 200 300 400 500 feature selected percentage 0 100 200 300 400 500 feature selected percentage 0 100 200 300 400 500 feature selected percentage 0 100 200 300 400 500 feature selected percentage Ra SE1 k NN Figure 7: Average selected percentages of features for madelon data set in 200 replicates when n = 1000 Ra SE: Random Subspace Ensemble Classification Method Mice protein expression Hand-written digits recognition n = 200 n = 500 n = 800 n = 50 n = 100 n = 200 Ra SE-LDA 7.411.14 5.700.93 4.651.24 1.56 0.85 1.130.59 0.80 0.54 Ra SE-QDA 9.142.58 4.811.17 3.441.23 2.501.47 1.890.91 1.470.96 Ra SE-k NN 6.801.88 1.550.88 0.62 0.55 1.860.96 1.120.66 0.75 0.45 Ra SE1-LDA 7.241.10 5.531.02 4.491.23 1.060.63 0.70 0.35 0.530.40 Ra SE1-QDA 9.382.21 5.161.16 3.401.16 2.181.66 1.180.71 0.85 0.61 Ra SE1-k NN 7.432.00 1.70 0.87 0.600.56 1.720.95 1.02 0.62 0.60 0.44 RP-LDA 24.842.91 22.792.50 22.342.55 1.751.29 1.220.67 1.040.61 RP-QDA 18.312.57 16.192.08 15.662.38 2.121.67 1.280.94 0.92 0.62 RP-k NN 11.772.54 2.570.89 0.92 0.68 1.68 1.34 1.03 0.60 0.84 0.59 LDA 7.071.37 3.880.85 3.131.08 1.820.96 1.010.56 QDA 3.252.32 k NN 20.532.47 7.751.44 2.801.21 1.42 1.32 0.670.41 0.60 0.47 s LDA 5.701.10 3.950.91 3.131.05 2.301.36 1.711.27 1.150.95 RAMP 11.762.42 8.521.69 7.021.89 3.311.75 2.261.19 1.700.87 NSC 30.493.31 29.702.76 29.883.02 3.221.44 3.531.23 3.591.50 RF 8.321.71 2.620.94 1.04 0.73 2.341.24 1.630.73 1.370.74 Not applicable. Table 5: Error rates for mice protein expression and hand-written digits recognition data sets training samples of size 50, 100, 200 are used, and the remained data is used as the test data. Average of test misclassification rates and standard deviations are reported in Table 5, from which it can be seen that when n = 50, 200, Ra SE1-LDA enjoys the minimal test misclassification rate while standard k NN method is the best when n = 100. And we also note that all the Ra SE classifiers get improved after 1 iteration for all three settings, implying that the underlying classification problem may be a sparse one. 6. Discussion 6.1 Summary In this work, we introduce a flexible ensemble classification framework named Ra SE, which is designed to solve the sparse classification problem. To select the optimal subspace for each weak learner, we define a new information criterion, ratio information criterion (RIC), based on Kullback-Leibler divergence, and it is shown to achieve screening consistency and weak consistency under some general model conditions. This guarantees that each weak learner is trained using features in the minimal discriminative set with a high probability for a sufficiently large sample size and the number of random subspaces. We also investigate the consistency of RIC for LDA and QDA models under some specific conditions. The theoretical analysis of Ra SE classifiers with specific base classifiers is conducted. In addition, Tian and Feng we present two versions of Ra SE algorithms, that is, the vanilla version (Ra SE) and the iterative version (Ra SET ). We also apply Ra SE for feature ranking based on the average selected percentage of features in subspaces. Theoretical analysis shows that when the stepwise detectable condition holds and the signal is sufficiently sparse, the iterative Ra SE can cover the minimal discriminative set with high probability after a few iterations, with the required number of random subspaces smaller than that for the vanilla Ra SE. Multiple numerical experiments, including simulations and real data, verify that Ra SE is a favorable classification method for sparse classification problems. The Ra SE algorithms are available in R package Ra SEn (https://cran.r-project. org/web/packages/Ra SEn/index.html). 6.2 Future Work There are many interesting directions along which Ra SE can be extended and explored. An interesting question is how to extend Ra SE and RIC into multi-class problems. For example, the pair-wise KL divergences can be used to define the multi-class RIC. Moreover, we can also apply Ra SE for variable selection. We can conduct thresholding to the average selected percentage of features in B1 subspaces to select variables. When the sample size is small, a bootstrap-type idea can be used, and each time we apply Ra SE on a bootstrap sample and at the end, take the average for the selected percentage to do variable selection or feature ranking. Finally, we aggregate the classifiers by taking a simple average over all weak learners. However, the boosting-type idea can also be applied here to assign different weights for different weak learners according to the training error, which may further improve the performance of Ra SE. In addition, the distribution for random subspaces can also be chosen to be different for each weak learner and can also be updated using a similar idea to boosting. Acknowledgments The authors would like to thank the Action Editor and anonymous referees for many constructive comments which have greatly improved the paper. This work was partially supported by National Science Foundation CAREER grant DMS-2013789. Ra SE: Random Subspace Ensemble Classification Appendix A. Additional Figures of Simulations We present figures of the selected percentage for each feature when n equals to the smallest value among three settings. 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage Ra SE1 LDAn 1 2 5 N feature selected percentage Ra SE2 LDAn 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage 1 2 5 N feature selected percentage Ra SE1 k NN 1 2 5 N feature selected percentage Ra SE2 k NN Figure 8: Average selected percentages of features for model 1 in 200 replicates when n = 200 Tian and Feng selected percentage selected percentage selected percentage selected percentage selected percentage Ra SE1 LDAn selected percentage Ra SE2 LDAn selected percentage selected percentage selected percentage selected percentage selected percentage Ra SE1 k NN selected percentage Ra SE2 k NN Figure 9: Average selected percentages of features for model 1 in 200 replicates when n = 200 Ra SE: Random Subspace Ensemble Classification 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage Ra SE Gamma 1 2 3 4 5 N feature selected percentage Ra SE1 Gamma 1 2 3 4 5 N feature selected percentage Ra SE2 Gamma 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage Ra SE1 k NN 1 2 3 4 5 N feature selected percentage Ra SE2 k NN Figure 10: Average selected percentages of features for model 2 in 200 replicates when n = 100 Tian and Feng 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage 1 2 10 30 50 N feature selected percentage Ra SE1 k NN 1 2 10 30 50 N feature selected percentage Ra SE2 k NN Figure 11: Average selected percentages of features for model 3 in 200 replicates when n = 200 Ra SE: Random Subspace Ensemble Classification 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage 1 2 3 4 5 N feature selected percentage Ra SE1 k NN 1 2 3 4 5 N feature selected percentage Ra SE2 k NN Figure 12: Average selected percentages of features for model 4 in 200 replicates when n = 200 Tian and Feng selected percentage selected percentage selected percentage selected percentage selected percentage selected percentage selected percentage selected percentage Ra SE1 k NN selected percentage Ra SE2 k NN Figure 13: Average selected percentages of features for model 4 in 200 replicates when n = 200 Ra SE: Random Subspace Ensemble Classification Appendix B. Main Proofs In the following proofs, for convenience, we use C to represent a positive constant, which could be different at different occurences. B.1 Proof of Proposition 3 (i) The conditional probability is P(y = 1|x) = π1f(1)(x) π1f(1)(x) + π0f(0)(x). The definition of discriminative set is equivalent to P(y = 1|x) = P(y = 1|x S), (14) or P(y = 0|x) = P(y = 0|x S) almost surely, which is also equivalent to f(1)(x) f(0)(x) = f(1) S (x S) f(0) S (x S) . (15) (ii) First, we observe the condition is equivalent to f(1)(x) = h(x S)f(0)(x). Taking integration on both sides with respect to x Sc, we get f(1) S (x S) = h(x S)f(0) S (x S). Due to the equivalence of (14) and (15), S is a discriminative set. B.2 Proof of Proposition 5 To facilitate our analysis, we first state the following lemma. Lemma 26 A discriminative set S is unique, if it will not be a discriminative set anymore after removing any features from it. Proof [Proof of Lemma 26] Suppose the conclusion is not correct, then there exist two different discriminative sets S1 and S2 satisfying such a property, that is, deleting any features from them leads to non-discriminative sets. Then according to Proposition 3 (i), we have f(1) S1 (x S1) f(0) S1 (x S1) = f(1) S2 (x S2) f(0) S2 (x S2) , (16) almost surely w.r.t Px, where Px = π0P(0) + π1P(1). Since f(0), f(1) are supported on the whole Rp, Px dominates the Lebesgue measure. Combined with the explicit form of density functions of Gaussian distribution and denoting ΩS1,S1 = (Σ(1) S1,S1) 1 (Σ(0) S1,S1) 1, ΩS2,S2 = (Σ(1) S2,S2) 1 (Σ(0) S2,S2) 1, δS1 = (Σ(0) S1,S1) 1µ(0) S1 (Σ(1) S1,S1) 1µ(1) S1 , δS2 = (Σ(0) S2,S2) 1µ(0) S2 (Σ(1) S2,S2) 1µ(1) S2 , it can be obtained from (4) and (16) that 2x T S1ΩS1,S1x S1 + δT S1x S1 = c + 1 2x T S2ΩS2,S2x S2 + δT S2x S2, Tian and Feng where c, c are constants irrelative to S1, S2. Considering the combined vector x S1 S2, there exists some matrix Ωand vector δ such that the equation above can be simplified as x T S1 S2Ωx S1 S2 + δT x S1 S2 + c c = 0, (17) for almost every x S1 S2 in the Euclidean space. Since S1 is a discriminative set, by Example 2, for every feature j S1, the corresponding row of ΩS1,S1 is not zero vector or the corresponding component of δS1 is not zero. The same argument holds for S2 as well. Since S1\S2 and S2\S1 are not empty, at least one of Ωand δ contain non-zero components. However, it s obvious that (17) cannot hold for almost every x S1 S2 in Euclidean space, which leads to contradiction. Thus S is unique. Now let s proceed on the proof of Proposition 5. By Definition 1, the minimal discriminative set S satisfies the property described in Lemma 26, therefore by this lemma S is obviously unique, which implies (i). For (ii), if there exists a discriminative set S S , we can remove elements from S until we arrive at a discriminative set S that satisfies the property stated in Lemma 26. It s apparent S = S , which contradicts with Lemma 26. (iii) is trivial from Definition 1. B.3 Proof of Proposition 8 By Definition 6 and the fact that deg(S) = |S| + 1, it s easy to obtain that RICn(S) = 2 i=1 1(yi = 0) ((ˆµ(1) S )T ˆΣ 1 S,S (ˆµ(0) S )T ˆΣ 1 S,S)xi,S + 1 2(ˆµ(0) S )T ˆΣ 1 S,S ˆµ(0) S 2(ˆµ(1) S )T ˆΣ 1 S,S ˆµ(1) S i=1 1(yi = 1) ((ˆµ(0) S )T ˆΣ 1 S,S (ˆµ(1) S )T ˆΣ 1 S,S)xi,S 2(ˆµ(1) S )T ˆΣ 1 S,S ˆµ(1) S 1 2(ˆµ(0) S )T ˆΣ 1 S,S ˆµ(0) S + cn (|S| + 1) ((ˆµ(1) S )T ˆΣ 1 S,S (ˆµ(0) S )T ˆΣ 1 S,S)ˆµ(0) S + 1 2(ˆµ(0) S )T ˆΣ 1 S,S ˆµ(0) S 1 2(ˆµ(1) S )T ˆΣ 1 S,S ˆµ(1) S ((ˆµ(0) S )T ˆΣ 1 S,S (ˆµ(1) S )T ˆΣ 1 S,S)ˆµ(1) S + 1 2(ˆµ(1) S )T ˆΣ 1 S,S ˆµ(1) S 1 2(ˆµ(0) S )T ˆΣ 1 S,S ˆµ(0) S + cn (|S| + 1) = (ˆµ(1) S ˆµ(0) S )T ˆΣ 1 S,S(ˆµ(1) S ˆµ(0) S ) + cn (|S| + 1), which completes the proof. B.4 Proof of Proposition 9 By Definition 6 and the fact deg(S) = |S|(|S| + 3)/2 + 1, it s easy to obtain that RICn(S) = 2 i=1 1(yi = 0) 1 2x T i,S((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)xi,S + ((ˆµ(1) S )T (ˆΣ(1) S,S) 1 Ra SE: Random Subspace Ensemble Classification (ˆµ(0) S )T (ˆΣ(0) S,S) 1)xi,S + 1 2(ˆµ(0) S )T (ˆΣ(0) S,S) 1 ˆµ(0) S 1 2(ˆµ(1) S )T (ˆΣ(1) S,S) 1 ˆµ(1) S + log(|ˆΣ(0) S,S|) log(|ˆΣ(1) S,S|) + 2 i=1 1(yi = 1) 1 2x T i,S((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)xi,S + ((ˆµ(0) S )T (ˆΣ(0) S,S) 1 (ˆµ(1) S )T (ˆΣ(1) S,S) 1)xi,S + 1 2(ˆµ(1) S )T (ˆΣ(1) S,S) 1 ˆµ(1) S 2(ˆµ(0) S )T (ˆΣ(0) S,S) 1 ˆµ(0) S + log(|ˆΣ(1) S,S|) log(|ˆΣ(0) S,S|) + cn (|S|(|S| + 3)/2 + 1). Denote the observations with class r as {x(r) i }nr i=1, r = 0, 1. And it holds that i=1 1(yi = 0) h x T i,S((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)xi,S i h (x(0) i,S)T ((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)x(0) i,S i i=1 Tr h ((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)x(0) i,S(x(0) i,S)T i ((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1) 1 i=1 x(0) i,S(x(0) i,S)T # = ˆπ0Tr h ((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)(ˆµ(0) S (ˆµ(0) S )T + ˆΣ(0) S,S) i = ˆπ0(ˆµ(0) S )T ((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)ˆµ(0) S + ˆπ0Tr h ((ˆΣ(0) S,S) 1 (ˆΣ(1) S,S) 1)ˆΣ(0) S,S i . Similarly we have i=1 1(yi = 1) h x T i,S((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)xi,S i = ˆπ1(ˆµ(1) S )T ((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)ˆµ(1) S + ˆπ1Tr h ((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)ˆΣ(1) S,S i . Combining with the fact i=1 1(yi = r)xi,S = ˆπrµ(r) S , r = 0, 1, we obtain that RICn(S) = (ˆµ(1) S ˆµ(0) S )T [ˆπ1(ˆΣ(0) S,S) 1 + ˆπ0(ˆΣ(1) S,S) 1](ˆµ(1) S ˆµ(0) S ) + Tr[((ˆΣ(1) S,S) 1 (ˆΣ(0) S,S) 1)(ˆπ1 ˆΣ(1) S,S ˆπ0 ˆΣ(0) S,S)] + (ˆπ1 ˆπ0)(log |ˆΣ(1) S,S| log |ˆΣ(0) S,S|) + cn (|S|(|S| + 3)/2 + 1), which completes the proof. Tian and Feng B.5 Proof of Theorem 10 Denote gn(α ) = π1g(1) n (α ) π0g(0) n (α ). By the definition of {αi}N i=1, it holds that gn(α) = 0, when α / {αi}N i=1. Recall that x|y = 0 P(0), x|y = 1 P(1), where P(0), P(1) are the corresponding cumulative distribution functions. We have E[R(CRa SE n )] = π0 X P(νn(x) > α)d P(0) + π1 X P(νn(x) α)d P(1). For given x and the corresponding α = µn(x), we can construct a random variable T = PB1 j=1 1 n CSj n (x) = 1 o Bin(B1, α ). Then Z X P(νn(x) α)d P(1) = Z [0,1] P(T B1α)d G(1) n (α ). Similarly, Z X P(νn(x) > α)d P(0) = 1 Z [0,1] P(T B1α)d G(0) n (α ). This leads to E[R(CRa SE n )] = π0 + Z [0,1] P(T B1α)d Gn(α ), where Gn(α ) = π1G(1) n (α ) π0G(0) n (α ). And there also holds that R(CRa SE n ) = π0P(0)(µn(x) > α) + π1P(1)(µn(x) < α) 2[π0P(0)(µn(x) = α)) + π1P(1)(µn(x) = α)] = π0(1 G(0) n (α)) + π1G(1) n (α) + 1 2[π0g(0) n (α) π1g(1) n (α)] = π0 + Gn(α) 1 where gn(α ) = π1g(1) n (α ) π0g(0) n (α ). This implies E[R(CRa SE n )] R(CRa SE n ) = Z [0,1] [P(T B1α) 1{α α}]d Gn(α ) + 1 2gn(α). (18) (i) When α / {αi}N i=1, gn(α) = 0 holds. For α {αi}N i=1, by Hoeffding s inequality (Petrov, 2012), we obtain that |P(T B1α) 1{α α}| P(T B1α > B1(α α ))1{α α} + P(B1α T B1(α α))1{α >α} exp{ 2B1(α α)2} exp { CαB1} , where Cα = min 1 i N |α αi|2. This leads to the final bound |E[R(CRa SE n )] R(CRa SE n )| exp { CαB1}. Ra SE: Random Subspace Ensemble Classification (ii) When α = αi0, i0 {1, 2, . . . , N}, for α = αi0, we have |P(T B1α) 1{α α}| exp{ 2B1(α α)2}, leading to X i =i0 |P(T B1α) 1{α α}|gn(αi) = exp { CαB1} X i =i0 gn(αi) exp { CαB1} , where Cα = 2 min 1 i N |α αi|2 = 2 min i =i0 |αi0 αi|2. Therefore, again by (18), we have |E[R(CRa SE)] R(CRa SE n )| [P(T B1αi0) 1] gn(αi0) + 1 2gn(αi0) +exp { CαB1} . By Berry-Esseen theorem (Esseen, 1956), P(T B1αi0) = P B1(αi0(1 αi0)) 0 as B1 . Eventually there holds that |E[R(CRa SE)] R(CRa SE n )| O 1 B1 This completes the proof. B.6 Proof of Theorem 11 Referring to the proof of the bound on MC-variance in Cannings and Samworth (2017), it can be obtained that Rp 1{νn(x)>α}d P(0) 2 Z [0,α ] P(T B1α)P(T > B1α)d G(0) n (α )d G(0) n (α ), where given x, T Bin(B1, α ), T Bin(B1, α ) and T , T are independent. (i) For α / {α(0) i }N0 i=1 {α(1) i }N1 i=1: constant C(0) α = 2 min 1 i N0(|α α(0) i |2) > 0, C(1) α = 2 min 1 i N1(|α α(1) i |2) > 0. If α α α : Then by Hoeffding s inequality, we have P(T B1α) = P(T B1α B1(α α )) exp{ B1C(0) α }. (20) If α α α: Similarly we have P(T > B1α) exp{ B1C(0) α }. (21) If α α α : We have both (20) and (21). Thus we always have P(T B1α)P(T > B1α) exp{ B1C(0) α }. Tian and Feng Since here the integration actually is the finite summation, it implies that X 1{νn(x)>α}d P(0) exp{ B1C(0) α }. Since α / {α(1) i }N1 i=1: We can obtain the similar conclusion that X 1{νn(x) α}d P(1) exp{ B1C(1) α }. Thus for any α / {α(0) i }N0 i=1 {α(1) i }N1 i=1, setting Cα = 2 min 1 i N0 1 j N1 (|α α(0) i |2, |α α(1) j |2) = min(C(0) α , C(1) α ) > 0, then by the convexity, we have Var R(CRa SE n ) = Var π0 X 1{νn(x)>α}d P(0) + π1 X 1{νn(x) α}d P(1) X 1{νn(x)>α}d P(0) + π1Var Z X 1{νn(x) α}d P(1) exp{ B1Cα}. (ii) For α = α(0) i0 or α(1) i1 : Without loss of generality, suppose α = α(0) i0 . When α < α , similar to (i), there exists positive number C α such that P(T B1α)P(T > B1α) exp{ B1C α}. When α = α = α(0) i0 , similar to (19), there holds P(T B1α)P(T > B1α) = 1 Thus it holds X 1{νn(x)>α}d P (0) 1 2(g(0) n (α(0) i0 ))2 + O 1 B1 And similar results hold for Var R X 1{νn(x) α}d P (1) . Eventually we would have Var R(CRa SE n ) π0Var Z X 1{νn(x)>α}d P(0) + π1Var Z X 1{νn(x) α}d P(1) h π0(g(0) n (α))2 + π1(g(1) n (α))2i + O 1 B1 which completes the proof. Ra SE: Random Subspace Ensemble Classification B.7 Proof of Proposition 12 (i) Due to Proposition 3, this is trivial to be seen to hold here. (ii) First let s consider subspace S = S {j}, where |S | 1, j / S . The conditional densities of j|S are denoted as f(0) j|S , f(1) j|S and the components of x S corresponding to S and {j} are x S , xj, respectively. The definition of KL divergence and Fubini theorem incur KL(f(0) S ||f(1) S ) = Ex S f(0) S f(0) S (x S ) f(1) S (x S ) f(0) j|S (xj) f(1) j|S (xj) = Ex S f(0) S f(0) S (x S ) f(1) S (x S ) + Ex S f(0) S Exj f(0) j|S log f(0) j|S (xj) f(1) j|S (xj) = KL(f(0) S ||f(1) S ) + Ex S f(0) S h KL(f(0) j|S ||f(1) j|S ) i KL(f(0) S ||f(1) S ). Here = holds if and only if f(0) j|S (xj|x S ) = f(1) j|S (xj|x S ) a.s. with respect to P(0). By induction, this indicates that for any S S and S = , there holds KL(f(0) S ||f(1) S ) KL(f(0) S ||f(1) S ). (22) Note that in (22), by Proposition 3, if f(0) j|S (xj|x S ) = f(1) j|S (xj|x S ) a.s. with respect to P(0), we have f(0) S (x S) f(1) S (x S) = f(0) S (x S ) f(1) S (x S ) , a.s. , w.r.t. P(0). (23) Similarly, we have KL(f(1) S ||f(0) S ) KL(f(1) S ||f(0) S ), (24) where the = holds if and only if f(1) S (x S) f(0) S (x S) = f(1) S (x S ) f(0) S (x S ) , a.s. , w.r.t. P(1). (25) If S S , consider S = S (S \S) S , then by (i) and (22), there holds π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ) π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ) (26) = π0KL(f(0) S ||f(1) S ) + π1KL(f(1) S ||f(0) S ). Then by Proposition 3, if the = holds in (26), due to (23) and (25), we have f(0) S (x S) f(1) S (x S) = f(0) S (x S) f(1) S (x S) = f(0) S (x S ) f(1) S (x S ) , a.s. , w.r.t. Px = π0P(0) + π1P(1), implying that S is a discriminative set but S S , which yields a contradiction. Therefore (ii) holds here. Tian and Feng (iii) holds since the full model SFull S and the KL divergence is monotone in the sense of (22) and (24). B.8 Proof of Proposition 14 Denote R0|1 S (µ(0) S , µ(1) S , ΣS,S, x S) = log f(0) S (x S) f(1) S (x S) = µ(0) S µ(1) S T Σ 1 S,S h x S 1 2(µ(0) S + µ(1) S ) i . To eliminate any confusion and better illustrate the meaning of the gradient and second- order derivative matrix, we will use R0|1 S µ(0) S , R0|1 S µ(1) S , R0|1 S ΣS,S to represent the gradient and use 2R0|1 S µ(0) S ΣS,S , 2R0|1 S µ(1) S ΣS,S , 2R0|1 S ΣS,S ΣS,S to represent the second-order derivative matrix. According to Brookes (2005) and Petersen and Pedersen (2012), with some calculation we can obtain that R0|1 S µ(0) S = Σ 1 S,S x S µ(0) S , R0|1 S µ(1) S = Σ 1 S,S x S µ(1) S , R0|1 S ΣS,S = Σ 1 S,S µ(0) S µ(1) S x S 1 2(µ(0) S + µ(1) S ) T Σ 1 S,S, 2R0|1 S µ(0) S µ(0) S = 2R0|1 S µ(1) S µ(1) S = Σ 1 S,S, 2R0|1 S ΣS,S µ(0) S = Σ 1 S,S Σ 1 S,S I|S| x S µ(0) S , 2R0|1 S ΣS,S µ(1) S = Σ 1 S,S Σ 1 S,S I|S| x S µ(1) S , 2R0|1 S ΣS,S ΣS,S = I|S| Σ 1 S,S µ(0) S µ(1) S x S 1 2(µ(0) S + µ(1) S ) T 2(µ(0) S + µ(1) S ) µ(0) S µ(1) S T Σ 1 S,S I|S| Σ 1 S,S Σ 1 S,S , where is the Kronecker product. Then let s check conditions in Assumption 3 one by one. Without loss of generality, for (iii), we only check the case that xi,S i.i.d. f(0) S . And for (iv), (v), we only check the case that x S f(0) S . (i) It s easy to see that the number of parameters in LDA model in p-dimensional space is 2 + 2p + p(p+1) 2 , therefore κ1 = 2. (ii) According to Assumption 4, we have KL(f(0)||f(1)) = µ(1) µ(0) T Σ 1 µ(1) µ(0) Ra SE: Random Subspace Ensemble Classification µ(1) S µ(0) S 2 2 Σ 1 S ,S 2 The similar conclusion holds for KL(f(1)||f(0)) as well. (iii) For any ( µ(0) S , µ(1) S , ΣS,S): 1 n 2R0|1 S ΣS,S ΣS,S ( µ(0) S , µ(1) S , ΣS,S, xi) 2R0|1 S ΣS,S ΣS,S ( µ(0) S , µ(1) S , ΣS,S, xi) 2 Σ 1 S,S( µ(0) S µ(1) S ) 2 2( µ(0) S + µ(1) S ) 2 Σ 1 S,S 2 2 Σ 1 S,S 3 2 µ(0) S µ(1) S 2 2( µ(0) S + µ(1) S ) 2 . When µ(0) S µ(0) S 2, µ(1) S µ(1) S 2, ΣS,S ΣS,S F ζ < m, due to (28), it follows that Σ 1 S,S Σ 1 S,S 2 1 m2 ΣS,S ΣS,S 2 1 1 m ΣS,S ΣS,S 2 1 m2 ΣS,S ΣS,S F 1 1 m ΣS,S ΣS,S F ζ m2 mζ , which leads to Σ 1 S,S 2 Σ 1 S,S 2 + Σ 1 S,S Σ 1 S,S 2 1 m ζ . In addition, we have µ(0) S µ(1) S 2 µ(0) S µ(0) S 2 + µ(1) S µ(1) S 2 + µ(0) S µ(1) S 2 D 1 2 M . Without loss of generality, consider xi,S i.i.d. f(0) S , it holds that 1 n 2( µ(0) S + µ(1) S ) 2 xi,S µ(0) S 2 + 1 2 µ(0) S µ(0) S 2 2 µ(1) S µ(1) S 2 + 1 2 µ(0) S µ(1) S 2 xi,S µ(0) S 2 + D 1 2 M . By Proposition 1 in Hsu et al. (2012), since 1 n Pn i=1(xi,S µ(0) S ) N(0, 1 nΣS,S), it follows that xi,S µ(0) S 2 > 1 n Tr(ΣS,S) + 2 r Tr(Σ2 S,S) ϵ n + 2 ΣS,S 2ϵ Tian and Feng And due to Assumption 4, we have Tr(ΣS,S) D ΣS,S max DM, Tr(Σ2 S,S) = ΣS,S 2 F D2 ΣS,S 2 max D2M2, ΣS,S 2 D ΣS,S max DM, xi,S µ(0) S 2 > 1 n DM + 2DM r ϵ Therefore when µ(0) S µ(0) S 2, µ(1) S µ(1) S 2, ΣS,S ΣS,S 2 ζ, we have 1 n 2R0|1 S ΣS,S ΣS,S ( µ(0) S , µ(1) S , ΣS,S, xi) max D 1 2 M xi,S µ(0) S 2 + D 1 2 M := VS({xi,S}n i=1), and P (VS({xi,S}n i=1) > CD) exp{ Cn}. Thus we can set κ2 = 1. (iv) Σ 1 S,S x S µ(0) S N(0, Σ 1 S,S), Σ 1 S,S x S µ(1) S N(Σ 1 S,S(µ(0) S µ(1) S ), Σ 1 S,S) when x S f(0) S . And also we have Σ 1 S,S max Σ 1 S,S 2 Σ 1 2 m 1. Then each component of R0|1 S µ(0) S and R0|1 S µ(1) S is m 1-sub Gaussian. On the other hand, since Σ 1 S,S x S 1 2(µ(0) S + µ(1) S ) N 1 2Σ 1 S,S(µ(0) S µ(1) S ), Σ 1 S,S and Σ 1 S,S(µ(0) S µ(1) S ) D 1 2 Σ 1 S,S 2 µ(0) S µ(1) S m 1M D 1 2 , (27) it follows that each component of R0|1 S ΣS,S is m 1 M D-sub Gaussian. Therefore κ3 = 1 (v) Notice that because of (27), we have Ex S f(0) S h Σ 1 S,S(x S µ(1) S ) i = Σ 1 S,S(µ(0) S µ(1) S ) m 1M D 1 2 , Ex S f(0) S Σ 1 S,S µ(0) S µ(1) S x S 1 2(µ(0) S + µ(1) S ) T Σ 1 S,S Ra SE: Random Subspace Ensemble Classification Σ 1 S,S µ(0) S µ(1) S µ(0) S µ(1) S T Σ 1 S,S Σ 1 S,S(µ(0) S µ(1) S ) 2 2(m 1M )2D. Therefore κ4 = 1. (vi) It s easy to see that µ(1) S µ(0) S T Σ 1 S,S h x S 1 2(µ(0) S + µ(1) S ) i N((µ(1) S µ(0) S )T Σ 1 S,S(µ(1) S µ(0) S ), (µ(1) S µ(0) S )T Σ 1 S,S(µ(1) S µ(0) S )). And there holds (µ(1) S µ(0) S )T Σ 1 S,S(µ(1) S µ(0) S ) µ(1) S µ(0) S 2 2 Σ 1 S,S 2 m 1(M )2D, which yields that log f(0) S (x S) f(1) S (x S) m 1D-sub Gaussian. So κ5 = 1 (vii) This can be easily derived from Lemma 28 and its proof. (viii) This is obvious because of Lemma 31.(iii) and Assumption 4.(iii). B.9 Proof of Theorem 15 First suppose that we have n0 observations of class 0 {x(0) i }n0 i=1 and n1 observations of class 1 {x(1) i }n1 i=1, where n0 + n1 = n. Define G(0) n0,S(θ S) = 1 f(0) S (x(0) i,S|θ S) f(1) S (x(0) i,S|θ S) , G(1) n1,S(θ S) = 1 f(1) S (x(1) i,S|θ S) f(0) S (x(1) i,S|θ S) for any θ S. Denote d RIC(S) = 2[ˆπ0G(0) n0,S(ˆθS) + ˆπ1G(1) n1,S(ˆθS)], then RICn(S) = d RIC(S) + cn deg(S). Lemma 27 If Assumptions 1-3 holds, then we have sup S:|S| D | d RIC(S) RIC(S)| > ϵ Cn ϵ Dκ1(Dκ3 + Dκ4) + p DDκ1 exp n Cn ϵ D2κ1+κ2 + p D exp Cn ϵ Proof [Proof of Lemma 27] By Taylor expansion and mean-value theorem, there exists λ (0, 1) and θS = λθS + (1 λ)ˆθS satisfying that G(0) n0,S(ˆθS) = G(0) n0,S(θS) + G(0) n0,S(θS)T (ˆθS θS) + 1 2(ˆθS θS)T 2G(0) n0,S( θ)(ˆθS θS). Tian and Feng Notice that G(0) n0,S(θS)T (ˆθS θS) 2 Dκ1 G(0) n0,S(θS) ˆθS θS , and when ˆθS θS 2 ζ, we also have (ˆθS θS)T 2G(0) n0,S( θ)(ˆθS θS) Dκ1 ˆθS θS 2 2G(0) n0,S( θ) max D2κ1 ˆθS θS 2 VS({x(0) i,S}n0 i=1) . Since nr Bin(n, πr), r = 0, 1, by Hoeffding s inequality, we have P(|nr nπr| > 0.5nπr) exp{ Cn}, r = 0, 1. Because of Assumption 3.(iv), given n0, each component of n0 G(0) n0,S(θS) is n0 2C3Dκ3sub Gaussian, then the tail bound in the below holds: G(0) n0,S(θS) Ex S f(0) S θSLS(x; θ) > ϵ Dκ1 exp Cn0 ϵ Then according to all conclusions above, we have P(|G(0) n0,S(ˆθS) KL(f(0) S ||f(1) S )| > ϵ) En0[P(|G(0) n0,S(ˆθS) KL(f(0) S ||f(1) S )| > ϵ||n0 nπ0| 0.5nπ0)] + P(|n0 nπ0| > 0.5nπ0) En0P G(0) n0,S(θS) KL(f(0) S ||f(1) S ) > ϵ/3 |n0 nπ0| 0.5nπ0 + G(0) n0,S(θS) Ex S f(0) S G(0) n0,S(θS) > CDκ3 |n0 nπ0| 0.5nπ0 + P CDκ1(Dκ3 + Dκ4) ˆθS θS > ϵ/3 + P 0.5CD2κ1+κ2 ˆθS θS 2 > ϵ/3 VS({x(0) i,S}n0 i=1) > CDκ2 |n0 nπ0| 0.5nπ0 + P(|n0 nπ0| > 0.5nπ0) + P(D 1 2 κ1 ˆθS θS > ζ) Cn ϵ Dκ1(Dκ3 + Dκ4) + Dκ1 exp n Cn ϵ D2κ1+κ2 o + exp Cn ϵ which yields P(|ˆπ0G(0) n0,S(ˆθS) π0KL(f(0) S ||f(1) S )| > ϵ) Dκ1 exp Cn ϵ Dκ1(Dκ3 + Dκ4) + Dκ1 exp n Cn ϵ D2κ1+κ2 o + exp Cn ϵ Ra SE: Random Subspace Ensemble Classification Similarly, there holds P(|ˆπ1G(1) n1,S(ˆθS) π1KL(f(1) S ||f(0) S )| > ϵ) Dκ1 exp Cn ϵ Dκ1(Dκ3 + Dκ4) + Dκ1 exp n Cn ϵ D2κ1+κ2 o + exp Cn ϵ Since d RIC(S) = 2[ˆπ0G(0) n0,S(ˆθS) + ˆπ1G(1) n1,S(ˆθS)], we obtain that P | d RIC(S) RIC(S)| > ϵ Dκ1 exp Cn ϵ Dκ1(Dκ3 + Dκ4) + Dκ1 exp n Cn ϵ D2κ1+κ2 o + exp Cn ϵ Due to the union bound over all p D = O(p D) possible subsets, we obtain the conclusion. Let s now prove Theorem 15. (i) It s easy to see that sup S:S S |S| D RICn(S) inf S:S S |S| D RICn(S) RIC(S ) + cn sup S:|S| D deg(S) + /3 inf S:S S |S| D RIC(S) /3 sup S:|S| D | d RIC(S) RIC(S)| > /3 P(cn sup S:|S| D deg(S) /3) + P sup S:|S| D | d RIC(S) RIC(S)| > /3 Cn Dκ1(Dκ3 + Dκ4) + p DDκ1 exp Cn D2κ1+κ2 Tian and Feng (ii) Similar to above, we have RICn(S ) inf S:S S |S| D RICn(S) RIC(S ) + cn deg(S ) + /3 inf S:S S |S| D RIC(S) /3 sup S:|S| D | d RIC(S) RIC(S)| > /3 Cn Dκ1(Dκ3 + Dκ4) + p DDκ1 exp Cn D2κ1+κ2 Besides, it holds RICn(S ) inf S:S S |S| D RICn(S) P (RIC(S ) + cn deg(S ) + cn/3 RIC(S ) + cn(deg(S ) + 1) cn/3) sup S:|S| D | d RIC(S) RIC(S)| > cn/3 Cn cn Dκ1(Dκ3 + Dκ4) + p DDκ1 exp n Cn cn D2κ1+κ2 + p D exp Cn cn Therefore we have P RICn(S ) = inf S:|S| D RICn(S) RICn(S ) inf S:S S |S| D RICn(S) RICn(S ) inf S:S S |S| D RICn(S) Ra SE: Random Subspace Ensemble Classification Cn cn Dκ1(Dκ3 + Dκ4) + p DDκ1 exp n Cn cn D2κ1+κ2 + p D exp Cn cn which is because cn . This completes the proof. B.10 Proof of Theorem 18 Lemma 28 For arbitrary ϵ (0, m 1), we have the following conclusions: (i) P( (ˆµ(1) ˆµ(0)) (µ(1) µ(0)) > ϵ) p exp{ Cnϵ2}, r = 0, 1; sup S1:|S1| D S2:|S2| D ˆΣS1,S2 ΣS1,S2 > ϵ p2 exp n Cn ϵ D 2o + p exp Cn ϵ sup S:|S| D ˆΣS,S ΣS,S 2 > ϵ p2 exp n Cn ϵ D 2o + p exp Cn ϵ sup S:|S| D ˆΣ 1 S,S Σ 1 S,S 2 > ϵ p2 exp n Cn ϵ D 2o + p exp Cn ϵ Proof [Proof of Lemma 28] For (i), because Σ max M, for any j = 1, . . . , p and r = 0, 1, ˆµ(r) j µ(r) j is a M-sub Gaussian variable. By the tail bound and union bound, we have P( ˆµ(r) µ(r) > ϵ) j=1 P(|ˆµ(r) j µ(r) j | > ϵ) p exp{ Cnϵ2}, r = 0, 1, which leads to (i). Denote Σ = (σij)p p, ˆΣ = (ˆσij)p p. To show (ii), similar to Bickel et al. (2008), we have P max i,j |ˆσij σij| > ϵ p2 exp{ Cnϵ2} + p exp{ Cnϵ}. And it yields sup S1:|S1| D S2:|S2| D ˆΣS1,S2 ΣS1,S2 > ϵ sup S1:|S1| D S2:|S2| D j S2 |ˆσij σij| > ϵ P D max i,j |ˆσij σij| > ϵ p2 exp Cn ϵ 2 + p exp n Cn ϵ Tian and Feng Since ˆΣS,S ΣS,S is symmetric, we have ˆΣS,S ΣS,S 2 ˆΣS,S ΣS,S (Bickel et al. (2008)). For (iv), firstly because the operator norm is sub-multiplicative, we have ˆΣ 1 S,S Σ 1 S,S 2 = ˆΣ 1 S,S(ˆΣS,S ΣS,S)Σ 1 S,S 2 ˆΣ 1 S,S 2 ˆΣS,S ΣS,S 2 Σ 1 S,S 2 ( ˆΣ 1 S,S Σ 1 S,S 2 + Σ 1 S,S 2) ˆΣS,S ΣS,S 2 Σ 1 S,S 2, ˆΣ 1 S,S Σ 1 S,S 2 Σ 1 S,S 2 2 ˆΣS,S ΣS,S 2 1 ˆΣS,S ΣS,S 2 Σ 1 S,S 2 1 m2 ˆΣS,S ΣS,S 2 1 1 m ˆΣS,S ΣS,S 2 . (28) Then we obtain that sup S:|S| D ˆΣ 1 S,S Σ 1 S,S 2 > ϵ P ˆΣS,S ΣS,S 2 > 1 m2 ˆΣS,S ΣS,S 2 > ϵ sup S:|S| D ˆΣS,S ΣS,S 2 > m2 Then by applying (iii), we get (iv) immediately. Lemma 29 Define δS = Σ 1 S,S(µ(1) S µ(0) S ) for any subset S. For S = S S , where S S = , we have δ S = Σ 1 S, S(µ(1) S µ(0) S ) = Σ 1 S, S µ(1) S µ(0) S µ(1) S µ(0) S Proof [Proof of Lemma 29] First, when S = SFull, we know that δ S = Σ 1 S, S(µ(1) S µ(0) S ) = Σ 1 S, S µ(1) S µ(0) S µ(1) S µ(0) S Then combined with the matrix decomposition (ΣS,S ΣS,S Σ 1 S ,S ΣS ,S) 1 Σ 1 S,SΣS,S (ΣS ,S ΣS ,SΣ 1 S,SΣS,S ) 1 Σ 1 S ,S ΣS ,S(ΣS,S ΣS,S Σ 1 S ,S ΣS ,S) 1 (ΣS ,S ΣS ,SΣ 1 S,SΣS,S ) 1 it can be noticed that (ΣS,S ΣS,S Σ 1 S ,S ΣS ,S) 1(µ(1) S µ(0) S ) = Σ 1 S,SΣS,S (ΣS ,S ΣS ,SΣ 1 S,SΣS,S ) 1(µ(1) S µ(0) S ). Ra SE: Random Subspace Ensemble Classification Also since there holds that (ΣS ,S ΣS ,SΣ 1 S,SΣS,S ) 1 = Σ 1 S ,S Σ 1 S ,S ΣS ,S( ΣS,S+ΣS,S Σ 1 S ,S ΣS ,S) 1ΣS,S Σ 1 S ,S , it can be easily verified that the * part is actually δS . Therefore the conclusion holds with S = SFull. For general S S , notice that SFull = (SFull\ S) S, with the same procedure, we can obtain that δSFull = Σ 1(µ(1) µ(0)) = 0 δ S δSFull = 0 δS we reach the conclusion. Lemma 30 For S which satisfies S S , let S = S \S, then Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S 2 γm. Proof [Proof of Lemma 30] Let S = S S , due to Lemma 29, it s easy to see that there holds µ(1) S µ(0) S µ(1) S µ(0) S (ΣS,S ΣS, S Σ 1 S , S Σ S ,S) 1 Σ 1 S,SΣS, S (Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1 Σ 1 S , S Σ S ,S(ΣS,S ΣS, S Σ 1 S , S Σ S ,S) 1 (Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1 (30) and δ S consists of several components of δS . Denote c = Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S . Since Σ 1 S, S is symmetric, combining with (29) we have (Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1c = δ S , leading to (Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1c 2 = δ S 2 γ, by Assumption 4.(iii). For the left-hand side: (Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1c 2 c 2 (Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1 2 c 2 Σ 1 S, S 2 c 2 λ 1 min(Σ S, S) Tian and Feng which leads to c 2 γm. Lemma 31 For LDA, RIC(S) = µ(1) S µ(0) S T Σ 1 S,S µ(1) S µ(0) S . It satisfies the following conclusions: (i) If S S , then RIC(S) = RIC(S ); (ii) For S = S S, we have RIC( S) = RIC(S) h Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S i T Σ S , S Σ S ,SΣ 1 S,SΣS, S h Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S i (iii) It holds inf S:|S| D S S RIC(S) RIC(S ) m3γ2. Proof [Proof of Lemma 31] (i) First let s suppose (ii) is correct. For S S , consider SFull = S S and S S = , then by sparsity condition, we have = Σ 1(µ(1) µ(0)) = Σ 1 µ(1) S µ(0) S µ(1) S µ(0) S Also by basic algebra, there holds (ΣS,S ΣS, S Σ 1 S , S Σ S ,S) 1 = Σ 1 S,S Σ 1 S,SΣS, S ( Σ S , S +Σ S ,SΣ 1 S,SΣS, S ) 1Σ S ,SΣ 1 S,S. (31) Combined with (30) and (31), it can be obtained that Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S = 0, yielding that RIC(SFull) = RIC(S). Since the same procedure can be conducted for arbitrary S S , we complete the proof. (ii) This can be directly calculated and simplified by applying (30) and (31). (iii) It s easy to see that λmin(Σ S , S Σ S ,SΣ 1 S,SΣS, S ) = λ 1 max((Σ S , S Σ S ,SΣ 1 S,SΣS, S ) 1) λ 1 max(Σ 1 S, S) = λmin(Σ S, S) Ra SE: Random Subspace Ensemble Classification Then by (ii) and Lemma 30, it holds that RIC( S) RIC(S) Σ S ,SΣ 1 S,S µ(1) S µ(0) S µ(1) S µ(0) S 2 2 λmin(Σ S , S Σ S ,SΣ 1 S,SΣS, S ) which completes the proof. Lemma 32 If Assumption 4 holds, for ϵ smaller than some constant and n, p larger than some constants, we have sup S:|S| D | d RIC(S) RIC(S)| > ϵ p2 exp Cn ϵ Proof [Proof of Lemma 32] By Lemma 28, when ϵ < m 1, there holds that sup S:|S| D | d RIC(S) RIC(S)| > ϵ sup S:|S| D ˆµ(1) S ˆµ(0) S T ˆΣ 1 S,S ˆµ(1) S ˆµ(0) S µ(1) S µ(0) S T Σ 1 S,S µ(1) S µ(0) S > ϵ sup S:|S| D ˆµ(1) S ˆµ(0) S T ˆΣ 1 S,S Σ 1 S,S ˆµ(1) S ˆµ(0) S + ˆµ(1) S ˆµ(0) S µ(1) S + µ(0) S T Σ 1 S,S ˆµ(1) S ˆµ(0) S + ˆµ(1) S ˆµ(0) S µ(1) S + µ(0) S T Σ 1 S,S µ(1) S µ(0) S > ϵ sup S:|S| D ˆµ(1) S ˆµ(0) S 2 ˆΣ 1 S,S Σ 1 S,S 2 + (ˆµ(1) S ˆµ(0) S ) (µ(1) S µ(0) S ) 2 Σ 1 S,S 2 ˆµ(1) S ˆµ(0) S 2 + µ(1) S µ(0) S 2 (3M )2D sup S:|S| D ˆΣ 1 S,S Σ 1 S,S 2 > ϵ sup S:|S| D (ˆµ(1) S ˆµ(0) S ) (µ(1) S µ(0) S ) > M Tian and Feng sup S:|S| D (ˆµ(1) S ˆµ(0) S ) (µ(1) S µ(0) S ) 2 m 1 3 p2 exp Cn ϵ 2 + p exp n Cn ϵ o + p exp { Cn} + p exp Cn ϵ p2 exp Cn ϵ Then the following steps to prove Theorem 18 are analogous to what we did to prove Theorem 15. B.11 Proof of Theorem 20 Denote ˆΩ(r) S,S = (ˆΣ(r) S,S) 1, r = 0, 1. Then we denote T(S) = Tr h (Ω(1) S,S Ω(0) S,S)(π1Σ(1) S,S π0Σ(0) S,S) i + (π1 π0)(log |Σ(1) S,S| log |Σ(0) S,S|), D(S) = (µ(1) S µ(0) S )T h π1Ω(0) S,S + π0Ω(1) S,S i (µ(1) S µ(0) S ), RIC(S) = 2[T(S) D(S)]. And their sample versions by MLEs are ˆT(S) = Tr h (ˆΩ(1) S,S ˆΩ(0) S,S)(ˆπ1 ˆΣ(1) S,S ˆπ0 ˆΣ(0) S,S) i + (ˆπ1 ˆπ0)(log |ˆΣ(1) S,S| log |ˆΣ(0) S,S|), ˆD(S) = (ˆµ(1) S ˆµ(0) S )T h ˆπ1Ω(0) S,S + ˆπ0Ω(1) S,S i (ˆµ(1) S ˆµ(0) S ), d RIC(S) = 2[ ˆT(S) ˆD(S)]. Similar to Lemma 28, we have the following lemma holds. Lemma 33 For arbitrary ϵ (0, m 1), we have conclusions in the follows for r = 0, 1: (i) P( (ˆµ(1) ˆµ(0)) (µ(1) µ(0)) > ϵ) p exp{ Cnϵ2}; sup S:|S| D ˆΣ(r) S,S Σ(r) S,S 2 > ϵ p2 exp n Cn ϵ D 2o + p exp Cn ϵ sup S:|S| D ˆΩ(r) S,S Ω(r) S,S 2 > ϵ p2 exp n Cn ϵ D 2o + p exp Cn ϵ Also, the lemmas in the follows are useful to prove Theorem 20 as well. Lemma 34 There hold the following conditions: Ra SE: Random Subspace Ensemble Classification (i) For S = S {j}, k(r) j,S = Σ(r) j,j Σ(r) j,S(Σ(r) S,S) 1Σ(r) S,j, r = 0, 1, where j / S, we have: T( S) T(S) = h Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j i T k(0) j,S Σ(0) S,S + π1 k(1) j,S Σ(1) S,S h Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j i + 1 π0 k(1) j,S k(0) j,S π1 k(0) j,S k(1) j,S + (π1 π0) log k(1) j,S k(0) j,S And for S = S S where S S = , there holds D( S) D(S) = π1 h Σ(0) S ,SΩ(0) S,S(µ(1) S µ(0) S ) (µ(1) S µ(0) S ) i T Σ(0) S , S Σ(0) S ,SΩ(0) S,SΣ(0) S, S h Σ(0) S ,SΩ(0) S,S(µ(1) S µ(0) S ) (µ(1) S µ(0) S ) i + π0 h Σ(1) S ,SΩ(1) S,S(µ(1) S µ(0) S ) (µ(1) S µ(0) S ) i T Σ(1) S , S Σ(1) S ,SΩ(1) S,SΣ(1) S, S h Σ(1) S ,SΩ(1) S,S(µ(1) S µ(0) S ) (µ(1) S µ(0) S ) i . (ii) Further, (i) implies the monotonicity of T, D, RIC in the following sense: If S1 S2, then T(S1) T(S2), D(S1) D(S2), which leads to RIC(S1) RIC(S2). (iii) Define S l = {j : [(Σ(0)) 1µ(0) (Σ(1)) 1µ(1)]j = 0}, S q = {j : [(Σ(1)) 1 (Σ(0)) 1]ij = 0, i}, then: (a) If S S q, then T(S) = T(S q); (b) If S S , then D(S) = D(S ) = D(S l ). Proof [Proof of Lemma 34] (i) (33) is obvious due to Lemma 31. Now let s prove (32). It s easy to see that T( S) = Tr h (Ω(1) S, S Ω(0) S, S)(π1Σ(1) S, S π0Σ(0) S, S) i + (π1 π0)(log |Σ(1) S, S| log |Σ(0) S, S|) = |S| π1Tr h Ω(0) S, SΣ(1) S, S i π0Tr h Ω(1) S, SΣ(0) S, S i + (π1 π0)(log |Σ(1) S, S| log |Σ(0) S, S|). Ω(r) S, S = Ω(r) S,S + 1 k(r) j,S Ω(r) S,SΣ(r) S,jΣ(r) j,SΩ(r) S,S 1 k(r) j,S Ω(r) S,SΣ(r) S,j 1 k(r) j,S Σ(r) j,SΩ(r) S,S 1 k(r) j,S Tian and Feng for r = 0, 1, we have Tr Ω(1) S, SΣ(0) S, S Ω(1) S,SΣ(0) S,S + 1 k(1) j,S Ω(1) S,SΣ(1) S,jΣ(1) j,SΩ(1) S,SΣ(0) S,S 1 k(1) j,S Ω(1) S,SΣ(1) S,jΣ(0) j,S k(1) j,S Σ(1) j,SΩ(1) S,SΣ(0) S,j + Σ(0) j,j k(1) j,S = Tr[Ω(1) S,SΣ(0) S,S] + 1 k(1) j,S Σ(1) j,SΩ(1) S,SΣ(0) S,SΩ(1) S,SΣ(1) S,j 2 k(1) j,S Σ(1) j,SΩ(1) S,SΣ(0) S,j + Σ(0) j,j k(1) j,S , (36) where the last equality follows from the fact that Tr(AB) = Tr(BA) if A and B are square matrices with the same dimension. Similarly we have Tr Ω(0) S, SΣ(1) S, S = Tr[Ω(0) S,SΣ(1) S,S]+ 1 k(0) j,S Σ(0) j,SΩ(0) S,SΣ(1) S,SΩ(0) S,SΣ(0) S,j 2 k(0) j,S Σ(0) j,SΩ(0) S,SΣ(1) S,j + Σ(1) j,j k(0) j,S . (37) Combining (34), (36), (37), the fact that Σ(r) j,j = k(r) j,S + Σ(r) j,SΩ(r) S,SΣ(r) S,j, |Σ(r) S, S| = |Σ(r) S,S| |k(r) j,S|, r = 0, 1 and T(S) = |S| π1Tr Ω(0) S,SΣ(1) S,S π0Tr Ω(1) S,SΣ(0) S,S +(π1 π0)(log |Σ(1) S,S| log |Σ(0) S,S|), (32) is obtained. (ii) For the monotonicity, since π0 log k(1) j,S k(0) j,S = π0 log k(0) j,S k(1) j,S k(0) j,S k(1) j,S 1 and π1 log k(1) j,S k(0) j,S k(1) j,S k(0) j,S 1 , we have 1 + (π1 π0) log k(1) j,S k(0) j,S k(1) j,S k(0) j,S π1 k(0) j,S k(1) j,S k(0) j,S k(1) j,S 1 k(1) j,S k(0) j,S 1 k(1) j,S k(0) j,S π1 k(0) j,S k(1) j,S implying that T( S) T(S). And it s easy to see that Σ(r) S , S Σ(r) S ,SΩ(r) S,SΣ(r) S, S is positive-definite, thus D( S) D(S). Ra SE: Random Subspace Ensemble Classification And we also have RIC( S) RIC(S). By induction we will obtain the monotonicity. (iii) Denote Ω(r) = (Σ(r)) 1, r = 0, 1. Consider full feature space SFull S . Let s remove one feature which does not belong to S q from SFull. Again, without loss of generality (in fact, we can always switch this feature with the last one), suppose we are removing the last feature j. That is, SFull = S {j}. By sparsity: Ω(1) Ω(0) = 0 0T 0 Plugging (35) into above and by simplification we can obtain that k(0) j,S = k(1) j,S, (39) Ω(0) S,SΣ(0) S,j = Ω(1) S,SΣ(1) S,j, (40) By Lemma 34, T(SFull) = T(S). Besides, this also implies that Ω(1) Ω(0) = Ω(1) S,S Ω(0) S,S 0 0T 0 By induction, it can be seen that for all subspace S S q, T(S) = T(S q). Then again consider SFull = S {j} but j / S , therefore by sparsity, in addition to (38), we also have Ω(1)µ(1) Ω(0)µ(0) = 0 which together with (38) leads to Σ(0) j,SΩ(0) S,S(µ(1) S µ(0) S ) (µ(1) j µ(0) j ) = 0, Σ(1) j,SΩ(1) S,S(µ(1) S µ(0) S ) (µ(1) j µ(0) j ) = 0. By Lemma 34, D(SFull) = D(S). And it holds that Ω(1)µ(1) Ω(0)µ(0) = Ω(1) S,Sµ(1) S Ω(0) S,Sµ(0) S 0 Again by induction, it can be seen that for any subspace S S , D(S) = D(S ). Lemma 35 It holds that inf j S q inf S:S \S={j} |S| D+p T(S) T(S ) m2 4M γq min mγq, 1 Tian and Feng Proof [Proof of Lemma 35] Suppose S \S = {j}, S = S {j} S and j S q. Due to Lemma 34, equation (35) and Assumption 5, we have either (k(1) j,S) 1 (k(0) j,S) 1 γq or (k(1) j,S) 1Ω(1) S,SΣ(1) S,j (k(0) j,S) 1Ω(0) S,SΣ(0) S,j γq holds. And it s easy to notice that for r = 0, 1, (k(r) j,S) 1 Ω(r) S, S 2 = λ 1 min(Σ(r) S, S) λ 1 min(Σ(r)) m 1, k(r) j,S = Σ(r) j,j Σ(r) j,SΩ(r) S,SΣ(r) S,j Σ(r) j,j M, which implies that m (k(r) j,S) M, r = 0, 1. (i) If (k(1) j,S) 1 (k(0) j,S) 1 γq, then we have k(0) j,S k(1) j,S 1 = k(0) j,S (k(1) j,S) 1 (k(0) j,S) 1 mγq, leading to k(0) j,S k(1) j,S 1 log k(0) j,S k(1) j,S mγq log(1 + mγq). When mγq 1, we know that mγq log(1 + mγq) 1 When mγq > 1, it holds that mγq log(1 + mγq) 1 Thus, we have k(0) j,S k(1) j,S 1 log k(0) j,S k(1) j,S 6 min{mγq, (mγq)2}. And the same result holds for k(1) j,S k(0) j,S as well. Therefore by (32), we have T(S) T( S) = T(S) T(S ) k(0) j,S k(1) j,S 1 log k(0) j,S k(1) j,S k(1) j,S k(0) j,S 1 log k(1) j,S k(0) j,S 6 min{mγq, (mγq)2}. Since this holds for arbitrary j S q and S satisfying S \S = {j}, we obtain that inf j S q inf S:S \S={j} |S| D+p T(S) T(S ) 1 6 min{mγq, (mγq)2}. (41) Ra SE: Random Subspace Ensemble Classification (ii) If (k(1) j,S) 1Ω(1) S,SΣ(1) S,j (k(0) j,S) 1Ω(0) S,SΣ(0) S,j γq, when (k(1) j,S) 1 (k(0) j,S) 1 m 2M γq, similar to (i), we have T(S) T( S) 1 γq (k(1) j,S) 1Ω(1) S,SΣ(1) S,j (k(0) j,S) 1Ω(0) S,SΣ(0) S,j (k(1) j,S) 1 (k(0) j,S) 1 Ω(1) S,SΣ(1) S,j + (k(0) j,S) 1 Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j . By (35), (k(1) j,S) 1Ω(1) S,SΣ(1) S,j is part of Ω(1) S, S, then M 1 Ω(1) S,SΣ(1) S,j (k(1) j,S) 1Ω(1) S,SΣ(1) S,j Ω(1) S, S max Ω(1) S, S yielding Ω(1) S,SΣ(1) S,j M Then we have Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j 2 Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j 1 T(S) T( S) Ω(1) S,SΣ(1) S,j Ω(0) S,SΣ(0) S,j 2 2 λmin π0(k(1) j,S) 1Σ(0) S,S + π1(k(0) j,S) 1Σ(1) S,S 2 π0M 1λmin(Σ(0) S,S) + π1M 1λmin(Σ(1) S,S) m3γ2 q 4M . (43) Combining (41), (42) and (43), we complete the proof. Lemma 36 It holds that inf j S l \S q inf S:S \S={j} |S| D+p [D(S ) D(S)] m3γ2 l . Proof For any j S l \S q and S satisfying S \S = {j}, let S = S {j}, it s easy to see that there holds µ(1) S µ(1) j µ(0) S µ(0) j Tian and Feng Denote c = Σ(1) j,SΩ(1) S,S(µ(1) S µ(0) S ) (µ(1) j µ(0) j ). By (39), (40), we can obtain Σ(0) j,SΩ(0) S,S(µ(1) S µ(0) S ) (µ(1) j µ(0) j ) = Σ(1) j,SΩ(1) S,S(µ(1) S µ(0) S ) (µ(1) j µ(0) j ). Combining with (44), we have m 1|c| (k(1) j,S) 1|c| = |δ j| γl, because δ j is the corresponding component of δS with respect to feature j. We obtain that |c| mγl, which leads to D(S ) D(S) = D( S) D(S) = k(1) j,S|c|2 m3γ2 l . Lemma 37 It holds that inf S:S S |S| D RIC(S) RIC(S ) min m2 4M γq min mγq, 1 Proof [Proof of Lemma 37] Because of Lemmas 35 and 36, it suffices to prove inf S:S S |S| D RIC(S) RIC(S ) inf j S inf S:S \S={j} |S| D+p RIC(S) RIC(S ) inf j S q inf S:S \S={j} |S| D+p [T(S) T(S )], inf j S l \S q inf S:j / S |S| D+p [D(S ) D(S)] For any subset S which does not cover S and |S| D, consider feature j S \S and S = S S \{j} S. It s easy to notice that | S| D + p . By the monotonicity proved in Lemma 34, we know that RIC( S) RIC(S). In addition, we know that RIC( S) inf j S inf S:S \S={j} |S| D+p RIC(S), which yields the first inequality. For the second inequality, it directly comes from Lemma 34.(iii). Lemma 38 If Assumption 5 holds, for ϵ smaller than some constant and n, p larger than some constants, we have sup S:|S| D | d RIC(S) RIC(S)| > ϵ p2 exp Cn ϵ Ra SE: Random Subspace Ensemble Classification Proof [Proof of Lemma 38] Recall that RIC(S) = 2[T(S) D(S)], d RIC(S) = 2[ ˆT(S) ˆD(S)]. Denote T1(S) = Tr h Ω(1) S,S Ω(0) S,S π1Σ(1) S,S π0Σ(0) S,S i , T2(S) = (π1 π0)(log |Σ(1) S,S| log |Σ(0) S,S|), then T(S) = T1(S) + T2(S). And since for any S with |S| D, we have ˆT1(S) T1(S) = Tr h ˆΩ(1) S,S ˆΩ(0) S,S ˆπ1 ˆΣ(1) S,S ˆπ0 ˆΣ(0) S,S Ω(1) S,S Ω(0) S,S π1Σ(1) S,S π0Σ(0) S,S i D ˆΩ(1) S,S ˆΩ(0) S,S ˆπ1 ˆΣ(1) S,S ˆπ0 ˆΣ(0) S,S Ω(1) S,S Ω(0) S,S π1Σ(1) S,S π0Σ(0) S,S 2 D Ω(1) S,S Ω(0) S,S 2 h |ˆπ1 π1| Σ(1) S,S 2 + Σ(0) S,S 2 + ˆΣ(1) S,S Σ(1) S,S 2 + ˆΣ(0) S,S Σ(0) S,S 2 i + D π1Σ(1) S,S π0Σ(0) S,S 2 ˆΩ(1) S,S Ω(1) S,S 2 + ˆΩ(0) S,S Ω(0) S,S 2 2Dm 1 h |ˆπ1 π1| 2M + ˆΣ(1) S,S Σ(1) S,S 2 + ˆΣ(0) S,S Σ(0) S,S 2 + DM ˆΩ(1) S,S Ω(1) S,S 2 + ˆΩ(0) S,S Ω(0) S,S 2 where the last inequality comes from Σ(r) S,S 2 M and Ω(1) S,S Ω(0) S,S 2 2m 1. And this yields sup S:|S| D ˆT1(S) T1(S) > ϵ P 4Dm 1M|ˆπ1 π1| > ϵ 2Dm 1 sup S:|S| D ˆΣ(1) S,S Σ(1) S,S 2 > ϵ 2Dm 1 sup S:|S| D ˆΣ(0) S,S Σ(0) S,S 2 > ϵ DM sup S:|S| D ˆΩ(1) S,S Ω(1) S,S 2 > ϵ DM sup S:|S| D ˆΩ(0) S,S Ω(0) S,S 2 > ϵ p2 exp Cn ϵ On the other hand, we have log |ˆΣ(1) S,S| λi(ˆΣ(1) S,S) λi(Σ(1) S,S) λi(ˆΣ(1) S,S) λi(Σ(1) S,S) 1 λi(ˆΣ(1) S,S) λi(Σ(1) S,S) 1 By Weyl s inequality, λi(ˆΣ(1) S,S) λi(Σ(1) S,S) λi(Σ(1) S,S) ˆΣ(1) S,S Σ(1) S,S 2 . Tian and Feng Therefore there exists ϵ > 0 such that when ˆΣ(1) S,S Σ(1) S,S 2 ϵ , we have |ˆΣ(1) S,S| ˆΣ(1) S,S Σ(1) S,S 2 + o D ˆΣ(1) S,S Σ(1) S,S 2 ˆΣ(1) S,S Σ(1) S,S 2 . Therefore, there holds sup S:|S| D ˆT2(S) T2(S) > ϵ sup S:|S| D 2 |ˆπ1 π1| log |Σ(1) S,S| + log |Σ(0) S,S| + log |ˆΣ(1) S,S| |ˆΣ(0) S,S| 2 |ˆπ1 π1| sup S:|S| D h log λi Σ(1) S,S + log λi Σ(1) S,S i > ϵ sup S:|S| D log |ˆΣ(1) S,S| sup S:|S| D log |ˆΣ(0) S,S| P 2D max{| log M|, | log m|} |ˆπ1 π1| > ϵ m sup S:|S| D ˆΣ(1) S,S Σ(1) S,S 2 > ϵ m sup S:|S| D ˆΣ(0) S,S Σ(0) S,S 2 > ϵ sup S:|S| D ˆΣ(1) S,S Σ(1) S,S 2 > ϵ ! sup S:|S| D ˆΣ(0) S,S Σ(0) S,S 2 > ϵ ! p2 exp Cn ϵ Thus we have sup S:|S| D ˆT(S) T(S) > ϵ p2 exp Cn ϵ Besides, following the same strategy in the proof of Lemma 32, it can be shown that sup S:|S| D ˆD(S) D(S) > ϵ p2 exp Cn ϵ By (45) and (46), we complete the proof. By all the above lemmas, and following similar idea in the proof of Theorem 15, we can prove the consistency stated in the theorem. Ra SE: Random Subspace Ensemble Classification B.12 Proof of Theorem 21 Firstly since taking subspace can also be seen as an axis-aligned projection, applying Theorem 2 in Cannings and Samworth (2017) we have E[R(CRa SE n ) R(CBayes)] E[R(CS1 n ) R(CBayes)] min(α, 1 α) . (47) Then it can be easily noticed that E{E[R(CS1 n ) R(CBayes)]} E{E[(R(CS1 n ) R(CBayes))1(S1 S )]} + (1 R(CBayes))P(S1 S ) E sup S:S S |S| D R(CS n ) R(CBayes) + P(S1 S ). (48) Combining (47) and (48), we obtain the conclusion. B.13 Proof of Theorem 22 This conclusion is almost the same as Theorem 2 in Cannings and Samworth (2017). However, since we are studying the discrete space of subspaces and discriminative sets are ideal subspaces here, we can drop Assumptions 2 and 3 in their paper and get a similar upper bound. Firstly we have E[R(CRa SE n ) R(CBayes)] E[R(CS1 n ) R(CBayes)] min(α, 1 α) , (49) by Cannings and Samworth (2017). Then write E[R(CS1 n )] = E[Rn(CS1 n )] + ϵn, (50) where ϵn = E[R(CS1 n )] E[Rn(CS1 n )]. Then we have E[Rn(CS1 n )] sup S:S S |S| D Rn(CS n ) + E R(CS1 n ) 1 Rn(CS1 n ) > sup S:S S |S| D sup S:S S |S| D Rn(CS n ) + P Rn(CS1 n ) > sup S:S S |S| D = sup S:S S |S| D Rn(CS n ) + Rn(CS11 n ) > sup S:S S |S| D sup S:S S |S| D Rn(CS n ) + (1 p S )B2 Tian and Feng = sup S:S S |S| D R(CS n ) + sup S:S S |S| D |ϵS n| + (1 p S )B2, (51) where ϵS n = R(CS n ) Rn(CS n ). Combining (49), (50) and (51), we obtain E{E[R(CRa SE n ) R(CBayes)]} E sup S:S S |S| D R(CS n ) R(CBayes) + E sup S:S S |S| D |ϵS n| + E(ϵn) + (1 p S )B2 min(α, 1 α) . B.14 Proof of Proposition 23 First we prove the following lemma. Lemma 39 (Devroye et al. (2013)) For a non-negative random variable z satisfying P(z > t) C1 exp{ C2ntα}, for any t > 0, where C1 > 1, C2 > 0 and α 1 are three fixed constants, then we have Ez log C1 + 1 Proof [Proof of Lemma 39] It s easy to see 0 P(zα > t)dt = ϵ + C1 ϵ exp{ C2nt}dt = ϵ + C1 C2n exp{ C2nϵ}, Ezα inf ϵ>0 C2n exp{ C2nϵ} = log C1 + 1 Then by Jensen s inequality, it holds Ez (Ezα) 1 α log C1 + 1 which completes the proof. Then let s prove the proposition. Notice that 2 S = |δT S ΣS,SδS| δS 2 2 λmin(ΣS,S) mp γ2, for any S S . In addition, due to mean value theorem, it follows that 2| ˆ S S| + |ˆτS τS|. Ra SE: Random Subspace Ensemble Classification By a similar argument, due to (10), we can obtain that R(CS LDA n ) R(CBayes) 1 2| ˆ S S| + |ˆτS τS|. By Lemma 32, we know that sup S:|S| D | ˆ 2 S 2 S| > ϵ p2 exp Cn ϵ This yields that sup S:S S |S| D R(CS LDA n ) R(CBayes) > ϵ sup S:S S |S| D | ˆ 2 S 2 S| > 3 2 sup S:S S |S| D | ˆ 2 S 2 S| ˆ S + S > 1 2ϵ, sup S:S S |S| D | ˆ 2 S 2 S| 3 sup S:S S |S| D |ˆτS τS| > 1 2ϵ, sup S:S S |S| D | ˆ 2 S 2 S| 3 sup S:|S| D | ˆ 2 S 2 S| > 3 sup S:|S| D | ˆ 2 S 2 S| > 3 sup S:|S| D 4ϵ, sup S:S S |S| D | ˆ 2 S 2 S| 3 sup S:|S| D 4ϵ, sup S:S S |S| D | ˆ 2 S 2 S| 3 sup S:|S| D | ˆ 2 S 2 S| > 3 sup S:|S| D | ˆ 2 S 2 S| > 3 sup S:|S| D | ˆ 2 S 2 S| > C(p ) 3 2 γ3ϵ + P |ˆπ0 π0| > Cϵ p + P |ˆπ1 π1| > Cϵ p ϵ(p ) 3 2 γ3 ϵ(p ) 1 2 γ D2 which completes the proof by applying Lemma 39. Tian and Feng B.15 Proof of Proposition 24 Denote δS = Ω(1) S,Sµ(1) S Ω(0) S,Sµ(0) S , ΩS,S = Ω(1) S,S Ω(0) S,S, and d S(x S) = log π1 π0 2x T SΩS,Sx S+ 2(µ(1) S )T Ω(1) S,Sµ(1) S + 1 2(µ(0) S )T Ω(0) S,Sµ(0) S . Their estimators are correspondingly denoted as ˆδS = ˆΩ(1) S,S ˆµ(1) S ˆΩ(0) S,S ˆµ(0) S , ˆΩS,S = ˆΩ(1) S,S ˆΩ(0) S,S, and ˆd S(x S) = log ˆπ1 ˆπ0 2x T S ˆΩS,Sx S + ˆδT S x S 1 2(ˆµ(1) S )T ˆΩ(1) S,S ˆµ(1) S + 1 2(ˆµ(0) S )T ˆΩ(0) S,S ˆµ(0) S . From the proof of Theorem 20, we know that d S(x S) = d S (x S ) for any S S . Denote the training data as Dtr, then R(CS n ) = π0P(0)( ˆd S(x S) > 0|Dtr) + π1P(1)( ˆd S(x S) 0|Dtr), R(CBayes) = π0P(0)(d S (x S ) > 0) + π1P(1)(d S (x S ) 0). Then we have sup S:S S |S| D [P(0)( ˆd S(x S) > 0|Dtr) P(0)(d S (x S ) > 0)] sup S:S S |S| D P(0)(d S(x S) > d S(x S) ˆd S(x S)|Dtr) P(0)(d S (x S ) > 0) P(0)(d S (x S ) > ϵ) + sup S:S S |S| D P(0)(|d S(x S) ˆd S(x S)| > ϵ|Dtr) P(0)(d S (x S ) > 0) ϵ h(0)(z)dz + sup S:S S |S| D P(0)(|d S(x S) ˆd S(x S)| > ϵ|Dtr) ϵuc + sup S:S S |S| D P(0)(|d S(x S) ˆd S(x S)| > ϵ|Dtr), (52) for any ϵ (0, c). Denote a S = (Σ(0) S,S) 1 2 [Ω(1) S,S(µ(1) S ˆµ(1) S ) + (Ω(1) S,S ˆΩ(1) S,S)(ˆµ(1) S ˆµ(0) S ) + ˆΩ(0) S,S(ˆµ(0) S µ(0) S )]. Notice that d S(x S) ˆd S(x S) = log π1 2(x S µ(0) S )T (ˆΩS,S ΩS,S)(x S µ(0) S ) + a T S(Σ(0) S,S) 1 2 (x S µ(0) S ) + 1 2(µ(1) S µ(0) S )T Ω(1) S,S(ˆµ(1) S µ(1) S ) 2(ˆµ(0) S µ(0) S )T Ω(0) S,S(ˆµ(0) S µ(0) S ). Further, denote z S = (Σ(0) S,S) 1 2 (x S µ(0) S ) N(0|S|, I|S|). Then it follows that |d S(x S) ˆd S(x S)| C|ˆπ1 π1| + 1 2 ˆΩS,S ΩS,S 2 Σ(0) S,S 2 z S 2 2 + |a T Sz S| 2 µ(1) S µ(0) S 2 Ω(1) S,S 2 ˆµ(1) S µ(1) S 2 2 ˆµ(0) S µ(0) S 2 2 Ω(0) S,S 2 C|ˆπ1 π1| + M 2 ˆΩS,S ΩS,S 2 z S 2 2 + |a T Sz S| Ra SE: Random Subspace Ensemble Classification + M Dm 1 ˆµ(1) S µ(1) S + Dm 1 ˆµ(0) S µ(0) S 2 , a S 2 ˆµ(1) S µ(1) S 2 + Ω(1) S,S ˆΩ(1) S,S 2 ˆµ(1) S ˆµ(0) S 2 + ˆµ(0) S µ(0) S 2 D 1 2 ( ˆµ(1) S µ(1) S + ˆµ(0) S µ(0) S ) + Ω(1) S,S ˆΩ(1) S,S 2 ˆµ(1) S ˆµ(0) S 2. For any t, t > 0, denote event B = {C|ˆπ1 π1| ϵ/4, sup S:|S| D ˆΩS,S ΩS,S 2 t, M Dm 1 ˆµ(1) S µ(1) S ϵ/8, Dm 1 ˆµ(0) S µ(0) S ϵ/8, a S 2 t }. When a S 2 t , a T Sz S is a t -sub Gaussian. This yields that sup S:S S |S| D |d S(x S) ˆd S(x S)| > ϵ Dtr B P(t z S 2 2 > ϵ/4) + P(|a T Sz S| > ϵ/4| a S 2 t ) where ϵ satisfies Cϵ t > D and the first term comes from the tail bound of χ2 1-distribution P z S 2 2 > Cϵ = P z S 2 2 > D + Cϵ Taking the expectation for the training data in (52), we have E sup S:S S |S| D [R(CS n ) R(CBayes)] ϵuc + E sup S:S S |S| D P(|d S(x S) ˆd S(x S)| > ϵ|Dtr B) + P(Bc) 2 + exp{ Cnϵ2} + p exp Cn ϵ 2 , t = D log p 2 , t = D 3 2 log p 2 for an arbitrary ϖ (0, 1/2) and plug them into (53), we obtain that 2 + exp{ Cnϵ2} + p2 exp + p exp Cn ϵ Tian and Feng ϖ + p2 exp Cnϖ(log p)1 ϖ + p exp CDn2ϖ(log p)1 ϖ because log p nϖ0 for some ϖ0 (0, 1). Therefore, due to (54), we have E sup S:S S |S| D [R(CS n ) R(CBayes)] D2 log p B.16 Proof of Theorem 25 We first prove the following helpful lemma. Lemma 40 For H Hypergeometric(p, p , d), if min(d, p ) p d p = o(1) and B2 p p d t+1 , where t is a positive integer no larger than d, then (Pr(H t))B2 1 as p . Proof [Proof of Lemma 40] The cumulative distribution function of H satisfies Pr(H t) = 1 d t+1 p d p t 1 1, t + 1 p , t + 1 n t + 2, p + t + 2 p d ; 1 , where 3F 2 is the generalized hypergeometric function (Abadir, 1999). And we have 1, t + 1 p , t + 1 n t + 2, p + t + 2 p d ; 1 1n(t + 1 p )n(t + 1 d)n (t + 2)n(p + t + 2 p d)n 1 (t + 1 p )n(t + 1 d)n (t + 2)n(p + t + 2 p d)n min(d t, p t) X ( p t 1) ( p t n)(d t 1) (d t n) (t + 2)n(p + t + 2 p d) (p + t + 1 p d + n) min(d t, p t) X 1 + min(d, p ) O p d where an := a(a + 1) (a + n 1), a0 := 1 for any real number a and positive integer n. On the other hand, we can also see that 1, t + 1 p , t + 1 n t + 2, p + t + 2 p d ; 1 1. Ra SE: Random Subspace Ensemble Classification For convenience, denote F = 3F 2 1, t + 1 p , t + 1 n t + 2, p + t + 2 p d ; 1 , then by Taylor expansion, it holds that d t+1 p d p t 1 d t+1 p d p t 1 d t+1 p d p t 1 In addition, we have d t+1 p d p t 1 p p dt+1 p ( p 1) ( p t) (t + 1)!(p p + t + 1) (p p + 1) p d t+1 1 (t + 1)!. d t+1 p d p t 1 p p = o(1), B2 d t+1 p d p t 1 leading to B2 log 1 ( d t+1)( p d p t 1) ( p p ) F = o(1), which implies the conclusion. Next let s prove the original theorem. Without loss of generality, assume there is a positive constant C such that at the first step of Algorithm 2, B2 satisfies D p + 1 B2 p and at the following steps it follows C (D + C0)D+1p p (log p)p CD 0 (D p + 1) B2 p Notice that condition (i) in Assumption 6 implies that sup S:|S| D |Crn(S) Cr(S)| > Mnν(n, p, D) We will prove the original theorem in three steps. Denote η(t) j = P(j S(t) 1 ), ˆη(t) j = 1 B1 PB1 i=1 1(j S(t) i ), S(t) = {j : ˆη(t) j > 1/ log p}, S(t) = S(t) S , p t = | S(t) |. (i) Step 1: When t = 0, due to the stepwise detectable condition, there exists a subset S(0) S with cardinality p that satisfies the conditions. On the other hand, k=1 {S1k S(0) } = 1 h 1 P S1k S(0) i B2 Tian and Feng 1 1 D p + 1 B2 log 1 D p + 1 yielding that k=1 {S(0) 1k S(0) } Then we have P S(0) 1 S(0) k=1 {S(0) 1k S(0) }, inf S:|S S | p |S| D S:S S =S(0) Crn(S) > Mnν(n, p, D), n |S(0) 1k S | p o! k=1 {S(0) 1k S(0) }, sup S:|S| D |Crn(S) Cr(S)| 1 2Mnν(n, p, D), n |S(0) 1k S | p o! k=1 {S(0) 1k S(0) } sup S:|S| D |Crn(S) Cr(S)| > 1 2Mnν(n, p, D) n |S(0) 1k S | p + 1 o! as n is sufficiently large. The last inequality holds because there exists a variable H Hypergeometric(p, p , D), such that n |S(0) 1k S | p + 1 o! = 1 h 1 P |S(0) 1k S | p + 1 i B2 1 d D P |S(0) 1k S | p + 1||S(0) 1k | = d 1 d D P |S(0) 1k S | p + 1||S(0) 1k | = D Ra SE: Random Subspace Ensemble Classification n |S(0) 1k S | p + 1 o n |S(0) 1k | = D o! = 1 P(H p ) due to Lemma 40. Then for any j S(0) , it holds that P j S(0) 1 P S(0) 1 S(0) 1 3 And by Hoeffding s inequality (Petrov, 2012): P ˆη(1) j 1 2e C P ˆη(1) j η(1) j > 1 2e C 1 exp 2B1 1 Following by union bounds, we can obtain n ˆη(1) j 1 2e Co (ii) Step 2: When t = 1, let s first condition on some specific S(0) defined before as {j : ˆη(0) j > C0/ log p} satisfying | S(0) | = p . Later we will take the expectation to get the inequality without conditioning. To simplify and distinguish the notations, we omit the condition mentioned above and denote the new corresponding conditional probabilities as Pc, Pc, Pc. For any specific S(0) , due to the stepwise detectable condition again, there exists a subset S(1) S with cardinality p satisfies the conditions. Similar to step 1, we have n S(1) 1k S(0) S(1) o! p 1+ p d D Pc S(1) 1k S(0) S(1) |S(1) 1k | = d Pc S(1) 1k S(0) S(1) |S(1) 1k | = d Pc S(1) 1k S(1) |S(1) 1k | = d, S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) = Pc S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) = |S(1) 1k | = d . (57) Tian and Feng For convenience, let s consider a series (j1, . . . , jd) sampled from {1, . . . , p} without replacement with sampling weight (ˆη(1) 1 , . . . , ˆη(1) p ) in this order. We use Pc(j1, . . . , jd||S(1) 1k | = d) to represent the corresponding probability and use Pc(ji|j1, . . . , ji 1, |S(1) 1k | = d) to represent the conditional probability for sampling ji given j1, . . . , ji 1. Based on the notations defined above, there holds Pc S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) = |S(1) 1k | = d (j1,...,jd) S(0) (j1,...,jd) ( S(0)\ S(0) )= Pc j1, . . . , jd |S(1) 1k | = d (j1,...,jd) S(0) (j1,...,jd) ( S(0)\ S(0) )= Pc j1 |S(1) 1k | = d Pc j2 j1, |S(1) 1k | = d Pc(jd|j1, . . . , jd 1, |S(1) 1k | = d). (58) Since for ji S(0) , it holds Pc(ji|j1, . . . , ji 1, |S(1) 1k | = d) Pc(ji||S(1) 1k | = d) j S(0) ˆη(0) j + P j / S(0) C0 C0 (D + C0) log p. And for ji {1, . . . , p}\ S(0), it holds Pc(ji|j1, . . . , ji 1, |S(1) 1k | = d) Pc(ji||S(1) 1k | = d) j S(0) ˆη(0) j + P j / S(0) C0 C0 (D + C0)p. Therefore by plugging these inequalities into (58), it yields that when d p(0) , Pc S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) = |S(1) 1k | = d p(0) !(d p(0) )! C0 (D + C0) log p p(0) C0 (D + C0)p 1 | S(0)| p(0) + d 1 p d(d 1) (d p(0) + 1) Cd 0 (D + C0)d(log p) p(0) Ra SE: Random Subspace Ensemble Classification d p(0) p(0) ! (log p) p(0) C0 D + C0 1 (log p)p C0 D + C0 when n is sufficiently large. In the last second inequality, we used the fact that | S(0)| D log p C0 . On the other hand, given |S(1) 1k | = d, S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) = , the indicators of whether the remaining features are sampled out or not follow the restricted multinomial distribution with parameters (p | S(0)|, d p(0) , 1), which means that the remaining variables have the same sampling weights. Then for d p(0) p , Pc S(1) 1k S(1) |S(1) 1k | = d, S(1) 1k S(0) , S(1) 1k ( S(0)\ S(0) ) = p | S(0)| p max(d p(0) p , 1) p | S(0)| which combined with (56), (57) and (59) leads to n S(1) 1k S(0) S(1) o! 1 (log p)p C0 D + C0 since B2 C (D+C0)D+1p p (log p)p CD 0 (D p +1) . Furthermore, n |S(1) 1k (S \ S(0) )| p + 1 o! = 1 h 1 Pc |S(1) 11 (S \ S(0) )| p + 1 i B2 , Pc |S(1) 11 (S \ S(0) )| p + 1 = Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = Pc S(1) 11 ( S(0)\ S(0) ) = Tian and Feng + Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = Pc S(1) 11 ( S(0)\ S(0) ) = Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = 1 d D Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = , |S(1) 11 | = d 1 d D P |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = , |S(1) 11 | = D = Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = , |S(1) 11 | = D . Similar to step 1, because of Lemma 40, it follows that n |S(1) 1k (S \ S(0) )| p + 1 o! 1 h 1 Pc |S(1) 11 (S \ S(0) )| p + 1 S(1) 11 ( S(0)\ S(0) ) = , |S(1) 11 | = D i B2 n |S(1) 1k (S \ S(0) )| p + 1 o n S(1) 1k ( S(0)\ S(0) ) = , |S(1) 1k | = D o! uniformly for any S(0). Then similar to (55), we have for j S(0) S(1) : η(1) j = Pc j S(1) 1 Pc S(1) 1 S(0) S(1) 1 3 And by Hoeffding s inequality (Petrov, 2012): P ˆη(1) j 1 2e C S(0) P ˆη(1) j η(1) j > 1 2e C 1 exp 2B1 1 By union bounds, it holds that j S(0) S(1) n ˆη(1) j 1 2e Co S(0) Since the above conclusions hold for any S(0) and | S(0) | |S(0) | = p , we can conclude that j=1 1(ˆη(1) j 1 2e C) 2 p Ra SE: Random Subspace Ensemble Classification j S(0) S(1) n ˆη(1) j 1 2e Co S(0) |S(0) | = p 2B1e 2C 1 p exp 1 After the the second iteration step, with probability 1 2p exp 1 2B1e 2C , there will be at least 2 p features of S covered in S(1) and having ˆηj > 1 2e C. Similarly, after the the t-th iteration step, there will be at least (t + 1) p features of S covered in S(t) and having ˆηj > 1 2e C. (iii) Step 3: By step 2, after at most t = p p 1 iterations, S(t ) will cover S . Without loss of generality, let s assume the smallest t satisfying S(t ) S equal to p when the iteration number t = p p , we have n ˆη(t 1) j 1 2e Co Using the same notations defined at the beginning of step 2 (notice that here we only need to condition on T j S {ˆη(t 1) j 1 2e C}), we have Pc S(t) 1k S (j1,...,jd) S Pc(j1, . . . , jd||S(t) 1k | = d) (j1,...,jd) S Pc j1 |S(t) 1k | = d Pc jd|j1, . . . , jd 1, |S(t) 1k | = d . Similar to step 2, for ji S , it holds Pc(ji|j1, . . . , ji 1, |S(t) 1k | = d) Pc(ji||S(t) 1k | = d) ˆη(t 1) ji P j S ˆη(t 1) j + P j / S C0 And for ji {1, . . . , p}\S , it holds Pc(ji|j1, . . . , ji 1, |S(t) 1k | = d) Pc(ji||S(t) 1k | = d) ˆη(1) ji P j S ˆη(t) j + P j / S C0 p C0 (D + C0)p. Thus, we have Pc S(t) 1k S 1 (j1,...,jd) S C0 (D + C0)p Tian and Feng C0 (D + C0)p = (D p + 1)(1 2e C)p CD p 0 D(D + C0)D 1 D 1 n S(t) 1k S o! = 1 h 1 Pc S(t) 1k S i B2 B2 (D p + 1)(1 2e C)p CD p 0 D(D + C0)D 1 D 1 C (D + C0)p p DCp 0 1 D 1 Thus we have n S(t) 1k S o! Pc S(t) 1k S P n ˆη(t 1) j 1 2e Co C (D + C0)p +1p p DCp 0 1 D 1 Then similar to (55), there holds P(S(t) 1 S ) P k=1 {S1k S(t) }, inf S:S S |S| D Crn(S) sup S:S S Crn(S) > Mnν(n, p, D) k=1 {S1k S(t) }, sup S:|S| D |Crn(S) Cr(S)| 1 2Mnν(n, p, D) n S(t) 1k S o! sup S:|S| D |Crn(S) Cr(S)| > 1 2Mnν(n, p, D) Ra SE: Random Subspace Ensemble Classification Combined with all the conclusions above, as n, B1, B2 , we get P(S(t) 1 S ) exp C (D + C0)p +1p p DCp 0 1 D 1 sup S:|S| D |Crn(S) Cr(S)| > 1 2Mnν(n, p, D) And for t > p p , the same conclusion can be obtained by following the same procedure, which completes our proof. K. M. Abadir. An introduction to hypergeometric functions for economists. Econometric Reviews, 18(3):287 330, 1999. M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1948. H. Ahn, H. Moon, M. J. Fazzari, N. Lim, J. J. Chen, and R. L. Kodell. Classification by ensembles from random partitions of high-dimensional data. Computational Statistics & Data Analysis, 51(12):6166 6179, 2007. H. Akaike. Information theory and an extension of the maximum likelihood principle. Proceeding of IEEE international symposium on information theory, 1973. S. D. Bay. Combining nearest neighbor classifiers through multiple feature subsets. In ICML, volume 98, pages 37 45. Citeseer, 1998. T. B. Berrett and R. J. Samworth. Efficient two-sample functional estimation and the super-oracle phenomenon. ar Xiv preprint ar Xiv:1904.09347, 2019. P. J. Bickel, E. Levina, et al. Some theory for fisher s linear discriminant function ,naive bayes , and some alternatives when there are many more variables than observations. Bernoulli, 10(6):989 1010, 2004. P. J. Bickel, E. Levina, et al. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577 2604, 2008. R. Blaser and P. Fryzlewicz. Random rotation ensembles. The Journal of Machine Learning Research, 17(1):126 151, 2016. R. Blaser and P. Fryzlewicz. Regularizing axis-aligned ensembles via data rotations that favor simpler learners, 2019. Tian and Feng T. Boot and D. Nibbering. Subspace methods. In Macroeconomic Forecasting in the Era of Big Data, pages 267 291. Springer, 2020. L. Breiman. Bagging predictors. Machine learning, 24(2):123 140, 1996. L. Breiman. Pasting small votes for classification in large databases and on-line. Machine learning, 36(1-2):85 103, 1999. L. Breiman. Random forests. Machine learning, 45(1):5 32, 2001. M. Brookes. The matrix reference manual. Imperial College London, 3, 2005. R. Bryll, R. Gutierrez-Osuna, and F. Quek. Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets. Pattern recognition, 36(6):1291 1302, 2003. P. B uhlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011. K. P. Burnham and D. R. Anderson. Practical use of the information-theoretic approach. In Model selection and inference, pages 75 117. Springer, 1998. T. I. Cannings and R. J. Samworth. Random-projection ensemble classification. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):959 1035, 2017. J. Chen and Z. Chen. Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759 771, 2008. J. Chen and Z. Chen. Extended bic for small-n-large-p sparse glm. Statistica Sinica, pages 555 574, 2012. L. Devroye and T. Wagner. Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, 25(2):202 207, 1979. L. Devroye, L. Gy orfi, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013. D. Dua and C. Graff. Uci machine learning repository. school of information and computer science, university of california, irvine, ca, 2019. R. J. Durrant and A. Kab an. Random projections as regularizers: learning a linear discriminant from fewer observations than dimensions. Machine Learning, 99(2):257 286, 2015. B. Efron. The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70(352):892 898, 1975. C.-G. Esseen. A moment inequality with an application to the central limit theorem. Scandinavian Actuarial Journal, 1956(2):160 170, 1956. Ra SE: Random Subspace Ensemble Classification J. Fan, Y. Feng, and X. Tong. A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(4):745 771, 2012. J. Fan, Y. Feng, J. Jiang, and X. Tong. Feature augmentation via nonparametrics and selection (fans) in high-dimensional classification. Journal of the American Statistical Association, 111(513):275 287, 2016. Y. Fan and C. Y. Tang. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):531 552, 2013. Y. Fan, Y. Kong, D. Li, Z. Zheng, et al. Innovated interaction screening for high-dimensional nonlinear classification. The Annals of Statistics, 43(3):1243 1272, 2015. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179 188, 1936. E. Fix. Discriminatory analysis: nonparametric discrimination, consistency properties. USAF school of Aviation Medicine, 1951. Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory, pages 23 37. Springer, 1995. S. Ganguly, J. Ryu, Y.-H. Kim, Y.-K. Noh, and D. D. Lee. Nearest neighbor density functional estimation based on inverse laplace transform. ar Xiv preprint ar Xiv:1805.08342, 2018. N. Garc ıa-Pedrajas and D. Ortiz-Boyer. Boosting random subspace method. Neural Networks, 21(9):1344 1362, 2008. I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the nips 2003 feature selection challenge. In Advances in neural information processing systems, pages 545 552, 2005. N. Hao, Y. Feng, and H. H. Zhang. Model selection for high-dimensional quadratic regression via regularization. Journal of the American Statistical Association, 113(522):615 625, 2018. H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263 1284, 2009. C. Higuera, K. J. Gardiner, and K. J. Cios. Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. Plo S one, 10(6), 2015. T. K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832 844, 1998. D. Hsu, S. Kakade, T. Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012. Tian and Feng B. Jiang, X. Wang, and C. Leng. A direct approach for sparse quadratic discriminant analysis. The Journal of Machine Learning Research, 19(1):1098 1134, 2018. R. Kohavi, G. H. John, et al. Wrappers for feature subset selection. Artificial intelligence, 97(1-2):273 324, 1997. S. Kullback and R. A. Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79 86, 1951. Q. Li and J. Shao. Sparse quadratic discriminant analysis for high dimensional data. Statistica Sinica, pages 457 473, 2015. Y. Li and J. S. Liu. Robust variable and interaction selection for logistic regression and general index models. Journal of the American Statistical Association, 114(525):271 286, 2019. M. E. Lopes. Estimating a sharp convergence bound for randomized ensembles. Journal of Statistical Planning and Inference, 204:35 44, 2020. M. E. Lopes et al. Estimating the algorithmic variance of randomized ensembles via the bootstrap. The Annals of Statistics, 47(2):1088 1112, 2019. Q. Mai, H. Zou, and M. Yuan. A direct approach to sparse discriminant analysis in ultrahigh dimensions. Biometrika, 99(1):29 42, 2012. Q. Mai, Y. Yang, and H. Zou. Multiclass sparse discriminant analysis. ar Xiv preprint ar Xiv:1504.05845, 2015. G. J. Mc Lachlan. Mahalanobis distance. Resonance, 4(6):20 26, 1999. M. Mukhopadhyay and D. B. Dunson. Targeted random projection for prediction from high-dimensional features. Journal of the American Statistical Association, pages 1 13, 2019. K. B. Petersen and M. S. Pedersen. The matrix cookbook, nov 2012. URL http://www2. imm. dtu. dk/pubdb/p. php, 3274:14, 2012. V. V. Petrov. Sums of independent random variables, volume 82. Springer Science & Business Media, 2012. R. Rao and Y. Wu. A strongly consistent procedure for model selection in a regression problem. Biometrika, 76(2):369 374, 1989. L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2):1 39, 2010. J. Shao, Y. Wang, X. Deng, S. Wang, et al. Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of statistics, 39(2):1241 1265, 2011. M. Skurichina and R. P. Duin. Bagging, boosting and the random subspace method for linear classifiers. Pattern Analysis & Applications, 5(2):121 135, 2002. Ra SE: Random Subspace Ensemble Classification R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Class prediction by nearest shrunken centroids, with applications to dna microarrays. Statistical Science, pages 104 117, 2003. A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000. H. Wang. Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488):1512 1524, 2009. Q. Wang, S. R. Kulkarni, and S. Verd u. Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Transactions on Information Theory, 55(5):2392 2405, 2009. Q. Zhang and H. Wang. On bic s selection consistency for discriminant analysis. Statistica Sinica, pages 731 740, 2011.