# provable_variable_selection_for_streaming_features__4d28fdc9.pdf Provable Variable Selection for Streaming Features Jing Wang 1 Jie Shen 2 Ping Li 3 In large-scale machine learning applications and high-dimensional statistics, it is ubiquitous to address a considerable number of features among which many are redundant. As a remedy, online feature selection has attracted increasing attention in recent years. It sequentially reveals features and evaluates the importance of them. Though online feature selection has proven an elegant methodology, it is usually challenging to carry out a rigorous theoretical characterization. In this work, we propose a provable online feature selection algorithm that utilizes the online leverage score. The selected features are then fed to k-means clustering, making the clustering step memory and computationally efficient. We prove that with high probability, performing k-means clustering based on the selected feature space does not deviate far from the optimal clustering using the original data. The empirical results on realworld data sets demonstrate the effectiveness of our algorithm. 1. Introduction For retailers, brick-and-mortar stores and internet-based stores, various recommendation methods are proposed in an attempt to sell products. The recommendation model is usually updated in a timely manner or it includes new valuable features of products which are not previously available. For example, during the Apple WWDC 2018 keynote, Apple has introduced new features of their platforms to fight fingerprinting , a technique which tracks users based on identifying computers. With the available of new features, a feature selection model is employed to determine whether the new features will drive sales of products in the future 1Cornell University, New York, NY 10021, USA. 2Rutgers University, Piscataway, NJ 08854, USA. 3Baidu Research, Bellevue, WA 98004, USA. Jing Wang , Ping Li . Correspondence to: Jie Shen . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). and only related features will be included in the recommendation model. Hence, in real-world applications, features are usually revealed in a continuous stream. It is necessary to evaluate new features immediately and output intermediate result. The feature evaluation process in a stream is called online feature selection (Perkins & Theiler, 2003; Zhou et al., 2005; Wu et al., 2010). We first formulate this problem. Suppose that there are n samples but initially we do not observe all of the features. We call the sequence a1, a2, Rn is a feature stream, with each ai Rn being the ith feature, or the ith covariate of n samples. Note that in our setting, the feature ai is revealed at time stamp i. If ai is selected, we update the observation matrix A as follows: A [A θiai] . (1.1) where the parameter θi = 0 is chosen in an online manner. In the literature, a large number of online methods have been proposed based on statistical measurements or optimization techniques (Perkins & Theiler, 2003; Zhou et al., 2005; Wang et al., 2015). For example, Perkins & Theiler (2003) added a new feature which contributes to a predictor learning and optimization analysis into the model. Zhou et al. (2005) proposed an adaptive complexity penalty method to evaluate a new feature based on its p-value. Wu et al. (2010) utilized the Markov blanket to measure the relationship between a new feature and the selected feature subset. Yet successful, most of the results in this line of research are empirical in nature. On the other hand, feature selection method can be categorized into either supervised or unsupervised. For instance, Shen & Li (2017) recently proposed a non-convex supervised approach for variable selection with favorable iteration complexity. Unsupervised methods, however, are with great practical importance to many areas such as Cardiology, as annotated data is usually precious and limited due to genetic privacy issue and the medical background requirement for annotators. Summary of Contributions. In this paper, we consider the high-dimensional regime that the number of features is much larger than the sample size, and the features are revealed in an online manner. We propose an unsupervised algorithm termed Online leverage scores for Feature Selection Provable Variable Selection for Streaming Features (OFS). Our main technique is to approximately compute the broadly used leverage score in each iteration, and determine the importance of each feature in real time. We prove that the reduced feature space is a good approximation to the original one in some sense to be clarified. Furthermore, we apply k-means clustering on the set of selected features, and show that the clustering performance does not degrade a lot. Computationally, our algorithm enjoys low time complexity and little memory usage, which makes it a perfect fit for big data analytics. 2. Related Work Feature selection is a primary technique in machine learning to address the curse of dimensionality . In the last decades, a number of methods have been proposed (Guyon & Elisseeff, 2003; Donoho & Jin, 2008). In this section, we give a brief review of existing approaches in terms of batch situation and online situation. Batch Methods. Existing batch feature selection methods can be roughly divided to unsupervised, supervised and semi-supervised approaches. The supervised methods utilize the target variable to guide the feature evaluation process, such as Fisher score, Least Absolute Shrinkage and Selection Operator (Lasso) (Tibshirani, 1996) and minimum Redundancy Maximum Relevance (Peng et al., 2005). Unsupervised feature selection methods mainly depend on latent data distribution analysis (He et al., 2005), such as spectral analysis (Zhao & Liu, 2007; Cai et al., 2010) and Kullback-Leibler Divergence between neighborhood distributions (Wei & Philip, 2016). The semi-supervised feature selection algorithms make benefits of both aforementioned approaches, such as combining Gaussian Field and Harmonic functions (Kong & Yu, 2010; Zhu et al., 2003). Feature selection methods are also characterized as wrapper, embedded and filter model. The wrapper model evaluates feature subsets by their performance on a specific algorithm, such as SVM or Naive Bayes for classification tasks (Forman, 2003) and k-means for clustering tasks (Guyon et al., 2002; Xu et al., 2014). The embedded model seeks the desired feature subset by solving a regularized optimization objective function with certain constraints (Zhang, 2009; Yang & Xu, 2013). Examples of this approach include Least Angle Regression (Efron et al., 2004) and group Lasso (Zhang et al., 2016). The optimization process forces most coefficients small or exact zero. The features corresponding to nonzero coefficients are selected. The filter model utilizes certain statistical measurements, such as the Hilbert-Schmidt Independence Criterion (HSIC), leverage score (Boutsidis et al., 2009) and kernel-based measures of independence (Chen et al., 2017). Specifically, the statistical leverage score is an important measurement for unsupervised feature selection. It characterizes the outstanding features that have more affect towards the result of a statistical procedure. There are multiple variants of the statistical leverage score, such as the normalized leverage score (Boutsidis et al., 2009), the truncated version of leverage score (Gittens & Mahoney, 2013) and the kernel ridge leverage score (Alaoui & Mahoney, 2015). The ridge leverage score is used to select features for k-means clustering (Boutsidis et al., 2009) and has proved to attain (2 + ϵ)- approximate partition. Specifically, the ridge leverage score of the ith column of data matrix A Rn d is defined as (Alaoui & Mahoney, 2015), li = a i (AA + λI) 1ai, (2.1) where λ > 0 is a parameter, I Rn n is the identity matrix. However, it is expensive as it requires O n3 + n2d running time and O (nd) memory storage. A number of recent papers focus on sampling some columns of A and approximate the linear kernel of A (Li et al., 2013; Alaoui & Mahoney, 2015; Cohen et al., 2016; Musco & Musco, 2017). However, none of these techniques have been applied for feature selection of streaming features. Online Methods. Motivated by the fact that features are available in a stream in real-world applications, online feature selection has attracted a lot of attention (Perkins & Theiler, 2003; Zhou et al., 2005; Wu et al., 2010; Wang et al., 2013). The batch-mode algorithms cannot handle this situation well as the global feature space is required in advance. Examples of online feature selection approaches either utilize statistical measurements, such as alpha-investing (Zhou et al., 2005) and mutual information (Wu et al., 2010) or rely on optimization techniques, such as stochastic gradient grafting (Perkins & Theiler, 2003; Wang et al., 2015). All existing mentioned methods come with no theoretical guarantees of the selected feature subset for clustering task. 2.1. Notation We use bold lower-case letters, e.g. v Rd to denote a column vector. v 2 is used to denote the ℓ2-norm of the vector. Capital letters such as X are used to denote matrices, and its transpose is denoted by X . The capital letter In n is reserved for the identity matrix where n indicates its size. For an invertible matrix X, we write its inverse as X 1. Otherwise, we use X for the pseudoinverse. For a square matrix X, we write its trace as Tr (X), which is the sum of its diagonal elements. The ith column and jth row of the matrix X are denoted by xi and (xj) , respectively. Suppose that the rank of matrix X Rn m is k min{m, n}. The singular value decomposition of X Provable Variable Selection for Streaming Features is given by X = u1, , uk v 1 ... v k where the singular values in descending order σ1 σk > 0, U = [u1, , uk] Rn k contains the left singular vectors and V = [v1, , vk] contains the right singular vectors. In this paper, we will use the Frobe- nius norm X F := q Pk i=1 σ2 i and the spectral norm X 2 := max1 i k σi = σ1. For a sequence of random variables X1, X2, . . . , we write Ej 1 [Xj] for the expectation of Xj conditioning on {X1, . . . , Xj 1}. 3. Main Results In this section, we propose an online algorithm for feature selection, where the goal is to approximate the original data with much fewer attributes in some sense. To the end, we make use of the leverage score that, from a high level, reflects the importance of each feature. Suppose that the data matrix is A Rn d, i.e., n samples lying in a d-dimensional ambient space. The statistical leverage score of the ith column (i.e., feature) of A is defined as l i = a i (AA ) ai. (3.1) It is well known that sampling an n O ϵ 2n log n matrix A with probabilities proportional to the respective leverage scores yields a (1 + ϵ)-spectral approximation to A (Spielman & Srivastava, 2011), in the sense that for all x A x 2 A x 2 , or more precisely (1 ϵ)x AA x x A A x (1 + ϵ)x AA x, or (1 ϵ)AA A A (1 + ϵ)AA . In the online setting, however, we are not able to access all the data to compute the leverage score. The key idea of our algorithm is that when a new feature arrives, we approximate its leverage score based on the obtained features, which can further be used to guide the selection process. To be more concrete, at time stamp i, suppose the observed data matrix is Ai 1 and the new feature ai is revealed, we need to determine whether ai is kept or discarded. A natural way for the sake is to compute the approximate leverage score of ai as follows: li = a i ( Ai 1 A i 1) ai. (3.2) Algorithm 1 Online Feature Selection Require: Initial data matrix A0, sampling rate c = 8ϵ 2 log n, approximation parameter ϵ (0, 1). 1: for i = 1, do 2: Reveal the ith feature ai. 3: Compute the online leverage score li = min((1 + ϵ)a i ( Ai 1 A i 1) ai, 1). 4: Compute the probability, pi = min(c li, 1). 5: With probability pi, update Ai = [ Ai 1 ai/ pi]. Intuitively, if Ai 1 is a good approximation to A, li indicates the importance of ai as l i does. And what we will show is that, it is the case after we reveal a few attributes. It is known that if the entire feature space is available, each leverage score is upper bounded by 1. However the estimates based on Ai 1 can be arbitrary because Ai 1 is a submatrix of A which leads to Ai 1 A i 1 AA . For our analysis, we technically require that each li is not larger than 1. Hence, we will make use of a modified quantity li = min (1 + ϵ)a i ( Ai 1 A i 1) ai, 1 . (3.3) Note that ϵ > 0 is some pre-defined accuracy parameter, and the above suggests we are using a conservative estimate of the leverage score. To see this, consider Ai 1 = A, then li l i . It is essential in the online setting in that we may lose many important features with an aggressive strategy. Then, the sampling probability is computed as pi = min 8ϵ 2 log n li, 1 . (3.4) With the scaling factor of li above, it is not hard to see that for a small approximation error ϵ, one has to select the current feature with high probability, which conforms the intuition an exact estimation of A requires selecting all the features. We summarize our method in Algorithm 1. 3.1. Analysis We first show that with high probability, the data matrix produced by our algorithm is a good approximation to A. Provable Variable Selection for Streaming Features Theorem 1. Consider Algorithm 1. Let A be the output when it terminates. It holds with high probability that (1 ϵ)AA A A (1 + ϵ)AA . Proof. Let Ai = (a1 a2 . . . ai). Define Y 0 as the zero matrix and for all i 1, let Y i 1 = (AA ) /2( Ai 1 A i 1 Ai 1A i 1)(AA ) /2. Let ui = (AA ) /2ai. If Y i 1 2 ϵ, we set Xi = 0. Otherwise, set ( (1/pi 1)uiu i , if ai is sampled in A, uiu i , otherwise. Thus, Xi = Y i Y i 1. Consider the case Y i 1 2 < ϵ. We get li = min((1 + ϵ)a i ( Ai A i ) ai, 1) min((1 + ϵ)a i (Ai A i + ϵAA ) ai, 1) min((1 + ϵ)a i ((1 + ϵ)(AA )) ai, 1) = a i (AA ) ai Thus, pi min(cu i ui, 1). If pi = 1, then Xi = 0. Otherwise, we have pi cu i ui. Moreover, we get pi (1/pi 1)2(uiu i )2 + (1 pi) (uiu i )2 = (uiu i )2/pi Let W i = Pi k=1 Ek 1 X2 k . We have k=1 uiu i /c Applying Lemma 4 gives Pr( Y n 2 ϵ) n exp ϵ2/2 1/c + ϵ/(3c) n exp( cϵ2/4) This implies that with high probability (AA ) /2( A A )(AA ) /2 I 2 ϵ. We thus have (1 ϵ)AA A A (1 + ϵ)AA , completing the theorem. Now we turn to control the number of features selected by Algorithm 1. We will use the result in (Cohen et al., 2015) shown below. Lemma 1. Let A be an n d matrix, ϵ (0, 1), c = 1/ϵ, l1, , ld be over-estimated leverage scores, i.e., li a i (AA ) ai for all 1 i d. Let pi = min{c li, 1}. Construct A by independently sampling each column ai of A with probability pi and rescale it by 1/ pi if it is included in A. Then, with high probability, A is the (1 + ϵ)- spectral approximation of A and the number of columns in A is O ϵ 2 Pd i=1 li log n . By Lemma 1, in order to control the number of selected features, we need to bound the sum of online leverage scores. Lemma 2. After running Algorithm 1, it holds with high probability that i=1 li = O (n log( A 2)) . Proof. We define δi = log det( Ai A i ) log det( Ai 1 A i 1). The sum of δi can be bounded by the logarithm of the ratio of the determinants of A A . By the matrix determinant lemma, we have Ei 1 h exp( li/8 δi) i = pi eli/8(1 + a i ( Ai 1 A i 1) 1ai/pi) 1 + (1 pi) eli/8 (1 + li/4) (pi(1 + a i ( Ai 1 A i 1) 1ai/pi) 1 If c li < 1, we have pi = c li and Ei 1 h exp( li/8 δi) i c li (1 + li/4)(1 + 1/((1 + ϵ)c)) 1 + (1 c li) = (1 + li/4)(cli(1 + 1/((1 + ϵ)c)) 1 + 1 cli) (1 + li/4)(1 + cli(1 1/(4c) 1)) = (1 + li/4)(1 li/4) 1. Provable Variable Selection for Streaming Features Otherwise, pi = 1 and we have Ei 1 h exp( li/8 δi) i (1 + li/4)(1 + A i ( A i 1 Ai 1 + λI) 1Ai) 1 (1 + li/4)(1 + li) 1 1. We now analyze the expected product of exp( li/8 δi) over the first k steps. For k 1 we have i=1 li/8 δi i=1 li/8 δi and so by induction on k, i=1 li/8 δi Hence by Markov s inequality, i=1 li > 8n + 8 Using Theorem 1, with high probability, we have A A (1 + ϵ)AA , implying that det( A A ) (1 + ϵ)n( A 2 2)n, log det( A A ) n(1 + log( A 2 2)). By the definition of δi, it holds with high probability that i=1 δi = log det( A A + λI) n n(1 + log( A 2 2) 1) = n(log( A 2 2)). And with high probability, i=1 li 8n + 8 8n + 8n log( A 2 2) = O n log( A 2 2) = O (n log( A 2)) . The proof is complete. Thus Lemma 1 and 2 imply that Algorithm 1 selects O ϵ 2n log d log( A 2) features with high probability. Time Complexity. The running time of Algorithm 1 is dominated by the online leverage score computation in Step 3, which is O n3 . In the case that Ai 1 is a Laplacian matrix, Step 3 can be implemented in O d log2 n time by a fast graph Laplacian solver with the Johnson-Lindenstrauss lemma, as stated in (Koutis et al., 2016). Memory Cost. The memory cost for leverage score computation is significantly reduced from O (nd) to O n2 log n (storage of Ai). This follows from the analysis of Lemma 2 which states that when the algorithm terminates, only O ϵ 2n log n log( A 2) features will be selected. Note that this paper considers the regime where n d, such as the number of patients with rare diseases n and the length of their gene expressions d, or the batch size in neural networks n and the corresponding dimension of feature space d. Hence our online implementation is order of magnitude more efficient. It leads to practical values of our algorithm for learning tasks, such as clustering. 3.2. Application to k-Means Clustering We explore the performance of matrix A returned by Algorithm 1 when it is used for k-means clustering. We first recall the k-means clustering problem. Formally, k-means clustering seeks to partition the data matrix A Rn d into k clusters {C1, , Ck} to minimize the distance between data points and its closest center {µ1, , µk} (Awasthi et al., 2010): min µ1,...,µk aj µi 2 2 , (3.5) where µi be the center of data points in Ci. It is known that k-means clustering is an instance of low-rank approximation (Boutsidis et al., 2009). To see this, we construct an n k matrix X as the cluster indicator matrix. Then for each solution {µi}k i=1 of (3.5), we will assign a cluster label, say xj {1, 2, . . . , k}, to each sample aj. We set Xij = 1/ p |Cj| if ai belongs to Cj, and 0 otherwise. In this way, the ith row of XX A is actually the average of the points with label i, i.e., the center µi of the ith class. Hence, from the discussion, we may rewrite (3.5) as follows: aj (XX A)i 2 More compactly, we aim to solve See (Ostrovsky et al., 2006) for a more detailed discussion. Provable Variable Selection for Streaming Features Let the indicator matrix X Rn k denote the optimal partition on A, i.e., X = arg min X We first investigate how the cluster indicator matrix X over A is deviated from the optimum. The following lemma provides the bound of the k-means objective function value on A. Lemma 3. Suppose that A is the matrix returned by Algorithm 1, then (1 ϵ) A XX A 2 (1 + ϵ) A XX A 2 when ϵ is the parameter of Algorithm 1. Proof. Using the notation Y = I XX , we can rewrite the objective function of k-means based on the data matrices A and A as A XX A 2 F = Y A 2 F = Tr Y AA Y , A XX A 2 F = Tr Y A A Y . Tr Y A A Y = Tr i=1 y i A A yi where yi is the ith column of Y . Then by the spectral bound on AA in Theorem 1, we immediately get (1 ϵ)Tr Y AA Y Tr Y A A Y (1 + ϵ)Tr Y AA Y . Plugging Y = I XX into the above inequalities completes the proof. Now we show that A is also a good approximation to A. Theorem 2. Suppose that A is returned by Algorithm 1. Let X = arg min A XX A 2 F . Then given ϵ (0, 1), we can get A X X A 2 1 ϵ A X X A 2 Proof. Using Lemma 3, we have (1 ϵ) A X X A 2 F A X X A 2 F , A X X A 2 F (1 + ϵ) A X X A 2 On the other hand, by the optimality of X for A, we have F A X X A 2 Combining the above inequalities, we have 1 ϵ A X X A 2 The proof is complete. Theorem 2 implies that if X is an optimal solution for A, then it also preserves an (1 + ϵ)-approximation for A. We compare our algorithm with existing dimension reduction methods for k-means clustering as shown in Table 1. 4. Experiments This section describes an empirical study of the efficacy and efficiency of our algorithm. We first elaborate the experimental settings. Data Sets. We perform the experiments on 6 realistic data sets, including USPS1, AR2, COIL203, CIFAR-104, MNIST5 and ORL6. The summary of them is shown in Table 2. Comparative Methods. We compare our algorithm with state-of-the-art feature selection approaches, including supervised model, for instance, alpha-investing (Alpha) (Zhou et al., 2005), as well as unsupervised model, e.g., λ-ridge leverage score (Lev S) (Alaoui & Mahoney, 2015) and Laplacian score (Lap S) (He et al., 2005). We also compare to sparse random projection (SEC) (Liu et al., 2017) which is particularly designed for k-means clustering. Pipeline. After running our method and the baselines above, we obtain a reduced set of features. Then we feed it to the standard k-means clustering that is available in Matlab. We also report the clustering result based on the original set of features, and we simply denote it by k-means. Results. We report the clustering accuracy against the number of selected features in Figure 1. We can see that our algorithm achieves competitive performance with other batch methods. For example, our algorithm outperforms all the 1https://archive.ics.uci.edu/ml/datasets. html 2http://www2.ece.ohio-state.edu/ aleix/ ARdatabase.html 3http://www.cs.columbia.edu/CAVE/ software/softlib/coil-20.php 4https://www.cs.toronto.edu/ kriz/cifar. html 5http://yann.lecun.com/exdb/mnist/ 6http://www.cl.cam.ac.uk/research/dtg/ attarchive/facedatabase.html Provable Variable Selection for Streaming Features Table 1. Comparison of dimension reduction methods for k-means clustering on data matrix A Rn d with n data points and d features. δ (0, 1) represents the confidence level. Note that we consider the high-dimensional regime where n d. Methods Dimension Time Approximation Ratio Boutsidis et al. (2009) O ϵ 2k log(k/ϵ) O min(nd2, n2d) 2 + ϵ Boutsidis et al. (2015) O k/ϵ2 O ndϵ 2k/ log(n) 2 + ϵ Liu et al. (2017) O max(ϵ 2(k + log(1/δ)), 6 ϵ2δ) O (nnz(A)) 1 + ϵ Pourkamali-Anaraki & Becker (2017) O (log(n)/n) O (nd log(d) + d log(n)) N/A This Work O ϵ 2n log n log( A 2) 10 50 100 150 200 250 # Dimension 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Lap S Lev S Alpha SEC k-means OFS USPS 20 100 180 260 340 420 500 # Dimension 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Lap S Lev S Alpha SEC k-means OFS AR 20 100 180 260 340 420 500 # Dimension 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Lap S Lev S Alpha SEC k-means OFS COIL20 20 100 180 260 340 420 500 # Dimension Lap S Lev S Alpha SEC k-means OFS CIFAR-10 20 100 180 260 340 420 500 # Dimension 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Lap S Lev S Alpha SEC k-means OFS MNIST 20 100 180 260 340 420 500 # Dimension 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Lap S Lev S Alpha SEC k-means OFS ORL Figure 1. Clustering accuracy against the number of selected features. Table 2. Summary of Data Sets. Data set # Data Points # Features USPS 9,298 256 AR 1,400 3,168 COIL20 1,440 1,024 CIFAR-10 60,000 512 MNIST 70,000 784 ORL 96,436 1,770 baseline methods on COIL20, CIFAR-10 and ORL when the number of selected features varies from 10 to 500. The clustering performance on our selected subset even outperforms the one with all available features. Computational Efficiency. We illustrate the running time in Table 3. In terms of efficiency, our algorithm outperforms most of the comparative methods. This is not surprising in that for batch methods, they often update the model with all the data while we process them one by one. For example, on the CIFAR-10 data set, Laplacian score requires 2 minutes Table 3. CPU time of comparative algorithms (seconds in default). Data set Lap S Lev S Alpha SEC OFS USPS 1.46 0.94 3.30 0.003 0.15 AR 0.70 7.76 26.99 0.005 1.74 COIL20 0.23 0.64 4.79 0.002 0.58 CIFAR-10 2 mins 13.79 3 mins 0.002 2.28 MNIST 19.38 10.23 9.62 0.003 1.32 ORL 0.24 0.45 0.42 0.003 0.05 for feature selection because the computation of the graph matrix based on global feature space is expensive. Our algorithm, in contrast, only requires a few seconds. The reason is that in each iteration, we operate with a skinny matrix A instead of the whole data matrix A. 5. Conclusion In this paper, we have proposed an online feature selection for k-means clustering. For features in a stream, we approx- Provable Variable Selection for Streaming Features imate its leverage score in an online manner and perform feature selection based on such an inexact score. We provide theoretical guarantee that our unsupervised approach produces an accurate estimation based on the original space. Moreover, in the high-dimensional regime the algorithm is computationally efficient and consumes little memory. Perhaps more importantly, our algorithm is capable of addressing streaming data which makes it a perfect fit for large-scale learning systems. In addition, we extend the analysis to the k-means clustering problem, and provably show that with the set of features reduced by our approach, we are still able to obtain a near-optimal solution to the original k-means problem. The extensive empirical study matches perfectly our analysis. Acknowledgements Jing Wang, Jie Shen and Ping Li are supported by NSFBigdata-1419210 and NSF-III-1360971. Jing Wang is supported by grants from the Dalio Foundation. The work was done when Jing Wang was a postdoc at Rutgers University. A. Technical Lemmas We provide a technical lemma which is due to (Tropp et al., 2011). Lemma 4. Let Y 0, Y 1, . . . , Y n be a matrix martingale that are self-adjoint matrices with dimension d, and let X1, . . . , Xn be such that Xk = Y k Y k 1 for all 1 k n. Assume Xk 2 R, almost surely for all k. Define the predictable quadratic variation process j=1 Ej 1 X2 j for all k, where Ej 1 X2 j denotes the expectation of X2 j conditioning on X1, , Xj 1. Then, for all ϵ > 0 and σ2 > 0, Pr Y n 2 ϵ and W n 2 σ2 d exp ϵ2/2 σ2 + Rϵ/3 Alaoui, A. and Mahoney, M. W. Fast randomized kernel ridge regression with statistical guarantees. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, pp. 775 783, 2015. Awasthi, P., Blum, A., and Sheffet, O. Stability yields a PTAS for k-median and k-means clustering. In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science, pp. 309 318, 2010. Boutsidis, C., Drineas, P., and Mahoney, M. W. Unsupervised feature selection for the k-means clustering problem. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, pp. 153 161, 2009. Boutsidis, C., Zouzias, A., Mahoney, M. W., and Drineas, P. Randomized dimensionality reduction for k-means clustering. IEEE Transactions on Information Theory, 61 (2):1045 1062, 2015. Cai, D., Zhang, C., and He, X. Unsupervised feature selection for multi-cluster data. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining, pp. 333 342, 2010. Chen, J., Stern, M., Wainwright, M. J., and Jordan, M. I. Kernel feature selection via conditional covariance minimization. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pp. 6949 6958, 2017. Cohen, M. B., Lee, Y. T., Musco, C., Musco, C., Peng, R., and Sidford, A. Uniform sampling for matrix approximation. In Proceedings of the 6th Conference on Innovations in Theoretical Computer Science, pp. 181 190. ACM, 2015. Cohen, M. B., Musco, C., and Pachocki, J. Online row sampling. In Proceedings of the 19th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, pp. 7:1 7:18, 2016. Donoho, D. and Jin, J. Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proceedings of the National Academy of Sciences, 105(39):14790 14795, 2008. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. Least angle regression. Annals of Statistics, 32(2):407 499, 2004. Forman, G. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(3):1289 1305, 2003. Gittens, A. and Mahoney, M. W. Revisiting the Nystr om method for improved large-scale machine learning. Journal of Machine Learning Research, 28(3):567 575, 2013. Guyon, I. and Elisseeff, A. An introduction to variable and feature selection. Journal of Machine Learning Research, 3(3):1157 1182, 2003. Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3):389 422, 2002. Provable Variable Selection for Streaming Features He, X., Cai, D., and Niyogi, P. Laplacian score for feature selection. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems, pp. 507 514, 2005. Kong, X. and Yu, P. S. Semi-supervised feature selection for graph classification. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining, pp. 793 802, 2010. Koutis, I., Levin, A., and Peng, R. Faster spectral sparsification and numerical algorithms for SDD matrices. ACM Transactions on Algorithms, 12(2):17, 2016. Li, M., Miller, G. L., and Peng, R. Iterative row sampling. In Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science, pp. 127 136, 2013. Liu, W., Shen, X., and Tsang, I. Sparse embedded k-means clustering. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pp. 3321 3329, 2017. Musco, C. and Musco, C. Recursive sampling for the Nystr om method. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pp. 3836 3848, 2017. Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. The effectiveness of lloyd-type methods for the kmeans problem. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pp. 165 176, 2006. Peng, H., Long, F., and Ding, C. Feature selection based on mutual information criteria of max-dependency, maxrelevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226 1238, 2005. Perkins, S. and Theiler, J. Online feature selection using grafting. In Proceedings of the 20th International Conference on Machine Learning, pp. 592 599, 2003. Pourkamali-Anaraki, F. and Becker, S. Preconditioned data sparsification for big data with applications to PCA and k-means. IEEE Transactions on Information Theory, 63 (5):2954 2974, 2017. Shen, J. and Li, P. On the iteration complexity of support recovery via hard thresholding pursuit. In Proceedings of the 34th International Conference on Machine Learning, pp. 3115 3124, 2017. Spielman, D. A. and Srivastava, N. Graph sparsification by effective resistances. SIAM Journal on Computing, 40(6): 1913 1926, 2011. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267 288, 1996. Tropp, J. A. et al. Freedman s inequality for matrix martingales. Electronic Communications in Probability, 16: 262 270, 2011. Wang, J., Zhao, Z.-Q., Hu, X., Cheung, Y.-M., Wang, M., and Wu, X. Online group feature selection. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp. 1757 1763, 2013. Wang, J., Wang, M., Li, P., Liu, L., Zhao, Z., Hu, X., and Wu, X. Online feature selection with group structure analysis. IEEE Transactions on Knowledge and Data Engineering, 27(11):3029 3041, 2015. Wei, X. and Philip, S. Y. Unsupervised feature selection by preserving stochastic neighbors. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 995 1003, 2016. Wu, X., Yu, K., Wang, H., and Ding, W. Online streaming feature selection. In Proceedings of the 27th International Conference on Machine Learning, pp. 1159 1166, 2010. Xu, Z., Huang, G., Weinberger, K. Q., and Zheng, A. X. Gradient boosted feature selection. In Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining, pp. 522 531, 2014. Yang, W. and Xu, H. A unified robust regression model for lasso-like algorithms. In Proceedings of the 30th International Conference on Machine Learning, pp. 585 593, 2013. Zhang, T. On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research, 10(3):555 568, 2009. Zhang, Y., Ray, S., and Guo, E. W. On the consistency of feature selection with lasso for non-linear targets. In Proceedings of the 33rd International Conference on Machine Learning, pp. 183 191, 2016. Zhao, Z. and Liu, H. Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th International Conference on Machine Learning, pp. 1151 1157, 2007. Zhou, J., Foster, D., Stine, R., and Ungar, L. Streaming feature selection using alpha-investing. In Proceedings of the 11st ACM International Conference on Knowledge Discovery and Data Mining, pp. 384 393, 2005. Zhu, X., Ghahramani, Z., and Lafferty, J. D. Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning, pp. 912 919, 2003.