# theory_of_dualsparse_regularized_randomized_reduction__ec610c91.pdf Theory of Dual-sparse Regularized Randomized Reduction Tianbao Yang TIANBAO-YANG@UIOWA.EDU Department of Computer Science, the University of Iowa, Iowa City, USA Lijun Zhang ZHANGLJ@@LAMDA.NJU.EDU.CN National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China Rong Jin RONGJIN@CSE.MSU.EDU Department of Computer Science and Engineering, Michigan State University, East Lansing, USA Institute of Data Science and Technologies at Alibaba Group, Seattle, USA Shenghuo Zhu SHENGHUO@GMAIL.COM Institute of Data Science and Technologies at Alibaba Group, Seattle, USA In this paper, we study randomized reduction methods, which reduce high-dimensional features into low-dimensional space by randomized methods (e.g., random projection, random hashing), for large-scale high-dimensional classification. Previous theoretical results on randomized reduction methods hinge on strong assumptions about the data, e.g., low rank of the data matrix or a large separable margin of classification, which hinder their applications in broad domains. To address these limitations, we propose dual-sparse regularized randomized reduction methods that introduce a sparse regularizer into the reduced dual problem. Under a mild condition that the original dual solution is a (nearly) sparse vector, we show that the resulting dual solution is close to the original dual solution and concentrates on its support set. In numerical experiments, we present an empirical study to support the analysis and we also present a novel application of the dual-sparse regularized randomized reduction methods to reducing the communication cost of distributed learning from large-scale high-dimensional data. 1. Introduction As the scale and dimensionality of data continue to grow in many applications (e.g., bioinformatics, finance, com- Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). puter vision, medical informatics) (S anchez et al., 2013; Mitchell et al., 2004; Simianer et al., 2012; Bartz et al., 2011), it becomes critical to develop efficient and effective algorithms to solve big data machine learning problems. Randomized reduction methods for large-scale or high-dimensional data analytics have received a great deal of attention in recent years (Mahoney & Drineas, 2009; Shi et al., 2012; Paul et al., 2013; Weinberger et al., 2009; Mahoney, 2011). By either reducing the dimensionality (referred to as feature reduction) or reducing the number of training instances (referred to as instance reduction), the resulting problem has a smaller size of training data that is not only memory-efficient but also computation-efficient. While randomized instance reduction has been studied a lot for fast least square regression (Drineas et al., 2008; 2006; 2011; Ma et al., 2014), randomized feature reduction is more popular for linear classification (Blum, 2005; Shi et al., 2012; Paul et al., 2013; Weinberger et al., 2009; Shi et al., 2009a) (e.g., random hashing is a noticeable built-in tool in Vowpal Wabbit 1, a fast learning library, for solving high-dimensional problems.). In this paper, we focus on the latter technique and refer to randomized feature reduction as randomized reduction for short. Although several theoretical properties have been examined for randomized reduction methods when applied to classification, e.g., generalization performance (Paul et al., 2013), preservation of margin (Blum, 2005; Balcan et al., 2006; Shi et al., 2012) and the recovery error of the model (Zhang et al., 2014), these previous results reply on strong assumptions about the data. For example, both (Paul et al., 2013) and (Zhang et al., 2014) assume the data matrix is of low-rank, and (Blum, 2005; Balcan et al., 1http://hunch.net/ vw/ Dual-sparse Regularized Randomized Reduction (DSRR) 2006; Shi et al., 2012) make a assumption that all examples in the original space are separated with a positive margin (with a high probability). Another analysis in (Zhang et al., 2014) assumes the weight vector for classification is sparse. These assumptions are too strong to hold in many real applications. Contributions. To address these limitations, we propose dual-sparse regularized randomized reduction methods referred to as DSRR by leveraging the (near) sparsity of dual solutions for large-scale high-dimensional (LSHD) classification problems (i.e., the number of (effective) support vectors is small compared to the total number of examples). In particular, we add a dual-sparse regularizer into the reduced dual problem. We present a novel theoretical analysis of the recovery error of the dual variables and the primal variable and study its implication for different randomized reduction methods (e.g., random projection, random hashing and random sampling). Novelties. Compared with previous works (Blum, 2005; Balcan et al., 2006; Shi et al., 2012; Paul et al., 2013), our theoretical analysis demands a mild assumption about the data and directly provides guarantee on a small recovery error of the obtained model, which is critical for subsequent analysis, e.g., feature selection (Guyon et al., 2002; Brank et al., 2002) and model interpretation (R atsch et al., 2005; Sonnenburg & Franc, 2010; Rtsch et al., 2005; Sonnenburg et al., 2007; Ben-Hur et al., 2008). For example, when exploiting a linear model to classify people into sick or not sick based on genomic markers, the learned weight vector is important for understanding the effect of different genomic markers on the disease and for designing effective medicine (Jostins & Barrett, 2011; Kang & Cho, 2011). In addition, the recovery could also increase the predictive performance, in particular when there exists noise in the original features (Goldberger et al., 2005). Compared with (Zhang et al., 2014) that proposes to recover a linear model in the original feature space by dual recovery, i.e., constructing a weight vector using the dual variables learned from the reduced problem and the original feature vectors, our methods are better in that (i) we rely on a more realistic assumption of the sparsity of dual variables (e.g., in support vector machine (SVM)); (ii) we analyze both smooth loss functions and non-smooth loss functions (they focused on smooth functions); (iii) we study different randomized reduction methods in the same framework not just the random projection. In numerical experiments, we present an empirical study on a real data set to support our analysis and we also demonstrate a novel application of the reduction and recovery framework in distributed learning from LSHD data, which combines the benefits of the two complementary techniques for addressing big data problems. Distributed learning/optimization recently receives significant interest in solving big data problems (Jaggi et al., 2014; Li et al., 2014; Yang, 2013; Agarwal et al., 2011). However, it is notorious for high communication cost, especially when the dimensionality of data is very high. By solving a dimensionality reduced data problem and using the recovered solution as an initial solution to the distributed optimization on the original data, we can reduce the number of iterations and the communication cost. In practice, we employ the recently developed distributed stochastic dual coordinate ascent algorithm (Yang, 2013), and observe that using the recovered solution as an initial solution we are able to attain almost the same performance with only one or two communications of high dimensional vectors among multiple machines. 2. Preliminaries Let (xi, yi), i = 1, . . . , n denote a set of training examples, where xi Rd, yi {1, 1}. Assume both n and d are very large. The goal of classification is to solve the following optimization problem: w = arg min w Rd 1 n i=1 ℓ(w xiyi) + λ 2 w 2 2 (1) where ℓ(zy) is a convex loss function and λ is a regularization parameter. Using the conjugate function, we can turn the problem into a dual problem: α = arg max α Rn 1 i=1 ℓ i (αi) 1 2λn2 αT X Xα (2) where X = (x1, . . . , xn) is the data matrix and ℓ i (α) is the convex conjugate function of ℓ(zyi). Given the optimal dual solution α , the optimal primal solution can be computed by w = 1 λn Xα . For LSHD problems, directly solving the primal problem (1) or the dual problem (2) could be very expensive. We aim to address the challenge by randomized reduction methods. Let A( ) : Rd Rm denote a randomized reduction operator that reduces a ddimensional feature vector into m-dimensional feature vector. Let bx = A(x) denote the reduced feature vector. With the reduced feature vectors bx1, . . . , bxn of the training examples, a conventional approach is to solve the following reduced primal problem u = arg min u Rm 1 n i=1 ℓ(u bxiyi) + λ 2 u 2 2 (3) or its the dual problem bα = arg max α Rn 1 i=1 ℓ i (αi) 1 2λn2 αT b X b Xα (4) where b X = (bx1, . . . , bxn) Rm n. Previous studies have analyzed the reduced problems for random projection methods and proved the preservation of margin (Blum, 2005; Shi et al., 2012) and the preservation of minimum Dual-sparse Regularized Randomized Reduction (DSRR) enclosing ball (Paul et al., 2013). Zhang et al. (2014) proposed a dual recovery approach that constructs a recovered solution by bw = 1 λn Pn i=1[bα ]ixi and proved the recovery error for random projection under the assumption of low-rank data matrix or sparse w . In addition, they also showed that the naive recovery by A u (when A(x) = Ax) has a large recovery error. One deficiency with the simple dual recovery approach is that due to the reduction in the feature space, many nonsupport vectors for the original optimization problem will become support vectors, which could result in the corruption in the recovery error. As a result, the original analysis of dual recovery method requires a strong assumption of data (i.e., the low rank assumption). In this work, we plan to address this limitation in a different way, which allows us to relax the assumption significantly. 3. DSRR and its Guarantee To reduce the number of or the contribution of training instances that are non-support vectors in the original optimization problem and are transformed into support vectors due to the reduction of the feature space, we employ a simple trick that adds a dual-sparse regularization to the reduced dual problem. In particular, we solve the following problem: eα = (5) arg max α Rn 1 i=1 ℓ i (αi) 1 2λn2 αT b X b Xα 1 where R(α) = τ α 1, and τ > 0 is a regularization parameter, whose theoretical value will be revealed later. To further understand the added dual-sparse regularizer, we consider SVM, where the loss function can be either the hinge loss (a non-smooth function) ℓ(zy) = max(0, 1 zy) or the squared hinge loss (a smooth function) ℓ(zy) = max(0, 1 zy)2. We first consider the hinge loss, where ℓ i (αi) = αiyi for αiyi [ 1, 0]. Then the new dual problem is equivalent to max α y [ 1,0]n 1 n i=1 αiyi 1 2λn2 αT b X b Xα τ Using variable transformation αiyi βi, the above problem is equivalent to max β [0,1]n 1 n i=1 βi(1 τ) 1 2λn2 (β y)T b X b X(β y) Changing into the primal form, we have max u Rm 1 n i=1 ℓ1 τ(u bxiyi) + λ 2 u 2 2 (6) where ℓγ(z) = max(0, γ z) is a max-margin loss with margin given by γ. It can be understood that adding the ℓ1 regularization in the reduced problem of SVM is equivalent to using a max-margin loss with a smaller margin, which is intuitive because examples become difficult to separate after dimensionality reduction and is consistent with several previous studies that the margin is reduced in the reduced feature space (Blum, 2005; Shi et al., 2012). Similarly for squared hinge loss, the equivalent primal problem is max u Rm 1 n i=1 ℓ2 1 τ(u bxiyi) + λ 2 u 2 2 (7) where ℓ2 γ(z) = max(0, γ z)2. Although adding a dual-sparse regularizer is intuitive and can be motivated from previous results, we emphasize that the proposed dual-sparse formulation provides a new perspective and bounding the dual recovery error eα α is a non-trivial task, which is a major contribution of this paper. We first state our main result in Theorem 1 for smooth loss functions. Theorem 1. Let eα be the optimal dual solution to (5). Assume α is s-sparse with the support set given by S. If τ 2 λn (X X b X b X)α , then we have [eα ]Sc 1 3 [eα ]S [α ]S 1 (8) Furthermore, if ℓ(z) is a L-smooth loss function 2, we have eα α 2 3τL s, eα α 1 12τLs (9) [eα ]S [α ]S 1 3τLs, [eα ]Sc 1 9τLs (10) where Sc is the complement of S, and [α]S is a vector that only contains the elements of α in the set S. Remark 1: The proof is presented at the end of Section 4. It can be seen that the dual recovery error is proportional to the value of τ which is dependent on (X X b X b X)α , which we can bound without using any assumption about the data matrix or the optimal dual variable α . In contrast, previous bounds (Zhang et al., 2013; 2014; Paul et al., 2013) depend on X X b X b X 2, which requires the low rank assumption on X. In next section, we provide an upper bound of 1 λn (X X b X b X)α that will allow us to understand how the reduced dimensionality m affects the recovery error. Essentially, the results indicate that for random projection, randomized Hadamard transform and random hashing, 1 λn (X X b X b X)α O( q m ) w 2 with a high probability 1 δ, and thus the recovery error will be scaled as p 1/m in terms of m - the same order of recovery error as in (Zhang et al., 2013; 2014) that assumes low rank of the data matrix. Remark 2: We would like to make a connection with LASSO for sparse signal recovery. In sparse signal recovery under noise measurements f = Uw + e, where e denotes the noise in measurements, if a LASSO 2A function is L-smooth if its gradient is L-Lipschitz continuous. Dual-sparse Regularized Randomized Reduction (DSRR) 2 Uw f 2 2+λ w 1 is solved for the solution, then the regularization parameter λ is required to be larger than the quantity U e that depends on the noise in order to have an accurate recovery (Eldar & Kutyniok, 2012). Similarly in our formulation, the added ℓ1 regularization τ α 1 is to counteract the noise in b X b X as compared with XX and the value of τ is dependent on the noise. To present the theoretical result on the non-smooth loss functions, we need to introduce restricted eigen-value conditions similar to those used in the sparse recovery analysis for LASSO (Bickel et al., 2009; Xiao & Zhang, 2013). In particular, we introduce the following definition of restricted eigen-value condition. Definition 2. Given an integer s > 0, we define Kn,s = {α Rn : α 2 1, α 1 s}. We say that X satisfies the restricted eigenvalue condition at sparsity level s if there exist positive constants ρ+ s and ρ s such that ρ+ s = sup α Kn,s n , ρ s = inf α Kn,s α X Xα We also define another quantity that measures the restricted eigen-value of X X b X b X, namely σs = sup α Kn,s |α (X X b X b X)α| n . (11) Theorem 3. Let eα be the optimal dual solution to (5). Assume α is s-sparse with the support set given by S. If τ 2 λn (X X b X b X)α , then we have [eα ]Sc 1 3 [eα ]S [α ]S 1 Assume the data matrix X satisfies the restricted eigenvalue condition at sparsity level 16s and σ16s < ρ 16s, we have eα α 2 3λ 2(ρ 16s σ16s)τ s eα α 1 6λ (ρ 16s σ16s)τs Remark 3: The proof appears in the full version of the paper 3. Compared to smooth loss functions, the conditions that guarantee a small recovery for non-smooth loss functions are more restricted. In next section, we will provide a bound on σ16s to further understand the condition of σ16s ρ 16s, which essentially implies that m Ω ρ+ 16s ρ 16s 2 s log(n/s) . Last but not least, we provide a theoretical result on the recovery error for the nearly sparse optimal dual variable α . We state the result for smooth loss functions. To quantify the near sparsity, we let αs Rn denote a vector that zeros all entries in α except for the top-s elements in magnitude 3http://arxiv.org/abs/1504.03991 and assume αs satisfies the following condition: ℓ (αs ) + 1 where ℓ (α) = ( ℓ 1(α1), . . . , ℓ n(αn)) . The above condition can be considered as a sub-optimality condition (Boyd & Vandenberghe, 2004) of αs measured in the infinite norm. For the optimal solution α , we have ℓ (α ) + 1 λn X Xα = 0. Theorem 4. Let eα be the optimal dual solution to (5). Assume α is nearly s-sparse such that (12) holds with the support set of αs given by S. If τ 2 λn (X X b X b X)α + 2ξ, then we have [eα ]Sc 1 3 [eα ]S [α ]S 1 Furthermore, if ℓ(z) is a L-smooth loss function, we have eα αs 2 3τL s, eα αs 1 12τLs (13) [eα ]S [α ]S 1 3τLs, [eα ]Sc 1 9τLs (14) Remark 4: The proof appears in the full version of the paper. Compared to Theorem 1 for exactly sparse optimal dual solution, the dual recovery error bound for nearly sparse optimal dual solution is increased by 6L sξ for ℓ2 norm and by 24Lsξ for ℓ1 norm. Finally, we note that with the recovery error bound for the dual solution, we can easily derive an error bound for the primal solution ew = 1 λn Xeα . Below we present a theorem for smooth loss functions. One can easily extend the result to non-smooth loss functions. Theorem 5. Let ew be the recovered primal solution using eα the optimal dual solution to (5). Assume α is s-sparse and ℓ(z) is a L-smooth loss function. If τ 2 λn (X X b X b X)α then we have where σ1 is the maximum singular value of X. Furthermore if 1 n X X has a restricted eigen-value ρ+ 16s at sparsity level 16s, then ρ+ 16s λ n 3Lτ s Remark 5: Since ρ+ 16s is always less than σ2 1/n, the second result if the restricted eigen-value condition holds is always better than the first result. With the bound of τ as revealed later, we can see that the error of ew scales as O(p s m w 2) in terms of sparsity s of α , the reduced dimensionality m and the magnitude of w . A similar order of error bound was established in (Zhang et al., 2014) assuming w is s-sparse and X is approximately low rank. In contrast, we do not assume X is approximately low rank. 4. Analysis In this section, we first provide upper bound analysis of 2 λn (X X b X b X)α and σs, and then present the Dual-sparse Regularized Randomized Reduction (DSRR) proof of Theorem 1 for smooth loss functions. To facilitate our analysis, we define λn( b X b X X X)α 4.1. Bounding A critical condition in both Theorem 1 and Theorem 3 is τ > . In order to reveal the theoretical value of τ and its implication for various randomized reduction methods, we need to bound . We first provide a general analysis and then study its implication for various randomized reduction methods separately. The analysis is based on the following assumption, which essentially is indicated by Johnson-Lindenstrauss (JL)-type lemmas. Assumption 1 (A1). Let A(x) = Ax be a linear projection operator where A Rm d such that for any given x Rd with a high probability 1 δ, we have Ax 2 2 x 2 2 ϵA,δ x 2 2 where ϵA,δ depends on m, δ and possibly d. With this assumption, we have the following theorem regarding the upper bound of . Theorem 6. Suppose A Rm d satisfies Assumption A, then with a high probability 1 2δ we have R w 2ϵA,δ/n where R = maxi xi 2. Proof. 1 λn( b X b X X X)α = 1 λn(X A AX X X)α λn X (A A I)Xα = X (I A A)w where we use the fact w = 1 λn Xα . Then 1 λn[( b X b X X X)α ]i = x i (I A A)w Therefore in order to bound , we need to bound x i (I A A)w for all i [n]. We first bound for individual i and then apply the union bound. Let exi and ew be normalized version of xi and w , i.e., exi = xi/ xi 2 and ew = w / w 2. Suppose Assumption A is satisfied, then with a probability 1 δ, ex i A Aew ex i ew = A(exi + ewi) 2 2 A(exi ew ) 2 2 4 ex i ew ϵA,δ 2 ( exi 2 2 + ew 2 2) ϵA,δ Similarly with a probability 1 δ, ex i A Aew ex i ew = A(exi + ew ) 2 2 A(exi ew ) 2 2 4 ex i ew ϵA,δ 2 ( ex 2 2 + ew 2 2) ϵA,δ Therefore with a probability 1 2δ, we have |x i A Aw x i w | xi 2 w 2|ex i A Aew ex ew | xi 2 w 2ϵA,δ Then applying union bound, we complete the proof. Next, we discuss four classes of randomized reduction operators, namely random projection, randomized Hadamard transform, random hashing and random sampling, and study the corresponding ϵA,δ and their implications for the recovery error. Random Projection. Random projection has been employed widely for dimension reduction. The projection operator A is usually sampled from sub-Gaussian distributions with mean 0 and variance 1/m, e.g., (i) Gaussian distribution: Aij N(0, 1/m), (ii) Rademacher distribution: Pr(Aij = 1/ m) = 0.5, (iii) discrete distribution: Pr(Aij = p 3/m) = 1/6 and Pr(Aij = 0) = 2/3. The last two distributions for dimensionality reduction were proposed and analyzed in (Achlioptas, 2003). The following lemma is the general JL-type lemma for A with sub Gaussian entries, which reveals the value of ϵA,δ in Assumption A. Lemma 1. (Nelson) Let A Rm d be a random matrix with sub Gaussian entries of mean 0 and variance 1/m . For any given x with a probability 1 δ, we have Ax 2 2 x 2 2 c m x 2 2 where c is some small universal constant. Randomized Hadamard Transform. Randomized Hadamard transform was introduced to speed-up random projection, reducing the computational time 4 of random projection from O(dm) to O(d log d) or even O(d log m). The projection matrix A is of the form A = PHD, where D Rd d is a diagonal matrix with Dii = 1 with equal probabilities. H is the d d Hadamard matrix (assuming d is a power of 2), scaled by 1/ P Rm d is typically a sparse matrix that facilities computing Px. Several choices of P are possible (Nelson; Ailon & Chazelle, 2009; Tropp, 2011). Below we provide a JL-type lemma for a randomized Hadamard transform with P Rm d that samples m coordinates from q d m HDx with replacement. Lemma 2. (Nelson) Let A = q d m PHD Rm d be a randomized Hadamard transform with P being a random sampling matrix. For any given x with a probability 1 δ, we have Ax 2 2 x 2 2 c log(1/δ) log(d/δ) m x 2 2 where c is some small universal constant. Remark 6: Compared to random projection, there is an additional p log(d/δ) factor in ϵA,δ. However, it can 4refers to the running time of computing Ax. Dual-sparse Regularized Randomized Reduction (DSRR) be removed by applying an additional random projection. In particular, if we let A = q d m P PHD Rm d, where P Rt d is a random sampling matrix with t = m log(d/δ) and P Rm t is a random projection matrix that satisfies Lemma 1, then we have the same order of ϵA,δ. Please refer to (Nelson) for more details. Random Hashing. Another line of work to speed-up random projection is random hashing which makes the projection matrix A much sparser and takes advantage of the sparsity of feature vectors. It was introduced in (Shi et al., 2009b) for dimensionality reduction and later was improved to an unbiased version by (Weinberger et al., 2009) with some theoretical analysis. Dasgupta et al. (2010) provided a rigorous analysis of the unbiased random hashing. Recently, Kane & Nelson (2014) proposed two new random hashing algorithms with a slightly sparser random matrix A. Here we provide a JL-type lemma for the random hashing algorithm in (Weinberger et al., 2009; Dasgupta et al., 2010). Let h : N [m] denote a random hashing function, and ξ = (ξ1, . . . , ξd) denote a Rademacher random variable, i.e., ξi, i = 1, . . . , d are independent and ξi {1, 1} with equal probabilities. The projection matrix A can be written as A = HD, where D Rd d is a diagonal matrix with Djj = ξj, and H Rm d with Hij = δi,h(j) 5. Under the random matrix A, the feature vector x Rd is reduced to bx Rm, where [bx]i = P j:h(j)=i[x]jξj. The following JL-type Lemma is a basic result from (Dasgupta et al., 2010) with a rephrasing. Lemma 3. Let A = HD Rm d be a random hashing matrix. For any given vector x Rd such that x x 2 1 c, for δ < 0.1, with a probability 1 3δ, we have Ax 2 2 x 2 2 12 log(1/δ) where c = 8 p m/3 log1/2(1/δ) log2(m/δ). Remark 7: Compared to random projection, there is an additional condition on the feature vector x x 2 c . However, it can be removed by applying an extra preconditioner P to x before applying the projection matrix A, i.e., bx = HDPx. Two preconditioners were discussed in (Dasgupta et al., 2010), with one corresponding to duplicating x c times and scaling it by 1/ c and another one given by P Rd d which consists of d/b diagonal blocks of b b randomized Hadamard matrix, where b = 6c log(3c/δ). The running time of the reduction using the later preconditioner is O(d log c log log c). Random Sampling. Last we discuss random sampling and compare with the aforementioned randomized reduction methods. In fact, the JL-type lemma for random sam- 5δij = 1 if i = j, and 0 otherwise. pling is implicit in the proof of Lemma 2. We make it explicit in the following lemma. Lemma 4. Let A = q d m P Rm d be a scaled random sampling matrix where P Rm d samples m coordinates with replacement. Then with a probability 1 δ, we have Ax 2 2 x 2 2 x 3d log(1/δ) Remark 8: Compared with other three randomized reduction methods, there is an additional x d factor in ϵA,δ, which could result in a much larger ϵA,δ and consequentially a larger recovery error. That is why the randomized Hadamard transform was introduced to make this additional factor close to a constant. From the above discussions, we can conclude that with random projection, randomized Hadamard transform and random hashing, with a probability 1 δ we have, = max i |x i (I A A)w | which essentially indicates that τ 2c R q 4.2. Bounding σs for non-smooth case Another condition in Theorem 3 is to require σ16s ρ 16s. Since ρ 16s is dependent on the data, we provide an upper bound of σ16s to further understand the condition. In the following analysis, we assume ϵA,δ = O( q m ). Recall the definition of σs: σs = sup α Kn,s |α (X X b X b X)α| n . (15) We provide a bound of σs below. The key idea is to use the convex relaxation of Kn,s. Define Sn,s = {α Rn : α 2 1, α 0 s}. It was shown in (Plan & Vershynin, 2011) that conv(Sn,s) Kn,s 2conv(Sn,s), where conv(S) is the convex hull of the set S. It is not difficult to show that (see the supplement) max α Kn,s |(Xα) (I A A)(Xα)| 4 max α1,α2 Sn,s |(Xα1) (I A A)(Xα2)| Let u1 = Xα1 and u2 = Xα2. For any fixed α1, α2 Sn,s, with a probability 1 δ we can have 1 n|(Xα1) (I A A)(Xα2)| = O where we use max α Sn,s Xα 2 2 n max α Kn,s Xα 2 2 n = ρ+ s Then by using Lemma 3.3 in (Plan & Vershynin, 2011) about the entropy of Sn,s and the union bound, we can ar- Dual-sparse Regularized Randomized Reduction (DSRR) rive at the following upper bound for σs. Theorem 7. With a probability 1 δ, we have (log(1/δ) + s log(n/s)) Remark 9: With above result, we can further understand the condition σ16s ρ 16s, which amounts to (log(1/δ) + s log(n/s)) i.e., m Ω(κ2 16s(log(1/δ) + s log(n/s))) where κ16s = ρ+ 16s/ρ 16s is the restricted condition number of the data matrix. 4.3. Proof Sketch of Theorem 1 We present a proof sketch of Theorem 1. Due to limitation of space, other proofs are provided in the supplement. Let b F(α) be defined as i=1 ℓ i (αi) + 1 2λn2 αT b X b Xα + τ Since eα = arg min b F(α) therefore for any g α 1 0 ˆF(eα ) ˆF(α ) n ℓ (α ) + 1 λn2 b X b Xα n(eα α ) g + 1 2n L eα α 2 2 where we used the strong convexity of ℓ i and its strong convexity modulus 1/L. By the optimality condition of α , we can have 0 (α eα ) 1 n ℓ (α ) + 1 λn2 X Xα Combining the above two inequalities we have 0 (eα α ) 1 n(eα α ) g + 1 2n L eα α 2 2 Since the above inequality holds for any g α 1, if we choose [g ]i = sign([eα ]i), i Sc, then we have (eα α ) g [eα ]S [α ]S 1 + [eα ]Sc 1 Combining the above inequalities leads to (τ + ) [eα ]S [α ]S 1 (τ ) [eα ]Sc 1 2L eα α 2 2 Assuming τ 2 , we have eα α 2 2 3τL [eα ]S [α ]S 1 [eα ]Sc 1 3 [eα ]S [α ]S 1 (17) Therefore, [eα α ]S 2 1 s eα α 2 2 3τLs [eα ]S [α ]S 1 leading to the result [eα ]S [α ]S 1 3τLs. Combing this inequality with inequalities in (17) we have [eα ]Sc 1 9τLs, eα α 2 3τL s. 5. Numerical Experiments In this section, we provide a case study in support of DSRR and the theoretical analysis, and a demonstration of the application of DSRR to distributed optimization. A case study on text classification. We use the RCV1binary data (Lewis et al., 2004) to conduct a case study. The data contains 697, 641 documents and 47, 236 features. We use a splitting 677, 399/20, 242 for training and testing. The feature vectors were normalized such that the ℓ2 norm is equal to 1. We only report the results using random hashing since it is the most efficient, while other randomized reduction methods (except for random sampling) have similar performance. For the loss function, we use both the squared hinge loss (smooth) and the hinge loss (nonsmooth). We aim to examine two questions related to our analysis and motivation (i) how does the value of τ affect the recovery error? (ii) how does the number of samples m affect the recovery error? We vary the value of τ among 0, 0.1, 0.2, . . . , 0.9, the value of m among 1024, 2048, 4096, 8192, and the value of λ among 0.001, 0.00001. Note that τ = 0 corresponds to the randomized reduction approach without the sparse regularizer. The results averaged over 5 random trials are shown in Figure 1 for the squared hinge loss and in Figure 2 for the hinge loss. We first analyze the results in Figure 1. We can observe that when τ increases the ratio of [eα ]Sc 1 [eα ]S [α ]S 1 decreases indicating that the magnitude of dual variables for the original non-support vectors decreases. This is intuitive and consistent with our motivation. The recovery error of the dual solution (middle) first decreases and then increases. This can be partially explained by the theoretical result in Theorem 1. When the value of τ becomes larger than a certain threshold making τ > hold, then Theorem 1 implies that a larger τ will lead to a larger error. On the other hand, when τ is less than the threshold, the dual recovery error will decrease as τ increases. In addition, the figures exhibit that the thresholds for larger m are smaller which is consistent with our analysis of = O( p 1/m). The difference between λ = 0.001 and λ = 0.00001 is because that smaller λ will lead to larger w 2. In terms of the hinge loss, we observe similar trends, however, the recovery is much more difficult than that for squared hinge loss especially when the value of λ is small. An application to distributed learning. Although in some cases the solution learned in the reduced space can provide sufficiently good performance, it usually performs worse than the optimal solution that solves the original problem and sometimes the performance gap between them can not be ignored as seen in following experiments. To address this issue, we combine the benefits of distributed learning and the proposed randomized reduction methods for solv- Dual-sparse Regularized Randomized Reduction (DSRR) 0 0.1 0.3 0.5 0.7 0.9 0 non support error/support error m=1024 m=2048 m=4096 m=8192 0 0.1 0.3 0.5 0.7 0.9 0.2 relative dual error L2 norm m=1024 m=2048 m=4096 m=8192 0 0.1 0.3 0.5 0.7 0.9 relative primal error L2 norm m=1024 m=2048 m=4096 m=8192 0 0.1 0.3 0.5 0.7 0.9 0 no support error/support error m=1024 m=2048 m=4096 m=8192 0 0.1 0.3 0.5 0.7 0.9 relative dual error L2 norm m=1024 m=2048 m=4096 m=8192 0 0.1 0.3 0.5 0.7 0.9 0 relative primal error L2 norm m=1024 m=2048 m=4096 m=8192 Figure 1. Recovery error for squared hinge loss. From left to right: [eα ]Sc 1 [eα ]S [α ]S 1 vs τ, eα α 2 α 2 vs τ, and e w w 2 0 0.1 0.3 0.5 0.7 0.9 0 no support error/support error m=1024 m=2048 m=4096 m=8192 0 0.1 0.3 0.5 0.7 0.9 0.4 relative dual error L2 norm m=1024 m=2048 m=4096 m=8192 0 0.1 0.3 0.5 0.7 0.9 0.1 relative primal error L2 norm 0 0.1 0.3 0.5 0.7 0.9 0 no support error/support error m=1024 m=2048 m=4096 m=8192 0 0.1 0.3 0.5 0.7 0.9 relative dual error L2 norm m=1024 m=2048 m=4096 m=8192 0 0.1 0.3 0.5 0.7 0.9 0 relative primal error L2 norm m=1024 m=2048 m=4096 m=8192 Figure 2. Same curves as above but for non-smooth hinge loss. ing big data problems. When data is too large and sits on multiple machines, distributed learning can be employed to solve the optimization problem. In distributed learning, individual machines iteratively solve sub-problems associated with the subset of data on them and communicate some global variables (e.g., the primal solution w Rd) among them. When the dimensionality d is very large, the total communication cost could be very high. To reduce the total communication cost, we propose to first solve the reduced data problem and then use the found solution as the initial solution to the distributed learning for the original data. Below, we demonstrate the effectiveness of DSRR for the recently proposed distributed stochastic dual coordinate ascent (Dis DCA) algorithm (Yang, 2013). The procedure is (1) reduce original high-dimensional data to very low dimensional space on individual machines; (2) use Dis DCA to solve the reduced problem; (3) use the optimal dual solution to the reduce problem as an initial solution to Dis DCA for solving the original problem. We record the running time for randomized reduction in step 1 and optimization of the reduced problem in step 2, and the optimization of the original problem in step 3. We compare the performance of four methods (i) the DSRR method that uses the model of the reduced problem solved by Dis DCA to make predictions, (ii) the method that uses the recovered model Table 1. Statistics of datasets Name #Training #Testing #Features #Nodes RCV1 677,399 20,242 47, 236 5 KDD 8,407,752 748,401 29,890,095 10 Testing Error (%) DSRR DSRR Rec DSRR Dis DCA 1 DSRR Dis DCA 2 Dis DCA Testing Error (%) DSRR DSRR Rec DSRR Dis DCA 1 DSRR Dis DCA 2 Dis DCA DSRR DSRR Rec DSRR Dis DCA 1 DSRR Dis DCA 2 Dis DCA DSRR DSRR Rec DSRR Dis DCA 1 DSRR Dis DCA 2 Dis DCA Figure 3. Top: Testing error for different methods. Bottom: Training time for different methods. The value of λ = 10 5 and the value of τ = 0.9. The high-dimensional features are reduced to m = 1024-dimensional space using random hashing. The loss function is the squared hinge loss. in the original space, referred to as DSRR-Rec; (iii) the method that uses the dual solution to the reduced problem as an initial solution of Dis DCA and runs it for the original problem with k = 1 or 2 communications (the number of updates before each communication is set to the number of examples in each machine), referred to as DSRRDis DCA-k; and (iv) the distributed method that directly solves the original problem by Dis DCA. For Dis DCA to solve the original problem, we stop running when its performance on the testing data does not improve. Two data sets are used, namely RCV1-binary, KDD 2010 Cup data. For KDD 2010 Cup data, we use the one available on Lib SVM data website. The statistics of the two data sets are summarized in Table 1. The results averaged over 5 trials are shown in Figure 3, which exhibit that the performance of DSRR-Dis DCA-1/2 is remarkable in the sense that it achieves almost the same performance of directly training on the original data (Dis DCA) and uses much less training time. In addition, DSRR-Dis DCA performs much better than DSRR and has small computational overhead. 6. Conclusions In this paper, we have proposed dual-sparse regularized randomized reduction methods for classification. We presented rigorous theoretical analysis of the proposed dualsparse randomized reduction methods in terms of recovery error under a mild condition that the optimal dual variable is (nearly) sparse for both smooth and non-smooth loss functions, and for various randomized reduction approaches. The numerical experiments validate our theoretical analysis and also demonstrate that the proposed reduction and recovery framework can benefit distributed optimization by providing a good initial solution. Dual-sparse Regularized Randomized Reduction (DSRR) Acknowledgements The authors would like to thank the anonymous reviewers for their helpful and insightful comments. T. Yang was supported in part by NSF (IIS-1463988). R. Jin was partially supported by NSF IIS-1251031 and ONR N000141410631. Achlioptas, Dimitris. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of Computer and System Sciences., 66:671 687, 2003. Agarwal, Alekh, Chapelle, Olivier, Dudk, Miroslav, and Langford, John. A reliable effective terascale linear learning system. Co RR, 2011. Ailon, Nir and Chazelle, Bernard. The fast johnson lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302 322, 2009. Balcan, Maria-Florina, Blum, Avrim, and Vempala, Santosh. Kernels as features: On kernels, margins, and lowdimensional mappings. Machine Learning, 65:79 94, 2006. Bartz, Daniel, Hatrick, Kerr, Hesse, Christian W., M uller, Klaus-Robert, and Lemm, Steven. Directional variance adjustment: improving covariance estimates for highdimensional portfolio optimization. ar Xiv:1109.3069, 2011. Ben-Hur, Asa, Ong, Cheng Soon, Sonnenburg, S oren, Sch olkopf, Bernhard, and R atsch, Gunnar. Support vector machines and kernels for computational biology. PLo S Comput Biology, 4:e1000173, 2008. Bickel, Peter J., Ritov, Ya acov, and Tsybakov, Alexandre B. Simultaneous analysis of lasso and dantzig selector. ANNALS OF STATISTICS, 37(4), 2009. Blum, Avrim. Random projection, margins, kernels, and feature-selection. In Proceedings of the 2005 International Conference on Subspace, Latent Structure and Feature Selection, volume 3940, pp. 52 68. Springer, 2005. Boyd, Stephen and Vandenberghe, Lieven. Convex Optimization. Cambridge University Press, 2004. Brank, Janez, Grobelnik, Marko, Mili c-Frayling, Natasa, and Mladeni c, D. Feature selection using support vector machines. In Proceedings of the International Conference on Data Mining Methods and Databases for Engineering, Finance, and Other Fields, pp. 84 89, 2002. Dasgupta, Anirban, Kumar, Ravi, and Sarl os, Tam as. A sparse johnson lindenstrauss transform. In Proceedings of the 42nd ACM symposium on Theory of computing, STOC 10, pp. 341 350, 2010. Drineas, Petros, Mahoney, Michael W., and Muthukrishnan, S. Sampling algorithms for l2 regression and applications. In ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1127 1136, 2006. Drineas, Petros, Mahoney, Michael W., and Muthukrishnan, S. Relative-error cur matrix decompositions. SIAM Journal Matrix Analysis Applications, 30:844 881, 2008. Drineas, Petros, Mahoney, Michael W., Muthukrishnan, S., and Sarl os, Tam as. Faster least squares approximation. Numerische Mathematik, 117(2):219 249, February 2011. Eldar, Yonina C. and Kutyniok, Gitta. Compressed Sensing: Theory and Applications. Compressed Sensing: Theory and Applications. Cambridge University Press, 2012. ISBN 9781107005587. Goldberger, Jacob, Roweis, Sam, Hinton, Geoffrey, and Salakhutdinov, Ruslan. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), pp. 513 520, 2005. Guyon, Isabelle, Weston, Jason, Barnhill, Stephen, and Vapnik, Vladimir. Gene selection for cancer classification using support vector machines. Machine Learning (ML), 46:389 422, 2002. Jaggi, Martin, Smith, Virginia, Tak ac, Martin, Terhorst, Jonathan, Krishnan, Sanjay, Hofmann, Thomas, and Jordan, Michael I. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems (NIPS), pp. 3068 3076, 2014. Jostins, Luke and Barrett, Jeffrey C. Genetic risk prediction in complex disease. Human Molecular Genetics, 20 (R2):R182 R188, 2011. Kane, Daniel M. and Nelson, Jelani. Sparser johnsonlindenstrauss transforms. Journal of the ACM, 61:4:1 4:23, 2014. Kang, J., Kugathasan S. Georges M. Zhao H. and Cho, J.H. Improved risk prediction for crohn s disease with a multi-locus approach. Human Molecular Genetics, 20: 2435 2442, 2011. Lewis, David D., Yang, Yiming, Rose, Tony G., and Li, Fan. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research (JMLR), 5:361 397, 2004. Dual-sparse Regularized Randomized Reduction (DSRR) Li, Mu, Andersen, David G, Smola, Alex J, and Yu, Kai. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems (NIPS), pp. 19 27. 2014. Ma, Ping, Mahoney, Michael W., and Yu, Bin. A statistical perspective on algorithmic leveraging. In Proceedings of the 31th International Conference on Machine Learning (ICML), pp. 91 99, 2014. Mahoney, Michael W. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123 224, 2011. Mahoney, Michael W. and Drineas, Petros. Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697 702, 2009. Mitchell, Tom M., Hutchinson, Rebecca, Niculescu, Radu S., Pereira, Francisco, Wang, Xuerui, Just, Marcel, and Newman, Sharlene. Learning to decode cognitive states from brain images. Machine Learning, 57(1-2): 145 175, 2004. Nelson, Jelani. Johnson-lindenstrauss notes. Technical report, MIT. Paul, Saurabh, Boutsidis, Christos, Magdon-Ismail, Malik, and Drineas, Petros. Random projections for support vector machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 498 506, 2013. Plan, Yaniv and Vershynin, Roman. One-bit compressed sensing by linear programming. Co RR, abs/1109.4299, 2011. R atsch, G., Sonnenburg, S., and Sch olkopf, B. RASE: recognition of alternatively spliced exons in c.elegans. Bioinformatics, 21:i369 i377, 2005. Rtsch, Gunnar, Sonnenburg, Sren, and Schlkopf, Bernhard. Rase: recognition of alternatively spliced exons in c.elegans. In Proceedings of the International Conference on Intelligent Systems for Molecular Biology (ISMB) (Supplement of Bioinformatics), pp. 369 377, 2005. S anchez, Jorge, Perronnin, Florent, Mensink, Thomas, and Verbeek, Jakob J. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3):222 245, 2013. Shi, Qinfeng, Petterson, James, Dror, Gideon, Langford, John, Smola, Alex, and Vishwanathan, S.V.N. Hash kernels for structured data. Journal of Machine Learning Research (JMLR), 10:2615 2637, 2009a. Shi, Qinfeng, Petterson, James, Dror, Gideon, Langford, John, Smola, Alexander J., Strehl, Alexander L., and Vishwanathan, Vishy. Hash kernels. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 496 503, 2009b. Shi, Qinfeng, Shen, Chunhua, Hill, Rhys, and van den Hengel, Anton. Is margin preserved after random projection? In Proceedings of the International Conference on Machine Learning (ICML), 2012. Simianer, Patrick, Riezler, Stefan, and Dyer, Chris. Joint feature selection in distributed stochastic learning for large-scale discriminative training in smt. In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), pp. 11 21, 2012. Sonnenburg, Sren and Franc, Vojtech. Coffin: A computational framework for linear svms. In Proceedings of the International Conference on Machine Learning (ICML), pp. 999 1006, 2010. Sonnenburg, Sren, Schweikert, Gabriele, Philips, Petra, Behr, Jonas, and Rtsch, Gunnar. Accurate splice site prediction using support vector machines. BMC Bioinformatics, 8, 2007. Tropp, Joel A. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(1-2):115 126, 2011. Weinberger, Kilian Q., Dasgupta, Anirban, Langford, John, Smola, Alexander J., and Attenberg, Josh. Feature hashing for large scale multitask learning. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1113 1120, 2009. Xiao, Lin and Zhang, Tong. A proximal-gradient homotopy method for the sparse least-squares problem. SIAM Journal on Optimization, 23(2):1062 1091, 2013. Yang, Tianbao. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems (NIPS), pp. 629 637, 2013. Zhang, Lijun, Mahdavi, Mehrdad, Jin, Rong, Yang, Tianbao, and Zhu, Shenghuo. Recovering the optimal solution by dual random projection. In Proceedings of Conference of Learning Theory, pp. 135 157, 2013. Zhang, Lijun, Mahdavi, Mehrdad, Jin, Rong, Yang, Tianbao, and Zhu, Shenghuo. Random projections for classification: A recovery approach. IEEE Transactions on Information Theory (IEEE TIT), 60(11):7300 7316, 2014.