# faster_projectionfree_convex_optimization_over_the_spectrahedron__b9d62965.pdf Faster Projection-free Convex Optimization over the Spectrahedron Dan Garber Toyota Technological Institute at Chicago dgarber@ttic.edu Minimizing a convex function over the spectrahedron, i.e., the set of all d d positive semidefinite matrices with unit trace, is an important optimization task with many applications in optimization, machine learning, and signal processing. It is also notoriously difficult to solve in large-scale since standard techniques require to compute expensive matrix decompositions. An alternative is the conditional gradient method (aka Frank-Wolfe algorithm) that regained much interest in recent years, mostly due to its application to this specific setting. The key benefit of the CG method is that it avoids expensive matrix decompositions all together, and simply requires a single eigenvector computation per iteration, which is much more efficient. On the downside, the CG method, in general, converges with an inferior rate. The error for minimizing a β-smooth function after t iterations scales like β/t. This rate does not improve even if the function is also strongly convex. In this work we present a modification of the CG method tailored for the spectrahedron. The per-iteration complexity of the method is essentially identical to that of the standard CG method: only a single eigenvector computation is required. For minimizing an -strongly convex and β-smooth function, the expected error of the method after t iterations is: rank(X ) 1/4t β p λmin(X )t where rank(X ), λmin(X ) are the rank of the optimal solution and smallest nonzero eigenvalue, respectively. Beyond the significant improvement in convergence rate, it also follows that when the optimum is low-rank, our method provides better accuracy-rank tradeoff than the standard CG method. To the best of our knowledge, this is the first result that attains provably faster convergence rates for a CG variant for optimization over the spectrahedron. We also present encouraging preliminary empirical results. 1 Introduction Minimizing a convex function over the set of positive semidefinite matrices with unit trace, aka the spectrahedron, is an important optimization task which lies at the heart of many optimization, machine learning, and signal processing tasks such as matrix completion [1, 13], metric learning [21, 22], kernel matrix learning [16, 9], multiclass classification [2, 23], and more. Since modern applications are mostly of very large scale, first-order methods are the obvious choice to deal with this optimization problem. However, even these are notoriously difficult to apply, since most of the popular gradient schemes require the computation of an orthogonal projection on each iteration to enforce feasibility, which for the spectraheron, amounts to computing a full eigen-decomposition 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. of a real symmetric matrix. Such a decomposition requires O(d3) arithmetic operations for a d d matrix and thus is prohibitive for high-dimensional problems. An alternative is to use first-order methods that do not require expensive decompositions, but rely only on computationally-cheap leading eigenvector computations. These methods are mostly based on the conditional gradient method, also known as the Frank-Wolfe algorithm [3, 12], which is a generic method for constrained convex optimization given an oracle for minimizing linear functions over the feasible domain. Indeed, linear minimization over the spectrahedron amounts to a single leading eigenvector computation. While the CG method has been discovered already in the 1950 s [3], it has regained much interest in recent years in the machine learning and optimization communities, in particular due to its applications to semidefinite optimization and convex optimization with a nuclear norm constraint / regularization1, e.g., [10, 13, 17, 19, 22, 2, 11]. This regained interest is not surprising: while a full eigen-decomposition for d d matrix requires O(d3) arithmetic operations, leading eigenvecor computations can be carried out, roughly speaking, in worst-case time that is only linear in the number of non-zeros in the input matrix multiplied by either 1 for the popular Power Method or by 1/2 for the more efficient Lanczos method, where is the target accuracy. These running times improve exponentially to only depend on log(1/ ) when the eigenvalues of the input matrix are well distributed [14]. Indeed, in several important machine learning applications, such as matrix completion, the CG method requires eigenvector computations of very sparse matrices [13]. Also, very recently, new eigenvector algorithms with significantly improved performance guarantees were introduced which are applicable for matrices with certain popular structure [5, 8, 20]. The main drawback of the CG method is that its convergence rate is, in general, inferior compared to projection-based gradient methods. The convergence rate for minimizing a smooth function, roughly speaking, scales only like 1/t. In particular, this rate does not improve even when the function is also strongly convex. On the other hand, the convergence rate of optimal projection-based methods, such as Nesterov s accelerated gradient method, scales like 1/t2 for smooth functions, and can be improved exponentially to exp( (t)) when the objective is also strongly convex. Very recently, several successful attempts were made to devise natural modifications of the CG method that retain the overall low per-iteration complexity, while enjoying provably faster convergence rates, usually under a strong-convexity assumption, or a slightly weaker one. These results exhibit provablyfaster rates for optimization over polyhedral sets [7, 15] and strongly-convex sets [6], but do not apply to the spectrahedron. For the specific setting considered in this work, several heuristic improvements of the CG method were suggested which show promising empirical evidence, however, non of them provably improve over the rate of the standard CG method [19, 17, 4]. In this work we present a new non-trivial variant of the CG method, which, to the best of our knowledge, is the first to exhibit provably faster convergence rates for optimization over the spectrahedron, under standard smoothness and strong convexity assumptions. The per-iteration complexity of the method is essentially identical to that of the standard CG method, i.e., only a single leading eigenvector computation per iteration is required. Our method is tailored for optimization over the spectrahedron, and can be seen as a certain hybridization of the standard CG method and the projected gradient method. From a high-level view, we take advantage of the fact that solving a 2-regularized linear problem over the set of extreme points of the spectrahedron is equivalent to linear optimization over this set, i.e., amounts to a single eigenvector computation. We then show via a novel and non-trivial analysis, that includes new decomposition concepts for positive semidefinite matrices, that such an algorithmically-cheap regularization is sufficient, in presence of strong convexity, to derive faster convergence rates. 2 Preliminaries and Notation For vectors we let k k denote the standard Euclidean norm, while for matrices we let k k denote the spectral norm, k k F denote the Frobenius norm, and k k denote the nuclear norm. We denote by Sd the space of d d real symmetric matrices, and by Sd the spectrahedron in Sd, i.e., Sd := {X 2 Sd | X 0, Tr(X) = 1}. We let Tr( ) and rank( ) denote the trace and rank of a given matrix in Sd, respectively. We let denote the standard inner-product for matrices. Given a matrix X 2 Sd, we let λmin(X) denote the smallest non-zero eigenvalue of X. 1minimizing a convex function subject to a nuclear norm constraint is efficiently reducible to the minimization of the function over the spectrahedron, as fully detailed in [13]. Given a matrix A 2 Sd, we denote by EV(A) an eigenvector of A that corresponds to the largest (signed) eigenvalue of A, i.e., EV(A) 2 arg maxv:kvk=1 v>Av. Given a scalar > 0, we also denote by EV (A) an -approximation to the largest (in terms of eigenvalue) eigenvector of A, i.e., EV (A) returns a unit vector v such that v>Av λmax(A) . Definition 1. We say that a function f(X) : Rm n ! R is -strongly convex w.r.t. a norm k k, if for all X, Y 2 Rm n it holds that f(Y) f(X) + (Y X) rf(X) + Definition 2. We say that a function f(X) : Rm n ! R is β-smooth w.r.t. a norm k k, if for all X, Y 2 Rm n it holds that f(Y) f(X) + (Y X) rf(X) + β The first-order optimality condition implies that for a -strongly convex f, if X is the unique minimizer of f over a convex set K Rm n, then for all X 2 K it holds that 2 k X X k2. (1) 2.1 Problem setting The main focus of this work is the following optimization problem: min X2Sd f(X), (2) where we assume that f(X) is both -strongly convex and β-smooth w.r.t. k k F . We denote the (unique) minimizer of f over Sd by X . 3 Our Approach We begin by briefly describing the conditional gradient and projected-gradient methods, pointing out their advantages and short-comings for solving Problem (2) in Subsection 3.1. We then present our new method which is a certain combination of ideas from both methods in Subsection 3.2. 3.1 Conditional gradient and projected gradient descent The standard conditional gradient algorithm is detailed below in Algorithm 1. Algorithm 1 Conditional Gradient 1: input: sequence of step-sizes { t}t 1 [0, 1] 2: let X1 be an arbitrary matrix in Sd 3: for t = 1... do 4: vt EV ( rf(Xt)) 5: Xt+1 Xt + t(vtv> t Xt) 6: end for Let us denote the approximation error of Algorithm 1 after t iterations by ht := f(Xt) f(X ). The convergence result of Algorithm 1 is based on the following simple observations: ht+1 = f(Xt + t(vtv> t Xt)) f(X ) (3) ht + t(vtv> t Xt) rf(Xt) + 2 ht + t(X Xt) rf(Xt) + 2 F (1 t)ht + 2 where the first inequality follows from the β-smoothness of f(X), the second one follows for the optimal choice of vt, and the third one follows from convexity of f(X). Unfortunately, while we expect the error ht to rapidly converge to zero, the term kvtv> F in Eq. (3), in principal, might remain as large as the diameter of Sd, which, given a proper choice of step-size t, results in the well-known convergence rate of O(β/t) [12, 10]. This consequence holds also in case f(X) is not only smooth, but also strongly-convex. However, in case f is strongly convex, a non-trivial modification of Algorithm 1 can lead to a much faster convergence rate. In this case, it follows from Eq. (1), that on any iteration t, k Xt X k2 F 2 ht. Thus, if we consider replacing the choice of Xt+1 in Algorithm 1 with the following update rule: V2Sd V rf(Xt) + tβ F , Xt+1 Xt + t(Vt Xt), (4) then, following basically the same steps as in Eq. (3), we will have that ht+1 ht + t(X Xt) rf(Xt) + 2 and thus by a proper choice of t, a linear convergence rate will be attained. Of course the issue now, is that computing Vt is no longer a computationally-cheap leading eigenvalue problem (in particular Vt is not rank-one), but requires a full eigen-decomposition of Xt, which is much more expensive. In fact, the update rule in Eq. (4) is nothing more than the projected gradient decent method. 3.2 A new hybrid approach: rank one-regularized conditional gradient algorithm At the heart of our new method is the combination of ideas from both of the above approaches: on one hand, solving a certain regularized linear problem in order to avoid the shortcomings of the CG method, i.e., slow convergence rate, and on the other hand, maintaining the simple structure of a leading eigenvalue computation that avoids the shortcoming of the computationally-expensive projected-gradient method. Towards this end, suppose that we have an explicit decomposition of the current iterate Xt = Pk i , where (a1, a2, ..., ak) is a probability distribution over [k], and each xi is a unit vector. Note in particular that the standard CG method (Algorithm 1) naturally produces such an explicit decomposition of Xt (provided X1 is chosen to be rank-one). Consider now the update rule in Eq. (4), but with the additional restriction that Vt is rank one, i.e, Vt arg min V2Sd, rank(V)=1V rf(Xt) + tβ F . Note that in this case it follows that Vt is a unit trace rank-one matrix which corresponds to the leading eigenvector of the matrix rf(Xt) + tβXt. However, when Vt is rank-one, the regularization k Vt Xtk2 F makes little sense in general, since unless X is rank-one, we do not expect Xt to be such (note however, that if X is rank one, this modification will already result in a linear convergence rate). However, we can think of solving a set of decoupled component-wise regularized problems: 8i 2 [k] : v(i) kvk=1 v>rf(Xt)v + tβ 2 kvv> xix> rf(Xt) + tβxix> where the equivalence in the first line follows since kvv>k F = 1, and thus the minimizer of the LHS is w.l.o.g. a leading eigenvector of the matrix on the RHS. Following the lines of Eq. (3), we will now have that ht+1 ht + t i ) rf(Xt) + 2 i ) rf(Xt) + 2 = ht + t Ei (a1,...,ak) i ) rf(Xt) + tβ where the second inequality follows from convexity of the squared Frobenius norm, and the last equality follows since (a1, ..., ak) is a probability distribution over [k]. While the approach in Eq. (6) relies only on leading eigenvector computations, the benefit in terms of potential convergence rates is not trivial, since it is not immediate that we can get non-trivial bounds for the individual distances kv(i) i k F . Indeed, the main novelty in our analysis is dedicated precisely to this issue. A motivation, if any, is that there might exists a decomposition of X as X = Pk i=1 bix (i)x (i)>, which is close in some sense to the decomposition of Xt. We can then think of the regularized problem in Eq. (6), as an attempt to push each individual component x(i) towards its corresponding component in the decomposition of X , and as an overall result, bring the following iterate Xt+1 closer to X . Note that Eq. (7) implicitly describes a randomized algorithm in which, instead of solving a regularized EV problem for each rank-one matrix in the decomposition of Xt, which is expensive as this decomposition grows large with the number of iterations, we pick a single rank-one component according to its weight in the decomposition, and only update it. This directly brings us to our proposed algorithm, Algorithm 2, which is given below. Algorithm 2 Randomized Rank one-regularized Conditional Gradient 1: input: sequence of step-sizes { t}t 1, sequence of error tolerances { t}t 0 2: let x0 be an arbitrary unit vector 3: X1 x1x> 1 such that x1 EV 0( rf(x0x> 0 )) 4: for t = 1... do 5: suppose Xt is given by Xt = Pk i , where each xi is a unit vector, and (a1, a2, ..., ak) is a probability distribution over [k], for some integer k 6: pick it 2 [k] according to the probability distribution (a1, a2, ...ak) 7: set a new step-size t as follows: t/2 if ait t ait else rf(Xt) + tβxitx> 9: Xt+1 Xt + t(vtv> it) 10: end for We have the following guarantee for Algorithm 2 which is the main result of this paper. Theorem 1. [Main Theorem] Consider the sequence of step-sizes { t}t 1 defined by t = 18/(t + 8), and suppose that 0 = β and for any iteration t 1 it holds that t = rank(X ) 1/4t β p λmin(X )t . Then, all iterates are feasible, and 8t 1 : E [f(Xt) f(X )] = O rank(X ) 1/4t β p λmin(X )t It is important to note that the step-size choice in Theorem 1 does not require any knowledge on the parameters , β, rank(X ), and λmin(X ). The knowledge of β is required however for the EV computations. While it follows from Theorem 1 that the knowledge of , rank(X ), λmin(X ) is needed to set the accuracy parameters - t, in practice, iterative eigenvector methods are very efficient and are much less sensitive to exact knowledge of parameters than the choice of step-size for instance. While the eigenvalue problem in Algorithm 2 is different from the one in Algorithm 1, due to the additional term in xitx> it, the efficiency of solving both problems is essentially the same since efficient EV procedures are based on iteratively multiplying the input matrix with a vector. In particular, multiplying a vector with a rank-one matrix takes O(d) time. Thus, as long as nnz(rf(Xt)) = (d), which is highly reasonable, both EV computations run in essentially the same time. Finally, note also that aside from the computation of the gradient direction and the leading eigenvector computation, all other operations on any iteration t, can be carried out in O(d2 + t) additional time. The complete proof of Theorem 1 and all supporting lemmas are given in full detail in the appendix. Here we only detail the two main ingredients in the analysis of Algorithm 2. Throughout this section, given a matrix Y 2 Sd, we let PY, 2 Sd denote the projection matrix onto all eigenvectors of Y that correspond to eigenvalues of magnitude at least . Similarly, we let P? Y, denote the projection matrix onto the eigenvectors of Y that correspond to eigenvalues of magnitude smaller than (including eigenvectors that correspond to zero-valued eigenvalues). 4.1 A new decomposition for positive semidefinite matrices with locality properties The analysis of Algorithm 2 relies heavily on a new decomposition idea of matrices in Sd that suggests that given a matrix X in the form of a convex combination of rank-one matrices: X = Pk i , and another matrix Y 2 Sd, roughly speaking, we can decompose Y as the sum of rank-one matrices, such that the components in the decomposition of Y are close to those in the decomposition of X in terms of the overall distance k X Yk F . This decomposition and corresponding property justifies the idea of solving rank-one regularized problems, as suggested in Eq. (6), and applied in Algorithm 2. Lemma 1. Let X, Y 2 Sd such that X is given as X = Pk i , where each xi is a unit vector, and (a1, ..., ak) is a distribution over [k], and let , γ 2 [0, 1] be scalars that satisfy γ 1 γ k X Yk F . Then, Y can be written as Y = Pk j=1(aj bj)W, such that 1. each yi is a unit vector and W 2 Sd 2. 8i 2 [k] : 0 bi ai and Pk Y, k F + k X Yk F i=1 bikxix> Y, k F + k X Yk F 4.2 Bounding the per-iteration improvement The main step in the proof of Theorem 1, is understanding the per-iteration improvement, as captured in Eq. (7), achievable by applying the update rule in Eq. (6), which updates on each iteration all of the rank-one components in the decomposition of the current iterate. Lemma 2. [full deterministic update] Fix a scalar > 0. Let X 2 Sd such that X = Pk i , where each xi is a unit vector, and (a1, ..., ak) is a probability distribution over [k]. For any i 2 [k], let vi := EV rf(X) + βxix> . Then, it holds that i ) rf(X) + β (f(X) f(X )) + β min{1, 5 f(X) f(X ), 3 2 p λmin(X ) f(X) f(X )}. proof sketch. The proof is divided to three parts, each corresponding to a different term in the min expression in the bound in the Lemma. The first bound, at a high-level, follows from the standard conditional gradient analysis (see Eq. (3)). We continue to derive the second and third bounds. From Lemma 1 we know we can write X in the following way: where for all i 2 [k], b i 2 [0, ai] and y i is a unit vector, and W 2 Sd. Using nothing more than Eq. (8), the optimality of vi for each i 2 [k], and the bounds in Lemma 1, it can be shown that i ) rf(X) + β (X X) rf(X) + β (X X) rf(X) + β X , k F + k X X k F Now we can optimize the above bound in terms of , γ. One option is to upper bound k X P? rank(X ) , which together with the choice 1 = 2rank(X ) , γ1 = 2rank(X )k X X k F , give us: RHS of (9) (X X) rf(X) + 5 β rank(X )k X X k F . (10) Another option, is to choose 2 = λmin(X ), γ2 = k X X k F λmin(X ) which gives us k X P? X , k F = 0. This results in the bound: RHS of (9) (X X) rf(X) + 3 βk X X k F λmin(X ) . (11) Now, using the convexity of f to upper bound (X X) rf(X) (f(X) f(X )) and Eq. (1) in both Eq. (10) and (11), gives the second the third parts of the bound in the lemma. 5 Preliminary Empirical Evaluation We evaluate our method, along with other conditional gradient variants, on the task of matrix completion [13]. Setting The underlying optimization problem for the matrix completion task is the following: min Z2N Bd1,d2( ){f(Z) := 1 (Z Eil,jl rl)2}, (12) where Ei,j is the indicator matrix for the entry (i, j) in Rd1 d2, {(il, jl, rl)}n l=1 [d1] [d2] R, and NBd1,d2( ) denotes the nuclear-norm ball of radius in Rd1 d2, i.e., NBd1,d2( ) := {Z 2 Rd1 d2 | k Zk := min{d1,d2} X where we let σ(Z) denote the vector of singular values of Z. . That is, our goal is to find a matrix with bounded nuclear norm (which serves as a convex surrogate for bounded rank) which matches best the partial observations given by {(il, jl, rl)}n In order to transform Problem (12) to optimization over the spectrahedron, we use the reduction specified in full detail in [13], and also described in Section A in the appendix. The objective function in Eq. (12) is known to have a smoothness parameter β with respect to k k F , which satisfies β = O(1), see for instance [13]. While the objective function in Eq. (12) is not strongly convex, it is known that under certain conditions, the matrix completion problem exhibit properties very similar to strong convexity, in the sense of Eq. (1) (which is indeed the only consequence of strong convexity that we use in our analysis) [18]. 60 80 100 120 140 160 180 200 220 #iterations CG Away-CG ROR-CG 60 80 100 120 140 160 180 200 220 240 #iterations CG Away-CG ROR-CG Figure 1: Comparison between conditional gradient variants for solving the matrix completion problem on the MOVIELENS100K (left) and MOVIELENS1M (right) datasets. Two modifications of Algorithm 2 We implemented our rank one-regularized conditional gradient variant, Algorithm 2 (denoted ROR-CG in our figures) with two modifications. First, on each iteration t, instead of picking an index it of a rank-one matrix in the decomposition of the current iterate at random according to the distribution (a1, a2, ..., ak), we choose it in a greedy way, i.e., we choose the rank-one component that has the largest product with the current gradient direction. While this approach is computationally more expensive, it could be easily parallelized since all dot-product computations are independent of each other. Second, after computing the eigenvector vt using the step-size t = 1/t (which is very close to that prescribed in Theorem 1), we apply a line-search, as detailed in [13], in order to the determine the optimal step-size given the direction vtv> Baselines As baselines for comparison we used the standard conditional gradient method with exact line-search for setting the step-size (denoted CG in our figures)[13], and the conditional gradient with away-steps variant, recently studied in [15] (denoted Away-CG in our figures). While the away-steps variant was studied in the context of optimization over polyhedral sets, and its formal improved guarantees apply only in that setting, the concept of away-steps still makes sense for any convex feasible set. This variant also allows the incorporation of an exact line-search procedure to choose the optimal step-size. Datasets We have experimented with two well known datasets for the matrix completion task: the MOVIELENS100K dataset for which d1 = 943, d2 = 1682, n = 105, and the MOVIELENS1M dataset for which d1 = 6040, d2 = 3952, n 106. The MOVIELENS1M dataset was further sub-sampled to contain roughly half of the observations. We have set the parameter in Problem (12) to = 10000 for the ML100K dataset, and = 35000 for the ML1M dataset. Figure 1 presents the objective (12) vs. the number of iterations executed. Each graph is the average over 5 independent experiments 2. It can be seen that our approach indeed improves significantly over the baselines in terms of convergence rate, for the setting under consideration. [1] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717 772, 2009. [2] Miroslav Dudík, Zaïd Harchaoui, and Jérôme Malick. Lifted coordinate descent for learning with trace- norm regularization. Journal of Machine Learning Research - Proceedings Track, 22:327 336, 2012. [3] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:149 154, 1956. [4] Robert M Freund, Paul Grigas, and Rahul Mazumder. An extended frank-wolfe method with" in-face" directions, and its application to low-rank matrix completion. ar Xiv preprint ar Xiv:1511.02204, 2015. 2We ran several experiments since the leading eigenvector computation in each one of the CG variants is randomized. [5] Dan Garber and Elad Hazan. Fast and simple pca via convex optimization. ar Xiv preprint ar Xiv:1509.05647, [6] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In Proceedings of the 32nd International Conference on Machine Learning,ICML, pages 541 549, 2015. [7] Dan Garber and Elad Hazan. A linearly convergent variant of the conditional gradient algorithm under strong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization, 26(3):1493 1528, 2016. [8] Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Faster eigenvector computation via shift-and-invert preconditioning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, pages 2626 2634, 2016. [9] Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. The Journal of Machine Learning Research, 12:2211 2268, 2011. [10] Elad Hazan. Sparse approximate solutions to semidefinite programs. In 8th Latin American Theoretical Informatics Symposium, LATIN, 2008. [11] Elad Hazan and Satyen Kale. Projection-free online learning. In Proceedings of the 29th International Conference on Machine Learning, ICML, 2012. [12] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, ICML, 2013. [13] Martin Jaggi and Marek Sulovský. A simple algorithm for nuclear norm regularized problems. In Proceedings of the 27th International Conference on Machine Learning, ICML, 2010. [14] J. Kuczy nski and H. Wo zniakowski. Estimating the largest eigenvalues by the power and lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl., 13:1094 1122, October 1992. [15] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems, pages 496 504, 2015. [16] Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning Research, 5:27 72, 2004. [17] Sören Laue. A hybrid algorithm for convex semidefinite optimization. In Proceedings of the 29th International Conference on Machine Learning, ICML, 2012. [18] Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K. Ravikumar. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1348 1356. 2009. [19] Shai Shalev-Shwartz, Alon Gonen, and Ohad Shamir. Large-scale convex minimization with a low-rank constraint. In Proceedings of the 28th International Conference on Machine Learning, ICML, 2011. [20] Ohad Shamir. A stochastic PCA and SVD algorithm with an exponential convergence rate. In Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015. [21] Eric P Xing, Andrew Y Ng, Michael I Jordan, and Stuart Russell. Distance metric learning with application to clustering with side-information. Advances in neural information processing systems, 15:505 512, 2003. [22] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. J. Mach. Learn. Res., 13(1):1 26, January 2012. [23] Xinhua Zhang, Dale Schuurmans, and Yao-liang Yu. Accelerated training for matrix-norm regularization: A boosting approach. In Advances in Neural Information Processing Systems, pages 2906 2914, 2012.