# projection_free_rankdrop_steps__164bd74b.pdf

Projection Free Rank-Drop Steps

Edward Cheung Yuying Li Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada {eycheung, yuying}@uwaterloo.ca

The Frank-Wolfe (FW) algorithm has been widely used in solving nuclear norm constrained problems, since it does not require projections. However, FW often yields high rank intermediate iterates, which can be very expensive in time and space costs for large problems. To address this issue, we propose a rank-drop method for nuclear norm constrained problems. The goal is to generate descent steps that lead to rank decreases, maintaining low-rank solutions throughout the algorithm. Moreover, the optimization problems are constrained to ensure that the rank-drop step is also feasible and can be readily incorporated into a projection-free minimization method, e.g., FW. We demonstrate that by incorporating rank-drop steps into the FW algorithm, the rank of the solution is greatly reduced compared to the original FW or its common variants.

1 Introduction The Frank-Wolfe algorithm has been widely used for many machine learning applications due to its projection-free property, particularly when linear minimization on the domain is easy but projection is difﬁcult. A particularly interesting problem in machine learning is the nuclear norm constrained problem, min X Rm n f(X) s.t. X NN δ (1)

where NN is the nuclear norm, which is a typical convex relaxation for rank constrained optimization problems [Fazel et al., 2001]. Common applications of nuclear-norm constrained problems are matrix completions, multivariate regression, multi-task learning, and clustering with missing information. For (1), a projection operation onto the nuclear norm ball will require a full singular value decomposition (SVD), which can be too expensive to perform at each iteration, but is required by methods such as the projected gradient descent. Without any projection, the FW iterate is guaranteed to be feasible. For (1), the linear subproblem used by FW only requires computing the singular vector pair corresponding to the largest singular value of the gradient at each iteration [Jaggi, 2013]. This is signiﬁcantly cheaper than computing

the full SVD when the dimension of the matrix is large. A challenge when using FW for (1) is that the nuclear norm ball is a convex hull of an inﬁnite number of rank-one matrices, referred to as the atoms of the set of feasible points. Since an atom is added at each iteration, the solution is expressed as a convex combination of an arbitrarily large atomic set, which lacks the crucial low-rank property we desire from nuclear norm constrained problems. While other methods exist for solving instances of (1), e.g., Active Subspace Selection [Hsieh and Olsen, 2014] or Redistributing Nonconvex Regularizers [Yao et al., 2016], implementations for these methods utilize speciﬁc properties of f( ). In [Yao et al., 2016], it appears that for each choice of f( ), a specialized solver must be written, and in [Hsieh and Olsen, 2014], knowledge of special structures using f( ) and 2f( ) is necessary to use the approach efﬁciently. In contrast, the FW can be readily used for a general f( ) without special modiﬁcations. One of our main goals is to devise a new way to efﬁciently generate descent feasible steps which also decrease the rank of the current iterate when solving (1) without assuming a speciﬁc form for f( ). We propose a new nonconvex optimization formulation to determine the steepest descent rankdrop steps. By further considering the interior rank-drop step and exterior rank-drop step cases separately, we obtain formulations which lead to efﬁcient feasible rank-drop step computation. We establish theoretical properties of the proposed formulations. In addition we demonstrate computationally that the proposed rank-drop FW method can obtain much lower rank solutions than previous FW alternatives, greatly improving computational efﬁciency.

1.1 Notation

In this paper, σi(A) denotes the i th singular value of A, with σ1 ... σmin{m,n}. For a matrix A = [aij] Rm n, A NN := P i σi(A) denotes the nuclear norm and A sp := σ1(A) denotes the spectral norm. Let B (A, ϵ) := {X Rm n : A X ϵ} be the closed norm-ball with a speciﬁed norm. We use the term thin SVD to mean the SVD with strictly positive singular values, i.e., if A = UΣV is a thin SVD and rank(A) = r, then U Rm r, Σ Rr r, and V Rn r. Finally, A, B := tr(A B) denotes the usual trace inner product.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

2 The Frank-Wolfe Algorithm and Away Step

We brieﬂy review FW along with its common variants. Here FW is described for a general problem,

min x S f(x) (2)

where S is a closed and bounded convex set. The FW algorithm can be summarized in Algorithm 1. We denote the FW direction as dfw := sk xk.

Algorithm 1 Frank-Wolfe (FW)

Let x0 S for k = 0... do sk arg mins S s, f(xk) dfw sk xk xk+1 xk + τdfw for τ [0, 1] end for

There are many choices for the stepsize τ which ensure convergence. We will assume in this paper that the optimal step-size is used since the stepsize choice is not the focus of this work. Away steps has been introduced in [Wolfe, 1970] and analyzed in [Gu elat and Marcotte, 1986] to move away from bad atoms to accelerate convergence. The away step method maintains a set of active atoms, Ak, such that the current iterate xk is decomposed into an atomic decomposition, i.e., xk = P ai Ak αaiai, where the weight satisfy αai > 0 and P ai Ak αai 1. The away step provides a direction that moves away from one of the active atoms which yields a better descent direction than the current FW direction. This direction solves the following optimization problem,

vk := arg min v Ak f(xk), xk vk (3)

To ensure convergence, the away direction daway := xk vk is only taken if f(xk), daway f(xk), dfw . Algorithm 2 summarizes the away step FW method (AFW) in [Lacoste-Julien and Jaggi, 2015].

2.1 In-Face Steps In [Freund et al., 2015], In-Face Steps are proposed to generalize away-steps. Instead of generating descent directions by moving away from bad atoms, the in-face step selects the best descent direction along the minimal face containing the current iterate. By aiming to move along the minimal face to the boundary of the feasible region to reach a lower dimensional face, the algorithm explicitly maintains a low-rank structure without compromising convergence. For (1), as in [Freund et al., 2015], the minimal face F(Xk) of BNN(0, δ) containing a point Xk is given by the set,

F(Xk) = BNN(0, δ), when Xk < δ UMV , otherwise, (4)

where Xk has a thin r rank SVD, UΣV , M is a real positive semideﬁnite matrix with tr(M) = δ.

Algorithm 2 (Atomic) Away Steps Frank-Wolfe (AFW)

Let x0 S and A0 {x0}. Initialize αs 0, s S. for k = 0... do sk arg mins S s, f(xk) , dfw sk xk vk arg maxv Ak v, f(xk), daway xk vk if f(xk), dfw f(xk), daway then dk dfw τ arg minτ [0,1] f(xk + τdk) αai (1 τ )αai, ai Ak αsk αsk + τ , Ak Ak {sk} else dk daway τ arg minτ h 0, αvk 1 αvk

i f(xk + τdk)

αvi (1 + τ )αvi, vi Ak αvk αvk τ end if xk+1 xk + τ dk Ak+1 {a Ak : αa > 0} end for

In particular, for comparison, we consider the In-Face step from the Away-step Strategy described in [Freund et al., 2015], which uses the following direction,

Zk := arg max Z F(Xk) f(Xk), Z . (5)

3 Rank-Drop Steps One signiﬁcant issue with the FW method for nuclear norm constrained problems is that the rank of the intermediate solution often steadily increases [Freund et al., 2015], and frequently yielding a high rank solution upon termination. In Figure 1, we highlight this phenomenon. For very large problems, we cannot store the matrix Xk, and instead, only maintain the rank-one factors. Thus, computing the gradient at each iteration can be computationally prohibitive. For matrix completion, as an example, if Ωis the set of observed entries, then forming the gradient requires at least |Ω| r inner product calculations. Since the number of observed entries, |Ω|, is typically very large, any increase in the rank of the solution greatly increases the computational time in each iteration. To address this issue, we propose to search over the set of rank-one matrices which decrease the rank of the iterate by one. The proposed optimization formulation for determining such a rank-one matrix is motivated by the following theorem.

Theorem 3.1 ([Egerv ary, 1960]). Let u Rm, v Rn, A Rm n, and B = A σ 1uv . Then rank(B) = rank(A) 1 if and only if there are vectors x Rn and y Rm such that u = Ax, v = A y and σ = y Ax = 0.

3.1 The Rank-Drop Optimization Framework Assume that rank(Xk) > 1. Our goal is to determine a rankdrop step which reduces the rank of Xk but has a comparable objective value. We restrict our attention to the rank-one matrices that can decrease the rank of a matrix; let R(A) denote,

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

0 100 200 300 400 500 600 700 800 900 1000 Iteration

Frank-Wolfe Away Rank-Drop

Figure 1: The Frank-Wolfe and Away-Step methods reach very high rank solutions at convergence for the Movie Lens 100k dataset with δ = 1. The rank of the iterate becomes very large before the rank begins to decrease very slowly. In contrast, the proposed rankdrop FW method maintains very low-rank intermediate solutions throughout the entire computation. The objective values at termination are 1905.4, 1905.4, 1905.6 for FW, Away Step, Rank-Drop respectively.

for a matrix A, the set of rank-drop steps, which are these rank-one matrices, R(A) :={σ 1uv : x Rn, y Rm

s.t. u = Ax, v = A y, σ = y Ax = 0}. (6)

Since u and v must be in the column and row spans of A respectively, Lemma 3.2 shows that the set of rank-drop steps can be expressed in a more concise form. Lemma 3.2. 1 Let A Rm n with a thin SVD A = UΣV . Then we can rewrite the set of rank-drop steps as,

R(A) = Ust V

s Σ 1t , s Σ 1t > 0 . (7)

Let Zk R(Xk) and ˆZk = Zk/ Zk NN. Then, similar to away steps, we consider iterates in the following form,

Xk+1 := Xk + τDrd, where Drd := Xk δ ˆZk. (8) Our goal is to ﬁnd a rank-drop step that results in a feasible iterate which best minimizes the objective. Since not all rankdrop steps lead to feasible iterates and testing for feasibility at each iteration is computationally expensive, we want to establish a veriﬁable sufﬁcient condition for rank-drop steps that ensure feasibility. In Lemma 3.3, we ﬁrst establish a lower bound on the nuclear norm of the rank-drop step. Lemma 3.3. Let A Rm n with rank r and let Z R(A) be an arbitrary rank-drop step. Then Z NN σr(A). To ensure feasibility of the iterate after a rank-drop step, we consider two distinct cases. Let κ(Xk) be half of the distance between Xk and the boundary of the nuclear norm ball,

κ(Xk) := δ Xk NN

We consider κ(Xk) σr(Xk), the interior case, and κ(Xk) < σr(Xk), the exterior case. For each case, we will motivate a rank-drop step formulation to guarantee feasibility of the next iterate and provide a method for its computation.

1Due to space limitations, proofs will be omitted. The proofs will be added in an appendix at https://arxiv.org/abs/1704. 04285.

3.2 The Interior Rank-Drop Problem Assume that κ(Xk) σr(Xk), where κ(Xk) is deﬁned in (9). Theorem 3.4 below establishes an easily veriﬁable condition guaranteeing that Xk+1, after taking the rank-drop step, remains in the interior of the nuclear ball. Theorem 3.4. Let Xk Rm n with Xk NN < δ, Zk R(A), and ˆZk = Zk/ Zk NN. If Zk (δ Xk NN)/2, then Xk+1 = Xk + τ(Xk δ ˆZk) BNN(0, δ) for every τ [0, τ ], where τ = Zk NN/(δ Zk NN). Moreover, if τ = τ , then rank(Xk+1) = rank(Xk) 1. Theorem 3.4 motivates the following formulation,

min Z Z, f(Xk)

s.t. Z BNN(0, κ(Xk)) R(Xk). (10)

Problem (10) determines, in the interior case, the best rankdrop step for solving (1) based on the ﬁrst order approximation to the objective. The constraint on Z guarantees that the next iterate is feasible if a rank-drop step is taken. Theorem 3.5. If rank(Xk) 1 with σr(Xk) κ(Xk), then the feasible region for (10) is non-empty. Assume that a thin SVD for Xk, Xk = UΣV , is given. Using Lemma 3.2, the constraint in (10) can be made explicit,

s Σ 1t , f(Xk)

s Σ 1t BNN(0, κ(Xk))

s Σ 1t > 0.

To make (11) more amenable to computation, we remove the fraction using the normalization constraint s Σ 1t = κ(Xk) 1 and formulate the problem equivalently as follows,

min s,t Rr q(s, t) := f(Xk), κ(Xk)Ust V

s.t. s Σ 1t = κ(Xk) 1

s 2 = 1, t 2 1

Note that the constraints in (12) also ensure that s and t cannot be rescaled to obtain a different solution yielding an identical rank-drop step. The equivalence of (10) and (12) is formally established in Theorem 3.6. Theorem 3.6. If Xk is a feasible point of (1) and κ(Xk) σr(Xk), then an optimal solution to (12) is an optimal solution to (11). Moreover, an optimal to (11) can always be rescaled into an optimal solution to (12).

3.3 The Exterior Rank-Drop Problem Now consider the case κ(Xk) < σr(Xk). If κ(Xk) is relatively large and Xk is not nearly rank deﬁcient, rank-drop is less important. If κ(Xk) is small, however, then Xk is close to the boundary of the nuclear ball. In this case, we are interested in rank-drop descent directions that move inside the nuclear norm ball. We establish the following theorems to facilitate formulating appropriate optimization problems for this case. Subsequently g(X) denotes the subdifferential of g(X) at X.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Theorem 3.7. Let Xk BNN(0, δ) have the thin SVD Xk = UΣV . Deﬁne Dk = Xk δ ˆZk with Zk R(Xk) and ˆZk = Zk/ Zk NN = Ust V , for some s and t with s = t = 1. Then, max G Xk NN Dk, G 0, (13)

if and only if δs t Xk NN. Following Theorem 3.7, Dk will be a descent direction for the nuclear norm at Xk if δs t Xk NN. When Xk is on the boundary of the nuclear norm ball, Corollary 3.7.1 below states a simpler and useful characterization. Corollary 3.7.1. Let Xk have the thin SVD Xk = UΣV

with Xk NN = δ. Deﬁne Dk = Xk δ ˆZk with Zk R(Xk) and ˆZk = Zk/ Zk NN = Ust V , for some s and t with s 2 = t 2 = 1. Then, max G Xk NN Dk, G 0 (14)

if and only if s = t. When Xk is on the boundary of the nuclear ball, κ(Xk) = 0. Lemma 3.8 below shows that Xk is close to the boundary when either r is large or when σr(Xk) is small, assuming κ(Xk) < σr(Xk). Lemma 3.8. If κ(Xk) < σr(Xk), then Xk NN > r r+2δ.

For the exterior case, κ(Xk) < σr(Xk), we are mostly interested in the situation when Xk is close to the boundary of the nuclear norm ball. Consequently, we motivate the next formulation assuming Xk NN = δ as a good approximation. When Xk is on the boundary of the nuclear norm ball, we consider all directions which move into the nuclear norm ball, hence feasible directions, as candidate solutions. From Corollary 3.7.1, this can be reduced to searching over directions Dk = Xk δUss V . Let W = U f(Xk)V . Note that 1

2s (W +W)s = f(Xk), Uss V . Hence, in the exterior case, we solve the following optimization problem,

max s Rr 1 2 s (W + W)s

s Σ 1s s.t. s 2 = 1. (15)

We note that (15) is a generalized eigenvalue problem, which can be further reduced to a standard eigenvalue problem since Σ 1 is a nonsingular diagonal matrix. Theorem 3.9 below establishes that, using (15), the required step-size to decrease the rank of the current iterate also guarantees that the next iterate is feasible. Theorem 3.9. Let s be an optimal solution to (15) and let Dk = Xk δUss V , where Xk = UΣV is a thin SVD with Xk NN δ. If τ = (δs Σ 1s 1) 1 and Xk+1 = Xk + τ Dk, then rank(Xk+1) = rank(Xk) 1 and Xk+1 NN δ. We observe that τ matches the upper bound on the stepsize for in-face steps when an away-step method is used and when Xk = δ.

Comparison with In-Face Steps Lemma 3.10 and 3.11 below show that the rank-drop steps derived from (12) and (15) lie on the minimal face of BNN(0, δ) containing Xk, denoted as F(Xk).

Lemma 3.10. Suppose Xk NN < δ and κ(Xk) σr(Xk), with thin SVD Xk = UΣV . Then Z = κ(Xk)Ust V F(Xk), where (s, t) is the solution to (12).

Lemma 3.11. Suppose κ(Xk) < σr(Xk), with thin SVD Xk = UΣV . Then Z = δUss V F(Xk), where s is the solution to (15).

We emphasize, however, that the rank-drop steps are constructed explicitly to decrease the rank of the current iterate. Speciﬁcally, we highlight the following differences between rank-drop steps and the in-face direction (5) suggested in [Freund et al., 2015]. Firstly, when the iterate is in the interior of the nuclear norm ball, the in-face steps often do not lead to rank-drop steps. Moreover, at each iteration when the iterate is in the interior, a binary search is required to determine the maximum feasible step length once the direction is computed. This requires several SVD updates, leading to very expensive intermediate iterates. Thus, it is critical for the iterates of the in-face method to reach the boundary of the nuclear norm ball quickly. Secondly, the parametrization suggested in [Freund et al., 2015] (i.e. γ1 = 0 and γ2 = ) for large datasets corresponds to only taking the maximum step length. On the boundary of the nuclear norm ball, this is equivalent to only accepting iterates with rank decreased. Since our proposed rank-drop is determined optimally to decrease the objective (up to the ﬁrst order) among all rank-drop steps explicitly, we believe that this is a better optimization formulation when the iterate approaches the boundary.

Solving the Optimization Problem in the Interior Case Now we discuss how to solve the Rank-Drop optimization problem (12) in the interior case. Let W := U f(Xk)V . The Lagrangian for (12) is,

L(s, t, λ, α, β) :=s Wt + λ(s Σ 1t κ(Xk) 1)

+ α(s s 1) + β(t t 1).

Theorem 3.12. Suppose (s, t, λ, α, β) satisﬁes the KKT conditions of (12) with t 2 < 1. Then α = β = 0, λ is an eigenvalue of ΣW, and Mλ := 1

2(W + λΣ 1) has zero as a singular value. Conversely, assume that (ˆs, ˆt) forms a singular vector pair of Mλ associated with the singular value zero and κ(Xk)ˆs Σ 1ˆt > 1, then the KKT conditions of (12) are satisﬁed at (ˆs, ˆt/(κ(Xk)ˆs Σ 1ˆt), λ, 0, 0).

Since problem (12) is nonconvex and generally difﬁcult to solve, we only consider candidate KKT points characterized in Theorem 3.12. For each eigenvalue λ of ΣWx, we ﬁx α = β = 0 and choose (ˆs, ˆt) that corresponds to the singular vector pair of 0.5(W +λΣ 1) with the singular value zero. If (s, t) = (ˆs, ˆt/(κ(Xk)ˆs Σ 1ˆt)) is feasible with respect to (12) and t 2 < 1, then by Theorem 3.12, (s, t, λ, 0, 0) satisﬁes the KKT conditions. In the event that no feasible candidate is found, we solve (15) instead and obtain a feasible rank-drop step, guaranteed

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

by Theorem 3.9. We remark that, although it is always possible to ﬁnd feasible rank-drop steps using (15), in the interior case, it is still preferable to solve problem (12) since the assumption that s = t can be restrictive and unnecessary when far away from the boundary. Algorithm 3 presents the details of the rank-drop direction computation.

3.4 Step Selection Criteria The criteria in [Freund et al., 2015] for away-steps require solving the regular FW linear subproblem at each iteration even when the FW step is not taken. In the rank-drop framework, we accept any rank-drop step that does not increase the objective. This allows the algorithm to maintain a lower rank SVD as well as allows the algorithm to skip computing a FW step unnecessarily. Additionally, for the rank-drop step, we observe that it is very rare that the algorithm chooses a rankdrop step two iterations in a row. Thus, we only generate the rank-drop direction when the previous iterate is a regular FW step, avoiding unnecessary rank-drop step computations.

4 Convergence Analysis Following the proof for Theorem 4 in [Gu elat and Marcotte, 1986], we show that the iterates, from Rank Drop FW in Algorithm 4, converge to the global optimum of (1). Theorem 4.1. Let {Xk} be a sequence generated by Algorithm 4 and let f be the optimal value for problem (1). Assume f(X) is Lipschitz continuous in the feasible region. Then f(Xk) f 8δ2L 4+N k fw , where N k fw be the number of

FW steps taken up to the iteration k.

5 Complexity Per Iteration When computing the rank-drop steps, we note that the dimension of the subproblems is r, the rank of the current iterate. First, forming the matrix W := U f(Xk)V requires O(r2h min{m, n}) operations, where h is the maximum number of nonzero elements in any row or column. In the interior case, we must compute an eigen-decomposition of an r r matrix which takes O(r3) time. Then, each eigenvalue λ is used to form the matrix 0.5(W + λΣ 1) where the singular vector pair corresponding to the zero singular value is computed. The total time required for the interior case is O(r3 + r2h min{m, n}). In the exterior case, O(r2) ﬂops are required to compute the largest eigenvalue. Thus, the total complexity per iteration is O(r3 + r2h min{m, n}).

6 Experimental Results We validate the rank-drop steps on a matrix completion task using various datasets from Movie Lens2. We ﬁrst center and scale each data set to have mean 0 and standard deviation 1. We compare the proposed Rank-Drop Frank Wolfe (RDFW) against the aforementioned FW variants, FW [Frank and Wolfe, 1956], AFW [Lacoste-Julien and Jaggi, 2015], and IF(0, ), [Freund et al., 2015], as well as a state-of-theart nuclear norm regularized solver in Active ALT [Hsieh and Olsen, 2014].

2http://grouplens.org/datasets/movielens/

Algorithm 3 Compute Rank-Drop Direction (rank Drop)

Input: thin SVD Xk := UΣV and f(Xk). κ(Xk) δ Xk NN

2 W U f(Xk)V if κ(Xk) σr(Xk) then (Interior Case) Λ eigs( ΣW) b for λi Λ do Mλ := 1

2(W + λiΣ 1) (s, t, σ) SVD(Mλ) (return the zero SV) if κ(Xk)s Σ 1t 1 and q s, t κ(Xk)s Σ 1t > b then (s , t ) (s, t) b q s, t κ(Xk)s Σ 1t

end if end for α (s Σ 1t) 1 τ α δ α end if if κ(Xk) < σr(Xk) or b = then (Exterior case or no candidates from the Interior Case) s gen Eig(0.5(W + W ), Σ 1) t s τ (δs Σ 1s 1) 1 end if return (s , t , τ )

Algorithm 4 Rank-Drop Frank-Wolfe (RDFW)

Let X0 S, with initial SVD X0 = UΣV , and maximum iteration count T prev Step RD (Start with Frank-Wolfe step) for k = 0... do Compute f(Xk) if prev Step == RD then GO TO (Frank-Wolfe) end if Compute Rank-Drop direction (see Algorithm 3): (sk, tk, τ ) rank Drop(U, Σ, V, f(Xk)) X := Xk + τ (Xk δUskt k V ) if f( X) f(Xk) then Xk+1 X Zk δUskt k V prev Step RD else (Frank-Wolfe) Zk arg min Z BNN (0,δ) Z, f(Xk) τ arg minτ [0,1] f(Xk + τ(Zk Xk)) Xk+1 := Xk + τ (Zk Xk) prev Step Frank-Wolfe end if (U, Σ, V ) update SVD(U, Σ, V, τ Zk) end for

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Dataset Frank-Wolfe Away-Step FW In-Face(0, ) Active ALT Rank-Drop FW

ML-100k RMSE 0.878 0.878 0.878 0.877 0.879 Rank (Max) 501.4 (504) 466.8 (477) 54.2 (64) 43.4 (50) 41.6 (44) Time (s) 146.09 152.49 37.85 30.28 35.45

ML-1M RMSE 0.820 0.820 0.822 0.818 0.820 Rank (Max) 467.4 (478) 425.2 (430) 77.6 (112) 65 (68) 67 (69) Time (s) 1,127.70 1,197.67 356.42 503.39 284.85

ML-10M RMSE 0.806 0.806 0.807 0.806 0.808 Rank (Max) 1000 (1000) 889.8 (893) 210 (297) 141.8 (144) 140 (140) Time (s) 41,467.60 44,327.57 29,630.43 13,281.95 12,499.43

ML-20M RMSE - - 0.800 0.800 0.801 Rank (Max) - - 274.6 (471) 206.2 (214) 202 (203) Time (s) - - 117,535.82 38,497.16 29,102.62

Table 1: Computational results on matrix completion problems averaged over 5 random initializations. The max rank is the maximum rank observed over all 5 trials. For ML-20M, the FW and AFW algorithms took too long to successfully terminate.

Dataset # Users # Movies # Ratings Movie Lens 100k 943 1,682 100,000 Movie Lens 1M 6,040 3,900 1,000,209 Movie Lens 10M 82,248 10,681 10,000,054 Movie Lens 20M 138,493 27,278 20,000,263

Table 2: Movie Lens Data

Following [Yao et al., 2016], we randomly partition each dataset into 50% training, 25% validation, and 25% testing. The δ value in (1) is tuned with δ = µj Y F , where Y F is the Frobenius norm of the training data matrix, and µj = 2 + 0.2j, j N. We increase j until the mean RMSE on the validation set does not improve by more than 10 3. We terminate the algorithm when an upper bound on the relative optimality gap ensures (f(Xk) f )/f < 10 2 or a maximum iteration count of 1000 is reached. For Active ALT, a regularized nuclear norm problem (not constrained) is solved where the regularization parameter λ is chosen by approximately solving for the Lagrange multiplier from the solution to the constrained problem. From the optimality conditions, we have U f(X )V + λI = 0. Thus, λ is approximated by the mean of the diagonal values of U f(Xk)V , where Xk is the converged solution of RDFW. Active ALT terminates when f(Xk 1) f(Xk) < 10 4, to match with the criterion suggested in [Yao et al., 2016] with a maximum iteration count of 150.

Dataset Frank-Wolfe δ Active ALT λ Movie Lens 100k 3 10.94 Movie Lens 1M 3.4 22.7 Movie Lens 10M 5.2 49.4 Movie Lens 20M 6.6 59.04

Table 3: Parameters used for each dataset.

6.1 Computational Details All simulations were run in MATLAB. For all FW variants, we maintain a thin SVD for the current iterate, where the SVD is updated at each iteration using a rank-one update as

described in [Brand, 2006]. The rank is calculated by counting all singular values larger than 10 6, in agreement with the deﬁnition of rank in [Freund et al., 2015].

7 Discussion We observe that the proposed method, RDFW, shows large improvements over FW and its common variants in terms of the rank and computing time. While Active ALT, the leading nuclear norm based matrix completion solver, is very competitive with RDFW, RDFW has the additional attractive property that it does not require knowledge of the structure of the Hessian, or even require f( ) to be twice differentiable, allowing for greater generality. Moreover, since the matrix completion objective is quadratic, this is the ideal situation for Active ALT since the underlying solver in Active ALT only requires one Newton step to converge, whereas with a nonquadratic objective, the computational burden can increase.

8 Conclusions We have proposed a rank-drop optimization formulation to determine optimally descent rank-drop steps for the nuclearnorm constrained minimization. By considering the interior and exterior cases separately to ensure feasibility, we also devise subproblems that can be efﬁciently solved. The proposed formulation can be deployed in a projection free minimization method, e.g., FW method, to efﬁciently compute a low rank solution by maintaining low rank intermediate iterates, without compromising the strong convergence guarantees. While classic FW methods tend to have very high rank solutions for nuclear-norm constrained problems, we have shown that the addition of rank-drop steps can drastically reduce the rank of the iterates, allowing for much faster algorithms to reach low rank solutions with less required space.

Acknowledgements Authors acknowledge funding from the National Sciences and Engineering Research Council of Canada.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

References [Brand, 2006] Matthew Brand. Fast low-rank modiﬁcations of the thin singular value decomposition. Linear algebra and its applications, 415(1):20 30, 2006. [Egerv ary, 1960] Eugen Egerv ary. On rank-diminishing operations and their applications to the solution of linear equations. Zeitschrift f ur angewandte Mathematik und Physik ZAMP, 11(5):376 386, 1960. [Fazel et al., 2001] Maryam Fazel, Haitham Hindi, and Stephen P Boyd. A rank minimization heuristic with application to minimum order system approximation. In American Control Conference, 2001. Proceedings of the 2001, volume 6, pages 4734 4739. IEEE, 2001. [Frank and Wolfe, 1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95 110, 1956. [Freund et al., 2015] Robert M Freund, Paul Grigas, and Rahul Mazumder. An extended frank-wolfe method with in-face directions, and its application to low-rank matrix completion. ar Xiv preprint ar Xiv:1511.02204, 2015. [Gu elat and Marcotte, 1986] Jacques Gu elat and Patrice Marcotte. Some comments on wolfe s away step. Mathematical Programming, 35(1):110 119, 1986. [Hsieh and Olsen, 2014] Cho-Jui Hsieh and Peder A Olsen. Nuclear norm minimization via active subspace selection. In ICML, pages 575 583, 2014. [Jaggi, 2013] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1), pages 427 435, 2013. [Lacoste-Julien and Jaggi, 2015] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of frankwolfe optimization variants. In Advances in Neural Information Processing Systems, pages 496 504, 2015. [Wolfe, 1970] Philip Wolfe. Convergence theory in nonlinear programming. Integer and nonlinear programming, pages 1 36, 1970. [Yao et al., 2016] Quanming Yao, James Kwok, et al. Efﬁcient learning with a family of nonconvex regularizers by redistributing nonconvexity. ar Xiv preprint ar Xiv:1606.03841, 2016.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)