# generalized_singular_value_thresholding__f92aedaf.pdf Generalized Singular Value Thresholding Canyi Lu1, Changbo Zhu1, Chunyan Xu2, Shuicheng Yan1, Zhouchen Lin3, 1 Department of Electrical and Computer Engineering, National University of Singapore 2 School of Computer Science and Technology, Huazhong University of Science and Technology 3 Key Laboratory of Machine Perception (MOE), School of EECS, Peking University canyilu@nus.edu.sg, zhuchangbo@gmail.com, xuchunyan01@gmail.com, eleyans@nus.edu.sg, zlin@pku.edu.cn This work studies the Generalized Singular Value Thresholding (GSVT) operator Proxσ g ( ), Proxσ g (B) = arg min X i=1 g(σi(X)) + 1 2||X B||2 F , associated with a nonconvex function g defined on the singular values of X. We prove that GSVT can be obtained by performing the proximal operator of g (denoted as Proxg( )) on the singular values since Proxg( ) is monotone when g is lower bounded. If the nonconvex g satisfies some conditions (many popular nonconvex surrogate functions, e.g., ℓp-norm, 0 < p < 1, of ℓ0-norm are special cases), a general solver to find Proxg(b) is proposed for any b 0. GSVT greatly generalizes the known Singular Value Thresholding (SVT) which is a basic subroutine in many convex low rank minimization methods. We are able to solve the nonconvex low rank minimization problem by using GSVT in place of SVT. Introduction The sparse and low rank structures have received much attention in recent years. There have been many applications which exploit these two structures, such as face recognition (Wright et al. 2009), subspace clustering (Cheng et al. 2010; Liu et al. 2013b) and background modeling (Cand es et al. 2011). To achieve sparsity, a principled approach is to use the convex ℓ1-norm. However, the ℓ1-minimization may be suboptimal, since the ℓ1-norm is a loose approximation of the ℓ0-norm and often leads to an over-penalized problem. This brings the attention back to the nonconvex surrogate by interpolating the ℓ0-norm and ℓ1-norm. Many nonconvex penalities have been proposed, including ℓp-norm (0 < p < 1) (Frank and Friedman 1993), Smoothly Clipped Absolute Deviation (SCAD) (Fan and Li 2001), Logarithm (Friedman 2012), Minimax Concave Penalty (MCP) (Zhang and others 2010), Geman (Geman and Yang 1995) and Laplace (Trzasko and Manduca 2009). Their definitions are shown in Table 1. Numerical studies (Cand es, Wakin, and Boyd 2008) have shown that the nonconvex optimization usually outperforms convex models. Corresponding author. Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Table 1: Popular nonconvex surrogate functions of ℓ0-norm (||θ||0). Penalty Formula g(θ), θ 0, λ > 0 ℓp-norm λθp, 0 < p < 1. λθ, if θ λ, 2(γ 1) , if λ < θ γλ, 2 , if θ > γλ. Logarithm λ log(γ+1) log(γθ + 1) 2γ , if θ < γλ, 1 2γλ2, if θ γλ. Geman λθ θ+γ . Laplace λ(1 exp( θ The low rank structure is an extension of sparsity defined on the singular values of a matrix. A principled way is to use the nuclear norm which is a convex surrogate of the rank function (Recht, Fazel, and Parrilo 2010). However, it suffers from the same suboptimal issue as the ℓ1-norm in many cases. Very recently, many popular nonconvex surrogate functions in Table 1 are extended on the singular values to better approximate the rank function (Lu et al. 2014). However, different from the convex optimization, the nonconvex low rank minimization is much more challenging than the nonconvex sparse minimization. The Iteratively Reweighted Nuclear Norm (IRNN) method is proposed to solve the following nonconvex low rank minimization problem (Lu et al. 2014) min X F(X) = i=1 g(σi(X)) + h(X), (1) where σi(X) denotes the i-th singular value of X Rm n (we assume m n in this work). g : R+ R+ is continuous, concave and nonincreasing on [0, + ). Popular nonconvex surrogate functions in Table 1 are some examples. h : Rm n R+ is the loss function which has Lipschitz continuous gradient. IRNN updates Xk+1 by minimizing a surrogate function which upper bounds the objective function in (1). The surrogate function is constructed by linearizing g and h at Xk, simultaneously. In theory, IRNN guarantees to decrease the objective function value of (1) in each iteration. However, it may decrease slowly since the upper Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Gradient g(θ) (a) ℓp-norm Gradient g(θ) Gradient g(θ) (c) Logarithm Gradient g(θ) Gradient g(θ) Gradient g(θ) (f) Laplace Figure 1: Gradients of some nonconvex functions (For ℓpnorm, p = 0.5. For all penalties, λ = 1, γ = 1.5). bound surrogate may be quite loose. It is expected that minimizing a tighter surrogate will lead to a faster convergence. A possible tighter surrogate function of the objective function in (1) is to keep g and relax h only. This leads to the following updating rule which is named as Generalized Proximal Gradient (GPG) method in this work Xk+1 = arg min X i=1 g(σi(X)) + h(Xk) + h(Xk), X Xk + µ 2 ||X Xk||2 F = arg min X i=1 g(σi(X)) + µ 2 ||X Xk + 1 µ h(Xk)||2 F , where µ > L(h), L(h) is the Lipschitz constant of h, guarantees the convergence of GPG as shown later. It can be seen that solving (2) requires solving the following problem Proxσ g (B) = arg min X i=1 g(σi(X)) + 1 2||X B||2 F . (3) In this work, the mapping Proxσ g ( ) is called the Generalized Singular Value Thresholding (GSVT) operator associated with the function Pm i=1 g( ) defined on the singular values. If g(x) = λx, Pm i=1 g(σi(X)) is degraded to the convex nuclear norm λ||X|| . Then (3) has a closed form solution Proxσ g (B) = U Diag(Dλ(σ(B)))VT , where Dλ(σ(B)) = {(σi(B) λ)+}m i=1, and U and V are from the SVD of B, i.e., B = U Diag(σ(B))VT . This is the known Singular Value Thresholding (SVT) operator associated with the convex nuclear norm (when g(x) = λx) (Cai, Cand es, and Shen 2010). More generally, for a convex g, the solution to (3) is Proxσ g (B) = U Diag(Proxg(σ(B)))VT , (4) where Proxg( ) is defined element-wise as follows, Proxg(b) = arg min x 0 fb(x) = g(x) + 1 2(x b)2, 1 (5) 1For x < 0, g(x) = g( x). If b 0, Proxg(b) 0. If b < 0, Proxg(b) = Proxg( b). So we only need to discuss the case b, x 0 in this work. where Proxg( ) is the known proximal operator associated with a convex g (Combettes and Pesquet 2011). That is to say, solving (3) is equivalent to performing Proxg( ) on each singular value of B. In this case, the mapping Proxg( ) is unique, i.e., (5) has a unique solution. More importantly, Proxg( ) is monotone, i.e., Proxg(x1) Proxg(x2) for any x1 x2. This guarantees to preserve the nonincreasing order of the singular values after shrinkage and thresholding by the mapping Proxg( ). For a nonconvex g, we still call Proxg( ) as the proximal operator, but note that such a mapping may not be unique. It is still an open problem whether Proxg( ) is monotone or not for a nonconvex g. Without proving the monotonity of Proxg( ), one cannot simply perform it on the singular values of B to obtain the solution to (3) as SVT. Even if Proxg( ) is monotone, since it is not unique, one also needs to carefully choose the solution pi Proxg(σi(B)) such that p1 p2 pm. Another challenging problem is that there does not exist a general solver to (5) for a general nonconvex g. It is worth mentioning that some previous works studied the solution to (3) for some special choices of nonconvex g (Nie, Huang, and Ding 2012; Chartrand 2012; Liu et al. 2013a). However, none of their proofs was rigorous since they ignored proving the monotone property of Proxg( ). See the detailed discussions in the next section. Another recent work (Gu et al. 2014) considered the following problem related to the weighted nuclear norm: min X fw,B(X) = i=1 wiσi(X) + 1 2||X B||2 F , (6) where wi 0, i = 1, , m. Problem (6) is a little more general than (3) by taking different gi(x) = wix. It is claimed in (Gu et al. 2014) that the solution to (6) is X = UDiag ({Proxgi(σi(B)), i = 1, , m}) VT , (7) where B = UDiag(σ(B))VT is the SVD of B, and Proxgi(σi(B)) = max{σi(B) wi, 0}. However, such a result and their proof are not correct. A counterexample is as follows: " 0.0941 0.4201 0.5096 0.0089 " 0.0345 0.1287 0.0542 0.0512 " 0.0130 0.1938 0.1861 0.0218 where X is obtained by (7). The solution X is not optimal to (6) since there exists b X shown above such that fw,B( b X) = 0.2262 < fw,B(X ) = 0.2393. The reason behind is that (Prox gi(σi(B)) Prox gj(σj(B)))(σi(B) σj(B)) 0, (8) does not guarantee to hold for any i = j. Note that (8) holds when 0 w1 wm, and thus (7) is optimal to (6) in this case. In this work, we give the first rigorous proof that Proxg( ) is monotone for any lower bounded function (regardless of the convexity of g). Then solving (3) is degenerated to solving (5) for each b = σi(B). The Generalized Singular Value Thresholding (GSVT) operator Proxσ g ( ) associated with any lower bounded function in (3) is much more general than the known SVT associated with the convex nuclear norm (Cai, Cand es, and Shen 2010). In order to compute GSVT, we analyze the solution to (5) for certain types of g (some special cases are shown in Table 1) in theory, and propose a general solver to (5). At last, with GSVT, we can solve (1) by the Generalized Proximal Gradient (GPG) algorithm shown in (2). We test both Iteratively Reweighted Nuclear Norm (IRNN) and GPG on the matrix completion problem. Both synthesis and real data experiments show that GPG outperforms IRNN in terms of the recovery error and the objective function value. Generalized Singular Value Thresholding Problem Reformulation A main goal of this work is to compute GSVT (3), and uses it to solve (1). We will show that, if Proxg( ) is monotone, problem (3) can be reformulated into an equivalent problem which is much easier to solve. Lemma 1. (von Neumann s trace inequality (Rhea 2011)) For any matrices A, B Rm n (m n), Tr(AT B) Pm i=1 σi(A)σi(B), where σ1(A) σ2(A) 0 and σ1(B) σ2(B) 0 are the singular values of A and B, respectively. The equality holds if and only if there exist unitaries U and V such that A = U Diag(σ(A))VT and B = U Diag(σ(B))VT are the SVDs of A and B, simultaneously. Theorem 1. Let g : R+ R+ be a function such that Proxg( ) is monotone. Let B = U Diag(σ(B))VT be the SVD of B Rm n. Then an optimal solution to (3) is X = U Diag(ϱ )VT , (9) where ϱ satisfies ϱ 1 ϱ 2 ϱ m, i = 1, , m, and ϱ i Proxg(σi(B)) = argmin ϱi 0 g(ϱi) + 1 2(ϱi σi(B))2. Proof. Denote σ1(X) σm(X) 0 as the singular values of X. Problem (3) can be rewritten as min ϱ:ϱ1 ϱm 0 i=1 g(ϱi) + 1 2||X B||2 F (11) By using the von Neumann s trace inequality in Lemma 1, we have ||X B||2 F = Tr (XT X) 2 Tr(XT B) + Tr(BT B) i=1 σ2 i (X) 2 Tr(XT B) + i=1 σ2 i (B) i=1 σ2 i (X) 2 i=1 σi(X)σi(B) + i=1 σ2 i (B) i=1 (σi(X) σi(B))2. Note that the above equality holds when X admits the singular value decomposition X = U Diag(σ(X))VT , where U and V are the left and right orthonormal matrices in the SVD of B. In this case, problem (11) is reduced to min ϱ:ϱ1 ϱm 0 2(ϱi σi(B))2 . (12) Since Proxg( ) is monotone and σ1(B) σ2(B) σm(B), there exists ϱ i Proxg(σi(B)), such that ϱ 1 ϱ 2 ϱ m. Such a choice of ϱ is optimal to (12), and thus (9) is optimal to (3). From the above proof, it can be seen that the monotone property of Proxg( ) is a key condition which makes problem (12) separable conditionally. Thus the solution (9) to (3) shares a similar formulation as the known Singular Value Thresholding (SVT) operator associated with the convex nuclear norm (Cai, Cand es, and Shen 2010). Note that for a convex g, Proxg( ) is always monotone. Indeed, (Prox g(b1) Prox g(b2)) (b1 b2) (Proxg(b1) Proxg(b2))2 0, b1, b2 R+. The above inequality can be obtained by the optimality of Proxg( ) and the convexity of g. The monotonicity of Proxg( ) for a nonconvex g is still unknown. There were some previous works (Nie, Huang, and Ding 2012; Chartrand 2012; Liu et al. 2013a) claiming that the solution (9) is optimal to (3) for some special choices of nonconvex g. However, their results are not rigorous since the monotone property of Proxg( ) is not proved. Surprisingly, we find that the monotone property of Proxg( ) holds for any lower bounded function g. Theorem 2. For any lower bounded function g, its proximal operator Proxg( ) is monotone, i.e., for any p i Proxg(xi), i = 1, 2, p 1 p 2, when x1 > x2. Note that it is possible that σi(B) = σj(B) for some i < j in (10). Since Proxg( ) may not be unique, we need to choose ϱ i Proxg(σi(B)) and ϱ j Proxg(σj(B)) such that ϱ i ϱ j. This is the only difference between GSVT and SVT. Proximal Operator of Nonconvex Function So far, we have proved that solving (3) is equivalent to solving (5) for each b = σi(B), i = 1, , m, for any lower bounded function g. For a nonconvex g, only for some special cases, the candidate solutions to (5) have a closed form (Gong et al. 2013). There does not exist a general solver for a more general nonconvex g. In this section, we analyze the solution to (5) for a broad choice of the nonconvex g. Then a general solver will be proposed in the next section. Assumption 1. g : R+ R+, g(0) = 0. g is concave, nondecreasing and differentiable. The gradient g is convex. In this work, we are interested in the nonconvex surrogate of ℓ0-norm. Except the differentiablity of g and the convexity of g, all the other assumptions in Assumption 1 are necessary to construct a surrogate of ℓ0-norm. As shown later, these two additional assumptions make our analysis much easier. Note that the assumptions for the nonconvex Algorithm 1: A General Solver to (5) in which g satisfying Assumption 1 Input: b 0. Output: Identify an optimal solution, 0 or ˆxb = max{x| fb(x) = 0, 0 x b}. if g(b) = 0 then return ˆxb = b; else // find ˆxb by fixed point iteration. x0 = b. // Initialization. while not converge do xk+1 = b g(xk); if xk+1 < 0 then return ˆxb = 0; break; end end end Compare fb(0) and fb(ˆxb) to identify the optimal one. function considered in Assumption 1 are quite general. It is easy to verify that many popular surrogates of ℓ0-norm in Table 1 satisfy Assumption 1, including ℓp-norm, Logarithm, MCP, Geman and Laplace penalties. Only the SCAD penalty violates the convex g assumption, as shown in Figure 1. Proposition 1. Given g satisfying Assumption 1, the optimal solution to (5) lies in [0, b]. The above fact is obvious since both g(x) and 1 2(x b)2 are nondecreasing on [b, + ). Such a result limits the solution space, and thus is very useful for our analysis. Our general solver to (5) is also based on Proposition 1. Note that the solutions to (5) lie in 0 or the local points {x| fb(x) = g(x) + x b = 0}. Our analysis is mainly based on the number of intersection points of D(x) = g(x) and the line Cb(x) = b x. Let b = sup{b | Cb(x) and D(x) have no intersection}. We have the solution to (5) in different cases. Please refer to the supplementary material for the detailed proofs. Proposition 2. Given g satisfying Assumption 1 and g(0) = + . Restricted on [0, + ), when b > b, Cb(x) and D(x) have two intersection points, denoted as P b 1 = (xb 1, yb 1), P b 2 = (xb 2, yb 2), and xb 1 < xb 2. If there does not exist b > b such that fb(0) = fb(xb 2), then Proxg(b) = 0 for all b 0. If there exists b > b such that fb(0) = fb(xb 2), let b = inf{b | fb(0) = fb(xb 2) }. Then we have Proxg(b) = argmin x 0 fb(x) = xb 2, if b > b , 0, if b b . Proposition 3. Given g satisfying Assumption 1 and g(0) < + . Restricted on [0, + ), if we have C g(0)(x) = g(0) x g(x) for all x (0, g(0)), then Cb(x) and D(x) have only one intersection point 0 0.5 1 1.5 2 2.5 3 0 (a) ℓ1-norm 0 0.5 1 1.5 2 2.5 3 0 (b) ℓp-norm 0 0.5 1 1.5 2 2.5 3 0 0 0.5 1 1.5 2 2.5 3 0 (d) Logarithm 0 0.5 1 1.5 2 2.5 3 0 (e) Laplace 0 0.5 1 1.5 2 2.5 3 0 Figure 2: Plots of b v.s. Proxg(b) for different choices of g: convex ℓ1-norm and popular nonconvex functions which satisfy Assumption 1 in Table 1. (xb, yb) when b > g(0). Furthermore, Proxg(b) = argmin x 0 fb(x) = xb, if b > g(0), 0, if b g(0). Suppose there exists 0 < ˆx < g(0) such that C g(0)(ˆx) = g(0) ˆx > g(ˆx). Then, when g(0) b > b, Cb(x) and D(x) have two intersection points, which are denoted as P b 1 = (xb 1, yb 1) and P b 2 = (xb 2, yb 2) such that xb 1 < xb 2. When g(0) < b, Cb(x) and D(x) have only one intersection point (xb, yb). Also, there exists b such that g(0) > b > b and f b(0) = f b(xb 2). Let b = inf{b | fb(0) = fb(xb 2) }. We have Proxg(b) = argmin x 0 fb(x) = xb, if b > g(0), = xb 2, if g(0) b > b , 0, if b b . Corollary 1. Given g satisfying Assumption 1. Denote ˆxb = max{x| fb(x) = 0, 0 x b} and x = arg minx {0,ˆxb} fb(x). Then x is optimal to (5). The results in Proposition 2 and 3 give the solution to (5) in different cases, while Corollary 1 summarizes these results. It can be seen that one only needs to compute ˆxb which is the largest local minimum. Then comparing the objective function value at 0 and ˆxb leads to an optimal solution to (5). Algorithms In this section, we first give a general solver to (5) in which g satisfies Assumption 1. Then we are able to solve the GSVT problem (3). With GSVT, problem (1) can be solved by Generalized Proximal Gradient (GPG) algorithm as shown in (2). We also give the convergence guarantee of GPG. A General Solver to (5) Given g satisfying Assumption 1, as shown in Corollary 1, 0 and ˆxb = max{x| fb(x) = 0, 0 x b} are the candidate solutions to (5). The left task is to find ˆxb which is the largest local minimum point near x = b. So we can start Figure 3: Experimental results of low rank matrix recovery on random data. (a) Frequency of Success (Fo S) for a noise free case. (b) Relative error for a noisy case. (c) Convergence curves of IRNN and GPG for a noisy case. searching for ˆxb from x0 = b by the fixed point iteration algorithm. Note that it will be very fast since we only need to search within [0, b]. The whole procedure to find ˆxb can be found in Algorithm 1. In theory, it can be proved that the fixed point iteration guarantees to find ˆxb. If g is nonsmooth or g is nonconvex, the fixed point iteration algorithm may also be applicable. The key is to find all the local solutions with smart initial points. Also all the nonsmooth points should be considered as the candidates. All the nonconvex surrogates g except SCAD in Table 1 satisfy Assumption 1, and thus the solution to (5) can be obtained by Algorithm 1. Figure 2 illustrates the shrinkage effect of proximal operators of these functions and the convex ℓ1-norm. The shrinkage and thresholding effect of these proximal operators are similar when b is relatively small. However, when b is relatively large, the proximal operators of the nonconvex functions are nearly unbiased, i.e., keeping b nearly the same as the ℓ0-norm. On the contrast, the proximal operator of the convex ℓ1-norm is biased. In this case, the ℓ1-norm may be over-penalized, and thus may perform quite differently from the ℓ0-norm. This also supports the necessity of using nonconvex penalties on the singular values to approximate the rank function. Generalized Proximal Gradient Algorithm for (1) Given g satisfying Assumption 1, we are now able to get the optimal solution to (3) by (9) and Algorithm 1. Now we have a better solver than IRNN to solve (1) by the updating rule (2), or equivalently Xk+1 = Proxσ 1 µ g The above updating rule is named as Generalized Proximal Gradient (GPG) for the nonconvex problem (1), which generalizes some previous methods (Beck and Teboulle 2009; Gong et al. 2013). The main per-iteration cost of GPG is to compute an SVD, which is the same as many convex methods (Toh and Yun 2010a; Lin, Chen, and Ma 2009). In theory, we have the following convergence results for GPG. Theorem 3. If µ > L(h), the sequence {Xk} generated by (2) satisfies the following properties: (1) F(Xk) is monotonically decreasing. (2) lim k + (Xk Xk+1) = 0; (3) If F(X) + when ||X||F + , then any limit point of {Xk} is a stationary point. It is expected that GPG will decrease the objective function value faster than IRNN since it uses a tighter surrogate function. This will be verified by the experiments. Experiments In this section, we conduct some experiments on the matrix completion problem to test our proposed GPG algorithm i=1 g(σi(X)) + 1 2||PΩ(X) PΩ(M)||2 F , (13) where Ωis the index set, and PΩ: Rm n Rm n is a linear operator that keeps the entries in Ωunchanged and those outside Ωzeros. Given PΩ(M), the goal of matrix completion is to recover M which is of low rank. Note that we have many choices of g which satisfies Assumption 1, and we simply test on the Logarithm penalty, since it is suggested in (Lu et al. 2014; Cand es, Wakin, and Boyd 2008) that it usually performs well by comparing with other nonconvex penalties. Problem (13) can be solved by GPG by using GSVT (9) in each iteration. We compared GPG with IRNN on both synthetic and real data. The continuation technique is used to enhance the low rank matrix recovery in GPG. The initial value of λ in the Logarithm penalty is set to λ0, and dynamically decreased till reaching λt. Low-Rank Matrix Recovery on Random Data We conduct two experiments on synthetic data without and with noises (Lu et al. 2014). For the noise free case, we generate M = M1M2, where M1 Rm r, M2 Rr n are i.i.d. random matrices, and m = n = 150. The underlying rank r varies from 20 to 33. Half of the elements in M are missing. We set λ0 = 0.9||PΩ(M)|| , and λt = 10 5λ0. The relative error Rel Err= ||X M||F /||M||F is used to evaluate the recovery performance. If Rel Err is smaller than 10 3, X is regarded as a successful recovery of M. We repeat the experiments 100 times for each r. We compare GPG by using GSVT with IRNN and the convex Augmented Lagrange Multiplier (ALM) (Lin, Chen, and Ma 2009). Figure 3 (a) plots r v.s. the frequency of success. It can be seen APGL IRNN GPG APGL IRNN GPG Relative Error (a) Original APGL IRNN GPG APGL IRNN GPG Relative Error (f) PSNR & error Figure 4: Image inpainting by APGL, IRNN, and GPG. that GPG is slightly better than IRNN when r is relatively small, while both IRNN and GPG fail when r 32. Both of them outperform the convex ALM method, since the nonconvex logarithm penalty approximates the rank function better than the convex nuclear norm. For the noisy case, the data matrix M is generated in the same way, but are added some additional noises 0.1E, where E is an i.i.d. random matrix. For this task, we set λ0 = 10||PΩ(M)|| , and λt = 0.1λ0 in GPG. The convex APGL algorithm (Toh and Yun 2010b) is compared in this task. Each method is run 100 times for each r {15, 18, 20, 23, 25, 30}. Figure 3 (b) shows the mean relative error. It can be seen that GPG by using GSVT in each iteration significantly outperforms IRNN and APGL. The reason is that λt is not that small as in the noise free case. Thus, the upper bound surrogate of g in IRNN will be much more loose than that in GPG. Figure 3 (c) plots some convergence curves of GPG and IRNN. It can be seen that GPG without relaxing g will decrease the objective function value faster. Applications on Real Data Matrix completion can be applied to image inpainting since the main information is dominated by the top singular values. For a color image, assume that 40% of pixels are uniformly missing. They can be recovered by applying low rank matrix completion on each channel (red, green and blue) of the image independently. Besides the relative error defined above, we also use the Peak Signal-to-Noise Ratio (PSNR) to evaluate the recovery performance. Figure 4 shows two images recovered by APGL, IRNN and GPG, respectively. It can be seen that GPG achieves the best performance, i.e., the largest PSNR value and the smallest relative error. We also apply matrix completion for collaborative filtering. The task of collaborative filtering is to predict the unknown preference of a user on a set of unrated items, according to other similar users or similar items. We test on the Movie Lens data set (Herlocker et al. 1999) which includes three problems, movie-100K , movie-1M and movie10M . Since only the entries in Ωof M are known, we use Normalized Mean Absolute Error (NMAE) ||PΩ(X ) PΩ(M)||1/|Ω| to evaluate the performance as in (Toh and Yun 2010b). As shown in Table 2, GPG achieves the best performance. The improvement benefits from the GPG algorithm which uses a fast and exact solver of GSVT (9). Table 2: Comparison of NMAE of APGL, IRNN and GPG for collaborative filtering. Problem size of M: (m, n) APGL IRNN GPG moive-100K (943, 1682) 2.76e-3 2.60e-3 2.53e-3 moive-1M (6040, 3706) 2.66e-1 2.52e-1 2.47e-1 moive-10M (71567, 10677) 3.13e-1 3.01e-1 2.89e-1 Conclusions This paper studied the Generalized Singular Value Thresholding (GSVT) operator associated with the nonconvex function g on the singular values. We proved that the proximal operator of any lower bounded function g (denoted as Proxg( )) is monotone. Thus, GSVT can be obtained by performing Proxg( ) on the singular values separately. Given b 0, we also proposed a general solver to find Proxg(b) for certain type of g. At last, we applied the generalized proximal gradient algorithm by using GSVT as the subroutine to solve the nonconvex low rank minimization problem (1). Experimental results showed that it outperformed previous method with smaller recovery error and objective function value. For nonconvex low rank minimization, GSVT plays the same role as SVT in convex minimization. One may extend other convex low rank models to nonconvex cases, and solve them by using GSVT in place of SVT. An interesting future work is to solve the nonconvex low rank minimization problem with affine constraint by ALM (Lin, Chen, and Ma 2009) and prove the convergence. Acknowledgements This research is supported by the Singapore National Research Foundation under its International Research Centre @Singapore Funding Initiative and administered by the IDM Programme Office. Z. Lin is supported by NSF China (grant nos. 61272341 and 61231002), 973 Program of China (grant no. 2015CB3525) and MSRA Collaborative Research Program. C. Lu is supported by the MSRA fellowship 2014. References Beck, A., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences. Cai, J.-F.; Cand es, E. J.; and Shen, Z. 2010. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20(4):1956 1982. Cand es, E. J.; Li, X.; Ma, Y.; and Wright, J. 2011. Robust principal component analysis? Journal of the ACM 58(3). Cand es, E. J.; Wakin, M. B.; and Boyd, S. P. 2008. Enhancing sparsity by reweighted ℓ1 minimization. Journal of Fourier Analysis and Applications 14(5-6):877 905. Chartrand, R. 2012. Nonconvex splitting for regularized low-rank+ sparse decomposition. IEEE Transactions on Signal Processing 60(11):5810 5819. Cheng, B.; Yang, J.; Yan, S.; Fu, Y.; and Huang, T. S. 2010. Learning with ℓ1-graph for image analysis. TIP 19(Compendex):858 866. Combettes, P. L., and Pesquet, J.-C. 2011. Proximal splitting methods in signal processing. Fixed-point algorithms for inverse problems in science and engineering. Fan, J., and Li, R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456):1348 1360. Frank, L., and Friedman, J. 1993. A statistical view of some chemometrics regression tools. Technometrics. Friedman, J. 2012. Fast sparse regression and classification. International Journal of Forecasting 28(3):722 738. Geman, D., and Yang, C. 1995. Nonlinear image recovery with half-quadratic regularization. TIP 4(7):932 946. Gong, P.; Zhang, C.; Lu, Z.; Huang, J.; and Ye, J. 2013. A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In ICML. Gu, S.; Zhang, L.; Zuo, W.; and Feng, X. 2014. Weighted nuclear norm minimization with application to image denoising. In CVPR. Herlocker, J. L.; Konstan, J. A.; Borchers, A.; and Riedl, J. 1999. An algorithmic framework for performing collaborative filtering. In International ACM SIGIR conference on Research and development in information retrieval. ACM. Lin, Z.; Chen, M.; and Ma, Y. 2009. The augmented Lagrange multiplier method for exact recovery of a corrupted low-rank matrices. UIUC Technical Report UILU-ENG-092215, Tech. Rep. Liu, D.; Zhou, T.; Qian, H.; Xu, C.; and Zhang, Z. 2013a. A nearly unbiased matrix completion approach. In Machine Learning and Knowledge Discovery in Databases. Liu, G.; Lin, Z.; Yan, S.; Sun, J.; Yu, Y.; and Ma, Y. 2013b. Robust recovery of subspace structures by low-rank representation. TPAMI 35(1):171 184. Lu, C.; Tang, J.; Yan, S. Y.; and Lin, Z. 2014. Generalized nonconvex nonsmooth low-rank minimization. In CVPR. Nie, F.; Huang, H.; and Ding, C. H. 2012. Low-rank matrix recovery via efficient Schatten p-norm minimization. In AAAI. Recht, B.; Fazel, M.; and Parrilo, P. A. 2010. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review 52(3):471 501. Rhea, D. 2011. The case of equality in the von Neumann trace inequality. preprint. Toh, K., and Yun, S. 2010a. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization. Toh, K., and Yun, S. 2010b. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization 6(615640):15. Trzasko, J., and Manduca, A. 2009. Highly undersampled magnetic resonance image reconstruction via homotopicminimization. IEEE Transactions on Medical imaging 28(1):106 121. Wright, J.; Yang, A. Y.; Ganesh, A.; Sastry, S. S.; and Ma, Y. 2009. Robust face recognition via sparse representation. TPAMI 31(2):210 227. Zhang, C.-H., et al. 2010. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38(2):894 942.