# scale_invariant_power_iteration__29345942.pdf Journal of Machine Learning Research 24 (2023) 1-47 Submitted 9/19; Revised 10/23; Published 11/23 Scale Invariant Power Iteration Cheolmin Kim* cheolmkim@u.northwestern.edu Department of Industrial Engineering and Management Sciences Northwestern University Evanston, IL, 60208, USA Youngseok Kim* youngseok@uchicago.edu Department of Statistics University of Chicago Chicago, IL, 60637, USA Diego Klabjan d-klabjan@northwestern.edu Department of Industrial Engineering and Management Sciences Northwestern University Evanston, IL, 60208, USA Editor: Zaid Harchaoui We introduce a new class of optimization problems called scale invariant problems that cover interesting problems in machine learning and statistics and show that they are efficiently solved by a general form of power iteration called scale invariant power iteration (SCI-PI). SCI-PI is a special case of the generalized power method (GPM) (Journée et al., 2010) where the constraint set is the unit sphere. In this work, we provide the convergence analysis of SCI-PI for scale invariant problems which yields a better rate than the analysis of GPM. Specifically, we prove that it attains local linear convergence with a generalized rate of power iteration to find an optimal solution for scale invariant problems. Moreover, we discuss some extended settings of scale invariant problems and provide similar convergence results. In numerical experiments, we introduce applications to independent component analysis, Gaussian mixtures, and non-negative matrix factorization with the KL-divergence. Experimental results demonstrate that SCI-PI is competitive to application specific stateof-the-art algorithms and often yield better solutions. Keywords: scale invariance, power iteration, optimization, convergence analysis, machine learning applications 1. Introduction We study a new class of optimization problems called scale invariant problems having the form of maximize f(x) subject to x Bd {x Rd : x 2 = 1}, (1) where f : Rd R is a twice continuously differentiable, scale invariant function. We say that a function f is scale invariant, which is rigorously defined later in Definition 1, if its geometric surface is invariant under constant multiplication of x. Many important The two authors contributed equally to this paper. c 2023 Cheolmin Kim, Youngseok Kim and Diego Klabjan. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/19-784.html. Kim, Kim and Klabjan optimization problems in statistics and machine learning can be formulated as scale invariant problems, for instance, Lp-norm kernel PCA and maximum likelihood estimation of mixture proportions, to name a few. Moreover, as studied herein, independent component analysis (ICA, Example 3), Gaussian mixture models (GMM, Example 4), and non-negative matrix factorization with the Kullback-Leibler divergence (KL-NMF, Example 5) can be formulated as extended settings of scale invariant problems where the objective function is a sum of scale invariant functions or the problem is scale invariant with respect to a subset of variables while the other variables are fixed. Since Bd is not a convex set, scale invariant problems are in general non-convex optimization problems. Nevertheless, some instances can be efficiently solved, for instance, the leading eigenvector problem (Golub and Van Loan, 2012). Power iteration (Muntz, 1913; Mises and Pollaczek-Geiringer, 1929) is an algorithm to find the leading eigenvector of a diagonalizable matrix A. In power iteration, xk+1 Axk/ Axk 2 is repeatedly applied until some stopping criterion is satisfied. Since no hyperparameter is required, this update rule is practical but at the same time it attains global linear convergence with the rate of |λ2|/|λ1| where |λi| is the ith largest absolute eigenvalue of A (Wilkinson, 1965; Golub and Van Loan, 2012). The linear convergence property of power iteration has been extended to many applications. However, theoretical understanding of when and how such algorithms enjoy this attractive convergence property of power iteration is limited. For example, for convex f, a general form of power iteration called generalized power method (GPM) (Journée et al., 2010) has been shown to attain only global sublinear convergence rate of O(1/ϵ), not generalizing the appealing linear convergence of power iteration. For historical development of power iteration, see Golub and Van der Vorst (2000); Tapia et al. (2018). In this work, we present scale invariant problems that generalize the leading eigenvector problem in the sense that any stationary point x of (1) satisfying f(x ) = λ x for some λ is an eigenvector of 2f(x ). By this property, scale invariant problems can be seen as the leading eigenvector problem near a local optimum x , so we can expect that a general form of power iteration would work well for them. By swapping the objective function and the constraint, we obtain a geometrically interpretable dual problem with the goal of finding the closest point w to the origin from the constraint f(w) = 1. By mapping an iterate xk to the dual space, taking a descent step in the dual space and mapping it back to the original space, we geometrically derive scale invariant power iteration (SCI-PI), which replaces Axk with f(xk) in power iteration. SCI-PI is the same algorithm as GPM applied to the unit sphere constraint. However, we improve the convergence rate of GPM for scale invariant problems showing that the algorithm attains local linear convergence with a generalized rate of power iteration when initialized close to it. To the best of our knowledge, this is the first work exploiting the properties of scale invariant problems. Also, this is the first linear convergence result of GPM for general optimization problems. This improvement is significant since with linear convergence, the iteration complexity to attain an ϵ-optimal solution reduces from O(1/ϵ) to O(1/ log(1/ϵ)). Moreover, under some mild conditions, we provide an explicit expression regarding the initial condition on x0 x 2 to ensure convergence. In the extended settings (Section 4), we discuss three variants of (1). In the first setting, we consider a sum of scale invariant functions (Subsection 4.1) as an objective function. This setting covers a Kurtosis-based ICA and can be solved by SCI-PI with similar convergence guarantees. Second, we consider a block version of scale invariant problems (Subsection 4.2) Scale Invariant Power Iteration which covers KL-NMF and the Burer-Monteiro factorization of semi-definite programs. To solve this block scale invariant problem, we present a block version of SCI-PI and show that it attains linear convergence in a two-block case. Lastly, we consider partially scale invariant problems (Subsection 4.3) which include general mixture problems such as GMM. For this partially scale invariant problems, we present an alternating algorithm based on SCI-PI and gradient ascent along with its convergence analysis. In numerical experiments, we benchmark the proposed algorithms against state-of-the-art methods for ICA, KL-NMF, and GMM. The experimental results show that our algorithms are computationally competitive and result in better solutions in most if we do not beat in all herein studied cases. In summary, this work has the following contributions. 1. We introduce scale invariant problems which cover interesting examples in statistics and machine learning. By the eigenvector property (Proposition 4), they resemble the leading eigenvector problem near a local optimum x . 2. For scale invariant problems, we prove that SCI-PI (a special form of GPM) converges to a local maximum x at a logarithmic rate when initialized close to x . This generalizes the attractive convergence property of power iteration. Moreover, we introduce three extended settings of scale invariant problems along with solution algorithms and their convergence analyses. 3. We report numerical experiments including a novel reformulation of KL-NMF to a block scale invariant problem. The experimental results demonstrate that SCI-PI is not only computationally competitive to state-of-the-art methods but also often yield better solutions. The paper is organized as follows. In Section 2, we define scale invariance and present interesting properties of scale invariant problems including an eigenvector property and a dual formulation. We then provide a geometric derivation of SCI-PI and a convergence analysis in Section 3. The extended settings are discussed in Section 4 and we report the numerical experiments in Section 5. We finish the introduction with literature review and a notation paragraph. 1.1 Related Works Power Iteration The global linear convergence property of power iteration is analogous to that of gradient descent for convex optimization. Therefore, many variants including coordinate-wise (Lei et al., 2016), momentum (Xu et al., 2018), online (Boutsidis et al., 2015; Garber et al., 2015), stochastic (Oja, 1982), stochastic variance-reduced (Shamir, 2015, 2016; Xu et al., 2018; Kim and Klabjan, 2020b), and truncated (Yuan and Zhang, 2013; Han and Liu, 2014) power iterations have been developed, drawing a parallel literature to gradient descent for convex optimization. We discover a class of optimization problems which can be locally seen as the leading eigenvector problem, and prove that they can be efficiently solved by a general form of power iteration. Generalized Power Method (GPM) GPM (Journée et al., 2010) is an iterative algorithm that finds the next iterate by projecting the gradient at the current iterate to the constraint set. GPM has been applied to statistical problems such as sparse principal Kim, Kim and Klabjan component analysis (PCA) (Journée et al., 2010; Luss and Teboulle, 2013) and L1-norm kernel PCA (Kim and Klabjan, 2020a). While GPM has a general form of power iteration, its convergence analysis does not extend the attractive convergence property of power iteration. For example, only global sublinear convergence has been shown for convex f. We generalize the local linear convergence property of power iteration to scale invariant problems. Block Power Iteration A block version of power iteration has been developed to solve the phase synchronization problem (Boumal, 2016). If the problem consists of a single block and the shift parameter is set to zero, this algorithm specializes to power iteration. Under some conditions on the measurement noise and the initial iterate, it attains linear convergence to a global solution (Liu et al., 2017). To solve the Burer-Monteiro factorization (Burer and Monteiro, 2003) of semi-definite programs (Vandenberghe and Boyd, 1996), Erdogdu et al. (2022) developed a block coordinate maximization (BCM) algorithm. By iteratively sampling a block and applying power iteration to it, BCM attains local linear convergence as well as global sublinear convergence. However, the linear convergence property of block power iteration has not been extended to more general settings. In this work, we prove that block variants of SCI-PI attain linear convergence for block and partially scale invariant problems. Alternating Minimization Alternating algorithms have been developed for many applications such as k-means clustering (Mac Queen, 1967), Gaussian mixture model (Bishop, 2006), dictionary learning (Olshausen and Field, 1997; Aharon et al., 2006), matrix completion (Candès and Recht, 2009), matrix factorization (Lee and Seung, 2001), and finding a point in the intersection of two closed sets (Lewis et al., 2009). Beck (2015) studied alternating minimization and proved that it achieves sublinear convergence for convex programming. For optimization problems with a separable convex objective function and a linear constraint, alternating direction method of multipliers (ADMM) (Boyd et al., 2011) has been shown to attain linear convergence (Hong and Luo, 2017). However, due to the exact minimization step, ADMM can incur high per iteration cost. Instead of performing exact minimization, our algorithms alternatively apply simple steps to update blocks but at the same time they achieve local linear convergence. Manifold Optimization Viewed as an optimization problem on the real projective plane, a scale invariant problem can be reformulated to an equivalent problem in the embedding space. The reformulated problem is unconstrained in the embedding space but it has a highly non-convex structure, e.g., the maximization of the Rayleigh quotient. In order to solve the reformulated problem, general algorithms for unconstrained non-convex optimization such as gradient and Newton methods with line search, and trust region method (Absil et al., 2009) can be employed. Rather than working in the embedding space, we focus on a generalization of power iteration. Gauge Optimization A gauge function which is a nonnegative, convex, and positively homogeneous function that vanishes at the origin is a multiplicatively scale invariant function. Gauge functions include norms and pseudonorms as special cases and generalize the notion of a norm. The gauge program that minimizes a gauge function over a convex set is introduced in (Freund, 1987) and further studied in (Friedlander et al., 2014). The literature on gauge optimization is mainly about developing and studying dual problems. Conversely, we develop a simple numerical algorithm that solves the primal problem. Scale Invariant Power Iteration 1.2 Notation Let R and R+ denote the set of real numbers and the set of non-negative real numbers, respectively. Let d be the dimension of the optimization variable x. Let Rd denote the set of d-dimensional real vectors and f be a function from Rd to R. We denote the gradient and Hessian of a function f as f and 2f. Let u and v be functions from R to R+ and R \ {0} to R, representing multiplicative and additive factor functions, respectively, and let p be the degree of a scale invariant function, which equals to the degree of homogeneity for a multiplicative scale invariant function and 0 for an additive scale invariant function. We use (λi, vi) and (si, ui) to represent eigen-pairs. The jth element of vi is denoted as vi,j. Let k be the iteration index and we denote the sequences of iterates and function values by {xk}k=0,1, and {f(xk)}k=0,1, , respectively. Lastly, we let , and ( ) 2 denote element-wise product, division and square, respectively and let 1n Rn denote the vector of n ones. 2. Scale Invariant Problems Before presenting properties of scale invariant problems, we first define scale invariant functions. Definition 1 We say that a function f : Rd R is multiplicatively scale invariant if it satisfies f(cx) = u(c)f(x) (2) for some even function u : R R+ with u(0) = 0. Also, we say that f : Rd \ {0} R is additively scale invariant if it satisfies f(cx) = f(x) + v(c) (3) for some even function v : R \ {0} R with v(1) = 0. The following proposition characterizes the exact form of u and v for continuous f. Proposition 2 If a continuous function f = 0 satisfies (2) with a multiplicative factor u, then we have u(c) = |c|p (4) for some p > 0. Also, if a continuous function f satisfies (3) with an additive factor v, then we have v(c) = loga |c| (5) for some a such that 0 < a and a = 1. Using the explicit forms of u and v in Proposition 2, we establish derivative-based properties of scale invariant functions below. Proposition 3 Suppose that f is twice differentiable. If f satisfies (2) with a multiplicative factor u(c) = |c|p, we have c f(cx) = |c|p f(x), f(x)T x = pf(x), 2f(x)x = (p 1) f(x). (6) Kim, Kim and Klabjan Also, if f satisfies (3) with an additive factor v(c) = loga |c|, we have c f(cx) = f(x), f(x)T x = 1 log(a), 2f(x)x = f(x). (7) Proposition 3 states that a scale invariant function f satisfies that 2f(x)x is a scalar multiple of f(x). Let L(λ, x) = f(x) + λ (1 x 2) . be the Lagrange function of (1) and (λ , x ) be a stationary point satisfying L(λ, x) = 0 such that f(x ) = λ x , x 2 = 1. (8) In the next proposition, we derive an eigenvector property which states that for any stationary point (λ , x ) of (1) satisfying (8), x is an eigenvector of 2f(x ). Proposition 4 Suppose that f is twice differentiable and let (λ , x ) be a stationary point of (1) satisfying (8). If f satisfies (2) with u(c) = |c|p, then we have 2f(x )x = (p 1)λ x . Also, if f satisfies (3) with v(c) = loga |c|, then we have 2f(x )x = λ x . In both cases, x is an eigenvector of 2f(x ). Moreover, if λ is greater than the largest eigenvalue of 2f(x )(I x (x )T ), then x is a local maximum to (1). Proof If f is multiplicative scale invariant with the degree of p, by Proposition 3, we have 2f(x )x = (p 1) f(x ) = (p 1)λ x . Also, by Proposition 3, if f is additive scale invariant f, we have 2f(x )x = f(x ) = λ x . Therefore, in both cases, a stationary point x is an eigenvector of 2f(x ). Suppose that λ is greater than the largest eigenvalue of 2f(x )(I x (x )T ). For any h satisfying h T x = 0, we have h T 2 xx L(x , λ )h = h T 2f(x ) λ (I x (x )T ) h = h T 2f(x )(I x (x )T )h λ h 2 2 < 0. Since the second-order sufficient condition is satisfied, x is a local maximum. Proposition 4 states that a stationary point x is an eigenvector of 2f(x ). Note that the Lagrange multiplier λ is not necessarily an eigenvalue corresponding to x . The eigenvalue Scale Invariant Power Iteration corresponding to x is (p 1)λ if f is multiplicatively scale invariant or λ if f is additively scale invariant. The second-order sufficient condition for local optimality requires that the Lagrange multiplier λ rather than the eigenvalue corresponding to x is greater than the largest eigenvalue of 2f(x )(I x (x )T ). Due to this eigenvector property, scale invariant problems can be considered as a generalization of the leading eigenvector problem. Next, we introduce a dual formulation of scale invariant problems. Proposition 5 Suppose that the objective function f is continuous and either multiplicatively scale invariant with a positive optimal value and a multipicative factor u(c) = |c|p such that p > 0 or additively scale invariant having an additive factor v(c) = loga |c| such that a > 1. Then, solving (1) is equivalent to solving the following optimization problem minimize w 2 subject to f(w) = 1. (9) In other words, if x is an optimal solution to (1), then w = x /f(x )1/p (multiplicative) or w = a1 f(x )x (additive) is an optimal solution to (9). Conversely, if w is an optimal solution to (9), x = w / w 2 is an optimal solution to (1). For a multiplicatively scale invariant f having a negative optimal value and a multiplicative factor u(c) = |c|p such that p < 0, we can derive a similar reformulation by replacing f(w) = 1 with f(w) = 1. On the other hand, for an additively scale invariant f having an additive factor v(c) = loga |c| such that 0 < a < 1, we obtain a maximization problem with the same objective function and constraint. The dual formulation (9) has a nice geometric interpretation that an optimal solution w is the closest point to the origin from {w : f(w) = 1}. We use this understanding to derive SCI-PI in Section 3. Lastly, we introduce two well-known examples of scale invariant problems in machine learning and statistics. Example 1 (Lp-norm Kernel PCA) Given data vectors ai Rd and a mapping Φ, Lpnorm PCA considers maximize 1 n Pn i=1 Φ(ai)T x p p subject to x Bd (10) where the objective function satisfies property (2) with u(c) = |c|p. Example 2 (Estimation of Mixture Proportions) Given a design matrix L Rn d satisfying Lij 0, the problem of estimating mixture proportions seeks to find a vector π of mixture proportions on the probability simplex Sd = π : Pd j=1 πj = 1, π 0 that solves maximize 1 n Pn i=1 log Pd j=1Lijπj subject to π Sd. (11) By reparametrizing πj by x2 j, we obtain an equivalent optimization problem maximize 1 n Pn i=1 log Pd j=1Lijx2 j subject to x Bd, (12) which now satisfies property (3) with v(c) = 2 log |c|. The reformulation idea in Example 2 implies that any simplex-constrained problem with scale invariant f can be reformulated to a scale invariant problem. Kim, Kim and Klabjan 3. Scale Invariant Power Iteration In this section, we provide a geometric derivation of SCI-PI to find a local optimal solution of (1). The algorithm is developed using the geometric interpretation of the dual formulation (9) as illustrated in Figure 1. Starting with an iterate xk B, we obtain a dual iterate wk by mapping xk to the constraint f(w) = 1. Given wk, we identify the hyperplane lk on which the current iterate wk lies and is tangent to f(w) = 1. After identifying the equation of lk, we find the closest point zk to the origin from lk and obtain a new dual iterate wk+1 by mapping zk to the constraint f(w) = 1. Finally, we obtain a new primal iterate xk+1 by mapping wk+1 back to the set Bd. Now, we develop an algorithm based on the above idea. For derivation of the algorithm, we assume that an objective function f is differentiable and satisfies either (2) with u(c) = |c|p where p > 0 and f(x) > 0 for all x B or (3) with v(c) = loga|c| where 1 < a. Under these conditions, a scalar mapping from xk to wk can be well defined as wk = xk/f(xk)1/p or wk = a1 f(xk)xk, respectively. Let wk = ckxk. Since wk is on the constraint f(w) = 1, the normal vector of the hyperplane lk is f(wk). Therefore, we can write down the equation of the hyperplane lk as w : f(wk)T (w wk) = 0 . Note that zk is a scalar multiple of f(wk) where the scalar can be determined from the requirement that zk is on lk. Since wk+1 is the projection of zk, it must be a scalar multiple of the normal vector yk = f(wk). Therefore, we can write wk+1 as wk+1 = dkyk. Finally, by projecting wk+1 to Bd, we obtain xk+1 = wk+1 wk+1 2 = dkyk dkyk 2 = yk yk 2 = f(wk) f(wk) 2 = f(ckxk) f(ckxk) 2 = f(xk) f(xk) 2 where the last equality follows from Proposition 3. The update rule is the linear optimization oracle on Bd. In the sense that SCI-PI finds an optimal solution by solving a sequence of linear optimization problems, it is similar to the Frank-Wolfe algorithm (also called conditional gradients) and online linear prediction algorithms (Huang et al., 2017). Summarizing all the above, we obtain SCI-PI presented in Algorithm 1. Figure 1: Geometric derivation of SCI-PI Algorithm 1 SCI-PI Output: initial point x0 Bd k 0 while f(xk) = 0 do xk+1 f(xk) f(xk) 2 k k + 1 end while Output: xk Note that in Figure 1 { wk 2}k=0,1, is non-increasing if the sublevel set {w | f(w) 1} is convex. Since all sublevel sets of a quasi-convex function are convex, we can expect that SCI-PI yields an ascending step if f is quasi-convex (not necessarily scale invariant). See Proposition 11 in Appendix B for the convergence to a stationary point for quasi-convex f. If f is not quasi-convex, the sequence {f(xk)}k=0,1, is not necessarily increasing, making Scale Invariant Power Iteration it hard to analyze global convergence. Exploiting the eigenvector property, we study local convergence of SCI-PI for scale invariant f below. Theorem 6 Let f be a scale invariant, twice continuously differentiable function on an open set containing Bd. Let x be a local maximum such that f(x ) = λ x and (λi, vi) be an eigen-pair of 2f(x ) with x = v1. If λ > λ2 = max2 i d |λi|, then there exists some δ > 0 such that under the initial condition x0 x 2 < δ, the sequence of iterates {xk}k=0,1, generated by SCI-PI satisfies 2 x0 x 2 2, λ2 λ + γt < 1 for all t 0 and lim k γk = 0. Moreover, if jf = f/ xj has a continuous Hessian Hj on an open set containing Bd {x Rd : x 2 1}, we can explicitly write δ as δ(λ , λ1, λ2, M) = min λ + |λ λ1 M|, 2(λ λ2) λ λ2 + M + |λ λ1 M| where λ1 = |λ1| and M = max x Bd,y1, ,yd Bd q Pd i=1(x T Gi(y1, , yd)x)2, Gi(y1, , yd) = Pd j=1 vi,j Hj(yj). Proof Since 2f(x ) is real and symmetric, without loss of generality, we assume that {v1, , vd} form an orthogonal basis in Rd. Since f is twice continuously differentiable on an open set containing Bd, for x Bd, using the Taylor expansion of f(x)T vi at x , we have f(x)T vi = f(x )T vi + (x x )T 2f(x )vi + Ri(x) (13) Ri(x) = o( x x 2). (14) From f(x ) = λ x and x = v1, we have f(x)T v1 = f(x )T x + (x x )T 2f(x )x + R1(x) = λ λ1(1 x T x ) + R1(x) where α(x) = λ1(1 x T x ) + R1(x). (16) Kim, Kim and Klabjan On the other hand, for 2 i d, due to f(x ) = λ x , x = v1 and the orthogonality of {v1, . . . , vd}, we have f(x )T vi = λ (x )T vi = 0. (17) From (13), this results in f(x)T vi = λix T vi + Ri(x). (18) Using (18) and the definition of λ2, we have f(x)T vi 2 = h λ2 i (x T vi)2 + 2λi(x T vi)Ri(x) + (Ri(x))2i i=2 (x T vi)2 + 2 λ2 i=2 |x T vi||Ri(x)| + i=2 (Ri(x))2 . By the Cauchy Schwartz inequality, we have i=2 |x T vi||Ri(x)| λ2 i=2 (x T vi)2 i=2 (Ri(x))2, which results in f(x)T vi 2 λ2 2 i=2 (x T vi)2 + 2 λ2 i=2 (x T vi)2 i=2 (Ri(x))2 + i=2 (Ri(x))2 i=2 (x T vi)2 + β(x) i=2 (Ri(x))2. (21) Using 1 x T x = o( x x 2) and (14) for (16) and (21), we have α(x) = o( x x 2), β(x) = o( x x 2). (22) From x Bd, x = v1, and the fact that {v1, , vd} forms an orthogonal basis in Rd, we have d X i=2 (x T vi)2 = 1 (x T v1)2 = 1 (x T x )2 2(1 x T x ) = x x 2 2. (23) Therefore, by (15), (20), (22), (23), and Lemma 12, we obtain the first part of the desired result. Scale Invariant Power Iteration On the other hand, if jf has a continuous Hessian Hj, by Lemma 14, we have i=1 (Ri(x))2 1 4M2 x x 4 2. (24) Using (24) for (16) and (21), we have α(x) = λ1(1 x T k x ) + R1(x) (M + |λ1|)(1 x T x ), i=2 (Ri(x))2 M 2 x x 2 2. (25) Therefore, using (15), (20), (23), (25) and Lemma 12 with A = λ , B = M + |λ1|, C = 0, D = λ2, E = 0, F = M, we obtain the desired result. Theorem 6 states local convergence of SCI-PI with an asymptotic rate of λ / λ2. Note that the assumption that the Lagrange multiplier λ corresponding to a local maximum x satisfies λ > λ2 = max2 i d |λi| holds for all strict local maxima if f is convex, multiplicatively scale invariant with p 1 since λi 0 for all i and λ = (p 1)λ1, according to Proposition 4. However, in general, not all local maxima satisfy this assumption since it is stronger than the second-order sufficient condition stated as λ > max2 i d λi in Proposition 4. Nevertheless, by adding σ x 2 for some σ > 0 to the objective function f, we can always enforce λ > λ2. Conversely, by adding σ x 2 for some σ < 0, we may improve the convergence rate as done by shifted power iteration (Golub and Van Loan, 2012). The convergence rate of γk is o(1) for twice continuously differentiable f. If jf has a continuous Hessian, we further have γk = O( xk x 2). (For the derivations of the convergence rate of γk, see the proofs of Lemmas 12 and 14.) The non-convexity of the objective function hinders the attainment of global guarantees for SCI-PI. While Theorem 6 establishes a condition on the initial iterate that guarantees local convergence, finding an initial point sufficiently close to a global optimal solution is a challenging task. To ensure global guarantees, additional conditions on problem-specific parameters are necessary. For example, mild assumptions on problem parameters in affine phase retrieval render the objective function strongly convex, leading to global optimality (Huang and Xu, 2022). Nevertheless, local convergence remains an appealing property since random initialization often provides a satisfactory starting point (Chen et al., 2019). Reduction to power iteration For the leading eigenvector problem to find the leading eigenvector of a positive semi-definite matrix A 0, the objective function is f(x) = 1 2x T Ax, and thus SCI-PI specializes to power iteration. Let λi be the ith largest eigenvalue of A. The condition λ > λ2 in Theorem 6 is interpreted as the positive eigen-gap λ1 λ2 > 0 assumption. The convergence result in Theorem 6 not only matches the convergence rate of λ1/λ2 but also restores the initial condition δ < 2 or x T 0 x > 0 of power iteration since M = 0. Kim, Kim and Klabjan Comparision to generalized power method Under the spherical constraint, generalized power method (Journée et al., 2010) has the same update rule as SCI-PI. Generalized power method has shown to attain sublinear convergence for convex f. Local linear convergence in Theorem 6 has been not shown for generalized power method. This convergence result established for a scale invariant objective function with the spherical constraint is extended to various settings in the next section. 4. Extended Settings 4.1 Sum of Scale Invariant Functions Consider a sum of scale invariant functions of the form f(x) = Pm i=1 f M i (x) + Pn j=1 f A j (x) where f M i is multiplicatively scale invariant with ui(c) = |c|pi and f A j is additively scale invariant with vj(c) = logaj |c|. Note that this does not imply that f is scale invariant in general. Thus, a stationary point x satisfying f(x ) = λ x is not necessarily an eigenvector of 2f(x ). Instead, a stationary point x is an eigenvector of F(x ) defined as j=1 2f A j (x). i=1 f M i (x) + j=1 f A j (x) = F(x)x by Proposition 3. Here is an example that involves a sum of scale invariant functions. Example 3 (Kurtosis-based ICA) Given a pre-processed data matrix W Rn d, Kurtosisbased ICA (Hyvärinen and Oja, 2000) solves maximize 1 n (w T i x)4 3 2 subject to x Bd. (26) The objective function f is a sum of scale invariant functions. We present a local convergence analysis of SCI-PI for a sum of scale invariant functions as follows. Theorem 7 Let f be a sum of scale invariant functions and twice continuously differentiable on an open set containing Bd. Let x be a local maximum such that f(x ) = λ x and {v1, , vd} be a set of eigenvectors of F(x ) with x = v1. If λ > λ2 = 2f(x )(I x (x )T ) 2, then there exists some δ > 0 such that under the initial condition x0 x 2 < δ, the sequence of iterates {xk}k=0,1, generated by SCI-PI satisfies 2 x0 x 2 2, Scale Invariant Power Iteration λ2 λ + γt < 1 for all t 0 and lim k γk = 0. Moreover, if jf = f/ xj has a continuous Hessian Hj on an open set containing Bd, we can explicitly write δ as δ(λ , λ1, λ2, M) = 2(λ λ2) λ + M + λ1 + |λ M| 2 2f(x )x 2 and M = max x Bd,y1, ,yd Bd q Pd i=1(x T Gi(y1, , yd)x)2, Gi(y1, , yd) = Pd j=1 vi,j Hj(yj). Proof By the stationarity condition, a local optimal solution x is an eigenvector of F(x ). Since F(x ) is real and symmetric, without loss of generality, we assume that {v1, , vd} form an orthogonal basis in Rd. Since f is twice continuously differentiable on an open set containing Bd, for x Bd, using the Taylor expansion of f(x)T vi at x , we have f(x)T vi = f(x )T vi + (x x )T 2f(x )vi + Ri(x) (27) where Ri(x) = o( x x 2). (28) Using (27) with i = 1 and f(x ) = λ x , we obtain f(x)T v1 = λ (x )T v1 + (x x )T 2f(x )v1 + R1(x) = λ + α(x) (29) where α(x) = (x x )T 2f(x )v1 + R1(x). (30) Using (27) and f(x ) = λ x for 2 i d, we have f(x)T vi = λ (x )T vi + (x x )T 2f(x )vi + Ri(x) = (x x )T 2f(x )vi + Ri(x), resulting in i=2 ( f(x)T vi)2 = (x x )T 2f(x )vi + Ri(x) 2 . (31) From x = v1 and the fact that {v1, , vd} forms an orthogonal basis in Rd, we have (x x )T 2f(x )vi 2 = 2f(x )(x x ) 2 2 (x x )T 2f(x )v1 2 = (x x )T 2f(x ) I x (x )T 2f(x )(x x ) = (x x )T 2f(x ) I x (x )T 2 2f(x )(x x ). Kim, Kim and Klabjan 2f(x ) I x (x )T 2 2f(x ) 2 = I x (x )T 2f(x ) 2 2 = 2f(x ) I x (x )T 2 2, (x x )T 2f(x )vi 2 λ2 2 x x 2 2. (32) Also, using (32) and the Cauchy-Schwartz inequality, we obtain i=2 (x x )T 2f(x )vi Ri(x) i=2 |(x x )T 2f(x )vi||Ri(x)| i=2 ((x x )T 2f(x )vi)2 i=2 Ri(x)2. Using (32) and (33) for (31), we obtain i=2 ( f(x)T vi)2 λ2 2 x x 2 2 + 2 λ2 x x 2 i=2 Ri(x)2 + i=2 Ri(x)2, resulting in d X i=2 ( f(x)T vi)2 λ2 x x 2 2 + β(x) 2 (34) i=2 Ri(x)2. (35) Using (x x )T 2f(x )v1 = o( p x x 2) and (28) for (30) and (35), we have α(x) = o( p x x 2), β(x) = o( x x 2). (36) By (29), (34), (36), and Lemma 12, we obtain the first part of the desired result. On the other hand, if jf has a continuous Hessian Hj, by Lemma 14, we have i=1 (Ri(x))2 1 4M2 x x 4 2. (37) Scale Invariant Power Iteration Using |(x x )T 2f(x )v1| λ1 q 1 x T k x and (37) for (30) and (35), this leads to α(x) = (x x )T 2f(x )v1 + R1(x) λ1 q 1 x T k x M(1 x T k x ), i=2 Ri(x)2 M 2 x x 2 2. (38) By (29), (34), (38), and Lemma 13 with A = λ , B = M, C = λ1, D = 0, E = λ2, F = M, δ(λ , λ1, λ2, M) = min λ + |λ M| + λ1 , 2(λ λ2) λ + M + λ1 + |λ M| 2(λ λ2) λ + M + λ1 + |λ M|, which completes the proof. Note that λ1 has the additional 2 factor which comes from the fact that x is not necessarily an eigenvector of 2f(x ). Nonetheless, the asymptotic convergence rate in Theorem 7 provides a generalization of the convergence rate in Theorem 6. 4.2 Block Scale Invariant Problems Next, consider a class of optimization problems having the form of maximize f(x, y) subject to x Bdx, y Bdy where f : Rdx+dy R is scale invariant in x for fixed y and vice versa. A stationary point (x , y ) satisfies xf(x , y ) = λ x and yf(x , y ) = s y for some λ , s R, and x and y are eigenvectors of 2 xxf(x , y ) and 2 yyf(x , y ), respectively, according to Proposition 4. Some examples of block scale invariant problems are given next. Example 4 (Semidefinite Programming (SDP) (Vandenberghe and Boyd, 1996)) Let A, X Rn n. Given an SDP problem maximize A, X subject to Xii = 1, i {1, 2, , n}, X 0, the Burer-Monteiro approach (Burer and Monteiro, 2003) yields the following block scale invariant problem (Erdogdu et al., 2022) maximize A, σσT subject to σi 2 = 1, i {1, 2, , n} where σi denotes the ith row of σ Rn r. Kim, Kim and Klabjan Example 5 (Kullback-Leibler (KL) divergence NMF) The KL-NMF problem (Févotte and Idier, 2011; Lee and Seung, 2001; Wang and Zhang, 2013) is defined as minimize DKL(V WH) Vij log Vij Pr q=1 Wiq Hqj Vij + q=1 Wiq Hqj subject to Wiq 0, Hqj 0, i = 1, , n, j = 1, , m, q = 1, , r. Many popular algorithms (Lee and Seung, 2001; Lin, 2007) for the KL-NMF problem are based on alternating minimization of W and H. Since the objective function can be decomposed over j, given W 0 and j {1, , m}, we consider a subproblem of the form minimize fj KL(h) = Vij log Vij Pr q=1 Wiqhj Vij + subject to hq 0 (40) where hq = Hqj. Note that the KL-NMF problem in the form of (39) is not a block scale invariant problem. However, using a novel reformulation, we show that the KL divergence NMF subproblem is indeed a scale invariant problem. Lemma 8 The KL-NMF subproblem (40) is equivalent to the following scale invariant problem i=1 Vij log q=1 Wiq hq subject to q=1 hq = 1, hq 0, (41) with the relationship (Pn i=1 Vij) hq = (Pn i=1 Wiq)hq. Proof Since a log-linear function is concave, (40) is a convex problem in h. Consider the Lagrangian of the original problem L(h, λ) = fj KL(h) q=1 λqhq, λ 0. (42) Let h be an optimal solution to (40) and λ be a vector in Rr satisfying the following first-order KKT conditions fj KL(h ) = λ q1m, λ qh q = 0, q = 1, , r (43) where fj KL denotes the derivative of fj KL with respect to h. Since (43) implies Pr q=1 h qλ q = 0, we have q=1 h qλ q = q=1 h q fj KL(h ) = Vij Wiqh q Pr q =1 Wiq h q + q=1 Wiqh q, resulting in n X q=1 Wiqh q. (44) Scale Invariant Power Iteration minimize fj SCI(h) = i=1 Vij log Vij Pr q=1Wiqhq subject to q=1 Wiqhq, hq 0, (45) and let f KL and f SCI be the optimal objective values of (40) and (45), respectively. We prove the equivalence of (45) and (40) by the following arguments: 1. Since (45) has an additional constraint Pn i=1 Vij = Pn i=1 Pr q=1 Wiqhq compared to (40), it always satisfies f SCI f KL . 2. Since we have shown that Pn i=1 Vij = Pn i=1 Pr q=1 Wiqh q, a solution h of (40) is a feasible point of (45). This implies f KL f SCI. Now, we reparameterize h by h so that Pn i=1 Vij = Pn i=1 Pr q=1 Wiqhq if and only if Pr q=1 hq = 1, which yields the relationship between two variables hq = hq(Pn i=1 Wiq)/(Pn i=1 Vij). Note that (41) is a mixture proportion estimation problem (Example 2) and thus a scale invariant problem. To solve block scale invariant problems, we consider an alternating maximization algorithm called block SCI-PI, which repeats xk+1 xf(xk, yk)/ xf(xk, yk) 2, yk+1 yf(xk, yk)/ yf(xk, yk) 2. (46) We present a local convergence result of block SCI-PI below. Theorem 9 Suppose that f is twice continuously differentiable on an open set containing Bdx Bdy and let (x , y ) be a local maximum satisfying xf(x , y ) = λ x , λ > λ2 = max 2 i dx |λi|, yf(x , y ) = s y , s > s2 = max 2 j dy |sj| where (λi, vi) and (sj, uj) are eigen-pairs of 2 xxf(x , y ) and 2 yyf(x , y ), respectively with x = v1 and y = u1. If ν2 = 2 yxf(x , y ) 2 2 < (λ λ2)(s s2), then for the sequence of iterates {(xk, yk)}k=0,1, generated by (46), there exists some δ > 0 such that if 0 < δ, then we have 2 k Qk 1 t=0 (ρ + γt)2 2 0 and limk γk = 0 Kim, Kim and Klabjan Proof From Lemma 15 with x = xk, y = yk, we have Pdx i=2( xf(xk, yk)T vi)2 ( xf(xk, yk)T v1)2 λ2 λ xk x 2 + ν λ yk y 2 + θx(xk, yk) 2 . xk+1 = xf(xk, yk) xf(xk, yk) 2 , s Pdx i=2( xf(xk, yk)T vi)2 ( xf(xk, yk)T v1)2 λ2 λ xk x 2 + ν λ yk y 2 + θx(xk, yk). (47) Using Lemma 15 for x = yk, y = xk and the definition of yk+1, we have s xk x 2 + s2 s yk y 2 + θy(xk, yk). (48) Combining (47) and (48), we obtain " xk+1 x 2 yk+1 y 2 λ2 λ ν λ ν s s2 s " xk x 2 yk y 2 " θx(xk, yk) θy(xk, yk) Since ρ < 1 due to ν2 < (λ λ2)(s s2), by Lemma 17, we obtain the desired result. Being the spectral norm of the off-diagonal block of the Hessian at the local maximum (x , y ), ν measures how much partial derivatives of one block of variables are affected by the other block of variables. If the objective function f is separable in x and y as in the case of the KL-NMF problem, ν becomes zero, and we have ρ = max { λ2/λ , s2/s }. Note that ρ increases as ν increases. If ν2 becomes larger than (λ λ2)(s s2), the Jacobi update rule (46) may fail due to the interaction effects between x and y. On the other hand, the result of Theorem 6 can be restored by dropping x or y in Theorem 9. While we consider the two-block case here, the algorithm and the convergence analysis can be easily generalized to more than two blocks. 4.3 Partially Scale Invariant Problems Lastly, we consider a class of optimization problems of the form maximize f(x, y) subject to x Bdx where f(x, y) : Rdx+dy R is a scale invariant function in x for each y Rdy. A partially scale invariant problem has the form of (1) with respect to x once y is fixed. If x is fixed, we obtain an unconstrained optimization problem with respect to y. A stationary point (x , y ) satisfies xf(x , y ) = λ x and yf(x , y ) = 0 for some λ R, and x is an eigenvector of 2 xxf(x , y ). Here is an example of partially scale invariant problems. Scale Invariant Power Iteration Example 6 (Gaussian Mixture Model (GMM)) The GMM problem is defined as j=1 πj N(ai; µj, Σj) subject to π Sd. Note that the objective function is scale invariant for fixed µj and Σj, and µj is unconstrained. If we assume some structure on Σj, estimation of Σj can also be unconstrained. For general Σj, semi-positive definiteness is necessary for Σj. To solve partially scale invariant problems, we consider an alternating maximization algorithm based on SCI-PI and the gradient method as xk+1 xf(xk, yk)/ xf(xk, yk) 2, yk+1 yk + α yf(xk, yk). (50) While the gradient method is used in (50), any method for unconstrained optimization can replace it. We present a convergence analysis of (50) below. Theorem 10 Suppose that f(x, y) is scale invariant in x for each y Rdy, µ-strongly concave in y with an L-Lipschitz continuous yf(x, y) for each x Bdx, and three-times continuously differentiable on an open set containing Bdx Rdy. Let (x , y ) be a local maximum satisfying f(x ) = λ x , λ > λ2 = max2 i dx|λi| where (λi, vi) is an eigen-pair of 2 xx f(x , y ) with x = v1. If ν2 = 2 yxf(x , y ) 2 2 < µ(λ λ2), then for the sequence of iterates {(xk, yk)}k=0,1, generated by (50) with α = 2/(L+µ), there exists some δ > 0 such that if 0 < δ, then we have 2 k Qk 1 t=0 (ρ + γt)2 2 0 and limk γk = 0 Proof Using Lemma 15 for x = xk, y = yk and the definition of xk+1, we have xk+1 x 2 λ2 λ xk x 2 + ν λ yk y 2 + θx(xk, yk). (51) By Lemma 16 with x = xk, y = yk, we also have yk+1 y 2 2ν L + µ xk x 2 + L µ yk y 2 + θy(xk, yk). (52) Kim, Kim and Klabjan Combining (51) and (52), we obtain " xk+1 x 2 yk+1 y 2 2ν L + µ L µ L + µ " xk x 2 yk y 2 " θx(xk, yk) Note that since ν2 < µ(λ λ2), the spectral radius ρ satisfies Therefore, by Lemma 14, we obtain the desired result. As in the result of Theorem 9, the rate ρ increases as ν increases and is equal to max { λ2/λ , (L µ)/(L + µ)} when ν = 0. Also, by dropping y, we can restore the convergence result of Theorem 6. 5. Numerical Experiments We test the proposed algorithms on real-world data sets. All experiments are implemented on a standard laptop (2.6 GHz Intel Core i7 processor and 16GM memory) using the Julia programming language. Let us emphasize that scale invariant problems frequently appear in many important applications in statistics and machine learning. We select three important applications, KL-NMF, GMM and ICA. A description of the data sets is provided below and source codes are available at: https://github.com/youngseok-kim/SCIPI-JMLR. 5.1 Description of Data Sets For KL-NMF (Section 5.2), we use four public real data sets available online and summarized in Table 1. Waving Trees (WT) has 287 images, each having 160 120 pixels. KOS and NIPS are sparse, large matrices implemented for topic modeling. WIKI is a large binary matrix having values 0 or 1 representing the adjacency matrix of a directed graph. Here, sparsity represents the fraction of zero elements in a matrix. Name # of samples # of features # of nonzeros Sparsity WIKI 8,274 8,297 104,000 0.999 NIPS 1,500 12,419 280,000 0.985 KOS 3,430 6,906 950,000 0.960 WT 287 19,200 5,510,000 0.000 Table 1: A brief summary of data sets used for KL-NMF These four data sets are retrieved from https://www.microsoft.com/en-us/research/project, https: //archive.ics.uci.edu/ml/datasets/bag+of+words, and https://snap.stanford.edu/data/wiki-Vote. html Scale Invariant Power Iteration Name # of classes # of samples Dimension Sonar 2 208 60 Ionosphere 2 351 34 House Votes84 2 435 16 Br Cancer 2 699 10 PIDiabetes 2 768 8 Vehicle 4 846 18 Glass 6 214 9 Zoo 7 101 16 Vowel 11 990 10 Servo 51 167 4 Table 2: A brief summary of data sets used for GMM Name # of samples # of features Wine 178 14 Soybean 683 35 Vehicel 846 18 Vowel 990 10 Cardio 2,126 22 Satellite 6,435 37 Pendigits 10,992 17 Letter 20,000 16 Shuttle 58,000 9 Table 3: A brief summary of data sets used for ICA For GMM (Section 5.3), we use ten public real data sets, corresponding to all small and moderate data sets provided by the mlbench package in R. We select data sets for multi-class classification problems and run EM and SCI-PI for the given number of classes without class labels. In Table 2, the sample size varies from 101 to 990, the dimension varies from 2 to 60, and the number of classes varies from 2 to 51. In these data sets, only a small portion of entries are missing. If missing entries exist, we impute them with the means. For ICA, discussed also in Section 5.3, we use nine public data sets (see Table 3) from the UCI Machine Learning repository . The sample size varies from 178 to 58,000 and the dimension varies from 9 to 37. 5.2 KL-divergence Nonnegative Matrix Factorization We perform experiments on the KL-NMF problem (39) described in Example 5. Let us recall that the original KL-NMF problem can be solved via block SCI-PI where in each iteration the algorithm solves the subproblem of the form (41). Our focus is to compare this algorithm with other well-known alternating minimization algorithms listed below, updating H and W alternatively. We let z = V (Wh). Projected gradient descent (PGD): It iterates hnew h η W T (z 1n) followed by projection onto the simplex, where η h is an appropriate learning rate (Lin, 2007). https://archive.ics.uci.edu/ml/index.php Kim, Kim and Klabjan 0 5000 10000 0 10000 20000 30000 0 10000 20000 0 10000 20000 30000 40000 iteration n = 2000, m =10 relative error iteration iteration iteration n = 2000, m = 20 n = 2000, m = 200 n = 20000, m = 200 MU SCI PI PGD 0 5000 10000 0 10000 20000 30000 0 10000 20000 0 10000 20000 30000 40000 iteration iteration log10(relative error) n = 2000, m = 10 n = 2000, m = 20 n = 2000, m = 200 n = 20000, m = 200 iteration iteration MU SCI PI PGD Figure 2: Convergence of the three algorithms for the KL-NMF subproblem; the relative error |fk f |/|f0 f | (Top) and the log relative error (Bottom); n/m: the number of samples/features of the data matrix Multiplicative update (MU): A famous multiplicative update algorithm is originally suggested by (Lee and Seung, 2001), which iterates hnew h (W T z) (W T 1n) and is learning rate free. Our method (SCI-PI): It iterates hnew h (σ +W T z) 2 followed by rescaling, where σ is a shift parameter. We simply use σ = 1 for preconditioning. Sequential quadratic programming (MIXSQP): It exactly solves each subproblem via a convex solver mixsqp (Kim et al., 2020). This algorithm performs sequential non-negative least squares. KL-NMF subproblem on synthetic data sets Before presenting experimental results of alternating algorithms on the KL-NMF problem (39), we report small experiments using synthetic data sets on the KL-NMF subproblem (40) where we repeat the above iterations until convergence. To study the convergence rate for the KL-NMF subproblems, we use the four data sets studied in Kim et al. (2020). We study MU, PGD and SCI-PI since they have the same order of computational complexity per iteration, but omit MIXSQP since it is a second-order method which cannot be directly compared. For PGD, the learning rate is optimized by grid search. The stopping criterion is fk f 2 10 6f where fk is the objective value at iteration k and f is the solution obtained by MIXSQP after extensive computation time. Scale Invariant Power Iteration The result is shown in Figure 2 . The average runtime for aforementioned three methods are 33, 33 and 30 seconds for 10,000 iterations, respectively. Although the reformulated scale invariant problem (12) is a non-convex problem, SCI-PI always finds a global optimal solution, regardless of the starting point. Moreover, as shown in the figure, SCI-PI outperforms the other two algorithms for all simulated data sets. KL-NMF on real world data sets Next, we test the four algorithms on the data sets in Table 1. We estimate r = 20 factors. At each iteration, all four algorithms solve m subproblems simultaneously for W and then alternatively for H. The result is summarized in Figure 3 . The convergence plots are based on the average relative errors over ten repeated runs with random initializations. The result shows that SCI-PI is an overall winner, showing faster convergence rates. The stopping criterion is the same as above. To assess the overall performance when initialized differently, we select KOS and WIKI and run MU, PGD, SCI-PI, and MIXSQP ten times . The three algorithms except MIXSQP have (approximately) the same computational cost per iteration, take runtime of 391, 396, 408 seconds for KOS data and 372, 390, 418 seconds for WIKI data, respectively for 200 iterations. MIXSQP has a larger per iteration cost. After 400 seconds, SCI-PI achieves the lowest objective values in all cases but one for each data set (38 out of 40 in total). Thus it clearly outperforms other methods and also achieves the lowest variance. Unlike the other three algorithms, SCI-PI is not an ascent algorithm but an eigenvalue-based fixed-point algorithm. Admittedly, non-monotone convergence of SCI-PI can hurt reliability of the solution but for the KL-NMF problem its performance turns out to be stable. 5.3 Gaussian Mixture Model and Independent Component Analysis In this subsection, we study the empirical performance of SCI-PI when it is applied to GMM and ICA. GMM GMM fits a mixture of Gaussian distributions to the underlying data. Let Lij = N(ai; µj, Σj) where i is the sample index and j the cluster index and let π be the actual mixture proportion vector. GMM fits into our restricted scale invariant setting (Section 4.3) with reparametrization, but the gradient update for µj, Σj is replaced by the exact coordinate ascent step. The EM and SCI-PI updates for π can be written respectively as r = 1n (Lπ), πnew π (LT r) (EM), πnew π (α + LT r) 2 (SCI-PI) (54) where α is a shift parameter set to 1. We compare SCI-PI and EM for different real-world data sets from Table 2. All the algorithms initialize from the same standard Gaussian random variable, repeatedly for ten times. The result is summarized in the left panel in Figure 4. In some cases, SCI-PI achieves much larger objective values even if initialized the same. In many cases the two algorithms exhibit the same performance. This is because estimation of µk s and Σk s are usually harder than estimation of π, and EM and SCI-PI have the same For each evaluation, we randomly draw ten initial points and report the averaged relative errors with respect to f . The initial input for the KL-NMF problem is a one-step MU update of a Uniform(0, 1) random matrix. In all plots we do not show the first few iterations. The initial random solutions have the gap of approximately 50% which drops to a few percent after ten iterations where the plots start. Kim, Kim and Klabjan 0 50 100 150 200 0 50 100 150 200 MU SCI PI PGD MIXSQP 0 50 100 150 200 0 50 100 150 200 MU SCI PI PGD MIXSQP MU PGD SCI PI MIXSQP MU PGD SCI PI MIXSQP objective value objective value computation time (seconds) computation time (seconds) computation time (seconds) computation time (seconds) relative error relative error relative error relative error 0 200 400 600 0 1000 2000 3000 0 250 500 750 1000 1250 0 1000 2000 computation time (seconds) computation time (seconds) log(relative error) log(relative error) computation time (seconds) computation time (seconds) log(relative error) log(relative error) MU PGD SCI PI MIXSQP MU PGD SCI PI MIXSQP Figure 3: Convergence of the four NMF algorithms; the relative error |fk f |/|f0 f | (Top left) and the log relative error (Bottom); Boxplots containing ten objective values achieved after 400 seconds (Top right) House Votes relative error relative error Figure 4: The relative objective f SCI-PI/f EM for GMM (Left) and f SCI-PI/f Fast ICA for ICA (Right) Scale Invariant Power Iteration updates for µ and Σ. For a few cases EM outperforms SCI-PI. Let us mention that SCI-PI and EM have the same order of computational complexity and require 591 and 590 seconds of total computation time, respectively. ICA We implement SCI-PI on the Kurtosis-based ICA problem (Hyvärinen et al., 2004) and compare it with the benchmark algorithm Fast ICA (Hyvarinen, 1999), which is the most popular algorithm. Given a pre-processed data matrix W Rn d, we maximize an approximation of negative entropy (Hyvärinen and Oja, 2000), f(x) = Pn i=1 (w T i x)4 3 2, subject to x Bd. This problem fits into the sum of scale invariant setting (Section 4.1). SCI-PI iterates xk+1 W T [(Wxk) 4 31n) (Wxk) 3] and Fast ICA iterates xk+1 W T (Wxk) 3 3(1T n(Wxk) 2)xk, both followed by normalization. In Figure 4 (right panel), we compare SCI-PI and Fast ICA on the data sets in Table 3. The majority of data points (81 out of 100 in total) show that SCI-PI tends to find a better solution with a larger objective value, but in a few cases SCI-PI converges to a sub-optimal point. Both algorithms are fixed-point based and thus have no guarantee of global convergence but overall SCI-PI outperforms Fast ICA. SCI-PI and Fast ICA have the same order of computational complexity and require 11 and 12 seconds of total computation time, respectively. 6. Final Remarks In this paper, we propose a new class of optimization problems called the scale invariant problems, together with a generic solver SCI-PI, which is indeed an eigenvalue-based fixedpoint iteration. We showed that SCI-PI directly generalizes power iteration and enjoys similar properties such as that SCI-PI has local linear convergence under mild conditions and its convergence rate is determined by eigenvalues of the Hessian matrix at a solution. Also, we extend scale invariant problems to problems with more general settings. We show by experiments that SCI-PI can be a competitive option for numerous important problems such as KL-NMF, GMM and ICA. Moreover, while not studied in this work, SCI-PI can be generalized to solve optimization problems on the Stiefel manifold such as block PCA. Under orthonormality constraints, the problem with a scale invariant function can be locally considered as the top k eigenvector problem. Therefore, we can develop a general form of the QR iteration (Francis, 1961, 1962) and its convergence analysis using similar proof techniques. Finding more examples and extending SCI-PI further to a more general setting is a promising direction for future studies. A centered matrix f W = n1/2UDV T is pre-processed by W = f WV D 1V T so that W T W = n V V T . Kim, Kim and Klabjan Appendix A. Proofs of Propositions A.1 Proof of Proposition 2 From (2) and (3), we infer functional equations of the multiplicative factor u and the additive factor v. Under the continuity assumption on f, these functional equations have the forms of Cauchy equations. Relying on known properties of Cauchy equations, we prove that u and v are homogeneous and log functions, respectively. We next provide details. We first consider the multiplicative scale invariant case. Let x be a point such that f(x) = 0. Then, we have f(rsx) = u(rs)f(x) = u(r)u(s)f(x), which results in u(rs) = u(r)u(s) for all r, s R. Let g(r) = ln(u(er)). Then, we have g(r + s) = ln(u(er+s)) = ln(u(eres)) = ln(u(er)) + ln(u(es)) = g(r) + g(s), which implies that g satisfies the first Cauchy functional equation. Since f is continuous, so is u and thus g. Therefore, by (Sahoo and Kannappan, 2011, pp. 81-82), we have g(r) = rg(1) (55) for all r 0. From the definition of g and (55), we have u(er) = eg(r) = (er)g(1). (56) Representing r > 0 as r = eln(r) and using (56), we obtain u(r) = u eln(r) = rg(1) = rln(u(e)) = rp. Since f(x) = 0, if p = ln(u(e)) < 0, then we have limr 0+f(rx) = limr 0+u(r)f(x) = f(x) limr 0+rp = f(x) = f(0) < , contradicting the fact that f is continuous at 0. Also, if p = 0, then we get u(r) = 1, which contradicts u(0) = 0. Therefore, we must have p > 0. From u being an even function, we finally have u(r) = |r|p Now, consider the additive scale invariant case. For any x dom(f), we have f(rsx) = f(x) + v(rs) = f(x) + v(r) + v(s), Scale Invariant Power Iteration which results in v(rs) = v(r) + v(s) for all r, s R \ {0}. Let g(r) = v(er). Then, we have g(r + s) = v(er+s) = v(eres) = v(er) + v(es) = g(r) + g(s). Since g is continuous and satisfies the second Cauchy functional equation, by (Sahoo and Kannappan, 2011, pp. 83-84), we have g(r) = rg(1) for all r 0. For r > 0, letting r = eln(r), we have v(r) = v(eln(r)) = g(ln(r)) = g(1)ln(r) = v(e)ln(r) = loga(r) where a = e 1 v(e) . Note that a satisfies 0 < a and a = 1. From the fact that v is an even function, we finally have v(r) = loga|r| for r R \ {0}. A.2 Proof of Proposition 3 Without loss of generality, we can represent a scale-invariant function f as f(cx) = u(c)f(x) + v(c) (57) since we can restore a multiplicatively or additively scale-invariant function by setting v(c) = 0 or u(c) = 1, respectively. In order to derive the first-order derivative properties, we differentiate (57) with respect to x and c, respectively. Then, we further differentiate the latter with respect to x to obtain the second-order property. By differentiating (57) with respect to x, we have f(cx) = u(c) On the other hand, by differentiating (57) with respect to c, we have f(cx)T x = u (c)f(x) + v (c). (58) By differentiating (58) with respect to x, we obtain c 2f(cx)x + f(cx) = u (c) f(x). (59) Plugging c = 1 into (58) and (59) completes the proof. Kim, Kim and Klabjan A.3 Proof of Proposition 5 In order to prove the equivalence of (1) and (9), we show that a scalar multiple of an optimal solution to one problem is optimal to the other problem. Since {x : x 2 = 1} and {w : f(w) = 1} have a one-to-one correspondence, we can uniquely determine such a mapping. First, we consider the multiplicatively scale invariant case where f(x ) > 0. Suppose that an optimal solution to (9) is z not x /f(x )1/p such that z 2 < x /f(x )1/p 2. (60) Let ˆz = z/ z 2. Then, we have ˆz 2 = 1 and z = ˆz/f(ˆz)1/p. Since ˆz 2 = x 2 = 1, we have z 2 = ˆz/f(ˆz)1/p 2 = 1/f(ˆz)1/p, x /f(x )1/p 2 = 1/f(x )1/p. (61) From (60) and (61), we have f(x ) < f(ˆz) since p > 0. This contradicts the assumption that x is an optimal solution to (1). For an optimal solution w to (9), since f(w ) = 1, we have w = 0 and thus w 2 > 0 and f (w / w 2) = 1/ w p 2 > 0. Suppose an optimal solution to (1) is y with f(y) > f (w / w 2) > 0. (62) Let ˆy = y/f(y)1/p. Then, f(ˆy) = 1 and y = ˆy/ ˆy 2. Using f(ˆy) = f(w ) = 1, we have f(y) = f (ˆy/ ˆy 2) = 1/ ˆy 1/p 2 , f (w / w 2) = 1/ w 1/p 2 . (63) From (62) and (63), we obtain ˆy 2 < w 2, which contradicts that w is optimal to (9). Next, we consider the additively scale invariant case. Suppose that an optimal solution to (9) is z with z 2 < a1 f(x )x 2. (64) Let ˆz = z/ z 2. Then, we have ˆz 2 = 1 and z = a1 f(ˆz)ˆz. Using ˆz 2 = x 2 = 1, we have z 2 = a1 f(ˆz), a1 f(x )x 2 = a1 f(x ). (65) From (64) and (65), we obtain f(x ) < f(ˆz) due to a > 1, which contradicts the assumption that x is an optimal solution to (1). Conversely, suppose that an optimal solution of (1) is y with f(y) > f (w / w 2) . (66) Let ˆy = a1 f(y)y. Then, we have f(ˆy) = 1 and y = ˆy/ ˆy 2. Since f(ˆy) = f(w ) = 1, we have f(y) = f(ˆy) loga ˆy 2 = 1 loga ˆy 2, f (w / w 2) = 1 loga w 2. (67) From (66) and (67), we have ˆy 2 < w 2 since a > 1, contradicting the fact that w is an optimal solution to (9). Scale Invariant Power Iteration Appendix B. Sublinear convergence of SCI-PI for quasi-convex f Proposition 11 If f is quasi-convex and continuously differentiable, a sequence of iterates {xk}k=0,1, generated by SCI-PI satisfies f(xk+1) f(xk) for all k 0. Moreover, either Algorithm 1 terminates with a stationary point or every limit point is a stationary point. Proof Suppose that f(xk+1) < f(xk). By the first-order property of differentiable quasiconvex functions, this leads to f(xk)T (xk+1 xk) = f(xk)T f(xk) f(xk) 2 xk = f(xk) 2 f(xk)T xk 0. However, since f(xk+1) = f(xk), f(xk) is not a scalar multiple of xk, resulting in f(xk) 2 f(xk)T xk > 0. This contradicts (68). Therefore, we have f(xk+1) f(xk). If Algorithm 1 terminates at iteration k, we have f(xk) = 0. Therefore, xk is a stationary point satisfying (8) with the value of the Lagrange multiplier being zero. Otherwise, let x be a limit point and suppose that x is not a stationary point. Then, there exists some ϵ > 0 such that f(x )T x / f(x ) 2 < 1 ϵ. Since f is continuous, there exists some δ > 0 such that for x Bd with f(x) = 0, if x x 2 < δ, then f(x )T f(x) f(x ) 2 f(x) 2 > 1 ϵ. Let k be an index such that xk x 2 < δ. Since {f(xk)}k=0,1, is non-decreasing, we have f(xk ) f(xk +1) f(x ). By the first-order derivative property of quasi-convex functions, we obtain f(x )T (xk +1 x ) However, since xk +1 = f(xk )/ f(xk ) 2 and xk x 2 < δ, we must have f(x ) 2 < 1 ϵ < f(x )T f(xk ) f(x ) 2 f(xk ) 2 . This leads to a contradiction. Therefore, x must be a stationary point. Appendix C. Additional Lemmas On several occasions, we use if x Bd, y Bd, then x y 2 2 = x 2 2 + y 2 2 2x T y = 2(1 x T y). Note that if x T y 0, then q 1 (x T y)2 = q (1 x T y)(1 + x T y) p 1 x T y = x y 2 By Cauchy-Schwarz, we also have q 1 (x T y)2 = q (1 x T y)(1 + x T y) 1 x T y = x y 2. (70) Kim, Kim and Klabjan C.1 In Support of the Proofs of Theorem 6 and Theorem 7 Lemma 12 Let x be a vector in Rd and {v1, , vd} be an orthogonal basis in Rd such that x = v1. If a twice continuously differentiable function f : Rd R satisfies f(x)T v1 = λ + α(x), i=2 ( f(x)T vi)2 λ2 x x 2 + β(x) 2 (71) for every x Bd and some functions α, β : Rd R and scalars λ , λ such that α(x) = o( p x x 2), β(x) = o( x x 2), λ > λ 0, then for the sequence of iterates {xk}k=0,1, generated by SCI-PI, there exists some δ > 0 such that under the initial condition x0 x 2 < δ, we have 2 x0 x 2 2, λ2 λ + γt < 1, and lim k γk = 0. Proof By (71) for every x Bd, we have Pd i=2( f(x)T vi)2 ( f(x)T v1)2 λ2 x x 2 + β(x) θ(x) = λ2 x x 2 + β(x) λ + α(x) λ2 λ x x 2. Then, we have θ(x) = o( x x 2) and Pd i=2( f(x)T vi)2 ( f(x)T v1)2 λ2 λ + θ(x) x x 2 2 x x 2 2. (72) ϵ(x) = θ(x) x x 2 . (73) For x Rd such that x T x > 0 or x x 2 < 2, we multiply (72) by 2/(1 + x T x ) to obtain Pd i=2( f(x)T vi)2 ( f(x)T v1)2 2 1 + x T x λ + ϵ(x) 2 1 + 1 x T x λ + γ(x) 2 x x 2 2 (74) γ(x) = λ2 λ 1 + x T x + p 2(1 + x T x ) 1 + 1 x T x 1 + x T x . (75) Scale Invariant Power Iteration By (71), there exists some δ1 > 0 such that if x x 2 < δ1, then f(x)T v1 > 0. (76) Also, by (73), for any γ > 0 satisfying λ2 λ + γ < 1, (77) there exists some constant δ2 > 0 such that if x x 2 < δ2, then Let γk = γ(xk), ϵk = ϵ(xk), and δ = min{δ1, δ2, q 2}. We prove that if xk x 2 < δ, then we have xk+1 x 2 2 λ2 2 xk x 2 2 and γk γ. (79) 2, we have x T k x > 0. Also, from xk x 2 < δ1 and x = v1, using the update rule of SCI-PI and (76), we obtain x T k+1x = f(xk)T x f(xk) 2 = f(xk)T v1 f(xk) 2 > 0. On other the hand, since |x T k+1v1| xk+1 2 v1 2 = 1, we have 1 (x T k+1x )2 1 (x T k+1v1)2 (x T k+1v1)2 . Also, from the fact that {v1, , vd} forms an orthogonal basis in Rd, we have f(xk) = Pd i=1( f(xk)T vi)vi and f(xk) 2 2 = Pd i=1( f(xk)T vi)2. Using the update rule of SCI-PI, we have 1 (x T k+1v1)2 (x T k+1v1)2 = f(xk) 2 2 ( f(xk)T v1)2 ( f(xk)T v1)2 = Pd i=2( f(xk)T vi)2 ( f(xk)T v1)2 . By xk x 2 < 2, using (74), we have xk+1 x 2 2 = (1 (x T k+1x )2) 2 1 + x T k+1x λ + γ(xk) 2 xk x 2 2. Since x T k x > 0, xk x 2 < δ2, and 1 x T k x < λ 1 + x T k x + q 2(1 + x T k x ) 1 + 1 x T k x 1 + x T k x γ Kim, Kim and Klabjan which proves (79). Therefore, if x0 satisfies x0 x 2 < δ, by repeatedly applying (79), we obtain 2 x0 x 2 2 and λ2 λ + γk λ2 λ + γ < 1. xk x 2 2 < λ2 λ + γ 2k x0 x 2 2, (80) we have xk x , and thus limk γk = 0 by (75). This gives the desired result. Lemma 13 Let x be a vector in Rd and {v1, , vd} be an orthogonal basis in Rd such that x = v1. If a three-times continuously differentiable function f : Rd R satisfies f(x)T v1 A B(1 x T x ) C p 1 x T x (81) i=2 ( f(x)T vi)2 D q 1 (x T x )2 + E x x 2 + F for every x Bd and some constants A, B, C, D, E, F such that A > 0, B + C > 0, D + E then for the sequence of iterates {xk}k=0,1, generated by SCI-PI, under the initial condition that x0 x 2 < δ where 2A A + |A B| + C , 2(A D E) A D + F + C + |A B| 2 x0 x 2 2, D + E A + γt < 1, and lim k γk = 0. Proof Let x0 x 2 = δ0 < δ. To prove the main result, we show that if xk x 2 < δ0, then we have x T k+1x > 0 and 1 (x T k+1x )2 (x T k+1x )2 ρk 1 (x T k x )2 (x T k x )2 (84) D + E + (E + F) xk x 2/ 2 A (|A B| + C) xk x 2/( Scale Invariant Power Iteration Suppose that xk x 2 < δ0 for k 0. Since xk x 2 < δ, by (83), we have x T k x > 0 and A B(1 x T k x ) C q 1 x T k x = Ax T k x + (A B)(1 x T k x ) C q = Ax T k x + A B 2 xk x 2 2 C x T k x A |A B| + C where the first inequality follows from xk x 2 2 and the second one follows from x T k x = (|A B| + C) xk x 2 2 xk x 2 2/ 2 (|A B| + C) xk x 2 2 xk x 2 < A. Inequality (85) implies that x T k+1x = f(xk)T v1 f(xk) 2 = A B(1 x T k x ) C q f(xk) 2 > 0. On the other hand, since 1 (x T k+1v1)2 (x T k+1v1)2 = f(xk) 2 2 ( f(xk)T v1)2 ( f(xk)T v1)2 = Pd i=2( f(xk)T vi)2 ( f(xk)T v1)2 , using (81), (82), and (85), we have 1 (x T k+1x )2 (x T k+1x )2 1 (x T k x )2 + E xk x 2 + F A B(1 x T k x ) C q D + E + (E + F) xk x 2/ 2 A (|A B| + C) xk x 2/( !2 1 (x T k x )2 (x T k x )2 where we use the fact that 1 + x 1 + x for x 0 to derive 1 (x T k x )2 + E xk x 2 + F 2 xk x 2 2 q 1 (x T k x )2 = D + E 1 + 1 x T k x 1 + x T k x + F 1 + x T k x D + E + (E + F) 1 + x T k x D + E + (E + F) xk x 2 Kim, Kim and Klabjan Since x T k x > 0 and x T k+1x > 0, we can write (84) as ρk(1 + x T k x ) ρk + (1 ρk)(x T k x )2 + x T k x q ρk + (1 ρk)(x T k x )2 ρk = ρk(1 + x T k x ) ρk + (1 ρk)(x T k x )2 + x T k x q ρk + (1 ρk)(x T k x )2 . Before proving ρk ρ0 < 1, we first show that ρk ρ0 < 1. Since x T k x x T 0 x , we have xk x 2 x0 x 2, and thus 2 xk x 2 x0 x 2 which results in ρk ρ0. From δ0 < δ and (83), we have |A B| + C + E + F 2 x0 x 2 x0 x 2 A D E, A |A B| + C 2 x0 x 2 x0 x 2 D + E + (E + F) x0 x 2 This leads to ρ0 < 1. If ρk = 0, obviously ρk ρ0. Otherwise, using 0 < ρk ρ0 and x T k x x T 0 x , we have ρk = 1 + x T k x (x T k x )2/ρk + (1 (x T k x )2) + x T k x q (x T k x )2/ρ2 k + (1 (x T k x )2)/ρk 1 + x T k x (x T k x )2/ρ0 + (1 (x T k x )2) + x T k x q (x T k x )2/ρ2 0 + (1 (x T k x )2)/ρ0 ρ0(1 + x T k x ) ρ0 + (1 ρ0)(x T k x )2 + x T k x q ρ0 + (1 ρ0)(x T k x )2 ρ0 + (1 ρ0)(x T k x )2 ρ0 + (1 ρ0)(x T k x )2 q ρ0 + (1 ρ0)(x T k x )2 + x T k x ρ0 + (1 ρ0)(x T 0 x )2 ρ0 + (1 ρ0)(x T 0 x )2 q ρ0 + (1 ρ0)(x T 0 x )2 + x T 0 x Scale Invariant Power Iteration ρ0 + (1 ρ0)(x T 0 x )2 + x T 0 x q ρk + (1 ρk)(x T 0 x )2 ρk(1 + x T 0 x ) > 0, we finally have ρ < 1. Therefore, we have xk+1 x 2 2 D + E 2 xk x 2 2 ρ0 xk x 2 2 Since xk+1 x 2 < δ, by induction, we have xk x 2 2 ρk 0 x0 x 2 2, which implies the convergence of x to x . As x x , we have ρk ρk and ρk (D+E A )2, which leads to γk 0. This completes the proof. Lemma 14 Let f be three-times continously differentiable on an open set containing Bd and Hj be the Hessian of jf = f/ xj. Let {v1, , vd} be an orthogonal basis in Rd. For x Bd and y1, , yd Bd, let Gi(y1, , yd) = j=1 vi,j Hj(yj), Ri(x) = f(x)T vi f(x )T vi (x x )T 2f(x )vi. Then, we have d X i=1 (Ri(x))2 1 4M2 x x 4 2 where M = max x Bd,y1, ,yd Bd q Pd i=1(x T Gi(y1, , yd)x)2. Proof From jf(x) being twice continuously differentiable near Bd, we have jf(x) = jf(x ) + jf(x )T (x x ) + 1 2 (x x )T Hj(yj) (x x ) (86) where yj N(x, x ) := {y | y = λx + (1 λ)x , 0 λ 1}. Since 2(x x )T Gi(y1, , yd)(x x ) T Gi(y1, , yd) x x using the definition of M, we have the desired result. Kim, Kim and Klabjan C.2 In Support of the Proofs of Theorem 9 and Theorem 10 Lemma 15 Suppose that f(x, y) is scale invariant in x Rdx for each y Rdy and twice continuously differentiable on an open set containing Bdx Bdy. Let (x , y ) be a point satisfying xf(x , y ) = λ x , λ > λ2 = max2 i dx|λi|, x = v1 where (λi, vi) is an eigen-pair of 2 xxf(x , y ). Then, for any x Bdx and y Bdy, we have xf(x, y)T v1 = λ + (y y )T 2 yxf(x , y )x + α(x, y) i=2 ( xf(x, y)T vi)2 λ2 x x 2 + ν y y 2 + β(x, y) 2 α(x, y) = o , β(x, y) = o , ν = 2 xyf(x , y ) 2. Therefore, we have Pdx i=2( xf(x, y)T vi)2 ( xf(x, y)T v1)2 λ2 λ x x 2 + ν λ y y 2 + θ(x, y) 2 θ(x, y) = o Proof Since 2 xxf(x , y ) is real and symmetric, without loss of generality, we assume that {v1, , vdx} forms an orthogonal basis in Rdx. By Taylor expansion of xf(x, y)T vi at (x , y ), we have xf(x, y)T vi = xf(x , y )T vi + x x T 2 xxf(x , y ) 2 yxf(x , y ) vi + Ri(x, y) Ri(x, y) = o Using xf(x , y ) = λ x and x = v1, we have xf(x , y )T v1 = λ , (x x )T 2 xxf(x , y )v1 = λ1(1 x T k x ). Therefore, we obtain xf(x, y)T v1 = λ + (x x )T 2 yxf(x , y )x + α(x, y) (88) Scale Invariant Power Iteration α(x, y) = R1(x, y) λ1(1 x T x ) = o In the same way, for 2 i dx, we have xf(x , y )T vi = λ (x )T vi = 0, (x x )T 2 xxf(x , y )vi = λix T vi, resulting in xf(x, y)T vi = λix T vi + (y y )T 2 yxf(x , y )vi + Ri(x, y). (89) From (89), we obtain i=2 ( xf(x, y)T vi)2 = i=2 (λi)2(x T vi)2 + (y y )T 2 yxf(x , y )vi 2 i=2 (Ri(x, y))2 + 2 i=2 λi(x T vi)(y y )T 2 yxf(x , y )vi i=2 λi(x T vi)Ri(x, y) i=2 (y y )T 2 yxf(x , y )vi Ri(x, y). Since {v1, , vdx} forms an orthogonal basis in Rdx, with x = v1 and x 2 2 = 1, we have i=2 (λi)2(x T vi)2 ( λ2)2 1 (x T x )2 ( λ2)2 x x 2 2 (y y )T 2 yxf(x , y )vi 2 (y y )T 2 yxf(x , y ) 2 2 ν2 y y 2 2. Let R2(x, y) = max2 i dx |Ri(x, y)|. Note that R2(x, y) = o Using the Cauchy-Shwartz inequality, we have i=2 λi(x T vi)(y y )T 2 yxf(x , y )vi λ2ν y y 2 x x 2. Also, we have i=2 λi(x T vi)Ri(x, y) λ2 R2(x, y) p Kim, Kim and Klabjan i=2 Ri(x, y)(y y )T 2 yxf(x , y )vi ν R2(x, y) p Therefore, we obtain i=2 ( xf(x, y)T vi)2 λ2 x x 2 + ν y y 2 + β(x, y) 2 (90) β(x, y) = R2(x, y) p Since {v1, , vdx} forms an orthogonal basis in Rdx and |x T x | x 2 x 2 = 1, we have 1 ( xf(x, y)T x )2 xf(x, y) 2 2 Pdx i=2( xf(x, y)T vi)2 ( xf(x, y)T v1)2 . Using (88) and (90), we have Pdx i=2( xf(x, y)T vi)2 ( xf(x, y)T v1)2 λ2 λ x x 2 + ν λ y y 2 + θ(x, y) 2 θ(x, y) = o This completes the proof. Lemma 16 Suppose that f(x, y) is µ-strongly concave in y Rdy with an L-Lipschitz continuous yf(x, y) for each x Bdx and three-times continously differentiable with respect to x and y on an open set containing Bdx Rdy. Let (x , y ) be a point such that yf(x , y ) = 0. Then, for any x Bdx and y Rdy, with α = 2/(L + µ), we have y + α yf(x, y) y 2 2ν L + µ x x 2 + L µ y y 2 + θ(x, y) (91) ν = 2 yxf(x , y ) 2, θ(x, y) = o Proof Let y,if be the ith coordinate of yf and Hy,i = Hxx y,i Hxy y,i Hyx y,i Hyy y,i Scale Invariant Power Iteration be the Hessian of y,if. By Taylor expansion of y,if(x, y) at (x , y), we have y,if(x, y) = y,if(x , y) + x y,if(x , y)T (x x ) + Ri(x, y) (92) where x y,if(x , y) denotes the ith column of 2 yxf(x , y) and Ri(x, y) = 1 2(x x )T Hxx y,i(ˆxi, y)(x x ), ˆxi N(x, x ). (93) Also, from f being three-times continuously differentiable, we have x y,if(x , y) = x y,if(x , y ) + Hxy y,i(x , ˆyi)(y y ), ˆyi N(y, y ). (94) |(y y )T Hyx y,i(x , ˆyi)(x x )| Hyx y,i(x , ˆyi) 2 x x 2 y y 2 2 Hyx y,i(x , ˆyi) 2 x x 2 2 + y y 2 2 , (y y )T Hyx y,i(x , ˆyi)(x x ) = o By (92), (93), (94), and (95), we have yf(x, y) = yf(x , y) + 2 yxf(x , y )(x x ) + R(x, y) (96) Ri(x, y) = Ri(x, y) + (y y )T Hyx y,i(x , ˆyi)(x x ) = o Using (96), we have y + α yf(x, y) y = y y + α yf(x , y) + α 2 yxf(x , y )(x x ) + R(x, y), resulting in y + α yf(x, y) y 2 y y + α yf(x , y) 2 + α 2 yxf(x , y )(x x ) 2 + α R(x, y) 2. (97) Since f(x , y) is µ-strongly convex in y with an L-Lipschitz continuous gradient yf(x , y), by theory of convex optimization (Bubeck, 2015, p. 279), we have y y + α yf(x , z) 2 L µ due to α = 2/(L + µ). Also, we have α 2 yxf(x , y )(x x ) 2 2ν L + µ x x 2. (99) Kim, Kim and Klabjan Plugging (98), (99) into (97), we finally obtain y y + α yf(x , y) 2 L µ y y 2 + 2ν L + µ x x 2 + θ(x, y) θ(x, y) = R(x, y) 2 = o Lemma 17 Suppose that a sequence of iterates {(xk, yk)}k=0,1, satisfies " xk+1 x 2 yk+1 y 2 # " xk x 2 yk y 2 " θx(xk, yk) θy(xk, yk) for some functions θx and θy such that θx(xk, yk) = o , θy(xk, yk) = o If a, d, e 0 and b, c > 0 satisfy (a b)2 + 4e2 then there exists some δ > 0 such that if 0 < δ, then we have t=1 (ρ + γt)2 2 0 and lim t γt = 0 Proof From (100), we have " xk+1 x 2 yk+1 y 2 # " xk x 2 yk y 2 " θx(xk, yk) θy(xk, yk) (M + N(xk, yk)) " xk x 2 yk y 2 M = a e/b e/c d , ϵ(x, y) = max{θx(x, y), θy(x, y)} p xk x 2 2 + yk y 2 2 , Scale Invariant Power Iteration N(x, y) = ϵ(x, y) p xk x 2 2 + yk y 2 2 xk x 2 yk y 2 xk x 2 yk y 2 Note that we have lim (x,y) (x ,y ) Nij(x, y) = 0, i, j = 1, 2. By Lemma 18, there exists a sequence {ωt}t=0,1, such that t=0 (ρ + ωt) and limt ωt = 0. Since ρ < 1, this implies that Mk 2 converge to 0. Let τ = min{k : Mk 2 < 1}, ρ = Mτ 2 + 1 2 , ρmax = max 1 k τ Mk 2. Due to Nij(x, y) 0 as (x, y) (x , y ) for i, j = 1, 2, there exists some δ > 0 such that if then we have M + N(φ(x, y, l)) 2 < ρ, max 0 0 such that g(x, y, τ) < ρ. Also, for each 1 m < τ, there exists some δm > 0 such that g(x, y, m) < 1 + ρmax. Taking the minimum of δm for 1 m τ, we obtain δ satisfying (104). For any n 0, if nτ < δ, using (104) for (103), we have nτ+m (1 + ρmax) nτ, 1 m < τ, (n+1)τ ρ nτ. (105) Kim, Kim and Klabjan Suppose that 0 δ. Then, by repeatedly applying (105), for any n 0 and 0 m τ, we have nτ+m ( ρ)n (1 + ρmax) 0, which implies that k 0 as k . Let Nt = N(xt, yt), ηk = Qk t=0(M + Nt) 2 Qk 1 t=0 (M + Nt) 2 Mk+1 2 Mk 2 , γk = ωk + ηk. Then, we have t=0 (M + Nt) t=0 (ρ + ωt + ηt) = t=0 (ρ + γt). (106) Since Nt 0, we have ηt 0, and thus limt γt = 0. This concludes the proof. Lemma 18 Let M be a 2 2 matrix such that M = a e/b e/c d for some a > 0, b > 0, c > 0, d 0, e 0 and let ρ be the largest absolute eigenvalue of M. Then, there exists a sequence {ωt}t=0,1,... such that t=0 (ρ + ωt) and lim t ωt = 0. Proof The characteristic equation reads det(M λI) = λ2 λ(a + d) + ad e2 with the discriminant of (a d)2 + 4e2 Thus, all eigenvalues are real. First, we consider the case when det(M λI) = 0 has a double root. We obtain the condition for a double root as (a d)2 + 4e2 Since b > 0 and c > 0, this implies a = d and e = 0. Therefore, M = a I and ρ = a. From Mk = ak I, we have Mk 2 = a2k = ρk, resulting in ωk = Mk+1 2 Mk 2 ρ = ρ ρ = 0, k 0. Scale Invariant Power Iteration Next, we consider the case when M has two distinct eigenvalues λ1 and λ2. Since a + d > 0, we have λ1 + λ2 > 0. Without loss of generality, assume λ1 > λ2. Then, ρ = λ1. Let v1 and v2 be corresponding eigenvectors of λ1 and λ2, respectively. Since v1 and v2 are linearly independent we can represent each column of M as a linear combination of v1 and v2 as M = [α1v1 + β1v2 α2v1 + β2v2]. By repeatedly multiplying M, we obtain Mk = [α1λk 1 1 v1 + β1λk 1 2 v2 α2λk 1 1 v1 + β2λk 1 2 v2]. Let Ck = (Mk)T Mk. Then, we have Ck 11 = α2 1λ2(k 1) 1 + β2 1λ2(k 1) 2 + 2α1β1(λ1λ2)k 1v T 1 v2 Ck 22 = α2 2λ2(k 1) 1 + β2 2λ2(k 1) 2 + 2α2β2(λ1λ2)k 1v T 1 v2 Ck 12 = α1α2λ2(k 1) 1 + β1β2λ2(k 1) 2 + (α1β2 + α2β1)(λ1λ2)k 1v T 1 v2, Ck 21 = Ck 12. Ck 11 α2 1λ2(k 1) 1 + β2 1λ2(k 1) 2 2α1β1(λ1λ2)k 1 = α1λk 1 1 β1λk 1 2 2 0 Ck 22 α2 2λ2(k 1) 1 + β2 2λ2(k 1) 2 2α2β2(λ1λ2)k 1 = α2λk 1 1 β2λk 1 2 2 0, Ck 11 + Ck 22 + q Ck 11 Ck 22 2 + 4(Ck 12)2 , v u u u u t Ck+1 11 + Ck+1 22 + r Ck+1 11 Ck+1 22 2 + 4(Ck+1 12 )2 Ck 11 + Ck 22 + q Ck 11 Ck 22 2 + 4(Ck 12)2 . lim k Ck 11 λ2(k 1) 1 = α2 1, lim k Ck 22 λ2(k 1) 1 = α2 2, lim k Ck 12 λ2(k 1) 1 = lim k Ck 21 λ2(k 1) 1 = α1α2, lim k Mk+1 2 lim k ωk = lim k Mk+1 2 Mk 2 ρ = ρ ρ = 0, we obtain the desired result. Kim, Kim and Klabjan P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009. Michal Aharon, Michael Elad, and Alfred Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311 4322, 2006. Amir Beck. On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization, 25(1):185 209, 2015. Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Nicolas Boumal. Nonconvex phase synchronization. SIAM Journal on Optimization, 26(4): 2355 2377, 2016. Christos Boutsidis, Dan Garber, Zohar Karnin, and Edo Liberty. Online principal components analysis. In ACM-SIAM Symposium on Discrete Algorithms, pages 887 901, 2015. Stephen Boyd, Neal Parikh, and Eric Chu. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Now Publishers Inc, 2011. Sébastien Bubeck. Convex Optimization: Algorithms and Complexity. Now Publishers, Inc., 2015. Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2): 329 357, 2003. Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717, 2009. Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming, 176:5 37, 2019. Murat A Erdogdu, Asuman Ozdaglar, Pablo A Parrilo, and Nuri Denizcan Vanli. Convergence rate of block-coordinate maximization Burer-Monteiro method for solving large SDPs. Mathematical Programming, 195(1):243 281, 2022. Cédric Févotte and Jérôme Idier. Algorithms for nonnegative matrix factorization with the β-divergence. Neural Computation, 23(9):2421 2456, 2011. John GF Francis. The QR transformation a unitary analogue to the LR transformation - part 1. The Computer Journal, 4(3):265 271, 1961. John GF Francis. The QR transformation - part 2. The Computer Journal, 4(4):332 345, 1962. Scale Invariant Power Iteration Robert M Freund. Dual gauge programs, with applications to quadratic programming and the minimum-norm problem. Mathematical Programming, 38(1):47 67, 1987. Michael P Friedlander, Ives Macedo, and Ting Kei Pong. Gauge optimization and duality. SIAM Journal on Optimization, 24(4):1999 2022, 2014. Dan Garber, Elad Hazan, and Tengyu Ma. Online learning of eigenvectors. In International Conference on Machine Learning, pages 560 568, 2015. Gene H Golub and Henk A Van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 123(1-2):35 65, 2000. Gene H Golub and Charles F Van Loan. Matrix Computations. JHU Press, 2012. Fang Han and Han Liu. Scale-invariant sparse PCA on high-dimensional meta-elliptical data. Journal of the American Statistical Association, 109(505):275 287, 2014. Mingyi Hong and Zhi-Quan Luo. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 162(1-2):165 199, 2017. Meng Huang and Zhiqiang Xu. Strong convexity of affine phase retrieval. ar Xiv preprint ar Xiv:2204.09412, 2022. Ruitong Huang, Tor Lattimore, András György, and Csaba Szepesvári. Following the leader and fast rates in online linear prediction: Curved constraint sets and other regularities. Journal of Machine Learning Research, 18(145):1 31, 2017. Aapo Hyvarinen. Fast ICA for noisy data using Gaussian moments. In IEEE International Symposium on Circuits and Systems VLSI, volume 5, pages 57 61, 1999. Aapo Hyvärinen and Erkki Oja. Independent component analysis: Algorithms and applications. Neural Networks, 13(4-5):411 430, 2000. Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis. John Wiley & Sons, 2004. Michel Journée, Yurii Nesterov, Peter Richtárik, and Rodolphe Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11(Feb):517 553, 2010. Cheolmin Kim and Diego Klabjan. A simple and fast algorithm for L1-norm kernel PCA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8):1842 1855, 2020a. Cheolmin Kim and Diego Klabjan. Stochastic variance-reduced algorithms for PCA with arbitrary mini-batch sizes. In International Conference on Artificial Intelligence and Statistics, volume 108, pages 4302 4312, 2020b. Youngseok Kim, Peter Carbonetto, Matthew Stephens, and Mihai Anitescu. A fast algorithm for maximum likelihood estimation of mixture proportions using sequential quadratic programming. Journal of Computational and Graphical Statistics, 29(2):261 273, 2020. Kim, Kim and Klabjan Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, pages 556 562, 2001. Qi Lei, Kai Zhong, and Inderjit S Dhillon. Coordinate-wise power method. In Advances in Neural Information Processing Systems, pages 2064 2072, 2016. Adrian S Lewis, D Russell Luke, and Jérôme Malick. Local linear convergence for alternating and averaged nonconvex projections. Foundations of Computational Mathematics, 9(4): 485 513, 2009. Chih-Jen Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation, 19(10):2756 2779, 2007. Huikang Liu, Man-Chung Yue, and Anthony Man-Cho So. On the estimation performance and convergence rate of the generalized power method for phase synchronization. SIAM Journal on Optimization, 27(4):2426 2446, 2017. Ronny Luss and Marc Teboulle. Conditional gradient algorithms for rank-one matrix approximations with a sparsity constraint. SIAM Review, 55(1):65 98, 2013. James Mac Queen. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281 297, 1967. RV Mises and Hilda Pollaczek-Geiringer. Praktische verfahren der gleichungsau-flösung. Zeitschrift für Angewandte Mathematik und Mechanik, 9(1):58 77, 1929. CL Muntz. Solution direct de l equation seculaire et de quelques problems analogues. Comptus Rendus de l Academie des Sciences, 156:43 46, 1913. Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15(3):267 273, 1982. Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311 3325, 1997. Prasanna K Sahoo and Palaniappan Kannappan. Introduction to Functional Equations. Chapman and Hall/CRC, 2011. Ohad Shamir. A stochastic PCA and SVD algorithm with an exponential convergence rate. In International Conference on Machine Learning, pages 144 152, 2015. Ohad Shamir. Fast stochastic algorithms for SVD and PCA: Convergence properties and convexity. In International Conference on Machine Learning, pages 248 256, 2016. Richard A Tapia, John E Dennis Jr, and Jan P Schäfermeyer. Inverse, shifted inverse, and Rayleigh quotient iteration as Newton s method. SIAM Review, 60(1):3 55, 2018. Lieven Vandenberghe and Stephen Boyd. Semidefinite programming. SIAM Review, 38(1): 49 95, 1996. Scale Invariant Power Iteration Yu-Xiong Wang and Yu-Jin Zhang. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on Knowledge and Data Engineering, 25(6):1336 1353, 2013. James H Wilkinson. The Algebraic Eigenvalue Problem. Clarendon Press, Oxford, 1965. Peng Xu, Bryan He, Christopher De Sa, Ioannis Mitliagkas, and Chris Re. Accelerated stochastic power iteration. In International Conference on Artificial Intelligence and Statistics, pages 58 67, 2018. Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(Apr):899 925, 2013.