# scale_invariant_power_iteration__29345942.pdf

Journal of Machine Learning Research 24 (2023) 1-47 Submitted 9/19; Revised 10/23; Published 11/23

Scale Invariant Power Iteration

Cheolmin Kim* cheolmkim@u.northwestern.edu Department of Industrial Engineering and Management Sciences Northwestern University Evanston, IL, 60208, USA

Youngseok Kim* youngseok@uchicago.edu Department of Statistics University of Chicago Chicago, IL, 60637, USA

Diego Klabjan d-klabjan@northwestern.edu Department of Industrial Engineering and Management Sciences Northwestern University Evanston, IL, 60208, USA

Editor: Zaid Harchaoui

We introduce a new class of optimization problems called scale invariant problems that cover interesting problems in machine learning and statistics and show that they are eﬃciently solved by a general form of power iteration called scale invariant power iteration (SCI-PI). SCI-PI is a special case of the generalized power method (GPM) (Journée et al., 2010) where the constraint set is the unit sphere. In this work, we provide the convergence analysis of SCI-PI for scale invariant problems which yields a better rate than the analysis of GPM. Speciﬁcally, we prove that it attains local linear convergence with a generalized rate of power iteration to ﬁnd an optimal solution for scale invariant problems. Moreover, we discuss some extended settings of scale invariant problems and provide similar convergence results. In numerical experiments, we introduce applications to independent component analysis, Gaussian mixtures, and non-negative matrix factorization with the KL-divergence. Experimental results demonstrate that SCI-PI is competitive to application speciﬁc stateof-the-art algorithms and often yield better solutions. Keywords: scale invariance, power iteration, optimization, convergence analysis, machine learning applications

1. Introduction

We study a new class of optimization problems called scale invariant problems having the form of maximize f(x) subject to x Bd {x Rd : x 2 = 1}, (1)

where f : Rd R is a twice continuously diﬀerentiable, scale invariant function. We say that a function f is scale invariant, which is rigorously deﬁned later in Deﬁnition 1, if its geometric surface is invariant under constant multiplication of x. Many important

The two authors contributed equally to this paper.

c 2023 Cheolmin Kim, Youngseok Kim and Diego Klabjan.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/19-784.html.

Kim, Kim and Klabjan

optimization problems in statistics and machine learning can be formulated as scale invariant problems, for instance, Lp-norm kernel PCA and maximum likelihood estimation of mixture proportions, to name a few. Moreover, as studied herein, independent component analysis (ICA, Example 3), Gaussian mixture models (GMM, Example 4), and non-negative matrix factorization with the Kullback-Leibler divergence (KL-NMF, Example 5) can be formulated as extended settings of scale invariant problems where the objective function is a sum of scale invariant functions or the problem is scale invariant with respect to a subset of variables while the other variables are ﬁxed.

Since Bd is not a convex set, scale invariant problems are in general non-convex optimization problems. Nevertheless, some instances can be eﬃciently solved, for instance, the leading eigenvector problem (Golub and Van Loan, 2012). Power iteration (Muntz, 1913; Mises and Pollaczek-Geiringer, 1929) is an algorithm to ﬁnd the leading eigenvector of a diagonalizable matrix A. In power iteration, xk+1 Axk/ Axk 2 is repeatedly applied until some stopping criterion is satisﬁed. Since no hyperparameter is required, this update rule is practical but at the same time it attains global linear convergence with the rate of |λ2|/|λ1| where |λi| is the ith largest absolute eigenvalue of A (Wilkinson, 1965; Golub and Van Loan, 2012). The linear convergence property of power iteration has been extended to many applications. However, theoretical understanding of when and how such algorithms enjoy this attractive convergence property of power iteration is limited. For example, for convex f, a general form of power iteration called generalized power method (GPM) (Journée et al., 2010) has been shown to attain only global sublinear convergence rate of O(1/ϵ), not generalizing the appealing linear convergence of power iteration. For historical development of power iteration, see Golub and Van der Vorst (2000); Tapia et al. (2018).

In this work, we present scale invariant problems that generalize the leading eigenvector problem in the sense that any stationary point x of (1) satisfying f(x ) = λ x for some λ is an eigenvector of 2f(x ). By this property, scale invariant problems can be seen as the leading eigenvector problem near a local optimum x , so we can expect that a general form of power iteration would work well for them. By swapping the objective function and the constraint, we obtain a geometrically interpretable dual problem with the goal of ﬁnding the closest point w to the origin from the constraint f(w) = 1. By mapping an iterate xk to the dual space, taking a descent step in the dual space and mapping it back to the original space, we geometrically derive scale invariant power iteration (SCI-PI), which replaces Axk with f(xk) in power iteration. SCI-PI is the same algorithm as GPM applied to the unit sphere constraint. However, we improve the convergence rate of GPM for scale invariant problems showing that the algorithm attains local linear convergence with a generalized rate of power iteration when initialized close to it. To the best of our knowledge, this is the ﬁrst work exploiting the properties of scale invariant problems. Also, this is the ﬁrst linear convergence result of GPM for general optimization problems. This improvement is signiﬁcant since with linear convergence, the iteration complexity to attain an ϵ-optimal solution reduces from O(1/ϵ) to O(1/ log(1/ϵ)). Moreover, under some mild conditions, we provide an explicit expression regarding the initial condition on x0 x 2 to ensure convergence.

In the extended settings (Section 4), we discuss three variants of (1). In the ﬁrst setting, we consider a sum of scale invariant functions (Subsection 4.1) as an objective function. This setting covers a Kurtosis-based ICA and can be solved by SCI-PI with similar convergence guarantees. Second, we consider a block version of scale invariant problems (Subsection 4.2)

Scale Invariant Power Iteration

which covers KL-NMF and the Burer-Monteiro factorization of semi-deﬁnite programs. To solve this block scale invariant problem, we present a block version of SCI-PI and show that it attains linear convergence in a two-block case. Lastly, we consider partially scale invariant problems (Subsection 4.3) which include general mixture problems such as GMM. For this partially scale invariant problems, we present an alternating algorithm based on SCI-PI and gradient ascent along with its convergence analysis. In numerical experiments, we benchmark the proposed algorithms against state-of-the-art methods for ICA, KL-NMF, and GMM. The experimental results show that our algorithms are computationally competitive and result in better solutions in most if we do not beat in all herein studied cases. In summary, this work has the following contributions.

1. We introduce scale invariant problems which cover interesting examples in statistics and machine learning. By the eigenvector property (Proposition 4), they resemble the leading eigenvector problem near a local optimum x .

2. For scale invariant problems, we prove that SCI-PI (a special form of GPM) converges to a local maximum x at a logarithmic rate when initialized close to x . This generalizes the attractive convergence property of power iteration. Moreover, we introduce three extended settings of scale invariant problems along with solution algorithms and their convergence analyses.

3. We report numerical experiments including a novel reformulation of KL-NMF to a block scale invariant problem. The experimental results demonstrate that SCI-PI is not only computationally competitive to state-of-the-art methods but also often yield better solutions.

The paper is organized as follows. In Section 2, we deﬁne scale invariance and present interesting properties of scale invariant problems including an eigenvector property and a dual formulation. We then provide a geometric derivation of SCI-PI and a convergence analysis in Section 3. The extended settings are discussed in Section 4 and we report the numerical experiments in Section 5. We ﬁnish the introduction with literature review and a notation paragraph.

1.1 Related Works

Power Iteration The global linear convergence property of power iteration is analogous to that of gradient descent for convex optimization. Therefore, many variants including coordinate-wise (Lei et al., 2016), momentum (Xu et al., 2018), online (Boutsidis et al., 2015; Garber et al., 2015), stochastic (Oja, 1982), stochastic variance-reduced (Shamir, 2015, 2016; Xu et al., 2018; Kim and Klabjan, 2020b), and truncated (Yuan and Zhang, 2013; Han and Liu, 2014) power iterations have been developed, drawing a parallel literature to gradient descent for convex optimization. We discover a class of optimization problems which can be locally seen as the leading eigenvector problem, and prove that they can be eﬃciently solved by a general form of power iteration.

Generalized Power Method (GPM) GPM (Journée et al., 2010) is an iterative algorithm that ﬁnds the next iterate by projecting the gradient at the current iterate to the constraint set. GPM has been applied to statistical problems such as sparse principal

Kim, Kim and Klabjan

component analysis (PCA) (Journée et al., 2010; Luss and Teboulle, 2013) and L1-norm kernel PCA (Kim and Klabjan, 2020a). While GPM has a general form of power iteration, its convergence analysis does not extend the attractive convergence property of power iteration. For example, only global sublinear convergence has been shown for convex f. We generalize the local linear convergence property of power iteration to scale invariant problems.

Block Power Iteration A block version of power iteration has been developed to solve the phase synchronization problem (Boumal, 2016). If the problem consists of a single block and the shift parameter is set to zero, this algorithm specializes to power iteration. Under some conditions on the measurement noise and the initial iterate, it attains linear convergence to a global solution (Liu et al., 2017). To solve the Burer-Monteiro factorization (Burer and Monteiro, 2003) of semi-deﬁnite programs (Vandenberghe and Boyd, 1996), Erdogdu et al. (2022) developed a block coordinate maximization (BCM) algorithm. By iteratively sampling a block and applying power iteration to it, BCM attains local linear convergence as well as global sublinear convergence. However, the linear convergence property of block power iteration has not been extended to more general settings. In this work, we prove that block variants of SCI-PI attain linear convergence for block and partially scale invariant problems.

Alternating Minimization Alternating algorithms have been developed for many applications such as k-means clustering (Mac Queen, 1967), Gaussian mixture model (Bishop, 2006), dictionary learning (Olshausen and Field, 1997; Aharon et al., 2006), matrix completion (Candès and Recht, 2009), matrix factorization (Lee and Seung, 2001), and ﬁnding a point in the intersection of two closed sets (Lewis et al., 2009). Beck (2015) studied alternating minimization and proved that it achieves sublinear convergence for convex programming. For optimization problems with a separable convex objective function and a linear constraint, alternating direction method of multipliers (ADMM) (Boyd et al., 2011) has been shown to attain linear convergence (Hong and Luo, 2017). However, due to the exact minimization step, ADMM can incur high per iteration cost. Instead of performing exact minimization, our algorithms alternatively apply simple steps to update blocks but at the same time they achieve local linear convergence.

Manifold Optimization Viewed as an optimization problem on the real projective plane, a scale invariant problem can be reformulated to an equivalent problem in the embedding space. The reformulated problem is unconstrained in the embedding space but it has a highly non-convex structure, e.g., the maximization of the Rayleigh quotient. In order to solve the reformulated problem, general algorithms for unconstrained non-convex optimization such as gradient and Newton methods with line search, and trust region method (Absil et al., 2009) can be employed. Rather than working in the embedding space, we focus on a generalization of power iteration.

Gauge Optimization A gauge function which is a nonnegative, convex, and positively homogeneous function that vanishes at the origin is a multiplicatively scale invariant function. Gauge functions include norms and pseudonorms as special cases and generalize the notion of a norm. The gauge program that minimizes a gauge function over a convex set is introduced in (Freund, 1987) and further studied in (Friedlander et al., 2014). The literature on gauge optimization is mainly about developing and studying dual problems. Conversely, we develop a simple numerical algorithm that solves the primal problem.

Scale Invariant Power Iteration

1.2 Notation

Let R and R+ denote the set of real numbers and the set of non-negative real numbers, respectively. Let d be the dimension of the optimization variable x. Let Rd denote the set of d-dimensional real vectors and f be a function from Rd to R. We denote the gradient and Hessian of a function f as f and 2f. Let u and v be functions from R to R+ and R \ {0} to R, representing multiplicative and additive factor functions, respectively, and let p be the degree of a scale invariant function, which equals to the degree of homogeneity for a multiplicative scale invariant function and 0 for an additive scale invariant function. We use (λi, vi) and (si, ui) to represent eigen-pairs. The jth element of vi is denoted as vi,j. Let k be the iteration index and we denote the sequences of iterates and function values by {xk}k=0,1, and {f(xk)}k=0,1, , respectively. Lastly, we let , and ( ) 2 denote element-wise product, division and square, respectively and let 1n Rn denote the vector of n ones.

2. Scale Invariant Problems

Before presenting properties of scale invariant problems, we ﬁrst deﬁne scale invariant functions.

Deﬁnition 1 We say that a function f : Rd R is multiplicatively scale invariant if it satisﬁes f(cx) = u(c)f(x) (2)

for some even function u : R R+ with u(0) = 0. Also, we say that f : Rd \ {0} R is additively scale invariant if it satisﬁes

f(cx) = f(x) + v(c) (3)

for some even function v : R \ {0} R with v(1) = 0.

The following proposition characterizes the exact form of u and v for continuous f.

Proposition 2 If a continuous function f = 0 satisﬁes (2) with a multiplicative factor u, then we have u(c) = |c|p (4)

for some p > 0. Also, if a continuous function f satisﬁes (3) with an additive factor v, then we have v(c) = loga |c| (5)

for some a such that 0 < a and a = 1.

Using the explicit forms of u and v in Proposition 2, we establish derivative-based properties of scale invariant functions below.

Proposition 3 Suppose that f is twice diﬀerentiable. If f satisﬁes (2) with a multiplicative factor u(c) = |c|p, we have

c f(cx) = |c|p f(x), f(x)T x = pf(x), 2f(x)x = (p 1) f(x). (6)

Kim, Kim and Klabjan

Also, if f satisﬁes (3) with an additive factor v(c) = loga |c|, we have

c f(cx) = f(x), f(x)T x = 1 log(a), 2f(x)x = f(x). (7)

Proposition 3 states that a scale invariant function f satisﬁes that 2f(x)x is a scalar multiple of f(x). Let

L(λ, x) = f(x) + λ (1 x 2) .

be the Lagrange function of (1) and (λ , x ) be a stationary point satisfying L(λ, x) = 0 such that

f(x ) = λ x , x 2 = 1. (8)

In the next proposition, we derive an eigenvector property which states that for any stationary point (λ , x ) of (1) satisfying (8), x is an eigenvector of 2f(x ).

Proposition 4 Suppose that f is twice diﬀerentiable and let (λ , x ) be a stationary point of (1) satisfying (8). If f satisﬁes (2) with u(c) = |c|p, then we have

2f(x )x = (p 1)λ x .

Also, if f satisﬁes (3) with v(c) = loga |c|, then we have

2f(x )x = λ x .

In both cases, x is an eigenvector of 2f(x ). Moreover, if λ is greater than the largest eigenvalue of 2f(x )(I x (x )T ), then x is a local maximum to (1).

Proof If f is multiplicative scale invariant with the degree of p, by Proposition 3, we have

2f(x )x = (p 1) f(x ) = (p 1)λ x .

Also, by Proposition 3, if f is additive scale invariant f, we have

2f(x )x = f(x ) = λ x .

Therefore, in both cases, a stationary point x is an eigenvector of 2f(x ). Suppose that λ is greater than the largest eigenvalue of 2f(x )(I x (x )T ). For any h satisfying h T x = 0, we have

h T 2 xx L(x , λ )h = h T 2f(x ) λ (I x (x )T ) h

= h T 2f(x )(I x (x )T )h λ h 2 2 < 0.

Since the second-order suﬃcient condition is satisﬁed, x is a local maximum.

Proposition 4 states that a stationary point x is an eigenvector of 2f(x ). Note that the Lagrange multiplier λ is not necessarily an eigenvalue corresponding to x . The eigenvalue

Scale Invariant Power Iteration

corresponding to x is (p 1)λ if f is multiplicatively scale invariant or λ if f is additively scale invariant. The second-order suﬃcient condition for local optimality requires that the Lagrange multiplier λ rather than the eigenvalue corresponding to x is greater than the largest eigenvalue of 2f(x )(I x (x )T ). Due to this eigenvector property, scale invariant problems can be considered as a generalization of the leading eigenvector problem. Next, we introduce a dual formulation of scale invariant problems.

Proposition 5 Suppose that the objective function f is continuous and either multiplicatively scale invariant with a positive optimal value and a multipicative factor u(c) = |c|p such that p > 0 or additively scale invariant having an additive factor v(c) = loga |c| such that a > 1. Then, solving (1) is equivalent to solving the following optimization problem

minimize w 2 subject to f(w) = 1. (9)

In other words, if x is an optimal solution to (1), then w = x /f(x )1/p (multiplicative) or w = a1 f(x )x (additive) is an optimal solution to (9). Conversely, if w is an optimal solution to (9), x = w / w 2 is an optimal solution to (1).

For a multiplicatively scale invariant f having a negative optimal value and a multiplicative factor u(c) = |c|p such that p < 0, we can derive a similar reformulation by replacing f(w) = 1 with f(w) = 1. On the other hand, for an additively scale invariant f having an additive factor v(c) = loga |c| such that 0 < a < 1, we obtain a maximization problem with the same objective function and constraint. The dual formulation (9) has a nice geometric interpretation that an optimal solution w is the closest point to the origin from {w : f(w) = 1}. We use this understanding to derive SCI-PI in Section 3. Lastly, we introduce two well-known examples of scale invariant problems in machine learning and statistics.

Example 1 (Lp-norm Kernel PCA) Given data vectors ai Rd and a mapping Φ, Lpnorm PCA considers

maximize 1 n Pn i=1 Φ(ai)T x p p subject to x Bd (10)

where the objective function satisﬁes property (2) with u(c) = |c|p.

Example 2 (Estimation of Mixture Proportions) Given a design matrix L Rn d

satisfying Lij 0, the problem of estimating mixture proportions seeks to ﬁnd a vector π of mixture proportions on the probability simplex Sd = π : Pd j=1 πj = 1, π 0 that solves

maximize 1 n Pn i=1 log Pd j=1Lijπj subject to π Sd. (11)

By reparametrizing πj by x2 j, we obtain an equivalent optimization problem

maximize 1 n Pn i=1 log Pd j=1Lijx2 j subject to x Bd, (12)

which now satisﬁes property (3) with v(c) = 2 log |c|.

The reformulation idea in Example 2 implies that any simplex-constrained problem with scale invariant f can be reformulated to a scale invariant problem.

Kim, Kim and Klabjan

3. Scale Invariant Power Iteration

In this section, we provide a geometric derivation of SCI-PI to ﬁnd a local optimal solution of (1). The algorithm is developed using the geometric interpretation of the dual formulation (9) as illustrated in Figure 1. Starting with an iterate xk B, we obtain a dual iterate wk by mapping xk to the constraint f(w) = 1. Given wk, we identify the hyperplane lk on which the current iterate wk lies and is tangent to f(w) = 1. After identifying the equation of lk, we ﬁnd the closest point zk to the origin from lk and obtain a new dual iterate wk+1 by mapping zk to the constraint f(w) = 1. Finally, we obtain a new primal iterate xk+1 by mapping wk+1 back to the set Bd. Now, we develop an algorithm based on the above idea. For derivation of the algorithm, we assume that an objective function f is diﬀerentiable and satisﬁes either (2) with u(c) = |c|p

where p > 0 and f(x) > 0 for all x B or (3) with v(c) = loga|c| where 1 < a. Under these conditions, a scalar mapping from xk to wk can be well deﬁned as wk = xk/f(xk)1/p or wk = a1 f(xk)xk, respectively. Let wk = ckxk. Since wk is on the constraint f(w) = 1, the normal vector of the hyperplane lk is f(wk). Therefore, we can write down the equation of the hyperplane lk as w : f(wk)T (w wk) = 0 . Note that zk is a scalar multiple of f(wk) where the scalar can be determined from the requirement that zk is on lk. Since wk+1 is the projection of zk, it must be a scalar multiple of the normal vector yk = f(wk). Therefore, we can write wk+1 as wk+1 = dkyk. Finally, by projecting wk+1 to Bd, we obtain

xk+1 = wk+1 wk+1 2 = dkyk dkyk 2 = yk yk 2 = f(wk) f(wk) 2 = f(ckxk) f(ckxk) 2 = f(xk) f(xk) 2

where the last equality follows from Proposition 3. The update rule is the linear optimization oracle on Bd. In the sense that SCI-PI ﬁnds an optimal solution by solving a sequence of linear optimization problems, it is similar to the Frank-Wolfe algorithm (also called conditional gradients) and online linear prediction algorithms (Huang et al., 2017). Summarizing all the above, we obtain SCI-PI presented in Algorithm 1.

Figure 1: Geometric derivation of SCI-PI

Algorithm 1 SCI-PI

Output: initial point x0 Bd k 0 while f(xk) = 0 do

xk+1 f(xk) f(xk) 2 k k + 1 end while Output: xk

Note that in Figure 1 { wk 2}k=0,1, is non-increasing if the sublevel set {w | f(w) 1} is convex. Since all sublevel sets of a quasi-convex function are convex, we can expect that SCI-PI yields an ascending step if f is quasi-convex (not necessarily scale invariant). See Proposition 11 in Appendix B for the convergence to a stationary point for quasi-convex f. If f is not quasi-convex, the sequence {f(xk)}k=0,1, is not necessarily increasing, making

Scale Invariant Power Iteration

it hard to analyze global convergence. Exploiting the eigenvector property, we study local convergence of SCI-PI for scale invariant f below.

Theorem 6 Let f be a scale invariant, twice continuously diﬀerentiable function on an open set containing Bd. Let x be a local maximum such that f(x ) = λ x and (λi, vi) be an eigen-pair of 2f(x ) with x = v1. If λ > λ2 = max2 i d |λi|, then there exists some δ > 0 such that under the initial condition x0 x 2 < δ, the sequence of iterates {xk}k=0,1, generated by SCI-PI satisﬁes

2 x0 x 2 2,

λ2 λ + γt < 1 for all t 0 and lim k γk = 0.

Moreover, if jf = f/ xj has a continuous Hessian Hj on an open set containing Bd {x Rd : x 2 1}, we can explicitly write δ as

δ(λ , λ1, λ2, M) = min

λ + |λ λ1 M|,

2(λ λ2) λ λ2 + M + |λ λ1 M|

where λ1 = |λ1| and

M = max x Bd,y1, ,yd Bd

q Pd i=1(x T Gi(y1, , yd)x)2, Gi(y1, , yd) = Pd j=1 vi,j Hj(yj).

Proof Since 2f(x ) is real and symmetric, without loss of generality, we assume that {v1, , vd} form an orthogonal basis in Rd. Since f is twice continuously diﬀerentiable on an open set containing Bd, for x Bd, using the Taylor expansion of f(x)T vi at x , we have

f(x)T vi = f(x )T vi + (x x )T 2f(x )vi + Ri(x) (13)

Ri(x) = o( x x 2). (14)

From f(x ) = λ x and x = v1, we have

f(x)T v1 = f(x )T x + (x x )T 2f(x )x + R1(x)

= λ λ1(1 x T x ) + R1(x)

where α(x) = λ1(1 x T x ) + R1(x). (16)

Kim, Kim and Klabjan

On the other hand, for 2 i d, due to f(x ) = λ x , x = v1 and the orthogonality of {v1, . . . , vd}, we have

f(x )T vi = λ (x )T vi = 0. (17)

From (13), this results in

f(x)T vi = λix T vi + Ri(x). (18)

Using (18) and the deﬁnition of λ2, we have

f(x)T vi 2 =

h λ2 i (x T vi)2 + 2λi(x T vi)Ri(x) + (Ri(x))2i

i=2 (x T vi)2 + 2 λ2

i=2 |x T vi||Ri(x)| +

i=2 (Ri(x))2 .

By the Cauchy Schwartz inequality, we have

i=2 |x T vi||Ri(x)| λ2

i=2 (x T vi)2

i=2 (Ri(x))2,

which results in

f(x)T vi 2 λ2 2

i=2 (x T vi)2 + 2 λ2

i=2 (x T vi)2

i=2 (Ri(x))2 +

i=2 (Ri(x))2

i=2 (x T vi)2 + β(x)

i=2 (Ri(x))2. (21)

Using 1 x T x = o( x x 2) and (14) for (16) and (21), we have

α(x) = o( x x 2), β(x) = o( x x 2). (22)

From x Bd, x = v1, and the fact that {v1, , vd} forms an orthogonal basis in Rd, we have d X

i=2 (x T vi)2 = 1 (x T v1)2 = 1 (x T x )2 2(1 x T x ) = x x 2 2. (23)

Therefore, by (15), (20), (22), (23), and Lemma 12, we obtain the ﬁrst part of the desired result.

Scale Invariant Power Iteration

On the other hand, if jf has a continuous Hessian Hj, by Lemma 14, we have

i=1 (Ri(x))2 1

4M2 x x 4 2. (24)

Using (24) for (16) and (21), we have

α(x) = λ1(1 x T k x ) + R1(x) (M + |λ1|)(1 x T x ),

i=2 (Ri(x))2 M

2 x x 2 2. (25)

Therefore, using (15), (20), (23), (25) and Lemma 12 with

A = λ , B = M + |λ1|, C = 0, D = λ2, E = 0, F = M,

we obtain the desired result.

Theorem 6 states local convergence of SCI-PI with an asymptotic rate of λ / λ2. Note that the assumption that the Lagrange multiplier λ corresponding to a local maximum x satisﬁes λ > λ2 = max2 i d |λi| holds for all strict local maxima if f is convex, multiplicatively scale invariant with p 1 since λi 0 for all i and λ = (p 1)λ1, according to Proposition 4. However, in general, not all local maxima satisfy this assumption since it is stronger than the second-order suﬃcient condition stated as λ > max2 i d λi in Proposition 4. Nevertheless, by adding σ x 2 for some σ > 0 to the objective function f, we can always enforce λ > λ2. Conversely, by adding σ x 2 for some σ < 0, we may improve the convergence rate as done by shifted power iteration (Golub and Van Loan, 2012). The convergence rate of γk is o(1) for twice continuously diﬀerentiable f. If jf has a continuous Hessian, we further have γk = O( xk x 2). (For the derivations of the convergence rate of γk, see the proofs of Lemmas 12 and 14.)

The non-convexity of the objective function hinders the attainment of global guarantees for SCI-PI. While Theorem 6 establishes a condition on the initial iterate that guarantees local convergence, ﬁnding an initial point suﬃciently close to a global optimal solution is a challenging task. To ensure global guarantees, additional conditions on problem-speciﬁc parameters are necessary. For example, mild assumptions on problem parameters in aﬃne phase retrieval render the objective function strongly convex, leading to global optimality (Huang and Xu, 2022). Nevertheless, local convergence remains an appealing property since random initialization often provides a satisfactory starting point (Chen et al., 2019).

Reduction to power iteration For the leading eigenvector problem to ﬁnd the leading eigenvector of a positive semi-deﬁnite matrix A 0, the objective function is f(x) = 1

2x T Ax, and thus SCI-PI specializes to power iteration. Let λi be the ith largest eigenvalue of A. The condition λ > λ2 in Theorem 6 is interpreted as the positive eigen-gap λ1 λ2 > 0 assumption. The convergence result in Theorem 6 not only matches the convergence rate of λ1/λ2 but also restores the initial condition δ <

2 or x T 0 x > 0 of power iteration since M = 0.

Kim, Kim and Klabjan

Comparision to generalized power method Under the spherical constraint, generalized power method (Journée et al., 2010) has the same update rule as SCI-PI. Generalized power method has shown to attain sublinear convergence for convex f. Local linear convergence in Theorem 6 has been not shown for generalized power method. This convergence result established for a scale invariant objective function with the spherical constraint is extended to various settings in the next section.

4. Extended Settings

4.1 Sum of Scale Invariant Functions

Consider a sum of scale invariant functions of the form f(x) = Pm i=1 f M i (x) + Pn j=1 f A j (x) where f M i is multiplicatively scale invariant with ui(c) = |c|pi and f A j is additively scale invariant with vj(c) = logaj |c|. Note that this does not imply that f is scale invariant in general. Thus, a stationary point x satisfying f(x ) = λ x is not necessarily an eigenvector of 2f(x ). Instead, a stationary point x is an eigenvector of F(x ) deﬁned as

j=1 2f A j (x).

i=1 f M i (x) +

j=1 f A j (x) = F(x)x

by Proposition 3. Here is an example that involves a sum of scale invariant functions.

Example 3 (Kurtosis-based ICA) Given a pre-processed data matrix W Rn d, Kurtosisbased ICA (Hyvärinen and Oja, 2000) solves

maximize 1 n

(w T i x)4 3 2 subject to x Bd. (26)

The objective function f is a sum of scale invariant functions.

We present a local convergence analysis of SCI-PI for a sum of scale invariant functions as follows.

Theorem 7 Let f be a sum of scale invariant functions and twice continuously diﬀerentiable on an open set containing Bd. Let x be a local maximum such that f(x ) = λ x and {v1, , vd} be a set of eigenvectors of F(x ) with x = v1. If λ > λ2 = 2f(x )(I x (x )T ) 2, then there exists some δ > 0 such that under the initial condition x0 x 2 < δ, the sequence of iterates {xk}k=0,1, generated by SCI-PI satisﬁes

2 x0 x 2 2,

Scale Invariant Power Iteration

λ2 λ + γt < 1 for all t 0 and lim k γk = 0.

Moreover, if jf = f/ xj has a continuous Hessian Hj on an open set containing Bd, we can explicitly write δ as

δ(λ , λ1, λ2, M) =

2(λ λ2) λ + M + λ1 + |λ M|

2 2f(x )x 2 and

M = max x Bd,y1, ,yd Bd

q Pd i=1(x T Gi(y1, , yd)x)2, Gi(y1, , yd) = Pd j=1 vi,j Hj(yj).

Proof By the stationarity condition, a local optimal solution x is an eigenvector of F(x ). Since F(x ) is real and symmetric, without loss of generality, we assume that {v1, , vd} form an orthogonal basis in Rd. Since f is twice continuously diﬀerentiable on an open set containing Bd, for x Bd, using the Taylor expansion of f(x)T vi at x , we have

f(x)T vi = f(x )T vi + (x x )T 2f(x )vi + Ri(x) (27)

where Ri(x) = o( x x 2). (28)

Using (27) with i = 1 and f(x ) = λ x , we obtain

f(x)T v1 = λ (x )T v1 + (x x )T 2f(x )v1 + R1(x) = λ + α(x) (29)

where α(x) = (x x )T 2f(x )v1 + R1(x). (30)

Using (27) and f(x ) = λ x for 2 i d, we have

f(x)T vi = λ (x )T vi + (x x )T 2f(x )vi + Ri(x) = (x x )T 2f(x )vi + Ri(x),

resulting in

i=2 ( f(x)T vi)2 =

(x x )T 2f(x )vi + Ri(x) 2 . (31)

From x = v1 and the fact that {v1, , vd} forms an orthogonal basis in Rd, we have

(x x )T 2f(x )vi 2 = 2f(x )(x x ) 2 2 (x x )T 2f(x )v1 2

= (x x )T 2f(x ) I x (x )T 2f(x )(x x )

= (x x )T 2f(x ) I x (x )T 2 2f(x )(x x ).

Kim, Kim and Klabjan

2f(x ) I x (x )T 2 2f(x ) 2 = I x (x )T 2f(x ) 2 2 = 2f(x ) I x (x )T 2 2,

(x x )T 2f(x )vi 2 λ2 2 x x 2 2. (32)

Also, using (32) and the Cauchy-Schwartz inequality, we obtain

i=2 (x x )T 2f(x )vi Ri(x)

i=2 |(x x )T 2f(x )vi||Ri(x)|

i=2 ((x x )T 2f(x )vi)2

i=2 Ri(x)2.

Using (32) and (33) for (31), we obtain

i=2 ( f(x)T vi)2 λ2 2 x x 2 2 + 2 λ2 x x 2

i=2 Ri(x)2 +

i=2 Ri(x)2,

resulting in d X

i=2 ( f(x)T vi)2 λ2 x x 2 2 + β(x) 2 (34)

i=2 Ri(x)2. (35)

Using (x x )T 2f(x )v1 = o( p

x x 2) and (28) for (30) and (35), we have

α(x) = o( p

x x 2), β(x) = o( x x 2). (36)

By (29), (34), (36), and Lemma 12, we obtain the ﬁrst part of the desired result. On the other hand, if jf has a continuous Hessian Hj, by Lemma 14, we have

i=1 (Ri(x))2 1

4M2 x x 4 2. (37)

Scale Invariant Power Iteration

Using |(x x )T 2f(x )v1| λ1 q

1 x T k x and (37) for (30) and (35), this leads to

α(x) = (x x )T 2f(x )v1 + R1(x) λ1 q

1 x T k x M(1 x T k x ),

i=2 Ri(x)2 M

2 x x 2 2. (38)

By (29), (34), (38), and Lemma 13 with

A = λ , B = M, C = λ1, D = 0, E = λ2, F = M,

δ(λ , λ1, λ2, M) = min

λ + |λ M| + λ1 ,

2(λ λ2) λ + M + λ1 + |λ M|

2(λ λ2) λ + M + λ1 + |λ M|,

which completes the proof.

Note that λ1 has the additional

2 factor which comes from the fact that x is not necessarily an eigenvector of 2f(x ). Nonetheless, the asymptotic convergence rate in Theorem 7 provides a generalization of the convergence rate in Theorem 6.

4.2 Block Scale Invariant Problems

Next, consider a class of optimization problems having the form of

maximize f(x, y) subject to x Bdx, y Bdy

where f : Rdx+dy R is scale invariant in x for ﬁxed y and vice versa. A stationary point (x , y ) satisﬁes xf(x , y ) = λ x and yf(x , y ) = s y for some λ , s R, and x and y are eigenvectors of 2 xxf(x , y ) and 2 yyf(x , y ), respectively, according to Proposition 4. Some examples of block scale invariant problems are given next.

Example 4 (Semideﬁnite Programming (SDP) (Vandenberghe and Boyd, 1996)) Let A, X Rn n. Given an SDP problem

maximize A, X subject to Xii = 1, i {1, 2, , n}, X 0,

the Burer-Monteiro approach (Burer and Monteiro, 2003) yields the following block scale invariant problem (Erdogdu et al., 2022)

maximize A, σσT subject to σi 2 = 1, i {1, 2, , n}

where σi denotes the ith row of σ Rn r.

Kim, Kim and Klabjan

Example 5 (Kullback-Leibler (KL) divergence NMF) The KL-NMF problem (Févotte and Idier, 2011; Lee and Seung, 2001; Wang and Zhang, 2013) is deﬁned as

minimize DKL(V WH)

Vij log Vij Pr q=1 Wiq Hqj Vij +

q=1 Wiq Hqj

subject to Wiq 0, Hqj 0, i = 1, , n, j = 1, , m, q = 1, , r.

Many popular algorithms (Lee and Seung, 2001; Lin, 2007) for the KL-NMF problem are based on alternating minimization of W and H. Since the objective function can be decomposed over j, given W 0 and j {1, , m}, we consider a subproblem of the form

minimize fj KL(h) =

Vij log Vij Pr q=1 Wiqhj Vij +

subject to hq 0 (40)

where hq = Hqj. Note that the KL-NMF problem in the form of (39) is not a block scale invariant problem. However, using a novel reformulation, we show that the KL divergence NMF subproblem is indeed a scale invariant problem.

Lemma 8 The KL-NMF subproblem (40) is equivalent to the following scale invariant problem

i=1 Vij log

q=1 Wiq hq subject to

q=1 hq = 1, hq 0, (41)

with the relationship (Pn i=1 Vij) hq = (Pn i=1 Wiq)hq.

Proof Since a log-linear function is concave, (40) is a convex problem in h. Consider the Lagrangian of the original problem

L(h, λ) = fj KL(h)

q=1 λqhq, λ 0. (42)

Let h be an optimal solution to (40) and λ be a vector in Rr satisfying the following ﬁrst-order KKT conditions

fj KL(h ) = λ q1m, λ qh q = 0, q = 1, , r (43)

where fj KL denotes the derivative of fj KL with respect to h. Since (43) implies Pr q=1 h qλ q = 0, we have

q=1 h qλ q =

q=1 h q fj KL(h ) =

Vij Wiqh q Pr q =1 Wiq h q +

q=1 Wiqh q,

resulting in n X

q=1 Wiqh q. (44)

Scale Invariant Power Iteration

minimize fj SCI(h) =

i=1 Vij log Vij Pr q=1Wiqhq subject to

q=1 Wiqhq, hq 0,

(45) and let f KL and f SCI be the optimal objective values of (40) and (45), respectively. We prove the equivalence of (45) and (40) by the following arguments:

1. Since (45) has an additional constraint Pn i=1 Vij = Pn i=1 Pr q=1 Wiqhq compared to (40), it always satisﬁes f SCI f KL .

2. Since we have shown that Pn i=1 Vij = Pn i=1 Pr q=1 Wiqh q, a solution h of (40) is a feasible point of (45). This implies f KL f SCI.

Now, we reparameterize h by h so that Pn i=1 Vij = Pn i=1 Pr q=1 Wiqhq if and only if Pr q=1 hq = 1, which yields the relationship between two variables hq = hq(Pn i=1 Wiq)/(Pn i=1 Vij). Note that (41) is a mixture proportion estimation problem (Example 2) and thus a scale invariant problem.

To solve block scale invariant problems, we consider an alternating maximization algorithm called block SCI-PI, which repeats

xk+1 xf(xk, yk)/ xf(xk, yk) 2, yk+1 yf(xk, yk)/ yf(xk, yk) 2. (46)

We present a local convergence result of block SCI-PI below.

Theorem 9 Suppose that f is twice continuously diﬀerentiable on an open set containing Bdx Bdy and let (x , y ) be a local maximum satisfying

xf(x , y ) = λ x , λ > λ2 = max 2 i dx |λi|, yf(x , y ) = s y , s > s2 = max 2 j dy |sj|

where (λi, vi) and (sj, uj) are eigen-pairs of 2 xxf(x , y ) and 2 yyf(x , y ), respectively with x = v1 and y = u1. If

ν2 = 2 yxf(x , y ) 2 2 < (λ λ2)(s s2),

then for the sequence of iterates {(xk, yk)}k=0,1, generated by (46), there exists some δ > 0 such that if 0 < δ, then we have

2 k Qk 1 t=0 (ρ + γt)2 2 0 and limk γk = 0

Kim, Kim and Klabjan

Proof From Lemma 15 with x = xk, y = yk, we have Pdx i=2( xf(xk, yk)T vi)2

( xf(xk, yk)T v1)2 λ2

λ xk x 2 + ν

λ yk y 2 + θx(xk, yk) 2 .

xk+1 = xf(xk, yk) xf(xk, yk) 2 ,

s Pdx i=2( xf(xk, yk)T vi)2

( xf(xk, yk)T v1)2 λ2 λ xk x 2 + ν

λ yk y 2 + θx(xk, yk).

(47) Using Lemma 15 for x = yk, y = xk and the deﬁnition of yk+1, we have

s xk x 2 + s2

s yk y 2 + θy(xk, yk). (48)

Combining (47) and (48), we obtain

" xk+1 x 2 yk+1 y 2

λ2 λ ν λ ν s s2 s

" xk x 2 yk y 2

" θx(xk, yk) θy(xk, yk)

Since ρ < 1 due to ν2 < (λ λ2)(s s2), by Lemma 17, we obtain the desired result.

Being the spectral norm of the oﬀ-diagonal block of the Hessian at the local maximum (x , y ), ν measures how much partial derivatives of one block of variables are aﬀected by the other block of variables. If the objective function f is separable in x and y as in the case of the KL-NMF problem, ν becomes zero, and we have ρ = max { λ2/λ , s2/s }. Note that ρ increases as ν increases. If ν2 becomes larger than (λ λ2)(s s2), the Jacobi update rule (46) may fail due to the interaction eﬀects between x and y. On the other hand, the result of Theorem 6 can be restored by dropping x or y in Theorem 9. While we consider the two-block case here, the algorithm and the convergence analysis can be easily generalized to more than two blocks.

4.3 Partially Scale Invariant Problems

Lastly, we consider a class of optimization problems of the form

maximize f(x, y) subject to x Bdx

where f(x, y) : Rdx+dy R is a scale invariant function in x for each y Rdy. A partially scale invariant problem has the form of (1) with respect to x once y is ﬁxed. If x is ﬁxed, we obtain an unconstrained optimization problem with respect to y. A stationary point (x , y ) satisﬁes xf(x , y ) = λ x and yf(x , y ) = 0 for some λ R, and x is an eigenvector of 2 xxf(x , y ). Here is an example of partially scale invariant problems.

Scale Invariant Power Iteration

Example 6 (Gaussian Mixture Model (GMM)) The GMM problem is deﬁned as

j=1 πj N(ai; µj, Σj) subject to π Sd.

Note that the objective function is scale invariant for ﬁxed µj and Σj, and µj is unconstrained. If we assume some structure on Σj, estimation of Σj can also be unconstrained. For general Σj, semi-positive deﬁniteness is necessary for Σj.

To solve partially scale invariant problems, we consider an alternating maximization algorithm based on SCI-PI and the gradient method as

xk+1 xf(xk, yk)/ xf(xk, yk) 2, yk+1 yk + α yf(xk, yk). (50)

While the gradient method is used in (50), any method for unconstrained optimization can replace it. We present a convergence analysis of (50) below.

Theorem 10 Suppose that f(x, y) is scale invariant in x for each y Rdy, µ-strongly concave in y with an L-Lipschitz continuous yf(x, y) for each x Bdx, and three-times continuously diﬀerentiable on an open set containing Bdx Rdy. Let (x , y ) be a local maximum satisfying f(x ) = λ x , λ > λ2 = max2 i dx|λi|

where (λi, vi) is an eigen-pair of 2 xx f(x , y ) with x = v1. If

ν2 = 2 yxf(x , y ) 2 2 < µ(λ λ2),

then for the sequence of iterates {(xk, yk)}k=0,1, generated by (50) with α = 2/(L+µ), there exists some δ > 0 such that if 0 < δ, then we have

2 k Qk 1 t=0 (ρ + γt)2 2 0 and limk γk = 0

Proof Using Lemma 15 for x = xk, y = yk and the deﬁnition of xk+1, we have

xk+1 x 2 λ2 λ xk x 2 + ν

λ yk y 2 + θx(xk, yk). (51)

By Lemma 16 with x = xk, y = yk, we also have

yk+1 y 2 2ν L + µ

xk x 2 + L µ

yk y 2 + θy(xk, yk). (52)

Kim, Kim and Klabjan

Combining (51) and (52), we obtain

" xk+1 x 2 yk+1 y 2

2ν L + µ L µ L + µ

" xk x 2 yk y 2

" θx(xk, yk)

Note that since ν2 < µ(λ λ2), the spectral radius ρ satisﬁes

Therefore, by Lemma 14, we obtain the desired result.

As in the result of Theorem 9, the rate ρ increases as ν increases and is equal to max { λ2/λ , (L µ)/(L + µ)} when ν = 0. Also, by dropping y, we can restore the convergence result of Theorem 6.

5. Numerical Experiments

We test the proposed algorithms on real-world data sets. All experiments are implemented on a standard laptop (2.6 GHz Intel Core i7 processor and 16GM memory) using the Julia programming language. Let us emphasize that scale invariant problems frequently appear in many important applications in statistics and machine learning. We select three important applications, KL-NMF, GMM and ICA. A description of the data sets is provided below and source codes are available at: https://github.com/youngseok-kim/SCIPI-JMLR.

5.1 Description of Data Sets

For KL-NMF (Section 5.2), we use four public real data sets available online and summarized in Table 1. Waving Trees (WT) has 287 images, each having 160 120 pixels. KOS and NIPS are sparse, large matrices implemented for topic modeling. WIKI is a large binary matrix having values 0 or 1 representing the adjacency matrix of a directed graph. Here, sparsity represents the fraction of zero elements in a matrix.

Name # of samples # of features # of nonzeros Sparsity

WIKI 8,274 8,297 104,000 0.999 NIPS 1,500 12,419 280,000 0.985 KOS 3,430 6,906 950,000 0.960 WT 287 19,200 5,510,000 0.000

Table 1: A brief summary of data sets used for KL-NMF

These four data sets are retrieved from https://www.microsoft.com/en-us/research/project, https: //archive.ics.uci.edu/ml/datasets/bag+of+words, and https://snap.stanford.edu/data/wiki-Vote. html

Scale Invariant Power Iteration

Name # of classes # of samples Dimension

Sonar 2 208 60 Ionosphere 2 351 34 House Votes84 2 435 16 Br Cancer 2 699 10 PIDiabetes 2 768 8 Vehicle 4 846 18 Glass 6 214 9 Zoo 7 101 16 Vowel 11 990 10 Servo 51 167 4

Table 2: A brief summary of data sets used for GMM

Name # of samples # of features

Wine 178 14 Soybean 683 35 Vehicel 846 18 Vowel 990 10 Cardio 2,126 22 Satellite 6,435 37 Pendigits 10,992 17 Letter 20,000 16 Shuttle 58,000 9

Table 3: A brief summary of data sets used for ICA

For GMM (Section 5.3), we use ten public real data sets, corresponding to all small and moderate data sets provided by the mlbench package in R. We select data sets for multi-class classiﬁcation problems and run EM and SCI-PI for the given number of classes without class labels. In Table 2, the sample size varies from 101 to 990, the dimension varies from 2 to 60, and the number of classes varies from 2 to 51. In these data sets, only a small portion of entries are missing. If missing entries exist, we impute them with the means. For ICA, discussed also in Section 5.3, we use nine public data sets (see Table 3) from the UCI Machine Learning repository . The sample size varies from 178 to 58,000 and the dimension varies from 9 to 37.

5.2 KL-divergence Nonnegative Matrix Factorization

We perform experiments on the KL-NMF problem (39) described in Example 5. Let us recall that the original KL-NMF problem can be solved via block SCI-PI where in each iteration the algorithm solves the subproblem of the form (41). Our focus is to compare this algorithm with other well-known alternating minimization algorithms listed below, updating H and W alternatively. We let z = V (Wh).

Projected gradient descent (PGD): It iterates hnew h η W T (z 1n) followed by projection onto the simplex, where η h is an appropriate learning rate (Lin, 2007).

https://archive.ics.uci.edu/ml/index.php

Kim, Kim and Klabjan

0 5000 10000

0 10000 20000 30000

0 10000 20000

0 10000 20000 30000 40000 iteration

n = 2000, m =10

relative error

iteration iteration iteration

n = 2000, m = 20 n = 2000, m = 200 n = 20000, m = 200

MU SCI PI PGD

0 5000 10000

0 10000 20000 30000

0 10000 20000

0 10000 20000 30000 40000 iteration iteration

log10(relative error)

n = 2000, m = 10 n = 2000, m = 20 n = 2000, m = 200 n = 20000, m = 200

iteration iteration

MU SCI PI PGD

Figure 2: Convergence of the three algorithms for the KL-NMF subproblem; the relative error |fk f |/|f0 f | (Top) and the log relative error (Bottom); n/m: the number of samples/features of the data matrix

Multiplicative update (MU): A famous multiplicative update algorithm is originally suggested by (Lee and Seung, 2001), which iterates hnew h (W T z) (W T 1n) and is learning rate free.

Our method (SCI-PI): It iterates hnew h (σ +W T z) 2 followed by rescaling, where σ is a shift parameter. We simply use σ = 1 for preconditioning.

Sequential quadratic programming (MIXSQP): It exactly solves each subproblem via a convex solver mixsqp (Kim et al., 2020). This algorithm performs sequential non-negative least squares.

KL-NMF subproblem on synthetic data sets Before presenting experimental results of alternating algorithms on the KL-NMF problem (39), we report small experiments using synthetic data sets on the KL-NMF subproblem (40) where we repeat the above iterations until convergence. To study the convergence rate for the KL-NMF subproblems, we use the four data sets studied in Kim et al. (2020). We study MU, PGD and SCI-PI since they have the same order of computational complexity per iteration, but omit MIXSQP since it is a second-order method which cannot be directly compared. For PGD, the learning rate is optimized by grid search. The stopping criterion is fk f 2 10 6f where fk is the objective value at iteration k and f is the solution obtained by MIXSQP after extensive computation time.

Scale Invariant Power Iteration

The result is shown in Figure 2 . The average runtime for aforementioned three methods are 33, 33 and 30 seconds for 10,000 iterations, respectively. Although the reformulated scale invariant problem (12) is a non-convex problem, SCI-PI always ﬁnds a global optimal solution, regardless of the starting point. Moreover, as shown in the ﬁgure, SCI-PI outperforms the other two algorithms for all simulated data sets.

KL-NMF on real world data sets Next, we test the four algorithms on the data sets in Table 1. We estimate r = 20 factors. At each iteration, all four algorithms solve m subproblems simultaneously for W and then alternatively for H. The result is summarized in Figure 3 . The convergence plots are based on the average relative errors over ten repeated runs with random initializations. The result shows that SCI-PI is an overall winner, showing faster convergence rates. The stopping criterion is the same as above. To assess the overall performance when initialized diﬀerently, we select KOS and WIKI and run MU, PGD, SCI-PI, and MIXSQP ten times . The three algorithms except MIXSQP have (approximately) the same computational cost per iteration, take runtime of 391, 396, 408 seconds for KOS data and 372, 390, 418 seconds for WIKI data, respectively for 200 iterations. MIXSQP has a larger per iteration cost. After 400 seconds, SCI-PI achieves the lowest objective values in all cases but one for each data set (38 out of 40 in total). Thus it clearly outperforms other methods and also achieves the lowest variance. Unlike the other three algorithms, SCI-PI is not an ascent algorithm but an eigenvalue-based ﬁxed-point algorithm. Admittedly, non-monotone convergence of SCI-PI can hurt reliability of the solution but for the KL-NMF problem its performance turns out to be stable.

5.3 Gaussian Mixture Model and Independent Component Analysis

In this subsection, we study the empirical performance of SCI-PI when it is applied to GMM and ICA.

GMM GMM ﬁts a mixture of Gaussian distributions to the underlying data. Let Lij = N(ai; µj, Σj) where i is the sample index and j the cluster index and let π be the actual mixture proportion vector. GMM ﬁts into our restricted scale invariant setting (Section 4.3) with reparametrization, but the gradient update for µj, Σj is replaced by the exact coordinate ascent step. The EM and SCI-PI updates for π can be written respectively as

r = 1n (Lπ), πnew π (LT r) (EM), πnew π (α + LT r) 2 (SCI-PI) (54)

where α is a shift parameter set to 1. We compare SCI-PI and EM for diﬀerent real-world data sets from Table 2. All the algorithms initialize from the same standard Gaussian random variable, repeatedly for ten times. The result is summarized in the left panel in Figure 4. In some cases, SCI-PI achieves much larger objective values even if initialized the same. In many cases the two algorithms exhibit the same performance. This is because estimation of µk s and Σk s are usually harder than estimation of π, and EM and SCI-PI have the same

For each evaluation, we randomly draw ten initial points and report the averaged relative errors with respect to f . The initial input for the KL-NMF problem is a one-step MU update of a Uniform(0, 1) random matrix. In all plots we do not show the ﬁrst few iterations. The initial random solutions have the gap of approximately 50% which drops to a few percent after ten iterations where the plots start.

Kim, Kim and Klabjan

0 50 100 150 200

0 50 100 150 200

MU SCI PI PGD MIXSQP

0 50 100 150 200

0 50 100 150 200

MU SCI PI PGD MIXSQP

MU PGD SCI PI MIXSQP

MU PGD SCI PI MIXSQP

objective value objective value

computation time (seconds)

computation time (seconds)

computation time (seconds)

computation time (seconds)

relative error relative error

relative error relative error

0 200 400 600

0 1000 2000 3000

0 250 500 750 1000 1250

0 1000 2000

computation time (seconds) computation time (seconds)

log(relative error) log(relative error)

computation time (seconds) computation time (seconds)

log(relative error) log(relative error)

MU PGD SCI PI MIXSQP

MU PGD SCI PI MIXSQP

Figure 3: Convergence of the four NMF algorithms; the relative error |fk f |/|f0 f | (Top left) and the log relative error (Bottom); Boxplots containing ten objective values achieved after 400 seconds (Top right)

House Votes

relative error

relative error

Figure 4: The relative objective f SCI-PI/f EM for GMM (Left) and f SCI-PI/f Fast ICA for ICA (Right)

Scale Invariant Power Iteration

updates for µ and Σ. For a few cases EM outperforms SCI-PI. Let us mention that SCI-PI and EM have the same order of computational complexity and require 591 and 590 seconds of total computation time, respectively.

ICA We implement SCI-PI on the Kurtosis-based ICA problem (Hyvärinen et al., 2004) and compare it with the benchmark algorithm Fast ICA (Hyvarinen, 1999), which is the most popular algorithm. Given a pre-processed data matrix W Rn d, we maximize an approximation of negative entropy (Hyvärinen and Oja, 2000), f(x) = Pn i=1 (w T i x)4 3 2, subject to x Bd. This problem ﬁts into the sum of scale invariant setting (Section 4.1). SCI-PI iterates xk+1 W T [(Wxk) 4 31n) (Wxk) 3] and Fast ICA iterates xk+1 W T (Wxk) 3 3(1T n(Wxk) 2)xk, both followed by normalization. In Figure 4 (right panel), we compare SCI-PI and Fast ICA on the data sets in Table 3. The majority of data points (81 out of 100 in total) show that SCI-PI tends to ﬁnd a better solution with a larger objective value, but in a few cases SCI-PI converges to a sub-optimal point. Both algorithms are ﬁxed-point based and thus have no guarantee of global convergence but overall SCI-PI outperforms Fast ICA. SCI-PI and Fast ICA have the same order of computational complexity and require 11 and 12 seconds of total computation time, respectively.

6. Final Remarks

In this paper, we propose a new class of optimization problems called the scale invariant problems, together with a generic solver SCI-PI, which is indeed an eigenvalue-based ﬁxedpoint iteration. We showed that SCI-PI directly generalizes power iteration and enjoys similar properties such as that SCI-PI has local linear convergence under mild conditions and its convergence rate is determined by eigenvalues of the Hessian matrix at a solution. Also, we extend scale invariant problems to problems with more general settings. We show by experiments that SCI-PI can be a competitive option for numerous important problems such as KL-NMF, GMM and ICA. Moreover, while not studied in this work, SCI-PI can be generalized to solve optimization problems on the Stiefel manifold such as block PCA. Under orthonormality constraints, the problem with a scale invariant function can be locally considered as the top k eigenvector problem. Therefore, we can develop a general form of the QR iteration (Francis, 1961, 1962) and its convergence analysis using similar proof techniques. Finding more examples and extending SCI-PI further to a more general setting is a promising direction for future studies.

A centered matrix f W = n1/2UDV T is pre-processed by W = f WV D 1V T so that W T W = n V V T .

Kim, Kim and Klabjan

Appendix A. Proofs of Propositions

A.1 Proof of Proposition 2

From (2) and (3), we infer functional equations of the multiplicative factor u and the additive factor v. Under the continuity assumption on f, these functional equations have the forms of Cauchy equations. Relying on known properties of Cauchy equations, we prove that u and v are homogeneous and log functions, respectively. We next provide details. We ﬁrst consider the multiplicative scale invariant case. Let x be a point such that f(x) = 0. Then, we have

f(rsx) = u(rs)f(x) = u(r)u(s)f(x),

which results in

u(rs) = u(r)u(s)

for all r, s R. Let g(r) = ln(u(er)). Then, we have

g(r + s) = ln(u(er+s)) = ln(u(eres)) = ln(u(er)) + ln(u(es)) = g(r) + g(s),

which implies that g satisﬁes the ﬁrst Cauchy functional equation. Since f is continuous, so is u and thus g. Therefore, by (Sahoo and Kannappan, 2011, pp. 81-82), we have

g(r) = rg(1) (55)

for all r 0. From the deﬁnition of g and (55), we have

u(er) = eg(r) = (er)g(1). (56)

Representing r > 0 as r = eln(r) and using (56), we obtain

u(r) = u eln(r) = rg(1) = rln(u(e)) = rp.

Since f(x) = 0, if p = ln(u(e)) < 0, then we have

limr 0+f(rx) = limr 0+u(r)f(x) = f(x) limr 0+rp = f(x) = f(0) < ,

contradicting the fact that f is continuous at 0. Also, if p = 0, then we get u(r) = 1, which contradicts u(0) = 0. Therefore, we must have p > 0. From u being an even function, we ﬁnally have

u(r) = |r|p

Now, consider the additive scale invariant case. For any x dom(f), we have

f(rsx) = f(x) + v(rs) = f(x) + v(r) + v(s),

Scale Invariant Power Iteration

which results in

v(rs) = v(r) + v(s)

for all r, s R \ {0}. Let g(r) = v(er). Then, we have

g(r + s) = v(er+s) = v(eres) = v(er) + v(es) = g(r) + g(s).

Since g is continuous and satisﬁes the second Cauchy functional equation, by (Sahoo and Kannappan, 2011, pp. 83-84), we have

g(r) = rg(1)

for all r 0. For r > 0, letting r = eln(r), we have

v(r) = v(eln(r)) = g(ln(r)) = g(1)ln(r) = v(e)ln(r) = loga(r)

where a = e

1 v(e) . Note that a satisﬁes 0 < a and a = 1. From the fact that v is an even function, we ﬁnally have

v(r) = loga|r|

for r R \ {0}.

A.2 Proof of Proposition 3

Without loss of generality, we can represent a scale-invariant function f as

f(cx) = u(c)f(x) + v(c) (57)

since we can restore a multiplicatively or additively scale-invariant function by setting v(c) = 0 or u(c) = 1, respectively. In order to derive the ﬁrst-order derivative properties, we diﬀerentiate (57) with respect to x and c, respectively. Then, we further diﬀerentiate the latter with respect to x to obtain the second-order property. By diﬀerentiating (57) with respect to x, we have

f(cx) = u(c)

On the other hand, by diﬀerentiating (57) with respect to c, we have

f(cx)T x = u (c)f(x) + v (c). (58)

By diﬀerentiating (58) with respect to x, we obtain

c 2f(cx)x + f(cx) = u (c) f(x). (59)

Plugging c = 1 into (58) and (59) completes the proof.

Kim, Kim and Klabjan

A.3 Proof of Proposition 5

In order to prove the equivalence of (1) and (9), we show that a scalar multiple of an optimal solution to one problem is optimal to the other problem. Since {x : x 2 = 1} and {w : f(w) = 1} have a one-to-one correspondence, we can uniquely determine such a mapping. First, we consider the multiplicatively scale invariant case where f(x ) > 0. Suppose that an optimal solution to (9) is z not x /f(x )1/p such that

z 2 < x /f(x )1/p 2. (60)

Let ˆz = z/ z 2. Then, we have ˆz 2 = 1 and z = ˆz/f(ˆz)1/p. Since ˆz 2 = x 2 = 1, we have

z 2 = ˆz/f(ˆz)1/p 2 = 1/f(ˆz)1/p, x /f(x )1/p 2 = 1/f(x )1/p. (61)

From (60) and (61), we have f(x ) < f(ˆz) since p > 0. This contradicts the assumption that x is an optimal solution to (1). For an optimal solution w to (9), since f(w ) = 1, we have w = 0 and thus w 2 > 0 and f (w / w 2) = 1/ w p 2 > 0. Suppose an optimal solution to (1) is y with

f(y) > f (w / w 2) > 0. (62)

Let ˆy = y/f(y)1/p. Then, f(ˆy) = 1 and y = ˆy/ ˆy 2. Using f(ˆy) = f(w ) = 1, we have

f(y) = f (ˆy/ ˆy 2) = 1/ ˆy 1/p 2 , f (w / w 2) = 1/ w 1/p 2 . (63)

From (62) and (63), we obtain ˆy 2 < w 2, which contradicts that w is optimal to (9). Next, we consider the additively scale invariant case. Suppose that an optimal solution to (9) is z with

z 2 < a1 f(x )x 2. (64)

Let ˆz = z/ z 2. Then, we have ˆz 2 = 1 and z = a1 f(ˆz)ˆz. Using ˆz 2 = x 2 = 1, we have

z 2 = a1 f(ˆz), a1 f(x )x 2 = a1 f(x ). (65)

From (64) and (65), we obtain f(x ) < f(ˆz) due to a > 1, which contradicts the assumption that x is an optimal solution to (1). Conversely, suppose that an optimal solution of (1) is y with

f(y) > f (w / w 2) . (66)

Let ˆy = a1 f(y)y. Then, we have f(ˆy) = 1 and y = ˆy/ ˆy 2. Since f(ˆy) = f(w ) = 1, we have

f(y) = f(ˆy) loga ˆy 2 = 1 loga ˆy 2, f (w / w 2) = 1 loga w 2. (67)

From (66) and (67), we have ˆy 2 < w 2 since a > 1, contradicting the fact that w is an optimal solution to (9).

Scale Invariant Power Iteration

Appendix B. Sublinear convergence of SCI-PI for quasi-convex f

Proposition 11 If f is quasi-convex and continuously diﬀerentiable, a sequence of iterates {xk}k=0,1, generated by SCI-PI satisﬁes f(xk+1) f(xk) for all k 0. Moreover, either Algorithm 1 terminates with a stationary point or every limit point is a stationary point.

Proof Suppose that f(xk+1) < f(xk). By the ﬁrst-order property of diﬀerentiable quasiconvex functions, this leads to

f(xk)T (xk+1 xk) = f(xk)T f(xk) f(xk) 2 xk

= f(xk) 2 f(xk)T xk 0.

However, since f(xk+1) = f(xk), f(xk) is not a scalar multiple of xk, resulting in

f(xk) 2 f(xk)T xk > 0.

This contradicts (68). Therefore, we have f(xk+1) f(xk). If Algorithm 1 terminates at iteration k, we have f(xk) = 0. Therefore, xk is a stationary point satisfying (8) with the value of the Lagrange multiplier being zero. Otherwise, let x

be a limit point and suppose that x is not a stationary point. Then, there exists some ϵ > 0 such that f(x )T x / f(x ) 2 < 1 ϵ. Since f is continuous, there exists some δ > 0 such that for x Bd with f(x) = 0, if x x 2 < δ, then

f(x )T f(x) f(x ) 2 f(x) 2 > 1 ϵ.

Let k be an index such that xk x 2 < δ. Since {f(xk)}k=0,1, is non-decreasing, we have f(xk ) f(xk +1) f(x ). By the ﬁrst-order derivative property of quasi-convex functions, we obtain f(x )T (xk +1 x )

However, since xk +1 = f(xk )/ f(xk ) 2 and xk x 2 < δ, we must have

f(x ) 2 < 1 ϵ < f(x )T f(xk ) f(x ) 2 f(xk ) 2 .

This leads to a contradiction. Therefore, x must be a stationary point.

Appendix C. Additional Lemmas

On several occasions, we use if x Bd, y Bd, then

x y 2 2 = x 2 2 + y 2 2 2x T y = 2(1 x T y).

Note that if x T y 0, then q

1 (x T y)2 = q

(1 x T y)(1 + x T y) p

1 x T y = x y 2

By Cauchy-Schwarz, we also have q

1 (x T y)2 = q

(1 x T y)(1 + x T y)

1 x T y = x y 2. (70)

Kim, Kim and Klabjan

C.1 In Support of the Proofs of Theorem 6 and Theorem 7

Lemma 12 Let x be a vector in Rd and {v1, , vd} be an orthogonal basis in Rd such that x = v1. If a twice continuously diﬀerentiable function f : Rd R satisﬁes

f(x)T v1 = λ + α(x),

i=2 ( f(x)T vi)2 λ2 x x 2 + β(x) 2 (71)

for every x Bd and some functions α, β : Rd R and scalars λ , λ such that

α(x) = o( p

x x 2), β(x) = o( x x 2), λ > λ 0,

then for the sequence of iterates {xk}k=0,1, generated by SCI-PI, there exists some δ > 0 such that under the initial condition x0 x 2 < δ, we have

2 x0 x 2 2, λ2 λ + γt < 1, and lim k γk = 0.

Proof By (71) for every x Bd, we have Pd i=2( f(x)T vi)2

( f(x)T v1)2 λ2 x x 2 + β(x)

θ(x) = λ2 x x 2 + β(x)

λ + α(x) λ2 λ x x 2.

Then, we have θ(x) = o( x x 2) and Pd i=2( f(x)T vi)2

( f(x)T v1)2 λ2

λ + θ(x) x x 2

2 x x 2 2. (72)

ϵ(x) = θ(x) x x 2 . (73)

For x Rd such that x T x > 0 or x x 2 <

2, we multiply (72) by 2/(1 + x T x ) to obtain Pd i=2( f(x)T vi)2

( f(x)T v1)2

2 1 + x T x

λ + ϵ(x) 2 1 + 1 x T x

λ + γ(x) 2 x x 2 2 (74)

γ(x) = λ2 λ

1 + x T x + p

2(1 + x T x )

1 + 1 x T x

1 + x T x . (75)

Scale Invariant Power Iteration

By (71), there exists some δ1 > 0 such that if x x 2 < δ1, then

f(x)T v1 > 0. (76)

Also, by (73), for any γ > 0 satisfying

λ2 λ + γ < 1, (77)

there exists some constant δ2 > 0 such that if x x 2 < δ2, then

Let γk = γ(xk), ϵk = ϵ(xk), and δ = min{δ1, δ2, q

2}. We prove that if xk x 2 < δ, then we have

xk+1 x 2 2 λ2

2 xk x 2 2 and γk γ. (79)

2, we have x T k x > 0. Also, from xk x 2 < δ1 and x = v1, using the update rule of SCI-PI and (76), we obtain

x T k+1x = f(xk)T x

f(xk) 2 = f(xk)T v1

f(xk) 2 > 0.

On other the hand, since |x T k+1v1| xk+1 2 v1 2 = 1, we have

1 (x T k+1x )2 1 (x T k+1v1)2

(x T k+1v1)2 .

Also, from the fact that {v1, , vd} forms an orthogonal basis in Rd, we have f(xk) = Pd i=1( f(xk)T vi)vi and f(xk) 2 2 = Pd i=1( f(xk)T vi)2. Using the update rule of SCI-PI, we have

1 (x T k+1v1)2

(x T k+1v1)2 = f(xk) 2 2 ( f(xk)T v1)2

( f(xk)T v1)2 = Pd i=2( f(xk)T vi)2

( f(xk)T v1)2 .

By xk x 2 <

2, using (74), we have

xk+1 x 2 2 = (1 (x T k+1x )2)

2 1 + x T k+1x

λ + γ(xk) 2 xk x 2 2.

Since x T k x > 0, xk x 2 < δ2, and 1 x T k x < λ

1 + x T k x + q

2(1 + x T k x )

1 + 1 x T k x

1 + x T k x γ

Kim, Kim and Klabjan

which proves (79). Therefore, if x0 satisﬁes x0 x 2 < δ, by repeatedly applying (79), we obtain

2 x0 x 2 2 and λ2 λ + γk λ2 λ + γ < 1.

xk x 2 2 < λ2

λ + γ 2k x0 x 2 2, (80)

we have xk x , and thus limk γk = 0 by (75). This gives the desired result.

Lemma 13 Let x be a vector in Rd and {v1, , vd} be an orthogonal basis in Rd such that x = v1. If a three-times continuously diﬀerentiable function f : Rd R satisﬁes

f(x)T v1 A B(1 x T x ) C p

1 x T x (81)

i=2 ( f(x)T vi)2 D q

1 (x T x )2 + E x x 2 + F

for every x Bd and some constants A, B, C, D, E, F such that

A > 0, B + C > 0, D + E

then for the sequence of iterates {xk}k=0,1, generated by SCI-PI, under the initial condition that x0 x 2 < δ where

2A A + |A B| + C ,

2(A D E) A D + F + C + |A B|

2 x0 x 2 2, D + E

A + γt < 1, and lim k γk = 0.

Proof Let x0 x 2 = δ0 < δ. To prove the main result, we show that if xk x 2 < δ0, then we have x T k+1x > 0 and

1 (x T k+1x )2

(x T k+1x )2 ρk 1 (x T k x )2

(x T k x )2 (84)

D + E + (E + F) xk x 2/

2 A (|A B| + C) xk x 2/(

Scale Invariant Power Iteration

Suppose that xk x 2 < δ0 for k 0. Since xk x 2 < δ, by (83), we have x T k x > 0 and

A B(1 x T k x ) C q

1 x T k x = Ax T k x + (A B)(1 x T k x ) C q

= Ax T k x + A B

2 xk x 2 2 C

x T k x A |A B| + C

where the ﬁrst inequality follows from xk x 2

2 and the second one follows from

x T k x = (|A B| + C) xk x 2

2 xk x 2 2/

2 (|A B| + C) xk x 2

2 xk x 2 < A.

Inequality (85) implies that

x T k+1x = f(xk)T v1

f(xk) 2 = A B(1 x T k x ) C q

f(xk) 2 > 0.

On the other hand, since

1 (x T k+1v1)2

(x T k+1v1)2 = f(xk) 2 2 ( f(xk)T v1)2

( f(xk)T v1)2 = Pd i=2( f(xk)T vi)2

( f(xk)T v1)2 ,

using (81), (82), and (85), we have

1 (x T k+1x )2

(x T k+1x )2

1 (x T k x )2 + E xk x 2 + F

A B(1 x T k x ) C q

D + E + (E + F) xk x 2/

2 A (|A B| + C) xk x 2/(

!2 1 (x T k x )2

(x T k x )2

where we use the fact that 1 + x 1 + x for x 0 to derive

1 (x T k x )2 + E xk x 2 + F

2 xk x 2 2 q

1 (x T k x )2 = D + E

1 + 1 x T k x

1 + x T k x + F

1 + x T k x

D + E + (E + F)

1 + x T k x

D + E + (E + F) xk x 2

Kim, Kim and Klabjan

Since x T k x > 0 and x T k+1x > 0, we can write (84) as

ρk(1 + x T k x )

ρk + (1 ρk)(x T k x )2 + x T k x q

ρk + (1 ρk)(x T k x )2

ρk = ρk(1 + x T k x )

ρk + (1 ρk)(x T k x )2 + x T k x q

ρk + (1 ρk)(x T k x )2 .

Before proving ρk ρ0 < 1, we ﬁrst show that ρk ρ0 < 1. Since x T k x x T 0 x , we have xk x 2 x0 x 2, and thus

2 xk x 2 x0 x 2

which results in ρk ρ0. From δ0 < δ and (83), we have

|A B| + C + E + F

2 x0 x 2 x0 x 2 A D E,

A |A B| + C

2 x0 x 2 x0 x 2 D + E + (E + F) x0 x 2

This leads to ρ0 < 1. If ρk = 0, obviously ρk ρ0. Otherwise, using 0 < ρk ρ0 and x T k x x T 0 x , we have

ρk = 1 + x T k x

(x T k x )2/ρk + (1 (x T k x )2) + x T k x q

(x T k x )2/ρ2 k + (1 (x T k x )2)/ρk

1 + x T k x

(x T k x )2/ρ0 + (1 (x T k x )2) + x T k x q

(x T k x )2/ρ2 0 + (1 (x T k x )2)/ρ0

ρ0(1 + x T k x )

ρ0 + (1 ρ0)(x T k x )2 + x T k x q

ρ0 + (1 ρ0)(x T k x )2

ρ0 + (1 ρ0)(x T k x )2

ρ0 + (1 ρ0)(x T k x )2 q

ρ0 + (1 ρ0)(x T k x )2 + x T k x

ρ0 + (1 ρ0)(x T 0 x )2

ρ0 + (1 ρ0)(x T 0 x )2 q

ρ0 + (1 ρ0)(x T 0 x )2 + x T 0 x

Scale Invariant Power Iteration

ρ0 + (1 ρ0)(x T 0 x )2 + x T 0 x q

ρk + (1 ρk)(x T 0 x )2 ρk(1 + x T 0 x ) > 0,

we ﬁnally have ρ < 1. Therefore, we have

xk+1 x 2 2 D + E

2 xk x 2 2 ρ0 xk x 2 2

Since xk+1 x 2 < δ, by induction, we have

xk x 2 2 ρk 0 x0 x 2 2,

which implies the convergence of x to x . As x x , we have ρk ρk and ρk (D+E

A )2, which leads to γk 0. This completes the proof.

Lemma 14 Let f be three-times continously diﬀerentiable on an open set containing Bd and Hj be the Hessian of jf = f/ xj. Let {v1, , vd} be an orthogonal basis in Rd. For x Bd and y1, , yd Bd, let

Gi(y1, , yd) =

j=1 vi,j Hj(yj), Ri(x) = f(x)T vi f(x )T vi (x x )T 2f(x )vi.

Then, we have d X

i=1 (Ri(x))2 1

4M2 x x 4 2

where M = max x Bd,y1, ,yd Bd

q Pd i=1(x T Gi(y1, , yd)x)2.

Proof From jf(x) being twice continuously diﬀerentiable near Bd, we have

jf(x) = jf(x ) + jf(x )T (x x ) + 1

2 (x x )T Hj(yj) (x x ) (86)

where yj N(x, x ) := {y | y = λx + (1 λ)x , 0 λ 1}. Since

2(x x )T Gi(y1, , yd)(x x )

T Gi(y1, , yd) x x

using the deﬁnition of M, we have the desired result.

Kim, Kim and Klabjan

C.2 In Support of the Proofs of Theorem 9 and Theorem 10

Lemma 15 Suppose that f(x, y) is scale invariant in x Rdx for each y Rdy and twice continuously diﬀerentiable on an open set containing Bdx Bdy. Let (x , y ) be a point satisfying xf(x , y ) = λ x , λ > λ2 = max2 i dx|λi|, x = v1

where (λi, vi) is an eigen-pair of 2 xxf(x , y ). Then, for any x Bdx and y Bdy, we have

xf(x, y)T v1 = λ + (y y )T 2 yxf(x , y )x + α(x, y)

i=2 ( xf(x, y)T vi)2 λ2 x x 2 + ν y y 2 + β(x, y) 2

α(x, y) = o

, β(x, y) = o

, ν = 2 xyf(x , y ) 2.

Therefore, we have

Pdx i=2( xf(x, y)T vi)2

( xf(x, y)T v1)2 λ2

λ x x 2 + ν

λ y y 2 + θ(x, y) 2

θ(x, y) = o

Proof Since 2 xxf(x , y ) is real and symmetric, without loss of generality, we assume that {v1, , vdx} forms an orthogonal basis in Rdx. By Taylor expansion of xf(x, y)T vi at (x , y ), we have

xf(x, y)T vi = xf(x , y )T vi + x x

T 2 xxf(x , y ) 2 yxf(x , y )

vi + Ri(x, y)

Ri(x, y) = o

Using xf(x , y ) = λ x and x = v1, we have

xf(x , y )T v1 = λ , (x x )T 2 xxf(x , y )v1 = λ1(1 x T k x ).

Therefore, we obtain

xf(x, y)T v1 = λ + (x x )T 2 yxf(x , y )x + α(x, y) (88)

Scale Invariant Power Iteration

α(x, y) = R1(x, y) λ1(1 x T x ) = o

In the same way, for 2 i dx, we have

xf(x , y )T vi = λ (x )T vi = 0, (x x )T 2 xxf(x , y )vi = λix T vi,

resulting in

xf(x, y)T vi = λix T vi + (y y )T 2 yxf(x , y )vi + Ri(x, y). (89)

From (89), we obtain

i=2 ( xf(x, y)T vi)2 =

i=2 (λi)2(x T vi)2 +

(y y )T 2 yxf(x , y )vi 2

i=2 (Ri(x, y))2 + 2

i=2 λi(x T vi)(y y )T 2 yxf(x , y )vi

i=2 λi(x T vi)Ri(x, y)

i=2 (y y )T 2 yxf(x , y )vi Ri(x, y).

Since {v1, , vdx} forms an orthogonal basis in Rdx, with x = v1 and x 2 2 = 1, we have

i=2 (λi)2(x T vi)2 ( λ2)2 1 (x T x )2 ( λ2)2 x x 2 2

(y y )T 2 yxf(x , y )vi 2 (y y )T 2 yxf(x , y ) 2 2 ν2 y y 2 2.

Let R2(x, y) = max2 i dx |Ri(x, y)|. Note that

R2(x, y) = o

Using the Cauchy-Shwartz inequality, we have

i=2 λi(x T vi)(y y )T 2 yxf(x , y )vi λ2ν y y 2 x x 2.

Also, we have

i=2 λi(x T vi)Ri(x, y) λ2 R2(x, y) p

Kim, Kim and Klabjan

i=2 Ri(x, y)(y y )T 2 yxf(x , y )vi ν R2(x, y) p

Therefore, we obtain

i=2 ( xf(x, y)T vi)2 λ2 x x 2 + ν y y 2 + β(x, y) 2 (90)

β(x, y) = R2(x, y) p

Since {v1, , vdx} forms an orthogonal basis in Rdx and |x T x | x 2 x 2 = 1, we have

1 ( xf(x, y)T x )2

xf(x, y) 2 2 Pdx i=2( xf(x, y)T vi)2

( xf(x, y)T v1)2 .

Using (88) and (90), we have

Pdx i=2( xf(x, y)T vi)2

( xf(x, y)T v1)2 λ2

λ x x 2 + ν

λ y y 2 + θ(x, y) 2

θ(x, y) = o

This completes the proof.

Lemma 16 Suppose that f(x, y) is µ-strongly concave in y Rdy with an L-Lipschitz continuous yf(x, y) for each x Bdx and three-times continously diﬀerentiable with respect to x and y on an open set containing Bdx Rdy. Let (x , y ) be a point such that yf(x , y ) = 0. Then, for any x Bdx and y Rdy, with α = 2/(L + µ), we have

y + α yf(x, y) y 2 2ν L + µ

x x 2 + L µ

y y 2 + θ(x, y) (91)

ν = 2 yxf(x , y ) 2, θ(x, y) = o

Proof Let y,if be the ith coordinate of yf and

Hy,i = Hxx y,i Hxy y,i Hyx y,i Hyy y,i

Scale Invariant Power Iteration

be the Hessian of y,if. By Taylor expansion of y,if(x, y) at (x , y), we have

y,if(x, y) = y,if(x , y) + x y,if(x , y)T (x x ) + Ri(x, y) (92)

where x y,if(x , y) denotes the ith column of 2 yxf(x , y) and

Ri(x, y) = 1

2(x x )T Hxx y,i(ˆxi, y)(x x ), ˆxi N(x, x ). (93)

Also, from f being three-times continuously diﬀerentiable, we have

x y,if(x , y) = x y,if(x , y ) + Hxy y,i(x , ˆyi)(y y ), ˆyi N(y, y ). (94)

|(y y )T Hyx y,i(x , ˆyi)(x x )| Hyx y,i(x , ˆyi) 2 x x 2 y y 2

2 Hyx y,i(x , ˆyi) 2 x x 2 2 + y y 2 2 ,

(y y )T Hyx y,i(x , ˆyi)(x x ) = o

By (92), (93), (94), and (95), we have

yf(x, y) = yf(x , y) + 2 yxf(x , y )(x x ) + R(x, y) (96)

Ri(x, y) = Ri(x, y) + (y y )T Hyx y,i(x , ˆyi)(x x ) = o

Using (96), we have

y + α yf(x, y) y = y y + α yf(x , y) + α 2 yxf(x , y )(x x ) + R(x, y),

resulting in

y + α yf(x, y) y 2 y y + α yf(x , y) 2 + α 2 yxf(x , y )(x x ) 2 + α R(x, y) 2. (97)

Since f(x , y) is µ-strongly convex in y with an L-Lipschitz continuous gradient yf(x , y), by theory of convex optimization (Bubeck, 2015, p. 279), we have

y y + α yf(x , z) 2 L µ

due to α = 2/(L + µ). Also, we have

α 2 yxf(x , y )(x x ) 2 2ν L + µ

x x 2. (99)

Kim, Kim and Klabjan

Plugging (98), (99) into (97), we ﬁnally obtain

y y + α yf(x , y) 2 L µ

y y 2 + 2ν L + µ

x x 2 + θ(x, y)

θ(x, y) = R(x, y) 2 = o

Lemma 17 Suppose that a sequence of iterates {(xk, yk)}k=0,1, satisﬁes " xk+1 x 2 yk+1 y 2

# " xk x 2 yk y 2

" θx(xk, yk) θy(xk, yk)

for some functions θx and θy such that

θx(xk, yk) = o

, θy(xk, yk) = o

If a, d, e 0 and b, c > 0 satisfy

(a b)2 + 4e2

then there exists some δ > 0 such that if 0 < δ, then we have

t=1 (ρ + γt)2 2 0 and lim t γt = 0

Proof From (100), we have " xk+1 x 2 yk+1 y 2

# " xk x 2 yk y 2

" θx(xk, yk) θy(xk, yk)

(M + N(xk, yk))

" xk x 2 yk y 2

M = a e/b e/c d

, ϵ(x, y) = max{θx(x, y), θy(x, y)} p

xk x 2 2 + yk y 2 2 ,

Scale Invariant Power Iteration

N(x, y) = ϵ(x, y) p

xk x 2 2 + yk y 2 2

xk x 2 yk y 2 xk x 2 yk y 2

Note that we have lim (x,y) (x ,y ) Nij(x, y) = 0, i, j = 1, 2.

By Lemma 18, there exists a sequence {ωt}t=0,1, such that

t=0 (ρ + ωt) and limt ωt = 0.

Since ρ < 1, this implies that Mk 2 converge to 0. Let

τ = min{k : Mk 2 < 1}, ρ = Mτ 2 + 1

2 , ρmax = max 1 k τ Mk 2.

Due to Nij(x, y) 0 as (x, y) (x , y ) for i, j = 1, 2, there exists some δ > 0 such that if

then we have

M + N(φ(x, y, l)) 2 < ρ, max 0<m τ

M + N(φ(x, y, l)) 2 < 1 + ρmax (104)

where φ(x, y, l) denotes the lth iterate (xl, yl) of the underlying algorithm starting with (x0, y0) = (x, y). To see this, let us deﬁne

g(x, y, m) =

M + N(φ(x, y, l)) 2 .

By (100) and (101), if x x and y y , then for any 0 l τ, we have

φ(x, y, l) (x , y ),

resulting in g(x, y, m) Mm 2.

Therefore, there exists some δτ > 0 such that g(x, y, τ) < ρ. Also, for each 1 m < τ, there exists some δm > 0 such that g(x, y, m) < 1 + ρmax. Taking the minimum of δm for 1 m τ, we obtain δ satisfying (104). For any n 0, if nτ < δ, using (104) for (103), we have

nτ+m (1 + ρmax) nτ, 1 m < τ, (n+1)τ ρ nτ. (105)

Kim, Kim and Klabjan

Suppose that 0 δ. Then, by repeatedly applying (105), for any n 0 and 0 m τ, we have

nτ+m ( ρ)n (1 + ρmax) 0,

which implies that k 0 as k . Let

Nt = N(xt, yt), ηk = Qk t=0(M + Nt) 2 Qk 1 t=0 (M + Nt) 2 Mk+1 2

Mk 2 , γk = ωk + ηk.

Then, we have

t=0 (M + Nt)

t=0 (ρ + ωt + ηt) =

t=0 (ρ + γt). (106)

Since Nt 0, we have ηt 0, and thus limt γt = 0. This concludes the proof.

Lemma 18 Let M be a 2 2 matrix such that

M = a e/b e/c d

for some a > 0, b > 0, c > 0, d 0, e 0 and let ρ be the largest absolute eigenvalue of M. Then, there exists a sequence {ωt}t=0,1,... such that

t=0 (ρ + ωt) and lim t ωt = 0.

Proof The characteristic equation reads

det(M λI) = λ2 λ(a + d) + ad e2

with the discriminant of

(a d)2 + 4e2

Thus, all eigenvalues are real. First, we consider the case when det(M λI) = 0 has a double root. We obtain the condition for a double root as

(a d)2 + 4e2

Since b > 0 and c > 0, this implies a = d and e = 0. Therefore, M = a I and ρ = a. From Mk = ak I, we have Mk 2 =

a2k = ρk, resulting in

ωk = Mk+1 2

Mk 2 ρ = ρ ρ = 0, k 0.

Scale Invariant Power Iteration

Next, we consider the case when M has two distinct eigenvalues λ1 and λ2. Since a + d > 0, we have λ1 + λ2 > 0. Without loss of generality, assume λ1 > λ2. Then, ρ = λ1. Let v1 and v2 be corresponding eigenvectors of λ1 and λ2, respectively. Since v1 and v2 are linearly independent we can represent each column of M as a linear combination of v1 and v2 as M = [α1v1 + β1v2 α2v1 + β2v2].

By repeatedly multiplying M, we obtain

Mk = [α1λk 1 1 v1 + β1λk 1 2 v2 α2λk 1 1 v1 + β2λk 1 2 v2].

Let Ck = (Mk)T Mk. Then, we have

Ck 11 = α2 1λ2(k 1) 1 + β2 1λ2(k 1) 2 + 2α1β1(λ1λ2)k 1v T 1 v2

Ck 22 = α2 2λ2(k 1) 1 + β2 2λ2(k 1) 2 + 2α2β2(λ1λ2)k 1v T 1 v2

Ck 12 = α1α2λ2(k 1) 1 + β1β2λ2(k 1) 2 + (α1β2 + α2β1)(λ1λ2)k 1v T 1 v2, Ck 21 = Ck 12.

Ck 11 α2 1λ2(k 1) 1 + β2 1λ2(k 1) 2 2α1β1(λ1λ2)k 1 = α1λk 1 1 β1λk 1 2 2 0

Ck 22 α2 2λ2(k 1) 1 + β2 2λ2(k 1) 2 2α2β2(λ1λ2)k 1 = α2λk 1 1 β2λk 1 2 2 0,

Ck 11 + Ck 22 + q Ck 11 Ck 22 2 + 4(Ck 12)2 ,

v u u u u t

Ck+1 11 + Ck+1 22 +

r Ck+1 11 Ck+1 22 2 + 4(Ck+1 12 )2

Ck 11 + Ck 22 + q Ck 11 Ck 22 2 + 4(Ck 12)2 .

lim k Ck 11 λ2(k 1) 1 = α2 1, lim k Ck 22 λ2(k 1) 1 = α2 2, lim k Ck 12 λ2(k 1) 1 = lim k Ck 21 λ2(k 1) 1 = α1α2,

lim k Mk+1 2

lim k ωk = lim k Mk+1 2

Mk 2 ρ = ρ ρ = 0,

we obtain the desired result.

Kim, Kim and Klabjan

P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009.

Michal Aharon, Michael Elad, and Alfred Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311 4322, 2006.

Amir Beck. On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization, 25(1):185 209, 2015.

Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

Nicolas Boumal. Nonconvex phase synchronization. SIAM Journal on Optimization, 26(4): 2355 2377, 2016.

Christos Boutsidis, Dan Garber, Zohar Karnin, and Edo Liberty. Online principal components analysis. In ACM-SIAM Symposium on Discrete Algorithms, pages 887 901, 2015.

Stephen Boyd, Neal Parikh, and Eric Chu. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Now Publishers Inc, 2011.

Sébastien Bubeck. Convex Optimization: Algorithms and Complexity. Now Publishers, Inc., 2015.

Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semideﬁnite programs via low-rank factorization. Mathematical Programming, 95(2): 329 357, 2003.

Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717, 2009.

Yuxin Chen, Yuejie Chi, Jianqing Fan, and Cong Ma. Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming, 176:5 37, 2019.

Murat A Erdogdu, Asuman Ozdaglar, Pablo A Parrilo, and Nuri Denizcan Vanli. Convergence rate of block-coordinate maximization Burer-Monteiro method for solving large SDPs. Mathematical Programming, 195(1):243 281, 2022.

Cédric Févotte and Jérôme Idier. Algorithms for nonnegative matrix factorization with the β-divergence. Neural Computation, 23(9):2421 2456, 2011.

John GF Francis. The QR transformation a unitary analogue to the LR transformation - part 1. The Computer Journal, 4(3):265 271, 1961.

John GF Francis. The QR transformation - part 2. The Computer Journal, 4(4):332 345, 1962.

Scale Invariant Power Iteration

Robert M Freund. Dual gauge programs, with applications to quadratic programming and the minimum-norm problem. Mathematical Programming, 38(1):47 67, 1987.

Michael P Friedlander, Ives Macedo, and Ting Kei Pong. Gauge optimization and duality. SIAM Journal on Optimization, 24(4):1999 2022, 2014.

Dan Garber, Elad Hazan, and Tengyu Ma. Online learning of eigenvectors. In International Conference on Machine Learning, pages 560 568, 2015.

Gene H Golub and Henk A Van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 123(1-2):35 65, 2000.

Gene H Golub and Charles F Van Loan. Matrix Computations. JHU Press, 2012.

Fang Han and Han Liu. Scale-invariant sparse PCA on high-dimensional meta-elliptical data. Journal of the American Statistical Association, 109(505):275 287, 2014.

Mingyi Hong and Zhi-Quan Luo. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 162(1-2):165 199, 2017.

Meng Huang and Zhiqiang Xu. Strong convexity of aﬃne phase retrieval. ar Xiv preprint ar Xiv:2204.09412, 2022.

Ruitong Huang, Tor Lattimore, András György, and Csaba Szepesvári. Following the leader and fast rates in online linear prediction: Curved constraint sets and other regularities. Journal of Machine Learning Research, 18(145):1 31, 2017.

Aapo Hyvarinen. Fast ICA for noisy data using Gaussian moments. In IEEE International Symposium on Circuits and Systems VLSI, volume 5, pages 57 61, 1999.

Aapo Hyvärinen and Erkki Oja. Independent component analysis: Algorithms and applications. Neural Networks, 13(4-5):411 430, 2000.

Aapo Hyvärinen, Juha Karhunen, and Erkki Oja. Independent Component Analysis. John Wiley & Sons, 2004.

Michel Journée, Yurii Nesterov, Peter Richtárik, and Rodolphe Sepulchre. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11(Feb):517 553, 2010.

Cheolmin Kim and Diego Klabjan. A simple and fast algorithm for L1-norm kernel PCA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8):1842 1855, 2020a.

Cheolmin Kim and Diego Klabjan. Stochastic variance-reduced algorithms for PCA with arbitrary mini-batch sizes. In International Conference on Artiﬁcial Intelligence and Statistics, volume 108, pages 4302 4312, 2020b.

Youngseok Kim, Peter Carbonetto, Matthew Stephens, and Mihai Anitescu. A fast algorithm for maximum likelihood estimation of mixture proportions using sequential quadratic programming. Journal of Computational and Graphical Statistics, 29(2):261 273, 2020.

Kim, Kim and Klabjan

Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, pages 556 562, 2001.

Qi Lei, Kai Zhong, and Inderjit S Dhillon. Coordinate-wise power method. In Advances in Neural Information Processing Systems, pages 2064 2072, 2016.

Adrian S Lewis, D Russell Luke, and Jérôme Malick. Local linear convergence for alternating and averaged nonconvex projections. Foundations of Computational Mathematics, 9(4): 485 513, 2009.

Chih-Jen Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation, 19(10):2756 2779, 2007.

Huikang Liu, Man-Chung Yue, and Anthony Man-Cho So. On the estimation performance and convergence rate of the generalized power method for phase synchronization. SIAM Journal on Optimization, 27(4):2426 2446, 2017.

Ronny Luss and Marc Teboulle. Conditional gradient algorithms for rank-one matrix approximations with a sparsity constraint. SIAM Review, 55(1):65 98, 2013.

James Mac Queen. Some methods for classiﬁcation and analysis of multivariate observations. In Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281 297, 1967.

RV Mises and Hilda Pollaczek-Geiringer. Praktische verfahren der gleichungsau-ﬂösung. Zeitschrift für Angewandte Mathematik und Mechanik, 9(1):58 77, 1929.

CL Muntz. Solution direct de l equation seculaire et de quelques problems analogues. Comptus Rendus de l Academie des Sciences, 156:43 46, 1913.

Erkki Oja. Simpliﬁed neuron model as a principal component analyzer. Journal of Mathematical Biology, 15(3):267 273, 1982.

Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311 3325, 1997.

Prasanna K Sahoo and Palaniappan Kannappan. Introduction to Functional Equations. Chapman and Hall/CRC, 2011.

Ohad Shamir. A stochastic PCA and SVD algorithm with an exponential convergence rate. In International Conference on Machine Learning, pages 144 152, 2015.

Ohad Shamir. Fast stochastic algorithms for SVD and PCA: Convergence properties and convexity. In International Conference on Machine Learning, pages 248 256, 2016.

Richard A Tapia, John E Dennis Jr, and Jan P Schäfermeyer. Inverse, shifted inverse, and Rayleigh quotient iteration as Newton s method. SIAM Review, 60(1):3 55, 2018.

Lieven Vandenberghe and Stephen Boyd. Semideﬁnite programming. SIAM Review, 38(1): 49 95, 1996.

Scale Invariant Power Iteration

Yu-Xiong Wang and Yu-Jin Zhang. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on Knowledge and Data Engineering, 25(6):1336 1353, 2013.

James H Wilkinson. The Algebraic Eigenvalue Problem. Clarendon Press, Oxford, 1965.

Peng Xu, Bryan He, Christopher De Sa, Ioannis Mitliagkas, and Chris Re. Accelerated stochastic power iteration. In International Conference on Artiﬁcial Intelligence and Statistics, pages 58 67, 2018.

Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(Apr):899 925, 2013.