# certifiably_optimal_low_rank_factor_analysis__20d4a55b.pdf

Journal of Machine Learning Research 18 (2017) 1-53 Submitted 11/15; Revised 2/17; Published 4/17

Certiﬁably Optimal Low Rank Factor Analysis

Dimitris Bertsimas dbertsim@mit.edu Sloan School of Management MIT Cambridge, MA 02139, USA

Martin S. Copenhaver mcopen@mit.edu Operations Research Center MIT Cambridge, MA 02139, USA

Rahul Mazumder rahulmaz@mit.edu Sloan School of Management MIT Cambridge, MA 02139, USA

Editor: Benjamin Recht

Factor Analysis (FA) is a technique of fundamental importance that is widely used in classical and modern multivariate statistics, psychometrics, and econometrics. In this paper, we revisit the classical rank-constrained FA problem which seeks to approximate an observed covariance matrix (Σ) by the sum of a Positive Semideﬁnite (PSD) low-rank component (Θ) and a diagonal matrix (Φ) (with nonnegative entries) subject to Σ Φ being PSD. We propose a ﬂexible family of rank-constrained, nonlinear Semideﬁnite Optimization based formulations for this task. We introduce a reformulation of the problem as a smooth optimization problem with convex, compact constraints and propose a uniﬁed algorithmic framework, utilizing state of the art techniques in nonlinear optimization to obtain high-quality feasible solutions for our proposed formulation. At the same time, by using a variety of techniques from discrete and global optimization, we show that these solutions are certiﬁably optimal in many cases, even for problems with thousands of variables. Our techniques are general and make no assumption on the underlying problem data. The estimator proposed herein aids statistical interpretability and provides computational scalability and signiﬁcantly improved accuracy when compared to current, publicly available popular methods for rank-constrained FA. We demonstrate the eﬀectiveness of our proposal on an array of synthetic and real-life datasets. To our knowledge, this is the ﬁrst paper that demonstrates how a previously intractable rank-constrained optimization problem can be solved to provable optimality by coupling developments in convex analysis and in global and discrete optimization. Keywords: factor analysis, rank minimization, semideﬁnite optimization, ﬁrst order methods, nonlinear optimization, global optimization, discrete optimization

1. Introduction

Factor Analysis (FA) (Anderson, 2003; Bartholomew et al., 2011; Mardia et al., 1979), a widely used methodology in classical and modern multivariate statistics, is used as a

c 2017 Dimitris Bertsimas, Martin S. Copenhaver, and Rahul Mazumder.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/15-613.html.

Bertsimas, Copenhaver, and Mazumder

tool to obtain a parsimonious representation of the correlation structure among a set of variables in terms of a smaller number of common hidden factors. A basic FA model is of the form x = Lf + ϵ, where xp 1 is the observed random vector, fr1 1 (with r1 p; note that we do not necessarily restrict r1 to be small) is a random vector of common factor variables or scores, Lp r1 is a matrix of factor loadings and ϵp 1 is a vector of uncorrelated random variables. We assume that the variables are mean-centered, f and ϵ are uncorrelated and without loss of generality, the covariance of f is the identity matrix. We will denote Cov(ϵ) = Φ = diag(Φ1, . . . , Φp). It follows that

Σ = Σc + Φ, (1)

where Σ is the covariance matrix of x and Σc := LL is the covariance matrix corresponding to the common factors. Decomposition (1) suggests that Σ can be written as the sum of a positive semideﬁnite (PSD) matrix Σc of rank r1 and a nonnegative diagonal matrix (Φ) corresponding to the errors. In particular, the variance of the ith coordinate of x := (x1, . . . , xp), i.e., var(xi) = P k l2 ik + Φi, i = 1, . . . , p, splits into two parts, where L = ((lik)). The ﬁrst part (P k l2 ik) is known as the communality estimate (since this is the variance of the factors common to all the xi s) and the remaining part Φi is the variance speciﬁc to the ith variable (Φi s are also referred to as the unique variances or simply uniquenesses).

Formulation of the estimator: In decomposition (1), the assumption that the rank (r1) of Σc is small compared to p is fairly stringent see Guttman (1958); Shapiro (1982); ten Berge (1998) for a historical overview of the concept. In a classical paper of Guttman (1958), the author argued based on psychometric evidence that Σc is often found to have high algebraic rank. In psychometric case studies it is rather rare that the covariance structure can be completely explained by a few common factors corresponding to mental abilities in fact there is evidence of at least hundreds of common factors being present with the number growing without an upper bound. Formally, this means that instead of assuming that Σc has exactly low-rank it is practical to assume that it can be well-approximated by a low-rank matrix, namely, L1L 1 with L1 Rp r. More precisely, L1L 1 is the best rank-r approximation to Σc in the matrix q-norm (also known as the Schatten norm), as deﬁned in (5), and (Σc L1L 1) is the residual component. Following psychometric terminology, L1 corresponds to the r most signiﬁcant factors representative of mental abilities and the residual Σc L1L 1 corresponds to the remaining psychometric factors unexplained by L1L 1. Thus we can rewrite decomposition (1) as follows:

Σ = L1L 1 | {z } :=Θ

where we use the notation Θ = L1L 1 and N = (Σc L1L 1) with Θ + N = Σc = Σ Φ. Note that Θ denotes the best rank-r approximation to (Σ Φ), with the residual component being N = Σ Φ Θ. Note that the entries in Φ need to be non-negative1 and Σ Φ 0. In fact, in the words of ten Berge (1998) (see p. 326)

1. Negative estimates of the diagonals of Φ are unwanted since they correspond to variances, but some FA estimation procedures often lead to negative estimates of Φ these are popularly known in the literature as Heywood cases and have invited a signiﬁcant amount of discussion in the community.

Certifiably Optimal Low Rank Factor Analysis

. . . However, when Σ Φ the covariance matrix for the common parts of the variables, would appear to be indeﬁnite, that would be no less embarrassing than having a negative unique variance in Φ. . .

We further refer the reader to Mardia et al. (1979) discussing the importance of Σ Φ being PSD.2 We thus have the following natural structural constraints on the parameters:

Θ 0, Φ = diag(Φ, . . . , Φp) 0 and Σ Φ 0. (3)

Motivated by the above considerations, we present the following rank-constrained estimation problem for FA:

minimize Σ (Θ + Φ) q q s.t. rank(Θ) r

Φ = diag(Φ1, . . . , Φp) 0

where Θ Rp p, Φ Rp p are the optimization variables, and for a real symmetric matrix Ap p, its matrix q-norm, also known as the Schatten norm (or Schatten-von-Neumann norm) is deﬁned as

i=1 |λi(A)|q ! 1

where λi(A), i = 1, . . . , p, are the (real) eigenvalues of A.

Interpreting the estimator: The estimation criterion (4) seeks to jointly obtain the (low-rank) common factors and uniquenesses that best explain Σ in terms of minimizing the matrix q-norm of the error Σ (Θ + Φ) under the PSD constraints (3). Note that criterion (4) does not necessarily assume that Σ exactly decomposes into a low-rank PSD matrix and a non-negative diagonal matrix. Problem (4) enjoys curious similarities with Principal Component Analysis (PCA). In PCA, given a PSD matrix Σ the leading r principal component directions of Σ are obtained by minimizing Σ Θ q subject to Θ 0 and rank(Θ) r. If the optimal solution Φ to Problem (4) is given, Problem (4) is analogous to a rank-r PCA on the residual matrix Σ Φ thus it is naturally desirable to have Σ Φ 0. In PCA one is interested in understanding the proportion of variance explained by the top-r principal component directions: Pr i=1 λi(Σ)/ Pp i=1 λi(Σ). The denominator Pp i=1 λi(Σ) = Tr(Σ) accounts for the total variance explained by the covariance matrix Σ. Analogously, the proportion of variance explained by bΘr (which denotes the best rank r approximation to Σ Φ) is given by Pr i=1 λi(Σ Φ)/ Pp i=1 λi(Σ Φ) for this quantity to be interpretable it is imperative that Σ Φ 0. In the above argument, of course, we assumed that Φ is given. In general, Φ needs to be estimated: Problem (4) achieves this goal by jointly learning Φ and Θ. We note that certain popular approaches of FA (see Sections 1.1 and 1.2) do not impose the PSD constraint Σ Φ as a part of the estimation

2. However, the estimation method described in Mardia et al. (1979) does not guarantee that Σ Φ 0.

Bertsimas, Copenhaver, and Mazumder

scheme leading to indeﬁnite Σ Φ thereby rendering statistical interpretations troublesome. Our numerical evidence suggests that the quality of estimates of Θ and Φ obtained from Problem (4) outperform those obtained by other competing procedures which do not incorporate the PSD constraints into their estimation criteria.

Choice of r: In exploratory FA, it is standard to consider several choices of r and study the manner in which the proportion of variance explained by the common factors saturates with increasing r. We refer the reader to popularly used methods described in Anderson (2003); Bartholomew et al. (2011); Mardia et al. (1979) and more modern techniques (Bai and Ng, 2008, see also references therein) for the choice of r.

Estimate of Covariance Matrix: In the ﬁnite sample setting, we set Σ to be the sample covariance matrix. A consequence of solving Problem (4) is that we get an estimate for the covariance matrix given by ˆΘ + ˆΦ in this sense, criterion (4) can be viewed as a regularization scheme: the rank constraint on Θ encourages parsimony and the PSD constraints encourage interpretability, as discussed above.

In this paper, we propose a general computational framework to solve Problem (4) for any q 1. The well-known Schatten q-norm appearing in the loss function is chosen for ﬂexibility it underlines the fact that our approach can be applied for any q 1. Note that the estimation criterion (4) (even for the case q = 1) does not seem to appear in prior work on approximate minimum rank Factor Analysis (MRFA) (ten Berge and Kiers, 1991; Shapiro and ten Berge, 2002). However, we show in Proposition 1 that Problem (4) for the special case q = 1 turns out to be equivalent to MRFA. For q = 2, the loss function is the familiar squared Frobenius norm also used in MINRES, though the latter formulation is not equivalent to Problem (4), as explained in Section 1.1. We place more emphasis on studying the computational properties for the more common norms q {1, 2}. The presence of the rank constraint in Problem (4) makes the optimization problem nonconvex. Globally optimizing Problem (4) or for that matter obtaining a good stationary point is quite challenging. We propose a new equivalent smooth formulation to Problem (4) which does not contain the combinatorial rank constraint. We employ simple and tractable sequential convex relaxation techniques with guaranteed convergence properties and excellent computational properties to obtain a stationary point for Problem (4). An important novelty of this paper is to present certiﬁable lower bounds on Problem (4) without resorting to structural assumptions, thus making it possible to solve Problem (4) to provable optimality. Towards this end we propose new methods and ideas that incorporate state-of-the-art developments in nonlinear and global optimization.3

1.1 A selective overview of related FA estimators

FA has a long and inﬂuential history which dates back more than a hundred years. The notion of FA possibly ﬁrst appeared in Spearman (1904) for the one factor model, which was then generalized to the multiple factors model by various authors (see, for example,

3. The class of optimization problems studied in this paper involve global minimization of nonconvex, continuous semideﬁnite optimization problems. Computational methods for this class of problems are in a nascent stage; further, such methods are signiﬁcantly less developed when compared to those for mixed integer linear optimization problems, thus posing a major challenge in this work.

Certifiably Optimal Low Rank Factor Analysis

Thurstone 1947). Signiﬁcant contributions related to computational methods for FA have been nicely documented in Bai and Ng (2008); Harman and Jones (1966); J oreskog (1967, 1978); Lawley and Maxwell (1962); Ledermann (1937); Rao (1973); Shapiro (1982); and ten Berge and Kiers (1991), among others. We will brieﬂy describe some widely used approaches for FA that are closely related to the approach pursued herein and also point out their connections.

Constrained Minimum Trace Factor Analysis (MTFA): This approach (ten Berge et al., 1981; Riccia and Shapiro, 1982) seeks to decompose Σ exactly into the sum of a diagonal matrix and a low-rank component, which are estimated via the following convex optimization problem:

minimize Tr(Θ)

s.t. Θ 0 Σ = Θ + Φ Φ = diag(Φ1, . . . , Φp) 0,

with variables Θ, Φ. Because Θ is PSD, Tr(Θ) = Pp i=1 λi(Θ) is a convex surrogate (Fazel, 2002) for the rank of Θ Problem (6) may thus be viewed as a convexiﬁcation of the rank minimization problem

minimize rank(Θ)

s.t. Θ 0 Σ = Θ + Φ Φ = diag(Φ1, . . . , Φp) 0.

In general, Problems (6) and (7) are not equivalent. See Riccia and Shapiro (1982); Saunderson et al. (2012); Shapiro (1982) (and references therein) for further connections between the minimizers of (6) and (7). A main diﬀerence between formulations (4) and (7) is that the former allows an error in the residual (Σ Θ Φ) by constraining Θ to have low-rank, unlike (7) which imposes a hard constraint Σ = Θ + Φ. As noted earlier, this can be quite restrictive in various applications. Even if one views Problem (6) as imposing a less stringent requirement than that of Problem (7), we see two distinct advantages of Problem (4) over Problem (6): it oﬀers the modeling ﬂexibility of controlling the complexity, viz. rank, of solutions via the choice of r; and it provides smaller of estimates of rank for a comparable amount of explained variance, as substantiated by experimental ﬁndings presented in Section 5.

Approximate Minimum Rank Factor Analysis (MRFA): This method (see for example, ten Berge and Kiers, 1991; Shapiro and ten Berge, 2002) considers the following optimization problem:

i=r+1 λi(Σ Φ)

s.t. Φ = diag(Φ1, . . . , Φp) 0 Σ Φ 0,

Bertsimas, Copenhaver, and Mazumder

where the optimization variable is Φ. Proposition 1 presented below establishes that Problem (8) is equivalent to the rank-constrained FA formulation (4) for the case q = 1. This connection does not appear to be formally established in ten Berge and Kiers (1991). We believe that criterion (4) for q = 1 is easier to interpret as an estimation criterion for FA models over (8). ten Berge and Kiers (1981, 1991) describe a method4 for numerically optimizing (8) as documented in the code for MRFA (ten Berge and Kiers, 2003), their implementation can handle problems of size p 20.

Principal Component (PC) Factor Analysis: Principal Component factor analysis or PC in short (Connor and Korajczyk, 1986; Bai and Ng, 2008) implicitly assumes that Φ = σ2Ip p, for some σ2 > 0 and performs a low-rank PCA on Σ. It is not clear how to estimate Φ via this method such that Φi 0 and Σ Φ 0. Following Mardia et al. (1979), the Φ s may be estimated after estimating bΘ via the update rule bΦ = diag(Σ bΘ) the estimates thus obtained, however, need not be non-negative. Furthermore, it is not guaranteed that the condition Σ Φ 0 is met.

Minimum Residual Factor Analysis (MINRES): This approach (Harman and Jones, 1966; Shapiro and ten Berge, 2002) considers the following optimization problem (with respect to the variable Lp r):

σij (LL )ij 2 , (9)

where Σ := ((σij)), Θ = LL and the sum (in the objective) is taken over all the oﬀ-diagonal entries. Formulation (9) is equivalent to the nonconvex optimization problem

minimize Σ (Θ + Φ) 2 2 s.t. rank(Θ) r Θ 0 Φ = diag(Φ1, . . . , Φp),

with variables Θ and Φ, where Φi, i 1, are unconstrained. If bΘ is a minimizer of Problem (10), then any b L satisfying b Lb L = bΘ minimizes (9) and vice-versa. Various heuristic approaches are used to for Problem (9). The R package psych, for example, uses a black box gradient based tool optim to minimize the nonconvex Problem (9) with respect to L. Once b L is estimated, the diagonal entries of Φ are estimated as bΦi = σii (b Lb L )ii, for i 1. Note that bΦi obtained in this fashion may be negative5 and the condition Σ Φ 0 may be violated.

Generalized Least Squares, Principal Axis and variants: The Ordinary Least Squares (OLS) method for FA (Bartholomew et al., 2011) considers formulation (10) with the additional constraint that Φi 0 i. The Weighted Least Squares (WLS) or the generalized least squares method (see for example, Bartholomew et al., 2011) considers a weighted least squares objective: W (Σ (Θ + Φ)) 2 2 .

4. The method is similar to Algorithm 1 presented herein for the case of q = 1; however, ten Berge and Kiers (1981, 1991) rely on a heuristic procedure, as described in Bentler and Woodward (1980), for the subproblem with respect to Φ. 5. If bΦi < 0, some ad-hoc procedure is used to threshold it to a nonnegative quantity.

Certifiably Optimal Low Rank Factor Analysis

As in the ordinary least squares case, here too we assume that Φi 0. Various choices of W are possible depending upon the application, with W {Σ 1, Φ 1} being a couple of popular choices. The Principal Axis (PA) FA method (Bartholomew et al., 2011; Revelle, 2015) is popularly used to estimate factor model parameters based on criterion (10) along with the constraints Φi 0 i. This method starts with a nonnegative estimate bΦ and performs a rank r eigendecomposition on Σ bΦ to obtain bΘ. The matrix bΦ is then updated to match the diagonal entries of Σ bΘ, and the above steps are repeated until the estimate bΦ stabilizes.6 Note that in this procedure the estimate bΘ may fail to be PSD and the entries of bΦi may be negative as well. Heuristic restarts and various initializations are often carried out to arrive at a reasonable solution (see for example discussions in Bartholomew et al. 2011). In summary, the least squares stylized methods described above may lead to estimates that violate one or more of the constraints: Σ Φ 0 , Θ 0 and Φ 0.

Maximum Likelihood for Factor Analysis: This approach (Bai and Li, 2012; J oreskog, 1967; Mardia et al., 1979; Roweis and Ghahramani, 1999; Rubin and Thayer, 1982) is another widely used method in FA and typically assumes that the data follows a multivariate Gaussian distribution. This procedure maximizes a likelihood function and is quite diﬀerent from the loss functions pursued in this paper and discussed above. The estimator need not exist for any Σ see for example Robertson and Symons (2007).

Most of the methods described in Section 1.1 are widely used and their implementations are available in statistical packages psych (Revelle, 2015), n Factors (Raiche and Magis, 2011), GPArotation (Bernaards and Jennrich, 2005), and others and are publicly available from CRAN.7

1.2 Broad categories of factor analysis estimators

A careful investigation of the methods described above suggests that they can be divided into two broad categories. Some of the above estimators explicitly incorporate a PSD structural assumption on the residual covariance matrix Σ Φ in addition to requiring Θ 0 and Φ 0 while the others do not. As already pointed out, these constraints are important for statistical interpretability. We propose to distinguish between the following two broad categories of FA algorithms:

(A) This category is comprised of FA estimation procedures cast as nonlinear Semideﬁnite Optimization (SDO) problems estimation takes place in the presence of constraints of the form Σ Φ 0, along with Θ 0 and Φ 0. Members of this category are MRFA, MTFA and more generally Problem (4).

Existing approaches for these problems are typically not scalable: for example, we are not aware of any algorithm (prior to this paper) for Problem (8) (MRFA) that scales to covariance matrices with p larger than thirty. Indeed, while theoretical guarantees

6. We are not aware of a proof, however, showing the convergence of this procedure. 7. http://cran.us.r-project.org

Bertsimas, Copenhaver, and Mazumder

of optimality exist in certain cases (Saunderson et al., 2012), the conditions required for such results to hold are generally diﬃcult to verify in practice.

(B) This category includes classical FA methods which are not based on nonlinear SDO based formulations (as in Category (A)). MINRES, OLS, WLS, GLS, PC and PA based FA estimation procedures (as described in Section 1.1) belong to this category. These methods are generally scalable to problem sizes where p is of the order of a few thousand signiﬁcantly larger than most procedures belonging to Category (A) and are implemented in open-source R-packages.

Contributions: Our contributions in this paper may be summarized as follows:

1. We consider a ﬂexible family of FA estimators which can be obtained as solutions to rank-constrained nonlinear SDO problems. In particular, our framework provides a unifying perspective on several existing FA estimation approaches.

2. We propose a novel exact reformulation of the rank-constrained FA problem (4) as a smooth optimization problem with convex compact constraints. We also develop a uniﬁed algorithmic framework utilizing modern optimization techniques to obtain high quality solutions to Problem (4). Our algorithms, at every iteration, simply require computing a low-rank eigendecomposition of a p p matrix and a structured scalable convex SDO. Our proposal is capable of solving FA problems involving covariance matrices having dimensions up to a few thousand, thereby making it on par with the most scalable FA methods used currently.8

3. Our SDO formulation enables us to estimate the underlying factors and unique variances under the restriction that the residual covariance matrix is PSD a characteristic that is absent in several popularly used FA methods. This aids statistical interpretability, especially in drawing parallels with PCA and understanding the proportion of variance explained by a given number of factors. Methods proposed herein produce superior quality estimates, in terms of various performance metrics, when compared to existing FA approaches. To our knowledge, this is the ﬁrst paper demonstrating that certiﬁably optimal solutions to a rank-constrained problem can be found for problems of realistic sizes, without making any assumptions on the underlying data.

4. Using techniques from discrete and global optimization, we develop a branch-andbound algorithm which proves that the low-rank solutions found are often optimal in seconds for problems on the order of p = 10 variables, in minutes for problems on the order of p = 100, and in days for some problems on the order of p = 4000. As the selected rank increases, so too does the computational burden of proving optimality. It is particularly crucial to note that the optimal solutions for all problems we consider are found very quickly, and that vast majority of computational time is then spent on proving optimality. Hence, for a practitioner who is not particularly concerned with certifying optimality, our techniques for ﬁnding feasible solutions provide high-quality estimates quickly.

8. An implementation of our approach is available at https://github.com/copenhaver/factoranalysis.

Certifiably Optimal Low Rank Factor Analysis

5. We provide computational evidence demonstrating the favorable performance of our proposed method. Finally, to the best of our knowledge this is the ﬁrst paper that views various FA methods in a uniﬁed fashion via a modern optimization lens and attempts to compare a wide range of FA techniques in large scale.

Structure of the paper: The paper is organized as follows. In Section 1 we propose a ﬂexible family of optimization Problems (4) for the task of statistical estimation in FA models. Section 2 presents an exact reformulation of Problem (4) as a nonlinear SDO without the rank constraint. Section 3 describes the use of nonlinear optimization techniques such as the Conditional Gradient (CG) method (Bertsekas, 1999) adapted to provide feasible solutions (upper bounds) to our formulation. First order methods employed to compute the convex SDO subproblems are also described in the same section. In Section 4, we describe our method for certifying optimality of the solutions from Section 3 in the case when q = 1. In Section 5, we present computational results demonstrating the eﬀectiveness of our proposed method in terms of (a) modeling ﬂexibility in the choice of the number of factors r and the parameter q, (b) scalability, and (c) the quality of solutions obtained in a wide array of real and synthetic datasets comparisons with several existing methods for FA are considered. Section 6 contains our conclusions.

2. Reformulations of Problem (4)

Let λ(A) denote the vector of eigenvalues of A, arranged in decreasing order, i.e.,

λ1(A) λ2(A) . . . λp(A). (11)

The following proposition presents the ﬁrst reformulation of Problem (4) as a continuous eigenvalue optimization problem with convex compact constraints. Proofs of all results can be found in Appendix A.

Proposition 1 (a) For any q 1, Problem (4) is equivalent to:

(CFAq) minimize fq(Φ; Σ) :=

i=r+1 λq i (Σ Φ)

s.t. Φ = diag(Φ1, . . . , Φp) 0 Σ Φ 0,

where Φ is the optimization variable. (b) Suppose Φ is a minimizer of Problem (12), and let

Θ = U diag λ1(Σ Φ ), . . . , λr(Σ Φ ), 0, . . . , 0 U , (13)

where Up p is the matrix of eigenvectors of Σ Φ . Then (Θ , Φ ) is a solution to Problem (4).

Problem (12) is a nonlinear SDO in Φ, unlike the original formulation (4) that estimates Θ and Φ jointly. Note that the rank constraint does not appear in Problem (12) and the constraint set of Problem (12) is convex and compact. However, Problem (12) is nonconvex due to the nonconvex objective function Pp i=r+1 λq i (Σ Φ). For q = 1 the function appearing in the objective of (12) is concave and for q > 1, it is neither convex nor concave.

Bertsimas, Copenhaver, and Mazumder

Proposition 2 The estimation Problem (4) is equivalent to

minimize Σ (Θ + Φ) q q s.t. rank(Θ) r

Φ = diag(Φ1, . . . , Φp) 0

Note that Problem (14) has an additional PSD constraint Σ Θ 0 which does not explicitly appear in Problem (4). It is interesting to note that the two problems are equivalent. By virtue of Proposition 2, Problem (14) can as well be used as the estimation criterion for rank constrained FA. However, we will work with formulation (4) because it is easier to interpret from a statistical perspecitve.

Special instances of (CFAq): We show that some well-known FA estimation problems can be viewed as special cases of our general framework. For q = 1, Problem (12) reduces to MRFA, as described in (8). For q = 1 and r = 0, Problem (12) reduces to MTFA (6). When q = 2, we get a variant of (10), i.e., a PSD constrained analogue of MINRES

minimize p P

i=r+1 λ2 i (Σ Φ)

s.t. Φ = diag(Φ1, . . . , Φp) 0 Σ Φ 0.

Note that unlike Problem (10), Problem (15) explicitly imposes PSD constraints on Φ and Σ Φ. The objective function in (12) is continuous but non-smooth. The function is diﬀerentiable at Φ if and only if the r and (r+1)th eigenvalues of Σ Φ are distinct (Lewis, 1996; Shapiro and ten Berge, 2002), i.e., λr+1(Σ Φ) < λr(Σ Φ). The non-smoothness of the objective function in Problem (12) makes the use of standard gradient based methods problematic (Bertsekas, 1999). Theorem 1 presents a reformulation of Problem (12) in which the objective function is continuously diﬀerentiable.

Theorem 1 (a) The estimation criterion given by Problem (4) is equivalent to9

minimize gq(W, Φ) := Tr(W(Σ Φ)q) s.t. Φ = diag(Φ1, . . . , Φp) 0 Σ Φ 0 I W 0 Tr(W) = p r.

(b) The solution bΘ of Problem (4) can be recovered from the solution c W, bΦ of Problem (16) via: bΘ := b U diag(bλ1, . . . , bλr, 0, . . . , 0) b U , (17)

9. For any PSD matrix A, with eigendecomposition A = UA diag(λ1, . . . , λp)U A, we deﬁne Aq := UA diag(λq 1, . . . , λq p)U A, for any q 1.

Certifiably Optimal Low Rank Factor Analysis

where b U is the matrix formed by the p eigenvectors corresponding to the eigenvalues bλ1, . . . , bλp (arranged in decreasing order) of the matrix Σ bΦ. Given bΦ, any solution bΘ (given by (17)) is independent of q.

In Problem (16), if we partially minimize the function gq(W, Φ) over Φ (with ﬁxed W), the resulting function is concave in W. This observation leads to the following proposition.

Proposition 3 The function Gq(W) obtained upon (partially) minimizing gq(W, Φ) over Φ (with W ﬁxed) in Problem (16), given by

Gq(W) := inf Φ gq(W, Φ)

s.t. Φ = diag(Φ1, . . . , Φp) 0 Σ Φ 0, (18)

is concave in W. The sub-gradients of the function Gq(W) exist and are given by:

Gq(W) = (Σ bΦ(W))q, (19)

where bΦ(W) is a minimizer of the convex optimization Problem (18).

In light of Proposition 3, we present another reformulation of Problem (4) as the following concave minimization problem:

minimize Gq(W)

s.t. I W 0 Tr(W) = p r, (20)

where the function Gq(W) is diﬀerentiable if and only if bΦ(W) is unique. Note that by virtue of Proposition 1, Problems (12) and (20) are equivalent. Therefore, it is natural to ask whether one formulation might be favored over the other from a computational perspective. Towards this end, note that both Problems (12) and (20) involve the minimization of a non-smooth objective function, over convex compact constraints. However, the objective function of Problem (12) is nonconvex (for q > 1) whereas the one in Problem (20) is concave (for all q 1). We will see in Section 3 that CG-based algorithms can be applied to a concave minimization problem (even if the objective function is not diﬀerentiable); however, CG applied to general non-smooth objective functions has limited convergence guarantees. Thus, formulation (20) is readily amenable to CG-based optimization algorithms, unlike formulation (12). This seems to make Problem (20) computationally more appealing than Problem (12).

3. Finding Upper Bounds

This section presents a uniﬁed computational framework for the class of problems (CFAq). Problem (16) is a nonconvex smooth optimization problem and obtaining a stationary point is quite challenging. We propose iterative schemes based on the Conditional Gradient (CG) algorithm (Bertsekas, 1999) a generalization of the Frank-Wolfe algorithm (Frank

Bertsimas, Copenhaver, and Mazumder

and Wolfe, 1956) to obtain a stationary point of the problem. The appealing aspect of our framework is that every iteration of the algorithm requires solving a convex SDO problem which is computationally tractable. While oﬀ-the-shelf interior point algorithms for example SDPT3 (Toh et al., 1999), Yalmip (L ofberg, 2004), and MOSEK (Andersen and Andersen, 2000) can be used to solve the convex SDO problems, they typically do not scale well for large problems due to intensive memory requirements. In this vein, ﬁrst order algorithms have received a lot of attention (Nesterov, 2004, 2005, 2007) in convex optimization of late, due to their low cost per iteration, low-memory requirements, and ability to deliver solutions of moderate accuracy for large problems within a modest time limit. We use ﬁrst order methods to solve the convex SDO problems. We present one primary scheme based on the CG algorithm:

Algorithm 1: This scheme, described in Section 3.1, applies CG on the optimization Problem (20), where the function Gq(W) deﬁned in (18) is concave (and possibly non-smooth).

In addition, in Appendix B we present an alternative approach that applies CG to Problem (16), where the objective function gq(W, Φ) is smooth.

To make notation simpler, we will use the following shorthand:

Ψp,p r := {W Rp p : I W 0, Tr(W) = p r} (21)

and FΣ = {Φ : Σ Φ 0, Φ = diag(Φ1, . . . , Φp) 0}.

3.1 A CG based algorithm for Problem (20)

The CG method for Problem (20) requires solving a linearization of the concave objective function. At iteration k, if W(k) is the current estimate of W, the new estimate W(k+1) is obtained by

W(k+1) arg min W Ψp,p r

D Gq(W(k)), W E = arg min W Ψp,p r Tr(W(Σ Φ(k))q), (22)

where by Proposition 3, (Σ Φ(k))q is a sub-gradient of Gq(W) at W(k) with Φ(k) given by Φ(k) arg min Φ FΣ Tr(W(k)(Σ Φ)q) (23)

No explicit line search is necessary here because the minimum will always be at the new point, i.e., Φ(k), due to the concavity of the objective function. The sequence W(k) is recursively computed via (22) until the convergence criterion

Gq(W(k)) Gq(W(k+1)) TOL Gq(W(k)), (24)

for some user-deﬁned tolerance, TOL > 0, is met. A short description of the procedure appears in Algorithm 1. Before we present the convergence rate of Algorithm 1, we will need to introduce some notation. For any point W belonging to the feasible set of Problem (20) let us deﬁne (W) as follows: (W) := inf W Ψp,p r Gq(W), W W . (25)

Certifiably Optimal Low Rank Factor Analysis

Algorithm 1 A CG based algorithm for formulation (20)

1. Initialize with W(1) (k = 1), feasible for Problem (20) and repeat, for k 2, Steps 2-3 until convergence criterion (24) is satisﬁed.

2. Update Φ (with W ﬁxed) by solving (23).

3. Update W (with Φ ﬁxed) by solving (22), to get W(k+1).

Further, W satisﬁes the ﬁrst order stationary condition for Problem (20) if W is feasible for the problem and (W ) 0. We now present Theorem 2 establishing the rate of convergence and associated convergence properties of Algorithm 1. The proof (along with all other omitted proofs) is contained in Appendix A.

Theorem 2 If W(k) is a sequence produced by Algorithm 1, then Gq(W(k)) is a monotone decreasing sequence and every limit point W( ) of the sequence W(k) is a stationary point of Problem (20). Furthermore, Algorithm 1 has a convergence rate of O(1/K) (with K denoting the iteration index) to a ﬁrst order stationary point of Problem (20), i.e.,

min i=1,...,K

n (W(i)) o Gq(W(1)) Gq(W( ))

3.2 Solving the convex SDO problems

Algorithm 1 requires sequentially solving convex SDO problems in W and Φ. We describe herein how these subproblems can be solved eﬃciently.

3.2.1 Solving the SDO problem with respect to W

A generic SDO problem associated with Problem (22) requires updating W as

c W arg min W Ψp,p r W, f W , (27)

for some ﬁxed symmetric f Wp p, depending upon the algorithm and the choice of q, described as follows. For Algorithm 1 the update in W at iteration k for Problem (22), corresponds to f W = (Σ Φ(k+1))q. A solution to Problem (27) is given by c W = Pp i=r+1 uiu i, where u1, . . . , up are the eigenvectors of the matrix f W, corresponding to the eigenvalues λ1(f W), . . . , λp(f W).

3.2.2 Solving the SDO problem with respect to Φ

The SDO problem arising from the update of Φ is not as straightforward as the update with respect to W. Before presenting the general case, it helps to consider a few special cases of (CFAq). For q = 1 the objective function of Problem (16)

g1(W, Φ) = W, Σ

i=1 wiiΦi (28)

Bertsimas, Copenhaver, and Mazumder

is linear in Φ (for ﬁxed W). For q = 2, the objective function of Problem (16)

g2(W, Φ) = Tr(WΣ2) +

i=1 (wiiΦ2 i 2 wi, σi Φi) (29)

is a convex quadratic in Φ (for ﬁxed W). For Algorithm 1, the partial minimizations with respect to Φ, for q = 1 and q = 2, require minimizing Problems (28) and (29), respectively. Various instances of optimization problems with respect to Φ such as those appearing in Algorithm 1 can be viewed as special cases of the following family of SDO problems:

minimize Φ FΣ

ciΦ2 i + diΦi (30)

where ci 0 and di for i = 1, . . . , p are problem parameters that depend upon the choice of algorithm and q. We now present a ﬁrst order convex optimization scheme for solving (30).

A First Order Scheme for Problem (30)

With the intention of providing a simple and scalable algorithm for the convex SDO problem, we use the Alternating Direction Method of Multipliers (Bertsekas, 1999; Boyd et al., 2011) (ADMM). We introduce a splitting variable Λ = Σ Φ and rewrite Problem (30) in the following equivalent form:

minimize Φ,Λ

i=1 (ciΦ2 i + diΦi)

s.t. Φ = diag(Φ1, . . . , Φp) 0

The Augmented Lagrangian for the above problem is:

Lρ(Φ, Λ, ν) :=

i=1 (ciΦ2 i + diΦi) + D ν, Λ (Σ Φ) E + ρ

2 Λ (Σ Φ) 2 2, (32)

where ρ > 0 is a scalar and , denotes the standard trace inner product. ADMM involves the following three updates:

Φ(k+1) arg min Φ=diag(Φ1,...,Φp) 0 Lρ(Φ, Λ(k), ν(k)), (33)

Λ(k+1) arg min Λ 0 Lρ(Φ(k+1), Λ, ν(k)), (34)

ν(k+1) = ν(k) + ρ(Λ(k+1) (Σ Φ(k+1))), (35)

and produces a sequence {(Φ(k), λ(k), ν(k))}, k 1 the convergence properties of the algorithm are quite well known (Boyd et al., 2011).

Certifiably Optimal Low Rank Factor Analysis

Problem (33) can be solved in closed form as

Φ(k+1) i = ρ ρ + 2ci max

(σii λ(k) ii ) (di + ν(k) ii ) ρ , 0

, i = 1, . . . , p. (36)

The update with respect to Λ in (34) requires an eigendecomposition:

Λ(k+1) = arg min Λ 0

Λ (Σ Φ(k+1) 1

where, the operator PS+ p (A) denotes the projection of a symmetric matrix A onto the cone of

PSD matrices of dimension p p: PS+ p (A) = UA diag max {λ1, 0} , . . . , max {λp, 0} U A,

where, A = UA diag(λ1, . . . , λp)U A is the eigendecomposition of A.

Stopping criterion: The ADMM iterations (33) (35) are continued till the values of Λ(k+1) (Σ Φ(k+1)) 2 and the relative change in the objective values of Problem (30) become smaller than a certain threshold, say, TOL α, where α {10 1, . . . , 10 3} this is typically taken to be smaller than the convergence threshold for the CG iterations (TOL).

Computational cost of Problem (30): The most intensive computational stage in the ADMM procedure is in performing the projection operation (37); this requires O(p3) operations due to the associated eigendecomposition. This needs to be done for as many ADMM steps, until convergence. Since Problem (30) is embedded inside iterative procedures like Algorithm 1, the estimates of (Φ, Λ, ν) obtained by solving Problem (30) for a iteration index (of the CG algorithm) provides a good warm-start for the Problem (30) in the subsequent CG iteration. This is often found to decrease the number of iterations required by the ADMM algorithm to converge to a prescribed level of accuracy.10

3.3 Computational cost of Algorithm 1

For Algorithm 1 and other CG-based algorithms for factor analysis (see Appendix B), the computational bottleneck is in performing the eigendecomposition of a p p matrix: the W update requires performing a low-rank eigendecomposition of a p p matrix and the Φ update requires solving a problem of the form (30), which also costs O(p3). Since eigendecompositions can easily be done for p of the order of a few thousands, the proposed algorithms can be applied to that scale. Note that most existing popular algorithms for FA belonging to Category (B) (see Section 1.2) also perform an eigendecomposition with cost O(p3). Thus it appears that

10. The utility of warm starts is another compelling reason to apply a ﬁrst-order-based approach instead of interior point methods. Indeed, warm starts are well-known to perform quite poorly when incorporated into interior point methods often they can perform worse than cold starts (Yildirim and Wright, 2002; John and Yildirim, 2008). Given the need to repeatedly solve similarly structured SDOs for both upper bounds (as presented in this section) as well as lower bounds (as presented in Section 4), the ability to eﬀectively incorporate warm start information is crucial.

Bertsimas, Copenhaver, and Mazumder

Category (B) and the algorithms proposed herein have the same computational complexity and hence these two estimation approaches are equally scalable.

4. Certiﬁcates of Optimality via Lower Bounds

In this section, we outline our approach to computing lower bounds to (CFAq) via techniques from global optimization and matrix analysis. In particular, we focus on the case when q = 1.11 We begin with an overview of the method. We then discuss initialization parameters for the method as well as branching rules and other reﬁnements employed in our approach.

4.1 Overview of Method

Our primary problem of interest is to provide lower bounds to (CFA1), i.e.,

minimize Φ FΣ

i=r+1 σi(Σ Φ), (38)

or equivalently visa-vis Theorem 1,

minimize W Ψp,p r Φ FΣ

W, Σ Φ . (39)

One possible approach is to consider convex lower bounds to (39).

Deﬁnition 3 For a function f : Γ Rn R we deﬁne its convex envelope on Γ, denoted convenvΓ(f), to be the largest convex function g with g f. In symbols,

g = sup{h : Γ R | h convex on Γ and h f}.

The convex envelope acts as the best possible convex lower bound for a given function. Further, its precise form is well-known for certain classes of functions; one such function of principle interest here is described in the following theorem. This is in some contexts referred to as a Mc Cormick hull and is widely used throughout the nonconvex optimization literature (Floudas, 1999; Tawarmalani and Sahinidis, 2002a).

Theorem 4 (Al-Khayyal and Falk, 1983) If f : Γ = [0, 1] [ℓ, u] R is deﬁned by f(x, y) = xy, then the convex envelope of f on Γ is precisely

convenvΓ(f)(x, y) = max{ ux, ℓ ℓx y}.

Further, |f(x, y) convenvΓ(f)(x, y)| (u ℓ)/4. In particular, if |u ℓ| 0, then convenvΓ(f) f.

11. The general case for q > 1 can be addressed using similar (although more complicated) techniques as applied to Problem (16), again applying principles developed in global optimization.

Certifiably Optimal Low Rank Factor Analysis

Using this result we proceed to describe our approach for computing lower bounds to (CFA1) as in (39). First observe that if Φi [ℓi, ui] and Wii [0, 1], then

convenv( WiiΦi)(Wii, Φi) = max{ ui Wii, ℓi ℓi Wii Φi}

= min{ui Wii, Φi + ℓi Wii ℓi}.

Hence, the best possible convex lower bound to the objective in Problem (39), namely W, Σ Φ = W, Σ P i WiiΦi, for ℓ diag(Φ) u and I W 0 is

i=1 min{ui Wii, Φi + ℓi Wii ℓi}.

By introducing auxiliary variables e Rp to represent these convex envelopes, we obtain the following linear SDO that is a lower bound to (39):

minimize W,Φ,e W, Σ

s. t. Φ FΣ W Ψp,p r ℓi Φi ui ei Φi + ℓi Wii ℓi ei ui Wii

The approach now lies in iteratively reﬁning the lower and upper bounds on the diagonal entries of Φ, denoted ℓand u, respectively, in order to improve the quality of the approximations obtained via convex envelopes (cf. Theorem 4). This classical scheme is known as spatial branch and bound and is shown in pseudocode in Algorithm 2 as it applies to solving (39) by way of using (LSℓ,u). In words, Algorithm 2 involves treating a given node n = [ℓ, u], which represents bounds on Φ, namely, ℓ diag(Φ) u. Here we solve (LSℓ,u) with the lower and upper bounds ℓand u, respectively, and see whether the resulting new feasible solution is better (lower in objective value) than the best known incumbent solution encountered thus far. We then see if the the bound for this node as obtained via (LSℓ,u) is better than the currently known best feasible solution; if it is not at least the current best feasible solution s objective value (up to some numerical tolerance), then we must further branch on this node, generating two new nodes n1 and n2 which partition the existing node n. Throughout, we keep track of the worst lower bound encountered, which allows for us to terminate the algorithm early while still having a provable suboptimality guarantee on the best feasible solution Φf FΣ found thus far. In light of Theorem 4, we have as a corollary the following theorem.

Theorem 5 Given numerical tolerance TOL > 0, Algorithm 2 (with an appropriate branching rule)12 terminates in ﬁnitely many iterations and solves (CFA1) to within an additive optimality gap of at most TOL. Further, if Algorithm 2 is terminated early (i.e., before Nodes = ), then the best feasible solution Φf at termination is guaranteed to be within an additive optimality gap of zf zlb.

12. A branching rule that is suﬃcient for convergence is selecting i arg maxi(uc i ℓc i) and α = (uc i + ℓc i)/2.

Bertsimas, Copenhaver, and Mazumder

The algorithm we have considered here omits some important details. After discussing properties of (LSℓ,u) in Section 4.2, we will discuss various aspects of Algorithm 2. In Section 4.3, we detail how to choose input u0. We then turn our attention to branching (line 5) in Section 4.4. In Section 4.5, we use results from matrix analysis coupled with ideas from the modern practice of discrete optimization to make tailored reﬁnements to Algorithm 2. Finally, in Section 4.6 we include a discussion of node selection strategies.

Algorithm 2 Spatial branch and bound scheme to solve (39). The inputs are as follows: (a) upper bounds u0 such that for any i and any Φ FΣ, we have Φi u0 i (see Section 4.3); (b) optimality tolerance TOL; and (c) initial feasible solution Φf FΣ.

1. Initialize zf P i>r σi(Σ Φf); Nodes {([0, u0], )}; and zlb .

2. While Nodes = , remove some node ([ℓc, uc], zc) Nodes.

3. Solve (LSℓc,uc). Let Φ be an optimal solution with z the optimal objective value; set zu P i>r σi(Σ Φ).

4. If zu < zf (i.e., a better feasible solution is found), update the best feasible solution found thus far (Φf) to be Φ and update the corresponding value (zf) to zu.

5. If z < zf TOL (i.e., a TOL optimal solution has not yet been found), then pick some i {1, . . . , p} and some α (ℓc i, uc i). Then add two new nodes to Nodes:

j<i [ℓc j, uc j] [ℓc i, α] Y

j>i [ℓc j, uc j], z

j<i [ℓc j, uc j] [α, uc i] Y

j>i [ℓc j, uc j], z

6. Update the best lower bound zlb min ([ℓ,u],z) Nodes z and return to Step 2.

Global Optimization State of the Art: We close this section by discussing similarities between the branch-and-bound approach we develop here and existing methods in nonconvex optimization. Our approach is very similar in spirit to approaches to global optimization (Floudas, 1999), and in particular for (nonconvex) quadratic optimization problems, quadratically-constrained convex optimization problems, and bilinear optimization problems (Hansen et al., 1993; Bao et al., 2009; Tawarmalani and Sahinidis, 2002b; Tawarmalani et al., 2013; Costa and Liberti, 2012; Anstreicher and Burer, 2010; Misener and Floudas, 2012). The primary similarity is that we work within a branch and bound framework using successively better convex lower bounds. However, while global optimization software for a variety of nonconvex problems with underlying vector variables is generally well-developed (as evidenced by solvers like BARON, see Sahinidis 2014), this is not the case for problems with underlying matrix variables and semideﬁnite constraints. The presence of semideﬁnite structure presents several substantial computational challenges. First and foremost, algorithmic implementations for solving linear SDOs are not nearly as advanced as those which exist for linear optimization problems. Therefore, each

Certifiably Optimal Low Rank Factor Analysis

subproblem, which is itself a linear SDO, carries a larger computational cost than the usual corresponding linear program which typically arises in other global optimization problems with vector variables. Secondly, a critical component of the success of global optimization software is the ability to quickly resolve multiple instances of subproblems which have similar structure. Corresponding methods for SDOs, as solved via interior point methods, are generally not well-developed. Finally, semideﬁnite structure complicates the traditional process of computing convex envelopes. Such computations are critical to the success of modern global optimization solvers like BARON. There are a variety of other possible approaches to computing lower bounds to (CFAq). One such approach is the method of moments (Lasserre, 2009). However, for problems of the size we are considering, such an approach is likely not computationally feasible, so we do not make a direct comparison here. There is also recent work in complementarity constraints literature (Bai et al., 2016) which connects rank-constrained optimization problems to copositive optimization (Burer, 2012). In short, such an approach turns (38) into an equivalent convex problem; despite the transformation, the new problem is not particularly amenable to computation at this time. For this reason, we do not consider the copositive optimization approach.

4.2 Properties of (LSℓ,u)

We now examine properties of (LSℓ,u), the main subproblem of interest. Observe that it is a linear SDO problem, and therefore we can consider its dual, namely

maximize q,f u,f ℓ,µ, σ,M,N,P (p r)q u f u Tr(N) P, Σ diag(ℓ)

s. t. µ + σ = 1 diag(P) + f u f ℓ µ = 0 Σ diag(u) + M + N + diag(diag(u ℓ)µ) q I = 0 f u, f ℓ, µ, σ 0 M, N, P 0.

Observation 1 We now include some remarks about structural properties of (LSℓ,u) and its dual (DSℓ,u).

1. If rank(Σ) = p then the Slater condition (Boyd and Vandenberghe, 2004) holds and hence there is strong duality, so we can work with (DSℓ,u) instead of (LSℓ,u) as an exact reformulation.

2. There exists an optimal solution to the dual with f u = 0. This is a variable reduction which is not immediately obvious. Note that f u appears as the multiplier for the constraints in the primal of the form diag(Φ) u. To claim that we can set f u = 0 it suﬃces to show that the corresponding constraints in the primal can be ignored. Namely, if (W , Φ ) solves (LSℓ,u) with the constraints Φi ui i omitted, then the pair (W , eΦ) is feasible and optimal to (LSℓ,u) with all the constraints included, where eΦ is deﬁned by eΦi = min{Φ i , ui}.

Bertsimas, Copenhaver, and Mazumder

Hereinafter we set f u = 0 and omit the constraint diag(Φ) u (with the caveat that, upon solving a problem and identifying some Φ , we must instead work with min{Φ , diag(u)}, taken entrywise).

Solving Subproblems: We brieﬂy detail how to solve (LSℓ,u). In light of the discussion in Section 3.2, we choose to apply a ﬁrst-order method. Observe that we cannot solve the primal form (LSℓ,u) within the branch-and-bound framework unless we solve it to optimality. Therefore, we instead choose to work with its dual (DSℓ,u). We apply an oﬀthe-shelf solver SCS (O Donoghue et al., 2016) to solve (DSℓ,u) and ﬁnd reasonably accurate, feasible solutions for this dual problem, which guarantees that we have a lower bound to (LSℓ,u).13 In this way, we maintain the provable optimality properties of Algorithm 2 without needing to solve nodes in the branch-and-bound tree to full optimality.

4.3 Input Parameters

In solving the root node of Algorithm 2, we must begin with some choice of u0 = u. An obvious ﬁrst choice for ui is ui = Σii, but one can do better. Let us optimally set ui, deﬁning it as

ui := maximize x R x

s.t. Σ x E(i) 0, (40)

where E(i) Rp p is a matrix with all zeros except E(i) ii = 1. These bounds are useful because if Φ FΣ, then Φi ui. Note that problem (40) is a linear SDO for which strong duality holds. Its dual is precisely

ui = minimize M Rp p M, Σ

s. t. Mii = 1 M 0, (41)

a linear SDO in standard form with a single equality constraint. By a result of (Barvinok, 1995; Pataki, 1998), there exists a rank one solution to (41). This implies that (41) can actually be solved as a convex quadratic program:

ui = minimize m Rp m Σm

s. t. m2 i = 1. = minimize m m Σm

s. t. mi = 1. (42)

This formulation given in (42) is computationally inexpensive to solve (given a large number of specialized convex quadratic problem solvers), in contrast to both formulations (40) and (41).14

13. One notable feature of the ADMM-based approach is that we can extract an approximately feasible primal solution Φ, which is useful for branching. Note that in Algorithm 2, we can replace the best incumbent solution if we ﬁnd a new Φ which has better objective value P

i>r σi(Σ Φ). Because Φ may not be feasible (i.e., Φ / FΣ), we take care here. Namely, compute t = P

i>r λi(Σ Φ), where λ1(Σ Φ) λp(Σ Φ) are the sorted eigenvalues of Σ Φ. If t < zf, then we perform an iteration of CG scheme for ﬁnding feasible solutions (outlined in Section 3) to ﬁnd a feasible Φ FΣ. We then use this as a possible candidate for replacing the incumbent. 14. In the case when Σ 0, it is not necessary to use quadratic optimization problems to compute u. In this case once can apply a straightforward Schur complement argument (Boyd and Vandenberghe, 2004) to show that u can be computed by solving for inverse of p diﬀerent (p 1) (p 1) matrices (or equivalently, by ﬁnding the diagonal of the precision matrix Σ 1).

Certifiably Optimal Low Rank Factor Analysis

4.4 Branching

Here we detail two methods for branching (line 5 in Algorithm 2). The problem of branching is as follows: having solved (LSℓ,u) for some particular n = [ℓ, u], we must choose some i {1, . . . , p} and split the interval [ℓi, ui] to create two new subproblems. We begin with a simple branching rule. Given a solution (W , Φ , e ) to (LSℓ,u), compute i argmaxi |e i W iiΦ i | and branch on variable Φi, generating two new subproblems with the intervals Y

j<i [ℓj, uj] [ℓi, Φ i ] Y

j>i [ℓj, uj] and Y

j<i [ℓj, uj] [Φ i , ui] Y

j>i [ℓj, uj].

Observe that, so long as maxi |e i W iiΦ i | > 0, the solution (W , Φ , e ) is not optimal for either of the subproblems created. We now brieﬂy describe an alternative rule which we employ instead. We again pick the branching index i as before, but now the two new nodes we generate are Y

j<i [ℓj, uj] [ℓi, (1 ϵ)Φ i +ϵℓi] Y

j>i [ℓj, uj] and Y

j<i [ℓj, uj] [(1 ϵ)Φ i +ϵℓi, ui] Y

j>i [ℓj, uj],

where ϵ [0, 1) is some parameter. For the computational experiments, we set ϵ = 0.4. Such an approach, which lowers the location of the branch in the ith interval [ℓi, ui] from Φ i , serves to improve the objective value from the ﬁrst node, while hurting the objective value from the second node (here by objective value, we mean the objective value of the optimal solution to the two new subproblems). In this way, it spreads out the distance between the two, and so it is more likely that the ﬁrst node may have an objective value that is higher than zf TOL than before, and hence, this would mean there are fewer nodes necessary to consider to solve for an additive gap of TOL. While this heuristic explanation is only partially satisfying, we have observed throughout a variety of numerical experiments that this rule, even though simple, performs better across a variety of example classes than the basic branching rule outlined. At the same time, recent work on the theory of branching rules supports such a heuristic rule (le Bodic and Nemhauser, 2015). In Section 5.4, we give evidence on the impact of the use of the modiﬁed branching rule.

4.5 Weyl s Method Pruning and Bound Tightening

In this subsection, we develop another method for lower bounds for the factor analysis problem. While we use it to supplement our approach detailed throughout Section 4, it is of interest as a standalone method, particularly for its computational speed and simplicity. In Section 5.3, we discuss the performance of this approach in both contexts. The central problem of interest in factor analysis involves the spectrum of a symmetric matrix (Σ Φ) which is the diﬀerence of two other symmetric matrices (Σ and Φ). From a linear algebraic perspective, the spectrum of the sum of real symmetric matrices is an extensively studied problem (Horn and Johnson, 2013; Bhatia, 2007).Therefore, it is natural to inquire how such results carry over to our setting. We discuss the implications of a wellknown result from this literature, namely Weyl s inequality (see, e.g., Bhatia 2007):

Bertsimas, Copenhaver, and Mazumder

Theorem 6 (Weyl s Inequality) For symmetric matrices A, B Rp p with sorted eigenvalues

λ1(A) λ2(A) λp(A) and λ1(B) λ2(B) λp(B)

one has for any k {1, . . . , p} that

λk(A + B) λk+j(A) + λp j(B) j = 0, . . . , p k.

For any vector x Rp we let {x(i)}p i=1 denote sorted {xi}p i=1 with x(1) x(2) x(p). Using this notation, we arrive at the following theorem.

Theorem 7 For any diagonal matrix Φ one has that

minimize Φ FΣ

i>r σi(Σ Φ) minimize Φ FΣ

i>r max j=0,...,p i

n λi+j(Σ Φ) + diag Φ Φ

(p j) , 0 o!

Proof Apply Weyl s inequality with A = Σ diag( Φ) and B = Φ Φ, and use the fact that Φ FΣ so Σ Φ 0.

Weyl s Method: The new lower bound introduced in Theorem 7 is a nonconvex problem in general. We begin by discussing one situation in which Theorem 7 provides computationally tractable (and fast) lower bounds; we deem this Weyl s method, detailed as follows:

1. Compute bounds ui as in (40), so that for all Φ FΣ, one has Φi ui i.

2. For each r {1, . . . , p}, one can compute a lower bound to (CFA1) (for a given r) of

i>r max{λi(Σ diag(u)), 0},

by taking Theorem 7 with Φ = diag(u).

As per the above remarks, computing u in Step 1 of Weyl s method can be carried out eﬃciently. Step 2 only relies on computing the eigenvalues of Σ diag(u). Therefore, this lower bounding procedure is quite simple to carry out. What is perhaps surprising is that this simple lower bounding procedure is eﬀective as a standalone method. We describe such results in Section 5.3. We now turn our attention to how Weyl s method can be utilized within the branch and bound tree as described in Algorithm 2.

Pruning: We begin by considering how Weyl s method can be used for pruning. The notion of pruning for branch and bound trees is grounded in the theory and practice of discrete optimization. In short, pruning in the elimination of nodes from the tree without actually solving them. We make this precise in our context. Consider some point in the branch and bound process in Algorithm 2, where we have some collection of nodes, ([ℓc, uc], zc) Nodes. Recall that zc is the optimal objective

Certifiably Optimal Low Rank Factor Analysis

value of the parent node of n. Visa-vis Weyl s method, we know a priori, without solving (LSℓc,uc), that

minimize Φ FΣ ℓc diag(Φ) uc

i>r σi(Σ Φ) X

i>r max{λi(Σ diag(uc)), 0}.

Hence, if zf TOL < P i>r max{λi(Σ diag(uc)), 0}, where zf is as in Algorithm 2, then node n can be discarded, i.e., there is no need to actually compute (LSℓc,uc) or further consider this branch. This is because if we were to solve (LSℓc,uc), and then branch again, solving further down this branch to optimality, then the ﬁnal lower bound obtained would necessarily be at least as large as the best feasible objective already found (within tolerance TOL). In this way, because Weyl s method is relatively fast, this provides a simple method for pruning. In the computational results detailed in Section 5.3, we always use Weyl s method to discard nodes which are not fruitful to consider.

Bound Tightening: We now turn our attention to another way in which Weyl s method can be used to improve the performance of Algorithm 2 bound tightening. In short, bound tightening is the use of implicit constraints to strengthen bounds obtained for a given node. We detail this with the same node notation as above. Namely, consider a given node n = [ℓc, uc]. Fix some j {1, . . . , p} and let α (ℓc j, uc j). If we have that

i>r max{λi(Σ diag( u)), 0},

where u is uc with the jth entry replaced by α, then we can replace the node n with the tightened node n = [ ℓ, uc], where ℓis ℓc with the jth entry replaced by α. We consider why this is valid. Suppose that one were to solve (LSℓc,uc) and choose to branch on index j at α. Then one would create two new nodes: [ℓc, u] and [ ℓ, uc]. We would necessarily then prune away the node [ℓc, u] as just described; hence, we can replace [ℓc, uc] without loss of generality with [ ℓ, uc]. Note that here α (ℓc j, uc j) and j {1, . . . , p} were arbitrary. Hence, for each j, one can choose the largest such αj (ℓc j, uc j) (if one exists) so that zf TOL < P i>k max{λi(Σ diag( u)), 0}, and then replace ℓc by ℓ.15

Such a procedure is somewhat expensive (because of its use of repeated eigenvalue calculations), but can be thought of as optimal pruning via Weyl s method. In our experience the beneﬁt of bound tightening does not warrant its computational cost when used at every node in the branch-and-bound tree except in a small number of problems. For this reason, in the computational results in Section 5.3 we only employ bound tightening at the root node n = [0, u0].

4.6 Node Selection

In this section, we brieﬂy describe our method of node selection. The problem of node selection has been considered extensively in discrete optimization and is still an active area of research. Here we describe a simple node selection strategy.

15. An obvious choice to ﬁnd such an α is a grid-search-based bisection method. For simplicity we use a linear search on a grid instead of resorting to the bisection method.

Bertsimas, Copenhaver, and Mazumder

To be precise, consider some point in Algorithm 2 were we have a certain collection of nodes, ([ℓc, uc], zc) Nodes. The most obvious node selection strategy is to pick the node n for which zc is smallest among all nodes in Nodes. In this way, the algorithm is likely to improve the gap zf zlb at every iteration. Such greedy selection strategies tend to not perform particularly well in general global optimization problems (see, e.g., Tawarmalani and Sahinidis, 2002a). For these reasons, we employ a slightly modiﬁed greedy selection strategy which utilizes Weyl s method. For a given node n, we also consider its corresponding lower bound wc

obtained from Weyl s method, namely, wc := P i>r max{λi(Σ diag(uc)), 0}. For each node, we now consider max{zc, wc}. There are two cases to consider:

1. With probability β, we select the node with smallest value of max{zc, wc}.

2. In the remaining scenarios (occurring with probability 1 β), we choose randomly between selecting the node with smallest value of zc and the node with smallest value of wc. To be precise, let Z be the minimum of zc over all nodes and likewise let W be the minimum of wc over all nodes. Then with (independent) probability β, we choose the node with worst zc or wc (i.e., with min{zc, wc} = min{Z, W}); with probability 1 β, if Z < W we choose a node with wc = W, and if Z > W we choose a node with zc = Z.

In this way, we allow for the algorithm to switch between trying to make progress towards improving the convex envelope bounds and making progress towards improving the best of the two bounds (the convex envelope bounds along with the Weyl bounds). We set β = 0.9 for all computational experiments. It is possible that a more dynamic branching strategy could perform substantially better; however, the method here has a desirable level of simplicity. We close by noting that while this node selection strategy appears na ıve, it is not necessarily so simple. Improved node selection strategies from discrete optimization often take into account some sort of duality theory. Weyl s inequality is at its core a result from duality theory (principally Wielandt s minimax principle; see Bhatia, 2007), and therefore our strategy is not as unsophisticated as it might appear on ﬁrst inspection.

5. Computational experiments

In this section, we perform various computational experiments to study the properties of our diﬀerent algorithmic proposals for (CFAq), for q {1, 2}. Using a variety of statistical measures, we compare our methods with existing popular approaches for FA, as implemented in standard R statistical packages psych (Revelle, 2015), n Factors (Raiche and Magis, 2011), and GPArotation (Bernaards and Jennrich, 2005). We then turn our attention to certiﬁcates of optimality as described in Section 4 for (CFA1).

5.1 Synthetic examples

For our synthetic experiments, we considered distinctly diﬀerent groups of examples. Classes A1 and A2 have subspaces of the low-rank common factors which are random and the values

Certifiably Optimal Low Rank Factor Analysis

of Φi are taken to be equally spaced. The underlying matrix corresponding to the common factors in type A1 is exactly low-rank, while this is not the case in type A2. Class A1(R/p). We generated a matrix Lp R := ((Lij)) (with R < p) with Lij iid N(0, 1). The unique variances Φ1, . . . , Φp, are taken to be proportional to p equi-spaced values on

the interval [λR(L L), λ1(L L)] such that Φi = φ λ1(L L) + (λR(L L) λ1(L L))i 1

1 i p. Here φ, which controls the ratio of the variances between the uniquenesses and the common latent factors is chosen such that Pp i=1 Φi = Tr(LL ), i.e., the contribution to the total variance from the common factors matches that from the uniqueness factors. The covariance matrix is thus given by: Σ = LL + Φ.

Class A2(p). Here Lp p is generated as Lij iid N(0, 1). We did a full singular value decomposition on L let UL denote the set of p (left) singular vectors. We created a positive deﬁnite matrix with exponentially decaying eigenvalues as follows: e Le L = UL diag(λ1, . . . , λp)U L, where the eigenvalues were chosen as λi = 0.8i/2, i = 1, . . . , p. We chose the diagonal entries of Φ (like data Type-A1), as a scalar multiple ( φ) of a uniformly spaced grid in [λp, λ1] and φ was chosen such that P i Φi = Tr(e Le L ).

In contrast, classes B1, B2, and B3 are qualitatively diﬀerent from the aforementioned ones the subspaces corresponding to the common factors are more structured, and hence diﬀerent from the coherence-like assumptions on the eigenvectors which are necessary for nuclear norm based methods (Saunderson et al., 2012) to work well. Class B1(R/p). We set Θ = LL , where Lp R is given by

Lij = 1, i j 0, i > j.

Class B2(r/R/p). Here we set Θ = LL , where Lp R is such that

1, i, j = 1, . . . , r iid N(0, 1), i > r, j = 1, . . . , R 0, i = 1, . . . , r, j > r.

Class B3(r/R/p). Here we deﬁne Θ = LL , where Lp R is such that

1, j = 1, . . . , r, i j iid N(0, 1), j > r, i = 1, . . . , R 0, i > j, j = 1, . . . , r.

In all the B classes, we generated Φi iid abs(N(0, 1)) and the covariance matrix Σ was taken to be Σ = Θ + αΦ, where α is so that Tr(Θ) = α Tr(Φ).

Comparisons with other FA methods

We performed a suite of experiments using Algorithm 1 for the cases q {1, 2}. We compared our proposed algorithm with the following popular FA estimation procedures as described in Section 1.1:

Bertsimas, Copenhaver, and Mazumder

1. MINRES: minimum residual factor analysis 2. WLS: weighted least squares method with weights being the uniquenesses 3. PA: this is the principal axis factor analysis method 4. MTFA: constrained minimum trace factor analysis formulation (6) 5. PC: The method of principal component factor analysis 6. MLE: this is the maximum likelihood estimator (MLE) 7. GLS: the generalized least squares method

For MINRES, WLS, GLS, and PA, we used the implementations available in the R package psych (Revelle, 2015) available from CRAN. For MLE we investigated the methods factanal from R package stats and the fa function from R package psych. The estimates obtained by the diﬀerent MLE implementations were similar in our experiments; therefore, we report the results obtained from factanal. For MTFA, we used our own implementation by adapting the ADMM algorithm (Section 3.2.2) to solve Problem (6). For the experiments in Section 5.1, we took the convergence thresholds for Algorithm 1 as TOL = 10 5 and ADMM as TOL α = 10 9. For the PC method we followed the description in Bai and Ng (2008) (as described in Section 1.1) the Φ estimates were thresholded at zero if they became negative. Note that all the methods considered in the experiments, apart from MTFA, allow the user to specify the desired number of factors in the problem. Since standard implementations of MINRES, WLS and PA require Σ to be a correlation matrix, we standardized all covariance matrices to correlation matrices at the outset.

5.1.1 Performance measures

We consider the following measures of goodness of ﬁt (see Bartholomew et al. 2011 and references therein) to assess the performances of the diﬀerent FA estimation procedures.

Estimation error in Φ: We use the following measure to assess the quality of an estimator for Φ:

Error(Φ) :=

i=1 (bΦi Φi)2. (43)

The estimation of Φ plays an important role in FA given a good estimate bΦ, the r-common factors can be obtained by a rank-r eigendecomposition on the residual covariance matrix Σ bΦ.

Proportion of variance explained and semi-deﬁniteness of (Σ Φ): A fundamental objective in FA lies in understanding how well the r-common factors explain the residual covariance, i.e., (Σ bΦ) a direct analogue of what is done in PCA, as explained in Section 1. For a given r, the proportion of variance explained by the r common factors is given by

Explained Variance =

i=1 λi( bΘ)/

i=1 λi(Σ bΦ). (44)

As r increases, the explained variance increases to one. This trade-oﬀbetween r and Explained Variance plays an important role in exploratory FA and in particular the choice

Certifiably Optimal Low Rank Factor Analysis

of r. For the expression (44) to be meaningful, it is desirable to have Σ bΦ 0. Note that our framework (CFAq), and in particular MTFA, estimates Φ under a PSD constraint on Σ Φ. However, as seen in our experiments (bΦ, bΘ) estimated by the remaining methods MINRES, PA, WLS, GLS, MLE and others often violate the PSD condition on Σ bΦ for some choices of r, thereby rendering the interpretation of Explained Variance troublesome. For the MTFA method with estimator bΘ, the measure (44) applies only for the value of r = rank( bΘ) and the explained variance is one. For the methods we have included in our comparisons, we report the values of Explained Variance as delivered by the R-package implementations.

Proximity between bΘ and Θ: A relevant measure of the proximity between Θ and its estimate ( bΘ) is given by

Error(Θ) := bΘ Θr 2 2/ Θr 2 2, (45)

where Θr is the best rank-r approximation to Θ and can be viewed as the natural oracle counterpart of bΘ. Note that MTFA does not incorporate any constraint on rank( bΘ) in its formulation. Since the estimates obtained by this procedure satisfy bΘ = Σ bΦ, rank( bΘ) may be quite diﬀerent from r.

Discussion of experimental results. We next discuss our ﬁndings from the numerical experiments for the synthetic datasets. Table 1 shows the performances of the various methods for diﬀerent choices of p and R for class A1. For the problems (CFAq), q {1, 2}, we present the results of Algorithm 1. (Results obtained by the approach in Appendix B were similar.) In all the examples, with the exception of MTFA, we set the number of factors to be r = (R 1), one less than the rank of the covariance matrix for the common underlying factors; in other words, the remaining rank one component can be considered as noise. For MTFA, the rank of bΘ was computed as the number of eigenvalues of bΘ larger than 10 5. MTFA and (CFAq), q {1, 2}, estimate Φ with zero error signiﬁcantly better than competing methods. While MTFA and (CFAq), q {1, 2}, result in estimates such that Σ bΦ is PSD, other methods, however, fail to do so; indeed, the discrepancy can often be quite large. MTFA performs poorly in terms of estimating Θ since the estimated Θ has rank diﬀerent than r. In terms of the proportion of variance explained (CFAq) performs signiﬁcantly better than all other methods. The notion of Explained Variance by MTFA for r = (R 1) is not applicable since the rank of the estimated Θ is larger than r. Figure 1 displays results for type A2. Here we present the results for (CFAq), q {1, 2}, using Algorithm 1. For all the methods (with the exception of MTFA) we computed estimates of Θ and Φ for a range of values of r. MTFA and (CFA1) do a perfect job in estimating Φ and both deliver PSD matrices Σ bΦ. MTFA computes solutions ( bΘ) with a higher numerical rank and with large errors in estimating Θ (for smaller values of r). Among the four performance measures corresponding to MTFA, Error(Θ) is the only one that varies with diﬀerent r values. Each of the other three measures deliver a single value corresponding to r = rank( bΘ). Overall, it appears that (CFAq) is signiﬁcantly better than all other methods. Figure 2 shows the results for classes B1, B2, and B3. We present the results for (CFAq) for q {1, 2} using Algorithm 1 as before. Figure 2 shows the performance of the diﬀerent

Bertsimas, Copenhaver, and Mazumder

Performance Method Used Problem Size measure (CFA1) (CFA2) MTFA MINRES WLS PA PC MLE (R/p) Error(Φ) 0.0 0.0 0.0 699.5 700.1 700.0 639.9 699.6

3/200 Explained Var. 0.6889 0.6889 - 0.2898 0.2898 0.2898 0.2956 0.2898 λmin(Σ bΦ) 0.0 0.0 0.0 -0.6086 -0.6194 -0.6195 -0.6140 -0.6034 Error(Θ) 0.0 0.0 689.68 0.1023 0.1079 0.1146 0.7071 0.1111 Error(Φ) 0.0 0.0 0.0 197.18 197.01 196.98 139.60 197.18

5/200 Explained Var. 0.8473 0.8473 - 0.3813 0.3813 0.3814 0.3925 0.3813 λmin(Σ bΦ) 0.0 0.0 0.0 -0.4920 -0.5071 -0.5081 -0.4983 -0.4919 Error(Θ) 0.0 0.0 190.26 0.0726 0.0802 0.0817 1.1638 0.0725 Error(Φ) 0.0 0.0 0.0 39.85 39.18 39.28 2.47 40.04

10/200 Explained Var. 0.9371 0.9371 - 0.4449 0.4452 0.4452 0.4686 0.4449 λmin(Σ bΦ) 0.0 0.0 0.0 -0.2040 -0.2289 -0.2329 -0.2060 -0.2037 Error(Θ) 0.0 0.0 35.94 0.0531 0.0399 0.0419 2.47 0.0560 Error(Φ) 0.0 0.0 0.0 3838 3859 3858 3715 3870

2/500 Explained Var. 0.7103 0.7103 - 0.3036 0.3033 0.3033 0.3056 0.3031 λmin(Σ bΦ) 0.0 0.0 0.0 -0.7014 -0.7126 -0.7125 -0.7121 -0.7168 Error(Θ) 0.0 0.0 3835.6 0.0726 0.0770 0.0766 0.5897 0.1134 Error(Φ) 0.0 0.0 0.0 1482.9 1479.1 1478.9 1312.9 1485.1

5/500 Explained Var. 0.8311 0.8311 - 0.3753 0.3754 0.3754 0.3798 0.3752 λmin(Σ bΦ) 0.0 0.0 0.0 -0.5332 -0.5433 -0.5445 -0.5423 -0.5379 Error(Θ) 0.0 0.0 1459.5 0.1176 0.0752 0.0746 1.1358 0.1301 Error(Φ) 0.0 0.0 0.0 301.94 301.08 301.04 160.29 302.1

10/500 Explained Var. 0.9287 0.9287 - 0.4443 0.4444 0.4444 0.4538 0.4443 λmin(Σ bΦ) 0.0 0.0 0.0 -0.3290 -0.3280 -0.3281 -0.3210 -0.3302 Error(Θ) 0.0 0.0 291.44 0.0508 0.0415 0.0412 2.3945 0.0516 Error(Φ) 0.0 0.0 0.0 19123 19088 19090 18770 19120

2/1000 Explained Var. 0.6767 0.6767 - 0.2885 0.2886 0.2886 0.2898 0.2885 λmin(Σ bΦ) 0.0 0.0 0.0 -0.6925 -0.7335 -0.7418 -0.7422 -0.6780 Error(Θ) 0.0 0.0 19037 50.76 0.3571 0.0984 0.8092 18.5930 Error(Φ) 0.0 0.0 0.0 6872 6862 6861 6497 6876

5/1000 Explained Var. 0.8183 0.8183 - 0.3716 0.3717 0.3717 0.3739 0.3716 λmin(Σ bΦ) 0.0 0.0 0.0 -0.6202 -0.6785 -0.6797 -0.6788 -0.6209 Error(Θ) 0.0 0.0 6818.1 0.1706 0.0776 0.0780 1.1377 0.1696 Error(Φ) 0.0 0.0 0.0 1682 1681 1681 1311 1682

10/1000 Explained Var. 0.9147 0.9147 - 0.4360 0.4360 0.4360 0.4408 0.4360 λmin(Σ bΦ) 0.0 0.0 0.0 -0.2643 -0.2675 -0.2676 -0.2631 -0.2643 Error(Θ) 0.0 0.0 1654.3 0.0665 0.0568 0.0570 2.4202 0.0665

Table 1: Comparative performances of the various FA methods for data of type A1, for diﬀerent choices of R and p. In all the above methods (apart from MTFA) r was taken to be (R 1). In all the cases rank( bΘ) obtained by MTFA is seen to be R. The - symbol implies that the notion of explained variance is not meaningful for MTFA for r = R 1. No method in Category (B) satisﬁes Σ Φ 0. Methods proposed in this paper seem to signiﬁcantly outperform their competitors, across the diﬀerent performance measures.

Certifiably Optimal Low Rank Factor Analysis

methods in terms of four diﬀerent metrics: error in Φ estimation, proportion of variance explained, violation of the PSD constraint on Σ Φ, and error in Θ estimation. For the case of B1, we see that the proportion of explained variance for (CFAq) reaches one at a rank smaller than that of MTFA this shows that the nonconvex criterion (CFAq) provides smaller estimates of the rank than its convex relaxation MTFA when one seeks a model that explains the full proportion of residual variance. This result is qualitatively diﬀerent from the behavior seen for A1 and A2 where the beneﬁt of (CFAq) over MTFA was mainly due to its ﬂexibility to control the rank of Θ. Algorithms in Category (A) do an excellent job in estimating Φ. All other competing methods perform poorly in estimating Φ for small/moderate values of r. We observe that none of the methods apart from (CFAq) and MTFA lead to PSD estimates of Σ bΦ (unless r becomes suﬃciently large which corresponds to a model with a saturated ﬁt). In terms of the proportion of variance explained, our proposal performs much better than the competing methods. We see that the error in Θ estimation incurred by (CFAq), increases marginally as soon as the rank r becomes larger than a certain value for B1. Note that around the same values of r, the proportion of explained variance reaches one in both these cases, thereby suggesting that this is possibly not a region of statistical interest. In summary, Figure 2 suggests that (CFAq) performs very well compared to all its competitors.

Summary: All methods of Category (B) (see Section 1.2) used in the experimental comparisons perform worse than Category (A) in terms of measures Error(Φ), Error(Θ) and Explained Variance for small/moderate values of r. They also lead to indeﬁnite estimates of Σ bΦ. MTFA performs well in estimating Φ but fails in estimating Θ mainly due to the lack in ﬂexibility of imposing a rank constraint; in some cases the trace heuristic falls short of doing a good job in approximating the rank function when compared to its nonconvex counterpart (CFAq). The estimation methods proposed herein have a signiﬁcant edge over existing methods in producing high quality solutions across various performance metrics.

5.2 Real data examples

This section describes the performance of diﬀerent FA methods on some real-world benchmark datasets popularly used in the context of FA. These datasets can be found in the R libraries datasets (R Core Team, 2014), psych (Revelle, 2015), and Facto Mine R (Husson et al., 2015) and are as follows:

The bﬁdata set has 2800 observations on 28 variables (25 personality self-reported items and 3 demographic variables).

The neo data set has 1000 measurements for p = 30 dimensions.

The Harman data set is a correlation matrix of 24 psychological tests given to 145 seventhand eighth-grade children.

The geomorphology data set is a collection of geomorphological data collected across p = 10 variables and 75 observations. (The geomorphology data set originally has p = 11, but we remove the one categorical feature, leaving p = 10.)

Bertsimas, Copenhaver, and Mazumder

0 500 1000 1500 2000

G G G G G G G G G G G G

MINRES PC MLE (CFA1) MTFA

0.0 0.2 0.4 0.6 0.8 1.0

Variance Explained

G G G G G G G G G G

0 5 10 15 20 25 30 35

0.8 0.6 0.4 0.2 0.0

Minimum Eigen Value

G G G G G G G G G G G G

0 5 10 15 20 25 30 35

0.0 0.5 1.0 1.5

G G G G G G G G G G G G

Number of factors Number of factors

Figure 1: Performance of various FA methods for examples from class A2(p = 200) as a function of the number of factors. The y-axis label Minimum Eigenvalue refers to λmin(Σ bΦ). We present the results of (CFA1), as obtained via Algorithm 1 the results of (CFA2) were similar, and hence omitted from the plot. Our methods seems to perform better than all the other competitors. For large values of r, as the ﬁts saturate, the methods become similar. The methods (as available from the R package implementations) that experienced convergence diﬃculties do not appear in the plot.

The JO data set records athletic performance in Olympic events, and involves 24 observations with p = 58. This example is distinctive because the corresponding correlation matrix Σ is not full rank (having more variables than observations).

We present the results in Figure 3. We also experimented with other methods such as WLS and GLS but the results were similar to MINRES and hence have not been shown in the ﬁgure. For the real examples, most of the performance measures described in Section 5.1.1 do not apply16; however, the notion of explained variance (44) does apply. We used this metric to compare the performance of diﬀerent competing estimation procedures. We observe that solutions delivered by Category (B) explain the maximum amount of residual variance for a given rank r, which is indeed desirable, especially in the light of its analogy with PCA on the residual covariance matrix Σ Φ.

16. Understanding performance of FA methods on real data is diﬃcult because it is an unsupervised problem. However, we can understand the performances of diﬀerent methods by drawing parallels with PCA in terms of the proportion of variance explained of the matrix Σ Φ see our discussion in Section 1.

Certifiably Optimal Low Rank Factor Analysis

Type B1 Type B2 Type B3

0 10 20 30 40

G G G G G G G G G G G G G G G G

MINRES PC MLE (CFA1) MTFA

0 20 40 60 80 100

MINRES PC MLE (CFA1) MTFA

0 20 40 60 80 100

G G G G G G G

MINRES PC MLE (CFA1) MTFA

Variance Explained

0.2 0.4 0.6 0.8 1.0

G G G G G G G G G G G G G

0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.2 0.4 0.6 0.8 1.0

0.4 0.3 0.2 0.1 0.0

G G G G G G G G G G G G G G G G

0.8 0.6 0.4 0.2 0.0

1.0 0.8 0.6 0.4 0.2 0.0

G G G G G G G

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

G G G G G G G G G G G G

2 3 4 5 6 7

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0 0.5 1.0 1.5 2.0

G G G G G G G

Number of factors Number of factors Number of factors

Figure 2: Performance of diﬀerent methods for instances of B1(30/100), B2(5/10/100), and B3(5/10/100). We see that (CFA1) exhibits very good performance across all instances, signiﬁcantly outperforming the competing methods (the results of (CFA2) were similar).

Bertsimas, Copenhaver, and Mazumder

bﬁ Neo Harman

Variance Explained

0 5 10 15 20

0.0 0.2 0.4 0.6 0.8 1.0

G G G G G G G G G G G G G G G G G G

MINRES PC MLE (CFA1) MTFA G

0 5 10 15 20 25

0.0 0.2 0.4 0.6 0.8 1.0

G G G G G G G G G G G G G G G G G G G G G

MINRES PC MLE (CFA1) MTFA G

0 5 10 15 20

0.0 0.2 0.4 0.6 0.8 1.0

G G G G G G G G G G G G G G G

MINRES PC MLE (CFA1) MTFA

Number of factors Number of factors Number of factors

Figure 3: Figure showing proportion of variance explained by diﬀerent methods for real-data examples. We see that in terms of the proportion of explained variance, (CFA1) delivers the largest values for diﬀerent values of r, which is indeed desirable. (CFA1) also shows nice ﬂexibility in delivering diﬀerent models with varying r, in contrast to MTFA which delivers one model with proportion of variance explained equal to one. The results of (CFA2) were found to be similar to (CFA1).

5.3 Certiﬁcates of Optimality via Algorithm 2

We now turn our attention to certiﬁcates of optimality using Algorithm 2. Computational results of Algorithm 2 for a variety of problem sizes across all six classes can be found in Tables 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11. In general, we provide results for p ranging between 10 and 4000. Parametric choices are outlined in depth in Table 2.17 We always initialize Algorithm 2 with an initial feasible solution as found via Algorithm 1 so that we can understand if the estimators found via the CG-based approach are indeed optimal. In particular, if the best feasible objective value does not change throughout the branchand-bound algorithm, then the initial estimator was indeed optimal (within the numerical tolerance).

Root Node Gap: Let us ﬁrst consider the gap at the root node. In classes A1, B1, B2, and B3, we see that the Weyl bound at the root node often provides a better bound than the one given by using convex envelopes. Indeed, the bound provided by Weyl s method can in many instances certify optimality (up to numerical tolerance) at the root node. For example, this is the case in many instances of classes A1 and half of the instances in B3. Given that Weyl s method is computationally inexpensive (only requiring the computation of p convex quadratic optimization problems and an eigenvalue decomposition), this suggests that Weyl s inequality as used within the context of factor analysis is particularly fruitful. In contrast, in class A2 and the real examples, we see that the convex envelope bound tends to perform better. Because of the structure of Weyl s inequality, Weyl s method is well-suited for correlation matrices Σ with very quickly decaying eigenvalues. Examples in

17. All computational experiments are performed in a shared cluster computing environment with highly variable demand, and therefore runtimes are not necessarily a reliable measure of problem complexity; hence, the number of nodes considered is always displayed. Further, Algorithm 2 is highly parallelizable, like many branch-and-bound algorithms; however, our implementation is serial. Therefore, with improvements in code design, it is very likely that runtimes can be substantially improved beyond those shown here.

Certifiably Optimal Low Rank Factor Analysis

Root node Terminal node Problem size Instance Upper CE Weyl Upper Lower Nodes Time (s) (R/p) bound LB LB bound bound

1 1.54 1.44 1.43 1.54 1.44 1 0.14 2 1.50 1.40 1.40 1.50 1.40 1 0.38 3 1.44 1.25 1.32 1.44 1.35 5 0.20

1 0.88 0.49 0.70 0.88 0.78 78 4.08 2 0.49 0.25 0.39 0.49 0.39 1 1.27 3 0.52 0.17 0.42 0.52 0.42 14 0.34

1 0.43 0.90 0.10 0.43 0.33 28163 19432.70 2 0.06 0.75 0.00 0.06 0.00 1 0.20 3 0.17 0.98 0.00 0.17 0.07 3213 380.66

1 3.99 3.91 3.94 3.99 3.94 1 0.19 2 4.64 4.58 4.60 4.64 4.60 1 2.07 3 3.34 3.26 3.28 3.34 3.28 1 0.66

1 2.33 2.06 2.24 2.33 2.33 1 0.21 2 2.55 2.38 2.49 2.55 2.49 1 6.62 3 2.04 1.86 1.97 2.04 1.97 1 3.51

1 0.61 0.07 0.49 0.61 0.51 626 75.98 2 0.50 0.01 0.40 0.50 0.40 41 2.88 3 0.92 0.18 0.81 0.92 0.82 267 24.81

Table 2: Computational results for Algorithm 2 for class C = A1(R/p). All computations are performed in julia using SDO solvers MOSEK (for primal feasible solutions) and SCS (for dual feasible solutions within tolerance 10 3). Computation time does not include preprocessing (such as computation of u as in (40) and ﬁnding an initial incumbent feasible solution Φf as computed via the conditional gradient algorithm in Section 3). We always use default tolerance TOL = 0.1 for algorithm termination. Parameters for branching, pruning, node selection, etc., are detailed throughout Section 4. For each problem size, we run three instances (the instance number is the random seed provided to julia). Upper bounds denote zf, which is the best feasible solution found thus far (either at the root node or at algorithm termination). At the root node, we display two lower bounds: the lower bound arising from convex envelopes (denoted CE LB ) and the one arising from the Weyl bound (denoted Weyl LB ). Note that for lower bound at the termination node, we mean the worst bound max{zc, wc} (see Section 4.6; zc is from the convex envelope approach, while wc comes from Weyl s method). Nodes indicates the number of nodes considered in the course of execution, while Time (s) denotes the runtime (in seconds). We set r , the rank used within Algorithm 2, to r = R 1, where R is the generative rank displayed. All results displayed to two decimals. Computations run on high-demand, shared cluster computing environment with variable architectures. Runtime is capped at 400000s (approximately 5 days), and any instance which is still running at that time is marked with an asterisk next to its runtime.

Bertsimas, Copenhaver, and Mazumder

Root node Terminal node Problem size Instance Upper CE Weyl Upper Lower Nodes Time (s) (R/p) bound LB LB bound bound

1 7.16 7.09 7.14 7.16 7.14 1 18.66 2 6.74 6.66 6.71 6.74 6.71 1 13.19 3 6.78 6.74 6.77 6.78 6.77 1 33.28

1 3.04 2.86 3.01 3.04 3.01 1 19.13 2 3.32 3.12 3.29 3.32 3.29 1 7.33 3 3.70 3.56 3.67 3.70 3.67 1 52.53

1 0.88 0.15 0.81 0.88 0.81 1 6.22 2 1.20 0.08 1.12 1.20 1.12 1 17.41 3 1.14 0.02 1.07 1.14 1.07 1 8.88

1 13.72 13.69 13.72 13.72 13.72 1 125.17 2 14.07 14.03 14.06 14.07 14.06 1 63.25 3 14.67 14.64 14.66 14.67 14.66 1 28.36

1 8.54 8.46 8.53 8.54 8.53 1 117.71 2 7.16 7.04 7.14 7.16 7.14 1 70.09 3 7.07 6.94 7.06 7.07 7.06 1 78.60

1 2.62 2.10 2.58 2.62 2.58 1 36.91 2 2.81 2.38 2.78 2.81 2.78 1 9.25 3 3.16 2.64 3.12 3.16 3.12 1 49.04

Table 3: Computational results for larger examples from class C = A1(R/p). Same parameters as in Table 2 but with larger problem instances. Again we set r = R 1.

these two classes do not have such a spectrum, and indeed Weyl s method does not provide the best root node bound.18 Because neither Weyl s method nor the convex envelope bound strictly dominate one another at the root node across all examples, our approach incorporating both can leverage the advantages of each. Observe that the root node gap (either in terms of the absolute diﬀerence between the initial feasible solution found and the better of the convex envelope bound and the Weyl bound) tends to be smaller when r is much smaller than p. This suggests that the approach we take is well-suited to certify optimality of particularly low-rank decompositions in noisy factor analysis settings. We see that this phenomenon is true across all classes. The numerical results suggest that corresponding theoretical guarantees for Weyl s method are potentially of interest in their own right and warrant further consideration. Finally, we remark that if the true convex envelope of the objective over the set of semideﬁnite constraints was taken, then the convex envelope objective would always be non-negative. However, because we have taken the convex envelope of the objective over the polyhedral constraints only, this is not the case.

Performance at Termination: It is particularly encouraging that the initial feasible solutions provided via Algorithm 1 remain the best feasible solution throughout the execution of Algorithm 2 for all but three problem instances. (Of course, this need not be

18. However, it is worth remarking that Weyl s method still provides lower bounds on the rank of solutions to the noiseless factor analysis problem. Hence, even in settings where Weyl s method is not necessarily well-suited for proving optimality for the noisy factor analysis problem, it can still be applied successfully to lower bound rank for noiseless factor analysis.

Certifiably Optimal Low Rank Factor Analysis

Root node Terminal node Problem size Instance r Upper CE Weyl Upper Lower Nodes Time (s) (p) used bound LB LB bound bound

0.98 0.29 0.26 0.96 0.86 1955 91.99 2 0.85 0.28 0.29 0.85 0.75 736 37.70 3 0.89 0.29 0.23 0.89 0.79 1749 53.30

0.53 0.43 0.13 0.53 0.43 3822 409.68 2 0.58 0.45 0.15 0.58 0.48 8813 1623.38 3 0.64 0.44 0.10 0.59 0.49 13061 1671.61

5.13 4.14 2.13 5.13 5.03 29724 36857.46 2 4.68 3.44 2.00 4.68 4.58 28707 35441.37 3 4.83 3.67 2.00 4.83 4.73 17992 11059.55

4.26 2.68 1.55 4.26 4.05 86687 400002.3* 2 3.88 2.00 1.46 3.88 3.71 132721 400003.7* 3 4.03 2.25 1.48 4.03 3.84 100042 400002.1*

17.83 16.54 11.49 17.83 17.44 35228 400002.5* 2 18.57 17.34 12.19 18.57 18.20 34006 400007.1* 3 18.21 17.06 11.96 18.21 17.86 35874 400004.9*

38.65 37.50 33.08 38.65 38.18 20282 400015.2* 2 38.84 37.66 33.25 38.84 38.37 17306 400013.7* 3 39.02 37.86 33.67 39.02 38.58 19889 400004.9*

30.98 29.05 25.69 30.98 29.91 16653 400005.6* 2 31.37 29.39 26.07 31.37 30.32 26442 400021.4* 3 31.20 29.24 26.13 31.20 30.15 19445 400043.4*

Table 4: Computational results for class C = A2(p). All parameters as per Table 2. Here we show the behavior across a variety of choices of the parameter r . Recall that the class A2 is generated by a common variance component with high rank and exponentially decaying eigenvalues.

universally true.) This is an important observation to make because without a provable optimality scheme such as the one we consider here, it is diﬃcult to quantify the performance of heuristic upper bound methods. As we demonstrate here, despite the only local guarantees of solutions obtained via a conditional gradient scheme, they tend to perform quite well in the setting of factor analysis. Indeed, even in the three examples where the best feasible solution is improved, the improved solution is found very early in the branching process.

Across the diﬀerent example classes, we see that in general the gap tends to decrease more when r is small relative to p and p is smaller. To appropriately contextualize and appreciate the number of nodes solved for a problem with p = 100 on the timescale of 100s, with state-of-the-art implementations of interior point methods, solving a single node in the branch and bound tree can take on the order of 40s (for the speciﬁcally structured problems of interest and on the same machine). In other words, if one were to na ıvely use interior point methods, it would only be possible to solve approximately three nodes during a 100s time limit. In contrast, by using a ﬁrst-order method approach which facilitates warm starts, we are able to solve hundreds of nodes in the same amount of time.

We see that Algorithm 2 performs particularly well for classes A1 (Tables 2 and 3) and B1 (Table 5), for problems of reasonable size with relatively small underlying ranks. This

Bertsimas, Copenhaver, and Mazumder

Root node Terminal node Problem size Instance r Upper CE Weyl Upper Lower Nodes Time (s) (r/p) chosen bound LB LB bound bound

1 1 0.25 0.06 0.11 0.25 0.20 24 0.66 2 1 0.17 0.04 0.04 0.17 0.12 17 0.50 3 1 0.16 0.08 0.00 0.16 0.11 26 0.16

1 2 0.09 0.27 0.00 0.09 0.04 77 0.62 2 2 0.07 0.36 0.00 0.07 0.02 118 0.75 3 2 0.07 0.36 0.00 0.07 0.02 125 3.79

1 3 0.10 0.46 0.00 0.10 0.05 2162 166.44 2 3 0.10 0.62 0.00 0.10 0.05 3282 406.82 3 4 0.06 0.49 0.00 0.04 0.00 122 11.69

1 1 0.26 0.10 0.13 0.26 0.21 21 0.17 2 1 0.19 0.01 0.07 0.19 0.14 15 0.40 3 1 0.19 0.04 0.04 0.19 0.14 26 0.59

1 2 0.10 0.24 0.01 0.10 0.05 73 1.47 2 2 0.08 0.32 0.00 0.08 0.03 103 2.82 3 2 0.08 0.30 0.00 0.08 0.03 96 1.64

1 3 0.13 0.44 0.00 0.13 0.08 2602 113.82 2 3 0.12 0.59 0.00 0.12 0.07 3971 188.93 3 3 0.15 0.45 0.01 0.15 0.10 3910 531.04

1 2 0.11 0.15 0.04 0.11 0.06 50 0.66 2 2 0.10 0.18 0.01 0.10 0.05 61 1.95 3 2 0.10 0.20 0.01 0.10 0.05 64 0.58

1 3 0.17 0.35 0.03 0.17 0.12 2283 333.62 2 3 0.17 0.46 0.02 0.17 0.12 4104 544.71 3 3 0.19 0.36 0.03 0.19 0.14 4362 620.61

1 2 0.12 0.07 0.05 0.12 0.05 1 0.71 2 2 0.11 0.09 0.04 0.11 0.07 12 0.82 3 2 0.11 0.12 0.03 0.11 0.06 30 0.46

1 3 0.19 0.22 0.06 0.19 0.14 1535 237.19 2 3 0.19 0.28 0.05 0.19 0.14 2625 383.27 3 3 0.20 0.26 0.05 0.20 0.15 2829 355.08

Table 5: Computational results for class C = B1(R/p). We choose r during computation as the largest r such that the CG method of Section 3 is strictly positive (up to additive tolerance 0.05; we use a smaller value here because the objective values are smaller across this class). For this class, examples can be preprocessed because Σ B1(R/p) has a block of size R R in the upper left, with all other entries set to zero except the diagonal. Hence, it suﬃces to perform factor analysis with the truncated matrix Σ1:R,1:R.

is highly encouraging. Class A1 forms a set of prototypical examples for which theoretical recovery guarantees perform well; in stark contrast, problems such as those in B1 which have highly structured underlying factors tend to not satisfy the requirements of such recovery guarantees. Indeed, if Σ B1(R/p), then there generally appears to be a rank R/2 matrix Θ 0 so that Σ Θ is positive semideﬁnite and diagonal. In such a problem, for r on the order of R/2 1, we provide nearly optimal solutions within a reasonable time frame. Further, we note that similar results to those obtained for classes A1 and B1 are obtained for the other classes and are detailed in Tables 4 (class A2), 6 (class B2), 7 (class B3), and

Certifiably Optimal Low Rank Factor Analysis

Root node Terminal node Problem size Instance r Upper CE Weyl Upper Lower Nodes Time (s) (r/R/p) chosen bound LB LB bound bound

1 4 1.15 0.4 1.03 1.15 1.05 517 146.72 2 4 1.08 0.50 0.97 1.08 0.98 366 60.56 3 4 0.89 0.29 0.79 0.89 0.79 1 0.56

1 4 1.24 0.46 1.11 1.24 1.14 933 232.74 2 4 0.99 0.55 0.92 0.99 0.92 1 0.87 3 4 0.72 0.08 0.62 0.72 0.62 123 20.16

1 4 7.54 7.46 7.53 7.54 7.53 1 84.20 2 4 6.45 6.34 6.44 6.45 6.44 1 119.62 3 4 7.20 7.09 7.18 7.20 7.18 1 97.17

1 4 8.40 8.32 8.39 8.40 8.39 1 135.73 2 4 5.79 5.70 5.78 5.79 5.78 1 38.19 3 4 7.58 7.47 7.57 7.58 7.57 1 73.51

1 9 3.56 3.11 3.53 3.56 3.53 1 71.10 2 9 3.00 2.61 2.97 3.00 2.97 1 85.47 3 9 2.58 2.14 2.55 2.58 2.55 1 22.99

1 9 3.23 2.80 3.20 3.23 3.20 1 50.63 2 9 2.94 2.56 2.91 2.94 2.91 1 51.99 3 9 2.57 2.12 2.54 2.57 2.54 1 72.45

Table 6: Computational results for class B2(r/R/p). Parameters as set in Table 2. Here r is chosen during computation as largest k such that the CG method of Section 3 is strictly positive (up to 0.1). Here r = R 1 ends up being an appropriate choice.

8 and 9. In a class such as A2, which is generated as a high rank matrix (with decaying spectrum) with added individual variances, theoretical recovery guarantees do not generally apply, so again it is encouraging to see that our approach still makes signiﬁcant progress towards proving optimality. Further, as shown in Table 11, for a variety of problems with p on the order of 1000 or 4000, solutions can be found in seconds and optimality can be certiﬁed within minutes via Weyl bounds, with no need for convex envelope bounds as computed via Algorithm 2, so long as the rank r is suﬃciently small (for classes A1, B2, and B3 on the order of hundreds, and for A2 on the order of tens). This strongly supports the value of such an eigenvalue-based approach. When computing lower bounds solely via Weyl s method, the only necessary computations are solving quadratic programs (to ﬁnd u as in Section 4.3) and an eigenvalue decomposition. As Table 11 suggests, for suﬃciently small rank, one can still quickly ﬁnd certiﬁably optimal solutions even for very large-scale factor analysis problems.

A particularly interesting example that stands apart from the rest is JO from the set of real examples. In particular, the correlation matrix for this example is not full rank (as there are fewer observations than variables). As displayed in Table 10, solutions found via conditional gradient methods are certiﬁed to be optimal at the root node. Indeed, the initial u0 as in Section 4.3 can be quickly veriﬁed to be u0 = 0 for this example, i.e., factor analysis is only possible with no individual variances. Hence, Algorithm 2 terminates at the root node without needing any branching.

Bertsimas, Copenhaver, and Mazumder

Root node Terminal node Problem size Instance r Upper CE Weyl Upper Lower Nodes Time (s) (r/R/p) chosen bound LB LB bound bound

1 3 0.30 0.30 0.13 0.30 0.20 777 219.75 2 3 0.22 0.11 0.12 0.22 0.12 1 2.61 3 3 0.19 0.17 0.10 0.19 0.10 1 1.57

1 2 1.00 0.62 0.82 1.00 0.90 217 48.43 2 2 0.95 0.47 0.74 0.95 0.85 305 26.14 3 2 0.45 0.06 0.30 0.45 0.36 98 12.31

1 3 0.55 0.24 0.40 0.55 0.45 7099 47651.57 2 3 0.28 0.07 0.17 0.28 0.18 266 1431.88 3 3 0.19 0.07 0.11 0.19 0.11 1 574.61

1 2 1.03 0.70 0.81 1.03 0.93 4906 34514.69 2 2 0.90 0.66 0.78 0.90 0.80 95 596.96 3 2 0.55 0.39 0.47 0.55 0.47 1 1787.14

1 8 0.12 0.58 0.05 0.12 0.05 1 77.38 2 8 0.11 0.53 0.05 0.11 0.05 1 41.24 3 8 0.10 0.43 0.05 0.10 0.05 1 110.17

1 6 0.33 0.73 0.20 0.33 0.22 27770 400000.0* 2 7 0.25 0.43 0.17 0.25 0.17 1 128.63 3 7 0.31 0.28 0.24 0.31 0.24 1 213.72

Table 7: Computational results for class B3(r/R/p). Parameters as set in Table 2. As in Tables 5 and 6, here r is chosen during computation as largest r such that the CG method of Section 3 produces a feasible solution with strictly positive objective (up to 0.1).

Root node Terminal node r Upper CE Weyl Upper Lower Nodes Time (s) used bound LB LB bound bound 1 4.06 3.78 2.53 4.06 3.96 44 0.64 2 2.64 2.04 1.42 2.64 2.54 1885 142.06 3 1.56 0.62 0.61 1.56 1.46 11056 2458.23 4 0.88 0.33 0.28 0.88 0.78 66877 60612.61 5 0.36 0.88 0.0 0.36 0.25 155759 400012.4*

Table 8: Computational results for the geomorphology example (p = 10). Parameters as set in Table 2. Here we display results for the choices of r {1, 2, 3, 4, 5} (for r > 5, the upper bound is below the numerical tolerance, so we do not include those).

Root node Terminal node r Upper CE Weyl Upper Lower Nodes Time (s) used bound LB LB bound bound 1 9.88 9.64 5.89 9.88 9.78 158 30.13 2 7.98 7.54 4.22 7.98 7.88 31710 49837.17 3 6.53 5.85 3.01 6.53 6.35 81935 400003.8*

Table 9: Computational results for the Harman example (p = 24). Parameters as set in Table 2. Here we display results for the choices of r {1, 2, 3}.

Certifiably Optimal Low Rank Factor Analysis

r Optimal Time value (s) 1 51.85 5.59 2 46.30 6.86 3 41.29 6.08 4 36.54 6.08 5 32.36 1.21 6 28.72 7.83 7 25.39 1.13 8 22.10 3.87 9 19.20 4.97 10 16.63 4.17 11 14.30 7.87

r Optimal Time value (s) 12 12.52 3.37 13 10.81 2.97 14 9.25 1.07 15 7.78 6.21 16 6.44 5.40 17 5.15 0.91 18 3.98 8.50 19 2.84 5.33 20 1.87 2.15 21 1.07 2.13 22 0.48 1.65

Table 10: Computational results for the example JO (p = 58). Parameters as set in Table 2. Here we display results for the choices of r {1, 2, . . . , 22} (for r > 22, the upper bound is below the numerical tolerance, so we do not include those). For all choices of r , the optimal solution is found and certiﬁed to be optimal at the root node, and therefore we do not display any information other than the optimal values.

Finally, we note that all synthetic examples we have considered have equal proportions of common and individual variances (although, of course, this is not exploited by our approach as this information is not a priori possible to specify without additional contextual information). If one modiﬁes the classes so that the proportion of the common variances is higher than the proportion of individual variances (in the generative example), then Algorithm 2 is able to deliver better bounds on a smaller time scale. (Results are not included here.) This is not particularly surprising because the branch-and-bound approach we take essentially hinges on how well the products WiiΦi can be approximated. When there is underlying factor structure with a lower proportion of individual variances, the scale of WiiΦi is smaller and hence these products are easier to approximate well.

5.4 Additional Considerations

We now turn our attention to assessing the beneﬁts of various algorithmic modiﬁcations as presented in Section 4. We illustrate the impact of these by focusing on four representative examples from across the classes: A1(3/10), B1(6/50), B3(3/5/20), and G := geomorphology.

Performance of Branching Strategy: We begin by considering the impact of our branching strategy as developed in Section 4.4. The results across the four examples are shown in Table 12. Recall that ϵ [0, 1) controls the extent of deviation from the canonical branching location, with ϵ = 0 corresponding to no deviation. Across all examples, we see that the number of nodes considered to prove optimality is approximately convex in ϵ [0, 0.5]. In particular, for all examples, the optimal choice of ϵ is not the canonical choice of ϵ = 0. This contrast is stark for the examples B3(3/5/20) and G. Indeed, for these two examples, the number of nodes considered when ϵ = 0 is over ﬁve times larger than for any ϵ {0.1, 0.2, 0.3, 0.4, 0.5}.

Bertsimas, Copenhaver, and Mazumder

Example r Upper Lower bound bound

A1(100/1000)

10 460.35 457.46 30 322.35 313.62 50 198.70 197.13 70 103.72 102.78 90 28.65 28.34

10 184.22 183.18 20 61.24 60.34 30 20.25 19.50 40 6.63 6.02 50 2.17 1.71

B2(10/90/1000)

20 379.02 377.23 40 236.10 234.83 60 122.08 121.33 80 34.49 34.24

B2(20/90/1000)

20 378.37 376.64 40 235.16 233.94 60 121.69 120.96 80 34.33 34.09

B3(10/50/1000)

10 384.40 384.04 20 233.25 232.97 30 105.68 105.49 40 0.69 0.60

B3(20/150/1000)

20 426.31 422.13 45 285.56 282.33 70 174.00 171.70 95 86.54 85.15 120 20.30 19.80

Example r Upper Lower bound bound

A1(360/4000)

100 1334.40 1327.60 150 991.64 986.15 200 693.02 688.87 250 434.64 431.80 300 213.29 211.75 350 30.75 30.50

10 733.41 733.09 20 244.13 243.85 30 79.94 79.68 40 26.17 25.96 50 8.57 8.39 60 2.80 2.66 70 0.90 0.79

B2(80/360/4000)

100 1360.1 1354.04 150 1008.90 1004.04 200 703.78 700.21 250 440.56 438.16 300 215.87 214.62 350 31.02 30.91

B3(120/360/4000)

100 1081.20 1078.74 150 630.25 628.54 200 253.50 252.50 240 7.89 7.46

Table 11: Computational results across several classes for larger scale instances with p {1000, 4000}. Here for a given example, the random seed is set as 1. Results are displayed across a variety of choices of rank r . Class B1 is not shown because of the reduction noted in Table 5 (which reduces this class to a much smaller-dimensional class, and hence it is not truly large scale). The Upper bound denotes the upper bound found by the conditional gradient method, while Lower bound denotes the Weyl bound at the root node (no convex envelope bounds via Algorithm 2 are shown here because of the large nature of the SDO-based convex envelope lower bounds which are prohibitive to solve for problems of this size). These instances are completed in MATLAB because of its superior support (over julia) for solving large-scale quadratic optimization problems, which is the only computation necessary for computing Weyl bounds.

Certifiably Optimal Low Rank Factor Analysis

Example r Nodes considered for ϵ = 0.0 0.1 0.2 0.3 0.4 0.5 A1(3/10) 2 103 72 59 62 78 101 B1(6/50) 2 39 33 29 40 50 53 B3(3/5/20) 2 8245 762 278 189 217 203 geomorphology 1 937 162 59 46 44 52

Table 12: Computational results for eﬀect of branching strategy across diﬀerent examples. In particular, we consider how the number of nodes needed to prove optimality (up to numerical tolerance) changes across diﬀerent choices of ϵ, where ϵ is as in Section 4.4. Recall that ϵ = 0 corresponds to a na ıve choice of branching, while ϵ > 0 corresponds to a modiﬁed branching. All other parameters as in Table 2, including the branching index. For synthetic examples, the instances are initialized with random seed 1.

Example r Nodes considered for Na ıve strategy Modiﬁed strategy A1(3/10) 2 99 78 B1(6/50) 2 135 50 B3(3/5/20) 2 375 217 geomorphology 1 43 44

Table 13: Computational results for eﬀect of node selection strategy across diﬀerent examples. In particular, we consider how the number of nodes needed to prove optimality (up to numerical tolerance) changes across the na ıve and modiﬁed branching strategies as described in Section 4.4. All other parameters as in Table 2. For synthetic examples, the instances are initialized with random seed 1.

Example r Upper CE LB CE LB Bound with tightening without tightening A1(3/10) 2 0.88 0.70 0.39 B1(6/50) 2 0.11 0.15 0.17 B3(3/5/20) 2 1.00 0.62 0.51 geomorphology 1 4.06 3.78 3.78

Table 14: Computational results for eﬀect of root node bound tightening across diﬀerent examples. In particular, we consider how the convex envelope lower bound (denoted CE LB ) compares with and without bound tightening at the root node (see Section 4.5). All other parameters as in Table 2. Upper bound displayed is as in Table 2 and is included for scale (i.e., to compare the relative impact of tightening). For synthetic examples, the instances are initialized with random seed 1.

Example r Nodes considered for TOL = 0.10 0.05 0.025 A1(3/10) 2 78 287 726 B1(6/50) 2 1 50 262 B3(3/5/20) 2 217 1396 5132 geomorphology 1 44 217 978

Table 15: Computational results for numerical tolerance TOL across diﬀerent examples. In particular, we consider how the number of nodes needed to prove optimality changes as a function of the additive numerical tolerance TOL chosen for algorithm termination. All other parameters as in Table 2. For synthetic examples, the instances are initialized with random seed 1.

Bertsimas, Copenhaver, and Mazumder

In other words, the alternative branching strategy can have a substantial impact on the number of nodes considered in the branch-and-bound tree. As a direct consequence, this strategy can drastically reduce the computation time needed to prove optimality. As the examples suggest, it is likely that ϵ should be chosen dynamically during algorithm execution, as the particular choice depends on a given problem s structure. However, we set ϵ = 0.4 for all other computational experiments because this appears to oﬀer a distinct beneﬁt over the na ıve strategy of setting ϵ = 0.

Performance of Node Selection Strategy: We now turn our attention to the node selection strategy as detailed in Section 4.6. Recall that node selection considers how to pick which node to consider next in the current branch-and-bound tree at any iteration of Algorithm 2. We compare two strategies: the na ıve strategy which selects the node with worst convex envelope bound (as explicitly written in Algorithm 2) and the modiﬁed strategy which employs randomness and Weyl bounds to consider nodes which might not be advantageous to fathom when only convex envelope bounds are considered. The comparison is shown in Table 13. We see that this strategy is advantageous overall (with only the example G giving a negligible decrease in performance). The beneﬁt is particularly strong for examples from the B classes which have highly structured underlying factors. For such examples, there is a large diﬀerence between the convex envelope bounds and the Weyl bounds at the root node (see e.g. Table 5). Hence, an alternative branching strategy which incorporates the eigenvalue information provided by Weyl bounds has potential to improve beyond the na ıve strategy. Indeed, this appears to be the case across all examples where such behavior occurs.

Performance of Bound Tightening All computational results employ bound tightening as developed in Section 4.5, but only at the root node. Bound tightening, which requires repeated eigenvalue computations, is a computationally expensive process. For this reason, we have chosen not to employ bound tightening at every node in the branch-and-bound tree. From a variety of computational experiments, we observed that the most important node for bound tightening is the root node, and therefore it is a reasonable choice to only employ bound tightening there. Consequently, we employ pruning via Weyl s method (detailed in Section 4.5 as well) at all nodes in the branch-and-bound tree. (Recall that bound tightening can be thought of as optimal pruning via Weyl s method.) In Table 14 we show the impact of bound tightening at the root node in terms of the improvement in the lower bounds provided by convex envelopes. The results for the class A1 are particularly distinctive. Indeed, for this class bound tightening has a substantial impact on the quality of the convex envelope bound (for the example A1(3/10) given, the improvement is from a relative gap at the root node of 56% to a gap of 20%). For the examples shown, bound tightening oﬀers the least improvement in the real example geomorphology. In light of Table 8 this is not too surprising, as Weyl s method (at the root node) is not particular eﬀective for this example. As Weyl s inequality is central to bound tightening, problems for which Weyl s inequality is not particularly eﬀective tend to experience less beneﬁt from bound tightening at the root node.

Inﬂuence of Optimality Tolerance We close this section by assessing the inﬂuence of the optimality tolerance for termination, TOL. In particular, we study how the number of

Certifiably Optimal Low Rank Factor Analysis

nodes to prove optimality changes as a function of additive gap necessary for termination. The corresponding results across the four examples are shown in Table 15. Not surprisingly, as the gap necessary for termination is progressively halved, the corresponding number of nodes considered increases substantially. However, it is important to note that even though the gap at termination is smaller as this tolerance decreases (by design), for these examples the best feasible solution remains unchanged. In other words, the increase in the number of nodes is the price for more accurately proving optimality and not for ﬁnding better feasible solutions. Indeed, as noted earlier, the solutions found via conditional gradient methods at the outset are of remarkably high quality.

6. Conclusions

We analyze the classical rank-constrained FA problem from a computational perspective. We proposed a general, ﬂexible family of rank-constrained, nonlinear SDO-based formulations for the task of approximating an observed covariance matrix Σ as the sum of a PSD low-rank component Θ and a diagonal matrix Φ (with nonnegative entries) subject to Σ Φ being PSD. Our framework enables us to estimate the underlying factors and unique variances under the restriction that the residual covariance matrix is semideﬁnite this is important for statistical interpretability and understanding the proportion of variance explained by a given number of factors. This constraint, however, seems to ignored by most other widely used methods in FA.

We introduce a novel exact reformulation of the rank-constrained FA problem as a smooth optimization problem with convex, compact constraints. We present a uniﬁed algorithmic framework, utilizing modern techniques in nonlinear optimization and ﬁrst order methods in convex optimization to obtain high-quality solutions for the FA problem. At the same time, we use techniques from discrete and global optimization to demonstrate that these solutions are often provably optimal. We provide computational evidence demonstrating that the methods proposed herein provide high quality estimates with improved accuracy when compared to existing, popularly-used methods in FA.

In this work we have demonstrated that a previously intractable rank optimization problem can be solved to provable optimality. We envision that ideas similar to those used here can be used to address an even larger class of estimation problems with underlying matrix structure. In this way, we anticipate signiﬁcant progress on such problems in the next decade, particularly in light of myriad advances throughout distinct areas of modern optimization.

Acknowledgments

Copenhaver is supported by the U.S. Department of Defense, Oﬃce of Naval Research, through the National Defense Science and Engineering Graduate (NDSEG) Fellowship. Mazumder is partially supported by ONR Grant N000141512342. We gratefully acknowledge the thoughtful and detailed feedback of two anonymous referees which has led to a substantial improvement in the quality, organization, and completeness of the paper. Copen-

Bertsimas, Copenhaver, and Mazumder

haver would also like to thank the Julia Opt team, especially Joey Huchette, for assistance with experimental features in the julia language.

This appendix contains all proofs for results presented in the main text. Proof of Proposition 1:

(a) We start by observing that for any two real symmetric matrices A, B (with dimensions p p) and the matrix q-norm, a result due to Mirsky (this is also known as the q Wielandt-Hoﬀman inequality; see e.g. Stewart and Sun 1990, p. 205) states:

A B q λ(A) λ(B) q, (46)

where λ(A) and λ(B) denote the vector of eigenvalues of A and B, respectively, arranged in decreasing order, i.e., λ1(A) λ2(A) . . . λp(A) and λ1(B) λ2(B) . . . λp(B). Using this result for Problem (4), it is easy to see that for ﬁxed Φ

{Θ : Θ 0, rank(Θ) r} = {Θ : λ(Θ) 0, λ(Θ) 0 r} , (47)

where λ(Θ) 0 counts the number of non-zero elements of λ(Θ). If we partially minimize the objective function in Problem (4), with respect to Θ (with Φ ﬁxed), and use (46) along with (47), we have the following inequality:

inf Θ (Σ Φ) Θ q q s.t. Θ 0, rank(Θ) r

inf λ(Θ) λ(Σ Φ) λ(Θ) q q s.t. λ(Θ) 0, λ(Θ) 0 r. (48)

Since Σ Φ 0, it follows that the minimum objective value of the r.h.s. optimization Problem (48) is given by Pp i=r+1 λq i (Σ Φ) and is achieved for λi(Θ) = λi(Σ Φ) for i = 1, . . . , r. This leads to the following inequality:

inf Θ:Θ 0,rank(Θ) r (Σ Φ) Θ q q

i=r+1 λq i (Σ Φ). (49)

Furthermore, if Up p denotes the matrix of p eigenvectors of Σ Φ, then the following choice of Θ :

Θ := U diag λ1(Σ Φ), . . . , λr(Σ Φ), 0, . . . , 0 U , (50)

gives equality in (49). This leads to the following result:

inf Θ:Θ 0,rank(Θ) r Σ (Θ + Φ) q q = Σ (Θ + Φ) q q =

i=r+1 λq i (Σ Φ). (51)

(b) The minimizer Θ of Problem (51) is given by (50). In particular, if Φ solves Problem (12) and we compute Θ via (50) (with Φ = Φ ), then the tuple (Φ , Θ ) solves Problem (4). This completes the proof of the proposition.

Certifiably Optimal Low Rank Factor Analysis

Proof of Proposition 2: We build upon the proof of Proposition 1. Note that any Φ that is feasible for Problems (14) and (4) is PSD. Observe that Θ (appearing in the proof of Proposition 1) as given by (50) satisﬁes:

Σ Φ Θ 0 = Σ Θ 0, (52)

where, the right hand side of (52) follows since Φ 0. We have thus established that the solution Θ to the following problem

minimize Θ (Σ Φ) Θ q q

is feasible for the following optimization problem:

minimize Θ (Σ Φ) Θ q q

Since Problem (54) involves minimization over a subset of the feasible set of Problem (53), it follows that Θ is also a minimizer for Problem (54). This completes the proof of the equivalence.

Proof of Theorem 1:

(a) The proof is based on ideas appearing in Overton and Womersley (1992), where it was shown that the sum of the top r eigenvalues of a real symmetric matrix can be written as the solution to a linear SDO problem.

By an elegant classical result due to Fan (Stewart and Sun, 1990), the smallest (p r) eigenvalues of a real symmetric matrix A can be written as

i=r+1 λi(A) = inf Tr (V AV)

s.t. V V = I, (55)

where the optimization variable is V Rp (p r). We will show that the solution to the above nonconvex problem can be obtained via the following linear (convex) SDO problem: minimize Tr (WA)

s.t. I W 0 Tr(W) = p r. (56)

Bertsimas, Copenhaver, and Mazumder

Clearly, Problem (56) is a convex relaxation of Problem (55) hence its minimum value is at least smaller than Pp i=r+1 λi(A). By standard results in convex analysis (Rockafellar, 1996), it follows that the minimum of the above linear SDO problem (56) is attained at the extreme points of the feasible set of (56). The extreme points (see for example, Overton and Womersley, 1992; Vandenberghe and Boyd, 1996) of this set are given by the set of orthonormal matrices of rank p r: VV : V Rp (p r) : V V = I .

It thus follows that the (global) minima of Problems (56) and (55) are the same. Applying this result to the PSD matrix A = (Σ Φ)q appearing in the objective of (12), we arrive at (16). This completes the proof of part (a).

(b) The statement follows from (50).

Proof of Proposition 3: Observe that, for every ﬁxed Φ, the function gq(W, Φ) is concave (in fact, it is linear). Since Gq(W) is obtained by taking a point-wise inﬁmum (with respect to Φ) of the concave function gq(W, Φ), the resulting function Gq(W) is concave (Boyd and Vandenberghe, 2004). The expression of the sub-gradient (19) is an immediate consequence of Danskin s Theorem (Bertsekas, 1999; Rockafellar, 1996).

Proof of Theorem 2: Using the concavity of the function Gq(W) we have:

Gq(W(i+1)) Gq(W(i)) + Gq(W(i)), W(i+1) W(i) = Gq(W(i)) + (W(i)). (57)

Note that (W(i)) 0 and Gq(W(i+1)) Gq(W(i)) for all i k. Hence the (decreasing) sequence of objective values converge to Gq(W( )) (say) and (W( )) 0. Adding up the terms in (57) from i = 1, . . . , k we have:

Gq(W( )) Gq(W(1)) Gq(W(k+1)) Gq(W(1)) k min i=1,...,k

n (W(k)) o (58)

from which (26) follows.

Proof of Theorem 5: By construction,

max [ℓ,u] Nodes u ℓ 1

decreases monotonically to zero as the iterative scheme proceeds. Therefore, to show convergence it suﬃces to show that the additive error at a node n = [ℓ, u] Nodes is O( u ℓ 1). Fix an arbitrary node n = [ℓ, u] Nodes. Without loss of generality, we may assume that (CFA1) as restricted to diag(Φ) [ℓ, u] has a feasible solution; as

Certifiably Optimal Low Rank Factor Analysis

such, let Φ be an optimal solution. Likewise, let Φ be an optimal solution (LSℓ,u). First note that Pp i=r+1 σi(Σ Φ ) Pp i=r+1 σi(Σ Φ ). Now, by two applications of the bound in Theorem 4 and the fact that Φ is optimal to (LSℓ,u), we have that Pp i=r+1 σi(Σ Φ ) Pp i=r+1 σi(Σ Φ ) + u ℓ 1/2. Hence, the additive error satisﬁes

i=r+1 σi(Σ Φ )

i=r+1 σi(Σ Φ )

as was to be shown.

This appendix contains an alternative conditional gradient method for ﬁnding feasible solutions. Since Problem (16) involves the minimization of a smooth function over a compact convex set, the CG method requires iteratively solving the following convex optimization problem:

minimize gp(W(k), Φ(k)), W Φ

s.t. W Ψp,p r Φ FΣ,

where gp(W(k), Φ(k)) is the derivative of gp(W(k), Φ(k)) at the current iterate (W(k), Φ(k)). Note that due to the separability of the constraints in W and Φ, Problem (59) splits into two independent optimization problems with respect to W and Φ. The overall procedure is outlined in Algorithm 3. Since Wgp(W(k), Φ(k)) = (Σ Φ(k))q, the update for W appearing in (60) requires solving:

minimize W Ψp,p r W, (Σ Φ(k))q . (63)

Similarly, the update for Φ appearing in (61) requires solving:

minimize Φ FΣ

i=1 Φiℓi, (64)

where the vector (ℓ1, . . . , ℓp) is given by diag(ℓ1, . . . , ℓp) = q diag W(k)(Σ Φ)q 1 , where diag(A) is a diagonal matrix having the same diagonal entries as A. The sequence (W(k), Φ(k)) is recursively computed via Algorithm 3 until a convergence criterion is met:

gp(W(k), Φ(k)) gp(W(k+1), Φ(k+1)) TOL gp(W(k), Φ(k)), (65)

for some user-deﬁned tolerance TOL > 0.

Bertsimas, Copenhaver, and Mazumder

Algorithm 3 A CG based algorithm for Problem (16)

1 Initialize with (W(1), Φ(1)), feasible for Problem (16) and repeat the following Steps 23 until the convergence criterion described in (65) is met.

2 Solve the linearized Problem (59), which requires solving two separate (convex) SDO problems over W and Φ:

W (k+1) arg min W Ψp,p r W, Wgp(W(k), Φ(k)) (60)

Φ (k+1) arg min Φ FΣ Φ, Φgp(W(k), Φ(k)) (61)

where Wgp(W, Φ) (and Φgp(W, Φ)) is the partial derivative with respect to W (respectively, Φ).

3 Obtain the new iterates:

W(k+1) = W(k) + ηk(W (k+1) W(k)),

Φ(k+1) = Φ(k) + ηk(Φ (k+1) Φ(k)). (62)

with ηk [0, 1] chosen via an Armijo type line-search rule (Bertsekas, 1999).

A tuple (W , Φ ) satisﬁes the ﬁrst order stationarity conditions (Bertsekas, 1999) for Problem (16), if the following condition is satisﬁed:

inf gp(W , Φ ), W W

s.t. W Ψp,p r Φ FΣ.

Note that Φ deﬁned above also satisﬁes the ﬁrst order stationarity conditions for Problem (12). The following theorem presents a global convergence guarantee for Algorithm 3:

Theorem 8 (Bertsekas, 1999) Every limit point (W( ), Φ( )) of a sequence (W(k), Φ(k)) produced by Algorithm 3 is a ﬁrst order stationary point of the optimization Problem (16).

Numerical experiments (in line with those from Section 5) suggest that Algorithm 3 performs similarly to Algorithm 1, and therefore we only present the results for Algorithm 1 in the main text. Algorithm 1 has the advantage that it does not require a line search, unlike Algorithm 3. Finally, we note that for Algorithm 3 the update for W at iteration k for solving Problem (63) corresponds to f W = (Σ Φ(k))q.

Certifiably Optimal Low Rank Factor Analysis

F.A. Al-Khayyal and J.E. Falk. Jointly constrained biconvex programming. Mathematics of Operations Research, 8:273 286, 1983.

E.D. Andersen and K.D. Andersen. High Performance Optimization, chapter The Mosek Interior Point Optimizer for Linear Programming: An Implementation of the Homogeneous Algorithm. Springer, 2000.

T. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, New York, 3rd edition, 2003.

K.M. Anstreicher and S. Burer. Computable representations for convex hulls of lowdimensional quadratic forms. Mathematical Programming, Series B, 124:33 43, 2010.

J. Bai and K. Li. Statistical analysis of factor models of high dimension. The Annals of Statistics, 40(1):436 465, 2012.

J. Bai and S. Ng. Large dimensional factor analysis. Foundations and Trends in Econometrics, 3(2):89 163, 2008.

L. Bai, J.E. Mitchell, and J.-S. Pang. On conic QPCCs, conic QCQPs and completely positive programs. Mathematical Programming, 159(1-2):109 136, 2016.

X. Bao, N.V. Sahinidis, and M. Tawarmalani. Multiterm polyhedral relaxations for nonconvex, quadratically constrained quadratic programs. Optimization Methods and Software, 24:485 504, 2009.

D. Bartholomew, M. Knott, and I. Moustaki. Latent Variable Models and Factor Analysis: A Uniﬁed Approach. Wiley, 2011.

A. Barvinok. Problems of distance geometry and convex properties of quadratic maps. Discrete and Computational Geometry, 12:189 202, 1995.

P.M. Bentler and J.A. Woodward. Inequalities among lower-bounds to reliability: With applications to test construction and factor analysis. Psychometrika, 45:249 267, 1980.

C. Bernaards and R. Jennrich. Gradient projection algorithms and software for arbitrary rotation criteria in factor analysis. Educational and Psychological Measurement, 65:676 696, 2005.

D. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, 2nd edition, 1999.

R. Bhatia. Perturbation bounds for matrix eigenvalues. SIAM, 2007.

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 3(1):1042 1068, 2011.

Bertsimas, Copenhaver, and Mazumder

S. Burer. Handbook on Semideﬁnite, Conic and Polynomial Optimization, chapter Copositive programming, pages 201 218. Springer, 2012.

G. Connor and R. Korajczyk. Performance measurement with the arbitrage pricing theory: A new framework for analysis. Journal of Financial Economics, 15(3):373 394, 1986.

A. Costa and L. Liberti. Relaxations of multilinear convex envelopes: dual is better than primal. In International Symposium on Experimental Algorithms, pages 87 98. Springer, 2012.

M. Fazel. Matrix Rank Minimization with Applications. Ph D thesis, Stanford University, 2002.

C.A. Floudas. Deterministic global optimization: theory, algorithms, and applications. Kluwer, 1999.

M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95 110, 1956.

L. Guttman. To what extent can communalities reduce rank? Psychometrika, 23(4):297 308, 1958.

P. Hansen, B. Jaumard, M. Ruiz, and J. Xiong. Global minimization of indeﬁnite quadratic functions subject to box constraints. Naval Research Logistics, 40:373 392, 1993.

H. Harman and W. Jones. Factor analysis by minimizing residuals (MINRES). Psychometrika, 31:351 368, 1966.

R.A. Horn and C.R. Johnson. Matrix analysis. Cambridge University Press, 2nd edition, 2013.

F. Husson, J. Josse, S. Le, and J. Mazet. Facto Mine R: Multivariate Exploratory Data Analysis and Data Mining, 2015. URL http://CRAN.R-project.org/package=Facto Mine R. R package version 1.31.3.

E. John and E.A. Yildirim. Implementation of warm-start strategies in interior-point methods for linear programming in ﬁxed dimension. Computational Optimization and Applications, 41:151 183, 2008.

K. J oreskog. Some contributions to maximum likelihood factor analysis. Psychometrika, 32 (4):443 482, 1967.

K. J oreskog. Structural analysis of covariance and correlation matrices. Psychometrika, 43 (4):443 477, 1978.

J.B. Lasserre. Moments, positive polynomials and their applications. Imperial College Press, 2009.

D. Lawley and A. Maxwell. Factor analysis as a statistical method. Journal of the Royal Statistical Society. Series D (The Statistician), 12(3):209 229, 1962.

Certifiably Optimal Low Rank Factor Analysis

P. le Bodic and G.L. Nemhauser. An abstract model for branching and its application to mixed integer programming. ar Xiv preprint ar Xiv:1511.01818, 2015.

W. Ledermann. On the rank of the reduced correlational matrix in multiple-factor analysis. Psychometrika, 2(2):85 93, 1937.

A. Lewis. Derivatives of spectral functions. Mathematics of Operations Research, 21(3): 576 588, 1996.

J. L ofberg. YALMIP: A toolbox for modeling and optimization in Matlab. In International Symposium on Computer Aided Control Systems Design, pages 284 289. IEEE, 2004.

K. Mardia, J. Kent, and J. Bibby. Multivariate analysis. Academic Press, 1979.

R. Misener and C.A. Floudas. Global optimization of mixed-integer quadraticallyconstrained quadratic programs (MIQCQP) through piecewise-linear and edge-concave relaxations. Mathematical Programming, Series B, 136:155 182, 2012.

Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, 2004.

Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, Series A, 103:127 152, 2005.

Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007. Technical Report number 76.

B. O Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3):1042 1068, 2016.

M. Overton and R. Womersley. On the sum of the largest eigenvalues of a symmetric matrix. SIAM Journal of Matrix Analysis and Applications, 13:41 45, 1992.

G. Pataki. On the rank of extreme matrices in semideﬁnite programs and the multiplicity of optimal eigenvalues. Mathematics of Operations Research, 23:339 358, 1998.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, 2014. URL http://www.R-project.org/.

G. Raiche and D. Magis. n Factors: Parallel analysis and non graphical solutions to the Cattell Scree test. URL http://cran.r-project.org/web/packages/n Factors/index. html, 2011.

C. R. Rao. Linear Statistical Inference and Its Applications. Wiley, New York, 1973.

W. Revelle. psych: Procedures for Psychological, Psychometric, and Personality Research, 2015. URL http://CRAN.R-project.org/package=psych. R package version 1.5.6.

Bertsimas, Copenhaver, and Mazumder

G. Riccia and A. Shapiro. Minimum rank and minimum trace of covariance matrices. Psychometrika, 47:443 448, 1982.

D. Robertson and J. Symons. Maximum likelihood factor analysis with rank-deﬁcient sample covariance matrices. Journal of Multivariate Analysis, 98(4):813 828, 2007.

R.T. Rockafellar. Convex Analysis. Princeton University Press, 1996.

S. Roweis and Z. Ghahramani. A unifying review of linear gaussian models. Neural Computation, 11(2):305 345, 1999.

D. Rubin and D. Thayer. EM algorithms for ML factor analysis. Psychometrika, 47(1): 69 76, 1982.

N. V. Sahinidis. BARON 14.3.1: Global Optimization of Mixed-Integer Nonlinear Programs, User s Manual, 2014.

J. Saunderson, V. Chandrasekaran, P. Parrilo, and A. Willsky. Diagonal and low-rank matrix decompositions, correlation matrices, and ellipsoid ﬁtting. SIAM Journal on Matrix Analysis and Applications, 33(4):1395 1416, 2012.

A. Shapiro. Rank-reducibility of a symmetric matrix and sampling theory of minimum trace factor analysis. Psychometrika, 47:187 199, 1982.

A. Shapiro and J. ten Berge. Statistical inference of minimum rank factor analysis. Psychometrika, 67:79 94, 2002.

C. Spearman. General Intelligence, Objectively Determined and Measured. American Journal of Psychology, 15:201 293, 1904.

G. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press, 1990.

M. Tawarmalani and N.V. Sahinidis. Convexiﬁcation and global optimization in continuous and mixed-integer nonlinear programming: theory, algorithms, software, and applications, volume 65 of Nonconvex Optimization and its Applications. Kluwer, 2002a.

M. Tawarmalani and N.V. Sahinidis. Convex extensions and envelopes of lower semicontinuous functions. Mathematical Programming, Series A, 93:247 263, 2002b.

M. Tawarmalani, J-P.P. Richard, and C. Xiong. Explicit convex and concave envelopes through polyhedral subdivisions. Mathematical Programming, Series A, 138:531 577, 2013.

J. ten Berge. Some recent developments in factor analysis and the search for proper communalities. In Advances in data science and classiﬁcation, pages 325 334. Springer, 1998.

J. ten Berge and H. Kiers. Computational aspects of the greatest lower bound to the reliability and constrained minimum trace factor analysis. Psychometrika, 46:201 213, 1981.

Certifiably Optimal Low Rank Factor Analysis

J. ten Berge and H. Kiers. A numerical approach to the approximate and the exact minimum rank of a covariance matrix. Psychometrika, 56:309 315, 1991.

J. ten Berge and H. Kiers. The minimum rank factor analysis program MRFA. URL http://www.ppsw.rug.nl/~kiers/, 2003.

J. ten Berge, T. Snijders, and F.E. Zegers. Computational aspects of the greatest lower bound to the reliability and constrained minimum trace factor analysis. Psychometrika, 46:201 213, 1981.

L.L. Thurstone. Multiple Factor Analysis: a Development and Expansion of the Vectors of the Mind. University of Chicago Press, 1947.

K.C. Toh, M.J. Todd, and R.H. Tutuncu. SDPT3 a Matlab software package for semidefinite programming. Optimization Methods and Software, 11:545 581, 1999.

L. Vandenberghe and S. Boyd. Semideﬁnite programming. SIAM Review, 38(1):49 95, 1996.

E.A. Yildirim and S. Wright. Warm-start strategies in interior-point methods for linear programming. SIAM Journal on Optimization, 12:782 810, 2002.