# empirical_bayes_matrix_factorization__e0b0643d.pdf Journal of Machine Learning Research 22 (2021) 1-40 Submitted 6/20; Revised 10/20; Published 4/21 Empirical Bayes Matrix Factorization Wei Wang weiwang@statistics.uchicago.edu Department of Statistics University of Chicago Chicago, IL, USA Matthew Stephens mstephens@uchicago.edu Department of Statistics and Department of Human Genetics University of Chicago Chicago, IL, USA Editor: Sayan Mukherjee Matrix factorization methods, which include Factor analysis (FA) and Principal Components Analysis (PCA), are widely used for inferring and summarizing structure in multivariate data. Many such methods use a penalty or prior distribution to achieve sparse representations ( Sparse FA/PCA ), and a key question is how much sparsity to induce. Here we introduce a general Empirical Bayes approach to matrix factorization (EBMF), whose key feature is that it estimates the appropriate amount of sparsity by estimating prior distributions from the observed data. The approach is very flexible: it allows for a wide range of different prior families and allows that each component of the matrix factorization may exhibit a different amount of sparsity. The key to this flexibility is the use of a variational approximation, which we show effectively reduces fitting the EBMF model to solving a simpler problem, the so-called normal means problem. We demonstrate the benefits of EBMF with sparse priors through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identifies interpretable structure that agrees with known relationships among human tissues. Software implementing our approach is available at https://github.com/stephenslab/flashr. Keywords: empirical Bayes, matrix factorization, normal means, sparse prior, unimodal prior, variational approximation 1. Introduction Matrix factorization methods are widely used for inferring and summarizing structure in multivariate data. In brief, these methods represent an observed n p data matrix Y as: Y = LT F + E (1.1) where L is an K n matrix, F is a K p matrix, and E is an n p matrix of residuals (whose entries we assume to be normally distributed, although the methods we develop can be generalized to other settings; see Section 6.4). Here we adopt the notation and terminology of factor analysis, and refer to L as the loadings and F as the factors . c 2021 Wei Wang and Matthew Stephens. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-589.html. Wang and Stephens The model (1.1) has many potential applications. One range of applications arise from matrix completion problems (e.g. Fithian et al., 2018): methods that estimate L and F in (1.1) from partially observed Y provide a natural and effective way to fill in the missing entries. Another wide range of applications come from the desire to summarize and understand the structure in a matrix Y : in (1.1) each row of Y is approximated by a linear combination of underlying factors (rows of F), which ideally have some intuitive or scientific interpretation. For example, suppose Yij represents the rating of a user i for a movie j. Each factor might represent a genre of movie ( comedy , drama , romance , horror etc), and the ratings for a user i could be written as a linear combination of these factors, with the weights (loadings) representing how much individual i likes that genre. Or, suppose Yij represents the expression of gene j in sample i. Each factor might represent a module of co-regulated genes, and the data for sample i could be written as a linear combination of these factors, with the loadings representing how active each module is in each sample. Many other examples could be given across many fields, including psychology (Ford et al., 1986), econometrics (Bai and Ng, 2008), natural language processing (Bouchard et al., 2015), population genetics (Engelhardt and Stephens, 2010), and functional genomics (Stein-O Brien et al., 2018). The simplest approaches to estimating L and/or F in (1.1) are based on maximum likelihood or least squares. For example, Principal Components Analysis (PCA) or, more precisely, truncated Singular Value Decomposition (SVD) can be interpreted as fitting (1.1) by least squares, assuming that columns of L are orthogonal and columns of F are orthonormal (Eckart and Young, 1936). And classical factor analysis (FA) corresponds to maximum likelihood estimation of L, assuming that the elements of F are independent standard normal and allowing different residual variances for each column of Y (Rubin and Thayer, 1982). While these simple methods remain widely used, in the last two decades researchers have focused considerable attention on obtaining more accurate and/or more interpretable estimates, either by imposing additional constraints (e.g. non-negativity; Lee and Seung, 1999) or by regularization using a penalty term (e.g. Jolliffe et al., 2003; Witten et al., 2009; Mazumder et al., 2010; Hastie et al., 2015; Fithian et al., 2018), or a prior distribution (e.g. Bishop, 1999; Attias, 1999; Ghahramani and Beal, 2000; West, 2003). In particular, many authors have noted the benefits of sparsity assumptions on L and/or F particularly in applications where interpretability of the estimates is desired and there now exists a wide range of methods that attempt to induce sparsity in these models (e.g. Sabatti and James, 2005; Zou et al., 2006; Pournara and Wernisch, 2007; Carvalho et al., 2008; Witten et al., 2009; Engelhardt and Stephens, 2010; Knowles and Ghahramani, 2011; Bhattacharya and Dunson, 2011; Mayrink et al., 2013; Yang et al., 2014; Gao et al., 2016; Hore et al., 2016; Roˇckov a and George, 2016; Srivastava et al., 2017; Kaufmann and Schumacher, 2017; Fr uhwirth-Schnatter and Lopes, 2018; Zhao et al., 2018). Many of these methods induce sparsity in the loadings only, although some induce sparsity in both loadings and factors. In any statistical problem involving sparsity, a key question is how strong the sparsity should be. In penalty-based methods this is controlled by the strength and form of the penalty, whereas in Bayesian methods it is controlled by the prior distributions. In this paper we take an Empirical Bayes approach to this problem, exploiting variational approximation methods (Blei et al., 2017) to obtain simple algorithms that jointly estimate Empirical Bayes Matrix Factorization the prior distributions for both loadings and factors, as well as the loadings and factors themselves. Both EB and variational methods have been previously used for this problem (Bishop, 1999; Lim and Teh, 2007; Raiko et al., 2007; Stegle et al., 2010). However, most of this previous work has used simple normal prior distributions that do not induce sparsity. Variational methods that use sparsity-inducing priors include Girolami (2001), which uses a Laplace prior on the factors (no prior on the loadings which are treated as free parameters), Hochreiter et al. (2010) which extends this to Laplace priors on both factors and loadings, with fixed values of the Laplace prior parameters; Titsias and L azaro-Gredilla (2011), which uses a sparse spike-and-slab (point normal) prior on the loadings (with the same prior on all K loadings) and a normal prior on the factors; and Hore et al. (2016) which uses a spike-and-slab prior on one mode in a tensor decomposition. (While this work was in review further examples appeared, including Argelaguet et al. 2018, which uses normal priors on the loadings and point-normal on the factors.) Our primary contribution here is to develop and implement a more general EB approach to matrix factorization (EBMF). This general approach allows for a wide range of potential sparsity-inducing prior distributions on both the loadings and the factors within a single algorithmic framework. We accomplish this by showing that, when using variational methods, fitting EBMF with any prior family can be reduced to repeatedly solving a much simpler problem the empirical Bayes normal means (EBNM) problem with the same prior family. This feature makes it easy to implement methods for any desired prior family one simply has to implement a method to solve the corresponding normal means problem, and then plug this into our algorithm. This approach can work for both parametric families (e.g. normal, point-normal, laplace, point-laplace) and non-parametric families, including the adaptive shrinkage priors (unimodal and scale mixtures of normals) from Stephens (2017). It is also possible to accommodate non-negative constraints on either L and/or F by using non-negative prior families. Even simple versions of our approach e.g. using point-normal priors on both factors and loadings provide more generality than most existing EBMF approaches and software. A second contribution of our work is to highlight similarities and differences between EBMF and penalty-based methods for regularizing L and/or F. Indeed, our algorithm for fitting EBMF has the same structure as commonly-used algorithms for penalty-based methods, with the prior distribution playing a role analogous to the penalty (see Remark 3 later). While the general correspondence between estimates from penalized methods and Bayesian posterior modes (MAP estimates) is well known, the connection here is different, because the EBMF approach is estimating a posterior mean, not a mode (indeed, with sparse priors the MAP estimates of L and F are not useful because they are trivially 0). A key difference between the EBMF approach and penalty-based methods is that the EBMF prior is estimated by solving an optimization problem, whereas in penalty-based methods the strength of the penalty is usually chosen by cross-validation. This difference makes it much easier for EBMF to allow for different levels of sparsity in every factor and every loading: in EBMF one simply uses a different prior for every factor and loading, whereas tuning a separate parameter for every factor and loading by CV becomes very cumbersome. The final contribution is that we provide an R software package, flash (Factors and Loadings by Adaptive SHrinkage), implementing our flexible EBMF framework. We demonstrate Wang and Stephens the utility of these methods through both numerical comparisons with competing methods and through a scientific application: analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons flash often provides more accurate inferences than other methods, while remaining computationally tractable for moderate-sized matrices (millions of entries). In the GTEx data, flash highlight both effects that are shared across many tissues ( dense factors) and effects that are specific to a small number of tissues ( sparse factors). These sparse factors often highlight similarities between tissues that are known to be biologically related, providing external support for the reliability of the results. 2. A General Empirical Bayes Matrix Factorization Model We define the K-factor Empirical Bayes Matrix Factorization (EBMF) model as follows: k=1 lkf T k + E (2.1) lk1, . . . , lkn iid glk, glk Gl (2.2) fk1, . . . , fkp iid gfk, gfk Gf (2.3) Eij N(0, 1/τij) with τ := (τij) T . (2.4) Here Y is the n p observed data matrix, lk is an n-vector (the kth set of loadings ), fk is a p-vector (the kth factor ), Gl and Gf are pre-specified (possibly non-parametric) families of distributions, glk and gfk are unknown prior distributions that are to be estimated, E is an n p matrix of independent error terms, and τ is an unknown n p matrix of precisions (τij) which is assumed to lie in some space T . (This allows structure to be imposed on τ, such as constant precision, τij = τ, or column-specific precisions, τij = τj, for example.) Our methods allow that some elements of Y may be missing , and can estimate the missing values (Section 4.1). The term Empirical Bayes in EBMF means we fit (2.1)-(2.4) by obtaining point estimates for the priors glk, gfk, (k = 1, . . . , K) and approximate the posterior distributions for the parameters lk, fk given those point estimates. This contrasts with a fully Bayes approach that, instead of obtaining point estimates for glk, gfk, would integrate over uncertainty in the estimates. This would involve specifying prior distributions for glk, gfk as well as (perhaps substantial) additional computation. The EB approach has the advantage of simplicity both conceptually and computationally while enjoying many of the benefits of a fully Bayes approach. In particular it allows for sharing of information across elements of each loading/factor. For example, if the data suggest that a particular factor, fk, is sparse, then this will be reflected in a sparse estimate of gfk, and subsequently strong shrinkage of the smaller elements of fk1, . . . , fkp towards 0. Conversely, when the data suggest a non-sparse factor then the prior will be dense and the shrinkage less strong. By allowing different prior distributions for each factor and each loading, the model has the flexibility to adapt to any combination of sparse and dense loadings and factors. However, to fully capitalize on this flexibility one needs suitably flexible prior families Gl and Gf capable of capturing both sparse and dense factors. A key feature of our work is it allows for very flexible prior families, including non-parametric families. Empirical Bayes Matrix Factorization Some specific choices of the distributional families Gl and Gf correspond to models used in previous work. In particular, many previous papers have studied the case with normal priors, where Gl and Gf are both the family of zero-mean normal distributions (e.g. Bishop, 1999; Lim and Teh, 2007; Raiko et al., 2007; Nakajima and Sugiyama, 2011). This family is particularly simple, having a single hyper-parameter, the prior variance, to estimate for each factor. However, it does not induce sparsity on either L or F; indeed, when the matrix Y is fully observed, the estimates of L and F under a normal prior (when using a fully factored variational approximation) are simply scalings of the singular vectors from an SVD of Y (Nakajima and Sugiyama, 2011; Nakajima et al., 2013). Our work here extends these previous approaches to a much wider range of prior families that do induce sparsity on L and/or F. We note that the EBMF model (2.1)-(2.4) differs in an important way from the sparse factor analysis (SFA) methods in Engelhardt and Stephens (2010), which use a type of Automatic Relevance Determination prior (e.g. Tipping, 2001; Wipf and Nagarajan, 2008) to induce sparsity on the loadings matrix. In particular, SFA estimates a separate hyperparameter for every element of the loadings matrix, with no sharing of information across elements of the same loading. In contrast, EBMF estimates a single shared prior distribution for elements of each loading, which, as noted above, allows for sharing of information across elements of each loading/factor. 3. Fitting the EBMF Model To simplify exposition we begin with the case K = 1 ( rank 1 ); see Section 3.6 for the extension to general K. To simplify notation we assume the families Gl, Gf are the same, so we can write Gl = Gf = G. To further lighten notation in the case K = 1 we use gl, gf, l, f instead of gl1, gf 1, l1, f1. Fitting the EBMF model involves estimating all of gl, gf, l, f, τ. A standard EB approach would be to do this in two steps: Estimate gl, gf and τ, by maximizing the likelihood: L(gl, gf, τ) := Z Z p(Y |l, f, τ) gl(dl1) . . . gl(dln) gf(df1) . . . gf(dfp) (3.1) over gl, gf G, τ T . (This optimum will typically not be unique because of identifiability issues; see Section 3.8.) Estimate l and f using their posterior distribution: p(l, f|Y, ˆgl, ˆgf, ˆτ). However, both these two steps are difficult, even for very simple choices of G. Instead, following previous work (see Introduction for citations) we use variational approximations to approximate this approach. Although variational approximations are known to typically under-estimate uncertainty in posterior distributions, our focus here is on obtaining useful point estimates for l, f; results shown later demonstrate that the variational approximation can perform well in this task. Wang and Stephens 3.1 The Variational Approximation The variational approach see Blei et al. (2017) for review begins by writing the log of the likelihood (3.1) as: l(gl, gf, τ) := log L(gl, gf, τ) (3.2) = F(q, gl, gf, τ) + DKL(q||p) (3.3) F(q, gl, gf, τ) = Z q(l, f) log p(Y, l, f|gl, gf, τ) q(l, f) dl df, (3.4) DKL(q||p) = Z q(l, f) log p(l, f|Y, gl, gf, τ) q(l, f) dl df (3.5) is the Kullback Leibler divergence from q to p. This identity holds for any distribution q(l, f). Because DKL is non-negative, it follows that F(q, gl, gf, τ) is a lower bound for the log likelihood: l(gl, gf, τ) F(q, gl, gf, τ) (3.6) with equality when q(l, f) = p(l, f|Y, gl, gf, τ). In other words, l(gl, gf, τ) = max q F(q, gl, gf, τ), (3.7) where the maximization is over all possible distributions q(l, f). Maximizing l(gl, gf, τ) can thus be viewed as maximizing F over q, gl, gf, τ. However, as noted above, this maximization is difficult. The variational approach simplifies the problem by maximizing F but restricting the family of distributions for q. Specifically, the most common variational approach and the one we consider here restricts q to the family Q of distributions that fully-factorize : q : q(l, f) = i=1 ql,i(li) j=1 qf,j(fj) The variational approach seeks to optimize F over q, gl, gf, τ with the constraint q Q. For q Q we can write q(l, f) = ql(l)qf(f) where ql(l) = Qn i=1 ql,i(li) and qf(f) = Qp j=1 qf,j(fj), and we can consider the problem as maximizing F(ql, qf, gl, gf, τ). 3.2 Alternating Optimization We optimize F(ql, qf, gl, gf, τ) by alternating between optimizing over variables related to l [(ql, gl)], over variables related to f [(qf, gf)], and over τ. Each of these steps is guaranteed to increase (or, more precisely, not decrease) F, and convergence can be assessed by (for example) stopping when these optimization steps yield a very small increase in F. Note that F may be multi-modal, and there is no guarantee that the algorithm will converge to a global optimum. The approach is summarized in Algorithm 1. Empirical Bayes Matrix Factorization Algorithm 1 Alternating Optimization for EBMF (rank 1) Require: Initial values q(0) l , q(0) f , g(0) l , g(0) f 1: t 0 4: τ (t) arg maxτ F(q(t 1) l , q(t 1) f , g(t 1) l , g(t 1) f , τ) 5: q(t) l , g(t) l arg maxql,gl F(ql, q(t 1) f , gl, g(t 1) f , τ (t)). 6: q(t) f , g(t) f arg maxqf ,gf F(q(t) l , qf, g(t) l , gf, τ (t)). 7: until converged 8: return q(t) l , q(t) f , g(t) l , g(t) f , τ (t) The key steps in Algorithm 1 are the maximizations in Steps 4-6. Step 4, the update of τ, involves computing the expected squared residuals: Ď R2ij := Eql,qf [(Yij lifj)2] (3.9) = [Yij Eql(li)Eqf (fj)]2 Eql(li)2Eqf (fj)2 + Eql(l2 i )Eqf (f2 j ). (3.10) This is straightforward provided the first and second moments of ql and qf are available (see Appendix A.1 for details). Steps 5 and 6 are essentially identical except for switching the role of l and f. One of our key results is that each of these steps can be achieved by solving a simpler problem the Empirical Bayes normal means (EBNM) problem. The next subsection (3.3) describes the EBNM problem, and the following subsection (3.4) details how this can be used to solve Steps 5 and 6. 3.3 The EBNM Problem Suppose we have observations x = (x1, . . . , xn) of underlying quantities θ = (θ1, . . . , θn), with independent Gaussian errors with known standard deviations s = (s1, . . . , sn). Suppose further that the elements of θ are assumed i.i.d. from some distribution, g G. That is, x|θ Nn(θ, diag(s2 1, . . . , s2 n)) (3.11) θ1, . . . , θn iid g, g G, (3.12) where Nn(µ, Σ) denotes the n-dimensional normal distribution with mean µ and covariance matrix Σ. By solving the EBNM problem we mean fitting the model (3.11)-(3.12) by the following two-step procedure: 1. Estimate g by maximum (marginal) likelihood: ˆg = arg max g G Z p(xj|θj, sj)g(dθj). (3.13) Wang and Stephens 2. Compute the posterior distribution for θ given ˆg, p(θ|x, s, ˆg) Y j ˆg(θj)p(xj|θj, sj). (3.14) Later in this paper we will have need for the posterior first and second moments, so we define them here for convenience: θj := E(θj|x, s, ˆg) (3.15) sθ2j := E(θ2 j|x, s, ˆg). (3.16) Formally, this procedure defines a mapping (which depends on the family G) from the known quantities (x, s), to (ˆg, p), where ˆg, p are given in (3.13) and (3.14). We use EBNM to denote this mapping: EBNM(x, s) = (ˆg, p). (3.17) Remark 1 Solving the EBNM problem is central to all our algorithms, so it is worth some study. A key point is that the EBNM problem provides an attractive and flexible way to induce shrinkage and/or sparsity in estimates of θ. For example, if θ is truly sparse, with many elements at or near 0, then the estimate ˆg will typically have considerable mass near 0, and the posterior means (3.15) will be shrunk strongly toward 0 compared with the original observations. In this sense solving the EBNM problem can be thought of as a model-based analogue of thresholding-based methods, with the advantage that by estimating g from the data the EBNM approach automatically adapts to provide an appropriate level of shrinkage. These ideas have been used in wavelet denoising (Clyde and George, 2000; Johnstone et al., 2004; Johnstone and Silverman, 2005a; Xing et al., 2016), and false discovery rate estimation (Thomas et al., 1985; Stephens, 2017) for example. Here we apply them to matrix factorization problems. 3.4 Connecting the EBMF and EBNM Problems The EBNM problem is well studied, and can be solved reasonably easily for many choices of G (e.g. Johnstone and Silverman, 2005b; Koenker and Mizera, 2014a; Stephens, 2017). In Section 4 we give specific examples; for now our main point is that if one can solve the EBNM problem for a particular choice of G then it can be used to implement Steps 5 and 6 in Algorithm 1 for the corresponding EBMF problem. The following Proposition formalizes this for Step 5 of Algorithm 1; a similar proposition holds for Step 6 (see also Appendix A). Proposition 2 Step 5 in Algorithm 1 is solved by solving an EBNM problem. Specifically arg max ql,gl F(ql, qf, gl, gf, τ) = EBNM(ˆl(Y, s f, Ď f 2, τ), sl(Ď f 2, τ)) (3.18) Empirical Bayes Matrix Factorization where the functions ˆl : Rn p Rp Rp Rn p Rn and sl : Rp Rn p Rn are given by ˆl(Y, v, w, τ)i := P j τij Yijvj P j τijwj , (3.19) sl(w, τ)i := and s f, Ď f 2 Rp denote the vectors whose elements are the first and second moments of f under qf: s f := (Eqf (fj)) (3.21) Ď f 2 := (Eqf (f2 j )). (3.22) Proof See Appendix A. For intuition into where the EBNM in Proposition 2 comes from, consider estimating l, gl in (2.1) with f and τ known. The model then becomes n independent regressions of the rows of Y on f, and the maximum likelihood estimate for l has elements: P j τij Yijfj P j τijf2 j , (3.23) with standard errors Further, it is easy to show that ˆli N(li, s2 i ). (3.25) Combining (3.25) with the prior l1, . . . , ln iid gl, gl G (3.26) yields an EBNM problem. The EBNM in Proposition 2 is the same as the EBNM (3.25)-(3.26) , but with the terms fj and f2 j replaced with their expectations under qf. Thus, the update for (ql, gl) in Algorithm 1, with (qf, gf, τ) fixed, is closely connected to solving the EBMF problem for known f, τ . 3.5 Streamlined Implementation Using First and Second Moments Although Algorithm 1, as written, optimizes over (ql, qf, gl, gf), in practice each step requires only the first and second moments of the distributions ql and qf. For example, the EBNM problem in Proposition 1 involves s f and Ď f 2 and not gf. Consequently, we can Wang and Stephens simplify implementation by keeping track of only those moments. In particular, when solving the normal means problem, EBNM(x, s) in (3.17), we need only return the posterior first and second moments (3.15) and (3.16). This results in a streamlined and intuitive implementation, summarized in Algorithm 2. Algorithm 2 Streamlined Alternating Optimization for EBMF (rank 1) Require: A data matrix Y (n p) Require: A function, ebnm(x, s) (sθ, Ď θ2), that solves the EBNM problem (3.11)-(3.12) and returns the first and second posterior moments (3.15)-(3.16). Require: A function, init(Y ) (ˆl, ˆf) that produces initial estimates for l (an n vector) and f (a p vector) given data Y . (For example, rank 1 singular value decomposition.) 1: Initialize first moments (sl, s f), using (sl, s f) init(Y ) 2: Initialize second moments (sl2, Ď f 2), by squaring first moments: sl2 (sl2 i ) and Ď f 2 ( s f 2 j ). 4: Compute the matrix of expected squared residuals Ď R2ij from (3.9). 5: τj n/ P i Ď R2ij. [This update assumes column-specific variances; it can be modified to make other assumptions.] 6: Compute ˆl(Y, s f, Ď f 2, τ) and standard errors sl(sl2, τ), using (3.19) and (3.20). 7: (sl, sl2) ebnm(ˆl, sl). 8: Compute ˆf(Y, sl, sl2, τ) and standard errors sf(sl2, τ) (similarly as for ˆl and sl; see (A.14) and (A.15)). 9: ( s f, Ď f 2) ebnm( ˆf, sf). 10: until converged 11: return sl, sl2, s f, Ď f 2, τ Remark 3 Algorithm 2 has a very intuitive form: it has the flavor of an alternating least squares algorithm, which alternates between estimating l given f (Step 6) and f given l (Step 8), but with the addition of the ebnm step (Steps 7 and 9), which can be thought of as regularizing or shrinking the estimates: see Remark 1. This viewpoint highlights connections with related algorithms. For example, the (rank 1 version of the) SSVD algorithm from Yang et al. (2014) has a similar form, but uses a thresholding function in place of the ebnm function to induce shrinkage and/or sparsity. 3.6 The K-factor EBMF Model It is straightforward to extend the variational approach to fit the general K factor model (2.1)-(2.4). In brief, we introduce variational distributions (qlk, qfk) for k = 1, . . . , K, and then optimize the objective function F(ql1, gl1, qf1, gf1; . . . ; ql K, gl K, qf K, gf K; τ). Similar to the rank-1 model, this optimization can be done by iteratively updating parameters relating to a single loading or factor, keeping other parameters fixed. And again we simplify implementation by keeping track of only the first and second moments of the distributions qlk and qfk, which we denote slk, sl2k, s fk, Ď f 2k. The updates to slk, sl2k (and s fk, Ď f 2k) are essentially identical to those for fitting the rank 1 model above, but with Yij replaced with Empirical Bayes Matrix Factorization the residuals obtained by removing the estimated effects of the other k 1 factors: Rk ij := Yij X k =k slk i s fk j. (3.27) Based on this approach we have implemented two algorithms for fitting the K-factor model. First, a simple greedy algorithm, which starts by fitting the rank 1 model, and then adds factors k = 2, . . . , K, one at a time, optimizing over the new factor parameters before moving on to the next factor. Second, a backfitting algorithm (Breiman and Friedman, 1985), which iteratively refines the estimates for each factor given the estimates for the other factors. Both algorithms are detailed in Appendix A. 3.7 Selecting K An interesting feature of EB approaches to matrix factorization, noted by Bishop (1999), is that they automatically select the number of factors K. This is because the maximum likelihood solution to glk, gfk is sometimes a point mass on 0 (provided G includes this distribution). Furthermore, the same is true of the solution to the variational approximation (see also Bishop, 1999; Stegle et al., 2012). This means that if K is set sufficiently large then some loading/factor combinations will be optimized to be exactly 0. (Or, in the greedy approach, which adds one factor at a time, the algorithm will eventually add a factor that is exactly 0, at which point it terminates.) Here we note that the variational approximation may be expected to result in conservative estimation (i.e. underestimation) of K compared with the (intractable) use of maximum likelihood to estimate gl, gf. We base our argument on the simplest case: comparing K = 1 vs K = 0. Let δ0 denote the degenerate distribution with all its mass at 0. Note that the rank-1 factor model (2.1), with gl = δ0 (or gf = δ0) is essentially a rank-0 model. Now note that the variational lower bound, F, is exactly equal to the log-likelihood when gl = δ0 (or gf = δ0). This is because if the prior is a point mass at 0 then the posterior is also a point mass, which trivially factorizes as a product of point masses, and so the variational family Q includes the true posterior in this case. Since F is a lower bound to the log-likelihood we have the following simple lemma: Lemma 4 If F(ˆq, ˆgl, ˆgf, ˆτ) > F(δ0, δ0, δ0, ˆτ0) then l(ˆgl, ˆgf, ˆτ) > l(δ0, δ0, ˆτ0). Proof l(ˆgl, ˆgf, ˆτ) F(ˆq, ˆgl, ˆgf, ˆτ) > F(δ0, δ0, δ0, ˆτ0) = l(δ0, δ0, ˆτ0) (3.28) Thus, if the variational approximation F favors ˆgl, ˆgf, ˆτ over the rank 0 model, then it is guaranteed that the likelihood would also favor ˆgl, ˆgf, ˆτ over the rank 0 model. In other words, compared with the likelihood, the variational approximation is conservative in terms of preferring the rank 1 model to the rank 0 model. This conservatism is a double-edged sword. On the one hand it means that if the variational approximation finds structure it should be taken seriously. On the other hand it means that the variational approximation could miss subtle structure. Wang and Stephens In practice Algorithm 2 can converge to a local optimum of F that is not as high as the trivial (rank 0) solution, F(δ0, δ0, δ0, ˆτ0). We can add a check for this at the end of Algorithm 2, and set ˆgl = ˆgf = δ0 and ˆτ = ˆτ0 when this occurs. 3.8 Identifiability In EBMF each loading and factor is identifiable, at best, only up to a multiplicative constant (provided G is a scale family). Specifically, scaling the prior distributions gfk and glk by ck and 1/ck respectively results in the same marginal likelihood, and also results in a corresponding scaling of the posterior distribution on the factors fk and loadings lk (e.g. it scales the posterior first moments by ck, 1/ck and the second moments by c2 k, 1/c2 k). However, this non-identifiability is not generally a problem, and if necessary it could be dealt with by re-scaling factor estimates to have norm 1. 4. Software Implementation: flash We have implemented Algorithms 2, 4 and 5 in an R package, flash ( factors and loadings via adaptive shrinkage ). These algorithms can fit the EBMF model for any choice of distributional family Gl, Gf: the user must simply provide a function to solve the EBNM problem for these prior families. One source of functions for solving the EBNM problem is the adaptive shrinkage (ashr) package, which implements methods from Stephens (2017). These methods solve the EBNM problem for several flexible choices of G, including: G = SN, the set of all scale mixtures of zero-centered normals; G = SU, the set of all symmetric unimodal distributions, with mode at 0; G = U, the set of all unimodal distributions, with mode at 0; G = U+, the set of all non-negative unimodal distributions, with mode at 0. These methods are computationally stable and efficient, being based on convex optimization methods (Koenker and Mizera, 2014b) and analytic Bayesian posterior computations. We have also implemented functions to solve the EBNM problem for additional choices of G in the package ebnm (https://github.com/stephenslab/ebnm). These include G being the point-normal family: G = PN, the set of all distributions that are a mixture of a point mass at zero and a normal with mean 0. This choice is less flexible than those in ashr, and involves non-convex optimizations, but can be faster. Although in this paper we focus our examples on sparsity-inducing priors with Gl = Gf = G we note that our software makes it easy to experiment with different choices, some of which represent novel methodologies. For example, setting Gl = U+ and Gf = SN yields an EB version of semi-non-negative matrix factorization (Ding et al., 2008), and we are aware of no existing EB implementations for this problem. Exploring the relative merits of these many possible options in different types of application will be an interesting direction for future work. Empirical Bayes Matrix Factorization 4.1 Missing Data If some elements of Y are missing, then this is easily dealt with. For example, the sums over j in (3.19) and (3.20) are simply computed using only the j for which Yij is not missing. This corresponds to an assumption that the missing elements of Y are missing at random (Rubin, 1976). In practice we implement this by setting τij = 0 whenever Yij is missing (and filling in the missing entries of Y to an arbitrary number). This allows the implementation to exploit standard fast matrix multiplication routines, which cannot handle missing data. If many data points are missing then it may be helpful to exploit sparse matrix routines. 4.2 Initialization Both Algorithms 2 and 4 require a rank 1 initialization procedure, init. Here, we use the soft Impute function from the package soft Impute (Mazumder et al., 2010), with penalty parameter λ = 0, which essentially performs SVD when Y is completely observed, but can also deal with missing values in Y . The backfitting algorithm (Algorithm 5) also requires initialization. One option is to use the greedy algorithm to initialize, which we call greedy+backfitting . 5. Numerical Comparisons We now compare our methods with several competing approaches. To keep these comparisons manageable in scope we focus attention on methods that aim to capture possible sparsity in L and/or F. For EBMF we present results for two different shrinkage-oriented prior families, G: the scale mixture of normals (G = SN), and the point-normal family (G = PN). We denote these flash and flash pn respectively when we need to distinguish. In addition we consider Sparse Factor Analysis (SFA) (Engelhardt and Stephens, 2010), SFAmix (Gao et al., 2013), Nonparametric Bayesian Sparse Factor Analysis (NBSFA) (Knowles and Ghahramani, 2011), Penalized Matrix Decomposition (Witten et al., 2009) (PMD, implemented in the R package PMA), and Sparse SVD (Yang et al., 2014) (SSVD, implemented in R package ssvd). Although the methods we compare against involve only a small fraction of the very large number of methods for this problem, the methods were chosen to represent a wide range of different approaches to inducing sparsity: SFA, SFAmix and NBSFA are three Bayesian approaches with quite different approaches to prior specification; PMD is based on a penalized likelihood with L1 penalty on factors and/or loadings; and SSVD is based on iterative thresholding of singular vectors. We also compare with soft Impute (Mazumder et al., 2010), which does not explicitly model sparsity in L and F, but fits a regularized low-rank matrix using a nuclear-norm penalty. Finally, for reference we also use standard (truncated) SVD. All of the Bayesian methods (flash, SFA, SFAmix and NBSFA) are self-tuning , at least to some extent, and we applied them here with default values. According to Yang et al. (2014) SSVD is robust to choice of tuning parameters, so we also ran SSVD with its default values, using the robust option (method="method"). The soft Impute method has a single tuning parameter (λ, which controls the nuclear norm penalty), and we chose this penalty by orthogonal cross-validation (OCV; Appendix B). The PMD method can use two tuning parameters (one for l and one for f) to allow different sparsity levels in l vs f. Wang and Stephens However, since tuning two parameters can be inconvenient it also has the option to use a single parameter for both l and f. We used OCV to tune parameters in both cases, referring to the methods as PMD.cv2 (2 tuning parameters) and PMD.cv1 (1 tuning parameter). 5.1 Simple Simulations 5.1.1 A Single Factor Example We simulated data with n = 200, p = 300 under the single-factor model (2.1) with sparse loadings, and a non-sparse factor: li π0δ0 + (1 π0) 1 5N(0, σ2 m) (5.1) fj N(0, 1) (5.2) where δ0 denotes a point mass on 0, and (σ2 1, . . . , σ2 5) := (0.25, 0.5, 1, 2, 4). We simulated using three different levels of sparsity on the loadings, using π0 = 0.9, 0.3, 0. (We set the noise precision τ = 1, 1/16, 1/25 in these three cases to make each problem not too easy and not too hard.) We applied all methods to this rank-1 problem, specifying the true value K = 1. (The NBSFA software does not provide the option to fix K, so is omitted here.) We compare methods in their accuracy in estimating the true low-rank structure (B := lf T ) using relative root mean squared error: RRMSE( ˆB, B) := P i,j( ˆBij Bij)2 P i,j B2 ij . (5.3) Despite the simplicity of this simulation, the methods vary greatly in performance (Figure 1). Both versions of flash consistently outperform all the other methods across all scenarios (although soft Impute performs similarly in the non-sparse case). The next best performances come from soft Impute (SI.cv), PMD.cv2 and SFA, whose relative performances depend on the scenario. All three consistently improve on, or do no worse than, SVD. PMD.cv1 performs similarly to SVD. The SFAmix method performs very variably, sometimes providing very poor estimates, possibly due to poor convergence of the MCMC algorithm (it is the only method here that uses MCMC). The SSVD method consistently performs worse than simple SVD, possibly because it is more adapted to both factors and loadings being sparse (and possibly because, following Yang et al. 2014, we did not use CV to tune its parameters). Inspection of individual results suggests that the poor performance of both SFAmix and SSVD is often due to over-shrinking of non-zero loadings to zero. 5.1.2 A Sparse Bi-cluster Example (Rank 3) An important feature of our EBMF methods is that they estimate separate distributions gl, gf for each factor and each loading, allowing them to adapt to any combination of sparsity in the factors and loadings. This flexibility is not easy to achieve in other ways. For example, methods that use CV are generally limited to one or two tuning parameters because of the computational difficulties of searching over a larger space. Empirical Bayes Matrix Factorization Difference from flash result (90% zeros) Difference from flash result (30% zeros) Difference from flash result (0% zeros) Figure 1: Boxplots comparing accuracy of flash with several other methods in a simple rank-1 simulation. This simulation involves a single dense factor, and a loading that varies from strong sparsity (90% zeros, left) to no sparsity (right). Accuracy is measured by difference in each methods RRMSE from the flash RRMSE, with smaller values indicating highest accuracy. The y axis is plotted on a nonlinear (square-root) scale to avoid the plots being dominated by poorer-performing methods. To illustrate this flexibility we simulated data under the factor model (2.1) with n = 150, p = 240, K = 3, τ = 1/4, and: l1,i N(0, 22) i = 1, . . . , 10 (5.4) l2,i N(0, 1) i = 11, . . . , 60 (5.5) l3,i N(0, 1/22) i = 61, . . . , 150 (5.6) f1,j N(0, 1/22) j = 1, . . . , 80 (5.7) f2,j N(0, 1) j = 81, . . . , 160 (5.8) f3,j N(0, 22) j = 161, . . . , 240, (5.9) with all other elements of lk and fk set to zero for k = 1, 2, 3. This example has a sparse bi-cluster structure where distinct groups of samples are each loaded on only one factor (Figure 2a), and both the size of the groups and number of variables in each factor vary. We applied flash, soft Impute, SSVD and PMD to this example. (We excluded SFA and SFAmix since these methods do not model sparsity in both factors and loadings.) The Wang and Stephens (a) Left: Illustration of the true latent rank-3 block structure used in these simulations. Right boxplots comparing accuracy of flash with several other methods across 100 replicates. Accuracy is measured by the difference of each methods RRMSE from the flash RRMSE, so smaller is better. (b) Illustration of tendency of each method to either over-shrink the signal (SSVD) or under-shrink the noise (SI.cv, PMD.cv1, SVD) compared with flash. Each panel shows the mean absolute value of the estimated structure from each method. Figure 2: Results from simulations with sparse bi-cluster structure (K = 3). results (Figure 2) show that again flash consistently outperforms the other methods, and again the next best is soft Impute. On this example both SSVD and PMD outperform SVD. Although SSVD and PMD perform similarly on average, their qualitative behavior is different: PMD insufficiently shrink the 0 values, whereas SSVD shrinks the 0 values well but overshrinks some of the signal, essentially removing the smallest of the three loading/factor combinations (Figure 2b). Empirical Bayes Matrix Factorization 5.2 Missing Data Imputation for Real Data Sets Here we compare methods in their ability to impute missing data using five real data sets. In each case we hold out (mask) some of the data points, and then apply the methods to obtain estimates of the missing values. The data sets are as follows: Movie Lens 100K data, an (incomplete) 943 1682 matrix of user-movie ratings (integers from 1 to 5) (Harper and Konstan, 2016). Most users do not rate most movies, so the matrix is sparsely observed (94% missing), and contains about 100K observed ratings. We hold out a fraction of the observed entries and assess accuracy of methods in estimating these. We centered and scaled the ratings for each user before analysis. GTEx e QTL summary data, a 16 069 44 matrix of Z scores computed testing association of genetic variants (rows) with gene expression in different human tissues (columns). These data come from the Genotype Tissue Expression (GTEx) project (Consortium et al., 2015), which assessed the effects of thousands of e QTLs across 44 human tissues. (An e QTL is a genetic variant that is associated with expression of a gene.) To identify e QTLs, the project tested for association between expression and every near-by genetic variant, each test yielding a Z score. The data used here are the Z scores for the most significant genetic variant for each gene (the top e QTL). See Section 5.3 for more detailed analyses of these data. Brain Tumor data, a 43 356 matrix of gene expression measurements on 4 different types of brain tumor (included in the denoise R package, Josse et al., 2018). We centered each column before analysis. Presidential address data, a 13 836 matrix of word counts from the inaugural addresses of 13 US presidents (1940 2009) (also included in the denoise R package, Josse et al., 2018). Since both row and column means vary greatly we centered and scaled both rows and columns before analysis, using the bi Scale function from soft Impute. Breast cancer data, a 251 226 matrix of gene expression measurements from Carvalho et al. (2008), which were used as an example in the paper introducing NBSFA (Knowles and Ghahramani, 2011). Following Knowles and Ghahramani (2011) we centered each column (gene) before analysis. Among the methods considered above, only flash, PMD and soft Impute can handle missing data. We add NBSFA (Knowles and Ghahramani, 2011) to these comparisons. To emphasize the importance of parameter tuning we include results for PMD and soft Impute with default settings (denoted PMD, SI) as well as using cross-validation (PMD.cv1, SI.cv). For these real data the appropriate value of K is, of course, unknown. Both flash and NBSFA automatically estimate K. For PMD and soft Impute we specified K based on the values inferred by flash and NBSFA. (Specifically, we used K = 10, 30, 20, 10, 40 respectively for the five data sets.) We applied each method to all 5 data sets, using 10-fold OCV (Appendix B) to mask data points for imputation, repeated 20 times (with different random number seeds) for each data set. We measure imputation accuracy using root mean squared error (RMSE): RMSE( ˆY , Y ; Ω) = s ij Ω (Yij ˆYij)2. (5.10) Wang and Stephens Movie Lens data Breast cancer data Presidential address data Figure 3: Comparison of the accuracy of different methods in imputing missing data. Each panel shows a boxplot of error rates (RMSE) for 20 simulations based on masking observed entries in a real data set. where Ωis the set of indices of the held-out data points. The results are shown in Figure 3. Although the ranking of methods varies among data sets, flash, PMD.cv1 and SI.cv perform similarly on average, and consistently outperform NBSFA, which in turn typically outperforms (untuned) PMD and unpenalized soft Impute. These results highlight the importance of appropriate tuning for the penalized methods, and also the effectiveness of the EB method in flash to provide automatic tuning. In these comparisons, as in the simulations, the two flash methods typically performed similarly. The exception is the GTEx data, where the scale mixture of normals (G = SN) performed worse. Detailed investigation revealed this to be due to a very small number of very large outlier imputed values, well outside the range of the observed data, which grossly inflated RMSE. These outliers were so extreme that it should be possible to implement a filter to avoid them. However, we did not do this here as it seems useful to highlight this unexpected behavior. (Note that this occurs only when data are missing, and even then only in one of the five data sets considered here.) Empirical Bayes Matrix Factorization 5.3 Sharing of Genetic Effects on Gene Expression Among Tissues To illustrate flash in a scientific application, we applied it to the GTEx data described above, a 16, 069 44 matrix of Z scores, with Zij reflecting the strength (and direction) of effect of e QTL i in tissue j. We applied flash with G = SN using the greedy+backfitting algorithm (i.e. the backfitting algorithm, initialized using the greedy algorithm). The flash results yielded 26 factors (Figure 4-5) which summarize the main patterns of e QTL sharing among tissues (and, conversely, the main patterns of tissue-specificity). For example, the first factor has approximately equal weight for every tissue, and reflects the fact that many e QTLs show similar effects across all 44 tissues. The second factor has strong effects only in the 10 brain tissues, from which we infer that some e QTLs show much stronger effects in brain tissues than other tissues. Subsequent factors tend to be sparser, and many have a strong effect in only one tissue, capturing tissue-specific effects. For example, the 3rd factor shows a strong effect only in whole blood, and captures e QTLs that have much stronger effects in whole blood than other tissues. (Two tissues, Lung and Spleen , show very small effects in this factor but with the same sign as blood. This is intriguing since the lung has recently been found to make blood cells see Lefran cais et al. 2017 and a key role of the spleen is storing of blood cells.) Similarly Factors 7, 11 and 14 capture effects specific to Testis , Thyroid and Esophagus Mucosa respectively. A few other factors show strong effects in a small number of tissues that are known to be biologically related, providing support that the factors identified are scientifically meaningful. For example, factor 10 captures the two tissues related to the cerebellum, Brain Cerebellar Hemisphere and Brain Cerebellum . Factor 19 captures tissues related to female reproduction, Ovary , Uterus and Vagina . Factor 5 captures Muscle Skeletal , with small but concordant effects in the heart tissues ( Heart Atrial Appendage and Heart Left Ventricle ). Factor 4, captures the two skin tissues ( Skin Not Sun Exposed Suprapubic , Skin Sun Exposed Lower leg ) and also Esophagus Mucosa , possibly reflecting the sharing of squamous cells that are found in both the surface of the skin, and the lining of the digestive tract. In factor 24, Colon Transverse and Small Intestine Terminal Ileum show the strongest effects (and with same sign), reflecting some sharing of effects in these intestinal tissues. Among the 26 factors, only a few are difficult to interpret biologically (e.g. factor 8). To highlight the benefits of sparsity, we contrast the flash results with those for soft Impute, which was the best-performing method in the missing data assessments on these data, but which uses a nuclear norm penalty that does not explicitly reward sparse factors or loadings. The first eight soft Impute factors are shown in Figure 6. The soft Impute results except for the first two factors show little resemblance to the flash results, and in our view are harder to interpret. 5.4 Computational Demands It is difficult to make general statements about computational demands of our methods, because both the number of factors and number of iterations per factor can vary considerably depending on the data. However, to give a specific example, running our current implementation of the greedy algorithm on the GTEx data (a 16,000 by 44 matrix) takes Wang and Stephens Factor 13 ; pve: 0.003 Factor 14 ; pve: 0.004 Factor 11 ; pve: 0.005 Factor 12 ; pve: 0.003 Factor 9 ; pve: 0.002 Factor 10 ; pve: 0.004 Factor 7 ; pve: 0.011 Factor 8 ; pve: 0.008 Factor 5 ; pve: 0.012 Factor 6 ; pve: 0.01 Factor 3 ; pve: 0.016 Factor 4 ; pve: 0.01 Factor 1 ; pve: 0.694 Factor 2 ; pve: 0.022 factor values Adipose Subcutaneous Adipose Visceral (Omentum) Adrenal Gland Artery Aorta Artery Coronary Artery Tibial Brain Anterior cingulate cortex (BA24) Brain Caudate (basal ganglia) Brain Cerebellar Hemisphere Brain Cerebellum Brain Cortex Brain Frontal Cortex (BA9) Brain Hippocampus Brain Hypothalamus Brain Nucleus accumbens (basal ganglia) Brain Putamen (basal ganglia) Breast Mammary Tissue Cells EBV transformed lymphocytes Cells Transformed fibroblasts Colon Sigmoid Colon Transverse Esophagus Gastroesophageal Junction Esophagus Mucosa Esophagus Muscularis Heart Atrial Appendage Heart Left Ventricle Liver Lung Muscle Skeletal Nerve Tibial Ovary Pancreas Pituitary Prostate Skin Not Sun Exposed (Suprapubic) Skin Sun Exposed (Lower leg) Small Intestine Terminal Ileum Spleen Stomach Testis Thyroid Uterus Vagina Whole Blood Figure 4: Results from running flash on GTEx data (factors 1 - 8). The pve ( Percentage Variance Explained ) for loading/factor k is defined as pvek := sk/(P k sk + P ij 1/τij) where sk := P ij(slki s fkj)2. It is a measure of the amount of signal in the data captured by loading/factor k (but its naming as percentage variance explained should be considered loose since the factors are not orthogonal). Empirical Bayes Matrix Factorization Factor 25 ; pve: 0.001 Factor 26 ; pve: 0.001 Factor 23 ; pve: 0.001 Factor 24 ; pve: 0.001 Factor 21 ; pve: 0.001 Factor 22 ; pve: 0.001 Factor 19 ; pve: 0.001 Factor 20 ; pve: 0.001 Factor 17 ; pve: 0.002 Factor 18 ; pve: 0.002 Factor 15 ; pve: 0.001 Factor 16 ; pve: 0.002 factor values Adipose Subcutaneous Adipose Visceral (Omentum) Adrenal Gland Artery Aorta Artery Coronary Artery Tibial Brain Anterior cingulate cortex (BA24) Brain Caudate (basal ganglia) Brain Cerebellar Hemisphere Brain Cerebellum Brain Cortex Brain Frontal Cortex (BA9) Brain Hippocampus Brain Hypothalamus Brain Nucleus accumbens (basal ganglia) Brain Putamen (basal ganglia) Breast Mammary Tissue Cells EBV transformed lymphocytes Cells Transformed fibroblasts Colon Sigmoid Colon Transverse Esophagus Gastroesophageal Junction Esophagus Mucosa Esophagus Muscularis Heart Atrial Appendage Heart Left Ventricle Liver Lung Muscle Skeletal Nerve Tibial Ovary Pancreas Pituitary Prostate Skin Not Sun Exposed (Suprapubic) Skin Sun Exposed (Lower leg) Small Intestine Terminal Ileum Spleen Stomach Testis Thyroid Uterus Vagina Whole Blood Figure 5: Results from running flash on GTEx data (factors 15 - 26) Wang and Stephens Factor 7 ; pve: 0.008 Factor 8 ; pve: 0.006 Factor 5 ; pve: 0.01 Factor 6 ; pve: 0.009 Factor 3 ; pve: 0.013 Factor 4 ; pve: 0.011 Factor 1 ; pve: 0.648 Factor 2 ; pve: 0.021 factor values Adipose Subcutaneous Adipose Visceral (Omentum) Adrenal Gland Artery Aorta Artery Coronary Artery Tibial Brain Anterior cingulate cortex (BA24) Brain Caudate (basal ganglia) Brain Cerebellar Hemisphere Brain Cerebellum Brain Cortex Brain Frontal Cortex (BA9) Brain Hippocampus Brain Hypothalamus Brain Nucleus accumbens (basal ganglia) Brain Putamen (basal ganglia) Breast Mammary Tissue Cells EBV transformed lymphocytes Cells Transformed fibroblasts Colon Sigmoid Colon Transverse Esophagus Gastroesophageal Junction Esophagus Mucosa Esophagus Muscularis Heart Atrial Appendage Heart Left Ventricle Liver Lung Muscle Skeletal Nerve Tibial Ovary Pancreas Pituitary Prostate Skin Not Sun Exposed (Suprapubic) Skin Sun Exposed (Lower leg) Small Intestine Terminal Ileum Spleen Stomach Testis Thyroid Uterus Vagina Whole Blood Figure 6: Results from running soft Impute on GTEx data (factors 1-8). The factors are both less sparse and less interpretable than the flash results. Empirical Bayes Matrix Factorization about 140s (wall time) for G = PN and 650s for G = SN (on a 2015 Mac Book Air with a 2.2 GHz Intel Core i7 processor and 8Gb RAM). By comparison, a single run of soft Impute without CV takes 2-3s, so a naive implementation of 5-fold CV with 10 different tuning parameters and 10 different values of K would take over 1000s (although one could improve on this by use of warm starts for example). 6. Discussion Here we discuss some potential extensions or modifications of our work. 6.1 Orthogonality Constraint Our formulation here does not require the factors or loadings to be orthogonal. In scientific applications we do not see any particular reason to expect underlying factors to be orthogonal. However, imposing such a constraint could have computational or mathematical advantages. Formally adding such a constraint to our objective function seems tricky, but it would be straightforward to modify our algorithms to include an orthogonalization step each update. This would effectively result in an EB version of the SSVD algorithms in Yang et al. (2014), and it seems likely to be computationally faster than our current approach. One disadvantage of this approach is that it is unclear what optimization problem such an algorithm would solve (but the same is true of SSVD, and our algorithms have the advantage that they deal with missing data.) 6.2 Non-negative Matrix Factorization We focused here on the potential for EBMF to induce sparsity on loadings and factors. However, EBMF can also encode other assumptions. For example, to assume the loadings and factors are non-negative, simply restrict G to be a family of non-negative-valued distributions, yielding Empirical Bayes non-negative Matrix Factorization (EBNMF). Indeed, the ashr software can already solve the EBNM problem for some such families G, and so flash already implements EBNMF. In preliminary assessments we found that the greedy approach is problematic here: the non-negative constraint makes it harder for later factors to compensate for errors in earlier factors. However, it is straightforward to apply the backfitting algorithm to fit EBNMF, with initialization by any existing NMF method. The performance of this approach is an area for future investigation. Wang and Stephens 6.3 Tensor Factorization It is also straightforward to extend EBMF to tensor factorization, specifically a CANDECOMP/PARAFAC decomposition (Kolda and Bader, 2009): k=1 lkifkjhkm + Eijm (6.1) lk1, . . . , lkn iid glk, glk G (6.2) fk1, . . . , fkp iid gfk, gfk G (6.3) hk1, . . . , hkr iid ghk, ghk G (6.4) Eijm iid N(0, 1/τijm). (6.5) The variational approach is easily extended to this case (a generalization of methods in Hore et al., 2016), and updates that increase the objective function can be constructed by solving an EBNM problem, similar to EBMF. It seems likely that issues of convergence to local optima, and the need for good initializations, will need some attention to obtain good practical performance. However, results in Hore et al. (2016) are promising, and the automatic-tuning feature of EB methods seems particularly attractive here. For example, extending PMD to this case allowing for different sparsity levels in l, f and h would require 3 penalty parameters even in the rank 1 case, making it difficult to tune by CV. 6.4 Non-Gaussian Errors It is also possible to extend the variational approximations used here to fit non-Gaussian models, such as binomial data; see for example Jaakkola and Jordan (2000); Seeger and Bouchard (2012); Klami (2015). The extension of our EB methods using these ideas is detailed in Wang (2017). Acknowledgments We thank P. Carbonetto for computational assistance, and P. Carbonetto, D. Gerard, and A. Sarkar for helpful conversations and comments on a draft manuscript. Computing resources were provided by the University of Chicago Research Computing Center. This work was supported by NIH grant HG002585 and by a grant from the Gordon and Betty Moore Foundation (Grant GBMF #4559). Empirical Bayes Matrix Factorization Appendix A. Variational EBMF with K Factors Here we describe in detail the variational approach to the K factor model, including deriving updates that we use to optimize the variational objective. (These derivations naturally include the K = 1 model as a special case, and our proof of Proposition 5 below includes Proposition 2 as a special case.) Let ql, qf denote the variational distributions on the K loadings/factors: ql(l1, , l K) = Y k qlk(lk) (A.1) qf(f1, . . . , f K) = Y k qfk(fk). (A.2) The objective function F (3.4) is thus a function of ql = (ql1, . . . , ql K), qf = (qf1, . . . , qf K), gl = (gl1, . . . , gl K) and gf = (gf1, . . . , gf K), as well as the precision τ: F(ql, qf, gl, gf, τ) = Z Y k qlk(lk)qfk(fk) log p(Y, l, f; gl1, gf1, , gl K, gf K, τ) Q k qlk(lk)qfk(fk) dlk dfk, = Eql,qf log p(Y |l, f; τ) + X k Eqlk log glk(lk) qlk(lk) + X k Eqfk log gfk(fk) We optimize F by iteratively updating parameters relating to τ, a single loading k (qlk, glk) or factor k (qfk, gfk), keeping other parameters fixed. We simplify implementation by keeping track of only the first and second moments of the distributions qlk and qfk, which we denote slk, sl2k, s fk, Ď f 2k. We now describe each kind of update in turn. A.1 Updates for Precision Parameters Here we derive updates to optimize F over the precision parameters τ. Focusing on the parts of F that depend on τ gives: F(τ) = Eql Eqf X ij 0.5 log(τij) 0.5τij(Yij X k lkifkj)2 + const (A.5) h log(τij) + τij Ď R2ij i + const (A.6) where Ď R2 is defined by: Ď R2ij := Eql,qf [(Yij k=1 lkifkj)2] (A.7) k slki s fkj)2 X k (slki)2( s fkj)2 + X sl2kiĎ f 2kj. (A.8) Wang and Stephens If we constrain τ T then we have ˆτ = arg max τ T ij [log(τij) τij Ď R2ij]. (A.9) For example, assuming constant precision τij = τ yields: ˆτ = NP P ij Ď R2ij . (A.10) Assuming column-specific precisions (τij = τj), which is the default in our software, yields: ˆτj = N P i Ď R2ij . (A.11) Other variance structures are considered in Appendix A.5 of Wang (2017). A.2 Updating Loadings and Factors The following Proposition, which generalizes Proposition 2 in the main text, shows how updates for loadings (and factors) for the K-factor EBMF model can be achieve by solving an EBNM problem. Proposition 5 For the K-factor model, arg maxqlk,glk F(ql, qf, gl, gf, τ) is solved by solving an EBNM problem. Specifically arg max qlk,glk F(ql, qf, gl, gf, τ) = EBNM(ˆl(Rk, s fk, Ď f 2k, τ), sl(Ď f 2k, τ)) (A.12) where the functions ˆl and sl are given by (3.19) and (3.20), s fk, Ď f 2k Rp denote the vectors whose elements are the first and second moments of fk under qfk, and Rk denotes the residual matrix (3.27). Similarly, arg maxqfk,gfk F(ql, qf, gl, gf, τ) is solved by solving an EBNM problem. Specifically, arg max qfk,gfk F(ql, qf, gl, gf, τ) = EBNM( ˆf(Rk, slk, sl2k, τ), sf(sl2k, τ)) (A.13) where the functions ˆf : Rn p Rn Rn Rn p Rp and sf : Rn Rn p Rp are given by ˆf(Y, v, w, τ)j := P i τij Yijvi P i τijwi , (A.14) sf(w, τ)j := ! 0.5 . (A.15) Empirical Bayes Matrix Factorization A.2.1 A Lemma on the Normal Means Problem To prove Proposition 5 we introduce a lemma that characterizes the solution of the normal means problem in terms of an objective that is closely related to the variational objective. Recall that the EBNM model is: x = θ + e (A.16) θ1, . . . , θn iid g, g G. (A.17) where ei N(0, s2 i ). Solving the EBNM problem involves estimating g by maximum likelihood: ˆg = arg max g G l(g), (A.18) where l(g) = log p(x|g). (A.19) It also involves finding the posterior distributions: p(θ|x, ˆg) = Y j p(θj|x, ˆg) Y j ˆg(θj)p(xj|θj, sj). (A.20) Lemma 6 Solving the EBNM problem also solves: max qθ,g G F NM(qθ, g) (A.21) F NM(qθ, g) = Eqθ j (Ajθ2 j 2Bjθj) + Eqθ log g(θ) qθ(θ) + const (A.22) with Aj = 1/s2 j and Bj = xj/s2 j, and g(θ) := Q j g(θj). Equivalently, (A.21)-(A.22) is solved by g = ˆg in (A.18) and qθ = p(θ|x, ˆg) in (A.20), with xj = Bj/Aj and s2 j = 1/Aj. Proof The log likelihood can be written as l(g) := log[p(x|g)] (A.23) = log[p(x, θ|g)/p(θ|x, g)] (A.24) = Z qθ(θ) log p(x, θ|g) p(θ|x, g)dθ (A.25) = Z qθ(θ) log p(x, θ|g) qθ(θ) dθ + Z qθ(θ) log qθ(θ) p(θ|x, g)dθ (A.26) = F NM(qθ, g) + DKL(qθ||pθ|x,g) (A.27) F NM(qθ, g) = Z qθ(θ) log p(x, θ|g) qθ(θ) dθ (A.28) Wang and Stephens DKL(qθ||pθ|x,g) = Z qθ(θ) log p(θ|x, g) qθ(θ) dθ (A.29) Here pθ|x,g denotes the posterior distribution p(θ|x, g). This identity holds for any distribution qθ(θ). Rearranging (A.27) gives: F NM(qθ, g) = l(g) DKL(qθ||pθ|x,g). (A.30) Since DKL(qθ||pθ|x,g) 0, with equality when qθ = pθ|x,g, F NM(qθ, g) is maximized over qθ by setting qθ = pθ|x,g. Further max qθ F NM(qθ, g) = l(g), (A.31) so arg max g G max qθ F NM(qθ, g) = arg max g G l(g) = ˆg. (A.32) It remains only to show that F NM has the form (A.22). By (A.16) and (A.17), we have log p(x, θ|g) = 1 j s 2 j (xj θj)2 + log g(θ) + const. (A.33) F NM(qθ, g) = Eqθ j (Ajθ2 j 2Bjθj) + Eqθ log g(θ) qθ(θ) + const. (A.34) A.2.2 Proof of Proposition 5 We are now ready to prove Proposition 5. Proof We prove the first part of the proposition since the proof for the second part is essentially the same. The objective function (A.3) is: F(ql, qf, gl, gf, τ) = Eql,qf log p(Y |l, f; τ) + X k Eqlk log glk(lk) qlk(lk) + X k Eqfk log gfk(fk) i (Aikl2 ki 2Biklki) + Eqlk log glk(lk) qlk(lk) + C1 (A.36) Empirical Bayes Matrix Factorization where C1 is a constant with respect to qlk, glk and j τij Eqf (f2 kj) (A.37) j τij Rk ij Eqf fkj . (A.38) Based on Lemma 6, we can solve this optimization problem (A.36) by solving the EBNM problem with: P j τij Rk ij Eqf fkj P j τij Eqf (f2 kj) (A.39) s2 i = 1 P j τij Eqf (f2 kj). (A.40) A.3 Algorithms Just as with the rank 1 EBMF model, the updates for the rank K model require only the first and second moments of the variational distributions q. Thus we implement the updates in algorithms that keep track of the first moments (sl := (sl1, . . . , sl K) and s f := ( s f1, . . . , s f K)) and second moments (sl2 := (sl21, . . . , sl2K) and Ď f 2 := (Ď f 21, . . . , Ď f 2K)), and the precision τ. Algorithm 3 implements a basic update for τ, and for the parameters relating to a single factor k (slk, sl2k, s fk, Ď f 2k). Note that the latter updates are identical to the updates for fitting the single factor EBMF model, but with Yij replaced with the residuals obtained by removing the estimated effects of the other k 1 factors. Wang and Stephens Algorithm 3 Single-factor update for EBMF (rank K) Require: A data matrix Y (n p) Require: A function, ebnm(x, s) (sθ, Ď θ2), that solves the EBNM problem (3.11)-(3.12) and returns the first and second posterior moments (3.15)-(3.16). Require: Current values for first moments sl := (sl1, . . . , sl K) and s f := ( s f1, . . . , s f K). Require: Current values for second moments sl2 := (sl21, . . . , sl2K) and Ď f 2 := (Ď f 21, . . . , Ď f 2K). Require: An index k indicating which loading/factor to compute updated values for. 1: Compute matrix of expected squared residuals, Ď R2, using (A.7) 2: τj n/ P i Ď R2ij. [Assumes column-specific variances; can be modified to make other assumptions.] 3: Compute residual matrix Rk := Y P k =k slk s f T k . 4: Compute ˆl(Rk, s fk, Ď f 2k, τ) and its standard error sl(τ, Ď f 2k), using (3.19) and (3.20). 5: (slk, sl2k) ebnm(ˆl, sl). 6: Compute ˆf(Rk, slk, sl2k, τ) and its standard error sf(sl2, τk). 7: ( s fk, Ď f 2k) ebnm( ˆf, sf). 8: return updated values slk, sl2k, s fk, Ď f 2k, τ. Based on these basic updates we implemented two algorithms for fitting the K-factor EBMF model: the greedy algorithm, and the backfitting algorithm, as follows. A.3.1 Greedy Algorithm The greedy algorithm is a forward procedure that, at the kth step, adds new factors and loadings lk, fk by optimizing over qlk, qfk, glk, gfk while keeping the distributions related to previous factors fixed. Essentially this involves fitting the single-factor model to the residuals obtained by removing previous factors. The procedure stops adding factors when the estimated new factors (or loadings) are identically zero. The algorithm as follows: Empirical Bayes Matrix Factorization Algorithm 4 Greedy Algorithm for EBMF Require: A data matrix Y (n p) Require: A function, ebnm(x, s) (sθ, Ď θ2), that solves the EBNM problem (3.11)-(3.12) and returns the first and second posterior moments (3.15)-(3.16). Require: A function, init(Y ) (l; f) that provides initial estimates for the loadings and factors (see Section 4.2). Require: A function single update(Y, sl, s f, sl2, Ď f 2, k) (slk, sl2k, s fk, Ď f 2k, τ) implementing Algorithm 3. 1: initialize K 0. 3: K K + 1. 4: Compute residual matrix Rij = Yij PK 1 k=1 slki s fkj. 5: Initialize first moments (sl K, s f K) init(R). 6: Initialize second moments by squaring first moments: sl2K sl2 K; Ď f 2K s f 2 K. 8: (sl K, sl2K, s f K, Ď f 2K, τ) single update(Y, sl, sl2, s f, Ď f 2, K) 9: until converged 10: until s f K is identically 0 or sl K is identically 0. 11: return sl, sl2, s f, Ď f 2, τ A.3.2 Backfitting Algorithm The backfitting algorithm iteratively refines a fit of K factors and loadings, by updating them one at a time, at each update keeping the other loadings and factors fixed. The name comes from its connection with the backfitting algorithm in Breiman and Friedman (1985), specifically the fact that it involves iteratively re-fitting to residuals. Algorithm 5 Backfitting algorithm for EBMF (rank K) Require: A data matrix Y (n p) Require: A function, ebnm(x, s) (sθ, Ď θ2), that solves the EBNM problem (3.11)-(3.12) and returns the first and second posterior moments (3.15)-(3.16). Require: A function, init(Y ) (l1, . . . , l K; f1, . . . , f K) that provides initial estimates for the loadings and factors (e.g. the greedy algorithm from Appendix A.3.1, or a rank K SVD). Require: A function single update(Y, sl, s f, sl2, Ď f 2, k) (slk, sl2k, s fk, Ď f 2k, τ) implementing Algorithm 3. 1: Initialize first moments (sl1, . . . , sl K; s f1, . . . , s f K) init(Y ). 2: Initialize second moments by squaring first moments: sl2k sl2 k; Ď f 2k s f 2 k. [alternatively the init function could provide these initial values]. 4: for k = 1, . . . , K do 5: (slk, sl2k, s fk, Ď f 2k, τ) single update(Y, sl, sl2, s f, Ď f 2, k) 6: until converged 7: return sl, sl2, s f, Ď f 2, τ Wang and Stephens A.4 Objective Function Computation The algorithms above all involve updates that will increase (or, at least, not decrease) the objective function F(ql, qf, gl, gf, τ). However, these updates do not require computing the objective function itself. In iterative algorithms it can be helpful to compute the objective function to monitor convergence (and as a check on implementation). In this subsection we describe how this can be done. In essence, this involves extending the solver of the EBNM problem to also return the value of the log-likelihood achieved in that problem (which is usually not difficult). The objective function of the EBMF model is: F(ql, qf, gl, gf, τ) = Eql,qf log p(Y |l, f; τ) + Eql log gl(l) ql(l) + Eqf log gf(f) qf(f) (A.41) The calculation of Eql,qf log p(Y |l, f; τ) is straightforward and Eql log gl(l) ql(l) and Eql log gl(l) ql(l) can be calculated using the log-likelihood of the EBNM model using the following Lemma 7. Lemma 7 Suppose ˆg, q solves the EBNM problem with data (x, s): (ˆg, q) = EBNM(x, s), (A.42) where q := (q1, . . . , qn) are the estimated posterior distributions of the normal means parameters θ1, . . . , θn. Then j ˆg(θj)/ Y j qj(θj))) = l(ˆg; x, s) + 1 j log(2πs2 j) + (1/s2 j)(x2 j + Eq(θ2 j) 2xj Eq(θj) (A.43) where l(ˆg; x, s) is the log of the likelihood for the normal means problem (A.27). Proof We have from (A.28) F NM(qθ, ˆg) = Z qθ(θ) log p(x, θ|ˆg) qθ(θ) dθ (A.44) = Z qθ(θ) log p(x|θ)ˆg(θ) qθ(θ) dθ (A.45) = Eq(log( Y j ˆg(θj)/ Y j qj(θj))) 1 j log(2πs2 j) + (1/s2 j)(xj θj)2] And the result follows from noting that F NM(ˆq, ˆg) = l(ˆg). A.5 Inference with Penalty Term Conceivably, in some settings one might like to encourage solutions to the EBMF problem be sparser than the maximum-likelihood estimates for gl, gf would produce. This could Empirical Bayes Matrix Factorization be done by extending the EBMF model to introduce a penalty term on the distributions gl, gf so that the maximum likelihood estimates are replaced by maximizing a penalized likelihood. We are not advocating for this approach, but it is straightforward given existing machinery, and so we document it here for completeness. Let hl(gl) and hf(gf) denote penalty terms on gl and gf, so the penalized log-likelihood would be: l(gl, gf, τ) := log[p(Y |gl, gf, τ 2)] + hl(gl) + hf(gf) (A.47) = F(q, gl, gf, τ 2) + hl(gl) + hf(gf) + DKL(q||p) where F(q, gl, gf, τ 2) and DKL(q||p) are defined in (3.4) and (3.5). And the corresponding penalized variational objective is: max F(q, gl, gf, τ 2) + hl(gl) + hf(gf). (A.48) It is straightforward to modify the algorithms above to maximize this penalized objective: simply modify the EBNM solvers to solve a corresponding penalized normal means problem. That is, instead of estimating the prior g by maximum likelihood, the EBNM solver must now maximize the penalized log-likelihood: ˆg = arg max g G l EBNM(g) + h(g), (A.49) where l EBNM denote the log-likelihood for the EBNM problem. (The computation of the posterior distributions given ˆg is unchanged). For example, the ashr software (Stephens, 2017) provides the option to include a penalty on g to encourage overestimation of the size of the point mass on zero. This penalty was introduced to ensure conservative behavior in False Discovery Rate applications of the normal means problem. It is unclear that such a penalty is desirable in the matrix factorization application. However, the above discussion shows that using this penalty (e.g. within the ebnm function used by the greedy or backfitting algorithms) can be thought of as solving a penalized version of the EBMF problem. Appendix B. Orthogonal Cross Validation Cross-validation assessments involving holding out (hiding) data from methods. Here we introduce a novel approach to selecting the data to be held out, which we call Orthogonal Cross Validation (OCV). Although not the main focus of our paper, we believe that OCV is a novel and appealing approach to selecting hold-out data for factor models, e.g. when using CV to select an appropriate dimension K for dimension reduction methods, as in Owen and Wang (2016). Generic k-fold CV involves randomly dividing the data matrix into k parts and then, for each part, training methods on the other k 1 parts before assessing error on that part, as in Algorithm 6. Wang and Stephens Algorithm 6 k-fold CV 1: procedure k-fold cross validation 2: randomly divide data matrix Y into Y(1), , Y(k) with hold-out index Ω(1), , Ω(k) 3: for i = 1, , k do 4: take Y(i) as missing and run flash 5: ˆY(i) = E[YΩ(i)|Y Ω(i)] 6: s2 i = || ˆY(i) YΩ(i)||2 2 return RMSE: score = q P k s2 k NP The novel part of OCV is in how to choose the hold-out pattern. We randomly divide the columns and rows into k sets. and put these sets into k orthogonal parts, and then take all Yij with the chosen column and row indices as hold-out Y(i). To illustrate this scheme, we take 3-fold CV as an example. We randomly divide the columns into 3 sets and the rows into 3 sets as well. The data matrix Y is divided into 9 partition (by row and column permutation): Y11 Y12 Y13 Y21 Y22 Y23 Y31 Y32 Y33 Then Y(1) = {Y11, Y22, Y33}, Y(2) = {Y12, Y23, Y31} and Y(3) = {Y13, Y21, Y32} are orthogonal to each other. Then the data matrix Y is marked as: Y(1) Y(2) Y(3) Y(3) Y(1) Y(2) Y(2) Y(3) Y(1) In OCV, each fold k, Y(k) contains equally balanced part of data matrix and includes all the row and column indices. This ensures that all i s and j s are included into each Y (k). In 3-fold OCV, we have: Y11 Y12 Y13 Y21 Y22 Y23 Y31 Y32 Y33 F (1) F (2) F (3) + E (B.1) Y(1) Y(2) Y(3) Y(3) Y(1) Y(2) Y(2) Y(3) Y(1) L(1)F (1) L(1)F (2) L(1)F (3) L(2)F (1) L(2)F (2) L(2)F (3) L(3)F (1) L(3)F (2) L(3)F (3) where Y(1) = {Y11, Y22, Y33}, Y(2) = {Y12, Y23, Y31} and Y(3) = {Y13, Y21, Y32}. We can see for each hold-out part, Y(k), L(1), L(2), L(3) and F (1), F (2), F (3) show up once and only once. In this sense the hold-out pattern is balanced . Empirical Bayes Matrix Factorization Ricard Argelaguet, Britta Velten, Damien Arnol, Sascha Dietrich, Thorsten Zenz, John C Marioni, Florian Buettner, Wolfgang Huber, and Oliver Stegle. Multi-omics factor analysis a framework for unsupervised integration of multi-omics data sets. Molecular systems biology, 14(6):e8124, 2018. Hagai Attias. Independent factor analysis. Neural Computation, 11(4):803 851, 1999. Jushan Bai and Serena Ng. Large Dimensional Factor Analysis. Now Publishers Inc, 2008. Anirban Bhattacharya and David B Dunson. Sparse Bayesian infinite factor models. Biometrika, pages 291 306, 2011. Christopher M. Bishop. Variational principal components. In Ninth International Conference on Artificial Neural Networks (Conf. Publ. No. 470), volume 1, pages 509 514. IET, 1999. David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017. Guillaume Bouchard, Jason Naradowsky, Sebastian Riedel, Tim Rockt aschel, and Andreas Vlachos. Matrix and tensor factorization methods for natural language processing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Tutorial Abstracts, pages 16 18, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-5005. URL https://www.aclweb.org/anthology/P15-5005. Leo Breiman and Jerome H. Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80(391):580 598, 1985. doi: 10.1080/01621459.1985.10478157. URL http://www.tandfonline.com/ doi/abs/10.1080/01621459.1985.10478157. Carlos M. Carvalho, Jeffrey Chang, Joseph E. Lucas, Joseph R. Nevins, Quanli Wang, and Mike West. High-dimensional sparse factor modeling: Applications in gene expression genomics. Journal of the American Statistical Association, 103(484):1438 1456, 2008. ISSN 0162-1459. doi: 10.1198/016214508000000869. Merlise Clyde and Edward I George. Flexible empirical Bayes estimation for wavelets. Journal of the Royal Statistical Society Series B, 62(4):681 698, 2000. doi: 10.1111/ 1467-9868.00257. GTEx Consortium et al. The genotype-tissue expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235):648 660, 2015. Chris HQ Ding, Tao Li, and Michael I Jordan. Convex and semi-nonnegative matrix factorizations. IEEE transactions on pattern analysis and machine intelligence, 32(1):45 55, 2008. Wang and Stephens C Eckart and G Young. The approximation of one matrix by another of lower rank. Psychometrika, 1:211 218, 1936. Barbara E Engelhardt and Matthew Stephens. Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLo S Genetics, 6(9): e1001117, sep 2010. William Fithian, Rahul Mazumder, et al. Flexible low-rank statistical modeling with missing data and side information. Statistical Science, 33(2):238 260, 2018. J Kevin Ford, Robert C Mac Callum, and Marianne Tait. The application of exploratory factor analysis in applied psychology: A critical review and analysis. Personnel Psychology, 39(2):291 314, 1986. Sylvia Fr uhwirth-Schnatter and Hedibert Freitas Lopes. Sparse Bayesian factor analysis when the number of factors is unknown. ar Xiv preprint ar Xiv:1804.04231, 2018. Chuan Gao, Christopher D Brown, and Barbara E Engelhardt. A latent factor model with a mixture of sparse and dense factors to model gene gene expression data with confounding effects. ar Xiv:1310.4792v1, 2013. Chuan Gao, Ian C. Mc Dowell, Shiwen Zhao, Christopher D. Brown, and Barbara E. Engelhardt. Context specific and differential gene co-expression networks via Bayesian biclustering. PLo S Computational Biology, 12(7):1 39, 07 2016. doi: 10.1371/journal.pcbi. 1004791. URL https://doi.org/10.1371/journal.pcbi.1004791. Zoubin Ghahramani and Matthew J Beal. Variational inference for Bayesian mixtures of factor analysers. In Advances in neural information processing systems, pages 449 455, 2000. Mark Girolami. A variational method for learning sparse and overcomplete representations. Neural Computation, 13(11):2517 2532, 2001. F Maxwell Harper and Joseph A Konstan. The Movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems, 5(4):19, 2016. Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and low-rank SVD via fast alternating least squares. Journal of Machine Learning Research, 16(1):3367 3402, 2015. Sepp Hochreiter, Ulrich Bodenhofer, Martin Heusel, Andreas Mayr, Andreas Mitterecker, Adetayo Kasim, Tatsiana Khamiakova, Suzy Van Sanden, Dan Lin, Willem Talloen, Luc Bijnens, Hinrich W. H. G ohlmann, Ziv Shkedy, and Djork-Arn e Clevert. FABIA: factor analysis for bicluster acquisition. Bioinformatics, 26(12):1520 1527, 04 2010. ISSN 1367-4803. doi: 10.1093/bioinformatics/btq227. URL https://doi.org/10.1093/ bioinformatics/btq227. Victoria Hore, Ana Vi nuela, Alfonso Buil, Julian Knight, Mark I Mc Carthy, Kerrin Small, and Jonathan Marchini. Tensor decomposition for multiple-tissue gene expression experiments. Nature Genetics, 48(9):1094 1100, 2016. Empirical Bayes Matrix Factorization Tommi S. Jaakkola and Michael I. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10:25 37, 2000. Iain M. Johnstone and Bernard W. Silverman. Empirical Bayes selection of wavelet thresholds. The Annals of Statistics, 33(4):1700 1752, 2005a. Iain M Johnstone and Bernard W Silverman. EBayesthresh: R and s-plus programs for empirical Bayes thresholding. J. Statist. Soft, 12:1 38, 2005b. Iain M Johnstone, Bernard W Silverman, et al. Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4):1594 1649, 2004. Ian T Jolliffe, Nickolay T Trendafilov, and Mudassir Uddin. A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics, 12(3): 531 547, 2003. Julie Josse, Sylvain Sardy, and Stefan Wager. denoise R: A package for low rank matrix estimation, 2018. Sylvia Kaufmann and Christian Schumacher. Identifying relevant and irrelevant variables in sparse factor models. Journal of Applied Econometrics, 32(6):1123 1144, 2017. Arto Klami. Polya-gamma augmentations for factor models. In Asian Conference on Machine Learning, pages 112 128, 2015. David Knowles and Zoubin Ghahramani. Nonparametric Bayesian sparse factor. Annals of Applied Statistics, 5(2B):1534 1552, 2011. doi: 10.1214/10-AOAS435. Roger Koenker and Ivan Mizera. Convex optimization, shape constraints, compound decisions, and empirical Bayes rules. Journal of the American Statistical Association, 109 (506):674 685, 2014a. Roger Koenker and Ivan Mizera. Convex optimization in R. Journal of Statistical Software, 60(5):1 23, 2014b. doi: 10.18637/jss.v060.i05. Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455 500, 2009. D D Lee and H S Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788 791, 1999. doi: 10.1038/44565. Emma Lefran cais, Guadalupe Ortiz-mu noz, Axelle Caudrillier, Be nat Mallavia, Fengchun Liu, David M Sayah, Emily E Thornton, Mark B Headley, Tovo David, Shaun R Coughlin, Matthew F Krummel, Andrew D Leavitt, Emmanuelle Passegu e, and Mark R Looney. The lung is a site of platelet biogenesis and a reservoir for haematopoietic progenitors. Nature, 544(7648):105 109, 2017. doi: 10.1038/nature21706. Yew Jin Lim and Yee Whye Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD cup and workshop, volume 7, pages 15 21. Citeseer, 2007. Wang and Stephens Vinicius Diniz Mayrink, Joseph Edward Lucas, et al. Sparse latent factor models with interactions: Analysis of gene expression data. The Annals of Applied Statistics, 7(2): 799 822, 2013. Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11(Aug): 2287 2322, 2010. Shinichi Nakajima and Masashi Sugiyama. Theoretical analysis of Bayesian matrix factorization. Journal of Machine Learning Research, 12:2583 2648, 2011. Shinichi Nakajima, Masashi Sugiyama, S Derin Babacan, and Ryota Tomioka. Global analytic solution of fully-observed variational bayesian matrix factorization. Journal of Machine Learning Research, 14(Jan):1 37, 2013. Art B Owen and Jingshu Wang. Bi-cross-validation for factor analysis. Statistical Science, 31(1):119 139, 2016. Iosifina Pournara and Lorenz Wernisch. Factor analysis for gene regulatory networks and transcription factor activity profiles. BMC bioinformatics, 8:61, 2007. ISSN 1471-2105. doi: 10.1186/1471-2105-8-61. Tapani Raiko, Alexander Ilin, and Juha Karhunen. Principal component analysis for large scale problems with lots of missing values. In European Conference on Machine Learning, pages 691 698. Springer, 2007. Veronika Roˇckov a and Edward I George. Fast Bayesian factor analysis via automatic rotations to sparsity. Journal of the American Statistical Association, 111(516):1608 1622, 2016. Donald B Rubin. Inference and missing data. Biometrika, 63(3):581 592, 1976. Donald B Rubin and Dorothy T Thayer. EM algorithms for ML factor analysis. Psychometrika, 47(1):69 76, 1982. Chiara Sabatti and Gareth M James. Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics, 22(6):739 746, 2005. Matthias Seeger and Guillaume Bouchard. Fast variational Bayesian inference for nonconjugate matrix factorization models. In Artificial Intelligence and Statistics, pages 1012 1018, 2012. Sanvesh Srivastava, Barbara E Engelhardt, and David B Dunson. Expandable factor analysis. Biometrika, 104(3):649 663, 2017. Oliver Stegle, Leopold Parts, Richard Durbin, and John Winn. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eqtl studies. PLo S Computational Biology, 6(5):e1000770, 2010. Empirical Bayes Matrix Factorization Oliver Stegle, Leopold Parts, Matias Piipari, John Winn, and Richard Durbin. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nature Protocols, 7(3):500 507, 2012. Genevieve L Stein-O Brien, Raman Arora, Aedin C Culhane, Alexander V Favorov, Lana X Garmire, Casey S Greene, Loyal A Goff, Yifeng Li, Aloune Ngom, Michael F Ochs, et al. Enter the matrix: factorization uncovers knowledge from omics. Trends in Genetics, 34 (10):790 805, 2018. Matthew Stephens. False discovery rates: a new deal. Biostatistics, 18(2):275 294, Apr 2017. doi: 10.1093/biostatistics/kxw041. D C Thomas, J Siemiatycki, R Dewar, J Robins, M Goldberg, and B G Armstrong. The problem of multiple inference in studies designed to generate hypotheses. American Journal of Epidemiology, 122(6):1080 95, Dec 1985. Michael E Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1(Jun):211 244, 2001. Michalis K. Titsias and Miguel L azaro-Gredilla. Spike and slab variational inference for multi-task and multiple kernel learning. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2339 2347. Curran Associates, Inc., 2011. URL http://papers.nips.cc/paper/ 4305-spike-and-slab-variational-inference-for-multi-task-and-multiple-kernel-learning. pdf. Wei Wang. Applications of Adaptive Shrinkage in Multiple Statistical Problems. Ph D thesis, The University of Chicago, 2017. Mike West. Bayesian factor regression models in the large p, small n paradigm. Bayesian Statistics 7 - Proceedings of the Seventh Valencia International Meeting, pages 723 732, 2003. ISSN 08966273. doi: 10.1.1.18.3036. David P. Wipf and Srikantan S. Nagarajan. A new view of automatic relevance determination. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1625 1632. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/ 3372-a-new-view-of-automatic-relevance-determination.pdf. Daniela M. Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515 534, 2009. doi: 10.1093/biostatistics/kxp008. Zhengrong Xing, Peter Carbonetto, and Matthew Stephens. Smoothing via adaptive shrinkage (smash): denoising Poisson and heteroskedastic Gaussian signals. ar Xiv preprint ar Xiv:1605.07787, 2016. Wang and Stephens Dan Yang, Zongming Ma, and Andreas Buja. A sparse singular value decomposition method for high-dimensional data. Journal of Computational and Graphical Statistics, 23(4):923 942, 2014. Shiwen Zhao, Barbara E Engelhardt, Sayan Mukherjee, and David B Dunson. Fast moment estimation for generalized latent Dirichlet models. Journal of the American Statistical Association, pages 1 13, 2018. Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265 286, 2006. ISSN 1061-8600. doi: 10.1198/106186006X113430.