# rich_component_analysis__70c86620.pdf

Rich Component Analysis

Rong Ge RONGGE@CS.DUKE.EDU Duke University, Computer Science Department, 308 Research Dr, Durham NC 27708

James Zou JAMESYZOU@GMAIL.COM Microsoft Research, One Memorial Dr, Cambridge MA 02139

In many settings, we have multiple data sets (also called views) that capture different and overlapping aspects of the same phenomenon. We are often interested in ﬁnding patterns that are unique to one or to a subset of the views. For example, we might have one set of molecular observations and one set of physiological observations on the same group of individuals, and we want to quantify molecular patterns that are uncorrelated with physiology. Despite being a common problem, this is highly challenging when the correlations come from complex distributions. In this paper, we develop the general framework of Rich Component Analysis (RCA) to model settings where the observations from different views are driven by different sets of latent components, and each component can be a complex, high-dimensional distribution. We introduce algorithms based on cumulant extraction that provably learn each of the components without having to model the other components. We show how to integrate RCA with stochastic gradient descent into a meta-algorithm for learning general models, and demonstrate substantial improvement in accuracy on several synthetic and real datasets in both supervised and unsupervised tasks. Our method makes it possible to learn latent variable models when we don t have samples from the true model but only samples after complex perturbations.

1. Introduction

A hallmark of modern data deluge is the prevalence of complex data that capture different aspects of some common

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

phenomena. For example, for a set of patients, it s common to have multiple modalities of molecular measurements for each individual (gene expression, genotyping, etc.) as well as physiological attributes. Each set of measurements corresponds to a view on the samples. The complexity and the heterogeneity of the data is such that it s often not feasible to build a joint model for all the data. Moreover, if we are particularly interested in one aspect of the problem (e.g. patterns that are speciﬁc to a subset of genes that are not shared across all genes), it would be wasteful of computational and modeling resources to model the interactions across all the data.

More concretely, suppose we have two sets (views) of data, U and V , on a common collection of samples. We model this as U = S1+S2 and V = AS2+S3, where S1 captures the latent component speciﬁc to U, S3 is speciﬁc to V , and S2 is common to both U and V and is related in the two views by an unknown linear transformation A. Each component Si can be a complex, high-dimensional distribution. The observed samples from U and V are component-wise linear combinations of the unobserved samples from Si. To model all the data, we would need to jointly model all three Si, which can have prohibitive sample/computation complexity and also prone to model misspeciﬁcation. Ideally, if we are only interested in the component that s unique to the ﬁrst view, we would simply write down a model for S1 without making any parametric assumptions about S2 and S3, except that they are independent.

In this paper, we develop a general framework of Rich Component Analysis (RCA) to explore such multicomponent, multi-view datasets. Our framework allows for learning an arbitrarily complex model of a speciﬁc component of the data, Si, without having to make parametric assumptions about other components Sj. This allows the analyst to focus on the most salient aspect of data analysis. The main conceptual contribution is the development of new algorithms to learn parameters of complex distributions without any samples from that distribution. In the two-view example, we do not observe samples from our

Rich Component Analysis

model of interest, S1. Instead the observations from U are compositions of true samples from S1 with complex signal from another process S2 which is shared with V . Our approach performs consistent parameter estimation of S1 without modeling S2, S3.

1.1. Related models

Understanding signals as compositions of different components or factors is a widely studied problem. Many previous works address different aspects of this problem through different modeling assumptions. RCA differs from previous works in following important aspects:

1. Each component Si can be a very complicated multidimensional distribution.

2. It is possible to learn the parameters for a speciﬁc component without specifying a parametric model for any of the other components.

3. The number of components can be much larger than the number of observations.

Some previous approaches that achieve a subset of these desirable properties, however to our best knowledge RCA is the only approach that has all three. Satifysing these three properties enable RCA to be more robust to model misspeciﬁcation.

Independent component analysis (ICA)(Comon & Jutten, 2010) may appear similar to our model, but it is actually quite different. In ICA, let s = [s1, ..., sn] be a vector of latent sources, where si s are one dimensional independent, non-Gaussian random variables. There is an unknown mixture matrix A and the observations are x = As. Given many samples x(t), the goal of ICA is to deconvolve and recover each sample si. In our setting, each si can be a high-dimensional vector with complex correlations. It is information-theoretically not possible to deconvolve and recover the individual samples si. Instead we aim to learn the distribution Si(θ) without having explicit samples from it. There are indeed algorithms for overcomplete ICA that works even when the number of components is much larger than the number of observations (e.g. (De Lathauwer et al., 2007)), they still need to assume that the individual components are one dimensional.

Another related model is canonical correlation analysis (CCA)(Hotelling, 1936). One way to interpret CCA is by the following generative model: there is a common signal z N(0, I), and view-speciﬁc signals z(m) N(0, I). Each view x(m) is then sampled according to N(A(m)z + B(m)z(m), Σ(m)), where m index the view. CCA is equivalent to maximum likelihood estimation of A(m) in this generative model. In our framework, CCA corresponds to the very restricted setting where S1, S2, S3

are all Gaussians. RCA learns S1 without making such parametric assumptions about S2 and S3. Moreover, using CCA, it is not clear how to learn the distribution S1 if it is not orthogonal to the shared subspace S2. In our experiments, we show that the naive approach of performing CCA (or kernel CCA) followed by taking the orthogonal projection leads to very poor performance.

Factor analysis (FA)(Harman, 1976) also corresponds to a multivariate Gaussian model, and hence does not address the general problem that we solve. In FA, latent variables are sampled z N(0, I) and the observations are x|z N(µ + Λz, Ψ).

It is also natural to consider a probabilistic model for all the components S1, S2, S3, and perform learning by algorithms like EM. When we have simple model for all the components this might be feasible. However in many cases S2 and S3 may either correspond to noise or have very complicated distributions, and we may not have enough knowledge about their distribution or enough sample/computation resources to model them. In these cases RCA model allows us to still get a reasonable estimate on the component that we are most interested in. See Section 5 for more discussions and experiments.

A different notion of contrastive learning was introduced in (Zou et al., 2013). They focused on settings where there are two mixture models with overlapping mixture components. The method there applies only for Latent Dirichlet allocation and Hidden Markov Models and requires explicit parametric models for each component.

Outline. RCA consists of two stages: 1) from the observed data, extract all the cumulants of the component that we want to model; 2) using the cumulants, perform method-of-moments or maximum likelihood estimation (MLE) of model parameters via polynomial approximations to gradient descent. We introduce the relevant properties of cumulants and tensors in Section 2. In Section 3, we develop the formal models for Rich Component Analysis (RCA) and the cumulant extraction algorithms. Section 4 shows how to integrate the extracted cumulants with method-of-moments or maximum likelihood inference. We test RCA on many data sets and show its performance gain in Section 5. All the proofs are in the Appendix.

2. Preliminaries In this section we introduce the basics of cumulants. For more information please refer to Appendix A. Cumulants provide an alternative way to describe the correlations of random variables. Cumulants have a nice additive property: the cumulant of sum of independent random variables is equal to the sum of cumulants. More formally, for a random variable X R the cumulant is deﬁned to be the coefﬁcients of the cumulant generating function log E[et X].

Rich Component Analysis

We can also deﬁne cross-cumulants which are cumulants for different variables (e.g. covariance). For n variables X1, ..., Xn, their cross-cumulant can be computed using the following formula:

κt(X1, ..., Xt) = X

π (|π| 1)!( 1)|π| 1 Y

In this formula, π is enumerated over all partitions of [t], |π| is the number of parts and B runs through the list of parts. We also use κt(X) κt(X, ..., X) when it s the same random variable.

We can similarly deﬁne cumulants for multivariate distributions. For random vector X Rd, the t-th order cumulant (and t-th order moment) is an object in Rdt (a tth order tensor). The (i1, ..., it)-th coordinate of cumulant tensor is κt(Xi1, Xi2, ..., Xit). We often unfold tensors into matrices. Tensor T Rdt unfolds into matrix M = unfold(T) Rdt 1 d: M(i1,...,it 1),it = Ti1,...,it. Cumulants have several nice properties that we summarize below. Fact Suppose X1, ..., Xt are random variables in Rd. The t-th order cumulant κt(X1, ..., Xt) is a tensor in Rdt that have the following properties:

1. (Independence) If (X1, ..., Xt) and (Y1, ..., Yt) are independent, then κt(X1 + Y1, ..., Xt + Yt) = κt(X1, ..., Xt) + κt(Y1, ..., Yt).

2. (Linearity) κt(c1X1, ..., ct Xt) = c1c2 ct κt(X1, ..., Xt), we can apply linear transformations to multi-variate cumulants (see Appendix A).

3. (Computation) The cumulant κt(X1, ..., Xt)can be computed in O((td)t) time.

The second order cross-cumulant, κ2(X, Y ) is equal to the covariance E[(X E[X])(Y E[Y ])]. Higher cumulants measures higher-order correlations and also provide a measure of the deviation from Gaussianity.Cumulants and moments can be converted using Fa a di Bruno s formula. We estimates cumulants using k-statistics(Rose & Smith, 2002).

3. Rich Component Analysis In this section, we show how to use cumulant to disentangle complex latent components. The key ideas and most applications of RCA are captured in the contrastive learning setting when there are two views. We introduce this model ﬁrst and then extend it to general settings.

3.1. RCA for contrastive learning

Recall the example in the introduction where we have two views of the data, formally,

U = S1 + S2, V = AS2 + S3. (1)

Here, S1, S2, S3 Rd are independent random variables that can have complicated distributions; A Rd d is an unknown linear transformation1. The observations consist of pairs of samples (u, v). Each pair is generated by drawing independent samples si Si, i = 1, 2, 3 and adding these samples component-wise to obtain u = s1 + s2 and v = As2 + s3. Note that the same s2 shows up in both u and v, introducing correlation between the two views. We are interested in learning properties about Si, for example learning its maximum likelihood (MLE) parameters. For concreteness, we focus our discussion on learning S1 although our techniques also apply to S2 and S3.

The main difﬁculty is that we don t have any samples from S1. The observations of U involves a potentially complicated perturbation by S2. In fact, even if we know exactly the distributions S1, S2, S3, the true value of s1 is often not determined because we only have two equations and three variables. Our hope is to remove the effect of perturbation S2 by utilizing the second view V , and we would like to do this without assuming a particular model for S2 or S3.

Note that the problem is inherently under-determined: it is impossible to ﬁnd the means of S1, S2, S3 without any additional information. This is in some sense the only ambiguity, as we will see if we know the mean of one distribution it is possible to extract all order cumulants of S1, S2, S3. For simplicity throughout this section we assume the means of S1, S2, S3 are 0 (given the mean of any of S1, S2, S3, we can always use the means of U and V to compute the means of other distributions, and shift them to have mean 0).

Determining linear transformation First we can ﬁnd A by the following formula:

A = unfold(κ4(V, U, U, U)) unfold(κ4(V, U, U, V )). (2)

Lemma 3.1. Suppose the unfolding of the 4-th order cumulant unfold(κ4(AS2, S2, S2, S2)) has full rank, given the exact cumulants κ4(V, U, U, U) and κ4(V, U, U, V ), the above algorithm ﬁnds the correct linear transformation A in time O(d5).

Intuitively, since only S2 appears in both U and V , the cross-cumulants κ4(V, U, U, U) and κ4(V, U, U, V ) depend only on S2. Also, by linearity of cumulants we must have unfold(κ4(V, U, U, V )) = unfold(κ4(V, U, U, U))A

(see Appendix B.1). In the lemma we could have used 3rd order cumulants, however for many distributions (e.g. all symmetric distributions) the 3rd order cumulant is 0. Most distributions satisfy the condition that unfold(κ4(AS2, S2, S2, S2)) is full rank, the only natu-

1Here we assume A is square for simplicity. Our algorithm can work as long as A has full column rank.

Rich Component Analysis

ral distribution that does not satisfy this constraint is the Gaussian distribution (where κ4 is 0).

Note: Although the running time of this algorithm seems to be a large polynomial, in practice we can randomly select a O(d) d submatrix of the two unfolded cumulants, and the running time would be O(d3). The number of samples we need for real instances also appears low, even with 1000 samples we can often obtain reasonable estimates for A R30 30 (see Section 5 for experiments).

Extracting cumulants Even when the linear transformation A is known, in most cases it is still information theoretically impossible to ﬁnd the values of the samples s1, s2, s3 as we only have two views. However, we can still hope to learn useful information about the distributions S1, S2, S3. In particular, we derived the following formulas to estimate the cumulants of the distributions:

κt(S1) = κt(U) κt(U, U, ..., U, A 1V ), (3)

κt(S2) = κt(U, U, ..., U, A 1V ), (4)

κt(S3) = κt(V ) κt(AU, V, V, ..., V ). (5)

Theorem 3.2. For all t > 1, Equations (3)-(5) compute the t-th order cumulants for S1, S2, S3 in time O((td)t+2)

Proof of Theorem 3.2 relies on the fact that since only S2 appears in both U and V , the cross-cumulant κt(U, U, ..., U, A 1V ) captures the cumulant of S2. Moreover, by independence, κt(U) = κt(S1) + κt(S2), so we can recover κt(S1) by subtracting off the estimated κ(S2) (and similarly for κt(S3)). When the dimension of U is smaller than the dimension of V and A Rd V d U has full column rank, the above formula with pseudo-inverse A in place of A 1 still recovers all cumulants. In Appendix B.1, we prove that both the formulas for computing A and for extracting the cumulants are robust to noise. In particular, we give the sample complexity for learning A and κt(S1) from samples of U and V , both are polynomial in relevant quantities.

Note that the exponential dependency on t is necessary, because even storing the tensor requires dt time. Most followup algorithms only require low order tensor (e.g. t = 3).

Given κt(S1), we can use standard algorithms to compute moments of S1. Many learning algorithms are based on method-of-moments and can be directly applied (see Section 4.1). Other optimization-based algorithms can also be adapted (Section 4.2).

3.2. General model of Rich Component Analysis

We can extend the cumulant extraction algorithm in contrastive learning to general settings with more views and components. The ideas are very similar, but the algorithm is more technical in order to keep track of all the components. We present the intuition and the main results here

and defer the details to Appendix B.2. Consider a set of observations U1, U2, . . . , Uk Rd, each is linearly related to a subset of variables S1, S2, . . . , Sp Rd, the variable Sj appears in a subset Qj [k] of the observations. That is, i [k] Ui =

j=1 A(i,j)Sj, (6)

where A(i,j) Rd d are unknown linear transformations, and A(i,j) = 0 if i Qj. For simplicity we assume all the linear transformations are invertible. The variable Sj models the latent source of signal that is common to the subset of observations {Ui|i Qj}. The matrix A(i,j)

models the transformation of latent signal Sj in view i. In order for the model to be identiﬁable, it is necessary that all the subsets Qj s are distinct (otherwise the latent sources with identical Qj can be collapsed into one Sj). In the most general setting, we have a latent signal that is uniquely associated with every subset of observations. In this case, p = 2k 1 and {Qj} corresponds to all the non-empty subsets of [k]. In some settings, only speciﬁc subset of views Ui share common signals and {Qj} can be a small set. We measure the complexity of the set system using the following notion:

Deﬁnition 3.1 (L-distinguishable). We say a set system {Qj} is L-distinguishable, if for every set Qj, there exists a subset T Qj of size at most L (called the distinguishing set) such that for any other set Qj (j = j), either Qj Qj or T Qj .

For example, the set system of the contrastive model is {{1}, {1, 2}, {2}} and it is 2-distinguishable. Intuitively, for any set Qj in the set system, there is a subset T of size at most L that distinguishes Qj from all the other sets (except the supersets of Qj). We use Algorithm 1 to recover all the linear transformations A(i,j) (for more details of the algorithm see Algorithm 2 in Appendix). Algorithm 1 takes as input a set system {Qj} that captures our prior belief about how the datasets are related. When we don t have any prior belief, we can input the most general {Qj} of size 2k 1, which is k-distinguishable. The algorithm automatically determines if certain variable Sj = 0. In the algorithm, min(Qj) is the smallest element of Qj.

Lemma 3.3. Given observations Ui s as deﬁned in Equation 6, suppose the sets Qj s are L-distinguishable, all the unknown linear transformations A(i,j) s are invertible, unfoldings unfold(κL+1(Sj)) is either 0 (if Sj = 0) or have full rank, then given the exact L + 1-th order cumulants, Algorithm 1 outputs all the correct linear transformations A(i,j) in time poly(L!, (dk)L).

Once all the linear transformations A(i,j) are recovered, we follow the same strategy as in the contrastive analysis case in Section 3.1.

Rich Component Analysis

Algorithm 1 Find Linear Require: set system {Qj} that is L-distinguishable, L+1th order moments repeat

Pick set Qj that is not a subset of any remaining sets Let T = {w1, w2, ..., w L} be distinguishing set for Qj Compute cumulants for all i Qj: Mi = unfold(κL+1(Uw1, ..., Uw L, Ui). If Mmin Qj = 0, then set Sj = 0; continue the loop. Let A(i,j) = (M min Qj Mi) for all i Qj, A(i,j) = 0 for all i Qj. Mark Qj as processed, subtract all cumulants of Qj. until all sets are processed

Theorem 3.4. Under the same assumption as Lemma 3.3, for any t L Algorithm 3 computes the correct t-th order cumulants for all the variables in time poly((L + t)!, (dk)L+t).

Note that in the most general case it is impossible to ﬁnd cumulants with order t < L, because there can be many different variables Sj s but not enough views. Both Algorithms 1 and 3 are robust to noise, with sample complexity that depends polynomially on the relevant condition numbers, and exponential in the order of cumulant considered. For more details see Appendix B.2.

4. Using Cumulants in learning applications

The cumulant extraction techniques of Section 3 constructs unbiased estimators for the cumulants of Si. In this section we show how to use the estimated cumulants/moments to perform maximum likelihood learning of Si. For concreteness, we frame the discussion on the contrastive learning setting, where we want to learn S1. For general RCA the method works when L (see Deﬁnition 3.1) is small or the distributions have speciﬁc relationship between lower and higher cumulants.

4.1. Method-of-Moments

RCA recovers the cumulants of S1, from which we can construct all the moments of S1 in time O((td)t). This makes it possible to directly combine RCA with any estimation algorithm based on the method-of-moments. Method-of-moments have numerous applications in machine learning. The simplest (and most commonly used) example is arguably principal component analysis, where we want to ﬁnd the maximum variance directions in S1. This is only related to the covariance matrix E[S1S 1 ]. RCA removes the covariance due to S2 and constructs an unbiased estimator of E[S1S 1 ], from which we can extract the top eigen-space.

The next simplest model is least squares regression (LSR).

Suppose the distribution S1 contains samples and labels (X, Y ) Rd R, and only the samples are corrupted by perturbations, i.e. Y is independent of S2. LSR tries to ﬁnd a parameter β that minimizes E[(Y β X)2]. The optimal solution again only depends on the moments of (X, Y ): β = (E[XX ]) 1E[Y X]. Using the second-order cumulants/moments extracted from RCA , we can efﬁciently estimate β .

Method-of-moment estimators, especially together with tensor decomposition algorithms have been successfully applied to learning many latent variable models, including Mixture of Gaussians (GMM), Hidden Markov Model, Latent Dirichlet Allocation and many others (see (Anandkumar et al., 2014)). RCA can be used in conjunction with all these methods. We ll consider learning GMM in Section 5.

4.2. Approximating Gradients

There are many machine learning models where it s not clear how to apply method-of-moments. Gradient descent (GD) and stochastic gradient descent (SGD) are general purpose techniques for parameter estimation across many models. Here we show how to combine RCA with gradient descent. The key idea is that the extracted cumulants/moments of S1 forms a polynomial basis. If the gradient of the log-likelihood can be approximated by a lowdegree polynomial in S1, then the extracted cumulants from RCA can be used to approximate this gradient.

Consider the general setting where we have a model D with parameter θ, and for any sample s1 the likelihood is L(θ, s1). The maximum likelihood estimator tries to ﬁnd the parameter that maximizes the likelihood of observed samples: θ = arg max E[log L(θ, s1)]. In many applications, this is solved using stochastic gradient descent, where we pick a random sample and move the current guess to the corresponding gradient direction: θ(t+1) = θ(t) + ηt θ log L(θ, s(t) 1 ), where ηt is a step size and s(t) 1 is the t-th sample. For convex functions this is known to converge to the optimal solution (Shalev-Shwartz et al., 2009). Even for non-convex functions this is often used as a heuristic.

If the gradient of log-likelihood θ log L(θ, s1) is a low degree polynomial in s1, then using the lower order moments we can obtain an unbiased estimator for E[ θ log L(θ, S1)] with bounded variance, which is sufﬁcient for stochastic gradient to work. This is the case for linear least-squares regression, and its regularized forms using either ℓ1 or ℓ2 regularizer.

In the case when log-likelihood is not a low degree polynomial in S1, we approximate the gradient by a low degree polynomial, either through simple Taylor s expansion or other polynomial approximations (e.g. Chebyshev polynomials, see more in (Powell, 1981)). This will give us a

Rich Component Analysis

biased estimator for the gradient whose bias decreases with the increasing degree we use. In general, when the (negative) log-likelihood function is strongly convex we can still hope to ﬁnd an approximate solution: Lemma 4.1. Suppose the negative log-likelihood function F(θ) = E[log L(θ, S1)] is µ-strongly convex and Hsmooth, given an estimator G(θ) for the gradient such that G(θ) F(θ) ϵ, gradient descent using G(θ) with step size 1 2H converges to a solution θ s.t. θ θ 2 8ϵ2

When high degree polynomials are needed to approximate the gradient, our algorithm requires number of samples that grows exponentially in the degree.

Logistic Regression We give a speciﬁc example to illustrate using RCA and low degree polynomials to simulate gradient descent. Consider the basic logistic regression setting, where the samples s1 = (x, y) Rd {0, 1}, and the

log-likelihood function is log L(θ, s1) = log eyθ x

We can then approximate the function eθ x

1+eθ x using a

low degree polynomial in θ x. As an example, we use

3rd degree Chebychev: eθ x

1+eθ x 0.5 + 0.245θ x

0.014(θ x)3. The gradient we take in each step is

E[ θ log L(θ, S1)]

E[Y X] 0.5E[X] 0.245E[X(θ X)] + 0.014E[X(θ X)3].

To estimate this approximation, we only need quadratic terms E[X(θ X)] and a projection of the 4-th order moment E[X(θ X)3]. These terms are computed from the projected 2nd and 4-th order cumulants of X that are extracted from the cumulants of U and V via Section 3. Even though these quantities are of high degree, they can be estimated in linear time in the number of samples because they are lower-dimensional projections.

5. Experiments

In the experiments, we focus on the contrastive learning setting where we are given observations of U = S1 + S2 and V = AS2 + S3. The goal is to estimate the parameters for the S1 distribution. Our approach can also learn the shared component S2 as well as S3. We tested our method in ﬁve settings, where S1 corresponds to: low rank Gaussian (PCA), linear regression, mixture of Gaussians (GMM), logistic regression and the Ising model. The ﬁrst three settings illustrate combining RCA with methodof-moments and the latter two settings requires RCA with polynomial approximation to stochastic gradient. In each setting, we compared the following four algorithms:

1. The standard learning algorithm using the actual samples s1 S1(θ) to learn the parameters θ. This is the

gold-standard, denoted as true samples .

2. Our contrastive RCA algorithm using paired samples from U and V to learn S1(θ).

3. The naive approach that ignores S2 and uses U to learn S1(θ) directly, denoted as naive .

4. First perform Canonical Correlation Analysis (CCA) on U and V , and project the samples from U onto the subspace orthogonal to the canonical correlation subspace. Then learn S1 from the projected samples of U. We denote this as CCA .

In all ﬁve settings, we let S3 be sampled uniformly from [ 1, 1]d, where d is the dimension of S3. The empirical results are robust to other choices of S3 that we have tried, e.g. multivariate Gaussian or mixture of Gaussians.

Contrastive PCA. S1 was set to have a principal component along direction v1, i.e. s1 N(0, v1v 1 + σ2I). S2 was sampled from Unif([ 1, 1]d)+v2v 2 and v1, v2 are random unit vectors in Rd. RCA constructs an unbiased estimator of E[S1S 1 ] from the samples of U and V . We then report the top eigenvector of this estimator as the estimated ˆv1. We evaluate each algorithm by the mean squared error (MSE) of the inferred ˆv1 to the true v1.

Contrastive regression. S1 is the uniform distribution, s1 Unif([ 1, 1]d) and y = β s1 + N(0, 1). S2 was sampled from Unif([ 1, 1]d) + v2v 2 and β, v2 are random unit vectors in Rd. Our approach gives unbiased estimator of E[S1S 1 ] from which we estimate ˆβ = (E[S1S 1 ]) 1E[Y S1]. All algorithms are evaluated by the MSE between the inferred ˆβ and the true β.

Contrastive mixture of Gaussians. S1 is a mixture of d spherical Gaussians in Rd, s1 Pd k=1 1 d N(µ(1) k , σ2). S2 is also a mixture of spherical Gaussians, s2 Pd k=1 1 d N(µ(2) k , σ2). RCA gives unbiased estimators of the third-order moment tensor, E[s1 s1 s1]. We then use the estimator in (Hsu & Kakade, 2013) to get a low rank tensor whose components correspond to center vectors, and apply alternating minimization (see (Kolda & Bader, 2009)) to infer ˆµ(1) k . Algorithms are evaluated by the MSE between the inferred centers {ˆµ(1) k } and the true centers {µ(1) k }.

Contrastive logistic regression. Let s1 Unif([ 1, 1]d) and y = 1 with probability 1 1+e β s1 .

S2 was sampled from Unif([ 1, 1]d)+v2v 2 , and β, v2 are unit vectors in Rd. We use the 4-th order Chebychev polynomial approximation to the SGD of logistic regression as in Section 4.2. Evaluation is the MSE error between the inferred ˆβ and the true β.

Rich Component Analysis

Figure 1. All the y-axis indicate mean squared error (MSE). a-e shows the tradeoff between sample size and MSE for the four algorithms in each of the ﬁve applications. f,g shows the convergence rate of SGD for the logistic and Ising models. h-j shows the tradeoff between perturbation strength and MSE. k shows the inference accuracy of A. Error bars corresponds to one standard deviation.

Contrastive Ising model. Let S1 be a mean-zero Ising model on d-by-d grid with periodic boundary conditions. Each of the d2 vertices are connected to four neighbors and can take on values { 1}. The edge between vertices i and j is associated with a coupling Jij Unif[ 1, 1]. The state of the Ising model, s1, has probability 1

(i,j) E Jijs1(i)s1(j), where Z is the partition function. We let S2 also be a d-by-d grid of spins where half of the spins are independent Bernoulli random variables and the other half are correlated, i.e. they are all 1 or all -1 with probability 0.5. We use composite likelihood to estimate the couplings Jij of S1, which is asymptotically consistent(Varin et al., 2011). For the gold-standard baseline (which uses the true samples s1), we use the exact gradient of the composite likelihood. For RCA , we used the 4-th order Taylor approximation to the gradient. Evaluation is the MSE between the true Jij and the estimated ˆJij.

Results. For the method-of-moment applications PCA, linear regression, GMM we used 10 dimensional samples for U and V . The tradeoff between inference accuracy (measured in MSE) and sample size is shown in the top row of Figure 1. Even with just 100 samples, RCA performs

signiﬁcantly better than the naive approach and CCA. With 1000 samples, the accuracy of RCA approaches that of the algorithm using the true samples from S1. It is interesting to note that projecting onto the subspace orthogonal to CCA can perform much worse than even the naive algorithm. In the linear regression setting, for example, when the signal of S2 happens to align with β, the direction of prediction, projecting onto the subspace orthogonal to S2 loses much of the predictive signal.

In the SGD settings, we used a 10 dimensional logistic model and a 5-by-5 Ising model (50 Jij parameters to infer). RCA also performed substantially better than the two benchmarks (Figure 1 d, e). In all the cases, the accuracy of RCA improved monotonically with increasing sample size. This was not the case for the Naive and CCA algorithms, which were unable to take advantage of larger data due to model-misspeciﬁcation. In Figure 1 f and g, we plot the learning trajectory of RCA over the SGD steps for representative runs of the algorithm with 1000 samples. RCA converges to the ﬁnal state at a rate similar to the truesample case. The residual error of RCA is due to the bias introduced by approximating the sigmoid with low-degree polynomial. When many samples are available, a higher-

Rich Component Analysis

degree polynomial approximation can be used to reduce this bias.

We also explored how the algorithms perform as the magnitude of the signal in S2 is increased compared to S1 (Figure 1 h-j) with ﬁxed 1000 samples. In these plots the xaxis measures the ratio of standard deviations of S2 and S1.At close to 0, most of the signal of U comes from S1, and all the algorithms are fairly accurate. As the strength of the perturbation increases, RCA performs signiﬁcantly better than the benchmarks, especially in the Ising model. Finally we empirically explored the sample complexity of the subroutine to recover the A matrix from the 4th order cumulants. Figure 1 k shows the MSE between the true A (sampled Unif[ 1, 1]d d) and the inferred ˆA as a function of the sample size. Even with 1000 samples, we can obtain reasonable estimates of A R30 30.

Comparisons to joint modeling and EM. An advantage of the RCA approach is that we only need to model the part of the data generating process that we are interested in (or able to model): if we are interested in the process that is unique to U, then we only need to write down a model for S1 without having to model S2 and S3. This not only reduces the risk of model misspeciﬁcation but can also reduce inference complexity. An alternative is to specify a fully parametrized, joint model for S1, S2, S3 and A, and perform maximum likelihood inference on all the parameters given U, V . This is only possible if we able to explicitly model S2 and S3.

To test the performance of a joint model, we consider a friendly setting where we are given the exact parametric form of S1(θ1), S2(θ2), S3(θ3) and A is the identity matrix. We used the same datasets as in the mixture of Gaussian experiments: S1 and S2 are both mixtures of 10 spherical Gaussians in R10 and S3 Unif[0, 1]10. The Gaussians all have known variance 1 and the goal is to infer the means of the mixture components of S1 from U, V . We used EM to perform inference on the joint model of S1, S2, S3. With 500 samples, the joint model achieved MSE of 0.016 (s.e. 0.0007) which is 88% larger than the MSE of 0.0085 achieved by RCA. Even with 1000 samples, the joint model still performed worse: MSE = 0.015 (s.e. 0.0005). While the joint model outperformed the naive algorithm (MSE = 0.02, see Figure 1c), it is less accurate than RCA even when the models are correctly speciﬁed. This could be due to the fact that the joint model has a large number of components and interactions which need to be inferred using approximations such as EM from a small dataset.

Biomarkers experiment. We applied RCA to a real dataset of DNA methylation biomarkers. Twenty biomarkers (10 test and 10 control) measured the DNA methylation level (a real number between 0 and 1) at twenty ge-

nomic loci across 686 individuals (Zou et al., 2014). Each individual was associated with a binary disease status Y . Logistic regression on the ten test biomarkers was used to determine the weight vector, β, which quantiﬁes the contribution of the methylation at each of these ten locus to the disease risk. The other ten independent loci are control markers. Getting accurate estimates for the values of β is important for understanding the biological roles of these loci. In this dataset, all the samples were measured on one platform, leading to relatively accurate estimate of β. In many cases samples are collected from multiple facilities (or by different labs). We simulated this within our RCA framework. We let S1 be the original data matrix of the ten test markers across the 686 samples. We let S3 be the original data matrix of the ten control markers in these same samples. We modeled S2 as a mixture model, where samples are randomly assigned to different components that capture lab speciﬁc biases. The perturbed observations are U = S1 + S2 and V = AS2 + S3, i.e. U and V simulate the measurements for the test and control markers, respectively, when the true signal has been perturbed by this mixtures distribution of lab biases. We assume that we can only access U and V and do not know S2, i.e. where each sample is generated. Running logistic regression directly on U and the phenotype Y obtained a MSE of 0.24 (std 0.03) between the inferred ˆβ and the true β measured from directly regressing S1 on Y . Directly using CCA also introduce signiﬁcant errors with MSE of 0.25 (std 0.02). Using all the control markers as covariates in the logistic regression, the MSE of the test markers β was 0.14 (std 0.03). In general, adding V as covariates to the regression can eliminate S2 at the expense of adding S3, and can reduce accuracy when S3 is larger than S2. Using our RCA logistic regression on U and V , we obtained signiﬁcantly more accurate estimates of θ, with MSE 0.1 (std 0.03). See Appendix for more analysis of this experiment.

Discussion We proposed the framework of Rich Component Analysis in this paper. RCA generalizes the widelyused Independent Component Analysis/Factor Analysis models in several directions: 1) each latent source is allowed to be a high dimensional distribution and 2) there could be many more latent sources than there are observations. These two extensions make it not possible to deconvolve and recover the samples from each independent source. Our approach based on cumulant extraction and approximate SGD provides a theoretically sound and practical solution. Additionally it allows the algorithm to recover a particularly interesting component without modeling all the other components, which can be very complex. These properties makes RCA more robust to model misspeciﬁcation. We demonstrate the accuracy of RCA on several simulations and on the real problem of bio-marker discovery. RCA perform better than the most natural competitors.

Rich Component Analysis

Anandkumar, Animashree, Ge, Rong, Hsu, Daniel, Kakade, Sham M., and Telgarsky, Matus. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15:2773 2832, 2014.

Arora, Sanjeev, Ge, Rong, Ma, Tengyu, and Moitra, Ankur. Simple, efﬁcient, neural algorithms for dictionary learning. In Proceedings of The 28th Conference on Learning Theory, COLT, 2015.

Bulinskii, AV. Bounds for mixed cumulants and higherorder covariances of bounded random variables. Theory of Probability & Its Applications, 19(4):835 839, 1975.

Comon, Pierre and Jutten, Christian. Handbook of Blind Source Separation: Independent component analysis and applications. Academic press, 2010.

De Lathauwer, L., Castaing, J., and Cardoso, J.-F. Fourthorder cumulant-based blind identiﬁcation of underdetermined mixtures. Signal Processing, IEEE Transactions on, 55(6):2965 2973, 2007.

Harman, Harry H. Modern factor analysis. University of Chicago Press, 1976.

Hotelling, Harold. Relations between two sets of variates. Biometrika, pp. 321 377, 1936.

Hsu, Daniel and Kakade, Sham M. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pp. 11 20. ACM, 2013.

Kenney, John Francis and Keeping, Ernest Sydney. Mathematics of statistics-part i. 1954.

Kolda, Tamara G and Bader, Brett W. Tensor decompositions and applications. SIAM review, 51(3):455 500, 2009.

Powell, Michael James David. Approximation theory and methods. Cambridge university press, 1981.

Rose, Colin and Smith, Murray D. Mathematical statistics with Mathematica, volume 1. Springer New York, 2002.

Shalev-Shwartz, Shai, Shamir, Ohad, Srebro, Nathan, and Sridharan, Karthik. Stochastic Convex Optimization. In Proceedings of the Conference on Learning Theory (COLT), 2009.

Varin, Cristiano, Reid, Nancy, and Firth, David. A overview of composite likelihood methods. Statistica Sinica, 21:5 42, 2011.

Zou, James, Hsu, Daniel, Parkes, David, and Adam, Ryan. Contrastive Learning Using Spectral Methods. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2013.

Zou, James, Lippert, Christoph, Heckerman, David, Aryee, Martin, and Listgarten, Jennifer. Epigenome-wide association studies without the need for cell-type composition. Nature Methods, 11(3), 2014.