# stay_on_path_pca_along_graph_paths__7c5e84f7.pdf Stay on path: PCA along graph paths Megasthenis Asteris MEGAS@UTEXAS.EDU Anastasios Kyrillidis ANASTASIOS@UTEXAS.EDU Alexandros G. Dimakis DIMAKIS@AUSTIN.UTEXAS.EDU Department of Electrical and Computer Engineering, The University of Texas at Austin Han-Gyol Yi GYOL@UTEXAS.EDU Bharath Chandrasekaran BCHANDRA@AUSTIN.UTEXAS.EDU Department of Communication Sciences & Disorders, The University of Texas at Austin We introduce a variant of (sparse) PCA in which the set of feasible support sets is determined by a graph. In particular, we consider the following setting: given a directed acyclic graph G on p vertices corresponding to variables, the non-zero entries of the extracted principal component must coincide with vertices lying along a path in G. From a statistical perspective, information on the underlying network may potentially reduce the number of observations required to recover the population principal component. We consider the canonical estimator which optimally exploits the prior knowledge by solving a non-convex quadratic maximization on the empirical covariance. We introduce a simple network and analyze the estimator under the spiked covariance model. We show that side information potentially improves the statistical complexity. We propose two algorithms to approximate the solution of the constrained quadratic maximization, and recover a component with the desired properties. We empirically evaluate our schemes on synthetic and real datasets. 1. Introduction Principal Component Analysis (PCA) is an invaluable tool in data analysis and machine learning. Given a set of n centered p-dimensional datapoints Y Rp n, the first principal component is arg max x 2=1 x bΣx, (1) Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). where bΣ = 1/n YY is the empirical covariance matrix. The principal component spans the direction of maximum data variability. This direction usually involves all p variables of the ambient space, in other words the PC vectors are typically non-sparse. However, it is often desirable to obtain a principal component with specific structure, for example limiting the support of non-zero entries. From a statistical viewpoint, in the high dimensional regime n = O(p), the recovery of the true (population) principal component is only possible if additional structure information, like sparsity, is available for the former (Amini & Wainwright, 2009; Vu & Lei, 2012). There are several approaches for extracting a sparse principal component. Many rely on approximating the solution to max x Rp x bΣx, subject to x 2 = 1, x 0 k. (2) The non-convex quadratic optimization is NP hard (by a reduction from maximum clique problem), but optimally exploits the side information on the sparsity. Graph Path PCA. In this paper we enforce additional structure on the support of principal components. Consider a directed acyclic graph (DAG) G = (V, E) on p vertices. Let S and T be two additional special vertices and consider all simple paths from S to T on the graph G. Ignoring the order of vertices along a path, let P(G) denote the collection of all S-T paths in G. We seek the principal component supported on a path of G, i.e., the solution to max x X(G) x bΣx, (3) X(G) x Rp : x 2 = 1, supp(x) P(G) . (4) We will argue that this formulation can be used to impose several types of structure on the support of principal components. Note that the covariance matrix bΣ and the graph Stay on path: PCA along graph paths can be arbitrary: the matrix is capturing data correlations while the graph is a mathematical tool to efficiently describe the possible supports of interest. We illustrate this through a few applications. Financial model selection: Consider the problem of identifying which companies out of the S&P500 index capture most data variability. Running Sparse PCA with a sparsity parameter k will select k companies that maximize explained variance. However, it may be useful to enforce more structure: If we must select one company from each business sector (e.g., Energy, Health Care,etc.) how could we identify these representative variables? In Section 4, we show that this additional requirement can be encoded using our graph path framework. We compare our variable selection with Sparse PCA and show that it leads to interpretable results. Biological and f MRI networks: Several problems involve variables that are naturally connected in a network. In these cases our Graph Path PCA can enforce interpretable sparsity structure, especially when the starting and ending points are manually selected by domain experts. In section 4, we apply our algorithm on f MRI data using a graph on regions of interest (ROIs) based on the Harvard-Oxford brain structural atlas (Desikan et al., 2006). We emphasize that our applications to brain data is preliminary: the directional graphs we extract are simply based on distance and should not be interpreted as causality, simply as a way of encoding desired supports. What can we say about the tractability of (3)? We note that despite the additional constraints on the sparsity patterns, the number of admissible support sets (i.e. S-T paths) can be exponential in p, the number of variables. For example, consider a graph G as follows: S connected to two nodes who are then both connected to two nodes, etc. for k levels and finally connected to T. Clearly there are 2k S-T paths and therefore a direct search is not tractable. Our Contributions: 1. From a statistical viewpoint, we show that side information on the underlying graph G can reduce the number of observations required to recover the population principal component x X(G) via (3). For our analysis, we introduce a simple, sparsity-inducing network model on p vertices partitioned into k layers, with edges from one layer to the next, and maximum out-degree d (Fig. 1). We show that n = O(log p/k + k log d) observations yi N(0, Σ), suffice to obtain an arbitrarily good estimate via (3). Our proof follows the steps of (Vu & Lei, 2012). 2. We complement this with an information-theoretic lower bound on the minimax estimation error, un- Figure 1. A (p, k, d)-layer graph G = (V, E): a DAG on p vertices, partitioned into k disjoint sets (layers) L1, . . . , Lk. The highlighted vertices form an S-T path. der the spiked covariance model with latent signal x X(G), which matches the upper bound. 3. We propose two algorithms for approximating the solution of (3), based on those of (Yuan & Zhang, 2013) and (Papailiopoulos et al., 2013; Asteris et al., 2014) for the sparse PCA problem. We empirically evaluate our algorithms on synthetic and real datasets. Related Work There is a large volume of work on algorithms and the statistical analysis of sparse PCA (Johnstone & Lu, 2004; Zou et al., 2006; d Aspremont et al., 2008; 2007; Johnstone & Lu, 2004; Vu & Lei, 2012; Amini & Wainwright, 2009). On the contrary, there is limited work that considers additional structure on the sparsity patterns. Motivated by a face recognition application, (Jenatton et al., 2010) introduce structured sparse PCA using a regularization that encodes higher-order information about the data. The authors design sparsity inducing norms that further promote a pre-specified set of sparsity patterns. Finally, we note that the idea of pursuing additional structure on top of sparsity is not limited to PCA: Modelbased compressive sensing seeks sparse solutions under a restricted family of sparsity patterns (Baldassarre et al., 2013; Baraniuk et al., 2010; Kyrillidis & Cevher, 2012), while structure induced by an underlying network is found in (Mairal & Yu, 2011) for sparse linear regression. 2. A Data Model Sample Complexity The layer graph. Consider a directed acyclic graph G = (V, E) on p vertices, with the following properties: V = {S, T} b V , where S is a source vertex, T is a terminal one, and b V is the set of remaining p 2 vertices. b V can be partitioned into k disjoint subsets (layers) L1, . . . , Lk, i.e., S i Li = b V , and Li Lj = , i, j [k], i = j, such that: Γout(v) Li+1, v Li, for i = 1, . . . , k 1, where Γout(v) denotes the out-neighborhood of v. Stay on path: PCA along graph paths Γout(S) = L1, and Γout(v) = {T}, v Lk. In the sequel, for simplicity, we will further assume that p 2 is a multiple of k and |Li|= (p 2)/k, i [k]. Further, |Γout(v)| = d, v Li, i = 1, . . . , k 1, and |Γin(v)| = d, v Li, i = 2, . . . , k, where Γin(v) denotes the in-neighborhood of v. In words, the edges from one layer are maximally spread accross the vertices of the next. We refer to G as a (p, k, d)-layer graph. Fig. 1 illustrates a (p, k, d)-layer graph G. The highlighted vertices form an S-T path π: a set of vertices forming a trail from S to T. Let P(G) denote the collection of S-T paths in a graph G for a given pair of source and terminal vertices. For the (p, k, d)-layer graph, |π| = k, π P(G), and |P(G)| = |L1| dk 1 = p 2 k dk 1 p 2 k , since d {1, . . . , (p 2)/k}. Spike along a path. We consider the spiked covariance model, as in the sparse PCA literature (Johnstone & Lu, 2004; Amini & Wainwright, 2008). Besides sparsity, we impose additional structure on the latent signal; structure induced by a (known) underlying graph G. Consider a p-dimensional signal x and a bijective mapping between the p variables in x and the vertices of G. For simplicity, assume that the vertices of G are labeled so that xi is associated with vertex i V . We restrict x in X(G) x Rp : x 2 = 1, supp(x) P(G) , that is, x is a unit-norm vector whose active (nonzero) entries correspond to vertices along a path in P(G). We observe n points (samples) {yi}n i=1 Rp, generated randomly and independently as follows: β ui x + zi, (5) where the scaling coefficient ui N(0, 1) and the additive noise zi N(0, Ip) are independent. Equivalently, yis are i.i.d. samples, distributed according to N(0, Σ), where Σ = Ip + β x x . (6) 2.1. Lower bound Theorem 1 (Lower Bound). Consider a (p, k, d)-layer graph G on p vertices, with k 4, and log d 4H(3/4). (Note that p 2 k d), and a signal x X(G). Let {yi}n i=1 be a sequence of n random observations, independently drawn according to probability density function Dp(x ) = N 0, Ip + β x x , for some β > 0. Let D(n) p (x ) denote the product measure over the n independent draws. Consider the problem of estimating x from the n observations, given G. There exists x X(G) such that for every estimator bx, ED(n) p (x ) min n 1, C (1+β) 4 log d o . (7) Theorem 1 effectively states that for some latent signal x X(G), and observations generated according to the spiked covariance model, the minimax error is bounded away from zero, unless n = Ω(log p/k + k log d). In the sequel, we provide a sketch proof of Theorem 1, following the steps of (Vu & Lei, 2012). The key idea is to discretize the space X(G) in order to utilize the Generalized Fano Inequality (Yu, 1997). The next lemma summarizes Fano s Inequality for the special case in which the n observations are distibuted according to the n-fold product measure D(n) p (x ): Lemma 2.1 (Generalized Fano (Yu, 1997)). Let Xϵ X(G) be a finite set of points x1, . . . , x|Xϵ| X(G), each yielding a probability measure D(n) p (xi) on the n observations. If d(xi, xj) α, for some pseudo-metric1 d( , ) and the Kullback-Leibler divergences satisfy KL D (n) p (xi) D (n) p (xj) γ, for all i = j, then for any estimator bx max xi Xϵ ED(n) p (xi)[d(bx, xi)] α 2 1 γ + log 2 Inequality (8), using the pseudo-metric d (bx, x) bxbx xx F, will yield the desired lower bound of Theorem 1 on the minimax estimation error (Eq. (7)). To that end, we need to show the existence of a sufficiently large set Xϵ X(G) such that (i) the points in Xϵ are well separated under d( , ), while (ii) the KL divergence of the induced probability measures is upper appropriately bounded. Lemma 2.2. (Local Packing) Consider a (p, k, d)-layer graph G on p vertices with k 4 and log d 4 H(3/4). For any ϵ (0, 1], there exists a set Xϵ X(G) such that 2 < xi xj 2 for all xi, xj Xϵ, xi = xj, and log |Xϵ| log p 2 k + 1/4 k log d. 1A pseudometric on a set X is a function d : Q2 R that satisfies all properties of a distance (non-negativity, symmetry, triangle inequality) except the identity of indiscernibles: d(q, q) = 0, q Q but possibly d(q1, q2) = 0 for some q1 = q2 Q. Stay on path: PCA along graph paths Proof. (See Appendix 7). For a set Xϵ with the properties of Lemma 2.2, taking into account the fact that xix i xjx j F xi xj 2 (Lemma A.1.2 of (Vu & Lei, 2012)), we have d2(xi, xj) = xix i xjx j 2 F > ϵ2 xi, xj Xϵ, xi = xj. Moreover, KL(Dp(xi) Dp(xj)) = β 2(1+β) h (1 + β) Tr I xjx j xix i Tr xjx j (I xix i ) i 4(1+β) xix i xjx j 2 F β2 (1+β) xi xj 2 2. In turn, for the n-fold product distribution, and taking into account that xi xj 2 KL D (n) p (xi) D (n) p (xj) 2nβ2ϵ2 (1 + β) γ. (10) Eq. (9) and (10) establish the parameters α and γ required by Lemma 2.1. Substituting those into (8), along with the lower bound of Lemma 2.2 on |Xϵ|, we obtain max xi Xϵ ED(n) p (xi)[d(bx, xi)] ϵ 2 (1+β) + log 2 The final step towards establishing the desired lower bound in (7) is to appropriately choose ϵ. One can verify that if ϵ2 = min n 1, C (1+β) 4 log d o , (12) where C > 0 is a constant to be determined, then (1 + β) 1 log |Xϵ| 1 4 and log |Xϵ| 4 log 2, (13) (see Appendix 8 for details). Under the conditions in (13), the inequality in (11) implies that max xi Xϵ ED(n) p (xi)[d(bx, xi)] 1 2 Substituting ϵ according to (12), yields the desired result in (7), completing the proof of Theorem 1. 2.2. Upper bound Our upper bound is based on the estimator obtained via the constrained quadratic maximizaton in (3). We note that the analysis is not restricted to the spiked covariance model; it applies to a broader class of distributions (see Assum. 1). Theorem 2 (Upper bound). Consider a (p, k, d)-layer graph G and x X(G). Let {yi}n i=1 be a sequence of n i.i.d. N(0, Σ) samples, where Σ 0 with eigenvalues λ1 > λ2 . . ., and principal eigenvector x . Let bΣ be the empirical covariance of the n samples, bx the estimate of x obtained via (3), and ϵ bxbx x x F. Then, E[ϵ] C λ1 λ1 λ2 1 where A = O log p 2 k + k log d . In the sequel, we provide a sketch proof of Theorem 2. The proof closely follows the steps of (Vu & Lei, 2012) in developing their upper bound for the the sparse PCA problem. Lemma 2.3 (Lemma 3.2.1 (Vu & Lei, 2012)). Consider Σ Sp p + , with principal eigenvector x and λgap λ1 λ2(Σ). For any ex Rp with ex 2 = 1, 2 exex x x 2 F Σ, x x exex . Let bx be an estimate of x via (3), and ϵ bxbx x x F. From Lemma 2.3, it follows (see (Vu & Lei, 2012)) that 2 ϵ2 bΣ Σ, bxbx x x . (15) Both x and bx belong to X(G): unit-norm vectors, with support of cardinality k+2 coinciding with a path in P(G). Their difference is supported in P2(G): the collection of sets formed by the union of two sets in P(G). Let X 2(G) denote the set of unit norm vectors supported in P2(G). Via an appropriate upper bounding of the right-hand side of (15), (Vu & Lei, 2012) show that E[ϵ] b C λ1 λ2 E h supθ X 2 θ bΣ Σ θ i , for an appropriate constant b C > 0. Further, under the assumptions on the data distribution, and utilizing a result due to (Mendelson, 2010), n max n A, A2 , for C and K constants depending on the distribution, and A EY N(0,Ip) supθ X 2 Y, θ . (16) This reduces the problem of bounding E[ϵ] to bounding the supremum of a Gaussian process. Let Nδ X 2(G) be a minimal δ-covering of X 2(G) in the Euclidean metric with the property that x X 2(G), y Nδ such that x y 2 δ and supp(x y) P2(G). Then, supθ X 2 Y, θ (1 δ) 1 max θ Nδ Y, θ . (17) Taking expectation w.r.t. Y and applying a union bound on the right hand side, we conclude A e C (1 δ) 1 p log |Nδ|. (18) Stay on path: PCA along graph paths It remains to construct a δ-covering Nδ with the desired properties. To this end, we associate isometric copies of S2k+1 2 with each support set in P2(G). It is known that there exists a minimal δ-covering for S2k+1 2 with cardinality at most (1 + 2/δ)2k+2. The union of the local δ-nets forms a set Nδ with the desired properties. Then, log |Nδ| log |P2(G)| + 2(k + 1) log(1 + 2/δ) = O log p 2 k + k log d , for any constant δ. Substituting in (18), implies the desired bound on E[ϵ], completing the proof of Theorem 2. 3. Algorithmic approaches We propose two algorithms for approximating the solution of the constrained quadratic maximization in (3): 1. The first is an adaptation of the truncated power iteration method of (Yuan & Zhang, 2013) for the problem of computing sparse eigenvectors. 2. The second relies on approximately solving (3) on a low rank approximation of bΣ, similar to (Papailiopoulos et al., 2013; Asteris et al., 2014). Both algorithms rely on a projection operation from Rp onto the feasible set X(G), for a given graph G = (V, E). Besides the projection step, the algorithms are oblivious to the specifics of the constraint set,2 and can adapt to different constraints by modifying the projection operation. 3.1. Graph-Truncated Power Method Algorithm 1 Graph-Truncated Power Method input bΣ Rp p, G = (V, E), x0 Rp 1: i 0 2: repeat 3: wi bΣxi 4: xi+1 Proj X(G)(wi) 5: i i + 1 6: until Convergence/Stop Criterion output xi We consider a simple iterative procedure, similar to the truncated power method of (Yuan & Zhang, 2013) for the problem of computing sparse eigenvectors. Our algorithm produces sequence of vectors xi X(G), i 0, that serve as intermediate estimates of the desired solution of (3). The procedure is summarized in Algorithm 1. In the ith iteration, the current estimate xi is multiplied by the empirical covariance bΣ, The product wi Rp is projected back to the feasible set X(G), yielding the next estimate xi+1. 2For Alg. 2, the observation holds under mild assumptions: X(G) must be such that x 2 = Θ(1), while x X(G) should both achieve the same objective value. The core of Algorithm 1 lies in the projection operation, Proj X(G)(w) arg min x X(G) 1 2 x w 2 2, (19) which is analyzed separately in Section 3.3. The initial estimate x0 can be selected randomly or based on simple heuristics, e.g., the projection on X(G) of the column of bΣ corresponding to the largest diagonal entry. The algorithm terminates when some convergence criterion is satisfied. The computational complexity (per iteration) of Algorithm 1 is dominated by the cost of matrix-vector multiplication and the projection step. The former is O(k p), where k is cardinality of the largest support in X(G). The projection operation for the particular set X(G), boils down to solving the longest path problem on a weighted variant of the DAG G (see Section 3.3), which can be solved in time O(|V | + |E|), i.e., linear in the size of G. 3.2. Low-Dimensional Sample and Project The second algorithm outputs an estimate of the desired solution of (3) by (approximately) solving the constrained quadratic maximization not on the original matrix bΣ, but on a low rank approximation bΣr of bΣ, instead: i=1 λiqiq i = i=1 viv i = VV , (20) where λi is the ith largest eigenvalue of bΣ, qi is the corresponding eigenvector, vi λi qi, and V is the p r matrix whose ith column is equal to vi. The approximation rank r is an accuracy parameter; typically, r p. Our algorithm operates3 on bΣr and seeks xr arg max x X(G) x bΣr x. (21) The motivation is that an (approximate) solution for the low-rank problem in (21) can be efficiently computed. Intuitively, if bΣr is a sufficiently good approximation of the original matrix bΣ, then xr would perform similarly to the solution x of the original problem (3). The Algorithm. Our algorithm samples points from the low-dimensional principal subspace of bΣ, and projects them on the feasible set X(G), producing a set of candidate estimates for xr. It outputs the candidate that maximizes the objective in (21). The exact steps are formally presented in Algorithm 2. The following paragraphs delve into the details of Algorithm 2. 3Under the spiked covariance model, this approach may be asymptotically unsuitable; as the ambient dimension increases, it with fail to recover the latent signal. Empirically, however, if the spectral decay of bΣ is sharp, it yields very competitive results. Stay on path: PCA along graph paths Algorithm 2 Low-Dimensional Sample and Project input bΣ Rp p, G = (V, E), r [p], ϵ > 0 1: [Q, Λ] svd(bΣ, r) 2: V QΛ 1/2 {bΣr VV } 3: C {Candidate solutions} 4: for i = 1 : O(ϵ r log p) do 5: ci uniformly sampled from Sr 1 6: wi Vci 7: xi Proj X(G)(wi) 8: C = C {xi} 9: end for output bxr arg maxx C V x 2 2 3.2.1. THE LOW RANK PROBLEM The rank-r maximization in (21) can be written as max x X(G) x bΣrx = max x X(G) V x 2 2, (22) and in turn (see (Asteris et al., 2014) for details), as a double maximization over the variables c Sr 1 and x Rp: max x X(G) V x 2 2 = max c Sr 1 max x X(G) (Vc) x 2. (23) The rank-1 case. Let w Vc; w is only a vector in Rp. For given c and w, the x that maximizes the objective in (23) (as a function of c) is x(c) arg max x X(G) w x 2 . (24) The maximization in (24) is nothing but a rank-1 instance of the maximization in (22). Observe that if x X(G), then x X(G), and the two vectors attain the same objective value. Hence, (24) can be simplified: x(c) arg max x X(G) w x. (25) Further, since x 2 = 1, x X(G), the maximization in (25) is equivalent to minimizing 1 2 w x 2 2. In other words, x(c) is just the projection of w Rp onto X(G): x(c) Proj X(G)(w). (26) The projection operator is described in Section 3.3. Multiple rank-1 instances. Let cr, xr denote a pair that attains the maximum value in (23). If cr was known, then xr would coincide with the projection x(cr) of w = Vcr on the feasible set, according to (26). Of course, the optimal value cr of the auxiliary variable is not known. Recall, however, that cr lies on the low dimensional manifold Sr 1. Consider an ϵ-net Nϵ covering the r-dimensional unit sphere Sr 1; Algorithm 2 constructs such a net by random sampling. By definition, Nϵ contains at least one point, call it bcr, in the vicinity of cr. It can be shown that the corresponding solution x(bcr) in (24) will perform approximately as well as the optimal solution xr, in terms of the quadratic objective in (23), for a large, but tractable, number of points in the ϵ-net of Sr 1. 3.3. The Projection Operator Algorithms 1, and 2 rely on a projection operation from Rp onto the feasible set X(G) (Eq. (4)). We show that the projection effectively reduces to solving the longest path problem on (a weighted variant of) G. The projection operation, defined in Eq. (19), can be equivalently4 written as Proj X(G)(w) arg max x X(G) w x. For any x X(G), supp(x) P(G). For a given set π, by the Cauchy-Schwarz inequality, i π w2 i = bw 1π, (27) where bw Rp is the vector obtained by squaring the entries of w, i.e., bwi = w2 i , i [n], and 1π {0, 1}p denotes the characteristic of π. Letting x[π] denote the subvector of x supported on π, equality in (27) can be achieved by x such that x[π] = w[π]/ w[π] 2, and x[πc] = 0. Hence, the problem in (27) reduces to determining π(w) arg max π P(G) bw 1π. (28) Consider a weighted graph Gw, obtained from G = (V, E) by assigning weight bwv = w2 v on vertex v V . The objective function in (28) equals the weight of the path π in Gw, i.e., the sum of weights of the vertices along π. Determining the optimal support π(w) for a given w, is equivalent to solving the longest (weighted) path problem5 on Gw. The longest (weighted) path problem is NP-hard on arbitrary graphs. In the case of DAGs, however, it can be solved using standard algorithms relying on topological sorting in time O(|V | + |E|) (Cormen et al., 2001), i.e., linear in the size of the graph. Hence, the projection x can be determined in time O(p + |E|). 4It follows from expanding the quadratic 1 2 x w 2 2 and the fact that x 2 = 1, x X(G). 5The longest path problem is commonly defined on graphs with weighted edges instead of vertices. The latter is trivially transformed to the former: set w(u, v) w(v), (u, v) E, where w(u, v) denotes the weight of edge (u, v), and w(v) that of vertex v. Auxiliary edges can be introduced for source vertices. Stay on path: PCA along graph paths 4. Experiments 4.1. Synthetic Data. We evaluate Alg. 1 and 2 on synthetic data, generated according to the model of Sec. 2. We consider two metrics: the loss function bxbx x x F and the Support Jaccard distance between the true signal x and the estimate bx. For dimension p, we generate a (p, k, d)-layer graph G, with k = log p and out-degree d = p/k, i.e., each vertex is connected to all vertices in the following layer. We augment the graph with auxiliary source and terminal vertices S and T with edges to the original vertices as in Fig. 1. Per random realization, we first construct a signal x X(G) as follows: we randomly select an S-T path π in G, and assign random zero-mean Gaussian values to the entries of x indexed by π. The signal is scaled to unit length. Given x , we generate n independent samples according to the spiked covariance model in (5). Fig. 2 depicts the aforementioned distance metrics as a function of the number n of observations. Results are the average of 100 independent realizations. We repeat the procedure for multiple values of the ambient dimension p. (k log d + log p=k)=n 5 10 15 20 25 30 kbxbx> ! x?x> p = 100 p = 300 p = 700 p = 1000 (k log d + log p=k)=n 5 10 15 20 25 30 Jaccard distance p = 100 p = 300 p = 700 p = 1000 Figure 2. Metrics on the estimate bx produced by Alg. 1 (Alg. 2 is similar) as a function of the sample number (average of 100 realizations). Samples are generated according to the spiked covariance model with signal x X(G) for a (p, k, d)-layer graph G. Here, k = log p and d = p/k. We repeat for multiple values of p. Comparison with Sparse PCA. We compare the performance of Alg. 1 and Alg. 2 with their sparse PCA counterparts: the Truncated Power Method of (Yuan & Zhang, 2013) and the Spannogram Alg. of (Papailiopoulos et al., 2013), respectively. Fig. 3 depicts the metrics of interest as a function of the number of samples, for all four algorithms. Here, samples are drawn i.i.d from N(0, Σ), where Σ has principal eigenvector equal to x , and power law spectral decay: λi = i 1/4. Results are an average of 100 realizations. The side information on the structure of x assists the recovery: both algorithms achieve improved performance compared to their sparse PCA counterparts. Here, the power method based algorithms exhibit inferior perfor- Samples n 1000 2000 3000 4000 5000 kbxbx> ! xx>k F Trunc. Power M. Span. k-sparse Graph Power M. Low-D Sampling Samples n 1000 2000 3000 4000 5000 Jaccard distance Trunc. Power M. Span. k-sparse Graph Power M. Low-D Sampling Figure 3. Estimation error between true signal x and estimate bx from n samples. (average of 100 realizations). Samples generated i.i.d. N(0, Σ), where Σ has eigenvalues λi = i 1/4 and principal eigenvector x X(G), for a (p, k, d)-layer graph G. (p = 103, k = 50, d = 10). mance, which may be attributed to poor initialization. We note, though, that at least for the size of these experiments, the power method algorithms are significantly faster. 4.2. Finance Data. This dataset contains daily closing prices for 425 stocks of the S&P 500 Index, over a period of 1259 days (5years): 02.01.2010 01.28.2015, collected from Yahoo! Finance6. Stocks are classified, according to the Global Industry Classification Standard7 (GICS), into 10 business sectors e.g., Energy, Health Care, Information Technology, etc (see Fig. 4 for the complete list). We seek a set of stocks comprising a single representative from each GICS sector, which captures most of the variance in the dataset. Equivalently, we want to compute a structured principal component constrained to have exactly 10 nonzero entries; one for each GICS sector. Consider a layer graph G = (V, E) (similar to the one depicted in Fig. 1) on p = 425 vertices corresponding to the 425 stocks, partitioned into k = 10 groups (layers) L1, . . . , L10 V , corresponding to the GICS sectors. Each vertex in layer Li has outgoing edges towards all (and only the) vertices in layer Li+1. Note that (unlike Fig. 1) layers do not have equal sizes, and the vertex out-degree varies across layers. Finally, we introduce auxiliary vertices S and T connected with the original graph as in Fig. 1. Observe that any set of sector-representatives corresponds to an S-T path in G, and vice versa. Hence, the desired set of stocks can be obtained by finding a structured principal component constrained to be supported along an S-T path in G. Note that the order of layers in G is irrelevant. Fig. 4 depicts the subset of stocks selected by the proposed structure PCA algorithms (Alg. 1, 2). A single representative is selected from each sector. For comparison, we also run two corresponding algorithms for sparse PCA, with 6http://finance.yahoo.com 7http://www.msci.com/products/indexes/sector/gics/ gics_structure.html Stay on path: PCA along graph paths sparsity parameter k = 10, equal to the number of sectors. As expected, the latter yield components achieving higher values of explained variance, but the selected stocks originate from only 5 out of the 10 sectors. GICS code 25 30 10 40 35 20 45 15 50 55 GICS Sector Codes 10 Energy 15 Materials 20 Industrials 25 Consumer Discretionary 30 Consumer Staples 35 Health Care 40 Financials 45 Information Tech. 50 Telecom. Services 55 Utilities Figure 4. The figure depicts the sets of 10 stocks extracted by sparse PCA and our structure PCA approach. Sparse PCA (k = 10), selects 10 stocks from 5 GICS sectors (above). On the contrary, our structured PCA algorithms yield a set of 10 stocks containing a representative from each sector (below) as desired. 4.3. Neuroscience Data. We use a single-session/single-participant resting state functional magnetic resonance imaging (resting state f MRI) dataset. The participant was not instructed to perform any explicit cognitive task throughout the scan (Van Essen et al., 2013). Data was provided by the Human Connectome Project, WU-Minn Consortium.8 Mean timeseries of n = 1200 points for p = 111 regionsof-interest (ROIs) are extracted based on the Harvard Oxford Atlas (Desikan et al., 2006). The timescale of analysis is restricted to 0.01 0.1Hz. Based on recent results on resting state f MRI neural networks, we set the posterior cingulate cortex as a source node S, and the prefrontal cortex as a target node T (Greicius et al., 2009). Starting from S, we construct a layered graph with k = 4, based on the physical (Euclidean) distances between the center of mass of the ROIs: i.e., given layer Li, we construct Li+1 from non-selected nodes that are close in the Euclidean sense. Here, |L1| = 34 and |Li| = 25 for i = 2, 3, 4. Each layer is fully connected with its previous one. No further assumptions are derived from neurobiology. The extracted component suggests a directed pathway from the posterior cingulate cortex (S) to the prefrontal cortex (T), through the hippocampus (1), nucleus accumbens (2), parahippocampal gyrus (3), and frontal opercu- 8(Principal Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the Mc Donnell Center for Systems Neuroscience at Washington University. lum (4) (Fig. 5). Hippocampus and the parahippocampal gyrus are critical in memory encoding, and have been found to be structurally connected to the posterior cingulate cortex and the prefrontal cortex (Greicius et al., 2009). The nucleus accumbens receives input from the hippocampus, and plays an important role in memory consolidation (Wittmann et al., 2005). It is noteworthy that our approach has pinpointed the core neural components of the memory network, given minimal information. Figure 5. We highlight the nodes extracted for the neuroscience example. Source node set to the posterior cingulate cortex (S: PCC), and target to the prefrontal cortex (T: Prefrontal). The directed path proceeded from the nucleus accumbens (1: NAcc), hippocampus (2: Hipp), parahippocampal gyrus (3: Parahipp), and to the frontal operculum (4: Operculum). Here, X coordinates (in mm) denote how far from the midline the cuts are. 5. Conclusions We introduced a new problem: sparse PCA where the set of feasible support sets is determined by a graph on the variables. We focused on the special case where feasible sparsity patterns coincide with paths on the underlying graph. We provided an upper bound on the statistical complexity of the constrained quadratic maximization estimator (3), under a simple graph model, complemented with a lower bound on the minimax error. Finally, we proposed two algorithms to extract a component accommodating the graph constraints and applied them on real data from finance and neuroscience. A potential future direction is to expand the set of graphinduced sparsity patterns (beyond paths) that can lead to interpretable solutions and are computationally tractable. We hope this work triggers future efforts to introduce and exploit such underlying structure in diverse research fields. Stay on path: PCA along graph paths 6. Acknowledgments The authors would like to acknowledge support from grants: NSF CCF 1422549, 1344364, 1344179 and an ARO YIP award. Amini, Arash and Wainwright, Martin. High-dimensional analysis of semidefinite relaxations for sparse principal components. In Information Theory, 2008. ISIT 2008. IEEE International Symposium on, pp. 2454 2458. IEEE, 2008. Amini, Arash and Wainwright, Martin. High-dimensional analysis of semidefinite relaxations for sparse principal components. The Annals of Statistics, pp. 2877 2921, 2009. Asteris, Megasthenis, Papailiopoulos, Dimitris, and Dimakis, Alexandros. Nonnegative sparse PCA with provable guarantees. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1728 1736, 2014. Baldassarre, Luca, Bhan, Nirav, Cevher, Volkan, and Kyrillidis, Anastasios. Group-sparse model selection: Hardness and relaxations. ar Xiv preprint ar Xiv:1303.3207, 2013. Baraniuk, R., Cevher, V., Duarte, M., and Hegde, C. Model-based compressive sensing. Information Theory, IEEE Transactions on, 56(4):1982 2001, 2010. Cormen, Thomas, Stein, Clifford, Rivest, Ronald, and Leiserson, Charles. Introduction to Algorithms. Mc Graw-Hill Higher Education, 2nd edition, 2001. ISBN 0070131511. d Aspremont, Alexandre, El Ghaoui, Laurent, Jordan, Michael, and Lanckriet, Gert. A direct formulation for sparse PCA using semidefinite programming. SIAM review, 49(3):434 448, 2007. d Aspremont, Alexandre, Bach, Francis, and Ghaoui, Laurent El. Optimal solutions for sparse principal component analysis. The Journal of Machine Learning Research, 9:1269 1294, 2008. Desikan, Rahul, S egonne, Florent, Fischl, Bruce, Quinn, Brian, Dickerson, Bradford, Blacker, Deborah, Buckner, Randy, Dale, Anders, Maguire, Paul, and Hyman, Bradley. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage, 31(3):968 980, 2006. Greicius, Michael D, Supekar, Kaustubh, Menon, Vinod, and Dougherty, Robert F. Resting-state functional connectivity reflects structural connectivity in the default mode network. Cerebral cortex, 19(1):72 78, 2009. Jenatton, Rodolphe, Obozinski, Guillaume, and Bach, Francis. Structured sparse principal component analysis. In International Conference on Artificial Intelligence and Statistics, pp. 366 373, 2010. Johnstone, Iain and Lu, Arthur Yu. Sparse principal components analysis. Unpublished manuscript, 2004. Kyrillidis, Anastasios and Cevher, Volkan. Combinatorial selection and least absolute shrinkage via the CLASH algorithm. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pp. 2216 2220. IEEE, 2012. Mairal, Julien and Yu, Bin. Path coding penalties for directed acyclic graphs. In Proceedings of the 4th NIPS Workshop on Optimization for Machine Learning (OPT11). Citeseer, 2011. Mendelson, Shahar. Empirical processes with a bounded ψ-1 diameter. Geometric and Functional Analysis, 20(4):988 1027, 2010. Papailiopoulos, Dimitris, Dimakis, Alex, and Korokythakis, Stavros. Sparse PCA through low-rank approximations. In Proceedings of the 30th International Conference on Machine Learning, ICML 13, pp. 767 774. ACM, 2013. Van Essen, David, Smith, Stephen, Barch, Deanna, Behrens, Timothy, Yacoub, Essa, Ugurbil, Kamil, and Consortium, WUMinn HCP. The WU-Minn human connectome project: An overview. Neuroimage, 80:62 79, 2013. Vu, Vincent and Lei, Jing. Minimax rates of estimation for sparse PCA in high dimensions. In International Conference on Artificial Intelligence and Statistics, pp. 1278 1286, 2012. Wittmann, Bianca, Schott, Bj orn, Guderian, Sebastian, Frey, Julietta, Heinze, Hans-Jochen, and D uzel, Emrah. Reward-related f MRI activation of dopaminergic midbrain is associated with enhanced hippocampus-dependent long-term memory formation. Neuron, 45(3):459 467, 2005. Yu, Bin. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pp. 423 435. Springer, 1997. Yuan, Xiao-Tong and Zhang, Tong. Truncated power method for sparse eigenvalue problems. The Journal of Machine Learning Research, 14(1):899 925, 2013. Zou, Hui, Hastie, Trevor, and Tibshirani, Robert. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265 286, 2006.