# stable_feature_selection_from_brain_smri__0913f991.pdf Stable Feature Selection from Brain s MRI Bo Xin , Lingjing Hu , Yizhou Wang and Wen Gao National Engineering Laboratory for Video Technology, Key Laboratory of Machine Perception, School of EECS, Peking University, Beijing, 100871, China Yanjing Medical College, Capital Medical University, Beijing, 101300, China Neuroimage analysis usually involves learning thousands or even millions of variables using only a limited number of samples. In this regard, sparse models, e.g. the lasso, are applied to select the optimal features and achieve high diagnosis accuracy. The lasso, however, usually results in independent unstable features. Stability, a manifest of reproducibility of statistical results subject to reasonable perturbations to data and the model (Yu 2013), is an important focus in statistics, especially in the analysis of high dimensional data. In this paper, we explore a nonnegative generalized fused lasso model for stable feature selection in the diagnosis of Alzheimer s disease. In addition to sparsity, our model incorporates two important pathological priors: the spatial cohesion of lesion voxels and the positive correlation between the features and the disease labels. To optimize the model, we propose an efficient algorithm by proving a novel link between total variation and fast network flow algorithms via conic duality. Experiments show that the proposed nonnegative model performs much better in exploring the intrinsic structure of data via selecting stable features compared with other state-of-the-arts. Introduction Neuroimage analysis is challenging due to its high feature dimensionality and data scarcity. Sparse models such as the lasso (Tibshirani 1996) have gained great reputation in statistics and machine learning, and they have been applied to the analysis of such high dimensional data by exploiting the sparsity property in the absence of abundant data. As a major result, automatic selection of relevant variables/features by such sparse formulation achieves promising performance. For example, in (Liu, Zhang, and Shen 2012), the lasso model was applied to the diagnosis of Alzheimer s disease (AD) and showed better performance than the support vector machine (SVM), which is one of the state-of-the-arts in brain image classification. However, in statistics, it is known that the lasso does not always provide interpretable results because of its instability (Yu 2013). Stability here means the reproducibility of statistical results subject to reasonable perturbations to data and Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. the model. (These perturbations include the often used Jacknife, bootstrap and cross-validation.) This unstable behavior of the lasso model is critical in high dimensional data analysis. The resulting irreproducibility of the feature selection are especially undesirable in neuroimage analysis/diagnosis. However, unlike the problems such as registration and classification, the stability issue of feature selection is much less studied in this field. In this paper we propose a model to induce more stable feature selection from high dimensional brain structural Magnetic Resonance Imaging (s MRI) images. Besides sparsity, the proposed model harnesses two important additional pathological priors in brain s MRI: (i) the spatial cohesion of lesion voxels (via inducing fusion terms) and (ii) the positive correlation between the features and the disease labels. The correlation prior is based on the observation that in many brain image analysis problems (such as AD, frontotemporal dementia, corticobasal degeneration, etc), there exist strong correlations between the features and the labels. For example, gray matter of AD is degenerated/atrophied. Therefore, the gray matter values (indicating the volume) are positively correlated with the cognitive scores or disease labels {-1,1}. That is, the less gray matter, the lower the cognitive score. Accordingly, we propose nonnegative constraints on the variables to enforce the prior and name the model as non-negative Generalized Fused Lasso (n2GFL). It extends the popular generalized fused lasso and enables it to explore the intrinsic structure of data via selecting stable features. To measure feature stability, we introduce the Estimation Stability recently proposed in (Yu 2013) and the (multi-set) Dice coefficient (Dice 1945). Experiments demonstrate that compared with existing models, our model selects much more stable (and pathological-prior consistent) voxels. It is worth mentioning that the non-negativeness per se is a very important prior of many practical problems, e.g. (Lee and Seung 1999). Although n2GFL is proposed to solve the diagnosis of AD in this work, the model can be applied to more general problems. Incorporating these priors makes the problem novel w.r.t the lasso or generalized fused lasso from an optimization standpoint. Although off-the-shelf convex solvers such as CVX (Grant and Boyd 2013) can be applied to solve the optimization, it hardly scales to high-dimensional problems in feasible time. In this regard, we propose an efficient algo- Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence rithm that solves the n2GFL problem exactly. We generalize the proximal gradient methods (such as FISTA) (Beck and Teboulle 2009) to solve our constrained optimization and prove its convergence. We then show that by using an element-wise post-processing, the resulting proximal operator can be reduced to the total variation (TV) problem. It is known that TV can be solved by parametric flow algorithms (Chambolle and Darbon 2009; Xin et al. 2014). In the present study, we provide a novel equivalence via conic duality, which gives us a minimum quadratic cost flow formulation (Hochbaum and Hong 1995). Fast flow algorithms (including parametric flow) are then easily applied. In practice, our algorithm runs hundreds of times faster than CVX at the same precision and can scale to high-dimensional problems. Related work. In addition to sparsity, people leverage underlying data structures and introduce stronger priors such as the structured sparsity (Jacob, Obozinski, and Vert 2009) to increase model stability. However, for voxel-based s MRI data analysis, handcrafted grouping of the voxels or sub-structures may not coincide with various pathological topology priors. Consequently, group lasso (with overlap) (Jacob, Obozinski, and Vert 2009; Jenatton et al. 2012; Rao et al. 2013) is not an ideal model to the problem. In contrast, the graph-based structured sparse models adapt better to such a situation. The most popular one is referred here as Lap L1, which adopts l2 norm regularization of neighborhood variable difference (e.g. (Ng and Abugharbieh 2011; Grosenick et al. 2013)). However, as we will show in the experiments, these models select many more features than necessary. Very recently, generalized fused lasso or total variation has been successful applied to brain image analysis problems inducing the l1 difference (Gramfort, Thirion, and Varoquaux 2013; Xin et al. 2014). In the experiments, we show that by including an extra nonnegative constraint, the features selected by our model is much more stable than that of such unconstrained models. A very recent work (Avants et al. 2014) also explored this positive correlation (partially supporting our assumption), but the problem formulation was quite different: neither structural assumption was considered, nor the stability of feature selection was discussed. From the optimization standpoint, the applied framework is similar to that of (Xin et al. 2014) but two key differences exist: (1) the FISTA and soft-thresholding process applied in (Xin et al. 2014) do not generalize to constrained optimization problems, we show important modifications and provide theoretical proof; (2) we propose a novel understanding of TV s relation with flow problems via conic duality and prove that the minimum norm point problem of (Xin et al. 2014) is a special case of our framework. The Proposed Method Nonnegative Generalized Fused Lasso (n2GFL) Let {(xi, yi)}N i=1 be a set of samples, where xi Rd and yi R are features and labels, respectively. Also, we denote 1Although different names are given in e.g. (Ng and Abugharbieh 2011; Grosenick et al. 2013), they are in fact fundamentally applying the graph Laplacian smoothing. by X Rd N and y RN the concatenations of xi and yi. Then, we consider the formulation min β Rd l(β; X, y) + λ1 i=1 |βi| + λ2 (i,j) E wij|βi βj|, s.t. β 0 (1) where λ1, λ2 0 are tuning parameters. l is a loss term of variable β (assumed to be convex and smooth). wijs are pre-defined weights. Here, the variables (e.g. the s MRI voxel parameters in the AD problem) are supposed to have certain underlying structure represented by a graph G = (V, E) with nodes V and edges E. Each variable corresponds to a node on the graph. As mentioned above, in many brain image analysis, there exist strong directional correlations (positive or negative) between the features and the labels, thus we assume β 0 (or β 0). Due to the l1 penalties on each variable as well as each adjacent pair of variables in (1), solutions tend to be both sparse and smooth, i.e., adjacent variables tend to be similar and spatially coherent. Also because we have added the nonnegative constraints, the model will not select negatively correlated features as support. In practice, we notice that unconstrained models will systematically select many negatively correlated features. The nonnegative constraints greatly reduce these falsely recovered variables and encourage genuine disease-related features to be selected. Efficient Optimization of n2GFL The optimization of n2GFL is convex and off-the-shelf solver such as CVX can be applied. However, this solution hardly scales to a problem sized of thousands (mainly due to its choice of general second order frameworks), see Table 1. In this regard, we propose certain modifications to scalable first order methods (e.g. accelerated proximal methods (Beck and Teboulle 2009)). This is done by exploring Lagrange multiplier method to deal with the constraints. From the optimization standpoint, these modifications are nontrivial and compose one major contribution of this work. We first extend the (fast) iterative shrinkage thresholding algorithm (ISTA and FISTA) (Beck and Teboulle 2009) as follows. Proposition 1. Let β be the optimal solution to (1) and βk defined as follows βk+1 = min β Rd 1 2 β zk 2 2 + λ1 (i,j) E wij|βi βj|, s.t. β 0, (2) where L > 0 is the Lipschitz constant of l( ) and k is the iteration number. If zk=βk 1 L l(βk), then F(βk) F(β ) αL β0 β 2 2 2k , where F( ) is the objective of (1). If zk = yk 1 L l(yk) where yk = βk + αk(βk βk 1) with α controlling the momentum, we have F(βk) F(β ) 2αL β0 β 2 2 (k+1)2 The proof can be viewed as an instantiation of the convex analysis introduced in (Nesterov and Nesterov 2004). We provide a rigorous proof in the supplementary file. Now the key to solve (1) is how efficiently we solve (2). If there were no constraints, (2) is the fused lasso signal approximation proposed in (Friedman et al. 2007), where it was shown that by utilizing the separability of the l1 norm, an element-wise soft-threshold technique can be applied to remove the sparse term. Since the constraints of (2) are also separable, we show how (2) can be further reduced likewise. Proposition 2. If we define β = min β Rd 1 2 β z 2 2 + λ2 (i,j) E wij|βi βj|, (3) then the optimal solution to (2) (denoted as β ) can be achieved by an element-wise post-processing to β as follows β = max(sign( β) max(| β| λ1 L , 0), 0); (4) where is an element-wise product operator. Proof. We define θi = λ1 L and θij = λ2wij L respectively for all i V and (i, j) E. We denote β as the optimal solution of the unconstrained problem of (2). According to (Friedman et al. 2007), β i = sign( βi) max(| βi| θi, 0). We now consider the nonnegative constraints in (2). According to the Karush-Kuhn-Tucker (KKT) conditions, the necessary and sufficient conditions for β 1, ...β d are Lgi = (βi zi) + θisi + j:(i,j) E θijtij j:(j,i) E θijtji αi = 0, s.t. αiβi = 0, (5) where αi 0 are the Lagrange multipliers and s, t are subgradients: si = sign(βi) if βi = 0 and si [ 1, 1] if βi = 0; tij = sign(βi βj) for βi = βj and tij [ 1, 1] if βi = βj. The objective equation in (5) is to set the derivative of the Lagrange function to zero and the constraint equations are obtained from the complementary slackness condition. We now consider two cases of β: Case 1 β i 0: Note that by setting αi = 0, β i 0 satisfies the conditions in (5), thus β i = β i = sign( βi) max(| βi| θi, 0) is the solution of (2). Case 2 β i < 0: We can set β i = 0 and αi = β i > 0, then we have β i = β i αi. Lgi = (β i zi) + θisi + j:(i,j) E θijtij j:(j,i) E θijtji αi = (β i αi zi) + θisi + j:(i,j) E θijtij j:(j,i) E θijtji = (β i zi) + θisi + j:(i,j) E θijtij j:(j,i) E θijtji = 0. Hence, in summary, we have β = max(sign( β) max(| β| λ1 L , 0), 0). Notice that, (3) is a (continous) total variation problem, which is known can be efficiently solved by parametric flow algorithms in (Chambolle and Darbon 2009; Xin et al. 2014). Here we present a more general perspective of such an equivalence via conic duality, which gives us a natural and novel minimum quadratic cost flow formulation. Fast flow algorithms, such as but not limited to parametric flow (Gallo, Grigoriadis, and Tarja 1989) etc. are then easily applied. For example, we show that the minimum norm point problem solved by parametric flow in (Xin et al. 2014) can be viewed as a special case of the proposed dual. Conic Dual to Total Variation To solve (3), we apply generalized inequalities and its corresponding Lagrange duality introduced in (Boyd and Vandenberghe 2004). Specifically, we first define a set C such that C={(β, α) Rd+1 | (i, j), |βi βj| α}. Lemma 3. The set C is a proper cone. This can be easily shown by checking all the properties required by a proper cone. See the supplementary file for a proof. We now consider the following problem (we keep using θij = λ2wki L for λ2wki min (i,j) E,{βij Rd,αij R} 1 2 β z 2 2 + (i,j) E θijαij s.t. (βij, αij) C and βij k = βk k = i, j 0 else . Since (βij, αij) C, then |βi βj| αij, therefore (6) is indeed equivalent to (3). Moreover, because C is a proper cone, we can rewrite (6) as follows, min (i,j) E,{βij Rd,αij R} 1 2 β z 2 2 + (i,j) E θijαij C 0 and βij k = βk k = i, j 0 else . where β C 0 β C is defined as generalized inequality (Boyd and Vandenberghe 2004). We call (7) the primal problem (which equals to the original TV problem). Since the primal problem is both convex and satisfies Slater s condition, strong Lagrange duality holds (under generalized inequality). We define the Lagrange function as L(β, α, ξ, τ) 2 β z 2 2 + (i,j) θijαij s.t. ξij Rd : ξij C 0 and ξij k = 0 if k = i, j, where (ξij, τ ij) are the Lagrange multipliers and C is the dual cone of C, defined as C = {v | w T v 0, w C}. To formulate the dual problem, we take the derivative of L( ) with respect to the primal variables (β, α) and we have ij ξij = 0 and θij τ ij = 0. (9) ଵ ଶ ଶ ሺߦଶഥ ߛଶሻ ଶ ߦଶഥ ߛଶ ߦଶ ଵଶ ߦଶ ଶଷ ߛଶ ߦଵ ଵଶ ߦଶ ଵଶ ߦଶ ଶଷ ߦଷ ଶଷ ଵ ଶ ୧ ሺߦ ഥ ߛ ሻ ଶ Figure 1: Graph structure and the flows. Each source-to-node edge (s,i) has 0 cost and a maximum capacity γi and a minimum capacity γi, this ensures a flow γi; each node-to-node edge (i,j) has 0 cost and a maximum capacity θij; each node-to-sink edge (i,t) has infinite capacity and a cost (in red) 1 2(yi ( ξi + γi)), where ξi = j,(i,j) E ξij i . All flows through node 2 is highlighted in blue for an example. By applying (9) to (8), the dual problem is written as max (i,j) E,{ξij Rd} 1 (i,j) E ξij 2 2 + 1 s.t. (ξij, τ ij) C ; ξij k = 0, k = i, j. Proposition 4. (ξij, τ ij) C , ξij k = 0, k = i, j ξij i + ξij j = 0, |ξij i | θij. Please find the proof in the supplementary file. Accordingly, the dual problem becomes min (i,j) E,{ξij Rd} 1 2 z (i,j) E ξij 2 2 s.t. ξij k = 0, k = i, j, ξij i + ξij j = 0, |ξij i | θij, where we omit z 2 2 from the objective since it is a constant with respect to ξs and we also changed the sign of all ξs for better illustration of the flows. Problem (11) can be viewed as the following minimum quadratic cost flow formulation, min (i,j) E,{ξij Rd} 1 2 y ( (i,j) E ξij + γ) 2 2 s.t. ξij k = 0, k = i, j, ξij i + ξij j = 0, |ξij i | θij, where we have induced γ (γi = max {|zi|, j,(i,j) E θij}) and denote y=z + γ to ensure y 0 and (ξij + γ) 0. Thus each feasible ξ of (12) is a possible flow on graph G=(V, E). Since ξij i + ξij j = 0, |ξij| can denote a flow on edge (i,j) such that ξij i 0 denotes a flow coming into node i and ξij i 0 denotes a flow leaving node i. Figure 1 illustrates such flows by taking node 2 as an example. Thus to minimize the objective of (12) is equivalent to computing a minimum cost flow on this graph. Since the cost is quadratic with respect to the flow, this problem is a minimum quadratic cost flow problem. According to (Hochbaum Table 1: Runtime (in sec.) comparison of the proposed algorithm with CVX. d is the data dimensionality. indicates the test did not finish within 24 hours. d 400 900 2500 4900 10000 CVX 5.22 33.71 889.80 1.08e4 Ours 0.87 2.39 15.95 64.33 321.93 and Hong 1995; Mairal et al. 2011), this type of problems can be efficiently solved via fast flow algorithms including but not limited to the parametric flow (Gallo, Grigoriadis, and Tarja 1989). Note that in (Xin et al. 2014), TV is shown equivalent to a minimum norm point (MNP) problem under submodular constraints which is solved via parametric flow. We now discuss the relation between the dual problem i.e. (12) or (11) and the MNP considered in (Xin et al. 2014). Recall that the MNP problem is defined as follows min s Rd,s B(λfc) z s 2 2, (13) where fc(S) is a cut function, defined as fc(S) = i S,j V\S wij and B( ) is the base polyhedron of fc. Proposition 5. For any minimizer ξ of (11), define ˆs such that ˆs = (i,j) E ξ ij, then ˆs is a minimizer of (13). For any minimizer s of (13), there exists a decomposition such that s = (i,j) E ˆ ξij, where ˆξ is one minimizer of (11). According to Prop. 5, the MNP problem i.e. (13) can be viewed as a special case of (11) (the conic dual), where (i,j) E ξij=s. Moreover, since (11) has relatively looser constraints, it is possible to devise more efficient algorithms (than parametric flow) to solve (11) and thereafter TV. For example, in (Mairal et al. 2011), a faster (than parametric flow) flow algorithm is proposed to solve their specific minimum quadratic flow problem. Hence, the conic dual perspective opens a new opportunity to solve the famous TV problem more efficiently. Optimization summary. In summary, by applying Prop. 1, we can solve n2GFL by iteratively solving (2). By applying Prop. 2, we further reduce (2) to the TV problem defined in (3), we then transform it to a minimum cost flow algorithm via conic duality and solve it by a fast flow algorithm. In Tab. 1, we compare the proposed algorithm with an off-the-shelf solver on synthetic data. We generate a random β Rd and a 2D grid graph of d nodes with each node having four neighbors. We then generate N = d/2 samples: xi Rd and yi = βT xi + 0.01ni, where xi and ni are drawn from the standard normal distribution. All experiments are carried out on an Intel(R) Core(TM) i7-3770 CPU at 3.40GHz. The experiments show that the proposed optimization algorithm is more efficient and scalable. Application to the Diagnosis of AD In the diagnosis of AD, two fundamental issues are AD/NC (Normal Control) classification and MCI/NC (Mild Cognitive Impairment) classification. Let xi Rd be sub- (a) T-test(LDA) Figure 2: Feature selection by different models. The top row illustrates selected voxels in a 3D model (voxels with positive β are in brown and negative ones are in blue), the mid and bottom rows illustrate the corresponding projections on brain slices. jects s MRI voxels and yi = { 1, 1} be the disease status (AD/NC or MCI/NC). Since the problems are classifications, we use the logistic regression as the loss term i=1 log (1 + exp ( yi(βT xi + c))), (14) where c R is the bias parameter (to be learned). For the graph structure, we define each voxel as a node and their spatial adjacency as the edges, i.e. wij=1 if voxels i and j are adjacent and 0 otherwise. The data are obtained from the Alzheimer s Disease Neuroimaging Initiative (ADNI) database2. We split all the baseline data into 1.5T and 3.0T MRI scans datasets (named 15T and 30T). 64 AD patients, 90 NC and 208 MCI patients are included in our 15T dataset; 66 AD patients and 110 NC are included in our 30T dataset. (Most 30T MCI data are in an on-going phase and are not included). Data preprocessing follows the DARTEL VBM pipeline (Ashburner and others 2007) as commonly done in the literature. 2,527 8 8 8 mm3 size voxels that have values greater than 0.2 in the mean gray matter population template serve as the input features. We design experiments on three tasks, namely, 15ADNC, 30ADNC, 15MCINC. Classification Accuracy. 10-fold cross-validation (CV) evaluation is applied and the classification accuracy for all tasks are summarized in Tab. 2. Under exactly the same experiment setup, we compare n2GFL with the state-of-theart classifiers: logistic regression (LR), SVM, sparse models e.g. the lasso and its graph Laplacian structured variants, i.e. 2http://adni.loni.ucla.edu Table 2: Classification accuracies. SVM LR MLDA Lap L lasso GFL n2GFL 15ADNC 83.1 83.1 83.1 84.4 85.7 85.1 86.4 30ADNC 87.5 86.9 83.5 87.5 88.6 85.8 90.3 15MCINC 71.1 70.5 59.4 70.1 70.8 69.8 71.8 the Lap L, the unconstrained GFL (Xin et al. 2014), and the MLDA model (Dai et al. 2012), which applies a variant of Fisher Discriminant Analysis after univariate feature selection (via T-test). For each model, we used grid-search to find the optimal parameters respectively. Note that our accuracies may not be superior to the recent work (Liu et al. 2014), the main reason is that in (Liu et al. 2014), multi-modality data (including PET and s MRI data) are used. Nevertheless, Tab. 2 demonstrates that n2GFL outperforms all the other models using only voxel-based s MRI data. Feature selection. For each task, the selected features are those whose β are not zero . In Figure 2, the result of 30ADNC is used to illustrate the feature selection by different models (using the parameters at their best accuracy). As shown, the selected voxels by both GFL and n2GFL cluster into several spatially connected regions, whereas those of lasso and T-test/MLDA scatter around. Also, as mentioned before, the Lap L tends to select much more voxels than necessary due to the l2 regularization. Moreover, the selected voxels by GFL and n2GFL are concentrated in Hippocampus, Para Hippocampal gyrus (which are believed to be the (f) overlap Figure 3: Stability of selected voxels across different folds of the cross-validation. The results of 5 different folds are shown in (a)-(e). The voxels with positive β are in brown, negative ones are in blue. The common/overlapped voxels selected in all 10 folds are shown in green (f). The top row illustrates voxels selected by the lasso model, the mid row illustrates those of GFL and the bottom row shows those of n2GFL. early damaged regions). On the other hand, the lasso and Ttest/MLDA either select less lesion voxels or select probably noisy voxels not in the early damaged regions. Feature Stability. In Figure 3, we show the selected voxels across different folds of CV3. As shown, the selected voxels by lasso vary much across different folds, whereas the selected voxels by GFL are more stable. However, by assuming the positive correlation between the features and the disease labels in n2GFL, we further increase the stability. To quantitatively evaluate the stability gain, we denote the variables of the kth fold of CV as β(k). We introduce two measurements here. In (Yu 2013), the Estimation Stability (ES) is proposed to measure the stability of the estimation k=1 Xβ(k) X β 2 2/K β 2 2, (15) where β = K k=1 β(k)/K. It is shown in (Yu 2013) that ES is a fair measurement of the estimation stability. To further understand the stability of feature selection, we also extend the Dice coefficient (Dice 1945) to multiple sets and apply the multi-set Dice Coefficient (m DC) as a measurement. We denote set S(k) = {i : βi(k) = 0} and define m DC as m DC = K#( K k=1S(k))/ k=1 #(S(k)), (16) where # is the number of elements in a set. In Tab. 3, both measurements quantitatively suggest n2GFL obtains much 3Here, parameters were determined by accuracy. Similar results were observed using parameters producing same level of sparsity. Table 3: Stability comparison of the models. lasso GFL n2GFL ES (smaller is better) 0.035 0.033 0.022 m DC (larger is better) 0.267 0.374 0.644 more stable voxels due to the consideration of the correlation between the features and the disease labels 4. Conclusions In this paper, we explore the nonnegative generalized fused lasso model to address an important problem of neuroimage analysis, i.e. the stability of feature selection. Experiments show that our model greatly improves the stabilities of feature selection over existing methods for brain image analysis. Although n2GFL is applied to the diagnosis of AD problem, it can be applied to solve more general problems. Moreover, we believe that the theoretical points made here e.g. nonnegative FISTA, soft-thresholding and the conic dual of TV, provide motivation for future work of general interest. 4We notice that, in (Xin et al. 2014), the stability is computed using the top 50 positive voxels because these voxels are believe to be the most atrophied ones. By computing the stability of all non-zero voxels, the m DC of GFL drops around 30%. This clearly shows that the instability is caused largely by the undesirable voxels that disagree with the correlation prior (those scattered blue voxels in the mid row). Acknowledgments This work was supported in part by Natural Science Foundation of China (NSFC) grants 973-2015CB351800, NSFC61272027, NSFC-61231010, NSFC-61121002 and NSFC61210005. References Ashburner, J., et al. 2007. A fast diffeomorphic image registration algorithm. Neuroimage 38(1):95 113. Avants, B. B.; Libon, D. J.; Rascovsky, K.; Boller, A.; Mc Millan, C. T.; Massimo, L.; Coslett, H.; Chatterjee, A.; Gross, R. G.; and Grossman, M. 2014. Sparse canonical correlation analysis relates network-level atrophy to multivariate cognitive measures in a neurodegenerative population. Neuro Image 84:698 711. Beck, A., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2(1):183 202. Boyd, S. P., and Vandenberghe, L. 2004. Convex optimization. Cambridge university press. Chambolle, A., and Darbon, J. 2009. On total variation minimization and surface evolution using parametric maximum flows. International journal of computer vision 84(3):288 307. Dai, Z.; Yan, C.; Wang, Z.; Wang, J.; Xia, M.; Li, K.; and He, Y. 2012. Discriminative analysis of early alzheimer s disease using multi-modal imaging and multi-level characterization with multi-classifier (m3). Neuroimage 59(3):2187 2195. Dice, L. R. 1945. Measures of the amount of ecologic association between species. Ecology 26(3):297 302. Friedman, J.; Hastie, T.; H ofling, H.; Tibshirani, R.; et al. 2007. Pathwise coordinate optimization. The Annals of Applied Statistics 1(2):302 332. Gallo, G.; Grigoriadis, M.; and Tarja, R. 1989. A fast parametric maximum flow algorithm and applications. SIAM Journal of Computing 18(1):30 55. Gramfort, A.; Thirion, B.; and Varoquaux, G. 2013. Identifying predictive regions from fmri with tv-l1 prior. In Pattern Recognition in Neuroimaging (PRNI), 2013 International Workshop on, 17 20. IEEE. Grant, M., and Boyd, S. 2013. CVX: Matlab software for disciplined convex programming, version 2.0 beta. http:// cvxr.com/cvx. Grosenick, L.; Klingenberg, B.; Katovich, K.; Knutson, B.; and Taylor, J. E. 2013. Interpretable whole-brain prediction analysis with graphnet. Neuro Image 72:304 321. Hochbaum, D. S., and Hong, S. 1995. About strongly polynomial time algorithms for quadratic optimization over submodular constraints. Mathematical programming 69(13):269 309. Jacob, L.; Obozinski, G.; and Vert, J.-P. 2009. Group lasso with overlap and graph lasso. In Proceedings of the 26th Annual International Conference on Machine Learning, 433 440. ACM. Jenatton, R.; Gramfort, A.; Michel, V.; Obozinski, G.; Eger, E.; Bach, F.; and Thirion, B. 2012. Multiscale mining of fmri data with hierarchical structured sparsity. SIAM Journal on Imaging Sciences 5(3):835 856. Lee, D. D., and Seung, H. S. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788 791. Liu, F.; Wee, C.-Y.; Chen, H.; and Shen, D. 2014. Intermodality relationship constrained multi-modality multi-task feature selection for alzheimer s disease and mild cognitive impairment identification. Neuro Image 84:466 475. Liu, M.; Zhang, D.; and Shen, D. 2012. Ensemble sparse classification of alzheimer s disease. Neuro Image 60(2):1106 1116. Mairal, J.; Jenatton, R.; Obozinski, G.; and Bach, F. 2011. Convex and network flow optimization for structured sparsity. The Journal of Machine Learning Research 12:2681 2720. Nesterov, Y., and Nesterov, I. 2004. Introductory lectures on convex optimization: A basic course, volume 87. Springer. Ng, B., and Abugharbieh, R. 2011. Generalized sparse regularization with application to fmri brain decoding. In Information Processing in Medical Imaging, 612 623. Springer. Rao, N.; Cox, C.; Nowak, R.; and Rogers, T. T. 2013. Sparse overlapping sets lasso for multitask learning and its application to fmri analysis. In Advances in Neural Information Processing Systems, 2202 2210. Tibshirani, R. 1996. Regression shrinkage and selection via the Lasso. Journal of Royal Statistical Society B 58(1):267 288. Xin, B.; Kawahara, Y.; Wang, Y.; and Gao, W. 2014. Efficient generalized fused lasso and its application to the diagnosis of alzheimers disease. In Twenty-Eighth AAAI Conference on Artificial Intelligence. Yu, B. 2013. Stability. Bernoulli 19(4):1484 1500.