# regularized_joint_mixture_models__9e9035b8.pdf

Journal of Machine Learning Research 24 (2023) 1-47 Submitted 7/21; Revised 8/22; Published 1/23

Regularized Joint Mixture Models

Konstantinos Perrakis konstantinos.perrakis@durham.ac.uk Department of Mathematical Sciences Durham University, UK

Thomas Lartigue thomas.lartigue@dzne.de Aramis Project Team, Inria & Center of Applied Mathematics, CNRS, Ecole Polytechnique, IP Paris, France

Frank Dondelinger fdondelinger.work@gmail.com Lancaster Medical School Lancaster UK

Sach Mukherjee sach.mukherjee@dzne.de German Center for Neurodegenerative Diseases, Bonn, Germany & MRC Biostatistics Unit, University of Cambridge, UK

Editor: Samuel Kaski

Regularized regression models are well studied and, under appropriate conditions, oﬀer fast and statistically interpretable results. However, large data in many applications are heterogeneous in the sense of harboring distributional diﬀerences between latent groups. Then, the assumption that the conditional distribution of response Y given features X is the same for all samples may not hold. Furthermore, in scientiﬁc applications, the covariance structure of the features may contain important signals and its learning is also aﬀected by latent group structure. We propose a class of mixture models for paired data p X, Y q that couples together the distribution of X (using sparse graphical models) and the conditional Y | X (using sparse regression models). The regression and graphical models are speciﬁc to the latent groups and model parameters are estimated jointly. This allows signals in either or both of the feature distribution and regression model to inform learning of latent structure and provides automatic control of confounding by such structure. Estimation is handled via an expectation-maximization algorithm, whose convergence is established theoretically. We illustrate the key ideas via empirical examples. An R package is available at https://github.com/k-perrakis/regjmix.

Keywords: distribution shifts, heterogeneous data, joint learning, latent groups, mixture models, sparse regression

1. Introduction

Regularized regression models usually assume homogeneity in the sense that the same conditional distribution of a response Y given features X is taken to hold for all samples. In the presence of latent groups that might have diﬀerent underlying conditional distributions, regression modeling may be confounded, possibly severely. Similarly, covariance structure among features can be an important signal in scientiﬁc applications but its learning may be strongly aﬀected by latent group structure.

2023 Konstantinos Perrakis, Thomas Lartigue, Frank Dondelinger and Sach Mukherjee.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0796.html.

Perrakis, Lartigue, Dondelinger and Mukherjee

These issues are a concern whenever data might harbour unrecognized distributional shifts or group structure, an issue that is increasingly prominent in an era of large and often heterogeneous data. Furthermore, for heterogeneous data the two aspects the distribution of features X and the conditional Y | X are related in practice, since either or both may contain signals relevant to detecting and modeling group structure, which in turn is essential to overall estimation. Motivated by such heterogeneous data settings with paired data of the form p X, Y q, in this paper we study a class of joint mixture models that couple together both aspects sparse graphical models for X and parsimonious regression models for Y | X in one framework. Speciﬁcally, in high-level notation, we consider models of the form

Z τZ X | Z k p Xpµk, Σkq

Y | X, Z k p Y pgp XT βkq, σ2 kq

where Z Pt1, . . . , Ku is a latent indicator of group membership with distribution τZ, p Wpm, sq denotes the probability distribution of a random variable W with location m and scale s, and gp q is a link function. This work focuses on the familiar and important case where both X and Y are normally distributed and g is the identity function. However, the key ideas apply to any model of the general form in Eq. (1). In this context the presence of group structure has the following consequences:

Confounding due to latent groups. Associations between components of X and Y may be entirely diﬀerent globally (with Z marginalized out) vs. locally (conditionally on Z) with regression coeﬃcients diﬀering even in signs and sparsity patterns.

Ambiguous group structure in feature space. Clustering the X s alone may lead to cluster labels which do not capture the relevant structure, as instances of group signal in X may be unrelated to Y (e.g. clustering gene expression data may yield well-deﬁned clusters; however, these may not relate to a speciﬁc biological/medical response).

Group-speciﬁc signal in regression coeﬃcients. Nonidentical coeﬃcients or feature importance across groups provide a potentially useful discriminant signal for identifying the group structure itself. This signal cannot be detected by clustering the X s.

Two common strategies given paired data p X, Y q are: (S1) ignore any potential grouping and ﬁt one regression model using the entire data and (S2) cluster the X s and then ﬁt separate regression models to the group-speciﬁc data. Strategy (S1) is risky, since the resulting regression coeﬃcients may be entirely incorrect if latent group structure is present (e.g. due to Simpson s paradox and related phenomena). Also, when the modeling aspect is of importance, under (S1) evaluation of predictive loss is not a satisfactory guide for model assessment, since prediction error may be apparently small despite severe model misspeciﬁcation. Strategy (S2), although in some ways safer, is also not guaranteed to protect from such eﬀects, unless the resulting group structure obtained from clustering the X s is correct with reference to the overall problem; this may not hold in general. Furthermore, since (S2) models the X data alone, it cannot exploit any signal in the conditionals Y | X to guide the clustering. A common variant of (S2) is to perform a dimension reduction on X, such as PCA, and cluster on the reduced space. This, however, does not resolve the problem as the major principal components may not be predictive of Y ; see e.g. Jolliﬀe (1982).

Regularized Joint Mixtures

1.1 Related work and contribution

For discrete latent variables Z a standard way of approaching such heterogeneous problems is via mixture models. The literature on mixtures is vast; below we summarize related work according to model structure, covering the most popular approaches for the case of continuous responses.

Mixtures for X|Z. The most commonly used and extensively studied approach within this category is the Gaussian mixture model (GMM); see e.g., Mc Lachlan and Peel (2000). GMMs have undergone a series of novel developments focusing on parsimonious modeling of the covariance matrices such as parametrizations based on eigenvalue decomposition (Banﬁeld and Raftery, 1993; Celeux and Govaert, 1995; Fraley and Raftery, 2002), factorizations based on factor analysis models (Mc Nicholas and Murphy, 2008), and extensions to sparse graphical model estimation (Anandkumar et al., 2012; St adler et al., 2017; Fop et al., 2019), among others. These approaches diﬀer from ours in that they consider only the X signal and do not include a regression component, thus, inheriting the potential drawbacks of strategy (S2).

Mixtures for Y |X, Z. Finite mixtures of regression (FMR) models belong in this category. Similarly to GMMs, Gaussian FMRs have been studied and developed extensively, allowing for ﬂexible modeling designs (see e.g. Fr uwirth-Schnatter, 2005) and regularized estimation (Khalili and Jiahua, 2007; St adler et al., 2010; Khalili and Lin, 2013). FMRs focus on the relationship between Y and X without including a generative probability model for X. Our approach is motivated by settings in which the X distribution itself is of interest and may be confounded by latent group structure. Furthermore, under FMRs a new X1 cannot be allocated to one speciﬁc group and thereby used to obtain a group-speciﬁc prediction.

Mixtures for Y, Z|X. Mixtures of experts (Mo E; Dayton and Macready, 1988; Jacobs et al., 1991; Jordan and Jacobs, 1994; Jacobs, 1997) jointly model the response and latent allocations. Mo Es consist of expert networks (these are models that predict Y from X) and a discriminative model (the gating network) that chooses among the experts. The parsimonious covariance parametrizations for GMMs (Banﬁeld and Raftery, 1993; Celeux and Govaert, 1995; Fraley and Raftery, 2002) have been introduced within the Mo E framework initially in Dang and Mc Nicholas (2015) for the special case where the same set of predictors enter the expert and gating networks, and, more recently, in Murphy and Murphy (2020) for the general case where diﬀerent predictors are allowed to enter in the two networks; the latter work also introduces an additional noise component for outlier detection. Regularized Mo E approaches include those of Khalili (2010) and Chamroukhi and Huynh (2018), among others. Mo Es include FMRs as a special case in the absence of a gating network. Also, similarly to FMRs, Mo Es condition upon features and, thus, lack a generative model for X. However, unlike with FMRs, group-speciﬁc prediction of the response is possible under Mo Es, as the learned gating network can be used to allocate new feature observations.

Mixtures for Y, X|Z. A ﬁrst approach within this category is proﬁle regression (Molitor et al., 2010; Liverani et al., 2015). Under proﬁle regression X and Y are conditionally independent given the latent group indicator Z. Speciﬁcally, the component Y |Z involves a regression model including a proﬁle parameter (capturing the eﬀect of X) plus additional co-variates, while the component X|Z is some multivariate distribution (e.g. Gaussian). A second approach, more relevant to our work, is the cluster weighted model (CWM) mixture

Perrakis, Lartigue, Dondelinger and Mukherjee

introduced by Ingrassia et al. (2012). In this case, we have a linear model component for Y |X, Z and a multivariate distribution component for X|Z, following the hierarchical structure of Eq. (1). As illustrated, in Ingrassia et al. (2012) Gaussian CWMs lead to the same family of probability distributions generated by GMMs (see also below) and under speciﬁc conditions include FMRs and Mo Es as special cases. Extensions of CWMs accommodate the use of GLMs and mixed-type data (Ingrassia et al., 2015; Punzo and Ingassia, 2016), parsimonious parametrizations (similarly to GMMs and Mo Es) of covariances (Dang et al., 2017) and latent factor structures for the feature matrix (Subedi et al., 2013), among others.

The class of models proposed here henceforth, referred to as regularized joint mixture (RJM) models belong to the latter category of mixtures. Speciﬁcally, the model speciﬁcation (the likelihood part of the model) is of a CWM type, but the resulting clustering and parameter learning process under RJMs is diﬀerent due to regularization. CWMs rely on maximum-likelihood (ML) estimation and in this case under the normal-normal setting with identity link (as considered here), Eq. (1) is equivalent to a GMM on the concatenated matrix r X, Y s as shown in Ingrassia et al. (2012). However, under RJMs, the equivalence to the GMM no longer holds, because the regression and graphical model parts are treated diﬀerently (below we include comparisons with direct Gaussian mixture modeling of the concatenated matrix r X, Y s). We view regularization as essential for delivering a usable solution to the problem. In many cases the number of features p may be on the same order as sample size n, or larger, and at the group level the sample sizes are of course smaller; hence without suitable regularization both the regression and graphical models will typically be ill-behaved. In the speciﬁc implementation we propose, we use the graphical lasso (Friedman et al., 2008a) for graphical model estimation, while for the regression part we consider: (i) the Bayesian lasso (Park and Casella, 2008) and (ii) the normal-Jeﬀreys prior (Figueiredo, 2001). We note that other choices would be possible within the general framework, subject to computational considerations and appropriate handling of tuning parameters. In summary, the merits of RJMs are the following:

(i) Learns latent group structure by combining information from the distribution over X and the regression of Y on X within a principled framework;

(ii) Provides group-speciﬁc feature importance and graphical models with explicit sparsity patterns;

(iii) Applicable in p ą n settings;

(iv) Allows group-speciﬁc prediction for the response given a new feature vector X1.

1.2 Motivation

Given the large literature on mixture models, it is important to clarify at the outset why the models we study are needed. We are motivated by applications in which latent group structure may be important and where aspects such as (potentially group speciﬁc) feature importance and covariance structure among the X s play a role. To take one example, in biomedicine there is much interest in latent disease subtypes. These will often have subtype-speciﬁc covariance patterns, due to diﬀerences in underlying

Regularized Joint Mixtures

regulatory networks, and the analyst may want to understand subtype-speciﬁc disease biology and feature importance. Focusing on only the X s may be insuﬃcient, because this does not account for the response Y (and there may be many ways of clustering X not relevant to the Y of interest). For example, if X are data on human subjects and Y a cancer phenotype, many instances of cluster structure in X may be unrelated to the cancer setting. In addition, focusing solely on diﬀerences in regression models Y |X means that subgroup recovery is diﬃcult or impossible if these diﬀerences are not large enough. Similarly, in data-driven marketing, latent customer subgroups may have diﬀerent covariance structure among features and at the same time manifest diﬀerences in regression models linking such features to responses (such as revenue per customer). The formulation we propose includes both sources of information in one model and thereby allows for subgroup identiﬁcation and parameter estimation that accounts for both aspects.

0.3 0.1 0.1 0.2

0.15 0.00 0.10

0.2 0.0 0.1 0.2

0.2 0.0 0.2

0.2 0.0 0.1 0.2 0.3

0.4 0.0 0.4

0.2 0.2 0.4 0.6 0.8

0.2 0.0 0.2 0.4 µ = 0.5

0.2 0.2 0.4 0.6 0.8

0.0 0.4 0.8

0.2 0.2 0.4 0.6

0.2 0.2 0.6 1.0

0.0 0.5 1.0

0.0 0.2 0.4 0.6 µ = 1

0.0 0.5 1.0

0.0 0.4 0.8 1.2

0.0 0.5 1.0

0.0 0.5 1.0 1.5

Figure 1: Examples for two subgroups. Each panel shows a speciﬁc level of diﬀerence in the regression models, quantiﬁed by the diﬀerence β in regression coeﬃcients. The diﬀerence in the feature distributions is controlled via a simple mean shift µ.

Figure 1 shows simple illustrative examples to bring out some of these points. We emphasize that this is for illustrative purposes only, intended to highlight some interesting contrasts (full empirical results appear in Section 5 below). In these examples, we consider settings in which there may be either or both of an X signal diﬀerence (in the ﬁgure this is a simple mean shift µ, but the models we propose are general and for multivariate X allow

Perrakis, Lartigue, Dondelinger and Mukherjee

also for diﬀerences in covariance structure) and a diﬀerence in subgroup-speciﬁc regression coeﬃcients ( β). For this initial illustration we consider two latent groups, each with sample size equal to 100, and ten potential predictors, but with only one predictor having an eﬀect (a non-zero regression coeﬃcient) on the response; full details of the simulations can be found in Appendix A. Results in terms of subgroup identiﬁcation, as quantiﬁed by Rand Index, are summarized in Figure 2. When there is no diﬀerence in regression models ( β 0), Mo Es cannot detect any structure (since the X distribution is not modelled). On the other hand, with a stronger diﬀerence in regression models ( β 1), Mo E outperforms a Gaussian mixture (on the X s), since the latter does not model the regression part. The approach we propose models both aspects in a uniﬁed framework, hence works well regardless of where the signal lies. Furthermore, and as shown in detail via empirical examples below, by accounting for the latent structure, RJM is able to detect subgroup-speciﬁc sparsity patterns whilst avoiding Simpson s paradox-like eﬀects that could otherwise arise. Later, we show detailed empirical results, including an example, based on cancer data, that highlights some of these points, and in particular how subgroup identiﬁcation beneﬁts from joint modeling, relative to simply clustering X (or clustering the stacked vector p X, Y q) or using Mo E.

0.00 0.25 0.50 0.75 1.00

Adjusted Rand Index

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

Method Gaussian Mixture Mixture of Experts Regularized Joint Mixture

Figure 2: Role of signal location in subgroup identiﬁcation. The diﬀerence β in regression coeﬃcients represents the signal from the regression models, from no signal (left panel), to a strong signal (right panel). In each panel the signal in the feature distribution increases from left to right via a simple mean shift µ.

The remainder of the paper is structured as follows. In Section 2 we lay out the model speciﬁcation and discuss the regularization methods under consideration as well as eﬃcient tuning strategies. Computational and theoretical details of the expectation-maximization (EM) optimization are covered in Section 3. In Section 4 we discuss prediction using RJMs and discuss how predictive measures can potentially be used for cluster selection. In Section 5 we present empirical examples, focusing initially on small-scale simulations and then proceeding to larger scale semi-synthetic experiments and applications to real data. The paper concludes with a discussion in Section 6.

2. The RJM model

Regularized Joint Mixtures

2.1 Model speciﬁcation

Let y denote an n-dimensional vector of outputs or responses and X an nˆp feature matrix. Samples are indexed by i 1, . . . , n. Let K denote the number of groups and zi P t1 . . . Ku represent the true (latent) group indicator for the sample point pyi, xiq with Prpzi kq τk. The group-speciﬁc parameters are θk pθX

kq T with θX

k being the parameters governing respectively the marginal distribution of X and the regression model of Y on X. We allow for group-speciﬁc parameters, but assume that samples are independent and identically distributed within groups. The joint distribution of pyi, xiq in group k is

ppyi, xi|θk, zi kq ppyi|θY

k, xi, zi kqppxi|θX

k , zi kq. (2)

The features are modeled as p-dimensional multivariate normal so that θX

k pµk, vecpΣkqq T , where µk is the mean and Σk the p ˆ p covariance matrix. For the responses, we specify a normal linear regression model, with parameters θY

k pαk, βk, σ2 kq T , where αk is the intercept, βk the vector of regression coeﬃcients and σ2 k the error variance. Inclusion of the intercept is necessary in the present setting, because it is not possible to center the response appropriately when the group labels are unknown. Thus, we have that

k , zi kq ppxi|µk, Σk, zi kq Nppxi|µk, Σkq (3)

and ppyi|θY

k, xi, zi kq ppyi|αk, βk, σ2 k, xi, zi kq Npyi|αk x T i βk, σ2 kq. (4)

Marginalizing out the latent variables leads to a mixture representation of the form

ppy, X|θ, τq

k 1 ppyi|θY

k, xi, zi kqppxi|θX

k , zi kqτk, (5)

where θ pθ1, . . . , θKq T and τ pτ1, . . . , τKq T .

2.2 Regularization and priors

Given the likelihood function of θ and τ in (5) we consider general solutions of the form

ˆθ, ˆτ arg max θ,τ

log ppy, X|θ, τq

k 1 penpθ X

k 1 penpθ Y

where penp q denotes a penalty function, and θ X

k are parameter subsets that we wish to penalize. The particular parameters we penalize are the group-speciﬁc covariances and regression vectors; hence, θ X

k vecpΣkq and θ Y

k βk in (6) (although under one approach below we consider θ Y

k pβk, σ2 kq T ). Penalization is required because the corresponding ML estimates may be ill-behaved or ill-deﬁned and we are interested in groupspeciﬁc feature importance and conditional independence structure. In general, tuning of penalty parameters is challenging in the latent group setting. The common approach, based on cross-validation (CV), needs to be handled with care when the estimation of parameters requires iterative procedures converging to (local) maxima, such as the EM algorithm. Speciﬁcally, performing CV at each iteration of the algorithm would

Perrakis, Lartigue, Dondelinger and Mukherjee

change the penalty and, thus, also change the objective function at each iteration. A bruteforce solution to the problem would be to pre-specify a grid of values for the penalty and select the value that optimizes a speciﬁc criterion. However, apart from open issues related to the range and length of the grid, this would also be a computationally burdensome task, requiring multiple EM processes (each with multiple starts) for each point at the grid. Given these considerations, we argue that more viable strategies are the following; (i) using universal penalties from existing literature, (ii) using CV in a stepwise manner, and (iii) considering the penalties as free parameters under estimation. The ﬁrst strategy is advisable to use for parameters whose learning is not the main goal of the analysis, but for which regularization is required in order to attain workable, non-spurious solutions. Universal penalties which satisfy certain theoretical requirements (see e.g., Donoho and Johnstone, 1994; St adler and Mukherjee, 2013) typically work well to that end. On the other hand, for the main parameters of interest, whose learning (e.g. estimation, sparsity patterns) is of importance, the second and third strategies seem to be more appropriate as they oﬀer more speciﬁc, application-driven solutions. In short, the stepwise CV approach entails adjusting the penalties a few times during the EM and running the algorithm suﬃciently long in order to reach local maxima after the last adjustment. The last strategy, requires maximizing the objective with respect to the penalties and, thus, requires the introduction of a prior distribution. Within this framework one could ideally (yet not necessarily) consider properties of the prior; for instance, its behavior under small and/or large samples and the consequent eﬀect on shrinkage, among others. From a pragmatic perspective, the prior choice is linked to more realistic considerations relating to the maximization required in the EM algorithm. For instance, the half-Cauchy prior, which is a standard choice for penalties (Polson and Scott, 2012) in fully Bayesian implementations based on posterior sampling, may not be necessarily convenient to work with within an EM framework. We proceed with a description of the penalty functions, discussing our choices from both penalized likelihood and Bayesian viewpoints, and explaining our reasoning for the way that penalty parameters are tuned based on the aforementioned strategies. For the remainder of this Section, it is convenient to discuss the corresponding solutions under known group labels. For the subset of datapoints where zi k, let us denote the nk ˆ 1 response by yk and the nk ˆ p predictor matrix by Xk for k 1, . . . , K. We emphasize that this is for expositional clarity only; the actual solutions under latent group labels are, of course, obtained iteratively via the EM algorithm presented in Section 3.

2.2.1 Regularization of Σk

For the regularization of Σk we use the graphical lasso (Meinshausen and B uhlmann, 2006; Yuan and Lin, 2007; Friedman et al., 2008a). The graphical lasso induces sparsity in the inverse covariance matrix Ωk Σ 1 k for group k. In this case we have penpθ X

k q penpΩkq ζ}Ωk}1 in (6), where ζ ą 0 controls the strength of regularization and } }q is the Lq norm. For known group labels the graphical lasso estimate would be

pΩk arg max Ωk PM

! log |Ωk| tr ΩkˆSk ζ}Ωk}1 ) , (7)

where M is the space of positive deﬁnite matrices and ˆSk is the ML covariance estimate of Xk. The solution in (7) is equivalent to the posterior mode under a likelihood as in (3)

Regularized Joint Mixtures

and a prior distribution of the following form

j 1 Exppωkjj|ψ{2q

jăl,l 2 DEpωkjl|0, ψ 1q

1tΩk PM u, (8)

where 1t u is the indicator function, Expp |rq is the exponential distribution with rate r ą 0 and DEp |µ, bq is the double exponential distribution with location µ P R and scale b ą 0. The connection between (7) and the corresponding posterior mode under the likelihood in (3) and the prior in (8) is that for any given value of ζ we have that ψ nkζ (Wang, 2012). For the graphical lasso penalty, we use the universal threshold ψ ?2n log p{2 derived from the logical arguments developed in St adler and Mukherjee (2013). Note that here we use n instead of nk as the group labels will be unknown. The reason for choosing the universal-threshold approach for the graphical lasso is that we are less interested in the sparsity pattern of Ωk itself. Rather, we mainly want a well-behaved estimate that allows group structure to be eﬀectively accounted for.

2.2.2 Regularization of βk

The lasso approach. We consider a scaled version of the lasso (Tibshirani, 1996), where the penalty in (6) is given by penpθ Y

k q penpβk, σ2 kq λk}βk}1{σk with λk ą 0. Here we introduce group speciﬁc penalty parameters λk, unlike previously, where parameter ψ is common across groups. This type of lasso regularization has been studied by St adler et al. (2010) in the context of FMR; a setting where the role of the scaling of variance parameters is much more important than in standard homogeneous regression, as the σk s may diﬀer and the grouping is not ﬁxed. If the groups were known, the lasso estimate would be

ˆαk, ˆβk, ˆσ2 k arg min αk,βk,σ2 k

# }yk αk1nk Xkβk}2 2 2σ2 k λk }βk}1

σk pnk p 2q log σk

where 1q denotes a q-dimensional vector of ones. The solution in (9), which is slightly diﬀerent than the one in St adler et al. (2010), corresponds to the posterior mode under the Bayesian lasso formulation (Park and Casella, 2008), that speciﬁes independent double exponential priors for the regression coeﬃcients conditional on the error variance which is assigned the scale-invariant Jeﬀreys prior. Namely,

ppβk|σ2 k, λkq

and ppσ2 kq9 1

σ2 k , (10)

with the correspondence to (9) completed when ppαkq91. We propose two methods for handling λk; the ﬁxed-penalty lasso (FLasso) based on a plug-in estimate and the randompenalty lasso (RLasso) based on the construction of a suitable prior.

FLasso. This approach is essentially a two-step tuning procedure. We start with initial estimates, ˆλ

p0q k obtained by minimizing the CV mean squared error based on some prior clustering of the data. Then, at a certain iteration we re-calculate the CV estimates and ﬁx each group penalty to the new estimate ˆλ

p1q k for all further EM iterations. Speciﬁcally, we ﬁx the parameter after the ﬁrst iteration where the group assignments do not change. From

Perrakis, Lartigue, Dondelinger and Mukherjee

a Bayesian perspective the FLasso approach can be viewed as an empirical Bayes method as we use the data in order to plug-in ˆλ

p0q k and ˆλ

p1q k in the prior of βk appearing in (10). The monotonic behaviour of the EM may be disrupted at the re-estimation iteration, but after that point it will hold.

RLasso. In this approach we propose placing a prior distribution on λk so that this parameter will be automatically updated during the EM. We construct the prior so that it satisﬁes the requirement of supporting no penalization asymptotically (λk Ñ 0 as n Ñ 8). A suitable prior for our purposes is the Pareto distribution whose scale parameter is also the lower bound of its support. Speciﬁcally, we have a prior distribution with scale an ą 0 and shape bn ą 0 (parameters are deﬁned to depend on n) of the following form

ppλkq bnabn n λ pbn 1q k , (11)

where λk P ran, 8q. In our setting parameter an does need to be speciﬁed explicitly and is regarded to be decreasing in n, while the shape parameter is speciﬁed as bn pp 1q c a

2K log p{n for some c P p0, 1s. The rationale for these choices and further details are discussed in Appendix B. As shown next, RLasso will lead to a reasonable update for λk during the M-step.

The normal-Jeﬀreys approach. The normal-Jeﬀreys (NJ) prior (Figueiredo, 2001) consists of independent improper priors; in our context the prior is given by

j 1 ppβkjq 9

j 1 |βkj| 1. (12)

For known group labels with ppαk, σ2 kq 9 1{σ2 k the corresponding penalized estimate is

ˆαk, ˆβk, ˆσ2 k arg min αk,βk,σ2 k

# }yk αk1nk Xkβk}2 2 2σ2 k

j 1 log |βkj| pnk 2q log σk

Here penpθ Y

k q penpβkq řp j 1 log |βkj|. The NJ is well known in the shrinkage-prior literature (Griﬃn and Brown, 2005; Carvalho et al., 2010; Polson and Scott, 2010). As with most shrinkage priors, (12) can be expressed as a scale-mixture of normals; namely, ppβkj|skjq p2πskjq 1{2 expp β2 kj{2skjq with πpskjq 9 s 1 kj and, therefore, we have that ş ppβkj|skjqπpskjqdskj 9 |βkj| 1. As the mixing distribution lacks a hyper-parameter the prior is characterized by the absence of a global scale parameter. Also, due to heavy tails small coeﬃcients are shrunk a lot, while large signals remain relatively unaﬀected; similarly to other heavy-tailed priors (Carvalho et al., 2010; Griﬃn and Brown, 2005). Figueiredo (2003) and Bae and Mallick (2004) show that the NJ prior strongly induces sparsity and yields good performance in terms of selection. The use of the NJ prior is appealing for the RJM framework. Handling penalties is cumbersome in our setting and the NJ prior provides an attractive tuning-free alternative. In general, shrinkage priors which lack a global scale parameter fail to capture the average signal density of the data (Carvalho et al., 2010); however, despite this potential shortcoming of the NJ prior the potential beneﬁts are worth exploring. Also, the posterior mode under

Regularized Joint Mixtures

(12) is easy to ﬁnd through the use of an EM algorithm where the scaling parameters skj are considered latent (Figueiredo, 2003). This additional latent structure can be easily incorporated in our EM without additional computational costs. In fact, the corresponding NJ-EM update is in closed-form, which is not the case in the lasso approach.

3. The RJM-EM algorithm

In this Section we present ﬁrst the expectation and maximization steps of the proposed EM algorithm. We then prove that under certain conditions on the regularization the proposed algorithm converges towards a critical point of the likelihood function.

3.1 The EM steps

The E-Step. Irrespective of regularization approach, the group-membership probabilities of the mixture model in (5) at iteration t of the algorithm are calculated as

mptq ki x Prpzi k|yi, xi, θptq k q ppyi|θ

Y ptq k , xi, zi kqppxi|θ

Xptq k , zi kqτ ptq k ř k ppyi|θ

Y ptq k , xi, zi kqppxi|θ

Xptq k , zi kqτ ptq k , (14)

for i 1, . . . , n, with the distributions appearing in the right-hand side of (14) deﬁned in (3) and (4). Let us deﬁne some quantities that will be used throughout; namely, nptq k řn i 1 mptq ki , mptq k pmptq k1, . . . , mptq knq T and Mptq k diagpmptq k q. A convenient feature of the RJM design is that due to the hierarchical structure of the model the objective function can be split into separate simple parts; speciﬁcally,

Qpθ, τ, λ|θptq, τ ptq, λptqq QY pθY , λ|θY ptq, λptqq QXpθX|θXptqq QZpτ|τ ptqq, (15)

where θY pθY

1, . . . , θY

Kq T and θX pθX

1 , . . . , θX

Kq T . Here, by λ we denote the vector of group penalty parameters of the regression component. Depending upon regularization approach, the elements of vector λ at iteration t are ﬁxed in FLasso, free and under estimation in RLasso and absent in NJ; respectively having λptq pˆλpt q 1 , . . . , ˆλpt q K q T (where t t0, 1u with zero and one corresponding to the initial CV estimate and the re-estimated CV value; see FLasso in Section 2.2.2), λptq pλptq 1 , . . . , λptq K q T and λptq . Starting in reverse order from the right-hand side of (15) we have that

QZpτ|τ ptqq

k 1 nptq k log τk, (16)

while the second component of the objective function is given by

QXpθX|θXptqq 1

i 1 mptq ki

log |Ωk| pxi µkq T Ωkpxi µkq

where ψ ?2n log p{2. The last component QY in (15) depends on regularization method. We deﬁne two distinct functions QY

lasso and QY

NJ. For lasso we use the re-parametrization

Perrakis, Lartigue, Dondelinger and Mukherjee

χk αk{σk, φk βk{σk and ρk σ 1 k (St adler et al., 2010), resulting in

lassopθY , λ|θY ptq, λptqq

pρky χk1n Xφkq T Mptq k pρky χk1n Xφkq 2

λk}φk}1 pnptq k p 2q log ρk fpλkq

which is convex. Here fptqpλkq 0 for FLasso and fpλkq c a

2K log p{n log λk for RLasso. Under the NJ approach the corresponding objective is given by

NJpθY |θY ptqq 1

py αk1n Xβkq T Mptq k py αk1n Xβkq σ2 k βT k Vptq k βk

pnptq k 2q log σ2 k

where Vptq k diag 1{β2ptq k1 , . . . , 1{β2ptq kp . This matrix arises from the second underlying latent structure (the latent scale parameters) in the EM; details are provided in Appendix C. As we will see below, matrix Vk will in fact provide the ﬁnal sparse estimate of βk, as some of its diagonal entries go to inﬁnity during the EM; consequently, the diagonal entries of Uk V 1 k that go to zero correspond to the coeﬃcients that are set equal to zero.

The M-Step. From (16) we have that the group probabilities are updated as

τ pt 1q k nptq k {n. (20)

Concerning parameter block θX

k , from (17) we obtain the following updating equations

µpt 1q k řn i 1 mptq ki xi

nptq k , (21)

Ωpt 1q k arg max Ωk

! log |Ωk| tr Ωk Sptq k ζptq k }Ωk}1 ) , (22)

where in (22) we have that Sptq k n ptq k řn i 1 mptq ki pxi µpt 1q k qpxi µpt 1q k q T and penalty given by ζptq k ψ{nptq k a

2n log p{p2nptq k q. For the lasso objective in (18) the FLasso group

penalties are simply λpt 1q k ˆλpt q k , while the RLasso update is

λpt 1q k c K1{2

Note that the RLasso update is a scaled version of the optimal universal penalty under orthonormal predictors (quantity inside the parenthesis in (23)) and that the scaling depends

Regularized Joint Mixtures

on the sparsity of the coeﬃcients and on c. For the components of θY

k, the updates are as follows

ρpt 1q k y T Mptq k pχptq k 1n Xφptq k q

c y T Mptq k pχptq k 1n Xφptq k q 2 4y T Mptq k ypnptq k p 2q

2y T Mptq k y ,

χpt 1q k pρpt 1q k y Xφptq k q T mptq k nptq k , (25)

φpt 1q k arg min φk

1 2}M1{2ptq k pρpt 1q k y χpt 1q k 1n Xφkq}2 2 λpt 1q k }φk}1. (26)

Finally, the EM updates of θY

k under the NJ prior are the following

y αptq k 1n Xβptq k T Mptq k y αptq k 1n Xβptq k

nptq k 2 , (27)

αpt 1q k py Xβptq k q T mptq k nptq k , (28)

βpt 1q k XT Mptq k X σ2pt 1q k Vptq k 1 XT Mptq k y αpt 1q k 1n . (29)

As remarked previously, in practice we work with Uptq k V 1ptq k Speciﬁcally, we have two available options for βk; the ﬁrst is suited for the n ą p case and is given by

1 2 ptq k σ2pt 1q k Ip U

1 2 ptq k XT Mptq k XU

1 2 ptq k 1 U

1 2 ptq k XT Mptq k y αpt 1q k 1n , (30)

while the second, which is faster to compute when n ă p, is given by

βpt 1q k σ 2pt 1q k Uptq k Ip XT σ2pt 1q k M 1ptq k XUptq k XT 1 XUptq k ı XT Mptq k y αpt 1q k 1n

(31) Additional details on practical implementation appear in Appendix D.

3.2 Convergence guarantees

3.2.1 Preliminaries

The proposed EM algorithm is an expectation/conditional-maximization (ECM) as introduced by Meng and Rubin (1993). Let us recall some elements of their formalism. We call ξ P Ξ the variable optimised in the M step. The corresponding optimised function is Qpξ|ξptqq, where ξptq is the value of the parameter after t ECM steps. Then, the exact M step is deﬁned as ξpt 1q : arg max ξPΞ Qpξ|ξptqq . (32)

Perrakis, Lartigue, Dondelinger and Mukherjee

When the optimisation in (32) is inconvenient, Meng and Rubin (1993) proposed to replace it by S P N successive block-wise updates ( conditional maximization ; CM). Given S constraint functions tgspξqu S s 1, the CM step is decomposed into the S intermediary steps:

ξpt s{Sq : arg max ξPΞ;gspξq gspξpt ps 1q{Sqq Qpξ|ξptqq , (33)

for s 1, ..., S. In their theorems 2 and 3, Meng and Rubin (1993) provide conditions under which all limit points of any ECM sequence are critical points of the observed likelihood. We propose a reformulation of their theorem 3, where we list explicitly all the required conditions.

Theorem 1 (Theorem 3 of Meng and Rubin (1993)) With r P N , let Ξ be a subset of the Euclidean space Rr. Let ξptq(

t PN P ΞN be an ECM sequence that has an observed loglikelihood called Lpξq as its objective function. The initial term ξp0q is such that Lpξp0qq ą 8. Let Qpξ|ξptqq be the corresponding expected complete likelihood that is conditionally maximised at each CM step, with constraints functions tgspξqu S s 1. Finally, call Ξ the interior of Ξ, and assume that:

Each conditional maximisation in the CM step (33) has a unique optimum

@s, gspξq is diﬀerentiable and the gradient gspξq P Rrˆds is of full rank on Ξ

ŞS s 1 gspθqu|u P Rds( t0u

The condition (6)-(10) of Wu (1983):

(6) Ξξp0q : ! ξ P Ξ|Lpξq ě Lpξp0qq ) is compact for any Lpξp0qq ą 8

(7) L is continuous on Ξ and diﬀerentiable on Ξ

(9) Ξξp0q Ď Ξ

(10) Qpξ1|ξ2q is continuous in both ξ1 and ξ2.

Then all limit points of ξptq(

t PN are stationary points of the objective Lpξq.

Note that condition (8) of Wu (1983):

(8) The sequence Lpξptqq (

t is upper bounded for any ξp0q P Ξ

is veriﬁed as a direct consequence of (6) and (7) and is actually not an additional condition.

3.2.2 Main results

Here, we apply the convergence Theorem 1 to the proposed ECM algorithm for the RJM model. First we show that without modiﬁcation, our ECM veriﬁes almost all the hypotheses of Theorem 1. In particular the ones speciﬁc to the ECM procedure, as laid out by Meng and Rubin (1993). Then, we provide conditions on the ECM penalization under which the remaining, more restrictive, regularity hypotheses in Wu (1983) are also veriﬁed. In our case, the optimization variable is ξ : pµ, Ω, φ, χ, ρ, λ, τq P Ξ. Where the closure of the parameter set Ξ is Ξ RKp ˆ Spp Rq ˆ RKp ˆ RK ˆ R K ˆ R K ˆ

Regularized Joint Mixtures

SK Ă RKpp2 2p 3q, with Spp Rq the cone of positive semi-deﬁnite matrices of size p and SK : τ P r0, 1s K| ř k τk 1 ( . Its interior is Ξ RKp ˆ Spp Rq ˆ RKp ˆ RK ˆ R K ˆ R K ˆ S K, with Spp Rq the open cone of positive deﬁnite matrices of size p and S K : τ Ps0, 1r K| ř k τk 1 ( . The ECM sequence takes its values in Ξ. With the proper priors, the parameters Ω, ρ, λ, and τ cannot take values on the border of their respective sets during an ECM sequence. In such a scenario, the ECM sequence lives in Ξ and we can simply consider that Ξ Ξ , which helps with several of the hypotheses. To ensure this property, it is suﬃcient to set the regularization such that the objective Lpξq is inﬁnite everywhere on the border. This objective is the penalized observed log-likelihood function:

k 1 ppyi|θY

k, xi, zi kqppxi|θX

k , zi kqτk

ˆ pyiρk χk x T i φkq2 2 log ρk

xi µk 2 Ωk log |Ωk|

2 log τk pp 1q log 2π 2

where xi µk 2 Ωk : pxi µkq T Ωkpxi µkq. The form penpξq for the penalty term is a generalization of the separable penalty ř k penpθ X

k q ř k penpθ Y

k q proposed in Eq. (6). With the posterior weights mptq ki x Prpzi k|yi, xi, ξptqq as deﬁned in the E-step (14), we can deﬁne the function Qpξ|ξptqq to maximise in the CM step:

ˆ pyiρk χk x T i φkq2 2 log ρk

xi µk 2 Ωk log |Ωk|

2 log τk pp 1q log 2π 2

The conditional maximization of this function is carried out in Eq. (20) to (26), for one speciﬁc version of the penalty penpξq. As required by Theorem 1, each of these optmizations is uniquely deﬁned. This property is penalty dependent. Hence, in general it is required to use penalties/priors on the parameters that lead to uni-modal posterior distributions.

On the other hand, the general structure of the conditional updates is independent of the penalty. Indeed, we propose a block-type update where each block is updated conditionally to all other being ﬁxed. The order of the updates is: τ Ñ µ Ñ ΩÑ λ Ñ ρ Ñ χ Ñ φ.

Perrakis, Lartigue, Dondelinger and Mukherjee

This correspond to constraint functions of the form:

g1pξq ξzτ : pµ, Ω, λ, ρ, χ, φq ,

g2pξq ξzµ : pτ, Ω, λ, ρ, χ, φq ,

g3pξq ξzΩ: pτ, µ, λ, ρ, χ, φq ,

g4pξq ξzλ : pτ, µ, Ω, ρ, χ, φq ,

g5pξq ξzρ : pτ, µ, Ω, λ, χ, φq ,

g6pξq ξzχ : pτ, µ, Ω, λ, ρ, φq ,

g7pξq ξzφ : pτ, µ, Ω, λ, ρ, χq .

When the joint optimization in several consecutive blocks is possible, usually because they are separate in the objective, this approach can be simpliﬁed by fusing the corresponding blocks. For instance, in Eq (20) to (24), we perform a joint optimization in τ, µ, Ω, λ, ρ under the constraint that φ, χ is ﬁxed. Then, in Eq (25), an optimization on χ with λ, ρ, φ ﬁxed. Finally, in Eq (26), an optimization on φ with λ, ρ, χ ﬁxed. Note that in this case, the optimization in τ, µ, Ωis a true M step and is independent on the other parameters. Hence, once this block has been updated in the ﬁrst step, constraining it to remain ﬁxed in the subsequent steps is unnecessary. The functions gspξq deﬁned in (36) are obviously diﬀerentiable on Ξ with gradients of the form:

gspξq 0dsˆd1 . . . 0dsˆds 1 Ids 0dsˆds 1 . . . 0dsˆd S T P Rrˆds ,

which are of rank ds (full rank), with r : ř s ds Kpp2 2p 3q. We can also see that there is no overlap between their non-zero components, which results in the desired property that ŞS s 1 gspθqu|u P Rds( t0u. As a consequence, as long as the penalty is chosen such that the posterior distribution in each parameter is unimodal, the algorithm veriﬁes the three ECM-speciﬁc conditions for convergence introduced by Meng and Rubin (1993). Among the basic conditions identiﬁed by Wu (1983), some are also systematically veriﬁed by our algorithm with little assumption on the penalty; namely:

(7) The model part of Lpξq is always continuous and diﬀerentiable in Ξ Ξ . Hence, this property is guaranteed for Lpξq as long as the penalty term is also continuous and diﬀerentiable.

(9) Ξξp0q : ! ξ P Ξ|Lpξq ě Lpξp0qq ) Ď Ξ Ξ .

(10) Qpξ1|ξ2q is continuous in ξ1 for the same reason that Lpξq is continuous in ξ. The dependency of Qpξ1|ξ2q on ξ2 is entirely through the terms ppzi k|yi, xi, ξ2q ppyi, xi, zi k|ξ2q{ř l ppyi, xi, zi l|ξ2q which are continuous in ξ2 for the same reason that the likelihood is.

The ﬁnal missing hypothesis is (6), the compacity of the level lines of the likelihood function. This property is much more restrictive and requires speciﬁc hypotheses on the regularization. The following theorem synthesizes every observation made so far and provides suﬃcient conditions to verify the ﬁnal hypothesis.

Regularized Joint Mixtures

Theorem 2 (Convergence of the ECM algorithm for RJMs) Consider an ECM sequence ξptq(

t PN P ΞN with objective function the observed log-likelihood Lpξq of the RJM model (34). The initial term ξp0q is such that Lpξp0qq ą 8 and the conditional maximization step of the expected complete likelihood Qpξ|ξptqq in Eq. (35) is conducted using the block-wise scheme deﬁned by the constraint function gspξq in Eq. (36). Assume that the regularization term penpξq is continuous, diﬀerentiable and such that each of the block maximizations is unique (uni-modal posterior). Assume additionally that there exists a positive constant δ ą 0 such that

log τ 1 k µk Ωk log ˇˇΩ 1 k ˇˇ fλpλkq ρk log ρ 1 k |χk| φk ,

(37) where fλ is a lower bounded function on R such that fλpxq ÝÑ xÑ0 8 and fλpxq ÝÑ xÑ 8 8.

Then, Theorem 1 applies and all limit points of ξptq(

t PN are stationary points of the objective Lpξq.

The norms . on each parameters in Eq. (45) (in Appendix E) are unspeciﬁed since all norms are equivalent in ﬁnite dimension.

For fλ, the lower bound on the penalty in λk, a function such as fλpxq x log x is suitable.

Sketch of proof: The full details of the proof can be found in Appendix E, providing here a brief proof sketch. We prove that under the assumptions of Theorem 2, all the hypotheses of Theorem 1 are met, which yields the desired convergence result. As previously discussed, all the hypotheses of Theorem 1, save for hypothesis (6), of Wu (1983) are organically veriﬁed within the RJM model. Provided that Lpξq is inﬁnite on the border of Ξ, we can safely take Ξ Ξ , and then the few required assumptions on the penalty are veriﬁed: it is continuous, diﬀerentiable and such that each of the block maximizations is unique. Hence, all the eﬀorts of the proof are spent on proving that Lpξq is inﬁnite on the border of Ξ and that hypothesis (6) is veriﬁed. This is done all at once; thanks to the control (45), we are able to deﬁne an increasing family Ξm of compacts of Ξ such that the log-likelihood Lpξq on any point of ΞzΞm is as low as desired with a well chosen compact Ξm. With this result, we have that (i) Lpξq is 8 outside of Ξ and (ii) Ξξp0q : ξ P Ξ|Lpξq ě Lpξp0qq ( is compact for any Lpξp0qq ą 8 (hypothesis (6)), allowing us to conclude.

4. Prediction and cluster selection

An interesting feature of RJMs relates to prediction. Speciﬁcally, a new observation x PRp

can be allocated to a cluster via the quantities ˆπ k 9 ˆτkϕppx |ˆµk, ˆΣkq, where ˆτk, ˆµk and ˆΣk are EM estimates for k 1, . . . , K. A simple prediction of y then follows via ˆy ˆα k x T ˆβ k, where k arg maxk ˆπk and ˆα k and ˆβ k are the corresponding EM estimates.

Perrakis, Lartigue, Dondelinger and Mukherjee

Although not our primary focus in this paper, it is interesting to brieﬂy consider the idea of setting K based on predictive loss. In more detail, let G denote the set of the number of clusters under consideration, so that the cluster indicator under model g P G is kg 1, . . . , Kg. Under each model g we can obtain cluster allocations for a subset of held-out test data y and X . Denote by y kg the n kg ˆ 1 test-response vector and by X kg the n kg ˆ p test-feature matrix assigned to group kg 1, . . . , Kg conditional on g. Then, the solution for the best group-wise predictive model (in an ℓ2 sense) is given by

ˆgpred arg min g PG

y kg ˆαkg1n kg X T kg ˆβkg 2 2 n kg

This eﬀectively sets K to minimize predictive loss and connects in a way supervised and unsupervised learning, providing a simple and quite natural way of determining the number of clusters based on a guided search that aims to optimize prediction of y. Of course, standard approaches for inferring the number of clusters, such as information criteria, can be used within the RJM framework. For instance, using BIC (Schwarz, 1978) in our case translates in selecting the number of clusters as

ˆg BIC arg max g PG

! 2 log ppy, X|θg, τ gq logpnqνg ) , (39)

where νg is the number of elements in θg that are not set equal to zero. Similarly, for AIC (Akaike, 1974) we have

ˆg AIC arg max g PG

! 2 log ppy, X|θg, τ gq 2νg ) . (40)

5. Empirical examples

In this Section we present results from simulation experiments, starting with a small-scale simulation in Section 5.1 which allows us to evaluate and visualize easily the various learning aspects of RJMs. In Section 5.2 we use data from The Cancer Genome Atlas in semisynthetic examples of much larger scale, providing detailed comparisons with baseline and various oracle-type approaches. All simulations are based on data-generating mechanisms which are multivariate generalizations of three elementary problems which are depicted in Figure 3, the purpose of which is to facilitate understanding of more complex multivariate problems as the ones in Section 5.2 via illustration of simpler univariate analogues. Finally, in Section 5.3 we show results using fully empirical data.

5.1 Small-scale simulations

Set-up. We consider two groups (K 2) with total n 100 and balanced groups, i.e. nk 50 for k P t1, 2u. The number of predictors is p 10, where in each group only the ﬁrst predictor (xk1) has a non-zero coeﬃcient, i.e. only βk1 0 for k P t1, 2u. The covariates are generated as Xk Nppµk, Σkq, with µ1 p0, . . . , 0q T and µ2 p1, . . . , 1q T being of dimensionality p ˆ 1. For the covariances we consider two scenarios: an uncorrelatedscenario with diagonal covariances of the form Σ1 Σ2 Ip and a correlated-scenario with

Regularized Joint Mixtures

Figure 3: Some interesting cases of group structure. Univariate analogues of problems considered in the empirical examples (all of which are multivariate) in order to illustrate the key ideas. Shown are two latent groups with some separation between the groupwise means of a single feature x . Three speciﬁc cases, that diﬀer with respect to the regression model linking y and x , are shown: equal-intercept/unequal-slope (Case A); unequal-intercept/equal-slope (Case B); and equal intercept and slope (Case C).

non-diagonal covariances, where each variable xkj, for j 2, . . . , p, is again Gaussian noise, but the signal variable xk1 is generated as xk1 Nnk 1.5xk3 0.5xk5 0.7xk7, 0.5Ink . The response of each group is generated as yk Nnk 1nkαk xk1βk1, σ2 k Ink . Speciﬁcation of the slopes and intercepts is based on the three cases of Figure 3; see Table 3 (Appendix F). Finally, the error variance σ2 k of each group is set to ﬁx signal strength in a label-oracle sense, namely so that the correlation between test data (for test group sample sizes of 250) and predictions from a lasso model is approximately 0.8 when group labels are known. The results that follow are from 50 repetitions of each simulation. We focus on regression signal detection, estimation of coeﬃcients and group assignment performance.

Signal detection and estimation. The variable inclusion frequencies over 50 repetitions of the simulations for the uncorrelated scenario are presented in Figure 4. RJM-NJ performs better overall as it detects inﬂuential eﬀects almost all of the times, while the inclusion rates of non-inﬂuential eﬀects are much lower than 50%. RJM-FLasso is eﬀective in detecting the signals but produces much denser solutions. RJM-RLasso solutions are sparser in comparison to FLasso; we note, however, that RLasso tends to over-shrink the coeﬃcients of the inﬂuential predictors as well. The inclusion frequencies for the correlated scenario are similar (Appendix F, Figure 12). Violin plots of slope estimates are presented in Appendix F (Figure 13, uncorrelated scenario; Figure 14, correlated scenario), while the corresponding plots for intercepts are presented in Figures 15 and 16. The NJ estimates are overall more accurate. In all comparisons we include results from mixtures of experts obtained from R package Mo EClust (Murphy and Murphy, 2020), which performs simultaneous selection for experts, gates and covariance structures using forward search model selection based on BIC. In terms of variable selection Mo E performs exceptionally well under case A and yields similar results to RJM-NJ under cases B and C (see Figures 4 and 12). As for estimation, Mo E leads to overall accurate slope and intercept estimates in case A, however,

Perrakis, Lartigue, Dondelinger and Mukherjee

the corresponding estimates in cases B and C have a higher variance in comparison to RJM estimates (see Figures 13 to 16, Appendix F).

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

Uncorrelated Scenario Case A

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

Uncorrelated Scenario Case B

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

Uncorrelated Scenario Case C

Inclusion frequency

Figure 4: First simulation, uncorrelated scenario. Variable inclusion frequencies (over 50 repetitions) for signal variables (in green) and non-signal variables (in red) for regression cases A, B and C. Horizontal black lines correspond to a frequency of 0.5.

Latent group assignment. A natural question is whether including the regression part of the model within a uniﬁed framework provides any gains with respect to simply clustering the X matrix. In practice, between-group diﬀerences in means may be subtle. Hence, we consider in particular performance as the magnitude of the mean diﬀerence varies; i.e., we set µ1 0p and µ2 0p d1p with d h U, where h deﬁnes a grid ranging from 0.1 to 1 with a step size of 0.05 and UPt 1, 1u is a uniformly random sign. Hence, |d| is a measure of the strength of the mean signal. We compare to k-means, hierarchical clustering, GMMs and Mo Es as implemented in R using the default options of kmeans, hclust, mclust (Scrucca et al., 2016) and Mo EClust respectively; for the latter two using the BIC-optimal model. In addition, for the clustering approaches we use as input: (i) only X and (ii) X together with y stacked in one data matrix. For these simulations we use 20 repetitions. One standard-error plots of adjusted Rand index averages as functions of |d| are shown in Figure 5. As seen, in case A, RJMs generally outperform all methods except of Mo E; the latter performs better for lower values of |d|, while RJMs perform better for higher values. In cases B and C (where RJM is over-parameterized) our methods remain competitive in the uncorrelated scenario, while lead to better overall results in the correlated scenario. On the other hand, Mo E is not eﬀective under cases B and C, which as argued in Section 1 (recall Figure 2) is to be expected when there are no diﬀerences in regression coeﬃcients.

Regularized Joint Mixtures

0.25 0.50 0.75 1.00 |d|

Adjusted Rand Index

Uncorrelated scenario Case A

0.25 0.50 0.75 1.00 |d|

Adjusted Rand Index

Correlated scenario Case A

0.25 0.50 0.75 1.00 |d|

Adjusted Rand Index

Uncorrelated scenario Case B

0.25 0.50 0.75 1.00 |d|

Adjusted Rand Index

Correlated scenario Case B

0.25 0.50 0.75 1.00 |d|

Adjusted Rand Index

Uncorrelated scenario Case C

0.25 0.50 0.75 1.00 |d|

Adjusted Rand Index

Correlated scenario Case C

kmeans X kmeans YX

hclust X hclust YX

mclust X mclust YX

Mo Eclust RJM NJ

RJM Flasso RJM Rlasso

Figure 5: First simulation, cases A (top), B (middle) and C (bottom). One standard-error plots of adjusted Rand index averages (from 20 repetitions) vs absolute distance (|d|) of the group-wise covariate means under the uncorrelated scenario (left) and correlated scenario (right). 21

Perrakis, Lartigue, Dondelinger and Mukherjee

5.2 Semi-synthetic simulations based on real cancer data

The simulations presented below are based on data from the The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov). The rationale is to anchor the simulation in real covariance structures. Speciﬁcally, we use data previously used in Taschler et al. (2019), consisting of gene expression values from four cancer types; breast (BRCA), kidney renal clear cell (KIRC), lung adenocarcinoma (LUAD) and thyroid (THCA). Our strategy is to treat the cancer type as hidden: this allows us to test our approaches in the context of diﬀerential covariance structure as seen in a real group-structured problem whilst having access to true gold-standard labels.

Table 1: Second simulation. Intercept values and slope-generating mechanisms for the two groups under the three cases illustrated in Figure 3. [TNpµ, σ2, l, uq denotes a truncated normal distribution, where µ P R, σ2 ą 0 and l, u are the respective lower and upper truncation bounds, while m TNpµ, σ2, a, bq denotes the speciﬁc mixture of TNpµ, σ2, 8, aq and TNpµ, σ2, b, 8q with a ă b and mixing parameter equal to 0.5; i.e., a truncated normal with support everywhere except in (a, b)].

Case Group Intercept Slopes

Common locations Disjoint locations

A 1 α1 α2 0 β 1 TNp0, σ2, 8, 0.1q β 1 m TNp0, σ2, 0.1, 0.1q 2 β 2 TNp0, σ2, 0.1, 8q β 2 m TNp0, σ2, 0.1, 0.1q

B 1 α1 0 β 1 β 2 m TNp0, σ2, 0.1, 0.1q β 1 m TNp0, σ2, 0.1, 0.1q 2 α2 1 β 2 m TNp0, σ2, 0.1, 0.1q

C 1 α1 α2 0 β 1 β 2 m TNp0, σ2, 0.1, 0.1q β 1 m TNp0, σ2, 0.1, 0.1q 2 β 2 m TNp0, σ2, 0.1, 0.1q

Set-up. In these experiments we use covariates from two cancer types; namely, the BRCA and KIRC groups. For all simulations we use n 250, balanced group sample sizes, i.e. nk 125 for k 1, 2, and varying dimensionality for the features; namely, i) p 100 (n ą p problem), ii) p 250 (n p problem) and iii) p 500 (n ă p problem). We consider sparse problems where the percentage of non-zero coeﬃcients (β ) is s 4% and the setting in which some of the non-zero coeﬃcients are at common locations and others are at disjoint locations across the two groups (placing half the non-zero coeﬃcients at common locations). Speciﬁcation of the common-location β k s will determine the three general cases depicted initially in Figure 3. To rule out very small coeﬃcients, we draw from a truncated normal distribution, with support excluding the interval p 0.1, 0.1q. Group speciﬁc intercept values and slope-generating mechanisms (based on Figure 3), are summarized in Table 1. Given the matrices Xk, the intercepts αk and the sparse vectors βk the response is generated as yk Nnkpmk, Inkσ2 yq, where mk Inkαk Xkβk for k 1, 2 and σ2 y 1. The scale parameter σ2 in Table 1 is tuned so that the overall signal-to-noise under each case is approximately equal to three; i.e. Varpmq{σ2 y 3 and m pm1, m2q T . Performance is evaluated as a function of the absolute distance |d| between the groupwise feature means µ1 and µ2. We initially normalize the features, so that µ1 µ2 0, and consider again the case where each element of µ2 is shifted by d h u, where h P

Regularized Joint Mixtures

t0.1, 0.2, . . . , 0.8, 0.9u and u is a random uniform sign. Each simulation is repeated 20 times using random subsamples of features from the original data. Here we present results for the n ă p setting (p 500); results for p P t100, 250u can be found in Appendix G. For the regression questions addressed below our aim is to compare RJM with the clusteringthen-regression approach. Obviously, a range of regression methods could be used in the second step. As the simulations are sparse and linear by design, to ensure that the simple clustering-then-regression approach is not disadvantaged we use lasso in the second step.

Group assignment. We compare to the same methods considered in Section 5.1, except of Mo EClust as the dimensionalities are too large for a forward model search for optimal selection of expert and gating functions. Figure 6 presents error plots of adjusted Rand index averages. In general, we observe a phase-transition type of behaviour as all methods improve as |d| increases. However, the transition is faster with RJM which outperforms the other methods and stabilizes relatively quickly to correct assignment. For p 500 lassobased RJM outperforms the NJ variant for cases B and C, however, for p t100, 250u all RJM methods perform equally well more or less; see Figures 17 and 18 in Appendix G.

0.25 0.50 0.75

Adjusted Rand Index

0.25 0.50 0.75 |d|

0.25 0.50 0.75

kmeans X kmeans YX

hclust X hclust YX

mclust X mclust YX

RJM NJ RJM Flasso

Figure 6: Second simulation, p 500, group assignment. Average adjusted Rand Index as a function of the absolute distance (|d|) of the group-wise covariate means, for cases A (left), B (center) and C (right). [Error bars indicate standard errors from 20 repetitions.]

Variable selection. For this comparison we initially set a benchmark model called the label-oracle-lasso, under which the true group labels are assumed known and we ﬁt separate lasso regressions via glmnet (Friedman et al., 2008b). We also consider a cluster-lasso model which involves separate regressions based on estimated group labels. This approach involves an initial clustering step: we give an advantage to the cluster-lasso by using, for each dimension considered, the clustering approach that performs best (hclust for p 100 and Mclust for p P t250, 500u). Naturally, the cluster-lasso will be equivalent to the oracle-lasso when group assignment is perfect. We summarize results via the area under the ROC curve (AUC) based on the ranking of the absolute values of the coeﬃcients. In particular, we consider the diﬀerence between

Perrakis, Lartigue, Dondelinger and Mukherjee

the AUC from oracle-lasso and AUC from competing methods (cluster-lasso and RJM approaches). One standard-error plots for p 500 are presented in Figure 7. As expected cluster-lasso yields smaller selection loss (approaching oracle-lasso) as the separation of group-wise means increases, but so do the RJM methods. RJM-FLasso is overall better and even seems to result in slightly improved selection in comparison to the oracle-lasso as |d| increases, possibly due to the fact that RJM uses weighted estimation based on the entire sample. Importantly, RJM methods overall outperform the common cluster-lasso approach in low and/or medium magnitude regions of |d|. These results illustrate the nontrivial gains possible from a uniﬁed treatment of the various aspects of the model vs. the simple approach of clustering followed by sparse regression. For the simulations with p 100 and p 250 see Figures 19 and 20 (Appendix G), respectively.

0.00 0.25 0.50 0.75

AUC difference wrt oracle

0.00 0.25 0.50 0.75 |d|

0.00 0.25 0.50 0.75

Method cluster lasso RJM NJ RJM FLasso RJM Rlasso

Figure 7: Second simulation, p 500, variable selection. AUC loss from oracle-lasso as a function of the absolute distance (|d|) of the group-wise covariate means, for cases A (left), B (center) and C (right). [Error bars indicate standard errors from 20 repetitions.]

Estimation. Comparisons are made again with respect to the label-oracle-lasso; this time we consider the increase in root mean squared error (RMSE) resulting from the fact that the group labels are unknown. Here we also consider the pooled-lasso; a naive model which does not take into account group structure. This allows us to investigate the eﬀect of ignoring group structure under each case in Table 1; this is of particular interest in Case A where the common-location coeﬃcients have opposite signs. Results for the p 500 are summarized in Figure 8. We use standardized coeﬃcients for the calculation of RMSE in order to have a common scale across simulations and cases. As expected, under Case A the pooled-lasso model performs poorly, while RJM methods provide overall better estimates than cluster-lasso. Under cases B and C, cluster-lasso and RJM which are over-parameterized (common-location eﬀects are equal) perform more or less the same and are in general comparable to the pooled-lasso which is under-parameterized (due to the disjoint-location eﬀects). The p 100 and p 250 cases are shown in Appendix G (Figures 21 and 22); results are in general similar with the diﬀerence that cluster-lasso

Regularized Joint Mixtures

performs better when |d| 1 and RJM-RLasso performs overall worse. Overall, our illustrations suggest that the RJM-FLasso is the most stable method, followed by RJM-NJ.

0.00 0.25 0.50 0.75

RMSE difference wrt oracle

(Group 1) Case A

0.00 0.25 0.50 0.75 |d|

(Group 1) Case B

0.00 0.25 0.50 0.75

(Group 1) Case C

0.00 0.25 0.50 0.75

RMSE difference wrt oracle

0.00 0.25 0.50 0.75 |d|

0.00 0.25 0.50 0.75

Method pooled lasso cluster lasso RJM NJ RJM FLasso RJM Rlasso

Figure 8: Second simulation, p 500, estimation. Increase in RMSE relative to the oraclelasso as a function of the absolute distance (|d|) of the group-wise covariate means, under group one (top) and group two (bottom), and for cases A (left), B (center) and C (right). [Error bars indicate standard errors from 20 repetitions.]

Selection of number of clusters. In Appendix H we further consider the case where the number of clusters is not known a-priori in simulation experiments which take into account all four cancer types. Results on cluster selection using the predictive approach described in Section 4 are shown in Figure 23.

5.3 Real cancer data example

In this Section we consider a fully non-synthetic example, where both features and response are real empirical data. The general strategy is as follows. We use the TCGA data as introduced above, with gene expression levels treated as features. Speciﬁcally, we include all four cancer types used in the previous section, selecting via stratiﬁed random sampling n 250 samples in total. Stratiﬁed sampling ensures that the cancer-type proportions are preserved; speciﬁcally the dataset under consideration consists of 102 BRCA, 51 KIRC, 49 LUAD and 48 THCA observations (abbreviations as previously introduced). A total of p 100 gene expression levels (selected at random from all genes) are used in these experiments. As before, in the applications that follow the true labels (i.e., the cancer type indicator) are treated as latent and hence not used in analysis, but only to evaluate performance. As responses, we use one of the p 100 gene expression levels, with the remaining forming the feature set. This procedure has the advantage of allowing us to consider many diﬀerent responses (genes) whilst entirely eschewing synthetic data generation. We ﬁrst show a

Perrakis, Lartigue, Dondelinger and Mukherjee

illustrative example using one particular gene as response and then show results from all responses considered.

Illustrative analysis for a particular response gene (NAPSA). We consider the gene NAPSA (napsin A aspartic peptidase) (gene ID 9476) to illustrate the set-up. For this illustrative analysis we assume that the cancer types are given so that it is known a priori that there are four classes. This particular gene is used to illustrate the approach as it is informative with respect to hidden group structure, as shown in Figure 9 (left panel), but not to the extent of fully revealing the class structure. Heatmaps of the sample covariance matrices of the remaining 99 genes under each cancer type are presented in Figure 9 (panels in the right); these generally indicate slight diﬀerences in covariance structures.

Response gene (NAPSA 9476)

BRCA KIRC LUAD THCA

Figure 9: Real data example, data visualization for single, illustrative response. Plot (a) of response gene NAPSA in the four cancer types, and heatmaps of feature covariances in (b) BRCA, (c) KIRC, (d) LUAD and (e) THCA cancer types.

For evaluation of clustering performance we compare to the same methods as in Section 5.1, including again Mo E as implemented in package Mo Eclust; however, for computational convenience, we use the option of a classiﬁcation-EM algorithm, which is a faster but generally sub-optimal algorithm (although we note that initial tests suggested a gain in clustering performance for this dataset). We further consider k-medoids, fuzzy c-means (implemented via par and fanny in package cluster, respectively) and clustered support vector machines (clust SVM, package Swarm SVM; Gu and Han, 2013). Finally, for the purely cluster-oriented approaches (k-means/medoids, fuzzy c-means, hclust and mclust) we use as input the concatenated matrix containing the response and the predictor genes. Table 2 shows the resulting adjusted Rand index under each method (for k-means and clust SVM, which are highly sensitive to initialization, the values are averages from 100 runs). As seen, RJMs clearly outperform the other approaches. Figure 10 shows the resulting regression coeﬃcient estimates (396 in total given the four cancer types), from the three RJM variants, ranked in absolute value from highest to lowest (non-zero values in green; zero values in red). Consistent with the results presented in

Regularized Joint Mixtures

Table 2: Real data application, clustering performance. Adjusted Rand index values for the ten methods under consideration using gene NAPSA as response variable.

Methods and clustering performance

Method k-means k-medoids fuzzy c-means hclust clust SVM Adj. Rand Index 0.43 0.44 0.59 0.51 0.42

Method mclust Mo Eclust RJM-NJ RJM-FLasso RJM-RLasso Adj. Rand Index 0.29 0.55 0.72 0.68 0.75

Section 5.1, we observe again that RJM-NJ results in the most parsimonious model (fewer than 100 predictors), followed by RJM-RLasso (around 100 predictors), while RJM-FLasso includes the most predictors (more than 100). We also observe that the lasso variants tend to shrink the coeﬃcients of inﬂuential predictors more than RJM-NJ; this is also generally anticipated as the NJ prior has heavier tails in comparison to the Bayesian lasso prior. Finally, the forward search of Mo Eclust included a few predictors in the gating networks, but resulted in entirely sparse expert networks as all regression coeﬃcients were set to zero.

0 100 200 300 400

0.0 0.1 0.2 0.3 0.4 0.5

Absolute regression coefficients

0 100 200 300 400

0.0 0.1 0.2 0.3 0.4 0.5

Absolute regression coefficients

0 100 200 300 400

0.0 0.1 0.2 0.3 0.4 0.5

Absolute regression coefficients

Figure 10: Real data application, regression performance. Absolute values of RJM regression coeﬃcients ranked in decreasing order, using gene NAPSA as response variable. Green points indicate non-zero coeﬃcients; red points, coeﬃcients that are set equal to zero.

Performance over all responses. Above we considered a speciﬁc gene to illustrate the key ideas; here we show results from all responses. That is, we consider in turn each of the genes as response, treating all others as features. Thus, there are 100 problems considered in total (each with the same four latent subgroups). In this case, the input for the purely clustering methods is the data matrix of the predictor variables.

Perrakis, Lartigue, Dondelinger and Mukherjee

Violin plots of the resulting adjusted Rand index values from the ten methods are presented in Figure 11. These results, spanning one hundred diﬀerent responses, support the results seen above, as the three RJM variants consistently perform relatively well over most of the responses and the results are broadly in line with some of the results in Sections 5.1 and 5.2. In Appendix I, we further consider BIC-based model selection; as shown there, RJMs select more frequently the correct number of groups in comparison to GMMs and Mo Es.

0.0 0.2 0.4 0.6 0.8

k means k medoids fuzzy c means hclust clust SVM mclust Mo Eclust RJM NJ RJM Flasso RJM Rlasso

Adjusted Rand Index

Figure 11: Real data application, clustering performance. Violin plots of adjusted Rand index values from TCGA-based experiments spanning a hundred diﬀerent responses (see text for details).

5.4 Summary

Broadly speaking the RJM variants performed similarly to one another in terms of clustering/group assignment. In several situations they outperformed the other methods compared with, while, at the same time, they tended to remain competitive across the range of scenarios tested. Also, RJM can improve selection for the number of groups when this is not known at the outset. There were some diﬀerences between the RJM variants with respect to regression modelling. The NJ approach tended to perform well, in terms of variable selection and estimation, in sparse settings characterized by moderate to strong signals, while the lasso approaches yielded relatively denser models overall.

6. Discussion

We introduced a class of regularized mixture models that jointly deal with sparse covariance structure and sparse regression in the context of latent groups. We showed that principled joint modeling of these two aspects leads to gains with respect to simpler decoupled or pooled strategies and that exploiting established ℓ1-penalized tools and related Bayesian approaches leads to practically applicable solutions. The RJM methods presented in this paper are implemented as an R package regjmix, available at https://github.com/k-perrakis/ regjmix. Future research directions include extensions to generalized linear models and

Regularized Joint Mixtures

mixed models. Below we discuss some additional aspects and point to speciﬁc directions for future work.

Distribution shifts and shift-robust learning. By accounting for data heterogeneity, RJMs help to guard against (potentially severe) confounding of multivariate regression models by hidden group structure. This has interesting connections to distribution shifts in machine learning and shift-robust learning; see e.g. Recht et al. (2018); Heinze-Deml and Meinshausen (2021). In particular, we think RJM would be a useful tool for shift-robust learning, since it could be used to block paired data p X, Y q into distributionally non-identical groups which could in turn be used to train and test predictors in a shift-robust fashion, facilitating shift-robust learning under unknown distributional regimes.

Choice of regularization. In this work we used the graphical lasso approach for covariance estimation, mainly motivated by certain biomedical applications where network models are of interest. However, the general RJM strategy could be used with other kinds of multivariate models (e.g. factor models). For the regression coeﬃcients we considered: (i) the Bayesian lasso prior under two strategies (FLasso/RLasso) and (ii) the NJ prior. For practitioners seeking the closest analogue to the popular lasso approach based on crossvalidation in the non-latent regression setting, we recommend FLasso. When it is preferable to use a shrinkage prior with heavier tails we recommend using NJ, which can be very eﬀective in detecting sparsity patterns without over-shrinking large coeﬃcients. In general, our main goal was to explore some of the available regularization options, but we note that the RJM model is fairly modular in the sense that other methods from the penalized likelihood or the Bayesian literatures recent reviews provided by Desboulets (2018) and van Erp et al. (2019), respectively can be used within the same framework. Of course, speciﬁcs will depend upon approach; for instance, the elastic-net (Zou and Hastie, 2005) and the adaptive lasso (Zou, 2006) are fairly easy to incorporate at present, while other methods such as the horseshoe estimator (Carvalho et al., 2010) require further investigation.

High-dimensional issues. RJM remains eﬀective when p ą n, but a general issue when jointly modeling p Y, Xq is that for relatively large p cluster allocation will be mainly guided by X. In the empirical examples RJM outperformed mclust (recall that without regularization RJM is equivalent to a GMM); to provide some intuition about that let us consider the two regularization steps. The ﬁrst on the covariance matrix of X can be viewed as p lasso regressions (Meinshausen and B uhlmann, 2006) that essentially discard non-inﬂuential relationships among features. The second discards non-inﬂuential predictor eﬀects on the response. Overall this sparsiﬁcation may be viewed as a dimensionality reduction, which mitigates over-emphasis on X. One idea for handling this issue as p grows larger is to consider explicit weighting of the eﬀect of X, e.g. by replacing the multivariate normal in (3) with a density of the form Nppxi|µk, Σkq1{δ, where the power-parameter (δ ą 1) would inﬂate the covariance. While from a computational standpoint the proposed framework is scalable and can also handle the p ą n case, in very high dimensions it would become computationally burdensome. This can be potentially addressed via high-dimensional projections. Finally, although our EM convergence result is general, there remain open theoretical questions concerning rates of convergence and optimality of the estimators themselves.

Perrakis, Lartigue, Dondelinger and Mukherjee

Acknowledgments

We would like to thank Keefe Murphy for interesting discussions and for updating R package Mo EClust in order to make it possible to implement the comparisons presented in this paper. We acknowledge support via the German Bundesministerium f ur Bildung und Forschung (BMBF) project Mech ML , the Medical Research Council (programme number MC UU 00002/17) and the National Institute for Health Research (NIHR) Cambridge Biomedical Research Centre.

Appendix A. Simulations for Section 1.2

The simulation results presented in Section 1.2 are based on 20 repetitions where we consider two groups (K 2) with total n 200 and a balanced design, i.e. nk 100 for k P t1, 2u. The number of predictors is p 10. In each group only one predictor (x k) has a nonzero coeﬃcient (β k), and this predictor is chosen randomly over the 20 repetitions. The three cases in Figures 1 and 2 correspond to: (i) β 1 β 2 0.5 (plots on the left), (ii) β 1 0.5, β 2 1 (plots on the middle), and (iii) β 1 0.5, β 2 1.5 (plots on the right). The features are generated as Xk N10pµk, 0.5Ipq. That is, the two feature groups share the same diagonal covariance structure, but we let the mean vectors to vary. Speciﬁcally, the ﬁrst mean vector is always µ1 p0, . . . , 0q T , while for the second mean vector we consider again three cases (which lead to the variability with respect to the x-axis of the plots in Figure 2); namely, (i) µ2 p0, . . . , 0q T , (ii) µ2 p0.5, . . . , 0.5q T and (iii) µ2 p1, . . . , 1q T . The response of each group is generated as yk Nnk x kβ k, σ2 k Ink , where σ2 k Varpx kβ kq{5 for k P t1, 2u. The regularized joint mixture model is based on the normal-Jeﬀreys prior discussed in Section 2.2.2. For the implementation of the Gaussian mixture and mixture-of-experts models we used R packages mclust (Scrucca et al., 2016) and Mo EClust (Murphy and Murphy, 2020), respectively, using the default model-search options, selecting the BIC-optimal model.

Appendix B. Justiﬁcation for the RLasso Pareto prior

We generally want the Pareto prior to be such that it will not penalize the regression coeﬃcients asymptotically. Under the prior in (11) the mode of λk is an, while the prior mean is given by

Epλkq an bn bn 1,

for bn ą 1 and the prior variance by

Varpλkq a2 n bn pbn 1q2pbn 2q

for bn ą 2. Given that an Ñ 0 as n Ñ 8, in order to meet our requirement we want bn Ñ C ą 2 as n Ñ 8; to that end, we specify bn pp 1q c a

2K log p{n, for c P p0, 1s. Explicit speciﬁcation of an is not required as it does not aﬀect the posterior mode; any decreasing function of n (subject to an ą 0) will satisfy the recuirement. As for c, we recommend setting it equal to mint a

2p{3n, 1u as a default option.

Regularized Joint Mixtures

Appendix C. Objective function under the NJ prior

The hierarchical form of the NJ prior is βk|Sk Npp0, Skq, where Sk diagpsk1, . . . , skpq assuming latent skj with πpskjq9s 1 kj for k 1, . . . , K and j 1, . . . , p. The conditional distribution of any s (dropping momentarily subscripts k, j for simplicity) is

pps|βq qpsq ż qpsqds , with qpsq ppβ|sqs 1 and ż qpsqds |β| 1.

Given this, it follows that

Es|βrs 1s ż s 1pps|βqds β 2 (41)

This result will be needed in the derivation of the E-step below. The joint prior of βk and σ2 k is ppβ, σ2|Sq ppβ|Sqppσ2q9 śK k 1 exp 1

2βT k S 1 k βk 1

σ2 k . The objective under the NJ

prior presented in Eq. (19) in the main paper, is derived as follows.

NJpθY |θY ptqq Ez,S|y,X,θY ptq log fpy|X, z, α, β, σ2q log πpβ|Sq log πpσ2q

Ez|y,X,θY ptq log fpy|X, z, α, β, σ2q ES|βptq log πpβ|Sq log πpσ2q

i Ez|y,X,θY ptq log fpyi|xi, zi, α, β, σ2q ES|βptq log πpβ|Sq log πpσ2q

k 1 mptq ki

1 2σ2 k pyi αk x T i βkq2 1

2βT k S 1 k βk

! log σ2 k )

1 2σ2 k py αk1n Xβkq T Mptq k py αk1n Xβkq nptq k 2 log σ2 k

2βT k ESk|βptq k

! log σ2 k ) (43)

1 2σ2 k py αk1n Xβkq T Mptq k py αk1n Xβkq nptq k 2

2βT k Vptq k βk

# py αk1n Xβkq T Mptq k py αk1n Xβkq σ2 k βT k Vptq k βk

pnptq k 2q log σ2 k

Perrakis, Lartigue, Dondelinger and Mukherjee

where mptq ki appearing in (42) is given in (14) in the main paper, Mptq k diagpmptq k q with

mptq k pmptq k1, . . . , mptq knq T and Vptq k diag 1{β2ptq k1 , . . . , 1{β2ptq kp . The transition from (43) to (44) is due to (41).

Appendix D. Details and implementation of the EM

For the graphical lasso optimization in (22) in the main paper we use the eﬃcient R package glasso Fast (Sustik and Calderhead, 2012). For the lasso optimizations in (26) we use glmnet (Friedman et al., 2008b) with penalty equal to λpt 1q k {nptq k . We initialize the algorithm via a simple clustering of the data. For this we use R package mclust. Through the resulting group assignments we obtain initial estimates θ

Xp0q k and θ

Y p0q k . In order to initiate EMs from diﬀerent starting points we add random perturbations to µ

p0q k and σ

2p0q k and positive random perturbations to the diagonal elements of Σ

p0q k . The multiple EMs can be easily run in parallel. As a default option we use ten EM starts. For the termination of the algorithm we use a combination of two criteria that are commonly used in practice. The ﬁrst is to simply set a maximum number p Tq of EM iterations. Empirical results suggest that the option T 20 is suﬃcient. The second criterion takes into account the relative change in the objective function in (15); namely, the algorithm is stopped when ˇˇˇˇˇ Qpθ, τ, λ|θptq, τ ptq, λptqq

Qpθ, τ, λ|θpt 1q, τ pt 1q, λpt 1qq 1

using as default option ϵ 10 6. Moreover, the algorithm is stopped, and results are discarded, when the sample size of a certain group becomes prohibitively small for estimation. We deﬁne this criterion as a function of total sample size and the number of groups. Speciﬁcally, we terminate if mink nptq k ď n{p10Kq.

Appendix E. Proof of Theorem 2

We need to prove that with our model and under the assumptions of Theorem 2, all the hypotheses of Theorem 3 of Meng and Rubin (1993) (Theorem 1) are met. As discussed throughout section 3.2.2 in the main paper, under the RJM model, the following conditions are suﬃcient to verify all the hypotheses of Theorem 4.1 except for hypothesis 6: the penalty is continuous, diﬀerentiable, such that each of the block maximisation is unique and Lpξq inﬁnite on the border of Ξ. The ﬁrst three conditions are already assumptions of our Theorem. Hence, it remains only to be shown that Lpξq is inﬁnite on the border of Ξ and that hypothesis (6) is met. Both are very similar properties, we prove both of them together. For this task, we make use of the penalty lower bound assumption of our Theorem, recalled in Eq. (45).

log τ 1 k µk Ωk log ˇˇΩ 1 k ˇˇ fλpλkq ρk log ρ 1 k |χk| φk .

Regularized Joint Mixtures

To begin with, we have:

ˆ pyiρk χk x T i φkq2 2 log ρk

xi µk 2 Ωk log |Ωk|

2 log τk pp 1q log 2π 2

fi,kpξq pyiρk χk x T i φkq2 2 log ρk xi µk 2 Ωk log |Ωk| 2 log τk 2

Lpξq npp 1q log 2π

k 1 exp ˆ 1

2fi,kpξq . (46)

From Eq. (45) and the fact that pyiρk χk x T i φkq2 ě 0 and xi µk 2 Ωk ě 0, we have:

fi,kpξq ě 2

δ 1l kq log τl µl Ωl p1 n

2δ1l kq log |Ωl|

fλpλlq ρl p1 n

δ 1l kq log ρl |χl| φl

fτ k,lpτlq fµ k,lpµlq fΩ k,lpΩlq fλ k,lpλlq fρ k,lpρlq fχ k,lpχlq fφ k,lpφlq .

fτ k,lpτlq : p1 n

δ 1l kq log τl ,

fµ k,lpµlq : µl ,

fΩ k,lpΩlq : Ωl p1 n

2δ1l kq log |Ωl| ,

fλ k,lpλlq : fλpλlq ,

fρ k,lpρlq : ρl p1 n

δ 1l kq log ρl ,

fχ k,lpχlq : |χl| ,

fφ k,lpφlq : φl .

The dependency on k, l is denoted in the indices of all these functions for the sake of uniformity, although only fτ k,l, fΩ k,l and fρ k,l actually depend on k and l. We recall that, with a ą 0, the function x ÞÑ x a log x, is lower bounded on R , and converges towards 8 both when x Ñ 0 and when x Ñ 8. To analyse fΩ k,l, it is convenient to consider the nuclear norm for Ωk and rewrite the whole as: fΩ k,lpΩlq ř j ψl,j p1 n

2δ1l kq log ψl,j, with tψl,jup j 1 the eigenvalues of Ωl.

Perrakis, Lartigue, Dondelinger and Mukherjee

With these observations at hand, note that all the functions in (48) can be lower bounded by the same constant c ą 8, valid for all values of k and l. They also all converge towards 8 on the boundary of their respective sets of deﬁnition. For m ą 0, we deﬁne Ξm as the compact subset of Ξ such that ξ P Ξm ðñ ξ P Ξ and @k 1, ..., K :

ψmaxpΩkq ď m ,

ψminpΩkq ě 1

1 m ď λk ď m ,

1 m ď ρk ď m ,

It is clear that @m ą 0, Ξm Ď Ξ . With all these objects deﬁned, we can ﬁnish the proof. For any real number A ą 8, let us show that there exists M ą 0 such that @ξ P ΞzΞM, Lpξq ă A. First consider the following: for any real number B ą 8, there exists a m B ą 0 such that, for all k and l:

if τl ă 1 m B , then fτ k,lpτlq ą B ,

if µl ą m B, then fµ k,lpµlq ą B ,

if ψmaxpΩlq ą m B, then fΩ k,lpΩlq ą B ,

if ψminpΩlq ă 1 m B , then fΩ k,lpΩlq ą B ,

if λl ă 1 m B or λl ą m B, then fλ k,lpλlq ą B ,

if ρl ă 1 m B or ρl ą m B, then fρ k,lpρlq ą B ,

if |χl| ą m B, then fχ k,lpχlq ą B ,

if φl ą m B, then fφ k,lpφlq ą B .

If ξ P ΞzΞm B, then by deﬁnition of the sets Ξm (49), there exist at least one l P t1, ..., Ku such that at least one of the above scenarios is realised. By injecting the resulting lower bound into the inequality (47), we get:

@k, fi,kpξq ą 2

nδp B p7K 1qcq .

Regularized Joint Mixtures

Then, from Eq. (46):

Lpξq npp 1q log 2π

k 1 exp ˆ 1

ă npp 1q log 2π

k 1 exp ˆ 1

nδp B p7K 1qcq

npp 1q log 2π

2 n log K δp B p7K 1qcq .

Since δ ą 0, then there exists BA ą 0 such that for all B ě BA:

npp 1q log 2π

2 n log K δp B p7K 1qcq ă A .

As a consequence, M : m BA is such that @ξ P ΞzΞM, Lpξq ă A. In other words tξ P Ξ|Lpξq ě Au Ď ΞM. We have proven that for any A ą 8, there exists M ą 0 such that tξ P Ξ|Lpξq ě Au Ď ΞM. Since the Ξm are compacts, this means that the closed set tξ P Ξ|Lpξq ě Au is also a compact, hence hypothesis (6) of Wu (1983) is veriﬁed. Moreover, since Ξm Ď Ξ , this means that the log-likelihood goes to 8 on the border of Ξ. Hence no EM sequence will take values on the border, hence we can safely consider that Ξ Ξ . With these last two hypotheses veriﬁed, we can apply Theorem 3 of Meng and Rubin (1993) and beneﬁt from the convergence guarantees.

Appendix F. Further results from Section 5.1

Table 3: First simulation. Intercept and slope parameter values for the two groups under the three cases illustrated in Figure 3 in the main paper.

Case Group Intercept Slope

A 1 0 -1 2 0 -1

B 1 0 -1 2 1 -1

C 1 0 -1 2 0 -1

Perrakis, Lartigue, Dondelinger and Mukherjee

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

Correlated Scenario Case A

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

Correlated Scenario Case B

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

0.0 0.2 0.4 0.6 0.8 1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9X10X1 X2 X3 X4 X5 X6 X7 X8 X9X10

Correlated Scenario Case C

Inclusion frequency

Figure 12: First simulation, correlated scenario. Variable inclusion frequencies (under 50 repetitions) for signal variables (in green), correlated noise variables (in black) and uncorrelated noise variables (in red) for regression cases A, B and C. Horizontal black lines correspond to a frequency of 0.5.

Regularized Joint Mixtures

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Mo E RJM NJ RJM FLasso RJM RLasso

Group 1 beta estimate

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Uncorrelated Scenario

Mo E RJM NJ RJM FLasso RJM RLasso

0.0 0.5 1.0 1.5 2.0

Mo E RJM NJ RJM FLasso RJM RLasso

1.4 1.0 0.6 0.2 0.0

Mo E RJM NJ RJM FLasso RJM RLasso

Group 2 beta estimate

1.5 1.0 0.5 0.0 0.5 1.0

Uncorrelated Scenario

Mo E RJM NJ RJM FLasso RJM RLasso

1.0 0.5 0.0 0.5 1.0

Mo E RJM NJ RJM FLasso RJM RLasso

Figure 13: First simulation, uncorrelated scenario. Violin plots of Mo Eclust and RJM slope estimates (from 50 repetitions) for cases A, B and C. Horizontal black lines correspond to the true slopes.

0.0 0.5 1.0 1.5

Mo E RJM NJ RJM FLasso RJM RLasso

Group 1 beta estimate

0.0 0.5 1.0 1.5

Correlated Scenario

Mo E RJM NJ RJM FLasso RJM RLasso

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Mo E RJM NJ RJM FLasso RJM RLasso

1.2 1.0 0.8 0.6 0.4 0.2 0.0

Mo E RJM NJ RJM FLasso RJM RLasso

Group 2 beta estimate

0.5 0.0 0.5 1.0

Correlated Scenario

Mo E RJM NJ RJM FLasso RJM RLasso

1.5 1.0 0.5 0.0 0.5 1.0

Mo E RJM NJ RJM FLasso RJM RLasso

Figure 14: First simulation, correlated scenario. Violin plots of Mo Eclust and RJM slope estimates (from 50 repetitions) for cases A, B and C. Horizontal black lines correspond to the true slopes.

Perrakis, Lartigue, Dondelinger and Mukherjee

2.0 1.5 1.0 0.5 0.0 0.5 1.0

Mo E RJM NJ RJM FLasso RJM RLasso

Group 1 alpha estimate

3 2 1 0 1 2

Uncorrelated Scenario

Mo E RJM NJ RJM FLasso RJM RLasso

Mo E RJM NJ RJM FLasso RJM RLasso

2.0 1.5 1.0 0.5 0.0 0.5

Mo E RJM NJ RJM FLasso RJM RLasso

Group 2 alpha estimate

Uncorrelated Scenario

Mo E RJM NJ RJM FLasso RJM RLasso

Mo E RJM NJ RJM FLasso RJM RLasso

Figure 15: First simulation, uncorrelated scenario. Violin plots of Mo Eclust and RJM intercept estimates (from 50 repetitions) for cases A, B and C. Horizontal black lines correspond to the real intercepts.

3 2 1 0 1 2

Mo E RJM NJ RJM FLasso RJM RLasso

Group 1 alpha estimate

2 1 0 1 2 3

Correlated Scenario

Mo E RJM NJ RJM FLasso RJM RLasso

3 2 1 0 1 2 3

Mo E RJM NJ RJM FLasso RJM RLasso

Mo E RJM NJ RJM FLasso RJM RLasso

Group 2 alpha estimate

2 1 0 1 2 3 4

Correlated Scenario

Mo E RJM NJ RJM FLasso RJM RLasso

Mo E RJM NJ RJM FLasso RJM RLasso

Figure 16: First simulation, correlated scenario. Violin plots of Mo Eclust and RJM intercept estimates (from 50 repetitions) for cases A, B and C. Horizontal black lines correspond to the real intercepts.

Regularized Joint Mixtures

Appendix G. Further results from Section 5.2

0.25 0.50 0.75

Adjusted Rand Index

0.25 0.50 0.75 |d|

0.25 0.50 0.75

Method kmeans X kmeans YX

hclust X hclust YX

mclust X mclust YX

RJM NJ RJM Flasso

Figure 17: Second simulation, p 100, group assignment. Average adjusted Rand Index as a function of the absolute distance (|d|) of the group-wise covariate means, for cases A (left), B (center) and C (right). [Error bars indicate standard errors from 20 repetitions.]

0.25 0.50 0.75

Adjusted Rand Index

0.25 0.50 0.75 |d|

0.25 0.50 0.75

Method kmeans X kmeans YX

hclust X hclust YX

mclust X mclust YX

RJM NJ RJM Flasso

Figure 18: Second simulation, p 250, group assignment. Average adjusted Rand Index as a function of the absolute distance (|d|) of the group-wise covariate means, for cases A (left), B (center) and C (right). [Error bars indicate standard errors from 20 repetitions.]

Perrakis, Lartigue, Dondelinger and Mukherjee

0.00 0.25 0.50 0.75

AUC difference wrt oracle

0.00 0.25 0.50 0.75 |d|

0.00 0.25 0.50 0.75

Method cluster lasso RJM NJ RJM FLasso RJM Rlasso

Figure 19: Second simulation, p 100, variable selection. AUC loss from oracle-lasso as a function of the absolute distance (|d|) of the group-wise covariate means, for cases A (left), B (center) and C (right). [Error bars indicate standard errors from 20 repetitions.]

0.00 0.25 0.50 0.75

AUC difference wrt oracle

0.00 0.25 0.50 0.75 |d|

0.00 0.25 0.50 0.75

Method cluster lasso RJM NJ RJM FLasso RJM Rlasso

Figure 20: Second simulation, p 250, variable selection. AUC loss from oracle-lasso as a function of the absolute distance (|d|) of the group-wise covariate means, for cases A (left), B (center) and C (right). [Error bars indicate standard errors from 20 repetitions.]

Regularized Joint Mixtures

0.00 0.25 0.50 0.75

RMSE difference wrt oracle

(Group 1) Case A

0.00 0.25 0.50 0.75 |d|

(Group 1) Case B

0.00 0.25 0.50 0.75

(Group 1) Case C

0.00 0.25 0.50 0.75

RMSE difference wrt oracle

0.00 0.25 0.50 0.75 |d|

0.00 0.25 0.50 0.75

Method pooled lasso cluster lasso RJM NJ RJM FLasso RJM Rlasso

Figure 21: Second simulation, p 100, regression coeﬃcients estimation. Increase in RMSE relative to the oracle-lasso as a function of the absolute distance (|d|) of the group-wise covariate means, under group one (top) and group two (bottom), and for cases A (left), B (center) and C (right). [Error bars indicate standard errors from 20 repetitions.]

0.00 0.25 0.50 0.75

RMSE difference wrt oracle

(Group 1) Case A

0.00 0.25 0.50 0.75 |d|

(Group 1) Case B

0.00 0.25 0.50 0.75

(Group 1) Case C

0.00 0.25 0.50 0.75

RMSE difference wrt oracle

0.00 0.25 0.50 0.75 |d|

0.00 0.25 0.50 0.75

Method pooled lasso cluster lasso RJM NJ RJM FLasso RJM Rlasso

Figure 22: Second simulation, p 250, regression coeﬃcients estimation. Increase in RMSE relative to the oracle-lasso as a function of the absolute distance (|d|) of the group-wise covariate means, under group one (top) and group two (bottom), and for cases A (left), B (center) and C (right). [Error bars indicate standard errors from 20 repetitions.]

Perrakis, Lartigue, Dondelinger and Mukherjee

Appendix H. Selection of number of groups in Section 5.2

Here we consider all four cancer types (BRCA, KIRC, LUAD, THCA) and provide some results on cluster selection under unknown number of groups using the predictive approach described in Section 4. We consider three simulation settings where the respective true number of groups is g 2 (using the BRCA and KIRC cancer types), g 3 (BRCA, KIRC, LUAD) and g 4 (further including THCA). For each setting we ﬁt RJM models with two, three and four components. The simulations are along the lines of Section 5.2 considering Case A of Table 1 for p 100. The conditions outlined in Table 1 for the β j s at the common locations are used again for g P t3, 4u. Here we use the real groupsample proportions as they occur in the TCGA data set and assume that sample size grows with number of groups (for the simulations to be on an equal basis). Speciﬁcally, we set n 250 ˆ g . The resulting group sample sizes for g P t2, 3, 4u are pn1 335, n2 165q, pn1 382, n2 188, n3 180q and pn1 410, n2 200, n3 200, n4 190q, respectively. We use 80% of the samples for training and 20% for testing.

g = 2 g = 3 g = 4

True no. of clusters: 2 Detection rate: 66%

g = 2 g = 3 g = 4

True no. of clusters: 3 Detection rate: 52%

g = 2 g = 3 g = 4

True no. of clusters: 4 Detection rate: 60%

g = 2 g = 3 g = 4

True no. of clusters: 2 Detection rate: 54%

g = 2 g = 3 g = 4

True no. of clusters: 3 Detection rate: 60%

g = 2 g = 3 g = 4

0 5 10 15 20 25

True no. of clusters: 4 Detection rate: 50%

g = 2 g = 3 g = 4

True no. of clusters: 2 Detection rate: 60%

g = 2 g = 3 g = 4

True no. of clusters: 3 Detection rate: 52%

g = 2 g = 3 g = 4

True no. of clusters: 4 Detection rate: 56%

Selected model

Figure 23: Second simulation, cluster selection. Barplots of selected clusters (50 repetitions) using the predictive approach in Section 4. Correct identiﬁcation is highlighted in green; true number of clusters and detection rate of the correct model is annotated in each panel.

Figure 23 shows barplots of the selected number of clusters resulting from 50 repetitions of the simulations. As seen, the correct model is selected in majority under all cases. The detection rate of the correct model is also annotated in each panel of Figure 23. Here, RJM-NJ is slightly better with average overall detection rate of 59%, while RLasso and FLasso have 56% and 55%, respectively.

Regularized Joint Mixtures

Appendix I. Selection of number of groups in Section 5.3

Here we present additional results concerning performance including a model selection step. The setting is as in Section 5.3 in the main text except that sample size is equal to 200. In particular, treating in turn each of the genes as response, with all others considered as features. A model selection step is included over the number of clusters, selecting between models with g P t2, 3, 4u components based on BIC. Here we compare with GMMs (mclust) and an Mo E implementation from package flexmix (Gr un and Leisch, 2008). Under the latter method we use elastic-net regularization (Zou and Hastie, 2005) in the expert networks and intercept-only multinomial gating functions (the latter is better than allowing 99 predictors to enter the gating functions in the absence of regularization for this part of the model). We note that results from Mo Eclust are not presented here as this method (based on incremental forward model search) never selected the correct model with four clusters. Results from the 100 attempts are summarized in Table 4.

Table 4: TCGA data application, performance including a model selection step. Number of times, out of 100 applications to the TGCA data, that the methods selected two, three and four clusters based on BIC. Each time a diﬀerent gene expression was used as response variable with the predictor matrix containing the remaining 99 gene expressions. The correct number of clusters is four corresponding to the four cancer types included in the dataset.

Methods and cluster selection

Estimated number of clusters ˆg 2 ˆg 3 ˆg 4

GMM (mclust) 5 91 4 Mo E (flexmix) 18 67 15 RJM-NJ 17 18 65 RJM-FL 16 17 67 RJM-RL 17 19 64

H. Akaike. A new look at the statistical model identiﬁcation. IEEE Transactions on Automatic Control, 19(6):716 723, 1974.

A. Anandkumar, D. J. Hsu, F. Huang, and S. M. Kakade. Learning mixtures of tree graphical models. In NIPS, pages 1061 1069, 2012.

K. Bae and B. Mallick. Gene selection using a two-level hierarchical Bayesian model. Bioinformatics, 20(18):3423 3430, 2004.

J. D. Banﬁeld and A. E. Raftery. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3):803 821, 1993.

C. Carvalho, N. Polson, and J. Scott. The horseshoe estimator for sparse signal. Biometrika, 97(2):465 480, 2010.

Perrakis, Lartigue, Dondelinger and Mukherjee

G. Celeux and G. Govaert. Gaussian parsimonious clustering models. Pattern Recognition, 28(5):781 793, 1995.

F. Chamroukhi and B. T. Huynh. Regularized maximum-likelihood estimation of mixtureof-experts for regression and clustering. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1 8, 2018.

U. J. Dang and P. D. Mc Nicholas. Families of Parsimonious Finite Mixtures of Regression Models. In I. Morlini, T. Minerva, and M. Vichi, eds., Advances in Statistical Models for Data Analysis, pages 73 84. Springer International Publishing, 2015.

U. J. Dang, A. Punzo, P. D. Mc Nicholas, S. Ingrassia, and R. P. Browne. Multivariate response and parsimony for Gaussian cluster-weighted models. Journal of Classiﬁcation, 34(1):4 34, 2017.

C. Dayton and G. Macready. Concomitant-variable latent-class models. Journal of the American Statistical Association, 83(401):173 178, 1988.

L. D. D. Desboulets. A review on variable selection in regression analysis. Econometrics, 6 (4), 2018.

D. L. Donoho and J. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425 455, 1994.

M. A. T. Figueiredo. Wavelet-based image estimation: An empirical Bayes approach using Jeﬀreys noninformative prior. IEEE Transactions on Image Processing, 10(9):1322 1331, 2001.

M. A. T. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):1150 1159, 2003.

M. Fop, T.B. Murphy, and L. Scrucca. Model-based clustering with sparse covariance matrices. Statistics and Computing, 29(4):791 819, 2019.

C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association, 97(458):611 631, 2002.

J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008a.

J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1 22, 2008b.

S. Fr uwirth-Schnatter. Finite Mixture and Markov Switching Models. Springer, Heidelberg, 2005.

J. Griﬃn and P. Brown. Alternative prior distributions for variable selection with very many more variables than observations. Technical report, University of Warwick, 2005.

B. Gr un and F. Leisch. Flex Mix version 2: Finite mixtures with concomitant variables and varying and constant parameters. Journal of Statistical Software, 28(4):1 35, 2008.

Regularized Joint Mixtures

Q. Gu and J. Han. Clustered support vector machines. In Proceedings of the Sixteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 307 315, 2013.

C. Heinze-Deml and N. Meinshausen. Conditional variance penalties and domain shift robustness. Machine Learning, 110:303 348, 2021.

S. Ingrassia, S. C. Minotti, and G. Vittadini. Local statistical modeling via a clusterweighted approach with elliptical distributions. Journal of Classiﬁcation, 29(3):363 401, 2012.

S. Ingrassia, A. Punzo, G. Vittadini, and S. C. Minotti. The generalized linear mixed cluster-weighted model. Journal of Classiﬁcation, 32(1):85 113, 2015.

R. A. Jacobs. Bias/variance analyses of mixtures-of-experts architectures. Neural Computation, 9(2):369 383, 1997.

R. A. Jacobs, M.I. Jordan, S.I. Nowlan, and G.E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79 87, 1991.

I. T. Jolliﬀe. A note on the use of principal components in regression. Journal of the Royal Statistical Society, Series C, 31(3):300 303, 1982.

M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2):181 214, 1994.

A. Khalili. New estimation and feature selection methods in mixture-of-experts models. Canadian Journal of Statistics, 38(4):519 539, 2010.

A. Khalili and C. Jiahua. Variable selection in ﬁnite mixture of regression models. Journal of the American Statistical Association, 102(479):1025 1038, 2007.

A. Khalili and S. Lin. Regularization in ﬁnite mixture of regression models with diverging number of parameters. Biometrics, 69(2):436 446, 2013.

S. Liverani, D. I. Hastie, L. Azizi, M. Papathomas, and S. Richardson. PRe Miu M: An R package for proﬁle regression mixture models using Dirichlet processes. Journal of Statistical Software, 64(7):1 30, 2015.

G.J. Mc Lachlan and D. Peel. Finite Mixture Models. John Wiley, New York, USA, 2000.

D. P. Mc Nicholas and T. B. Murphy. Parsimonious Gaussian mixture models. Statistics and Computing, 18(3):285 296, 2008.

N. Meinshausen and P. B uhlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436 1462, 2006.

X.-L. Meng and D. B. Rubin. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2):267 278, 1993.

Perrakis, Lartigue, Dondelinger and Mukherjee

J. Molitor, M. Papathomas, M. Jerrett, and S. Richardson. Bayesian proﬁle regression with an application to the National survey of children s health. Biostatistics, 11(3):484 498, 2010.

K. Murphy and T. B. Murphy. Gaussian parsimonious clustering models with covariates and a noise component. Advances in Data Analysis and Classiﬁcation, 14(2):293 325, 2020.

T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103(482):681 686, 2008.

N. G. Polson and J. G. Scott. Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction. In J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. David, D. Heckerman, A.F.M. Smith and M. West, eds., Bayesian Statistics, Vol. 9, pages 501 538. Oxford University Press, 2010.

N. G. Polson and J. G. Scott. On the half-cauchy prior for a global scale parameter. Bayesian Analysis, 7(4):887 902, 2012.

A. Punzo and S. Ingassia. Clustering bivariate mixed-type data via the cluster-weighted model. Computational Statistics, 31(3):989 1013, 2016.

B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do CIFAR-10 classiﬁers generalize to CIFAR-10? ar Xiv:1806.00451 [cs.LG], 2018.

G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461 464, 1978.

L. Scrucca, M. Fop, T. B. Murphy, and Raftery A. E. mclust 5: clustering, classiﬁcation and density estimation using Gaussian ﬁnite mixture models. The R Journal, 8(1):289 317, 2016.

N. St adler and S. Mukherjee. Penalized estimation in high-dimensional hidden Markov models with state-speciﬁc graphical models. The Annals of Applied Statistics, 7(4):2157 2179, 2013.

N. St adler, P. B uhlmann, and S. van de Geer. ℓ1-penalization for mixture regression models. Test, 19:209 256, 2010.

N. St adler, F. Dondelinger, S. M. Hill, R. Akbani, Y. Lu, G. B. Mills, and S. Mukherjee. Molecular heterogeneity at the network level: high-dimensional testing, clustering and a TCGA case study. Bioinformatics, 33(18):2890 2896, 2017.

S. Subedi, A. Punzo, S. Ingassia, and P.D. Mc Nicholas. Clustering and classiﬁcation via cluster-weighted factor analyzers. Advances in Data Analysis and Classiﬁcation, 7(1): 5 40, 2013.

M. A Sustik and B Calderhead. GLASSOFAST: An eﬃcient GLASSO implementation. Technical Report TR-12-29:1-3, UTCS, 2012.

B. Taschler, F. Dondelinger, and S. Mukherjee. Model-based clustering in very high dimensions via adaptive projections. ar Xiv:1902.08472v1 [stat.ML], 2019.

Regularized Joint Mixtures

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58(1):267 288, 1996.

S. van Erp, D. L. Oberski, and J. Mulder. Shrinkage priors for Bayesian penalized regression. Journal of Mathematical Psychology, 89:31 50, 2019.

H. Wang. Bayesian graphical lasso models and eﬃcient posterior computation. Bayesian Analysis, 7(4):867 886, 2012.

C. F. J. Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1):95 103, 1983.

M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19 35, 2007.

H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418 1429, 2006.

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67(2):301 320, 2005.