# mixed_regression_via_approximate_message_passing__fbd63a08.pdf Journal of Machine Learning Research 24 (2023) 1-44 Submitted 4/23; Published 8/23 Mixed Regression via Approximate Message Passing Nelvin Tan tcnt2@cam.ac.uk Department of Engineering, University of Cambridge Cambridge, CB2 1PZ, United Kingdom Ramji Venkataramanan rv285@cam.ac.uk Department of Engineering, University of Cambridge Cambridge, CB2 1PZ, United Kingdom Editor: David Sontag We study the problem of regression in a generalized linear model (GLM) with multiple signals and latent variables. This model, which we call a matrix GLM, covers many widely studied problems in statistical learning, including mixed linear regression, max-affine regression, and mixture-of-experts. The goal in all these problems is to estimate the signals, and possibly some of the latent variables, from the observations. We propose a novel approximate message passing (AMP) algorithm for estimation in a matrix GLM and rigorously characterize its performance in the high-dimensional limit. This characterization is in terms of a state evolution recursion, which allows us to precisely compute performance measures such as the asymptotic mean-squared error. The state evolution characterization can be used to tailor the AMP algorithm to take advantage of any structural information known about the signals. Using state evolution, we derive an optimal choice of AMP denoising functions that minimizes the estimation error in each iteration. The theoretical results are validated by numerical simulations for mixed linear regression, max-affine regression, and mixture-of-experts. For max-affine regression, we propose an algorithm that combines AMP with expectation-maximization to estimate the intercepts of the model along with the signals. The numerical results show that AMP significantly outperforms other estimators for mixed linear regression and max-affine regression in most parameter regimes. Keywords: Approximate Message Passing, Mixed Linear Regression, Max-Affine Regression, Mixture-of-Experts, Expectation-Maximization 1. Introduction We study the problem of regression in a generalized linear model with multiple signals (regressors) and latent variables. Specifically, consider L signal vectors β(1), . . . , β(L) Rp, and define the signal matrix B := (β(1), . . . , β(L)) Rp L. Then, the goal is to estimate B from an observed matrix Y := [Y1, . . . , Yn] Rn Lout, whose ith row Yi RLout is generated as: Yi = q(B Xi , Ψi), i [n]. (1) Here Xi Rp is the ith feature vector, Ψi RLΨ is a vector of unobserved auxiliary variables, and q : RL RLΨ RLout is a known function. We refer to the model (1) as c 2023 Nelvin Tan and Ramji Venkataramanan. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/23-0473.html. Tan and Venkataramanan the matrix generalized linear model, or matrix GLM. As we show below, the matrix GLM covers many widely studied regression models including mixed linear regression, max-affine regression, mixed GLMs, and mixture-of-experts. 1.1 Mixed Linear Regression In this model, we wish to estimate L signal vectors from unlabeled observations of each. Specifically, the components of the observed vector Y := (Y1, . . . , Yn) are generated as: Yi = Xi, β(1) ci1 + + Xi, β(L) ci L + ϵi, i [n]. (2) Here ϵi is a noise variable, and ci1, . . . , ci L {0, 1} are binary-valued latent variables such that PL l=1 cil = 1, for i [n]. The notation , denotes the Euclidean inner product and [n] := {1, . . . , n}. In words, each observation comes from exactly one of the L signal vectors, but we do not know which one. The mixed linear regression (MLR) model in (2) is a special case of the matrix GLM in (1), with the rows of the auxiliary matrix given by Ψi = (ci,1, . . . , ci,L, ϵi), for i [n]. The case of L = 1 is standard linear regression, which implicitly assumes a homogeneous population, i.e., a single regression vector captures the population characteristics of the entire sample. However, this assumption may not be realistic in some situations as the sample may contain several sub-populations. Standard linear regression may provide biased estimates in such situations when the population heterogeneity is unobserved. The MLR model is more flexible as it allows for differences in regressors across unobserved sub-populations. MLR has been used for analyzing heterogeneous data in a variety of fields including biology, physics, and economics (Mc Lachlan and Peel, 2004; Gr un and Leisch, 2007; Li et al., 2019; Devijver et al., 2020). In the MLR model (2), a natural approach for estimating β(1), . . . , β(L) from {Xi, Yi}n i=1 is via the global least-squares estimator given by: bβ(1), . . . , bβ(L) = argmin β(1),...,β(L) Rp c1,...,c L {0,1}n PL l=1 cil=1, i [n] l=1 Xi, β(l) cil However, this optimization problem is non-convex, and computing the global minimum is known to be NP-hard (Yi et al., 2014). A range of alternative approaches has been proposed including estimators based on: Bayesian methods (Viele and Tong, 2002), spectral methods (Chaganty and Liang, 2013; Yi et al., 2014); expectation-maximization (Faria and Soromenho, 2010; St adler et al., 2010; Zhang et al., 2020); alternating minimization and its variants (Yi et al., 2014; Shen and Sanghavi, 2019; Ghosh and Kannan, 2020; Zilber and Nadler, 2023); convex relaxation (Chen et al., 2014); moment descent methods (Li and Liang, 2018; Chen et al., 2020); and tractable non-convex objective functions (Zhong et al., 2016; Barik and Honorio, 2022). Most of these techniques are generic, and while some can incorporate certain constraints like sparsity, they are not well-equipped to exploit specific structural information about β(1), . . . , β(L), such as a known prior on the signals. Moreover, these methods are sub-optimal with respect to sample complexity: for accurate recovery they require the number of observations n to be at least of order p log p (Yi et al., 2014; Mixed Regression via AMP Li and Liang, 2018; Chen et al., 2020). In contrast, here we consider the high-dimensional regime where n is proportional to p and provide exact asymptotics for the performance of the proposed estimator. 1.2 Max-Affine Regression In the max-affine regression (MAR) model, we have Yi = max Xi, β(1) + b1, . . . , Xi, β(L) + b L + ϵi, i [n]. (4) Here b1, . . . , b L R are the intercepts (typically unknown), and ϵi is a zero-mean noise variable that is independent of Xi. In words, each observation comes from the maximum of L affine functions, each defined via a different signal vector. When L = 1 and b1 = 0, the model (4) corresponds to standard linear regression. When L = 2 and β(1) = β(2) = β along with b1 = b2 = 0, then (4) reduces to Yi = | Xi, β | + ϵ. This is the widely studied phase retrieval problem (Netrapalli et al., 2013; Cand es et al., 2015), which arises in applications such as scientific imaging (Fogel et al., 2016). For general L, the function x 7 maxl {1,...,L}{ x, βl + bl} is always convex and thus, estimation under model (4) can be used to fit convex functions to the observed data. Indeed, the MAR model serves as a parametric approximation to the non-parametric convex regression model Yi = ϕ(Xi) + ϵi, i [n] (5) where ϕ : Rp R is an unknown convex function (Bal azs et al., 2015; Ghosh et al., 2022). Unfortunately, convex regression suffers from the curse of dimensionality unless p is small (Guntuboyina and Sen, 2013). Since convex functions can be approximated to arbitrary accuracy by maxima of affine functions, it is reasonable to simplify the problem by considering only those convex functions that can be written as a maximum of a fixed number of affine functions. This assumption directly leads to the MAR model (4), which has been studied as a tractable alternative to the non-parametric convex regression model (5) in applications where p is large, such as data in economics, finance and operations research (Bal azs, 2016). MAR can also be used as a tractable model for the problem of estimating convex sets from support function measurements (Soh and Chandrasekaran, 2021), which arises in tomography applications (Prince and Willsky, 1990; Gregor and Rannou, 2002). To write the MAR model as an instance of the matrix GLM (1), let us concisely denote the unknown parameters by β(l) ma = β(l) Rp+1 for l [L], and the observations by X(ma) i , yi for i [n], where X(ma) i = Xi 1 Rp+1 are the augmented features. Under the augmented features and signals, the model (4) becomes Yi = max l [L] n X(ma) i , β(l) ma o + ϵi, i [n], (6) which is of the form in (1). A natural approach for estimating β(1) ma, . . . , β(L) ma is the least squares estimator, defined as bβ(1) ma, . . . , bβ(L) ma = argmin β(1) ma,...,β(L) ma Rp+1 Yi max l [L] n X(ma) i , β(l) ma o 2 . (7) Tan and Venkataramanan Ghosh et al. (2022, Lemma 1) showed that a global minimizer of the least-squares criterion above always exists but will not in general be unique, since any relabelling of the indices (of the signal vectors) of a minimizer will also be a minimizer. Furthermore, the optimization problem in (7) is non-convex, and for a worst-case choice of the design matrix X = (X1, . . . , Xn) , the problem is NP-hard (Ghosh et al., 2022). 1.3 Mixed Generalized Linear Models and Mixture-of-Experts A mixed GLM is a generalization of the MLR model (2), where the output function is not necessarily linear. Specifically, for some known function q : R2 R, we have Yi = q( Xi, β(1) ci1 + + Xi, β(L) ci L , ϵi), i [n]. (8) As before, ϵi is a noise variable, and ci1, . . . , ci L {0, 1} are binary-valued latent variables such that PL l=1 cil = 1, for i [n]. The case of L = 1 is the standard GLM which, with suitable choices for q and ϵ, covers a range of statistical learning problems including logistic regression, phase retrieval, and one-bit compressed sensing. In all these settings, the mixed GLM model (8) allows the flexibility to account for unlabeled data coming from multiple sub-populations (Khalili and Chen, 2007; Sedghi et al., 2016). The mixture-of-experts model, introduced by Jacobs et al. (1991); Jordan and Jacobs (1994), is a generalization of the mixed GLM, where the probability of selecting each regressor can depend on the feature vector. In addition to the L regressors β(1), . . . , β(L) Rp, here we have L gating parameters w(1), . . . , w(L) Rp, using which the observations are generated as follows. For each i [n]: Yi = q( Xi, β(l) , ϵi) with probability exp( Xi, w(l) ) PL l =1 exp( Xi, w(l ) ) for l [L]. (9) Here q : R R is a known activation function and ϵi is a noise variable. Mixture-of-experts models and its variants have been widely studied in machine learning (Yuksel et al., 2012; Huang and Yao, 2012; Makkuva et al., 2019, 2020) and applications such as computer vision (Gross et al., 2017), natural language processing (Shazeer et al., 2017), and econometrics (Huang et al., 2013; Compiani and Kitamura, 2016). To see that the mixture-of-experts model is a special case of the matrix GLM in (1), we take the signal matrix to be B = [β(1), . . . , β(L), w(1), . . . , w(L)] and the auxiliary matrix Ψ Rn 2 with rows Ψi = (ψi, ϵi), where ψi i.i.d. Uniform[0, 1] for i [n] and independent of {ϵi}i [n]. Then the model (9) can be written as: Yi = q(B Xi, Ψi) l=1 q( Xi, β(l) , ϵi)1 l 1 X exp( Xi, w(l ) ) PL l =1 exp( Xi, w(l ) ) < ψi exp( Xi, w(l ) ) PL l =1 exp( Xi, w(l ) ) where 1{ } is the indicator function. Mixed Regression via AMP 1.4 Approximate Message Passing The main contribution of this work is to design and analyze an Approximate message passing (AMP) algorithm for estimation in the matrix GLM model (1). We then apply the algorithm to mixed linear regression, max-affine regression, and mixture-of-experts. Approximate message passing (AMP) is a family of iterative algorithms which can be tailored to take advantage of structural information about the signals and the model, e.g., a known prior on the signal vector or on the proportion of observations that come from each signal. AMP algorithms were first proposed for the standard linear model (Kabashima, 2003; Donoho et al., 2009; Bayati and Montanari, 2011a; Krzakala et al., 2012), but have since been applied to a range of statistical problems, including estimation in generalized linear models (Rangan, 2011; Schniter and Rangan, 2014; Barbier et al., 2019; Ma et al., 2019; Sur and Cand es, 2019; Maillard et al., 2020; Mondelli and Venkataramanan, 2021) and low-rank matrix estimation (Deshpande and Montanari, 2014; Fletcher and Rangan, 2018; Kabashima et al., 2016; Lesieur et al., 2017; Montanari and Venkataramanan, 2021; Li and Wei, 2023). In all these settings, under suitable model assumptions the performance of AMP in the high-dimensional limit is characterized by a succinct deterministic recursion called state evolution. The state evolution characterization has been used to show that AMP achieves Bayes-optimal performance for some models (Deshpande and Montanari, 2014; Donoho et al., 2013; Montanari and Venkataramanan, 2021; Barbier et al., 2019), and a conjecture from statistical physics states that AMP is optimal among polynomial-time algorithms for a wide range of statistical estimation problems. 1.5 Main Contributions We propose an AMP algorithm for the matrix GLM (1), under the assumption that the features {Xi}i [n] are i.i.d. Gaussian. Our first technical contribution is a state evolution result for the AMP algorithm (Theorem 1), which gives a rigorous characterization of its performance in the high-dimensional limit as n, p with a fixed ratio δ = n/p, for a constant δ > 0. This allows us to compute exact asymptotic formulas for performance measures such as the mean-squared error (MSE) and the normalized correlation between the signals and their estimates. The AMP algorithm uses a pair of denoising functions to produce updated signal estimates in each iteration. The accuracy of these estimates can be tracked using a signal-to-noise ratio defined in terms of the state evolution parameters. Our second contribution (Proposition 2) is to derive an optimal choice of denoising functions that maximizes this signal-to-noise ratio. The optimal choice for one of the these functions depends on the prior on the signals, while the other depends only on the output function q( , ) in (1). In Section 4, we present numerical simulation results for mixed linear regression, maxaffine regression, and mixture-of-experts. The case of max-affine regression requires special attention as the AMP derived for the matrix GLM cannot be directly applied. This is because the matrix GLM AMP and its state evolution analysis is derived assuming that the features are all i.i.d. Gaussian. However, to write MAR as an instance of the matrix GLM, recall from (6) that we use the augmented features X(ma) i = Xi 1 Rp+1, i [n], which are not i.i.d. Gaussian due to the last component being 1. We address this by Tan and Venkataramanan using the original formulation of MAR in (4), with the intercepts b1, . . . , b L treated as unknown model parameters. We estimate these intercepts via an expectation-maximization (EM) algorithm that uses the AMP iterates to approximate certain intractable quantities. This leads to a combined EM-AMP algorithm which is described in Section 4.3. For both mixed linear regression and max-affine regression, the numerical results show that AMP significantly outperforms other popular estimators (such as alternating minimization) in most parameter regimes. Though the algorithms and results in this paper focus on estimating the signals β(1), . . . β(L), they can be often be translated to estimating the latent variables as well. For example, in mixed linear regression, given signal estimates bβ(1), . . . , bβ(L), the labels can be estimated as ˆci = argminl [L] (Yi Xi, bβ(l) )2, for i [n]. A preliminary version of this paper was published in the proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS 2023) (Tan and Venkataramanan, 2023). The focus of the preliminary version was largely on mixed linear regression. In the current paper, in addition to MLR, we provide results for max-affine regression and mixture-of experts, including the novel EM-AMP algorithm for MAR. Technical Ideas. The state evolution performance characterization in Theorem 1 is proved using a change of variables that maps the proposed algorithm to an abstract AMP recursion with matrix-valued iterates. A state evolution characterization for this abstract AMP was established by Javanmard and Montanari (2013); this result is translated via the change of variables to obtain the state evolution characterization for the proposed AMP. Our combined EM-AMP algorithm for max-affine regression is inspired by the work of Vila and Schniter (2013), who used a similar approach for the problem of sparse linear regression with unknown parameters in the signal prior. Though our AMP algorithm and its analysis assume i.i.d. Gaussian features, we expect that they can be extended to a much broader class of i.i.d. designs using the recent universality results of Wang et al. (2022). Another exciting direction for future work is to generalize the AMP algorithm and its state evolution to mixed regression models with rotationally invariant design matrices. This can be done via a reduction to an abstract AMP recursion for rotationally invariant matrices, similar to the ones studied in Fan (2022) and Zhong et al. (2021). 1.6 Other Related Work Mixtures of linear and generalized linear models. The special case of symmetric mixed linear regression where β(1) = β(2) has been studied in many recent works. We note that symmetric MLR is a version of the phase retrieval problem (Netrapalli et al., 2013; Cand es et al., 2015; Fogel et al., 2016). Balakrishnan et al. (2017) and Klusowski et al. (2019) obtained statistical guarantees on the performance of the EM algorithm for a class of problems, including symmetric MLR. Variants of the EM algorithm for symmetric MLR in the high-dimensional setting (with sparse signals) were analyzed by Wang et al. (2015),Yi and Caramanis (2015), and Zhu et al. (2017). Fan et al. (2018) obtained minimax lower bounds for a class of computationally feasible algorithms for symmetric MLR. Kong et al. (2020) studied MLR as a canonical example of meta-learning. They consider the setting where the number of signals (L) is large, and derive conditions under which a Mixed Regression via AMP large number of signals with a few observations can compensate for the lack of signals with abundantly many observations. The case of MLR with sparse signals was studied by Krishnamurthy et al. (2019) and Pal et al. (2021), and the gap between statistical and computational performance limits for sparse MLR was recently characterized by Arpino and Venkataramanan (2023). Pal et al. (2022) studied the prediction error of MLR in the non-realizable setting, where no generative model is assumed for the data. The convergence rate of maximum-likelihood estimation for the parameters of a mixed GLM was derived by Ho et al. (2022). Chandrasekher et al. (2023) analyzed the performance of a class of iterative algorithms (not including AMP) for mixed GLMs, providing a sharp characterization of the per-iteration error with sample-splitting in the regime n p polylog(p), assuming a Gaussian design and a random initialization. Spectral estimators for mixed GLMs were studied in the recent work of Zhang et al. (2022), which characterizes their asymptotic performance for Gaussian designs and independent signals. Statistical and computational limits for a two-layers neural network, a model similar to the matrix GLM, were studied by Aubin et al. (2018). A Vector AMP algorithm for MAP and MMSE inference in a similar multi-layer model was proposed by Pandit et al. (2020). Despite the similarities with the matrix GLM, to the best of our knowledge these works do not investigate AMP for the settings of mixed and max-affine regression. Max-affine regression. For the non-parametric convex regression model in (5), the least squares estimator is ˆϕ(ls) argminϕ Pn i=1(Yi ϕ(Xi))2, where the minimization is over all convex functions ϕ. This least-squares estimator can be computed by solving a quadratic program. Theoretical properties of this estimator and algorithms to compute it were studied by Seijo and Sen (2011), Lim and Glynn (2012) and Mazumder et al. (2019). For the MAR model (4), several approaches for signal estimation have been proposed, including alternating minimization (Magnani and Boyd, 2009; Ghosh et al., 2022), convex adaptive partitioning (Hannah and Dunson, 2013), and adaptive max-affine partitioning (Bal azs, 2016). Among them, theoretical guarantees have been established only for alternating minimization; these guarantees are in the regime where n is at least of order p log(n/p) (Ghosh et al., 2022). In contrast, in this paper we consider the high-dimensional regime where n is proportional to p as n . 2. Preliminaries Notation. All vectors (even rows of matrices) are treated as column vectors unless otherwise stated. Matrices are denoted by upper case letters, and given a matrix A, we write Ai for its ith row. The notation M 0 denotes that the square matrix M is positive semidefinite. We write Ip for the p p identity matrix. For r [1, ), we write x r for the ℓr-norm of x = (x1, . . . , xn) Rn, so that x r = Pn i=1 |xi|r 1/r. Given random variables U, V , we write U d= V to denote equality in distribution. Complete convergence. The asymptotic results in this paper are stated in terms of complete convergence (Hsu and Robbins, 1947), (Feng et al., 2022, Sec. 1.1). This is a stronger mode of stochastic convergence than almost sure convergence, and is denoted using the symbol c . Let {Xn} be a sequence of random elements taking values in a Euclidean space E. We say that Xn converges completely to a deterministic limit x E, and write Tan and Venkataramanan Xn c x, if Yn x almost surely for any sequence of E-valued random elements {Yn} with Yn d= Xn for all n. Wasserstein distances. For D N, let PD(r) be the set of all Borel probability measures on RD with finite rth-moment. That is, any P PD(r) satisfies R RD x r 2d P(x) < . For P, Q PD(r), the r-Wasserstein distance between P and Q is defined by dr(P, Q) = inf(X,Y ) E[ X Y r 2]1/r, where the infimum is taken over all pairs of random vectors (X, Y ) defined on a common probability space with X P and Y Q. 2.1 Model Assumptions In the model (1), each feature vector Xi Rp is assumed to have independent Gaussian entries with zero mean and variance 1/n, i.e., Xi i.i.d. N(0, Ip/n). The n p design matrix X is formed by stacking the sensing vectors X1, . . . , Xn, i.e., X = [X1, . . . , Xn] . Similarly, the auxiliary variable matrix Ψ Rn LΨ is defined as Ψ = [Ψ1, . . . , Ψn] . The design matrix X is independent of both the signal matrix B = (β(1), . . . , β(L)) Rp L and the auxiliary variable matrix Ψ Rn LΨ. As p , we assume that n/p = δ, for some constant δ > 0. As p , the empirical distributions of the rows of the signal matrix and the auxiliary variable matrix are assumed to converge in Wasserstein distance to well-defined limits. More precisely, for some r [2, ), there exist random variables B P B (where B RL) and Ψ P Ψ (where Ψ RLΨ) with E[ B B] > 0 and E PL l=1 | Bl|r , E PLΨ l=1 | Ψl|r < , such that writing νp(B) and νn(Ψ) for the empirical distributions of the rows of B and Ψ respectively, we have dr(νp(B), P B) c 0 and dr(νn(Ψ), P Ψ) c 0. 3. AMP for the Matrix GLM Consider the task of estimating the signal matrix B given {Xi, Yi}i [n], generated according to (1). Algorithm. In each iteration k 1, the AMP algorithm iteratively produces estimates b Bk and Θk of B Rp L and Θ := XB Rn L, respectively. Starting with an initializer b B0 Rp L and defining b R 1 := 0 Rn L, for k 0 we compute: Θk = X b Bk b Rk 1(F k) , b Rk = gk(Θk, Y ), Bk+1 = X b Rk b Bk(Ck) , b Bk+1 = fk+1(Bk+1). (11) Here the functions gk : RL RLout RL and fk+1 : RL RL act row-wise on their matrix inputs, and the matrices Ck, F k+1 RL L are defined as i=1 g k(Θk i , Yi), F k+1 = 1 j=1 f k+1(Bk+1 j ), (12) where g k, f k+1 RL L denote the Jacobians of gk, fk+1, respectively, with respect to their first arguments. We note that the time complexity of each iteration of (11) is O(np L). State evolution. The memory terms b Rk 1(F k) and b Bk(Ck) in (11) play a crucial role in debiasing the iterates Θk and Bk+1, ensuring that their joint empirical distributions are accurately captured by state evolution in the high-dimensional limit. Theorem Mixed Regression via AMP 1 below shows that for each k 1, the empirical distribution of the rows of Bk converges to the distribution of Mk B B + Gk B RL, where Gk B N(0, Tk B) is independent of B, the random variable representing the limiting distribution of the rows of the signal matrix B. The deterministic matrices Mk B, Tk B RL L are recursively defined below. The result implies that the empirical distribution of the rows of b Bk converges to the distribution of fk(Mk B B +Gk B). Thus fk can be viewed as a denoising function that can be tailored to take advantage of the prior on B. Theorem 1 also shows that the empirical distribution of the rows of Θk converges to the distribution of Mk ΘZ + Gk Θ RL, where Z N(0, 1 δE[ B B ]) and Gk Θ N(0, Tk Θ) are independent. We now describe the state evolution recursion defining the matrices Mk B, Tk B, Mk Θ, Tk Θ RL L. Recalling that the observation Y is generated via the function q according to (1), it is convenient to rewrite gk in (11) in terms of another function hk : RL RL RLΨ RL hk(z, u, v) := gk(u, q(z, v)). (13) Then, for k 0, given Σk R2L 2L, take Z Zk N(0, Σk) to be independent of Ψ P Ψ and compute: Mk+1 B = E[ Zhk(Z, Zk, Ψ)], (14) Tk+1 B = E[hk(Z, Zk, Ψ)hk(Z, Zk, Ψ) ], (15) " Σk+1 (11) Σk+1 (12) Σk+1 (21) Σk+1 (22) where the four L L matrices constituting Σk+1 R2L 2L are given by: Σk+1 (11) = 1 Σk+1 (12) = Σk+1 (21) = 1 δ E Bfk+1(Mk+1 B B + Gk+1 B ) , Σk+1 (22) = 1 δ E fk+1(Mk+1 B B + Gk+1 B )fk+1(Mk+1 B B + Gk+1 B ) . Here we take Gk+1 B N(0, Tk+1 B ) to be independent of B P B. Note that Zhk denotes the partial derivative (Jacobian) of hk with respect to its first argument Z RL, so it is an L L matrix. The state evolution recursion (14)-(16) is initialized with Σ0 R2L 2L defined below in (21). N(0, Σk), using standard properties of Gaussian random vectors, we have (Z, Zk, Ψ) d= (Z, Mk ΘZ + Gk Θ, Ψ), (18) where Gk Θ N(0, Tk Θ), Mk Θ = Σk (21) Σk (11) 1, (19) Tk Θ = Σk (22) Σk (21) Σk (11) 1 Σk (12). (20) Tan and Venkataramanan Main result. We begin with two assumptions required for the main result. The first is on the AMP initializer b B0 Rp L, and the second is on the functions gk, fk+1 used to define the AMP in (11). (A1) There exists Σ0 R2L 2L and c0 R such that as n, p (with n/p δ), we have " B B B b B0 ( b B0) B ( b B0) b B0 l=1 | b B0 jl|r c c0. (22) Here r [2, ) is the same as that used for the assumptions on the signal matrix at the end of Section 2.1. Furthermore, there exists a Lipschitz F0 : RL RL such that 1 p( b B0) φ(B) c E[F0( B)φ( B) ] and Σ0 (22) E[F0( B)F0( B) ] is positive semi-definite for all Lipschitz φ : RL RL. (A2) For k 0, the function fk+1 is non-constant and Lipschitz on RL, and hk defined in (13) is Lipschitz on R2L+LΨ with P Ψ({v : (z, u) hk(z, u, v) is a non-constant}) > 0. Furthermore, f k+1 is continuous Lebesgue almost everywhere, and writing Dk RL+LΨ for the set of discontinuities of g k, we have P[(Zk, Y ) Dk] = 0. Assumptions (A1) and (A2) are similar to those required for AMP initialization in (non-matrix) generalized linear models (Feng et al., 2022, Section 4). Moreover, (A1) is implied by the assumptions on the signal matrix if an initialization b B0 is chosen to be a scaled version of the all ones matrix. The result is stated in terms of pseudo-Lipschitz test functions. Let PLm(r, C) be the set of functions φ : Rm R such that |φ(x) φ(y)| C(1 + x r 1 2 + y r 1 2 ) x y 2 for all x, y Rm. A function φ PLm(r, C) is called pseudo-Lipschitz of order r. Theorem 1 Consider the AMP in (11) for the matrix GLM model in (1). Suppose that the model assumptions in Section 2.1 as well as (A1) and (A2) are satisfied, and that T1 B is positive definite. Then for each k 0, we have sup φ PL2L(r,1) j=1 φ(Bk+1 j , Bj) E[φ(Mk+1 B B + Gk+1 B , B)] c 0, (23) sup φ PL2L+LΨ(r,1) i=1 φ(Θk i , Θi, Ψi) E[φ(Mk Θ Z + Gk Θ, Z, Ψ)] c 0, (24) as n, p with n/p δ, where Θi = B Xi for 1 i n. In the above, Gk+1 B N(0, Tk+1 B ) is independent of B, and Gk Θ N(0, Tk Θ) is independent of (Z, Ψ). The proof of the theorem is given in Section 5.1. The result (23) is equivalent to the statement that the joint empirical distributions of the rows of (Bk+1, B) converges completely in r-Wasserstein distance to the joint distribution of (Mk+1 B B + Gk+1 B , B); see (Feng et al., 2022, Corollary 7.21). An analogous statement holds for (24). Mixed Regression via AMP Performance measures. Theorem 1 allows us to compute the limiting values of performance measures such as the mean-squared error (MSE), and the normalized correlation between each signal and its AMP estimate. For k 1, writing ˆβ(ℓ),k for the ℓth column of the AMP iterate b Bk, we have b Bk = ˆβ(1),k, . . . , ˆβ(L),k . Note that ˆβ(ℓ),k is the estimate of the signal β(ℓ) after k iterations, and define the shorthand Bk := Mk B B + GB k . Then Theorem 1 implies that the normalized squared correlation between each signal and its AMP estimate after k iterations converges as: ˆβ(ℓ),k, β(ℓ) 2 ˆβ(ℓ),k 2 2 β(ℓ) 2 2 c (E[fk,ℓ( Bk) Bℓ])2 E[fk,ℓ( Bk)2]E[ B2 ℓ], ℓ [L]. (25) Here fk,ℓis the ℓth component of the function fk : RL RL, and Bℓis the ℓth component of B RL. Similarly, the MSE of the AMP estimate after k iterations converges as: β(ℓ) ˆβ(ℓ),k 2 2 p c E h Bℓ fk,ℓ( Bk) 2i , ℓ [L]. (26) 3.1 Choosing the Functions of AMP Recalling that the empirical distributions of the rows of Θk and Bk+1 converge to the laws of Mk Θ Z + Gk Θ and Mk+1 B B + Gk+1 B , respectively, we define the random vectors: e Zk := Z + Mk Θ 1 Gk Θ, e Bk+1 := B + Mk+1 B 1 Gk+1 B . (27) (If the inverse doesn t exist we premultiply by the pseudoinverse.) Since Gk+1 B N(0, Tk+1 B ) and Gk Θ N(0, Tk Θ), the effective noise covariance matrices are: Cov( e Zk Z) = Mk Θ 1 Tk Θ Mk Θ 1 =: Nk Θ, Cov( e Bk+1 B) = Mk+1 B 1 Tk+1 B Mk+1 B 1 =: Nk+1 B . (28) From (20), we observe that Mk Θ, Tk Θ are both determined by Σk, which in turn is determined by the choice of fk (from (17)). Similarly, from (14) and (15), Mk+1 B and Tk+1 B are determined by gk. A natural objective is to choose fk and gk to minimize the trace of the effective noise covariance matrices Nk Θ and Nk+1 B in (28). We can interpret the quantities Tr(Nk Θ) and Tr(Nk+1 B ) as the effective noise variances for estimating Z, B from e Zk, e Bk+1, respectively. In the special case where there is only one signal, minimizing these effective noise variances is equivalent to maximizing the scalar signal-to-noise ratios (Mk Θ)2/Tk Θ and (Mk+1 B )2/Tk+1 B , respectively, which is achieved by the Bayes-optimal AMP for generalized linear models (Rangan, 2011; Feng et al., 2022). Assuming that the signal prior P B and the distribution of auxiliary variables PΨ are known, the following proposition gives optimal choices for fk, gk. Proposition 2 Let k 1. Then: Tan and Venkataramanan 1) Given Mk B, Tk B, the quantity Tr(Nk Θ) is minimized when fk = f k, where f k(s) = E[ B | Mk B B + Gk B = s], (29) where Gk B N(0, Tk B) and B P B are independent. 2) Given Mk Θ, Tk Θ, the quantity Tr(Nk+1 B ) is minimized when gk = g k, where g k(u, y) =Cov[Z | Zk = u] 1 E[Z | Zk = u, Y = y] E[Z | Zk = u] . (30) N(0, Σk) and Y = q(Z, Ψ), with Ψ P Ψ independent of Z. The proof is given in Section 5.2. 4. Numerical Simulations In this section, we present numerical results for mixed linear regression (Eq. (2)), maxaffine regression (Eq. (4)), and mixture-of experts (Eq. (9)). For MLR and MAR, we compare the performance of AMP with other popular estimators. 4.1 Mixed Linear Regression (Two Signals) Consider the MLR model (2) with two signals, where Yi = Xi, β(1) ci + Xi, β(2) (1 ci) + ϵi, i [n]. (31) We take ci i.i.d. Bernoulli(α) for α (0, 1), ϵi i.i.d. N(0, σ2), and Xi i.i.d. N(0, Ip/n), for i [n]. We set the signal dimension p = 500 and vary the value of n in our experiments. The AMP algorithm in (11) is implemented with gk = g k, the optimal choice given by (30). For the function fk, we use the Bayes-optimal f k in (29) unless stated otherwise. In Appendix A, we provide the implementation details, and show how the functions fk, gk and their derivatives can approximated from the data. The performance in all the plots is measured via the normalized squared correlation between the AMP estimate and the signal (see (25)). Each point on the plots is obtained from 10 independent runs, where in each run, AMP is executed for 10 iterations. We report the average and error bars at 1 standard deviation of the final iteration. Gaussian prior. In Figures 1, 2, and 3, we set the Bernoulli parameter α = 0.7 and choose the two signals to be jointly Gaussian, with their entries generated as (β(1) j , β(2) j ) i.i.d. N 0 0 , j [p]. (32) The initializer b B0 Rp 2 is chosen randomly according to the same distribution, independently of the signal. Figure 1 shows the performance of AMP for independent signals (ρ = 0). The normalized squared correlation is plotted as a function of the sampling ratio δ = n/p, for different noise levels σ. The state evolution predictions closely match the performance of AMP for practical values of n, p, validating the result of Theorem 1. As expected, the correlation improves Mixed Regression via AMP 1 2 3 4 5 δ Correlation SE, σ = 0 SE, σ = 0.2 SE, σ = 0.4 AMP, σ = 0 AMP, σ = 0.2 AMP, σ = 0.4 1 2 3 4 5 δ Correlation SE, σ = 0 SE, σ = 0.2 SE, σ = 0.4 AMP, σ = 0 AMP, σ = 0.2 AMP, σ = 0.4 Figure 1: MLR, Gaussian prior with ρ = 0: normalized squared correlation vs. δ for various noise levels σ, with α = 0.7. 1 2 3 4 5 δ Correlation SE, Cov = 0 SE, Cov = 1 SE, Cov = -1 AMP, Cov = 0 AMP, Cov = 1 AMP, Cov = -1 1 2 3 4 5 δ Correlation SE, Cov = 0 SE, Cov = 1 SE, Cov = -1 AMP, Cov = 0 AMP, Cov = 1 AMP, Cov = -1 Figure 2: MLR, Gaussian prior with different values of signal covariance ρ: Normalized squared correlation vs. δ, with α = 0.7, σ = 0. with increasing δ and degrades with increasing σ. The performance for β(1) is better than for β(2) as 70% of the observations come from β(1). Figure 2 plots the performance as a function of δ for signal correlation ρ {0, 1, 1}, with σ = 0 (noiseless). When ρ = 1, both signals are identical and the problem reduces to standard linear regression. When ρ = 1, we have β(1) = β(2) = β, so there is still effectively only one signal vector. However, the ρ = 1 case is harder than ρ = 1 since each measurement is unlabelled and could come from either β or β (with probabilities 0.7 and 0.3, respectively). We note that the case of ρ = 1 and α = 0.5 is the phase retrieval problem, for which AMP algorithms have been studied in a number of works, e.g., (Schniter and Rangan, 2014; Ma et al., 2019). AMP needs to be initialized carefully in this setting since a random initialization independent of Tan and Venkataramanan 1 2 3 4 5 δ Correlation AMP, α = 0.7 AMP, α = 0.6 1 2 3 4 5 δ Correlation AMP, α = 0.7 AMP, α = 0.6 Figure 3: MLR, Gaussian prior with ρ = 0 and different values of estimated proportion ˆα: Normalized squared correlation vs. δ, with true α = 0.7, σ = 0. the signal leads to state evolution predicting zero correlation between the signal and the AMP iterates (Ma et al., 2018; Mondelli and Venkataramanan, 2021). In practical applications, we may not know the exact proportion of observations that come that come from the first signal. Figure 3 shows the performance when AMP is run assuming a proportion parameter ˆα = 0.6 which is different from the true value α = 0.7. The functions f k, g k defining the AMP depend on α, hence replacing α with ˆα in these functions is effectively running AMP with a different (sub-optimal) choice of denoising functions. Sparse prior. We next consider a sparse prior for each of the two signals, with their entries generated as β(1) j , β(2) j i.i.d. ε 2δ+1 + (1 ε)δ0 + ε 2δ 1, j [p]. (33) Here δ( ) denotes the Dirac delta function, and we note that the two signals are independent. The initializer is generated randomly from the same prior, independently of the signals. We investigate the performance of AMP with two choices of denoising function: the Bayesoptimal denoising function (defined in (29)) and the soft-thresholding denoising function (defined in (34)-(36) below). For the case of standard linear regression, the soft-thresholding function is a popular choice of denoiser for AMP when the signal is known to be sparse, but the exact sparsity level and the distribution of the non-zero coefficients are not known (Montanari, 2012). AMP with soft-thresholding denoising is also closely related to LASSO, which is widely used for sparse linear regression (Bayati and Montanari, 2011b). We evaluate the two denoisers in Figures 4 and 5, respectively, by plotting heatmaps showing the performance for various values of the pair (δ, ε). For each point in the heatmap, we take the minimum of the mean normalized squared correlation of the two estimates with the respective signals. This is obtained by executing 10 runs of AMP using the desired fk function with 10 iterations per run (i.e., k = 1, . . . , 10), and taking the average of the 10 correlations (from the 10 runs) at the final iteration. Mixed Regression via AMP 0.51.01.52.02.53.03.54.04.55.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ε (a) α = 0.6 0.51.01.52.02.53.03.54.04.55.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ε (b) α = 0.7 Figure 4: MLR: Heatmap of minimum normalized correltion for Bayes-optimal fk, with p = 500, σ = 0. For the Bayes-optimal denoiser fk, the heatmaps are shown in Figure 4. We have the following observations: Performance is better for α = 0.6 compared to α = 0.7. This is because we are taking the minimum between the two correlations, and when α is larger (α = 0.7), there is less data available for the group with fewer observations (for a given (ε, δ)). For a given δ, performance is generally better at ε closer to 0 or 1. This is because at ε = 0.1, most of the signal entries are 0, and at ε = 1, all the values are either 1 or 1. Around ε = 0.5, we have a significant proportion of all three values, causing estimation to be harder. The soft-thresholding function with threshold θ, denoted by ST( ; θ) : R R, is defined as x θ if x > θ 0 if θ x θ x + θ if x θ. To set the threshold for the soft-thresholding denoiser fk, we recall from Theorem 1 that the empirical distribution of Bk j converges to the distribution of the random vector Mk B B + Gk B. Therefore, the empirical distribution of (Mk B) 1Bk j j [p] converges to the distribution of B + (Mk B) 1Gk B, where (Mk B) 1Gk B N(0, Nk B), where Nk B = Cov (Mk B) 1Gk B = (Mk B) 1T k B(Mk B) 1. (35) Tan and Venkataramanan 0.51.01.52.02.53.03.54.04.55.0 (a) α = 0.5 0.51.01.52.02.53.03.54.04.55.0 (b) α = 0.6 Figure 5: MLR: Heatmap of minimum normalized correlation for soft-thresholding fk, with p = 1000, σ = 0. Soft-thresholding tuning parameter ζ = 1.1402. Letting ζ > 0 be a tuning parameter, we set the soft-thresholding denoiser fk to be: fk(Bk j ) = ST (Mk B) 1Bk j 1; ζ q Nk B ST (Mk B) 1Bk j 2; ζ q Nk B This implies that fk(Bk j ) = {fk}1 {Bk j }1 {fk}1 {Bk j }2 {fk}2 {Bk j }1 {fk}2 {Bk j }2 where for i1, i2 {1, 2}, {fk}i1 {Bk j }i2 = {(Mk B) 1}i1i2 if (Mk B) 1Bk j i1 > ζ q Nk B i1i1 0 if (Mk B) 1Bk j i1 ζ q Nk B i1i1 {(Mk B) 1}i1i2 if (Mk B) 1Bk j i1 < ζ q Nk B Here the notation { }i1 denotes the i1-th entry of the vector and { }i1i2 the i1, i2-th entry of the matrix inside the parentheses. Figure 5 shows the heatmaps for the soft-thresholding, with the tuning parameter ζ set to 1.1402. This value of ζ attains the minimax MSE of the soft-thresholding denoiser over the class of sparse signal priors which assign a probability mass at least 0.9 to the value 0 (Montanari, 2012). We observe that the performance for α = 0.5 is stronger as the samples are more evenly spread out between the two signals. As expected the correlation improves as the signal becomes sparser (i.e., ε decreases), and as δ increases. Figure 6 compares the Bayes-optimal function with the soft-thresholding function for fixed values of sparsity level ε = 0.1 and mixture parameter α = 0.6. The significantly better performance of the Mixed Regression via AMP 1 2 3 4 5 δ Correlation Bayes (SE) ST (SE) Bayes ST 1 2 3 4 5 δ Correlation Bayes (SE) ST (SE) Bayes ST Figure 6: MLR: Comparison of minimum normalized correlation for Bayes-optimal fk vs. soft-thresholding fk, with p = 1000, α = 0.6, σ = 0, ζ = 1.1402, and ε = 0.1. the Bayes-optimal denoiser compared to soft-thresholding is because the former optimally utilizes knowledge of the signal prior, whereas soft-thresholding only uses an estimate of the proportion of zeros in the signals. Comparison with other estimators. Figure 7 compares the performance of AMP with other widely studied estimators for mixed linear regression, for the Gaussian signal prior in (32) with independent signals (ρ = 0). The other estimators are: the spectral estimator proposed in (Yi et al., 2014, Algorithm 2); alternating minimization (AM) (Yi et al., 2014, Algorithm 1); and expectation maximization (EM) (Faria and Soromenho, 2010, Section 2.1). Figure 8 compares the performance of AMP with these estimators for the sparse signal prior in (33) with ε = 0.1. For this prior, we modified the least squares step of the AM algorithm in (Yi et al., 2014, Algorithm 2) to use Lasso instead of standard least squares this gives better performance as it takes advantage of the signal sparsity. We also tried using the lasso-type EM algorithm St adler et al. (2010), but it did not give a noticeable improvement in performance. In both setups, AMP significantly outperforms the other estimators as it is tailored to take advantage of the signal prior via the choice of the denoising function fk. 4.2 Mixed Linear Regression (Three Signals) To illustrate AMP s ability to tackle MLR with more than two signals, we now consider the model (2) with three signals: Yi = Xi, β(1) ci1 + Xi, β(2) ci2 + Xi, β(3) ci3 + ϵi, i [n]. (39) We take [ci1, ci2, ci3] to be a one-hot vector, and denote the position of the one in the one-hot vector by ci i.i.d. Categorical({α1, α2, α3}). As before, ϵi i.i.d. N(0, σ2), and Xi i.i.d. N(0, Ip/n), for i [n]. We set the signal dimension p = 500 and vary the value of n in our experiments. The AMP algorithm in (11) is implemented with gk = g k and fk = f k (i.e., the optimal choices given in Proposition 2). Tan and Venkataramanan 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 δ Correlation 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 δ Correlation Spectral EM AM AMP Figure 7: MLR, comparison of different estimators for Gaussian prior with ρ = 0: Normalized squared correlation vs. δ, with α = 0.6, σ = 0. 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 δ Correlation 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 δ Correlation Spectral EM AM (lasso) AMP Figure 8: MLR, comparison of different estimators for sparse prior: Normalized squared correlation vs. δ, with α = 0.6, σ = 0.1. We use independent Gaussian priors for the three signals. Specifically, we generate: (β(1) j , β(2) j , β(3) j ) i.i.d. N(E[ B], I3), j [p] ci i.i.d. Categorical({α1, α2, α3}), i [n]. (40) The initializer b B0 Rp 3 is chosen randomly according to the same distribution, independent of the signal. We consider the following three scenarios, where σ = 0 (noiseless): Signals with same mean and same proportions. Figure 9 shows the performance with E[ B] = [0, 0, 0] and (α1, α2, α3) = (1/3, 1/3, 1/3). We observe that the Mixed Regression via AMP 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation Figure 9: MLR with three signals: signals with same mean and same proportions. 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation Figure 10: MLR with three signals: signals with same mean and different proportions. performance does not improve much with increasing δ as the algorithm finds it challenging to distinguish the signals when they all have the same prior and correspond to the same proportion of observations. Signals with same mean and different proportions. Figure 10 shows the performance with E[ B] = [0, 0, 0] and (α1, α2, α3) = (0.5, 0.3, 0.2). The performance here is significantly better than the previous case where signals have the same mean and proportions. As expected, the correlation for β1 is significantly better than that for β2 and β3 since β1 has the highest proportion of observations. Signals with different means and same proportions. Figure 11 shows the performance with E[ B] = [0, 0.5, 1] and (α1, α2, α3) = (1/3, 1/3, 1/3). This is the case with the best estimation performance. This is because the distinct means help distinguish the signals from one another and the equal proportions ensure that all three have sufficient number of observations for large enough δ. Finally, Figure 12 compares the performances of AMP with other widely studied estimators for MLR, for the Gaussian signal prior in (40) with E[ B] = [0, 0.5, 1] and (α1, α2, α3) = (1/3, 1/3, 1/3) (this is the case where signals have different prior distributions but appear in the same proportion of observations). The other estimators are: the spectral estimator proposed in (Yi et al., 2014, Algorithm 2); alternating minimization (AM) (Yi et al., 2014, Algorithm 1); and expectation maximization (EM) (Faria and Soromenho, Tan and Venkataramanan 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation Figure 11: MLR with three signals: signals with different means and same proportions. 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation Spectral EM AM AMP 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 δ Correlation Figure 12: MLR with three signals: Comparison with different estimators. The signals have different means and occur in the same proportions. 2010, Section 2.1). We modified the grid search1 step of the spectral estimator in (Yi et al., 2014, Algorithm 2) to sample evenly across a sphere instead of a circle (to account for the fact that we now have three signals instead of two). Since this step cannot be done exactly like in the 2D case, we used the Fibonacci sphere algorithm ( Alvaro Gonz alez, 2010) to achieve this approximately and efficiently in our 3D case. As in the case of two-signal MLR, AMP significantly outperforms the other estimators as it is tailored to take advantage of the signal prior via the choice of the denoising function fk. 4.3 Max-Affine Regression We consider on the MAR model (4) with two signals, which is given by: Yi = max n Xi, β(1) + b1, Xi, β(2) + b2 o + ϵi, i [n], (41) where ϵi i.i.d. N(0, σ2), Xi i.i.d. N(0, Ip/n) for i [n]. Recall from (6) that MAR can be written as an instance of the matrix GLM using the augmented features X(ma) i = 1. In the two signal case, grid search was used to iterate over all possible combinations of the top two eigenvectors of 1 n Pn i=1 Yi Xi X i to get the best combination for each signal. Mixed Regression via AMP Algorithm 1 Expectation-maximization approximate message passing (EM-AMP) 1: Initialize the intercepts b0 := (b0 1, b0 2), and b Bkmax,0. 2: for iteration m := 1, . . . , mmax do 3: Compute E Z| Y ; bm , and run AMP with intercept estimates bm as part of the model for kmax iterations to produce bΘm = X b Bkmax,m. Let bΘ(1), bΘ(2) be the two columns of bΘm. 4: Compute bm+1 1 := 1 |{i:bΘ(1) i +bm 1 >bΘ(2) i +bm 2 }| P i:bΘ(1) i +bm 1 >bΘ(2) i +bm 2 Yi E Z1| Y ; bm . 5: Compute bm+1 2 := 1 |{i:bΘ(1) i +bm 1 bΘ(2) i +bm 2 }| P i:bΘ(1) i +bm 1 bΘ(2) i +bm 2 Yi E Z2| Y ; bm . 6: Output bmmax and b Bkmax,mmax. Rp+1, i [n]. Since the augmented features are not i.i.d. Gaussian (due to the last component being 1), we use the original formulation of MAR in (4) and consider the intercepts b1 and b2 to be unknown parameters of the output function q( , ). Our solution is to estimate the unknown intercepts using an expectation-maximization (EM) algorithm. The EM algorithm iteratively produces intercept estimates, denoted by bm (bm 1 , bm 2 ), for m 1. However, the expectation step of the EM algorithm requires an estimate of Θ = XB, which we approximate via AMP. This leads to a combined expectationmaximization approximate message passing (EM-AMP) algorithm, which is described in Algorithm 1. The idea of combining AMP with the EM algorithm was introduced by Vila and Schniter (2013), for sparse linear regression with unknown parameters in the signal prior. The AMP stage in step 3 of Algorithm 1 is implemented with gk = g k and fk = f k (i.e., the optimal choices), computed using the current intercept estimates. The details of computing the g k and the conditional expectation E Z| Y ; bm in this step are given in Appendix B. The derivation of the EM updates in steps 4 and 5 of Algorithm 1 is given in Section 5.3. We set the signal dimension p = 500 and vary the value of n in our experiments. We consider different choices for the intercepts b := (b1, b2) and use a Gaussian prior for B = β(1), β(2) , where we generate β(1) j , β(2) j i.i.d. N E[ B], I2 , j [p]. (42) The initializer b B0 Rp 2 is chosen according to the same distribution, independently of the signal. The EM initialization b0 is taken to be (0, 0). Figures 13-15 show the performance of EM-AMP for max-affine regression with different choices of prior, intercepts, and noise level. The performance in all the plots is measured via the normalized squared correlation between the full signals β(1) ma = ((β(1)) , b1) and β(2) ma = ((β(2)) , b2) and their respective estimates ˆβ(1) ma = ((ˆβ(1)) ,ˆb1) and ˆβ(2) ma = ((ˆβ(2)) ,ˆb2) , i.e., β(l) ma, ˆβ(l) ma 2 ˆβ(l) ma 2 2 β(l) ma 2 2 , where l {1, 2}. (43) Tan and Venkataramanan Each point on the plots is obtained from 5 independent runs, where in each run, we execute EM-AMP with mmax = 5 and kmax = 5. We report the average and error bars at 1 standard deviation of the final iteration. EM-AMP is compared with: (i) the alternating minimization algorithm (Ghosh et al., 2022) (the only known algorithm for MAR with theoretical guarantees), and (ii) the Oracle AMP (OR-AMP) where we assume that the true intercepts b are known and are part of the matrix GLM model. Though the intercepts are not known in practice, OR-AMP provides an upper bound on the best correlation achievable by AMP since it uses the optimal denoising functions and the correct intercepts. Hence, it is reasonable to expect that OR-AMP would provide the best performance. We study the performance of our algorithms for the following three scenarios: Same intercept, signals with different means. Figure 13 shows the results for the setting b = (1, 1), E[ B] = [0, 1] , σ = 0.1. When the intercepts are the same, the proportion of observations from each signal is roughly the same for large enough δ. This is because Yi = maxl {1,2} Xi, β(l) + bl , where Xi, β(l) is a zero-mean Gaussian regardless of the mean of β(l). (The variance of Xi, β(l) depends on the mean of β(l).) In this setting, AM performs poorly for smaller δ values, but matches or slightly exceeds the performance of EM-AMP for large δ. Different intercepts, signals with different means. Figure 14 shows the results for the setting b = (1, 0), E[ B] = [0, 1] , σ = 0.1. As mentioned above, the signal mean does not affect the proportion of observations from each sample. Hence, the signal with a larger intercept will have more observations. EM-AMP outperforms AM for the signal with fewer observations for all values δ, while for the other signal, AM is slightly better for larger values of δ. Same intercept, signals with different means, higher noise level. Figure 15 shows the results for b = (1, 1), E[ B] = [0, 1] , σ = 0.4. The plots show that EM-AMP significantly outperforms AM for all values of δ. AM is quite sensitive to the presence of noise unlike EM-AMP, which is more robust and nearly matches the performance OR-AMP. Hard case. When entries of both β(1) and β(2) are generated from the same distribution, and the intercepts b1 and b2 are the same, the estimation problem becomes very challenging. In this case, AM, EM-AMP, and AMP all struggle to give an accurate estimate. 4.4 Mixture-of-Experts We consider the MOE model (9) with two regressors, two gating parameters, and the identity activation function q(x) = x. The model is given by Yi = Xi, β(1) 1 ψi exp( Xi, w(1) ) exp( Xi, w(1) ) + exp( Xi, w(2) ) + Xi, β(2) 1 ψi > exp( Xi, w(1) ) exp( Xi, w(1) ) + exp( Xi, w(2) ) Mixed Regression via AMP 1 2 3 4 5 δ Correlation AM OR-AMP EM-AMP 1 2 3 4 5 δ Correlation AM OR-AMP EM-AMP Figure 13: MAR: Same intercepts, signals with different means, noise level σ = 0.1. 1 2 3 4 5 δ Correlation AM OR-AMP EM-AMP 1 2 3 4 5 δ Correlation AM OR-AMP EM-AMP Figure 14: MAR: Different intercepts, signals with different means, noise level σ = 0.1. where ψi i.i.d. Uniform[0, 1], ϵi i.i.d. N(0, σ2), Xi i.i.d. N(0, Ip/n) for i [n]. The signal dimension is set to p = 500 and the value of n is varied in our experiments. We use a Gaussian prior for B = (β(1), β(2), w(1), w(2)), where we generate (β(1) j , β(2) j , w(1) j , w(2) j ) i.i.d. N([1, 2, 3, 4] , I4), j [p]. (45) We run the AMP algorithm in (11) with gk = g k and fk = f k (i.e., the optimal choices). The initializer B0 Rp 4 is chosen according to the same distribution, independently of the signal. The details of the implementation are given in Appendix C. Figure 16 shows the performance of AMP for MOE. The performance in the plots is measured via the normalized squared correlation given in (25). Each point on the plots is obtained from 5 independent runs, where in each run, we execute AMP with k = 5. We report the average and error bars at 1 standard deviation of the final iteration. Figure 16 indicates a good match between the empirical performance of AMP and the theoretical state evolution predictions. The Tan and Venkataramanan 1 2 3 4 5 δ Correlation AM OR-AMP EM-AMP 1 2 3 4 5 δ Correlation AM OR-AMP EM-AMP Figure 15: MAR: Same intercept, signals with different means, noise level σ = 0.4. improvement across increasing δ s is more significant for the regressors than the gating parameters since the latter are easier to estimate because of their larger means. 5. Proofs and Derivations In this section, we provide detailed proofs and derivations of our results. 5.1 Proof of Theorem 1 To prove the theorem, we use a change of variables to rewrite (11) as a new matrixvalued AMP iteration. The new iteration is a special case of an abstract AMP iteration for which a state evolution result has been established by Javanmard and Montanari (2013). This state evolution result is then translated to obtain the results in (23)-(24). Given the iteration (11), for k 0 define ˇBk+1 := Bk+1 B(Mk+1 B ) , ˇΘk := (Θ, Θk), (46) where we recall that Θ = XB. For k 0, we also define the function ˇfk : R2L R2L: ˇfk( ˇBk, B) = (B, fk( ˇBk + B(Mk B) )). (47) Then, we claim that the original AMP iteration (11) is equivalent to the following one: ˇΘk = X ˇfk( ˇBk, B) hk 1(ˇΘk 1, Ψ)( ˇF k) ˇBk+1 = X hk(ˇΘk, Ψ) ˇfk( ˇBk, B)( ˇCk) , (48) where hk is defined in (13), and the matrices ˇCk RL 2L, ˇF k+1 R2L L are defined as: ˇCk = h E[ Zhk(Z, Zk, Ψ)], 1 n Pn i=1 Θk i hk(Θi, Θk i , Ψi) i ˇF k+1 = 0L L 1 n Pp j=1 f k+1( ˇBk j + Bj(Mk B) ) Mixed Regression via AMP 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 δ Correlation 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 δ Correlation 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 δ Correlation 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 δ Correlation Figure 16: Mixture-of-experts: noise level σ = 0.1. The iteration (48) is initialized with ˇΘ0 = (Θ, X b B0), where b B0 is the initializer of the original AMP. The equivalence between the iteration in (48) and the original AMP in (11) can be seen by substituting the definitions (46) and (47) into (48), and recalling from (14) that Mk+1 B = E[ Zhk(Z, Zk, Ψ)]. A key feature of the new iteration in (48) is that, in addition to the previous iterate, the inputs to the functions ˇfk and hk are auxiliary variables (B, Ψ, respectively) that are independent of X. This is in contrast to the AMP in (11) where the input Y to the function gk is not independent of X. The recursion in (48) is a special case of an abstract AMP recursion with matrix-valued iterates for which a state evolution result has been established by Javanmard and Montanari (2013). We will use a version of the result described in (Feng et al., 2022, Sec. 6.7)); the state evolution for the abstract AMP can also be obtained using the general AMP framework in Gerbelot and Berthier (2021). The standard form of the abstract AMP recursion uses empirical estimates (instead of expected values) for the first L columns of ˇCk in (49). However, the state evolution result still remains valid for the recursion (48) (see Remark 4.3 of Feng et al. (2022)). This result, stated in Proposition 3 below, guarantees that the empirical distributions of the rows of ˇΘk and ˇBk+1 converge to the Gaussian distributions N(0, ˇΣk) and N(0, ˇTk+1), respectively, where the deterministic Tan and Venkataramanan covariance matrices ˇΣk R2L 2L, ˇTk+1 RL L are defined by the following state evolution recursion. Let ˇΣ0 = Σ0 (defined in Assumption (A1)), and for k 0: ˇTk+1 = E hk(Gk σ, Ψ)hk(Gk σ, Ψ) (50) ˇΣk+1 = δ 1E ˇfk+1(Gk+1 τ , B) ˇfk+1(Gk+1 τ , B) = " δ 1E[ B B ] ˇΣk+1 (12) ˇΣk+1 (12) ˇΣk+1 (22) ˇΣk+1 (12) = ˇΣk+1 (21) = δ 1E Bfk+1(Gk+1 τ + MB k+1 B) (52) ˇΣk+1 (22) = δ 1E fk+1(Gk+1 τ + MB k+1 B) fk+1(Gk+1 τ + MB k+1 B) . (53) Here we take Gk σ N(0, ˇΣk) independent of Ψ P Ψ, and Gk+1 τ N(0, ˇTk+1) independent of B P B. Comparing the recursive definitions of (Tk+1 B , Σk+1) in (15)-(17) and of (ˇTk+1, ˇΣk+1) in (50)-(51), and noting that they are both initialized with Σ0, we have that ˇTk+1 = Tk+1 B and ˇΣk+1 = Σk+1 for k 0. The following proposition follows from the state evolution result (Feng et al., 2022, Sec. 6.7) for an abstract AMP recursion with matrix-valued iterates. Proposition 3 Assume the setting of Theorem 1. For the abstract AMP in (48), for k 0 we have: sup η PL2L(r,1) j=1 η( ˇBk+1 j , Bj) E[η(Gk+1 τ , B)] c 0, (54) sup η PL2L+LΨ(r,1) i=1 η(ˇΘk i , Ψi) E[η(Gk σ, Ψ)] c 0, (55) as n, p with n/p δ. To obtain the result (23), we recall the definition of ˇBk+1 from (46), and in (54) we take η( ˇBk+1, B) = ck,rφ( ˇBk+1 + B(Mk+1 B ) , B) for a suitably small constant ck,r > 0, and recall that Gk+1 τ N(0, Tk+1 B ). To obtain (24), we recall the definition of ˇΘk from (46), and in (55) take η(ˇΘk, Ψ) = φ(Θk, Θ, Ψ). Since ˇΣk = Σk, we have: (Gk σ, Ψ) d= (Z, Zk, Ψ) d= (Z, Mk ΘZ + Gk Θ, Ψ), (56) where the last equality follows from (18). This completes the proof of the theorem. 5.2 Proof of Proposition 2 The proof relies on the following generalized Cauchy-Schwarz inequality for covariance matrices. Lemma 4 (Lavergne, 2008, Lemma 1) Let U, V RL random vectors such that E[ U 2 2] < , E[ V 2 2] < , and E[V V ] is invertible. Then E[UU ] E[UV ] E[V V ] 1E[V U ] 0. (57) Mixed Regression via AMP Proof of part 1. Using the law of total expectation, Σk (12) in (17) can be written as: δΣk (12) = E[ Bfk(Mk B B + Gk B) ] = E E[ Bfk(Mk B B + Gk B) | Mk B B + Gk B] = E[f kf k ], (58) where we use the shorthand fk fk(Mk B B+Gk B) and f k E[ B | Mk B B+Gk B]. Using Lemma 4 we have that E[f k(f k) ] E[f kf k ]E[fkf k ] 1E[fk(f k) ] 0 = δ 1E[f k(f k) ] Σk (12)(Σk (22)) 1Σk (21) 0, (59) where we have used (58) and (17) for the second line. Adding and subtracting Tk Θ in (59) we obtain Tk Θ Tk Θ δ 1E[f k(f k) ] + Σk (12)(Σk (22)) 1Σk (21) | {z } :=Γk Θ Multiplying the matrix (Tk Θ Γk Θ) in (60) by Mk Θ 1 on the left and Mk Θ 1 on the right maintains positive definiteness. This yields Nk Θ Mk Θ 1 Γk Θ Mk Θ 1 0, (61) where we have used the formula for Nk Θ from (28). Eq. (61) implies Tr(Nk Θ) Tr Mk Θ 1 Γk Θ Mk Θ 1 . (62) Now, using the formula for Tk Θ in (20) it can be verified that when fk = f k, we have Tk Θ = Γk Θ = 1 E f k(f k) E f k(f k) E B B 1E f k(f k) . (63) Therefore (60)-(62) are satisfied with equality when fk = f k, which proves the first part of the proposition. Proof of part 2. We begin by introducing the multivariate Stein s lemma: Lemma 5 Let x = (x1, . . . , x L) and g : RL RL be such that for j = 1, . . . , L, the function xj gl(x1, . . . , x L) (where gl(x1, . . . , x L) is the lth entry of g(x1, . . . , x L)) is absolutely continuous for Lebesgue almost every (xi : i = j) RL 1, with weak derivative xjgl : RL R satisfying E[| xjgl(x)|] < . Let g(x) = ( g1(x), . . . , g L(x)) RL L where gl(x) = x1gl(x), . . . , x Lgl(x) for x RL. If X N(µ, Σ) with Σ positive definite, then E[ g(X)] = Σ 1E (X µ)g(X) . (64) Tan and Venkataramanan Proof We have E[(X µ)g(X) ] = E[(X µ)g1(X)], . . . , E[(X µ)g L(X)] (a) = (ΣE[ g1(X)], . . . , ΣE[ g L(X)]) = ΣE[( g1(X), . . . , g L(X))] (b) = ΣE[ g(X)] , where (a) uses the multivariate Stein s Lemma from (Feng et al., 2022, Lemma 6.20) which states that under our conditions we have E[Xgl(X)] = ΣE[ gl(X)] for l = 1, . . . , L, and (b) uses the definition of g(x). Finally, rearranging the above equation and taking the transpose gives the result. Next, we use Lemma 5 to show that Mk+1 B = E gk(Zk, Y )g k(Zk, Y ) , (66) where g k is defined in (30). Indeed, using the law of total expectation we have Mk+1 B = E h E[ Zhk(Z, Zk, Ψ)|Zk] i (a) = E h Cov[Z|Zk] 1E (Z E[Z|Zk])hk(Z, Zk, Ψ) Zk i = E[Cov[Z|Zk] 1(Z E[Z|Zk])hk(Z, Zk, Ψ) ] = E h E Cov[Z|Zk] 1(Z E[Z|Zk])hk(Z, Zk, Ψ) Zk, Y i (b) = E h g k(Zk, Y )hk(Z, Zk, Ψ) i (c) = E gk(Zk, Y )g k(Zk, Y ) . Here (a) applies Lemma 5, (b) follows from the definition of g k in (30), and (c) from (13). Using the shorthand gk gk(Zk, Y ) and g k g k(Zk, Y ), from Lemma 4 we have: E[g k(g k) ] E[g kg k ] E[gkg k ] 1 E[gk(g k) ] 0 E[g k(g k) ] Nk+1 B 1 0 (68) E[g k(g k) ] 1 Nk+1 B 0, (69) where (68) is obtained by recalling from (28) that (Nk+1 B ) 1 = Mk+1 B Tk+1 B 1 Mk+1 B , and using the expressions for Mk+1 B and Tk+1 B in (66) and (15). Eq. (69) follows from the fact that if P and Q are positive definite matrices such that P Q 0, then P 1 Q 1 0. From (69), we have that Tr(Nk+1 B ) Tr E[g k(g k) ] 1 , (70) with equality if gk = g k. This completes the proof of the second part of the proposition. Mixed Regression via AMP 5.3 Derivation of the EM Step for Max-Affine Regression In this section, we derive steps 4 and 5 of EM-AMP for max-affine regression (Algorithm 1). We follow an approach similar to the one in Vila and Schniter (2013), where EM was combined with AMP for compressed sensing with unknown parameters in the signal prior. We adapt their derivation to the MAR model. Recall that Yi = max Xi, β(1) + b1, Xi, β(2) + b2 + ϵi = max Θ(1) i + b1, Θ(2) i + b2 + ϵi, i [n]. (71) Here Θ(1) and Θ(2) are the first and second columns of Θ Rn 2 respectively, and Θi := Θ(1) i , Θ(2) i R2. The parameter that we would like to estimate using EM is b = (b1, b2). Before providing the derivation, we briefly review the main idea behind the EM algorithm. The EM algorithm iteratively produces estimates bm (bm 1 , bm 2 ) for m 1, with the goal of increasing the likelihood p(Y ; b) at each iteration, where Y = [Y1, . . . , Yn] . It achieves this by iteratively increasing a lower bound on p(Y ; b), thus guaranteeing that the likelihood converges to a local maximum or at least a saddle point (Wu, 1983). In our case, for an arbitrary distribution ˆp on (Θ(1), Θ(2)) we have log p(Y ; b) = Z Θ(2) ˆp(Θ(1), Θ(2))dΘ(1)dΘ(2) log p(Y ; b) Θ(2) ˆp(Θ(1), Θ(2))dΘ(1)dΘ(2) log p(Θ(1), Θ(2), Y ; b) ˆp(Θ(1), Θ(2)) ˆp(Θ(1), Θ(2)) p(Θ(1), Θ(2)|Y ; b) (a) = EΘ(1),Θ(2) ˆp log p(Θ(1), Θ(2), Y ; b) + H(ˆp) + D(ˆp p( |Y ; b)) (72) (b) EΘ(1),Θ(2) ˆp log p(Θ(1), Θ(2), Y ; b) + H(ˆp) := L(Y ; b), where in (a), H( ) is the Shannon entropy and D( ) the Kullback-Leibler (KL) divergence, and the inequality (b) follows from the non-negativity of the KL divergence. The EMalgorithm at step m + 1 iterates over two steps: In the E-step, we choose ˆp to maximize the lower bound L(Y ; b) for fixed b = bm, In the M-step, we choose b to maximize the lower bound L(Y ; b) for fixed ˆp = ˆpm. For the E-step, since L(Y ; bm) = log p(Y ; bm) D(ˆp p( |Y ; bm)) (via rearranging (72)), the maximizing probability density function (pdf) would be ˆpm = p(Θ(1), Θ(2)|Y ; bm). Then, for the M-step, from the definition of L(Y ; b), the maximizing b is: bm+1 = argmax b R2 EΘ(1),Θ(2) p(Θ(1),Θ(2)|Y ; bm) log p Θ(1), Θ(2), Y ; b . (73) We can further expand the p( ) above as p Θ(1), Θ(2), Y ; b = p Θ(1), Θ(2) p Y |Θ(1), Θ(2); b = p Θ(1), Θ(2) n Y i=1 p(Yi|Θi; b), (74) Tan and Venkataramanan where p Θ(1), Θ(2) = Qn i=1 p Θ(1) i , Θ(2) i does not depend on b (recall that Θ(1), Θ(2) = XB). It is challenging to jointly optimize over b = (b1, b2) in (73), so we update one component of b at a time while holding the other fixed. This is a well known incremental variant of EM (Neal and Hinton, 1998). Using (74) and (73), the updated b1 estimate is bm+1 1 = argmax b1 R E h log i=1 p(Yi|Θi; b1, bm 2 ) | Y ; bmi = argmax b1 R i=1 E h log p(Yi|Θi; b1, bm 2 ) | Y ; bmi = argmax b1 R Θi p(Θi|Y ; bm) log p(Yi|Θi; b1, bm 2 )dΘi , (75) where we recall that bm = (bm 1 , bm 2 ). At this point, note that p(Yi|Θi; b1, bm 2 ) N max Θ(1) i + b1, Θ(2) i + bm 2 , σ2 . (76) To get bm+1 1 , we need to solve Θi p(Θi|Y ; bm) log p(Yi|Θi; b1, bm 2 )dΘi = 0 Θi p(Θi|Y ; bm) b1 log p(Yi|Θi; b1, bm 2 )dΘi = 0, (77) where we used Leibniz s integral rule to exchange differentiation and integration. Using (76), the derivative in (77) is b1 log p(Yi|Θi; b1, bm 2 ) = b1 log 1 σ Yi max{Θ(1) i + b1, Θ(2) i + bm 2 } σ ( 1 σ2 (Yi Θ(1) i b1) if Θ(1) i + b1 > Θ(2) i + bm 2 0 if Θ(1) i + b1 Θ(2) i + bm 2 . (78) Substituting the above back into (77), we get i:Θ(1) i +b1>Θ(2) i +bm 2 Θi p Z|Y (Θi|Y ; bm) 1 σ2 (Yi Θ(1) i b1)dΘi = 0 i:Θ(1) i +b1>Θ(2) i +bm 2 Θi p Z|Y (Θi|Y ; bm)dΘi Z Θi p Z|Y (Θi|Y ; bm)Θ(1) i dΘi Θi p Z|Y (Θi|Y ; bm)dΘi i:Θ(1) i +b1>Θ(2) i +bm 2 Yi E[Θ(1) i |Y ; bm] b1 = 0. (79) Mixed Regression via AMP Rearranging gives |{i : Θ(1) i + b1 > Θ(2) i + bm 2 }| i:Θ(1) i +b1>Θ(2) i +bm 2 Yi E Θ(1) i |Y ; bm |{i : Θ(1) i + b1 > Θ(2) i + bm 2 }| i:Θ(1) i +b1>Θ(2) i +bm 2 Yi E Z1| Y ; bm , (80) where we approximate E Θ(1) i |Y ; bm by E Z1| Y ; bm because computing E Θ(1) i |Y ; bm is intractable. The computation of E Z1| Y ; bm is detailed in Appendix B. Note that in (80) it is not possible to compute b1 on the LHS while using it in the RHS. An easy (but admittedly non-principled) fix is to just use bm 1 on the RHS. This gives the update: bm+1 1 := 1 |{i : Θ(1) i + bm 1 > Θ(2) i + bm 2 }| i:Θ(1) i +bm 1 >Θ(2) i +bm 2 Yi E Z1| Y ; bm , (81) where Θi and E Z1|Y ; bm can be obtained from AMP in the previous iteration. Similarly, the update for the other intercept is: bm+1 2 := 1 |{i : Θ(1) i + bm 1 Θ(2) i + bm 2 }| i:Θ(1) i +bm 1 Θ(2) i +bm 2 Yi E Z2| Y ; bm . (82) Since Θ(1), Θ(2) are unknown, we approximate them using AMP iterates. Specifically, the two columns of bΘm = X b Bkmax,m provide estimates of Θ(1), Θ(2), respectively. Using these in (81) and (82) yields Steps 4 and 5 of the EM-AMP in Algorithm 1. Acknowledgments N. Tan was supported by the Cambridge Trust and the Harding Distinguished Postgraduate Scholars Programme Leverage Scheme. Tan and Venkataramanan Appendix A. Implementation Details for MLR In this appendix, we consider MLR with two signals and provide the implementation details of matrix-AMP with Bayes-optimal functions (see Proposition 2), for the Gaussian prior and the sparse discrete prior. While the implementation details stated here are for MLR with two signals, it is straightforward to generalize them to the case of three signals, which we have omitted. A.1 Gaussian Prior We start by writing the matrix-AMP algorithm in (11)-(12) with more details: Initialize b R 1 = 0 Rn 2, F0 = I2. Next, we initialize the rows of B0 and b B0 independently using the jointly Gaussian prior. Letting B = ( β(1), β(2)) R2 be a random variable distributed according to the jointly Gaussian prior, we initialize: B0 j , b B0 j i.i.d. N E[ β(1)] E[ β(2)] , Var[ β(1)] Cov[ β(1), β(2)] Cov[ β(2), β(1)] Var[ β(2)] , j [p], (83) E[( β(1))2] E[ β(1) β(2)] (E[ β(1)])2 E[ β(1)]E[ β(2)] E[ β(1) β(2)] E[( β(2))2] E[ β(1)]E[ β(2)] (E[ β(2)])2 (E[ β(1)])2 E[ β(1)]E[ β(2)] E[( β(1))2] E[ β(1) β(2)] E[ β(1)]E[ β(2)] (E[ β(2)])2 E[ β(1) β(2)] E[( β(2))2] For each iteration of matrix-AMP k N0, we have the following steps: 1. Compute Θk := X b Bk b Rk 1F k 2. Compute b Rk := gk(Θk, Y ) 3. Approximate Ck := 1 n Pn i=1 g k(Θk i , Yi) 4. Compute Bk+1 := X b Rk b Bk C k 5. Approximate b Bk+1 := fk+1(Bk+1) 6. Approximate F k+1 := 1 n Pp j=1 f k+1(Bk+1 j ) 7. Approximate Σk+1 The quantities in steps 1, 2, and 4 can be directly computed. The other steps require some form of numerical approximation (based on limiting properties of the iterates) to make the computation tractable. We now explain how the quantities in steps 2, 3, 5, 6, and 7 can be computed or approximated. Step 2: We assume that Σk has been approximated in the previous iteration. From Proposition 2, for MLR the function gk : R2 R R is given by gk(Zk, Y ) = Cov[Z|Zk] 1(E[Z|Zk, Y ] E[Z|Zk]). (85) Mixed Regression via AMP N(0, Σk), using standard properties of Gaussian random vectors we have: Cov[Z|Zk] = Σk (11) Σk (12)(Σk (22)) 1Σk (21), E[Z|Zk] = Σk (12)(Σk (22)) 1Zk. (86) To compute E[Z|Zk, Y ], we recall that Z = (Z1, Z2) and let Ψ = ( c, ϵ), with c Bernoulli(α) and ϵ N(0, σ2) independent. Then, Y = q(Z, Ψ) = Z1 c + Z2(1 c) + ϵ, (87) using which we have that E[Z|Zk, Y ] = E[Z|Zk, Y , c = 1]P[ c = 1|Zk, Y ] + E[Z|Zk, Y , c = 0]P[ c = 0|Zk, Y ]. (88) We now show how each of the four terms in (88) can be computed. We first find the joint distribution of (Z, Zk, Y | c = 1), which from (87) is jointly Gaussian. We denote this distribution by N(0, Σk,1 Y ), and proceed to derive Σk,1 Y R5 5. Given a matrix M Rn1 n2, will use the notation M[a],[b] to denote the submatrix consisting of the first a rows and first b columns of M, and M[a+],[b+] to denote the submatrix with rows {a, . . . , n1} and columns {b, . . . , n2} . We know from the joint distribution of (Z, Zk) that (Σk,1 Y )[4],[4] = Σk. Hence, we only need to determine the remaining entries: (Σk,1 Y )5,5 = Var[ Y | c = 1] = Var[Z1 + ϵ] = Σk 11 + σ2, (Σk,1 Y )1,5 = (Σk,1 Y )5,1 = Cov[ Y , Z1| c = 1] = Cov[Z1 + ϵ, Z1] = Σk 11, (Σk,1 Y )2,5 = (Σk,1 Y )5,2 = Cov[ Y , Z2| c = 1] = Cov[Z1 + ϵ, Z2] = Σk 12, (Σk,1 Y )1,3 = (Σk,1 Y )3,1 = Cov[ Y , Zk 1 | c = 1] = Cov[Z1 + ϵ, Zk 1 ] = Σk 13, (Σk,1 Y )1,4 = (Σk,1 Y )4,1 = Cov[ Y , Zk 2 | c = 1] = Cov[Z1 + ϵ, Zk 2 ] = Σk 14, where we have used the fact that (Z, Zk) and ϵ are independent, and the notation Σk ij refers to the (i, j)-th entry of the matrix Σk. This gives Σk 11 Σk 12 Σk 13 Σk 14 Σk 11 Σk 21 Σk 22 Σk 23 Σk 24 Σk 21 Σk 31 Σk 32 Σk 33 Σk 34 Σk 31 Σk 41 Σk 42 Σk 43 Σk 44 Σk 41 Σk 11 Σk 12 Σk 13 Σk 14 Σk 11 + σ2 From the joint distribution, we can compute E[Z|Zk, Y , c = 1] = (Σk,1 Y )[2],[3+](Σk,1 Y ) 1 [3+],[3+] where [3+] := {3, 4, 5}. Using the same approach, we can determine the joint distribution of (Z, Zk, Y | c = 0) as N(0, Σk,0 Y ), where Σk 11 Σk 12 Σk 13 Σk 14 Σk 12 Σk 21 Σk 22 Σk 23 Σk 24 Σk 22 Σk 31 Σk 32 Σk 33 Σk 34 Σk 32 Σk 41 Σk 42 Σk 43 Σk 44 Σk 42 Σk 21 Σk 22 Σk 23 Σk 24 Σk 22 + σ2 Tan and Venkataramanan From this joint distribution, we can compute E[Z|Zk, Y , c = 0] = (Σk,0 Y )[2],[3+](Σk,0 Y ) 1 [3+],[3+] The first conditional probability term in (88) can be computed as: P[ c = 1|Zk, Y ] = P[ c = 1]P[Zk, Y | c = 1] P[ c = 1]P[Zk, Y | c = 1] + P[ c = 0]P[Zk, Y | c = 0] = αP[Zk, Y | c = 1] αP[Zk, Y | c = 1] + (1 α)P[Zk, Y | c = 0], (94) where given c = 1, we have (Zk, Y ) N 0, (Σk,1 Y )[3+],[3+] , and given c = 0, we have (Zk, Y ) N 0, (Σk,0 Y )[3+],[3+] . Similarly, we have: P[ c = 0|Zk, Y ] = (1 α)P[Zk, Y | c = 0] αP[Zk, Y | c = 1] + (1 α)P[Zk, Y | c = 0], (95) where given c = 0, we have (Zk, Y ) N 0, (Σk,0 Y )[3+],[3+] . Using (91)-(95), we can compute (88), which together with the quantities in (86) allows us to compute gk(Zk, Y ) in (85). Finally, compute b Rk by applying gk row wise to Θk and Y (i.e., compute gk(Θk i , Yi)). Step 3: Recalling that gk(Zk, Y ) = hk(Z, Zk, Ψ), we approximate Ck = 1 n Pb i=1 g k(Θk i , Yi) by calculating E[g k(Zk, Y )] = E[ Zkhk(Z, Zk, Ψ)]. Here Zkhk denotes the Jacobian with respect to Zk. We compute the latter expectation by applying the generalized Stein s lemma (see Lemma 5) to (Z, Zk) N(0, Σk) and hk(Z, Zk, Ψ). This gives: h(Z, Zk, Ψ) = Σk E h (Z,Zk)hk(Z, Zk, Ψ) i R4 2. (96) Writing the above explicitly, we have E[Zhk(Z, Zk, Ψ) ] E[Zkhk(Z, Zk, Ψ) ] " Σk (11) Σk (12) Σk (21) Σk (22) # E[ Zhk(Z, Zk, Ψ)] E[ Zkhk(Z, Zk, Ψ)] " Σk (11)E[ Zhk(Z, Zk, Ψ)] + Σk (12)E[ Zkhk(Z, Zk, Ψ)] Σk (21)E[ Zhk(Z, Zk, Ψ)] + Σk (22)E[ Zkhk(Z, Zk, Ψ)] where the matrices Σk (11), Σk (12, Σk (21), Σk (22) R2 2 are as defined in (17). Looking at just the second row above and rearranging, we obtain: E[ Zkhk(Z, Zk, Ψ)] = n (Σk (22)) 1 E[Zkhk(Z, Zk, Ψ) ] Σk (21)E[ Zhk(Z, Zk, Ψ)] o . Here E[Zkhk(Z, Zk, Ψ) ] = E[Zkgk(Zk, Y )] can be approximated by 1 n Θk, gk(Θk, Y ) , and E[ Zhk(Z, Zk, Ψ)] can be approximated by 1 ngk(Θk, Y ) gk(Θk, Y ) (this follows from the equivalent expressions for Mk+1 B in (14) and (66), noting that we have used gk = g k). Mixed Regression via AMP Step 5: Since B is independent of Gk+1 B N(0, Tk+1 B ), we have that B Mk+1 B B + Gk+1 B N E[ B] Mk+1 B E[ B] , Cov[ B] Cov[ B](Mk+1 B ) Mk+1 B Cov[ B] Mk+1 B Cov[ B](Mk+1 B ) + Tk+1 B This implies that fk+1(Mk+1 B B + Gk+1 B =: s) = E[ B|s] = E[ B] + Cov[ B](Mk+1 B ) Mk+1 B Cov[ B](Mk+1 B ) + Tk+1 B 1 s Mk+1 B E[ B] . (100) We can use the above function to compute fk+1(Bk+1 j ) if we can approximate Mk+1 B and Tk+1 B (which is the same as Mk+1 B under the Bayes-optimal choices, by (15) and (66)). Using (66), this can be calculated using Tk+1 B = Mk+1 B 1 ngk(Θk, Y ) gk(Θk, Y ). (101) Step 6: The expression for this can be obtained by taking the derivative of (100) w.r.t. s, which gives f k+1(s) = Mk+1 B Cov[ B](Mk+1 B ) + Tk+1 B 1 Mk+1 B Cov[ B], (102) where Mk+1 B and Tk+1 B can be approximated using (101). Step 7: Using the formulas in (16)-(17) for Σk+1 and noting that fk+1 is a conditional expectation, the covariance Σk+1 can be approximated as " Σk (11) 1 pfk+1(Bk+1) fk+1(Bk+1) 1 pfk+1(Bk+1) fk+1(Bk+1) 1 pfk+1(Bk+1) fk+1(Bk+1) A.2 Sparse Discrete Prior As described in Appendix A.1, there are seven main steps in the AMP algorithm. A change in the prior requires us to make changes to our denoiser fk which affects steps 5, 6, and 7; the other steps remain unchanged. The changes are as follows, for the Bayes-optimal and soft-thresholding choices for the denoiser fk. Bayes-optimal fk: Step 5: We have fk+1(Mk+1 B B + Gk+1 B =: s) = E[ B|s] = P b b P[ B = b]P[s| B = b] P b P[ B = b]P[s| B = b] , (104) where (s| B = b) N(Mk+1 B b, Tk+1 B ), i.e., P[s| B = b] is the bivariate Gaussian pdf with mean vector Mk+1 B b and covariance matrix Tk+1 B . Tan and Venkataramanan Step 6: In the following, for brevity we write f fk+1, with f1, f2 denoting its two components. By the definition of a Jacobian, we have sf(Mk+1 B B + Gk+1 B = s) = s1 f1 s2 f2 s1 f2 s2 To compute sf1(s), letting b = [ b(1), b(2)] , we write f1(s) = P b b(1) P[ B = b]P[s| B = b] P b P[ B = b]P[s| B = b] =: num1 denom1 . (106) By the quotient rule for functions with a vector input and an output in R, we have sf1(s) = ( snum1)(denom1) (num1)( sdenom1) denom2 1 (107) Since P[s| B = b] is the bivariate Gaussian pdf with mean Mk+1 B b and covariance matrix Tk+1 B , we have that s P[s| B = b] = s 2(s Mk+1 B b) (Tk+1 B ) 1(s Mk+1 B b)} q det(2πTk+1 B ) = (Tk+1 B ) 1 Mk+1 B b s P[s| B = b] (108) Using the above equation, we get b(1)(Tk+1 B ) 1 Mk+1 B b s P[ B = b]P[s| B = b], sdenom1 = X b (Tk+1 B ) 1 Mk+1 B b s P[ B = b]P[s| B = b], (109) using which sf1(s) can be computed using (106). The Jacobian sf2(s) can be similarly computed. Soft-Thresholding fk: Step 5: We can directly compute the function in (36). Step 6: We can directly compute the Jacobian as shown in (37). Step 7: We observe that the unlike the Bayes-optimal case, we no longer have the equality E[ Bfk+1(Mk+1 B B + Gk+1 B ) ] = E[fk+1(Mk+1 B B + Gk+1 B )fk+1(Mk+1 B B + Gk+1 B ) ]. Hence, we need to compute E[ Bfk+1(Mk+1 B B + Gk+1 B ) ] separately. To do so, we evaluate each entry of E[ Bfk+1(Mk+1 B B + Gk+1 B ) ] separately. We start by substituting the definitions of B and fk+1, with B = ( β(1), β(2)). This gives: { Bf k+1}11 = β(1)ST { B + (Mk+1 B ) 1Gk+1 B }1; α q {Nk+1 B }11 = β(1)ST β(1) + {(Mk+1 B ) 1}11{Gk+1 B }1 + {(Mk+1 B ) 1}12{Gk+1 B }2; α q {Nk+1 B }11 . (110) Mixed Regression via AMP Expanding the function out and taking expectations over β(1) and Gk+1 B , we get E[{ Bf k+1}11] = ( ε if |{ B + (Mk+1 B ) 1Gk+1 B }1| > α q {(Mk+1 B ) 1Tk+1 B (Mk+1 B ) 1}11 0 otherwise . Following a similar set of steps for the other entries, we have E[{ Bf k+1}22] = ( ε if |{ B + (Mk+1 B ) 1Gk+1 B }2| > α q {(Mk+1 B ) 1Tk+1 B (Mk+1 B ) 1}22 0 otherwise , E[{ Bf k+1}12] = E[{ Bf k+1}21] = 0. (113) In our algorithm, we do not have access to B + (Mk+1 B ) 1Gk+1 B , but have Bk+1 whose empirical distribution (of rows) converges to B +(Mk+1 B ) 1Gk+1 B . We can therefore estimate the expectations in (111) and (112) by evaluating the right side for each row Bk+1 j and taking the average. For example, we compute E[{ Bf k+1}11] by evaluating (111) for each of the p rows of Bk+1 and taking the average. Appendix B. Implementation Details for MAR Changing the matrix GLM model q( , ) only affects the denoising function gk in AMP. Hence, in the seven-step implementation described in Appendix A.1, the only change is in the computation of gk in steps 2 and 3. (The Jacobian g k in step 3 is approximated using gk, as described on p. 34.) Recall that gk(Zk, Y ) = Cov[Z|Zk] 1 E[Z|Zk, Y ] E[Z|Zk] . (114) Note that Cov[Z|Zk] and E[Z|Zk] are the same as that for mixed linear regression (with the formulas given in (86)), so we only need to evaluate E[Z|Zk, Y ] R2. This will be approximated using a Monte Carlo approach which we now describe. We have E[Z|Zk = u, Y = y] = R z p Zk(u)p Z|Zk(z|u)p Y |Z,Zk(y|z, u)dz R p Zk(u)p Z|Zk(z|u)p Y |Z,Zk(y|z, u)dz = EZ|Zk=u[Z p Zk(u)p Y |Z,Zk(y|Z, u)] EZ|Zk=u[ p Zk(u)p Y |Z,Zk(y|Z, u)] , (115) where given Zk = u and Y = y, the probability density functions inside the expectations can be easily evaluated since Zk N(0, Σk (22)), and for σ > 0 and z = (z1, z2), p Y |Z,Zk(y|z, u) = p Y |Z(y|z) σ , if z1 + b1 > z2 + b2 σ , otherwise , (116) Tan and Venkataramanan where φN is the standard Gaussian density. With the above density functions, we can approximate the numerator and denominator of (115) by sampling z s from (Z|Zk = u) N Σk (12) Σk (22) 1u, Σk (11) Σk (12) Σk (22) 1Σk (21) , (117) and then taking the averages of the functions inside the expectations. Additionally, the EM part of the EM-AMP algorithm requires E[Z| Y ]. This is computed using Monte Carlo in a similar fashion to E[Z|Zk, Y ]: E[Z| Y = y] = R z p Z(z)p Y |Z(y|z)dz R p Z(z)p Y |Z(y|zdz = EZ[Z p Y |Z(y|Z)] EZ[p Y |Z(y|Z)] , (118) where p Y |Z(y|z) is given in (116). We can approximate the numerator and denominator of (118) by sampling z s from Z N(0, Σk (11)) and then taking the averages of the functions inside the expectations. In the mth iteration of EM-AMP, the expectations in (115) and (118) are computed by using the current intercept estimates (bm 1 , bm 2 ) in the formula for p Y |Z in (116). Appendix C. Implementation Details for MOE The changes required here are similar to those of MAR stated in Appendix B. In the seven-step implementation described in Appendix A.1, the only change is in the computation of gk in steps 2 and 3. Recall that gk(Zk, Y ) = Cov[Z|Zk] 1 E[Z|Zk, Y ] E[Z|Zk] . (119) Note that Cov[Z|Zk] and E[Z|Zk] are the same as that for mixed linear regression (with the formulas given in (86)), so we only need to evaluate E[Z|Zk, Y ] R4. This will be approximated using the same Monte Carlo approach as MAR, by writing E[Z|Zk = u, Y = y] = EZ|Zk=u[Z p Zk(u)p Y |Z,Zk(y|Z, u)] EZ|Zk=u[ p Zk(u)p Y |Z,Zk(y|Z, u)] , (120) where given Zk = u and Y = y, the probability density functions inside the expectations can be easily evaluated since Zk N(0, Σk (22)), and for σ > 0 and z = (z1, z2, z3, z4), p Y |Z,Zk(y|z, u) = p Y |Z(y|z) = Z 1 0 p Y , ψ|Z(y, v|z)dv (a) = Z 1 0 p ψ(v)p Y | ψ,Z(y|v, z)dv (b) = Z exp(z3) exp(z3)+exp(z4) 0 p ψ(v)φN y z1 exp(z3) exp(z3)+exp(z4) p ψ(v)φN y z2 (c) = exp(z3) exp(z3) + exp(z4)φN y z1 + 1 exp(z3) exp(z3) + exp(z4) where φN is the standard Gaussian density. Here (a) uses the independence between ψ and Z (see assumption in sentence below (13)), (b) uses the fact that Y = Z1 + ϵ when Mixed Regression via AMP ψ exp(Z3) exp(Z3)+exp(Z4) and Y = Z2 + ϵ when ψ > exp(Z3) exp(Z3)+exp(Z4), (c) uses p ψ(v) = 1 for all v [0, 1] since ψ Uniform[0, 1]. With the above density functions, we can approximate the numerator and denominator of (120) by sampling z s from (Z|Zk = u) N Σk (12) Σk (22) 1u, Σk (11) Σk (12) Σk (22) 1Σk (21) , (121) and then taking the averages of the functions inside the expectations. Gabriel Arpino and Ramji Venkataramanan. Statistical-computational tradeoffs in mixed sparse linear regression, 2023. https://arxiv.org/abs/2303.02118. Benjamin Aubin, Antoine Maillard, Florent Krzakala, Nicolas Macris, Lenka Zdeborov a, et al. The committee machine: Computational to statistical gaps in learning a two-layers neural network. In Advances in Neural Information Processing Systems, 2018. Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77 120, 2017. G abor Bal azs. Convex Regression: Theory, Practice, and Applications. Ph D thesis, University of Alberta, 2016. G abor Bal azs, Andr as Gy orgy, and Csaba Szepesvari. Near-optimal max-affine estimators for convex regression. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 56 64, 2015. Jean Barbier, Florent Krzakala, Nicolas Macris, L eo Miolane, and Lenka Zdeborov a. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451 5460, 2019. Adarsh Barik and Jean Honorio. Sparse mixed linear regression with guarantees: Taming an intractable problem with invex relaxation. International Conference on Machine Learning, 162: 1627 1646, 2022. Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57:764 785, 2011a. Mohsen Bayati and Andrea Montanari. The lasso risk for Gaussian matrices. IEEE Transactions on Information Theory, 58(4):1997 2017, 2011b. Emmanuel J. Cand es, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985 2007, 2015. Arun Tejasvi Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions. International Conference on Machine Learning, page 1040 1048, 2013. Kabir Aladin Chandrasekher, Ashwin Pananjady, and Christos Thrampoulidis. Sharp global convergence guarantees for iterative nonconvex optimization with random data. The Annals of Statistics, 51(1):179 210, 2023. Sitan Chen, Jerry Li, and Zhao Song. Learning mixtures of linear regressions in subexponential time via Fourier moments. In ACM Symposium on Theory of Computing (STOC), 2020. Tan and Venkataramanan Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed regression with two components: Minimax optimal rates. Conference on Learning Theory, pages 560 604, 2014. Giovanni Compiani and Yuichi Kitamura. Using mixtures in econometric models: a brief review and some new results. The Econometrics Journal, 19(3):C95 C127, 2016. Yash Deshpande and Andrea Montanari. Information-theoretically optimal sparse PCA. In IEEE International Symposium on Information Theory (ISIT), pages 2197 2201, 2014. Emilie Devijver, Yannig Goude, and Jean-Michel Poggi. Clustering electricity consumers using high-dimensional regression mixture models. Applied Stochastic Models in Business and Industry, pages 159 177, 2020. David L. Donoho, Arian Maleki, and Andrea Montanari. Message passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106:18914 18919, 2009. David L. Donoho, Adel Javanmard, and Andrea Montanari. Information-theoretically optimal compressed sensing via spatial coupling and approximate message passing. IEEE Transactions on Information Theory, 59(11):7434 7464, Nov. 2013. Jianqing Fan, Han Liu, Zhaoran Wang, and Zhuoran Yang. Curse of heterogeneity: Computational barriers in sparse mixture models and phase retrieval, 2018. https://arxiv.org/abs/1808. 06996. Zhou Fan. Approximate message passing algorithms for rotationally invariant matrices. Annals of Statistics, 50(1):197 224, 2022. Susana Faria and Gilda Soromenho. Fitting mixtures of linear regressions. Journal of Statistical Computation and Simulation, 80(2):201 225, 2010. Oliver Y. Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J. Samworth. A unifying tutorial on approximate message passing. Foundations and Trends in Machine Learning, 2022. Alyson K. Fletcher and Sundeep Rangan. Iterative reconstruction of rank-one matrices in noise. Information and Inference: A Journal of the IMA, 7(3):531 562, 2018. Fajwel Fogel, Ir ene Waldspurger, and Alexandre d Aspremont. Phase retrieval for imaging problems. Math. Prog. Comp., 8:311 335, 2016. C edric Gerbelot and Rapha el Berthier. Graph-based approximate message passing iterations, 2021. https://arxiv.org/abs/2109.11905. Avishek Ghosh and Ramchandran Kannan. Alternating minimization converges super-linearly for mixed linear regression. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1093 1103, 2020. Avishek Ghosh, Ashwin Pananjady, Adityanand Guntuboyina, and Kannan Ramchandran. Maxaffine regression: Parameter estimation for Gaussian designs. IEEE Transactions on Information Theory, 68(3):1851 1885, 2022. Jens Gregor and Fernando R Rannou. Three-dimensional support function estimation and application for projection magnetic resonance imaging. International journal of imaging systems and technology, 12(1):43 50, 2002. Mixed Regression via AMP Sam Gross, Marc Aurelio Ranzato, and Arthur Szlam. Hard mixtures of experts for large scale weakly supervised vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6865 6873, 2017. Bettina Gr un and Friedrich Leisch. Applications of finite mixtures of regression models, 2007. https://tinyurl.com/3sfyrwbs. Adityanand Guntuboyina and Bodhisattva Sen. Covering numbers for convex functions. IEEE Transactions on Information Theory, 59(4):1957 1965, 2013. Lauren A. Hannah and David B. Dunson. Multivariate convex regression with adaptive partitioning. Journal of Machine Learning Research, 14(3):3261 3294, 2013. Nhat Ho, Chiao-Yu Yang, and Michael I Jordan. Convergence rates for Gaussian mixtures of experts. The Journal of Machine Learning Research, 23(1):14523 14603, 2022. Pao-Lu Hsu and Herbert Robbins. Complete convergence and the law of large numbers. Proceedings of the national academy of sciences, 33(2):25 31, 1947. Mian Huang and Weixin Yao. Mixture of regression models with varying mixing proportions: a semiparametric approach. Journal of the American Statistical Association, 107(498):711 724, 2012. Mian Huang, Runze Li, and Shaoli Wang. Nonparametric mixture of regression models. Journal of the American Statistical Association, 108(503):929 941, 2013. Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79 87, 1991. Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling. Information and Inference, 2(2):115 144, 2013. Michael I. Jordan and Robert A. Jacobs. Hierarchical Mixtures of Experts and the EM Algorithm. Neural Computation, 6(2):181 214, 03 1994. Yoshiyuki Kabashima. A CDMA multiuser detection algorithm on the basis of belief propagation. Journal of Physics A: Mathematical and General, 36(43):11111 11121, Oct 2003. Yoshiyuki Kabashima, Florent Krzakala, Marc M ezard, Ayaka Sakata, and Lenka Zdeborov a. Phase transitions and sample complexity in Bayes-optimal matrix factorization. IEEE Transactions on Information Theory, 62(7):4228 4265, 2016. Abbas Khalili and Jiahua Chen. Variable selection in finite mixture of regression models. Journal of the American Statistical Association, 102(479):1025 1038, 2007. Jason M. Klusowski, Dana Yang, and W. D. Brinda. Estimating the coefficients of a mixture of two linear regressions by expectation maximization. IEEE Transactions on Information Theory, 65: 3515 3524, 2019. Weihao Kong, Raghav Somani, Zhao Song, Sham Kakade, and Sewoong Oh. Meta-learning for mixed linear regression. International Conference on Machine Learning, 119:5394 5404, 2020. Akshay Krishnamurthy, Arya Mazumdar, Andrew Mc Gregor, and Soumyabrata Pal. Sample Complexity of Learning Mixture of Sparse Linear Regressions. In Advances in Neural Information Processing Systems, volume 32, 2019. Tan and Venkataramanan Florent Krzakala, Marc M ezard, Fran cois Sausset, Yifan Sun, and Lenka Zdeborov a. Probabilistic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving matrices. Journal of Statistical Mechanics: Theory and Experiment, 2012(8), 2012. Pascal Lavergne. A Cauchy-Schwarz inequality for expectation of matrices. Discussion Papers, Department of Economics, Simon Fraser University, 2008. Thibault Lesieur, Florent Krzakala, and Lenka Zdeborov a. Constrained low-rank matrix estimation: Phase transitions, approximate message passing and applications. Journal of Statistical Mechanics: Theory and Experiment, 2017(7):073403, 2017. Gen Li and Yuting Wei. A non-asymptotic framework for approximate message passing in spiked models, 2023. https://arxiv.org/abs/2208.03313. Qianyun Li, Runmin Shi, and Faming Liang. Drug sensitivity prediction with high-dimensional mixture regression. Plo S One, pages 1 18, 2019. Yuanzhi Li and Yingyu Liang. Learning mixtures of linear regressions with nearly optimal complexity. Conference On Learning Theory, pages 1125 1144, 2018. Eunji Lim and Peter W. Glynn. Consistency of multidimensional convex regression. Operations Research, 60(1):196 208, 2012. Junjie Ma, Ji Xu, and Arian Maleki. Approximate message passing for amplitude based optimization. In International Conference on Machine Learning (ICML), pages 3371 3380, 2018. Junjie Ma, Ji Xu, and Arian Maleki. Optimization-based AMP for phase retrieval: The impact of initialization and ℓ2 regularization. IEEE Transactions on Information Theory, 65(6):3600 3629, 2019. Alessandro Magnani and Stephen P. Boyd. Convex piecewise-linear fitting. Optimization and Engineering, 10(1):1 17, 2009. Antoine Maillard, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborov a. Phase retrieval in high dimensions: Statistical and computational phase transitions. In Neural Information Processing Systems (Neur IPS), 2020. Ashok Makkuva, Pramod Viswanath, Sreeram Kannan, and Sewoong Oh. Breaking the gridlock in mixture-of-experts: Consistent and efficient algorithms. International Conference on Machine Learning (ICML), pages 4304 4313, 2019. Ashok Makkuva, Sewoong Oh, Sreeram Kannan, and Pramod Viswanath. Learning in gated neural networks. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3338 3348, 2020. Rahul Mazumder, Arkopal Choudhury, Garud Iyengar, and Bodhisattva Sen. A computational framework for multivariate convex regression and its variants. Journal of the American Statistical Association, 114(525):318 331, 2019. Geoffrey J Mc Lachlan and David Peel. Finite mixture models. John Wiley & Sons, 2004. Marco Mondelli and Ramji Venkataramanan. Approximate message passing with spectral initialization for generalized linear models. International Conference on Artificial Intelligence and Statistics, pages 397 405, 2021. Mixed Regression via AMP Andrea Montanari. Graphical models concepts in compressed sensing. In Compressed Sensing: Theory and Applications, pages 394 438. Oxford University Press, 2012. Andrea Montanari and Ramji Venkataramanan. Estimation of low-rank matrices via approximate message passing. Annals of Statistics, 45(1):321 345, 2021. Radford M. Neal and Geoffrey E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, pages 355 368, 1998. Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization. Advances in Neural Information Processing Systems, pages 2796 2804, 2013. Soumyabrata Pal, Arya Mazumdar, and Venkata Gandikota. Support recovery of sparse signals from a mixture of linear measurements. In Advances in Neural Information Processing Systems, 2021. Soumyabrata Pal, Arya Mazumdar, Rajat Sen, and Avishek Ghosh. On learning mixture of linear regressions in the non-realizable setting. In International Conference on Machine Learning, pages 17202 17220, 2022. Parthe Pandit, Mojtaba Sahraee Ardakan, Sundeep Rangan, Philip Schniter, and Alyson K Fletcher. Matrix inference and estimation in multi-layer models. In Advances in Neural Information Processing Systems, volume 33, 2020. Jerry Ladd Prince and Alan S Willsky. Reconstructing convex sets from support line measurements. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(4):377 389, 1990. Sundeep Rangan. Generalized approximate message passing for estimation with random linear mixing. IEEE International Symposium on Information Theory, 2011. Philip Schniter and Sundeep Rangan. Compressive phase retrieval via generalized approximate message passing. IEEE Transactions on Signal Processing, 63(4):1043 1055, 2014. Hanie Sedghi, Majid Janzamin, and Anima Anandkumar. Provable tensor methods for learning mixtures of generalized linear models. International Conference on Artificial Intelligence and Statistics, 51:1223 1231, 2016. Emilio Seijo and Bodhisattva Sen. Nonparametric least squares estimation of a multivariate convex regression function. Annals of Statistics, 39(3):1633 1657, 2011. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and JeffDean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations (ICLR), 2017. Yanyao Shen and Sujay Sanghavi. Iterative least trimmed squares for mixed linear regression. Advances in Neural Information Processing Systems, 32, 2019. Yong Sheng Soh and Venkat Chandrasekaran. Fitting tractable convex sets to support function evaluations. Discrete & Computational Geometry, 66(2):510 551, 2021. Nicolas St adler, Peter B uhlmann, and Sara van de Geer. ℓ1-penalization for mixture regression models. TEST, 19(2):209 256, 2010. Pragya Sur and Emmanuel J Cand es. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516 14525, 2019. Tan and Venkataramanan Nelvin Tan and Ramji Venkataramanan. Mixed linear regression via approximate message passing. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2023. Kert Viele and Barbara Tong. Modeling with mixtures of linear regressions. Statistics and Computing, 12:315 330, 2002. Jeremy P Vila and Philip Schniter. Expectation-maximization Gaussian-mixture approximate message passing. IEEE Transactions on Signal Processing, 61(19):4658 4672, 2013. Tianhao Wang, Xinyi Zhong, and Zhou Fan. Universality of approximate message passing algorithms and tensor networks, 2022. https://arxiv.org/abs/2206.13037. Zhaoran Wang, Quanquan Gu, Yang Ning, and Han Liu. High dimensional EM algorithm: Statistical optimization and asymptotic normality. Advances in neural information processing systems, pages 2521 2529, 2015. CF JeffWu. On the convergence properties of the EM algorithm. The Annals of Statistics, pages 95 103, 1983. Xinyang Yi and Constantine Caramanis. Regularized EM algorithms: A unified framework and statistical guarantees. Advances in Neural Information Processing Systems, pages 1567 1575, 2015. Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed linear regression. International Conference on Machine Learning, pages 613 621, 2014. Seniha Esen Yuksel, Joseph N. Wilson, and Paul D. Gader. Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8):1177 1193, 2012. Linjun Zhang, Rong Ma, T. Tony Cai, and Hongzhe Li. Estimation, confidence intervals, and largescale hypotheses testing for high-dimensional mixed linear regression, 2020. https://arxiv.org/ abs/2011.03598. Yihan Zhang, Marco Mondelli, and Ramji Venkataramanan. Precise asymptotics for spectral methods in mixed generalized linear models, 2022. https://arxiv.org/abs/2211.11368. Kai Zhong, Prateek Jain, and Inderjit S. Dhillon. Mixed linear regression with multiple components. Advances in Neural Information Processing Systems, pages 2190 2198, 2016. Xinyi Zhong, Chang Su, and Zhou Fan. Approximate Message Passing for orthogonally invariant ensembles: Multivariate non-linearities and spectral initialization, 2021. https://arxiv.org/ abs/2110.02318. Rongda Zhu, Lingxiao Wang, Chengxiang Zhai, and Quanquan Gu. High-dimensional variancereduced stochastic gradient expectation-maximization algorithm. In International Conference on Machine Learning (ICML), pages 4180 4188, 2017. Pini Zilber and Boaz Nadler. Imbalanced mixed linear regression, 2023. https://arxiv.org/abs/ 2301.12559. Alvaro Gonz alez. Measurement of areas on a sphere using fibonacci and latitude longitude lattices. Mathematical Geosciences, 42(1):49 64, 2010.