# heterogeneous_sufficient_dimension_reduction_and_subspace_clustering__0ae9ae72.pdf

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Lei Yan 1 Xin Zhang 1 Qing Mai 1

Abstract Scientific and engineering applications are often heterogeneous, making it beneficial to account for latent clusters or sub-populations when learning low-dimensional subspaces in supervised learning, and vice versa. In this paper, we combine the concept of subspace clustering with model-based sufficient dimension reduction and thus generalize the sufficient dimension reduction framework from homogeneous regression setting to heterogeneous data applications. In particular, we propose the mixture of principal fitted components (mix PFC) model, a novel framework that simultaneously achieves clustering, subspace estimation, and variable selection, providing a unified solution for high-dimensional heterogeneous data analysis. We develop a group Lasso penalized expectation-maximization (EM) algorithm and obtain its non-asymptotic convergence rate. Through extensive simulation studies, mix PFC demonstrates superior performance compared to existing methods across various settings. Applications to real world datasets further highlight its effectiveness and practical advantages.

1. Introduction

Reducing high-dimensional data to a low-dimensional representation is one of the most important steps in multivariate statistics and various applied sciences. Typically, this is achieved by projecting data onto a single low-dimensional subspace. Among unsupervised dimension reduction methods, principal component analysis (PCA) is probably the most popular one. However, when applied in the context of regressing a univariate response Y on p-dimensional predictor X, PCA faces three critical limitations: data heterogeneity, loss of information on regression, and the curse of

1Department of Statistics, Florida State University, Tallahassee, Florida, United States. Correspondence to: Xin Zhang <henry@stat.fsu.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

dimensionality. First, real-world data often lie in a union of multiple subspaces with unknown membership, reflecting underlying latent sub-populations. Second, PCA, as an unsupervised method, equates variation with information and thus disregards the specific relationship between Y and X. Third, in the high-dimensional setting where p is much larger than the sample size n, the subspace estimated by PCA can be highly unreliable, or even orthogonal to the true subspace (Baik & Silverstein, 2006; Paul, 2007).

Addressing the challenges of heterogeneity and high dimensionality is crucial for statistical and machine learning methods. Subspace clustering is a family of powerful methods for clustering the data into multiple subspaces(Vidal, 2011; Soltanolkotabi & Cand es, 2012). Most subspace clustering methods generalize PCA and factor analysis from a single subspace to a union of multiple subspaces (Agarwal & Mustafa, 2004; Vidal et al., 2005; Yan & Pollefeys, 2006; Tron & Vidal, 2007; Favaro et al., 2011; Elhamifar & Vidal, 2013). However, like PCA, these methods ignore the response, leading to inevitable loss of regression relevant information. Furthermore, subspace clustering methods often assume clean observations with small random noise (Kanatani, 2001; Vidal et al., 2005), meaning data points are expected to lie nearly exactly within a subspace. When the observations are subject to significant random errors, these methods frequently fail to accurately identify clusters.

Conversely, sufficient dimension reduction (SDR) provides a supervised framework by projecting X onto a lowdimensional subspace while preserving all relevant regression information. Formally, we have

Y X | PSX, (1)

where PS is the projection matrix onto S. The intersection of all the subspaces satisfying (1) is called the central subspace. Recent advances in deep learning-based SDR methods (Banijamali et al., 2018; Liang et al., 2022; Kapla et al., 2022; Huang et al., 2024; Chen et al., 2024) have demonstrated the potential to capture complex nonlinear structures in high-dimensional data.

While SDR guarantees regression sufficiency (Li, 1991; Cook & Forzani, 2008), it overlooks the heterogeneous nature of scientific and engineering applications. Incor-

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

porating heterogeneity into SDR has the potential to enhance dimension reduction by uncovering latent subpopulations. Motivated by this, we integrate the subspace clustering concept with model-based SDR.

Contributions. In this paper, we propose a novel mixture of principal fitted component (mix PFC) model, designed for simultaneous clustering, variable selection, and dimension reduction. It has the following major contributions.

Supervised Subspace Clustering: By extending subspace clustering into a supervised framework, mix PFC identifies subspaces that preserve regression information, addressing both heterogeneity and predictive accuracy. This contrasts sharply with unsupervised approaches (Elhamifar & Vidal, 2013; Ji et al., 2017; Cai et al., 2022). Leveraging response information, mix PFC allows for exact overlap between subspaces, overcoming limitations of classical methods that require separation conditions (e.g., minimal angle condition in Soltanolkotabi & Cand es (2012)). Building upon SDR, mix PFC targets central subspaces rather than subspaces where X resides. This ensures the low-dimensional representation of X preserves all the information relevant to regression. Moreover, unlike most subspace clustering methods that separate clustering from subspace estimation, mix PFC performs both tasks jointly in a unified framework.

High-Dimensional Estimation: We develop a group penalized expectation-maximization (EM) algorithm for the mix PFC model. Extending SDR to high dimensions is a challenging and nascent research area (Lin et al., 2018; 2019; Tan et al., 2018; 2020; Zeng et al., 2024). Existing approaches often require inverting p p matrices or estimating parameters in a p2-dimensional space, which poses scalability challenges. Our model, designed to estimate multiple heterogeneous subspaces in different unknown subpopulations, is much more complicated. To address this challenge, we formulate the subspace estimation as a convex optimization over an approximately p-dimensional parameter space. We further incorporate a group Lasso penalty (Yuan & Lin, 2006) for coordinate-independent variable selection (Chen et al., 2010).

Theoretical Guarantees: We establish theoretical results for the proposed group penalized EM algorithm. While classical EM theories only guaranteed asymptotic convergence to a fixed point, we derive a nonasymptotic result that mix PFC converges geometrically to a fixed point that is within statistical precision of the unknown true parameter. This stronger type of guarantee has emerged only recently (Balakr-

ishnan et al., 2017). Unlike many existing proofs in high-dimensional EM algorithms, our analysis does not require sample splitting (Kwon et al., 2019) and allows a relatively general model. Specifically, we derive a non-asymptotic convergence rate for a twomixture principal fitted components model with unknown mixing proportions and without restrictions on the minimum angle between subspaces.

2. Mixture of Principal Fitted Components

We extend the framework of sufficient dimension reduction by introducing a latent variable to model heterogeneity in data. In particular, we consider univariate (continuous or discrete) response Y R, multivariate predictor X Rp, and a latent categorical variable W {1, 2, . . . , K}. We aim to estimate K subspaces Sw, w = 1, . . . , K, such that

Y X | (PSw X, W = w),

Pr(W | Y, X) = Pr(W | Y, PSX) (2)

where S = PK w=1 Sw Rp offers the usual SDR as seen in the literature and each PSw is the projection matrix onto Sw to capture the relationship between Y and X within cluster w. Let β Rp d denote a basis matrix of S. Then, the projected data βT X contains all relevant information in X to be combined with response information Y for clustering data into K clusters. When W is observable, the smallest space Sw is known as the conditional central subspace, and is a building block for studying the partial central subspace (Chiaromonte et al., 2002). However, our problem is much more challenging because W is latent and has to be inferred from data. Moreover, unlike existing partial/conditional central subspace methods, we further incorporate variable selection for high-dimensional studies.

We propose the mix PFC model, as a generative mixture of principal fitted components (PFC),

X | (Y, W = w) N(µw + Γwf(Y ), ),

Pr(W = w) = πw, w = 1, . . . , K, (3)

where µw Rp is the center of each cluster, Γw Rp q

is the coefficient matrix that represents the relationship between Y and X in each cluster, f( ) = (f1( ), . . . , fq( ))T : R 7 Rq is a set of pre-specified fitting functions that introduces non-linear relationships, Rp p is symmetric positive definite matrix, and πw > 0 is the mixture probabilities with PK w=1 πw = 1.

The mix PFC model unifies and generalizes many modelbased clustering and model-based SDR approaches. When K = 1, the mix PFC reduces to the PFC model (Cook & Forzani, 2008), which further reduces to the probabilistic principal component analysis (PCA) model (Tipping & Bishop, 1999) by restricting = σ2Ip and replacing f(Y )

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Subspace Clustering

Mixture of PFC

Figure 1. The mix PFC model enhances subspace clustering by considering how the response (solid line) changes as the predictor varies within each cluster: linear response effects in the first mixture and non-linear in the second.

as latent variables ν N(0, Iq). Therefore, our mix PFC, by allowing general covariance structures and incorporating response information, is a supervised generalization of the probabilistic PCA. On the other hand, if we completely remove the effect of the response Γwf(Y ), the mix PFC becomes the Gaussian mixture model (GMM) (Mc Lachlan et al., 2019; Cai et al., 2019). If we further have = σ2Ip, mix PFC is a model-based interpretation of K-means clustering (Forgy, 1965; Mac Queen et al., 1967). When Y is categorical, the mix PFC model reduces to the mixture discriminant analysis model (Hastie & Tibshirani, 1996; Fraley & Raftery, 2002).

The mix PFC can also be viewed as an extension of subspace clustering. Figure 1 demonstrates the advantage of mix PFC over subspace clustering. In subspace clustering, we have a fully unsupervised problem on X, but in mix PFC we further have the response Y to guide our clustering, which can be very informative. Consequently, subspace clustering requires a minimal angle condition for identifiability (Soltanolkotabi & Cand es, 2012), but in mix PFC we allow non-overlap, partial overlap, and complete overlap between Sj and Sk for any (j, k). With the response Y , even when Sj = Sk, we can still have Γj = Γk to ensure cluster identifiability, although span(Γj) = span(Γk). A more concrete demonstration is given in Figure 2, where we have conducted two toy example simulations based on the proposed model (3). The first toy example in Figure 2 (a) has S1 = S2, but the response Y has different relationships with X in the two clusters. The subspace clustering would completely fail to identify the subspace or to cluster data while mix PFC works well and produces near perfect subspace estimation and clustering results. In Figure 2 (b), we have another simulation where S1 S2, which is an ideal setup for subspace clustering methods. We applied our proposed method and two popular subspace clustering methods: random sample consensus (RANSAC, Tron & Vidal (2007)) and sparse subspace clustering (SSC, Elhamifar & Vidal (2013)). The clustering errors by subspace clustering methods, 8.4% (SSC) and 13.0% (RANSAC), are reduced by mix PFC to 3% thanks to the additional response super-

Finally, the following proposition identifies the key parameter for fitting mix PFC.

Proposition 2.1. Under model (3), the smallest subspaces satisfying (2) are Sw = span( 1Γw), w = 1, . . . , K. Consequently, dw = dim(Sw) = rank(Γw) and d = dim(PK w=1 Sw) PK w=1 dw.

The rank of Γw Rp q, dw, could be smaller than q, the number of functions in f. Our model-based SDR approach for handling heterogeneity in data is now rigorously connected to the central subspace notion in general (2). Based on the maximum likelihood estimation (MLE) for PFC model parameters (Cook & Forzani, 2008), we derive the MLE for Sw and, more importantly, a penalized EM algorithm for high-dimensional data.

3. Group-Penalized EM Algorithm

Let {(Xi, Yi)}n i=1 be n independent data points from mix PFC (3), θ = ( , πw, µw, Sw, w = 1, . . . , K) be the set of unknown model parameters. In low dimensions, all the parameters can be estimated by the EM algorithm.

The EM algorithm aims to maximize the log-likelihood of X | Y over θ, by iteratively alternating between an Expectation-step (E-step) and a Maximization-step (Mstep). The conditional log-likelihood of X | Y is

w=1 πw N(Xi | µw + Γwfi, )

where fi = f(Yi) and N( | µ, ) is the probability density function of a multivariate normal distribution with mean µ and covariance . In the E-step, we compute the expectation of the log-likelihood function of θ with respect to the conditional distribution of W given {(Xi, Yi)}n i=1:

Q(θ|bθ(t)) =

w=1 γiw(bθ(t))[log(πw)

+ log(N(Xi | µw + Γwfi, ))],

where γiw(bθ(t)) = Pr(Wi = w | bθ(t), Xi, Yi). Assuming the cluster means µw are equal, the estimated probability γiw(bθ(t)) is given by

γiw(bθ(t)) 1 = X

bπ(t) j bπ(t) w exp{(Xi 1/2[(bΓ(t) j + bΓ(t) w )fi])T

( b (t)) 1(bΓ(t) j bΓ(t) w )fi} + 1

Then, in the M-step, we update bθ(t+1) = argmaxθ Q(θ|bθ(t)) by maximizing (4).

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Predictor Projected onto S1 ( = S2)

(a) Toy Example 1

Predictor Projected onto S1 ( S2)

Predictor Projected onto S2 ( S1)

(b) Toy Example 2

Figure 2. Two toy simulation examples, with p = 6 and two mixtures of size n1 = n2 = 200. The proposed mix PFC method works well in both extremes of setups: (a) Subspaces coincide, i.e, S1 = S2, but mixtures are well-separated by response variability. In this case, subspace clustering completely fails. (b) Subspaces orthogonal to each other, i.e, S1 S2. For both examples, we plot the response versus the estimated linear reductions of the predictors based on the mix PFC method.

The above standard EM algorithm is infeasible for highdimensional problems. The inverse of p p covariance matrix serves as the cornerstone of the EM algorithm. In high dimensions, it is impractical to use 1 repeatedly in the EM updates. However, it can be seen that the probabilities γiw(bθ(t)) are evaluated on the linear function of Xi: XT i ( b (t)) 1(bΓ(t) j bΓ(t) w ). By Proposition 2.1, 1(Γj Γw) is contained in the central subspace. Hence, there is no loss of information to first project Xi onto the central subspace S and then calculate the probability based on reduced predictors, avoiding the p p matrix inversion 1. Specifically, given a basis matrix β Rp d of S, we focus on the reduced predictors βT X = βT Γwf(Y ) + βT ϵ Rd. This is a mixture linear regression problem Z = Awf(Y ) + ξ, where Z = βT X, Aw = βT Γw, and ξ N(0, ) with = βT β Rd d. Then the updating equation for the probabilities can be simplified to

γiw(bθ(t)) 1 = X

bπ(t) j bπ(t) w exp (Z(t) i 1

2[( b A(t) j

+ b A(t) w )fi)])T (( b )(t)) 1 b A(t) j b A(t) w fi o + 1.

Given β, the closed-form updates for , Aw, and πw are straightforward to derive. The most challenging part remaining is how to obtain an accurate estimator of the central subspace.

For high-dimensional predictors, we consider the following groupwise penalized estimation of the basis matrix of each subspace βw. First of all, we recognize that span(βw) span( 1Γw) = span(Σ 1 w Uw), where Σw = cov(X | W = w) Rp p and Uw = cov(X, f(Y ) | W = w) Rp q are the covariance matrices. The iterative sample es-

timates in EM updates are computed as

bΣ(t) w = 1

i=1 γiw(bθ(t))(Xi bµ(t) w )(Xi bµ(t) w )T ,

b U(t) w = 1

i=1 γiw(bθ(t))(Xi bµ(t) w )(fi f)T .

where bµ(t) w = (P

i γiw(bθ(t))) 1 P

i γiw(bθ(t))Xi and f = 1/n P i fi. Then we solve the convex optimization problem,

b B(t) w = argmin Bw Rp q 1 2 tr(BT w bΣ(t) w Bw)

tr{( b U(t) w )T Bw} + λ Bw 2,1, (6)

where λ > 0 is tuning parameter and the L2,1 penalty Bw 2,1 = Pp i=1(Pq j=1(Bw)2 ij)1/2 is coordinateindependent (Chen et al., 2010). The problem in (6) is convex. We develop a groupwise coordinate descent algorithm to solve it efficiently. Note that Bw Rp q s are naturally rank deficient, with q dw. Therefore, at the convergence of the penalized EM algorithm 1, we use the span of the top-dw left singular vectors of b Bw to be the subspace estimates b Sw.

As shown in the original PFC paper (Cook & Forzani, 2008), subspace estimation remains consistent under misspecification of f(Y ), provided f(Y ) is sufficiently correlated with the true function. Our mix PFC model inherits this property, ensuring validity across a broad class of functions. In practice, polynomials or splines are standard choices. The initialization method, selection of K, along with mix PFC-ISO an alternative algorithm tailored for isotropic covariance matrices is detailed in Section B of the appendix. The code is available on Git Hub at https: //github.com/leiyan-ly/mix PFC.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Algorithm 1 Penalized EM algorithm for mixture PFC

Input: Data {(Xi, Yi)}n i=1, fitting function f( ) Initialize bγiw(θ0) and center fi repeat

E-Step: γiw(bθ(t)) = bπ(t) w bπ(t) w + P

j =w bπ(t) j exp{(Z(t) i 1

2[( b A(t) j + b A(t) w )fi)])T (( b )(t)) 1( b A(t) j b A(t) w )fi}

M-Step: b B(t+1) w = argmin Bw 1 2 tr(BT w bΣ(t) w Bw) tr{( b U(t) w )T Bw} + λ Bw 2,1 bπ(t+1) w = 1/n Pn i=1 γiw(bθ(t)), bµ(t+1) w = 1/ Pn i=1[γiw(bθ(t))] Pn i=1 γiw(bθ(t))Xi until converge Output: bπw, b Sw

4.1. Preliminary

We begin this section with some notations. For numbers a and b, a b and a b means max{a, b} and min{a, b}. For an integer n, [n] denotes the set {1, . . . , n}. For a vector x = (x1, . . . , xp)T , x 1 = Pp i=1 |xi|, and x 2 = p Pp i=1 x2 i . For a matrix A = (aij), σmin(A) and σmax(A) represent the smallest and largest singular values of A, SA denote the column space of A. The Frobenius norm, and spectral norm of A are defined as A F = p

tr(AT A), A 2 = σmax(A). The Frobenius inner product between two matrices A and B is A, B F = tr(AT B). For matrices A and B with same column rank d, the distance between subspaces is D(SA, SB) = PA PB F /

2d. For a set A {1, . . . , p}, Ac and |A| denote its complement and cardinality. For two sequences of positive numbers {an} and {bn}, an bn or an = O(bn) means an/bn C < , and an = o(bn) means that an/bn 0 as n . Let Spq 1 be the unit sphere. For a positive integer s < p/(3q), let L(s) = {u Rpq : u e Sc 1 1

3s) u e S1 2 + sq u 2, for some e S1

[pq] with | e S1| = 3sq} and Lp(s) = L(s)1:p = {u1:p : u L(s)}. For a matrix A Rp q, define A F,s = supu Rp q,vec(u) L(s) Spq 1 A, u F .

We conduct the theoretical analysis under the assumption that Yi is fixed, f( ) is known, and µw = 0. We focus on the case where K = 2, a common assumption in highdimensional EM algorithm analysis (Cai et al., 2019; Wang et al., 2024). We further assume that = σ2Ip with σ known. Treating the covariance matrix as a known parameter is also standard in theoretical studies of simpler models such as mixture linear regression (Klusowski et al., 2019; Wang et al., 2024) and the Gaussian mixture model (Xu et al., 2016; Cai et al., 2019). Without loss of generality, we set σ2 = 1. Under these assumptions, we re-define the parameter as θ = (π1, Γ1, Γ2), since π2 = 1 π1 and Sw = span(Γw), w = 1, 2. With this setup, we analyze

the theoretical properties of a simplified version of Algorithm 1, which is detailed in Algorithm 4 in the appendix. Let θ denote the true value of θ, and bθ(t) represent the estimate of θ at the t-th iteration. The true parameter space is defined as

Θ = {θ :π 1 (cπ, 1 cπ), vec(Γ w) 0 sq,

B w F Ma, Γ w F Mb, w = 1, 2},

where each condition has a natural interpretation. The condition π 1 (cπ, 1 cπ) ensures each latent cluster has a sufficiently large sample size. The sparsity condition vec(Γ w) 0 sq reflects group sparsity structure, and Γ w F Mb are used in the literature on mixture linear regression (Yi & Caramanis, 2015; Wang et al., 2024). The parameter B w is the true solution to the optimization problem (6), and is defined later in the text.

Since σ2 = 1 and K = 2, the conditional probability Pr(Wi = 1|θ, Xi, Yi) is simplified as

γi1(θ) 1 = (1/π1 1) exp{[Xi 1/2(Γ2 + Γ1)fi]T

(Γ2 Γ1)fi} + 1,

and let γi2(θ) = Pr(Wi = 2|θ, Xi, Yi) = 1 γi1(θ). The following quantities are used repeatedly in the theoretical analysis:

i=1 γiw(θ), πw(θ) = E[bπw(θ)],

b Uw(θ) = 1

i=1 γiw(θ)Xif T i , Uw(θ) = E[ b Uw(θ)],

i=1 γiw(θ)Xi XT i , Σw(θ) = E[bΣw(θ)],

where the expectation is with respect to Xi, i = 1, 2 . . . , n. We define B w = (Σ w) 1U w, where Σ w = Σw(θ ) and U w = Uw(θ ). Let bβ(t) w and β w represent topdw left singular vectors of b B(t) w and B w. Then we have Sw = Sβ w. Let Mn(θ) = {bπw(θ), b Uw(θ), bΣw(θ), w =

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

1, 2} represent the sample-based estimates and M(θ) = {πw(θ), Uw(θ), Σw(θ), w = 1, 2} be the population counterpart. To quantify the differences between two sets of parameters, we introduce a distance d F (M(θ1), M(θ2)), defined as

max w=1,2{|πw(θ1) πw(θ2)| Uw(θ1) Uw(θ2) F

(Σw(θ1) Σw(θ2))B w F }.

tr[(Γ 2 Γ 1)bΣf(Γ 2 Γ 1)T ] denote the sig-

nal strength of the mixture PFC model, where bΣf = 1/n Pn i=1 fif T i . We define the contraction basin Bcon(θ ) as the set:

Bcon(θ ) ={θ : πw (c0, 1 c0), Γw Γ w F CbΩ,

(1 Cd)Ω2 | tr(δw(Γ)bΣf(Γ2 Γ1)T )|

(1 + Cd)Ω2,

vec(Γw Γ w) L(s), w = 1, 2},

where c0 cπ and δw(Γ) = Γ w (Γ2 + Γ1)/2. The contraction basin requires that θ is not far away from the true parameter θ . Under the conditions shown later, an initialization bθ(0) within the contraction basin guarantees that all subsequent estimators bθ(t) remain in the contraction basin throughout the iterative process.

4.2. Main Results

We need some technical conditions before stating the theoretical results.

(C1) The singular values of bΣf = 1/n Pn i=1 fif T i satisfy that M1 σmin(bΣf) σmax(bΣf) M2, and M3 min1 i n fi 2 max1 i n fi 2 M4.

(C2) The initialization θ(0) satisfies that d F (θ(0), θ ) B(0) 1 B 1 F B(0) 2 B 2 F < rΩ, and vec(Γ(0) w Γ w) L(s), with r < |c0 cπ|/Ω

Cd 1/(4 M1) + b2 4a2 b 2a), a2 =

2M 3/2 2 / M1, b = 2 M2 + [M2 + M2/2]/ M1.

(C3) There exists a sufficiently large constant M5 > 0, which does not depend on n, p, s, such that σdw(B w) M5 p

sq3(log n)2 log p/n.

(C4) Ω C1(c0, Cb, Mb, Mi; i = 1, . . . , 4) for a constant that is only depends on c0, Mb, Cb, and Mi, i = 1, . . . , 4, and Cb < C2(M2) for a constant only depends on M2.

(C5) n > C3sq3 log(p) for a sufficiently large constant C3.

Condition (C1) is mild since fi is a q-dimensional vector, where q is a small fixed number that does not grow with n and p. Condition (C2) ensures the initialization lies within the contraction basin, which guarantees the estimates produced at each step of the EM algorithm stay in the contraction basin. It is a common condition in mixture models (Cai et al., 2019; Wang et al., 2024). Condition (C3) requires that the nonzero singular values of B w are sufficiently separated from zero. This is a standard assumption in the theoretical analysis of high dimensional SDR problems (Zeng et al., 2024). Condition (C4) has two requirements. The first one is that the signal strength is larger than a constant that does not depend on n and p such that the two mixtures are distinguishable. This assumption is widely used in mixture linear model (Zhang et al., 2020; Wang et al., 2024). The second is that, for the parameters Γw within the contraction basin, the distance Γw Γ w F is bounded by the signal strength multiplied by a universal constant independent of n and p. Condition (C5) is a common assumption in high dimensions on the relationship among n, p, s to guarantee consistent estimation (Meinshausen & Yu, 2009; Cai et al., 2019). Specifically, it implies that the restrictive eigenvalue condition infu Lp(s) Sp 1 |u T 1

n Pn i=1 Xi XT i u| > τ1 holds with high probability for a positive constant τ1.

Next, we state the main result for the subspace estimation error of mix PFC in Theorem 4.1, with its proof provided in Section C in the appendix.

Theorem 4.1. Under conditions (C1)-(C5), there exists a constant 0 < κ < 1/2, such that b B(t) w satisfies, with probability at least 1 o(1),

b B(t) w B w F κt(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) | {z } computational error

sq3(log n)2 log p

n | {z } statistical error

Consequently, for t ( log κ) 1 log{n(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )},

b B(t) w B w F , D(S b β(t) w , Sβ w)

sq3(log n)2 log p

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Theorem 4.1 is the first theoretical result in highdimensional heterogeneous SDR. Compared to the highdimensional PFC result in Zeng et al. (2024), our convergence rate is slower by a factor of log n, reflecting the added complexity of unknown cluster labels. Importantly, Theorem 4.1 holds under unequal proportions and arbitrary subspace angles, making it highly non-trivial. Even in lowdimensional settings, existing EM theory often requires additional assumptions such as equal proportions (Gaussian mixtures (Xu et al., 2016)), or symmetric coefficients (mixtures of linear regression (Zhu et al., 2017)). Additionally, our analysis does not rely on sample splitting, a common technique in the literature (Yi et al., 2014; Yi & Caramanis, 2015; Zhang et al., 2020) that divides the data into many batches and uses a new batch of samples in each iteration to make random samples and current parameter estimates independent. Sample splitting, while theoretically convenient, is suboptimal in practice as it decreases estimation efficiency and is rarely used in real-world applications. Recent work by Wang et al. (2024) derived a rate of p

s(log n)2 log p/n without data splitting for mixture of linear regression. However, the mixture of PFC is inherently more complex. Our rate is slower by a factor of q3/2 due to the dependence on the q-dimensional vector f(Y ). Similarly, compared to Gaussian mixture model (Cai et al., 2019), the convergence rate is slower by a factor of q3/2 log n due to the involvement of γiw in Σw(θ) and function f(Y ).

Starting with an initial value within the contraction basin, Theorem 4.1 shows that the proposed algorithm converges to the true parameters at a rate containing both computational and statistical errors. The computational error, expressed as κt(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ), diminishes exponentially as t since 0 < κ < 1/2. The statistical error, p

sq3(log n)2 log p/n, represents the irreducible estimation error and persists regardless of the number of EM iterations. For sufficiently large t ( log κ) 1 log{n(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )}, the computation error becomes negligible relative to the statistical error. Beyond this step, additional iterations do not improve the estimators. Notably, since this threshold grows only logarithmically with n, the Algorithm 1 achieves accurate estimation in practice within a limited number of iterations.

The computational cost per EM iteration is O(n Kpq + n K3q3 + KTpnq + n Kp2), where T denotes the number of iterations to solve the penalized optimization problem (6). Given that q is a small number that does not grow with n, p, K, the overall complexity of Algorithm 1 is O(log(n)(n K3 + KTnp + Knp2)) with the dominated term O(log(n)Knp2) from covariance estimation. This remains tractable even for large K or p.

5. Numerical Results

5.1. Simulations

We compare the mix PFC and mix PFC-ISO against existing methods in clustering accuracy, subspace estimation, and variable selection. Since no existing method simultaneously classifies the data and estimates subspaces, we evaluate our methods against subspace clustering approaches for clustering error rates and high-dimensional SDR methods for subspace estimation and variable selection. The subspace clustering methods considered include LSA (Yan & Pollefeys, 2006), SSC (Elhamifar & Vidal, 2013), LRSC (Favaro et al., 2011), GPCA (Vidal et al., 2005), and RANSAC (Tron & Vidal, 2007). GPCA is applied only to the important variables due to computational constraints. Additionally, K-means and hierarchical clustering, are included and applied to both X and X Y , where the i-th row of X Y is defined as Yi Xi. For variable selection and subspace estimation accuracy, we include Lasso SIR (Lin et al., 2019), SEAS-SIR, and SEAS-PFC (Zeng et al., 2024). Clustering results are presented in this section, with subspace estimation and variable selection results provided in Section A of the appendix.

We consider four settings for central subspaces, denoted as models M1-M4, to examine different configurations of mixtures. Models M1-M3 have K = 2 mixtures with different degrees of overlap between two subspaces. Specifically, the subspaces are identical in M1, orthogonal in M2, and oblique in M3. Model M4 randomly generates multiple subspaces (K > 2), which tend to be nearly orthogonal to each other sine s = 10. The dimension dw of each subspace is 1 for M1 and M2, and 2 for M3 and M4. The active set is defined as Aw = {1, . . . , 6} for M1-M3, and Aw = {1, . . . , 10} for M4. Across all models, we set µw = 0, f(Y ) = (Y, |Y |)T , and πw = 1/K. After the basis matrices βw of central subspaces are generated according to M1-M4 (parameters provided in Section A of the appendix), we set Γw = βwηw, where ηw Rdw q

links the central subspace and function f(Y ). Imbalanced clusters and non-linear functions are also examined in Section A of the appendix.

The sample size is fixed at n = 200K with p = 1000, and for each simulation setting, 100 independent datasets are generated. To explore the influence of different covariance structures, we examine four configurations: 0.1Ip, Ip, AR(0.3), AR(0.5), where AR(r) represents the auto-regressive covariance structure ( )ij = r|i j| for i, j = 1, . . . , p.

Table 1 summarizes clustering error rates. As expected, mix PFC achieves substantially lower error rates than all subspace clustering methods across most model settings. When K = 2, mix PFC has error rates of around 10%

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

across all settings, dropping below 5% in certain cases. Notably, mix PFC-ISO demonstrates superior performance with even lower error rates for model M2. For M2 with = 0.1Ip, subspace clustering methods exhibit comparable or slightly lower error rates. This is likely due to favorable conditions for subspace clustering methods, where the subspaces are orthogonal and the random errors are minimal. When K > 2, error rates remain impressively low, under 3% for K = 3, probably due to enhanced signal strength from setting s = 10. When K = 5, error rates rise to around 15%, likely due to the increased difficulty in generating high-quality initial values for larger K.

5.2. Real Data Analysis

The Australian Institute of Sport (AIS) dataset, available in the R package dr, contains lean body mass data for 102 male and 100 female athletes. The objective is to investigate the relationship between lean body mass and 8 predictors, including height, weight, and red cell count. Given that body composition varies between males and females (Bredella, 2017), the AIS data likely includes two distinct subpopulations. Figure 3 (a) shows summary plots for males and females when sex is observed, highlighting distinct fitted lines for each group. Figure 3 (b) demonstrates that mix PFC effectively identifies the two subpopulations, achieving an error rate of 0.074.

3.6 3.9 4.2 4.5

Lean Body Mass

(a) PFC with true sex

0.0 0.2 0.4 0.6

Lean Body Mass

5.75 6.00 6.25 6.50 6.75

Lean Body Mass

(b) mix PFC with estimated sex

Figure 3. Summary plots for the AIS dataset. (a) Fitted lines for males and females when the sex variable is observed, illustrate distinct subpopulation trends. (b) Results from mix PFC, which accurately identifies the two subpopulations with an error rate of 0.074. The Cancer Cell Line Encyclopedia (CCLE) dataset contains 8-point dose-response curves for 24 chemical compounds across over 400 cell lines, with 18,926 gene expression features for each cell line, accessible at https: //sites.broadinstitute.org/ccle. Due to inconsistencies in cell lines across compounds, we focus on two popular cancer treatments: Nutlin-3 (n = 480) and AZD6244 (n = 479). Following (Wang et al., 2024; Li et al., 2019), we use the logarithm of the area under the dose-response curve as the response, representing drug sensitivity. The top p = 500 genes with the highest absolute correlations with the responses are selected for analysis. Given the inherent complexity of cancer, the CCLE data is

expected to be heterogeneous.

The dataset is randomly partitioned into 80% training and 20% testing samples, with 100 repetitions. Table 2 reports the prediction mean squared errors (PMSE) and the number of selected variables bs for each method, with the number of clusters set to 3 and 5 for Nutlin-3 and AZD6244 when using mix PFC. For both compounds, mix PFC significantly reduces prediction error compared to homogeneous methods, suggesting heterogeneity in the data. Notably, mix PFC does not select more variables than Lasso and the three homogeneous SDR methods.

Figure 4 shows summary plots of the response against reduced predictors projected onto each subspace for Nutlin. Within each cluster, the response exhibits approximately linear relationships with the projected predictors. The lack of clear patterns when points are projected onto subspaces outside their cluster further highlights the heterogeneity of data. The plot for AZD6244 and additional real data analysis are provided in Section A in the appendix.

1.5 2.0 2.5 3.0 3.5

-2.0 -1.6 -1.2 -0.8

Figure 4. The scatter-plot of the response Y versus bβT w X for the drug Nutlin. The solid line is fitted using samples in the given cluster.

6. Discussion

In this work, we proposed a mixture of PFC model, which combines subspace clustering with SDR methods to handle heterogeneous and high-dimensional data. An efficient group Lasso penalized EM algorithm has been developed to simultaneously perform clustering, subspace estimation, and variable selection. Theoretical analysis revealed an encouraging non-asymptotic convergence rate, offering insight into the empirical success of the algorithm.

A key aspect of our theoretical framework is its development for K = 2 cluster scenario. Generalizing theories for multi-cluster EM algorithms remains an important yet challenging direction. Recent advances focus on lowdimensional settings (Yan et al., 2017; Tian et al., 2024), and to the best of our knowledge, no general theory exists for high-dimensional multi-cluster EM. Addressing this open question likely requires fundamentally new tools to handle complex parameter spaces and interactions between K subspaces.

Further discussion and potential extensions are provided in

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Table 1. Averages and standard errors of clustering error rates with n = 200K, p = 1000.

mix PFC-ISO mix PFC RANSAC LSA LRSC SSC GPCA K-means X K-means X Y hclust X hclust X Y

M1: K = 2, = 0.1Ip, Ip, AR(0.3), AR(0.5) 5.0 (0.8) 3.4 (0.7) 48 (0.1) 44.6 (0.2) 47.8 (0.2) 47.9 (0.1) 47.9 (0.1) 44.2 (0.2) 20.2 (0.3) 45.8 (0.2) 14 (0.2) 12.1 (1.0) 10.3 (1.1) 48.1 (0.2) 47.9 (0.2) 48 (0.2) 48 (0.1) 48.1 (0.1) 45.2 (0.2) 29.1 (0.4) 48.1 (0.2) 38.2 (0.5) 6.0 (0.6) 48.0 (0.2) 47.6 (0.2) 48.0 (0.2) 47.9 (0.1) 48.2 (0.1) 44.7 (0.1) 23.1 (0.3) 47 (0.2) 23.9 (0.3) 4.7 (0.4) 47.9 (0.2) 47.0 (0.2) 47.8 (0.2) 48.2 (0.1) 48.0 (0.1) 44.5 (0.2) 21.6 (0.3) 46.6 (0.2) 18.3 (0.3)

M2: K = 2, = 0.1Ip, Ip, AR(0.3), AR(0.5) 9.9 (1.5) 16.4 (2.0) 28.3 (1.2) 4.9 (0.1) 48.1 (0.1) 11.7 (0.2) 48 (0.1) 47.6 (0.2) 27.5 (0.3) 46.9 (0.2) 19.1 (0.3) 13.9 (0.8) 16.3 (1.4) 47.6 (0.2) 43.5 (0.3) 47.7 (0.2) 47.2 (0.2) 48.2 (0.1) 47.5 (0.2) 36.5 (0.3) 47.4 (0.2) 46.2 (0.3) 11.5 (0.9) 46.2 (0.3) 31.2 (0.4) 48.3 (0.1) 44.9 (0.3) 48.2 (0.1) 37 (0.4) 32.2 (0.3) 44.2 (0.3) 42.6 (0.5) 11.4 (0.9) 43.4 (0.6) 21.5 (0.3) 48.0 (0.1) 43.5 (0.4) 48.0 (0.2) 33.9 (0.2) 31.1 (0.3) 40.2 (0.3) 34.9 (0.5)

M3: K = 2, = 0.1Ip, Ip, AR(0.3), AR(0.5) 5.9 (1.1) 3.6 (1.2) 39.7 (0.7) 40.0 (0.8) 47.5 (0.2) 13.2 (0.2) 47.8 (0.2) 44.6 (0.2) 22.3 (0.3) 44.5 (0.4) 15.1 (0.2) 11.1 (1.5) 8.4 (1.4) 47.8 (0.2) 47.7 (0.2) 48.0 (0.1) 48.0 (0.1) 47.9 (0.2) 45.9 (0.2) 30.6 (0.4) 47.9 (0.1) 41.2 (0.4) 7.2 (1.4) 47.8 (0.2) 47.3 (0.2) 47.8 (0.2) 47.5 (0.2) 47.9 (0.2) 47.2 (0.1) 24.8 (0.3) 47.9 (0.1) 26.3 (0.3) 4 (0.9) 47.0 (0.2) 47.2 (0.1) 48.1 (0.1) 47.7 (0.2) 47.8 (0.2) 46.7 (0.1) 24.1 (0.3) 47.7 (0.2) 21.1 (0.3)

M4: K = 3, = 0.1Ip, Ip, AR(0.3), AR(0.5) 0 (0) 0 (0) 44.1 (1) 25.4 (0.5) 18.2 (0.6) 6.5 (0.1) 63.3 (0.1) 58.3 (0.2) 52.1 (0.2) 59.4 (0.4) 29.4 (0.3) 3.7 (0.1) 2.9 (0.1) 62.3 (0.2) 57.2 (0.2) 64 (0.1) 60.5 (0.3) 63.4 (0.1) 62.8 (0.2) 53.3 (0.2) 68.8 (0.3) 53.7 (0.4) 2.5 (0.1) 59.8 (0.3) 56.1 (0.3) 64 (0.1) 57.5 (0.3) 63.4 (0.1) 61.8 (0.2) 50.1 (0.3) 66 (0.3) 46.6 (0.4) 2.2 (0.1) 56.9 (0.3) 54.1 (0.2) 64 (0.1) 59.6 (0.3) 63.7 (0.1) 62.8 (0.2) 47.8 (0.3) 63.3 (0.3) 42.3 (0.5)

M4: K = 5, = 0.1Ip, Ip, AR(0.3), AR(0.5) 0.4 (0.3) 0.2 (0.2) 50.1 (0.7) 38.8 (0.6) 32.6 (0.8) 13.8 (0.2) 74.1 (0.2) 63.4 (0.2) 46.5 (0.4) 64.8 (0.2) 25.5 (0.2) 14.0 (1.2) 10.2 (1.6) 75.2 (0.1) 66.7 (0.2) 76.9 (0.1) 75.3 (0.1) 76 (0.1) 65.7 (0.1) 54.3 (0.4) 75.6 (0.1) 71.5 (0.3) 15.7 (2.0) 73.6 (0.2) 65.1 (0.2) 76.8 (0.1) 74.6 (0.1) 76 (0.1) 62.7 (0.1) 53.7 (0.3) 72.8 (0.1) 66.1 (0.3) 24.4 (2.2) 72.4 (0.2) 65 (0.2) 77 (0.1) 75.8 (0.1) 76.1 (0.1) 63.1 (0.2) 55.4 (0.2) 71.1 (0.1) 62 (0.3)

Table 2. The averages of the prediction errors, the sparsity level bs, and the corresponding standard errors based on 100 replicates.

Nutlin-3 mix PFC SEAS-SIR SEAS-PFC Lasso SIR Lasso PMSE 100 8.9(0.4) 18.7 (0.4) 18.8 (0.4) 18.2 (0.3) 17.6 (0.3) bs 31.5(2.5) 48.3 (1.3) 33.3 (1.8) 32 (0.9) 39.1 (1.1) AZD6244 PMSE 100 45.6 (1.7) 108.8 (2.2) 107.1 (2.2) 83.2 (1.6) 77.6 (1.5) bs 77.8 (3.2) 78.3 (1.6) 58.8 (1.5) 66.8 (0.7) 78.6 (0.7)

Section G in the appendix.

Acknowledgements

The authors thank reviewers for constructive comments. The research was partly supported by grant DMS-2053697 from the U.S. National Science Foundation.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Adamczak, R. A tail inequality for suprema of unbounded empirical processes with applications to Markov chains.

Electronic Journal of Probability, 13:1000 1034, 2008.

Agarwal, P. K. and Mustafa, N. H. k-means projective clustering. In Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 04, pp. 155 165, New York, NY, USA, 2004. Association for Computing Machinery.

Baik, J. and Silverstein, J. W. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis, 97(6):1382 1408, 2006.

Balakrishnan, S., Wainwright, M. J., and Yu, B. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1): 77 120, 2017.

Banijamali, E., Karimi, A.-H., and Ghodsi, A. Deep variational sufficient dimensionality reduction. ar Xiv preprint ar Xiv:1812.07641, 2018.

Biernacki, C., Celeux, G., and Govaert, G. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3-4):561 575, 2003.

Bredella, M. A. Sex Differences in Body Composition, pp. 9 27. Springer International Publishing, 2017.

Cai, J., Fan, J., Guo, W., Wang, S., Zhang, Y., and Zhang, Z. Efficient deep embedded subspace clustering. In 2022

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21 30, 2022.

Cai, T. T., Ma, J., and Zhang, L. Chime: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality. The Annals of Statistics, 47(3):1234 1267, 2019.

Chen, X., Zou, C., and Cook, R. D. Coordinateindependent sparse sufficient dimension reduction and variable selection. The Annals of Statistics, 38(6):3696 3723, 2010.

Chen, Y., Jiao, Y., Qiu, R., and Yu, Z. Deep nonlinear sufficient dimension reduction. The Annals of Statistics, 52(3):1201 1226, 2024.

Chiaromonte, F., Cook, R., and Li, B. Sufficient dimensions reduction in regressions with categorical predictors. The Annals of Statistics, 30(2):475 497, 2002.

Cook, R. D. and Forzani, L. Principal fitted components for dimension reduction in regression. Statistical Science, 23(4):485 501, 2008.

d Aspremont, A., Ghaoui, L. E., Jordan, M. I., and Lanckriet, G. R. G. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 49(3):434 448, 2007.

Dirksen, S. Tail bounds via generic chaining. Electronic Journal of Probability, 20(53):1 29, 2015.

Elhamifar, E. and Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765 2781, 2013.

Erichson, N. B., Zheng, P., Manohar, K., Brunton, S. L., Kutz, J. N., and Aravkin, A. Y. Sparse principal component analysis via variable projection. SIAM Journal on Applied Mathematics, 80(2):977 1002, 2020.

Favaro, P., Vidal, R., and Ravichandran, A. A closed form solution to robust subspace estimation and clustering. In CVPR 2011, pp. 1801 1807, 2011.

Forgy, E. W. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21:768 769, 1965.

Fraley, C. and Raftery, A. E. Model-based clustering, discriminant analysis, and density estimation. Journal of the American statistical Association, 97(458):611 631, 2002.

Hastie, T. and Tibshirani, R. Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):155 176, 1996.

Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science, 313 (5786):504 507, 2006.

Hsing, T. and Ren, H. An RKHS formulation of the inverse regression dimension-reduction problem. Annals of statistics, 37(2):726 755, 2009.

Huang, J., Jiao, Y., Liao, X., Liu, J., and Yu, Z. Deep dimension reduction for supervised representation learning. IEEE Transactions on Information Theory, 70(5): 3583 3598, 2024.

Ji, P., Zhang, T., Li, H., Salzmann, M., and Reid, I. Deep subspace clustering networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 23 32, Red Hook, NY, USA, 2017. Curran Associates Inc.

Journ ee, M., Nesterov, Y., Richt arik, P., and Sepulchre, R. Generalized power method for sparse principal component analysis. Journal of Machine Learning Research, 11:517 553, 2010.

Kanatani, K. Motion segmentation by subspace separation and model selection. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pp. 586 591 vol.2, 2001.

Kapla, D., Fertl, L., and Bura, E. Fusing sufficient dimension reduction with neural networks. Computational Statistics & Data Analysis, 168:107390, 2022.

Klusowski, J. M., Yang, D., and Brinda, W. Estimating the coefficients of a mixture of two linear regressions by expectation maximization. IEEE Transactions on Information Theory, 65(6):3515 3524, 2019.

Kwon, J., Qian, W., Caramanis, C., Chen, Y., and Davis, D. Global convergence of the em algorithm for mixtures of two component linear regression. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 2055 2110. PMLR, 25 28 Jun 2019.

Kwon, O.-R., Mukherjee, G., and Bien, J. Semi-supervised learning of noisy mixture of experts models. ar Xiv preprint ar Xiv:2410.09039, 2024.

Le, L., Patterson, A., and White, M. Supervised autoencoders: Improving generalization performance with unsupervised regularizers. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Ledoux, M. and Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes, volume 23. Springer Science & Business Media, 1991.

Li, B., Artemiou, A., and Li, L. Principal support vector machines for linear and nonlinear sufficient dimension reduction. Annals of Statistics, 39(6):3182 3210, 2011.

Li, K.-C. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86 (414):316 327, 1991.

Li, L., Ke, C., Yin, X., and Yu, Z. Generalized martingale difference divergence: Detecting conditional mean independence with applications in variable screening. Computational Statistics & Data Analysis, 180:107618, 2023.

Li, Q., Shi, R., and Liang, F. Drug sensitivity prediction with high-dimensional mixture regression. PLOS ONE, 14(2):1 18, 02 2019.

Li, R., Zhong, W., and Zhu, L. Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499):1129 1139, 2012.

Liang, S., Sun, Y., and Liang, F. Nonlinear sufficient dimension reduction with a stochastic neural network. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS 22, Red Hook, NY, USA, 2022. Curran Associates Inc.

Lin, Q., Zhao, Z., and Liu, J. S. On consistency and sparsity for sliced inverse regression in high dimensions. The Annals of Statistics, 46(2):580 610, 2018.

Lin, Q., Zhao, Z., and Liu, J. S. Sparse sliced inverse regression via lasso. Journal of the American Statistical Association, 114(528):1726 1739, 2019.

Mac Queen, J. et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pp. 281 297. University of California Press, 1967.

Mai, Q., Yang, Y., and Zou, H. Multiclass sparse discriminant analysis. Statistica Sinica, 29(1):97 111, 2019.

Mc Lachlan, G. J., Lee, S. X., and Rathnayake, S. I. Finite mixture models. Annual review of statistics and its application, 6:355 378, 2019.

Meinshausen, N. and Yu, B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1):246 270, 2009.

Paul, D. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 17(4):1617 1642, 2007.

Rudelson, M. and Zhou, S. Reconstruction from anisotropic random measurements. In Mannor, S., Srebro, N., and Williamson, R. C. (eds.), Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, pp. 10.1 10.24, Edinburgh, Scotland, 25 27 Jun 2012. PMLR.

Shao, X. and Zhang, J. Martingale difference correlation and its use in high-dimensional variable screening. Journal of the American Statistical Association, 109(507): 1302 1318, 2014.

Sheng, W. and Yin, X. Sufficient dimension reduction via distance covariance. Journal of Computational and Graphical Statistics, 25(1):91 104, 2016.

Soltanolkotabi, M. and Cand es, E. J. A geometric analysis of subspace clustering with outliers. The Annals of Statistics, 40(4):2195 2238, 2012.

Sz ekely, G. J., Rizzo, M. L., and Bakirov, N. K. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769 2794, 2007.

Talagrand, M. The generic chaining: upper and lower bounds of stochastic processes. Springer Science & Business Media, 2005.

Tan, K., Shi, L., and Yu, Z. Sparse SIR: Optimal rates and adaptive estimation. The Annals of Statistics, 48(1):64 85, 2020.

Tan, K. M., Wang, Z., Zhang, T., Liu, H., and Cook, R. D. A convex formulation for high-dimensional sparse sliced inverse regression. Biometrika, 105(4):769 782, 2018.

Tian, Y., Weng, H., and Feng, Y. Towards the theory of unsupervised federated learning: non-asymptotic analysis of federated EM algorithms. In Proceedings of the 41st International Conference on Machine Learning, ICML 24. JMLR.org, 2024.

Tibshirani, R., Walther, G., and Hastie, T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411 423, 2001.

Tipping, M. E. and Bishop, C. M. Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61(3):611 622, 1999.

Tron, R. and Vidal, R. A benchmark for the comparison of 3-d motion segmentation algorithms. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1 8, 2007.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Vidal, R. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52 68, 2011.

Vidal, R., Ma, Y., and Sastry, S. Generalized principal component analysis (GPCA). IEEE transactions on pattern analysis and machine intelligence, 27(12):1945 1959, 2005.

Wang, N., Zhang, X., and Mai, Q. Statistical analysis for a penalized EM algorithm in high-dimensional mixture linear regression model. Journal of Machine Learning Research, 25(222):1 85, 2024.

Wang, Z., Gu, Q., Ning, Y., and Liu, H. High dimensional EM algorithm: statistical optimization and asymptotic normality. In Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2, NIPS 15, pp. 2521 2529, Cambridge, MA, USA, 2015. MIT Press.

Witten, D. M., Tibshirani, R., and Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515 534, 2009.

Wu, H.-M. Kernel sliced inverse regression with applications to classification. Journal of Computational and Graphical Statistics, 17(3):590 610, 2008.

Xu, J., Hsu, D., and Maleki, A. Global analysis of expectation maximization for mixtures of two gaussians. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 16, pp. 2684 2692, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.

Yan, B., Yin, M., and Sarkar, P. Convergence of gradient em on multi-component mixture of gaussians. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Yan, J. and Pollefeys, M. A general framework for motion segmentation: Independent, articulated, rigid, nonrigid, degenerate and non-degenerate. In Leonardis, A., Bischof, H., and Pinz, A. (eds.), Computer Vision ECCV 2006, pp. 94 106, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.

Yi, X. and Caramanis, C. Regularized EM algorithms: a unified framework and statistical guarantees. In Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, NIPS 15, pp. 1567 1575, Cambridge, MA, USA, 2015. MIT Press.

Yi, X., Caramanis, C., and Sanghavi, S. Alternating minimization for mixed linear regression. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 613 621, Bejing, China, 22 24 Jun 2014. PMLR.

Yu, Y., Wang, T., and Samworth, R. J. A useful variant of the davis kahan theorem for statisticians. Biometrika, 102(2):315 323, 2015.

Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68 (1):49 67, 2006.

Zeng, J., Mai, Q., and Zhang, X. Subspace estimation with automatic dimension and variable selection in sufficient dimension reduction. Journal of the American Statistical Association, 119(545):343 355, 2024.

Zhang, L., Ma, R., Cai, T. T., and Li, H. Estimation, confidence intervals, and large-scale hypotheses testing for high-dimensional mixed linear regression. ar Xiv preprint ar Xiv:2011.03598, 2020.

Zhu, R., Wang, L., Zhai, C., and Gu, Q. High-dimensional variance-reduced stochastic gradient expectationmaximization algorithm. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 4180 4188. PMLR, 06 11 Aug 2017.

Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu, C., ki Cho, D., and Chen, H. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International Conference on Learning Representations, 2018.

Zou, H., Hastie, T., and Tibshirani, R. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265 286, 2006.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

This appendix provides additional numerical results, implementation details, and technical proofs supporting the theoretical analysis. Section A presents detailed results from the numerical studies. Section B outlines implementation specifics, including the initialization method, selection of the number of clusters, and the mix PFC-ISO algorithm. The theoretical analysis of Theorem 4.1 relies on two key lemmas, with ancillary results provided in Section F. The proofs of the two key lemmas are detailed in Sections D and E, while Section C contains the full proof of the theorem. We discuss potential extensions in Section G.

A. Additional Numerical Results

A.1. Simulations

The parameters for M1-M4 in the simulation section are generated as follows.

(M1) The non-zero coefficients of β1 are β1i = 2.3, i = 1, . . . 6, and then set β2 = β1.

(M2) The non-zero coefficients of β1 and β2 are β1i = 2.3, 1 i 6, and β2i = 2.3 for i = 1, 3, 5, β2i = 2.3 for i = 2, 4, 6.

(M3) The basis matrices β1 and β2 have two columns. The non-zero rows of β1 and β2 are

(β1)T 1:6 = 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3

(β2)T 1:6 = 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3 2.3

(M4) The basis matrices βw, w = 1, . . . , K have two columns. For each cluster w, the non-zero elements are generated as

(βw)ij Unif([ 2.5, 2.1] [2.1, 2.5]), i 10, j = 1, 2, w = 1, 2, . . . , K.

We consider four scenarios of simulations as described in Table A.3. The clustering error rates for S1 and S4 are presented in the main paper. The error rates for scenarios S2 and S3 are summarized in Table A.4&A.5.

To assess the subspace estimation and variable selection accuracy, we define the following criteria: the distance between estimated and actual subspaces is defined as Dw = D(Sβw, S b βw) = Pβw P b βw F / 2dw, w = 1, . . . , K; error rate ER is the fraction of incorrectly classified samples; true positive rate (TPR) and false positive rate (FPR) are defined as TPRw = | b Aw T Aw|/|Aw| and TPRw = | b Aw T Ac w|/|Ac w|. When K > 2, instead of reporting each Dw, TPRw and FPRw, we calculate the average subspace distance D = PK w=1 Dw/K, average TPR and FPR TPR = PK w=1 TPRw /K and FPR = PK w=1 FPRw /K, and TPR and FPR of the union of selected variables across all clusters ] TPR = |(S b Aw) T(S Aw)|/|(S Aw)| and ] FPR = |(S b Aw) T(S Ac w)|/|(S Ac w)|.

Table A.6&A.7 summarize variable selection and subspace estimation results for scenario S1&S4 with covariance = I and AR(0.3). The corresponding results for = 0.1I and AR(0.5) are provided in Table A.8&A.9. Results for S2 and S3 can be found in Table A.10-A.11 and Table A.12-A.13, respectively. Under scenario S1, mix PFC demonstrates strong performance, effectively identifying the important variables and accurately estimating central subspaces. It achieves true positive rates TPRw greater than 95%, false positive rates FPRw below 1%, and subspace estimation errors Dw around 0.3 when = I and AR(0.3). However, as variables correlation increases ( = AR(0.5)), mix PFC shows reduced accuracy in subspace estimation with Dw increasing to approximately 0.5, though it remains effective in variable selection. When the covariance matrix contains small elements ( = 0.1I), mix PFC performance declines in both variable selection and subspace estimation. This reduction occurs because estimating Sw = 1SΓw becomes challenging when all elements of have small magnitude. In such cases, mix PFC-ISO, which assumes isotropic errors, outperforms mix PFC when = 0.1I and I.

In the unbalanced clusters scenario S2, subspace estimation errors are smaller for clusters with more samples, while variable selection and error rates remain consistent with S1. In scenario S3, featuring a nonlinear fitting function, both mix PFC and mix PFC-ISO perform poorly in variable selection and subspace estimation for model M3. Even SEAS-PFC using true clusters offers little improvement. However, the error rates remain well controlled. The performance under M1 and M2 is

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

similar to S1. In the multi-cluster setting S4, the mix PFC shows 10% - 20% reduction in TPRw for each cluster compared to S1. But, the union of selected variables typically includes all the important variables.

When compared with Lasso SIR, SEAS-SIR, and SEAS-PFC fitted on estimated clusters, mix PFC has similar TPRw and FPRw but larger subspace estimation errors. This is likely because the algorithm incorporates some misleading information from other clusters through γiw. Among the high-dimensional SDR methods, SEAS-SIR and SEAS-PFC consistently outperformed Lasso SIR. Therefore, it is recommended to refit with SEAS-PFC after classifying the data using mix PFC. Notably, fitting with true clusters offers only marginal improvements typically just a few percentage points over using clusters estimated by mix PFC.

Table A.3. Parameters settings of four sets of simulations. The linking matrices ηw = (1, 0.3) for M1 and M2, and ηw = I for M3 and M4.

Scenario Model π Y f(Y )

S1 M1-M3 (0.5, 0.5) Unif( 1, 1) (Y, |Y |)T

S2 M1-M3 (0.7, 0.3) Unif( 1, 1) (Y, |Y |)T

S3 M1-M3 (0.5, 0.5) Unif(0, 2) (Y 2/3, Y 3/6)T

S4 M4 K = 3, 5 (1/K, . . . , 1/K) Unif( 1, 1) (Y, |Y |)T

Table A.4. Averages and standard errors of clustering error rates for scenario S2 with n = 400, p = 1000, K = 2.

Method mix PFC-ISO mix PFC RANSAC LSA LRSC SSC GPCA K-means X K-means X Y hclust X hclust X Y

M1: n = 400, p = 1000, = 0.1I, I, AR(0.3), AR(0.5) ER(%) 7.7 (1.3) 3.1 (0.7) 40.8 (0.5) 48.1 (0.1) 47.7 (0.2) 47.4 (0.2) 47.3 (0.2) 48.7 (0.1) 33.1 (0.6) 46.1 (0.3) 10.3 (0.5) 13.1 (1.3) 7.2 (0.6) 48 (0.2) 41.2 (0.4) 42.5 (0.4) 47.9 (0.1) 47.4 (0.2) 48.6 (0.1) 42.7 (0.7) 43 (0.5) 27.8 (0.9) 6.9 (0.9) 46.4 (0.3) 44.5 (0.3) 47.8 (0.1) 47.7 (0.2) 45 (0.4) 48.5 (0.1) 35.9 (0.7) 46.5 (0.3) 18.8 (0.8) 4.4 (0.5) 43.8 (0.4) 45.5 (0.3) 48 (0.1) 48 (0.1) 44.9 (0.4) 48.7 (0.1) 35.2 (0.6) 45.9 (0.3) 13.5 (0.5) M2: n = 400, p = 1000, = 0.1I, I, AR(0.3), AR(0.5) ER(%) 12.9 (1.7) 14.4 (1.9) 29.9 (1.1) 5.4 (0.1) 47.6 (0.2) 11.3 (0.2) 47 (0.2) 48.3 (0.1) 41.7 (0.3) 45.3 (0.3) 13.9 (0.6) 13.2 (0.5) 14.4 (1.3) 47.4 (0.2) 40.6 (0.5) 39.5 (0.5) 46.9 (0.2) 47.9 (0.2) 48.3 (0.1) 45.9 (0.6) 40.8 (0.6) 30.7 (0.4) 10 (0.7) 44.8 (0.4) 33.6 (0.3) 48.1 (0.2) 46.1 (0.3) 44.2 (0.4) 46.5 (0.2) 45.6 (0.2) 47.2 (0.2) 29.9 (0.7) 9.4 (0.8) 40.8 (0.7) 24.4 (0.2) 47.9 (0.2) 46.2 (0.3) 44 (0.4) 44.2 (0.2) 44.4 (0.3) 48 (0.2) 31.8 (1.2) M3: n = 400, p = 1000, = 0.1I, I, AR(0.3), AR(0.5) ER(%) 8.0 (1.2) 2.1 (0.9) 37.8 (0.8) 46.1 (0.3) 46.8 (0.2) 14.4 (0.5) 47.1 (0.2) 45.1 (0.2) 35.7 (0.5) 43 (0.5) 11.7 (0.5) 10.9 (1.2) 3.3 (0.7) 47.9 (0.2) 45.8 (0.3) 39.4 (0.3) 47.6 (0.2) 46.9 (0.2) 46.8 (0.2) 44.4 (0.7) 41.8 (0.6) 29.7 (0.8) 3.9 (1) 45.6 (0.3) 46.9 (0.2) 48.1 (0.1) 47.7 (0.2) 44 (0.4) 48 (0.1) 39.3 (0.5) 46.7 (0.2) 20 (0.8) 4.8 (1) 42.4 (0.4) 46.6 (0.2) 48.2 (0.2) 48.1 (0.2) 44.4 (0.4) 48.2 (0.1) 38.2 (0.4) 46.6 (0.2) 14.7 (0.5)

Table A.5. Averages and standard errors of clustering error rates for scenario S3 with n = 400, p = 1000, K = 2.

Method mix PFC-ISO mix PFC RANSAC LSA LRSC SSC GPCA K-means X K-means X Y hclust X hclust X Y

M1: n = 400, p = 1000, = 0.1Ip, Ip, AR(0.3), AR(0.5) ER(%) 6.1 (0.8) 3.8 (0.1) 47.7 (0.2) 42.7 (0.2) 48.3 (0.1) 47.9 (0.2) 48 (0.2) 39.9 (0.2) 39.6 (0.2) 42.5 (0.3) 35.8 (0.2) 14.2 (0.5) 14.3 (0.8) 48 (0.2) 47.8 (0.2) 47.9 (0.1) 47.9 (0.2) 47.9 (0.2) 44 (0.2) 43.6 (0.3) 47.9 (0.2) 45.4 (0.3) 9.1 (0.5) 47.9 (0.1) 47 (0.2) 48.2 (0.1) 48.1 (0.2) 48.1 (0.2) 41 (0.2) 40.9 (0.2) 47.6 (0.2) 40.6 (0.3) 7.8 (0.6) 48 (0.1) 44.3 (0.3) 48.1 (0.1) 48 (0.1) 48.2 (0.1) 40.6 (0.1) 39.6 (0.2) 46.3 (0.2) 37.7 (0.2)

M2: n = 400, p = 1000, = 0.1Ip, Ip, AR(0.3), AR(0.5) ER(%) 6.9 (0.8) 11.1 (1.6) 40 (0.8) 18.7 (1.2) 48.1 (0.2) 16.9 (0.2) 48.2 (0.1) 46.7 (0.2) 40.4 (0.2) 45.3 (0.4) 38.1 (0.2) 21.7 (0.7) 19.9 (0.9) 48 (0.2) 47.9 (0.2) 48 (0.1) 47.9 (0.2) 47.8 (0.2) 47.7 (0.2) 45.2 (0.3) 48 (0.2) 47.2 (0.2) 17.6 (0.8) 47.5 (0.2) 42 (0.2) 47.9 (0.2) 47.6 (0.2) 48.1 (0.1) 46.3 (0.3) 40.8 (0.2) 47.5 (0.2) 47 (0.2) 16.7 (0.7) 46.3 (0.2) 38.4 (0.2) 47.8 (0.2) 46.8 (0.2) 47.9 (0.2) 41.4 (0.4) 39.6 (0.2) 44.3 (0.3) 44.9 (0.3)

M3: n = 400, p = 1000, = 0.1Ip, Ip, AR(0.3), AR(0.5) ER(%) 17.9 (1.8) 22.6 (2.2) 33.1 (1.1) 11.1 (0.2) 47.6 (0.2) 12.6 (0.2) 48.1 (0.2) 39.9 (0.1) 39.8 (0.2) 39.7 (0.3) 35.2 (0.2) 20 (1.2) 13.8 (1.1) 47.6 (0.2) 46.5 (0.2) 47.9 (0.2) 47.3 (0.2) 47.8 (0.2) 42.9 (0.2) 42.1 (0.3) 47.7 (0.2) 43.9 (0.3) 13.1 (1.2) 47.7 (0.2) 45.5 (0.2) 48.1 (0.1) 47.5 (0.2) 48.1 (0.2) 44 (0.2) 40.2 (0.3) 47.5 (0.2) 42.6 (0.3) 11.5 (1.1) 47.2 (0.3) 44.2 (0.2) 48 (0.2) 46.9 (0.2) 48.1 (0.2) 44.8 (0.2) 39 (0.2) 47.4 (0.2) 39.4 (0.3)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Table A.6. Simulations results for scenario S1 with n = 400, p = 1000, K = 2. The table reports averages and standard errors of TPR, FPR, and subspace distance for each cluster.

Method = I = AR(0.3)

TPRw(%) FPRw(%) Dw 100 TPRw(%) FPRw(%) Dw 100

M1 mix PFC 95, 96.7 (1.7, 1.4) 2, 1.6 (0.6, 0.5) 30, 29.4 (2.2, 2.1) 98.7, 98.3 (0.7, 1) 0.2, 0.4 (0.2, 0.3) 35, 35 (1.4, 1.4) mix PFC-ISO 100, 100 (0, 0) 0.4, 0.4 (0, 0) 6.3, 6.4 (0.2, 0.2) Lasso SIRt 100, 100 (0, 0) 1.5, 1.7 (0.1, 0.1) 30.6, 31.7 (1, 1) 99.5, 99.7 (0.3, 0.2) 1.3, 1.1 (0.1, 0.1) 41.2, 41.1 (1.2, 1.1) Lasso SIRm 94.7, 96 (1.8, 1.5) 2.3, 2 (0.4, 0.3) 38.2, 37 (2.1, 2.1) 98.5, 97.7 (0.8, 1.1) 1.2, 1.2 (0.2, 0.2) 41.2, 41 (1.5, 1.4) SEAS-SIRt 100, 100 (0, 0) 0.5, 0.6 (0.1, 0.1) 7.5, 7.5 (0.4, 0.3) 100, 100 (0, 0) 0.1, 0.1 (0, 0) 13.9, 14.5 (0.3, 0.4) SEAS-SIRm 99.1, 99.5 (0.5, 0.4) 0.7, 0.8 (0.1, 0.2) 10.7, 10.6 (1.3, 1.3) 99.8, 99.3 (0.2, 0.5) 0.2, 0.2 (0.1, 0.1) 15, 15.8 (0.6, 1) SEAS-PFCt 100, 100 (0, 0) 1.2, 1.4 (0.2, 0.2) 8, 8 (0.4, 0.4) 100, 100 (0, 0) 0.2, 0.2 (0, 0) 13.5, 14.2 (0.3, 0.4) SEAS-PFCm 96.7, 96.8 (1.5, 1.4) 1.2, 1.6 (0.4, 0.3) 14.9, 13.9 (2.4, 2) 99.8, 99.7 (0.2, 0.3) 0.3, 0.5 (0.1, 0.2) 15.1, 15.7 (1, 1.2)

M2 mix PFC 95.3, 94.5 (1.4, 1.5) 0.7, 0.8 (0.1, 0.1) 31.7, 31.8 (1.9, 1.9) 98.8, 100 (0.5, 0) 0, 1.3 (0, 0.1) 37.3, 24.1 (1.4, 0.9) mix PFC-ISO 98.7, 98.8 (0.8, 0.6) 0.5, 0.8 (0.1, 0.2) 9.2, 10 (1.3, 1.6) Lasso SIRt 100, 100 (0, 0) 1.5, 1.3 (0.1, 0.1) 30.6, 29.8 (1, 1) 99.5, 100 (0.3, 0) 1.3, 2.4 (0.1, 0.2) 41.2, 29.9 (1.2, 1) Lasso SIRm 92.5, 91.7 (1.7, 1.9) 1.7, 1.8 (0.2, 0.2) 40.9, 39.2 (2.1, 2.1) 97, 99 (1.1, 0.5) 1.1, 2.4 (0.1, 0.2) 43.5, 32.1 (1.5, 1.3) SEAS-SIRt 100, 100 (0, 0) 0.5, 0.6 (0.1, 0.1) 7.5, 7.8 (0.4, 0.3) 100, 100 (0, 0) 0.1, 1.8 (0, 0.1) 13.9, 30.9 (0.3, 0.5) SEAS-SIRm 93.8, 92.7 (1.6, 1.7) 0.4, 0.5 (0.1, 0.1) 19.4, 19.2 (2.4, 2.4) 98.2, 97.3 (0.8, 1.1) 0.1, 1.8 (0, 0.2) 18, 33.8 (1.3, 1.1) SEAS-PFCt 100, 100 (0, 0) 1.2, 1.3 (0.2, 0.2) 8, 8.1 (0.4, 0.3) 100, 100 (0, 0) 0.2, 0.9 (0, 0.1) 13.5, 28.4 (0.3, 0.5) SEAS-PFCm 92.8, 92.3 (1.7, 1.7) 1.1, 0.9 (0.2, 0.2) 19.6, 19.6 (2.4, 2.4) 97.5, 97.7 (1.1, 1) 0.2, 1 (0, 0.1) 17.7, 30.9 (1.4, 1.1)

M3 mix PFC 94.2, 94.3 (1.7, 1.6) 0.2, 0.1 (0.1, 0.1) 29.5, 29.6 (2.2, 2.1) 96.8, 98.5 (0.9, 0.6) 0.3, 0.4 (0.2, 0.1) 30, 35.7 (1.7, 1.4) mix PFC-ISO 98.8, 97.2 (1, 1.6) 0.7, 0.7 (0.1, 0.2) 21.4, 22.1 (1.7, 2) Lasso SIRt 100, 100 (0, 0) 6.6, 6.6 (0.2, 0.2) 51.1, 50.5 (0.8, 0.9) 99.8, 100 (0.2, 0) 9.8, 8.4 (0.2, 0.2) 69.3, 68.5 (0.5, 0.5) Lasso SIRm 94.8, 96.3 (1.5, 1.2) 7.6, 7.2 (0.3, 0.3) 59, 57.4 (1.4, 1.5) 98.3, 98.7 (0.6, 0.6) 9.3, 8.1 (0.2, 0.2) 71, 70.3 (0.7, 0.7) SEAS-SIRt 100, 100 (0, 0) 2.2, 2 (0.3, 0.3) 9.8, 9.6 (0.3, 0.3) 100, 100 (0, 0) 0.2, 0.3 (0, 0) 18, 31.2 (0.4, 0.3) SEAS-SIRm 94.1, 92.6 (1.5, 1.7) 1.4, 2.2 (0.2, 0.3) 21.8, 23.2 (2.3, 2.4) 95.3, 95 (1.6, 1.6) 0.1, 0.2 (0, 0) 26.7, 37.7 (2, 1.6) SEAS-PFCt 100, 100 (0, 0) 1.5, 1.9 (0.2, 0.3) 9.1, 9.7 (0.3, 0.3) 100, 100 (0, 0) 0.2, 0.3 (0, 0) 17.7, 30.8 (0.4, 0.3) SEAS-PFCm 91.5, 89.2 (1.9, 2.2) 1.2, 1.3 (0.2, 0.2) 22.4, 23.6 (2.5, 2.6) 94.3, 93 (1.7, 1.9) 0.2, 0.2 (0.1, 0) 26.7, 37.6 (2, 1.6)

Table A.7. Simulations results for scenario S4 with n = 400K, p = 1000, K = 3, 5. The table reports averages and standard errors of TPR, FPR, and subspace distance, calculated as the mean values across K clusters.

Method = I = AR(0.3)

TPR(%) FPR(%) ] TPR(%) ] FPR(%) D 100 TPR(%) FPR(%) ] TPR(%) ] FPR(%) D 100

M4 K = 3 mix PFC 99.6 (0.1) 0 (0) 100 (0) 0 (0) 27.2 (0.4) 86.7 (0.8) 0.3 (0.1) 100 (0) 1 (0.2) 45.5 (0.6) mix PFC-ISO 100 (0) 0 (0) 100 (0) 0.1 (0) 14.7 (0.2) Lasso SIRt 100 (0) 6.1 (0.1) 100 (0) 17.3 (0.3) 53.1 (0.4) 99.4 (0.1) 6 (0.1) 100 (0) 17 (0.2) 61.2 (0.3) Lasso SIRm 100 (0) 6.1 (0.1) 100 (0) 17.2 (0.3) 52.9 (0.4) 99.4 (0.2) 6.1 (0.1) 100 (0) 17.4 (0.3) 61.7 (0.3) SEAS-SIRt 100 (0) 2.5 (0.1) 100 (0) 7.5 (0.4) 11.3 (0.2) 92 (0.6) 0.3 (0.1) 100 (0) 0.9 (0.2) 43.8 (0.7) SEAS-SIRm 100 (0) 2.3 (0.1) 100 (0) 6.7 (0.4) 12.1 (0.2) 90.4 (0.6) 0.3 (0.1) 100 (0) 0.8 (0.2) 45.4 (0.7) SEAS-PFCt 100 (0) 1 (0.1) 100 (0) 3 (0.3) 9.7 (0.2) 90.3 (0.8) 0.1 (0) 100 (0) 0.3 (0) 44 (0.8) SEAS-PFCm 100 (0) 0.8 (0.1) 100 (0) 2.4 (0.3) 10.5 (0.2) 89.4 (0.8) 0.1 (0) 100 (0) 0.3 (0) 45.6 (0.8)

M4 K = 5 mix PFC 98 (0.4) 0.1 (0) 100 (0) 0.4 (0.1) 32.3 (1) 88 (0.7) 0.8 (0.1) 100 (0) 4 (0.4) 50 (1.2) mix PFC-ISO 97.2 (0.6) 0.2 (0) 100 (0) 0.9 (0.2) 25.6 (1.4) Lasso SIRt 100 (0) 6.1 (0.1) 100 (0) 27 (0.3) 53.6 (0.3) 99.8 (0.1) 6.2 (0.1) 100 (0) 27.4 (0.4) 62.5 (0.3) Lasso SIRm 98.3 (0.5) 6.3 (0.1) 100 (0) 27.6 (0.5) 57.5 (1.1) 93.7 (1) 6.8 (0.2) 99.8 (0.1) 29.6 (0.6) 69.9 (1.2) SEAS-SIRt 100 (0) 2.5 (0.1) 100 (0) 11.8 (0.6) 11.4 (0.2) 94.6 (0.4) 0.4 (0) 100 (0) 1.8 (0.2) 42.9 (0.5) SEAS-SIRm 99.3 (0.2) 2.6 (0.1) 100 (0) 12.2 (0.5) 17.3 (1.3) 88.6 (0.8) 0.8 (0.1) 100 (0) 3.8 (0.5) 50.3 (1.1) SEAS-PFCt 100 (0) 1.2 (0.1) 100 (0) 5.9 (0.4) 9.6 (0.2) 91.8 (0.5) 0.3 (0) 100 (0) 1.2 (0.1) 44.3 (0.5) SEAS-PFCm 98.7 (0.3) 1.1 (0.1) 100 (0) 5.4 (0.4) 15.5 (1.3) 86.3 (0.9) 0.5 (0.1) 100 (0) 2.5 (0.5) 50.9 (1)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Table A.8. Simulations results for scenario S1 with n = 400, p = 1000, K = 2. The table reports averages and standard errors of TPR, FPR, and subspace distance for each cluster.

Method = 0.1I = AR(0.5)

TPRw(%) FPRw(%) Dw 100 TPRw(%) FPRw(%) Dw 100

M1 mix PFC 97.5, 98 (0.9, 0.7) 0, 0 (0, 0) 39, 39.2 (1.2, 1.3) 96.3, 96.5 (1.1, 0.8) 0.9, 1.1 (0.1, 0.3) 50.3, 48.3 (1.3, 1.2) mix PFC-ISO 100, 100 (0, 0) 0.1, 0.1 (0, 0) 1.7, 1.6 (0.1, 0.1) Lasso SIRt 98.8, 99.5 (0.4, 0.3) 0.8, 0.9 (0.1, 0.1) 43.9, 44.8 (1.2, 1.1) 96.2, 95.3 (0.7, 0.8) 1.1, 1 (0.1, 0.1) 52.8, 51.7 (1.2, 1.2) Lasso SIRm 98, 98.5 (0.7, 0.7) 0.8, 0.9 (0.1, 0.1) 45.2, 45.5 (1.3, 1.3) 96, 94.5 (0.8, 1.2) 1, 1.2 (0.1, 0.2) 51.7, 52.8 (1.2, 1.4) SEAS-SIRt 100, 100 (0, 0) 0.1, 0.1 (0, 0) 2.7, 2.5 (0.2, 0.2) 100, 100 (0, 0) 1.1, 1 (0.1, 0.1) 19.7, 19.7 (0.4, 0.4) SEAS-SIRm 100, 100 (0, 0) 0.1, 0.2 (0, 0.1) 2.9, 3 (0.2, 0.2) 99.8, 99.8 (0.2, 0.2) 0.8, 1 (0.1, 0.2) 20.5, 20.6 (0.6, 0.8) SEAS-PFCt 100, 100 (0, 0) 0, 0 (0, 0) 2.8, 2.4 (0.2, 0.2) 100, 100 (0, 0) 0.1, 0.1 (0, 0) 19.8, 19.5 (0.5, 0.4) SEAS-PFCm 100, 100 (0, 0) 0.2, 0 (0.1, 0) 3, 3.1 (0.3, 0.3) 100, 99.2 (0, 0.6) 0.1, 0.3 (0, 0.2) 20.5, 20.6 (0.6, 1)

M2 mix PFC 87.8, 87.5 (1.8, 1.9) 0.2, 0.2 (0.1, 0.1) 50.6, 49.4 (1.5, 1.7) 95.5, 100 (1, 0) 0.8, 2.6 (0.1, 0.2) 54.2, 34.1 (1.6, 1) mix PFC-ISO 100, 99.8 (0, 0.2) 0.2, 0.2 (0, 0) 9.8, 10.3 (2, 2.1) Lasso SIRt 98.8, 99.3 (0.4, 0.3) 0.8, 1.1 (0.1, 0.1) 44.3, 45.1 (1.2, 1.2) 96.3, 97.7 (0.7, 0.6) 1.1, 3.8 (0.1, 0.2) 52.9, 40.4 (1.2, 1.1) Lasso SIRm 84.7, 84.7 (2.3, 2.3) 1.2, 1.1 (0.1, 0.1) 57.2, 55 (1.9, 1.7) 91.7, 96.3 (1.4, 0.9) 1, 4 (0.1, 0.2) 55.3, 43.7 (1.4, 1.4) SEAS-SIRt 100, 100 (0, 0) 0.1, 0.1 (0, 0) 2.9, 2.6 (0.2, 0.1) 100, 89.5 (0, 1.3) 1.1, 2.5 (0.1, 0.2) 19.7, 63.8 (0.4, 0.5) SEAS-SIRm 86.3, 86.2 (2.2, 2.2) 0.1, 0.1 (0, 0) 23.1, 23.1 (3.1, 3.1) 99, 89.8 (0.5, 1.5) 0.8, 2.4 (0.1, 0.2) 23.6, 65.1 (1.2, 0.6) SEAS-PFCt 100, 100 (0, 0) 0, 0 (0, 0) 2.8, 2.6 (0.2, 0.2) 100, 95.7 (0, 0.9) 0.1, 1.9 (0, 0.2) 19.8, 59.3 (0.5, 0.5) SEAS-PFCm 85.5, 85.3 (2.3, 2.3) 0, 0 (0, 0) 23.3, 23.1 (3.1, 3.1) 98.2, 92 (0.7, 1.4) 0.1, 1.7 (0, 0.2) 23.8, 60.6 (1.3, 0.6)

M3 mix PFC 92.3, 92.5 (2, 1.9) 0, 0.3 (0, 0.1) 39.2, 37.9 (1.8, 1.9) 96.7, 97.5 (1.1, 0.8) 0.5, 0.9 (0.1, 0.3) 40.5, 50.1 (1.2, 1.2) mix PFC-ISO 100, 100 (0, 0) 0.4, 0.4 (0, 0) 10.8, 10.7 (1.5, 1.5) Lasso SIRt 100, 100 (0, 0) 4.5, 4.8 (0.2, 0.2) 57.3, 58.5 (0.8, 0.8) 98.7, 98.5 (0.5, 0.5) 8.8, 6.4 (0.2, 0.3) 77.9, 75.3 (0.4, 0.4) Lasso SIRm 95.5, 95.3 (1.7, 1.7) 4.6, 4.7 (0.2, 0.2) 59.8, 60.1 (1.3, 1.2) 96.5, 98.2 (0.9, 0.7) 9.1, 6.8 (0.2, 0.3) 78.5, 76 (0.5, 0.4) SEAS-SIRt 100, 100 (0, 0) 0, 0 (0, 0) 3.8, 3.6 (0.2, 0.2) 100, 87.5 (0, 1.3) 0.8, 0.6 (0.1, 0.2) 34.3, 53.4 (0.6, 0.9) SEAS-SIRm 95.2, 96 (1.7, 1.5) 0, 0 (0, 0) 9.9, 9.7 (2.1, 2) 99.5, 89 (0.3, 1.3) 0.7, 0.8 (0.1, 0.2) 36.6, 53.3 (1, 0.9) SEAS-PFCt 100, 100 (0, 0) 0, 0 (0, 0) 3, 2.9 (0.1, 0.1) 100, 83 (0, 1.5) 0.1, 0.1 (0, 0) 33.1, 54.1 (0.5, 0.9) SEAS-PFCm 94.8, 94.7 (1.8, 1.8) 0, 0 (0, 0) 9.2, 9.1 (2.1, 2.2) 98.3, 86.3 (1, 1.6) 0.1, 0.2 (0, 0.1) 35.7, 53.2 (1.1, 1)

Table A.9. Simulations results for scenario S4 with n = 400K , p = 1000, K = 3, 5. The table reports averages and standard errors of TPR, FPR, and subspace distance, calculated as the mean values across K clusters.

Method = 0.1I = AR(0.5)

TPR(%) FPR(%) ] TPR(%) ] FPR(%) D 100 TPR(%) FPR(%) ] TPR(%) ] FPR(%) D 100

M4 K = 3 mix PFC 65.9 (0.9) 0 (0) 93.5 (0.8) 0 (0) 62.9 (0.7) 77.9 (0.7) 0.7 (0.1) 99.7 (0.2) 2 (0.2) 57 (0.5) mix PFC-ISO 100 (0) 0.1 (0) 100 (0) 0.3 (0) 3.3 (0.1) Lasso SIRt 98.7 (0.2) 4.4 (0.1) 100 (0) 12.7 (0.3) 68 (0.3) 96.5 (0.3) 6.1 (0.1) 100 (0) 17.3 (0.3) 70.2 (0.3) Lasso SIRm 98.8 (0.2) 4.5 (0.1) 100 (0) 12.9 (0.3) 68.2 (0.4) 96.4 (0.3) 6.1 (0.1) 100 (0) 17.2 (0.3) 70.2 (0.3) SEAS-SIRt 99.9 (0) 0 (0) 100 (0) 0 (0) 12.5 (0.5) 76.4 (0.5) 0.6 (0) 100 (0) 1.7 (0.1) 61.8 (0.2) SEAS-SIRm 100 (0) 0 (0) 100 (0) 0 (0) 12 (0.4) 75.3 (0.5) 0.5 (0) 100 (0) 1.5 (0.1) 62.2 (0.2) SEAS-PFCt 100 (0) 0 (0) 100 (0) 0 (0) 9.1 (0.3) 72 (0.4) 0 (0) 100 (0) 0.1 (0) 61.9 (0.1) SEAS-PFCm 100 (0) 0 (0) 100 (0) 0 (0) 8.9 (0.3) 72.4 (0.5) 0 (0) 100 (0) 0.1 (0) 61.9 (0.1)

M4 K = 5 mix PFC 65.7 (0.8) 0 (0) 97.8 (0.4) 0 (0) 64.6 (0.7) 76.8 (0.7) 1 (0.1) 100 (0) 5.1 (0.4) 65.6 (0.9) mix PFC-ISO 100 (0) 0.1 (0) 100 (0) 0.6 (0) 3.8 (0.4) Lasso SIRt 98.9 (0.1) 4.4 (0.1) 100 (0) 19.9 (0.3) 67.7 (0.3) 97.5 (0.2) 6.1 (0.1) 100 (0) 27.1 (0.4) 70.7 (0.2) Lasso SIRm 98.8 (0.2) 4.4 (0.1) 100 (0) 20.3 (0.3) 68 (0.3) 84.8 (1.4) 6.5 (0.1) 99.9 (0.1) 28.5 (0.5) 80.1 (0.9) SEAS-SIRt 99.9 (0) 0 (0) 100 (0) 0 (0) 10 (0.3) 80.3 (0.4) 0.4 (0) 100 (0) 1.7 (0.1) 62.8 (0.1) SEAS-SIRm 99.8 (0.1) 0 (0) 100 (0) 0 (0) 10.4 (0.4) 74.6 (0.8) 0.8 (0.1) 100 (0) 4 (0.5) 68.1 (0.7) SEAS-PFCt 100 (0) 0 (0) 100 (0) 0 (0) 7.4 (0.3) 77.3 (0.3) 0 (0) 100 (0) 0.2 (0) 61.5 (0.1) SEAS-PFCm 99.9 (0.1) 0 (0) 100 (0) 0 (0) 7.7 (0.3) 74.5 (0.7) 0.8 (0.1) 100 (0) 3.7 (0.7) 67 (0.6)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Table A.10. Simulations results for scenario S2 with n = 400, p = 1000, K = 2. The table reports averages and standard errors of TPR, FPR, and subspace distance for each cluster.

Method = 0.1I = I

TPR(%) FPR(%) D 100 TPR(%) FPR(%) D 100

M1 mix PFC 97.8, 97.3 (0.7, 0.9) 0, 0 (0, 0) 35.6, 35.9 (1.3, 1.4) 98.2, 98.8 (1, 0.6) 0.5, 1 (0.1, 0.3) 23.7, 23.9 (1.5, 1.6) mix PFC-ISO 100, 100 (0, 0) 0.2, 0.3 (0, 0) 1.9, 1.9 (0.1, 0.1) 100, 99.8 (0, 0.2) 0.8, 0.8 (0.2, 0.2) 8, 8.5 (0.8, 0.9) Lasso SIRt 100, 96.3 (0, 0.7) 0.8, 0.8 (0.1, 0.1) 37.3, 52 (1.2, 1.3) 100, 100 (0, 0) 1.5, 1.4 (0.1, 0.1) 24.9, 41.1 (0.9, 1.2) Lasso SIRm 97, 98.2 (0.9, 0.7) 0.9, 0.8 (0.1, 0.1) 45.1, 47.4 (1.7, 1.7) 98.8, 98.8 (0.6, 0.7) 1.7, 1.6 (0.2, 0.2) 36.1, 37 (1.7, 1.9) SEAS-SIRt 100, 100 (0, 0) 0, 0.1 (0, 0) 2.1, 3.3 (0.1, 0.2) 100, 100 (0, 0) 1.5, 0.4 (0.2, 0.1) 6.1, 9.8 (0.2, 0.4) SEAS-SIRm 100, 100 (0, 0) 0.1, 0.1 (0.1, 0) 2.8, 3.1 (0.2, 0.2) 99.8, 99.3 (0.2, 0.7) 1, 1.2 (0.2, 0.2) 9.4, 10.2 (0.7, 1.1) SEAS-PFCt 100, 100 (0, 0) 0, 0 (0, 0) 2, 3 (0.1, 0.2) 100, 100 (0, 0) 1.4, 0.8 (0.3, 0.1) 6.4, 9.1 (0.3, 0.3) SEAS-PFCm 100, 100 (0, 0) 0.1, 0.1 (0.1, 0) 2.7, 3.4 (0.2, 0.5) 98.7, 98.2 (1.1, 1.3) 0.9, 1.1 (0.2, 0.2) 9.9, 10.8 (1.3, 1.5)

M2 mix PFC 89.8, 89 (1.7, 1.6) 0.2, 0.3 (0.1, 0.1) 43.4, 50.8 (1.9, 1.6) 97.2, 96 (0.9, 1.3) 0.9, 0.9 (0.1, 0.1) 26.6, 37 (1.9, 1.9) mix PFC-ISO 100, 97.5 (0, 1.1) 0.4, 0.3 (0.1, 0) 9.6, 17.1 (1.7, 2.9) 100, 98.7 (0, 0.8) 0.2, 1.8 (0, 0.3) 6.4, 13.3 (0.7, 1.5) Lasso SIRt 100, 96.5 (0, 0.7) 0.8, 0.8 (0.1, 0.1) 37.2, 53.4 (1.2, 1.3) 100, 100 (0, 0) 1.5, 1.4 (0.1, 0.1) 24.9, 41.3 (0.9, 1.3) Lasso SIRm 89.2, 83.7 (1.9, 2.2) 0.9, 0.9 (0.1, 0.1) 47.5, 61 (2, 1.6) 95.7, 92.2 (1.3, 1.7) 1.7, 1.7 (0.2, 0.1) 34, 53.5 (2, 2) SEAS-SIRt 100, 100 (0, 0) 0, 0.1 (0, 0) 2.1, 3.5 (0.1, 0.2) 100, 100 (0, 0) 1.5, 0.6 (0.2, 0.1) 6.1, 10.4 (0.2, 0.4) SEAS-SIRm 90.3, 87.7 (1.9, 2.2) 0, 0 (0, 0) 20.2, 21.5 (3, 3) 97.3, 93.9 (0.7, 1.5) 1.4, 0.4 (0.3, 0.1) 16.7, 22.2 (2.2, 2.3) SEAS-PFCt 100, 100 (0, 0) 0, 0 (0, 0) 2, 3.2 (0.1, 0.2) 100, 100 (0, 0) 1.4, 0.8 (0.3, 0.1) 6.4, 9.8 (0.3, 0.4) SEAS-PFCm 89.8, 87 (1.9, 2.2) 0, 0 (0, 0) 19.9, 21.5 (2.9, 3) 95.7, 92 (1.2, 1.9) 1, 0.7 (0.2, 0.1) 17.8, 23.5 (2.3, 2.5)

M3 mix PFC 96.3, 93.2 (1.2, 1.2) 0.1, 0.1 (0.1, 0) 35.9, 40.1 (1.4, 1.8) 98.8, 96.8 (0.4, 1.2) 0.1, 0.2 (0.1, 0.1) 19.4, 27 (1.4, 1.4) mix PFC-ISO 100, 100 (0, 0) 0.5, 0.6 (0.1, 0) 10.9, 14.3 (1.3, 1.6) 100, 92.7 (0, 2.5) 0.4, 1.7 (0.1, 0.1) 17.1, 31.7 (1.3, 2.3) Lasso SIRt 100, 99.3 (0, 0.3) 4.2, 4 (0.2, 0.1) 48.8, 69.1 (0.8, 0.7) 100, 100 (0, 0) 7.4, 5.2 (0.2, 0.1) 42.5, 63.1 (0.8, 0.8) Lasso SIRm 98.5, 98.3 (0.8, 0.6) 4.6, 4.2 (0.2, 0.2) 51.8, 70.5 (1.3, 1) 98.8, 99.5 (0.6, 0.3) 7.5, 5.6 (0.2, 0.2) 45.9, 65 (1.2, 0.9) SEAS-SIRt 100, 100 (0, 0) 0, 0.4 (0, 0.1) 3, 4.7 (0.2, 0.3) 100, 100 (0, 0) 2, 1.1 (0.2, 0.1) 7.8, 12.9 (0.2, 0.3) SEAS-SIRm 99.3, 97.3 (0.5, 1.2) 0, 0.3 (0, 0.1) 6.7, 8.8 (1.3, 1.6) 99.2, 98 (0.5, 0.8) 1.6, 1.1 (0.2, 0.1) 12.3, 18.3 (1.5, 1.6) SEAS-PFCt 100, 100 (0, 0) 0, 0 (0, 0) 2.3, 3.7 (0.1, 0.2) 100, 100 (0, 0) 1.5, 0.8 (0.3, 0.1) 7.5, 11.3 (0.3, 0.3) SEAS-PFCm 98.2, 97.2 (1.1, 1.2) 0.1, 0 (0.1, 0) 7, 8.3 (1.7, 1.7) 97.5, 97 (1.2, 1.1) 1.1, 0.7 (0.2, 0.1) 12.7, 17.1 (1.7, 1.6)

Table A.11. Simulations results for scenario S2 with n = 400, p = 1000, K = 2. The table reports averages and standard errors of TPR, FPR, and subspace distance for each cluster.

Method = AR(0.3) = AR(0.5)

TPR(%) FPR(%) D 100 TPR(%) FPR(%) D 100

M1 mix PFC 96, 98 (1.4, 0.9) 0.1, 0 (0, 0) 35.7, 34 (1.6, 1.3) 96.2, 95 (0.8, 1.1) 0.8, 0.7 (0.1, 0.1) 46.2, 44.4 (1.4, 1.4) Lasso SIRt 100, 97 (0, 0.6) 1.2, 1.2 (0.1, 0.1) 33.2, 50.3 (1.1, 1.4) 99.7, 91.3 (0.2, 1) 0.9, 0.9 (0.1, 0.1) 40.7, 60.9 (1.2, 1.1) Lasso SIRm 97.3, 98.8 (0.9, 0.5) 1.2, 1.5 (0.1, 0.2) 48.3, 44.9 (2, 1.8) 95.5, 94.8 (0.8, 0.8) 0.9, 0.7 (0.1, 0.1) 52.8, 50.2 (1.7, 1.7) SEAS-SIRt 100, 100 (0, 0) 1.2, 0.5 (0.2, 0.1) 13.5, 15.5 (0.3, 0.4) 100, 100 (0, 0) 0.3, 0.1 (0, 0) 18.8, 19.8 (0.2, 0.4) SEAS-SIRm 100, 100 (0, 0) 0.9, 1.2 (0.2, 0.2) 15.9, 15.1 (0.6, 0.4) 100, 100 (0, 0) 0.2, 0.2 (0, 0) 19.5, 20.1 (0.3, 0.5) SEAS-PFCt 100, 100 (0, 0) 0.1, 0.8 (0, 0.1) 13.4, 14.9 (0.3, 0.4) 100, 100 (0, 0) 0.1, 0.1 (0, 0) 18.7, 19.7 (0.2, 0.4) SEAS-PFCm 99.3, 100 (0.7, 0) 0.4, 0.3 (0.1, 0.1) 16.3, 15.8 (1, 1.1) 100, 100 (0, 0) 0.1, 0.1 (0, 0) 19.4, 20.1 (0.3, 0.4)

M2 mix PFC 98.5, 99.8 (0.8, 0.2) 0, 1.4 (0, 0.2) 32.4, 32 (1.3, 1) 96.3, 97.3 (1.1, 0.9) 0.5, 3 (0, 0.2) 41, 47.5 (1.3, 1.2) Lasso SIRt 100, 100 (0, 0) 1.2, 2.3 (0.1, 0.1) 33.2, 42.2 (1.1, 1.1) 99.7, 88.8 (0.2, 1.1) 0.9, 3.4 (0.1, 0.1) 40.8, 61.4 (1.2, 1.2) Lasso SIRm 97.7, 97.8 (1, 1) 1, 2 (0.1, 0.1) 35.2, 48.7 (1.3, 1.4) 95.8, 85.7 (1.3, 1.5) 1.1, 3 (0.1, 0.1) 45.9, 64.5 (1.6, 1.2) SEAS-SIRt 100, 100 (0, 0) 1.2, 1.4 (0.2, 0.1) 13.5, 38.1 (0.3, 0.5) 100, 74.7 (0, 1.4) 0.3, 1.2 (0, 0.1) 19, 69.9 (0.2, 0.5) SEAS-SIRm 99.3, 97.8 (0.3, 0.8) 1.7, 1.1 (0.3, 0.1) 16.2, 41.3 (0.9, 1) 99.7, 74.2 (0.2, 1.5) 0.2, 1.2 (0, 0.1) 20.6, 71.3 (0.6, 0.6) SEAS-PFCt 100, 100 (0, 0) 0.1, 0.8 (0, 0.1) 13.4, 33.4 (0.3, 0.5) 100, 85.2 (0, 1.4) 0.1, 1.2 (0, 0.1) 18.9, 64.9 (0.2, 0.7) SEAS-PFCm 99.3, 97.8 (0.3, 1) 0.1, 0.8 (0, 0.1) 16, 36 (1, 1) 99.3, 83.3 (0.3, 1.7) 0.1, 1.2 (0, 0.1) 20.8, 66 (0.6, 0.7)

M3 mix PFC 98.8, 98.5 (0.5, 0.8) 0.3, 0.1 (0.2, 0.1) 25.1, 36.6 (1.5, 1.1) 96.2, 95.7 (1.4, 1.1) 0.1, 0.8 (0, 0.2) 37, 54.3 (1.5, 1) Lasso SIRt 100, 99.3 (0, 0.3) 12.1, 5.6 (0.2, 0.1) 62.8, 75.9 (0.5, 0.4) 99.8, 95.2 (0.2, 0.8) 10.9, 4.9 (0.3, 0.1) 74.1, 80.2 (0.3, 0.4) Lasso SIRm 98.2, 98.3 (0.9, 0.6) 12, 5.8 (0.2, 0.2) 64.8, 76.5 (0.8, 0.5) 97.2, 93.8 (1.1, 1) 11, 5.1 (0.3, 0.2) 75.3, 80.6 (0.6, 0.5) SEAS-SIRt 100, 99.8 (0, 0.2) 0.6, 0.8 (0.2, 0.1) 18.3, 33.1 (0.4, 0.4) 100, 86 (0, 1.3) 0.4, 0.1 (0, 0) 32.7, 54.5 (0.5, 0.9) SEAS-SIRm 97.3, 96.7 (1.2, 1.3) 1.1, 1 (0.3, 0.1) 22.6, 36.1 (1.4, 1.2) 97.5, 84.3 (1, 1.5) 0.3, 0.1 (0, 0) 36.5, 56.1 (1.2, 1) SEAS-PFCt 100, 99.2 (0, 0.5) 0.1, 0.7 (0, 0.1) 17.6, 32.5 (0.3, 0.6) 100, 83.8 (0, 1.4) 0.1, 0.1 (0, 0) 32, 54.6 (0.5, 0.9) SEAS-PFCm 96.8, 95.3 (1.4, 1.6) 0.1, 0.8 (0, 0.1) 21.7, 36.3 (1.5, 1.3) 96.5, 81.8 (1.4, 1.6) 0.1, 0.1 (0, 0) 35.8, 56.3 (1.3, 1)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Table A.12. Simulations results for scenario S3 with n = 400, p = 1000, K = 2. The table reports averages and standard errors of TPR, FPR, and subspace distance for each cluster.

Method = 0.1I = I

TPR(%) FPR(%) D 100 TPR(%) FPR(%) D 100

M1 mix PFC 96.8, 95.8 (1.2, 1.4) 0, 0 (0, 0) 38, 37.7 (1.3, 1.4) 95.7, 95.7 (1.6, 1.7) 1.4, 1.3 (0.3, 0.3) 37.8, 36 (1.9, 2) mix PFC-ISO 100, 100 (0, 0) 0.6, 0.6 (0, 0) 2.6, 2.7 (0.1, 0.1) 99.7, 99.7 (0.3, 0.3) 1.4, 1.5 (0.1, 0.2) 13.2, 13.5 (1, 1.1) Lasso SIRt 100, 99.5 (0, 0.3) 1, 1.2 (0.1, 0.1) 38.1, 43.8 (1.1, 1.3) 100, 100 (0, 0) 2.5, 2.6 (0.2, 0.2) 36.5, 40.8 (1.2, 1.2) Lasso SIRm 99.7, 99.7 (0.2, 0.2) 1.2, 0.9 (0.1, 0.1) 43.1, 39.5 (1.2, 1.2) 96.7, 96.2 (1.6, 1.7) 3.1, 3.3 (0.3, 0.3) 45.8, 45.3 (1.8, 1.9) SEAS-SIRt 100, 100 (0, 0) 0, 0 (0, 0) 3.6, 3.7 (0.2, 0.2) 100, 100 (0, 0) 0.9, 1.1 (0.1, 0.2) 11.1, 13.2 (0.4, 0.4) SEAS-SIRm 100, 100 (0, 0) 0, 0 (0, 0) 3.8, 3.9 (0.2, 0.2) 98.1, 98.3 (1.3, 1.2) 1.6, 1.6 (0.2, 0.2) 17.1, 16 (1.3, 1.4) SEAS-PFCt 100, 100 (0, 0) 0.2, 0.3 (0, 0.1) 3.4, 3.5 (0.2, 0.2) 100, 100 (0, 0) 0.9, 1.1 (0.2, 0.2) 10.8, 12.9 (0.4, 0.4) SEAS-PFCm 100, 100 (0, 0) 0.3, 0.3 (0.1, 0) 3.7, 3.5 (0.2, 0.2) 97, 96.8 (1.7, 1.6) 1.6, 1.7 (0.3, 0.3) 19.5, 18.1 (1.8, 1.9)

M2 mix PFC 91.7, 90.5 (1.6, 1.7) 0.8, 0.6 (0.3, 0.4) 42.7, 45.9 (1.8, 1.9) 95.3, 95 (1.6, 1.4) 1.1, 1.7 (0.2, 0.2) 36, 39.8 (1.7, 1.7) mix PFC-ISO 100, 99.5 (0, 0.5) 4.4, 0.8 (1.4, 0) 8.5, 7 (2.1, 1.5) 98.3, 97.7 (0.8, 0.9) 1.7, 2.5 (0.2, 0.4) 16.9, 19.3 (1.5, 2) Lasso SIRt 100, 99.7 (0, 0.2) 1, 1.2 (0.1, 0.1) 37.2, 43 (1.1, 1.3) 100, 100 (0, 0) 2.5, 2.6 (0.2, 0.2) 36.5, 40.1 (1.2, 1.4) Lasso SIRm 92.8, 92.3 (1.7, 1.6) 1.4, 1.7 (0.2, 0.2) 46.1, 49.8 (2, 1.9) 95.8, 95.7 (1.3, 1.4) 2.8, 2.8 (0.2, 0.2) 47.1, 48.9 (1.5, 1.5) SEAS-SIRt 100, 100 (0, 0) 0, 0 (0, 0) 3.6, 3.8 (0.2, 0.2) 100, 100 (0, 0) 0.9, 1.1 (0.1, 0.2) 11.1, 12.9 (0.4, 0.4) SEAS-SIRm 94.3, 95.8 (1.5, 1.3) 0.1, 0 (0, 0) 14.5, 14 (2.5, 2.4) 97, 96.8 (1.2, 1.2) 1, 1.1 (0.1, 0.2) 20.7, 22.2 (1.7, 1.6) SEAS-PFCt 100, 100 (0, 0) 0.2, 0.3 (0, 0.1) 3.3, 3.6 (0.2, 0.2) 100, 100 (0, 0) 0.9, 0.9 (0.2, 0.2) 10.8, 12.3 (0.4, 0.5) SEAS-PFCm 92.3, 94.7 (1.9, 1.5) 0.2, 0.3 (0.1, 0.1) 14.8, 14 (2.6, 2.4) 96.3, 95.7 (1.3, 1.5) 0.9, 1.2 (0.2, 0.2) 21.1, 22.7 (1.7, 1.8)

M3 mix PFC 40.7, 42.5 (1.4, 1.1) 2.6, 4.5 (0.7, 0.9) 80.3, 81 (0.8, 0.9) 49.7, 49.2 (1.2, 1.1) 2, 2.2 (0.2, 0.2) 73.7, 74 (0.4, 0.5) mix PFC-ISO 82.8, 84.3 (2.1, 2) 1.1, 0.9 (0.3, 0.1) 48.8, 48.7 (2.4, 2.2) 54, 53.6 (1.4, 1.5) 4.6, 4.1 (0.2, 0.1) 73.2, 73 (0.6, 0.6) Lasso SIRt 98.7, 99.8 (0.6, 0.2) 8.3, 8.3 (0.2, 0.3) 67.6, 67.3 (0.9, 1) 69, 67 (1.5, 1.5) 10.1, 10.5 (0.2, 0.1) 73.5, 73.9 (0.3, 0.3) Lasso SIRm 69.5, 70 (3.2, 3.3) 9.9, 9.3 (0.3, 0.2) 81.6, 80.1 (1.2, 1.3) 59, 60.8 (1.3, 1.8) 10.5, 10.9 (0.2, 0.2) 76.3, 76.4 (0.7, 0.7) SEAS-SIRt 94.7, 95.8 (1.4, 1.2) 0, 0 (0, 0) 42.9, 40.2 (1.5, 1.5) 56.2, 56.2 (1.2, 1.1) 1.7, 1.8 (0.3, 0.3) 70.7, 70.6 (0.1, 0.1) SEAS-SIRm 68.3, 67.7 (2.9, 3) 0.5, 0.3 (0.2, 0.1) 65.4, 65.2 (1.7, 1.7) 53.1, 53.3 (1.2, 1.2) 2.5, 1.5 (0.3, 0.2) 71.9, 71.6 (0.5, 0.4) SEAS-PFCt 50, 50 (0, 0) 0, 0 (0, 0) 70.7, 70.7 (0, 0) 51, 52 (0.5, 0.5) 1.4, 2.5 (0.2, 0.3) 70.8, 70.8 (0, 0) SEAS-PFCm 43.7, 44.7 (0.9, 0.9) 0.2, 0.5 (0.1, 0.2) 75.3, 75.7 (0.6, 0.7) 48.7, 49.2 (0.8, 0.9) 1.8, 1.7 (0.3, 0.3) 72.3, 72.4 (0.5, 0.5)

Table A.13. Simulations results for scenario S3 with n = 400, p = 1000, K = 2. The table reports averages and standard errors of TPR, FPR, and subspace distance for each cluster.

Method = AR(0.3) = AR(0.5)

TPR(%) FPR(%) D 100 TPR(%) FPR(%) D 100

M1 mix PFC 96.7, 96.5 (1.1, 1.1) 0.5, 0.6 (0.2, 0.2) 35.1, 36.9 (1.6, 1.7) 92.2, 91.3 (1.5, 1.9) 0.4, 0.3 (0.3, 0.2) 49.4, 49.1 (1.5, 1.5) Lasso SIRt 99.7, 99.7 (0.2, 0.2) 1.7, 2 (0.1, 0.2) 41.6, 45.5 (1.2, 1.4) 99, 97.3 (0.4, 0.7) 1.2, 1.4 (0.1, 0.1) 47.1, 53.7 (1.3, 1.2) Lasso SIRm 99, 98.5 (0.6, 0.7) 1.9, 1.9 (0.2, 0.2) 43.9, 45.5 (1.5, 1.5) 94.8, 94.7 (1.2, 1.4) 1.4, 1.4 (0.2, 0.2) 53.3, 52.3 (1.4, 1.4) SEAS-SIRt 100, 100 (0, 0) 1, 1 (0.2, 0.2) 15, 15 (0.4, 0.4) 100, 100 (0, 0) 0.2, 0.3 (0, 0) 20, 20.9 (0.4, 0.6) SEAS-SIRm 99.8, 100 (0.2, 0) 1.2, 1.2 (0.2, 0.2) 16.3, 15.7 (0.6, 0.5) 99, 99.2 (0.6, 0.8) 0.3, 0.4 (0.1, 0.1) 22.3, 23 (1.1, 1.1) SEAS-PFCt 100, 100 (0, 0) 1.5, 1.9 (0.2, 0.3) 14.7, 15.3 (0.4, 0.4) 100, 99.8 (0, 0.2) 0.2, 0.2 (0, 0) 20, 20.9 (0.4, 0.6) SEAS-PFCm 100, 100 (0, 0) 1.9, 2 (0.3, 0.3) 16.2, 16.8 (0.7, 0.8) 98.7, 98.7 (0.7, 1) 0.4, 0.5 (0.2, 0.2) 22.7, 23.7 (1.2, 1.3)

M2 mix PFC 94.2, 99 (1.5, 0.5) 0.1, 2.3 (0.1, 0.3) 42.3, 36.6 (1.5, 1.3) 72.7, 94 (1.6, 1.5) 0.1, 4.5 (0.1, 0.6) 64.2, 52 (1.1, 1.5) Lasso SIRt 99.7, 100 (0.2, 0) 1.7, 4.9 (0.1, 0.2) 41.6, 47.9 (1.2, 1.3) 99.2, 86 (0.4, 1.2) 1.3, 6.6 (0.1, 0.2) 47.5, 70.6 (1.3, 1.2) Lasso SIRm 94.8, 98 (1.2, 0.7) 1.6, 4.9 (0.2, 0.2) 51.3, 51.9 (1.4, 1.2) 88.5, 85.7 (1.2, 1.4) 1.2, 6.7 (0.1, 0.2) 60.4, 71.5 (1.3, 1.3) SEAS-SIRt 100, 99.5 (0, 0.3) 1, 2.3 (0.2, 0.2) 15, 41.1 (0.4, 0.8) 100, 66.8 (0, 1.5) 0.2, 1.4 (0, 0.2) 20, 73.4 (0.4, 0.5) SEAS-SIRm 98.7, 96.7 (0.7, 1.1) 1.5, 2.3 (0.2, 0.2) 22.5, 46.1 (1.4, 1) 97.3, 71.3 (0.8, 1.5) 0.1, 2 (0, 0.2) 31.1, 72.3 (1.4, 0.5) SEAS-PFCt 100, 99.7 (0, 0.3) 1.5, 0.9 (0.2, 0.1) 14.7, 36 (0.4, 0.7) 100, 73.8 (0, 1.7) 0.2, 1 (0, 0.1) 20, 69.7 (0.5, 0.5) SEAS-PFCm 98.8, 97 (0.6, 1.1) 0.6, 1.4 (0.2, 0.2) 22.1, 42.1 (1.4, 1.1) 96.7, 71.8 (1.1, 1.5) 0.1, 1.2 (0, 0.2) 31.3, 70.6 (1.5, 0.5)

M3 mix PFC 50.5, 50.7 (1.2, 1.1) 0.5, 1.3 (0.1, 0.2) 72.6, 74 (0.4, 0.6) 53.5, 54.3 (1.3, 1.3) 0.3, 1 (0.2, 0.2) 72.3, 73.6 (0.5, 0.5) Lasso SIRt 78.2, 73.5 (1.6, 1.5) 9.6, 9.5 (0.2, 0.2) 74.2, 73.9 (0.3, 0.3) 78.5, 77 (1.4, 1.4) 8.7, 9 (0.2, 0.2) 75.4, 74.1 (0.3, 0.2) Lasso SIRm 71.8, 70 (1.9, 1.7) 10, 10.2 (0.2, 0.2) 76.5, 76.7 (0.7, 0.7) 75.8, 75 (1.9, 1.4) 9.2, 9.4 (0.2, 0.2) 77.7, 76.2 (0.7, 0.6) SEAS-SIRt 91.5, 80.2 (1.3, 1.2) 0.6, 1.3 (0.1, 0.2) 59.5, 65.7 (1.2, 0.6) 93, 84.2 (1, 0.8) 0.2, 0.5 (0.1, 0.1) 62.9, 65.7 (1.1, 0.7) SEAS-SIRm 82.6, 74.3 (1.9, 1.6) 0.5, 1 (0.1, 0.2) 67.9, 69.2 (0.9, 0.6) 87.4, 81.3 (1.6, 1.4) 0.5, 0.4 (0.1, 0.1) 69.5, 68.4 (0.8, 0.7) SEAS-PFCt 89.7, 76.3 (1.5, 1.2) 0.4, 0.7 (0.1, 0.2) 53.9, 61.7 (1.4, 0.8) 93.3, 83.2 (1, 0.6) 0.1, 0.1 (0, 0) 55.1, 63.2 (1.2, 0.7) SEAS-PFCm 80.3, 70.5 (2.1, 1.8) 0.3, 0.8 (0.1, 0.2) 64.4, 68.1 (1.2, 0.9) 85.7, 80 (2, 1.3) 0.1, 0.4 (0, 0.1) 64.9, 67.5 (1.2, 0.9)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

A.2. Real Data Analysis

For the CCLE data, we evaluate the performance of mix PFC across K = 2, 3, . . . , 10. Table A.14 summarizes PMSE and the number of selected variables bs (total number of unique variables selected across all clusters) for different K under mix PFC.

For mix PFC, the prediction error decreases as the number of clusters K increases. However, the reduction becomes less pronounced beyond K = 3 for Nutlin-3 and K = 5 and AZD6244. For both responses, mix PFC achieves a significant reduction in prediction error compared to homogeneous methods (Table 2), suggesting heterogeneity in the data. The number of selected variables bs initially increases and then stabilizes as the number of clusters K grows, indicating that different clusters select different variables. Notably, mix PFC does not select more variables than Lasso and the three homogeneous SDR methods, even for large K. These findings suggest that over-specification of K does not significantly impact the prediction performance of mix PFC. However, over-specification of K increases computational cost and introduces variability in parameter estimation. Using the tuning method described in Section B of the appendix, the average selected K is 3 for both Nutlin-3 and AZD6244.

Summary plots of the response versus the reduced predictors projected onto each subspace are shown in Figure A.5 for AZD6244. These plots reveal approximately linear relationships between the response and projected predictors within each cluster for both drugs. Notably, the points in cluster 1 form a vertical band when projected onto subspaces b Sw, w = 2, 3, 4, 5. This pattern coincides with the example in Figure 2 (b) and suggests the first subspace is orthogonal to the remaining subspaces.

Table A.14. The averages of the prediction errors, the sparsity level bs, and the corresponding standard errors based on 100 replicates.

K = 2 K = 3 K = 4 K = 5 K = 6 K = 7 K = 8 K = 9 K = 10 Nutlin-3 PMSE 100 11.9 (0.4) 8.9 (0.4) 8.3 (0.3) 8.4 (0.3) 7.7 (0.3) 7.84 (0.4) 7.36 (0.3) 6.9 (0.3) 6.9 (0.3) bs 16.5 (1.2) 31.5 (2.5) 45.2 (2.7) 52.3 (2.7) 53.9 (2.7) 53.7 (2.6) 52.6 (2.2) 58.2 (2.4) 59.5 (2.1) AZD6244 PMSE 100 66.4 (1.5) 58.7 (2.0) 62.2 (2.3) 45.6 (1.7) 41.6 (1.2) 36.7 (1) 33.5 (1.0) 32.0 (1.0) 34.7 (1.2) bs 21.6 (0.6) 53.4 (2.5) 62.2 (2.3) 77.8 (3.2) 77.4 (3.1) 74.1 (2.8) 65.1 (2.7) 67.5 (2.4) 70.6 (2.3)

B. Implementation Details

B.1. Initialization and Tuning K

Initialization. To implement the proposed mix PFC algorithm, it is critical to obtain reliable initial values for γiw. Classical distance-based clustering algorithms, like K-means and hierarchical clustering, often produce low-quality initial values. Similarly, subspace clustering methods also fail for the high-dimensional mixture of PFC. In the Gaussian mixture model, it is recommended to initialize EM with short runs of EM, where the algorithm is stopped early rather than run to convergence (Biernacki et al., 2003). Given that existing methods are not designed for the model (3), we propose a similar initialization procedure that runs the mix PFC algorithm on transformed data with an early stopping criterion to generate initial estimates of γiw. In high-dimensional settings, we transform the original data into a lower-dimensional space by applying distance correlation (dcor) screening (Li et al., 2012) followed by principal component analysis (PCA). Specifically, we first select u n log n variables using dcor and then project the data onto the first v principal components of the selected variables. Based on simulation results in Table A.15, we recommend setting u = 2 and v = 10.

Selection of number of clusters K. The gap statistic (Tibshirani et al., 2001) is a popular technique for selecting the optimal number of clusters for many clustering algorithms. Here, we adapt this approach to enhance its suitability for the mixture model 3. Let Qw represent the orthogonal complement of βw, Gw denote the indices of observations in cluster w and nw = |Gw|. Then define

i,i Gw QT w Xi QT w Xi 2, VK =

The gap statistic is Gapn(K) = E n[log(VK)] log(VK),

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

-3.5 -3.0 -2.5 -2.0

-4.0 -3.5 -3.0 -2.5 -2.0 -1.5

-3.5 -3.0 -2.5 -2.0 -1.5 -1.0

Figure A.5. The scatter-plot of the response Y versus bβT w X for the drug AZD6244. The solid line is fitted using samples in the given cluster.

Table A.15. Error rates averaged over 20 datasets generated under model scenario S1 using M1-M3. First, we use dcor to select u N log(N) variables and then apply PCA to project the data onto v dimension space spanned by principal components.

v = 5 v = 10 v = 15 v = 5 v = 10 v = 15 v = 5 v = 10 v = 15

= 0.1I mix PFC u = 1 5.7 2.2 6.3 25.2 27.1 30.0 8.4 3.0 7.3 u = 2 13.6 4.0 4.5 29.2 31.6 35.0 11.0 11.9 13.5 u = 3 15.8 13.2 4.5 30.1 38.6 38.1 7.3 7.5 10.5 = 0.1I mix PFC-ISO u = 1 23.8 25.0 24.9 6.7 9.0 11.3 20.6 24.3 19.2 u = 2 24.7 22.1 34.5 11.1 11.3 17.3 18.7 19.8 24.5 u = 3 30.5 28.2 18.7 9.1 21.9 9.2 27.3 22.3 17.2 = I mix PFC u = 1 14.3 10.7 10.8 27.3 37.4 43.5 9.9 7.3 9.0 u = 2 20.1 14.4 11.9 31.6 38.5 41.3 13.9 9.9 8.7 u = 3 16.1 16.5 18.7 31.8 42.7 45.9 11.9 13.8 20.7 = I mix PFC-ISO u = 1 17.3 13.0 11.4 9.4 12.0 15.3 13.9 19.7 17.6 u = 2 18.8 15.0 11.5 15.3 10.2 12.6 24.8 22.1 13.3 u = 3 15.0 15.4 17.6 12.2 14.4 13.2 14.7 19.5 8.3 = AR(0.3) mix PFC u = 1 21.0 12.3 14.4 27.0 31.2 38.8 12.9 12.3 11.4 u = 2 13.9 15.5 12.5 23.2 34.3 41.4 20.5 19.1 14.9 u = 3 22.5 18.9 14.3 29.4 40.9 36.5 26.3 28.4 23.5

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

where E n denotes the expectation under a sample of size n from a reference distribution. We estimate E n[log(VK)] by an average of B copies log(V K), which is computed from samples drawn from uniform reference distribution. However, in high dimensions, the gap statistic becomes computationally expensive as it requires calculating VK for B data samples. In practice, we find that bypassing the expectation calculation and directly using VK yields effective results. Let K denote the true number of clusters. To illustrate, we use models M1 (K = 2) and M4 (K = 3, 5), with an identity covariance matrix as defined in the simulation section, to generate high-dimensional data. Notably, the two subspaces are identical in model M1. Figure A.6 presents a representative plot of the within-cluster dispersion VK, calculated using the mix PFC Algorithm 1, as a function of the number of clusters. The error measure VK decreases monotonically as the number of clusters K increases, but begins to rise once K exceeds the true number of clusters K . Based on this finding, we propose to select the smallest K such that

max(VK) VK ρ(max(VK) min(VK)),

where ρ (0, 1) is specified to avoid overestimating K. Table A.16 presents the selected K for models M1 and M4 across four different covariance matrices. The proposed selection method works well in most cases, with the exception of K = 5 under AR(0.3) and AR(0.5). The reason is inaccuracies in subspace estimation affect VK, which depends on the orthogonal complement of central subspaces.

K* = 2 K* = 3 K* = 5

2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

Figure A.6. Within-cluster dispersion VK for K = 1, 2, . . . 10.

Table A.16. The average selected number of clusters over 100 repetitions, using statistic VK. The standard errors are in parentheses and Kmax = 10.

K 0.1I I AR(0.3) AR(0.5)

K = 2 2 (0) 2 (0) 2 (0) 2.1(0) K = 3 3.1 (0) 3 (0) 2.7 (0) 2.9 (0.1) K = 5 5 (0) 4.8 (0) 3.7 (0.1) 3.1 (0.2)

B.2. Implementation of mix PFC

For the mix PFC algorithm 1, the tuning parameter λ(t) could either be fixed or vary across iterations. In theoretical analysis, to show statistical convergence results, we set λ(t+1) = κλ(t) + Cλ p

q3(log n)2 log p/n, where 0 < κ < 1/2 and Cλ is a positive constant. Note λ(t) is at the order of p

q3(log n)2 log p/n when t is large. Therefore, in practice, we fix λ(t) = λ and tune λ with cross-validated distance correlation (Sz ekely et al., 2007). For fixed λ(t) = λ, the penalized EM algorithm maximizes

w=1 Bw 2,1, (7)

where ℓ(θ) is the log-likelihood of X|Y . The following lemma shows the convergence of Algorithm 1.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Lemma B.1. If we set λ(t) = λ for all t, the objective function from (7) evaluated at bθ(t+1) is guaranteed to be no less than the objective function from (7) evaluated at bθ(t). That is, the sequence of iterates {bθ(t)} t=1 generated by Algorithm 1 monotonically increase the value of the objective function from (7).

Proof. Recall that the conditional log-likelihood is

ℓ(θ; X, Y ) =

w=1 πw N(Xi | µw + Γwηwf(Yi), )].

The penalized EM algorithm can be viewed as two alternating maximization steps. Consider the following function:

F(q, θ) := Eq[log(L(θ; X, Y, W))] + H(q) λ

w=1 Bw 2,1,

where q = (q1, . . . , q K) is an arbitrary probability density over the unobserved variable W and H(q) = P

i qi log(qi) is the entropy of distribution q. It is easy to show

F(q, θ) = ℓ(θ; X, Y ) DKL(q p W |X,Y ( |X, Y, θ)) λ

w=1 Bw 2,1,

where DKL is the Kullback-Leibler (KL) divergence. The KL divergence between two distributions is non-negative and is zero when the two distributions are identical. Therefore, the E-step is to choose q to maximize F(q, θ):

bq(t+1) = argmax q F(q, bθ(t)) = p W |X,Y ( |X, Y, bθ(t)),

which is given by the updating function of γiw. In M-step, we maximize F over θ:

bθ(t+1) = argmax θ F(bq(t), θ)

= argmax θ Ebq(t)[log(L(θ; X, Y, W))] λ

= argmax θ Q(θ|bθ(t)) λ

w=1 Bw 2,1,

which is the same as the penalized M-step. Through this coordinate ascent strategy, the update bθ(t+1) makes the value of the penalized log-likelihood function converge monotonically. Combining this result with the convergence (Lemma H.1, Zeng et al. (2024)) of the group-wise coordinate descent algorithm 2, we know Algorithm 1 converges when λ(t) = λ.

Recall that in Algorithm 1 in the main paper. We update b Bt+1 w by solving:

argmin Bw Rp q 1 2 tr(BT w bΣ(t) w Bw) tr{( b U(t) w )T Bw} + λ Bw 2,1.

This optimization is solved by the group-wise coordinate descent algorithm proposed in Mai et al. (2019). For ease of presentation, we remove the subscript w in Bw, bΣ(t) w and b U(t) w . The algorithm is presented in Algorithm 2.

Tuning λ. For two random vectors T Rp and Z Rq, the distance correlation dcor(T, Z) [0, 1] measures the dependence between the two random vectors (Sz ekely et al., 2007). In particular, dcor(T, Z) = 0 if and only if T and Z are independent. Given observed samples e T Rn p and e Z Rn q, the sample distance correlation is denoted as d dcor(e T, e Z). Under the usual SDR model, Sheng & Yin (2016) showed that the distance covariance between Y and βT X is maximized at the central subspace over β Rp d such that βT Σβ = Id. Therefore, Zeng et al. (2024) recommended selecting λ by maximizing d dcor(e Y , e X bβ), where e Y Rn and e X Rn p represent the observed response vector and predictor matrix. For the mixture SDR model, we adjust for the sample membership γiw and select λ by maximizing d dcor(Dw e Y , Dw e X bβw), where Dw is a diagonal matrix with diagonal elements γiw(bθ).

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Algorithm 2 Group-wise coordinate descent algorithm

Input: bΣ(t) = (bσij) Rp p, b U(t) = (bujl) Rp q

Initialize each row of B(0) as B(0) j = (b(0) j1 , . . . , b(0) jq )T = 0 Rq, j = 1, . . . , p and calculate the auxiliary vector

h(0) 1 = (h(0) 11 , . . . , h(0) 1q )T Rq as h(0) 1l = bu1l P

j =1 bσj 1bj l bσ11 , l = 1, . . . , q

for j = 1 to p do

Based on h(r 1) j in the previous step, update B(r) j as

B(r) j = h(r 1) j

bσjj h(r 1) j 2

Based on B(r) 1 , . . . , B(r) j 1, B(r 1) j+1 , . . . , B(r 1) p , update the auxiliary vector h(r 1) j as

h(r 1) jl = bujl (P

j <j bσj jb(r) j l + P

j >j bσj jb(r 1) j l )

bσjj , l = 1, . . . , q

end for until converge Output: B(r)

B.3. Implementation of mix PFC-ISO

When = σ2I, there is no need to compute the inverse of the p p covariance matrix. Instead, we only need to estimate σ2. Under this simplification, the parameters of mixture of PFC model (3) reduces to θ = (σ2, πw, µw, Sw, w = 1, . . . , K). In the E-step, the estimated probability is calculated by

bγiw(bθ(t)) = bπ(t) w bπ(t) w + P

j =w bπ(t) j exp{1/(bσ2)(t)(Xi 1

2[(bΓ(t) j + bΓ(t) w )fi])T (bΓ(t) j bΓ(t) w )fi} .

In the M-step, the most important part is the updating formula of Γw. Let C(t) w = P

i γiw(bθ(t)), Dw be a diagonal matrix with diagonal elements γiw(bθ(t)), Xw Rn p be the centered data matrix with rows (Xi bµ(t) w )T , F Rn q with rows (fi f)T . Through straightforward calculation, the MLE of Sw is the span of eigenvectors of bΣ(t) fit,w corresponding

to the largest dw eigenvalues, where bΣ(t) fit,w = XT w Dw PFDw Xw/Cw, PF = F(FT Dw F) 1F and dw is the dimension of

subspace Sw. Let bΦ(t) w = ( bϕ(t) 1 , . . . , bϕ(t) dw) be the eigenvectors corresponding to the largest dw eigenvalues of bΣ(t) fit,w. Then, we have bΓ(t) w = bΦ(t) w (bΦ(t) w )T XT w Dw F(FT Dw F) 1.

To update σ2, we maximize Q function over σ2, while letting other parameters fixed,

(bσ2)(t) = 1

w=1 C(t) w h tr(bΣw) tr(PbΦ(t) w bΣfit,w) i ,

where bΣ(t) w = XT w Dw Xw/n. In high dimensions, it is challenging to accurately estimate the eigenvectors of bΣ(t) fit,w. To

address this issue, we promote sparsity by finding the sparse eigenvectors of bΣ(t) fit,w. Several algorithms have been proposed to compute sparse eigenvectors (d Aspremont et al., 2007; Zou et al., 2006; Witten et al., 2009; Journ ee et al., 2010). We adopt the variable projection method proposed by Erichson et al. (2020) for its computational efficiency and robustness in high dimensions. The algorithm, mix PFC-ISO, and corresponding convergence analysis are provided in the supplement.

To obtain sparse estimates bΦ(t) w , We adopt the variable projection method proposed by Erichson et al. (2020) for its com-

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

putational efficiency and robustness in high dimensions. Specifically, we solve the following problem

(A , bΦ(t) w ) = argmin A,Φw

1 2 bΣ(t) fit,w bΣ(t) fit,wΦw AT 2 F + α(t)ξ(Φw), subject to AT A = I,

where α(t) is a tuning parameter, and ξ is a sparsity-inducing penalty function such as Lasso or elastic net. For fixed α(t) = α, the penalized EM algorithm 3 for mix PFC with isotonic errors maximizes

w=1 ξ(Φw). (8)

The following result guarantees the convergence of Algorithm 3.

Lemma B.2. If we set α(t) = α for all t, the objective function from (8) evaluated at bθ(t+1) is guaranteed to be no less than the objective function from (8) evaluated at bθ(t). That is, the sequence of iterates {bθ(t)} t=1 generated by Algorithm 3 monotonically increase the value of the objective function from (8).

Proof. The conditional log-likelihood is

ℓ(θ; X, Y ) =

w=1 πw N(Xi | µw + Γwηwf(Yi), σ2I)].

The penalized EM algorithm can be viewed as two alternating maximization steps. Consider the following function:

F(q, θ) := Eq[log(L(θ; X, Y, W))] + H(q) α

where q = (q1, . . . , q K) is an arbitrary probability density over the unobserved variable W. It is easy to show

F(q, θ) = ℓ(θ; X, Y ) DKL(q p W |X,Y ( |X, Y, θ)) α

where DKL is the Kullback-Leibler (KL) divergence. The KL divergence between two distributions is non-negative and is zero when the two distributions are identical. Therefore, the E-step is to choose q to maximize F(q, θ):

bq(t+1) = argmax q F(q, bθ(t)) = p W |X,Y ( |X, Y, bθ(t)),

which is given by the updating function of γiw. In M-step, we maximize F over θ:

bθ(t+1) = argmax θ F(bq(t), θ)

= argmax θ Ebq(t)[log(L(θ; X, Y, W))] α

= argmax θ Q(θ|bθ(t)) α

Through this coordinate ascent strategy, the update bθ(t+1) makes the value of the penalized log-likelihood function converge monotonically. Combining this result with the convergence of the variable projection algorithm solving the penalized Q-function (Erichson et al., 2020), we know Algorithm 3 converges when α(t) = α.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Algorithm 3 Penalized EM algorithm for mixture PFC with isotonic errors

Input: Data {(Xi, Yi)}n i=1, fitting function f Initialize bγiw(θ0) repeat

E-Step: bγ(t) iw = bπ(t) w bπ(t) w + P

j =w bπ(t) j exp{1/(bσ2)(t)(Xi 1

2[(bΓ(t) j + bΓ(t) w )fi])T (bΓ(t) j bΓ(t) w )fi} M-Step:

bΣ(t) fit,w = XT w Dw PFDw Xw/n, PF = F(FT Dw F) 1F

(A , bΦ(t) w ) = argmax A,Φw 1

2 bΣ(t) fit,w bΣ(t) fit,wΦw AT 2 F α(t)ξ(Φw), subject to AT A = I

bΓ(t+1) w = bΦ(t) w (bΦ(t) w )T XT w Dw F(FT Dw F) 1

(bσ2)(t+1) = 1 np PK w=1 C(t) w h tr(bΣw) tr(PbΦ(t) w bΣfit,w) i

bπ(t+1) w = 1

n Pn i=1 bγ(t) iw bµ(t+1) w = 1 P

i γiw(bθ(t)) P γiw(bθ(t))Xi until converge Output: bπw, bΓw

C. Proof of Theorem 1

From this section onward, we focus on the theoretical properties of a simplified version (Algorithm 4) of the mix PFC algorithm presented in the main paper.

Recall the five conditions are:

(C1) The singular values of bΣf = 1 n Pn i=1 fif T i satisfy that M1 σmin(bΣf) σmax(bΣf) M2, and M3 min1 i n fi 2 max1 i n fi 2 M4.

(C2) The initialization θ(0) satisfies that d F (θ(0), θ ) B(0) 1 B 1 F B(0) 2 B 2 F < rΩ, and vec(Γ(0) w Γ w) L(s),

with r < |c0 cπ|/Ω Cb 1

Cd 1/(4 M1) + b2 4a2 b 2a), a2 = 2M 3/2 2 M1 , b = 2 M2 + 2M2+ M2

(C3) There exists a sufficiently large constant M5 > 0, which does not depend on n, p, s, such that σd(B w) M5 p

sq3(log n)2 log p/n.

tr[(Γ 2 Γ 1)bΣf(Γ 2 Γ 1)T ] C(c0, Cb, Mb, Mi; i = 1, . . . , 4), which is a constant that is only depends on c0, Mb, Cb, and Mi, i = 1, . . . , 4.

(C5) n > C3sq3 log(p) for a sufficiently large constant C3.

Before the proof, we review some definitions and present properties about Σ w and U w. The parameter of interest is θ = {π1, Γ1, Γ2}. Let θ be the true value of θ, and bθ(t) be the estimate of θ at the t-th iteration. The true parameter space we consider is

Θ = {θ : π 1 (cπ, 1 cπ), vec(Γ w) 0 sq, B w F Ma, Γ w F Mb, w = 1, 2},

and the constriction basin Bcon(θ ) is

Bcon(θ ) ={θ : πw (c0, 1 c0), Γw Γ w F CbΩ,

(1 Cd)Ω2 | tr(δw(Γ)bΣf(Γ2 Γ1)T )| (1 + Cd)Ω2,

vec(Γw Γ w) L(s), w = 1, 2},

where Ωis the signal strength, c0 cπ, and δw(Γ) = Γ w (Γ2 + Γ1)/2. We define

L(s) = {u Rpq : u e Sc 1 1 ( sq + 2q

3s) u e S1 2 + sq u 2,

for some e S1 [pq] with | e S1| = 3sq.}

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Algorithm 4 Penalized EM algorithm for mixture PFC (simplified version for theoretical analysis)

Input: Data (Xi, Yi), fitting function f, d1, d2

Initial value bπ(0) w , bΓ(0) w , Σ(0) w , U(0) w , and b B(0) w = argmin Bw

1 2 tr(BT w bΣ(0) w Bw) tr{( b U(0) w )T Bw}+λ(0) Bw 2,1, where the

initial tuning parameter is

λ(0) = C1 d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F s + Cλ

q3(log n)2 log p

bγiw(bθ(t)) = bπ(t) w bπ(t) w + P w =w bπ(t) j exp{[(bσ2)(t)] 1(Xi 1

2[(bΓ(t) w + bΓ(t) w )f(Yi)])T (bΓ(t) w bΓ(t) w )f(Yi)}

bπ(t+1) w = 1

n Pn i=1 bγiw(bθ(t)), bΣ(t+1) w = 1

i=1 γiw(bθ(t))Xi Xi, b U(t+1) w = 1

i=1 γiw(bθ(t))Xif(Yi)T

b B(t+1) w = argmin Bw

1 2 tr(BT w bΣ(t+1) w Bw) tr{( b U(t+1) w )T Bw}+λ(t+1) Bw 2,1, λ(t+1) = κλ(t)+Cλ

q3(log n)2 log p

Compute bβ(t+1) w , the top-d left singular vectors of b B(t+1) w and then update according to

bΓ(t+1) w = P b β(t+1) w b U(t+1) w [bπ(t+1) w bΣf] 1

(bσ2)(t+1) = 1

i=1 bγiw)(bθ(t))[tr(bΣ(t+1) w ) tr(PbΓ(t+1) w bΣfit,w)]

until converge Output: bπw, bΓw, bηw

Let M(θ) = {πw(θ), Uw(θ), Σw(θ), w = 1, 2}. Note that vec(Γ ) 0 sq implies vec(B ) 0 sq. Define

i=1 γiw(θ), πw(θ) = E[bπw(θ)]

b Uw(θ) = 1

i=1 γiw(θ)Xif T i , Uw(θ) = E[ b Uw(θ)]

i=1 γiw(θ)Xi XT i , Σw(θ) = E[bΣw(θ)].

By definition,

bπ(t+1) w = bπw(bθ(t)), π(t+1) w = E[bπ(t+1) w ] = E[bπw(bθ(t))] = πw(bθ(t)) b U(t+1) w = b Uw(bθ(t)), U(t+1) w = E[ b U(t+1) w ] = E[ b Uw(bθ(t))] = Uw(bθ(t)) bΣ(t+1) w = bΣw(bθ(t)), Σ(t+1) w = E[bΣ(t+1) w ] = E[bΣw(bθ(t))] = Σw(bθ(t)).

πw(θ ) = E[ 1

i=1 γiw(θ )] = E[ 1

i=1 P(Wi = w|Xi, Yi)] = P(Wi = w|Xi, Yi) = π w.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

U w = U1(θ ) = 1

i=1 E[γiw(θ )Xi]f T i

i=1 E[EW |X[I(Wi = w)Xi]]f T i = 1

i=1 EW,X[I(Wi = w)Xi]f T i

i=1 EW [I(Wi = w)EX|W =w[Xi]]f T i = 1

i=1 π wΓ wfif T i = π wΓ w bΣf,

where bΣf = 1

n Pn i=1 fif T i . Similarly,

Σ w = Σw(θ ) = 1

i=1 E[γiw(θ )Xi XT i ]

i=1 EW [I(Wi = w) EX|W =w[Xi XT i ]] = 1

i=1 EW [I(Wi = w)(Ip + Γ wfif T i (Γ w)T )]

i=1 (π w Ip) + π wΓ w bΣf(Γ w)T = π w Ip + π wΓ w bΣf(Γ w)T .

Let B w = (Σ w) 1U w, β w be the top-d left singular vectors. Then span(β w) = span(Γ w). We further have Γ w = PΓ w U w(π w bΣf) 1 = Pβ w U w(π w bΣf) 1.

Since σmin(Γ w) 0, σmin(Ip + Γ w bΣf(Γ w)T ) 1, we can bound the 2-norm as following

B w 2 = [π w Ip + π wΓ w bΣf(Γ w)T ] 1π wΓ w bΣf 2

= [Ip + Γ w bΣf(Γ w)T ] 1Γ w bΣf 2

[Ip + Γ w bΣf(Γ w)T ] 1 2 Γ w bΣf 2

Γ w F bΣf 2 Mb M2.

Define d F,s(M(θ1), M(θ2)) and d2(M(θ1), M(θ2)) as

max w=1,2{|πw(θ1) πw(θ2)| Uw(θ1) Uw(θ2) F,s (Σw(θ1) Σw(θ2))B w F,s},

max w=1,2{|πw(θ1) πw(θ2)| Uw(θ1) Uw(θ2) F (Σw(θ1) Σw(θ2))B w F },

where A F,s = supu Rp q, u F =1,vec(u) L(s) A, u F . The proof of Theorem 4.1 is based on the following two lemmas.

Lemma C.1. Under conditions (C1) and (C4), if θ Bcon(θ ), for some 0 < κ0 < 1 2 (256C0q/τ0) 8C0 ,

d F (M(θ), M(θ )) κ0(d F (θ, θ ) B1 B 1 F B2 B 2 F ),

where d F (θ, θ ) = {|π1 π 1| Γ1 Γ 1 F Γ2 Γ 2 F }

Lemma C.2. Suppose θ Θ , under condition (C1) and (C5), there exists a constant Ccon > 0, such that with probability at least 1 o(1),

sup θ Bcon(θ ) d F,s(Mn(θ), M(θ)) Ccon

sq3(log n)2 log p

The proof of these two lemmas is quite involved and is presented in Sections D and E. We first establish a concentration result for the estimator b Bw in Section C.1 before proving Theorem 4.1 in Section C.2.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

C.1. Concentration of the estimator b Bw

Lemma C.3. Suppose that θ Θ and vec(b B(0) w ) Bcon(θ ). Let

λ(t+1) 4Ccon

q3(log n)2 log p

n + 4κ0(d F (bθ(t), θ ) b B(t) 1 B 1 F b B(t) 1 B 2 F s ),

for κ0 defined before and some constant Ccon and b B(t+1) w solved by

b B(t+1) w = argmin Bw

1 2 tr(BT w bΣ(t+1) w Bw) tr{( b U(t+1) w )T Bw} + λ(t+1) Bw 2,1.

With probability at least 1 o(1), we have

vec(b B(t+1) w B w) L(s),

b B(t+1) w B w F 4

τ0 d F,s(Mn(bθ(t)), M(θ )) + 2

τ0 λ(t+1)( p

3sq + 2 sq + 2 p

Proof. Recall that L(s) = {u Rpq : u e Sc 1 1 ( sq + 2q

3s) u e S1 2 + sq u 2,

for some e S1 [pq] with | e S1| = 3sq.}

Consider b B(t+1) w = argmin Bw

1 2 tr(BT w bΣ(t+1) w Bw) tr{( b U(t+1) w )T Bw} + λ(t+1) Bw 2,1.

For simplicity, we let w = 1 in the following. To show

vec(b B(t+1) 1 B 1) L(s),

we note that

λ(t+1)( b B(t+1) 1 2,1 B 1 2,1)

2 tr((B 1)T bΣ(t+1) 1 B 1) 1

2 tr((b B(t+1) 1 )T bΣ(t+1) 1 b B(t+1) 1 ) tr(( b U(t+1) 1 )T B 1) + tr(( b U(t+1) 1 )T b B(t+1) 1 ))

tr((B 1 b B(t+1) 1 )T bΣ(t+1) 1 B 1) 1

2 tr((B 1 b B(t+1) 1 )T bΣ(t+1) 1 (B 1 b B(t+1) 1 ))

tr(( b U(t+1) 1 )T (B 1 b B(t+1) 1 )),

where we use

1 2 tr((B 1 b B(t+1) 1 )T bΣ(t+1) 1 (B 1 b B(t+1) 1 ))

2 tr((B 1)T bΣ(t+1) 1 B 1) + 1

2 tr((b B(t+1) 1 )T bΣ(t+1) 1 b B(t+1) 1 ) tr((B 1)T bΣ(t+1) 1 b B(t+1) 1 ).

Since bΣ(t+1) 1 is symmetric semi-positive definite, we have

tr((B 1 b B(t+1) 1 )T bΣ(t+1) 1 (B 1 b B(t+1) 1 ))

= vec(B 1 b B(t+1) 1 )T vec(bΣ(t+1) 1 (B 1 b B(t+1) 1 ))

= vec(B 1 b B(t+1) 1 )T (Iq bΣ(t+1) 1 ) vec(B 1 b B(t+1) 1 ) 0.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Using the fact that Σ 1B 1 U 1 = 0, we have

λ(t+1)( b B(t+1) 1 2,1 B 1 2,1)

tr((B 1 b B(t+1) 1 )T bΣ(t+1) 1 B 1) tr(( b U(t+1) 1 )T (B 1 b B(t+1) 1 ))

= tr((B 1 b B(t+1) 1 )T (bΣ(t+1) 1 B 1 b U(t+1) 1 )

= B 1 b B(t+1) 1 , bΣ(t+1) 1 B 1 b U(t+1) 1 F

= B 1 b B(t+1) 1 , (bΣ(t+1) 1 Σ(t+1) 1 )B 1 + Σ(t+1) 1 B 1 U(t+1) 1 ( b U(t+1) 1 U(t+1) 1 ) F

= B 1 b B(t+1) 1 , (bΣ(t+1) 1 Σ(t+1) 1 )B 1 ( b U(t+1) 1 U(t+1) 1 ) F +

B 1 b B(t+1) 1 , Σ(t+1) 1 B 1 U(t+1) 1 F

= B 1 b B(t+1) 1 , (bΣ(t+1) 1 Σ(t+1) 1 )B 1 ( b U(t+1) 1 U(t+1) 1 ) F +

B 1 b B(t+1) 1 , (Σ(t+1) 1 Σ 1)B 1 + Σ 1B 1 U 1 (U(t+1) 1 U 1) F

= B 1 b B(t+1) 1 , (bΣ(t+1) 1 Σ(t+1) 1 )B 1 ( b U(t+1) 1 U(t+1) 1 ) F | {z } (i)

B 1 b B(t+1) 1 , (Σ(t+1) 1 Σ 1)B 1 (U(t+1) 1 U 1) F | {z } (ii)

B 1 b B(t+1) 1 , Σ 1B 1 U 1 F | {z } =0

Let u(t+1) = b B(t+1) 1 B 1. For a matrix A Rp q and a set S [p], AS R|S| q is the submatrix where rows are in set S. Occasionally, we use the same notation AS to represent its 0-extended version A Rp q such that A Sc = 0 and A S = AS. Then b B(t+1) 1 2,1 B 1 2,1 = B 1 + u(t+1) 2,1 B 1 2,1

= B 1 + u(t+1) S + u(t+1) Sc 2,1 B 1 2,1

B 1 + u(t+1) Sc 2,1 u(t+1) S 2,1 B 1 2,1

= B 1 2,1 + u(t+1) Sc 2,1 u(t+1) S 2,1 B 1 2,1

= u(t+1) Sc 2,1 u(t+1) S 2,1.

Therefore λ(t+1)( u(t+1) Sc 2,1 u(t+1) S 2,1) (i) + (ii) + (iii).

For a vector x Rq, we have x 2 x 1 q x 2. Then,

j=1 |uij| q

Let S1 = {S, S + p, . . . , S + qp}, where S + p means the set of indices in S adds p. Then

u(t+1) Sc 2,1 = X

j=1 (u(t+1) ij )2 1 q

j=1 |u(t+1) ij | = 1 q vec(u(t+1))Sc 1 1.

u(t+1) S 2,1 = X

j=1 (u(t+1) ij )2 X

j=1 |u(t+1) ij | = vec(u(t+1))S1 1.

Thus, we have

λ(t+1)( 1 q vec(u(t+1))Sc 1 1 vec(u(t+1))S1 1) λ(t+1)( u(t+1) Sc 2,1 u(t+1) S 2,1) (i) + (ii) + (iii).

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Recall that

(i) = B 1 b B(t+1) 1 , (bΣ(t+1) 1 Σ(t+1) 1 )B 1 ( b U(t+1) 1 U(t+1) 1 ) F

= vec(B 1 b B(t+1) 1 )T vec((bΣ(t+1) 1 Σ(t+1) 1 )B 1) vec(B 1 b B(t+1) 1 )T vec( b U(t+1) 1 U(t+1) 1 ).

We want to bound ℓ2 norm of vec((bΣ(t+1) 1 Σ(t+1) 1 )B 1) and vec( b U(t+1) 1 U(t+1) 1 ). Let e S1 be a set of size 3sq, which contains S1 and the largest sq coefficients of b U(t+1) 1 U(t+1) 1 and the largest sq coefficients of (bΣ(t+1) 1 Σ(t+1) 1 )B 1. We have | vec(B 1 b B(t+1) 1 )T vec( b U(t+1) 1 U(t+1) 1 )|

| vec(( b U(t+1) 1 U(t+1) 1 ) e S1)T vec((B 1 b B(t+1) 1 ) e S1)|+

| vec(( b U(t+1) 1 U(t+1) 1 ) e Sc 1)T vec((B 1 b B(t+1) 1 ) e Sc 1)|

vec(( b U(t+1) 1 U(t+1) 1 ) e S1) 2 vec((B 1 b B(t+1) 1 ) e S1) 2+

vec(( b U(t+1) 1 U(t+1) 1 ) e Sc 1) vec((B 1 b B(t+1) 1 ) e Sc 1) 1.

By the definition, we have A F,s = sup µ Rp q

vec(µ) L(s) Spq 1

vec(A), vec(µ) .

Let v Rpq such that v e S1 = vec(A) e S1 and v e Sc 1 = 0. To define L(s), we want vectors u that u e Sc 1 1 is bounded. Clearly v L(s) since v e Sc 1 1 = 0. Then

A F,s 1 v 2 vec(A), v = 1 v 2 v 2 2 = v 2 = vec(A) e S1 2.

Thus, vec(( b U(t+1) 1 U(t+1) 1 ) e S1) 2 b U(t+1) 1 U(t+1) 1 ) e S1 F,s. By the definition of e S1, vec(( b U(t+1) 1

U(t+1) 1 ) e Sc 1) vec(( b U(t+1) 1 U(t+1) 1 ) e S1) 2/ sq b U(t+1) 1 U(t+1) 1 ) e S1 F,s/ sq. Therefore,

| vec(B 1 b B(t+1) 1 )T vec( b U(t+1) 1 U(t+1) 1 )|

b U(t+1) 1 U(t+1) 1 F,s vec(u(t+1)) e S1 2 + 1 sq b U(t+1) 1 U(t+1) 1 F,s vec(u(t+1)) e Sc 1 1.

Similarly, we have

vec(B 1 b B(t+1) 1 )T vec((bΣ(t+1) 1 Σ(t+1) 1 )B 1)

vec((bΣ(t+1) 1 Σ(t+1) 1 )B 1) e S1 2 vec((B 1 b B(t+1) 1 ) e S1) 2+

vec((bΣ(t+1) 1 Σ(t+1) 1 )B 1) e Sc 1) vec((B 1 b B(t+1) 1 ) e Sc 1) 1.

(bΣ(t+1) 1 Σ(t+1) 1 )B 1 F,s vec(u(t+1)) e S1 2 + 1 sq (bΣ(t+1) 1 Σ(t+1) 1 )B 1 F,s vec(u(t+1)) e Sc 1 1.

|(i)| 2Ccon

sq3(log n)2 log p

n vec(u(t+1)) e S1 2 + 2Ccon

sq3(log n)2 log p

n 1 sq vec(u(t+1)) e Sc 1 1.

Using Lemma C.1, we have

|(ii)| ( (Σ(t+1) 1 Σ 1)B 1 F + U(t+1) 1 U 1 F ) B 1 b B(t+1) 1 F

= ( (Σ1(bθ(t)) Σ 1)B 1 F + U1(bθ(t)) U 1 F ) B 1 b B(t+1) 1 F

2κ0 d F (bθ(t), θ ) b B(t) 1 B 1 F sq sq B 1 b B(t+1) 1 F

= 2κ0 d F (bθ(t), θ ) b B(t) 1 B 1 F sq sq vec(u(t+1)) 2.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Combine the bound for terms (i) and (ii), we have

λ(t+1)( 1 q vec(u(t+1))Sc 1 1 vec(u(t+1))S1 1)

sq3(log n)2 log p

n 1 sq sq vec(u(t+1)) e S1 2+

sq3(log n)2 log p

n 1 sq vec(u(t+1)) e Sc 1 1+

2κ0 d F (bθ(t), θ ) b B(t) 1 B 1 F sq sq vec(u(t+1)) 2.

q2(log n)2 log p

n + 2κ0 d F (bθ(t), θ ) b B(t) 1 B 1 F sq

Then 1 q vec(u(t+1))Sc 1 1 vec(u(t+1))S1 1

sq 2 q vec(u(t+1)) e S1 2 + 1 2 q vec(u(t+1)) e Sc 1 1 + sq 2 q vec(u(t+1)) 2

2 vec(u(t+1)) e S1 2 + 1 2 q vec(u(t+1)) e Sc 1 1 + s

2 vec(u(t+1)) 2,

which implies

1 2 q vec(u(t+1)) e Sc 1 1 s

2 vec(u(t+1)) e S1 2 + s

2 vec(u(t+1)) 2 + vec(u(t+1)) e S1 1,

where we use S1 e S1. Using vec(u(t+1)) e S1 1 3sq vec(u(t+1)) e S1 2, we have

vec(u(t+1)) e Sc 1 1 ( sq + 2q

3s) vec(u(t+1)) e S1 2 + sq vec(u(t+1)) 2.

Now, we focus on the second result. Let w = 1. Recall that

λ(t+1)( b B(t+1) 1 2,1 B 1 2,1)

tr((B 1 b B(t+1) 1 )T bΣ(t+1) 1 B 1) 1

2 tr((B 1 b B(t+1) 1 )T bΣ(t+1) 1 (B 1 b B(t+1) 1 ))

tr(( b U(t+1) 1 )T (B 1 b B(t+1) 1 )

= B 1 b B(t+1) 1 , bΣ(t+1) 1 B 1 F 1

2 B 1 b B(t+1) 1 , bΣ(t+1) 1 (B 1 b B(t+1) 1 ) F b U(t+1) 1 , B 1 b B(t+1) 1 F .

It follows that

| B 1 b B(t+1) 1 , bΣ(t+1) 1 (B 1 b B(t+1) 1 ) F | 2 | B 1 b B(t+1) 1 , bΣ(t+1) 1 B 1 b U(t+1) 1 F | | {z } (I)

2λ(t+1)( b B(t+1) 1 2,1 B 1 2,1 | {z } (II)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Recall that Σ 1B 1 U 1 = 0. For term (I), since vec(B 1 b B(t+1) 1 )/ vec(B 1 b B(t+1) 1 ) 2 L(s) Spq 1, we have

| B 1 b B(t+1) 1 , bΣ(t+1) 1 B 1 b U(t+1) 1 F |

= | B 1 b B(t+1) 1 , bΣ(t+1) 1 B 1 Σ 1B 1 + Σ 1B 1 U 1 + U 1 b U(t+1) 1 F |

= | B 1 b B(t+1) 1 , bΣ(t+1) 1 B 1 Σ 1B 1 F + B 1 b B(t+1) 1 , U 1 b U(t+1) 1 F + B 1 b B(t+1) 1 , Σ 1B 1 U 1 F |

B 1 b B(t+1) 1 F ( bΣ(t+1) 1 B 1 Σ 1B 1 F,s + b U(t+1) 1 U 1 F,s)

2 B 1 b B(t+1) 1 F d F,s(Mn(bθ(t)), M(θ )),

where we use the definitions of F,s norm and d F,s. In the last inequality, we use the fact

Mn(bθ(t)) = {bπw(bθ(t)), b Uw(bθ(t)), bΣw(bθ(t)), w = 1, 2} = {bπ(t+1) w , b U(t+1) w , bΣ(t+1) w , w = 1, 2}.

For term (II), using reverse triangle inequality,

b B(t+1) 1 2,1 B 1 2,1 =

k=1 (b B(t+1) 1,jk )2

k=1 (B 1,jk)2

k=1 (b B(t+1) 1,jk B 1,jk)2 =

k=1 |b B(t+1) 1,jk B 1,jk|

vec(b B(t+1) 1 B 1) 1

= vec(b B(t+1) 1 B 1) e S1 1 + vec(b B(t+1) 1 B 1) e Sc 1 1

3sq vec(b B(t+1) 1 B 1) e S1 2 + ( sq + 2 p

3sq2) vec(b B(t+1) 1 B 1) e S1 2+ sq vec(b B(t+1) 1 B 1) 2

3sq + 2 sq + 2 p

3sq2) b B(t+1) 1 B 1 F .

For the right hand side of (9), we have

| B 1 b B(t+1) 1 , bΣ(t+1) 1 (B 1 b B(t+1) 1 ) F | = | vec(B 1 b B(t+1) 1 )T vec(bΣ(t+1) 1 (B 1 b B(t+1) 1 ))|

= | vec(B 1 b B(t+1) 1 )T (Iq bΣ(t+1) 1 ) vec(B 1 b B(t+1) 1 )|

k=1 (B 1,k b B(t+1) 1,k )T bΣ(t+1) 1 (B 1,k b B(t+1) 1,k ) ,

where B 1,k and b B(t+1) 1,k represent the k-th column of B 1 and b B(t+1) 1 . Recall that

bΣ(t+1) 1 = 1

i=1 γ1,bθ(t)(Xi, Yi)Xi XT i .

By Lemma C.2, we know that

i=1 γ1,bθ(t)(Xi, Yi) E[bπ(t+1) 1 ]| = O(

sq3(log n)2 log p

with probability at least 1 o(1). Thus, 1

n Pn i=1 γ1,bθ(t)(Xi, Yi) > τ1 for some positive constant τ1 with probability at

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

least 1 o(1). Define the set N = {i : γ1,bθ(t)(Xi, Yi) > τ1/2}. Then by Lemma F.2,

(B 1,k b B(t+1) 1,k )T bΣ(t+1) 1 (B 1,k b B(t+1) 1,k )

B 1,k b B(t+1) 1,k 2 2 inf u Lp(s) Sp 1 u T 1

i=1 γ1,bθ(t)(Xi, Yi)Xi XT i u

B 1,k b B(t+1) 1,k 2 2 inf u Lp(s) Sp 1 u T 1

i N γ1,bθ(t)(Xi, Yi)Xi XT i u

B 1,k b B(t+1) 1,k 2 2 τ1/2 τ.

| B 1 b B(t+1) 1 , bΣ(t+1) 1 (B 1 b B(t+1) 1 ) F |

k=1 vec(B 1,k b B(t+1) 1,k ) 2 2 τ1/2 τ = vec(b B(t+1) 1 B 1) 2 2τ0,

where τ0 = τ1/2 τ. Combing the above results, we have

τ0 b B(t+1) 1 B 1 2 F 4 b B(t+1) 1 B 1 F d F,s(Mn(θ(t)), M(θ ))+

3sq + 2 sq + 2 p

3sq2) b B(t+1) 1 B 1 F .

b B(t+1) 1 B 1 F 4

τ0 d F,s(Mn(bθ(t)), M(θ )) + 2

τ0 λ(t+1)( p

3sq + 2 sq + 2 p

C.2. Proof of Theorem

Theorem C.4. Under conditions (C1)-(C5), there exists a constant 0 < κ < 1/2, such that b B(t) w satisfies, with probability at least 1 o(1),

b B(t) w B w F = O

κt(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) +

sq3(log n)2 log p

Consequently, for t ( log κ) 1 log{n(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )},

b B(t) w B w F , D(S b β(t) w , Sβ w) F = O

sq3(log n)2 log p

Proof. We update λ(t) by

λ(t) = κλ(t 1) + Cλ

q3(log n)2 log p

λ(0) = C1 d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F s + Cλ

q3(log n)2 log p

where C1 = τ0/(32C0q). Thus, we have

λ(t) = κt C1 d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F s + 1 κt+1

q3(log n)2 log p

Let κ = (1 128C0q

τ0 4C0)κ0, since 0 < κ0 < 1 2 (256C0q/τ0) 8C0 , we have 0 < κ < 1/2. Then define

C = 2κ2 4κ + 2

2κ2 5κ + 2 (C0 + 4C0q

τ0 + 64C0q τ0(1 κ)) 1 κ

Cλ = 4Ccon + 4κ0

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

(i) κ κ0, C1κ 4κ0, and ( 4C0q

τ0 + C0)κ0 + 16C0q

(ii) κ0 1 κC + Ccon C , ( 4C0q

τ0 + C0)Ccon + 16C0q τ0(1 κ)Cλ C .

The first two inequalities in (I) can be seen from the definition of κ and C1. For the third, we have

τ0 + C0)κ0 + 16C0q

τ0 κC1 = (4C0q

τ0 + C0)κ0 + 1

For the first inequality in (ii), it is equivalent to Ccon 1 κ κ0

1 κ C . Since κ0 κ,

1 κ C Ccon,

where in the last inequality we use the definition of C . For the second inequality in (ii), we have

τ0 + C0)Ccon + 16C0q τ0(1 κ)Cλ = (4C0q

τ0 + C0)Ccon + 16C0q τ0(1 κ)[4Ccon + 4κ0

τ0 + C0 + 64C0q τ0(1 κ)

Ccon + 64C0qκ0

τ0(1 κ)2 C .

Use the second inequality in (i), 64C0qκ0

τ0 + C0)Ccon + 16C0q τ0(1 κ)Cλ 4C0q

τ0 + C0 + 64C0q τ0(1 κ)

Ccon + κ 2(1 κ)2 C

2κ2 4κ + 2C + κ 2(1 κ)2 C = C .

Next, we use induction to show the following results

λ(t+1) 4Ccon

q3(log n)2 log p

n + 4κ0(d F (bθ(t), θ ) b B(t) 1 B 1 F b B(t) 1 B 2 F s ),

d F (bθ(t+1), θ ) b B(t+1) 1 B 1 F b B(t+1) 1 B 2 F

κt+1(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 1 B 2 F ) + 1 κt+2

sq3(log n)2 log p

d F (bθ(t+1), θ ) b B(t+1) 1 B 1 F b B(t+1) 1 B 2 F rΩ, vec(bΓ(t+1) w Γ w) L(s).

It is easy to verify that d F,s satisfies the triangle inequality. Then using Lemma C.1 and C.2

d F,s(Mn(bθ(0)), M(θ )) d F,s(M(bθ(0)), M(θ )) + d F,s(Mn(bθ(0)), M(bθ(0)))

κ0(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 1 B 2 F )+

sq3(log n)2 log p

n κ(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 1 B 2 F )+

sq3(log n)2 log p

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

where we use κ0 κ and Ccon κ0 1 κC + Ccon C (1 + κ)C . For λ(1), we have

λ(1) = κλ(0) + Cλ

q3(log n)2 log p

= κC1 d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F s + (1 + κ)Cλ

q3(log n)2 log p

q3(log n)2 log p

n + 4κ0 d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F s ,

since (1 + κ)Cλ 4Ccon and C1κ 4κ0.

Note that 3sq + 2 sq + 2 p

sq2. By Lemma C.3, we have

b B(1) 1 B 1 F

τ0 d F,s(Mn(bθ(0)), M(θ )) + 2

3sq + 2 sq + 2 p

κ0(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 1 B 2 F ) + Ccon

sq3(log n)2 log p

κC1 d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F s + (1 + κ)Cλ

q3(log n)2 log p

τ0 qκC1](d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )+

τ0 Ccon + 16

τ0 q(1 + κ)Cλ]

sq3(log n)2 log p

κ(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) + 1 κ2

sq3(log n)2 log p

By results in (i), 4q

τ0 κ0 + 16q

τ0 κC1 κ. To show 4q

τ0 Ccon + 16q

τ0 (1+κ)Cλ (1+κ)C , it is equivalent to show 4q

τ0 Ccon/(1+ κ) + 16q

τ0 Cλ C . Since 1/(1 + κ) < 1 < 1/(1 κ), the result holds by applying second inequality in (ii).

Let bβ(t) 1 be the top-d left singular vectors of b B(t) 1 . For matrix A Rp q,

Pβ 1 A F,s = sup u Rp q

vec(u) L(s) Spq 1 Pβ 1 A, u F = sup u Rp q

vec(u) L(s) Spq 1 Pβ 1 u, A F

Pβ 1 u F A F,s

where we use the fact that vec(Pβ 1 A) L(s). If vec(A) L(s),

A F,s = sup u Rp q

vec(u) L(s) Spq 1 A, u F A, A/ A F F = A F .

Then, we have

bΓ(1) 1 Γ 1 F = P b β(1) 1 b U1(bθ(0))[bπ1(bθ(0))bΣf] 1 Pβ 1 U 1[π 1 bΣf] 1 F

P b β(1) 1 b U1(bθ(0))[bπ1(bθ(0))bΣf] 1 Pβ 1 b U1(bθ(0))[bπ1(bθ(0))bΣf] 1 F,s+

Pβ 1 b U1(bθ(0))[bπ1(bθ(0))bΣf] 1 Pβ 1 U 1[π 1 bΣf] 1 F,s

P b β(1) 1 Pβ 1 F b U1(bθ(0)) 1

bπ1(bθ(0)) bΣ 1 f F,s+

d b U1(bθ(0))[bπ1(bθ(0))bΣf] 1 U 1[π 1 bΣf] 1 F,s.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

By condition (C3), we have b B(1) 1 B 1 2 b B(1) 1 B 1 F κ(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) + 1 (1 κ)M5 C σd(B 1) := C . For the first term in the last equality, by Lemma F.4, we have

P b β(1) 1 Pβ 1 F Cβ b B(1) 1 B 1 F ,

2d(4σ1(B ) + 2C )/σ2 d(B 1). Note that b U1(bθ(0)) = 1 n Pn i=1 γ1,bθ(0)(Xi, Yi)Xif T i and 0 γ1,bθ(0)(Xi, Yi) 1. According to Lemma F.3,

b U1(bθ(0))bΣ 1 f F,s 1 M1 b U1(bθ(0)) F,s M/M1,

with probability at least 1 o(1). For the second inequality,

b U1(bθ(0))[bπ1(bθ(0))bΣf] 1 U 1[π 1 bΣf] 1 F,s

1 M1 b U1(bθ(0))[bπ1(bθ(0))] 1 U 1(π 1) 1 F,s

1 M1 b U1(bθ(0)) U 1 F,s[bπ1(bθ(0))] 1 + 1 M1 U 1 F |[bπ1(bθ(0))] 1 (π 1) 1|

1 M1 b U1(bθ(0)) U 1 F,s[bπ1(bθ(0))] 1 + Mb M2

M1 |[bπ1(bθ(0))] 1 (π 1) 1|.

Then, there exists a positive constant C0, such that

bΓ(1) 1 Γ 1 F C0[ b B(1) 1 B 1 F + d F,s(Mn(bθ(0)), M(θ ))].

Without loss of generality, we assume C0 1. Therefore, we have

d F (bθ(1), θ ) b B(1) 1 B 1 F b B(1) 2 B 2 F

(C0 4 τ0 + C0)d F,s(Mn(bθ(0)), M(θ )) + C0 2 τ0 λ(1)( p

3sq + 2 sq + 2 p

κ0(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 1 B 2 F ) + Ccon

sq3(log n)2 log p

κC1 d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F s + (1 + κ)Cλ

q3(log n)2 log p

τ0 + 1)κ0 + 16C0

τ0 qκC1](d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )+

τ0 + 1)Ccon + 16C0

τ0 q(1 + κ)Cλ]

q3(log n)2 log p

κ(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) + 1 κ2

q3(log n)2 log p

By results in (i), ( 4C0q

τ0 + C0)κ0 + 16C0q

τ0 κC1 κ. To show ( 4C0q

τ0 + C0)Ccon + 16C0q

τ0 (1 + κ)Cλ (1 + κ)C , it is equivalent to show ( 4C0q

τ0 + C0)Ccon/(1 + κ) + 16C0q

τ0 Cλ C . Since 1/(1 + κ) < 1 < 1/(1 κ), the result holds by applying second inequality in (ii).

In addition, since d F (θ(0), θ ) B(0) 1 B 1 F B(0) 2 B 2 F < rΩ,

d F (bθ(1), θ ) b B(1) 1 B 1 F b B(1) 2 B 2 F

κ(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) + 1 κ2

q3(log n)2 log p

κrΩ+ (1 + κ)C r

q3(log n)2 log p

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

since rΩ> 1+κ

q3(log n)2 log p

n when n is sufficiently large. Then by Lemma F.5, bθ(1) Bcon(θ ).

Next, we assume the following holds for t-th step,

q3(log n)2 log p

n + 4κ0(d F (bθ(t 1), θ ) b B(t 1) 1 B 1 F b B(t 1) 1 B 2 F s ),

d F (bθ(t), θ ) b B(t) 1 B 1 F b B(t) 2 B 2 F κt(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )+

sq3(log n)2 log p

d F (bθ(t), θ ) b B(t) 1 B 1 F b B(t) 2 B 2 F rΩ, vec(bΓ(t) w Γ w) L(s).

By Lemma F.5, bθ(t) Bcon(θ ). Then

q3(log n)2 log p

n + 4κ0(d F (bθ(t), θ ) b B(t) 1 B 1 F b B(t) 1 B 2 F s )

q3(log n)2 log p

κt(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )+

sq3(log n)2 log p

4κ0κt (d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) s +

(4Ccon + 4κ0 1 κt+1

q3(log n)2 log p

κt+1C1 (d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) s + 1 κt+2

q3(log n)2 log p

Use (i), 4κ0 C1κ. By the definition of Cλ,

1 κ Cλ = 1 κt+2

1 κ | {z } >1

(4Ccon + 4κ0

1 κC ) 4Ccon + 4κ0

4Ccon + 4κ0 1 κt+1

Then note that

d F,s(Mn(bθ(t)), M(θ ))

d F,s(M(bθ(t)), M(θ )) + d F,s(Mn(bθ(t)), M(bθ(t)))

κ0(d F (bθ(t), θ ) b B(t) 1 B 1 F b B(t) 1 B 2 F ) + Ccon

sq3(log n)2 log p

κt(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) + 1 κt+1

sq3(log n)2 log p

sq3(log n)2 log p

κt+1(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) + 1 κt+2

sq3(log n)2 log p

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

since κ0 κ and κ0 1 κt+1

1 κ C + Ccon 1 κt+2

1 κ C . Then by Lemma C.3,

b B(t+1) 1 B 1 F

τ0 d F,s(Mn(bθ(t)), M(θ )) + 16q

κt(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) + 1 κt+1

sq3(log n)2 log p

sq3(log n)2 log p

κt+1C1 (d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) s +

q3(log n)2 log p

τ0 qκC1]κt(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )+

τ0 Ccon + 4

τ0 κ0 1 κt+1

1 κ C + 16q

sq3(log n)2 log p

κt+1(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) + 1 κt+2

sq3(log n)2 log p

In the last inequality, we use ( 4

τ0 C0q + C0)κ0 + 16

τ0 C0qκC1 κ, and

τ0 + C0)Ccon + (4C0q

τ0 + C0)κ0 1 κt+1

1 κ C + 16C0q

1 κ Cλ 1 κt+2

τ0 + C0)Ccon + ( 1

1 κ C + 16C0q

1 κ Cλ 1 κt+2

τ0 + C0)Ccon + 16C0q

1 κ Cλ 1 κt+2

where the right-hand side is greater than C and thus the inequality holds due to (ii). Using the same argument,

bΓ(t+1) 1 Γ 1 F C0[ b B(t+1) 1 B 1 F + d F,s(Mn(bθ(t)), M(θ ))].

d F (bθ(t+1), θ ) b B(t+1) 1 B 1 F b B(t+1) 2 B 2 F

τ0 + C0)d F,s(Mn(bθ(t)), M(θ )) + 16C0q

τ0 + C0) n κ0 h κt(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )+

sq3(log n)2 log p

sq3(log n)2 log p

κt+1C1 (d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) s + 1 κt+2

q3(log n)2 log p

τ0 + C0)κ0 + 16C0q

τ0 κC1]κt(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )+

τ0 + C0)Ccon + (4C0

τ0 + C0)κ0 1 κt+1

1 κ C + 16C0q

sq3(log n)2 log p

κt+1(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) + 1 κt+2

sq3(log n)2 log p

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Further, since 1 κt+2

1+κ 1 κt+1 and rΩ> 1+κ

q3(log n)2 log p

n when n is sufficiently large,

d F (bθ(t+1), θ ) b B(t+1) 1 B 1 F b B(t+1) 2 B 2 F

κt+1(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) + 1 κt+2

sq3(log n)2 log p

κt+1rΩ+ 1 κt+2

sq3(log n)2 log p

n κt+1rΩ+ 1 κt+2

1 + κ rΩ rΩ.

When t ( log κ) 1 log{n(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )},

κt κ logκ{n(d F (bθ(0),θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F )}

n(d F (bθ(0), θ ) b B(0) 1 B 1 F b B(0) 2 B 2 F ) .

which implies

b B(t) w B w F = O

sq3(log n)2 log p

With Lemma F.4,

D(S b β(t) w , Sβ w) = P b β(t) w Pβ w F

2d b B(t) w B w F = O

sq3(log n)2 log p

D. Proof of Lemma C.1

D.1. Contraction of weights

In this section, we show |π1(θ) π1(θ )| κ0(d F (θ, θ ) B1 B 1 F B2 B 2 F ). By definition,

π1(θ) π1(θ ) = E[ 1

i=1 (γ1,θ(Xi, Yi) γ1,θ (Xi, Yi))].

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

When Yi is fixed, we can not further simplify the above. Thus, for given i, we bound | E[γ1,θ(Xi, Yi) γ1,θ (Xi, Yi)]|. Let ξT = (π1, vec(Γ2 Γ1)T , vec(Γ2 + Γ1)T ), ξ = ξ ξ , ξu = ξ + u ξ. Then ξ0 = ξ , ξ1 = ξ. Then E[γ1,θ(Xi, Yi) γ1,θ (Xi, Yi)]

0 γ1,ξ(Xi, Yi)

0 γ1,ξ(Xi, Yi)

ξ=ξu, π1 du + E Z 1

0 γ1,ξ(Xi, Yi)

ξ=ξu, Γ2 Γ1 du +

0 γ1,ξ(Xi, Yi)

vec(Γ2 + Γ1)

ξ=ξu , Γ2+Γ1 du

0 E[ γ1,ξ(Xi, Yi)

π1 ] ξ=ξu, (π1 π 1) du + Z 1

0 E[ γ1,ξ(Xi, Yi)

vec(Γ2 Γ1)] ξ=ξu, Γ2 Γ1 du+

0 E[ γ1,ξ(Xi, Yi)

vec(Γ2 + Γ1)] ξ=ξu, Γ2+Γ1 du

sup ξ Bcon(θ ) | E[ γ1,ξ(Xi, Yi)

π1 ]| |π1 π 1|

sup ξ Bcon(θ ) E[ γ1,ξ(Xi, Yi)

vec(Γ2 Γ1)] 2 Γ2 Γ1 Γ 2 + Γ 1 F | {z } (II)

sup ξ Bcon(θ ) E[ γ1,ξ(Xi, Yi)

vec(Γ2 + Γ1)] 2 Γ2 + Γ1 Γ 2 Γ 1 F | {z } (III)

Thus, we bound the three terms in the last inequality. Recall that

γ1,ξ(Xi, Yi) = π1 π1 + (1 π1) exp{[Xi 1

2(Γ2 + Γ1)fi]T (Γ2 Γ1)fi}.

We can decompose Xi as the sum of two independent random variables, Xi d Z + ψfi, where Z N(0, Ip) and is independent of fi and Wi, P(ψ = Γ 1) = π 1 and P(ψ = Γ 2) = 1 π 1. Let δ(Γ) = ψ (Γ2 + Γ1)/2. Then

2(Γ2 + Γ1)fi d Z + ψfi 1

2(Γ2 + Γ1)fi d Z + δ(Γ)fi.

Therefore, we can write

γ1,ξ(Xi, Yi) = π1 π1 + (1 π1) exp{(Z + δ(Γ)fi)T (Γ2 Γ1)fi}.

By calculation, we have

γ1,ξ(Xi, Yi)

π1 = exp{(Z + δ(Γ)fi)T (Γ2 Γ1)fi} (π1 + (1 π1) exp{(Z + δ(Γ)fi)T (Γ2 Γ1)fi})2

γ1,ξ(Xi, Yi) vec(Γ2 Γ1) = π1(1 π1)exp{(Z + δ(Γ)fi)T (Γ2 Γ1)fi} fi (Z + δ(Γ)fi)

(π1 + (1 π1) exp{(Z + δ(Γ)fi)T (Γ2 Γ1)fi})2

γ1,ξ(Xi, Yi) vec(Γ2 + Γ1) = π1(1 π1)exp{(Z + δ(Γ)fi)T (Γ2 Γ1)fi} 1

2fi (Γ2 Γ1)fi (π1 + (1 π1) exp{(Z + δ(Γ)fi)T (Γ2 Γ1)fi})2 .

Let T1 = (Γ2 Γ1)fi and T2 = δ(Γ)fi. Let Hi be an orthonormal matrix whose first row is TT 1 / T1 2. Then Hi T1 = T1 2e1, where e1 is the basis vector in the Euclidean space whose first entry is 1 and zero otherwise. Then

f T i (Γ2 Γ1)T Z = TT 1 Z = TT 1 HT i Hi Z = TT 1 HT i V = V1 T1 2, (10)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

where V1 is the first coordinate of V = Hi Z N(0, Ip) and is a standard normal distribution. Then

E[ γ1,ξ(Xi, Yi)

π1 ] = E[ exp(TT 1 (Z + T2) (π1 + (1 π1) exp{TT 1 (Z + T2)})2 ]

= E[ exp( T1 2Z1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2 ],

where Z1 is a standard normal distribution. Note that

|TT 1 T2| = |f T i (Γ2 Γ1)T δ(Γ)fi| c| tr(δ(Γ)bΣf(Γ2 Γ1)T )| c1Ω2,

where c1 = c(1 Cd). Similarly, we can show |TT 1 T2| c2Ω2.

Recall that Ω= q

tr[(Γ 2 Γ 1)bΣf(Γ 2 Γ 1)T ] = bΣ1/2 f (Γ 2 Γ 1)T F . We have

Ω2 = vec((Γ 2 Γ 1)T )T vec(bΣf(Γ 2 Γ 1)T )

= vec((Γ 2 Γ 1)T )T (Ip bΣf) vec((Γ 2 Γ 1)T ),

which implies Ω2/M2 Γ 2 Γ 1 2 F Ω2/M1.

Then, when 2 M2Cb < 1, q

TT 1 T1 = q

f T i (Γ2 Γ1)T (Γ2 Γ1)fi = q

tr[(Γ2 Γ1)fif T i (Γ2 Γ1)T ]

tr[(Γ2 Γ1)bΣf(Γ2 Γ1)T ] = c bΣ1/2 f (Γ2 Γ1)T F

c bΣ1/2 f (Γ 2 Γ 1)T F bΣ1/2 f (Γ2 Γ 2 Γ1 + Γ 1) F

M2CbΩ) c3Ω,

for some constant c3 that depends on M2 and Cb. Similarly, we can show p

TT 1 T1 c4Ω. Define events

Ei = {| T1 2Z1| c1

On the event Ei, | T1 2Z1 + TT 1 T2| |TT 1 T2| |T1 2Z1| c1Ω2/2. Using the tail probability of normal distribution, we obtain

P(Ec i ) 2 exp( c2 1Ω4

8 T1 2 2 ) 2 exp( c2 1Ω2

E[ γ1,ξ(Xi, Yi)

= E[ exp( T1 2Z1 + TT 1 T2) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2 ]

= E[ exp( T1 2Z1 + TT 1 T2) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2

E exp( T1 2Z1 + TT 1 T2) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2

Ec i ]P(Ec i )

1 min{π2 1, (1 π1)2} exp( c1Ω2

2 ) + 1 2 min{π2 1, (1 π1)2} exp( c2 1Ω2

c2 0 exp( c1Ω2

2c2 0 exp( c2 1Ω2

c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2). (11)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

We proceed to bound (II). Note that

1 π1(1 π1) γ1,ξ(Xi, Yi) vec(Γ2 Γ1)

= exp{(Z + δ(Γ)fi)T (Γ2 Γ1)fi} (π1 + (1 π1) exp{(Z + δ(Γ)fi)T (Γ2 Γ1)fi})2 fi (Z + δ(Γ)fi)

= γ1,ξ(Xi, Yi)

π1 fi Z | {z } (II.i)

+ γ1,ξ(Xi, Yi)

π1 fi δ(Γ)fi | {z } (II.ii)

By definition of Hi,

fi Z = fi I1 HT i Hi Z = (fi HT i )(Hi Z) = (fi HT i )V.

For the first term, we have

E γ1,ξ(Xi, Yi)

π1 fi Z = E (fi HT i ) exp( T1 2V1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2V1 + TT 1 T2)})2 V

= (fi HT i ) E exp( T1 2V1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2V1 + TT 1 T2)})2 V1e1

where the last inequality uses the fact that V1 and Vj are independent for any 1 < j p and E[Vj] = 0. Then

E exp( T1 2V1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2V1 + TT 1 T2)})2 V1

= E exp( T1 2Z1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2 ( T1 2Z1 + TT 1 T2 TT 1 T2) 1 T1 2

E exp( T1 2Z1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2 ( T1 2Z1 + TT 1 T2) Ei

P(Ei) 1 T1 2

E exp( T1 2Z1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2 ( T1 2Z1 + TT 1 T2) Ec i

P(Ec i ) 1 T1 2

E exp( T1 2Z1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2 TT 1 T2

For the first term in the last equality, using the fact that on the event Ei, | T1 2Z1 + TT 1 T2| c1Ω2/2, it is bounded by

2 min{π2 1, (1 π1)2} exp( 3c1Ω2

The second term is bounded by 1 min{π2 1, (1 π1)2} exp( c2 1Ω2

For the third term, we have

E exp( T1 2Z1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2 TT 1 T2

w=1 π w E exp( T1 2Z1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2

Wi = w |TT 1 T2| T1 2

c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2)c2Ω/c3.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Since (fi HT i )e1 = vec(HT i e1f T i ) = vec(T1/ T1 2f T i ), it follows that,

E[(II.i)] 2 (fi HT i )e1 2

2 min{π2 1, (1 π1)2} exp( 3c1Ω2

1 min{π2 1, (1 π1)2} exp( c2 1Ω2

c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2)c2Ω/c3

vec(T1/ T1 2f T i ) 2( 2 c2 0c3Ω 2c2Ω

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 )Ω2)

fi 2( 2 c2 0c3Ω 2c2Ω

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 )Ω2)

M4( 2 c2 0c3Ω 2c2Ω

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 )Ω2). (12)

We proceed to bound (II.ii). Note that

Γ 1 Γ2 + Γ1

2 F = Γ 1 2 Γ1

2 + Γ 1 2 Γ2

2 Γ 1 Γ1 F + 1

2 Γ 2 Γ2 + Γ 1 Γ 2 F

2CbΩ+ 1 2 M1 Ω= (Cb + 1 2 M1 )Ω,

and similarly Γ 2 Γ2+Γ1

2 F (Cb + 1 2 M1 )Ω. Therefore δ(Γ)fi 2 2 Wi M 2 4 (Cb + 1 2 M1 )2Ω2. By the definition of Kronecker product,

fi δ(Γ)fi 2 2 Wi =

j=1 f 2 ij δ(Γ)fi 2 2 Wi M 4 4 (Cb + 1 2 M1 )2Ω2,

where fij is the j-th element of fi. Then

E[(II.ii)] 2 = E[ γ1,ξ(Xi, Yi)

π1 fi δ(Γ)fi] 2

c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2)M 2 4 (Cb + 1 2 M1 )Ω.

E[ γ1,ξ(Xi, Yi)

vec(Γ2 Γ1)] 2 π1(1 π1)( E[(II.i)] 2 + E[(II.ii)] 2)

4 ( 2 c2 0c3Ω 2c2Ω

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 )Ω2)+

1 2c2 0 M 2 4 (Cb + 1 2 M1 )Ωexp( (c1

2 c2 1 8c2 4 )Ω2)

c5 exp( (3c1

8 c2 1 8c2 4 )Ω2), (13)

where c5 = max{ M4

4 ( 2 c2 0c3Ω 2c2Ω

c2 0c3 ), 1 2c2 0 M 2 4 (Cb + 1 2 M1 )Ω} .

For the term (III), note that

fi T1 2 2 =

j=1 fij T1 2 2 = fi 2 2 T1 2 2 M 2 4 c2 4Ω2.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Then, we have

E[ γ1,ξ(Xi, Yi)

vec(Γ2 + Γ1)] 2

= E[π1(1 π1)exp{(Z + δ(Γ)fi)T (Γ2 Γ1)fi} 1

2fi (Γ2 Γ1)fi (π1 + (1 π1) exp{(Z + δ(Γ)fi)T (Γ2 Γ1)fi})2 ] 2

2 E[ γ1,ξ(Xi, Yi)

π1 ] fi T1 2

8 2 c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2)M4c4Ω

4c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2). (14)

Combing results in (11), (13) and (14), we have

E[γ1,θ(Xi, Yi) γ1,θ (Xi, Yi)] κπ(d F (θ, θ ) B1 B 1 F B2 B 2 F ), (15)

where κπ = ( 2

c2 0 + c5 + M4c4Ω

4c2 0 ) exp( ( 3c1

8 c2 1 8c2 4 )Ω2).

D.2. Contraction of matrices Uw

We aim to show

U1(θ) U1(θ ) F κU(d F (θ, θ ) B1 B 1 F B2 B 2 F ).

By definition,

Uw(θ) Uw(θ) = E( 1

i=1 {[γ1,ξ(Xi, Yi) γ1,ξ (Xi, Yi)]Xif T i }).

When Yi is fixed, Xi are not identically distributed. Thus, we bound the expectation of given i. Let ξ = (π1, vec(Γ2 Γ1), vec(Γ2+Γ1)), ξ = ξ ξ , ξu = ξ +u ξ. Then ξ0 = ξ , ξ1 = ξ. Define the Jacobian matrix J = f/ x Rm n

as Jij = fi/ xj, where x = (x1, . . . , xn)T and f = (f1, . . . , fm)T . Then

vec(E{[γ1,ξ(Xi, Yi) γ1,ξ (Xi, Yi)]Xif T i })

vec(γ1,ξ(Xi, Yi)Xif T i ) vec(ξ)

ξ=ξu vec(ξu)

vec(γ1,ξ(Xi, Yi)Xif T i ) π1

ξ=ξu π1du +

vec(γ1,ξ(Xi, Yi)Xif T i ) vec(Γ2 Γ1)

ξ=ξu Γ2 Γ1du +

vec(γ1,ξ(Xi, Yi)Xif T i ) vec(Γ2 + Γ1)

ξ=ξu Γ2+Γ1du .

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

E{[γ1,ξ(Xi, Yi) γ1,ξ (Xi, Yi)]Xif T i } F

q E{[γ1,ξ(Xi, Yi) γ1,ξ (Xi, Yi)]Xif T i } 2

q sup ξ Bcon(θ ) E vec(γ1,ξ(Xi, Yi)Xif T i ) π1

q sup ξ Bcon(θ ) E vec(γ1,ξ(Xi, Yi)Xif T i ) vec(Γ2 Γ1)

q sup ξ Bcon(θ ) E vec(γ1,ξ(Xi, Yi)Xif T i ) vec(Γ2 + Γ1)

q sup ξ Bcon(θ ) E γ1,ξ(Xi, Yi)

π1 vec T (Xif T i ) 2|π1 π 1|

q sup ξ Bcon(θ ) E γ1,ξ(Xi, Yi)

vec(Γ2 Γ1) vec T (Xif T i ) 2 Γ1 Γ 1 Γ2 + Γ 2 F | {z } (II)

q sup ξ Bcon(θ ) E γ1,ξ(Xi, Yi)

vec(Γ2 + Γ1) vec T (Xif T i ) 2 Γ1 Γ 1 + Γ2 Γ 2 F | {z } (III)

Note that vec(Xif T i ) = vec[(Z + ψfi)f T i ] = (fi Ip)(Z + ψfi). For the second term (II), we have

1 π1(1 π1) γ1,ξ(Xi, Yi) vec(Γ2 Γ1) vec T (Xif T i )

= exp((Z + δ(Γ)fi)T ((Γ2 Γ1)fi) {π1 + (1 π1) exp((Z + δ(Γ)fi)T ((Γ2 Γ1)fi)}2 (fi Ip)(Z + δ(Γ)fi) vec T (Xif T i )

= γ1,ξ(Xi, Yi)

π1 (fi Ip)(Z + δ(Γ)fi) (Z + ψfi)T (f T i Ip)

= γ1,ξ(Xi, Yi)

π1 (fi Ip)ZZT (f T i Ip) | {z } (II.i)

+ γ1,ξ(Xi, Yi)

π1 (fi Ip)Zf T i ψT (f T i Ip) | {z } (II.ii)

γ1,ξ(Xi, Yi)

π1 (fi Ip)δ(Γ)fi ZT (f T i Ip) | {z } (II.iii)

+ γ1,ξ(Xi, Yi)

π1 (fi Ip)δ(Γ)fif T i ψT (f T i Ip) | {z } (II.iv)

Recall that T1 = (Γ2 Γ1)fi, T2 = δ(Γ)fi, Hi is an orthonormal matrix whose first row is TT 1 / T1 2, V = Hi Z

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

N(0, Ip). For the first term, we have

= (fi Ip) E[ γ1,ξ(Xi, Yi)

π1 ZZT ](f T i Ip)

= (fi Ip)HT i E[ γ1,ξ(Xi, Yi)

π1 Hi ZZT HT i ](f T i Ip)Hi

= (fi Ip)HT i E[ γ1,ξ(Xi, Yi)

π1 VVT ](f T i Ip)Hi

= (fi Ip)HT i E[ exp( T1 2V1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2V1 + TT 1 T2)})2 VVT ](f T i Ip)Hi (by(10))

= (fi Ip)HT i E[ exp( T1 2V1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2V1 + TT 1 T2)})2 (V 2 1 1)] | {z } (II.i.a)

e1e T 1 (f T i Ip)Hi+

(fi Ip)HT i E[ exp( T1 2V1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2V1 + TT 1 T2)})2 ] | {z } (II.i.b)

Ip(f T i Ip)Hi.

Recall that the event Ei = {| T1 2Z1| c1

2 Ω2} and on the event Ei, | T1 2Z1 + TT 1 T2| |TT 1 T2| |T1 2Z1| c1Ω2/2. Let t = T1 2Z1 + TT 1 T2, g(t) = exp(t) (π1+(1 π1) exp{t})2 . Then

| E[(II.i.a)]|

= E[ exp( T1 2Z1 + TT 1 T2) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2 (Z2 1 1)]

E[g(t)(Z2 1 1)|Ei]P(Ei) + E[g(Z1)(Z2 1 1)|Ec i ]P(Ec i )

= E[g(t)(( T1 2Z1 + TT 1 T2)2 2TT 1 T2( T1 2Z1 + TT 1 T2) + (TT 1 T2)2 T1 2 2 T1 2 2 )|Ei]P(Ei) +

E[g(t)(( T1 2Z1 + TT 1 T2)2 2TT 1 T2( T1 2Z1 + TT 1 T2) + (TT 1 T2)2 T1 2 2 T1 2 2 )|Ec i ]P(Ec i )

n E[g(t)t2|Ei] + E[g(t)( T1 2 2 (TT 1 T2)2)|Ei] + E[g(t) 2TT 1 T2t|Ei] o +

n E[g(t)t2|Ec i ] + E[g(t)( T1 2 2 (TT 1 T2)2)|Ei] + E[g(t) 2TT 1 T2t|Ei] o P(Ec i )+

4 min{π2 1, (1 π1)2} exp( c1

4 Ω2) + (c2 4 + c2 2)Ω2 2

c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2) +

2c2Ω2 2 min{π2 1, (1 π1)2} exp( 3c1

8 Ω2) + 1 c2 3Ω2

4 min{π2 1, (1 π1)2} + (c2 4 + c2 2)Ω2

min{π2 1, (1 π1)2}

2 exp( c2 1Ω2

4 + 2(c2 2 + c2 4)Ω2 + 4c2Ω2

c2 3c2 0Ω2 exp( (c1

4 c2 1 8c2 4 )Ω2) + 8 + (c2 2 + c2 4)Ω2 + 2c2Ω2

c2 3c2 0Ω2 exp( c2 1Ω2

Let c5 = 4+2(c2 2+c2 4)Ω2+4c2Ω2

c2 3c2 0Ω2 + 8+(c2 2+c2 4)Ω2+2c2Ω2

c2 3c2 0Ω2 . We have

| E[(II.i.a)]| c5 exp( (c1

4 c2 1 8c2 4 )Ω2). (16)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

The term (II.i.b) can be bounded same as (11),

| E[(II.i.b)]| 2

c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2). (17)

Then we bound the spectral norm of E[(II.i)],

E[(II.i)] 2

(fi Ip)HT i e1e T 1 (f T i Ip)Hi 2 | E[(II.i.a)]|+

(fi Ip)HT i Ip(f T i Ip)Hi 2 | E[(II.i.b)]|

M 2 4 | E[(II.i.a)]| + M 2 4 | E[(II.i.a)]|

M 2 4 c5 exp( (c1

4 c2 1 8c2 4 )Ω2) + M 2 4 2 c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2)

M 2 4 (c5 + 2

c2 0 ) exp( (c1

4 c2 1 8c2 4 )Ω2). (18)

The other three terms are easier to bound. Note that

E[(II.ii)] = (fi Ip) E[ γ1,ξ(Xi, Yi)

π1 Zf T i ψT ](f T i Ip)

= (fi Ip)HT i E[ γ1,ξ(Xi, Yi)

π1 V1e1f T i ψT ](f T i Ip).

Using the similar technique in (12), we have

E[ γ1,ξ(Xi, Yi)

π1 V1e1f T i ψT ] 2 |E[ γ1,ξ(Xi, Yi)

M4Mb( 2 c2 0c3Ω 2c2Ω

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 ).

E[(II.ii)] 2

(fi Ip)HT i 2 E[ γ1,ξ(Xi, Yi)

π1 V1e1f T i ψT ] 2 (f T i Ip) 2

M 3 4 Mb( 2 c2 0c3Ω 2c2Ω

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 ). (19)

We proceed to bound (II.iii),

E[(II.iii)] = (fi Ip) E[ γ1,ξ(Xi, Yi)

π1 δ(Γ)fi ZT ](f T i Ip)

= (fi Ip) E[ γ1,ξ(Xi, Yi)

π1 δ(Γ)fie T 1 V1]Hi(f T i Ip).

Using the similar technique in (12) and δ(Γ)fi 2 2 M 2 4 (Cb + 1 2 M1 )2Ω2, we have

E[ γ1,ξ(Xi, Yi)

π1 δ(Γ)fie T 1 V1] 2 M4(Cb + 1 2 M1 )( 2

c2 0c3 2c2Ω2

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 ).

E[(II.iii)] 2

(fi Ip) 2 E[ γ1,ξ(Xi, Yi)

π1 δ(Γ)fie T 1 V1] 2 Hi(f T i Ip) 2

M 3 4 (Cb + 1 2 M1 )( 2

c2 0c3 2c2Ω2

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 ). (20)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Finally, we bound (II.iv),

E[(II.iv)] 2

= (fi Ip) E[ γ1,ξ(Xi, Yi)

π1 δ(Γ)fif T i ψT ](f T i Ip) 2

M 4 4 Mb(Cb + 1 2 M1 )Ω2

c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2). (21)

Combing results (18), (19), (20), and (21), we have

E γ1,ξ(Xi, Yi)

vec(Γ2 Γ1) vec T (Xif T i ) 2

π1(1 π1)( E[(II.i)] 2 + E[(II.ii)] 2 + E[(II.iii)] 2 + E[(II.iv)] 2)

M 2 4 (c5 + 2

c2 0 ) exp( (c1

4 c2 1 8c2 4 )Ω2) + M 3 4 Mb( 2 c2 0c3Ω 2c2Ω

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 ) +

M 3 4 (Cb + 1 2 M1 )( 2

c2 0c3 2c2Ω2

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 )+

M 4 4 Mb(Cb + 1 2 M1 )Ω2

c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2)

c6 exp( (c1

4 c2 1 8c2 4 )Ω2), (22)

where c6 = 1

4 n M 2 4 (c5 + 2

c2 0 ) + M 3 4 Mb( 2 c2 0c3Ω 2c2Ω

c2 0c3 ) + M 3 4 (Cb + 1 2 M1 )( 2 c2 0c3 2c2Ω2

M 4 4 Mb(Cb + 1 2 M1 )Ω2

Next, we bound the first term (I),

E γ1,ξ(Xi, Yi)

π1 vec T (Xif T i ) 2

= E γ1,ξ(Xi, Yi)

π1 (fi Ip)(Z + ψfi) 2

(fi Ip) E γ1,ξ(Xi, Yi)

π1 Z 2 | {z } (I.i)

+ (fi Ip) E γ1,ξ(Xi, Yi)

2 | {z } (I.ii)

Similarly to (12), we have

(I.i) = (fi Ip)HT i E γ1,ξ(Xi, Yi)

= (fi Ip)HT i e1 E exp( T1 2V1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2V1 + TT 1 T2)})2 V1

M4( 2 c2 0c3Ω 2c2Ω

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 )Ω2). (23)

According to (11),

(I.ii) M 2 4 Mb 2 c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2). (24)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

E γ1,ξ(Xi, Yi)

π1 vec T (Xif T i ) 2

M4( 2 c2 0c3Ω 2c2Ω

c2 0c3 ) exp( (3c1

8 c2 1 8c2 4 )Ω2) + M 2 4 Mb 2 c2 0 exp( (c1

2 c2 1 8c2 4 )Ω2)

c7 exp( (3c1

8 c2 1 8c2 4 )Ω2), (25)

where c7 = M4( 2 c2 0c3Ω 2c2Ω

c2 0c3 ) + M 2 4 Mb 2

For the third term, using (25) and the fact that fi (Γ2 Γ1)fi 2 M4c4Ω,

E γ1,ξ(Xi, Yi)

vec(Γ2 + Γ1) vec T (Xif T i ) 2

2 fi (Γ2 Γ1)fi E[ γ1,ξ(Xi, Yi)

π1 vec T (Xif T i )] 2

8 fi (Γ2 Γ1)fi 2 E[ γ1,ξ(Xi, Yi)

π1 vec T (Xif T i )] 2

8 c7 exp( (3c1

8 c2 1 8c2 4 )Ω2). (26)

Combining results (22), (25) and (26), we have

E{[γ1,ξ(Xi, Yi) γ1,ξ (Xi, Yi)]Xif T i } F

qd F (θ, θ ) c6 exp( (c1

4 c2 1 8c2 4 )Ω2) + c7 exp( (3c1

8 c2 1 8c2 4 )Ω2)+

8 c7 exp( (3c1

8 c2 1 8c2 4 )Ω2)

qd F (θ, θ )[c6 + c7 + M4c4Ω

8 c7] exp( (3c1

8 c2 1 8c2 4 )Ω2)

κU(d F (θ, θ ) B1 B 1 F B2 B 2 F ). (27)

where κU = q[c6 + c7 + M4c4Ω

8 c7] exp( ( 3c1

8 c2 1 8c2 4 )Ω2).

D.3. Contraction of covariance matrices

We aim to show

[Σ1(θ) Σ1(θ )]B 1 F κΣ(d F (θ, θ ) B1 B 1 F B2 B 2 F ).

By definition,

[Σ1(θ) Σ1(θ )]B 1 = E( 1

i=1 {[γ1,ξ(Xi, Yi) γ1,ξ (Xi, Yi)]Xi XT i B 1}).

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Using the same argument as before, we obtain

E{[γ1,ξ(Xi, Yi) γ1,ξ (Xi, Yi)]Xi XT i B 1} F

q E{[γ1,ξ(Xi, Yi) γ1,ξ (Xi, Yi)]Xi XT i B 1} 2

q sup ξ Bcon(θ ) E vec(γ1,ξ(Xi, Yi)Xi XT i B 1) π1

q sup ξ Bcon(θ ) E vec(γ1,ξ(Xi, Yi)Xi XT i B 1) vec(Γ2 Γ1)

q sup ξ Bcon(θ ) E vec(γ1,ξ(Xi, Yi)Xi XT i B 1) vec(Γ2 + Γ1)

q sup ξ Bcon(θ ) E γ1,ξ(Xi, Yi)

π1 vec T (Xi XT i B 1) 2|π1 π 1|

q sup ξ Bcon(θ ) E γ1,ξ(Xi, Yi)

vec(Γ2 Γ1) vec T (Xi XT i B 1) 2 Γ1 Γ 1 Γ2 + Γ 2 F | {z } (II)

q sup ξ Bcon(θ ) E γ1,ξ(Xi, Yi)

vec(Γ2 + Γ1) vec T (Xi XT i B 1) 2 Γ1 Γ 1 + Γ2 Γ 2 F | {z } (III)

We focus on the second term (II). Note that vec(Xi XT i B 1) = [(XT i B 1)T Ip]Xi. Then,

1 π1(1 π1) γ1,ξ(Xi, Yi) vec(Γ2 Γ1) vec T (Xi XT i B 1)

= exp((Z + δ(Γ)fi)T ((Γ2 Γ1)fi) {π1 + (1 π1) exp((Z + δ(Γ)fi)T ((Γ2 Γ1)fi)}2 (fi Ip)(Z + δ(Γ)fi) vec T (Xi XT i B 1)

= γ1,ξ(Xi, Yi)

π1 (fi Ip)(Z + δ(Γ)fi) (Z + ψfi)T ((Z + ψfi)T B 1 Ip). (28)

We can decompose the last line into a sum of 8 terms. The term involves three Z is the most complicated one. We consider

γ1,ξ(Xi, Yi)

π1 (fi Ip)ZZT (ZT B 1 Ip).

Recall that Hi is an orthonormal matrix whose first row is TT 1 / T1 2. We further require that B 1,1: span(Hi,1:, Hi,2:), B 1,2: span(Hi,1:, Hi,2:, Hi,3:), . . . , B 1,q: span(Hi,1:, . . . , Hi,(q+1):), where Aj: is the j-th row of matrix A. Thus there exists a matrix Λ Rp q such that B 1 = HT i Λ. Then

ZT B 1 = (HT i Hi Z)T B 1 = ZT HT i Hi HT i Λ = VT Λ

j=1 λj2Vj, . . . ,

j=1 λjq Vj) := MT ,

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

where λjk is the (j, k)-th element of Λ and Vj is the j-th element of V. When j > k + 1, λjk = 0 and Λ F = B 1 F

d Mb M2. Therefore,

γ1,ξ(Xi, Yi)

π1 (fi Ip)ZZT (ZT B 1 Ip)

= γ1,ξ(Xi, Yi)

π1 (fi Ip)HT i VVT Hi(MT Ip)

= γ1,ξ(Xi, Yi)

π1 (fi Ip)HT i VVT (MT Hi Ip)

= γ1,ξ(Xi, Yi)

π1 (fi Ip)HT i VVT (M1Hi, M2Hi, . . . , Mq Hi),

where Mj is the j-th element of M. Therefore, we have

E[ γ1,ξ(Xi, Yi)

π1 (fi Ip)ZZT (ZT B 1 Ip)]

= (fi Ip)HT i E[ γ1,ξ(Xi, Yi)

π1 VVT (M1Hi, M2Hi, . . . , Mq Hi)]

= (fi Ip)HT i E[ exp( T1 2V1 + TT 1 T2)) (π1 + (1 π1) exp{ T1 2V1 + TT 1 T2)})2 | {z } g(V1)

VVT (M1Hi, M2Hi, . . . , Mq Hi)].

Note that E[g(V1)VVT Mk]

λ1k V 3 1 λ1k V1V 2 2 ... λ1k V1V 2 p

λ2k V1V 2 2 λ3k V1V 2 3 λ(k+1)k V1V 2 k+1 0 λ2k V1V 2 2 λ1k V 3 2 λ3k V1V 2 3 λ1k V 3 3 ... ... λ(k+1)k V1V 2 k+1 λ(k+1)k V 3 k+1 0 0

Therefore, for each k, the matrix E[g(V1)VVT Mk] can be written as the sum of a diagonal matrix and a matrix with 2k non-zero elements. Then, the p pq matrix E[g(V1)VVT (M1Hi, M2Hi, . . . , Mq Hi)] can be written as the sum of two block matrices (E[J1] + E[J2]), where each block has size p p. In the first matrix E[J1], the k-th block is Dk Hi, where Dk is a diagonal matrix. And in the second matrix E[J2], the k-th block is Ak Hi, where Ak is a matrix that only has 2k non-zero elements. Since E[J2] 2 E[J2] F = E[(A1Hi, . . . , Aq Hi)] F = E[(A1, . . . , Aq)] F , each element of E[Ak] only involve E[g(V1)V1], which can be bounded with the same argument used in (12). The total number of non-zeros elements in E[(A1, . . . , Aq)] is q(1 + q). Then we have

E[J2] 2 c8q(1 + q) exp( (3c1

8 c2 1 8c2 4 )Ω2), (29)

where c8 is some positive constant.

Next, we bound E[J1] 2. Since each block of J1 is a diagonal matrix D1 times the orthonormal matrix Hi, rows of J1 are orthogonal. Then we can construct its SVD E[J1] = FDKT in the following way. Let F Rp p be the identity matrix, D is a diagonal matrix that elements are ℓ2 norm of rows of J1, and KT is normalized E[J1] where each row is divided by its ℓ2 norm. Therefore, E[J1] 2 equals to the largest ℓ2 norm of rows E[J1]. It is easy to see rows of E[J1] and rows of E[(D1, . . . , D2)] have same ℓ2 norm. Thus, we only need to bound the largest ℓ2 norm of E[(D1, . . . , D2)]. Each row of E[(D1, . . . , D2)] has q non-zero elements, which contain either E[g(V1)V1] or E[g(V1)V 3 1 ]. Since E[g(V1)V1] can be

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

bounded similar to (12). We focus on bounding E[g(V1)V 3 1 ]. Recall that the event Ei = {| T1 2Z1| c1

2 Ω2} and on the event Ei, | T1 2Z1 + TT 1 T2| |TT 1 T2| |T1 2Z1| c1Ω2/2. Let t = T1 2Z1 + TT 1 T2, h(t) = exp(t) (π1+(1 π1) exp{t})2 . Then | E[g(V1)V 3 1 ]|

= E[ exp( T1 2Z1 + TT 1 T2) (π1 + (1 π1) exp{ T1 2Z1 + TT 1 T2)})2 Z3 1]

= E[h(t)( T1 2Z1 + TT 1 T2)3 3(TT 1 T2)2 T1 2Z1 3 T1 2 2TT 1 T2Z2 1 (TT 1 T2)3

E[h(t)( T1 2Z1 + TT 1 T2)3] + 3 T1 2 2

E[h(t)(TT 1 T2)2Z1]

E[h(t)TT 1 T2Z2 1 + 1 T1 3 2

E[h(t)(TT 1 T2)3].

The second, third, and fourth terms can be bounded similarly as (12), (16) and (11). For some positive constant c9, we have

E[h(t)(TT 1 T2)2Z1] + 3 T1 2

E[h(t)TT 1 T2Z2 1 + 1 T1 3 2

E[h(t)(TT 1 T2)3]

c9 exp( (c1

4 c2 1 8c2 4 )Ω2).

Using Lemma F.1,

E[h(t)( T1 2Z1 + TT 1 T2)3]

E[ exp(t) (π1 + (1 π1) exp{t})2 t3|Ei]P(Ei) + E[ exp(t) (π1 + (1 π1) exp{t})2 t3|Ec i ]P(Ec i )

8 min{π2 1, (1 π1)2} exp( c1Ω2

8 ) + 4 min{π2 1, (1 π1)2} exp( c2 1Ω2

8 c2 0c3 3Ω3 exp( c1Ω2

8 ) + 4 c2 0c3 3Ω3 exp( c2 1Ω2

8c2 4 ) 8 c2 0c3 3Ω3 exp( (c1

8 c2 1 8c2 4 )Ω2).

Combining the above results, we have

| E[g(V1)V 3 1 ]| (c9 + 8 c2 0c3 3Ω3 ) exp( (c1

8 c2 1 8c2 4 )Ω2).

Therefore, the 2-norm of E[J1] (the largest ℓ2 norm of rows of J1) is bounded by

E[J1] 2 c10q exp( (c1

8 c2 1 8c2 4 )Ω2). (30)

Combing results (29) and (30),

E[ γ1,ξ(Xi, Yi)

π1 (fi Ip)ZZT (ZT B 1 Ip)] 2 M4[c8q(1 + q) + c10q] exp( (c1

8 c2 1 8c2 4 )Ω2).

The other 7 terms in (28) involve Z at most twice and therefore can be bounded with the same technique in (11), (12), and (16). Therefore, we have

E γ1,ξ(Xi, Yi)

vec(Γ2 Γ1) vec T (Xi XT i B 1) 2 c11q2 exp( (c1

8 c2 1 8c2 4 )Ω2), (31)

for some constant c11.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

For the first term (I), γ1,ξ(Xi, Yi)

π1 vec T (Xi XT i B 1) = γ1,ξ(Xi, Yi)

π1 XT i [(XT i B 1) Ip]

= γ1,ξ(Xi, Yi)

π1 (Z + ψfi)T ((Z + ψfi)T B 1 Ip),

and the third term (III)

γ1,ξ(Xi, Yi) vec(Γ2 + Γ1) vec T (Xi XT i B 1)

2 γ1,ξ(Xi, Yi)

π1 (fi (Γ2 Γ1)fi) vec T (Xi XT i B 1)

2 γ1,ξ(Xi, Yi)

π1 (fi (Γ2 Γ1)fi) (Z + ψfi)T ((Z + ψfi)T B 1 Ip).

We see that both terms only have ZZT and thus can be bounded similarly to (11), (12), and (16). Finally, for some positive constant c12 we obtain

E{[γ1,ξ(Xi, Yi) γ1,ξ (Xi, Yi)]Xi XT i B 1} F qq2c12 exp( (c1

8 c2 1 8c2 4 )Ω2)d F (θ, θ )

κΣ(d F (θ, θ ) B1 B 1 F B2 B 2 F ), (32)

where κΣ = qq2c12 exp( ( c1

8 c2 1 8c2 4 )Ω2).

E. Proof of Lemma C.2

E.1. Covering number of L(s)

We state two lemmas that are used later.

Lemma E.1 (Rudelson & Zhou (2012), Lemma 11). Let u1, . . . , u M Rpq. Let y conv(u1, . . . , u M). There exists a set L [M] such that

|L| m = 4 maxj [M] uj 2 2 ε2

and a vector y conv(uj, j L) such that y y 2 ε.

Lemma E.2 (Rudelson & Zhou (2012), Lemma 21). Let u, θ, x Rpq be vectors such that θ 2 = 1, x, θ = 0, and u is not parallel to x. Define ϕ : R R by:

ϕ(λ) = x + λu, θ

Assume ϕ(λ) has a local maximum at 0, then

Next, we show the following lemma

Lemma E.3. Let 0 < s < 1/3p, and d = 26883sq3, then

L(s) Spq 1 2 conv( [

|J| d EJ(pq) Spq 1), (33)

where conv denotes the convex hull and EJ(pq) = span(ej : j J).

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Proof. The proof is analogous to that for Lemma 13 of Rudelson & Zhou (2012) with some modifications. We assume d < pq, otherwise the lemma is trivially true. For a vector u Rpq, let T denote the indices of the 3sq largest absolute coefficients of u. Then u T c 1 u e Sc 1 1. We decompose a vector u L(s) Spq 1 as

u = u T + u T c u T + [( sq + 2q

3s) u e S1 2 + sq u 2] absconv(ej : j T c),

where absconv denotes the absolutely convex hull. Since

u T c 2 2 u T c 1 u T c u e Sc 1 1 u T 1

3s) u e S1 2 + sq u 2] u T 2 3sq

3s) u T 2 + sq u 2] u T 2 3sq

3/3 + 2 q) u T 2 +

3/3 u 2] u T 2,

we have 1 = u 2 2 = u T 2 2 + u T c 2 2 (1 +

3/3 + 2 q) | {z } a2 u T 2 2 +

3/3 | {z } b

Let u T 2 = x. We are interested in finding a range of x that satisfies a2x2 + bx 1, which is equivalent to (ax + b/(2a))2 b2/(4a2) + 1. Then we have u T 2

Define V = {u T + [( sq + 2q

3s) u e S1 2 + sq u 2] absconv(ej : j T c) : u L(s) Spq 1}.

We have L(s) Spq 1 V L(s) and V is compact. Therefore, V contains a base of L(s), that is, for any y L(s)/{0}, there exists λ > 0 such that λy V. For any nonzero vector v Rpq, we define

F(v) = v v 2 .

Then function F is continuous on L(s)/{0} and V. Thus,

L(s) Spq 1 = F(L(s)/{0}) = F(V).

By duality, inclusion (33) can be derived by showing the supremum of any linear functional over the left side of (33) does not exceed the supremum over the right side of it. Since L(s) Spq 1 = F(V), it is enough to show that for any θ Spq 1, there exists z Rpq/{0} such that supp(z ) d and F(z ) satisfies that

max v V F(v), θ 2 F(z ), θ . (34)

For a given θ, we construct a d-sparse vector z that satisfies (34). Let z = argmaxv V F(v), θ . By the definition of V, there exists a set I [pq] such that |I| = 3sq and for some ηj {1, 1},

z = z I + [( sq + 2q

3s) z e S1 2 + sq z 2] X

j Ic αjηjej,

where αj [0, 1], P

j Ic αj 1 and z I 2

2a2 . If αi = 1 for some i Ic, then αj = 0 for j Ic/{i} and z is a sparse vector with supp(z) 3sq + 1. Let z = z. Clearly, (33) holds with d = 3sq + 1. In the following, we assume αi [0, 1), i Ic. To use lemma E.1, denote epq+1 = 0, ηpq+1 = 1 and set

αpq+1 = 1 X

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Then y := z Ic conv(u1, . . . , u|M|), where uj = [( sq + 2q

3s) z e S1 2 + sq z 2] ηjej, j M = {j Ic {pq + 1} : αj > 0}. We define αpq+1 because the sum of coefficients must equal 1 in convex combinations. According to lemma E.1, there exists a set J M such that

|J | m := 4 maxj M uj 2 2 ε2

= 4 maxj M[( sq + 2q

3s) z e S1 2 + sq z 2]2 ej 2 2 ε2

4sq + 12sq2 + 4sq 3q + sq + 2sq + 4sq 3q

= 44sq + 8sq 3q + 12sq2

ε2 = 16sq + 32sq 3q + 48sq2

and a vector y conv(uj, j J )

y = [( sq + 2q

3s) z e S1 2 + sq z 2] X

such that P

j J βj = 1 and y y 2 ε. Set z = z I + y . Then z EJ , where J = (I J ) [pq] and |J | |I| + |J | 3sq + m. We have

z z 2 = z z I y 2 = z Ic y 2 = y y 2 ε.

For {βj : j J } as above, we extend it to {βj : j Ic {pq + 1}} by setting βj = 0 if j Ic {pq + 1}/{J } and write z = z I + [( sq + 2q

3s) z e S1 2 + sq z 2] X

j Ic {pq+1} βjηjej,

where βj [0, 1] and P

j Ic {pq+1} βj = 1. If z = z , then (34) holds naturally

max v V F(v), θ = F(z), θ 2 F(z ), θ = 2 F(z), θ ,

and d = 3sq + m. Otherwise, for some λ to be specified, consider the vector

z + λ(z z) = z I + [( sq + 2q

3s) z e S1 2 + sq z 2] X

j Ic {pq+1} [(1 λ)αj + λβj]ηjej.

j Ic {pq+1}[(1 λ)αj +λβj] = 1. There exists δ0 > 0 such that j Ic {pq +1}, (1 λ)αj +λβj [0, 1] if |λ| < δ0 since

This condition holds by continuity for all j such that αj (0, 1).

If αj = 0 for some j, then βj = 0 by definition of M.

Therefore, we have P

j Ic[(1 λ)αj+λβj] 1, which implies z+λ(z z) V. Now consider function ϕ : ( δ0, δ0) R,

ϕ(λ) = F(z + λ(z z)), θ = z + λ(z z), θ

z + λ(z z) 2 .

Since z = argmaxv V F(v), θ , ϕ(λ) attains a local maximum at 0. According to lemma E.2,

z, θ = z + (z z), θ

z, θ 1 z z 2

z 2 = z 2 z z 2

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

It follows that F(z ), θ

F(z), θ = z / z 2, θ

z/ z 2, θ = z 2

z 2 z 2 z z 2

z 2 z 2 + z z 2 z 2 z z 2

z 2 = z 2 z z 2

z 2 + z z 2

z 2 + ε = 1 2ε z 2 + ε.

We know z 2 z I 2

2a2 , where a2 = 1 +

3/3 + 2 q and b =

3/3. Let ε =

6a2 , we have

Therefore, we construct a sparse vector z such that (34) holds. To derive d, we have

ε2 = 4a2 + 2b2 2b

3/3 + 2 q) + 2/3 2

3/3 + 2 q) + 1/3

3/3 + 2 q)2

= x + 2/3 2

x + 1/3 9x2 ,

where x = 4(1 +

3/3 + 2 q). Since q 1, x 4(3 +

3/3). By the derivative

d dx(x + 2/3 2

x + 1/3) = 1 1 3x + 1 > 0, when x > 4(3 +

Substitute x = 14 into the numerator

x2 = 1 16(1 +

3/3 + 2 q)2 > 1 16 14q = 1 224q .

Then, we have

ε2 < 26880sq3.

Let Mnet be the cardinality of a 1/2-net of L(s) Spq 1. We want to bound Mnet, which will be used in the later proof. With lemma E.3 and using the same argument in Rudelson & Zhou (2012)[Section H.1], we have

Mnet (1 + 2/(1

d )d = exp(d log(5epq

Let Cd = 26883,

log(Mnet) d log(5epq

d ) = Cdsq3 log( 5epq

Cdsq3 ) = Cdsq3[log( p

sq2 ) + log( 5e

Cd )] Cnetsq3 log( p

for some Cnet > 0, when p > csq2 for sufficiently large c. If we want to eliminate log 5e/Cd, log( p

sq2 ) log 5e/Cd. That is the reason we require p > csq2.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

E.2. Concentration of the matrices Uw

Recall that b U1(θ) = 1

n Pn i=1 γ1,θ(Xi, Yi)Xif T i , U1(θ) = 1

n Pn i=1 E[γ1,θ(Xi, Yi)Xi]f T i . We want to bound

W U = sup θ Bcon(θ ) b U1(θ) U1(θ) F,s.

By definition, we have

W U = sup vec(u) L(s) Spq 1 sup θ Bcon(θ ) 1

i=1 γ1,θ(Xi, Yi)Xif T i 1

i=1 E[γ1,θ(Xi, Yi)Xi]f T i , u F

= sup vec(u) L(s) Spq 1 sup θ Bcon(θ ) 1

i=1 (γ1,θ(Xi, Yi)Xi E[γ1,θ(Xi, Yi)Xi]) f T i , u F .

W U u = sup θ Bcon(θ ) 1

i=1 (γ1,θ(Xi, Yi)Xi E[γ1,θ(Xi, Yi)Xi]) f T i , u F .

Then W U = supvec(u) L(s) Spq 1 W U u . We use an ε-net argument. The first step is approximation. Let vec(u1), . . . , vec(u Mnet) be a 1/2-net of L(s) Spq 1. This means that for any v L(s) Spq 1, there is some index j [Mnet] such that v uj F 1/2. We have

W U v W U uj + |W U uj W U v |

max j [Mnet] W U uj + W U uj v F

max j [Mnet] W U uj + 1

Then W U = supv W U v maxj [Mnet] W U uj +1/2W U, which implies W U 2 maxj [Mnet] W U uj. Then in the second step, we bound the tail of W U uj for fixed j. And the third step is union bound, where use the covering number of L(s) Spq 1. Let {ϵi}n i=1 be i.i.d. Rademacher variables. Recall that

γ1,θ(Xi, Yi) = π1 π1 + (1 π1) exp{(Xi 1/2(Γ2 + Γ1)fi)T (Γ2 Γ1)fi | {z } Cθ,Y (Xi)

= π1 π1 + (1 π1) exp(Cθ,Y (Xi)).

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Then according to Lemma S.5 in Wang et al. (2024) and H older s inequality, we have

E[exp(λW U uj)]

= E[exp(λ sup θ Bcon(θ ) 1

i=1 (γ1,θ(Xi, Yi)Xi E[γ1,θ(Xi, Yi)Xi]) f T i , uj F )]

n sup θ Bcon(θ )

i=1 (γ1,θ(Xi, Yi)Xi E[γ1,θ(Xi, Yi)Xi]) f T i , uj F )]

sup θ Bcon(θ )

i=1 ϵi γ1,θ(Xi, Yi)Xif T i , uj F )]

sup θ Bcon(θ )

ϵi( π1 π1 + (1 π1) exp(Cθ,Y (Xi)) π1) Xif T i , uj F + ϵiπ1 Xif T i , uj F

sup θ Bcon(θ )

i=1 ϵi( π1 π1 + (1 π1) exp(Cθ,Y (Xi)) π1) Xif T i , uj F )]1/2

sup θ Bcon(θ )

i=1 ϵiπ1 Xif T i , uj F )]1/2

| {z } (II)

To bound (I), we use lemma C.1 in Cai et al. (2019). The function ψ(x) = π1 π1+(1 π1) exp(x) π1 is Lipschitz with constant 1 π1

c0 and ψ(0) = 0. Since Yi, fi are fixed, Cθ,Y (Xi) is a function defined for random variable Xi. Let Γ = π 1Γ 1 + (1 π 1)Γ 2, and Zi = Xi Γ fi be the centered random variable. Note that Zi π 1N(Γ 1fi Γ fi, Ip) + (1 π 1)N(Γ 2fi Γ fi, Ip).

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

sup θ Bcon(θ )

i=1 ϵi Cθ,Y (Xi) Xif T i , uj F )]

sup θ Bcon(θ )

i=1 ϵi[(Xi 1/2(Γ2 + Γ1)fi)T (Γ2 Γ1)fi] Xif T i , uj F )]

sup θ Bcon(θ )

i=1 ϵi Xi 1/2(Γ2 + Γ1)fi, (Γ2 Γ1)fi Xif T i , uj F )]

sup θ Bcon(θ )

i=1 ϵi Zi + Γ fi 1/2(Γ2 + Γ1)fi, (Γ2 Γ1)fi (Zi + Γ fi)f T i , uj F )]

sup θ Bcon(θ )

i=1 ϵi Zi, (Γ2 Γ1)fi Zif T i , uj F )]1/4

| {z } (iv)

sup θ Bcon(θ )

i=1 ϵi Γ fi 1/2(Γ2 + Γ1)fi, (Γ2 Γ1)fi Zif T i , uj F )]1/4

sup θ Bcon(θ )

i=1 ϵi Zi, (Γ2 Γ1)fi Γ fif T i , uj F )]1/4

| {z } (ii)

sup θ Bcon(θ )

i=1 ϵi Γ fi 1/2(Γ2 + Γ1)fi, (Γ2 Γ1)fi Γ fif T i , uj F )]1/4

| {z } (iii)

Note that Γ2 Γ1 F = Γ2 Γ 2 (Γ1 Γ 1) + Γ 2 Γ 1 F Γ2 Γ 2 F + Γ1 Γ 1 F + Γ 2 F + Γ 1 F 2CbΩ+ 2Mb. (35)

We can bound Γ2 +Γ1 F with same quantity. Therefore, (Γ 1)T (Γ1 Γ2) 2 Mb(2CbΩ+2Mb) = 2Cb MbΩ+2M 2 b . We like to bound (Γ 1 Γ2 + Γ1

2 )T (Γ2 Γ1) 2 Γ 1 Γ2 + Γ1

2 F Γ2 Γ1 2

( Γ 1 F + Γ2 + Γ1

2 F )(2CbΩ+ 2Mb)

2(CbΩ+ 2Mb)(CbΩ+ Mb)

= 2C2 b Ω2 + 6Cb MbΩ+ 4M 2 b .

We know that A 2 = maxx,y x T Ay/( x 2 y 2) For (i), we have sup θ Bcon(θ ) Γ fi 1/2(Γ2 + Γ1)fi, (Γ2 Γ1)fi

sup θ Bcon(θ )

π 1 (Γ 1 1/2(Γ2 + Γ1))fi, (Γ2 Γ1)fi +

sup θ Bcon(θ )

(1 π 1) (Γ 2 1/2(Γ2 + Γ1))fi, (Γ2 Γ1)fi

sup θ Bcon(θ )

fi 2 2(2C2 b Ω2 + 6Cb MbΩ+ 4M 2 b ) .

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Let e C11 = (2C2 b Ω2 + 6Cb MbΩ+ 4M 2 b ) maxi fi 2 2. Therefore,

= E[exp(32λ

sup θ Bcon(θ )

i=1 ϵi Γ fi 1/2(Γ2 + Γ1)fi, (Γ2 Γ1)fi Zif T i , uj F )]

sup θ Bcon(θ ) Γ fi 1/2(Γ2 + Γ1)fi, (Γ2 Γ1)fi

sup θ Bcon(θ )

i=1 ϵi Zif T i , uj F )]

c0 e C11 sup θ Bcon(θ )

i=1 ϵi Zif T i , uj F )].

Let that e Zi|(Wi = w) = Zif T i , uj F |(Wi = w) = vec(uj)T (fi Ip)Zi|(Wi = w).

Since Zi|(Wi = w) N(Γ wfi Γ fi, Ip), var(e Zi|(Wi = w)) = vec(uj)T (fi Ip) 2 2 M 2 4 . Therefore ϵie Zi|(Wi = w) ψ2 = e Zi|(Wi = w) ψ2 CM4. Since ϵi is independent of Zi, E[ϵie Zi|(Wi = w)] = 0. Then, by equation (5.12) in Vershynin (2010), the moment generating function of ϵi e Zi|(Wi = w) is

E[exp(tϵie Zi|(Wi = w))] exp(CM 2 4 t2).

Then, we have

i=1 ϵi Zif T i , uj F )]

i=1 ϵi Zif T i , uj F , exp(32λ

i=1 ϵi Zif T i , uj F

i=1 ϵie Zi), exp( 32λ

i=1 ϵie Zi)

i=1 ϵie Zi)] + E[exp( 32λ

i=1 ϵie Zi)]

n 322(1 c0)2

c2 0 e C2 11CM 2 4 ) = 2 exp(λ2

where C11 = 322/4CM 2 4 (1 c0)2

c2 0 e C2 11. Then

(i) E[exp(32λ

c0 e C11 sup θ Bcon(θ )

i=1 ϵi Zif T i , uj F )]1/4 21/4 exp(λ2

n C11). (36)

When θ Bcon(θ ), vec(Γw Γ w) L(s), which implies vec(Γw) = vec(Γ w) + vec(uw), vec(uw) L(s). Next, we bound the second term (ii). Note that

| Γ fif T i , uj F | = | vec(uj)T vec(Γ fifi)| vec(uj) 2 vec(Γ fif T i ) 2

= (fif T i Ip) vec(Γ ) 2 fif T i 2 vec(Γ ) 2 Mb fi 2 2 Mb max i fi 2 2.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

(ii)4 = E[exp(32λ

sup θ Bcon(θ )

i=1 ϵi Zi, (Γ2 Γ1)fi Γ fif T i , uj F )]

c0 Mb max i fi 2 2 sup θ Bcon(θ )

i=1 ϵi Zi, (Γ2 Γ1)fi )]

c0 Mb max i fi 2 2 sup θ Bcon(θ )

i=1 ϵi Zi, (Γ 2 + u2)fi )]1/2

c0 Mb max i fi 2 2 sup θ Bcon(θ )

i=1 ϵi Zi, (Γ 1 + u1)fi )]1/2

c0 Mb max i fi 2 2

i=1 ϵi Zi, Γ 2fi )]1/4

| {z } (ii.1)

c0 Mb max i fi 2 2 sup vec(u2) L(s)

i=1 ϵi Zi, u2fi )]1/4

| {z } (ii.2)

c0 Mb max i fi 2 2

i=1 ϵi Zi, Γ 1fi )]1/4

| {z } (ii.3)

c0 Mb max i fi 2 2 sup vec(u1) L(s)

i=1 ϵi Zi, u1fi )]1/4

| {z } (ii.4)

We first focus on the term (ii.4)

c0 Mb max i fi 2 2 sup vec(u1) L(s)

i=1 ϵi Zi, u1fi )].

Again, we use the ϵ-net argument. Let vec(eu1), . . . , vec(eu Mnet) be a 1/2-net of L(s) Spq 1.

sup vec(u1) L(s)

i=1 ϵi Zi, u1fi = sup vec(u1) L(s) u1 F

i=1 ϵi Zi, u1/ u1 2fi

sup vec(u1) L(s) u1 F sup vec(eu) L(s) Spq 1

i=1 ϵi Zi, eufi

2 sup vec(u1) L(s) u1 F max j [Mnet]

i=1 ϵi Zi, eujfi

2CbΩmax j [Mnet]

i=1 ϵi Zi, eujfi ,

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

where in the last equality we use vec(u1) = vec(Γ1) vec(Γ 1) and Γ1 Γ 1 F CbΩ. Then

c0 Mb max i fi 2 2 sup vec(u1) L(s)

i=1 ϵi Zi, u1fi )]

c0 Mb max i fi 2 2CbΩ max j [Mnet]

i=1 ϵi Zi, eujfi )]

j [Mnet] E[exp(256λ

c0 Mb max i fi 2 2CbΩ

i=1 ϵi Zi, eujfi )]

j [Mnet] E[exp(λ

c0 Mb max i fi 2 2CbΩ | {z } e C12

i=1 ϵi e Zi )],

where e Zi = Zi, eujfi . Since E[ϵi e Zi|(Wi = w)] = 0 and var[ e Zi|(Wi = w)] fi 2 2, E[exp(tϵie Zi|(Wi = w))] exp(CM 2 4 t2). Therefore, we have

j [Mnet] E[exp(λ

i=1 ϵi e Zi )]

j [Mnet] E[max

i=1 e Zi), exp(λ

i=1 ϵi e Zi)

j [Mnet] E[exp(λ

i=1 ϵi e Zi)] + X

j [Mnet] E[exp(λ

i=1 ϵi e Zi)]

n2 e C2 12CM 2 4 ) = 2Mnet exp( e C2 12CM 2 4 | {z } 4 C12

2 exp(4 C12 λ2

n + Cnetsq3 log( p

Let e Zi = Zi, Γ 1fi . Then E[ϵi e Zi|(Wi = w)] = 0, var[ e Zi|(Wi = w)] M 2 b fi 2 2. With same the argument,

(ii.3)4 = E[exp(128λ

c0 Mb max i fi 2 2

i=1 ϵi Zi, Γ 1fi )]

c0 Mb max i fi 2 2

i=1 ϵi e Zi)] + E[exp(128λ

c0 Mb max i fi 2 2

i=1 ϵi e Zi)]

n 1282(1 c0)2

c2 0 CM 4 b M 4 4 | {z }

Then we have

(ii.3) (ii.4)

2 exp(( C12 + b C12)λ2

n + Cnet/4sq3 log( p

With the same argument, we can derive a similar bound for (ii.1) (ii.2). Then, for some positive constant C12,

(ii) 21/4 exp(C12 λ2

n + Cnet/8sq3 log( p

sq2 )). (37)

Recall that for a bounded random variable X [a, b] almost surely with E[X] = 0, E[exp(t X)] exp(t2(b a)2/8) for any t R. From earlier derivation, we have | Γ fif T i , uj F | Mb maxi fi 2 2. Thus, E[ϵi Γ fif T i , uj F ] = 0 and

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

ϵi Γ fif T i , uj F [ Mb maxi fi 2 2, Mb maxi fi 2 2], which implies

E[exp(tϵi Γ fif T i , uj F )] exp(t2M 2 b (maxi fi 2 2)2

For (iii), we have

= E[exp(32λ

sup θ Bcon(θ )

i=1 ϵi Γ fi 1/2(Γ2 + Γ1)fi, (Γ2 Γ1)fi Γ fif T i , uj F )]

c0 e C11 sup θ Bcon(θ )

i=1 ϵi Γ fif T i , uj F )]

i=1 ϵi Γ fif T i , uj F )] + E[exp(32λ

i=1 ϵi Γ fif T i , uj F )]

n 512(1 c0)2

c2 0 e C2 11M 2 b M 4 4 | {z } 4C13

(iii) 21/4 exp(λ2

n C13). (38)

Recall that vec(Γw Γ w) L(s). There exist uw Rp q such that vec(Γw) = vec(Γ w)+vec(uw) and vec(uw) L(s). Then we proceed to bound

(iv)4 = E[exp(32λ

sup θ Bcon(θ )

i=1 ϵi Zi, (Γ2 Γ1)fi Zif T i , uj F )]

sup θ Bcon(θ )

i=1 ϵi Zi, Γ2fi Zif T i , uj F )]1/2

sup θ Bcon(θ )

i=1 ϵi Zi, Γ1fi Zif T i , uj F )]1/2

sup θ Bcon(θ )

i=1 ϵi Zi, Γ 2fi Zif T i , uj F )]1/4

| {z } (iv.1)

sup θ Bcon(θ )

i=1 ϵi Zi, u2fi Zif T i , uj F )]1/4

| {z } (iv.2)

sup θ Bcon(θ )

i=1 ϵi Zi, Γ 1fi Zif T i , uj F )]1/4

| {z } (iv.3)

sup θ Bcon(θ )

i=1 ϵi Zi, u1fi Zif T i , uj F )]1/4

| {z } (iv.4)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

For (iv.4), define

f Wu := sup vec(eu) L(s) Spq 1

i=1 ϵi Zi, eufi Zif T i , uj F

= sup vec(eu) L(s) Spq 1 vec(eu), (

i=1 ϵi vec(Zif T i ) vec(Zif T i )T ), vec(uj) ,

where we use Zi, eufi = f T i eu T Zi = tr(f T i eu T Zi) = eu, Zif T i F = vec(eu), vec(Zif T i ) .

f Weu,u = vec(eu), (

i=1 ϵi vec(Zif T i ) vec(Zif T i )T ), vec(uj) ,

and {eu1, . . . , eu Mnet} be a 1/2-net of of L(s) Spq 1. For any v L(s) Spq 1, let euk be one of the closest point in the 1/2-cover. Then by definition, v euk v euk L(s) Spq 1, v euk 2 1

f Wv,u f Weuk,u + |f Weuk,u f Wv,u|

max k [Mnet] f Weuk,u + | vec(euk v)/ euk v 2, (

i=1 ϵi vec(Zif T i ) vec(Zif T i )T ), vec(uj) euk v 2

max k [Mnet] f Weuk,u + f Wu 1 2,

which implies f Wu 2 maxk [Mnet] f Weuk,u. Then

sup θ Bcon(θ )

i=1 ϵi Zi, u1fi Zif T i , uj F

= sup vec(u1) L(s) u1 F

i=1 ϵi Zi, u1/ u1 F fi Zif T i , uj F

CbΩ f Wu 2CbΩ max k [Mnet] f Weuk,u,

where we use supθ Bcon(θ ) u1 F = vec(Γ1 Γ 1) 2 CbΩ. Then

(iv.4)4 E[exp(128λ

2CbΩ max k [Mnet] f Weuk,u )]

k [Mnet] E[exp(128λ

i=1 ϵi Zi, eukfi Zif T i , uj F )]+

k [Mnet] E[exp(128λ

i=1 ϵi Zi, eukfi Zif T i , uj F )].

To bound ϵi Zi, eukfi Zif T i , uj F for fixed euk and uj, we use lemma D.2 in Wang et al. (2015) that states the product of two sub-Gaussian random variables is a sub-exponential random variable. Let ψ1 and ψ2 denote the sub-exponential and sub-Gaussian norm. Note that Zi, eukfi |(Wi = w) and Zif T i , uj F |(Wi = w) are normal distributions with zero mean and variance less than or equal to fi 2 2. For X N(0, σ2), X ψ2 Cψ2σ. For some Cψ > 0, we have

ϵi Zi, eukfi Zif T i , uj F |(Wi = w) ψ1 = Zi, eukfi Zif T i , uj F |(Wi = w) ψ1 Cψ1 max{ Zi, eukfi |(Wi = w) 2 ψ2, Zif T i , uj F |(Wi = w) 2 ψ2}

Cψ1(C2 ψ2 max i fi 2 2 + C 2 ψ2 max i fi 2 2)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Note that E[ϵi Zi, eukfi Zif T i , uj F |(Wi = w)] = 0 since ϵi is Rademacher random variable independent of Zi. According to Lemma 5.15 in Vershynin (2010), we obtain

c0 2CbΩ ϵi Zi, eukfi Zif T i , uj F )] exp(C 1282λ2

c2 0 4C2 b Ω2M 4 4 ),

Cψ maxi fi 2 2 .

(iv.4)4 2 X

k [Mnet] exp(λ2

n 4C 1282(1 c0)2

c2 0 C2 b Ω2M 4 4 | {z }

n 4 e C14)Mnet 2 exp(λ2

n 4 e C14 + Cnetsq3 log( p

For term (iv.3), using Lemma D.2 (Wang et al., 2015) and Lemma 5.15 (Vershynin, 2010) again, for some positive constant C, we have

c0 ϵi Zi, Γ 1fi Zif T i , uj F )] exp(λ2

c0 | c C (1 + Mb) maxi fi 2 2 .

(iv.3)4 2 exp(λ2

We can bound

(iv.3) (iv.4)

2 exp(( e C14 + C14)λ2

n + Cnet/4sq3 p

The analysis of (iv.1) (iv.2) is similar to (iv.3) (iv.4) and we have

(iv) 21/4 exp(C14 λ2

n + Cnet/8sq3 p

sq2 ). (39)

Combing (36), (37), (38), and (39), the bound for (I) is

(I) [(i) (ii) (iii) (iv)]1/2

2 exp(CI λ2

n + Cnet/8sq3 p

sq2 ), (40)

where CI = (C11 + C12 + C13 + C14)/2. It remains to bound (II). Using the same argument, for some positive constant CII, we have

(II)2 = E[exp(2λ

sup θ Bcon(θ )

i=1 ϵiπ1 Xif T i , uj F )]

sup θ Bcon(θ )

i=1 ϵiπ1 Zif T i , uj F )]1/2

sup θ Bcon(θ )

i=1 ϵiπ1 Γ fif T i , uj F )]1/2

which implies

n CII). (41)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Combining the results (40) and (41), we obtain

E[exp(λW U uj)] (I) (II) 2 exp(λ2

n CU + Cnet/8sq3 log( p

where CU = CI + CII. Thus,

E[exp(λW U) E[exp(2λ max j [Mnet] W U uj)]] X

j [Mnet] E[exp(2λW U uj)]

2Mnet exp(4λ2

n CU + Cnet/8sq3 log( p

sq2 )) 2 exp(4λ2

n CU + 2Cnetsq3 log( p

n CU + 2Cnetsq3 log(p)).

Using the Chernoff bounds, we have

P( sup θ Bcon(θ ) b U1(θ) U1(θ) F,s > t) = P(W U > t)

exp( λt) E[exp(λW U)] 2 exp(4λ2

n CU + 2Cnetsq3 log(p) λt).

nsq3 log(p)/CU, t = (2Cnet + 5) p

CUsq3 log(p)/n. Then

P(W U > t) 2 exp( sq3 log(p)) 2 exp( log(p)) = o(1).

Recall that we require

Therefore as long as n C sq3 log(p) for a sufficiently large C , we have with probability at least 1 o(1),

sup θ Bcon(θ ) b U1(θ) U1(θ) F,s

Note that we can get a sharper bound when sq2 = o(p),

sup θ Bcon(θ ) b U1(θ) U1(θ) F,s

E.3. Concentration of the weights πw

We proceed to bound supθ Bcon(θ ) |bπ1(θ) π1(θ)|. Recall that

i=1 γ1,θ(Xi, Yi), π1(θ) = 1

n E[γ1,θ(Xi, Yi)],

γ1,θ(Xi, Yi) = π1 π1 + (1 π1) exp{(Xi 1/2(Γ2 + Γ1)fi)T (Γ2 Γ1)fi | {z } Cθ,Y (Xi)

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Let W π = supθ Bcon(θ ) |bπ1(θ) π1(θ)|. We have

E[exp(λW π)]

n sup θ Bcon(θ )

π1 π1 + (1 π1) exp(Cθ,Y (Xi)) E[γ1,θ(Xi, Yi)] )]

n sup θ Bcon(θ )

π1 π1 + (1 π1) exp(Cθ,Y (Xi)) E[γ1,θ(Xi, Yi)] )]

n sup θ Bcon(θ )

i=1 π1 π1 + (1 π1) exp(Cθ,Y (Xi)) E[γ1,θ(Xi, Yi)] )]

| {z } (II)

Apply Lemma S.5 in Wang et al. (2024) to (I),

(I) E[exp(2λ

sup θ Bcon(θ )

i=1 ϵi π1 π1 + (1 π1) exp(Cθ,Y (Xi))

sup θ Bcon(θ )

π1 π1 + (1 π1) exp(Cθ,Y (Xi)) π1

sup θ Bcon(θ )

i=1 ϵiπ1 )]1/2

| {z } (ii)

Note that ψ(x) = π1 π1+(1 π1)ex π1 is Lipschitz with constant 1 w

c0 and ψ(0) = 0. By Lemma C.1 (Cai et al., 2019) with g( ) = 1, we have

(i)2 E[exp(8λ

sup θ Bcon(θ )

i=1 ϵi(Xi 1/2(Γ2 + Γ1)fi)T (Γ2 Γ1)fi )]

sup θ Bcon(θ )

i=1 ϵi Zi, (Γ2 Γ1)fi )]1/2

| {z } (i.1)

sup θ Bcon(θ )

i=1 ϵi Γ fi 1

2(Γ2 + Γ1)fi, (Γ2 Γ1)fi )]1/2

| {z } (i.2)

Terms (i.1) and (i.2) can be bounded similarly to (37) and (38). We have

2 exp(2C11 λ2

n + Cnet/2sq3 log( p

2 exp(2C12 λ2

Term (ii) can be bounded easily using properties of sub-Gaussian random variables,

2 exp(C13 λ2

(I) (i) (ii) 2 exp(CI λ2

n + Cnet/4sq3 log( p

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

where CI = C11 + C12 + C13. Term (II) can be bounded similarly. Combining the results, we have

E[exp(λW π)] 4 exp(Cπ λ2

n + Cnetsq3 log( p

Using the same Chernoff approach as before, we have

P( sup θ Bcon(θ ) |bπ1(θ) π1(θ)| > t) = P(W π > t) 4 exp(Cπ λ2

n + Cnetsq3 log(p) λt).

nsq3 log(p)/Cπ, t = (Cnet + 2) p

Cπsq3 log(p)/n. Then

P(W U > t) 4 exp( sq3 log(p)) 4 exp( log(p)) = o(1).

Therefore, as long as n > C sq3 log(p) for a sufficiently large C , we have with probability at least 1 o(1),

sup θ Bcon(θ ) |bπ1(θ) π1(θ)|

Still when sq2 = o(p), we have the following sharper bound

sup θ Bcon(θ ) |bπ1(θ) π1(θ)|

E.4. Concentration of covariance matrices

Last, we study the concentration of Σw. Recall that

i=1 γ1,θ(Xi, Yi)Xi XT i , Σw(θ) = 1

i=1 E[γ1,θ(Xi, Yi)Xi XT i ].

Directly applying Lemma C.1 (Cai et al., 2019) converts the problem into bounding the product of three sub-Gaussian random variables. While it is well known the product of two sub-Gaussian variables is sub-exponential, the product of three or more sub-Gaussian random variables is not necessarily sub-exponential. Therefore, we must use another method than the one used in the concentration of Uw. We use tail bound for unbounded random processes given in Theorem 4 Adamczak (2008).

Let W Σ = supθ Bcon(θ ) (bΣ1(θ) Σ1(θ))B 1 F,s. By definition, we have

W Σ = sup vec(u) L(s) Spq 1 sup θ Bcon(θ ) 1

i=1 {γ1,θ(Xi, Yi)Xi XT i E[γ1,θ(Xi, Yi)Xi XT i ]}B 1, u F .

W Σ u = sup θ Bcon(θ ) 1

i=1 {γ1,θ(Xi, Yi)Xi XT i E[γ1,θ(Xi, Yi)Xi XT i ]}B 1, u F .

Then W Σ = supvec(u) L(s) Spq 1 W Σ u . Let vec(u1), . . . , vec(u Mnet) denotes the 1/2-net of L(s) Spq 1. For any v L(s) Spq 1, let uj be one of the closest point in the 1/2-cover. Then by definition,

v uj v uj L(s) Spq 1, v uj 2 1

W Σ v W Σ uj + |W Σ uj W Σ v | max j [Mnet] W Σ uj + 1

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

which implies W Σ 2 maxj [Mnet] W Σ uj. We proceed to bound W Σ uj for fixed uj L(s) Spq 1. To use Theorem 4 (Adamczak, 2008), define

f(Xi, Yi) = γ1,θ(Xi, Yi) Xi XT i B 1, uj F E[γ1,θ(Xi, Yi) Xi XT i B 1, uj F ].

n sup θ Bcon(θ ) |

i=1 f(Xi, Yi)|.

It is clear that for every θ Bcon(θ ) and every i, E[f(Xi, Yi)] = 0. Next we show Xi XT i B 1, uj F is sub-exponential. Note that Xi XT i B 1, uj F = vec(uj)T vec(Xi XT i B 1) = vec(uj)T (Iq Xi XT i ) vec(B 1)

k=1 u T j,k Xi XT i B 1,k =

k=1 Xi, uj,k Xi, B 1,k ,

where uj,k and B 1,k are the k-th column of uj and B 1. Clearly, Xi, uj,k and Xi, B 1,k are normal. Then

Xi, uj,k ψ2 Zi, uj,k ψ2 + ψfi, uj,k ψ2 C uj,k 2 + C ,

and similarly Xi, B 1,k ψ2 C B 1,k 2 + C .

Therefore, by Lemma D.2 (Wang et al., 2015)

Xi XT i B 1, uj F ψ1

k=1 Xi, uj,k Xi, B 1,k ψ1

k=1 C max{ Xi, uj,k 2 ψ2, Xi, B 1,k 2 ψ2} C < .

The above results hold for any θ Bcon(θ ). Then

sup θ Bcon(θ ) |f(Xi, Yi)| ψ1

= sup θ Bcon(θ )

γ1,θ(Xi, Yi)Xi XT i B 1, uj F + sup θ Bcon(θ )

E[ γ1,θ(Xi, Yi)Xi XT i , uj F ] ψ1

sup θ Bcon(θ )

Xi XT i B 1, uj F ψ1 + sup θ Bcon(θ )

E[ Xi XT i , uj F ] ψ1

where we use the fact 0 < γ1,θ(Xi, Yi) < 1. We verified the two conditions for Theorem 4 (Adamczak, 2008).

Define truncated function and the remaining parts of f(Xi, Yi) as

f1(Xi, Yi) = f(Xi, Yi)I( sup θ Bcon(θ ) |f(Xi, Yi)| ρ),

f2(Xi, Yi) = f(Xi, Yi)I( sup θ Bcon(θ ) |f(Xi, Yi)| > ρ),

where ρ = 8 E[maxi supθ Bcon(θ ) |f(Xi, Yi)|]. Let Q = maxi | Xi XT i B 1, uj F |. Since Xi XT i B 1, uj F is subexponential, P(| Xi XT i B 1, uj F | > x log n) 2 exp( cx log n). Then

P(Q > x log n)

i=1 P(| Xi XT i B 1, uj F | > x log n) 2 exp( cx log n + log n).

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Therefore, when log n > 1 we have

0 P(Q > t)dt = Z 2 log n

0 P(Q > t) | {z } 1

c P(Q > t)dt

2 c P(Q > t log n)d(t log n)

c + 2 log n Z

2 c exp( (ct 1) log n)dt

c + 2 log n Z

2 c exp( ct + 1)dt

c + 2 log nexp( 1)

Since the above holds for any θ Bcon(θ ), we have ρ C log n.

sup θ Bcon(θ ) |

i=1 f(Xi, Yi)| sup θ Bcon(θ ) |

i=1 f1(Xi, Yi) E[f1(Xi, Yi)]|+

sup θ Bcon(θ ) |

i=1 f2(Xi, Yi) E[f2(Xi, Yi)]|,

where we use the fact that E[f1] + E[f2] = 0. It follows that

E[ sup θ Bcon(θ ) |

i=1 f(Xi, Yi)|] E[ sup θ Bcon(θ ) |

i=1 f1(Xi, Yi) E[f1(Xi, Yi)]|]+

2 E[ sup θ Bcon(θ ) |

i=1 f2(Xi, Yi)|].

By Markov inequality and definition of f2(Xi, Yi), we have

P(max k n sup θ Bcon(θ ) |

i=1 f2(Xi, Yi)| > 0)

P(max i sup θ Bcon(θ ) |f(Xi, Yi)| > ρ) E[maxi supθ Bcon(θ ) |f(Xi, Yi)|]

which means

t0 = inf{t > 0; P(max k n sup θ Bcon(θ ) |

i=1 f2(Xi, Yi)| > t) 1

Then, by Proposition 6.8 in Ledoux & Talagrand (1991),

E[max k N sup θ Bcon(θ ) |

i=1 f2(Xi, Yi)|] 1

8 E[max i sup θ Bcon(θ ) |f2(Xi, Yi)|] ρ C log n.

E[ sup θ Bcon(θ ) | 1

i=1 f2(Xi, Yi)|] E[max k N sup θ Bcon(θ ) | 1

i=1 f2(Xi, Yi)|] C2 log n

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

When supθ Bcon(θ ) |f(Xi, Yi)| ρ, Xi XT i B 1, uj F is bounded. We proceed to bound f1, by Lemma S.5 (Wang et al., 2024) and Lemma C.1 (Cai et al., 2019),

E[ sup θ Bcon(θ )

i=1 f1(Xi, Yi) E[f1(Xi, Yi)] ]

C E[ sup θ Bcon(θ )

i=1 ϵi Xi 1

2(Γ2 + Γ1)fi, (Γ2 Γ1)fi Xi XT i B 1, uj F ]

C E[ sup θ Bcon(θ )

i=1 ϵiπ1 Xi XT i B 1, uj F ]

| {z } (II)

where Xi XT i B 1, uj F is bounded for all i. Under the condition Xi XT i B 1, uj F is bounded, ϵi Xi 1

2(Γ2+Γ1)fi, (Γ2 Γ1)fi Xi XT i B 1, uj F is sub-exponential for any Γ1 and Γ2. Define set T = {t : t T = (vec(Γ1)T , vec(Γ2)T ), Γ1, Γ2 Bcon(θ )}. Using the argument to derive L(s), we have T C conv(S

|J| d EJ(2pq) S2pq 1), where d = Cdsq3. By the definition of constriction basin, we have

vec(Γ1) vec(Γ 1) 2 vec(Γ1) vec(Γ 1) 2 + vec(Γ 1) vec(Γ 1) 2 2CbΩ.

Therefore, ([vec(Γ1) vec(Γ 1)]T , [vec(Γ2) vec(Γ 2)]T ) 2

vec(Γ1) vec(Γ 1) 2 + vec(Γ2) vec(Γ 2) 2

vec(Γ1) vec(Γ 1) 2 + vec(Γ2) vec(Γ 2) 2 4CbΩ.

The diameter of T D = diam(T ) = sup s,t T d(s, t) 4CbΩ,

where d is the ℓ2 distance. Note that

i=1 E[ ϵi Xi 1

2(Γ2 + Γ1)fi, (Γ2 Γ1)fi Xi XT i B 1, uj F q ]

i=1 Cqq = Cqq Cq!eq := q!

where we use the bound of moments of sub-exponential random variables and Stirling s approximation. By Corollary 5.2 in Dirksen (2015),

1 nγ2(T , d2) + 1

nγ1(T , d1) + C4

where γ1(T , d1) and γ2(T , d2) are Talagrand functional (see Dirksen (2015) for details). Given t T , let Kti = ϵi Xi 1 2(Γ2 + Γ1)fi, (Γ2 Γ1)fi Xi XT i B 1, uj F . Dirksen (2015) define

d1(s, t) = max i Kti Ksi ψ1, d2(s, t) =

i=1 Kti Ksi 2 ψ2

Consider the two metric spaces (T , ℓ2) and (T , d1). We have d1(s, t) Cρ s t 2. Then by Theorem 1.3.6 (Talagrand, 2005), γ1(T , d1) Cργ1(T , ℓ2). Similar result hold for γ2(T , d2). Therefore,

(I) C3ρ 1 nγ2(T , ℓ2) + 1

nγ1(T , ℓ2) + C4

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

By equation (4) (Dirksen, 2015),

γα(T , ℓ2) Cα

0 (log N(T , ℓ2, ϵ))1/αdϵ = Cα

0 (log N(T , ℓ2, ϵ))1/αdϵ,

where we use log(1) = 0. We know that

N(T , ℓ2, ϵ) 1 + 2

Then we have

γ1(T , ℓ2) C Z D

0 log N(T , ℓ2, ϵ)dϵ

0 sq3 log 1 + 2

+ sq3 log 2ep

Csq3 log p,

when p > C sq2 for some constant. Similarly,

γ2(T , ℓ2) C Z D

0 (log N(T , ℓ2, ϵ))1/2dϵ

sq3 log 1 + 2

+ sq3 log 2ep

Then according to (45),

(I) C1 log n(

n + sq3 log p

sq3(log n)2 log p

when n > sq3 log(p). Combining with (44), we have

E[ sup θ Bcon(θ ) | 1

i=1 f(Xi, Yi)|] C1

sq3(log n)2 log p

n + C2 log n

sq3(log n)2 log p

where we use log(n)/n < log(n)/ n p

sq3 log p log(n)/ n. To use Theorem 4 (Adamczak, 2008), we need to bound

σ2 := sup θ Bcon(θ )

nf(Xi, Yi)}2] = 1

n2 sup θ Bcon(θ )

i=1 E[{f(Xi, Yi)}2].

Note that E[{f(Xi, Yi)}2]

= E[{γ1,θ(Xi, Yi) Xi XT i B 1, uj F E[γ1,θ(Xi, Yi) Xi XT i B 1, uj F ]}2]

= E[(γ1,θ(Xi, Yi) Xi XT i B 1, uj F )2] E[γ1,θ(Xi, Yi) Xi XT i B 1, uj F ]2

E[( Xi XT i B 1, uj F )2] + E[ Xi XT i B 1, uj F ]2 C,

since Xi XT i B 1, uj F is sub-exponential. Then σ2 C/n. The last term to bound before applying the theorem is maxi supθ Bcon(θ ) |f(Xi, Yi)| ψ1. Use earlier results,

max i sup θ Bcon(θ ) | 1

nf(Xi, Yi)| ψ1 C 1

n log n sup θ Bcon(θ ) |f(Xi, Yi)| ψ1

n sup θ Bcon(θ ) | Xi XT i B 1, uj F | ψ1 C log n

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Let W f uj = supθ Bcon(θ ) | 1

n Pn i=1 f(Xi, Yi)|. Then, for 0 < η < 1 and δ > 0, we have the following result

P(W f uj (1 + η) E[W f uj] + t)

t C maxi supθ Bcon(θ ) f(Xi, Yi) ψ1

exp C5nt2 + 3 exp C6nt

Let t = C7 q

sq3(log n)2 log p

n , where C7 is sufficiently large. Using union bound, we have

P( max j Mnet W f uj (1 + η) E[W f uj] + t)

Mnet exp C5nt2 + 3Mnet exp C6nt

exp Cnetsq3 log p

sq2 C5nt2 + 3 exp Cnetsq3 log p

exp( log p) + 3 exp( log p) = 4

when n > C8sq3 log p for sufficiently large C8. This means with probability at least 1 o(1),

max j Mnet W f uj (1 + η) E[W f uj] + C7

sq3(log n)2 log p

sq3(log n)2 log p

Recall that W Σ uj W f uj. Then, with probability at least 1 o(1),

sup θ Bcon(θ ) (bΣ1(θ) Σ1(θ))B 1 F,s = W Σ

sq3(log n)2 log p

F. Ancillary Lemmas

We first present some technical lemmas that are used in the proof.

Lemma F.1. Let f1(t) = et {w+(1 w)et}2 , f2(t) = tet {w+(1 w)et}2 , f3(t) = (t2 b2)et

{w+(1 w)et}2 , and f4(t) = t3et {w+(1 w)et}2 . Then

f1(t) 1 4 min{w2, (1 w)2}, t R,

sup t a f1(t) 1 min{w, 1 w}2 exp( a), a 0,

|f2(t)| 1 2 min{w2, (1 w)2}, t R,

sup |t| a |f2(t)| 2 min{w, (1 w)}2 exp( 3a/4), a 0,

|f3(t)| 4 + b2

min{w2, (1 w)2} t R,

sup |t| a |f3(t)| 4 + b2

min{w, (1 w)}2 exp( a/2), a 0,

|f4(t)| 2 min{w2, (1 w)2} t R,

sup |t| a |f4(t)| 8 min{w, (1 w)}2 exp( a/4), a 0.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Proof. We use results (C.1) - (C.6) in supplement of Cai et al. (2019) to prove. Since f1(t) = 1 {we t/2+(1 w)et/2}2 := ef1(t/2). By (C.1) and (C.2), f1(t) 1 4 min{w2,(1 w)2}, and

sup t a f1(t) = sup t/2 a/2 ef1(t/2) 1 min{w, 1 w}2 exp(2 a/2).

Since f2(t) = 2 t/2 {we t/2+(1 w)et/2}2 := 2 ef2(t/2), using (C.3) and (C.4), we have

|f2(t)| 2| ef2(t/2)| 2 1 4 min{w2, (1 w)2},

sup t a |f2(t)| = 2 sup t/2 a/2 | ef2(t/2)| 2 min{w, (1 w)}2 exp( 3

sup t a |f2(t)| = sup t a |f2( t/2)| 2 min{w, (1 w)}2 exp( 3

Note that f3(t) = 4 (t/2)2 (b/2)2

{we t/2+(1 w)et/2}2 := 4 ef3(t/2). Then, by (C.5) and (C.6),

|f3(t)| = 4| ef3(t/2)| 4 1 + (b/2)2

min{w2, (1 w)2} = 4 + b2

min{w2, (1 w)2},

sup t a |f3(t)| = 4 sup t/2 a/2 | ef3(t/2)| 4 + b2

min{w, (1 w)}2 exp( a/2).

Define ef4(t) = t3 {we t+(1 w)et}2 . Then f4(t) = 8 ef4(t/2). Note that

| ef4(t)| = t3

{we t + (1 w)et}2 t3

min{w, (1 w)}2(e t + et)2 1 4 min{w, (1 w)}2 ,

which implies

|f4(t)| 2 min{w, (1 w)}2 .

Then note that t3 {we t+(1 w)et}2 exp( a/2) when t a 0. Therefore,

sup t a |f4(t)| = 8 sup t/2 a/2 | ef4(t/2)| 8 min{w, (1 w)}2 exp( a/4).

Lemma F.2. Let Lp(s) = L(s)1:p. The restrictive eigenvalue condition

inf u Lp(s) Sp 1

i=1 Xi XT i u > τ0

holds with probability at least 1 o(1) when n > Cs log p for sufficiently large positive constant C.

Proof. Let C(s) = 2 conv(S

|J| d EJ(p) Sp 1), d = Cds. According to Lemma E.3, we have Lp(s) Spq 1 C(s). We show a stronger conclusion that

inf u C(s) Sp 1

i=1 Xi XT i u > τ1.

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Since Xi π 1N(Γ 1fi, ) + π 2N(Γ 2fi, ), we have

E[Xi] = Γ fi, Γ = π 1Γ 1 + π2Γ 2,

E[Xi XT i ] = + π 1Γ 1fif T i (Γ 1)T + π 2Γ 2fif T i (Γ 2)T ,

Xi := cov(Xi) = + π 1Γ 1fif T i (Γ 1)T + π 2Γ 2fif T i (Γ 2)T Γ fif T i (Γ )T . (47)

For any v C(s) Sp 1, Zi = v T Xi is mixture of Gaussian and thus sub-Gaussian. Then Z2 i = v T Xi XT i v is subexponential. Thus, v T Xi XT i v E[v T Xi XT i v] ψ1 Ci v T Xi XT i v ψ1 := Li. Let L = maxi Li. By Bernstein s inequality (Theorem 2.8.1 in Vershynin (2018)), for t > 0,

nv T Xi XT i v 1

nv T E[Xi XT i ]v t) 2 exp( c min t2 Pn i=1 1 n2 L2 i , t maxi 1

2 exp( c min nt2

) 2 exp( c1nt2),

where the last inequality holds when t L and c1 = c/L2. Let ϵ < 1, v1, . . . , v|J | is an ϵ-net of C(s) Sp 1. Then according to Lemma E.3, log |J | CJ s log p, where CJ is a positive constant. By union bound,

nv T j Xi XT i vj 1

nv T j E[Xi XT i ]vj

t, j J ) 2|J | exp( c1nt2)

2 exp(CJ s log p c1nt2).

Define Ψ = 1/n Pn i=1 Xi XT i . Then, we have for all j J ,

v T j E[Ψ]vj t Ψ1/2vj 2 2 v T j E[Ψ]vj + t,

with probability at least 1 2 exp(CJ s log p c1nt2).

Assume C1 σmin( ) σmax( ) C2. For any unit vector v, we have

v T Γ 1 bΣf(Γ 1)T v (Γ 1)T v 2 2M2 σmax(Γ 1)2M2 M 2 b M2.

Since E[Ψ] = + π 1Γ 1 bΣf(Γ 1)T + π 2Γ 2 bΣf(Γ 2)T ,

C1 v T E[Ψ]v C2 + M2M 2 b .

Therefore, p

C1 t Ψ1/2vj 2 q

C2 + M2M 2 b + t.

Then, for any v C(s) Sp 1, there exists a vj such that

v vj 2 ϵ, v vj v vj 2 C(s) Sp 1,

and Ψ1/2vj 2 Ψ1/2(v vj) 2 Ψ1/2v 2 Ψ1/2vj 2 + Ψ1/2(v vj) 2.

The right-hand side is upper bounded by q

C2 + M2M 2 b + t + ϵ sup v C(s) Sp 1 Ψ1/2v 2,

which implies

sup v C(s) Sp 1 Ψ1/2v 2

C2 + M2M 2 b + t 1 ϵ .

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Meanwhile, the left-hand side is lower bounded by

Ψ1/2vj 2 Ψ1/2(v vj) 2 p

C1 t ϵ sup v C(s) Sp 1 Ψ1/2v 2

C2 + M2M 2 b + t 1 ϵ ,

which means

inf v C(s) Sp 1 Ψ1/2v 2 p

C2 + M2M 2 b + t 1 ϵ .

Let C = C2 + M2M 2 b , t = C1/2 and

C + C1/2 + p

When 0 < ϵ < 1, 0 < τ1 < p

C1/2. We have

inf v C(s) Sp 1 Ψ1/2v 2 τ1,

with probability at least 1 2 exp(CJ s log p c1nt2) for 0 < τ1 < p

C1/2. When n > Cs log p for sufficiently large C, exp(CJ s log p c1nt2) < 2 exp( log p) = 2/p = o(1).

Lemma F.3. Let U = 1

n Pn i=1 Xif T i . The following

U F,s = sup v Rp q

v L(s) Spq 1 U, v F M,

holds with probability at least 1 o(1) when n > Csq3 log p for a sufficiently large positive constant C.

Proof. By definition

U F,s = sup v Rp q

vec(v) L(s) Spq 1 1

i=1 Xif T i , v F = sup v Rp q

vec(v) L(s) Spq 1

i=1 vec(v)T (fi Ip)Xi.

Note that, for any vec(v) L(s) Spq 1 , E[vec(v)T (fi Ip)Xi] = vec(v)T (fi Ip)Γ fi, where Γ = π 1Γ 1 + π2Γ 2. Since vec(v)T (fi Ip)Xi is sub-Gaussian, we have vec(v)T (fi Ip)Xi vec(v)T (fi Ip)Γ fi ψ1 C1. By Bernstein s inequality (Theorem 2.8.1 in Vershynin (2018)), for t > 0,

n vec(v)T (fi Ip)Xi vec(v)T (fi Ip)Γ fi

2 exp( c min t2 Pn i=1 1 n2 C2 1 , t maxi 1

) 2 exp( c min nt2

2 exp( c1nt2),

where the last inequality holds when t C1 and c1 = c/C2 1. Let vec(v1), . . . , vec(v Mnet) is an 1/2-net of L(s) Spq 1. Then according to Lemma E.3, log Mnet Cnetsq3 log p, where Cnet is a positive constant. By union bound,

n vec(v)T (fi Ip)Xi vec(v)T (fi Ip)Γ fi

t, vj L(s) Spq 1)

2Mnet exp( c1nt2) 2 exp(Cnetsq3 log p c1nt2).

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

Therefore for all vj L(s) Spq 1,

i=1 Xif T i , vj F =

1 n vec(vj)T (fi Ip)Xi

1 n vec(vj)T (fi Ip)Γ fi + t

1 n (fi Ip)Γ 2M4 + t

M 2 4 Mb + t.

Then, for any vec(v) L(s) Spq 1, there exists a vj such that

2, vec(v vj)

v vj F L(s) Spq 1,

i=1 Xif T i , v F 1

i=1 Xif T i , vj F + 1

i=1 Xif T i , v vj F .

Taking supremum of both sides, we have

sup v L(s) Spq 1 1

i=1 Xif T i , v F 2(M 2 4 Mb + t),

with probability at least 1 2 exp(Cnetsq3 log p c1nt2). Let t = ( 2Cnetsq3 log p

nc1 )1/2. When n > Csq3 log p, with probability at least 1 o(1),

sup v L(s) Spq 1 1

i=1 Xif T i , v F 2(M 2 4 Mb + 2Cnet

Lemma F.4. Suppose that b B, B Rp q with rank(b B) = rank(B ) = d. Let bβ, β be the top-d left singular vectors of b B, B Rp q and σ1 σd be the singular values of B . Assume B b B 2 CB. Then

2d4σ1 + 2CB

σ2 d B b B F .

Proof. Let βT β = MDNT denote the singular value decomposition of βT β, where M, N Rd d and D = diag(ω1, . . . , ωd). Define the principal angles between the subspaces spanned by bβ and β as (ϕ1, . . . , ϕd) = (cos 1 ω1, . . . , cos 1 ωd), where ω1 ωd are the singular values of bβT β. And define sin Φ( bβ, β) = diag(sin ϕ1, . . . , sin ϕd). Then according to Theorem 3 (Yu et al., 2015),

sin Φ( bβ, β) F 2(2σ1 + b B B 2) min{

d b B B 2, b B B F } σ2 d

d b B B F .

Pβ P b β F = ββT bβ bβT F = q

tr[(ββT bβ bβT )T (ββT bβ bβT )]

tr(ββT + bβ bβT ) 2 tr( bβT ββT bβ) = q

2d 2 tr(MD2MT )

i=1 sin2 ϕi

2 sin Φ( bβ, β) F

d b B B F ,

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

where we use sin(cos 1 ωi) = p

Lemma F.5. Let a2 = 2M 3/2 2 M1 , b = 2 M2+ M2+2 M2 M1 . If d F (θ, θ ) B1 B 1 F B2 B 2 F < rΩ, vec(Γw Γ w)

L(s), and r < |c0 cπ|/Ω Cb 1

Cd 1/ M1 + b2 4a2 b 2a), then θ Bcon(θ ).

Proof. Recall that Bcon(θ ) ={θ : πw (c0, 1 c0), Γw Γ w F CbΩ,

(1 Cd)Ω2 | tr(δw(Γ)bΣf(Γ2 Γ1)T )| (1 + Cd)Ω2,

vec(Γw Γ w) L(s), w = 1, 2},

where δw(Γ) = Γ w (Γ2 + Γ1)/2, and Ω= q

tr[(Γ 2 Γ 1)bΣf(Γ 2 Γ 1)T ].

Since π 1 (cπ, 1 cπ), when |π1 π 1| < rΩ< |c0 cπ|, we have cπ |c0 cπ| < π1 < 1 cπ + |c0 cπ|. Using c0 < cπ, π1 (c0, 1 c0). By definition of r, Γw Γ w F rΩ CbΩ.

Note that |Ω2 tr(δw(Γ)bΣf(Γ2 Γ1)T )|

= | tr[(Γ 2 Γ 1)bΣf(Γ 2 Γ 1)T ] tr([Γ w (Γ2 + Γ1)/2]bΣf(Γ2 Γ1)T )|

= | tr[bΣf{(Γ 2 Γ 1)T (Γ 2 Γ 1) (Γ2 Γ1)T (Γ 2 Γ 1)}]|+

| tr[bΣf{(Γ2 Γ1)T (Γ 2 Γ 1) (Γ2 Γ1)T [Γ w (Γ2 + Γ1)/2]}|

= | tr[bΣf(Γ 2 Γ 1 Γ2 + Γ1)T (Γ 2 Γ 1)| | {z } (I)

| tr[bΣf(Γ2 Γ1)T [Γ 2 Γ 1 Γ w (Γ2 + Γ1)/2]}| | {z } (II)

We have (I) = | vec(Γ 2 Γ 1 Γ2 + Γ1)T vec[(Γ 2 Γ 1)bΣf]

Γ 2 Γ 1 Γ2 + Γ1 F (Γ 2 Γ 1)bΣf F

Since (Γ2 Γ1)bΣ1/2 f F (Γ 2 Γ 1)bΣ1/2 f F + (Γ2 Γ 2 Γ1 + Γ 1)bΣ1/2 f F Ω+ 2 p

Γ2 Γ1 F 1 M1 (Γ2 Γ1)bΣ1/2 f F 1 + 2 M2r M1 Ω.

And Γ2 + Γ1 F = Γ2 Γ 2 + Γ1 Γ 1 + Γ 2 + Γ 1 F 2rΩ+ 2Mb.

When Ω> 16M2Mb, for w = 1,

(II) = | vec(Γ2 Γ1)T vec([Γ 2 2Γ 1 (Γ2 + Γ1)/2]bΣf)|

Γ2 Γ1 F [Γ 2 2Γ 1 (Γ2 + Γ1)/2]bΣf F

1 + 2 M2r M1 Ω M2(3Mb + rΩ+ Mb)

1 + 2 M2r M1 M2rΩ2 + 1 + 2 M2r M1 4M2MbΩ

1 + 2 M2r M1 (M2r + 1

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

If w = 2, (II) = | vec(Γ2 Γ1)T vec([ Γ 1 (Γ2 + Γ1)/2]bΣf)|

Γ2 Γ1 F [Γ 1 + (Γ2 + Γ1)/2]bΣf F

1 + 2 M2r M1 Ω M2(Mb + rΩ+ Mb)

1 + 2 M2r M1 (M2r + 1

Let a2 = 2M 3/2 2 M1 , b = 2 M2 + 2M2+ M2

2 M1 and assume 1/(4 M1) < Cd. Then a2r2 + br < Cd 1/(4 M1) when

Cd 1/(4 M1) + b2 4a2 b 2a). Therefore, we have

|Ω2 tr(δw(Γ)bΣf(Γ2 Γ1)T )| 2 p

M2rΩ2 + 1 + 2 M2r M1 (M2r + 1

" 2M 3/2 2 M1 r2 + (2 p

M2 + 2M2 + M2

2 M1 )r + 1 4 M1

G. Additional Discussion

The mixture PFC method is model-based and belongs to the linear heterogeneous SDR method. In contrast, nonlinear SDR aims to find a vector-valued function g such that Y X | g(X). Traditional nonlinear SDR methods often combine kernel tricks and linear SDR techniques (Wu, 2008; Hsing & Ren, 2009; Li et al., 2011). However, these approaches have common computational challenges, when computing eigenvectors or inverse of n n or p p matrices, making them infeasible for large-scale high-dimensional data. Deep learning, with its proven success in various domains, offers promising alternatives for nonlinear SDR. The auto-encoder (Hinton & Salakhutdinov, 2006; Zong et al., 2018) is the most representative example of deep learning for unsupervised dimension reduction. Recently, several deep SDR methods have emerged, leveraging the power of deep neural networks to address the above challenges (Banijamali et al., 2018; Liang et al., 2022; Kapla et al., 2022; Huang et al., 2024; Chen et al., 2024).

We suggest two strategies to extend the linear heterogeneous SDR to nonlinear settings through deep learning. The first strategy is inspired by Kwon et al. (2024) and addresses semi-supervised scenarios with both labeled data {(Xi, Yi)}n i=1 and unlabeled data {(Xi)N i=n+1}. Then model assumes the following structure:

Y X | (gw(X), W = w), Pr(W | Y, X) = Pr(W | Y, gw(X))

X | (f W = ew) N(µ e w, Σ e w), w, ew = 1, . . . , K

Pr(W = w|f W = ew) = πw| e w.

The key idea is to use the Gaussian mixture model on the unlabeled data to infer the joint distribution of (X, f W) and then apply any proposed deep SDR method to learn gw. The procedure is as follows.

Step 1: Learn the joint distribution of (X, f W) using GMM fitted to the unlabeled data.

Step 2: Assign labeled data {(Xi, Yi)}n i=1 to the K clusters defined by f W using the estimate distribution of f W|X.

Step 3: Estimate the nonlinear SDR using any deep SDR method for each cluster.

Step 4: Estimate the transition matrix ΠW |f W = (πw| e w).

The second strategy combines a compression network and an estimation network, similar to the deep auto-encoding Gaussian mixture model (DAGMM) (Zong et al., 2018). The compression network is a supervised auto-encoder (Le et al., 2018),

Heterogeneous Sufficient Dimension Reduction and Subspace Clustering

designed to perform dimension reduction while preserving the nonlinear SDR structure. The innermost layer incorporates a supervised loss to ensure the reduced representation g(X) satisfies the conditional independence condition in nonlinear SDR. Various dependence measures can be used to construct the loss function, such as distance covariance (Sz ekely et al., 2007), martingale difference divergence (Shao & Zhang, 2014), and generalized martingale difference divergence (Li et al., 2023). Then the estimation network uses the learned low-dimensional vector g(X) and the response to predict clusters. Unlike DAGMM, this step employs a supervised clustering model rather than the Gaussian mixture model Zong et al. (2018). To evaluate the clustering quality of the estimation network, the log-likelihood of the cluster assignments can be computed. Both strategies highlight the potential of deep learning to effectively extend SDR to nonlinear, heterogeneous, and high-dimensional settings.