# provable_subspace_identification_under_postnonlinear_mixtures__7b7f223a.pdf Provable Subspace Identification Under Post-Nonlinear Mixtures Qi Lyu School of EECS Oregon State University Corvallis, OR 97331 lyuqi@oregonstate.edu Xiao Fu School of EECS Oregon State University Corvallis, OR 97331 xiao.fu@oregonstate.edu Unsupervised mixture learning (UML) aims at identifying linearly or nonlinearly mixed latent components in a blind manner. UML is known to be challenging: Even learning linear mixtures requires highly nontrivial analytical tools, e.g., independent component analysis or nonnegative matrix factorization. In this work, the post-nonlinear (PNL) mixture model where unknown element-wise nonlinear functions are imposed onto a linear mixture is revisited. The PNL model is widely employed in different fields ranging from brain signal classification, speech separation, remote sensing, to causal discovery. To identify and remove the unknown nonlinear functions, existing works often assume different properties on the latent components (e.g., statistical independence or probability-simplex structures). This work shows that under a carefully designed UML criterion, the existence of a nontrivial null space associated with the underlying mixing system suffices to guarantee identification/removal of the unknown nonlinearity. Compared to prior works, our finding largely relaxes the conditions of attaining PNL identifiability, and thus may benefit applications where no strong structural information on the latent components is known. A finite-sample analysis is offered to characterize the performance of the proposed approach under realistic settings. To implement the proposed learning criterion, a block coordinate descent algorithm is proposed. A series of numerical experiments corroborate our theoretical claims. 1 Introduction Unsupervised mixture learning (UML) has been broadly used in machine learning tasks, e.g., for recovering mixed latent components or extracting informative data representations. Central to understanding UML is the concept of identifiability. The identifiability of UML is concerned with whether the latent components can be identified (up to certain to inconsequential ambiguities) with guarantees without using any label or training data. This is a challenging problem due to the unsupervised nature. To address the identifiability challenge, linear mixture learning problems were tackled via exploiting properties of the latent components, e.g., statistical independence [1], nonnegativity [2 4], quasi-stationarity [5,6], and boundedness [7]. This leads to a cohort of blind source separation (BSS) methods. However, linear mixture models (LMMs) oftentimes fail to perform well on real-world data, due to the unknown nonlinear distortions arising in data generation and acquisition. In the past two decades, nonlinear mixture models (NMMs) have attracted unprecedentedly more attention [8 14]. Nonetheless, the identifiability of the latent components is much harder to establish under NMMs. It was shown in [15] that the NMMs are in general not identifiable, even with strong assumptions, e.g., statistical independence among the latent components. To establish NMM identifability, a line of work exploited observable auxiliary variables or specific priors (i.e., extra information about the data 36th Conference on Neural Information Processing Systems (Neur IPS 2022). or the latent components on top of statistical independence); see, e.g., [13,14,16,17]. Another line of methods took an alternative route via utilizing the structures of the nonlinear distortions. In particular, the post-nonlinear mixture (PNL) model [8, 9, 18 20] is a reasonable extension of the classical LMM by assuming unknown nonlinear transformations individually imposed onto the outputs of a underlying linear mixture. This kind of structured NMM is used in a wide range of applications, e.g. hyperspectral imaging [19,21], brain data classification [20,22 24], and causal discovery [12,25,26]. In order to identify the generative model/latent components under the PNL model, a key step is to remove the nonlinear distortions in an unsupervised manner. To this end, the existing methods often explicitly rely on some properties of the latent components, e.g., statistical independence [8,9,18], availability of multiview data [20] and the sum-to-one and nonnegativity (or probability-simplex) structure of the latent components [19,27]. These structures stem from specific applications, and thus are meaningful yet not always available. A natural question is whether the nonlinear distortions are still removable if the widely used structural assumptions (e.g., those in [8,9,18 20,27]) do not hold? In this work, we offer a positive answer to the above inquiry and build a new learning algorithm based on our analyses. Our detailed contributions are as follows. Subspace-Based PNL Model Identification. We propose a new nonlinearity identification and removal criterion under the PNL model. We show that, as long as the underlying unknown linear mixing system of the PNL model admits a nontrivial orthogonal complement (or, its transpose has a nontrivial null space), the unknown nonlinear transformations are identifiable and removable, under mild conditions. Notably, our approach does not use any specific structural assumptions of the latent components. Our finding presents a fairly different result compared to existing nonlinear mixture learning techniques. The latter often heavily rely on the properties of the latent components. Hence, our result may offer new insights and a useful alternative to this long-standing challenge. An immediate consequence is that our method can work with various types of latent components, including those not covered by existing PNL model identification approaches in [8,9,18 20,27]. Performance Analysis Under Realistic Settings. The model identification analysis is conducted under the population case where unlimited data is available, with the assumption that a universal function learner can be used to cancel any nonlinear distortion. To advance the understanding under more realistic settings, we also consider the realistic scenario where one only has access to finite samples and non-universal function learners. The analysis is a combination or statistical generalization theory and numerical differentiation, which is reminiscent of the recently developed framework in [20,28,29] with nontrivial additional efforts to accommodate our learning criterion. Optimization Algorithm Design. Based on the theoretical analysis, we formulate the identification problem as a constrained optimization criterion. The optimization problem is challenging, as it involves learning unknown nonlinear functions [represented by neural networks (NNs)] under nonconvex constraints. We propose a block coordinate descent (BCD) algorithm that alternates between two subproblems, namely, null space basis learning and nonlinear function learning. This way, the latter is disentangled with constraints, and thus off-the-shelf NN training techniques (e.g., Adam [30]) can be readily employed to handle the subproblem. A series of experiments are conducted for validating our theoretical claims. Notations. x, x, and X represent a scalar, a vector, and a matrix, respectively; kxk0, kxk1, and kxk2 denote the 0 function, the 1 norm, and the Euclidean norm, respectively; and ~ denote the Khatri-Rao product and the Hadamard product, respectively; f is used to denote a function f( ) : R ! R; f 0, f 00 and f (n) denote the first-order, the second-order, and the nth order derivatives of the function f, respectively; f g denotes the function composition; R(X) and N(X) mean the range space and the null space of X, respectively. 2 Background 2.1 UML and Model Identification In UML, the classical LMM is defined as follows [31]: x = As , = 1, 2, , N, (1) where x 2 RM is the th observation sample, s 2 RK contains the K latent components, and A 2 RM K is the mixing system. The latent components in (1) are generally not identifiable, if A is not known [31]. As a remedy, information about A and/or s is often used to establish identifiability. For instance, statistical independence among [s ]i s is used in independent component analysis (ICA) [1], and the nonnegativity of A 2 RM K + and S = [s1, , s N] 2 RK N + are used in nonnegative matrix factorization (NMF) [3,4]; also see [5 7,32 34] for more conditions that can assist LMM identification. The LMM in (1) is often over-simplified for modeling the data generation/acquisition processes in real-world applications where nonlinear effects are widely observed. In many cases, the NMMs are of more interest. A typical NMM is as follows [35]: x = g(s ), = 1, 2, , N, (2) where g : RK ! RM is a nonlinear mixing system . Compared to the LMM, the NMM in (2) can represent much more complex data generation/acquisition processes. However, it was shown in [35] that the model in (2) is not identifiable, even if the latent components satisfy some fairly strong assumptions, e.g., statistical independence. A workaround is to use more information of or additional conditions on the latent components, e.g., the availability of auxiliary variables [13,14,16]. Another approach is by exploiting structural assumptions of the nonlinear mixing process. This leads to the suite of PNL model learning works (see, e.g., [8,9,18 20,27,36]), which will be our focus. 2.2 Post-Nonlinear Mixture Model The PNL model is a natural extension to the LMM, which is expressed as follows [8,9,18,20,27,36]: x = g(As ), = 1, 2, , N, (3) where g = [g1( ), , g M( )]> with each gm( ) : R ! R being an invertible nonlinear function. We assume s p(s) and the support of p(s) is a continuous and open domain S. The PNL model has a myriad of of applications across different fields, e.g., causal discovery [12,25, 26], hyperspectral imaging [21] and brain data classification [22 24]. It is considered as an appropriate model when the unknown nonlinear distortions happen at the sensor end in data acquisition. It also offers a valuable generalization to the linear mixture generative models in data analytics, e.g., those used in NMF and data clustering [37,38]. An important remark is that the challenge of UML under the PNL model boils down to identifying and removing g( ), as this will lead to a well-understood LMM as in (1). Then, any LMM identification approach, e.g., those in [3,6,7,31,32,34,39], can be applied to identify the latent components s for = 1, . . . , N. Existing works tackle the g-identification and removal problem from various angles. In [8,18], the statistical independent among the latent components (i.e., [s ]k for k = 1, . . . , K) is used. Recently, [19,27] proposed to identify the PNL model with latent components satisfying the probability simplex constraint (i.e., 1>s = 1, s 0) that often arises in soft clustering problems. The work in [20] shows that g( ) can be provably removed if multiple views of data exist. These methods offered insightful and useful approaches to remove g( ), but the use of specific structural assumptions on the latent components make their applicability relatively narrow. We propose to identify and remove the nonlinear distortions via exploiting an underlying subspace that often naturally exists under PNL models, instead of resorting to structural assumptions of s as in prior works. Our method starts with the following observation: If the mixing system A is a tall matrix, N(A>) is a nontrivial subspace. This implies that if only the linear mixture part is considered (i.e., if g( ) can be removed completely), there exist vectors v such that v>As = 0, 8 , (4) where the nonzero v is any vector from N(A>). In the next section, we will show how this simple equation in (4) allows for constructing a provable g-identification and removal criterion. 3 Proposed Approach In this section, we propose a learning criterion to identify/remove the nonlinear transformations g( ). We first consider the population case where infinite data points are available; that is, all x p(x) over the continuous support X are available. Then, we will move forward to consider the finite-sample case in Sec. 3.2. Under the infinite-data assumption, our g-identification formulation is as follows: find Q, f( ) = [f1( ), , f M( )]> (5a) s.t. Q>f(x ) = 0, 8x 2 X (5b) fm( ) : R ! R is invertible 8m, (5c) k Qk0 = MD, Q>Q = I, (5d) where Q 2 RM D, D is the dimension of the null space (i.e., N(A>)), in which D 1 and D < M. The 0 function constraint in (5d) means that Q is constrained to be a dense matrix, which will prove important for establishing identifiability; this will become clearer in Sec. 3.1. In the formulation, fm( ) is a nonlinear invertible function that we look for. Ideally, we hope to learn fm( ) = g 1 m ( ) which cancels the nonlinear transformation. Then, Q forms a basis of N(A>). Note that under the orthogonality constraint in (5d), the trivial solution, i.e., Q = 0, is avoided. Besides, the invertibility constraint in 5c is used to prevent other degenerate solutions, e.g., f(x ) = c for all with a constant c, from happening. 3.1 Provable Nonlinearity Removal We will show that solving (5) guarantees that the nonlinear transformation g( ) is removed from the PNL model. To this end, we first introduce the following condition: Definition 1 (Locally Free Components) Consider real-valued random vector v = [v1, . . . , vd]> 2 Rd, where vi resides in a continuous and open set Vi R. Assume that the following holds for all j p(vj|v j) > 0, with continuous support Vj|v j given any v j, where v j denotes the vector with all entries but vj. Then, the components in v are all locally free components. Note that the definition above means that each vj has the freedom to change (at least locally) no matter what value the v j part takes. As a special case, statistically independent random variables clearly satisfy the definition. However, statistical independence is a much stronger condition. Dependent variables could also be locally free components. For instance, vectors from the probability simplex, i.e., v 2 K, K = {v 2 RK|1>v = 1, v > 0}, (6) also satisfy Definition 1 that is, any K 1 of the K components in v are locally free components. Generally, if v1, . . . , vd are not completely dependent, the locally free condition is not hard to meet. Simply speaking, with vi and vj being free variables, we could observe any possible value of vj given a fixed vi. Using this notion, consider the functional equation q>h(Av) = 0 that holds for all v. Then, we can always observe A[ v1, , vj, , vd]>" with any fixed v j, where vj is from the continuous support of p(vj| v j) > 0. Taking derivative of (7) w.r.t. vj leads to the following: A[ v1, , vj, , vd]>" as all the terms @ vi/@vj = 0 for any i 6= j. This property will help us show the following main result: Theorem 1 (Nonlinearity Identification and Removal) Under the model in (3), assume that the criterion (5) is solved. In addition, assume that A 2 RM K is drawn from any joint absolutely continuous distribution, that e K of the K components of s are locally free components (cf. Def. 1), and that the learned bhm = bfm gm is twice differentiable for all m 2 [M]. Suppose that e K( e K + 1) Then, almost surely, any b f( ) that is a solution of (5) satisfies bhm(x) = bfm gm(x) = cmx + dm, 8m 2 [M], where cm 6= 0 and dm is a constant, at the limit of infinite data. Theorem 1 indicates that if any solution b f( ) of (5) is found, then the composition bhm = bfm gm is an affine function for all m. Notably, the theorem does not require all the K latent components to be locally free components, as long as the condition e K( e K+1)/2 M is satisfied which means that some components are even allowed to be completely dependent. Intuitively, if e K is too small then it means that the remaining K e K components do not provide useful information, while a larger e K implies that there is more diversity among the latent components to assist nonlinearity removal. Note that the premise here is that the null space N(A>) must exist, i.e., M > K should hold as rank(A) = min{M, K} with probability one for any A that is drawn from a joint continuous distribution. Otherwise, one may fail to learn a nonzero Q. On the other hand, if M is overly large, i.e., M > e K( e K+1)/2, the model is not identifiable as there is a lack of information (i.e., the number of locally free components is not large enough) to pin down all the gi( ) s with i = 1, , M. We show the proof as follows: Proof: By solving (5), we have the following equation for all s 2 S Q>h(As ) = 0, (10) where h = f g is the function composition. For simplicity, we drop the subscript as the equation holds for any s 2 S. First, we show the existence of solution to (5). It is obvious that one can let f = g 1 (i.e., fm = g 1 m for all m) and make the columns of Q to be an orthogonal basis of N(A>). In addition, such a Q can always be made dense. This is because A follows a joint continuous distribution, which indicates that the probability of N(A>) being orthogonal to any axis is zero. Next, we show that such a solution is unique up to only affine ambiguities. By the assumptions, there are e K out of the K components of s that are locally free. Thus, by taking the second-order (cross) derivatives of Eq. 10 w.r.t. each of the e K variables, as shown in (8), we have the following: 66666666664 a e K ~ a e K a e K 1 ~ a e K 77777777775 75 = 0, 8s 2 S, (11) where ak is the kth column of A. Without loss of generality, we have assumed that the first e K variables of s are locally free components. Note that if one can show bh00 m([As]m) = 0 for any m over any s, then it indicates that bhm( ) is an affine function for all m over S. To this end, we show that Q> B has full column rank, where the sizes of the matrices involved are Q 2 RM D and B 2 R By Lemma 1 of [40], for matrices U 2 Ra m and V 2 Rb m, U V has full column rank if krank(U) + krank(V ) m + 1 with krank(U) and krank(V ) being the Kruskal rank of U and V , respectively. The Kruskal rank of matrix U is defined as the largest number such that every set of columns of U is linearly independent. For Q> B, we do not have control over Q since it is an optimization variable. But the Kruskal rank k Q> is at least 1 since we have the constraint k Qk0 = MD. In terms of B, we can show that the matrix B has full Kruskal rank min ! e K( e K+1)/2, M , almost surely. The detailed explanation can be found in Appendix B. To derive the conditions required for model identification (i.e., rank(Q> B) = M), we consider the following two cases: a) For the case of krank(B) = e K( e K+1) 2 , rank(Q> B) = M with probability one when the following conditions are met: e K( e K + 1) e K( e K + 1) 2 M + 1, which holds if M = e K( e K + 1) b) For the case of krank(B) = M, rank(Q> B) = M with probability one when the following conditions are met: e K( e K + 1) 2 and 1 + M M + 1, which holds if e K( e K + 1) The two cases imply that if e K( e K+1) 2 M is satisfied, then the nonlinear functions gm( ) s can be removed almost surely. 3.2 Finite-Sample Identifiability Analysis The identifiability analysis in Sec. 3.1 is based on the assumption that (uncountably) infinite samples are available. However, one always has to work with finite samples in practice. In addition, the proof of Theorem 1 assumed that the fm( ) s are universal function approximators. Nonetheless, in practice, the function learners, e.g., neural networks, may have nontrivial approximation errors for modeling nonlinear functions. In this section, we provide analysis under more realistic conditions. To begin with, note that when only N i.i.d. samples are available, one could rewrite the finite-sample version of (5) as follows s.t. k Qk0 = MD, Q>Q = I, fm( ) 2 F, (12b) where F : R ! R is an invertible function class used for modeling the inverse of function gm( ). In the population case (i.e., N = 1), (12) and (5) have the same solutions. To characterize the finite sample performance of (12) under f( ) that has limited expressive power, we will use the following definition and assumptions: Definition 2 (Function Class and Inverse) The function g 2 G is an invertible continuous nonlinear function. We define its inverse class G 1 as a function class that contains all the u s satisfying u g(y) = y + β with 6= 0, 8g 2 G. Assumption 1 (Realization Gap) For any u 2 G 1, there exists f 2 F such that supx2X |f(xm) u(xm)| , 8m 2 [M]. Assumption 2 (Boundedness) The 4th-order derivatives of fm 2 F and gm 2 G exist. In addition, |f (n) m ( )| and |g(n) m ( )| are bounded for all n 2 [4] and m 2 [M]. Any 4th-order derivative of q> k h(As ) with k = [D] is bounded by Cd. In addition, A has bounded elements. Assumption 3 (Neural Network Structure) The function fm( ) is parameterized by a one-hiddenlayer neural network with R neurons and 1-Lipschitz nonlinear activation function ( ) : R ! R with (0) = 01. F = {fm|fm(x) = w> 2 (w1x), kwik2 B, i = 1, 2}, (13) where (y) = [ (y1), . . . , (y R)]>, wi 2 RR for i = 1, 2. Next, we show the finite-sample identifiability result for the case where the above neural network class is used. Our analysis can be readily generalized to cover more complex neural networks such as deep convolutional neural network (CNN) and Res Net. However, we use this simple neural network as a showcase, since our goal is to reveal insights other than covering complex neural structures. Using the neural networks from Assumption 3, we have the following lemma: 1Note that commonly used activation functions, e.g., , tanh, rectified linear unit (Re LU) and exponential linear unit (ELU), all satisfy this condition; see [41]. Lemma 1 Assume that F is the function class defined in Assumption 3 and the input x is bounded by |xm| Cx for all m. Then, the loss function 2 has the following Rademacher complexity The proof can be found in Appendix D. Note that Lemma 1 indicates that the complexity of the function class is proportional to the width of the neural network R and inversely proportional to the sample size N. With Lemma 1, we have the following theorem: Theorem 2 (Finite-Sample Identifiability) Under the generative model (3), assume that Assumptions 1, 2 and 3 hold. Suppose that x for 2 [N] are i.i.d. samples from X according to a certain distribution D. Denote any solution of (12) as b f = [ bf1, . . . , bf M]> and bhm = bfm gm for m 2 [M]. Then, with probability of at least 1 δ, the following holds: 222bh00(As) min(Q> B) + min(Q> B)N 1/4 min(Q> B) denotes the smallest singular value of the matrix Q> B, which is defined in the linear system (11). The detailed proof is relegated to Appendix D. The proof consists of three steps: 1) treat the loss function as a supervised regression problem and estimate the generalization error using the empirical error and Rademacher complexity; 2) approximate the linear system in (11) using the generalization error and numerical differentiation; and 3) bound and characterize the second-order derivatives based on the approximated linear system. Theorem 2 implies that the second-order derivatives of bhm( ) s approach zero with sufficiently large sample size N. However, the term is not determined by N and it only decreases to zero when F is a universal function approximator (which could be attained if R is large enough). Therefore, there is always a trade-off between the complexity (encoded by key parameters like depth and width) of the neural network and the effectiveness of model identification given fixed sample size N. According to Theorem 2, one hopes to employ an expressive enough neural network for canceling gm( ), but excessively increasing the expressiveness would hurt the performance under a given N this is consistent with our experience in many neural network learning problems. 4 Optimization Design In this section, we will design an optimization scheme to implement the criterion in (12) Specifically, we propose to the following formulation: s.t. Q>Q = I, fm( ) 2 F, fm( ) is invertible. (16b) Note that the formulation in (16) does not ensure a dense Q. Nonetheless, if Q is initialized randomly using any continuous distribution, iterative optimizers would never return a Q that contains zeros, if the iterative algorithm can be approximated by a continuous transformation. We use the neural network structure defined in Assumption 3 to approximate fm( ). Note that with sufficiently large R, the function fm( ) is a universal approximator [42,43]. In addition, to promote invertibility, we introduce another neural network r( ), with the same structure as that of f( ), to make sure that f(x ) can be converted back to x . This idea is from autoencoder [44]. The problem is then recast as follows: min Q>Q=I,f,r 2 | {z } L1 kx r(f(x ))k2 2 | {z } L2 where the hyperparameter λ 0 is used for balancing the energy between the loss and the regularizer term. In addition, we define L = L1 + λL2. The problem can be handled by a block coordinate descent (BCD) algorithm as shown in Algorithm 1. Note that any stochastic optimizer designed for neural network training (e.g., [30]) can be used here for the f( ) and r( ) updates. We use f and r to denote the parameters of the corresponding neural networks. Besides, we denote b L and b Li for i = 1, 2 as estimates of L and Li over a random batch B, respectively. For example, b L1 = 1 |B| 2B k Q>f(x )k2 2, and b L2 is defined in an identical way. The term r r b L2 denotes the derivative of b L2 taken w.r.t. r. The other derivative terms are defined follwoing the same manner. We refer to the algorithm as post-nonlinear subspace identification. This is because Theorem 1 indicates that the method identifies range(S>) where S = [s1, . . . , s N] for N K after removing the nonlinear distortions (technically, this needs dm = 0 for all m (cf. Theorem 1) which automatically holds if our norm minimization term in the objective becomes zero). Algorithm 1: Post-Nonlinear Subspace Identification. Data: x for = 1, , N Result: f 1 while stopping criterion is not reached do 2 Q U[:, M D : M] where UDV > = SVD (F ) with F = [f(x1), , f(x N)]; 3 while stopping criterion is not reached do 4 Draw a random batch B; 5 r f b L r f b L1 + λr f b L2; 6 r r b L r r b L2; 7 Update f and r using r f b L and r r b L with any stochastic optimizer, respectively; 5 Numerical Experiments 5.1 Synthetic Data We generate data following the model in (3). For the neural networks representing fm( ) and rm( ), we use R = 256 with Re LU activations. We use the Adam optimizer [30] with the initial learning rate being 2e 4 for the network optimization part. For the hyperparameters, we set λ = 1e 4. The nonlinear functions gm( ) s are selected to be variants of ex, sigmoid(x) and tanh(x) to guarantee invertibility. The source code can be found online2. Independent Latent Components. In this simulation, we make K = 3 and M = 5. Each element of s is drawn independently from the uniform distribution U[ 1, 1] with N = 10000, and the mixing matrix A is drawn from the normal distribution. Under this setting, it is obvious that Q 2 R5 2. Fig. 1 shows the composition bhm( ) = bfm gm( ). One can see that all the five compositions are visually affine, which means that the algorithm has successfully identified and removed g( ). Prior works on post-nonlinear ICA oftentimes involve computation and sample-size demanding implementations. For example, in [8], a nonparameteric density estimation step of the learned components needs to be involved in every iteration in order to measure and promote statistical independence. In comparison, the proposed approach is much more straightforward and easier to implement, as it does not rely on statistical independence. Dependent Latent Components. In this case, we consider the latent components drawn from the probability simplex as in (6). The mixing matrix A is the same as before. Note that the latent components are dependent. Consequently the dimension of the null space is M K + 1 because there are only e K = K 1 free variables in s . For this simulation, we set N = 10, 000, K = 3, and M = 5. Accordingly, the dimension of N(A>) is M K + 1 = 3 which makes Q a 5 3 matrix. 2https://github.com/llvqi Figure 1: Learned function compositions with independent latent components. Figure 2: Learned function compositions with dependent latent components. Fig. 2 shows the results. It is obvious that the nonlinear distortions are successfully removed. Readers might have noticed that in this case the condition in (9) is actually violated since e K( e K+1)/2 = 3 < 5. Interestingly, the method still works. This is likely because in this numerical example Q> tends to have full Kruskal rank while in our proof we only used that krank(Q>) = 1. It turns out that the identifiability condition could actually be much improved if Q> has full Kruskal rank. We refer interested readers to further discussions in Appendix A. Validating Theorem 2. In this part, we demonstrate the impact of R (i.e., the width of the employed neural network) on the performance of nonlinearity removal, as stated in Theorem 2. To be specific, we measure the subspace distance (see the definition in [20]) between the estimated range( b S>) and the ground truth range(S>), where S = [s1, , s N]. This value is within [0,1] with 0 being the best. The setting is the same as that of the independent case. Table 1 shows the results, which are averaged over 5 random trials. One can see that when fm( ) is not expressive enough, i.e., when R = 8, the performance is far from satisfactory. This is because the term dominates in the bound (15). As the number of neurons increases, the subspace distance keeps decreasing, which indicates that the latent subspace is well identified. However, if the function is overly complex (R = 1024), the performance starts to deteriorate since in this case the R becomes dominant on the right hand side of (15). In addition, one can see that larger N leads to a better performance in general, which also validates our claim in (15). 5.2 Real Data Dataset. We use the human face electroencephalogram (EEG) dataset3. The EEG signals were collected when a subject was shown with pictures of real human faces or scrambled faces. The scrambled face images were generated from the real faces with Fourier transformation by random phase permutations [45]). At every sample, the EEG signal is a 130-dimensional vector x , which are measured through 130 sensors (channels) all over the subject s scalp. The task is to use x for = 1, . . . , N as the training data to learn a classifier. The classifier aims to determine whether the subject sees a real face image or scrambled ones when a new x is collected. Metric. We use a number of unsupervised representation learning (URL) method to extract lowdimensional embeddings of x s. Then, these embeddings are used to train classifiers based on support vector machines (SVM) and logistic regression (LR). We have N = 27, 692 samples of x , which are split as training, validation and test sets with 24794, 1449, and 1449 samples, respectively. We use the classification error to serve as an indirect measure of the performance. We follow the hypothesis in [45]. There, the model is x g(As ) with a PNL structure, where s are the brain generated clean electronic signals, and A and g( ) represent the propagation, mixing, and distortions on the sensor end, respectively [23]. Hence, we use our method to remove g( ) and then reduce the dimension of As to K0 via PCA. Our belief is that if the PNL learning method can remove g( ), then the extracted undistorted signals will benefit training linear classifiers such as SVM 3https://www.fil.ion.ucl.ac.uk/spm/data/mmfaces/ Table 1: Subspace distance with independent components. R = 8 R = 16 R = 32 R = 64 R = 128 R = 256 R = 1024 N = 10000 0.47 0.08 0.22 0.05 0.08 0.02 0.05 0.02 0.04 0.01 0.03 0.01 0.07 0.01 N = 20000 0.44 0.06 0.20 0.08 0.07 0.02 0.05 0.01 0.03 0.01 0.02 0.01 0.06 0.01 Table 2: Average classification error std (best error rate attained in the 5 trials) on the EEG data. (K0) is the feature dimension for training the classifier. Raw PCA (30) Autoencoder (50) Proposed (30) n ICA [16] (30) PNL-ICA [9] (30) SVM 0.46 0.05 (0.43) 0.43 0.02 (0.41) 0.43 0.03 (0.40) 0.40 0.03 (0.35) 0.44 0.02 (0.43) 0.46 0.05 (0.38) LR 0.47 0.04 (0.42) 0.45 0.04 (0.41) 0.46 0.03 (0.41) 0.39 0.03 (0.35) 0.43 0.03 (0.39) 0.44 0.03 (0.39) and LR. We include PCA (applied onto the raw data), autoencoder, and nonlinear ICA (n ICA) based approaches [9,16] as baselines. Results. Table 2 shows the classification error. The results are averaged over 5 random trials. For all methods, the reduced dimension K0 (for our method, we have D = M K0) is selected over {10, 30, 50, 70} using the validation set. We use fully connected networks that have two hidden layers and 256 neurons on each layer. The same network structure is used for the autoencoder baseline. The activation functions are all Re LU. We use Adam [30] optimizer for neural network updates, where the initial learning rate is 1e 4. One can see that our method has the lowest classification error, compared to other baselines. Note that brain signal classification problem is quite challenging, and thus using the raw data only outputs an accuracy that is slightly better than random guesses. Using our method, the average classification errors are reduced by 6% and 7% from the those using the raw data for SVM and LR, respectively, which is substantial. This also shows the usefulness of the PNL model for EEG data. The PCA is a linear method that does not capture the nonlinearity of the data, and thus the result under PCA is less promising. The autoencoder can be understood as a nonlinearity removal method, but it does not ensure provable g( ) removal. The n ICA variants also exhibit higher error rates relative to the proposed method. This may be because the ICA frameworks hinge on structural assumptions on the latent components, which may not always be available in practice. 6 Conclusion In this work, we revisited the PNL model from a model identification perspective. We proposed a new framework to identify and remove the unknown nonlinear distortions in the data generating/acquisition process. Compared to existing works, the proposed approach does not require the latent components to satisfy specific structural assumptions, e.g., statistical independence. Consequently, our new result offers a PNL model identification tool to a much wider spectrum of applications. In addition, we provided analysis of the cases where only finite samples and non-universal function learners are available. We designed a constrained optimization scheme to implement the proposed learning criterion. Our theoretical claims were supported by experiment results. There are a number of limitations. First, the PNL model would incur high computational cost when the data feature dimension M is large, since each dimension induces an individual neural network. Second, there are meaningful applications where the A matrix may not have a nontrivial orthogonal complement (e.g., in PNL-based causal direction estimation [12]), yet this current framework does not guarantee nonlinearity removal in those cases. Third, the PNL problem s learning criterion is a nonconvex optimization problem. The proposed algorithm does not guarantee to solve the formulated learning criterion, which creates a gap between the identifiable results (by assuming optimization optimality) and the attainable results in practice. Acknowledgement: This work was supported by the National Science Foundation (NSF) CAREER Award under Project ECCS-2144889. [1] P. Comon, Independent component analysis, a new concept? Signal Processing, vol. 36, no. 3, pp. 287 314, 1994. [2] X. Fu, K. Huang, and N. D. Sidiropoulos, On identifiability of nonnegative matrix factorization, IEEE Signal Process. Lett., vol. 25, no. 3, pp. 328 332, 2018. [3] X. Fu, K. Huang, N. D. Sidiropoulos, and W.-K. Ma, Nonnegative matrix factorization for signal and data analytics: Identifiability, Algorithms, and Applications, IEEE Signal Process. Mag., vol. 36, no. 2, pp. 59 80, March 2019. [4] N. Gillis, Nonnegative matrix factorization. SIAM, 2020. [5] L. D. Lathauwer and J. Castaing, Blind identification of underdetermined mixtures by simul- taneous matrix diagonalization, IEEE Trans. Signal Process., vol. 56, no. 3, pp. 1096 1105, Mar. 2008. [6] X. Fu, W.-K. Ma, K. Huang, and N. D. Sidiropoulos, Blind separation of quasi-stationary sources: Exploiting convex geometry in covariance domain, IEEE Trans. Signal Process., vol. 63, no. 9, pp. 2306 2320, May 2015. [7] S. Cruces, Bounded component analysis of linear mixtures: A criterion of minimum convex perimeter, IEEE Trans. Signal Process., vol. 58, no. 4, pp. 2141 2154, April 2010. [8] A. Taleb and C. Jutten, Source separation in post-nonlinear mixtures, IEEE Trans. Signal Process., vol. 47, no. 10, pp. 2807 2820, 1999. [9] A. Ziehe, M. Kawanabe, S. Harmeling, and K.-R. Müller, Blind separation of post-nonlinear mixtures using linearizing transformations and temporal decorrelation, Journal of Machine Learning Research, vol. 4, no. Dec, pp. 1319 1338, 2003. [10] A. Hyvarinen and H. Morioka, Nonlinear ICA of temporally dependent stationary sources, in International Conference on Artificial Intelligence and Statistics, vol. 54, 20 22 Apr 2017, pp. 460 469. [11] R. S. Zimmermann, Y. Sharma, S. Schneider, M. Bethge, and W. Brendel, Contrastive learning inverts the data generating process, in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 139. PMLR, 18 24 Jul 2021, pp. 12 979 12 990. [12] K. Zhang and A. Hyvärinen, On the identifiability of the post-nonlinear causal model, in 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009). AUAI Press, 2009, pp. 647 655. [13] A. Hyvarinen and H. Morioka, Unsupervised feature extraction by time-contrastive learning and nonlinear ICA, in Advances in Neural Information Processing Systems, vol. 29, 2016. [14] A. Hyvarinen, H. Sasaki, and R. Turner, Nonlinear ICA using auxiliary variables and general- ized contrastive learning, in International Conference on Artificial Intelligence and Statistics, 2019, pp. 859 868. [15] A. Hyvarinen, Fast and robust fixed-point algorithms for independent component analysis, IEEE Trans. Neural Netw.,, vol. 10, no. 3, pp. 626 634, 1999. [16] I. Khemakhem, D. Kingma, R. Monti, and A. Hyvarinen, Variational autoencoders and nonlinear ICA: A unifying framework, in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 2207 2217. [17] L. Gresele, P. K. Rubenstein, A. Mehrjou, F. Locatello, and B. Schölkopf, The incomplete Rosetta stone problem: Identifiability results for multi-view nonlinear ICA, in Proceedings of UAI 2020, 2020, pp. 217 227. [18] S. Achard and C. Jutten, Identifiability of post-nonlinear mixtures, IEEE Signal Process. Lett., vol. 12, no. 5, pp. 423 426, 2005. [19] Q. Lyu and X. Fu, Identifiability-guaranteed simplex-structured post-nonlinear mixture learning via autoencoder, IEEE Trans. Signal Process., vol. 69, pp. 4921 4936, 2021. [20] , Nonlinear multiview analysis: Identifiability and neural network-assisted implementa- tion, IEEE Trans. Signal Process., vol. 68, pp. 2697 2712, 2020. [21] Y. Altmann, A. Halimi, N. Dobigeon, and J.-Y. Tourneret, A post nonlinear mixing model for hyperspectral images unmixing, in 2011 IEEE International Geoscience and Remote Sensing Symposium, 2011, pp. 1882 1885. [22] F. Oveisi, EEG signal classification using nonlinear Independent Component Analysis, in Proc. IEEE ICASSP 2009, April 2009, pp. 361 364. [23] A. Ziehe, K. . Muller, G. Nolte, B. . Mackert, and G. Curio, Artifact reduction in magne- toneurography based on time-delayed second-order correlations, IEEE Trans. Biomedical Eng., vol. 47, no. 1, pp. 75 87, Jan 2000. [24] F. Oveisi, S. Oveisi, A. Efranian, and I. Patras, Nonlinear Independent Component Analysis for EEG-Based Brain-Computer Interface Systems, Independent Component Analysis for Audio and Biosignal Applications, Edited by Ganesh R. Naik, p. 165, 2012. [25] K. Zhang and A. Hyvärinen, Distinguishing causes from effects using nonlinear acyclic causal models, in Causality: Objectives and Assessment. PMLR, 2010, pp. 157 164. [26] K. Uemura, T. Takagi, K. Takayuki, H. Yoshida, and S. Shimizu, A multivariate causal discovery based on post-nonlinear model, in First Conference on Causal Learning and Reasoning, 2022. [Online]. Available: https://openreview.net/forum?id=av Zrtg416yr [27] B. Yang, X. Fu, N. D. Sidiropoulos, and K. Huang, Learning nonlinear mixtures: Identifiability and algorithm, IEEE Trans. Signal Process., vol. 68, pp. 2857 2869, 2020. [28] Q. Lyu, X. Fu, W. Wang, and S. Lu, Understanding latent correlation-based multiview learning and self-supervision: An identifiability perspective, in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=5FUq05QRc5b [29] Q. Lyu and X. Fu, Finite-sample analysis of deep CCA-based unsupervised post-nonlinear multimodal learning, IEEE Transactions on Neural Networks and Learning Systems, pp. 1 7, 2022. [30] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, in International Conference on Learning Representations, 2015. [31] P. Comon and C. Jutten, Handbook of Blind Source Separation. Elsevier, 2010. [32] M. Zibulevsky and B. A. Pearlmutter, Blind source separation by sparse decomposition in a signal dictionary, Neural computation, vol. 13, no. 4, pp. 863 882, 2001. [33] X. Fu, K. Huang, B. Yang, W.-K. Ma, and N. D. Sidiropoulos, Robust volume minimization- based matrix factorization for remote sensing and document clustering, IEEE Trans. Signal Process., vol. 64, no. 23, pp. 6254 6268, Dec 2016. [34] N. Gillis and S. Vavasis, Fast and robust recursive algorithms for separable nonnegative matrix factorization, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 4, pp. 698 714, April 2014. [35] A. Hyvärinen and P. Pajunen, Nonlinear Independent Component Analysis: Existence and uniqueness results, Neural Networks, vol. 12, no. 3, pp. 429 439, 1999. [36] E. Oja, The nonlinear PCA learning rule in Independent Component Analysis, Neurocomput- ing, vol. 17, no. 1, pp. 25 45, 1997. [37] D. Lee and H. Seung, Learning the parts of objects by non-negative matrix factorization, Nature, vol. 401, no. 6755, pp. 788 791, 1999. [38] B. Yang, X. Fu, and N. D. Sidiropoulos, Learning from hidden traits: Joint factor analysis and latent clustering, IEEE Trans. Signal Process., vol. 65, no. 1, pp. 256 269, 2017. [39] X. Fu, W.-K. Ma, T.-H. Chan, and J. M. Bioucas-Dias, Self-dictionary sparse regression for hyperspectral unmixing: Greedy pursuit and pure pixel search are related, IEEE J. Sel. Topics Signal Process., vol. 9, no. 6, pp. 1128 1141, 2015. [40] N. D. Sidiropoulos, R. Bro, and G. B. Giannakis, Parallel factor analysis in sensor array processing, IEEE transactions on Signal Processing, vol. 48, no. 8, pp. 2377 2388, 2000. [41] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, Regularization of neural networks using dropconnect, in International Conference on Machine Learning. PMLR, 2013, pp. 1058 1066. [42] G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems, vol. 2, no. 4, pp. 303 314, 1989. [43] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural networks, vol. 4, no. 2, pp. 251 257, 1991. [44] G. E. Hinton and R. S. Zemel, Autoencoders, minimum description length and Helmholtz free energy, in Advances in Neural Information Processing Systems, 1994, pp. 3 10. [45] J. Ashburner, G. Barnes, C. Chen, J. Daunizeau, G. Flandin, K. Friston, D. Gitelman, S. Kiebel, J. Kilner, V. Litvak et al., Spm8 manual, Functional Imaging Laboratory, Institute of Neurology, 2012. [46] G. Bacciagaluppi, M. J. Donald, and P. E. Vermaas, Continuity and discontinuity of definite properties in the modal interpretation, Helvetica Physica Acta, vol. 68, no. 7, p. 679, 1995. [47] T. W. Anderson, An introduction to multivariate statistical analysis. Wiley New York, 1962. [48] T. Jiang, N. D. Sidiropoulos, and J. M. Ten Berge, Almost-sure identifiability of multidi- mensional harmonic retrieval, IEEE Trans. Signal Process., vol. 49, no. 9, pp. 1849 1859, 2001. [49] P. Liang, CS229T/STAT231: Statistical Learning Theory, 2016. [50] P. L. Bartlett and S. Mendelson, Rademacher and Gaussian complexities: Risk bounds and structural results, Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463 482, 2002. [51] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algo- rithms. Cambridge university press, 2014. [52] K. Mørken, Numerical algorithms and digital representation, 2013, Department of Mathemat- ics, Centre of Mathematics for Applications, University of Oslo, 2018. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]