# nonparametric_evaluation_of_noisy_ica_solutions__670419e7.pdf

Nonparametric Evaluation of Noisy ICA Solutions

Syamantak Kumar1 Purnamrita Sarkar2 Peter Bickel3 Derek Bean4

1Department of Computer Science, UT Austin 2Department of Statistics and Data Sciences, UT Austin 3Department of Statistics, University of California, Berkeley 4 Department of Statistics, University of Wisconsin, Madison syamantak@utexas.edu, purna.sarkar@austin.utexas.edu, bickel@stat.berkeley.edu, derekb@stat.wisc.edu

Independent Component Analysis (ICA) was introduced in the 1980 s as a model for Blind Source Separation (BSS), which refers to the process of recovering the sources underlying a mixture of signals, with little knowledge about the source signals or the mixing process. While there are many sophisticated algorithms for estimation, different methods have different shortcomings. In this paper, we develop a nonparametric score to adaptively pick the right algorithm for ICA with arbitrary Gaussian noise. The novelty of this score stems from the fact that it just assumes a finite second moment of the data and uses the characteristic function to evaluate the quality of the estimated mixing matrix without any knowledge of the parameters of the noise distribution. In addition, we propose some new contrast functions and algorithms that enjoy the same fast computability as existing algorithms like FASTICA and JADE but work in domains where the former may fail. While these also may have weaknesses, our proposed diagnostic, as shown by our simulations, can remedy them. Finally, we propose a theoretical framework to analyze the local and global convergence properties of our algorithms.

1 Introduction

Independent Component Analysis (ICA) was introduced in the 1980 s as a model for Blind Source Separation (BSS) [13, 12], which refers to the process of recovering the sources underlying a mixture of signals, with little knowledge about the source signals or the mixing process. It has become a powerful alternative to the Gaussian model, leading to PCA in factor analysis. A good account of the history, many results, and the role of dimensionality in noiseless ICA can be found in [28, 5]. In one of its first forms, ICA is based on observing n independent samples from a k-dimensional population:

x = Bz + g (1)

where B Rd k (d k) is an unknown mixing matrix, z Rk is a vector of statistically independent mean zero sources or factors and g N(0, Σ) is a d dimensional mean zero Gaussian noise vector with covariance matrix Σ, independent of z. We denote diag (a) := cov (z) and S := cov (x) = B diag (a) BT + Σ. The goal of the problem is to estimate B. ICA was initially developed in the context of g = 0, now called noiseless ICA. In that form, it spawned an enormous literature [17, 39, 40, 1, 8, 27, 35, 37, 23, 11, 6, 42] and at least two wellestablished algorithms FASTICA [29] and JADE [9] which are widely used. The general model with nonzero g and unknown noise covariance matrix Σ, known as noisy ICA, has developed more slowly, with its own literature [4, 50, 49, 30, 31, 41]. In the following sections, we will show that various ICA and noisy ICA methods have distinct shortcomings. Some struggle with heavy-tailed distributions and outliers, while others require approximations of entropy-based objectives, which have their own challenges (see eg. [32]).

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Although methods for noiseless cases may sometimes work in noisy settings [49], they are not always reliable (e.g., see Figure 1 in [49] and Theorem 2 in [11]). Despite the plethora of algorithms for noisy and noiseless ICA, the literature has largely been missing a diagnostic to decide which algorithm to pick for a given dataset. This is our primary goal.

The paper is organized as follows. Section 2 contains background and notation. Section 3.1 contains the independence score, its properties, and a Meta algorithm 1 that utilizes this score. Section 3.3 introduces new contrast functions based on the natural logarithm of the characteristic function and moment generating function of x Tu. Section 4 presents the global and local convergence results for noisy ICA. Section 5 demonstrates the empirical performance of our new contrast functions as well as the Meta algorithm applied to a variety of methods for inferring B.

2 Background and overview

This section contains a general overview of the main concepts of ICA.

Notation: We denote vectors using bold font as v Rd. The dataset is represented as x(i)

x(i) Rd. Scalar random variables and matrices are represented using upper case letters, respectively. Ik denotes the kdimensional identity matrix. Entries of vector v (matrix M) are represented as vi (Mij). M(:, i) (M(i, :)) represents the ith column (row) of matrix M. M represents the Moore Penrose inverse of M. . represents the Euclidean l2 norm for vectors and operator norm for matrices. . F similarly represents the Frobenius norm for matrices. E [.] (b E [.]) denotes the expectation (empirical average); x y denotes statistical independence. . ψ2 represents the subgaussian Orlicz norm (see Def. 2.5.6 in [46]). Xn = OP (an) implies that Xn

an is stochastically

bounded i.e, ϵ > 0, there exists a finite M such that supn P Xn

> M ϵ (See [45]).

Identifiability: One key issue in Eq 1 is the identifiability of A := B 11, which holds up to permutation and scaling of the coordinates of z if, at most, one component of z is Gaussian (see [14, 43]). If d = k and Σ are unknown, it is possible, as we shall see later, to identify the canonical versions of B and Σ. For d > k and unknown Σ, it is possible (see [15]) to reduce to the d = k case. In this work, we assume d = k. For classical factor models, when z is Gaussian, identifiability can only be identified up to rotation and scale, leading to non-interpretability. This suggests fitting strategies focusing on recovering B 1 through nongaussianity as well as independence. In the noisy model (Eq 1), an additional difficulty is that recovery of the source signals becomes impossible even if B is known exactly. This is because, the ICA models x = B(z + r) + g and x = Bz + (Br + g) are indistinguishable (see [48] pages 2-3). This is resolved in [48] via maximizing the Signal-to Interference-plus-Noise ratio (SINR). In classical work on ICA, a large class of algorithms (see [37]) choose to optimize a measure of non-gaussianity, referred to as contrast functions.

Contrast functions: Methods for estimating B typically optimize an empirical contrast function. Formally, we define a contrast function f(u|P) by f : Bk P R where Bk is the unit ball in Rk

and P is essentially the class of mean zero distributions on Rd. More concretely, for suitable choices of g, f(u|P) f(u|x) = Ex P [g(u Tx)] for any random variable x P. The most popular example of a contrast function is the kurtosis, which is defined as f(u|P) = Ex P (u Tx)4 E (u Tx)2 2 3. The aim is to maximize an empirical estimate, ˆf(u|P). Notable contrast functions are the negentropy, mutual information (used by INFOMAX [7]), the tanh function, etc. (See [37] for further details).

Fitting strategies: There are two broad categories of the existing fitting strategies. (I) Find A such that the components of Ax are both as nongaussian and as independent as possible. See, for example, the multivariate cumulant-based method JADE [9] and characteristic function-based methods [21, 11]. The latter we shall call the PFICA method. (II) Find successively the jth row of A, denoted by A(j, :) Rd, j = 1, . . . , k such that A(j, :)Tx are independent and each as nongaussian as possible, that is, estimate A(1, :), project, and then apply the method again on the residuals successively. The chief representative of these methods is Fast ICA [27] based on univariate kurtosis.

From noiseless to noisy ICA: In the noiseless case one can first prewhiten the dataset using the empirical variance-covariance matrix, ˆΣ, of x(i), using ˆΣ 1/2. Therefore, WLOG, B can be assumed to be orthogonal. Searching over orthogonal matrices not only simplifies algorithm design for fitting

1If B is not invertible, this can be replaced by the Moore-Penrose inverse.

strategy (I) but also makes strategy (II) meaningful. It also greatly simplifies the analysis of kurtosisbased contrast functions (see [18]). For noisy ICA, prewhitening requires knowledge of the noise covariance, and therefore many elegant methods [4, 48, 49] avoid this step.

Individual shortcomings of different contrast functions and fitting strategies: To our knowledge, no single ICA algorithm or contrast function works universally across all source distributions. Adaptively determining which algorithm is useful in a given setting would be an important tool in a practitioner s toolbox. Key challenges include:

Cumulants: Even though contrast functions based on even cumulants have no spurious local optima (Theorem 3, [49]), they can vanish for some non-Gaussian distributions. Our experiments (Section 5) show that kurtosis-based contrast functions can perform poorly even when there are a few independent components with zero kurtosis. They also suffer under heavy-tailed source distributions [3, 11, 26].

PFICA: PFICA can outperform cumulant-based methods in some situations (Section 5). However, it is computationally much slower, and poorly performs for Bernoulli(p) sources with small p.

Fast ICA: [32] show that Fast ICA s use of Negentropy approximations for computational efficiency may result in a contrast function where optimal directions do not correspond to directions of low entropy. tanh is another popular smooth contrast function used by Fast ICA. However, in [52], the authors show that this tends to return spurious solutions for some distributions [52].

Contrast functions are usually nonconvex and may have spurious local optima. However, they may still, under suitable conditions, converge to maximal directions. We introduce such a contrast function which vanishes iff x is Gaussian in Section 3.3 and does not require higher moments.

Our contributions:

1) Using Theorem 1 we obtain a computationally efficient scoring method for evaluating the B estimated by different algorithms. Our score can also be extended to evaluate each demixing direction in sequential settings, which we do not pursue here.

2) We propose new contrast functions that work under scenarios where cumulant-based methods fail. We propose a fast sequential algorithm as in [49] to optimize these and further provide theoretical guarantees for convergence.

3 Main contributions

In this section, we present the Independence score and state the key result that justifies it. We conclude with two new contrast functions.

3.1 Independence score While there are many tests for independence [22, 34, 42, 6], we choose the Characteristic functionbased one because it is easy to compute and has been widely used in normality testing [19, 20, 25] and ICA [11, 21, 44]. Let us start with noiseless ICA to build intuition. Say we have estimated the inverse of the mixing matrix, i.e. B 1 using a matrix F. Ideally, if F = B 1, then Fx = z, where z is a vector of independent random variables. Kac s theorem [33] tells us that E exp(it Tz) = Qk j=1 E [exp(itjzj)] if and only if zi are independent. In [21], a novel characteristic functionbased objective (CHFICA), is further analyzed and studied by [11] (PFICA). Here one minimizes |E exp(it T Fx) Qk j=1 E [exp(itj(Fx)j)] | over F. We refer to this objective as the uncorrected independence score. We propose to adapt the CHFICA objective using estimable parameters, S, to the noisy ICA setting. We will minimize:

(t, F|P) :=

E h exp(it T Fx) i exp t T diag(FSF T )t

| {z } JOINT

j=1 E [exp(itj(Fx)j)] exp t T FSF Tt

| {z } PRODUCT

where S = E[xx T ] and can be estimated using the sample covariance matrix. Hence, we do not require knowledge of any model parameters. We refer to this score as the (corrected) independence score. The second terms in JOINT and PRODUCT (Eq 2) corrects the original score by canceling out the additional terms resulting from the Gaussian noise using the covariance matrix S of the

data (estimated via the sample covariance). See [21] for a related score requiring knowledge of the unknown Gaussian covariance matrix Σ. We are now ready to present our first theoretical result for consistency of (t, F|P).

Theorem 1. If F Rk k is invertible and the joint and marginal characteristic functions of all independent components, {zi}i [k], are twice-differentiable, then t Rk, (t, F|P) = 0 iff F = DB 1 where D is a permutation of an arbitrary diagonal matrix.

Unfortunately, F is not uniquely defined if Σ is unknown as we have noted in Section 1. The proof of Theorem 1 is provided in the Appendix Section A.1. When using this score in practice, S is replaced with the sample covariance matrix of the data, and the expectations are replaced by the empirical estimates of the characteristic function. The convergence of this empirical score, (t, F| ˆP), to the population score, (t, F|P), is given by the following theorem.

Theorem 2. Let F := {F Rk k : F 1}, x subgaussian(σ) and Ck := max(1, k log (n) Tr(S)). Then, we have

sup F F |Et N(0,Ik) (t, F|P) Et N(0,Ik) (t, F| ˆP)|= OP

k2 S max(k, σ4 S ) log2(n Ck)

This bound shows that uniformly over F, the empirical average Et N(0,Ik) (t, F| ˆP), is close to the population score. This guarantees that as long as the difference between the population scores of the two candidate algorithms is not too small, the meta-algorithm can pick up the better score. Remark 1 (Subgaussianity assumption). The subgaussianity assumption in Theorem 2 simply ensures the concentration of the sample covariance matrix since it is used in the score (see Theorem A.3, line 825). It can be relaxed if the concentration in the operator norm is ensured. Appendix Section A.2.1 contains experiments with more super-Gaussian source signals.

The proof of this result is deferred to the Appendix section A.1.1. It is important to note that (t, F| ˆP) is not easy to optimize. As we show in Section 3.2, objective functions that are computationally more amenable to optimize for ICA, e.g., cumulants, satisfy some properties (see Assumption 1). The independence score does not satisfy the first property (Assumption 1 (a)). In the noiseless case, whitening reduces the problem to B being orthogonal. This facilitates efficient gradient descent in the Stiefel manifold of orthogonal matrices [11]. However, in the noisy setting, the prewhitening matrix contains the unknown noise covariance Σ, making optimization hard. Furthermore, as noted in Remark 2, in practice, it is better to use Et N(0,Ik) (t, F| ˆP) [11, 21, 25] instead of using a fixed vector, t, to evaluate (t, F| ˆP). Any iterative optimization method requires repeatedly estimating this expectation at each gradient step. Therefore, instead of using this score directly as the optimization objective, we use it to choose between different contrast functions after extracting the demixing matrix with each. This process is referred to as the Meta algorithm (Algorithm 1).

Algorithm 1 Meta-algorithm for choosing best candidate algorithm.

Input: Algorithm list L, Data X Rn k for j in range[1, size (L)] do

Bj Lj (X) {Extract mixing matrix Bj using jth candidate algorithm}

δj b Et N(0,Ik) h t, B 1 j | ˆP i

end for i arg minj [size(L)] [δj] return Bi

Figure 1: Correlation of independence score (with std. dev.) with Amari error between B = ϵB + (1 ϵ) I and B, averaged over 10 random runs.

Remark 2 (Averaging over t). Algorithm 1 averages the independence score over t N (0, Ik). While Eq 2 defines the independence score for one direction t, in practice, however, there may be some directions such that a non-gaussian signal has a small score in that direction. Hence, it is desirable to average over t, following the convention in [11, 21, 25].

To gain an intuition of the independence score, we conduct an experiment where we use the dataset mentioned in Section 5 (Figure 2b) and compute the independence score for B = ϵB +(1 ϵ) I, ϵ (0.5, 1). As we increase ϵ, B approaches B, and hence the Amari error d B ,B (see Eq 8) decreases. Figure 1 shows the independence score versus Amari error, indicating that the independence score accurately predicts solution quality even without knowing the true B.

3.2 Desired properties of contrast functions

The following properties are often useful for designing provable optimization algorithms for ICA.

Assumption 1 (Properties of contrast functions). Let f be a contrast function defined for a mean zero random variable X. Then,

(a) f(u|X + Y ) = f(u|X) + f(u|Y ) for X Y

(b) f(0|X) = 0, f (0|X) = 0, f (0|X) = 0

(c) f(u|G) = 0, u for G N (0, 1)

(d)WLOG, f(u|X) = f( u|X) (symmetry)

Properties (a) and (c) ensure that f(u|G + X) = f(u|X) for non-gaussian X and independent non-gaussian G, which means that the additive independent Gaussian noise does not change f. Property (c) ensures that |f(u|G)| is minimal for a Gaussian. Property (d) holds without loss of generality because one can always symmetrize the distribution. 2

3.3 New contrast functions

In Section 2, we provided a discussion of the individual shortcomings of different contrast functions for existing contrast functions. Before we introduce new contrast functions in this section, we revisit the algorithmic issues posed by the added Gaussian noise with unknown Σ in Eq 1.

Prewhitening the data is challenging for noisy ICA because E[xx T ] includes the unknown Gaussian covariance matrix Σ. [4] show that it is enough to quasi orthogonalize the data, i.e., multiply it by a matrix, which makes the data have a diagonal covariance matrix (not necessarily the identity). Subsequent work [50, 49] uses cumulants and the Hessian of f(u) to construct a matrix C := BDBT where D is a diagonal matrix, and then use this matrix to achieve quasi-orthogonalization. In general, D may not be positive semidefinite (e.g. when components of z have kurtosis of different signs). To remedy this, in [49], the authors propose computing the C matrix using the Hessian of f(u) at some fixed unit vector and then perform an elegant power method in the pseudo-Euclidean space:

ut f(C ut 1|P) f(C ut 1|P) (3)

A pseudo-Euclidean space is a generalization of the Euclidean space used by [48]. Here the produce between vectors u,v is given as u Av,

Algorithm 2 Pseudo-Euclidean Power Iteration for ICA (Algorithm 2 in [48]) B is the recovered matrix for the ICA model in Eq 1. A is a running estimate of B .

Input: C Rk k, f

A 0, B 0 for j in range[1, k] do

Draw u uniformly at random from Sk 1 while Convergence (up to sign) do

u B Au u f C u / f C u 2 end while

Bj u, Aj C Bj/ C Bj Bj

end for return B, A

2Note that yi := xi x n/2 +i, i = 1 n/2 has a symmetric distribution, and also remains a noisy version of a linear combination of source signals with the same mixing matrix.

In this section, we present two new contrast functions based on the logarithm of the characteristic function (CHF) and the cumulant generating function (CGF). These do not depend on a particular cumulant, and the first only requires a finite second moment, which makes it suitable for heavytailed source signals. We will use f(u|P) or f(u|x) where x P interchangeably to represent the population contrast function. In constructing both of the following contrast functions, we use the fact that, like the cumulants, the CGF and CHF-based contrast functions satisfy Assumption 1 (a). To satisfy Assumption 1 (c), we subtract out the part resulting from the Gaussian noise, which leads to the additional terms involving u Su.

CHF-based contrast function: Recall that S denotes the covariance matrix of x. We maximize the absolute value of following:

f(u|P) = log E exp(iu Tx) + log E exp( iu Tx) + u T Su

= log(E cos(u Tx))2 + log(E sin(u Tx))2 + u T Su (4)

The intuition is that this is exactly zero for zero mean Gaussian data and maximizing this leads to extracting non-gaussian signal from the data. It also satisfies Assumption 1 a), b) and c).

CGF-based contrast function: The cumulant generating function also has similar properties (see [53] for the noiseless case). In this case, we maximize the absolute value of following:

f(u|P) = log E exp(u Tx) 1

2u T Su (5)

Like CHF, this vanishes iff x is mean zero Gaussian, and satisfies Assumption 1 a), b) and c). However, it cannot be expected to behave well in the heavy-tailed case. In the Appendix (Section A.1, Theorem A.1), we show that the Hessian of both functions obtained from Eq 4 and Eq 5 evaluated at some u is, in fact, of the form BDBT where D is some diagonal matrix.

4 Global and local Convergence: loss landscape of noisy ICA

Here we present sufficient conditions for global and local convergence of a broad class of contrast functions. This covers the cumulant-based contrast functions and our CHF and CGF-based proposals.

4.1 Global convergence The classical argument for global optima of kurtosis for noiseless ICA in [18] assumes WLOG that B is orthogonal. This reduces the optimization over u into one for v = Bu. Since B is orthogonal, u = 1 translates into v = 1. It may seem that this idea should extend to the noisy case due to property 1 a) and c) by optimizing over v = Bu over potentially non-orthogonal mixing matrices B.

However, for non-orthogonal B, the norm constraint on v is no longer v Tv = 1, rendering the original argument invalid. In what follows, we extend the original argument to the noisy ICA setting by introducing pseudo-euclidean constraints. Our framework includes a large family of contrast functions, including cumulants.

To our knowledge, this is the first work to characterize the loss functions in the pseudo-euclidean space. In contrast, [50] provides global convergence of the cumulant-based methods by a convergence argument of the power method itself.

Consider the contrast function f (u|P) := Ex P g x Tu and recall the definition of the quasi orthogonalization matrix C = BDBT , where D is a diagonal matrix. For simplicity, WLOG let us assume that D is invertible. We now aim to find the optimization objective that leads to the pseudo-Euclidean update in Eq 3. For f C 1u|P = Ex P g u T C 1x , consider the following:

f C 1u|P = X

i [k] E g B 1u

i zi/Dii = X

i [k] E [g (αi zi)] = X

i [k] f (αi/Dii|zi) (6)

where we define α := B 1u and zi = zi/Dii. We now examine f(C 1u) subject to the pseudo norm constraint u T C 1u = 1. The key point is that for suitably defined f, one can construct a matrix C = BDBT from the data even when B is unknown. So our new objective is to optimize

f C 1u|P s.t. u T C 1u = 1

Using the Lagrange multiplier method, optimizing the above simply gives the power method update (Eq 3) u f(C 1u). Furthermore, optimizing Eq 6 leads to the following transformed objective:

i f (αi/Dii| zi)

i α2 i /Dii = 1 (7)

Now we provide our theorem about maximizing f(C 1u|P). Analogous results hold for minimization. We will denote the corresponding derivatives evaluated at t0 as f (t0|zi) and f (t0|zi).

Theorem 3. Let C be a matrix of the form BDBT , where di := Dii = f (ui|zi) for some random ui. Let S+ = {i : di > 0}. Consider a contrast function f : R R. Assume that for every non-Gaussian independent component, Assumption 1 holds and the third derivative, f (u|X), does not change the sign in the half-lines [0, ), ( , 0]. Then f C 1u x with the constraint of u,u C 1 = 1 has local maxima at B 1u = ei, i S+. All solutions satisfying e T i B 1u = 0, i [1, k] are minima, and all solutions with | i : e T i B 1u = 0 |< k are saddle points. These are the only fixed points.

Note that the above result immediately implies that for C = C, the same optimization in Theorem 3 will have maxima at all u such that B 1u = ei, i S . Corollary 3.1. 2kth cumulant-based contrast functions for k > 1, have maxima and minima at the directions of the columns of B, provided {zi}i [k] have a corresponding non-zero cumulant.

Proof. This proof (see Appendix Section A.1) is immediate since cumulants satisfy (a), (c). For (c) and (d), note that f(u|Z) is simply u2jκ2j(Z), where κ2j(Z) is the 2jth cumulant of Z.

Theorem 3 can be easily extended to the case where C is not invertible. The condition on the third derivative, f (u|X), may seem strong, and it is not hard to construct examples of random variables Z where this fails for suitable contrast functions (see [36]). However, for such settings, it is hard to separate Z from a Gaussian. See the Appendix A.4.1 for more details.

4.2 Local convergence

In this section, we establish a local convergence result for the power iteration-based method described in Eq 3. Define t R, i [d], the functions qi (t) := f αi Dii

zi . Then, without loss of generality, we have the following result for the first corner: Theorem 4. Let αi = B 1ui and α := e1. Let the contrast function f(.|X), satisfy assumptions 1 (a), (b), (c), and (d). Let B := [ B 1 2, B 1 2] and define c1, c2, c3 > 0 such that i [d],

supx B n |qi(x)|

c1 , |q i(x)|

c2 , |q i (x)|

o 1. Define ϵ := B F Be1 2 max c3+c2 Be1

|q1(α 1)| , c2 2 |q1(α 1)| 2 , c3 Be1

Let α0 α 2 R, R max n c2 c3 , 1

5ϵ o . Then, we have t 1, αt+1 α

Therefore, we can establish linear convergence without the third derivative constraint required for global convergence (see Theorem 3), by assuming some mild regularity conditions on the contrast functional and a sufficiently close initialization. The proof is included in Appendix Section A.1.2.

Remark 3. Let α = B 1u denote the demixed direction, u. Theorem 4 shows that if the initial u0 is such that α0 is close to the ground truth α (which is WLOG taken to be e1), then there is geometric convergence to α . |q1(α 1)| and |q 1(α 1)| essentially quantify how non-gaussian the corresponding independent component is since they are zero for a mean zero Gaussian. When these are large quantities, ϵ is small, and the initialization radius α0 α 2 can be relatively larger.

5 Experiments

In this section, we provide experiments to compare the fixed-point algorithms based on the characteristic function (CHF), the cumulant generating function (CGF) (Eqs 4 and 5) with the kurtosis-based algorithm (PEGI-κ4 [49]). We also compare against noiseless ICA3 algorithms - Fast ICA, JADE, and PFICA. These six algorithms are included as candidates for the Meta algorithm (see Algorithm

3MATLAB implementations (under the GNU General Public License) can be found at - Fast ICA and JADE. The code for PFICA was provided on request by the authors.

1). We also present the Uncorrected Meta algorithm (Unc. Meta) to denote the Meta algorithm with the uncorrected independence score proposed in CHFICA [21]. Table 1 and Figure 2 provide experiments on simulated data, and Figure 3 provides an application for image demixing [16]. We provide additional experiments on denoising MNIST images in the Appendix Section A.2.

Experimental setup: Similar to [49], the mixing matrix B is constructed as B = UΛV T , where U, V are k dimensional random orthonormal matrices, and Λ is a diagonal matrix with Λii [1, 3]. The covariance matrix Σ of the noise g follows the Wishart distribution and is chosen to be ρ

k RRT , where k is the number of sources, and R is a random Gaussian matrix. Higher values of the noise power ρ can make the noisy ICA problem harder. Keeping B fixed, we report the median of 100 random runs of data generated from a given distribution (different for different experiments). The quasi-orthogonalization matrix for CHF and CGF is initialized as ˆB ˆBT using the mixing matrix, ˆB, estimated via PFICA. The performance of CHF and CGF is based on a single random initialization of the vector in the power method (see Algorithm 2). Our experiments were performed on a Macbook Pro M2 2022 CPU with 8 GB RAM.

Error Metrics: Due to the inherent ambiguity in signal recovery discussed in Section 2, we measure error using the accuracy of estimating B. We report the widely used Amari error [2, 24, 11, 10] for our results. For estimated demixing matrix b B, define W = b B 1B, after normalizing the rows of b B 1 and B 1. Then we measure d b B,B as -

d b B,B := 1

Pk j=1|Wij|

maxj|Wij| +

Pk i=1|Wij| maxi|Wij|

Variance reduction using Meta: We first show that the independence score can also be used to pick the best solution from many random initializations. In Figure 2 (a), the top panel has a histogram of 40 runs of CHF, each with one random initialization. The bottom panel shows the histogram of 40 experiments, where, for each experiment, the best out of 30 random initializations are picked using the independence score. The top and bottom panels have (mean, standard deviation) (0.51, 0.51) and (0.39,0.34) respectively. This shows a reduction in variance and overall better performance.

Effect of varying kurtosis: For our second experiment (see Table 1), we use k = 5 independent components, sample size n = 105 and noise power ρ = 0.2 from a Bernoulli(p) distribution, where we vary p from 0.001 to 0.5 1/

12. The last parameter makes kurtosis zero. Different algorithms

Algorithm Scaled κ4 994 194 95 15 5 2 0.8 0.13 0

Meta 0.007 0.010 0.011 0.010 0.011 0.011 0.0128 0.01981 0.023 CHF 1.524 0.336 0.011 0.010 0.011 0.011 0.0129 0.0213 0.029 CGF 0.007 0.011 0.011 0.016 0.029 0.044 0.05779 0.06521 0.071 PEGI 0.007 0.010 0.011 0.010 0.012 0.017 0.02795 0.13097 1.802 PFICA 1.525 0.885 0.540 0.024 0.023 0.023 0.0212 0.0224 0.024 JADE 0.021 0.022 0.021 0.022 0.022 0.023 0.029 0.089 1.909 Fast ICA 0.024 0.027 0.026 0.026 0.026 0.027 0.02874 0.0703 - Unc. Meta 1.52 0.049 0.041 0.0408 0.0413 0.0414 0.0413 0.0416 0.0419

Table 1: Median Amari error with varying p in Bernoulli(p) data. The scaled kurtosis is given as κ4 := (1 6p (1 p))/(p (1 p)). Observe that the Meta algorithm (shaded in red) performs at par or better than the best candidate algorithms. Fast ICA did not converge for zero-kurtosis data.

perform differently (best candidate algorithm is highlighted in bold font) for different p. In particular, characteristic function-based methods like PFICA and CHF perform poorly for small values of p. We attribute this to the fact that the characteristic function is close to one for small p. Kurtosis-based algorithms like PEGI, JADE, and FASTICA perform poorly for kurtosis close to zero. Furthermore, the Uncorrected Meta algorithm performs worse than the Meta algorithm since it shadows PFICA.

Effect of varying noise power: For our next two experiments, we use k = 9, out of which 3 are from

3 , 3 are from Exponential(5) and 3 are from Bernoulli 1 2 q

1 12 and hence have zero kurtosis. In these experiments, we vary the sample size (Figure 2b), n fixing ρ = 0.2, and

the noise power (Figure 2a) , ρ fixing n = 105, respectively. Note that with most mixture distributions, it is easily possible to have low or zero kurtosis. We include such signals in our data to highlight some limitations of PEGI-κ4 and show that Algorithm 1 can choose adaptively to obtain better results. Figure 2a shows that large noise power leads to worse performance for all methods. PEGI, JADE,

(a) Variation of performance with noise power

(b) Variation of performance with sample size

(c) Histograms of Amari error with n = 104 and noise power ρ = 0.2 Figure 2: Amari error in the log-scale on y axis and varying noise powers (for n = 105) and varying sample sizes (for ρ = 0.2) on x axis for figures 2a and 2b respectively. For figure 2c, the top panel contains a histogram of 40 runs with one random initialization. The bottom panel contains a histogram of 40 runs, each of which is the best independence score out of 30 random initializations.

and Fast ICA perform poorly consistently, because of some zero kurtosis components. CGF suffers because of heavy-tailed components. However, CHF and PFICA perform well consistently. The Meta algorithm mostly picks the best algorithm except for a few points, where the difference between the two leading algorithms is small. The Uncorrected Meta algorithm (which uses the independence score without the additional correction term introduced for noise) performs identically to PFICA.

Effect of varying sample size: Figure 2b shows that the noiseless ICA methods have good performance at smaller sample sizes. However, they suffer a bias in performance compared to their noiseless counterparts as the sample size increases. The Meta algorithm can consistently pick the best option amongst the candidates, irrespective of the distribution, leading to significant improvements in performance. What is interesting is that up to a sample size of 104, PFICA dominates other algorithms and Meta performs like PFICA. However, after that CHF dominates, and one can see that Meta starts to have a similar trend as CHF. We also see that the Uncorrected Meta algorithm (which uses the independence score without the additional correction term introduced for noise) has a near identical error as PFICA and has a bias for large n. 6 Conclusion

ICA is a classical problem that aims to extract independent non-Gaussian components. The vast literature on both noiseless and noisy ICA introduces many different inference methods based on a

(a) Source Images

(b) Mixed Images

(c) Demixed Images using CHF-based contrast function ( (t, F|P) = 5.26 10 3).

(d) Demixed Images using Kurtosis-based contrast function ( (t, F|P) = 2.48 10 2)

Figure 3: We demix images using ICA by flattening and linearly mixing them with a 4 4 matrix B (i.i.d entries N(0, 1)) and Wishart noise (ρ = 0.001). The CHF-based method (c) recovers the original sources well, upto sign. The Kurtosis-based method (d) fails to recover the second source. This is consistent with its higher independence score. The Meta algorithm selects CHF from candidates CHF, CGF, Kurtosis, Fast ICA, and JADE. Appendix Section A.2 provides results for other contrast functions and their independence scores.

variety of contrast functions for separating the non-Gaussian signal from the Gaussian noise. Each has its own set of shortcomings. We aim to identify the best method for a given dataset in a data-driven fashion. In this paper, we propose a nonparametric score, which is used to evaluate the quality of the solution of any inference method for the noisy ICA model. Using this score, we design a Meta algorithm, which chooses among a set of candidate solutions of the demixing matrix. We also provide new contrast functions and a computationally efficient optimization framework. While they also have shortcomings, we show that our diagnostic can remedy them. We provide uniform convergence properties of our score and theoretical results for local and global convergence of our methods. Simulated and real-world experiments show that our Meta-algorithm matches the accuracy of the best candidate across various distributional settings.

Acknowledgments and Disclosure of Funding

The authors thank Aiyou Chen for many important discussions that helped shape this paper and for sharing his code of PFICA. The authors also thank Joe Neeman for sharing his valuable insights and the anonymous reviewers for their helpful suggestions in improving the exposition of the paper. SK and PS were partially supported by NSF grants 2217069, 2019844, and DMS 2109155.

[1] Shun-Ichi Amari and J.-F. Cardoso. Blind source separation-semiparametric statistical approach. IEEE Transactions on Signal Processing, 45(11):2692 2700, 1997.

[2] Shun-ichi Amari, Andrzej Cichocki, and Howard Yang. A new learning algorithm for blind signal separation. Advances in neural information processing systems, 8, 1995.

[3] Joseph Anderson, Navin Goyal, Anupama Nandi, and Luis Rademacher. Heavy-tailed analogues of the covariance matrix for ica. In Satinder P. Singh and Shaul Markovitch, editors, AAAI, pages 1712 1718. AAAI Press, 2017.

[4] Sanjeev Arora, Rong Ge, Ankur Moitra, and Sushant Sachdeva. "provable ica with unknown gaussian noise, with implications for gaussian mixtures and autoencoders". In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, NIPS, pages 2384 2392, 2012.

[5] Arnab Auddy and Ming Yuan. Large dimensional independent component analysis: Statistical optimality and computational tractability, 2023.

[6] F.R. Bach and M.I. Jordan. Kernel independent component analysis. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP 03)., volume 4, pages IV 876, 2003.

[7] Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129 1159, 1995.

[8] Jean-François Cardoso. High-order contrasts for independent component analysis. Neural computation, 11(1):157 192, 1999.

[9] J.F. Cardoso and A. Souloumiac. Blind beamforming for non-gaussian signals. IEE Proceedings F (Radar and Signal Processing), 140:362 370(8), December 1993.

[10] Aiyou Chen and Peter J. Bickel. Efficient independent component analysis. The Annals of Statistics, 34(6):2825 2855, 2006.

[11] Aiyou Chen and P.J. Bickel. Consistent independent component analysis and prewhitening. IEEE Transactions on Signal Processing, 53(10):3625 3632, 2005.

[12] Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994.

[13] Pierre Comon and Christian Jutten. Handbook of Blind Source Separation: Independent component analysis and applications. Academic press, 2010.

[14] George Darmois. Analyse générale des liaisons stochastiques: etude particulière de l analyse factorielle linéaire. Revue de l Institut international de statistique, pages 2 8, 1953.

[15] M. Davies. Identifiability issues in noisy ica. IEEE Signal Processing Letters, 11(5):470 473, 2004.

[16] Laurent de Vito. Ica for demixing images. https://github.com/ldv1/ICA_for_ demixing_images, 2024. Accessed: 2024-05-20.

[17] N. Delfosse and P. Loubaton. Adaptive separation of independent sources: a deflation approach. In Proceedings of ICASSP 94. IEEE International Conference on Acoustics, Speech and Signal Processing, volume iv, pages IV/41 IV/44 vol.4, 1994.

[18] Nathalie Delfosse and Philippe Loubaton. Adaptive blind separation of convolutive mixtures. In ICASSP, pages 2940 2943. IEEE Computer Society, 1996.

[19] Bruno Ebner and Norbert Henze. Bahadur efficiencies of the epps pulley test for normality. ar Xiv preprint ar Xiv:2106.13962, 2021.

[20] Bruno Ebner and Norbert Henze. On the eigenvalues associated with the limit null distribution of the epps-pulley test of normality. Statistical Papers, 64(3):739 752, 2023.

[21] Jan Eriksson and Visa Koivunen. Characteristic-function-based independent component analysis. Signal Processing, 83(10):2195 2208, 2003.

[22] Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, and Alexander J. Smola. A kernel statistical test of independence. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS 07, page 585 592, Red Hook, NY, USA, 2007. Curran Associates Inc.

[23] Trevor Hastie and Rob Tibshirani. Independent components analysis through product density estimation. Advances in neural information processing systems, 15, 2002.

[24] Trevor Hastie and Rob Tibshirani. Independent components analysis through product density estimation. In Proceedings of the 15th International Conference on Neural Information Processing Systems, NIPS 02, page 665 672, Cambridge, MA, USA, 2002. MIT Press.

[25] Norbert Henze and Thorsten Wagner. A new approach to the bhep tests for multivariate normality. Journal of Multivariate Analysis, 62(1):1 23, 1997.

[26] A. Hyvarinen. One-unit contrast functions for independent component analysis: a statistical analysis. In Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop, pages 388 397, 1997.

[27] Aapo Hyvarinen. Fast and robust fixed-point algorithms for independent component analysis. IEEE transactions on Neural Networks, 10(3):626 634, 1999.

[28] Aapo Hyvärinen, J. Karhunen, and Erkki Oja. Independent component analysis. John Wiley & Sons, 2001.

[29] Aapo Hyvärinen and Erkki Oja. A fast fixed-point algorithm for independent component analysis. Neural Computation, 9(7):1483 1492, 1997.

[30] Pauliina Ilmonen and Davy Paindaveine. Semiparametrically efficient inference based on signed ranks in symmetric independent component models. Annals of Statistics, 39:2448 2476, 2011.

[31] Pauliina Ilmonen and Davy Paindaveine. Semiparametrically efficient inference based on signed ranks in symmetric independent component models. 2011.

[32] Elena Issoglio, Paul Smith, and Jochen Voss. On the estimation of entropy in the fastica algorithm. Journal of Multivariate Analysis, 181:104689, 2021.

[33] M. Kac. Sur les fonctions indépendantes (i) (propriétés générales). Studia Mathematica, 6:46 58, 1936.

[34] Sergey Kirshner and Barnabás Póczos. Ica and isa using schweizer-wolff measure of dependence. In Proceedings of the 25th International Conference on Machine Learning, ICML 08, page 464 471, New York, NY, USA, 2008. Association for Computing Machinery.

[35] Te-Won Lee, Mark Girolami, and Terrence J Sejnowski. Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural computation, 11(2):417 441, 1999.

[36] Peter Mc Cullagh. Does the moment-generating function characterize a distribution? The American Statistician, 48(3):208 208, 1994.

[37] Erkki Oja and A Hyvarinen. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411 430, 2000.

[38] Erkki Oja, Aapo Hyvärinen, and Patrik Hoyer. Image feature extraction and denoising by sparse coding. Pattern Analysis & Applications, 2:104 110, 1999.

[39] Dinh Tuan Pham. Blind separation of instantaneous mixture of sources via an independent component analysis. IEEE Transactions on Signal Processing, 44(11):2768 2779, 1996.

[40] Dinh Tuan Pham and P. Garat. Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Transactions on Signal Processing, 45(7):1712 1725, 1997.

[41] Karl Rohe and Muzhe Zeng. Vintage factor analysis with varimax performs statistical inference. ar Xiv preprint ar Xiv:2004.05387, 2020.

[42] Hao Shen, Stefanie Jegelka, and Arthur Gretton. Fast kernel-based independent component analysis. IEEE Transactions on Signal Processing, 57(9):3498 3511, 2009.

[43] Viktor P Skitovitch. On a property of the normal distribution. DAN SSSR, 89:217 219, 1953.

[44] Fabian J. Theis. Multidimensional independent component analysis using characteristic functions. In 2005 13th European Signal Processing Conference, pages 1 4, 2005.

[45] Aad W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

[46] Roman Vershynin. High-Dimensional Probability. Cambridge University Press, Cambridge, UK, 2018.

[47] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

[48] James R. Voss, Mikhail Belkin, and Luis Rademacher. Optimal recovery in noisy ICA. Co RR, abs/1502.04148, 2015.

[49] James R Voss, Mikhail Belkin, and Luis Rademacher. A pseudo-euclidean iteration for optimal recovery in noisy ica. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

[50] James R Voss, Luis Rademacher, and Mikhail Belkin. Fast algorithms for gaussian noise invariant independent component analysis. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.

[51] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

[52] Tianwen Wei. A study of the fixed points and spurious solutions of the deflation-based fastica algorithm. Neural Comput. Appl., 28(1):13 24, jan 2017.

[53] Arie Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):897 902, 2000.

The Appendix is organized as follows -

Section A.1 proves Theorems 1, 2, A.1, 3 and 4 Section A.2 provides additional experiments for noisy ICA Section A.3 provides the algorithm to compute independence scores via a sequential procedure Section A.4 explores the third derivative constraint in Theorem 3 and provides examples where it holds Section A.5 contains surface plots of the loss landscape for noisy ICA using the CHF-based contrast functions

Proof of Theorem 1. We will give an intuitive argument for the easy direction for k = 2. We will show that if F = DB 1 for some permutation of a diagonal matrix D, then after projecting using that matrix, the data is of the form z + g for some Gaussian vector g . Let k = 2. Let the entries of this vector be z1 + g 1 and z2 + g 2, where zi, i {1, 2} are mean zero independent non-Gaussian random variables and g i, i {1, 2} are mean zero possibly dependent Gaussian variables. The Gaussian variables are independent of the non-Gaussian random variables. Let ti, i {1, 2} be arbitrary but fixed real numbers. Let t = (t1, t2) and let Σg denote the covariance matrix of the Gaussian g . Assume for simplicity var(zi) = 1 for i {1, 2}. Denote by Λ := cov(Fx) = FSF T = I + Σg . The JOINT part of the score is given by:

JOINT = E [it1z1] E [it2z2] exp t T (Σg + diag(Λ))t

The PRODUCT part of the score is given by

PRODUCT = E [it1z1] E [it2z2] exp t T (diag(Σg ) + Λ)t

It is not hard to see that JOINT equals PRODUCT. The same argument generalizes to k > 2. We next provide proof for the harder direction here. Suppose that F (t|P) = 0, i.e,

E exp(it T Fx) exp

t T diag FSF T t 2

j=1 E [exp(itj(Fx)j)] exp t T FSF Tt

We then prove that F must be of the form DB 1 for D being a permutation of a diagonal matrix. Then, taking the logarithm, we have

ln E exp(it T Fx)

j=1 ln (E [exp(itj(Fx)j)]) = 1

2t T diag FSF T FSF T t

2t T diag FΣF T FΣF T t

Under the ICA model we have, x = Bz + g. Therefore, from Eq A.9, using the definition of the characteristic function of a Gaussian random variable and noting that cov (g) = Σ, we have

ln E exp(it T Fx) = ln E exp(it T FBz) + ln E exp(it T Fg)

= ln E exp(it T FBz) 1

2t T FΣF Tt

and similarly,

j=1 ln (E [exp(itj(Fx)j)]) =

j=1 ln (E [exp(itj(FBz)j)]) 1

2t T diag FΣF T t

Therefore, from Eq A.9,

g F (t) := ln E exp(it T FBz)

j=1 ln (E [exp(itj(FBz)j)])

must be a quadratic function of t. It is important to note that the second term is an additive function w.r.t t1, t2, tk. For simplicity, consider the case of k = 2 and assume that the joint and marginal

characteristic functions of all signals are twice differentiable. Consider 2(g F(t))

t1 t2 , then

2 (g F (t))

t1 t2 const. (A.10)

To simplify the notation, let ψj(t) := E eitzj , fj log ψj, M := FB. The above is then equivalent to

k=1 M1k M2kf k (M1kt1 + M2kt2) const. (A.11)

Note that M is invertible since F, B are invertible. Therefore, let s M T t, then

k=1 M1k M2kf k (sk) const.

which implies that either M1k M2k = 0 or f k (sk) const, for k = 1, 2. For any k {1, 2}, since zk is non-Gaussian, its logarithmic characteristic function, i.e. fk cannot be a quadratic function, so

M1k M2k = 0

which implies that each column of M has a zero. Since M is invertible, thus M is either a diagonal matrix or a permutation of a diagonal matrix. Now to consider k > 2, we just need to apply the above k = 2 argument to pairwise entries of {t1, , tk} with other entries fixed. Hence proved.

Proof of Theorem 3. We start with considering the following constrained optimization problem -

sup u T C 1u=1 f C 1u|P

By an application of lagrange multipliers, for the optima uopt, we have

νC 1uopt = C 1 f (C 1)Tuopt|P , u T opt C 1uopt = 1

with multiplier ν R. This leads to a fixed-point iteration very similar to the one in [49], which is the motivation for considering such an optimization framework. We now introduce some notations to simplify the problem.

Let u = Bα, z i := zi

di and a i := var(z i) = ai

d2 i , where ai := var(zi). Then,

sup u T C 1u=1 f C 1u|P = sup αT D 1α=1 f D 1α|z

We will use f(u|P) or f(u|x) where x P interchangeably. By Assumption 1-(a), we see that

f D 1α|z = Pk i=1 f αi

di |zi . For simplicity of notation, we will use hi(αi/di) = f (αi/di|zi), since the functional form can be different for different random variables zi. Therefore, we seek to characterize the fixed points of the following optimization problem -

i=1 hi(αi/di) (A.12)

α2 i di = 1,

di = 0 i [k]

where α, d Rk. We have essentially moved from optimizing in the ., . C 1 space to the ., . D 1 space. We find stationary points of the Lagrangian

The components of the gradient of L(α, λ) w.r.t α are given as

αj L(α, λ) = 1

At the fixed point (α, λ) we have,

By Assumption 1-(b), for αi = 0, the above is automatically zero. However, for {j : αj = 0}, we have,

So, for all {j : αj = 0}, we must have the same value and sign of 1 αj h j αj

. By assumption in the theorem statement, h i (x) does not change sign on the half-lines x > 0 and x < 0. Along with Assumption 1-(b), this implies that

x [0, ), sgn(hi(x)) = sgn(h i(x)) = sgn(h i (x)) = sgn(h i (x)) = κ1 (A.15)

x ( , 0], sgn(hi(x)) = sgn(h i(x)) = sgn(h i (x)) = sgn(h i (x)) = κ2 (A.16)

where κ1, κ2 { 1, 1} are constants. Furthermore, we note that Assumption 1-(d) ensures that hi(x) is a symmetric function. Therefore, x R, sgn(hi(x)) = sgn(h i (x)) = κ, κ { 1, 1}. Then, since di = h i (ui),

sgn (di) = sgn (hi(x)) , x R (A.17)

Now, using A.14,

sgn(λ) = sgn h j

sgn(αj), using Eq A.15 and A.16

= sgn (dj) sgn hj

= 1, using Eq A.17 (A.18)

Keeping this in mind, we now compute the Hessian, H Rk k, of the lagrangian, L(α, λ) at the fixed point, (α, λ). Recall that, we have h i(0) = 0 and h i (0) = 0. Thus, for {i : αi = 0},

Hii = λ/di (A.19)

This implies that sgn (di Hii) = sgn (λ) = 1 for {i : αi = 0}, using Eq A.18.

For {i : αi = 0}, we have

αi αj L(α, λ) α = 1(i = j)

d2 i h i αi

, for αi = 0, using A.14

= 1(i = j) 1

, for αi = 0

We consider the pseudo inner product space ., . D 1 for optimizing α. Furthermore, since we are in this pseudo-inner product space, we have

v, Hv D 1 = v D 1Hv

So we will consider the positive definite-ness of the matrix H := D 1H to characterize the fixed points. Recall that for a differentiable convex function f, we have x, y dom(f) Rn

f(y) f(x) + f(x)T (y x)

Therefore, for {i : αi = 0}, we have

sgn (di Hii) = sgn (di) sgn (diαi) sgn αi

= sgn (αi) sgn h i

, using convexity/concavity of h (.)

= sgn (αi) sgn h i

, using A.15 and A.16

= sgn (αi) sgn (λ) sgn (αi) , using A.14 = sgn (λ) = 1, using Eq A.17 (A.20)

Let S := {i : αi = 0}. We then have the following cases -

1. Assume that only one αi is nonzero, then v orthogonal to this direction, v, Hv D 1 < 0 by Eq A.19. Thus this gives a local maxima.

2. Assume more than one αi are nonzero, but |S|< k. Then we have for i S, Hjj < 0 using A.19. For i S, Hii > 0 from Eq A.20. Hence these are saddle points.

3. Assume we have S = [k], i.e. i, αi = 0. In this case, H 0. So, we have a local minima.

This completes our proof.

Theorem A.1. Consider the data generated from the noisy ICA model (Eq 1). If f(u|P) is defined as in Eq 4 or Eq 5, we have 2f(u|P) = BDu BT , for some diagonal matrix Du, which can be different for the differerent contrast functions. Remark 4. The above theorem is useful because it shows that the Hessian of the contrast functions based on the CHF (eq 4 or CGF (eq 5) is of the form BDBT , where D is some diagonal matrix. This matrix can be precomputed at some vector u and used as the matrix C in the power iteration update (see eq 3) in Algorithm 2 of [49].

Proof of Theorem A.1. First, we show that the CHF-based contrast function (see Eq 4), the Hessian, is of the correct form.

CHF-based contrast function

f(u|P) = log E exp(iu T Bz) + log E exp( iu T Bz) + u T B diag(a)BTu,

where diag(a) is the covariance matrix of z. For simplicity of notation, define ϕ(z;u) := E exp(iu T Bz) .

f(u|P) = i B

E exp(iu T Bz)z

ϕ(z;u) | {z } µ(u;z)

i E exp( iu Tz)z

+ 2B diag(a)BTu

H(u;z) := E exp(iu T Bz)zz T

ϕ(z;u) µ(u;z)µ(u;z)T .

Taking a second derivative, we have:

2f(u|P) = B ( H(u;z) H(z; u) + 2 diag(a)) BT

All we have to show at this point is that H(u;z) is diagonal. We will evaluate the k, ℓentry. Let Bi denote the ith column of B. Let Y\k,ℓ:= u T B P

j =k,ℓzj. Using independence of the components of z, we have, for k = ℓ,

E exp(iu T Bkzk + iu T Bℓzℓ+ Y\kℓ)zkzℓ

ϕ(z;u) = E exp(iu T Bkzk)zk E exp(iu T Bℓzℓ)zℓ

E [exp(iu T Bkzk)] E [exp(iu T Bℓzℓ)] (A.21)

Now we evaluate:

e T k µ(u;z) := E exp(iu T Bz)zk

ϕ(z;u) = E exp(iu T Bkzk)zk

E [exp(iu T Bkzk)] (A.22)

Using Eqs A.21 and A.22, we see that indeed H(u;z) is diagonal.

Now, we prove the statement about the CGF-based contrast function.

CGF-based contrast function

f (u|P) = E exp u T Bz Bz

E [exp (u T Bz)] B diag (a) BTu,

2f (u|P) = E exp u T Bz Bzz T BT

E [exp (u T Bz)] E exp u T Bz Bz E exp u T Bz z T BT

E [exp (u T Bz)]2 B diag (a) BT

E exp u T Bz zz T

E [exp (u T Bz)] E exp u T Bz z E exp u T Bz z T

E [exp (u T Bz)]2 | {z }

Covariance of z P(z). exp u T Bz

E [exp (u T Bz)]

The new probability density P (z) . exp(u T Bz) E[exp(u T Bz)] is an exponential tilt of the original pdf, and since {zi}d i=1 are independent, and the new tilted density also factorizes over the zi s, therefore, the covariance under this tilted density is also diagonal.

A.1.1 Uniform convergence

In this section, we provide the proof for Theorem 2. First, we state some preliminary results about the uniform convergence of smooth function classes which will be useful for our proofs.

A.1.1.1 Preliminaries

Let D := supθ, θ T ρX θ, θ denote the diameter of set T, and let NX (δ; T) denote the δ-covering number of T in the ρX metric. Then, we have the following standard result: Proposition A.2. (Proposition 5.17 from [51]) Let {Xθ, θ T} be a zero-mean subgaussian process with respect to the metric ρX. Then for any δ [0, D] such that NX (δ; T) 10, we have

sup γ,γ T;ρX(γ,γ ) δ

D2 log (NX (δ; T))

Remark 5. For zero-mean subgaussian processes, Proposition A.2 implies, E [supθ T Xθ]

E h supθ, θ T Xθ X θ i

Proposition A.3. (Theorem 4.10 from [51]) For any b-uniformly bounded class of functions C, any positive integer n 1, any scalar δ 0, and a set of i.i.d datapoints X(i)

i [n] we have

i=1 f X(i) E [f (X)]

i=1 ϵif X(i)

with probability at least 1 exp nδ2

2b2 . Here {ϵi}i [n] are i.i.d rademacher random variables.

A.1.1.2 Proof of theorem 2

For dataset x(i)

i [n] ,x(i) Rk, consider the following definitions:

ϕ(t, F|P) := Ex exp(it T Fx) , ϕ(t, F| ˆP) := 1

j=1 exp(it T Fx(j)) (A.23)

ψ(t, F|P) :=

j=1 Ex [exp(itj(Fx)j)] , ψ(t, F| ˆP) := 1

r=1 exp(itj(Fx(r))j) (A.24)

Let (t, F| ˆP) be the empirical version where the expectation is replaced by sample averages. Now define:

(t, F|P) = ϕ(t, F|P) exp t T diag(FSF T )t ψ(t, F|P) exp t T FSF Tt

(t, F| ˆP) = ϕ(t, F| ˆP) exp t T diag(F b SF T )t ψ(t, F| ˆP) exp t T F b SF Tt

Theorem A.4. Let F := {F Rk k : F 1}. Assume that x subgaussian(σ). We have:

sup F F |Et N(0,Ik) (t, F|P) Et N(0,Ik) (t, F| ˆP)|= OP

k2 S max(k, σ4 S ) log2(n Ck)

where Ck := max(1, k log (n) Tr(S)).

Proof. Define E = {t : t 2k log n}. We will use the fact that P(Ec) 2/n (Example 2.12, [51]). Also note that by construction, (t, F|P) 1. Observe that:

sup F F |Et N(0,Ik) (t, F|P) Et N(0,Ik) (t, F| ˆP)|

sup F F Et N(0,Ik)| (t, F|P) (t, F| ˆP)|

Et N(0,Ik) sup F F | (t, F|P) (t, F| ˆP)|

(t, F|P) (t, F| ˆP)

E + Et N(0,Ik)

sup F F | (t, F|P) (t, F| ˆP) |Ec P(Ec)

(t, F|P) (t, F| ˆP)

k2 S max(k, σ4 S ) log2(n Ck)

+ 2/n, using Theorem A.5

The second inequality follows from Jensen s inequality, the fourth follows from E[sup(.)] sup(E[.]).

Theorem A.5. Let F := {F Rk k : F 1}. Assume that x subgaussian(σ). We have:

sup F F | (t, F|P) (t, F| ˆP)|= OP

k S max(k, σ4 S ) log(n Ct)

where Ct := max(1, t 2Tr(S)).

(t, F|P) (t, F| ˆP) ϕ(t, F|P) ϕ(t, F| ˆP) exp t T diag(FSF T )t

+ ϕ(t, F| ˆP) exp t T diag(F b SF T )t exp t T diag(FSF T )t

+ ψ(t, F|P) ψ(t, F| ˆP) exp t T FSF Tt

+ ψ(t, F| ˆP) exp t T F b SF Tt exp t T FSF Tt

Finally, for some S = λS + (1 λ) ˆS,

|exp( t T FSF Tt) exp( t T F ˆSF Tt)| = D S exp( t T FSF Tt) S , ˆS S E

exp( t T FS F Tt) t T F(S ˆS)F Tt

t 2 S ˆS = t 2OP

where the last result follows from Theorem 4.7.1 in [47]. Now, for the second term, observe that: ϕ(t, F| ˆP) exp t T diag(F b SF T )t exp t T diag(FSF T )t

exp t T diag(FSF T )t exp t T diag(F(b S S)F T )t 1 (A.25)

We next have that, with probability at least 1 1/n,

diag F(S b S)F T 2 F(S b S)F T 2 C

Thus with probability at least 1 1/n, using the inequality |1 ex| 2|x|, x [ 1, 1], Eq A.25 leads to:

ϕ(t, F| ˆP) exp t T diag(F b SF T )t exp t T diag(FSF T )t 2C t 2

Note that the matrices diag F(S b S)F T and FSF T are positive semi-definite and t is unit-norm. Observe that, using Lemmas A.6 and A.7, we have:

sup F F | (t, F|P) (t, F| ˆP)|

k Tr(S) log (n Ct)

k S max(k, σ4 S ) log(n Ct)

Lemma A.6. Define F := {F Rk k : F 1}. Let x(i)

i [n] be i.i.d samples from a subgaussian(σ) distribution. We have:

j=1 exp(it T Fx(j)) Ex exp(it T Fx) = OP

k Tr(S) log (n Ct)

where Ct := max(1, t 2Tr(S)).

Proof. Let ϕ(t, F;x(i)) = exp(it T Fx(i)). Next, we note that F 2 1 imply F Tt t . Therefore,

j=1 exp(it T Fx(j)) Ex exp(it T Fx) sup u Rk, u 1

j=1 exp(iu Tx(j)) Ex exp(iu Tx)

Let f X(u) = P i ϵiξ(u;x(i)). We will argue for unit vectors u, and later scale them by t . Define Bk := u Rk, u 1 , ξ(u;x) := exp(iu Tx) and let

d(u,u )2 := X

i (ξ(u;x(i)) ξ(u ;x(i)))2

Then we have,

u exp(iu Tx) 2 = x | sin(u Tx) + i cos(u Tx)| x

Then, ξ(u;x(i)) ξ(t;x(i)) min u ξ(u;x(i)) 2 u u , 2

min x(i) 2 u u , 2

Let b S := P

i x(i) x(i) T /n and τn := n Tr b S . We next have,

D2 := max u,u d(u,u )2 min

4n, max u,u X

i x(i) 2 2 u u 2 )

4 min {n, τn}

Next, we bound the covering number N(δ; Bk, du). Note that N(δ; Bk, du) N(δ/ τn; Bk, . 2) since,

d (u,u )2 = X

i (ξ(u;x(i)) ξ(u ;x(i)))2

i x(i) 2 2 u u 2

i Tr x(i) x(i) T u u 2

Using Cauchy-Schwarz, we have:

|f X(u) f X(u )|

i=1 ϵi|ξ(u;x(i)) ξ(u;x(i))| p

Therefore, we have N(δ; Bk, du) 3 τn

δ k . Therefore, using Proposition A.2,

sup F F |f X(u) f X(u )| 2Ex

sup d(u,u ) δ |f X(u) f X(u )|

+ 4Ex h D p

log N(δ; Bk, du) i

min {n, τn}

2k log 3 τn

(k min(n, τn) log (max(τn, n)) , setting δ := p

min(1, τn/n)

min(1, Tr(S)) log (n max(1, Tr(S))) , Cauchy-Schwarz inequality

Now we recall that F Tt t , since all vectors t appear as t T F Tx(i), this simply leads to scaling τ and S by t 2. Dividing both sides by n, we have

sup F F |f X(t) f X(t )|

min(1, t 2Tr(S)) log (n max(1, t 2Tr(S)))

Thus, using Proposition A.3 and using the definition of Ct, we have,

sup F F |ϕ(t, F| ˆP) ϕ(t, F|P)|=

t 2Tr(S) log (n Ct) = t

Tr(S) log (n Ct)

Lemma A.7. Let F = {F Rk k : F 1}. We have:

ψ(t, F| ˆP) ψ(t, F|P) OP

Proof. Let Bk := u Rk, u 1 Now define ψj(t, F;x) = E[exp(itj F T j x)]. Also define

f (j) X (F) = 1

i ϵiψj(t, F;x(i)). Thus, using the same argument as in Lemma A.6 for t t ej and x(i) tjx(i),

sup F F |ψj(t, F| ˆP) ψj(t, F|P)|= OP

Finally, we see that:

ψ(t, F| ˆP) ψ(t, F|P) X

The above is true because, for |ai|, |bi| 1, i = 1, . . . , k,

i j bi(aj+1 bj+1)

j=0 |aj+1 bj+1|

A.1.2 Local convergence

In this section, we provide the proof of Theorem 4. Recall the ICA model from Eq 1,

x = Bz + g,

diag (a) := cov (z) , Σ := cov (g) = B diag (a) BT + Σ

We provide the proof for the CGF-based contrast function. The proof for the CHF-based contrast function follows similarly. From the Proof of Theorem A.1 we have,

f (u|P) = E exp u T Bz Bz

E [exp (u T Bz)] B diag (a) BTu,

2f (u|P) = E exp u T Bz Bzz T BT

E [exp (u T Bz)] E exp u T Bz Bz E exp u T Bz z T BT

E [exp (u T Bz)]2 B diag (a) BT

E exp u T Bz zz T

E [exp (u T Bz)] E exp u T Bz z E exp u T Bz z T

E [exp (u T Bz)]2 | {z }

Covariance of z P (z) exp u T Bz

E [exp (u T Bz)]

The new probability density P (z) exp u T Bz

E [exp (u T Bz)] is an exponential tilt of the original pdf, and since

{zi}d i=1 are independent, and the new tilted density also factorizes over the zi s, therefore, the covariance under this tilted density is also diagonal.

Let C := BD0BT . We denote the ith column of B as B.i. Define functions

ri (u) := ln E exp u T B.izi Var (zi) u T B.i BT .iu 2 gi (u) := ri (u)

Then f (u|P) = Pd i=1 ri (u), f (u|P) = Pd i=1 gi (u). For fixed point iteration, uk+1 =

f(C 1uk|P) f(C 1uk|P ) . Consider the function f C 1u|P . Let ei denote the ith axis-aligned unit basis vector of Rn. Since B is full-rank, we can denote α = B 1u. We have,

i=1 gi C 1u

E h exp u T C 1 T B.izi B.izi i

E h exp u T (C 1)T B.izi i Var (zi) B.i BT .i C 1u

E h exp u T BT 1 D 1 0 eizi B.izi i

E h exp u T (BT ) 1 D 1 0 eizi i Var (zi) B.ie T i D 1 0 B 1u

E h exp αi (D0)ii zi zi i

E h exp αi (D0)ii zi i Var (zi) αi (D0)ii

For t R, define the function qi (t) : R R as -

qi (t) := E h exp t (D0)ii zi zi i

E h exp t (D0)ii zi i Var (zi) t (D0)ii

Note that qi (0) = E [zi] = 0. For t = (t1, t2, td) Rd, let q (t) := [q1 (t1) , q2 (t2) qd (td)]T Rd. Then,

f C 1u|P = Bq (α)

Therefore, if ut+1 = Bαt+1, then

Bαt+1 = Bq (αt) Bq (αt)

= αt+1 = q (αt) Bq (αt)

= i [d], (αt+1)i = qi ((αt)i)

At the fixed point, α = e1 Be1 and Bq (α ) Bq (α ) 2 = Be1. Therefore,

i [2, d], (αt+1)i α i = (αt+1)i = qi ((αt)i)

Bq (αt) (A.26)

for i = 1, (αt+1)1 α 1 = q1 ((αt)i)

Bq (αt) 1 Be1 (A.27)

Note the smoothness assumptions on qi(.) mentioned in Theorem 4, i [d],

1. supt [ B 1 2, B 1 2]|qi (t) | c1 2. supt [ B 1 2, B 1 2]|q i (t) | c2 3. supt [ B 1 2, B 1 2]|q i (t) | c3

Since k, uk = 1, therefore, k, αt B 1 2. We seek to prove that the sequence {αt}n k=1 converges to α .

A.1.2.1 Taylor expansions

Consider the function gi (y) := qi (yi) Bq (y) for y Rd. We start by computing the gradient,

Lemma A.8. Let gi (y) := qi (yi) Bq (y) , then we have

[ ygi (y)]j =

1 Bq (y) qi (yi)

yi qi (yi) Bq (y) 2 Bq (y)

qi (yi) qi (yi)

yi , for j = i

q j (yj) Bq (y) qi (yi)e T j BT Bq (y) Bq (y) 2 , for j = i

Proof. The derivative w.r.t yi is given as -

yi = 1 Bq (y) qi (yi)

yi qi (yi) Bq (y) 2 Bq (y)

= 1 Bq (y) qi (yi)

yi qi (yi) Bq (y) 2 Bq (y)

qi (yi) qi (yi)

yi Note that

y Ay 2= 1 Ay AT Ay

yi = 1 Bq (y) qi (yi)

yi qi (yi) Bq (y) 2 1 Bq (y) e T i BT Bq (y) qi (yi)

= q i (yi) Bq (y)

1 qi (yi)e T i BT Bq (y) Bq (y) 2

For j = i, the derivative w.r.t yj is given as -

yj = qi (yi) Bq (y) 2 Bq (y)

qj (yj) qj (yj)

= q j (yj) Bq (y) qi (yi)e T j BT Bq (y) Bq (y) 2 (A.29)

Next, we bound qi (t) and q i (t). Lemma A.9. Under the smoothness assumptions on qi(.) mentioned in Theorem 4, we have for t [ B 1 2, B 1 2],

1. |q1 (t) q1 (α 1)| c2 |t α 1| , |q 1 (t) q 1 (α 1)| c3 |t α 1|

2. i, |qi (t) | c3t2

2 , |q i (t) | c3|t|

Proof. First consider qi (t) and q i (t) for i = 1. Using Taylor expansion around t = 0, we have for some c (0, t) and i = 1,

qi (t) = qi (0) + tq i (0) + t2

2 q i (c) and

q i (t) = q i (0) + tq i (0)

Now, we know that qi (0) = q i (0) = 0. Then using Assumption 1, we have, for t [ B 1 2, B 1 2],

|qi (t) | c3t2

2 and (A.30)

|q i (t) | c3|t| (A.31)

Similarly, using Taylor expansion around α 1 = 1 Be1 2 for q1 (.) we have, for some c (0, t)

q1 (t) = q1 (α 1) + (t α 1) q 1 (c ) and

q 1 (t) = q 1 (α 1) + (t α 1) q 1 (c )

Therefore, using Assumption 4, we have, for t [ B 1 2, B 1 2]

|q1 (t) q1 (α 1)| c2 |t α 1| and (A.32)

|q 1 (t) q 1 (α 1)| c3 |t α 1| (A.33)

A.1.2.2 Using convergence radius

In this section, we use the Taylor expansion results for qi (.) and q i (.) in the previous section to analyze the following functions for y Rd,

1. w (y) := By

2. v (y; i) := e T i BT Bq (y) Bq (y) 2

Under the constraints specified in the theorem statement we have,

y α 2 R, R max {c2/c3, 1} ,

Be1 max c2 |q1 (α 1)|, c3 |q 1 (α 1)|

We start with w (y).

Lemma A.10. y Rd, satisfying (A.34), let δ (y) := q (y) q1 (α 1)e1. Then, we have

1. (1 ϵ y α ) |q1 (α 1)| Be1 Bq (y) (1 + ϵ y α ) |q1 (α 1)| Be1

2. δ (y) c2 y α

Proof. Consider the vector δ (y) := q (y) q1 (α 1)e1. Then,

|(δ (y))ℓ| c2 |y1 α 1| , for ℓ= 1, using Eq A.32 c3

2 (y)2 ℓ, for ℓ = 1, using Eq A.30 (A.35)

δ (y) c2 y α (A.36)

Now consider w (y). Using the mean-value theorem on the Euclidean norm we have,

Bq (y) = |q1 (α 1)| Be1 + 1 Bγ (y) BT Bγ (y) T δ (y) , (A.37)

where γ (y) = µq (y) + (1 µ) q1 (α 1)e1, µ (0, 1). Then, 1 Bγ (y) BT Bγ (y) T δ (y) 1 Bγ (y)

BT Bγ (y) δ (y)

1 Bγ (y) B Bγ (y) δ (y)

= B δ (y) c2 B y α , using Eq A.36 c2 B F y α , ϵ |q1 (α 1)| Be1 y α , using Eq A.34 (A.38)

Therefore using A.37,

(1 ϵ y α ) |q1 (α 1)| Be1 Bq (y) (1 + ϵ y α ) |q1 (α 1)| Be1 (A.39)

Finally, we consider the function v (y; i) := e T i BT Bq (y) Bq (y) 2 .

Lemma A.11. y Rd satisfying (A.34), we have

1. For i = 1, 1 5ϵ y α q1 (α 1) v (y|1) 1 + 5ϵ y α

2. For i = 1, |q1 (α 1) v (y; i)| (1 + 5ϵ y α ) Bei

Proof. Define θ := ϵ y α for convenience of notation. We have,

e T i BT Bq (y) Bq (y) 2 = q1 (α 1) e T i BT Be1 Bq (y) 2 + e T i BT Bδ (y) Bq (y) 2 , using definition of δ (y) (A.40)

Therefore, if i = 1, then using Lemma A.10,

(1 + θ)2 + q1 (α 1)e T 1 BT Bδ (y) Bq (y) 2 q1 (α 1)e T 1 BT Bq (y) Bq (y) 2 1

(1 θ)2 + q1 (α 1)e T 1 BT Bδ (y) Bq (y) 2

Using the following inequalities -

(1 + θ)2 1 2θ, θ 0, and, 1

(1 θ)2 1 + 5θ, θ 1

and observing that using Eq A.34, and Lemma A.10, q1 (α 1)e T 1 BT Bδ (y) Bq (y) 2

|q1 (α 1)| Be1

(1 θ)2 |q1 (α 1)|2 Be1 2 B δ (y)

(1 θ)2 c2 B |q1 (α 1)| Be1 y α

Therefore we have from Eq A.41,

1 5θ q1 (α 1) v (y; i) 1 + 5θ (A.42)

where we used θ := ϵ y α . For the case of i = 1, using Lemma A.10 we have

|q1 (α 1)| |v (y; i)| = |q1 (α 1)| e T i BT Bq (y) Bq (y) 2

|q1 (α 1)| Bei Bq (y)

(1 θ)2 Bei Be1

(1 + 5θ) Bei

We now operate under the assumption that Eq A.34 holds for y = αt and inductively show that it

holds for y = αt+1 as well. Recall the function gi (y) := qi (yi) Bq (y) . By applying the mean-value

theorem for gi (.) for the points αt and α , we have from Eq A.26 and A.27 -

|(αt+1)i α i | = |gi (αt) gi (α )|

= gi (βi)T (αt α ) for βi := (1 λi)αt + λiα , λi (0, 1)

gi (βi) αt α (A.44)

Note the induction hypothesis assumes that Eq A.34 is true for x = αt. Since i, λi (0, 1), therefore Eq A.34 holds for all βi as well. Squaring and adding Eq A.44 for i [d] and taking a square-root, we have

αt+1 α αt α

i=1 gi (βi) 2

Let us consider the expression Gij := gi (y)

βi , i, j [k]. For the purpose of the subsequent

analysis, we define θ := ϵ αt α . We divide the analysis into the following cases -

Case 1 : i = 1, j = 1

Using Lemma A.8,

|G11| = q 1 ((β1)1) Bw (β1)

1 q1 ((β1)1)e T 1 BT Bq (β1) Bq (β1) 2

From Lemma A.9,

|q 1 ((β1)1)| |q 1 (α 1)| + c3 |(β1)1 α 1|

= |q 1 (α 1)| 1 + c3 |(β1)1 α 1| |q 1 (α 1)|

|q 1 (α 1)| 1 + c3 αt α

|q 1 (α 1)|

|q 1 (α 1)| (1 + θ)

q1 ((β1)1) |q1 (α 1) |+c2 |(β1)1 α 1|

= |q1 (α 1) | 1 + c2 |(β1)1 α 1| |q1 (α 1)

|q1 (α 1) | 1 + c2 αt α

|q1 (α 1) |

|q1 (α 1) |(1 + θ)

From Lemma A.10,

Bq (β1) (1 θ) |q1 (α 1)| Be1

From Lemma A.11 and θ 1 10,

6θ 1 q1 ((β1)1)e T 1 BT Bw (β1) Bw (β1) 2 6θ

|G11| 6 1 + θ

|q 1 (α 1)| θ Be1 |q1 (α 1)|

7.5ϵ |q 1 (α 1)| Be1 |q1 (α 1)| αt α , since θ 1

Case 2 : i = 1, j = 1

Using Lemma A.8,

Bq (β1) q1 ((β1)1)e T j BT Bq (β1) Bq (β1) 2

From Lemma A.9, q j (β1)j c3 (β1)j α j c3 (αt)j α j , since α j = 0

From Lemma A.10,

Bq (β1) (1 θ) |q1 (α 1)| Be1

From Lemma A.9,

|q1 ((β1)1)| |q1 (α 1)| + c2 |(β1)1 α 1|

= |q1 (α 1)| 1 + c2 |(β1)1 α 1| |q1 (α 1)|

|q1 (α 1)| (1 + θ)

From Lemma A.11, e T j BT Bq (β1) Bq (β1) 2

|q1 (α 1) | Bej Be1

(1 + 5θ) Bej |q1 (α 1) | Be1 2

(αt)j α j 2c3 Bej |q1 (α 1) | Be1 2

Case 3 : i = 1, j = 1

Using Lemma A.8,

|Gi1| = q 1 ((βi)1) Bq (βi) qi ((βi)i)e T 1 BT Bq (βi) Bq (βi) 2

From Lemma A.9,

|q 1 ((βi)1)| |q 1 (α 1)| + c3 |(βi)1 α 1|

= |q 1 (α 1)| 1 + c3 |(βi)1 α 1| |q 1 (α 1)|

|q 1 (α 1)| (1 + θ)

From Lemma A.10,

Bq (βi) (1 θ) |q1 (α 1)| Be1

From Lemma A.9,

|qi ((βi)i)| c3

2 ((βi)i α i )2 c3

2 ((αt)i α i )2

From Lemma A.11, e T 1 BT Bq (βi) Bq (βi) 2

(1 + θ) (1 + 5θ)

|q 1 (α 1)|

|q1 (α 1)|2 Be1 ((αt)i α i )2

c3 |q 1 (α 1)|

|q1 (α 1)|2 Be1 ((αt)i α i )2

c3R |q 1 (α 1)|

|q1 (α 1)|2 Be1 |(αt)i α i |

c2 |q 1 (α 1)|

|q1 (α 1)|2 Be1 |(αt)i α i |

Case 4 : i = 1, j = 1, i = j

Using Lemma A.8,

|Gii| = q i ((βi)i) Bq (βi)

1 qi ((βi)i)e T i BT Bq (βi) Bq (βi) 2

From Lemma A.9,

|q i ((βi)i)| c3 (βi)i α j c3 |(αt)i α i |

From Lemma A.10,

Bq (βi) (1 θ) |q1 (α 1)| Be1

From Lemma A.9,

|qi ((βi)i)| c3

2 ((βi)i α i )2 c3

2 ((αt)i α i )2 c3R αt α c2 αt α

From Lemma A.11, e T i BT Bq (βi) Bq (βi) 2

(1 + 5θ) Bei |q1 (α 1) | Be1

qi ((βi)i)e T i BT Bq (βi) Bq (βi) 2 (1 + 5θ) αt α Bei

Be1 c2 |q1 (α 1) |

(1 + 5θ) αt α B F

Be1 c2 |q1 (α 1) | (1 + 5θ) ϵ αt α 2θ

|Gii| c3 |(αt)i α i | (1 θ) |q1 (α 1)| Be1 2c3 |q1 (α 1)| Be1 |(αt)i α i |

Case 5 : i = 1, j = 1, i = j

Using Lemma A.8,

Bq ((βi)) qi ((βi)i)e T j BT Bq ((βi)) Bq ((βi)) 2

From Lemma A.9, q j (βi)j c3 (βi)j α j

c3 (αt)j α j

From Lemma A.10,

Bq (βi) (1 θ) |q1 (α 1)| Be1

From Lemma A.9,

|qi ((βi)i)| c3

2 ((βi)i α i )2 c3

2 ((αt)i α i )2 c3R

2 |(αt)i α i |

From Lemma A.11, e T j BT Bq (βi) Bq (βi) 2

|q1 (α 1) | Bej Be1

|Gij| c2 3R

|q1 (α 1)|2 Be1 2

(αt)j α j |(αt)i α i | c2 3R Bej

|q1 (α 1)|2 Be1 2

(αt)j α j |(αt)i α i |

A.1.2.3 Putting everything together

Putting all the cases together in Eq A.45, we have αt+1 α

αt α C αt α , where

v u u u u u u u u t

7.5ϵ |q 1 (α 1)| Be1 |q1 (α 1)|

2 + 2c3 B F |q1 (α 1) | Be1 2

c2 |q 1 (α 1)|

|q1 (α 1)|2 Be1

+ 2c3 |q1 (α 1)| Be1

c2 3R2 B F |q1 (α 1)|2 Be1 2

Then we have, C 5 B F

Be1 2 max c3 |q1(α 1)|, c2 2 |q1(α 1)| 2

. For linear convergence, we require CR < 1,

which is ensured by the condition mentioned in Theorem 4.

A.2 Additional experiments for noisy ICA

A.2.1 Experiments with super-gaussian source signals

Table 2 shows the Amari error in the presence of super-Gaussian signals. The signals are 1) a Uniform distribution U(

3), 2) Bernoulli 1 2 + 1

, 3) Laplace(0, 0.05) (mean 0 and standard deviation 0.05), 4) Exponential(5), and 5) and 6) a Student s t-distribution with 3 and 5 degrees of freedom, respectively. The Meta algorithm closely follows the best candidate algorithm even in the presence of many super-Gaussian signals.

Table 2: Variation of Amari error with Sample Size for Heavy-Tailed distributions, averaged over 100 random runs. Noise power ρ = 0.001, number of sources k = 6.

Algorithm n 200 500 1000 5000 10000

Meta 0.44376 0.25215 0.20222 0.11435 0.0838 CHF 1.10103 0.70823 0.52529 0.21344 0.09194 CGF 1.84266 1.51216 1.3702 0.84351 0.58753 PEGI 2.27873 1.86561 1.71474 1.33709 1.23322 PFICA 0.39237 0.25468 0.21222 0.12878 0.12347 JADE 0.70174 0.39246 0.28199 0.12652 0.09686 Fast ICA 0.66441 0.3869 0.28419 0.13215 0.09946

A.2.2 Image demixing experiments

In this section, we provide additional experiments for image-demixing using ICA. We mix images using flattening and linearly mixing them with a 4 4 matrix B (i.i.d entries N(0, 1)) and Wishart noise (ρ = 0.001). Demixing is performed using the SINR-optimal demixing matrix (see Section 2) and the results are shown in Figure A.1 along with their corresponding independence scores. The CHF-based method recovers the original sources well, upto sign. The Kurtosis and CGF-based method fails to recover the second source. This is consistent with their higher independence score. The Meta algorithm selects CHF from candidates CHF, CGF, Kurtosis, Fast ICA, and JADE.

A.2.3 Image denoising experiments

In this experiment, we use the ICA-based denoising technique proposed in [38] to compare candidate Noisy ICA algorithms and show that the Meta algorithm can pick the best-denoised image based on the independence score proposed in our work.

We use the noisy MNIST dataset and further add entrywise Gaussian noise with variance proportional to sin2(c1π(i + j))(c1 = 50) to the (i, j)th pixel. Training images are flattened to create a 784dimensional vector, and PCA is performed to reduce dimensionality to 25, on which subsequently ICA is performed. The original and denoised images along with their independence score are shown in Figure A.2. We note that CHF-based denoising provides qualitatively better results, while CGF-based denoising provides the worst results. This is consistent with their corresponding independence scores.

A.3 Algorithm for sequential calculation of independence scores

In this section, we provide an algorithm for the computation of the independence score proposed in the manuscript, but in a sequential manner. The detailed algorithm is provided as Algorithm 3. We assume that we have access to a matrix of the form C = BDBT .

The power method (see Eq 3) essentially extracts one column of B (up to scaling) at every step. [49] provides an elegant way to use the pseudo-Euclidean space to successively project the data onto columns orthogonal to the ones extracted. This enables one to extract each column of the mixing matrix. For completeness, we present the steps of this projection. After extracting the ith

(a) Source Images

(b) Mixed Images

(c) Demixed Images using CHF-based contrast function ( (t, F|P) = 5.26 10 3).

(d) Demixed Images using Kurtosis-based contrast function ( (t, F|P) = 2.48 10 2)

(e) Demixed Images using CGF-based contrast function ( (t, F|P) = 4 10 2).

(f) Demixed Images using Fast ICA ( (t, F|P) = 7.1 10 3).

(g) Demixed Images using JADE ( (t, F|P) = 6.8 10 3).

Figure A.1: Image-Demixing using ICA

(a) Original Images

(b) CHF-based denoising ( (t, F|P) = 4.4 10 6)

(c) Kurtosis-based denoising ( (t, F|P) = 1.7 10 4)

(d) CGF-based denoising ( (t, F|P) = 9.2 10 6)

(e) Fast ICA-based denoising ( (t, F|P) = 5.9 10 6)

(f) JADE-based denoising ( (t, F|P) = 4.6 10 6)

Figure A.2: Image Denoising using ICA

column u(i), the algorithm maintains two matrices. The first, denoted by U, estimates the mixing matrix B one column at a time (up to scaling). The second, denoted by V estimates B 1 one row at a time. It is possible to extend the independence score in Eq 2 to the sequential setting as follows. Let u(j) be the jth vector extracted by the power method (Eq 3) until convergence. After extracting ℓ columns, we project the data using min(ℓ+ 1, k) projection matrices. These would be mutually independent if we indeed extracted different columns of B. Let C = BDBT for some diagonal matrix D. For convenience of notation, denote Bi B(:, i). Set -

v(j) = C u(j)

u(j)T C u(j) (A.46)

When u(j) = Bj/ Bj , v(j) = (BT ) 1D ej

e T j D ej Bj = (BT ) 1ej Bj . Let x denote an arbitrary

datapoint. Thus

U(:, j)V (j, :)x = Bjzj + U(:, j)V (j, :)g (A.47)

Thus the projection on all other columns j > ℓis given by:

j=1 Bjzj + g =

j=ℓ+1 Bjzj + g

Figure A.3: Mean independence score with errorbars from 50 random runs

where g = Fg, where F is some k k matrix. So we have min(ℓ, k) vectors of the form zi Bi + gi, and when ℓ< k an additional vector which contains all independent random variables zj, j > ℓ along with a mean-zero Gaussian vector. Then we can project each vector to a scalar using unit random vectors and check for independence using the Independence Score defined in 2. When ℓ= k, then all we need are the j = 1, . . . , k projections on the k directions identified via ICA in Eq A.47.

As an example, we conduct an experiment (Figure A.3), where we fix a mixing matrix B using the same generating mechanism in Section 5. Now we create vectors q which interpolate between B(:, 1) and a fixed arbitrary vector orthogonal to B(:, 1). As the interpolation changes, we plot the independence score for direction q. The dataset is the 9-dimensional dataset, which has independent components from many different distributions (see Section 5).

This plot clearly shows that there is a clear negative correlation between the score and the dot product of a vector with the column B(:, 1). To be concrete, when q has a small angle with B(:, 1), the independence score is small, and the error bars are also very small. However, as the dot product decreases, the score grows and the error bars become larger.

Algorithm 3 Independence Score after extracting ℓcolumns. U and V are running estimates of B and B 1 upto ℓcolumns and rows respectively.

X Rn k, U Rk ℓ, V Rℓ k, Number of random projections M k0 min(ℓ+ 1, k) for j in range[1, ℓ] do

Yj X (U(:, j)V (j, :))T

end for if ℓ< k then

Yℓ+1 X I V T U T

end if for j in range[1, M] do

t random unit vector in Rk for a in range[1, k0] do

W(:, j) Yjt end for Let α R1,k0 represent a row of W

S cov(W) γ P

i,j Sij β P

i Sii s(j) = |ˆE exp(P

j iαj) exp( β) Qk0 j=1 ˆE exp(iαj) exp( γ)| end for Return mean(s), stdev(s)

Figure A.4: Plots of the third derivative of the CGF-based contrast function for different datasets - Bernoulli(p = 0.5) A.4a, Uniform(U

3 ) A.4b, Poisson(λ = 1) A.4c and Exponential(λ = 1) A.4d. Note that the sign stays the same in each half-line

We refer to this function as (X, U, V, ℓ) which takes as input, the data X, the number of columns ℓ and matrices U, V which are running estimates of B and B 1 upto ℓcolumns and rows respectively.

A.4 More details for third derivative condition in theorem 3

Theorem 3 requires the condition The third derivative of h X(u) does not change the sign in the half line [0, ) for the non-Gaussian random variable considered in the ICA problem.". In this section, we provide sufficient conditions with respect to contrast functions and datasets where this holds. Further, we also provide an interesting example to demonstrate that our Assumption 1-(d) might not be too far from being necessary.

CGF-based contrast function. Consider the cumulant generating function of a random variable X, i.e. the logarithm of the moment generating function. Now consider the contrast function

g X(t) = CGF(t) var(X)t2/2.

We first note that it satisfies Assumption 1(a)-(d). Next, we observe that it is enough for a distribution to have all cumulants of the same sign to satisfy Assumption 1(d) for the CGF. For example, the Poisson distribution has all positive cumulants. Figure A.4 depicts the third derivative of the contrast function for a Bernoulli(1/2), Uniform, Poisson, and Exponential.

Logarithm of the symmetrized characteristic function. Figure A.5 depicts the third derivative of this contrast function for the logistic distribution where Assumption 1(d) holds. Although the Assumption does not hold for a large class of distributions here, we believe that the notion of having the same sign can be relaxed up to some bounded parameter values instead of the entire half line, so that the global convergence results of Theorem 3 still hold. This belief is further reinforced by Figure A.6 which shows that the loss landscape of functions where Assumption 1(d) doesn t hold is still smooth.

Figure A.5: Plots of the third derivative of the CHF-based contrast function for Logistic(0, 1) dataset.

A.4.1 Towards a necessary condition

Consider the probability distribution function (see [36]) f(x) = ke x2

2 (a+b cos(cπx)) for constants a, b, c > 0. For f(x) to be a valid pdf, we require that f(x) 0, and R f(x) dx = 1. Therefore, we have, Z

2 (a + b cos(cπx)) dx = 1 (A.48)

Noting standard integration results, we have that, Z

2 cos(cπx) dx =

Therefore, k = 1

2π(a+be c2π2

2 ) . Some algebraic manipulations yield the following:

MGFf(x)(t) = 1

(a + be c2π2

2 a + be c2π2

Therefore, ln(MGFf(x)(t)) = ln(a + be c2π2

2 + ln(a + be c2π2

2 cos(cπt)).

We now compute the mean, µ, and variance σ2.

The mean, µ can be given as :

f(x)x dx = 0 since f(x) is an even function.

The variance, σ2 can therefore be given as :

2 x2 dx + b Z

2 x2 cos(cπx) dx

Now, we note that R e x2

2π and R e x2

2 x2 cos(cπx) dx = e c2π2

2π(1 c2π2) Therefore,

2π + be c2π2

2π(1 c2π2) = 1 bc2π2e c2π2

a + be c2π2

The symmetrized CGF can therefore be written as sym CGF(t) := ln(MGFf(x)(t)) + ln(MGFf(x)( t))

= 2 ln(a + be c2π2

2 ) + t2 + 2 ln(a + be c2π2

2 cos(cπt))

Therefore, g(t) = sym CGF(t) (1 bc2π2e c2π2

2 )t2 can be written as :

g(t) = 2 ln(a + be c2π2

2 ) + 2 ln(a + be c2π2

2 cos(cπt)) + bc2π2e c2π2

a + be c2π2

Finally, g (t) can be written as :

2 ln(a + be c2π2

2 cos(cπt)) + bc2π2e c2π2

a + be c2π2

= 2bcπe c2π2

a + be c2π2

2 cos(cπt) + 2bc2π2e c2π2

a + be c2π2

g (t) can be written as :

g (t) = 2bc2π2e c2π2

2 a cos(cπt) + be c2π2

(a + be c2π2

2 cos(cπt))2 + 2bc2π2e c2π2

a + be c2π2

and g (t) can be evaluated as:

g (t) = 2bc3π3e c2π2

a2 2b2e c2π2 abe c2π2

(a + be c2π2

2 cos(cπt))3

Lets set a = 2, b = 1, c = 4 π. Then, we have that the pdf, f(x) = ke x2

2 (2 cos(4x)) =

2 (1 + 2 sin2(2x)). The corresponding functions are :

E[exp(t X)] = 1 (2 e 8)e t2

2 2 e 8 cos(4t)

g(t) := sym CGF(t) var(X)t2 = 2 ln(2 e 8) + 2 ln(2 e 8 cos(4t)) 16e 8

g (t) = 8e 8 sin(4t) 2 e 8 cos(4t) 32e 8

g (t) = 32e 8 2 cos(4t) e 8

(2 e 8 cos(4t))2 32e 8

g (t) = 128e 8 sin(4t)

4 2e 16 + 2e 8 cos(4t)

(2 e 8 cos(4t))3

Closeness to a Gaussian MGF: It can be seen there that g (t) changes sign in the half-line. However, the key is to note that the MGF of this distribution is very close to that of a Gaussian and can be made arbitrarily small by varying the parameters a and b. This renders the CGF-based contrast function ineffectual. This leads us to believe that Assumption 1(d) is not far from being necessary to ensure global optimality of the corresponding objective function.

A.5 Surface plots of the CHF-based contrast functions

Figure A.6 depicts the loss landscape of the Characteristic function (CHF) based contrast function described in Section 3.3 of the manuscript. We plot the value of the contrast function evaluated at

B 1u, u = 1

for x, y [ 1, 1]. As shown in the figure. We have rotated the data so

that the columns of B align with the X and Y axes. The global maxima occur at u aligned with the columns of B.

Figure A.6: Surface plots for zero-kurtosis (A.6c and A.6d) and Uniform (A.6a, A.6b) data with n = 10000 points, noise-power ρ = 0.1 and number of source signals, k = 2.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: Our main contributions are the Meta algorithm for non-parametric selection of the best noisy ICA algorithm for a given distribution, along with 2 new contrast functions of independent interest.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: In the Experiments (Section 5, there are some noise/sample-size regimes where other algorithms may perform slightly better than our proposed contrast functions, but we show that even then our diagnostic can pick out the best algorithm.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Every theorem statement mentions the assumptions and the location of the proof in the Appendix. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Section 5 provides extensive details of the experimental setup. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We make the code for our experiments available.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: Section 5 provides extensive details of the experimental setup.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: Figure 1 and 2(a) show error bars and error distribution respectively. Figure 2(b),(c) show the mean over 100 runs, similar to [49]

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: Our experiments were performed on a single Macbook Pro M2 2022 CPU with 8 GB RAM.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We abide by the Neur IPS Code of Ethics in our work. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [No] Justification: This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We do not release any models with a risk for misuse. Guidelines:

The answer NA means that the paper poses no such risks.

Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: MATLAB implementations (under the GNU General Public License) can be found at - Fast ICA and JADE. The code for PFICA was provided on request by the authors. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets to require documentation or licensing. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: We only use publicly available datasets in Section 5.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We do not have experiments with crowdsourcing or human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.