# deterministic_independent_component_analysis__833029ad.pdf Deterministic Independent Component Analysis Ruitong Huang RUITONG@UALBERTA.CA Andr as Gy orgy GYORGY@UALBERTA.CA Csaba Szepesv ari SZEPESVA@UALBERTA.CA Department of Computing Science, University of Alberta, Edmonton, AB T6G2E8 Canada We study independent component analysis with noisy observations. We present, for the first time in the literature, consistent, polynomial-time algorithms to recover non-Gaussian source signals and the mixing matrix with a reconstruction error that vanishes at a 1/ T rate using T observations and scales only polynomially with the natural parameters of the problem. Our algorithms and analysis also extend to deterministic source signals whose empirical distributions are approximately independent. 1. Introduction Independent Component Analysis (ICA) has received much attention in the past decades. In the standard ICA model one can observe a d-dimensional vector X that is a linear mixture of d independent variables (S1, . . . , Sd) with Gaussian noise: X = AS + ϵ, (1) where ϵ N(0, Σ) is a d-dimensional Gaussian noise with zero mean and covariance matrix Σ, and A is a nonsingular d d mixing matrix. The goal of the observer is to recover (separate) the source signals and the mixing matrix given several independent and identically distributed (i.i.d.) observations from the above model. The ICA literature is vast in both practical algorithms and theoretical analyses; we refer to the book of Comon and Jutten (2010) for a comprehensive survey. In this paper we investigate one of he most important problems in ICA: finding consistent, computationally efficient algorithms with finite-sample performance guarantees. In particular, we aim to develop algorithms whose computational and sample complexity are polynomial in the natural parameters of the problem. Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). A popular approach to the ICA problem is to find a linear transformation W for X by optimizing a, so-called, contrast function that measures dependence or non-gaussianity of the resulting coordinates of WX. The optimal W then can serve as an estimate of A 1, thereby recovering the mixing matrix A. One of the most popular ICA algorithms, Fast ICA (Hyvarinen, 1999), follows this approach for a specific contrast function. Fast ICA has been analyzed theoretically from many aspects (Tichavsky et al., 2006; Oja and Yuan, 2006; Ollila, 2010; Dermoune and Wei, 2013; Wei, 2014). In particular, recently Miettinen et al. (2014) showed that in the noise-free case (i.e., when X = AS), the error of Fast ICA (when using a particular forth-momentsbased contrast function) vanishes at a rate of 1/ T where T is the sample size. In addition, several other methods have been shown to achieve similar error rates in the noisefree setting (e.g., Eriksson and Koivunen, 2003; Samarov et al., 2004; Chen and Bickel, 2005; Chen et al., 2006). However, to our knowledge, no similar finite sample results are available in the noisy case. On the other hand, several promising algorithms are available in the noisy case that make significant advances towards provably efficient and effective ICA algorithms, albeit fall short of providing a complete solution. Using a quasi-whitening procedure, Arora et al. (2012) reduces the problem to finding all the local optima of a specific function defined using the forth order cumulant, and propose a polynomial-time algorithm to find them with appealing theoretical guarantees. However, the results depend on an unspecified parameter (β in the original paper) whose proper tuning is essential; note that even an exhaustive search over β is problematic, since its valid range is not well understood. The exploitation of the special algebraic structure of the forth moments induced by the independence leads to several other works related to ICA (Hsu and Kakade, 2013; Anandkumar et al., 2012a;b). A similar idea is also discussed earlier as a intuitive argument to construct a contrast function (Cardoso, 1999). The first rigorous proofs for this idea are developed using matrix perturbation tools in a gen- Deterministic Independent Component Analysis eral tensor perspective (Anandkumar et al., 2012a;b; Goyal et al., 2014). A common problem faced by these methods is a minimal gap of the eigenvalues, which may result in an exponential dependence on the number of source signals d. More precisely, these methods all require an eigendecomposition of some flattened tensor where the minimal gap between the eigenvalues plays an essential role. Although the exact size of this gap is not yet understood, a naive analysis introduces an exponential dependence on the dimension d. Such dependence is also observed in the literature (Cardoso, 1999; Goyal et al., 2014). One way to circumvent such dependence is to directly decompose a high-order tensor using the power method, which requires no flattening procedure (Anandkumar et al., 2014). However, when applied to the ICA problem, this introduces a bias term and so the error does not approach 0 as the sample size approaches infinity. Another issue is the wellknown fact that the power method is unstable in practice for high-order tensors. Goyal et al. (2014) proposed another method by exploring the characteristic function rather than the forth moments. However, their algorithm requires picking a parameter (σ in the original paper) that is smaller than some unknown quantity, making their algorithm impossible to tune. Recently, Vempala and Xiao (2014) proposed an ICA algorithm based on an elegant, recursive version of the method of Goyal et al. (2014) that avoids dealing with the aforementioned minimal gap; however, they still need an oracle to set the unspecified parameter of Goyal et al. (2014). In this paper we propose a provably polynomial-time algorithm for the noisy ICA model. Our algorithm is a refined version of the ICA method proposed by (Hsu and Kakade, 2013) (HKICA). However, we propose two simpler ways, one inspired by Frieze et al. (1996), Arora et al. (2012), and another based on Vempala and Xiao (2014), to deal with the spacing problem of the eigenvalues under similar conditions to those of Goyal et al. (2014). Unlike the method proposed by Goyal et al. (2014), our first method can force the eigenvalues to be well-separated with a gap that is independent of the mixing matrix A, while our second method, based on the recursive decomposition idea of Vempala and Xiao (2014), avoids dealing with the minimum gap (on the price of introducing other complications). We prove that our methods achieve an O(1/ T) error in estimating A and the source signals, with high probability, such that both the convergence rate and the computational complexity scale polynomially with the natural parameters of the problem. Our method needs no parameter tuning, which makes it even more appealing. Another contribution of the present paper is that our analysis is conducted in a deterministic manner. In practice, ICA is also known to work well for unmixing the mixture of various deterministic signals. One of the classical 0 5 10 15 5 5 Mixture 1 0 5 10 15 10 10 Mixture 2 1 Fast ICA 1 1 Fast ICA 2 Figure 1. Example of ICA for deterministic sources: The first two rows show the source signals s1(t) = 0.5 t 2 t/2 , s2 = cos(t), the next two rows present the observations with mixing matrix A = 1 2 2.6 5.1 . The reconstructed (and rescaled) signals are shown for Fast ICA, HKICA, and DICA after sampling As(t) at 10000 uniformly spaced points in the interval [0, 15]. demonstrations of ICA is showing that two periodic signals can be well recovered from their mixtures (Hyv arinen and Oja, 2000). Such an example is shown in Figure 1. It can be seen that our algorithm, DICA, in this particular example, can solve the problem better than other algorithms, Fast ICA (Hyvarinen, 1999) and HKICA (Hsu and Kakade, 2013). Such phenomenon suggests that the usual probabilistic notion is unsatisfactory if one wishes to have deeper understanding of ICA. Our deterministic analysis helps investigate this curious phenomenon without losing any generality to the traditional stochastic setting. Formally, instead of observing T i.i.d. samples from (1), the source signals are defined by the function s : N Rd be a d-dimensional deterministic signal , and the observations are x(t) = As(t) + ϵt, where (ϵt) t=1 is an i.i.d. sequence of d-dimensional N(0, Σ) random variables. The rest of this paper is organized as follows: The ICA problem is introduced in detail in Section 2 and our main results are highlighted in Section 3. The polynomial-time algorithms underlying these results are developed through the next two sections: Section 4.1 is devoted to the analysis of the HKICA algorithm, also showing its disadvantages, while our new algorithms are presented in Section 5. Experimental results are reported in Section 6. Proofs are presented in the full version of the paper (Huang et al., 2015). 1.1. Notation We denote the set of real and natural numbers by R and N, respectively. A vector v Kd for a field K is assumed to be a column vector. Let v 2 denote its L2-norm, and for any matrix Z let Z 2 = maxv: v 2=1 Zv 2 denote the corresponding induced norm. Denote the maximal and minimal singular value of Z by σmax(Z) and σmin(Z), respectively. Also, let Zi and Zi: denote the ith column and, resp., row of Z, and let Z(2,min) = mini Zi 2, Z(2,max) = maxi Zi 2 and Zmax = maxi,j |Zi,j|. Deterministic Independent Component Analysis Clearly, σmax(Z) = Z 2 Z(2,max) Zmax, and σmin(Z) Z(2,min). For a tensor (including vectors and matrices) T, its Frobenious norm (or L2 norm) T F is defined as the square root of the sum of the square of all the entries. For a vector v = (v1, . . . , vd) Kd, |v| is defined coordinatewise, that is |v| = (|v1|, . . . , |vd|). The transpose of a vector/matrix Z is denoted by Z , while the inverse of the transpose is denoted by Z . The outer product of two vectors v, u Kd is denoted by u v = uv . v k denotes the k-fold outer product of v with itself, that is, v v v . . . v, which is a k-dimensional tensor. Given a 4-dimensional tensor T, we denote the matrix Z by T(η, η, , ) that is generated by marginalizing the first two coordinates of T on the direction η, that is, Zi,j = Pd k1,k2=1 ηk1ηk2Tk1,k2,i,j. (Similar definitions for marginalizing different coordinates of the tensor.) For a real vector v and some real number C, v C means that all the entries of v are at most C. The bold symbol 1 denotes a vector with all entries being 1 (the dimension of this vector will always be clear from the context). Finally, Poly ( , , ) denotes a polynomial function of its argument. 2. The ICA Problem In this paper we consider the following non-stochastic version of the ICA problem. Assume that we can observe the d dimensional mixed signal x(t) Rd, t [T] := {1, 2, . . . , T} generated by x(t) = As(t) + ϵ(t), (2) where A is a d d nonsingular mixing matrix, s : [T] [ C, C]d is a bounded, d-dimensional source function for some constant C 1. , and ϵ : [T] Rd is the noise function. We will denote the ith component of s by si. Furthermore, we will use the notation σmin = σmin(A) and σmax = σmax(A) For any t, k 1 and signal u : [t] Rk, we introduce the empirical distribution ν(u) t defined by ν(u) t (B) = 1 t |{τ [t] : u(t) B}| for all Borel sets B Rk. Next we will impose assumptions on the empirical measure that guarantee that on the average we do not deviate too much from the stochastic model. The next assumption implies that the empirical distributions of the source signals are approximately zero mean, and that the noise is approximately zero-mean Gaussian. Assumption 2.1. Assume there exists a constant L and a function g : N R such that g(t) 0 as t and (i) ESi ν (si) t [Si] F , EY ν(ϵ) t [Y ] F g(t); (ii) EY ν(ϵ) t [Y 2] F , EY ν(ϵ) t [Y 3] F L; EY ν(ϵ) t [Y 4] (EY ν(ϵ) t [Y 2]) 2 (η, η, , ) 2(EY ν(ϵ) t [Y 2]) 2(η, , η, ) F g(t) η 2 2. Here L and the function g may depend on {A, Σ, C, d}. Remark 2.2. The first assumption forces the average of s and ϵ decay to 0 at a rate of g(t). The next one requires that both the second and third moments of the noise be bounded. The last assumption basically says that the induced measure of the noise function ϵ has 0 kurtosis in the limit. We will also need to guarantee that the source signals and the noise be approximately independent: Assumption 2.3. Assume the source signal function and the noise function are independent up to the 4th moment in the sense that for any i1, i2, j1, j2 0 such that i1 +i2 + j1 + j2 4, ES ν(s) t [(AS) i1 EY ν(ϵ) t [Y j1] (AS) i2] E(S,Y ) ν(s,ϵ) t [(AS) i1 Y j1 (AS) i2] F g(t), EY ν(ϵ) t [Y j1 ES ν(s) t [(AS) i1] Y j2] E(S,Y ) ν(s,ϵ) t [Y j1 (AS) i1 Y j2] F g(t), for the same function g in Assumption 2.1, where (s, ϵ) is the function obtained by concatenating s and ϵ together. The sufficiency of such weaker assumptions is also discussed in the paper of Frieze et al. (1996). The next proposition shows that these assumptions are all satisfied, with high probability, for the traditional stochastic setting of the ICA model with Gaussian noise independent to the source signals. Proposition 2.4. In the traditional stochastic setting of ICA, that is, when (s(t))t [T is an i.i.d. sequence, independent of the i.i.d. Gaussian noise sequence (ϵ(t))t [T ],there exists L = Poly Amax, Σ 2, C, d, 1 δ and g(t) = L/ t, such that Assumptions 2.1 and 2.3 hold with probability at least 1 δ. On the other hand, our setting can also cover some other examples excluded by the traditional setting, such as the example of Figure 1 in Section 1. Example 2.5. Assume that the unknown sources si (1 i d) are deterministic and periodic. Our observation x = As + ϵ is a linear mixture of s contaminated by i.i.d. Gaussian noise for each time step, where A is a nonsingular matrix and ϵ N(0, Σ) is Gaussian. Even though ϵ is i.i.d. for every time step, the observations cannot satisfy the traditional i.i.d. assumption, since the source s is deterministic. However, it can be proved that if the ratio of the periods of each pair of (si, sj) is irrational, this example satisfies all the assumptions above for T large enough. Deterministic Independent Component Analysis Our setting also extends the traditional one to a practically important case, Markov sources. Example 2.6. Assume that si is a stationary and ergodic Markov source, and the sources are independent of each other for 1 i d. Our observations are similar to the setting in Example 2.5. Because of the Markov property, the observations do not satisfy the i.i.d. assumptions. On the other hand, it can be verified that this example satisfies the above assumptions. 3. Main Results The ICA approach requires that the components si of the source signal s be statistically independent. In our setup, we require that the empirical distribution ν(s) T be close to a product distribution. Fix some product distribution µ = µ1 . . . µd over Rd such that ESi µi[Si] = 0 and κi := |ESi µi[S4 i ] 3 ESi µi[S2 i ] 2 | = 0. Let K denote the diagonal matrix diag(κ1, , κd), and define κmax = maxi κi and κmin = mini κi. To measure the distance of ν(s) T from µ, define the following family of distances to measure the closeness of two distributions: Given two distributions ν1 and ν2 over Rd, let Dk(ν1, ν2) = supf F | R f(s)dν1(s) R f(s)dν2(s)|, where F = {f : Rd R : f(s) = Qk j=1 sij, 1 i1, . . . , ik d} is the set of all monomials up to degree k. Finally let ξ = 6C2D2(µ, ν(s) T ) + D4(µ, ν(s) T ) . (3) In general, we will need a condition that ξ is small enough, so that the components of s are independent enough. To this end, one should choose µ to minimize ξ; however, such a minimizer does not always exists. Generally, µ could be selected as the product of the limit distributions, if applicable, of the individual sources. On the other hand, in the traditional stochastic setting where the observations are i.i.d. samples, the empirical distribution will converge to the population distribution, which, based on the independence assumption, is a product probability measure. Therefore, in this case, ξ will be small for large enough sample sizes. Example 3.1. Let µ1 be a Bernoulli distribution µ1({0.5}) = 1/2 and µ1({ 0.5}) = 1/2, and µ2 to be a distribution with density function p(x) = 1 π 1 x2 for 1 x 1. For the demonstration example in Figure 1, pick µ = µ1 µ2. It is easy to see that µ1 (µ2) is the limit distribution of source 1 (respectively, source 2). Let T = 2 u + b as the division with remainder, where u is integer and 0 b < 2. Moreover, assume b 1 (similar analysis will go through for the case of b > 1). The induced distribution νs1 T of source 1 is νs1 T ({0.5}) = u+b and νs1 T ({ 0.5}) = u T . Thus the total variation distance of µ1 and νs1 T is at most 1/(2T). Similarly, it can be verified that the total variation distance of νT and µ also decays as 1/T. Thus, D4 is O(1/T), since the monomials f(s) in the definition of D4 are bounded from above by 1. Lastly, note that D2 is upper bounded by D4 by definition, so ξ decays at a 1/T rate. Now we are ready to state our main result, which shows the existence of polynomial-time algorithms for ICA that reconstructs the mixing matrix A with error that vanishes at an O(1/ T) rate for T samples and is also polynomial in the natural parameters of the problem: Theorem 3.2. Consider the ICA problem (2). There exists an algorithm that estimates the mixing matrix A from T samples of x such that (i) the computational complexity of the algorithm is O(d3T); and (ii) if Assumptions 2.1 and 2.3 are satisfied, T Poly d, 1 κmin , 1 δ , L, C, σmax, 1 σmin and there exists a product distribution µ such that D4(µ, νT ) Poly 1 C , σmin, 1 σmax , 1 then, with probability at least 1 δ, there exists a permutation π and constants {c1, . . . , cd}, such that for all 1 k d, ck ˆAπ(k) Ak 2 C D4(µ, νT ) + g2(T) + g(T) , where C = Poly (σmax, 1/σmin, 1/κmin, 1/δ, d, C, L), and ˆA is the output of the algorithm. In particular, in the traditional stochastic setting, if S has distribution µ and T Poly d, 1 κmin , 1 δ , C, σmax, 1 σmin , Σ 2 then, with probability at least 1 δ, there exists a permutation π and constants {c1, . . . , cd}, such that for all 1 k d, ck ˆAπ(k) Ak 2 Poly C, σmax, 1 σmin , 1 κmin , 1 Remark 3.3. Note that the result is polynomial in 1/δ which is weaker than being polynomial in log(1/δ). In the next sections, we will present two algorithms, DICA (Algorithm 2 and HKICA.R (Algorithm 3) in Section 5 that satisfy the theorem. 4. Estimating Moments: the HKICA Algorithm In this section we introduce the ICA method of Hsu and Kakade (2013) which is based on the well-known excesskurtosis-like quantity defined as follows: Deterministic Independent Component Analysis For any p 1, η Rd, and distribution ν over Rd, let m(ν) p (η) = EX ν[(η X)p], (4) fν(η) = 1 12 m(ν) 4 (η) 3m(ν) 2 (η)2 . (5) Hsu and Kakade (2013) showed that 2fν(x) T (η), the second derivative of the function fν(x) T , is extremely useful for the ICA problem: They showed that if µ(X) is the distribution of the observations X in the stochastic setting where S comes from the product distribution µ, then fµ(X)(η) = f Aµ(η) for all η (where Aµ denotes the distribution of AS) and, consequently, the eigenvectors1 of the matrix M = 2fµ(X)(φ)( 2fµ(X)(ψ)) 1 are the rescaled columns of A if φ Ai ψ Ai are distinct for all i. Thus, to obtain an algorithm, one needs to estimate 2fµ(X) in such a way that the noise ϵ could still be neglected. An estimate 2 ˆf of 2fµ(X) is not hard, since for any ν, 2fν(η) can be computed as 2fν(η) = Gν(η) := G(ν) 1 (η) G(ν) 2 (η) 2G(ν) 3 (η), (6) G(ν) 1 (η) = EX ν[ η X 2XX ]; G(ν) 2 (η) = EX ν[ η X 2]EX ν[XX ]; G(ν) 3 (η) = EX ν[ η X X]EX ν[ η X X ], and these quantities can be estimated using the observed samples. In what follows, we will use the estimate 2 ˆf := 2fν(x) T and, in general, we will add a hat to quantities which are derived from the empirical distribution ν(x) T . It is important to note that, under our assumptions, the noise ϵ has limited effect in the estimation procedure, as shown in the full version of the paper (Huang et al., 2015). In particular, the difference in the estimation of the Hessian matrix caused by the noise is Poly (Lη, L, d, σmax, C) (g(T) + 1) g(T). Denote this quantity by P(Lη). Note that this error caused by the noise decays at a rate of T. Putting everything together, we obtain the algorithm HKICA, named after Hsu and Kakade (2013), which is shown in Algorithm 1, Algorithm 1 The HKICA algorithm. input x(t) for 1 t T. output An estimation of the mixing matrix A. 1: Sample φ and ψ independently from a standard Gaussian distribution of dimension d; 2: Evaluate 2 ˆf(φ) and 2 ˆf(ψ), 3: Compute ˆ M = ( 2 ˆf(φ))( 2 ˆf(ψ)) 1; 4: Compute all the eigenvectors of ˆ M, {µ1, . . . , µd}; 5: Return ˆA = (µ1, . . . , µd). 1Throughout the paper eigenvectors always mean right eigenvectors, unless specified otherwise. 4.1. Analysis of HKICA Hsu and Kakade (2013) claimed that HKICA is easy to analyze using matrix perturbation techniques. In this section we provide a rigorous analysis of the algorithm, which reveals some unexpected complications. Definition 4.1. Let Eψ denote the following event: For some fixed C1 = 2d ℓfor 0 ℓ 1, and 2d, mini |ψ Ai| C1 and ψ 2 Lu hold simultaneously. The performance of the HKICA algorithm will essentially depend on the parameter , as shown in the following theorem, where γA = min i,j:i =j Theorem 4.2. Suppose Assumptions 2.1 and 2.3 hold. Furthermore, assume that T Poly d, Lu, C, σmax, κmax, L, 1 ℓ, 1 κmin , 1 σmin , 1 and that there exist a product measure µ such that ξ Poly γA, 1 Lu , 1 σmax , 1 κmax , κmin, σmin, ℓ . Then, on the event Eψ, there exists a permutation π and constants {c1, . . . , cd}, such that for any k, max 1 k d c1 ˆAπ(k) Ak 2 1 γA (ξ + P(Lu))Q (8) where ˆA is the output of the HKICA algorithm, and Q = Poly d, Lu, σmax, κmax, 1 κmin , 1 σmin , 1 Remark 4.3. (i) Note that the bound in (8) goes to zero at an O(1/ T) rate whenever D4(µ, ν(s) T ) = O(1/ T) and g(T) = O(1/ T), as, e.g., in the stochastic setting. (ii) The parameter 1/γA is essential in the above theorem, in the sense that not only the reconstruction error bound is linear in 1/γA, but the condition also requires a small 1/γA so that the above error bound is valid. Also, since γA is the minimum spacing of the eigenvalues of M = 2f Aµ(φ)( 2f Aµ(ψ)) 1, the eigenvalue perturbations imposed by the noise cannot be too large compared to γA without potentially ruining the eigenvectors of M; thus, the dependence on γA seems to necessary. Despite the important role that γA plays in the efficiency of the HKICA algorithm, it is not clear how it depends on different properties of A. To the best of our knowledge, even a polynomial (in the dimension d) lower bound of γA is not yet available in the literature. Similar problems have been discussed by H usler (1987) and Goyal et al. (2014), but there solutions are not applicable to our case. Deterministic Independent Component Analysis 5. A Refined HKICA Algorithm The problems with γA motivate us to refine the HKICA algorithm. The idea is inspired by Arora et al. (2012) and Frieze et al. (1996) using a quasi-whitening procedure: One can show that 2fµ(ψ) = AKDψA where Dψ = diag (ψ A1)2, , (ψ Ad)2 , and so B = AK1/2D1/2 ψ R for some orthonormal matrix R. Defining Ti = 2fµ(B φi), one can calculate that Ti = AK1/2D 1/2 ψ Λi A where Λi = diag (φ i R1)2, . . . , (φ i Rd)2 and Ri denote the ith column of R. Then M = T1T 1 2 = AΛA 1 with Λ = Λ1Λ 1 2 = diag φ 1 R1 φ 2 R1 2 , . . . , φ 1 Rd φ 2 Rd Thus, Ai are again the eigenvectors of M, but now the eigenvalues of M are defined in terms of the orthogonal matrix R instead of A, and so the resulting minimum spacing γR = min i,j:i =j φ 1 Ri φ 2 Ri 2 φ 1 Rj φ 2 Rj is much easier to handle. The resulting algorithm, called Deterministic ICA (DICA), is shown in Algorithm 2. Note that on the event Eφ, Algorithm 2 Deterministic ICA (DICA) input x(t) for 1 t T. output An estimation of the mixing matrix A. 1: Sample ψ from a d-dimensional standard Gaussian distribution; 2: Evaluate 2 ˆf(ψ), 3: Compute ˆB such that 2 ˆf(ψ) = ˆB ˆB ; 4: Sample φ1 and φ2 independently from the standard Gaussian distribution; 5: Compute ˆT1 = 2 ˆf( ˆB φ1) and ˆT2 = 2 ˆf( ˆB φ2); 6: Compute all the eigenvectors of ˆ M = ˆT1 ˆT2 1 , {µ1, . . . , µd}; 7: Return ˆA = {µ1, . . . , µd}. φ j R 2 Lu, j {1, 2}. We will show later that this event Eφ, as well as other events defined later, will hold simultaneously with high probability. Definition 5.1. Let Eφ denote the following event: For some fixed constant Lu 2d and ℓl such that ℓl = π 2dℓ for 0 ℓ 1, φ1 2 Lu, φ2 2 Lu, and mini{|φ 2 Ri|} ℓl hold simultaneously. Similarly to Theorem 4.2, one can show that under some technical assumptions, which hold with probability 1 if ξ, P(Lu), and P 2σminκ1/2 min C1 are small enough, on the event Eψ Eφ, there exists a permutation π and constants {c1, . . . , cd}, such that for 1 k d, ck ˆAπ(k) Ak 2 4σ2 max γRσmin Q, where ˆA is the output of the DICA algorithm and Q is polinomial in the usual problem parameters and decays roughly as (ξ + P(Lu)). Details are given in the full version of the paper (Huang et al., 2015). It is very similar to the result of Theorem 4.2, with γR in place of γA, as required. To analyze γR analytically, note that φ1 and φ2 are independently sampled from standard Gaussian distribution. Thus, {φ 1 R1, , φ 1 Rd, φ 2 R1, , φ 2 Rd} are 2d independent standard Gaussian random variables. Let Zi = φ 1 Ri φ 2 Ri . Therefore, Zi, 1 i d are d independent Cauchy(0, 1) random variables. Using this observation, we show in the full version (Huang et al., 2015) that, among others, γR δ 2d2 with probability at least 1 δ. Based on the above, one can show that Theorem 3.2 holds for DICA (Huang et al., 2015). Furthermore, a heuristic modification of DICA can also be derived that performs better in the experiments, but proving performance guarantees for that algorithm has defied our efforts so far (details are given in the full version of the paper, Huang et al. 2015). 5.1. Recursive Versions Recently, Vempala and Xiao (2014) proposed a recursion idea to improve the sample complexity of the Fourier PCA algorithm of Goyal et al. (2014). Instead of recovering all the columns of A in a single eigen-decomposition, the recursive algorithm only decomposes the whole space into two subspaces according to the maximal spacing of the eigenvalues, then recursively decomposes each subspaces until they are all 1-dimensional. The insight of this recursive procedure is the following: when the maximal spacing of the eigenvalues are much larger than the minimal one, the algorithm may win over a single decomposition even with the accumulating errors through the recursion. However, this algorithm is based on the assumption that the mixing matrix is orthonormal, so that the projection to its subspaces can always eliminate some component of the source signal. We adapt the above idea to our algorithms. Due to space limitations, we will only consider the simplest recursive algorithm, the recursive version of HKICA, as an example. To force an orthonormal mixing matrix, we will first compute the square root matrix B of 2f(ψ) = ADψKA . Thus B = AD1/2 ψ K1/2R for some orthonormal matrix R. Transforming our observations by B 1, we have the new observations y(t) = B 1x(t) + B 1ϵ(t) = RD1/2 ψ K1/2s(t) + B 1ϵ(t). Note that transformed noise Deterministic Independent Component Analysis vector B 1ϵ(t) is still Gaussian. Also, D1/2 ψ K1/2 is diago- nal, thus RD1/2 ψ K1/2s(t) is an orthonormal mixture of independent sources. We then apply the recursive algorithm to recover the mixing matrix R. Finally, BR gives an estimate of A up to scaling. To recover R using a recursive algorithm, we follow the idea of HKICA (and DICA) to compute two Hessian matrices T1 = RD 1 ψ Λ1R and T2 = RD 1 ψ Λ2R . Then, instead of computing the eigen-decomposition of T0 = T1T 1 2 (as in HKICA), we only decompose its eigenspace into two subspaces, according to the maximal spacing of the eigenvalues of T0. The Decompose helper function takes a projection matrix P of a subspace spanned by some columns of R (WLOG we assume it is the first k columns of R). Then we compute the projection of T0 as M = P T0P. Thus the eigenspace of PMP is in the span of P. Lastly, by separating the eigenvectors of M according to its eigenvalues into PP1 and PP2, the Decompose function repeatedly decomposes the subspaces into two smaller subspaces. Algorithm 3 Recursive version of HKICA (HKICA.R) input x(t) for 1 t T. output An estimation of the mixing matrix A. 1: Sample ψ from a d-dimensional standard Gaussian distribution; 2: Evaluate 2 ˆf(ψ) = ˆG(ψ); 3: Compute ˆB such that 2 ˆf(ψ) = ˆB ˆB ; 4: Compute ˆy(t) = ˆB 1x(t) for 1 t T; 5: Let P = Id; 6: Compute ˆR = Decompose(ˆy, P); 7: Return ˆB ˆR; Algorithm 4 The Decompose helper function input x(t) for 1 t T, a projection matrix P Rd k (d k). output An estimation of the mixing matrix A Rd k. 1: if k == 1, Return P; 2: Sample φ1 and φ2 independently from a standard Gaussian distribution of dimension d; 3: Evaluate 2 ˆf(φ1) and 2 ˆf(φ2), 4: Compute ˆT = ( 2 ˆf(φ1))( 2 ˆf(φ2)) 1; 5: Compute ˆ M = P ˆTP; 6: Compute all the eigen-decomposition of ˆ M, its eigenvalues{σ1, . . . , σd} where σ1 . . . σk and their corresponding eigenvectors {µ1, . . . , µk}; 7: Find the index m = arg max σm σm+1; 8: Let P1 = (µ1, . . . , µm), and P2 = (µm+1, . . . , µk); 9: Compute W1 = Decompose(x, PP1), and W2 = Decompose(x, PP2); 10: Return [W1, W2]; Remark 5.2. Other algorithms can be modified into a recursive version in a similar way. Theorem 5.3. Under the conditions of Theorem 3.2, with probability at least 1 δ, the recursive version of HKICA returns a mixing matrix ˆA with an error ˆA ADP 2 bounded by Poly d, 1 κmin , 1 σmin , 1 ℓ, Lu, L, C, σmax for some diagonal matrix D and permutation matrix P. Remark 5.4. Note that when T is large enough, the term Q2 will be dominated by ξ, which is the error carried over from quasi-whitening. The recursion idea improves the sample complexity of the eigen-decomposition (to recover the orthonormal mixing matrix R). 6. Experimental Results In this section we compare the performance of different ICA algorithms in some synthetic examples, with mixing matrices of different coherences. We test 9 algorithms: HKICA (HKICA), and its recursive version (HKICA.R); DICA (DICA), and its recursive version (DICA.R); the modified version of DICA (MDICA), and its recursive version (MDICA.R); the default Fast ICA algorithm from the ITE toolbox (Szab o et al., 2012) (FICA); the recursive Fourier PCA algorithm of Xiao (2014) (FPCA); and random guessing (Random). FPCA is modified so that it can be applied to the case of non-orthogonal mixing matrix. In the simulation, a common mixing matrix A of dimension 6 is generated in the following ways: We construct four kinds of matrices: A1 = P; A2 = vb 1 + 0.3 P; A3 = vb 1 + 0.05 P; and A4 = vb 1 + 0.005 P. Here the vector vb and the matrix P are both generated from standard normal distribution (with different dimensions). Then all the mixing matrices are rescaled to a same magnitude. We also generate an orthonormal mixing matrix R, obtained by computing the left column space of a non-singular random matrix (from standard normal distribution). Then we generate a 6-dimensional BPSK signal s as follows. Let p = ( 19). We generate a {+1, 1} valued sequence q(t) uniformly at random for 1 t T, and set si(t) = q(t)i sin(pit). Note that in order to have the components of s close to independent, we need the ratio of their frequencies are irrational. Lastly, the observed signal is generated as x = As + cϵ where ϵ is the noise generated from a d-dimensional normal distribution with randomly generated covariance. We take T = 20000 instances of the observed signal on time steps t = 1, . . . , 20000. We test the noise ratio c from 0 (noise- Deterministic Independent Component Analysis 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Noise_ratio c Reconstruction Error The Reconstruction Error of R FICA HKICA MDICA DICA FPCA HKICA.R DICA.R MDICA.R 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Noise_ratio c Reconstruction Error The Reconstruction Error of A1 FICA HKICA MDICA DICA FPCA HKICA.R DICA.R MDICA.R 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Noise_ratio c Reconstruction Error The Reconstruction Error of A2 FICA HKICA MDICA DICA FPCA HKICA.R DICA.R MDICA.R 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Noise_ratio c Reconstruction Error The Reconstruction Error of A3 FICA HKICA MDICA DICA FPCA HKICA.R DICA.R MDICA.R 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 Noise_ratio c Reconstruction Error The Reconstruction Error of A4 FICA HKICA MDICA DICA FPCA HKICA.R DICA.R MDICA.R Figure 2. Reconstruction Error free) to 1 (very noisy). All the algorithms are evaluated on a 150 repetitions. For each repetition, we try 3 times and report the best. We measure the performances of the algorithms by its actual reconstruction error. In particular, we evaluate the following quantity between the true mixing matrix A and the estimate ˆA returned by the algorithms: minΠ,S ˆAΠS A Frob, where Π is a permutation matrix, and S is a column scaling matrix (diagonal). The calculation of this measure would require a exhaust search for the optimal permutation. 6.1. Results We report the reconstruction errors for different kinds of mixing matrices and noise ratios. The experimental results suggest that moment methods are more robust to high-coherence mixing matrices and Gaussian noise than Fast ICA. Fast ICA achieves the best per- formance in case of low coherence. As the coherence of the mixing matrix A increases, its performance decreases quickly and becomes sensitive to noise. We expected that DICA will achieve smaller error for an extremely coherent A, since 1/γA will be much larger than 1/γR. However, the experimental results indicate the opposite. Note that high coherence implies small minimal singular value. In this case, the estimation error of M in DICA could be much larger than that in HKICA, because of the fourth degree of A 1. This error overwhelms the improvement brought by larger eigenvalue spacings, if the sample size is not large enough. The investigation of this phenomenon is left for future work. On the other hand, MDICA tries to achieve a small estimation error, meanwhile we expect it to keep the eigenvalue spacing large (intuitively, it is approximately the spacing of the square of d Gaussian random variables), leading to good performance. This is confirmed by the experimental results, in both the non-recursive and recursive versions. The recursive idea is not always helpful for the moment methods. For a highly coherent A, the recursive versions outperform their non-recursive counterparts. Note that in this case, A is close to singular (small minimal singular value), and thus it requires more samples. On the other hand, when A has relatively low coherence, the estimation error of the fourth moments contributes more to the reconstruction error. Recursive algorithms suffers from making several such estimations. In summary, the results suggest that these moment methods are comparable to each other in practice, while Fast ICA is better for mixing matrices with low coherence or mild coherence with low noise. If the mixing matrix is orthonormal, then FPCA performs better than the other algorithms. If the observations have large noise and the mixing matrix is not extremely coherent, then HKICA may be the best choice. In the case of an extremely coherent mixing matrix, MDICA performs the best. Also, the recursive idea is very helpful for small sample sizes. 7. Conclusions We considered the problem of independent component analysis with noisy observation. For the first time in the literature, we presented ICA algorithms that can recover non Gaussian source signals with polynomial computational complexity and provable performance guarantees on the reconstruction error that guarantee that for T samples the reconstruction error vanishes at a 1/ T rate and depends only polynomially on the natural parameters of the problem. The algorithms do not depend on unknown problem parameters, and also extend to deterministic sources with approximately independent empirical distributions. Deterministic Independent Component Analysis Acknowledgements This work was supported by the Alberta Innovates Technology Futures and NSERC. A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. Co RR, abs/1210.7559, 2012a. A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden markov models. ar Xiv preprint ar Xiv:1203.0683, 2012b. A. Anandkumar, R. Ge, and M. Janzamin. Guaranteed nonorthogonal tensor decomposition via alternating rank-1 updates. ar Xiv preprint ar Xiv:1402.5180, 2014. S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ica with unknown gaussian noise, with implications for gaussian mixtures and autoencoders. In Advances in Neural Information Processing Systems, pages 2375 2383, 2012. J. Cardoso. High-order contrasts for independent component analysis. Neural computation, 11(1):157 192, 1999. A. Chen and P. J Bickel. Consistent independent component analysis and prewhitening. Signal Processing, IEEE Transactions on, 53(10):3625 3632, 2005. A. Chen, P. J Bickel, et al. Efficient independent component analysis. The Annals of Statistics, 34(6):2825 2855, 2006. P. Comon and C. Jutten. Handbook of Blind Source Separation: Independent component analysis and applications. Academic press, 2010. A. Das Gupta. Finite sample theory of order statistics and extremes. In Probability for Statistics and Machine Learning, pages 221 248. Springer, 2011. A. Dermoune and T. Wei. Fast ICA algorithm: Five criteria for the optimal choice of the nonlinearity function. IEEE transactions on signal processing, 61(5-8):2078 2087, 2013. J. Eriksson and V. Koivunen. Characteristic-function-based independent component analysis. Signal Processing, 83 (10):2195 2208, 2003. A. Frieze, M. Jerrum, and R. Kannan. Learning linear transformations. In 37th IEEE Annual Symposium on Foundations of Computer Science, pages 359 359. IEEE Computer Society, 1996. N. Goyal, S. Vempala, and Y. Xiao. Fourier PCA and robust tensor decomposition. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 584 593. ACM, 2014. D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11 20. ACM, 2013. R. Huang, A. Gy orgy, and Cs. Szepesv ari. Deterministic independent component analysis. in preparation, 2015. J. H usler. Minimal spacings of non-uniform densities. Stochastic processes and their applications, 25:73 81, 1987. A. Hyvarinen. Fast and robust fixed-point algorithms for independent component analysis. Neural Networks, IEEE Transactions on, 10(3):626 634, 1999. A. Hyv arinen and E. Oja. Independent component analysis: algorithms and applications. Neural Networks, 13(4 5): 411 430, 2000. B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302 1338, 2000. S. Miettinen, J.and Taskinen, K. Nordhausen, and H. Oja. Fourth moments and independent component analysis. ar Xiv preprint ar Xiv:1406.4765, 2014. E. Oja and Z. Yuan. The Fast ICA algorithm revisited: Convergence analysis. Neural Networks, IEEE Transactions on, 17(6):1370 1381, 2006. E. Ollila. The deflation-based Fast ICA estimator: statistical analysis revisited. Signal Processing, IEEE Transactions on, 58(3):1527 1541, 2010. A. Samarov, A. Tsybakov, et al. Nonparametric independent component analysis. Bernoulli, 10(4):565 582, 2004. G.W. Stewart and J.-g. Sun. Matrix perturbation theory. Computer science and scientific computing. Academic Press, 1990. ISBN 9780126702309. Z. Szab o, B. P oczos, and A. L orincz. Separation theorem for independent subspace analysis and its consequences. Pattern Recognition, 45:1782 1791, 2012. P. Tichavsky, Z. Koldovsky, and E. Oja. Performance analysis of the Fast ICA algorithm and Cram er-Rao bounds for linear independent component analysis. Signal Processing, IEEE Transactions on, 54(4):1189 1203, 2006. S. Vempala and Y. Xiao. Max vs min: Independent component analysis with nearly linear sample complexity. Co RR, abs/1412.2954, 2014. URL http://arxiv. Deterministic Independent Component Analysis org/abs/1412.2954. T. Wei. The convergence and asymptotic analysis of the generalized symmetric Fast ICA algorithm. ar Xiv preprint ar Xiv:1408.0145, 2014. Y. Xiao. Fourier pca package. Git Hub, 2014. URL https://github.com/yingusxiaous/ lib FPCA.