# robustness_of_nonlinear_representation_learning__40cbdbf8.pdf

Robustness of Nonlinear Representation Learning

Simon Buchholz 1 2 Bernhard Sch olkopf 1 2 3

We study the problem of unsupervised representation learning in slightly misspecified settings, and thus formalize the study of robustness of nonlinear representation learning. We focus on the case where the mixing is close to a local isometry in a suitable distance and show based on existing rigidity results that the mixing can be identified up to linear transformations and small errors. In a second step, we investigate Independent Component Analysis (ICA) with observations generated according to x = f(s) = As + h(s) where A is an invertible mixing matrix and h a small perturbation. We show that we can approximately recover the matrix A and the independent components. Together, these two results show approximate identifiability of nonlinear ICA with almost isometric mixing functions. Those results are a step towards identifiability results for unsupervised representation learning for real-world data that do not follow restrictive model classes.

1. Introduction

One of the fundamental problems of data analysis is the unsupervised learning of representations. Modern machine learning algorithms excel at learning accurate representations of very complex data distributions. However, those representations are, in general, not related to the true underlying latent factors of variations that generated the data. Nevertheless, it is desirable to not only match the training distribution, but also to identify the underlying causal structure because such representations are expected to improve the downstream performance and improve explainability and robustness (Sch olkopf et al., 2021).

Generally, we can only hope to learn the ground truth model

1Max Planck Institute for Intelligent Systems, T ubingen, Germany 2T ubingen AI Center, T ubingen, Germany 3ELLIS Institute, T ubingen, Germany. Correspondence to: Simon Buchholz <sbuchholz@tue.mpg.de>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

in identifiable settings, i.e., when, up to certain symmetries, there is a unique model in the considered class generating the observations. The simplest example of such a setting is linear ICA where we observe linear mixtures of independent variables. Identifiability of linear ICA was shown under mild assumptions in Comon (1994). On the other hand, it is well known that for general nonlinear functions ICA is not identifiable (Hyv arinen & Pajunen, 1999). However, recently, a flurry of results was proved that showed various identifiability results under additional assumptions. Those results rely among others on restrictions of the function class (Taleb & Jutten, 1999; Horan et al., 2021; Gresele et al., 2021; Buchholz et al., 2022; Kivva et al., 2022), multienvironment data (Khemakhem et al., 2020; Hyvarinen et al., 2019), or interventional data (Ahuja et al., 2022; Squires et al., 2023; von K ugelgen et al., 2023; Buchholz et al., 2023), for a broader overview we refer to the recent survey Hyv arinen et al. (2023).

While those results brought significant progress to the field of Causal Representation Learning (CRL) they also generally require strong assumptions which will typically only hold approximately for real data. Thus, it is of great importance to understand the various robustness properties of such identifiability results. A first step in this direction is the investigation of whether the latent variables can still be approximately identified in settings where a small amount of misspecification is allowed, i.e., when the model assumptions are mildly violated. Our analysis here focuses on theoretical questions regarding identifiability in misspecified settings, but this has nevertheless profound implications for the empirical side because it clarifies what assumptions generate or do not generate useful learning signals that can be exploited by suitable algorithms. This is particularly important since there is still a lack of algorithms that uncover the true latent structure for complex data beyond toy settings.

In this work, we study the problem of approximate identifiability in a setting where the mixing function is close to a local isometry and the latent variables are independent (ICA). In particular, our results can be seen as a generalization of the identifiability result of Horan et al. (2021). The main contributions of this work can be summarized as follows:

Robustness of Nonlinear Representation Learning

1. We initiate the study of robustness for representation learning and clarify that some existing results are not robust to arbitrarily small amount of misspecification.

2. We show based on existing rigidity results from material science that we can identify the latent variables up to a linear transformation and a small perturbation when the mixing function f is close to a local isometry.

3. Then we carefully investigate slightly non-linear ICA, a setting of independent interest. We show that then the linear part of the mixing can be identified up to a small error, which depends on the strength of the perturbation.

4. Finally, we combine the previous two results to show robustness for representation learning problems with independent latent variables and a mixing function close to a local isometry.

The rest of this paper is structured as follows. In Section 2 we introduce the general setting of representation learning, ICA, and local isometries, and we then consider the robustness of identifiability in representation learning in Section 3. Our results on approximate linear identifiability for approximate local isometries can be found in Section 4 followed by our analysis of perturbed linear ICA and a combination of the two results in Section 5. We conclude in Section 6. In Appendix A we collect some notation. In this work, C denotes a generic constant that is allowed to change from line to line. We denote by D= equality of the distributions of two random variables.

In this section, we introduce and motivate our main setting. We assume that we have d-dimensional latent variables S such that their distribution satisfies P P for some class of probability distributions P. For our analysis of local isometries we make the following mild assumption.

Assumption 2.1. We assume that the class of probability distributions P has bounded support Ω Rd with density lower and upper bounded.

In the second part of this work we focus on independent component analysis, i.e., we make the following assumption.

Assumption 2.2. All P P have independent components, i.e., P = Nd i=1 Pi for some measures Pi on R and we furthermore assume that the Pi are non-Gaussian and have connected support.

The latent variables S are hidden, and we assume that we observe X = f(S) for some mixing function f F(Ω).

Note that the distribution of X is then given by the pushforward f P. Here F is some function class consisting of functions f : Ω Rd RD and we will always assume without further notice that f is a diffeomorphism on its image (i.e., injective, differentiable and with differentiable inverse on its domain which is a submanifold of RD). Our main interest concerns the function class of local isometries, i.e.,

Fiso = {f : Ω Rd RD : Df (s)Df(s) = Id

for all s Ω} (1)

for some connected domain Ω Rd. The class of local isometries has attracted substantial attention in representation learning because it locally preserves the structure of the data, which is a desirable feature in many settings (Tenenbaum et al., 2000; Donoho & Grimes, 2003; Belkin & Niyogi, 2003). Closely connected notions like the restricted isometry property play a crucial role in signal processing (Candes & Tao, 2005), and several works show that making (parts) of neural networks isometric improves performance (Qi et al., 2020; Liu et al., 2021; Miyato et al., 2018). In Gresele et al. (2021) it is argued based on the independent mechanism principle that the closely related function class of orthogonal coordinate transformations is a natural function class for representation learning. Essentially, the argument relies on the well-known fact that two isotropic random vectors are very close to orthogonal up to an exponentially small probability. In Appendix J we show a construction of a family of random functions that becomes increasingly isometric as D . Let us finally remark that the investigation of the inductive bias of VAEs in Rol ınek et al. (2019); Zietlow et al. (2021); Reizinger et al. (2022) revealed that their loss function and architecture promote that the encoder implements an orthogonal coordinate transformation. The empirical success of VAEs on disentanglement tasks is therefore a further motivation to study theoretical properties of orthogonal coordinate transformations and local isometries.

For ICA with locally isometric mixing function the following identifiability result was shown.

Theorem 2.3 (Theorem 1, Horan et al. (2021)). We assume that P satisfies Assumption 2.2. Suppose X = f(S) where S P P and f Fiso. If X D= f( S) for some f Fiso and S P P, then f = f P for some linear map P which is a product of a permutation and reflections.

Note that Horan et al. (2021) only claimed disentanglement, but the stronger version stated presently follows as sketched below and is also a special case of Theorem 2 of Buchholz et al. (2022). The proof of this result proceeds in two steps. First, we can identify f up to an orthogonal linear transformation A by general results. Indeed, f 1 f : Ω Rd is a local isometry and local isometries from Ω Rd Rd are affine. This is a result first shown by Liouville in the 19th

Robustness of Nonlinear Representation Learning

century, a simple proof can be found in the recent review of Hyv arinen et al. (2023). A slightly different viewpoint showing the same result was given by Tenenbaum et al. (2000). After identifying S up to linear transformations, we can as a second step apply the standard identifiability result for linear ICA to identify A up to permutations and reflection (there is no scale ambiguity here because we consider local isometries).

We emphasize that this separation into two steps, where first identifiability up to linear transformations and then full identifiability are shown, is very common, e.g., a similar proof strategy has been applied for polynomial mixing functions (Ahuja et al., 2022), piecewise linear functions (Kivva et al., 2022), or for general mixing functions and interventional data (Buchholz et al., 2023).

3. Robustness and approximate Identifiabilility

Theorem 2.3 proves identifiability, i.e., all data representations agree up to permutation and reflections. However, this uses crucially that the assumptions are exactly satisfied, i.e., f is a true local isometry, the coordinates of S are perfectly independent, and we know the distribution of X exactly.

In real-world settings, typically none of these assumptions will hold exactly but at best approximately ( all models are wrong (Box, 1976)). Thus, identifiability results can only be relevant for applications when they are robust, i.e., they continue to hold in some approximate sense when the assumptions are mildly violated. For, e.g., supervised learning or distribution learning, there is a large body of research (essentially the field of learning theory) addressing in particular the dependence on the sample size but also misspecification attracted considerable attention, e.g., White (1980; 1982). For representation learning there are much fewer results. The sample complexity of linear ICA algorithms has been studied, e.g., by Yau & Bresler (1992); Tichavsky et al. (2006); Wei (2015). Moreover, Zhang & Chan (2008) investigated ICA for a perturbed linear model, although mostly empirically and without stringent quantitative results. To the best of our knowledge, no robustness results are known for non-linear mixing functions f. The main focus of this work is to show a first robustness result for a misspecification of the function class, while we do not consider misspecification of independence of the ground truth samples and finite sample effects here.

Let us emphasize that it is an important question because not all identifiability results have a meaningful extension to slight misspecification. As a simple motivating example, consider analytic functions (i.e., functions that can be expressed as a power series) and smooth functions (i.e., functions that have arbitrary many derivatives). While those function classes are superficially similar, they exhibit pro-

found differences: Analytic functions f : R R have the property that there is a unique continuation of f|U from any arbitrarily small open set U to a maximal domain of definition (this can be seen as arbitrary far o.o.d. generalization) while for smooth functions virtually no information about f|U can be deduced from f|U. This indicates that minor misspecification (analytic vs. smooth) can render identifiability results void, while the representation capacity remains almost the same and the difference can hardly be detected from data. To connect this more closely to identifiability in representation learning, consider, e.g., an identifiability result such that the following properties hold (a similar discussion can be found in the Appendix of Buchholz et al. (2023))

When assuming that the mixing f is in some function class F then f is identifiable (up to certain symmetries, e.g., permutations or linear maps)

When the mixing is just assumed to be an injective and continuous function f is not identifiable, i.e., for f F there is f giving rise to the same observational distribution (for suitable latent distributions) but f and f are entirely different data representations that are not related by the allowed symmetries.

The function space F is dense in the space of injective continuous functions, i.e., every injective and continuous function f can be approximated to arbitrary precision by f F.

Note that the last property implies that for f / F no meaningful distance to F can be defined, as it can be approximated arbitrarily well by functions in F. This also makes the identifiability result brittle: While f F can be identified, there is a sequence of functions F fi f converging to the spurious solution f . Thus, it can be arbitrarily difficult to decide in practice whether f generated the data or an approximation fi F of the spurious solution f . Concrete examples of this setting include polynomial mixing functions and piecewise linear mixing functions which can be used to identify, e.g., Gaussian or rectangular base distributions (Ahuja et al., 2022; Kivva et al., 2022). We emphasize that these results nevertheless are important theoretical contributions, however, these identifiability results might be difficult to exploit in practice. Our observations here are in spirit similar to a no free lunch theorem (Wolpert & Macready, 1997) because it clarifies the fundamental tradeoff between representation capacity of the function class and identifiability. Note that the lack of identifiability can also harm downstream performance, as shown by Saengkyongam et al. (2023).

Let us try to make the point above concrete in a toy setting. We consider two-dimensional Gaussian latent variables Z.

Robustness of Nonlinear Representation Learning

Figure 1. (Left) Color map of Gaussian latent variable Z, (Center) Color map of the transformed data X = f(Z) where f is a piecewise

linear approximation of a radius dependent rotation (i.e., f(Z) D Z), (Right) Representation X = f (Z) learned by a VAE with Re LU activation functions initialized with f and f 1 for decoder and encoder and small variance.

Let f be a piecewise linear mixing function f that approximates a radius dependent rotation m (Hyv arinen & Pajunen, 1999) which is a map such that m(Z) D= Z. Consider obser-

vations X = f(Z) D Z. Note that f and the identity map generate similar observations so it is hard to distinguish between the mixing function f and the identity id even though f is identifiable for Gaussian latent variables up to linear transformations (see Kivva et al. (2022)). However, if we train a VAE with Re LU activations that is initialized with f and f 1 as decoder and encoder, i.e., we initialize it with the true mixing, it nevertheless fails to preserve this representation. Instead, the VAE learns a simple orthogonal transformation of the data, i.e., even though the map f is in theory identifiable the learning signal is not sufficient to overcome the inductive bias of the learning method (here the VAE) towards simple functions that, e.g., preserve simple geometric properties. An illustration of the learned mixing can be found in Figure 1. Additional details for the setting can be found in Appendix H.

The main purpose of this work is to show that the function class Fiso behaves differently, i.e., robust representation learning results can be proved. To formalize this, we first need to introduce a way to measure the distance Θ(f) of a function f to the space of local isometries Fiso. This quantity should vanish (Θ(f) = 0) when f Fiso and our goal is to show that when Θ(f) is small we can still approximately identify f. For functions f : Ω Rd Rd

we use the following quantity

Θp p(f, Ω) = Z

Ω dist(Df(s), SO(d))p

+ dist (Df) 1(s), SO(d) p ds. (2)

Here dist2(A, SO(d)) = min Q SO(d) |Q A|2 refers to

the euclidean distance to the space SO(d) = {A Rd d : AA = Id, det A = 1} of all rotations. We discuss the definition and properties of the distance in more detail in Appendix C. The distance Θ (which is not a distance in the mathematical sense) measures how close a function f is to a local isometry pointwise and integrates this. If f Fiso(Ω) is a local isometry and if det Df > 0 everywhere (which is no loss of generality by reflecting one coordinate), then indeed Θp(f, Ω) = 0 because Df SO(d) pointwise. For our results we need the second term in (2), i.e., we measure the local deviation from being an isometry of the map and its inverse. One could also define the metric by integrating with respect to the latent distribution P instead of the Lebesgue measure, but we think that the latter is slightly more natural because a common assumption based on the independent mechanism assumption (Janzing & Sch olkopf, 2010; Sch olkopf et al., 2012; Janzing et al., 2016) is that P and f are sampled independently. This also simplifies our analysis slightly.

Our main interest concerns high dimensional embeddings, i.e., mixing function f : Rd M RD for d D where M is an embedded submanifold. In this case, the definition (2) needs to be slightly adapted, by essentially restricting the target of Df to the tangent space of M. For D d we define

Θp p(f, Ω) = Z

Ω distp(Df(z), SO(d, Tf(z)M))

+ distp (Df) 1(z), SO(Tf(z)M, d) dz (3)

where Tf(z)M denotes the tangent space of M at f(z) and SO(d, Tf(z)M) denotes the set of orthogonal matrices Q RD d (i.e., Q Q = Idd) with range in Tf(z)M that are orientation preserving (here we fix any orientation

Robustness of Nonlinear Representation Learning

on M). To interpret the second summand, we need to associate certain matrix representations to the maps (Df) 1 and SO(Tf(z)M, d) and we refer to Appendix D for a careful definition.

Let us now discuss and motivate the choice of distance in a bit more detail. First, we emphasize that (2) does not (directly) quantify how well f is globally approximated by a local isometry, i.e., whether there exists a g Fiso such that, e.g., f g 1 is small. We note that from a modelling standpoint, it seems more reasonable to assume that (2) is small for the ground truth mixing function than assuming that f can be globally well approximated by a local isometry because the former corresponds to the common assumption that f preserves the structure of the data locally (in particular distances) while it is not clear why f should be globally constrained. In Appendix J we show that for d D and certain classes of random functions Θ will be indeed small. This quantifies the observation that those random functions become isometric as D (similar results can be found in Reizinger et al. (2023)).

One difficulty in the setting is that as soon as we allow misspecification of f the mixing function f will typically be no longer identifiable (otherwise the existing identifiability result could be strengthened to a larger function class). We focus mostly on the setting of nonlinear ICA and in this case, it is well known that f is not identifiable. Thus, we can only hope to prove a version of approximate identifiability that gracefully recovers the classical identifiability result as the misspecification disappears. We here define approximate identifiability of the latent variables based on the mean correlation coefficient (MCC) which has been used before to evaluate the empirical performance of representation learning algorithms. For a pair of d dimensional random variables (S, S) the MCC is defined by

MCC(S, S) = max π Sd d 1 d X

i=1 |ρ(Si, Sπ(i))| (4)

where ρ(X, Y ) = Cov(X, Y )/(Var(X) Var(Y ))1/2 denotes the correlation coefficient. Note that MCC(S, S) = 1 implies that Si = λi Sπ(i), for a vector λ Rd, i.e., we recover Z up to permutation and scale. For MCC values close to 1 this is approximately true, in particular S = PS + ε holds for a scaled permutation P and some error ε that vanishes as the MCC goes to 1. More generally, one could also allow coordinatewise reparametrizations of S when those are not identifiable (Gresele et al., 2021), but this is not necessary when considering (perturbed) local isometries.

Let us elaborate on the difference between proper identifiability (Xi & Bloem-Reddy, 2023) and approximate identifiability. Recall that a proper identifiability results is of the form f P = f P for f, f F and P, P P implies that f 1 f is a tolerable ambiguity, e.g., a com-

bination of permutation and reflection in Theorem 2.3. In this case, an equivalent statement is that f P = f P implies MCC( f 1 f(S), S) = 1 for S P. In contrast, for approximate identifiability results we cannot simply take any f such that f P = f P because there will typically exist arbitrarily chaotic spurious solutions with no guarantee on MCC( f 1 f(S), S) for S P. Instead, we need to make a specific choice for the function f such that f P = f P which ensures that MCC( f 1 f(S), S) 1 for a suitable class of f. In our case those will be functions such that Θp(f, Ω) is small. Of course, our choice of f is only allowed to depend on the observations X = f(S) but not on f itself. Here the basic idea is to choose f as close as possible to being a local isometry, i.e., we roughly chose f such that Θ( f, Ω) is minimal under the constraint f P = f P. Let us now provide a very informal version of our final approximate identifiability result.

Theorem 3.1 (Informal sketch). Suppose S P where P has independent components, and we observe X = f(S) for some mixing function f. Then we can find f such that ˆS = f 1(X) satisfies for some C > 0, p > 1

MCC( ˆS, S) 1 CΘ2 p(f) (5)

where C depends on everything except f.

Note that if f is far away from a local isometry, i.e., Θp(f) is large then the right-hand side is negative, and the obtained recovery guarantee is void which is not surprising as nonlinear ICA is not identifiable. The actual statement can be found in Theorem 5.11. We emphasize again that robustness is an aspect of representation learning that has previously attracted little attention, and we study one specific example, namely robustness of learning approximate local isometries. However, there are many alternative robustness questions and possible misspecifications, e.g, we could consider latent sources that are only approximately independent or settings where MCC might not be the right measure of identifiability.

4. Approximate linear Identifiability for approximate local Isometries

We now extend linear identifiability for locally isometric embeddings to approximate isometric embeddings. Restricting the ambiguity to a linear transformation is already an important first step which can be in principle combined with any result on causal representation learning for linear mixing functions, examples of recent results for linear mixing functions include Ahuja et al. (2023); Squires et al. (2023); Varici et al. (2023). The key ingredient for our results is the rigidity statement Theorem E.1 in Appendix E which played an important role in the mathematical analysis of elastic materials (see Ciarlet (1997) for an overview).

The main consequence of this theorem that is relevant for

Robustness of Nonlinear Representation Learning

our work here is the bound

min L u L Lq(Ω)

Ω dist(Du(s), SO(d))p ds 1

for all u and q = pd/(d p) where the minimum is over functions of the form L(s) = As+b with A SO(d). This result shows that when a map u has gradient pointwise close to any, potentially varying, rotation, it is globally close to an affine map. We emphasize that this is highly non-trivial, naively one would expect that the rotation could change with s. Note that when the right hand-side is zero, i.e., f is a local isometry then we conclude that u is affine recovering the fact that local isometries from Rd to Rd are already affine which is the key observation underlying Theorem 2.3. This rigidity property of almost isometries renders them almost identifiable. This then generalizes Theorem 2.3.

For simplicity, we mostly present our results for the case where D = d in the main paper. The additional technicalities for d < D (which is the main interest of our results) will be discussed in Appendix D. For a measure P P with support Ω Rd we introduce the set of models

M(f P) = {(g, Q, Ω ) : g F(Ω ), where

g Q = f P, supp(Q) = Ω } (7)

that generate the observed distribution of X = f(S). We now consider for a fixed p the triple

argmin ( g, Q, Ω) M(f P)

Ω dist((D g) 1(g(s)), SO(d))p Q(ds),

i.e., we pick (g, Q, Ω ) such that X g Q so that it generates the observational distribution and in addition make its inverse as isometric as possible. For D > d we replace the distance to SO(d) by the distance to SO(Tg(s)M, d) (see (73) in Appendix E for details). We emphasize that M(f P) and therefore g only depends on the observational distribution f P but not on f directly. Note that we could also minimize a variant of Θ(g, Ω ) where we integrate with respect to Q. We remark in passing that we do not prove the existence of a minimizer of (8) (although this should be possible using the direct method of the calculus of variations). Since the functional is lower bounded, a minimizing sequence gn exists and we can instead approximate the infimum up to arbitrarily small ε adding an arbitrarily small additional error term to Theorem 4.1 below (see Appendix E for details).

Theorem 4.1. Suppose we have a latent distribution P P satisfying Assumption 2.1 with support Ω Rd where Ωis

a bounded connected Lipschitz domain. The observational distribution is given by X = f(S) M RD where S P and f F(Ω). Fix a 1 < p < . Define (g, Q, Ω ) as in (8) (as in (73) in Appendix D in the undercomplete case). Then there is A SO(d) such that g 1 f(s) = As + h(s) + b and h satisfies for p < d and q = pd/(d p) the bound

h P,q C1Θp(f, Ω). (9)

Here C1 is a constant depending on d, p, Ω, and the lower and upper bound on the density of P.

The proof of this result, including the extension to the undercomplete case can be found in Appendix E. Let us continue our discussion on the meaning of approximate identifiability in the context of this theorem. First, we note that if Θp(f, Ω) is not small we obtain no useful statement, except that our transformed data is some function of the original data. This is not surprising: We cannot hope to recover any mixing function because the problem of learning f from f P is not identifiable (even when assuming P known). Moreover, our statement only applies to one specific unmixing g which is again unavoidable for the same reason. What we show is that if f and g are both close to being locally isometric, then the concatenation g 1f will be close to a linear (even orthogonal) map. Note that g does not appear on the righthand side of (9) because we choose g to be the maximally isometric representation of our observations, but we know that this representation is more isometric than any alternative representation, in particular more isometric than f. This allows us to bound the non-linearity h in terms of f only. While this result does not have the simplicity of a standard identifiability result it provides a more general viewpoint. Indeed, if Θp(f, Ω) = 0, i.e., when f is a local isometry we have h = 0 and we recover f up to a linear transformation which is the standard identifiability result for local isometries (see Theorem 2 of Horan et al. (2021)). Our result extends this gracefully to functions that are approximate local isometries in the sense that Θp(f, Ω) is small. Let us add some remarks about this result. Remark 4.2. The optimization problem for g in (8) is non-convex, and difficult to optimize in practice.

For D = d the introduction of g is not necessary, instead we can directly apply Theorem E.1 to f and just work with the original data X and directly apply Theorem E.1 to X = f(S).

The assumptions that Ωis connected and that the density of P is lower bounded are necessary. In particular the result does not apply to distributions P with disconnected support.

There are alternative assumptions that allow us to remove (or bound) the second term in (2), e.g., assuming

Robustness of Nonlinear Representation Learning

Df Df > c1 > 0, i.e., the smallest singular value of Df is bounded below is sufficient (see Lemma C.2).

5. Perturbed linear ICA

In this section, we consider the problem of independent component analysis where the mixing is a slight perturbation of a linear function. This is a problem of general interest beyond the main setting considered in this paper because in typical real-world applications the mixing will only be approximately linear, so understanding the effect of the nonlinear part is important. Concretely, we assume that data is generated by a perturbed linear model

x = f(s) = As + ηh(s) (10)

where h : Rd Rd is a non-linear function and η R is a small constant. We can assume that h is centered, i.e., E(h(S)) = 0 and that A is the regression matrix when regressing X on S. This is equivalent to the relation ES(Sh(S) ) = 0d d. Remark 5.1. We decided to express the nonlinear part of the model as ηh which allows us to consider η 0 to shrink the size of the perturbation. As we check carefully in the proofs, all bounds only depend on h through its norm h P,q for some q specified below.

It is clear that f cannot be identified from the distribution of X as the mixing is nonlinear. Instead, we investigate how well the linear part given by the matrix A can be recovered and to what degree we can recover the ground truth sources S. Our results show that as η 0 we recover the identifiability results for linear ICA. One issue that creates substantial theoretical problems in this work and in general is that there is a gap between the statistical and computational aspects of ICA, in the sense that linear ICA is identifiable for non-Gaussian latent variables, however ICA algorithms typically require a stronger non-degeneracy condition on the latent distribution than non-Gaussianity of the latent variables. In the misspecified setting we need to define a (not necessarily computationally feasible) algorithm to pick a mixing function for the observed data, which then results in the same limitations as the conventional algorithms face.

Let us now quickly review how the independent components can be identified. Most ICA algorithms consider a function H : Sd 1 R defined by

H(w) = EG w Σ 1

where ΣX denotes the covariance matrix of X so that Σ 1

2 X X is whitened and G is the so-called contrast function. Then, under a suitable degeneracy assumption, the independent components correspond to an orthogonal set of local extrema of H.

Indeed, as a motivation for the nonlinear case we discuss in Appendix F the well-known calculation that for linear mixing functions (i.e., η = 0) H has a local extremum if w Σ 1

2 X X = Si for some i, i.e., when A Σ 1

2 X w = ei. Note that for linear mixing A the relation ΣX = AA

holds, and the vectors

wi = (AA ) 1 2 A ei (12)

thus satisfy w i Σ 1

2 X X = Si and | wi| = 1. We also consider the matrix W = A 1(AA ) 1 2 which has rows wi = W ei and satisfies WΣ 1

2 X X = S for η = 0.

The main goal of this section is to show that this general picture remains approximately true in the perturbed setting, i.e., under minor regularity assumptions on G there is a matrix W close to W such that its rows are local extrema of H. Let us collect the necessary assumptions for our results.

Assumption 5.2. The function G is even, three times differentiable, and there are constants Cg and dg such that for k 3

|G(k)(x)| Cg(1 + |x|)max(dg k,0) (13)

where G(k) denotes the k-th derivative of G.

We will write g = G from now on following the notation in the field. Note that commonly used contrast functions like G = ln cosh or G = | |4 satisfy this assumption. We need some regularity assumption on the source variables.

Assumption 5.3. Write q = max(dg, 3). Assume that the latent sources S satisfy for some constant M <

E(|S|q) = M (14)

Finally, we need a condition that ensures that the contrast function can single out the latent variable Si which is also necessary in the linear case.

Assumption 5.4. The latent variables S satisfy for an αi = 0

E(Sig(Si) g (Si)) = αi. (15)

Then we have the following result.

Theorem 5.5. Suppose that X and S are random variables satisfying the relation (10) and the distribution P of S has independent components (see Assumption 2.2) and E(S) = 0 and E(S2 i ) = 1. Assume G is an even contrast function that satisfies the pointwise bounds in Assumption 5.2 and that S and G satisfy Assumptions 5.3 and 5.4 for some 1 i d. Assume that h P,q 1 where q = max(dg, 3) is defined in Assumption 5.3. Let wi be defined as in (12). Then, there is an η0 > 0 (depending on d, αd, A, M, Cg, and q but not

Robustness of Nonlinear Representation Learning

on h) such that for η < η0 there is a local extremum wi of H close to wi satisfying | wi wi| = O(η). In addition, H is strictly convex or concave (depending on the sign of αi) in a neighborhood Bκ( wi) with κ independent of η < η0.

Remark 5.6. In fact, we can characterize wd a bit more precisely by

1 2 XA ηα 1 d v 1

+ O(η2) = wd + O(η2) (16)

where vi = E((A 1h)d(S)g (Sd)Si) + E((A 1h)ig(Sd)) for i < d and αd = E(Sdg(Sd) g (Sd)) as in (15) and similar expressions hold for i < d.

The proof can be found in Appendix F. We can apply the result to all coordinates simultaneously and obtain the following slight generalization.

Theorem 5.7. Under the same assumptions as in Theorem 5.5 where we assume that the Assumption 5.4 holds for all Si with 1 i d there is a matrix W whose rows wi = W ei satisfy |wi| = 1, wi is a local extremum of H,

2 X wi = ei + O(η), and |W W| = O(η).

The main message of these results is that all standard ICA algorithms based on a contrast function G can extract wi approximately in this setting, and the resulting matrix W is close to W the ground truth unmixing of the linear part. Let us clarify this through the example of gradient-based algorithms.

Corollary 5.8. Under the same assumptions as in Theorem 5.5 and for η < η0 gradient ascent (if αi > 0) or descent (if αi < 0) with sufficiently small step size will converge to wi locally.

This result is a direct consequence of Theorem 5.5 and convergence of gradient descent for strictly convex functions. We provide some more details in Appendix F. The results above established that we can approximately find A as η 0. Next we show that we do not only approximately recover the linear mixing, but we can also approximately recover the sources S.

Theorem 5.9. Consider the same assumptions as in Theorem 5.7 and let W be (for η < η0) the matrix as in the statement of Theorem 5.7. Define ˆS = WΣ 1

2 X X. Then there is a constant C2 depending on d, αi, A, M, Cg, and q such that

MCC( ˆS, S) 1 C2η2. (17)

The proof of this result can be found in Appendix F. It is based on a general bound for the MCC stated in Lemma B.2 in Appendix B. We provide some experimental evidence that the derived bounds have the optimal scaling in η in Appendix I.

Remark 5.10. The results might appear a bit unsatisfactory because we only show existence of local extrema of H that allow us to recover the sources approximately and local convergence of gradient descent or ascent to these extrema. However, we do not claim that these are the only extrema or there is an efficient algorithm to find all extrema in polynomial time. This problem is, however, not a feature of the perturbed setting, but the same issue arises already for linear ICA (Hyvarinen, 1999). Note that for the special case of the kurtosis-based contrast function G(s) = s4 3 with linear mixing it can be shown (under additional assumptions) that the vectors wi are the only local maxima (or minima) of H and they can be found in polynomial time as shown by Arora et al. (2012). We expect that this result generalizes to our perturbed setting.

Finally, we combine the two parts of the analysis to show approximate identifiability of ICA with approximately locally isometric mixing. To simplify the analysis we assume that the support of P is compact. Then we can combine our previous results to obtain the following theorem.

Theorem 5.11. Assume that S P with bounded and connected support Ωsatisfies Assumptions 2.1 and 2.2. Suppose observations are given by X = f(S) M RD for some f F(Ω). Assume that P satisfies Assumption 5.2, 5.3, and 5.4 for some contrast function G. Let q = max(dg, 3) and p = dq/(d + q). Define (g, Q, Ω ) as in (8) (for d = D and as in (73) in Appendix D for d < D) and shift g such that X = g(X) is centered. Define ˆS = WΣ 1

2 X X as in Theorem 5.7 for observations X. Then the following bound holds for Θp(f, Ω) sufficiently small and some C3 > 0

MCC( ˆS, S) 1 C3Θ2 p(f, Ω). (18)

Here C3 depends on C1 and C2 from Theorem 4.1 and 5.9 and the bounds on the density of P.

The informal version of this theorem is that by learning a maximally locally isometric transformation of our data followed by running a linear ICA-algorithm we can approximately recover the true sources if the ground truth mixing is close to a local isometry. A proof of this theorem can be found in Appendix G. Let us add a few remarks about this result.

For local isometries (Θp(f, Ω) = 0) we essentially recover Theorem 2 in (Horan et al., 2021).

The assumption that P has bounded support combined with the independence assumption implies that P is supported on a cuboid. It was shown in (Ahuja et al., 2023; Roth et al., 2023) that then linear transformations of P are identifiable (up to scaling and permutation) and we expect that this can be generalized to perturbed

Robustness of Nonlinear Representation Learning

linear maps, but it is non-trivial to prove this and to obtain a quantitative statement as we obtain above.

In Appendix G we also show how to relax the condition that P has compact support, which comes at a price of additional technical difficulties.

6. Conclusion

In this paper, we initialized the theoretical investigation of robustness results for representation learning with misspecified models. While we showed robustness results for linear ICA and for almost isometric embeddings there are several closely related questions, of which we name a few. First, it would be of interest to show for the perturbed linear ICA model finite sample guarantees, local convergence guarantees for algorithms not relying on gradient descent, e.g., for the fast-ICA algorithm, and analyze the effects of additive noise. Secondly, one limitation of this work is that we assume that the observational distribution f P and the manifold it is supported on are exactly known in the undercomplete case. Note that by the Nash embedding theorem (Nash, 1954) we can approximate uniformly any submanifold by a sequence of local isometries. Thus, here additional regularization by, e.g., penalizing higher order derivatives is necessary to extend robustness results to, e.g., finite sample size or noisy settings. The broader questions that this work motivates is what class of identifiability results are brittle and which are robust to misspecification (or finite sample bias). A deeper understanding of this landscape (and its intersection with real-world datasets) will help to understand which identifiability results give rise to useful learning signals, and thus provide some guidance for the development of novel algorithms for (causal) representation learning.

Acknowledgements

This work was supported by the T ubingen AI Center.

Impact statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Adams, R. and Fournier, J. Sobolev Spaces. ISSN. Elsevier Science, 2003. ISBN 9780080541297. URL https://books.google.de/books?id= R5A65Koh-Eo C.

Ahuja, K., Mahajan, D., Syrgkanis, V., and Mitliagkas, I. To-

wards efficient representation identification in supervised learning. ar Xiv preprint ar Xiv:2204.04606, 2022.

Ahuja, K., Mahajan, D., Wang, Y., and Bengio, Y. Interventional causal representation learning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 372 407. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr. press/v202/ahuja23a.html.

Arora, S., Ge, R., Moitra, A., and Sachdeva, S. Provable ica with unknown gaussian noise, with implications for gaussian mixtures and autoencoders. In Pereira, F., Burges, C., Bottou, L., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips. cc/paper_files/paper/2012/file/ 09c6c3783b4a70054da74f2538ed47c6-Paper. pdf.

Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373 1396, 2003. doi: 10.1162/ 089976603321780317.

Boumal, N. An Introduction to Optimization on Smooth Manifolds. Cambridge University Press, 2023. doi: 10. 1017/9781009166164.

Box, G. E. P. Science and statistics. Journal of the American Statistical Association, 71(356):791 799, 1976. ISSN 01621459. URL http://www.jstor.org/ stable/2286841.

Buchholz, S., Besserve, M., and Sch olkopf, B. Function classes for identifiable nonlinear independent component analysis. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview. net/forum?id=Dp Ka P-PY8b K.

Buchholz, S., Rajendran, G., Rosenfeld, E., Aragam, B., Sch olkopf, B., and Ravikumar, P. Learning linear causal representations from interventions under general nonlinear mixing, 2023.

Candes, E. and Tao, T. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203 4215, 2005. doi: 10.1109/TIT.2005.858979.

Ciarlet, P. G. Mathematical Elasticity: Volume II: Theory of Plates. ISSN. Elsevier Science, 1997. ISBN 9780080535913.

Robustness of Nonlinear Representation Learning

Comon, P. Independent component analysis, a new concept? Signal Processing, 36 (3):287 314, 1994. ISSN 0165-1684. doi: https://doi.org/10.1016/0165-1684(94)90029-9. URL https://www.sciencedirect.com/ science/article/pii/0165168494900299. Higher Order Statistics.

Conti, S. and Schweizer, B. Rigidity and gamma convergence for solid-solid phase transitions with so(2) invariance. Communications on Pure and Applied Mathematics, 59(6):830 868, 2006. doi: https://doi.org/10.1002/cpa. 20115. URL https://onlinelibrary.wiley. com/doi/abs/10.1002/cpa.20115.

Donoho, D. L. and Grimes, C. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591 5596, 2003. doi: 10.1073/pnas. 1031596100. URL https://www.pnas.org/doi/ abs/10.1073/pnas.1031596100.

Friesecke, G., James, R. D., and M uller, S. A theorem on geometric rigidity and the derivation of nonlinear plate theory from three-dimensional elasticity. Communications on Pure and Applied Mathematics, 55 (11):1461 1506, 2002. doi: https://doi.org/10.1002/cpa. 10048. URL https://onlinelibrary.wiley. com/doi/abs/10.1002/cpa.10048.

Gresele, L., Von K ugelgen, J., Stimper, V., Sch olkopf, B., and Besserve, M. Independent mechanism analysis, a new concept? Advances in Neural Information Processing Systems, 34, 2021.

Horan, D., Richardson, E., and Weiss, Y. When is unsupervised disentanglement possible? In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 5150 5161. Curran Associates, Inc., 2021. URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 29586cb449c90e249f1f09a0a4ee245a-Paper. pdf.

Hyvarinen, A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626 634, 1999. doi: 10.1109/72. 761722.

Hyv arinen, A. and Oja, E. Independent component analysis: algorithms and applications. Neural networks, 13(4-5): 411 430, 2000.

Hyv arinen, A. and Pajunen, P. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429 439, 1999.

Hyvarinen, A., Sasaki, H., and Turner, R. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 859 868. PMLR, 2019.

Hyv arinen, A., Khemakhem, I., and Monti, R. Identifiability of latent-variable and structural-equation models: from linear to nonlinear, 2023.

Janzing, D. and Sch olkopf, B. Causal inference using the algorithmic markov condition. IEEE Transactions on Information Theory, 56(10):5168 5194, 2010. doi: 10. 1109/TIT.2010.2060095.

Janzing, D., Chaves, R., and Sch olkopf, B. Algorithmic independence of initial condition and dynamical law in thermodynamics and causal inference. New Journal of Physics, 18(9):093052, sep 2016. doi: 10.1088/ 1367-2630/18/9/093052. URL https://dx.doi. org/10.1088/1367-2630/18/9/093052.

Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207 2217. PMLR, 2020.

Kivva, B., Rajendran, G., Ravikumar, P., and Aragam, B. Identifiability of deep generative models without auxiliary information. Advances in Neural Information Processing Systems, 35:15687 15701, 2022.

Kohn, R. V. New integral estimates for deformations in terms of their nonlinear strains. Archive for Rational Mechanics and Analysis, 78(2):131 172, 1982.

Kollo, T. and von Rosen, D. Advanced Multivariate Statistics with Matrices. Mathematics and Its Applications. Springer Netherlands, 2006. ISBN 9781402034190. URL https://books.google.de/books?id= 27m A99VQTi4C.

Lee, J. Introduction to Smooth Manifolds. Graduate Texts in Mathematics. Springer New York, 2013. ISBN 9780387217529. URL https://books.google. de/books?id=w4bh Bw AAQBAJ.

Liu, W., Lin, R., Liu, Z., Rehg, J. M., Paull, L., Xiong, L., Song, L., and Weller, A. Orthogonal over-parameterized training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7251 7260, June 2021.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=B1QRgzi T-.

Robustness of Nonlinear Representation Learning

Nash, J. C1 isometric imbeddings. Annals of Mathematics, 60(3):383 396, 1954. ISSN 0003486X. URL http: //www.jstor.org/stable/1969840.

Qi, H., You, C., Wang, X., Ma, Y., and Malik, J. Deep isometric learning for visual recognition. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 7824 7835. PMLR, 13 18 Jul 2020. URL https:// proceedings.mlr.press/v119/qi20a.html.

Reizinger, P., Gresele, L., Brady, J., von K ugelgen, J., Zietlow, D., Sch olkopf, B., Martius, G., Brendel, W., and Besserve, M. Embrace the gap: Vaes perform independent mechanism analysis, 2022. URL https: //arxiv.org/abs/2206.02416.

Reizinger, P., Li, H.-B., Ravuri, A., Husz ar, F., and Lawrence, N. D. Independent mechanism analysis in GPLVMs. In Fifth Symposium on Advances in Approximate Bayesian Inference, 2023. URL https: //openreview.net/forum?id=Wg K0RYP-6H.

Rol ınek, M., Zietlow, D., and Martius, G. Variational autoencoders pursue PCA directions (by accident). In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 12406 12415. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.01269. URL http://openaccess.thecvf.com/content_ CVPR_2019/html/Rolinek_Variational_ Autoencoders_Pursue_PCA_Directions_ by_Accident_CVPR_2019_paper.html.

Roth, K., Ibrahim, M., Akata, Z., Vincent, P., and Bouchacourt, D. Disentanglement of correlated factors via hausdorff factorized support. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), May 2023. URL https://openreview. net/pdf?id=OKc Jhp Qi Gi X.

Saengkyongam, S., Rosenfeld, E., Ravikumar, P., Pfister, N., and Peters, J. Identifying representations for intervention extrapolation, 2023.

Sch olkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. On causal and anticausal learning. In Proceedings of the 29th International Conference on Machine Learning, pp. 1255 1262, New York, NY, USA, 2012. Omnipress.

Sch olkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. Toward causal representation learning. Proceedings of the IEEE, 109(5): 612 634, 2021. ar Xiv:2102.11107.

Squires, C., Seigal, A., Bhate, S. S., and Uhler, C. Linear causal disentanglement via interventions. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 32540 32560. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/ v202/squires23a.html.

Taleb, A. and Jutten, C. Source separation in post-nonlinear mixtures. IEEE Trans. Signal Process., 47(10):2807 2820, 1999. doi: 10.1109/78.790661. URL https: //doi.org/10.1109/78.790661.

Tenenbaum, J. B., de Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, December 2000.

Tichavsky, P., Koldovsky, Z., and Oja, E. Performance analysis of the fastica algorithm and crame&acute;r-rao bounds for linear independent component analysis. Trans. Sig. Proc., 54(4):1189 1203, oct 2006. ISSN 1053-587X. doi: 10.1109/TSP.2006.870561. URL https://doi. org/10.1109/TSP.2006.870561.

Varici, B., Acarturk, E., Shanmugam, K., Kumar, A., and Tajer, A. Score-based causal representation learning with interventions. ar Xiv preprint ar Xiv:2301.08230, 2023.

von K ugelgen, J., Besserve, M., Liang, W., Gresele, L., Keki c, A., Bareinboim, E., Blei, D. M., and Sch olkopf, B. Nonparametric identifiability of causal representations from unknown interventions, 2023.

Wei, T. A convergence and asymptotic analysis of the generalized symmetric fastica algorithm. IEEE Transactions on Signal Processing, 63(24):6445 6458, 2015. doi: 10.1109/TSP.2015.2468686.

White, H. Using least squares to approximate unknown regression functions. International Economic Review, 21 (1):149 170, 1980. ISSN 00206598, 14682354. URL http://www.jstor.org/stable/2526245.

White, H. Maximum likelihood estimation of misspecified models. Econometrica, 50(1):1 25, 1982. ISSN 00129682, 14680262. URL http://www.jstor. org/stable/1912526.

Wolpert, D. and Macready, W. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67 82, 1997. doi: 10.1109/4235.585893.

Xi, Q. and Bloem-Reddy, B. Indeterminacy in generative models: Characterization and strong identifiability. In

Robustness of Nonlinear Representation Learning

Ruiz, F., Dy, J., and van de Meent, J.-W. (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 6912 6939. PMLR, 25 27 Apr 2023. URL https://proceedings.mlr. press/v206/xi23a.html.

Yau, S. and Bresler, Y. A compact cramer-rao bound expression for parametric estimation of superimposed signals. IEEE Transactions on Signal Processing, 40(5):1226 1230, 1992. doi: 10.1109/78.134484.

Zhang, K. and Chan, L. Minimal nonlinear distortion principle for nonlinear independent component analysis. Journal of Machine Learning Research, 9(81):2455 2487, 2008. URL http://jmlr.org/papers/v9/ zhang08b.html.

Zietlow, D., Rol ınek, M., and Martius, G. Demystifying inductive biases for (beta-)vae based architectures. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 12945 12954. PMLR, 2021. URL http://proceedings. mlr.press/v139/zietlow21a.html.

Robustness of Nonlinear Representation Learning

Supplementary Material

Here we collect additional results, details, and proofs for the main paper. This appendix is structured as follows. We first introduce some notation and definitions used throughout the paper in Appendix A then we state and prove a lemma that can be used to bound the MCC in our setting in Appendix B. We gather some elementary results about the distance of matrices to the orthogonal group in Appendix C, and we extend these results to mappings to submanifolds M RD and define our measure of non-isometry also in this setting in Appendix D. In Appendix E we prove our results on approximate linear identifiability for approximate local isometries, followed by our careful analysis of perturbed linear ICA in Appendix F. The proof of our combination of the two results can be found in Appendix G. In Appendix H we provide the missing details for the experimental illustration of non-robustness and in Appendix I we confirm empirically the theoretical convergence rates which we derived for perturbed linear ICA. Finally, in Appendix J we show that a certain class of random functions becomes increasingly isometric as the ambient dimension grows.

A. Notation and Mathematical Definitions

In the paper we use notation from different fields. Let us collect a few definitions and references to the literature.

Linear Algebra We denote by |A| = |A|F the Frobenius norm on matrices, i.e., |A|2 = Tr AA = P

i,j |Aij|2. We will frequently use unitary invariance |X|F = |QX|F for Q SO(d) of the Frobenius norm. The symbol A = (A 1)

denotes the inverse of the transpose.

Probability Theory We denote by f P the pushforward of a measure P along f. It is defined by f P(A) = P(f 1A) and the distribution of f(Z) is f P if Z P. We denote by f P,p = (EP|f||p)1/p the Lp norm.

Sobolev spaces For differentiable functions u : Ω R we define the Sobolev norm

u p W 1,p(Ω) = Z

Ω |u|p(s) + |Du|p(s) ds. (19)

The Sobolev embedding W 1,p(Ω) , Lq(Ω) is continuous for q = pd/(d p). For details and a proper definition of Sobolev spaces we refer to Adams & Fournier (2003). We denote by f p the usual Lp norm with respect to the Lebesgue measure.

Differential Geometry We consider mixing functions f : Rd M RD where M is an embedded submanifold. In this case, we need some notions from differential geometry such as the tangent space Tp M whose definitions can be found in, e.g., Lee (2013).

B. Bounding MCC for almost Permutations

Here, we provide a Lemma that allows us to control the MCC between two data representations when the transformation is perturbed linear and the linear part is close to a permutation matrix.

Lemma B.1. Let P be a centered probability measure. Assume that for Z P the bound

max i,j Zi P,2 Zj P,2 c1 (20)

holds for some constant c1. Let T(z) = Az + h(z) where A is a matrix such that there is a permutation ρ and a constant c2 with

max i 1 |Ai,ρ(i)|

j =ρ(i) |Ai,j| c2 (21)

and h has the property that for some constant c3

hi P,2 |Ai,ρ(i)| Zρ(i) P,2 c3. (22)

Robustness of Nonlinear Representation Learning

Then the bound

MCC(T(Z), Z) 1 2c3 2c1c2 (23)

Proof. We lower bound the expression (4) for the permutation π = ρ 1 in the statement of the lemma. Let Z P and Z = T(Z). Consider the correlation coefficients

ρ(Zρ(i), ˆZi) = Cov(Zρ(i), ˆZi)

σZρ(i)σ ˆ Zi . (24)

We find, using that Zρ(i) is centered

Cov(Zρ(i), ˆZi) Ai,ρ(i)E(Z2 ρ(i)) = E(Zρ(i) ˆZi) Ai,ρ(i)E(Z2 ρ(i))

j Ai,j E(Zρ(i)Zj) + E(Zρ(i)hi(Z)) Ai,ρ(i)E(Z2 ρ(i))

j =ρ(i) Ai,j E(Zρ(i)Zj) + E(Zρ(i)hi(Z))

hi(Z) P,2 + X

j =ρ(i) |Ai,j| Zi P,2

|Ai,ρ(i)| Zρ(i) 2 P,2(c3 + c2c1).

Note that the first terms vanish for independent latent variables. Next we control the variance of ˆZi. We find using the triangle inequality

σ2 ˆ Zi E( ˆZ2 i )

hi(Z) P,2 + X

j |Aij| Zj P,2

Zρ(i) 2 P,2

c3|Ai,ρ(i)| + |Ai,ρ(i)| + c1 X

Zρ(i) 2 P,2|Ai,ρ(i)|2 (1 + c3 + c2c1)2 .

We now combine (25) and (26) with (1 + x) 1 1 x and find

|ρ(Zρ(i), ˆZi)| |Ai,ρ(i)| Zρ(i) 2 P,2(1 c3 c1c2) Zρ(i) P,2 Zρ(i) P,2|Ai,ρ(i)| (1 + c3 + c2c1) (1 c3 c1c2)2 1 2c3 2c1c2. (27)

The claim follows by bounding

MCC(T(Z), Z) d 1 X

i |ρ(Zρ(i), ˆZi)| 1 2c3 2c1c2. (28)

The previous lemma has minimal assumptions on P and h. We now show that the lemma can be improved when assuming independence of the coordinates of P. Otherwise, the statement and proof are similar.

Robustness of Nonlinear Representation Learning

Lemma B.2. Let P be a centered probability measure on Rd satisfying Assumption 2.2, i.e., with independent components. Assume that for Z P the bound

max i,j Zi P,2 Zj P,2 c1 (29)

holds for some constant c1. Suppose that T(z) = Az + h(z) and EP(Zihj(Z)) = 0 for all 1 i, j d. Assume that A is a matrix such that there is a permutation ρ and a constant c2 with

max i 1 |Ai,ρ(i)|2 X

j =ρ(i) |Ai,j|2 c2 2 (30)

and h has the property that for some constant c3

hi P,2 |Ai,ρ(i)| Zρ(i) P,2 c3. (31)

Then the bound

MCC(T(Z), Z) 1 c2 3 2 (c1c2)2

Proof. Again, we lower bound the expression (4) for the permutation π = ρ 1 in the statement of the lemma. Let Z P and Z = T(Z). We first control

ρ(Zρ(i), ˆZi) = Cov(Zρ(i), ˆZi)

σZρ(i)σ ˆ Zi . (33)

Using that Zρ(i) is centered followed by E(Zi Zj) = E(Zihj(Z)) = 0 (by independence and the assumption for h) we get

Cov(Zρ(i), ˆZi) = E(Zρ(i) ˆZi)

j Ai,j E(Zρ(i)Zj) + E(Zρ(i)hi(Z))

= Ai,ρ(i)E(Z2 ρ(i)).

Next we control the variance of ˆZi. We expand the product and find (exploiting again independence and that h and Z are uncorrelated)

σ2 ˆ Zi = E( ˆZ2 i ) = E

= hi 2 P,2 + X

j =ρ(i) |Ai,j|2 Zj 2 P,2 + |Ai,ρ(i)|2 Zρ(i) 2 P,2

|Ai,ρ(i)|2 Zρ(i) 2 P,2

1 + c2 3 + c2 1 1 |Ai,ρ(i)|2 X

Zρ(i) 2 P,2|Ai,ρ(i)|2 1 + c2 3 + (c2c1)2 .

Combining again (34) and (35) with 1 + x 1 + x/2 (for x > 0) and (1 + x) 1 1 x, we find

|ρ(Zρ(i), ˆZi)| |Ai,ρ(i)| Zρ(i) 2 P,2 Zρ(i) P,2 |Ai,ρ(i)| Zρ(i) P,2 p

1 + c2 3 + (c2c1)2 1 + c2 3 2 + (c2c1)2

1 1 c2 3 2 (c2c1)2

Robustness of Nonlinear Representation Learning

We finish by noting

MCC(T(Z), Z) d 1 X

i |ρ(Zρ(i), ˆZi)| 1 c2 3 2 (c2c1)2

C. Auxiliary result on distances to the orthogonal group

We need some elementary and well-known facts on the distance of matrices to the orthogonal group. Let A Rd d be a matrix with det A > 0. Then we consider a Singular Value Decomposition (SVD) A = UΣV where U, V SO(d) and Σ = Diag(σ1, . . . , σd) Diag(d) contains the singular values σi > 0. The following Lemma holds.

Lemma C.1. For A Rd d with det A > 0 and any SVD A = UΣV as above, the relation

dist(A, SO(d))2 = |(A A) 1 2 Id|2 F = |Σ Id|2 F =

i=1 (σi 1)2 (38)

holds, moreover the minimizer of |A Q|2 F is given by Q = UV .

Proof. First, we note that the last three expressions agree. For the last two this is obvious, and we find

Σ2V = V ΣV . (39)

Then we conclude by unitary invariance |RX|F = |X|F for R SO(d) of the Frobenius norm that

A A Id|F = |V ΣV Id|F = |Σ Id|F (40)

It remains to show that also the first expression agrees with the other three expressions. We find for Q SO(d) using again unitary invariance that

|A Q|2 F = |UΣV Q|2 F = |Σ U QV |2 F

= |Σ|2 F + |U QV |2 F 2 Tr Σ(U QV ) = |Σ|2 F + d 2

i=1 σi(U QV )ii. (41)

Here we used |R|2 F = Tr RR = d in the last step. Since σi > 0 and since Rii 1 for R SO(d) we conclude

|A Q|2 F |Σ|2 F + d

i=1 σi (42)

with equality if (U QV )ii = 1 for 1 i d and thus U QV = Id and therefore Q = UV . From (41) we find

|A UV |2 F = |Σ|2 F + d 2

i=1 (σ2 i 2σi + 1). (43)

This has the following consequence.

Lemma C.2. Let A Rd d with det A > 0 and denote the smallest singular value of A by σmin. Then

dist(A 1, SO(d))2 1 σ2 min dist(A, SO(d))2 (44)

Robustness of Nonlinear Representation Learning

Proof. Note that the singular values of A 1 are σ 1 i where σi denote the singular values of A. Using Lemma C.1 we then find

dist(A 1, SO(d))2 =

i=1 (σ 1 i 1)2 =

i=1 σ 2 i (σi 1)2 σ 2 min dist(A, SO(d))2. (45)

We need one further lemma that concerns the distance to SO(d) of a product of matrices.

Lemma C.3. Let A, B Rd d with det A > 0, det B > 0. Then

dist(AB, SO(d)) 3

2 dist(A, SO(d)) + 3

2 dist(B, SO(d)). (46)

This implies for p > 1

distp(AB, SO(d)) 3p (distp(A, SO(d)) + distp(B, SO(d))) . (47)

Proof. Denote the projections of A and B on SO(d) by Q and R respectively. Then

dist(AB, SO(d)) |AB QR|F = |(A Q)R + Q(B R) + (A Q)(B R)|F |(A Q)R|F + |Q(B R)|F + |(A Q)(B R)|F

|A Q|F + |B R|F + p

|A Q|F |B R|F

2|A Q|F + 3

2 dist(A, SO(d)) + 3

2 dist(B, SO(d)).

Here we used again unitary equivalence of the Frobenius norm and the Cauchy-Schwarz inequality. This proves (46) and observe that by the generalized mean inequality (a + b)p 2p 1(ap + bp) 2p(ap + bp) we obtain (47).

D. Extension to the undercomplete case

The main interest of representation learning concerns the case where we actually learn a representation of a low dimensional submanifold embedded in a high dimensional space. This creates some additional technical difficulties that we address here. We always assume that the mixing f : Ω M RD is a diffeomorphism on an embedded submanifold. We can identify the tangent space Tf(s)M with an affine subspace of RD and we center this subspace at 0 instead of f(s) to obtain a linear subspace. We consider the standard Riemannian metric on Rd and RD inducing also a metric on Tf(s)M. The differential Df(s) defines a bijective linear map Rd = Ts Rd Tf(s)M.

To simplify the notation and restrict attention to the essential requirements we consider a linear subspace H RD and we assume that A RD d is a full rank matrix with range(A) = H, in particular dim(H) = d. We now characterize the matrix representations of maps Rd H RD. For this we consider the unique polar decomposition A = UP where U : Rd RD with U SO(d, D) and P : Rd Rd is symmetric and positive definite. Note that P is given by the expression P =

A A. In the following U will always denote the orthogonal matrix from such a polar decomposition.

Lemma D.1. Let A = UP be the polar decomposition of A. Then every linear map T RD d with Range(T) H = Range(A) can be written uniquely as T = UB for some B Rd d.

Proof. We first claim that UU T = T. Let x Rd then Tx = Uy for some y Rd because T maps to H and U is surjective on H. We thus find

UU Tx = UU Uy = Uy = V x (49)

where we used U U = Idd (recall U SO(d, D)). Then we find T = UB with B = U T Rd d. Uniqueness follows by injectivity of U.

Robustness of Nonlinear Representation Learning

We now consider the space of isometries SO(d, H) consisting of all linear maps Q : Rd RD such that Range(Q) H and |Qv| = |v| for all v Rd. Note that SO in addition requires preservation of orientation, but we can either just choose the orientation induced by A or ignore this issue and think of orthogonal matrices O(d, H). We nevertheless write SO to be consistent with the literature. Note that SO(d, H) SO(d, D). We can now define the distance

dist(A, SO(d, H)) = min Q SO(d,H) |A Q|F . (50)

This distance can be characterized as follows.

Lemma D.2. Let T RD d be a linear map with Range(T) H. Then

dist(T, SO(d, H)) = dist(B, SO(d)). (51)

where B is the unique matrix such that T = UB (see Lemma D.1).

Proof. The crucial observation is that the relation

SO(d, H) = U SO(d) (52)

holds where U SO(d) = {UV : V SO(d)}. We first consider the inclusion . Consider V SO(d, H). We have seen in the proof of Lemma D.1 that V = U(U V ) so we only need to show that U V SO(d). We find

(U V ) U V = V UU V = V V = Idd (53)

where we used UU V = V and V SO(d, D). Therefore U V SO(d) and thus V U SO(d). Now we consider the reverse inclusion . Let Q SO(d). Then we find

Range(UQ) = Range(U) = Range(UP) = Range(A) = H. (54)

Moreover |UQv|2 = v Q U UQv = v Q Qv = |v|2 and therefore UQ SO(d, H). Thus we find

dist(T, SO(d, H)) = dist(UB, U SO(d)) = dist(B, SO(d)) (55)

where the last step used the unitary invariance of the Frobenius norm, i.e., |UB| = |B| for U SO(d, D) and B Rd d.

Recall that in the polar decomposition A = UP the matrix P is given by P = (A A) 1 2 . Then the lemma above implies that for A such that H = range(A)

dist2(A, SO(d, H)) = dist2((A A) 1 2 , SO(d)) = X (σi 1)2 (56)

where σi are the singular values of A and we used Lemma C.1 in the last step.

We also need to consider isometries mapping H Rd and the distance to these isometries. The space of SO(H, d) consists of all distance preserving linear maps H Rd. Clearly, there is no unique matrix representation Rd D of such maps because the map is not defined on H , i.e., there are many extensions to RD. We essentially consider the extension by zero to the orthogonal complement, i.e, for a map T : H Rd we consider the extension T : RD Rd (not reflected in the notation) such that Tv = 0 for any v H, i.e., v w = 0 for all w H. Using this extension, we can identify such maps T with a unique matrix Rd D. Then we define

dist(T, SO(H, d)) = min Q SO(H,d) |T Q|F (57)

where we identify T and Q with matrix representations as explained before. We have the following simple lemma.

Lemma D.3. Let T : H Rd be a linear map, which we identify with a matrix representation T RD d as explained above. Let U : Rd H RD be an isometry. Then there is a matrix B Rd d such that T = BU and

dist(T, SO(H, d)) = dist(B, SO(d)). (58)

Robustness of Nonlinear Representation Learning

Proof. Since U is an isometry to H we find U SO(d, D) and U maps H isometrically to Rd (because U (Uv) = Iddv = v). Therefore, there is a matrix B Rd d such that Tv = BU v for all v H. Let w H . Then for all v Rd

we have Uv H and thus 0 = w, Uv = U w, v which implies U w = 0. We conclude that T = BU . Next we claim that similar to (52) the relation

SO(H, d) = SO(d) U (59)

holds. The inclusion follows, since the concatenation of isometries is an isometry (and U H = {0}. The reverse inclusion can be shown by using that Q SO(H, d) can be expressed as Q = BU for some B and noting that since U SO(H, d) we find that B must be an isometry on Rd and therefore B SO(d).

Plugging everything together, we find

dist2(T, SO(H, d)) = min Q SO(H,d) |Q T|2 F = min Q SO(d) |BU Q U |2 F = min Q SO(d) Tr U(B Q ) (B Q )U

= min Q SO(d) Tr(B Q ) (B Q )U U = min Q SO(d) Tr(B Q ) (B Q )Idd

= min Q SO(d) |B Q |2 F = dist2(B, SO(d)).

This ends the proof.

Note that an invertible matrix B induces a bijective map from Rd to H through the matrix representation UB. Its inverse has the matrix representation B 1U , indeed, then we find B 1U UB = Idd. Now we can define the extension of the distance Θ to the undercomplete case. We define for f : Ω Rd M RD

Θp p(f, Ω) = Z

Ω distp(Df(z), SO(d, Tf(z)M)) + distp (Df) 1(z), SO(Tf(z)M, d) dz (61)

where (Df) 1(z) denotes the inverse of the map Df(z) viewed as a linear map with target Tf(z)M. Note that for d = D this definition agrees with the definition in (2).

By our results so far, we can equivalently write

Θp p(f, Ω) = Z

Ω distp(P(Df(s)), SO(d)) + distp P 1(Df(s)), SO(d) ds

Ω distp((Df(s) Df(s)) 1 2 , SO(d)) + distp (Df(s) Df(s)) 1

2 , SO(d) ds (62)

where P(Df(s)) = p

Df(s) Df(s) denotes the unique matrix P in the polar decomposition Df(s) = UP.

We need the following extension of Lemma C.3.

Lemma D.4. Suppose H = Range(A) for a full rank matrix A RD d. Let T RD d with range(T) H and S : H Rd a linear map which we identify with its matrix representation S Rd D as explained above. Then ST is the matrix representation of the concatenation of the linear maps, and we assume that det(ST) > 0. Then the bound

dist(ST, SO(d)) 3

2 dist(S, SO(H, d)) + 3

2 dist(T, SO(d, H)) (63)

holds. For p 1 this implies

distp(ST, SO(d)) 3p distp(S, SO(H, d)) + 3p distp(T, SO(d, H)). (64)

Proof. Denote as before the polar decomposition of A by A = UP. We have observed that there are matrices B, B such that S = B U and T = UB. Then we have ST = B U UB = B B. Moreover, we have seen that

dist(T, SO(d, H)) = dist(B, SO(d)), dist(S, SO(H, d)) = dist(B , SO(d)). (65)

Robustness of Nonlinear Representation Learning

Thus we find using Lemma C.3 (for A = B , B = B)

dist(ST, SO(d)) = dist(B B, SO(d)) 3

2 dist(B , SO(d)) + 3

2 dist(B, SO(d))

2 dist(T, SO(d, H)) + 3

2 dist(S, SO(H, d)). (66)

E. Proof and extension of approximate linear identifiability for approximate isometries

In this section we provide the proofs for the results in Section 4. First, we state the key rigidity statement that we use in the proof of Theorem 4.1. Theorem E.1 (Theorem 3.1 in Friesecke et al. (2002)). Let Ω Rd be a bounded Lipschitz domain1. Then there is for p > 1 a constant C(Ω, p) such that for each u W 1,p(Ω, Rd) there is a linear function L(s) = As + b with b Rd, A SO(d) such that

u L p W 1,p(Ω) C(Ω, p) Z

Ω dist(Du(s), SO(d))p ds. (67)

Remark E.2. The reference above only stated the case p = 2 the simple extension to general p can be found in Section 2.4 in Conti & Schweizer (2006).

The main interest for the study of elasticity is the bound on the gradient because this is related to the energy. Here, we are only interested in the deviation L u (where actually simpler proofs of similar results are possible (Kohn, 1982)). Using for p < d the Sobolev embedding W 1,p(Ω) , Lq(Ω) for q = pd/(d p) the Theorem above implies

u L Lq(Ω) C(Ω, p) u L W 1,p(Ω)

Ω dist(Du(s), SO(d))p ds 1

Let us now discuss one lemma that we split off from the proof of Theorem 4.1 because we will use it in the proof of Theorem G.2 below again. Lemma E.3. Suppose that d D and f : Ω Rd M RD and g : Ω M are diffeomorphisms, P and Q are measures on Ωand Ω such that f P = g Q. Suppose the density of P is lower bounded on Ω. Let T = g 1 f, then Z

Ω distp(DT(z), SO(d)) dz

Ω distp(Df(s), SO(d, Tf(s)M)) ds + C Z

Ω distp(Dg 1(g(s)), SO(Tg(s)M, d)) Q(ds) (69)

where C > 0 depends on the lower bound of the density and p.

Remark E.4. For d = D the formula (69) simplifies to Z

Ω dist(DT(z), SO(d))p dz 3p Z

Ω distp(Df(s), SO(d)) ds + C Z

Ω distp(Dg 1(g(s)), SO(d)) Q(ds). (70)

Proof. First we note that g 1 : M Ωis by assumption differentiable and by the chain rule we have Dg 1(f(s))Df(s) = DT(s). We now find using Lemma D.4 that Z

Ω distp(DT(s), SO(d)) ds = Z

Ω distp(Dg 1(f(s))Df(s), SO(d)) ds

Ω distp(Dg 1(f(s)), SO(Tf(s)M, d)) + distp(Df(s), SO(d, Tf(s)M)) ds (71)

1A Lipschitz domain is slightly informally a set whose boundary Ωcan be expressed as the union of the graphs of Lipschitz functions (see (Adams & Fournier, 2003) for a complete definition).

Robustness of Nonlinear Representation Learning

Now we use the transformation formula for push-forward measures, and the assumption g Q = f P and the lower bound on the density of P Z

Ω distp(Dg 1(f(s)), SO(Tf(s)M, d)) ds C Z

Ω distp(Dg 1(f(s)), SO(Tf(s)M, d)) P(ds)

f(Ω) distp(Dg 1(x), SO(Tx M, d)) (f P)(dx)

g(Ω ) distp(Dg 1(x), SO(Tx M, d)) (g Q)(dx)

Ω distp(Dg 1(g(s)), SO(Tg(s)M, d)) Q(ds).

The last two displays together imply the claim.

We now provide the extensions of Theorem 4.1 to d < D. In this case, we can still define the set M(f P) as in (7). We define g by

(g, Q, Ω ) argmin ( g, Q, Ω) M(f P)

Ω dist(D g 1(x), SO(Tx M, d))p g Q(dx). (73)

Here we need to integrate the deviation from an isometry over the observational distribution g Q = f P. Again, this agrees with the definition of (8) for d = D. To avoid the assumption that the minimum exists, we also let (gε, Q, Ω ) M(f P) be any function such that Z

Ω dist(Dg 1 ε (x), SO(Tx M, d))p (gε) Q(dx) inf ( g, Q, Ω) M(f P)

Ω dist(D g 1(x), SO(Tx M, d))p g Q(dx) + ε. (74)

Then we can state and prove the following complete version of Theorem 4.1. Theorem E.5. Suppose we have a latent distribution P P satisfying Assumptions 2.1 with support Ω Rd where Ωis a bounded connected Lipschitz domain. The observational distribution is given by X = f(Z) M RD where Z P and f F(Ω). Fix a 1 < p < . Let (gε, Q, Ω ) M(f P) be any function satisfying (74). Then there is A SO(d) and b Rd such that g 1 ε f(z) = Az + b + h(z) and h satisfies the bound

h P,q CΘp(f, Ω) + Cε 1 p (75)

for q = pd/(d p). Here C is a constant depending on d, p, Ω, and the lower and upper bound on the density of P.

It is clear that this theorem is more general than Theorem 4.1.

Proof. As before, we call the transition function T = g 1f : Ω Ω . First, we observe that by definition of gε Z

Ω dist(Dg 1 ε (x), SO(Tx M, d))p (gε) Q(dx) inf ( g, Q, Ω) M(f P)

Ω dist(D g 1(x), SO(Tx M, d))p g Q(dx) + ε

Ω distp(Df 1(f(s)), SO(Tf(s)M, d)) P(ds) + ε. (76)

Here we used in the second step that (f, P, Ω) M(f P), i.e., f is a valid representation of our data so that this provides an upper bound on the infimum. We now find using Lemma E.3 and (76) Z

Ω distp(DT(s), SO(d))p ds

Ω distp(Df(s), SO(d, Tf(s)M)) ds + C Z

Ω distp(Dg 1(g(s)), SO(Tf(s)M, d)) Q(ds)

Ω distp(Df(s), SO(d, Tf(s)M)) ds + C Z

Ω distp(Df 1(f(s)), SO(Tf(s)M, d)) P(ds) + Cε

Ω distp(Df(s), SO(d, Tf(s)M)) ds + C Z

Ω distp(Df 1(f(s)), SO(Tf(s)M, d)) ds + Cε

= CΘp p(f, Ω) + Cε.

Robustness of Nonlinear Representation Learning

Note that in the second to last step we used the upper bound on the density of P.

Now we apply Theorem E.1 (or rather its consequence (68) which states that there is a matrix A SO(d) and b Rd such that Z

Ω |T(s) As b|q ds 1

Ω distp(DT(s), SO(d)) ds 1

Using the upper bound on the density of P and the last two displays we find Z

Ω |T(s) As b|q P(ds) 1

Ω |T(s) As b|q ds 1

Ω distp(DT(s), SO(d)) ds 1

CΘp(f, Ω) + Cε 1 p .

Here we used (a + b) 1 p (2a) 1 p + (2b) 1 p This completes the proof.

F. Proofs for the results on perturbed linear ICA

In this section we prove Theorem 5.5 and the extension in Theorem 5.9. Two technical difficulties when proving this result is that we consider a function defined on the sphere and that we need to whiten the data. Therefore, we first prove a linearized result in Lemma F.1 from which the Theorem can be deduced after some technical algebraic manipulations. To motivate the calculations, let us first briefly sketch the well-known unperturbed case of linear ICA (Hyv arinen & Oja, 2000).

Proof sketch of the Linear Result We assume that X is whitened and X = AS. Let w0 with |w0| = 1 be such that w0X = w0AS = ed S = Sd. Note that for w in a neighborhood of w0 we can write for some εi (small)

1 (ε2 1 + . . . + ε2 d 1)Sd + ε1S1 + . . . + εd 1Sd 1. (80)

Indeed, since Si are independent and with unit variance, we find that the prefactor of Sd has to be q

1 (ε2 1 + . . . + ε2 d 1) to ensure that w X has unit variance. Note that

1 (ε2 1 + . . . + ε2 d 1) = 1

i=1 ε2 i + O(ε4) (81)

where ε2 = Pd 1 i=1 ε2 i is the l2 norm. Then we can Taylor expand (denoting G = g)

G(Xw) = G(Sd) +

i=1 εi Si 1

+ O(ε3). (82)

Taking the expectation over this expression, we obtain

E(G(Xw)) = E(G(Sd)) + E

i=1 εi Si 1

= E(G(Sd)) 1

2E (Sdg(Sd))

i=1 ε2 i + 1

i=1 ε2 i + O(ε3)

where we used that E(Si) = 0, E(S2 i ) = 1, and the fact that Si and Sj are independent for i = j so that E(Sig(Sd)) = E(Si)E(g(Sd)) = 0. In particular, we obtain

E(G(Xw)) = E(G(Sd)) + 1

(E(g (Sd) E (Sdg(Sd)) + O(ε3). (84)

Robustness of Nonlinear Representation Learning

We conclude that under the condition

E(g (Sd)) E (Sdg(Sd)) = 0 (85)

the function w E(G(w X)) has a local extremum at w = w0 and it is strictly convex or concave around w0.

The key lemma. We now generalize the reasoning above to the perturbed setting. To simplify this further, we first assume A = Id so that

X = S + ηh(S). (86)

We also remove the linear whitening operation involved in H and instead consider

Hη,h(w) = EG w X

where we used the shorthand σw = p

E((w X)2). Clearly the function H is invariant under rescaling of the argument, i.e., homogeneous of degree 0. Instead of restricting it to the sphere, we define the function H : Rd 1 R given by

H(ε) = Hη,h((ε, 1) ) (88)

around ε = 0. Then our goal is to show that H has an extremum close to ε = 0 which allows us to approximately recover the independent component Sd. We now prove the following Lemma.

Lemma F.1. Assume the contrast function G and the distribution of the sources S satisfy the Assumptions 5.2, 5.3, and 5.4 and assume that h P,q 1 for q = max(3, dg). Define v Rd 1 by

vi = E(hd(S)g (Sd)Si + g(Sd)hi(S)). (89)

Then, there is η0 > 0 (depending on all problem constants, e.g., d, α, q, M but not on h) such that for η η0 the function H has a local extremum ε0 in the vicinity of ε = 0 such that |ε0| = O(η). This extremum satisfies

α + O(η2). (90)

Moreover there is η 0 > 0 (depending on the same quantities as η0) such that for η η 0 there is a radius κ > 0 such that H is strictly convex or concave on Bκ(0) and satisfies

D2 H(ε) |αd|

2 Id for αd < 0

2 Id for αd > 0 . (91)

Proof. Since we are interested in small ε we will always assume that |ε| 1. We will denote w = (ε, 1) . Then we have

i=1 εi Si + ηw h(S). (92)

We define Ω= E(h(S) h(S)), i.e., the covariance of the nonlinear part. We find

E((w X)2) = E

i=1 εi Si + ηw h(S)

i=1 ε2 i + η2w Ωw (93)

where we used that Si are unit variance and uncorrelated and E(S h(S)) = 0. In particular, we conclude that

i=1 ε2 i + η2w Ωw. (94)

Robustness of Nonlinear Representation Learning

Note that |Ω| h 2 P,2 h 2 P,q 1 by assumption. Let us introduce some notation. We consider the shorthand

i=1 εi Si. (95)

We also use

hw(S) = w h(S), hε(S) =

i=1 εihi(S). (96)

Since w = (ε1, . . . , εd 1, 1) we have

hw(S) = hε(S) + hd(S). (97)

Now we perform a second order Taylor expansion of G around Sd with remainder term. We obtain that for some ξ [Sd, w X/σw]

= G (Sd) + g (Sd) Sε + ηhw(S) + Sd(1 σw)

2g (Sd) Sε + ηhw(S) + Sd(1 σw)

6g (ξ) Sε + ηhw(S) + Sd(1 σw)

Our goal is to extract the quadratic terms in ε and η of this expression. We now start to bound the error term and show that it is of order 3. Since ξ [Sd, w X/σw] and σw > 1 we conclude that |ξ| max(|w X|, |Sd|). Then we can control using Assumption 5.2

|g (ξ)| Cg(1 + |ξ|max(dg 3,0)) Cg(1 + max(|w X|, |Sd|)max(dg 3,0))

C(1 + |S|max(dg 3,0) + |ηh(S)|max(dg 3,0)). (99)

We have the simple bound

|σw 1| |ε|2 + |η|2|w|2 |Ω| |ε|2 + |η|2|w|2. (100)

Then we can bound (using η2|ε| + |ε|2η |ε|3 + η3)

Sε + ηhw(S) + Sd(1 σw)

3 C(|ε|3 + η3)|S|3 + η3|h(S)|3 (101)

This implies (recall that q = max(3, dg))

g (ξ) Sε + ηhw(S) + Sd(1 σw)

3 C(d, Cg, q)(|ε|3 + η3) (1 + |S|q + |h(S)|q) . (102)

Next, we consider the second term

1 2g (Sd) Sε + ηhw(S) + Sd(1 σw)

which we approximate to 2nd order in ε and η and put all terms of order 3 and higher in the error term. Note that Sε = O(ε). We expand ηhw(S) = ηhd(S) + ηhε(S) where ηhd(S) = O(η) and ηhε(S) = O(ηε). Finally (1 σw) = O(|ε|2 + η2). We conclude that

Sε + ηhw(S) + Sd(1 σw)

2 C(|S|2 + |h(S)|2) (104)

Robustness of Nonlinear Representation Learning

where we again bounded mixed terms by, e.g., |ε|2η |ε|3 + η3. Using in addition that 1 σ 1 w C(|ε|2 + η2) to replace σw in the denominator by 1 up to third order error terms, we find g (Sd) Sε + ηhw(S) + Sd(1 σw)

2 g (Sd) (Sε + ηhd(S))2 C(|ε|3 + η3)(1 + |S|max(dg 2,0))(|S|2 + |h(S)|2).

Finally we consider the term proportional to g(Sd). Here we need the sharper bound σw 1 |ε|2

2 w Ωw 2|ε|4 + 2η4(|w|2|Ω|)2 2|ε|4 + 2η4|w|4 (106)

which implies using w = ed + O(ε) that w Ωw = Ωdd + O(ε) and moreover (recall |Ω| 1) σw 1 |ε|2

2|ε|4 + 2η4|w|4 + 3|ε|η2. (107)

Then we get, again keeping terms up to order 2 in ε or η g(Sd)Sε + ηhw(S) + Sd(1 σw)

σw g(Sd) Sε + ηhw(S) 1

2(|ε|2 + η2Ωdd)Sd

C(|ε|3 + η3)|S|max(dg 1,0)(|S| + |h(S)|). (108)

Using the bounds (102), (105), and (108) in (98) we obtain G(w X) G(Sd) g(Sd) Sε + ηhw(S) 1

2(|ε|2 + η2Ωdd)Sd

2g (Sd) (Sε + ηhd(S))2

C(|ε|3 + |η|3)(1 + |S|q + |h(S)|q). (109)

Let us call the term between the absolute value on the left-hand side T. Then we can bound using Assumption 5.3

|ET| E|T| C(|ε|3 + η3)E(|S|q + |h(S)|q) C(|ε|3 + |η|3)(M + h q P,q) = Ξ(|ε|3 + |η|3) (110)

where Ξ = C(M + 2) was introduced for future reference and depends on Cg, d, q, and the moment bound M but is independent of ε and η and h. We observe next, using E(Sd Sε) = 0, E(g(Sd Sε) = 0 that

ET = EG w X

E(G(Sd)) ηE(hw(S)g(Sd)) + 1

2(|ε|2 + η2Ωdd)E(Sdg(Sd))

2|ε|2E(g (Sd)) ηE(hd(S)Sεg (Sd)) 1

2η2E(g (Sd)hd(S)2)

E(G(Sd)) ηE(hd(S)g(Sd)) + 1

2(η2Ωdd)E(Sdg(Sd)) 1

2η2E(g (Sd)hd(S)2)

ηE(hd(S)Sεg (Sd)) ηE(hε(S)g(Sd)) + 1

2|ε|2 (Eg(Sd)S1 Eg (Sd)) .

We thus obtained an expansion of H(w) = EG (w X/σw) up to second order in ε. Note that by plugging in εi = 0 for all i, i.e., w = ed, in (110) EG ed X

E(G(Sd)) ηE(hd(S)g(Sd)) + 1

2(η2Ωdd)E(Sdg(Sd)) 1

2η2E(g (Sd)hd(S)2) Ξη3 (112)

so we conclude using the triangle inequality from (110), (111), and (112) that EG w X

ηE(hd(S)Sεg (Sd)) ηE(hε(S)g(Sd)) + 1

2|ε|2 (Eg(Sd)Sd Eg (Sd)) Ξ(|ε|3 + 2η3).

Robustness of Nonlinear Representation Learning

As a next step, we consider the derivatives of H. Since the reasoning is similar to the steps above, we provide slightly fewer details, i.e., we hide integrable terms in the O notation. First we observe that

iσw = εi + η2(Ωw)i

From here we obtain

= Si + ηhi(S)

σw (εi + η2(Ωw)i)(w S + ηhw(S))

We apply a first order Taylor expansion to g and obtain the bound g w X

= g(Sd) + O(|ε| + η). Thus, we find

i H(ε) = i EG w X

σw E(Sig(Sd)) + O(|ε| + η) = O(|ε| + η) (116)

where we used that Sj and Sd are independent.

Next we show using the bound (13) from Assumption 5.2 that the function ε w = (ε, 1) E(G(w X/σw)) is strictly convex or concave around ε = 0 for η sufficiently small. For this, we need to find an expression for the second derivatives of G. To keep the length of the formulas manageable, we hide the η2(Ωw)i as a O(η2) term. We find for 1 i, j d 1 using (115) and (114)

Sj + ηhj(S)

σW (εj + η2(Ωw)j)(w S + ηhw(S))

= Sj + ηhj(S)

σw + εj(w S + ηhw(S))

Si + ηhi(S)

σw εi(w S + ηhw(S))

εj(Si + ηhi(S)) + δij(w S + ηhw(S)) + εi(Sj + ηhj(S))

σ3w 3εiεj(w S + ηhw(S))

= Sj Sig w X

δij S1g w X

+ O(|ε| + η).

Here, we used again that σw = 1 + O(|ε| + η). Now we apply a Taylor expansion to g and g and obtain g(w X/σw) = g(Sd) + O(|ε| + η) and similarly for g (here similar power counting as in the first part implies that the highest moment that needs to be bounded is q = max(3, dg), we do not show this in full detail here). Then we obtain

i j H(ε) = i j EG w X

= E Si Sjg w X

δij E (Sdg (Sd)) + O(|ε| + η)

= δij E(g (Sd) Sdg(Sd)) + O(|ε| + η) = αdδij + O(|ε| + η). (118)

In particular, we find

|D2 H(ε) + αd Id| = O(|ε| + η). (119)

We conclude that for η < η0 with η0 sufficiently small, there is for some κ > 0 a neighborhood Bκ(0) such that the function Bκ(0) ε H(ε) is strictly convex or concave (depending on the sign of αd = E(Sdg(Sd) g (Sd))) with D2 H |αd|/2 Id in the convex case. We emphasize that κ is independent of η as soon as η η0 is sufficiently small.

It remains to prove the existence of a maximum or minimum and the expansion (90). To achieve this, we compare H with its expansion to second order, i.e., we define

f(ε) = ηE(hd(S)Sεg (Sd)) + ηE(hε(S)g(Sd)) 1

2|ε|2 (Eg(Sd)Sd Eg (Sd)) = εηv 1

2α|ε|2 (120)

where we recall that we defined v Rd 1 by vi = E(hd(S)Sig (Sd)) + E(hi(S)g(Sd)) for i = 1, . . . , d 1 and αd = Eg(Sd)Sd Eg (Sd). Then we can rewrite (113) as H(ε) H(0) f(ε) < Ξ(|ε|3 + 2η3). (121)

Robustness of Nonlinear Representation Learning

In other words H agrees with f up to a constant term and error terms of order 3 in ε and η. If we just use this bound on | H f| we could prove (90) but only with an error term of order η 3 2 . To obtain the better rate η2 we need to consider a second order expansion.

We assume now that αd > 0 such that H and f are concave in a neighborhood Bκ(0) (the proof for αd < 0 is very similar) for η < η0. For αd < 0 a similar reasoning applies and only some inequalities are reversed.

We first expand the function f. The relation αd > 0 implies that f is maximized at

αd , and f(ε) f(εmax) = η2|v|2

2αd . (122)

and we have for ε = εmax + ε the expansion

f(ε) = f(εmax + ε) = f(εmax) 1

2αd| ε|2. (123)

Now we consider similar expansions for H. For concreteness, we introduce the constant Ξ2 such that

|D2 H(ε) + αd Id| Ξ2(|ε| + η). (124)

Such a constant exists by (119).

Recall that we assumed that αd > 0 such that H is concave in a neighborhood Bκ(0) for η < η0. Suppose ε0 is the unique global maximum of H on Bρ(0) for

ρ = 2η(|v| + 1)

where we assume that η is sufficiently small such that H is uniformly convex on Bρ. Note that then either D H(ε0) = 0 or ε0 Bρ(0) and D H(ε0)(ε ε0) < 0 for all ε Bρ(0). Now, the general heuristic is that H roughly behaves like a parabola with vertex ε0 and f is a parabola with vertex εmax and both parabolas have approximately the same second derivative. Then, the distance between the two parabolas will increase for large arguments, leading to a contradiction to (121).

We now define the point ε1 as the intersection of the ray from εmax to ε0 with the set B2ρ(0). We define

δ = |εmax ε0| (126)

(the quantity we want to bound) and

µ = |ε0 ε1| ρ (127)

(since ε0 Bρ(0) and |ε1| = 2ρ). Note that ε1 is on the ray from εmax to ε0 so we have |ε1 εmax| = |ε1 ε0| + |ε0 εmax| = δ + µ. We then find using (123)

f(ε0) f(ε1) == (f(εmax) 1

2αdδ2) (f(εmax) 1

2αd(δ + µ)2) = 1

2αd(δ + µ)2 = 1

2αd(µ2 + 2µδ).

We now derive a similar bound for H for which we need the second order Taylor expansion with integral remainder which reads for g : Rd R with g C2 as follows

g(x) = g(x0) + Dg(x0)(x x0) + Z 1

0 (1 t)(x x0) D2g(x0 + t(x x0))(x x0) dt. (129)

We apply this with ε0 and ε1. We observe that since ε1 ε0 and εmax ε0 point in opposite directions, the relation D H(ε0)(εmax ε0) 0 (since εmax Bρ(0)) implies

D H(ε0)(ε1 ε0) 0. (130)

Robustness of Nonlinear Representation Learning

Then we find from the second order expansion (129) using the last display and (119)

H(ε1) H(ε0) = D H(ε0)(ε1 ε0) + Z 1

0 (1 t)(ε1 ε0) D2 H(ε0 + t(ε1 ε0))(ε1 ε0) dt

0 (αd + Ξ2(2ρ + η)) (ε1 ε0)2 Z 1

2 (αd + Ξ2(2ρ + η)) µ2.

Using (121) followed by (128) and (131) we now find

4Ξ(4ρ3 + η3) | H(ε0) H(0) f(ε0)| + | H(ε1) H(0) f(ε1)|

H(ε0) H(0) f(ε0) + H(ε1) H(0) f(ε1)

= ( H(ε1) H(ε0)) + (f(ε0) f(ε1))

2 (αd + Ξ2(2ρ + η)) µ2 + 1

2αd(µ2 + 2µδ)

2Ξ2(2ρ + η)µ2.

We conclude using ρ µ 2ρ and the definition (125) that

δ 4Ξ(4ρ3 + η3)

αdµ + 1 2αd Ξ2(2ρ + η)µ

αd Ξ2(2ρ2 + ηρ)

16η2(|v| + 1)2

α2 d + αdη2

8η2(|v| + 1)2

α2 d + 2η2(|v| + 1)

For η sufficiently small this implies that δ < ρ/2 and thus ε0 Bρ(0) (i.e., in the interior and not on the boundary) so that ε0 really is a local maximum of H. The proof for αd < 0 follows similarly.

We now extend Lemma F.1 by considering A = Id, including the whitening, and considering the function on the sphere, thus proving Theorem 5.5.

Proof of Theorem 5.5. Recall that we defined H : Sd 1 R by

Hη(w) = EG(w Σ 1

2 X X) (134)

where ΣX = E(XX ) = AA + η2Ωwith Ω= E(h(S)h(S) ). Note that ΣX depends implicitly on η so we will indicate this dependence in the following for some quantities. We now relate this to the setting in Lemma F.1. The function H defines a map on the manifold Sd 1. We analyze its properties by considering a suitable chart. We define the map T : Rd 1 Sd 1 by

This map defines a chart locally around ε = 0. Recall that we defined in (88) (indicating the parameter dependence for clarity)

Hη,h(ε) = Hη,h (w) = EG w (S + ηh(S))

Robustness of Nonlinear Representation Learning

where w = (ε, 1) and σw = E(w X)2 with X = S + ηh(S). Then the relation

Hη(Tη(ε)) = Hη,A 1h(ε) (137)

holds. Indeed, writing w = (ε, 1) we obtain

Hη(Tη(ε)) = EG

1 2 XA w Σ 1

w A 1(AS + ηh(S)) Σ

= EG w (S + ηA 1h(S))

= Hη,A 1h(ε)

where we used that (recall E(Sh(S) ) = 0)

σ2 w = E((w (S + ηA 1h(S)))2) = w E (S + ηA 1h(S))(S + ηh(S) A ) w

= w (Id + η2A 1ΩA w = (A w) (AA + η2Ω)A w

= (A w) ΣXA w = Σ

1 2 XA w 2 .

Note that this is not surprising as both terms were chosen such that the argument of G has unit variance. Note that |A 1h(S)| |A 1| |h(S)| so we find A 1h q P,q |A 1| h P,q. Now we apply Lemma F.1 to the function Hη,A 1h(ε) = H|A 1|η,|A 1| 1A 1h(ε) where |A 1| 1A 1h P,q 1. Lemma F.1 implies that H has a local extremum at some ε0 = ηα 1 d v + O(η 3 2 ) for η < η0 and some η0 where αd = Eg(Sd)Sd Eg (Sd) and

vi = E((A 1h)d(S)g (Sd)Si + g(Sd)(A 1h)i(S)). (140)

Thus Hη has a local extremum at

wd = Tη(ε) Σ

1 2 XA ηα 1 d v 1

+ O(η2). (141)

Since ΣX = AA + O(η2) we also find (recall wd = (AA ) 1 2 A ed by (12)) that |wd wd| = O(η). The estimated independent component is given by

2 X X = ηα 1 d v 1

S + η ηα 1 d v 1

A 1h(S) = Sd + O(η) (142)

1 2 Xw0 = A ηα 1 d v 1

+ O(η2) = (A 1)d,: + O(η), (143)

i.e., we recover the d-th row of the unmixing matrix up to errors of order η.

Finally, we prove the convexity of H around Σ

1 2 XA ed for α < 0. Intuitively, this is not surprising as we proved the convexity of H around ε = 0 and the relation (137) should lift this to the map H. The formal proof requires tools from differential geometry, which we use freely. The proof relies on the following relation from Riemannian geometry

Hess(H) = X

2 H εi εj dεi dεj X

i,j,k Γk ij H εk dεi dεj. (144)

Robustness of Nonlinear Representation Learning

Here Γk ij denotes the Christoffel symbols expressed in the chart Tη where we use the induced metric on Sd 1 as a submanifold of Rd.

We have shown in Lemma F.1 that the matrix with entries D2 H satisfies for α < 0

D2 H |α| Id + O(η + |ε|) (145)

D H = O(|ε| + η). (146)

It is straightforward to show using the calculations below that the Christoffel symbols are bounded (in fact small, but we do not need this). This and the last display imply that the last term in (144) is bounded O(η + |ε|). We now consider the tangent vectors εi which can be identified with the vector εi Tη(ε) Rd that is tangential to Sd 1. We note that

(Id + η2A 1ΩA ) ε 1

= 1 + |ε|2 + η2P(ε) (147)

where P denotes a quadratic polynomial in ε. This implies

εi Tη(ε) = Σ

1 2 XA ei Σ

+ O(|ε| + η). (148)

This implies

g Sd 1( εi, εj) = e i A 1ΣXA ej Σ

2 + O(|ε| + η) = δij + O(η2) 1 + O(|ε|2 + η2 + O(|ε| + η) = δij + O(|ε| + η). (149)

This means that, up to higher-order terms, the metric g agrees with the standard metric. We conclude that for any tangent vector Y = Pd 1 i=1 yi εi we obtain

Hess(H)(Y, Y )

g Sd 1(Y, Y ) |α| |y|2

|y|2 + O(η + |ε|). (150)

We conclude that for η η0 there is a neighborhood of Tη(0) Σ

1 2 XA ed where Hess(H) |α|/2 Id, i.e. H is strictly convex. The proof for α > 0 is similar.

The proof of Theorem 5.7 is now trivial.

Proof of Theorem 5.7. By the assumptions of the Theorem and Theorem 5.5 we find for each 1 i d a vector wi with |wi| = 1 such that H has a local extremum at wi and A Σ 1

2 X wi = ei + O(η) (when being pedantic, we apply the result to permuted data changing coordinates i and d). Then the matrix W with rows wi = W ei is the desired matrix.

The proof of Corollary 5.8 follows from standard results in convex optimization.

Proof of Corollary 5.8. It is well known that gradient descent locally converges for a convex function and sufficiently small step size, a proof for the general case where we optimize over a Riemannian manifold (in our case the sphere) can be found, e.g., in (Boumal, 2023). Local convergence of projected gradient descent can also be shown.

Finally, we prove that WX essentially recovers the true sources as stated in Theorem 5.9.

Robustness of Nonlinear Representation Learning

Proof of Theorem 5.9. We verify the assumption of Lemma B.2 for S and

2 X X = WΣ 1

2 X (AS + ηh(S)), (151)

i.e., we define T(s) = (WΣ 1

2 X A)s + ηWΣ 1

2 X h(s)). Since we assume that Si have unit variance, the bound (29) holds

with c1 = 1. Let us set P = WΣ 1

2 X A and verify condition (30) for P. By construction of W we have

e i P = e i WΣ 1

2 X A = w i Σ 1

2 X A = ei + O(η). (152)

This implies that when setting ρ(i) = i then we find

1 |Pi,i|2 X

j =i |Pi,j|2 O(η2) 1 O(η) = O(η2) (153)

for sufficiently small η. Finally, we use the bound

2 X h(S) P,2 η|W| |Σ

1 2 X| h P,q = O(η). (154)

Here we used that |W| = d since its rows are normalized. This implies

2 X h)i(S) P,2 |Pi,i| Z P,2 O(η) (155)

and therefore (31) holds with c3 = O(η). Now we can apply Lemma B.2 and conclude that

MCC(S, ˆS) 1 Cη2 (156)

for some constant C > 0.

G. Proof for approximate identifiability for ICA with almost locally isometric mixing

In this section we prove Theorem 5.11 and then provide some extensions that allow us to consider P with unbounded support. However, we first prove one simple auxiliary lemma which allows us to transform a data representation X = AS + h(S) b to X b = A S + h (S) such that h is centered and E(Sh (S)) = 0.

Lemma G.1. Suppose S has distribution P such that E(S) = 0 and ESS α 1Id in the sense of symmetric matrices for some α > 0. We assume that X = AS + h(S) + b for an orthogonal matrix A SO(d), some function h : Rd Rd and b Rd. Then there is a linear map A , b Rd and h : Rd Rd such that X b = A S + h (S) and

ES(h (S)) = 0, ES(S h (S)) = 0. (157)

Morevoew, for q 2,

h P,q < 2 1 + α S 2 P,q h(S) P,q. (158)

σmin(A ) 1 2α S P,2 h(S) P,q. (159)

Proof. First, we find

X EX = AS + (h(S) + b EX) = AS + hc(S) (160)

where we defined hc(S) = h(S) + b EX. Note that Ehc(S) = EX EX AES = 0 and thus hc(S) = h(S) Eh(S). By the triangle inequality and H older s inequality (which implies f P,p f P,q for p < q) we get

hc P,q = h EPh P,q h P,q + EPh P,q h P,q + |EPh| 2 h P,q. (161)

Robustness of Nonlinear Representation Learning

Thus we find that hc is centered, and its q norm is at most larger by a factor of 2 than the corresponding norm of h. Then we define

h (S) = hc(S) E(hc(S)S )E(SS ) 1S, A = A + E(hc(S)S )E(SS ) 1 (162)

i.e., we regress hc(S) linearly on S so that E(h (S)S ) = 0. We bound using Cauchy-Schwarz and q 2

E(hc(S)S )E(SS ) 1S P,q α|E(hc(S)S )| S P,q αE(|hc(S)S |) S P,q αE(|hc(S)| |S|) S P,q α hc(S) P,2 S P,2 S P,q α hc(S) 2 P,q S P,q. (163)

This implies

h (S) P,q 1 + α S 2 P,q hc(S) P,q 2 1 + α S 2 P,q h(S) P,q. (164)

Finally we conclude the lower bound on A . We observe that

|A A| = |E(hc(S)S )E(SS ) 1| α hc(S) P,2 S P,2 2α S P,2 h(S)P,q. (165)

Using A SO(d) we conclude that

σmin(A ) 1 2α S P,2 h(S)P,q. (166)

We can now prove Theorem 5.11.

Proof of Theorem 5.11. The proof essentially follows by combining Theorem 4.1 and Theorem 5.9. We note that by assumption the support of P is connected and by independence of the Si it is actually a cuboid and thus a Lipschitz domain. Using Theorem 4.1 we find that if we define g as in (8) (or (73) for d < D), then X = g 1 f(S) = g 1(X) can be expressed as

X = g 1 f(S) = AS + h(S) + b (167)

for some A SO(d) and h satisfies the bound

h P,q C1Θp(f, Ω) (168)

where p and q are as in the statement of the Theorem, i.e, q = max(3, dg) and p = dq/(d + q) which is equivalent to q = pd/(d p), as required in Theorem 4.1. By shifting X we can assume that X is centered. Now we apply Lemma G.1 from above and find that

X = A S + h (S) (169)

σmin(A ) 1 2α S P,2 h(S) P,q 1 2α S P,2C1Θp(f, Ω) > 1

if Θp(f, Ω) sufficiently small. Moreover, there is a constant C 1 > 0 depending on C1 and the distribution P such that

h P,q < C 1Θp(f, Ω). (171)

We can write

h(S) = (C 1Θp(f, Ω)) (C 1Θp(f, Ω)) 1h(S) = η h(S) C 1Θp(f, Ω) (172)

where η = C 1Θp(f, Ω). Now we apply Theorem 5.7. If η = C 1Θp(f, Ω) < η0 we find that there is a matrix W Rd d

with rows wi such that H (defined using the distribution of X) has local extrema at wi. Moreover, by Theorem 5.9 the reconstructed sources ˆS = W X = W g(X) satisfy

MCC( ˆS, S) 1 C2η2 1 C3Θ2 p(f, Ω) (173)

for C3 = C2C 2 1 .

Robustness of Nonlinear Representation Learning

We can also extend the results to ICA with approximately locally isometric mixing function and unbounded support of the sources. To simplify the analysis, we restrict our attention to functions that are perfect local isometries away from a compact set, i.e., we consider

Fc iso(d,D, Ω) = {f : Rd M RD : f|Ω is a

local isometry and f(ω) is not

isometric to (Rd, gstand) for any ω Ω}.

Here gstand denotes the standard metric on Rd. The definition essentially ensures that Ωis the maximal set such that outside of Ωthe function f is a local isometry. Note that the second part of the condition is satisfied, e.g., when the curvature of f(Ω) is everywhere non-vanishing. Then we have the following result.

Theorem G.2. Suppose that the mixing f Fc iso(d, D, Ω) for some bounded set Ωsuch that Ω is connected. Assume that X = f(S) where S P. Assume that P of S satisfies Assumption 5.2, 5.3, and 5.4 for some contrast function G and the density of P is bounded above and everywhere positive on Rd and lower bounded on Ω. Let q = max(dg, 3) and p = dq/(d + q). Assume that Θp(f, Ω) is sufficiently small. Then we can find a transformation g : M Rd such that the transformed data X = g(X) is centered and has the property that there is a matrix W Rd d whose rows are d

normalized vectors wi which are local extrema of the function w E(G(w Σ 1

2 X X)) such that ˆS = W X satisfies

MCC( ˆS, S) 1 CΘ2 p(f, Ω) (175)

where C depends on Ω, d, and the parameters in the Assumptions 5.2, 5.3, and 5.4.

Remark G.3. Of course, we expect to recover the vectors wi typically by applying an algorithm for linear ICA to the transformed data X, however, due to the lack of theoretical guarantees for ICA algorithms we cannot show a stronger statement (see Remark 5.10).

We expect that the assumption that f is a local isometry outside Ωcan be relaxed, e.g., by using bounded contrast functions like G(s) = exp( s2/2).

We now prove Theorem G.2. In its proof we need a slightly different rigidity result in this case (actually a simpler result because we in addition fix the boundary) which we now state. We denote by σmax the largest singular value of a matrix.

Theorem G.4 (Theorem 2.1 in Kohn (1982)). Let 1 p d, suppose there is a rigid motion L(z) = Az + b with A SO(d) and u W 1,p loc (Rd, Rd). We assume that there is a bounded set Ωsuch that u(z) = L(z) for z / Ω. Then

u L q C(d, p) (σmax(Du) 1)+ p (176)

where q = np/(n p).

Remark G.5. 1. Note that by Lemma C.1 we can bound

(σmax(Du) 1)+ s X

i (σi 1)2 = dist(A, SO(d)). (177)

2. The exponent q agrees with the exponent for the Sobolev embedding result.

3. There is an extension to p > d where u L can be bounded.

4. The result Theorem E.1 is more general than this result because it does not assume affine boundary values. We cannot directly apply Theorem E.1 in our context because we need that the linear map in (67) can be chosen as the affine boundary values L. While this is the case, it requires to inspect the proof carefully, and it is simpler to rely on this known result from the literature.

We can now continue with the Proof of Theorem G.2. To clarify the proof structure, we split the proof into two parts. First, we establish that the assumptions of the Theorem imply that we can reduce the problem to perturbed linear ICA. We summarize this result in the following proposition.

Robustness of Nonlinear Representation Learning

Proposition G.6. Suppose that the mixing f Fc iso(d, D, Ω) for some bounded set Ωand Ω connected. Suppose that X = f(S) with S P. Assume that the density of S is globally positive and upper bounded and, moreover, lower bounded on Ω. Then we can find a map g : M Rd such that the recovered sources X = g(X) satisfies

X = AS + h(S) (178)

for some rotation A SO(d) and function h such that

h q CΘp(f, Ω) (179)

where q = pd/(d p) and C depends on d, p, and the bounds on the density of P on Ω.

Proof. For a diffeomorphism g : Rd M we let I(g) M be the maximal open set such that g 1|I(g) is a local isometry. We consider the set G of diffeomorphisms from Rd to M such that I(g) is maximal. By definition of the function class Fc iso(d, D, Ω) there are functions g such that I(g) = f( Ω ) (e.g., g = f), moreover I(g) f(Ω) = for any g by definition of Fc iso(d, D, Ω). Therefore, G is well-defined. Now we consider g G such that

g argmin g G

Rd distp(D g 1(x), SO(Tx M, d)) f P(dx). (180)

By shifting g we can assume that g 1 f(S) is centered. Then we consider the transition function T = g 1 f. By assumption f|Ω is a local isometry, and we have shown that g 1|f(Ω ) is a local isometry and thus T|Ω is a local isometry and therefore T(s) = As + b = L(s) for some A SO(d) by Theorem 2.3. We now consider the restriction T|Ωand apply Lemma E.3 to find Z

Ω distp(DT(s), SO(d)) ds

Ω distp(Df(s), SO(d, Tf(s)M)) ds + C Z

Ω distp(Dg 1(g(s)), SO(Tf(s)M, d)) Q(ds). (181)

Now since f G we can conclude (using the upper bound on the density of P) as before that Z

Ω distp(Dg 1(g(s)), SO(Tf(s)M, d)) Q(ds) Z

Ω distp(Df 1(f(s)), SO(Tf(s)M, d)) P(ds)

Ω distp(Df 1(f(s)), SO(Tf(s)M, d)) ds. (182)

Combining the last two displays we find Z

Ω distp(DT(s), SO(d)) ds CΘp p(f, Ω). (183)

Now we apply Theorem G.4 (and the remark below this result) to conclude that

T L q CΘp(f, Ω). (184)

Defining h = T L and using the upper bound on the density of P we conclude that X = g 1(X) = T(S) is given by

X = L(S) + h(S) (185)

h P,q CΘp(f, Ω). (186)

Now we can complete the proof of Theorem G.2.

Robustness of Nonlinear Representation Learning

Proof of Theorem G.2. The proof is now the same as the proof of Theorem 5.11 except that we first apply Proposition G.6 instead of Theorem 4.1. Indeed, by applying Proposition G.6 and the definition of g used in the proof we find that

X = g(X) = g(f(S)) = AS + h(S) + b (187)

where A SO(d) and h P,q CΘp(f, Ω) = CΘp(f, Rd). The rest of the proof is the same as the proof of Theorem 5.11, i.e., we apply Lemma G.1 to rewrite X = A S + h (S) where h (S) is centered and Eh (S)S ) = 0 and then apply Theorem 5.9. Note that those two results do not assume bounded support of S.

H. Experimental Illustration of non-robustness of piecewise linear functions

In this appendix we collect some additional details on the experimental illustration of the non-robustness of learning piecewise linear mixing functions. Recall that we consider Z following a two-dimensional standard normal distribution. We let m be a radius dependent rotation (Hyv arinen & Pajunen, 1999) which is given by

m(z1, z2) = cos(5φ(z2))z1 + sin(5φ(z2))z2 sin(5φ(z2))z1 + cos(5φ(z2))z2

where φ : R R is a smooth bump function supported in (0, 1) and given by

( exp 1 1 (2t 1)2 for 0 < t < 1

0 else. (189)

It can be checked explicitly that m(Z) D= Z. Now we consider a VAE model with encoder and decoder each having 3 hidden residual layers with 64 neurons. We train encoder and decoder in a supervised way on pairs (m(Z), Z) and (Z, m(Z)) (where Z follows a standard normal distribution) using the mean squared error until the mean error is smaller than 0.001. We define the map implemented by the decoder as f. We generate data f(Z) where Z is Gaussian and then train a VAE on samples f(Z) and we initialize the VAE with the learned f and f 1 and give a large negative bias of 5 to the log-variance layer to make the encoder close to deterministic.

I. Experimental Illustration of Convergence Rate for perturbed linear ICA

To illustrate the results in Section 5, we provide a small scale experimental illustration. We consider data generated by X = AS + ηh(S) where Si follow independent Laplace distributions and all entries of A follow a centered normal distribution with variance d 1 and hi(S) S3 i βi Si αi where αi, βi are chosen such that E(hi(S)Sj) = 0, E(hi(S)) = 0 and the proportionality is chosen such that E(hi(S)2) = 1. Then we run the Fast-ICA algorithm (Hyvarinen, 1999) on ˆΣ 1

2 X X with n = 106 samples for η [10 3, 1] We initialize winit = wi (recall that wi denote the rows of A 1(AA ) 1 2 , i.e., the true unmixing of the whitened linear part A) to avoid problems with spurious minima and matching of minima which are already present for linear ICA. While wi cannot be inferred from X, this allows us to focus here on the essential differences of our setting to linear ICA. Recall that wi was defined in (16) and denotes the predicted next order expression of the recovered source. In Figure 2 we plot the distance of the output wi of the fast-ICA algorithm to wi and wi and evaluate the MCC of the recovered sources ˆS and the true sources S as a function of η, the strength of the perturbation.

We find that the dependence on η roughly matches our theoretical results. Note that for η .1 the error is already of order 1 and thus independent of η while for η .01 the distance |wi wi| no longer decreases due to final sample effects (note that roughly |wi wi| = O(n 1

2 ) for such η). We also refer to (Horan et al., 2021) where small scale real-world image data was investigated and the independent components recovered. Their mixing function is only approximately locally isometric, and therefore our results give a theoretical justification for their observations.

J. A Class of approximately isometric random Functions

In this section we illustrate how approximate local isometries naturally appear when considering random high dimensional functions. Note that a slightly more general class of random functions was considered in Reizinger et al. (2023). However, their work does not quantify approximate local isometries. We consider random functions f : Rd RD where we are mostly interested in the case D d. Assume that each coordinate fi is given by a random draw from a smooth Gaussian

Robustness of Nonlinear Representation Learning

10 3 10 2 10 1 100 10 3

= 1.08 = 2.47

10 3 10 2 10 1 100

1 MCC(S, S)

Figure 2. (Left) Distance of recovered linear unmixing w from w (see (12)) and w (see (16)) as a function of η. Plotted is the median over 5 runs with d = 5 (i.e., 25 recovered components). The shaded area shows the range from the 32% to the 68% quantile. (Right)

Difference of the Mean MCC over 5 runs of ˆS = WΣ 1

2 X X from perfect recovery (MCC = 1). Regression lines are obtained by linear regression in log-log space for 0.01 η 0.1 and β values indicate their slope.

process, e.g., a mean zero process whose covariance is given by a Mat ern kernel. Assume that the kernel is isotropic K(x, y) = K(|x y|) with K (0) = 0 and K (0) = 1 (can be obtained by scaling). Then the following holds for k = l

E( kfi(x) lfi(x)) = lim h 0 h 2E((fi(s + hek) fi(s))(fi(s + hel) fi(s)))

= lim h 0 h 2E((fi(s + hek) fi(s))(fi(s + hel) fi(s)))

= lim h 0 h 2(K(

2h) 2K(h) + K(0)).

Here, we used in the last step that |(s + hek) (s + hel)| = h|ek el| =

2h. We then find using Taylor expansion of K (recall K (0) = 0) that

E( kfi(s) lfi(s)) = lim h 0 h 2 (K(0) + h2K (0) + o(h2)) 2 K(0) h2

2 K (0) + o(h2)) + K(0) = 0. (191)

We also note that

E( kfi(s) kfi(s)) = lim h 0 h 2E((fi(s + hek) fi(s))(fi(s + hek) fi(s)))

= lim h 0 h 2E(2K(0) 2K(h)) = K (0) = 1. (192)

We conclude that the matrix Df(s) has independent Gaussian entries with variance 1. To measure how isometric f is, we need to investigate the singular values of Df. This has been investigated extensively in the field of random matrix theory, e.g., the random matrices appearing here are a special case of the Wishart distribution. Indeed, there are explicit expressions for the joined eigenvalue density of D 1X X where Xij follow independent standard normal distributions and fine information on the spectrum is available even when d, D jointly with d/D fixed. Here we are only interested in the asymptotic scaling as D , which is much simpler to obtain, and we now sketch this to provide some intuition. We denote A = 1

DDf(s) Df(s), and we introduce f =

D 1f. Using the results above, we find

EAkl = E(D f (s)D f(s))kl = 1

m E(Df(s))mk(Df(s))ml = 1

m E kfm(s) lfm(s) = δkl. (193)

Robustness of Nonlinear Representation Learning

In other words EA = Idd. By the law of large numbers we infer that

lim D D f (s)D f(s) = Idd. (194)

In other words, we conclude that as D the sequence f becomes increasingly isometric. This can also be quantified. Note that for k = l

E( kfi(s) lfi(s))2 = E( kfi(s))2E( lfi(s))2 = 1. (195)

Here we used that uncorrelated Gaussian variables are independent. Similarly, we obtain for k = l

E( kfi(s) kfi(s))2 (E( kfi(s) kfi(s)))2 = E( kfi(s))4 1 = 3 1 = 2. (196)

Linearity of the variance implies that

D2 (1 + δkl) = 1

D(1 + δkl). (197)

We conclude that

E|A E(A)|2 F = d2 + d

Let σi 0 be the singular values of D f(s). Then we bound |σi 1| |σ2 i 1| for σi > 0 and find X

i |σi 1|2 X

i |σ2 i 1|2 = |A E(A)|2 F (199)

Using (56) we conclude that

E dist2(D f(s), SO(d, Tf(s)M)) = E X

i (σi 1)2 E|A E(A)|2 = d2 + d

The argument for the second term in (61) is similar but a bit more involved because we need to bound the inverse. Suitable bounds for A 1 can be found, e.g., in Theorem 2.4.14 in (Kollo & von Rosen, 2006). They then imply that

E dist2(D f 1(x), SO(Tx M, d)) C

We then conclude that

EΘ2( f, Ω) q

EΘ2 2( f, Ω) C

and therefore the functions f approach local isometries as D in a quantitative way with the expected rate. This result can be generalized to p > 2 using stronger concentration bounds for sub-Gaussian variables.