# on_linear_identifiability_of_learned_representations__928c9c57.pdf

On Linear Identiﬁability of Learned Representations

Geoffrey Roeder 1 Luke Metz 2 Diederik P. Kingma 2

Identiﬁability is a desirable property of a statistical model: it implies that the true model parameters may be estimated to any desired precision, given sufﬁcient computational resources and data. We study identiﬁability in the context of representation learning: discovering nonlinear data representations that are optimal with respect to some downstream task. When parameterized as deep neural networks, such representation functions lack identiﬁability in parameter space, because they are overparameterized by design. In this paper, building on recent advances in nonlinear Independent Components Analysis, we aim to rehabilitate identiﬁability by showing that a large family of discriminative models are in fact identiﬁable in function space, up to a linear indeterminacy. Many models for representation learning in a wide variety of domains have been identiﬁable in this sense, including text, images and audio, state-of-the-art at time of publication. We derive sufﬁcient conditions for linear identiﬁability and provide empirical support for the result on both simulated and real-world data.

1. Introduction

An increasingly common methodology in machine learning is to improve performance on a primary down-stream task by ﬁrst learning a high-dimensional representation of the data on a related, proxy task. In this paradigm, training a model reduces to ﬁne-tuning the learned representations for optimal performance on a particular sub-task (Erhan et al., 2010). Deep neural networks (DNNs), as ﬂexible function approximators, have been surprisingly successful in discovering effective high-dimensional representations for use in downstream tasks such as image classiﬁcation (Sharif Razavian et al., 2014), text generation (Radford

1Princeton University 2Google Brain. Correspondence to: Geoffrey Roeder <roeder@princeton.edu>, Diederik P. Kingma <durk@google.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

et al., 2018; Devlin et al., 2018), and sequential decision making (Oord et al., 2018).

When learning representations for downstream tasks, it would be useful if the representations were reproducible, in the sense that every time a network relearns the representation function on the same data distribution, they were approximately the same, regardless of small deviations in the initialization of the parameters or the optimization procedure. In some applications, such as learning real-world causal relationships from data, such reproducible learned representations are crucial for accurate and robust inference (Johansson et al., 2016; Louizos et al., 2017). A rigorous way to achieve reproducibility is to choose a model whose representation function is identiﬁable in function space. Informally speaking, identiﬁability in function space is achieved when, in the limit of inﬁnite data, there exists a single, global optimum in function space. Interestingly, Figure 1 exhibits learned representation functions that appear to be the same up to a linear transformation, even on ﬁnite data and optimized without convergence guarantees (see Appendix A.1 for training details).

In this paper, we account for Figure 1 by making precise the relationship it exempliﬁes. We prove that a large class of discriminative and autoregressive models are identiﬁable in function space, up to a linear transformation. Our results extend recent advances in the theory of nonlinear Independent Components Analysis (ICA), which have recently provided strong identiﬁability results for generative models of data (Hyvärinen et al., 2018; Khemakhem et al., 2019; 2020; Sorrenson et al., 2020). Our key contribution is to bridge the gap between these results and discriminative models, commonly used for representation learning (e.g., (Hénaff et al., 2019; Brown et al., 2020)).

The rest of the paper is organized as follows. In Section 2, we describe a general discriminative model family, deﬁned by its canonical mathematical form, which generalizes many supervised, self-supervised, and contrastive learning frameworks. In Section 3, we prove that learned representations in this family have an asymptotic property desirable for representation learning: equality up to a linear transformation. In Section 4, we show that this family includes a number of highly performant models, state-of-the-art at publication for their problem domains, including CPC (Oord et al., 2018),

On Linear Identiﬁability of Learned Representations

Figure 1. Left and Middle: Two learned DNN representation functions fθ1(B), fθ2(B) visualized on held-out data B. The DNNs are word embedding models (Mnih & Teh, 2012) trained on the Billion Word Dataset (Chelba et al., 2013) (see Appendix A.1 for code release and training details). Right: Afθ1(B) and fθ2(B), where A is a linear transformation learned after training. The overlap exhibits linear identiﬁability (see Section 3): different representation functions, learned on the same data distribution, live within linear transformations of each other in function space.

BERT (Devlin et al., 2018), and GPT-2 and GPT-3 (Radford et al., 2018; 2019; Brown et al., 2020). Section 5.2 investigates the actually realizable regime of ﬁnite data and partial optimization, showing that representations learned by members of the identiﬁable model family approach equality up to a linear transformation as a function of dataset size, neural network capacity, and optimization progress.

2. Model Family and Data Distribution

The learned embeddings of a DNN are a function not only of the parameters, but also the network architecture and size of dataset (viewed as a sample from the underlying data distribution). This renders any analysis in full generality challenging. To make such an analysis tractable, in this section, we begin by specifying a set of assumptions about the underlying data distribution and model family that must hold for the learned representations to be similar up to a linear transformation. These assumptions are, in fact, satisﬁed by a number of already published, highly performant models. We establish assumptions and deﬁnitions in this section, and exhibit models that satisfy them in depth in Section 4.

Data Distribution We assume the existence of a generalized dataset in the form of an empirical distribution p D(x, y, S) over random variables x, y and S with the following properties:

The random variable x is an input variable, typically high-dimensional, such as text or an image.

The random variable y is the target variable whose value the model predicts. In case of object classiﬁcation, y is a semantically meaningful class label. However, in our model family, y may also be a high-

dimensional context variable, such a text, image, or sentence fragment.

S is a set containing the possible values of y given x, so p D(y|x, S) > 0 y S.

Note that the set of labels S is not ﬁxed, but a random variable. This allows supervised, contrastive, and selfsupervised learning frameworks to be analyzed together: the meaning of S encodes the task. For supervised classiﬁcation, S is deterministic and contains class labels. For self-supervised pretraining, S contains randomly-sampled high-dimensional variables such as image embeddings. For deep metric learning (Hoffer & Ailon, 2015; Sohn, 2016), the set S contains one positive and k negative samples of the class to which x belongs.

Canonical Discriminative Form Given a data distribution as above, a generalized discriminative model family may be deﬁned by its parameterization of the probability of a target variable y conditioned on an observed variable x and a set S that contains not only the true target label y, but also a collection of distractors y :

pθ(y|x, S) = exp(fθ(x) gθ(y)) P

y S exp(fθ(x) gθ(y )), (1)

The codomain of the functions fθ(x) and gθ(y) is RM, and the domains vary according to modelling task. For notational convenience both are parameterized by θ Θ, but f and g may use disjoint parts of θ, meaning that they do not necessarily share parameters.

With F and G we denote the function spaces of fθ and gθ respectively. Our primary domain of interest is when fθ and gθ are highly ﬂexible function approximators, such as

On Linear Identiﬁability of Learned Representations

DNNs. This brings certain analytical challenges. In neural networks, different choices of parameters θ can result in the same functions fθ and gθ, hence the map Θ F G is many-to-one. In the context of representation learning, the function fθ is typically viewed as a nonlinear feature extractor, e.g., the learned representation of the input data. While other choices meet the membership conditions for the family deﬁned by the canonical form of Equation (1), in the remainder, we will focus on DNNs. We next present a deﬁnition of identiﬁability suitable for DNNs, and prove that members of the above family satisfy it, under additional mild assumptions.

3. Model Identiﬁability

In this section, we derive identiﬁability conditions for models in the family deﬁned in Section 2.

3.1. Identiﬁability in Parameter Space

Identiﬁability analysis answers the question of whether it is theoretically possible to learn the parameters of a statistical model exactly. Speciﬁcally, given some estimator θ for model parameters θ , identiﬁability is the property that, for any {θ , θ } Θ,

pθ = pθ = θ = θ . (2)

Models that do not have this property are said to be non-identiﬁable. This happens when different values {θ , θ } Θ can give rise to the same model distribution pθ (y|x, S) = pθ (y|x, S). In such a case, observing an empirical distribution pθ (y|x, S), and ﬁtting a model pθ (y|x, S) to it perfectly does not guarantee that θ = θ .

Neural networks exhibit various symmetries in parameter space such that there is almost always a many-to-one correspondence between a choice of θ and resulting probability function pθ. A simple example in neural networks is that one can swap the (incoming and outgoing) connections of two neurons in a hidden layer. This changes the value of the parameters, but does not change the network s function. Thus, when representation functions fθ or gθ are parameterized as DNNs, Equation (2) is not satisﬁable.

3.2. Identiﬁability in Function Space

For reliable and efﬁcient representation learning, we want learned representations fθ from two identiﬁable models to be sufﬁciently similar for interchangeable use in downstream tasks. The most general property we wish to preserve among learned representations is their ability to discriminate among statistical patterns corresponding to categorical groupings. In the model family deﬁned in Section 2, the data and context functions fθ and gθ parameterize pθ(y|x, S), the probability of label assignment, through a normalized

inner product. This induces a hyperplane boundary, for discrimination, in a joint space of learned representations for data x and context y. Therefore, in the following, we will derive identiﬁability conditions up to a linear transformation, using a notion of similarity in parameter space inspired by Hyvärinen et al. (2018).

Deﬁnition 1. Let

L be a pairwise relation on Θ deﬁned as:

θ L θ fθ (x) = Afθ (x) gθ (y) = Bgθ (y) (3)

where A and B are invertible M M matrices.

See Appendix B for proof that

L is an equivalence relation. In the remainder, we refer to identiﬁability up to the equivalence relation

L -identiﬁable or linearly identiﬁable.

3.3. Derivation of Identiﬁability Conditions

We next present a simple proof of the

L -identiﬁability of members of the generalized discriminative family deﬁned in Section 2. This result reveals sufﬁcient conditions under which a discriminative probabilistic model pθ(y|x, S) has a useful property: the learned representations of the input x and target random variables y for any two pairs of parameters (θ , θ ) are related as θ L θ , that is, fθ (x) = Afθ (x) and gθ (y) = Bgθ (y).

First, we review the notation for the proof, which is introduced in detail in Section 2. We then highlight an important requirement on the diversity of the data distribution, which must be satisﬁed for the proof statement to hold. We prove the result immediately after.

Notation. The target random variables y, associated with input random variables x, may be class labels (as in supervised classiﬁcation), or they could be stochastically generated from datapoints x as, e.g., perturbed image patches (as in self-supervised learning). We account for this additional stochasticity as a set-valued random variable S, containing all possible values of y conditioned on some x. For brevity, we will use shorthands that drop the parameters θ: p := pθ , p := pθ , f := fθ , f := fθ , g := gθ .

Diversity condition. We assume that for any (θ , θ ) for which it holds that p = p , and for any given x, by repeated sampling S p D(S|x) and picking two points y A, y B S, we can construct a set of M distinct tuples {(y(i) A , y(i) B )}M i=1 such that the matrices L and L are invertible, where L

consists of columns (g (y(i) A ) g (y(i) B )), and L consists of columns g (y(i) A ) g (y(i) B ), i {1, . . . , M}. See Section 3.4 for detailed discussion.

On Linear Identiﬁability of Learned Representations

Theorem 1. Under the diversity condition, models in the family deﬁned by Equation (1) are linearly identiﬁable. That is, for any θ , θ Θ, and f , f , g , g , p , p deﬁned as in Section 2,

p = p = θ L θ .

Proof. We proceed by directly constructing an invertible linear transformation that satisﬁes Deﬁnition 1. Consider y A, y B S. The likelihood ratios for these points

p (y A|x, S) p (y B|x, S) = p (y A|x, S)

p (y B|x, S) (4)

are equal. Substituting the model deﬁnition from Equation (1), we ﬁnd:

exp(f (x) g (y A)) exp(f (x) g (y B)) = exp(f (x) g (y A))

exp(f (x) g (y B)), (5)

where the normalizing constants have cancelled out on both the leftand right-hand sides. Evaluating the logarithm of both sides and simplifying yields

(g (y A) g (y B)) f (x)

=(g (y A) g (y B)) f (x). (6)

Note that this equation is true for any triple (x, y A, y B) for which p D(x, y A, y B) > 0.

We next collect M distinct tuples (y(i) A , y(i) B ). Using identity (6), we now construct a system of M linear equations that relates f and f .

Let L be the (M M)-dimensional matrix whose i-th column is the difference vector (g (y(i) A ) g (y(i) B )). Similarly, let L be the (M M)-dimensional matrix whose i-th column is (g (y(i) A ) g (y(i) B )). Then, the system of M linear equations is

L f (x) = L f (x).

By the diversity condition, L is invertible. We left-multiply by L , yielding

f (x) = (L L 1) f (x). (7)

Hence, f (x) = Af (x) for A = (L L 1). Because L is also invertible, so is A. This completes the proof that p = p = fθ (x) = Afθ (x) for invertible A. See Appendix C for the remainder, which proves the corresponding result for g, and completes the proof of Theorem 1.

3.4. When Does the Diversity Condition Hold?

The diversity condition guarantees the existence of the matrix A in Equation (7) by ensuring that the matrices L and

L are non-singular. Informally, this requires that the set of possible values of y given x must be big enough the size of the set S is greater than some number and the function gθ has enough unique points in its range to ensure that there exist M difference vectors that span its range.1

For example, consider a supervised learning model with K classes. The random variable S is clamped to the possible labels for an image x, and is of size K. In order for the diversity condition to hold, the number of classes K M + 1 so that there can exist M difference vectors gθ(y(1)) gθ(y(j)), j = 2, . . . , M + 1. In case of selfsupervised or deep metric learning, where S and y may be randomly generated from x, this requirement is easy to satisfy. The same is true for language models with large vocabularies. However, for supervised classiﬁcation with a small number of classes, this requirement on the size of S may be restrictive, as we discuss further in Section 4. We stress here that our goal is to study representation learning, rather than supervised classiﬁcation, so that the fact our result applies to supervised learning at all is an interesting curiosity.

Along with requiring number of classes |S| = K M + 1, we implicitly assumed that the context representation function gθ has the following property: there exist M difference vectors in the range of gθ (of the form in Equation (6) that span it. This is a mild assumption in the context of DNNs: for random initialization and iterative weight updates, this property follows from the stochasticity of the distribution used to initialize the network. Brieﬂy, a set of M + 1 unique points y(j) such that the M vectors gθ(y(1)) gθ(y(j)), j = 2, . . . , M + 1 are not linearly independent has measure zero. For other choices of gθ, care must be taken to ensure this condition is satisﬁed.

What can be said when L and L are ill-conditioned, that is, the ratio between maximum and minimum singular value σmax(L)

σmin(L) (dropping superscripts when a statement apply to both) is large? In the context of a data representation matrix such as L, this implies that there exists at least one column ℓj of L and constants λk for k = j such that ℓj P

k =j λkℓk 2 < ε for small ε. In other words, some column is nearly a linear combination of the others. This implies, in turn, that there exists some tuple (y(k), y(i)) such that the resulting difference vector ℓj = gθ(y(k) A ) gθ(y(i) B ) can nearly (in the sense above) be written as a linear combination of the other columns. Such near singularity is in this case is caused by the choice of samples y that yield the difference vectors. The issue could be handled by resampling different data points until the condition number of the matrices is satisfactory. This amounts to strengthening

1We note here that a second, weaker diversity condition is also required on the data distribution and model with respect to x and f. This is discussed in Appendix C.

On Linear Identiﬁability of Learned Representations

the diversity condition. We leave more detailed analysis to future work, as the result will depend on the choice of f and g.

4. Examples of Linearly Identiﬁable Models

The form of Equation (1) is already used as a general approach for a variety of machine learning problems. We present a non-exhaustive sample of such publications, chosen to exhibit the range of applications. Many of these approaches were state-of-the-art at the time of their release: Contrastive Predictive Coding (Hénaff et al., 2019), BERT (Devlin et al., 2018), GPT-2 and GPT-3 (Radford et al., 2018; 2019; Brown et al., 2020), XLNET (Yang et al., 2019), and the triplet loss for deep metric learning (Sohn, 2016). In this section, we discuss how to interpret the functional components of these frameworks with respect to the generalized data distribution of Section 2 and canonical parameterization of Equation (1). See Appendix D for reductions to the canonical form of Equation (1).

Supervised Classiﬁcation. Although the scope of this paper is identiﬁable representation learning, under certain conditions, standard supervised classiﬁers can learn identiﬁable representations as well. In this case, the number of classes must be strictly greater than the feature dimension, as noted in Section 3.4. We simulate such a model in Section 5.1 to show evidence of its linear identiﬁability. We stress that representation learning as pretraining for classiﬁcation is a way to ensure that the conditions on label diversity are met, rather than relying on the supervised classiﬁer itself to generate identiﬁable representations. This paradigm is discussed in the next subsection.

Representations learned during supervised classiﬁcation can be linearly identiﬁable under the following model speciﬁcation. The input random variables x represent some data domain to be classiﬁed, such as images or word embeddings. The target variables y represent label assignments for x, typically semantically meaningful. These are often encoded these as the standard basis vectors ey, a one-hot encoding." The set S contains all K possible values of y. In this case, notice that S is not stochastic: the empirical distribution p D(S|x) is modelled as a Dirac measure with all probability mass on the set S = {0, . . . , K 1} (using integers, here, to represent distinct labels) . The representation function fθ(x) of a classiﬁer is often implemented as DNN that maps from the input layer to the layer just prior to the model logits. The context map gθ(y) is given by the weights in the ﬁnal, linear projection layer, which outputs unnormalized logits. Concretely, gθ(y) = Wey, where W RM M is a learnable weight matrix. In order satisfy the diversity condition, the dimension M of the number of classes K must be strictly greater than the dimension of the

learned representation M, that is, |S| M + 1. Finally, the output of the ﬁnal, linear projection layer is normalized through a Softmax function, yielding the parameterization of Equation (1).

Self-Supervised Pretraining for Image Classiﬁcation. Self-supervised learning is a framework that ﬁrst pretrains a DNN before deploying it on some other, related task. The pretraining task often takes the form of Equation (1) and meets the sufﬁcient conditions to be linearly identiﬁable. A paradigmatic example is Contrastive Predictive Coding (CPC) (Oord et al., 2018). CPC is a general pretraining framework, but we focus for the sake of clarity on its use in image models here. CPC as applied to images involves: (1) preprocessing an image into augmented patches, (2) assigning labels according to which image the patch came from, and then (3) predicting the representations of the patches whether below, to the right, to the left, or above a certain level (Oord et al., 2018).

The context function of CPC, gθ(y), encodes a particular position in the sequence of patches, and the representation function, fθ(x), is an autoregressive function of the previous k patches, according to some predeﬁned patch ordering. Given some x, the collection of all patches from the sequence, from a given minibatch of images, is the set S p D(S|x), where the randomness enters via the patch preprocessing algorithm. Since the preprocessing phase is part of the algorithm design, it is straightforward to make it sufﬁciently diverse (enough transformations of enough patches) so as to meet the requirements for the model to be linearly identiﬁable.

Multi-task Pretraining for Natural Language Generation. Autoregressive language models, such as (Mikolov et al., 2010; Dai & Le, 2015) and more recently GPT-2 and GPT-3 (Radford et al., 2018; 2019; Brown et al., 2020), are typically also instances of the model family of Equation (1). Data points x are the past tokens, fθ(x) is a nonlinear representation of the past estimated by either an LSTM (Hochreiter & Schmidhuber, 1997) or an autoregressive Transformer model (Vaswani et al., 2017), y is the next token, and wi = gθ(y = i) is a learned representation of the next token, often implemented as a simple look-up table, as in supervised classiﬁcation.

BERT (Devlin et al., 2018) is also a member of the linearly identiﬁable family. This model pretrains word embeddings through a denoising autoencoder-like (Vincent et al., 2008) architecture. For a given sequence of tokenized text, some ﬁxed percentage of the symbols are extracted and set aside, and their original values set to a special null symbol, corrupting" the original sequence. The pretraining task in BERT is to learn a continuous representation of the extracted symbols conditioned on the remainder of the text. A transformer (Vaswani et al., 2017) function approximator

On Linear Identiﬁability of Learned Representations

is used to map from the corrupted sequence into a continuous space. The transformer network is the fθ(x) function of Equation (1). The context map gθ(y) is a lookup map into the learned basis vector for each token.

5. Experiments

The derivation in Section 3 shows that, for models in the general discriminative family deﬁned in Section 2, the functions fθ and gθ are identiﬁable up to a linear transformation given unbounded data and assuming model convergence. The question remains as to how close a model trained on ﬁnite data and without convergence guarantees will approach this limit. One subtle issue is that poor architecture choices (such as too few hidden units, or inadequate inductive priors) or insufﬁcient data samples when training can interfere with model estimation and thereby linear identiﬁability of the learned representations, due to underﬁtting. In this section, we study this issue over a range of models, from low-dimensional language embedding and supervised classiﬁcation (Figures 1 and 2 respectively) to GPT-2 (Radford et al., 2019), an approximately 1.5 109-parameter generative model of natural language (Figure 4). See Appendix A and the code release for details needed to reproduce.

Through these experiments, we show that (1) in the small dimensional, large data regime, linearly identiﬁable models yield learned representations that lie approximately within a linear transformation of each other (Figures 1 and 2) as predicted by Theorem 1; and (2) in the high dimensional, large data regime, linearly identiﬁable models yield learned representations that exhibit a strong trend towards linear identiﬁability. The learned representations approach a linear transformation of each other monotonically, as a function of dataset sample size, neural network capacity (number of hidden units), and optimization progress. In the case of GPT2, which has beneﬁted from substantial tuning by engineers to improve model estimation, we ﬁnd strong evidence of linear identiﬁability.

Note on methodology: measuring linear similarity between learned representations. How can we measure whether pairs of learned representations live within a linear transformation of each other in function space? We adapt Canonical Correlation Analysis (CCA) (Hotelling, 1936) for this purpose, which ﬁnds the optimal linear transformations to maximize correlation among two random vectors. On a randomly selected held-out subset B D of the training data we compute fθ1(B) and fθ2(B) for two models with parameters θ1 and θ2 respectively. Assume without loss of generality that fθ1(B) and fθ2(B) are centered. CCA ﬁnds the optimal linear transformations C and D such that the pairwise correlations ρi between the ith columns of C fθ1(B) and D fθ2(B) are maximized. We collect cor-

relations together in ρ. If after linear transformation the two matrices are aligned, the mean of ρ will be 1; if they are instead uncorrelated, then the mean of ρ will be 0. We use the mean of ρ as a proxy for the existence of a linear transformation between fθ1(B) and fθ2(B). For DNNs, it is a well known phenomenon that most of the variability in a learned representation tends to concentrate in a low-dimensional subspace, leaving many noisy, random dimensions (Morcos et al., 2018). Such random noise can result in spurious high correlations in CCA. A solution to this problem is to apply Principal Components Analysis (PCA) (Pearson, 1901) to each of the two matrices fθ2(B) and fθ1(B), projecting onto their top-k principal components, before applying CCA. This technique is known as SVCCA (Raghu et al., 2017).

5.1. Simulation Study: Classiﬁcation by DNNs

We report ﬁrst on a simulation study of linearly identiﬁable K-way classiﬁcation, where all assumptions and sufﬁcient conditions of Theorem 1 are guaranteed to be met. We generated a synthetic data distribution with the properties required by Section 2, and chose DNNs that had sufﬁcient capacity to learn a speciﬁed nonlinear relationship between inputs x and targets y. In short, the data distribution p D(x, y, S) consists of inputs x sampled from a 2-D Gaussian with σ = 3. The targets y were assigned among K = 18 classes according to their radial position (angle swept out by a ray ﬁxed at the origin). The number of classes K was chosen to ensure K dim[fθ(x)] + 1, the diversity condition. See Appendix D.1 for more details.

To evaluate linear similarity, we trained two randomly initialized models of p D(y|x, S). Plots show fθ(x), the data representation function, on random x. Figure 2b shows that the mean CCA increases to its maximum value over training, demonstrating that the feature spaces converge to the same solution up to a linear transformation modulo model estimation noise. Similarly, Figure 2c shows that the learned representations exhibit a strongly linear relationship.

5.2. Self-Supervised Learning for Image Classiﬁcation

We next investigate high-dimensional, self-supervised representation learning on CIFAR-10 (Krizhevsky et al., 2009) using CPC (Oord et al., 2018; Hénaff et al., 2019). For a given input image, this model predicts the identity of a bottom image patch representation given a top patch representation (Figure 3a.) Here, S comprises the true patch with a set of distractor patches from across the current minibatch. For each model we deﬁne both fθ and gθ as a 3-layer MLP with 256 units per layer (except where noted otherwise) and ﬁx output dimensionality of 64.

In Figure 3b, CCA coefﬁcients are plotted over the course of training. As training progresses, alignment between the learned representations increases. In Figure 3c, we artiﬁ-

On Linear Identiﬁability of Learned Representations

(a) (b) (c)

Figure 2. Supervised Classiﬁcation. (a) Data distribution for a linearly identiﬁable K-way classiﬁcation problem. (b) Mean (centered) CCA between the learned representations over the course of training. After approx. 4000 iterations, CCA ﬁnds a linear transformation that rotate the learned representations into alignment, up to optimization error. (c) Learned representations after transformation via optimal linear transformation. The ﬁrst dimension of the ﬁrst model s feature space is plotted against the ﬁrst dimension of second. The learned representations have a nearly linear relationship, modulo estimation noise.

(a) (b) (c) (d)

Figure 3. Self-Supervised Representation Learning. Error bars are computed over 5 pairs of models. (a) Input data. Two patches are taken (one from top half, and one from the bottom half) of an image at random. Using a contrastive loss, we predict the identity of the bottom patch encoding from the top. (b) Linear similarity of learned representations at checkpoints (see legend). As models converge, linear similarity increases. (c) Linear similarity as we increase the amount of data for fθ and gθ. Error bars are computed over 5 pairs of models. (d) As we increase model size, linear similarity after convergence increases for both fθ and gθ.

cially limited the size of the dataset, and plot mean correlation after training and convergence. This shows that increasing availability of data correlates with closer alignment. In Figure 3d, we ﬁx dataset size and artiﬁcially limit the model capacity (number hidden units) to investigate the effect of model size on the learned representations, varying the number of hidden units from 64 to 8192. This show that increasing model capacity correlates with increase in alignment of learned representations.

Finally, we report on a study of GPT-2 (Radford et al., 2019), a massive-scale language model. The identiﬁable representation is the set of features just before the last linear layer of the model. We use pretrained models from Hugging Face (Wolf et al., 2019). Hugging Face provides four different versions of the GPT-2: gpt2, gpt2-medium, gpt2-large and gpt2-xl, which differ mainly in the hyper-parameters that determine the width and depth of the neural network layers. For approximately 2000 input sentences, per timestep, for each model, we extracted repre-

sentations at the last layer (which is identiﬁable) in addition to the representations per timestep given by three earlier layers in the model. Then, we performed SVCCA on each possible pair of models, on each of the four representations. SVCCA was performed with 16, 64, 256 and 768 principal components, computed by applying SVD separately for each representations of each model. We chose 768 as the largest number of principal components, since that is the representation size for the smallest model in the repository (gpt2). We then averaged the CCA correlation coefﬁcients across the pairs of models. Figure 4 shows the results. The results align well with our theory, namely that the representations at the last layer are more linearly related than the representations at other layers of the model.

5.4. Interpretation and Summary

Theorem 1 establishes linear identiﬁability as an asymptotic property of a model that holds in the limit of inﬁnite data and exact estimation. The experiments of this section have shown that for linear identiﬁable models, when the dimen-

On Linear Identiﬁability of Learned Representations

Figure 4. Text Embeddings by GPT-2. GPT-2 results. Representations of the last hidden layer (which is identiﬁable), in addition to three earlier layers (not necessarily identiﬁable) for four GPT-2 models. For each representation layer, SVCCA is computed over to all pairs of models, over which correlation coefﬁcients were averaged. SVCCA was applied with 16, 64, 256 and 768 principal components. The learned representations in the last, identiﬁable layer more correlated than representations learned in preceding layers.

sionality is small relative to dataset size (Figures 1 and 2), the learned embeddings are closely linearly related, up to noise. Problems of model estimation and sufﬁcient dataset size are more pronounced in high dimensions. Nevertheless, in GPT-2, representations among different trained models do in fact approach a mean correlation coefﬁcient of 1.0 after training (Figure 4, blue line), providing strong evidence of linear identiﬁability.

6. Related Works

Prior to Hyvärinen & Morioka (2016), identiﬁability analysis was uncommon in deep learning. We build on advances in the theory of nonlinear ICA (Hyvärinen & Morioka, 2016; Hyvärinen et al., 2018; Khemakhem et al., 2019). In this section, we carefully distinguish our results from prior and concurrent works. Our diversity assumption is similar to diversity assumptions in these earlier works, while differing on certain conditions. The main difference is that their results apply to related but distinct families of models compared to the general discriminative family outlined in this paper. Arguably most related is Theorem 3 of Hyvärinen et al. (2018) and its proof, which shows that a class of contrastive discriminative models will estimate, up to an afﬁne transformation, the true latent variables of a nonlinear ICA model. The main difference with our result is that they additionally assume: (1) that the mapping between observed variables and latent representations is invertible; and (2) that the discriminative model is binary logistic regression exhibiting universal approximation (Hornik et al., 1989), estimated with a contrastive objective. In addition, (Hyvärinen et al., 2018) does not present conditions for afﬁne identiﬁability for their version of the context representation function g. It should be noted that Theorem 1 in (Hyvärinen et al., 2018) provides a potential avenue for further generalization of our Theorem 1 to discriminative models with non-linear interaction between f and g.

Concurrent work (Khemakhem et al., 2020) has expanded

the theory of identiﬁable nonlinear ICA to a class of conditional energy-based models (EBMs) with universal density approximation capability, therefore imposing milder assumptions than previous nonlinear ICA results. Their version of afﬁne identiﬁability is similar to our result of linear identiﬁability in Section 3.2. The main differences are that Khemakhem et al. (2020) focus in both theory and experiment on EBMs. This allows for alternative versions of the diversity condition, assuming that the Jacobians of their versions of f or g are full rank. This is only possible if x or y are assumed continuous-valued; note that we do not make such an assumption. Khemakhem et al. (2020) also presents an architecture for which the conditions provably hold, in addition to sufﬁcient conditions for identiﬁability up to element-wise scaling, which we did not explore in this work. While we build on these earlier results, we are, to the best of our knowledge, the ﬁrst to apply identiﬁability analysis to state-of-the-art discriminative and autoregressive generative models.

ecent work on the asymptotics of fully-connected, inﬁnitely wide neural networks (Lee et al., 2017) has shown that they converge to a Gaussian Process with a particular approximable kernel, extending earlier work on single-layer networks (Neal, 1995). Jacot et al. (2018) prove that the evolution of a neural network during training can also be described by a kernel, termed the Neural Tangent Kernel (NTK). Both are ﬁne-grained analysis that place restrictions on the forms of the neural networks under analysis in order to produce strong analytic results. Like NTK, we take a function-space perspective, but our analysis considers learned representation functions and their optimal solution sets.

7. Conclusion

We have shown that representations learned by a large family of discriminative models are identiﬁable up to a linear transformation, providing a novel perspective on representa-

On Linear Identiﬁability of Learned Representations

tion learning using DNNs. Since identiﬁability is a property of a model class, and identiﬁcation is realized in the asymptotic limit of data and compute, we perform experiments in the more realistic setting with ﬁnite datasets and ﬁnite compute. Our empirical results show that as the representational capacity of the model and dataset size increases, learned representations indeed tend towards solutions that are equal up to only a linear transformation.

8. Acknowledgements

We thank Ryan Adams for in-depth conversations on methodology, Kevin Murphy for helpful feedback on an early version, Aapo Hyvärinen, Ilyes Khemakhem, and Jascha Sohl-Dickstein for helpful conversations and feedback, George Tucker and Alexi Alemi for feedback on an early version of the theorem, and members of the Google Brain Team and Princeton Laboratory for Intelligent Probabilistic Systems for valuable remarks and comments. Geoffrey Roeder is supported in part by the National Science and Engineering Research Council of Canada (PGSD3-5187162018).

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., and Wanderman-milne, S. Jax: Composable transformations of Python+Num Py programs, 2018. URL Http://Github.Com/Google/Jax.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., and Others. Language Models are Few-Shot Learners. Arxiv Preprint Arxiv:2005.14165, 2020.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., and Robinson, t. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. Arxiv Preprint Arxiv:1312.3005, 2013.

Dai, A. M. and Le, Q. V. Semi-Supervised Sequence Learning. In Advances in Neural information Processing Systems, pp. 3079 3087, 2015.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Arxiv Preprint Arxiv:1810.04805, 2018.

Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. Why Does Unsupervised Pretraining Help Deep Learning? Journal of Machine Learning Research, 11(Feb):625 660, 2010.

Hénaff, O. J., Razavi, A., Doersch, C., Eslami, S., and Oord, A. V. D. Data-Efﬁcient Image Recognition

with Contrastive Predictive Coding. Arxiv Preprint Arxiv:1905.09272, 2019.

Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory. Neural Computation, 9(8):1735 1780, 1997.

Hoffer, E. and Ailon, N. Deep Metric Learning Using Triplet Network. In International Workshop On Similarity-Based Pattern Recognition, pp. 84 92. Springer, 2015.

Hornik, K., Stinchcombe, M., and White, H. Multilayer Feedforward Networks are Universal Approximators. Neural Networks, 2(5):359 366, 1989.

Hotelling, H. Relations Between Two Sets of Variates. Biometrika, 28(3/4):321 377, 1936.

Hyvärinen, A. and Morioka, H. Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA. In Advances in Neural information Processing Systems, pp. 3765 3773, 2016.

Hyvärinen, A., Sasaki, H., and Turner, R. E. Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning. Arxiv Preprint Arxiv:1805.08651, 2018.

Jacot, A., Gabriel, F., and Hongler, C. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural information Processing Systems, pp. 8571 8580, 2018.

Johansson, F., Shalit, U., and Sontag, D. Learning representations for counterfactual inference. In International conference on machine learning, pp. 3020 3029, 2016.

Khemakhem, I., Kingma, D. P., and Hyvärinen, A. Variational Autoencoders and Nonlinear ICA: A Unifying Framework. Arxiv Preprint Arxiv:1907.04809, 2019.

Khemakhem, I., Monti, R. P., Kingma, D. P., and Hyvärinen, A. ICE-Bee M: Identiﬁable Conditional Energy-based Deep Models. Arxiv Preprint Arxiv:2002.11537, 2020.

Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. Arxiv Preprint Arxiv:1412.6980, 2014.

Krizhevsky, A., Hinton, G., and Others. Learning Multiple Layers of Features from Tiny Images. 2009.

Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-dickstein, J. Deep Neural Networks as Gaussian Processes. Arxiv Preprint Arxiv:1711.00165, 2017.

Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L., and Shazeer, N. Generating Wikipedia by Summarizing Long Sequences. Arxiv Preprint Arxiv:1801.10198, 2018.

On Linear Identiﬁability of Learned Representations

Louizos, C., Shalit, U., Mooij, J. M., Sontag, D., Zemel, R., and Welling, M. Causal effect inference with deep latentvariable models. In Advances in Neural Information Processing Systems, pp. 6446 6456, 2017.

Mikolov, T., Karaﬁát, M., Burget, L., ˇCernock y, J., and Khudanpur, S. Recurrent Neural Network Based Language Model. In Eleventh Annual Conference of The international Speech Communication Association, 2010.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural information Processing Systems, pp. 3111 3119, 2013.

Mnih, A. and Hinton, G. E. A Scalable Hierarchical Distributed Language Model. In Advances in Neural information Processing Systems, pp. 1081 1088, 2009.

Mnih, A. and Teh, Y. W. A Fast and Simple Algorithm for Training Neural Probabilistic Language Models. Arxiv Preprint Arxiv:1206.6426, 2012.

Morcos, A. S., Raghu, M., and Bengio, S. Insights on Representational Similarity in Neural Networks with Canonical Correlation, 2018.

Neal, R. M. Bayesian Learning for Neural Networks. Ph D thesis, University of toronto, 1995.

Oord, A. V. D., Li, Y., and Vinyals, O. Representation Learning with Contrastive Predictive Coding. Arxiv Preprint Arxiv:1807.03748, 2018.

Pearson, K. LIII. On Lines and Planes of Closest Fit to Systems of Points in Space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559 572, 1901.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving Language Understanding by Generative Pre-training. 2018.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language Models are Unsupervised Multitask Learners. Openai Blog, 1(8), 2019.

Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and interpretability. In Advances in Neural information Processing Systems, pp. 6076 6085, 2017.

Sharif Razavian, A., Azizpour, H., Sullivan, J., and Carlsson, S. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. In Proceedings of The Ieee Conference On Computer Vision and Pattern Recognition Workshops, pp. 806 813, 2014.

Sohn, K. Improved Deep Metric Learning with Multi-class N-Pair Loss Objective. In Advances in Neural information Processing Systems, pp. 1857 1865, 2016.

Sorrenson, P., Rother, C., and Köthe, U. Disentanglement by Nonlinear ICA with General Incompressible-ﬂow Networks (Gin). Arxiv:2001.04872 [Cs, Stat], January 2020.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is All You Need. In Advances in Neural information Processing Systems, pp. 5998 6008, 2017.

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and Composing Robust Features with Denoising Autoencoders. In Proceedings of The 25th international Conference On Machine Learning, pp. 1096 1103, 2008.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J. Huggingface s Transformers: State-of-theart Natural Language Processing. Arxiv, Abs/1910.03771, 2019.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le, Q. V. XLNET: Generalized Autoregressive Pretraining for Language Understanding. Arxiv Preprint Arxiv:1906.08237, 2019.