# on_affine_homotopy_between_language_encoders__56a94fde.pdf

On Affine Homotopy between Language Encoders

Robin S. M. Chan1 Reda Boumasmoud1 Anej Svete1 Yuxin Ren2

Qipeng Guo3 Zhijing Jin1,4 Shauli Ravfogel1 Mrinmaya Sachan1

Bernhard Schölkopf1,4 Mennatallah El-Assady1 Ryan Cotterell1

1ETH Zürich 2Tsinghua University 3Fudan University 4Max Plank Institute for Intelligent Systems

Pre-trained language encoders functions that represent text as vectors are an integral component of many NLP tasks. We tackle a natural question in language encoder analysis: What does it mean for two encoders to be similar? We contend that a faithful measure of similarity needs to be intrinsic, that is, task-independent, yet still be informative of extrinsic similarity the performance on downstream tasks. It is common to consider two encoders similar if they are homotopic, i.e., if they can be aligned through some transformation.1 In this spirit, we study the properties of affine alignment of language encoders and its implications on extrinsic similarity. We find that while affine alignment is fundamentally an asymmetric notion of similarity, it is still informative of extrinsic similarity. We confirm this on datasets of natural language representations. Beyond providing useful bounds on extrinsic similarity, affine intrinsic similarity also allows us to begin uncovering the structure of the space of pre-trained encoders by defining an order over them.

https://github.com/chanr0/affine-homotopy

1 Introduction

A common paradigm in modern natural language processing (NLP) is to pre-train a language encoder on a large swathe of natural language text. Then, a task-specific model is fit (fine-tuned) using the language encoder as the representation function of the text. More formally, a language encoder is a function h: Σ Ñ RD, i.e., a function that maps a string over an alphabet Σ to a finite-dimensional vector. Now, consider sentiment analysis as an informative example of a task. Suppose our goal is to classify a string y P Σ as one of three polarities Π t

u. Then, the probability of y exhibiting a specific polarity is often given by a log-linear model, e.g., the probability of

| yq softmaxp E hpyq bq

where E P R3ˆD, b P R3 and softmax: RN Ñ N 1. Empirically, using a pre-trained encoder h leads to significantly better classifier performance than training a log-linear model from scratch.

In the context of the widespread deployment of language encoders, this paper tackles a natural question: Given two language encoders h and g, how can we judge to what extent they are similar? This question is of practical importance recent studies have shown that even small variations in the random seed used for training can result in significant performance differences on downstream tasks between models with the same architecture [13, 35] In this case, we say that two such language encoders exhibit an extrinsic difference, i.e., the difference between two encoders manifests itself when considering their performance on a downstream task. However, we also seek an intrinsic notion

1Homotopy, from the Greek ὁμός (homo; same) and τόπος (topos; place), refers to a continuous transformation between functions or shapes, showing they can be deformed into one another without breaking or tearing.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

of similarity between two language encoders, i.e., a notion of similarity that is independent of any particular downstream task. Moreover, we may hope that a good notion of intrinsic similarity would allow us to construct a notion of extrinsic similarity that holds for all downstream tasks.

Existing work studies language encoder similarity by evaluating whether two encoders produce similar representations for a finite dataset of strings [3, 20, 22, 42, inter alia], often by analyzing whether the representation sets can be approximately linearly aligned [22, 27]. More formally, two encoders are considered similar if there exists a matrix A such that hpyq A gpyq holds for strings y in some finite set D Ă Σ .2 This assumes that examining finitely many outputs provides sufficient insight into encoder behavior. In contrast, we set out to study the relationships between language encoders, i.e., functions, themselves. This decision, rather than being just a technicality, allows us to derive a richer understanding of encoder relationships, revealing properties and insights that remain obscured under conventional finite-set analysis. Concretely, we ask what notions of similarity between encoders one could consider and what they imply for their relationships.

The main contributions of the paper are of a theoretical nature. We first define an (extended) metric space on language encoders. We then extend this notion to account for transformations in a broad framework of S-homotopy for a set of transformations S, where g is S-homotopic to h if g can be transformed into h through some transformation in S. As a concrete application of the framework, we study affine homotopy the similarity of h and ψ g for affine transformations ψ. The notion of intrinsic similarity induced by such one-sided alignment is not symmetric and can be seen as the cost of transforming g into h. Nevertheless, we show it is informative of extrinsic similarity: If one encoder can be affinely mapped to another, we can guarantee that it also performs similarly on downstream tasks. We confirm this empirically by studying the intrinsic and extrinsic similarities of various pretrained encoders, where we observe a positive correlation between intrinsic and extrinsic similarity. Beyond measuring similarity, homotopy also allows us to define a form of hierarchy on the space of encoders, elucidating a structure in which some encoders are more informative than others. Such an order is also suggested by our experiments, where we find that certain encoders are easier to map to than others which shows in the rank of the learned representations and affects their transfer learning ability.

2 Language Encoders

Let Σ be an alphabet a finite, non-empty set of symbols y and EOS R Σ a distinguished end-ofstring symbol. With Σ def Ť8 n 0 Σn we denote the Kleene closure of Σ, the set of all strings y. A language encoder is a function h: Σ Ñ V def RD that maps strings to real vectors.3 We write EV def V Σ for the R-vector space of language encoders, and Eb def th P EV | hpΣ q is boundedu Ă EV for its sub-vector space of bounded encoders.

There are two common ways that language encoders are created [7]. The first is through autoregressive language modeling. A language model (LM) is a probability distribution over Σ .4 Autoregressive LMs are defined through the multiplication of conditional probability distributions phpyt | yătq as

LM h pyq php EOS | yq

t 1 phpyt | yătq, (2)

where each php | yătq is a distribution over Σ Y t EOSu parametrized by a language encoder h:

phpyt | yătq def softmaxp E hpyătqqyt, (3)

where E P Rp|Σ| 1qˆD. An autoregressive LM provides a simple manner to learn a language encoder from a dataset of strings D typnqu N n 1 by minimizing D s negative log-likelihood. We may also learn a language encoder through masked language modeling (MLM), which defines the conditional probabilities based on both sides of the masked symbol s context

phpyt | yăt, yątq def softmaxp E hpyăt [MASK] yątqqyt. (4)

2We discuss related work in more detail in App. C. 3In principle, one could relax the replace RD with any finite dimensional vector space. 4In the following, we assume language model tightness to the effect that we can assume that LMs produce valid probability distributions over Σ [15].

Maximizing the log-likelihood of a corpus under a language model derived from a language encoder h with a gradient-based algorithm only requires h to be a differentiable function of its parameters. Once a language encoder has been trained on a (large) corpus, its representations can be used on more fine-grained NLP tasks such as classification. The rationale for such transfer learning is that representations h pyq stemming from a performant language model also contain information useful for other downstream tasks on natural language. An NLP practitioner might then implement a task-specific transformation of h pyq. To tackle the problem that the tasks of interest are often less resource-abundant and to keep the training costs low, task-specific transformations are usually simple, often in the form of linear transformations of h pyq, as in Eq. (1).

3 Measuring the Alignment of Langauge Encoders

We begin by introducing measures of affine alignment and hemi-metrics on EV .

3.1 Preliminaries on Hemi-Metric Spaces

Language encoders compute representations for the infinitely many strings in Σ . In general, these representations might diverge towards 8, making it necessary to talk about unbounded encoders, where it is convenient to allow distances and norms to take extended real numbers as values.5

Definition 3.1. An extended metric on a set X is a map d: X Ñ R such that

a. @x, y P X, dpx, yq 0 iff x y; (Identity)

b. @x, y, z P X, dpx, yq ď dpx, zq dpz, yq; (Triangle Inequality)

c. @x, y P X, dpx, yq dpy, xq. (Symmetry)

Similarly, an extended norm is a map } }: X Ñ R that satisfies the norm axioms. Moreover, we will consider maps d that do not satisfy the symmetry axiom. Lawvere [25] notes that symmetry is artificial and unnecessary for many of the main theorems involving metric spaces. In such situations, the quantity dpx, yq can be interpreted as the cost of going from x to y. Occasionally, we want d to capture that it costs more to go from x to y than to return, making asymmetry desirable.

Definition 3.2. A hemi-metric6 or Lawvere-metric on a set X is a map d: X Ñ R such that

a. dpx, xq 0,

b. dpx, zq ď dpx, yq dpy, zq for all x, y, z P X.

One of our main contributions is a formalization of measuring how far a language encoder h is from the set of all possible transformations of another encoder g for example, from all affine transformations of g. For this, we lift a hemi-metric over elements x P X to subsets of X, a crucial for the rest of the paper.

Definition 3.3. Let p X, dq be a hemi-metric space. For non-empty E, E1 Ă X, we define

d Hp E, E1q def sup x PE inf y PE1 dpx, yq. (5)

The map d H is called the Hausdorff Hoare map and is a hemi-metric on Pp Xqzt Hu, the power set of X. When E is a singleton set txu, we will, with a slight abuse of notation, write d Hpx, E1q to mean d Hptxu, E1q, defined as infy PE1 dpx, yq.7

We next introduce the hemi-metric recipe. It tells us how one can define a hemi-metric on a set X by embedding X into the power set of another space Y where a hemi-metric already exists. After X is embedded, one can use the Hausdorff Hoare map based on the hemi-metric from Y to define a hemi-metric on X through the images of x P X.

5R is the set of non-negative real numbers along with the value 8, assumed to be above all reals. We adopt the following conventions: 8 0 0 8 0; 8 r r 8 8, 8 r Rą0 for every r P Rą0. 6A basic example of a hemi-metric space is the pair p R, d Rq, where d Rpx, yq maxpx y, 0q. 7Additional properties of d H are discussed in Lem. D.1.

Remark 3.1 (Hemi-Metric Recipe). Let X be a set, p Y, dq a hemi-metric space, and S : X Ñ Pp Y qzt Hu, x ÞÑ Ex a function that assigns an x P X a subset Ex P Pp Y qzt Hu. Using Lem. D.1, we can construct a hemi-metric on X with d H

S px, yq def d Hp Ex, Eyq d Hp Spxq, Spyqq, and an extended pseudo-metric (a symmetric hemi-metric) with d

H, sym S px, yq maxpd H

S px, yq, d H

Remark 3.1 introduces a general recipe for defining hemi-metric spaces on function spaces topological spaces whose elements are functions from a set to subsets of an extended-metric space. This naturally applies to the study of encoders and their transformations, which we call S-homotopy, i.e., two encoders are S-homotopic if one can be S-deformed into the other. In this case, the set Eh for h P EV corresponds to the set of encoders that h can be transformed into with mappings in S. We could, for example, take S as the set of all continuous maps, smooth maps, or multi-layer perceptrons. Our following discussion of affine maps, i.e., where S Affp V q, is extrinsically motivated but can be understood as a specific instance of the more general framework of S-homotopy.

3.2 A Norm and a Distance on EV

The hemi-metric recipe first requires us to define a (hemi-)metric on the individual elements. Given that all norms on the R-vector space V are equivalent [24, Proposition 2.2, XII], we fix in this paper a norm | |V on V . We introduce the maps } }8 : EV Ñ R and d8p , q: EV ˆ EV Ñ R :

}h}8 def sup y PΣ |hpyq|V (6) and d8ph, gq }h g}8, (7)

where } }8 is an extended norm on EV and p EV , d8q is a complete8 extended metric space.9

Let GLp V q be the set of invertible DˆD matrices. We write } }V : GLp V q Ñ R for the subordinate matrix norm, i.e., }A}V supv PV zt0u |Av|V

|v|V . By abuse of language, we can view V as an affine space10 and set Affp V q for the group of affine transformations of V . An affine transformation ψ on V is a map v ÞÑ Av b, for some invertible A P GLp V q and b P V . We call ψlin def A the linear part of ψ and tψ : v ÞÑ v b its translation part. We denote with T Ă Affp V q the subgroup of translations. Note that there is a natural left action of Affp V q on EV , i.e., Affp V q ˆ EV Ñ EV , h ÞÑ ψ h.11

3.3 Affine Alignment of Language Encoders

We now use the general recipe from Remark 3.1 for affine alignment of language encoders affinely mapping from one encoder to another. For a subset S Ă Affp V q we can define

d Sph, gq def d H

8ph, Spgqq inf ψPS }h ψ g}8 (8a)

}h}S def d Sp0EV , hq, (8b)

where S phq def ts h | s P Su. In the notation of the hemi-metric recipe from Remark 3.1, we set X Y EV (we align an encoder with another encoder) and d d8, the uniform convergence distance (cf. 3.2). Further, we take S Ď Affp V q Ă V V and define S : EV Ñ P p EV q, h ÞÑ S phq def ts h | s P Su. In words, d Sph, gq captures the notion of how well the encoder g can be Stransformed into h. This is commonly called the alignment of g with h. d Sph, gq does not, however, necessarily tell us anything about how well h can be S-transformed into g, resulting in asymmetry.

Remark 3.2. d Affp V q defined in Eq. (8a) is not a metric on EV .12 Further, when S Affp V q, the map infψ,ψ1PAffp V q }ψ h ψ1 g}8 is trivially zero by Cor. D.1.

8A metric space is complete if every Cauchy sequence (a sequence where the distance between terms eventually becomes arbitrarily small) converges to a point within the space. 9This follows immediately from the fact that | |V is a norm and from the completeness of V . 10This amounts to forgetting the special role played by the zero vector. 11A left action of a group G on a set X is a map : G ˆ X Ñ X such that e x x for all x P X, where e is the identity element of G, and g1 pg2 xq pg1g2q x for all g1, g2 P G and x P X. 12See App. D.2 for a derivation.

In the case of affine isometries Isop V q tψ P Affp V q | ψlin P Op V qu we show that the pair p EV , d Isop V qq constitutes an extended pseudo-metric space. Proposition 3.1. The pair p EV , d Isop V qq is an extended pseudo-metric space.

4 Intrinsic Affine Homotopy

The notion of affine alignment allows us to introduce homotopic relations on EV . We first derive the affine intrinsic preorder ÁAff on the space of encoders.13

Lemma 4.1. Let p X, dq be a hemi-metric space. The relation px Ád y iff dpx, yq 0q is a preorder14 and it will be called the specialization ordering of d.

Proof. Goubault-Larrecq [17, Proposition 6.1.8].

Definition 4.1 (Intrinsic Affine Preorder). For two encoders h, g P EV , we define the relation

h ÁAff g iff d Affp V qph, gq 0. (9)

Lemma 4.2. The relation ÁAff is a preorder on EV .

Proof. Follows from d Affp V qpψ h, gq ď }ψlin}V d Affp V qph, gq, see App. D.3.

Intuitively, ÁAff captures the order of encoders such that higher-positioned encoders in the order can be S-transformed to the lower-positioned ones. To derive the implications of ÁAff we introduce the notion of an encoder rank. Definition 4.2 (Encoder Rank). For any h P EV let the encoder rank be rankphq def dim Rp Vhq, where Vh is the subvector space generated by the image of h. When rankphq dim Rp V q, h is a full rank encoder, else it is rank deficient.

Theorem 4.1. For h, g P EV , we have

h ÁAff g ô h ψpπh gq for some ψ P Affp V q (10) where, πh is the orthogonal projection of V onto Vh. In particular, if d Affp V qph, gq 0 then rankphq ď rankpgq. If in addtion, we know rankpgq rankphq, then g must by an affine transformation of g, i.e., h ψ g for some ψ P Affp V q.

This allows us to state our first notion of language encoder similarity: intrinsic affine homotopy. Definition 4.3 (Exact Intrinsic Affine Homotopy). We say that two encoders h, g P EV are exactly intrinsically affinely homotopic and write h Aff g if d Affp V qph, gq 0 and rankphq rankpgq. (11)

For any h, g P EV , one can easily show that

h Aff g ðñ pg ÁAff h and h ÁAff gq ðñ d

H, sym Affp V qph, gq 0, (12)

which implies that Aff is an equivalence relation on the set of language encoders EV . Intuitively, two encoders h and g are exactly intrinsically affinely homotopic, this means that both g can be affinely mapped to h, as well as the other way around.

5 Extrinsic Homotopy

In 4, we explore methods for assessing how similar two language encoders are without reference to any downstream tasks. Here, we extend our discussion to the extrinsic homotopy of language encoders. Since language encoders are primarily used to generate representations for downstream tasks such as in transfer learning, illustrated by the sentiment analysis example in 1 we argue that the key criterion in the similarity of two encoders lies in how closely we can align predictions stemming from their representations.15

13All left-out proofs can be found in App. D.2. 14A reflexive and transitive relation on X. 15The proofs of all claims in this section can be found in App. D.3.

Principle 5.1 (Extrinsic Homotopy). Two language encoders h and g are extrinsically homotopic if we can guarantee a similar performance on any downstream task h and g might be used for.

The rest of the section formalizes this intuitive notion and describes its relationship with intrinsic affine homotopy. Let W be the vector space RN and set Affp V, Wq as the set of affine maps from V to W.16 We define E def MappΣ , N 1q and EW MappΣ , Wq. Lastly, we formalize the notion of a transfer learning task as constructing a classifier that uses a language encoder s string representations. Particularly, we set VN to be the family of log-linear models as follows

VN : EV Ñ Pp E N 1qzt Hu, h ÞÑ softmaxλp Aff V,W phqq, (13)

where Aff V,W is the map

Aff V,W : EV Ñ Pp EW qzt Hu, h ÞÑ tψ h | ψ P Affp V, Wqu (14)

and softmaxλ : RN Ñ N 1 is defined for λ P R , x P RN, and n P r Ns as

softmaxλ pxqn exp pλxnq řN n1 1 exp pλxn1q . (15)

Remark 5.1. Each pψ softmaxλ ψphpyqq can be seen as a probability distribution over r Ns

VNphq tpp | yq: r Ns Ñ r0, 1s, ÞÑ softmaxλ ψphpyqq | ψ P Affp V, Wqu. (16)

Through our standard recipe from Remark 3.1, we can define the following hemi-metrics on EV . Definition 5.1. For any two encoders h, g P EV , we define17

Affp V,W qph, gq def d H

8,Wp Aff V,W phq, Aff V,W pgqq (17a)

Vp V, qph, gq def d H

8, N 1p VNphq, VNpgqq (17b)

Notice that we use d H rather than d in Def. 5.1 since we are interested in how closely we can bring h and g when we affinely transform both of them this corresponds to independently affinely transforming the encoders for the same transfer learning task. In particular, Eq. (17b) measures how different two encoders are on any transfer learning task, formalizing the notion of extrinsic homotopy (cf. Principle 5.1), captured by the following definition. Definition 5.2 (Extrinsic Affine Preorder). An encoder h P EV is exactly extrinsically homotopic to18 g P EV if d H

Vp V, qph, gq 0.

Analogously to Def. 4.1, we use d H

Vp V, qph, gq to define a preorder.

Definition 5.3 (Extrinsic Affine Preorder). For two encoders h, g P EV , we define the relation

h ÁExt g iff d H

Vp V, qph, gq 0. (18)

Lemma 5.1. The relation h ÁExt g is a preorder on EV .

We now relate d H

Affp V,W qph, gq and d H

Vp V, qph, gq from Def. 5.1, and d Affp V qph, gq from 4.

Lemma 5.2. Let h, g P EV . We have

1. There exists a constant cpλq ą 0 such that for any ψ P Affp V, Wq

8, N 1psoftmaxλpψ hq, VNpgqq ď cpλq}ψlin}d Affp V qph, gq.

Vp V, qph, gq ď cpλqd H

Affp V,W qph, gq.

16Given an affine map f : V Ñ W, there is a unique linear map A flin P Lp V, Wq and b P W such that for every v P V we have fpvq A v b. 17The subscript 8 in d8, and d8,W is used to insist on that we are considering the supremum distance d8 in N 1 and W, respectively. 18Exact extrinsic homotopy is asymmetric.

3. d Affp V qph, gq 0 ñ d H

Affp V,W qph, gq 0 ñ d H

Vp V, qph, gq 0.

Lem. 5.2 shows that ÁExt is finer than ÁAff. This means that the affine intrinsic preorder is contained in the extrinsic preorder, i.e., h ÁAff g ñ h ÁExt g. Lastly, we can show that d H

Vp V, qph, gq is upper bounded by the intrinsic hemi-metric d H

Theorem 5.1 (ϵ-Intrinsic ñ Opϵq-Extrinsic). Let h, g P EV be two encoders. Then,

Vp V, qph, gq ď cpλq d H

Affp V qph, gq.

6 Linear Alignment Methods for Finite Representation Sets

4 and 5 introduce ways of comparing language encoders as functions, which holistically characterizes relationships between them. We now address a more practical concern: Given two language encoders h and g, how can we approximate their similarity in practice? Rather than comparing h pyq : Σ Ñ RD with g pyq : Σ Ñ RD over the entire Σ ,19 we compare them over a finite set of strings Y typnqu N n 1. We combine Y s representations given by h and g into matrices H, G P RNˆD, where we denote Hy, h pyq and Gy, g pyq. We can approximate the notions of similarity from 3 by optimizing over the affine maps Affp V q (for example, using gradient descent). Particularly, we approximate intrinsic similarity as

ˆd Affp V qp H, Gq def inf ψPAffp V q max y PY }Hy, ψ Gy, }V , (19)

and extrinsic similarity for some task-specific fixed ψ1 as

ˆdψ1p H, Gq def inf ψPAffp V,W q max y PY }softmaxpψ1 Hy, q softmaxpψ Gy, q}W . (20)

Unfortunately, the max over Y makes the optimization in Eqs. (19) and (20) difficult. For simplicity, we turn to commonly used linear alignment methods, which we review for completeness.

Orthogonal Procrustes Problem. Rather than optimizing the infinity norm over Y as Eqs. (19) and (20), the orthogonal Procrustes problem finds the orthogonal transformation minimizing the Frobenius norm [34] by solving argmin APOp V q }H AG}F . Given the singular-value decomposition HJG UΣVJ, the optimum is achieved by UVJ.20 Since the argmin is over Op V q, this defines an extended pseudo-metric space by Prop. 3.1.

Canonical Correlation Analysis (CCA). CCA [20] is a linear alignment method that finds the matrices A, B that project H and G into subspaces maximizing their canonical correlation. Let A ,j and B ,j be the jth column vectors of A and B, respectively. The formulation is as follows

ρj sup A ,j,B ,j corrp HA ,j, GB ,jq s.t. @iăj HA ,j K HA ,i , @iăj GB ,j K GB ,i. (21)

The representation similarity is measured in terms of the goodness of CCA fit, e.g., the mean squared CCA correlation R2 CCA řD i 1 ρ2 i {D. We can reformulate the CCA objective in Eq. (21) as

inf A,B 1 2}AJH BJG}2 F s.t. p AJHqp AJHq J p BJGqp BJGq J I. (22)

Given the singular-value decomposition HJG UΣVJ, the solution of Eq. (22) is p ˆA, ˆBq pp HHJq 1

2 U, p GGJq 1

2 Vq, where p HHJq 1

2 and p GGJq 1

2 are whitening transforms of U and V. Assuming the data is whitened during pre-processing, CCA corresponds to linear alignment under an orthogonality constraint, equivalent to the orthogonal Procrustes problem; see also App. E.

CCA Extensions. Projection-weighted CCA (PWCCA) [30] also finds alignment matrices with CCA but applies weighting to correlation values ρi to report the goodness of fit. Given the canonical vectors ˆA, PWCCA reports ρPW řD i 1 αiρi{ ř

i αi, where αi ř

j |x ˆA ,i, H ,jy|.21

19For simplicity, we assume that h and g both map to RD

20See App. E for the derivation. 21CCA extensions beyond PWCCA are dicussed in App. C.

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

M25 Layer 4

ER M5 M10 M15 M20 M25 ER

M25 Layer 8

ER M5 M10 M15 M20 M25 ER

M25 Layer 12

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 M

ER M5 M10 M15 M20 M25 M

ER M5 M10 M15 M20 M25 M

ER M5 M10 M15 M20 M25 M

Max Error L2 Norm

Figure 1: Asymmetry between ELECTRA (E), Ro BERTa (R), and MULTIBERT encoders (M1-M25) across layers. For each pair of the encoders Mpiq and Mpjq, we generate training set embeddings Hpiq, Hpjq P RNˆD for SST-2, COLA, and MNLI. We then fit Hpiq to Hpjq with an affine map and report the goodness of fit through the max error L2 norm, i.e., an approximation of dp Hpjq, Hpiqq on row i and column j of the grid. Full results across GLUE tasks are shown in Figure 4.

Non-Alignment Methods. While not explicitly (linearly) aligning representations, CKA [22] evaluates the kernel similarity between representations. CKA computes the normalized Hilbert-Schmidt independence [18] between centered kernel matrices KH and KG where KH ij kp Hi, , Hj, q, and KG ij kp Gi, , Gj, q for a kernel function k, i.e., trp KHKGq{ a

ptrp KHJKHqtrp KGJKGqq. Linear CKA, where kp Hi, , Hj, q HJ i, Hj, , is commonly used.

7 Experiments

We now explore the practical implications of our theoretical results. We conduct experiments on ELECTRA [6], ROBERTA [28], and the 25 MULTIBERT [35] encoders, which are architecturally identical to BERT-BASE [11] models pre-trained with different seeds. We report results on the training sets of two GLUE benchmark classification tasks: SST-2 [38] and MRPC [14]. When reporting d and ˆdψ1 from Eq. (19) and Eq. (20), we use the L2 norm for simplicity and approximate d H

Vp V, qp H, Gq sup ψ1PAffp V,W q inf ψPAffp V,W q max y PY }softmaxpψ1 Hy, q softmaxpψ Gy, q}2. (23)

The experimental setup and compute resources are further described in App. F.

The Intrinsic Preorder of Encoders. We first investigate whether the asymmetry of d Affp V q is measurable in the finite alphabet encoder representations. Figure 1 shows distinct vertical lines for both tasks indicating that there are encoders that are consistently easier to affinely map to pÑMq. This seems to be rather independent of which encoder we map from p MÑq. We further see that this trend is task-independent for early layers but diverges for later layers.

The Influence of Encoder Rank Deficiency. As discussed in 4, the encoder rank plays a pivotal role in affine mappability; exact affine homotopy is only achievable between equal-rank encoders.22 With this in mind, we return to our findings from Figure 1 to evaluate whether the observed differences

22We provide additional experiments on the role of the encoder rank in App. G.

0.20 0.40 ˆd Aﬀ(V )

E R M1 M2M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14

M15 M16 M17 M18M19

M20 M21 M22 M23

0.04 0.06 ˆd Aﬀ(V )

0.28 ρ: 0.309

0.10 0.20 ˆd Aﬀ(V )

0.30 ρ: 0.283

M7 M8 M9 M10

0.03 0.05 ˆd Aﬀ(V )

0.05 0.10 0.15

M17 M18 M19 M20

0.05 0.10 ˆd Aﬀ(V )

Figure 2: For ELECTRA (E), Ro BERTa (R), and MULTIBERTs (M1-M25), we plot extrinsic ( ˆdψ1) against intrinsic similarity ( ˆd Affp V q) across GLUE tasks. We group the points by how well we can map to each encoder (ÑM), and display the median, as well as the first and third quartiles as vertical and horizontal lines. We additionally show the linear regression from ˆd Affp V q to ˆdψ1.

between encoders can be attributed to a difference in measurable rank. Due to the inaccuracies of computing the rank numerically, we approximate the encoder rank using the rank to precision ϵ as the number of representation matrix singular values larger than some ϵ P R.23 We find statistically significant (p-value < 0.05) rank correlation with the median intrinsic distance ˆd Affp V q when mapping to the corresponding encoder for RTE (ρ 0.312), MRPC (ρ 0.609), and QQP (ρ 0.389). We find no statistically significant correlations with the median distance when mapping from the corresponding encoder. This difference in encoder ranks could, therefore, partially explain the previously observed differences in affine mappability as some encoders seem to learn lower-rank representations.

A Linear Bound on Extrinsic Similarity. Lem. 5.2 derives a relationship between affine intrinsic and extrinsic similarity. To evaluate its strength in practice, we measure Spearman s Rank Correlation (ρ) and Pearson Correlation (PCC) between intrinsic measures introduced in 6 and the extrinsic measures ˆdψ1 and ˆd H

Vp V, q. PCC measures the strength and direction of a linear relationship between two random variables, whereas Spearman s ρ additionally evaluates the variables monotonic association. ˆdψ1 is computed by training a linear classifier ψ1 P Affp V q on the final MULTIBERT layer for each task. Further, we report ˆd H

Vp V, q as the maximum L2 loss for a large number of randomly generated24 classifiers ψ1 on the final layer of each MULTIBERT encoder. We generate 100 such classifiers for a range of GLUE datasets.25 Table 1 show significant, large linear correlation prevalent in all linear alignment methods, whereas CKA a linear, non-alignment method does not capture extrinsic behavior as faithfully. Further, Figure 2 visualizes the linear relationship explicitly for all considered GLUE datasets.

8 Discussion

We set out to explore homotopic relationships between language encoders, augmenting existing work on the similarity of finite representation sets by holistically studying encoder functions. In particular, the general framework of S-homotopy allows us to study any functional relationship between encoders, enabling the exploration of many types of encoder relationships. As a first step in this direction and a concrete example, 4 explores affine homotopy, discussing what it means to be able to align two models with affine transformations. Here, Hausdorff Hoare maps prove useful, as they allow us to measure a notion of (asymmetric) distance between a point an encoder and the set of all affine transformations of another encoder. Lem. 5.2 in 5 then connects the intrinsic,

23Following Press et al. [31], we choose ϵ as n σ1 ϵp maxp N, Dq for n 5, where σ1 is the largest singular value of the N ˆ D representation matrix and ϵp the float machine epsilon. We note that the rank to precision ϵ and the recovered correlation may depend on the chosen ϵ. 24The generation process is described in App. F. 25The computational expense of computing ˆd H

Vp V, q restricts this analysis to a limited set of classifiers, depending on the alphabet size. See App. B for a discussion.

Intrinsic Measure ˆd Affp V q Orth. Procrustes R2 CCA PWCCA Linear CKA

Lin. Alignment-Based Yes Yes Yes Yes No

SST-2 ˆd Affp V q ρ 0.080 0.095 0.172* 0.016 0.088 PCC 0.545* 0.937* 0.932* 0.970* 0.231* ˆd H

Vp V, q ρ 0.621* 0.157* 0.071 0.231* 0.295* PCC 0.723* 0.539* 0.457* 0.566* 0.320*

MRPC ˆdψ1 ρ 0.309* 0.250* -0.001 0.220* 0.214* PCC 0.707* 0.697* 0.733* 0.743* 0.241* ˆd H

Vp V, q ρ 0.231* 0.025* 0.178* 0.059 0.030 PCC 0.790* 0.755* 0.879* 0.875* 0.174*

RTE ˆdψ1 ρ 0.534* 0.053 0.037 0.308* 0.185* PCC 0.570* 0.401* 0.429* 0.250* 0.078 ˆd H

Vp V, q ρ 0.234* 0.317* -0.147* 0.338* 0.240* PCC 0.718* 0.870 0.778 0.780 0.205

Co LA ˆdψ1 ρ 0.196* 0.006 0.040 0.185* 0.165* PCC 0.204* 0.529* 0.553* 0.550* 0.215* ˆd H

Vp V, q ρ 0.348* 0.078 0.133* 0.340* 0.380* PCC 0.429* 0.664* 0.318* 0.786* 0.513*

Table 1: Spearman s Rank Correlation Coefficient (ρ) and Pearson s Correlation Coefficient (PCC) between intrinsic measures introduced in 6 and the extrinsic similarities ˆdψ1 and ˆd H

Vp V, q across various GLUE datasets. * indicates a p-value ă 0.01 (assuming independence).

task-independent, similarly to extrinsic similarity the similarity of performance on downstream tasks. Concretely, it derives a linear relationship between the intrinsic and extrinsic dissimilarity for any fixed affine transformation ψ1 (i.e., a fixed downstream task). Thm. 5.1 discusses a stronger bound, namely on the worst-case extrinsic dissimilarity among all downstream linear classifiers, i.e., among all possible tasks. Further, by accounting for the asymmetries of encoder relationships, we augment the work on similarity in proper metric spaces [3, 36, 42].

Although encoders may not be affinely related in practice, empirical evidence in 7 suggests that notions of affine order still surface (cf. Tab. 1, Fig. 2), particularly as differently initialized BERTs exhibit variations in downstream task performance [29]. While other similarity measures, such as those used in seed specificity tests [12], are designed to remain invariant to initialization changes, our results indicate that intrinsic affine homotopy is appropriately sensitive to them. This sensitivity raises new questions about the landscape of pre-trained encoders; as seen in Fig. 1, asymmetry in intrinsic affine similarity among similarly pre-trained encoders impacts downstream performance, as corroborated by Lem. 5.2 and empirical results in Tab. 1. Differences in representation ranks may partly explain this asymmetry mapping between artificially generated rank-deficient encoders yields mostly symmetric affine distances (cf. Fig. 3). Another explanation might be that easy-to-learn encoders might be approximately linear combinations of others, making them easy to map to but not necessarily from. Overall, our findings highlight the need to account for directionality in encoder similarity measures to address the asymmetry inherent in this problem.

9 Conclusion

We discuss the structure of the space of language encoder in the framework of S-homotopy the notion of aligning encoders with a chosen set of functions. We formalize affine alignment between encoders and show that it provides upper bounds on the differences in performance on downstream tasks. Experiments show our notion of intrinsic affine homotopy to be consistently predictive of downstream task behavior while revealing an asymmetric order in the space of encoders.

Broader Impact

This paper presents foundational research about the similarity of language encoders. To the best of our knowledge, there are no ethical or negative societal implications to this work.

Acknowledgements

Ryan Cotterell acknowledges support from the Swiss National Science Foundation (SNSF) as part of the The Forgotten Role of Inductive Bias in Interpretability project. Anej Svete is supported by the ETH AI Center Doctoral Fellowship. Robin Chan acknowledges support from FYAYC. We thank Raphaël Baur and Furui Cheng for helpful discussions and reviews of the current manuscript.

[1] Yamini Bansal, Preetum Nakkiran, and Boaz Barak. 2021. Revisiting model stitching to compare neural representations. In Advances in Neural Information Processing Systems, volume 34, pages 225 236. Curran Associates, Inc.

[2] Saaid Baraty, Dan A. Simovici, and Catalin Zara. 2011. The impact of triangular inequality violations on medoid-based clustering. In Foundations of Intelligent Systems, pages 280 289, Berlin, Heidelberg. Springer Berlin Heidelberg.

[3] Enric Boix-Adsera, Hannah Lawrence, George Stepaniants, and Philippe Rigollet. 2022. Gulp: a prediction-based metric between representations. In Advances in Neural Information Processing Systems, volume 35, pages 7115 7127. Curran Associates, Inc.

[4] Dmitri Burago, Yuri Burago, and Sergei Ivanov. 2001. A Course in Metric Geometry. American Mathematical Society, Providence, RI.

[5] C. Chang, W. Liao, Y. Chen, and L. Liou. 2016. A mathematical theory for clustering in metric spaces. IEEE Transactions on Network Science and Engineering, 3(01):2 16.

[6] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations.

[7] Ryan Cotterell, Anej Svete, Clara Meister, Tianyu Liu, and Li Du. 2023. Formal aspects of language modeling. ar Xiv preprint ar Xiv:2311.04329.

[8] Adrián Csiszárik, Péter K orösi-Szabó, Ákos Matszangosz, Gergely Papp, and Dániel Varga. 2021. Similarity and matching of neural network representations. In Advances in Neural Information Processing Systems, volume 34, pages 5656 5668. Curran Associates, Inc.

[9] Alexander D Amour, Katherine A. Heller, Dan I. Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yi-An Ma, Cory Y. Mc Lean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin G. Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, and D. Sculley. 2020. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 23:226:1 226:61.

[10] Sanjoy Dasgupta and Philip M. Long. 2005. Performance guarantees for hierarchical clustering. Journal of Computer and System Sciences, 70(4):555 569. Special Issue on COLT 2002.

[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171 4186, Minneapolis, Minnesota. Association for Computational Linguistics.

[12] Frances Ding, Jean-Stanislas Denain, and Jacob Steinhardt. 2021. Grounding representation similarity through statistical testing. In Advances in Neural Information Processing Systems, volume 34, pages 1556 1568. Curran Associates, Inc.

[13] Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. ar Xiv preprint ar Xiv:2002.06305.

[14] William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).

[15] Li Du, Lucas Torroba Hennigen, Tiago Pimentel, Clara Meister, Jason Eisner, and Ryan Cotterell. 2023. A measure-theoretic characterization of tight language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9744 9770, Toronto, Canada. Association for Computational Linguistics.

[16] Bolin Gao and Lacra Pavel. 2017. On the properties of the softmax function with application in game theory and reinforcement learning. ar Xiv preprint ar Xiv:1704.00805, 1704.00805.

[17] Jean Goubault-Larrecq. 2013. Non-Hausdorff Topology and Domain Theory: Selected Topics in Point-Set Topology. New Mathematical Monographs. Cambridge University Press.

[18] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. 2005. Measuring statistical dependence with Hilbert-Schmidt norms. In Algorithmic Learning Theory, pages 63 77, Berlin, Heidelberg. Springer Berlin Heidelberg.

[19] William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1489 1501, Berlin, Germany. Association for Computational Linguistics.

[20] David Roi Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16:2639 2664.

[21] Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich. 2023. Similarity of neural network models: A survey of functional and representational measures. ar Xiv preprint ar Xiv:2305.06329.

[22] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3519 3529. PMLR.

[23] Nikolaus Kriegeskorte, Marieke Mur, and Peter Bandettini. 2008. Representational similarity analysis - connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2.

[24] Serge Lang. 2002. Algebra. Springer New York.

[25] F. W. Lawvere. 2002. Metric spaces, generalized logic, and closed categories. Theory and Applications of Categories No. 1 (2002) pp 1-37.

[26] Karel Lenc and Andrea Vedaldi. 2019. Understanding image representations by measuring their equivariance and equivalence. International Journal of Computer Vision, 127(5):456 476.

[27] Li, Yixuan and Yosinski, Jason and Clune, Jeff and Lipson, Hod and Hopcroft, John. 2015. Convergent learning: Do different neural networks learn the same representations? In Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015, volume 44 of Proceedings of Machine Learning Research, pages 196 212, Montreal, Canada. PMLR.

[28] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Ro BERTa: A robustly optimized BERT pretraining approach. Ar Xiv, abs/1907.11692.

[29] R. Thomas Mc Coy, Junghyun Min, and Tal Linzen. 2020. BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. In Proceedings of the Third Blackbox NLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 217 227, Online. Association for Computational Linguistics.

[30] Ari Morcos, Maithra Raghu, and Samy Bengio. 2018. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.

[31] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. 2007. Numerical Recipes: The Art of Scientific Computing, 3rd edition. Cambridge University Press.

[32] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. 2017. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

[33] Yuxin Ren, Qipeng Guo, Zhijing Jin, Shauli Ravfogel, Mrinmaya Sachan, Bernhard Schölkopf, and Ryan Cotterell. 2023. All roads lead to Rome? Exploring the invariance of transformers representations. ar Xiv preprint ar Xiv:2305.14555.

[34] Peter H. Schönemann. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1 10.

[35] Thibault Sellam, Steve Yadlowsky, Jason Wei, Naomi Saphra, Alexander D Amour, Tal Linzen, Jasmijn Bastings, Iulia Turc, Jacob Eisenstein, Dipanjan Das, et al. 2022. The Multi BERTs: BERT reproductions for robustness analysis. In International Conference on Learning Representations. Open Review.net.

[36] Mahdiyar Shahbazi, Ali Shirali, Hamid Aghajan, and Hamed Nili. 2021. Using distance on the Riemannian manifold to compare representations in brain and in models. Neuro Image, 239:118271.

[37] C.G. Small. 2012. The Statistical Theory of Shape. Springer Series in Statistics. Springer New York.

[38] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631 1642, Seattle, Washington, USA. Association for Computational Linguistics.

[39] Hrishikesh D. Vinod. 1976. Canonical ridge and econometrics of joint production. Journal of Econometrics, 4:147 166.

[40] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.

[41] Fei Wang and Jimeng Sun. 2015. Survey on distance metric learning and dimensionality reduction in data mining. Data Mining and Knowledge Discovery, 29(2):534 564.

[42] Alex H. Williams, Erin Kunz, Simon Kornblith, and Scott Linderman. 2021. Generalized shape metrics on neural representations. In Advances in Neural Information Processing Systems, volume 34, pages 4738 4750. Curran Associates, Inc.

[43] Peter N. Yianilos. 1993. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 93, page 311 321, USA. Society for Industrial and Applied Mathematics.

[44] Ruiqi Zhong, Dhruba Ghosh, Dan Klein, and Jacob Steinhardt. 2021. Are larger pretrained language models uniformly better? Comparing performance at the instance level. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3813 3827, Online. Association for Computational Linguistics.

Symbol Meaning Introduced

D Size of string representations. 1 V D-dimensional R-vector space. 2 R R Y t8u Def. 3.1 r Ns Ă N The set t1, . . . , Nu for N P N. 5 N 1 The N 1-dimensional probability simplex. 1 EV The vector space of language encoders. 2 Eb The subspace of bounded language encoders. 2 Affp V q The set of (invertible) affine transformations on V . 3.2 GLp V q The group of all invertible linear maps on V to itself. 3.2 Op V q The orthogonal group of V ; the group of norm-preserving linear maps on V 3.3

d A general (hemi-)metric. Def. 3.2 d8 The 8 (uniform convergence) norm on EV . 3.2 d H The Hausdorff Hoare map of d. Def. 3.3 d H, sym The symmetrized Hausdorff Hoare map. Def. 3.3 d H

S The Hausdorff Hoare map where the sets in the arguments are computed by applying all transformations s P S to the two input elements.

d S One-sided affine alignment measure over S. Eq. (8a) } }S The S-homotopy norm of an encoder. Eq. (8a) Á A preorder. 4 & 5 An equivalence relation. 4

Table 2: A summary of notation used in the paper.

B Limitations

In this section, we address some of our work s limitations.

Non-Linear Encoder Relationships. This work focuses on affine similarity between encoders. As we find and discuss in 4 with the example of MULTIBERTs, language encoder representations may generally not be exactly affinely related. Nevertheless, understanding the affine-homotopy relationships on EV still helps us to make conclusions about practical findings as in 7.

Linear Classifiers. Our work provides precise theoretical guarantees on the performance of linear classifiers applied to affinely-related encoders. In practice, task fine-tuning can take the form of more complex models, such as re-training entire pre-trained models. This work does not cover such more complex fine-tuning techniques.

Numerical Approximations. To bridge our theoretical findings on affine homotopy relationships in EV with their practical implementations in 7, we concede several approximations. For instance, while d H

Vp V, q is valuable in analysis, optimizing Eq. (23) directly is computationally challenging and requires costly approximations. Similarly, in computing intrinsic distances across all representation layers in Fig. 1, we optimize for mean squared error (MSE) and evaluate the maximum loss instead of optimizing for it directly, which serves as an approximation of d that results in more stable optimization given computational constraints. Finally, we address numerical inaccuracies encountered during singular value decomposition (SVD) computations in 7, which we mitigate by tuning the rank according to precision ϵ.

C Additional Related Work

In this section, we complement our discussion in 6 and 8 with additional related work.

Representational and Functional Similarity. Our work is related to the ongoing efforts to quantify the similarity between neural networks. Much related work discusses similarity measures in terms of the invariance properties of neural networks [12, 22, inter alia]; see Klabunde et al. [21] for a recent comprehensive survey. Notably, Klabunde et al. [21] compile various representational [19, 23, 27, 32, 39, inter alia] and functional ways to measure similarity, which are related to our notions of intrinsic and extrinsic homotopy, respectively. Whereas our notion of intrinsic affine homotopy fits into the class of linear alignment-based measures [12, 27, 42, inter alia] as described in 6, the notion of extrinsic similarity fits into the broader category of performance-based functional measures [1, 8, 26]. Most relevantly, Boix-Adsera et al. [3] propose the GULP metric that provides a bound on the expected prediction dissimilarity for norm-one-bounded ridge regression.

Similarity Measures as Metrics. A line of work draws from statistical shape analysis [37] to motivate the development of similarity measures that are that conform to axioms of valid metrics [3, 36, 42]. Learning within proper metric spaces provides certain theoretical guarantees [2, 5, 10, 41, 43]. For example, Williams et al. [42] derive two families of generalized shape metrics, modifying existing dissimilarity measures to ensure they meet metric criteria. Notably, one of these generalized shape metrics is based on linear regression over the group of linear isometries, similar to the approach derived for encoder maps in Prop. 3.1.

Understanding Similarity of Language Encoders. Finally, several previous works characterize the landscape of language encoders and their sensitivity to slight changes to the pre-training or fine-tuning procedure [9, 13, 44]. This prompted multi-seed releases of encoders such as BERT [35, 44] that are frequently used for robustness or sensitivity analysis [12, 33], similar to the one presented in this work.

D Addenda on Affine Homotopy

In this section, we provide additional derivations and proofs complementing the discussion in 3 5.

D.1 Preliminaries on Hemi-Metric Spaces

Definition D.1. Let p X, dq be a hemi-metric space. The open ball Bpx, ϵq of center x and radius ϵ ą 0, is the set ty P X | dpx, yq ă ϵu. The open balls form a base for the open ball topology.26

Lemma 4.1. Let p X, dq be a hemi-metric space. The relation px Ád y iff dpx, yq 0q is a preorder27 and it will be called the specialization ordering of d.

Example D.1. An example of a specialization ordering is the prefix ordering of strings ďprefix28. More precisely, for any y, y1 P Σ , we define dΣ py, y1q to be zero if y is a prefix of y1 and 2 n otherwise, where n is the length of the longest prefix of y that is also a prefix of y1. Then pΣ , dΣ q is a hemi-metric space whose specialization ordering is ďprefix. //

Lemma D.1. Let p X, dq be a hemi-metric space.

1. The set tx P X | d Hpx, Eq 0u is exactly the closure of E in the open ball topology.

2. For any x, x1 P X, we have the inequality d Hpx, Eq ď dpx, x1q d Hpx1, Eq. If d is a metric, then d Hp , Eq is 1-Lipschitz from p X, dq to R .

3. Let Z Ă Pp Xq be any space of non-empty subsets of X. The Hausdorff Hoare map d H is hemi-metric on Z. Its specialization ordering Ád H is given by E Ád H E1 iff29 E Ă clp E1q, iff clp Eq Ă clp E1q.

Proof. See Goubault-Larrecq [17, Lemma 6.1.11, Proposition 6.2.16 & Lemma 7.5.1].

26This, by definition, is the topology generated by all open balls. 27A reflexive and transitive relation on X. 28Defined by y ďprefix y1 if y is a prefix of y1. 29Here, the closure is with respect to the topology defined by d.

D.2 Additional Derivations: Affine Alignment Measures

Remark 3.2. d Affp V q defined in Eq. (8a) is not a metric on EV .30 Further, when S Affp V q, the map infψ,ψ1PAffp V q }ψ h ψ1 g}8 is trivially zero by Cor. D.1.

Proof. To see that d Affp V q is not a metric, consider the following two encoders: gpyq |y| e, where e P V is any fixed vector, and h be any map from Σ to the ball Bp0V , 1q of radius one. In such a case, we have d Affp V qph, gq 8. Even on the space of bounded encoders 31 d Affp V q is not a metric. We provide the following counter-example: Let h be any rank R encoder, e.g., h can be any map that sends the first R strings to the basis of V . Let A be a non-invertible linear map of V and set g Aphq. Then clearly d Affp V qph, gq 0, but d Affp V qpg, hq can not be zero for dimensionality reasons (see Thm. 4.1).

Lemma D.2 (Hausdorff Distance). Let E, E1 Ă EV . The map

8 p E, E1q def maxpd H

8p E, E1q, d H

8p E1, Eqq sup h PEV |d H

8ph, Eq d H

8ph, E1q| (24)

is an extended pseudo-metric on Pp EV qzt Hu.

Proof. It follows readily from Lem. D.1. See also Burago et al. [4, 7.3.1].

For any affine subgroup S Ă Affp V q, let Sphq def tψphq | ψ P Su. It then follows immediately from Lem. D.2 that the map d

H, sym S ph, gq def d H, sym

8 p Sphq, Spgqq is an extended pseudo-metric on EV .

Lemma D.3. For any h, g P EV , any ψ P Isop V q and any non-empty S Ă EV , we have

d Spψ h, gq dψ 1Sph, gq. (25)

In particular, d Isop V qpψ h, gq d Isop V qph, gq.

Proof. Lem. D.3 follows by definition d8pψ h, ψ gq d8ph, ψ 1 ψ gq.

Proposition 3.1. The pair p EV , d Isop V qq is an extended pseudo-metric space.

Proof. Using Lem. D.3, one can show that d Isop V qph, gq d H, sym

8 p Isophq, Isopgqq, where Isophq def tψ h: ψ P Isop V qu. The proposition follows then from Lem. D.2.

For any ψ P Affp V q and any h P EV , we then have

ψ h P Eb ô h P Eb ô }h}8 ă 8.

1. If h P Eb, then }h}Isop V q }h}T rh,

where rh denotes the radius of h, which we define as the radius of the minimum enclosing ball of the set hpΣ q, and the } }Isop V q norm is defined as in Eq. (8).

2. For any ψ P Affp V q and a subset S Ă Affp V q normalized32 by ψ and containing T . Then

}ψ h}S ď }ψlin}V }h}S,

where the } }S norm is defined as in Eq. (8).

30See App. D.2 for a derivation.

31Recall Eb def th P EV | hpΣ q is boundedu. 32The set S is normalized by ψ if ψ 1 ϕ ψ P S for all ϕ P S.

1. Let t P T be the translation moving the center of the ball enclosing hpΣ q to the center 0V . Hence }h}Isop V q ď }h}T ď }t h}8 rh

Now observe that for any other isometry ψ t, then rψ h rh. The ball Bp0V , }ψ h}8q clearly contains all points in ψ hpΣ q, hence by definition of the radius rψ h we must have }ψ h}8 ď rh, which finishes the proof of 1.

2. Write ψ ϕlin t, with t P T . We then have

}ψ h}S inf ϕPS }ϕpψ hq}8

inf ϕPS }ψlinpψ 1 lin ϕ ψlin t looooooooomooooooooon

Note that ϕ ÞÑ ψ 1 lin ϕ ψlin t is by definition a bijection of S, hence

}ψ h}S inf ϕPS }ψlinpϕ hq}8

inf ϕPS sup y PΣ |ψlin ppϕ hqpyqq |V

ď }ψlin}V inf ϕPS sup y PΣ |pϕ hqpyqq|V

}ψlin}V }h}S.

Corollary D.1. Let S Ą T such that infψPS }ψlin}V .33 Then, d Sph, gq def infψ,ψ1PS }ψ h ψ1 g}8 0 for all h, g P Eb.

Proof. Note that d Spψ h, gq ď }ψlin}V d Sph, gq, which follows from Lem. D.4. Hence

d Sph, gq inf ψPS d Spψ h, gq

ď inf ψPS }ψlin}V looooomooooon

D.3 Proofs: Intrinsic Affine Homotopy

Lemma 4.2. The relation ÁAff is a preorder on EV .

Proof. Since d Affp V qpψ h, gq ď }ψlin}V d Affp V qph, gq (see Lem. D.4),

d Affp V qph, gq 0 ô d H

Affp V qph, gq 0.

Therefore, the relation ÁAff is the specialization ordering of the hemi-metric d H

Theorem 4.1. For h, g P EV , we have

h ÁAff g ô h ψpπh gq for some ψ P Affp V q (10)

where, πh is the orthogonal projection of V onto Vh. In particular, if d Affp V qph, gq 0 then rankphq ď rankpgq. If in addtion, we know rankpgq rankphq, then g must by an affine transformation of g, i.e., h ψ g for some ψ P Affp V q.

1. Recall from 3.2 that EV is complete with respect to the metric d8. The condition dph, gq Affp V q 0 simply means that there exists ϕn P Affp V q such that limnÑ8 ϕn g h in EV , in other words h P Affpgq, i.e., h lies in the closure of Affpgq in EV .

33This is, for example, the case if S is a group and there exists ϕ P S such that }ϕlin}V ă 1, e.g., S Affp Fq.

Let Bh Ă Σ such that hp Bhq is a basis for Vh. Therefore, there exists ϵ ą 0 such that any family34

y PBh Bphpyq, ϵq

has rank dim Rp Vhq. This shows that there exists N ě 1 such that for any n ě N one has }h ϕn g}8 ă ϵ, and

rankptϕn gpyq: y P Bhuq rankphq.

Which implies in particular

dim Rp Vhq ď dim Rp Vgq i.e., rank h ď rank g. (26)

2. If rankpgq rankphq D, then limnÑ8 ϕn ϕ, where ϕ is the affine map given by gpyq ÞÑ hpyq for y P Bh. Indeed, for any v ř

b PBh λbb P V , we have

}pϕ ϕnqpvq} ď }h ψn g}8 ÿ

b Php Bhq |λb| ď c}h ψn g}8 }v}V

for some constant c ą 0, since all norms on V are equivalent. Hence, limnÑ8 }ϕ ϕn}V 0, which shows the claim. Accordingly, we must have ϕ g h.

Now we can prove Eq. (10):

3 ñ. Given that }h πh ϕn g}8 ď }h ϕn g}8, we also have limnÑ8 πh ϕn g h.

Write πK for the orthogonal projection on V K and set πh,n πh 1 n}ϕn}πK h. Note that limnÑ8 πh,n πh. Accordingly,

lim nÑ8 ψnpπ gq h,

where ψn πh,nϕnπ 1 h,n. From this, we deduce that

d Affp V qph, πh gq 0.

Now applying 2. yields h ϕpπh gq for some ϕh P Affp Vhq, or h ϕpπh gq where ϕ ϕh πK h P Affp V q.

3 ð. Assume now that h ϕpπh gq for some ϕ P Affp V q. Then h limnÑ8 ϕ πh,npgq, where πh,n πh 1

nπK h, which shows the desired implication.

D.4 Proofs: Extrinsic Homotopy

Lemma 5.2. Let h, g P EV . We have

1. There exists a constant cpλq ą 0 such that for any ψ P Affp V, Wq

8, N 1psoftmaxλpψ hq, VNpgqq ď cpλq}ψlin}d Affp V qph, gq.

Vp V, qph, gq ď cpλqd H

Affp V,W qph, gq.

3. d Affp V qph, gq 0 ñ d H

Affp V,W qph, gq 0 ñ d H

Vp V, qph, gq 0.

1. Clearly,

8, N 1psoftmaxλpψ hq, VNpgqq ď cpλqd Vp V,W qpψ h, Aff V,W pgqq

ď cpλq inf ψ1Pψ Affp V q }ψ h ψ1 g}8,W

cpλq inf ψ1PAffp V q }ψph ψ1 gq}8,W

cpλq}ψlin}d Affp V qph, gq.

34close enough to hp Bhq.

where, the first inequality follows from the fact that softmaxλ is cpλq-Lipschitz for some constant that depends on λ [16, Proposition 4].

2. & 3. are are immediate consequences of 1.

Theorem 5.1 (ϵ-Intrinsic ñ Opϵq-Extrinsic). Let h, g P EV be two encoders. Then,

Vp V, qph, gq ď cpλq d H

Affp V qph, gq.

Proof. Let ψ P Affp V, Wq. There exists a linear map A: V Ñ W and a ϕV P GLp V q, such that ψ A ϕ and }A} 1. Accordingly, Lem. 5.2 yields

8, N 1psoftmaxλpψ hq, VNpgqq ď cpλq d8,W pψ h, Aff V,W pgqq (28a)

ď cpλqd Affp V qpϕV h, gq (28b)

ď cpλq sup ψV PAffp V q pd Affp V qpψV h, gqq (28c)

cpλq d H Affp V qph, gq. (28d)

Therefore d H

Vp V, qph, gq ď cpλq d H Affp V qph, gq.

E Addenda on Linear Alignment Methods for Finite Representation Sets

Linear Regression A common way to evaluate the similarity of two representation matrices H P RNˆD and G P RNˆD is through linear regression. Linear regression finds the matrix ˆA P RDˆD that minimizes the least squares error:

ˆA argmin APRDˆD }G HA}2 F p HJHq 1HJG. (29)

Let H QHRH and G QGRG be the QR-decomposition of H and G, respectively. The goodness of fit is commonly evaluated through the R-squared value R2 LR, i.e., as the proportion of variance in G explained by the fit:

R2 LR 1 }G H ˆA}2 F }G}2 F }QJ GH}2 F }G}2 F . (30)

To derive Eq. (30), consider the fitted value ˆG

ˆG H ˆA Hp HJHq 1HJG (31a)

QHRHp RJ HQJ HQHRHq 1RJ HQJ HG (31b)

QHQJ HG. (31c)

The residuals are therefore

}G ˆG}2 F trpp G ˆGq Jp G ˆGqq (32a)

trpp G ˆGq JGq (32b, residuals orthogonal to fitted values)

trp GJGq trp GJQHQJ HGq (32c)

}G}2 F }QJ HG}2 F . (32d)

With this, we can compute the coefficient of determination as

R2 LR 1 }G ˆG}2 F }G}2 F 1 }G}2 F }QJ HG}2 F }G}2 F }QJ HG}2 F }G}2 F . (33)

Orthogonal Procrustes Problem. Let G P RNˆD and H P RNˆD representation matrices. In the orthogonal Procrustes problem, we seek to find the orthogonal matrix A that best maps H to G:

argmin APOp V q }H AG}F . (34)

}G HA}2 F trpp G HAq Jp G HAqq

trp GJGq trp GJHAq trp AJHJGq trp AJHJHAq

}G}2 F }H}2 F 2trp AJHJGq,

an equivalent objective to Eq. (34) is

ˆA argmax APOp V q x AH, Gy F

Let UΣVJ be the singular-value decomposition of HJG, then

ˆA argmax APOp V q x AH, Gy F (35a)

argmax APOp V q x A, GHJy F (35b)

argmax APOp V q x A, UΣVJy F (35c)

argmax APOp V q x UJAV, Σy F (35d)

where UJAV is a product of orthogonal matrices, and, therefore, orthogonal. Since Σ is diagonal, Eq. (35d) is maximized by UJ ˆAV I, which means that ˆA UVJ.

Canonical Correlation Analysis. We can rewrite the CCA objective from Eq. (21) as

max A,B trp AJHGJBq s.t. p AJHqp AJHq J p BJGqp BJGq J I, (36)

which, by definition of the Frobenius norm, is equivalent to Eq. (22). Let MHG HGJ, MHH HHJ, MGG GGJ, and let UΣVJ MHG be the singular-value decomposition of MHG. One can show that the optimum of Eq. (22) is found at p ˆA, ˆBq p M 1

2 GGVq. Because AJH, BJG, U, and V are by definition orthogonal, we see that CCA first whitens the representations p H, Gq through p M 1

2 GGq and then orthogonally transforms them. This provides the intuition behind a close relationship between CCA and the Orthogonal Procrustes problem: For pre-whitened representation matrices, CCA (Eq. (22)) is equivalent to solving the Orthogonal Procrustes problem (Eq. (34)). To see this, let WH and WG be whitening transforms for H and G, respectively. Then, Eq. (22) is equivalent to min A,BPOp V q }AJWHH BJWGG}2 F (37)

p AWHHqp AWHHq J AAJ I, (38a)

p BWGGqp BWGGq J BBJ I. (38b)

Therefore, we can derive

min A,BPOp V q }AJWHH BJWGG}2 F min ABJPOp V q }AJ}}WHH ABJWGG}2 F

(39a, A P Op V q)

min CPOp V q }WHH CJWGG}.,

(39b, C def ABJ P Op V q)

which is equivalent to solving the Orthogonal Procrustes problem (Eq. (34)) on the whitened matrices WHH and WGG.

F Experimental Setup

In this section, we provide additional details about the setup and compute resources of the experiments in 7. To generate embeddings, we used the open-sourced code by Ren et al. [33]. Further, for Orthogonal Procrustes, CCA, PWCCA, and Linear CKA, we use the open source implementation by Ding et al. [12]. Our complete code is added as supplementary material.

Models and Datasets. We first extract the D 768 dimensional training set representations for SST-2, MRPC, RTE, Co LA, MNLI, and QQP across all 12 layers of ELECTRA [6], ROBERTA [28], and the 25 MULTIBERT [35] models from Hugging Face.35 The models and the MRPC dataset are licensed under Apache License 2.0. The SST-2 dataset is licensed under the Creative Commons CC0: Public Domain license. The RTE dataset is licensed under the CC BY 3.0 license. The Co LA dataset is licensed under the CC BY-SA 4.0 license. The MNLI dataset is licensed under the General Public License (GPL). THE QQP dataset is licensed under a custom non-commercial license.36 The dataset statistics are shown in Tab. 3. We note that for all experiments, MNLI and QQP were shortened to the first 10K training samples due to computational limitations.

Dataset Task Train Dataset Size Domain

SST-2 Sentiment Analysis 67K Movie reviews MRPC Paraphrase Detection 3.7K News RTE Textual Entailment 2.5K Mixed Co LA Linguistic Acceptability 8.5K Miscellaneous MNLI Natural Language Inference 393K Multi-Genre QQP Paraphrase Detection 364K Social QA

Table 3: Statistics for the used GLUE benchmark [40] datasets.

Hyperparameters. Each experiment was run using Riemann SGD37 as an optimizer as it initially produced the best convergence when computing our affine similarity measures. Further, to account for convergence artifacts, we ran the intrinsic similarity computation optimizations in each experiment for learning rates r1E-4, 1E-3, 1E-2, 1E-1s and extrinsic computations for r1E-3, 1E-2, 2E-2s and report the best result. When training the task-specific linear probing classifier ψ1 for ˆdψ1, we use the crossentropy loss, Riemann SGD and optimize over the learning rates r1E-2, 1E-1, 2E-1, 4E-1s. For the computation of Hausdorff Hoare map d H, we fixed a lr of 1E-3 to save compute resources, as this lr generally leads to the best convergence in previous experiments. We used a batch size 64 and let optimization run for 20 epochs, keeping other parameters at default. For reproducibility, we set the initial seed to 42 during training.

Generating Random Affine Maps. For the last experiment, we generate random affine maps. To approximate d H we sample the matrix entries of the affine map from Np0, 1q. We then additionally normalize the transformed representation matrix as this leads to better convergence. To approximate ˆd H

Vp V, q, we fit a linear probe on H to 100 sets of randomly generated class labels, for the embeddings of each task. The predictions of that probe then become what G affinely maps to. In both cases, the seeds are set ascendingly from 0.

Compute Resources. We compute the embeddings on a single A100-40GB GPU, which took around two hours. All other experiments were run on 8-32 CPU cores, each with 8 GB of memory. Computing extrinsic distances between 600 model combinations across both datasets usually takes 2-3 hours on 8 cores, whereas intrinsic computation is more costly, and can run up to 4 hours. Note our approximation of Hausdorff Hoare maps (cf. Eq. (23)) across all models is significantly more costly due to our sampling approach and can take up to 72 hours to compute on 32 cores for large datasets such as SST-2, and up to 12 hours for MRPC. The resources needed for initially failed experiments do not significantly exceed the reported compute.

35https://huggingface.co/google 36https://www.quora.com/about/tos 37https://github.com/geoopt/geoopt

Max Error L2 Norm

768 (full) 614 (80%) 461 (60%) 307 (40%) 154 (20%)

768 (full) 614 (80%) 461 (60%) 307 (40%) 154 (20%)

Figure 3: The effect of artificial rank deficiency averaged across MULTIBERTs. For each pair of embeddings Hpiq and Hpjq from MUTLIBERTs Mpiq and Mpjq we generate additional rankdeficient encoders Hpiq X% and Hpjq Y % with X, Y P t20%, ..., 90%u of the full rank through SVD

truncation. We compute dp Hpiq Y %, Hpjq X%q, for each pair of possible rank-deficiency and finally report the median across all MULTIBERTs on row X and column Y on the grid. We additionally show row-, and column medians.

G Additional Experimental Results

The Influence of Encoder Rank Deficiency. In Thm. 4.1 we discuss how the relative rank of encoders influences their affine alignment and derive the equivalence relation Aff conditioned on equal rank between encoders. To test the effect of unequal rank on affine alignment in an isolated setup, we artificially construct reduced-rank encoders through singular value decomposition (SVD) truncation. In Figure 3 we expectedly find a trend in how the encoder rank influences affine mappability. We additionally highlight that the computed distances are rather symmetric, with no clear differences when mapping to (ÑM), rather than from (MÑ) an encoder. Finally, we note the trend in the diagonal indicating that mapping between encoders of the same rank becomes easier between lower-rank encoders.

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

M25 Layer 4

ER M5 M10 M15 M20 M25 ER

M25 Layer 8

ER M5 M10 M15 M20 M25 ER

M25 Layer 12

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 ER

ER M5 M10 M15 M20 M25 M

ER M5 M10 M15 M20 M25 M

ER M5 M10 M15 M20 M25 M

ER M5 M10 M15 M20 M25 M

Max Error L2 Norm

Figure 4: Asymmetry between ELECTRA (E), Ro BERTa (R), and MULTIBERT encoders (M1-M25) across layers. For each pair of the encoders Mpiq and Mpjq, we generate training set embeddings Hpiq, Hpjq P RNˆD for the GLUE tasks SST-2, Co LA, MNLI, QQP, RTE, and MRPC. We then fit Hpiq to Hpjq with an affine map and report the goodness of fit through the max error L2 norm, i.e., an approximation of dp Hpjq, Hpiqq on row i and column j of the grid.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes] Justification: The abstract and introduction clearly state the claims and contributions. The claims match the theoretical and experimental results.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes] Justification: We highlight limitations in the main text and add more general points in App. B.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: If not in the main text, all proofs along with their assumptions are in App. D and App. E. We additionally provide short proof sketches in the main text. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We describe all used models, procedures as well as parameters, seeds, and required compute resources either in the main text or in App. F. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Code pulls data from GLUE, is runnable, and is submitted as supplementary material. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Again, if not listed in the main text, it is listed in App. F. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We provide significance indications in the form of p-values to all correlation computations and fitted regressions. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Listed in App. F. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The authors do not foresee any ethical implications to this work. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: Discussed in the final, "Broader Impact", section in the main body. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: We cite the original paper that produced each re-used code snippet and dataset. We provide licenses for all models and datasets used in App. F.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We do not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: We do not do crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We do not do crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.