# independence_tests_for_language_models__3239a33a.pdf Independence Tests for Language Models Sally Zhu * 1 Ahmed Ahmed * 1 Rohith Kuditipudi * 1 Percy Liang 1 Motivated by liability and intellectual property concerns over open-weight models we consider the following problem: given the weights of two models, can we test whether they were trained independently i.e., from independent random initializations? We consider two settings: constrained and unconstrained. In the constrained setting, we make assumptions about model architecture and training and propose statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. We compute the p-values by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures between the original two models versus these copies. We report p-values on pairs of 21 open-weight models (210 total pairs) and find we correctly identify all pairs of non-independent models. In the unconstrained setting we make none of the prior assumptions and allow for adversarial evasion attacks that do not change model output. We thus propose a new test which matches hidden activations between two models, which is robust to these transformations and to changes in model architecture and can also identify specific non-independent components of models. Though we no longer obtain exact p-values from this test, empirically we find it reliably distinguishes non-independent models like a p-value. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and Striped Hyena-7B), and it is even robust to retraining individual layers of either model from scratch. *Equal contribution 1Department of Computer Science, Stanford University, Stanford, US. Correspondence to: Sally Zhu , Ahmed Ahmed . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1. Introduction Consider the ways in which two models could be related: one model may be a finetune of the other; one could be spliced and pruned from certain parts of the other; both models could be separately fine-tuned from a common ancestor; finally, they could be independently trained from each other. We consider the problem of determining whether two models are independently trained versus not from their weights, which we formalize as a hypothesis testing problem in which the null hypothesis is that the weights of the two models are independent. We concretely treat only the weight initialization as random and thus consider two models with different random initial seeds as independent, even if both models were trained on the same data, or one model was distilled from the outputs of the other. A solution to this independence testing problem would help auditors track provenance of open-weight models. This is pertinent because while open-weight models enable broader access and customization, they also pose potential risks for misuse as they cannot be easily monitored or moderated (Kapoor et al., 2025). Model developers would also gain an enhanced ability to protect their intellectual property (IP) (Mensch, 2024; Peng et al., 2023) and enforce custom model licenses (Dubey et al., 2024; Deep Seek-AI et al., 2024). We consider two settings of the independence testing problem. In the constrained setting, we make assumptions on training and initialization (essentially, that the training algorithm is equivariant to permuting the hidden units of the random initialization) that enable us to obtain provably valid p-values. The main idea is that under these assumptions we can cheaply simulate many exchangeable copies of each model s weights and compare the value of some test statistic (e.g., cosine similarity of model weights) on each of these copies with the original model pair. The assumptions generally hold in practice but preclude robustness to adversarial evasion attacks and architectural changes. For the constrained setting, we evaluate various test statistics on 21 models of the Llama 2 architecture (Touvron et al., 2023), including 12 fine-tunes of Llama 2 and nine independently trained models, obtaining extremely small p-values We share code at https://github.com/ ahmeda14960/model-tracing. Independence Tests for Language Models Figure 1. Given the weights of two models, what relationships can we derive? for all 69 non-independent model pairs. Notably, our tests retain low p-values over different fine-tuning methods (e.g., different optimizers) and on models fine-tuned for many tokens from the base model such as Llemma (Azerbayev et al., 2024), which was fine-tuned on an additional 750B tokens from Llama 2 (i.e., 37.5% of the Llama 2 training budget). We also confirm that the leaked Miqu-70B model from Mistral is derived from Llama 2-70B. For the unconstrained setting we develop a test robust to simple modifications to model weights and architecture, such as permuting hidden units, that can violate the assumptions of the constrained setting if an adversary applies them after fine-tuning. Though we are not able to obtain provably exact p-values in the unconstrained setting, we derive a test whose output empirically behaves like a p-value and reliably distinguishes non-independent models from independent models. In particular, we first align the hidden units of two models which may each have different activation types and hidden dimensions and then compute some measure of similarity between the aligned models. Because of the alignment step, the test is robust to changes in model architecture and various adversarial evasion attacks (including those that break prior work). Moreover, it can localize the dependence: we can identify specific components or weights that are not independent between two models, even when they have different architectures. We evaluate our unconstrained setting test on 141 independent model pairs and find that its output empirically behaves like a p-value in the sense that it is close to uniformly distributed in [0, 1] over these pairs. In contrast, it is almost zero for all dependent pairs we test (including those for which we simulate a somewhat strong adversary by retraining entire layers from scratch). We also employ our test to identify pruned model pairs, which occur when one reduces the layer dimensions by retaining only select activations and weights from a pre-trained model; for example, we identified the precise layers of Llama 3.1 8B from which each Llama 3.2 3B and Llama 3.2 1B layer was derived. The work most closely related to ours is due to Zeng et al. (2024), who considered our constrained setting; they develop various tests to determine whether a model as a whole is independent of another by computing the cosine similarity of the products of certain weight matrices in both models. They show that their tests are robust to simple adversarial transformations of model weights that preserve model output; however, we detail in Appendix G.1 other transformations to perturb dependent models that evades detection by their tests. Additionally, unlike Zeng et al. (2024), in the constrained setting we obtain exact p-values from our tests. Jin et al. (2024) propose crafting specific queries that are likely to produce different responses among independently trained models; their method does not require access to weights but also does not produce exact p-values. 2.1. Problem formulation Let f : Θ X Y denote a model mapping parameters θ Θ and an input X X to an output f(X; θ) Y. We represent a model training or fine-tuning process as a learning algorithm A : Θ Θ that takes as input a set of initial parameters corresponding to either a random initialization or, in the case of fine-tuning, base model parameters. Specifically, A includes the choice of training data, ordering of minibatches, and all other design decisions and even the randomness used during training everything other than the initial model weights. Given two models θ1, θ2 P for some joint distribution P P(Θ1 Θ2), our goal is to test the null hypothesis H0 : θ1 θ2, (1) where denotes independence of two random variables. One example of a case where θ1 and θ2 might not be independent is if θ2 is fine-tuned from θ1, i.e., Θ1 = Θ2 (meaning the two models share the same architecture) and θ2 = A(θ1) for some learning algorithm A. We treat Independence Tests for Language Models learning algorithms as deterministic functions. Thus, for θ1 = A1(θ0 1) and θ2 = A2(θ0 2), then θ0 1 θ0 2 (i.e. two models with independent random initializations) implies our null hypothesis. Deep learning models are often nested in nature. For example, Transformer models include self-attention layers and MLP layers as submodels. We formalize the notion of a submodel via the following definition. Definition 1. A model f : Θ X Y contains a submodel g : Θ X Y if there exists a projection operator proj : Θ Θ such that for all θ Θ we have f(x; θ) = fout(g(fin(x); proj(θ))) for some functions fin : X X and fout : Y Y (which may depend on θ). Many of our experiments will involve Transformer models specifically containing MLP layers with Gated Linear Unit (GLU) activations, which are widely used among language models. It thus will be useful to define this type of MLP presently through the following example. Example 1: (GLU MLP) Let G, U Rh d and D Rd h. Let σ : R R be an element-wise activation function. For x Rd and θ = (G, U, D) Θh mlp, let fmlp(x; θ) := D(σ(Gx) (Ux)). Likewise, for X Rs d let fmlp(X; θ) Rs d denote the result of broadcasting fmlp over the rows of X. In addition to the basic independence testing problem above, we also consider the problem of localized testing: testing whether various pairs of submodels among two overall models are independent or not. A prototypical example of a localized testing problem is identifying which layers of a larger model (e.g., Llama 3.1-8B) were used to initialize a smaller model (e.g., Llama 3.2-3B) (in this case, we treat the layers as different submodels). 2.2. Constrained Setting 2.2.1. TESTING FRAMEWORK Algorithm 1 (PERMTEST) encapsulates our framework for computing p-values against the null hypothesis in the constrained setting, wherein we simulate T exchangeable copies of the first model θ1 by applying transformations to its weights. The exchangeability of these copies holds under some assumptions on the learning algorithm and random initialization that produced the original model. We capture these assumptions in the following definitions; together, they define the constrained setting. Definition 2 (Π-invariance). Let Π Θ Θ. A distribution P P(Θ) is Π-invariant if for θ P and any π Π, the parameters θ and π(θ) are identically distributed. Algorithm 1: Test for computing p-values (PERMTEST) Input: Model weights θ1, θ2 Parameters :test statistic ϕ; discrete transformation class Π; permutation count T Output: p-value ˆp (0, 1] 1 n ties 0; 2 for t 1, . . . , T do 3 πt Unif(Π); 4 ϕt ϕ(πt(θ1), θ2); 5 s s + 1{ϕt = ϕ(θ1, θ2)}; 6 ξ Unif({0, ..., n ties}) // break ties 7 ˆp 1 1 T +1(1 + ξ + PT t=1 1{ϕt < ϕ(θ1, θ2)}; 8 return ˆp Definition 3 (Π-equivariance). Let Π Θ Θ, π Π, and θ0 Θ. A learning algorithm A is Π-equivariant if and only if π(A(θ0)) = A(π(θ0)). The main idea underlying PERMTEST is that so long as θ1 = A(θ0 1) and θ0 1 P for some Π-equivariant learning algorithm A and Π-invariant distribution P, we can simulate T exchangeable (but not independent) copies {πt(θ1)}T t=1 of θ1 by sampling πt i.i.d. Unif(Π). This allows us to efficiently compute an exact p-value without actually repeating the training process of θ1. In effect, Definitions 2 and 3 imply that π commutes with A i.e., π(A(θ0 1)) = A(π(θ0 1)). Under exchangeability, the p-value output by PERMTEST will be uniformly distributed over {(i + 1)/(T + 1)}T i=0. Standard initialization schemes for feedforward networks are symmetric over the hidden units of the network, and so one example of a class of transformations with respect to which any such initialization is invariant is the set of permutations over the hidden units of the network. Moreover, the gradient of the model s output with respect to the hidden units is permutation equivariant; thus, any learning algorithm whose update rule is itself a permutation equivariant function of gradients (e.g., SGD, Adam, etc.) satisfies Definition 3 with respect to these transformations. A (contrived) example of a learning algorithm that is not permutation equivariant is one that uses different learning rates for each hidden unit depending on the index of the hidden unit. Example 2 (Permuting hidden units): Let θ = (G, U, D) Θh mlp parameterize a GLU MLP, where recall fmlp(x; θ) := D(σ(Gx) (Ux)) for some element-wise activation function σ : R R. Abusing notation, let Π be the set of h h permutation matrices such that for π Π we define π(θ) = (πG, πU, DπT ). Observe fmlp(x; θ) = fmlp(x; π(θ)) and π( θfmlp(x; θ)) = π(θ)f(x; π(θ)) for all inputs x. The assumptions we make in the constrained setting suffice for PERMTEST to produce a valid p-value, as we show in Independence Tests for Language Models the following theorem, whose proof uses symmetry of the initialization and training process (full proof in Appendix A).1 Importantly, the result of the theorem holds (under the null hypothesis) without any assumptions on θ2; so, a model developer of θ1 testing other models with our methods can have confidence in the validity of our test without trusting the provider of θ2. Of course, if θ2 does not satisfy the equivariance assumption on training (as in the unconstrained setting), then PERMTEST is unlikely to produce a low p-value even in cases where θ1 and θ2 are not independent (e.g., if an adversary finetunes θ2 from θ1 but then afterwards randomly permutes its hidden units). Theorem 1. Let ϕ : Θ Θ R be a test statistic and Π Θ Θ be finite. Let A : Θ Θ be Π-equivariant and let P P(Θ) be Π-invariant. For θ0 1 P, let θ1 = A(θ0 1). Let θ2 Θ be independent of θ1. Then bp = PERMTEST(θ1, θ2) is uniformly distributed on { i+1 T +1}T i=0. We also generalize Theorem 1 to apply to randomized learning algorithms that satisfy a notion of equivariance in distribution (including dropout) in Appendix B. However, throughout the main text we will continue to treat learning algorithms as deterministic for the sake of simplicity. 2.2.2. TEST STATISTICS We have shown PERMTEST produces a valid p-value regardless of the test statistic ϕ we use. The sole objective then in designing a test statistic is to achieve high statistical power: we would like bp = PERMTEST(θ1, θ2) to be small when θ1 and θ2 are not independent. The statistics in this section apply to any model pair sharing the same architecture. Prior work (Xu et al., 2024) proposed testing whether two models are independent or not based on the ℓ2 distance between their weights, summed over layers. Specifically, for a model with L layers parameterized by Θ = Θ1 ... ΘL, with θ1 = (θ(ℓ) 1 )L ℓ=1 and θ2 = (θ(ℓ) 2 )L ℓ=1, let ϕℓ2(θ1, θ2) := PL i=1 ℓ2(θ(ℓ) 1 , θ(ℓ) 2 ). We can obtain p-values from ϕℓ2 by using it within PERMTEST. However, a major limitation is that in order to obtain a p-value less than 1/(T +1) we must recompute ϕℓ2 at least T times; then the statistical power of our test using ϕℓ2 is therefore bottlenecked by computation. To address this limitation, we propose a family of test statistics whose distribution under the null is identical for any model pair. The test statistics all share the following general form based on Algorithm 2 (MATCH): for m, n N and M : Θ Rn m, let ϕM(θ1, θ2) := SPEARMAN(MATCH(M(θ1), M(θ2)), [1, ..., n]), (2) 1As a result of the test yielding exact p-values, we can directly control for the false positive rate via the significance threshold. where SPEARMAN is the Spearman rank correlation (Algorithm 3). Equation (2) is applicable to any model architecture Θ for which we can define a suitable matrix valued function M of model parameters. For example, M could extract a weight matrix or activation matrix (based on some set of inputs) from a layer of the model, where each row corresponds to a hidden unit of the model. We use MATCH to align the rows of the two extracted matrices and compute the Spearman correlation of this alignment with the identity map between rows. We describe matching in Algorithm 2, wherein cossim denotes cosine similarity function and LAP denotes the algorithm of Ramshaw & Tarjan (2012) we use to solve the matching problem. The idea is that for two dependent models, each row of M(θ1) should be similar to its counterpart in M(θ2); thus, the alignment found by SPEARMAN will be close to the identity map. Meanwhile, so long as M is a Π-equivariant map (Definition 4), then ϕM(θ1, θ2) under the null yields valid p-values (see Theorem 2 and proof in Appendix A); so we can use the more computationally-efficient Algorithm 3 to convert statistics to p-values instead of running PERMTEST. Definition 4. (equivariant map) A matrix-valued function M : Θ Rn m is Π-equivariant with respect to a class of transformations Π : Θ Θ if there exists a bijection between Π and the set of n n permutation matrices such that M(π(θ)) = πM(θ) for all θ Θ and π Π. Theorem 2. Let M : Θ Rn m be a Π equivariant map and let P P(Θ) be Π-invariant. Let θ1, θ2 Θ be independent random variables, with θ1 = A(θ0 1) for θ0 1 P1. Then ϕM(θ1, θ2) is uniformly distributed on [0, 1). Algorithm 2: Cosine similarity matching (MATCH) Input: Matrices W1, W2 with h rows Output: Permutation π : [h] [h] 1 for i 1, . . . , h do 2 for j 1, . . . , h do 3 Ci,j cossim((W1)i, (W2)j); 4 π LAP(C); Taking various such functions M yields different test statistics. We focus our experiments on Transformer models consisting of a series of L Transformer blocks that each contain a GLU MLP submodel, and we take M(θ) to be either the up projection weights or the hidden-layer activations of one of these MLP submodels. In particular, let U (ℓ)(θ) Rh d denote the first layer up projection weights of the MLP in the ℓ-th block, where h is the hidden dimension and d is the input dimension, and let H(ℓ)(θ) Rh (N s) denote the (flattened) hidden activations that obtain from passing N length s input sequences X RN s d to the same MLP Independence Tests for Language Models module (the test is valid for any X; we will specify later how we choose X in our experiments). The two main test statistics we employ in our experiments are ϕU (ℓ) and ϕH(ℓ). Both U (ℓ) and H(ℓ) are equivariant with respect to permuting the hidden units of the corresponding MLP, so we can directly interpret the outputs of ϕU (ℓ) and ϕH(ℓ) as p-values. Moreover, we can separately permute the hidden units of the MLP in the ℓ-th block without changing the inputs or outputs of the other blocks. Thus, as we show in Theorem 3 (proof in Appendix A), we can aggregate the p-values from ϕU (ℓ) and ϕH(ℓ) across blocks using Fisher s method ((Mosteller & Fisher, 1948)) to obtain a more powerful test in Algorithm 4 (FISHER). Algorithm 3: Deriving p-values from Spearman correlation (SPEARMAN) Input: Permutations π1, π2 : [h] [h] Output: p-value ˆp (0, 1] 1 r 1 6 P(π1[i] π2[i])2 3 ˆp P(Tn 2 > t) ; 4 return ˆp Algorithm 4: Aggregating p-values (FISHER) Input: p-values {bp(i)}L i=1 Output: p-value ˆp (0, 1] 1 ξ PL i=1 log bp(i); 2 ˆp 1 P(χ2 2L < 2ξ); 3 return ˆp Theorem 3. Consider block indices i, j [L] with i = j for models with L blocks. Suppose for ℓ {i, j} that 1. M (ℓ) : Θ Rh N is equivariant with respect to Π(ℓ), i.e., for any θ Θ and π(ℓ) Π(ℓ) we have M(π(ℓ)(θ)) = π(ℓ)M(θ). 2. A is a Π(ℓ)-equivariant learning algorithm and P P(Θ) is a Π(ℓ)-invariant distribution. Let θ1, θ2 Θ. If θ1 θ2 for θ1 = A(θ0 1) with θ0 1 P, then MATCH(M (i)(θ1), M (i)(θ2)) MATCH(M (j)(θ1), M (j)(θ2)). Recall ϕU (ℓ) and ϕH(ℓ) are functions of MATCH(M (ℓ)(θ1), M (ℓ)(θ2)) respectively for M (ℓ) = U (ℓ) and M (ℓ) = H(ℓ), both of which satisfy the assumptions of the theorem. Thus, the result of the theorem applies to both these test statistics, and the independence of the p-values from these test statistics across blocks follows directly from the independence of the statistics themselves. 2.3. Unconstrained Setting For the unconstrained setting, our goal is to design a robust test that applies to models of different architectures and is robust to output-preserving transformations of model weights. Recall our tests for the constrained setting satisfy neither of these desiderata: these tests assume both models have the same number of hidden units, and it is easy to fool them without changing the output of a model by permuting the order of the hidden units in the model. Our robust test reposes on the design of ϕM in equation (2). The goal is to identify two matrix valued functions of model parameters M, M : Θ Rn m that jointly satisfy the following condition: any output-preserving transformation of model parameters must transform both M and M in the same way. Then, whereas previously we would correlate MATCH(M(θ1), M(θ2)) with the identity permutation, we instead define ϕM,M := SPEARMAN(MATCH(M(θ1), M(θ2)), MATCH(M (θ1), M (θ2)). (3) The above goal is aspirational in the sense that for any nontrivial deep learning model we are not able to fully enumerate the set of transformations of model parameters to which model output is invariant; nonetheless, it will serve as a useful guiding principle for designing our robust test under the framework of equation (3). We organize the description of our full robust test which is generally applicable to a variety of model architectures into two parts: first, in Section 2.3.1 we instantiate equation (3) to obtain a test for GLU MLP models. Then, in Section 2.3.2 we use our GLU MLP test as a primitive for designing a test that applies to general deep learning models (including those which do not contain any GLU MLP submodels). 2.3.1. TESTING GLU MODELS Recalling our definition of a GLU MLP model in Example 1, for k {1, 2} let θk = (Gk, Uk, Dk) Θhk mlp, and with inputs X Rd N let Hup(θk) = Uk X Rmax{h1,h2} N be the output of the up projection operation and let Hgate(θk) = Gk X Rmax{h1,h2} N be the output of the gate projection operation (with appropriate zero-padding when h1 = h2). Due to the element-wise product operation, we conjecture that in general it is not possible to permute the rows of Gk while preserving the output of θi without permuting the rows Uk in the same way, and so we use ϕM,M with M = Hgate and M = Hup for our GLU MLP test. Henceforth, we will shorthand this test as ϕMATCH. Independence Tests for Language Models As with the constrained setting, we focus much of our experiments on Transformer models, which recall consist of a series of L Transformer blocks that each contain a GLU MLP submodel. Adopting the notational conventions of Section 2.2.2, we can apply our GLU MLP test to the ℓ-th block by taking M = H(ℓ) gate and M = H(ℓ) up , where like before (in the case of ϕH(ℓ)) we obtain the activation inputs for each block by computing a forward pass through the full model over a set of length s sequences of input tokens. We can aggregate the results of these tests over blocks using FISHER, like we do for ϕU (ℓ) and ϕH(ℓ) in the constrained setting. Alternatively, we can apply the test to all possible O(L2) pairs of blocks between two Transformer models if we suspect that certain blocks from one model served as the initializations for different blocks in the other model. Specifically, we can test the i-th block of θ1 and the j-th block of θ2 using ϕ(i,j) MATCH := SPEARMAN(MATCH(H(i) gate(θ1), H(j) gate(θ2)), MATCH(H(i) up (θ1), H(j) up (θ2))). This test is relevant for pruned models, where only select blocks (layers) of θ2 may be used to initialize the smaller θ1; or, if an adversary takes only certain layers, or even only certain activations, of a pre-trained model and injects other layers. 2.3.2. BEYOND GLU MODELS Thus far we have focused on models f : X Θ Y containing a GLU MLP submodel. In particular, recalling Definition 1, we have assumed for some projmlp : Θ Θh mlp that f(x; θ) = fout(fmlp(fin(x); projmlp(θ))). (4) Now, our goal is to test more general types of models. In particular, we generalize to an arbitrary alternative submodel falt : Rd Θalt Rd with projalt : Θ Θalt such that f(x; θ) = fout(falt(fin(x); projalt(θ))). (5) In order to test whether two models θ1, θ2 Θ of the more general form in equation (5) are independent, we will first construct proxy models of the form in equation (4) and then apply our previous test ϕMATCH to these proxy models. We construct these proxy models by leveraging the fact that falt shares the same input and output space with fmlp. Specifically, for k {1, 2} we first learn parameters bθk Θh mlp so that fmlp( ; bθk) approximates falt( ; projalt(θk)). We then return ϕMATCH(bθ1, bθ2). We capture this two-stage process in Algorithm 5. Perhaps surprisingly, we show that Algorithm 5 is effective in practice at distinguishing independent versus non independent models. The hidden dimension h and input distribution Algorithm 5: Generalized robust test Input: Model parameters θ1, θ2 Θ Parameters :distribution P over Rd Output: bp [0, 1] 1 for k {1, 2} do arg minbθ Ex P falt(x; projalt(θk)) fmlp(x; bθk) 2 3 return bp ϕMATCH(bθ1, bθ2) P with which we learn the GLU MLP are hyperparameters of the test. See Section 3.2 for details. 3. Experimental Results 3.1. Constrained setting We first validate validate the effectiveness of our tests in the constrained setting on open-weight language models 21 models trained with the Llama-7B architecture with public documentation on ground truth model independence. These models all contain L = 32 GLU MLPs, each part of its own Transformer block. We run experiments with three different tests. Each test comprises two elements: a test statistic along with a method for computing p-values from the statistic. For the first test, we use ϕℓ2 and compute pvalues via PERMTEST with T = 100. For the other two tests, we compute p-values by directly aggregating the outputs of (respectively) ϕU (ℓ) and ϕH(ℓ) over ℓ [L] using FISHER. We obtain the inputs to the GLU MLP in the ℓ-th required to compute ϕH(ℓ) by sampling sequences of tokens uniformly at random from the models vocabulary and computing a forward pass through the full model while storing the MLP hidden layer activations. The equivariant transformation class Π is the set of permutations over both the hidden units of each MLP (see Example 2) and the embedding dimension of the model (i.e., the inputs passed to the both the MLP and self-attention layers in each block); we defer the precise definition of Π in this case to Appendix D. 3.1.1. BASELINE STATISTICS We employ two test statistics from prior work as baselines: Jensen-Shannon divergence between next token output distributions (ϕJSD, (Lin, 2006)), and ϕℓ2 (Xu et al., 2024)) with PERMTEST (details in Section 3.2.2). We computed ϕJSD using input sequences sampled from Wiki Text-103 (Merity et al., 2017; Xu et al., 2024) (consistent with prior work). Since the Jensen-Shannon divergence is (by definition) invariant to any transformation of weights that does not affect model output, we cannot compute meaningful p-values using PERMTEST; instead, in our experiments we report the raw value of the test statistic itself. Independence Tests for Language Models 3.1.2. LLAMA FAMILY EXPERIMENTAL RESULTS The 21 models we evaluated include 6 base models (trained from scratch), so we have six disjoint sets of the models based on Llama-2-7b-hf stemming from a diverse mix of industry labs and non-profits (Azerbayev et al., 2024; Sudalairaj et al., 2024; Liu et al., 2024; Li et al., 2023). We consider any pair of models in the same tree as dependent and all other pairs as independent. We include examples of further fine-tunes (e.g., llemma 7b) of fine-tunes (e.g., Code Llama-7b-hf) among the models we test. We will mostly refer to models using by their Huggingface identifiers, without the organization names for clarity. We evaluated four test statistics: ϕU (ℓ) (cosine similarity of weights), ϕH(ℓ) (cosine similarity of hidden activations), ϕℓ2 (ℓ2 distance), and ϕJSD (Jensen-Shannon Divergence). As we describe in Section 2.2.2, for ϕU (ℓ) and ϕH(ℓ) we report aggregated p-values over all blocks using FISHER. We report results for a subset of these pairs involving base model Llama-2-7b-hf in Table 1 while deferring the rest and the full experimental setup details to Appendix E. θ1 = Llama-2-7b-hf, p-values θ2 =? Indep.? ϕJSD (log) ϕℓ2 ϕU(ℓ) ϕH(ℓ) llama-7b-hf -11.10 0.98 0.60 0.25 vicuna-7b-v1.1 -10.40 0.63 0.16 0.64 Amber -10.69 0.75 0.36 0.88 open-llama-7b -8.38 0.26 0.36 0.71 vicuna-7b-v1.5 -10.87 0.01 ε ε Code Llama-7b-hf -10.62 0.01 ε ε llemma-7b -10.24 0.01 ε ε Orca-2-7b -10.34 0.01 ε ε Table 1. We report various constrained setting test statistics with θ1 as Llama-2-7b-hf and θ2 ranging over the listed models. The independent column is the ground truth. Here, ε = 2.2e-308 (numerical underflow for a 64-bit float). We find our proposed tests ϕU(ℓ) and ϕH(ℓ) distinguish independent versus non-independent model pairs with high statistical power. Consistent with prior work (Xu et al., 2024), we find that ϕJSD does not reliably distinguish independent versus dependent model pairs. For example, Code Llama-7b-hf exhibits a larger divergence with Llama-2-7b-hf than the independently-trained models llama-7b-hf and Amber. All other test statistics reliably distinguish independent versus dependent pairs; in particular, the p-values we obtain using the other test statistics are negligible for all dependent pairs (for ϕℓ2, because we run PERMTEST with T = 99 for computational reasons, we cannot obtain a p-value less than 0.01.Notably, in contrast to our findings, prior work (Xu et al., 2024) argued that the ℓ2 distance between model parameters is not a reliable indicator of independence, in the sense that the ℓ2 distance between dependent pairs is sometimes larger than that of independent pairs (similar to the case of ϕJSD); the key difference is that Xu et al. (2024) report the raw ℓ2 distance whereas we obtain p-values from the raw distances using PERMTEST. We hypothesize that PERMTEST effectively standardizes the raw distances. We further evaluated the efficacy of our tests through ablations by training two models with the same OLMo-7B architecture on the same dataset that only differ on the choice of random initialization of randomness, and report results in Appendix E.1. We also verify that Miqu-70B is not independent from Llama 2-70B (Mensch, 2024) and report further details in Appendix E.2. 3.2. Unconstrained setting For the unconstrained setting, we first assess the previous 21 models of the Llama-7B architecture. We compute ϕMATCH with the gate and up-projection matrices M = Hℓ gate and M = Hℓ up of each MLP in block ℓ [L], and aggregate them with FISHER. We obtain the activations in the MLPs by using input sequences sampled from Wiki Text-103 and computing a forward pass through the full model, with results on all model pairs in Appendix F. We find that the distribution of ϕMATCH on independent model pairs is close to uniform (Figure 2), whereas across all non-independent model pairs the statistic is at most ε. Unlike the constrained setting, where the p-values are valid by construction, the output of the robust test does not enjoy such theoretical guarantees; however, Figure 2 suggests that even in the unconstrained setting our statistic ϕMATCH behaves like a p-value. (a) Plot of x [0, 1) vs. the fraction of ϕ(i) MATCH (across all MLP blocks) of independent model pairs less than x. (b) Plot of x [0, 1) vs. the fraction of ϕMATCH (ϕ(i) MATCH aggregated with FISHER) of independent model pairs less than x. Figure 2. We plot the fraction of ϕMATCH less than x [0, 1), aggregated with FISHER for independent model pairs. Both plots roughly follow the line y = x, i.e. a uniform distribution in [0, 1) under the null, meaning ϕMATCH empirically acts as a p-value. We also validated our tests on the Mistral architecture we compared the weights of the hybrid Striped Hyena-Nous-7B (Poli et al., 2023) with Mistral-7B-v0.1 and find non-independent parameters via ϕU (ℓ). We compute ϕU (ℓ) on all parameters, which allows us to identify non-independence between specific parameters of the models such as the self-attention matrices rather than as models as a whole, and report values of ϕU (ℓ) among certain parameters in Table 6 in Independence Tests for Language Models Appendix F.1. From the small p-values, we infer that the embedding layer and some self-attention matrices were likely shared between the two models. 3.2.1. SIMULATING STRONG-ISH ADVERSARIES A significant difficulty in evaluating the robustness of our test ϕMATCH to adversarial transformations is that we cannot exhaustively enumerate all such transformations. Recalling that ϕMATCH specifically considers the MLP layers contained within two models, we attempt to fool it by randomly reinitializing and retraining these MLP layers individually, thus simulating a somewhat strong adversary. We reinitialize the first GLU MLP module of a model θ1 with an MLP with double the width, and using Algorithm 5 (generalized robust test), we train ˆθ1 with random Gaussians as the training distribution P. We retrain each of the 32 MLPs (keeping other layers fixed) of vicuna-7b-v1.5 (a finetune of Llama-2-7b-hf) for 10k gradient steps (until the loss curve plateus). (Additional hyperparameters and a learning curve are in Appendix F.2.) For all 32 runs, we compute ϕMATCH for the retrained model with the original Llama-2-7b-hf and find ϕMATCH remains very small between the non-independent models even after an MLP has been retrained. For example, retraining the first MLP module, ϕ(1) MATCH on the first MLP was less than ε = 2.2e308, indicating that the two models are not independent. We find the same is true for the other MLP layers as well (i.e. ϕ(ℓ) MATCH when evaluated on retrained layer ℓ), with full results in Table 7 of Appendix F.2. 3.2.2. GENERALIZING TO DIFFERENT ARCHITECTURES As we describe in Section 2.3.2, we can also apply our test to model architectures which do not contain GLU MLP submodels. For example, the GPT-2 architecture uses a standard 2-layer MLP rather than a GLU MLP. We apply our test (Algorithm 5) to GPT2 PMC and gpt2, where the former is a finetune of the latter (Radford et al., 2019). We use 30k training steps with an isotropic Gaussian input distribution to learn the GLU MLP parameters with which we replace the original MLP submodels in each model. The test yields a value of 3.034e-61, thus distinguishing the two models as dependent. We show additional results on independent and non-independent models (of Llama and GPT architectures) in Appendix F.4. 3.3. Fine-grained forensics and Localized testing Finally, we use ϕMATCH on models pairs with different dimensions, specifically on pruned model pairs, when model dimensions are reduced by preserving only select weights. In particular, we were able to identify the specific Transformer blocks of Llama-3.1-8B whose weights were likely used in initializing Llama-3.2-3B and Llama-3.2-1B, as Meta reported that the first two models were pruned from the third (Meta AI, 2024). We match ϕ(i,j) MATCH with block i from θ1 and j from θ2, such that ϕ(i,j) MATCH is less than 1e-4. We report the matched layers between the Llama-3.1 and Llama-3.2 models in Figure 3 and in Appendix F.3. Figure 3. We evaluate ϕ(i,j) MATCH between all pairs of GLU MLPs of Llama 3.1-8B and Llama 3.2-3B. Arrows indicate if ϕ(i,j) MATCH < 1e-4 and suggest which Transformer blocks of Llama 3.1-8B were kept in the pruning process to initialize Llama 3.2-3B. We also identify which hidden units were most likely shared between the blocks when MLP dimension is reduced (from 14336 to 8192) during pruning, from the permutation π returned from the up projection matching, MATCH(H(ℓ) θ1,up, H(ℓ) θ2,up).We plot the activation matching for Llama-3.1-8B and Llama-3.2-3B in Appendix F.3. 4. Related & Future Work A related line of work known as model fingerprinting (Xu et al., 2024; Zhang et al., 2025; Jin et al., 2024; Yang & Wu, 2024) plants a secret signal in the weights of a model so that anyone who knows the key can detect the fingerprint from query access to the model (or fine-tunes of the model). For example, Xu et al. (2024) propose fingerprinting a model by fine-tuning on a secret random string; fingerprint detection then resolves to prompting a putative fingerprinted model with a prefix of the string. Unlike Xu et al. (2024), we do not intervene on the training process of the models we test; however, we do require access to model weights. Finally, a separate line of work on text watermarking aims to attribute model-generated text by planting a watermark when sampling text from the model (Christ et al., 2024; Kirchenbauer et al., 2023; Kuditipudi et al., 2024; Aaronson & Kirchner, 2023). Because it intervenes on sampling, text watermarking is inapplicable to open-weight models, the focus of both model fingerprinting and our setting. Recent work demonstrates that models can directly learn to generate watermarked text but also finds the learned watermark is not robust to further fine-tuning (Gu et al., 2024). Future work can consider differentiating between fine-tunes of the same base model to reconstruct a complete family tree of model lineage is possible (e.g. infer Llemma is a direct fine-tune of Code Llama) (Yax et al., 2025), and whether robustness against adversarial attacks is solvable with exact guarantees warrants further exploration. Independence Tests for Language Models Acknowledgments We gratefully acknowledge the support of this work by an NSF Frontier Award (NSF Grant no. 1805310) and Omidyar. Sally Zhu was supported by a Stanford CURIS Fellowship. Ahmed Ahmed is grateful to be supported by an NSF Graduate Research Fellowship and a Knight-Hennessy Fellowship. Impact statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, some of which we discuss in the introduction and none which we feel must be specifically highlighted here. Aaronson, S. and Kirchner, H. Watermarking GPT Outputs, 2023. Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., Mc Aleer, S. M., Jiang, A. Q., Deng, J., Biderman, S., and Welleck, S. Llemma: An Open Language Model for Mathematics. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=4Wnq RR915j. Christ, M., Gunn, S., and Zamir, O. Undetectable Watermarks for Language Models. In Agrawal, S. and Roth, A. (eds.), Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pp. 1125 1139. PMLR, 30 Jun 03 Jul 2024. URL https://proceedings.mlr. press/v247/christ24a.html. Deep Seek-AI, Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu, H., Wang, H., Zhang, H., Ding, H., Xin, H., Gao, H., Li, H., Qu, H., Cai, J. L., Liang, J., Guo, J., Ni, J., Li, J., Wang, J., Chen, J., Chen, J., Yuan, J., Qiu, J., Li, J., Song, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K., Yu, K., Wang, L., Zhang, L., Xu, L., Xia, L., Zhao, L., Wang, L., Zhang, L., Li, M., Wang, M., Zhang, M., Zhang, M., Tang, M., Li, M., Tian, N., Huang, P., Wang, P., Zhang, P., Wang, Q., Zhu, Q., Chen, Q., Du, Q., Chen, R. J., Jin, R. L., Ge, R., Zhang, R., Pan, R., Wang, R., Xu, R., Zhang, R., Chen, R., Li, S. S., Lu, S., Zhou, S., Chen, S., Wu, S., Ye, S., Ye, S., Ma, S., Wang, S., Zhou, S., Yu, S., Zhou, S., Pan, S., Wang, T., Yun, T., Pei, T., Sun, T., Xiao, W. L., Zeng, W., Zhao, W., An, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang, W., Li, X. Q., Jin, X., Wang, X., Bi, X., Liu, X., Wang, X., Shen, X., Chen, X., Zhang, X., Chen, X., Nie, X., Sun, X., Wang, X., Cheng, X., Liu, X., Xie, X., Liu, X., Yu, X., Song, X., Shan, X., Zhou, X., Yang, X., Li, X., Su, X., Lin, X., Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhu, Y. X., Zhang, Y., Xu, Y., Xu, Y., Huang, Y., Li, Y., Zhao, Y., Sun, Y., Li, Y., Wang, Y., Yu, Y., Zheng, Y., Zhang, Y., Shi, Y., Xiong, Y., He, Y., Tang, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y., Guo, Y., Wu, Y., Ou, Y., Zhu, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Zha, Y., Xiong, Y., Ma, Y., Yan, Y., Luo, Y., You, Y., Liu, Y., Zhou, Y., Wu, Z. F., Ren, Z. Z., Ren, Z., Sha, Z., Fu, Z., Xu, Z., Huang, Z., Zhang, Z., Xie, Z., Zhang, Z., Hao, Z., Gou, Z., Ma, Z., Yan, Z., Shao, Z., Xu, Z., Wu, Z., Zhang, Z., Li, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z., Gao, Z., and Pan, Z. Deep Seek-V3 Technical Report, 2024. URL https://arxiv.org/abs/2412.19437. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C., Marra, C., Mc Connell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C., Allonsius, D., Song, D., Pintz, D., Livshits, D., Esiobu, D., Choudhary, D., Mahajan, D., Garcia-Olano, D., Perino, D., Hupkes, D., Lakomkin, E., Al Badawy, E., Lobanova, E., Dinan, E., Smith, E. M., Radenovic, F., Zhang, F., Synnaeve, G., Lee, G., Anderson, G. L., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Korevaar, H., Xu, H., Touvron, H., Zarov, I., Ibarra, I. A., Kloumann, I., Misra, I., Evtimov, I., Copet, J., Lee, J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Billock, J., Hong, J., Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J., Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Upasani, K., Plawiak, K., Li, K., Heafield, K., Stone, K., El-Arini, K., Iyer, K., Malik, K., Chiu, K., Bhalla, K., Rantala-Yeary, L., van der Maaten, L., Chen, L., Tan, L., Jenkins, L., Martin, L., Madaan, L., Malo, L., Blecher, L., Landzaat, L., de Oliveira, L., Muzzi, M., Pasupuleti, M., Singh, M., Paluri, M., Kardas, M., Oldham, M., Rita, M., Pavlova, M., Kambadur, M., Lewis, M., Si, M., Singh, M. K., Hassan, M., Goyal, N., Torabi, N., Bashlykov, N., Bogoychev, N., Chatterji, N., Duchenne, O., C elebi, O., Alrassy, P., Zhang, P., Li, P., Vasic, P., Weng, P., Bhargava, P., Dubal, P., Krishnan, P., Koura, P. S., Xu, P., He, Q., Dong, Q., Srinivasan, R., Ganapathy, R., Calderer, R., Cabral, R. S., Stojnic, R., Raileanu, R., Girdhar, R., Patel, R., Sauvestre, R., Polidoro, R., Sumbaly, R., Taylor, R., Silva, R., Hou, R., Wang, R., Hosseini, S., Chennabasappa, S., Singh, S., Bell, S., Kim, S. S., Edunov, S., Nie, S., Narang, S., Raparthy, S., Shen, Independence Tests for Language Models S., Wan, S., Bhosale, S., Zhang, S., Vandenhende, S., Batra, S., Whitman, S., Sootla, S., Collot, S., Gururangan, S., Borodinsky, S., Herman, T., Fowler, T., Sheasha, T., Georgiou, T., Scialom, T., Speckbacher, T., Mihaylov, T., Xiao, T., Karn, U., Goswami, V., Gupta, V., Ramanathan, V., Kerkez, V., Gonguet, V., Do, V., Vogeti, V., Petrovic, V., Chu, W., Xiong, W., Fu, W., Meers, W., Martinet, X., Wang, X., Tan, X. E., Xie, X., Jia, X., Wang, X., Goldschlag, Y., Gaur, Y., Babaei, Y., Wen, Y., Song, Y., Zhang, Y., Li, Y., Mao, Y., Coudert, Z. D., Yan, Z., Chen, Z., Papakipos, Z., Singh, A., Grattafiori, A., Jain, A., Kelsey, A., Shajnfeld, A., Gangidi, A., Victoria, A., Goldstand, A., Menon, A., Sharma, A., Boesenberg, A., Vaughan, A., Baevski, A., Feinstein, A., Kallet, A., Sangani, A., Yunus, A., Lupu, A., Alvarado, A., Caples, A., Gu, A., Ho, A., Poulton, A., Ryan, A., Ramchandani, A., Franco, A., Saraf, A., Chowdhury, A., Gabriel, A., Bharambe, A., Eisenman, A., Yazdan, A., James, B., Maurer, B., Leonhardi, B., Huang, B., Loyd, B., Paola, B. D., Paranjape, B., Liu, B., Wu, B., Ni, B., Hancock, B., Wasti, B., Spence, B., Stojkovic, B., Gamido, B., Montalvo, B., Parker, C., Burton, C., Mejia, C., Wang, C., Kim, C., Zhou, C., Hu, C., Chu, C.-H., Cai, C., Tindal, C., Feichtenhofer, C., Civin, D., Beaty, D., Kreymer, D., Li, D., Wyatt, D., Adkins, D., Xu, D., Testuggine, D., David, D., Parikh, D., Liskovich, D., Foss, D., Wang, D., Le, D., Holland, D., Dowling, E., Jamil, E., Montgomery, E., Presani, E., Hahn, E., Wood, E., Brinkman, E., Arcaute, E., Dunbar, E., Smothers, E., Sun, F., Kreuk, F., Tian, F., Ozgenel, F., Caggioni, F., Guzm an, F., Kanayet, F., Seide, F., Florez, G. M., Schwarz, G., Badeer, G., Swee, G., Halpern, G., Thattai, G., Herman, G., Sizov, G., Guangyi, Zhang, Lakshminarayanan, G., Shojanazeri, H., Zou, H., Wang, H., Zha, H., Habeeb, H., Rudolph, H., Suk, H., Aspegren, H., Goldman, H., Molybog, I., Tufanov, I., Veliche, I.-E., Gat, I., Weissman, J., Geboski, J., Kohli, J., Asher, J., Gaya, J.-B., Marcus, J., Tang, J., Chan, J., Zhen, J., Reizenstein, J., Teboul, J., Zhong, J., Jin, J., Yang, J., Cummings, J., Carvill, J., Shepard, J., Mc Phie, J., Torres, J., Ginsburg, J., Wang, J., Wu, K., U, K. H., Saxena, K., Prasad, K., Khandelwal, K., Zand, K., Matosich, K., Veeraraghavan, K., Michelena, K., Li, K., Huang, K., Chawla, K., Lakhotia, K., Huang, K., Chen, L., Garg, L., A, L., Silva, L., Bell, L., Zhang, L., Guo, L., Yu, L., Moshkovich, L., Wehrstedt, L., Khabsa, M., Avalani, M., Bhatt, M., Tsimpoukelli, M., Mankus, M., Hasson, M., Lennie, M., Reso, M., Groshev, M., Naumov, M., Lathi, M., Keneally, M., Seltzer, M. L., Valko, M., Restrepo, M., Patel, M., Vyatskov, M., Samvelyan, M., Clark, M., Macey, M., Wang, M., Hermoso, M. J., Metanat, M., Rastegari, M., Bansal, M., Santhanam, N., Parks, N., White, N., Bawa, N., Singhal, N., Egebo, N., Usunier, N., Laptev, N. P., Dong, N., Zhang, N., Cheng, N., Chernoguz, O., Hart, O., Salpekar, O., Kalinli, O., Kent, P., Parekh, P., Saab, P., Balaji, P., Rittner, P., Bontrager, P., Roux, P., Dollar, P., Zvyagina, P., Ratanchandani, P., Yuvraj, P., Liang, Q., Alao, R., Rodriguez, R., Ayub, R., Murthy, R., Nayani, R., Mitra, R., Li, R., Hogan, R., Battey, R., Wang, R., Maheswari, R., Howes, R., Rinott, R., Bondu, S. J., Datta, S., Chugh, S., Hunt, S., Dhillon, S., Sidorov, S., Pan, S., Verma, S., Yamamoto, S., Ramaswamy, S., Lindsay, S., Lindsay, S., Feng, S., Lin, S., Zha, S. C., Shankar, S., Zhang, S., Zhang, S., Wang, S., Agarwal, S., Sajuyigbe, S., Chintala, S., Max, S., Chen, S., Kehoe, S., Satterfield, S., Govindaprasad, S., Gupta, S., Cho, S., Virk, S., Subramanian, S., Choudhury, S., Goldman, S., Remez, T., Glaser, T., Best, T., Kohler, T., Robinson, T., Li, T., Zhang, T., Matthews, T., Chou, T., Shaked, T., Vontimitta, V., Ajayi, V., Montanez, V., Mohan, V., Kumar, V. S., Mangla, V., Ionescu, V., Poenaru, V., Mihailescu, V. T., Ivanov, V., Li, W., Wang, W., Jiang, W., Bouaziz, W., Constable, W., Tang, X., Wang, X., Wu, X., Wang, X., Xia, X., Wu, X., Gao, X., Chen, Y., Hu, Y., Jia, Y., Qi, Y., Li, Y., Zhang, Y., Zhang, Y., Adi, Y., Nam, Y., Yu, Wang, Hao, Y., Qian, Y., He, Y., Rait, Z., De Vito, Z., Rosnbrick, Z., Wen, Z., Yang, Z., and Zhao, Z. The Llama 3 Herd of Models, 2024. URL https://arxiv.org/abs/2407.21783. Groeneveld, D., Beltagy, I., Walsh, E., Bhagia, A., Kinney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Pyatkin, V., Ravichander, A., Schwenk, D., Shah, S., Smith, W., Strubell, E., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Zettlemoyer, L., Dodge, J., Lo, K., Soldaini, L., Smith, N., and Hajishirzi, H. OLMo: Accelerating the science of language models. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15789 15809, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long. 841. URL https://aclanthology.org/2024. acl-long.841/. Gu, C., Li, X. L., Liang, P., and Hashimoto, T. On the Learnability of Watermarks for Language Models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=9k0kr Nzvl V. Jin, H., Zhang, C., Shi, S., Lou, W., and Hou, Y. T. Pro FLingo: A Fingerprinting-based Intellectual Property Protection Scheme for Large Language Models. In 2024 IEEE Conference on Communications and Network Security (CNS), pp. 1 9, 2024. doi: 10.1109/CNS62487.2024. 10735575. Independence Tests for Language Models Kapoor, S., Bommasani, R., Klyman, K., Longpre, S., Ramaswami, A., Cihon, P., Hopkins, A., Bankston, K., Biderman, S., Bogen, M., Chowdhury, R., Engler, A., Henderson, P., Jernite, Y., Lazar, S., Maffulli, S., Nelson, A., Pineau, J., Skowron, A., Song, D., Storchan, V., Zhang, D., Ho, D. E., Liang, P., and Narayanan, A. Position: On the Societal Impact of Open Foundation Models. In Proceedings of the 41st International Conference on Machine Learning, ICML 24. JMLR.org, 2025. Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., and Goldstein, T. A Watermark for Large Language Models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 17061 17084. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/ v202/kirchenbauer23a.html. Kuditipudi, R., Thickstun, J., Hashimoto, T., and Liang, P. Robust Distortion-free Watermarks for Language Models. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview. net/forum?id=Fpa CL1MO2C. Li, G., Hammoud, H., Itani, H., Khizbullin, D., and Ghanem, B. CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 51991 52008. Curran Associates, Inc., 2023. URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ a3621ee907def47c1b952ade25c67698-Paper-Conference. pdf. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theor., 37(1):145 151, September 2006. ISSN 0018-9448. doi: 10.1109/18.61115. URL https: //doi.org/10.1109/18.61115. Liu, Z., Qiao, A., Neiswanger, W., Wang, H., Tan, B., Tao, T., Li, J., Wang, Y., Sun, S., Pangarkar, O., Fan, R., Gu, Y., Miller, V., Zhuang, Y., He, G., Li, H., Koto, F., Tang, L., Ranjan, N., Shen, Z., Iriondo, R., Mu, C., Hu, Z., Schulze, M., Nakov, P., Baldwin, T., and Xing, E. P. LLM360: Towards Fully Transparent Open-Source LLMs. In First Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Qd Whj0QZFw. Mensch, A. Mistral CEO confirms Miqu model leak, August 2024. URL https://x.com/arthurmensch/ status/1752737462663684344. Accessed: 202408-15. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer Sentinel Mixture Models. In International Conference on Learning Representations, 2017. URL https:// openreview.net/forum?id=Byj72udxe. Meta AI. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models, 2024. URL https: //ai.meta.com/blog. Mosteller, F. and Fisher, R. A. Questions and Answers. The American Statistician, 2(5):30 31, 1948. ISSN 00031305. URL http://www.jstor.org/ stable/2681650. Peng, S., Chen, Y., Xu, J., et al. Intellectual Property Protection of DNN Models. World Wide Web, 26:1877 1911, July 2023. doi: 10.1007/ s11280-022-01113-3. URL https://doi.org/10. 1007/s11280-022-01113-3. Poli, M., Wang, J., Massaroli, S., Quesnelle, J., Carlow, R., Nguyen, E., and Thomas, A. Striped Hyena: Moving Beyond Transformers with Hybrid Signal Processing Models, 12 2023. URL https://github.com/ togethercomputer/stripedhyena. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language Models are Unsupervised Multitask Learners. 2019. URL https://cdn.openai. com/better-language-models/language_ models_are_unsupervised_multitask_ learners.pdf. Ramshaw, L. and Tarjan, R. E. On Minimum Cost Assignments in Unbalanced Bipartite Graphs. 2012. URL https://api.semanticscholar. org/Corpus ID:6964149. Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkinson, D., Authur, R., Bogin, B., Chandu, K., Dumas, J., Elazar, Y., Hofmann, V., Jha, A., Kumar, S., Lucy, L., Lyu, X., Lambert, N., Magnusson, I., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Ravichander, A., Richardson, K., Shen, Z., Strubell, E., Subramani, N., Tafjord, O., Walsh, E., Zettlemoyer, L., Smith, N., Hajishirzi, H., Beltagy, I., Groeneveld, D., Dodge, J., and Lo, K. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15725 15788, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.840. URL https: //aclanthology.org/2024.acl-long.840/. Independence Tests for Language Models Sudalairaj, S., Bhandwaldar, A., Pareja, A., Xu, K., Cox, D. D., and Srivastava, A. LAB: Large-Scale Alignment for Chat Bots, 2024. URL https://arxiv.org/ abs/2403.01081. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open Foundation and Fine Tuned Chat Models, 2023. URL https://arxiv. org/abs/2307.09288. Xu, J., Wang, F., Ma, M., Koh, P. W., Xiao, C., and Chen, M. Instructional Fingerprinting of Large Language Models. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3277 3306, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long. 180. URL https://aclanthology.org/2024. naacl-long.180/. Yang, Z. and Wu, H. A Fingerprint for Large Language Models, 2024. URL https://arxiv.org/abs/2407. 01235. Yax, N., Oudeyer, P.-Y., and Palminteri, S. Phylo LM: Inferring the Phylogeny of Large Language Models and Predicting their Performances in Benchmarks. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=r TQNGQxm4K. Zeng, B., Wang, L., Hu, Y., Xu, Y., Zhou, C., Wang, X., Yu, Y., and Lin, Z. Hu Ref: HUman-REadable Fingerprint for Large Language Models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=Rl Zgn EZs OH. Zhang, J., Liu, D., Qian, C., Zhang, L., Liu, Y., Qiao, Y., and Shao, J. REEF: Representation Encoding Fingerprints for Large Language Models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=Sn Dm Pk OJ0T. Independence Tests for Language Models A. Proofs of Main Theorems Proof of Theorem 1. From our assumptions on A and P and the fact that {πt}T t=1 are independently drawn, it follows that the collection {πt(θ1)}T t=1 comprises T exchangeable copies of θ1. The independence of θ1 and θ2 thus implies {(πt(θ1), θ2)}T t=1 comprises T exchangeable copies of (θ1, θ2), and so the claim follows by symmetry ϕ(θ1, θ2) is identically distributed as {ϕ(πt(θ1), θ2)}T t=1, so ϕ(θ1, θ2) will have uniform rank among the other values. Ties (ϕt = ϕ(θ1, θ2)) randomly contribute to ˆp, so symmetry still holds and the p-values will be uniformly distributed under the null. Proof of Theorem 2. As M is a Π-equivariant map, if θ1 θ2 then letting π = LAP(C) in MATCH is equivalent in distribution to sampling π Unif(Π). Then the output of MATCH is identical in distribution for any pair of independent models, and can be converted to a p-value using SPEARMAN and the distribution for the Spearman correlation coefficient (t-distribution with h 2 degrees of freedom). Proof of Theorem 3. Let θ 1 A(π(i) 1 π(j) 2 (θ0 1)) for π1, π2 i.i.d. Unif(Π). Then θ 1 is an independent copy of θ1 since taking the composition π(i) 1 π(j) 2 (θ1) yields an independent copy of θ1 for any π1, π2 Π. From θ1 θ2, it follows for ℓ {i, j} that MATCH(M (ℓ)(θ 1), M (ℓ)(θ2)) is identically distributed to MATCH(M (ℓ)(θ1), M (ℓ)(θ2)). The result then follows from the fact MATCH is equivariant with respect to permuting the rows of its arguments: in particular, for any π Π we have MATCH(πW1, W2) = πMATCH(W1, W2). B. Randomized Learning Algorithms One notable (non-contrived) category of deep learning algorithms that are not permutation equivariant are those with random dropout masks to hidden units during training. In particular, once we fix a specific setting of mask values to specify a deterministic learning algorithm, this algorithm will not be permutation equivariant unless the individual dropout masks are all permutation invariant (which is highly unlikely). We provide a generalized statement of Theorem 1 for randomized algorithms. Definition 5. Let Π Θ Θ. Let π Π and θ0 Θ, with θ A(θ0), θ = π( θ) and θ A(π(θ0)). A randomized learning algorithm A : Θ P(Θ) is Π-equivariant if and only if θ d= θ . Theorem 4. Let ϕ : Θ Θ R be a test statistic and Π Θ Θ be finite. Let A : Θ P(Θ) be Π-equivariant and let P P(Θ) be Π-invariant. Let θ1, θ2 Θ be independent random variables, with θ1 A(θ0 1) for θ0 1 P1. Then bp = PERMTEST(θ1, θ2) is uniformly distributed on { i T +1}T i=1. Proof. The proof is identical to that of Theorem 1. C. Transformer Architecture and Notation We consider models with the Llama Transformers architecture and define the notation henceforth, although this can easily be extended to other Transformer architectures. Following the definition of fmlp in Example 1, we can define an abstraction of the full Llama language model architecture consisting of L Transformer blocks sandwiched between an input and output layer. For the sequel, we will abuse notation in applying fmlp to multi-dimensional tensors by broadcasting along the last axis. We use d, n N to respectively denote the model dimension and sequence length, where ΘLM = Θin Θ L block Θout with Θblock denoting the parameter space of each Transformer block and Θin, Θout denoting the parameter spaces the input and output layers. We decompose Θblock = Θattn Θmlp and use frest : Θattn Rn d Rn d to denote all remaining parts of the Transformer besides the MLP. The inputs to frest are the input and output of the MLP, and the output of frest is fed directly to the MLP of the next layer. In particular, frest takes the input and output to the MLP of layer i, and first performs the residual connection following the MLP of layer i, then the self-attention and normalization components of layer i + 1, and returns the input to the MLP of layer i + 1. We use fin : Θin X Rn d and fout : Θ(L) block Rn d Y to respectively denote the input and output layers, i.e. the elements before the first MLP and after the last MLP. Putting everything together gives the following definition of the model; we introduce the notation X(i) θ in the definition as a matter of convenience to track intermediate activations. Definition 6. (GLU Transformer model) Let θ = (θin, {θ(i) block}L i=1, θout) ΘLM and X X, with θ(i) block = (θ(i) attn, θ(i) mlp). Then f LM(X; θ) = fout(X(L) θ ; θout) for X(0) θ = fin(X; θin) and X(i) θ = frest(X(i 1) θ , fmlp(X(i 1) θ )). (6) For a Llama model, table 2 describes the shapes of the model weight matrices for i = 1, . . . , L, for V (vocab size), demb (the hidden dimension), and dmlp (MLP hidden dimension). Following Definition 6, we have θin = (E), θ(i) block = (θ(i) attn, θ(i) mlp) where Independence Tests for Language Models Parameter name Notation embedding E RV demb input layernorm γinput,i R1 demb attention query matrix WQ,i Rdemb demb attention key matrix WK,i Rdemb demb attention value matrix WV,i Rdemb demb attention output matrix WO,i Rdemb demb post-attention layernorm γpost-attn, i R1 demb MLP gate projection Gi Rdmlp demb MLP up projection Ui Rdmlp demb MLP down projection Di Rdemb dmlp final layernorm γfinal R1 demb linear output O Rdemb V Table 2. Llama model architecture and dimensions. θ(i) attn = (γinput,i, WQ,i, WK,i, WV,i, WO,i, γ(i) post-attn), θ(i) mlp = (Gi, Ui, Di), and θout = (γfinal, L). We now describe a forward pass of the model. We define the softmax function on a vector v = (v1, . . . , vn), softmax(v), as softmax(v)i = evi Pn k=1 evk . On batched input X RN n m where each X(b) = [w1| . . . |wm] Rn m with column vectors wi, we define the softmax as softmax(X(b)) = [softmax(w1)| . . . |softmax(wm)], softmax(X) = [softmax(X(1))| . . . |softmax(X(N))]. For a forward pass of the model f LM(X; θ), consider an input sequence of tokens X {0, 1}N V as one-hot vectors where n is sequence length. Then We feed the input through: 1. (fin) Embedding layer: X(0) θ = fin(X; θin) = XE RN demb 2. (fattn, fmlp, fpost) For each Transformer block i = 0, 1, . . . , L, through fattn, fmlp, and fpost: (a) Input layernorm: X(i) LN1 = X(i) θ q Var(X(i) θ ) + ε γinput,i (with variance over the last axis) for some offset ε (typically 1e-6). (b) Causal multi-head self-attention: Split X(i) LN1 on the first axis into nheads X(i) LN1,j, . . . , X(i) LN1,nheads. On each head X(i) LN1,j, X(i) SA,j = self-attn(X(i) LN1,j) = softmax X(i) LN1,j W T Q,i(X(i) LN1,j W T K,i)T demb X(i) LN1,j W T V,i W T O,i and concatenate X(i) SA,j along the first axis again as X(i) SA . (c) Dropout and residual connection: X(i) DR1 = X(i) LN1 + Dropout(X(i) SA ) (d) Post-attention layernorm: X(i) LN2 = X(i) DR1 q Var(X(i) DR1) + ε γpost-attn,i (with variance over the last axis) for some offset ε. Then we have fattn(X(i 1) θ ; θ(i) attn) = X(i) LN2. Independence Tests for Language Models Parameter name θ πemb(θ) πmlp(θ) embedding E Eπemb E input layernorm γinput,i γinput,iπemb γinput,i attention query matrix WQ,i WQ,iπemb WQ,i attention key matrix WK,i WK,iπemb WK,i attention value matrix WV,i WV,iπemb WV,i attention output matrix WO,i πT emb WO,i WO,i post-attention layernorm γpost-attn, i γpost-attn, iπemb γpost-attn, i MLP gate projection Gi Giπemb πmlp,i Gi MLP up projection Ui Uiπemb πmlp,i Ui MLP down projection Di πT emb Di DiπT mlp,i final layernorm γfinal γfinalπemb γfinal linear output O πT emb O O Table 3. Transformations πemb and πmlp applied to a Llama-architecture model. (e) Next, we feed through fmlp, the multi-layer perceptron: fmlp(X(i) LN2; θ(i) mlp) = XMLP i = [σ(XLN2 i GT i ) (XLN2 i U T i )]DT i for some activation σ (e.g., Si LU). (f) Finally, we feed through fpost, dropout and the residual connection: fpost(θ(i) mlp) = X(i+1) θ = XDR1 i + Dropout(XMLP i ) 3. (fout) Final layernorm on the output X(N+1) θ from the final Transformer block: X(L) LN = X(L) θ q Var(X(L) θ ) + ε γfinal (with variance over the last axis) for some offset ε. Then, linear output embedding and softmax mapping to output probabilities: fout(X(L) θ ) = softmax(X(L) LN OT ), which defines the entire forward pass f LM(X; θ). D. Model Transformation Class We describe two sets of equivariant transformations Π on a Transformer model as described in Appendix C. (Abusing notation), the first set, Πemb, consists of elements πemb where πemb Rdemb demb is a permutation matrix. The second set, Πmlp, consists of elements πmlp where πmlp Rdmlp dmlp is a permutation matrix. 1. πemb(θ): Applying an embedding permutation πemb Rdemb demb by left or right multiplying all relevant matrices by ξembed (permuting rows or columns). 2. πmlp(θ): Applying MLP permutations πmlp,i Rdmlp dmlp to MLP layers. These permutations are applied such that the outputs of the original model θ and the permuted model Π(θ) remain aligned. We describe the details in Table 3. E. Additional Constrained Setting Experimental Results We report p-values from the statistics ϕℓ2, ϕU(ℓ), and ϕH(ℓ) on all 210 model pairs (from 21 Llama 2-architecture models) in Figures 4, 5, and 6, where the model names are colored by base model (ground truth). For all statistics, the p-values on independent model pairs are uniformly distributed, while they are all significant at 0.01 (smaller for ϕU(ℓ) and ϕH(ℓ)) for fine-tuned model pairs. Independence Tests for Language Models Figure 4. Results of p-values from ϕℓ2 on all model pairs. Independence Tests for Language Models Figure 5. Results of p-values from ϕU(ℓ) on all model pairs, where ε = 2.2e-308. Independence Tests for Language Models Figure 6. Results of p-values from ϕH(ℓ) on all model pairs, where ε = 2.2e-308. Independence Tests for Language Models # train tokens ϕU(ℓ) ϕH(ℓ) ϕℓ2 ϕMATCH ϕJSD (log) 100M 0.641 0.119 0.07 0.809 -11.81 1B 0.789 0.483 0.06 0.443 -11.05 10B 0.707 0.277 0.93 0.343 -11.28 18B 0.819 0.141 0.64 0.027 -11.03 Table 4. Results for ϕU(ℓ), ϕH(ℓ), and ϕMATCH evaluated on training checkpoints between two independently-trained OLMo models. θ1 = Llama-2-70b-hf, θ2 = ϕU (ℓ) miqu-1-70b-pytorch ε Llama-3.1-70B 0.571 Palmyra-Fin-70B-32K 0.539 Table 5. Results of ϕU(ℓ) (aggregated with FISHER) with θ1 as Llama-2-70b-hf and θ2 ranging over the listed models. E.1. Identically distributed, Independent models We further evaluated the efficacy of our tests through ablations by training two models with the same architecture on the same dataset that only differ on the choice of random initialization of randomness. Specifically, we ensure that our test does not incorrectly detect two similar (trained using the same learning algorithm) but independent (randomly initialized) models, as non-independent. To verify this, we randomly initialized a model with the OLMo (7B) architecture (Groeneveld et al., 2024) and trained it on the Dolma v1 7 dataset ((Soldaini et al., 2024)). We trained a second model with independently chosen initialization and data ordering. We keep checkpoints for both seeds after 100M, 1B, 10B, and 18B train tokens and evaluate the statistics ϕU(ℓ), ϕH(ℓ), and ϕMATCH on the two models at each training checkpoint, reported in Table 4. We highlight that the p-values are broadly distributed, validating our tests support independence even on two similarly-trained but independent models. E.2. Tests for Larger Models Next, we evaluated our tests on larger models. We ran ϕU(ℓ) on four 70B parameter models with the Llama 2-70B architecture shown in Table 5, and in particular, we verify that Miqu-70B is not independent from Llama 2-70B. F. Additional Unconstrained Setting Experimental Results We report values of ϕMATCH on all model pairs in Figure 7. The statistic is low (< ε = 10 308) for all non-independent model pairs, and uniformly distributed for independent model pairs, empirically acting as a p-value. F.1. Striped Hyena Experiments We report ϕU(ℓ) on specific parameters from Striped Hyena-Nous-7B and Mistral-7B-v0.1 shown in Table 6. We no longer only evaluate ϕU(ℓ) on MLP up projection matrices, so that we can investigate similarity in other parameters as well. These p-values no longer satisfy the independence requirement of Theorem 2, so we do not aggregate them with FISHER. Parameter name Notation ϕU (ℓ) embedding E 1.61e-16 attention query matrix W (1) Q 6.17e-190 attention key matrix W (1) K 1.47e-7 attention value matrix W (1) V 1.56e-114 attention query matrix W (1) Q 6.17e-190 attention output matrix W (1) O 0.010 MLP gate projection G(1) 0.517 MLP up projection U (1) 0.716 MLP down projection D(1) 6.03e-80 Table 6. ϕU(ℓ) on parameters from Striped Hyena-Nous-7B and Mistral-7B-v0.1, some with low p-values. Independence Tests for Language Models Figure 7. Results of values of ϕMATCH on all model pairs, where ε = 2.2e-308. Independence Tests for Language Models MLP Loss log10(ϕ(i) MATCH) 1 0.0048 479 2 0.012 485 3 0.0026 614 4 0.0034 580 5 0.0030 523 6 0.0035 513 7 0.0041 533 8 0.0042 464 9 0.0050 439 10 0.0050 377 11 0.0060 365 MLP Loss log10(ϕ(i) MATCH) 12 0.0060 342 13 0.0058 330 14 0.0066 323 15 0.0063 414 16 0.0061 394 17 0.0063 445 18 0.0055 515 19 0.0045 571 20 0.0045 512 21 0.0047 595 22 0.0043 555 MLP Loss log10(ϕ(i) MATCH) 23 0.0043 593 24 0.0047 542 25 0.0050 497 26 0.0051 534 27 0.0052 482 28 0.0061 477 29 0.0065 433 30 0.0098 361 31 2.313 26.4 32 0.0114 174 Table 7. ϕMATCH on individual blocks between Llama-2-7b-hf and vicuna-7b-v1.5 after retraining MLP layers. F.2. MLP Retraining Experiments We retrain each of the 32 MLP layers by feeding in random inputs through the original MLP (gate, up, and down projection matrices.) We train for 10000 gradient steps using MSE loss and an Adam Optimizer with a learning rate of 0.001 and batch size of 5000. A sample learning curve is in Figure 8. Figure 8. Learning curve for MLP retraining. The MLP retraining results for all 32 MLP layers of vicuna-7b-v1.5, compared with Llama-2-7b-hf are in Table 7, showing that the statistic is robust to retraining of all layers. F.3. Localized Testing As described in 4.4.2, we can run ϕMATCH on all pairs of Transformer blocks between two models (of different architecture), as long as they share the GLU structure. In addition to the Llama 3 results, we report results of matched blocks on the Sheared-LLa Ma and Nvidia-Minitron models, which are both pruned from Llama models. In particular, we were able to identify the specific Transformer blocks of θ8B = Llama-3.1-8B whose weights were likely used in initializing θ3B = Llama-3.2-3B and θ1B = Llama-3.2-1B, as Meta reported that the Llama-3.2-3B and Llama-3.2-1B models were pruned from Llama-3.1-8B ((Meta AI, 2024)). We use ϕMATCH on all pairs of MLP blocks, where (dθ8B, hθ8B, Nθ8B) = (4096, 14336, 32),(dθ3B, hθ3B, Nθ3B) = (3072, 8192, 28), and (dθ1B, hθ1B, Nθ1B) = (2048, 8192, 16). We match blocks when the statistic ϕ(i,j) MATCH from block i of model 1 and block j of model 2 is less than 1e-4, reported in Tables 8 and 9 (with the same for the other matchings in this section). Independence Tests for Language Models i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 j : ϕ(i,j) MATCH(θ8B, θ3B) < 1e 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 i 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 j : ϕ(i,j) MATCH(θ8B, θ3B) < 1e 4 16 17 18 19 20 21 22 23 24 25 26 27 28 Table 8. θ8B = Llama-3.1-8B blocks matched with θ3B = Llama-3.2-3B blocks using ϕMATCH i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 j : ϕ(i,j) MATCH(θ8B, θ1B) < 1e 4 1 2 3 4 5 6 7 8 9 i 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 j : ϕ(i,j) MATCH(θ8B, θ1B) < 1e 4 10 11 15 16 Table 9. θ8B = Llama-3.1-8B blocks matched with θ1B = Llama-3.2-1B blocks using ϕMATCH Next, we have Sheared-LLa Ma 2.7B, with 32 Transformer blocks, hidden dimension 2560 and MLP dimension 6912. All 32 blocks align with the 32 blocks of Llama 2 7B, although both hidden and MLP dimensions have been reduced through pruning. i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 j : ϕ(i,j) MATCH(θ1, θ2) < 1e 90 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 i 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 j : ϕ(i,j) MATCH(θ1, θ2) < 1e 90 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Table 10. θ1 = Sheared-LLa Ma 1.3B blocks matched with θ2 = Llama-2-7B blocks using ϕMATCH Next, we have Sheared-LLa Ma 1.3B, with 24 Transformer blocks, hidden dimension 2048 and MLP dimension 5504. i 1 2 3 4 5 6 7 8 9 10 11 12 j : ϕ(i,j) MATCH(θ1, θ2) < 1e 5 1 2 3 4 5 6 7 8 10 12 16 i 13 14 15 16 17 18 19 20 21 22 23 24 j : ϕ(i,j) MATCH(θ1, θ2) < 1e 5 17 18 19 20 21 22 25 27 28 29 31 32 Table 11. θ1 = Sheared-LLa Ma 1.3B blocks matched with θ2 = Llama-2-7B blocks using ϕMATCH Finally, we compare Llama 3.1 8B with nvidia/Llama-3.1-Minitron-4B-Depth-Base, a pruned model by reducing from 32 to 16 Transformer blocks and are able to identify the likely shared blocks. i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 j : ϕ(i,j) MATCH(θ1, θ2) < 1e 90 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 Table 12. θ1 = nvidia/Llama-3.1-Minitron-4B-Depth-Base blocks matched with θ2 = Llama-2-7B blocks using ϕMATCH F.4. MLP Distillation Experiments As we mentioned in section 3.2.2, we present further results on experiments where we distill a model without a GLU MLP and then test the efficacy of our approach. These models do not use GLU MLPS (instead, a different feed-forward network) and a GLU MLP is distilled as the first FFN using Algorithm 5. The cases of two non-independent models still have very small values of ϕ(1) MATCH. Independence Tests for Language Models θ1 θ2 Independent? ϕ(1) MATCH gpt2 GPT2 PMC 3.034e-61 gpt2 artgpt2tox 1.049e-75 gpt2 distilgpt2 1.079e-63 Llama-3.2-1B Llama-3.2-3B 2.011e-70 openai-gpt gpt2 0.359 openai-gpt distilgpt2 0.770 gpt Llama-3.2-1B 0.481 Table 13. ϕ(1) MATCH on models distilled with a GLU MLP. Parameter name θ Rot(θ) = θ embedding E ERemb input layernorm γinput, i γ input, i attention query matrix WQ,i Ri WQ,i diag(γinput, i) Remb diag( 1 γ input, i ) attention key matrix WK,i Ri WK,i diag(γinput, i) Remb diag( 1 γ input, i ) attention value matrix WV,i WV,i diag(γinput, i) Remb diag( 1 γ input, i ) attention output matrix WO,i RT emb WO,i post-attention layernorm γpost-attn, i γ post-attn, i MLP gate projection Gi Gi diag(γpost-attn,i) Remb diag( 1 γ post-attn,i ) MLP up projection Ui ci Ui diag(γpost-attn,i) Remb diag( 1 γ post-attn,i ) MLP down projection Di 1 ci RT emb Di final layernorm γfinal γ final linear output O O diag(γfinal) Remb diag( 1 γ final ) Table 14. Output-preserving rotation applied to a Llama-architecture model. G. Output-Preserving Transformations An adversary could apply a particular rotation scheme by multiplying weight matrices by an orthogonal rotation matrix U that will also preserve outputs. We describe such a transformation which breaks the invariants proposed by (Zeng et al., 2024) by manipulating layernorms. While this list may not be exhaustive, the following six transformations (with the first two described previously) camouflage the language model while preserving outputs: T1. Permuting the rows of the embedding matrix (and subsequent matrices due to residual connections) by a permutation ξemb Rdemb demb T2. Permuting the MLP matrices (N different permutations for each Transformer block) by permutations ξ1, . . . , ξN Rdmlp dmlp T3. Rotating the embedding matrix (and subsequent matrices due to residual connections) by an orthogonal rotation matrix Remb Rdemb demb T4. Rotating the query and key attention matrices (N different rotations for each Transformer block) by orthogonal rotation matrices R1, . . . , RN Rdemb demb T5. Replacing all layernorms (input, post-attention, final) with vectors in R1 demb with non-zero elements T6. Scaling the MLP matrices by a constant non-zero factor Consider a model θ of Llama architecture (Appendix C). Consider orthogonal matrices Remb, R1, . . . R32 as described, as well as new layernorms γ input,1, . . . , γ input,32, γ post-attn,1, . . . , γ post-attn,32 in R1 demb with non-zero elements. Finally, consider non-zero constants c1, . . . , c32, which we use to transform the layernorms. We apply the rotation with these parameters to θ, to get a new rotated model, Rot(θ). We generalize the set of transformations above as applying Rot(θ) to a model θ . We transform all the original matrices of θ as in Table 14 (for i = 1, . . . , 32). Note that the transformations T1 and T2 are elements of Πemb and Πmlp and the remaining transformations T3 to T6 are described in Table 14. Importantly, T5 is the transformation that (Zeng et al., 2024) s invariants are not robust to; our unconstrained setting test ϕMATCH is robust to all 6 transformations, which we show in Table 15. Independence Tests for Language Models G.1. Breaking Hu REF Invariants Only transformations T3 and T5 are required to break the invariants from (Zeng et al., 2024). Their first invariant is Ma = E(WQ,i)T WK,i)ET at layer i, and for M with an embedding matrix rotation Remb where the layernorms γinput,i are replaced with γ input,i, we have the invariant is Ma = E (W Q,i)T ((W K,i)T )T E T M a = (ERemb) diag( 1 γ input,i )RT embdiag(γinput,i)W T Q,i RT i Ri WK,idiag(γinput,i)Rembdiag( 1 γ input,i ) = ERembdiag( 1 γ input,i )RT embdiag(γinput,i)W T Q,i WK,idiag(γinput,i)Rembdiag( 1 γ input,i )RT emb E, and in general Ma = M a unless the layernorm weights are equal constants. The other two invariants also do not hold due to changing the layernorms. (Note that our notation for Transformers is different than theirs.) Assuming in their invariant Mf that W1 and W2 are the gate and down projection matrices of an MLP (this is not stated explicitly in the paper but can be inferred from experiments), the remaining invariants do not hold either. Empirically, we compute the invariants between Llama2-7b and independently trained models and between Llama2-7b and rotated finetuned models (including Llama2-7b) in Table 15. We can see there is little distinction between the independent vs. non-independent model pairs. θ1 = Llama-2-7b-hf, θ2 = Independent? Ma Mb Mc ϕMATCH ϕU(ℓ) ϕH(ℓ) ϕJSD vicuna-7b-v1.5 1.0 0.9883 0.9922 < ε < ε < ε -10.874 Nous-Hermes-llama-2-7b 1.0 1.0 1.0 < ε < ε < ε -12.101 llama-7b-hf 0.0884 0.0250 0.0400 0.049 0.595 0.253 -11.102 Amber Chat 0.1289 -0.0093 0.0198 0.941 0.460 0.279 -10.281 Openllama-v1 0.1084 0.0076 0.0057 0.286 0.357 0.703 -8.381 Rotated Llama-2-7b-hf 0.0767 0.0908 0.1011 < ε 0.517 0.323 Rotated vicuna-7b-v1.5 0.1553 0.0933 0.0977 < ε 0.688 0.857 -10.874 Rotated Nous-Hermes-llama-2-7b 0.0332 0.0718 0.1060 < ε 0.772 0.240 -12.101 Table 15. Results for the three invariants Ma, Mb, Mc from (Zeng et al., 2024) between Llama-2-7b-hf and independent and nonindependent models. G.2. Invariance of Outputs under Rotation These transformations are particularly important because they preserve outputs as we show in Theorem ??, and hence generally can go undetected, though ϕMATCH is robust to them. Theorem 5. For any input sequence X {0, 1}n V , the outputs of models θ and Rot(θ) = θ are aligned, i.e. f LM(X; θ) = f LM(X; θ ). Proof. First, note that an element-wise product of two one-dimensional vectors is equivalent to multiplying by the diagonal matrix of the second vector, i.e. for v, γ R1 m, v γ = vdiag(γ). We use this in our layernorm calculations. Let the output from the unrotated embedding layer be y = fin(X, E) = EX (for X {0, 1}n V ). Then the output from the rotated embedding layer is y = fin(X, E ) = (ERemb)(x) = y Remb. Now consider Transformer block i with input y and the rotated Transformer block with input y Remb. y is passed into the input layernorm, which returns z = LNi(y) = y p Var(y) + ε γinput,i = y p Var(y) + ε diag(γinput,i). The rotated input layernorm on y returns z = LN i(y ) = y p Var(y ) + ε γ input,i = y Remb p Var(y Remb) + ε γ input,i Var(y) + ε Rembdiag(γ input,i) = z diag( 1 γinput,i )Rembdiag(γ input,i), Independence Tests for Language Models which follows from Remb being orthogonal. Then we have the output from the unrotated self-attention is w = softmax z W T Q,i(z W T K,i)T p z W T V,i W T O,i, and the output from the rotated self-attention with input z is z (Ri WQ,idiag(γinput, i)Rembdiag( 1 γ input, i ))T (z (Ri WK,idiag(γinput, i)Rembdiag( 1 γ input, i ))T )T z (WV,idiag(γinput, i)Rembdiag( 1 γ input, i ))T (RT emb WO,i)T z diag( 1 γ input, i )RT embdiag(γinput, i)W T Q,i RT i (z diag( 1 γ input, i )RT embdiag(γinput, i)W T K,i RT i )T z diag( 1 γ input, i )RT embdiag(γinput, i)W T V,i W T O,i Remb z diag( 1 γ input, i )RT embdiag(γinput, i)W T Q,i WK,idiag(γinput, i)Rembdiag( 1 γ input, i )(z )T z W T V,i W T O,i Remb z WQ,i W T K,iz T p z W T V,i W T O,i Remb = w Remb = w . Then y and y respectively from before the layernorm are added as residual connections as v = y + w and v = y + w = v Remb. v is passed into the post-attention layernorm, which returns u = LNi(v) = v p Var(v) + ε γpost-attn,i = v p Var(v) + ε diag(γpost-attn,i). Similar to the input layernorm, the rotated post-attention layernorm on v returns u = LN i(v ) = v p Var(v ) + ε γ post-attn,i = v Remb p Var(v Remb) + ε γ post-attn,i Var(v) + ε Rembdiag(γ post-attn,i) = u diag( 1 γpost-attn,i )Rembdiag(γ post-attn,i). Then the output from the unrotated MLP layer on u is t = [σ(u GT i ) (u U T i )]DT i and the output from the rotated MLP on u is t = [σ(u (Gidiag(γpost-attn,i)Rembdiag( 1 γ post-attn,i ))T (u (ci Uidiag(γpost-attn,i)Rembdiag( 1 γ post-attn,i ))T )]( 1 ci RT emb Di)T = [σ(u diag( 1 γpost-attn,i )Rembdiag(γ post-attn,i)diag( 1 γ post-attn,i )RT embdiag(γpost-attn,i)GT i ) (ciu diag( 1 γpost-attn,i )Rembdiag(γ post-attn,i)diag( 1 γ post-attn,i )RT embdiag(γpost-attn,i))U T i ] 1 ci DT i Remb = [ciσ(u GT i ) (u U T i )] 1 ci DT i Remb = t Remb. Then the output from the self-attention is added as a residual connection, and the final output from the unrotated Transformer block is s = t + v, and the output from the rotated Transformer block is s = t + v = s Remb. Suppose a is the output after all Transformer layers in θ and a is the output after all Transformer layers in θ . Then the outputs after the final layernorms are b = v p Var(a) + ε diag(γfinal) Independence Tests for Language Models b = b diag( 1 γfinal )Rembdiag(γ final), and the logits from the linear output layer are b OT = b diag( 1 γfinal )Rembdiag(γ final)diag(γfinal)RT embdiag( 1 γ final )OT = b (O )T , which are the same for both models. We attempted to undo such a transformation that an adversary may apply by solving the least squares problem: We solve for a rotation A that minimizes |AX Y | where X is a weight matrix of the first model and Y is the corresponding weight matrix of the second model. Although this will provide a potential rotation to undo this transformation, we find that this solution will also find a matrix A that aligns two independent model pairs as well. This makes undo-ing the rotation this way unreliable. The same holds for X and Y that are activations over multiple inputs.