# persistent_topological_features_in_large_language_models__4efc4591.pdf Persistent Topological Features in Large Language Models Yuri Gardinazzi* 1 2 Karthik Viswanathan* 1 3 Giada Panerai 1 2 Alessio Ansuini 1 Alberto Cazzaniga 1 Matteo Biagetti 1 Abstract Understanding the decision-making processes of large language models is critical given their widespread applications. To achieve this, we aim to connect a formal mathematical framework zigzag persistence from topological data analysis with practical and easily applicable algorithms. Zigzag persistence is particularly effective for characterizing data as it dynamically transforms across model layers. Within this framework, we introduce topological descriptors that measure how topological features, p-dimensional holes, persist and evolve throughout the layers. Unlike methods that assess each layer individually and then aggregate the results, our approach directly tracks the full evolutionary path of these features. This offers a statistical perspective on how prompts are rearranged and their relative positions changed in the representation space, providing insights into the system s operation as an integrated whole. To demonstrate the expressivity and applicability of our framework, we highlight how sensitive these descriptors are to different models and a variety of datasets. As a showcase application to a downstream task, we use zigzag persistence to establish a criterion for layer pruning, achieving results comparable to state-ofthe-art methods while preserving the system-level perspective. 1. Introduction Large Language Models (LLMs) have revolutionized natural language processing by achieving unprecedented performance levels across a wide range of tasks (see Raiaan et al. (2024) for a review). Despite their success, the blackbox nature of these models has raised significant concerns *Equal contribution 1Area Science Park 2University of Trieste 3University of Amsterdam. Correspondence to: Matteo Biagetti . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). about interpretability and transparency (Liao & Vaughan, 2023). Moreover, their large scale demands a considerable amount of computational resources (Samsi et al., 2023; Bai et al., 2024), making it essential to reduce their size without compromising performance (Ma et al., 2023; Gromov et al., 2024; Men et al., 2024). One strategy for addressing these issues has been to study the models internal representations. Early works (Zeiler & Fergus, 2014) demonstrated that visualization techniques can effectively uncover hierarchical representations within convolutional neural networks, highlighting how lower layers focus on edge detection while higher layers correspond to object parts and semantic concepts. Additionally, (Olah et al., 2018) illustrated that analyzing weight matrices and neuron activations can reveal interpretable features and organizational structures within deep networks, providing insights into how complex patterns are encoded and processed. More recently, geometric studies made progress by introducing concepts like intrinsic dimension to characterize the manifold of internal representations and its evolution across layers (Ansuini et al., 2019; Doimo et al., 2020; Pope et al., 2021). These methods have been successfully applied to transformer models in various works (Valeriani et al., 2023; Tulchinskii et al., 2024; Cheng et al., 2023; 2024; Viswanathan et al., 2025). One notable achievement of this approach has been to show the emergence of semantic knowledge and abstraction phases in the middle layers of models, rather than at the final layers, as might be intuitively expected. However, these approaches provide only a static view of internal representations and suffer limitations in tracking their changes across layers. A natural framework to address these limitations and to offer a more comprehensive characterization of the geometry of internal representations of neural networks is Topological Data Analysis (TDA). TDA is a set of unsupervised techniques that offers robust methods to describe the shape and structure of complex datasets. It has seen exponential growth with applications in computational biology (Mandal et al., 2020), cosmology (Biagetti et al., 2021; Yip et al., 2024)], personalized medicine (Skaf & Laubenbacher, 2022), time-dependent data analysis (El-Yaagoubi et al., Persistent Topological Features in Large Language Models 2023), and machine learning (Hensel et al., 2021), just to name a few. One prominent tool within TDA is persistent homology, which tracks the birth and death of topological features across different scales, thereby capturing the multiscale behavior of a point cloud. Several studies have proposed persistent homology to investigate neural networks and their internal representations (e.g. (Rieck, Bastian Alexander et al., 2023), (Naitzat et al., 2020; Lacombe et al., 2021; Magai & Ayzenberg, 2022)). However, in the context of TDA applications, it has not yet been recognized that the internal representations of neural networks can essentially be viewed as point clouds dynamically evolving in time (layers). In the particular case of LLMs, as pre-trained models process inputs, they transform these point clouds within the representation space layer by layer, capturing essential features and relationships throughout the model s depth. Thus, it is natural to interpret these transformations as an evolving discrete dynamical system. To address this problem, we exploit a TDA tool developed to characterize time-varying point clouds and temporal networks, known as zigzag persistence. Our approach achieves the following results: Zig Zag Framework for LLMs: We build a fast and scalable pipeline to characterize the birth and death of topological features across transformer models layers. As new contributions in the context of zigzag applications, we introduce the k-Nearest Neighbors-based filtration, and we interpret layers as time snapshots, tracking the trajectory of features across layers. Identification of Phases of Prompt Processing: Using interpretable topological descriptors, we characterize the model s dynamical processing of prompts in representation space across layers. We use this characterization to identify four distinct phases as prompts move in representation space across layers: an initial phase with rapid rearrangement of positions, a middle phase characterized by stable, long-lived relations among prompts, a transition phase where the model refines these relations, and a final phase of new rearrangements preparing data for output. Model Pruning: As a showcase downstream task, we use our descriptors to define a criterion to prune layers without significantly degrading performance, finding comparable results to state-of-the-art methods. Our topological descriptors show quantitatively different results but qualitatively similar across models, datasets, and choices of zigzag hyperparameters. This proves their expressivity and simultaneously shows a degree of universality in the topological structure of LLM representations. 2. Related Work Topology of Internal Representations. TDA has been extensively used in machine learning (see (Hensel et al., 2021) for a recent review). In the context of studying internal representation, studies on Convolutional Neural Networks (CNN) used topological descriptors to explore the shape of activation functions (Rathore et al.) or their relations to performance (Naitzat et al., 2020). (Magai & Ayzenberg, 2022) introduced persistent homology dimension as an estimator of the intrinsic dimension of internal representations in CNNs, while (Barannikov et al., 2022) proposed a measure of similarity based on topological descriptors to compare representations. Betti numbers have been observed to remain stable across different datasets for the same architectures and to decrease as depth increases (Suresh et al., 2023). Zigzag Persistence. Zigzag persistence was introduced in (Carlsson & de Silva, 2010; Carlsson et al., 2009; Tausz & Carlsson, 2011) as an extension of persistent homology to study the persistence of topological features across sequences of spaces. This approach is particularly useful when data undergo dynamic changes or transformations over time. Since its introduction, zigzag persistence has been applied in various fields, including Hopf bifurcations in dynamical systems (Tymochko et al., 2020), commuting patterns in Great Britain s transportation network Myers et al. (2023), coral reef ecosystems (Mc Donald et al., 2023), cell location time series (Yang et al., 2023; Zhang et al., 2023), and honeybee aggregations (Gharooni-Fard et al., 2024). It has also inspired methodological extensions such as multidimensional persistence (Kim & M emoli, 2021) and the development of formigrams and crocker stacks (Xian et al., 2022). Distinguishing Transformer Stages. Recent works have studied the geometry of internal representations to identify distinct stages1 in the way pre-trained transformer models process input across layers. For instance, Valeriani et al. (2023) demonstrated that transformer models exhibit a phase transition in the middle layers, characterized by a spike in intrinsic dimensionality, which correlates with the emergence of semantic and syntactic abstractions. Similarly, Cheng et al. (2023) highlighted that middle layers play a crucial role in compressing input representations into lowerdimensional manifolds, enabling the model to generalize and handle complex linguistic tasks. Subsequent work has confirmed and expanded these findings in different ways (Lad et al., 2024; Artzy & Schwartz, 2024; Skean et al., 2024). Layer Pruning in Large Language models. Among ex- 1While previous work has frequently used the word stage in this context, we prefer phase to emphasize the continuous, evolving nature of the process. Persistent Topological Features in Large Language Models isting methods to reduce the size of neural networks, layer pruning has gained particular relevance in the context of LLMs. The first applications to BERT models (Fan et al., 2020; Zhang & He, 2020; Fan et al., 2021; Jha et al., 2024) inspired a long series of experiments employing similar techniques (Sajjad et al., 2023; Siddiqui et al., 2024; He et al., 2024; Zhang et al., 2024a; Kim et al., 2024; Zhang et al., 2024b). Many of these efforts base their methodology on similarity measures of internal representations, which have conveniently been summarized in a recent review (Klabunde et al., 2023). In this work, we consider (Gromov et al., 2024), which uses angular similarity, and (Men et al., 2024), which uses Block-Influence similarity, as a reference point for comparison. In this section, we introduce the zigzag persistence framework, which we use to analyze the internal representations of LLMs pre-trained with an autoregressive loss. These models typically receive an input sequence of n tokens (often representing a sentence) embedded in a d-dimensional space. The input is transformed across the network layers without altering the embedding dimension. Due to the autoregressive nature of these models, the representation of the last token in a sequence captures information about the entire sequence and is used to predict the next. As a result, we choose to focus on the last token representation of each sequence at each layer. Thus, our point cloud is represented by last tokens embeddings, i.e. vectors of the form {xi(ℓj)} Rd, for i = 1, ..., Nsentences and j = 1, ..., Nlayers. These last tokens are extracted from a variety of datasets and serve as an observational probe of how the model processes input. 3.1. Topological Data Analysis Topological data analysis (Edelsbrunner et al., 2002; Zomorodian & Carlsson, 2004) provides a tool for geometrically characterizing highly complex datasets. Within this framework, persistent homology (Carlsson, 2009) is the key methodology to characterize a point cloud on multiple scales at once. Its goal is to identify the range of scales over which a particular class of topological features (connected components, loops, voids, higher dimensional holes ) remain relevant, or persistent , as opposed to topological noise , i.e. features disappearing roughly at the same scale they formed. The basic ingredients for this technique are i) a criterion to connect points, forming a simplicial complex and ii) a scale parameter ν (often a coarsening scale) such that given ν1 ν2, then the two corresponding simplicial complexes are related by Kν1 Kν2. The ordered sequence of simplicial complexes for varying scale parameters is called filtration. An intuitive example is the Vietoris-Rips filtration, built from complexes parametrized by the radius of the Figure 1. A schematic representation of the zigzag algorithm. It shows how zigzag can track the evolution of topological features over time, and the descriptors of this work are built upon them. ball drawn around each point of the dataset. Filtrations can be generalized to a more flexible structure called a zigzag filtration. Unlike a standard filtration, a zigzag filtration allows the sequence of complexes to move both forward and backward, meaning that inclusions between complexes can reverse at certain steps. We take this approach in our study to track the evolution of the internal representations across layers, rather than at a fixed snapshot, as done in traditional persistent homology implementations. In this sense, our parameter is not a distance/coarsening scale, but a discrete time scale represented by the layer number. We track topological features as they are formed and destroyed along the model layers and statistically characterize these changes to describe a complex series of transformations in high-dimensional space. Differently than standard persistent homology, shortand long-lived features represent how the model dynamically evolves. Short-lived features indicate a high rate of rearrangement of the points between adjacent layers, while long-lived features suggest a phase of retention of (relative) positions across several layers. This is a crucial point in our analysis, as it provides a novel tool to interpret how the model processes different inputs by moving them and changing their relative positions in the representation space. Recent computational advances have allowed feasible implementation of these methods for analyzing time-varying point clouds within the broader machine learning community. For instance, efficient algorithms such as the fast zigzag persistence method introduced by Dey & Hou (2022) have enabled scalable analysis of evolving topological features, making persistence-based approaches much more accessible in large-scale applications. We now outline the main steps of the zigzag algorithm, leaving a rigorous mathematical formulation to Appendix A. Persistent Topological Features in Large Language Models 3.2. Zigzag Persistence for Layer Analysis We aim to study internal representations by tracking statistical changes in the formation of p-dimensional holes generated by connecting nearby data points within each layer ℓi. As introduced above, the first ingredient for a TDA formulation is a criterion for connecting points. In this regard, we construct a k-Nearest Neighbors graph Gℓi = (Vℓi, Eℓi) at every layer ℓi, where the number k NN of neighbors is a fixed hyperparameter (see (Le & Taylor, 2024) for a previous use of a k NN-based filtration). To explore higher-order relations among points, we extend the dimension of the graph by filling higher-dimensional simplices. More precisely, we fill a simplex when its boundary, composed of lower-dimensional simplices (such as vertices and edges), is complete. In particular, we consider a triangle as filled when it has three vertices with pairwise connections. Similarly, a tetrahedron is filled when four vertices are all interconnected by edges, totaling six edges. This concept extends to higher dimensions up to a specified maximum dimension m. Thus, in each layer, we construct the simplicial complex Kℓi defined by: S xs, xl S, (xs, xl) Eℓi and |S| m + 1 . (1) To track changes in the network, we compute intersection layers by identifying simplices present simultaneously in both adjacent layers. This allows us to construct a sequence of inclusions between these complexes where we define L Nlayers for conciseness. This sequence represents our zigzag filtration, denoted by Φ. This filtration is the second ingredient needed to define persistent homology. We thus define a notion of birth and death of p-dimensional holes, with p = 0, ..., m 1, being m the maximum dimension at which we expand the graph. Throughout this work, we choose m = 4, which implies that the p-dimensional holes are well defined up to dimension p = 3. We can track the persistence of these objects as they appear in a given layer when a group of points exhibits a particular proximity and distribution in the complex and disappear at a subsequent layer when some points have moved apart, causing them to vanish. We illustrate the idea in Figure 1. Comparison to similar frameworks. A distinguishing feature of our methodology is the choice of a k NN filtration, whose stability was discussed in (Le & Taylor, 2024) in the context of persistent homology, though it was never applied to zigzag persistence. A notable effort in describing spatiotemporal networks similarly to this work is (Kim & M emoli, 2021), where the main summary statistic (the rank invariant) involves calculating a 6-dimensional data vector (4 across layers and 2 across scale) and thus combines a variation of both a time and a scale parameter, using the Rips filtration. Varying also the scale parameter is worth investigating in this context, and the techniques in (Kim & M emoli, 2021) would be a starting point for implementing it. The work (Kim & M emoli, 2023) from the same authors is also related to our work since the maximal group diagram and the persistence clustergram (cfr. Figure 2) are annotated (with the representative topological features) barcodes. In this work, they fix a scale similar to our case. Zigzag Persistence Diagram. The output of the zigzag algorithm is then a multiset of birth-death pairs [b, d]2, known as the persistence diagram Persp(Φ) = n [b, d] | b, d {0, . . . , 2(Nlayers 1)} o . We thus work with a zigzag filtration naturally indexed by {0, 1, 2, . . . , 2(Nlayers 1)}. Specifically, as shown in the Figure 1, even numbers starting from 0 are assigned to pdimensional holes that emerge and disappear within the model layers. In contrast, odd numbers are designated for features at the intersection layers. It is important to note that homology classes are defined as equivalence classes, meaning that a connected component (in the case of 0dimensional homology) need not maintain the same form at the level of simplices throughout its lifetime. The orange connected component in the figure exemplifies this: in Layer 1, it corresponds to the three points {x5, x6, x7} connected by edges, forming a triangle. In the intersection layer, it is reduced to the edge {x6, x7}. In Layer 2 this edge merges with another connected component (depicted in red), marking the death of the orange component. This feature ensures the robustness of our construction to small changes in the k NN graph. A mathematical explanation of this is provided in Appendix A. The algorithm that generates Persp(Φ) is schematically described in Appendix B, and in Appendix C we show a toy example using a calendar month task to visualize how we track zigzag barcodes. Effective Persistence Image. The pairs generated within Persp(Φ) can be visualized through a persistence image, a well-known descriptor within the TDA tools. The persistence image in our case results in a grid of size (2Nlayers 1) (2Nlayers 1), for each homology dimension p. Each pixel in the grid is associated with an integer 2The repetition of a pair [b, d] indicates that multiple holes in dimension p have been created and destroyed in correspondence of the same layers. Persistent Topological Features in Large Language Models value corresponding to the number of holes appearing with that birth-death pair. Defined this way, the persistence image does not discriminate between the model and intersection layers. Their behavior is generally fairly different, and have an alternating structure between model and intersection layers. Hence, persistence images are not smooth as a function of layers. To achieve a smoother representation, we introduce effective persistence images, obtained by excluding the intersection layers from the construction. This is achieved by defining a map, similar to the approach in (Kim & M emoli, 2017), that translates the collection of intervals from the zigzag persistence diagram of the filtration in (2) into intervals, where the birth and death occur only across model layers. Formally, for b, d > 0, we obtain: c PIp(b/2, d/2) =PIp(b, d) + PIp(b 1, d) + PIp(b, d 1) + PIp(b 1, d 1), (4) where c PIp is the effective persistence image for the pdimensional holes and b, d are model layers indexed by even numbers.3 3.3. Zigzag Descriptors The collection of c PIps taken over all p contains all the information output from our zigzag algorithm, and gives a useful overview of the model as a whole. On the other hand, they are not easily tractable statistically and are hard to interpret. We extract two descriptors from the effective persistence image, defined below. Births Relative Frequency. A useful way to summarize a persistence diagram is by counting features within a specific region of interest. In our context, it is informative to measure the rate with which new p-dimensional holes are created, as this reflects the model s propensity to move prompts toward each other in specific regions of space. We thus define the births relative frequencies as ℓi ω(ℓ, ℓi) c PIp(ℓ, ℓi) P ℓi ω(ℓ, ℓi) P ℓi c PIp(ℓ, ℓi) , (5) where ω(ℓ, ℓi) = |ℓ ℓi|α (6) is a weight with varying exponent α.4 For negative values of α, the average gives weight to counts with low death values, effectively tracing the fraction of short-lived features. On the other hand, positive values of α give more weight to long-persistent features. 3Note that this operation does not modify the information about the model layers contained in the original Persp(Φ), as it redefines consistently all the births and deaths. 4This type of weighting has been used previously for topological descriptors, see e.g. (Chazal et al., 2014). Inter-Layer Persistence. To better track the persistence of features across layers, we can calculate the fraction of p-dimensional holes in one layer, ℓ1, that exist in another layer, ℓ2, as well, and have existed throughout the layers in between.5 Mathematically it can be expressed as Zp(ℓ1, ℓ2) = ℓ1 M1,ℓ2>M2 c PIp (ℓ1, ℓ2) βp(ℓ1) , (7) where M1 = min(ℓ1, ℓ2); M2 = max(ℓ1, ℓ2) and βp(ℓ) is the Betti number, i.e. the number of alive p-dimensional holes at layer ℓ.6 We can then further summarize this quantity again by power-weighted averaging it, PNlayers ℓi=1 ω(ℓ, ℓi) Zp(ℓ, ℓi) PNlayers ℓi=1 ω(ℓ, ℓi) (8) where we fix one of the two layers and average over all other layers, and the weight is the same as Eq. (6). Given that the birth or death of a given p-dimensional hole implies the rearrangements of points in space, Zp tracks the dynamical movement of prompts relative positions in representation space as a function of the model s depth. 4. Experiments 4.1. Models, Datasets and Benchmarks We work with 4 models: Llama2 (Touvron et al., 2023), Llama3 (AI@Meta, 2024), Mistral (Jiang et al., 2023) and Pythia 6.9B (Biderman et al., 2023). These models are open-source decoder-only transformers, and they achieve high performance in the benchmarks we consider in this work. We analyze Llama2-7B, Llama3-8B, Mistral 7B, and Pythia 6.9B because they have 32 hidden layers and have comparable parameter sizes. In Appendix E.2 we show results for larger models as a consistency check. The input dataset from which we take internal representations must provide a fair test of how the model processes and understands language. We consider the following datasets: 1) The Standford Sentiment Treebank (SST) dataset (Socher et al., 2013). 2) The Pile dataset (Gao et al., 2020) from which we take a subset of 10K prompts, accessible on Hugging Face.73) A dataset of mathematical problems (Hendrycks et al., 2021b). 4) A dataset of codes retrieved 5We note that this is related to the generalized rank invariants in the context of multiparameter and zigzag persistence (Clause et al., 2024; Dey et al., 2024), which measures the rank of the homology maps between consecutive layers in a zigzag filtration. However, it differs in that it is normalized by the Betti number and explicitly enforces the continuity of holes throughout the intermediate layers. 6Note that (7) is well-defined only when βp(ℓ) > 0. If there are no p-dimensional holes at either ℓ1 or ℓ2, Zp(ℓ1, ℓ2) should be 0 by definition. We omitted this limit case from (7) for readability. 7https://huggingface.co/datasets/Neel Nanda/pile-10k Persistent Topological Features in Large Language Models from Git Hub.8 Each prompt is processed from these datasets to extract the last token at each normalization layer and the final normalization is applied to the output layer. To ensure fair comparisons and eliminate potential biases in our descriptors caused by varying point cloud sizes, all datasets are reduced to the first 8,000 prompts. Additionally, we divide the datasets into incremental subsets of {100, 200, ..., 1000} prompts and compute the mean and standard deviation across subsets to systematically evaluate the scalability of our descriptors and to quantify their sensitivity to changes in point cloud size. For our experiments, we consider the 500 prompts subset, amounting to 16 subsets. In Appendix E.4 we perform a more detailed analysis at varying subset sizes. We use the SST dataset as a reference dataset for all the plots and show results for varying datasets in Appendix E.3. We use 3 benchmarks for layer pruning performance evaluation: MMLU (Hendrycks et al., 2021a), Hella Swag (Zellers et al., 2019), and Winogrande (Sakaguchi et al., 2019), which have been widely used for similar purposes in previous analyses. The benchmarks are evaluated for the models with the use of the library lm-eval-harness by (Gao et al., 2024) with a 5-shot setup. 4.2. Zigzag persistence applied to LLM models We generate zigzag diagrams for each model and dataset and for each homology dimension up to p = 3, for a range of values of k NN [1, 15]. We find that 0-, 2 and 3-dimensional holes are relatively lower in number, while 1-dimensional holes reach tens of thousands of elements per layer. This behavior might be expected for a k NN-graph-based construction since connections are dense even for low values of k NN, especially if points are concentrated in low dimensional regions of the representation space. We examine this behavior in detail to make sure that our construction is stable for different choices on the k NN graph, see Appendix D for details. The choice of the hyperparameter k NN is done so as to maximize the total number of holes. Therefore, in what follows, we show results for our topological descriptors for 1-dimensional holes and k NN = 4 only. Effective Persistence Image. We show an example of an effective persistent image of 1-dimensional holes in Figure 2, where we use the Llama 3 8B model and the SST dataset. The x-axis represents the layer at which a 1-dimensional hole is born, and the y-axis represents persistence, i.e. death layer - birth layer. The color bar measures the amount of 1-dimensional holes on a given grid point. We see that features born after the first half of the model s depth have a higher tendency to be long-lived with respect to features 8https://huggingface.co/datasets/codeparrot/github-code born earlier on. This aspect is going to be evident when computing also the topological descriptors, below. In the Appendix E.1, we show a wider range of images, comparing them across models by taking the element-wise difference of effective persistence images. Figure 2. Effective persistence image of 1-dimensional holes for the Llama 3 8B model using the SST dataset, where we fix k NN = 4. The density plot shows the number of holes (color bar) for a given birth-persistence pair (xand y-axis), where values refer to the model layer. This plot shows that a large amount of 1-dimensional holes are short-lived and that long-lived features appear after the first half of the model. Births Relative Frequency. In the left panel of Figure 3 we show B1 (Eq. 5) for Llama 3 8B on the SST dataset, for varying α = 1, 0, 0.5, 1, 2. We can clearly distinguish two behaviors for shortand long-lived 1-dimensional holes: the former peaks at early layers and progressively decreases, while the latter peaks at middle layers. Additionally, a strong increase in births number is seen in the last few layers. We highlight a horizontal line corresponding to the uniform distribution for comparison. We complement these results in Appendix E.3 with a comparison across models and datasets, finding qualitatively similar results. Inter-Layer Persistence. We show the power-weighted inter-layer persistence (Eq. (8)) in the middle and right panel of Figure 3. In the middle panel, we use Llama 3 with the SST dataset at varying α = 1, 0, 0.5, 1, 2. As for B1, we can distinguish two behaviors for short-lived and long-lived features, though now Z1 traces the probability of features that are alive at a given layer to be still alive in earlier or later layers. We see that for short-lived features, this probability grows steadily until the second half of the model s depth, where it reaches a plateau, and then suddenly drops in the last few layers. On the other hand, for longlived features, there is a peak in probability at the middle layers. We can qualitatively see the same behavior for the other models in the right panel of Figure 3, though with quantitative differences across models. We cross-check for Persistent Topological Features in Large Language Models Figure 3. Left Panel: Births Relative Frequency, B1, as a function of model s layers for Llama 3 8B on the SST dataset, for varying α, which traces short-(long-)lived features for negative(positive) values. Short-lived features peak early in the model and progressively decrease, while long-lived features peak in middle layers. All features experience a sharp growth in the last two layers. The dashed line represents a uniform distribution of births across the layers. Middle Panel: Inter-Layer Persistence as a function of model s layers for Llama 3 8B on the SST dataset, for varying α. The persistence of short-lived features consistently grows and plateaus towards the end of the model, while long-lived features are primarily present across middle layers. Right Panel: Inter-Layer Persistence for Llama 3, Llama 2, Mistral and Pythia for the SST dataset at α = 2. The models considered exhibit qualitatively similar, but quantitatively different behavior, with Pythia experiencing the higher peak, and Mistral the lowest. This seems inversely related to model s performance once pruning layers in this range. In all panels, curves and shaded regions represent the mean and standard deviation over 16 subsets of 500 prompts, respectively. other datasets in Appendix E.3, also finding qualitatively similar, but quantitatively different results across datasets. 4.3. Interpretation and implications for the model s performance In interpreting our results, it is essential to recognize that: 1) models process each token of the prompt, while we use only the last token as a proxy for the entire prompt; and 2) each prompt is processed separately from the others, such that each point moves a priori independently from the others in the representation space. Within this framework, the zigzag algorithm effectively tracks how the models dynamically organize prompts across both spatial and temporal dimensions (layers). Our findings, as illustrated in Figure 3, reveal four distinct phases: Early to Middle Layers: In the first layers, a large number of short-lived 1-dimensional holes are formed, indicating that most prompts are fastly rearranged within a few layers. This finding relates with previous work identifying local contextualization (Lad et al., 2024) and increased dimensionality (Valeriani et al., 2023). Middle Layers: In this phase, 1-dimensional holes born in middle layers have the highest probability of being long-lived than in other phases. This implies that relative positions before, but especially after these layers are kept relatively stable. This would seem related to the decreasing dimensionality found in (Cheng et al., 2024), since relevant degrees of freedom estimated by intrinsic dimension are progressively better distinguishable and more stable to noise. Middle to Late Layers: Short-lived 1-dimensional holes born after the first layers decrease in number. At the same time, the probability that a 1-dimensional hole is short-lived increases until after the middle layers when it reaches a plateau. Concurrently, the probability (and the amount) of long-lived ones drops. This indicates a phase of relatively few short-lived adjustments in the relative positions of prompts since many of the features that formed in the middle layers are still there (because they are long-lived). We expect that these short-lived adjustments relate to a phase of specialization (Lad et al., 2024) and to a phase of relatively constant dimensionality (Valeriani et al., 2023). Last Layers: In the last two to three layers, the births relative frequency grows rapidly while the inter-layer persistence drops. These two behaviors are compatible, given that the fraction of newborn 1-dimensional holes is large, and that there are no layers left to persist. This results suggest another strong rearrangement of points, which can be linked to the model producing the required output (Valeriani et al., 2023; Lad et al., 2024) and thus changing abruptly the position of prompts in representation space. Relation to model s performance. As a test of the interpretation of the 4 phases, we perform the following exper- Persistent Topological Features in Large Language Models Figure 4. Winogrande performances for 4 models obtained with a sliding window of 5 blocks of adjacents layers and moved through the models every 2 layers. This experiment reflects 4 phases: 1) removing early layers brings performances close to random choice, 2) while performance grows to almost maximum after middle layers, and 3) plateaus; 4) removing late layers causes another drop in performance right before the end of the model. iment: we prune blocks of layers with a sliding window from early to late layers as a way to compare the relative importance of layers in the various phases. We show the results of the experiment in Figure 4, where we show the performance of the 4 models considered for the Winogrande benchmark, as a function of the sliding window of pruned layers. We see that removing layers in the first phase significantly affects performance. After the second phase, pruning weakly affects performance, being a phase of relative adjustments. Removing the last few layers causes another drop in performance. Interestingly, the overall performance of models is inversely related to the peak height of the interlayer persistence of long-lived features. This relation is seen also in the other limit of α < 0. We also notice that the two best-performing models, Mistral and Llama, 3 exhibit a drop in performance when removing layers at the end of the third phase, right before the fourth. We zoom in on these layers for the MMLU benchmark where the drop is particularly evident in Appendix G.1, confirming that the drop is caused by removing the last 2-3 layers of the third phase. Importantly, all these results are qualitatively the same across the three benchmarks considered in this work. 4.4. Layer Pruning Recently, measures of layer similarity have been used to identify layers that contribute minimally to the performance of LLMs. These layers can be pruned, and the performance re-evaluated to validate this assumption. Given our results in Figure 4, we can argue that layers belonging to the third phase might be pruned without affecting for model s performance. Consequently, we establish a pruning criterion Models MMLU Hella Swag Wino Grande Full This work Other works Full This work Other works Full This work Other works Llama 2 45.74 37.38 43.95 58.54 44.71 42.78 74.43 68.67 67.72 Llama 3 65.07 53.44 53.44 61.37 41.60 41.60 77.10 70.00 70.00 Mistral 7B 62.40 53.17 38.20 62.83 36.67 34.45 77.35 66.50 63.76 Pythia - - - 49.70 31.43 34.96 63.30 55.71 58.09 Table 1. Benchmark Table. For each benchmark, we show three columns: (i) Full, represents the accuracy of the model without any layer pruned. (ii) This work, accuracy of the model, where layers are pruned following the algorithm 2). (iii) Other works, accuracy obtained by considering the same amount of layer pruned estimated with our method and then computing the layer to be pruned with two different similarity measures: angular distance from (Gromov et al., 2024) and Bi-score from (Men et al., 2024). The chosen layers turn out to be the same for the two methods, so the results are condensed into one column. based on the plateau observed in the inter-layer persistence of short-lived features. Specifically, we prune layers that lie within 10% of the maximum value of Z1. This is computed for each different model, using the Pile dataset as proxy.9 We show a schematic summary of the algorithm in Appendix G.2. We compare our layer pruning methods to recent work (Gromov et al., 2024) and (Men et al., 2024) performing pruning using similarity measures. Both approaches are designed to take as input the desired number of layers to prune Nprune and measure performance as Nprune grows. For a fair comparison, we feed the number of layers cut by our method as an input to the other two methods, and verify which layers they select to cut given this input, and the corresponding performance. We show which layers are cut for each method in Table 2 in Appendix G.2. Interestingly, both considered methods from (Gromov et al., 2024) and (Men et al., 2024) give the same result at fixed Nprune, thus we refer to them simply as other works . We show performance results in Table 1,10 where in bold we indicate the layer pruning method that has better or equal performance with respect to the other method. Despite often selecting different layers, our zigzag-based pruning strategy achieves comparable results to methods from (Gromov et al., 2024) and (Men et al., 2024). 5. Conclusions Recent work has argued in different ways that large language models process inputs across layers through distinct phases, 9We use Pile since it is characterized by a broad range of topics, representing a wider range of prompts. Note that the shape of the curve around the peak of Z1 is approximately similar across datasets (see Figure 9 in Appendix). 10Results for Pythia on MMLU tasks are not shown because the model is not designed for following the format of the tasks, as shown in (Biderman et al., 2023). Persistent Topological Features in Large Language Models and that understanding these phases is important for the models interpretation. We exploit topological data analysis tools to build descriptors that allow to statistically characterize the dynamics of prompts within internal representations of large language models. Based on this characterization, we distinguish four phases and connect them to the model s behavior through experiments based on layer pruning and performance benchmarking. Our method consistently provides qualitatively similar results across different models, datasets, and parameter selections. Simultaneously, our topological descriptors allow for quantitative differentiation across models and datasets, creating opportunities for experiments designed to address more specific and practical questions regarding particular models or datasets. There are several limitations in our study that future research could address. First, while our method shows robustness across hyperparameters within the framework, these choices need not be optimal. Defining an appropriate criterion for connecting points in the representation space, and consequently, a filtration, is a delicate task in TDA that could require further investigations to detail the impact of the various choices on the construction of the filtration. Secondly, our study primarily focuses on static, pre-trained models. Extending this framework to track the evolution of internal representations during training might provide important insights into model efficiency and behavior. 6. Reproducibility All the results contained in this work are reproducible by means of a Git Hub repository that can be found at this link https://github.com/Rit Area Science Park/Zig Zag LLMs. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Acknowledgments We thank Mathieu Carri ere and Magnus Botnan for helpful discussions on the TDA implementation and suggestions on the zigzag algorithm. M.B. Y.G. and G.P. are partially supported by the Programma Nazionale della Ricerca (PNR) grant J95F21002830001 with title FAIR-by-design . K.V. was partially supported by Programma Nazionale della Ricerca (PNR) grant J95F21002830001 with the title FAIR-by-design during his visit to Area Science Park while this project was in its development phase. A.A. and A.C. were supported by the project Supporto alla diagnosi di malattie rare tramite l intelligenza artificiale CUP: F53C22001770002. A.A. and A. C. were supported by the European Union Next Generation EU within the project PNRR PRP@CERIC IR0000028 - Mission 4 Component 2 Investment 3.1 Action 3.1.1. We thank Area Science Park supercomputing platform ORFEO made available for conducting the research reported in this paper and the technical support of the Laboratory of Data Engineering staff. AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md. Ansuini, A., Laio, A., Macke, J. H., and Zoccolan, D. Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019. Artzy, A. B. and Schwartz, R. Attend first, consolidate later: On the importance of attention in different LLM layers. In Belinkov, Y., Kim, N., Jumelet, J., Mohebbi, H., Mueller, A., and Chen, H. (eds.), Proceedings of the 7th Blackbox NLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp. 177 184, Miami, Florida, US, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.blackboxnlp-1. 10. URL https://aclanthology.org/2024. blackboxnlp-1.10/. Bai, G., Chai, Z., Ling, C., Wang, S., Lu, J., Zhang, N., Shi, T., Yu, Z., Zhu, M., Zhang, Y., Yang, C., Cheng, Y., and Zhao, L. Beyond efficiency: A systematic survey of resource-efficient large language models, 2024. Barannikov, S., Trofimov, I., Balabin, N., and Burnaev, E. Representation topology divergence: A method for comparing neural network representations. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 1607 1626. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr. press/v162/barannikov22a.html. Biagetti, M., Cole, A., and Shiu, G. The persistence of large scale structures. part i. primordial non-gaussianity. Journal of Cosmology and Astroparticle Physics, 2021 (04):061, April 2021. ISSN 1475-7516. doi: 10. 1088/1475-7516/2021/04/061. URL http://dx.doi. org/10.1088/1475-7516/2021/04/061. Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O Brien, K., Hallahan, E., Khan, M., Purohit, S., Persistent Topological Features in Large Language Models Prashanth, U., Raff, E., et al. Pythia: A suite for analyzing large language models across training and scaling, 2023. latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023. Carlsson, G. Topology and data. Bulletin of the American Mathematical Society, 46(2):255 308, 2009. Carlsson, G. and de Silva, V. Zigzag persistence. Found. Comut. Math., 10(4):367 405, August 2010. Carlsson, G. E., de Silva, V., and Morozov, D. Zigzag persistent homology and real-valued functions. In SCG 09, 2009. URL https://api.semanticscholar. org/Corpus ID:5801261. Chazal, F., Fasy, B. T., Lecci, F., Rinaldo, A., and Wasserman, L. Stochastic convergence of persistence landscapes and silhouettes. In Proceedings of the Thirtieth Annual Symposium on Computational Geometry, SOCG 14, pp. 474 483, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450325943. doi: 10.1145/2582112.2582128. URL https://doi. org/10.1145/2582112.2582128. Cheng, E., Kervadec, C., and Baroni, M. Bridging information-theoretic and geometric compression in language models. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12397 12420, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023. emnlp-main.762. URL https://aclanthology. org/2023.emnlp-main.762. Cheng, E., Doimo, D., Kervadec, C., Macocco, I., Yu, J., Laio, A., and Baroni, M. Emergence of a highdimensional abstraction phase in language transformers. ar Xiv preprint ar Xiv:2405.15471, 2024. Clause, N., Kim, W., and M emoli, F. The generalized rank invariant: M obius invertibility, discriminating power, and connection to other invariants, 2024. URL https: //arxiv.org/abs/2207.11591. Dey, T. K. and Hou, T. Fast Computation of Zigzag Persistence. In Chechik, S., Navarro, G., Rotenberg, E., and Herman, G. (eds.), 30th Annual European Symposium on Algorithms (ESA 2022), volume 244 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 43:1 43:15, Dagstuhl, Germany, 2022. Schloss Dagstuhl Leibniz-Zentrum f ur Informatik. ISBN 9783-95977-247-1. doi: 10.4230/LIPIcs.ESA.2022.43. URL https://drops.dagstuhl.de/entities/ document/10.4230/LIPIcs.ESA.2022.43. Dey, T. K., Kim, W., and M emoli, F. Computing Generalized Rank Invariant for 2-Parameter Persistence Modules via Zigzag Persistence and Its Applications. Discrete & Computational Geometry, 71(1): 67 94, January 2024. ISSN 1432-0444. doi: 10.1007/ s00454-023-00584-z. URL https://doi.org/10. 1007/s00454-023-00584-z. Doimo, D., Glielmo, A., Ansuini, A., and Laio, A. Hierarchical nucleation in deep neural networks. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 7526 7536. Curran Associates, Inc., 2020. Edelsbrunner, Letscher, and Zomorodian. Topological persistence and simplification. Discrete & computational geometry, 28:511 533, 2002. El-Yaagoubi, A. B., Chung, M. K., and Ombao, H. Topological data analysis for multivariate time series data. Entropy, 25(11), 2023. ISSN 1099-4300. doi: 10.3390/e25111509. URL https://www.mdpi.com/1099-4300/25/ 11/1509. Engels, J., Michaud, E. J., Liao, I., Gurnee, W., and Tegmark, M. Not all language model features are onedimensionally linear. In The Thirteenth International Conference on Learning Representations. URL https: //openreview.net/forum?id=d63a4AM4hb. Fan, A., Grave, E., and Joulin, A. Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=Syl O2y St Dr. Fan, C., Li, J., Zhang, T., Ao, X., Wu, F., Meng, Y., and Sun, X. Layer-wise model pruning based on mutual information. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3079 3090, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main. 246. URL https://aclanthology.org/2021. emnlp-main.246. Gabriel, P. Unzerlegbare darstellungen i. manuscripta mathematica, 6(1):71 103, March 1972. ISSN 14321785. doi: 10.1007/bf01298413. URL http://dx. doi.org/10.1007/BF01298413. Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The Pile: An 800gb dataset of diverse text for language modeling. ar Xiv preprint ar Xiv:2101.00027, 2020. Persistent Topological Features in Large Language Models Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., Di Pofi, A., Foster, C., Golding, L., Hsu, J., Le Noac h, A., Li, H., Mc Donell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/ 12608602. Gharooni-Fard, G., Byers, M., Deshmukh, V., Bradley, E., Mayo, C., Topaz, C. M., and Peleg, O. A computational topology-based spatiotemporal analysis technique for honeybee aggregation. npj Complexity, 1(1):3, 2024. Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P., and Roberts, D. A. The unreasonable ineffectiveness of the deeper layers, 2024. URL https://arxiv.org/ abs/2403.17887. He, S., Sun, G., Shen, Z., and Li, A. What matters in transformers? not all attention is needed. Co RR, abs/2406.15786, 2024. URL https://doi.org/10. 48550/ar Xiv.2406.15786. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021a. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b. URL https://openreview.net/forum? id=7Bywt2m Qs Ce. Hensel, F., Moor, M., and Rieck, B. A survey of topological machine learning methods. Frontiers in Artificial Intelligence, 4, 2021. ISSN 2624-8212. doi: 10.3389/frai.2021.681108. URL https://www.frontiersin.org/journals/ artificial-intelligence/articles/10. 3389/frai.2021.681108. Jha, A. H., Sherborne, T., Walsh, E. P., Groeneveld, D., Strubell, E., and Beltagy, I. Just chop: Embarrassingly simple llm compression, 2024. URL https://arxiv. org/abs/2305.14864. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.- A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b, 2023. URL https: //arxiv.org/abs/2310.06825. Kim, B.-K., Kim, G., Kim, T.-H., Castells, T., Choi, S., Shin, J., and Song, H.-K. Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024. URL https://arxiv.org/abs/ 2402.02834. Kim, W. and M emoli, F. Stable signatures for dynamic graphs and dynamic metric spaces via zigzag persistence. ar Xiv: Algebraic Topology, 2017. URL https://api. semanticscholar.org/Corpus ID:44017453. Kim, W. and M emoli, F. Spatiotemporal persistent homology for dynamic metric spaces. Discrete Comput. Geom., 66(3):831 875, October 2021. ISSN 01795376. doi: 10.1007/s00454-019-00168-w. URL https: //doi.org/10.1007/s00454-019-00168-w. Kim, W. and M emoli, F. Extracting persistent clusters in dynamic data via m obius inversion. Discrete Comput. Geom., 71(4):1276 1342, October 2023. ISSN 01795376. doi: 10.1007/s00454-023-00590-1. URL https: //doi.org/10.1007/s00454-023-00590-1. Klabunde, M., Schumacher, T., Strohmaier, M., and Lemmerich, F. Similarity of neural network models: A survey of functional and representational measures. ar Xiv preprint ar Xiv:2305.06329, 2023. Lacombe, T., Ike, Y., Carriere, M., Chazal, F., Glisse, M., and Umeda, Y. Topological uncertainty: Monitoring trained neural networks through persistence of activation graphs. ar Xiv [stat.ML], May 2021. Lad, V., Gurnee, W., and Tegmark, M. The remarkable robustness of LLMs: Stages of inference? In ICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum? id=R5unwb9KPc. Le, M. Q. and Taylor, D. Persistent homology with k-nearest-neighbor filtrations reveals topological convergence of pagerank. Foundations of Data Science, 2024. doi: 10.3934/fods.2024038. URL https://www.aimsciences.org/article/ id/66c30c8be7a25d6c964d771b. Liao, Q. V. and Vaughan, J. W. Ai transparency in the age of llms: A human-centered research roadmap, 2023. URL https://arxiv.org/abs/2306.01941. Ma, X., Fang, G., and Wang, X. LLM-pruner: On the structural pruning of large language models. In Thirtyseventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/ forum?id=J8Ajf9Wf XP. Magai, G. and Ayzenberg, A. Topology and geometry of data manifold in deep learning. Ar Xiv, abs/2204.08624, April 2022. Persistent Topological Features in Large Language Models Mandal, S., Guzm an-S aenz, A., Haiminen, N., Basu, S., and Parida, L. A topological data analysis approach on predicting phenotypes from gene expression data. In Mart ın-Vide, C., Vega-Rodr ıguez, M. A., and Wheeler, T. (eds.), Algorithms for Computational Biology, pp. 178 187, Cham, 2020. Springer International Publishing. ISBN 978-3-030-42266-0. Mc Donald, R., Neuhausler, R., Robinson, M., Larsen, L., Harrington, H., and Bruna, M. Zigzag persistence for coral reef resilience using a stochastic spatial model. Journal of the Royal Society, Interface, 20:20230280, 08 2023. doi: 10.1098/rsif.2023.0280. Men, X., Xu, M., Zhang, Q., Wang, B., Lin, H., Lu, Y., Han, X., and Chen, W. Shortgpt: Layers in large language models are more redundant than you expect, 2024. URL https://arxiv.org/abs/2403.03853. Morozov, D. Dionysus2. URL https://www.mrzv. org/software/dionysus2/. Myers, A., Mu noz, D., Khasawneh, F. A., and Munch, E. Temporal network analysis using zigzag persistence. EPJ Data Science, 12(1):6, 2023. Naitzat, G., Zhitnikov, A., and Lim, L.-H. Topology of deep neural networks. Journal of Machine Learning Research, 21(184):1 40, 2020. URL http://jmlr. org/papers/v21/20-345.html. Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K., and Mordvintsev, A. The building blocks of interpretability. 2018. doi: 10.23915/distill.00010. Pope, P., Zhu, C., Abdelkader, A., Goldblum, M., and Goldstein, T. The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=XJk19Xz Gq2J. Raiaan, M. A. K., Mukta, M. S. H., Fatema, K., Fahad, N. M., Sakib, S., Mim, M. M. J., Ahmad, J., Ali, M. E., and Azam, S. A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access, 12:26839 26874, 2024. doi: 10.1109/ACCESS.2024.3365742. Rathore, A., Chalapathi, N., Palande, S., and Wang, B. Topoact: Visually exploring the shape of activations in deep learning. Computer Graphics Forum, 40(1):382 397. doi: https://doi.org/10.1111/cgf. 14195. URL https://onlinelibrary.wiley. com/doi/abs/10.1111/cgf.14195. Rieck, Bastian Alexander, Togninalli, Matteo, Bock, Christian, Moor, Michael, Horn, Max, Gumbsch, Thomas, and Borgwardt, Karsten. Neural persistence: A complexity measure for deep neural networks using algebraic topology. 2023. doi: 10.3929/ ETHZ-B-000327207. URL http://hdl.handle. net/20.500.11850/327207. Sajjad, H., Dalvi, F., Durrani, N., and Nakov, P. On the effect of dropping layers of pre-trained transformer models. Comput. Speech Lang., 77(C), January 2023. ISSN 08852308. doi: 10.1016/j.csl.2022.101429. URL https: //doi.org/10.1016/j.csl.2022.101429. Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale, 2019. URL https://arxiv.org/abs/ 1907.10641. Samsi, S., Zhao, D., Mc Donald, J., Li, B., Michaleas, A., Jones, M., Bergeron, W., Kepner, J., Tiwari, D., and Gadepally, V. From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1 9, 2023. doi: 10.1109/HPEC58863.2023. 10363447. Siddiqui, S. A., Dong, X., Heinrich, G., Breuel, T. M., Kautz, J., Krueger, D., and Molchanov, P. A deeper look at depth pruning of llms. Co RR, abs/2407.16286, 2024. URL https://doi.org/10. 48550/ar Xiv.2407.16286. Skaf, Y. and Laubenbacher, R. Topological data analysis in biomedicine: A review. Journal of Biomedical Informatics, 130:104082, 2022. ISSN 15320464. doi: https://doi.org/10.1016/j.jbi.2022.104082. URL https://www.sciencedirect.com/ science/article/pii/S1532046422000983. Skean, O., Arefin, M. R., and Shwartz-Ziv, R. Does representation matter? exploring intermediate layers in large language models. In Workshop on Machine Learning and Compression, Neur IPS 2024, 2024. URL https: //openreview.net/forum?id=FN0t Z9p VLz. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631 1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https: //www.aclweb.org/anthology/D13-1170. Suresh, S., Das, B., Abrol, V., and Roy, S. D. On characterizing the evolution of embedding space of neural networks using algebraic topology. ar Xiv [cs.LG], November 2023. Persistent Topological Features in Large Language Models Tausz, A. and Carlsson, G. E. Applications of zigzag persistence to topological data analysis. Co RR, abs/1108.3545, 2011. URL http://arxiv.org/ abs/1108.3545. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288. Tulchinskii, E., Kuznetsov, K., Kushnareva, L., Cherniavskii, D., Nikolenko, S., Burnaev, E., Barannikov, S., and Piontkovskaya, I. Intrinsic dimension estimation for robust detection of ai-generated texts. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS 23, Red Hook, NY, USA, 2024. Curran Associates Inc. Tymochko, S., Munch, E., and Khasawneh, F. A. Using zigzag persistent homology to detect hopf bifurcations in dynamical systems. Algorithms, 13(11):278, October 2020. ISSN 1999-4893. doi: 10.3390/a13110278. URL http://dx.doi.org/10.3390/a13110278. Valeriani, L., Doimo, D., Cuturello, F., Laio, A., ansuini, A., and Cazzaniga, A. The geometry of hidden representations of large transformer models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=c CYvak U5Ek. Viswanathan, K., Gardinazzi, Y., Panerai, G., Cazzaniga, A., and Biagetti, M. The geometry of tokens in internal representations of large language models, 2025. URL https://arxiv.org/abs/2501.10573. Xian, L., Adams, H., Topaz, C. M., and Ziegelmeier, L. Capturing dynamics of timevarying data via topology, 2022. URL https: //www.aimsciences.org/article/id/ 2acaee54-6688-46a4-b35d-447f84c4c691. Yang, J., Fang, H., Dhesi, J., Yoon, I. H., Bull, J. A., Byrne, H. M., Harrington, H. A., and Grindstaff, G. Topological classification of tumour-immune interactions and dynamics. ar Xiv preprint ar Xiv:2308.05294, 2023. Yip, J. H. T., Biagetti, M., Cole, A., Viswanathan, K., and Shiu, G. Cosmology with persistent homology: a Fisher forecast. JCAP, 09:034, 2024. doi: 10.1088/1475-7516/ 2024/09/034. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Computer Vision ECCV 2014, pp. 818 833, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10590-1. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hella Swag: Can a machine really finish your sentence? In Korhonen, A., Traum, D., and M arquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791 4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472. Zhang, M. and He, Y. Accelerating training of transformer-based language models with progressive layer dropping. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 14011 14023. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ a1140a3d0df1c81e24ae954d935e8926-Paper. pdf. Zhang, M., Chowdhury, S., and Saggar, M. Temporal mapper: Transition networks in simulated and real neural dynamics. Network Neuroscience, 7(2):431 460, 2023. Zhang, Y., Dong, Y., and Kawaguchi, K. Investigating layer importance in large language models. In Belinkov, Y., Kim, N., Jumelet, J., Mohebbi, H., Mueller, A., and Chen, H. (eds.), Proceedings of the 7th Blackbox NLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp. 469 479, Miami, Florida, US, November 2024a. Association for Computational Linguistics. URL https://aclanthology.org/2024. blackboxnlp-1.29. Zhang, Y., Li, Y., Wang, X., Shen, Q., Plank, B., Bischl, B., Rezaei, M., and Kawaguchi, K. Finercut: Finergrained interpretable layer pruning for large language models, 2024b. URL https://arxiv.org/abs/ 2405.18218. Zomorodian, A. and Carlsson, G. Computing persistent homology. In Proceedings of the twentieth annual symposium on Computational geometry, pp. 347 356, 2004. Persistent Topological Features in Large Language Models A. Mathematical Formulation of Zig Zag Persistence Zigzag persistence is a computational topology method that extends classical persistent homology to handle more complex data structures and filtration processes. Unlike standard persistence, which analyzes a single sequence of spaces filtered by inclusion, zigzag persistence allows for the exploration of data where sequences of spaces and maps can move both forward and backward. A zigzag filtration of topological spaces is a sequence: χ: X1 X2 Xn, (9) where each Xi is a topological space and each arrow represents a continuous function pointing forwards Xi Xi+1 or backwards Xi Xi+1. If we apply a homology functor Hp with coefficients in a field k to such a filtration, we get a zigzag filtration of k-vector spaces, called zigzag module: Hp(χ): Hp(X1) Hp(X2) Hp(Xn). (10) It is proven in (Carlsson & de Silva, 2010) that the algebraic classification of zigzag modules resembles Gabriel s classification of the persistence module described in (Gabriel, 1972). In particular, every finite-dimensional zigzag module, i.e. for which all the k-vector spaces in the sequence that are finite-dimensional, can be decomposed as a direct sum of interval modules, where a (finitely indexed) interval module is a module of the form: I[b,d] : I1 I2 In, (11) where Ii = k for b i d, and Ii = 0 otherwise, and every arrow of the form k k or k k is the identity map. Moreover, the list of summands is unique up to reordering. The zigzag persistence diagram of a filtration χ in dimension p is the multiset of intervals [b, d] corresponding to the list of interval summands I[b,d] of Hp(χ). In other words, Persp(χ) = {[bj, dj]: j J} Hp(χ) = M j J I[bj,dj] (12) Each interval [b, d] is called persistence interval and is thought of as a persistent homological feature of χ that appears at time b (referred to as the birth ) and disappears at time d (referred to as the death11 ). 11In our setting we say a p-dimensional holes dies , we mean that the corresponding homology class no longer persists in subsequent layers. In the zigzag filtration, this happens when the hole is no longer represented by an independent equivalence class in the homology group. In our approach, the use of intersection layers is essential for computing zigzag persistence, as it allows the construction of injective maps between the k NN complexes of model layers (see (2))12. Since our primary goal is to analyze the topological changes between model layers, we eliminate the construction of intersection layers while preserving the topological features by shifting each persistence interval such that the birth and death times occur strictly within the layers. For an interval [b, d] in the zigzag persistence diagram of dimension p of filtration 2, the mapping that enables a bijective transformation to a new interval [ˆb, ˆd]13 only across model layers is defined as follows: ( b + 1 if b is an intersection layer b otherwise , ( d + 1 if d is an intersection layer d otherwise (13) The relationship between the persistence image and the effective persistence image for p-dimensional holes, denoted respectively by PIp and c PIp, where b, d are the model layers indexed by even numbers, is described by the following system of equations: c PIp(0, 0) = PIp(0, 0) c PIp(b/2, d/2) = PIp(b, d) + PIp(b 1, d) +PIp(b, d 1) + PIp(b 1, d 1) c PIp(b/2, ) = PIp(b, ) + PIp(b 1, ). B. Zigzag algorithm The zigzag algorithm is schematically described below. It exploits two existing public codes that were developed for zigzag computations: DIONYSUS2 (Morozov) and FASTZIGZAG (Dey & Hou, 2022). DIONYSUS2 is a C++ library for computing persistent homology, with a specific library for zigzag persistence. In our case, it has the role of extracting the filtration f and computing the times array, i.e. the list of layer indices to be associated with the birth and death of features. FASTZIGZAG allows to calculate efficiently the persistence diagram Persp(Φ) by converting the input zigzag filtration to a non-zigzag filtration of an equivalent complex with the same length, and it then converts the 12An alternative method for constructing these maps and obtaining the zigzag persistence diagram is to use a filtration where, instead of intersections, the union of the complexes from two consecutive layers is considered. However, the Diamond Lemma, as discussed in (Carlsson et al., 2009), guarantees that both the intersectionand union-based filtrations encode the same homological information. 13By construction, all resulting intervals contain even numbers, as the model layers are indexed with these numbers. Persistent Topological Features in Large Language Models Algorithm 1 Zigzag algorithm Require: model, dataset, k NN, m reps extract Representations(model, dataset) K [] for i 1 to model.get Num Layers() do graph k Nearest Neighbors Graph(reps[i], k NN) K.append(graph Expansion(graph, m) end for Kint compute Intersection Layers(K) f, times compute Filtration Times(K, Kint) Φ Fast Zig Zag(f, times) obtained persistence intervals back to zigzag. The computational cost of our algorithm is O(n2 Nlayers) + O(mω) where the first part is the KNN graph creation cost for the input dataset at each layer, and the second part is the theoretical cost of Fast Zig Zag with ω < 2.37286. The algorithm performs well even for the relatively large datasets we employ for this analysis: with 10K points embedded in a space with dimension d = 4096, a number of neighbors for the k NN graph of k NN = 10, and a maximum homology dimension of m = 10 on an AMD EPYC 7H12 it takes approximately 2 hours. C. A Toy Example to Visualize Zigzag Barcodes To demonstrate the zigzag persistence visualization framework, we analyze a simple calendar arithmetic task on the Mistral model. We use prompts from the form Let s do some calendar math. Four months from [MONTH] is where [MONTH] cycles through all twelve months. This example was used in (Engels et al.) to analyze circular representations in language models. We extract hidden representations from all 33 transformer layers14 and apply zigzag persistent homology with k = 2 neighbors and maximum simplex dimension 3. The point cloud at each layer comprises 12 tokens corresponding to the 12 different month prompts. We analyze two token positions: month tokens (semantic input) and answer tokens (given by the is token). So the point cloud at each layer comprises of 12 tokens for each prompt Figure 5 shows the analysis for both token types. The persistence barcode for month tokens (top panel) exhibits a prominent feature from layers 5-14, while answer tokens (bottom panel) show persistent structure emerging later (layers 21-32). 14Layer 0 is the embedding layer. The visualization demonstrates two key capabilities of zigzag barcodes: Layer-wise tracking: Barcodes reveal when topological features emerge and disappear across network depth. Differential patterns: Different token types exhibit distinct persistence signatures. This toy example illustrates the framework s ability to visualize geometric structure evolution in transformer representations. D. Combining the k NN graph with the Vietoris-Rips complex The k-Nearest Neighbors (k NN) complex is built by expanding the corresponding k NN graph to a fixed dimension m. A key limitation of the k NN complex is that it ranks points by proximity without considering their actual distances. As a result, once k is fixed on each layer, each point is connected to its k-Nearest Neighbors, regardless of the absolute distances involved. In our setting, the number of connected components (the Betti 15 number β0) of the k NN complexes as a function of the layers tends to be unity, i.e. the whole complex is connected, even for relatively small values of k NN 6. This implies that connected components contain no useful topological information on the internal representations. To address this issue, we follow the approach in (Naitzat et al., 2020), which combines the k NN complex with the Vietoris-Rips complex. Starting from the k NN graph, the idea is to introduce a threshold radius R on each layer and use it to filter out edges of the graph whose lengths are less than or equal to R, and then expand, denoting this new complex k NN-VR. This filtering step allows us to focus on longer-range connections, uncovering significant topological features that may be hidden by shorter, more local connections. To ensure consistency across layers, we select the radius R in each layer such that the number of connected components, β0, of the k NN complex falls in a pre-determined range. We then compute the observables presented in this work and verify the results. For clarity, we refer to k NN complex the construction used in the main body, and k NN-VR complexes the one presented in this section. For the sake of conciseness, we present only results for the inter-layer persistence Z. In Figure 6 we show the inter-layer persistence of 1- 15Betti numbers have been used in previous works (Naitzat et al., 2020; Suresh et al., 2023) for interpreting internal representations of neural networks. However, they describe each layer independently from the others, which is not the purpose of this work. Persistent Topological Features in Large Language Models Figure 5. Zigzag persistence analysis of calendar arithmetic tokens. Top panel: Month token representations exhibit persistent topological structure emerging at layer 5 and persisting until layer 14. The points are plotted with the first and second component of PCA, respectively X and Y axis. Bottom panel: Answer token representations show late-emerging persistent structure from layers 21-32, with the points plotted on the first and second component of PCA. For each panel, the upper half displays k-nearest neighbor graphs (k = 2) for the 12 prompts across selected transformer layers, while the lower half shows the corresponding 1-dimensional persistence barcodes tracking topological feature evolution across all 33 layers. The red dotted lines in the barcode plots indicate the specific layers visualized in the upper half. dimensional holes of the k NN and the k NN-VR complexes and the 0-dimensional holes of the k NN-VR complexes computed by imposing β0 = 500 100. 16 We observe all three curves are qualitatively similar. This indicates the stability of the results, even when removing a considerable amount of short edges. Moreover, we observe the same behavior also on 0-dimensional holes, now that we modified the complex such that their statistics are large enough to reliably compute persistence. We argue this indicates a universal (in 16We checked that results are stable as long as β0 is much lower than the total number of points. homology) tendency to retain relational connections among particles in the middle-late layers of the model. E. Consistency of Results E.1. Effective Persistent Images across Models Given a fixed dataset, effective persistent images can be calculated across models and subtracted element-wise to highlight differences in how the models process the same information. We show all the possible comparisons for the 4 models considered in this work in Figure 7, which we Persistent Topological Features in Large Language Models Figure 6. Inter-Layer Persistence with weight α = 0 as a function of model layers computed for Llama3 8B on the SST dataset for both k NN and k NN-VR complexes. We impose the number of connected components, β0 = 500 100 to build the k NN-VR complexes. Figure 7. Element-wise difference of effective persistence image calculated for Llama 2, Llama 3, Mistral and Pythia on the SST dataset. The color bar indicates a normalized difference between 1 and 1, on a logarithmic scale. calculate for the SST dataset. We can observe clear patterns in differences across models, reflecting what is observed in Figure 3. E.2. Larger Models We verify that our topological descriptors exhibit the same qualitative results for larger models, namely Llama 2 13B, Llama 2 70B and Llama 3 70B, using the SST dataset. We show the births relative frequency and inter-layer persistence in Figure 8 in the left and right panels, respectively. As a representative value, we choose a weight of α = 0 for both descriptors, which gives equal weight to shortand long-lived features. E.3. Varying Datasets We test our topological descriptors on 4 different datasets, as presented in Section 4. As a reference, we consider the Figure 8. Births relative frequency (left) and inter-layer persistence (right) as a function of the models depth for larger models, namely Llama 2 13B, Llama 2 70B, and Llama 3 70B, compared to Llama 3 8B, computed for the SST dataset with weight α = 0. Llama 3 8B model and choose α = 0 as weight for both the births relative frequency and inter-layer persistence. We show results averaged over subsets of size 500 points in Figure 9. We note qualitative similarity for both descrip- Figure 9. Births Relative Frequency and Inter-Layer Persistence for weight α = 0as a function of model layers for Llama 3 8B for a range of datasets, averaged over 16 subsets of size 500. tors across all datasets, though quantitative differences can clearly be seen especially for inter-layer persistence. Interestingly, we observe that the code dataset has a slight divergent behaviour in middle layers. To investigate this further, we filter the Code dataset for 5 programming languages with different levels of verbosity (C, Java, HTML, Markdown, Python) and for each one we extract 10K prompts. We then calculate the inter-layer persistence for α = 1 and α = 2 for these datasets so as to highlight separately shortand long-lived features. In this case, we average over 8 subsets of size 1000 for computational convenience. We show results in Figure 10. From these calculations, we gather that most programming languages generally exhibit a drop in the amount of short-lived features in the middle layer, the effect being more visible for verbose codes (java, C). This feature is not seen for other types of datasets. These results deserve further investigation, which we leave to future work. E.4. Results as a Function of Varying Subsets Here we show that our topological descriptors show consistent results across various subset sizes. While the variance Persistent Topological Features in Large Language Models Figure 10. Inter-Layer Persistence for weights α = 1 (left panel) and α = 2 (right panel) as a function of model layers for Llama 3 8B for a range of programming languages, averaged over 8 subsets of size 1000. increases for smaller subsets, descriptors computer over different subset sizes are within a standard deviation. We show these results for both the births relative frequency and the inter-layer persistence calculated for the Llama 3 8B model on the SST dataset in Figure 11. Figure 11. Births relative frequency (left) and inter-layer persistence (right) with weight α = 0 for the Llama 3 8B model computed on the SST dataset for different subset sizes as a function of model layers. Variance of Z1 as a function of data points. Given all the subsets with {100, 200, ..., 1000} points, we can calculate how the variance of these subsets scales as compared to the size of the subset. As a test case, we take the Llama 3 8B model with the SST dataset and compute the interlayer persistence at weight α = 0 over all the subsets. We then plot the variance of each subset as a function of the subset size for different layers. We choose 4 layers so that they are roughly representative of the dynamical phases identified in the main text. We show results in Figure 12, where we overlay a fitted curve in black. Apart for the first layers, where the variance grows with the number of points, N, in later layers the relation seems to be approximately σ2 N 3/2. E.5. Additional result across models In this section we present supplementary evidence of the consistency of the results across models for our descriptors, Figure 12. Variance of the inter-layer persistence with weight α = 0 for the Llama 3 8B model computed on the SST dataset as a function of subset sizes. We show four different layers: 3, 15, 23 and 32. The black dashed lines represent a fitting function. effective persistence, Births Relative Frequency and Inter Layer Persistence on Llama 2, Mistral, and Pythia, in Figure 13. Figure 13. Supplementary plots for Llama 2, Mistral, and Pythia on the SST dataset. The first row displays Inter-layer persistence, the second row shows the Births relative frequency, and the third row presents the effective persistent images. F. A shuffling test As a test of our topological descriptors and the phases seen in Section 4, we perform a shuffling of tokens within the prompts of the SST dataset, as a way of destroying the structure and semantic coherence of the prompts, without Persistent Topological Features in Large Language Models modifying their unigram frequency distribution (see e.g. (Cheng et al., 2024) for an application of shuffling to internal representations of transformers). In Figures 14 and 15, we show the births relative frequency and the inter-layer persistence for shuffled and structured prompts. For the former, we can see a clear difference in behavior on the birth of the long-lived features between the shuffled and structured cases, the peak at middle layers being higher for shuffled prompts. A reversed trend is seen for the inter-layer persistence. Overall, the frequency of births of short-lived features is not significantly affected by the shuffling, while the inter-layer persistence drops on the second half of the model s depth. These findings deserve further investigations, which we postpone to future work. Figure 14. Births relative frequency at weight α = 1 (left) and α = 2 (right) as a function of model layers for the Llama 3 8B for shuffled and unshuffled prompts from the SST dataset. Figure 15. Inter-layer persistence at weight α = 1 (left) and α = 2 (right) as a function of model layers for the Llama 3 8B for shuffled and unshuffled prompts from the SST dataset. G. More on Pruning G.1. Sliding Window on Other Benchmarks We can test the sliding window experiments on the other two benchmarks, MMLU and Hellaswag, show in Figures 17 and 16, respectively. For the case of MMLU, we zoom in on the drop in performance seen at the end of the third phase: we show the performance of the MMLU benchmark against block sizes of 5, 3, and 2 adjacent layers with sliding windows of 2, 1 Figure 16. Hellaswag 5-shot benchmark run on Llama3 8B, Llama2 7B, Mistral and Pythia. A sliding window of size 5 is applied to cut blocks every 2 layers. and 1 for the left, middle and right panels, respectively. We see that performance is at the level of random choice during the increasing phase and it maximizes close to the maximum Inter-Layer Persistence during the plateau phase. Consistently with the Winogrande benchmark, we see a drop in performance right in correspondence with the decreasing phase. For both Llama and Mistral, the relevant layers are a few layers before the last. This finding deserves a closer investigation, which we leave for future work. G.2. Layer pruning algorithm Here we schematically describe the algorithm for layer pruning used to produce results presented in Table 1. Algorithm 2 Pruning algorithm Require: Z1, model, threshold, max max( Z1) layers To Remove [] for l 1 to model.get Num Layers() do if Z1[l] > max threshold then layers To Remove.append(l) end if end for model.remove Layers(layers To Remove) Models Nprune Our method Other works Llama 2 8 [20-27] [21-28] Llama 3 9 [20-28] [20-28] Mistral 10 [20-29] [19-28] Pythia 11 [19-29] [18-28] Table 2. Table with the list of models and the layers that we cut for the pruning experiment. With a given Nprune we show the layers cut with our method and for the Angular Distance and the Bi Score (Other works). Persistent Topological Features in Large Language Models Figure 17. MMLU 5-shot benchmark run on Llama3 8B, Llama2 7B and Mistral. The different benchmarks shown are done by cutting blocks of layers with a fixed size and by changing the starting point with a sliding window. Left Panel: benchmark made with a block size of 5 and sliding windows of 2, Middle Panel: benchmark made with block size of 3 and sliding windows of 1, Right Panel: with a block size of 2 and sliding window of 1. Model CMMLU Commonsense-QA WSC full this work other works full this work other works full this work other works llama 2 32.17 28.51 30.00 55.94 38.40 52.83 88.63 84.60 75.80 llama 3 50.96 34.02 34.02 73.45 65.93 65.93 85.70 80.94 80.94 Mistral 44.47 38.84 29.68 69.78 62.33 30.62 87.18 69.60 72.15 Pythia - - - - - - 81.67 60.06 71.43 Table 3. Supplementary 5-shots benchmarks done with the same methodology as in Table 1. G.3. Additional benchmarks In Table 3, we present additional benchmarks beyond those presented in the main text, using the pruning method outlined in 4.4. Results are overall consistent with what find with benchmarks in the main text.