# language_models_are_implicitly_continuous__9bea985f.pdf Published as a conference paper at ICLR 2025 LANGUAGE MODELS ARE IMPLICITLY CONTINUOUS Samuele Marro1 Davide Evangelista2 X. Angelo Huang3 Emanuele La Malfa4 Michele Lombardi2 Michael Wooldridge4 1Department of Engineering Science 2Department of Computer Science University of Oxford University of Bologna Oxford, UK Bologna, Italy 3Department of Computer Science 4Department of Computer Science ETH Zurich University of Oxford Zurich, Switzerland Oxford, UK Language is typically modelled with discrete sequences. However, the most successful approaches to language modelling, namely neural networks, are continuous and smooth function approximators. In this work, we show that Transformerbased language models implicitly learn to represent sentences as continuous-time functions defined over a continuous input space. This phenomenon occurs in most state-of-the-art Large Language Models (LLMs), including Llama2, Llama3, Phi3, Gemma, Gemma2, and Mistral, and suggests that LLMs reason about language in ways that fundamentally differ from humans. Our work formally extends Transformers to capture the nuances of time and space continuity in both input and output space. Our results challenge the traditional interpretation of how LLMs understand language, with several linguistic and engineering implications. 1 INTRODUCTION In linguistics and computer science, language is typically modelled as a discrete sequence of symbols: a sentence is a sequence of words, phonemes, characters, or tokens drawn from a finite vocabulary. This characterisation underpins both linguistics (Hockett & Hockett, 1960; Chomsky, 1995; Studdert-Kennedy, 2005; Akmajian et al., 2017) as well as classic and recent algorithmic approaches to language modelling (Manning, 1999; Bengio et al., 2000; Mnih & Hinton, 2008).1 In Machine Learning, a successful paradigm to model language is that of Large Language Models (LLMs, (Devlin, 2018; Brown et al., 2020)). In LLMs, language is modelled via an optimisation problem whose objective is to predict a word given its surrounding context (Peters et al., 2018; Radford et al., 2019), though recent advancements fine-tune the models with procedures inspired by reinforcement learning (Schulman et al., 2017; Rafailov et al., 2024). At their core, the architectures that model language, including feed-forward neural networks (Mikolov et al., 2013a), Long-Short Term Memory Networks (LSTMs) (Hochreiter, 1997; Sundermeyer et al., 2012) and Transformers (Vaswani et al., 2017), approximate a discrete sequence of tokens with continuous smooth functions. However, training inherently continuous models on discrete sequences does not imply that the models themselves treat language as discrete. This paper explores how the tension between discrete data and continuous function approximators is synthesised in Transformers-based Large Language Models (Vaswani et al., 2017). To do These authors contributed equally. Corresponding author. Email: samuele.marro@eng.ox.ac.uk.. 1For completeness, a few notable works in linguistics and computer science model language as continuous: among the others, Alkhouli et al. (2014) and Bowman et al. (2015) model sentences as continuous entities in latent space, while recent approaches with quantum NLP represent meaning as a superstate of different words (Guarasci et al., 2022). Published as a conference paper at ICLR 2025 so, we seamlessly generalise the Transformers architecture to support continuous inputs. This extension, which does not modify a model s weights or alter the architecture, allows the study of existing pretrained LLMs, including Llama (Dubey et al., 2024), Mistral (Jiang et al., 2023), and Gemma (Gemma Team et al., 2024b), with continuous input sequences. By running experiments on state-of-the-art LLMs, we find that the language LLMs learn is implicitly continuous, as they are able to handle, with minor modifications, inputs that are both time continuous and spatial continuous. In particular, we formally show that the results obtained by extending pretrained LLMs to handle time continuous input strongly depend on a quantity, named duration, associated with each sentence. We also show in Section 4 that the semantics of this continuum significantly deviate from human intuition. Our results suggest that our intuition about human language can be misleading when applied to LLMs, as LLMs hold implicit representations of continuous sequences in unintuitive ways. Furthermore, these observations have practical consequences from an engineering perspective, as they suggest that it is possible to leverage the continuous representations of LLMs to pretrain them more efficiently. Our code is available at https://github.com/samuelemarro/continuous-llm-experiments. 2 RELATED WORK Modern state-of-the-art pretrained language models operate on a discrete number of tokens and do not handle continuous inputs directly. However, in other domains, extensions of classical Transformers (Vaswani et al. (2017)) to time continuous inputs have been recently explored in tackling different problems. In modelling dynamical systems, (Fonseca et al. (2023)) have proposed adding new regularisations to a classical transformer to create continuous behaviour in an attempt to improve upon existing time continuous models such as Neural ODEs (Chen et al. (2019); Kidger (2022)). In time series modelling, (Chen et al. (2024); Moreno-Pino et al. (2024)) further developed the ideas advanced by Neural ODEs by integrating time continuous transformers and, consequently, superseding other existing approaches such as Neural CDE (Kidger et al. (2020)) and Neural RDE (Morrill et al. (2021)). Another line of work considers time continuous extension of language by processing it through networks combining the flexibility of classical Transformers with the mathematical interpretability of Neural ODEs, such as ODETransformer (Li et al. (2022a)), CSAODE (Zhang et al. (2021)), Trans Evolve (Dutta et al. (2021)), and N-ODE Transformer (Baier-Reinio & De Sterck (2020)). Several authors have also explored spatial continuous extensions of LLMs Tang et al. (2021); Schwenk (2007); Ostling & Tiedemann (2017), where the embedding space is expanded to include vectors not directly mapped to specific tokens, thereby enhancing the representational power of the models. A broad class of Diffusion Language Models (Li et al. (2022b); Gong et al. (2022); Lovelace et al. (2024a); Gulrajani & Hashimoto (2024); Zhang et al. (2024); Lovelace et al. (2024b)) employs a similar concept, where the model generates elements not necessarily tied to individual tokens in the embedding space, thereby effectively incorporating spatial continuous extensions into language modelling. Additionally, continuous representations in LLMs have been studied either in the context of concepts for which intuitive spectra exist (Gurnee & Tegmark, 2023; Arditi et al., 2024) or from a neurondriven perspective (Anthropic, 2024). In particular, our work can be seen as complementary to the latter: while the authors of Anthropic (2024) show that certain neurons map to specific concepts (which can then be steered), we show that such concepts exist even at the embedding level, and we offer a theoretical framework to study formally such phenomena in an architecture-independent manner. Continuing the overview of related works, in Ravfogel et al. (2022) the authors remove concepts like biases by erasing the pertinent area that makes that concept emerge. Our work shows however that LLMs can interpolate between overlapping concepts: erasing an entire area might thus also affect other concepts, which represents an interesting research direction. Moreover, in Todd et al. (2023) the authors show that some LLMs feature function vectors, which represent specific operations (e.g. sums), It is possible that some of the continuous behaviours in an LLM arise as function vectors. Published as a conference paper at ICLR 2025 Figure 1: Left. A graphical representation of the time continuity of language. Each observed token is obtained by sampling at integer timesteps a stepwise constant function defined on the real interval [0,T]. The length of each constant interval is the duration of the associated token. Right. A spatial continuous extension of a sentence, where x(t) can represent any value. Finally, our work can also be seen as a response to Vilnis & Mc Callum (2014), which trains inherently continuous embeddings: we show that this process is not necessary, as even discretely trained LLMs show similar behaviours. 3 A CONTINUOUS GENERALISATION OF LANGUAGE MODELS In this section, we propose the hypothesis that language can be seen as the discretisation of a spatiotemporal continuous function whose value corresponds to a valid token for any integer timestep. As we will show, this assumption allows us to define a continuous extension of the classical casual Transformer module, namely the Continuous Causal Transformer (CCT). The CCT accepts spatio-temporal continuous functions as input while including pretrained models on regular timeand space-discrete data as a special case. Moreover, we will formally discuss the implications of this construction, showing the basic results required to describe the experiments we presented later. 3.1 TIME CONTINUITY Following classical approaches (Cotterell et al., 2023), we define a natural sentence as a sequence {w1,w2,...,w T } W of tokens, sampled from an underlying distribution p(w1,w2,...,w T ), where each token wt only depends on previous timesteps, i.e. p(w1,w2,...,w T ) = p1(w1) T t=2 pt(wt w s. (12) The multi-head version of this causal attention mechanism typically used in modern architecture is obtained by computing H independent versions {Y1,...,YH} of Equation 10, and defining a learnable matrix W (o) Rd H d, which combines them to obtain a single transformed observation matrix: Y = cat(Y1,...,YH)W (o)T . (13) A continuous version of a multi-head attention network can be simply obtained by considering Equation 10 on a single timestep t [0,T]. Indeed, it can be rewritten as: yt = T s=1 softmax [QKT softmax [QKT = exp([ QKT d + M] t,s) T s =1 exp([ QKT d + M] t,s ) . (15) Consequently, softmax [QKT exp( q T t ks d ) if s < t, 0 otherwise, Where q T t and ks are the t-th and the s-th rows of Q and KT , respectively. Altogether, the above equations imply that: Published as a conference paper at ICLR 2025 1 Zt exp(q T t ks d )vs, (17) where Zt = t s =1 exp( q T t k s d ) is a normalisation constant. The above formula is then used in Section 3 to define CCT. A.2 DERIVATION OF EQUATION 8 We recall that: q(t) = W (q)x(t) = T k=1 W (q)xk1[ak,bk](t), (18) k(t) = W (k)x(t) = T k=1 W (k)xk1[ak,bk](t), (19) v(t) = W (v)x(t) = T k=1 W (v)xk1[ak,bk](t). (20) Consequently, 0 1 Zt exp(q(t)T k(s) T k=1 exp(q(t)T k(s) d )W (v)xk1[ak,bk](s)ds ak 1 Zt exp(q(t)T k(s) d )W (v)xkds (19) = T k=1 1 Zt W (v)xk q(t)T T k =1 W (k)xk 1[ak ,bk ](s) 1 Zt W (v)xk ak exp(q(t)T W (k)xk (18) = T k=1 1 Zt W (v)xk exp ( T t =1 W (q)xt 1[at ,bt ](t)) T W (k)xk 1 Zt W (v)xk exp (W (q)x t) T W (k)xk 1 Zt exp(q T t kk which proves Equation 8. B EXPERIMENTAL SETUP We now describe the shared aspects of our experiments. Published as a conference paper at ICLR 2025 B.1 IMPLEMENTING A CCT CCTs can be implemented with little effort by starting with the implementation of a regular transformer and applying three modifications: 1. Modifying it so that it accepts arbitrary embeddings, rather than only tokens; 2. Modifying it so that positional indices can be floating points, instead of only integers; 3. Adding support for custom floating-point attention masks. In our experiments, we used Hugging Face, which natively supports 1. and 3. and can be easily adapted to support 2. Note that the last modification is necessary in order to support non-standard durations. In fact, the Euler discretisation of the integral in Equation (4) is equivalent to regular attention with carefully chosen attention coefficients. Proof The discretisation of Equation (4) is, assuming that we have n samples at positions p1,...,pn, which represent the value of a piecewise constant function defined over the intervals (0,p1],(p1,p2],...,(pn 1,pn]: y(t) = i 1,...,n 1 Zt (pi pi 1)exp(q(i)T k(pi) d )v(pi), (22) with p0 = 0. In other words, if we are using multiplicative attention coefficients, the discretisation is equivalent to applying attention coefficients of the form pi pi 1. Intuitively, this means that the further apart two samples are, the higher the weight of the latter sample. Note that for additive coefficients we can simply bring pi pi 1 inside the exponential: y(t) = i 1,...,n 1 Zt exp(log(pi pi 1) + q(i)T k(pi) d )v(pi), (23) which is equivalent to an additive coefficient of pi pi 1. In Practice At the implementation level, a sequence of n elements with durations d1,...,dn and embeddings e1,...,en is fed to the extended transformer as follows: The embeddings e1,...,en are fed directly, rather than feeding the sequence as tokens and mapping them to embeddings; The positional encodings are defined such that no holes are left in the piecewise constant function. In other words, the position pi is defined as pi = i 1 j=1 dj; The durations are encoded using the formulae described in Equations (22) and (23). B.2 EXPERIMENT HYPERPARAMETERS Since we study the logits, we do not use any typically generation-related hyperparameters (e.g. temperature and top-k). Aside from those described in Appendix B.1, we do not perform any other modification. Experiment-specific parameters are reported in the respective subsections of Appendix C.4. C CONTINUITY - FULL RESULTS C.1 SINGLE-TOKEN CONTINUITY C.1.1 QUALITATIVE RESULTS For single-token continuity, we shrink the subset of considered tokens with a coefficient in the range [0.1,1]. Published as a conference paper at ICLR 2025 Since the LLMs do not necessarily return a numeric value, all of the queries were wrapped with a prompt to coax the model into doing so. The template for our prompts is thus: Question: In the sentence "[REPEATED WORDS]", how many times is [CATEGORY] mentioned? Reply with a single-digit number Answer: For Gemma 1, Llama 2 and Mistral we used a slight variation, since they did not return a numeric output: Question: How many [CATEGORY] are listed in the sentence "[REPEATED WORDS]"? Reply with a single-digit number Answer: We used variations of these two prompts throughout most of this paper. See the source code for further information. Alongside apples, we also tested the same prompt with the word cat (category: animal ) and rose (category: flower ). See Figures 6 to 8 for full results. C.1.2 QUANTITATIVE RESULTS We reduce the duration of all the steps (following the procedure described in Appendix B) and measure the time sensitivity, i.e. the number k of unique applicable token peaks divided by the expected number of peaks n. For instance, if there are 4 steps, we expect to see peaks for 1, 2, 3, and 4. Our dataset is composed of 200 sentences with the same template as Appendix C.1.1 Note that, for some sentence+tokenizer combinations, a single word might be split into multiple tokens. As our analysis focuses on single-token words, we ignore such cases. Refer to Table 1 for the percentages of considered results for each model. We define a unique relative peak as a situation where, for at least one duration factor, a certain class is the top prediction. For example, if by varying the duration factor the top class becomes 1 , then 2 , then 1 again, then 3 , the unique relative peaks will be {1,2,3}. In our case, we only consider the probabilities of numerical tokens. We normalise the number of relative peaks by the number of expected peaks (e.g., if a word is repeated four times and we observe three unique relative peaks, the normalised frequency is 3/4 = 0.75). Note that, if the discrete interpretations of LLMs held true, we would only observe one unique relative peak (since the prediction would be constant as the duration factor varies). We name the hypothetical frequency in case such a hypothesis held true the counterfactual normalised frequency. We report in Table 2 the counterfactual and observed normalised frequencies for the various models, as well as the average per-sample ratio between the two. Overall, the observed peak frequency is significantly higher than the counterfactual frequency; these results are thus incompatible with a discrete interpretation of LLMs. For the sake of completeness, we report in Table 3 the results where we only count the expected peaks as valid (i.e. if an element is repeated four times, only the unique relative peaks 1 , 2 , 3 and 4 are considered valid). Published as a conference paper at ICLR 2025 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (a) Llama 3 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (b) Llama 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (c) Gemma 1 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (d) Gemma 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (f) Mistral Figure 6: Predicted next token for the sentence In the sentence apple apple apple apple , how many fruits are mentioned? with duration shrinking for all studied models. 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (a) Llama 3 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (b) Llama 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (c) Gemma 1 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (d) Gemma 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (f) Mistral Figure 7: Predicted next token for the sentence In the sentence cat cat cat cat , how many animals are mentioned? with duration shrinking for all studied models. Published as a conference paper at ICLR 2025 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (a) Llama 3 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (b) Llama 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (c) Gemma 1 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (d) Gemma 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (f) Mistral Figure 8: Predicted next token for the sentence In the sentence rose rose rose rose , how many flowers are mentioned? with duration shrinking for all studied models. Model Name Valid Ratio Llama 3 8b 93.5% Llama 2 13b 54.5% Gemma 1 7b 100.0% Gemma 2 9b 100.0% Phi 3 Medium 54.5% Mistral 7b 76.0% Global 79.8% Table 1: Percentage of valid setups (i.e. setups where a word is treated as a single token) for the single token counting experiment. The Global row refers to all valid records divided by all records. Model name Normalised Peak Frequency Average Ratio Counterfactual Observed Llama 3 8b 0.2594 0.8953 3.57 Llama 2 13b 0.2599 0.9174 3.60 Gemma 1 7b 0.2610 0.5127 1.95 Gemma 2 9b 0.2610 0.8513 3.37 Phi 3 Medium 0.2599 0.8836 3.43 Mistral 7b 0.2584 0.4737 1.85 Global 0.2600 0.7404 2.90 Table 2: Counterfactual and observed normalised peak ratios for the single-token counting experiment, with all peaks (including unexpected ones). Note that the Global row is an average weighted by the number of valid records. Published as a conference paper at ICLR 2025 Model name Normalised Peak Frequency Average Ratio Counterfactual Observed Llama 3 8b 0.2594 0.8914 3.56 Llama 2 13b 0.2599 0.8263 3.27 Gemma 1 7b 0.2610 0.3503 1.34 Gemma 2 9b 0.2610 0.8513 3.37 Phi 3 Medium 0.2599 0.792 3.09 Mistral 7b 0.2584 0.4737 1.85 Global 0.2600 0.6849 2.70 Table 3: Counterfactual and observed normalised peak ratios for the single-token counting experiment, with only expected peaks. Note that the Global row is an average weighted by the number of valid records. C.2 COUNTING EVENTS C.2.1 QUALITATIVE RESULTS Similarly to Appendix C.1, we used a prompt to coax the models into giving numeric outputs, as well as coefficients in the range [0.1,1]. Alongside the shop example, we tested two other passages: The class went to the zoo. They saw a lion. They saw an elephant. They saw a giraffe. They saw a penguin. How many animals did the class see? Emily went to the beach. She found a seashell. She found a starfish. She found a smooth stone. She found a piece of seaweed. How many things did Emily find? See Figures 9 to 11 for full results. C.2.2 QUANTITATIVE RESULTS In addition to our qualitative results, we report further quantitative experiments for time duration. We consider the sequential dataset from Lin et al. (2024), which contains 200 curated how-to tutorials split by step. Our template is as follows: Tutorial: [Tutorial Title] [Steps] Question: How many steps are necessary to complete the tutorial? Reply with a single-digit number Answer: It takes We then compute, in the same fashion as Appendix C.1.2, the normalised peak frequency, both when considering all peaks (Table 4) and only the expected peaks (Table 5). Published as a conference paper at ICLR 2025 Model Name Normalised Peak Frequency Average Ratio Counterfactual Observed Llama 3 8b 0.2186 0.7564 3.70 Llama 2 13b 0.2186 0.7274 3.42 Gemma 7b 0.2186 0.6967 3.46 Gemma 2 9b 0.2186 0.6188 3.21 Phi 3 0.2186 0.7123 3.385 Mistral 0.2186 0.5119 2.52 Global 0.2186 0.6706 3.28 Table 4: Counterfactual and observed normalised peak ratios for the event counting experiment, with all peaks (including unexpected ones). Model Name Normalised Peak Frequency Average Ratio Counterfactual Observed Llama 3 8b 0.2186 0.6397 3.26 Llama 2 13b 0.2186 0.4854 2.54 Gemma 7b 0.2186 0.5984 3.04 Gemma 2 9b 0.2186 0.5740 3.01 Phi 3 0.2186 0.5436 2.75 Mistral 0.2186 0.4517 2.29 Global 0.2186 0.6097 3.05 Table 5: Counterfactual and observed normalised peak ratios for the event counting experiment, with only expected peaks. 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (a) Llama 3 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (b) Llama 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (c) Gemma 1 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (d) Gemma 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (f) Mistral Figure 9: Predicted next token for the shop passage for all studied models, with duration shrinking. Published as a conference paper at ICLR 2025 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (a) Llama 3 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (b) Llama 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (c) Gemma 1 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (d) Gemma 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (f) Mistral Figure 10: Predicted next token for the zoo passage for all studied models, with duration shrinking. 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (a) Llama 3 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (b) Llama 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (c) Gemma 1 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (d) Gemma 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 1 Other Digits 2 Other Tokens (f) Mistral Figure 11: Predicted next token for the beach passage for all studied models, with duration shrinking. Published as a conference paper at ICLR 2025 Model Name Valid Rate Llama 3 8b 60.0% Llama 2 13b 19.5% Gemma 1 7b 94.5% Gemma 2 9b 95.0% Phi 3 Medium 48.5% Mistral 7b 93.0% Global 68.4% Table 6: Percentage of valid records (i.e. records where the unmodified output of the sum is correct) in the sums experiment. C.3 NUMBER SUMS C.3.1 QUALITATIVE RESULTS Our experimental setup is identical to that of Appendix C.1. In addition to 24 + 13, we repeat our experiments with the sums 13 + 74 and 32 + 56. Refer to Figures 12 to 14 for full results. C.3.2 QUANTITATIVE RESULTS We consider a dataset of 100 questions involving sums, such as the following: Question: Mia delivered 82 packages in the morning and 38 packages in the afternoon. How many packages did Mia deliver in total? Answer: Mia delivered where item 1 and item 2 are two-digit numbers. The specific items and questions vary, but in all cases the answer is the sum of two two-digit numbers. For each sentence, we independently shrink each of the two numbers, for a total of 200 records. In each record, we observe how the predicted probabilities vary as the duration factor varies. We only consider records where the model correctly computes the first digit of the sum on the unaltered sentence (see Table 6 for a per-model breakdown). Let yo be the original label (i.e. the first digit of the sum of the two numbers) and Ys the shrunk labels (i.e. the set of predicted digits in case the shrunk number is treated as a single-digit number. For instance, in the sum 24 + 37, the original label is 6 (24 + 37 = 61), but the shrunk labels are 2 (24 + 3 = 27) and 3 (24 + 7 = 31). Note that we ignore non-numerical labels. We then check three properties: P1: For a certain duration factor, the collective probability of the labels in Ys is higher than that of yo; P2: For a certain duration factor, the collective probability of the labels in Ys is higher than that of yo and any other numerical label; P3: For a certain duration factor, the collective probability of the labels in Ys is higher than that of yo and any other numerical label. Additionally, at no point is another numerical label (i.e. neither yo nor an element of Ys) the top label. Note that P3 implies P2 and that P2 implies P1. We report how frequently each property is true in Table 7. Published as a conference paper at ICLR 2025 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 3 2 Other Tokens (a) Llama 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 3 2 Other Tokens (b) Gemma 1 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 3 2 Other Tokens (c) Gemma 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 3 2 Other Tokens (d) Phi 3 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 3 2 Other Tokens (e) Mistral Figure 12: Predicted next token for the sentence The sum of 24 and 13 is in all studied models (except Llama 3) with shrinking of 13. 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 8 2 1 Other Tokens (a) Llama 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 8 2 1 Other Tokens (b) Gemma 1 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 8 2 1 Other Tokens (c) Gemma 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 8 2 1 Other Tokens (d) Phi 3 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 8 2 1 Other Tokens (e) Mistral Figure 13: Predicted next token for the sentence The sum of 13 and 74 is in all studied models (except Llama 3) with shrinking of 74. Published as a conference paper at ICLR 2025 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 8 3 Other Tokens (a) Llama 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 8 3 Other Tokens (b) Gemma 1 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 8 3 Other Tokens (c) Gemma 2 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 8 3 Other Tokens (d) Phi 3 0.2 0.4 0.6 0.8 1.0 Duration Factor Probability 8 3 Other Tokens (e) Mistral Figure 14: Predicted next token for the sentence The sum of 32 and 56 is in all studied models (except Llama 3) with shrinking of 32. Model Name P1 Frequency P2 Frequency P3 Frequency Llama 3 8b 29.17% 25.83% 25.83% Llama 2 13b 92.31% 87.18% 76.92% Gemma 1 7b 97.88% 97.35% 90.48% Gemma 2 9b 95.79% 95.79% 89.47% Phi 3 Medium 100.00% 100.00% 87.63% Mistral 7b 100.00% 100.00% 87.10% Global 87.82% 86.97% 79.05% Table 7: Frequency of each property for valid records in the sums experiment. Note that the Global row is an average that is weighted by the number of valid records. Published as a conference paper at ICLR 2025 C.4 SPACE CONTINUITY C.4.1 QUALITATIVE RESULTS We report the full results concerning interpolations of embeddings in the main paper in Figures 15 to 17 and 19. We also check that the intermediate interpolation does not correspond to any existing token by asking the LLM to repeat the embedding: Repeat the word 8. As shown in Figure 19, the repetition of 8 does not correspond to any existing token (as shown by the lack of peaks for tokens other than those related to apples and bananas). Additionally, we adapt some experiments to another pair of tokens, namely cats and dogs, where we find that the interpolation of cats and dogs is an animal, but whether cats-dogs meow depends on the position along the interpolation axis (see Figures 20 to 23). Similarly, refer to Figures 24 to 27 for our results on the water-juice interpolation. C.4.2 BOOLEAN INTERPOLATION We then test how our results compare with studies on interpolation of Boolean formulae. To do so, we perform linear interpolations of Boolean binary operators and study how intermediate operators behave. In particular, we study interpolations of: AND and OR; AND and XOR; AND and NAND; OR and NOR. We report our results for all models whose tokenizers treat the operators as having the same number of tokens. While the models often struggle to compute the correct Boolean results for discrete inputs, we nonetheless observe the emergence of fuzzy operators, whose truth values can be best represented as floating points. C.4.3 QUANTITATIVE RESULTS We consider 50 pairs of objects having some properties in common (e.g. apples and bananas are both fruits). For each pair, we consider one common property, one property shared by one element of the pair, one property shared by the other element of the pair, and one property that is shared by neither of them, for a total of 200 records. We then interpolate (with 40 steps) between the sentence containing one object or the other. For instance, for a property shared by both apples and bananas, we interpolate between the sentences: Question: Can apples be eaten raw? (yes/no)\n Answer: Question: Can bananas be eaten raw? (yes/no)\n Answer: and compute the predicted scores of yes and no . Note that some questions may have borderline answers (i.e. the answer could be argued to be true or false); we still consider such questions valid, as we are interested in the variation of the output throughout the interpolation rather than the specific answer. We however ignore objecttokenizer pairs where the two resulting sentences have different tokenized lengths (as that prevents interpolation). We report the percentage of valid records in Table 8. To measure the smoothness of the variation in output, we compute the maximum of the absolute derivative across the interpolation interval, i.e. maxx [a,b] f (x) (which can be seen as an estimate of the Lipschitz constant). We then normalize by dividing by the amplitude (i.e. maxx [a,b] f(x) minx [a,b] f(x). Although imperfect, this metric provides insight into the sharpest variation in output for a model. We report the average metric in Table 9. In general, we observe that Llama 2 and Gemma 2 have higher normalised maximum absolute derivatives, while the other models tend to have similar normalised values. Published as a conference paper at ICLR 2025 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (b) Llama 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (c) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (d) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (e) Phi 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (f) Mistral Figure 15: Predicted next token for the sentence Are 8 red? , where 8 is an interpolation of apples and bananas . 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (b) Llama 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (c) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (d) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (e) Phi 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (f) Mistral Figure 16: Predicted next token for the sentence Are 8 a fruit? , where 8 is an interpolation of apples and bananas . Published as a conference paper at ICLR 2025 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability red green yellow Other Tokens (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability yellow green red Other Tokens (b) Llama 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability green yellow red Other Tokens (c) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability green yellow red Other Tokens (d) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability yellow green red Other Tokens (e) Phi 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability red yellow green Other Tokens (f) Mistral Figure 17: Predicted next token for the sentence The most common colour of 8 is , where 8 is an interpolation of apples and bananas . We report all tokens with a probability of at least 5% at any point of the interpolation. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability supermarket market Other Tokens local farmers store grocery (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability store Other Tokens (b) Llama 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability market supermarket store Other Tokens (c) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability price store rate supermarket grocery Other Tokens (d) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability store local market Other Tokens 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability super store market Other Tokens (f) Mistral Figure 18: Predicted next token for the sentence Alice bought some 8 at the , where 8 is an interpolation of apples and bananas . We report all tokens with a probability of at least 5% at any point of the interpolation. Published as a conference paper at ICLR 2025 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability App Other Tokens (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability ban Other Tokens (b) Llama 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Banana Bananas apples Apples Ban Other Tokens (c) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Banana Bananas Apple Apples bananas Other Tokens (d) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability The Other Tokens 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Pan Other Tokens (f) Mistral Figure 19: Predicted next token for the sentence Repeat the word 8 , where 8 is an interpolation of apples and bananas . We report all tokens with a probability of at least 5% at any point of the interpolation. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (b) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (c) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (d) Mistral Figure 20: Predicted next token for the sentence Do 8 meow? , where 8 is an interpolation of cats and dogs . Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (b) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (c) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (d) Mistral Figure 21: Predicted next token for the sentence Are 8 animals? , where 8 is an interpolation of cats and dogs . Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens. Published as a conference paper at ICLR 2025 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability same Other Tokens (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability beginning same end Other Tokens (b) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability beginning age same Other Tokens end (c) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Hum beginning end Other Tokens (d) Mistral Figure 22: Predicted next token for the sentence We bought two 8 at the , where 8 is an interpolation of cats and dogs . We report all tokens with a probability of at least 5% at any point of the interpolation. Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability I Other Tokens Cats (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Dogs Other Tokens (b) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability cats Cat Other Tokens dogs Cats (c) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability C Other Tokens (d) Mistral Figure 23: Predicted next token for the sentence Repeat the word 8. , where 8 is an interpolation of cats and dogs . We report all tokens with a probability of at least 5% at any point of the interpolation. Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (b) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (c) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (d) Mistral Figure 24: Predicted next token for the sentence Does 8 contain sugar? , where 8 is an interpolation of water and juice . Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (b) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (c) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Yes No Other Tokens (d) Mistral Figure 25: Predicted next token for the sentence Is 8 a drink? , where 8 is an interpolation of water and juice . Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens. Published as a conference paper at ICLR 2025 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability morning Other Tokens (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability morning Other Tokens (b) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability car morning afternoon Other Tokens (c) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability morning desert Other Tokens (d) Mistral Figure 26: Predicted next token for the sentence We drank some 8 in the , where 8 is an interpolation of water and juice . We report all tokens with a probability of at least 5% at any point of the interpolation. Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Juice water Ju Other Tokens (a) Llama 3 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability W Other Tokens (b) Gemma 1 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability Water Juice I Other Tokens water (c) Gemma 2 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability W Other Tokens water (d) Mistral Figure 27: Predicted next token for the sentence Repeat the word 8. , where 8 is an interpolation of water and juice . We report all tokens with a probability of at least 5% at any point of the interpolation. Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 28: Predicted next token for interpolations of AND and OR in Llama 3. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 29: Predicted next token for interpolations of AND and OR in Llama 2. Published as a conference paper at ICLR 2025 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 30: Predicted next token for interpolations of AND and OR in Gemma. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 31: Predicted next token for interpolations of AND and OR in Gemma 2. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 32: Predicted next token for interpolations of AND and OR in Phi 3. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 33: Predicted next token for interpolations of AND and OR in Mistral. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 34: Predicted next token for interpolations of AND and XOR in Llama 3. Published as a conference paper at ICLR 2025 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 35: Predicted next token for interpolations of AND and XOR in Gemma. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 36: Predicted next token for interpolations of AND and XOR in Gemma 2. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 37: Predicted next token for interpolations of AND and NAND in Llama 3. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 38: Predicted next token for interpolations of AND and NAND in Gemma. 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (a) (0, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (b) (0, 1) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (c) (1, 0) 0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor Probability True False Other Tokens (d) (1, 1) Figure 39: Predicted next token for interpolations of AND and NAND in Gemma 2. Published as a conference paper at ICLR 2025 Model Name Valid Rate Llama 3 8b 74.0% Llama 2 13b 34.0% Gemma 1 7b 84.0% Gemma 2 9b 84.0% Phi 3 Medium 34.0% Mistral 7b 46.0% Global 59.3% Table 8: Percentage of valid records (i.e. where both sentences have the same tokenized length) in the embedding interpolation experiment. Model Name Normalised Maximum Absolute Derivative Llama 3 8b 4.959 Llama 2 13b 9.005 Gemma 1 7b 5.090 Gemma 2 9b 5.402 Phi 3 Medium 9.279 Mistral 7b 7.445 Global 6.214 Table 9: Average normalised maximum absolute derivative for each model. The Global row is an average weighted by the number of valid records. We also study whether the intermediate embeddings have outputs that are significantly different from those of the embedding extremes. In theory, we would expect the probability of an output computed on an intermediate embedding to be an interpolation between the value for one object or the other. Let f(a) be the probability score for one extreme of the interpolation and f(b) the probability score for the other extreme. We consider our expected range as [min(f(a),f(b)),max(f(a),f(b))]. We then verify whether, for some intermediate value, the predicted probability of yes or no is outside their respective ranges.6 In particular, we compute the maximum difference between the allowed range and the actual value, i.e.: mdiff = max x [a,b]{ f(x) min(f(a),f(b)) f(x) < min(f(a),f(b)) f(x) + max(f(a),f(b)) f(x) > max(f(a),f(b)) (24) We define mmax as the maximum of the mdiff values for yes and no . Note that if the function only has values in [min(f(a),f(b)),max(f(a),f(b))], mmax will be 0. We report the average mmax and the percentage of records where mmax 0.05 in Table 10. We also report the histograms describing the distributions of mmax in Figure 40. In general, we observe on average that 20% of records have an mmax higher than 0.05, which cannot be explained by a simple interpolation-based model of LLM behaviour. D ADDITIONAL RESULTS We complement our previous observations on time continuity with experiments on two common sequence transformations, namely shifting and scaling. 6Note that increasing the probability of yes does not necessarily decrease the probability of no , as the model can also provide spurious outputs. Published as a conference paper at ICLR 2025 Model Name Average mmax % (mmax 0.05) Llama 3 8b 0.0434 32.43% Llama 2 13b 0.0639 23.53% Gemma 1 7b 0.0332 22.02% Gemma 2 9b 0.0178 9.52% Phi 3 Medium 0.0212 14.71% Mistral 7b 0.0362 22.83% Global 0.0339 20.79% Table 10: Average mmax for each model and percentage of records where mmax > 0.05. The global row is an average weighted by the number of valid records. 0.00 0.05 0.10 0.15 0.20 0.25 Variation (a) Llama 3 0.0 0.2 0.4 0.6 0.8 Variation (b) Llama 2 0.00 0.05 0.10 0.15 0.20 0.25 Variation (c) Gemma 1 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Variation (d) Gemma 2 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Variation 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Variation (f) Mistral Figure 40: Distribution of mmax for all the studied models. Published as a conference paper at ICLR 2025 D.1 TRANSLATIONAL INVARIANCE For shifting, we increment the positional embeddings of each token (as well as the lower bound of integration) by a fixed amount (up to 10) without actually changing the sentence s duration. In particular, we feed the following input to Llama3-8B: capital ¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹ France ¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹ In addition to the sentence The capital of France is , we study the following sentences: The Great Gatsby is my favourite ; O Romeo, Romeo, wherefore art thou . We report our full results for translation in Figures 41 to 43. D.2 (LACK OF) SCALE INVARIANCE On the other hand, for scaling we increase/decrease the duration of the entire sentence: The ¹¹¹ ¹¹¹ ds1 > 1 capital ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ds2 > 1 France ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ds4 > 1 See Figures 44 to 46 for our results. D.3 ANALYSIS As shown for instance in Figure 47, we find that while the impact of shifting on an LLM s output is negligible, scaling significantly changes how the LLM interprets the input (and thus what the LLM outputs). This phenomenon occurs regardless of the model and sentence, empirically confirming the theoretical observations we already made about translation invariance in Section 3. We believe that shifting does not affect an LLM s output due to the fact that positional and rotaryembeddings (Vaswani et al., 2017; Gemma Team et al., 2024b) are robust to translations: as long as the relative positions between tokens are preserved, a model s output remains consistent. On the other hand, scaling leads to significant variations in an LLM s output. Our results suggest that beyond interpreting time continuously, duration is itself an intrinsic property of our generalised Transformers that can explain why LLMs are robust on inputs with low frequency in the training data (e.g., their embedding may interpolate with others with similar semantics). D.4 SHIFTING INVARIANCE WITH LEARNED POSITIONAL EMBEDDINGS While the properties of common positional encodings (in particular sinusoidal position encoding and Rotary Positional Encoding) inherently incentivise translation invariance, we study whether such a phenomenon takes place in models with learned positional encodings. To do so, we repeat our experiments with translation in GPT2, which uses this form of encoding. We report our results in Figure 48. While the magnitude of the effect is certainly less strong compared to Ro PE encodings, we observe that the top class remains consistent under translation for moderate shifts, which is consistent with our Ro PE results. Published as a conference paper at ICLR 2025 0 2 4 6 8 10 Translation Factor Probability Paris a Other Tokens (a) Llama 3 0 2 4 6 8 10 Translation Factor Probability one Other Tokens (b) Llama 2 0 2 4 6 8 10 Translation Factor Probability the Other Tokens (c) Gemma 1 0 2 4 6 8 10 Translation Factor Probability a the one Other Tokens (d) Gemma 2 0 2 4 6 8 10 Translation Factor Probability Paris Other Tokens (e) Phi 3 0 2 4 6 8 10 Translation Factor Probability the Other Tokens (f) Mistral Figure 41: Predicted next token for the sentence The capital of France is in all studied models with shifting. We report all tokens with a probability of at least 5% at any point of the interpolation. 0 2 4 6 8 10 Translation Factor Probability novel book Other Tokens (a) Llama 3 0 2 4 6 8 10 Translation Factor Probability book novel Other Tokens (b) Llama 2 0 2 4 6 8 10 Translation Factor Probability film novel book Other Tokens (c) Gemma 1 0 2 4 6 8 10 Translation Factor Probability novel book Other Tokens (d) Gemma 2 0 2 4 6 8 10 Translation Factor Probability novel book Other Tokens (e) Phi 3 0 2 4 6 8 10 Translation Factor Probability novel book Other Tokens (f) Mistral Figure 42: Predicted next token for the sentence The Great Gatsby is my favourite in all studied models with shifting. We report all tokens with a probability of at least 5% at any point of the interpolation. Published as a conference paper at ICLR 2025 0 2 4 6 8 10 Translation Factor Probability Rome Other Tokens (a) Llama 3 0 2 4 6 8 10 Translation Factor Probability , Romeo Other Tokens (b) Llama 2 0 2 4 6 8 10 Translation Factor Probability Romeo , Other Tokens (c) Gemma 1 0 2 4 6 8 10 Translation Factor Probability Romeo , Other Tokens (d) Gemma 2 0 2 4 6 8 10 Translation Factor Probability Rome Other Tokens (e) Phi 3 0 2 4 6 8 10 Translation Factor Probability Rome Other Tokens (f) Mistral Figure 43: Predicted next token for the sentence O Romeo, Romeo, wherefore art thou in all studied models with shifting. We report all tokens with a probability of at least 5% at any point of the interpolation. 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability also Other Tokens (a) Llama 3 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability Paris located a Other Tokens (b) Llama 2 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability located Other Tokens (c) Gemma 1 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability a the one Other Tokens (d) Gemma 2 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability Paris Other Tokens (e) Phi 3 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability the Other Tokens (f) Mistral Figure 44: Predicted next token for the sentence The capital of France is in all studied models with scaling. We report all tokens with a probability of at least 5% at any point of the interpolation. Published as a conference paper at ICLR 2025 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability novel book Other Tokens (a) Llama 3 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability movie Other Tokens (b) Llama 2 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability novel place film Other Tokens (c) Gemma 1 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability novel Other Tokens (d) Gemma 2 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability ! Other Tokens (e) Phi 3 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability place thing Other Tokens (f) Mistral Figure 45: Predicted next token for the sentence The Great Gatsby is my favourite in all studied models with scaling. We report all tokens with a probability of at least 5% at any point of the interpolation. 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability Rome Other Tokens (a) Llama 3 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability can Other Tokens (b) Llama 2 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability hast my dost , wilt Other Tokens (c) Gemma 1 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability Juliet Other Tokens (d) Gemma 2 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability w Other Tokens 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability Rome art , Other Tokens (f) Mistral Figure 46: Predicted next token for the sentence O Romeo, Romeo, wherefore art thou in all studied models with scaling. We report all tokens with a probability of at least 5% at any point of the interpolation. Published as a conference paper at ICLR 2025 0 2 4 6 8 10 Translation Factor Probability one Other Tokens (a) Shifting does not impact an LLM s output. 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor Probability also Other Tokens (b) Scaling visibly impacts an LLM s output. Figure 47: Effect of shifting and scaling on Llama 3 for the sentence The capital of France is . 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Translation Factor Probability capital all other Other Tokens (a) France 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Translation Factor Probability like Other Tokens (b) Gatsby 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Translation Factor Probability art thou Other Tokens (c) Romeo Figure 48: Predicted next token for the France, Gatsby and Romeo sentences in GPT2 with translation.