# language_models_are_implicitly_continuous__9bea985f.pdf

Published as a conference paper at ICLR 2025

LANGUAGE MODELS ARE IMPLICITLY CONTINUOUS

Samuele Marro1 Davide Evangelista2 X. Angelo Huang3 Emanuele La Malfa4

Michele Lombardi2 Michael Wooldridge4

1Department of Engineering Science 2Department of Computer Science University of Oxford University of Bologna Oxford, UK Bologna, Italy

3Department of Computer Science 4Department of Computer Science ETH Zurich University of Oxford Zurich, Switzerland Oxford, UK

Language is typically modelled with discrete sequences. However, the most successful approaches to language modelling, namely neural networks, are continuous and smooth function approximators. In this work, we show that Transformerbased language models implicitly learn to represent sentences as continuous-time functions defined over a continuous input space. This phenomenon occurs in most state-of-the-art Large Language Models (LLMs), including Llama2, Llama3, Phi3, Gemma, Gemma2, and Mistral, and suggests that LLMs reason about language in ways that fundamentally differ from humans. Our work formally extends Transformers to capture the nuances of time and space continuity in both input and output space. Our results challenge the traditional interpretation of how LLMs understand language, with several linguistic and engineering implications.

1 INTRODUCTION

In linguistics and computer science, language is typically modelled as a discrete sequence of symbols: a sentence is a sequence of words, phonemes, characters, or tokens drawn from a finite vocabulary. This characterisation underpins both linguistics (Hockett & Hockett, 1960; Chomsky, 1995; Studdert-Kennedy, 2005; Akmajian et al., 2017) as well as classic and recent algorithmic approaches to language modelling (Manning, 1999; Bengio et al., 2000; Mnih & Hinton, 2008).1 In Machine Learning, a successful paradigm to model language is that of Large Language Models (LLMs, (Devlin, 2018; Brown et al., 2020)). In LLMs, language is modelled via an optimisation problem whose objective is to predict a word given its surrounding context (Peters et al., 2018; Radford et al., 2019), though recent advancements fine-tune the models with procedures inspired by reinforcement learning (Schulman et al., 2017; Rafailov et al., 2024).

At their core, the architectures that model language, including feed-forward neural networks (Mikolov et al., 2013a), Long-Short Term Memory Networks (LSTMs) (Hochreiter, 1997; Sundermeyer et al., 2012) and Transformers (Vaswani et al., 2017), approximate a discrete sequence of tokens with continuous smooth functions. However, training inherently continuous models on discrete sequences does not imply that the models themselves treat language as discrete.

This paper explores how the tension between discrete data and continuous function approximators is synthesised in Transformers-based Large Language Models (Vaswani et al., 2017). To do

These authors contributed equally. Corresponding author. Email: samuele.marro@eng.ox.ac.uk.. 1For completeness, a few notable works in linguistics and computer science model language as continuous: among the others, Alkhouli et al. (2014) and Bowman et al. (2015) model sentences as continuous entities in latent space, while recent approaches with quantum NLP represent meaning as a superstate of different words (Guarasci et al., 2022).

Published as a conference paper at ICLR 2025

so, we seamlessly generalise the Transformers architecture to support continuous inputs. This extension, which does not modify a model s weights or alter the architecture, allows the study of existing pretrained LLMs, including Llama (Dubey et al., 2024), Mistral (Jiang et al., 2023), and Gemma (Gemma Team et al., 2024b), with continuous input sequences.

By running experiments on state-of-the-art LLMs, we find that the language LLMs learn is implicitly continuous, as they are able to handle, with minor modifications, inputs that are both time continuous and spatial continuous. In particular, we formally show that the results obtained by extending pretrained LLMs to handle time continuous input strongly depend on a quantity, named duration, associated with each sentence. We also show in Section 4 that the semantics of this continuum significantly deviate from human intuition. Our results suggest that our intuition about human language can be misleading when applied to LLMs, as LLMs hold implicit representations of continuous sequences in unintuitive ways. Furthermore, these observations have practical consequences from an engineering perspective, as they suggest that it is possible to leverage the continuous representations of LLMs to pretrain them more efficiently.

Our code is available at https://github.com/samuelemarro/continuous-llm-experiments.

2 RELATED WORK

Modern state-of-the-art pretrained language models operate on a discrete number of tokens and do not handle continuous inputs directly. However, in other domains, extensions of classical Transformers (Vaswani et al. (2017)) to time continuous inputs have been recently explored in tackling different problems. In modelling dynamical systems, (Fonseca et al. (2023)) have proposed adding new regularisations to a classical transformer to create continuous behaviour in an attempt to improve upon existing time continuous models such as Neural ODEs (Chen et al. (2019); Kidger (2022)). In time series modelling, (Chen et al. (2024); Moreno-Pino et al. (2024)) further developed the ideas advanced by Neural ODEs by integrating time continuous transformers and, consequently, superseding other existing approaches such as Neural CDE (Kidger et al. (2020)) and Neural RDE (Morrill et al. (2021)).

Another line of work considers time continuous extension of language by processing it through networks combining the flexibility of classical Transformers with the mathematical interpretability of Neural ODEs, such as ODETransformer (Li et al. (2022a)), CSAODE (Zhang et al. (2021)), Trans Evolve (Dutta et al. (2021)), and N-ODE Transformer (Baier-Reinio & De Sterck (2020)). Several authors have also explored spatial continuous extensions of LLMs Tang et al. (2021); Schwenk (2007); Ostling & Tiedemann (2017), where the embedding space is expanded to include vectors not directly mapped to specific tokens, thereby enhancing the representational power of the models. A broad class of Diffusion Language Models (Li et al. (2022b); Gong et al. (2022); Lovelace et al. (2024a); Gulrajani & Hashimoto (2024); Zhang et al. (2024); Lovelace et al. (2024b)) employs a similar concept, where the model generates elements not necessarily tied to individual tokens in the embedding space, thereby effectively incorporating spatial continuous extensions into language modelling.

Additionally, continuous representations in LLMs have been studied either in the context of concepts for which intuitive spectra exist (Gurnee & Tegmark, 2023; Arditi et al., 2024) or from a neurondriven perspective (Anthropic, 2024). In particular, our work can be seen as complementary to the latter: while the authors of Anthropic (2024) show that certain neurons map to specific concepts (which can then be steered), we show that such concepts exist even at the embedding level, and we offer a theoretical framework to study formally such phenomena in an architecture-independent manner.

Continuing the overview of related works, in Ravfogel et al. (2022) the authors remove concepts like biases by erasing the pertinent area that makes that concept emerge. Our work shows however that LLMs can interpolate between overlapping concepts: erasing an entire area might thus also affect other concepts, which represents an interesting research direction. Moreover, in Todd et al. (2023) the authors show that some LLMs feature function vectors, which represent specific operations (e.g. sums), It is possible that some of the continuous behaviours in an LLM arise as function vectors.

Published as a conference paper at ICLR 2025

Figure 1: Left. A graphical representation of the time continuity of language. Each observed token is obtained by sampling at integer timesteps a stepwise constant function defined on the real interval [0,T]. The length of each constant interval is the duration of the associated token. Right. A spatial continuous extension of a sentence, where x(t) can represent any value.

Finally, our work can also be seen as a response to Vilnis & Mc Callum (2014), which trains inherently continuous embeddings: we show that this process is not necessary, as even discretely trained LLMs show similar behaviours.

3 A CONTINUOUS GENERALISATION OF LANGUAGE MODELS

In this section, we propose the hypothesis that language can be seen as the discretisation of a spatiotemporal continuous function whose value corresponds to a valid token for any integer timestep. As we will show, this assumption allows us to define a continuous extension of the classical casual Transformer module, namely the Continuous Causal Transformer (CCT). The CCT accepts spatio-temporal continuous functions as input while including pretrained models on regular timeand space-discrete data as a special case. Moreover, we will formally discuss the implications of this construction, showing the basic results required to describe the experiments we presented later.

3.1 TIME CONTINUITY

Following classical approaches (Cotterell et al., 2023), we define a natural sentence as a sequence {w1,w2,...,w T } W of tokens, sampled from an underlying distribution p(w1,w2,...,w T ), where each token wt only depends on previous timesteps, i.e. p(w1,w2,...,w T ) = p1(w1) T t=2 pt(wt w<t), where we defined w<t = (w1,...,wt 1). Moreover, given a continuous function E W Rd that embeds any token wt as a d-dimensional vector xt = E(wt), we name the push-forward distribution p(x T ) = E#p(w T ) the distribution of natural sentences. Clearly, since the set W is inherently discrete, then the set X = Rg(E), i.e., the range of E, is a finite set, to which we refer as the space of valid embeddings. Indeed, in any classical formalisation of language, a sentence is considered to be finite both in time and space.

In this work, we hypothesise that any observed sentence {xt}T t=1 p(x T ) originates from the integer discretisation of a function x(t), defined on the real interval [0,T]. Clearly, there exist infinitely many of these functions. In this section, we assume for x(t) the simplest, possible form: a stepwise constant function, defined as

x(t) = T s=1 xs1[as,bs](t), (1)

where 1[as,bs](t) is the indicator function of the interval [as,bs], whose value is 1 if t [as,bs], 0 otherwise. For the intervals {[as,bs]}T s=1 we assume that:

Published as a conference paper at ICLR 2025

(H1) {[as,bs]}T s=1 define a partition of the interval [0,T], i.e.

T s=1 [as,bs] = [0,T], T s=1 [as,bs] = .

(H2) s [as,bs] for any s = 1,...,T.

With this assumption, a sentence can be seen as a continuous flow of information, where the duration of each word is defined as the length of the interval defining it, i.e. bs as. Throughout the rest of this paper, we will refer to the duration of a token xs as ds = bs as. A representation of this concept is summarised in Figure 1.

To be able to process time continuous inputs, we need an extension of the typical causal Transformer architecture, which we name Continuous Causal Transformer (CCT). We argue that any Transformer-based LLM can be seen as a discretisation of our CCT, and we propose a technique to modify the architecture of standard LLMs to handle stepwise-constant sentences as input, which represents the basis of our experiments in Section 4.

To prove our statement, we begin by considering the classical formulation of multi-head causal attention module, where the transformed output {yt}T t=1 associated with the sequence {xt}T t=1 is defined as:

1 Zt exp(q T t ks

where Zt = t s =1 exp( q T t k s

d ) is a normalisation constant, and:

q(t) = W (q)x(t), k(t) = W (k)x(t), v(t) = W (v)x(t), (3)

where W (q), W (k), and W (v) are the attention matrices that are learned during training. For an in-depth derivation of Equation 2, refer to Appendix A.1.

Therefore, a continuous extension of Equation 2 can be simply obtained by substituting the sum with an integral, thus obtaining the continuous causal attention module

0 1 Zt exp(q(t)T k(s)

d )v(s)ds, (4)

where Zt = t 0 exp( q(t)T k(s)

The multi-head version of Equation 4 is obtained by simply considering H independent copies of y(t) (namely, {y1(t),...,y H(t)}), concatenating them and multiplying the result with a parameter matrix W (o):

y(t) = W (o)cat(y1(t),...,y H(t)). (5)

Note that the learned matrices W (q), W (k), W (v), and W (o) do not depend on the time discretisation (since they act on the feature domain of xt and yt). Consequently, we can use pretrained weights from any classical LLM.

To complete the construction of our CCT architecture, we remark that the Add&Norm scheme can be naturally re-used as it is, computing the final output x(t) as:

x(t) = Layer Norm(W (z)z(t) + x(t)), (6) z(t) = Layer Norm(y(t) + x(t)). (7)

Published as a conference paper at ICLR 2025

Finally, we observe that Equation 4 can be simplified by considering our stepwise-constant assumption for the language, as in Equation 1. Indeed, we show in Appendix A.2 that:

y(t) = T k=1

1 Zt exp(q T t kk

d )vkdk, (8)

where t is the only integer such that t [a t,b t], whose existence and uniqueness are guaranteed by (H1) and (H2). Note that Equation 8 is equivalent to Equation 2 as long as the partition {[ak,bk]}T k=1 is chosen so that each token has a uniform duration dk = 1 for all k = 1,...,T, since in this case for any integer timestep t,

yt = y(t) = T k=1

1 Zt exp(q T t kk

which is exactly Equation 2.

However, the CCT module is more general in the sense that it can easily handle arbitrarily stepwise constant functions, for example with non-uniform duration. Indeed, we can simply choose any partition of the interval [0,T] into sub-intervals satisfying condition (H2), and apply Equation 8 to compute the continuous transformed output y(t). Interestingly, Equation 8 implies that, for a fixed input sentence x(t), the output y(t) does not depends on the chosen partition {[ak,bk]}T k=1, but only on the durations dk of each token. This suggests that our CCT shows shifting-invariance, i.e. the output does not change if the input gets shifted in time, as long as the shifted partition satisfies (H1) and (H2). On the other hand, the CCT is not scale-invariant, since scaling will alter the token duration. These properties have been observed empirically in Appendix D.1.

3.2 SPACE CONTINUITY

Note that, in our construction, we never explicitly used the fact that the range of x(t) is a subset of X, i.e. that any value of x(t) is an admissible token. This motivates the introduction of spatial continuous CCTs, where we consider a more general sentence in the form of equation 1, where xs is not necessarily the embedding of a meaningful word ws. Note that our CCT model does not require any explicit modifications to be adapted to this setup.

Spatial continuity is particularly relevant when the value xs X is obtained by interpolating between two meaningful tokens. Interestingly, we show in Section 4.3 that pretrained LLMs assign a semantic meaning to these intermediate embeddings, which results in something distinct from the two tokens which are used to compute the interpolation. Note that this is non-trivial and nonpredictable, since the LLM never explicitly saw the majority of non-meaningful tokens as input during training, suggesting intriguing properties of the CCT architecture itself (which we aim to analyse more in-depth in future work).

To conclude, we remark that the stepwise constant assumption on the continuous language structure as in Equation 1 can be easily generalised to any other function structure for which a closed-form solution of the integral in Equation 4 can be obtained. This happens, for example, for any choice of piece-wise polynomial function defined over a partition of [0,T], satisfying the assumptions (H1) and (H2). Testing the behaviour of the CCT architecture for other choices of x(t) is left to future work.

In the next sections, we will prove through several experiments that not only our proposed CCT model acts as a natural generalisation of classical Causal Transformer, recovering the same behaviour when a uniform discretisation of the domain into intervals of length 1 is considered, but also that the CCT seems to understand the concept of temporal fraction of a word.

4 PRETRAINED LLMS ARE IMPLICITLY CONTINUOUS

In this section, we use our generalisation of LLMs to study the time continuous and space-continuous behaviours of pretrained LLMs. Thanks to our extension, we identify several novel properties of dis-

Published as a conference paper at ICLR 2025

cretely trained models, such as the key role of duration as a semantic component and the existence of meaningful intermediate embeddings that do not map to any known token. Unless otherwise specified, our results hold for a wide variety of models, including Llama2-13B-Chat (Touvron et al., 2023), Llama3-8B (Dubey et al., 2024), Phi-3-Medium-4k-Instruct (Abdin et al., 2024), Gemma-17B (Gemma Team et al., 2024a), Gemma-2-9B (Gemma Team et al., 2024b), and Mistral-7B (Jiang et al., 2023).

4.1 SINGLE-TOKEN TIME CONTINUITY

We begin by studying how LLMs respond to variations in the time duration of a token. Consider the following input prompt to Llama3-8B:

In the sentence "apple ¹¹¹¹ ¹¹¹¹ ds = 1

apple ¹¹¹¹ ¹¹¹¹ ds = 1

apple ¹¹¹¹ ¹¹¹¹ ds = 1

apple ¹¹¹¹ ¹¹¹¹ ds = 1

", how many fruits are mentioned?

The answer is naturally 4 , and we expect language models to reply similarly.2 However, suppose that we reduce the duration of the portion of the sentence between double quotes, as defined in Section 3), i.e.,

In the sentence "apple apple apple apple

¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ds [1,4]

", how many fruits are mentioned?

As the duration of the four apple tokens varies, we expect the model to output either (1) 4 , since the number of tokens has not changed or (2) nonsensical/unrelated tokens, since such an input would be out-of-distribution. Instead, Llama 3 returns an output that both is meaningful and differs from that of the original sentence. As shown in Figure 2, the output consists of each number from 1 to 4, with peaks that vary with the token duration. In other words, the LLM interprets the sentence s content as if there were, respectively, 1, 2, 3 and 4 apple tokens.

From a linguistic perspective, the model returns outputs that, while reasonable, are hard to reconcile with the traditional interpretation of language in humans. LLMs naturally embody notions such as half a token (i.e., a token whose duration is not one time step), which humans lack.3 This suggests that LLMs interpret language differently compared to humans.

We quantitatively study this phenomenon by repeating this experiment on a dataset of 200 word counting tasks. In particular, we reduce the duration of all repeated tokens (in our example, apple ) and measure how many unique applicable peaks we observe as the duration varies. For instance, if the word is repeated four times, as we vary the duration factor we expect the model to count, for different duration factors, 1, 2, 3, and 4 elements.4 A discrete interpretation of LLMs would predict that the model consistently predicts the same number (i.e. 4) regardless of the duration factor. Instead, we observe on average 190% more unique predictions. In other words, our results are incompatible with a discrete interpretation of the behaviour of LLMs. Refer to Appendix C.1 for a more in-depth overview of our methodology and results.

In the next section, we stress the generality of this observation by applying the notion of time duration to multiple tokens and entire sentences.

4.2 BEYOND SINGLE-TOKEN TIME CONTINUITY

A natural extension of token-level time continuity involves changing the duration of entire linguistic units.

We first study how LLMs behave when summing 2-digit numbers, where the duration of one of the addends is reduced. Consider the following input to Llama2-13B.5

2Indeed, all the models we studied replied with 4 . 3To obtain the same output from humans, one would modify the input to include instructions such as Consider each apple word as half an apple . We do not need to prompt an LLM with such instructions. 4We treat 0 as an unexpected peak since the semantics of a zero-duration sentence are ill-defined. 5We use Llama2 instead of Llama3 as the latter treats both digits as a single token.

Published as a conference paper at ICLR 2025

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

Figure 2: Time continuity experiments on the example in Section 4.1. If we reduce the time duration of the apple tokens, the transition between each output answer (a number between 1 and 4) is continuous.

The sum of 24 and 1 ds1

As shown in Figure 3a, by shrinking the duration of the tokens 1 and 3 , we observe that the model transitions from predicting 3 (i.e., the first token of the sum 24 + 13) to 2 (i.e., the model progressively treats 13 as a single-digit number). Refer to Appendix C.3 for more examples of this behaviour across models and choices of numbers.

Additionally, we study how LLMs behave when we reduce the duration of entire sentences. For instance, consider the following passage, which is fed to Llama3-8B:

Alice goes to the shop.

She buys a carton of milk. ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ds1

She buys an apple. ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ds2 She buys a potato. ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ds3

She buys a loaf of bread. ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ds4 How many items did Alice buy?

By reducing the duration of each sentence {ds1,ds2,ds3,ds4}, we once again observe the model replying as if Alice bought 1, 2, 3 or 4 items (Figure 3b), which is consistent with our findings in the previous section. In particular, we observe on average 205% more unique predictions than what would be expected from a discrete interpretation of LLMs. Refer to Appendix C.2 for further data on this behaviour, which we observed to be present across all models.

There are thus reasons to believe that the notion of time continuity is innate in pretrained LLMs. Such a notion might emerge as a natural consequence of their nature as smooth function approximations, though further research in this direction is needed.

4.3 SPACE CONTINUITY IN PRETRAINED LLMS

In addition to time continuity, we study the nature of space continuity in LLMs.

Published as a conference paper at ICLR 2025

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

3 2 Other Tokens

(a) When feeding The sum of 24 and 13 is to Llama 2 and shrinking 13, we observe a transition in the output from 3 (i.e. the result of 24 + 13) to 2 (i.e. the result of 24 + a single-digit number).

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(b) If we reduce the time duration of each sentence, the transition between each output answer (a number between 1 and 4 ) is continuous.

Figure 3: Time continuity experiments as per Section 4.2 with Llama3-8B.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens

(a) Predicted next token for the sentence Are 8 red? for Llama 3.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

supermarket market Other Tokens

local farmers

store grocery

(b) Predicted next token for the sentence Alice bought some 8 at the for Llama 3. All reported tokens have a probability of at least 5% at any point of the interpolation. For the other tokens, we report the cumulative distribution.

Figure 4: The linear embedding hypothesis holds for Llama3 and extends to the output prediction.

In the NLP literature, it is widely known that the embeddings of semantically similar tokens tend to share some of their semantics, an observation often referred as the linear embedding hypothesis (Mikolov et al., 2013b; Park et al., 2023). However, thanks to our generalisation we discover that such hypothesis also holds for LLMs in the output space. In other words, inputs undergo a series of non-linear transformations to make an LLM predict the next word; yet, the intermediate embeddings and the output preserve the semantics of the interpolated inputs, with the region of the embedding space that, while not having any meaningful interpretation, are treated as proper concepts with well-defined properties.

Consider the following example for Llama3-8B:

Published as a conference paper at ICLR 2025

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens

(a) Predicted next token for the sentence Are 8 fruits? for Llama 3.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

red green yellow Other Tokens

(b) Predicted next token for the sentence The most common colour for 8 is . All reported tokens have a probability of at least 5% at any point of the interpolation. For the other tokens, we report the cumulative distribution.

Figure 5: Beyond the linear embedding hypothesis on Llama 3.

In the previous example, 8 represents a linear interpolation of the embeddings of the apples and bananas tokens. Intuitively, such an interpolation does not have a proper semantics in human language (there is no such thing as apple-bananas ). In fact, the interpolation does not map to the embedding of any known token (see Appendix C.4). Nevertheless, Llama3 s output (as well as other LLMs ) smoothly transitions between outputting yes and no , as shown in Figure 4 (left). This behaviour is further confirmed by similar prompts like Alice bought some 8 at the (Figure 4 (right)). In other words, any interpolation between apples and bananas is treated as a grammatically correct part of a sentence.

While one may argue that the context plays a role in modelling the range of possible answers to these questions (e.g., in the first example yes and no are the only reasonable options), this interpretation only partially captures the nuances of LLMs behaviour with space-continuous inputs. In fact, if we prompt the model with the following input:

Are 8 fruits?

the model always replies yes , as reported in Figure 5 (left). Any interpolation of apples and bananas is indeed a valid fruit for the LLM. Additionally, for a subset of models (namely Llama 3, Gemma 1, Mistral, and partially Gemma 2), when we feed the input:

The most common colour for 8 is

we discover that the intermediate colour of apples and bananas is green (Figure 5), i.e., these Transformers hold the knowledge that there is a fruit in between apples and bananas whose colour is neither red nor yellow . Refer to Appendix C.4 for more quantitative and qualitative results.

In short, LLMs assign a semantic meaning to their embedding space, but this meaning is neither trivial nor consistent with human intuition. These results challenge the assumption that the way LLMs interpret language is a surrogate of that of humans.

5 IMPLICATIONS

Our results show that the current understanding of language models is incomplete. In this section, we highlight what we believe are promising implications of our findings.

Published as a conference paper at ICLR 2025

LLMs learn a continuous representation of language. While LLMs are mostly trained on discrete data, they learn representations that go beyond the meaning and interplay of each token. We thus argue that our generalisation of Transformers opens up to new inference techniques: for example, multiple sub-tokens can be generated in parallel as interpolations of sentences with similar templates but different semantics. The same intuition can be applied to training. Yet, it requires solving a non-trivial issue: tokens fed simultaneously and in superposition (e.g., an interpolation of multiple, semantically similar words) generate an output where each corresponding token should be disentangled and reassigned to its original sentence.

Language models treat language differently than humans. Most of the examples reported in this paper contradict human intuition about language. Duration deeply affects a model s output in ways humans hardly grasp. Similarly, LLMs assign meaning to interpolations of embeddings even when such representations do not map to well-defined concepts. These behaviours represent a significant deviation from how humans reason about language, one that cannot be explained by a simple lack of data. We believe that the field of linguistics, alongside that of Machine Learning, should embrace the challenge of studying the continuous language of LLMs, synthesising classic tools from linguistics and deep learning. For instance, the notion of space continuity is deeply intertwined with how LLMs interpret language: we believe that our generalisation of Transformers can help us better understand the strong performance of LLMs on low-frequency inputs, as well as explain why they fail on edge cases that are semantically meaningful to humans.

Overall, our findings suggest that understanding more in-depth the continuous nature of LLMs can both improve their performance and align their representations with human reasoning.

6 CONCLUSION

In this paper, we introduce a generalisation of causal Transformers to study how pretrained LLMs respond to continuous inputs. We characterise the notions of time and space continuity, which we use to show that the language LLMs learn diverges from that of humans. In particular, LLMs assign complex semantic meanings to otherwise meaningless interpolations of embeddings and treat duration as a key property of their language. These results suggest that our traditional understanding of LLMs is incomplete.

We hope that, by studying the continuous behaviour of LLMs, future work will be able to shed further insight into the language of Large Language Models, with implications for both linguistics and LLM design.

ACKNOWLEDGEMENTS

This work was supported by the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems n. EP/Y035070/1, in addition to Microsoft Ltd and the Collegio Superiore di Bologna. Emanuele La Malfa is supported by the Alan Turing Institute. We would like to thank Aleksandar Petrov and Constantin Venhoff for their insightful discussions.

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. ar Xiv preprint ar Xiv:2404.14219, 2024.

Adrian Akmajian, Ann K Farmer, Lee Bickmore, Richard A Demers, and Robert M Harnish. Linguistics: An introduction to language and communication. MIT press, 2017.

Tamer Alkhouli, Andreas Guta, and Hermann Ney. Vector space models for phrase-based machine translation. In Dekai Wu, Marine Carpuat, Xavier Carreras, and Eva Maria Vecchi (eds.), Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pp. 1 10, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/W14-4001. URL https://aclanthology.org/W14-4001.

Published as a conference paper at ICLR 2025

Anthropic. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits, 2024. URL https://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html.

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. ar Xiv preprint ar Xiv:2406.11717, 2024.

Aaron Baier-Reinio and Hans De Sterck. N-ode transformer: A depth-adaptive variant of the transformer using neural ordinary differential equations. ar Xiv preprint ar Xiv:2010.11358, 2020.

Yoshua Bengio, R ejean Ducharme, and Pascal Vincent. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.

Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. ar Xiv preprint ar Xiv:1511.06349, 2015.

Tom B. Brown, Benjamin Mann, Nick Ryder, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations, 2019. URL https://arxiv.org/abs/1806.07366.

Yuqi Chen, Kan Ren, Yansen Wang, Yuchen Fang, Weiwei Sun, and Dongsheng Li. Contiformer: Continuous-time transformer for irregular time series modeling. Advances in Neural Information Processing Systems, 36, 2024.

Noam Chomsky. Language and nature. Mind, 104(413):1 61, 1995.

Ryan Cotterell, Anej Svete, Clara Meister, Tianyu Liu, and Li Du. Formal aspects of language modeling. ar Xiv preprint ar Xiv:2311.04329, 2023.

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783, 2024.

Subhabrata Dutta, Tanya Gautam, Soumen Chakrabarti, and Tanmoy Chakraborty. Redesigning the transformer architecture with insights from multi-particle dynamical systems. Advances in Neural Information Processing Systems, 34:5531 5544, 2021.

Antonio Fonseca, Emanuele Zappala, Josue Ortega Caro, and David van Dijk. Continuous spatiotemporal transformers, 2023. URL https://arxiv.org/abs/2301.13338.

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. ar Xiv preprint ar Xiv:2403.08295, 2024a.

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram e, et al. Gemma 2: Improving open language models at a practical size. ar Xiv preprint ar Xiv:2408.00118, 2024b.

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Ling Peng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. ar Xiv preprint ar Xiv:2210.08933, 2022.

Raffaele Guarasci, Giuseppe De Pietro, and Massimo Esposito. Quantum natural language processing: Challenges and opportunities. Applied sciences, 12(11):5651, 2022.

Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36, 2024.

Published as a conference paper at ICLR 2025

Wes Gurnee and Max Tegmark. Language models represent space and time. ar Xiv preprint ar Xiv:2310.02207, 2023.

S Hochreiter. Long short-term memory. Neural Computation MIT-Press, 1997.

Charles F Hockett and Charles D Hockett. The origin of speech. Scientific American, 203(3):88 97, 1960.

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. ar Xiv preprint ar Xiv:2310.06825, 2023.

Patrick Kidger. On neural differential equations, 2022. URL https://arxiv.org/abs/ 2202.02435.

Patrick Kidger, James Morrill, James Foster, and Terry Lyons. Neural controlled differential equations for irregular time series. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6696 6707. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/ paper/2020/file/4a5876b450b45371f6cfe5047ac8cd45-Paper.pdf.

Bei Li, Quan Du, Tao Zhou, Yi Jing, Shuhan Zhou, Xin Zeng, Tong Xiao, Jing Bo Zhu, Xuebo Liu, and Min Zhang. ODE transformer: An ordinary differential equation-inspired model for sequence generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8335 8351, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.571. URL https://aclanthology.org/2022. acl-long.571.

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusionlm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328 4343, 2022b.

Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony Cohn, and Janet B Pierrehumbert. Graph-enhanced large language models in asynchronous plan reasoning. ar Xiv preprint ar Xiv:2402.02805, 2024.

Justin Lovelace, Varsha Kishore, Yiwei Chen, and Kilian Q Weinberger. Diffusion guided language modeling. ar Xiv preprint ar Xiv:2408.04220, 2024a.

Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Weinberger. Latent diffusion for language generation. Advances in Neural Information Processing Systems, 36, 2024b.

Christopher D Manning. Foundations of statistical natural language processing. The MIT Press, 1999.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013a.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daum e III, and Katrin Kirchhoff (eds.), Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746 751, Atlanta, Georgia, June 2013b. Association for Computational Linguistics. URL https://aclanthology.org/N13-1090.

Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. Advances in neural information processing systems, 21, 2008.

Fernando Moreno-Pino, Alvaro Arroyo, Harrison Waldon, Xiaowen Dong, and Alvaro Cartea. Rough transformers: Lightweight continuous-time sequence modelling with path signatures, 2024. URL https://arxiv.org/abs/2405.20799.

Published as a conference paper at ICLR 2025

James Morrill, Cristopher Salvi, Patrick Kidger, and James Foster. Neural rough differential equations for long time series. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 7829 7838. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr. press/v139/morrill21b.html.

Robert Ostling and J org Tiedemann. Continuous multilinguality with language vectors. In Mirella Lapata, Phil Blunsom, and Alexander Koller (eds.), Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 644 649, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https: //aclanthology.org/E17-2102.

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. ar Xiv preprint ar Xiv:2311.03658, 2023.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227 2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. URL https://aclanthology.org/N18-1202.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.

Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan D Cotterell. Linear adversarial concept erasure. In International Conference on Machine Learning, pp. 18400 18421. PMLR, 2022.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Holger Schwenk. Continuous space language models. Computer Speech & Language, 21(3):492 518, 2007. ISSN 0885-2308. doi: https://doi.org/10.1016/j.csl.2006.09.003. URL https: //www.sciencedirect.com/science/article/pii/S0885230806000325.

Michael Studdert-Kennedy. How did language go discrete. Language origins: Perspectives on evolution, ed. M. Tallerman, pp. 48 67, 2005.

Martin Sundermeyer, Ralf Schl uter, and Hermann Ney. Lstm neural networks for language modeling. In Interspeech, volume 2012, pp. 194 197, 2012.

Zineng Tang, Shiyue Zhang, Hyounghun Kim, and Mohit Bansal. Continuous language generative flow. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4609 4622, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.355. URL https://aclanthology.org/2021.acl-long.355.

Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. ar Xiv preprint ar Xiv:2310.15213, 2023.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998 6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

Published as a conference paper at ICLR 2025

Luke Vilnis and Andrew Mc Callum. Word representations via gaussian embedding. ar Xiv preprint ar Xiv:1412.6623, 2014.

Jing Zhang, Peng Zhang, Baiwen Kong, Junqiu Wei, and Xin Jiang. Continuous self-attention models with neural ode networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 14393 14401, 2021.

Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Joshua Susskind, and Navdeep Jaitly. Planner: generating diversified paragraph via latent language diffusion model. Advances in Neural Information Processing Systems, 36, 2024.

Published as a conference paper at ICLR 2025

A DETAILED DERIVATION OF SECTION 3

A.1 A RECAP ON DISCRETE TRANSFORMER

In this section, we briefly recall how classical causal transformers work. Let X RT d be the matrix whose rows are x T t . Then, each head of a causal attention mechanism computes a matrix Y RT d, such that:

Y = Attention(Q,K,V ) = softmax(QKT

d + M)V, (10)

where Q,K RT dk, V RT d are the queries, the keys, and the value, respectively, while M RT T is the causal attention mask, which is introduced to ensure that each row y T t of Y only depends on observations xs with s t. The matrices Q, K, V are defined as:

Q = XW (q)T , K = XW (k)T , V = XW (v)T , (11)

where W (q), W (k), and W (v) are the attention matrices that are learned during training, while the causal attention mask M is defined as:

Mt,s = {0 if t s, if t > s. (12)

The multi-head version of this causal attention mechanism typically used in modern architecture is obtained by computing H independent versions {Y1,...,YH} of Equation 10, and defining a learnable matrix W (o) Rd H d, which combines them to obtain a single transformed observation matrix:

Y = cat(Y1,...,YH)W (o)T . (13)

A continuous version of a multi-head attention network can be simply obtained by considering Equation 10 on a single timestep t [0,T]. Indeed, it can be rewritten as:

yt = T s=1 softmax [QKT

softmax [QKT

= exp([ QKT

d + M] t,s)

T s =1 exp([ QKT

d + M] t,s ) . (15)

Consequently,

softmax [QKT

exp( q T t ks

d ) if s < t,

0 otherwise,

Where q T t and ks are the t-th and the s-th rows of Q and KT , respectively. Altogether, the above equations imply that:

Published as a conference paper at ICLR 2025

1 Zt exp(q T t ks

d )vs, (17)

where Zt = t s =1 exp( q T t k s

d ) is a normalisation constant. The above formula is then used in

Section 3 to define CCT.

A.2 DERIVATION OF EQUATION 8

We recall that:

q(t) = W (q)x(t) = T k=1 W (q)xk1[ak,bk](t), (18)

k(t) = W (k)x(t) = T k=1 W (k)xk1[ak,bk](t), (19)

v(t) = W (v)x(t) = T k=1 W (v)xk1[ak,bk](t). (20)

Consequently,

0 1 Zt exp(q(t)T k(s)

T k=1 exp(q(t)T k(s)

d )W (v)xk1[ak,bk](s)ds

ak 1 Zt exp(q(t)T k(s)

d )W (v)xkds

(19) = T k=1

1 Zt W (v)xk

q(t)T T k =1 W (k)xk 1[ak ,bk ](s)

1 Zt W (v)xk

ak exp(q(t)T W (k)xk

(18) = T k=1

1 Zt W (v)xk exp

( T t =1 W (q)xt 1[at ,bt ](t)) T W (k)xk

1 Zt W (v)xk exp

(W (q)x t) T W (k)xk

1 Zt exp(q T t kk

which proves Equation 8.

B EXPERIMENTAL SETUP

We now describe the shared aspects of our experiments.

Published as a conference paper at ICLR 2025

B.1 IMPLEMENTING A CCT

CCTs can be implemented with little effort by starting with the implementation of a regular transformer and applying three modifications:

1. Modifying it so that it accepts arbitrary embeddings, rather than only tokens; 2. Modifying it so that positional indices can be floating points, instead of only integers; 3. Adding support for custom floating-point attention masks.

In our experiments, we used Hugging Face, which natively supports 1. and 3. and can be easily adapted to support 2.

Note that the last modification is necessary in order to support non-standard durations. In fact, the Euler discretisation of the integral in Equation (4) is equivalent to regular attention with carefully chosen attention coefficients.

Proof The discretisation of Equation (4) is, assuming that we have n samples at positions p1,...,pn, which represent the value of a piecewise constant function defined over the intervals (0,p1],(p1,p2],...,(pn 1,pn]:

y(t) = i 1,...,n

1 Zt (pi pi 1)exp(q(i)T k(pi)

d )v(pi), (22)

with p0 = 0. In other words, if we are using multiplicative attention coefficients, the discretisation is equivalent to applying attention coefficients of the form pi pi 1. Intuitively, this means that the further apart two samples are, the higher the weight of the latter sample.

Note that for additive coefficients we can simply bring pi pi 1 inside the exponential:

y(t) = i 1,...,n

1 Zt exp(log(pi pi 1) + q(i)T k(pi)

d )v(pi), (23)

which is equivalent to an additive coefficient of pi pi 1.

In Practice At the implementation level, a sequence of n elements with durations d1,...,dn and embeddings e1,...,en is fed to the extended transformer as follows:

The embeddings e1,...,en are fed directly, rather than feeding the sequence as tokens and mapping them to embeddings; The positional encodings are defined such that no holes are left in the piecewise constant function. In other words, the position pi is defined as pi = i 1 j=1 dj; The durations are encoded using the formulae described in Equations (22) and (23).

B.2 EXPERIMENT HYPERPARAMETERS

Since we study the logits, we do not use any typically generation-related hyperparameters (e.g. temperature and top-k). Aside from those described in Appendix B.1, we do not perform any other modification. Experiment-specific parameters are reported in the respective subsections of Appendix C.4.

C CONTINUITY - FULL RESULTS

C.1 SINGLE-TOKEN CONTINUITY

C.1.1 QUALITATIVE RESULTS

For single-token continuity, we shrink the subset of considered tokens with a coefficient in the range [0.1,1].

Published as a conference paper at ICLR 2025

Since the LLMs do not necessarily return a numeric value, all of the queries were wrapped with a prompt to coax the model into doing so. The template for our prompts is thus:

Question: In the sentence "[REPEATED WORDS]", how many times is [CATEGORY] mentioned? Reply with a single-digit number Answer:

For Gemma 1, Llama 2 and Mistral we used a slight variation, since they did not return a numeric output:

Question: How many [CATEGORY] are listed in the sentence "[REPEATED WORDS]"? Reply with a single-digit number Answer:

We used variations of these two prompts throughout most of this paper. See the source code for further information. Alongside apples, we also tested the same prompt with the word cat (category: animal ) and rose (category: flower ). See Figures 6 to 8 for full results.

C.1.2 QUANTITATIVE RESULTS

We reduce the duration of all the steps (following the procedure described in Appendix B) and measure the time sensitivity, i.e. the number k of unique applicable token peaks divided by the expected number of peaks n. For instance, if there are 4 steps, we expect to see peaks for 1, 2, 3, and 4. Our dataset is composed of 200 sentences with the same template as Appendix C.1.1 Note that, for some sentence+tokenizer combinations, a single word might be split into multiple tokens. As our analysis focuses on single-token words, we ignore such cases. Refer to Table 1 for the percentages of considered results for each model.

We define a unique relative peak as a situation where, for at least one duration factor, a certain class is the top prediction. For example, if by varying the duration factor the top class becomes 1 , then 2 , then 1 again, then 3 , the unique relative peaks will be {1,2,3}. In our case, we only consider the probabilities of numerical tokens.

We normalise the number of relative peaks by the number of expected peaks (e.g., if a word is repeated four times and we observe three unique relative peaks, the normalised frequency is 3/4 = 0.75). Note that, if the discrete interpretations of LLMs held true, we would only observe one unique relative peak (since the prediction would be constant as the duration factor varies). We name the hypothetical frequency in case such a hypothesis held true the counterfactual normalised frequency.

We report in Table 2 the counterfactual and observed normalised frequencies for the various models, as well as the average per-sample ratio between the two. Overall, the observed peak frequency is significantly higher than the counterfactual frequency; these results are thus incompatible with a discrete interpretation of LLMs.

For the sake of completeness, we report in Table 3 the results where we only count the expected peaks as valid (i.e. if an element is repeated four times, only the unique relative peaks 1 , 2 , 3 and 4 are considered valid).

Published as a conference paper at ICLR 2025

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(a) Llama 3

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(b) Llama 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(c) Gemma 1

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(d) Gemma 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(f) Mistral

Figure 6: Predicted next token for the sentence In the sentence apple apple apple apple , how many fruits are mentioned? with duration shrinking for all studied models.

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(a) Llama 3

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(b) Llama 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(c) Gemma 1

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(d) Gemma 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(f) Mistral

Figure 7: Predicted next token for the sentence In the sentence cat cat cat cat , how many animals are mentioned? with duration shrinking for all studied models.

Published as a conference paper at ICLR 2025

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(a) Llama 3

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(b) Llama 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(c) Gemma 1

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(d) Gemma 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(f) Mistral

Figure 8: Predicted next token for the sentence In the sentence rose rose rose rose , how many flowers are mentioned? with duration shrinking for all studied models.

Model Name Valid Ratio

Llama 3 8b 93.5%

Llama 2 13b 54.5%

Gemma 1 7b 100.0%

Gemma 2 9b 100.0%

Phi 3 Medium 54.5%

Mistral 7b 76.0%

Global 79.8%

Table 1: Percentage of valid setups (i.e. setups where a word is treated as a single token) for the single token counting experiment. The Global row refers to all valid records divided by all records.

Model name Normalised Peak Frequency Average Ratio

Counterfactual Observed

Llama 3 8b 0.2594 0.8953 3.57

Llama 2 13b 0.2599 0.9174 3.60

Gemma 1 7b 0.2610 0.5127 1.95

Gemma 2 9b 0.2610 0.8513 3.37

Phi 3 Medium 0.2599 0.8836 3.43

Mistral 7b 0.2584 0.4737 1.85

Global 0.2600 0.7404 2.90

Table 2: Counterfactual and observed normalised peak ratios for the single-token counting experiment, with all peaks (including unexpected ones). Note that the Global row is an average weighted by the number of valid records.

Published as a conference paper at ICLR 2025

Model name Normalised Peak Frequency Average Ratio

Counterfactual Observed

Llama 3 8b 0.2594 0.8914 3.56

Llama 2 13b 0.2599 0.8263 3.27

Gemma 1 7b 0.2610 0.3503 1.34

Gemma 2 9b 0.2610 0.8513 3.37

Phi 3 Medium 0.2599 0.792 3.09

Mistral 7b 0.2584 0.4737 1.85

Global 0.2600 0.6849 2.70

Table 3: Counterfactual and observed normalised peak ratios for the single-token counting experiment, with only expected peaks. Note that the Global row is an average weighted by the number of valid records.

C.2 COUNTING EVENTS

C.2.1 QUALITATIVE RESULTS

Similarly to Appendix C.1, we used a prompt to coax the models into giving numeric outputs, as well as coefficients in the range [0.1,1]. Alongside the shop example, we tested two other passages:

The class went to the zoo. They saw a lion. They saw an elephant. They saw a giraffe. They saw a penguin. How many animals did the class see?

Emily went to the beach. She found a seashell. She found a starfish. She found a smooth stone. She found a piece of seaweed. How many things did Emily find?

See Figures 9 to 11 for full results.

C.2.2 QUANTITATIVE RESULTS

In addition to our qualitative results, we report further quantitative experiments for time duration.

We consider the sequential dataset from Lin et al. (2024), which contains 200 curated how-to tutorials split by step. Our template is as follows:

Tutorial: [Tutorial Title] [Steps] Question: How many steps are necessary to complete the tutorial? Reply with a single-digit number Answer: It takes

We then compute, in the same fashion as Appendix C.1.2, the normalised peak frequency, both when considering all peaks (Table 4) and only the expected peaks (Table 5).

Published as a conference paper at ICLR 2025

Model Name Normalised Peak Frequency Average Ratio

Counterfactual Observed

Llama 3 8b 0.2186 0.7564 3.70

Llama 2 13b 0.2186 0.7274 3.42

Gemma 7b 0.2186 0.6967 3.46

Gemma 2 9b 0.2186 0.6188 3.21

Phi 3 0.2186 0.7123 3.385

Mistral 0.2186 0.5119 2.52

Global 0.2186 0.6706 3.28

Table 4: Counterfactual and observed normalised peak ratios for the event counting experiment, with all peaks (including unexpected ones).

Model Name Normalised Peak Frequency Average Ratio

Counterfactual Observed

Llama 3 8b 0.2186 0.6397 3.26

Llama 2 13b 0.2186 0.4854 2.54

Gemma 7b 0.2186 0.5984 3.04

Gemma 2 9b 0.2186 0.5740 3.01

Phi 3 0.2186 0.5436 2.75

Mistral 0.2186 0.4517 2.29

Global 0.2186 0.6097 3.05

Table 5: Counterfactual and observed normalised peak ratios for the event counting experiment, with only expected peaks.

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(a) Llama 3

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(b) Llama 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(c) Gemma 1

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(d) Gemma 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(f) Mistral

Figure 9: Predicted next token for the shop passage for all studied models, with duration shrinking.

Published as a conference paper at ICLR 2025

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(a) Llama 3

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(b) Llama 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(c) Gemma 1

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(d) Gemma 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(f) Mistral

Figure 10: Predicted next token for the zoo passage for all studied models, with duration shrinking.

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(a) Llama 3

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(b) Llama 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(c) Gemma 1

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(d) Gemma 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

1 Other Digits

2 Other Tokens

(f) Mistral

Figure 11: Predicted next token for the beach passage for all studied models, with duration shrinking.

Published as a conference paper at ICLR 2025

Model Name Valid Rate

Llama 3 8b 60.0%

Llama 2 13b 19.5%

Gemma 1 7b 94.5%

Gemma 2 9b 95.0%

Phi 3 Medium 48.5%

Mistral 7b 93.0%

Global 68.4%

Table 6: Percentage of valid records (i.e. records where the unmodified output of the sum is correct) in the sums experiment.

C.3 NUMBER SUMS

C.3.1 QUALITATIVE RESULTS

Our experimental setup is identical to that of Appendix C.1. In addition to 24 + 13, we repeat our experiments with the sums 13 + 74 and 32 + 56. Refer to Figures 12 to 14 for full results.

C.3.2 QUANTITATIVE RESULTS

We consider a dataset of 100 questions involving sums, such as the following:

Question: Mia delivered 82 packages in the morning and 38 packages in the afternoon. How many packages did Mia deliver in total? Answer: Mia delivered

where item 1 and item 2 are two-digit numbers. The specific items and questions vary, but in all cases the answer is the sum of two two-digit numbers.

For each sentence, we independently shrink each of the two numbers, for a total of 200 records. In each record, we observe how the predicted probabilities vary as the duration factor varies. We only consider records where the model correctly computes the first digit of the sum on the unaltered sentence (see Table 6 for a per-model breakdown).

Let yo be the original label (i.e. the first digit of the sum of the two numbers) and Ys the shrunk labels (i.e. the set of predicted digits in case the shrunk number is treated as a single-digit number. For instance, in the sum 24 + 37, the original label is 6 (24 + 37 = 61), but the shrunk labels are 2 (24 + 3 = 27) and 3 (24 + 7 = 31). Note that we ignore non-numerical labels.

We then check three properties:

P1: For a certain duration factor, the collective probability of the labels in Ys is higher than that of yo;

P2: For a certain duration factor, the collective probability of the labels in Ys is higher than that of yo and any other numerical label;

P3: For a certain duration factor, the collective probability of the labels in Ys is higher than that of yo and any other numerical label. Additionally, at no point is another numerical label (i.e. neither yo nor an element of Ys) the top label.

Note that P3 implies P2 and that P2 implies P1.

We report how frequently each property is true in Table 7.

Published as a conference paper at ICLR 2025

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

3 2 Other Tokens (a) Llama 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

3 2 Other Tokens (b) Gemma 1

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

3 2 Other Tokens (c) Gemma 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

3 2 Other Tokens (d) Phi 3

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

3 2 Other Tokens (e) Mistral

Figure 12: Predicted next token for the sentence The sum of 24 and 13 is in all studied models (except Llama 3) with shrinking of 13.

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

8 2 1 Other Tokens (a) Llama 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

8 2 1 Other Tokens (b) Gemma 1

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

8 2 1 Other Tokens (c) Gemma 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

8 2 1 Other Tokens (d) Phi 3

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

8 2 1 Other Tokens (e) Mistral

Figure 13: Predicted next token for the sentence The sum of 13 and 74 is in all studied models (except Llama 3) with shrinking of 74.

Published as a conference paper at ICLR 2025

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

8 3 Other Tokens (a) Llama 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

8 3 Other Tokens (b) Gemma 1

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

8 3 Other Tokens (c) Gemma 2

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

8 3 Other Tokens (d) Phi 3

0.2 0.4 0.6 0.8 1.0 Duration Factor

Probability

8 3 Other Tokens (e) Mistral

Figure 14: Predicted next token for the sentence The sum of 32 and 56 is in all studied models (except Llama 3) with shrinking of 32.

Model Name P1 Frequency P2 Frequency P3 Frequency

Llama 3 8b 29.17% 25.83% 25.83%

Llama 2 13b 92.31% 87.18% 76.92%

Gemma 1 7b 97.88% 97.35% 90.48%

Gemma 2 9b 95.79% 95.79% 89.47%

Phi 3 Medium 100.00% 100.00% 87.63%

Mistral 7b 100.00% 100.00% 87.10%

Global 87.82% 86.97% 79.05%

Table 7: Frequency of each property for valid records in the sums experiment. Note that the Global row is an average that is weighted by the number of valid records.

Published as a conference paper at ICLR 2025

C.4 SPACE CONTINUITY

C.4.1 QUALITATIVE RESULTS

We report the full results concerning interpolations of embeddings in the main paper in Figures 15 to 17 and 19. We also check that the intermediate interpolation does not correspond to any existing token by asking the LLM to repeat the embedding:

Repeat the word 8.

As shown in Figure 19, the repetition of 8 does not correspond to any existing token (as shown by the lack of peaks for tokens other than those related to apples and bananas).

Additionally, we adapt some experiments to another pair of tokens, namely cats and dogs, where we find that the interpolation of cats and dogs is an animal, but whether cats-dogs meow depends on the position along the interpolation axis (see Figures 20 to 23). Similarly, refer to Figures 24 to 27 for our results on the water-juice interpolation.

C.4.2 BOOLEAN INTERPOLATION

We then test how our results compare with studies on interpolation of Boolean formulae. To do so, we perform linear interpolations of Boolean binary operators and study how intermediate operators behave. In particular, we study interpolations of:

AND and OR; AND and XOR; AND and NAND; OR and NOR.

We report our results for all models whose tokenizers treat the operators as having the same number of tokens. While the models often struggle to compute the correct Boolean results for discrete inputs, we nonetheless observe the emergence of fuzzy operators, whose truth values can be best represented as floating points.

C.4.3 QUANTITATIVE RESULTS

We consider 50 pairs of objects having some properties in common (e.g. apples and bananas are both fruits). For each pair, we consider one common property, one property shared by one element of the pair, one property shared by the other element of the pair, and one property that is shared by neither of them, for a total of 200 records. We then interpolate (with 40 steps) between the sentence containing one object or the other. For instance, for a property shared by both apples and bananas, we interpolate between the sentences:

Question: Can apples be eaten raw? (yes/no)\n Answer: Question: Can bananas be eaten raw? (yes/no)\n Answer:

and compute the predicted scores of yes and no .

Note that some questions may have borderline answers (i.e. the answer could be argued to be true or false); we still consider such questions valid, as we are interested in the variation of the output throughout the interpolation rather than the specific answer. We however ignore objecttokenizer pairs where the two resulting sentences have different tokenized lengths (as that prevents interpolation). We report the percentage of valid records in Table 8.

To measure the smoothness of the variation in output, we compute the maximum of the absolute derivative across the interpolation interval, i.e. maxx [a,b] f (x) (which can be seen as an estimate of the Lipschitz constant). We then normalize by dividing by the amplitude (i.e. maxx [a,b] f(x) minx [a,b] f(x). Although imperfect, this metric provides insight into the sharpest variation in output for a model. We report the average metric in Table 9. In general, we observe that Llama 2 and Gemma 2 have higher normalised maximum absolute derivatives, while the other models tend to have similar normalised values.

Published as a conference paper at ICLR 2025

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (b) Llama 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (c) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (d) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (e) Phi 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (f) Mistral

Figure 15: Predicted next token for the sentence Are 8 red? , where 8 is an interpolation of apples and bananas .

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (b) Llama 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (c) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (d) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (e) Phi 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (f) Mistral

Figure 16: Predicted next token for the sentence Are 8 a fruit? , where 8 is an interpolation of apples and bananas .

Published as a conference paper at ICLR 2025

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

red green yellow Other Tokens (a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

yellow green red Other Tokens (b) Llama 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

green yellow red Other Tokens (c) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

green yellow red Other Tokens (d) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

yellow green red Other Tokens (e) Phi 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

red yellow green Other Tokens (f) Mistral

Figure 17: Predicted next token for the sentence The most common colour of 8 is , where 8 is an interpolation of apples and bananas . We report all tokens with a probability of at least 5% at any point of the interpolation.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

supermarket market Other Tokens

local farmers

store grocery

(a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

store Other Tokens

(b) Llama 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

market supermarket

store Other Tokens

(c) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

price store

rate supermarket

grocery Other Tokens

(d) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

store local

market Other Tokens

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

super store

market Other Tokens

(f) Mistral

Figure 18: Predicted next token for the sentence Alice bought some 8 at the , where 8 is an interpolation of apples and bananas . We report all tokens with a probability of at least 5% at any point of the interpolation.

Published as a conference paper at ICLR 2025

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

App Other Tokens

(a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

ban Other Tokens

(b) Llama 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Banana Bananas

apples Apples

Ban Other Tokens (c) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Banana Bananas

Apple Apples

bananas Other Tokens

(d) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

The Other Tokens

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Pan Other Tokens

(f) Mistral

Figure 19: Predicted next token for the sentence Repeat the word 8 , where 8 is an interpolation of apples and bananas . We report all tokens with a probability of at least 5% at any point of the interpolation.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (b) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (c) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (d) Mistral

Figure 20: Predicted next token for the sentence Do 8 meow? , where 8 is an interpolation of cats and dogs . Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (b) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (c) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (d) Mistral

Figure 21: Predicted next token for the sentence Are 8 animals? , where 8 is an interpolation of cats and dogs . Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens.

Published as a conference paper at ICLR 2025

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

same Other Tokens (a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

beginning same end Other Tokens (b) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

beginning age

same Other Tokens

end (c) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Hum beginning

end Other Tokens (d) Mistral

Figure 22: Predicted next token for the sentence We bought two 8 at the , where 8 is an interpolation of cats and dogs . We report all tokens with a probability of at least 5% at any point of the interpolation. Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

I Other Tokens

Cats (a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Dogs Other Tokens (b) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

cats Cat Other Tokens

dogs Cats (c) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

C Other Tokens (d) Mistral

Figure 23: Predicted next token for the sentence Repeat the word 8. , where 8 is an interpolation of cats and dogs . We report all tokens with a probability of at least 5% at any point of the interpolation. Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (b) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (c) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (d) Mistral

Figure 24: Predicted next token for the sentence Does 8 contain sugar? , where 8 is an interpolation of water and juice . Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (b) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (c) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Yes No Other Tokens (d) Mistral

Figure 25: Predicted next token for the sentence Is 8 a drink? , where 8 is an interpolation of water and juice . Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens.

Published as a conference paper at ICLR 2025

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

morning Other Tokens (a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

morning Other Tokens (b) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

car morning afternoon Other Tokens (c) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

morning desert Other Tokens (d) Mistral

Figure 26: Predicted next token for the sentence We drank some 8 in the , where 8 is an interpolation of water and juice . We report all tokens with a probability of at least 5% at any point of the interpolation. Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Juice water

Ju Other Tokens (a) Llama 3

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

W Other Tokens (b) Gemma 1

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

Water Juice

I Other Tokens

water (c) Gemma 2

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

W Other Tokens

water (d) Mistral

Figure 27: Predicted next token for the sentence Repeat the word 8. , where 8 is an interpolation of water and juice . We report all tokens with a probability of at least 5% at any point of the interpolation. Results for Llama 2 and Phi 3 are not reported due to the two sentences having a different number of tokens.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 28: Predicted next token for interpolations of AND and OR in Llama 3.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 29: Predicted next token for interpolations of AND and OR in Llama 2.

Published as a conference paper at ICLR 2025

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 30: Predicted next token for interpolations of AND and OR in Gemma.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 31: Predicted next token for interpolations of AND and OR in Gemma 2.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 32: Predicted next token for interpolations of AND and OR in Phi 3.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 33: Predicted next token for interpolations of AND and OR in Mistral.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 34: Predicted next token for interpolations of AND and XOR in Llama 3.

Published as a conference paper at ICLR 2025

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 35: Predicted next token for interpolations of AND and XOR in Gemma.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 36: Predicted next token for interpolations of AND and XOR in Gemma 2.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 37: Predicted next token for interpolations of AND and NAND in Llama 3.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 38: Predicted next token for interpolations of AND and NAND in Gemma.

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (a) (0, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (b) (0, 1)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (c) (1, 0)

0.0 0.2 0.4 0.6 0.8 1.0 Interpolation Factor

Probability

True False Other Tokens (d) (1, 1)

Figure 39: Predicted next token for interpolations of AND and NAND in Gemma 2.

Published as a conference paper at ICLR 2025

Model Name Valid Rate

Llama 3 8b 74.0%

Llama 2 13b 34.0%

Gemma 1 7b 84.0%

Gemma 2 9b 84.0%

Phi 3 Medium 34.0%

Mistral 7b 46.0%

Global 59.3%

Table 8: Percentage of valid records (i.e. where both sentences have the same tokenized length) in the embedding interpolation experiment.

Model Name Normalised Maximum Absolute Derivative

Llama 3 8b 4.959

Llama 2 13b 9.005

Gemma 1 7b 5.090

Gemma 2 9b 5.402

Phi 3 Medium 9.279

Mistral 7b 7.445

Global 6.214

Table 9: Average normalised maximum absolute derivative for each model. The Global row is an average weighted by the number of valid records.

We also study whether the intermediate embeddings have outputs that are significantly different from those of the embedding extremes. In theory, we would expect the probability of an output computed on an intermediate embedding to be an interpolation between the value for one object or the other. Let f(a) be the probability score for one extreme of the interpolation and f(b) the probability score for the other extreme. We consider our expected range as [min(f(a),f(b)),max(f(a),f(b))]. We then verify whether, for some intermediate value, the predicted probability of yes or no is outside their respective ranges.6 In particular, we compute the maximum difference between the allowed range and the actual value, i.e.:

mdiff = max x [a,b]{ f(x) min(f(a),f(b)) f(x) < min(f(a),f(b)) f(x) + max(f(a),f(b)) f(x) > max(f(a),f(b)) (24)

We define mmax as the maximum of the mdiff values for yes and no . Note that if the function only has values in [min(f(a),f(b)),max(f(a),f(b))], mmax will be 0. We report the average mmax and the percentage of records where mmax 0.05 in Table 10. We also report the histograms describing the distributions of mmax in Figure 40. In general, we observe on average that 20% of records have an mmax higher than 0.05, which cannot be explained by a simple interpolation-based model of LLM behaviour.

D ADDITIONAL RESULTS

We complement our previous observations on time continuity with experiments on two common sequence transformations, namely shifting and scaling.

6Note that increasing the probability of yes does not necessarily decrease the probability of no , as the model can also provide spurious outputs.

Published as a conference paper at ICLR 2025

Model Name Average mmax % (mmax 0.05)

Llama 3 8b 0.0434 32.43%

Llama 2 13b 0.0639 23.53%

Gemma 1 7b 0.0332 22.02%

Gemma 2 9b 0.0178 9.52%

Phi 3 Medium 0.0212 14.71%

Mistral 7b 0.0362 22.83%

Global 0.0339 20.79%

Table 10: Average mmax for each model and percentage of records where mmax > 0.05. The global row is an average weighted by the number of valid records.

0.00 0.05 0.10 0.15 0.20 0.25 Variation

(a) Llama 3

0.0 0.2 0.4 0.6 0.8 Variation

(b) Llama 2

0.00 0.05 0.10 0.15 0.20 0.25 Variation

(c) Gemma 1

0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Variation

(d) Gemma 2

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Variation

0.00 0.05 0.10 0.15 0.20 0.25 0.30 Variation

(f) Mistral

Figure 40: Distribution of mmax for all the studied models.

Published as a conference paper at ICLR 2025

D.1 TRANSLATIONAL INVARIANCE

For shifting, we increment the positional embeddings of each token (as well as the lower bound of integration) by a fixed amount (up to 10) without actually changing the sentence s duration. In particular, we feed the following input to Llama3-8B:

capital ¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹

France ¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹

In addition to the sentence The capital of France is , we study the following sentences:

The Great Gatsby is my favourite ;

O Romeo, Romeo, wherefore art thou .

We report our full results for translation in Figures 41 to 43.

D.2 (LACK OF) SCALE INVARIANCE

On the other hand, for scaling we increase/decrease the duration of the entire sentence:

The ¹¹¹ ¹¹¹ ds1 > 1

capital ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ds2 > 1

France ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ds4 > 1

See Figures 44 to 46 for our results.

D.3 ANALYSIS

As shown for instance in Figure 47, we find that while the impact of shifting on an LLM s output is negligible, scaling significantly changes how the LLM interprets the input (and thus what the LLM outputs). This phenomenon occurs regardless of the model and sentence, empirically confirming the theoretical observations we already made about translation invariance in Section 3.

We believe that shifting does not affect an LLM s output due to the fact that positional and rotaryembeddings (Vaswani et al., 2017; Gemma Team et al., 2024b) are robust to translations: as long as the relative positions between tokens are preserved, a model s output remains consistent. On the other hand, scaling leads to significant variations in an LLM s output.

Our results suggest that beyond interpreting time continuously, duration is itself an intrinsic property of our generalised Transformers that can explain why LLMs are robust on inputs with low frequency in the training data (e.g., their embedding may interpolate with others with similar semantics).

D.4 SHIFTING INVARIANCE WITH LEARNED POSITIONAL EMBEDDINGS

While the properties of common positional encodings (in particular sinusoidal position encoding and Rotary Positional Encoding) inherently incentivise translation invariance, we study whether such a phenomenon takes place in models with learned positional encodings. To do so, we repeat our experiments with translation in GPT2, which uses this form of encoding.

We report our results in Figure 48. While the magnitude of the effect is certainly less strong compared to Ro PE encodings, we observe that the top class remains consistent under translation for moderate shifts, which is consistent with our Ro PE results.

Published as a conference paper at ICLR 2025

0 2 4 6 8 10 Translation Factor

Probability

Paris a Other Tokens (a) Llama 3

0 2 4 6 8 10 Translation Factor

Probability

one Other Tokens

(b) Llama 2

0 2 4 6 8 10 Translation Factor

Probability

the Other Tokens

(c) Gemma 1

0 2 4 6 8 10 Translation Factor

Probability

a the one Other Tokens (d) Gemma 2

0 2 4 6 8 10 Translation Factor

Probability

Paris Other Tokens (e) Phi 3

0 2 4 6 8 10 Translation Factor

Probability

the Other Tokens

(f) Mistral

Figure 41: Predicted next token for the sentence The capital of France is in all studied models with shifting. We report all tokens with a probability of at least 5% at any point of the interpolation.

0 2 4 6 8 10 Translation Factor

Probability

novel book Other Tokens (a) Llama 3

0 2 4 6 8 10 Translation Factor

Probability

book novel Other Tokens (b) Llama 2

0 2 4 6 8 10 Translation Factor

Probability

film novel book Other Tokens (c) Gemma 1

0 2 4 6 8 10 Translation Factor

Probability

novel book Other Tokens (d) Gemma 2

0 2 4 6 8 10 Translation Factor

Probability

novel book Other Tokens (e) Phi 3

0 2 4 6 8 10 Translation Factor

Probability

novel book Other Tokens (f) Mistral

Figure 42: Predicted next token for the sentence The Great Gatsby is my favourite in all studied models with shifting. We report all tokens with a probability of at least 5% at any point of the interpolation.

Published as a conference paper at ICLR 2025

0 2 4 6 8 10 Translation Factor

Probability

Rome Other Tokens (a) Llama 3

0 2 4 6 8 10 Translation Factor

Probability

, Romeo Other Tokens (b) Llama 2

0 2 4 6 8 10 Translation Factor

Probability

Romeo , Other Tokens (c) Gemma 1

0 2 4 6 8 10 Translation Factor

Probability

Romeo , Other Tokens (d) Gemma 2

0 2 4 6 8 10 Translation Factor

Probability

Rome Other Tokens (e) Phi 3

0 2 4 6 8 10 Translation Factor

Probability

Rome Other Tokens (f) Mistral

Figure 43: Predicted next token for the sentence O Romeo, Romeo, wherefore art thou in all studied models with shifting. We report all tokens with a probability of at least 5% at any point of the interpolation.

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

also Other Tokens (a) Llama 3

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

Paris located a Other Tokens (b) Llama 2

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

located Other Tokens

(c) Gemma 1

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

a the one Other Tokens (d) Gemma 2

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

Paris Other Tokens (e) Phi 3

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

the Other Tokens (f) Mistral

Figure 44: Predicted next token for the sentence The capital of France is in all studied models with scaling. We report all tokens with a probability of at least 5% at any point of the interpolation.

Published as a conference paper at ICLR 2025

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

novel book Other Tokens (a) Llama 3

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

movie Other Tokens (b) Llama 2

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

novel place

film Other Tokens

(c) Gemma 1

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

novel Other Tokens

(d) Gemma 2

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

! Other Tokens (e) Phi 3

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

place thing Other Tokens

(f) Mistral

Figure 45: Predicted next token for the sentence The Great Gatsby is my favourite in all studied models with scaling. We report all tokens with a probability of at least 5% at any point of the interpolation.

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

Rome Other Tokens (a) Llama 3

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

can Other Tokens

(b) Llama 2

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

hast my dost

, wilt Other Tokens

(c) Gemma 1

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

Juliet Other Tokens

(d) Gemma 2

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

w Other Tokens

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

Rome art , Other Tokens (f) Mistral

Figure 46: Predicted next token for the sentence O Romeo, Romeo, wherefore art thou in all studied models with scaling. We report all tokens with a probability of at least 5% at any point of the interpolation.

Published as a conference paper at ICLR 2025

0 2 4 6 8 10 Translation Factor

Probability

one Other Tokens

(a) Shifting does not impact an LLM s output.

0.2 0.4 0.6 0.8 1.0 1.2 1.4 Scaling Factor

Probability

also Other Tokens

(b) Scaling visibly impacts an LLM s output.

Figure 47: Effect of shifting and scaling on Llama 3 for the sentence The capital of France is .

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Translation Factor

Probability

capital all

other Other Tokens (a) France

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Translation Factor

Probability

like Other Tokens (b) Gatsby

0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Translation Factor

Probability

art thou Other Tokens (c) Romeo

Figure 48: Predicted next token for the France, Gatsby and Romeo sentences in GPT2 with translation.