# grokking_at_the_edge_of_linear_separability__f99e96a5.pdf

Grokking at the Edge of Linear Separability

Alon Beck 1 Noam Levi 2 Yohai Bar Sinai 1

We investigate the phenomenon of grokking delayed generalization accompanied by nonmonotonic test loss behavior in a simple binary logistic classification task, for which "memorizing" and "generalizing" solutions can be strictly defined. Surprisingly, we find that grokking arises naturally even in this minimal model when the parameters of the problem are close to a critical point, and provide both empirical and analytical insights into its mechanism. Concretely, by appealing to the implicit bias of gradient descent, we show that logistic regression can exhibit grokking when the training dataset is nearly linearly separable from the origin and there is strong noise in the perpendicular directions. The underlying reason is that near the critical point, "flat" directions in the loss landscape with nearly zero gradient cause training dynamics to linger for arbitrarily long times near quasi-stable solutions before eventually reaching the global minimum. Finally, we highlight similarities between our findings and the recent literature, strengthening the conjecture that grokking generally occurs in proximity to the interpolation threshold, reminiscent of critical phenomena often observed in physical systems.

1. Introduction

Understanding the relationship between the intrinsic properties of data, the training dynamics of neural networks (NNs), and their ability to generalize is crucial to explaining the success of modern machine learning (ML) algorithms. In particular, highly over-parameterized models based on the transformer architecture (Vaswani et al., 2023), such as

1Raymond and Beverly Sackler School of Physics and Astronomy, Tel Aviv University, Tel Aviv 69978, Israel 2École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland. Correspondence to: Alon Beck <alonbk2@gmail.com>, Noam Levi <noam.levi@epfl.ch>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Large Language Models (LLMs) (Open AI, 2024; Google, 2023; Zeng et al., 2022; Brown et al., 2020; Chowdhery et al., 2022; Anil et al., 2023), as well as state of the art models for computer vision (Srivastava & Sharma, 2023), defy expectations and are able to generalize with a number of parameters far exceeding the so called interpolation threshold (Kaplan et al., 2020; Schaeffer et al., 2023). Interestingly, these models have been shown to exhibit unpredictable behaviors when changing the number of network parameters, not only with respect to generalization, but also in their learning dynamics.

One such phenomenon, known as Grokking, was first observed by Power et al. (2022) during the training of a transformer model on modular arithmetic tasks. Grokking occurs when a model initially achieves perfect training accuracy but no generalization (i.e. no better than a random predictor), and upon further training, transitions to almost perfect generalization. This phenomenon has garnered substantial attention in recent years (Gromov, 2023; Liu et al., 2023; Xu et al., 2023) due to its striking contrast with naive expectations, whereby over-fitting is generally seen as an undesirable property of models that should not generalize with further training, originally dealt with using early stopping (Prechelt, 1996).

In this work, we present a straightforward model where grokking naturally emerges, allowing us to identify and analyze its fundamental cause as proximity of the system to a critical point. In our simple yet illuminating setting, the asymptotic optimal solution can always be identified, allowing a sharp definition of notions that are typically ambiguous, such as "memorization" and "learning".

We study a typical logistic binary classification problem, with the goal of finding a linear separator between two Gaussians with distinct labels. We assume that the Gaussians are well separated along the separation axis and contains noise in all perpendicular directions. Extensions of this setup are considered in Sec. 5. We mainly focus on the limit of large N, d while keeping the ratio λ = d/N fixed, although the results can be generalized.

Our main contributions are:

1. We prove, and demonstrate numerically, that grokking may occur in this setting, and is promoted when λ is close to 1/2 and the separation along the separation

Grokking at the Edge of Linear Separability

between the Gaussians is small relative to the noise in other directions.

2. We show that this happens because λ = 1/2 is a critical point. That is, for λ < 1/2 the model will almost surely asymptotically approach perfect generalization accuracy and vanishing loss, while for λ > 1/2 the model will almost surely achieve imperfect generalization accuracy and the population loss will diverge at t .

3. More fundamentally, we show that generalization depends on whether the training set is linearly separable from the origin (that is, whether the origin is contained in the convex hull of the training set). In the limit N, d , the data almost surely separable from the origin if λ > 1/2 and almost surely false otherwise.

4. Moreover, we show that near the threshold value λ = 1/2 (or, more generally, when the data is on the verge of being separable), the dynamics may generically track the overfitting solution for arbitrarily long times before transitioning to the optimal generalizing solution. This behavior manifests as a non-monotonic test loss and delayed generalization, and can lead to divergence of the "grokking time".

5. We construct a simple, one-dimensional model which captures the salient aspects of the problem, and explicitly solve the time evolution of the model parameters for several interesting cases.

The main takeaway from this setup is that grokking happens near a critical point, similar to critical slowing down in the physics literature. While further study is needed, we conjecture that this applies to other grokking examples, as was demonstrated explicitly (though not necessarily stated in these terms) in Levi et al. (2023); Liu et al. (2023); Gromov (2023); Rubin et al. (2023; 2024).

The rest of the paper is organized as follows: Sec. 3 presents our main analysis of grokking as a critical phenomenon, beginning with empirical results in Sec. 3.1, studying the possible solutions in Sec. 3.2, and relating it to linear separability in Sec. 3.3. Finally, we bring together the pieces in Sec. 3.4 to explain why grokking occurs near the critical point. Sec. 4 provides a tractable effective model which fully captures the grokking dynamics. In Sec. 5, we discuss generalizations of this setup. We conclude in Sec. 6 and discuss future directions. In the appendices we discuss several generalizations of our results and provide some further proofs and derivations.

2. Related Work

Following the discovery of grokking by Power et al. (2022), numerous studies have attempted to elucidate its underlying mechanisms. Liu et al. (2022) showed that when sufficient data determines the structured representation, perfect generalization can be achieved on a non-modular addition task. Other works have identified factors contributing to grokking, including pattern learning (Davies et al., 2023), delayed robustness (Humayun et al., 2024), and transitions from memorization to circuit formation (Nanda et al., 2023), and the role of activation sparsity, weight entropy, and circuit complexity in real-world tasks (Golechha, 2024). Others analyzed the trigonometric algorithms learned by networks after grokking (Nanda et al., 2023; Chughtai et al., 2023; Merrill et al., 2023; Gromov, 2023), and demonstrated similar dynamics in sparse parity tasks (Merrill et al., 2023). Additional works proposed "slingshots" (Thilak et al., 2022) or "oscillations" (Notsawo et al., 2023) as explanations for grokking, whereas others focused on the role of regularization (Power et al., 2022; Liu et al., 2023) and of numerical precision (Prieto et al., 2025), which may significantly impact grokking in certain scenarios. We stress that our work requires none of these in order to exhibit or explain grokking.

Recently, a body of works on solvable models which grok in various settings has emerged. Liu et al. (2023); Kumar et al. (2023) and Lyu et al. (2024) have linked grokking to memorization and transitions from lazy to rich dynamics. Žunkoviˇc & Ilievski (2022); Gromov (2023); Doshi et al. (2024) analyzed solvable models exhibiting grokking and related their findings to the formation of latentspace structure, Xu et al. (2023) related grokking to benign over-fitting for Re LU networks on XOR data, Rubin et al. (2024) described grokking as a first-order phase transition, and Levi et al. (2023) provide the full dynamical solution of grokking in linear regression. Our work attempts to sidestep external probes, and fill the gap between solvable models and representation learning, whereby we can always identify the optimal solutions, while still solving the dynamics of the model. To accomplish this, our work relies on the results of Soudry et al. (2018); Nacson et al. (2019); Ji & Telgarsky (2019), which analyze the late-time dynamics properties of logistic regression for separable and inseparable data under gradient descent (GD).

3. Grokking in Binary Classification

3.1. Model setup and empirical results

Consider a dataset of N training samples { xi}N i=1 Rd+1. We investigate the case where the data consists of two Gaussian distributions with opposite labels, well separated along the axis of separation but with added noise in all

Grokking at the Edge of Linear Separability

Separable Inseparable

Figure 1. Left panels: Gradient descent dynamics for three different values of λ = d/N. Loss and accuracy over the train and test datasets, and the time evolution of b(t) and S(t) . Grokking is significant only when λ approaches to 1/2 from below. We can see that for λ > 1/2, S increases indefinitely and generalization is not possible (see Eq. (4)). The parameters are N = 4 104,σ = 5, η = 0.01. The direction of S(t = 0) was drawn isotropically with S0 = 0.1 and b(t = 0) = 0. The number of test samples is Ntest = 104. Right panels: Top: The norm of the limiting value S in the separable case λ > 1/2, as a function of λ. Note that S diverges for λ 1

2. Middle: the accuracy at the end of the training (in blue), and the predicted limiting accuracy (orange), calculated only using the margin of the dataset, see Prop. 3.2. Bottom: The Grokking time, defined here as the delay between the times when Atrain/gen surpass a threshold of 0.9. Grokking time and S diverge near λ = 1/2. Additional details regarding the experiments can be found in App. K.

other directions. WLOG, we take the separation between the Gaussians to be along the first axis. Explicitly, the distribution of the data is xi N( µ, Σ) where the covariance Σ is diagnoal with Σ11 = σA and Σii = σB for i > 1, and µ = (µ, 0, 0, ...) is a vector pointing in the direction of the first axis with magnitude µ. We consider a binary logistic regression task with a linear model without bias.

Throughout this work, we focus on the regime σA µ, which allows full generalization below the critical point. We explain the reasoning behind this assumption, and discuss cases where it does not hold in Sec. 5. We also note that neither Gaussianity nor the assumption of unit covariance is essential for our analysis, and are only made for simplicity (see App. G).

Reduction to a d-dimensional problem with bias.

We consider now the limiting case σA = 0, in which the first coordinate of each data point is simply µ. In this case, the gradient flow dynamics can be exactly mapped to a d-dimensional problem with bias, in which all points are assigned the same label. The key idea is that, up to a scaling factor, the data points take the form xi = (1, xi), where xi N(0, σ2 B µ2 Id) is a d-dimensional vector (Id is the d d

identity matrix). In this formulation, it is standard to treat the first coordinate of the weight vector as the bias term. A more rigorous justification for this equivalence is provided in App. H, where it is also shown that under this mapping the d dimensional problem the effective learning rate is faster by a factor of µ2. The benefit of this equivalence is that the analysis of the equivalent model is simpler.

To sum up, for simplicity, throughout this paper we will analyze the following equivalent d-dimensional problem: Consider N training samples {xi}N i=1 Rd where xi N(0, σ2Id) and σ > 0 is the feature standard deviation (mapped to σB/µ in the model above). Consider logistic classification, where all input points are assigned the same label, which for concreteness we take as {yi}N i=1 = 1.

Linear model: loss and accuracy. The model parameters are a weight vector S Rd and a bias term b R. The output is a scalar fi = f(xi) = S xi + b. We optimize the empirical cross-entropy loss L(S, b) and measure the

Grokking at the Edge of Linear Separability

empirical accuracy A(S, b), given by1

L(S, b) = 1

i=1 ℓ ST xi + b , (1)

i=1 Θ ST xi b ,

where ℓ(fi) = log 1 + e yi fi = log 1 + efi is the single sample loss and Θ(z) is the Heaviside function, defined as Θ(z) = 1 if z 0 and Θ(z) = 0 if z < 0.

Optimizer. Throughout the main text we use gradient descent (GD) dynamics. The effects of other optimizers are discussed in App. F. The GD equations at training step t with learning rate η are

St+1 St = η SL, bt+1 bt = η b L. (2)

In this paper we will focus on the gradient flow (GF) limit (η 0) of these equations.

Numerical results. In Fig. 1, we show numerical results depicting the gradient-descent dynamics of the model across three values of λ d/N. Notably, we observe a significant grokking effect, both in the non-monotonicity of the test loss, and a delayed rise in test accuracy, only when λ λc = 1/2 (there may be some differences between the grokking observed here and other examples in the literature, but they seem to be superficial - see App. B). In the following section, we explain how λc can be interpreted as the interpolation threshold in this setting.

3.2. The generalizing and over-fitting solutions

To understand grokking in this setup, we begin by examining the optimal generalizing solution. Since the support of the input distribution is unbounded and all labels are equal, the model must position all points in Rd on the same side of the separating hyperplane, effectively pushing the decision boundary to infinity.

To see this rigorously, we derive expressions for the generalization accuracy and loss. Since the data follows a Gaussian distribution, xi N(0, σ2Id), the generalization (population) loss is, by definition:

Lgen = Ex N(0,σ2Id) h log 1 + exp ST x + b i (3)

= Ey N(0,1) h log 1 + eσ S y+b i ,

where we used the fact that ST x N(0, S 2σ2). Note that Lgen depends only on b and S . Similarly, the

1The labels do not appear explicitly in L since they are identical for all samples.

generalization accuracy is given by:

Agen(S, b) = Ey N(0,1) h Θ ( σ S y b) i (4)

where erf is the error function.

Proposition 3.1. Perfect generalization, i.e., Lgen 0 and Agen 1, is achieved only if both b and b/ S . That is, b must tend to negative infinity while also being infinitely large compared to S .

Proof. It is easily seen from Eq. (4) that the condition Agen 1 requires b/ S . If b is bounded, then this can only happen for S 0, but this cannot be since then Eq. (3) implies that Lgen is bounded away from zero. Therefore, perfect generalization implies both b and b/ S .

The bottom panels of Fig. 1 show that b at late times in all parameter regimes. However, while S saturates at a constant value for λ < 1/2, it diverges when t for λ > 1/2, and does so at a rate comparable to b, leading to sub-optimal generalization limt A(S(t), b(t)) < 1.

Relation to prior results regarding separability. These results are closely related to the framework developed by Soudry et al. (2018), who studied the convergence of binary classification for linearly separable data, and later expanded by Ji & Telgarsky (2019) for inseparable data. In our case, since the model contains a bias term and all labels are the same, the data is always separable by a hyperplane at infinity . To use their framework, we need to work in an extended space of dimension d + 1, where we define the extended weight vector w = (S, b) Rd+1. The network solution at the late stages of training can be obtained as a direct corollary of Theorem 3 from Soudry et al. (2018).

Theorem 1 (Rephrased from Theorem 3 of Soudry et al. (2018) ). In the setting described above, for any smooth monotonically decreasing loss function with an exponential tail, and for small learning rate, GD iterates will converge at the late stages of training to:

w(t) = w SVM log(t) + ρ(t) , (5)

w SVM = argmin (S,b)

n S 2 + b2 s.t. ST xi + b 1 o .

Here, w SVM is the solution2 to the hard margin SVM problem in the extended d + 1 space and ρ is a residual vector which is bounded for all t.

Connecting this result to the previous discussion, we see

2Note that the SVM solution in the extended d + 1 is not the same as the typical formulation of the Support Vector Machine (SVM) with bias in d dimensions, because of the different penalty used for the bias term.

Grokking at the Edge of Linear Separability

Inseparable

Inseparable

Figure 2. Evolution of the model parameters. (left) the distribution of ST xi/ S , where S is the final spatial weight vector that was found using GD dynamics for λ = 0.05, 0.48, 0.52, 0.8. The parameters are identical to those of Fig. 1. We can see that for λ = 0.05, 0.48 the model does not separate the data (because the data is inseparable) while for 0.52, 0.8 it does. The margin is plotted for λ = 0.8 Middle panel: S(t) , optimized with GD using the exponential loss given in Eq. (6), with and without a bias term. With a bias term, the result is shown as a function of the conformal time (Eq. (8). The two curves follow the same path different rates. The inset shows d S d log(t) and db d log(t), (right) Optimization paths for different λ values, shown in the b, σ S plane. For inseparable data b diverges while S is bounded, while slightly above the limit of separability both b and S diverge.

indeed that either |b| and/or S must diverge at infinite training times, and the question is now reduced to the directionality of w SVM.

The generalizing solution, which classifies correctly all points in Rd is when w SVM = (0, 1), (0 being the ddimensional zero vector) i.e. when it points in the direction of the bias and the separating plane is at infinity. This is exactly the aforementioned condition b and |b|/ S . In contrast, over-fitting occurs when the hyperplane is far enough from the data to correctly classify all the training samples, but does not go to infinity. In the extended space, this means that w SVM also contains a component in the direction of the data, and the model did not correctly learn the data distribution. In what follows, we will show that the factor determining whether we observe grokking in this setup is not the regular separability of data points from one another, but rather separability from the origin (or, separability with no bias), defined as follows: Definition 2. A data-set {xi}N i=1, xi Rd 1 is linearly separable from the origin iff there exists a vector S Rd 1

such that ST xi > 0 for any i.

In the rest of the paper, we will use separable as a shorthand for separable from the origin . We are now ready to present our main claims regarding the grokking phenomenology presented in Fig. 1. We argue that:

The generalization and overfitting at t depend only on whether the training samples (in Rd) are separable (from the origin) (Prop. 3.2).

For large N, d, the training set is separable if λ > 1

2 and inseparable for λ < 1

2 (Prop. 3.3).

For separable training sets (λ > 1 2), the model will always overfit, and the limiting generalization accuracy is directly related to the optimal separating

margin (Prop. 3.2.2). For inseparable training sets (λ < 1

2) the model will always generalize perfectly: limt b(t) = and S saturates on a finite value limt S(t) = S (Prop. 3.2.1).

However, for λ 1

2 , the training set is on the verge of separability, and S diverges (Prop. 3.4).

Consequently, our main result follows: dynamics may take arbitrarily long times to reach the generalizing solution. This is the underlying mechanism of grokking in this setting.

3.3. Separability determines whether the model will generalize perfectly or not

Proposition 3.2. The model will reach perfect generalization if and only if the data is not linearly separable from the origin. In particular:

1. If the data is not linearly separable from the origin, then limt b(t) = while S saturates on a finite value limt S(t) = S . 2. If the data is linearly separable from the origin, then limt Agen = 1 2 h 1 + erf 1 σM

i , where M is the margin.

To prove Prop. 3.2, we first note that due to the exponential tail of the cross-entropy loss, at late times the loss is dominated by samples with large model outputs f = ST x + b, for which the cross-entropy loss ℓ(f) of Eq. (2), approaches the exponential loss ℓe(f) = ef (Soudry et al., 2018; Nacson et al., 2019; Ji & Telgarsky, 2019). Specifically, the exponential loss must converge to the same late time dynamics as the cross entropy loss. Therefore, we will consider the exponential loss for which the calculations

Grokking at the Edge of Linear Separability

are tractable,

Le(S, b) = 1

i=1 e ST xi+b. (6)

In the gradient-flow limit, the induced dynamics are

i e ST xixi , b

i e ST xi . (7)

Note that both rates are proportional to a common timedependent scalar eb(t). We can thus define a so-called conformal time3: τ(t) = t 0 eb(t )dt , which is a strictly increasing function of t. In terms of τ, the time evolution takes the form S

i e ST xixi , b τ = η

i e ST xi . (8)

The importance of this change of variables is that the dynamics of S(τ) in terms of the conformal time are identical to those of S(t) in the absence of bias. That is, S(t) follows the same path that would be obtained by minimizing L = 1

N PN i=1 e ST xi, but does so at a different rate which depends exponentially on the current value of b(t). This is demonstrated in the middle panel of Fig. 2. Since τ(t) diverges for t (see App. C for details), S(t) must follow the same path at long times, as it would have followed without bias. We can now complete the proof:

Proof of 3.2.1 (inseparable case) Consider the dynamics without bias. In case the data is inseparable, Le of Eq. (6) is unbounded in all directions of S. That is, for any unit vector e Rd we have limα Le(αe, 0) = . Since Le is convex, the gradient flow dynamics will lead to a global minimum at a finite point limt S(t) = S . Recalling the discussion of conformal time above, this is also the limit of the dynamics with bias. From Eq. (5), we know that either S(t) or |b(t)| must diverge, and since S(t) approaches a finite value, we conclude that |b|/ S for t . That is, w SVM = (0, 1) and the model flows towards the generalizing solution.

Proof of 3.2.2 (separable case) When the training set is separable from the origin, it is easier to examine the optimization problem in Eq. (5) directly. We wish to minimize w 2 = S 2 + b2 under the separability constraints. The generalizing solution wg = (0, 1) satisfies all constraints trivially and has wg = 1. However, since the data is separable from the origin, there exists another solution to the constraints, namely w = (S , 0), where S is the separating vector in d dimensions

3This is a common measure in cosmology and gravitational physics to describe co-moving objects in an expanding or shrinking spacetime background (Guth, 1981).

without bias, i.e. the solution to

S = argmin S

n S 2 s.t. ST xi 1 o . (9)

The norm of S is the inverse of the separation margin M = 1/ S .

Due to convexity, any convex combination of wg and w will also satisfy the constraints, and since they are orthogonal it also has a smaller norm. The combination with the smallest norm is the global optimum, which is easily shown to be proportional to w SVM (M 2S , 1). That is, both S and b diverge when t and limt b(t) S(t) = 1

M . Plugging the result into Eq. (4) completes the proof. Next, we establish the relation between separability and λ. Proposition 3.3. For N, d , and λ = d/N, the dataset is separable from the origin with probability 1, if λ > 1/2, and is inseparable if λ < 1/2.

In other words, Prop. 3.3 states that (almost) any large set of N points in d dimensions are separable (i.e., lie on the same "half" of some hypersphere passing through origin) as long as d > N/2. This is a direct corollary of Wendel s theorem (Wendel, 1962) which we prove in App. A.

3.4. Collecting the pieces: why does grokking happen near λ = 1

We have established that for λ < 1

2, the model will almost surely generalize perfectly. For infinitely long times, S(t) converges to a finite vector S , and b(t) diverges. For λ > 1

2, the model will almost surely overfit. Intuitively, one should expect that in the vicinity of the critical point λc = 1

2, where the two solutions exchange stability, dynamics may become slow. This is because for λ > 1/2 the overfitting solution is stable and for λ smaller than but close to 1/2, it is unstable but only slightly so. Therefore, the dynamics may spend arbitrarily long times in the vicinity of the overfitting solution before flowing to the generalizing solution. This is delayed generalization. Rigorously, this happens through of the following properties:

Proposition 3.4. For λ 1

That is, when the training set is non-separable, but on the verge of separability, S obtains arbitrarily large values. This statement is formally proven in App. D and empirically demonstrated in Fig. 1. It can also be obtained as a corollary of Ji & Telgarsky (2019). An intuitive geometric interpretation is that for a nearly separable set, S(t) approaches a finite limit S , but a small translation of the data would make the set separable, and correspondingly would make |S(t)| . Smoothness thus implies |S | must be large if the set is almost separable. Proposition 3.5. For λ < 1

2 and σ large enough, S(t) will approach its asymptotic value S arbitrarily fast.

Grokking at the Edge of Linear Separability

-2 -1 0 1 2 -2

-2 -1 0 1 2 -2

-2 -1 0 1 2 -2

10 6 10 5 10 4 10 3 10 2

t | Acc(t)=1-

Figure 3. Simplified model. Three left panels: Illustration of the hard margin SVM problem in 1+1 dimensions for the simplified model. Note that w SV M is the point closest to the origin in the intersection of the two shaded regions. When w SV M points along the bias axis (the vertical axis) if and only if x2 0. (right) Grokking time, defined as the time it takes for the generalization accuracy to reach 0.95, plotted against σ and 1 2λ. We see it diverges when both λ 1/2 and σ , while neither condition suffices alone.

λ < 1/2 (x2 > 0) λ = 1/2 (x2 = 0) λ > 1/2 (x2 < 0)

b(t 1) log(t) log(t) 1 1+M 2 log(t)

S (t 1) 1 2(1 λ) log 1 1 2λ log(log(t)) M 1+M 2 log(t)

Table 1. Summary of the different regimes of the simplified model, where x1 = 1, x2 = 1 2λ, and the margin is M = |x2| = 2λ 1.

Proof. To see this, it is useful to define the rescaled variables xi = xi/σ, S = σS. Clearly, xi N(0, Id). The gradient flow equations in terms of the rescaled variables are (we study the exponential loss for simplicity)

i e S T xi xi, (10)

i e S T xi.

Note that these are identical to the gradient flow equations of the original variables given by Eq. (7), but the dynamics of S are faster by factor of σ2. Thus, by taking a large σ, S will approach its asymptotic value S arbitrarily fast, while the dynamics of b will not change. Recalling that σ was mapped to σB/µ in the original two-Gaussians model introduced at the start, we note that a large σ is equivalent to introducing large noise in directions perpendicular to the separation axis.

We can now understand mechanistically how grokking occurs. For λ values close enough to 1

2 from below, the limiting norm S is arbitrarily large (Prop. 3.4). For large enough σ, S(t) will grow arbitrarily fast towards S (Prop. 3.5). Under these conditions, the growth rate of b(t) remains boudned, and the generalization can be delayed for arbitrarily long times. Note that this necessitates both λ 1

2 and σ , as is also demonstrated in the right panel of Fig. 3. Interestingly, using adaptive momentum based optimizers like ADAM (Kingma & Ba, 2017), one can see significant grokking even for σ = 1, see App. B and App. F for more details.

4. Insights From a Simplified Model

Our main claim is that the asymptotic dynamics depend only on the separability of the training set, and that grokking occurs at the edge of linear separability. This intuition can be worked out explicitly in a much simpler setting in one dimension. In this case, separability boils down to asking whether the origin is contained between the extremal points min{xi} and max{xi}. Therefore, all the phenomenology of the full model described in the previous sections can be captured by a training set consisting of only 2 points x1, x2, representing the points with maximal and minimal projections along S . For consistency with the problem of Gaussian data, we parameterize this set as

x1 = σ , x2 = σ (1 2λ) , (11)

L(S, b) = 1

2 e Sx1+b + e Sx2+b ,

so that the scale of xi is σ, and they are separable (inseparable) for λ < 1/2 (λ > 1/2). We note that the margin of the dataset from the origin has the same dependence as in the Gaussian model with N points in d dimensions.

The asymptotic dynamics of this model qualitatively, and sometimes quantitatively, capture the phenomenology of the full problem. The model is fully tractable analytically and the detailed analysis is presented in App. E. We summarize here the main results:

The left panels of Fig. 3 show the geometry of the problem in 1 + 1 dimensions. It is easily seen that the optimal SVM solution is s = 0, b = 1 if and only if the data is not separable, i.e. when the segment x2 0

Grokking at the Edge of Linear Separability

contains the origin. The limiting value S can be easily found to be S = 1 2(λ 1) log (1 2λ) , for the separable case λ < 1/2. Indeed, it diverges logarithmically at λ = 1/2, in agreement with the numerical results of the full model presented in the upper-right panel of Fig. 1. The long time dynamics of S(t) and b(t) are summarized in Table 1. The behavior of the loss and accuracy of the simplified model as a function of λ is remarkably similar to that of the full model, see Fig. 8 in App. E.

Criticality. We note that this result bears a striking resemblance to that of Levi et al. (2023), which employed the MSE loss in a linear regression problem, again for N points sampled iid from an isotropic Gaussian distribution. In their setting, the interpolation threshold is at λ = 1, in the sense that for λ < 1 the model always generalizes asymptotically, and never generalizes for λ > 1. They also found logarithmic divergence of the "grokking time" (the time difference between the times it takes for the generalization and training accuracy to reach a certain threshold). It diverges as a function of the distance from criticality as log (1

λ), which was explained in terms of a critical slowing down effect, arising from a vanishing eigenvalue of the data covariance near criticality. While the two problems are quite different, they both display a critical behavior near an effective interpolation threshold of the corresponding problem. We believe this is not a coincidence but rather a manifestation of a deeper relation between the behaviors of NNs in the vicinity of critical points.

5. Extensions of the Setup

In our original setup, we assumed σA µ, which allowed us to reduce the model to a d-dimensional problem with a bias term under constant labels. In this section, we explore what happens when this assumption does not hold. We show that while the condition σA µ (or equivalently, constant labels) is essential for full generalization, grokking in the sense of delayed generalization as well as non-monotonic test loss behavior, can still be observed provided the system remains near the critical point of linear separability.

Typically, binary classification models fully generalize only when the number of samples is sufficiently large, i.e., N d. However, grokking is observed outside this regime, with d/N 1/2. Here, the assumption σA µ becomes particularly useful, as it ensures perfect test accuracy for any λ below the critical point. When this assumption is relaxed, full generalization is no longer guaranteed. Nevertheless, the underlying mechanism discussed in earlier sections remains: near the critical point λ = 1/2, the model can initially reduce the loss by adjusting the direction and norm of the perpendicular weights S2, ..., Sd+1, while

only modifying S1 at later times, leading to delayed generalization. In Fig. 4 we present numerical simulations supporting this behavior: While σA = 0.01 is small enough to closely follow the behavior of σA = 0, for larger values (σA = 0.05, 0.1), we observe that the limiting accuracy is below 1 and the loss begins to diverge at long times.

Finally, we note that while the precise dynamics depend on the data distribution and labeling scheme, similar results are expected for a broad class of linear binary classification problems, as long as λ is near the critical point. For instance, in App. I, we analyze a generalization of the setup from another perspective, where the constant labels are explicitly modified to be discriminative. The results closely resemble those presented here, see Fig. 16.

Figure 4. Top: The GD dynamics of the loss and accuracy for the 2-Gaussians model, with different values of σA. The rest of the parameters are λ = 0.4, µ = 0.2, σB = 1, and η = 0.1. Bottom: illustration of the separation for each σA.

Grokking at the Edge of Linear Separability

6. Discussion, conclusions and limitations

We studied the dynamics of gradient descent in a simple setting of logistic classification with strong noise. We have shown that in this setting, grokking occurs near a critical point in the asymptotic dynamics. Specifically, at the critical point, which occurs at λ = 1/2 in our case, the overfitting and generalizing solutions exchange stability. We showed that this non-analytic change in the asymptotic dynamics is the cause for grokking, much like in Rubin et al. (2024); Levi et al. (2023); Doshi et al. (2024), and to some extent also Humayun et al. (2024), who showed that grokking occurs near a phase transition.

Intuitively, in the vicinity of the critical point there are flat directions in the loss landscape. These directions may cause training to stay in the vicinity of almoststable solutions for arbitrarily long times periods before eventually converging to the global minimum. In the physics literature, this behavior is known as critical slowing down (e.g. (Sethna, 2021)). In the current context, this is the mechanism of delayed generalization, which also explains the non-monotonic evolution of the generalization loss.

While we cannot show it rigorously, we conjecture that grokking is intimately related to such critical points also in different settings. In a few examples, this has been directly demonstrated, (Levi et al., 2023; Rubin et al., 2023; 2024; Humayun et al., 2024). We note that other intriguing phenomena, such as the non-monotonic dependence of asymptotic performance on model complexity, a.k.a double descent , have also been proposed to be related to criticality, e.g. (Schaeffer et al., 2023).

If this is indeed the case, then in analogy to the theory of critical phenomena in physics, there might exist universality classes that have similar critical behavior, but possibly very different underlying mechanisms (Sethna, 2021). We will address this connection in future work.

Limitations: We considered a specific problem of linear binary classification in high dimensions. It is natural to ask how our results extend to more complex data, for instance including non-trivial correlations, hierarchical structure, or a finite sample space, as in the original observation of grokking in Power et al. (2022); Gromov (2023). While we believe the same analysis can be repeated in these instances, in the sense of (non)linear separability, we leave this to future work. In any case, we do not claim that the underlying mechanism of criticality is necessarily related to separability.

The analytic were all done in the GF limit, and while our results were verified by experiments with a finite learning rate, it may be interesting to study how large learning rates affect this setup, possibly relating to catapults (Lewkowycz et al., 2020) or the edge of stability mechanism (Cohen et al.,

Lastly, we did not study the prospect of nonlinear logistic regression, which is closer to deep learning models in the wild. We believe some of our results may be generalized, provided we accept a feature map description of the model up to the last layer, and consider the SVM solution on the learned features.

Acknowledgments

We thank Hillel Aharoni, Francois Charton, Amit Moscovich, Daniel Soudry and Matthieu Wyart for fruitful discussions. YBS was supported by research grant ISF 1907/22 and Google Gift grant. NL is supported by the EPFL AI4science program.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, in particular due to the large model sizes considered in this work, but we do not feel there are specific aspects of this work with broader impacts beyond the considerations relevant to all large machine learning models.

Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. Palm 2 technical report. ar Xiv preprint ar Xiv:2305.10403, 2023.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022.

Chughtai, B., Chan, L., and Nanda, N. A toy model of universality: Reverse engineering how networks learn group operations, 2023.

Cohen, J. M., Kaur, S., Li, Y., Kolter, J. Z., and Talwalkar, A. Gradient descent on neural networks typically occurs at the edge of stability, 2022.

Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers,

Grokking at the Edge of Linear Separability

EC-14(3):326 334, 1965. doi: 10.1109/PGEC.1965. 264137.

Davies, X., Langosco, L., and Krueger, D. Unifying grokking and double descent. ar Xiv preprint, ar Xiv:2303.06173, 2023.

Doshi, D., Das, A., He, T., and Gromov, A. To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets, 2024.

Golechha, S. Progress measures for grokking on real-world tasks, 2024.

Google. Bard - chat based ai tool from google (october 2023 version) [large language model]. 2023. URL https: //bard.google.com/.

Gromov, A. Grokking modular arithmetic. ar Xiv prerint, ar Xiv:2303.02679, 2023.

Guth, A. H. Inflationary universe: A possible solution to the horizon and flatness problems. Phys. Rev. D, 23:347 356, Jan 1981. doi: 10.1103/Phys Rev D. 23.347. URL https://link.aps.org/doi/10. 1103/Phys Rev D.23.347.

Humayun, A. I., Balestriero, R., and Baraniuk, R. Deep networks always grok and here is why, 2024.

Ji, Z. and Telgarsky, M. The implicit bias of gradient descent on nonseparable data. In Conference on learning theory, pp. 1772 1798. PMLR, 2019.

Kaplan, J., Mc Candlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017.

Kumar, T., Bordelon, B., Gershman, S. J., and Pehlevan, C. Grokking as the transition from lazy to rich training dynamics, 2023.

Levi, N., Beck, A., and Bar-Sinai, Y. Grokking in linear estimators a solvable model that groks without understanding, 2023.

Lewkowycz, A., Bahri, Y., Dyer, E., Sohl-Dickstein, J., and Gur-Ari, G. The large learning rate phase of deep learning: the catapult mechanism, 2020.

Liu, Z., Kitouni, O., Nolte, N. S., Michaud, E., Tegmark, M., and Williams, M. Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35:34651 34663, 2022.

Liu, Z., Michaud, E. J., and Tegmark, M. Omnigrok: Grokking beyond algorithmic data, 2023.

Lyu, K., Jin, J., Li, Z., Du, S. S., Lee, J. D., and Hu, W. Dichotomy of early and late phase implicit biases can provably induce grokking, 2024.

Merrill, W., Tsilivis, N., and Shukla, A. A tale of two circuits: Grokking as competition of sparse and dense subnetworks, 2023.

Nacson, M. S., Lee, J. D., Gunasekar, S., Savarese, P. H. P., Srebro, N., and Soudry, D. Convergence of gradient descent on separable data, 2019.

Nanda, N., Chan, L., Liberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. ar Xiv preprint ar Xiv:2301.05217, 2023.

Notsawo, P. J. T., Zhou, H., Pezeshki, M., Rish, I., and Dumas, G. Predicting grokking long before it happens: A look into the loss landscape of models which grok, 2023.

Open AI. Gpt-4 technical report, 2024. URL https:// arxiv.org/abs/2303.08774.

Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. Grokking: Generalization beyond overfitting on small algorithmic datasets. ar Xiv preprint ar Xiv:2201.02177, 2022.

Prechelt, L. Early stopping-but when? In Neural Networks, 1996. URL https://api.semanticscholar. org/Corpus ID:14049040.

Prieto, L., Barsbey, M., Mediano, P. A. M., and Birdal, T. Grokking at the edge of numerical stability, 2025. URL https://arxiv.org/abs/2501.04697.

Rubin, N., Seroussi, I., and Ringel, Z. Droplets of good representations: Grokking as a first order phase transition in two layer networks, 2023.

Rubin, N., Seroussi, I., and Ringel, Z. Grokking as a first order phase transition in two layer networks, 2024.

Schaeffer, R., Khona, M., Robertson, Z., Boopathy, A., Pistunova, K., Rocks, J. W., Fiete, I. R., and Koyejo, O. Double descent demystified: Identifying, interpreting & ablating the sources of a deep learning puzzle, 2023.

Sethna, J. P. Statistical mechanics: entropy, order parameters, and complexity, volume 14. Oxford University Press, USA, 2021.

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19(70):1 57, 2018. URL http://jmlr. org/papers/v19/18-188.html.

Grokking at the Edge of Linear Separability

Srivastava, S. and Sharma, G. Omnivec: Learning robust representations with cross modal sharing, 2023.

Thilak, V., Littwin, E., Zhai, S., Saremi, O., Paiss, R., and Susskind, J. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. ar Xiv preprint ar Xiv:2206.04817, 2022.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need, 2023.

Wendel, J. A problem in geometric probability. Mathematica Scandinavica, 11:109 112, 1962. URL http://eudml.org/doc/165817.

Xu, Z., Wang, Y., Frei, S., Vardi, G., and Hu, W. Benign overfitting and grokking in relu networks for xor cluster data, 2023.

Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. Glm-130b: An open bilingual pre-trained model. ar Xiv preprint ar Xiv:2210.02414, 2022.

Žunkoviˇc, B. and Ilievski, E. Grokking phase transitions in learning local rules with gradient descent, 2022.

Grokking at the Edge of Linear Separability Appendix

A. Separability and Wendel s Theorem

Wendel s theorem (Wendel, 1962) states that the probability that N random vectors drawn from a distribution in d dimensions are linearly separable, is

In relation to our work, the only assumptions required from the distribution is that

It is symmetric around the origin, i.e. P(x) = P( x), and

The dataset is almost surely in general position.

We note that Eq. (12) is a the cumulative probability function of the Binomial distribution, i.e. the probability that the the number of successes is greater than d out of N 1 attempts with success probability 1

2. The central limit theorem states that in the limit of large N, d the binomial distribution approaches a Gaussian, and thus the cumulative distribution function approaches the error function. Straightforward manipulations show that for large N, d,

It is seen that for d the transition becomes infinitely sharp as a function of lambda and we have

lim d p(λ) =

2 1 2 λ = 1

See also Cover (1965) for further discussion.

B. Relation to canonical examples

In this section, we will discuss the similarities and differences of our work with previous examples of Grokking in the literature, focusing on the seminal work of Power et. al. (Power et al., 2022). We first note that Grokking at Power et. al. is significant when the fraction of the data used for training α = Ntraining/N is near a critical value αc, in the sense that the system achieves perfect generalization if and only if α > αc, as can be seen in Fig. 1 (center) of their paper. We expect that this non-analytic behavior in the long time limit of training will be the crucial property that underlies grokking. That is, we expect that near such points the dynamics will be slow. We note that α in Power et. al. is analogous to our λ parameter, defining an effective "interpolation threshold" for the modular arithmetic problem.

Secondly, we note that a noticeable difference between our work and that of Power et. al. is that in our case, the accuracy shows a rise from the start rather than staying at chance level for a long time before generalizing. We argue that this is only a superficial discrepancy that depends on the choice of optimizer and fine-tuning of hyperparameters, and that the fundamental mechanism (that grokking occurs near critical points in which solutions exchange stability and dynamics are generically slow) is the same.

Indeed, in Fig. 5 we show that our setup is capable of grokking with accuracy staying at chance level (50%) at the start, similar to Power et al. We achieved this by using λ values closer to half ( almost separable ) and the Adam optimizer instead of vanilla gradient descent (GD). The fact that this optimizer converges faster on the training data is no coincidence: the adaptive learning rate leads to quicker convergence to large values of |S| (the memorizing solution"), maintaining accuracy at chance level until later stages, before going to large b (the "generalizing solution"). Notably, Power et al. also used Adam (or Adam W). In conclusion, although Adam can lead to a slightly "cleaner" grokking result, we explored GD because it is easier to derive analytical insights from it while, we believe, not changing the underlying mechanism of

Grokking at the Edge of Linear Separability

100 101 102 103 104 0

100 101 102 103 104

100 101 102 103 104

Figure 5. Grokking in a similar setup to the results in the main text but with ADAM optimizer (with β1 = 0.8, β2 = 0.9), instead of GD. The parameters are λ = d/N = 0.495, N = 4000 and σ = 1.

grokking. Finally, We will also note that the non-monotonicity of the test loss is also a typical sign of Grokking that can be seen in our setup (for example, compare Fig. 4 of Power et al. with the test loss in Fig. 5).

C. Divergence of the conformal time

In the main text we have defined the conformal time τ = t 0 eb(t )dt and saw that as the result the gradient-decent trajectory of S(t) is the same as one that minimizes the exponential loss without bias: L = 1

N PN i=1 e ST xi. However, if τ is bounded it might reach a different fixed point. We will show now that indeed τ must diverge. First, we notice that the loss must be bounded from above: If the points are not separable (that is, there is some ε > 0 such that for any ST

S xi > ε for any i), then it must be true since S is bounded otherwise the loss would be infinite. If the points are separable, then S might diverge (and will, as discussed in the main text) but at some point all of the arguments of the exponent would be negative, so the loss would be trivially bounded by 1. Now, using the fact that β

t we have β

i e S xi. Denoting L(t) < C , we see that

t < ηC. (15)

We note that on the left-hand side we have a positive function (since β

t < 0). In other words,

t h 1 β(t) i < ηC, so we get that

1 β(t) = 1 β(0) + ˆ t

< 1 β(0) + ˆ t

0 ηC = 1 β(0) + ηCt (16)

so that 1 β(t) < 1 + ηCt or, β(t) > 1 1+ηCt. This means that t 0 β(t) > t 0 1 1+ηεt, which diverges.

Grokking at the Edge of Linear Separability

0.00 0.05 0.10 0.15 0.20 1 2

Fraction of ST

finalxi > 0

10 2 10 1 1 2

Separable fraction, < 0.5

Figure 6. Left panel: The fraction of positive ST S xi, which goes to a constant for λ = 1/2. Right panel: The fraction of separable datasets for λ < 1/2 that were not included in the calculation.

0.3 0.2 0.1 0.0 0.1 0.2 1 2

finalxi > 0)

Figure 7. Numerical investigation of properties of the limiting distribution of ST xi, as a function of 1 2λ (averaged over different

random configurations). In blue, we plot the average value of positive ST S xi, for λ < 1/2 (by minimizing L = 1 N PN i=1 e ST xi ), and the margin for λ > 1/2 (using SVM).

D. Proof that S diverges for almost separable data

We look at the function

f(S, {xi}) =

i=1 e S xi x, S Rd (17)

We will assume n > d and that the data is in general position, and that it is not separable from the origin. Since for every S we must have S xi > 0 for some i, it is easy to see that f diverges when S grows large in any direction. Since f > 0, there exists a global minimum at finite S.

A minimum (which is also unique under our assumptions but that s not crucial) obeys

i=1 xie S xi = 0 (18)

If we divide this expression by f, we get

i=1 pixi = 0 pi = e S xi

f 0 pi 1, X

i pi = 1 (19)

Eq. (19) means that the origin is a convex combination of the sample points with weights pi. We found that a necessary condition for the existence of a critical point at a finite S is that the origin is contained in the convex hull of the sample

Grokking at the Edge of Linear Separability

points. This is of course equivalent to the condition that the origin is not linearly separable from the sample data.

We want to show that if the data is almost separable, that is, if it is not separable but the origin is close to the boundary of the convex hull, then S must be large. The intuition for this comes from Eq. (19): if the origin is very close to the boundary of the convex hull then some of the pi s must be very large compared to the others, which can only happen if S is large.

In fact, the origin is exactly on the boundary of the convex hull (that is, the data is exactly on the edge of separability) if and only if for every representation of the origin as a convex combination of the sample points,

i=1 qixi = 0 , (20)

the weights qi are non zero only for k sample points, say xi, . . . , xk, with k d, and x1, . . . xk are the vertices of a facet of the convex hull. This naturally leads to the definition:

Definition We say that the origin is ϵ-close to the boundary if there exist k points x1, . . . , xk such that for every representation of the type of Eq. (20), the total weight assigned to x1, . . . xk is at least 1 ϵ,

Theorem If the origin is ϵ-close to the boundary of the convex hull of the sample points, then the norm of S = argmin f is bounded from below by

where D = maxij |xi xj| is the diameter of the data.

Proof. We divide the points to two groups: A = {x1, . . . xk}, and B = {xk+1, . . . x N}. Since the origin is ϵ-close, the ratio of the weights of the two groups is bounded by P

Consider now the convex combination Eq. (19). Using Jensen s inequality, we can bound the relative weights of the second group by X

i B e S xi (N k)e S x , with x = 1 N k

i B xi (22)

where x is the average of the points in the second group. Therefore, the ratio is bounded by P

i A e S xi P

i B e S xi P

(N K)e S x k N k e|S|D k N k e|S|D (23)

Combining Eq. (21) and Eq. (23) we get

ϵ k N k e|S|D |S| 1

Since k d, we also have N k n d.

Note that the same convexity argument would work also for logisitic loss f = P

i ℓ(S xi), ℓ(z) = log (1 + ez), or any other monotonic and convex ℓ. In this case the only difference is that the log function should be replaced the inverse of ℓ.

Grokking at the Edge of Linear Separability

Figure 8. Simplified model (left, center) Loss and accuracy for different λ and σ = 5. dotted/solid lines represent the training/generalization respectively. (right) Grokking time (the time difference between the time it takes for the training and generalization loss to reach a certain threshold, ε = 10 50 in this case) for σ = 20. The data is in very good agreement with the prediction. (inset) how grokking time is calculated for ε = 10 50 and λ = 10 4.

E. Details of the Simplified Model

E.1. Justification and Relation to the Full Model

We will provide here supplemental results regarding the justification of the simplified model (by numerical comparison to the full model). To obtain the results we average over different random realizations: Assuming the so-called self averaging property, we know that the average over a large number of finite systems should give us the same result as the infinite system (where N, d and the ratio is constant).

In the non-separable case, S can be found by any optimizer that minimizes the loss L = P e ST xi. We note that when getting close to the transition point, for any finite-sized system we have some probability of getting a separable set (even though λ < 1/2), see Eq. (12). In this case, we just ignore the the result: In the right panel of Fig. 6 we present the fraction of realizations that are separable. This will probably introduce some bias into the results which is likely the cause of the fact that the average of positive samples (and similarly, the margin) does not go exactly to zero for λ 1/2 (see Fig. 7). In the left panel of Fig. 6 we present the fraction of positive ST S xi . Interestingly, it does not go to zero but to some positive

constant, implying that there is a singularity in the density of ST S xi at λ = 1/2.

E.2. Analytical Predictions

Here, we provide the full analysis of the model presented in Sec. 4, for a single point fixed at x1 = 1, and a second point x2 = x = 1 2λ, where λ = d/N.

The gradient flow equations in conformal time are given by

2 xe Sx e S = η

(1 2λ)e S(1 2λ) e S , (25)

2 e Sx + e S = η

e S(1 2λ) + e S .

While there exist analytical solutions for Eq. (25), they do not necessarily provide any intuition, and so we find it better to begin by investigating three special representative cases:

1. x2 = 1 (non-separable).

2. x2 = 0 (marginally non-separable).

3. x2 = 1 (separable).

Grokking at the Edge of Linear Separability

For x = 1 (A), the data is entirely non-separable in one dimension and the conformal time solutions are

S(τ) = log tanh ητ

2 + tanh 1 e S0 , (26)

b(τ) = b0 + log

tanh 2 tanh 1 e S0 cosh 2 tanh 1 e S0

tanh ητ + 2 tanh 1 (e S0) cosh ητ + 2 tanh 1 (e S0)

in which case the generalization accuracy reaches 1 for τ , as b(τ) grows faster than S(τ) with conformal time.

For x2 = 0 (B), the equations in conformal time become:

2e S, b τ = η

2(e S + 1) (27)

By solving for S and plugging into b

τ , we immediately get

S = log e S0 + η

2τ , b = log e S0 + η

2τ + S0 + b0. (28)

Using the fact that eb = τ

t , we get that τ

2 τ e S0+b0, and taking another integral, we get that e η 2 τ e S0 1 + η

2t + (e S0 1). Taking the inverse of this, we finally get

h W0 e S0+b0 η

2t + (e S0 1) ee S0 1 e S0 + 1 i , (29)

where W0 is the Lambert W function. We note that for large t we have τ log(t), and therefore b log (t), S log(log(t)), so it is interesting to note that in the critical point we still have limt S(t)/b(t) = 0 (i.e., accuracy goes to 1), even though S diverges.

Finally, for x = 1 (C), the data is fully separable in one dimension and the solution in conformal time is given by

S(τ) = S0 + log 1 + ητe S0 , b(τ) = b0 log 1 + ητe S0 , (30)

showing that the accuracy is bounded at A gen = 1

2 1 + erf 1

agreeing with Item 1 for M = 1.

For completeness, we report here the full solution, as a function of the conformal time τ = t 0 eb(t)dt of Eq. (25). We define

f(y) = xey(x+2) 2F1 1, 1 + 1 x+1; 2 + 1 x+1; e(x+1)yx

x + 2 ey, (31)

then the solution for S(τ) is given by the inverse function f 1(u) evaluated at

u = xe S0(x+2) 2F1 1, 1 + 1 x+1; 2 + 1 x+1; e S0(x+1)x

2 e S0, (32)

xe S0(x+2) 2F1 1, 1 + 1 x+1; 2 + 1 x+1; e S0(x+1)x

Grokking at the Edge of Linear Separability

The solution for b(τ) is obtained simply by integrating Eq. (25), resulting in

1 e (1+x)f 1

e S0 e S0(2+x)x 2F1(1,1+ 1 x+1 ;2+ 1 x+1 ;e S0(1+x)x) 2+x

1 e (1+x)f 1

2 e S0(2+x)x 2F1(1,1+ 1 x+1 ;2+ 1 x+1 ;e S0(1+x)x) 2+x

e S0 e S0(2+x)x 2F1 1, 1 + 1 x+1; 2 + 1 x+1; e S0(1+x)x

2 e S0(2+x)x 2F1 1, 1 + 1 x+1; 2 + 1 x+1; e S0(1+x)x

While these solutions may not necessarily be instructive in this form, appropriate limits can be taken in order to obtain the results in the main text.

E.3. Grokking time in the simplified model

We can define t tr, t gen as the times it would take for the training and generalization loss to reach some threshold ε. We can find t tr by solving Ltr = 1

2eb e S + e S(1 2λ) = ε. We will assume that σ is large enough such that S = S from the start (as discussed in the main text, σ increase the rate that S goes to its final value). Therefore, we plug S = log(1 2λ), and find that for λ which is close enough to 1/2, the loss is approximately given by Ltr = 1

2eb. Comparing to ε and plugging the long-time limit b = log( η

2t), we find that

Similarly, using the generalization loss Lgen = ebe S2/2 we can find that

ηεe 1 2 log2(1 2λ) (36)

It is already clear that for any finite ε, t gen t tr diverges. We can also obtain an ε-independent property by noting that q

log(t gen/t tr) = 1

2 log(1 2λ), (37)

which is verified numerically in Fig. 9.

It is interesting to note that in the conformal time, we have b τ, and therefore can repeat this calculation and obtain

τ tr = log(ηε), τ gen = 1

2 log2(1 2λ) log η

In this case, the result is a bit more natural since now the time difference (instead of ratio) becomes ε-independent:

τ gen τ tr 1

2 log2(1 2λ). (39)

We note that this result still depends on ε implicitly, in the sense that our assumption that S = S is true only for long-times, or ε which is small enough.

E.4. Calculation of the subleading term in the separable case

We now consider x2 < 0 but close to zero (that is, we are in a separable case where M = x2 is the margin). We know that w diverges at long times as w M 1+M 2 log η

2(1 + M 2)t + 1 . We will now denote

u w M 1 + M 2 log hη

2(1 + M 2)t + 1 i

Grokking at the Edge of Linear Separability

7.0 6.5 6.0 5.5 5.0 4.5

Figure 9. Numerical evidence for the Grokking time in the simplified model. In the left panel, we demonstrate for 1 2λ = 0.001 how the Grokking time is calculated: t tr, t gen are calculated by finding the intersection of the loss with some ε. In the right panel we plot p

log(t gen/t tr) versus log(1 2λ) numerically, and show that the result is linear with slope 1

2, in agreement with the prediction of Eq. (37)

as the difference from the diverging term. The equation for u is therefore

2eb e x1 u+ M 1+M2 log[ η

2 (1+M 2)t] Me M u+ M 1+M2 log[ η

2 (1+M 2)t] M 1 + M 2 1

Plugging the (long-time) solution for the bias, b 1 M 2+1 log h η 2(1 + M 2)t + e (1+M 2)b0 i (where b0 = 0 in our case), we get

t = x1 η 2ex1u η

2(1 + M 2)t + 1 x1M 1

1+M2 + e Mu 1 M (1 + M 2)t + 2

For M 0, we note that the second term is O(M 2), and by neglecting it we get

x2 1 x1M + M 2

2(1 + M 2)t + 1 x1M+M2

1+M2 x2 1 x1M + M 2 + 1

For t , and neglecting the other O(M 2) terms, we finally get

Remarkbly, this is identical to the result of the in the x2 > 0 case.

F. Impact of different parameters

Here we present supplemental results for Sections Sec. 4.

F.1. The Variance Scale σ

Here we provide additional information regarding the effect of σ different than 1. In particular, we will show that increasing σ can make grokking more apparent (but only up to a certain point). We will first assume that σ = 1 at the start, and investigate how taking xi = σxi changes the dynamics in comparison to that case. We will begin with the non-separable

Grokking at the Edge of Linear Separability

case (λ < 1/2). Recalling that the equations for gradient flow in our model are given by Eq. (7), this results in

i e ST σxixi, b t = η

i e ST σxi. (40)

We can now absorb σ into S by denoting S σS and investigate how it affects the dynamics of S, and the generalization loss and accuracy as a function of S. First, the GD equations become

i e S T xixi, b t = η

i e S T xi. (41)

We note that the generalization loss and accuracy in Eqs. (3) and (4) are the same except they are now a function of S

instead of S (being a function of σ S ). Since the equation for S

t is just multiplied by a factor σ2, the limiting value of S would be the same as for the σ = 1 case, but it will reach it at a faster rate. To sum up, obtaining the dynamics of the loss and accuracy when σ is larger than one can be done by using the same Eqs. (3) and (4), but also (A) Increasing the starting condition of S0 by a factor of σ, and (B) Multiply only the learning rate of the spatial part by a factor of σ2. If σ is large enough, we can go to the fixed point of S as fast as we want, enabling the appearance of Grokking (if also the limiting value of S is large, which happens when we are on the edge of being separable).

Finally, we will also investigate the effect of σ in the separable regime (λ > 1/2). Now we can use ?? and Item 2, where we only need to consider how σ changes the margin M. Since it is obtained from the equation ST

S xm = M, we can see

that the new margin will be larger by σ than the old one, i.e., M = σM. Plugging this in the expression for the accuracy in Prop. 3.2, we get that the accuracy is now

lim t Agen 1

1 + erf 1 σ2M

where we note that the argument inside the erf is smaller in a factor of σ2, drastically reducing the limiting accuracy.

F.2. Optimizer

The effect of changing the optimizer to Adam is demonstrated in Fig. 11. We note that the fact that adaptive-type optimizers change each learning rate individually based on past gradients, leads the dynamics faster in the direction of S , relatively to b. The fact that it makes S change faster (and not slower) than b, is probably related to the fact that S is a vector in high dimension: Moving each component of such vector will result in a change of the norm in a rate that is proportional to

d, but this may need further investigation. We will also note that using a different optimizer for the non-separable region where, will lead to a different solution than the hard margin SVM, as is also discussed by Soudry et al. (Soudry et al., 2018). This means that the results we developed in the main text will not hold, but we can still expect to obtain accuracy smaller than one since S diverges, as indeed can be seen in Fig. 11.

F.3. Initial conditions

As discussed in the main text, changing the initial conditions can change the monotonicity of the generalization loss and accuracy: See Fig. 12, Fig. 13 below.

F.4. Different loss

In the main text, we showed that we can use the exponential instead of the CE loss, since it will converge to it at late times. Here we provide numerical evidence that indeed Grokking could be seen, even when taking from the beginning just the exponential loss L = 1

i e ST xi+b: See Fig. 14.

Grokking at the Edge of Linear Separability

100 101 102 103 104 105

100 101 102 103 104 105

100 101 102 103 104 105

100 101 102 103 104 105

100 101 102 103 104 105 100 101 102 103 104 105

100 101 102 103 104 105

100 101 102 103 104 105

100 101 102 103 104 105

Figure 10. Gradient descent dynamics for three different values of σ, all for λ = 0.48. The top panels show the loss and accuracy for the train and test datasets, while the bottom panels present b and the norm of S. Except for σ, the parameters are the same as in Fig. 1. We can see that increasing σ makes the grokking more apparent at start, but then saturates (that is, increasing σ will not increase the grokking time anymore).

100 101 102 103 104

100 101 102 103 104

100 101 102 103 104

100 101 102 103 104

100 101 102 103 104 100 101 102 103 104

100 101 102 103 104

100 101 102 103 104

100 101 102 103 104

Figure 11. Dynamics, using ADAM optimizer with Py Torch s default parameters. The setup is the same as Fig. 1, except for the fact that σ = 1 now instead of 5. Significant Grokking can be seen even though the value of σ is not large.

Grokking at the Edge of Linear Separability

100 101 102 103 104

||S0|| = 0.1

100 101 102 103 104

100 101 102 103 104

101 ||S0|| = 2

100 101 102 103 104

100 101 102 103 104

100 101 102 103 104

100 101 102 103 104

100 101 102 103 104

100 101 102 103 104

Figure 12. Gradient descent dynamics for λ = 0.48 and for three different values of starting norm, S0 . Except for this, the setup is the same as Fig. 1. We can see that the non-monotonicity of the loss can be affected by the starting condition.

101 103 105

101 103 105

101 103 105

101 103 105

101 103 105 0.0

101 103 105

101 103 105

101 103 105

101 103 105

Figure 13. Gradient descent dynamics for λ = 0.48 and for three different values of b. Except for this, the setup is the same as Fig. 1.

Grokking at the Edge of Linear Separability

100 101 102 103 104 105

100 101 102 103 104 105

100 101 102 103 104 105

100 101 102 103 104 105

100 101 102 103 104 105 100 101 102 103 104 105

100 101 102 103 104 105

100 101 102 103 104 105

100 101 102 103 104 105

Figure 14. Gradient descent dynamics for a setup which is the same as Fig. 1, but with the exponent loss L = 1 N P

i e ST xi+b. That is, the loss is strictly the exponent loss at any time (and not just converge to the exponent loss at long times, as the CE loss). Clearly, we can see that the behavior of the Grokking is similar.

Grokking at the Edge of Linear Separability

G. Different input data distributions

As discussed in the main text, our results hold for any data distribution that is symmetric around the origin. Since the underlying mechanism only requires that the data is on the verge of separability (in which case |S | diverges). As we discuss in App. A, in the limit d, N , λ = 1/2 is the critical value below which the dataset is almost surely inseparable. Therefore, the analysis and resulting behavior, including the occurrence of Grokking and the critical point of λ should hold for any symmetric distribution.

To demonstrate this, we compare three input distributions at λ = 0.45 in Fig. 15: (1) The isotropic Gaussian input (as discussed in the main text), (2) Non-isotropic Gaussian inputs, generated using a covariance matrix with eigenvalues that follows the scaling law λn = λ0

nα , with α = 1.5. (3) Mixture of Gaussians N(µ = 1, σ = 0.25). We notice that σ in this context (and its effect on Grokking described in Sec. 4) could also be easily generalized for any distribution, by simply multiplying all of the inputs by a factor of σ.

Mixture of Gaussians

101 103 105

101 103 105

101 103 105

Figure 15. Grokking for three different input data distributions: Isotropic Gaussian (left), Gaussian with covariance whose eigenvalues follow a power-law scaling (middle), and uniform distribution (right). The parameters are: d = 180, N = 400 (λ = 0.45), and the optimizer is gradient-decent. Left panel: Isotropic Gaussian with σ = 5, as appears in the main text. Middle panel: Gaussian with eigenvalues that follow λn = λ0/nα, where α = 1.5. The normalization factor λ0 is chosen such that P

n λn = σ d, where here σ = 10. Right panel: each element in the input vector is chosen from a mixture of Gaussian distribution, with µ1,2 = 1 and σ1,2 = 0.25. After sampling, the input was multiplied by 5, as a generalization of the original σ.

H. Reduction to a d-dimensional model with bias.

In this section we will prove the equivalence between the d+1-dimensional two-Gaussians model and the d-dimensional model with bias. Consider the d + 1-dimensional classification problem with no bias:

xi = (yiµ, xi,2, xi,3, ..., xi,d+1), (43)

where yi = 1 are the labels, µ is some constant and xi,j N(0, σB). The loss is L = 1

i ℓi where ℓi = ℓ(yi S xi) is some loss. Taking xi yixi, we get

xi = (µ, yixi,2, yixi,3, ...), (44)

Grokking at the Edge of Linear Separability

but the distribution of yixi,j stays the same since it is symmetrical, so we can assume WLOG that yi = 1. The dynamics is

(S xi) xi. (45)

We now denote xi = 1

µxi, S = µS (so that S xi = S xi), and get

(S xi) xi, (46)

which is the same except for a µ2 learning rate factor. Since xi = (1, 1

µxi,3, ...), we can define also the scalar b and the d dimensional vectors S, x, by

S (b, S1, S2, ..., Sd), xi (1, xi,1, ..., xi,d) (47)

We can see that

b t = ηµ2 1

ℓ(S xi + b)

(S xi + b) , (48)

ℓ(S xi + b)

(S xi + b) xi (49)

Which is exactly the dynamics of the d model with bias, where all of the labels are the same. We also note that σ = σB/µ, where σ is factor of the std of xi.

I. Extension of the model from another perspective

In this section, we begin with the constant label model and extend it explicitly to a model where only a fraction r of the samples are the same (so that for r = 1 we get the same model studied throughout the paper). We show that grokking can still be observed. Specifically, we classify a point xi as label -1 if its first coordinate xi,1 exceeds a threshold µ, and as label 1 otherwise. Here, µ = Q(r), where Q is the Gaussian quantile function (inverse CDF) and r [0, 1] is a fraction. We note that for r < 1, only a fraction r of points are labeled -1, while the remaining 1 r are labeled 1.

The same reasoning as the previous sections can be applied here: near the critical point λ = 1/2, the model could reduce the loss first by changing the direction of S, and only then modifying b, leading to delayed generalization. However, unlike the constant-labeling case, full generalization is not possible even for λ > 1/2, since the optimal solution is not obtained by taking the separating hyperplane to infinity. We can see this by studying the expression for the generalization accuracy (see App. I.1) given by

dx1e 1 2σ2 x2 1sign(x1 µ)

2 x1 + b/S1

σ q Pd i=2(Si/S1)2

For perfect generalization, we must have b/S1 = µ and S1 Si, i > 1, which can only be achieved up to a certain error, determined by the max-margin solution.

In Fig. 16, we demonstrate numerically that while the limiting generalization accuracy is smaller than one for r < 1, grokking in the sense of delayed generalization and a non-monotonic test loss is still present.

I.1. Derivation of Eq. (50)

Starting from the explicit expression for the accuracy

i=1 (yiθ(S xi + b) + (1 yi)θ( S xi b)) , (51)

Grokking at the Edge of Linear Separability

100 101 102 103 104 105 0

100 101 102 103 104 105

100 101 102 103 104 105

Figure 16. Grokking at λ = 0.48, for three different label-fractions: r = 1 (constant label), r = 0.99, and r = 0.95 (represented by three different opacities). For example, for r = 0.95, approximately 95% of the input data would be assigned with the label 1 and 0.05 with the label 1. The rest of the parameters are the same as in Fig. 1.

and recalling that xi N(0, σId d) , we get that the generalization accuracy is given by

2 x T Σ 1x [y(x)θ(S x + b) + (1 y(x))θ( S x b)] . (52)

Plugging the label to be a function of the first coordinate

( 0, x1 < µ 1, else , (53)

where µ = Q(r) is the threshold (Q being the Gaussian quantile function as discussed in the main text), we get that

j=2 dxje 1 2σ2 Pd i=2 x2 i 1

dx1e x2 1 2σ2 θ( S1x1

i=2 Sixi b) (54)

µ dx1e x2 1 2σ θ(S1x1 +

i=2 Sixi + b)

Setting y = Pd i=2 Sixi, we note that y N(0, σy), where σy = σ q Pd i=2 S2 i = σ S S2 1 . Therefore,

dx1e 1 2σ2 x2 1θ( S1x1 y b) + ˆ

2σ x2 1θ(S1x1 + y + b) .

Grokking at the Edge of Linear Separability

Taking explicitly the integral over y of the two terms and simplifying the result, we will finally get

dx1e 1 2σ2 x2 1sign(x1 µ)erf

||S||2 S2 1

J. Direct calculation of the late time behavior of b, S in the separable regime

Here we present a direct calculation for S(t) S(t) ˆS(t), b(t) at late times. First, we will work in the conformal time τ = t 0 β(t )dt . As discussed in the main text, when the data is separable we know that in the late time limit ˆS goes to a certain direction and S . Therefore, only the points with maximum ST xm will contribute to the sum in the large

i e ST xi D

N e ST xm, (58)

where ST xm is the maximum value that is obtained, D is their degeneracy. We note that ST xi are all negative, so the maximum are just the points which are closest to zero, so these are exactly the support vectors. To continue, for τ we denote

S Ef(τ)ˆS, (59)

where E is some constant, and f(τ) is some function of t. Without loss of generality, and for compatibility with the results of the main text, we will define E by the equation E 1 ˆS T xm . We can now find f(τ) explicitly using the following arguments: We know that

i e ST xi ST

Using the approximation and comparing with S

τ = Ef (τ), we get that

N 1 E e f(τ) = Ef (τ). (61)

Solving for f(τ), we get

f(τ) = log η D

N 1 E2 τ + C1

where C1 is some constant. Therefore, we get

N 1 E2 τ + C1 . (63)

and taking the integral over dτ we get

N 1 E2 t + C1

Recalling that τ

t = eb, we can integrate to find:

τ(t) = 1 η D

1 E2+1 [C2t + C3]

1 E2+1 1 η D

N 1 E2 C1. (65)

Using b = log( τ

t ), we get for long times that

E2 + 1 log(t). (66)

Plugging also S Ef(τ) E log(τ), we can see that

S(t) E E2 + 1 log(t). (67)

Grokking at the Edge of Linear Separability

Recalling that E = 1 M , this verifies the result of the main text.

K. Supplemental Details of Experiments

In this section, we will provide information regarding the experiments. The results of the left panels of Fig. 1 (and all of the results in Section App. F) can be easily obtained on a personal laptop. As for the heavier experiments which are presented in the right-most column of Fig. 1 (and in Fig. 7):

(1) Calculation of S and ST xi properties of the distribution. The setup is: N = 2400, σ = 1, and d =930, 990, 1050, 1086, 1110, 1134, 1152, 1158, 1164, 1170, 1173, 1176, 1179, 1182, 1185, 1188, 1191, averaged over 15000 different random realizations, and using the ADAM optimizer (any optimizer will work in the inseparable regime).

(2) Calculation of the margin. The setup is: N = 2400, σ = 1, and d =1230, 1260, 1290, 1320, 1350, 1380, 1410, 1440, 1470, 1500, 1530, 1560, averaged over 1000 realizations.

(3) The results of the right-middel and right-bottom panels of Fig. 1. The setup is: N = 400, and d =20, 40, 80, 100, 120, 140, 160, 168, 180, 188, 192, 196, 200, 208, 220, 228, 240, 260, 280, 300, 320, 360, averaged over 100 realizations.