# dynamics_of_the_accelerated_tsne__a08000e4.pdf

Published in Transactions on Machine Learning Research (06/2025)

Dynamics of the accelerated t-SNE

Kyoichi Iwasaki iwasaki.kyoichi@ism.ac.jp The Graduate University for Advanced Studies, SOKENDAI

Hideitsu Hino hino@ism.ac.jp The Institute of Statistical Mathematics

Reviewed on Open Review: https: // openreview. net/ forum? id= 4205

This paper investigates the dynamics of t-Stochastic Neighbor Embedding (t-SNE), a popular tool for visualizing complex datasets in exploratory data analysis, optimized by the Nesterov s accelerated gradient method. Building on the foundational work that connects t-SNE with spectral clustering and dynamical systems, we extend the analysis to include accelerated dynamics which is not addressed in the previous work, revealing the emergence of Bessel and modified Bessel functions as a novel aspect of the algorithm s behavior characterizing the temporal evolution of the accelerated t-SNE. Because the ordinary differential equation corresponding to the optimization process under consideration has a closed-form solution, by performing eigenvalue decomposition of the data s adjacency matrix as a preprocessing step, we can obtain low-dimensional embeddings at any point in time without performing sequential optimization. This advancement not only enhances the practical utility of t-SNE but also contributes to a deeper understanding of its underlying dynamics.

1 Introduction

Data visualization plays a pivotal role in exploratory data analysis, enabling the comprehension of complex datasets by reducing them to more cognitively manageable two or three-dimensional representations. Among various dimensionality reduction techniques, t-distributed stochastic neighbor embedding(t-SNE) (van der Maaten & Hinton, 2008), a descendant of the stochastic neighbor embedding (Hinton & Roweis, 2002), has emerged as powerful tools for visualizing the overall structure of data and understanding how clusters are formed across diverse fields. t-SNE continues to undergo technical and theoretical development, and its applications are also expanding. In (Zhu & Ting, 2022), the use of an isolation kernel instead of a Gaussian kernel accelerates computation, while (Böhm et al., 2023) integrates contrastive learning within computer vision research, and (Skrodzki et al., 2023) explores the concept of coarse embedding to expedite inference. Furthermore, the concept has been extended into hyperbolic space using polar quadtree techniques, as demonstrated in related research by (Zhou & Sharpee, 2021; Guo et al., 2022). Also, theoretical advancements have been made, such as the convergence study presented in (Jeong & Wu, 2024). For general low-dimensional representation, the concept of dynamical dimension reduction is introduced in (Yoon & Osting, 2022).

In this paper, we focus on another aspect of t-SNE, the acceleration of the optimization procedure from the viewpoint of dynamical system. In many software implementation of t-SNE (Szymański & Kajdanowicz, 2017; Krijthe, 2015), the optimization procedure includes a momentum term. We consider applying momentum-based methods such as the Momentum Method (MM) and the Nesterov s accelerated gradient (NAG) method (Nesterov, 1983) to t-SNE aiming to enhance the convergence behavior of the optimization dynamics in the early exaggeration (EE) stage. We analyze the impact of these modifications from a continuous-time dynamical systems perspective.

We note that the t-SNE is composed of two stages: the EE stage and the embedding stage (van der Maaten & Hinton, 2008). Some previous theoretical studies (Linderman & Steinerberger, 2019; Cai & Ma, 2022) have

Published in Transactions on Machine Learning Research (06/2025)

conducted analyses focusing on the EE stage. While these methods are often introduced to accelerate practical optimization, our primary focus here is on understanding the dynamical behavior of the system rather than achieving computational speed-up. Since this stage itself possesses a sufficiently rich mathematical structure, we will also focus on the EE stage, and the embedding stage is not modified in this study.

1.1 Contributions

We draw upon the foundational work of (Cai & Ma, 2022), who established a connection between t-SNE clustering dynamics and spectral clustering under certain conditions for gradient descent (GD) method, framing the data clustering phenomenon within a dynamical systems perspective. While the use of MM or NAG can potentially accelerate the optimization in the EE stage, it is not obvious whether the resulting embeddings still preserve cluster structures in the same way as standard GD. Our contribution extends this analysis by demonstrating that the incorporation of MM and NAG can accelerate the EE dynamics; in particular, we show that NAG introduces mathematical entities such as the Bessel and modified Bessel functions into the system. Through rigorous theoretical analysis and empirical validation, we show that our approach offers a further insights into the dynamical properties of t-SNE.

Our main contributions are summarized as follows:

We explore an approach of dynamical system analysis of t-SNE including a momentum term, and further extended to the NAG. We find that in the former case (MM), even with the same linear ODE, convergence is theoretically faster [Proposition 5.4], and in the case of the NAG, it leads to the derivation of related ODEs as the first kind Bessel functions or modified Bessel functions. [Theorem 6.4] As these claims suggest, our approach explicitly involves eigenvalues and eigenvectors. When the data size is n, solving the eigenvalue problem requires computational complexity O(n3), which is greater than the original O(n2) complexity. Therefore, in this paper, we limit our numerical experiments to datasets of relatively small volume.

We derive these ODEs and their relations, organizing them into several theorems, propositions and lemmas, where all proofs are provided in the Appendix. As a matter of fact, deriving parallel results already given without momentum in (Cai & Ma, 2022) to those with momentum term are technically not straight forward.

Since the closed-form solution of the ODE is available, embeddings at any point in time can be easily obtained. [Proposition 5.4, Theorem 6.4]

We find that insights related to EE stage in (Cai & Ma, 2022) are replicated in our method with acceleration, leading to the conclusion that the optimization algorithm converges in less time than when using GD. Since the EE stage does not have an explicitly defined objective function with the acceleration parameter α > 1, it is challenging to monitor the function value and determine an appropriate stopping point during optimization. As a secondary benefit of our continuous formulation via ODEs and their closed-form solutions, we can define a principled stopping criterion based on a threshold applied to the Accumulated Residual Ratio (ARR) of major and minor eigenvalue contributions. [(27), (28)] This approach enables the identification of a suitable termination time within the continuous framework. Such a mechanism is particularly advantageous when t-SNE is employed as part of a larger system or pipeline, where an automatic and interpretable stopping condition is beneficial.

We theoretically demonstrate that, under certain conditions, iterative computations for both MM and NAG can be replaced by their continuous counterparts by adopting this continuous relaxation approach using ODEs. [Proposition5.2, 6.2] It provides theoretical justification for results that had previously been obtained heuristically through iterative methods. We can confidently rely on the original discrete optimization algorithms (MM and NAG), as they retain the essential dynamics while being more computationally efficient. However, since solving the ODEs typically requires eigenvalue decompositions of large matrices, the practical advantage of using the continuous formulation is limited.

Published in Transactions on Machine Learning Research (06/2025)

By interpreting the spectral decomposition of the dynamics in terms of Lyapunov exponents, we offer a novel theoretical perspective on the formation of clusters in t-SNE. [Proposition H.1, H.2] Components associated with positive Lyapunov exponents dominate the long-term behavior and effectively determine the clustering structure, with the corresponding eigenvectors encoding cluster assignments. [(29)] In contrast, components with negative Lyapunov exponents decay over time and gradually vanish from the embedding. Notably, in the case of NAG, these stable components do not simply decay exponentially but exhibit non-monotonic, oscillatory behavior governed by Bessel functions.

1.2 Key take aways for t-SNE users

We investigate both the theoretical and practical aspects of MM and NAG, based on the framework of GD. Our study builds upon the theoretical foundation established by (Cai & Ma, 2022), which demonstrated that the EE stage of the t-SNE algorithm can be approximated by a first-order ODE. The solution of this ODE can be expressed explicitly in terms of the adjacency matrix s eigenvalues and eigenvectors, and providing a theoretical justification for why t-SNE reveals the data s cluster structure.

In our paper, we extend this line of reasoning by investigating whether a similar theoretical framework can be applied to other optimization schemes such as MM and NAG. Specifically, we ask: Can the dynamics of MM and NAG during the EE stage also be described using ODEs, and if so, do the same assurances hold?

The answer to this question is yes : we theoretically demonstrate that optimizing t-SNE with either MM or NAG likewise yields the emergence of clear cluster structures. This result implies that, although the computational cost of optimization was previously a concern for t-SNE, one can now employ accelerated gradient methods with confidence. Furthermore, our work proposes a novel stopping criterion for the ODEbased formulation, a point which is not explicitly addressed in (Cai & Ma, 2022) s original work. This contributes both to the theoretical understanding and practical usability of accelerated variants of t-SNE.

Next, we present the practitioner-oriented takeaway. We derive the corresponding ODEs and their closedform solutions by continuously relaxing the step-by-step iterative process. This relaxation enables, particularly in the EE stage, the approximation of intermediate states at arbitrary time points by evaluating the ODE solution. In addition, we propose a stopping criterion, referred to as the ARR metric defined in (28), which provides a principled method for determining an appropriate termination time. To apply our approach in practice, two major assumptions must be taken into account:

When the data size n is sufficiently small, the eigenvalue problem required by the method typically of computational complexity O(n3) can be solved within a practical amount of time.

The assumptions stated in the theoretical takeaway namely, (I1), (T1.M), (T2.M) for the MM method, and (I1), (T1.N), (T2.N) for NAG are satisfied.

Provided these conditions hold, the following approach becomes applicable:

Preparation: Compute the adjacency matrix P and its eigenvalues.

EE stage: Compute the ARR-based criterion and determine the stopping time.

Embedding stage: Obtain the low-dimensional embedding by evaluating the formula (3) with m(k+1) = 0, where acceleration is unnecessary in order to obtain stable results in the embedding stage.

2 Related work

Dealing with t-SNE, a method of dimensionality reduction, as a dynamical system requires a background in both the methodology of dimensionality reduction and analysis of algorithms as dynamical systems. Moreover, the unique properties of t-SNE and the development of related research are also pertinent. We concisely summarize the related studies for each aspect.

Published in Transactions on Machine Learning Research (06/2025)

2.1 Dimensional reduction methodology

In dimensionality reduction, numerous methods have been established to transform high-dimensional data into a more manageable, lower-dimensional representation, aiding in elucidation of the data structure. Principal component analysis (PCA) (Jolliffe, 1986) serves as a foundational linear method, identifying principal axes of maximum variance. Isometric mapping (Isomap) (Tenenbaum et al., 2000) focuses on preserving the geometric distances on the manifold, effectively maintaining the spatial relationships among data points. Locally linear embedding (LLE) (Roweis & Saul, 2000) isolates local linear relationships by reconstructing points from their immediate neighbors, reflecting the intrinsic geometry of the manifold. The Laplacian eigenmap technique (Belkin & Niyogi, 2001) constructs a graph Laplacian and leverages its eigenfunctions to achieve a lower-dimensional projection that reveals the manifold s local characteristics.

By further developing graph-based methods such as Isomap and LLE, SNE (Hinton & Roweis, 2002) and t-SNE (van der Maaten & Hinton, 2008) have been proposed as methods that learn embeddings directly and have become very popular. UMAP (Mc Innes & Healy, 2018) and PHATE (Moon et al., 2019) have been developed as powerful alternatives to the t-SNE algorithm, and recently, the relationships between t-SNE and UMAP have also been discussed (Damrich et al., 2023).

2.2 Gradient methods to differential equation

The approach of deriving and analyzing ODEs as continuous limits of iterative optimization algorithms is widely adopted. This includes attempts to extend acceleration techniques beyond NAG (Wibisono et al., 2016), employing Runge-Kutta integrators (Zhang et al., 2018) and symplectic integrators (Goto & Hino, 2025), and formulating continuous-time models using Lyapunov analysis and tools from stochastic calculus (Orvieto & Lucchi, 2019). These studies exemplify the breadth of research exploring different aspects of continuous approaches to optimization. There are more geometrically sophisticated methods, such as (Defazio, 2019), which addresses NAG on a Riemannian manifold, and (Wilson et al., 2021), which performs analysis using Lyapunov functions. In deriving the ODEs related to the NAG method, we refer to several results from (Su et al., 2016), where the emergence of Bessel functions for a related ODE is established.

2.3 SNE algorithm and recent developments

In addition to the studies introduced in the introduction, variants of t-SNE and relationships with other methods are also being actively researched (Kobak et al., 2019; Chen et al., 2015; Pezzotti et al., 2017; Fujiwara et al., 2020). A generic formulation of embedding algorithms that includes SNE and other existing algorithms are studied, elucidating their relation with spectral methods and graph Laplacians (Carreira Perpiñán, 2010; Vladymyrov & Carreira-Perpiñán, 2012).

There are several theoretical attempts, such as (Linderman & Steinerberger, 2019), which compares spectral clustering and t-SNE and (Arora et al., 2018), which states that t-SNE works well under specific conditions called γ-spherical and γ-separated in EE stage. The paper (Linderman & Steinerberger, 2022) presents theoretical and experimental results, focusing on sufficient conditions for the parameters α for acceleration, and h for step size under the assumption of GD. It also discusses considerations regarding disjoint clusters and the independence of the results from initial values. Note that such results include both EE and subsequent embedding stages.

Additionally, recent studies have addressed aspects related to implicit regularization and scaling effects in clustering algorithms. For instance, (Auffinger & Fletcher, 2023) discusses implicit regularization mechanisms, as highlighted in Proposition 6.5, which connects closely to early stopping in clustering optimization processes. Additionally, (Murray & Pickarski, 2024) explores the scaling effects arising from the gradual decrease of pij, which is the element of adjacency matrix commonly used in t-SNE is the symmetric probability matrix P = (pij), as n . This phenomenon appears to relate to Lemma 5.5 or 6.3, particularly the condition concerning the sign of αλR(L(P)) 1 n 1, where λR(L(P)) stands for Rth-eigenvalue of the unnormalized Laplacian matrix of P. These findings provide critical insights into how regularization and scaling behaviors influence clustering performance and stability under large-scale settings.

Published in Transactions on Machine Learning Research (06/2025)

2.4 Acceleration methods

It is widely known that a drawback of t-SNE is its high computational cost, and many attempts have been made to improve its computational efficiency. The original paper (van der Maaten & Hinton, 2008) uses both early exaggeration effect by adopting acceleration parameter α > 1 and momentum methods. The Barnes Hut tree is a well-known approach for accelerating the t-SNE optimization procedure (van der Maaten, 2013; 2014). The n-body force calculations is utilized in (Yang et al., 2013; Vladymyrov & Carreira-Perpiñán, 2014). (Skrodzki et al., 2024) applies Barnes-Hut approximation idea into hyperbolic disc to accelerate lowdimensional embedding. We note that it is experimentally shown in (Lambert et al., 2023) that NAG can be an alternative to EE.

Although the term acceleration often suggests computational speed-up, we clarify that our current framework focuses on theoretical acceleration in the continuous-time dynamics, rather than practical runtime efficiency. In fact, due to the reliance on full eigendecomposition of the graph Laplacian, the overall computational complexity remains O(n3) in our current implementation. Nonetheless, we believe that this spectral formulation offers valuable analytical insight, and that future implementations using approximate eigensolvers (e.g., Lanczos algorithm) may help bridge the gap between theoretical structure and practical scalability.

For example, the oscillatory behavior revealed through Bessel-type dynamics is directly captured in the ARR metric we propose, which provides an interpretable and data-driven criterion for early stopping. This analytical formulation could inspire new techniques for adaptive exaggeration scheduling or spectrally informed initialization, potentially improving both the interpretability and efficiency of t-SNE variants in large-scale applications.

3 Original t-SNE algorithm

We explain the original t-SNE algorithm (van der Maaten & Hinton, 2008). For a vector a = (ai)1 i n Rn, define its lp norm by a p = (Pn i=1 |ai|p)1/p. For a matrix A = (aij) Rn n and its spectral norm is A|| = sup x 2 1 Ax 2. We also introduce sets of orthogonal matrices O(n, k) = {V Rn k; V V = Ik} and O(n) = O(n, n). We use the notation [n] = {1, 2, . . . , n} for n Z>0. The diameter of a set S( Rn) is diam(S) = supx,y S x y 2. For a symmetric matrix A = (aij)1 i,j n Rn n, we define degree operator D : Rn n Rn n as D(A) = diag(Pn i=1 ai1, . . . , Pn i=1 ain), and Laplacian operator L : Rn n Rn n by L(A) = D(A) A. Also, we set 1n = (1, . . . , 1) Rn and Hn = 1 n(n 1)(1n1 n In) Rn n.

Let {Xi}i [n] Rdh be the dh-dimensional data points. For the pairs of data points {(Xi, Xj)}1 i =j n, we define a symmetric matrix P = (pij)1 i,j n Rn n with pii = 0 for i [n] and pij = (pi|j + pj|i)/(2n) for i = j, where

pj|i = exp( ||Xi Xj||2 2/2τ 2 i ) P

l {1,2,...,n}\{i} exp( ||Xi Xl||2 2/2τ 2 i ). (1)

Here, τi > 0 is the bandwidth parameter. The aim of t-SNE is to achieve a low-dimensional representation {yi}i [n] Rdl, dl < dh of {Xi}i [n]. Typically, dl is two or three because t-SNE is predominantly used for visualization, and in this paper, we consider dl = 2 without loss of generality. Consider a symmetric matrix Q = (qij)1 i,j n where qii = 0 for i [n] and

qij = (1 + ||yi yj||2 2) 1 P

l,s {1,2,...,n},l =s(1 + ||yl ys||2 2) 1 (2)

for i = j. We get the low dimensional representation {yi}i [n] by minimizing the Kullback Leibler (KL) divergence (Kullback & Leibler, 1951) between matrices P and Q as (y1, . . . , yn) = arg miny1,...,yn P

i,j [n], i =j pij log pij

As described in (Cai & Ma, 2022), the t-SNE algorithm is typically implemented in two distinct phases: an EE stage followed by an embedding stage. While previous studies have explored accelerating the EE stage by introducing a parameter α > 1 into the GD framework, this paper extends the concept further

Published in Transactions on Machine Learning Research (06/2025)

by investigating whether analogous acceleration strategies can be systematically applied to MM and NAG. Although the explicit form of the objective function for the case with α > 1 is not known, we proceed by assuming the following formulation in line with the original t-SNE work (van der Maaten & Hinton, 2008) and (Cai & Ma, 2022). The following update rule aligns to (Cai & Ma, 2022), which states

y(k+1) i = y(k) i + h X

1 i,j n, i =j (y(k) j y(k) i )S(k) ij (α) + m(k+1)(y(k) i y(k 1) i ), (3)

where h R+ represents the step size parameter, originally denoted as 4h in (van der Maaten & Hinton, 2008), and m(k+1) R is a momentum parameter. The value α > 0, known as the exaggeration parameter, plays a crucial role in the gradient coefficient S(k) ij (α) derived in Appendix A, and is given by

S(k) ij (α) = αpij q(k) ij 1 + y(k) i y(k) j 2 2 R, (4)

1 i,j n,i =j(y(k) j y(k) i )S(k) ij (α) R2, and we define adjacency matrix S(k) α = (S(k) ij (α))i,j [n]. The

algorithm starts with an initialization y(0) i = y( 1) i , i [n]. Although the momentum term m(k+1)(y(k) i y(k 1) i ) to accelerate the updating algorithm was not discussed for simplicity in (Cai & Ma, 2022), our aim is to discuss how we may accelerate the convergence of the algorithm and derive a dynamical system corresponding to accelerated updating methods.

4 Asymptotic behavior of the update equation

The points {y(k) i }i [n] R2 are represented in two dimensions, but they are reorganized into n vectors

according to the first and second coordinates as y(k) l Rn, l [2]. This identification and the following Theorem 4.1 enable us to make sure that the adjacency matrices S(k) α is considered as a fixed matrix αP Hn. For l [2], Eq. (3) is reformulated as

y(k+1) l = [In h L(S(k) α )]y(k) l + m(k+1)(y(k) l y(k 1) l ), (5)

where In Rn n is the identity matrix and y(k) l Rn is the l-th coordinates of {y(k) i }i [n].The following

theorem shows that the matrix S(k) α = (S(k) ij (α))i,j [n] can be approximately simplified. This approximation, combined with a continuous relaxation with Hn, enables the derivation of ODEs from the recurrence relation (3), making it one of the core statements of this paper. Theorem 4.1. (Asymptotic Graphical interpretation, Theorem 2 in (Cai & Ma, 2022)) Let P = (pij)1 i,j n be a symmetric matrix defined in Eq. (1) and denote η(k) := (diam({y(k) i }1 i n))2. Then, for any i, j [n] with i = j, and each k 1 such that η(k) < 1, we have S(k) ij (α) αpij + 1 n(n 1)

αpijη(k) + 2η(k)

n(n 1)(1 η(k)). (6)

Then for each k 1, as long as η(k) and α satisfy η(k) ||P|| n||P|| , α 1 n||P||, we have

lim n ||S(k) α (αP Hn)|| ||αP Hn|| = 0. (7)

With this statement, Eq. (5) admits an approximation

y(k+1) l [In h L(αP Hn)]y(k) l + m(k+1)(y(k) l y(k 1) l ). (8)

We consider the global behavior on {y(k) l }, l [2] under the following initialization and condition of parameters 1.

1The condition (I1) comes from Initialization , and (T1) describes Trajectory as n .

Published in Transactions on Machine Learning Research (06/2025)

(I1) {y(0) i }i [n], {y(1) i }i [n] satisfy minl [2]{ y(0) l 2, y(1) l 2} > 0 and

maxl [2]{ y(0) l , y(1) l } = O(1) as n .

(T1) The parameters (α, h, k) satisfy k(nhα P + h/n) = O(1) as n .

Condition (I1) states that the initialization {y(0) i } should not be simply all zeros or unbounded, and (T1) says that the cumulative differences of h L(S(k) α ) and h L(αP Hn) are limited.

The update formula (8) is a three-term recurrence relation and two initial vectors y(0) l and y(1) l need to be set. With the momentum term, the vectors y(k+1) i stays within limited area:

Proposition 4.2. (Localization) Under the condition (I1) and (T1), we have

diam({y(k+1) i }i [n]) C max l [2]{ y(0) l , y(1) l } (9)

for some universal constant C > 0.

The proposition 4.2 ensures that the data points {y(k) i } are globally bounded and the condition (I1) leads to η(k) < 1, and both the theorem 4.1 and the approximate Eq. (8) holds. This result had been derived in (Cai & Ma, 2022) without the momentum term; however, deriving this result when considering the momentum term is not trivial and requires an evaluation of the difference in solutions at two consecutive points t 1 and t originating from the momentum term.

5 Gradient flow with constant momentum coefficient

We assume that m(k+1) = m (0, 1), which is commonly assumed in existing works (Kovachki & Stuart, 2021; He et al., 2023; Hao et al., 2021; Sutskever et al., 2013; Rashidi et al., 2020). Let { y(k) l }k 0 be computed by the update equation

y(k+1) l = [In h L(αP Hn)] y(k) l + m( y(k) l y(k 1) l ). (10)

The following lemma derives the ODE that holds for MM in the continuous-time limit as h 0.

Lemma 5.1. With the assumption y(k) l Yl(kh), t = kh and the continuous limit h 0, the update rule (10) is reduced to the ODE with initial value y(0) l = Yl(0)

Yl(t) = 1 1 m L(αP Hn)Yl(t). (11)

Furthermore, the following proposition provides a theoretical guarantee that y(k) l and Yl(kh) become sufficiently close as h 0.

Proposition 5.2. (Gradient flow with constant momentum) Under the condition (I1), for l [2], k Z 0, and some positive constants C1, C2, we have

y(k) l Yl(kh) 2 Yl(0) 2 2khm C1

1 m L(αP Hn) + k

2 L(αP Hn) 2C2. (12)

This proposition justifies that the approximation y(k) l Yl(kh) works under the conditions (I1), (T1). To better understand the solution of the derived ODE (11), we consider its eigendecomposition. Remember that the matrix P is symmetric, and its Laplacian L(P) is likewise. With the eigendecomposition of L(P) expressed as L(P) = V ΛV, we have

Published in Transactions on Machine Learning Research (06/2025)

Lemma 5.3. The matrix L(αP Hn) is symmetric, and we have eigendecomposition:

L(αP Hn) = V ΣV, (13)

where Σ = diag(σi)i [n], Λ = diag(λi)i [n], 0 = λ1 λ2 . . . λn are eigenvalues2 and V O(n), which are the same eigenvectors of L(P). Here, the relation between Σ and Λ is

σ1 = λ1 = 0, σi = αλi 1 n 1 (2 i n). (14)

With the Lemma 5.1, we achieve the explicit expression

Yl(t) = exp t 1 m L(αP Hn) y(0) l . (15)

Combining the eigendecomposition and Eq. (15) yields the following proposition:

Proposition 5.4. (Solution path on constant momentum) For l [2], the first order linear ODE (11) with initial value Yl(0) = y(0) l and momentum term m (0, 1) has the unique solution:

Yl(t) = (u 1 y(0) l )u1 +

i=2 exp t 1 m

(u i y(0) l )ui. (16)

Propositions 5.2 and 5.4 extend (21) or Proposition 8 in (Cai & Ma, 2022) with a constant momentum term, as substituting m = 0 in Eq. (16) recovers the original result. The effect of momentum coefficients is discussed in Appendix I.7.

5.1 Well-conditioned matrix

The behavior of exp t 1 m αλi 1 n 1 for t 1 depends on the sign of αλi 1 n 1. If negative, the corresponding components are amplified; if positive, they are suppressed. It implies that

lim t Yl(t) span({u1, . . . , u R}). (17)

If the data {Xi}i [n] are well clustered and bandwidths τi are properly chosen, (Balakrishnan et al., 2011) demonstrates that P is well approximated. We assume the weighted graph of a well-conditioned matrix P , as detailed in Appendix E, has R 2 connected components.

5.2 Spectral convergence with constant momentum coefficient

Condition(T1) ensures Theorem 4.1 holds. We assume a stronger, analogous condition (T1.M), along with (T2.M), which governs eigenvalues and clustering behavior in section 5.1:

(T1.M) The parameters (α, h, t) satisfy α [nλR+1(L(P))] 1 and t 1 m = o(n) as n .

(T2.M) There exists a symmetric and well-conditioned matrix P Rn such that λR+1(L(P )) max{( tα 1 m) 1, L(P P) } and tα 1 m L(P P) = o(1) as n .

In the context of visualization and clustering, the following lemma is fundamental in identifying which components remain dominant in the eigendecomposition representation.

Lemma 5.5. Under the conditions (T1.M), (T2.M) and n 1, we have

αλR(L(P)) 1 n 1 0, αλR+1(L(P)) 1 n 1 > 0. (18)

Especially, the eigenvalues {σi} R consist of R-negative numbers or zeros (0 = σ1, σ2, . . . , σR), and n R positive numbers (σR+1, . . . , σn).

2We denote i-th eigenvalue as λi(A), if we emphasize the original matrix A.

Published in Transactions on Machine Learning Research (06/2025)

Note that this proposition is not explicitly stated in (Cai & Ma, 2022), but it implies that (18) holds under the conditions (T1.M) and (T2.M). This, in turn, means that the same R eigenvalues and eigenvectors as in the GD case with m = 0 remain in the visualization, thereby providing a theoretical guarantee that MM serves as an acceleration of GD.

6 Gradient flow with NAG

Another accelerated gradient method is NAG (Nesterov, 1983), which takes the following form: starting with y(0) l and w(0) l = y(0) l under our situation:

y(k+1) l = w(k) l h L(αP Hn)w(k) l , (19)

w(k) l = y(k) l + k 1

k + 2( y(k) l y(k 1) l ). (20)

Similar to Lemma 5.1, we have another linear ODE with NAG.

Lemma 6.1. With the assumption y(k) l Yl(k

h) for a smooth curve Yl(t) for t 0 and l [2], and putting k = t/

h, we have the ODE corresponding to the update rule of NAG in Eqs. (19), (20) as

t Yl(t) + L(αP Hn)Yl(t) = 0. (21)

The following proposition provides a theoretical guarantee that y(k) l and Yl(k

h) holds for NAG, just as Proposition 5.2.

Proposition 6.2. As the step size h 0, the update rule (19), (20) is reduced to the ODE (21) in the sense that for all fixed T > 0, lim h 0 max 0 k T

h y(k) l Yl(k

h) 2 = 0. (22)

We aim to demonstrate that this situation would be reasonable under the following conditions:

(T1.N) The parameters (α, h, t) satisfy α [nλR+1(L(P))] 1 and t = o(n 1 2 ) as n .

(T2.N) There is a symmetric and well-conditioned matrix P Rn such that λR+1(L(P )) max{(t2α) 1, L(P P )}, and t2 L(P P) = o(1) as n .

Similar to Lemma 5.5, we have the following lemma:

Lemma 6.3. Under the conditions (T1.N), (T2.N) and n 1, we have

αλR(L(P)) 1 n 1 0, αλR+1(L(P)) 1 n 1 > 0. (23)

Especially, the entire eigenvalues (σ1, . . . , σn) consist of R-negative numbers or zeros (0 = σ1, σ2, . . . , σR), and n R positive numbers (σR+1, . . . , σn).

In the case of NAG, this proposition guarantees that, under the conditions (T1.N) and (T2.N), the same R eigenvalues and eigenvectors are emphasized in visualization and clustering, and that the number R matches that in both GD and MM as Lemma 5.5.

As in Lemma 5.3, we have another closed-form expression by solving linear ODE (21).

Theorem 6.4. Under the conditions (T1.N), (T2.N), partition the eigenvectors as V = (U, U ), where U O(n, R), U O(n, n R) 3. Then, we have

Yl(t) = UΓ1(t)y(0) l,1:R + U Γ2(t)y(0) l,R+1:n, (24)

3U is orthogonal complement of U.

Published in Transactions on Machine Learning Research (06/2025)

Γ1(t) = diag

1, 2 t σ2 I1(t σ2), . . . , 2 t σR I1(t σR)

, Γ2(t) = diag 2 t σR+1 J1(t σR+1), . . . , 2 t σn J1(t σn)

and the symbol I1( ) denotes the modified Bessel function of the first kind, while J1( ) represents the Bessel function of the first kind. Also, y(0) l,1:R = (y(0) l,1 , . . . , y(0) l,R) RR, y(0) l,R+1:n = (y(0) l,R+1, . . . , y(0) l,n) Rn R are the initial values.

The explicit expression (24) is the solution path under the NAG for the t-SNE algorithm. In the Appendix I.1, we consider the qualitative differences in the solutions of ODEs corresponding to GD, MM, and NAG by plotting the solutions.

Lemmas 5.5 and 6.3 demonstrate that eigenvalues and eigenvectors for which αλi 1 n 1 0 are emphasized, highlighting key features for spectral clustering. This phenomenon, where large eigenvalues λi decrease with increasing t, is referred to as implicit regularization. The subsequent proposition asserts that Yl(t) converges towards the subspaces spanned by the first R eigenvectors, which form the Laplacian null space commonly utilized in spectral clustering, thus embodying implicit regularization with the inclusion of momentum terms. Furthermore, the assumptions regarding t such as t = o(n), t = o(n1/2), tα 1 m L(P P) = o(1) or t2α L(P P) = o(1) imply necessity for early stopping to avoid overshooting.

Proposition 6.5. (Implicit regularization, clustering and early stopping) Under the conditions (I1), (T1.M) and (T2.M), let Yl(t) be the function defined as (16), U0 O(n, R) be such that the columns span the null space of P . Then we have

lim n Yl(t) U0U 0 Yl(t)||2 ||Yl(0)||2 = 0, l [2]. (25)

Also, for a permutation matrix O Rn n, we have

lim n Yl(t) Ozl 2

Yl(0) 2 = 0, l [2], (26)

where zl = (zl1, . . . , zl1, zl2, . . . , zl2, . . . , zl R, . . . , zl R) Rn and zlr = θ r y(0) l / nr for r [R], the number of zlr is nr, i.e. the number of nodes in the r-th connected component. Similarly, under conditions (I1), (T1.N) and (T2.N), let Yl(t) be the function defined in (24) U0 O(n, R) be that the columns span the null space of P . Then we have the same formulas (25), (26).

This proposition shows that in both cases, Yl(t) serves as eigenvectors defining the Laplacian null space, producing results equivalent to spectral clustering. Despite utilizing the same data points and the same adjacency matrix P with corresponding eigenvalues and eigenvectors, the increase in t presupposes distinct formulations: the MM adheres to t = o(n) and tα L(P P) = o(1) as n , whereas the NAG relies on t = o(n1/2) and t2α L(P P) = o(1) as n . It is consistent with the general observation that the NAG converges more rapidly than the MM. Also, Lyapunov exponents are discussed in Appendix H.

7 Stop timing

While commonly used metrics such as Trustworthiness (Venna & Kaski, 2001) and Continuity (Kaski et al., 2003) are effective in evaluating the quality of low-dimensional embeddings after dimensionality reduction, they are not suitable for determining the evaluation timing in the context of ODEs. To address this limitation, we propose a novel metric, Average Residual Ratio(ARR). As will be confirmed in Section 8.2 and I.6, the comparison between the proposed ARR and the Trustworthiness values of the embedding indicates a wellaligned correspondence. The term stop time is used informally to refer to a general criterion for deciding when to stop the algorithm, and should not be confused with the technical notion of stopping time in stochastic analysis. With Eqs. (16), (24), we express Yl(t) = Pn i=1 ci,l(t)ui with the time-variant coefficient ci,l(t) R. To evaluate the positive contribution of ci,l(t), define ai(t) = P l [2] |ci,l(t)|. With Eq. (14) and

in the case of MM, ai(t) = P

l [2] exp tσi

1 m |u i y(0) l |, where we set m = 0 for GD. By contrast, in the

Published in Transactions on Machine Learning Research (06/2025)

case of NAG,

2y(0) l,i t σi I1(t σi) , (i = 1, . . . , R)

2y(0) l,i t σi J1(t σi) , (i = R + 1, . . . , n).

Figure 1: t-SNE visualization of n = 500 KDD Cup 1999 dataset using three ODEs: GD, MM and NAG.

To determine the stopping time, we propose the Average Residual Ratio(ARR), which divides the equation into main terms (i = 1, .., R) and residuals (i = R + 1, .., n), based on the average contribution to clustering:

1 n R Pn i=R+1 ai(t)

1 R PR i=1 ai(t) + 1 n R Pn i=R+1 ai(t) . (28)

Published in Transactions on Machine Learning Research (06/2025)

Based on the Lemmas 5.5, 6.3 and the observation of Figures 4a, 4b in Appendix, it is expected that ai(t) increase monotonically for i = 1, . . . , R, and decreasing or gradually decreasing for i = R + 1, . . . , n.

We propose a stopping criterion based on a threshold (e.g., 0.01) and demonstrate its effectiveness in subsection 8.2. While t-SNE is unsupervised, this heuristic is similar to those used for convergence in algorithms like GD/MM/NAG. Our analysis shows that components with negative eigenvalues form clusters, while those with positive eigenvalues don t contribute to the final embedding. Thus, using the ratio of these components as a convergence criterion for the ODE is reasonable.

7.1 Lyapunov exponents

The ARR used to determine the stop timing was based on the time parameter t, but considering the solution of the ODE as a dynamical system reveals an intriguing relationship.

The Lyapunov exponents for MM and NAG are given by H.122, H.123. Lemmas 5.5, 6.3 show that the first R eigenvalues are non-positive. Let the Lyapunov exponent for MM be λLyap,MM,i and that for NAG be λLyap,NAG,i. When momentum parameter m does not get close to 1, it is observed that

λLyap,MM,i = 1 1 m

αλi + 1 n 1 = λLyap,NAG,i, for i [R]. (29)

This is guaranteed by the fact that the function of x satisfies 0 < x < 1 and x < x, and a larger Lyapunov exponent implies faster cluster convergence. This is consistent with the commonly stated notion that NAG achieves faster convergence as an optimization method.

8 Numerical experiments

We numerically evaluate our results by comparing the t-SNE algorithms with and without acceleration and their ODE counterparts with real-world datasets: KDDcup1999 and MNIST4. Results on other data are shown in Appendix I. In this paper, we set perplexity = 30, as in Cai & Ma (2022) 5. In all experiments in this section, all initial embedding vectors were generated randomly. The discussion on the methods of initialization in t-SNE and the impact of different initializations, along with experimental results, are presented and discussed in Appendices I.4 and I.5.

8.1 Experiment 1: Dynamic behavior of clustering over time

The KDDCup1999 dataset (Stolfo & Chan, 1999; Stolfo et al., 1998; Tavallaee et al., 2009) includes a wide variety of intrusions simulated in a military network environment, and comprises 42 columns with their labels. We extracted 100 samples of data for each of the 5 labels ( smurf , neptune , normal , back , satan ) from the dataset, totaling 500 samples.

For the solution of ODEs corresponding to GD, MM and NAG, we substitute concrete value of the time t to obtain two dimensional map of the original points in 42 dimensional space in Fig. 1, where the top, middle and bottom rows correspond to results of GD, MM and NAG, respectively. The column is aligned with the same iteration count k, and as we move to the right, k increases, with the corresponding time t increasing accordingly. The first row corresponds to GD, the second to MM, and the third to NAG. Comparing GD and MM, as described in Eq. (16), the effect of the momentum term m causes the results to appear as though time t has been fast-forwarded, even for the same time point. As for NAG, it can be observed that cluster formation occurs earlier in time t. Specifically, while the second column of MM corresponds to t = 220, the fourth column of NAG at t = 221.36 already shows that cluster formation is nearly complete. We state the assumption in Lemma 5.1 about GD and MM, that is we identify the variables as t = kh, where the variables k is iteration number of the original iterative algorithms, h as step size and t is time variable. For NAG, the identification can be t = k

h as in Lemma 6.1.

4The experiments were conducted on a laptop PC with a 12th Gen Intel(R) Core(TM) i7-1255U processor, 500GB storage, 16GB memory, using Python. 5Perplexity is typically set between 5 and 50, as mentioned in van der Maaten & Hinton (2008).

Published in Transactions on Machine Learning Research (06/2025)

Let s pick up the first MM case (t = 50) and the second NAG case (t = 69.57). The variables t are formally different, but for MM, due to the presence of the momentum term m = 0.5, it effectively becomes t = 69.57(= 50/(1 0.5))(See the formula in Proposition 5.4, where we adopt that MM can be understood as fast-forwarding GD with respect to the variable t.). Although they have practically similar variable t, the results are drastically different. The three plots in the rightmost column are all elongated.

Furthermore, in NAG, the final plot on the rightmost panel corresponds to t = 221.36, while in the MM, the second plot from the left corresponds to t = 314.29. The application conditions for the NAG are either t = o(n1/2) in (T1.N) and t2 P P = o(1) condition in (T2.N). By contrast, the application conditions for the MM are either t 1 m = o(n) in (T1.M) and the condition t 1 m P P = o(1) in (T2.M). This generally suggests that the NAG can only be applied for relatively smaller values compared to the MM.

8.2 Experiment 2: Stopping criterion

Although we have conducted experiments to demonstrate that the accelerated method can indeed achieve embeddings in fewer time steps in the previous section, in our analysis, the embeddings are explicitly obtained as solutions to differential equations. Therefore, while this research originated from accelerating t-SNE, it no longer makes sense to compare it with other methods from the perspective of computational efficiency. We demonstrate that by setting a threshold to ARR (28), we obtain a reasonable embedding without using an iterative optimization algorithm with a popular MNIST dataset, which contains grayscale images of handwritten digits, and we focus on 400 images with 4 labels( 2 , 4 , 6 , 8 ) totaling 1600 samples, where each image contains 28 28 = 784 pixels. At the same time, we provide a comparison with standard Trustworthiness metric (Venna & Kaski, 2001), which is commonly used in dimensionality reduction methods such as t-SNE.

Figure 2 shows the transition of ARR (28) and Trustworthiness with 10 nearest neighbors over the time t (left) and the embedding results for each method: GD, MM, and NAG after early exaggeration stage (right-top) and embedding stage (right-bottom).

As t increases, ARR decreases, with NAG being the fastest, followed by MM, and GD being the slowest. The ARR crosses the 0.01 threshold at t = 2700 for GD, t = 1400 for MM, and t = 280.0 for NAG. The right panels show that the embeddings at these times provide reasonable clustering. As noted in section 7.1, cluster formation occurs fastest with NAG, followed by MM and GD. This can be interpreted by analyzing the Lyapunov exponents of the underlying ODE. When comparing ARR and Trustworthiness, as ARR decreases, Trustworthiness tends to increase, showing a general inverse correlation. The correlation coefficient was approximately 0.77. More details are reported in Appendix I.6.

As shown in the original paper (van der Maaten & Hinton, 2008) and the subsequent study in (Cai & Ma, 2022), the t-SNE algorithm typically visualizes clustering results by performing an embedding stage following the early exaggeration stage, which intend to facilitate cluster formation. The visualization in the lower right of Figure 2 follows this convention. It is important to note that the embedding stage is solely for visualization purposes. Therefore, no acceleration methods such as MM or NAG were applied. Instead, GD was iteratively used after the early exaggeration stage.

9 Conclusion

In this paper, we derive a linear ODE and Bessel differential equation corresponding to the MM and NAG in optimizing t-SNE with explicit solutions. In addition, akin to results obtained through GD, performing an eigenvalue decomposition confirms that implicit regularization, specifically clustering around small eigenvalues near zero, which is crucial in spectral clustering, similarly holds for them.

In addition to its significance as a theoretical analysis of the t-SNE algorithm a popular method for dimensionality reduction and visualization there is practical utility in that the ODE we derive has a closed-form solution. This means that by performing eigenvalue decomposition of the adjacency matrix in advance, we have the advantage of obtaining embedding results at any desired point in time.

Published in Transactions on Machine Learning Research (06/2025)

0 500 1000 1500 2000 2500 3000 t

ARR / Trustworthiness

ARR / Trustworthiness for MNIST dataset

Method & Metric

0.01 0.00 0.01 0.02 0.03

GD (k=27, t=2700)

0.01 0.00 0.01 0.02 0.03

MM (k=14, t=1400)

0.02 0.01 0.00 0.01 0.02 0.03

NAG (k=28, t=280.0)

1.5 1.0 0.5 0.0 0.5 1.0 1.5

2.0 GD (k=200)

1.5 1.0 0.5 0.0 0.5 1.0 1.5

1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

NAG (k=200)

Figure 2: (left): transition of ARR and Trustworthiness (TW). (right): clustering result for MNIST dataset with ARR=0.01 using three ODEs GD, MM, and NAG (right-top), and result after the following embedding stage (right-bottom)

While this framework currently relies on a full eigendecomposition of the graph Laplacian, which incurs O(n3) complexity, we emphasize that this step is used primarily for theoretical analysis and computation of diagnostic metrics such as ARR. For large-scale applications, we anticipate that approximate methods such as the Lanczos algorithm or randomized eigendecomposition could reduce the computational burden to around O(n2), especially for sparse graphs. Exploring these approximations would be a promising direction for extending our framework to high-dimensional datasets at scale.

We discuss the approximated formula for t-SNE only concerning the EE stage but do not address the embedding stage. Theoretical analysis of the embedding stage will complement our work and further deepen the understanding of t-SNE. Additionally, we incorporate bandwidth τi into the perplexity but only examined specific case. A broader analysis of different scenarios and their implications on perplexity would provide a more comprehensive understanding.

UMAP (Ghojogh et al., 2021) has been frequently used alongside t-SNE. There are studies discussing the relationship between t-SNE and UMAP (Damrich et al., 2023); hence, analyzing UMAP through a dynamical systems-based approach is an intriguing direction for future research.

Acknowledgement

Part of this work is supported by JSPS KAKENHI (23K24909 and 25H01494). Finally, we express our special thanks to the editor and anonymous reviewers whose valuable comments helped to improve the manuscript.

Sanjeev Arora, Wei Hu, and Pravesh K. Kothari. An analysis of the t-SNE algorithm for data visualization. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 1455 1462. PMLR, 06 09 Jul 2018. URL https://proceedings.mlr.press/v75/arora18a.html.

Published in Transactions on Machine Learning Research (06/2025)

Antonio Auffinger and Daniel Fletcher. Equilibrium distributions for t-distributed stochastic neighbour embedding. ar Xiv preprint ar Xiv:2304.03727, 2023.

Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, and Aarti Singh. Noise thresholds for spectral clustering. In John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pp. 954 962, 2011. URL https://proceedings.neurips.cc/paper/2011/hash/ dbe272bab69f8e13f14b405e038deb64-Abstract.html.

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani (eds.), Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada], pp. 585 591. MIT Press, 2001. URL https://proceedings.neurips.cc/paper/2001/hash/ f106b7f99d2cb30c3db1c3cc0fde9ccb-Abstract.html.

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.

Neural Comput., 15(6):1373 1396, 2003. doi: 10.1162/089976603321780317. URL https://doi.org/10. 1162/089976603321780317.

Jan Niklas Böhm, Philipp Berens, and Dmitry Kobak. Unsupervised visualization of image datasets using contrastive learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. URL https://openreview.net/pdf?id= n I2Hm VA0hvt.

T. Tony Cai and Rong Ma. Theoretical foundations of t-SNE for visualizing high-dimensional clustered data.

J. Mach. Learn. Res., 23:301:1 301:54, 2022. URL http://jmlr.org/papers/v23/21-0524.html.

Miguel Á. Carreira-Perpiñán. The elastic embedding algorithm for dimensionality reduction. In Johannes Fürnkranz and Thorsten Joachims (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 167 174. Omnipress, 2010. URL https://icml. cc/Conferences/2010/papers/123.pdf.

Pin-Yu Chen, Baichuan Zhang, Mohammad Al Hasan, and Alfred O. Hero III. Incremental method for spectral clustering of increasing orders. Co RR, abs/1512.07349, 2015. URL http://arxiv.org/abs/ 1512.07349.

Sebastian Damrich, Jan Niklas Böhm, Fred A. Hamprecht, and Dmitry Kobak. From t-SNE to UMAP with contrastive learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open Review.net, 2023. URL https://openreview.net/pdf?id= B8a1Fc Y0vi.

Aaron Defazio. On the curved geometry of accelerated optimization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 1764 1773, 2019. URL https:// proceedings.neurips.cc/paper/2019/hash/e515df0d202ae52fcebb14295743063b-Abstract.html.

Takanori Fujiwara, Jia-Kai Chou, Shilpika, Panpan Xu, Liu Ren, and Kwan-Liu Ma. An incremental dimensionality reduction method for visualizing streaming multidimensional data. IEEE Trans. Vis. Comput. Graph., 26(1):418 428, 2020. doi: 10.1109/TVCG.2019.2934433. URL https://doi.org/10.1109/TVCG. 2019.2934433.

Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, and Mark Crowley. Uniform manifold approximation and projection (UMAP) and its variants: Tutorial and survey. Co RR, abs/2109.02508, 2021. URL https: //arxiv.org/abs/2109.02508.

Published in Transactions on Machine Learning Research (06/2025)

Shin-itiro Goto and Hideitsu Hino. Fast symplectic integrator for nesterov-type acceleration method. Japan

Journal of Industrial and Applied Mathematics, 42(1):331 372, 2025. URL https://doi.org/10.1007/ s13160-024-00680-4.

Yunhui Guo, Haoran Guo, and Stella X. Yu. CO-SNE: dimensionality reduction and visualization for hyperbolic data. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 11 20. IEEE, 2022. doi: 10.1109/CVPR52688.2022.00011. URL https://doi.org/10.1109/CVPR52688.2022.00011.

Zhiyong Hao, Yixuan Jiang, Huihua Yu, and Hsiao-Dong Chiang. Adaptive learning rate and momentum for training deep neural networks. Co RR, abs/2106.11548, 2021. URL https://arxiv.org/abs/2106.11548.

Lulu He, Jimin Ye, and Jianwei E. Nonconvex optimization with inertial proximal stochastic variance reduction gradient. Inf. Sci., 648:119546, 2023. doi: 10.1016/J.INS.2023.119546. URL https://doi.org/ 10.1016/j.ins.2023.119546.

Geoffrey E Hinton and Sam Roweis. Stochastic neighbor embedding. In S. Becker, S. Thrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems, volume 15. MIT Press, 2002. URL https://proceedings.neurips.cc/paper_files/paper/2002/file/ 6150ccc6069bea6b5716254057a194ef-Paper.pdf.

Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193 218, 1985.

Seonghyeon Jeong and Hau-Tieng Wu. Convergence analysis of t-sne as a gradient flow for point cloud on a manifold, 2024.

I.T. Jolliffe. Principal Component Analysis. Springer Verlag, 1986.

Samuel Kaski, Janne Nikkilä, Merja Oja, Jarkko Venna, Petri Törönen, and Eero Castrén. Trustworthiness and metrics in visualizing similarity of gene expression. BMC Bioinform., 4:48, 2003. doi: 10.1186/ 1471-2105-4-48. URL https://doi.org/10.1186/1471-2105-4-48.

Dmitry Kobak and George C Linderman. Initialization is critical for preserving global data structure in both t-sne and umap. Nature biotechnology, 39(2):156 157, 2021.

Dmitry Kobak, George C. Linderman, Stefan Steinerberger, Yuval Kluger, and Philipp Berens. Heavytailed kernels reveal a finer cluster structure in t-SNE visualisations. In Ulf Brefeld, Élisa Fromont, Andreas Hotho, Arno J. Knobbe, Marloes H. Maathuis, and Céline Robardet (eds.), Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2019, Würzburg, Germany, September 16-20, 2019, Proceedings, Part I, volume 11906 of Lecture Notes in Computer Science, pp. 124 139. Springer, 2019. doi: 10.1007/978-3-030-46150-8\_8. URL https://doi.org/10.1007/ 978-3-030-46150-8_8.

Nikola B. Kovachki and Andrew M. Stuart. Continuous time analysis of momentum methods. J. Mach.

Learn. Res., 22:17:1 17:40, 2021. URL http://jmlr.org/papers/v22/19-466.html.

Jesse H. Krijthe. Rtsne: t-Distributed Stochastic Neighbor Embedding using Barnes-Hut Implementation, 2015. URL https://github.com/jkrijthe/Rtsne. R package version 0.17.

Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical

statistics, 22(1):79 86, 1951.

Pierre Lambert, John A Lee, Edouard Couplet, and Cyril De Bodt. Nesterov momentum and gradient normalization to improve t-SNE convergence and neighborhood preservation, without early exaggeration. ESANN proceedings, 1:1, 2023.

George C. Linderman and Stefan Steinerberger. Clustering with t-sne, provably. SIAM J. Math. Data Sci., 1(2):313 332, 2019. doi: 10.1137/18M1216134. URL https://doi.org/10.1137/18M1216134.

Published in Transactions on Machine Learning Research (06/2025)

George C. Linderman and Stefan Steinerberger. Dimensionality reduction via dynamical systems: The case of t-sne. SIAM Rev., 64(1):153 178, 2022. doi: 10.1137/21M1446769. URL https://doi.org/10.1137/ 21m1446769.

Leland Mc Innes and John Healy. UMAP: uniform manifold approximation and projection for dimension reduction. Co RR, abs/1802.03426, 2018. URL http://arxiv.org/abs/1802.03426.

Kevin R Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel B Burkhardt, William S Chen, Kristina Yim, Antonia van den Elzen, Matthew J Hirn, Ronald R Coifman, Natalia B Ivanova, Guy Wolf, and Smita Krishnaswamy. Visualizing structure and transitions in high-dimensional biological data. Nature biotechnology, 37(12):1482 1492, December 2019. ISSN 1087-0156,1546-1696. doi: 10.1038/s41587-019-0336-3.

Ryan Murray and Adam Pickarski. Large data limits and scaling laws for tsne. Co RR, abs/2410.13063, 2024. doi: 10.48550/ARXIV.2410.13063. URL https://doi.org/10.48550/ar Xiv.2410.13063.

Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Soviet Mathematics Doklady, 27:372 376, 1983.

Frank W. J. Olver and Leonard C. Maximon. Bessel functions. In Frank W. J. Olver, Daniel W. Lozier, Ronald F. Boisvert, and Charles W. Clark (eds.), NIST Handbook of Mathematical Functions, pp. 215 286. Cambridge University Press, 2010.

Antonio Orvieto and Aurelien Lucchi. Continuous-time models for stochastic optimization algorithms. In

Advances in Neural Information Processing Systems 32, pp. 12610 12622. Curran Associates, Inc., 2019.

Nicola Pezzotti, Boudewijn P. F. Lelieveldt, Laurens van der Maaten, Thomas Höllt, Elmar Eisemann, and Anna Vilanova. Approximated and user steerable t-SNE for progressive visual analytics. IEEE Trans. Vis. Comput. Graph., 23(7):1739 1752, 2017. doi: 10.1109/TVCG.2016.2570755. URL https: //doi.org/10.1109/TVCG.2016.2570755.

Zana Rashidi, Kasra Ahmadi K. A., Aijun An, and Xiaogang Wang. Adaptive momentum coefficient for neural network optimization. In Frank Hutter, Kristian Kersting, Jefrey Lijffijt, and Isabel Valera (eds.), Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2020, Ghent, Belgium, September 14-18, 2020, Proceedings, Part II, volume 12458 of Lecture Notes in Computer Science, pp. 35 51. Springer, 2020. doi: 10.1007/978-3-030-67661-2\_3. URL https://doi.org/10. 1007/978-3-030-67661-2_3.

Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding.

Science, 290(5500):2323 2326, 2000. doi: 10.1126/science.290.5500.2323. URL https://www.science. org/doi/abs/10.1126/science.290.5500.2323.

Martin Skrodzki, Nicolas F. Chaves-de-Plaza, Klaus Hildebrandt, Thomas Höllt, and Elmar Eisemann. Tuning the perplexity for and computing sampling-based t-SNE embeddings. Co RR, abs/2308.15513, 2023. doi: 10.48550/ARXIV.2308.15513. URL https://doi.org/10.48550/ar Xiv.2308.15513.

Martin Skrodzki, Hunter van Geffen, Nicolas F. Chaves-de-Plaza, Thomas Höllt, Elmar Eisemann, and Klaus Hildebrandt. Accelerating hyperbolic t-sne. Co RR, abs/2401.13708, 2024. doi: 10.48550/ARXIV.2401. 13708. URL https://doi.org/10.48550/ar Xiv.2401.13708.

Fan Wei Lee Wenke Prodromidis Andreas Stolfo, Salvatore and Philip Chan. KDD Cup 1999 Data. UCI Machine Learning Repository, 1999. DOI: https://doi.org/10.24432/C51C7N.

Salvatore J. Stolfo, Wei Fan, Wenke Lee, Andreas L. Prodromidis, and Philip Chan. KDD cup 1999 data. https://doi.org/10.24432/C51C7N, December 1998. URL https://doi.org/10.24432/C51C7N. Accessed on 2024-12-05.

Weijie Su, Stephen P. Boyd, and Emmanuel J. Candès. A differential equation for modeling nesterov s accelerated gradient method: Theory and insights. J. Mach. Learn. Res., 17:153:1 153:43, 2016. URL http://jmlr.org/papers/v17/15-084.html.

Published in Transactions on Machine Learning Research (06/2025)

Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pp. 1139 1147. JMLR.org, 2013. URL http://proceedings.mlr.press/v28/sutskever13. html.

P. Szymański and T. Kajdanowicz. A scikit-based Python environment for performing multi-label classification. Ar Xiv e-prints, February 2017.

Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani. A detailed analysis of the KDD CUP 99 data set. In 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, CISDA 2009, Ottawa, Canada, July 8-10, 2009, pp. 1 6. IEEE, 2009. doi: 10.1109/CISDA.2009.5356528. URL https://doi.org/10.1109/CISDA.2009.5356528.

J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, December 2000.

Warren S Torgerson. Multidimensional scaling: I. theory and method. Psychometrika, 17(4):401 419, 1952.

Laurens van der Maaten. Barnes-hut-sne. Co RR, abs/1301.3342, 2013. URL https://api. semanticscholar.org/Corpus ID:208915826.

Laurens van der Maaten. Accelerating t-sne using tree-based algorithms. J. Mach. Learn. Res., 15:3221 3245, 2014. URL https://api.semanticscholar.org/Corpus ID:14618636.

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning

research, 9(11), 2008.

Jarkko Venna and Samuel Kaski. Neighborhood preservation in nonlinear projection methods: An experimental study. In Georg Dorffner, Horst Bischof, and Kurt Hornik (eds.), Artificial Neural Networks - ICANN 2001, International Conference Vienna, Austria, August 21-25, 2001 Proceedings, volume 2130 of Lecture Notes in Computer Science, pp. 485 491. Springer, 2001. doi: 10.1007/3-540-44668-0\_68. URL https://doi.org/10.1007/3-540-44668-0_68.

Max Vladymyrov and Miguel Á. Carreira-Perpiñán. Fast training of nonlinear embedding algorithms. Ar Xiv, abs/1206.4646, 2012. URL https://api.semanticscholar.org/Corpus ID:52800064.

Max Vladymyrov and Miguel Á. Carreira-Perpiñán. Linear-time training of nonlinear low-dimensional embeddings. In International Conference on Artificial Intelligence and Statistics, 2014. URL https: //api.semanticscholar.org/Corpus ID:1083215.

Ulrike von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17(4):395 416, 2007. doi: 10.1007/ S11222-007-9033-Z. URL https://doi.org/10.1007/s11222-007-9033-z.

George Neville Watson. A treatise on the theory of Bessel functions, volume 2. The University Press, 1922.

Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351 E7358, 2016. ISSN 0027-8424. doi: 10.1073/pnas.1614734113.

Ashia C. Wilson, Benjamin Recht, and Michael I. Jordan. A lyapunov analysis of accelerated methods in optimization. J. Mach. Learn. Res., 22:113:1 113:34, 2021. URL https://api.semanticscholar.org/ Corpus ID:235341488.

Zhirong Yang, Jaakko Peltonen, and Samuel Kaski. Scalable optimization of neighbor embedding for visualization. In International Conference on Machine Learning, 2013. URL https://api.semanticscholar. org/Corpus ID:14514845.

Published in Transactions on Machine Learning Research (06/2025)

Ryeongkyung Yoon and Braxton Osting. A dynamical systems based framework for dimension reduction.

Co RR, abs/2204.08155, 2022. doi: 10.48550/ARXIV.2204.08155. URL https://doi.org/10.48550/ ar Xiv.2204.08155.

Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis kahan theorem for statisticians.

Biometrika, 102(2):315 323, 2015.

Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct runge-kutta discretization achieves acceleration. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 3900 3909. Curran Associates, Inc., 2018.

Yuansheng Zhou and Tatyana O Sharpee. Hyperbolic geometry of gene expression. Iscience, 24(3), 2021.

Ye Zhu and Kai Ming Ting. Improving the effectiveness and efficiency of stochastic neighbour embedding with isolation kernel (extended abstract). In Lud De Raedt (ed.), Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 5792 5796. International Joint Conferences on Artificial Intelligence Organization, 7 2022. doi: 10.24963/ijcai.2022/812. URL https://doi.org/10. 24963/ijcai.2022/812. Journal Track.

Published in Transactions on Machine Learning Research (06/2025)

A Derivation of the t-SNE gradient coefficient

For the sake of brevity, we omit the iteration index k, we use yi, pij, qij, instead of y(k) i , p(k) ij and q(k) ij . The contents basically follows to Appendix A in van der Maaten & Hinton (2008).

t-SNE minimizes the KL divergence between the joint probabilities pij in the high-dimensional space Rdh and the joint probabilities qij in the low-dimensional space Rdl. They are defined in Eq. (1), pij = (pi|j +pj|i)/2n and Eq. (2), respectively. Cost function C is defined with KL divergence Kullback & Leibler (1951):

C = KL(P||Q) = X

ij pij log pij

ij (pij log pij pij log qij) . (A.1)

For the purpose of the derivation briefly, we set two variables dij and Z:

dij = yi yj (A.2)

k =l (1 + d2 kl) 1. (A.3)

Then, with the chain rule, the gradient of the cost function C with respect to yi is given by

(yi yj)/dij (A.4)

C dij (yi yj)/dij. (A.5)

pij are defined with high-dimensional data points, and they re irrelevant to yi or dij that is pkl

dij = 0 for k, l [n]. Then, we have

k =l pkl log qkl

k =l pkl (log qkl Z log Z)

k =l pkl 1 qkl Z ((1 + d2 kl) 1) dij 1

The gradient term ((1+d2 ij) 1) dij is only nonzero when k = i and l = j. Then,

C dij = 2 pij

qij Z (1 + d2 ij) 2dij 2 X

k =l pkl (1 + d2 ij) 2dij Z . (A.9)

Note that P

k =l pkl = 1 and we have

C dij = 2pij(1 + d2 ij) 1dij 2qij(1 + d2 ij) 1dij = 2(pij qij)(1 + d2 ij) 1dij. (A.10)

Finally, we have C yi = 4 X

pij qij 1 + yi yj 2 (yi yj). (A.11)

In this formula, we aim to emphasize the effect of the original data distribution using pij. We formally substitute pij with αpij and define the term αpij qij 1+ yi yj 2 as Sij(α). This leads to the expression in Eq. (4).

Published in Transactions on Machine Learning Research (06/2025)

B The effect of accelerator parameter

In this section, we examine the effect of the acceleration parameter α separately for GD and MM/NAG.

B.1 The case of GD

First, consider the update equation (B.12) for GD. The low-dimensional representations y(k) i and y(k) j (i = j)

show that y(k+1) i is calculated using y(k) i and a gradient term. The sign of αpij q(k) ij determines its directional

influence. We define αpij as the attractive force and q(k) ij as the repulsive force. When αpij q(k) ij > 0,

meaning the attractive force is stronger, y(k+1) i moves closer to y(k) j compared to y(k) i . Conversely, when

αpij q(k) ij < 0, the repulsive force dominates, causing y(k+1) i to move further away from y(k) j . Figure 3 illustrates this situation.

y(k+1) i = y(k) i + h X

1 i,j n, i =j

αpij q(k) ij 1 + y(k) i y(k) j 2 2 (y(k) j y(k) i ) (B.12)

in case that α𝑝𝑖𝑗(attractive force) > 𝑞𝑖𝑗

(𝑘)(repulsive force)

In case that α𝑝𝑖𝑗(attractive force)

(𝑘)(repulsive force)

Figure 3: Diagram of the update equation for y(k) i

Therefore, the larger the value of α > 1, the stronger the attractive force among nodes becomes. As a result, this generally facilitates the formation of clusters.

B.2 The case of MM and NAG

As the update equations (10) for MM and (19) or (20) for NAG show, y(k+1) i is calculated using both the gradient term and the momentum term. Acceleration is achieved by incorporating not only the current gradient but also past updates. Even when using MM or NAG, the amplification of the attractive force with α tends to occur.

Published in Transactions on Machine Learning Research (06/2025)

In summary, when α > 1, the attractive force generally becomes stronger, making it easier to form clusters across all optimization methods GD, MM, and NAG. This suggests that a larger α can accelerate the clustering process.

C Asymptotic behavior of the update equation

C.1 Proof of Theorem 4.1

This theorem in Cai & Ma (2022) is based without momentum term in (3) and it can be extended even with momentum term, because the proof doesn t refer to the update equation (5), but just the definition of S(k) ij (α) in Eq. (4).

C.2 Proof of Proposition 4.2

This proof proceeds in the same manner as the proof of Proposition 3 in Cai & Ma (2022). Note that y(k+1) li [I h S(k) α ]i. 1 y(k) l + m(k+1) y(k) l y(k 1) l for any k 1, where

[I h S(k) α ]i. 1 y(k) l = 1 h

j=1 S(k) ij (α) + h X

j =i |S(k) ij | 1 + 2h

j=1 |S(k) ij (α)|.

For the second term, we have

j=1 |S(k) ij (α)| hn S(k) α nh(α P + Q ) nhα P + h(1 + η(k))

where the last inequality follows from Eq. (40) in Cai & Ma (2022). Intermediate variable Z is defined as P

i =j(1 + y(k) i y(k) j 2 2) 1 and we can deduce it with Q(k) 1/Z, Z n(n 1)/(1 + η(k)) and Q(k) (1 + η(k))/n(n 1). Then we have

y(k+1) li 1 + 2nhα P + h(1 + η(k))

y(k) l + m(k+1) y(k) l + m(k+1) y(k 1) l

1 + 2nhα P + h(1 + η(k))

n 1 + m(k+1) y(k) l + m(k+1) y(k 1) l

1 + 2nhα P + h(1 + η(k))

n 1 + m(k+1) ( y(k) l + y(k 1) l )

2 + 4nhα P + 2h(1 + η(k))

n 1 + 2m(k+1) max l [2]{ y(k) l , y(k 1) l },

y(k+1) l 2 + 4nhα P + 2h(1 + η(k))

n 1 + 2m(k+1) max l [2]{ y(k) l , y(k 1) l }.

Introducing rn = nhα P + h

y(k+1) l (2 + 2m(k+1) + Crn) max l [2]{ y(k) l , y(k 1) l } (C.13)

and for any k 2, y(k) l (2 + 2m(k+1) + Crn)k 1 max l [2]{ y(1) l , y(0) l }. (C.14)

Lemma C.1. For k 2, η(k) and maxl [2]{ y(k) l , y(k 1) l } are bounded by a constant.

Published in Transactions on Machine Learning Research (06/2025)

Proof. In case of k = 2, Eq. (C.14) shows that

y(2) l (2 + 2m(k+1) + Crn) max l [2]{ y(1) l , y(0) l }. (C.15)

The condition (T1) states that maxl [2]{ y(2) l , y(1) l } is bounded. According to (C.14),

η(2) 4(2 + 2m(2) + Crn)2 max l [2]{ y(1) l 2 , y(0) l 2 } (C.16)

is also bounded. Then, let s say that the statement holds for k, i.e. both y(k) l and y(k+1) l are bounded. With inequality (C.13), y(k+1) l is also bounded. Assuming rn = O(1) by condition (T1)

η(k+1) 4 max i [n],l [2] |y(k+1) li |2 4 max l [2]{ y(k+1) 1 2 , y(k+1) 2 2 } (C.17)

4(2 + 2m(k+1) + Crn)2 max l [2]{ y(k) l 2 , y(k 1) l 2 } = O(1), (C.18)

we have η(k+1) is bounded. By induction, the statement holds for k 2.

As long as k = k(n) with krn = O(1) by condition (T1), we have

y(k) l / max l [2]{ y(1) l , y(0) l } = O(1), (C.19)

or, diam({y(k) i }i [n])

maxl [2]{ y(1) l , y(0) l } maxi [n],l [2] |y(k) li |

maxl [2]{ y(1) l , y(0) l } = O(1). (C.20)

It proves the statement.

D Derivation of ODEs

We proceed to concretely examine the following two cases: MM and NAG.

D.1 Proof of Lemma 5.1

The assumption y(k) l Yl(kh) is justfied with the following satatement, which is shown similar to Proposition 8 in Cai & Ma (2022). We consider the momentum terms to be a constant m(0 < m < 1), i.e. m(k+1) = m for k. So, we can rewrite the formula (8) as follows:

y(k+1) l [In h L(αP Hn)]y(k) l + m(y(k) l y(k 1) l ), l [2]. (D.21)

For l [2], let { y(k) l }k 0 be the sequence defined the iteration with the update formula:

y(k+1) l = [In h L(αP Hn)] y(k) l + m( y(k) l y(k 1) l ), k 1 (D.22)

with initial values y(0) l = y(0) l , y1 l = y(1) l . Introduce the assumption Yl(t) y(k) l for some smooth curve Yl(t) defined for t 0. Set t = kh. Then, the update formula (D.22) reduces to

Yl(t + h) = Yl(t) h L(αP Hn)Yl(t) + m(Yl(t) Yl(t h)). (D.23)

By dividing by h > 0, we have

Yl(t + h) Yl(t)

h = L(αP Hn)Yl(t) + m Yl(t) Yl(t h)

From the definition of the derivative d dt Yl(t) = limh 0 Yl(t+h) Yl(t)

h = limh 0 Yl(t) Yl(t h)

h , we have (15).

Published in Transactions on Machine Learning Research (06/2025)

D.2 Proof of Proposition 5.2

For any k Z 0, we set ek = y(k) l Yl(kh), L = L(αPn Hn) for brevity.

Lemma D.1. Let Yl(t) the formula in Proposition 5.1, and a symmectic matrix L has an eigendecomposition, L = QΛQ 1, (D.25)

where Λ = diag(λ1, , λn), λi R, λ1 λ2 λn and Q O(n). For t > 0

Yl(t) 2 exp tλ1 1 m

Yl(0) 2. (D.26)

Specifically, if 0 < m < 1 and the condition (I1) holds, Yl(t) 2 is bounded.

Proof. With the eigendecomposition L = QΛQ 1, we have

Yl(t) 2 exp t 1 m L Yl(0) 2, (D.27)

Q exp t 1 mΛ Q 1 Yl(0) 2, (D.28)

Q exp t 1 mΛ Q 1 Yl(0) 2, (D.29)

exp tλ1 1 m

Yl(0) 2. (D.30)

Using Taylor expansion for some ξ (kh, (k + 1)h), and Lemma 5.1, we have

Yl((k + 1)h) = Yl(kh) + hd Yl(kh)

= Yl(kh) h 1 m LYl(kh) + h2

2 L2Yl(ξ). (D.32)

We set these formulae into ek+1 = Yl((k + 1)h) y(k+1) l , (D.33)

ek+1 = Yl(kh) h 1 m LYl(kh) + 1

2 L2Yl(ξ) (D.34)

(1 + m)y(k) l h L y(k) l m y(k 1) l . (D.35)

With Yl(kh) = y(k) l + ek, we have

ek+1 = ek hm 1 m LYl(kh) + m( y(k 1) l y(k) l ) + 1

2 L2Yl(ξ). (D.36)

Due to triangular inequality, we have

ek+1 2 ek 2 + hm 1 m L y(k) l 2 + m y(k 1) l y(k) l 2 + 1

2 L2Yl(ξ) 2. (D.37)

Summing up on the index, we obtain

ek+1 2 e0 2 +

1 m L y(j) l 2 + m y(j 1) l y(j) l 2 + 1

2 L 2 Yl(ξj) 2 . (D.38)

Published in Transactions on Machine Learning Research (06/2025)

Thanks to our initial condition as in Lemma 5.1, e0 = y(0) l Yl(0) = (0, , 0) . Proposition 4.2 says that we have constant C1 > 0 such that maxj y(j) l 2 Yl(0) 2 C1. (D.39)

Additionally, with respect to the term y(j 1) l y(j) l 2, consider the update equation (10),

y(j) l y(j 1) l = m( y(j 1) l y(j 2) l ) h L y(j 1) l . (D.40)

We have y(j) l y(j 1) l 2 m y(j 1) l y(j 2) l 2 + h C1 L Yl(0) 2. (D.41)

By recursively applying these inequalities, we obtain

y(j 1) l y(j) l 2 (mj + mj 1 + . . . + m + 1)h C1 L Yl(0) 2 (D.42)

1 m h C1 L Yl(0) 2 h C1 1 m L Yl(0) 2. (D.43)

With Lemma D.1 and (I1),

Yl(0) 2 exp ξjλ1

We denote the right-hand side as C2 > 0. Under these conditions, we have

ek+1 2 Yl(0) 2 2hm(k + 1)C1

1 m L + k + 1

2 L 2C2. (D.45)

This proves the statement.

D.3 Proof of Lemma 5.3

Since L(P αHn) = L(αP Hn) 1 n 1In + 1 n(n 1)1n1 n , (D.46)

the eigenvectors L(αP Hn) and L(αP) share the eigenvectors. Also, if we decompose eigenvector as L(P αHn) = Pn i=1 σiuiu i , L(αP Hn)ui = σiui. (D.47)

We have σ1 = αλ1, σi = αλi 1 n 1 for 2 i n, which corresponds to (14).

D.4 Proof of Lemma 5.5

Remember that both λi(L(P )) λi+1(L(P )) for i [n 1] and λR+1(L(P )) is the first positive eigenvalue, i.e. λR(L(P )) = 0, λR+1(L(P )) > 0. (D.48)

With Weyl s inequality |λi(L(P)) λi(L(P ))| L(E) , (D.49)

where E = P P . When i = R in inequality. (D.49), we have

|αλR(L(P))| α L(E) . (D.50)

The condition (T2.M) says that t 1 mα L(E) = o(1), and it s equivalent that

ϵ > 0, N N s.t. n N t 1 mα L(E) < ϵ. (D.51)

Published in Transactions on Machine Learning Research (06/2025)

If we set ϵ = t (n 1)(1 m), we have

αλR(L(P)) 1 n 1 α L(E) 1 n 1 (D.52)

t 1 n 1 (D.53)

t (1 m) t (n 1)(1 m) 1 n 1 = 0. (D.54)

By contrast, when i = R + 1 in Ineq. (D.49)

|λR+1(L(P)) λR+1(L(P ))| L(E) . (D.55)

So, we have

λR+1(L(P)) λR+1(L(P )) L(E) . (D.56)

If we set ϵ = (1 t n 1) 1 1 m and the condition (T2.M) such as λR+1(L(P ) (tα) 1

αλR+1(L(P)) 1 n 1 αλR+1(L(P ) α L(E) 1 n 1 (D.57)

> αλR+1(L(P ) (1 m)ϵ

t 1 n 1 (D.58)

1 1 m 1 n 1 = 0. (D.59)

Note that the condition (T1.M) t = o(n) assures that ϵ = (1 t n 1) 1 1 m > 0 for n 1 and 0 < m < 1.

E Well-conditioned matrix

As discussed in section 5.1, we re interested in the behavior of Yl(t) for t 1. The sign of the terms αλi 1 n 1 decides it; if negative, the corresponding components are amplified; if positive, they are suppressed as noted in (17).

This scenario, where αλi 1 n 1 0 for 2 i R, indicates that the i-th eigenvalue λi is close to 0 under α > 0. Although the matrix L(P), by definition, has one connected component, suppose there exists another matrix P sufficiently close to P such that its Laplacian L(P ) has R connected components, implying the dimension of its null space is R. As discussed in (Cai & Ma, 2022), it is natural to pick up Laplacian null space. We call the adjacency matrix P as well-conditioned , if its associated weighted graph has R 2 connected components. Proposition E.1. (Proposition 6 in (Cai & Ma, 2022)) Let A Rn n be symmetric and well-conditioned matrix. Then, the smallest eigenvalue of the Laplacian L(A) is 0 and has multiplicity R, and its associated eigenspace is spanned by {θ1, . . . , θR}, where for each r [R],

( 1/ ni, if node j belongs component i, 0, otherwise, (E.60)

and nr is the number of nodes related to r-th connected component. Also, the null space of L(A) is spanned with these basis

, . . . , 1 n R

where 1nr is a length nr column vector with ones, and 0 is a zero vector with appropriate length.

When the data {Xi}i [n] are well clustered and bandwidths τi are appropriately selected, in (Balakrishnan et al., 2011), it is shown that we have a good approximated matrix of P. Such well-conditioned matrix P is obtained concretely in section 8. Such matrix P is theoretically used in implicit regularization theorem 6.5.

Published in Transactions on Machine Learning Research (06/2025)

F Derivation of ODE for NAG

F.1 Proof of Lemma 6.1

With two equations (19) and (20), we derive another ODE. Applying a rescaling, we have

y(k+1) l y(k) l

k + 2 y(k) l y(k 1) l

h L(αP Hn)w(k) l . (F.62)

Introduce the [assumption] y(k) l Y(k

h) for some smooth curve Yl(t) for t 0. Set k = t/

as the step size h goes to zero, Yl(t) y(t/

h) l = y(k) l and Yl(t +

h) l = y(k+1) l Taylor expansion gives

(y(k+1) l y(k) l )/

h = Yl(t) + 1

h), (y(k) l y(k 1) l )/

h = Yl(t) 1

h L(αP Hn)w(k) l =

h L(αP Hn)Yl + o(

h). So, (F.62) is written as

h L(αP Hn)Yl(t) + o(

By comparing the coefficients of

h, we achieve

t Yl(t) + L(αP Hn)Yl(t) = 0, l [2]. (F.63)

Note that the first initial condition is Yl(0) = y(0) l . Also, taking k = 1 in (F.62), we have

(y(1) l y(0) l )/

h L(αP Hn)wl(t) = o(1).

Then, the second initial condition is just Yl(0) = (0, . . . , 0) Rn.

F.2 Proof of Proposition 6.2

If we define a function f : Rn Rn as f(y) = 1

2y L(αP Hn)y for y Rn, the function f is differentiable and f(y) = L(αP Hn)y. So, we have

f(x) f(y) 2 σn x y 2, (F.64)

where σn is the maximum eigenvalue of L(αP Hn), especially f is σn-Lipschitz function. So, we can apply Proposition 2 in Su et al. (2016), which shows our statement.

F.3 Proof of Lemma 6.3

Similar to Lemma 5.5, we can deduce the results since we have t2 L(P P ) = o(1), that is

ϵ > 0, N N s.t. n N t2α L(E) < ϵ, (F.65)

λR+1(L(P )) (t2α) 1 which are both in (T2.N) and t = o(n1/2) as in (T1.N).

F.4 Proof of Proposition 6.4

As in the Lemma 5.3, we have L(αP Hn) = VΣV

dt + VΣU Yl(t) = 0. (F.66)

Published in Transactions on Machine Learning Research (06/2025)

By multiplying V from left-hand side, we have

dt2 (V Yl(t)) + 3

t d dt(V Yl(t)) + Σ(V Yl(t)) = 0, (F.67)

where we used V V = In. By replacing V Yl(t) with Yl(t),

dt + ΣYl(t) = 0, (F.68)

Yl(t) = 0. (F.69)

This means that for any i [n],

t d Yl,i(t)

dt + σi Yl,i(t) = 0 (F.70)

with Yl,i(0) = y(0) l,i , Yl,i(0) = 0.

Lemma F.1. For modified Bessel function of the first kind I1( ), we have

lim s +0 2I1(s)

s = 1. (F.71)

Proof. With Taylor expansion around 0, we have

1 k!Γ(k + 2)

2k+1 , (F.72)

where Γ(k + 2) is Gamma function at k + 2. For s > 0,

1 k!(k + 1)!

2k . (F.73)

lim s +0 I1(s)

2 1 0!1! = 1

Proposition F.2. ODE (F.70) has the following explicit expressions:

2y(0) l,i t σi J1(t σi), if σi > 0,

2y(0) l,i t σi I1(t σi), if σi 0,

where J1( ) is Bessel function of the first kind, I1( ) is the modified Bessel function of the first kind.

Proof. Firstly, consider the case of σ > 0. For simplicity, we omit the indices l and Zi(u) = u Yi(u/ σi) which satisfies u2 Zi + u Zi + (u2 1)Zi = 0.

Published in Transactions on Machine Learning Research (06/2025)

When we set v = u/ σi, we have

u Yi(v) = Yi(v) + v Yi(v),

+ u σi Yi u σi

= 2 σi Yi u σi

So, we proceed the calculation

u2 Zi + u Zi + (u2 1)Zi

= u2n 2 σi Yi u σi

o + u n Yi u σi

+ u σi Yi u σi

o + (u2 1)u Yi u σi

+ u Yi u σi

+ u3Yi u σi

+ σi Yi u σi

The solution function Zi(u) can be expressed with J1(u)

(2m)!!(2m + 2)!!u2m+1,

where we get around J1(u) = (1 + o(1))u

near zero for the variable u. Remember that Y(0) l,i = y(0) l,i , we have

Yl,i(t) = 2y(0) l,i t σi J1(t σi), for σi > 0. (F.77)

Secondly, we address the case of σi < 0. Zi(u) = u Yl,i(u/ σi), and we obtain the following modified Bessel differential equation:

u2 Zi + u Zi (u2 + 1)Zi = 0. (F.78)

This also comes from the following calculation:

du = 1 σi Y u σi

+ u σi Yi u σi

+ u σi Yi u σi

= 2 σi Yi u σi

Then, we continue

u2 Zi + u Zi (u2 + 1)Zi (F.82)

= u2n 2 σi Yi u σi

o + u n Yi u σi

+ u σi Yi u σi

(u2 + 1)u Yi u σi

+ 3u2 σi Yi u σi

+ σi Yi u σi

o = 0. ( ( σi)2 = σi). (F.86)

Published in Transactions on Machine Learning Research (06/2025)

Similar to the case of σi > 0, we have the explicit expression of Zi(u) with I1(u)

1 m! (m + 1)!

Yl,i(t) = 2y(0) l,i t σi I1(t σi), for σi < 0. (F.87)

This formula can be extended under σ = 0 with the formula (F.71) i.e. limu 0 I1(u)/u = y(0) l,i .

As in Lemma 6.3, σi are R-nonpositive eigenvalues, and there are (n R)-positive eigenvalues. So, Yl(t) was calculated as V Yl(t), and multiplying V(= (U, U )), we finally obtain

Yl(t) = V y(0) l,1 , 2y(0) l,2 t σ2 I1(t σ2), . . . , 2y(0) l,R t σR I1(t σR),

2y(0) l,R+1 t σR+1 J1(t σR+1), . . . , 2y(0) l,n t σn J1(t σn) (F.88)

= (u1, . . . , u R)diag 1, 2 t σ2 I1(t σ2), . . . , 2 t σR I1(t σR) (y(0) l,1 , . . . , y(0) l,R) (F.89)

+ (u R+1, . . . , un)diag 2 t σR+1 J1(t σR+1), . . . , 2 t σn J1(t σn) (y(0) l,R+1, . . . , y(0) l,n) (F.90)

= UΓ1(t)y(0) l,1:R + U Γ2(t)y(0) l,R+1:n, (F.91)

where U = (u1, . . . , u R) O(n, R) and U = (u R+1, . . . , un) O(n, n R) is orthogonal complement of U. Also

Γ1(t) = diag 1, 2 t σ2 I1(t σ2), . . . , 2 t σR I1(t σR) , (F.92)

Γ2(t) = diag 2 t σR+1 J1(t σR+1), . . . , 2 t σn J1(t σn) . (F.93)

It proves the proposition.

G Implicit regularization

G.1 Proof of Proposition 6.5 concerning MM

The proof for the MM can be directly adapted from proof of Theorem 10 in Cai & Ma (2022), requiring only the substitution of e t(αλi 1 n 1 ) for e t 1 m (αλi 1 n 1 ).

G.2 Proof of Proposition 6.5 concerning NAG

Before proceeding with the proof, let us first establish the following lemma:

Lemma G.1. 1. For orthonormal basis U O(n, R), we can define U O(n, n R) and obtain

U = U = 1. (G.94)

2. We have the following asymptotic approximation with J1(s) for s 1,

2 πs3 cos s 3π

Published in Transactions on Machine Learning Research (06/2025)

3. Similarly, for modified Bessel function I1(s), we have the asymptotic approximation for s 1,

2πses. (G.96)

Proof. 1. For (G.94), we obtain the results with basic linear algebra.

2. The asymptotic expansion of J1(s) is found in 7.21 of Watson (1922) or 10.7.8 in Olver & Maximon (2010) such as

2 πs cos s 3π

Dividing by s > 0, we have the formula.

3. The statement follows from 7.23 of Watson (1922).

The basic strategy on the proof for NAG still aligns to Theorem 10 in Cai & Ma (2022). Let U0 O(n, R) be the matrix whose columns span the null space of L(P ), and the first column of U0 is n 1/21. Also, let U O(n, R) be the collection of eigenvectors of L(P) corresponding the smallest R eigenvalues. With the Davis-Kahan theorem discussed in von Luxburg (2007); Yu et al. (2015), we have

U 0 U = U 0 U L(E) λR+1(L(P )), (G.98)

where E = P P and U0 is orthogonal complement of U0.

||U0U 0 Yl(t) Yl(t)||2

||Yl(0)||2 (G.99)

= U0U 0 UΓ1(t)y(0) l,1:R UΓ1(t)y(0) l,1:R 2 ||Yl(0)||2 + U0U 0 U Γ2(t)y(0) l,1:R U Γ2(t)y(0) l,R+1:n 2 ||Yl(0)||2 (G.100)

= ||(U0U 0 I)UΓ1(t)y(0) l,1:R||2 ||Yl(0)||2 + ||(U0U 0 I)U Γ2(t)y(0) l,1:R||2 ||Yl(0)||2 (G.101)

= ||(U0 U 0 )UΓ1(t)y(0) l,1:R||2 ||Yl(0)||2 + ||(U0 U 0 )U Γ2(t)y(0) l,1:R||2 ||Yl(0)||2 (G.102)

||U0 (U 0 U)Γ1(t)|| + ||(U0 U 0 )UΓ2(t)|| (G.103)

||U 0 U|| ||Γ1(t)|| + Γ2(t) (G.104)

||L(E)|| λR+1(L(P )) 2 t σR I1(t σR) + 2 t σR+1 J1(t σR+1) (G.105)

= ||L(E)|| λR+1(L(P )) 2

αλR(L(P)) + 1 n 1 I1 t

αλR(L(P)) + 1 n 1

αλR+1(L(P)) 1 n 1 J1 t

αλR+1(L(P)) 1 n 1

L(E) λR+1(L(P )), t2 αλR(L(P)) 1 n 1

Published in Transactions on Machine Learning Research (06/2025)

with αλR+1(L(P)) 1

n, t2αλR+1(L(P)) , we have

lim (t,n) ||U0U 0 Yl(t) Yl(t)||2

||Yl(0)||2 = 0. (G.109)

U0U 0 Yl(t) U0U 0 Yl(0) 2 Yl(0) 2 (G.110)

= U0U 0 UΓ1(t)y(0) l,1:R + U0U 0 U Γ2(t)y(0) l,R+1,n U0U 0 UΓ1(0)y(0) l,1:R U0U 0 U Γ1(0)y(0) l,R+1:n 2 Yl(0) 2 (G.111)

= U0U 0 U(Γ1(t) Γ1(0))y(0) l,1:R 2 Yl(0) 2 + U0U 0 U (Γ2(t) Γ2(0))y(0) l,R+1:n 2 Yl(0) 2 (G.112)

Γ1(t) Γ1(0) 2 + U 0 U (Γ2(t) Γ2(0)) 2 (G.113)

2 t σR I1(t σR) 1 + ||L(E)|| λR+1(L(P )) 2 t σR+1 J1(t σR+1) 1 (G.114)

αλR(L(P)) + 1 n 1 I1 t

αλR(L(P)) + 1 n 1

+ ||L(E)|| λR+1(L(P ))

αλR+1(L(P)) 1 n 1 J1 t

αλR+1(L(P)) 1 n 1

Whenever the condition (G.108) holds, we also have

lim (t,n) U0U 0 Yl(t) U0U 0 Yl(0) 2 Yl(0) 2 = 0. (G.117)

Eq. (G.108) is ensured by the conditions of the theorem and the asymptotic behavior in (F.71), (G.95) as follows:

The condition L(E) λR+1(L(P )) comes from λR+1(L(P )) max{(t2α) 1, L(P P) }.

t2 αλR(L(P)) 1 n 1 0 comes from t2 L(P P) = o(1) as n in (T2.N) and t = o(n1/2) in (T1.N).

Also, with the Weyl s inequality |λi(L(P)) λi(L(P ))| L(E) , αλR+1(L(P)) αL(P ) α L(E) 1

n due to α [nλR+1(L(P ))] in (T1.N). Finally, t2αλR+1(L(P)) can be deduced with Weyl s inequality and λR+1(L(P )) (t2α) 1 in (T2.N).

H Lyapunov exponents for each eigensolution

The expressions (16) or (24) can be interpreted as the temporal evolution of a dynamical system. By examining the Lyapunov exponents, one can gain insight into the asymptotic behaviors for t 1.

H.1 Lyapunov exponent for GD and MM

We calculate the Lyapunov exponent for MM6. We pick up the ith-term from the equation (16), and set Lyapunov exponent as λLyap,i for i [n]:

λLyap,i = lim t + lim δYl,i(0) 0 1

t log δYl,i(t) 2

δYl,i(0) 2 , (H.118)

6For GD, consistently interpret m = 0 throughout this subsection.

Published in Transactions on Machine Learning Research (06/2025)

where δYl,i(t) the perturbation at time t caused by the initial perturbation δYl,i(0). Now, we have

Yl,i(t) = exp t 1 m

(u i y(0) l )ul, l [2]. (H.119)

Therefore, by canceling out the same terms in the numerator and the denominator,

λLyap,i = lim t + lim δYl,i(0) 0 1

t log exp t 1 m

Note that the result holds regardless of l [2]. With the Lemma 5.5, we have the following proposition Proposition H.1. (Lyapunov exponent for GD and MM) Under the conditions (T1.M), (T2.M), n 1 and i [n],

λLyap,i = 1 1 m

Especially, λLyap,i 0 (i = 1, . . . , R), and λLyap,i < 0 (i = R + 1, . . . , n).

This statement suggests that the elements of Yl(t) (i = 1, . . . , R) highlight a specific structure within the cluster, enhancing its distinctive characteristics.

H.2 Lyapunov exponent for NAG

Now, for the case of NAG. Using the formula (24), the asymptotic approximation (G.95) and (G.96), we arrive at the following conclusion: Proposition H.2. (Lyapunov exponent for NAG) Under the conditions (T1.N), (T2.N) and n 1, we have

αλi + 1 n 1, (i = 1, . . . , R).

0, (i = R + 1, . . . , n). (H.123)

Especially, λLyap,i 0(i = 1, . . . , R).

It also suggests that the elements of Yl(t) (i = 1, . . . , R) emphasize a cluster structure.

The case for i = R+1, . . . , n differs slightly. In GD or MM, λLyap,i is negative, causing to exponential decay. However, in NAG, oscillations occurs, so the Lyapunov exponent is considered to be 0. This aligns with the typical differences between GD, MM and NAG.

I Supplemental numerical experiments

I.1 Differences between solution paths of ODE corresponding to GD, MM and NAG

We derived ODEs corresponding to the iterative optimization procedures for t-SNE with and without acceleration. We also derived their closed-form solution. We compare the difference between solution paths by simply plotting them in Fig. 4. As illustrated in Fig.4a, for i [R], i.e., components corresponding to the negative eigenvalues of the Laplacian L(αP Hn) (let s denote σ < 0), GD follows an exponential function exp( tσ), and the MM follows exp( σt 1 m), both of which are exponential functions (see Eq. (16)). On the

other hand, NAG follows a function 2I1(t σ)

t σ involving the modified Bessel function of the first kind I1( ) (see Eq. (24)). Based on their series expansions, these functions are greater than 1 for positive values of t and converge to 1 as t approaches zero (see Lemma F.1 for NAG case).

By contrast, in Fig. 4b, for t 1, GD and MM decay exponentially and NAG decays under the formula (G.95). From these conditions, the eigenvectors ui for i R + 1, i.e., components corresponding to positive eigenvalues, asymptotically decay to 0 as time t increases.

Published in Transactions on Machine Learning Research (06/2025)

0 2 4 6 8 10 x

Function value

2I1(x ) x e x

(a) GD, MM and NAG (negative eigenvalues)

0 10 20 30 40 50 x

Function value

(b) GD, MM and NAG (positive eigenvalues)

Figure 4: Comparison of (Modified) Bessel function and exponential functions.

I.2 Experiments with synthetic dataset

We conduct a comparison between iterative optimization methods for t-SNE and ODEs corresponding to them, specifically for GD, MM, and NAG, to explore their convergence behaviors using a synthetic dataset, generated from the following Gaussian Mixture Model (GMM):

Xi|zi = r N(µr, Σ), zi Multinomial(π1, π2, π3), for i [200], (I.124)

where µ1 = (0, 0, 0) , µ2 = (150, 110, 170) , µ3 = ( 130, 150, 150) are the centers of Gaussian distri-

butions, Σ =

30 20 25 20 50 10 25 10 30

is the common covariance matrix and (π1, π2, π3) = (1/3, 1/3, 1/3). With the

above {zi}i [200], we define the element of P as p ij = pij if zi = zj; otherwise, it is set to 0. The Laplacian L(αP Hn) has an eigenvalue with almost 0 and other two negative eigenvalues, i.e. R = 3. According to Cai & Ma (2022), it can be proven that the conditions (T1.M), (T2.M), (T1.N) and (T2.N) are satisfied by placing additional assumptions on these.

Figure 5 shows that the comparison between iterative optimization methods with and without acceleration (GD, MM, and NAG), and ODEs correspond to them for the dataset generated from the above explained GMM. The optimization is done with step size parameter h = 5, momentum parameter m = 0.5, Perplexity is 30, and exaggeration parameter α = 10. In the plot, solid lines indicate iterative approaches and dashed lines denote methods using ODEs. From this plot, we see that the KL-divergence (the objective function of the t-SNE optimization procedure) aligns well between iterative algorithms and continuous limit ODEs. It can also be observed that as t increases, the discrepancy in the KL-divergence values between those obtained from the iterative algorithm and those obtained as the solution path of the ODE becomes larger. This is consistent with the theoretical results obtained in this paper.

As in section 8, we state the assumption in Lemma 5.1 about GD and MM, we identify the variables as t = kh, where the variable t as time parameter, k as iteration number and h as step size. For NAG, the identification can be t = k

h as in Lemma 6.1. Figure 6 shows the representation where the top row corresponds to GD, the middle row to MM, and the bottom to NAG.

We have two kinds of observations.

Case 1: Focus on that the middle case in GD (t = 35) and the rightmost case in NAG (t = 33.54). In the case of NAG, even though the value of t is smaller at 33.54 compared to 35 in the case of GD, the data is well-separated, whereas in GD it is still undifferentiated.

Case 2: The rightmost case in GD (t = 75) and the middle case in MM (t = 35). Formally, the time parameter t is different, and the low-dimensional representation looks similar. Considering the effect of the momentum term m = 0.5, it can be interpreted as accelerating time t, effectively making it t = 70(= 35/(1 0.5)). See the formula (16), where we adopt that MM can be understood as fast-forwarding GD with respect to the variable t thanks to the moment term with m.

Published in Transactions on Machine Learning Research (06/2025)

0 20 40 60 80 100 0.0

KL-divergence

GD_ODE GD_iterative MM_ODE MM_iterative NAG_ODE NAG_iterative

Figure 5: A comparison between iterative optimization methods and those approximated using ODEs (GD, MM, NAG).

I.3 Profiles of datasets

In addition of GMM datasets explained in section I.2, the following datasets are used. Here are the links and their licenses:

KDDCup1999: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, CC BY 4.0 license referenced by https://archive.ics.uci.edu/dataset/130/kdd+cup+1999+data

MNIST: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_ openml.html, BSD license provided by sklearn

Olivetti Faces: https://scikit-learn.org/stable/modules/generated/sklearn.datasets. fetch_olivetti_faces.html, BSD license provided by sklearn

For the KDDCup1999 dataset, download the kddcup.data_10_percent.gz file locally and use it. For the MNIST and Olivetti Faces datasets, we download and use them directly in the code by utilizing the functions provided by sklearn.

I.4 Experiments through variation of random Initialization

It is well-known that the clustering results using t-SNE may vary with variations of initialization such as Kobak & Linderman (2021). Cai & Ma (2022) states a sufficient condition in their Theorem 14 as random initialization, which leads to intercluster repulsion (Th.13) after both early exaggeration and embedding stage. It is reported that false clustering may appear due to an incidental combination of overlapped cluster from the early exaggeration stage as in Remark 16 there. By contrast, we focus only on early exaggeration stage, and numerical experiments on several datasets, performed multiple runs with different initializations for the same dataset, and present the evaluation results.

We use KDDCup1999, MNIST and Olivetti datasets. The Olivetti face data set consists of images of 40 individuals with small variations in viewpoint. The data set consists of 400 images (10 per individual) of size 64 64 = 4096 and is labeled according to identity.

In order to evaluate the variation of initialization, we used Adjusted Rand Index (ARI) (Hubert & Arabie, 1985), which is commonly employed to evaluate clustering performance with known labels (ground truth clusters). This index assesses the degree of correspondence between the predicted clusters and the true class labels, serving as a measure of clustering accuracy in cases where the true labels are available, akin to supervised learning.

Published in Transactions on Machine Learning Research (06/2025)

0.050 0.025 0.000 0.025

GD (k=4,h=5,t=20)

0.02 0.00 0.02

GD (k=7,h=5,t=35)

0.01 0.00 0.01

GD (k=15,h=5,t=75)

0.02 0.00 0.02 0.02

MM (k=4,h=5,t=20)

0.01 0.00 0.01

MM (k=7,h=5,t=35)

0.01 0.00 0.01

MM (k=15,h=5,t=75)

0.10 0.05 0.00 0.05

NAG (k=4,h=5,t=8.94)

0.04 0.02 0.00 0.02

NAG (k=7,h=5,t=15.65)

0.02 0.00 0.02

NAG (k=15,h=5,t=33.54)

Figure 6: Low-dimensional expression with GMM dataset. Top: GD, Middle: MM, Bottom: NAG.

Figure 7 demonstrates the results. The band visible in the ARI transition represents the results of varying ARR after randomly changing the initial values 30 times. The width of the band corresponds to the standard deviation of the ARR.

In all three datasets, ARI increases most rapidly for NAG(green), followed by MM(purple), with GD(pink) trailing behind. NAG achieves the best results in the shortest time. It is important to note the relationship between the variables t and k for GD and MM, t = kh, while for NAG, t = k

h. As a result, even with the same value of k, the effective value of t is smaller for NAG, which makes the NAG curve appear truncated.

Published in Transactions on Machine Learning Research (06/2025)

(a) KDDCup1999

0 200 400 600 800 t

Adjusted Rand Index

0 200 400 600 800 t

Adjusted Rand Index

(c) Olivetti

0 100 200 300 400 500 600 700 t

Adjusted Rand Index

Figure 7: Adjusted Rand Index(ARI) for GD, MM, NAG across different datasets

I.5 Experiments through various initialization methods beyond random initialization

Generally, there are several ways to initializing methods for t-SNE. Kobak & Linderman (2021) reports the result with initialization using PCA, Linderman & Steinerberger (2019) uses spectral clustering in initializing. In this section, we experiment several initialization methods and compare the results under the equation (16) or (24) with the indicator ARI.

We basically follow the way of section I.4. The MNIST dataset is used for the experiments, and the Adjusted Rand Index (ARI) is employed as the evaluation metric. To assess the stability of each initialization method, we conduct 10 trials for each approach. In addition to random initialization as described in section I.4, we attempt representative initializations using PCA, Spectral Embedding (SE) (Belkin & Niyogi, 2003), and Multi Dimensional Scaling (MDS) (Torgerson, 1952). For PCA, SE, and MDS, the dimension is reduced to two by setting n_components to 2. In the case of SE, the number of neighbors (n_neighbors) used to construct the Laplacian matrix is set to 300.

Figure 8 demonstrate the results. For the optimization methods GD(orange) and MM(purple), we observe that changing the initialization from random to other methods results in starting with a relatively higher ARI, and the metric generally improved as the time t increase.

On the other hand, for NAG (green), it is observed that the ARI starts near zero in all initialization methods. This is because, in the case of GD and MM, all components of the initial values are included in the eigendecomposition of Eq. (16) like u i y(0) l , whereas for NAG, the decomposition in Eq. (24) only includes a subset of the initial values in each term like UΓ1(t)y(0) l,1:R or U Γ2(t)y(0) l,R+1:n. 7 In the case of NAG, as time progresses, the influence of the eigenvectors becomes reflected, and the ARI gradually improves.

Furthermore, across the different initialization methods, it is confirmed that the SE leads to the greatest improvement in terms of ARI in our experiment.

As a minor point, for MDS, using random placement during initialization results in a similar variation in ARI outcomes as observed with random initialization. In contrast, PCA and SE, being deterministic methods, do not exhibit such variation.

I.6 Experiments through ARR for other data sets

We also explore the stopping time to obtain the clustering results with ARR(Average Residual Rate) defined in (28) at appropriate time. In addition, we observe whether another indicator Trustworthiness exhibits a similar trend to ARR during the optimization process.

In section 8.2, we evaluate ARR for MNIST data set. Here, we evaluate it with KDDCup1999, Olivetti Face datasets as well. Figure 9 shows the results for KDDCup1999, and Fig. 10 for Olivetti.

7Remember that both Γ1(t) and Γ2(t) are diagonal matrices as in Theorem 6.4.

Published in Transactions on Machine Learning Research (06/2025)

0 200 400 600 800 t

Adjusted Rand Index (ARI)

optimization method

0 200 400 600 800 t

optimization method

0 200 400 600 800 t

Adjusted Rand Index (ARI)

optimization method

0 200 400 600 800 t

optimization method

Figure 8: A comparison among various initialization methods with MNIST dataset for GD/MM/NAG

The left figure shows the transition of ARR and Trustworthiness. We begin by reviewing ARR. NAG(green) exhibits the fastest drop in ARR, followed by MM(orange) and then GD(blue). A red dashed line marks the threshold at ARR = 0.01. As mentioned in section I.4, it is important to note that due to the different relationships between the variables t and k among GD/MM/NAG8, the NAG curve appears truncated.

Next, we compare ARR with Trustworthiness (Venna & Kaski, 2001). Trustworthiness is a standard metric used in methods such as t-SNE to evaluate how well the structure of the original high-dimensional data is preserved in the low-dimensional embedding. Specifically, it measures whether the neighbors in the embedded space are also neighbors in the original high-dimensional space. The value ranges from 0 to 1, with values closer to 1 indicating higher agreement.

The same figure also shows the case where the number of neighbors is set to 10 for Trustworthiness. The dashed blue/orange/green lines represent the Trustworthiness values for each optimization method: GD/MM/NAG. As ARR decreases, Trustworthiness tends to increase, indicating an approximate inverse correlation. The correlation coefficient is around 0.85 for KDDCup1999 and 0.90 for Olivetti. Furthermore, because the Trustworthiness value is sensitive to the choice of neighborhood size, the strong inverse correlation observed in our results does not necessarily indicate that trustworthiness can be used as a general alternative to ARR for determining the stopping time.

Although Continuity is also a well-known metric (Kaski et al., 2003) similar to Trustworthiness, numerical experiments in this case showed that its values were nearly equivalent though not identical to those of Trustworthiness. Therefore, we only present Trustworthiness in the figure.

The right panels of figures 9 and 10 show the clustering results of GD, MM, and NAG when ARR = 0.01.

I.7 Experiments with various momentum coefficients

We explore the effect of the momentum coefficient m in the MM. In the Eq. (16), the transition of ARR for MNIST dataset when changing the coefficient m from 0.1 to 0.9 can be seen in Fig. 11. Other configurations of MNIST are the same as in section 8.2. The coefficient m was originally assumed to range from 0 to 1, and it can be confirmed that as m increases, the influence of the residual term on ARR diminishes over a shorter period of time. Formally, GD can be regarded as having m of 0, which is also consistent with the results shown with both datasets -KDDCup1999 and Olivettiin Figs. 9, 10.

8t = kh for GD and MM, whereas t = k

Published in Transactions on Machine Learning Research (06/2025)

0 100 200 300 400 500 600 t

ARR / Trustworthiness

ARR / Trustworthiness for KDDCup1999 dataset

Method & Metric

0.04 0.03 0.02 0.01 0.00 0.01 0.02

0.04 0.03 0.02 0.01 0.00 0.01 0.02

0.06 0.04 0.02 0.00 0.02 0.04

NAG (t=104.36)

Figure 9: (left): transition of ARR and Trustworthiness (TW), (right): Clustering result for KDDCup1999 dataset with ARR=0.01 using three ODEs: GD, MM, and NAG.

0 200 400 600 800 1000 1200 t

ARR / Trustworthiness

ARR / Trustworthiness for Olivetti dataset

Method & Metric

0.050 0.025 0.000 0.025 0.050 0.075 0.100

GD (t=1030)

0.050 0.025 0.000 0.025 0.050 0.075 0.100 0.125

0.3 0.2 0.1 0.0 0.1 0.2

NAG (t=192.9)

Figure 10: (left): transition of ARR and Trustworthiness (TW), (right): Clustering result for Olivetti Face dataset with ARR=0.01 using three ODEs: GD, MM, and NAG.

I.8 Further analysis of KDDCup1999 dataset

We obtain the adjacency matrix P as depicted in Fig. 12. The heatmap illustrates that the elements of P exhibit relatively high values (orange or yellow) between clusters, while other regions are closer to black.

Solving the eigenproblem with the Laplacian matrix L(αP Hn) using α = 10, we obtain eigenvalues close to zero and four negative eigenvalues, as shown in Fig. 13. They are emphasized during clustering regardless of the method used (GD/MM/NAG), resulting in the formation of clusters in the low-dimensional representation.

Furthermore, Fig.14 displays the distribution of the dominant eigenvectors u1, . . . , u5. These eigenvectors are emphasized in the low-dimensional representation as the time parameter t progresses under Eqs.(16) and (24). As shown in this figure, the components belonging to the same cluster simultaneously take on the same values, and the relative magnitudes of these values become crucial when forming clusters in the lowdimensional space. This mechanism leads to the clustering results depicted in Fig. 1 as the low-dimensional representation.

Published in Transactions on Machine Learning Research (06/2025)

0 500 1000 1500 2000 2500 3000 t

Average Residual Ratio(ARR)

ARR with MM for MNIST dataset

momentum coefficient

0.1 0.3 0.5 0.7 0.9

Figure 11: A comparison of MM with various momentum coefficients

0 100 200 300 400

Figure 12: A heatmap of similarity matrix P for the n = 500 samples from KDD Cup 1999 dataset.

Published in Transactions on Machine Learning Research (06/2025)

0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0

Distribution of Eigenvalues

Figure 13: Distribution of eigenvalues {σi}i [n] of the Laplacian matrix L(αP Hn): σ1 0, σi < 0(2 i 5) and σi > 0(i 6).

values of u1 values of u2 values of u3

values of u4

values of u5

Figure 14: The eigenvectors, whose eigenvalues are the smallest five: u1, . . . , u5.