# from_tsne_to_umap_with_contrastive_learning__343938d8.pdf Published as a conference paper at ICLR 2023 FROM t-SNE TO UMAP WITH CONTRASTIVE LEARNING Sebastian Damrich IWR at Heidelberg University sebastian.damrich@uni-tuebingen.de Jan Niklas B ohm University of T ubingen jan-niklas.boehm@uni-tuebingen.de Fred A. Hamprecht IWR at Heidelberg University fred.hamprecht@iwr.uni-heidelberg.de Dmitry Kobak University of T ubingen dmitry.kobak@uni-tuebingen.de Neighbor embedding methods t-SNE and UMAP are the de facto standard for visualizing high-dimensional datasets. Motivated from entirely different viewpoints, their loss functions appear to be unrelated. In practice, they yield strongly differing embeddings and can suggest conflicting interpretations of the same data. The fundamental reasons for this and, more generally, the exact relationship between t-SNE and UMAP have remained unclear. In this work, we uncover their conceptual connection via a new insight into contrastive learning methods. Noisecontrastive estimation can be used to optimize t-SNE, while UMAP relies on negative sampling, another contrastive method. We find the precise relationship between these two contrastive methods and provide a mathematical characterization of the distortion introduced by negative sampling. Visually, this distortion results in UMAP generating more compact embeddings with tighter clusters compared to t-SNE. We exploit this new conceptual connection to propose and implement a generalization of negative sampling, allowing us to interpolate between (and even extrapolate beyond) t-SNE and UMAP and their respective embeddings. Moving along this spectrum of embeddings leads to a trade-off between discrete / local and continuous / global structures, mitigating the risk of over-interpreting ostensible features of any single embedding. We provide a Py Torch implementation. 1 INTRODUCTION Low-dimensional visualization of high-dimensional data is a ubiquitous step in exploratory data analysis, and the toolbox of visualization methods has been rapidly growing in the last years (Mc Innes et al., 2018; Amid & Warmuth, 2019; Szubert et al., 2019; Wang et al., 2021). Since all of these methods necessarily distort the true data layout (Chari et al., 2021), it is beneficial to have various tools at one s disposal. But only equipped with a theoretic understanding of the aims of and relationships between different methods, can practitioners make informed decisions about which visualization to use for which purpose and how to interpret the results. The state of the art for non-parametric, non-linear dimensionality reduction relies on the neighbor embedding framework (Hinton & Roweis, 2002). Its two most popular examples are t-SNE (van der Maaten & Hinton, 2008; van der Maaten, 2014) and UMAP (Mc Innes et al., 2018). Both can produce insightful, but qualitatively distinct embeddings. However, why their embeddings are different and what exactly is the conceptual relation between their loss functions, has remained elusive. Here, we answer this question and thus explain the mathematical underpinnings of the relationship between t-SNE and UMAP. Our conceptual insight naturally suggests a spectrum of embedding methods complementary to that of B ohm et al. (2022), along which the focus of the visualization shifts from local to global structure (Fig. 1). On this spectrum, UMAP and t-SNE are simply two instances and inspecting various embeddings helps to guard against over-interpretation of apparent structure. As a practical corollary, our analysis identifies and remedies an instability in UMAP. Published as a conference paper at ICLR 2023 (a) Z < Zt-SNE (b) Z = Zt-SNE (c) Z = ZNCVis Default NEG (d) Z = |X|/m (e) Z > |X|/m (i) Partition function Figure 1: (a e) Neg-t-SNE embedding spectrum of the MNIST dataset for various values of the fixed normalization constant Z, see Sec. 5. As Z increases, the scale of the embedding decreases, clusters become more compact and separated before eventually starting to merge. The Neg-t-SNE spectrum produces embeddings very similar to those of (f) t-SNE, (g) NCVis, and (h) UMAP, when Z equals the partition function of t-SNE, the learned normalization parameter Z of NCVis, or |X|/m = n 2 /m used by UMAP, as predicted in Sec. 4 6. (i) The partition function P ij(1+d2 ij) 1 tries to match Z and grows with it. Here, we initialized all Neg-t-SNE runs using Z = |X|/m; without this early exaggeration , low values of Z yield fragmented clusters (Fig. S11). We provide the new connection between t-SNE and UMAP via a deeper understanding of contrastive learning methods. Noise-contrastive estimation (NCE) (Gutmann & Hyv arinen, 2010; 2012) can be used to optimize t-SNE (Artemenkov & Panov, 2020), while UMAP relies on another contrastive method, negative sampling (NEG) (Mikolov et al., 2013). We investigate the discrepancy between NCE and NEG, show that NEG introduces a distortion, and this distortion explains how UMAP and t-SNE embeddings differ. Finally, we discuss the relationship between neighbor embeddings and self-supervised learning (Wu et al., 2018; He et al., 2020; Chen et al., 2020; Le-Khac et al., 2020). In summary, our contributions are 1. a new connection between the contrastive methods NCE and NEG (Sec. 4), 2. the exact relation of t-SNE and UMAP and a remedy for an instability in UMAP (Sec. 6), 3. a spectrum of contrastive neighbor embeddings encompassing UMAP and t-SNE (Sec. 5), 4. a connection between neighbor embeddings and self-supervised learning (Sec. 7), 5. a unified Py Torch framework for contrastive (non-)parametric neighbor embedding methods. Our code is available at https://github.com/berenslab/contrastive-ne and https://github.com/hci-unihd/cl-tsne-umap. 2 RELATED WORK One of the most popular methods for data visualization is t-SNE (van der Maaten & Hinton, 2008; van der Maaten, 2014). Recently developed NCVis (Artemenkov & Panov, 2020) employs noisecontrastive estimation (Gutmann & Hyv arinen, 2010; 2012) to approximate t-SNE in a samplingbased way. Therefore, we will often refer to the NCVis algorithm as NC-t-SNE . UMAP (Mc Innes et al., 2018) has matched t-SNE s popularity at least in computational biology (Becht et al., 2019) and uses another sampling-based optimization method, negative sampling (Mikolov et al., 2013), also employed by Large Vis (Tang et al., 2016). Other recent sampling-based visualization methods include Tri Map (Amid & Warmuth, 2019) and Pa CMAP (Wang et al., 2021). Given their success, t-SNE and UMAP have been scrutinized to find out which aspects are essential to their performance. Initialization was found to be strongly influencing the global structure Published as a conference paper at ICLR 2023 in both methods (Kobak & Linderman, 2021). The exact choice of the low-dimensional similarity kernel (B ohm et al., 2022) or the weights of the k-nearest-neighbor graph (Damrich & Hamprecht, 2021) were shown to be largely inconsequential. Both algorithms have similar relevant hyperparameters such as, e.g., the heavy-tailedness of the similarity kernel (Yang et al., 2009; Kobak et al., 2019). While not obvious from the original presentations, the central difference between t-SNE and UMAP can therefore only be in their loss functions, which have been studied by Damrich & Hamprecht (2021); B ohm et al. (2022); Wang et al. (2021), but never conceptually connected. We achieve this by deepening the link between negative sampling (NEG) and noise-contrastive estimation (NCE). NEG was introduced as an ad hoc replacement for NCE in the context of learning word embeddings (Mikolov et al., 2013). The relationship between NEG and NCE has been discussed before (Dyer, 2014; Levy & Goldberg, 2014; Ruder, 2016; Ma & Collins, 2018; Le-Khac et al., 2020), but here we go further and provide the precise meaning of NEG: We show that, unlike NCE, NEG learns a model proportional but not equal to the true data distribution. Both t-SNE and UMAP have parametric versions (van der Maaten, 2009; Sainburg et al., 2021) with very different logic and implementations. Here we present a unified Py Torch framework for nonparametric and parametric contrastive neighbor embeddings. As special cases, it includes UMAP and t-SNE approximations with NCE (like NCVis) and the Info NCE loss (Jozefowicz et al., 2016; van den Oord et al., 2018), which has not yet been applied to neighbor embeddings. Our resulting Info NC-t-SNE elucidates the relationship between Sim CLR (Chen et al., 2020) and neighbor embeddings. The parallel work of Hu et al. (2023) makes a qualitatively similar argument, but does not discuss negative samples. Balestriero & Le Cun (2022) show how Sim CLR recovers Isomap (Tenenbaum et al., 2000), while we connect Sim CLR to more modern neighbor embeddings. 3 BACKGROUND 3.1 NOISE-CONTRASTIVE ESTIMATION (NCE) The goal of parametric density estimation is to fit a parametric model qθ to iid samples s1, . . . , s N from an unknown data distribution p over a space X. For maximum likelihood estimation (MLE) the parameters θ are chosen to maximize the log-likelihood of the observed samples θ = arg max θ i=1 log qθ(si) . (1) This approach crucially requires qθ to be a normalized model. It is otherwise trivial to increase the likelihood arbitrarily by scaling qθ. To circumvent the expensive computation of the partition function Z(θ) = P x X qθ(x), Gutmann & Hyv arinen (2010; 2012) introduced NCE. It turns the unsupervised problem of density estimation into a supervised problem in which the data samples need to be identified from a set containing the N data samples and m times as many noise samples t1, . . . tm N drawn from a noise distribution ξ, e.g., the uniform distribution over X. Briefly, NCE fits θ by minimizing the binary cross-entropy between the true class assignment and posterior probabilities P(data | x) = qθ(x)/ qθ(x)+mξ(x) and P(noise | x) = 1 P(data | x) (Supp. A.1): θ = arg min θ i=1 log qθ(si) qθ(si) + mξ(si) i=1 log 1 qθ(ti) qθ(ti) + mξ(ti) The key advantage of NCE is that the model does not need to be explicitly normalized by the partition function, but nevertheless learns to equal the data distribution p and hence be normalized: Theorem 1 (Gutmann & Hyv arinen (2010; 2012)). Let ξ have full support and suppose there exists some θ such that qθ = p. Then θ is a minimum of LNCE(θ) = Es p log qθ(s) qθ(s) + mξ(s) m Et ξ log 1 qθ(t) qθ(t) + mξ(t) and the only other extrema of LNCE are minima θ which also satisfy q θ = p. Published as a conference paper at ICLR 2023 In NCE, the model typically includes an optimizable normalization parameter Z which we emphasize by writing qθ,Z = qθ/Z. But importantly, Thm. 1 applies to any model qθ that is able to match the data distribution p, even if it does not contain a learnable normalization parameter. In the setting of learning language models, Jozefowicz et al. (2016) proposed a different version of NCE, called Info NCE. Instead of classifying samples as data or noise, the aim here is to predict the position of a data sample in an (m + 1)-tuple containing m noise samples and one data sample (Supp. A.2). For the uniform noise distribution ξ, this yields the expected loss function LInfo NCE(θ) = E x p x1,...,xm ξ log qθ(x) qθ(x) + Pm i=1 qθ(xi) Ma & Collins (2018) showed that an analogue of Thm. 1 applies to Info NCE. 3.2 NEIGHBOR EMBEDDINGS Neighbor embeddings (NE) (Hinton & Roweis, 2002) are a group of dimensionality reduction methods, including UMAP, NCVis, and t-SNE that aim to find a low-dimensional embedding e1, . . . , en Rd of high-dimensional input points x1, . . . , xn RD, with D d and usually d = 2 for visualization. NE methods define a notion of similarity over pairs of input points which encodes the neighborhood structure and informs the low-dimensional embedding. The exact high-dimensional similarity distribution differs between the NE algorithms, but recent work (B ohm et al., 2022; Damrich & Hamprecht, 2021) showed that t-SNE and UMAP results stay practically the same when using the binary symmetric k-nearest-neighbor graph (sk NN) instead of t-SNE s Gaussian or UMAP s Laplacian similarities. An edge ij is in sk NN if xi is among the k nearest neighbors of xj or vice versa. The high-dimensional similarity function is then given by p(ij) = 1(ij sk NN)/|sk NN|, where |sk NN| denotes the number of edges in the sk NN graph and 1 is the indicator function. NCVis uses the same similarities. We will always use k = 15. There are further differences in the choice of low-dimensional similarity between t-SNE and UMAP, but B ohm et al. (2022) showed that they are negligible. Therefore, here we use the Cauchy kernel ϕ(dij) = 1/(d2 ij + 1) for all NE methods to transform distances dij = ei ej in the embedding space into low-dimensional similarities. We abuse notation slightly by also writing ϕ(ij) = ϕ(dij). All NE methods in this work can be cast in the framework of parametric density estimation. Here, p is the data distribution to be approximated with a model qθ, meaning that the space X on which both p and qθ live is the set of all pairs ij with 1 i < j n. The embedding positions e1, . . . , en become the learnable parameters θ of the model qθ. For t-SNE, NC-t-SNE (NCVis), and UMAP, qθ is proportional to ϕ( ei ej ), but the proportionality factor and the loss function are different. t-SNE uses MLE and therefore requires a normalized model qθ(ij) = ϕ(ij)/Z(θ), where Z(θ) = P k =l ϕ(kl) is the partition function. The loss function Lt-SNE(θ) = Eij p log qθ(ij) = X p(ij) log ϕ(ij) + log X k =l ϕ(kl) (5) is the expected negative log-likelihood of the embedding positions θ, making t-SNE an instance of MLE. Usually t-SNE s loss function is introduced as the Kullback-Leibler divergence between p and qθ, which is equivalent as the entropy of p does not depend on θ. NC-t-SNE uses NCE and optimizes the expected loss function LNC-t-SNE(θ, Z) = Eij p log qθ,Z(ij) qθ,Z(ij) + mξ(ij) m Eij ξ log 1 qθ,Z(ij) qθ,Z(ij) + mξ(ij) (6) where qθ,Z(ij) = ϕ(ij)/Z with learnable Z and ξ is approximately uniform (see Supp. D). According to Thm. 1, NC-t-SNE has the same optimum as t-SNE and can hence be seen as a sampling-based approximation of t-SNE. Indeed, we found that Z in NC-t-SNE and the partition function Z(θ) in t-SNE converge approximately to the same value (Fig. S10). UMAP s expected loss function is derived in Damrich & Hamprecht (2021): LUMAP(θ) = Eij p log qθ(ij) m Eij ξ log 1 qθ(ij) , (7) Published as a conference paper at ICLR 2023 with qθ(ij) = ϕ(ij) and ξ is approximately uniform, see Supp. D. This is the effective loss function actually implemented in the UMAP algorithm, but note that it has only about m/n of the repulsion compared to the loss stated in the original UMAP paper (Mc Innes et al., 2018), as shown in B ohm et al. (2022) and Damrich & Hamprecht (2021) (Supp. C). In practice, the expectations in UMAP s and NC-t-SNE s loss functions are evaluated via sampling, like in Eq. (2). This leads to a fast, O(n), stochastic gradient descent optimization scheme. Both loss functions are composed of an attractive term pulling similar points (edges of the sk NN graph) closer together and a repulsive term pushing random pairs of points further apart. Similarly, t-SNE s loss yields attraction along the graph edges while repulsion arises through the normalization term. 4 FROM NOISE-CONTRASTIVE ESTIMATION TO NEGATIVE SAMPLING In this section we work out the precise relationship between NCE and NEG, going beyond prior work (Dyer, 2014; Goldberg & Levy, 2014; Levy & Goldberg, 2014; Ruder, 2016). NEG differs from NCE by its loss function and by the lack of the learnable normalization parameter Z. In our setting, NEG s loss function amounts to1 LNEG(θ) = Ex p log qθ(x) qθ(x) + 1 m Ex ξ log 1 qθ(x) qθ(x) + 1 In order to relate it to NCE s loss function, our key insight is to generalize the latter by allowing to learn a model that is not equal but proportional to the true data distribution. Corollary 2. Let Z, m R+. Let ξ have full support and suppose there exist some θ such that qθ = Zp. Then θ is a minimum of the generalized NCE loss function LNCE Z (θ) = Ex p log qθ(x) qθ(x) + Zmξ(x) m Ex ξ log 1 qθ(x) qθ(x) + Zmξ(x) and the only other extrema of LNCE Z are minima θ which also satisfy q θ = Zp. Proof. The result follows from Thm. 1 applied to the model distribution qθ := qθ/ Z. Dyer (2014) and Ruder (2016) pointed out that for a uniform noise distribution ξ(x) = 1/|X| and as many noise samples as the size of X (m = |X|), the loss functions of NCE and NEG coincide, since mξ(x) = 1. However, the main point of NCE and NEG is to use far fewer noise samples in order to attain a speed-up over MLE. Our Cor. 2 for the first time explains NEG s behavior in this more realistic setting (m |X|). If the noise distribution is uniform, the generalized NCE loss function with Z = |X|/m equals the NEG loss function since (|X|/m)mξ(x) = 1. By Cor. 2, any minimum θ of the NEG loss function yields qθ = (|X|/m)p, assuming that there are parameters that make this equation hold. In other words, NEG aims to find a model qθ that is proportional to the data distribution with the proportionality factor |X|/m which is typically huge. This is different from NCE, which aims to learn a model equal to the data distribution. Choosing m |X| does not only offer a computational speed-up but is necessary when optimizing NEG for a neural network with SGD, like we do in Sec.7. Only one single mini-batch is passed through the neural network during each iteration and is thus available for computing the loss. Hence, all noise samples must come from the current mini-batch and their number m is upper-bounded by the mini-batch size b. Mini-batches are typically much smaller than |X|. Thus, this common training procedure necessitates m |X| highlighting the relevance of Cor. 2. While NCE uses a model qθ/Z with learnable Z, we can interpret NEG as using a model qθ/ Z with fixed and very large normalization constant Z = |X|/m. As a result, qθ in NEG needs to attain much larger values to match the large Z. This can be illustrated in the setting of neighbor embeddings. Applying NEG to the neighbor embedding framework yields an algorithm that we call Neg-t-SNE . Recall that in this setting, |X| = n 2 is the number of pairs of points. B ohm et al. (2022) found empirically that t-SNE s partition function Z(θ) is typically between 50n and 100n, while in Neg-t-SNE, Z = O(n2) is much larger for modern big datasets. To attain the larger values of ϕ(ij) required by NEG, points that are connected in the sk NN graph have to move much closer 1We focus on the loss function, ignoring Mikolov et al. (2013) s choices specific to word embeddings. Published as a conference paper at ICLR 2023 (a) UMAP, no annealing (b) UMAP with annealing (c) Neg-t-SNE, no ann. (d) Neg-t-SNE with ann. Figure 2: Embeddings of the MNIST dataset with UMAP and Neg-t-SNE with and without learning rate annealing in our implementation. UMAP does not work well without annealing because it implicitly uses the diverging 1/d2 ij kernel in NEG, while Neg-t-SNE uses the more numerically stable Cauchy kernel (Sec. 6). UMAP s reference implementation also requires annealing, see Figs. S1a, d. together in the embedding than in t-SNE. Indeed, using our Py Torch implementation of Neg-t-SNE on the MNIST dataset, we confirmed that Neg-t-SNE (Fig. 1d) produced more compact clusters than t-SNE (Fig. 1f). See Supp. K and Alg. S1 for implementation details. We emphasize that our analysis only holds because NEG has no learnable normalization parameter Z. If it did, then such a Z could absorb the term Z in Eq. (9), leaving the parameters θ unchanged. 5 NEGATIVE SAMPLING SPECTRUM Varying the fixed normalization constant Z in Eq. (9) has important practical effects that lead to a whole spectrum of embeddings in the NE setting. The original NEG loss function Eq. (8) corresponds to Eq. (9) with Z = |X|/m. We still refer to the more general case of using an arbitrary Z in Eq. (9) as negative sampling and Neg-t-SNE in the context of neighbor embeddings. Figs. 1a e show a spectrum of Neg-t-SNE visualizations of the MNIST dataset for varying Z. Per Cor. 2, higher values of Z induce higher values for qθ, meaning that points move closer together. Indeed, the scale of the embedding decreases for higher Z as evident from the scale bars in the bottom-right corner of each plot. Moreover, clusters become increasingly compact and then even start to merge. For lower values of Z the embedding scale is larger and clusters are more spread out. Eventually, clusters lose almost any separation and start to overlap for very small Z. Cor. 2 implies that the partition function P x qθ(x) should grow with Z, and indeed this is what we observed (Fig. 1i). The match between the sum of Cauchy kernels P ij ϕ(ij) and Z was not perfect, but that was expected: The Cauchy kernel is bounded by 1 from above, so values Z > n 2 are not matchable. Similarly, very small values of Z are difficult to match because of the heavy tail of the Cauchy kernel. See Supp. I for a toy example where the match is perfect. By adjusting the Z value, one can obtain Neg-t-SNE embeddings very similar to NC-t-SNE and t-SNE. If the NC-t-SNE loss function Eq. (6) has its minimum at some θ and ZNC-t-SNE, then the Neg-t-SNE loss function Eq. (9) with Z = ZNC-t-SNE is minimal at the same θ . We confirmed this experimentally: Setting Z = ZNC-t-SNE, yields a Neg-t-SNE embedding (Fig. 1c) closely resembling that of NC-t-SNE (NCVis) (Fig. 1g). Similarly, setting Z to the partition function Z(θt-SNE), obtained by running t-SNE, yields a Neg-t-SNE embedding closely resembling that of t-SNE (Figs. 1b, f). We used k NN recall and correlation between pairwise distances for quantitative confirmation (Supp. H). The Neg-t-SNE spectrum is strongly related to the attraction-repulsion spectrum of B ohm et al. (2022). They introduced a prefactor ( exaggeration ) to the attractive term in t-SNE s loss, which increases the attractive forces, and obtained embeddings similar to our spectrum when varying this parameter. We can explain this as follows. The repulsive term in the NCE loss Eq. (3) has a prefactor m and our spectrum arises from the loss Eq. (9) by varying Z in the term Zm. Equivalently, our spectrum can be obtained by varying the m value (number of noise samples per one sk NN edge) while holding Zm fixed (Fig. S8). In other words, our spectrum arises from varying the repulsion strength in the contrastive setting, while B ohm et al. (2022) obtained the analogous spectrum by varying the repulsion strength in the t-SNE setting (see also Supp. B.2). Published as a conference paper at ICLR 2023 6 UMAP S CONCEPTUAL RELATION TO t-SNE Our comparison of NEG and NCE in Sec. 4 allows us for the first time to conceptually relate UMAP and t-SNE. Originally, Mc Innes et al. (2018) motivated the UMAP algorithm by a specific choice of weights on the sk NN graph and the binary cross-entropy loss. Following Large Vis (Tang et al., 2016), they implemented a sampling-based scheme which they referred to as negative sampling , although UMAP s loss function Eq. (7) does not look like the NEG loss Eq. (8). Thus, it has been unclear if UMAP actually uses Mikolov et al. (2013) s NEG. Here, we settle this question: Lemma 3. UMAP s loss function (Eq. 7) is NEG (Eq. 8) with the parametric model qθ(ij) = 1/d2 ij. Proof. Indeed, qθ(ij) = 1/(1 + d2 ij) = (1/d2 ij)/(1/d2 ij + 1) = qθ(ij)/ qθ(ij) + 1 . Lem. 3 tells us that UMAP uses NEG but not with a parametric model given by the Cauchy kernel. Instead it uses a model qθ, which equals the inverse squared distance between embedding points. For large embedding distances dij both models behave alike, but for nearby points they strongly differ: The inverse-square kernel 1/d2 ij diverges when dij 0, unlike the Cauchy kernel 1/(1+d2 ij). Despite this qualitative difference, we found empirically that UMAP embeddings look very similar to Neg-t-SNE embeddings at Z = |X|/m, see Figs. 1d and h for the MNIST example. To explain this observation, it is instructive to compare the loss terms of Neg-t-SNE and UMAP: The attractive term amounts to log 1/(d2 ij + 1) = log(1 + d2 ij) for UMAP and log 1/(d2 ij + 1)/ 1/(d2 ij + 1) + 1 = log(2 + d2 ij) for Neg-t-SNE, while the repulsive term equals log (1 + d2 ij)/d2 ij and log (2 + d2 ij)/(1 + d2 ij) , respectively. While the attractive terms are similar, the repulsive term for UMAP diverges at zero but that of Neg-t-SNE does not (Fig. S7, Supp. B.2). This divergence introduces numerical instability into the optimization process of UMAP. We found that UMAP strongly depends on annealing its learning rate down to zero (Figs. 2a, b). Without it, clusters appear fuzzy as noise pairs can experience very strong repulsion and get catapulted out of their cluster (Fig. 2a). While Neg-t-SNE also benefits from this annealing scheme (Fig. 2d), it produces a very similar embedding even without any annealing (Fig. 2c). Thus, UMAP s effective choice of the 1/d2 ij kernel makes it less numerically stable and more dependent on optimization tricks, compared to Neg-t-SNE.2 See Supp. E for more details. Our conclusion is that at its heart, UMAP is NEG applied to the t-SNE framework. UMAP s sampling-based optimization is much more than a mere optimization trick; it enables us to connect it theoretically to t-SNE. When UMAP s loss function is seen as an instance of NEG, UMAP does not use the Cauchy kernel but rather the inverse-square kernel. However, this does not make a strong difference due to the learning rate decay. As discussed in Sec. 4, the fixed normalization constant Z in Neg-t-SNE / UMAP is much larger than the learned Z in NC-t-SNE or t-SNE s partition function. This explains why UMAP pulls points closer together than both NC-t-SNE and t-SNE and is the reason for the typically more compact clusters in UMAP plots (B ohm et al., 2022). 7 CONTRASTIVE NE AND CONTRASTIVE SELF-SUPERVISED LEARNING Contrastive self-supervised representation learning (van den Oord et al., 2018; Tian et al., 2020; He et al., 2020; Chen et al., 2020; Caron et al., 2020) and contrastive neighbor embeddings (a term we suggest for NC-t-SNE, Neg-t-SNE, UMAP, etc.) are conceptually very similar. The key difference is that the latter use a fixed k NN graph to find pairs of similar objects, while the former rely on data augmentations or context to generate pairs of similar objects on the fly. Other differences include the representation dimension ( 128 vs. 2), the use of a neural network for parametric mapping, the flavor of contrastive loss (Info NCE vs. NCE / NEG), and the number of noise samples m. However, these other differences are not crucial, as we establish here with our unified Py Torch framework. As an example, we demonstrate that t-SNE can also be optimized using the Info NCE loss, resulting in Info NC-t-SNE (Supp. K and Alg. S1). Its result on MNIST was similar to that of NC-t-SNE (Figs. 3b, c). For the default number of noise samples, m = 5, the embeddings of both 2Recently, a pull request to UMAP s Git Hub repository effectively changed the kernel of parametric UMAP to the Cauchy kernel, in order to overcome numerical instabilities via an ad hoc fix, see Supp. E. Published as a conference paper at ICLR 2023 (a) Neg-t-SNE (b) NC-t-SNE (c) Info NC-t-SNE (d) Param. Neg-t-SNE (e) Param. NC-t-SNE (f) Param. Info NC-t-SNE Figure 3: NE embeddings of MNIST are qualitatively similar in the non-parametric (top) and parametric settings (bottom). We used our Py Torch framework with m = 5 and batch size b = 1024. algorithms were visibly different from t-SNE proper (Fig. 1f). Recent work in self-supervised learning (Chen et al., 2020) reports improved performance for more noise samples, in agreement with the theory of Gutmann & Hyv arinen (2012) and Ma & Collins (2018). We observed that both NC-t-SNE and Info NC-t-SNE with m = 500 approximated t-SNE much better (Figs. S18k and S19k). Like t-SNE, but unlike the m = 5 setting, m = 500 required early exaggeration to prevent cluster fragmentation (Figs. S18 and S19). We discuss Info NC-t-SNE s relation to Tri Map in Supp. F. Next, we used our Py Torch framework to obtain parametric versions of all contrastive NE algorithms discussed here (NC-t-SNE, Info NC-t-SNE, Neg-t-SNE, UMAP). We used a fully connected neural network with three hidden layers as a parametric RD Rd mapping and optimized its parameters using Adam (Kingma & Ba, 2015) (Supp. K). We used batch size b = 1024 and sampled all m negative samples from within the batch; the dataset was shuffled each epoch before batching. Using all four loss functions, we were able to get parametric embeddings of MNIST that were qualitatively similar to their non-parametric versions (Fig. 3). The parametric versions of NCE and Info NCE produced much larger embeddings than their non-parametric counterparts, however the final loss values were very similar (Figs. S9a, b). For NCE, the larger scale of the parametric embedding was compensated by a smaller learned normalization parameter Z, so that both parametric and non-parametric versions were approximately normalized (Fig. S9c). Our parametric UMAP implementation is very similar to that of Sainburg et al. (2021). But our parametric, approximate t-SNE implementations strongly differ from the parametric t-SNE of van der Maaten (2009), which constructed separate k NN graphs within each batch and optimized the vanilla t-SNE loss function. Neighbor embeddings methods achieve impressive results with only few noise samples, while existent common practice in the self-supervised learning literature (Bachman et al., 2019; Chen et al., 2020; He et al., 2020) recommends very large m. As a proof of concept, we demonstrate that using only m = 16 non-curated (Wu et al., 2017; Roth et al., 2020) noise samples can suffice for imagebased self-supervised learning. We used a Sim CLR setup (Chen et al., 2020) to train representations of the CIFAR-10 dataset (Krizhevsky, 2009) using a Res Net18 (He et al., 2016) backbone, a fixed batch size of b = 1024 and varying number of noise samples m. Other implementation details Table 1: Test set classification accuracies on CIFAR-10 representations obtained with Info NCE loss and batch size b = 1024 saturate already at a low number of noise samples m, common for neighbor embeddings. We used the Res Net18 output H R512 and report mean std across three seeds. m = 2 m = 16 m = 128 m = 512 m = 2b 2 k NN classifier 86.9 0.2 91.7 0.1 92.0 0.3 92.2 0.2 91.9 0.1 Linear classifier 90.7 0.2 93.1 0.2 93.3 0.3 93.2 0.3 93.3 0.1 Published as a conference paper at ICLR 2023 follow Chen et al. (2020) s CIFAR-10 setup (Supp. K.4). The classification accuracy, measured on the Res Net output, saturates already at m = 16 noise samples (Tab. 1). Note that, different from the original Sim CLR setup, we decoupled the batch size from the number of noise samples. Recent work found conflicting evidence when varying m for a fixed batch size (Mitrovic et al., 2020; Nozawa & Sato, 2021; Ash et al., 2022). Future research is needed to perform more systematic benchmarks into the importance of m. Here, we present these results only as a proof of principle that a similarly low m as in neighbor embeddings (default m = 5) can work in a Sim CLR setting. 8 DISCUSSION AND CONCLUSION In this work, we studied the relationship between two popular unsupervised learning methods, noisecontrastive estimation (NCE) and negative sampling (NEG). We focused on their application to neighbor embeddings (NE) because this is an active and important application area, but also because NEs allow to directly visualize the NCE / NEG outcome and to form intuitive understanding of how different algorithm choices affect the result. Our study makes three conceptual advances. First, we showed that NEG replaces NCE s learnable normalization parameter Z by a large constant Z, forcing NEG to learn a scaled data distribution. While not explored here, this implies that NEG can be used to learn probabilistic models and is not only applicable for embedding learning as previously believed (Dyer, 2014; Ruder, 2016; Le-Khac et al., 2020). In the NE setting, NEG led to the method Neg-t-SNE that differs from NCVis / NC-t-SNE (Artemenkov & Panov, 2020) by a simple switch from the learnable to a fixed normalization constant. We argued that this can be a useful hyperparameter because it moves the embedding along the attraction-repulsion spectrum similar to B ohm et al. (2022) and hence can either emphasize more discrete structure with higher repulsion (Fig. S16) or more continuous structure with higher attraction (see Figs. S13 S14 for developmental single-cell datasets). Our quantitative evaluation in Supp. H corroborates this interpretation. We believe that inspection of several embeddings along the spectrum can guard against over-interpreting any single embedding. Exploration of the spectrum does not require specialized knowledge. For UMAP, we always have ZUMAP = n 2 /m; for t-SNE, B ohm et al. (2022) found that the partition function Zt-SNE typically lies in [50n, 100n]. Our Py Torch package allows the user to move along the spectrum with a slider parameter s such that s=0 and s=1 correspond to Z = Zt-SNE and Z = ZUMAP, respectively, without a need for specifying Z directly. A caveat is that Thm. 1 and Cor. 2 both assume the model to be rich enough to perfectly fit the data distribution. This is a strong and unrealistic assumption. In the context of NEs, the data distribution is zero for most pairs of points, which is impossible to match using the Cauchy kernel. Nevertheless, the consistency of MLE and thus t-SNE also depends on this assumption. Even without it, the gradients of MLE, NCE, and NEG are very similar (Supp. B). Finally, we validated our Cor. 2 in a toy setting in which its assumptions do hold (Supp. I). Second, we demonstrated that UMAP, which uses NEG, differs from Neg-t-SNE only in UMAP s implicit use of a less numerically stable similarity function. This isolates the key aspect of UMAP s success: Instead of UMAP s appraised (Oskolkov, 2022; Coenen & Pearce, 2022) high-dimensional similarities, the refined Cauchy kernel, or its stated cross-entropy loss function, it is the application of NEG that lets UMAP perform well and makes clusters in UMAP plots more compact and connections between them more continuous than in t-SNE, in agreement with B ohm et al. (2022). To the best of our knowledge, this is the first time UMAP s and t-SNE s loss functions, motivated very differently, have been conceptually connected. Third, we argued that contrastive NEs are closely related to contrastive self-supervised learning (SSL) methods such as Sim CLR (Chen et al., 2020) which can be seen as parametric Info NC-t-SNE for learning representations in S128 based on the unobservable similarity graph implicitly constructed via data augmentations. We feel that this connection has been underappreciated, with the literature on NEs and on self-supervised contrastive learning staying mostly disconnected. Exceptions are the concurrent works of Hu et al. (2023) and B ohm et al. (2023). They propose to use t-SNE s Cauchy kernel for SSL. Instead, we bridge the gap between NE and SSL with our Info NC-t-SNE, as well as with parametric versions of all considered NE methods, useful for adding out-of-sample data to an existing visualization. Moreover, we demonstrated that the feasibility of few noise samples in NE translates to the Sim CLR setup. We developed a concise Py Torch framework optimizing all of the above, which we hope will facilitate future dialogue between the two research communities. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS We thank Philipp Berens for comments and support as well as James Melville for discussions. This research was funded by the Deutsche Forschungsgemeinschaft (DFG, Germany s Research Foundation) via SFB 1129 (240245660; S.D., F.A.H.) as well as via Excellence Clusters 2181/1 STRUCTURES (390900948; S.D., F.A.H.) and 2064 Machine Learning: New Perspectives for Science (390727645; J.N.B., D.K.), by the German Ministry of Education and Research (T ubingen AI Center, 01IS18039A; J.N.B., D.K.), and by the Cyber Valley Research Fund (D.30.28739; J.N.B.). The authors thank the International Max Planck Research School for Intelligent Systems (IMPRSIS) for supporting J.N.B. ETHICS STATEMENT Our work analyses connections between popular contrastive learning methods and in particular visualization methods. Visualization methods are fundamental in exploratory data analysis, e.g., in computational biology. Given the basic nature of visualization it can of course potentially be used for malicious goals, but we expect mostly positive societal effects such as more reliable exploration of biological data. Similarly, contrastive self-supervised learning has the potential to overcome the annotation bottleneck that machine learning faces. This might free countless hours of labelling for more productive use but could also enable institutions with sufficient resources to learn from larger and larger datasets, potentially concentrating power. REPRODUCIBILITY All of our results are reproducible. We included proofs for the theoretical statement Cor. 2 and those in Supp. B and D. The datasets used in the experiments including preprocessing steps are described in Supp. J. We discuss the most important implementation details in the main text, e.g., in Sec. 7. An extensive discussion of our implementation is included in Supp. K and in Alg. S1. Our code and instructions for how to reproduce the experiments are publicly available. We separated the implementation of contrastive neighbor embeddings from the scripts and notebooks needed to reproduce our results, to provide dedicated repositories for people interested in using our method and people wanting to reproduce our results. The contrastive neighbor embedding implementation can be found at https://github.com/berenslab/contrastive-ne and the scripts and notebooks for reproducing our results at https://github.com/hci-unihd/cl-tsne-umap. The relevant commits in both repositories are tagged iclr2023. Published as a conference paper at ICLR 2023 Ehsan Amid and Manfred K Warmuth. Tri Map: Large-scale Dimensionality Reduction Using Triplets. ar Xiv preprint arxiv: 1910.00204, 2019. Aleksandr Artemenkov and Maxim Panov. NCVis: Noise Contrastive Approach for Scalable Visualization. In Proceedings of The Web Conference 2020, pp. 2941 2947, 2020. Jordan Ash, Surbhi Goel, Akshay Krishnamurthy, and Dipendra Misra. Investigating the Role of Negatives in Contrastive Representation Learning. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, pp. 7187 7209, 2022. Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning Representations by Maximizing Mutual Information across Views. In Advances in Neural Information Processing Systems, volume 32, pp. 15535 15545, 2019. Randall Balestriero and Yann Le Cun. Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods. In Advances in Neural Information Processing Systems, 2022. Etienne Becht, Leland Mc Innes, John Healy, Charles-Antoine Dutertre, Immanuel WH Kwok, Lai Guan Ng, Florent Ginhoux, and Evan W Newell. Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology, 37(1):38 44, 2019. Jan Niklas B ohm, Philipp Berens, and Dmitry Kobak. Attraction-Repulsion Spectrum in Neighbor Embeddings. Journal of Machine Learning Research, 23(95):1 32, 2022. Jan Niklas B ohm, Philipp Berens, and Dmitry Kobak. Unsupervised visualization of image datasets using contrastive learning. In International Conference on Learning Representations, 2023. Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. In Advances in Neural Information Processing Systems, volume 33, pp. 9912 9924, 2020. David M Chan, Roshan Rao, Forrest Huang, and John F Canny. t-SNE-CUDA: GPU-Accelerated t-SNE and its Applications to Modern Data. In 2018 30th International Symposium on Computer Architecture and High Performance Computing, pp. 330 338. IEEE, 2018. Tara Chari, Joeyta Banerjee, and Lior Pachter. The Specious Art of Single-Cell Genomics. bio Rxiv, 2021. Benjamin Charlier, Jean Feydy, Joan Alexis Glaun es, Franc ois-David Collin, and Ghislain Durif. Kernel Operations on the GPU, with Autodiff, without Memory Overflows. Journal of Machine Learning Research, 22(74):1 6, 2021. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the International Conference on Machine Learning, Proceedings of Machine Learning Research, pp. 1597 1607, 2020. Andy Coenen and Adam Pearce. A deeper dive into UMAP theory. https://pair-code. github.io/understanding-umap/supplement.html, 2022. Accessed: 2022-05-11. Sebastian Damrich and Fred A Hamprecht. On UMAP s True Loss Function. In Advances in Neural Information Processing Systems, volume 34, pp. 5798 5809, 2021. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, volume 27, 2014. Andrew Draganov, Tyrus Berry, Jakob Rødsgaard Jørgensen, Katrine Scheel Nellemann, Ira Assent, and Davide Mottin. Gi DR-DUN; Gradient Dimensionality Reduction Differences and Unification. ar Xiv preprint ar Xiv:2206.09689, 2022. Published as a conference paper at ICLR 2023 Chris Dyer. Notes on Noise Contrastive Estimation and Negative Sampling. ar Xiv preprint arxiv: 1410.8251, 2014. Peter Eisenmann. Fast Visualization of High-Dimensional Data via Parallelized UMAP on GPUs. Master s thesis, Karlsruhe Institue of Technology, 2019. Yoav Goldberg and Omer Levy. word2vec Explained: Deriving Mikolov et al. s Negative-Sampling Word-Embedding Method. ar Xiv preprint arxiv: 1402.3722, 2014. Michael U Gutmann and Aapo Hyv arinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297 304, 2010. Michael U Gutmann and Aapo Hyv arinen. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. Journal of Machine Learning Research, 13(2), 2012. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729 9738, 2020. Geoffrey E Hinton and Sam Roweis. Stochastic Neighbor Embedding. In Advances in Neural Information Processing Systems, volume 15, pp. 857 864, 2002. Tianyang Hu, Zhili Liu, Fengwei Zhou, Wenjia Wang, and Weiran Huang. Your Contrastive Learning Is Secretly Doing Stochastic Neighbor Embedding. In International Conference on Learning Representations, 2023. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the Limits of Language Modeling. ar Xiv preprint arxiv: 1602.02410, 2016. Sabina Kanton, Michael James Boyle, Zhisong He, Malgorzata Santel, Anne Weigert, F atima Sanch ıs-Calleja, Patricia Guijarro, Leila Sidow, Jonas Simon Fleck, Dingding Han, et al. Organoid single-cell genomic atlas uncovers human-specific features of brain development. Nature, 574(7778):418 422, 2019. Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, pp. 1 15, 2015. Dmitry Kobak and Philipp Berens. The art of using t-SNE for single-cell transcriptomics. Nature Communications, 10(1):1 14, 2019. Dmitry Kobak and George C Linderman. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nature Biotechnology, pp. 1 2, 2021. Dmitry Kobak, George Linderman, Stefan Steinerberger, Yuval Kluger, and Philipp Berens. Heavy Tailed Kernels Reveal a Finer Cluster Structure in t-SNE Visualisations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 124 139. Springer, 2019. Alex Krizhevsky. Learning multiple layers of features from tiny images. Master s thesis, University of Toronto, 2009. Phuc H Le-Khac, Graham Healy, and Alan F Smeaton. Contrastive Representation Learning: A Framework and Review. IEEE Access, 8:193907 193934, 2020. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Omer Levy and Yoav Goldberg. Neural Word Embedding as Implicit Matrix Factorization. In Advances in Neural Information Processing Systems, volume 27, pp. 2177 2185, 2014. Published as a conference paper at ICLR 2023 George C Linderman, Manas Rachh, Jeremy G Hoskins, Stefan Steinerberger, and Yuval Kluger. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature Methods, 16(3):243 245, 2019. Ilya Loshchilov and Frank Hutter. SGDR: Stochastic Gradient Descent with Warm Restarts. Proceedings of the International Conference on Learning Representations, pp. 1 16, 2017. Zhuang Ma and Michael Collins. Noise Contrastive Estimation and Negative Sampling for Conditional Models: Consistency and Statistical Efficiency. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3698 3707, 2018. Leland Mc Innes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ar Xiv preprint ar Xiv:1802.03426, 2018. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, volume 26, pp. 3111 3119, 2013. Jovana Mitrovic, Brian Mc Williams, and Melanie Rey. Less can be more in contrastive learning. In Proceedings on I Can t Believe It s Not Better! at Neur IPS Workshops, pp. 70 75, 2020. Ashwin Narayan, f Bonnie Berger, and Hyunghoon Cho. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nature Biotechnology, pp. 1 10, 2021. Corey J Nolet, Victor Lafargue, Edward Raff, Thejaswi Nanditale, Tim Oates, John Zedlewski, and Joshua Patterson. Bringing UMAP Closer to the Speed of Light with GPU Acceleration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 418 426, 2021. Kento Nozawa and Issei Sato. Understanding Negative Samples in Instance Discriminative Selfsupervised Representation Learning. In Advances in Neural Information Processing Systems, volume 34, pp. 5784 5797, 2021. Nikolay Oskolkov. How Exactly UMAP Works. https://towardsdatascience.com/ how-exactly-umap-works-13e3040e1668, 2022. Accessed: 2022-05-11. Jonathan S Packer, Qin Zhu, Chau Huynh, Priya Sivaramakrishnan, Elicia Preston, Hannah Dueck, Derek Stefanik, Kai Tan, Cole Trapnell, Junhyong Kim, Robert H. Waterston, and John I. Murray. A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution. Science, 365(6459):1265 1274, 2019. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py Torch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024 8035, 2019. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Pavlin G. Poliˇcar, Martin Straˇzar, and Blaˇz Zupan. open TSNE: a modular Python library for t-SNE dimensionality reduction and embedding. bio Rxiv, 2019. Karsten Roth, Timo Milbich, and Bjorn Ommer. PADS: Policy-Adapted Sampling for Visual Similarity Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6568 6577, 2020. Sebastian Ruder. On word embeddings - Part 2: Approximating the Softmax. http://ruder. io/word-embeddings-softmax, 2016. Accessed: 2022-05-17. Tim Sainburg, Leland Mc Innes, and Timothy Q Gentner. Parametric UMAP Embeddings for Representation and Semisupervised Learning. Neural Computation, 33(11):2881 2907, 2021. Published as a conference paper at ICLR 2023 Benjamin Szubert, Jennifer E Cole, Claudia Monaco, and Ignat Drozdov. Structure-preserving visualisation of high dimensional single-cell datasets. Scientific Reports, 9(1):1 10, 2019. Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. Visualizing large-scale and highdimensional data. In Proceedings of the 25th International Conference on World Wide Web, pp. 287 297, 2016. Clanuwat Tarin, B Mikel, K Asanobu, L Alex, Y Kazuaki, and H David. Deep Learning for Classical Japanese Literature. In Proceedings of 2018 Workshop on Machine Learning for Creativity and Design (Thirty-second Conference on Neural Information Processing Systems), volume 3, 2018. Joshua B Tenenbaum, Vin De Silva, and John C Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500):2319 2323, 2000. Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive Multiview Coding. In Proceedings of the European Conference on Computer Vision, pp. 776 794. Springer, 2020. A aron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding. ar Xiv e-prints, pp. ar Xiv 1807, 2018. Laurens van der Maaten. Learning a Parametric Embedding by Preserving Local Structure. In Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, pp. 384 391, 2009. Laurens van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research, 15(1):3221 3245, 2014. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11):2579 2605, 2008. Daniel E. Wagner, Caleb Weinreb, Zach M. Collins, James A. Briggs, Sean G. Megason, and Allon M. Klein. Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo. Science, 360(6392):981 987, 2018. Yingfan Wang, Haiyang Huang, Cynthia Rudin, and Yaron Shaposhnik. Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, Tri MAP, and Pa CMAP for Data Visualization. Journal of Machine Learning Research, 22:1 73, 2021. Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling Matters in Deep Embedding Learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840 2848, 2017. Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised Feature Learning via Non Parametric Instance Discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3733 3742, 2018. Zhirong Yang, Irwin King, Zenglin Xu, and Erkki Oja. Heavy-Tailed Symmetric Stochastic Neighbor Embedding. In Advances in Neural Information Processing Systems, pp. 2169 2177, 2009. Zhirong Yang, Jaakko Peltonen, and Samuel Kaski. Scalable optimization of neighbor embedding for visualization. In International Conference on Machine Learning, pp. 127 135, 2013. Zhirong Yang, Yuwei Chen, Denis Sedov, Samuel Kaski, and Jukka Corander. Stochastic cluster embedding. Statistics and Computing, 33(1):12, 2023. Published as a conference paper at ICLR 2023 SUPPLEMENTARY TEXT A PROBABILISTIC FRAMEWORKS OF NCE AND INFONCE In this section we recap the probabilistic frameworks of NCE (Gutmann & Hyv arinen, 2010; 2012) and Info NCE (Jozefowicz et al., 2016; van den Oord et al., 2018). In NCE the unsupervised problem of parametric density estimation is turned into a supervised problem in which the data samples need to be identified in a set S containing N data samples s1, . . . , s N and m times as many noise samples t1, . . . , tm N drawn from a noise distribution ξ, which can be (but does not have to be) the uniform distribution. In other words, we are interested in the posterior probability P(y|x) of element x S coming from the data (y = data) rather than from the noise distribution (y = noise). The probability of sampling x from noise, P(x|noise), is just the noise distribution ξ, and similarly P(x|data) is the data distribution p. Since the latter is unknown, it is replaced by the model qθ. Since S contains m times as many noise samples as data samples, the prior class probabilities are P(data) = 1/(m + 1) and P(noise) = m/(m + 1). Thus, the unconditional probability of an element of S is P(x) = (qθ(x) + mξ(x))/(m + 1). The posterior probability for classifying some given element x of S as data rather than noise is thus P(data|x) = P(x|data)P(data) P(x) = qθ(x) qθ(x) + mξ(x). (10) NCE optimizes the parameters θ by maximizing the log-likelihood of the posterior class distributions, or, equivalently, by minimizing the negative log-likelihoods. This is the same as a sum over binary cross-entropy losses (Eq. (2) in the main text): θ = arg min θ i=1 log qθ(si) qθ(si) + mξ(si) i=1 log 1 qθ(ti) qθ(ti) + mξ(ti) In expectation, we have the loss function (Eq. (4) in the main text) LNCE(θ) = Es p log qθ(s) qθ(s) + mξ(s) m Et ξ log 1 qθ(t) qθ(t) + mξ(t) Since qθ(x)/ qθ(x) + mξ(x) = 1/ h 1 + qθ(x)/ mξ(x) 1i , NCE s loss function can also be seen as binary logistic regression loss function with log qθ(x) mξ(x) as the input to the logistic function: LNCE(θ) = Es p log σ log qθ(s) m Et ξ log 1 σ log qθ(t) where σ(x) = 1/ 1 + exp( x) is the logistic function. A.2 INFONCE Again, we are casting the unsupervised problem of density estimation as a supervised classification problem. We consider a tuple of m + 1 samples T = (x0, . . . , xm) one of which comes from the data and the rest from the noise distribution. Instead of classifying each sample independently as noise or data (as in Sec. A.1), here we are interested in identifying the position of the single data sample. This allows us to see the problem as a multi-class classification problem with m+1 classes. Let Y be the random variable that holds the index of the data sample. A priori, we have P(Y = k) = 1/(m + 1) for all k = 0, . . . , m. Moreover, conditioned on sample k coming from the data distribution, all other samples must come from the noise distribution, i.e., we Published as a conference paper at ICLR 2023 have P(xi|Y = k) = ξ(xi) for i = k. As the data distribution is unknown, we model it with P(xk|Y = k) = qθ(xk) as in Sec. A.1. This yields the likelihood of tuple T given the data index Y = k P(T|Y = k) = qθ(xk) Y i =k ξ(xi) = qθ(xk) i=0 ξ(xi). (14) Marginalizing over Y , we obtain P(T) = 1 m + 1 ξ(xk) . (15) Finally, we can compute the posterior via Bayes rule as P(Y = k|T) = P(T|Y = k)P(Y = k) ξ(xk) Pm i=0 qθ(xi) ξ(xi) = qθ(xk) Pm i=0 qθ(xi), (16) where the last equality only holds for the uniform noise distribution. The Info NCE loss is the crossentropy loss with respect to the true position of the data sample, i.e., in expectation and for uniform ξ it reads: LInfo NCE(θ) = E x p x1,...,xm ξ log qθ(x) qθ(x) + Pm i=1 qθ(xi) Similar to how the NCE loss can be seen as binary logistic regression loss function (Sec. A.1), the Info NCE loss can be viewed as multinomial logistic regression loss function with the terms log qθ(xi) ξ(xi) entering the softmax function. B GRADIENTS B.1 GRADIENTS OF MLE, NCE, AND NEG Here, we compute the gradients of MLE, NCE, and NEG, elaborating the discussion in Artemenkov & Panov (2020). Following the discussion in Secs. 3.2 and 4, we consider a normalized model qθ(x) = ϕθ(x)/Z(θ) with partition function Z(θ) = P x ϕθ(x ) for MLE, a model qθ,Z(x) = ϕθ(x)/Z with learnable normalization parameter Z for NCE, and a model qθ, Z(x) = ϕθ(x)/ Z with fixed normalization constant Z for NEG. We show the results both for general Z and noise distribution ξ and for the default value Z = |X|/m and a uniform noise distribution ξ(x) = 1/|X|. Thus, the losses are LMLE(θ) = Ex p log qθ(x) , (18) LNCE(θ, Z) = Ex p log qθ,Z(x) qθ,Z(x) + mξ(x) m Ex ξ log 1 qθ,Z(x) qθ,Z(x) + mξ(x) LNEG(θ, Z) = Ex p log qθ, Z(x) qθ, Z(x) + mξ(x) m Ex ξ log 1 qθ, Z(x) qθ, Z(x) + mξ(x) LNEG(θ) = Ex p log ϕθ(x) ϕθ(x) + 1 m Ex ξ log 1 ϕθ(x) ϕθ(x) + 1 For MLE, we find d LMLE(θ) dθ Ex p log qθ(x) (22) p(x) log ϕθ(x) + log X ϕθ(x) dϕθ(x) qθ(x) 1 1 Z(θ) dϕθ(x) Published as a conference paper at ICLR 2023 For the NCE loss we will use the templates d log f(z) f(z)+c dz = d log 1 1+c/f(z) = d log(1 + c/f(z)) = 1 1 + c/f(z) c f(z)2 df(z) = c f(z) + c 1 f(z) df(z) d log c f(z)+c dz = d log (f(z)/c + 1) = 1 f(z)/c + 1 1 = 1 f(x) + c df(z) with some differentiable function f(z) and c independent of z. For the NCE loss we get d LNCE(θ, Z) Ex p log qθ,Z(x) qθ,Z(x) + mξ(x) m Ex ξ log 1 qθ,Z(x) qθ,Z(x) + mξ(x) Ex p log qθ,Z(x) qθ,Z(x) + mξ(x) m Ex ξ log mξ(x) qθ,Z(x) + mξ(x) p(x) mξ(x) qθ,Z(x) + mξ(x) 1 qθ,Z(x) dqθ,Z(x) +mξ(x) 1 qθ,Z(x) + mξ(x) dqθ,Z(x) qθ,Z(x) 1 mξ(x) qθ,Z(x) + mξ(x) 1 Z dϕθ(x) We used our templates at the third equality sign with f(z) = qθ,Z(x) and c = mξ(x). The derivation of the gradients for the NEG loss is entirely analogous to that for NCE with qθ,Z(x) replaced by qθ, Z(x). Together, the gradients of the three losses are qθ(x) 1 1 Z(θ) dϕθ(x) d LNCE(θ, Z) qθ,Z(x) 1 mξ(x) qθ,Z(x) + mξ(x) 1 Z dϕθ(x) d LNEG(θ, Z) qθ, Z(x) 1 mξ(x) qθ, Z(x) + mξ(x) 1 1 ϕθ(x) + 1 dϕθ(x) The fractions involving mξ(x) in the gradients of LNCE and LNEG tend to one as m highlighting that a higher number of noise samples improves NCE s approximation of MLE (Gutmann Published as a conference paper at ICLR 2023 & Hyv arinen, 2010; 2012; Artemenkov & Panov, 2020). All four gradients are the same up to this factor and the different choice of normalization. In all of them, the models qθ, qθ,Z, or qθ, Z try to match the data distribution p according to the first factor in each gradient. This similarity of the gradients holds independent of the assumptions of Thm. 1 and Cor. 2. We optimize all losses with stochastic gradient descent and can therefore expect our analysis to hold even if the assumptions are violated. B.2 GRADIENTS FOR NEIGHBOR EMBEDDINGS Here, we rewrite the above gradients for the case of neighbor embeddings. This means that MLE, NCE, and NEG become t-SNE, NC-t-SNE, and Neg-t-SNE, respectively. We include two variants of Neg-t-SNE: for a general value of Z and for the default value Z = n 2 /m used by UMAP. In addition, we also show the gradient of UMAP. Recall that for neighbor embedding, we have X = {ij|1 i < j n}, θ = {e1, . . . , en}, and ϕ(ij) = 1/(1 + ei ej 2). The noise distributions are close to uniform, see Supp. D, so that ξ(ij) n 2 1. p(ij) ϕ(ij) P ϕ(ij)(ej ei) (42) d LNC-t-SNE Z + mξ(ij) | {z } 1 for m p(ij) ϕ(ij) ϕ(ij)(ej ei) (43) d LNeg-t-SNE( Z) z }| { mξ(ij) p(ij) ϕ(ij) ϕ(ij)(ej ei) (44) d LNeg-t-SNE 1 ϕ(ij) + 1 | {z } [0.5,1] p(ij) ϕ(ij) n 2 /m ϕ(ij)(ej ei) (45) p(ij) ϕ(ij) n 2 /m 1 1 ϕ(ij) | {z } numerically unstable for ei ej ϕ(ij)(ej ei) (46) As above, we see that NC-t-SNE and Neg-t-SNE have terms that go to one for m . For Neg-t-SNE with default Z, we used the approximation ξ(ij) n 2 1. Here, the corresponding term is restricted to the interval [0.5, 1]. More generally, we see that the repulsive force in Neg-t-SNE scales as 1/ Z. Its default value of n(n 1)/(2m) is much larger than the typical values of the partition function P k =l ϕ(kl) in t-SNE or the learned Z of NC-t-SNE, which are typically in the range [50n, 100n] (B ohm et al., 2022), leading to much smaller repulsion in default Neg-t-SNE and in UMAP than in NC-t-SNE or t-SNE. Finally, the repulsive part of the UMAP gradient contains a term that diverges for ei ej, leading to ϕ(ij) 1, reflecting the discussion of UMAP s numerical instability in Sec. 6. C UMAP S LOSS FUNCTION In the original UMAP paper, Mc Innes et al. (2018) define weights µ(ij) [0, 1] on the sk NN graph and state that these shall be reproduced by the low-dimensional similarities ϕ(ij) by means of a sum of binary cross-entropy loss functions, one for each edge ij h µ(ij) log ϕ(ij) + 1 µ(ij) log 1 ϕ(ij) i . (47) Indeed, this loss has its minimum at µ(ij) = ϕ(ij) for all ij. However, it is of course not possible to achieve zero loss for any real-world data using the Cauchy kernel in two dimensions. Experiments show that using this loss function in practice leads to an excess of repulsion and consequently to very Published as a conference paper at ICLR 2023 poor embeddings (B ohm et al., 2022). The actual UMAP implementation has much less repulsion due to the sampling of repulsive edges, see below. As the weights µ(ij) are only supported on the sparse sk NN graph, most of the 1 µ(ij) terms are equal to one. To simplify the loss function, the UMAP paper replaces all 1 µ(ij) terms by 1, leading to the loss function h µ(ij) log ϕ(ij) + log 1 ϕ(ij) i . (48) In the implementation, UMAP samples the repulsive edges, which drastically changes the loss (B ohm et al., 2022) to the effective loss (Damrich & Hamprecht, 2021) h µ(ij) log ϕ(ij) + m(di + dj) 2n log 1 ϕ(ij) i , (49) where di = P j µ(ij) denotes the degree of node i and the number of negative samples m is a hyperparameter. By default, m = 5. Since di log(k), the effective loss only has about m log(k)/n of the repulsion in the originally stated loss function. As a result, the µ(ij) are not reproduced in the embedding space by this loss function. We rewrite this effective loss function further to fit into our framework. The attractive prefactors µ(ij) sum to P ij µ(ij), while the repulsive prefactors add up to m times this factor. Dividing the entire loss function by this term does not change its properties. But then, we can write the prefactors as probability distributions p(ij) = µ(ij)/ P ij µ(ij) and ξ(ij) = p(i) + p(j) /2n using p(i) = P j p(ij). With this, we can write the effective loss function as ij p(ij) log ϕ(ij) m X ij ξ(ij) log 1 ϕ(ij) , (50) or in the expectation form as Eij p log ϕ(ij) m Eij ξ log 1 ϕ(ij) , (51) like we do in Eq. (7). D NOISE DISTRIBUTIONS Here we discuss the various noise distributions used by UMAP, NCVis, and our framework. The main claim is that all these noise distributions are sufficiently close to uniform, even though their exact shape depends on the implementation details. Since our actual implementation, as well as the reference implementations of UMAP and NCVis, considers edges ij and ji separately, we will do so from now on. Hence, there is now a total of E := 2|sk NN| edges. We always assume that p(ij) = p(ji) and adding up the probabilities for both directions yields one: Pn i,j=1 p(ij) = 1. For a given data distribution over pairs of points ij, we define p(i) = Pn j=1 p(ij) so that P i p(i) = 1. As discussed in Supp. B of Damrich & Hamprecht (2021), the p(i) values are approximately constant when p(ij) is uniform on the sk NN graph or proportional to the UMAP similarities. D.1 NOISE DISTRIBUTIONS OF UMAP AND NCVIS UMAP s noise distribution is derived in Damrich & Hamprecht (2021) and reads in our notation (Supp. C) ξ(ij) = p(i) + p(j) Note that UMAP uses a weighted version of the sk NN graph. Still, ξ is close to uniform, see Supp. B of Damrich & Hamprecht (2021). Published as a conference paper at ICLR 2023 The noise distribution of NCVis is also close to being uniform and equals (Artemenkov & Panov, 2020): ξ(ij) = p(i) This is a slightly different noise distribution than in UMAP and in particular it is asymmetric. However, we argue that in practice it is equivalent to the same noise distribution as in UMAP. The noise distribution is used in two ways in NCVis: for sampling noise samples and in the posterior class probabilities P(data|ij) = qθ,Z(ij) qθ,Z(ij) + mξ(ij) . (54) Both in the reference NCVis implementation and in ours, for the second role, the noise distribution is explicitly approximated by the uniform one and we use the posterior probabilities P(data|ij) = qθ,Z(ij) qθ,Z(ij) + m 1 2|sk NN| . (55) Together with the symmetry of the Cauchy kernel this implies that the repulsion on the embedding vectors ei and ej from noise sample ij and ji is the same. As a result, the expectation 1 qθ,Z(ij) qθ,Z(ij) + m 1 2|sk NN| is the same for ξ(ij) = p(i)/n and for UMAP s noise distribution ξ(ij) = p(i) + p(j) D.2 EFFECT OF BATCHED TRAINING ON THE NOISE DISTRIBUTION In our framework, the noise distribution is influenced by the batched training procedure (Alg. S1) because the negative samples can come only from the current training batch. In every epoch, we first shuffle the set of directed edges of the sk NN graph and then chunk it into batches. To emphasize, the batches consist of directed edges and not of the original data points. For each edge in a batch, we take its head and sample m indices from the heads and tails of all edges in the batch (excluding the already selected head) and use them as tails to form negative sample pairs. To obtain a negative sample pair ij, the batch must contain some directed edge ik, providing the head of the negative sample, and some pair lj or jl, providing the tail. We want to derive the expected number of times that a directed edge ij is considered as a negative sample in a batch. For simplicity, let us assume that the number of batches divides the number of directed edges E. As the set of sk NN edges is shuffled every epoch, the expected number of pairs ij as negative samples is the same for all batches. Let us consider a batch B of size b. We denote by Yrs the random variable that holds the number of times edge rs appears in B. We also introduce random variables Y rs = P t =r Yts and Yr s = P t =s Yrt. Let p(r s) := p( sr) := p(r) p(rs). For each occurrence of an i as head of an edge in B, we sample m tails to create negative samples uniformly from all head and tails in B with replacement, but we prevent sampling the identical head i as negative sample tail. If, however, the same node i is part of the other edges in the batch, then it may be sampled and would create a futile negative sample ii. There are m chances for creating a negative sample edge ij for every head i and any occurrence of j in the batch. The number of heads i in the batch is Yij + Yi j and the number of occurrences of j is Yij + Yji + Y ij + Yj i. Since we sample the tail of a negative sample pair uniformly with replacement, any of the occurrences of j has probability 1/(2b 1) to be selected. Hence, the expected number Nij of times that the ordered pair ij with i = j is considered as a negative sample in batch B is Nij = m(Yij + Yi j)Yij + Yji + Y ij + Yj i 2b 1 . (58) Published as a conference paper at ICLR 2023 Since a head i may not choose itself to form a negative sample, the expected number of times that ii appears as negative sample in the batch is j Yij Yij 1 + Yi j + Yji + Y ji 2b 1 . (59) Since the batches are sampled without replacement, the random variables Yrs are distributed according to a multivariate hypergeometric distribution, so that E(Yrs) = bp(rs), (60) E(Y rs) = bp( rs), (61) Var(Yrs) = b E b E 1p(rs) 1 p(rs) , (62) Cov(Yrs, Y uv) = b E b E 1p(rs)p( uv). (63) We use these expressions and analogous ones together with the symmetries p(rs) = p(sr) to compute (leaving out intermediate algebra steps) the expectation of Nij over the shuffles: E(Nij) = mb 2b 1 E 1p(ij) + 2 b E b p(i)p(j) . (64) Since we sample m negative samples for each positive sample and since each batch contains b positive samples, we need to divide E(Nij) by mb to obtain ξ(ij): ξ(ij) = 1 2b 1 E 1p(ij) + 2 b E b p(i)p(j) . (65) E(Nii) = mb 2b 1 E 1p(i) + 2 b E b and hence the noise distribution value for the pair ii is ξ(ii) = 1 2b 1 E 1p(i) + 2 b E b p(i)2 . (67) We see that the noise distribution depends on the batch size b. This is not surprising: For example, if the batch size is equal to one, the ordered pair ij can only be sampled as a negative sample in the single batch that consists of that pair. Indeed, for b = 1 our formula yields ξ(ij) = p(ij), (68) meaning that the data and the noise distributions coincide. Conversely, if b = E and there is only one batch, we obtain ξ(ij) = 2E 2E 1p(i)p(j) (69) and the noise distribution is close to uniform. For batch sizes between 1 and E the noise distribution is in between these two extremes. For MNIST, E 1.5 106, and in our experiments we used b = 1024. This means that the prefactor of the share of the data distribution is about 0.0005 while that of the near-uniform distribution p(i)p(j) is about 0.9995, so the resulting noise distribution is close to uniform. Note that Thm. 1 and Cor. 2 only require the noise distribution to have full support which is the case for any batch size greater than one. Published as a conference paper at ICLR 2023 (a) ζ = 10 3, no annealing (b) ζ = 10 10, no annealing (c) ζ = 0, no annealing (d) ζ = 10 3, with annealing (e) ζ = 10 10, with annealing (f) ζ = 0, with annealing Figure S1: UMAP embeddings of the MNIST dataset, ablating numerical optimization tricks of UMAP s reference implementation. The learning rate annealing is crucial (bottom row) but safeguarding against divisions by zero in UMAP s repulsive term Eq. (71) by adding ζ to the denominator has little effect. These experiments were run using the reference implementation, modified to change the ζ value and to optionally switch off the learning rate annealing. E OPTIMIZATION TRICKS IN UMAP S REFERENCE IMPLEMENTATION UMAP s repulsive term log(1 ϕ(ij)) = log 1 + d2 ij d2 ij can lead to numerical problems if the two points of the negative sample pair are very close. In addition to the learning rate decay, discussed in Sec. 6, UMAP s implementation uses further tricks to prevent unstable or even crashing training. In non-parametric UMAP s reference implementation, the gradient on embedding position ei exerted by a single sampled repulsive pair ij is actually 2 1 d2 ij + ζ 1 1 + d2 ij (ej ei) (71) with ζ = 0.001 instead of ζ = 0. This corresponds to the full loss function Eij p log(ϕ(ij)) m 1 + ζ 1 ζ Eij ξ log 1 + ζ 1 ζ ϕ(ij) . (72) However, we found that ζ does not have much influence on the appearance of a UMAP embedding. Fig. S1 shows MNIST embeddings obtained using the original UMAP implementation modified to use different values of ζ. Neither a much smaller positive value such as ζ = 10 10 nor setting ζ = 0 substantially changed the appearance of the embedding (even though some runs with ζ = 0 did crash). The learning rate annealing played a much bigger role in how the embedding looked like (Fig. S1, bottom row). The reference implementation of parametric UMAP uses automatic differentiation instead of implementing the gradients manually. To avoid terms such as log(0) in the repulsive loss, it clips the argument of the logarithm from below at the value ε = 10 4, effectively using the loss function max n ε, 1 1 + d2 ij m Eij ξ log max n ε, 1 1 1 + d2 ij Published as a conference paper at ICLR 2023 (a) UMAP, ε = 10 4, no ann. (b) UMAP, ε = 10 10, no ann. (c) UMAP, ε = 10 4, with ann. (d) UMAP, ε = 10 10, with ann. (e) Neg-t-SNE, ε = 10 4, no ann. (f) Neg-t-SNE, ε = 10 10, no ann. (g) Neg-t-SNE, ε = 0, no ann. (h) Neg-t-SNE, ε = 10 4, with ann. (i) Neg-t-SNE, ε = 10 10, with ann. (j) Neg-t-SNE, ε = 0, with ann. Figure S2: UMAP and Neg-t-SNE embeddings of the MNIST dataset using different values ε at which we clip arguments to logarithm functions. These experiments were done using our implementation. Varying ε did not strongly influence the appearance of the embedding. But setting ε = 0 led to crashing UMAP runs. Annealing the learning rate is important for UMAP, yet not for Neg-t-SNE. We employ a similar clipping in our code whenever we apply the logarithm function. Again, we found that the exact value of ε is not important for our UMAP reimplementation, while using the learning rate annealing is (Fig. S2, top two rows). In the extreme case of setting ε = 0, our UMAP runs crashed. We believe that the reason is that we allow negative sample pairs to be of the form ii, which would not send any gradient, but would lead to a zero argument to the logarithm. The reference implementation of UMAP excludes such negative sample pairs ii. Our Neg-t-SNE approach does not have any of these problems, as the repulsive term is upper bounded 1 1+d2 ij 1 1+d2 ij + 1 2 + d2 ij 1 + d2 ij log(2) (74) Published as a conference paper at ICLR 2023 and does not diverge for dij 0. For this reason, Neg-t-SNE is not very sensitive to the value at which we clip arguments to the logarithm and works even with ε = 0, both with and without learning rate annealing (Fig. S2, bottom two rows). The attractive terms in the loss functions do not pose numerical problems in practice due to the heavy tail of the Cauchy kernel. That said, in order to keep different experiments and losses comparable, we used clipping in both the attractive and the repulsive loss terms with ε = 10 10 for all neighbor embedding plots computed with our framework unless otherwise stated. A recent pull request (#856) to the parametric part of UMAP s reference implementation proposed another way to ameliorate the numerical instabilities. The clipping of the arguments to the logarithm was replaced with a sigmoid of the logarithm of the Cauchy kernel, so that the attractive and repulsive terms become log max ε, ϕ(ij) log σ log ϕ(ij) , (75) log max ε, 1 ϕ(ij) log 1 σ log ϕ(ij) , (76) where σ(x) = 1/ 1 + exp( x) is the sigmoid function. This change can seem drastic as, e.g., for ϕ(ij) = 1 we have max ε, ϕ(ij) = 1, but σ log ϕ(ij) = 1/2. But unravelling the definitions shows that this turns the loss function precisely into our Neg-t-SNE loss function since σ log ϕ(ij) = 1 1 + exp log ϕ(ij) = 1 1 + ϕ(ij) 1 = ϕ(ij) ϕ(ij) + 1. (77) So, in order to overcome the numerical problems incurred by UMAP s implicit choice of 1/d2 ij as similarity kernel, the pull request suggested a fix that turns out to be equivalent to negative sampling using the Cauchy kernel. We encourage this change to UMAP as it makes its loss function equivalent to our Neg-t-SNE and thus also conceptually more related to t-SNE. We also suggest to implement it in the non-parametric case. F TRIMAP AND INFONC-t-SNE The neighbor embedding method Tri Map (Amid & Warmuth, 2019) also employs a contrastive technique to compute the layout. Tri Map considers triplets (i, j, k), where i and j are data points that are more similar than i and k. It moves the embeddings ei and ej closer together and pushes ei and ek further apart. While a full investigation of Tri Map is beyond the scope of our work, here we argue that it bears some similarity to Info NC-t-SNE with m = 1. First, we describe Tri Map in more detail. Most of the triplets (i, j, k) consist of nearest neighbors i and j and a random point k. By default, 12 nearest neighbors j are chosen for each i, using a modified Euclidean distance. For each such pair (i, j) additional distinct points k (by default 4) are chosen randomly from the set of all points excluding i and all the chosen j. This leads to 48 triplets (i, j, k) for each point i. Additionally, several (by default 3) triplets are formed for each i with random j and k only ensuring that the similarity between i and j is larger than between i and k. Thus, for each point i there are in total 51 triplets (i, j, k) leading to a grand total of 51n triplets. These triplets are formed ahead of the layout optimization phase and kept fixed throughout. Tri Map also assigns a weight wijk to each triplet. This weight is larger for triplets in which i and j are much more similar to each other than i and k. In our notation, Tri Map s loss function can be written as X ijk triplets wijk ϕ(ik) ϕ(ij) + ϕ(ik) = const X ijk triplets wijk ϕ(ij) ϕ(ij) + ϕ(ik). (78) This loss is very similar to the loss of Info NC-t-SNE with m = 1: E ij p k U log ϕ(ij) ϕ(ij) + ϕ(ik) Published as a conference paper at ICLR 2023 (a) Tri Map default (b) Tri Map Eucl. k NN (c) Tri Map no random (d) Tri Map no weights (e) Info NC-t-SNE m = 1 Figure S3: Ablating design choices of Tri Map on the MNIST dataset. Simplifying Tri Map leads to an embedding similar to that of Info NC-t-SNE with m = 1. The biggest difference results from the omission of random triplets in panel c. Table S1: Quality of Tri Map and Info NC-t-SNE. Means and standard deviations over three runs. k NN recall Spearman correlation Tri Map (default) 0.098 0 0.245 0.005 Tri Map (k NN) 0.099 0 0.232 0.003 Tri Map (no random) 0.101 0 0.355 0.004 Tri Map (no weights) 0.099 0 0.381 0.004 Info NC-t-SNE (m = 1) 0.085 0.001 0.381 0.006 Info NC-t-SNE (m = 5) 0.104 0 0.365 0.005 The differences are: the absence of log; the presence of weights; the presence of random triplets; nearest neighbors being based not exactly on Euclidean metric; and fixed triplets during optimization. In the following we modify the reference Tri Map implementation and ablate these differences in order to see which ones are more or less important (Figure S3). We measure the quality of all embeddings using two metrics (Tab. S1): as a global metric we use the Spearman correlation between highand low-dimensional pairwise distances between 5000 randomly chosen points (Kobak & Berens, 2019) and as a local metric we use k NN recall with k = 15 (Kobak & Berens, 2019). First, we modified Tri Map to use the 15 nearest neighbors in Euclidean metric. This did not change the visual appearance of the MNIST embedding (Figs. S3a, b). Next, we additionally omitted the random triplets. The resulting embedding became visually closer to the Info NC-t-SNE embedding (Fig. S3c). Finally, additionally setting all the weights wijk to one had only a small visual effect (Fig. S3d). Together, these simplifications noticeably improved the global metric compared to the default Tri Map, while leaving the local metric almost unchanged (Tab. S1). Similar to the experiments of Wang et al. (2021), our ablations made Tri Map look very similar to Info NC-t-SNE with m = 1 (Fig. S3e), suggesting that the remaining differences between them do not play a major role. This refers to the details of the optimization (e.g. fixed triplets vs. randomly sampled triplets in each epoch, full-batch vs. stochastic gradient descent, momentum and adaptive learning rates, etc.), as well as to the presence/absence of the logarithm in the loss function. Since d log ϕ(ij) ϕ(ij)+ϕ(ik) dθ = ϕ(ij) + ϕ(ik) d ϕ(ij) ϕ(ij)+ϕ(ik) the gradients of Tri Map and Info NC-t-SNE point in the same direction, but are differently scaled. Info NC-t-SNE prioritizes triplets with high ϕ(ij) + ϕ(ik) /ϕ(ij), i.e. triplets in which the embedding similarity does not respect the triplet well. Note that this is conceptually different from Tri Map s weights wijk which are based on the high-dimensional similarities. However, at least on MNIST, this difference in gradients did not have a major effect (Figs. S3d, e). Note that our default for Info NC-t-SNE is m = 5 instead of m = 1. On MNIST, this setting yields a better local score and a similar global score compared to m = 1, and much better scores than the default Tri Map (Tab. S1). The work of Ma & Collins (2018) theoretically underpins our observation in Fig. S19 that the higher m gets, the better Info NC-t-SNE approximates t-SNE. Published as a conference paper at ICLR 2023 G RELATION TO OTHER VISUALIZATION METHODS G.1 t-SNE APPROXIMATIONS A vanilla implementation of t-SNE has to compute the partition function Z(θ) and hence scales quadratically with the sample size n. Therefore, nearly all modern t-SNE packages implement approximations. Common are Barnes-Hut-t-SNE (BH-t-SNE) (Yang et al., 2013; van der Maaten, 2014) and FFT-accelerated interpolation-based t-SNE (FIt-SNE) (Linderman et al., 2019). Both methods approximate the repulsive interaction between all pairs of points and the computation of the partition function, achieving complexities O(kn log(n)2d) and O(kn2d), respectively. In fact, all our t-SNE plots are done with the FIt-SNE implementation of open TSNE (Poliˇcar et al., 2019). The focus of our work is not primarily on accelerating t-SNE, but on understanding its relation to contrastive neighbor embeddings. Different from the t-SNE approximations above, these do not approximately compute the partition function, but either learn it (NC-t-SNE), use a fixed normalization parameter (Neg-t-SNE and UMAP), or do not require it at all (Info NC-t-SNE). They only sample nm repulsive interactions per epoch. This cuts their complexity down to O(kmnd). In particular, they scale linearly with the embedding dimension d and can therefore be used for more general representation learning with d 2, unlike BH-t-SNE and FIt-SNE. G.2 OTHER CONTRASTIVE t-SNE APPROXIMATIONS In addition to NC-t-SNE and Info NC-t-SNE, there are two other, recent sampling-based strategies to approximate t-SNE, GDR-t-SNE (Draganov et al., 2022) and SCE (Yang et al., 2023). Both sample m repulsive interactions per each attractive interaction. In order to scale these repulsive interactions properly, an estimate of the partition function is computed. In GDR-t-SNE, the sum of the Cauchy kernels of the mkn repulsive interactions over each epoch are added up to obtain an estimate of Zt-SNE(θ) mkn/ n(n 1) . Since both the number of repulsive interactions and the estimate of the partition function are decreased by a factor of mk/(n 1) compared to the vanilla t-SNE, the overall repulsion strength is correct. In contrast, SCE (Yang et al., 2023) initializes its estimate of the partition function with the maximum value n(n 1) and then updates it using a moving average after each sampled repulsive pair. Both Draganov et al. (2022) and Yang et al. (2023) use m = 1. Their approaches are similar to NCE in that they approximately estimate Z using the sampled noise points, while NCE treats Z as learnable parameter and performs gradient descent on it. Note, however, that the second factor of the NCE gradient in Eq. (39) is not present in the approaches of Draganov et al. (2022) and Yang et al. (2023). G.3 LARGEVIS Large Vis (Tang et al., 2016) was, to the best of our knowledge, the first contrastive neighbor embedding method. It uses NEG. Since it was quickly overshadowed by very similar UMAP, we focused the exposition in the main paper on UMAP. Damrich & Hamprecht (2021) computed Large Vis effective loss function in closed form. In our notation it reads LLarge Vis(θ) = Eij p log(ϕ(ij)) γm Eij ξ log(1 ϕ(ij)). (81) The are some implementation differences between Large Vis and UMAP, but in terms of the loss function, the main difference is the γ factor in front of the repulsive term, which defaults to 7. Therefore, our analysis of UMAP carries over to Large Vis: Its loss is essentially an instance of NEG with inverse square kernel. The additional factor γ = 7 moves it along the attraction-repulsion spectrum towards more repulsion, that is, towards t-SNE. Pairwise Controlled Manifold Approximation Projection (Pa CMAP) (Wang et al., 2021) is a recent visualization method also based on sampling m repulsive forces per one attractive force (with m = 2 by default). It employs a large number of additional design choices, such as weak attraction between mid-near points in addition to attraction between nearest neighbors (which are based on a modified Euclidean distance, as in Tri Map), several optimization regimes with dynamically changing loss weights, etc. Nevertheless, the core parts of its loss can be related to (generalized) Neg-t-SNE. Published as a conference paper at ICLR 2023 The attractive loss in Pa CMAP is d2 + 1 d2 + 1 + c = 1 c 1 d2 + 1 + c (82) with c = 10 for nearest neighbors (and c = 10 000 for mid-near points but that part of the loss is switched off during the final stage of the optimization). The attractive loss for Neg-t-SNE with uniform noise distribution and Z = c|X|/m is log ϕ(d) ϕ(d) + c = log 1 (d/ c)2 + 1 + c Thus, the differences are the logarithm and some rescaling of the distances. Note that Neg-t-SNE with c = 10 typically produces very similar result compared to the default c = 1 (cf. Fig 1). The repulsive loss in Pa CMAP is 1 2 + d2 = 1 1 + d2 2 + d2 . (84) Again it is similar to the repulsive loss of Neg-t-SNE, but now for Z = |X|/m, the default negative sampling value also used by UMAP: log 1 ϕ(d) ϕ(d) + 1 = log 1 ϕ(d) + 1 = log 1 + d2 Here, the difference is just in the logarithm, similar to the connection between Tri Map and Info NC-t-SNE in Supp. F. Empirically, Pa CMAP embeddings of high-dimensional data like MNIST look similar to UMAP embeddings, and hence to Neg-t-SNE embeddings, in agreement with our analysis. H QUANTITATIVE EVALUATION This section provides a quantitative evaluation of the Neg-t-SNE spectrum. For each embedding on the spectrum, we compute two metrics. The first metric, k NN recall with k = 15 (Kobak & Berens, 2019), measures the fraction of nearest neighbors in the embedding that are also nearest neighbors in the reference configuration. It is therefore a measure of local quality. The second metric is the Spearman correlation between the pairwise distances in the embedding and in the reference configuration (Kobak & Berens, 2019). To speed up the computation, we consider all pairwise distances between a sample of 5000 points. Since most random pairs of points are not close to each other, this is a measure of global quality. As reference configuration, we use three different layouts for each dataset. We compare against the original high-dimensional data to measure how faithful Neg-t-SNE embeddings are. Additionally, we compare to the corresponding t-SNE and UMAP embeddings to investigate for which Z value Neg-t-SNE matches t-SNE and UMAP the best. We found that compared to the high-dimensional data, the k NN recall followed an inverse U-shape across the spectrum (Fig. S4a). k NN recall was low for very large and for very small values of Z. It peaked when Z was close to t-SNE s partition function Zt-SNE and NC-t-SNE s learned normalization parameter ZNC-t-SNE. At UMAP s normalization constant ZUMAP the k NN recall was usually lower. This confirms our observation that the local quality of the embedding improves when decreasing Z from ZUMAP to Zt-SNE in agreement with B ohm et al. (2022). The k NN recall for the Kuzushiji-49 dataset at Z = Zt-SNE was very low, compared to other datasets. We suspect that this was due to incomplete convergence (Fig. S17, Tab. S4). Conversely, the Spearman correlation mostly increased with Z (Fig. S4b), which aligns with our finding that higher attraction improves the global layout of the embedding. We also computed the k NN recall and the Spearman correlation for the proper t-SNE and UMAP embeddings (not depicted in Figs. S4a, b). The k NN recall was higher for t-SNE embeddings than for Neg-t-SNE at Z = Zt-SNE (see, e.g., Fig. S18l), likely because proper t-SNE considers the repulsive interaction between all points. The metrics for the UMAP embeddings and the Spearman Published as a conference paper at ICLR 2023 correlation for t-SNE were close to the corresponding values for Neg-t-SNE at the respective Z values. When comparing Neg-t-SNE embeddings to the proper t-SNE embedding, the best fit in terms of k NN recall (Fig. S4c) and in terms of Spearman correlation (Fig. S4d) was usually achieved when Z was close to Zt-SNE, in accordance with our theory. Again, the Kuzushiji-49 dataset was an outlier, likely due to convergence issues. Finally, when comparing Neg-t-SNE embeddings to the UMAP embedding, the highest Spearman correlation was achieved by Z ZUMAP (Fig. S4f), in agreement with our theoretical predictions. The highest k NN recall was typically achieved at a slightly lower Z (Fig. S4e). Overall, these experiments provide empirical support for our interpretation of the Neg-t-SNE spectrum as implementing a trade-off between global and local structure preservation, and they confirm the predicted locations of t-SNEand UMAP-like embeddings on the spectrum. We computed the k NN recall and the Spearman correlation also for the embeddings in Figs. S18 and S19, where we varied the number of noise samples m and the initialization method for NC-t-SNE and Info NC-t-SNE. As expected, we found that higher m improves the k NN recall, but t-SNE proper achieved higher k NN recall still. The Spearman correlation decreased with m when we initialized randomly and did not change much for other initialization methods. Random initialization hurt the Spearman correlation significantly. I TOY DATASET Cor. 2 predicts that the partition function of a Neg-t-SNE embedding should equal the Z value. In our real-world examples (panels i in Figs. 1, S11 S16) we observed a monotone relationship between them, but not a perfect agreement. As discussed in Sec. 5, this is due to the shape of the Cauchy kernel and the fact that the data distribution is zero for many pairs of points. Here, we consider a toy example for which we observe a perfect match between Neg-t-SNE s partition function and the Z value, confirming our Cor. 2. As we operate with the binary sk NN graph, the only possible high-dimensional similarities are zero and one. Due to the heavy tail of the Cauchy kernel, we would like our toy example to have no pairs of points for which the data distribution is zero. Thus, all pairs of points need to have equal high-dimensional similarity. In two dimensions, only up to three points can be placed equidistantly from each other. Therefore, we consider the simple case of placing three points according to the Neg-t-SNE loss function with p 1 6 (there are six directed edges) for various values of Z. We keep the Cauchy kernel as similarity function, which has maximum 1/(1 + 02) = 1. Thus, it is not possible to match values above Z = 6, and the three points end up having the same position in the embedding. But for smaller values of Z, the resulting partition function is indeed exactly equal to Z (Fig. S5). For this experiment we decreased the batch size to 6 and the learning rate to 0.01, but kept all other hyperparameters at their default values. We used the well-known MNIST (Le Cun et al., 1998) dataset for most of our experiments. We downloaded it via the torchvision API from http://yann.lecun.com/exdb/mnist/. This website does not give a license. But https://keras.io/api/datasets/mnist/ and http://www.pymvpa.org/datadb/mnist.html name Yann Le Cun and Corinna Cortes as copyright holders and claim MNIST to be licensed under CC BY-SA 3.0, which permits use and adaptation. The MNIST dataset consists of 70 000 grayscale images, 28 28 pixels each, that show handwritten digits. The Kuzushiji-49 dataset (Tarin et al., 2018) was downloaded from https://github.com/ rois-codh/kmnist where it is licensed under CC-BY-4.0. It contains 270 912 grayscale images, 28 28 pixels each, that show 49 different cursive Japanese characters. Published as a conference paper at ICLR 2023 (a) Relative to the high-dimensional data (b) Relative to the high-dimensional data (c) Relative to t-SNE embeddings (d) Relative to t-SNE embeddings (e) Relative to UMAP embeddings (f) Relative to UMAP embeddings Figure S4: Quantitative evaluation of embeddings on the Neg-t-SNE spectrum with respect to different reference configurations. Means and standard deviations over three random seeds are plotted. Left column: k NN recall, a measure of local agreement. Right column: Spearman correlation, a measure of global agreement. First row: Comparison of Neg-t-SNE embeddings to the highdimensional input data. Second row: Comparison of Neg-t-SNE embeddings to the corresponding t-SNE embedding. Third row: Comparison of Neg-t-SNE embeddings to the corresponding UMAP embedding. The Sim CLR experiments were performed on the CIFAR-10 (Krizhevsky, 2009) dataset, another standard machine learning resource. We downloaded it via the sklearn.datasets.fetch openml API from https://www.openml.org/search? type=data&sort=runs&id=40927&status=active. Unfortunately, we were not able to find a license for this dataset. CIFAR-10 consists of 60 000 images, 32 32 RGB pixels each, depicting objects from five animal and five vehicle classes. The transcriptomic dataset of Kanton et al. (2019) was downloaded from https:// www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-7552/, which permits free use. We only used the 20 272 cells in the human brain organoid cell line 409b2 . The transcriptomic dataset of Wagner et al. (2018) was downloaded in its scanpy version from https://kleintools.hms.harvard.edu/paper_websites/wagner_ zebrafish_timecourse2018/mainpage.html. The full dataset at https://www. ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE112294 is free to download and reproduce. The dataset contains gene expressions for 63 530 cells from a developing zebrafish embryo. The downloaded UMIs of both datasets were preprocessed as in (B ohm et al., 2022; Kobak Published as a conference paper at ICLR 2023 Figure S5: Partition function of Neg-t-SNE embeddings as a function of Z for a toy dataset with three equidistant points. There is a perfect fit between Z and the partition function until Z = 6, which is the largest partition function a set of three points with six directed edges can have under the Cauchy kernel. Beyond this value, all three points overlap in the embedding. Mean with standard deviation over three random seeds is plotted. & Berens, 2019). After selecting the 1000 most variable genes, we normalized the library sizes to the median library size in the dataset, log-transformed the normalized values with log2(x + 1), and finally reduced the dimensionality to 50 via PCA. The transcriptomic dataset of the C. elegans flatworm (Packer et al., 2019; Narayan et al., 2021) was obtained from http://cb.csail.mit.edu/cb/densvis/datasets/ with consent of the authors who license it under CC BY-NC 2.0. It is already preprocessed to 100 principal components. K IMPLEMENTATION K.1 PACKAGES All contrastive embeddings were computed with our Py Torch (Paszke et al., 2019) implementation of Neg-t-SNE, NC-t-SNE, UMAP, and Info NC-t-SNE. Exceptions are Fig. 1 and analogous Figs. S11 S16. There, for panel g we used the reference implementation of NCVis (Artemenkov & Panov, 2020) (with a fixed number of noise samples m, and not the default schedule), and for panel h we used UMAP 0.5. The t-SNE plots were created with the open TSNE (Poliˇcar et al., 2019) (version 0.6.1) package. Similarly, we used the reference UMAP implementation in Fig. S1 and open TSNE in Figs. S18 and S19. The Tri Map plots in Supp. F were computed with version 1.1.4 of the Tri Map package by Amid & Warmuth (2019). We extended these implementations of NCVis, UMAP, and t-SNE to make them accept custom embedding initializations and unweighted sk NN graphs and to log various quantities of interest. We always used the standard Cauchy kernel for better comparability. K.2 INITIALIZATION, EXAGGERATION AND Sk NN GRAPH All PCAs were computed with sklearn (Pedregosa et al., 2011). We used Py Ke Ops (Charlier et al., 2021) to compute the exact sk NN graph and to handle the quadratic complexity of computing the partition functions on a GPU. The same PCA initialization and sk NN graph with k = 15 were used for all embeddings. The sk NN graph for MNIST and Kuzushiji-49 was computed on a 50dimensional PCA of the dataset. We initialized all embeddings with a scaled version of PCA (save for Figs. S18 and S19). For t-SNE embeddings we rescaled the initialization so that the first dimension has a standard deviation of 0.0001 (as is default in open TSNE), for all other embeddings to a standard deviation of 1. For Tri Map, we stuck to its default of scaling the PCA down by a factor of 0.01. Published as a conference paper at ICLR 2023 We employed some version of early exaggeration (van der Maaten & Hinton, 2008) for the first 250 epochs in most non-parametric plots. For t-SNE it is the default early exaggeration of open TSNE. When varying Z in non-parametric Neg-t-SNE, early exaggeration meant using Z = |X|/m for the first 250 epochs (save for Fig. S11). When varying the number of noise samples in Figs. S8, S18 and S19, we still used m = 5 for the first 250 epochs. In Figs. 2, 3, S1, and S2 as well as in all reference NCVis or UMAP plots, we did not use early exaggeration as neither small Z nor high m made it necessary. When we used some form of early exaggeration and learning rate annealing, the annealing to zero took place over the first 250 epochs, was then reset, and annealed again to zero for the remaining, typically 500, epochs. K.3 OTHER DETAILS FOR NEIGHBOR EMBEDDING EXPERIMENTS When computing logarithms during the optimization of neighbor embeddings, we clip the arguments to the range [10 10, 1], save for Fig. S2, where we varied this lower bound. The lower bound is smaller than in the reference implementation of parametric UMAP, where it is set to 10 4. Our defaults were a batch size of 1024, linear learning rate annealing from 1 (non-parametric) or 0.001 (parametric) to 0 (save for Figs. 2, S1, and S2), 750 epochs (save for Figs. S10 and S17) and m = 5 noise samples (save for Figs. S8, S10, S18, and S19). Non-parametric runs were optimized with SGD without momentum and parametric runs with the Adam optimizer (Kingma & Ba, 2015). Parametric runs used the same feed-forward neural net architecture as the reference parametric UMAP implementation. That is, four layers with dimensions input dimension 100 100 100 2 with Re LU activations in all but the last one. We used the vectorized, 786-dimensional version of MNIST as input to the parametric neighbor embedding methods (and not the 50-dimensional PCA; but the sk NN graph was computed in the PCA space for consistency with non-parametric embeddings). Like the reference NCVis implementation, we used the fractions qθ,Z(x)/(qθ,Z(x) + m) instead of qθ,Z(x)/ qθ,Z(x) + mξ(x) . This is a mild approximation as the noise distribution is close to uniform. But it means that the model learns a scaled data distribution (cf. Cor. 2), so that we need to multiply the learned normalization parameter Z by n(n 1) when comparing to t-SNE or checking normalization of the NC-t-SNE model. Similarly, we also approximate the true noise distribution by the uniform distribution for the fractions qθ(x)/(qθ(x) + Zm/|X|) instead of qθ(x)/ qθ(x) + Zmξ(x) in our Neg-t-SNE implementation. We mentioned in Sec. 5 and showed in Fig. S8 that one can move along the attraction-repulsion spectrum also by changing the number of noise samples m, instead of the fixed normalization constant Z. In UMAP s reference implementation, there is a scalar prefactor γ for the repulsive forces. Theoretically, adjusting γ should also move along the attraction-repulsion spectrum, but setting it higher than 1 led to convergence problems in (B ohm et al., 2022), Fig. A11. When varying our Z, we did not have such issues. For panels i in Figs. 1, S11 S16 the Neg-t-SNE spectra were computed for Z equal to Z(θt-SNE), ZNC-t-SNE, and n(n 1) m x, where x {5 10 5, 1 10 4, 2 10 4, 5 10 4, . . . , 1 102, 2 102, 5 102}. K.4 SIMCLR EXPERIMENTS For the Sim CLR experiments, we trained the model for 1000 epochs, of which we used 5 epochs for warmup. The learning rate during warmup was linearly interpolated from 0 to the initial learning rate. After the warmup epochs, we annealed the learning rate with a cosine schedule (without restarts) to 0 (Loshchilov & Hutter, 2017). We optimized the model parameters with SGD and momentum 0.9. We used the same data augmentations as in Chen et al. (2020) and their recommended batch size of 1024. We used a Res Net18 (He et al., 2016) as the backbone and a projection head consisting of two linear layers (512 1024 128) with a Re LU activation in-between. The loss was applied to the L2 normalized output of the projection head, but like Chen et al. (2020) we used the output of the Res Net as the representation for the linear evaluation. As similarity function we used the exponential of the normalized scalar product (cosine similarity) and always kept the temperature at 0.5, as suggested in Chen et al. (2020). Published as a conference paper at ICLR 2023 Table S2: Run time overview for the most compute-heavy experiments Number of runs Time per run [min] (mean SD) Neg-t-SNE for Figs. 1, S11, S13, S14, S15 360 39 4 Neg-t-SNE for Fig. S16 24 121 44 Neg-t-SNE for Fig. S17 3 3592 44 NC-t-SNE (our implementation) for Fig. S10b 3 786 2 Sim CLR runs for Tab. 1 12 721 16 Table S3: Learned normalization parameter for NC-t-SNE and partition function of t-SNE in our experiments. Mean and standard deviation is computed over three random seeds. In our setup t-SNE is deterministic. ZNC-t-SNE [106] Z(θt-SNE) [106] MNIST Fig. 1 34.3 0.1 8.13 MNIST without EE Fig. S11 34.3 0.1 6.25 MNIST imbalanced Fig. S12 6.15 0.06 3.12 Human brain organoid Fig. S13 3.57 0.03 1.30 Zebrafish Fig. S14 30.8 0.1 7.98 C. elegans Fig. S15 36.9 0.7 11.7 Kuzushiji-49 Fig. S16 395 3 89.6 The Res Net was trained on the combined CIFAR-10 train and test sets. When evaluating the accuracy, we froze the backbone, trained the classifier on the train set, and evaluated its accuracy on the test set. We used sklearn s KNeighbors Classifier with cosine metric and k = 15 neighbors and sklearn s Logistic Regression classifier with the SAGA solver (Defazio et al., 2014), no regularization penalty, a tolerance of 0.0001, and max iter=1000. Other parameters were left at the default values. When sampling negative samples for a given head, we excluded that head from the candidates for negative samples. We sampled negative samples with replacement. If the desired number of negative samples m equals twice the batch size minus 2 (m = 2b 2, full-batch repulsion ), we took the entire batch without the current head and its tail as negative samples for that head. K.5 CODE AVAILABILITY Our code is publicly available. The Py Torch implementation of contrastive neighbor embeddings can be found at https://github.com/berenslab/contrastive-ne. Details for reproducing the experiments and figures, alongside scripts and notebooks are at https://github. com/hci-unihd/cl-tsne-umap. K.6 STABILITY Whenever we reported a metric or show a graph, we ran the experiments for 3 different random seeds and reported the mean the standard deviation. When the standard deviation was very small, we omitted it from the main text and report it in Tab. S3. Save for the usually approximate sk NN graph computation, t-SNE does not depend on a random seed. As we computed the sk NN exactly with Py Ke Ops (Charlier et al., 2021), t-SNE is deterministic in our framework. Panels i in Figs. 1, S11 S16 show the standard deviation as shaded area. Again, the standard deviations are very small and barely visible. The ratio of standard deviation to mean was never larger than 0.006 in panels i of Figs. 1, S11 S16. Similarly, the standard deviation in Figs. S9 and S10, shown as shaded area, is mostly smaller than the line width. Published as a conference paper at ICLR 2023 Figure S6: Run times of the embedding optimization phase for the MNIST dataset using different batch sizes, training modes, and hardware. Running on GPU is much faster than on CPU; nonparametric runs are faster than parametric runs. As the batch size increases, the run time decreases strongly in all settings. K.7 COMPUTE We ran our neighbor embedding experiments on a machine with 56 Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz, 502 GB RAM and 10 NVIDIA TITAN Xp GPUs. The Sim CLR experiments were conducted on a SLURM cluster node with 8 cores of an Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz and a Nvidia V100 GPU with a RAM limit of 54 GB. Each experiment used one GPU at most. Our total compute is dominated by the neighbor embedding runs for Figs. 1, S11, S13 S16 and S10b and by the Sim CLR experiments. Tab. S2 lists the number of runs and the average run time. We thus estimate the total compute time to be about 646 hours. Our implementation relies on pure Py Torch and our experiments were conducted on GPUs. We originally kept the batch size for all our experiments at the default of 1024, motivated by Chen et al. (2020) and Sainburg et al. (2021), but eventually noticed that the run time of the visualization experiments depends crucially on the batch size (Fig. S6). Increasing the batch size to 219 decreased the run time of the embedding optimization on MNIST data (without the sk NN graph computation) from about 40 min to just above 20 s. For a batch size of at least 216, our implementation was faster than the reference implementations of UMAP and t-SNE (via open TSNE). The latter run only on CPU, while our experiments ran on GPU. Nevertheless, we observed a substantial improvement in run time for higher batch sizes also when running our Py Torch implementation on CPU and also when training a parametric embedding. Note that the parametric setting with full batch gradient descent (batch size 1,500,006 for MNIST s sk NN graph) exceeded our GPU s memory. On CPU, our implementation is just shy of the performance of UMAP when using 219 attractive pairs per batch. Dedicated CUDA implementations (Chan et al., 2018; Eisenmann, 2019; Nolet et al., 2021) outperform our implementation on GPU. Compared to the Numba-accelerated CPU implementation of UMAP and the CUDA-accelerated GPU implementations, our pure Py Torch implementation arguably strikes a good speed / complexity trade-off, is easier to study and adapt by the machine learning community, and seamlessly integrates non-parametric and parametric settings as well as all four contrastive loss functions. Note that the change from NC-t-SNE to Neg-t-SNE is as simple as fixing the learnable normalization parameter to a constant, so the original NCVis code written in C++ can easily be adapted to compute Neg-t-SNE. We have not used this for any of the experiments in our paper. Published as a conference paper at ICLR 2023 Algorithm S1: Batched contrastive neighbor embedding algorithm input : list of directed sk NN graph edges E = [i1j1, . . . , i|E|j|E|] parameters θ // embeddings (non-param.)/ network weights (param.) number of epochs T learning rate η number of noise samples m Cauchy kernel q // of embeddings (non-param.)/ network output (param.) batch size b loss mode mode normalization constant Z // default n 2 /m, required for mode = Neg-t-SNE output: final embeddings e1, . . . , en 1 if mode = NC-t-SNE then 3 for t = 0 to T do // Learning rate annealing 4 ηt = η (1 t 6 while α < |E| do 8 for β = 1, . . . , b do // Treat attractive edge iα+βjα+β and noise edges iα+βj µ // Sample noise edge tails but omit head of considered edge 9 j 1 , . . . , j m Uniform({iα+1, jα+1, . . . , jα+b}\{iα+β}) // Aggregate loss based on mode 10 if mode = Neg-t-SNE then 11 L = L log qθ(iα+βjα+β) qθ(iα+βjα+β)+ Zm/( n 2) Pm µ=1 log 1 qθ(iα+βj µ ) qθ(iα+βj µ )+ Zm/( n 2) 12 else if mode = NC-t-SNE then 13 L = L log qθ(iα+βjα+β)/Z qθ(iα+βjα+β)/Z+m Pm µ=1 log 1 qθ(iα+βj µ )/Z qθ(iα+βj µ )/Z+m 14 else if mode = Info NC-t-SNE then 15 L = L log qθ(iα+βjα+β) qθ(iα+βjα+β)+Pm µ=1 qθ(iα+βj µ ) 16 else if mode = UMAP then 17 L = L log qθ(iα+βjα+β) Pm µ=1 log 1 qθ(iα+βj µ ) // Update parameters with SGD (non-param.) or Adam (param.) 18 θ = θ ηt θL 19 if mode = NC-t-SNE then 20 Z = Z ηt ZL 21 α = α + b 22 Shuffle E 23 return θ Published as a conference paper at ICLR 2023 L ADDITIONAL FIGURES Figure S7: Attractive and repulsive loss terms of UMAP and Neg-t-SNE. The main difference is that UMAP s repulsive loss diverges at zero challenging its numerical optimization. The attractive terms are log(1 + d2 ij) and log(2 + d2 ij) for UMAP and Neg-t-SNE, respectively, and the repulsive ones are log (1 + d2 ij)/d2 ij and log (2 + d2 ij)/(1 + d2 ij) , respectively. (c) m = 500 (d) m = 2b 2 = 2046 Figure S8: Neg-t-SNE embeddings of the MNIST dataset for varying number of noise samples m and using batch size b = 1024. While for NC-t-SNE and Info NC-t-SNE more noise samples improve the approximation to t-SNE, see Figs. S18 and S19, changing m in Neg-t-SNE moves the result along the attraction-repulsion spectrum (Fig. 1) with more repulsion for larger m. However, the computational complexity of Neg-t-SNE scales with m, so that moving along the spectrum via changing Z is much more efficient. For the first 250 epochs, m was set to 5, to achieve an effect similar to early exaggeration (Supp. K). Published as a conference paper at ICLR 2023 (a) Info NC-t-SNE loss (b) NC-t-SNE loss (c) NC-t-SNE normalization Figure S9: (a, b) Loss curves for the parametric and non-parametric Info NC-t-SNE and NC-t-SNE optimizations leading to Figs. 3b, c, e, and f. While the embedding scale differs drastically between the non-parametric and the parametric run, the loss values are close. (c) Normalization of the model P ij ϕ(ij)/Z for the parametric and non-parametric NC-t-SNE optimizations. The difference in the embedding scale is compensated by a three orders of magnitude change in Z, so that both versions learn approximately normalized models. These experiments used our NC-t-SNE reimplementation. Figure S10: NC-t-SNE learns to have the same partition function (PF) as t-SNE on the MNIST dataset. The higher the number m of noise samples (a) or the longer the optimization (b), the better the match. Both methods used early exaggeration, which for NC-t-SNE meant to start with m = 5 noise samples for the first 250 epochs. The learned normalization parameter Z converged to but did not exactly equal NC-t-SNE s partition function P ij qθ(ij). Nevertheless, it was of the same order of magnitude. Again, the match was better for more noise samples. Since we reinitialized the learnable Z for NC-t-SNE after the early exaggeration phase, there were brief jumps in the partition function and in Z at the beginning of the non-exaggerated phase. Published as a conference paper at ICLR 2023 (a) Z < Zt-SNE (b) Z = Zt-SNE (c) Z = ZNCVis (d) Z = |X|/m (e) Z > |X|/m (i) Partition function Figure S11: (a e) Neg-t-SNE embeddings of the MNIST dataset for various values of the fixed normalization constant Z. As Z increases, the scale of the embedding decreases, clusters become more compact and separated before eventually starting to merge. The Neg-t-SNE spectrum produces embeddings very similar to those of (f) t-SNE, (g) NCVis, and (h) UMAP, when Z equals the partition function of t-SNE, the learned normalization parameter Z of NCVis, or |X|/m = n 2 /m used by UMAP, as predicted in Sec. 4 6. (i) The partition function P ij(1 + d2 ij) 1 tries to match Z and grows with it. In contrast to Fig. 1, we did not use early exaggeration here, but initialized the Neg-t-SNE and t-SNE with PCA rescaled so that the first dimension has standard deviation 1 and 0.0001, respectively. This makes the embeddings with small Z values show cluster fragmentation, similar to the t-SNE embedding in (f) without early exaggeration. For very low Z, the Neg-t-SNE embedding in (a) shows very little structure. Published as a conference paper at ICLR 2023 (a) Z < Zt-SNE (b) Z = Zt-SNE (c) Z = ZNCVis (d) Z = |X|/m (e) Z > |X|/m (i) Partition function Figure S12: (a e) Neg-t-SNE embeddings of an imbalanced version of the MNIST dataset for various values of the fixed normalization constant Z. As Z increases, the scale of the embedding decreases, clusters become more compact and separated before eventually starting to merge. The Neg-t-SNE spectrum produces embeddings very similar to those of (f) t-SNE, (g) NCVis, and (h) UMAP, when Z equals the partition function of t-SNE, the learned normalization parameter Z of NCVis, or |X|/m = n 2 /m used by UMAP, as predicted in Sec. 4 6. (i) The partition function P ij(1 + d2 ij) 1 tries to match Z and grows with it. Similar to early exaggeration in t-SNE we started all Neg-t-SNE runs using Z = |X|/m and only switched to the desired Z for the last two thirds of the optimization. The dataset was created by randomly removing 10 c% of the class of digit c, so that the class sizes linearly decrease from digit 0 to digit 9. Published as a conference paper at ICLR 2023 (a) Z < Zt-SNE (b) Z = Zt-SNE (c) Z = ZNCVis (d) Z = |X|/m (e) Z > |X|/m (i) Partition function Figure S13: (a e) Neg-t-SNE spectrum on the developmental single-cell RNA sequencing dataset from Kanton et al. (2019) for various parameters Z. As Z increases, the scale of the embedding decreases and the continuous structure (corresponding to the developmental stage) becomes more apparent, making higher Z more suitable for visualizing continuous datasets (B ohm et al., 2022). The spectrum produces embeddings very similar to those of (f) t-SNE and (g) NCVis when Z equals the partition function of t-SNE or the learned normalization parameter of NCVis. The UMAP embedding in (h) closely resembles the Neg-t-SNE embedding at Z = |X|/m = n 2 /m. (i) The partition function P ij(1 + d2 ij) 1 of the Neg-t-SNE embeddings increased with Z. Similar to early exaggeration in t-SNE we started all Neg-t-SNE runs using Z = |X|/m and only switched to the desired Z for the last two thirds of the optimization. The dataset contains 20 272 cells and is colored by the duration of the development. There are ten times fewer cells collected after 10 days than after one month. Published as a conference paper at ICLR 2023 (a) Z < Zt-SNE (b) Z = Zt-SNE (c) Z = ZNCVis (d) Z = |X|/m (e) Z > |X|/m (i) Partition function Figure S14: (a e) Neg-t-SNE spectrum on the single-cell RNA sequencing dataset of a developing zebrafish embryo (Wagner et al., 2018) for various parameters Z. As Z increases, the scale of the embedding decreases and the continuous structure (corresponding to the developmental stage) becomes more apparent, making higher Z more suitable for visualizing continuous datasets (B ohm et al., 2022). The spectrum produces embeddings very similar to those of (f) t-SNE and (g) NCVis when Z equals the partition function of t-SNE or the learned normalization parameter of NCVis. The UMAP embedding in (h) closely resembles the Neg-t-SNE embedding at Z = |X|/m = n 2 /m. (i) The partition function P ij(1 + d2 ij) 1 of the Neg-t-SNE embeddings increased with Z. Similar to early exaggeration in t-SNE we started all Neg-t-SNE runs using Z = |X|/m and only switched to the desired Z for the last two thirds of the optimization. The dataset contains 63 530 cells and is colored by the hours post fertilization (hpf). There are ten times fewer cells collected after 8 hours than after 24. Published as a conference paper at ICLR 2023 (a) Z < Zt-SNE (b) Z = Zt-SNE (c) Z = ZNCVis (d) Z = |X|/m (e) Z > |X|/m (i) Partition function Figure S15: (a e) Neg-t-SNE spectrum of the single-cell RNA sequencing dataset of the C. elegans flatworm (Packer et al., 2019; Narayan et al., 2021) for various values of the fixed normalization constant Z. As Z increases, the scale of the embedding decreases, the scale of the embedding decreases, and more continuous structure becomes apparent. The Neg-t-SNE spectrum produces embeddings very similar to those of (f) t-SNE, (g) NCVis, and (h) UMAP, when Z equals the partition function of t-SNE, the learned normalization parameter Z of NCVis, or |X|/m = n 2 /m used by UMAP, as predicted in Sec. 4 6. (i) The partition function P ij(1 + d2 ij) 1 tries to match Z and grows with it. Similar to early exaggeration in t-SNE we started all Neg-t-SNE runs using Z = |X|/m and only switched to the desired Z for the last two thirds of the optimization. The dataset contains information on 86 024 cells of 37 types indicated by the colors. It is imbalanced with only 25 cells of the least abundant type but 31 375 cells of unknown type (grey). Published as a conference paper at ICLR 2023 (a) Z < Zt-SNE (b) Z = Zt-SNE (c) Z = ZNCVis (d) Z = |X|/m (e) Z > |X|/m (i) Partition function Figure S16: (a e) Neg-t-SNE embeddings of the Kuzushiji-49 dataset (Tarin et al., 2018) for various values of the fixed normalization constant Z. As Z increases, the scale of the embedding decreases, clusters become more compact and separated before eventually starting to merge. The Neg-t-SNE spectrum produces embeddings similar to those of (f) t-SNE, (g) NCVis, and (h) UMAP, when Z equals the partition function of t-SNE, the learned normalization parameter Z of NCVis, or |X|/m = n 2 /m used by UMAP, as predicted in Sec. 4 6. (i) The partition function P ij(1+d2 ij) 1 tries to match Z and grows with it. Similar to early exaggeration in t-SNE we started all Neg-t-SNE runs using Z = |X|/m and only switched to the desired Z for the last two thirds of the optimization. The dataset contains 270 912 images of 49 different Japanese characters. The classes are imbalanced with 456 to 7 000 samples per class. We see that a higher level of repulsion than UMAP s Z = |X|/m helps to visualize the discrete structure of the dataset. The sampling based Neg-t-SNE embedding at Z = Zt-SNE has less structure than the t-SNE embedding. Fig. S17 and Tab. S4 show that the Neg-t-SNE result improves for longer optimization. (a) 500 epochs (b) 1000 epochs (c) 5000 epochs (d) 10000 epochs Figure S17: Neg-t-SNE embeddings of the Kuzushiji-49 dataset (Tarin et al., 2018) for Z = Zt-SNE show more structure when optimized longer. Similar to early exaggeration in t-SNE we started all Neg-t-SNE runs using Z = |X|/m for 250 epochs and only switched to the desired Z for the remaining number of epochs indicated in the subcaptions. Published as a conference paper at ICLR 2023 Table S4: Longer run times improve the Neg-t-SNE optimization on the Kuzushiji-49 dataset (Tarin et al., 2018) for Z = Zt-SNE. The KL divergence is computed with respect to the normalized model qθ/(P ij qθ(ij)). Epochs 500 1000 5000 10000 Partition function [106] 18.96 0.02 13.27 0.01 6.93 0.01 5.75 0.01 Neg-t-SNE loss 1021 2 832 2 630 1 599 1 KL divergence 5.52 0.01 5.18 0.01 4.76 0.01 4.65 0.00 (a) m = 5 random init (b) m = 50 random init (c) m = 500 random init (d) t-SNE random init (e) m = 5 PCA init (f) m = 50 PCA init (g) m = 500 PCA init (h) t-SNE PCA init (i) m = 5 EE (j) m = 50 EE (k) m = 500 EE (l) t-SNE EE Figure S18: NC-t-SNE (our implementation) on the MNIST dataset for varying number of noise samples m and different starting conditions. Higher number of noise samples m improves the approximation quality to t-SNE (last column). The first row is initialized with isotropic Gaussian noise and the second and the third rows with PCA (both normalized to have standard deviation of one or 0.0001 in the first dimension for NC-t-SNE or t-SNE, respectively). In the third row, the first 250 epochs used m = 5 and the latter used the given m value for NC-t-SNE. This is similar to t-SNE s early exaggeration that we used in panel l. NC-t-SNE seems to be less dependent on early exaggeration than t-SNE, especially for low m values. Insets show the k NN recall and the Spearman correlation between distances of pairs of points. Higher m increases the k NN recall, while mostly leaving the Spearman correlation unchanged. Random initialization hurts the Spearman correlation, but not the k NN recall. Published as a conference paper at ICLR 2023 (a) m = 5 random init (b) m = 50 random init (c) m = 500 random init (d) t-SNE random init (e) m = 5 PCA init (f) m = 50 PCA init (g) m = 500 PCA init (h) t-SNE PCA init (i) m = 5 EE (j) m = 50 EE (k) m = 500 EE (l) t-SNE EE Figure S19: Info NC-t-SNE on the MNIST dataset for varying number of noise samples m and different starting conditions. Higher number of noise samples m improves the approximation quality to t-SNE (last column). The first row is initialized with isotropic Gaussian noise and the second and the third rows with PCA (both normalized to have standard deviation of one or 0.0001 in the first dimension for Info NC-t-SNE or t-SNE, respectively). In the third row, the first 250 epochs used m = 5 and the latter used the given m value for Info NC-t-SNE. This is similar to t-SNE s early exaggeration that we used in panel l. Info NC-t-SNE seems to be less dependent on early exaggeration than t-SNE, especially for low m values. Insets show the k NN recall and the Spearman correlation between distances of pairs of points. Higher m increases the k NN recall, while mostly leaving the Spearman correlation unchanged. Random initialization hurts the Spearman correlation, but not the k NN recall.