# on_umaps_true_loss_function__713023d2.pdf On UMAP s True Loss Function Sebastian Damrich Fred A. Hamprecht HCI/IWR at Heidelberg University, 69120 Heidelberg, Germany {sebastian.damrich, fred.hamprecht}@iwr.uni-heidelberg.de UMAP has supplanted t-SNE as state-of-the-art for visualizing high-dimensional datasets in many disciplines, but the reason for its success is not well understood. In this work, we investigate UMAP s sampling based optimization scheme in detail. We derive UMAP s true loss function in closed form and find that it differs from the published one in a dataset size dependent way. As a consequence, we show that UMAP does not aim to reproduce its theoretically motivated high-dimensional UMAP similarities. Instead, it tries to reproduce similarities that only encode the k nearest neighbor graph, thereby challenging the previous understanding of UMAP s effectiveness. Alternatively, we consider the implicit balancing of attraction and repulsion due to the negative sampling to be key to UMAP s success. We corroborate our theoretical findings on toy and single cell RNA sequencing data. 1 Introduction Today s most prominent methods for non-parametric, non-linear dimension reduction are t-Distributed Stochastic Neighbor Embedding (t-SNE) [20, 19] and Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) [13]. The heart of UMAP is claimed to be its sophisticated method for extracting the high-dimensional similarities. However, the reason for UMAP s excellent visualizations is not immediately obvious from this approach. In particular, UMAP s eponymous uniformity assumption, see Section 3, is arguably difficult to defend for the wide range of datasets on which UMAP performs well. Therefore, it is not well understood what aspect of UMAP is responsible for its great visualizations. Both t-SNE and UMAP have to overcome the computational obstacle of considering the quadratic number of interactions between all pairs of points. The breakthrough for t-SNE came with a Barnes-Hut approximation [19]. Instead, UMAP employs a sampling based approach to avoid a quadratic number of repulsive interactions. Other than [4] little attention has been paid to this sampling based optimization scheme. In this work, we fill this gap and analyze UMAP s optimization method in detail. In particular, we derive the effective, closed form loss function which is truly minimized by UMAP s optimization scheme. While UMAP s use of negative sampling was intended to avoid quadratic complexity, we find, surprisingly, that the resulting effective loss function differs significantly from UMAP s purported loss function. The weight of the loss function s repulsive term is drastically reduced. As a consequence, UMAP is not actually geared towards reproducing the clever high-dimensional similarities. In fact, we show that most information beyond the shared k NN graph connectivity is essentially ignored as UMAP actually approximates a binarized version of the high-dimensional similarities. These theoretical findings underpin some empirical observations in [4] and demonstrate that the gist of UMAP is not its high-dimensional similarities. This resolves the disconnect between UMAP s uniformity assumption and its success on datasets of varying density. From a user s perspective it is important to gain an intuition for deciding which features of a visualization can be attributed to the data and which ones are more likely artifacts of the visualization method. With our analysis, we can explain UMAP s tendency to produce crisp, over-contracted substructures, which increases with the dataset size, as a side effect of its optimization. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Without the motivation of reproducing sophisticated high-dimensional similarities in embedding space, it seems unclear why UMAP performs well. We propose an alternative explanation for UMAP s success: The sampling based optimization scheme balances the attractive and repulsive loss terms despite the sparse high-dimensional attraction. Consequently, UMAP can leverage the connectivity information of the shared k NN graph via gradient descent effectively. 2 Related work For most of the past decade t-SNE [20, 19] was considered state-of-the-art for non-linear dimension reduction. In the last years UMAP [13] at least ties with t-SNE. In both methods, points are embedded so as to reproduce high-dimensional similarities. Different from t-SNE, the high-dimensional similarities are sparse for UMAP and do not need to be normalized over the entire dataset. Additionally, t-SNE adapts the local scale of high-dimensional similarities by achieving a predefined perplexity, while UMAP uses its uniformity assumption. The low-dimensional similarity functions also differ. Recently, Böhm et al. [4] placed both UMAP and t-SNE on a spectrum of dimension reduction methods that mainly differ in the amount of repulsion employed. They argue that UMAP uses less repulsion than t-SNE. A parametric version of UMAP was proposed in [17]. UMAP s success, in particular in the biological community [2, 16], sparked interest in understanding UMAP more deeply. The original paper [13] motivates the choice of the high-dimensional similarities using concepts from algebraic topology and thus justifies UMAP s transition from local similarities µi j to global similarities µij. The authors find that while the algorithm focuses on reproducing the local similarity pattern similar to t-SNE, it achieves better global results. In contrast, Kobak and Linderman [10] attribute the better global properties of UMAP visualizations to the more informative initialization and show that t-SNE manages to capture more global structure if initialized in a similar way. Wang et al. [21] also analyze the local and global properties of t-SNE, UMAP and Tri MAP [1]. Narayan et al. [14] observe that UMAP s uniformity assumption leads to visualizations in which denser regions are more spread out while sparser regions get overly contracted. They propose an additional loss term that aims to reproduce the local density around each point and thus spaces sparser regions out. We provide an additional explanation for overly contracted visualizations: UMAP does not reproduce the high-dimensional similarities but exaggerates the attractive forces over the repulsive ones, which can result in overly crisp visualizations, see Figures 1b, 1c and 2a. Our work aligns with Böhm et al. [4]. The authors conjecture that the sampling based optimization procedure of UMAP prevents the minimization of the supposed loss function, thus not reproducing the high-dimensional similarities in embedding space. They substantiate this hypothesis by qualitatively estimating the relative size of attractive and repulsive forces. In addition, they implement a Barnes Hut approximation to the loss function (6) and find that it yields a diverged embedding. We analyze UMAP s sampling procedure in depth, compute UMAP s true loss function in closed form and contrast it against the supposed loss in Section 5. Based on this analytic effective loss function, we can further explain Böhm et al. [4] s empirical finding that the specific high-dimensional similarities provide little gain over the binary weights of a shared k NN graph,1 see Section 6. Finally, our theoretical framework leads us to a new tentative explanation for UMAP s success in Section 7. 3 Background on UMAP UMAP assumes that the data lies on a low-dimensional Riemannian manifold (R, g) in ambient high-dimensional space. Moreover, the data is assumed to be uniformly distributed with respect to the metric tensor g, or put simply, with respect to the distance along the manifold. The metric tensor g in turn is assumed to be locally constant. The key idea of UMAP [13] is to compute pairwise similarities in high-dimensional space which inform the optimization of the low-dimensional embedding. Let x1, ..., xn RD be high-dimensional, mutually distinct data points for which low-dimensional embeddings e1, ..., en Rd shall be found, where d D, often d = 2 or 3. First, UMAP extracts high-dimensional similarities between the data points. To do so, the k nearest neighbor graph is computed. Let i1, ..., ik denote the indices of xi s k nearest neighbors in increasing order of distance to xi. Then, using its uniformity assumption, UMAP fits a local notion of similarity 1The shared k nearest neighbor graph contains an edge ij if i is among j s k nearest neighbors or vice versa. for each data point i by selecting a scale σi such that the total similarity of each point to its k nearest neighbors is normalized, i.e. find σi such that κ=1 exp d(xi, xiκ) d(xi, xi1) /σi = log2(k). (1) This defines the directed high-dimensional similarities ( exp d(xi, xj) d(xi, xi1) /σi for j {i1, . . . , ik} 0 else. (2) Finally, these are symmetrized to obtain undirected high-dimensional similarities or input similarities between items i and j µij = µi j + µj i µi jµj i [0, 1]. (3) While each node has exactly k non-zero directed similarities µi j to other nodes which sum to log2(k), this does not hold exactly after symmetrization. Nevertheless, typically the µij are highly sparse, each node has positive similarity to about k other nodes and the degree of each node di = Pn j=1 µij is approximately constant and close to log2(k), see Supp. Fig. 5 and 6 in Appendix B. For convenience of notation, we set µii = 0 and define the total similarity as µtot = 1 2 Pn i=1 di. Distance in embedding space is transformed to low-dimensional similarity by a smooth approximation to the high-dimensional similarity function, φ(d; a, b) = (1 + ad2b) 1, for all pairs of points. The shape parameters a, b are essentially hyperparameters of UMAP. We will overload notation and write νij = φ(||ei ej||) = φ(ei, ej) (4) for the low-dimensional or embedding similarities and usually suppress their dependence on a and b. Supposedly, UMAP approximately optimizes the following objective function with respect to the embeddings e1, . . . , en: L({ei}|{µij}) = 2 X µij log(νij) +(1 µij) log(1 νij) (5) µij log(φ(ei, ej)) | {z } La ij +(1 µij) log(1 φ(ei, ej)) | {z } Lr ij While the high-dimensional similarities µij are symmetric, UMAP s implementation does consider their direction during optimization. For this reason, our losses in equations (5) and (6) differ by a factor of 2 from the one given in [13]. Viewed through the lens of a force-directed model, the derivative of the first term in each summand of the loss function, La ij/ ei, captures the attraction of ei to ej due to the high-dimensional similarity µij and the derivative of the second term, Lr ij/ ei, represents the repulsion that ej exerts on ei due to a lack of similarity in high dimension, 1 µij. Alternatively, the loss can be seen as the sum of binary cross entropy losses for each pairwise similarity. Thus, it is minimized if the low-dimensional similarities νij exactly match their high-dimensional counterparts µij, that is, if UMAP manages to reproduce the input similarities in embedding space. UMAP uses a sampling based stochastic gradient descent to optimize the low-dimensional embedding typically starting from a Laplacian Eigenmap layout [3, 10]. Our main contribution is to show that the sampling based optimization in fact leads to a different objective function, so that UMAP does not reproduce the high-dimensional similarities in low-dimensional space, see Sections 4 to 6. 4 UMAP does not reproduce high-dimensional similarities UMAP produces scientifically useful visualizations in many domains and is fairly robust to its hyperparameters. Since any visualization of intrinsically high-dimensional data must be somehow unfaithful, it is not straightforward to check the visualization quality other than by its downstream use. Some quantitative metrics such as the correlation between highand low-dimensional distances [10, 2] exist. We follow a different route to show unexpected properties of UMAP. Consider the toy example of applying UMAP to data that is already low-dimensional, so that no reduction in dimension is (a) Original data (c) UMAP on larger dataset (d) UMAP from dense similarities Figure 1: UMAP does not preserve the data even when no dimension reduction is required. 1a: Original data consisting of 1000 points sampled uniformly from a ring in 2D. All optimizations ran for 10000 epochs and were initialized at the input data. 1b: Result of UMAP. The circular shape is visible but the ring width is nearly completely contracted. 1c: Result of UMAP on a dataset consisting of twenty rings as in 1a spaced so they do not interact in the input similarity computation. Only the embedding of one ring is shown. There should not be a significant difference from Fig. 1b. Yet, the ring width is completely contracted. 1d: Result of UMAP for dense input similarities computed from the original data in Fig. 1a with the similarity function φ from Section 3. No change from the initialization would be optimal in this setting. Instead the output has spurious curves and larger width. More figures with the default number of epochs and initialization can be found in Supp. Fig. 20. required: D = d = 2. Ideally, the data would be preserved in this situation. One might also expect that UMAP achieves its aim of perfectly reproducing the input similarities. Surprisingly, neither of these expectations is met. In Figure 1, we depict 2D UMAP visualizations of a 2D uniform ring dataset. To ease UMAP s task, we initialized the embedding with the original data, hoping that UMAP s optimization would just deem this layout to be optimal. We also used a longer run time to ensure convergence of UMAP s optimization procedure. Otherwise, we used the default hyperparameters. The results with default initialization and run time are qualitatively similar and shown in Supp. Fig. 20. UMAP manages to capture the ring shape of the data, but changes its appearance significantly. It contracts the width of the ring nearly to a line, see Figure 1b, a phenomenon also observed on real-world datasets, see Figure 2a. Whether this exaggeration of the ring shape is useful depends on the use case. This finding goes beyond Narayan et al. [14] s observation that over-contraction happens in regions of low density since our toy dataset is sampled uniformly from a circular ring. Moreover, we embedded the same ring with UMAP as part of a larger dataset of twenty such rings, spaced so that the input similarities for each ring are not influenced by the other rings. According to Section 3, we would expect a qualitatively similar embedding as for the individual ring. However, we see in Figure 1c that UMAP produces an even crisper embedding. In another experiment, we varied the number of samples from a single ring and observed again that the tendency for over-contraction increases with the dataset size. For more information on UMAP s dependence on the dataset size confer Appendix H. The dataset size dependence is particularly noteworthy, as computing UMAP embeddings of subsets or subsamples of a dataset is common practice. As described in Section 3, UMAP employs different methods in input and embedding space to transform distances to similarities. In particular, in input space similarities are zero for all but the closest neighbors, while in embedding space they are computed with the heavy-tailed function φ. To test whether this prevents the reproduction of the input data, we also computed the dense similarities on the original data with φ and used these as input similarities for the embedding in Figure 1d. Since we also initialize with the original dataset, a global optimum of the objective function (6) for this choice of input similarity, one would expect no change by UMAP s optimization scheme. However, we observe that in this setting UMAP produces spurious curves and increases the width of the ring. Böhm et al. [4] implemented a Barnes-Hut approximation of UMAP s objective function (6), which produced a diverged embedding. Inspired by this finding, we compute the loss values according to equation (5) for various input and embedding similarities µij and νij in our toy example, see Table 1. Consider the row with the usual input similarities (µij = µ({x1, . . . , xn})). In a completely diverged embedding, all self similarities are one and all others zero (νij = 1(i == j)). We find that the loss for such an embedding is lower than for the optimized UMAP embedding. This is in Table 1: UMAP loss value for various combinations of input and embedding similarities, µij, νij, of the toy example in Figure 1. The loss for the UMAP embedding in Figure 1b (middle column) is always higher than for another two-dimensional layout (bold). Hence, UMAP does not minimize its purported loss. Results are averaged over 7 runs and one standard deviation is given, see also Appendix F.1. Embedding similarities νij 1(i == j) φ({e1, . . . , en}) φ({x1, . . . , xn}) Input similarities µij (diverged layout) (UMAP result) (input layout) µ({x1, . . . , xn}) 62959 82 70235 1301 136329 721 φ({x1, . . . , xn}) 902757 2788 331666 1308 224584 8104 accordance with Böhm et al. [4] s Barnes-Hut experiment and shows that UMAP does not optimize its supposed objective function (6) as a diverged embedding is approximately feasible in two dimensions. This discrepancy is not just due to the fact that input and embedding similarities are computed differently: The second row of Table 1 contains loss values for the setting in which we use the dense similarities as input similarities, as in Figure 1d. We initialize the embedding at the optimal loss value (νij = φ({x1, . . . , xn}) = µij), but UMAP s optimization moves away from this layout and towards an embedding with higher loss (νij = φ({e1, . . . , en})) although we always compute similarity in the same manner. Clearly, UMAP s optimization yields unexpected results. 5 UMAP s sampling strategy and effective loss function UMAP uses a sampling based approach to optimize its loss function, to reduce complexity. A simplified version of the sampling procedure can be found in Algorithm 1. Briefly put, an edge ij is sampled according to its high-dimensional similarity and the embeddings ei and ej of the incident nodes are pulled towards each other. For each such sampled edge ij, the algorithm next samples m negative samples s uniformly from all nodes and the embedding of i is repelled from that of each negative sample. Note that the embeddings of the negative samples are not repelled from that of i, see Alg. 1 line 10. So there are three types of gradient applied to an embedding ei during an epoch: 1. ei is pulled towards ej when edge ij is sampled (Alg.1 line 5) 2. ei is pulled towards ej when edge ji is sampled (Alg.1 line 6) 3. ei is pushed away from negative sample embeddings when some edge ij is sampled (Alg.1 line 9). The full update of embedding ei during epoch t is, according to UMAP s implementation, given by Xt ij La ij ei + Xt ji La ji ei + Xt ij s=1 Y t ij,s Lr is ei where Xt ab is the binary random variable indicating whether edge ab was sampled in epoch t and Y t ab,s is the random variable for the number of times s was sampled as negative sample for edge ab in epoch t if ab was sampled in epoch t and zero otherwise. By construction, E(Xt ab) = µab and E(Y t ab,s|Xt ab = 1) = m/n. Taking the expectation over the random events in an epoch, we obtain the expected update of UMAP s optimization procedure, see Appendix C for more details: µij La ij ei + dim 2n Lr ij ei Comparing the above closed formula for the expectation of the UMAP updates of the low-dimensional embeddings, to the gradient of UMAP s loss function (6) µij La ij ei + (1 µij) Lr ij ei we find that the sampling procedure yields the correct weight for the attractive term in expectation, as designed. However, as noticed by Böhm et al. [4], the negative sampling changes the weight for the (b) Inverted weights UMAP Figure 2: UMAP on C. elegans data from [16, 14]. 2a: UMAP visualization. Several parts of the embedding appear locally one-dimensional, for instance the seam cells.2 2b: Same as 2a but with inverted positive high-dimensional similarities. The result is qualitatively similar, if not better. 2c: Two dimensional PCA of the dataset. Highlighted seam cells clearly have two dimensional variance in the PCA plot, but are overly contracted to nearly a line in the UMAP plots 2a and 2b. The full legend with all cell types and further information can be found in Supp. Fig. 14. We report quantitative metrics in Appendix G. Figure best viewed in color. repulsive term significantly. Our closed formula helps to make their qualitative arguments precise: Instead of 1 µij, we have a term dim 2n , which depends on the number of negative samples m per sampled edge. Contrary to the intention of Mc Innes et al. [13], the repulsive weights are not uniform but vary with the degree of each point di, which, however, is typically close to log2(k), see Appendix B. More practically, since the non-zero high-dimensional similarities are sparse, 1 µij is equal to 1 for most ij. In contrast, the expected repulsive weight is small for large datasets as the numerator dim is of the order of log2(k)m independent of the dataset size n but the denominator scales with n. Another effect of the negative sampling is that in general the expected update (8) does not correspond to any loss function, see Appendix D. We remedy this by additionally pushing the embedding of a negative sample i away from the embedding ej, whenever i was sampled as negative sample to some edge jk, see Alg.1 line 10. This yields the following gradient at epoch t Xt ij La ij ei + Xt ji La ji ei + Xt ij s=1 Y t ij,s Lr is ei + k=1 Xt jk Y t jk,i Lr ji ei corresponding to a loss in epoch t of Xt ij La ij + s=1 Xt ij Y t ij,s Lr is Using the symmetry of µij, La ij and Lr ij in i and j, we compute the effective loss L = E( Lt) = 2 X µij La ij + (di + dj)m In fact, the above remedy of also pushing the negative samples does not affect the behavior of UMAP qualitatively, see for instance Supp. Figs. 21 and 15.3 In this light, we can treat L as the effective objective function that is optimized via SGD by UMAP s optimization procedure. It differs from UMAP s loss function (6) by having a drastically reduced repulsive weight of (di+dj)m 2n instead of 1 µij. We illustrate our analysis on gene expression measurements of 86024 cells of C. elegans [16, 14]. We start out with a 100 dimensional PCA of the data4 and use the cosine metric in high-dimensional space, 2Near the tip of the seam cells, the points seem to lie on a one-dimensional manifold. Closer to the middle of the seam cells the spread appears wider. But there are actually two dense parallel lines of points. 3In fact, the Parametric UMAP [17] does include the update of negative samples. 4 obtained from http://cb.csail.mit.edu/cb/densvis/datasets/. We informed the authors of our use of the dataset, which they license under CC BY-NC 2.0. consider k = 30 neighbors and optimize for 750 epochs, similar to [14]. The resulting visualization is depicted in Figure 2a. On this dataset the average value of 1 µij is 0.9999 but the maximal effective repulsive weight maxij (di+dj)m 2n is 0.0043, showing the dramatic reduction of repulsion due to negative sampling. After each optimization epoch, we log our effective loss L (12), the actual loss Lt (11) of each epoch computed based on the sampled (negative) pairs as well the purported UMAP loss L (6).5 We always consider the embeddings at the end of each epoch. Note that UMAP s implementation updates each embedding ei not just once at the end of the epoch but as soon as i is incident to a sampled edge or sampled as a negative sample. This difference does not change the actual loss much, see Appendix F. The loss curves are plotted in Figure 3. We can see that our predicted loss matches its actual counterpart nearly perfectly. While both, L and Lt, agree with the attractive part of the supposed UMAP loss, its repulsive part and thus the total loss are two orders of magnitude larger. Furthermore, driven by the repulsive part, the total intended UMAP loss increases during much of the optimization process, while the actual and effective losses decrease, exemplifying that UMAP really optimizes our effective loss L (12) instead of its purported loss L (6). Additional loss curves for UMAP on other sc RNA-seq datasets and CIFAR-10 [11] can be found in Appendix I. Recently, a parametric version of UMAP was proposed [17]. Here, a neural network is trained to perform the mapping from high to low dimension. Its effective loss function differs slightly from the one for Non-Parametric UMAP as negative samples only come for the training batch: Theorem 5.1. The effective loss function of Parametric UMAP with neural network fθ, batch size b and total similarity µtot = 1 2 Pn i=1 di is 1 (m + 1)µtot 1 i 0. (15) The approximation holds in the typical case in which (di+dj)m 2n 0, discussed above. In other words, compared to the massively reduced repulsion weight even the smallest high-dimensional similarity appears maximal. Thus, the negative sampling essentially binarizes the input similarities. UMAP s high-dimensional similarities are non-zero exactly on the shared k nearest neighbor graph edges of the high-dimensional data. Therefore, the binarization explains why Böhm et al. [4] find that using the binary weights of the shared k nearest neighbor graph does not deteriorate UMAP s performance much.6 The binarization even helps UMAP to overcome disrupted high-dimensional similarities, as long as only the edges of the shared k NN graph have non-zero weight. In Figure 2b, we invert the original positive high-dimensional weights on the C. elegans dataset. That means that the k-th nearest neighbor will have higher weight than the nearest neighbor. The resulting visualization even improves on the original by keeping the layout more compact. This underpins Böhm et al. [4] s claim that the elaborate theory used to compute the high-dimensional similarities is not the reason for UMAP s practical success. In fact, we show that UMAP s optimization scheme even actively ignores most information beyond the shared k NN graph. In Figure 4, we show histograms of the various notions of similarity for the C. elegans dataset. We see in panel 4b how the binarization equalizes the positive target similarities for the original and the inverted high-dimensional similarities. UMAP s tendency for over-contraction is already strong for small datasets. Consider UMAP s default setting of k = 15 and m = 5. Then for datasets with n = 500 points each input similarity µij > 0.2 is mapped to a target similarity ν ij > 0.83, see Appendix H.4. The binary cross entropy terms in the effective loss L (12) are not normalized. This leads to a different weighing of the binary cross entropy terms for each pair ij µij La ij + (di + dj)m µij + (di + dj)m ν ij log(νij) + (1 ν ij) log(1 νij) . (17) As (di+dj)m 2n is very small for large datasets, the term µij + (di+dj)m 2n is dominated by µij. Hence, the reduced repulsion not only binarizes the high-dimensional similarities, it also puts higher weight on the positive than the zero target similarities. Therefore, we can expect that the positive target similarities are better approximated by the embedding similarities than the zero ones. Indeed, panel 4a shows that the low-dimensional similarities match the positive target similarities very well, as expected from the weighted BCE reading of the effective loss function (17). 6.1 Explaining artifacts in UMAP visualizations We conclude this section by explaining the observed artifacts of UMAP s visualization in Figures 1 and 2 in the light of the above analysis. The normal UMAP optimization contracts the ring in Figure 1b even when initialized at the original layout because the reduced repulsion yields nearly binary target similarities. All pairs that are part of the k NN graph not only want to be sufficiently 6Böhm et al. [4] used a scaled version of the k NN graph, but the scaling factor cancels for the target weights. (a) Similarities for µij > 0 (b) Original and inverted similarities for µij > 0 Figure 4: Histograms of high-dimensional (µij), target (ν ij) and low-dimensional (νij) similarities on the C. elegans dataset [16, 14] for pairs with positive high-dimensional similarity. 4a: The similarities of UMAP s low-dimensional embedding reproduce the target similarities instead of the highdimensional ones. The target similarities are heavily skewed towards one. 4b: Comparison of positive input and target similarities for the original and inverted input similarities. While the histograms of the input similarities differ noticeably, their target similarities do not and neither do the embedding similarities (not shown). The binarization essentially ignores all information beyond the k NN graph. close that their high-dimensional similarity is reproduced, but so close that their similarity is nearly one. The fact that the effective loss weighs the terms with target similarity near one much more than those with target similarity near zero reinforces this trend. As a result, the ring gets contracted. The same argument applies to the over-contracted parts of the UMAP visualization of the C. elegans dataset in Figure 2. When we increase the dataset size by adding more rings as in Figure 1c, negative sample pairs can come from different rings, reducing the frequency that a pair of points from the same ring is sampled. This reduces the repulsion further. Thus, there is stronger binarization and the embedding looks crisper. If we sample a single ring more densely, over-contraction is also increased. But the visual appearance is more nuanced as over-contraction can appear in the form of densely clustered and line-like substructures within the ring width, see Appendix H.3. Our framework can also explain the opposite behavior of UMAP when the dense similarities are used as input similarities in Figure 1d. In this setting, the average degree of a node is about 100. With a negative sample rate m = 5 and a dataset size n = 1000 this yields repulsive weights (di+dj)m 2n 0.5. Thus, we increase the repulsion on pairs with high input similarity, but decrease it on pairs with low input similarity. Now, the target similarities are not a binarization of the input similarities, but instead skewed towards 0.5. Hence, we can expect embedding points to increase their distance to nearest neighbors, but distant points to move closer towards each other. This is what we observe in Figure 1d, where the width of the ring has increased and the ring curves to bring distant points closer together. 7 Discussion 7.1 Negative sampling in Large Vis The visualization method Large Vis [18] uses negative sampling to optimize its objective function, too. Thus, our analysis of the loss function and the target similarities applies to Large Vis and helps to explain why UMAP and Large Vis visualizations are often similar. For more details confer Appendix E. 7.2 Balancing attraction and repulsion By deriving UMAP s true loss function and target similarities, we are able to explain several peculiar properties of UMAP visualizations. According to our analysis, UMAP does not aim to reproduce the high-dimensional UMAP similarities in low dimension but rather the binary weights of the shared k NN graph of the input data. This raises the question just what part of UMAP s optimization leads to its excellent visualization results. Apparently, the exact formula for the repulsive weights is not crucial as it differs for Non-Parametric UMAP and Parametric UMAP while both produce similarly high quality embeddings. A first tentative step towards an explanation might be the different weighing of the BCE terms in the effective loss function (17). Focusing more on the similar rather than the dissimilar pairs might help to overcome the imbalance between an essentially linear number of attractive and a quadratic number of repulsive pairs. Inflated attraction was found beneficial for t-SNE as well, in the form of early exaggeration [12]. Put another way, the decreased repulsive weights result in comparable total attractive and repulsive weights, which might facilitate the SGD based optimization. Indeed, the total attractive weight in UMAP s effective loss function is 2µtot = Pn i,j=1 µij and the total repulsive weight amounts to 2mµtot = Pn i,j=1 (di+dj)m 2n for Non-Parametric UMAP. For Parametric UMAP the total attractive and repulsive weights are 1/(m+1) and m/(m+1) (b 1)/b, respectively. For the default value of m = 5, the total attractive and repulsive weights are of roughly the same order of magnitude. Moreover, we observe in Figure 3 that the resulting attractive and repulsive losses are also of comparable size. Using UMAP s purported loss function, however, would yield dominating repulsion. 7.3 Improved UMAP The reason for the binarization of the input similarities is the reduced repulsion on the pairs of points that are linked in the shared k NN graph. As discussed above, it might not be desirable to increase the repulsion on all pairs of points to 1 µij. But it might be useful to change the optimization procedure of UMAP so as to decrease the repulsion per edge only on the non-k NN graph edges and not on the k NN graph edges. In addition to the positive and negative sampling of UMAP, we suggest to explicitly sample the shared k NN graph edges for repulsion with probability 1 µij (di + dj)m/(2n). This would correct the repulsion weight for edges in the shared k NN graph. The loss function would read ij k NN graph µij log(νij)+(1 µij) log(1 νij) 2 X ij / k NN graph 2n log(1 νij) . (18) Now the target similarities are precisely the high-dimensional similarities µij for all pairs i, j. Hence, in an optimal embedding, all µij s are exactly reproduced and not just a binary version of them. We hope that this will improve the embedding quality and decrease the observation of over-contraction. Crucially, the amount of repulsion per data point is kept constant and is not increasing with n. For the quadratic number of non-k NN graph edges, the individual repulsion pre-factors still decrease with 1/n. The total loss function is a sum of normalized cross-entropy terms for the edges in the k NN graph and of non-normalized and drastically down-weighted cross-entropy terms for the edges not in the k NN graph. Since not all O(n2) negative pairs are considered explicitly and the additional sampling of the k NN graph edges for repulsion only scales with O(n), the complexity of computing the embedding would still scale linearly as O(kmn numepochs) and not quadratically in n. A more in-depth investigation as to why exactly balanced attraction and repulsion is beneficial for a useful embedding and the implementation of the improved sampling scheme are left for future work. 8 Conclusion In this work, we investigated UMAP s optimization procedure in depth. In particular, we computed UMAP s effective loss function analytically and found that it differs slightly between the nonparametric and parametric versions of UMAP and significantly from UMAP s alleged loss function. The optimal solution of the effective loss function is typically a binarized version of the highdimensional similarities. This shows why the sophisticated form of the high-dimensional UMAP similarities does not add much benefit over the shared k NN graph. Instead, we conjecture that the resulting balance between attraction and repulsion is the main reason for UMAP s great visualization capability. Our analysis can explain some artifacts of UMAP visualizations, in particular its tendency to produce over-contracted embeddings which gets stronger for larger datasets. Acknowledgments and Disclosure of Funding We would like to thank Dmitry Kobak, Quentin Garrido and the anonymous reviewers for useful discussions and helpful feedback. Moreover, we thank Ashwin Narayan for making the sc RNA-seq datasets available. This work is supported in part by the Deutsche Forschungsgemeinschft (DFG, German Research Foundation) Projektnummer 240245660 - SFB 1129, the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy EXC 2181/1 - 390900948 (the Heidelberg STRUCTURES Excellence Cluster) and Informatics for Life funded by the Klaus Tschira Foundation. [1] E. Amid and M. K. Warmuth. Tri Map: Large-scale Dimensionality Reduction Using Triplets. ar Xiv preprint ar Xiv:1910.00204, 2019. [2] E. Becht, L. Mc Innes, J. Healy, C.-A. Dutertre, I. W. Kwok, L. G. Ng, F. Ginhoux, and E. W. Newell. Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnology, 37(1):38 44, 2019. [3] M. Belkin and P. Niyogi. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14. MIT Press, 2002. URL https://proceedings. neurips.cc/paper/2001/file/f106b7f99d2cb30c3db1c3cc0fde9ccb-Paper.pdf. [4] J. N. Böhm, P. Berens, and D. Kobak. A Unifying Perspective on Neighbor Embeddings along the Attraction-Repulsion Spectrum. ar Xiv preprint ar Xiv:2007.08902, 2020. [5] B. Charlier, J. Feydy, J. A. Glaunès, F.-D. Collin, and G. Durif. Kernel Operations on the GPU, with Autodiff, without Memory Overflows. Journal of Machine Learning Research, 22(74):1 6, 2021. URL http://jmlr.org/papers/v22/20-275.html. [6] E. Davis, S. Sethuraman, et al. Consistency of modularity clustering on random geometric graphs. Annals of Applied Probability, 28(4):2003 2062, 2018. [7] W. Dong, C. Moses, and K. Li. Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures. In Proceedings of the 20th International Conference on World Wide Web, pages 577 586, 2011. [8] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. [9] D. Kobak and P. Berens. The art of using t-SNE for single-cell transcriptomics. Nature Communications, 10(1):1 14, 2019. [10] D. Kobak and G. C. Linderman. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nature Biotechnology, pages 1 2, 2021. [11] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. [12] G. C. Linderman and S. Steinerberger. Clustering with t-SNE, Provably. SIAM Journal on Mathematics of Data Science, 1(2):313 332, 2019. [13] L. Mc Innes, J. Healy, and J. Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ar Xiv preprint ar Xiv:1802.03426, 2018. [14] A. Narayan, B. Berger, and H. Cho. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nature Biotechnology, pages 1 10, 2021. [15] M. E. Newman. Analysis of weighted networks. Physical Review E, 70(5):056131, 2004. [16] J. S. Packer, Q. Zhu, C. Huynh, P. Sivaramakrishnan, E. Preston, H. Dueck, D. Stefanik, K. Tan, C. Trapnell, J. Kim, et al. A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution. Science, 365(6459), 2019. [17] T. Sainburg, L. Mc Innes, and T. Q. Gentner. Parametric UMAP: learning embeddings with deep neural networks for representation and semi-supervised learning. ar Xiv preprint ar Xiv:2009.12981, 2020. [18] J. Tang, J. Liu, M. Zhang, and Q. Mei. Visualizing large-scale and high-dimensional data. In Proceedings of the 25th International Conference on World Wide Web, pages 287 297, 2016. [19] L. Van Der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research, 15(1):3221 3245, 2014. [20] L. Van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2008. [21] Y. Wang, H. Huang, C. Rudin, and Y. Shaposhnik. Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, Tri MAP, and Pa CMAP for Data Visualization. Journal of Machine Learning Research, 22:1 73, 2021. [22] G. X. Zheng, J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, T. D. Wheeler, G. P. Mc Dermott, J. Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature Communications, 8(1):1 12, 2017. [23] R. Zilionis, C. Engblom, C. Pfirschke, V. Savova, D. Zemmour, H. D. Saatcioglu, I. Krishnan, G. Maroni, C. V. Meyerovitz, C. M. Kerwin, et al. Single-cell transcriptomics of human and mouse lung cancers reveals conserved myeloid populations across individuals and species. Immunity, 50(5):1317 1334, 2019.