# map_it_to_visualize_representations__2cbe1f00.pdf Published as a conference paper at ICLR 2024 MAP IT TO VISUALIZE REPRESENTATIONS Robert Jenssen Ui T The Arctic University of Norway & U Copenhagen & Norwegian Computing Center MAP IT visualizes representations by taking a fundamentally different approach to dimensionality reduction. MAP IT aligns distributions over discrete marginal probabilities in the input space versus the target space, thus capturing information in wider local regions, as opposed to current methods which align based on pairwise probabilities between states only. The MAP IT theory reveals that alignment based on a projective divergence avoids normalization of weights (to obtain true probabilities) entirely, and further reveals a dual viewpoint via continuous densities and kernel smoothing. MAP IT is shown to produce visualizations which capture class structure better than the current state of the art. 1 INTRODUCTION Representation learning is key to any machine learning system. For instance for learning to visualize input representations, or for visualizing learned representations obtained e.g. via deep neural networks trained in a supervised or unsupervised/self-supervised manner in order to gain insight1. Methods such as t-SNE (B ohm et al., 2023; Tucker et al., 2023; Huang et al., 2022; van der Maaten & Hinton, 2008; van der Maaten, 2014), the closely related Large Vis (Tang et al., 2016), as well as the recently proposed UMAP (Mc Innes et al., 2020), Tri MAP (Amid & Warmuth, 2019) and Pac Map (Wang et al., 2021), are tremendously important contributions, spurring numerous theoretical works, e.g. (Draganov et al., 2023; Damrich & Hamprecht, 2021; Kobak & Linderman, 2021). However, these dominant dimensionality reduction (DR) methods (further discussed in Appendix A) take vastly different theoretical approaches, but despite this diversity, none of the theories provide a framework which explain when normalization of weights is needed and when not. Understanding the role of normalization in DR has been an important quest (Draganov et al., 2023). Moreover, all existing methods are based on pairwise comparisons of weights in the input space and the target space without considering wider information from local regions. MAP IT represents a major shift. The MAP IT framework is based on information theory (IT) and statistical divergences, showing that normalization is not needed when the divergence used is projective. Based on a Cauchy-Schwarz projective divergence, MAP IT aligns distributions of marginal probabilities which capture information about wider local regions in the data, in essence aggregating information from pairwise weights in a new manner. A dual viewpoint to MAP IT is derived via continuous densities enabled by kernel smoothing, revealing the role of neighborhoods for the gradient-based MAP IT. See Appendix B for all proofs with additional comments. MAP IT creates overall markedly different embeddings. As an example, Figure 1 shows MAP IT embeddings of a subset of MNIST (d) compared to t-SNE (a), UMAP (b) and Pac MAP (c). MAP IT separates four of the classes of digits much clearer than any of the other methods. For the challenging digits 4, 9 and 7 (rectangles, zoomed), MAP IT creates less overlap between 4s and 9s, and much better separation of 7s (f) compared to e.g. Pac Map (e). The main contributions of this paper are i) Providing a new theory for visualization by dimensionality reduction; ii) Showing that normalization of weights to obtain true probabilities is not needed in MAP IT by the use of a projective divergence; iii) Revealing a dual viewpoint by aligning over distributions over discrete marginal probabilities or equivalently via continuous densities enabled by kernel smoothing; iv) Shedding light on the role of local neighborhoods for the MAP IT update rule; v) Generating markedly different visualizations compared to the current state of the art. 1To prepare the representations for a downstream task can also be of interest but not considered here. Published as a conference paper at ICLR 2024 (c) Pac Map. (d) MAP IT. (e) Zoomed Pac Map. (f) Zoomed MAP IT. Figure 1: Visualizing a subset of MNIST. MAP IT separates digits much more distinctly compared to the alternatives and creates less overlap on difficult classes as shown when zooming in (rectangles). 2 T-SNE FRAMEWORK t-SNE minimizes the Kullback-Leibler (KL) divergence between a joint probability distribution P over states given by x1, . . . , xn RD in the high-dimensional input space and a joint probability distribution Q over states (mapped data points) z1, . . . , zn Rd in the low-dimensional target space arg min z1,...,zn Rd KL(P||Q) = arg min z1,...,zn Rd i,j pij log pij qij = arg min z1,...,zn Rd X i,j pij log qij, (1) where qij = 1 + ||zi zj||2 1 /Zq are joint probabilities in the target space defined by the t-distribution (Cauchy distribution) and where Zq = P n,m 1 + ||zn zm||2 1 is an explicit normalization factor. Note that for this particular choice, one may write qij = qij/Zq where qij = 1 + ||zi zj||2 1 is an unnormalized quantity. To better prepare for the theory outlined in Section 3, the model pij = exp κi||xi xj||2 /Zp will be assumed, with the normalization factor Zp = P n,m exp( κi||xn xm||2) (further discussed in Appendix C). Proposition 1. [Minimizing KL(P||Q) and the role of normalization]. Let qij = qij/Zq for some choice of qij where Zq = P n,m qnm. Then arg min z1,...,zn Rd KL(P||Q) = arg min z1,...,zn Rd X i,j pij log qij + log X n,m qnm. (2) Comment to Proposition 1. The t-SNE cost function may by this result be expressed both over probabilities pij and over quantities qij in low dimensional space which are not probabilities as they are not normalized2. This also carries over to the expression for the gradient of KL(P||Q): 2Some interesting properties of the t-SNE cost function related to Laplacian eigenmaps may be highlighted when inserting the explicit form for qij. This is explored in Appendix A. Published as a conference paper at ICLR 2024 Proposition 2. [Gradient of KL(P||Q)] zi KL(P||Q) = 4 X j (pij qij) qij(zj zi). (3) The role of normalization. In t-SNE, gradient descent can only be performed after computing qij over the entire data set for each iteration or epoch, and can be interpreted as an n-body system of attractive forces and repulsive forces. Speeding up these computations is the reason for the development of the Barnes-Hut tree algorithm (van der Maaten, 2014) or the more recent interpolationbased t-SNE (Linderman et al., 2019). The requirement for normalization also hampers the utility of parametric t-SNE (van der Maaten, 2009) as discussed in (Sainburg et al., 2021). Recently, new algorithms and insight to normalization have been revealed (Artemenkov & Panov, 2020; Damrich et al., 2023) from the viewpoint of noise contrastive learning (Gutmann & Hyv arinen, 2012), also in the context of the interplay between attractive and repulsive forces (B ohm et al., 2022). Brief perspectives on t-SNE, UMAP, Pac Map and variants. A number of new dimensionality reduction methods have been proposed, most being more computationally efficient than t-SNE, usually ending up as a system of attractive and repulsive forces. UMAP (Mc Innes et al., 2020) exploits cross-entropy between simplical sets and invokes a sampling strategy on repulsive forces which has resemblance to the approach proposed in Large Vis (Tang et al., 2016) (a method inspired by t-SNE). Tri MAP (Amid & Warmuth, 2019) and Pac Map (Wang et al., 2021) are heuristic methods where the former is based on triplet loss and the latter empirically sets up a cost function based on pairwise distances between near pairs, mid-near pairs, and non-neighbor pairs. These methods are considered to be leading approaches, elaborated in more detail in the Appendix A3. 3 MAP IT FRAMEWORK 3.1 MAP IT BY THE CAUCHY SCHWARZ DIVERGENCE SNE is elegantly formulated in terms of the KL divergence. However, the KL divergence requires a normalization factor since the KL divergence is not projective. Definition 1 (Projective Divergence). Let pij = pij/Zp and qij = qij/Zq where Zp = P n,m pnm and Zq = P n,m qnm are normalizing constants. This means that P = P/Zp and Q = Q/Zq. A projective divergence is such that D(P||Q) = D( P|| Q) (4) where D is the divergence measure. For optimization purposes, the use of a projective divergence4 function may be appealing since the direct dependence on normalization constants is removed. A divergence measure which has received widespread attention in recent years is the Cauchy-Schwarz divergence (Yu et al., 2023; 2022; Principe, 2010; Jenssen et al., 2006). Without loss of generality, the CS divergence is here defined in terms of marginal probabilities P(xj) = pj in the input space and P(zj) = qj in the target space, respectively. Definition 2 (Cauchy-Schwarz (CS) Divergence). Denote Pm = {pj}N j=1 and Qm = {qj}N j=1. Then CS(Pm||Qm) = log 3A plethora of dimensionality reduction variants exist, see e.g. the overview (Wang et al., 2021) or visit the excellent repository https://jlmelville.github.io/smallvis/. t-SNE and the other methods mentioned here all use gradient descent but build on many aspects from spectral theory, e.g. (Roweis & Saul, 2000), (Tenenbaum et al., 2000), (Belkin & Niyogi, 2003), (Donoho & Grimes, 2003), (Jenssen, 2010). 4Could equivalently be defined with marginal probabilities qi = qi/Zq, Zq = P i qi, or continuous probability functions q(z) = q(z)/Zq, Zq = R q(z)dz (likewise for p(x)/pi). Published as a conference paper at ICLR 2024 Proposition 3. [CS divergence is projective]. Let pj = pj/Zp, Zp = P j pj , and qj = qj/Zq, Zq = P j qj . Then CS(Pm||Qm) = CS( Pm|| Qm). (6) The CS divergence will provide the backbone for the MAP IT framework. MAP IT will be derived both from the viewpoint of marginal probabilities and from the viewpoint of a continuous risk function. The latter provides a coupling to kernel methods which provides a dual framework. MAP IT s cost function lends itself nicely to optimization with gradient descent, explained in the next sections. 3.2 MAP IT VIA MARGINAL PROBABILITIES By the law of total probability pj = P k pjk and qj = P k qjk. Since pjk and qjk are assumed normalized (Eq. (1)), P j pj = 1 and P j qj = 1. Note that by this argument one may express qj = qj/Zq, where qj = P k qjk, Zq = P n,m qnm, and likewise for pj. Definition 3 (MAP IT cost function). Let pj = P k pjk and qj = P k qjk. The proposed MAP IT cost function and optimization problem are arg min z1,...,zn Rd CS(Pm||Qm) = arg min z1,...,zn Rd CS( Pm|| Qm). (7) Comment to Definition 3. A major deviation from the SNE framework is that MAP IT aims to align distributions over marginal probabilities for states corresponding to data points in the two spaces, i.e. pj and qj. These probabilities can be interpreted as degrees associated with the nodes of the underlying similarity graphs in the two spaces, reflecting properties of regions of the data, as opposed to considering only pairwise local connections. This is associated with the underlying mapping which may be expressed as zj = m(xj) + εj where m( ) is the mapping function and εj is some noise random variable. Minimizing CS(Pm||Qm) yields: Proposition 4. [Minimizing CS(Pm||Qm) with respect to z1, . . . , zn Rd] arg min z1,...,zn Rd CS( Pm|| Qm) = arg min z1,...,zn Rd log X j pj qj + 1 j q2 j . (8) As a main result of this paper, the gradient of MAP IT, needed in order to perform minimization of CS(Pm||Qm) over z1, . . . , zn Rd, is derived and stated in the following proposition: Proposition 5. [Gradient of CS(Pm||Qm)] zi CS(Pm||Qm) = 4 X j pj qj qj P q2 ij(zj zi). (9) Comment to Proposition 4. The MAP IT update rule zi = zi η zi CS(Pm||Qm) yields zi = zi + η X j pj qj qj P q2 ij(zj zi) (10) where the factor four has been fused into η. MAP IT thus has easily interpretable gradients, forming a n-body problem where each zi experiences an attractive force pj and a repulsive force qj from each other data point zj. The forces are adaptively weighted by entropy in the sense that log P j q2 j is a Renyi entropy of Qm and log P j pj qj is a cross entropy. All quantities are unnormalized. 3.3 MAP IT VIA CONTINUOUS DENSITY FUNCTIONS This section provides an alternative derivation for MAP IT via kernel smoothing. This reveals the role of local neighborhoods via derivation, affecting the computation of gradients. The following definition is provided, referred here to as the map CS divergence: Published as a conference paper at ICLR 2024 Definition 4 (Map CS divergence). Let p(x) be the probability density function over a domain X and define q(z) to be the probability density function for the stochastic map z = m(x) + ε over Z. Let f(x, z) be the joint distribution over X Z. The map CS divergence is defined as CS(p(x)||q(z)) = log R R p(x)q(z)f(x, z)dxdz R R p2(x)f(x, z)dxdz 1 2 R R q2(z)f(x, z)dxdz 1 Comment to Definition 4. Consider two functions g(x) and h(z) where z is some mapping of x. Within risk minimization, R = EX ZL [g(x), h(z)] = R R L [g(x), h(z)] f(x, z)dxdz for some loss L( ). Empirical risk minimization aims to minimize ˆR = P i L [g(xi), h(zi)] with respect to some variables or parameters. From this, the map CS divergence may be understood as a normalized risk function over g(x) = p(x) and h(z) = q(z) with product loss. Proposition 6. [The map CS divergence is projective]. Let z = m(x) + ε and assume p(x) = p(x)/Zp and q(z) = q(z)/Zq where p(x) and q(z) are unnormalized with Zp = R p(x)dx and Zq = R q(z)dz as the respective normalization constants. Then CS(p(x)||q(z)) = CS( p(x)|| q(z)). (12) Assume in the following unnormalized functions p(x) and q(z). Proposition 7. [Empirical map CS divergence]. Let a sample x1, . . . , xn RD be given and assume the mapping z = m(x) + ε. Then, an empirical map CS divergence is given by d CS(p(x)||q(z)) = log j p(xj) q(zj) Proposition 8. [Minimizing d CS(p(x)||q(z)) with respect to z1, . . . , zn Rd] arg min z1,...,zn Rd d CS(p(x)||q(z)) = arg min z1,...,zn Rd log X j p(xj) q(zj) + 1 j q2(zj). (14) Proposition 9. [Gradient of d CS(p(x)||q(z))] zi d CS(p(x)||q(z)) = X j p(xj ) q(zj ) q(zj) P zi q(zj). (15) Comment to Proposition 9. Eq. (15) is a more general result compared to Eq. (9). For a particular choice or estimator of p(xj) and q(zj), the basic MAP IT update rule zi = zi η zi CS(Pm||Qm) will equal zi = zi η zi d CS(p(x)||q(z)) as is shown next. But this is not true in general, and the latter result may actually be more flexible. Consider in the following the choice ˆ q(zj) = P k κz(zj zk) and similarly for ˆ p(xj). In this viewpoint, κ( ) is assumed to be a shift-invariant kernel function and P k κz(zj zk) is the wellknown kernel smoothing procedure, which is the backbone of reproducing kernel Hilbert space methods (e.g. (Schrab et al., 2023)) and of utmost importance in machine learning. Note that both the t-distribution (Cauchy distribution) and the Gaussian are valid kernel functions. Proposition 10. [Gradient of d CS(p(x)||q(z)) with kernel smoothing]. Let ˆ q(zj) = P k κz(zj zk) and ˆ p(xj) = P k κp(xj xk) for shift-invariant kernel functions κz( ) and κp( ). Then zi d CS(p(x)||q(z)) = 4 X j pj qj qj P κ2 z(zj zi)(zj zi). (16) Published as a conference paper at ICLR 2024 Figure 2: Illustrating the role of local neighborhoods in MAP IT. Comment to Proposition 10. With the notation qji def = κz(zj zi) associated with ˆ q(zj) = P k κz(zj zk), this result shows that in this particular situation the MAP IT rule zi = zi η zi CS(Pm||Qm) equals zi = zi η zi d CS(p(x)||q(z)). MAP IT via marginal probabilities and MAP IT via continuous densities are thus dual viewpoints enabled by the map CS divergence and the kernel smoothing approach. Having derived MAP IT, it is important to make clear that t-SNE based on the Cauchy-Schwarz divergence, instead of the Kullback-Leibler, appears as a special case of the theory. Proposition 11. [Cauchy-Schwarz (CS) t-SNE is a special case of MAP IT]. Let pjk be the probability for the joint event xj xk . Let qjk be the probability for the joint event zj zk. If (xj xk ) (zj zk) , then CS(Pm||Qm) = CS(P||Q). (17) 3.4 MAP IT VIA LOCAL NEIGHBORHOODS One subtle but very important aspect which affects the MAP IT theory from the previous sections concerns the role of local neighborhoods. This is best revealed from the starting point of MAP IT via continuous densities as follows. The aim of MAP IT is to discover z1, . . . , zn Rd from x1, . . . , xn RD. This entails derivatives zi p( ) q( ) and zi q2( ) which are quantities defined over neighborhoods. Consider zi p( ) q( ) p(zi + zi) q(zi + zi) p(zi) q(zi) where zi defines a neighborhood in the target space around zi. A sketch illustrating this situation is shown in Fig. (2) (a). However, an assumption is that an unknown transformation exists, z = m(x)+ε. If the transformation was known, there would be no need for MAP IT and neighborhoods zi around zi would correspond to neighborhoods xi around xi = m 1(zi). However, even if the transformation is not known, points in the two spaces come in pairs zi = m(xi) + ε, and over the course of the optimization neighborhood topologies in the two spaces should start to correspond. During the optimization, since the effect of the underlying mapping function only gradually will be learned, neighborhood topologies in the two spaces will not initially correspond. For differentiation with respect to zi one must therefore rely on zi = m( xi). This is illustrated in Fig. (2) (b). The practical consequence of the above analysis is that all marginal probabilities, part of the MAP IT learning rule, will be computed with respect to a neighborhood m( xi) and will be denoted pj Nxi and qj Nxi , respectively, where Nxi refers to the neighborhood around xi, yielding zi = zi + η X j pj qj qj Nxi P q2 ij(zj zi). (19) One may envision several possibilities for defining Nxi. Here, it is argued that if xj is one of the nearest neighbors of xi then the nearest neighbors of xj are also i the neighborhood of xi. The pj Nxi and qj Nxi are thus computed over the nearest neighbors of xj. Of course, this line of thought Published as a conference paper at ICLR 2024 Figure 3: MAP IT for a subset of MNIST for different initial values (runs) for k = 10. could have been continued to some degree (neighbors of neighbors), but is not explored further in this paper. Fig. (2) (c) shows as a filled black square a point xi (zi) and how the nearest neighbors of xi as well as their neighbors constitute Nxi. This resembles to some degree the use of near pairs and mid-near pairs in Pac Map. For points that are not neightbors of xi one could compute pj Nxi and qj Nxi in a similar manner. However, since all pij for xj not a neighbor of xi will be relatively small and constant with respect to points in the vicinity of xi, only pij (qij) is used in that case, as illustrated in Fig. (2) (c) (solid black circles). This resembles to some degree the use of non-near pairs in Pac Mac. Please see Appendix C for additional perspectives. Finding neighbors can be done in an exact manner, or by vintage trees (Yianilos, 1993), or by approximate search (Dong et al., 2011) as in Large Vis and UMAP. Importantly, the MAP IT theory deviates fundamentally from the theory for t-SNE, UMAP, and all the alternative DR approaches in the literature by not aligning based a pairwise comparison of points, i.e. individual pij and qij. MAP IT exploits combined information represented by marginal probabilities pj and qj, which are aligned, and which naturally encompass information about multiple pairs of points simultaneously, using unnormalized quantities. This represents a major shift. 4 MAP IT S VISUALIZATIONS General comments to the experimental part. The MAP IT theory has been developed from the viewpoint of divergence and for that reason the t/Gaussian distributions (and perplexity computation) as in t-SNE are chosen (see Appendix C for further considerations on these choices.). The UMAP approach to metrics and weights could have been used, only amounting to design choices. In all MAP IT experiments, a random initialization is used, a perplexity value of 15 is used, and no gain or trick are used such as to multiply pjk by some constant in the first (say, 100) iterations (widely done in the t-SNE literature). The common delta-bar-delta rule is used, setting momentum to 0.5 for the first 100 iterations and then 0.8. The learning rate is set to 50 always, over 1000 iterations. Results are found to be relatively stable with these choices (see Appendix C for more comments on implementation). For t-SNE the highly optimized Barnes-Hut algorithm is used with its default parameters. For UMAP, the state-of-the-art implementation from the Herzenberg Lab, Stanford, is used with default parameters, thus always initialized by Laplacian eigenmaps. The Pac Map author s Python implementation is used5 with default parameters, thus always initialized by PCA. All methods return a cost function value used to select the best result out of several runs. MAP IT code is available at https://github.com/SFI-Visual-Intelligence/. The focus of this paper is to introduce MAP IT as a fundamentally new way to approach dimensionality reduction and not to extensively scale up MAP IT. For that reason, the focus is on a range of relatively modest sized data sets, to convey the basic properties of MAP IT as a new approach. In Appendix C, some further considerations on potential upscaling of the method based on sampling of forces and computation of (entropy) weights (Eq. (19)) are provided. Visualizations below are best viewed in color. Plots are enlarged and repeated in Appendix C, for the benefit of the reader, where also more details about data sets are given. 5t-SNE: Barnes-Hut. UMAP: C. Meehan, J. Ebrahimian, W. Moore, and S. Meehan (2022). Uniform Manifold Approximation and Projection (UMAP) (https://www.mathworks.com/matlabcentral/fileexchange/71902). Pac Map: https://github.com/Yingfan Wang/Pa CMAP/tree/master. Published as a conference paper at ICLR 2024 (c) Pac Map. (d) MAP IT. Figure 4: Visualizing the Coil-20 dataset. (c) Pac Map. (d) MAP IT. Figure 5: Visualizing visual concepts. MNIST. A random subset of MNIST consisting of 2000 points is used. Figure 1 has already illustrated that MAP IT, despite not being heavily optimized and by random initialization, produced markedly different representation compared to the alternatives, with a clear class structure and with better separation between classes. Figure 10 in Appendix C shows visualizations using different values k of nearest neighbors. The value of this hyperparameter will influence results. Based on MNIST and other data sets it appears as if a value between 5 and 15 produce reasonable results over a range of data sets. Figure 3 illustrates MAP IT with k = 10 for MNIST over different runs. The main structure of the mapped data is always the same showing robustness wrt. initialization. Appendix C provides further quantitative analysis. Coil 20. It can be seen in Figure 4 that MAP IT separates very well the 20 classes, when compared to the alternatives. It may be that MAP IT experiences symmetry breaking to a somewhat larger degree than e.g. Pac Map (flattening out ring structures). Visual Concepts. Images corresponding to three different visual concepts are visualized in Fig. (5 by their SIFT (Lowe, 1999) descriptors (1000-dimensional). The visual concepts used are strawberry (blue), lemon (cyan) and australian terrier (yellow). It is evident from all the methods that the concepts are overlapping. t-SNE splits the australian terrier. Pac Map splits the hugely overlapping groups strawberry and lemon. UMAP doesn t split groups but indicates little group structure. MAP IT indicates more of a group structure, without splitting any group. Newsgroups. A bag of 100 words is created. Randomly, 10% of the documents are selected, obtaining a total of 1625 documents distributed over the categories comp, rec, sci and talk, and a document-word matrix is formed and used to visualize the word distribution, Fig. 6 (see Appendix C for word-clouds). t-SNE and UMAP uniformly spread the words in the plane. Pac Map seems to put words in groups roughly corresponding to topics, which may be natural but which is too strict (the word computer , for instance, would be expected to not be exclusive to e.g. to group sci). MAP IT also puts words in groups to some degree, but not exclusively, like Pac Map. Frey faces. The 1965 (28 20) Frey faces are visualized and shown in Fig. 7. Each image is a frame in a video of the same face varying smoothly over time. t-SNE, UMAP, and Pac Map all seem to predominantly pick up on a smile versus no smile structure in the video. MAP IT creates smaller local structures but these structures seem more spread out in the space in the sense of for instance not placing smiling faces very far from other faces. Sampling over non-neighbors. As mentioned, scaling up MAP IT is not a main focus in this paper. It is however of interest for future work. As an initial experiment, computation of attractive/repulsive Published as a conference paper at ICLR 2024 (c) Pac Map. (d) MAP IT. Figure 6: Visualizing word-representations from Newsgroups. (c) Pac MAP. (d) MAP IT. Figure 7: Visualizing the Frey faces. forces are split on two groups over k neighbors and n k non-neighbors for each point. The latter group is sampled, using only some multiple of k, here 3k. For k = 10 and n = 2000, this yields over 98.5 percent reduction in computations. Fig. 8 shows for MNIST that the visualization is basically unchanged (elaborated more in Appendix C). 5 CONCLUSIONS Figure 8: MAP IT with sampling of non-neighbor attractive/repulsive forces. MAP IT takes a fundamentally different approach to visualize representations by aligning distributions over discrete marginal probabilities (node degrees) in the input space versus the target space, thus capturing information in wider local regions. This is contrary to current methods which align based on individual probabilities between pairs of data points (states) only. The MAP IT theory reveals that alignment based on a projective divergence avoids normalization of weights (to obtain true probabilities) entirely, and further reveals a dual viewpoint via continuous densities and kernel smoothing. MAP IT is shown to produce visualizations which capture class structure better than the current state of the art. ACKNOWLEDGMENTS This work was partially funded by the Research Council of Norway (RCN) grant no. 309439 Centre of Research-based Innovation Visual Intelligence, http://visual-intelligence.no, as well as RCN grant no. 303514. The author thanks the anonymous reviewers for helpful comments. Reproducibility statement. To ensure reproducibility, proofs are included in Appendix B and additional results and analysis are included in Appendix C. Code is available at https://github. com/SFI-Visual-Intelligence/. Published as a conference paper at ICLR 2024 Ehsan Amid and Manfred K Warmuth. Tri Map: Large-scale dimensionality reduction using triplets. ar Xiv preprint ar Xiv:1910.00204, 2019. Aleksandr Artemenkov and Maxim Panov. Ncvis: Noise contrastive approach for scalable visualization. Proceedings of The Web Conference 2020, 2020. URL https://api. semanticscholar.org/Corpus ID:210966425. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373 1396, 2003. Anne C. Belkina, Christopher O. Ciccollela, Rina Anno, Richard Halpert, Josef Spidlen, and Jennifer E. Snyder-Cappione. Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nature Communications, 10 (5415), 2019. Jan Niklas B ohm, Philipp Berens, and Dmitry Kobak. Attraction-repulsion spectrum in neighbor embeddings. Journal of Machine Learning Research, 23:1 32, 2022. Niklas B ohm, Philipp Berens, and Dmitry Kobak. Unsupervised visualization of image datasets using contrastive learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=n I2Hm VA0hvt. Kerstin Bunte, Sven Haase, and Thomas Villmann. Stochastic neighbor embedding SNE for dimension reduction and visualization using arbitrary divergences. Neurocomputing, 90:23 45, 2012. T. Tony Cai and Rong Ma. Theoretical foundations of t-SNE for visualizing high-dimensional clustered data. Journal of Machine Learning Research, 23(1):1 54, 2022. Miguel Carreira-Perpinan. The elastic embedding algorithm for dimensionality reduction. In International Conference on Machine Learning, volume 10, pp. 167 174. 2010. Sebastian Damrich and Fred A. Hamprecht. On UMAP s true loss function. In Advances in Neural Information Processing Systems, volume 34. Curran Associates, 2021. Sebastian Damrich, Niklas B ohm, Fred A. Hamprecht, and Dmitry Kobak. From t-SNE to UMAP with contrastive learning. In International Conference on Learning Representations, volume 11, 2023. URL https://openreview.net/forum?id=B8a1Fc Y0vi. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009. Wei Dong, Charikar Moses, and Kai Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In International Conference on World Wide Web, volume 20. ACM, 2011. David Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. In Proceedings of the National Academy of Arts and Sciences, volume 100 (10), pp. 5591 5596. 2003. Andrew Draganov, Jacob Rødsand Jørgensen, Katrine Scheel Nelleman, Davide Mottin, Ira Assent, Tyrus Berry, and Aslay Cigdem. Act Up: Analyzing and consolidating t-SNE and UMAP. ar Xiv preprint ar Xiv:2305.07320, 2023. Michael Gutmann and Hyv arinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 12(2), 2012. Raia Hadsell, Sumit Chopra, and Yann Le Cun. Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE International Conference on Computer Vision, volume 2, pp. 1735 1742, 2006. Geoffrey E. Hinton and Sam Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems, volume 15. MIT Press, 2002. Published as a conference paper at ICLR 2024 Yanyong Huang, Kejun Guo, Xiuwen Yi, Jing Yu, Zongxin Shen, and Tianrui Li. T-copula and wasserstein distance-based stochastic neighbor embedding. Knowledge-based Systems, 243: 108431, 2022. J. J. Hull. A Database for Handwritten Text Recognition Research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550 554, 1994. Robert Jenssen. Kernel entropy component analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):847 860, 2010. Robert Jenssen, Jose C. Principe, Deniz Erdogmus, and Torbjørn Eltoft. The Cauchy Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. Journal of the Franklin Institute, 343(6):624 629, 2006. Dmitry Kobak and Phillip Berens. The art of using t-SNE for single-cell transcriptomics. Nature Communications, 10:5416, 2019. Dmitry Kobak and George C. Linderman. Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nature Biotechnology, 39(2):156 157, 2021. Dmitry Kobak, George Linderman, Stefan Steinerberger, Yuval Kluger, and Phillip Berens. Heavytailed kernels reveal a finer cluster structure in t-SNE visualisations. In Machine Learning and Knowledge Discovery in Databases, pp. 124 139. Springer International Publishing, 2020. George C. Linderman and Stefan Steinerberger. Clustering with t-SNE, provably. Siam J. Math. Data Sci., 1:313 332, 2019. George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinberger, and Yuval Kluger. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature Methods, 16:243 245, 2019. David G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the International Conference on Computer Vision. IEEE, 1999. Leland Mc Innes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projection for dimension reduction. ar Xiv preprint ar Xiv:1802.03426, 2020. Karthik Naryan, Ali Punjani, and Pieter Abbeel. Alpha-beta divergences discover micro and macro structures in data. In International Conference on Machine Learning, volume 37. 2015. Sameer A. Nene, Shree K. Nayar, and Hiroshi Murase. Columbia object image library (coil-20). In Technical Report. 1996. Jose Principe. Information Theoretic Learning - Renyi s Entropy and Kernel Perspectives. Springer New York, NY, 2010. Sam Roweis and Lawrence K. Saul. Nonlinear dimension reduction by locally linear embedding. Science, 290(5500):2323 2326, 2000. Tim Sainburg, Leland Mc Innes, and Timothy Q. Gentner. Parametric UMAP embeddings for representation and semisupervised learning. Neural Computation, 33:2881 2907, 2021. Antonin Schrab, Ilmun Kim, Melisande Albert, Beatrice Laurent, Benjamin Guedj, and Arthur Gretton. MMD aggregated two-sample test. Journal of Machine Learning Research, 24, 2023. Jian Tang, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. Visualizing large-scale and highdimensional data. In International Conference on World Wide Web, volume 25. ACM, 2016. Joshua Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000. Chris Thornton. Separability is a learner s best friend. In Neural Computation and Psychology Workshop, volume 4, pp. 40 46. 1998. Published as a conference paper at ICLR 2024 Daniel J. Trosten, Rwiddhi Chakraborty, Sigurd Løkse, Robert Jenssen, and Michael Kampffmeyer. Hubs and hyperspheres: Reducing hubness and improving transductive few-shot learning with hyperspherical embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7527 7536. IEEE, 2023. J. Derek Tucker, Matthew Martinez, and Jose Laborde. Dimensionality reduction using elastic measures. Stat, 12, 2023. Laurens van der Maaten. Learning a parametric embedding by preserving local structure. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 12, pp. 384 391. 2009. Laurens van der Maaten. Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research, 15(1):3321 3245, 2014. Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2008. Laurens van der Maaten and Killian Weinberger. Stochastic triplet embedding. In Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing, 2012. Yingfan Wang, Haiyang Huang, Cynthia Rudin, and Yaron Shaposhnik. Understanding how dimension reduction tools work: an empirical approach to deciphering t-SNE, UMAP, Tri MAP, and Pac MAP for data visualization. Journal of Machine Learning Research, 22(1):9129 9201, 2021. Michael Wilber, Iljung Kwak, David Kriegman, and Serge Belongie. Learning concept embeddings with combined human-machine expertise. In Proceedings of the IEEE International Conference on Computer Vision, volume 2, 2015. Peter Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, volume 93, pp. 311 321. Association for Computing Machinery (ACM), 1993. Shujian Yu, Francesco Alesiani, Wenzhe Yin, Robert Jenssen, and Jose Principe. Principle of relevant information for graph sparsification. In 38th Conference on Uncertainty in Artificial Intelligence, volume 180, pp. 2331 2341. PMLR, 2022. Shujian Yu, Hongmin Li, Sigurd Løkse, Robert Jenssen, and Jose Principe. The conditional cauchyschwarz divergence with applications to time-series data and sequential decision making. ar Xiv preprint ar Xiv:2301.08970, 2023. APPENDIX A: FURTHER PERSPECTIVES ON T-SNE, UMAP, PACMAP, AND VARIANTS As mentioned in Section 1 of this paper, a large plethora of dimensionality reduction methods exists and an excellent repository for more information is e.g. https://jlmelville.github.io/ smallvis/. In this paper, following recent literature, the main algorithms are considered to be t-SNE (van der Maaten, 2014), and UMAP (Mc Innes et al., 2020), and we include the empirically motivated Pac Map (Wang et al., 2021). For context, Tri Map (Amid & Warmuth, 2019) and Large Vis (Tang et al., 2016) are discussed below. The t-SNE theory has been outlined in Section 2. In the paper introducing UMAP, Mc Innes et al. (2020) argues that t-SNE should be considered the current state-of-the-art at that time, and mentions computational scalability as a main benefit of UMAP versus t-SNE. The main aspect with respect to computational scalability for UMAP versus t-SNE is that the simplical set theory shows that for UMAP normalization over pairwise similarities (probabilities in t-SNE) is not needed, as opposed to t-SNE. This illustrates the importance of the sound theoretical foundation of UMAP. As further described in (Mc Innes et al., 2020), UMAP s simplical set cross-entropy cost function resembles in several ways the Large Vis (Tang et al., 2016) cost function. Large Vis also avoids normalization in the embedding space, albeit from a more heuristic point of view, but not in the input space where a procedure similar to the one used in Barnes-Hut t-SNE (van der Maaten, 2014) is used. Avoiding normalization in the embedding space Published as a conference paper at ICLR 2024 is key to the negative sampling strategy employed in Large Vis and which is key to its computational scalability, also an integral component in the UMAP optimization. Large Vis is not included in the experimental part of this paper to avoid clutter, but is an influential algorithm in the t-SNE family. Mc Innes et al. (2020) and Wang et al. (2021), for instance, have both extensive comparative experiments sections also involving Large Vis. A motivation for Wang et al. (2021) is to discuss preservation of local structure versus global structure. They propose a heuristic method, Pac Map, which is intended to strike a balance between Tri Map (Amid & Warmuth, 2019) (better at preserving global structure) and t-SNE/UMAP (local structure). Tri Map is is a triplet loss-based method. Wang et al. (2021) argues that Tri Map is the first successful triplet constraint method (as opposed to (Hadsell et al., 2006; van der Maaten & Weinberger, 2012; Wilber et al., 2015)) but claims that without PCA initialization Tri Map s global structure is ruined . Pac Map is based on a study of the principles behind attractive and repulsive forces and finds that forces should be exerted on further points and sets up a heuristically designed procedure for treating near pairs, mid-near pairs, and non-neighbors. Understanding t-SNE versus UMAP, in particular, from a theoretical perspective, has gained interest in the recent years. Damrich & Hamprecht (2021) studies the interplay between attractive and repulsive forces in UMAP in detail and comes to the conclusion that UMAP is actually not exactly optimizing the cost function put forth in (Mc Innes et al., 2020). B ohm et al. (2022) studies the whole attraction-repulsion spectrum and find cases where UMAP may diverge. Damrich et al. (2023) provides an entirely new viewpoint relating a fixed normalization constant in UMAP to a smaller learned normalization constant in t-SNE when considering negative sampling and noise contrastive learning. Draganov et al. (2023) argues that the normalization aspect is basically the key difference between t-SNE and UMAP, and suggests a way to toggle between the two approaches. Kobak & Linderman (2021) show that UMAP s initialization (Laplacian eigenmaps (Belkin & Niyogi, 2003)) is very important for UMAP s results and claims that t-SNE can be improved by a similar initialization. Wang et al. (2021) also studies initialization, and claim that both UMAP but also Tri Map are very dependent on initialization. Several papers have revealed close connections between t-SNE and Laplacian eigenmaps (Carreira-Perpinan, 2010; Linderman & Steinerberger, 2019; B ohm et al., 2022). Some further comments on the relationship between t-SNE and Laplacian eigenmaps are provided in Appendix B. Draganov et al. (2023) argues that the normalization aspect is basically the key difference between t-SNE and UMAP, and suggests a way to toggle between the two approaches. Other highly influential papers provide important insight (Kobak & Berens, 2019; Kobak et al., 2020). In this paper, sampling is also demonstrated (Figure 8, and Appendix C), sharing some similarity to the sampling strategies invoked in e.g. Large Vis and UMAP. It should also be mentioned that dimensionality reduction methods inspired by the t-SNE approach by alternative divergences to the Kullback-Leibler over joint pairwise probabilities have been studied to some degree (Bunte et al., 2012; Naryan et al., 2015; Huang et al., 2022). However, these works have not discussed projective properties of divergence measures and have not contributed to understanding the aspect of normalization. APPENDIX B: PROPOSITIONS WITH PROOFS AND COMMENTS For the benefit of the reader, the well-known Kullback-Leibler-based t-SNE relation from the main paper, which van der Maaten & Hinton (2008) builds on, is proved: arg min z1,...,zn Rd KL(P||Q) = arg min z1,...,zn Rd i,j pij log pij qij = arg min z1,...,zn Rd i,j pij log qij. (20) Published as a conference paper at ICLR 2024 arg min z1,...,zn Rd KL(P||Q) = arg min z1,...,zn Rd i,j pij log pij = arg min z1,...,zn Rd i,j pij log pij | {z } constant i,j pij log qij. (22) = arg min z1,...,zn Rd X i,j pij log qij. (23) Proposition 1. [Minimizing KL(P||Q) and the role of normalization]. Let qij = qij/Zq for some choice of qij where Zq = P n,m qnm. Then arg min z1,...,zn Rd KL(P||Q) = arg min z1,...,zn Rd X i,j pij log qij + log X n,m qnm. (24) arg min z1,...,zn Rd KL(P||Q) = arg min z1,...,zn Rd X i,j pij log qij (25) = arg min z1,...,zn Rd X i,j pij log qij P n,m qnm (26) = arg min z1,...,zn Rd i,j pij log qij X = arg min z1,...,zn Rd X i,j pij log qij + log X n,m qnm. (28) Comment to Proposition 1. Deriving the t-SNE cost function without first expressing qij6 in a particular form (t-distribution (Cauchy distribution) or Gaussian) shows a general property of the cost function. The t-SNE cost function is by the above result expressed by two terms where the first term relates the normalized pij to unnormalized qij. The second term only involves the unnormalized qij. From this expression, an interesting aspect of the t-SNE cost function is an intrinsic connection to Laplacian eigenmaps (Belkin & Niyogi, 2003). Discovery of connections between t-SNE and Laplacian eigenmaps can be traced back to Carreira-Perpinan (2010). More recently, B ohm et al. (2022) performed a closer examination of this relationship in terms of eigenvectors of normalized and unnormalized Laplacian matrices. Linderman & Steinerberger (2019) studied convergence of clustering with t-SNE from the viewpoint of early exaggeration (an often used optimization trick ) and identified a regime where t-SNE behaves like Laplacian-based spectral clustering. In Laplacian eigenmaps, the aim is to find a low-dimensional embedding z1, . . . , zn Rd from x1, . . . , xn RD. It is assumed that some similarity measure can be defined in the input space, 6In the original SNE paper (Hinton & Roweis, 2002) the Gaussian distribution was used qij = exp( κ||zi zj||2)/ P n,m exp( κ||zn zm||2). The argument in (van der Maaten & Hinton, 2008) was that the t-distribution helps mitigate the so-called crowding problem. It is possible to formulate the joint probabilities as functions of other distance functions than the Euclidean or in terms of similarity measures. Published as a conference paper at ICLR 2024 pairwise over xi and xj, which can be denoted wij. The Laplacian cost function is essentially P i,j wij||zi zj||2 to be minimized over z1, . . . , zn Rd, however given orthogonality constraints on z1, . . . , zn Rd to avoid trivial minima. Since for t-SNE, qij is a function of ||zi zj||2 for both the t-distribution (Cauchy distribution) and the Gaussian, there is a close link between t-SNE and Laplacian eigenmaps. For instance, for qij = [1 + ||zi zj||2] 1, the first term becomes P i,j wij log ||zi zj||2 and for the Gaussian the corresponding term becomes P i,j wij||zi zj||2 up to a proportionality constant (this was also pointed out in (Trosten et al., 2023)). By letting wij = pij the link becomes obvious. In t-SNE, there are no orthogonality constraints on z1, . . . , zn Rd. Instead, trivial solutions are avoided by the constraint posed by P Recently, Cai & Ma (2022) provided an elaborate spectral analysis concluding that t-SNE is connected to Laplacian eigenmaps. By the above analysis, and from e.g. (B ohm et al., 2022), this is evident from the form of the t-SNE cost function itself. Proposition 2. [Gradient of KL(P||Q)] zi KL(P||Q) = 4 X j (pij qij) qij(zj zi). (29) zi KL(P||Q) = zi X i,j pij log qij + log X The following derivation resembles (van der Maaten & Hinton, 2008). Note that if zi changes, the only pairwise distances that change are dij and dji where dij = ||zi zj||. Hence, the gradient of the cost function C with respect to zi is given by (zi zj) = 2 X C dij (zi zj). (31) Furthermore k,l pkl dij log qkl + dij log X n,m qnm (32) k,l pkl 1 qkl dij qkl + 1 P dij qnm. (33) The gradient is only non-zero for k = i, l = j and for n = i, m = j, yielding C dij = pij 1 qij dij qij + 1 P dij qij. (34) Note that for qij = h 1 1+||zi zj||2 i , which is the t-distribution, or alternatively for qij = exp( κ||zi zj||2), the Gaussian distribution, we have dij qij = 2 q2 ij. (35) pij qij + 1 P n,m qnm q2 ij = 2 ( pij + qij) qij, (36) Published as a conference paper at ICLR 2024 since qij = qij P n,m qnm . Finally, this yields zi C = zi KL(P||Q) = 4 X j (pij qij) qij(zj zi). (37) Proposition 3. [CS divergence is projective]. Let pj = pj/Zp, Zp = P j pj , and qj = qj/Zq, Zq = P j qj . Then CS(Pm||Qm) = CS( Pm|| Qm). (38) CS(Pm||Qm) = log pj Zp qj Zq P = CS( Pm|| Qm). (42) Proposition 4. [Minimizing CS(Pm||Qm) with respect to z1, . . . , zn Rd] arg min z1,...,zn Rd CS( Pm|| Qm) = arg min z1,...,zn Rd log X j pj qj + 1 j q2 j . (43) arg min z1,...,zn Rd CS(Pm||Qm) (44) = arg min z1,...,zn Rd log | {z } independent of z = arg min z1,...,zn Rd log X j pj qj + 1 j q2 j . (46) Published as a conference paper at ICLR 2024 Proposition 5. [Gradient of CS(Pm||Qm)] zi CS(Pm||Qm) = 4 X j pj qj qj P q2 ij(zj zi). (48) arg min z1,...,zn Rd CS(Pm||Qm) = arg min z1,...,zn Rd log X j pj qj + 1 j q2 j (49) There are several ways to proceed. Here, it is chosen to start by expressing CS(Pm||Qm) explicitly into cross-product terms pjk qjk. For convenience, the derivation is split into two parts. j pj qj = 1 P j pj qj = 1 P k ,k pjk qjk. (51) Look first at the case j = i. Then zi P k ,k pjk qjk will have non-zero terms for k = i, hence k ,k pjk qjk = X zi qji = pj zi qji. (52) k ,k pik qik = X ! zi qik = X k pk zi qik. (53) k ,k pjk qjk = X j,j =i 2 pj zi qij. (54) Note that for qij = h 1 1+||zi zj||2 i , which is the t-distribution, or alternatively for qij = exp( κ||zi zj||2), the Gaussian distribution, we have zi qij = 2 q2 ij(zj zi) (55) j pj qj = 4 1 P j,j =i pj q2 ij(zj zi). (56) Alternatively, j pj qj = 1 P i pi qi = 1 P j pj zi qj (57) and then work with qj = P k qjk . For the second part, consider k ,k qjk qjk. (58) Published as a conference paper at ICLR 2024 and work in a similar fashion as above from there, or express j 2 qj zi qj (59) and insert qj = P k qjk . This gives j q2 j = 4 1 P j,j =i qj q2 ij(zj zi). (60) Hence, when taken together: zi CS(Pm||Qm) = 4 X j pj qj qj P q2 ij(zj zi). (61) Proposition 6. [The map CS divergence is projective]. Let z = m(x) + ε and assume p(x) = p(x)/Zp and q(z) = q(z)/Zq where p(x) and q(z) are unnormalized with Zp = R p(x)dx and Zq = R q(z)dz as the respective normalization constants. Then CS(p(x)||q(z)) = CS( p(x)|| q(z)). (62) CS(p(x)||q(z)) (63) Zq f(x, z)dxdz R R p(x) 2 f(x, z)dxdz 1 2 f(x, z)dxdz 1 1 Zp Zq R R p(x) q(z)f(x, z)dxdz 1 Z2p R R p2(x)f(x, z)dxdz 1 2 1 Z2q R R q2(z)f(x, z)dxdz 1 R R p(x) q(z)f(x, z)dxdz R R p2(x)f(x, z)dxdz 1 2 R R q2(z)f(x, z)dxdz 1 = CS( p(x)|| q(z)). (67) Proposition 7. [Empirical map CS divergence]. Let a sample x1, . . . , xn RD be given and assume the mapping z = m(x) + ε. Then, an empirical map CS divergence is given by d CS(p(x)||q(z)) = log j p(xj) q(zj) Proof. We have CS(p(x)||q(z)) = CS( p(x)|| q(z)). Thus CS( p(x)|| q(z)) = log R R p(x) q(z)f(x, z)dxdz R R p2(x)f(x, z)dxdz 1 2 R R q2(z)f(x, z)dxdz 1 Published as a conference paper at ICLR 2024 It suffices to look at the numerator. Since R R p(x) q(z)f(x, z)dxdz = EX Z [ p(x) q(z)] and we have \ EX Z [ p(x) q(z)] = X j p(xj) q(zj), (70) d CS(p(x)||q(z)) = log j p(xj) q(zj) Proposition 8. [Minimizing d CS(p(x)||q(z)) with respect to z1, . . . , zn Rd] arg min z1,...,zn Rd d CS(p(x)||q(z)) = arg min z1,...,zn Rd log X j p(xj) q(zj) + 1 j q2(zj). (72) arg min z1,...,zn Rd CS(p(x)||q(z)) (73) = arg min z1,...,zn Rd log j p(xj) q(zj) | {z } independent of z = arg min z1,...,zn Rd log X j p(x) q(zj) + 1 j q2(zj). (75) Proposition 9. [Gradient of d CS(p(x)||q(z))] zi d CS(p(x)||q(z)) = X j p(xj ) q(zj ) q(zj) P zi q(zj). (77) arg min z1,...,zn Rd CS(p(x)||q(z)) = arg min z1,...,zn Rd log X j p(xj) q(zj) + 1 j q2(zj). (78) The derivation is split into two parts. First, j p(xj) q(zj) = 1 P j p(xj ) q(zj ) zi q(zj). (79) j q2(zj) = 1 zi q2(zj) = 1 zi q(zj). (80) Published as a conference paper at ICLR 2024 Taken together, thus zi d CS(p(x)||q(z)) = X j p(xj ) q(zj ) q(zj) P zi q(zj). (81) Proposition 10. [Gradient of d CS(p(x)||q(z)) with kernel smoothing]. Let ˆ q(zj) = P k κz(zj zk) and ˆ p(xj) = P k κp(xj xk) for shift-invariant kernel functions κz( ) and κp( ). Then zi d CS(p(x)||q(z)) = 4 X j pj qj qj P κ2 z(zj zi)(zj zi). (82) Proof. A shift-invariant kernel function satisfies κ(zj zi) = κ(dij) where dij = ||zj zi||2. Hence, similar to the derivation for Proposition 2, we will have zi d CS(p(x)||q(z)) = X j p(xj ) q(zj ) q(zj) P zi q(zj) (83) j p(xj ) q(zj ) q(zj) P dij + q(zj) (zj zi) (84) j p(xj ) q(zj ) q(zj) P dij (zj zi). (85) Note that q(zj) dij = κz(zj zi) dij = 2κ2(zj zi)) for κz(zj zi)) = h 1 1+||zi zj||2 i , which is the t-distribution, or alternatively for κz(zj zi)) = exp( κ||zi zj||2), the Gaussian distribution. Taken together, zi d CS(p(x)||q(z)) = 4 X j pj qj qj P κ2 z(zj zi)(zj zi). (86) Proposition 11. [Cauchy-Schwarz (CS) t-SNE is a special case of MAP IT]. Let pjk be the probability for the joint event xj xk . Let qjk be the probability for the joint event zj zk. If (xj xk ) (zj zk) , then CS(Pm||Qm) = CS(P||Q). (87) Proof. We have CS(Pm||Qm) = log k ,k pjk qjk k ,k pjk pjk k ,k qjk qjk Published as a conference paper at ICLR 2024 Figure 9: Illustration of CS t-SNE as a special case of MAP IT. It suffices to look at the numerator since the terms in the denominator are related normalization quantities. If (xj xk ) (zj zk) then Prob((xj xk ) (zj zk)) = pjk qjk = 0 (assuming independence) for k = k. Hence X k ,k pjk qjk = X j,i pjiqji (90) for k = k = i. Comment to Proposition 11. The illustration in Fig. 9 brings further perspective to this result. The black filled circle denotes node j. Nodes within the stapled circle are assumed to be relatively near and nodes n outside this circle are assumed to be distant in the sense that pjn is neglible for each such node n. For MAP IT, pjk is also multiplied by probabilities qjk and qjl in addition to qjk for nodes k and l close to k . This is not the case for CS t-SNE. This shows that MAP IT is able to capture information in wider local neighborhoods compared to CS t-SNE which only captures local information via pairwise affinities, a property it shares with its KL counterpart t-SNE and methods such as Laplacian eigenmaps (Belkin & Niyogi, 2003). In practise, this means that MAP IT models that the event (xj xk ) could induce the event (zj zk) if k and k are close. Published as a conference paper at ICLR 2024 APPENDIX C: ADDITIONAL RESULTS AND ANALYSIS Defining probabilities in the input space. The projective property of the CS divergence has been shown to avoid the need for direct normalization of probabilities if the probabilities come in a form qj = qj/Zq, where Zq is as explained in Section 3.2 and where one has access to qj, and similarly for pj. However, for optimization purposes, normalization Zp = P j pj is not as critical since this is a procedure done only once. In any case, it has recently been shown that dimensionality reduction methods are quite robust to the choice of input space pairwise similarity function, and even binary affinities seem to give good results for many algorithms (B ohm et al., 2022). Note that it is easy to show that even if the input space pairwise similarities do not come in the form assumed in Eq. (1), that is, pij = pij/Zp, Zp = P n,m pnm, but the target space similarities do, then DCS(Pm||Qm) = DCS(Pm|| Qm) (Proposition 3). In order for MAP IT to use the exact same input similarities as t-SNE commonly does, so-called symmetrized similarities are here used in the input space. These are given by pij = pi|j+pj|i 2n with pj|i = exp( κi||xi xj||2) P n =i exp( κi||xi xn||2) and likewise for pi|j. Note that the parameter κi is chosen to yield a pre-specified value of the so-called perplexity of the probability distribution (see e.g. (B ohm et al., 2022)), a function of its entropy. MNIST. MNIST7 is a data set of 28 28 pixel grayscale images of handwritten digits. There are 10 digit classes (0 through 9) and a total of 70000 images. Here, 2000 images are randomly sampled. Each image is represented by a 784-dimensional vector. Figure 1 and Figure 3 in the main paper show that MAP IT produces a MNIST visualization with much better separation between classes compared to alternatives and that the embedding is robust with respect to initial conditions. MAP IT s free parameter is the number of nearest neighbors k to go into the computation of pj Nxi and qj Nxi . Figure 10 shows representative embedding results for the subset of MNIST for different values of k. As in all dimensionality reduction methods, the visualization results depend on k. For MNIST, for k = 7 and for k = 10, the class structure appears and is relatively stable also for k = 12. For most data sets, a value for k between 5 and 15 seem to yield reasonable results. However, the impact of this hyperparameter should be further studied in future work. Quantitative analysis. For dimensionality reduction methods, various methods for quantifying neighborhood structure may provide insight. For instance, relatively low knn recall may illustrate better capturing of cluster structure at the potential expense of capturing manifold structures, as studied by (B ohm et al., 2022). For the embeddings shown in Figure 1, the fraction of k nearest neighbors in the input space that remain among the nearest neightbors in the target space ( knn recall ) is computed and shown in Table 1 for several values of k. For low values of k, Pac Map and MAP IT have lower recall values, indicating that these methods better capture cluster structure and may sacrifice some ability to capture manifold structure. For k ca equal to 60 and above the knn recall settles to similar values (slowly diminishing for larger k, not shown here). MNIST knn t-SNE UMAP Pac Map MAP IT 15 0.48 0.42 0.37 0.35 30 0.49 0.47 0.43 0.41 45 0.50 0.49 0.47 0.45 60 0.51 0.50 0.49 0.48 Frey 15 0.09 0.08 0.08 0.09 30 0.11 0.11 0.1 0.1 45 0.13 0.13 0.12 0.13 60 0.15 0.16 0.15 0.15 Table 1: Illustration of knn recall. 7MNIST, Newsgroups, Frey faces are obtained from http://cs.nyu.edu/ roweis/data.html. Published as a conference paper at ICLR 2024 (c) k = 10. (d) k = 12. Figure 10: MAP IT for a subset of MNIST for different values of k. Figure 11: Class-wise knn recall. Figure 11 shows an alternative way to quantify this effect. Here, the class-wise knn recall in the target space is shown. For each data point, the fraction of nearest neighbors in the low-dimensional space sharing class label with the point in question (Thornton, 1998) is plotted as a curve for various values of k. Here, the known labels for the MNIST are used. It can be observed that as k increases, t-SNE and UMAP have lower class-wise knn recall compared to Pac Map and MAP IT. This makes intuitive sense when compared to the visualizations provided in Figure 1. Further comments on MAP IT s potential for upscaling. Figure 8 in the main paper shows the result of an initial experiment to potentially scale up MAP IT by a certain sampling procedure. The proposed sampling procedure is not the same as the one employed in Large Vis and UMAP. In those methods, attractive forces and repulsive forces are separated. The number of points to go into the computation of repulsive forces are then sampled, so-called negative sampling. When creating Figure 8 in the main paper, forces have been separated into attractive/repulsive forces resulting from nearest neighbors versus attractive/repulsive forces coming from non-neighbors. Hence, the proposed MAP IT sampling is different. Experimentally, it was observed that if the number of non-neighbor forces were downsampled for instance to 50 percent, then a multiplication of the attractive/repulsive forces for non-neighbors by a factor two basically reproduced the original embedding. A downsampling of non-neighbor forces to 25 percent followed by a multiplication factor of four reproduced the original embedding. Similarly, downsampling of non-neighbor forces to 12.5 percent followed by a multiplication factor of eight reproduced the original embedding. This is illustrated in Figure 12 for MNIST with k = 10. In (a)-(c), the sampling is down to 50, 25, and 12.5 percent, respectively, and each of the four subfigures show the embedding after invoking a multiplication factor of 1, 2, 4 and 8 over the non-neighbor forces, respectively. The boxes indicate that the original embedding is in essence recreated (compare e.g. to Figure 10 (c)). For Figure 8 in the main paper, where only 3k non-neighbor forces are sampled, which means that ca 1.52 percent of non-neighbors are used in the sampling, the factor used is 66 (ca 1/0.0152). The effect that downsampling of attractive/repulsive forces for non-neighbors seems to have, requiring an inversely proportional multiplication factor as shown above, may be related to the observations made in (B ohm et al., 2022) on the effects of non-negative sampling in UMAP. Here, it was shown that negative sampling increased attraction in UMAP, and it was argued that without the negative sam- Published as a conference paper at ICLR 2024 (a) 50 percent downsampling. (b) 25 percent downsampling. (c) 12.5 percent downsampling. Figure 12: Each subfigure (a)-(c) show the visualization/embedding for different subsampling scenarios and for a factor 1, 2, 4, and 8, respectively, on non-neighbor attractive/repulsive forces in the MAP IT calculation. Figure 13: Illustration of MAP IT learning rates and number of iterations for a subset of USPS. pling, UMAP may provide embeddings with less cluster structure. Obviously, for MAP IT, there are many aspects that warrant future analysis along these lines. The example provided here is meant to show that in this controlled setting, sampling of MAP IT s non-neighbor attractive/repulsive forces (Figure 8) could reproduce the original embedding (Figure 1 (d)). Please see the end of this Appendix for some further comments on implementation. Learning rates (USPS). In all experiments the MAP IT learning rate has been set to 50 over 1000 iterations. Of course, changing these choices will to some degree change the embedding. These choices have however been observed to result in quite stable MAP IT results over a range of diverse data sets. Figure 13 shows the MAP IT cost as a function of iterations for different values of learning rates performed over a subset of the USPS data set (Hull, 1994). A random subset of the digits 3, 6, and 9 constitute the classes. For each learning rate η of 25, 50 and 100 three runs of MAP IT are performed and in each case the curve of cost versus iterations is shown. For this particular data set, low cost function values are obtained quicker for η = 100 (leftmost group of curves), compared to η = 50 (middle group of curves) and η = 25 (rightmost group of curves). When approaching 1000 iterations all curves have settled at low cost function values. Further studies of the interplay between learning rate, iterations, and various design choices for the MAP IT optimization are left for future work. Coil 20. This data set (Nene et al., 1996) consists of 1440 greyscale images consisting of 20 objects under 72 different rotations spanning 360 degrees. Each image is a 128x128 image which we treat as a single 16384 dimensional vector for the purposes of computing distance between images. Visualizations of Coil-20 were shown in the main paper in Figure 4. Enlarged visualizations of Coil 20 are shown in Figures 15, 16, 17 and 18. Visual Concepts. Images corresponding to three different visual concepts are visualized. SIFT (Lowe, 1999) descriptors represented by a 1000-dimensional codebook for each visual concept are Published as a conference paper at ICLR 2024 Figure 14: Preliminary experiment using the neighborhood structure in the input space when computing entropy weights for attractive and repulsive forces for MAP IT. downloaded from the Image Net data base (image-net.org) (Deng et al., 2009). The visual concepts used are strawberry, lemon and australian terrier. The concepts are represented by 1478, 1292 and 1079 images, respectively. The images within each category differ very much, as can be seen e.g. for australian terrier at image-net.org/synset?wnid=n02096294. A crude approach is taken here. Each image is represented by the overall frequency of codewords present for the SIFT descriptors contained in the image. Hence, each image is represented as a 1000-dimensional vector. The local modeling strength of the SIFT descriptors are lost this way, and one cannot expect the resulting data set to contain very discriminative features between the concepts. Visualizations of the visual concepts were shown in the main paper in Figure 5. Enlarged visualizations of Coil 20 are shown in Figures 19, 20, 21 and 22. Newsgroups. Visualizations of words from Newsgroups were shown in the main paper in Figure 6. Enlarged visualizations of Newsgroups as word clouds are shown in Figures 23, 24, 25 and 26. Frey faces. Visualizations of the Frey faces were shown in the main paper in Figure 7. Enlarged visualizations of the Frey faces are shown in Figures 27, 28, 29 and 30. Further comments on implementation and potential scaling. Previously, sampling of attractive/repulsive forces associated with non-neighbors was illustrated on the recurring MNIST example and it was shown that inclusion of only a few non-neighbors in the computations can reproduce the original embedding, given the use of an appropriate multiplication factor. In the current implementation, gradients are computed directly by a matrix operation (D M)X where M encodes attractive and repulsive forces according to Eq. (19) where D is the diagonal degree matrix of M and X stores all data points as rows. Neighborhood structure is encoded into M. For the sampling procedure to potentially truly scale up MAP IT, more efficient knn graph implementations are needed to avoid memory issues associated with the current very basic approach. This can be done by building on efficient implementations for t-SNE and UMAP, e.g. (B ohm et al., 2022). In the t-SNE and related literature, a problematic issue is the normalization factor Zq = P n,m qnm which needs to be recomputed over the optimization. For the CS divergence underlying MAP IT, the need for direct normalization is avoided. However, seemingly similar terms (P j pj qj and P j q2 j ) appear in the weighting of the attractive and repulsive forces, Eq. (19). The impact of these terms needs to be further studied. As a preliminary experiment, the terms are computed using only points in Nxj -neighborhoods both in the input space and the target space for MNIST, and the embedding is shown in Figure 14. It can be seen that the main structure of the embedding is preserved. Further comments on parameter choices. Note that (Belkina et al., 2019) provided advice for tailoring the early exaggeration and overall number of gradient descent iterations for t-SNE in a dataset-specific manner and argued that this may be especially important for large-scale data sets such as cytometry and transcriptomics data sets. Data sets used in this study are more traditional (MNIST etc), and available software used in this paper for t-SNE, UMAP, Pac Map are already welltested for such data sets. However, further parameter tuning as discussed in (Belkina et al., 2019) Published as a conference paper at ICLR 2024 could potentially improve results for t-SNE. Note that MAP IT uses no early exaggeration and there is much room for investigating the effect of for instance step size and the number of iterations, which may affect results also for MAP IT. Perspectives on MAP IT via local neighborhoods. Section 3.4 discussed the very important role that local neighborhoods play for MAP IP, revealed via differentiation of the MAP IT cost function. The following main points were observed: Differentiation must be done with respect to the input space neighborhoods for all quantities. For instance, neighborhood structure in the target space zi will be given by zi = m( xi), where m is the (unknown) mapping function. The practical consequence of this is stated in Eq. (19), which is expressed as a sum over all points j. Here, the attractive force for the gradient vector zj zi depends on pj Nxi and the repulsive force depends on qj Nxi , where Nxi denotes a neighborhood around xi. It was further argued that for points xj that are neighbors of point xi, they would already be in the neighborhood of xi such that for those nearest neighbors of xi the MAP IT update rule Eq. (19) uses pj Nxj for the attractive force and qj Nxj for the repulsive force. For points xj not neighbors of xi, it was argued that pj Nxi simplifies to pij for the attractive force and likewise qj Nxi simplifies to qij for the repulsive force. It is important to note that pj Nxj = Pknn k=1 pjk. It has here been emphasized that knn denotes the k th nearest neighbor of xj (see Section 3.2). If the pjk s are interpreted as weights on edges from node j to the nodes k = 1, . . . , knn, then it becomes clear that pj Nxj captures local properties in a region around xj. All of these considerations stem from the MAP IT cost function turned into an update rule for zi via differentiation. However, this analysis reveals a new perspective to MAP IT. If we instead of aligning the distributions over the marginal probabilities pj and qj for all j by the CS divergence focus on the contribution to the marginal probabilities from the local region around point xj and zj, respectively, i.e. pj Nxj and qj Nxj by arg min z1,...,zn Rd log X j pj Nxj qj Nxj + 1 j q2 j Nxj (91) then Eq. (19) would be obtained by the same arguments as in Section 3.4 and above. This shows that MAP IT is effectively aligning the degree of a node in the input space, locally in a neighborhood, with the degree of the corresponding node locally in the target space, but where neighborhoods are defined with respect to the input space in both cases. To some degree, this perspective resembles the local neighhborhood perspective in Locally Linear Embedding (LLE) (Roweis & Saul, 2000). Also here, local neighborhoods are reconstructed in the target space versus the input space via an eigenvector operation. LLE and many other spectral methods (as mentioned in the Introduction of this paper) have inspired much of the research in visualization and neighbor embedding methods but as stated in (Wang et al., 2021) in their section Local structure preservation methods before t-SNE (page 6) the field mainly moved away from these types of methods. In the future, it would be interesting to investigate closer links and differences between the fundamental LLE idea and MAP IT. This is an important aspect to point out for the following reason. When basing the input space similarities on symmetrized pij, which is done in this paper in the experimental part in order to use similarities which are comparable to what t-SNE usually process, an apparent paradox appears8. In that case, the form of the similarities are such that the pj s become uniform. In that case it would appear as if MAP IT is data independent, always trying to make the qj s uniform no matter what the input is. However, the above analysis highlights that via differentiation, MAP IT effectively focus on the contribution to the marginal probabilities from local regions, which will in general differ for the different nodes. Map KL Divergence?. An interesting question for future research could be to investigate whether the introduction of a map KL divergence similar to the map CS divergence (Definition 4), would 8As kindly pointed out by one of the reviewers of this paper. Published as a conference paper at ICLR 2024 provide a KL-based MAP IT framework via kernel smoothing. Since the KL divergence is not projective, direct normalization of probabilities would not be avoided, but a similar-in-spirit alignment of marginal probabilities seems to be possible. Concluding remarks. Together with the experiments and analysis in the main paper, these additional MAP IT results and analysis illustrate the potential of this new method to provide visualizations which in many cases are markedly different from the current state-of-the-art alternative with better class discrimination and reasonable embeddings overall, from a theoretical approach which is fundamentally different and which highlights both a viewpoint from the perspective of alignment of marginal probabilities as well as a dual viewpoint via continuous densities enabled by kernel smoothing. The role of normalization for divergences to be used as dimensionality reduction methods follows directly from the MAP IT theory. Published as a conference paper at ICLR 2024 Figure 15: t-SNE embedding of Coil 20. Figure 16: UMAP embedding of Coil 20. Published as a conference paper at ICLR 2024 Figure 17: Pac Map embedding of Coil 20. Figure 18: MAP IT embedding of Coil 20. Published as a conference paper at ICLR 2024 Figure 19: t-SNE embedding of visual concepts. Figure 20: UMAP embedding of visual concepts.. Published as a conference paper at ICLR 2024 Figure 21: Pac Map embedding of visual concepts.. Figure 22: MAP IT embedding of visual concepts.. Published as a conference paper at ICLR 2024 Figure 23: t-SNE embedding of words from Newsgroups. Figure 24: UMAP embedding of words from Newsgroups. Published as a conference paper at ICLR 2024 Figure 25: Pac Map embedding of words from Newsgroups. Figure 26: MAP IT embedding of words from Newsgroups.. Published as a conference paper at ICLR 2024 Figure 27: t-SNE embedding of Frey faces. Figure 28: UMAP embedding of Frey faces. Published as a conference paper at ICLR 2024 Figure 29: Pac Map embedding of Frey faces. Figure 30: MAP IT embedding of Frey faces.