# multidimensional_scaling_approximation_and_complexity__1170c435.pdf

Multidimensional Scaling: Approximation and Complexity

Erik Demaine 1 Adam Hesterberg 2 Frederic Koehler 3 Jayson Lynch 4 John Urschel 3

Metric Multidimensional scaling (MDS) is a classical method for generating meaningful (nonlinear) low-dimensional embeddings of highdimensional data. MDS has a long history in the statistics, machine learning, and graph drawing communities. In particular, the Kamada-Kawai force-directed graph drawing method is equivalent to MDS and is one of the most popular ways in practice to embed graphs into low dimensions. Despite its ubiquity, our theoretical understanding of MDS remains limited as its objective function is highly non-convex. In this paper, we prove that minimizing the Kamada-Kawai objective is NPhard and give a provable approximation algorithm for optimizing it, which in particular is a PTAS on low-diameter graphs. We supplement this result with experiments suggesting possible connections between our greedy approximation algorithm and gradient-based methods.

1. Introduction

Given the distances between data points living in a high dimensional space, how can we meaningfully visualize their relationships? This is a fundamental task in exploratory data analysis for which a variety of different approaches have been proposed. Many of these approaches seek to visualize high-dimensional data by embedding it into lower dimensional, e.g. two or three-dimensional, space.

Metric multidimensional scaling (MDS or m MDS) (Kruskal, 1964a; 1978) is a classical approach to this problem which attempts to ﬁnd a low-dimensional embedding that accurately represents the distances between points. Originally motivated by applications in psychometrics, MDS has now

1Computer Science and Artiﬁcial Intelligence Laboratory, MIT, Cambridge, MA, USA 2John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA 3Department of Mathematics, MIT, Cambridge, MA, USA 4Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada. Correspondence to: Frederic Koehler <fkoehler@mit.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

been recognized as a fundamental tool for data analysis across a broad range of disciplines. See the texts (Kruskal, 1978; Borg & Groenen, 2005) for more details, including a discussion of applications to data from scientiﬁc, economic, political, and other domains. Compared to other classical visualization tools like PCA1, metric multidimensional scaling has the advantage that it 1) is not restricted to linear projections of the data, i.e. it is nonlinear, and 2) is applicable to data from an arbitrary metric space, rather than just Euclidean space. Because of this versatility, MDS has also become one of the most popular algorithms in the ﬁeld of graph drawing, where the goal is to visualize relationships between nodes (e.g. people in a social network). In this context, MDS was independently proposed by Kamada and Kawai (Kamada et al., 1989) as a force-directed graph drawing method.

In this paper, we consider the algorithmic problem of computing the optimal embedding under the MDS/Kamada Kawai objective. The Kamada-Kawai objective is to minimize the following energy/stress functional E : Rrn R

E(x1, . . . , xn) = X

d(i, j) 1 2 , (1)

which corresponds to the physical situation where x1, . . . , xn Rr are particles and for each i = j, particles xi and xj are connected by an idealized spring with equilibrium length d(i, j) following Hooke s law with spring constant kij = 1 d(i,j)2 . In applications to visualization, the choice of dimension is often small, i.e. r = 1, 2, 3. We also note that in (1) the terms in the sum are sometimes re-weighted with vertex or edge weights, which we discuss in more detail later.

In practice, the MDS/Kamada-Kawai objective (1) is optimized via a heuristic procedure like gradient descent (Kruskal, 1964b; Zheng et al., 2018) or stress majorization (De Leeuw et al., 1977; Gansner et al., 2004). Because the objective is non-convex, these algorithms may not reach the global minimum, but instead may terminate at approximate critical points of the objective function. Heuristics such as restarting an algorithm from different initializations

1In the literature, PCA is sometimes referred to as classical multidimensional scaling, in contrast to metric multidimensional scaling, which we study in this work.

Multidimensional Scaling: Approximation and Complexity

and using modiﬁed step size schedules have been proposed to improve the quality of results. In practice, these heuristic methods do seem to work well for the Kamada-Kawai objective and are implemented in popular packages like GRAPHVIZ (Ellson et al., 2001) and the SMACOF package in R.

1.1. Our Results

In this work, we revisit this problem from an approximation algorithms perspective. First, we resolve the computational complexity of minimizing (1) by proving that ﬁnding the global minimum is NP-hard, even for graph metrics (where the metric is the shortest path distance on a graph). Consider the decision version of stress minimization over graph metrics, which we formally deﬁne below:

STRESS MINIMIZATION Input: Graph G = ([n], E), r N, L 0. Output: TRUE if there exists x = (x1, . . . , xn) Rnr

such that E(x) L; FALSE otherwise.

Theorem 1. There exists a polynomial p(n) such that the following gap version of STRESS MINIMIZATION in dimension r = 1 is NP-hard: given an input graph G with n vertices and L > 0, return TRUE if there exists x such that E(x) L and return FALSE if for every x, E(x) L + 1/p(n). Furthermore, the problem is hard even restricted to input graphs with diameter bounded by an absolute constant.

As a gap problem, the output is allowed to be arbitrary if neither case holds; the hardness of the gap formulation shows that there cannot exist a Fully-Polynomial Randomized Approximation Scheme (FPRAS) for this problem if P = NP, i.e. the runtime cannot be polynomial in the desired approximation guarantee. Our reduction shows this problem is hard even when the input graph has low diameter (even bounded by an absolute constant): this is a natural setting to consider since many real world graphs (for example, social networks (Dodds et al., 2003)) and random graph models (Watts & Strogatz, 1998) indeed have low diameter due to the small-world phenomena . Other key aspects of this hardness proof are: 1) we show the problem is hard even when the input d is a graph metric, and 2) we show it is hard even in its canonical unweighted formulation (1).

Given that computing the minimizer is NP-hard, a natural question is whether there exist polynomial time approximation algorithms for minimizing (1). We show that if the input graph has bounded diameter D = O(1), then there indeed exists a Polynomial-Time Approximation Scheme (PTAS) to minimize (1), i.e. for ﬁxed ϵ > 0 and ﬁxed D there exists an algorithm to approximate the global minimum of a n vertex

diameter D graph up to multiplicative error (1 + ϵ) in time f(ϵ, D) poly(n). More generally, we show:

Theorem 2 (Informal version of Theorem 4). Let R > ϵ > 0 be arbitrary. Algorithm KKSCHEME runs in time n2(R/ϵ)O(r R4/ϵ2) and outputs x1, . . . , xn Rr with xi R such that

E [E(x1, . . . , xn)] E(x 1, . . . , x n) + ϵn2

for any x 1, . . . , x n with x i R for all i, where E is the expectation over the randomness of the algorithm.

where KKSCHEME is a simple greedy algorithm described in Section 4 below. The fact that this result is a PTAS for bounded diameter graphs follows from combining it with the two structural results regarding optimal Kamada-Kawai embeddings, which are of independent interest. The ﬁrst (Lemma 4) shows that the optimal objective value for low diameter graphs must be of order Ω(n2) and the second (Lemma 5) shows that the optimal KK embedding is contractive in the sense that the diameter of the output is never much larger than the diameter of the input.

Lemma 1 (Informal version of Lemma 4). For any target dimension r 1, all graphs of diameter D = O(n1/r) satisfy E(x) = Ω(n2/Dr) for all x.

Lemma 2 (Informal version of Lemma 5). For any graph of diameter D and any target dimension r 1, any global minimizer of E(x) satisﬁes

max i,j xi xj = O(D log log D),

i.e. the diameter of the embedding is O(D log log D).

1.2. Related Work

Other Approaches to Nonlinear Dimensionality Reduction and Visualization. Recently, there has been renewed interest in force-directed graph layouts due to new applications in machine learning and data science. MDS itself is a popular technique for dimension reduction. Newer techniques, such as t-SNE (Maaten & Hinton, 2008) and UMAP (Mc Innes et al., 2018), can be viewed as similar type of force-directed weighted graph drawing with more complex objectives than Kamada-Kawai (see the discussion in (Mc Innes et al., 2018)); in comparison, some other dimensionality reduction methods, e.g. Laplacian eigenmaps (Belkin & Niyogi, 2003), are based on spectral embeddings of graphs.

In practice, methods like t-SNE and UMAP appear to work quite well, even though they are based on optimizing nonconvex objectives with gradient descent, which in general comes with no guarantee of success. Towards explaining this phenomena, t-SNE has been mathematically analyzed

Multidimensional Scaling: Approximation and Complexity

in a fairly speciﬁc setting where the data is split into wellseparated clusters (e.g. generated by well-separated Gaussian mixtures); in this case, the works (Arora et al., 2018; Linderman & Steinerberger, 2019) prove that the visualization recovers the corresponding cluster structure. A difﬁculty when proving more general guarantees is that the t-SNE and UMAP objectives are fairly complex, and hence not so easy to mathematically analyze.

Partially for this reason, in this work we focus on the simpler metric MDS/Kamada-Kawai objective. Experimentally, it has been observed that, using this objective, it is easy to ﬁnd high quality minima in many different situations (see e.g. (Zheng et al., 2018)), but to our knowledge there has not been a mathematical explanation of this phenomena.

Other related work. In the multidimensional scaling literature, there has been some study of the local convergence of algorithms like stress majorization, see for example (De Leeuw, 1988), which shows that stress majorization will converge quickly if in a sufﬁciently small neighborhood of a local minimum. This work seems to propose the ﬁrst provable guarantees for global optimization. The closest previous hardness result is the work of (Cayton & Dasgupta, 2006) where they showed that a similar problem is hard. In their problem: 1) the terms in (1) are weighted by d(i, j) and absolute value loss replaces the squared loss and 2) the input is an arbitrary pseudometric where nodes in the input are allowed to be at distance zero from each other. The second assumption makes the diameter (ratio of max to min distance in the input) inﬁnite, and this is a major obstruction to modifying their approach to show Theorem 1. See Remark 1 for further discussion. A much earlier hardness result is the work of (Saxe, 1979), in the easier (for proving hardness) case where distortion is only measured with respect to edges of the graph.

In the approximation algorithms literature, there has been a great deal of interest in optimizing the worst-case distortion of metric embeddings into various spaces, see e.g. (Badoiu et al., 2005) for approximation algorithms for embeddings into one dimension, and (Deza & Laurent, 2009; Naor, 2012) for more general surveys of low distortion metric embeddings. Though conceptually related, the techniques used in this literature are not generally targeted for minimizing a measure of average pairwise distortion like (1).

In the graph drawing literature, there are a number of competing methods for drawing a graph, with the best approach depending on application (Battista et al., 1998). Tutte s spring embedding theorem is often considered the seminal work in the force-directed layout community, and provides a method for producing a planar drawing of a three-connected planar graph (Tutte, 1963). Though the problem under consideration in this work does indeed belong to the class of

force-directed layouts, we stress the layouts under consideration do not minimize edge crossings in any sense.

Notation. In the remainder of the paper, we will generally assume the input is given as an unweighted graph to simplify notation; however, for the upper bounds (e.g. Theorem 2) we do handle the general case of arbitrary metrics with distances in [1, D] note that the lower bound of 1 is without loss of generality after re-scaling. In the lower bound (i.e. Theorem 1), we prove the (stronger) result that the problem is hard when restricted to graph metrics, instead of just for arbitrary metrics. We use standard asymptotic notation: f(n) = O(g(n)) means that lim supn f(n) g(n) <

, f(n) = Ω(g(n)) means that lim infn f(n) g(n) > 0, and f(n) = Θ(g(n)) means that f(n) = Ω(g(n)) and f(n) = O(g(n)). The notation [n] denotes {1, . . . , n}. Unless otherwise noted, denotes the Euclidean norm.

We also recall that a metric d : V V R 0 on a set V is formally deﬁned to be any function satisfying 1) d(v, w) = 0 iff v = w, 2) d(v, w) = d(w, v) for all v, w V and 3) d(v, w) d(v, u) + d(u, w) for any u, v, w V . A pseudometric relaxes 1) to the requirement that d(v, v) = 0 for all v.

2. Structural Results for Optimal Embeddings

In this section, we present two results regarding optimal layouts of a given graph. In particular, we provide a lower bound for the energy of a graph layout and an upper bound for the diameter of an optimal layout. The techniques used primarily involve estimating different components of the objective function E(x1, . . . , xn) given by (1) (written as E(x) in this section for convenience). For this reason, we introduce the notation

Ei,j(x) := xi xj

d(i, j) 1 2 for i, j [n],

Ei,j(x) for S [n],

ES,T (x) := X

j T Ei,j(x) for S, T [n], S T = .

We also make use of this notation in Appendices A and B.

First, we recall the following standard ϵ-net estimate.

Lemma 3 (Corollary 4.2.13 of (Vershynin, 2018)). Let BR = {x : x R} Rr be the origin-centered radius R ball in r dimensions. For any ϵ (0, R) there exists a subset Sϵ BR with |Sϵ| (3R/ϵ)r such that

max x R min y Sϵ x y ϵ,

Multidimensional Scaling: Approximation and Complexity

i.e. Sϵ is an ϵ-net of BR.

Using this result, we prove the following lower bound for the objective value of any layout of a diameter D graph in Rr.

Lemma 4. Let G = ([n], E) have diameter

Then any layout x Rrn has energy

Proof. Let G = ([n], E) have diameter D (n/2)1/r/10, and suppose that there exists a layout x Rr of G in dimension r with energy E(x) = cn2 for some c 1/810. If no such layout exists, then we are done. We aim to lower bound the possible values of c. For each vertex i [n], we consider the quantity Ei,V \i(x). The sum

i [n] Ei,V \i(x) = 2cn2,

and so there exists some i [n] such that Ei ,V \i (x) 2cn. By Markov s inequality, {j [n] | Ei ,j(x) > 10c} < n/5,

and so at least 4n/5 vertices (including i ) in [n] satisfy

d(i , j) 1 2 10c,

xi xj d(i , j)(1 +

The remainder of the proof consists of taking the ddimensional ball with center xi and radius 10D/9 (which contains 4n/5 vertices), partitioning it into smaller subregions, and then lower bounding the energy resulting from the interactions between vertices within each sub-region.

By applying Lemma 3 with R := 10D/9 and ϵ := 1/3, we may partition the r dimensional ball with center xi and radius 10D/9 into (10D)r disjoint regions, each of diameter at most 2/3. For each of these regions, we denote by Sj [n], j [(10D)r], the subset of vertices whose corresponding point lies in the corresponding region. As each region is of diameter at most 2/3 and the graph distance between any two distinct vertices is at least one, either

ESj(x) |Sj| 2

(2/3 1)2 = |Sj|(|Sj| 1)

or |Sj| = 0. Empty intervals provide no beneﬁt and can be safely ignored. The optimization problem

k=1 mk(mk 1) s.t.

k=1 mk = m, mk 1, k [ℓ],

has a non-empty feasible region for m ℓ, and the solution is given by m(m/ℓ 1) (achieved when mk = m/ℓfor all k). In our situation, m := 4n/5 and ℓ:= (10D)r, and, by assumption, m ℓ. This leads to the lower bound

j=1 ESj(x) 4n

4n 5(10D)r 1 ,

which implies that

c 16 450(10D)r

for D (n/2)1/r/10. This completes the proof.

The above estimate has the correct dependence for r = 1. For instance, consider the lexicographical product of a path PD and a clique Kn/D: i.e. a graph with D cliques in a line, and complete bipartite graphs between neighboring cliques. This graph has diameter D, and the layout in which the vertices (each corresponding to a copy of Kn/D) of PD lie exactly at the integer values [D] has objective value n 2 (n/D 1). This estimate is almost certainly not tight for dimensions r > 1, as there is no higher dimensional analogue of the path (i.e., a graph with O(Dr) vertices and diameter D that embeds isometrically in Rr).

Next, we provide an upper bound for the diameter of any optimal layout of a diameter D graph. For the sake of space, the proof of this result is reserved for Appendix A.

Lemma 5 (Proved in Appendix A). Let G = ([n], E) have diameter D. Then, for any optimal layout x Rrn, i.e., x such that E(x) E(y) for all y Rrn,

xi xj 2 D log log D

for all i, j [n].

While the above estimate is sufﬁcient for our purposes, we conjecture that this is not tight, and that the diameter of an optimal layout of a diameter D graph is always at most 2D.

3. Algorithmic Lower Bounds

In this section, we discuss algorithmic lower bounds for multidimensional scaling. In particular, we provide a sketch of the reduction used in the proof of Theorem 1. The formal proof itself is quite involved, and is therefore reserved for Appendix B.

Multidimensional Scaling: Approximation and Complexity

To show that minimizing (1) is NP-hard in dimension r = 1, we use a reduction from a version of Max All-Equal 3SAT. The Max All-Equal 3SAT decision problem asks whether, given variables t1, . . . , tℓ, clauses C1, . . . , Cm {t1, . . . , tℓ, t1, . . . , tℓ} each consisting of at most three literals (variables or their negation), and some value L, there exists an assignment of variables such that at least L clauses have all literals equal. The Max All-Equal 3SAT decision problem is known to be APX-hard, as it does not satisfy the conditions of the Max CSP classiﬁcation theorem for a polynomial time optimizable Max CSP (Khanna et al., 2001). More precisely, this is because of the following properties: 1) setting all variables true or all variables false does not satisfy all clauses, and 2) all clauses cannot be written in disjunctive normal form as two terms, one with all unnegated variables and one with all negated variables.

We require a much more restrictive version of this problem. In particular, we require a version in which all clauses have exactly three literals, no literal appears in a clause more than once, the number of copies of a clause is equal to the number of copies of its complement (deﬁned as the negation of all its elements), and each literal appears in exactly k clauses. This more restricted version is shown to still be APX-hard in Appendix B.

Suppose we have an instance of the aforementioned version of Max All-Equal 3SAT with variables t1, . . . , tℓand clauses C1, . . . , C2m. Let L = {t1, . . . , tℓ, t1, . . . , tℓ} be the set of literals and C = {C1, . . . , C2m} be the multiset of clauses. Consider the graph G = (V, E), with V = V0 V1 V2, where

V0 = {vi : i [Nv]},

V1 = {ti : t L, i [Nt]},

V2 = {Ci : C C, i [Nc]},

and E = V (2) \ (E1 E2), where

E1 = {(ti, tj) : t L, i, j [Nt]}},

E2 = {(ti, Cj) : t C, C C, i [Nt], j [Nc]},

denotes disjoint union, parameters Nv Nt Nc m, and V (2) := {U V : |U| = 2}.

For simplicity, in the following description we assume that cliques (other than V0) in the original graph generally embed together as one collection of nearby points, so we can treat them as single objects in the embedding. In Appendix B, this intuition is rigorously justiﬁed.

The clique on vertices V0 serves as an anchor that forces all other vertices to be almost exactly at the correct distance from its center. Without loss of generality, assume this anchor clique is centered at 0. In this graph, the cliques corresponding to literals and clauses, given by {ti}i [Nt]

and {Ci}i [NC] respectively, are all at distance one from the anchor clique. Literal cliques are at distance one from each other, except negations of each other, which are at distance two. Clause cliques are distance two from the literal cliques corresponding to literals in the clause and distance one from literal cliques corresponding to literals not in the clause. Clause cliques are all distance one from each other. The main idea of the reduction is that the location of the center of the anchor clique at 0 forces each literal to roughly be at either 1 or +1, and the distance between negations forces negations to be on opposite sides, i.e., xti x ti. Clause cliques are also roughly at either 1 or +1 and the distance to literals forces clauses to be opposite the side with the majority of its literals, i.e., clause C = {t1, t2, t3} lies at x Ci χ{xti 1 + xti 2 + xti 3 0}, where χ is the indicator variable. The optimal embedding of G, i.e. the location of variable cliques at either +1 or 1, corresponds to an optimal assignment for the Max All-Equal 3SAT instance.

Remark 1 (Comparison to (Cayton & Dasgupta, 2006)). As mentioned in the Introduction, the reduction here is signiﬁcantly more involved than the hardness proof for a related problem in (Cayton & Dasgupta, 2006). At a high level, the key difference is that in (Cayton & Dasgupta, 2006) they were able to use a large number of distance-zero vertices to create a simple structure around the origin. This is no longer possible in our setting (in particular, with bounded diameter graph metrics), which results in graph layouts with much less structure. For this reason, we require a graph that exhibits as much structure as possible. To this end, a reduction from Max All-Equal 3SAT using both literals and clauses in the graph is a much more suitable technique than a reduction from NAE 3SAT using only literals. In fact, it is not at all obvious that the same approach in (Cayton & Dasgupta, 2006), applied to unweighted graphs, would lead to a computationally hard instance.

4. Approximation Algorithm

In this section, we formally describe an approximation algorithm using tools from the Dense CSP literature, and prove theoretical guarantees for the algorithm.

4.1. Preliminaries: Greedy Algorithms for Max-CSP

A long line of work studies the feasibility of solving the Max-CSP problem under various related pseudorandomness and density assumptions. In our case, an algorithm with mild dependence on the alphabet size is extremely important. A very simple greedy approach, proposed and analyzed by Mathieu and Schudy (Mathieu & Schudy, 2008; Schudy, 2012) (see also (Yaroslavtsev, 2014)), satisﬁes this requirement.

Theorem 3 ((Mathieu & Schudy, 2008; Schudy, 2012)). Suppose that Σ is a ﬁnite alphabet, n 1 is a posi-

Multidimensional Scaling: Approximation and Complexity

Algorithm 1 Greedy Algorithm for Dense CSPs (Mathieu & Schudy, 2008; Schudy, 2012)

1: function Greedy CSP(Σ, n, t0, {fij}) 2: Shufﬂe the order of variables x1, . . . , xn by a random permutation. 3: for all assignments x1, . . . , xt0 Σt0 do 4: for (t0 + 1) i n do 5: Choose xi Σ to maximize X

j<i fji(xj, xi)

6: end for 7: Record x and objective value P

i =j fij(xi, xj). 8: end for 9: Return the assignment x found with maximum objective value. 10: end function

tive integer, and for every i, j n 2 we have a function fij : Σ Σ [ M, M]. Then for any ϵ > 0, Algorithm GREEDYCSP with t0 = O(1/ϵ2) runs in time n2|Σ|O(1/ϵ2) and returns x1, . . . , xn Σ such that

i =j fij(xi, xj) X

i =j fij(x i , x j) ϵMn2

for any x 1, . . . , x n Σ, where E denotes the expectation over the randomness of the algorithm.

In comparison, we note that computing the maximizer using brute force would run in time |Σ|n, i.e. exponentially slower in terms of n. This guarantee is stated in expectation but, if desired, can be converted to a high probability guarantee by using Markov s inequality and repeating the algorithm multiple times (as in Remark 2). We use GREEDYCSP to solve a minimization problem instead of maximization, which corresponds to negating all of the functions fij.

4.2. Discretization Argument

Lemma 6. For c, R > 0, the function x 7 (x/c 1)2 is 2 c max(1, R/c)-Lipschitz on the interval [0, R].

Proof. Because the derivative of the function is 2

c(x/c 1) and 2

c(x/c 1) 2 c max(1, R/c) on [0, R], the result follows from the mean value theorem.

Lemma 7. For c, R > 0 and y Rr with y R, the function x 7 ( x y /c 1)2 is 2

c max(1, 2R/c)-Lipschitz on BR = {x : x R}.

Proof. Because the function x y is 1-Lipschitz and x y x + y 2R by the triangle inequality, the

result follows from Lemma 6 and the fact that a composition of Lipschitz functions is Lipschitz.

Lemma 8. Let x1, . . . , xn Rr be arbitrary vectors such that xi R for all i and ϵ > 0 be arbitrary. Deﬁne Sϵ to be an ϵ-net of BR as in Lemma 3, so |Sϵ| (3R/ϵ)r. For any input metric over [n] with mini,j [n] d(i, j) = 1, there exists x 1, . . . , x n Sϵ such that

E(x 1, . . . , x n) E(x1, . . . , xn) + 4ϵRn2

where E is (1) deﬁned with respect to an arbitrary graph with n vertices.

Proof. By Lemma 7 the energy E(x1, . . . , xn) is the sum of n 2 n2/2 many terms, which, for each i and j, are individually 4R-Lipschitz in xi and xj. Therefore, deﬁning x i to be the closest point in Sϵ for all i gives the desired result.

4.3. Approximation Algorithm

Algorithm 2 Approximation Algorithm KKSCHEME

1: function KKSCHEME(ϵ1, ϵ2, R): 2: Build an ϵ1-net Sϵ1 of BR = {x : x R} Rr

as in Lemma 3. 3: Apply the GREEDYCSP algorithm of Theorem 3 with ϵ = ϵ2 to approximately minimize E(x1, . . . , xn) over x1, . . . , xn Sn ϵ1. 4: Return x1, . . . , xn. 5: end function

Theorem 4 (Formal Statement of Theorem 2). Let R > ϵ > 0 be arbitrary. For any input metric over [n] with mini,j [n] d(i, j) = 1, Algorithm KKSCHEME with ϵ1 = O(ϵ/R) and ϵ2 = O(ϵ/R2) runs in time n2(R/ϵ)O(r R4/ϵ2)

and outputs x1, . . . , xn Rr with xi R such that

E [E(x1, . . . , xn)] E(x 1, . . . , x n) + ϵn2

for any x 1, . . . , x n with x i R for all i, where E is the expectation over the randomness of the algorithm.

Proof. By combining Lemma 8 with Theorem 3 (used as a minimization instead of maximization algorithm), the output x1, . . . , xn of KKSCHEME satisﬁes

E(x1, . . . , xn) E(x 1, . . . , x n) + 4ϵ1Rn2 + ϵ2R2n2

and runs in time n22O(1/ϵ2 2)r log(3R/ϵ1). Taking ϵ2 = O(ϵ/R2) and ϵ1 = O(ϵ/R) gives the desired result.

Remark 2. The runtime can be improved to n2 + (R/ϵ)O(d R4/ϵ2) using a slightly more complex greedy CSP algorithm (Mathieu & Schudy, 2008). Also, by the usual argument, a high probability guarantee can be derived by repeating the algorithm O(log(2/δ)) times, where δ > 0 is the desired failure probability.

Multidimensional Scaling: Approximation and Complexity

4.4. Extension to Vertex-Weighted Setting

In this section, we generalize the approximation algorithm to handle vertex weights. This generalization is useful if vertices have associated importance weights, e.g. each vertex represents a different number of people, and larger/more important vertices should be embedded more accurately. Given a probability measure µ over [n], the weighted Kamada Kawai objective is

Eµ(x1, . . . , xn) = n2 X

i<j µ(i)µ(j) xi xj

d(i, j) 1 2 .

(2) Note that when µ is the uniform measure on [n], this reduces to (1).

Theorem 5. Let R > ϵ > 0 be arbitrary. Algorithm KKSCHEME with ϵ1 = O(ϵ/R) and ϵ2 = O(ϵ/R2) runs in time n O(r R4 log(R/ϵ)/ϵ2) and outputs x1, . . . , xn Rr with xi R such that

E [E(x1, . . . , xn)] E(x 1, . . . , x n) + ϵn2

for any x 1, . . . , x n with x i R for all i, where E is the expectation over the randomness of the algorithm.

Proof. The proof is the same as Theorem 4, except that we require a different dense CSP algorithm. More precisely, we can directly verify that the discretization Lemma, Lemma 8, holds with the same guarantee for the weighted Kamada-Kawai objective. This reduces the problem to approximating a dense CSP with vertex weights, for which we use Theorem 6.

The following Theorem formally describes the guarantee we use for approximately optimizing dense CSPs with vertex/variable weights. This result can be proved by slightly modifying the algorithm and analysis in (Yoshida & Zhou, 2014). For completeness, we provide a proof in Appendix C.

Theorem 6 (Proved in Appendix C). Suppose that Σ is a ﬁnite alphabet, n 1 is a positive integer, and for every i, j n 2 we have a function fij : Σ Σ [ M, M]. Then for any ϵ > 0, there exists an algorithm which runs in time n O(log |Σ|/ϵ2) and returns x1, . . . , xn Σ such that

E [Ei,j µfij(xi, xj)] Ei,j µfij(x i , x j) ϵM

for any x 1, . . . , x n Σ, where the outer E denotes the expectation over the randomness of the algorithm.

5. Experiments

We implemented the GREEDYCSP-based algorithm described above as well as a standard gradient descent approach to minimizing the Kamada-Kawai objective. In this

section we compare the behavior of these algorithms in a few interesting instances.

In addition to gradient descent, a couple of other local search heuristics are popular for minimizing the Kamada-Kawai objective: 1) the original algorithm proposed by Kamada and Kawai (Kamada et al., 1989), which updates single points at a time using a Newton-Raphson scheme, and 2) a variational approach known as majorization, which optimizes a sequence of upper bounds on the KK objective (De Leeuw et al., 1977; Gansner et al., 2004), where each step reduces to solving a Laplacian system. The recent work of (Zheng et al., 2018) compared these local search heuristics and argued that (stochastic) gradient descent, proposed in the early work of (Kruskal, 1964a), is one of the best performing methods in practice. For this reason, we focus on comparing with gradient descent.

Some Graph Drawing Examples. In Figure 1 we show the result of embedding a random Watts-Strogatz small world graph (Watts & Strogatz, 1998), a model of random graph intended to reﬂect some properties of real world networks. In Figure 2 we show an embedding of the 3elt graph from (Diekmann & Preis); in this case, it s interesting that all of the methods optimizing (1) seem to ﬁnd the same solution, except Greedy suffers a small loss due to discretization. This suggests that this solution may be the global optimum.

Note that in all ﬁgures, the MDS/Kamada-Kawai objective value achieved (normalized by 1/n2, where n is the number of vertices) is included in the subtitle of each plot. For comparison, in the bottom right of each Figure we display the standard spectral embedding given by embedding each vertex according to the entries of the bottom two nontrivial eigenvectors of the graph Laplacian.

Experiment with restarts. The algorithm we propose in Theorem 4 is randomized, which leaves open the possibility that better results are obtained by running the algorithm multiple times and taking the best result. In Figure 3, we show the result of embedding a well-known social network graph, the Davis Southern Women Network (Davis et al., 2009), by running all methods 10 times and taking the result with best objective value. This graph has a total of 32 nodes and records the attendance of 18 Southern women at 14 social events during the 1930s. To compare with the minimum, the average objective value achieved in the run is 0.0588, 0.0498, and 0.0515 for Greedy, Greedy and Grad, and Grad respectively so all methods did improved slightly by running multiple times. Finally, we note that running gradient descent with 30 restarts (as opposed to 10) improved its best score to 0.0478, essentially the same as the Greedy and Grad result.

Multidimensional Scaling: Approximation and Complexity

Figure 1. Embeddings of Watts Strogatz graph on 50 nodes with graph parameters K = 4 and β = 0.3 and t0 = 3 for GREEDYCSP.

Figure 2. 3elt graph from AG Monien collection (Diekmann & Preis); GREEDYCSP run with parameter t0 = 2.

Figure 3. Embedding of Davis Southern Women Network graph. The top left ﬁgure was generated using GREEDYCSP with t0 = 3.

Runtime Greedy Grad Laplacian Davis 5.6 s 4 s 4 ms Watts-Strogatz 453 s 4 s 20 ms

Table 1. Runtimes for methods with parameters used in ﬁgures.

Community Detection Experiment. A lot of the recent interest in force-directed graph drawing algorithms has been in their ability to discover interesting latent structure in data and with a view towards applications like non-linear dimensionality reduction. As a test of this concept on synthetic data, we tested the algorithms on a celebrated model of latent community structure in graphs, the stochastic block model. The results are shown in Figure 4, along with the results of a standard spectral embedding using the bottom two nontrivial eigenvectors of the Laplacian. We did not draw the edges in this case as they make the Figure difﬁcult to read; more importantly, the location of points in the embedding show that nontrivial community structure was recovered; for example, the green and blue communities are roughly linearly separable in all of the embeddings. Note that the spectral embedding approach admits strong provable guarantees for community recovery (see the survey (Abbe, 2017)), and so the interesting thing to observe here is that the force-directed drawing methods also recover nontrivial information about the latent structure.

Implementation details. All experiments were performed on a standard Kaggle GPU Kernel with a V80 GPU. Gradient descent was run with learning rate 0.005 for 4000 steps on all instances. We seeded the RNG with zero before each simulation for reproducibility. For the greedy method,

Multidimensional Scaling: Approximation and Complexity

Figure 4. Embeddings of a 3-community Stochastic Block Model

(SBM) with connection probabilities

0.09 0.03 0.02 0.03 0.15 0.04 0.02 0.04 0.1

community sizes 35, 35, 50. Colors correspond to latent community assignments. The top left is constructed using the GREEDYCSP algorithm with t0 = 3. For this experiment only, we used the degree-normalized Laplacian since it is generally preferred in the context of the SBM.

we eliminated the rotation and translation degrees of freedom when implementing the initial brute force step; the parameter R was set to 2.5 for the Davis experiment, and set to 4 for all others informally, the tuning rule for this parameter is to increase its value until the plot does not hit the boundary of the region. We compare runtimes in Table 5; the runtime for Greedy in Watts-Strogatz is much larger due to the larger value of n and of R used; the latter roughly corresponds to the larger diameter of the underlying graph (cf. Lemma 4).

6. Conclusions

Our theory and experimental results suggest the following natural question: does gradient descent, with enough random restarts, have a similar provable guarantee to Theorem 3? As noted in our experiments and in the experiments of (Zheng et al., 2018), gradient-based optimization often seems to ﬁnd high quality (albeit not global) minima of the Kamada-Kawai objective, even though the loss is highly non-convex. In fact, combining our analysis with a different theorem from (Schudy, 2012) proves that running a variant of GREEDYCSP without the initial brute force step (i.e. with t0 = 0), achieves an additive O(ϵn2) approximation

if we repeat the algorithm 221/ϵ2 many times. A similar

guarantee for gradient descent, a different sort of greedy procedure, sounds plausible.

Acknowledgements Frederic Koehler was supported in part by NSF CAREER Award CCF-1453261, NSF Large CCF-1565235, Ankur Moitra s ONR Young Investigator Award, and E. Mossel s Vannevar Bush Fellowship ONRN00014-20-1-2826. The work of J. Urschel was supported in part by ONR Research Contract N00014-17-1-2177. The authors would like to thank Michel Goemans for valuable conversations on the subject. The authors are grateful to Louisa Thomas for greatly improving the style of presentation.

Abbe, E. Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446 6531, 2017.

Arora, S., Hu, W., and Kothari, P. K. An analysis of the t-sne algorithm for data visualization. ar Xiv preprint ar Xiv:1803.01768, 2018.

Badoiu, M., Dhamdhere, K., Gupta, A., Rabinovich, Y., R acke, H., Ravi, R., and Sidiropoulos, A. Approximation algorithms for low-distortion embeddings into lowdimensional spaces. In SODA, volume 5, pp. 119 128. Citeseer, 2005.

Barak, B., Raghavendra, P., and Steurer, D. Rounding semideﬁnite programming hierarchies via global correlation. In 2011 ieee 52nd annual symposium on foundations of computer science, pp. 472 481. IEEE, 2011.

Battista, G. D., Eades, P., Tamassia, R., and Tollis, I. G. Graph drawing: algorithms for the visualization of graphs. Prentice Hall PTR, 1998.

Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003.

Borg, I. and Groenen, P. J. Modern multidimensional scaling: Theory and applications. Springer Science & Business Media, 2005.

Cayton, L. and Dasgupta, S. Robust euclidean embedding. In Proceedings of the 23rd international conference on machine learning, pp. 169 176, 2006.

Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.

Davis, A., Gardner, B. B., and Gardner, M. R. Deep South: A social anthropological study of caste and class. Univ of South Carolina Press, 2009.

Multidimensional Scaling: Approximation and Complexity

De Leeuw, J. Convergence of the majorization method for multidimensional scaling. Journal of classiﬁcation, 5(2): 163 180, 1988.

De Leeuw, J., Barra, I. J., Brodeau, F., Romier, G., Van Cutsem, B., et al. Applications of convex analysis to multidimensional scaling. In Recent Developments in Statistics. Citeseer, 1977.

Deza, M. M. and Laurent, M. Geometry of cuts and metrics, volume 15. Springer, 2009.

Diekmann, R. and Preis, R. Ag-monien graph collectionn.

https://www.cise.ufl.edu/research/ sparse/mat/AG-Monien/README.txt. Accessed: 2020-02-01.

Dodds, P. S., Muhamad, R., and Watts, D. J. An experimental study of search in global social networks. science, 301 (5634):827 829, 2003.

Ellson, J., Gansner, E., Koutsoﬁos, L., North, S. C., and Woodhull, G. Graphviz open source graph drawing tools. In International Symposium on Graph Drawing, pp. 483 484. Springer, 2001.

Feige, U. A threshold of ln n for approximating set cover. J. ACM, 45(4):634 652, 1998. doi: 10. 1145/285055.285059. URL https://doi.org/10. 1145/285055.285059.

Filho, I. T. F. A. Characterizing Boolean Satisﬁability Variants. Ph D thesis, Massachusetts Institute of Technology, 2019.

Gansner, E. R., Koren, Y., and North, S. Graph drawing by stress majorization. In International Symposium on Graph Drawing, pp. 239 250. Springer, 2004.

Kamada, T., Kawai, S., et al. An algorithm for drawing general undirected graphs. Information processing letters, 31(1):7 15, 1989.

Khanna, S., Sudan, M., Trevisan, L., and Williamson, D. P. The approximability of constraint satisfaction problems. SIAM Journal on Computing, 30(6):1863 1920, 2001.

Kruskal, J. B. Multidimensional scaling by optimizing goodness of ﬁt to a nonmetric hypothesis. Psychometrika, 29(1):1 27, 1964a.

Kruskal, J. B. Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29(2):115 129, 1964b.

Kruskal, J. B. Multidimensional scaling. Number 11. Sage, 1978.

Linderman, G. C. and Steinerberger, S. Clustering with t-sne, provably. SIAM Journal on Mathematics of Data Science, 1(2):313 332, 2019.

Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9(Nov): 2579 2605, 2008.

Mathieu, C. and Schudy, W. Yet another algorithm for dense max cut: go greedy. In SODA, pp. 176 182, 2008.

Mc Innes, L., Healy, J., and Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. ar Xiv preprint ar Xiv:1802.03426, 2018.

Montanari, A. Estimating random variables from random sparse observations. European Transactions on Telecommunications, 19(4):385 403, 2008.

Naor, A. An introduction to the ribe program. Japanese Journal of Mathematics, 7(2):167 233, 2012.

Papadimitriou, C. H. and Yannakakis, M. Optimization, approximation, and complexity classes. J. Comput. Syst. Sci., 43(3):425 440, 1991. doi: 10.1016/ 0022-0000(91)90023-X. URL https://doi.org/ 10.1016/0022-0000(91)90023-X.

Raghavendra, P. and Tan, N. Approximating csps with global cardinality constraints using sdp hierarchies. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pp. 373 387. SIAM, 2012.

Saxe, J. B. Embeddability of weighted graphs in k-space is strongly np-hard. In Proc. of 17th Allerton Conference in Communications, Control and Computing, Monticello, IL, pp. 480 489, 1979.

Schudy, W. Approximation Schemes for Inferring Rankings and Clusterings from Pairwise Data. Ph D thesis, Brown University, 2012.

Tutte, W. T. How to draw a graph. Proceedings of the London Mathematical Society, 3(1):743 767, 1963.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Watts, D. J. and Strogatz, S. H. Collective dynamics of small-world networks. nature, 393(6684):440 442, 1998.

Yaroslavtsev, G. Going for speed: Sublinear algorithms for dense r-csps. ar Xiv preprint ar Xiv:1407.7887, 2014.

Yoshida, Y. and Zhou, Y. Approximation schemes via sherali-adams hierarchy for dense constraint satisfaction problems and assignment problems. In Proceedings of the 5th conference on Innovations in theoretical computer science, pp. 423 438, 2014.

Multidimensional Scaling: Approximation and Complexity

Zheng, J. X., Pawar, S., and Goodman, D. F. Graph drawing by stochastic gradient descent. IEEE transactions on visualization and computer graphics, 25(9):2738 2748, 2018.