# ultrahyperbolic_representation_learning__2a760158.pdf Ultrahyperbolic Representation Learning Marc T. Law Jos Stam In machine learning, data is usually represented in a (flat) Euclidean space where distances between points are along straight lines. Researchers have recently considered more exotic (non-Euclidean) Riemannian manifolds such as hyperbolic space which is well suited for tree-like data. In this paper, we propose a representation living on a pseudo-Riemannian manifold of constant nonzero curvature. It is a generalization of hyperbolic and spherical geometries where the nondegenerate metric tensor need not be positive definite. We provide the necessary learning tools in this geometry and extend gradient-based optimization techniques. More specifically, we provide closed-form expressions for distances via geodesics and define a descent direction to minimize some objective function. Our novel framework is applied to graph representations. 1 Introduction In most machine learning applications, data representations lie on a smooth manifold [16] and the training procedure is optimized with an iterative algorithm such as line search or trust region methods [20]. In most cases, the smooth manifold is Riemannian, which means that it is equipped with a positive definite metric. Due to the positive definiteness of the metric, the negative of the (Riemannian) gradient is a descent direction that can be exploited to iteratively minimize some objective function [1]. The choice of metric on the Riemannian manifold determines how relations between points are quantified. The most common Riemannian manifold is the flat Euclidean space, which has constant zero curvature and the distances between points are measured by straight lines. An intuitive example of non-Euclidean Riemannian manifold is the spherical model (i.e. representations lie on a sphere) that has constant positive curvature and is used for instance in face recognition [25, 26]. On the sphere, geodesic distances are a function of angles. Similarly, Riemannian spaces of constant negative curvature are called hyperbolic [23]. Such spaces were shown by Gromov to be well suited to represent tree-like structures [10]. The machine learning community has adopted these spaces to learn tree-like graphs [5] and hierarchical data structures [11, 18, 19], and also to compute means in tree-like shapes [6, 7]. In this paper, we consider a class of pseudo-Riemannian manifolds of constant nonzero curvature [28] not previously considered in machine learning. These manifolds not only generalize the hyperbolic and spherical geometries mentioned above, but also contain hyperbolic and spherical submanifolds and can therefore describe relationships specific to those geometries. The difference is that we consider the larger class of pseudo-Riemannian manifolds where the considered nondegenerate metric tensor need not be positive definite. Optimizing a cost function on our non-flat ultrahyperbolic space requires a descent direction method that follows a path along the curved manifold. We achieve this by employing tools from differential geometry such as geodesics and exponential maps. The theoretical contributions in this paper are two-fold: (1) explicit methods to calculate dissimilarities and (2) general optimization tools on pseudo-Riemannian manifolds of constant nonzero curvature. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Figure 1: Two iso-surface depictions of an orthographic projection of the same pseudo-hyperboloid Q2,1 1 into R3 along one time dimension. It contains the entire family of hyperboloids as submanifolds. 2 Pseudo-Hyperboloids Notation: We denote points on a smooth manifold M [16] by boldface Roman characters x M. Tx M is the tangent space of M at x and we write tangent vectors ξ Tx M in boldface Greek fonts. Rd is the (flat) d-dimensional Euclidean space, it is equipped with the (positive definite) dot product denoted by , and defined as x, y = x y. The ℓ2-norm of x is x = p x, x . Rd = Rd\{0} is the Euclidean space with the origin removed. Pseudo-Riemannian manifolds: A smooth manifold M is pseudo-Riemannian (also called semi Riemannian [21]) if it is equipped with a pseudo-Riemannian metric tensor (named metric for short in differential geometry). The pseudo-Riemannian metric gx : Tx M Tx M R at some point x M is a nondegenerate symmetric bilinear form. Nondegeneracy means that if for a given ξ Tx M and for all ζ Tx M we have gx(ξ, ζ) = 0, then ξ = 0. If the metric is also positive definite (i.e. ξ Tx M, gx(ξ, ξ) > 0 iff ξ = 0), then it is Riemannian. Riemannian geometry is a special case of pseudo-Riemannian geometry where the metric is positive definite. In general, this is not the case and non-Riemannian manifolds distinguish themselves by having some non-vanishing tangent vectors ξ = 0 that satisfy gx(ξ, ξ) 0. We refer the reader to [2, 21, 28] for details. Pseudo-hyperboloids generalize spherical and hyperbolic manifolds to the class of pseudo Riemannian manifolds. Let us note d = p + q + 1 N the dimensionality of some pseudo-Euclidean space where each vector is written x = (x0, x1, , xq+p) . That space is denoted by Rp,q+1 when it is equipped with the following scalar product (i.e. nondegenerate symmetric bilinear form [21]): a = (a0, , aq+p) , b = (b0, , bq+p) , a, b q = j=q+1 ajbj = a Gb, (1) where G = G 1 = Iq+1,p is the d d diagonal matrix with the first q + 1 diagonal elements equal to 1 and the remaining p equal to 1. Since Rp,q+1 is a vector space, we can identify the tangent space to the space itself by means of the natural isomorphism Tx Rp,q+1 Rp,q+1. Using the terminology of special relativity, Rp,q+1 has q + 1 time dimensions and p space dimensions. A pseudo-hyperboloid is the following submanifold of codimension one (i.e. hypersurface) in Rp,q+1: Qp,q β = n x = (x0, x1, , xp+q) Rp,q+1 : x 2 q = β o , (2) where β R is a nonzero real number and the function 2 q given by x 2 q = x, x q is the associated quadratic form of the scalar product. It is equivalent to work with either Qp,q β or Qq+1,p 1 β as they are interchangeable via an anti-isometry (see supp. material). For instance, the unit q-sphere Sq = x Rq+1 : x = 1 is anti-isometric to Q0,q 1 which is then spherical. In the literature, the set Qp,q β is called a pseudo-sphere when β > 0 and a pseudo-hyperboloid when β < 0. In the rest of the paper, we only consider the pseudo-hyperbolic case (i.e. β < 0). Moreover, for any β < 0, Qp,q β is homothetic to Qp,q 1, the value of β can then be considered to be 1. We can obtain the spherical and hyperbolic geometries by constraining all the elements of the space dimensions of a pseudo-hyperboloid to be zero or constraining all the elements of the time dimensions except one to be zero, respectively. Pseudo-hyperboloids then generalize spheres and hyperboloids. The pseudo-hyperboloids that we consider in this paper are hard to visualize as they live in ambient spaces with dimension higher than 3. In Fig. 1, we show iso-surfaces of a projection of the 3-dimensional pseudo-hyperboloid Q2,1 1 (embedded in R2,2) into R3 along its first time dimension. Metric tensor and tangent space: The metric tensor at x Qp,q β is gx( , ) = , q where gx : Tx Qp,q β Tx Qp,q β R. By using the isomorphism Tx Rp,q+1 Rp,q+1 mentioned above, the tangent space of Qp,q β at x can be defined as Tx Qp,q β = ξ Rp,q+1 : x, ξ q = 0 for all β = 0. Finally, the orthogonal projection of an arbitrary d-dimensional vector z onto Tx Qp,q β is: Πx(z) = z z, x q x, x q x. (3) 3 Measuring Dissimilarity on Pseudo-Hyperboloids This section introduces the differential geometry tools necessary to quantify dissimilarities/distances between points on Qp,q β . Measuring dissimilarity is an important task in machine learning and has many applications (e.g. in metric learning [29]). Intrinsic geometry: The intrinsic geometry of the hypersurface Qp,q β embedded in Rp,q+1 (i.e. the geometry perceived by the inhabitants of Qp,q β [21]) derives solely from its metric tensor applied to tangent vectors to Qp,q β . For instance, it can be used to measure the arc length of a tangent vector joining two points along a geodesic and define their geodesic distance. Before considering geodesic distances, we consider extrinsic distances (i.e. distances in the ambient space Rp,q+1). Since Rp,q+1 is isomorphic to its tangent space, tangent vectors to Rp,q+1 are naturally identified with points. Using the quadratic form of Eq. (1), the extrinsic distance between two points a, b Qp,q β is: dq(a, b) = q | a b 2q| = q | a 2q + b 2q 2 a, b q| = q |2β 2 a, b q|. (4) This distance is a good proxy for the geodesic distance dγ( , ), that we introduce below, if it preserves distance relations: dγ(a, b) < dγ(c, d) iff dq(a, b) < dq(c, d). This relation is satisfied for two special cases of pseudo-hyperboloids for which the geodesic distance is well known: Spherical manifold (Q0,q β ): If p = 0, the geodesic distance dγ(a, b) = p |β| cos 1 a,b q called spherical distance. In practice, the cosine similarity , q β is often considered instead of dγ( , ) since it satisfies dγ(a, b) < dγ(c, d) iff a, b q < c, d q iff dq(a, b) < dq(c, d). Hyperbolic manifold (upper sheet of the two-sheet hyperboloid Qp,0 β ): If q = 0, the geodesic distance dγ(a, b) = p |β| cosh 1 a,b q β with a0 > 0 and b0 > 0 is called Poincaré distance [19]. The (extrinsic) Lorentzian distance was shown to be a good proxy in hyperbolic geometry [11]. For the ultrahyperbolic case (i.e. q 1 and p 2), the distance relations are not preserved: dγ(a, b) < dγ(c, d) dq(a, b) < dq(c, d). We then need to consider only geodesic distances. This section introduces closed-form expressions for geodesic distances on ultrahyperbolic manifolds. Geodesics: Informally, a geodesic is a curve joining points on a manifold M that minimizes some effort depending on the metric. More precisely, let I R be a (maximal) interval containing 0. A geodesic γ : I M maps a real value t I to a point on the manifold M. It is a curve on M defined by its initial point γ(0) = x M and initial tangent vector γ (0) = ξ Tx M where γ (t) is the derivative of γ at t. By analogy with physics, t is considered as a time value. Intuitively, one can think of the curve as the trajectory over time of a ball being pushed from a point x at t = 0 with initial velocity ξ and constrained to roll on the manifold. We denote this curve explicitly by γx ξ(t) unless the dependence is obvious from the context. For this curve to be a geodesic, its acceleration has to be zero: t I, γ (t) = 0. This condition is a second-order ordinary differential equation that has a unique solution for a given set of initial conditions [17]. The interval I is said to be maximal if it cannot be extended to a larger interval. In the case of Qp,q β , we have I = R and I is then maximal. Figure 2: (left) Illustration of the three types of geodesics of a pseudo-hyperboloid defined in Eq. (5). (right) Exponential map and logarithm map. Geodesic of Qp,q β : As we show in the supp. material, the geodesics of Qp,q β are a combination of the hyperbolic, flat and spherical cases. The nature of the geodesic γx ξ depends on the sign of ξ, ξ q. For all t R, the geodesic γx ξ of Qp,q β with β < 0 is written: | ξ,ξ q| sinh t ξ if ξ, ξ q > 0 x + tξ if ξ, ξ q = 0 | ξ,ξ q| sin t ξ if ξ, ξ q < 0 We recall that ξ, ξ q = 0 does not imply ξ = 0. The geodesics are an essential ingredient to define a mapping known as the exponential map. See Fig. 2 (left) for a depiction of these three types of geodesics, and Fig. 2 (right) for a depiction of the other quantities introduced in this section. Exponential map: Exponential maps are a way of collecting all of the geodesics of a pseudo Riemannian manifold M into a unique differentiable mapping. Let Dx Tx M be the set of tangent vectors ξ such that γx ξ is defined at least on the interval [0, 1]. This allows us to uniquely define the exponential map expx : Dx M such that expx(ξ) = γx ξ(1). The manifold Qp,q β is geodesically complete, the domain of its exponential map is then Dx = Tx Qp,q β . Using Eq. (5) with t = 1, we obtain an exponential map of the entire tangent space to the manifold: ξ Tx Qp,q β , expx(ξ) = γx ξ(1). (6) We make the important observation that the image of the exponential map does not necessarily cover the entire manifold: not all points on a manifold are connected by a geodesic. This is the case for our pseudo-hyperboloids. Namely, for a given point x Qp,q β there exist points y that are not in the image of the exponential map (i.e. there does not exist a tangent vector ξ such that y = expx(ξ)). Logarithm map: We provide a closed-form expression of the logarithm map for pseudo-hyperboloids. Let Ux Qp,q β be some neighborhood of x. The logarithm map logx : Ux Tx Qp,q β is defined as the inverse of the exponential map on Ux (i.e. logx = exp 1 x ). We propose: y Ux, logx(y) = cosh 1( x,y q β x if x,y q y x if x,y q cos 1( x,y q β x if x,y q |β| ( 1, 1) By substituting ξ = logx(y) into Eq. (6), one can verify that our formulas are the inverse of the exponential map. The set Ux = n y Qp,q β : x, y q < |β| o is called a normal neighborhood of x Qp,q β since for all y Ux, there exists a geodesic from x to y such that logx(y) = γ x logx(y)(0). We show in the supp. material that the logarithm map is not defined if x, y q |β|. Proposed dissmilarity: We define our dissimilarity function based on the general notion of arc length and radius function on pseudo-Riemannian manifolds that we recall in the next paragraph (see details in Chapter 5 of [21]). This corresponds to the geodesic distance in the Riemannian case. Let Ux be a normal neighborhood of x M with M pseudo-Riemannian. The radius function rx : Ux R is defined as rx(y) = p |gx (logx(y), logx(y))| where gx is the metric at x. If σx ξ is the radial geodesic from x to y Ux (i.e. ξ = logx(y)), then the arc length of σx ξ equals rx(y). We then define the geodesic distance between x Qp,q β and y Ux as the arc length of σx logx(y): dγ(x, y) = q | logx(y) 2q| = |β| cosh 1 x,y q |β| cos 1 x,y q |β| ( 1, 1) It is important to note that our distance is not a distance metric. However, it satisfies the axioms of a symmetric premetric: (i) dγ(x, y) = dγ(y, x) 0 and (ii) dγ(x, x) = 0. These conditions are sufficient to quantify the notion of nearness via a ρ-ball centered at x: Bρ x = {y : dγ(x, y) < ρ}. In general, topological spaces provide a qualitative (not necessarily quantitative) way to detect nearness through the concept of a neighborhood at a point [15]. Something is true near x if it is true in the neighborhood of x (e.g. in Bρ x). Our premetric is similar to metric learning methods [13, 14, 29] that learn a Mahalanobis-like distance pseudo-metric parameterized by a positive semi-definite matrix. Pairs of distinct points can have zero distance if the matrix is not positive definite. However, unlike classic metric learning, we can have triplets (x, y, z) that satisfy dγ(x, y) = dγ(x, z) = 0 but dγ(y, z) > 0 (e.g. x = (1, 0, 0, 0) , y = (1, 1, 1, 0) , z = (1, 1, 0, 1) in Q2,1 1). Since the logarithm map is not defined if x, y q |β|, we propose to use the following continuous approximation defined on the whole manifold instead: x Qp,q β , y Qp,q β , Dγ(x, y) = ( dγ(x, y) if x, y q 0 p |β| π 2 + x,y q |β| otherwise (9) To the best of our knowledge, the explicit formulation of the logarithm map for Qp,q β in Eq. (7) and its corresponding radius function in Eq. (8) to define a dissimilarity function are novel. We have also proposed some linear approximation to evaluate dissimilarity when the logarithm map is not defined but other choices are possible. For instance, when a geodesic does not exist, a standard way in differential geometry to calculate curves is to consider broken geodesics. One might consider instead the dissimilarity dγ(x, x) + dγ( x, y) = π p |β| + dγ( x, y) if logx(y) is not defined since x Qp,q β and log x(y) is defined. This interesting problem is left for future research. 4 Ultrahyperbolic Optimization In this section we present optimization frameworks to optimize any differentiable function defined on Qp,q β . Our goal is to compute descent directions on the ultrahyperbolic manifold. We consider two approaches. In the first approach, we map our representation from Euclidean space to ultrahyperbolic space. This is similar to the approach taken by [11] in hyperbolic space. In the second approach, we optimize using gradients defined directly in pseudo-Riemannian tangent space. We propose a novel descent direction which guarantees the minimization of some cost function. 4.1 Euclidean optimization via a differentiable mapping onto Qp,q β Our first method maps Euclidean representations that lie in Rd to the pseudo-hyperboloid Qp,q β , and the chain rule is exploited to perform standard gradient descent. To this end, we construct a differentiable mapping ϕ : Rq+1 Rp Qp,q β . The image of a point already on Qp,q β under the mapping ϕ is itself: x Qp,q β , ϕ(x) = x. Let Sq = x Rq+1 : x = 1 denote the unit q-sphere. We first introduce the following diffeomorphisms: Theorem 4.1 (Diffeomorphisms). For any β < 0, there is a diffeomorphism ψ : Qp,q β Sq Rp. Let us note x = t s Qp,q β with t Rq+1 and s Rp, let us note z = u v where u Sq and v Rp. The mapping ψ and its inverse ψ 1 are formulated (see proofs in supp. material): and ψ 1(z) = p With these mappings, any vector x Rq+1 Rp can be mapped to Qp,q β via ϕ = ψ 1 ψ. ϕ is differentiable everywhere except when x0 = = xq = 0, which should never occur in practice. It can therefore be optimized using standard gradient methods. 4.2 Pseudo-Riemannian optimization We now introduce a novel method to optimize any differentiable function f : Qp,q β R defined on the pseudo-hyperboloid. As we show below, the (negative of the) pseudo-Riemannian gradient is not a descent direction. We propose a simple and efficient way to calculate a descent direction. Pseudo-Riemannian gradient: Since x Qp,q β also lies in the Euclidean ambient space Rd, the function f has a well defined Euclidean gradient f(x) = ( f(x)/ x0, , f(x)/ xp+q) Rd. The gradient of f in the pseudo-Euclidean ambient space Rp,q+1 is (G 1 f(x)) = (G f(x)) Rp,q+1. Since Qp,q β is a submanifold of Rp,q+1, the pseudo-Riemannian gradient Df(x) Tx Qp,q β of f on Qp,q β is the orthogonal projection of (G f(x)) onto Tx Qp,q β (see Chapter 4 of [21]): Df(x) = Πx (G f(x)) = G f(x) G f(x), x q x, x q x = G f(x) f(x), x x, x q x. (11) This gradient forms the foundation of our descent method optimizer as will be shown in Eq. (13). Iterative optimization: Our goal is to iteratively decrease the value of the function f by following some descent direction. Since Qp,q β is not a vector space, we do not follow the descent direction by adding the descent direction multiplied by a step size as this would result in a new point that does not necessarily lie on Qp,q β . Instead, to remain on the manifold, we use our exponential map defined in Eq. (6). This is a standard way to optimize on Riemannian manifolds [1]. Given a step size t > 0, one step of descent along a tangent vector ζ Tx Qp,q β is given by: y = expx (tζ) Qp,q β . (12) Descent direction: We now explain why the negative of the pseudo-Riemannian gradient is not a descent direction. Our explanation extends Chapter 3 of [20] that gives the criteria for a tangent vector ζ to be a descent direction when the domain of the optimized function is a Euclidean space. By using the properties described in Section 3, we know that for all t R and all ξ Tx Qp,q β , we have the equalities: expx (tξ) = γx tξ(1) = γx ξ(t) so we can equivalently fix t to 1 and choose the scale of ξ appropriately. By exploiting Taylor s first-order approximation, there exists some small enough tangent vector ζ = 0 (i.e. with expx(ζ) belonging to a convex neighborhood of x [4, 8]) that satisfies the following conditions: γx ζ(0) = x Qp,q β , γ x ζ(0) = ζ Tx Qp,q β , γx ζ(1) = y Qp,q β , and the function f γx ζ : R R can be approximated at t = 1 by: f(y) = f γx ζ(1) f γx ζ(0) + (f γx ζ) (0) = f(x) + Df(x), ζ q. (13) where we use the following properties: t, (f γ) (t) = df(γ (t)) = gγ(t) (Df(γ(t)), γ (t)) (see details in pages 11, 15 and 85 of [21]), df is the differential of f and γ is a geodesic. To be a descent direction at x (i.e. so that f(y) < f(x)), the search direction ζ has to satisfy Df(x), ζ q < 0. However, choosing ζ = ηDf(x), where η > 0 is a step size, might increase the function value if the scalar product , q is not positive definite. If p + q 1, then , q is positive definite only if q = 0 (see details in supp. material), and it is negative definite iff p = 0 since , q = , in this case. A simple solution would be to choose ζ = ηDf(x) depending on the sign of Df(x), ζ q, but Df(x), ζ q might be equal to 0 even if Df(x) = 0 if , q is indefinite. The optimization algorithm might then be stuck to a level set of f, which is problematic. Algorithm 1 Pseudo-Riemannian optimization on Qp,q β input: differentiable function f : Qp,q β R to be minimized, some initial value of x Qp,q β 1: while not converge do 2: Calculate f(x) i.e. the Euclidean gradient of f at x in the Euclidean ambient space 3: χ Πx(GΠx(G f(x))) see Eq. (14) 4: x expx( ηχ) where η > 0 is a step size (e.g. determined with line search) 5: end while Proposed solution: To ensure that ζ Tx Qp,q β is a descent direction, we propose a simple expression that satisfies Df(x), ζ q < 0 if Df(x) = 0 and Df(x), ζ q = 0 otherwise. We propose to formulate ζ = ηΠx(GDf(x)) Tx Qp,q β , and we define the following tangent vector χ = 1 χ = Πx(GDf(x)) = f(x) f(x), x x, x q Gx f(x), x q x, x q x + x 2 f(x), x x, x 2q x. (14) The tangent vector ζ is a descent direction because Df(x), ζ q = η Df(x), χ q is nonpositive: Df(x), χ q = f(x) 2 2 f(x), x f(x), x q x, x q + f(x), x 2 x 2 x, x 2q (15) = G f(x) f(x), x x, x q x 2 = Df(x) 2 0. (16) We also have Df(x), χ q = Df(x) 2 = 0 iff Df(x) = 0 (i.e. x is a stationary point). It is worth noting that Df(x) = 0 implies χ = Πx(G0) = 0. Moreover, χ = 0 implies that Df(x) 2 = Df(x), 0 q = 0. We then have χ = 0 iff Df(x) = 0. Our proposed algorithm to the minimization problem minx Qp,q β f(x) is illustrated in Algorithm 1. Following generic Riemannian optimization algorithms [1], at each iteration, it first computes the descent direction χ Tx Qp,q β , then decreases the function by applying the exponential map defined in Eq. (6). It is worth noting that our proposed descent method can be applied to any differentiable function f : Qp,q β R, not only to those that exploit the distance introduced in Section 3. Interestingly, our method can also be seen as a preconditioning technique [20] where the descent direction is obtained by preconditioning the pseudo-Riemannian gradient Df(x) with the matrix Px = h G 1 x,x q xx i Rd d. In other words, we have χ = Px Df(x) = Πx(GDf(x)). In the more general setting of pseudo-Riemannian manifolds, another preconditioning technique was proposed in [8]. The method in [8] requires performing a Gram-Schmidt process at each iteration to obtain an (ordered [28]) orthonormal basis of the tangent space at x w.r.t. the induced quadratic form of the manifold. However, the Gram-Schmidt process is unstable and has algorithmic complexity that is cubic in the dimensionality of the tangent space. On the other hand, our method is more stable and its algorithmic complexity is linear in the dimensionality of the tangent space. 5 Experiments We now experimentally validate our proposed optimization methods and the effectiveness of our dissimilarity function. Our main experimental results can be summarized as follows: Both optimizers introduced in Section 4 decrease some objective function f : Qp,q β R. While both optimizers manage to learn high-dimensional representations that satisfy the problem-dependent training constraints, only the pseudo-Riemannian optimizer satisfies all the constraints in lowerdimensional spaces. This is because it exploits the underlying metric of the manifold. Hyperbolic representations are popular in machine learning as they are well suited to represent hierarchical trees [10, 18, 19]. On the other hand, hierarchical datasets whose graph contains cycles cannot be represented using trees. Therefore, we propose to represent such graphs using our ultrahyperbolic representations. An important example are community graphs such as Zachary s karate club [30] that contain leaders. Because our ultrahyperbolic representations are more flexible Figure 3: (left) graphic representation of Zachary s karate club (figure courtesy of [30]). (right) Loss values of Eq. (17) for different numbers of time dimensions and optimizers. than hyperbolic representations, we believe that our representations are better suited for these non tree-like hierarchical structures. Graph: Our ultrahyperbolic representations describe graph-structured datasets. Each dataset is an undirected weighted graph G = (V, E) which has node-set V = {vi}n i=1 and edge-set E = {ek}m k=1. Each edge ek is weighted by an arbitrary capacity ck R+ that models the strength of the relationship between nodes. The higher the capacity ck, the stronger the relationship between the nodes connected by ek. Learned representations: Our problem formulation is inspired by hyperbolic representation learning approaches [18, 19] where the nodes of a tree (i.e. graph without cycles) are represented in hyperbolic space. The hierarchical structure of the tree is then reflected by the order of distances between its nodes. More precisely, a node representation is learned so that each node is closer to its descendants and ancestors in the tree (w.r.t. the hyperbolic distance) than to any other node. For example, in a hierarchy of words, ancestors and descendants are hypernyms and hyponyms, respectively. Our goal is to learn a set of n points x1, , xn Qp,q β (embeddings) from a given graph G. The presence of cycles in the graph makes it difficult to determine ancestors and descendants. For this reason, we introduce for each pair of nodes (vi, vj) = ek E, the set of weaker pairs that have lower capacity: W(ek) = {el : ck > cl} {(va, vb) : (va, vb) / E}. Our goal is to learn representations such that pairs (vi, vj) with higher capacity have their representations (xi, xj) closer to each other than weaker pairs. Following [18], we formulate our problem as: min x1, ,xn Qp,q β (vi,vj) = ek E log exp ( d(xi, xj)/τ) X (va,vb) W(ek) {ek} exp ( d(xa, xb)/τ) (17) where d is the chosen dissimilarity function (e.g. Dγ( , ) defined in Eq. (9)) and τ > 0 is a fixed temperature parameter. The formulation of Eq. (17) is classic in the metric learning literature [3, 12, 27] and corresponds to optimizing some order on the learned distances via a softmax function. Implementation details: We coded our approach in Py Torch [22] that automatically calculates the Euclidean gradient f(xi). Initially, a random set of vectors {zi}n i=1 is generated close to the positive pole ( p |β|, 0, , 0) Qp,q β with every coordinate perturbed uniformly with a random value in the interval [ ε, ε] where ε > 0 is chosen small enough so that zi 2 q < 0. We set β = 1, ε = 0.1 and τ = 10 2. Initial embeddings are generated as follows: i, xi = p | zi 2 q| Qp,q β . Zachary s karate club dataset [30] is a social network graph of a karate club comprised of n = 34 nodes, each representing a member of the karate club. The club was split due to a conflict between instructor "Mr. Hi" (node v1) and administrator "John A" (node vn). The remaining members now have to decide whether to join the new club created by v1 or not. In [30], Zachary defines a matrix of relative strengths of the friendships in the karate club called C {0, 1, , 7}n n and that depends on various criteria. We note that the matrix is not symmetric and has 7 different pairs (vi, vj) for which Cij = Cji. Since our dissimilarity function is symmetric, we consider the symmetric matrix Table 1: Evaluation scores for the different learned representations (mean standard deviation) Evaluation metric R4 Q4,0 1 Q3,1 1 Q2,2 1 Q1,3 1 Q0,4 1 (flat) (hyperbolic) (ours) (ours) (ours) (spherical) Rank of the first leader 5.4 1.1 2.8 0.4 1.2 0.4 1.2 0.4 1.8 0.8 2.0 0.7 Rank of the second leader 6.6 0.9 4.2 0.7 2.4 0.9 2.6 0.5 4.0 1.2 4.0 1.4 top 5 Spearman s ρ 0.44 0.19 0.20 0.48 0.76 0.21 0.66 0.30 0.36 0.40 0.18 0.37 top 10 Spearman s ρ 0.00 0.14 0.38 0.06 0.74 0.11 0.79 0.12 0.71 0.08 0.55 0.20 S = C + C instead. The value of Sij is the capacity/weight assigned to the edge joining vi and vj, and there is no edge between vi and vj if Sij = 0. Fig. 3 (left) illustrates the 34 nodes of the dataset, an edge joining the nodes vi and vj is drawn iff Sij = 0. The level of a node in the hierarchy corresponds approximately to its height in the figure. Optimizers: We validate that our optimizers introduced in Section 4 decrease the cost function. First, we consider the simple unweighted case where every edge weight is 1. For each edge ek E, W(ek) is then the set of pairs of nodes that are not connected. In other words, Eq. (17) learns node representations that have the property that every connected pair of nodes has smaller distance than non-connected pairs. We use this condition as a stopping criterion of our algorithm. Fig. 3 (right) illustrates the loss values of Eq. (17) as a function of the number of iterations with the Euclidean gradient descent (Section 4.1) and our pseudo-Riemannian optimizer (introduced in Section 4.2). In each test, we vary the number of time dimensions q + 1 while the ambient space is of fixed dimensionality d = p + q + 1 = 10. We omit the case q = 0 since it corresponds to the (hyperbolic) Riemannian case already considered in [11, 19]. Both optimizers decrease the function and manage to satisfy all the expected distance relations. We note that when we use Df(x) instead of χ as a search direction, the algorithm does not converge. Moreover, our pseudo-Riemannian optimizer manages to learn representations that satisfy all the constraints for low-dimensional manifolds such as Q4,1 1 and Q4,2 1, while the optimizer introduced in Section 4.1 does not. Consequently, we only use the pseudo-Riemannian optimizer in the following results. Hierarchy extraction: To quantitatively evaluate our approach, we apply it to the problem of predicting the high-level nodes in the hierarchy from the weighted matrix S given as supervision. We consider the challenging low-dimensional setting where all the learned representations lie on a 4-dimensional manifold (i.e. p + q + 1 = 5). Hyperbolic distances are known to grow exponentially as we get further from the origin. Therefore, the sum of distances δi = Pn j=1 d(xi, xj) of a node vi with all other nodes is a good indication of importance. Intuitively, high-level nodes will be closer to most nodes than low-level nodes. We then sort the scores δ1, , δn in ascending order and report the ranks of the two leaders v1 or vn (in no particular order) in the first two rows of Table 1 averaged over 5 different initializations/runs. Leaders tend to have a smaller δi score with ultrahyperbolic distances than with Euclidean, hyperbolic or spherical distances. Instead of using δi for hyperbolic representations, the importance of a node vi can be evaluated by using the Euclidean norm of its embedding xi as proxy [11, 18, 19]. This is because high-level nodes of a tree in hyperbolic space are usually closer to the origin than low-level nodes. Not surprisingly, this proxy leads to worse performance (8.6 2.3 and 18.6 4.9) as the relationships are not that of a tree. Since hierarchy levels are hard to compare for low-level nodes, we select the 10 (or 5) most influential members based on the score si = Pn j=1 Sij. The corresponding nodes are 34, 1, 33, 3, 2, 32, 24, 4, 9, 14 (in that order). Spearman s rank correlation coefficient [24] between the selected scores si and corresponding δi is reported in Table 1 and shows the relevance of our representations. Due to lack of space, we also report in the supp. material similar experiments on a larger hierarchical dataset [9] that describes co-authorship from papers published at NIPS from 1988 to 2003. 6 Conclusion We have introduced ultrahyperbolic representations. Our representations lie on a pseudo-Riemannian manifold of constant nonzero curvature which generalizes hyperbolic and spherical geometries and includes them as submanifolds. Any relationship described in those geometries can then be described with our representations that are more flexible. We have introduced new optimization tools and experimentally shown that our representations can extract hierarchies in graphs that contain cycles. Broader Impact We introduce a novel way of representing relationships between data points by considering the geometry of non-Riemannian manifolds of constant nonzero curvature. The relationships between data points are described by a dissimilarity function that we introduce and exploits the structure of the manifold. It is more flexible than the distance metric used in hyperbolic and spherical geometries often used in machine learning and computer vision. Nonetheless, since the problems involving our representations are not straightforward to optimize, we propose novel optimization algorithms that can potentially benefit the machine learning, computer vision and natural language processing communities. Indeed, our method is application agnostic and could extend existing frameworks. Our contribution is mainly theoretical but we have included one practical application. Similarly to hyperbolic representations that are popular for representing tree-like data, we have shown that our representations are well adapted to the more general case of hierarchical graphs with cycles. These graphs appear in many different fields of research such as medicine, molecular biology and the social sciences. For example, an ultrahyperbolic representation of proteins might assist in understanding their complicated folding mechanisms. Moreover, these representations could assist in analyzing features of social media such as discovering new trends and leading "connectors". The impact of community detection for commercial or political advertising is already known in social networking services. We foresee that our method will have many more graph-based practical applications. We know of very few applications outside of general relativity that use pseudo-Riemannian geometry. We hope that our research will stimulate other applications in machine learning and related fields. Finally, although we have introduced a novel descent direction for our optimization algorithm, future research could study and improve its rate of convergence. Acknowledgments and Disclosure of Funding We thank Jonah Philion, Guojun Zhang and the anonymous reviewers for helpful feedback on early versions of this manuscript. This article was entirely funded by NVIDIA corporation. Marc Law and Jos Stam completed this working from home during the COVID-19 pandemic. [1] Pierre-Antoine Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009. [2] Henri Anciaux. Minimal submanifolds in pseudo-Riemannian geometry. World Scientific, 2011. [3] Tianshi Cao, Marc T. Law, and Sanja Fidler. A theoretical analysis of the number of shots in few-shot learning. In International Conference on Learning Representations, 2020. [4] Manfredo Perdigao do Carmo. Riemannian geometry. Birkhäuser, 1992. [5] Wei Chen, Wenjie Fang, Guangda Hu, and Michael W. Mahoney. On the hyperbolicity of small-world and treelike random graphs. Internet Mathematics, 9(4):434 491, 2013. [6] Aasa Feragen, Søren Hauberg, Mads Nielsen, and François Lauze. Means in spaces of tree-like shapes. In 2011 International Conference on Computer Vision, pages 736 746. IEEE, 2011. [7] Aasa Feragen, Pechin Lo, Marleen de Bruijne, Mads Nielsen, and François Lauze. Toward a theory of statistical tree-shape analysis. IEEE transactions on pattern analysis and machine intelligence, 35(8):2008 2021, 2012. [8] Tingran Gao, Lek-Heng Lim, and Ke Ye. Semi-riemannian manifold optimization. ar Xiv preprint ar Xiv:1812.07643, 2018. [9] Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. Euclidean Embedding of Cooccurrence Data. The Journal of Machine Learning Research, 8:2265 2295, 2007. [10] Mikhael Gromov. Hyperbolic groups. In Essays in group theory, pages 75 263. Springer, 1987. [11] Marc T. Law, Renjie Liao, Jake Snell, and Richard S. Zemel. Lorentzian distance learning for hyperbolic representations. In Proceedings of the 36th International Conference on Machine Learning, pages 3672 3681, 2019. [12] Marc T. Law, Jake Snell, Amir massoud Farahmand, Raquel Urtasun, and Richard S Zemel. Dimensionality reduction for representing the knowledge of probabilistic models. In International Conference on Learning Representations, 2019. [13] Marc T. Law, Yaoliang Yu, Matthieu Cord, and Eric P. Xing. Closed-form training of mahalanobis distance for supervised clustering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3909 3917, 2016. [14] Marc T. Law, Yaoliang Yu, Raquel Urtasun, Richard S. Zemel, and Eric P. Xing. Efficient multiple instance metric learning using weakly supervised data. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5948 5956, 2017. [15] John M. Lee. Introduction to topological manifolds, volume 202. Springer Science & Business Media, 2010. [16] John M. Lee. Introduction to Smooth Manifolds, volume 218. Springer Science & Business Media, 2013. [17] Ernest Lindelöf. Sur l application de la méthode des approximations successives aux équations différentielles ordinaires du premier ordre. Comptes rendus hebdomadaires des séances de l Académie des sciences, 116(3):454 457, 1894. [18] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, pages 6338 6347, 2017. [19] Maximillian Nickel and Douwe Kiela. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning, pages 3779 3788, 2018. [20] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006. [21] Barrett O neill. Semi-Riemannian geometry with applications to relativity, volume 103. Academic press, 1983. [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019. [23] Peter Petersen. Riemannian geometry, volume 171. Springer, 2006. [24] Charles Spearman. The proof and measurement of association between two things. The American journal of psychology, 1904. [25] Makarand Tapaswi, Marc T. Law, and Sanja Fidler. Video face clustering with unknown number of clusters. In The IEEE International Conference on Computer Vision (ICCV), October 2019. [26] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: L2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pages 1041 1049, 2017. [27] Jixuan Wang, Kuan-Chieh Wang, Marc T. Law, Frank Rudzicz, and Michael Brudno. Centroid-based deep metric learning for speaker recognition. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3652 3656, 2019. [28] Joseph Albert Wolf. Spaces of constant curvature, volume 372. American Mathematical Soc., 1972. [29] Eric P. Xing, Michael I. Jordan, Stuart J. Russell, and Andrew Y. Ng. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pages 521 528, 2003. [30] Wayne W. Zachary. An information flow model for conflict and fission in small groups. Journal of anthropological research, 33(4):452 473, 1977.