# spacetime_representation_learning__7416e557.pdf Published as a conference paper at ICLR 2023 SPACETIME REPRESENTATION LEARNING Marc T. Law & James Lucas NVIDIA Much of the data we encounter in the real world can be represented as directed graphs. In this work, we introduce a general family of representations for directed graphs through connected time-oriented Lorentz manifolds, called spacetimes in general relativity. Spacetimes intrinsically contain a causal structure that indicates whether or not there exists a causal or even chronological order between points of the manifold, called events. This chronological order allows us to naturally represent directed edges via imposing the correct ordering when the nodes are embedded as events in the spacetime. Previous work in machine learning only considers embeddings lying on the simplest Lorentz manifold or does not exploit the connection between Lorentzian pre-length spaces and directed graphs. We introduce a well-defined approach to map data onto a general family of spacetimes. We empirically evaluate our framework in the tasks of hierarchy extraction of undirected graphs, directed link prediction and representation of directed graphs. 1 INTRODUCTION Most of the machine learning literature has focused on learning representations that lie on a Riemannian manifold such as the Euclidean space, the d-sphere (e.g., ℓ2-normalized representations) (Wang et al., 2017; Tapaswi et al., 2019), hyperbolic geometry to represent graphs without cycles (Nickel & Kiela, 2017), or a statistical manifold in information geometry (Amari, 1998). Concepts of Euclidean geometry, such as distances, are naturally generalized to Riemannian geometry which remains easy to interpret. In contrast, recent approaches have considered learning representations that lie on a pseudo-Riemannian manifold to extract hierarchies in graphs with cycles (Law & Stam, 2020; Law, 2021) or represent directed graphs (Clough & Evans, 2017; Sim et al., 2021). Pseudo-Riemannian manifolds are generalizations of Riemannian manifolds where the constraint of positive definiteness of the nondegenerate metric tensor is relaxed. The machine learning literature on pseudo-Riemannian manifolds can be divided into two categories. The first category focuses on how to optimize a given function whose domain is a pseudo-Riemannian manifold and does not take into account whether the manifold is time-oriented or not (Law & Stam, 2020; Law, 2021). The second category exploits the interpretation of a specific family of pseudo-Riemannian manifolds called spacetimes in general relativity (Clough & Evans, 2017; Sim et al., 2021). More specifically, spacetimes are connected time-oriented Lorentz manifolds. They intrinsically contain a causal structure that indicates whether or not there exists a causal order between points of the manifold, called events. This causal structure has been utilized to represent directed graphs where each node is an event and the existence of an arc (i.e., directed edge) between two nodes depends on the causal character of the curves joining them (Bombelli et al., 1987). In particular, Clough & Evans (2017) consider learning representations via the Minkowski spacetime which is the simplest such manifold. On the other hand, Sim et al. (2021) use three types of spacetimes and propose an ad hoc method based on the sign of some time coordinate difference function to determine the orientation of edges. The sign of such a function is not always meaningful as it for instance alternates periodically when the manifold is non-chronological and does not generalize to all spacetimes. Moreover, the distance function that they optimize is constant when two points cannot be joined by a geodesic. Contributions. We propose a framework inspired by Lorentzian causality theory (Kronheimer & Penrose, 1967; Minguzzi, 2019), and in particular Lorentzian pre-length spaces (Kunzinger & Sämann, 2018), to learn directed graph representations lying on a large family of spacetimes. To this end, we present tools to account for time-orientation and exploit distances specific to Lorentz geometry. In particular, we propose to restrict the existence of edges to pairs of nodes whose representations lie in Published as a conference paper at ICLR 2023 Figure 1: Geodesics of the de Sitter space Sd 1(r) (left) and of the anti-de Sitter space Hd 1(r) (right). an open globally hyperbolic convex normal neighborhood. Such a neighborhood can be defined for any spacetime (see Theorem 2.7 of Minguzzi (2019)) and admits simple distance and time separation functions whose sign determines the direction of edges. We experimentally show that spacetimes can extract hierarchies in social networks better than standard approaches. Our framework also outperforms existing methods in link prediction on graphs with directed cycles. 2 SPACETIME DIFFERENTIAL GEOMETRY We introduce some differential geometry background about spacetimes. Spacetimes have been widely studied, and we refer the reader to Hawking & Ellis (1973); Uhlenbeck (1975); O Neill (1995); Beem et al. (1996); Wolf (2011); Gourgoulhon (2016), Chapter 5-8 & 14 of O Neill (1983). Pseudo-Riemannian Manifold. A d-dimensional pseudo-Riemannian manifold (M, g) is a smooth manifold such that every point x M has a d-dimensional tangent space Tx M whose metric tensor gx : Tx M Tx M R is a nondegenerate symmetric bilinear form (called a scalar product). Nondegeneracy means that v Tx M, gx(u, v) = 0 = u = 0. When the context is clear and to simplify the notation, we write , := gx( , ) to define the metric tensor at x. We also write M instead of (M, g). We write points x M of the manifold in bold serif font, and tangent vectors u Tx M in bold sans-serif font when we want to distinguish them from points. Lorentz manifold. Every tangent space Tx M of a d-dimensional pseudo-Riemannian manifold M admits an orthonormal basis {e1, . . . , ed} that satisfies i, ei, ei = 1 and i = j, ei, ej = 0. The index ν d of M is the number of vectors ei that satisfy ei, ei = 1. If ν = 0, M is Riemannian and its metric tensor is positive definite (i.e., x M, u Tx M, u, u 0 and u, u = 0 u = 0). If ν = 1, M is a Lorentz manifold and Tx M is a Lorentz vector space. Future timecone. A nonzero tangent vector u is called timelike (or chronological), null, spacelike or non-spacelike (or causal) if u, u is negative, zero, positive or nonpositive, respectively. The type into which u falls is called its causal character. If u = 0, then u is spacelike. Every Lorentz tangent space contains two timecones. Some timelike tangent vector t Tx M can arbitrarily be used to define the future timecone as the following set: C+ x (t) := {v Tx M : v, v < 0, t, v < 0} whereas t defines the past timecone C x (t) := C+ x ( t). Two timelike tangent vectors u and v are in the same timecone iff u, v < 0. They belong to different timecones if v = u. Time-orientability and time-orientation. A continuous vector field X is a function that assigns to each point x M a tangent vector of M at x denoted by X(x) Tx M. X and X are timelike if x M, X(x), X(x) < 0. A Lorentz manifold is time-orientable iff there exists a timelike vector field. If M is assigned such a timelike vector field X, it is time-oriented by X. In this case, non-spacelike tangent vectors u at each point x can be divided into two separate classes: future-directed if X(x), u < 0, and past-directed if X(x), u > 0. A curve γx u : I M where I R is defined such that its initial point is γx u(0) = x and its initial velocity is γ x u(0) = u Tx M. We denote it by γ when its initial conditions are clear from Published as a conference paper at ICLR 2023 the context. If its acceleration is zero, then it is called a geodesic. The curve γ is called timelike, null or spacelike if its velocity on the whole domain I is timelike, null or spacelike, respectively. It is called future-directed (or future-pointing) if its velocity is future-directed (and causal) on I. Completeness. The manifolds that we consider in this paper are geodesically complete (i.e., I = R), and the exponential map expx : Tx M M of M at x is defined as expx(u) := γx u(1) where γx u is a geodesic. The maximal normal neighborhood of x is the maximal subset Ux M where the logarithmic map logx := exp 1 x : Ux Tx M is a diffeomorphism, it satisfies Ux := {y M : expx(exp 1 x (y)) = y}. To simplify the notation, we also write xy := logx(y) where y Ux. Ux is convex if y Ux, there exists a unique geodesic totally contained in Ux from x to y. We now present some pseudo-Riemannian manifolds that will be relevant. Their differential geometry tools (e.g., exponential/logarithmic map and parallel transport) are provided in Appendix B. Pseudo-Euclidean space Rd ν. The flat d-dimensional pseudo-Riemannian manifold of index ν is denoted by Rd ν. In particular, Rd 0 is the Euclidean space and Rd 1 is the Minkowski space (or Minkowski spacetime). Since it is a vector space, we can identify its tangent space to the space itself by means of the natural isomorphism Rd ν Tx Rd ν. Rd ν is equipped with the following scalar product: x = (x1 ν, . . . , xd ν) , y = (y1 ν, . . . , yd ν) , x, y ν := i=1 ν xiyi + j=1 xjyj. (1) The maximal normal neighborhood for all x Rd ν is Ux = Rd ν. In special relativity, the first ν elements of x Rd ν are called time coordinates and the other ones are called space coordinates. The pseudo-sphere Sd ν(r) of radius r > 0 is called the de Sitter space when ν = 1 and defined as: Sd ν(r) := {x Rd+1 ν : x, x ν = r2}. (2) It is not time-orientable if d ν is odd, and Ux = {y Sd ν(r) : x, y ν > r2}. The pseudo-hyperbolic space Hd ν(r) is called the anti-de Sitter space when ν = 1 and defined as: Hd ν(r) := {x Rd+1 ν+1 : x, x ν+1 = r2}. (3) The anti-de Sitter space is time-orientable for all d, and Ux = {y Hd ν(r) : x, y ν+1 < r2}. The cylindrical Minkowski space Ld 1(C) := Rd 1/ is a quotient set defined such that x Rd 1 and y Rd 1 are equivalent (i.e., x y) iff i > 0, yi = xi and k Z, y0 = x0 + k C where C > 0 is a circumference hyperparameter. See page 148 of O Neill (1983) for other types of Lorentz cylinders. We have Ux = {y = (y0, . . . , yd 1) Rd 1 : y0 (x0 C/2, x0 + C/2)}. 3 SPACETIME GRAPH REPRESENTATION A spacetime is a connected time-oriented Lorentz manifold M whose points are called events. Informally, time-oriented is often weakened to time-orientable, and Hawking & Ellis (1973) ignore the time-orientability criterion to define spacetimes although it is required to study their causal structure. Our main contribution is to exploit this causal structure via Lorentzian pre-length spaces to represent graphs. In the following, we always assume that M is a spacetime unless stated otherwise. A finite directed graph can be given the structure of a Lorentzian pre-length space (Kunzinger & Sämann, 2018). To define our graphs, we then choose a special type of Lorentzian pre-length space that is easy to optimize. We now provide and follow the definitions of Kunzinger & Sämann (2018). Causal space. Let X be a set with a reflexive and transitive relation , and a transitive relation contained in (i.e., , so x y = x y). Then (X, , ) is called a causal space. This definition is more general than the one given in the seminal work of Kronheimer & Penrose (1967). Following general relativity, the event x M causally (resp. chronologically) precedes the event y M, and we write x < y (resp. x y) iff there exists a future-directed causal (resp. timelike) curve from x to y. This condition might be difficult to verify in general. However, if y is in a convex normal neighborhood of x denoted by Vx Ux (and we always assume in the following that x Vx), we have x < y (resp. x y) iff there exists a nonconstant future-directed causal (resp. timelike) geodesic from x to y (see Proposition 4.5.1 of (Hawking & Ellis, 1973)). We note Published as a conference paper at ICLR 2023 x y x = y or x < y. Any open subset W M of a spacetime is a spacetime in its own right, and the intrinsic causal relations of W imply the corresponding ones in M. In the case where W M is a convex open set, the intrinsic causality of W is as simple as that of Minkowski space (see page 403 of O Neill (1983)). Moreover, both (M, , ) and (W, , ) are causal spaces. Lorentzian pre-length space. Let (X, , ) be a causal space and d a metric on X. Let τ : X X [0, ] be a lower semicontinuous map (w.r.t. the metric topology induced by d) that satisfies the reverse triangle inequality: x, y, z X with x y z, τ(x, z) τ(x, y) + τ(y, z). Suppose that τ(x, y) = 0 if x y and τ(x, y) > 0 iff x y. Then (X, d, , , τ) is a Lorentzian pre-length space and τ is called the time separation function. See details about τ in Section 3.3. 3.1 GRAPH CONSTRUCTION VIA LORENTZIAN PRE-LENGTH SPACES We now explain our methodological contribution, which is the construction of directed graphs from the properties mentioned above. Let us define a directed graph G = (V, E) where V = {vi}n i=1 is the node set and E is the set of arcs. Our goal is to represent each node vi by a point xi M so that there exists an arc from vi to vj (i.e., (vi, vj) E) only if xi xj. We propose to consider subgraphs Gi = (V, Ei) defined such that E = Sn i=1 Ei. Each subgraph Gi is given by the structure of the Lorentzian pre-length space (Vxi, d, , , τ) where Vxi Uxi is an open subset of the maximal normal neighborhood Uxi. More precisely, we consider the chronological future I+(xi, Vxi) := {y Vxi : xi y} of the point xi relative to some given set Vxi Uxi (see page 402 of O Neill (1983)), and we draw an arc from vi to vj iff xj I+(xi, Vxi). Assuming Vxi is a convex normal neighborhood and xj Vxi, we have xj I+(xi, Vxi) only if xixj is defined and future-directed timelike (see Proposition 4.12 of Beem et al. (1996)). Our framework produces a subgraph G = Sn i=1 Gi of the graph G described by the chronological relations of M, and the causal relations between events depend on the choice of M. It is worth noting that for any spacetime, chronological order is transitive (i.e., xi xj and xj xk = xi xk (i.e., xk I+(xi, M)), see Chapter 3.2 of Beem et al. (1996)). vi is then connected by an arc to all the successors of vj in G . We avoid this degenerate case by drawing an arc from vi to vk in G iff xk I+(xi, Vxi). The choice of M and Vxi M for all i is then crucial to construct G G . Existence of directed cycles. Our framework can represent graphs with directed cycles only if M is non-chronological (i.e., there exists at least one closed timelike curve: x M, x x, see Fig. 1), which is the case if M is the anti-de Sitter space Hd 1(r) or the Cylindrical Minkowski space Ld 1(C). Spacetimes that do not contain closed timelike curves are called chronological (i.e., x M, x x) and can represent only Directed Acyclic Graphs (DAGs) in our framework. Some chronological spacetimes such as Sd 1(r) (when d 3 is odd) or Rd 1 are called globally hyperbolic and satisfy: x y = x and y can be joined by a (longest) causal geodesic that is not necessarily unique (see page 66 of Beem et al. (1996)). Nonetheless, if M is Sd 1(r) or Rd 1, we also have y I+(x, M) iff y Ux and xy is future-directed timelike (see page 411 of (O Neill, 1983) for details on conditions). Although DAGs do not contain directed cycles, they can contain undirected cycles. For simplicity, we consider that Vx Ux is the convex normal neighborhood of x that contains points y such that the arc length dγ of the geodesic γ from x = γ(0) to y = γ(1) is smaller than some arbitrary threshold ε (0, ] and can be formulated as dγ(x, y) := p | xy, xy |. We have: I+(x, Vx) = {y Ux : ε2 < xy, xy < 0, xy C+ x (t)}, (4) where C+ x (t) is the future timecone parametrized by some arbitrary timelike tangent vector t Tx M. The motivation is to choose or learn ε small enough that vi is not connected to undesired successors of vj. However, in most of our experiments we simply consider that Vx = Ux (see Section C.4). 3.2 EXAMPLES OF SPACETIMES We first illustrate how to represent directed graphs with the simplest spacetime, which is the (flat) Minkowski space Rd 1. It is used in special relativity which is a special case of the general relativity of a spacetime isometric to R4 1. It is also the geometry induced on each fixed tangent space of an arbitrary Lorentz manifold. It was used in Clough & Evans (2017) to represent DAGs due to its global hyperbolicity and the fact that any pair of points of Rd 1 can be joined by a geodesic. It is worth noting that the Hopf-Rinow theorem does not hold for non-Riemannian manifolds such as spacetimes. Published as a conference paper at ICLR 2023 Therefore for many spacetimes, there exist pairs of points that cannot be joined by a geodesic even if the spacetime is geodesically complete. Working with convex normal neighborhoods Vxi allows us to constrain chronological order between points only via timelike geodesics as explained in Section 3.1. Minkowski spacetime Rd 1. We recall that ν, Rd ν Tx Rd ν. Its geodesic γx y : R Rd ν is γx y(t) := x + ty. The exponential map at x is expx(y) := x + y and its inverse is xy := y x. Rd 1 is time-oriented by the vector field / x0 (e.g., x, y, τ(x, y) := y0 x0, and we should in theory define τ(x, y) := 0 if x y to properly follow the definition of Lorentzian pre-length spaces but we ignore this criterion during training for optimization purpose). Let us define t := (1, 0, . . . , 0) , α = (y0 x0) + q Pd 1 i=1 (yi xi)2 and β := xy, xy = (y0 x0)2 + Pd 1 i=1 (yi xi)2. According to equation 4, we have y I+(x, Vx) iff xy is future-directed timelike (i.e., β < 0 and xy, t < 0, or equivalently α < 0) and dγ(x, y) is smaller than ε (i.e., ε2 < β). There might exist a path but no arc between x y (i.e., y I+(x, M)\I+(x, Vx)) iff xy, t < 0 and β ε2. There exists no path between x and y (i.e., x y and y x) iff β 0. de Sitter spacetime Sd 1(r). The original de Sitter spacetime S4 1(r) is not time-orientable (see Corollary 11.2.6 of Wolf (2011)). However, when d 3, ν > 0 and d ν is even, Sd ν(r) is orientable and time-orientable, the projective space Sd ν(r)/ 1 used in Law (2021) is orientable but not timeorientable (see page 247 of O Neill (1983)). We assume in the following that conditions hold so that Sd ν(r) is a spacetime and we refer the reader to the appendix for details. For any pair of points x Sd ν(r) and y Sd ν(r), xy is defined iff y Ux (i.e., x, y 1 > r2), and xy is timelike iff x, y 1 > r2. Let p := (0, . . . , 0, r) Sd 1(r) denote the positive pole, and Γx p : Tp Sd 1(r) Tx Sd 1(r) denote the parallel transport from Tp Sd 1(r) to Tx Sd 1(r). Since the parallel transport preserves the causal character of any tangent vector v, we can define the future timecone C+ x ( xy) with respect to the timelike tangent vector t := (1, 0, . . . , 0) T p Sd 1(r) as follows: Lemma 3.1. Assuming xy is timelike, xy is future-directed iff Γx p(t) C+ x ( xy) if Γx p is defined (i.e., p Ux), and Γx p(t) C+ x ( xy) otherwise (i.e., p Ux). See proof in Appendix C.1.2. The anti-de Sitter spacetime Hd 1(r) and its projective version Pd 1(r) := Hd 1(r)/ 1 are nonchronological and satisfy x y = y x, which is convenient to represent graphs with directed cycles. Sim et al. (2021) use Hd 1(r) to represent directed graphs but also promote arcs (i.e., causal relation) between pairs of nodes that are not connected by any geodesic, which makes their problem hard to optimize. In this work, we only consider the existence of arcs if there exists a timelike geodesic in the convex normal neighborhood Vx joining two events. See Appendix C for details. A warped product is a manifold (M1 M2, g1 fg2) denoted by M1 f M2 where (M1, g1), (M2, g2) and f : M1 (0, ) is a smooth function called the warping function (e.g., f = 1). Let M = B f F be a Lorentz warped product where B is Lorentz and F is a complete Riemannian manifold. M is time-orientable iff B is (see page 417 of (O Neill, 1983)). M satisfies the chronology, causality or strong causality iff B does. Our framework can then be extended to warped products. 3.3 LORENTZIAN DISTANCE AND LORENTZIAN LENGTH SPACES The Lorentzian distance indicates chronological order (hence causality) between events when it is positive, satisfies the reverse triangle inequality and its squared function is of class C2 on a normal neighborhood. These properties make it ideal for optimization, and it can be used as time separation function. We refer the reader to Chapter 4 of Beem et al. (1996) and Section 5 of Minguzzi (2019). Proposition 4.5.3 of Hawking & Ellis (1973): Let x and y lie in a convex normal neighborhood Vx Ux. If x and y can be joined by a causal curve in Vx, the longest such curve is the unique causal geodesic in Vx from x to y. This is in contrast with Riemannian geometry where the (spacelike) geodesic corresponds to the shortest curve joining points. Since Vx is a convex normal neighborhood, we can define the arc length of such a curve by χV(x, y) := p xy, xy 0. If x y, χV(x, y) is called the Lorentzian distance from x to y on Vx, it corresponds to the elapsed proper time between the events x and y (i.e., as measured by a clock along the geodesic γx xy) (Gourgoulhon, 2016). If we define τ := χV, then (Vx, d, , , τ) is called a Lorentzian length space. In this paper, we consider the squared Lorentzian distance which is defined as y I+(x, Vx) = χ2 V(x, y) = xy, xy . Published as a conference paper at ICLR 2023 In the literature (Hawking & Ellis, 1973; Beem et al., 1996), the (squared) Lorentzian distance is defined such that χ2 V(x, y) = 0 if x and y are not joined by a causal curve due to lack of causality. When evaluating learned representations (i.e., at test time), we therefore consider that χ2 V(x, y) = 0 if this is the case. However, during training and for optimization purpose, we propose to consider: If x y and y x, χ2 V(x, y) = xy, xy if M = Rd ν 2( x, y ν r2) if M = Sd ν(r) 2(| x, y ν+1| r2) if M = Pd ν(r). (5) This makes the function χ2 V differentiable everywhere (except when x, y ν+1 = 0 if M = Pd ν(r)), equal to xy, xy > 0 if xy is timelike, and non-positive otherwise. χ2 V is defined for any pair of points (x, y) by using extrinsic geometry, whether xy is defined or not (see details in Section C.3). 4 RELATED WORK Exploiting spacetimes to represent directed graphs was proposed by Clough & Evans (2017) to extend Multi Dimensional Scaling (MDS) (Kruskal, 1964) to the simplest spacetime Rd 1. As in standard MDS, a target distance matrix between nodes is given as input and the goal of the method is to find the vector representations of nodes that return the best approximation of the target distance matrix when using the squared Lorentzian distance as the dissimilarity function. However, as discussed in Section 3.3, it is difficult to define target distances between pairs of nodes that are not connected. Moreover, the formulation in (Clough & Evans, 2017) considers only Rd 1 and does not account for future or past-direction between pairs of nodes during training. We solve this problem in Section 3.2 by enforcing future timelike geodesics to have their length smaller than some threshold ε (0, ]. Sim et al. (2021) extended Clough & Evans (2017) to the anti-de Sitter space and Lorentz cylinder. Although our motivation is similar, our contributions are methodological, rely on the intuitions of Lorentzian pre-length spaces, and provide an easier interpretation of the learned representations as we explain below. First, Sim et al. (2021) do not address clearly the case when there is no geodesic between pairs of points, and their optimization framework leads to a distance loss term with a zero gradient in this case, which is difficult to optimize. Moreover, their prediction of an arc between a pair of nodes is determined via a Triple Fermi-Dirac (TFD) probability function that accounts for the distance between the nodes, the time coordinate difference t and its opposite value t (i.e., TFD accounts for both the chronological future and past of a given node). In contrast, we restrict the representation of nodes connected by an arc to belong to I+(x, Vx) where Vx is an open convex normal neighborhood, which is simple to interpret and optimize. In general, the Lorentzian distance function from x to y on M is defined to be infinite if M is non-chronological and there exists a closed timelike curve joining x and y. Moreover, the Hopf-Rinow theorem does not hold for spacetimes. Working with convex normal neighborhoods allows us to restrict the existence of arcs between nodes to the existence of geodesics joining events. The definition of time separation functions also becomes straightforward and we use their sign to determine the direction of an edge. Our framework shares similarities with Sim et al. (2021) when M = Rd 1 = Vx because Rd 1 is globally hyperbolic. Otherwise, we do not formulate our time separation function in the same way and we create arcs differently. Further comparisons with Sim et al. (2021) are provided in Appendix E. The negative of the pseudo-Riemannian gradient is generally not a descent direction. We then use the optimization tools introduced in Gao et al. (2018); Law & Stam (2020); Law (2021) as described in Appendix F to learn our representations. Our approach could be extended to time-oriented pseudo Riemannian manifolds of index ν > 1 (see pages 240-242 of (O Neill, 1983)). We also have x y if there exists a future-directed timelike curve joining them. Lemma 3.1 that exploits parallel transport can for instance be generalized to timelike tangent vectors t1, . . . , tν to define different types of arcs. Some Riemannian approaches (Bordes et al., 2013; Ganea et al., 2018; Vilnis et al., 2018) have been used to represent DAGs. Ganea et al. (2018) consider entailment relations where x y means that y is a subconcept of x by constructing cones for each node representation in hyperbolic geometry. The DAGs they consider are directed trees and their underlying undirected graphs are trees (i.e., graphs without cycles). We propose to consider the chronological future and past that are intrinsic to our manifolds and have been studied for decades to define the causal structure. For instance, it is known that causal relations induce partially ordered sets (called causal sets) on causal spacetimes and can then represent DAGs (Bombelli et al., 1987). Our framework can represent DAGs that are not directed trees (see example in Appendix G.1) and is not limited to DAGs (see examples in Appendix G.2). Published as a conference paper at ICLR 2023 Table 1: Link prediction for directed graphs. Median average precision (AP) percentages across 20 random initializations on a held-out test set. Dataset DREAM 5: S. cerevisiae DREAM5 : in silico Manifold dimensionality d 3 5 10 50 100 3 5 10 50 100 Euclidean + FD 33.0 34.2 40.2 44.5 49.0 29.4 32.9 39.7 39.8 34.8 Hyperboloid + FD 29.2 37.9 46.5 48.8 47.9 28.8 46.8 50.8 50.9 52.5 Minkowski + TFD 34.7 38.6 46.4 52.7 54.0 36.3 43.1 51.2 57.7 58.0 Anti de-Sitter + TFD 37.2 41.3 44.9 47.5 49.4 38.1 45.2 51.9 55.6 56.0 Cylindrical Minkowski + TFD 37.4 42.7 46.8 53.4 54.6 41.0 48.4 56.3 58.9 61.0 Cylindrical Minkowski + equation 6 50.0 52.5 55.2 56.2 55.7 52.5 56.5 59.8 60.4 60.8 de Sitter + equation 6 44.8 51.6 55.6 55.3 55.4 48.5 57.4 62.0 60.6 61.1 5 EXPERIMENTS We show how our framework can represent graphs with directed cycles and predict links effectively in Section 5.1. We also show how the causal interpretation of our model can be used to represent hierarchical graphs with cycles in Section 5.2. We provide experiments on DAGs in Appendix D.1. 5.1 LINK PREDICTION ON GRAPHS WITH DIRECTED CYCLES We now consider the link prediction task on the Saccharomyces cerevisiae, in silico and Escherichia coli DREAM5 datasets (Marbach et al., 2012) (see values of hyperparameters, results on Escherichia coli with standard deviations and more discussion in Appendix D.2). We follow the experimental protocol of Sim et al. (2021) by considering only the positive-regulatory nodes from the networks mentioned above while omitting the gene-expression data itself. Each network is randomly split into train and test sets, following 85/15 splits, and a part of the training set is used for validation. Sim et al. (2021) use the Minkowski space Rd 1, the anti de-Sitter space Hd 1(r), and the Cylindrical Minkowski space Ld 1(C). Both Hd 1(r) and Ld 1(C) are non-chronological and can then represent graphs with directed cycles. Sim et al. (2021) design a probability function that they call Triple Fermi-Dirac (TFD) specifically for their Cylindrical Minkowski space. Instead, we propose to define the chronological future I+(x, Vx) of x Ld 1(C) such that if xy is timelike, we have y I+(x, Vx) if k Z, y0 +k C (x0, x0 +C/2). Similarly, the chronological past I (x, Vx) is defined such that if xy is timelike, we have y I (x, Vx) if k Z, y0 + k C (x0 C/2, x0). In detail, we define the time separation function for Ld 1(C) as follows: τ(x, y) := (y0 x0 + C 2 ) mod C C 2 ) where we use the modulo operation for real values which can be written as follows: a mod b := a b a b , and is the floor function. Following our spacetime interpretation, we propose to learn our representations by optimizing: min {xk M}n k=1 X (vi,vj) E log(F(xi, xj)) X (va,vb)/ E log(1 F(xa, xb)) (6) where F(xi, xj) := σm θ1(χ2 V(xi, xj)) σm θ2(τ(xi, xj)), θ1, θ2 > 0 are temperature hyperparameters, m > 0 is an exponent, and σθ(x) := 1/(1 + e x/θ) is the sigmoid function. We refer to Section C.2 and C.3 for the formulations of the time separation function τ and the squared Lorentzian distance χ2 V, respectively. This promotes future-directed timelike geodesics only between nodes connected by an arc. We report in Table 1 the Average Precision scores of our method (i.e., using equation 6) that outperforms baselines reported in Sim et al. (2021). The non-chronological cylindrical Minkowski space obtains a higher performance gap in low-dimensional space. This suggests it accounts for directed cycles better than chronological spacetimes. On the other hand, the gap is reduced in higher dimension except on Escherichia coli where the gap increases (see Table 4). 5.2 HIERARCHY EXTRACTION ON A SOCIAL NETWORK DATASET Pseudo-Riemannian manifolds were introduced in the machine learning literature to extract hierarchies in graphs with cycles (Law & Stam, 2020). Spacetimes are a special family of pseudo-Riemannian manifolds whose time-orientation was not taken into account in Law & Stam (2020). In this paper, we found that spacetimes are able to consistently outperform other pseudo-Riemannian manifold approaches by exploiting the causality interpretation implicit in the squared Lorentzian distance. Published as a conference paper at ICLR 2023 Table 2: Evaluation scores for the different learned representations (mean standard deviation). the lower the metric, the better. the larger the metric (in absolute value), the better. Manifold Problem d(x, y) Rank of 1st leader ( ) Rank of 2nd leader ( ) Top 5 ρ ( ) Top 10 ρ ( ) R2 (Euclidean) equation 7 dγ(x, y) 11.4 4.3 14.0 2.4 0.17 0.70 0.19 0.40 P2 0(r) (hyperbolic) equation 7 dγ(x, y) 7.0 1.4 8.8 1.1 0.47 0.31 0.26 0.13 P2 2(r) (elliptic) equation 7 dγ(x, y) 7.5 0.5 9.1 1.2 0.59 0.03 0.38 0.08 P2 1(r) equation 7 dγ(x, y) 2.2 1.3 6.4 2.2 0.28 0.26 0.55 0.20 P2 1(r) equation 7 χ2 U(x, y) 1.1 0.3 2.8 1.3 0.60 0.23 0.82 0.16 R3 (Euclidean) equation 7 dγ(x, y) 9.4 1.3 11.0 0.5 0.50 0.15 0.38 0.10 P3 0(r) (hyperbolic) equation 7 dγ(x, y) 4.3 1.1 6.1 1.1 0.05 0.39 0.02 0.15 P3 3(r) (elliptic) equation 7 dγ(x, y) 4.3 1.1 5.9 1.4 0.09 0.38 0.11 0.22 P3 1(r) equation 7 dγ(x, y) 1.2 0.4 3.8 1.6 0.59 0.33 0.63 0.13 P3 2(r) equation 7 dγ(x, y) 1.0 0.0 4.3 2.1 0.46 0.20 0.61 0.14 S3 1(r) (de Sitter) equation 7 χ2 U(x, y) 1.5 1.3 4.1 2.0 0.41 0.59 0.58 0.22 P3 1(r) equation 7 χ2 U(x, y) 1.1 0.3 2.7 0.8 0.56 0.29 0.85 0.07 R4 (Euclidean) equation 7 dγ(x, y) 4.6 1.0 6.9 0.7 0.06 0.45 0.04 0.19 P4 0(r) (hyperbolic) equation 7 dγ(x, y) 2.5 0.7 3.8 1.0 0.36 0.22 0.38 0.18 P4 4(r) (elliptic) equation 7 dγ(x, y) 2.5 0.8 3.6 0.7 0.46 0.29 0.38 0.26 P4 1(r) equation 7 dγ(x, y) 1.2 0.4 2.7 0.7 0.62 0.23 0.73 0.12 P4 2(r) equation 7 dγ(x, y) 1.3 0.7 3.1 1.0 0.61 0.28 0.72 0.07 P4 3(r) equation 7 dγ(x, y) 1.2 0.4 4.4 3.0 0.63 0.35 0.63 0.16 P4 1(r) equation 7 χ2 U(x, y) 1.2 0.4 3.0 1.1 0.56 0.34 0.77 0.13 R2 1 (Minkowski) equation 8 χ2 U(x, y) 1.1 0.3 2.6 0.8 0.71 0.19 0.82 0.08 P2 1(r) equation 8 χ2 U(x, y) 1.3 0.5 2.5 0.5 0.75 0.20 0.90 0.07 R3 1 (Minkowski) equation 8 χ2 U(x, y) 1.0 0.0 2.3 0.5 0.70 0.17 0.80 0.11 P3 1(r) equation 8 χ2 U(x, y) 1.3 0.5 2.4 0.5 0.69 0.29 0.90 0.07 S3 1(r) (de Sitter) equation 8 χ2 U(x, y) 1.1 0.3 2.1 0.3 0.86 0.15 0.88 0.05 R4 1 (Minkowski) equation 8 χ2 U(x, y) 1.0 0.0 2.3 0.9 0.70 0.22 0.81 0.13 P4 1(r) equation 8 χ2 U(x, y) 1.1 0.3 2.4 0.5 0.78 0.13 0.91 0.07 Graph dataset. We consider an undirected graph G = (V, E) where V = {vi}n i=1 is the node set and E is the edge set such that (vi, vj) E indicates that vi and vj are connected by an edge. Zachary s karate club dataset (Zachary, 1977) is a social network of n = 34 members of the karate club, each represented by a node vi. The club was split in two factions due to a conflict between the instructor (node v1) and the administrator (node v34) that are the two most important members (i.e., leaders) of the dataset. The other members had to decide between joining the new club created by v1 or stay with v34. Two nodes are joined by an edge (vi, vj) E if the members are friends. In the original ultrahyperbolic approaches (Law & Stam, 2020; Law, 2021), each node vi is represented by a point xi M on some pseudo-Riemannian manifold M. The goal is to learn embeddings {xi}n i=1 so that pairs of nodes joined by an edge (vi, vj) E are closer to each other than pairs of nodes not joined by an edge. To this end, the embeddings are learned in Law & Stam (2020); Law (2021) by minimizing the following problem: min {xk M}n k=1 X (vi,vj) E log e d(xi,xj)/θ e d(xi,xj)/θ + X (va,vb)/ E e d(xa,xb)/θ (7) where the dissimilarity function d(xi, xj) is the arc length dγ(xi, xj) := p | xixj, xixj | of the geodesic joining xi and xj. If M is Riemannian, dγ is the Riemannian distance. In this paper, we propose instead to consider in equation 7 that d is the squared Lorentzian distance χ2 U, as defined in Section 3.3, where Ux is the (convex) maximal normal neighborhood of x M. We also propose to simply enforce the embeddings of vi and vj to be joined by a timelike geodesic iff (vi, vj) E. Since G is an undirected graph, the futureor past-direction is not provided so we do not constrain the future-direction of the geodesics during training. Our problem formulation is then: min {xk M}n k=1 (va,vb)/ E σθ (d(xa, xb)) + λ X (vi,vj) E σθ ( d(xi, xj)) (8) where λ > 0 is a regularization parameter and we set d(xi, xj) = χ2 U(xi, xj). The main goal of equation 8 is to enforce pairs of nodes (vi, vj) E to be joined by a timelike geodesic and pairs (va, vb) / E to be joined by a spacelike geodesic since there is no causality between them. Published as a conference paper at ICLR 2023 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x1(space) Faction 1 Faction 34 Figure 2: (left) Coordinates of 2-dimensional embeddings x = (x0, x1) learned with equation 8 when M = R2 1. (right) Coordinates of the first three coordinates of embeddings x = (x0, x1, x2, x3) learned with equation 8 when M = S3 1(r). In Lorentz geometry, a timelike geodesic joining two points is the longest timelike curve in a given convex normal neighborhood. This translates in the high-level nodes v1 and v34 being the furthest from the rest of the nodes. The ground truth edges are plotted in yellow and the node color corresponds to the joined faction. A small number of spacelike edges are visible (those edges more than 45 degrees from vertical). Evaluation metrics. We report in Table 2 the results obtained when optimizing equation 7 or equation 8 using different dissimilarity functions. At test time, a score δi = Pn j=1 d(xi, xj) summing pairwise distances is assigned to each node vi and used as an indicator of importance in the hierarchical graph. Following Section 3.3, when the dissimilarity function is the arc length dγ (resp. squared Lorentzian distance χ2 U), we sort the scores δ1, . . . , δn is ascending (resp. descending) order and report the ranks of the two leaders v1 and vn (in no particular order) in the first two columns of Table 2 averaged over 10 different initializations. We also report in Table 2 the Spearmans rank correlation coefficient ρ (Spearman, 1904) between the ordered δi scores and the order of the 5 and 10 most important nodes which are 34, 1, 33, 3, 2, 32, 24, 4, 9, 14 (in that order, see Law & Stam (2020)). Results. In general, the squared Lorentzian distance χ2 U returns better performance than the arc length dγ. Moreover, the optimization problem in equation 8 enforcing timelike geodesics between connected nodes (i.e., causality) and spacelike geodesics between unconnected nodes (i.e., noncausality) extracts the most important nodes in low-dimensional spacetimes more effectively. The predicted ρ also shows higher rank correlation with χ2 U. It is worth noting that hyperbolic geometry was shown relevant to represent graphs without cycles (Gromov, 1987) but our hierarchical graph contains cycles, which explains why geometries other than hyperbolic may be more relevant. The performance gap with non-Riemannian baselines (Law & Stam, 2020; Law, 2021) can be explained by the fact that those baselines compare the arc lengths of different types of geodesics (i.e., timelike or spacelike) that are not necessarily comparable. On the other hand, our framework sets non-causal distances to zero at test time, and compares only the lengths of causal geodesics. Figure 2 illustrates spacetime diagrams of the learned representations for R2 1 and S3 1(r). The most important nodes tend to be further away from the other nodes with respect to the Lorentzian distance. Most ground truth edges are timelike geodesics and belong to the chronological past or future of an edge representation. We report in Appendix D.3 the same kind of hierarchy extraction experiment on a dataset that describes co-authorship information from papers published at NIPS from 1988 to 2003 (Globerson et al., 2007). The number of nodes/authors is |V | = 2715 and the number of edges is |E| = 4733. 6 CONCLUSION We have proposed a general framework to represent data, and in particular nodes of a directed graph, in a spacetime. Compared to previous work, our framework is properly endowed with the structure of a Lorentzian pre-length space, which makes it easy to optimize and applicable to a large family of spacetimes. In the case of hierarchical undirected graphs, we show that we can enforce causality between nodes connected by an edge to extract the most important nodes in the hierarchy. Published as a conference paper at ICLR 2023 ACKNOWLEDGMENTS We thank Yuval Atzmon, Gal Chechik and Haggai Maron for helpful discussions during the initial phase of this project. We also thank Aaron Sim for sharing his source code, Rafid Mahmood, Jos Stam and the anonymous reviewers for valuable feedback on early versions of the manuscript. Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251 276, 1998. doi: 10.1162/089976698300017746. John K Beem, Paul E Ehrlich, and Kevin L Easley. Global lorentzian geometry. Routledge, 1996. Luca Bombelli, Joohan Lee, David Meyer, and Rafael D Sorkin. Space-time as a causal set. Physical review letters, 59(5):521, 1987. Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26, 2013. James R Clough and Tim S Evans. Embedding graphs in lorentzian spacetime. Plo S one, 12(11): e0187301, 2017. Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. In International Conference on Machine Learning (ICML), pp. 1646 1655. PMLR, 2018. Tingran Gao, Lek-Heng Lim, and Ke Ye. Semi-riemannian manifold optimization. ar Xiv preprint ar Xiv:1812.07643, 2018. Johannes Gehrke, Paul Ginsparg, and Jon Kleinberg. Overview of the 2003 kdd cup. Acm Sigkdd Explorations Newsletter, 5(2):149 151, 2003. Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. Euclidean Embedding of Co-occurrence Data. The Journal of Machine Learning Research, 8:2265 2295, 2007. Eric Gourgoulhon. Geometry and physics of black holes. Institut d Astrophysique de Paris, France, 2016. URL https://luth.obspm.fr/~luthier/gourgoulhon/bh16/. Mikhael Gromov. Hyperbolic groups. In Essays in group theory, pp. 75 263. Springer, 1987. Stephen W Hawking and George Francis Rayner Ellis. The large scale structure of space-time, volume 1. Cambridge university press, 1973. Iaroslav Ispolatov, Pavel L Krapivsky, and Anton Yuryev. Duplication-divergence model of protein interaction network. Physical review E, 71(6):061911, 2005. Erwin H Kronheimer and Roger Penrose. On the structure of causal spaces. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 63, pp. 481 501. Cambridge University Press, 1967. Joseph B Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):1 27, 1964. Michael Kunzinger and Clemens Sämann. Lorentzian length spaces. Annals of Global Analysis and Geometry, 54(3):399 447, 2018. Marc T Law. Ultrahyperbolic neural networks. Advances in Neural Information Processing Systems, 2021. Marc T Law and Jos Stam. Ultrahyperbolic representation learning. Advances in Neural Information Processing Systems, 2020. Published as a conference paper at ICLR 2023 Daniel Marbach, James C Costello, Robert Küffner, Nicole M Vega, Robert J Prill, Diogo M Camacho, Kyle R Allison, Manolis Kellis, James J Collins, and Gustavo Stolovitzky. Wisdom of crowds for robust gene network inference. Nature methods, 9(8):796 804, 2012. Ettore Minguzzi. Lorentzian causality theory. Living reviews in relativity, 22(1):1 202, 2019. Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, pp. 6338 6347, 2017. Barrett O Neill. Semi-Riemannian geometry with applications to relativity, volume 103. Academic press, 1983. Barrett O Neill. The geometry of Kerr black holes. Courier Corporation, 1995. Aaron Sim, Maciej Wiatrak, Angus Brayne, Páidí Creed, and Saee Paliwal. Directed graph embeddings in pseudo-riemannian manifolds. International Conference on Machine Learning (ICML), 2021. Charles Spearman. The proof and measurement of association between two things. The American journal of psychology, 1904. Makarand Tapaswi, Marc T. Law, and Sanja Fidler. Video face clustering with unknown number of clusters. In The IEEE International Conference on Computer Vision (ICCV), October 2019. K. Uhlenbeck. A morse theory for geodesics on a lorentz manifold. Topology, 14(1):69 90, 1975. Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew Mc Callum. Probabilistic embedding of knowledge graphs with box lattice measures. In Annual Meeting of the Association for Computational Linguistics (ACL), 2018. Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: L2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1041 1049, 2017. Joseph Albert Wolf. Spaces of constant curvature, volume 372. AMS Chelsea Publishing: An Imprint of the American Mathematical Society, Sixth Edition, 2011. Wayne W. Zachary. An information flow model for conflict and fission in small groups. Journal of anthropological research, 33(4):452 473, 1977. Published as a conference paper at ICLR 2023 A SUMMARY OF SUPPLEMENTARY MATERIAL The supplementary material is structured as follows: Section B gives the differential geometry tools that are general to pseudo-Riemannian manifolds of constant curvature (called space forms). Section C gives the differential geometry tools that are specific to our spacetimes such as the time separation function and the squared Lorentzian distance. Section D presents additional results and further experimental details. Section E is an extended related work section. Section F gives the details of the optimizers we use. Section G gives examples of directed graphs described with our model. B DIFFERENTIAL GEOMETRY OF PSEUDO-RIEMANNIAN SPACE FORMS We provide the necessary differential geometry tools to work on the pseudo-Euclidean space Rd ν, the pseudo-sphere Sd ν(r) and the pseudo-hyperboloid Hd ν(r). Most of them are explained in (Law, 2021; Law & Stam, 2020). We refer the reader to (Hawking & Ellis, 1973; O Neill, 1983; Wolf, 2011). B.1 PSEUDO-EUCLIDEAN SPACE We recall that ν, Rd ν Tx Rd ν. Its geodesic γx y : R Rd ν is γx y(t) := x+ty. The exponential map at x is defined as expx(y) := γx y(1) = x + y and its inverse is xy := exp 1 x (y) = y x. Moreover, x Rd ν, y Rd ν, the parallel transport Γy x : Tx Rd ν Ty Rd ν can be defined as the identity function due to the isomorphism Rd ν Tx Rd ν Ty Rd ν. See page 67 of O Neill (1983). B.2 PSEUDO-SPHERE Sd ν(r) We give here the differential geometry tools specific to the pseudo-sphere which is defined as the following hypersurface: Sd ν(r) := x Rd+1 ν : x, x ν = r2 . The tangent space Tx Sd ν(r) of Sd ν(r) at x can be defined as: Tx Sd ν(r) := {u Rd+1 ν : u, x ν = 0}. In the case of the pseudo-sphere, we have u Tx Sd ν(r), v Tx Sd ν(r), gx(u, v) = u, v = u, v ν. Geodesic. The geodesic γx u : R Sd ν(r) satisfying γx u(0) = x and γ x u(0) = u Tx Sd ν(r) is formulated for all t R: | u,u ν| sin t p r u if u, u ν > 0 x + tu if u, u ν = 0 | u,u ν| sinh t p r u if u, u ν < 0 (9) Exponential map. The exponential map expx : Tx Sd ν(r) Sd ν(r) is defined such that u Tx Sd ν(r), expx(u) = γx u(1). We then have: | u,u ν| sin p r u if u, u ν > 0 x + u if u, u ν = 0 | u,u ν| sinh p r u if u, u ν < 0 (10) Logarithmic map. The logarithmic map logx is the inverse of the exponential map expx on a normal neighborhood of x Sd ν(r) denoted by Ux = {y Sd ν(r) : x,y ν r2 > 1}. It is formulated: y Ux, xy := logx(y) := arccos ( x,y ν r2 x if x,y ν y x if x,y ν arccosh ( x,y ν r2 x if x,y ν Published as a conference paper at ICLR 2023 where arccos is the inverse of the cosine function, and arccosh is the inverse of the hyperbolic cosine function. One can verify from equation 10 and equation 11 that xy is timelike (i.e., xy, xy < 0) iff x, y ν > r2. Geodesic distance . As explained in (Law & Stam, 2020) and Chapter 5 of (O Neill, 1983), when the logarithmic map exists for some pseudo-Riemannian manifold M, the arc length of the geodesic γx xy from x M to y M corresponds to the radius function: p |gx( xy, xy)| where gx : Tx M Tx M R is the metric tensor at x and xy = logx(y) is the logarithmic map. The geodesic distance dγ : Sd ν(r) Sd ν(r) R is then: dγ(x, y) := q | xy, xy ν| = ( r arccosh ( x,y ν r2 ) if x,y ν r2 1 r arccos ( x,y ν r2 ) if x,y ν r2 ( 1, 1) (12) dγ is not a distance metric but a symmetric premetric: it satisfies (i) dγ(x, y) = dγ(y, x) 0 and (ii) dγ(x, x) = 0. In (O Neill, 1983), the minimizing geodesic is defined by its arc length and then also corresponds to the geodesic distance mentioned above. Given the minimizing geodesic γ connecting x to y, the parallel transport Γy x : Tx Sd ν(r) Ty Sd ν(r) is a linear isometry such that u, v, u, v ν = Γy x(u), Γy x(v) ν. The parallel transport along γ from x = γ(0) to y = γ(1) (where x and y satisfy x, y ν > r2) is: Γy x(u) := u y, u ν x, y ν + r2 (y + x) (13) Theorem B.1 (Diffeomorphism (Wolf, 2011)). There exists a diffeomorphism ψ : Sd ν(r) Rν Sd ν 0 (r). Let us note x = t s Sd ν(r) with t Rν and s Rd ν+1 . Let us note z = t v Rν Sd ν 0 (r) where v Sd ν 0 (r). The mapping ψ and its inverse ψ 1 can be formulated: ψ(x) = t r s s and ψ 1(z) = t p where . is the standard Euclidean norm (i.e., s := B.3 PSEUDO-HYPERBOLOID Hd ν(r) We recall the differential geometry tools (from (Law & Stam, 2020)) specific to the pseudohyperboloid which is defined as the set: Hd ν(r) := {x Rd+1 ν+1 : x, x ν+1 = r2}. The tangent space Tx Hd ν(r) of Hd ν(r) at x can be defined as: Tx Hd ν(r) := {u Rd+1 ν+1 : u, x ν+1 = 0}. In the case of Hd ν(r), we have u Tx Hd ν(r), v Tx Hd ν(r), gx(u, v) = u, v = u, v ν+1. Geodesic. The geodesic γx u : R Hd ν(r) satisfying γx u(0) = x and γ x u(0) = u Tx Hd ν(r) is formulated for all t R: | u,u ν+1| sin t p r u if u, u ν+1 < 0 x + tu if u, u ν+1 = 0 | u,u ν+1| sinh t p r u if u, u ν+1 > 0 (15) Exponential map. The exponential map expx : Tx Hd ν(r) Hd ν(r) is defined such that u Tx Hd ν(r), expx(u) = γx u(1). We then have: | u,u ν+1| sin p r u if u, u ν+1 < 0 x + u if u, u ν+1 = 0 | u,u ν+1| sinh p r u if u, u ν+1 > 0 (16) Published as a conference paper at ICLR 2023 Logarithmic map. The logarithmic map logx is defined as the inverse of the exponential map expx on a normal neighborhood of x Hd ν(r) denoted by Ux = {y Hd ν(r) : x,y ν+1 r2 < 1}. It is formulated: y Ux, xy = logx(y) = arccosh ( x,y ν+1 y + x,y ν+1 r2 x if x,y ν+1 y x if x,y ν+1 arccos ( x,y ν+1 1 ( x,y ν+1 y + x,y ν+1 r2 x if x,y ν+1 (17) One can verify that a nonconstant geodesic from x to y is timelike iff x, y ν ( r2, r2) or y = x. Geodesic distance . The geodesic distance dγ : Hd ν(r) Hd ν(r) R is then: dγ(x, y) = q | xy, xy ν+1| = ( r arccosh ( x,y ν+1 r2 ) if x,y ν+1 r2 1 r arccos ( x,y ν+1 r2 ) if x,y ν+1 r2 ( 1, 1) (18) Parallel transport on Hd ν(r). The parallel transport connecting x Hd ν(r) to y Hd ν(r) is: Γy x(u) := u y, u ν+1 x, y ν+1 r2 (y + x) where x, y ν+1 < r2 (19) Theorem B.2 (Diffeomorphism (Wolf, 2011)). There exists a diffeomorphism ψ : Hd ν(r) Sν 0(r) Rd ν. Let us note x = t s Hd ν(r) with t Rν+1 and s Rd ν. Let us note z = u v Sν 0(r) Rd ν where u Sν 0(r) and v Rd ν. The mapping ψ and its inverse ψ 1 can be formulated: and ψ 1(z) = p The anti-de Sitter spacetime Hd 1(r) is non-chronological and satisfies x y = y x, which is convenient to represent graphs with directed cycles. It was used in Sim et al. (2021) to represent directed graphs. Nonetheless, it is worth noting that the problem formulation of Sim et al. (2021) for Hd 1(r) also promotes arcs (i.e., causal relation) between pairs of nodes that are not connected by any geodesic, which makes their problem hard to optimize. Following general relativity, we only consider the existence of arcs if there exists a timelike geodesic in Vx joining two events x and y. See Appendix C for details. B.4 PROJECTIVE SPACE Pd 1(r) The manifold Pd 1(r) := Hd 1(r)/ 1 used in Law (2021) is time-orientable for all d 2 (see page 214 of O Neill (1983)). We refer the reader to (Law, 2021) for details about Pd 1(r) := Hd 1(r)/ 1. C FUTURE DIRECTION, TIME SEPARATION FUNCTION AND LORENTZIAN DISTANCE C.1 FUTURE DIRECTION We discuss in this section how to constrain future direction for our spacetimes. C.1.1 MINKOWSKI SPACE The explanation can be found in Section 3.2 when M = Rd 1. We recall that we defined: t := (1, 0, 0, 0) , xy = y x, xy, xy = xy, xy 1 and α := (y0 x0) + q Pd 1 i=1 (yi xi)2. If α is negative, then xy is timelike (i.e., xy, xy < 0) and xy C+ x (t). By definition, xy is then future-directed timelike. Published as a conference paper at ICLR 2023 C.1.2 DE SITTER SPACE We provide the proof of Lemma 3.1 which is written: when M = Sd 1(r) and xy is timelike, xy is future-directed iff Γx p(t) C+ x ( xy) if Γx p is defined (i.e., p Ux), and Γx p(t) C+ x ( xy) otherwise (i.e., p Ux). We first recall a property from page 146 of (O Neill, 1983): If M is a Lorentz manifold, if a piecewise smooth curve α is timelike, this means not only that α (t) is timelike, but at each break ti of α: α (t i ), α (t+ i ) < 0. (21) Here the first vector derives from α|[ti 1, ti] and the second from α|[ti, ti+1]. Thus α does not switch timecones at the break. Let us note the poles p = (0, . . . , 0, r) Sd 1(r), p = (0, . . . , 0, r) Sd 1(r) and t = (1, 0 . . . , 0) where t Tp Sd 1(r) and t T p Sd 1(r). We also note x := (x0, x1, . . . , xd) Sd 1(r). If xd ( r, r), we show below that equation 21 can be satisfied by defining α such that α(t0) = p, α(t1) = x, α(t2) = p, α (t0) = t, α (t 1 ) = Γx p(t), α (t+ 1 ) = Γx p(t), α (t2) = t. For completeness, we also have t [t0, t 1 ], α (t) = Γα(t) p (t) and t [t+ 1 , t2], α (t) = Γα(t) p (t). Let us arbitrarily consider that t Tp Sd 1(r) defines the future direction of the manifold. For all x Sd 1(r) that satisfies x, p 1 > r2 (i.e., p Ux), we note Γx p(t) the parallel translate of t Tp Sd 1(r) to Tx Sd 1(r). As explained in Appendix B.1.2 of (Law, 2021), it is formulated: Γx p(t) := t x, t 1 x, p 1 + r2 (p + x) = t + x0 rxd + r2 (p + x) (22) We know from page 66 of (O Neill, 1983) that parallel transport (also called parallel translation) is a linear isometry that satisfies: u Tx Sd 1(r), v Tx Sd 1(r), u, v = Γp x(u), Γp x(v) = Γp x(u), Γp x(v) 1 (23) and u = Γx p (Γp x(u)). This implies Γx p(t) C+ x (u) Γp x(u) C+ p (t) (24) Similarly, if x, p 1 > r2 (i.e., p Ux), we have Γx p(t) C+ x (u) Γ p x (u) C+ p(t). Let us assume that x satisfies both p Ux (i.e., x, p 1 > r2) and p Ux (i.e., x, p 1 < r2), this is equivalent to x satisfying xd ( r, r). If xd ( r, r), then Γx p(t) C+ x (u), Γx p(t) C+ x (u) Γx p(t), Γx p(t) 1 < 0. By definition, we have: Γx p(t) := t x, t 1 x, p 1 + r2 ( p + x) = t + x0 rxd + r2 ( p + x) (25) Γx p(t), Γx p(t) 1 = 1 x2 0 rxd + r2 x2 0 rxd + r2 = 1 2 x2 0 (xd + r)( xd + r) < 0 (26) We have shown that Γx p(t) and Γx p(t) have the same future direction by showing that Γx p(t), Γx p(t) 1 < 0. A (piecewise) smooth curve that preserves causality can then be found by using Γx p(t) or Γx p(t). C.1.3 ANTI-DE SITTER SPACE When M = Hd 1(r), xy is future-directed iff Γx p(t) C+ x ( xy) if Γx p is defined (i.e., p Ux), and Γx p( t) C+ x ( xy) otherwise (i.e., p Ux). The proof is similar to the one in Section C.1.2. Let us note the poles p = (r, 0, . . . , 0) Hd 1(r), p = ( r, 0, . . . , 0) Hd 1(r) and t = (0, 1, 0 . . . , 0) where t Tp Hd 1(r) and t T p Hd 1(r). Let x = (x 1, x0, . . . , xd 1) Hd 1(r). We recall that p Ux iff x, p 2 < r2 (i.e., x 1 > r) and p Ux iff x, p 2 < r2 (i.e., x 1 < r). We define x such that x 1 ( r, r) (i.e., p Ux and p Ux). Published as a conference paper at ICLR 2023 Let us note Γx p(t) the parallel translate of t Tp Hd 1(r) to Tx Hd 1(r). As explained in Appendix B.3 of (Law, 2021), it is formulated: Γx p(t) := t x, t 2 x, p 2 r2 (p + x) = t + x0 rx 1 r2 (p + x) (27) We show below that Γx p(t) and Γx p( t) have the same future direction by showing that Γx p(t), Γx p( t) 2 < 0. Γx p( t) := t x, t 2 x, p 2 r2 ( p + x) = t x0 rx 1 r2 ( p + x) (28) Γx p(t), Γx p( t) 2 = 1 + x2 0 rx 1 r2 + x2 0 rx 1 r2 = 1 2 x2 0 r2 x2 1 (29) = r2 x2 1 2x2 0 r2 x2 1 < 0. (30) equation 30 is negative because we have by definition of Hd 1(r): x2 1 x2 0+Pd 1 i=1 x2 i = r2, which implies x2 0 r2 x2 1 > 0, and the denominator is positive because we defined x 1 ( r, r). C.2 TIME SEPARATION FUNCTION We give the formulations we use for the time separation function τ(x, y) that is positive if y is in the chronological future of x. There is no unique way to define it. We assume here that ν = 1. Globally hyperbolic spacetimes. If the spacetime M is globally hyperbolic, there exists a Cauchy time function c : M R (i.e., for any t R, the set Σt = {y M : c(y) = t} is a Cauchy hypersurface) that can be used to define τ(x, y) := c(y) c(x) since it satisfies the reverse triangle inequality τ(x, z) τ(x, y) + τ(y, z) when x y z. With this formulation, it actually satisfies τ(x, z) = τ(x, y) + τ(y, z) when x y z. In theory, to have a Lorentzian pre-length space, one could define τ such that it is 0 if x y, but this is not easy to optimize in practice so we ignore this constraint for optimization purpose. C.2.1 MINKOWSKI SPACE Rd 1 If M = Rd 1 then we arbitrarily define: τ(x, y) := xy, t = xy, t 1 = y0 x0 (31) where t := (1, 0, . . . , 0) defines the future timecone C+ x (t). It is worth noting that the function c : Rd 1 R defined such that x = (x0, . . . , xd 1) Rd 1, c(x) := x0 is a Cauchy time function. C.2.2 DE SITTER SPACE Sd 1(r) Following our discussion in Section C.1.2 and using the formulation of the parallel transport in equation 22, we can formulate the time function as: τ(x, y) := Γp x( xy), t ν if x, p ν 0 Γ p x ( xy), t ν otherwise. (32) where ν = 1. One limitation of equation 32 is that it assumes that xy is defined, which might not be the case (if y / Ux). From the discussion in Chapter 5.2 and Figure 16 (i) of (Hawking & Ellis, 1973) that uses Cartesian coordinates to formulate space coordinates on the pseudo-sphere, the function c : Sd 1(r) R defined such that x = (x0, . . . , xd) , c(x) := x0 is a Cauchy time function. One can then also simply define: τ(x, y) := y0 x0. (33) In practice, we found that using equation 33 returns better performance than equation 32 in the experiments of Section D.1. We then also use it for the experiment of Section 5.1. See Section D.2 for details. Published as a conference paper at ICLR 2023 C.2.3 ANTI-DE SITTER SPACE Hd 1(r) When ν > 0, Hd ν(r) is non-chronological and satisfies x y = y x, which is convenient to represent graphs with directed cycles. There exists a future-directed timelike geodesic from x to y only if xy is timelike or y = x. Following our discussion in Section C.1.3 and using the formulation of the parallel transport in equation 27, we can formulate the time function as: τ(x, y) := Γp x( xy), t ν+1 if x, p ν+1 0 Γ p x ( xy), t ν+1 otherwise. (34) C.2.4 PROJECTIVE VERSION OF THE ANTI-DE SITTER SPACE Pd 1(r) We recall that each point of Pd 1(r) is an unordered pair { x, x} Pd 1(r) where x Hd 1(r). By using the formulation of the parallel transport in equation 27 and assuming ν = 1, we can then define: τ({ x, x}, { y, y}) := Γp x( xy), t ν+1 if x, p ν+1 0 and x, y ν+1 0 Γ p x ( xy), t ν+1 if x, p ν+1 > 0 and x, y ν+1 0 Γp x( x( y)), t ν+1 if x, p ν+1 0 and x, y ν+1 > 0 Γ p x ( x( y)), t ν+1 if x, p ν+1 > 0 and x, y ν+1 > 0 (35) Let us assume that p = (r, 0, . . . , 0) . We can choose x such that x 1 0 so that x, p 2 0 is satisfied, and we can also choose y so that x, y 2 0 is satisfied. We can define t := (0, 1, 0, . . . , 0) Tp Pd 1(r). If we define u = (u 1, u0, . . . , ud 1) = Γp x( xy), then equation 35 can be rewritten: τ({ x, x}, { y, y}) = u0 (36) Another possibility is to use the following time separation function: τ({ x, x}, { y, y}) = u0 i=1 u2 i (37) which is positive only if xy is future-directed timelike. C.2.5 CYLINDRICAL MINKOWSKI SPACE Ld 1(C) As we explain in the main paper, we propose to define the chronological future I+(x, Vx) of x Ld 1(C) such that if xy is timelike, we have y I+(x, Vx) if k Z, y0 +k C (x0, x0 +C/2). Similarly, the chronological past I (x, Vx) is defined such that if xy is timelike, we have y I (x, Vx) if k Z, y0 + k C (x0 C/2, x0). We define the time separation function for Ld 1(C) as follows: τ(x, y) := (y0 x0 + C 2 ) mod C C where we use the modulo operation for real values which can be written as follows: a mod b := a b a b , and is the floor function. C.3 SQUARED LORENTZIAN DISTANCE We give the formulation of the squared Lorentzian distance for the different spacetimes that we use in the main paper depending on the nature of M: If M = Rd 1, χ2 U(x, y) := xy, xy = xy, xy 1 := (y0 x0)2 j=1 (yj xj)2. (39) Published as a conference paper at ICLR 2023 If M = Sd 1(r), χ2 U(x, y) := xy, xy = r2 arccosh2 ( x,y 1 r2 ) if x,y 1 r2 1 2( x, y 1 r2) otherwise. (40) If M = Pd 1(r), χ2 U(x, y) := xy, xy = r2 arccos2 ( | x,y 2| r2 ) if x,y 2 r2 [ 1, 1] 2(| x, y 2| r2) otherwise. (41) If M = Ld 1(C), χ2 U(x, y) := (y0 x0 + C 2 ) mod C C j=1 (yj xj)2. (42) The above equations are equal to 0 when x is equivalent to y, which is the case if y = x in general, or if y = x when M = Pd 1(r), or if y x when M = Ld 1(C). Otherwise, the equations are positive iff there exists at least one timelike geodesic from x to y. If the timelike geodesic from x to y = x is uniquely defined in Ux, we call it xy, and these equations are positive whether xy is future-directed timelike or past-directed timelike. Squared (Lorentzian) distance in Sim et al. (2021). Sim et al. (2021) acknowledge that there exists no geodesic from x to y if M = Hd 1(r) and x,y 2 r2 > 1. Therefore, they consider that their squared Lorentzian distance is equal to π2 in this case to keep their function smooth. This promotes causality between x and y. However, constraining future-direction between x and y when x,y 2 r2 > 1 becomes problematic without the explicit formulation of a timelike curve. This is why they formulate their time function based on a different criterion than ours that uses parallel transport (see Section C.2). Moreover, the fact that their distance function becomes a constant in this case makes it difficult to optimize as the gradient is zero. Instead, we propose to use the manifold M = Pd 1(r) which can represent the same types of relationships between points as Hd 1(r) since Pd 1(r) contains elliptic and hyperbolic parts (Law, 2021), and any pair of points of Pd 1(r) can be joined by a geodesic. C.4 ABOUT THE CHOICE OF ε IN EQUATION 4 We explain here why our formulation of equation 4 is general and can be extended to other spacetimes. As explained in Theorem 2.7 of Minguzzi (2019), an open globally hyperbolic convex normal neighborhood Vx can be defined for any point x M where M is a spacetime. As we mention in Section 3, any open subset of a spacetime is a spacetime. Therefore, Vx is a spacetime that can be chosen to be globally hyperbolic. We recall that a globally hyperbolic spacetime is strongly causal, and that the chronological future I+(x, Vx) is the set of points y Vx that satisfy x y (i.e., there exists a timelike curve from x to y). Since we define Vx to be a convex normal neighborhood, we know from Proposition 4.5.3 of Hawking & Ellis (1973) that any pair of its points x, y that satisfy x y can be joined by a unique futuredirected timelike geodesic γx xy, which is also the unique longest curve joining x to y. Since γx xy is future-directed timelike, it satisfies xy, xy < 0 and xy, t < 0 (i.e., xy C+ x (t)) where the timelike tangent vector t Tx M defines the future direction of M. Moreover, the arc length of the geodesic γx xy from x to y is p xy, xy , and is called its Lorentzian distance. From Theorem 5.6 and Lemma 5.7 of Minguzzi (2019), there exists a strongly causal open set containing x such that its Lorentzian distance with any other point y in that set has its Lorentzian distance p xy, xy upper bounded by some constant ε > 0. We can then define that open set to be the convex normal neighborhood Vx and we can choose ε > 0 such that all the points y Vx that satisfy x y also satisfy ε2 < xy, xy < 0. Since Vx is a subset of the maximal normal neighborhood Ux, we then obtain exactly the definition of equation 4, which is: I+(x, Vx) = {y Ux : ε2 < xy, xy < 0, xy C+ x (t)}. In our experiments, we consider that Vx = Ux. When M is the Minkowski space Rd 1 or the de Sitter space Sd 1(r), we then have ε = + but we describe some case where ε might be finite in Section 3.2. Similarly, we have ε = rπ/2 when M = Pd 1(r), and ε = C/2 when M = Ld 1(C). Published as a conference paper at ICLR 2023 Table 3: Preservation of chronological order between pairs of articles with one citing the other. |V | R2 1 R5 1 R9 1 P2 1(r) P5 1(r) P9 1(r) τ equation 31 equation 31 equation 31 equation 35 equation 35 equation 35 200 93.7% 94.0% 94.2% 73.3% 73.7% 74.9% 1000 91.5% 91.8% 92.0% 72.2% 72.4% 72.3% |V | S3 1(r) S5 1(r) S9 1(r) S3 1(r) S5 1(r) S9 1(r) τ equation 32 equation 32 equation 32 equation 33 equation 33 equation 33 200 93.8% 93.8% 93.7% 94.4% 94.9% 94.9% 1000 89.1% 89.2% 89.5% 92.8% 93.4% 93.6% D ADDITIONAL EXPERIMENTS AND EXPERIMENTAL DETAILS We now report additional experiments and provide experimental details. Setup. We ran all our experiments on a single desktop with 64 GB of RAM, a 6-core Intel i7-7800X CPU and a NVIDIA Ge Force RTX 3090 GPU. D.1 CHRONOLOGICAL ORDER IN DIRECTED GRAPHS Our goal in this subsection is to represent a directed graph with spacetimes. As in Clough & Evans (2017), we select the 200 and 1000 most cited papers in the Arxiv High-energy physics theory (HEP-TH) citation network (Gehrke et al., 2003). HEP-TH is originally a dataset of 27, 770 papers (each represented by a node) with 352, 807 edges. The graph contains an arc from vi to vj if paper i cites paper j. When selecting the 200 or 1000 most cited papers, the graph is not a Directed Acyclic Graph (DAG) as there exist pairs of papers that cite each other. We ignore these pairs of arcs. To simplify the notation, we write vi vj either if there exists a path from vi to vj, or the exists an arc from vi to vj but not from vj to vi (i.e., there can exist a longer path from vj to vi). We also write va vb if there exists no path from va to vb or from vb to va (i.e., va vb vb va). We optimize the problem: min {xk M}n k=1 va vb σθ1 χ2 U(xa, xb) + λ X σθ1 χ2 U(xi, xj) + σθ2 ( τ (xi, xj)) (43) where σθ(x) := 1/(1 + e x/θ) is the sigmoid function, θ1, θ2 > 0 are temperature parameters, λ is a regularization parameter and τ(x, y) is a time separation function that is positive (resp. negative) if y is in the chronological future (resp. past) of x. For instance, if M = Rd 1 then we arbitrarily define τ(x, y) := xy, t = y0 x0, where t := (1, 0, . . . , 0) defines the future timecone C+ x (t). In this experiment, our temperature hyperparameter values are θ1 = θ2 = 1, and we fix the radius to r = 1. We run our experiments for 108 iterations with a step size of 10 6 by using the optimization tools of (Law, 2021; Law & Stam, 2020). The regularization parameter λ is set to λ = |Ec| |E| where |E| is the number of pairs that satisfy vi vj and |Ec| is the number of pairs that satisfy va vb. We report in Table 3 how well the learned representations manage to preserve chronological order when we select the |V | = 200 or 1000 most cited papers. For instance if M = Rd 1, we report the percentage of pairs of nodes vi vj represented by xi and xj that satisfy xixj, t < 0. The chronological manifolds Rd 1 and Sd 1(r) manage to predict chronological order better than the nonchronological manifold Pd 1(r). This suggests that chronological spacetimes are more appropriate to represent graphs that are almost DAGs. The time separation function in equation 33 returns better performance than equation 32, we then use it in the rest of our experiments. Figure 3 illustrates the embeddings learned when M = R2 1. Article representations tend to satisfy the chronological order along the time coordinate. Published as a conference paper at ICLR 2023 20 10 0 10 20 x1(space) 199309145 199409089 199602022 199602052 199701025 199703030 Figure 3: 2-dimensional representations of the Arxiv High-energy physics theory (HEP-TH) citation network (Gehrke et al., 2003) when |V | = 200 and M = R2 1. The first four characters of the labels correspond to the year of submission on Arxiv, and the following two characters are the month of publication. For instance, the label 199208027 corresponds to the article submitted in August 1992 available at: https://arxiv.org/abs/hep-th/9208027. Similarly, 199409089 is the article available at https://arxiv.org/abs/hep-th/9409089 and submitted in September 1994. The embeddings more or less follow the chronological order (along the time axis) in which the articles were submitted to Arxiv. Published as a conference paper at ICLR 2023 D.2 DIRECTED LINK PREDICTION In this section, we provide details about the experiments in Section 5.1. We report in Table 4 the link prediction scores on the Saccharomyces cerevisiae, in silico and Escherichia coli DREAM5 datasets (Marbach et al., 2012). Following the evaluation protocol of (Sim et al., 2021), we report the median + standard deviation across 20 random initialization. We use the same training and test splits as (Sim et al., 2021). We use the following hyperparameters on the DREAM5 datasets to define equation 6: if M = Sd 1(r), r = 1, θ1 = 0.15, exponent m = 1, θ2 = 0.03, learning rate = 10 5, number of epochs = 2000. if M = Ld 1(C), C = 8, θ1 = 0.15, θ2 = 0.03, exponent m = 1, learning rate = 10 3, number of epochs = 2000. if M = Rd 1, θ1 = 0.15, θ2 = 0.03, exponent m = 1, learning rate = 10 3, number of epochs = 2000. When d = 10, 50 or 100, Sd 1(r) is not time-oriented (see explanation in Section 3.2). We report the scores obtained in these dimensionalities to be fair with baselines. We also reran the experiments of Minkowski + TFD on the Escherichia coli DREAM 5 dataset. It is worth noting that the DREAM5 datasets contain a relatively small number of cycles: The Saccharomyces cerevisiae DREAM 5 dataset contains 9 nodes that are part of at least one directed cycle (0.5% of 1,994 nodes) and 19 edges that are part of at least one directed cycle (0.5% of 3,940). The Escherichia coli DREAM 5 dataset contains 18 nodes that are part of at least one directed cycle (1.7% of 1,081 nodes) and 23 edges that are part of at least one directed cycle (1.1% of 2,066). The in silico DREAM 5 dataset contains 28 nodes that are part of at least one directed cycle (1.8% of 1,565 nodes) and 39 edges that are part of at least one directed cycle (1.0% of 4,012). The choice of a specific manifold acts as some inductive bias. When the manifold is chosen to be chronological, the created graph does not necessarily contain directed cycles but allows their existence. On the other hand, chronological manifolds are more appropriate for Directed Acyclic Graphs as they ensure that the created graph does not contain directed cycles. From our results, it seems that the (nonchronological) Cylindrical Minkowski spacetime obtains much better performance in the low-dimensional case as its lack of causality allows some flexibility that is less beneficial in the high-dimensional case. Sim et al. (2021) also use the synthetic Duplication-Divergence (Dupdiv) dataset (Ispolatov et al., 2005), but their dataset contains only 100 edges and 1,026 edges. We generated a bigger version of Dupdiv that contains 1,000 edges and 26,649 edges (22,651 for training/validation and 3,998 for test) following the same setup. More precisely, 748 nodes (74.8%) are part of at least one directed cycle and 22,409 edges are part of at least one directed cycle (84.1% of 26,649). Sim et al. (2021) obtain their best performance on their Dupdiv dataset with the Cylindrical Minkowski + Triple Fermi-Dirac (TFD). We compare it on our larger dataset with Cylindrical Minkowski + equation 6. The results are reported in Table 4 and show a consistent gain of 2% mean Average Precision by using a proper time separation function. We use the following hyperparameters for Dupdiv: M = Rd 1, θ1 = 0.4, θ2 = 0.07, exponent m = 0.5, learning rate = 0.02, number of epochs = 500. Published as a conference paper at ICLR 2023 Table 4: Link prediction for directed graphs. Median average precision (AP) percentages across 20 random initializations on a held-out test set. Dataset DREAM 5: Saccharomyces cerevisiae Manifold dimensionality d 3 5 10 50 100 Euclidean + FD 33.0 2.7 34.2 2.8 40.2 3.3 44.5 3.5 49.0 2.0 Hyperboloid + FD 29.2 2.5 37.9 1.3 46.5 1.6 48.8 1.4 47.9 1.2 Minkowski + TFD 34.7 2.2 38.6 1.9 46.4 3.1 52.7 3.0 54.0 2.5 Anti de-Sitter + TFD 37.2 3.2 41.3 1.5 44.9 2.5 47.5 3.1 49.4 3.3 Cylindrical Minkowski + TFD 37.4 3.2 42.7 2.3 46.8 3.5 53.4 2.2 54.6 2.1 Minkowski + equation 6 47.6 1.1 51.3 1.5 54.4 1.1 54.7 2.0 54.8 1.3 Cylindrical Minkowski + equation 6 50.0 1.7 52.5 1.4 55.2 1.5 56.2 1.4 55.7 1.7 de Sitter + equation 6 44.8 2.1 51.6 1.6 55.6 1.3 55.3 1.4 55.4 1.4 Dataset DREAM5 : in silico Manifold dimensionality d 3 5 10 50 100 Euclidean + FD 29.4 2.1 32.9 2.5 39.7 1.8 39.8 1.6 34.8 1.1 Hyperboloid + FD 28.8 5.5 46.8 4.6 50.8 7.4 50.9 1.5 52.5 1.5 Minkowski + TFD 36.3 2.3 43.1 3.1 51.2 3.0 57.7 2.8 58.0 2.7 Anti de-Sitter + TFD 38.1 4.8 45.2 2.3 51.9 5.2 55.6 4.2 56.0 3.4 Cylindrical Minkowski + TFD 41.0 3.6 48.4 7.3 56.3 8.4 58.9 2.9 61.0 1.9 Minkowski + equation 6 48.4 1.2 49.4 1.1 51.6 1.2 58.1 2.1 58.8 1.1 Cylindrical Minkowski + equation 6 52.5 1.9 56.5 1.6 59.8 1.5 60.4 1.5 60.8 1.3 de Sitter + equation 6 48.5 1.9 57.4 1.5 62.0 1.4 60.6 1.6 61.1 1.4 Dataset DREAM5 : Escherichia coli Manifold dimensionality d 3 5 10 50 100 Euclidean + FD 33.0 3.9 34.2 3.4 40.2 4.3 44.5 2.6 49.0 3.2 Hyperboloid + FD 43.4 4.1 47.2 3.3 52.7 1.9 53.6 1.4 50.6 0.7 Minkowski + TFD 43.8 2.0 50.9 2.3 57.7 2.1 58.4 2.3 58.3 2.1 Anti de-Sitter + TFD 42.7 3.7 56.5 2.6 61.8 6.8 63.3 4.8 63.0 7.5 Cylindrical Minkowski + TFD 50.3 3.3 56.8 3.4 62.3 3.3 65.8 3.4 63.2 2.4 Minkowski + equation 6 55.9 2.1 57.2 1.8 58.1 1.9 58.8 1.1 59.1 1.2 Cylindrical Minkowski + equation 6 60.9 1.8 64.0 2.4 67.5 2.3 70.1 1.4 70.4 2.1 de Sitter + equation 6 58.1 2.8 62.4 2.3 62.7 1.5 63.4 1.3 62.1 1.6 Dataset Duplication-Divergence (1000 edges) Manifold dimensionality d 3 5 10 50 100 Cylindrical Minkowski + TFD 55.5 0.6 64.7 1.3 69.8 1.4 70.2 1.0 70.7 0.8 Cylindrical Minkowski + equation 6 58.7 1.3 66.9 1.1 72.2 1.1 72.4 1.2 72.1 1.0 D.3 HIERARCHY EXTRACTION For both equation 7 and equation 8, we set θ = 10 2, r = 1 and train the model for 105 iterations/epochs. In equation 8, the regularization parameter λ is set to λ = |Ec| |E| where |E| is the number of pairs that satisfy (vi, vj) E and |Ec| is the number of pairs that satisfy (va, vb) / E. We report scores for the NIPS dataset in Table 5. We report the Spearmans rank correlation coefficient ρ (Spearman, 1904) for all the authors (left), the authors with at least 10 coauthors (middle) and authors with at least 20 coauthors (right). Spacetimes return better performance for the subset of authors with at least 10 coauthors. Published as a conference paper at ICLR 2023 Table 5: Evaluation scores for the different learned representations (mean standard deviation). the lower the metric, the better. the larger the metric, the better. Manifold Problem d(x, y) Whole dataset ρ ( ) top si 10 ρ ( ) Top si 20 ρ ( ) R4 equation 7 dγ(x, y) 0.469 0.512 0.217 P4 0(r) equation 7 dγ(x, y) 0.460 0.490 0.292 P4 4(r) equation 7 dγ(x, y) 0.629 0.552 0.316 P4 1(r) equation 7 dγ(x, y) 0.667 0.493 0.307 P4 2(r) equation 7 dγ(x, y) 0.625 0.441 0.227 P4 3(r) equation 7 dγ(x, y) 0.437 0.493 0.387 S3 1(r) equation 8 χ2 U(x, y) 0.369 0.536 0.663 R4 1 equation 8 χ2 U(x, y) 0.524 0.668 0.484 P4 1(r) equation 8 χ2 U(x, y) 0.538 0.326 0.143 S5 1(r) equation 8 χ2 U(x, y) 0.373 0.498 0.618 R6 1 equation 8 χ2 U(x, y) 0.478 0.678 0.543 P6 1(r) equation 8 χ2 U(x, y) 0.576 0.455 0.219 Figure 4: Coordinates of embeddings { x, x} P2 1(1) R3 2 learned with equation 8. Published as a conference paper at ICLR 2023 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x1(space) Faction 1 Faction 34 Figure 5: (top) Coordinates of 2-dimensional embeddings x = (x0, x1) learned with equation 8 when M = R2 1. (bottom) Coordinates of the first three coordinates of embeddings x = (x0, x1, x2, x3) learned with equation 8 when M = S3 1(r). In Lorentz geometry, a timelike geodesic joining two points is the longest timelike curve in a given convex normal neighborhood. This translates in the high-level nodes v1 and v34 being the furthest from the rest of the nodes. The ground truth edges are plotted in yellow and the node color corresponds to the joined faction. A small number of spacelike edges are visible (those edges more than 45 degrees from vertical). Published as a conference paper at ICLR 2023 E EXTENDED RELATED WORK To help explain the contributions of our work relative to prior art, we use this section to provide a more detailed comparison of our contributions to that of Sim et al. (2021). Sim et al. (2021) extended Clough & Evans (2017) to the anti-de Sitter space and Lorentz cylinder. Although our motivation is similar, our contributions are methodological, rely on a simpler use of the intuitions of general relativity and Lorentzian pre-length spaces, and provide an easier interpretation of the learned representations as we explain below. First, Sim et al. (2021) do not address clearly the case when there is no geodesic between pairs of points, and their optimization framework leads to a distance loss term with a zero gradient in this case, which make it difficult to optimize. Moreover, in Sim et al. (2021), the prediction of an arc between a pair of nodes is determined via a Triple Fermi-Dirac (TFD) probability function that accounts for the squared (Lorentzian) distance between the nodes, the time coordinate difference t and its opposite value t. In other words, TFD accounts for both the chronological future and past (with different weights) of a given node. The major methodological difference with Sim et al. (2021) is that we restrict the representation of nodes connected by an arc to belong to I+(x, Vx) where Vx is a convex normal neighborhood. Although subtle, this difference makes the optimization and interpretation of results easier. In general, the Lorentzian distance function from x to y on M is defined to be infinite (i.e., χM(x, y) = + ) if M is non-chronological and there exists a closed timelike curve joining x and y. Moreover, the Hopf-Rinow theorem does not hold for spacetimes. We propose to work with convex normal neighborhoods, which allows us to restrict the existence of arcs between nodes to the existence of geodesics joining events. Our distance function χV is called a local distance function (see Definition 4.25 of Beem et al. (1996)) when its domain is restricted to a convex normal neighborhood. Moreover, χV is continuous and differentiable on Vx I+(x, Vx) (see Lemma 4.26 of Beem et al. (1996)). In some cases, it might be easier to optimize its squared function χ2 V which is of class C2 on Vx Vx (see Theorem 2.6 of Minguzzi (2019)). It is worth noting that Vx can be defined to be globally hyperbolic for any spacetime (see Theorem 2.7 of Minguzzi (2019)). This means that Vx admits a Cauchy time function that can be used to define a time separation function τ (see explanation in Section C.2) whose sign defines the direction of edges. Our framework shares similarities with Sim et al. (2021) when M = Rd 1 = Vx because Rd 1 is globally hyperbolic and any pair of points of Rd 1 can be joined by a geodesic. However, the way we define the time separation τ (instead of using the same t) is different when M = Ld 1(C) because we restrict it to be calculated on the maximal convex normal neighborhood. We construct τ so that it is positive if xj I+(x, Vx) and negative if xj I (x, Vx). We then enforce τ to be positive if we want an arc from vi to vj. The fact that we work only with an open convex set instead of the whole manifold M is crucial because xj can belong to both the chronological future I+(xi, M) := {y M : xi y} and past I (xi, M) := {y M : y xi} if the M is non-chronological. Using the entire manifold requires that the sign of t is not as informative, as in Sim et al. (2021). Our approach allows us to determine the direction of the arc joining vi and vj only via the sign of τ. We also define a general way of optimizing τ via the parallel transport (see Appendix C.2) instead of working only with Cartesian coordinates, which is not meaningful for some spacetimes such as Pd 1(r). Moreover, since we restrict our Lorentzian distances to be calculated in the convex normal neighborhood Vx, we also have the nice interpretation that the Lorentzian distance corresponds to the length of the longest causal curve joining points. One other contribution is the connection of our work with the theory of Lorentzian pre-length spaces (Kunzinger & Sämann, 2018) which does not require notions of differential geometry to be understood and can be applied to discrete topological spaces (see Example 2.16 of Kunzinger & Sämann (2018)). The framework proposed by Sim et al. (2021) is not a Lorentzian pre-length space due to their formulation of their time coordinate difference function that does not satisfy the properties of a time separation function (especially when M is non-chronological). Published as a conference paper at ICLR 2023 Algorithm 1 Pseudo-Riemannian optimization input: differentiable function f : M R to be minimized, some initial value of x M 1: while not converge do 2: Calculate f(x) i.e., f(x) is the Euclidean gradient of f at x in the Euclidean ambient space 3: χ Πx (GΠx(G f(x))) 4: x expx( ηχ) where η > 0 is a step size 5: end while F OPTIMIZATION The optimizers that we use in the main paper for the different spacetimes can all be optimized as described in Algorithm 1. The goal is to minimize some function f : M R by using differential geometry tools described in (Gao et al., 2018; Law, 2021; Law & Stam, 2020). f(x) is the Euclidean gradient of f at x, Πx(z) is the orthogonal projection of an arbitrary vector z onto Tx M, and G is an involutory matrix (i.e., G 1 = G). We consider in the following that the step size η > 0 is fixed and given. We give details for the different spacetimes that we consider in the main paper. F.1 MINKOWSKI SPACETIME Rd 1 This optimizer was explained in (Gao et al., 2018). When M = Rd 1, Πx is the identity function. G is the diagonal matrix with the first diagonal element equal to 1 and the remaining ones equal to 1. expx(y) := x + y. Algorithm 1 corresponds to the standard Euclidean gradient descent because χ = f(x) in this case. F.2 CYLINDRICAL MINKOWSKI SPACETIME Ld 1(C) We recall that Ld 1(C) = Rd 1/ , a quotient set defined such that x Rd 1 and y Rd 1 are equivalent (i.e., x y) iff i > 0, yi = xi and k Z, y0 = x0 + k C where C > 0 is a circumference hyperparameter. When M = Ld 1(C), we use the same optimizer as in Section F.1. Although it is optional, we also reproject the time coordinate of x = (x0, . . . , xd 1) at the end of each iteration as follows: x0 (x0 + C 2 ) mod C C 2 ) where we use the modulo operation for real values which can be written as follows: a mod b := a b a b , and is the floor function. If the initial value of x0 is not in [ C 2 ), this projects x to an equivalent point. F.3 DE SITTER SPACE Sd 1(r) This optimizer was introduced in (Law & Stam, 2020). We recall that Sd 1(r) := {x Rd+1 1 : x, x 1 = r2}. G is the diagonal matrix with the first diagonal element equal to 1 and the remaining ones equal to 1. We have: Πx(z) := z z,x 1 x,x 1 x = z z,x 1 r2 x. The exponential map is defined in equation 10. F.4 ANTI-DE SITTER SPACE Hd 1(r) We recall that Hd 1(r) := {x Rd+1 2 : x, x 2 = r2}. G is the diagonal matrix with the first two diagonal elements equal to 1 and the remaining ones equal to 1. We have: Πx(z) := z z,x 2 r2 x. The exponential map is defined in equation 15. F.5 PROJECTIVE VERSION OF THE ANTI-DE SITTER SPACE Pd 1(r) The neural network optimizer is given in (Law, 2021). Otherwise, the embedding optimizer is the same as in Section F.4. The main difference is how the points are compared to calculate the distance and time separation function. Published as a conference paper at ICLR 2023 G EXAMPLES OF GRAPHS Since our general framework is fairly abstract, we give some explicit examples of directed graphs with or without cycles that can be described by our framework. G.1 DIRECTED ACYCLIC GRAPHS (DAGS) Figure 6: DAG containing an undirected cycle. We first consider a directed acyclic graph that consists of an undirected cycle (Figure 6). The DAG G = (V, E) is defined as V = {vi}6 i=1 and the set of arcs is E = {(v1, v4), (v2, v4), (v2, v5), (v3, v5), (v1, v6), (v3, v6)}. One can see that G is not a directed tree since the undirected path v1, v4, v2, v5, v3, v6, v1 is cyclic. We recall that if M is globally hyperbolic, M is also chronological and can then only describe DAGs with our framework. Let us give an example where M = Rd+1 1 or M = Sd 1(r) Rd+1 1 with d = 3 and r > 0 (e.g., r = 1). Both Rd+1 1 and Sd 1(r) are globally hyperbolic. Let us consider any value ε > 0, a = r + ε and b = 2a2 r2. We take the embeddings: x1 = (0, r, 0, 0) , x2 = (0, 0, r, 0) , x3 = (0, 0, 0, r) , x4 = (b, a, a, 0) , x5 = (b, 0, a, a) , x6 = (b, a, 0, a) . When M = Sd 1(r), we know from Section B.2 that xixj is timelike (i.e., xixj, xixj < 0) iff xi, xj 1 > r2. By using the time separation function in equation 33 and assuming xixj is timelike, we only need to compare the first coordinate of xi and xj to determine the direction of the edge between vi and vj. The case M = Rd+1 1 is similar except that xixj := xj xi is timelike iff xixj, xixj 1 < 0. One can verify that i, j, (vi, vj) E xj Uxi and the tangent vector xixj is future-directed timelike. G.2 GRAPHS WITH DIRECTED CYCLES G.2.1 MINKOWSKI CYLINDER Ld 1(C) Figure 7: A simple directed cycle. We now illustrate a simple example of graph with directed cycle that can be drawn with Ld 1(C) (Figure 7). The graph G = (V, E) is defined as V = {vi}3 i=1 and E = {(v1, v2), (v2, v3), (v3, v1)}, which is a graph with directed cycle. Let us consider that C = 3 and d = 2. The maximal normal neighborhood of every point x Rd 1 is Ux = {y = (y0, . . . , yd 1) Rd 1 : y0 (x0 1.5, x0 + 1.5)}. We also consider that Vx = Ux. Since Ld 1(C) is a quotient set, its points are equivalence classes. We consider three equivalence classes [xi] := {(i + 3k, 0) : k Z} where i {1, 2, 3} and we define xi := (i, 0) Rd 1. In other words, the five points x0 = (0, 0) , x1 = (1, 0) , x2 = (2, 0) , x3 = (3, 0) , x4 = (4, 0) in Rd 1 actually correspond to three points in Ld 1(C) due to the equivalence relation (i.e., x3 x0 and x4 x1). We can then compare those three equivalence classes by comparing xi with xi 1 and xi+1 only. One can verify that i {1, 2, 3}, we have xi+1 Uxi, xi+1 I+(xi, Uxi), and xi 1 Uxi. However, we also have xi 1 / I+(xi, Uxi). We then obtain the graph illustrated in Figure 7. Published as a conference paper at ICLR 2023 G.2.2 PROJECTIVE VERSION OF THE ANTI-DE SITTER SPACE Pd 1(r) Figure 8: A graph with directed cycles. We now consider the graph G = (V, E) defined as V = {vi}4 i=1 and E = {(v1, v4), (v2, v1), (v3, v1), (v3, v2), (v4, v2), (v4, v3)}. This is a graph with directed cycles (e.g., v1 v4 v2 v1, see Figure 8). We recall that every point of Pd 1(r) can be written as the unordered pair { x, x} where x Hd 1(r) Rd+1 2 . Taking d = 2, the maximal normal neighborhood of every point { x, x} can be written U{ x,x} = {{ y, y} : y Hd 1(r), x, y 2 < 0}. Let us consider r = 1, ε1 = 0.1, ε2 = ε3 = 0.5, a = 1, b = p (r + ε1)2 r2 + a2, c = 2, e = p (r + ε2)2 r2 + c2, g = 1, h = p (r + ε3)2 r2 + g2. We construct the four following points: x1 = p = (r, 0, 0) , x2 = (r + ε1, a, b) , x3 = (r + ε2, c, e) , x4 = (r + ε3, g, h) . To define the future direction, we consider the timelike tangent vector t = (0, 1, 0) Tp Hd 1(r). In our framework, there exists an edge between xi and xj iff | xi, xj 2| (0, r2). The direction of the edge is determined by using Section C.2.4.