# representing_hyperbolic_space_accurately_using_multicomponent_floats__c5a90451.pdf

Representing Hyperbolic Space Accurately using Multi-Component Floats

Tao Yu Department of Computer Science Cornell University tyu@cs.cornell.edu

Christopher De Sa Department of Computer Science Cornell University cdesa@cs.cornell.edu

Hyperbolic space is particularly useful for embedding data with hierarchical structure; however, representing hyperbolic space with ordinary ﬂoating-point numbers greatly affects the performance due to its ineluctable numerical errors. Simply increasing the precision of ﬂoats fails to solve the problem and incurs a high computation cost for simulating greater-than-double-precision ﬂoats on hardware such as GPUs, which does not support them. In this paper, we propose a simple, feasibleon-GPUs, and easy-to-understand solution for numerically accurate learning on hyperbolic space. We do this with a new approach to represent hyperbolic space using multi-component ﬂoating-point (MCF) in the Poincaré upper-half space model. Theoretically and experimentally we show our model has small numerical error, and on embedding tasks across various datasets, models represented by multi-component ﬂoating-points gain more capacity and run signiﬁcantly faster on GPUs than prior work.

1 Introduction

Representation learning is of particular interest nowadays in machine learning. Real-world data is often discrete such as natural language sentences in the form of syntax trees and graphs [7, 22] which makes it critical to do a good job of representing and storing them as numbers which we can better understand and deal with. Perhaps the most common way of doing this representation is with embedding, in which discrete objects are assigned points in some continuous (typically high-dimensional Euclidean) space. Standard embeddings in NLP include Word2Vec [24] and Glo Ve [30]. However, many kinds of real-world data are hierarchical or graph-like [25], in which case embedding into non-Euclidean space can outperform the standard Euclidean embedding. Hyperbolic space is such an embedding target of particular interest, showing impressive performance on many tasks such as taxonomies, link prediction and inferring concept hierarchies [27, 28, 15].

Hyperbolic space [9] is helpful for many types of embeddings because of its constant negative curvature, as distinguished from Euclidean space with zero curvature. This negative-curvature geometry endows hyperbolic space with many useful properties not present in Euclidean space. For example, the volume of a ball in even two-dimensional hyperbolic space increases exponentially with respect to the radius, as shown in [9], as a contrast to the polynomial rate at which balls grow in Euclidean space. This makes it possible to embed more instances in the hyperbolic ball than Euclidean ball with the same radius, making it ideal for structures like trees where the number of objects at a distance d from any given object can grow exponentially with d. Reasoning from this intuition, the hyperbolic space looks like a ﬁnite-degree tree and can be regarded as a continuous version of trees [6]. The combination of low dimension (which facilitates learning) with high volume (which increases representational capacity) makes hyperbolic space inherently well-suited for embedding hierarchical structures data like trees and graphs.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

However, the advantages of hyperbolic space do not come for free. The large volumes that are so beneﬁcial from a embedding-capacity perspective create unavoidable numerical problems. Informally called the Na N problem" [37, 34], representing hyperbolic space with ordinary ﬂoating-point numbers can cause signiﬁcant errors and lead to severe numerical instability that compounds throughout training, causing Na Ns to pop out everywhere in the computations. Common models of hyperbolic space used to learn hyperbolic embeddings using ﬂoating-point arithmetic, such as the Poincaré ball model [27, 15] and the Lorentz hyperboloid model [28], all suffer from signiﬁcant numerical errors. These errors originate in the ordinary ﬂoating-point representation, and are further ampliﬁed in subsequent computations by the very ill-conditioned Riemannian metrics involved in their construction. Yu and De Sa [37] show that in the Lorentz hyperboloid model, the error of representing an exact point with ordinary ﬂoating-point numbers becomes unbounded as the point moves away from the origin, which is what will happen in most embedding tasks: points unavoidably move far away from the origin as more and more data instances crowd in.

Many efforts and compromises have been made so as to solve or alleviate the Na N problem. One commonly adopted technical solution ﬁrstly exploited in [35] is using an appropriate scaling factor, i.e., scale down all the distances by the scaling factor before embedding, then recover the original distances by dividing to the scaling factor. However, this scaling will increase the distortion of the embedding, and the distortion gets worse as the scale factor increases. Sala et al. [34] show that this sort of tradeoff is unavoidable when embedding in hyperbolic space: one can either choose to increase the number of bits used for the ﬂoating-point numbers or increase the dimension of the space, and this is independent of the underlying models. However, increasing the dimension is not a panacea, as the Na N problem still exists in high dimensional space (albeit to a lesser degree).

A straightforward solution to the Na N problem would be to compute with high precision ﬂoatingpoint numbers, such as Big Floats (specially designed with a large quantity of bits), as adopted in the combinatorial construction of hyperbolic embedings in [34], where ﬂoating-point numbers with thousands of bits are used with the support of Julia [4]. However, there are two evident problems with representing hyperbolic space using Big Floats. Firstly, Big Floats are not supported on ML accelerator hardware like GPUs: computations using Big Floats on CPUs will be slow and hence not realistic for many meaningful tasks. Secondly, Big Floats are currently not supported in most deep learning frameworks even on CPUs, so one would need to develop the algorithms from scratch so as to learn over hyperbolic space, which is tedious and fraught with the danger of numerical mistakes.

A hybrid solution to the Na N problem proposed by Yu and De Sa [37] is to avoid ﬂoating-point arithmetic as much as possible by using integer arithmetic instead as integer arithmetic can be done exactly. Their tiling-based model works by decomposing hyperbolic space into a tiling: the product of a compact fundamental region (which can be represented in ﬂoating-point with bounded error) and a discrete group (which supports integer computation). We can also interpret their approach as decomposing a large ﬂoating-point number into a multiply of a small ﬂoating-point number with a large integer number using some group actions [3, 33]. This allows for a guaranteed bound on the numerical error of learning in hyperbolic space. However, their approach is limited somewhat in that its guarantees apply to loss functions that depend only the hyperbolic distance between points, and their best-performing model (their L-tiling model) is limited to 2-dimensional space. A bigger issue with this type of approach is computational: the big-integer matrix multiplication involved in the model is not supported on GPUs and is not yet implemented in many deep learning frameworks such as Py Torch [29] and Tensor Flow [1]. This effectively prevents the method from being used in training of complex models that depend on GPU acceleration, such as hyperbolic neural networks [14, 18, 10, 23].

In this paper, we propose a more direct way to represent and learn in hyperbolic space without sacriﬁcing either accuracy guarantees or GPU acceleration. To achieve high precision computations and controllable numerical errors, we choose to represent hyperbolic space with multiple-component ﬂoating-point numbers (MCF) [31, 32, 19, 36], an alternative approach different from Big Floats where a number is expressed as an unevaluated sum of multiple ordinary ﬂoating-point numbers. We make the following contributions:

We propose representing hyperbolic space with the Poincaré upper-half space model for optimization in section 3, and show how to represent it with MCF in section 4 to eliminate the Na N problem.

We provide algorithms for computing with these MCF that are adapted to hyperbolic space in section 5 and section 6, where we prove numerical errors can be reduced to any degree by simply increasing the number of components of the MCF. We experimentally measure our model on embedding tasks across several datasets, and show that one can gain more learning capacity (compared to models that use ordinary ﬂoatingpoint numbers) with only a mild computational slowdown on GPUs in section 7, meanwhile signiﬁcantly faster than prior high precision hyperbolic models.

2 Related Work

The non-Euclidean geometry of hyperbolic space offers many distinctive properties useful in machine learning. It is well-suited to model hierarchical structures such as trees and more complex graphs: for instance, Sala et al. [34] show that a tree can be embedded into a 2-dimensional hyperbolic space with arbitrarily low distortion, and Gromov [16] shows that any ﬁnite tree can be embedded into a ﬁnite hyperbolic space such that distances are preserved approximately.

Nickel and Kiela [27] propose to embed symbolic data into the Poincaré ball model of hyperbolic space for learning the hierarchical representations, where the Poincaré ball embeddings outperform Euclidean embeddings signiﬁcantly on hierarchical data in terms of both representation capacity and generalization ability. Subsequently, Ganea et al. [15] further improve the empirical results of Poincaré ball embeddings on directed acyclic graphs by theoretically grounding entailment using the hierarchical relations. Later work uses other models of hyperbolic space [28, 17] besides the Poincaré ball model for various purposes. For example, the Lorentz hyperboloid model used in Nickel and Kiela [28] to learn continuous hierarchies from large-scale unstructured data, was observed to be substantially more efﬁcient and numerically stable. Along a different direction, hyperbolic embeddings can be used in many down-streaming tasks, by simply treating the hyperbolic embedding as a layer to form neural networks, such as the hyperbolic neural networks [14] for textual entailment and noisy-preﬁx recognition tasks, hyperbolic attention networks [18] for neural machine translation and visual question answering tasks, and hyperbolic graph networks [10, 23] for link prediction and node classiﬁcation tasks.

Although widely adopted in many applications, the Na N problem in hyperbolic representations can be observed easily in practice and generally remains unsolved. Sala et al. [34] suggests that hyperbolic embeddings can have high quality on hierarchies but require large dimensions or high precision for graphs with long chains: they suggest alleviating the Na N problem by using scaling factors together with Big Floats. Yu and De Sa [37] introduced tiling-based models, the ﬁrst theoretical solution to the Na N problem with bounded-everywhere numerical error. The tiling-based models enable deriving high-precision embeddings without using high precision ﬂoating-point numbers.

In the literature of arbitrary precision algorithms, there are two different ways to achieve high precision computations, and Big Float is not the only option. The ﬁrst one is straightforward: the multiple-digit format, which consists of a sequence of digits coupled with a single exponent. Examples in this category include the computation package Mathematica [20], the Big Floats format in Julia [4], and the Fortran multiple-precision arithmetic packages [8, 2]. The other option is the multiple-component format, in which an exact number is written as an unevaluated sum of ordinary ﬂoating-point numbers, each with its own signiﬁcand and exponent [11, 31, 32, 13].

3 The Na N Problem

Hyperbolic space can be modelled with several well-known isomorphic models. Previous works have proposed to optimize over the hyperbolic space with the Poincaré ball model [27] and Lorentz hyperboloid model [28]. In this paper, we propose to optimize within the Poincaré upper-half space model instead, because its properties allow for localization of potential sources of numerical error.

The Poincaré upper-half space model is the manifold (Un, gu), where Un = {x Rn : xn > 0} is the upper half space of n-dimensional Euclidean space. The metric and corresponding distance function are

x2n , du(x, y) = arcosh 1 + x y 2

50 100 150 200 distance-to-origin

Averaged Representation Error

3MCs-Halfspace 2MCs-Halfspace Halfspace

Figure 1: Averaged representation error d(x, ﬂ(x)) as x moves away from the origin in halfspace models.

0 20 40 60 80 100 120 distance-to-origin

102 Numerical Error of RSGD

2MCs-Halfspace Halfspace

Figure 2: Numerical error d(y, y ) of the RSGD algorithm as x moves away from the origin in halfspace models.

where ge is the Euclidean metric. Notably, the tangent space of the upper-half space model is the same as the ambient vector space, which makes it easier and numerically stable to derive Riemannian gradients without projection to the tangent space. In the two dimensional case, the Poincaré upper-half space model can be interpreted as a mapping on the complex plane (the half-plane model ).

Numerical error analysis. Yu and De Sa [37] shows that in the Lorentz hyperboloid model of the hyperbolic space, the error of representing an exact point x with ordinary ﬂoating-point ﬂ(x) is unbounded and proportional to the distance to the origin d(x, O). Here, as a baseline to compare our MCF methods to, we show that the same sort of representation error d(x, ﬂ(x)) happens when representing points in the Poincaré upper-half space model with ordinary ﬂoating-point numbers.

Theorem 1 The representation error of storing a particular point x Un using ﬂoating-points ﬂis δﬂ(x) = du(x, ﬂ(x)), and the worst case representation error deﬁned as a function of the distance-to-origin d in the Poincaré upper-half space model is

δd := max x Un,du(x,O) dδﬂ(x) = arcosh(1 + ϵ2 machine cosh2(d)).

where ϵmachine is the machine epsilon of the underlying ﬂoating-point arithmetic. This becomes δd = 2ϵmachine + o(ϵmachine) if d < log(1/ϵmachine) and δd = 2d + 2 log(ϵmachine) + o(ϵ 1 machine exp( 2d)) if d log(1/ϵmachine).

We plot the averaged representation error and the Na N problem practically in Figure 1, the distance error d(x, ﬂ(x)) when x is represented as ﬂ(x) using ﬂoating-point. The error explodes at at distance of around 55 from the origin, which will then potentially cause Na Ns in subsequent computations.

Identify potential numerical errors. Optimizing x in the Poincaré upper-half space model will typically push x away from the origin, with xi, i = n being large ﬂoating-point numbers and xn being ﬂoating-point numbers close to 0. Hence, the subtraction of two big ﬂoating-point numbers in x y 2 of the distance calculation leads to catastrophic cancellation and accounts for a large part of the numerical errors. We can localize this part of the numerical errors easily in simple subtractions in the Poincaré upper-half space model. (For comparison, in the Lorentz model [28], the multiplication in Lorentzian scalar product accounts for the major numerical errors in distance computations.)

RSGD in the Poincaré upper-half space model. We are interested in gradient-based optimizations within the Poincaré upper-half space model, hence we provide Riemannian stochastic gradient descent (RSGD) algorithm for different learning algorithms in the hyperbolic space. For each z Un in the Poincaré upper-half space model, let f : Un R be a smooth real-valued loss function over the manifold, denote zf to be the Euclidean gradient of f w.r.t. z. In order to do RSGD in the Poincare upper-half space model, we ﬁrstly derive the Riemannian gradient in the tangent space from the Euclidean gradient using the pull-back metric computed as:

gradz f = zn zf,

then we take the learning rate η into consideration, denote v = η gradzf. Next, we will need the exponential map Exp in the Poincaré upper-half space model, which takes a tangent vector v in the tangent space to z on the manifold Un via the following arithmetic:

z i =zi + zn s tanh s vn vi, i < n (1)

z n = zn cosh s sinh s

v T v. We also provide a more numerically stable form of this exponential map in the Appendix. As an example, assume an exact point x is given, we do RSGD following the above arithmetic and get the accurate y as:

y = Expx( η gradxf).

We can derive a y using ordinary ﬂoating-point arithmetic and compute the distance error between the exact y and y . In Figure 2, we show how this error du(y, y ) varies as the starting point x gradually moves away from the origin. At a distance of around 60, the halfspace RSGD algorithm will be numerically unstable with numerical errors large enough to cause Na Ns.

In the Poincaré upper-half space model, the major part of the numerical errors in the exponential map happens in Equation 1, when a big ﬂoating-point number is added to a small ﬂoating-point gradient number. This lets us localize the numerical errors within a simple addition operation in the Poincaré upper-half space model; in comparison, for the exponential map in the Lorentz model [28], multiple operations cause numerical errors. Hence, we propose to use multiple-component ﬂoating-point numbers in the Poincaré upper-half space model to avoid the numerical errors in addition and subtraction, rather than base our methodology in other models like the Lorentz model.

4 Multiple-Component Floating-point (MCF)

Currently, most high precision algorithms are developed with one of two approaches: (1) the standard multiple-digit Big Float approach, involving a sequence of digits coupled with a single exponent and (2) the multiple-component format (MCF), which uses unevaluated sums of multiple ordinary ﬂoating-point numbers. The multiple-digit approach can represent compactly a much larger range of numbers, whereas the multiple-component approach has an advantage in speed, as it better takes advantage of existing ﬂoating-point accelerators. Since speed and scalability is critical for ML applications, MCF is to be preferred here. In this section, we will detail how MCF works.

For a machine which supports p-bit ﬂoating-point number arithmetic, exact arithmetic often produces values that require more than p-bits to express and store. In the multi-component ﬂoating-point format, each arbitrary precision value x is expressed as an expansion [36]:

x = xn + + x2 + x1,

where each xi is called a component of x and is represented by a ﬂoating-point number with a p-bit signiﬁcand. As the structure requirements of this expansion deﬁnition, all components are ordered by magnitude (from largest xn to smallest x1). What s more, every two different components xi, xj are required to be non-overlapping: the least signiﬁcant nonzero bit of xi is more signiﬁcant than the most signiﬁcant nonzero bit of xj or vice versa.

Multiple-component ﬂoating-point format allows roundoff error to occur when performing exact arithmetic, then accounts for it afterwards by adding more components. To measure roundoff quickly and correctly, the algorithms presented herein rely on the assumption that ordinary ﬂoating-point ops are performed with standard to-nearest rounding. MCF allows the boundaries between components to wander freely, and boundaries are still enforced but can fall at any bit position. For example, the sum of 2300 + 2 300, will be represented by (2300, 2 300) in the multiple-component ﬂoating-point format. In this sense, the multiple-component approach may store numbers with fewer bits, stored as two words of memory, whereas the multiple-digit approach will use at least 601 bits to store the sum [36].

From Figure 1, we see that the Na N problem occurs at a distance around 50 to the origin, thus for the representation of hyperbolic space, the range of distances is not the main concern, rather, the issue is that when points are far from the origin, a small error in magnitude causes a high numerical

Algorithm 1: Two-Sum

Input: p-bit ﬂoats a, b, where p 3

x ﬂ(a + b) bvirtual ﬂ(x a) avirtual ﬂ(x bvirtual) broundoff ﬂ(b bvirtual) aroundoff ﬂ(a avirtual) y ﬂ(aroundoff + broundoff) Return: (x, y)

Algorithm 2: Grow-Expansion

Input: m p-bits expansion e, p-bit ﬂoat b initialize Q0 b for i = 1 to m do

(Qi, hi) Two-Sum(Qi 1, ei) end for hm+1 Qm Return: h

error in subsequent computations. Multiple-component ﬂoats, which can have very small numerical errors, will thus be better-suited to represent hyperbolic space compared to the multiple-digit format.

Multiple-component ﬂoating-point arithmetic can be faster than multiple-digit arithmetic because the latter requires expensive normalization of results to ﬁxed digit positions. MCF can also be implemented to run efﬁciently on GPUs because their unevaluated-sum structure is amenable to parallelism. This allows learning applications that use MCF to enjoy not only accurate results but also beneﬁt from the hardware acceleration of GPUs.

5 Computing with MCF

We provide some basic numerical operations for computing with multiple-component ﬂoating-point numbers. To start with, we consider the sum of two p-bit ﬂoating-point numbers to form a nonoverlapping expansion.

Theorem 2 [21, 36]. For ﬂoating point numbers a, b, Alg. 1 produces a non-overlapping expansion (x, y) such that a + b = x + y, where x is an approximation to a + b and y is the roundoff error. Particularly, when exact rounding is adopted, the roundoff error y is bounded as

2ulp(ﬂ(a + b)), |y| min(|a|, |b|).

where ulp(x) is the unit of the least precision of x.

This theoretical bound on the roundoff error y serves an important role in computations involving Alg. 1 (most MCF algorithms). Now we can derive an expansion by adding two p-bit ﬂoating-point values with this algorithm, next, we need to add a single ﬂoating-point number to an expansion. Alg. 2 shows how to add a single p-bit value to any arbitrary precision value expressed as an expansion.

Theorem 3 [36] Let e = Pm i=1 ei be a non-overlapping m p( 3)-bit components expansion, sorted in increasing order, except that any of the ei may be zero. Let b be a p( 3)-bit value, then Alg. 2 produces a non-overlapping m + 1 cmponents expansion h = Pm+1 i=1 hi = e + b, where the components are also in increasing order, except that any of them may be zero. Notably, Qi in Alg. 2 can serve as an approximate sum of b and the ﬁrst i components of e.

As a result of Alg. 2, addition of a value to an m-components expansion will grow the expansion and outputs an expansion with m + 1 components. However, if there are no constraints on the number of components, then different values will be highly irregular and vary a lot, which will slow down the computations and make the algorithms much more complicated. Hence, in our implementation, rather to grow the number of components without any constraints, we would like to ﬁx the number of components for all expansions we may use. With Alg. 3, we can renormalize an m + 1 expansion back to an m components expansion.

Priest [32] proves that if the input expansion of Alg. 3 does not overlap by more than 51 bits, then the algorithm works correctly. This condition holds when ordinary ﬂoating-point numbers are used. Table 1 summarizes MCF algorithms used in this paper, some of them detailed in the Appendix. Note that ﬁrstly these MCF algorithms can be easily vectorized to apply on high dimensional vectors, secondly, they are linear in the number of components (quadratic for Scale-Expansion) and dimension, hence enjoy a great speedup.

Table 1: MCF algorithms used in the paper Operation Function Example Usage

Two-Sum sum of two ﬂoats most MCF algorithms Grow-Expansion add a ﬂoat to an expansion RSGD, update parameters Add-Expansion sum of two expansions distance calculation Scale-Expansion multiply a ﬂoat to an expansion gradient calculation, RSGD Renormalize reduce # components of an expansion RSGD, reduce # components

6 Learning using MCF

Algorithm 3: Renormalize [32]

Input: (m + 1)-components expansion (a0, a1, , am) in decreasing order.

initialize s am, k 0 for i = m to 1 do

(s, ti) Two-Sum(ai 1, s) end for for i = 1 to m do

(s, e) Two-Sum(s, ti) if e = 0 then

bk s s e k k + 1 end if end for Return: (b0, b1, , bm 1)

We solve the Na N problem in the Poincaré upper-half space model using multiple-component ﬂoating-point arithmetic in the addition & subtraction computations. As analyzed in Sec. 3, we only need to use m-multiplecomponent ﬂoating-point numbers for the ﬁrst n 1 coordinates of the model, denoted as m-x MC-Halfspace. We will mostly compute with MCF arithmetic (in Table 1) for the addition & subtraction involved in the computations, and leave the rest of computations to ordinary ﬂoating-point arithmetic.

For example, we compute the distance between x, y Un as:

du(x, y) = arcosh 1 + Add-Expansion(x, y)2 1 2xnyn

where we add two expansions x, y to get the summed expansion and then take the ﬁrst, largest component as an approximation of the scalar x y . Further more, in Appendix we show how to do RSGD using the multi-component ﬂoating-point numbers in the mx MC-Halfspace model.

When we compute with the introduced multi-components ﬂoating-point numbers in the model, we show theoretically the worse case numerical representation error in Theorem 4.

Theorem 4 The worst case representation error of storing an exact point x Un with m-multicomponent ﬂoating-point expansion (x(m), , x(2), x(1)) deﬁned as a function of the distance-toorigin d is

δd = arcosh 1 + ϵ2 machine + ϵ2m machine(1 + ϵmachine)

22m cosh2(d) ,

where ϵmachine is the machine epsilon. This becomes δd = 2ϵmachine + o(ϵmachine) if d < m log(1/ϵmachine) and δd = 2d + 2m log(ϵmachine/2) + o(ϵ 2m machine exp( 2d)) if d m log(1/ϵmachine).

Table 2: Datasets Information

Datasets Nodes Edges

Word Net[12] 74374 75834

Nouns 82115 769130

Verbs 13542 35079

Mammals 1181 6541 Gr-QC[22] 4158 13422

Table 3: Embedding performances on Mammals

Model MAP MR

Halfspace 92.07% 1.01% 1.600 0.79 2-x MC-Halfspace 93.85 93.85 93.85% 0.57% 1.430 1.430 1.430 0.16

Lorentz 85.75% 2.30% 1.857 4.78 L-Tiling 91.49% 1.42% 1.645 1.28 H-Tiling 91.60% 1.60% 1.559 0.44

DIMENSION MODELS WORDNET NOUNS WORDNET VERBS GR-QC

MAP MR MAP MR MAP MR

LORENTZ 15.61 0.37% 55.59 0.53 56.76 0.52% 4.18 0.07 55.93 0.32% 71.81 1.24 L-TILING 23.39 0.31% 34.22 1.03 63.08 0.31% 3.28 0.03 56.07 0.21% 72.53 2.08 H-TILING 23.34 0.19% 34.30 0.49 64.26 0.40% 3.23 0.05 55.95 0.30% 71.78 71.78 71.78 1.02 HALFSPACE 23.31 0.17% 34.81 0.26 63.65 0.32% 3.36 0.07 55.60 0.35% 74.29 1.32 2-XMC-HALFSPACE 24.42 24.42 24.42 0.11% 34.01 34.01 34.01 0.17 64.88 64.88 64.88 0.25% 3.18 3.18 3.18 0.05 56.35 56.35 56.35 0.38% 71.81 1.12

LORENTZ 74.82 0.38% 7.528 0.23 89.72 0.04% 1.52 0.00 71.16 0.33% 33.66 1.23 H-TILING 74.90 0.08% 7.529 0.12 90.13 0.08% 1.46 0.00 70.89 0.15% 32.99 0.52 HALFSPACE 74.89 0.14% 7.388 0.08 89.99 0.21% 1.51 0.01 71.10 0.19% 32.29 0.75 2-XMC-HALFSPACE 75.05 75.05 75.05 0.12% 7.377 7.377 7.377 0.12 90.25 90.25 90.25 0.09% 1.41 1.41 1.41 0.01 71.21 71.21 71.21 0.24% 31.89 31.89 31.89 0.71

LORENTZ 78.43 0.42% 6.148 0.15 90.71 90.71 90.71 0.13% 1.41 0.00 72.84 72.84 72.84 0.13% 28.53 0.27 H-TILING 78.40 0.22% 6.229 0.13 90.62 0.02% 1.40 0.00 72.57 0.09% 28.50 28.50 28.50 0.35 HALFSPACE 78.39 0.12% 6.227 0.21 90.71 90.71 90.71 0.11% 1.39 1.39 1.39 0.02 72.66 0.11% 28.56 0.24 2-XMC-HALFSPACE 78.46 78.46 78.46 0.08% 6.136 6.136 6.136 0.14 90.71 90.71 90.71 0.04% 1.39 1.39 1.39 0.00 72.69 0.09% 28.57 0.32

Table 4: Learning experiments on different datasets. Results are averaged over 5 runs and reported in mean+std style. Bold numbers show the best performance among all models.

Hence, there s a region around the origin whose radius is proportional to the number of components m, where we can be conﬁdent there are no numerical issues. This is the scaling we d expect, and is in line with the results from [34]. In Figure 1, we plot the averaged representation error d(x, MCF(x)) when x is represented using 2-component ﬂoating-points. The error will not explode until a distance of 125 from the origin simply by using 2 components, compared to the distance of 55 for vanilla halfspace.

Particularly, we also show the numerical errors of the adapted RSGD algorithm in the 2-x MCHalfspace model in Figure 2, with way smaller error compared to the increasing numerical errors in Poincaré upper-half space model using ordinary ﬂoating numbers. Practitioners can choose the number of components according to the desired precision requirements, yet many applications would get full beneﬁts from using merely a small multiple of (such as two or three) components.

As a matter of fact, MCF can be more generally adopted in many operations, in all hyperbolic models and different generic loss functions for better performance. Hence we also implement the model with all coordinates represented with MCF, denoted as m-MC-Halfspace. We defer the discussion of this model to Appendix since it is more sophisticated by making use of the Scale-Expansion algorithm.

7 Experiments

Firstly proposed in Nickel and Kiela [27], hyperbolic embedding was used to embed data with hierarchical structure such as taxonomies, producing results that outperform Euclidean embeddings. Motivated by this, we conduct embedding experiments on datasets including the Gr-Qc and commonly used Word Net Nouns, Verbs and Mammals dataset whose statistics are shown in Table 2. We are interested in two aspects of the proposed MCF-based models in practice: ﬁrstly the capacity of the models, and secondly the running speed of model training and optimization.

Reconstruction. We focus on the tasks which can show the capacity of the models. Given an observed transitive closure D = {(u, v)} of all objects, we embed all objects into hyperbolic space and then reconstruct it from the embedding. Hence, the error from the reconstructed embedding to the ground truth will be a reasonable measure for the capacity of the model and reﬂect how the underlying ﬂoating-point arithmetic may affect the performance. Speciﬁcally, we learn embeddings of the transitive closure D such that related objects are close in the embedding space. To do this, we aim to minimize the loss introduced in [27]:

(u,v) D log e d(u,v) P

v N(u) e d(u,v ) ,

where N(u) = {v | (u, v) D} {u} is the set of negative examples for u (including u). We do negative sampling randomly per positive example during training for different datasets.

We implemented the upper-half space model and x MC-Halfspace models trained with RSGD in our experiments, as for baselines, we consider the current state-of-the-art L-tiling models, H-tiling models [37] trained with RSGD and the Lorentz model [28]. Models were trained using ordinary

DATASET MODELS BATCHSIZE=32 BATCHSIZE=64 BATCHSIZE=128

2D 5D 10D 2D 5D 10D 2D 5D 10D

WORDNET NOUNS

LORENTZ 57.37 60.00 53.42 27.38 27.28 26.74 9.37 9.08 9.14 HALFSPACE 118.7 119.9 119.7 40.2 40.3 39.3 17.4 17.4 20.4

L-TILING (CPU) 205 203 203 203 203 185 185 182 161 158 158 H-TILING (CPU) 146 146 146 188 188 188 238 107 156 212 93 129 177 2-XMC-HALFSPACE 272 284 296 118 136 121 121 121 54 54 54 58 58 58 68 68 68

WORDNET VERBS

LORENTZ 3.21 3.19 3.25 0.93 0.93 0.96 0.42 0.41 0.41 HALFSPACE 5.65 5.68 5.71 1.69 1.66 1.66 0.78 0.75 0.77

H-TILING (CPU) 7.46 7.46 7.46 8.87 8.87 8.87 10.76 10.76 10.76 5.56 5.56 5.56 7.85 8.54 5.06 6.63 8.59 2-XMC-HALFSPACE 12.38 12.14 12.83 6.19 5.86 5.86 5.86 6.42 1.80 1.84 1.84 1.84 2.09

HALFSPACE 4.29 4.36 4.31 2.13 1.91 1.88 0.59 0.63 0.67 LORENTZ 2.45 2.61 2.44 0.67 0.76 0.67 0.36 0.36 0.36

H-TILING (CPU) 5.99 5.99 5.99 7.22 7.22 7.22 10.12 10.12 10.12 4.39 4.39 4.39 6.45 8.12 4.16 5.41 7.32 2-XMC-HALFSPACE 9.52 9.96 10.41 5.42 5.57 5.94 1.46 1.46 1.46 1.63 1.79 1.79 1.79

Table 5: Training time (seconds per epoch) for different models on different datasets. Results are averaged over 10 epochs. Performances of low precision models (Lorentz&Halfspace) are separated from that of high precision models in the table.

ﬂoat64 with the same hyper-parameters starting from same initializations of parameters, see Appendix for more implementation details. To evaluate the quality of the embeddings, we make use of the standard graph embedding metrics in [5, 26]. For an observed relationship (u, v), we rank the distance d(u, v) among the set {d(u, v )|(u, v ) D)}, then we evaluate the ranking on all objects in the dataset and record the mean rank (MR, smaller better) as well as the mean average precision (MAP, larger better) of the ranking.

We ﬁrstly evaluate different 2-dimensional embeddings on the light-scale Mammals dataset. As shown in Table 3, we choose 2 components for the MCF-based models. The proposed upper-half space model already outperforms all previously proposed baseline models. In particular, the x MC-Halfspace model further improves the embedding performance and achieves a 2.25% MAP improvement on Mammals compared to the best baseline models.

We present the large scale experiments on other three datasets in Table 4. These results show that MCF-based models generally perform better than baseline models in various dimensions with respect to MAP and MR, particularly in 2 dimensional hyperbolic space, since higher dimension can alleviate the Na N problem to some degree. On 2 dimensional embeddings of the largest Word Net Nouns dataset, MCF-based models gain as large as a 1.03% MAP improvement compared to the L-tiling model and a 1.11% MAP improvement compared to the upper-half space model.

The limitation of the work is that in some cases of embedding problems that stay close to the origin, MCF will give no improvement over plain half space because it was already numerically accurate enough, also our method will eventually break down at very large distances due to overﬂow or underﬂow of the ﬂoating point exponent.

Training time cost. Running the model on GPUs confers important training advantage for modern machine learning tasks. In order to provide a comprehensive comparison result, we train most models in our experiments on GPUs including MCF-based half space models, upper-half space model and the Lorentz model. As for the L-tiling and H-tiling model, since big-integer matrix multiplications are not supported in GPUs, we train both models with CPUs. For all timing results within this section, we use GPU: Ge Force RTX 2080 Ti and CPU: Intel(R) Xeon(R) Gold 6240 @2.60GHZ.

Note that negative sampling size and batch size can make a great difference to the timing performance over CPUs and GPUs. Herein, we ﬁx the negative sampling size to be 50 and only vary the batchsize in 32, 64, 128. Readers may question the performances of the model as the batchsize changes, we note that with a simple scale of the learning rate, one can achieve comparable results for different batchsize, detailed in the Appendix. We report the training time (seconds per epoch) for different models on three different datasets in Table 5. We can conclude that:

On large-scale datasets or large batchsize (e.g. 128) training, MCF-based models (high precision) can be trained efﬁciently, with timing close to that of models using ordinary ﬂoating-point.

MCF-based models (high precision) can run on GPUs, enjoying a great training speedup compared to other high precision models such as the L-tiling and H-tiling models which can only run over CPUs. As a matter of fact, on Word Net Nouns, relatively it takes MCF-based models around 2/3 less time and saves 30.27 hours to ﬁnish 1000 epochs training (batchsize = 128), compared to the H-Tiling model with best MAP performance.

More experiment details are in the Appendix. We will release our code in Py Torch publicly for reproducibility, and in hopes that practitioners can use MCF to reliably compute in hyperbolic space.

8 Conclusion

In this work, we proposed upper-half space model of hyperbolic space and presented new multicomponent ﬂoating-point representations. Our approach avoids the numerical errors inherent to popular models of hyperbolic space, such as the Lorentz model, while supporting efﬁcient computations on GPUs. This allows our MCF approach to effectively trade off computational cost and numerical accuracy for learning over hyperbolic space.

Acknowledgments and Disclosure of Funding

This work is supported by NSF IIS-2008102. This work was supported in part by a gift from Samba Nova Systems, Inc. The authors would like to thank Aaron Lou and anonymous reviewers from Neur IPS 2021 for providing valuable feedbacks on earlier versions of this paper.

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow. org/. Software available from tensorﬂow.org.

[2] David H Bailey. A fortran 90-based multiprecision system. ACM Transactions on Mathematical Software (TOMS), 21(4):379 387, 1995.

[3] Alan F Beardon. The geometry of discrete groups, volume 91. Springer Science & Business Media, 2012.

[4] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to numerical computing. SIAM review, 59(1):65 98, 2017. URL https://doi.org/10.1137/ 141000671.

[5] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787 2795, 2013.

[6] Brian H Bowditch. A course on geometric group theory. Mathematical Society of Japan, 16 of MSJ Memoirs, 2006.

[7] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015.

[8] Richard P Brent. A fortran multiple-precision arithmetic package. ACM Transactions on Mathematical Software (TOMS), 4(1):57 70, 1978.

[9] James W. Cannon, William J. Floyd, Richard Kenyon, Walter, and R. Parry. Hyperbolic geometry. In In Flavors of geometry, pages 59 115. University Press, 1997.

[10] Ines Chami, Rex Ying, Christopher Ré, and Jure Leskovec. Hyperbolic graph convolutional neural networks. Advances in neural information processing systems, 32:4869, 2019.

[11] Theodorus Jozef Dekker. A ﬂoating-point technique for extending the available precision. Numerische Mathematik, 18(3):224 242, 1971.

[12] Christiane Fellbaum. Word Net: An Electronic Lexical Database. Bradford Books, 1998.

[13] Steven Fortune. Progress in computational geometry. Directions in Geometric Computing, 3: 81 128, 1993.

[14] Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks. In Advances in neural information processing systems, pages 5345 5355, 2018.

[15] Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. In Proceedings of the 35th International Conference on Machine Learning, 2018.

[16] Mikhael Gromov. Hyperbolic groups. Essays in group theory, page 75 263, 1987.

[17] Albert Gu, Frederic Sala, Beliz Gunel, and Christopher Ré. Learning mixed-curvature representations in product spaces. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJxe Wn Cc F7.

[18] Caglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali Razavi, Razvan Pascanu, Karl Moritz Hermann, Peter Battaglia, Victor Bapst, David Raposo, Adam Santoro, et al. Hyperbolic attention networks. International Conference on Learning Representations, 2018.

[19] Yozo Hida, Xiaoye S Li, and David H Bailey. Library for double-double and quad-double arithmetic. 2007.

[20] Wolfram Research, Inc. Mathematica, Version 12.2. URL https://www.wolfram.com/ mathematica. Champaign, IL, 2020.

[21] Donald Ervin Knuth. The art of computer programming. 2. Seminumerical algorithms, volume 2. Addison-Wesley, 1971.

[22] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densiﬁcation and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):2, 2007.

[23] Qi Liu, Maximilian Nickel, and Douwe Kiela. Hyperbolic graph neural networks. ar Xiv preprint ar Xiv:1910.12892, 2019.

[24] Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 746 751, 2013.

[25] Shigeru Miyagawa, Robert C Berwick, and Kazuo Okanoya. The emergence of hierarchical structure in human language. Frontiers in psychology, 4:71, 2013.

[26] Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. Holographic embeddings of knowledge graphs. In Thirtieth Aaai conference on artiﬁcial intelligence, 2016.

[27] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6338 6347. Curran Associates, Inc., 2017.

[28] Maximillian Nickel and Douwe Kiela. Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 3779 3788. PMLR, 2018.

[29] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019.

[30] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532 1543, 2014.

[31] Douglas M Priest. Algorithms for arbitrary precision ﬂoating point arithmetic. University of California, Berkeley, 1991.

[32] Douglas M Priest. On properties of ﬂoating point arithmetics: numerical stability and the cost of accurate computations. Ph D thesis, Citeseer, 1992.

[33] Godement R. Fuchsian groups. Analysis IV. Universitext., 2015.

[34] Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re. Representation tradeoffs for hyperbolic embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 4460 4469, Stockholmsmässan, Stockholm Sweden, 2018. PMLR.

[35] Rik Sarkar. Low distortion delaunay embedding of trees in hyperbolic plane. In International Symposium on Graph Drawing, pages 355 366. Springer, 2011.

[36] Jonathan Richard Shewchuk. Adaptive precision ﬂoating-point arithmetic and fast robust geometric predicates. Discrete & Computational Geometry, 18(3):305 363, 1997.

[37] Tao Yu and Christopher M De Sa. Numerically accurate hyperbolic embeddings using tilingbased models. In Advances in Neural Information Processing Systems, volume 32, pages 2023 2033. Curran Associates, Inc., 2019.