# hyperbolic_optimizer_as_a_dynamical_system__9ee85956.pdf

Hyperbolic Optimizer as a Dynamical System

Nico Alvarado 1 2 3 Hans Lobel 1 2 3 4

During the last few years, the field of dynamical systems has been developing innovative tools to study the asymptotic behavior of different optimizers in the context of neural networks. In this work, we redefine an extensively studied optimizer, employing classical techniques from hyperbolic geometry. This new definition is linked to a non-linear differential equation as a continuous limit. Additionally, by utilizing Lyapunov stability concepts, we analyze the asymptotic behavior of its critical points.

1. Introduction

Classical optimization algorithms like gradient descent are commonly used, and their connection to dynamical systems is evident when viewing the weight updates as the evolution of a system state over iterations (Narendra & Parthasarathy, 1991). If θ represents the parameters, and L(θ) the loss function, the weight updates in each iteration, θt+1 = θt η L(θt), resemble the dynamics of a discrete-time dynamical system, where η is the learning rate and L(θt) is the gradient of the loss function. The optimization process aims to locate minima within the loss landscape, analogous to identifying equilibrium points in the energy landscape of a dynamical system. This can be expressed as θ = arg minθ L(θ), where θ signifies the optimal parameters.

Recall that hyperbolic spaces are homogeneous spaces of constant curvature equal to 1. The more relevant and crucial theoretical property of hyperbolic spaces and of spaces of negative curvature (Bridson & Haefliger, 2013) in general is that they can embed graphs such as trees with arbitrarily low distortion of the natural metrics. Gromov has first

1Department of Computer Science, Pontificia Universidad Cat olica de Chile 2National Center of Artificial Intelligence, Chile 3Millenium Institute Foundational Research on Data, Chile 4Department of Transport and Logistics Engineering, Pontificia Universidad Cat olica de Chile. Correspondence to: Nico Alvarado <nfalvarado.mat@uc.cl>, Hans Lobel <halobel@uc.cl>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

observed this (Gromov, 1987a), who introduced a much larger class of spaces, called δ-hyperbolic spaces, which are shown to be almost isometric to trees (Gromov, 1987b), including cases of graphs with control on the diameter of cycles (Sarkar, 2012). In contrast, euclidean and positively curved spaces do not allow to embed trees with bounded distortion of the metric (Bourgain, 1985; Indyk et al., 2017; Chami, 2021; Aggarwal et al., 2001; Rodr ıguez-Flores & Papadopoulos, 2020; Borassi et al., 2015). Hyperbolic embeddings have also shown promise for routing (Cvetkovski & Crovella, 2009), clustering (Chami et al., 2020; Lamping et al., 1995), biological networks (Albert et al., 2014), phylogenetic trees (Billera et al., 2001; Matsumoto et al., 2021), neuroscience (Allard & Serrano, 2020), text embedding (Dhingra et al., 2018; Balazevic et al., 2019), knowledge graphs (Sun et al., 2020).

Hyperbolic Neural Networks (HNN) leverage hyperbolic geometry for representing data in a more natural way, especially for capturing hierarchical relationships (Chami et al., 2019; Ganea et al., 2018; Yang et al., 2022). However, the non-Euclidean nature of hyperbolic spaces poses challenges for classical optimizers. In the hyperbolic setting, if g(θt) represent the Riemannian gradient, accounting for the curvature of the hyperbolic space, then the update rule becomes θt+1 = Expθt( ηg(θt)), where Expθt is the exponential map in the hyperbolic space.

Optimizers tailored for hyperbolic geometry, such as Riemannian optimization methods, play a crucial role. They efficiently navigate the unique curvature of the hyperbolic space, ensuring stable convergence. Proper optimization allows hyperbolic neural networks to exploit their intrinsic geometry fully, leading to enhanced performance in capturing hierarchical relationships and complex data structures.

In this work, we present an optimizer based on ADMM (Boyd et al., 2011), but tailored to work in hyperbolic geometry, particularly within the Poincar e ball model. Establishing a connection between this optimizer and a non-linear ordinary differential equation (ODE) enriches our comprehension of the dynamics. The novel contribution lies in delving into stability through ODE linearization, offering valuable insights for practically implementing the hyperbolic optimizer in real-world applications.

Hyperbolic Optimizer as a Dynamical System

1.1. Related work

The Alternating Direction Method of Multipliers (ADMM) can be seen as a dynamic process, whether we consider it as a continuous-time or discrete-time evolution. In the continuous case, ADMM updates are likened to the gradual transformation of states in a dynamical system ci. The algorithm employs an augmented Lagrangian function, and the iterative updates of Lagrange multipliers resemble the dynamics of continuous-time systems. Analyzing the continuous limit involves studying the algorithm s behavior through differential equations, shedding light on stability and convergence properties.

In practice, ADMM is often implemented as a discretetime algorithm. Each iteration corresponds to a step in the evolution of a discrete-time dynamical system. This discrete nature allows us to understand ADMM through the framework of difference equations, capturing the recursive relationship between consecutive updates. The convergence of ADMM, akin to stability in dynamical systems, is often analyzed to provide assurances about the algorithm s reliability (Boyd et al., 2011). The fixed points or equilibria of ADMM correspond to solutions of the optimization problem, and understanding these points is analogous to analyzing stable states in a dynamical system.

The method often converges well for a wide range of convex optimization problems (Franc a et al., 2018).

The scaled form of ADMM is given by (Boyd et al., 2011)

xk+1 = arg min x Rn f(x) + ρ

2 Ax zk + uk 2

zk+1 = arg min z Rm g(z) + ρ

2 Axk+1 z + uk 2

uk+1 = uk + Axk+1 zk+1,

where ρ > 0 is a penalty parameter and uk Rm is the kth Lagrange multiplier estimate for the constrain z = Ax.

Let f : Rn R and g: Rm R be a continuously differentiable convex functions and A Rm n an invertible matrix.

Theorem 1.1. (Franc a et al., 2018) Consider the optimization problem given by

min x,z {V (x, z) = f(x) + g(z) subject to z = Ax}

and the associated function V (x) = f(x) + g(Ax). Then, the continuous limit associated with the ADMM updates, with time scale t = k/ρ, corresponds to the initial value problem X + (AT A) 1 V (X) = 0

with X(0) = x0.

1.2. Paper contributions

Empirical evidence widely supports the effectiveness of low-dimensional hyperbolic spaces in learning hierarchical datasets. Despite a longstanding historical connection to embedding theory, as far as our knowledge extends, no theoretical investigations have been conducted on dynamical systems and hyperbolic optimizers. This article addresses and fills this gap in the literature, presenting the following key contributions:

We proved the existence of a non-linear differential equation linked to the continuous limit of the Hyperbolic ADMM flow. This enables us to explore the asymptotic behavior of critical points and provides insights for conducting numerical analyses in future studies.

We also proved that if a specific critical point X remains at a low value under small perturbations, it signifies stability over time. This is advantageous as it indicates the optimization process is effective, steadily converging toward the best solution. The result offers a form of assurance that our optimization will ultimately settle at this optimal point and not deviate elsewhere.

2. Preliminaries

Riemannian manifolds basics (see also (Petersen, 2016)). We recall that a d-dimensional manifold X is roughly a topological space that is locally parameterized by open sets of Rd. A differentiable manifold has parametrizations such that the change of parametrization is a differentiable map. This allows us to define infinitesimal directions at each point p X, forming the tangent space Tp X of X at p. A differentiable manifold X with an inner product gp( , ) on each tangent space Tp X Rd (called a Riemannian metric) is a Riemannian manifold. By integrating the gγ(t)-norm of the tangent vectors along a curve γ(t), we can define the Riemannian length of a curve and the minimum length required to connect two points gives a Riemannian distance on X. A geodesic is a curve on a Riemannian manifold that locally minimizes the length between its endpoints.

The sectional curvature k of a Riemannian manifold at a point x in the tangent space Tx M in the direction of two linearly independent tangent vectors x, y is given by:

k(x, y) = R(x, y)y, x x 2 y 2 x, y 2 ,

where R(x, y)y is the Riemann curvature tensor.

Hyperbolic spaces. Unlike Euclidean geometry, hyperbolic geometry rejects the parallel postulate, which leads to intriguing geometric properties. To fully understand this

Hyperbolic Optimizer as a Dynamical System

geometry, one must become familiar with the hyperbolic parallel postulate and the concept of curvature. Furthermore, understanding how hyperbolic space is represented and visualized using models, such as the Poincar e disk or the hyperboloid model, is crucial.

κ = 0 κ > 0 κ < 0

Figure 1. Example of different curvatures. On the right hand, we have negative curvature, thus a hyperbolic manifold.

The only Riemannian manifold of constant negative curvature 1 and dimension d is the hyperbolic space Bd, which can be identified (in the so-called Poincar e model) with the unit ball of Rd with the non-euclidean distance:

γ(x, y) = arccosh 1 + 2 x y 2

(1 x 2)(1 y 2)

Note that this distance is the Riemannian distance on the unit ball, associated with the Riemannian metric for which the norm of an infinitesimal vector v at point x is given by v 2 x = 2 v /(1 x 2).

Figure 2. Poincar e model with d = 2. The curves are geodesics (i.e., coincide with the minimum-γ-length paths between any two points on the curve) and have infinite γ-length.

Hyperbolic convexity. In hyperbolic geometry, the concept of hyperbolic convexity plays a crucial role in understanding the properties of sets and functions. A set is hyperbolically convex if, for any two points within the set, the geodesic connecting them lies entirely within the set. Geodesics in hyperbolic space are represented by hyperbolic lines, the analogs of straight lines in Euclidean geometry.

Hyperbolically convex sets have unique properties, such as stability under hyperbolic isometries. This means that if a set is hyperbolically convex and undergoes a hyperbolic isometry, the transformed set remains hyperbolically convex.

This property is essential in various applications, including understanding hyperbolic reflections and symmetry.

Moreover, the notion of hyperbolic convexity extends to functions. A function is considered hyperbolically convex if the region below its graph is a hyperbolically convex set. Hyperbolically convex functions have significant implications in optimization problems and variational principles in hyperbolic geometry.

The study of hyperbolically convex sets and functions provides valuable insights into the geometry and structure of hyperbolic space. It allows us to explore the relationships between geometric objects and transformations, offering a deeper understanding of the fundamental principles underlying hyperbolic geometry and its applications in diverse fields.

Convex sets in hyperbolic spaces, Hn, are closely related to convex cones belonging to the interior of the Lorentz cone

L := x Rn+1 : xn+1 q

x2 1 + + x2n

Definition 2.1. We say that the set C Hn is hyperbolically convex if for any p, q C the geodesic segment joining p to q is contained in C.

The hyperbolically convex sets are intersections of the hyperboloid with convex cones that belong to the interior of L.

Proposition 2.2. Let C be an open hyperbolically convex set and f : C R be a differentiable function. The function f is hyperbolically convex if and only if f(q) f(p) + f(p), logp q , for all p, q C and p = q.

Proposition 2.3. Let C Hn be an open hyperbolically convex set and f : C R a differentiable function. The function f is hyperbolically convex if and only if f satisfies

f(p), logp q + f(q), logq p 0, p, q C, p = q.

Gyrovector spaces.

Definition 2.4. Let V be a real inner product space and Vs the ball centered in 0, of radius s. We define the M obius addition as

u sv = (1 + 2/s2u v + 1/s2 v 2)u + (1 1/s2 u 2)v

1 + 2/s2u v + 1/s4 u 2 v 2 .

Also we can define the M obius subtraction as u v := u ( v).

Remark 2.5. Note that if s , then we can recover the Euclidean vector space sum.

In this work, we fix s = 1.

Hyperbolic Optimizer as a Dynamical System

Definition 2.6. We define the M obius scalar multiplication on Bn \ {0} by

r x = tanh(rarctanh( x )) x

We say that a set of gyrovectors {αi}l i=1 is linearly independent in a gyrovector space if r1 α1 rl αl = 0 when r1 = r2 = = rl = 0.

Given a matrix M and a vector x we define

M x = tanh Mx

x arctanh( x ) Mx

See Appendix A for more details.

3. Hyperbolic Optimization

3.1. Hyperbolic convex functions

We can extend the notion of hyperbolically convex functions to any hyperbolic model. Specifically, in the Poincar e ball model, we define

logx : Bn Tx Bn

y 7 (1 x 2) ln 1 + x y

expx : Tx Bn Bn

y 7 x tanh y 1 x 2

If we take the tangent space in x = 0, we have

log0(x) = ln 1 + x

exp0(x) = tanh ( x ) x

e2 x + 1 x x .

Definition 3.1. A subset S of a Riemannian manifold M is geodesically convex if, for every x, y S, there exists a geodesic segment c: [0, 1] M such that c(0) = x, c(1) = y and c(t) is in S for all t [0, 1].

In a geodesically convex set S, any two points are connected in S by at least one geodesic segment c. Composing a

function f : S R with c yields a real function on [0, 1]. If all of these compositions are convex in the usual sense, we say f is convex in a geometric sense.

Definition 3.2. A function f : S R is geodesically (strictly) convex if S is geodesically convex and f c: [0, 1] R is (strictly) convex for each geodesic segment c: [0, 1] M whose image is in S (with c(0) = c(1)).

In other words, for S a geodesically convex set, we say f : S R is geodesically convex if for all x, y S and all geodesics c connecting x to y in S the function f c: [0, 1] R is convex, that is,

t [0, 1], f(c(t)) (1 t)f(x) + tf(y).

It can be shown that if gx Tx M we have an equivalent definition of geodesically convex function:

f(x) f(x) + gx + exp 1 x (y) x, x, y M.

If M is a hyperbolic manifold, we have

f(y) f(x) + f(x), logx y , x, y M.

If f satisfies the previous condition and M is a hyperbolic manifold, we say that f is a hyperbolic convex function.

Definition 3.3. A function f : M R is geodesically (hyperbolically) µ-strongly convex if for any x, y M,

f(y) f(x) + gx, exp 1 x y x + µ

2 d2(x, y),

where d( , ) is the Riemannian distance.

If M = Bn we have

f(y) f(x) + f(x), logx y

arccosh 1 + 2 x y 2

(1 x 2)(1 y 2)

for any x, y M.

It is clear that if f is a hyperbolically µ-strongly convex function, then it is hyperbolically strictly convex.

Proposition 3.4. Let ( min f(x) s.t. x Bn,

where f is hyperbolically convex. If η Bn satisfies f(η) = 0, then η is a global minimum,

Proof. From the definition of being hyperbolically convex, we have

f(y) f(x) + f(x), logx y , x, y Bn.

Hyperbolic Optimizer as a Dynamical System

In particular, if we choose x = η, then

f(y) f(x) + f(x), logx y , x, y Bn

f(y) f(η) + f(η), logη y , y Bn.

f(y) f(η), y Bn.

A necessary and sufficient condition for η to be a global minimum is that f(η) = 0.

Proposition 3.5. Let ( min f(x) s.t. x Ω,

be an optimization problem where f : Ω Bn R is hyperbolically strictly convex on Ω, a convex set. Then, the optimal solution is unique.

Proof. Suppose that x, y Bn are different and both are solutions to the optimization problem. Then, f(x) = f(y) f(z), for any z Ω. But since

f(y) > f(x) + f(x), logx y ,

and f(x) = 0 we have a contradiction. So, the solution must be unique.

3.2. Hyperbolic ADMM

ADMM is an algorithm intended to blend the decomposability of dual ascent with the convergence properties of the method of multipliers. The algorithm solves problems in the form: ( min f(x) + g(z) s.t. Ax + Bz = c,

Let f : Bn R and g: Bm R be two continuously differentiable and hyperbolically convex functions. Now, choose a matrix A Bm n with full column rank, i.e., the columns vectors form a linearly independent set. Let and be the M obius addition and multiplication defined in the Poincar e ball model.

Consider a function V defined by:

x 7 f(x) + g(A x).

The following equations are the scaled form of ADMM in the gyrovector space version:

xk+1 = arg min x Bn f(x) + ρ

2 A x zk uk 2 (1)

zk+1 = arg min z Bm g(z) + ρ

2 A xk+1 z uk 2 (2)

uk+1 = uk A xk+1 zk+1. (3)

Unfortunately, the operations within gyrovector spaces present a challenging task due to their intricate nature. The inherent complexities make handling these spaces demanding and require a thoughtful approach. Due to this complexity, a better way to study ADMM in a hyperbolic space is by using a classical technique that identifies the Euclidean structure with the hyperbolic one.

Let f : Bn R and g: Bm R be two continuously differentiable and hyperbolically convex functions. Now, choose a matrix A R(m 1) (n 1) with full column rank, i.e., the columns vectors form a linearly independent set.

x 7 f(x) + g(expy(A(logy x))).

We define the scaled form of ADMM in the hyperbolic version:

xk+1 = arg min x Bn f(x) + ρ

2 (A y x zk) uk 2 (4)

zk+1 = arg min z Bm g(z) + ρ

2 (A y xk+1 z) uk 2 (5)

uk+1 = uk (A y xk+1) zk+1), (6)

where A y x = expy(A(logy x)).

In Riemannian Geometry, the choice of the tangent space at a particular point, often taken to be the zero point, is a convention that simplifies many calculations and allows for a more intuitive geometric interpretation. So,

xk+1 = arg min x Bn f(x) + ρ

2 (A 0 x) zk) uk 2 (7)

zk+1 = arg min z Bm g(z) + ρ

2 (A 0 xk+1) z) uk 2 (8)

uk+1 = uk (A 0 xk+1) zk+1). (9)

We have the main result of the paper:

Theorem 3.6. Consider the hyperbolic optimization problem given by ( minx,z{V (x, z) = f(x) + g(z)}, subject to z = exp0(A(log0 x))

and the function V (x) = f(x) + g(exp0(A(log0 x))). The continuous limit associated with the Hyperbolic ADMM

Hyperbolic Optimizer as a Dynamical System

updates, with t = k/ρ, corresponds to the initial value problem

(AT A) 1 V (X) + (AT A) 1Ω1 + (AT A) 1Ω2X+

((AT A) 1AΩ3 + Ω4X)X = 0 (10)

with X(0) = x0

where Ωi depends implicitly on X for all i = 1, 2, 3, 4.

Sketch of the proof. Define Lρ as

Lρ : Bn Bm Bm R

(x, z, u) 7 f(x) + g(z)

+ ut(exp0(A(log0 x)) z)

2 exp0(A(log0 x)) z 2

Now, from Proposition 3.5, we have that if (xk+1, zk+1, uk+1) satisfies equations 7, 8 and 9, then the solution is unique and then

0 = ℓzk+1 f(xk+1) ℓxk+1 g(zk+1)

+ ρℓxk+1ℓzk+1νk+1(zk+1 zk),

for certain terms ℓzk+1, ℓxk+1 and νk+1.

Following the idea of (Franc a et al., 2018) we choose t = δk, xk = X(t), zk = Z(t), uk = U(t) and νk = N(t). Using the Mean Value Theorem on the ith component of zk+1 we have that

zk+1,i = Zi(t + δ) = Zi(t) + δZ i(t + ζiδ),

for some ζi [0, 1]. Thus

lim δ 0 zk+1,i zk,i

δ = Z i(t).

Since this hold for every i, we can choose ρ = 1/δ and get

0 = ℓZ(t) f(X(t)) ℓX(t) g(Z(t))

+ ℓX(t)ℓZ(t)N(t)Z (t)

Under the same idea, we can get (exp0 A log0 X)i(t) = Zi(t) and since this holds for every component i we have:

Z(t) = exp(A0(log0 X(t)))

Z (t) = η(t) + ι(t)X(t) + κ(t)AX (t)

+ ω(t)AT AX(t)X (t),

for certain terms η(t), ι(t), κ(t) and ω(t).

Finally, since nor ℓX(t) or ℓZ(t) vanishes at any point, we have the result.

For more proof details, see Appendix B.

This theorem provides a way to understand the continuous version of the optimization process. In other words, it helps us predict how our variables will evolve smoothly over time as we try to find the best values to minimize the function V (x, z). The initial value X(0) = x0 gives us the starting point for this process.

4. Stability

Recall that γ is the metric given in Bn. Note that (Bn, γ) is a complete metric space. Fix y Bn and then,

lim x 1 γ(x, y) = lim τ ln(τ + p

where τ = 1 + 2 x y 2

(1 x 2)(1 y 2).

This implies that any Cauchy sequence is included in a set

Kr := {r: y r < 1}.

Clearly, Kr is compact in the Euclidean topology. Also, the metrics are equivalent in Kr. Then we have convergence in the Euclidean sense if, and only if, we converge in the hyperbolic sense. Thus, (Bn, γ) is a complete metric space.

Now consider

X = F(X, t), X(t0) = X0 (11)

a first order dynamical system with F : Bn R Bn, X = X(t) Bn, X0 Bn and t R. .

Let F be a L-Lipschitz continuous function on X, i.e.

γ(F(X1, t), F(X2, t)) Lγ(X1, X2)

for a fixed t, L > 0 and for all X1, X2 Bn.

Let Ω Bn R, (X0, t0) Ωand suppose that F is continuously differentiable on Ω. Since (Bn, γ) is a complete metric space, 11 has a unique solution X(t) on t (t0 ε, t+ε) for any ε > 0 and X(t0) = X0. We can extend the solution throughout Ωand furthermore, due to (Hirsch et al., 2012). the solution is a continuous function of (X0, t0), and if F depends continuously on some set of parameters, then it s also a continuous function of those parameters.

Definition 4.1. Let X be a point such that F(X , t) = 0 for all t t0. Then X is a critical point of the system 11. Also:

1. The point X is stable if for all neighborhood O Bn

of X , there exists a neighborhood O O of X

Hyperbolic Optimizer as a Dynamical System

such that every solution X(t) with initial condition X(t0) = X0 O satisfies F(O , t) O for all t > t0;

2. The point X is asymptotically stable if is stable and, lim t X(t) = X , for all X0 O ;

3. The point X is unstable if it is not stable.

The given definition indicates that a critical point X is considered stable if, within a small neighborhood, there exists an even smaller neighborhood O where all solutions starting from points within O remain within the original neighborhood O for all future times. If this stability condition also involves the system approaching X as time goes to infinity, then it is termed asymptotically stable. Conversely, if X is not stable, it is classified as unstable. This categorization provides insights into the long-term behavior and stability characteristics of the dynamic system centered around the critical point X .

The following result characterizes the critical points of the Hyperbolic ADMM flow 10.

Proposition 4.2. Let X be a strict local minimizer on V (X). If X (t) 1 when t , then X is a critical point on the Hyperbolic ADMM flow 10.

Proof. Since X is a strict local minimizer for V (X), then there exists a set O such that X O and V (X) > V (X ) for all X O \ {X }. Due to the first-order optimality conditions, we have that V (X ) = 0. Using this fact, and the fact that X (t) 1 when t we can conclude that X is a critical point of the dynamical system 10.

We can use the Lyapunov stability to check the stability of a system. In fact, we can determine if a dynamic system will stay in a particular state over time. Furthermore, we can extend to the Poincar e ball model a classical result (Hirsch et al., 2012).

Theorem 4.3. (Hirsch et al., 2012) Let X be a critical point of the dynamical system 11. Also, let O Bn be an open set containing X and L: O R be a continuously differentiable function. We have the following:

1. if L( ) satisfies

L(X ) = 0, L(X) > 0 for all X O \ {X }, L (X) 0 for all X O \ {X },

then X is stable and L is called a Lyapunov function;

2. If we have a strict inequality in the last point, then X is asymptotically stable, and L is called a strict Lyapunov function.

The statement means that as a system evolves over time according to certain equations, a strict Lyapunov function can be used to show that the system s solutions decrease or get closer to a specific condition (see Figure 3). The level sets here refer to sets of points where the Lyapunov function takes constant values. The term strict implies that the Lyapunov function consistently decreases, emphasizing a clear trend towards stability in the system.

Figure 3. Solution decreases through the level sets of a strict Lyapunov function.

Theorem 4.4. Let X be a critical point (and a strict local minimizer of V (X)) of the linearized Hyperbolic ADMM flow. If A is a positive definite matrix, V (X) > 0 near X and X(t) > 0 for all t > t0, where t0 [0, ), then X is asymptotically stable.

Proof. For this result, we need to assume that n = m, i.e., A is a square matrix.

Recall that the flow of the Hyperbolic ADMM is given by

0 = (AT A) 1 V (X) + (AT A) 1Ω1 + (AT A) 1Ω2X

+ ((AT A) 1AΩ3 + Ω4X)X .

This is a non-linear differential equation. Then, if we take a small perturbation X = X + δX implying X = δX and replacing in the flow of the Hyperbolic ADMM, we have

0 = (AT A) 1 V (X + δX)

+ (AT A) 1Ω1 + (AT A) 1Ω2(X + δX)

+ ((AT A) 1AΩ3 + Ω4(X + δX))δX .

Now, if we linearize the non-linear terms by neglecting the

Hyperbolic Optimizer as a Dynamical System

high-order terms involving δX and δX , we have

0 = (AT A) 1 2V (X )δX + (AT A) 1δX

+ ((AT A) 1AΩ3 + Ω4X )δX .

0 = (AT A) 1( 2V (X ) + In n) | {z } = A

+ ((AT A) 1AΩ3 + Ω4X ) | {z } = B

Thus, we have a dynamical system of the form AX + BX = 0. (12)

Note that in 12, we re omitting δ.

Finally, defining L(X) = XT PX, where P is a positive definite matrix, we can show using Theorem 4.3, and the fact that A is a positive definite matrix, that the critical point X is asymptotically stable.

This result states that if we have a specific critical point X and this point stays low when we make small changes around it, then it is stable over time.

In the context of the Hyperbolic ADMM flow, this means that if we start at X and the conditions mentioned in the result are satisfied, then as time goes on, we will stay close to the critical point. This is good because it indicates that the optimization process works effectively, converging to the best solution. The result provides a kind of guarantee that our optimization will eventually settle at this optimal point and not wander away.

5. Conclusions and future work

In this study, we introduce a novel optimizer using hyperbolic geometry. Specifically, we connect the Poincar e ball model to a non-linear differential equation. The complexity arises when the equation is not linearized, necessitating numerical analysis for stability studies. Linearizing the ODE reveals crucial insights into system behavior.

Looking forward, we propose a more general exploration of the optimizer, incorporating exponential and logarithmic operations at arbitrary points. This broadens our understanding of its behavior. We also pose questions about extending the optimizer to other hyperbolic models and the impact of isometries on stability and convergence.

Comparing our hyperbolic ADMM with the original, we find increased complexity in the hyperbolic version, advancing our understanding of why Hyperbolic Neural Networks perform well. We anticipate a numerical comparison with ADMM in gyrovector space, suggesting our hyperbolic version s potential superiority. While M obius operations are computationally expensive, their optimization may be taskdependent, offering a balance between cost and efficiency.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

The authors thank Sebastian Burgos for helpful comments and conversations.

The authors also acknowledge funding support from the Millennium Institute Foundational Research on Data, Chile (IMFD ANID - Millennium Science Initiative Program - Code ICN17 002) and the National Center of Artificial Intelligence, Chile (CENIA FB210017, Basal ANID).

Aggarwal, C. C., Hinneburg, A., and Keim, D. A. On the surprising behavior of distance metrics in high dimensional space. In Database Theory ICDT 2001: 8th International Conference London, UK, January 4 6, 2001 Proceedings 8, pp. 420 434. Springer, 2001.

Albert, R., Das Gupta, B., and Mobasheri, N. Topological implications of negative curvature for biological and social networks. Physical Review E, 89(3):032811, 2014.

Allard, A. and Serrano, M. A. Navigable maps of structural brain networks across species. PLo S computational biology, 16(2):e1007584, 2020.

Balazevic, I., Allen, C., and Hospedales, T. Multi-relational poincar e graph embeddings. In Proceedings of Neur IPS, pp. 4463 4473, 2019.

Billera, L. J., Holmes, S. P., and Vogtmann, K. Geometry of the space of phylogenetic trees. Advances in Applied Mathematics, 27(4):733 767, 2001.

Borassi, M., Chessa, A., and Caldarelli, G. Hyperbolicity measures democracy in real-world networks. Physical Review E, 92(3):032812, 2015.

Bourgain, J. On lipschitz embedding of finite metric spaces in hilbert space. Israel Journal of Mathematics, 52:46 52, 1985.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 1(1):1 122, 2011. doi: 10.1561/2200000016.

Bridson, M. R. and Haefliger, A. Metric spaces of nonpositive curvature, volume 319. Springer Science & Business Media, 2013.

Hyperbolic Optimizer as a Dynamical System

Chami, I. Representation Learning and Algorithms in Hyperbolic Spaces. Stanford University, 2021.

Chami, I., Ying, Z., R e, C., and Leskovec, J. Hyperbolic Graph Convolutional Neural Networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Chami, I., Gu, A., Chatziafratis, V., and R e, C. From trees to continuous embeddings and back: Hyperbolic hierarchical clustering. Advances in Neural Information Processing Systems, 33:15065 15076, 2020.

Cvetkovski, A. and Crovella, M. Hyperbolic embedding and routing for dynamic graphs. In IEEE INFOCOM 2009, pp. 1647 1655. IEEE, 2009.

Dhingra, B., Shallue, C. J., Norouzi, M., Dai, A. M., and Dahl, G. E. Embedding text in hyperbolic spaces. ar Xiv preprint ar Xiv:1806.04313, 2018.

Franc a, G., Robinson, D., and Vidal, R. Admm and accelerated admm as continuous dynamical systems. 05 2018.

Ganea, O., B ecigneul, G., and Hofmann, T. Hyperbolic neural networks. Advances in neural information processing systems, 31, 2018.

Gromov, M. Hyperbolic groups. Springer, 1987a.

Gromov, M. Hyperbolic Groups. In Gersten, S. M. (ed.), Essays in Group Theory, Mathematical Sciences Research Institute Publications, pp. 75 263. Springer, New York, NY, 1987b. ISBN 978-1-4613-9586-7. doi: 10.1007/978-1-4613-9586-7 3. URL https://doi. org/10.1007/978-1-4613-9586-7_3.

Hirsch, M., Smale, S., and Devaney, R. Differential Equations, Dynamical Systems, and an Introduction to Chaos. Academic Press, 2012.

Indyk, P., Matouˇsek, J., and Sidiropoulos, A. 8 LOWDISTORTION EMBEDDINGS OF FINITE METRIC SPACES. 2017.

Lamping, J., Rao, R., and Pirolli, P. A focus+context technique based on hyperbolic geometry for visualizing large hierarchies. In Proceedings of the SIGCHI conference on Human factors in computing systems - CHI 95, pp. 401 408, Denver, Colorado, United States, 1995. ACM Press. ISBN 978-0-201-84705-5. doi: 10.1145/ 223904.223956. URL http://portal.acm.org/ citation.cfm?doid=223904.223956.

Matsumoto, H., Mimori, T., and Fukunaga, T. Novel metric for hyperbolic phylogenetic tree embeddings. Biology Methods and Protocols, 6(1):bpab006, 2021.

Narendra, K. and Parthasarathy, K. Gradient methods for the optimization of dynamical systems containing neural networks. IEEE Transactions on Neural Networks, 2(2): 252 262, 1991. doi: 10.1109/72.80336.

Petersen, P. Riemannian Geometry, volume 171 of Graduate Texts in Mathematics. Springer International Publishing, Cham, 2016. ISBN 978-3-319-26652-7 978-3-319-26654-1. doi: 10.1007/978-3-319-26654-1. URL http://link.springer.com/10.1007/ 978-3-319-26654-1.

Rodr ıguez-Flores, M. A. and Papadopoulos, F. Hyperbolic mapping of human proximity networks. Scientific Reports, 10(1):20244, 2020.

Sarkar, R. Low Distortion Delaunay Embedding of Trees in Hyperbolic Plane. In Van Kreveld, M. and Speckmann, B. (eds.), Graph Drawing, volume 7034, pp. 355 366. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-25877-0 978-3642-25878-7. doi: 10.1007/978-3-642-25878-7 34. URL http://link.springer.com/10.1007/ 978-3-642-25878-7_34. Series Title: Lecture Notes in Computer Science.

Sun, Z., Chen, M., Hu, W., Wang, C., Dai, J., and Zhang, W. Knowledge association with hyperbolic knowledge graph embeddings. ar Xiv preprint ar Xiv:2010.02162, 2020.

Yang, M., Zhou, M., Li, Z., Liu, J., Pan, L., Xiong, H., and King, I. Hyperbolic graph neural networks: A review of methods and applications. ar Xiv preprint ar Xiv:2202.13852, 2022.

Hyperbolic Optimizer as a Dynamical System

A. Use of M obius operations

Definition A.1. Two vectors x, y Bn are linearly dependent if for some a R we can write x = a y.

Consider the set {(1/2, 0), (1/4, 0)}. Now,

2 ln 1 + 1/4

Then we have

1 2 = (5/3)a 1

(5/3)a + 1 (5/3)a + 1 = 2(5/3)a 2

a = ln 3 ln(5/3).

Thus, the vectors (1/2, 0), (1/4, 0) are linearly dependent.

More generally, if we have x 0

1+|y| 1 |y| a 1 1+|y| 1 |y| a + 1

a = ln x+y/|y|

y/|y| x ln 1+|y|

Proposition A.2. A set A is linearly dependent in Rn if and only if is linearly dependent in Bn

In Bn, choose {(1/2, 0), (0, 1/2)} and suppose that a, b = 0, then

= 0 2 32b + 2

32b + 1 = 0

Hyperbolic Optimizer as a Dynamical System

The previous system has no solutions. Thus, the set {(1/2, 0), (0, 1/2)} is l.i.

Now, if we choose {(1/2, 0), (1/2, 0)} and again a, b = 0 we will have

(x + y)2 = 2x2 2,

3a + 1, y = 3b 1

This equation has no solutions. Note that

(x + y)2 = 2x2 2

and x2 1 < 0. In fact, it is easy to note that 1/2 0

Proposition A.3. Let Bn = (Bn, , ). If a set of two vectors x, y are orthogonal in Rn, then x, y are linearly independent in Bn.

Proof. Consider a, b R \ {0} and {x, y} Bn such that x, y = 0. Then,

a x b y = 0

1+ y 1 y b 1 2

1+ y 1 y b + 1 2

1+ x 1 x a 1 2

1+ x 1 x a + 1 2

It is easy to see that r1 and r2 can t be zero. So, the orthogonal set {x, y} is linearly independent.

For a matrix multiplication, we will use the following example.

M = 1/2 0 0 1/2

, N = 1/3 0 0 1/3

Hyperbolic Optimizer as a Dynamical System

If M, N Bn m where M = [M1| . . . |Mm] and N = [N1| . . . |Nm] we define

M N = [M1 N1| . . . |Mm Nm].

As an example, choose

M = 1/2 0 0 1/2

, N = 1/3 0 0 1/3

M N = 1/2 0 0 1/2

1/3 0 0 1/3

= 2/3 0 0 2/3

Let x Bn. We re searching for a matrix M such that

If we choose M as the n-dimensional identity matrix (i.e. diagonal entries are 1s and the rest 0s) we have

tanh ln 1 + x

x = x 2 1 + x 2

2x1 1 + x2 1 + x2 2 = x1

2x2 1 + x2 1 + x2 2 = x2.

This equation has no solutions (assuming that x1 = x2 = 0) because x < 1. Thus, M cannot be the n-identity matrix.

To solve M x = x for M, we have to compute when

x arctanh( x ) = Mx .

= ln 1 + Mx

This equation has solutions only by numerical approximation.

B. Proof of Theorem 3.6

Theorem B.1. Consider the hyperbolic optimization problem given by ( minx,z{V (x, z) = f(x) + g(z)}, subject to z = exp0(A(log0 x))

Hyperbolic Optimizer as a Dynamical System

and the function V (x) = f(x) + g(exp0(A(log0 x))). The continuous limit associated with the Hyperbolic ADMM updates, with t = k/ρ, corresponds to the initial value problem

(AT A) 1 V (X) + (AT A) 1Ω1 + (AT A) 1Ω2X + ((AT A) 1AΩ3 + Ω4X)X = 0 (13)

with X(0) = x0

where Ωi depends implicitly on X for all i = 1, 2, 3, 4.

Proof. Define Lρ as

Lρ : Bn Bm Bm R

(x, z, u) 7 f(x) + g(z) + ut(exp0(A(log0 x)) z) + ρ

2 exp0(A(log0 x)) z 2

Differentiating Lρ w.r.t u we have

u = uk (exp0(A(log0 xk+1)) zk+1)

uk+1 = uk exp0 A ln 1 + xk+1

= uk exp0 ln 1 + xk+1

e 2 ln 1+ xk+1

1 xk+1 Axk+1

e 2 ln 1+ xk+1

1 xk+1 Axk+1

xk+1 + 1 Axk+1 Axk+1 zk+1

1+ xk+1 1 xk+1 2 Axk+1

1+ xk+1 1 xk+1 2 Axk+1

Axk+1 Axk+1 zk+1

1+ xk+1 1 xk+1 2 Axk+1

1+ xk+1 1 xk+1 2 Axk+1

Axk+1 Axk+1 .

uk+1 = uk (αk+1 zk+1)

= uk (1 2 αk+1, zk+1 + zk+1 2)αk+1 (1 αk+1 2)zk+1

1 2 αk+1, zk+1 + αk+1 2 zk+1 2

= uk 1 2 αk+1, zk+1 + zk+1 2

1 2 αk+1, zk+1 + αk+1 2 zk+1 2 αk+1 1 αk+1 2

1 2 αk+1, zk+1 + αk+1 2 zk+1 2 zk+1

In the right hand of the equality, we have two constants multiplying two vectors, call it µk+1 and νk+1 respectively. Thus,

uk+1 = (1 + 2 uk, µk+1αk+1 νk+1zk+1 + µk+1αk+1 νk+1zk+1 2)uk + (1 uk 2)(µk+1αk+1 νk+1zk+1)

1 2 uk, µk+1αk+1 νk+1zk+1 + uk 2 µk+1αk+1 νk+1zk+1 2 .

Hyperbolic Optimizer as a Dynamical System

Using the previous notation we can redefine equations 7 and 8 as:

xk+1 = arg min x Bn f(x) + ρ

2 (µα νkzk) uk 2 (14)

zk+1 = arg min z Bm g(z) + ρ

2 (µk+1αk+1 νz) uk 2 (15)

Now, from Proposition 3.5, we have that if (xk+1, zk+1, uk+1) satisfies equations 14, 15 and 9, then the solution is unique. Thus, we have

0 = f(xk+1) + ρ(µk+1αk+1 νk+1zk + uk+1) µk+1

xk+1 αk+1 + αk+1

xk+1 µk+1 νk+1

| {z } =ℓxk+1

0 = g(zk+1) + ρ(µk+1αk+1 νk+1zk+1 + uk+1) µk+1

zk+1 αk+1 νk+1

zk+1 zk+1 νk+1

| {z } =ℓzk+1

Multiplying the first equality by ℓzk+1 , the second one by ℓxk+1 and subtracting both we have

0 = ℓzk+1 f(xk+1) ℓxk+1 g(zk+1) + ρℓxk+1ℓzk+1νk+1(zk+1 zk).

Following the idea of (Boyd et al., 2011) we choose t = δk, xk = X(t), zk = Z(t), uk = U(t) and νk = N(t). Using the Mean Value Theorem on the ith component of zk+1 we have that

zk+1,i = Zi(t + δ) = Zi(t) + δZ i(t + ζiδ), for some ζi [0, 1].

lim δ 0 zk+1,i zk,i

δ = Z i(t).

Since this hold for every i, we can choose ρ = 1/δ and get

ℓzk+1 f(xk+1) ℓxk+1 g(zk+1) + ρℓxk+1ℓzk+1νk+1(zk+1 zk) = 0

ℓZ(t) f(X(t)) ℓX(t) g(Z(t)) + ℓX(t)ℓZ(t)N(t)Z (t) = 0

Now, on the i th component of 9 we have

Ui(t + δ) = Ui(t) + (exp0(A(log0 X)))i(t + δ) Zi(t + δ).

Again, by the Mean Value Theorem there exists ζi [0, 1] such that

δU i(t + ζiδ) = (exp0 A log0 X)i(t + δ) Zi(t + δ)

so, (exp0 A log0 X)i(t) = Zi(t). Since this holds for every component i, we have:

Hyperbolic Optimizer as a Dynamical System

Z(t) = exp0(A(log0 X(t)))

1+ X(t) 1 X(t) 2 AX(t)

X(t) log 1+ X(t) 1 X(t) 2 AX(t)

X(t) A X(t) + 2 AX(t)

X(t) ( 1+ X(t) 1 X(t) )

2X(t) (1 X(t) )2

1+ X(t) 1 X(t) 2 AX(t)

1+ X(t) 1 X(t) 2 AX(t)

X(t) + 1 1+ X(t) 1 X(t) 2 AX(t)

A AX(t) AT AX(t)

= η(t) + ι(t)X(t) + κ(t)AX (t) + ω(t)AT AX(t)X (t),

1+ X(t) 1 X(t) 2 AX(t)

X(t) log 1+ X(t) 1 X(t) λmax(AT A)

X(t) 1+ X(t) 1 X(t) 2 AX(t)

ι(t) = 8 1 + X(t)

X(t) AX(t) (1 X(t) 2) X(t) 1+ X(t) 1 X(t) 2 AX(t)

1+ X(t) 1 X(t) 2 AX(t)

X(t) + 1 1+ X(t) 1 X(t) 2 AX(t)

1+ X(t) 1 X(t) 2 AX(t)

X(t) + 1 1+ X(t) 1 X(t) 2 AX(t)

1 AX(t) 2 .

Since nor ℓX(t) or ℓZ(t) vanishes at any point, we have

f(X(t)) ℓX(t)

ℓZ(t) g(Z(t)) + ℓX(t)N(t)Z (t) = 0

V (X(t)) + ℓX(t)N(t)η(t) | {z } =Ω1(t)

+ ℓX(t)N(t)ι(t) | {z } =Ω2(t)

X(t) + ℓX(t)N(t)κ(t) | {z } =Ω3(t)

AX (t) + ℓX(t)N(t)ω(t) | {z } =Ω4(t)

AT AX(t)X (t)) = 0

V (X(t)) + Ω1(t) + Ω2(t)X(t) + Ω3(t)AX (t) + Ω4(t)AT AX(t)X (t)) = 0

(AT A) 1 V (X(t)) + (AT A) 1Ω1(t) + (AT A) 1Ω2(t)X(t) + (AT A) 1AΩ3(t)X (t) + Ω4(t)X(t)X (t) = 0

(AT A) 1 V (X(t)) + (AT A) 1Ω1(t) + (AT A) 1Ω2(t)X(t) + ((AT A) 1AΩ3(t) + Ω4(t)X(t))X (t) = 0,

which is equivalent to 10 since A is invertible.

Finally, since 10 is a non-homogeneous first-order linear equation, the dynamics is specified by X(0) = x0, where x0 is the estimate of the initial solution of B.1.

Hyperbolic Optimizer as a Dynamical System

C. Closed forms.

Due the Theorem 3.6 we have explicit forms for the derivatives of αk+1, µk+1 and νk+1 w.r.t. xk+1 and zk+1. This will be useful to run experiments in the future.

uk+1 (1 + 2 uk, µk+1αk+1 νk+1zk+1 + µk+1αk+1 νk+1zk+1 2)uk + (1 uk 2)(µk+1αk+1 νk+1zk+1)

1 2 uk, µk+1αk+1 νk+1zk+1 + uk 2 µk+1αk+1 νk+1zk+1 2 = 0

For αk+1 define

βk+1 = 1 + xk+1

1 xk+1 , γk+1 = 2 Axk+1

xk+1 and δk+1 = Axk+1 Axk+1

this implies that

βγk+1 k+1 + 1

βγk+1 k+1 1

βk+1 xk+1 = 2xk+1 (1 xk+1 )2 , γk+1 xk+1 = 2 Axk+1

xk+1 A xk+1 and δk+1

xk+1 = A Axk+1 AT Axk+1

αk+1 xk+1 = βγk+1 k+1 log βk+1 γk+1 xk+1 + γk+1

βk+1 βk+1 xk+1

(βγk+1 k+1 1)

(βγk+1 k+1 1)2

βγk+1 k+1 log βk+1 γk+1 xk+1 + γk+1

βk+1 βk+1 xk+1

(βγk+1 k+1 + 1)

(βγk+1 k+1 1)2

βγk+1 k+1 + 1

βγk+1 k+1 1

! δk+1 xk+1

= 2 βγk+1 k+1 log βk+1 γk+1 xk+1 + γk+1

βk+1 βk+1 xk+1

(βγk+1 k+1 1)2 +

βγk+1 k+1 + 1

βγk+1 k+1 1

! δk+1 xk+1

1+ xk+1 1 xk+1 2 Axk+1

log 1+ xk+1 1 xk+1 2 Axk+1

xk+1 A xk+1 + 2 Axk+1

xk+1 1+ xk+1

1 xk+1 2xk+1 (1 xk+1 )2

1+ xk+1 1 xk+1 2 Axk+1

1+ xk+1 1 xk+1 2 Axk+1

1+ xk+1 1 xk+1 2 Axk+1

A Axk+1 AT Axk+1

Hyperbolic Optimizer as a Dynamical System

µk+1 = 1 2 αk+1, zk+1 + zk+1 2

1 2 αk+1, zk+1 + αk+1 2 zk+1 2

µk+1 xk+1 = xk+1

1 2 αk+1, zk+1 + zk+1 2

1 2 αk+1, zk+1 + αk+1 2 zk+1 2

xk+1 (1 2 αk+1, zk+1 + zk+1 2)(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2

(1 2 αk+1, zk+1 + zk+1 2) xk+1 (1 2 αk+1, zk+1 + αk+1 2 zk+1 2)

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2

xk+1 zk+1( αk+1 2 zk+1 2 zk+1 2) 2 αk+1

xk+1 αk+1 zk+1 2(1 2 αk+1, zk+1 + zk+1 2)

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2 .

So now we have a closed form for µk+1

Recall that

νk+1 = 1 αk+1 2

1 2 αk+1, zk+1 + αk+1 2 zk+1 2 ,

νk+1 xk+1 = xk+1

1 2 αk+1, zk+1 + αk+1 2 zk+1 2

xk+1 αk+1 1 2 αk+1, zk+1 + αk+1 2 zk+1 2

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2

(1 αk+1 2) 2 αk+1

xk+1 zk+1 + αk+1

xk+1 2 αk+1 zk+1 2

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2 .

Now we need the derivatives w.r.t. zk+1:

µk+1 = 1 2 αk+1, zk+1 + zk+1 2

1 2 αk+1, zk+1 + αk+1 2 zk+1 2

zk+1 = zk+1

1 2 αk+1, zk+1 + zk+1 2

1 2 αk+1, zk+1 + αk+1 2 zk+1 2

zk+1 (1 2 αk+1, zk+1 + zk+1 2)(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2

(1 2 αk+1, zk+1 + zk+1 2) zk+1 (1 2 αk+1, zk+1 + αk+1 2 zk+1 2)

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2

= 2(αk+1 zk+1 )(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2

+ 2(1 2 αk+1, zk+1 + zk+1 2)(αk+1 αk+1 2 zk+1 )

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2

= 2 zk+1 (1 αk+1 zk+1 2 αk+1, zk+1 αk+1 2 + 2 αk+1 2 αk+1, zk+1 + αk+1 zk+1 )

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2

Hyperbolic Optimizer as a Dynamical System

νk+1 zk+1 = zk+1

1 2 αk+1, zk+1 + αk+1 2 zk+1 2

= zk+1 (1 2 αk+1, zk+1 + αk+1 2 zk+1 2)(1 αk+1 2)

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2

= (2αk+1 + 2 αk+1 2 zk+1 )( αk+1 2 1)

(1 2 αk+1, zk+1 + αk+1 2 zk+1 2)2 .