# robust_largemargin_learning_in_hyperbolic_space__e0d99d77.pdf

Robust large-margin learning in hyperbolic space

Melanie Weber Princeton University mw25@math.princeton.edu

Manzil Zaheer Google Research manzilzaheer@google.com

Ankit Singh Rawat Google Research ankitsrawat@google.com

Aditya Menon Google Research adityakmenon@google.com

Sanjiv Kumar Google Research sanjivk@google.com

Recently, there has been a surge of interest in representation learning in hyperbolic spaces, driven by their ability to represent hierarchical data with signiﬁcantly fewer dimensions than standard Euclidean spaces. However, the viability and beneﬁts of hyperbolic spaces for downstream machine learning tasks have received less attention. In this paper, we present, to our knowledge, the ﬁrst theoretical guarantees for learning a classiﬁer in hyperbolic rather than Euclidean space. Speciﬁcally, we consider the problem of learning a large-margin classiﬁer for data possessing a hierarchical structure. Our ﬁrst contribution is a hyperbolic perceptron algorithm, which provably converges to a separating hyperplane. We then provide an algorithm to efﬁciently learn a large-margin hyperplane, relying on the careful injection of adversarial examples. Finally, we prove that for hierarchical data that embeds well into hyperbolic space, the low embedding dimension ensures superior guarantees when learning the classiﬁer directly in hyperbolic space.

1 Introduction

Hyperbolic spaces have received sustained interest in recent years, owing to their ability to compactly represent data possessing hierarchical structure (e.g., trees and graphs). In terms of representation learning, hyperbolic spaces offer a provable advantage over Euclidean spaces for such data: objects requiring an exponential number of dimensions in Euclidean space can be represented in a polynomial number of dimensions in hyperbolic space [28]. This has motivated research into efﬁciently learning a suitable hyperbolic embedding for large-scale datasets [22, 4, 31].

Despite this impressive representation power, little is known about the beneﬁts of hyperbolic spaces for downstream tasks. For example, suppose we wish to perform classiﬁcation on data that is intrinsically hierarchical. One may naïvely ignore this structure, and use a standard Euclidean embedding and corresponding classiﬁer (e.g., SVM). However, can we design classiﬁcation algorithms that exploit the structure of hyperbolic space, and offer provable beneﬁts in terms of performance? This fundamental question has received surprisingly limited attention. While some prior work has proposed speciﬁc algorithms for learning classiﬁers in hyperbolic space [6, 21], these have been primarily empirical in nature, and do not come equipped with theoretical guarantees on convergence and generalization.

In this paper, we take a ﬁrst step towards addressing this question for the problem of learning a largemargin classiﬁer. We provide a series of algorithms to provably learn such classiﬁers in hyperbolic space, and establish their superiority over the classiﬁers naïvely learned in Euclidean space. This shows that by using a hyperbolic space that better reﬂects the intrinsic geometry of the data, one can see gains in both representation size and performance. Speciﬁcally, our contributions are:

Work performed while intern at Google Research, New York.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

(i) we provide a hyperbolic version of the classic perceptron algorithm and establish its convergence for data that is separable with a margin (Theorem 3.1). (ii) we establish how suitable injection of adversarial examples to gradient-based loss minimization yields an algorithm which can efﬁciently learn a large-margin classiﬁer (Theorem 4.3). We further establish that simply performing gradient descent or using adversarial examples alone does not sufﬁce to efﬁciently yield such a classiﬁer. (iii) we compare the Euclidean and hyperbolic approaches for hierarchical data and analyze the trade-off between low embedding dimensions and low distortion (dimension-distortion tradeoff) when learning robust classiﬁers on embedded data. For hierarchical data that embeds well into hyperbolic space, it sufﬁces to use smaller embedding dimension while ensuring superior guarantees when we learn a classiﬁer in hyperbolic space.

Contribution (i) establishes that it is possible to design classiﬁcation algorithms that exploit the structure of hyperbolic space, while provably converging to some admissible (not necessarily largemargin) separator. Contribution (ii) establishes that it is further possible to design classiﬁcation algorithms that provably converge to a large-margin separator, by suitably injecting adversarial examples. Contribution (iii) shows that the adaptation of algorithms to the intrinsic geometry of the data can enable efﬁcient utilization of the embedding space without affecting the performance.

Related work. Our results can be seen as hyperbolic analogue of classic results for Euclidean spaces. The large-margin learning problem is well studied in the Euclidean setting. Classic algorithms for learning classiﬁers include the perceptron [25, 24, 12] and support vector machines [8]. Robust margin-learning has been widely studied; notably by Lanckriet et al. [16], El Ghaoui et al. [10], Kim et al. [15] and, recently, via adversarial approaches by Charles et al. [5], Ji and Telgarsky [14], Li et al. [18] and Soudry et al. [30]. Adversarial learning has recently gained interest through efforts to train more robust deep learning systems (see, e.g., [20, 11]).

Recently, the representation of (hierarchical) data in hyperbolic space has gained a surge of interest. The literature focuses mostly on learning representations in the Poincare [22, 4, 31] and Lorentz [23] models of hyperbolic space, as well as on analyzing representation trade-offs in hyperbolic embeddings [27, 32] . The body of work on performing downstream ML tasks in hyperbolic space is much smaller and mostly without theoretical guarantees. Monath et al. [21] study hierarchical clustering in hyperbolic space. Cho et al. [6] introduce a hyperbolic version of support vector machines for binary classiﬁcation in hyperbolic space, albeit without theoretical guarantees. Ganea et al. [13] introduce a hyperbolic version of neural networks that shows empirical promise on downstream tasks, but likewise without theoretical guarantees. To the best of our knowledge, neither robust large-margin learning nor adversarial learning in hyperbolic spaces has been studied in the literature, including [6]. Furthermore, we are not aware of any theoretical analysis of dimension-distortion trade-offs in the related literature.

2 Background and notation

Figure 1: Lorentz model (geodesics in red).

We begin by reviewing some background material on hyperbolic spaces, as well as embedding into and learning in these spaces.

2.1 Hyperbolic space

Hyperbolic spaces are smooth Riemannian manifolds M = Hd with constant negative curvature κ and are as such locally Euclidean spaces. There are several equivalent models of hyperbolic space, each highlighting a different geometric aspect. In this work, we mostly consider the Lorentz model (aka hyperboloid model), which we brieﬂy introduce below, with more details provided Appendix A (see [3] for a comprehensive overview).

For x, x Rd+1, let x x = x0x 0 Pd i=1 xix i denote their Minkowski product. The Lorentz model is deﬁned as Ld = {x Rd+1 : x x = 1} with distance measure d L(x, x ) = acosh(x x ). Note that the distance d L(x, x ) corresponds to the length of the shortest line (geodesic) along the manifold connecting x and x (cf. Fig. 1). We also point out that (L, d L) forms a metric space.

2.2 Embeddability of hierarchical data

A map φ : X1 X2 between metric spaces (X1, d1) and (X2, d2) is called an embedding. The multiplicative distortion of φ is deﬁned to be the smallest constant c M 1 such that, x, x X1,

d2 φ(x), φ(x )) d1(x, x ) c M d2 φ(x), φ(x ) . When c M = 1, φ is termed an isometric embedding. Since hierarchical data is tree-like, we can use classic embeddability results for trees as a reference point. Bourgain [2], Linial et al. [19] showed that an N-point metric X (i.e., |X| = N) embeds into Euclidean space RO(log2 N) with c M = O(log N). This bound is tight for trees in the sense that embedding them in a Euclidean space (of any dimension) must have cm = Ω(log N) [19]. In contrast, Sarkar [28] showed that trees embed quasi-isometrically with c M = O(1 + ϵ) into hyperbolic space Hd, even in the low-dimensional regime with the dimension as small as d = 2.

2.3 Classiﬁcation in hyperbolic space

We consider classiﬁcation problems of the following form: X Ld denotes the feature space, Y = { 1} the binary label space, and W Rd+1 the model space. In the following, we denote the training set as S X Y.

We begin by deﬁning geodesic decision boundaries. Consider the Lorentz space Ld with ambient space Rd+1. Then every geodesic decision boundary is a hyperplane in Rd intersecting Ld and Rd+1. Further, consider the set of linear separators or decision functions of the form

H = {hw : w Rd+1, w w < 0}, where hw(x) = 1, w x > 0 1, otherwise. (2.1)

Note that the requirement w w < 0 in (2.1) ensures that the intersection of Ld and the decision hyperplane hw is not empty. The geodesic decision boundary corresponding to the decision function hw is then given by Hw = {z Ld : w z = 0}. The distance of a point x Ld from the decision boundary Hw can be computed as d x, Hw = asinh w x/ w w [6].

2.4 Large-margin classiﬁcation in hyperbolic space

In this paper, we are interested in learning a large margin classiﬁer in a hyperbolic space. Analogous to the Euclidean setting, the natural notion of margin is the minimal distance to the decision boundary over all training samples:

margin S(w) = inf (x,y) S yhw(x) d(x, Hw) = inf (x,y) S asinh y(w x)/

w w . (2.2)

For large-margin classiﬁer learning, we aim to ﬁnd hw deﬁned by w = argmaxw C margin S(w), where C = {w Rd+1 : w w < 0} imposes a nonconvex constraint. This makes the problem computationally intractable using classical methods, unlike its Euclidean counterpart.

3 Hyperbolic linear separator learning

The ﬁrst step towards the goal of learning a large-margin classiﬁer is to establish that we can provably learn some separator. To this end, we present a hyperbolic version of the classic perceptron algorithm and establish that it will converge on data that is separable with a margin.

3.1 The hyperbolic perceptron algorithm

The hyperbolic perceptron (cf. Alg. 1) learns a binary classiﬁer w with respect to the Minkowski product. This is implemented in the update rule vt wt + yx, similar to the Euclidean case. In contrast to the Euclidean case, the hyperbolic perceptron requires an additional normalization step wt+1 vt/ vt vt. This ensures that wt+1 wt+1 < 0; as a result wt+1 deﬁnes a meaningful classiﬁer, i.e., Ld Hwt+1 = . For more details, see Appendix B.

While intuitive, it remains to establish that this algorithm converges, i.e., ﬁnds a solution which correctly classiﬁes all the training samples. To this end, consider the following notion of hyperbolic linear separability with a margin: for X, X Ld, we say that X and X are linearly separable with (hyperbolic) margin γH, if there exists a w Rd+1 with w w = 1 such that w x > sinh(γH) x X and w x < sinh(γH) x X . Assuming our training set is separable with a margin, the hyperbolic perceptron has the following convergence guarantee: Theorem 3.1. Assume that there is some w Rd+1 with w w = 1 and w0 w 0, and some γH > 0, such that yj( w xj) sinh(γH) for j = 1, . . . , |S|. Then, Alg. 1 converges in O (1/sinh(γH)) steps and returns a solution with margin γH.

Algorithm 1 Hyperbolic perceptron

1: Initialize w0 Rd+1. 2: for t = 0, 1, . . . , T 1 do 3: for j = 1, . . . , n do 4: if sgn(xj wt) = yj then 5: vt wt + yjxj 6: wt+1 vt/ min{1, vt vt} 7: break 8: end if 9: end for 10: end for 11: Output: w T

Algorithm 2 Adversarial Training

1: Initialize w0 = 0, S = . 2: for t = 0, 1, . . . , T 1 do 3: St S iid with |St| = m; S t . 4: for i = 0, 1, . . . , m do 5: xi argmaxd L(xi,z) α l(z, yi; wt) 6: end for 7: S t {( xi, yi)}m i=1 8: S S S t 9: wt+1 A(wt, S, S ) 10: end for 11: Output: w T

The proof of Thm. 3.1 (provided in Appendix B) follows the standard proof of the Euclidean perceptron and utilizes the Cauchy-Schwartz inequality for the Minkowski product. To verify that Algorithm 1 always converges to a valid hyperplane, i.e., such that Ld Hvt = , which happens iff vt vt < 0, consider the following argument:

vt vt = (wt + yjxj) (wt + yjxj) = wt wt | {z }

+2 yj(xj wt) | {z }

+ y2 |{z} =1

(xj xj | {z }

Here, (i) is a consequence of the normalization step in Algorithm 1 and (iii) follows as xj xj = 1, since xj Ld. As for (ii), note that we perform the update vt wt + yjxj only when yj = sign(xj w) (cf. Algorithm 1).

Remark 3.2. Note that the perceptron algorithm in Euclidean space exhibits a O(1/γ2 E) convergence rate [24], where γE denotes the Euclidean margin. When γE 0, the 1/sinh(γH) convergence rate for hyperbolic spaces can be signiﬁcantly faster than 1/γ2 E, indicating that exploiting the structure of hyperbolic space can be beneﬁcial.

3.2 The challenge of large-margin learning

Thm. 3.1 establishes that the hyperbolic perceptron converges to some linear separator. However, for the purposes of generalization, one would ideally like to converge to a large-margin separator. As with the classic Euclidean perceptron, no such guarantee is possible for the hyperbolic perceptron; this motivates us to ask whether a suitable modiﬁcation can rectify this.

Drawing inspiration from the Euclidean setting, a natural way to proceed is to consider the use of margin losses, such as the logistic or hinge loss. Formally, let l: X { 1} R+ be a loss function:

l(x, y; w) = f(y (w x)), (3.1)

where f : R R+ is some convex, non-increasing function, e.g., the hinge loss. The empirical risk of the classiﬁer parameterized by w on the training set S X { 1} is L(w; S) = P (x,y) S l(x, y; w)/|S|. Commonly, we learn a classiﬁer by minimizing L(w; S) via gradient descent with iterates wt+1 wt η X

(x,y) S l(x, y; wt)/|S| , (3.2)

where η > 0 denotes the learning rate. Unfortunately, while this will yield a large-margin solution, the following result demonstrates that the number of iterations required may be prohibitively large. Theorem 3.3. Let ei Rd+1 be the i-th standard basis vector. Consider the training set S = {(e1, 1), ( e1, 1)} and the initialization w0 = e2. Suppose {wt}t 0 is a sequence of iterates in (3.2). Then, the number of iterations needed to achieve margin γH is Ω(exp(γH)).

While this result is disheartening, fortunately, we now present a simple resolution: by suitably adding adversarial examples, the gradient descent converges to a large-margin solution in polynomial time.

4 Hyperbolic large-margin separator learning via adversarial examples

Thm. 3.3 reveals that gradient descent on a margin loss is insufﬁcient to efﬁciently obtain a largemargin classiﬁer. Adapting the approach proposed in [5] for the Euclidean setting, we show how to

Space Perceptron Adversarial margin Adversarial ERM Adversarial GD

Euclidean (prior work) O 1 γ2 E

γE α Ω(exp(d)) Ω poly 1 γE α

Hyperbolic (this paper) O 1 sinh(γH) γH cosh(α) Ω(exp(d)) Ω poly cosh(α) sinh(γH)

Table 1: Comparison between Euclidean and hyperbolic spaces for Perceptron (cf. Alg. 1) and adversarial training (cf. Alg. 2). Recall that γE/H, α, and d denote the (Euclidean/ hyperbolic) margin of the training data, the adversarial perturbation budget, and the underlying dimension, respectively.

alleviate this problem by enriching the training set with adversarial examples before updating the classiﬁer (cf. Alg. 2). In particular, we minimize a robust loss

min w Rd+1 Lrob(w; S) := 1

(x,y) S lrob(x, y; w) (4.1)

lrob(x, y; w) := max z Ld: d L(x,z) α l(z, y; w) . (4.2)

The problem has a minimax structure, where the outer optimization minimizes the training error over S. The inner optimization generates an adversarial example by perturbing a given input feature x on the hyperbolic manifold. Note that the magnitude of the perturbation added to the original example is bounded by α, which we refer to as the adversarial budget. In particular, we want to construct a perturbation that maximizes the loss l, i.e., x argmaxd L(x,z) α l(z, y; w). In this paper, we restrict ourselves to only those adversarial examples that lead to misclassiﬁcation with respect to the current classiﬁer, i.e., hw(x) = hw( x) (cf. Remark 4.2).

Adversarial example x can be generated efﬁciently by reducing the problem to an (Euclidean) linear program with a spherical constraint:

(CERT) max z Rd w0z0 +

i=1 wizi s.t.

i=1 xizi cosh(α) x0z0, z\0 2 = z2 0 1 . (4.3)

Importantly, as detailed in Appendix C.2 and summarized next, (CERT) can be solved in closed-form. Theorem 4.1. Given the input example (x, y), let x\0 = (x1, . . . , xd). We can efﬁciently compute a solution to CERT or decide that no solution exists. If a solution exists, then based on a guess of z0, the solution has the form x = z0, p

z2 0 1 bα ˇx + p

1 b2α ˇx . Here, bα depends on the

adversarial budget α, and ˇx is a unit vector orthogonal to ˇx = x\0/ x\0 along w. Remark 4.2. Note that according to Thm. 4.1, it is possible that, for a particular guess of z0, we may not be able to ﬁnd an adversarial example x that leads to a prediction that is inconsistent with x, i.e., hw(x) = hw( x). Thus, for some t, we may have |S t| < m in Alg. 2.

The minimization with respect to w in (4.1) can be performed by an iterative optimization procedure, which generates a sequence of classiﬁers {wt}. We update the classiﬁer wt according to an update rule A, which accepts as input the current estimate of the weight vector, the original training set, and an adversarial perturbation of the training set. The update rule produces as output a weight vector which approximately minimizes the robust loss Lrob in (4.1).

We now establish that for a gradient based update rule, the above adversarial training procedure will efﬁciently converge to a large-margin solution. Table 1 summarizes the results of this section and compares with the corresponding results in the Euclidean setting [5].

4.1 Fast convergence via gradient-based update

Consider Alg. 2 with A(wt, S, S t) being a gradient-based update with learning rate ηt > 0:

wt+1 wt ηt/|S t| X

( x,y) S t wtl( x, y; wt) ; wt+1 wt+1/ p

wt+1 wt+1 , (4.4)

where the normalization is performed to ensure that the update remains valid, i.e., Ld Hw = .

To compute the update, we need to compute gradients of the outer minimization problem, i.e., w lrob over S t (cf. (4.1)). However, this function is itself a maximization problem. We therefore compute

the gradient at the maximizer of this inner maximization problem. Danskin s Theorem [9, 1] ensures that this gives a valid descent direction. Given the closed form expression for the adversarial example x as per Thm. 4.1, the gradient of the loss is

w l( x, y; w) = f (y(w x)) w y(w x) = f (y(w x)) yb x T ,

where yb x T = y( x0, x1, . . . , xn)T. With Danskin s theorem, l( x, y; w) lrob(x, y; w), so we can compute the descent direction and perform the step in (4.4). We defer details to Appendix C.4.

4.1.1 Convergence analysis

We now establish that the above gradient-based update converges to a large-margin solution in polynomial time. For this analysis, we need the following assumptions:

Assumption 1. 1. The training set S is linearly separable with margin at least γH, i.e., there exists a w Rd+1, such that y( w x) sinh(γH) for all (x, y) S. 2. There exists constants Rx, Rw 0, such that (i) x Rx; (ii) all possible adversarial perturbations remain within this constraint, i.e., x Rx; and (iii) w Rw. Let Rα := Rx Rw. 3. the function f(s), underlying the loss (cf. (3.1)), has the following properties: (i) f(s) > 0 s; (ii) f (s) < 0 s; (iii) f is differentiable, and (iv) f is β-smooth.

In the rest of the section, we work with the following hyperbolic equivalent of the logistic regression loss that fulﬁlls Assumption 1:

l(x, y; w) = ln (1 + exp( asinh (y(w x)/2Rα))) , (4.5)

where Rα is as deﬁned in Assumption 1. Other loss functions as well as the derivation of the hyperbolic logistic regression loss are discussed in Appendix C.1.

We ﬁrst show that Alg. 2 with a gradient update is guaranteed to converge to a large-margin classiﬁer.

Theorem 4.3. With constant step size and A being the GD update with an initialization w0 with w0 w0 < 0, limt L(wt; S S t) = 0.

The proof can be found in Appendix C.4. While this result guarantees convergence, it does not guarantee efﬁciency (e.g., by showing a polynomial convergence rate). The following result shows that Alg. 2 with a gradient based update obtains a max-margin classiﬁer in polynomial time.

Theorem 4.4. For a ﬁxed constant c (0, 1), let ηt = η := c 2 sinh2(γH) βσ2max cosh2(α)R2α with σmax denoting

an upper bound on the maximum singular value of the data matrix P

x S xx T , and A the GD update as deﬁned in (4.4). Then, Alg. 2 achieves the margin γH/cosh(α) in Ω poly (cosh(α)/sinh(γH))

Below, we brieﬂy sketch some of the proof ideas, but defer a detailed exposition to Appendix C.4 (cf. Thm. C.13 and C.14). To prove the gradient-based convergence result, we ﬁrst analyze the convergence of an adversarial perceptron , that resembles the adversarial GD in that it performs updates of the form wt+1 wt + y x. We then extend the analysis to the adversarial GD, where the convergence analysis builds on classical ideas from convex optimization.

The following auxiliary lemma relates the adversarial margin to the max-margin classiﬁer.

Lemma 4.5. Let w be the max-margin classiﬁer of S with margin γH. At each iteration of Algorithm 2, w linearly separates S S with margin at least γH cosh(α).

The result follows from geometric arguments, as discussed in Section C.3. With the help of this lemma, we can show the following bound on the sample complexity of the adversarial perceptron:

Theorem 4.6. Assume that there is some w Rd+1 with w w = 1 and w0 w 0, and some γH > 0, such that yj( w xj) sinh(γH) for j = 1, . . . , |S|. Then, the adversarial perceptron

(with adversarial budget α) converges after O cosh(α) sinh(γH) steps, at which it has margin of at least

γH cosh(α).

The technical proof can be found in Section C.3.

4.2 On the necessity of combining gradient descent and adversarial training

We remark here that the enrichment of the training set with adversarial examples is critical for the polynomial-time convergence. Recall ﬁrst that by Thm. 3.3, without adversarial training, we can construct a simple max-margin problem that cannot be solved in polynomial time. Interestingly, merely using adversarial examples by themselves does not sufﬁce for fast convergence either.

Consider Alg. 2 with an ERM as the update rule A(wt, S, S ). In this case, the iterate wt+1 corresponds to an ERM solution for S S , i.e.,

wt+1 argminw X

(x,y) S S l(x, y; w) . (4.6)

Let St = S, i.e., we utilize the full power of the adversarial training in each step. The following result reveals that even under this optimistic setting, Alg. 2 may not converge to a solution with a non-trivial margin in polynomial time: Theorem 4.7. Suppose Alg. 2 (with an ERM update) outputs a linear separator of S S . In the worst case, the number of iteration required to achieve a margin at least ϵ is Ω(exp(d)).

We note that a similar result in the Euclidean setting appears in [5]. We establish Thm. 4.7 by extending the proof strategy of [5, Thm. 4] to hyperbolic spaces. In particular, given a spherical code in Rd with T codewords and θ sinh(ϵ) cosh(α) minimum separation, we construct a training set S and subsequently the adversarial examples {S t} such that there exists a sequence of ERM solutions {wt}t T on S S (cf. (4.6)) that has margin less than ϵ. Now the result in Thm. 4.7 follows by utilizing a lower bound [7] on the size of the spherical code with T = Ω(exp(d)) codewords and θ sinh(ϵ) cosh(α) minimum separation. The proof of Thm. 4.7 is in Appendix C.5.

5 Dimension-distortion trade-off

So far we have focused on classifying data that is given in either Euclidean spaces Rd or Lorentz space Ld . Now, consider data (X, d X ) with similarity metric d X that was embedded into the respective spaces. We assume access to maps φE : X Rd and φH : X Ld + that embed X into the Euclidean space Rd and the upper sheet of the Lorentz space Ld +, respectively (cf. Remark A.1). Let c E and c H denote the multiplicative distortion induced by φE and φH, respectively (cf. 2.2). Upper bounds on c E and c H can be estimated based on the structure of X and the embedding dimensions.

Figure 2: Margin as distance between support vectors. Left: Euclidean. Right: Hyperbolic.

In this section, we address the natural question: How does the distortion c E, c H impact our guarantees on the margin? In the previous sections, we noticed that some of the guarantees scale with the dimension of the embedding space. Therefore, we want to analyze the trade-off between the higher distortion resulting from working with smaller embedding dimensions and the higher cost of training robust models due to working with larger embedding dimensions.

We often encounter data sets in ML applications that are intrinsically hierarchical. Theoretical results on the embeddability of trees (cf. 2.2) suggest that hyperbolic spaces are especially suitable to represent hierarchical data. We therefore restrict our analysis to such data. Further, we make the following assumptions on the underlying data X and the embedding maps, respectively. Assumption 2. (1) Both φH(X) and φE(X) are linearly separable in the respective spaces, and (2) X is hierarchical, i.e., has a partial order relation. Assumption 3. The maps φH, φE preserve the partial order relation in X and the root is mapped onto the origin of the embedding space.

Towards understanding the impact of the distortion of the embedding maps φH and φE on margin, we relate the distance between the support vectors to the size of the margin. The distortion of these distances via embedding then gives us the desired bounds on the margin.

0 50 100 150 200 Iteration

Training Loss

=0.0 =0.25 =0.5 =0.75 =1.0

0 50 100 150 200 Iteration

-Robust Loss

=0.0 =0.25 =0.5 =0.75 =1.0

0 50 100 150 200 Iteration

=0.0 =0.25 =0.5 =0.75 =1.0

Figure 3: Performance of Adversarial GD. Left: Loss L(w) on the original data. Middle: αrobust loss Lα(w). Right: Hyperbolic margin γH. We vary the adversarial budget α over {0, 0.25, 0.5, 0.75, 1.0}. Note that α = 0 corresponds to the state of the art [6].

5.1 Euclidean case

In the Euclidean case, we relate the distance of the support vectors to the size of the margin via triangle relations. Let x, y Rd denote support vectors, such that x, w > 0 and y, w < 0 and margin(w) = ϵ. Note that we can rotate the decision boundary, such that the support vectors are not unique. So, without loss of generality, assume that x1, x2 are equidistant from the decision boundary and w = 1 (cf. Fig. 2(left)). In this setting, we show the following relation between the margin with and without the inﬂuence of distortion:

Theorem 5.1. Let ϵ and ϵ denote the margin with and without distortion, respectively. If X is a tree embedded into RO(log2 |X|), then ϵ = O ϵ/log3 |X| .

The proof of Thm. 5.1 follows from a simple side length-altitude relations in the Euclidean triangle between support vectors (cf. Fig. 2(left)) and a simple application of Bourgain s result on embedding trees into Rd. For more details see Appendix D.1.

5.2 Hyperbolic case

As in the Euclidean case, we want to relate the margin to the pairwise distances of the support vectors. Such a relation can be constructed both in the original and in the embedding space, which allows us to study the inﬂuence of distortion on the margin in terms of c H. In the following, we will work with the half-space model Pd (cf. Appendix A.1). However, since the theoretical guarantees in the rest of the paper consider the Lorentz model Ld +, we have to map between the two spaces. We show in Appendix D.2 that such a mapping exists and preserves the Minkowski product, following [6].

The hyperbolic embedding φH has two sources of distortion: (1) the multiplicative distortion of pairwise distances, measured by the factor 1/c H; and (2) the distortion of order relations, in most embedding models captured by the alignment of ranks with the Euclidean norm. Under Assumption 3, order relationships are preserved and the root is mapped to the origin. Therefore, for x X, the distortion on the Euclidean norms is given as φH(x) = d E(φH(x), φH(0)) = d X (x, 0)/c H, i.e., the distortion on both pairwise distances and norms is given by a factor 1/c H.

In Pd , the decision hyperplane corresponds to a hypercircle Kw. We express its radius rw in terms of the hyperbolic distance between a point on the decision boundary and one of the hypercircle s ideal points [6]. The support vectors x, y lie on hypercircles Kx and Ky, which correspond to the set of points of hyperbolic distance ϵ (i.e., the margin) from the decision boundary. We again assume, without loss of generality, that at least one support vector is not unique and let x1, x2 Kx and y Ky (cf. Fig. 2(right)). We now show that the distortion introduced by φH has a negligible effect on the margin.

Theorem 5.2. Let ϵ and ϵ denote the margin with and without distortion, respectively. If X is a tree embedded into L2 +, then ϵ ϵ.

The technical proof relies on a construction that reduces the problem to Euclidean geometry via circle inversion on the decision hypercircle. We defer all details to Appendix D.2.

6 Experiments

We now present empirical studies for hyperbolic linear separator learning to corroborate our theory. In particular, we evaluate our proposed Adversarial GD algorithm ( 4) on data that is linearly separable in hyperbolic space and compare with the state of the art [6]. Furthermore, we analyze dimension-

distortion trade-offs ( 5). As in our theory, we train hyperbolic linear classiﬁers w whose prediction on x is y = sgn(w x). Additional experimental results can be found in Appendix F.

We emphasise that our focus in this work is in theoretically understanding the beneﬁts of hyperbolic spaces for classiﬁcation. The above experiments serve to illustrate our theoretical results, which as a starting point were derived for linear models and separable data. While extensions to non-separable data and non-linear models are of practical interest, a detailed study is left for future work.

Data. We use the Image Net ILSVRC 2012 dataset [26] along with its label hierarchy from wordnet. Hyperbolic embeddings in Lorentz space are obtained for the internal label nodes and leaf image nodes using Sarkar s construction [28]. For the ﬁrst experiment, we verify the effectiveness of adversarial learning by picking two classes (n09246464 and n07831146), which allows for a data set that is linearly separable in hyperbolic space. In this set, there were 1,648 positive and 1,287 negative examples. For the second experiment, to showcase better representational power of hyperbolic spaces for hierarchical data, we pick two disjoint subtrees (n00021939 and n00015388) from the hierarchy.

Adversarial GD. In the following, we utilize the hyperbolic hinge loss (C.2), (see Appendix F for other loss functions). To verify the effectiveness of adversarial training, we compute three quantities: (i) loss on original data L(w), (ii) α-robust loss Lα(w), and (iii) the hyperbolic margin γ. We vary the budget α over {0, 0.25, 0.5, 0.75, 1.0}, where α = 0 corresponds to the setup in [6]. For a given budget, we obtain adversarial examples by solving the CERT problem (4.3), which is feasible for

z0 (x0 cosh(α) , x0 cosh(α) + ), where = q

(x2 0 1)(cosh2(α) 1). We do a binary search in this range for z0, solve CERT and check if we can obtain an adversarial example. We utilize the adversarial examples we ﬁnd, and ignore other data points. In all experiments, we use a constant step-size ηt = 0.01 t. The results are shown in Fig. 3. As α increases the problem becomes harder to solve (higher training robust loss) but we achieve a better margin. Notably, we observe strong performance gains over the training procedures without adversarial examples [6].

Dimensional efﬁciency. In this experiment, we illustrate the beneﬁt of using hyperbolic space when the underlying data is truly hierarchical. To be more favourable to Euclidean setting, we subsample images from each subtree, such that in total we have 1000 vectors. We obtain Euclidean embeddings following the setup and code from Nickel and Kiela [22]. The Euclidean embeddings in 16 dimensions reach a mean rank (MR) 2, which indicates reasonable quality in preserving distance to few-hop neighbors. We observe superior classiﬁcation performance at much lower dimensions by leveraging hyperbolic space (see Table 2 in Appendix F.3). In particular, our hyperbolic classiﬁer achieves zero test error on 8-dimensional embeddings, whereas Euclidean logistic regression struggles even with 16-dimensional embeddings. This is consistent with our theoretical results ( 5): Due to high distortion, lower-dimensional Euclidean embeddings struggle to capture the global structure among the data points that makes the data points easily separable.

7 Conclusion and future work We studied the problem of learning robust classiﬁers with large margins in hyperbolic space. We introduced and analyzed a hyperbolic perceptron algorithm. Moreover, we explored multiple adversarial approaches to robust large-margin learning. The second part of the paper analyzed the role of geometry in learning robust classiﬁers. We compared Euclidean and hyperbolic approaches with respect to the intrinsic geometry of the data. For hierarchical data that embeds well into hyperbolic space, the lower embedding dimension ensures superior guarantees when learning the classiﬁer in hyperbolic space. This result suggests that it can be highly beneﬁcial to perform downstream machine learning and optimization tasks in a space that naturally reﬂects the intrinsic geometry of the data. Promising avenues for future research include (i) exploring the practicality of these results in broader machine learning and data science applications; and (ii) studying other related methods in non-Euclidean spaces, together with an evaluation of dimension-distortion trade-offs.

Acknowledgements

We thank Pranjal Awasthi for helpful discussions on computing adversarial examples and Eli Chien for helpful comments on an earlier version of the paper.

Broader Impact

The paper proposes novel algorithms for large-margin learning in hyperbolic space. The paper s scope is theoretical and does not discuss speciﬁc applications with societal consequences. Therefore, a discussion of the broader impact of this work is not applicable.

[1] D.P. Bertsekas. Nonlinear Programming. Athena scientiﬁc optimization and computation series. Athena Scientiﬁc, 2016.

[2] J. Bourgain. On Lipschitz embedding of ﬁnite metric spaces in Hilbert space. Israel Journal of Mathematics, 52(1):46 52, Mar 1985.

[3] Martin R. Bridson and André Haeﬂiger. Metric Spaces of Non-Positive Curvature, volume 319 of Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, Berlin, Heidelberg, 1999. ISBN 978-3-642-08399-0 978-3-662-12494-9. URL http: //link.springer.com/10.1007/978-3-662-12494-9.

[4] Benjamin Chamberlain, James Clough, and Marc Deisenroth. Neural embeddings of graphs in hyperbolic space. In ar Xiv:1705.10359 [stat.ML], 2017.

[5] Zachary Charles, Shashank Rajput, Stephen Wright, and Dimitris Papailiopoulos. Convergence and margin of adversarial training on separable data. ar Xiv:1905.09209, 2019.

[6] Hyunghoon Cho, Benjamin De Meo, Jian Peng, and Bonnie Berger. Large-margin classiﬁcation in hyperbolic space. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1832 1840. PMLR, 16 18 Apr 2019.

[7] Henry Cohn and Yufei Zhao. Sphere packing bounds via spherical codes. Duke Math. J., 163 (10):1965 2002, 07 2014.

[8] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Mach. Learn., 20(3):273 297, September 1995.

[9] John M. Danskin. The theory of max-min, with applications. SIAM Journal on Applied Mathematics, 14(4):641 664, 1966.

[10] Laurent El Ghaoui, Gert R. G. Lanckriet, and Georges Natsoulis. Robust classiﬁcation with interval data. Technical Report UCB/CSD-03-1279, EECS Department, University of California, Berkeley, Oct 2003.

[11] Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classiﬁer. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS 18, pages 1186 1195, 2018.

[12] Yoav Freund and Robert E. Schapire. Large margin classiﬁcation using the perceptron algorithm. Machine Learning, 37(3):277 296, Dec 1999.

[13] Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks. In Advances in neural information processing systems, pages 5345 5355, 2018.

[14] Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. Ar Xiv, abs/1803.07300, 2018.

[15] Seung-Jean Kim, Alessandro Magnani, and Stephen Boyd. Robust ﬁsher discriminant analysis. In Advances in neural information processing systems, pages 659 666, 2006.

[16] Gert R.G. Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya, and Michael I. Jordan. A robust minimax approach to classiﬁcation. J. Mach. Learn. Res., 3:555 582, March 2003. ISSN 1532-4435.

[17] Guy Lebanon and John Lafferty. Hyperplane margin classiﬁers on the multinomial manifold. In Proceedings of the Twenty-ﬁrst International Conference on Machine Learning, ICML 04, 2004.

[18] Yan Li, Ethan X.Fang, Huan Xu, and Tuo Zhao. Implicit bias of gradient descent based adversarial training on separable data. In International Conference on Learning Representations, 2020.

[19] Nathan Linial, Eran London, and Yuri Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15(2):215 245, 1995.

[20] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ICRL, 2018.

[21] Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew Mc Callum, and Amr Ahmed. Gradientbased hierarchical clustering using continuous representations of trees in hyperbolic space. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 19, pages 714 722, 2019.

[22] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems 30, pages 6338 6347, 2017.

[23] Maximillian Nickel and Douwe Kiela. Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In International Conference on Machine Learning, 2018.

[24] A. B. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615 622, 1962.

[25] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, pages 65 386, 1958.

[26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. In International journal of computer vision, 2015.

[27] Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re. Representation tradeoffs for hyperbolic embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 4460 4469, 2018.

[28] Rik Sarkar. Low distortion Delaunay embedding of trees in hyperbolic plane. In Graph Drawing, pages 355 366, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-258787.

[29] Claude E Shannon. Probability of error for optimal codes in a gaussian channel. Bell System Technical Journal, 38(3):611 656, 1959.

[30] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. JMLR, 19:1 57, 2018.

[31] Alexandru Tifrea, Gary Bécigneul, and Octavian-Eugen Ganea. Poincaré glove: Hyperbolic word embeddings. ICRL, 2019.

[32] Melanie Weber. Neighborhood growth determines geometric priors for relational representation learning. In International Conference on Artiﬁcial Intelligence and Statistics, volume 108, pages 266 276, 2020.