# robust_largemargin_learning_in_hyperbolic_space__e0d99d77.pdf Robust large-margin learning in hyperbolic space Melanie Weber Princeton University mw25@math.princeton.edu Manzil Zaheer Google Research manzilzaheer@google.com Ankit Singh Rawat Google Research ankitsrawat@google.com Aditya Menon Google Research adityakmenon@google.com Sanjiv Kumar Google Research sanjivk@google.com Recently, there has been a surge of interest in representation learning in hyperbolic spaces, driven by their ability to represent hierarchical data with significantly fewer dimensions than standard Euclidean spaces. However, the viability and benefits of hyperbolic spaces for downstream machine learning tasks have received less attention. In this paper, we present, to our knowledge, the first theoretical guarantees for learning a classifier in hyperbolic rather than Euclidean space. Specifically, we consider the problem of learning a large-margin classifier for data possessing a hierarchical structure. Our first contribution is a hyperbolic perceptron algorithm, which provably converges to a separating hyperplane. We then provide an algorithm to efficiently learn a large-margin hyperplane, relying on the careful injection of adversarial examples. Finally, we prove that for hierarchical data that embeds well into hyperbolic space, the low embedding dimension ensures superior guarantees when learning the classifier directly in hyperbolic space. 1 Introduction Hyperbolic spaces have received sustained interest in recent years, owing to their ability to compactly represent data possessing hierarchical structure (e.g., trees and graphs). In terms of representation learning, hyperbolic spaces offer a provable advantage over Euclidean spaces for such data: objects requiring an exponential number of dimensions in Euclidean space can be represented in a polynomial number of dimensions in hyperbolic space [28]. This has motivated research into efficiently learning a suitable hyperbolic embedding for large-scale datasets [22, 4, 31]. Despite this impressive representation power, little is known about the benefits of hyperbolic spaces for downstream tasks. For example, suppose we wish to perform classification on data that is intrinsically hierarchical. One may naïvely ignore this structure, and use a standard Euclidean embedding and corresponding classifier (e.g., SVM). However, can we design classification algorithms that exploit the structure of hyperbolic space, and offer provable benefits in terms of performance? This fundamental question has received surprisingly limited attention. While some prior work has proposed specific algorithms for learning classifiers in hyperbolic space [6, 21], these have been primarily empirical in nature, and do not come equipped with theoretical guarantees on convergence and generalization. In this paper, we take a first step towards addressing this question for the problem of learning a largemargin classifier. We provide a series of algorithms to provably learn such classifiers in hyperbolic space, and establish their superiority over the classifiers naïvely learned in Euclidean space. This shows that by using a hyperbolic space that better reflects the intrinsic geometry of the data, one can see gains in both representation size and performance. Specifically, our contributions are: Work performed while intern at Google Research, New York. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. (i) we provide a hyperbolic version of the classic perceptron algorithm and establish its convergence for data that is separable with a margin (Theorem 3.1). (ii) we establish how suitable injection of adversarial examples to gradient-based loss minimization yields an algorithm which can efficiently learn a large-margin classifier (Theorem 4.3). We further establish that simply performing gradient descent or using adversarial examples alone does not suffice to efficiently yield such a classifier. (iii) we compare the Euclidean and hyperbolic approaches for hierarchical data and analyze the trade-off between low embedding dimensions and low distortion (dimension-distortion tradeoff) when learning robust classifiers on embedded data. For hierarchical data that embeds well into hyperbolic space, it suffices to use smaller embedding dimension while ensuring superior guarantees when we learn a classifier in hyperbolic space. Contribution (i) establishes that it is possible to design classification algorithms that exploit the structure of hyperbolic space, while provably converging to some admissible (not necessarily largemargin) separator. Contribution (ii) establishes that it is further possible to design classification algorithms that provably converge to a large-margin separator, by suitably injecting adversarial examples. Contribution (iii) shows that the adaptation of algorithms to the intrinsic geometry of the data can enable efficient utilization of the embedding space without affecting the performance. Related work. Our results can be seen as hyperbolic analogue of classic results for Euclidean spaces. The large-margin learning problem is well studied in the Euclidean setting. Classic algorithms for learning classifiers include the perceptron [25, 24, 12] and support vector machines [8]. Robust margin-learning has been widely studied; notably by Lanckriet et al. [16], El Ghaoui et al. [10], Kim et al. [15] and, recently, via adversarial approaches by Charles et al. [5], Ji and Telgarsky [14], Li et al. [18] and Soudry et al. [30]. Adversarial learning has recently gained interest through efforts to train more robust deep learning systems (see, e.g., [20, 11]). Recently, the representation of (hierarchical) data in hyperbolic space has gained a surge of interest. The literature focuses mostly on learning representations in the Poincare [22, 4, 31] and Lorentz [23] models of hyperbolic space, as well as on analyzing representation trade-offs in hyperbolic embeddings [27, 32] . The body of work on performing downstream ML tasks in hyperbolic space is much smaller and mostly without theoretical guarantees. Monath et al. [21] study hierarchical clustering in hyperbolic space. Cho et al. [6] introduce a hyperbolic version of support vector machines for binary classification in hyperbolic space, albeit without theoretical guarantees. Ganea et al. [13] introduce a hyperbolic version of neural networks that shows empirical promise on downstream tasks, but likewise without theoretical guarantees. To the best of our knowledge, neither robust large-margin learning nor adversarial learning in hyperbolic spaces has been studied in the literature, including [6]. Furthermore, we are not aware of any theoretical analysis of dimension-distortion trade-offs in the related literature. 2 Background and notation Figure 1: Lorentz model (geodesics in red). We begin by reviewing some background material on hyperbolic spaces, as well as embedding into and learning in these spaces. 2.1 Hyperbolic space Hyperbolic spaces are smooth Riemannian manifolds M = Hd with constant negative curvature κ and are as such locally Euclidean spaces. There are several equivalent models of hyperbolic space, each highlighting a different geometric aspect. In this work, we mostly consider the Lorentz model (aka hyperboloid model), which we briefly introduce below, with more details provided Appendix A (see [3] for a comprehensive overview). For x, x Rd+1, let x x = x0x 0 Pd i=1 xix i denote their Minkowski product. The Lorentz model is defined as Ld = {x Rd+1 : x x = 1} with distance measure d L(x, x ) = acosh(x x ). Note that the distance d L(x, x ) corresponds to the length of the shortest line (geodesic) along the manifold connecting x and x (cf. Fig. 1). We also point out that (L, d L) forms a metric space. 2.2 Embeddability of hierarchical data A map φ : X1 X2 between metric spaces (X1, d1) and (X2, d2) is called an embedding. The multiplicative distortion of φ is defined to be the smallest constant c M 1 such that, x, x X1, d2 φ(x), φ(x )) d1(x, x ) c M d2 φ(x), φ(x ) . When c M = 1, φ is termed an isometric embedding. Since hierarchical data is tree-like, we can use classic embeddability results for trees as a reference point. Bourgain [2], Linial et al. [19] showed that an N-point metric X (i.e., |X| = N) embeds into Euclidean space RO(log2 N) with c M = O(log N). This bound is tight for trees in the sense that embedding them in a Euclidean space (of any dimension) must have cm = Ω(log N) [19]. In contrast, Sarkar [28] showed that trees embed quasi-isometrically with c M = O(1 + ϵ) into hyperbolic space Hd, even in the low-dimensional regime with the dimension as small as d = 2. 2.3 Classification in hyperbolic space We consider classification problems of the following form: X Ld denotes the feature space, Y = { 1} the binary label space, and W Rd+1 the model space. In the following, we denote the training set as S X Y. We begin by defining geodesic decision boundaries. Consider the Lorentz space Ld with ambient space Rd+1. Then every geodesic decision boundary is a hyperplane in Rd intersecting Ld and Rd+1. Further, consider the set of linear separators or decision functions of the form H = {hw : w Rd+1, w w < 0}, where hw(x) = 1, w x > 0 1, otherwise. (2.1) Note that the requirement w w < 0 in (2.1) ensures that the intersection of Ld and the decision hyperplane hw is not empty. The geodesic decision boundary corresponding to the decision function hw is then given by Hw = {z Ld : w z = 0}. The distance of a point x Ld from the decision boundary Hw can be computed as d x, Hw = asinh w x/ w w [6]. 2.4 Large-margin classification in hyperbolic space In this paper, we are interested in learning a large margin classifier in a hyperbolic space. Analogous to the Euclidean setting, the natural notion of margin is the minimal distance to the decision boundary over all training samples: margin S(w) = inf (x,y) S yhw(x) d(x, Hw) = inf (x,y) S asinh y(w x)/ w w . (2.2) For large-margin classifier learning, we aim to find hw defined by w = argmaxw C margin S(w), where C = {w Rd+1 : w w < 0} imposes a nonconvex constraint. This makes the problem computationally intractable using classical methods, unlike its Euclidean counterpart. 3 Hyperbolic linear separator learning The first step towards the goal of learning a large-margin classifier is to establish that we can provably learn some separator. To this end, we present a hyperbolic version of the classic perceptron algorithm and establish that it will converge on data that is separable with a margin. 3.1 The hyperbolic perceptron algorithm The hyperbolic perceptron (cf. Alg. 1) learns a binary classifier w with respect to the Minkowski product. This is implemented in the update rule vt wt + yx, similar to the Euclidean case. In contrast to the Euclidean case, the hyperbolic perceptron requires an additional normalization step wt+1 vt/ vt vt. This ensures that wt+1 wt+1 < 0; as a result wt+1 defines a meaningful classifier, i.e., Ld Hwt+1 = . For more details, see Appendix B. While intuitive, it remains to establish that this algorithm converges, i.e., finds a solution which correctly classifies all the training samples. To this end, consider the following notion of hyperbolic linear separability with a margin: for X, X Ld, we say that X and X are linearly separable with (hyperbolic) margin γH, if there exists a w Rd+1 with w w = 1 such that w x > sinh(γH) x X and w x < sinh(γH) x X . Assuming our training set is separable with a margin, the hyperbolic perceptron has the following convergence guarantee: Theorem 3.1. Assume that there is some w Rd+1 with w w = 1 and w0 w 0, and some γH > 0, such that yj( w xj) sinh(γH) for j = 1, . . . , |S|. Then, Alg. 1 converges in O (1/sinh(γH)) steps and returns a solution with margin γH. Algorithm 1 Hyperbolic perceptron 1: Initialize w0 Rd+1. 2: for t = 0, 1, . . . , T 1 do 3: for j = 1, . . . , n do 4: if sgn(xj wt) = yj then 5: vt wt + yjxj 6: wt+1 vt/ min{1, vt vt} 7: break 8: end if 9: end for 10: end for 11: Output: w T Algorithm 2 Adversarial Training 1: Initialize w0 = 0, S = . 2: for t = 0, 1, . . . , T 1 do 3: St S iid with |St| = m; S t . 4: for i = 0, 1, . . . , m do 5: xi argmaxd L(xi,z) α l(z, yi; wt) 6: end for 7: S t {( xi, yi)}m i=1 8: S S S t 9: wt+1 A(wt, S, S ) 10: end for 11: Output: w T The proof of Thm. 3.1 (provided in Appendix B) follows the standard proof of the Euclidean perceptron and utilizes the Cauchy-Schwartz inequality for the Minkowski product. To verify that Algorithm 1 always converges to a valid hyperplane, i.e., such that Ld Hvt = , which happens iff vt vt < 0, consider the following argument: vt vt = (wt + yjxj) (wt + yjxj) = wt wt | {z } +2 yj(xj wt) | {z } + y2 |{z} =1 (xj xj | {z } Here, (i) is a consequence of the normalization step in Algorithm 1 and (iii) follows as xj xj = 1, since xj Ld. As for (ii), note that we perform the update vt wt + yjxj only when yj = sign(xj w) (cf. Algorithm 1). Remark 3.2. Note that the perceptron algorithm in Euclidean space exhibits a O(1/γ2 E) convergence rate [24], where γE denotes the Euclidean margin. When γE 0, the 1/sinh(γH) convergence rate for hyperbolic spaces can be significantly faster than 1/γ2 E, indicating that exploiting the structure of hyperbolic space can be beneficial. 3.2 The challenge of large-margin learning Thm. 3.1 establishes that the hyperbolic perceptron converges to some linear separator. However, for the purposes of generalization, one would ideally like to converge to a large-margin separator. As with the classic Euclidean perceptron, no such guarantee is possible for the hyperbolic perceptron; this motivates us to ask whether a suitable modification can rectify this. Drawing inspiration from the Euclidean setting, a natural way to proceed is to consider the use of margin losses, such as the logistic or hinge loss. Formally, let l: X { 1} R+ be a loss function: l(x, y; w) = f(y (w x)), (3.1) where f : R R+ is some convex, non-increasing function, e.g., the hinge loss. The empirical risk of the classifier parameterized by w on the training set S X { 1} is L(w; S) = P (x,y) S l(x, y; w)/|S|. Commonly, we learn a classifier by minimizing L(w; S) via gradient descent with iterates wt+1 wt η X (x,y) S l(x, y; wt)/|S| , (3.2) where η > 0 denotes the learning rate. Unfortunately, while this will yield a large-margin solution, the following result demonstrates that the number of iterations required may be prohibitively large. Theorem 3.3. Let ei Rd+1 be the i-th standard basis vector. Consider the training set S = {(e1, 1), ( e1, 1)} and the initialization w0 = e2. Suppose {wt}t 0 is a sequence of iterates in (3.2). Then, the number of iterations needed to achieve margin γH is Ω(exp(γH)). While this result is disheartening, fortunately, we now present a simple resolution: by suitably adding adversarial examples, the gradient descent converges to a large-margin solution in polynomial time. 4 Hyperbolic large-margin separator learning via adversarial examples Thm. 3.3 reveals that gradient descent on a margin loss is insufficient to efficiently obtain a largemargin classifier. Adapting the approach proposed in [5] for the Euclidean setting, we show how to Space Perceptron Adversarial margin Adversarial ERM Adversarial GD Euclidean (prior work) O 1 γ2 E γE α Ω(exp(d)) Ω poly 1 γE α Hyperbolic (this paper) O 1 sinh(γH) γH cosh(α) Ω(exp(d)) Ω poly cosh(α) sinh(γH) Table 1: Comparison between Euclidean and hyperbolic spaces for Perceptron (cf. Alg. 1) and adversarial training (cf. Alg. 2). Recall that γE/H, α, and d denote the (Euclidean/ hyperbolic) margin of the training data, the adversarial perturbation budget, and the underlying dimension, respectively. alleviate this problem by enriching the training set with adversarial examples before updating the classifier (cf. Alg. 2). In particular, we minimize a robust loss min w Rd+1 Lrob(w; S) := 1 (x,y) S lrob(x, y; w) (4.1) lrob(x, y; w) := max z Ld: d L(x,z) α l(z, y; w) . (4.2) The problem has a minimax structure, where the outer optimization minimizes the training error over S. The inner optimization generates an adversarial example by perturbing a given input feature x on the hyperbolic manifold. Note that the magnitude of the perturbation added to the original example is bounded by α, which we refer to as the adversarial budget. In particular, we want to construct a perturbation that maximizes the loss l, i.e., x argmaxd L(x,z) α l(z, y; w). In this paper, we restrict ourselves to only those adversarial examples that lead to misclassification with respect to the current classifier, i.e., hw(x) = hw( x) (cf. Remark 4.2). Adversarial example x can be generated efficiently by reducing the problem to an (Euclidean) linear program with a spherical constraint: (CERT) max z Rd w0z0 + i=1 wizi s.t. i=1 xizi cosh(α) x0z0, z\0 2 = z2 0 1 . (4.3) Importantly, as detailed in Appendix C.2 and summarized next, (CERT) can be solved in closed-form. Theorem 4.1. Given the input example (x, y), let x\0 = (x1, . . . , xd). We can efficiently compute a solution to CERT or decide that no solution exists. If a solution exists, then based on a guess of z0, the solution has the form x = z0, p z2 0 1 bα ˇx + p 1 b2α ˇx . Here, bα depends on the adversarial budget α, and ˇx is a unit vector orthogonal to ˇx = x\0/ x\0 along w. Remark 4.2. Note that according to Thm. 4.1, it is possible that, for a particular guess of z0, we may not be able to find an adversarial example x that leads to a prediction that is inconsistent with x, i.e., hw(x) = hw( x). Thus, for some t, we may have |S t| < m in Alg. 2. The minimization with respect to w in (4.1) can be performed by an iterative optimization procedure, which generates a sequence of classifiers {wt}. We update the classifier wt according to an update rule A, which accepts as input the current estimate of the weight vector, the original training set, and an adversarial perturbation of the training set. The update rule produces as output a weight vector which approximately minimizes the robust loss Lrob in (4.1). We now establish that for a gradient based update rule, the above adversarial training procedure will efficiently converge to a large-margin solution. Table 1 summarizes the results of this section and compares with the corresponding results in the Euclidean setting [5]. 4.1 Fast convergence via gradient-based update Consider Alg. 2 with A(wt, S, S t) being a gradient-based update with learning rate ηt > 0: wt+1 wt ηt/|S t| X ( x,y) S t wtl( x, y; wt) ; wt+1 wt+1/ p wt+1 wt+1 , (4.4) where the normalization is performed to ensure that the update remains valid, i.e., Ld Hw = . To compute the update, we need to compute gradients of the outer minimization problem, i.e., w lrob over S t (cf. (4.1)). However, this function is itself a maximization problem. We therefore compute the gradient at the maximizer of this inner maximization problem. Danskin s Theorem [9, 1] ensures that this gives a valid descent direction. Given the closed form expression for the adversarial example x as per Thm. 4.1, the gradient of the loss is w l( x, y; w) = f (y(w x)) w y(w x) = f (y(w x)) yb x T , where yb x T = y( x0, x1, . . . , xn)T. With Danskin s theorem, l( x, y; w) lrob(x, y; w), so we can compute the descent direction and perform the step in (4.4). We defer details to Appendix C.4. 4.1.1 Convergence analysis We now establish that the above gradient-based update converges to a large-margin solution in polynomial time. For this analysis, we need the following assumptions: Assumption 1. 1. The training set S is linearly separable with margin at least γH, i.e., there exists a w Rd+1, such that y( w x) sinh(γH) for all (x, y) S. 2. There exists constants Rx, Rw 0, such that (i) x Rx; (ii) all possible adversarial perturbations remain within this constraint, i.e., x Rx; and (iii) w Rw. Let Rα := Rx Rw. 3. the function f(s), underlying the loss (cf. (3.1)), has the following properties: (i) f(s) > 0 s; (ii) f (s) < 0 s; (iii) f is differentiable, and (iv) f is β-smooth. In the rest of the section, we work with the following hyperbolic equivalent of the logistic regression loss that fulfills Assumption 1: l(x, y; w) = ln (1 + exp( asinh (y(w x)/2Rα))) , (4.5) where Rα is as defined in Assumption 1. Other loss functions as well as the derivation of the hyperbolic logistic regression loss are discussed in Appendix C.1. We first show that Alg. 2 with a gradient update is guaranteed to converge to a large-margin classifier. Theorem 4.3. With constant step size and A being the GD update with an initialization w0 with w0 w0 < 0, limt L(wt; S S t) = 0. The proof can be found in Appendix C.4. While this result guarantees convergence, it does not guarantee efficiency (e.g., by showing a polynomial convergence rate). The following result shows that Alg. 2 with a gradient based update obtains a max-margin classifier in polynomial time. Theorem 4.4. For a fixed constant c (0, 1), let ηt = η := c 2 sinh2(γH) βσ2max cosh2(α)R2α with σmax denoting an upper bound on the maximum singular value of the data matrix P x S xx T , and A the GD update as defined in (4.4). Then, Alg. 2 achieves the margin γH/cosh(α) in Ω poly (cosh(α)/sinh(γH)) Below, we briefly sketch some of the proof ideas, but defer a detailed exposition to Appendix C.4 (cf. Thm. C.13 and C.14). To prove the gradient-based convergence result, we first analyze the convergence of an adversarial perceptron , that resembles the adversarial GD in that it performs updates of the form wt+1 wt + y x. We then extend the analysis to the adversarial GD, where the convergence analysis builds on classical ideas from convex optimization. The following auxiliary lemma relates the adversarial margin to the max-margin classifier. Lemma 4.5. Let w be the max-margin classifier of S with margin γH. At each iteration of Algorithm 2, w linearly separates S S with margin at least γH cosh(α). The result follows from geometric arguments, as discussed in Section C.3. With the help of this lemma, we can show the following bound on the sample complexity of the adversarial perceptron: Theorem 4.6. Assume that there is some w Rd+1 with w w = 1 and w0 w 0, and some γH > 0, such that yj( w xj) sinh(γH) for j = 1, . . . , |S|. Then, the adversarial perceptron (with adversarial budget α) converges after O cosh(α) sinh(γH) steps, at which it has margin of at least γH cosh(α). The technical proof can be found in Section C.3. 4.2 On the necessity of combining gradient descent and adversarial training We remark here that the enrichment of the training set with adversarial examples is critical for the polynomial-time convergence. Recall first that by Thm. 3.3, without adversarial training, we can construct a simple max-margin problem that cannot be solved in polynomial time. Interestingly, merely using adversarial examples by themselves does not suffice for fast convergence either. Consider Alg. 2 with an ERM as the update rule A(wt, S, S ). In this case, the iterate wt+1 corresponds to an ERM solution for S S , i.e., wt+1 argminw X (x,y) S S l(x, y; w) . (4.6) Let St = S, i.e., we utilize the full power of the adversarial training in each step. The following result reveals that even under this optimistic setting, Alg. 2 may not converge to a solution with a non-trivial margin in polynomial time: Theorem 4.7. Suppose Alg. 2 (with an ERM update) outputs a linear separator of S S . In the worst case, the number of iteration required to achieve a margin at least ϵ is Ω(exp(d)). We note that a similar result in the Euclidean setting appears in [5]. We establish Thm. 4.7 by extending the proof strategy of [5, Thm. 4] to hyperbolic spaces. In particular, given a spherical code in Rd with T codewords and θ sinh(ϵ) cosh(α) minimum separation, we construct a training set S and subsequently the adversarial examples {S t} such that there exists a sequence of ERM solutions {wt}t T on S S (cf. (4.6)) that has margin less than ϵ. Now the result in Thm. 4.7 follows by utilizing a lower bound [7] on the size of the spherical code with T = Ω(exp(d)) codewords and θ sinh(ϵ) cosh(α) minimum separation. The proof of Thm. 4.7 is in Appendix C.5. 5 Dimension-distortion trade-off So far we have focused on classifying data that is given in either Euclidean spaces Rd or Lorentz space Ld . Now, consider data (X, d X ) with similarity metric d X that was embedded into the respective spaces. We assume access to maps φE : X Rd and φH : X Ld + that embed X into the Euclidean space Rd and the upper sheet of the Lorentz space Ld +, respectively (cf. Remark A.1). Let c E and c H denote the multiplicative distortion induced by φE and φH, respectively (cf. 2.2). Upper bounds on c E and c H can be estimated based on the structure of X and the embedding dimensions. Figure 2: Margin as distance between support vectors. Left: Euclidean. Right: Hyperbolic. In this section, we address the natural question: How does the distortion c E, c H impact our guarantees on the margin? In the previous sections, we noticed that some of the guarantees scale with the dimension of the embedding space. Therefore, we want to analyze the trade-off between the higher distortion resulting from working with smaller embedding dimensions and the higher cost of training robust models due to working with larger embedding dimensions. We often encounter data sets in ML applications that are intrinsically hierarchical. Theoretical results on the embeddability of trees (cf. 2.2) suggest that hyperbolic spaces are especially suitable to represent hierarchical data. We therefore restrict our analysis to such data. Further, we make the following assumptions on the underlying data X and the embedding maps, respectively. Assumption 2. (1) Both φH(X) and φE(X) are linearly separable in the respective spaces, and (2) X is hierarchical, i.e., has a partial order relation. Assumption 3. The maps φH, φE preserve the partial order relation in X and the root is mapped onto the origin of the embedding space. Towards understanding the impact of the distortion of the embedding maps φH and φE on margin, we relate the distance between the support vectors to the size of the margin. The distortion of these distances via embedding then gives us the desired bounds on the margin. 0 50 100 150 200 Iteration Training Loss =0.0 =0.25 =0.5 =0.75 =1.0 0 50 100 150 200 Iteration -Robust Loss =0.0 =0.25 =0.5 =0.75 =1.0 0 50 100 150 200 Iteration =0.0 =0.25 =0.5 =0.75 =1.0 Figure 3: Performance of Adversarial GD. Left: Loss L(w) on the original data. Middle: αrobust loss Lα(w). Right: Hyperbolic margin γH. We vary the adversarial budget α over {0, 0.25, 0.5, 0.75, 1.0}. Note that α = 0 corresponds to the state of the art [6]. 5.1 Euclidean case In the Euclidean case, we relate the distance of the support vectors to the size of the margin via triangle relations. Let x, y Rd denote support vectors, such that x, w > 0 and y, w < 0 and margin(w) = ϵ. Note that we can rotate the decision boundary, such that the support vectors are not unique. So, without loss of generality, assume that x1, x2 are equidistant from the decision boundary and w = 1 (cf. Fig. 2(left)). In this setting, we show the following relation between the margin with and without the influence of distortion: Theorem 5.1. Let ϵ and ϵ denote the margin with and without distortion, respectively. If X is a tree embedded into RO(log2 |X|), then ϵ = O ϵ/log3 |X| . The proof of Thm. 5.1 follows from a simple side length-altitude relations in the Euclidean triangle between support vectors (cf. Fig. 2(left)) and a simple application of Bourgain s result on embedding trees into Rd. For more details see Appendix D.1. 5.2 Hyperbolic case As in the Euclidean case, we want to relate the margin to the pairwise distances of the support vectors. Such a relation can be constructed both in the original and in the embedding space, which allows us to study the influence of distortion on the margin in terms of c H. In the following, we will work with the half-space model Pd (cf. Appendix A.1). However, since the theoretical guarantees in the rest of the paper consider the Lorentz model Ld +, we have to map between the two spaces. We show in Appendix D.2 that such a mapping exists and preserves the Minkowski product, following [6]. The hyperbolic embedding φH has two sources of distortion: (1) the multiplicative distortion of pairwise distances, measured by the factor 1/c H; and (2) the distortion of order relations, in most embedding models captured by the alignment of ranks with the Euclidean norm. Under Assumption 3, order relationships are preserved and the root is mapped to the origin. Therefore, for x X, the distortion on the Euclidean norms is given as φH(x) = d E(φH(x), φH(0)) = d X (x, 0)/c H, i.e., the distortion on both pairwise distances and norms is given by a factor 1/c H. In Pd , the decision hyperplane corresponds to a hypercircle Kw. We express its radius rw in terms of the hyperbolic distance between a point on the decision boundary and one of the hypercircle s ideal points [6]. The support vectors x, y lie on hypercircles Kx and Ky, which correspond to the set of points of hyperbolic distance ϵ (i.e., the margin) from the decision boundary. We again assume, without loss of generality, that at least one support vector is not unique and let x1, x2 Kx and y Ky (cf. Fig. 2(right)). We now show that the distortion introduced by φH has a negligible effect on the margin. Theorem 5.2. Let ϵ and ϵ denote the margin with and without distortion, respectively. If X is a tree embedded into L2 +, then ϵ ϵ. The technical proof relies on a construction that reduces the problem to Euclidean geometry via circle inversion on the decision hypercircle. We defer all details to Appendix D.2. 6 Experiments We now present empirical studies for hyperbolic linear separator learning to corroborate our theory. In particular, we evaluate our proposed Adversarial GD algorithm ( 4) on data that is linearly separable in hyperbolic space and compare with the state of the art [6]. Furthermore, we analyze dimension- distortion trade-offs ( 5). As in our theory, we train hyperbolic linear classifiers w whose prediction on x is y = sgn(w x). Additional experimental results can be found in Appendix F. We emphasise that our focus in this work is in theoretically understanding the benefits of hyperbolic spaces for classification. The above experiments serve to illustrate our theoretical results, which as a starting point were derived for linear models and separable data. While extensions to non-separable data and non-linear models are of practical interest, a detailed study is left for future work. Data. We use the Image Net ILSVRC 2012 dataset [26] along with its label hierarchy from wordnet. Hyperbolic embeddings in Lorentz space are obtained for the internal label nodes and leaf image nodes using Sarkar s construction [28]. For the first experiment, we verify the effectiveness of adversarial learning by picking two classes (n09246464 and n07831146), which allows for a data set that is linearly separable in hyperbolic space. In this set, there were 1,648 positive and 1,287 negative examples. For the second experiment, to showcase better representational power of hyperbolic spaces for hierarchical data, we pick two disjoint subtrees (n00021939 and n00015388) from the hierarchy. Adversarial GD. In the following, we utilize the hyperbolic hinge loss (C.2), (see Appendix F for other loss functions). To verify the effectiveness of adversarial training, we compute three quantities: (i) loss on original data L(w), (ii) α-robust loss Lα(w), and (iii) the hyperbolic margin γ. We vary the budget α over {0, 0.25, 0.5, 0.75, 1.0}, where α = 0 corresponds to the setup in [6]. For a given budget, we obtain adversarial examples by solving the CERT problem (4.3), which is feasible for z0 (x0 cosh(α) , x0 cosh(α) + ), where = q (x2 0 1)(cosh2(α) 1). We do a binary search in this range for z0, solve CERT and check if we can obtain an adversarial example. We utilize the adversarial examples we find, and ignore other data points. In all experiments, we use a constant step-size ηt = 0.01 t. The results are shown in Fig. 3. As α increases the problem becomes harder to solve (higher training robust loss) but we achieve a better margin. Notably, we observe strong performance gains over the training procedures without adversarial examples [6]. Dimensional efficiency. In this experiment, we illustrate the benefit of using hyperbolic space when the underlying data is truly hierarchical. To be more favourable to Euclidean setting, we subsample images from each subtree, such that in total we have 1000 vectors. We obtain Euclidean embeddings following the setup and code from Nickel and Kiela [22]. The Euclidean embeddings in 16 dimensions reach a mean rank (MR) 2, which indicates reasonable quality in preserving distance to few-hop neighbors. We observe superior classification performance at much lower dimensions by leveraging hyperbolic space (see Table 2 in Appendix F.3). In particular, our hyperbolic classifier achieves zero test error on 8-dimensional embeddings, whereas Euclidean logistic regression struggles even with 16-dimensional embeddings. This is consistent with our theoretical results ( 5): Due to high distortion, lower-dimensional Euclidean embeddings struggle to capture the global structure among the data points that makes the data points easily separable. 7 Conclusion and future work We studied the problem of learning robust classifiers with large margins in hyperbolic space. We introduced and analyzed a hyperbolic perceptron algorithm. Moreover, we explored multiple adversarial approaches to robust large-margin learning. The second part of the paper analyzed the role of geometry in learning robust classifiers. We compared Euclidean and hyperbolic approaches with respect to the intrinsic geometry of the data. For hierarchical data that embeds well into hyperbolic space, the lower embedding dimension ensures superior guarantees when learning the classifier in hyperbolic space. This result suggests that it can be highly beneficial to perform downstream machine learning and optimization tasks in a space that naturally reflects the intrinsic geometry of the data. Promising avenues for future research include (i) exploring the practicality of these results in broader machine learning and data science applications; and (ii) studying other related methods in non-Euclidean spaces, together with an evaluation of dimension-distortion trade-offs. Acknowledgements We thank Pranjal Awasthi for helpful discussions on computing adversarial examples and Eli Chien for helpful comments on an earlier version of the paper. Broader Impact The paper proposes novel algorithms for large-margin learning in hyperbolic space. The paper s scope is theoretical and does not discuss specific applications with societal consequences. Therefore, a discussion of the broader impact of this work is not applicable. [1] D.P. Bertsekas. Nonlinear Programming. Athena scientific optimization and computation series. Athena Scientific, 2016. [2] J. Bourgain. On Lipschitz embedding of finite metric spaces in Hilbert space. Israel Journal of Mathematics, 52(1):46 52, Mar 1985. [3] Martin R. Bridson and André Haefliger. Metric Spaces of Non-Positive Curvature, volume 319 of Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, Berlin, Heidelberg, 1999. ISBN 978-3-642-08399-0 978-3-662-12494-9. URL http: //link.springer.com/10.1007/978-3-662-12494-9. [4] Benjamin Chamberlain, James Clough, and Marc Deisenroth. Neural embeddings of graphs in hyperbolic space. In ar Xiv:1705.10359 [stat.ML], 2017. [5] Zachary Charles, Shashank Rajput, Stephen Wright, and Dimitris Papailiopoulos. Convergence and margin of adversarial training on separable data. ar Xiv:1905.09209, 2019. [6] Hyunghoon Cho, Benjamin De Meo, Jian Peng, and Bonnie Berger. Large-margin classification in hyperbolic space. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1832 1840. PMLR, 16 18 Apr 2019. [7] Henry Cohn and Yufei Zhao. Sphere packing bounds via spherical codes. Duke Math. J., 163 (10):1965 2002, 07 2014. [8] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Mach. Learn., 20(3):273 297, September 1995. [9] John M. Danskin. The theory of max-min, with applications. SIAM Journal on Applied Mathematics, 14(4):641 664, 1966. [10] Laurent El Ghaoui, Gert R. G. Lanckriet, and Georges Natsoulis. Robust classification with interval data. Technical Report UCB/CSD-03-1279, EECS Department, University of California, Berkeley, Oct 2003. [11] Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS 18, pages 1186 1195, 2018. [12] Yoav Freund and Robert E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277 296, Dec 1999. [13] Octavian Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks. In Advances in neural information processing systems, pages 5345 5355, 2018. [14] Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. Ar Xiv, abs/1803.07300, 2018. [15] Seung-Jean Kim, Alessandro Magnani, and Stephen Boyd. Robust fisher discriminant analysis. In Advances in neural information processing systems, pages 659 666, 2006. [16] Gert R.G. Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya, and Michael I. Jordan. A robust minimax approach to classification. J. Mach. Learn. Res., 3:555 582, March 2003. ISSN 1532-4435. [17] Guy Lebanon and John Lafferty. Hyperplane margin classifiers on the multinomial manifold. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML 04, 2004. [18] Yan Li, Ethan X.Fang, Huan Xu, and Tuo Zhao. Implicit bias of gradient descent based adversarial training on separable data. In International Conference on Learning Representations, 2020. [19] Nathan Linial, Eran London, and Yuri Rabinovich. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15(2):215 245, 1995. [20] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ICRL, 2018. [21] Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew Mc Callum, and Amr Ahmed. Gradientbased hierarchical clustering using continuous representations of trees in hyperbolic space. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 19, pages 714 722, 2019. [22] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems 30, pages 6338 6347, 2017. [23] Maximillian Nickel and Douwe Kiela. Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In International Conference on Machine Learning, 2018. [24] A. B. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615 622, 1962. [25] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, pages 65 386, 1958. [26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. In International journal of computer vision, 2015. [27] Frederic Sala, Chris De Sa, Albert Gu, and Christopher Re. Representation tradeoffs for hyperbolic embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 4460 4469, 2018. [28] Rik Sarkar. Low distortion Delaunay embedding of trees in hyperbolic plane. In Graph Drawing, pages 355 366, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-258787. [29] Claude E Shannon. Probability of error for optimal codes in a gaussian channel. Bell System Technical Journal, 38(3):611 656, 1959. [30] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. JMLR, 19:1 57, 2018. [31] Alexandru Tifrea, Gary Bécigneul, and Octavian-Eugen Ganea. Poincaré glove: Hyperbolic word embeddings. ICRL, 2019. [32] Melanie Weber. Neighborhood growth determines geometric priors for relational representation learning. In International Conference on Artificial Intelligence and Statistics, volume 108, pages 266 276, 2020.