# riemannian_adaptive_optimization_methods__d972949f.pdf Published as a conference paper at ICLR 2019 RIEMANNIAN ADAPTIVE OPTIMIZATION METHODS Gary B ecigneul, Octavian-Eugen Ganea Department of Computer Science ETH Z urich, Switzerland {gary.becigneul,octavian.ganea}@inf.ethz.ch Several first order stochastic optimization methods commonly used in the Euclidean domain such as stochastic gradient descent (SGD), accelerated gradient descent or variance reduced methods have already been adapted to certain Riemannian settings. However, some of the most popular of these optimization tools namely ADAM, ADAGRAD and the more recent AMSGRAD remain to be generalized to Riemannian manifolds. We discuss the difficulty of generalizing such adaptive schemes to the most agnostic Riemannian setting, and then provide algorithms and convergence proofs for geodesically convex objectives in the particular case of a product of Riemannian manifolds, in which adaptivity is implemented across manifolds in the cartesian product. Our generalization is tight in the sense that choosing the Euclidean space as Riemannian manifold yields the same algorithms and regret bounds as those that were already known for the standard algorithms. Experimentally, we show faster convergence and to a lower train loss value for Riemannian adaptive methods over their corresponding baselines on the realistic task of embedding the Word Net taxonomy in the Poincar e ball. 1 INTRODUCTION Developing powerful stochastic gradient-based optimization algorithms is of major importance for a variety of application domains. In particular, for computational efficiency, it is common to opt for a first order method, when the number of parameters to be optimized is great enough. Such cases have recently become ubiquitous in engineering and computational sciences, from the optimization of deep neural networks to learning embeddings over large vocabularies. This new need resulted in the development of empirically very successful first order methods such as ADAGRAD (Duchi et al., 2011), ADADELTA (Zeiler, 2012), ADAM (Kingma & Ba, 2015) or its recent update AMSGRAD (Reddi et al., 2018). Note that these algorithms are designed to optimize parameters living in a Euclidean space Rn, which has often been considered as the default geometry to be used for continuous variables. However, a recent line of work has been concerned with the optimization of parameters lying on a Riemannian manifold, a more general setting allowing non-Euclidean geometries. This family of algorithms has already found numerous applications, including for instance solving Lyapunov equations (Vandereycken & Vandewalle, 2010), matrix factorization (Tan et al., 2014), geometric programming (Sra & Hosseini, 2015), dictionary learning (Cherian & Sra, 2017) or hyperbolic taxonomy embedding (Nickel & Kiela, 2017; Ganea et al., 2018a; De Sa et al., 2018; Nickel & Kiela, 2018). A few first order stochastic methods have already been generalized to this setting (see section 6), the seminal one being Riemannian stochastic gradient descent (RSGD) (Bonnabel, 2013), along with new methods for their convergence analysis in the geodesically convex case (Zhang & Sra, 2016). However, the above mentioned empirically successful adaptive methods, together with their convergence analysis, remain to find their respective Riemannian counterparts. Indeed, the adaptivity of these algorithms can be thought of as assigning one learning rate per coordinate of the parameter vector. However, on a Riemannian manifold, one is generally not given an intrinsic coordinate system, rendering meaningless the notions sparsity or coordinate-wise update. Published as a conference paper at ICLR 2019 Our contributions. In this work we (i) explain why generalizing these adaptive schemes to the most agnostic Riemannian setting in an intrinsic manner is compromised, and (ii) propose generalizations of the algorithms together with their convergence analysis in the particular case of a product of manifolds where each manifold represents one coordinate of the adaptive scheme. Finally, we (iii) empirically support our claims on the realistic task of hyperbolic taxonomy embedding. Our initial motivation. The particular application that motivated us in developing Riemannian versions of ADAGRAD and ADAM was the learning of symbolic embeddings in non-Euclidean spaces. As an example, the Glo Ve algorithm (Pennington et al., 2014) an unsupervised method for learning Euclidean word embeddings capturing semantic/syntactic relationships benefits significantly from optimizing with ADAGRAD compared to using SGD, presumably because different words are sampled at different frequencies. Hence the absence of Riemannian adaptive algorithms could constitute a significant obstacle to the development of competitive optimization-based Riemannian embedding methods. In particular, we believe that the recent rise of embedding methods in hyperbolic spaces could benefit from such developments (Nickel & Kiela, 2017; 2018; Ganea et al., 2018a;b; De Sa et al., 2018; Vinh et al., 2018). 2 PRELIMINARIES AND NOTATIONS 2.1 DIFFERENTIAL GEOMETRY We recall here some elementary notions of differential geometry. For more in-depth expositions, we refer the interested reader to Spivak (1979) and Robbin & Salamon (2011). Manifold, tangent space, Riemannian metric. A manifold M of dimension n is a space that can locally be approximated by a Euclidean space Rn, and which can be understood as a generalization to higher dimensions of the notion of surface. For instance, the sphere S := {x Rn | x 2 = 1} embedded in Rn is an (n 1)-dimensional manifold. In particular, Rn is a very simple n-dimensional manifold, with zero curvature. At each point x M, one can define the tangent space Tx M, which is an n-dimensional vector space and can be seen as a first order local approximation of M around x. A Riemannian metric ρ is a collection ρ := (ρx)x M of inner-products ρx( , ) : Tx M Tx M R on Tx M, varying smoothly with x. It defines the geometry locally on M. For x M and u Tx M, we also write u x := p ρx(u, u). A Riemannian manifold is a pair (M, ρ). Induced distance function, geodesics. Notice how a choice of a Riemannian metric ρ induces a natural global distance function on M. Indeed, for x, y M, we can set d(x, y) to be equal to the infimum of the lengths of smooth paths between x and y in M, where the length ℓ(c) of a path c is given by integrating the size of its speed vector c(t) Tc(t)M, in the corresponding tangent space: ℓ(c) := R 1 t=0 c(t) c(t)dt. A geodesic γ in (M, ρ) is a smooth curve γ : (a, b) M which locally has minimal length. In particular, a shortest path between two points in M is a geodesic. Exponential and logarithmic maps. Under some assumptions, one can define at point x M the exponential map expx : Tx M M. Intuitively, this map folds the tangent space on the manifold. Locally, if v Tx M, then for small t, expx(tv) tells us how to move in M as to take a shortest path from x with initial direction v. In Rn, expx(v) = x + v. In some cases, one can also define the logarithmic map logx : M Tx M as the inverse of expx. Parallel transport. In the Euclidean space, if one wants to transport a vector v from x to y, one simply translates v along the straight-line from x to y. In a Riemannian manifold, the resulting transported vector will depend on which path was taken from x to y. The parallel transport Px(v; w) of a vector v from a point x in the direction w and in a unit time, gives a canonical way to transport v with zero acceleration along a geodesic starting from x, with initial velocity w. Published as a conference paper at ICLR 2019 2.2 RIEMANNIAN OPTIMIZATION Consider performing an SGD update of the form xt+1 xt αgt, (1) where gt denotes the gradient of objective ft1 and α > 0 is the step-size. In a Riemannian manifold (M, ρ), for smooth f : M R, Bonnabel (2013) defines Riemannian SGD by the following update: xt+1 expxt( αgt), (2) where gt Txt M denotes the Riemannian gradient of ft at xt. Note that when (M, ρ) is the Euclidean space (Rn, In), these two match, since we then have expx(v) = x + v. Intuitively, applying the exponential map enables to perform an update along the shortest path in the relevant direction in unit time, while remaining in the manifold. In practice, when expx(v) is not known in closed-form, it is common to replace it by a retraction map Rx(v), most often chosen as Rx(v) = x + v, which is a first-order approximation of expx(v). 2.3 AMSGRAD, ADAM, ADAGRAD Let s recall here the main algorithms that we are taking interest in. ADAGRAD. Introduced by Duchi et al. (2011), the standard form of its update step is defined as2 xi t+1 xi t αgi t/ k=1 (gi k)2. (3) Such updates rescaled coordinate-wise depending on the size of past gradients can yield huge improvements when gradients are sparse, or in deep networks where the size of a good update may depend on the layer. However, the accumulation of all past gradients can also slow down learning. ADAM. Proposed by Kingma & Ba (2015), the ADAM update rule is given by xi t+1 xi t αmi t/ q where mt = β1mt 1+(1 β1)gt can be seen as a momentum term and vi t = β2vi t 1+(1 β2)(gi t)2 is an adaptivity term. When β1 = 0, one essentially recovers the unpublished method RMSPROP (Tieleman & Hinton, 2012), the only difference to ADAGRAD being that the sum is replaced by an exponential moving average, hence past gradients are forgotten over time in the adaptivity term vt. This circumvents the issue of ADAGRAD that learning could stop too early when the sum of accumulated squared gradients is too significant. Let us also mention that the momentum term introduced by ADAM for β1 = 0 has been observed to often yield huge empirical improvements. AMSGRAD. More recently, Reddi et al. (2018) identified a mistake in the convergence proof of ADAM. To fix it, they proposed to either modify the ADAM algorithm with3 xi t+1 xi t αmi t/ q ˆvi t, where ˆvi t = max{ˆvi t 1, vi t}, (5) which they coin AMSGRAD, or to choose an increasing schedule for β2, making it time dependent, which they call ADAMNC (for non-constant). 3 ADAPTIVE SCHEMES IN RIEMANNIAN MANIFOLDS 3.1 THE DIFFICULTY OF DESIGNING ADAPTIVE SCHEMES IN THE GENERAL SETTING Intrinsic updates. It is easily understandable that writing any coordinate-wise update requires the choice of a coordinate system. However, on a Riemannian manifold (M, ρ), one is generally not 1to be interpreted as the objective with the same parameters, evaluated at the minibatch taken at time t. 2a small ε = 10 8 is often added in the square-root for numerical stability, omitted here for simplicity. 3with mt and vt defined by the same equations as in ADAM (see above paragraph). Published as a conference paper at ICLR 2019 provided with a canonical coordinate system. The formalism only allows to work with certain local coordinate systems, also called charts, and several different charts can be defined around each point x M. One usually says that a quantity defined using a chart is intrinsic to M if its definition does not depend on which chart was used. For instance, it is known that the Riemannian gradient gradf of a smooth function f : M R can be defined intrinsically to (M, ρ), but its Hessian is only intrinsically defined at critical points4. It is easily seen that the RSGD update of Eq. (2) is intrinsic, since it only involves exp and grad, which are objects intrinsic to (M, ρ). However, it is unclear whether it is possible at all to express either of Eqs. (3,4,5) in a coordinate-free or intrinsic manner. A tempting solution. Note that since an update is defined in a tangent space, one could be tempted to fix a canonical coordinate system e := (e(1), ..., e(n)) in the tangent space Tx0M Rd at the initialization x0 M, and parallel-transport e along the optimization trajectory, adapting Eq. (3) to: xt+1 expxt( t), et+1 Pxt(et; t), with t := αgt k=1 (gk)2, (6) where and ( )2 denote coordinate-wise division and square respectively, these operations being taken relatively to coordinate system et. In the Euclidean space, parallel transport between two points x and y does not depend on the path it is taken along because the space has no curvature. However, in a general Riemannian manifold, not only does it depend on the chosen path but curvature will also give to parallel transport a rotational component5, which will almost surely break the sparsity of the gradients and hence the benefit of adaptivity. Besides, the interpretation of adaptivity as optimizing different features (i.e. gradient coordinates) at different speeds is also completely lost here, since the coordinate system used to represent gradients depends on the optimization path. Finally, note that the techniques we used to prove our theorems would not apply to updates defined in the vein of Eq. (6). 3.2 ADAPTIVITY IS POSSIBLE ACROSS MANIFOLDS IN A PRODUCT From now on, we assume additional structure on (M, ρ), namely that it is the cartesian product of n Riemannian manifolds (Mi, ρi), where ρ is the induced product metric: M := M1 Mn, ρ := Product notations. The induced distance function d on M is known to be given by d(x, y)2 = Pn i=1 di(xi, yi)2, where di is the distance in Mi. The tangent space at x = (x1, ..., xn) is given by Tx M = Tx1M1 Txn Mn, and the Riemannian gradient g of a smooth function f : M R at point x M is simply the concatenation g = ((g1)T (gn)T )T of the Riemannian gradients gi Txi Mi of each partial map f i : y Mi 7 f(x1, ..., xi 1, y, xi+1, ..., xn). Similarly, the exponential, log map and the parallel transport in M are the concatenations of those in each Mi. Riemannian ADAGRAD. We just saw in the above discussion that designing meaningful adaptive schemes intuitively corresponding to one learning rate per coordinate in a general Riemannian manifold was difficult, because of the absence of intrinsic coordinates. Here, we propose to see each component xi Mi of x as a coordinate , yielding a simple adaptation of Eq. (3) as xi t+1 expi xi t k=1 gi k 2 xi k On the adaptivity term. Note that we take (squared) Riemannian norms gi t 2 xi t = ρi xi t(gi t, gi t) in the adaptivity term rescaling the gradient. In the Euclidean setting, this quantity is simply a scalar (gi t)2, which is related to the size of an SGD update of the ith coordinate, rescaled by the learning 4because the Poisson bracket cancels at critical points (Milnor, 1963, part 1.2). 5The rotational component of parallel transport inherited from curvature is called the holonomy. Published as a conference paper at ICLR 2019 rate (see Eq. (1)): |gi t| = |xi t+1 xi t|/α. By analogy, note that the size of an RSGD update in Mi (see Eq. (2)) is given by di(xi t+1, xi t) = di(expi xi t( αgi t), xi t) = αgi t xi t, hence we also recover gi t xi t = di(xi t+1, xi t)/α, which indeed suggests replacing the scalar (gi t)2 by gi t 2 xi t when transforming a coordinate-wise adaptive scheme into a manifold-wise adaptive one. 4 RAMSGRAD, RADAMNC: CONVERGENCE GUARANTEES In section 2, we briefly presented ADAGRAD, ADAM and AMSGRAD. Intuitively, ADAM can be described as a combination of ADAGRAD with a momentum (of parameter β1), with the slight modification that the sum of the past squared-gradients is replaced with an exponential moving average, for an exponent β2. Let s also recall that AMSGRAD implements a slight modification of ADAM, allowing to correct its convergence proof. Finally, ADAMNC is simply ADAM, but with a particular non-constant schedule for β1 and β2. On the other hand, what is interesting to note is that the schedule initially proposed by Reddi et al. (2018) for β2 in ADAMNC, namely β2t := 1 1/t, lets vt recover the sum of squared-gradients of ADAGRAD. Hence, ADAMNC without momentum (i.e. β1t = 0) yields ADAGRAD. Assumptions and notations. For 1 i n, we assume (Mi, ρi) is a geodesically complete Riemannian manifold with sectional curvature lower bounded by κi 0. As written in Eq. (7), let (M, ρ) be the product manifold of the (Mi, ρi) s. For each i, let Xi Mi be a compact, geodesically convex set and define X := X1 Xn, the set of feasible parameters. Define ΠXi : Mi Xi to be the projection operator, i.e. ΠXi(x) is the unique y Xi minimizing di(y, x). Denote by P i, expi and logi the parallel transport, exponential and log maps in (Mi, ρi), respectively. For f : M R, if g = gradf(x) for x M, denote by xi Mi and by gi Txi Mi the corresponding components of x and g. In the sequel, let (ft) be a family of differentiable, geodesically convex functions from M to R. Assume that each Xi Mi has a diameter bounded by D and that for all 1 i n, t [T] and x X, (gradft(x))i xi G . Finally, our convergence guarantees will bound the regret, defined at the end of T rounds as RT = PT t=1 ft(xt) minx X PT j=1 fj(x), so that RT = o(T). Finally, ϕi xi yi denotes any isometry from Txi Mi to Tyi Mi, for xi, yi Mi. Following the discussion in section 3.2 and especially Eq. (8), we present Riemannian AMSGRAD in Figure 1a. For comparison, we show next to it the standard AMSGRAD algorithm in Figure 1b. Require: x1 X, {αt}T t=1, {β1t}T t=1, β2 Set m0 = 0, τ0 = 0, v0 = 0 and ˆv0 = 0 for t = 1 to T do (for all 1 i n) gt = gradft(xt) mi t = β1tτ i t 1 + (1 β1t)gi t vi t = β2vi t 1 + (1 β2) gi t 2 xi t ˆvi t = max{ˆvi t 1, vi t} xi t+1 = ΠXi(expi xi t( αtmi t/ p τ i t = ϕi xi t xi t+1(mi t) end for (a) RAMSGRAD in M1 Mn. Require: x1 X, {αt}T t=1, {β1t}T t=1, β2 Set m0 = 0, v0 = 0 and ˆv0 = 0 for t = 1 to T do (for all 1 i n) gt = gradft(xt) mi t = β1tmi t 1 + (1 β1t)gi t vi t = β2vi t 1 + (1 β2)(gi t)2 ˆvi t = max{ˆvi t 1, vi t} xi t+1 = ΠXi(xi t αtmi t/ p ˆvi t) end for (b) AMSGRAD in Rn. Figure 1: Comparison of the Riemannian and Euclidean versions of AMSGRAD. Write hi t := αtmi t/ p ˆvi t. As a natural choice for ϕi, one could first parallel-transport6 mi t from xi t to xi t+1 := expxi t(hi t) using P i( ; hi t), and then from xi t+1 to xi t+1 along a minimizing geodesic. As can be seen, if (Mi, ρi) = R for all i, RAMSGRAD and AMSGRAD coincide: we then have κi = 0, di(xi, yi) = |xi yi|, ϕi = Id, expi xi(vi) = xi + vi, M1 Mn = Rn, gi t 2 xi t = (gi t)2 R. 6The idea of parallel-transporting mt from Txt M to Txt+1M previously appeared in Cho & Lee (2017). Published as a conference paper at ICLR 2019 From these algorithms, RADAM and ADAM are obtained simply by removing the max operations, i.e. replacing ˆvi t = max{ˆvi t 1, vi t} with ˆvi t = vi t. The convergence guarantee that we obtain for RAMSGRAD is presented in Theorem 1, where the quantity ζ is defined by Zhang & Sra (2016) as ζ(κ, c) := c p |κ|) = 1 + c 3|κ| + Oκ 0(κ2). (9) For comparison, we also show the convergence guarantee of the original AMSGRAD in appendix C. Note that when (Mi, ρi) = R for all i, convergence guarantees between RAMSGRAD and AMSGRAD coincide as well. Indeed, the curvature dependent quantity (ζ(κi, D ) + 1)/2 in the Riemannian case then becomes equal to 1, recovering the convergence theorem of AMSGRAD. It is also interesting to understand at which speed does the regret bound worsen when the curvature is small but non-zero: by a multiplicative factor of approximately 1 + D |κ|/6 (see Eq.(9)). Similar remarks hold for RADAMNC, whose convergence guarantee is shown in Theorem 2. Finally, notice that β1 := 0 in Theorem 2 yields a convergence proof for RADAGRAD, whose update rule we defined in Eq. (8). Theorem 1 (Convergence of RAMSGRAD). Let (xt) and (ˆvt) be the sequences obtained from Algorithm 1a, αt = α/ t, β1 = β11, β1t β1 for all t [T] and γ = β1/ β2 < 1. We then have: TD2 2α(1 β1) ˆvi T + D2 2(1 β1) α 1 + log T (1 β1)2(1 γ) 1 β2 ζ(κi, D ) + 1 t=1 gi t 2 xi t. (10) Proof. See appendix A. Theorem 2 (Convergence of RADAMNC). Let (xt) and (vt) be the sequences obtained from RADAMNC, αt = α/ t, β1 = β11, β1t = β1λt 1, λ < 1, β2t = 1 1/t. We then have: D 2α(1 β1) + α(ζ(κi, D ) + 1) t=1 gi t 2 xi t + β1D2 G n 2α(1 β1)(1 λ)2 . (11) Proof. See appendix B. The role of convexity. Note how the notion of convexity in Theorem 5 got replaced by the notion of geodesic convexity in Theorem 1. Let us compare the two definitions: the differentiable functions f : Rn R and g : M R are respectively convex and geodesically convex if for all x, y Rn, u, v M: f(x) f(y) gradf(x), x y , g(u) g(v) ρu(gradg(u), logu(v)). (12) But how does this come at play in the proofs? Regret bounds for convex objectives are usually obtained by bounding PT t=1 ft(xt) ft(x ) using Eq. (12) for any x X, which boils down to bounding each gt, xt x . In the Riemannian case, this term becomes ρxt(gt, logxt(x )). The role of the cosine law. How does one obtain a bound on gt, xt x ? For simplicity, let us look at the particular case of an SGD update, from Eq. (1). Using a cosine law, this yields gt, xt x = 1 2α xt x 2 xt+1 x 2 + α 2 gt 2. (13) One now has two terms to bound: (i) when summing over t, the first one simplifies as a telescopic summation; (ii) the second term PT t=1 αt gt 2 will require a well chosen decreasing schedule for α. In Riemannian manifolds, this step is generalized using the analogue lemma 6 introduced by Zhang & Sra (2016), valid in all Alexandrov spaces, which includes our setting of geodesically convex subsets of Riemannian manifolds with lower bounded sectional curvature. The curvature dependent quantity ζ of Eq. (10) appears from this lemma, letting us bound ρi xi t(gi t, logi xi t(xi )). Published as a conference paper at ICLR 2019 The benefit of adaptivity. Let us also mention that the above bounds significantly improve for sparse (per-manifold) gradients. In practice, this could happen for instance for algorithms embedding each word i (or node of a graph) in a manifold Mi and when just a few words are updated at a time. On the choice of ϕi. The fact that our convergence theorems (see lemma 3) do not require specifying ϕi suggests that the regret bounds could be improved by exploiting momentum/acceleration in the proofs for a particular ϕi. Note that this remark also applies to AMSGRAD (Reddi et al., 2018). 5 EXPERIMENTS We empirically assess the quality of the proposed algorithms: RADAM, RAMSGRAD and RADAGRAD compared to the non-adaptive RSGD method (Eq. 2). For this, we follow (Nickel & Kiela, 2017) and embed the transitive closure of the Word Net noun hierarchy (Miller et al., 1990) in the n-dimensional Poincar e model Dn of hyperbolic geometry which is well-known to be better suited to embed tree-like graphs than the Euclidean space (Gromov, 1987; De Sa et al., 2018). In this case, each word is embedded in the same space of constant curvature 1, thus Mi = Dn, i. Note that it would also be interesting to explore the benefit of our optimization tools for algorithms proposed in (Nickel & Kiela, 2018; De Sa et al., 2018; Ganea et al., 2018a). The choice of the Poincar e model is justified by the access to closed form expressions for all the quantities used in Alg. 1a: Metric tensor: ρx = λ2 x In, x Dn, where λx = 2 1 x 2 is the conformal factor. Riemannian gradients are rescaled Euclidean gradients: gradf(x) = (1/λ2 x) Ef(x). Distance function and geodesics, (Nickel & Kiela, 2017; Ungar, 2008; Ganea et al., 2018b). Exponential and logarithmic maps: expx(v) = x tanh λx v 2 v v , where is the generalized Mobius addition (Ungar, 2008; Ganea et al., 2018b). Parallel transport along the unique geodesic from x to y: Px y(v) = λx λy gyr[y, x]v. This formula was derived from (Ungar, 2008; Ganea et al., 2018b), gyr being given in closed form in (Ungar, 2008, Eq. (1.27)). Dataset & Model. The transitive closure of the Word Net taxonomy graph consists of 82,115 nouns and 743,241 hypernymy Is-A relations (directed edges E). These words are embedded in Dn such that the distance between words connected by an edge is minimized, while being maximized otherwise. We minimize the same loss function as (Nickel & Kiela, 2017) which is similar with log-likelihood, but approximating the partition function using sampling of negative word pairs (non-edges), fixed to 10 in our case. Note that this loss does not use the direction of the edges in the graph7 e d D(u,v) P u N(v) e d D(u ,v) (14) Metrics. We report both the loss value and the mean average precision (MAP) (Nickel & Kiela, 2017): for each directed edge (u, v), we rank its distance d(u, v) among the full set of ground truth negative examples {d(u , v)|(u , v) / E}. We use the same two settings as (Nickel & Kiela, 2017), namely: reconstruction (measuring representation capacity) and link prediction (measuring generalization). For link prediction we sample a validation set of 2% edges from the set of transitive closure edges that contain no leaf node or root. We only focused on 5-dimensional hyperbolic spaces. Training details. For all methods we use the same burn-in phase described in (Nickel & Kiela, 2017) for 20 epochs, with a fixed learning rate of 0.03 and using RSGD with retraction as explained in Sec. 2.2. Solely during this phase, we sampled negative words based on their graph degree raised at power 0.75. This strategy improves all metrics. After that, when different optimization methods start, we sample negatives uniformly. We use n = 5, following Nickel & Kiela (2017). 7In a pair (u, v), u denotes the parent, i.e. u entails v parameters θ are coordinates of all u, v. Published as a conference paper at ICLR 2019 Optimization methods. Experimentally we obtained slightly better results for RADAM over RAMSGRAD, so we will mostly report the former. Moreover, we unexpectedly observed convergence to lower loss values when replacing the true exponential map with its first order approximation i.e. the retraction Rx(v) = x + v in both RSGD and in our adaptive methods from Alg. 1a. One possible explanation is that retraction methods need fewer steps and smaller gradients to escape points sub-optimally collapsed on the ball border of Dn compared to fully Riemannian methods. As a consequence, we report retraction -based methods in a separate setting as they are not directly comparable to their fully Riemannian analogues. Figure 2: Results for methods doing updates with the exponential map. From left to right we report: training loss, MAP on the train set, MAP on the validation set. Figure 3: Results for methods doing updates with the retraction. From left to right we report: training loss, MAP on the train set, MAP on the validation set. Results. We show in Figures 2 and 3 results for exponential based and retraction based methods. We ran all our methods with different learning rates from the set {0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0, 3.0}. For the RSGD baseline we show in orange the best learning rate setting, but we also show the previous lower (slower convergence, in blue) and the next higher (faster overfitting, in green) learning rates. For RADAM and RAMSGRAD we only show the best settings. We always use β1 = 0.9 and β2 = 0.999 for these methods as these achieved the lowest training loss. RADAGRAD was consistently worse, so we do not report it. As can be seen, RADAM always achieves the lowest training loss. On the MAP metric for both reconstruction and link prediction settings, the same method also outperforms all the other methods for the full Riemannian setting (i.e. Tab. 2). Interestingly, in the retraction setting, RADAM reaches the lowest training loss value and is on par with RSGD on the MAP evaluation for both reconstruction and link prediction settings. However, RAMSGRAD is faster to converge in terms of MAP for the link prediction task, suggesting that this method has a better generalization capability. 6 RELATED WORK After Riemannian SGD was introduced by Bonnabel (2013), a pletora of other first order Riemannian methods arose, such as Riemannian SVRG (Zhang et al., 2016), Riemannian Stein variational gradient descent (Liu & Zhu, 2017), Riemannian accelerated gradient descent (Liu et al., 2017; Zhang & Sra, 2018) or averaged RSGD (Tripuraneni et al., 2018), along with new methods for their convergence analysis in the geodesically convex case (Zhang & Sra, 2016). Stochastic gradient Langevin dynamics was generalized as well, to improve optimization on the probability simplex (Patterson & Teh, 2013). Let us also mention that Roy et al. (2018) proposed Riemannian counterparts of SGD with momentum and RMSprop, suggesting to transport the momentum term using parallel translation, which is an idea that we preserved. However (i) no convergence guarantee is provided and (ii) their algorithm Published as a conference paper at ICLR 2019 performs the coordinate-wise adaptive operations (squaring and division) w.r.t. a coordinate system in the tangent space, which, as we discussed in section 3.1, compromises the possibility of obtaining convergence guarantees. Finally, another version of Riemannian ADAM for the Grassmann manifold G(1, n) was previously introduced by Cho & Lee (2017), also transporting the momentum term using parallel translation. However, their algorithm completely removes the adaptive component, since the adaptivity term vt becomes a scalar. No adaptivity across manifolds is discussed, which is the main point of our discussion. Moreover, no convergence analysis is provided either. 7 CONCLUSION Driven by recent work in learning non-Euclidean embeddings for symbolic data, we propose to generalize popular adaptive optimization tools (e.g. ADAM, AMSGRAD, ADAGRAD) to Cartesian products of Riemannian manifolds in a principled and intrinsic manner. We derive convergence rates that are similar to the Euclidean corresponding models. Experimentally we show that our methods outperform popular non-adaptive methods such as RSGD on the realistic task of hyperbolic word taxonomy embedding. ACKNOWLEDGMENTS Gary B ecigneul is funded by the Max Planck ETH Center for Learning Systems. Octavian Ganea is funded by the Swiss National Science Foundation (SNSF) under grant agreement number 167176. Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48 75, 2002. Silvere Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217 2229, 2013. Anoop Cherian and Suvrit Sra. Riemannian dictionary learning and sparse coding for positive definite matrices. IEEE transactions on neural networks and learning systems, 28(12):2859 2871, 2017. Minhyung Cho and Jaehyung Lee. Riemannian approach to batch normalization. In Advances in Neural Information Processing Systems, pp. 5225 5235, 2017. Christopher De Sa, Albert Gu, Christopher R e, and Frederic Sala. Representation tradeoffs for hyperbolic embeddings. 2018. URL https://www.cs.cornell.edu/ cdesa/papers/ arxiv2018_hyperbolic.pdf. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011. Octavian-Eugen Ganea, Gary B ecigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. In International Conference on Machine Learning, 2018a. Octavian-Eugen Ganea, Gary B ecigneul, and Thomas Hofmann. Hyperbolic neural networks. In Advances in Neural Information Processing Systems, 2018b. Mikhael Gromov. Hyperbolic groups. In Essays in group theory, pp. 75 263. Springer, 1987. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Chang Liu and Jun Zhu. Riemannian stein variational gradient descent for bayesian inference. ar Xiv preprint ar Xiv:1711.11216, 2017. Yuanyuan Liu, Fanhua Shang, James Cheng, Hong Cheng, and Licheng Jiao. Accelerated first-order methods for geodesically convex optimization on riemannian manifolds. In Advances in Neural Information Processing Systems 30, pp. 4868 4877. 2017. Published as a conference paper at ICLR 2019 George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller. Introduction to wordnet: An on-line lexical database. International journal of lexicography, 3(4): 235 244, 1990. John Milnor. Morse Theory.(AM-51), volume 51. Princeton university press, 1963. Maximilian Nickel and Douwe Kiela. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning, 2018. Maximillian Nickel and Douwe Kiela. Poincar e embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pp. 6341 6350, 2017. Sam Patterson and Yee Whye Teh. Stochastic gradient riemannian langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pp. 3102 3110, 2013. Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pp. 1532 43, 2014. Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In ICLR, 2018. Joel W Robbin and Dietmar A Salamon. Introduction to differential geometry. ETH, Lecture Notes, preliminary version, January, 2011. S Kumar Roy, Zakaria Mhammedi, and Mehrtash Harandi. Geometry aware constrained optimization techniques for deep learning. CVPR, 2018. Michael Spivak. A comprehensive introduction to differential geometry. volume four. 1979. Suvrit Sra and Reshad Hosseini. Conic geometric optimization on the manifold of positive definite matrices. SIAM Journal on Optimization, 25(1):713 739, 2015. Mingkui Tan, Ivor W Tsang, Li Wang, Bart Vandereycken, and Sinno Jialin Pan. Riemannian pursuit for big matrix recovery. In International Conference on Machine Learning, pp. 1539 1547, 2014. Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26 31, 2012. Nilesh Tripuraneni, Nicolas Flammarion, Francis Bach, and Michael I Jordan. Averaging stochastic gradient descent on riemannian manifolds. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., 2018. Abraham Albert Ungar. A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics, 1(1):1 194, 2008. Bart Vandereycken and Stefan Vandewalle. A riemannian optimization approach for computing low-rank solutions of lyapunov equations. SIAM Journal on Matrix Analysis and Applications, 31 (5):2553 2579, 2010. Tran Dang Quang Vinh, Yi Tay, Shuai Zhang, Gao Cong, and Xiao-Li Li. Hyperbolic recommender systems. ar Xiv preprint ar Xiv:1809.01703, 2018. Matthew D Zeiler. Adadelta: an adaptive learning rate method. ar Xiv preprint ar Xiv:1212.5701, 2012. Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pp. 1617 1638, 2016. Hongyi Zhang and Suvrit Sra. Towards riemannian accelerated gradient methods. ar Xiv preprint ar Xiv:1806.02812, 2018. Hongyi Zhang, Sashank J Reddi, and Suvrit Sra. Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pp. 4592 4600, 2016. Published as a conference paper at ICLR 2019 A PROOF OF THEOREM 1 Proof. Denote by xi t+1 := expi xi t( αtmi t/ p ˆvi t) and consider the geodesic triangle defined by xi t+1, xi t and xi . Now let a = di( xi t+1, xi ), b = di( xi t+1, xi t), c = di(xi t, xi ) and A = xi t+1xi txi . Combining the following formula8: di(xi t, xi t+1)di(xi t, xi ) cos( xi t+1xi txi ) = αtmi t/ q ˆvi t, logi xi t(xi ) xi t, (15) with the following inequality (given by lemma 6): a2 ζ(κ, c)b2 + c2 2bc cos(A), with ζ(κ, c) := |κ|c) , (16) mi t, logi xi t(x i ) xi t di(xi t, xi )2 di( xi t+1, xi )2 + ζ(κi, di(xi t, xi )) αt 2 p ˆvi t mi t 2 xi t, (17) where the use the notation , xi for ρi xi( , ) when it is clear which metric is used. By definition of ΠXi, we can safely replace xi t+1 by xi t+1 in the above inequality. Plugging mi t = β1tϕi xi t 1 xi t(mi t 1) + (1 β1t)gi t into Eq. (17) gives us gi t, logxi t(x i ) xi t ˆvi t 2αt(1 β1t) di(xi t, xi )2 di(xi t+1, xi )2 + ζ(κi, di(xi t, xi )) αt 2(1 β1t) p ˆvi t mi t 2 xi t + β1t (1 β1t) ϕi xi t 1 xi t(mi t 1), logxi t(x i ) xi t. (18) Now applying Cauchy-Schwarz and Young s inequalities to the last term yields gi t, logxi t(x i ) xi t ˆvi t 2αt(1 β1t) di(xi t, xi )2 di(xi t+1, xi )2 + ζ(κi, di(xi t, xi )) αt 2(1 β1t) p ˆvi t mi t 2 xi t + β1t 2(1 β1t) αt p ˆvi t mi t 1 2 xi t 1 + β1t 2(1 β1t) ˆvi t αt logxi t(x i ) 2 xi t. (19) From the geodesic convexity of ft for 1 t T, we have t=1 ft(xt) ft(x ) t=1 gt, logxt(x ) xt = t=1 gi t, logi xi t(xi ) xi t. (20) 8Note that since each Xi is geodesically convex, logarithms are well-defined. Published as a conference paper at ICLR 2019 Let s look at the first term. Using β1t β1 and with a change of indices, we have ˆvi t 2αt(1 β1t) di(xi t, xi )2 di(xi t+1, xi )2 (21) ˆvi t 1 αt 1 di(xi t, xi )2 + ˆvi 1 α1 di(xi 1, xi )2 ˆvi t 1 αt 1 ˆvi 1 α1 D2 = D2 2αT (1 β1) ˆvi T , (24) where the last equality comes from a standard telescopic summation. We now need the following lemma. ˆvi t mi t 2 xi t α 1 + log T (1 β1)(1 γ) 1 β2 t=1 gi t 2 xi t (25) Proof. Let s start by separating the last term, and removing the hat on v. ˆvi t mi t 2 xi t ˆvi t mi t 2 xi t + αT p ˆvi T mi T 2 xi T (26) ˆvi t mi t 2 xi t + αT p vi T mi T 2 xi T (27) Let s now have a closer look at the last term. We can reformulate mi T as: j=1 (1 β1j) k=1 β1,(T k+1) ϕi xi T 1 xi T ϕi xi j xi j+1(gi j) (28) Applying lemma 7, we get mi T 2 xi T j=1 (1 β1j) k=1 β1,(T k+1) j=1 (1 β1j) k=1 β1,(T k+1) ϕi xi T 1 xi T ϕi xi j xi j+1(gi j) 2 xi T Since ϕi is an isometry, we always have ϕi x y(u) y = u x, i.e. ϕi xi T 1 xi T ϕi xi j xi j+1(gi j) 2 xi T = gi j 2 xi j. (30) Using that β1k β1 for all k [T], mi T 2 xi T j=1 (1 β1j)βT j 1 j=1 (1 β1j)βT j 1 gi j 2 xi j Finally, (1 β1j) 1 and PT j=1 βT j 1 1/(1 β1) yield mi T 2 xi T 1 1 β1 j=1 βT j 1 gi j 2 xi j Published as a conference paper at ICLR 2019 Let s now look at vi T . It is given by vi T = (1 β2) j=1 βT j 2 gi j 2 xi j (33) Combining Eq. (32) and Eq. (33) allows us to bound the last term of Eq. (26): vi T mi T 2 xi T α (1 β1) PT j=1 βT j 1 gi j 2 xi j (1 β2) PT j=1 βT j 2 gi j 2 xi j βT j 1 gi j 2 xi j (1 β2)βT j 2 gi j 2 xi j j=1 γT j gi j xi j (36) With this inequality, we can now bound every term of Eq. (26): ˆvi t mi t 2 xi t j=1 γt j gi j xi j (37) = α (1 β1) 1 β2 j=1 γt j gi j xi j (38) = α (1 β1) 1 β2 t=1 gi t xi j j=t γj t/ p α (1 β1) 1 β2 t=1 gi t xi j α (1 β1) 1 β2 t=1 gi t xi j 1 (1 γ) α (1 β1)(1 γ) 1 β2 t=1 gi t 2 xi j α 1 + log T (1 β1)(1 γ) 1 β2 t=1 gi t 2 xi j (43) Putting together Eqs. (19), (20), (24) and lemma 3 lets us bound the regret: t=1 ft(xt) ft(x ) t=1 gi t, logxi t(x i ) xi t (44) TD2 2α(1 β1) ˆvi T + D2 2(1 β1) ˆvi t αt (45) + α 1 + log T (1 β1)2(1 γ) 1 β2 ζ(κi, D ) + 1 t=1 gi t 2 xi j, (46) Published as a conference paper at ICLR 2019 where we used the facts that d 7 ζ(κ, d) is an increasing function, and that αt/ p ˆvi t αt 1/ q ˆvi t 1, which enables us to bound both the second and third terms of the right-hand side of Eq. (19) using lemma 3. Remark. Let us notice that similarly as for AMSGRAD, RAMSGRAD also has a regret bounded by O(G T). This is easy to see from the proof of lemma 4. Hence the actual upper-bound on the regret is a minimum between the one in O(G T) and the one of Theorem 1. B PROOF OF THEOREM 2 Proof. Similarly as for the proof of Theorem 1 (and with same notations), we obtain the inequality: gi t, logxi t(x i ) xi t vi t 2αt(1 β1t) di(xi t, xi )2 di(xi t+1, xi )2 + ζ(κi, di(xi t, xi )) αt 2(1 β1t) p vi t mi t 2 xi t + β1t 2(1 β1t) αt p vi t mi t 1 2 xi t 1 + β1t 2(1 β1t) vi t αt logxi t(x i ) 2 xi t. (47) From the geodesic convexity of ft for 1 t T, we have t=1 ft(xt) ft(x ) t=1 gt, logxt(x ) xt = t=1 gi t, logi xi t(xi ) xi t. (48) With the same techniques as before, we obtain the same bound on the first term: vi t 2αt(1 β1t) di(xi t, xi )2 di(xi t+1, xi )2 D2 2αT (1 β1) vi T . (49) However, for the other terms, we now need a new lemma: ˆvi t mi t 2 xi t 2α (1 β1)2 t=1 gi t 2 xi t. (50) Proof. Let s start by separating the last term. vi t mi t 2 xi t vi t mi t 2 xi t + αT p vi T mi T 2 xi T . (51) Similarly as before, we have mi T 2 xi T 1 1 β1 j=1 βT j 1 gi j 2 xi j Let s now look at vi T . Since β2t = 1 1/t, it is simply given by t=1 gi t 2 xi t/T. (53) Combining these yields: vi T mi T 2 xi T α 1 β1 PT j=1 βT j 1 gi j 2 xi j q PT t=1 gi t 2 xi t βT j 1 gi j 2 xi j q Pj k=1 gi k 2 xi k Published as a conference paper at ICLR 2019 Using this inequality at all time-steps gives vi t mi t 2 xi t α 1 β1 PT j l=0 βl 1 gi j 2 xi j q Pj k=1 gi k 2 xi k gi j 2 xi j q Pj k=1 gi k 2 xi k j=1 gi j 2 xi j, (57) where the last inequality comes from lemma 8. Putting everything together, we finally obtain t=1 ft(xt) ft(x ) t=1 gi t, logxi t(x i ) xi t (58) TD2 2α(1 β1) vi T + D2 2(1 β1) vi t αt (59) + α (1 β1)3 i=1 (ζ(κi, D ) + 1) t=1 gi t 2 xi j, (60) where we used that for this choice of αt and β2t, we have αt/ p vi t αt 1/ q vi t 1. Finally, vi t αt D2 G n 2α(1 β1) tβ1t β1D2 G n 2α(1 β1)(1 λ)2 . (61) This combined with Eq. (53) yields the final result. Remark. Notice the appearance of a factor n/α in the last term of the last equation. This term is missing in corollaries 1 and 2 of Reddi et al. (2018), which is a mistake. However, this dependence in n is not too harmful here, since this term does not depend on T. Theorem 5 (Convergence of AMSGRAD). Let (ft) be a family of differentiable, convex functions from Rn to R. Let (xt) and (vt) be the sequences obtained from Algorithm 1b, αt = α/ t, β1 = β11, β1t β1 for all t [T] and γ = β1/ β2 < 1. Assume that each Xi R has a diameter bounded by D and that for all 1 i n, t [T] and x X, (gradft(x)) G . For (xt) generated using the AMSGRAD (Algorithm 1b), we have the following bound on the regret TD2 2α(1 β1) ˆvi T + D2 2(1 β1) α 1 + log T (1 β1)2(1 γ) 1 β2 t=1 (gi t)2 (62) Proof. See Theorem 4 of Reddi et al. (2018). Published as a conference paper at ICLR 2019 D USEFUL LEMMAS The following lemma is a user-friendly inequality developed by Zhang & Sra (2016) in order to prove convergence of gradient-based optimization algorithms, for geodesically convex functions, in Alexandrov spaces. Lemma 6 (Cosine inequality in Alexandrov spaces). If a, b, c, are the sides (i.e., side lengths) of a geodesic triangle in an Alexandrov space with curvature lower bounded by κ, and A is the angle between sides b and c, then |κ|c) b2 + c2 2bc cos(A). (63) Proof. See section 3.1, lemma 6 of Zhang & Sra (2016). Lemma 7 (An analogue of Cauchy-Schwarz). For all p, k N , u1, ..., uk Rp, a1, ..., ak R+, we have i ai ui 2 2 Proof. The proof consists in applying Cauchy-Schwarz inequality two times: i aiui 2 2 = X i,j aiaju T i uj (65) aiaj( aiui)T ( ajuj) (66) aiaj aiui 2 ajuj 2 (67) i αi ui 2 2 Finally, this last lemma is used by Reddi et al. (2018) in their convergence proof for ADAMNC. We need it too, in an analogue lemma. Lemma 8 ((Auer et al., 2002)). For any non-negative real numbers y1, ..., yt, the following holds: yi q Pi j=1 yj 2 i=1 yi. (70)