# a_scaled_bregman_theorem_with_applications__4facbddb.pdf

A scaled Bregman theorem with applications

Richard Nock , , Aditya Krishna Menon , Cheng Soon Ong ,

Data61, the Australian National University and the University of Sydney {richard.nock, aditya.menon, chengsoon.ong}@data61.csiro.au

Bregman divergences play a central role in the design and analysis of a range of machine learning algorithms through a handful of popular theorems. We present a new theorem which shows that Bregman distortions (employing a potentially non-convex generator) may be exactly re-written as a scaled Bregman divergence computed over transformed data. This property can be viewed from the standpoints of geometry (a scaled isometry with adaptive metrics) or convex optimization (relating generalized perspective transforms). Admissible distortions include geodesic distances on curved manifolds and projections or gauge-normalisation. Our theorem allows one to leverage to the wealth and convenience of Bregman divergences when analysing algorithms relying on the aforementioned Bregman distortions. We illustrate this with three novel applications of our theorem: a reduction from multi-class density ratio to class-probability estimation, a new adaptive projection free yet norm-enforcing dual norm mirror descent algorithm, and a reduction from clustering on ﬂat manifolds to clustering on curved manifolds. Experiments on each of these domains validate the analyses and suggest that the scaled Bregman theorem might be a worthy addition to the popular handful of Bregman divergence properties that have been pervasive in machine learning.

1 Introduction: Bregman divergences as a reduction tool

Bregman divergences play a central role in the design and analysis of a range of machine learning (ML) algorithms. In recent years, Bregman divergences have arisen in procedures for convex optimisation [4], online learning [9, Chapter 11] clustering [3], matrix approximation [13], classprobability estimation [7, 26, 29, 28], density ratio estimation [35], boosting [10], variational inference [18], and computational geometry [5]. Despite these being very different applications, many of these algorithms and their analyses basically rely on three beautiful analytic properties of Bregman divergences, properties that we summarize for differentiable scalar convex functions ϕ with derivative ϕ , conjugate ϕ , and divergence Dϕ:

the triangle equality: Dϕ(x y) + Dϕ(y z) Dϕ(x z) = (ϕ (z) ϕ (y))(x y);

the dual symmetry property: Dϕ(x y) = Dϕ (ϕ (y) ϕ (x));

the right-centroid (population minimizer) is the average: arg minµ E[Dϕ(X µ)] = E[X].

Casting a problem as a Bregman minimisation allows one to employ these properties to simplify analysis; for example, by interpreting mirror descent as applying a particular Bregman regulariser, Beck and Teboulle [4] relied on the triangle equality above to simplify its proof of convergence.

Another intriguing possibility is that one may derive reductions amongst learning problems by connecting their underlying Bregman minimisations. Menon and Ong [24] recently established how (binary) density ratio estimation (DRE) can be exactly reduced to class-probability estimation (CPE). This was facilitated by interpreting CPE as a Bregman minimisation [7, Section 19], and a new property of Bregman divergences Menon and Ong [24, Lemma 2] showed that for any twice

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Problem A Problem B that Theorem 1 reduces A to Reference Multiclass density-ratio estimation Multiclass class-probability estimation 3, Lemma 2 Online optimisation on Lq ball Convex unconstrained online learning 4, Lemma 4 Clustering on curved manifolds Clustering on ﬂat manifolds 5, Lemma 5

Table 1: Applications of our scaled Bregman Theorem (Theorem 1) Reduction encompasses shortcuts on algorithms and on analyses (algorithm/proof A uses algorithm/proof B as subroutine).

differentiable scalar convex ϕ, for g(x) = 1 + x and ˇϕ(x) .= g(x) ϕ(x/g(x)),

g(x) Dϕ(x/g(x) y/g(y)) = D ˇϕ(x y) , x, y. (1)

Since the binary class-probability function η(x) = Pr(Y = 1|X = x) is related to the classconditional density ratio r(x) = Pr(X = x|Y = 1)/ Pr(X = x|Y = 1) via Bayes rule as η(x) = r(x)/g(r(x)) ([24] assume Pr(Y = 1) = 1/2), any ˆη with small Dϕ(η ˆη) implicitly produces an ˆr with low D ˇϕ(r ˆr) i.e. a good estimate of the density ratio. The Bregman property of eq. (1) thus establishes a reduction from DRE to CPE. Two questions arise from this analysis: can we generalise eq. (1) to other g( ), and if so, can we similarly relate other problems to each other?

This paper presents a new Bregman identity (Theorem 1), the scaled Bregman theorem, a signiﬁcant generalisation of Menon and Ong [24, Lemma 2]. It shows that general distortions D ˇϕ which are not necessarily convex, positive, bounded or symmetric may be re-expressed as a Bregman divergence Dϕ computed over transformed data, and thus inherit their good properties despite appearing prima facie to be a very different object. This transformation can be as simple as a projection or normalisation by a gauge, or more involved like the exponential map on lifted coordinates for a curved manifold. Our theorem can be summarized in two ways. The ﬁrst is geometric as it specializes to a scaled isometry involving adaptive metrics. The second calls to a fundamental object of convex analysis, generalized perspective transforms [11, 22, 23]. Indeed, our theorem states when

"the perspective of a Bregman divergence equals the distortion of a perspective", for a perspective ( ˇϕ in eq. 1) which is analytically a generalized perspective transform but does not rely on the same convexity and sign requirements as in Maréchal [22, 23]. We note that the perspective of a Bregman divergence (the left-hand side of eq. 1) is a special case of conformal divergence [27], yet to our knowledge it has never been formally deﬁned. As with the aforementioned key properties of Bregman divergences, Theorem 1 has potentially wide implications for ML. We give three such novel applications to vastly different problems (see Table 1):

a reduction of multiple density ratio estimation to multiclass-probability estimation ( 3), generalising the results of [24] for the binary label case, a projection-free yet norm-enforcing mirror gradient algorithm (enforced norms are those of mirrored vectors and of the offset) with guarantees for adaptive ﬁltering ( 4), and a seeding approach for clustering on positively or negatively (constant) curved manifolds based on a popular seeding for ﬂat manifolds and with the same approximation guarantees ( 5).

Experiments on each of these domains ( 6) validate our analysis. The Supplementary Material (SM) details the proofs of all results, provides the experimental results in extenso and some additional (nascent) applications to exponential families and computational information geometry.

2 Main result: the scaled Bregman theorem

In the remaining, [k] .= {0, 1, ..., k} and [k] .= {1, 2, ..., k} for k N. For any differentiable (but not necessarily convex) ϕ : X R, we deﬁne the Bregman distortion Dϕ as

Dϕ(x y) .= ϕ(x) ϕ(y) (x y) ϕ(y) . (2)

If ϕ is convex, Dϕ is the familiar Bregman divergence with generator ϕ. Without further ado, we present our main result.

Theorem 1 Let, ϕ : X R be convex differentiable, and g : X R be differentiable. Then,

g(x) Dϕ (1/g(x)) x (1/g(y)) y = D ˇϕ x y , x, y X , (3)

where ˇϕ(x) .= g(x) ϕ ((1/g(x)) x) , (4)

X Dϕ (x y) D ˇϕ (x y) g(x) Rd 1 2 x y 2 2 x 2 (1 cos x, y) x 2 Rd 1 2 ( x 2 q y 2 q) P

i (xi yi) sign(yi) |yi|q 1

y q 2 q W x q W P

i xi sign(yi) |yi|q 1

y q 1 q x q/W

Rd R 1 2 x S y S 2 2

x 2 sin x 2 (1 cos DG(x, y)) x 2/ sin x 2 Rd C 1 2 x H y H 2 2 x 2 sinh x 2 (cosh DG(x, y) 1) x 2/ sinh x 2 Rd +

i xi log xi

yi 1 (x y) P

i xi log xi

yi d E[X] log E[X]

i x1/d i S(d) tr (X log X X log Y) tr (X) + tr (Y) tr (X log X X log Y) tr (X) log tr(X)

tr(Y) tr (X) S(d) tr XY 1 log det(XY 1) d det(Y1/d)tr XY 1 d det(X1/d) det(X1/d)

Table 2: Examples of (Dϕ, D ˇϕ, g) for which eq. (3) holds. Function x S .= f(x) : Rd Rd+1 and x H .= f(x) : Rd Rd C are the Sphere and Hyperbolic lifting maps deﬁned in SM, eqs. 51, 62. W > 0 is a constant. DG denotes the Geodesic distance on the sphere (for x S) or the hyperboloid (for x H). S(d) is the set of symmetric real matrices. Related proofs are in SM, Section III.

if and only if (i) g is afﬁne on X, or (ii) for every z Xg .= {(1/g(x)) x : x X},

ϕ (z) = z ϕ(z) . (5)

Table 2 presents some examples of (sometimes involved) triplets (Dϕ, D ˇϕ, g) for which eq. (3) holds; related proofs are in Appendix III. Depending on ϕ and g, there are at least two ways to summarize Theorem 1. One is geometric: Theorem 1 sometimes states a scaled isometry between X and Xg. The other one comes from convex optimisation: Theorem 1 deﬁnes generalized perspective transforms on Bregman divergences and roughly states the identity between the perspective transform of a Bregman divergence and the Bregman distortion of the perspective transform. Appendix VIII gives more details for both properties. We refer to Theorem 1 as the scaled Bregman theorem.

Remark. If Xg is a vector space, ϕ satisﬁes eq. (5) if and only if it is positive homogeneous of degree 1 on Xg (i.e. ϕ(αz) = α ϕ(z) for any α > 0) from Euler s homogenous function theorem. When Xg is not a vector space, this only holds for α such that αz Xg as well. We thus call the gradient condition of eq. (5) restricted positive homogeneity for simplicity.

Remark. Appendix IV gives a deep composition extension of Theorem 1.

For the special case where X = R, and g(x) = 1 + x, Theorem 1 is exactly [24, Lemma 2] (c.f. eq. 1). We wish to highlight a few points with regard to our more general result. First, the distortion generator ˇϕ may be1 non-convex, as the following illustrates.

Example. Suppose ϕ(x) = (1/2) x 2 2, the generator for squared Euclidean distance. Then, for g(x) = 1 + 1 x, we have ˇϕ(x) = (1/2) x 2 2/(1 + 1 x), which is non-convex on X = Rd.

When ˇϕ is non-convex, the right hand side in eq. (3) is an object that ostensibly bears only a superﬁcial similarity to a Bregman divergence; it is somewhat remarkable that Theorem 1 shows this general distortion between a pair (x, y) to be entirely equivalent to a (scaling of a) Bregman divergence between some transformation of the points. Second, when g is linear, eq. (3) holds for any convex ϕ (This was the case considered in [24]). When g is non-linear, however, ϕ must be chosen carefully so that (ϕ, g) satisﬁes the restricted homogeneity conditon2 of eq. (5). In general, given a convex ϕ, one can reverse engineer a suitable g, as illustrated by the following example.

Example. Suppose3 ϕ(x) = (1 + x 2 2)/2. Then, eq. (5) requires that x 2 2 = 1 for every x Xg, i.e. Xg is (a subset of) the unit sphere. This is afforded by the choice g(x) = x 2.

Third, Theorem 1 is not merely a mathematical curiosity: we now show that it facilitates novel results in three very different domains, namely estimating multiclass density ratios, constrained online optimisation, and clustering data on a manifold with non-zero curvature. We discuss nascent applications to exponential families and computational geometry in Appendices V and VI.

1Evidently, ˇϕ is convex iff g is non-negative, by eq. (3) and the fact that a function is convex iff its Bregman distortion is nonnegative [6, Section 3.1.3]. 2We stress that this condition only needs to hold on Xg X; it would not be really interesting in general for ϕ to be homogeneous everywhere in its domain, since we would basically have ˇϕ = ϕ. 3The constant 1/2 added in ϕ does not change Dϕ, since a Bregman divergence is invariant to afﬁne terms; removing this however would make the divergences Dϕ and D ˇ ϕ differ by a constant.

3 Multiclass density-ratio estimation via class-probability estimation

Given samples from a number of densities, density ratio estimation concerns estimating the ratio between each density and some reference density. This has applications in the covariate shift problem wherein the train and test distributions over instances differ [33]. Our ﬁrst application of Theorem 1 is to show how density ratio estimation can be reduced to class-probability estimation [7, 29].

To proceed, we ﬁx notation. For some integer C 1, consider a distribution P(X, Y) over an (instance, label) space X [C]. Let ({Pc}C c=1, π) be densities giving P(X|Y = c) and P(Y = c) respectively, and M giving P(X) accordingly. Fix c [C] a reference class, and suppose for simplicity that c = C. Let π C 1 such that πc .= πc/(1 πC). Density ratio estimation [35] concerns inferring the vector r(x) RC 1 of density ratios relative to C, with rc(x) .= P(X = x|Y = c)/P(X = x|Y = C) , while class-probability estimation [7] concerns inferring the vector η(x) RC 1 of class-probabilities, with ηc(x) .= P(Y = c|X = x)/ πc . In both cases, we estimate the respective quantities given an iid sample S P(X, Y)m (m is the training sample size).

The genesis of the reduction from density ratio to class-probability estimation is the fact that r(x) = (πC/(1 πC)) η(x)/ηC(x). In practice one will only have an estimate ˆη, typically derived by minimising a suitable loss on the given S [37], with a canonical example being multiclass logistic regression. Given ˆη, it is natural to estimate the density ratio via: ˆr(x) = ˆη(x)/ˆηC(x) . (6) While this estimate is intuitive, to establish a formal reduction we must relate the quality of ˆr to that of ˆη. Since the minimisation of a suitable loss for class-probability estimation is equivalent to a Bregman minimisation [7, Section 19], [37, Proposition 7], this is however immediate by Theorem 1:

Lemma 2 Given a class-probability estimator ˆη: X [0, 1]C 1, let the density ratio estimator ˆr be as per Equation 6. Then for any convex differentiable ϕ: [0, 1]C 1 R,

EX M[Dϕ(η(X) ˆη(X))] = (1 πC) EX PC Dϕ (r(X) ˆr(X))

where ϕ is as per Equation 4 with g(x) .= πC/(1 πC) + π x .

Lemma 2 generalises [24, Proposition 3], which focussed on the binary case with π = 1/2 (See Appendix VII for a review of that result). Unpacking the Lemma, the LHS in Equation 7 represents the object minimised by some suitable loss for class-probability estimation. Since g is afﬁne, we can use any convex, differentiable ϕ, and so can use any suitable class-probability loss to estimate ˆη. Lemma 2 thus implies that producing ˆη by minimising any class-probability loss equivalently produces an ˆr as per Equation 6 that minimises a Bregman divergence to the true r. Thus, Theorem 1 provides a reduction from density ratio to multiclass probability estimation.

We now detail two applications where g( ) is no longer afﬁne, and ϕ must be chosen more carefully.

4 Dual norm mirror descent: projection-free online learning on Lp balls

A substantial amount of work in the intersection of ML and convex optimisation has focused on constrained optimisation within a ball [32, 14]. This optimisation is typically via projection operators that can be expensive to compute [17, 19]. We now show that gauge functions can be used as an inexpensive alternative, and that Theorem 1 easily yields guarantees for this procedure in online learning. We consider the adaptive ﬁltering problem, closely related to the online least squares problem with linear predictors [9, Chapter 11]. Here, over a sequence of T rounds, we observe some xt X. We must then predict a target value ˆyt = w t 1xt using our current weight vector wt 1. The true target yt = u xt +ϵt is then revealed, where ϵt is some unknown noise, and we may update our weight to wt. Our goal is to minimise the regret of the sequence {wt}T t=0,

R(w1:T |u) .=

u xt w t 1xt 2

u xt yt 2 . (8)

Let q (1, 2] and p be such that 1/p + 1/q = 1. For ϕ .= (1/2) x 2 q and loss ℓt(w) = (1/2) (yt w xt)2, the p-LMS algorithm [20] employs the stochastic mirror gradient updates:

wt .= argmin w ηt ℓt(w) + Dϕ(w wt 1) = ( ϕ) 1 ( ϕ(wt 1) ηt ℓt) , (9)

where ηt is a learning rate to be speciﬁed by the user. [20, Theorem 2] shows that for appropriate ηt, one has R(w1:T |u) (p 1) maxx X x 2 p u 2 q.

The p-LMS updates do not provide any explicit control on wt , i.e. there is no regularisation. Experiments (Section 6) suggest that leaving wt uncontrolled may not be a good idea as the increase of the norm sometimes prevents (signiﬁcant) updates (eq. (9)). Also, the wide success of regularisation in ML calls for regularised variants that retain the regret guarantees and computational efﬁciency of p-LMS. (Adding a projection step to eq. (9) would not achieve both.) We now do just this. For ﬁxed W > 0, let ϕ .= (1/2) (W 2 + x 2 q), a translation of that used in p-LMS. Invoking Theorem 1 with the admissible gq(x) = ||x||q/W yields ˇϕ .= ˇϕq = W x q (see Table 2). Using the fact that Lp and Lq norms are dual of each other, we replace eq. (9) by:

wt .= ˇϕp ( ˇϕq(wt 1) ηt ℓt) . (10)

See Lemma A of the Appendix for the simple forms of ˇϕ{p,q}. We call update (10) the dual norm p-LMS (DN-p-LMS) algorithm, noting that the dual refers to the polar transform of the norm, and g stems from a gauge normalization for Bq(W), the closed Lq ball with radius W > 0. Namely, we have γGAU(x) = W/ x q = g(x) 1 for the gauge γGAU(x) .= sup{z 0 : z x Bq(W)}, so that ˇϕq implicitly performs gauge normalisation of the data. This update is no more computationally expensive than eq. (9) we simply need to compute the pand q-norms of appropriate terms but, crucially, automatically constrains the norms of wt and its image by ˇϕq.

Lemma 3 For the update in eq. (10), wt q = ˇϕq(wt) p = W, t > 0.

Lemma 3 is remarkable, since nowhere in eq. (10) do we project onto the Lq ball. Nonetheless, for the DN-p-LMS updates to be principled, we need a similar regret guarantee to the original p-LMS. Fortunately, this may be done using Theorem 1 to exploit the original proof of [20]. For any u Rd, deﬁne the q-normalised regret of {wt}T t=0 by

Rq(w1:T |u) .=

(1/gq(u)) u xt w t 1xt 2

(1/gq(u)) u xt yt 2 .(11)

We have the following bound on Rq for the DN-p-LMS updates (We cannot expect a bound on the unnormalised R( ) of eq. (8), since by Lemma 3 we can only compete against norm W vectors).

Lemma 4 Pick any u Rd, p, q satisfying 1/p + 1/q = 1 and p > 2, and W > 0. Suppose xt p Xp and |yt| Y, t T. Let {wt} be as per eq. (10), using learning rate

ηt .= γt W 4(p 1) max{W, Xp}Xp W + |yt w t 1xt|Xp , (12)

for any desired γt [1/2, 1]. Then,

Rq(w1:T |u) 4(p 1)X2 p W 2 + (16p 8) max{W, Xp}X2 p W + 8Y X2 p . (13)

Several remarks can be made. First, the bound depends on the maximal signal value Y , but this is the maximal signal in the observed sequence, so it may not be very large in practice; if it is comparable to W, then our bound is looser than [20] by just a constant factor. Second, the learning rate is adaptive in the sense that its choice depends on the last mistake made. There is a nice way to represent the offset vector ηt ℓt in eq. (10), since we have, for Q .= 4(p 1) max{W, Xp}Xp W,

ηt ℓt = W |yt w t 1xt|Xp Q + |yt w t 1xt|Xp sign(yt w t 1xt) 1

Xp x , (14)

so the Lp norm of the offset is actually equal to W Q, where Q [0, 1] is all the smaller as the vector w. gets better. Hence, the update in eq. (10) controls in fact all norms (that of w., its image by ˇϕq and the offset). Third, because of the normalisation of u, the bound actually does not depend on u, but on the radius W chosen for the Lq ball.

Sphere Hyperboloid

Lifting map

Spherical k-means

(in Sd) (in Rd+1) k-means(++)

(in Rd+1) k-means(++)

Figure 1: (L) Lifting map into Rd R for clustering on the sphere with k-means++. (M) Drec in Eq. (15) in vertical thick red line. (R) Lifting map into Rd C for the hyperboloid.

5 Clustering on a curved manifold via clustering on a ﬂat manifold

Our ﬁnal application can be related to two problems that have received a steadily growing interest over the past decade in unsupervised ML: clustering on a non-linear manifold [12], and subspace custering [36]. We consider two fundamental manifolds investigated by [16] to compute centers of mass from relativistic theory: the sphere Sd and the hyperboloid Hd, the former being of positive curvature, and the latter of negative curvature. Applications involving these speciﬁc manifolds are numerous in text processing, computer vision, geometric modelling, computer graphics, to name a few [8, 12, 15, 21, 30, 34]. We emphasize the fact that the clustering problem has signiﬁcant practical impact for d as small as 2 in computer vision [34].

The problem is non-trivial for two separate reasons. First, the ambient space, i.e. the space of registration of the input data, is often implicitly Euclidean and therefore not the manifold [12]: if the mapping to the manifold is not carefully done, then geodesic distances measured on the manifold may be inconsistent with respect to the ambient space. Second, the fact that the manifold has non-zero curvature essentially prevents the direct use of Euclidean optimization algorithms [38] put simply, the average of two points that belong to a manifold does not necessarily belong to the manifold, so we have to be careful on how to compute centroids for hard clustering [16, 27, 30, 31].

What we show now is that Riemannian manifolds with constant sectional curvature may be clustered with the k-means++ seeding for ﬂat manifolds [2], without even touching a line of the algorithm. To formalise the problem, we need three key components of Riemannian geometry: tangent planes, exponential map and geodesics [1]. We assume that the ambient space is a tangent plane to the manifold M, which conveniently makes it look Euclidean (see Figure 1). The point of tangency is called q, and the tangent plane Tq M. The exponential map, expq : Tq M M, performs a distance preserving mapping: the geodesic length between q and expq(x) in M is the same as the Euclidean length between q and x in Tq M. Our clustering objective is to ﬁnd C .= {c1, c2, ...ck} M such that Drec(S : C) = inf C M,|C |=k Drec(S, C ), with

Drec(S, C) .= P

i [m] minj [k] Drec(expq(xi), cj) , (15)

where Drec is a reconstruction loss, a function of the geodesic distance between expq(xi) and cj. We use two loss functions deﬁned from [16] and used in ML for more than a decade [12]:

R+ Drec(y, c) .= 1 cos DG(y, c) for M = Sd

cosh DG(y, c) 1 for M = Hd . (16)

Here, DG(y, c) is the corresponding geodesic distance of M between y and c. Figure 1 shows that Drec(y, c) is the orthogonal distance between Tc M and y when M = Sd. The solution to the clustering problem in eq. (15) is therefore the one that minimizes the error between tangent planes deﬁned at the centroids, and points on the manifold.

It turns out that both distances in 16 can be engineered as Bregman divergences via Theorem 1, as seen in Table 2. Furthermore, they imply the same ϕ, which is just the generator of Mahalanobis distortion, but a different g. The construction involves a third party, a lifting map (lift(.)) that increases the dimension by one. The Sphere lifting map Rd x 7 x S Rd+1 is indicated in Table 3 (left). The new coordinate depends on the norm of x. The Hyperbolic lifting map, Rd x 7 x H Rd C, involves a pure imaginary additional coordinate, is indicated in in Table 3 (right, with a slight abuse of notation) and Figure 1. Both x S and x H live on a d-dimensional manifold, depicted in Figure 1.

(Sphere) Sk-means++(S, k) (Hyperboloid) Hk-means++(S, k) Input: dataset S Tq Sd, k N ; Step 1: S+ {g 1 S (x S) x S : x S lift(S)}; Step 2: C+ k-means++_seeding(S+, k); Step 3: C exp 1 q (C+); Output: Cluster centers C Tq Sd;

Input: dataset S Tq Hd, k N ; Step 1: S+ {g 1 H (x H) x H : x H lift(S)}; Step 2: C+ k-means++_seeding(S+, k); Step 3: C exp 1 q (C+); Output: Cluster centers C Tq Hd; x S .= [x1 x2 xd x 2 cot x 2] x H .= [x1 x2 xd i x 2 coth x 2] g S(x S) .= x 2/ sin x 2 g H(x H) .= x 2/ sinh x 2 Table 3: How to use k-means++ to cluster points on the sphere (left) or the hyperboloid (right).

(p, q) = (1.17, 6.9) (p, q) = (2.0, 2.0) (p, q) = (6.9, 1.17) (p, q) = (1.17, 6.9) (p, q) = (2.0, 2.0) (p, q) = (6.9, 1.17)

0 10 20 30 40 50 60 70

0 20000 40000

0 20000 40000

0 20000 40000

-8 -6 -4 -2 0 2 4 6 8 10 12

0 20000 40000

0 20000 40000

-16 -14 -12 -10

-8 -6 -4 -2 0

0 20000 40000 ρ = 1.0 ρ = 1.0 ρ = 1.0 ρ = 0.5 ρ = 1.3 ρ = 0.2 Table 4: Summary of the experiments displaying (y) the error of p-LMS minus error of DN-p-LMS (when > 0, DN-p-LMS beats p-LMS) as a function of t, in the setting of [20], for various values of (p, q) (columns). Left panel: (D)ense target; Right panel: (S)parse target.

When they are scaled by the corresponding g.(.), they happen to be mapped to Sd or Hd, respectively, by what happens to be the manifold s exponential map for the original x (see Appendix III).

Theorem 1 is interesting in this case because ϕ corresponds to a Mahalanobis distortion: this shows that k-means++ seeding [2, 25] can be used directly on the scaled coordinates (g 1 {S,H}(x{S,H})

x{S,H}) to pick centroids that yield an approximation of the global optimum for the clustering problem on the manifold which is just as good as the original Euclidean approximation bound [2].

Lemma 5 The expected potential of Sk-means++ seeding over the random choices of C+ satisﬁes:

E[Drec(S : C)] 8(2 + log k) inf C Sd Drec(S : C ) . (17)

The same approximation bounds holds for Hk-means++ seeding on the hyperboloid (C , C+ Hd).

Lemma 5 is notable since it was only recently shown that such a bound is possible for the sphere [15], and to our knowledge, no such approximation quality is known for clustering on the hyperboloid [30, 31]. Notice that Lloyd iterations on non-linear manifolds would require repetitive renormalizations to keep centers on the manifold [12], an additional disadvantage compared to clustering on ﬂat manifolds that {G, K}-means++ seedings do not bear.

6 Experimental validation

We present some experiments validating our theoretical analysis for the applications above.

Multiple density ratio estimation. See Appendix IX for experiments in this domain.

Dual norm p-LMS (DN-p-LMS). We ran p-LMS and the DN-p-LMS of 4 on the experimental setting of [20]. We refer to that paper for an exhaustive description of the experimental setting, which we brieﬂy summarize: it is a noisy signal processing setting, involving a dense or a sparse target. We compute, over the signal received, the error of our predictor on the signal. We keep all parameters as they are in [20], except for one: we make sure that data are scaled to ﬁt in a Lp ball of prescribed radius, to test the assumption related in [20] that ﬁxing the learning rate ηt is not straightforward in p-LMS. Knowing the true value of Xp, we then scale it by a misestimation factor ρ, typically in [0.1, 1.7]. We use the same misestimation in DN-p-LMS. Thus, both algorithms suffer the same source of uncertainty. Also, we periodically change the signal (each 1000 iterations), to assess the performances of the algorithms in tracking changes in the signal.

Experiments, given in extenso in Appendix X, are sumarized in Table 4. The following trends emerge: in the mid to long run, DN-p-LMS is never beaten by p-LMS by more than a fraction of percent. On the other hand, DN-p-LM can beat p-LMS by very signiﬁcant differences (exceeding 40%), in particular when p < 2, i.e. when we are outside the regime of the proof of [20]. This indicates that

0 10 20 30 40 50 60

0 5 10 15 20 25 30 35 40 45 50

rel. improvement (%)

0 10 20 30 40 50 60 70 80 90 100

0 5 10 15 20 25 30 35 40 45 50

rel. improvement (%)

k Table 5: (L) Relative improvement (decrease) in k-means potential of SKM Sk-means++ compared to SKM alone. (R) Relative improvement of Sk-means++ over Forgy initialization on the sphere.

0 5 10 15 20 25 30 35 40 45 50

prop (%) DM beats GKM

0 5 10 15 20 25 30 35 40 45 50 55

0 5 10 15 20 25 30 35 40 45 50

0 5 10 15 20 25 30 35 40 45 50

0 5 10 15 20 25 30 35 40 45 50

k Table 6: (L) % of the number of runs of SKM whose output (when it has converged) is better than Sk-means++. (C) Maximal # of iterations for SKM after which it beats Sk-means++ (ignoring runs of SKM that do not beat Sk-means++). (R) Average # of iterations for SKM to converge.

signiﬁcantly stronger and more general results than the one of Lemma 4 may be expected. Also, it seems that the problem of p-LMS lies in an exploding norm problem: in various cases, we observe that wt (in any norm) blows up with t, and this correlates with a very signiﬁcant degradation of its performances. Clearly, DN-p-LMS does not have this problem since all relevant norms are under tight control. Finally, even when the norm does not explode, DN-p-LMS can still beat p-LMS, by less important differences though. Of course, the output of p-LMS can repeatedly be normalised, but the normalisation would escape the theory of [20] and it is not clear which normalisation would be best.

Clustering on the sphere. For k [50] , we simulate on T0S2 a mixture of spherical Gaussian and uniform densities in random rectangles with 2k components. We run three algorithms: (i) SKM [12] on the data embedded on S2 with random (Forgy) initialization, (ii), Sk-means++ and (iii) SKM with Sk-means++ initialisation. Results are averaged over the algorithms runs.

Table 5 (left) displays that using Sk-means++ as initialization for SKM brings a very signiﬁcant gain over SKM alone, since we almost divide the k-means potential by a factor 2 on some runs. The right plot of Table 5 shows that S-k-means++ consistently reduces the k-means potential by at least a factor 2 over Forgy. The left plot in Table 6 displays that even when it has converged, SKM does not necessarily beat Sk-means++. Finally, the center+right plots in Table 6 display that even when it does beat Sk-means++ when it has converged, the iteration number after which SKM beats Sk-means++ increases with k, and in the worst case may exceed the average number of iterations needed for SKM to converge (we stopped SKM if relative improvement is not above 1o/oo).

7 Conclusion

We presented a new scaled Bregman identity, and used it to derive novel results in several ﬁelds of machine learning: multiple density ratio estimation, adaptive ﬁltering, and clustering on curved manifolds. We believe that, like other known key properties of Bregman divergences, there is potential for other applications of the result; Appendices V, VI present preliminary thoughts in this direction.

8 Acknowledgments

The authors wish to thank Bob Williamson and the reviewers for insightful comments.

[1] S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.

[2] D. Arthur and S. Vassilvitskii. k-means++ : the advantages of careful seeding. In 19 th SODA, 2007. [3] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. JMLR, 6:1705 1749, 2005. [4] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003. [5] J.-D. Boissonnat, F. Nielsen, and R. Nock. Bregman Voronoi diagrams. DCG, 44(2):281 307, 2010. [6] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [7] A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability estimation and classiﬁcation: Structure and applications, 2005. Unpublished manuscript. [8] S.-R. Buss and J.-P. Fillmore. Spherical averages and applications to spherical splines and interpolation. ACM Transactions on Graphics, 20:95 126, 2001. [9] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University Press, 2006. [10] M. Collins, R. Schapire, and Y. Singer. Logistic regression, Ada Boost and Bregman distances. MLJ, 2002. [11] B. Dacorogna and P. Maréchal. The role of perspective functions in convexity, polyconvexity, rank-one convexity and separate convexity. J. Convex Analysis, 15:271 284, 2008. [12] I. Dhillon and D.-S. Modha. Concept decompositions for large sparse text data using clustering. MLJ, 42:143 175, 2001. [13] I.-S. Dhillon and J.-A. Tropp. Matrix nearness problems with Bregman divergences. SIAM Journal on Matrix Analysis and Applications, 29(4):1120 1146, 2008. [14] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efﬁcient projections onto the ℓ1-ball for learning in high dimensions. In ICML 08, pages 272 279, New York, NY, USA, 2008. ACM. [15] Y. Endo and S. Miyamoto. Spherical k-means++ clustering. In Proc. of the 12th MDAI, pages 103 114, 2015. [16] G.-A. Galperin. A concept of the mass center of a system of material points in the constant curvature spaces. Communications in Mathematical Physics, 154:63 84, 1993. [17] E. Hazan and S. Kale. Projection-free online learning. In John Langford and Joelle Pineau, editors, ICML 12, pages 521 528, New York, NY, USA, 2012. ACM. [18] M. Hernández-Lobato, Y. Li, M. Rowland, D. Hernández-Lobato, T. Bui, and R.-E. Turner. Black-box alpha-divergence minimization. In 33rd ICML, 2016. [19] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In 30th ICML, 2013. [20] J. Kivinen, M. Warmuth, and B. Hassibi. The p-norm generalization of the LMS algorithm for adaptive ﬁltering. IEEE Trans. SP, 54:1782 1793, 2006. [21] D. Kuang, S. Yun, and H. Park. Sym NMF: nonnegative low-rank approximation of a similarity matrix for graph clustering. J. Global Optimization, 62:545 574, 2014. [22] P. Maréchal. On a functional operation generating convex functions, part 1: duality. J. of Optimization Theory and Applications, 126:175 189, 2005. [23] P. Maréchal. On a functional operation generating convex functions, part 2: algebraic properties. J. of Optimization Theory and Applications, 126:375 366, 2005. [24] A.-K. Menon and C.-S. Ong. Linking losses for class-probability and density ratio estimation. In ICML, 2016. [25] R. Nock, P. Luosto, and J. Kivinen. Mixed Bregman clustering with approximation guarantees. In ECML, 2008. [26] R. Nock and F. Nielsen. Bregman divergences and surrogates for learning. IEEE PAMI, 31:2048 2059, 2009. [27] R. Nock, F. Nielsen, and S.-I. Amari. On conformal divergences and their population minimizers. IEEE Trans. IT, 62:527 538, 2016. [28] M. Reid and R. Williamson. Information, divergence and risk for binary experiments. JMLR, 12:731 817, 2011. [29] M.-D. Reid and R.-C. Williamson. Composite binary losses. JMLR, 11:2387 2422, 2010. [30] G. Rong, M. Jin, and X. Guo. Hyperbolic centroidal Voronoi tessellation. In 14 th ACM SPM, 2010. [31] O. Schwander and F. Nielsen. Matrix Information Geometry, chapter Learning Mixtures by Simplifying Kernel Density Estimators, pages 403 426. Springer Berlin Heidelberg, 2013. [32] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML 08, page 807 814. ACM, 2007. [33] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227 244, 2000. [34] J. Straub, G. Rosman, O. Freifeld, J.-J. Leonard, and J.-W. Fisher III. A mixture of Manhattan frames: Beyond the Manhattan world. In Proc. of the 27th IEEE CVPR, pages 3770 3777, 2014. [35] M. Sugiyama, T. Suzuki, and T. Kanamori. Density-ratio matching under the Bregman divergence: a uniﬁed framework of density-ratio estimation. AISM, 64(5):1009 1044, 2012. [36] R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28:52 68, 2011. [37] R.-C. Williamson, E. Vernet, and M.-D. Reid. Composite multiclass losses, 2014. Unpublished manuscript. [38] H. Zhang and S. Sra. First-order methods for geodesically convex optimization. Co RR, abs/1602.06053, 2016.