# harmonic_exponential_families_on_manifolds__30605d02.pdf

Harmonic Exponential Families on Manifolds

Taco S. Cohen T.S.COHEN@UVA.NL University of Amsterdam

Max Welling M.WELLING@UVA.NL University of Amsterdam University of California Irvine Canadian Institute for Advanced Research

In a range of ﬁelds including the geosciences, molecular biology, robotics and computer vision, one encounters problems that involve random variables on manifolds. Currently, there is a lack of ﬂexible probabilistic models on manifolds that are fast and easy to train. We deﬁne an extremely ﬂexible class of exponential family distributions on manifolds such as the torus, sphere, and rotation groups, and show that for these distributions the gradient of the log-likelihood can be computed efﬁciently using a non-commutative generalization of the Fast Fourier Transform (FFT). We discuss applications to Bayesian camera motion estimation (where harmonic exponential families serve as conjugate priors), and modelling of the spatial distribution of earthquakes on the surface of the earth. Our experimental results show that harmonic densities yield a significantly higher likelihood than the best competing method, while being orders of magnitude faster to train.

1. Introduction

Many problems in science and engineering involve random variables on manifolds. In the geosciences, for example, one deals with measurements such as the locations of earthquakes and weather events on the spherical surface of the earth. In robotics and computer vision, unobserved Lie group transformations such as rotations and rigid-body motions play an important role in motion understanding, localization and alignment problems. Classical probabilistic models cannot be applied to data on manifolds because

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

these models do not respect the manifold topology (having discontinuities at the value 0 2π of a circular variable, for instance), and because they are not equivariant (a manifold-preserving transformation of the data would take the distribution outside the model family). Among manifold distributions there are currently none that are both ﬂexible and efﬁciently trainable.

In this paper we study a very ﬂexible class of densities on compact Lie groups (such as the group of rotations in two or three dimensions) and homogeneous spaces (such as the circle, torus, and sphere). We refer to these densities as harmonic exponential families, because they are based on a generalized form of Fourier analysis known as non-commutative harmonic analysis (Chirikjian & Kyatkin, 2001). Speciﬁcally, the sufﬁcient statistics of these families are given by functions that are analogous to the sinusoids on the circle.

We show that the moment map (Wainwright & Jordan, 2007) that takes the natural parameters of an exponential family and produces the moments of the distribution can be computed very efﬁciently for harmonic exponential families using generalized Fast Fourier Transform (FFT) algorithms. This leads directly to a very efﬁcient maximumlikelihood estimation procedure that is applicable to any manifold for which an FFT algorithm has been developed, and that enjoys the convexity and convergence properties of exponential families.

We apply the harmonic exponential family of the sphere to the problem of modeling the spatial distribution of earthquakes on the surface of the earth. Our model signiﬁcantly outperforms the best competing method (a mixture model), while being orders of magnitude faster to train.

Harmonic exponential families also arise naturally as conjugate priors in a Bayesian framework for estimating transformations from correspondence pairs. In this case the points on the manifold (the transformations) are not observed directly. Instead we see a pair of vectors x and y

Harmonic Exponential Families

images, point clouds, or other data that provide information about the transformation that produced one from the other. If we have a prior p(g) over transformations and a likelihood p(y | x, g) that measures how likely y is to be the g-transformed version of x, we can consider the posterior over transformations p(g | x, y). Typically, the posterior turns into a complicated and intractable distribution, but we show that if the prior is a harmonic density and the likelihood is Gaussian, the posterior distribution is again a harmonic density whose parameters are easily obtained using the generalized FFT algorithm. Furthermore, the global mode of this posterior (the optimal transformation) can be computed efﬁciently by performing yet another FFT.

In this paper we provide both an abstract treatment of the theory of harmonic densities that covers a fairly broad class of manifolds in a uniform and easily understandable manner, and a concrete instantiation of this theory in the case of the rotation groups SO(2) and SO(3), and their homogeneous spaces, the circle S1 and sphere S2.

The rest of this paper is organized as follows. We begin by discussing related work in section 2, followed by a brief summary of non-commutative harmonic analysis for a machine learning audience in section 3. In section 4, we deﬁne harmonic exponential families and present an FFT-based maximum-likelihood estimation algorithm. In section 5, we show that harmonic exponential families are the conjugate priors in the Bayesian transformation estimation problem, and present an efﬁcient MAP inference algorithm. Our earthquake modelling experiments are presented in section 6, followed by a discussion and conclusion.

2. Related Work

The harmonic exponential families were ﬁrst deﬁned abstractly (but not named) by Diaconis (1988). Despite their long history and a signiﬁcant body of literature devoted to them (see Mardia & Jupp (1999)), this class of densities has remained intractable until now for all but the simplest cases.

In fact, various commonly used distributions on the circle and the sphere are harmonic densities of low degree. Among these are the 2-parameter von-Mises distribution on the circle and the 5-parameter Kent distribution on the sphere. These are both exponential families, about which Mardia & Jupp (1999) write (p. 175): Although exponential models have many pleasant inferential properties, the need to evaluate the normalizing constant (or at least the ﬁrst derivative of its logarithm) can be a practical difﬁculty.

What is a practical difﬁculty for the unimodal distributions with few parameters mentioned above becomes a show-

stopper for more ﬂexible exponential families with many parameters. The harmonic exponential family for the circle is known as the generalized von-Mises distribution, and can be deﬁned for any band-limit / order / degree L. However, no scalable maximum-likelihood estimation algorithm is known. Gatto & Jammalamadaka (2007) (who work only with the 4-parameter L = 2 distribution) compute the normalizing constant by a truncated inﬁnite sum, where each term involves an expensive Bessel function evaluation.

Beran (1979) studies an exponential family that is equivalent to the harmonic exponential family on the sphere, but does not provide a scalable learning algorithm. At each iteration, the proposed algorithm computes all O(L2) moments of the distribution by numerical integration. This requires O(L2) samples per integration, making the periteration cost O(L4). Beran further suggests using a second order optimization method, which would further increase the per-iteration cost to O(L6).

This is clearly not feasible when L is measured in the hundreds, and parameter counts in the 10 s of thousands, as is needed in the experiments reported in section 6. The algorithm described in this paper is simple, generic across manifolds, and fast (per-iteration complexity O(L2 log2 L) in the spherical case, for instance). It can be applied to any manifold for which an FFT has been developed.

3. Preliminaries

The manifolds we consider in this paper are either Lie groups or closely related manifolds called homogeneous spaces, and the sufﬁcient statistics that we use come from harmonic analysis on these manifolds. Since these concepts are not widely known in machine learning, we will review them in this section. For more details, we refer the reader to Chirikjian & Kyatkin (2001); Sugiura (1990); Kondor (2008); Goodman & Wallach (2009).

3.1. Lie Groups

A transformation group G is a set of invertible transformations that is closed under composition and taking inverses: for any g, h G, the composition gh is again a member of G, and so are the inverses g 1 and h 1.

A Lie group is a group that is also a differentiable manifold. For example, the group SO(3) of 3D rotations is a Lie group. It can be represented as a set of matrices,

SO(3) = {R R3 3 | RRT = I, det(R) = 1}, (1)

so one way to think about SO(3) is as a 3-dimensional manifold embedded in R3 3. For technical reasons, we will further restrict our attention to compact Lie groups, i.e. Lie groups that are closed and bounded.

Harmonic Exponential Families

3.2. Harmonic analysis on compact Lie groups

The basic idea of the generalized Fourier transform on compact Lie groups is to expand a function f : G C as a linear combination of carefully chosen basis functions. These basis functions have very special properties, because they are the matrix elements of irreducible unitary representations (IURs) of G. These terms will now be explained.

A representation of a group G on a vector space V is a map R from the group to the set of invertible linear transformations of V that preserves the group structure in the following sense:

R(gh) = R(g)R(h). (2)

Note that gh denotes composition of group elements while R(g)R(h) denotes matrix multiplication (once we choose a basis for V , that is).

In computer vision, we encounter group representations in the following way. An image is represented as a vector x Rn of pixel intensities, and to be concrete, we take G to be the group SO(2) consisting of rotations of the plane. Then R(θ) is the matrix such that R(θ)x is the image x rotated by angle θ.

As this example shows, many representations do not change the norm of the vectors on which they act: g G, x V : U(g) x = x . Such representations are called unitary. Unitary representations tend to be easier to work with both analytically and numerically.

Given a unitary representation U and a unitary matrix F, one can deﬁne an equivalent representation T(g) = F 1U(g)F. In computer vision one can think of F as a matrix containing image features in the rows. One can now try to ﬁnd F such that for every g, the matrix F 1U(g)F is block diagonal with the same block structure. If we continue to block-diagonalize until no further diagonalization is possible, we end up with blocks called irreducible unitary representations1. The IURs of a compact group can be indexed by a discrete index λ, and we denote the IURs as U λ(g).

As an example, consider the 2D rotation group SO(2). Since SO(2) is commutative, its representation matrices can be jointly diagonalized and so the IURs of SO(2) are 1 1 matrices: U λ 00(g) = eiλg. (3)

They satisfy the composition rule eiλ(g+h) = eiλgeiλh, which is the manifestation of eq. 2 for this representation. The standard Fourier series of a function on the circle is an expansion in terms of these matrix elements, which shows

1Technically, we have deﬁned the slightly easier to understand notion of indecomposability, which in this context implies irreducibility.

that standard Fourier analysis is a special case of the more general transform to be deﬁned shortly.

The matrix elements of IURs of SO(3) are known as Wigner D-functions. They are deﬁned for λ = 0, 1, 2, . . . and λ m, n λ, so the irreducible representations are (2λ + 1)-dimensional. Wigner D-functions can be expressed as sums over products of sinusoids or complex exponentials (Pinchon & Hoggan, 2007), but the formulae are somewhat unwieldy so that it is easier to think only about their general properties.

The most important general property of the matrix elements of IURs is that they are orthogonal:

U λ mn(g), U λ m n (g) Z

G U λ mn(g) U λ m n (g) dµ(g)

= δλλ δmm δnn

dim λ . (4)

Here µ is the normalized Haar measure, which is the natural way to measure volumes in G (Sugiura, 1990), and dim λ is the dimension of the representation. One can verify that the complex exponentials eiλg are indeed orthonormal with dµ(g) = dg

2π and dim λ = 1.

Intuitively, the matrix elements are like a complete orthogonal basis for the space L2(G) of square integrable functions on G. That is, it can be proven that any function f L2(G) can be written as

mn ηλ mn T λ mn(g) F 1η (g), (5)

where T λ mn(g) =

dim λ U λ mn(g) are the L2-normalized matrix elements and η are the Fourier coefﬁcients of f.

Integrating eq. 5 against a matrix element and using orthonormality, we ﬁnd:

G f(g)T λ mn(g)dµ(g). (6)

This is the (generalized) Fourier transform for compact groups, denoted η = Ff.

Fast and exact algorithms for the computation of Fourier coefﬁcients from samples of bandlimited functions on the rotation groups SO(2) and SO(3) have been developed, and the theory required to construct such algorithms for general compact Lie groups is understood (Maslen & Rockmore, 1997). The group SO(2) is isomorphic to the circle, so for G = SO(2) equation 5 reduces to a standard Fourier series on the circle, for which the well-known O(L log L) FFT algorithm can be used. The SO(3) FFT has complexity O(L3 log2 L) for bandlimit L (Maslen & Rockmore, 1997; Kostelec & Rockmore, 2008; Potts et al.,

Harmonic Exponential Families

2009). This is a tremendous speedup compared to the naive O(L6) algorithm, and the algorithms presented in this paper would certainly not be feasible without the FFT.

In section 4 we discuss how these generalized FFT algorithms can be used to efﬁciently compute moments, but ﬁrst we discuss the generalization of the Fourier transform on compact Lie groups to the Fourier transform on certain manifolds that are not groups.

3.3. Harmonic analysis on homogeneous spaces

A homogeneous space for a Lie group G is a manifold H such that for any two points x, y H we can ﬁnd a transformation g G with gx = y. For example, the plane is a homogeneous space for the translation group, and the sphere is a homogeneous space for the 3D rotation group. The plane is not a homogeneous space for the 2D rotation group, because points at different radii cannot be rotated into each other.

If we pick an origin o H, such as the north pole of the sphere, we can identify any other point h H by specifying how to transform the origin to get there: h = go. This identiﬁcation will not be unique, though, if there is a nontrivial subgroup K of G containing transformations that leave the origin invariant: K = {k G | ko = o}. This is because if h = go, then also h = gko, so both g and gk identify h. On the sphere for example, we can transform the north-pole into a point h by ﬁrst doing an arbitrary rotation around the north-pole axis (which leaves the north-pole unchanged) and then rotating the result to h.

Hence, one can think of the points h = go in a homogeneous space H as sets g K = {gk | k K} (called cosets) of group elements that are equivalent with respect to their effect on an arbitrarily chosen origin o of H. It follows that one can think of functions on a homogeneous space as functions on the group, with the special property that they are (right) invariant to transformations from K:

f(gk) = f(g) g G, k K, (7)

because right-multiplication by k will only shufﬂe the elements within each coset. Finally, one can show (see supplementary material) that in a suitable basis, a subset of the matrix elements of IURs form a basis for the linear space of square-integrable functions on G with this invariance property, which allows us to deﬁne the Fourier transform also for functions on H.

The exact same equations (6 and 5) that deﬁne the Fourier and inverse Fourier transform for a compact Lie group, deﬁne these transforms for a compact homogeneous space, but only a subset of the coefﬁcients ηλ mn will be non-zero. For the sphere, the matrix elements T λ m0 are equal2 to the

2Various normalization and phase conventions are in use for

spherical harmonics Y λ m, which form a basis for L2(S2). Fast spherical Fourier transform algorithms were developed by Driscoll & Healy (1994).

3.4. Exponential families

An exponential family is a class of densities of the form:

p(g | η) = 1

Zη exp (η T(g)). (8)

It is determined by a choice of sufﬁcient statistics T, that take the random variable g and produce a vector of real statistics T(g).

To learn the parameters η, one can perform gradient-based optimization of the log-likelihood of a set of iid samples g1, . . . , g N. The gradient is the moment discrepancy:

i=1 ln p(gi | η)

= T Ep(g|η)[T(g)], (9)

where T = 1 N PN i=1 T(gi) are the empirical moments. There is generally no closed form for the analytical moments (the expectation in eq. 9), so a numerical approximation is needed.

4. Harmonic Exponential Families

We deﬁne a harmonic exponential family on a group or homogeneous space as an exponential family where the sufﬁcient statistics are given by a ﬁnite number of matrix elements of IURs. This makes sense only if the function η T(g) is real-valued, so that it can be interpreted as an unnormalized log-probability. The easiest way to guarantee this is to take η to be real, and to use real functions obtained as a sparse linear combination of complex matrix elements T(g) as sufﬁcient statistics. For example, 1 2(eiλθ + e iλθ) = cos(λθ). From here on, we take T to be real, L2-normalized functions and F the expansion of a real function in terms of real basis functions T.

The key observation that leads to an efﬁcient algorithm for computing the moments of a harmonic density is that the moments of such a density are its Fourier coefﬁcients:

Ep(g|η) [T(g)] = Z

G p(g|η) T(g) dµ(g) = F p. (10)

Hence, one can obtain all J moments at once by sampling p on a ﬁnite grid and then computing its Fourier transform using a fast algorithm. As explained in section 4.1, the discretization error can be made extremely small using only O(J) spatial samples.

the spherical harmonics, but it is enough to know that Y λ m T λ m0.

Harmonic Exponential Families

However, even evaluating p at a single position takes O(J) computations when using J sufﬁcient statistics so that the overall complexity is still O(J2). Furthermore, in order to evaluate p we need to know the normalizing constant Zη.

The following derivation shows that we can work with the unnormalized density ϕ(g | η) = exp (η T(g)) instead:

[Fp]λ mn = Z

G p(g | η) T λ mn(g) dµ(g)

G ϕ(g | η) T λ mn(g) dµ(g)

= [Fϕ]λ mn [Fϕ]0 00 ,

The last step uses the fact that T 0 00(g) = 1 so that [Fϕ]0 00 is equal to the normalizing constant:

[Fϕ]0 00 = Z

G ϕ(g | η) T 0 00(g) dµ(g) = Zη (12)

Next, observe that we can evaluate ln ϕ efﬁciently at O(J) spatial points using the inverse FFT:

ln ϕ(g | η) = η T(g) = F 1η (g) (13)

This computation is exact, because the log-density is bandlimited (i.e. there are only ﬁnitely many parameters). Element-wise exponentiation then gives us ϕ evaluated on a grid.

So we have an efﬁcient algorithm for computing moments:

1. Compute ϕ = exp (F 1η).

2. Compute M = F ϕ

3. Compute Ep(g|η) [T(g)] = M/M 0 00.

To make this computation numerically stable for highly peaked densities, one should apply the log-Fourier-exp trick described in the supplementary material.

4.1. Approximation quality

Even though the Fourier coefﬁcients are deﬁned as deﬁnite integrals (eq. 6), the discrete FFT algorithms compute exact Fourier coefﬁcients, provided the function from which the discrete samples were gathered is bandlimited. A function is bandlimited if the coefﬁcients ηλ mn are zero for λ greater than the band-limit L. Although the function ln ϕ(g) = η T(g) is bandlimited, the function ϕ(g) = exp (η T(g)) is not, so the computed coefﬁcients are not exactly equal to the Fourier coefﬁcients of ϕ.

However, the function ϕ(g) is smooth (inﬁnitely differentiable), and a standard result in Fourier analysis shows that

the spectrum of a smooth function decays to zero asymptotically faster than O(1/λn) for any n. So our function will be effectively bandlimited , in the sense that coefﬁcients for λ greater than some pseudo-bandlimit will have negligible values. If L is the maximum degree of the sufﬁcient statistics (the bandlimit of η T(g)), one can obtain nearexact moments by computing the FFT up to the pseudobandlimit αL for some oversampling factor α. In practice, we use values for α ranging from 2 to 5.

5. Harmonic Densities as Conjugate Priors

In this section we discuss the Bayesian transformation inference problem, where the goal is to infer a posterior over a Lie group of transformations given only a set of correspondence pairs (such as images before and after a camera motion). It turns out that the harmonic exponential families are the conjugate priors for this problem, and again, the generalized FFT is key to performing efﬁcient inference.

The observed data in the Bayesian transformation inference problem are pairs of vectors (x, y) in RD that could represent images, space-time blocks of video, point-clouds, optical-ﬂow ﬁelds, ﬁtted geometric primitives, parameters of a function or other objects. In order to infer anything about a latent transformation g, we must know the group representation R(g) that acts on the observed data. If our data is an image x : R2 R, we get a representation on the Hilbert space in which x lives: [R(g)x](p) = x(g 1p), where p is a point in the plane. In the ﬁnite-dimensional analogue, where x is a vector of pixel intensities, R(g) will be close to a permutation matrix that takes each pixel to its proper new position. For compact groups3 this representation is unitary, and this is what we will assume for R(g) from now on. If the representation is not known in advance, it can also be learned from data (Cohen & Welling, 2014).

If we assume that observation x is the g-transformed version of y with some independent Gaussian noise with variance σ2 added, the likelihood function is given by

p(x | y, g) = N(x | R(g)y, σ2). (14)

As discussed in section 3.2, we can bring R in blockdiagonal form by a unitary change of basis: U(g) = F 1R(g)F. The matrix U is block diagonal, and the blocks U λ are equal to the L2-normalized sufﬁcient statistics T λ up to a scale factor: T λ(g) =

dim λ U λ(g). To simplify the computations, we shall work with data in this new basis: ˆx = Fx so that p(ˆx | ˆy, g) = N(ˆx | U(g)ˆy, σ2).

If we now choose as prior p(g) a member of the harmonic exponential family on G, the posterior p(g | ˆx, ˆy) is of the

3This more generally true for unimodular groups.

Harmonic Exponential Families

same form as the prior:

p(g | ˆx, ˆy) p(ˆx | ˆy, g)p(g)

2σ2 ˆx U(g)ˆy 2 + η T(g)

σ2 ˆx T U(g)ˆy + η T(g)

ηλ + ˆxλ ˆy T λ σ2

(15) i.e. we have a conjugate prior. The derivation relies on the unitarity of the representation: in expanding ˆx U(g)ˆy 2, we ﬁnd a term U(g)ˆy 2 which is equal to ˆy 2, making the dependence on U(g) linear (as it is in the prior).

5.1. Example: Bayesian analysis of camera rotation

To make matters concrete, we show how to compute a posterior over the rotation group SO(3) given two images taken before and after a camera rotation. An image is modeled as a function x : S2 R on the sphere, so that a camera rotation g SO(3) acts by rotating this function over the sphere: [R(g)x](p) = x(g 1p). We parameterize points p S2 as p = (ϕ, θ) for ϕ [0, 2π] and θ [0, π]. Recall from section 3.3 that we can represent p S2 by a coset representative gp SO(3), which we parameterize using Euler angles as gp = (ϕ, θ, 0). The transformation gp takes the origin of the sphere to p.

It is well known4 that in this context the matrix F deﬁned in the previous section as the matrix that blockdiagonalizes the representation R is given by the spherical Fourier transform F, which can be computed by an FFT. This means that if we represent x by its Fourier coefﬁcients ˆx = Fx, the coefﬁcients of the rotated function R(g)x(p) = x(g 1p) are given by U(g)ˆx, where U(g) is a block-diagonal matrix with irreducible representations U λ

as blocks. As derived in the previous section, the parameters of the posterior can easily be obtained in this basis by a block-wise outer product.

Figure 1 shows the posterior p(g, | x, y) for two synthetic spherical images x1 and x2, and their rotations y1 = x1

(no rotation) and y2 = U(0, π/3, π/2)x2. The posterior is plotted as a 3D isocontour in the ZYZ-Euler angle parameter space (α, β, γ) [0, 2π] [0, π] [0, 2π]. Although this plot looks like a box, the actual manifold has the topology of a projective 3-sphere. By construction, the density is spread through the parameter space in a way that is consistent with this topology: no discontinuities arise where wraparound in the parameter space occurs. This is desirable, because the wraparound is not an intrinsic property of the manifold.

4The derivation can be found in the supplementary material.

Figure 1. Posterior distributions (top) for two correspondence pairs (bottom).

Due to the symmetry of these ﬁgures certain rotations cannot be distinguished from the true rotation, so that the modes of the posterior distributions are supported on an entire subgroup of SO(3). Real images will not have such a high degree of symmetry, but it will nevertheless often be the case that a unique optimal transformation does not exist (Ma et al., 1999). Indeed, current keypoint based transformation estimation methods can easily get confused by repeating structures in an image, such as several identical windows on a building. Although more experimental work is needed, our method has the theoretical advantage that besides keypoints (which form a group representation), it can make use of parts of the image that do not allow for keypoints to be reliably placed (such as edges), while always providing a truthful impression of the degree to which a unique transformation or subgroup can be identiﬁed form the data.

5.2. MAP inference

As Bayesians we are done here, but for some applications one may wish to ﬁnd a single most likely transformation. To ﬁnd the optimal transformation, ﬁrst perform posterior inference (steps 1 and 2) and then maximize (step 3):

1. Compute ˆx = F x and ˆy = F y.

2. Compute ηλ = ηλ + 1 σ2

dim λ ˆxλˆy T λ

3. Compute g = arg maxi[F 1 η](gi)

The arg max ranges over all the points in a ﬁnite grid on G used by the FFT synthesis F 1. Optionally, one can reﬁne the optimum g by performing a few steps of gradientbased optimization on η T(g) to get sub-pixel accuracy.

Harmonic Exponential Families

6. Experiments

6.1. Modelling the spatial distribution of earthquakes

We compare our model and MLE algorithm to a Kent Mixture Model (KMM) on the problem of modelling the spatial distribution of signiﬁcant earthquakes on the surface of the earth.

We obtained the Signiﬁcant Earthquake Dataset (NGDC, 2015) from the National Geophysical Datacenter of the National Oceanographic and Athmospheric Administration. In total, the dataset contains 5780 earthquakes with complete information on the position of their epicenter, and 53 earthquakes whose coordinates are missing (these were discarded in our experiments). We did not model the severity of the earthquake, but only the occurrence of signiﬁcant earthquakes (as deﬁned by (NGDC, 2015)).

6.1.1. MIXTURE OF KENT DISTRIBUTIONS

The 5-parameter Kent distribution (Mardia & Jupp, 1999) is the spherical analogue of the normal distribution with unconstrained covariance. Being unimodal, the Kent distribution is not ﬂexible enough to describe complicated distributions such as the spatial distribution of earthquakes. The most ﬂexible distribution on the sphere that we have found in the literature is the Kent Mixture Model, ﬁrst described by Peel et al. (2001). The KMM is trained using the EM algorithm. We use the open source Python implementation of the EM algorithm for KMMs by H ofer (2014).

Unlike the harmonic densities, the log-likelihood of this model is not convex and contains many singularities where a mixture component concentrates on a single data point and decreases its variance indeﬁnitely. For this reason, we perform randomly initialized restarts until the algorithm has found 10 non-degenerate solutions, of which we retain the one with the best cross-validation log-likelihood. No regularization was used, because for the models that could be trained within a reasonable amount of time, no overﬁtting was observed.

6.1.2. THE S2 HARMONIC DENSITY

The harmonic density on the 2-sphere uses spherical harmonics as sufﬁcient statistics. The empirical moments are easily computed using standard spherical harmonic routines, but we found that for high orders the Sci Py routines are slow and numerically unstable. The supplementary material describes a simple, fast, and stable method for the evaluation of spherical harmonics. The computation of spherical harmonics up to band-limit L = 200 (for a total of (L + 1)2 = 40401 spherical harmonics) for 5780 points on the sphere took half a minute using this method and is performed only once for a given dataset.

For regularization we use a diagonal Gaussian prior on η, where the precision βλ m corresponding to the coefﬁcient of Y λ m is given by βλ m = α dim λ = α(2λ + 1) (for some regularization parameter α). This scheme is inspired by the fact that dim λ is the discrete Plancherel measure (Sugiura, 1990), and the empirical observation that the ﬁtted coefﬁcients become approximately uniform when weighted as ηλ m

dim λ. Note that adding regularization does not change the convexity of the objective function.

To ﬁnd maximum a posteriori parameters ˆη for the spherical harmonic density, we perform iterative gradient-based optimization on the log-posterior. The gradients (moment discrepancies) are computed using the FFT-based method described in section 4. We use the spherical FFT algorithm implemented in the NFFT library (Keiner et al., 2009; Kunis & Potts, 2003). The gradients are fed to a standard implementation of the L-BFGS algorithm.

6.1.3. RESULTS

Figure 2 shows the average train and test log-likelihood over 5 cross-validation folds, for the spherical harmonic density and the mixture of Kent distribution. The plotted values correspond to the regularization settings that yielded the best test log-likelihood.

The KMM reached an average test log-likelihood of 0.37 (with standard deviation of 0.03 over 5 cross-validation folds) using 70 mixture components (5 70 + 69 = 419 parameters). The harmonic density reached an average test log-likelihood of +0.3 (with standard deviation of 0.036 over 5 cross-validation folds), using bandlimit L = 140 (19880 parameters). The HD still outperforms the KMM when given a similar number of parameters: for L = 20 (440 419 parameters), the log-likelihood is 0.28 > 0.37, with standard deviation 0.028, and for L = 19 (399 parameters), the log-likelihood is 0.3 with standard deviation 0.03. The dataset and the learned densities are plotted in ﬁgure 3, clearly showing the superiority of the harmonic density.

7. Discussion and Future Work

What could explain the difference in log-likelihood between our model and the KMM? We believe two interrelated factors are driving this difference: the expressiveness of the model and the ease of optimization. Leaving technicalities aside, it is clear that both the spherical harmonic exponential family and the KMM can approximate any well-behaved density, given enough parameters and an optimization oracle. However, as is clear from ﬁgure 2, training time becomes prohibitive for the KMM for more than 70 mixture components / 419 parameters, while the harmonic density can efﬁciently be ﬁt for tens of thousands

Harmonic Exponential Families

Figure 2. Top: cross-validation train (dashed line) and test (solid line) log-likelihood for the Harmonic Density (HD) and Kent Mixture Model (KMM). Bottom: number of parameters versus training time for both models.

of parameters.

Furthermore, the KMM training algorithm (EM) can easily get stuck in local optima, or converge on a degenerate solution. This is the main reason for the poor runtime performance; while the KMM code could be further optimized, it is the fact that so many restarts are required to ﬁnd a good ﬁt that makes the algorithm slow. The log-likelihood function of the harmonic density, on the other hand, is convex, and the L-BFGS optimizer will typically converge to the global optimum in some 20 100 iterations.

As can be seen in ﬁgure 3, the harmonic density produces slight ringing artifacts that can be seen only in a log-plot such as this. These are the result of the limited bandwidth of the log-density, and will become progressively less pronounced as the number of parameters is increased. While they are clearly visible in log-space, the actual difference between peaks and valleys is on the order of 10 3 for bandlimit L = 100. The artifacts are not visible on a nonlogarithmic plot (and in such a plot the KMM density is hardly visible at all when plotted on the same intensity scale as the harmonic density, because the peaks are much lower). The harmonic density also tends to prefer heavier tails, which is probably accurate for many problems.

An interesting direction for future work is the extension to non-compact groups. While the mathematical theory becomes much more technical for such groups, (Kyatkin & Chirikjian, 2000) have already succeeded in developing FFT algorithms for the Euclidean motion group which is non-compact. From there it should be relatively straight-

(a) Signiﬁcant Earthquake Dataset

(b) Harmonic Density

(c) Kent Mixture Model

Figure 3. Log-probability for harmonic density and Kent mixture model, plotted using a perceptually accurate linear-lightness colormap on the same intensity scale.

forward to develop harmonic densities on the Euclidean group. Harmonic densities on the Euclidean, afﬁne or even projective group should ﬁnd many applications in robotics and computer vision (see e.g. (Kyatkin & Chirikjian, 1999)).

8. Conclusion

We have studied a class of exponential families on compact Lie groups and homogeneous manifolds, which we call harmonic exponential families. We have shown that for these families, maximum likelihood inference, posterior inference and mode seeking can be implemented using very efﬁcient generalized Fast Fourier Transform algorithms. In the Bayesian setting, we have shown that harmonic exponential families appear naturally as conjugate priors in the generic transformation inference problem. Our experiments show that training harmonic densities is fast even for very large numbers of parameters, and that far superior likelihood can be achieved using these models.

Acknowledgements

This research was supported by NWO (grant number NAI.14.108), Facebook and Google.

Beran, Rudolf. Exponential Models for Directional Data. The Annals of Statistics, 1979.

Chirikjian, G.S. and Kyatkin, A.B. Engineering Applications of Noncommutative Harmonic Analysis. CRC

Harmonic Exponential Families

Press, 1 edition, May 2001. ISBN 9781420041767.

Cohen, T. and Welling, M. Learning the Irreducible Representations of Commutative Lie Groups. In International Conference on Machine Learning (ICML), volume 32, 2014.

Diaconis, P. Group representations in probability and statistics. Hayward, CA: Institute of Mathematical Statistics, 1988. ISBN 0940600145.

Driscoll, JR and Healy, DM. Computing Fourier transforms and convolutions on the 2-sphere. Advances in applied mathematics, 1994.

Gatto, Riccardo and Jammalamadaka, Sreenivasa Rao. The generalized von Mises distribution. Statistical Methodology, 4(3):341 353, 2007. ISSN 15723127.

Goodman, Roe and Wallach, Nolan R. Symmetry, Representations, and Invariants. Springer, 2009. ISBN 9780-387-79852-3.

H ofer, Sebastian. Kent mixture model for scikit-learn, 2014. URL https://github.com/shoefer/ scikit-learn.

Keiner, J., Kunis, S., and Potts, D. Using NFFT 3 - a software library for various nonequispaced fast Fourier transforms. ACM Transactions on Mathematical Software, 36:1 30, 2009.

Kondor, Risi. Group theoretical methods in machine learning. Ph D thesis, Columbia University, 2008.

Kostelec, Peter J. and Rockmore, Daniel N. FFTs on the rotation group. Journal of Fourier Analysis and Applications, 14(2):145 179, 2008.

Kunis, Stefan and Potts, Daniel. Fast spherical Fourier algorithms. Journal of Computational and Applied Mathematics, 161:75 98, 2003.

Kyatkin, AB and Chirikjian, GS. Computation of Robot Conﬁguration and Workspaces via the Fourier Transform on the Discrete-Motion Group. The International Journal of Robotics Research, 1999.

Kyatkin, Alexander B and Chirikjian, Gregory S. Algorithms for Fast Convolutions on Motion Groups. Applied and Computational Harmonic Analysis, 9(2):220 241, September 2000.

Ma, Y, Soatto, S, Kosecka, J, Sastry, S, and Koˇseck a, J. Euclidean reconstruction and reprojection up to subgroups. In The Proceedings of the Seventh IEEE International Conference on Computer Vision (ICCV), volume 2, pp. 773 -780. California Univ., Berkeley, CA, IEEE, 1999.

Mardia, K. V. and Jupp, P. E. Directional Statistics. John Wiley & Sons, 1 edition, 1999. ISBN 0471953334.

Maslen, DK and Rockmore, DN. Generalized FFTs a survey of some recent results. Groups and Computation II, 1997.

NGDC. National Geophysical Data Center / World Data Service: Signiﬁcant Earthquake Database. [accessed 0118-2015], 2015. URL http://www.ngdc.noaa. gov/nndc/struts/form?t=101650&s=1&d=1.

Peel, David, Whiten, William J, and Mc Lachlan, Geoffrey J. Fitting Mixtures of Kent Distributions to Aid in Joint Set Identiﬁcation. Journal of the American Statistical Association, 96(453):56 63, 2001. ISSN 0162-1459. doi: 10.1198/016214501750332974.

Pinchon, Didier and Hoggan, Philip E. Rotation matrices for real spherical harmonics: general rotations of atomic orbitals in space-ﬁxed axes. Journal of Physics A: Mathematical and Theoretical, 40(7):1597 1610, February 2007. ISSN 1751-8113.

Potts, Daniel, Prestin, J, and Vollrath, A. A fast algorithm for nonequispaced Fourier transforms on the rotation group. Numerical Algorithms, pp. 1 28, 2009.

Sugiura, Mitsuo. Unitary Representations and Harmonic Analysis. John Wiley & Sons, New York, London, Sydney, Toronto, 2nd edition, 1990.

Wainwright, Martin J. and Jordan, Michael I. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends R in Machine Learning, 1(1 2):1 305, 2007. ISSN 1935-8237.