# a_locally_adaptive_normal_distribution__c9a6c2b4.pdf

A Locally Adaptive Normal Distribution

Georgios Arvanitidis, Lars Kai Hansen and Søren Hauberg Technical University of Denmark, Lyngby, Denmark DTU Compute, Section for Cognitive Systems {gear,lkai,sohau}@dtu.dk

The multivariate normal density is a monotonic function of the distance to the mean, and its ellipsoidal shape is due to the underlying Euclidean metric. We suggest to replace this metric with a locally adaptive, smoothly changing (Riemannian) metric that favors regions of high local density. The resulting locally adaptive normal distribution (LAND) is a generalization of the normal distribution to the manifold setting, where data is assumed to lie near a potentially low-dimensional manifold embedded in RD. The LAND is parametric, depending only on a mean and a covariance, and is the maximum entropy distribution under the given metric. The underlying metric is, however, non-parametric. We develop a maximum likelihood algorithm to infer the distribution parameters that relies on a combination of gradient descent and Monte Carlo integration. We further extend the LAND to mixture models, and provide the corresponding EM algorithm. We demonstrate the efﬁciency of the LAND to ﬁt non-trivial probability distributions over both synthetic data, and EEG measurements of human sleep.

1 Introduction

The multivariate normal distribution is a fundamental building block in many machine learning algorithms, and its well-known density can compactly be written as

p(x | µ, Σ) exp 1

2dist2 Σ(µ, x) , (1)

where dist2 Σ(µ, x) denotes the Mahalanobis distance for covariance matrix Σ. This distance measure corresponds to the length of the straight line connecting µ and x, and consequently the normal distribution is often used to model linear phenomena. When data lies near a nonlinear manifold embedded in RD the normal distribution becomes inadequate due to its linear metric. We investigate if a useful distribution can be constructed by replacing the linear distance function with a nonlinear counterpart. This is similar in spirit to Isomap [21] that famously replace the linear distance with a geodesic distance measured over a neighborhood graph spanned by the data, thereby allowing for a nonlinear model. This is, however, a discrete distance measure that is only well-deﬁned over the training data. For a generative model, we need a continuously deﬁned metric over the entire RD.

Following Hauberg et al. [9] we learn a smoothly changing metric that favors regions of high density i.e., geodesics tend to move near the data. Under this metric, the data space is interpreted as a D-dimensional Riemannian manifold. This manifold learning does not change dimensionality, but merely provides a local description of the data. The Riemannian view-point, however, gives a strong mathematical foundation upon which the proposed distribution can be developed. Our work, thus, bridges work on statistics on Riemannian manifolds [15, 23] with manifold learning [21].

We develop a locally adaptive normal distribution (LAND) as follows: First, we construct a metric that captures the nonlinear structure of the data and enables us to compute geodesics; from this, an

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Data LAND mean Linear Geodesic

Linear model

Linear mean

Figure 1: Illustration of the LAND using MNIST images of the digit 1 projected onto the ﬁrst 2 principal components. Left: comparison of the geodesic and the linear distance. Center: the proposed locally adaptive normal distribution. Right: the Euclidean normal distribution.

unnormalized density is trivially deﬁned. Second, we propose a scalable Monte Carlo integration scheme for normalizing the density with respect to the measure induced by the metric. Third, we develop a gradient-based algorithm for maximum likelihood estimation on the learned manifold. We further consider a mixture of LANDs and provide the corresponding EM algorithm. The usefulness of the model is veriﬁed on both synthetic data and EEG measurements of human sleep stages.

Notation: all points x RD are considered as column vectors, and they are denoted with bold lowercase characters. SD ++ represents the set of symmetric D D positive deﬁnite matrices. The learned Riemannian manifold is denoted M, and its tangent space at x M is denoted Tx M.

2 A Brief Summary of Riemannian Geometry

We start our exposition with a brief review of Riemannian manifolds [6]. These smooth manifolds are naturally equipped with a distance measure, and are commonly used to model physical phenomena such as dynamical or periodic systems, and many problems that have a smooth behavior.

Deﬁnition 1. A smooth manifold M together with a Riemannian metric M : M SD ++ is called a Riemannian manifold. The Riemannian metric M encodes a smoothly changing inner product u, M(x)v on the tangent space u, v Tx M of each point x M.

Remark 1. The Riemannian metric M(x) acts on tangent vectors, and may, thus, be interpreted as a standard Mahalanobis metric restricted to an inﬁnitesimal region around x.

The local inner product based on M is a suitable model for capturing local behavior of data, i.e. manifold learning. From the inner product, we can deﬁne geodesics as length-minimizing curves connecting two points x, y M, i.e.

ˆγ = argmin γ

γ (t), M(γ(t))γ (t) dt, s.t. γ(0) = x, γ(1) = y. (2)

Here M(γ(t)) is the metric tensor at γ(t), and the tangent vector γ denotes the derivative (velocity) of γ. The distance between x and y is deﬁned as the length of the geodesic. A standard result from differential geometry is that the geodesic can be found as the solution to a system of 2nd order ordinary differential equations (ODEs) [6, 9]:

y = Expx(v)

v = Logx(y) γ(t)

Figure 2: An illustration of the exponential and logarithmic maps.

2M 1(γ(t)) vec[M(γ(t))]

(γ (t) γ (t))

subject to γ(0) = x, γ(1) = y. Here vec[ ] stacks the columns of a matrix into a vector and is the Kronecker product.

This differential equation allows us to deﬁne basic operations on the manifold. The exponential map at a point x takes a tangent vector v Tx M to y = Expx(v) M such that the curve γ(t) = Expx(t v) is a geodesic originating at x with initial

velocity v and length v . The inverse mapping, which takes y to Tx M is known as the logarithm map and is denoted Logx(y). By deﬁnition Logx(y) corresponds to the geodesic distance from x to y. These operations are illustrated in Fig. 2. The exponential and the logarithmic map can be computed by solving Eq. 3 numerically, as an initial value problem (IVP) or a boundary value problem (BVP) respectively. In practice the IVPs are substantially faster to compute than the BVPs.

The Mahalanobis distance is naturally extended to Riemannian manifolds as dist2 Σ(x, y) = Logx(y), Σ 1Logx(y) . From this, Pennec [15] considered the Riemannian normal distribution

p M(x | µ, Σ) = 1

2 Logµ(x), Σ 1Logµ(x) , x M (4)

and showed that it is the manifold-valued distribution with maximum entropy subject to a known mean and covariance. This distribution is an instance of Eq. 1 and is the distribution we consider in this paper. Next, we consider standard intrinsic least squares estimates of µ and Σ.

2.1 Intrinsic Least Squares Estimators

Let the data be generated from an unknown probability distribution q M(x) on a manifold. Then it is common [15] to deﬁne the intrinsic mean of the distribution as the point that minimize the variance

ˆµ = argmin µ M

M dist2(µ, x)q M(x)d M(x), (5)

where d M(x) is the measure (or inﬁnitesimal volume element) induced by the metric. Based on the mean, a covariance matrix can be deﬁned

D(ˆµ) Logˆµ(x)Logˆµ(x) q M(x)d M(x), (6)

where D(ˆµ) is the domain over which TˆµM is well-deﬁned. For the manifolds we consider, the domain D(ˆµ) is RD. Practical estimators of ˆµ rely on gradient-based optimization to ﬁnd a local minimizer of Eq. 5, which is well-deﬁned [12]. For ﬁnite data {xn}N n=1, the descent direction is proportional to ˆv = PN n=1 Logµ(xn) TµM, and the updated mean is a point on the geodesic curve γ(t) = Expµ(t ˆv). After estimating the mean, the empirical covariance matrix is estimated as ˆΣ = 1 N 1 PN n=1 Logˆµ(xn)Logˆµ(xn) . It is worth noting that even though these estimators are natural, they are not maximum likelihood estimates for the Riemannian normal distribution (4).

In practice, the intrinsic mean often falls in regions of low data density [8]. For instance, consider data distributed uniformly on the equator of a sphere, then the optima of Eq. 5 is either of the poles. Consequently, the empirical covariance is often overestimated.

3 A Locally Adaptive Normal Distribution

We now have the tools to deﬁne a locally adaptive normal distribution (LAND): we replace the linear Euclidean distance with a locally adaptive Riemannian distance and study the corresponding Riemannian normal distribution (4). By learning a Riemannian manifold and using its structure to estimate distributions of the data, we provide a new and useful link between Riemannian statistics and manifold learning.

3.1 Constructing a Metric

In the context of manifold learning, Hauberg et al. [9] suggest to model the local behavior of the data manifold via a locally-deﬁned Riemannian metric. Here we propose to use a local covariance matrix to represent the local structure of the data. We only consider diagonal covariances for computational efﬁciency and to prevent the overﬁtting. The locality of the covariance is deﬁned via an isotropic Gaussian kernel of size σ. Thus, the metric tensor at x M is deﬁned as the inverse of a local diagonal covariance matrix with entries

n=1 wn(x)(xnd xd)2 + ρ

, with wn(x) = exp

xn x 2 2 2σ2

Here xnd is the dth dimension of the nth observation, and ρ a regularization parameter to avoid singular covariances. This deﬁnes a smoothly changing (hence Riemannian) metric that captures the local structure of the data. It is easy to see that if x is outside of the support of the data, then the metric tensor is large. Thus, geodesics are pulled towards the data where the metric is small. Note that the proposed metric is not invariant to linear transformations.While we restrict our attention to this particular choice, other learned metrics are equally applicable, c.f. [22, 9].

3.2 Estimating the Normalization Constant

The normalization constant of Eq. 4 is by deﬁnition

C(µ, Σ) = Z

2 Logµ(x), Σ 1Logµ(x) d M(x), (8)

where d M(x) denotes the measure induced by the Riemannian metric. The constant C(µ, Σ) depends not only on the covariance matrix, but also on the mean of the distribution, and the curvature of the manifold (captured by the logarithm map). For a general learned manifold, C(µ, Σ) is inaccessible in closed-form and we resort to numerical techniques. We start by rewriting Eq. 8 as

C(µ, Σ) = Z

q M(Expµ(v)) exp 1

2 v, Σ 1v dv. (9)

In effect, we integrate the distribution over the tangent space TµM instead of directly over the manifold. This transformation relies on the fact that the volume of an inﬁnitely small area on the manifold can be computed in the tangent space if we take the deformation of the metric into account [15]. This deformation is captured by the measure which, in the tangent space, is d M(x) = q M(Expµ(v)) dv. For notational simplicity we deﬁne the function m(µ, v) = q M(Expµ(v)) , which intuitively captures the cost for a point to be outside the data support (m is large in low density areas and small where the density is high).

Intrinsic Least Squares

Figure 3: Comparison of LAND and intrinsic least squares means.

We estimate the normalization constant (9) using Monte Carlo integration. We ﬁrst multiply and divide the integral with the normalization constant of the Euclidean normal distribution Z = p

(2π)D |Σ|. Then, the integral becomes an expectation estimation problem C(µ, Σ) = Z EN(0,Σ)[m(µ, v)], which can be estimated numerically as

s=1 m(µ, vs), where vs N(0, Σ) (10)

and S is the number of samples on TµM. The computationally expensive element is to evaluate m, which in turn requires evaluating Expµ(v). This amounts to solving an IVP numerically, which is fairly fast. Had we performed the integration directly on the manifold (8) we would have had to evaluate the logarithm map, which is a much more expensive BVP. The tangent space integration, thus, scales better.

3.3 Inferring Parameters

Assuming an independent and identically distributed dataset {xn}N n=1, we can write their joint distribution as p M(x1, . . . , x N) = QN n=1 p M(xn | µ, Σ). We ﬁnd parameters µ and Σ by maximum likelihood, which we implement by minimizing the mean negative log-likelihood

{ˆµ, ˆΣ} = argmin µ M Σ SD ++

φ (µ, Σ) = argmin µ M Σ SD ++

n=1 Logµ(xn), Σ 1Logµ(xn) + log (C(µ, Σ)) .

The ﬁrst term of the objective function φ : M SD ++ is a data-ﬁtting term, while the second can be seen as a force that both pulls the mean closer to the high density areas and shrinks the covariance. Speciﬁcally, when the mean is in low density areas, as well as when the covariance gives signiﬁcant

Algorithm 1 LAND maximum likelihood

Input: the data {xn}N n=1, stepsizes αµ, αA Output: the estimated ˆµ, ˆΣ, ˆC(ˆµ, ˆΣ)

1: initialize µ0, Σ0 and t 0 2: repeat 3: estimate C(µt, Σt) using Eq. 10 4: compute dµφ(µt, Σt) using Eq. 12 5: µt+1 Expµt(αµdµφ(µt, Σt))

6: estimate C(µt+1, Σt) using Eq. 10 7: compute Aφ(µt+1, Σt) using Eq. 13 8: At+1 At αA Aφ(µt+1, Σt) 9: Σt+1 [(At+1) At+1] 1

10: t t + 1 11: until φ(µt+1, Σt+1) φ(µt, Σt) 2 2 ϵ

probability to those areas, the value of m(µ, v) will by construction be large. Consequently, C(µ, Σ) will increase and these solutions will be penalized. In practice, we ﬁnd that the maximum likelihood LAND mean generally avoids low density regions, which is in contrast to the standard intrinsic least squares mean (5), see Fig. 3.

In practice we optimize φ using block coordinate descent: we optimize the mean keeping the covariance ﬁxed and vice versa. Unfortunately, both of the sub-problems are non-convex, and unlike the linear normal distribution, they lack a closedform solution. Since the logarithm map is a differentiable function, we can use gradient-based techniques to infer µ and Σ. Below we give the descent direction for µ and Σ and the corresponding optimization scheme is given in Algorithm 1. Initialization is discussed in the supplements.

Optimizing µ: the objective function is differentiable with respect to µ [6], and using that µ Logµ(x), Σ 1Logµ(x) = 2Σ 1Logµ(x), we get the gradient

µφ(µ, Σ) = Σ 1 " 1 N

n=1 Logµ(xn) Z C(µ, Σ) S

s=1 m(µ, vs)vs

It is easy to see that this gradient is highly dependent on the condition number of Σ. We ﬁnd that this, at times, makes the gradient unstable, and choose to use the steepest descent direction instead of the gradient direction. This is equal to dµφ(µ, Σ) = Σ µφ(µ, Σ) (see supplements).

Optimizing Σ: since the covariance matrix by deﬁnition is constrained to be in the space SD ++, a common trick is to decompose the matrix as Σ 1 = A A, and optimize the objective with respect to A. The gradient of this factor is (see supplements for derivation)

Aφ(µ, Σ) = A

n=1 Logµ(xn)Logµ(xn) Z C(µ, Σ) S

s=1 m(µ, vs)vsv s

Here the ﬁrst term ﬁts the given data by increasing the size of the covariance matrix, while the second term regularizes the covariance towards a small matrix.

3.4 Mixture of LANDs

At this point we can ﬁnd maximum likelihood estimates of the LAND model. We can easily extend this to mixtures of LANDs: Following the derivation of the standard Gaussian mixture model [3], our objective function for inferring the parameters of the LAND mixture model is formulated as follows

2 Logµk(xn), Σ 1 k Logµk(xn) + log(C(µk, Σk)) log(πk) , (14)

where Θ = {µk, Σk}K k=1 , rnk = πkp M(xn | µk,Σk) PK l=1 πlp M(xn | µl,Σl) is the probability that xn is generated by the

kth component, and PK k=1 πk = 1, πk 0. The corresponding EM algorithm is in the supplements.

4 Experiments

In this section we present both synthetic and real experiments to demonstrate the advantages of the LAND. We compare our model with both the Gaussian mixture model (GMM), and a mixture of LANDs using least squares (LS) estimators (5, 6). Since the latter are not maximum likelihood estimates we use a Riemannian K-means algorithm to ﬁnd cluster centers. In all experiments we use S = 3000 samples in the Monte Carlo integration. This choice is investigated empirically in the supplements. Furthermore, we choose σ as small as possible, while ensuring that the manifold is smooth enough that geodesics can be computed numerically.

4.1 Synthetic Data Experiments

1 2 3 4 Number of mixture components

Mean negative log-likelihood

GMM LS LAND True

Figure 4: The mean negative loglikelihood experiment.

As a ﬁrst experiment, we generate a nonlinear data-manifold by sampling from a mixture of 20 Gaussians positioned along a half-ellipsoidal curve (see left panel of Fig. 5). We generate 10 datasets with 300 points each, and ﬁt for each dataset the three models with K = 1, . . . , 4 number of components. Then, we generate 10000 samples from each ﬁtted model, and we compute the mean negative log-likelihood of the true generative distribution using these samples. Fig. 4 shows that the LAND learns faster the underlying true distribution, than the GMM. Moreover, the LAND perform better than the least squares estimators, which overestimates the covariance. In the supplements we show, using the standard AIC and BIC criteria, that the optimal LAND is achieved for K = 1, while for the least squares estimators and the GMM, the optimal is achieved for K = 3 and K = 4 respectively.

In addition, in Fig. 5 we show the contours for the LAND and the GMM for K = 2. There, we can observe that indeed, the LAND adapts locally to the data and reveals their underlying nonlinear structure. This is particularly evident near the boundaries of the data-manifold.

Data LAND means Geodesics, cluster 1 Geodesics, cluster 2

LAND mixture model

Gaussian mixture model

Figure 5: Synthetic data and the ﬁtted models. Left: the given data, the intensity of the geodesics represent the responsibility of the point to the corresponding cluster. Center: the contours of the LAND mixture model. Right: the contours of the Gaussian mixture model.

We extend this experiment to a clustering task (see left panel of Fig. 6 for data). The center and right panels of Fig. 6 show the contours of the LAND and Gaussian mixtures, and it is evident that the LAND is substantially better at capturing non-ellipsoidal clusters. Due to space limitations, we move further illustrative experiments to the supplementary material and continue with real data.

4.2 Modeling Sleep Stages

We consider electro-encephalography (EEG) measurements of human sleep from 10 subjects, part of the Physio Net database [11, 7, 5]. For each subject we get EEG measurements during sleep from two electrodes on the front and the back of the head, respectively. Measurements are sampled at fs = 100Hz, and for each 30 second window a so-called sleep stage label is assigned from the set {1, 2, 3, 4, REM, awake}. Rapid eye movement (REM) sleep is particularly interesting, characterized by having EEG patterns similar to the awake state but with a complex physiological pattern, involving e.g., reduced muscle tone, rolling eye movements and erection [16]. Recent evidence points to the importance of REM sleep for memory consolidation [4]. Periods in which the sleeper is awake are typically happening in or near REM intervals. Thus we here consider the characterization of sleep in terms of three categories REM, awake, and non-REM, the latter a merger of sleep stages 1 4.

We extract features from EEG measurements as follows: for each subject we subdivide the 30 second windows to 10 seconds, and apply a short-time-Fourier-transform to the EEG signal of the frontal electrode with 50% overlapping windows. From this we compute the log magnitude of the spectrum log(1 + |f|) of each window. The resulting data matrix is decomposed using Non-Negative Matrix Factorization (10 random starts) into ﬁve factors, and we use the coefﬁcients as 5D features. In Fig. 7 we illustrate the nonlinear manifold structure based on a three factor analysis.

Data LAND means Geodesics, cluster 1 Geodesics, cluster 2

LAND mixture model

Gaussian mixture model

Data LAND means Geodesics, cluster 1 Geodesics, cluster 2

LAND mixture model

Gaussian mixture model

Figure 6: The clustering problem for two synthetic datasets. Left: the given data, the intensity of the geodesics represent the responsibility of the point to the corresponding cluster. Center: the LAND mixture model. Right: the Gaussian mixture model.

1-4 R.E.M. awake

Figure 7: The 3 leading factors for subject s151 .

We perform clustering on the data and evaluate the alignment between cluster labels and sleep stages using the F-measure [14]. The LAND depends on the parameter σ to construct the metric tensor, and in this experiment it is less straightforward to select σ because of signiﬁcant intersubject variability. First, we ﬁxed σ = 1 for all the subjects. From the results in Table 1 we observe that for σ = 1 the LAND(1) generally outperforms the GMM and achieves much better alignment. To further illustrate the effect of σ we ﬁtted a LAND for σ = [0.5, 0.6, . . . , 1.5] and present the best result achieved by the LAND. Selecting σ this way leads indeed to higher degrees of alignment further underlining that the conspicuous manifold structure and the rather compact sleep stage distributions in Fig. 7 are both captured better with the LAND representation than with a linear GMM.

Table 1: The F-measure result for 10 subjects (the closer to 1 the better).

s001 s011 s042 s062 s081 s141 s151 s161 s162 s191

LAND(1) 0.831 0.701 0.670 0.740 0.804 0.870 0.820 0.780 0.747 0.786 GMM 0.812 0.690 0.675 0.651 0.798 0.870 0.794 0.775 0.747 0.776

LAND 0.831 0.716 0.695 0.740 0.818 0.874 0.830 0.783 0.750 0.787

5 Related Work

We are not the ﬁrst to consider Riemannian normal distributions, e.g. Pennec [15] gives a theoretical analysis of the distribution, and Zhang and Fletcher [23] consider the Riemannian counterpart of probabilistic PCA. Both consider the scenario where the manifold is known a priori. We adapt the distribution to the manifold learning setting by constructing a Riemannian metric that adapts to the data. This is our overarching contribution.

Traditionally, manifold learning is seen as an embedding problem where a low-dimensional representation of the data is sought. This is useful for visualization [21, 17, 18, 1], clustering [13], semi-supervised learning [2] and more. However, in embedding approaches, the relation between a

new point and the embedded points are less well-deﬁned, and consequently these approaches are less suited for building generative models. In contrast, the Riemannian approach gives the ability to measure continuous geodesics that follow the structure of the data. This makes the learned Riemannian manifold a suitable space for a generative model.

Simo-Serra et al. [19] consider mixtures of Riemannian normal distributions on manifolds that are known a priori. Structurally, their EM algorithm is similar to ours, but they do not account for the normalization constants for different mixture components. Consequently, their approach is inconsistent with the probabilistic formulation. Straub et al. [20] consider data on spherical manifolds, and further consider a Dirichlet process prior for determining the number of components. Such a prior could also be incorporated in our model. The key difference to our work is that we consider learned manifolds as well as the following complications.

6 Discussion

In this paper we have introduced a parametric locally adaptive normal distribution. The idea is to replace the Euclidean distance in the ordinary normal distribution with a locally adaptive nonlinear distance measure. In principle, we learn a non-parametric metric space, by constructing a smoothly changing metric that induces a Riemannian manifold, where we build our model. As such, we propose a parametric model over a non-parametric space.

The non-parametric space is constructed using a local metric that is the inverse of a local covariance matrix. Here locality is deﬁned via a Gaussian kernel, such that the manifold learning can be seen as a form of kernel smoothing. This indicates that our scheme for learning a manifold might not scale to high-dimensional input spaces. In these cases it may be more practical to learn the manifold probabilistically [22] or as a mixture of metrics [9]. This is feasible as the LAND estimation procedure is agnostic to the details of the learned manifold as long as exponential and logarithm maps can be evaluated.

Once a manifold is learned, the LAND is simply a Riemannian normal distribution. This is a natural model, but more intriguing, it is a theoretical interesting model since it is the maximum entropy distribution for a ﬁxed mean and covariance [15]. It is generally difﬁcult to build locally adaptive distributions with maximum entropy properties, yet the LAND does this in a fairly straight-forward manner. This is, however, only a partial truth as the distribution depends on the non-parametric space. The natural question, to which we currently do not have an answer, is whether a suitable maximum entropy manifold exist?

Algorithmically, we have proposed a maximum likelihood estimation scheme for the LAND. This combines a gradient-based optimization with a scalable Monte Carlo integration method. Once exponential and logarithm maps are available, this procedure is surprisingly simple to implement. We have demonstrated the algorithm on both real and synthetic data and results are encouraging. We almost always improve upon a standard Gaussian mixture model as the LAND is better at capturing the local properties of the data.

We note that both the manifold learning aspect and the algorithmic aspect of our work can be improved. It would be of great value to learn the parameter σ used for smoothing the Riemannian metric, and in general, more adaptive learning schemes are of interest. Computationally, the bottleneck of our work is evaluating the logarithm maps. This may be improved by specialized solvers, e.g. probabilistic solvers [10], or manifold-speciﬁc heuristics.

The ordinary normal distribution is a key element in many machine learning algorithms. We expect that many fundamental generative models can be extended to the manifold setting simply by replacing the normal distribution with a LAND. Examples of this idea include Naïve Bayes, Linear Discriminant Analysis, Principal Component Analysis and more. Finally we note that standard hypothesis tests also extend to Riemannian normal distributions [15] and hence also to the LAND.

Acknowledgements. LKH was funded in part by the Novo Nordisk Foundation Interdisciplinary Synergy Program 2014, Biophysically adjusted state-informed cortex stimulation (BASICS) . SH was funded in part by the Danish Council for Independent Research, Natural Sciences.

[1] M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15(6):1373 1396, June 2003.

[2] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. JMLR, 7:2399 2434, Dec. 2006.

[3] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer Verlag New York, Inc., Secaucus, NJ, USA, 2006.

[4] R. Boyce, S. D. Glasgow, S. Williams, and A. Adamantidis. Causal evidence for the role of REM sleep theta rhythm in contextual memory consolidation. Science, 352(6287):812 816, 2016.

[5] A. Delorme and S. Makeig. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J. Neurosci. Methods, page 21, 2004.

[6] M. do Carmo. Riemannian Geometry. Mathematics (Boston, Mass.). Birkhäuser, 1992.

[7] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. Physio Bank, Physio Toolkit, and Physio Net: Components of a New Research Resource for Complex Physiologic Signals. Circulation, 101(23):e215 e220, 2000 (June 13).

[8] S. Hauberg. Principal Curves on Riemannian Manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2016.

[9] S. Hauberg, O. Freifeld, and M. J. Black. A Geometric Take on Metric Learning. In Advances in Neural Information Processing Systems (NIPS) 25, pages 2033 2041, 2012.

[10] P. Hennig and S. Hauberg. Probabilistic Solutions to Differential Equations and their Application to Riemannian Statistics. In Proceedings of the 17th international Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 33, 2014.

[11] S. A. Imtiaz and E. Rodriguez-Villegas. An open-source toolbox for standardized use of Physio Net Sleep EDF Expanded Database. In 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 6014 6017, Aug 2015.

[12] H. Karcher. Riemannian center of mass and molliﬁer smoothing. Communications on Pure and Applied Mathematics, 30(5):509 541, 1977.

[13] U. Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing, 17(4):395 416, Dec. 2007.

[14] R. Marxer, H. Purwins, and A. Hazan. An f-measure for evaluation of unsupervised clustering with non-determined number of clusters. Report of the Em CAP project (European Commission FP6-IST), pages 1 3, 2008.

[15] X. Pennec. Intrinsic Statistics on Riemannian Manifolds: Basic Tools for Geometric Measurements. Journal of Mathematical Imaging and Vision, 25(1):127 154, July 2006.

[16] D. Purves, G. Augustine, D. Fitzpatrick, W. Hall, A. La Mantia, J. Mc Namara, and L. White. Neuroscience, 2008. De Boeck, Sinauer, Sunderland, Mass.

[17] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323 2326, 2000.

[18] B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In Advances in Kernel Methods - Support Vector Learning, pages 327 352, 1999.

[19] E. Simo-Serra, C. Torras, and F. Moreno-Noguer. Geodesic Finite Mixture Models. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.

[20] J. Straub, J. Chang, O. Freifeld, and J. W. Fisher III. A Dirichlet Process Mixture Model for Spherical Data. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2015.

[21] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500):2319, 2000.

[22] A. Tosi, S. Hauberg, A. Vellido, and N. D. Lawrence. Metrics for Probabilistic Geometries. In The Conference on Uncertainty in Artiﬁcial Intelligence (UAI), July 2014.

[23] M. Zhang and P. Fletcher. Probabilistic Principal Geodesic Analysis. In Advances in Neural Information Processing Systems (NIPS) 26, pages 1178 1186, 2013.