# calculating_optimistic_likelihoods_using_geodesically_convex_optimization__826c6350.pdf

Calculating Optimistic Likelihoods Using (Geodesically) Convex Optimization

Viet Anh Nguyen Soroosh Shaﬁeezadeh-Abadeh École Polytechnique Fédérale de Lausanne, Switzerland {viet-anh.nguyen, soroosh.shafiee}@epfl.ch

Man-Chung Yue The Hong Kong Polytechnic University, Hong Kong manchung.yue@polyu.edu.hk

Daniel Kuhn École Polytechnique Fédérale de Lausanne, Switzerland daniel.kuhn@epfl.ch

Wolfram Wiesemann Imperial College Business School, United Kingdom ww@imperial.ac.uk

A fundamental problem arising in many areas of machine learning is the evaluation of the likelihood of a given observation under different nominal distributions. Frequently, these nominal distributions are themselves estimated from data, which makes them susceptible to estimation errors. We thus propose to replace each nominal distribution with an ambiguity set containing all distributions in its vicinity and to evaluate an optimistic likelihood, that is, the maximum of the likelihood over all distributions in the ambiguity set. When the proximity of distributions is quantiﬁed by the Fisher-Rao distance or the Kullback-Leibler divergence, the emerging optimistic likelihoods can be computed efﬁciently using either geodesic or standard convex optimization techniques. We showcase the advantages of working with optimistic likelihoods on a classiﬁcation problem using synthetic as well as empirical data.

1 Introduction

Assume that a set of i.i.d. data points x M 1 x1, . . . , x M Rn is generated from one of several Gaussian distributions Pc, c C with |C| < . To determine the distribution Pc , c C, under which x M 1 has the highest likelihood ℓ(x M 1 , Pc) across all Pc, c C, we can solve the problem

c arg max c C

ℓ(x M 1 , Pc) 1

m=1 (xm µc) Σ 1 c (xm µc) log det Σc

where µc and Σc denote the means and covariance matrices that unambiguously characterize the distributions Pc, c C, and the log-likelihood function ℓ(x M 1 , Pc) quantiﬁes the (logarithm of the) relative probability of observing x M 1 under the Gaussian distribution Pc. Problem (1) naturally arises in various machine learning applications. In quadratic discriminant analysis, for example,

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

x M 1 denotes the input values of data samples whose categorical outputs y1, . . . , y M C are to be predicted based on the class-conditional distributions Pc, c C [25]. Likewise, in Bayesian inference with synthetic likelihoods, a Bayesian belief about the models Pc, c C, assumed to be Gaussian for computational tractability, is performed based on an observation x M 1 [31, 41]. Problem (1) also arises in likelihood-ratio tests where the null hypothesis x M 1 is generated by a distribution Pc, c C0 is compared with the alternative hypothesis x M 1 is generated by a distribution Pc, c C1 [17, 18].

In practice, the parameters (µc, Σc) of the candidate distributions Pc, c C, are typically not known and need to be estimated from data. In quadratic discriminant analysis, for example, it is common to replace the means µc and covariance matrices Σc with their empirical counterparts ˆµc and ˆΣc that are estimated from data. Similarly, the rival model distributions Pc, c C, in Bayesian inference with synthetic likelihoods are Gaussian estimates derived from (costly) sampling processes. Unfortunately, problem (1) is highly sensitive to misspeciﬁcation of the candidate distributions Pc. To combat this problem, we propose to replace the likelihood function in (1) with the optimistic likelihood

max P Pc ℓ(x M 1 , P) with Pc = n P M : ϕ(ˆPc, P) ρc o , (2)

where M is the set of all non-degenerate Gaussian distributions on Rn, ϕ is a dissimilarity measure satisfying ϕ(P, P) = 0 for all P M, and ρc R+ are the radii of the ambiguity sets Pc. Problem (2) assumes that the true candidate distributions Pc are unknown but close to the nominal distributions ˆPc that are estimated from the training data. In contrast to the log-likelihood ℓ(x M 1 , Pc) that is maximized in problem (1), the optimistic likelihood (2) is of interest in its own right. A common problem in constrained likelihood estimation, for example, is to determine a Gaussian distribution P (µ , Σ ) that is close to a nominal distribution P0 (µ0, Σ0) reﬂecting the available prior information such that x M 1 has high likelihood under P [34]. This task is an instance of the optimistic likelihood evaluation problem (2) with a suitably chosen dissimilarity measure ϕ.

Of crucial importance in the generalized likelihood problem (2) is the choice of the dissimilarity measure ϕ as it impacts both the statistical properties as well as the computational complexity of the estimation procedure. A natural choice appears to be the Wasserstein distance, which has recently been popularized in the ﬁeld of optimal transport [37, 40]. The Wasserstein distance on the space of Gaussian distributions is a Riemannian distance, that is, the distance corresponding the curvilinear geometry on the set of Gaussian distributions induced by the Wasserstein distance, as opposed to the usual distance obtained by treating it as a subset of the space of symmetric matrices. However, since the Wasserstein manifold has a non-negative sectional curvature [37], calculating the associated optimistic likelihood (2) appears to be computationally intractable. Instead, we study the optimistic likelihood under the Fisher-Rao (FR) distance, which is commonly used in signal and image processing [30, 2] as well as computer vision [21, 39]. The FR distance is also a Riemannian metric, and it enjoys many attractive statistical properties that we review in Section 2 of this paper. Most importantly, the FR distance has a non-positive sectional curvature, which implies that the optimistic likelihood (2) reduces to the solution of a geodesically convex optimization problem that is amenable to an efﬁcient solution [6, 35, 38, 42, 43, 44]. We also study problem (2) under the Kullback Leibler (KL) divergence (or relative entropy), which is intimately related to the FR metric. While the KL divergence lacks some of the desirable statistical features of the FR metric, we will show that it gives rise to optimistic likelihoods that can be evaluated in quasi-closed form by reduction to a one dimensional problem.

While this paper focuses on the parametric approximation of the likelihood where P belongs to the family of Gaussian distributions, we emphasize that the optimistic likelihood approach can also be utilized in a non-parametric setting [28].

The contributions of this paper may be summarized as follows.

1. We show that for Fisher-Rao ambiguity sets, the optimistic likelihood (2) reduces to a geodesically convex problem and is hence amenable to an efﬁcient solution via a Riemannian gradient descent algorithm. We analyze the optimality as well as the convergence of the resulting algorithm.

2. We show that for Kullback-Leibler ambiguity sets, the optimistic likelihood (2) can be evaluated in quasi-closed form by reduction to a one dimensional convex optimization problem.

3. We evaluate the numerical performance of our optimistic likelihoods on a classiﬁcation problem with artiﬁcially generated as well as standard benchmark instances.

Our optimistic likelihoods follow a broader optimization paradigm that exercises optimism in the face of ambiguity. This strategy has been shown to perform well, among others, in multi-armed bandit problems and Bayesian optimization, where the Upper Conﬁdence Bound algorithm takes decisions based on optimistic estimates of the reward [12, 13, 26, 36]. Optimistic optimization has also been successfully applied in support vector machines [8], and it closely relates to sparsity inducing non-convex regularization schemes [29].

The remainder of the paper proceeds as follows. We study the optimistic likelihood (2) under FR and KL ambiguity sets in Sections 2 and 3, respectively. We test our theoretical ﬁndings in the context of a classiﬁcation problem, and we report on numerical experiments in Section 4. Supplementary material and all proofs are provided in the online companion.

Notation. Throughout this paper, Sn, Sn + and Sn ++ denote the spaces of n-dimensional symmetric, symmetric positive semi-deﬁnite and symmetric positive deﬁnite matrices, respectively. For any A Rn n, the trace of A is deﬁned as Tr A = Pn i=1 Aii. For any A Sn, λmin(A) and λmax(A) denote the minimum and maximum eigenvalues of A, respectively. The base of log( ) is e.

2 Optimistic Likelihood Problems under the FR Distance

Consider a family of distributions with density functions pθ(x), where the parameter θ ranges over a ﬁnite-dimensional smooth manifold Θ. At each point θ Θ, the Fisher information matrix Iθ = Ex[ θ log(pθ(x)) θ log(pθ(x)) |θ] deﬁnes an inner product , θ on the tangent space TθΘ by ζ1, ζ2 θ = ζT 1 Iθζ2 for ζ1, ζ2 TθΘ. The family of inner products { , θ}θ Θ on the tangent spaces then deﬁnes a Riemannian metric, called the FR metric. The FR distance on Θ is the geodesic distance associated with the FR metric, i.e., the FR distance between the two points θ0, θ1 Θ is

d(θ0, θ1) = inf γ

γ (t), γ (t) γ(t)dt,

where the inﬁmum is taken over all smooth curves γ : [0, 1] Θ with γ(0) = θ0 and γ(1) = θ1. Any curve γ attaining the inﬁmum is said to be a geodesic from θ0 to θ1. The FR metric represents a natural distance measure for parametric families of probability distributions as it is invariant under transformations on the data space (the x space) by a class of statistically important mappings, and it is the unique (up to a scaling) Riemannian metric enjoying such a property, see [14, 15, 5].

Since the covariance matrix is more difﬁcult to estimate than the mean (see Appendix A), we focus here on the family of all Gaussian distributions with a ﬁxed mean vector ˆµ Rn. These distributions are parameterized by θ = Σ, that is, the covariance matrix. The parameter manifold is thus given by Θ = Sn ++. On this manifold, the FR distance is available in closed form.1

Proposition 2.1 (FR distance for Gaussian distributions [3]). If N(ˆµ, Σ0) and N(ˆµ, Σ1) are Gaussian distributions with identical mean ˆµ Rn and covariance matrices Σ0, Σ1 Sn ++, we have

d(Σ0, Σ1) = 1

2 1 ) F , (3)

where log( ) represents the matrix logarithm, and F stands for the Frobenius norm.

The distance d( , ) is invariant under inversions and congruent transformations of the input parameters [32, Proposition 1], i.e., for any ˆΣ, Σ Sn ++ and invertible matrix A Rn n, we have

d(ˆΣ 1, Σ 1) = d(ˆΣ, Σ) (4)

and d(AˆΣA , AΣA ) = d(ˆΣ, Σ). (5)

By the inversion invariance (4), the distance d( , ) is independent of whether we use the covariance matrix Σ or the precision matrix Σ 1 to parametrize normal distributions. Note that if x1 N(µ, Σ1) and x2 N(µ, Σ2), then Ax1 +b N(Aµ+b, AΣ1A ) and Ax2 +b N(Aµ+b, AΣ2A ). By the congruence invariance (5), the distance d( , ) thus remains unchanged under afﬁne transformations

1We can also handle the case where the covariance matrix is ﬁxed but the mean is subject to ambiguity, see Appendix B. However, as there is no closed-form expression for the FR distance between two generic Gaussian distributions, we cannot handle the case where both the mean and the covariance matrix are subject to ambiguity.

x Ax + b. Remarkably, the invariance property (5) uniquely characterizes the distance d( , ). More precisely, any Riemannian distance satisfying the invariance property (5) coincides (up to a scaling) with the distance d( , ), see, for example, [33, Section 3] and [9, Section 2].

We now study the optimistic likelihood problem (2), where the FR distance is used as the dissimilarity measure. Given a data batch x M 1 and a radius ρ > 0, the optimistic likelihood problem reduces to

min Σ BFR L(Σ), where L(Σ) S, Σ 1 + log det Σ, BFR {Σ Sn ++ : d(Σ, ˆΣ) ρ}, (6)

and S = M 1 PM m=1(xm ˆµ)(xm ˆµ) stands for the sample covariance matrix.

We next prove that problem (6) is solvable, which justiﬁes the use of the minimization operator. Lemma 2.2. The optimal value of problem (6) is ﬁnite and is attained by some Σ BFR.

Even though the objective function of (6) involves a concave log-det term, it can be shown to be convex over the region 0 Σ 2S [10, Exercise 7.4]. However, in practice S may be singular, in which case this region becomes empty. Maximum likelihood estimation problems akin to (6) are often reparameterized in terms of the precision matrix X = Σ 1. In this case, (6) becomes

min n S, X log det X : X Sn ++, log(X 1 2 ˆΣX 1 2 ) F

Even though this reparameterization convexiﬁes the objective, it renders the feasible set non-convex.

2.1 Geodesic Convexity of the Optimistic Likelihood Problem

As problem (6) cannot be addressed with standard methods from convex optimization, we re-interpret it as a constrained minimization problem on the Riemannian manifold Θ = Sn ++ endowed with the FR metric. The key advantage of this approach is that we can show problem (6) to be geodesically convex. Geodesic convexity generalizes the usual notion of convexity in Euclidean spaces to Riemannian manifolds. We can thus solve problem (6) via algorithms from geodesically convex optimization, which inherit many beneﬁts of the standard algorithms of convex optimization in Euclidean spaces.

The Riemannian manifold Θ = Sn ++ endowed with the FR metric is in fact a Hadamard manifold, that is, a complete simply connected Riemannian manifold with non-positive sectional curvature, see [22, Theorem XII 1.2]. Thus, any two points are connected by a unique geodesic [11]. By [7, Theorem 6.1.6], for Σ0, Σ1 Sn ++, the unique geodesic γ : [0, 1] Sn ++ from Σ0 to Σ1 is given by

1 2 0 . (7)

We are now ready to give precise deﬁnitions of geodesically convex sets and functions on Hadamard manifolds. We emphasize that these deﬁnitions would be more subtle for general Riemannian manifolds, which can have several geodesics between two points. Deﬁnition 2.3 (Geodesically convex set). A set U Sn ++ is said to be geodesically convex if for all Σ0, Σ1 U, the image of the unique geodesic from Σ0 to Σ1 is contained in U, i.e., γ([0, 1]) U. Deﬁnition 2.4 (Geodesically convex function). A function f : Sn ++ R is said to be geodesically convex if for all Σ0, Σ1 Sn ++, the unique geodesic γ from Σ0 to Σ1 satisﬁes f(γ(t)) (1 t)f(Σ0) + tf(Σ1) t [0, 1].

In order to prove that (6) is a geodesically convex optimization problem, we need to establish the geodesic convexity of the feasible region BFR and the loss function L( ). Note that, in stark contrast to Euclidean geometry, a geodesic ball on a general manifold may not be geodesically convex.2

Theorem 2.5 (Geodesic convexity of problem (6)). For any ˆΣ Sn ++, S Sn + and ρ R+, BFR is a geodesically convex set, and L( ) is a geodesically convex function over Sn ++.

Theorem 2.5 establishes that the optimistic likelihood problem (6), which is non-convex with respect to the usual Euclidean geometry on the embedding space Rn n, is actually convex with respect to the Riemannian geometry on Sn ++ induced by the FR metric.

2For example, consider the circle S1 {x R2 : x 2 = 1} which is a 1-dimensional manifold. Any major arc is a geodesic ball but not a geodesically convex subset of S1.

Algorithm 1 Projected Geodesic Gradient Descent Algorithm

Input: ˆΣ 0, ρ > 0, S 0, K N, {αk}K k=1 R++ Initialization: Set Σ1 ˆΣ, Σ1 ˆΣ for k = 1, 2, . . . , K 1 do

Compute the Riemannian gradient at Σk: Gk 2(Σk S) Perform a gradient descent step using the exponential map:

2 ExpΣk( αk Gk) = Σ

1 2 k exp αkΣ 1

Project Σk+ 1

2 onto BFR: Σk+1 Proj BFR(Σk+ 1

2 ) Compute the new iterate by interpolation: Σk+1 Exp Σk 1 k+1 Exp 1 Σk (Σk+1)

end for Output: Report the last iterate ΣK as an approximate solution

2.2 Projected Geodesic Gradient Descent Algorithm

Figure 1: Visualization of the FR ball BFR (yellow set) within the manifold Sn ++ (white set).

In the same way as the convexity of a standard constrained optimization problem can be exploited to ﬁnd a global minimizer via a projected gradient descent algorithm, the geodesic convexity of problem (6) can be exploited to ﬁnd a global minimizer by using a projected geodesic gradient descent algorithm of the type described in [43]. The mechanics of a generic iteration are visualized in Figure 1. As in any gradient descent method, given the current iterate Σ, we ﬁrst need to compute the direction along which the objective function L decreases fastest. In the context of optimization on manifolds, this direction corresponds to the negative Riemannian gradient G at point Σ, which belongs to the tangent space TΣSn ++ Sn. Unfortunately, the curve γ(α) = Σ αG fails to be a geodesic and will eventually leave the manifold for sufﬁciently large step sizes α. This prompts us to construct the (unique) geodesic that emanates from point Σ with initial velocity G. Formally, this geodesic can be represented as γ(α) = ExpΣ( αG), where ExpΣ( ) denotes the exponential map at Σ. As we will see below, this geodesic (visualized by the red curve) remains within the manifold for any α > 0 but may eventually leave the feasible region BFR. If this happens for the chosen step size α, we project ExpΣ( αG) back onto the feasible region, that is, we map it to its closest point in BFR with respect to the FR distance (visualized by the yellow cross). Denoting this FR projection by Proj BFR( ), the next iterate of the projected geodesic gradient descent algorithm can thus be expressed as Σ+ = Proj BFR(ExpΣ( αG)).

Starting from Σ1 = ˆΣ, the proposed algorithm constructs K iterates {Σk}K k=1 via the above recursion. As in [43], the algorithm also constructs a second sequence { Σk}K k=1 of feasible covariance matrices with Σ1 = ˆΣ and Σk+1 = γ(1/(k+1)) for k = 1, . . . , K 1, where γ(t) represents the geodesic (7) connecting Σk with Σk+1. Thus, Σk+1 is deﬁned as a geodesic convex combination of Σk and Σk+1. A precise description of the proposed algorithm in pseudocode is provided in Algorithm 1.

In the following we show that the Riemannian gradient, the exponential map ExpΣ( ) as well as the projection Proj BFR( ) can all be evaluated in closed form in O(n3).

By [3, Page 362], the FR metric on the tangent space TΣSn ++ at Σ Sn ++ can be re-expressed as

2 Tr Ω1Σ 1Ω2Σ 1 Ω1, Ω2 TΣSn ++. (8)

Using (8) and [1, Equation 3.32], the Riemannian gradient G = grad L of the objective function L( ) at Σ can be computed from the Euclidean gradient L(Σ) as

grad L(Σ) = 2Σ( L(Σ))Σ = 2Σ(Σ 1 Σ 1SΣ 1)Σ = 2(Σ S). (9)

Moreover, by [35, Equation (3.2)], the exponential map ExpΣ : TΣSn ++ Sn ++ at Σ is given by

ExpΣ(G) = Σ 1 2 exp(Σ 1

2 )Σ 1 2 , G TΣSn ++ Sn,

where exp( ) denotes the matrix exponential. The inverse map Exp 1 Σ : Sn ++ TΣSn ++ satisﬁes

Exp 1 Σ (A) = Σ 1 2 log Σ 1

2 Σ 1 2 , A Sn ++.

Finally, the projection Proj B( ) onto BFR with respect to the FR distance is deﬁned through

Proj BFR(Σ ) arg min Σ BFR d(Σ, Σ ), Σ Sn ++. (10)

The following lemma ensures that this projection is well-deﬁned and admits a closed-form expression.

Lemma 2.6 (Projection onto BFR). For any Σ Sn ++ with d(ˆΣ, Σ ) = ρ the following hold.

(i) There arg min-mapping in (10) is a singleton, and thus Proj BFR(Σ ) is well-deﬁned.

(ii) The projection of Σ onto BFR is given by

Proj BFR(Σ ) =

( ˆΣ 1 2 ˆΣ 1

ρ ˆΣ 1 2 if ρ > ρ, Σ otherwise. (11)

By comparison with (7), one easily veriﬁes that Proj BFR(Σ ) constitutes a geodesic convex combination between Σ and ˆΣ. Figure 1 visualizes the geodesic from ˆΣ to Σ by the blue dashed line. Therefore, the projection Proj BFR onto the FR ball BFR within Sn ++ endowed with the FR metric is constructed in a similar manner as the projection onto a Euclidean ball within a Euclidean space.

The following theorem asserts that Algorithm 1 enjoys a sublinear convergence rate. Theorem 2.7 (Sublinear convergence rate). With a constant stepsize

where Γ 2 1/2 n e2

2ρ λ 2 min(ˆΣ) max{|1 e

2ρλ 1 min(ˆΣ)λmax(S)|, 1}, Algorithm 1 satisﬁes

L( ΣK) L(Σ ) 2 7 4 ρ 3 2 Γ q

The proof of Theorem 2.7 closely follows that of [43, Theorem 9]. The difference is that [43, Theorem 9] requires the objective function to be Lipschitz continuous on Sn ++. Unfortunately, such an assumption is not satisﬁed by L( ). We circumvent this by proving that the Riemannian gradient of L( ) is bounded uniformly on BFR.

Endeavors are currently underway to devise algorithms for minimizing a geodesically strongly convex objective function over a geodesically convex feasible set that offer a linear convergence guarantee, see, e.g., [43, Theorem 15]. The next lemma shows that the objective function of problem (6) is indeed geodesically smooth and geodesically strongly convex3 whenever S 0. This suggests that the empirical performance of Algorithm 1 could be signiﬁcantly better than the theoretical guarantee of Theorem 2.7. Indeed, our numerical results in Section 4.1 conﬁrm that if S 0, then Algorithm 1 displays a linear convergence rate. Lemma 2.8 (Strong convexity and smoothness of L( )). The objective function L( ) of problem (6) is geodesically β-smooth on BFR with

β = 2λmax(S)

λmin(bΣ) exp(

If S 0, then L( ) is also geodesically σ-strongly convex on BFR with

σ = 2λmin(S)

λmax(bΣ) exp(

Remark 2.9. Problem (6) could also be addressed with the algorithmic framework developed in [24]. Due to space limitations, we leave this for future research.

3The strong convexity and smoothness properties are deﬁned in Deﬁnitions C.4 and C.5, respectively.

3 Generalized Likelihood Estimation under the KL Divergence

The KL divergence, which is widely used in information theory [16, 2], can be employed as an alternative dissimilarity measure in the optimistic likelihood problem (2). If both ˆP and P are Gaussian distributions, then the KL divergence from ˆP to P admits an analytical expression. Proposition 3.1 (KL divergence for Gaussian distributions). For any ˆµ Rn and Σ0, Σ1 Sn ++, the KL divergence from P0 = N(ˆµ, Σ0) to P1 = N(ˆµ, Σ1) amounts to

KL(P0 P1) = 1

2 Tr Σ 1 1 Σ0 + log det(Σ1Σ 1 0 ) n .

Unlike the FR distance, the KL divergence is not symmetric. Proposition 3.1 implies that if the KL divergence is used as the dissimilarity measure, then the optimistic likelihood problem (2) reduces to

n Tr SΣ 1 + log det Σ : Tr Σ 1 ˆΣ + log det(ΣˆΣ 1) n 2ρ o , (12)

where S = M 1 PM m=1(xm ˆµ)(xm ˆµ) denotes again the sample covariance matrix. Because of the concave log-det terms in the objective and the constraints, problem (12) is non-convex. By using the variable substitution X Σ 1, however, problem (12) can be reduced to a univariate convex optimization problem and thereby solved in quasi-closed form.

Theorem 3.2. For any ˆΣ 0 and ρ > 0, the optimal value of problem (12) amounts to

(1 + γ ) Tr S(S + γ ˆΣ) 1 + log det(S + γ ˆΣ) n log(1 + γ ),

where γ is the unique optimal solution of the univariate convex optimization problem

n γ(2ρ + log det ˆΣ) + n(1 + γ) log(1 + γ) (1 + γ) log det(S + γ ˆΣ) o . (13)

Problem (13) can be solved efﬁciently using state-of-the-art ﬁrstor second-order methods, see Appendix E. However, in each iteration we still need to evaluate the determinant of a positive deﬁnite n-by-n matrix, which requires O(n3) arithmetic operations. The following corollary shows that this computational burden can be alleviated when the sample covariance matrix S has low rank. Corollary 3.3 (Singular sample covariance matrices). If S = ΛΛ for some Λ Rn k and k N with k < n, then problem (13) simpliﬁes to

n 2γρ + n(1 + γ) log (1 + γ) (n k)(1 + γ) log γ (1 + γ) log det(γIk + Λ ˆΣ 1Λ) o .

We will see that for classiﬁcation problems the matrix S has rank 1, in which case the log-det term in the above univariate convex program reduces to the scalar logarithm. In Appendix E we provide explicit ﬁrstand second-order derivatives of the objective of problem (13) and its simpliﬁcation.

4 Numerical Results

We investigate the empirical behavior of our projected geodesic gradient descent algorithm (Section 4.1) and the predictive power of our ﬂexible discriminant rules (Section 4.2). Our algorithm and all tests are implemented in Python, and the source code is available from https: //github.com/sorooshafiee/Optimistic_Likelihoods.

4.1 Convergence Behavior of the Projected Geodesic Descent Algorithm

To study the empirical convergence behavior of Algorithm 1, for n {10, 20, . . . , 100} we generate 100 covariance matrices ˆΣ according to the following procedure. We (i) draw a standard normal random matrix B Rn n and compute A = B + B ; we (ii) conduct an eigenvalue decomposition A = RDRT ; we (iii) replace D with a random diagonal matrix ˆD whose diagonal elements are sampled uniformly from [1, 10]n; and we (iv) set ˆΣ = R ˆDR . For each of these covariance matrices, we set ˆµ = 0, M = 1, x M 1 x for a standard normal random vector x Rn and calculate

10 20 30 40 50 60 70 80 90 100 Dimension (n)

Number of iterations

(a) Scaling of iteration count for S 0

10 20 30 40 50 60 70 80 90 100 Dimension (n)

Execution time (s)

(b) Scaling of execution time for S 0

100 101 102 103 Number of iterations

|[L(Σk+1) L(Σk)]/L(Σk+1)|

(c) Convergence for n = 100 for S 0

2 4 6 8 10 Number of iterations

|[L(Σk+1) L(Σk)]/L(Σk+1)|

(d) Convergence for n = 100 for S 0 Figure 2: Convergence behavior of the projected geodesic gradient descent algorithm. Solid lines (shaded regions) represent averages (ranges) across 100 independent simulations.

the optimistic likelihood (6) for ρ = n/100. This choice of ρ ensures that the radius of the ambiguity set scales with n in the same way as the Frobenius norm. Figures 2(a) and 2(b) report the number of iterations as well as the overall execution time of Algorithm 1 when we terminate the algorithm as soon as the relative improvement |[L(Σk+1) L(Σk)]/L(Σk+1)| drops below 0.01%. Notice that the number of required iterations scales linearly with n while the overall runtime grows polynomially with n. Figure 2(c) shows the relative improvement as a function of the iteration count. Empirically, the number of iterations scales with O(1/k2), which is faster than the theoretical rate established in Theorem 2.7. We also study the empirical convergence behavior of Algorithm 1 when the input matrix S is positive deﬁnite. We repeat the ﬁrst experiment with M = 100, and we set S = δI + PM i=1 xix i /M for δ = 10 6 to ensure that S is positive deﬁnite. Figure 2(d) indicates that, in this case, the empirical convergence rate of Algorithm 1 is linear.

4.2 Application: Flexible Discriminant Rules

Consider a classiﬁcation problem where a categorical response Y C, C = {1, . . . , C}, should be predicted from continuous inputs X Rn. In this context, Bayes Theorem implies that P(Y = c|X = x) πc fc(x), c C, where πc = P(Y = c) denotes the prior probability of the response belonging to class c, and fc is the density function of X for an observation of class c. In practice, πc and fc are unknown and need to be estimated from a training data set (ˆx1, ˆy1), . . . , (ˆx N, ˆy N). Assuming that the densities fc, c C, correspond to Gaussian distributions with (unknown) classspeciﬁc means µc and covariance matrices Σc, the quadratic discriminant analysis (QDA) replaces πc with ˆπc = Nc/N, where Nc = |{i : ˆyi = c}|, and fc with the density of the Gaussian distribution ˆPc N(ˆµc, ˆΣc), whose mean and covariance matrix are estimated from the training data, to classify a new observation x using the discriminant rule

CQDA(x) arg max c C

2ℓ(x, ˆPc) + log(ˆπc) .

Table 1: Average correct classiﬁcation rates on the benchmark instances

FQDA KQDA QDA RQDA SQDA WQDA

Australian 80.68 83.68 80.03 79.76 80.73 79.94 Banknote authentication 99.07 99.47 98.56 98.54 98.53 98.54 Climate model 94.46 94.55 91.78 92.72 94.42 92.78 Cylinder 70.69 70.67 67.10 70.33 70.99 70.34 Diabetic 75.97 74.53 74.19 74.60 74.70 75.04 Fourclass 80.13 79.97 79.32 79.32 79.32 79.33 German credit 74.50 74.60 71.41 76.18 74.99 76.31 Haberman 74.87 75.41 74.92 74.96 75.04 74.97 Heart 84.23 83.31 81.42 82.62 84.17 82.42 Housing 88.89 92.90 88.54 87.01 81.69 88.31 Ilpd 57.42 57.83 55.18 54.97 55.45 55.15 Mammographic mass 80.66 80.85 80.37 80.88 81.05 80.65 Pima 75.97 74.53 74.19 74.60 74.70 75.04 Ringnorm 98.69 98.65 98.56 98.56 98.65 98.56

Here, the likelihood ℓ(x, ˆPc) is deﬁned as in (1) for M = 1. If ˆπ1 = . . . = ˆπC, this classiﬁcation rule reduces to the maximum likelihood discrimant rule [20, 14].

QDA can be sensitive to misspeciﬁcations of the empirical moments. To reduce this sensitivity, we replace the nominal Gaussian distributions ˆPc with the Gaussian distributions P c that would have generated the sample x with highest likelihood, among all Gaussian distributions in the vicinity of the nominal distributions ˆPc. This results in a ﬂexible discriminant rule of the form

Cﬂex(x) arg max c C max P Pc

2ℓ(x, P) + log(ˆπc) ,

which makes use of the optimistic likelihoods (2). Here, Pc is the FR or KL ball centered at the nominal distribution ˆPc. To ensure that ˆΣc 0 for all c C, we use the Ledoit-Wolf covariance estimator [23], which is parameter-free and returns a well-conditioned matrix by minimizing the mean squared error between the estimated and the real covariance matrix.

We compare the performance of our ﬂexible discriminant rules with standard QDA implementations from the literature on datasets from the UCI repository [4]. Speciﬁcally, we compare the following methods.

FQDA and KQDA: our ﬂexible discriminant rules based on FR (FQDA) and KL (KQDA) ambiguity sets with radii ρc; QDA: regular QDA with empirical means and covariance matrices estimated from data;

RQDA: regularized QDA based on the linear shrinkage covariance estimator ˆΣc + ρc In; SQDA: sparse QDA based on the graphical lasso covariance estimator [19] with parameter ρc; WQDA: Wasserstein QDA based on the nonlinear shrinkage approach [27] with parameter ρc.

All results are averaged across 100 independent trials for ρc {a n 10b : a {1, . . . , 9}, b { 3, 2, 1}}. In each trial, we randomly select 75% of the data for training and the remaining 25% for testing. The size of the ambiguity set and the regularization parameter are selected using stratiﬁed 5-fold cross validation. The performance of the classiﬁers is measured by the correct classiﬁcation rate (CCR). The average CCR scores over the 100 trials are reported in Table 1.

Acknowledgments We gratefully acknowledge ﬁnancial support from the Swiss National Science Foundation under grant BSCGI0_157733 as well as the EPSRC grants EP/M028240/1, EP/M027856/1 and EP/N020030/1.

[1] P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009.

[2] M. Arnaudon, F. Barbaresco, and L. Yang. Riemannian medians and means with applications to radar signal processing. IEEE Journal of Selected Topics in Signal Processing, 7(4):595 604, 2013.

[3] C. Atkinson and A. F. Mitchell. Rao s distance measure. Sankhy a: The Indian Journal of Statistics, Series A, 43(3):345 365, 1981.

[4] K. Bache and M. Lichman. UCI machine learning repository, 2013. Available from http: //archive.ics.uci.edu/ml.

[5] M. Bauer, M. Bruveris, and P. W. Michor. Uniqueness of the Fisher Rao metric on the space of smooth densities. Bulletin of the London Mathematical Society, 48(3):499 506, 2016.

[6] G. C. Bento, O. P. Ferreira, and J. G. Melo. Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds. Journal of Optimization Theory and Applications, 173(2):548 562, 2017.

[7] R. Bhatia. Positive Deﬁnite Matrices. Princeton University Press, 2009.

[8] J. Bi and T. Zhang. Support vector classiﬁcation with input data uncertainty. In Advances in Neural Information Processing Systems, pages 161 168, 2005.

[9] S. Bonnabel and R. Sepulchre. Riemannian metric and geometric mean for positive semideﬁnite matrices of ﬁxed rank. SIAM Journal on Matrix Analysis and Applications, 31(3):1055 1070, 2009.

[10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[11] M. R. Bridson and A. Haeﬂiger. Metric Spaces of Non-Positive Curvature. Springer, 2013.

[12] E. Brochu, V. M. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical Report UBC TR-2009-023 and ar Xiv:1012.2599, University of British Columbia, Department of Computer Science, 2009.

[13] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1 122, 2012.

[14] L. L. Campbell. An extended ˇCencov characterization of the information metric. Proceedings of the American Mathematical Society, 98(1):135 141, 1986.

[15] N. N. ˇCencov. Statistical Decision Rules and Optimal Inference. American Mathematical Society, 2000.

[16] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 2006.

[17] D. R. Cox. Tests of separate families of hypotheses. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, pages 105 123, 1961.

[18] D. R. Cox. A return to an old paper: tests of separate families of hypotheses . Journal of the Royal Statistical Society: Series B, 75(2):207 215, 2013.

[19] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432 441, 2008.

[20] W. K. Härdle and L. Simar. Applied Multivariate Statistical Analysis. Springer, 2015.

[21] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. Kernel methods on the Riemannian manifold of symmetric positive deﬁnite matrices. In IEEE Conference on Computer Vision and Pattern Recognition, pages 73 80, 2013.

[22] S. Lang. Fundamentals of Differential Geometry. Springer, 2012.

[23] O. Ledoit and M. Wolf. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2):365 411, 2004.

[24] C. Liu and N. Boumal. Simple Algorithms for Optimization on Riemannian Manifolds with Constraints. Applied Mathematics and Optimization, 2019.

[25] G. Mc Lachlan. Discriminant Analysis and Statistical Pattern Recognition. John Wiley & Sons, 2004.

[26] R. Munos. From bandits to Monte-Carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends R in Machine Learning, 7(1):1 129, 2014.

[27] V. A. Nguyen, D. Kuhn, and P. M. Esfahani. Distributionally robust inverse covariance estimation: The Wasserstein shrinkage estimator. ar Xiv preprint ar Xiv:1805.07194, 2018.

[28] V. A. Nguyen, S. Shaﬁeezadeh-Abadeh, M.-C. Yue, D. Kuhn, and W. Wiesemann. Optimistic distributionally robust optimization for nonparametric likelihood approximation. In Advances in Neural Information Processing Systems, 2019.

[29] M. Norton, A. Takeda, and A. Mafusalov. Optimistic robust optimization with applications to machine learning. ar Xiv preprint ar Xiv:1711.07511, 2017.

[30] X. Pennec, P. Fillard, and N. Ayache. A Riemannian framework for tensor computing. International Journal of Computer Vision, 66(1):41 66, 2006.

[31] L. F. Price, C. C. Drovandi, A. Lee, and D. J. Nott. Bayesian synthetic likelihood. Journal of Computational and Graphical Statistics, 27(1):1 11, 2018.

[32] S. Said, L. Bombrun, Y. Berthoumieu, and J. H. Manton. Riemannian Gaussian distributions on the space of symmetric positive deﬁnite matrices. IEEE Transactions on Information Theory, 63(4):2153 2170, 2017.

[33] R. P. Savage. The space of positive deﬁnite matrices and Gromov s invariant. Transactions of the American Mathematical Society, 274(1):239 263, 1982.

[34] R. Schoenberg. Constrained maximum likelihood. Computational Economics, 10(3):251 266, 1997.

[35] S. Sra and R. Hosseini. Conic geometric optimization on the manifold of positive deﬁnite matrices. SIAM Journal on Optimization, 25(1):713 739, 2015.

[36] N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In International Conference on Machine Learning, pages 1015 1022, 2010.

[37] A. Takatsu. Wasserstein geometry of Gaussian measures. Osaka Journal of Mathematics, 48(4):1005 1026, 2011.

[38] N. Tripuraneni, N. Flammarion, F. Bach, and M. Jordan. Averaging stochastic gradient descent on Riemannian manifolds. In Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 650 687, 2018.

[39] O. Tuzel, F. Porikli, and P. Meer. Pedestrian detection via classiﬁcation on Riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10):1713 1727, 2008.

[40] C. Villani. Optimal Transport: Old and New. Springer, 2008.

[41] S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466(7310):1102 1104, 2010.

[42] H. Zhang, S. J. Reddi, and S. Sra. Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In Advances in Neural Information Processing Systems, pages 4592 4600, 2016.

[43] H. Zhang and S. Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1617 1638, 2016.

[44] H. Zhang and S. Sra. An estimate sequence for geodesically convex optimization. In Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 1703 1723, 2018.