# monotonic_alphadivergence_minimisation_for_variational_inference__7cd6786e.pdf

Journal of Machine Learning Research 24 (2023) 1-76 Submitted 3/21; Revised 6/22; Published 1/23

Monotonic Alpha-divergence Minimisation for Variational Inference

Kam elia Daudel1,2 kamelia.daudel@gmail.com 1: LTCI, T el ecom Paris Institut Polytechnique de Paris, France 2: Department of Statistics, University of Oxford Oxford OX1 3TG, United Kingdom

Randal Douc randal.douc@telecom-sudparis.eu SAMOVAR, T el ecom Sud Paris Institut Polytechnique de Paris, France

Franc ois Roueﬀ francois.roueff@telecom-paris.fr LTCI, T el ecom Paris Institut Polytechnique de Paris, France

Editor: Shakir Mohamed

In this paper, we introduce a novel family of iterative algorithms which carry out αdivergence minimisation in a Variational Inference context. They do so by ensuring a systematic decrease at each step in the α-divergence between the variational and the posterior distributions. In its most general form, the variational distribution is a mixture model and our framework allows us to simultaneously optimise the weights and components parameters of this mixture model. Our approach permits us to build on various methods previously proposed for α-divergence minimisation such as Gradient or Power Descent schemes and we also shed a new light on an integrated Expectation Maximization algorithm. Lastly, we provide empirical evidence that our methodology yields improved results on several multimodal target distributions and on a real data example.

Keywords: Variational Inference, Kullback-Leibler, Alpha-Divergence, Mixture Models, Bayesian Inference

1. Introduction

Bayesian inference tasks often induce intractable and hard-to-compute posterior densities which need to be approximated. Among the class of approximating methods, Variational inference methods (for example Variational Bayes Jordan et al., 1999; James Beal, 2003) have attracted a lot of attention as they have empirically been shown to be widely applicable to many high-dimensional machine-learning problems (Hoﬀman et al., 2013; Ranganath et al., 2014; Kingma and Welling, 2014).

These optimisation-based methods introduce a simpler variational family Q and ﬁnd the best approximation to the unknown posterior density among this family in terms of a

c 2023 Kam elia Daudel, Randal Douc and Fran cois Roueﬀ.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-0249.html.

Daudel, Douc and Roueff

certain measure of dissimilarity, the most common choice of measure of dissimilarity being the exclusive Kullback-Leibler divergence (Blei et al., 2017; Zhang et al., 2019). However, the exclusive Kullback-Leibler divergence is known to have some drawbacks. Indeed, its zero-forcing behavior is responsible for returning variational approximations with light tails that severely underestimate the posterior covariance and that are unable to capture multimodality within the posterior density (Minka, 2005; Jerfel et al., 2021; Prangle, 2021). This is especially inconvenient when the variational family Q does not exactly match the posterior density (Yao et al., 2018; Campbell and Li, 2019) and even more so if one wants to appeal to Importance Sampling methods to approximate integrals of interest in a Bayesian Inference setting (Jerfel et al., 2021; Prangle, 2021). To avoid this hurdle, advances in Variational Inference turned to more general families of divergences such as the α-divergence (Zhu and Rohwer, 1995b,a) and R enyi s α-divergence (R enyi, 1961; van Erven and Harremoes, 2014). These families of divergences both recover the exclusive Kullback-Leibler when α 1 and thanks to the hyperparameter α, they provide a more ﬂexible framework that can bypass the diﬃculties associated to the exclusive Kullback-Leibler divergence by choosing α < 1 (Minka, 2005). For this reason, they have notably been used in Minka (2004); Hernandez-Lobato et al. (2016); Li and Turner (2016); Dieng et al. (2017); Wang et al. (2018); Daudel et al. (2021); Daudel and Douc (2021). In the spirit of Variational Inference methods based on the α-divergence, we propose in this paper to build a framework for α-divergence minimisation in a Variational Inference context. The particularity of our work is that our algorithms will ensure a monotonic decrease at each step in the α-divergence between the variational and the posterior distributions. In addition, our work will apply to variational families as wide as the class of mixture models. The paper is then organised as follows:

In Section 2, we introduce some notation and state the optimisation problem we aim at solving in terms of the posterior density, the variational density q Q and the α-divergence.

In Section 3, we consider the typical Variational Inference case where q belongs to a parametric family. In this particular case, we state in Theorem 1 conditions which ensure a systematic decrease in the α-divergence at each step for all α [0, 1). We then show in Corollary 1 that these conditions are satisﬁed for a well-chosen iterative scheme and we call the resulting approach the maximisation approach. This approach is particularly convenient, a fact that we illustrate over Example 1 and Lemma 1. Furthermore, we derive in Corollary 2 additional iterative schemes satisfying the conditions of Theorem 1 under the name gradient-based approach, which we then use to underline the links between our approach and Gradient Descent schemes for α-divergence and R enyi s α-divergence minimisation.

In Section 4, we extend the results from Section 3 to the more general case of mixture models. We derive in Theorems 2 and 3 conditions to simultaneously optimise the weights and the components parameters of a given mixture model, all the while maintaining the systematic decrease in the α-divergence initially enjoyed in Section 3. These conditions are then met in Corollary 4 and 5, which respectively generalise the maximisation approach of Corollary 1 and the gradient-based approach of Corollary 2 to the case of mixture models.

Section 5 is devoted to related work. We explain in more detail the links between our framework and existing Variational Inference methods for α-divergence minimisation. We

Monotonic Alpha-divergence Minimisation for Variational Inference

also connect our approach to the Power Descent algorithm from Daudel et al. (2021) and provide in Proposition 2 additional monotonicity results which go beyond the case α [0, 1). In addition, we obtain that an integrated Expectation Maximization algorithm introduced in Capp e et al. (2008) can be recovered as a special case of our framework.

In Section 6, we state generic results that solve our maximisation and gradient-based approaches when the variational family is based on the exponential family. More speciﬁcally, we introduce in Section 6.1 additional notation for the exponential family and we recall several of its useful properties. Section 6.2 and Section 6.3 then focus on the maximisation approach and gradient-based approach respectively and notably provide the theoretical justiﬁcation behind some special cases mentioned earlier in the paper, such as Gaussian densities in Example 1. Lastly, Section 6.4 extends once again the maximisation approach in order to incorporate variational families that were not previously included in our framework, yet draw on the exponential family (such as multivariate Student s t densities in Example 5).

In Section 7, we focus on the important case of Gaussian Mixture Models (GMMs) and we discuss practical implementations of our algorithms for GMMs optimisation.

Finally, we show in Section 8 that our enhanced framework has empirical beneﬁts that permit us to improve on the existing algorithms presented in Section 5 when we consider a variety of multimodal targets as well as a real data example.

2. Notation and Optimisation Problem

Let (Y, Y, ν) be a measured space, where ν is a σ-ﬁnite measure on (Y, Y). Assume that we have access to some observed variables D generated from a probabilistic model p( |y) parameterised by a hidden random variable y Y that is drawn from a certain prior p0. The posterior density of the latent variable y given the data D is then given by

p(y|D) = p(y, D)

p(D) = p0(y)p(D|y)

where p(D) = R Y p0(y)p(D|y)ν(dy) is called the marginal likelihood or model evidence. For many interesting choices of models, sampling directly from the posterior density is impossible and the marginal likelihood is also unknown or too costly to compute. To address this diﬃculty, Variational Inference methods approximate the posterior density by a simpler probability density belonging to a given variational family Q (from which it is easy to sample from). They select the best probability density in Q by solving an optimisation problem that involves a certain measure of dissimilarity, which is chosen to be the α-divergence in this paper due to its advantages compared to the more traditional exclusive Kullback-Leibler divergence (see for example Minka, 2005). More precisely, let us denote by P the probability measure on (Y, Y) with corresponding density p( |D) with respect to ν. As for the variational family, let us denote by Q the probability measure on (Y, Y) with associated density q Q with respect to ν. We let fα be the convex function on (0, + ) deﬁned by f0(u) = log(u), f1(u) = u log u and fα(u) = 1 α(α 1) [uα 1] for all α R \ {0, 1}. Then, assuming that both Q and P are

Daudel, Douc and Roueff

absolutely continuous w.r.t. ν, the α-divergence between Q and P (extended by continuity to the cases α = 0 and α = 1 as for example done in Cichocki and Amari, 2010) is given by

Dα(Q||P) = Z

p(y|D)ν(dy) , (1)

and the Variational Inference optimisation problem we aim at solving is infq Q Dα(Q||P). Notably, it can easily be proven that this optimisation problem is equivalent to solving

inf q Q Ψα(q; p) with p(y) = p(y, D) for all y Y , (2)

where, for all measurable non-negative function p on (Y, Y) and for all q Q, we have set

Ψα(q; p) = Z

p(y)ν(dy) .

As the unknown constant p(D) does not appear anymore in the optimisation problem (2), this formulation is often preferred. Therefore, we consider the general optimisation problem

inf q Q Ψα(q; p) , (3)

where p is any measurable positive function on (Y, Y). Note that we may drop the dependency on p in Ψα for notational ease and when no ambiguity occurs. At this stage, we are left with the choice of the variational family Q appearing in the optimisation problem (3). The natural idea in Variational Inference and the starting point of our approach is then to work within a parametric family : letting (T, T ) be a measurable space, K : (θ, A) 7 R A k(θ, y)ν(dy) be a Markov transition kernel on T Y with kernel density k deﬁned on T Y, we consider a parametric family of the form

Q = {q : y 7 k(θ, y) : θ T} .

3. An Iterative Algorithm for Optimising Ψα(k(θ, ); p)

In this section, our goal is to deﬁne iterative procedures which optimise Ψα(k(θ, ); p) with respect to θ and which are such that they ensure a systematic decrease in Ψα at each step. For this purpose, we start by introducing some mild conditions on k, p and ν that will be used throughout the paper.

(A1) The kernel density k on T Y, the function p on Y and the σ-ﬁnite measure ν on (Y, Y) satisfy, for all (θ, y) T Y, k(θ, y) > 0, p(y) 0 and 0 < R Y p(y)ν(dy) < .

From this point onwards, and as announced in the previous section, we will drop the dependency on p in Ψα for notational convenience. Let us now construct a sequence (θn)n 1 valued in T such that the sequence (Ψα(k(θn, )))n 1 is decreasing. The core idea of our approach will rely on the following proposition.

Monotonic Alpha-divergence Minimisation for Variational Inference

Proposition 1 Assume (A1). For all α [0, 1) and all θ, θ T such that Ψα(k(θ , )) < , it holds that

Ψα(k(θ, )) Z

Y k(θ , y)αp(y)1 α

α 1 log k(θ, y)

ν(dy) + Ψα(k(θ , )) . (4)

Moreover, if α = 0, this inequality is an equality and, if α (0, 1), equality in (4) is equivalent to having k(θ, y) = k(θ , y) for ν-a.e. y {p > 0}.

Proof We treat the two cases α = 0 and α (0, 1) separately.

(a) Case α = 0 with f0(u) = log(u) for all u > 0. This case is immediate since

Ψ0(k(θ, )) = Z

Y p(y) log k(θ, y)

ν(dy) + Ψ0(k(θ , )) .

(b) Case α (0, 1) with fα(u) = 1 α(α 1) [uα 1] for all u > 0. We have that

Ψα(k(θ, )) = Z

α(α 1) p(y)ν(dy)

α(α 1) p(y)ν(dy) + Ψα(k(θ , ))

Furthermore, using that log(uα) uα 1 for all u > 0 with equality only if u = 1, and since α (0, 1), this inequality becomes

uα 1 α(α 1) log(uα)

α(α 1) = log(u)

Applying this with u = k(θ, y)/k(θ , y) in the previous expression, we get that

Ψα(k(θ, )) Z

Y k(θ , y)αp(y)1 α

α 1 log k(θ, y)

ν(dy) + Ψα(k(θ , ))

which is exactly (4). Moreover, the equality holds if k(θ, y)/k(θ , y) = 1 for ν-a.e. y such that k(θ , y)αp(y)1 α > 0, which, by (A1), only happens if p(y) > 0.

This result then allows us to deduce Theorem 1 below.

Theorem 1 Assume (A1). Let α [0, 1) and starting from an initial θ1 T, let (θn)n 1 be deﬁned iteratively such that for all n 1,

Y k(θn, y)αp(y)1 α

α 1 log k(θn+1, y)

ν(dy) 0 . (5)

Further assume that Ψα(k(θ1, )) < . Then, Ψα(k(θn+1, )) Ψα(k(θn, )) for all n 1 with an equality occurring only if (5) is also an equality.

Daudel, Douc and Roueff

Proof Applying Proposition 1 iteratively for n = 1, 2, . . . with θ = θn+1 and θ = θn in (4) combined with (5), we get that Ψα(k(θn, )) < and

Ψα(k(θn+1, )) Z

Y k(θn, y)αp(y)1 α

α 1 log k(θn+1, y)

ν(dy) + Ψα(k(θn, )) Ψα(k(θn, )) .

Suppose now that Ψα(k(θn+1, )) = Ψα(k(θn, )) for some n 1. Then we have equalities in the previous display and thus an equality in (5) and in (4) with θ = θn+1 and θ = θn.

At this point, we seek to ﬁnd iterative schemes satisfying (5). To do so, for all α [0, 1), all n 1 and all y Y, we introduce the notation:

ϕ(α) n (y) = k(θn, y)αp(y)1 α (6)

ˇϕ(α) n = ϕ(α) n R ϕ(α) n dν .

Note that under (A1), the normalisation in ˇϕ(α) n satisﬁes 0 < R ϕ(α) n dν < , where the right inequality in particular follows from Jensen s inequality applied to the strictly concave function u 7 u1 α. We then state our ﬁrst corollary.

Corollary 1 (Maximisation approach) Assume (A1). Let α [0, 1) and let (bn)n 1 be a non-negative sequence. Starting from an initial θ1 T, let (θn)n 1 be deﬁned iteratively as follows

θn+1 = argmax θ T

h ϕ(α) n (y) + bnk(θn, y) i log k(θ, y)

ν(dy) , n 1 , (7)

where we assume that this argmax is uniquely deﬁned at each step. Then (5) holds at all times n 1 and we can apply Theorem 1.

Proof Under (A1) and using that α [0, 1), we can rewrite (5) as

Y ϕ(α) n (y) log k(θn+1, y)

Since R Y k(θn, y) log k(θn+1,y)

ν(dy) = D1(K(θn, )||K(θn+1, )) 0, we also obtain that

Y ϕ(α) n (y) log k(θn+1, y)

h ϕ(α) n (y) + bnk(θn, y) i log k(θn+1, y)

Thus, (5) holds at all times n if the following condition is satisﬁed:

h ϕ(α) n (y) + bnk(θn, y) i log k(θn+1, y)

ν(dy) 0 , n 1 ,

which is implied by the deﬁnition of θn+1 in (7).

We now make three comments regarding Corollary 1.

Monotonic Alpha-divergence Minimisation for Variational Inference

The r.h.s. of (7) can be simpliﬁed, yielding the equivalent iterative scheme

θn+1 = argmax θ T

h ϕ(α) n (y) + bnk(θn, y) i log k(θ, y)ν(dy) , n 1 ,

provided that R Y h ϕ(α) n (y) + bnk(θn, y) i | log k(θn, y)|ν(dy) < for all n 1.

While it suﬃces to ﬁnd any θn+1 so that (5) holds to obtain a systematic decrease in Ψα, deﬁning θn+1 as in (7) enables us to solve this argmax problem for notable choices of kernel density k. A remarkable aspect is indeed that (7) is written as a maximisation problem involving the logarithm of the ratio k(θ, y)/k(θn, y) whereas Ψα is not directly expressed in terms of logarithm of this ratio for α / {0, 1}. As a result, we can use the maximisation approach of Corollary 1 to derive simple update rules for (θn)n 1. An example and a lemma are provided thereafter to illustrate this fact and more general results regarding exponential families will follow later in Section 6.

The sequence (bn)n 1 appearing in (7) is arbitrary and can as a result be chosen by the practitioner. It is responsible for introducing a regularisation term in the argmax problem (7) so that the larger bn is, the closer θn+1 is to θn.

We next provide a motivating example where the maximisation approach is applicable.

Example 1 (Maximisation approach for a Gaussian density)We consider a d-dimensional Gaussian density, in which case Y = Rd, ν is the d-dimensional Lebesgue measure and k(θ, y) = N(y; m, Σ), where θ = (m, Σ) T denotes the mean and covariance matrix of the Gaussian density. We let the parameter set T include all possible means m in Rd and all possible positive-deﬁnite covariance matrices Σ. Finally, denoting by the Euclidean norm, we assume that the non-negative function p deﬁned on Y satisﬁes

1 + y 2/(1 α) p(y) dy < . (8)

Then (A1) holds. Furthermore, starting from any θ1 = (m1, Σ1) T so that Ψα(k(θ1, )) < and denoting θn = (mn, Σn) for all n 1, it holds that: for all n 1 and all γn (0, 1], there exists bn 0 such that the argmax problem (7) has a unique solution deﬁned by

mn+1 = γn ˆmn + (1 γn) mn ,

Σn+1 = γn bΣn + (1 γn) Σn + γn(1 γn) ( bmn mn) ( bmn mn)T , (9)

where, using the deﬁnition of ˇϕ(α) n in (6), we set

Y y ˇϕ(α) n (y) ν(dy) ,

Y yy T ˇϕ(α) n (y) ν(dy) bmn bm T n .

Here, the detailed derivation of (9) is deferred to Section 6.2 as we focus on interpreting this result in light of Corollary 1 (in particular we will see in Corollary 6 that the case

Daudel, Douc and Roueff

γn = 1 corresponding to bn = 0 will require an additional non-degenerate condition on p). By Corollary 1, the systematic decrease in (Ψα(k(θn, )))n 1 holds for any choice of sequence (γn)n 1 valued in (0, 1] in the updates (9). In addition, γn permits us to build a tradeoﬀat time n 1 between selecting an update close to the current parameter (that is, taking γn close to zero) and choosing the Gaussian density with exactly the same mean and covariance matrix as ˇϕ(α) n (that is, setting γn = 1). This is a key idea that is linked to the regularisation term appearing in (7) and that we will revisit several times throughout the paper.

The maximisation approach of Corollary 1 is also applicable to the commonly-used meanﬁeld variational family, which approximates the true density by a density with independent components parameterised separately. The following lemma indeed states that the global argmax problem can be separated into component-wise argmax problems in that case.

Lemma 1 (Maximisation approach for the mean-ﬁeld family) A generic member of the mean-ﬁeld variational family is given by k(θ, y) = QL ℓ=1 k(ℓ)(θ(ℓ), y(ℓ)) with θ = (θ(1), . . . , θ(L)) T and y = (y(1), . . . , y(L)) Y. Then, starting from θ1 T and denoting θn = (θ(1) n , . . . , θ(L) n ) for all n 1, solving (7) yields the following update formulas:

θ(ℓ) n+1 = argmax θ(ℓ)

h ϕ(α) n (y) + bnk(θn, y) i log

k(ℓ)(θ(ℓ), y(ℓ))

k(ℓ)(θ(ℓ) n , y(ℓ))

ν(dy) , ℓ= 1 . . . L, n 1 .

The maximisation approach is not the only way to satisfy (5). Indeed, this can also be achieved by taking a gradient-based approach and relying on additional smoothness conditions (see Appendix A.1 for the deﬁnition of β-smoothness), as written in Corollary 2.

Corollary 2 (Gradient-based approach) Assume (A1). Let T Rd be a convex set, let α [0, 1) and let (γn)n 1 be valued in (0, 1]. Starting from an initial θ1 T, let (θn)n 1 be deﬁned iteratively as follows

θn+1 = θn γn

βn gn(θn) , n 1 , (10)

where (gn)n 1 is the sequence of functions deﬁned by: for all n 1 and all θ T,

Y ϕ(α) n (y) α 1 log k(θ, y)

and gn is assumed to be βn-smooth. Then (5) holds for all n 1 and we can apply Theorem 1.

Proof Since γn (0, 1] and gn is a βn-smooth function by assumption, we can apply Lemma 2 and we obtain that for all n 1,

βn gn(θn) γn

2βn gn(θn) 2 .

Thus, by deﬁnition of θn+1 in (10), we have 0 = gn(θn) gn(θn+1), which in turn implies (5) and the proof is concluded.

Monotonic Alpha-divergence Minimisation for Variational Inference

Let us now reﬂect on the implications of Corollary 2. Under common diﬀerentiability assumptions, we can write: for all n 1 and all θ T

Y ϕ(α) n (y) α 1 log k(θ, y)

(θ,y)=(θ ,y) ν(dy) .

Then, (10) becomes

θn+1 = θn γn

Y ϕ(α) n (y) α 1 log k(θ, y)

(θ,y)=(θn,y) ν(dy) , n 1 . (12)

Under this form, the iterative scheme (12) bears similarities with Gradient Descent iterations for α-divergence and R enyi s α-divergence minimisation. Indeed, given a learning rate policy (rn)n 1 and setting p = p( , D), such Gradient Descent iterations are given respectively by

θn+1 = θn rn

Y ϕ(α) n (y) α 1 log k(θ, y)

(θ,y)=(θn,y) ν(dy) , n 1

θn+1 = θn rn

Y ˇϕ(α) n (y) α 1 log k(θ, y)

(θ,y)=(θn,y) ν(dy) , n 1

(we refer to Appendix A.2 for details regarding how these updates are obtained). Building on this comment, let us now give an example where the conditions on (gn)n 1 from Corollary 2 are satisﬁed and show how Gradient Descent steps (in that case for R enyi s α-divergence minimisation) can originate from our gradient-based approach.

Example 2 (Gradient-based approach for a Gaussian density)We consider the case of a d-dimensional Gaussian density with k(θ, y) = N(y; m, Σ), where this time θ = m. Further assume that T Rd is a convex subset and that Σ = σ2Id, where σ2 > 0 and Id denotes the d-dimensional identity matrix. Finally, denoting by the Euclidean norm, we assume that the non-negative function p deﬁned on Y satisﬁes

1 + y 1/(1 α) p(y)dy < . (13)

Then, (A1) holds. Furthermore, denoting θn = mn and setting βn = σ 2(1 α) 1 R ϕ(α) n dν for all n 1, the conditions on (gn)n 1 from Corollary 2 are satisﬁed so that the mean of the Gaussian density N(y; m, Σ) can be optimised as follows: for all n 1 and all γn [0, 1)

mn+1 = γn ˆmn + (1 γn) mn , (14)

where ˆmn = R Y y ˇϕ(α) n (y) ν(dy) (we refer to Corollary 8 for the detailed derivation). Setting p = p( , D), the update (14) can notably be seen as a Gradient Descent step for R enyi s α-divergence minimisation with a learning rate rn = σ2(1 α)γn, where γn [0, 1).

We now make two important comments.

Daudel, Douc and Roueff

Corollary 2 sheds light on the links between our approach and the more traditional Gradient Descent methodology for optimising objective functions based on the αdivergence in Variational Inference (Li and Turner, 2016; Dieng et al., 2017).

Unlike the usual Gradient Descent methodology, Corollary 2 requires a smoothness condition on gn. Smoothness conditions are tools that are often used to obtain stronger convergence guarantees for Variational Inference algorithms (see for example Buchholz et al., 2018; Alquier and Ridgway, 2020). Yet, it can be diﬃcult to satisfy these smoothness assumptions in practice, even if some results have been derived for speciﬁc variational families when using the exclusive Kullback-Leibler divergence as the objective function (Domke, 2020).

To the best of our knowledge, no such results have been proven for α-divergence minimisation. In our case, we obtain that a smoothness condition on gn translates into a systematic decrease in Ψα. As we shall see in the forthcoming section, being able to establish monotonicity results on (θn)n 1 will come in handy as we try to go beyond the framework considered in Section 3.

To put things into perspective with the maximisation approach from earlier, observe that the updates on the means in Examples 1 and 2 coincide. The diﬀerence between the two examples is due to the fact that the former provides an update for the covariance matrix as well, without sacriﬁcing the monotonicity of the overall algorithm. It is in fact hard to derive a covariance matrix update in the Gaussian case using the gradient-based approach (see Section 6.3 for details). In that sense, the maximisation approach provides an interesting alternative that can bypass this diﬃcult smoothness assumption on gn. We will further delve into this aspect as we reach Section 6.

So far, the log function appearing in Theorem 1 considerably eased the derivation of iterative update formulas for well-chosen kernel densities k that can exploit this log structure (Examples 1-2 and Lemma 1). Yet, these choices of kernel densities can be too restrictive to fully capture complex and multimodal posterior densities. Since working with a larger variational family might lead to more accurate approximations of the posterior density, an idea is then to investigate whether the iterative update formulas from Section 3 can be generalised to the mixture model variational family (for example whether we can extend the updates for a Gaussian kernel from Example 1 to Gaussian Mixture Models). As we shall see in the next section, further theoretical developments will be required in order to derive valid iterative schemes that optimise both the mixture weights and the mixture components parameters of a given mixture model.

4. Extension to Mixture Models

Let us ﬁrst formally deﬁne the class of mixture models we are going to be working with. Given J N , we introduce the simplex of RJ:

λ = (λ1, . . . , λJ) RJ : j {1, . . . , J} , λj 0 and

Monotonic Alpha-divergence Minimisation for Variational Inference

we deﬁne S+ J = {λ SJ : j {1, . . . , J} , λj > 0} and we denote Θ = (θ1, . . . , θJ) TJ. We consider the mixture model variational family given by

q : y 7 µλ,Θk(y) :=

j=1 λjk(θj, y) : λ SJ, Θ TJ

that is, we are interested in solving the optimisation problem

inf λ SJ,Θ TJ Ψα(µλ,Θk) ,

with J > 1. Let us next denote λn = (λj,n)1 j J and Θn = (θj,n)1 j J for all n 1. For convenience, we also introduce the shorthand notation µnk = µλn,Θnk and

ϕ(α) j,n(y) = k(θj,n, y) µnk(y)

ˇϕ(α) j,n = ϕ(α) j,n R ϕ(α) j,ndν

for all α [0, 1), all j = 1 . . . J, all n 1 and all y Y. Our goal in this section is to derive iterative schemes for the mixture weights and the mixture components parameters (λn, Θn)n 1 ensuring that the sequence (Ψα(µnk))n 1 is decreasing. As Theorem 1 holds for any choice of parametric family, a ﬁrst idea is to apply Theorem 1 to the variational family (15), which gives Corollary 3 below.

Corollary 3 Assume (A1). Let J N , let α [0, 1) and starting from an initial parameter set (λ1, Θ1) S+ J TJ, let (λn, Θn)n 1 be deﬁned iteratively such that for all n 1, Z

Y (µnk(y))αp(y)1 α

α 1 log µn+1k(y)

ν(dy) 0 . (17)

Further assume that Ψα(µ1k) < . Then, Ψα(µn+1k) Ψα(µnk) for all n 1.

Observe that we are in a less favourable situation in Corollary 3 with J > 1 compared to the cases we previously studied in Section 3. Indeed, we now have a ratio of sums inside the log function in (17), meaning that the approach from Theorem 1 to derive simple iterative schemes does not directly transfer to the variational family (15) for choices of kernel densities k identiﬁed in Examples 1-2 and Lemma 1. However, by carefully exploiting the condition (17), we are able to overcome this diﬃculty in our second main theorem below.

Theorem 2 Assume (A1). Let J N , α [0, 1) and starting from an initial parameter set (λ1, Θ1) S+ J TJ, let (λn, Θn)n 1 be deﬁned iteratively such that for all n 1,

j=1 λj,n ϕ(α) j,n(y) α 1 log

ν(dy) 0 (18)

j=1 λj,n ϕ(α) j,n(y) α 1 log

k(θj,n+1, y)

ν(dy) 0 . (19)

Further assume that Ψα(µ1k) < . Then, Ψα(µn+1k) Ψα(µnk) for all n 1.

Daudel, Douc and Roueff

Proof By Corollary 3, we can conclude if we show that (18) and (19) together imply (17). To show this, ﬁrst observe that since the function u 7 1 α 1 log(u) is convex and α [0, 1), Jensen s inequality implies that: for all y Y and all n 1,

1 α 1 log µn+1k(y)

= 1 α 1 log

λj,nk(θj,n, y) PJ ℓ=1 λℓ,nk(θℓ,n, y) λj,n+1k(θj,n+1, y)

λj,nk(θj,n, y)

λj,nk(θj,n, y) PJ ℓ=1 λℓ,nk(θℓ,n, y) 1 α 1 log

λj,n+1k(θj,n+1, y)

λj,nk(θj,n, y)

1 α 1 log µn+1k(y)

λj,n α 1 k(θj,n, y)

λj,n+1k(θj,n+1, y)

λj,nk(θj,n, y)

Multiplying by (µnk(y))αp(y)1 α on both sides, integrating with respect to ν(dy) and using the deﬁnition of ϕ(α) j,n(y) in (16), this in turn implies that: for all n 1,

Y (µnk(y))αp(y)1 α

α 1 log µn+1k(y)

j=1 λj,n ϕ(α) j,n(y) α 1 log

λj,n+1k(θj,n+1, y)

λj,nk(θj,n, y)

As a consequence, the condition

j=1 λj,n ϕ(α) j,n(y) α 1 log

λj,n+1k(θj,n+1, y)

λj,nk(θj,n, y)

ν(dy) 0 (20)

implies (17). The condition (20) in itself is then straightforwardly implied by (18) and (19) and the proof is concluded.

Strikingly, (18) does not depend on Θn+1 nor does (19) depend on λn+1 in Theorem 2. This means that we can treat these two conditions separately and thus that the weights and components parameters of the mixture can be optimised simultaneously. This result was far from immediate by looking at the initial condition (17) and, as we shall see laterly in Section 8, will lead to reduced computational power in practice. In addition, we have recovered in (19) the key property used in Section 3: compared to the condition (17) which involved the log of weighted sums of kernel densities, (19) considers a sum of logs of kernel densities. This suggests that we can extend the updates derived in Section 3 for well-chosen kernel densities k to the more general case of mixture models. Observe ﬁnally that the dependency in λj,n+1 appearing in (18) is simpler than the dependency in θj,n+1 appearing in (19) and that is expressed through the kernel density k. As a result, we will ﬁrst study (18), in the hope of deriving iterative update formulas for the mixture weights that do not require a speciﬁc choice of kernel density k. Interestingly, while the natural idea is to perform direct optimisation of the left-hand side of (18), we will derive a more general expression for the mixture weights update, which shall induce numerical advantages later illustrated in Section 8.

Monotonic Alpha-divergence Minimisation for Variational Inference

4.1 Choice of (λn)n 1

In the following theorem, we identify an update formula which satisﬁes (18), regardless of the choice of the kernel density k.

Theorem 3 Assume (A1). Let α [0, 1), let (ηn)n 1 be valued in (0, 1], let (κn)n 1 be such that (α 1)κn 0 at all times n 1 and let (Θn)n 1 be any sequence valued in TJ. Starting from an initial λ1 S+ J , let (λn)n 1 be deﬁned iteratively such that for all n 1

λj,n+1 = λj,n h R Y ϕ(α) j,n(y)ν(dy) + (α 1)κn iηn

PJ ℓ=1 λℓ,n h R Y ϕ(α) ℓ,n(y)ν(dy) + (α 1)κn iηn , j = 1 . . . J . (21)

Then (18) holds.

Proof We ﬁrst check that the integrals appearing in (21) are ﬁnite: Z

Y ϕ(α) j,n(y) ν(dy) < , j = 1, . . . , J , n 1 . (22)

Using (16), that λj,n > 0, that λj,nk(θj,n, y) µnk(y) and Jensen s inequality applied to the concave function u 7 u1 α, we have, for any j = 1, . . . , J and n 1,

Y ϕ(α) j,n(y) ν(dy) = Z

Y k(θj,n, y) µnk(y)

Y µnk(y) p(y)

Y p(y)ν(dy) 1 α .

The bound (22) follows by (A1). Now, to prove (18), we treat the cases ηn = 1 and ηn (0, 1) separately.

(a) Case ηn = 1. Since (α 1)κn 0 with α (0, 1), we have that

j=1 λj,n log(λj/λj,n) 0

where we have used that PJ j=1 λj,n log(λj/λj,n) PJ j=1 λj,n(λj/λj,n 1) = 0. In other words, to obtain (18) in the particular case ηn = 1, it is enough to show

j=1 λj,n ϕ(α) j,n(y) α 1 log

j=1 λj,n log

that is, since (22) holds,

ϕ(α) j,n(y) α 1 ν(dy) + κn

Daudel, Douc and Roueff

Notice then that by deﬁnition of (λj,n+1)1 j J when ηn = 1, we can write

λn+1 = argmin λ S+ J

ϕ(α) j,n(y) α 1 ν(dy) + κn

[Indeed, setting βj = λj,n h R Y ϕ(α) j,n(y)ν(dy) + (α 1)κn i and βj = βj/ PJ ℓ=1 βℓfor all j =

1 . . . J, we have that PJ j=1 βj log βj/λj 0 and that this quantity is minimal when λj = βj for j = 1 . . . J.] This implies (23) and settles the case ηn = 1.

(b) For the particular case ηn (0, 1), we will use that for all ϵ > 0 and all u > 0,

ϵ log(uϵ) 1

Indeed, since R Y ϕ(α) j,n(y) α 1 ν(dy) + κn 0 for all j = 1 . . . J, we can then write that for all ϵ > 0,

ϕ(α) j,n(y) α 1 ν(dy) + κn

ϕ(α) j,n(y) α 1 ν(dy) + κn

λj,n λj,n+1

Now notice that by deﬁnition of (λj,n+1)1 j J we can write

λn+1 = argmin λ S+ J

ϕ(α) j,n(y) α 1 ν(dy) + κn

when ϵ satisﬁes ηn = 1 1+ϵ. [Indeed setting βj = λj,n h R Y ϕ(α) j,n(y)ν(dy) + (α 1)κn i 1 1+ϵ and βj = βj/ PJ ℓ=1 βℓfor all j = 1 . . . J, we have by convexity of the function u 7 u1+ϵ that PJ j=1 βj/λj 1+ϵ λj (PJ j=1 βj)1+ϵ and that this quantity is minimal when λj = βj for j = 1 . . . J.] We then deduce that taking ϵ = η 1 n 1 (it is always possible since ηn (0, 1) by assumption) yields

ϕ(α) j,n(y) α 1 ν(dy) + κn

λj,n λj,n+1

which in turn yields (18) [since combined with (24) it implies (23) which itself implies (18) as seen in the case ηn = 1]. This settles the case ηn (0, 1).

Notice that as a byproduct of the proof of Theorem 3, the mixture weights update given by (21) can be rewritten under the form: for all n 1,

λn+1 = argmin λ S+ J hn(λ)

Monotonic Alpha-divergence Minimisation for Variational Inference

where, setting ϵ = η 1 n 1, we have deﬁned for all λ S+ J ,

PJ j=1 λj,n

" R Y ϕ(α) j,n(y) α 1 ν(dy) + κn

, if ηn = 1 ,

1 ϵ PJ j=1 λj,n

" R Y ϕ(α) j,n(y) α 1 ν(dy) + κn

ϵi , if ηn (0, 1) .

More speciﬁcally, hn(λ) acts as an upper bound of the left-hand side of (18) and we recover exactly the left-hand side of (18) in the particular case ηn = 1 and κn = 0.

Now that we have established the mixture weights updates (21) in Theorem 3, we are interested in deriving update formulas for the sequence (Θn)n 1 satisfying (19), which we will then pair up with (21) in order to apply Theorem 2. From this point onwards, all the proofs of the coming results will be deferred to the appendices to ease the reading.

4.2 Choice of (Θn)n 1

We investigate two diﬀerent approaches for choosing (Θn)n 1.

4.2.1 A Maximisation Approach

As done in Corollary 1, an idea is to consider an update for (Θn)n 1 of the form

Θn+1 = argmax Θ TJ gn(Θ) , n 1 ,

where the function gn is constructed as a lower bound on Θ TJ of the function Θ 7 R Y PJ j=1 λj,nϕ(α) j,n(y) log (k(θj, y)/k(θj,n, y)) ν(dy) that satisﬁes gn(Θn) = 0. In doing so, the function gn : Θ 7 gn(Θ)/(α 1) evaluated at Θn+1 is an upper bound of the left-hand side of (19) and gn(Θn+1) 0 implies (19). This leads us to Corollary 4 below.

Corollary 4 (Generalised maximisation approach) Assume (A1). Let α [0, 1), let (ηn)n 1 be valued in (0, 1] and let (κn)n 1 be such that (α 1)κn 0 at all times n 1. Furthermore, let (bj,n)n 1 be a non-negative sequence for all j = 1 . . . J. Starting from an initial parameter set (λ1, Θ1) S+ J TJ, let (λn, Θn)n 1 be deﬁned iteratively for all n 1 in such a way that (21) holds and

θj,n+1 = argmax θ T

h ϕ(α) j,n(y) + bj,nk(θj,n, y) i log

k(θ, y) k(θj,n, y)

ν(dy) , j = 1 . . . J , (25)

where we assume that this argmax is uniquely deﬁned at each step. Then, we can apply Theorem 2.

The proof of this result is deferred to Appendix B.1. Under the assumptions of Corollary 4, Algorithm 1 leads to a systematic decrease in Ψα at each step. This result eﬀectively generalises the monotonicity property from Corollary 1 to the case of mixture models and we can deduce simple iterative schemes satisfying (25) for well-chosen kernel densities k. To illustrate this, we provide below the extension of Example 1 to Gaussian Mixture Models (and Lemma 1 can be extended in the same way, see Appendix B.2).

Daudel, Douc and Roueff

Algorithm 1: Maximisation approach algorithm for mixture models

At iteration n, For all j = 1 . . . J, set

λj,n+1 = λj,n h R Y ϕ(α) j,n(y)ν(dy) + (α 1)κn iηn

PJ ℓ=1 λℓ,n h R Y ϕ(α) ℓ,n(y)ν(dy) + (α 1)κn iηn

θj,n+1 = argmax θ T

h ϕ(α) j,n(y) + bj,nk(θj,n, y) i log

k(θ, y) k(θj,n, y)

Example 3 (Maximisation approach for Gaussian Mixture Models) We consider the case of d-dimensional Gaussian mixture densities, in which case Y = Rd, ν is the d-dimensional Lebesgue measure and k(θj, y) = N(y; mj, Σj), where θj = (mj, Σj) T denotes the mean and covariance matrix of the j-th Gaussian component density. We let the parameter set T include all possible means in Rd and all possible positive-deﬁnite covariance matrices. Finally, we assume that the non-negative function p deﬁned on Y satisﬁes (8). Then (A1) holds. Moreover, starting from any (λ1, Θ1) S+ J TJ so that Ψα(µ1k) < and denoting θj,n = (mj,n, Σj,n) for all j = 1 . . . J and all n 1, it holds that: for all j = 1 . . . J, all n 1 and all γj,n (0, 1], there exists bj,n 0 such that the argmax problem (25) has a unique solution deﬁned by:

mj,n+1 = γj,n bmj,n + (1 γj,n)mj,n

Σj,n+1 = γj,n bΣj,n + (1 γj,n)Σj,n + γj,n(1 γj,n) ( bmj,n mj,n) ( bmj,n mj,n)T ,

where, using the deﬁnition of ˇϕ(α) j,n in (16), we set

Y y ˇϕ(α) j,n(y) ν(dy) ,

Y yy T ˇϕ(α) j,n(y) ν(dy) bmj,n bm T j,n .

The detailed derivation of the mean and covariance updates above is deferred to Section 6.2. The interpretation of these updates is akin to the one already made in Example 1: bj,n acts as a regularisation parameter which, through γj,n, permits a tradeoﬀbetween an update close to the current parameter θj,n and choosing the Gaussian density with exactly the same mean and covariance matrix as ˇϕ(α) j,n. As per written in Corollary 4, these updates are compatible with the mixture weights updates (21), resulting in a systematic decrease of (Ψα(µnk))n 1.

We next present another possible update formula for (λn, Θn)n 1.

4.2.2 A Gradient-based Approach

In the spirit of Corollary 2, we now resort to Gradient Descent steps to satisfy (18).

Monotonic Alpha-divergence Minimisation for Variational Inference

Algorithm 2: Gradient-based approach algorithm for mixture models

At iteration n, For all j = 1 . . . J, set

λj,n+1 = λj,n h R Y ϕ(α) j,n(y)ν(dy) + (α 1)κn iηn

PJ ℓ=1 λℓ,n h R Y ϕ(α) ℓ,n(y)ν(dy) + (α 1)κn iηn

θj,n+1 = θj,n γj,n

βj,n gj,n(θj,n) .

Corollary 5 (Generalised gradient-based approach) Assume (A1). Let T Rd be a convex set, let α [0, 1), let (ηn)n 1 be valued in (0, 1] and let (κn)n 1 be such that (α 1)κn 0 at all times n. Furthermore, for all j = 1 . . . J, let (γj,n)n 1 be valued in (0, 1]. Starting from an initial parameter set (λ1, Θ1) S+ J TJ, let (λn, Θn)n 1 be deﬁned iteratively for all n 1 in such a way that (21) holds and

θj,n+1 = θj,n γj,n

βj,n gj,n(θj,n) , j = 1 . . . J , (26)

where for all j = 1 . . . J, (gj,n)n 1 is deﬁned by: for all n 1 and all θ T,

gj,n(θ) = Z

ϕ(α) j,n(y) α 1 log

k(θ, y) k(θj,n, y)

and gj,n is assumed to be βj,n-smooth. Then, we can apply Theorem 2.

The proof of this result is deferred to Appendix B.3. Under the assumptions of Corollary 5, Algorithm 2 ensures a systematic decrease in Ψα at each step and Corollary 5 thus extends Corollary 2 to mixture models. Much like what we did for Corollary 2, we want to identify how our updates relate to Gradient Descent-based techniques for optimising Θ. Under common diﬀerentiability assumptions, we have: for all n 1, all j = 1 . . . J and all θ T,

gj,n(θ ) = Z

ϕ(α) j,n(y) α 1 log k(θ, y)

(θ,y)=(θ ,y) ν(dy) ,

so that (26) becomes

θj,n+1 = θj,n γj,n

ϕ(α) j,n(y) α 1 log k(θ, y)

(θ,y)=(θj,n,y) ν(dy) , j = 1 . . . J . (27)

The link with Gradient Descent-based Variational Inference shall become apparent by (i) writing the update formulas that ensue from Gradient Descent iterations for α-divergence and R enyi s α-divergence minimisation and (ii) understanding how Example 2 generalises to Gaussian Mixture Models. Given (λn, Θn) S+ J TJ, an index j in 1 . . . J and letting

Daudel, Douc and Roueff

p = p( , D), performing one Gradient Descent iteration w.r.t. θj,n for α-divergence and R enyi s α-divergence minimisation indeed respectively amounts to updating θj,n+1 as follows

θj,n+1 = θj,n rj,nλj,n

ϕ(α) j,n(y) α 1 log k(θ, y)

(θ,y)=(θj,n,y) ν(dy) ,

θj,n+1 = θj,n rj,n λj,n PJ j=1 λj,n R Y ϕ(α) j,n(y)ν(dy)

ϕ(α) j,n(y) α 1 log k(θ, y)

(θ,y)=(θj,n,y) ν(dy)

where rj,n > 0 is the learning rate (we refer to Appendix B.4 for details regarding these updates). There is thus a similarity between (27) and the Gradient Descent updates above. To fully comprehend the connection between our gradient-based approach and Gradient Descent steps, we present below the generalisation of Example 2 to Gaussian Mixture Models. The smoothness assumption on gj,n will be satisﬁed in that example and we will recover a Gradient Descent scheme for R enyi s α-divergence minimisation as a special case.

Example 4 (Gradient-based approach for Gaussian Mixture Models) We consider the case of d-dimensional Gaussian mixture densities with k(θj, y) = N(y; mj, Σj), where this time θj = mj. Further assume that T Rd is a convex subset and that Σj = σ2 j Id, where σ2 j > 0 and Id denotes the d-dimensional identity matrix. Finally, we assume that the non-negative function p deﬁned on Y satisﬁes (13). Then, (A1) holds. Setting βj,n = σ 2 j (1 α) 1 R Y ϕ(α) j,n(y)ν(dy) and denoting θj,n = mj,n, the function gj,n is βj,n-smooth for all j = 1 . . . J and all n 1 so that the means of the Gaussian densities (N(y; mj, Σj))1 j J can be optimised as follows: for all n 1,

mj,n+1 = γj,n ˆmj,n + (1 γj,n) mj,n , j = 1 . . . J ,

where ˇϕ(α) j,n is deﬁned in (16), γj,n [0, 1) and ˆmj,n = R Y y ˇϕ(α) j,n(y) ν(dy) (we refer to Corollary 8 for details). In particular, letting (γ j,n)n 1 be valued in (0, 1] for all j = 1 . . . J,

γj,n = γ j,n λj,n R Y ϕ(α) j,n(y)ν(dy) R Y µnk(y)αp(y)1 αν(dy) (0, 1], j = 1 . . . J, n 1,

(since R Y µnk(y)αp(y)1 αν(dy) = PJ j=1 λj,n R Y ϕ(α) j,n(y)ν(dy)). The resulting iterative algorithm is given by the following update at time n 1:

mj,n+1 = mj,n + γ j,n

R Y λj,nϕ(α) j,n(y)(y mj,n)ν(dy) R Y µnk(y)αp(y)1 αν(dy) , j = 1 . . . J . (28)

Interestingly, (28) can be seen as a Gradient Descent step w.r.t. θj,n for R enyi s α-divergence minimisation with a learning rate rj,n = σ2 j (1 α)γ j,n. Hence, if we were to solely rely on the Gradient Descent literature, the convergence of the iterative sequence (θj,n)n 1 deﬁned by (28) would require the sequence (λn)n 1 to be constant. This is in contrast with Corollary 5, which allows for a simultaneous optimisation of λ and Θ according to (21) and (28).

We now add on the comments made in Section 3 for the gradient-based methodology (which we built in Corollary 2 and have since extended to mixture models in Corollary 5):

Monotonic Alpha-divergence Minimisation for Variational Inference

A core insight from Corollary 5, which is exempliﬁed in Example 4, is that under a smoothness assumption on gj,n our mixture weights iterative updates are compatible with gradient-based updates, themselves linked to the Gradient Descent literature. In other words, we have embedded Gradient Descent-based iterative updates, which only act on Θ, within a larger framework where simultaneous updates for the mixture components parameters and the mixture weights are well-supported theoretically.

Putting things into perspective with the maximisation approach once again, notice that our previous conclusions from Section 3 still hold. Namely: (i) the updates on the means from Examples 3 and 4 coincide, meaning that contrary to the gradient-based approach, the maximisation approach enables covariance matrices updates on top of means updates (ii) the maximisation approach permits us to bypass the smoothness assumption made in Corollary 5. As a whole, these properties make the maximisation approach a compelling alternative to the gradient-based approach.

We presented two approaches to construct iterative schemes (λn, Θn)n 1 that ensure a monotonic decrease in Ψα at each step and lead to simple updates formulas for Gaussian Mixture Models. We now describe how our framework can be linked to the existing literature.

5. Related Work

In this section, we detail how our work relates to and improve on previous algorithms proposed for α-/R enyi s α-divergence minimisation.

5.1 R enyi Divergence Variational Inference (Li and Turner, 2016)

In Li and Turner (2016), they seek to maximise the Variational R enyi (VR) Bound via (Stochastic) Gradient Ascent. This objective function is derived from R enyi s α-divergence and is given by: for all variational density q Q and all α R \ {1},

Lα(q, D) := 1 1 α log Z

Y q(y)αp(y, D)1 αν(dy) . (29)

Contrary to us, the work from Li and Turner (2016) does not consider the case where q belongs to the mixture models variational family (15), that is q = µλ,Θk. Yet, a parallel can be drawn in the GMM case between Li and Turner (2016) and our approach by observing that the gradient-based updates on the means in Example 4 each coincide with a Gradient Ascent step on the objective function (29) for a well-chosen learning rate (since the gradient of Lα is proportional to the gradient of R enyi s α-divergence with a factor α 1, this follows from the remarks made in Section 4.2.2 regarding Gradient Descent steps for R enyi s α-divergence minimisation in the GMM case). Our work hence provides a theoretical framework which enables simultaneous optimisation of the mixture weights λ and of the mixture components parameters Θ. In addition, beyond the gradient-based updates, we propose the novel maximisation updates and we allow for covariance matrices optimisation. We also emphasise that our maximisation approach will apply to well-chosen kernels k beyond the Gaussian case, as we will detail in Section 6.

Daudel, Douc and Roueff

5.2 The Power Descent Algorithm (Daudel et al., 2021)

In order to identify the connection between our work and the Power Descent algorithm introduced in Daudel et al. (2021), let us brieﬂy present the latter. The Power Descent is a gradient-based algorithm that operates on probability measures and performs α-divergence minimisation for all α R \ {1}. More precisely, equipping T with a σ-ﬁeld T and denoting by M1(T) the space of probability measures on (T, T ), the Power Descent optimises Ψα(µk) with respect to µ M1(T), where µk(y) = R T µ(dθ)k(θ, y) for all µ M1(T) and all y Y. Given an initial probability measure µ1 M1(T), it does so by performing several one-step transitions of the Power Descent algorithm:

µn+1 = Iα(µn) , n 1 , (30)

where, for all µ M1(T) and all θ T,

bµ,α(θ) = Z

Yk(θ, y) 1 α 1

Iα(µ)(dθ) = µ(dθ) [(α 1)(bµ,α(θ) + κ) + 1] η 1 α

µ([(α 1)(bµ,α + κ) + 1] η 1 α ) .

Daudel et al. (2021) motivated the Power Descent algorithm by establishing a monotonicity result for this algorithm obtained as a particular case of (Daudel et al., 2021, Theorem 1). In their result, the monotonic decrease in Ψα of the scheme (30) holds for all µ M1(T), all α R \ {1}, all η (0, 1] and all κ such that (α 1)κ 0. We provide below a more general version of their result, where the monotonic decrease in Ψα holds for well-chosen values of η that are larger than 1 when α < 0 (we refer to Appendix B.5 for the proof of this result).

Proposition 2 Assume that p and k are as in (A1). Let (α, η) belong to any of the following cases.

(i) α 1 and η (0, (α 1)/α];

(ii) α ( 1, 0) and η (0, 1 α];

(iii) α [0, 1) or α > 1 and η (0, 1].

Moreover, let µ M1(T) be such that Ψα(µk) < and let κ be such that (α 1)κ 0. Then, the two following assertions hold.

(i) We have Ψα(Iα(µ)k) Ψα(µk).

(ii) We have Ψα(Iα(µ)k) = Ψα(µk) if and only if µ = Iα(µ).

Building on the monotonicity result provided by (Daudel et al., 2021, Theorem 1) for the Power Descent algorithm - that we generalised in Proposition 2 - Daudel et al. (2021) then applied this algorithm to mixture weights optimisation by letting the initial probability measure µ1 M1(T) be a weighted sum of Dirac measures of the form µ1 = PJ j=1 λj,nδθj, with

Monotonic Alpha-divergence Minimisation for Variational Inference

Θ T and λ1 SJ. For that choice of µ1, µn in (30) can be written as µn = PJ j=1 λj,nδθj at time n and the Power Descent amounts to performing the update

λj,n+1 = λj,n [(α 1)(bµn,α(θj) + κ) + 1] η 1 α PJ ℓ=1 λℓ,n[(α 1)(bµn,α(θℓ) + κ) + 1] η 1 α , j = 1 . . . J. (31)

Interestingly, the update (31) corresponds to the update on the mixture weights (21) we have identiﬁed in Theorem 3 for Θn = Θ, ηn = η/(1 α) and κn = κ. Steaming from this link between our approach and Daudel et al. (2021), we now make two important comments.

Beneﬁts of our approach compared to Daudel et al. (2021). The monotonicity result from Daudel et al. (2021) generalised in Proposition 2 requires the sequence (Θn)n 1 to be constant when applied to mixture weights optimisation in (31). This restricts the variational family to mixture models with ﬁxed mixture components parameters. To remedy this problem, Daudel et al. (2021) proposed a fully-adaptive algorithm that alternates between an Exploitation step optimising the mixture weights according to (31) and an Exploration step acting on the mixture components parameters. However, they established no theoretical guarantees for their complete Exploitation-Exploration algorithm, as the choice of the Exploration step remained mostly unexplored.

A strong improvement of our approach is then that we provide theoretically-sound updates for (λn, Θn)n 1, with the particularity that our mixture weights updates relate to the framework of Daudel et al. (2021). In that sense, our approach supplements the work done in Daudel et al. (2021) (albeit by an entirely diﬀerent proof technique). Furthermore, we do not need to alternate between mixture weights and mixture components parameters updates in our algorithms, as both can be carried out simultaneously (as done in Algorithms 1 and 2). In practice, this will permit us to reduce the computational cost (as the samples will be shared throughout the mixture weights and the mixture components parameters approximated updates, see Section 7).

A gradient-based mixture weights update. Daudel et al. (2021) establishes that the Power Descent belongs to a family of gradient-based algorithms which includes the Entropic Mirror Descent algorithm, a typical optimisation algorithm for optimisation under simplex constraints, as a special case. Viewed from this angle, the parameter η in (31) can be understood as a learning rate with bµn,α playing the role of the gradient of Ψα. Connecting the Power Descent to our framework thus sheds light on the gradient-based nature of the mixture weights update (21) from Theorem 3 and provides a better understanding of the role of the parameter ηn appearing in this update. This aspect will be helpful to interpret our numerical experiments in Section 8.

5.3 The M-PMC Algorithm (Capp e et al., 2008)

For any measurable positive function p on (Y, Y), the M-PMC algorithm (Capp e et al., 2008) aims at solving the optimisation problem

sup (λ SJ,Θ TJ)

j=1 λjk(θj, y)

p(y)ν(dy) ,

Daudel, Douc and Roueff

or equivalently, at minimising the inclusive Kullback-Leibler divergence D0(µλ,ΘK||P) w.r.t. to (λ, Θ), where P(A) = R A p dν/ R Y p dν for all A Y. This is done in (Capp e et al., 2008, Section 2) by introducing the following iterative updates: for all j = 1 . . . J and all n 1,

Y λj,nk(θj,n, y) PJ ℓ=1 λℓ,nk(θℓ,n, y) p(y) R Y p(y)ν(dy)ν(dy)

θj,n+1 = argmax θj T

Y λj,nk(θj,n, y) PJ ℓ=1 λℓ,nk(θℓ,n, y) log(k(θj, y))p(y)ν(dy) .

Capp e et al. (2008) motivated the two updates above by noticing that they can be seen as integrated versions under the target distribution of the update formulas for the Expectation Maximisation (EM) algorithm applied to the mixture density parameter estimation problem

sup (λ SJ,Θ TJ)

j=1 λjk(θj, Ym)

meaning that these updates ensure a systematic increase in the integrated likelihood at each step. As these updates also correspond to the case α = 0, ηn = 1, κn = 0, aj,n = 1 and bj,n = 0 in Algorithm 1, the M-PMC algorithm is in fact included in our framework and we can interpret our theoretical results in light of this algorithm. More precisely, we have generalised an integrated EM algorithm by preserving its monotonicity property for a wide range of hyperparameters. A particularly striking fact is that the monotonicity property holds for α [0, 1), hence updates akin to an EM procedure can be derived past the traditional case of likelihood optimisation. As we shall see, the additional layers of ﬂexibility obtained in Algorithm 1 will also have important practical consequences due to the underlying gradient-based structure behind the mixture weights (and the mixture components parameters updates in the GMMs case) we have uncovered.

Table 1 summarises the main improvements of our framework compared to the existing literature and in the coming section, we revisit the maximisation and the gradient-based approaches when the variational family is based on the exponential family.

6. Exponential Family Distributions: a Closer Look

In this section, we state generic results for the maximisation and the gradient-based approaches in the important case where the variational family is based on the exponential family. Those results will in particular enable us to show how the update formulas in Examples 3 and 4 are derived, before allowing us to incorporate novel variational families in our framework such as multivariate Student s t densities. To do so, let us ﬁrst review the exponential family and some of its well-known properties.

6.1 Notation and Useful Deﬁnitions

We start with the deﬁnition of an exponential family distribution.

Monotonic Alpha-divergence Minimisation for Variational Inference

Improvements of our framework

R enyi Gradient Descent Simultaneous optimisation w.r.t. (λ, Θ) Li and Turner (2016) (prev. mixture weights λ optimisation not considered) For GMMs : maximisation approach encompasses R enyi Gradient Descent and provides covariance matrices updates

Power Descent Simultaneous optimisation w.r.t. (λ, Θ) Daudel et al. (2021) (prev. (Θn)n 1 constant)

M-PMC algorithm Extension of an Integrated EM algorithm to: Capp e et al. (2008) α [0, 1), ηn (0, 1], (α 1)κn 0 and bj,n 0 (prev. α = 0, ηn = 1, κn = 0 and bj,n = 0)

Table 1: Key improvements of our framework for mixture models optimisation compared to related works (prev. stands for previously in the literature).

Deﬁnition 1 (Exponential family) Let ν be a σ-ﬁnite measure on (Y, Y), E be a Euclidean space endowed with an inner product , E and its corresponding norm E, h be a non-negative function deﬁned on Y and S : Y E be a measurable function. In its canonical form, a member of the exponential family is deﬁned by the parametric probability density

k(o)(ζ, y) = h(y) exp ( ζ, S(y) E A(ζ)) , y Y , ζ E0

where E0 := ζ E : Z

Y h(y) e ζ,S(y) E ν(dy) <

and where the values taken by A on the subset E0 are obtained from the normalising constraint R k(o)(ζ, ) dν = 1. In this setting, ν is called the dominating measure and S the natural statistic. In what follows, Int E0 denotes the interior of E0.

Recall now that S deﬁnes an exhaustive statistic for this model, that E0 is a convex subset of E and that A is inﬁnitely diﬀerentiable in Int E0. Moreover, for all ζ Int E0, the expectation and covariance operator of S under the density k(o)(ζ, ) are the gradient A and the Hessian T A of A taken at ζ, respectively. For later reference, we also recall

Y S(y) k(o)(ζ, y) ν(dy) (32)

and we introduce the following assumption.

(B1) The kernel density k(o) deﬁned on E0 Y is a member of the exponential family as in Deﬁnition 1. Moreover, we assume that:

(a) The function h on Y is positive.

(b) There is no aﬃne hyperplane of E to which S(y) belongs for ν-almost all y Y.

Notice that the assumption (B1)(a) can be circumvented by changing the dominating measure ν and therefore the assumptions (B1)(a) and (B1)(b) should be considered together. These

Daudel, Douc and Roueff

assumptions notably imply that the covariance operator of S under the density k(o)(ζ, ) is positive deﬁnite and thus so is T A(ζ) for all ζ Int E0. In addition, one often relies on a non-canonical parameterisation (see for example Example 1), leading to the kernel density

k(θ, y) = k(o)(υ(θ), y) = h(y) exp ( υ(θ), S(y) E A υ(θ)) , θ T , y Y (33)

where υ maps the parameter space T to a subset of Int E0. In that case, the assumption (B1)(a) will ensure that the assumption made on k in (A1) holds.

6.2 The Maximisation Approach for the Exponential Family

Recall that the maximisation approach proposed in Corollary 1 aims at solving the optimisation problem (7) given by

θn+1 = argmax θ T

h ϕ(α) n (y) + bnk(θn, y) i log k(θ, y)

ν(dy) , n 1 .

Furthermore, in the more general case of mixture models described in Corollary 4, the maximisation approach leads to the argmax problem (25). This second argmax problem is similar to (7) in the sense that it updates Θn = (θj,n)1 j J by solving J component-wise argmax problems of the same form as (7) (with θn+1, ϕ(α) n , bn and θn being replaced by θj,n+1, ϕ(α) j,n, bj,n and θj,n, for j = 1 . . . J):

θj,n+1 = argmax θ T

h ϕ(α) j,n(y) + bj,nk(θj,n, y) i log

k(θ, y) k(θj,n, y)

ν(dy) , j = 1 . . . J , n 1 .

Our goal is now to solve the argmax problems above for a member of the exponential family. To do so, we ﬁrst state a theorem in which the kernel k is in its canonical form, that is v is the identity mapping in (33) and k(θ, ) = k(o)(ζ, ). Other parameterisations will then be built using this theorem and as a result we will deduce corollaries that include non-canonical parameterisations and that are applicable to the general case of mixture models. All the proofs of the results stated in Section 6.2 are deferred to Appendix C.2.

Theorem 4 Let k(o) satisfy (B1) and O be an open subset of E0. Let ˇϕ be a probability density function with respect to the measure ν such that Z S E ˇϕ dν < . (34)

Let ζ0 O and b 0. Then the argmax problem

h ˇϕ(y) + b k(o)(ζ0, y) i log

k(o)(ζ, y) k(o)(ζ0, y)

admits at least a solution if and only if

s := 1 1 + b

Z S ˇϕ dν + b 1 + b A(ζ0) (36)

belongs to the image set A(O), in which case (35) has a unique solution ζ deﬁned by

A(ζ ) = s . (37)

Monotonic Alpha-divergence Minimisation for Variational Inference

Theorem 4 shows the equivalence between the argmax problem (35) and the equation (37) under the canonical parameterisation. From there, using the expression of A given in (32) and considering the non-canonical exponential family probability density (33), we can interpret the equation (37) as an equality between two means of the statistic S computed under two diﬀerent distributions. Setting ζ = υ(θ ) and ζ0 = υ(θ0), we indeed obtain Z

Y S(y)k(θ , y)ν(dy) = Z

Y S(y)ψ(y)ν(dy) (38)

where ψ is the mixture

ψ(y) = 1 1 + b ˇϕ(y) + b 1 + bk(θ0, y) , y Y .

As exempliﬁed in the following corollary, an adequate parameterisation ζ = υ(θ) may then lead to a simple solution of (38), which in turn provides a solution to the argmax problem

Y [ ˇϕ(y) + bk(θ0, y)] log k(θ, y)

ν(dy) . (39)

Corollary 6 (Gaussian density) Let d 1 and Y = Rd. Let M>0(d) denote the set of symmetric positive deﬁnite d d matrices. We consider the case of a d-dimensional Gaussian density with k(θ, y) = N(y; m, Σ), where θ = (m, Σ) T = Rd M>0(d). Let ˇϕ be a probability density function on Y such that (34) holds, that is in the Gaussian case Z

Y y 2 ˇϕ(y) dy < .

Let θ0 = (m0, Σ0) T and b 0. If b = 0, assume moreover that ˇϕ is non-degenerate in the sense that its covariance matrix is positive deﬁnite. Then (39) has a unique solution θ = (m , Σ ) T deﬁned by m = E[Y ] and Σ = Cov(Y ), where Y is a random variable valued in Rd with density

ψ(y) = 1 1 + b ˇϕ(y) + b 1 + b N(y; m0, Σ0) , y Y .

It now remains to choose ˇϕ adequately to relate (39) to (7), or rather to its generalisation (25). That way, we will in particular get that the update formulas given in Examples 1 and 3 are straightforward consequences of Corollary 6. This is the purpose of Corollary 7 below, in which we also rewrite the assumption (34) made on ˇϕ under a more convenient form.

Corollary 7 Let k(o) satisfy (B1). Further assume that the kernel k is of the form (33) with a one-to-one mapping υ deﬁned on T such that its image υ(T) is an open subset of E0. Let α [0, 1) and let p : Y R+ be such that

1 + S(y) 1/(1 α) E p(y) ν(dy) < (40)

holds. Let J N, n 1 and (λn, Θn) S+ J TJ. Then, for all j = 1 . . . J, (34) holds with ˇϕ = ˇϕ(α) j,n as deﬁned in (16) and setting γj,n = R ϕ(α) j,ndν/( R ϕ(α) j,ndν + bj,n), we have that

h ϕ(α) j,n(y) + bj,nk(θj,n, y) i log

k(θ, y) k(θj,n, y)

Daudel, Douc and Roueff

admits at least a solution if and only if

s j := γj,n

Z S ˇϕ(α) j,n dν + (1 γj,n) A υ(θj,n) (42)

belongs to the image set A υ(T), in which case (41) has a unique solution θ j deﬁned by

A υ(θ j) = s j

and we can set θj,n+1 = θ j for all j = 1 . . . J in (25) of Corollary 4.

Since the argmax problem (41) is immune to change of variables thanks to its argmax form, the maximisation approach can be solved using the canonical parameter ζ in Theorem 4 and transported to the parameter θ via the one-to-one mapping v as written in Corollary 7. The updates from Examples 1 and 3 then follow from Corollary 7, paired up with Corollary 6 and with the fact that (40) simpliﬁes to (8) (since in the Gaussian case we have set S(y) = (y, yy T /2) for all y Y). More generally, we can always solve the argmax problem (41), that is the argmax problem of Corollary 4 with k belonging to the exponential family, for bj,n > 0 large enough.1

We now move on to the study of the gradient-based approach for the exponential family.

6.3 The Gradient-based Approach for an Exponential Family Distribution

Remember that to construct the sequence (θn)n 1 according to the gradient-based approach from Corollary 2 we need to compute the gradient of the function gn deﬁned in (11) by

Y ϕ(α) n (y) α 1 log k(θ, y)

ν(dy) , θ T ,

and verify that gn satisﬁes a βn-smoothness condition, which leads to updates of the form

θn+1 = θn γn

βn gn(θn) , n 1

with γn (0, 1]. Furthermore, the extension to mixture models in Corollary 5 involves the gradient of J functions gj,n that resemble gn (gn, ϕ(α) n and θn are replaced by gj,n, ϕ(α) j,n and θj,n in the deﬁnition of gn above) and where each function gj,n is assumed to be βj,n-smooth. Let us now solve the gradient-based approach for a member of the exponential family. Like in the previous section, we start with a result handling the case where the kernel k is in its canonical form (that is, v is the identity mapping in (33) and k(θ, ) = k(o)(ζ, )). All the proofs of the results stated in Section 6.3 are deferred to Appendix C.3.

Proposition 3 Let k(o) satisfy (B1). Let ˇϕ be a probability density function with respect to ν satisfying (34). Let ζ0 Int E0 and set

g(o)(ζ) = Z

Y ˇϕ(y) log

k(o)(ζ, y) k(o)(ζ0, y)

ν(dy) , ζ Int E0 . (43)

1. Under the assumptions of Corollary 7, υ(T) is an open subset of E0 and A is a C -diﬀeomorphism on Int E0. Hence, A υ(T) is an open subset of E. Since s j in (42) tends to A υ(θj,n) as bj,n , s j ends up belonging to A υ(T) for bj,n large enough. The desired result follows by using that the regularisation parameter bj,n > 0 in Corollary 4 can be freely chosen at each step n 1.

Monotonic Alpha-divergence Minimisation for Variational Inference

Then, for any convex subset C0 Int E0, the function g(o) is β0-smooth over C0 with

g(o)(ζ) = A(ζ) Z S ˇϕ dν and β0 sup ζ C0

T A(ζ) op , (44)

where op denotes the operator norm.

Letting C0 be the line segment {ζ0 t β0 ( A(ζ0) R S ˇϕ dν) : t [0, 1]} in Proposition 3, we have that C0 stays in Int E0 for β0 large enough. Since the Hessian of A is continuous on Int E0, the inequality in (44) can also be satisﬁed. Hence, the following gradient step

Y S ˇϕ dν (45)

can always be performed for all γ (0, 1] and we can always apply the gradient-based approach when the kernel k is in its canonical form. When considering non-canonical exponential family probability densities of the form (33) and under some reasonable assumptions on the mapping υ (for instance, if υ is a C2 diﬀeomorphism), one is left with choosing between two alternative ways of exploiting Proposition 3:

(i) Apply a gradient step on the canonical parameter ζ using the objective g(o) in (43) and translate it into an update of the parameter θ via a change of variable. In this case, we set ζ0 = υ(θ0), we apply the step (45) and we then perform the change of variable θ = υ 1(ζ) to get back to a parameter in T. Overall, this leads to an update of the form

θ = υ 1 υ(θ0) γ

Y S ˇϕ dν .

(ii) Apply a gradient step directly on the parameter θ using the objective

g(θ) = Z ˇϕ(y) log k(θ, y)

In that case, since g = g(o) υ where g(o) is again as in (43) with ζ0 = υ(θ0), this leads to an update of the form

β υ(θ0) A υ(θ0) Z

Y S ˇϕ dν .

In both approaches, γ (0, 1] and the smoothness indices β0 and β have to be taken large enough in order to guarantee a decrease of the objective function. In practice, it is not clear which parameterisation leads to the simpler or more eﬃcient algorithm. We investigate a particular setting in the following result.

Corollary 8 (Gaussian density with known covariance matrix) Let d 1 and Y = Rd. We consider the exponential family of a d-dimensional Gaussian density with k(θ, y) = N(y; θ, Σ), where θ T := Rd and Σ is a (known) covariance matrix in M>0(d). Let ˇϕ be a probability density function w.r.t. ν such that (34) holds, that is in the Gaussian case Z y ˇϕ(y) dy < .

Daudel, Douc and Roueff

Set υ(θ) = Σ 1θ and deﬁne the canonical kernel k(o) with h(y) = (2π) d/2 |Σ| 1/2 e y T Σ 1y/2, S(y) = y and A(ζ) = ζT Σζ/2. Then (33) holds and the methods (i) and (ii) lead to the the following gradient-based updates, respectively,

β0 Σ θ0 Z y ˇϕ(y) dy (46)

β Σ 1 θ0 Z y ˇϕ(y) dy (47)

where β0 is the largest eigenvalue of Σ and β is the largest eigenvalue of Σ 1. They correspond to the smoothness indices of g(o) and g over Rd, respectively.

Corollary 8 illustrates how, even in a simple framework, the gradient-based updates strongly depend on the parameterisation chosen. They in fact coincide only if Σ is scalar in the above corollary, in which case Σ/β0 and Σ 1/β above are both equal to the identity matrix. Following the reasoning of Section 6.2, we can then apply Corollary 8 with ˇϕ = ˇϕ(α) n (and more generally with ˇϕ = ˇϕ(α) j,n). As (34) is implied by (13) (using Lemma 8 in Appendix C.2 and that here S(y) = y), this enables us to deduce the updates in Examples 2 and 4. Note that we did not introduce a convex subset C0 in Corollary 8 (this is due to the fact that the smoothness index is constant over the whole parameter space for both gradient-based updates in that case), which is why C0 does not appear in Examples 2 and 4. Let us next put into perspective Corollary 6, in which we considered a Gaussian density with varying mean vector and covariance matrix, and Corollary 8, where the covariance matrix is assumed to be known. While Corollary 8 kept the computations straightforward and provided gradient-based updates that do not require to introduce a convex subset C0, this is no longer the case if we take the same exact setting as in Corollary 6. Indeed, when θ = (m, Σ) T = Rd M>0(d) and under the condition (34), the gradient of g is given by

g(θ) = Σ 1 m Z y ˇϕ(y) dy , 1

2Σ 1 Z yy T ˇϕ(y) dy Σ + mm T Σ 1 .

The smoothness index β is now no longer constant here, which makes it much more involved to choose a convenient convex subset C0 and to bound the smoothness index over it.

At this stage, we can solve the maximisation and gradient-based approaches when the kernel k belongs to the exponential family for variational families as large as (ﬁnite) mixture models. The maximisation approach appears to be preferable to the gradient-based one as it does not require a smoothness assumption and does not depend on the parameterisation. In the next section, we investigate how our maximisation approach may further generalise to include extensions of the exponential family distribution such as Student s t distributions, hence further motivating the maximisation approach over the gradient-based approach.

6.4 Extension to Linear Mixture (LM) Models

The goal of this section is to examine what becomes of the maximisation approach for a speciﬁc extension of the exponential family distribution, which we will refer to as exponential linear mixture (ELM) family. The ELM family will in particular encompass the Student s

Monotonic Alpha-divergence Minimisation for Variational Inference

t distribution with mean m, covariance matrix Σ and a degrees of freedom which can be deﬁned as the continuous mixture

t(y; m, Σ, a) = Z

0 N(y; m, z 1Σ) ˇτa(dz) ,

where ˇτa denotes the χ2 distribution with a degrees of freedom. As a result, building a maximisation approach within the ELM family will lead to new iterative schemes for variational families that go beyond the cases considered thus far. To this end, let us ﬁrst provide a precise deﬁnition of the ELM family we want to study and introduce the main assumptions of this section. All the proofs from Section 6.4 are deferred to Appendix C.4.

6.4.1 The (E)LM Family: Definition and Notation

Let F be an Euclidean space endowed with the inner product , F and norm F . We denote by L(F, E) and L(F) the (ﬁnite-dimensional) linear spaces of linear operators from F to E and from F to itself, respectively. We now provide the deﬁnition of the (E)LM family.

Deﬁnition 2 ((E)LM family) Let k(o) : E0 Y R+ be a kernel density w.r.t. the dominating measure ν, with E0 E and let Υ : T F. Set Z = L(F, E) and denote by Z its Borel σ-ﬁeld. Let M be a subset of probability measures on (Z, Z) satisfying

(i) for all τ, τ in M, the distributions τ and τ are equivalent;

(ii) for all τ M and τ-almost all ℓ Z, the image set ℓ Υ( T) is included in E0.

A member of the linear mixture (LM) family with canonical kernel k(o), mixing class M and parameter mapping Υ, is deﬁned for a parameter θ = (ϑ, τ) T = T M by the density w.r.t. ν

k(1)(θ, y) = Z

Z k(o)(ℓ Υ(ϑ), y) τ(dℓ) , y Y . (48)

If moreover k(o) satisﬁes (B1), we say that k(1)(θ, ) is a member of the exponential linear mixture (ELM) family.

Here, Assumption (i) implies that we only need to check Assumption (ii) for one τ in M. As for Assumption (ii), it ensures that for all ϑ T and all τ M, the function (y, ℓ) 7 k(o)(ℓ Υ(ϑ), y) appearing in (48) is a probability density function w.r.t. ν τ, so that for all θ T, the function y 7 k(1)(θ, y) is a probability density function w.r.t. ν. Furthermore, the mapping Υ in Deﬁnition 2 allows us to introduce various parameterisations. It is also important to note that for θ = (ϑ, τ) T, the probability density function k(1)(θ, ) only depends on ξ = Υ(ϑ) Υ( T) and τ M. As we shall see later, Student s t distributions in particular will ﬁt this general setting. Before that, let us state conditions leading to a systematic decrease in Ψα when the kernel k(1) is as in Deﬁnition 2 and from there, let us see how a maximisation approach for the ELM family can be derived.

Daudel, Douc and Roueff

6.4.2 Monotonic Decrease Conditions for the (E)LM Family

In the case of an ELM family, we are not able to directly solve the argmax problem (25) as we did for the exponential family in Section 6.2. Instead, given an (E)LM family k(1) as in Deﬁnition 2, we come back to a monotonic decrease condition of the form

Y ˇϕ(y) log

k(1)(θ, y) k(1)(θ0, y)

ν(dy) 0 . (49)

As detailed in the remark below, for selected choices of ˇϕ and of (θ, θ0), (49) can indeed be linked to the conditions (5) and (19) appearing in Theorem 1 and Theorem 2 respectively.

Remark 1 Let α [0, 1). Taking ˇϕ = ˇϕ(α) n deﬁned by (6) and (θ, θ0) = (θn+1, θn), (49) becomes the condition (5) of Theorem 1. Applying (49) for j = 1, . . . , J with ˇϕ = ˇϕ(α) j,n deﬁned by (16) and (θ, θ0) = (θj,n+1, θj,n) yields the condition (19) of Theorem 2.

Let Dτ ,τ = dτ

dτ denote the Radon-Nikodym derivative of τ w.r.t. τ. We then have the following proposition (we defer the proofs of the results from Section 6.4.2 to Appendix C.4.1).

Proposition 4 (Conditions for a monotonic decrease within the (E)LM family) Let k(1) be an LM family as in Deﬁnition 2. Let ˇϕ : Y R+ be a probability density function with respect to ν. Let ϑ0, ϑ T and τ0, τ M such that

Z ϕ(ℓ, y) log (Dτ,τ0(ℓ)) τ0(dℓ) ν(dy) 0 , (50)

Z ϕ(ℓ, y) log

k(o)(ℓ Υ(ϑ), y) k(o)(ℓ Υ(ϑ0), y)

ν(dy) 0 (51)

where we set

ϕ(ℓ, y) = ˇϕ(y) k(o)(ℓ Υ(ϑ0), y)

k(1)(θ0, y) . (52)

Then (49) holds with θ = (ϑ, τ) and θ0 = (ϑ0, τ0) .

One possible way to obtain (50) is to observe that since the left-hand side of (50) is zero for τ = τ0, (50) is fulﬁlled by setting

τ = argmax τ M

Z ϕ(ℓ, y) log Dτ ,τ0(ℓ) τ0(dℓ) ν(dy) , (53)

assuming that this argmax is well-deﬁned. This will notably be done later in order to get the update formula (70) of Example 5 (see the proof of Corollary 10 in Appendix C.4). As for the updating of ϑ, we can again adopt a maximisation approach to derive a convenient ϑ. This is the purpose of the following result, where we directly update ξ = Υ(ϑ) since Υ is a known mapping in Deﬁnition 2.

Monotonic Alpha-divergence Minimisation for Variational Inference

Corollary 9 Consider an LM family as in Deﬁnition 2 and θ0 = (ϑ0, τ0) T M. Set ξ0 = Υ(ϑ0), let ˇϕ : Y R+ be a probability density function with respect to ν and deﬁne ϕ : Z Y R+ by (52). Let ϑ T be such that Υ(ϑ) is a solution of the argmax problem

argmax ξ Υ( T)

h ϕ(ℓ, y) + b0(ℓ) k(o)(ℓ(ξ0), y) i log

k(o)(ℓ(ξ), y) k(o)(ℓ(ξ0), y)

ν(dy) , (54)

where b0 is any non-negative function deﬁned on Z such that R b0 dτ0 < . Then (51) holds.

As we will see, (53) and (54) can be achieved for continuous mixtures such as the one used to deﬁne Student s t distributions but they require a thorough analysis. To this end, we next formulate and solve a maximisation approach for a kernel k(1) belonging to the ELM family.

6.4.3 The Maximisation Approach for the ELM Family

Let us show that we can solve an argmax of the form (54) for an adequate choice of b0 by fully exploiting the fact that k(o)(ζ, y) is a canonical kernel of the exponential family distribution which satisﬁes (B1) and by relying on additional assumptions to be introduced alongside some helpful notation. Our ﬁrst assumption is the following.

(C1) The image set Υ( T) is an open subset of F and for all τ M, τ-a.e. ℓis full rank.

(C1) guarantees that, for all ϑ T, τ M and τ-almost all ℓin Z, ℓ Υ(ϑ) belongs to Int E0, the interior set of E0 (see Lemma 9 in Appendix C.4 for details). This will come in handy in the upcoming derivations and we now introduce some helpful notation, before presenting our next two assumptions. We let op denote operator norms, for instance, for any ℓ Z,

ℓ op = sup { ℓ(x) F : x E , x E 1} .

We further denote the adjoint of the linear operator ℓby ℓT , for instance, in the case ℓ Z, ℓT L(E, F) is deﬁned by

ℓ(x), y E = D x, ℓT (y) E

F for all (x, y) F E (55)

and, for convenience, we also denote

ℓ := ℓT A ℓ. (56)

Here, A is the function appearing in the deﬁnition of an exponential family (Deﬁnition 1). Recall that A is inﬁnitely diﬀerentiable on Int E0 so that by (C1), ℓ is well-deﬁned on Υ( T) for τ-a.e. ℓwith τ M. Furthermore, we use Pϑ,τ and Eϑ,τ to denote the probability and its corresponding expectation under which the pair (Y, L) has density (y, ℓ) 7 k(o)(ℓ Υ(ϑ), y) with respect to ν τ. Namely, for any non-negative measurable g on Y Z,

Eϑ,τ [g(Y, L)] = Z g(y, ℓ) k(o)(ℓ Υ(ϑ), y) ν(dy)τ(dℓ) . (57)

We now introduce, for any (ϑ0, τ0) T M, two auxiliary functions mϑ0,τ0 and sϑ0τ0, both deﬁned from Y to R+, such that the two following conditions hold.

Daudel, Douc and Roueff

(C2) For all ϑ0 T and τ0 M, we have

Eϑ0,τ0 h L op Y i mϑ0,τ0(Y ) Pϑ0,τ0 a.s.

(C3) For all ϑ0 T and τ0 M, we have

ξ Υ( T) , ϵ, C > 0 , Eϑ0,τ0 h eϵ L (ξ) F Y i C e sϑ0,τ0(Y ) Pϑ0,τ0 a.s.

(C2) and (C3) are made to ensure that the integrals used to solve the argmax problem from Corollary 9 will be well-deﬁned. We ﬁnally present our last assumption, which corresponds to some sort of identiﬁability assumption (see Lemma 10 in Appendix C.4 for details). It is used to obtain the uniqueness for the argmax problem we will solve.

(C4) For all ξ = ξ Υ( T) and τ0 M, we have τ0 ({ℓ Z : ℓ(ξ) = ℓ(ξ )}) > 0.

We can now state the following result, whose proof can be found in Appendix C.4.2.

Theorem 5 Consider an ELM family as in Deﬁnition 2 and let θ0 = (ϑ0, τ0) T M. Assume (C1) (C4) hold. Let ˇϕ be a probability density function w.r.t. ν such that Z

Y ˇϕ(y) S(y) E mϑ0,τ0(y) + e sϑ0,τ0(y) ν(dy) < , (58)

deﬁne ϕ : Z Y R+ by (52) and set for all (ℓ, y, ξ) Z Y Υ( T) such that ℓ Υ(ϑ0) Int E0

Y ϕ(ℓ, y) ν(dy) , (59)

w τ0(ξ) = Z

Z ϕ(ℓ) ℓ (ξ) τ0(dℓ) . (60)

Then ϕ < τ0-a.s., and w τ0 is well-deﬁned from Υ( T) to F and one-to-one on Υ( T), hence bijective from Υ( T) to its image w τ0 Υ( T), and satisﬁes, for all ϑ T,

w τ0 Υ(ϑ) = Eϑ,τ0 h LT S(Y ) ϕ(L) i . (61)

Moreover, setting ξ0 = Υ(ϑ0), for any b R+, the argmax problem

argmax ξ Υ( T)

h ϕ(ℓ, y) + b ϕ(ℓ) k(o)(ℓ(ξ0), y) i log

k(o)(ℓ(ξ), y) k(o)(ℓ(ξ0), y)

ν(dy) , (62)

has at least a solution if and only if

w = 1 1 + b

Z ϕ(ℓ, y) ℓT S(y) ν(dy)τ0(dℓ) + b 1 + b w τ0(ξ0) (63)

belongs to w τ0 Υ( T), in which case (62) has a unique solution ξ deﬁned by

w τ0(ξ ) = w . (64)

Monotonic Alpha-divergence Minimisation for Variational Inference

By setting ξ = Υ(ϑ ) and ξ0 = Υ(ϑ0) as well as using (61), (64) can be interpreted as

Eϑ ,τ0 h LT S(Y ) ϕ(L) i = Eϑ0, τ0 h LT S( Y ) i , (65)

where τ0 denotes the probability having density ϕ with respect to τ0 and Y denotes a random variable valued in Y such that, under Pϑ0, τ0, the pair ( Y , L) is distributed according to the mixture density

(y, ℓ) 7 ψ(y, ℓ) = 1 1 + b ϕ(ℓ, y) + b 1 + b k(o)(ℓ Υ(ϑ0), y) ϕ(ℓ)

with respect to ν τ0. Interestingly, the second marginals of these two mixands are τ0. The distribution of ( Y , L) under Pϑ0, τ0 can thus equivalently be deﬁned by saying that L τ0 and the conditional distribution of Y given L = ℓis the mixture with density

y 7 ψ [y | ℓ] = 1 1 + b ϕ(ℓ, y)

ϕ(ℓ) + b 1 + b k(o)(ℓ Υ(ϑ0), y) .

Using the deﬁnition of τ0 on the left-hand side of (65) as well, the latter reads

Eϑ , τ0 h LT S(Y ) i = Eϑ0, τ0 h LT S( Y ) i . (66)

Hence, Theorem 5 along with (66) can be used to solve the argmax problem (54) in the case where (B1) and (C1) (C4) hold and for b0(ℓ) = b ϕ(ℓ) with b being a non-negative constant. From there, it only remains to ﬁnd ways to check that the condition (58) on ˇϕ holds. This can notably be done when ˇϕ takes one of the two forms listed in Remark 1 (see Lemma 8 in Appendix C.2 for details) and we now provide an example of particular interest.

Example 5 (Finite mixture of Student s t distributions) Let ν be the Lebesgue measure on Y = Rd, with d 1. Let α [0, 1) and p : Rd R+ such that

Rd(1 + y 2)1/(1 α) p(y) dy < . (67)

Let J 1 and consider a J-mixture family Q as in (15), with k(θ, ) being the density of the Student s t distribution with parameter θ = (m, Σ, a) T = Rd M>0(d) (0, ), given by

k(θ, y) = Γ ((a + d)/2)

Γ (a/2) (aπ)d/2 |Σ|1/2

a(y m)T Σ 1(y m) (a+d)/2 .

Deﬁne a sequence (µnk)n 1 valued in Q where µn = µλn,Θn with λn S+ J and Θn = (mj,n, Σj,n, aj,n)j=1...J TJ such that, for all n 1,

(i) given (ηn)n 1 valued in (0, 1], (κn)n 1 valued in ( , 0] and λ1 S+ J , (λn)n 2 is deﬁned by the iterative formula (21);

Daudel, Douc and Roueff

(ii) given (γj,n)j=1,...,J,n 1 valued in (0, 1] and Θ1 TJ, (Θn)n 2 is deﬁned by

Rd R+ z ψj,n(z, y) dy ˇτaj,n(dz)

Rd R+ y z ψj,n(z, y) dy ˇτaj,n(dz) , (68)

Rd R+ (y mj,n+1)(y mj,n+1)T z ψj,n(z, y) dy ˇτaj,n(dz) , (69)

aj,n+1 = 2 κ 1 Z

Rd R+ (z ln(z)) ϕj,n(z, y) dy ˇτaj,n(dz)

where, deﬁning ˇϕ(α) j,n as in (16) and denoting κ 1 : R (0, ) the inverse mapping of κ(x) = log(x) + Γ (x)/Γ(x), we set

ϕj,n(z, y) = ˇϕ(α) j,n(y) e z

2 (y mj,n)T Σ 1 j,n(y mj,n)

(2π/z)d/2 |Σj,n|1/2 k(θj,n, y) , (71)

ψj,n(z, y) = γj,n ϕj,n(z, y) + (1 γj,n) e z

2 (y mj,n)T Σ 1 j,n(y mj,n)

(2π/z)d/2 |Σj,n|1/2

Z ϕj,n(z, y) dy , (72)

ˇτa(dz) = 1R+(z) (a/2)a/2

Γ(a/2) za/2 1e za dz for any a > 0 , (73)

Here, ˇτa in (73) is the χ2 distribution with a degrees of freedom and κ 1 in (70) is well-deﬁned (this follows from Lemma 13 in Appendix C.4.3).

We then have the following result, whose proof is postponed to Appendix C.4.3.

Corollary 10 In the setting of Example 5, if Ψα(µ1k) < , then, at any time n 1, we have Ψα(µn+1k) Ψα(µnk).

We have stated results that solve our maximisation and gradient-based approaches when the variational family is based on the exponential family. However, all the update formulas we have obtained so far involve intractable integrals. In the next section, we focus on the case of Gaussian Mixture Models optimisation seen in Examples 3 and 4 and we investigate how our framework can be used to derive tractable algorithms in that case.

7. Gaussian Mixture Models (GMMs) Optimisation

We consider d-dimensional Gaussian mixture densities with k(θj, y) = N(y; mj, Σj), where θj = (mj, Σj) T denotes the mean and covariance matrix of the j-th Gaussian component density. By Theorem 2, a valid update formula at time n 1 for (λn)n 1 is

λj,n+1 = λj,n h R Y ϕ(α) j,n(y)ν(dy) + (α 1)κn iηn

PJ ℓ=1 λℓ,n h R Y ϕ(α) ℓ,n(y)ν(dy) + (α 1)κn iηn , j = 1 . . . J . (74)

Furthermore, as underlined in Example 3, a valid update at time n 1 for the means (mj,n+1)1 j J and the covariances matrices (Σj,n+1)1 j J is given by

j = 1 . . . J, mj,n+1 = (1 γj,n)mj,n + γj,n bmj,n (75)

Σj,n+1 = (1 γj,n)Σj,n + γj,n bΣj,n + γj,n(1 γj,n) ( bmj,n mj,n) ( bmj,n mj,n)T

Monotonic Alpha-divergence Minimisation for Variational Inference

where bmj,n = R Y y ˇϕ(α) j,n(y) ν(dy), bΣj,n = R Y yy T ˇϕ(α) j,n(y) ν(dy) bmj,n bm T j,n with ˇϕ(α) j,n =

ϕ(α) j,n/ R ϕ(α) j,nν(dy) and γj,n (0, 1]. In addition, another possible update for the means is

mj,n+1 = mj,n + γj,n

R Y λj,nϕ(α) j,n(y)(y mj,n)ν(dy) R Y µnk(y)αp(y)1 αν(dy) , j = 1 . . . J . (76)

By virtue of Example 4, this update can be used in lieu of the means update written in (75) with γj,n still in (0, 1]. Since the update (76) is linked to Gradient Descent steps for R enyi s α-divergence minimisation (see Example 4), we will refer to this update as the R enyi Gradient Descent (RGD) update in the following. One the other hand, the update on the means written in (75) will be referred to as the Maximisation Gradient (MG) update. Notice that the updates (74), (75) and (76) all involve intractable integrals. However, we can resort to approximate update rules and take advantage of the fact that the integrals appearing in these update formulas are of the form Z

Y ϕ(α) j,n(y)g(y)ν(dy) , (77)

where g is a well-chosen function deﬁned on Y and where ϕ(α) j,n is deﬁned in (16) by ϕ(α) j,n(y) = k(θj,n, y) (µnk(y)/p(y))α 1 for all y Y. Many choices are indeed possible to approximate (77) and for simplicity we restrict ourselves to typical Importance Sampling estimation. Looking at the deﬁnition of ϕ(α) j,n, a ﬁrst idea would be to sample M i.i.d. random variables

(Y (j) m,n)1 m M according to k(θj,n, ) for j = 1 . . . J and to use the unbiased estimator of (77)

µnk(Y (j) m,n)

p(Y (j) m,n)

!α 1 g(Y (j) m,n) .

As this sampler depends on j, this option becomes very computationally heavy as J and M increase due to the sampling cost (J M) and to the evaluation cost (µnk and p need to be evaluated M J times each, where µnk is a sum over J). To reduce this computational bottleneck, it is then preferable to consider sequences of samplers (qn)n n that do not depend on j at time n. More speciﬁcally, we discuss two possibilities.

(i) Best sampler at time n (IS-n). Approximated update rules can be obtained by sampling M i.i.d. random variables (Ym,n)1 m M according to the best approximation of p we have a time n, that is qn = µnk. Then, the samples are shared throughout the J mixture weights and mixture components parameters updates. This eases the computational burden, with both a sampling and an evaluation costs of M, as an unbiased estimator of (77) is

m=1 ˆϕ(α) j,n(Ym,n)g(Ym,n) where ˆϕ(α) j,n(y) = k(θj,n, y)

α 1 , y Y . (78)

(ii) Uniform sampler (IS-unif). Approximated update rules can be obtained by sampling according to the uniform sampler qn = J 1 PJ j=1 k(θj,n, ) and using the unbiased estimator (78). That way, the sampling and evaluation costs are also M each. Contrary to IS-n, this sampler ensures a fair sampling among all components, as it entails sampling an index vector [i1, i2, . . . , i M] uniformly from {1, . . . , J} whereas IS-n samples an index vector [i1, i2, . . . , i M] according to the mixture weights λn.

Daudel, Douc and Roueff

Algorithm 3: GMMs optimisation with the IS-n/IS-unif sampler

At iteration n,

1. Draw independently M samples (Ym,n)1 m M from the proposal qn.

2. For all j = 1 . . . J,

(a) Set bmj,n = PM m=1 ˆϕ(α) j,n(Ym,n) Ym,n PM m=1 ˆϕ(α) j,n(Ym,n) and bΣj,n = PM m=1 ˆϕ(α) j,n(Ym,n) Ym,n Y T m,n PM m=1 ˆϕ(α) j,n(Ym,n) bmj,n bm T j,n.

(b) Choose one between the (MG) and (RGD) approaches

(MG) mj,n+1 = (1 γn) mj,n + γn bmj,n

(RGD) mj,n+1 = mj,n + γn λj,n PM m=1 ˆϕ(α) j,n(Ym,n) (Ym,n mj,n) PJ j=1 PM m=1 λj,n ˆϕ(α) j,n(Ym,n)

λj,n+1 = λj,n h PM m=1 ˆϕ(α) j,n(Ym,n) + (α 1)κn iηn

PJ ℓ=1 λℓ,n h PM m=1 ˆϕ(α) ℓ,n(Ym,n) + (α 1)κn iηn

Σj,n+1 = (1 γn)Σj,n + γn bΣj,n + γn(1 γn) ( bmj,n mj,n) ( bmj,n mj,n)T .

Remark 2 Approximated update rules for GMMs can be obtained by sampling according to qn = N( ; 0d, Id), where 0d is the null vector of dimension d. These can be deduced by applying the reparameterisation trick: for all m = 1 . . . M, Y (j) m,n k(θj,n, ) if Y (j) m,n = mj,n + Σ 1/2 j,n εm,n with εm,n qn and by observing that an unbiased estimator of (77) is

M 1 PM m=1 µnk(Y (j) m,n)/p(Y (j) m,n) α 1 g(Y (j) m,n). This choice of qn relies on the existence of a reparameterisation and while it incurs a gain in terms of sampling cost as we only need M samples, the evaluation cost remains M J.

We thus have access to tractable algorithms that iteratively update both the weights and components parameters of a GMM by optimising the α-divergence between the mixture distribution and the targeted distribution. Our framework also allows the use of learning rates γj,n that are dependent on j. This means that the learning rates can diﬀer for each component parameters update and it paves the way for adaptive learning rates per components. This aspect is left for future work as we will focus on the case γj,n := γn according to Algorithm 3 and move on to presenting our numerical experiments.

Remark 3 Computing the estimator (78) can be expensive for two reasons: (i) µnk involves a sum over J so that evaluating this function requires heavier computations as the number of components J grows and (ii) bayesian applications with p = p( , D) often consider large-scale data sets D. To alleviate this computational burden, one can (i) approximate the summation

Monotonic Alpha-divergence Minimisation for Variational Inference

appearing in µnk using subsampling and (ii) use a mini-batch approach to approximate p = p( , D), as detailed in (Li and Turner, 2016, Section 4.3).

8. Numerical Experiments

In our numerical experiments, we are interested in understanding the behaviour of Algorithm 3 in practice. We ﬁrst investigate the challenging case of multimodal targets, as our framework is designed to handle multimodality thanks to the hyperparameter α [0, 1) and to the use of mixture models as a variational family.

8.1 Toy multimodal targets

Let ud be the d-dimensional vector whose coordinates are all equal to 1, Id be the identity matrix and c be a positive constant (we set c = 2). We consider three multimodal examples:

(i) Equally-weighted Gaussian Mixture Model. The target p is a mixture density of two equally-weighted d-dimensional Gaussian distributions multiplied by c such that

p(y) = c [0.5N(y; 2ud, Id) + 0.5N(y; 2ud, Id)] .

(ii) Imbalanced Gaussian Mixture Model. The target p is a mixture density of three d-dimensional Gaussian distributions with unequal weights and multiplied by c such that

p(y) = c [0.35N(y; 2ud, Id) + 0.25N(y; 2ud, Id) + 0.4N(y; ud, Id)]

(iii) Equally-weighted Student s t Mixture Model. The target p is a mixture density of two equally-weighted d-dimensional Student s t distributions with two degree of freedom (i.e. a = 2) multiplied by c such that

p(y) = c [0.5 t(y; 2ud, Id, a) + 0.5 t(y; 2ud, Id, a)] .

8.1.1 Comparing the RGD and the MG Approaches with ηn = 0

Setting ηn = 0 at all times n 1 keeps the mixture weights ﬁxed in Algorithm 3. In that case, our RGD approach relates to Gradient Ascent steps on the VR Bound (Li and Turner, 2016) w.r.t. the means of our Gaussian components (see Section 5.1). We then want to compare the performances of the RGD approach (that can be derived from Li and Turner, 2016) to the novel MG approach we have introduced in our work. Implementation details. The covariance matrices of the mixture components are ﬁxed and equal to σ2Id with σ2 = 1, α = 0.2, J {10, 50}, d = 16, M = 200, the total number of iterations N is equal to 100, Θ1 is generated by sampling from a centered normal distribution with covariance matrix 10Id, λ1 = [1/J, . . . , 1/J] and for all time n = 1 . . . N, κn = 0, ηn = 0. and γn = γ with γ {0.1, 0.5, 1.}. The experiments are replicated independently 30 times and the convergence of the RGD and of the MG approaches is monitored for the three multimodal examples (i), (ii) and (iii) by computing a Monte Carlo estimator of the VR Bound (since we have sampled M samples from qn at time n, these samples can readily be used to obtain an estimate the VR Bound with no additional computational cost).

Daudel, Douc and Roueff

J = 10 J = 50 (i)

Figure 1: Monte Carlo estimate of the VR Bound for the RGD and the MG approaches (ﬁxed mixture weights) when considering each of the target distributions (i), (ii) and (iii).

Our results are plotted on Figure 1, where RGD-IS-n(γ) and MG-IS-n(γ) denote the algorithms originating from the RGD and MG approaches (the IS-n and IS-unif samplers are equivalent here), and error bounds plots can be found in Figures 5 and 6 of Appendix D. As minimising Ψα is equivalent to maximising the VR Bound when α (0, 1), our theoretical results predict a systematic increase in the VR Bound for the algorithms considered. This is what we observe in Figure 1 and we also see that the choice of γ plays the role of a learning rate our algorithms: while increasing γ mostly improves the speed of convergence, it may deteriorate the performances if chosen too large, such as when J = 50 with the target (iii).

Furthermore, MG-IS-n(γ) leads to better performances in all but one case compared to RDG-IS-n(γ). This renders the MG-IS-n(γ) algorithm more suitable overall for capturing the multimodality of the various multimodal targets p we considered, a fact that is further

Monotonic Alpha-divergence Minimisation for Variational Inference

J = 10 J = 50 γ = 0.1 γ = 0.5 γ = 1.0 γ = 0.1 γ = 0.5 γ = 1.0

(i) RGD-IS-n(γ) -0.081 -0.076 -0.218 -1.640 -1.673 -1.560 MG-IS-n(γ) -3.702 -1.875 -2.711 -2.760 -2.771 -2.788

(ii) RGD-IS-n(γ) -0.211 -0.072 -0.015 -1.401 -1.437 -1.515 MG-IS-n(γ) -2.581 -2.101 -1.742 -2.611 -2.328 -1.933

(iii) RGD-IS-n(γ) -0.108 -0.008 -0.111 -1.652 -1.654 -1.634 MG-IS-n(γ) -0.913 -1.489 -1.846 -2.036 -2.530 -0.717

Table 2: Log MSE averaged for the RGD and the MG approaches (ﬁxed mixture weights) when considering each of the target distributions (i), (ii) and (iii).

supported in Table 2, in which we compare the log of the Mean-Squared Error (log MSE) returned by each algorithm. (The MSE is computed as the average of mapprox mtrue 2 over 30 independent runs of each algorithm, where . stands for the Euclidean norm, mtrue = EP[Y ] with P(A) = R A p(y)ν(dy)/ R Y p(y)ν(dy) for all A Y and where mapprox is the mean of the optimised variational distribution, that is in our setting mapprox = PJ j=1 λj,Nmj,N.) Let us next pair up the means optimisation with mixture weights optimisation and investigate the impact of the sequence (ηn)n 1 on our algorithms.

8.1.2 Comparing the RGD and the MG Approaches with ηn > 0.

So far, we have demonstrated the viability of our novel MG approach compared to the more usual R enyi s α-divergence-based RGD approach. Yet, we have not taken advantage of the fact that we can perform mixture weights optimisation on top of means optimisation, which is another novelty of our framework compared to traditional Variational Inference methods. Implementation details. We work with the same implementation details as those described in Section 8.1.1, except for the fact that we will now let ηn = η at time n and we will vary the value of η. In addition to the RGD-IS-n(γ) and the MG-IS-n(γ) algorithms, we also include the RGD-IS-unif(γ) and the MG-IS-unif(γ) algorithms in our results (that is, the algorithms resulting from pairing up the RGD and MG approaches with the IS-unif sampler). The plots for the averaged Monte Carlo estimate of the VR Bound with η = 0.1 and γ = 0.5 are then available in Figure 2 (and similar plots for η {0.05, 0.5} are provided in Figures 7 and 8 of Appendix D respectively). According to Figure 2, MG-IS-unif(0.5) outperforms most of the time the other methods (this trend for MG-IS-unif(γ) is further conﬁrmed by looking at the log MSE in Table 3 of Appendix D, which includes the cases γ {0.1, 1.} for completeness). We also plot in Figure 3 the ﬁnal mixture weights λN obtained after one typical run of MG-IS-unif(0.5) for η {0.05, 0.1, 0.5} and J {10, 50} (and additional Log MSE results can be found in Table 4 of Appendix D). This is done in order to delve into how η impacts the variational density returned at the end of Algorithm 3 (MG-IS-unif(0.5) was chosen among all four options since it enjoys good empirical performances according to Figure 2).

Daudel, Douc and Roueff

J = 10 J = 50 (i)

Figure 2: Monte Carlo estimate of the VR Bound for the RGD and the MG approaches (η = 0.1) when considering each of the target distributions (i), (ii) and (iii).

A key insight from Figure 3 is that optimising the mixture weights permits us to select the components appearing in our mixture model according to their overall contribution in constructing a good approximation of the targeted density.

On the one hand, we enable more ﬂexibility in our variational approximation as we optimise over a set of component parameters Θ with J > 1 instead of just one component parameter θ (J = 1). On the other hand, we avoid complexifying unecessarily the variational distribution returned at the end of the optimisation procedure since we perform mixture weights optimisation alongside components parameters optimisation. That way, we can bypass the limitation of the case η = 0, which may use more components than needed when approximating the targeted density (this is the case here since the targeted densities considered have three modes at best while J = {10, 50}).

However, there is a tradeoﬀto ﬁnd between the simplicity of the variational density that is returned and its accuracy at describing the targeted density, which is expressed via

Monotonic Alpha-divergence Minimisation for Variational Inference

J = 10 J = 50 (i)

Figure 3: Final mixture weights λN for one run of the MG-IS-unif(0.5) algorithm (varying mixture weights) when considering each of the target distributions (i), (ii) and (iii).

the choice of the learning rate η. Indeed, choosing η too large might lead to missing some of the modes while having η too small may not discriminate quickly enough between the components. Observe in particular that the number of components J too plays a role in how fast the components selection process occurs, as for the same value of η, this selection process is slower as J increases.

8.2 Bayesian Logistic Regression

We consider the Bayesian Logistic Regression setting from Gershman et al. (2012). Namely, the data D = {c, x} is made of I binary class labels, ci { 1, 1}, and of L covariates for each datapoint, xi RL. The latent variable y = {w, β} consists of L regression coeﬃcients

Daudel, Douc and Roueff

η J = 50 J = 100

Figure 4: Monte Carlo estimate of the VR Bound for the RGD and the MG approaches when considering the Bayesian Logistic Regression on the Covertype data set.

wl R, and a precision parameter β R+. Furthermore, the following model is assumed

p0(β) = Gamma(β; a, b), p0(w|β) = N(w; 0, β 1IL), p(ci = 1|xi, w) = 1 1 + e w T xi ,

where a = 1 and b = 0.01, so that p(y, D) = p0(β)p0(w|β) QI i=1 p(ci|xi, y). We now want to demonstrate the practicability of our framework in a real data scenario. To this end, we select the Covertype data set (581, 012 data points and 54 features, available at https: //www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html) and we compare the RGD and the MG approaches for this choice of target density. Implementation details. The covariance matrices of the mixture components are ﬁxed and equal to σ2Id with σ2 = 1, α = 0.2, J {50, 100}, d = 56, M = 200, the total number of iterations N is equal to 200, Θ1 is generated by sampling from a centered normal distribution with covariance matrix 5Id, λ1 = [1/J, . . . , 1/J] and for all time n = 1 . . . N, κn = 0, ηn = η, γn = γ where η {0.1, 0.5} and γ {0.05, 0.1, 0.2, 0.3}. Computing p(y, D) constitutes the major computation bottleneck here since I = 581, 012. To address this problem, p(y, D) is approximated according to Remark 3 with subsampled mini-batches of size 100. The experiments are replicated independently 30 times and the convergence of the RGD and of the MG approaches is monitored by computing a Monte Carlo estimator of the VR Bound. Our results are plotted in Figure 4, in which we focus on RGD-IS-n(γ) and MG-IS-unif(γ) (those two versions of Algorithm 3 are the most interesting to compare in this particular setting since MG-IS-n(γ) enjoys similar performances to RGD-IS-n(γ) and RGD-IS-unif(γ)

Monotonic Alpha-divergence Minimisation for Variational Inference

underperforms, see Figure 9 of Appendix D for details). We observe that both algorithms are able to learn in a real data scenario for a proper tuning of (η, γ) and that selecting either η or γ too small/large can deteriorate the performance (as per the learning rate behaviour associated to those parameters). Furthermore, MG-IS-unif(γ) can be tuned to outperform RGD-IS-n(γ), which conﬁrms our previous empirical ﬁndings regarding the relevance of the novel MG approach compared to the more traditional RGD one.

9. Conclusion

We introduced a novel methodology to build algorithms ensuring a monotonic decrease in the α-divergence at each step. Our methodology enabled simultaneous updates for both the weights and components parameters of a given mixture model, making it suitable for capturing complex multimodal target densities. Our work also connected and improved on diﬀerent approaches: Gradient Descent for α-divergence minimisation, Power Descent and an Integrated EM algorithm. By investigating variational families based on the exponential family, we applied our framework to important classes of models such as Gaussian Mixture Models and Student s t Mixture Models. Finally, we provided empirical evidence that our methodology can be used to enhance the aformentioned existing algorithms. To conclude, we state several directions to extend our work. Now that we have established a systematic decrease for our iterative schemes, the next step could be to derive convergence results and to compare them with those obtained using typical Gradient Descent schemes. Based on Proposition 2, another direction is to generalise the monotonicity property from Theorem 3 beyond the case α [0, 1). Lastly, much like it is already the case in traditional Black-Box Variational Inference, we expect that ﬁne-tuning our hyperparameters and resorting to more advanced Monte Carlo methods in the estimation of the intractable integrals appearing in our ideal algorithms will lead to further improved numerical results.

Acknowledgments

We would like to thank the action editor and the reviewers for helpful comments and suggestions on the paper. Kam elia Daudel acknowledges support of the UK Defence Science and Technology Laboratory (Dstl) and and Engineering and Physical Research Council (EPSRC) under grant EP/R013616/1. This is part of the collaboration between US DOD, UK MOD and UK EPSRC under the Multidisciplinary University Research Initiative.

Appendix A. Deferred Proofs and Results for Section 3

A.1 Quantifying the Improvement in one Step of Gradient Descent

Let T Rd be a convex set. Here, , is the standard inner product on Rd and . is the Euclidean norm.

Deﬁnition 3 A continuously diﬀerentiable function g deﬁned on T is said to be β-smooth if for all θ, θ T, g(θ) g(θ ) β θ θ .

Daudel, Douc and Roueff

Lemma 2 Let γ (0, 1], let g be a β-smooth function deﬁned on T. Then, for all θ T it holds that

2β g(θ) 2 .

Proof By assumption on g, we have that for all θ, θ T

g(θ ) g(θ) g(θ), θ θ β

In particular, setting θ = θ γ

β g(θ) yields

β g(θ) 2 γ2

Since γ (0, 1], we can deduce the desired result.

A.2 Gradient Descent for α- / R enyi s α-divergence Minimisation in Section 3

Let us ﬁrst write the deﬁnition of the α-divergence (resp. of R enyi s α-divergence) between the two absolutely continuous probability measures K(θ, ) and P

Dα(K(θ, )||P) = Z

α 1 p(y|D)ν(dy)

D(AR) α (K(θ, )||P) = 1 α(α 1) log Z

Y k(θ, y)αp(y|D)1 αν(dy) .

Here, the R enyi divergence is deﬁned following the convention from Cichocki and Amari (2010), alternative deﬁnitions may use a diﬀerent scaling factor.

A.2.1 Gradient Descent for α-divergence Minimisation

As underlined in the introduction, minimising the α-divergence Dα(Q||P) w.r.t q is equivalent to minimising Ψα(q; p) with p = p( , D) w.r.t q, where we have gotten rid of p(D) in the optimisation problem as written in (2). Letting Q be of the form Q = {q : y 7 k(θ, y) : θ T}, the traditional Variational Inference way to optimise Ψα(k(θ, ); p) with p = p( , D) w.r.t θ corresponds to performing Gradient Descent steps on θ to construct a sequence (θn)n 1 that converges towards a local optimum of the function θ 7 Ψα(k(θ, ); p). This procedure involves a well-chosen learning rate policy (rn)n 1 and sets:

θn+1 = θn rn Ψα(k(θn, ); p) , n 1 . (79)

Monotonic Alpha-divergence Minimisation for Variational Inference

Under common diﬀerentiability assumptions,

Ψα(k(θn, ); p) = Z

(θ,y)=(θn,y) p(y)ν(dy)

(θ,y)=(θn,y) ν(dy)

Y k(θn, y)αp(y)1 α

α 1 log k(θ, y)

(θ,y)=(θn,y) ν(dy)

so that (79) becomes

θn+1 = θn rn

Y ϕ(α) n (y) α 1 log k(θ, y)

(θ,y)=(θn,y) ν(dy) , n 1 .

A.2.2 Gradient Descent for R enyi s α-divergence Minimisation

Considering yet again Q = {q : y 7 k(θ, y) : θ T}, minimising R enyi s α-divergence

D(AR) α (K(θ, )||P) = 1 α(α 1) log Z

Y k(θ, y)αp(y|D)1 αν(dy)

w.r.t θ can be done by performing Gradient Descent steps on θ to construct a sequence (θn)n 1 that converges towards a local optimum of the function θ 7 D(AR) α (K(θ, )||P). This procedure involves a well-chosen learning rate policy (rn)n 1 and sets:

θn+1 = θn rn D(AR) α (K(θn, )||P) , n 1 . (80)

Under common diﬀerentiability assumptions and setting p = p( , D),

D(AR) α (K(θn, )||P) = 1 α(α 1) log Z

Y k(θn, y)αp(y|D)1 αν(dy)

R Y k(θn, y)αp(y, D)1 αν(dy)

R Y k(θn, y)αp(y, D)1 αν(dy)

Y ˇϕ(α) n (y) α 1 log k(θ, y)

(θ,y)=(θn,y) ν(dy)

so that (80) becomes

θn+1 = θn rn

Y ˇϕ(α) n (y) α 1 log k(θ, y)

(θ,y)=(θn,y) ν(dy) , n 1 .

Appendix B. Deferred Proofs and Results for Section 4

B.1 Proof of Corollary 4

Proof of Corollary 4 Following the proof of Corollary 1, we obtain that (19) holds by using the deﬁnition of Θn+1 combined with the fact that α [0, 1), D1(K(θj,n, )||K(θj,n+1, )) 0

Daudel, Douc and Roueff

and λj,n > 0 for all j = 1 . . . J. (18) holds by Theorem 3 and we can thus apply Theorem 2.

B.2 Extension of Lemma 1 to Mixture Models

Lemma 3 (Generalised maximisation approach for the mean-ﬁeld family) Let each component of the mixture be a member of the same mean-ﬁeld variational family so that k(θj, y) = QL ℓ=1 k(ℓ)(θ(ℓ) j , y(ℓ)) with θj = (θ(1) j , . . . , θ(L) j ) T. Then, starting from Θ1 T

and denoting Θn = (θ1,n, . . . θj,n) with θj,n = (θ(1) j,n, . . . , θ(L) j,n ) for all n 1 and all j = 1 . . . J, solving (25) yields the following update formulas: for all n 1 and all j = 1 . . . J,

θ(ℓ) j,n+1 = argmax θ(ℓ)

h ϕ(α) j,n(y) + bj,nk(θn, y) i log

k(ℓ)(θ(ℓ), y(ℓ))

k(ℓ)(θ(ℓ) j,n, y(ℓ))

ν(dy), ℓ= 1 . . . L .

B.3 Proof of Corollary 5

Proof of Corollary 5 Following the proof of Corollary 2, we will use that γj,n (0, 1] and that gj,n is a βj,n-smooth function. Indeed, we can thus apply Lemma 2 and we obtain by deﬁnition of θj,n+1 in (26) that 0 = gj,n(θj,n) gj,n(θj,n+1) for all n 1 and all j = 1 . . . J, which in turn implies (19). In addition, (18) holds by Theorem 3, hence we can apply Theorem 2.

B.4 Gradient Descent for α- / R enyi s α-divergence Minimisation in Section 4

We obtain the desired results by adapting the reasoning from Appendix A.2 to the case

q : y 7 µλ,Θk(y) =

j=1 λjk(θj, y) : Θ TJ

with λ SJ, meaning that we consider the α-divergence (resp. R enyi s α-divergence) between the two absolutely continuous probability measures µλ,ΘK and P:

Dα(µλ,ΘK||P) = Z

α 1 p(y|D)ν(dy)

D(AR) α (µλ,ΘK||P) = 1 α(α 1) log Z

Y (µλ,Θk(y))αp(y|D)1 αν(dy) ,

(we use the convention from Cichocki and Amari (2010) for R enyi s α-divergence).

B.5 Monotonicity Property for the Power Descent

B.5.1 Preliminary Remarks

For convenience, we redeﬁne in this section the function bµ,α for all µ M1(T) by

bµ,α(θ) = Z

Yk(θ, y) 1 α 1

α 1 ν(dy), θ T .

Monotonic Alpha-divergence Minimisation for Variational Inference

Then, for all η > 0, the iteration µ 7 Iα(µ) is well-deﬁned if we have

0 < µ(|bµ,α + κ| η 1 α ) < . (81)

Furthermore, Daudel et al. (2021) already established that one transition of the Power Descent algorithm ensures a monotonic decrease in the α-divergence at each step for all η (0, 1] and all κ such that (α 1)κ 0 under the assumption of Proposition 2, which settles the case (iii). Finally, while we establish our results for (i) and (ii) in the general case where µ M1(T), the particular case of mixture models follows immediately by choosing µ as a weighted sum of dirac measures.

B.5.2 Extending the Monotonicity

Let (ζ, µ) be a couple of probability measures where ζ is dominated by µ, which we denote by ζ µ. A ﬁrst lower-bound for the diﬀerence Ψα(µk) Ψα(ζk) was derived in Daudel et al. (2021) and was used to establish that the Power Descent algorithm diminishes Ψα for all η (0, 1]. We now prove a novel lower-bound for the diﬀerence Ψα(µk) Ψα(ζk) which will allow us to extend the monotonicity results from Daudel et al. (2021) beyond the case η (0, 1] when α < 0. This result relies on the existence of an exponent ϱ satisfying condition (A2) below, which will later on be used to specify a range of values for η ensuring that Ψα is decreasing after having applied one transition µ 7 Iα(µ)

(A2) We have ϱ R \ [0, 1] and the function fα,ϱ : u 7 fα(u1/ϱ) is non-decreasing and concave on R>0.

Proposition 5 Assume (A1). Let α R \ {1}, assume that ϱ satisﬁes (A2) and let κ be such that (α 1)κ 0. Then, for all µ, ζ M1(T) such that µ(|bµ,α|) < and ζ µ,

|ϱ| 1 {µ(|bµ,α + κ|) µ (|bµ,α + κ|gϱ)} Ψα(µk) Ψα(ζk) , (82)

where g is the density of ζ wrt µ, i.e. ζ(dθ) = µ(dθ)g(θ). Moreover, equality holds in (82) if and only if ζ = µ.

Proof First note that for all α R \ {1}, we have by (A2) that f α,ϱ(u) 0 for all u > 0, and thus that sg(ϱ) = sg(α 1) where sg(v) = 1 if v 0 and 1 otherwise. Since sg(f α(u)) = sg(α 1) = sg(κ) for all u > 0, this implies that ϱ 1f α(u) = |ϱ| 1|f α(u)|, ϱ 1κ = |ϱ 1κ| and ﬁnally that ϱ 1(bµ,α(θ) + κ) = |ϱ 1||bµ,α(θ) + κ| for all θ T, which will be used later in the proof. Write by deﬁnition of fα,ϱ in (A2) and ζ,

ϱ p(y)ν(dy)

T µ(dθ)k(θ, y)

ϱ p(y)ν(dy)

T µ(dθ)k(θ, y)

ϱ p(y)ν(dy) (83)

Daudel, Douc and Roueff

where the last inequality follows from Jensen s inequality applied to the convex function u 7 uϱ (since ϱ R \ [0, 1]) and the fact that fα,ϱ is non-decreasing. Now set

T µ(dθ)k(θ, y)

and note that

uy vy = µk(y)

T µ(dθ)k(θ, y)

µk(y) gϱ(θ) 1 . (84)

Since fα,ϱ is concave, fα,ϱ(uy) fα,ϱ(vy) + f α,ϱ(vy)(uy vy). Combining with (83), we get

Y fα,ϱ(uy)p(y)ν(dy) (85)

Y fα,ϱ(vy)p(y)ν(dy) + Z

Y f α,ϱ(vy)(uy vy)p(y)ν(dy)

Note that the ﬁrst term of the rhs can be written as Z

Y fα,ϱ(vy)p(y)ν(dy) = Z

p(y)ν(dy) = Ψα(µk) . (86)

Using now f α,ϱ(vy) = ϱ 1v1/ϱ 1 y f α(v1/ϱ y ) and (84), the second term of the rhs of (85) may be expressed as Z

Y f α,ϱ(vy)(uy vy)p(y)ν(dy)

T µ(dθ)k(θ, y)

µk(y) gϱ(θ) 1 p(y)ν(dy)

Y k(θ, y)f α

ν(dy) gϱ(θ)

= ϱ 1 {µ (bµ,α gϱ) µ(bµ,α)}

= |ϱ| 1 {µ (|bµ,α + κ|gϱ) µ(|bµ,α + κ|)} + |ϱ 1κ|(1 µ(gϱ)) ,

where we have used that ϱ 1(bµ,α(θ) + κ) = |ϱ 1||bµ,α(θ) + κ| for all θ T and that ϱ 1κ = |ϱ 1κ|. In addition, Jensen s inequality applied to the convex function u 7 uϱ

implies that µ(gϱ) 1 and thus Z

Y f α,ϱ(vy)(uy vy)p(y)ν(dy) |ϱ| 1 {µ (|bµ,α + κ|gϱ) µ(|bµ,α + κ|)} . (87)

Monotonic Alpha-divergence Minimisation for Variational Inference

Combining this inequality with (85) and (86) ﬁnishes the proof of the inequality. Furthermore, if the equality holds in (82), then the equality in Jensen s inequality (87) shows that g is constant µ-a.e. so that ζ = µ, and the proof is completed.

Remark 4 The proof of Proposition 5 relies on f α being of constant sign. Notice however that the deﬁnition of the α-divergence in (1) is invariant with respect to the transformation fα(u) = fα(u) + κ(u 1) for any arbitrary constant κ, that is fα can be equivalently replaced by fα in (1). This aspect is in fact expressed through the constant κ appearing in the update formula, that we however need to assume to satisfy (α 1)κ 0 in our proofs.

We now plan on setting ζ = Iα(µ) in Proposition 5 and obtain that one iteration of the Power Descent yields Ψα(Iα(µ)k) Ψα(µk). For this purpose and based on the upper bound obtained in Proposition 5, we strengthen the condition (81) as follows to take into account the exponent ϱ

0 < µ(|bµ,α + κ| η 1 α ) < and µ(|bµ,α + κ|gϱ) µ(|bµ,α + κ|)

with g = |bµ,α+κ| η 1 α

µ(|bµ,α+κ| η 1 α ) . (88)

This leads to the following result.

Proposition 6 Assume (A1). Let α R \ {1}, assume that ϱ satisﬁes (A2) and let κ be such that (α 1)κ 0. Let µ M1(T) be such that µ(|bµ,α|) < and assume that η satisﬁes (88). Then, the two following assertions hold.

(i) We have Ψα(Iα(µ)k) Ψα(µk).

(ii) We have Ψα(Iα(µ)k) = Ψα(µk) if and only if µ = Iα(µ).

Proof We treat the case κ = 0 in the proof below (the case κ = 0 unfolds similarly). We apply Proposition 5 with ζ = Iα(µ) so that ζ(dθ) = µ(dθ)g(θ) with g = |bµ,α|η/(1 α)/µ(|bµ,α|η/(1 α)). Then,

Ψα(Iα(µ)k) Ψα(µk) + |ϱ| 1 {µ (|bµ,α|gϱ) µ(|bµ,α|)} Ψα(µk) , (89)

where the last inequality follows from condition (88). Let us now show (ii). The if part is obvious. As for the only if part, Ψα(Iα(µ)k) = Ψα(µk) combined with (89) yields

Ψα(Iα(µ)k) = Ψα(µk) + |ϱ| 1 {µ (|bµ,α|gϱ) µ(|bµ,α|)} ,

which is the case of equality in Proposition 5. Therefore, Iα(µ) = µ.

While Proposition 6 resembles (Daudel et al., 2021, Theorem 1) in its formulation and in the properties on the iteration µ 7 Iα(µ) it establishes, it is important to note that the proof techniques used, and thus the conditions on η obtained, are diﬀerent. This brings us to the proof of Proposition 2. The proof of this theorem requires intermediate results, which are derived in Appendix B.5.3 alongside the proof of Proposition 2.

Daudel, Douc and Roueff

B.5.3 Proof of Proposition 2

For the sake of readability, we only treat the case κ = 0 in the proofs below (the case κ = 0 unfolds similarly). In Proposition 5, the diﬀerence Ψα(ζk) Ψα(µk) is split into two terms

Ψα(ζk) Ψα(µk) = A(µ, ζ) + |ϱ| 1 {µ (|bµ,α|gϱ) µ(|bµ,α|)} ,

where g = dζ/dµ. Moreover, Proposition 5 states that A(µ, ζ) is always non-positive. It turns out that the second term is minimal over all positive probability densities g when it is proportional to |bµ,α|1/(1 ϱ), as we show in Lemma 4 below.

Lemma 4 Let ϱ R \ [0, 1]. Then, for any positive probability density g w.r.t µ, we have

µ (|bµ,α|gϱ) h µ |bµ,α|1/(1 ϱ) i1 ϱ ,

with equality if and only if g |bµ,α|1/(1 ϱ).

Proof The function x 7 x1 ϱ is strictly convex for ϱ R \ [0, 1]. Thus Jensen s inequality yields, for any positive probability density g w.r.t. µ,

µ (|bµ,α|gϱ) = Z

|bµ,α(θ)|1/(1 ϱ)

!1 ϱ g(θ) h µ |bµ,α|1/(1 ϱ) i1 ϱ (90)

which ﬁnishes the proof of the inequality. The next statement follows from the case of equality in Jensen s inequality: g must be proportional to |bµ,α|1/(1 ϱ).

The next lemma shows that this choice leads to a non-positive second term, thus implying that Ψα(ζk) Ψα(µk).

Lemma 5 Assume (A1). Let α R \ {1} and assume that ϱ satisﬁes (A2). Then η = (1 α)/(1 ϱ) satisﬁes (88) for any µ M1(T) such that µ(|bµ,α|) < .

Proof We apply (90) with g = 1 and get that

h µ |bµ,α|1/(1 ϱ) i1 ϱ µ(|bµ,α|) < . (91)

Then (88) can be readily checked with η = (1 α)/(1 ϱ). Set φ = η/(1 α). Using that µ(|bµ,α|) < when φ < 0 and (A1) for φ > 0, we obtain µ(|bµ,α|φ) > 0, which concludes the proof.

While Lemma 5 seems to advocate for g = dζ/dµ to be proportional to |bµ,α|1/(1 ϱ), notice that this choice of g might not be optimal to minimize Ψα(ζk) Ψα(µk), as A(µ, ζ) also depends on g through ζ. In the next lemma, we thus propose another choice of the tuning parameter η, which also satisﬁes (88) for any µ M1(T) such that µ(|bµ,α|) < .

Lemma 6 Assume (A1). Let α R \ {1} and assume that ϱ satisﬁes (A2). Let µ M1(T) be such that µ(|bµ,α|) < . Further assume that |ϱ| 1, then η = (α 1)/ϱ satisﬁes (88).

Monotonic Alpha-divergence Minimisation for Variational Inference

Proof Setting g |bµ,α| 1/ϱ, we get

µ(|bµ,α|gϱ) = µ(|bµ,α|1 ϱ/ϱ)[µ(|bµ,α| 1/ϱ)] ϱ = [µ(|bµ,α| 1/ϱ)] ϱ µ(|bµ,α|)

where the last inequality follows from Jensen s inequality applied to the convex function u 7 u ϱ (since |ϱ| 1). Since µ(|bµ,α|) < , the parameter η = (α 1)/ϱ satisﬁes (88). Set φ = η/(1 α). Using that µ(|bµ,α|) < when φ < 0 and (A1) for φ > 0, we obtain µ(|bµ,α|φ) > 0, which concludes the proof.

Lemma 5 and Lemma 6 allow us to deﬁne a range of values for η that decreases Ψα after one transition of the Power Descent, under the assumption that ϱ satisﬁes (A2). Now, in order to prove Proposition 2 and given α R \ {1}, we need to check which values of ϱ satisfy the conditions expressed in (A2).

Proof of Proposition 2 The proof consists in verifying that we can apply Proposition 6, that is, given α R \ {1}, we must ﬁnd a range of constants ϱ which satisfy (A2). We then use Lemma 5 or Lemma 6 to deduce that, for the provided constants η, (88) holds.

(i) Assumption (A2) holds for all ϱ < 0, with fα,ϱ(u) = log(u)/ϱ. Moreover, by deﬁnition of bµ,α, we get for all n 1,

µ(|bµ,α|) = Z

Y µk(y) p(y)

µk(y)ν(dy) = Z

Y p(y)ν(dy) < .

Combining with Lemma 5 and Lemma 6, (88) holds for all µ M1(T) and for any η (0, 1].

(ii) Observing that for α / {0, 1},

fα,ϱ(u) = 1 α(α 1)

we get that (A2) holds for ϱ α if α < 0 Lemmas 5 and 6 provide the corresponding ranges for η in Cases (i) and (ii). To ﬁnish the proof, we now show that for all µ M1(T), µ(|bµ,α|) is ﬁnite, so that Lemmas 5 and 6 can indeed be applied.

Since uf α(u) = αfα(u) + 1/(α 1), we have, for all n 1,

µ(|bµ,α|) = Z

p(y)ν(dy) (92)

p(y)ν(dy) + 1 |α 1|

Using that Ψα(µk) > (which is a consequence of (A1) and of Jensen s inequality applied to the convex function u 7 ufα(1/u)), the r.h.s is ﬁnite if and only if Ψα(µk) is ﬁnite, which is what we have assumed and thus the proof is ﬁnished.

Daudel, Douc and Roueff

Appendix C. Deferred Proofs and Results of Section 6

We start with a useful lemma.

C.1 A Useful Lemma

The following lemma will be used for the proofs of Theorems 4 and 5 and Proposition 3.

Lemma 7 Let k(o) satisfy (B1). We have, for all ζ, ζ Int E0,

k(o)(ζ , y)

S(y) A(ζ ), ζ ζ

E , with equality if and only if ζ = ζ, (93)

k(o)(ζ , y)

k(o)(ζ, y) ν(dy) A(ζ) A(ζ ), ζ ζ

Proof Let ζ, ζ Int E0. We have, since Int E0 is convx,

k(o)(ζ , y)

= S(y), ζ ζ

E + A(ζ) A(ζ )

= S(y), ζ ζ

E + A(ζ ), ζ ζ

+ (ζ ζ )T Z 1

0 (1 t) h T A(t ζ + (1 t) ζ ) i dt (ζ ζ ) .

The inequality and the equality case of (93) then follow from the fact that the hessian T A is positive deﬁnite in Int E0. Using (93) and (32), we further obtain the lower bound displayed in (94) on the negated exclusive Kullback-Leibler divergence between the two (absolutely continuous w.r.t. ν) probability distributions with probability density functions k(o)(ζ, ) and k(o)(ζ , ) respectively.

C.2 Proofs of Section 6.2

The following lemma will be useful.

Lemma 8 Let let p and g be functions measurable from (Y, Y) to the Borel sets of R+ such that 0 < Z

Y p(y) (1 + g(y))1/(1 α) ν(dy) < . (95)

Let α [0, 1) and k : T Y R+ be a positive kernel density with respect to ν. Let θ0 T and µ be a probability on T such that

inf y Y k(θ0, y)

µk(y) > 0 . (96)

Then, setting

ϕ(y) = k(θ0, y)α p(y)

Monotonic Alpha-divergence Minimisation for Variational Inference

we have R ϕ dν > 0 and the probability density function ˇϕ = ϕ/ ( R ϕ dν) satisﬁes Z ˇϕ(y) g(y) ν(dy) < . (97)

Proof We have µk > 0 by assumption on k and µ. Thus ϕ is well deﬁned and R ϕ dν > 0 as a consequence of p 0 with the left-hand side of (95). We now check (97) in which ˇϕ can equivalently be replaced by ϕ. We have

Z ϕ(y) g(y) ν(dy) = Z k(θ0, y) p(y)

1 α g(y) ν(dy)

Z k(θ0, y) p(y)

µk(y) g(y)1/(1 α) ν(dy) α ,

by Jensen s inequality and concavity of x 7 x1 α for α [0, 1). Hence we obtain (97) as a consequence of (95) and (96).

Proof of Theorem 4 Let ζ0 O and b 0. In the following, we denote by C0(ζ) the value of the integral in the argmax in (35). Using that ˇϕ is a probability density function paired up with (32) and (34), we have Z

h ˇϕ(y) + b k(o)(ζ0, y) i ν(dy) = 1 + b > 0 , Z

h ˇϕ(y) + b k(o)(ζ0, y) i S(y)ν(dy) = Z

Y ˇϕ S dν + b A(ζ0) E .

By (93) in Lemma 7, for all ζ = ζ in O and all y Y, we have

k(o)(ζ , y)

> S(y) A(ζ ), ζ ζ

Hence, combining with the three previous identities, we get that, for all ζ = ζ in O,

C0(ζ ) C0(ζ) = Z

h ˇϕ(y) + b k(o)(ζ0, y) i log

k(o)(ζ , y)

Y ˇϕ S dν + b A(ζ0) (1 + b) A(ζ ), ζ ζ

E = (1 + b) s A(ζ ), ζ ζ

where we used the deﬁnition of s in (36). Now, since A is a C -diﬀeomorphism on Int E0, if s belongs to A(O) then there exists a unique ζ O such that A(ζ ) = s . As a result, plugging this into (98) and setting ζ = ζ , we have that C0(ζ ) > C0(ζ) for all ζ = ζ . This shows that ζ is the (unique) solution to the argmax problem (35). We conclude this proof with the proof of the reciprocal implication. Suppose that the argmax problem (35) has a solution ζ O so that, for all ζ O, C0(ζ ) C0(ζ) 0. Using (98), this would imply s A(ζ ), ζ ζ < 0 for all ζ O \ {ζ}. Since O is an open set, we can take ζ = ζ + ϵ (s A(ζ)) with ϵ > 0 small enough, which gives s A(ζ ), s A(ζ) < 0 .

Daudel, Douc and Roueff

Now letting ϵ 0 and since A is continuous, we get that s A(ζ) = 0 and thus that s

belongs to A(O).

Proof of Corollary 6 First note that letting ν be the d-dimensional Lebesgue measure and setting S(y) = (y, yy T /2), the condition (34) written in Theorem 4 takes the form Z

Y y 2 ˇϕ(y) dy < . (99)

Furthermore, setting υ(θ) = (Σ 1m, Σ 1) with θ = (m, Σ), we obtain that υ(T) = Rd M>0(d), which is the whole interior set Int E0, seen as an open set of E = Rd Rd d. Hence Theorem 4 applies and we can use the interpretation (38) of the equation (37) with k(θ , y) = N(y; m , Σ ) and k(θ0, y) = N(y; m0, Σ0), that is Z

Y S(y) N(y; m , Σ )ν(dy) = Z

Y S(y)ψ(y)ν(dy) , (100)

where ψ is the mixture

ψ(y) = 1 1 + b ˇϕ(y) + b 1 + b N(y; m0, Σ0) , y Y .

Using now that S(y) = (y, yy T /2), (100) is equivalent to saying that the distributions on both sides of this equation have the same mean and covariance matrix. Since the left-hand side distribution is a Gaussian distribution, we directly obtain that the solution of (100) is given by θ = (m , Σ ) with m and Σ being the mean and covariance matrix of the random variable Y with density ψ. Observe that these mean and covariance matrix are well-deﬁned due to the condition (99). This means that we only have to check that the covariance matrix is positive deﬁnite in order to have a solution in T = Rd M>0(d). If b > 0, then 1/(1 + b) < 1 and the fact that θ T guaranties that the covariance matrix of Y is positive deﬁnite. If b = 0, then Y has density ˇϕ and the non-degenerate condition on ˇϕ guaranties that the covariance is also positive deﬁnite in this case. This concludes the proof.

Proof of Corollary 7 Recall from (16) that ϕ(α) j,n(y) = k(θj,n, y)(p(y)/µnk(y))1 α with k(θj,n, y) > 0. Then Lemma 8 gives that

Y (1 + S(y) E) ϕ(α) j,n(y) ν(dy) < . (101)

We can thus apply Theorem 4 with ζ0 = υ(θj,n), O = υ(T), b = bj,n/ R Y ϕ(α) j,n dν and

ˇϕ = ϕ(α) j,n/ R ϕ(α) j,n dν, which satisﬁes (34).

C.3 Proofs of Section 6.3

Proof of Proposition 3 Following the proof of Theorem 4 but with b = 0, we have using (98) that: for all ζ , ζ Int E0, A(ζ) Z

Y S ˇϕ dν, ζ ζ

E g(o)(ζ ) g(o)(ζ) A(ζ ) Z

Y S ˇϕ dν, ζ ζ

Monotonic Alpha-divergence Minimisation for Variational Inference

Since A is continuous in Int E0, we get that g(o) is C1 over Int E0 and, for all ζ Int E0,

g(o)(ζ) = A(ζ) Z

Y S ˇϕ dν .

The β0-smoothness condition follows from the mean value theorem applied to g(o) whose Jacobian is T A.

Proof of Corollary 8 The context is similar to the proof of Corollary 6 except that the covariance matrix is known and thus the suﬃcient statistic reduces to S(y) = y and the canonical parameter to υ(θ) = Σ 1θ. The given expressions of h and A follow. Then we have, for all ζ Rd, A(ζ) = Σζ. Applying Proposition 3, we get that

g(o)(ζ) = Σζ Z

Y y ˇϕ(y) dy .

This gives that g(o) is β0-smooth over Rd with β0 equal to the maximal eigenvalue of Σ and leads to the gradient step

Y y ˇϕ(y) dy .

Multiplying on both sides by Σ leads to the gradient-based update (46). We now derive (47). Using that g = g(o) υ, we get that g(θ) = Σ 1 g(o)(Σ 1θ) and thus

g(θ) = Σ 1 θ Z

Y y ˇϕ(y) .

We obtain that g is β-smooth over Rd with β equal to the maximal eigenvalue of Σ 1, which leads to the gradient update (47).

C.4 Proofs of Section 6.4

C.4.1 Proofs of Preliminary Results

It will be convenient to use the shorthand notation: for all (ℓ, y, ξ) Z Y F,

kξ(ℓ, y) := k(o)(ℓ(ξ), y) ,

ˇkξ,τ(ℓ, y) := kξ(ℓ, y)

where τkξ denotes the mapping y 7 R kξ(ℓ, y) τ(dℓ).

Daudel, Douc and Roueff

Proof of Proposition 4 We have, for all τ0, τ M, ξ0, ξ F and y Y, using the deﬁnitions above and that of Dτ,τ0,

τ0kξ0(y) = log Z

kξ(ℓ, y) kξ0(ℓ, y)

! ˇkξ0,τ0(ℓ, y) τ(dℓ)

kξ(ℓ, y) kξ0(ℓ, y) Dτ,τ0(ℓ)

! ˇkξ0,τ0(ℓ, y) τ0(dℓ)

Z ˇkξ0,τ0(ℓ, y) log

kξ(ℓ, y) kξ0(ℓ, y) Dτ,τ0(ℓ)

where we used the concavity of the log function and the fact that ˇkξ0,τ0( , y) is a probability density function with respect to τ0. Let ξ = Υ(ϑ) and ξ0 = Υ(ϑ0) and set θ = (ϑ, τ) and θ0 = (ϑ0, τ0), so that, by Deﬁnition 2, we have τkξ(y) = k(1)(θ, y) and τ0kξ0(y) = k(1)(θ0, y). Applying the previous bound and integrating w.r.t. ˇϕ(y)ν(dy) thus yields

Y ˇϕ(y) log

k(1)(θ, y) k(1)(θ0, y)

Z ϕ(ℓ, y) log

kξ(ℓ, y) kξ0(ℓ, y) Dτ,τ0(ℓ)

ν(dy) , (102)

where we used that ˇϕ(y)ˇkξ0,τ0(ℓ, y) = ϕ(ℓ, y) as a consequence of (52). We conclude by noting that (50), (51) and (102) imply (49).

Proof of Corollary 9 Suppose that Υ(ϑ) belongs to the argmax (54), and let us show that (51) holds. Note that the objective function in the argmax is zero for ξ = ξ0 and thus must be non-negative for ξ = Υ(ϑ) in the argmax. Note also that the left-hand side of (51) is the sum of this objective function at ξ = Υ(ϑ) with the term

Z b0(ℓ) kξ0(ℓ, y) log

kΥ(ϑ)(ℓ, y)

ν(dy) , (103)

and it only remains to show that this term is also non-negative. To conclude, we observe that, for τ0-a.e. ℓ Z, Condition (ii) of Deﬁnition 1 implies that kξ0(ℓ, ) and kΥ(ϑ)(ℓ, ) are probability densities, so that

Y kξ0(ℓ, y) log

kΥ(ϑ)(ℓ, y)

and thus, b0 being non-negative,

Y b0(ℓ) kξ0(ℓ, y) log

kΥ(ϑ)(ℓ, y)

Using that log+(x) x for all x 0, the positive part inside this double integral is at most b0(ℓ) kΥ(ϑ)(ℓ, y), whose integral w.r.t. ν(dy)τ0(dℓ) is R b0 dτ0 < by assumption. Therefore,

Monotonic Alpha-divergence Minimisation for Variational Inference

the same inequality holds when reversing the order of integration in the previous display. We get that the expression in (103) is non-negative, which concludes the proof.

The following lemma is used to illustrate (C1).

Lemma 9 Consider an ELM family satisfying (C1). Then, for all ϑ T, τ M and τ-a.e. ℓ, ℓ Υ( T) is an open subset included in Int E0.

Proof In (C1), the linear operator ℓ Z being full rank implies that the image of an open set is an open subset of E. Indeed, take F F a linear supplementary space of the kernel of ℓ. Then ℓ|F is a bijective linear mapping from F to E and thus, for any open set O included in F, ℓ(O) = ℓ(O F ) = ℓ|F (O F ) is an open subset of E since O F is an open subset of F and ℓ|F is bicontinuous. Hence (C1) implies that ℓ Υ( T) is an open subset of E for all τ M and τ-almost all ℓ Z, which, by Assumption (ii) from Deﬁnition 2, must be included in Int E0.

We conclude with the following lemma, which relates Assumption (C4) to a kind of identiﬁability property.

Lemma 10 Consider an ELM family and let Pϑ,τ denote the probability of (Y, L) deﬁned on Y Z as above. Then (C4) is equivalent to have that for all (ϑ, τ), (ϑ , τ ) T M, Pϑ,τ = Pϑ ,τ if and only if Υ(ϑ) = Υ(ϑ ) and τ = τ .

Proof Let (ϑ, τ), (ϑ , τ ) T M. Note that Υ(ϑ) = Υ(ϑ ) and τ = τ is always suﬃcient to have Pϑ,τ = Pϑ ,τ . Also, since τ is the marginal over Z of Pϑ,τ and k(o)(ℓ Υ(ϑ), ) is the conditional density of Y given L = ℓwith respect to ν, it is clear that Pϑ,τ = Pϑ ,τ is equivalent to have τ = τ and

k(o)(ℓ Υ(ϑ), y) = k(o)(ℓ Υ(ϑ ), y) for ν-a.e. y and τ-a.e. ℓ. (104)

By (B1), which is assumed for an ELM family in Deﬁnition 2, for all ζ, ζ E0, we have k(o)(ζ, y) = k(o)(ζ , y) for ν-a.e. y if and only if ζ = ζ . By (ii) in Deﬁnition 2, we have ℓ Υ(ϑ) E0 for all ϑ T, τ M and τ-a.e. ℓ. Thus, for all (ϑ, ϑ , τ) T2 M, Eq. (104) is equivalent to have ℓ Υ(ϑ) = ℓ Υ(ϑ ) for τ-a.e. ℓ. (105)

Hence, to prove the result, we only need to show that (C4) is equivalent to have that, for all (ϑ, ϑ , τ) T2 M, Eq. (105) implies Υ(ϑ) = Υ(ϑ ). In fact (C4) exactly says that for all ϑ, ϑ T, Υ(ϑ) = Υ(ϑ ) implies the opposite of (105).

C.4.2 Proof of Theorem 5

The proof of Theorem 5 relies on a general proposition (Proposition 7) and two technical lemmas (Lemmas 11 and 12). The following deﬁnition will be useful to state the proposition.

Deﬁnition 4 (Directional continuity) Let g be deﬁned on an open subset O of the Euclidean space F and valued in a topological space. Let ξ O and x F such that x F = 1. We say that g is continuous at ξ in direction x if lim t 0 g(ξ + tx) = g(ξ).

Daudel, Douc and Roueff

Proposition 7 Consider an ELM family as in Deﬁnition 2 and Υ satisfying (C1). Let ˇϕ be a probability density function w.r.t. ν. Let τ0 M and ϕ : Z Y R+ be measurable function such that Z

Y Z ϕ(ℓ, y) ℓ op S(y) E ν(dy)τ0(dℓ) < (106) Z

Z ϕ(ℓ) ℓ (ξ) F τ0(dℓ) < for all ξ Υ( T) , (107)

where we set ϕ(ℓ, y) and ϕ(ℓ) as in (52) and (59). Deﬁne w τ0 : Υ( T) F as in (60). Let b R+ and ξ0 in Υ( T) and set w as in (63). Then the reciprocal set

Ξ = n ξ Υ( T) : w τ0(ξ) = w o

is included in the set of solutions of the argmax problem

argmax ξ Υ( T)

h ϕ(ℓ, y) + b ϕ(ℓ) k(o)(ℓ(ξ0), y) i log

k(o)(ℓ(ξ), y) k(o)(ℓ(ξ0), y)

ν(dy) . (108)

Moreover the two following assertions hold.

(i) If w τ0 is continuous on Υ( T) in every direction, the opposite inclusion is true, namely, the solution set of the argmax problem (108) is included in Ξ .

(ii) If we have, for all ξ = ξ Υ( T), Z

{ℓ Z : ℓ(ξ ξ ) =0} ϕ(ℓ) τ0(dℓ) > 0 , (109)

then ξ 7 w τ0(ξ) is one-to-one on Υ( T) and thus Ξ is either a singleton or empty.

Proof In this proof, we take b 0 and ξ0 Υ( T), and we denote the integral deﬁning the objective function in the argmax problem (108) by C(ξ). A ﬁrst remark is that, by (106), R ϕ(ℓ, y) ℓT S(y) ν(dy)τ0(dℓ) is well-deﬁned in F and w τ0(ξ) in (60) is well-deﬁned in F for all ξ Υ( T) as a consequence of (107). In particular we get that w in (63) is well-deﬁned in F (but does not necessarily belong to w τ0 Υ( T)). Next we show that the objective function C(ξ) is well deﬁned and ﬁnite for all ξ Υ( T) and that Fubini s theorem applies (so that the double integral in (108) can be computed in the opposite order or as a single integral over Y Z w.r.t. the product measure ν τ0). To this end, we show successively that Z

Y Z ϕ(ℓ, y)

k(o)(ℓ(ξ), y) k(o)(ℓ(ξ0), y)

! ν(dy)τ0(dℓ) < , (110)

Y Z ϕ(ℓ) k(o)(ℓ(ξ0), y) log+

k(o)(ℓ(ξ), y) k(o)(ℓ(ξ0), y)

ν(dy)τ0(dℓ) < , (111)

Y k(o)(ℓ(ξ0), y) log

k(o)(ℓ(ξ), y) k(o)(ℓ(ξ0), y)

τ0(dℓ) 0 , (112)

Y Z ϕ(ℓ) k(o)(ℓ(ξ0), y)

k(o)(ℓ(ξ), y) k(o)(ℓ(ξ0), y)

! ν(dy)τ0(dℓ) < . (113)

Monotonic Alpha-divergence Minimisation for Variational Inference

Let us now show that (110) holds. Let ℓ Z such that ℓ Υ( T) Int E0, which by Condition (ii) and (C1), is true for τ0-a.e. ℓ Z. Applying (93) in Lemma 7 with (ζ, ζ ) = (ℓ(ξ0), ℓ(ξ)) as well as (ζ, ζ ) = (ℓ(ξ), ℓ(ξ0)) and using the deﬁnition of ℓT in (55) and that of ℓ in (56), we get that

D ℓT S(y) ℓ (ξ), ξ ξ0 E

k(o)(ℓ(ξ), y) k(o)(ℓ(ξ0), y)

D ℓT S(y) ℓ (ξ0), ξ ξ0 E

It follows that, for all ξ0, ξ Υ( T), log

k(o)(ℓ(ξ), y) k(o)(ℓ(ξ0), y)

! ℓT S(y) F + max( ℓ (ξ0) F , ℓ (ξ) F ) ξ ξ0 F . (114)

Using ℓT S(y) F ℓT op S(y) E = ℓ op S(y) E, (106) and (107), we obtain (110).

To show (111), we use the bound log+(x) x for all x 0 so that the left-hand side of (111) is upper-bounded by R Y Z ϕ(ℓ) k(o)(ℓ(ξ), y) ν(dy)τ0(dℓ) = R Z ϕ(ℓ) τ0(dℓ) = 1. Next, to show (112), we observe that, as a consequence of (94), (55) and (56), we have

Y k(o)(ℓ(ξ), y) log

k(o)(ℓ(ξ), y) k(o)(ℓ(ξ0), y)

ν(dy) D ℓ (ξ0) ℓ (ξ), ξ ξ0 E

ℓ (ξ0) ℓ (ξ) F ξ ξ0 F .

Hence, using (107), we get (112). Finally, Tonelli s theorem gives us that the same positive part as in (111) has a ﬁnite integral when integrating in the same order as in (112), so that (111) and (112) imply (113). We now prove the inclusion of Ξ in the set of argmax solutions. Since Fubini s theorem applies in the deﬁnition of the objective function, we can write, for all ξ, ξ Υ( T),

C(ξ ) C(ξ) = Z

h ϕ(ℓ, y) + b ϕ(ℓ) k(o)(ℓ(ξ0), y) i log

k(o)(ℓ(ξ ), y)

k(o)(ℓ(ξ), y)

ν(dy)τ0(dℓ) .

Using (93) in Lemma 7 this time with ζ = ℓ(ξ) and ζ = ℓ(ξ ), we obtain that

C(ξ ) C(ξ) Z

h ϕ(ℓ, y) + b ϕ(ℓ)k(o)(ℓ(ξ0), y) i D ℓT S(y) ℓ (ξ ), ξ ξ E

E ν(dy)τ0(dℓ).

Using (106), (107) to integrate in any convenient order, we have Z

Y Z ϕ(ℓ, y) D ℓT S(y) ℓ (ξ ), ξ ξ E

E ν(dy)τ0(dℓ)

Y Z ϕ(ℓ, y) ℓT S(y) ν(dy)τ0(dℓ) w τ0(ξ ), ξ ξ

where we have used the deﬁnition of w τ0 in (60), and

Y Z ϕ(ℓ) k(o)(ℓ(ξ0), y) D ℓT S(y) ℓ (ξ ), ξ ξ E

E ν(dy)τ0(dℓ)

= D w τ0(ξ0) w τ0(ξ ), ξ ξ E

Daudel, Douc and Roueff

where we have used once again the deﬁnition of w τ0 in (60) as well as the expression of A given in (32). Plugging these two equalities in the previous inequality, we ﬁnally get that, for all ξ, ξ Υ( T),

C(ξ ) C(ξ) Z

Y Z ϕ(ℓ, y) ℓT S(y) ν(dy)τ0(dℓ) (1 + b) w τ0(ξ ) + b w τ0(ξ0), ξ ξ

E = (1 + b) D w w τ0(ξ ), ξ ξ E

Hence any ξ Ξ must be a a solution of the argmax (108). Let us now prove Assertion (i). We use again (115) this time with ξ in the set of the argmax solutions and ξ in the neighborhood of ξ (remember that Υ( T) is assumed to be open) such that ξ ξ = ϵ w w τ0(ξ) with ϵ > 0 small enough. We obtain that

D w w τ0(ξ ), w w τ0(ξ) E

The assumption that w τ0 is continuous at ξ in direction w w τ0(ξ) gives us that w =

w τ0(ξ), that is, ξ Ξ and the opposite inclusion is shown. We conclude with the proof of Assertion (ii). By (60), for all ξ, ξ Υ( T), we can write

w τ0(ξ) w τ0(ξ ) = Z

Z ϕ(ℓ) ℓ (ξ) ℓ (ξ ) τ0(dℓ) .

Since A has a positive deﬁnite Jacobian on the convex set Int E0, we have, for all ζ = ζ Int E0, A(ζ) A(ζ ), ζ ζ E > 0. Letting ξ = ξ Υ( T) and ℓ Z be such that ℓ(ξ), ℓ(ξ ) Int E0 and ℓ(ξ ξ ) = 0, we can apply the previous inequality with ζ = ℓ(ξ) and ζ = ℓ(ξ ) leading to D ℓ (ξ) ℓ (ξ ), ξ ξ E

Combining this result with the condition (109) gives us that for all ξ = ξ Υ( T) D w τ0(ξ) w τ0(ξ ), ξ ξ E

In particular, w τ0 is one-to-one and the proof is concluded.

Let d F denote the dimension of F and let us now introduce and prove two key lemmas.

Lemma 11 Let ξ Υ( T) and let x F be such that x F = 1. Let us denote, for any ϵ > 0,

BF (ξ, ϵ) = ξ F : ξ ξ F < ϵ ,

Dx(ξ, ϵ) = {ξ + t ϵ x : 0 t < 1} .

Let ℓ Z be such that ℓ Υ( T) Int E0. Then the two following assertions hold.

(i) For any 0 < ϵ ϵ < ϵ such that Dx(ξ, ϵ ) Υ( T), we have

sup ξ Dx(ξ,ϵ)

A ℓ(ξ ) A ℓ(ξ) ϵ max ℓ (ξ) F , ℓ (ξ + ϵ x) F

Monotonic Alpha-divergence Minimisation for Variational Inference

(ii) For any ϵ > 0 such that BF ξ, 2 1 + d F ϵ Υ( T), we have, for all ϵ0 (0, ϵ],

Y eϵ0 ℓT S(y) F

sup ξ Dx(ξ,ϵ0) k(o)(ℓ(ξ ), y)

ν(dy) 2 d F max ξ x,ϵ(ξ)

eϵ 0 ℓ (ξ ) F , (117)

where ϵ 0 = 1 + 2 d F ϵ0 and x,ϵ(ξ) is a set of 2(d F + 1) points in Bx(ξ, 2 1 + d F ϵ) only depending on ξ, x and ϵ.

Proof We prove Assertions (i) and (ii) successively. Proof of Assertion (i) Let 0 < ϵ ϵ < ϵ be such that Dx(ξ, ϵ ) Υ( T). First note that t 7 A ℓ(ξ + t x) has its ﬁrst and second derivative on t [0, ϵ ] given respectively by

h1 : t 7 A ℓ(ξ + t x), ℓ(x) E = D ℓ (ξ + t x), x E

F h2 : t 7 Dh T A ℓ(ξ + t x) i ℓ(x), ℓ(x) E

Since T A is positive deﬁnite on Int E0, h2 is non-negative, thus h1 is monotonous on [0, ϵ ] and its maximal absolute value is obtained at t = 0 or t = ϵ , meaning that |h1(t)| max(|h1(0)|, |h1(ϵ )|) for all t [0, ϵ ]. Therefore, the bound in (116) follows from the mean value theorem combined with the fact that: for all t [0, ϵ ] and ξ Dx(ξ, ϵ),

|h1(t)| ℓ (ξ + tx) F x F = ℓ (ξ + tx) F ξ ξ F ϵ x F ϵ

since x F = 1 by assumption. Proof of Assertion (ii) Take 0 < ϵ0 ϵ such that Bx(ξ, 2 1 + d F ϵ) Υ( T), let ϵ 0 be as stated and set ϵ = 1 + 2 d F ϵ. Let x1, . . . , xd F be an orthonormal basis of F so that, for any w E, we have

D ℓT (w), xk E

d F max k,s

s D ℓT (w), xk E

where, here and in the following, maxk,s is the maximum over s { 1} and k = 1 . . . d F . By deﬁnition of k(o) in Deﬁnition 1, we also have: for all ξ Dx(ξ, ϵ0),

eϵ0 ℓT S(y) F k(o)(ℓ(ξ ), y) = h(y) eϵ0 ℓT S(y) F + S(y),ℓ(ξ ) E A ℓ(ξ )

= h(y) eϵ0 ℓT S(y) F + S(y),ℓ(ξ ξ) E+ S(y),ℓ(ξ) E A ℓ(ξ )

Now using that

| S(y), ℓ(ξ ξ)

E | = | D ℓT S(y), ξ ξ E

F | ϵ0 ℓT S(y) F

Daudel, Douc and Roueff

(since x F = 1 and ξ Dx(ξ, ϵ0)) paired up with (118), we can deduce that

eϵ0 ℓT S(y) F k(o)(ℓ(ξ ), y) h(y) e2ϵ0 ℓT S(y) F + S(y),ℓ(ξ) E A ℓ(ξ )

max k,s h(y) exp D S(y), ℓ(ξ + 2 s p

d F ϵ0 xk) E

max k,s k(o)(ℓ(ξ s,k), y) e A ℓ ξ s,k A ℓ(ξ ) , (119)

where ξ s,k = ξ + 2 s d F ϵ0 xk. Note that we took ϵ small enough so that for all s, k,

Dx(ξ, ϵ + ϵ) Ds xk ξ, ϵ + ϵ BF ξ, 2 1 + p

d F ϵ Υ( T) .

ξ and ξ s,k are in Dx(ξ, ϵ0) and Ds xk ξ, 2 d F ϵ0 , respectively. Using (116) of Assertion (i) twice with ϵ ϵ < ϵ successively replaced by 2 d F ϵ0 ϵ 2 1 + d F ϵ and ϵ0 ϵ 2 1 + d F ϵ, we get that A ℓ ξ s,k A ℓ(ξ ) A ℓ ξ s,k A ℓ(ξ) + A ℓ(ξ) A ℓ(ξ )

d F ϵ0 max ℓ (ξ) F , ℓ (ξ + ϵ s xk) F

+ ϵ0 max ℓ (ξ) F , ℓ (ξ + ϵ x) F

ϵ 0 Cξ,x(ℓ) ,

where the last line follows by setting

Cξ,x(ℓ) = max n ℓ (ξ + ϵ x ) F : x {0, x , s xk} , with s { 1} , k = 1 . . . d F o .

Combining this inequality with (119) then yields : for all ξ Dx(ξ, ϵ0),

eϵ0 ℓT S(y) F k(o)(ℓ(ξ ), y) eϵ 0 Cξ,x(ℓ) max k,s k(o)(ℓ(ξ s,k), y)

eϵ 0 Cξ,x(ℓ) X

k,s k(o)(ℓ(ξ s,k), y) .

Observe that the right-hand side of the previous display does not depend on ξ and that the summands in the sum (which has 2d F terms) all have integrals equal w.r.t. ν equal to 1. The bound (117) follows and Assertion (ii) is proved.

Lemma 12 Consider an ELM family as in Deﬁnition 2. Assume (C1) and (C3). Let ϑ0 T, τ0 M and ˇϕ : Y R+ be a density function such that Z

Y ˇϕ(y) e sϑ0,τ0(y) ν(dy) < . (120)

Deﬁne ϕ(ℓ, y), ϕ(ℓ) and w τ0 as in (52), (59) and (60). Then for all ξ Υ( T) and x F such that x F = 1, there exists ϵ0 > 0 such that Z

ℓT S(y) F sup ξ Dx(ξ,ϵ0)

k(o)(ℓ(ξ ), y) ϕ(ℓ) ν(dy)τ0(dℓ) < . (121)

Monotonic Alpha-divergence Minimisation for Variational Inference

Moreover, w τ0 is continuous on Υ( T) in every direction, and, for all ϑ T, it holds that

w τ0 Υ(ϑ) = Eϑ,τ0 h LT S(Y ) ϕ(L) i . (122)

Proof Let ℓ Z is such that ℓ Υ( T) Int E0, ξ Υ( T) and x F such that x F = 1. Then for ϵ > 0 small enough, we have Bx(ξ, 2 1 + d F ϵ) Υ( T), and for all ϵ0 (0, ϵ],

Y eϵ0 ℓT S(y) F sup ξ Dx(ξ,ϵ0)

k(o)(ℓ(ξ ), y) ν(dy) 2 d F X

ξ eϵ 0 ℓ (ξ ) F ,

where we used Assertion (ii) of Lemma 11 with ϵ = 1 + 2 d F ϵ, ϵ 0 = 1 + 2 d F ϵ0 and is a set of 2(d F + 1) points in Bx(ξ, 2 1 + d F ϵ) only depending on ξ, x and ϵ. Since ℓT S(y) F 1

ϵ0 eϵ0 ℓT S(y) F , we get that:

Gξ,ϵ0(ℓ) := Z

ℓT S(y) F sup ξ Dx(ξ,ϵ0)

k(o)(ℓ(ξ ), y) ν(dy)

ξ eϵ 0 ℓ (ξ ) F .

In particular by (C1), with Lemma 9, this bounds hold for τ0-almost all ℓ Z. By (C3), since is a ﬁnite set only depending on ξ, x and ϵ and satisfying Bx(ξ, 2 1 + d F ϵ) Υ( T), there exist ϵ , C > 0 only depending on ξ, x and ϵ such that, for all ξ ,

Eϑ0,τ0 h eϵ L (ξ ) F Y i C e sϑ0,τ0(Y ) Pϑ0,τ0 a.s.

Since has cardinal 2(d F + 1), taking ϵ0 (0, ϵ] small enough to have ϵ 0 ϵ , this gives us that

gξ,ϵ0(Y ) := Eϑ0,τ0 [Gξ,ϵ0(L) | Y ] 4 C d F (d F + 1)

ϵ0 e sϑ0,τ0(Y ) Pϑ0,τ0 a.s.

Now observing that the ratio in (52) is the conditional density of L applied to ℓgiven Y = y under Pϑ0,τ0 and that the marginal distribution of Pϑ0,τ0 on Y is equivalent to ν, we obtain that Z

Z Gξ,ϵ0(ℓ) ϕ(ℓ) τ0(dℓ) = Z

Z Y Gξ,ϵ0(ℓ) ϕ(ℓ, y) τ0(dℓ)ν(dy)

Y gξ,ϵ0(y) ˇϕ(y) ν(dy)

4 C d F (d F + 1)

Y e sϑ0,τ0(y) ˇϕ(y) ν(dy) ,

Hence, using condition (120), we deduce that

Z Gξ,ϵ0(ℓ) ϕ(ℓ) τ0(dℓ) < ,

Daudel, Douc and Roueff

which is exactly the claimed bound (121). This bound gives in particular that

Y Z ℓT S(y) k(o)(ℓ(ξ ), y) ϕ(ℓ) ν(dy)τ0(dℓ)

is well-deﬁned in F for ξ Dx(ξ, ϵ0), and continuous at ξ in direction x. Moreover, the Fubini theorem applies so that (i) integrating ﬁrst with respect to y and (ii) using the expression of A given in (32), we obtain that this mapping is in fact ξ 7 w τ0(ξ ). This being true for all ξ Υ( T) and x F such that x F = 1, we get the last claim by observing that, by (57), the expectation in (122) is another way to express the same integral in the case where ξ = Υ(ϑ).

We can now prove Theorem 5.

Proof of Theorem 5 We use Proposition 7 with ϕ(ℓ, y) given by (52). We ﬁrst prove that Conditions (106) and (107) hold in this setting, under the assumptions of Theorem 5. We use again that the ratio in (52) is the conditional density of L applied to ℓgiven Y = y under Pϑ0,τ0, and that the marginal Pϑ0,τ0 on Y is equivalent to the dominating measure ν. Hence we can write the integral in (106) as

Y ˇϕ(y) S(y) E g(y) ν(dy) with g(Y ) = Eϑ0,τ0 h L op Y i .

By (C2) and (58), we obtain Condition (106). Similarly the integral in (107) reads, for all ξ Υ( T), Z

Y ˇϕ(y)g(y) ν(dy) with g(Y ) = Eϑ0,τ0 h L (ξ) F

Clearly, for this g, (C3) gives that, for some ϵ, C > 0 g(y) Cϵ 1 e sϑ0,τ0(y) for ν-a.e. y, and (58) implies (107). We can thus apply Proposition 7 and to conclude, we only need to check that the conditions of Assertions (i) and (ii) of Proposition 7 also applies, that is, we need to show that w τ0 is continuous on Υ( T) in every direction and that (109) holds for all ξ = ξ Υ( T). The continuity of w τ0 follows from Lemma 12 since (120) holds by (58) (and we obtain in passing (61), which is exactly (122)). Finally, Condition (109) follows from (C4), since in the setting of Theorem 5, ϕ(ℓ) > 0 for all ℓ Z.

C.4.3 Proof of Corollary 10

The proof of Corollary 10 relies on some preliminary results. We ﬁrst show in Proposition 8 below that Theorem 5 applies to a general mixture of Gaussian distributions (which notably includes the Student s t distribution family considered in Example 5).

Proposition 8 Let ˇM be a class of probability distributions on (0, ) and suppose that for all ˇτ0 ˇM, there exists ϵ0 > 0 such that Z

0 eϵ0 z ˇτ0(dz) < . (123)

Monotonic Alpha-divergence Minimisation for Variational Inference

Let d 1 and deﬁne for any ϑ = (m, Σ) T = Rd M>0(d)

ˇk(ϑ, y) = Z

0 N y; m, z 1Σ ˇτ0(dz) , (124)

seen as a kernel density with respect to ν being the Lebesgue measure on Rd. Then the following assertions hold.

(i) Let k(1) be ELM family given by Deﬁnition 2 with F = E = Rd Rd d, E0 = T = Rd M>0(d), Υ(m, Σ) = (Σ 1m, Σ 1),

k(o)(ξ, y) = N (y; m, Σ) for all ξ = Υ(ϑ) with ϑ = (m, Σ) , (125)

and M the class of all push-forward probabilities τ0 of ˇτ0 ˇM through the mapping z 7 z Id deﬁned from (0, ) to L(E), where Id is the identity operator on E. Then, M is a class of distributions on positive scalar operators on E and, for all ϑ = (m, Σ) T, we have ˇk(ϑ, y) = k(1)((ϑ, τ0), y).

(ii) Setting, for all y Rd and ϑ0 = (m0, Σ0) T,

2(y m0)T Σ 1 0 (y m0) , (126)

gˇτ0(u, v) = Z

0 zue v z ˇτ0(dz) for all u > 1 and v > ϵ0 , (127)

this ELM family satisﬁes (C1) (C4) with

mϑ0,τ0(y) = gˇτ0(d/2 + 1, qϑ0(y))

gˇτ0(d/2, qϑ0(y)) , (128)

sϑ0,τ0(y) = log gˇτ0(d/2, qϑ0(y) ϵ0)

gˇτ0(d/2, qϑ0(y))

(iii) Theorem 5 applies for any θ0 = (ϑ0, τ0) T M and any probability density function ˇϕ : Rd R+ satisfying Z

Y ˇϕ(y) y + y 2 mϑ0,τ0(y) + e sϑ0,τ0(y) ν(dy) < , (130)

in which case the argmax problem (62) has a unique solution ξ = Υ(ϑ ) with ϑ = (m , Σ ) given by

m = Eϑ0, τ0 h Z Y i

Eϑ0, τ0 [Z] , (131)

Σ = Eϑ0, τ0 h Z Y Y T i Eϑ0, τ0 h Z Y i Eϑ0, τ0 h Z Y i T

Eϑ0, τ0 [Z] , (132)

where, under Pϑ0, τ0, Z has density with respect to ˇτ0 given by

Y ˇϕ(y) zd/2e z qϑ0(y)

gˇτ0(d/2, qϑ0(y)) dy (133)

Daudel, Douc and Roueff

and the conditional distribution of Y given Z = z has density

ψ [y | z] = 1 1 + b zd/2e z qϑ0(y) ˇϕ(y) gˇτ0(d/2, qϑ0(y)) ϕ(z) + b 1 + b N y; m0, z 1Σ0 . (134)

Proof We prove the claimed assertions successively.

Proof of Assertion (i) The spaces F = E = Rd Rd d are endowed with the usual inner product (x, M), (x , M )

E = x T x + Tr MT M .

We have Int E0 = E0 = T = Rd M>0(d), and set, for all y Rd, S(y) = y, yy T /2

and, for all (x, M) Int E0, A(x, M) = 1

x T M 1x log |M| . In this setting, ℓT = ℓ, ℓ= ℓ op Id for τ0-almost all ℓ Z, and L op has distribution ˇτ0 for L τ0. Then, for all ϑ = (m, Σ) T and τ0-almost all ℓ Z, we have ℓ Υ(m, Σ) = Υ(m, Σ/ ℓ op). Consequently, for all ϑ T, y Rd and τ0-almost all ℓ Z, we have

k(o)(ℓ Υ(ϑ), y) = 1

ℓ 1 op Σ 1/2 e (y m)T ( ℓ 1 op Σ) 1(y m)/2

= N y; m, ℓ 1 op Σ . (135)

Thus the obtained ELM family k(1) of Deﬁnition 2 satisﬁes ˇk(ϑ, y) = k(1)((ϑ, τ0), y) for all ϑ = (m, Σ) T.

Proof of Assertion (ii) Assumption (C1) is obvious in the above setting. We then check (C4), (C2) and (C3), successively. For all ξ = ξ Υ( T) and τ0 M, we easily see that

τ0( ℓ Z : ℓ(ξ ξ ) = 0 ) = ˇτ0({0}) = 0 ,

and (C4) follows. To check (C2), we compute, for all ϑ0 = (m0, Σ0) and τ0 M,

Eϑ0,τ0 h L op Y i = Z

0 z N Y ; m0, z 1Σ0

ˇk(ϑ0, Y ) ˇτ0(dz)

R zd/2+1e z qϑ0(Y ) ˇτ0(dz)

R zd/2e z qϑ0(Y ) ˇτ0(dz)

where we successively used (57) with (135) and (126). Hence (C2) follows from (127) and (128). Let us now check (C3). Straightforward computations yield, for all ξ = (x, M) Υ( T) and τ0-almost all ℓ Z,

A ℓ(ξ) = (M 1x, (M 1xx T M 1 + ℓ 1 op M 1)/2)

ℓ (ξ) = ( ℓ op M 1x, ( ℓ op M 1xx T M 1 + M 1)/2) ℓ (ξ) F (1 + ℓ op) (M 1x, M 1xx T M 1 + M 1) F .

Monotonic Alpha-divergence Minimisation for Variational Inference

It follows, that, for all ϑ0 = (m0, Σ0) T, τ0 M, ξ = (x, M) Υ( T) and ϵ > 0,

Eϑ0,τ0 h eϵ L (ξ) F Y i C1 Eϑ0,τ0 h e C2 ϵ L op Y i Pϑ0,τ0 a.s.

= C1 log gˇτ0(d/2, qϑ0(y) C2ϵ)

gˇτ0(d/2, qϑ0(y))

Pϑ0,τ0 a.s. ,

where C1, C2 > 0 only depend on ξ. Hence, since we assumed (123), the deﬁnition of gˇτ0 in (127) holds for all v > ϵ0, and (C3) holds with sϑ0,τ0 as in (129) by taking ϵ < ϵ0/C2.

Proof of Assertion (iii) From the previous assertion, we can apply Theorem 5 and use (66) to solve the argmax problem (62) for a density ˇϕ with respect to the Lebesgue measure on Rd satisfying (58), which, since S(y) = y, yy T /2 and

S(y) E = y 2 + 1

4Tr(yy T yy T ) 1/2 = O y + y 2 ,

is equivalent to (130). Let us compute both sides of the equation (66) which characterizes the solution ξ = Υ(ϑ ). The left-hand side of (66) reads

Eϑ , τ0 h LT S(Y ) i = Eϑ , τ0 [Z S(Y )] ,

where Z = L op has density with respect to ˇτ0 given by (133). The right-hand side of (66) reads

Eϑ0, τ0 h LT S( Y ) i = Eϑ0, τ0 h Z S( Y ) i ,

where Z is the same as above and the conditional distribution of Y given Z = z has density given by (134). Finally, the solution ξ = Υ(ϑ ) is given by the equation

Eϑ , τ0 [Z S(Y )] = Eϑ0, τ0 h Z S( Y ) i (136)

and this equation leads to the solutions (131) and (132). Indeed, note that

Eϑ , τ0[Y |Z] = m and Eϑ , τ0[Y Y T |Z] m (m )T = Σ /Z (137)

Applying the ﬁrst identity in (137), then the tower property and the identity (136) applied to the ﬁrst component of S(Y ) = (Y, Y Y T /2),

m = Eϑ , τ0[Zm ]

Eϑ , τ0[Z] = Eϑ , τ0[ZY ]

Eϑ , τ0[Z] = Eϑ0, τ0 h Z Y i

Eϑ , τ0[Z] = Eϑ0, τ0 h Z Y i

where the last equality follows from the fact that under Pϑ , τ0 or Pϑ0, τ0, Z has the density ϕ with respect to ˇτ0. Similarly, using the second equality in (137), then the tower property and the identity (136) applied to the second component of S(Y ) = (Y, Y Y T /2)

Σ = Eϑ , τ0[Z(Σ /Z)] = Eϑ , τ0[ZY Y T ] Eϑ , τ0[Z]m (m )T

= Eϑ0, τ0 h Z Y Y T i Eϑ , τ0[Z]m (m )T

This proves (131) and (132).

Daudel, Douc and Roueff

Remark 5 The probability density function (124) is called an elliptical distribution. In particular if ˇτ0 is the χ2 distribution with a0 degrees of freedom, then it is the density of the Student s t distribution with parameter (m, Σ, a0). In this case we have

gˇτ0(u, v) = (a0/2)a0/2Γ(a0/2 + u) (a0/2 + v)a0/2+u Γ(a0/2) ,

and mϑ0,τ0 and sϑ0,τ0 in (128) and (129) read, for any ϵ0 (0, a0/2],

mϑ0,τ0(y) = a0/2 + d/2 a0/2 + qϑ0(y) = O(1) ,

sϑ0,τ0(y) = (a0/2 + d/2) log a0/2 + qϑ0(y) a0/2 ϵ0 + qϑ0(y)

so that Condition (130) boils down to R Y ˇϕ(y) y 2 ν(dy) < .

The following lemma is used in Example 5.

Lemma 13 Deﬁne, for all x (0, ),

κ(x) = log(x) + Γ (x)/Γ(x) .

Then κ is increasing and bijective from (0, ) to R.

Proof This result follows from the fact that the digamma function x 7 Γ (x)/Γ(x) is increasing from (0, ) to R.

Proof of Corollary 10 We apply Theorem 2 in the setting of Example 5. Condition (18) holds by Theorem 3 and the update (i) in Example 5. The remainder of the proof consists in proving Condition (19) for given n 1 and j = 1, . . . , J. We apply Proposition 8, whose Assertion (i) with Remark 5 provides an ELM family k(1) such that, for any ϑ = (m, Σ) Rd M>0(d) and a > 0, ˇk(ϑ, y) in (124) with ˇτ0 = ˇτa given by (73) is k(1)((ϑ, τ0), y) but also k(θ, y) in Example 5 for θ = (m, Σ, a). We will use Proposition 4 to this ELM family with ˇϕ = ˇϕ(α) j,n, θ = (ϑ, τ) deﬁned with ϑ = (mj,n+1, Σj,n+1) and τ the measure on obtained by pushing forward ˇτaj,n+1 through the mapping z 7 z Id and θ0 = (ϑ0, τ0) deﬁned similarly but using the parameters (mj,n, Σj,n, aj,n). Then conclusion (49) of Proposition 4 is exactly Condition (19) with k as in Example 5. Hence, to conclude the proof we only need to show that Proposition 4 applies in this setting. Namely, we need to prove (50) and (51), which respectively read

0 ϕj,n(z, y) log

dˇτaj,n (z)

dy 0 , (138)

Z ϕj,n( ℓ op , y) log

k(o)(ℓ Υ(ϑ), y) k(o)(ℓ Υ(ϑ0), y)

dy 0 , (139)

Monotonic Alpha-divergence Minimisation for Variational Inference

where ϕj,n is deﬁned by (71) and k(o) is the Gaussian canonical kernel deﬁned by (125). We thus conclude with the proof of these two conditions in the opposite order. Proof of (139) Assertion (ii) of Proposition 8 with Remark 5 gives that the ELM family k(1) satisﬁes (C1) (C4) with mϑ0,τ0 and sϑ0,τ0 such that Condition (130) boils down to R ˇϕ(y) y 2 dν < . This latter condition is satisﬁed with ˇϕ = ˇϕ(α) j,n as a consequence of (67) by using Lemma 8. Therefore Condition (130) holds and Assertion (iii) of Proposition 8 provides the solution of the argmax problem (62). A careful comparison of (131) and (132) with (68) and (69) tells us that ϑ = (mj,n+1, Σj,n+1) is the unique argmax of (62). Applying Corollary 9 in this setting ﬁnally gives us (139). Proof of (138) Since the left-hand side of (138) is zero for a = aj,n, it suﬃces to show that aj,n+1 in (70) satisﬁes

aj,n+1 = argmax a (0, )

0 ϕj,n(z, y) log

dˇτa dˇτaj,n (z)

By (73), we have, for all z > 0,

dˇτa dˇτaj,n (z)

(log z z) + Cj,n ,

where Cj,n is a positive constant not depending on a. The derivative of x 7 x log (x)+log Γ (x) is κ as deﬁned in Step (ii) of Example 5 and in Lemma 13. Is is easy to check that R ϕj,n(z, y)ˇτaj,n(dz)dy = R ˇϕ(y)dy = 1. Thus (140) follows from (70) with Lemma 13 if we can show that Z

0 ϕj,n(z, y) (|log z| + z) ˇτaj,n(dz) dy < .

In fact, we will show that there exists ϵ > 0 such that for all t ( ϵ, 1],

0 ϕj,n(z, y) zt ˇτaj,n(dz) dy < , (141)

which indeed implies the previous display. By deﬁnition of ϕj,n in (71), we ﬁnd that (141) is implied by

0 ˇϕ(α) j,n(y)zt+d/2 e zqj,n(y)

1 + 2 aj,n qj,n(y)

!(aj,n+d)/2 ˇτaj,n(dz)

where we set qj,n(y) = 1

2(y mj,n)T Σj,n(y mj,n). Since, as in Remark 5, for all t > d/2,

0 zt+d/2 e zqj,n(y) ˇτaj,n(dz) = (aj,n/2)aj,n/2Γ(aj,n/2 + t + d/2) (aj,n/2 + qj,n(y))aj,n/2+t+d/2 Γ(aj,n/2) ,

we get that for all t > d/2, there exists C j,n(t) > 0 such that

I(t) = C j,n(t) Z

Y ˇϕ(α) j,n(y)

1 + 2 aj,n qj,n(y)

Daudel, Douc and Roueff

Since qj,n(y) is a quadratic form and we already checked that R ˇϕ(α) j,n(y) (1 + y 2) dν < in the proof of (139), we ﬁnally ﬁnd that I(t) < if t > d/2 and t 2, which concludes the proof of (141).

Appendix D. Additional Numerical Experiments

In this section we provide further plots based on the numerical experiments from Section 8.

J = 10 J = 50 (i)

Figure 5: Error bounds for the Monte Carlo estimate of the VR Bound in the MG-IS-n approach (ﬁxed mixture weights) when considering each of the target distributions (i), (ii) and (iii).

Monotonic Alpha-divergence Minimisation for Variational Inference

J = 10 J = 50 (i)

Figure 6: Error bounds for the Monte Carlo estimate of the VR Bound in the RGD-IS-n approach (ﬁxed mixture weights) when considering each of the target distributions (i), (ii) and (iii).

Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of their variational approximations. The Annals of Statistics, 48(3):1475 1497, 2020. doi: 10.1214/19-AOS1855. URL https://doi.org/10.1214/19-AOS1855.

David M. Blei, Alp Kucukelbir, and Jon D. Mc Auliﬀe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017.

Alexander Buchholz, Florian Wenzel, and Stephan Mandt. Quasi-Monte Carlo variational inference. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,

Daudel, Douc and Roueff

J = 10 J = 50 γ = 0.1 γ = 0.5 γ = 1.0 γ = 0.1 γ = 0.5 γ = 1.0

(i) RGD-IS-n(γ) 0.372 0.510 0.384 -0.616 -0.713 -0.778 MG-IS-n(γ) 1.104 1.074 0.387 1.135 -0.077 -0.060 RGD-IS-unif(γ) 0.359 0.469 0.458 -0.688 -0.670 -0.583 MG-IS-unif(γ) -0.200 -0.229 -0.515 -1.500 -1.462 -1.246

(ii) RGD-IS-n(γ) -0.025 -0.056 -0.087 -1.027 -0.997 -0.969 MG-IS-n(γ) -0.270 -0.126 0.235 -0.269 -0.417 -0.487 RGD-IS-unif(γ) -0.121 -0.111 0.052 -1.097 -0.966 -0.883 MG-IS-unif(γ) -1.120 -0.938 -0.957 -1.764 -1.889 -1.192

(iii) RGD-IS-n(γ) -0.329 -0.197 -0.238 -1.691 -1.612 -1.637 MG-IS-n(γ) 1.101 0.758 0.524 0.181 -0.181 0.893 RGD-IS-unif(γ) -0.370 -0.224 -0.212 -1.708 -1.627 -1.649 MG-IS-unif(γ) -1.211 -1.313 -1.083 -2.013 -1.882 -0.491

Table 3: Log MSE for the RGD and the MG approaches (η = 0.1) when considering each of the target distributions (i), (ii) and (iii).

J = 10 J = 50 η = 0.05 η = 0.1 η = 0.5 η = 0.05 η = 0.1 η = 0.5

(i) RGD-IS-n(γ) 0.045 0.510 1.299 -1.355 -0.713 0.924 MG-IS-n(γ) 0.087 1.074 1.343 -1.205 -0.077 1.329 RGD-IS-unif(γ) -0.018 0.469 1.328 -1.385 -0.670 0.928 MG-IS-unif(γ) -1.244 -0.229 1.100 -2.524 -1.462 0.309

(ii) RGD-IS-n(γ) -0.096 -0.056 0.522 -1.509 -0.997 0.542 MG-IS-n(γ) -0.629 -0.126 0.100 -1.430 -0.417 0.348 RGD-IS-unif(γ) -0.195 -0.111 0.489 -1.542 -0.966 0.529 MG-IS-unif(γ) -1.814 -0.938 -0.149 -1.711 -1.889 -0.282

(iii) RGD-IS-n(γ) -0.091 -0.197 0.339 -1.596 -1.612 -0.282 MG-IS-n(γ) -0.772 0.758 1.358 -0.878 -0.181 0.927 RGD-IS-unif(γ) -0.113 -0.224 0.322 -1.611 -1.627 -0.300 MG-IS-unif(γ) -1.608 -1.313 -0.253 -1.879 -1.882 -0.716

Table 4: Log MSE for the RGD and the MG approaches (γ = 0.5) when considering each of the target distributions (i), (ii) and (iii).

Monotonic Alpha-divergence Minimisation for Variational Inference

J = 10 J = 50 (i)

Figure 7: Monte Carlo estimate of the VR Bound for the RGD and the MG approaches (η = 0.05) when considering each of the target distributions (i), (ii) and (iii).

pages 668 677. PMLR, 10 15 Jul 2018. URL https://proceedings.mlr.press/v80/ buchholz18a.html.

Trevor Campbell and Xinglong Li. Universal boosting variational inference. In Advances in Neural Information Processing Systems, volume 32, 2019.

Olivier Capp e, Randal Douc, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert. Adaptive importance sampling in general mixture classes. Statistics and Computing, 18 (4):447 459, 2008.

Andrzej Cichocki and Shun-ichi Amari. Families of alphabetaand gammadivergences: Flexible and robust measures of similarities. Entropy, 12(6):1532 1568, 2010.

Kam elia Daudel and Randal Douc. Mixture weights optimisation for alpha-divergence variational inference. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and

Daudel, Douc and Roueff

J = 10 J = 50 (i)

Figure 8: Monte Carlo estimate of the VR Bound for the RGD and the MG approaches (η = 0.5) when considering each of the target distributions (i), (ii) and (iii).

J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 4397 4408. Curran Associates, Inc., 2021.

Kam elia Daudel, Randal Douc, and Fran cois Portier. Inﬁnite-dimensional gradient-based descent for alpha-divergence minimisation. The Annals of Statistics, 49(4):2250 2270, 2021.

Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David Blei. Variational inference via \chi upper bound minimization. In Advances in Neural Information Processing Systems 30, pages 2732 2741, 2017.

Justin Domke. Provable smoothness guarantees for black-box variational inference. In International Conference on Machine Learning, pages 2587 2596. PMLR, 2020.

Monotonic Alpha-divergence Minimisation for Variational Inference

η J = 50 J = 100

Figure 9: Monte Carlo estimate of the VR Bound for the RGD and the MG approaches when considering the Bayesian Logistic Regression on the Covertype data set.

Samuel Gershman, Matt Hoﬀman, and David Blei. Nonparametric variational inference. In Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012.

Jose Hernandez-Lobato, Yingzhen Li, Mark Rowland, Thang Bui, Daniel Hernandez-Lobato, and Richard Turner. Black-box alpha divergence minimization. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1511 1520, New York, USA, 2016.

Matthew D. Hoﬀman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(4):1303 1347, 2013.

Matthew James Beal. Variational algorithms for approximate Bayesian inference, Ph D thesis, The Gatsby Computational Neuroscience Unit, University of London, UK, 2003.

Ghassen Jerfel, Serena Wang, Clara Fannjiang, Katherine A. Heller, Yian Ma, and Michael I. Jordan. Variational reﬁnement for importance sampling using the forward kullback-leibler divergence. In 37th Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2021), forthcoming, 2021.

Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2): 183 233, 1999.

Daudel, Douc and Roueff

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations, 2014.

Yingzhen Li and Richard E Turner. R enyi divergence variational inference. In Advances in Neural Information Processing Systems 29, pages 1073 1081, 2016.

Tom Minka. Power ep. Technical Report MSR-TR-2004-149, Microsoft Research Ltd, Cambridge, UK, 2004.

Tom Minka. Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft Research Ltd., Cambridge, UK, 2005.

Dennis Prangle. Distilling importance sampling. Avaliable on ar Xiv:1910.03632v3, last accessed on 10 Sep 2021, 2021.

Rajesh Ranganath, Sean Gerrish, and David Blei. Black Box Variational Inference. In Proceedings of the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 814 822, Reykjavik, Iceland, 2014. PMLR.

Alfr ed R enyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 547 561, Berkeley, Calif., 1961. University of California Press.

Tim van Erven and Peter Harremoes. R enyi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014.

Dilin Wang, Hao Liu, and Qiang Liu. Variational inference with tail-adaptive f-divergence. In Advances in Neural Information Processing Systems 31, pages 5737 5747, 2018.

Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman. Yes, but did it work?: Evaluating variational inference. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5581 5590, Stockholmsm assan, Stockholm Sweden, 2018. PMLR.

Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):2008 2026, 2019.

Huaiyu Zhu and Richard Rohwer. Information geometric measurements of generalisation. Technical Report NCRG/4350, Neural Computing Research Group Dept. of Computer Science and Applied Mathematics Aston University, Aston Triangle, UK, 1995a.

Huaiyu Zhu and Richard Rohwer. Bayesian invariant measurements of generalization. Neural Processing Letters, 2:28 31, 1995b.